Full

PROBABILITY,
MATHEMATICAL
STATISTICS, AND
STOCHASTIC
PROCESSES
Kyle Siegrist
University of Alabama in huntsville
Book: Probability, Mathematical Statistics,
Stochastic Processes
This text is disseminated via the Open Education Resource (OER) LibreTexts Project (https://LibreTexts.org) and like the hundreds
of other texts available within this powerful platform, it is freely available for reading, printing and "consuming." Most, but not all,
pages in the library have licenses that may allow individuals to make changes, save, and print this book. Carefully
consult the applicable license(s) before pursuing such effects.
Instructors can adopt existing LibreTexts texts or Remix them to quickly build course-specific resources to meet the needs of their
students. Unlike traditional textbooks, LibreTexts’ web based origins allow powerful integration of advanced features and new
technologies to support learning.
The LibreTexts mission is to unite students, faculty and scholars in a cooperative effort to develop an easy-to-use online platform
for the construction, customization, and dissemination of OER content to reduce the burdens of unreasonable textbook costs to our
students and society. The LibreTexts project is a multi-institutional collaborative venture to develop the next generation of open-
access texts to improve postsecondary education at all levels of higher learning by developing an Open Access Resource
environment. The project currently consists of 14 independently operating and interconnected libraries that are constantly being
optimized by students, faculty, and outside experts to supplant conventional paper-based books. These free textbook alternatives are
organized within a central environment that is both vertically (from advance to basic level) and horizontally (across different fields)
integrated.
The LibreTexts libraries are Powered by MindTouch® and are supported by the Department of Education Open Textbook Pilot
Project, the UC Davis Office of the Provost, the UC Davis Library, the California State University Affordable Learning Solutions
Program, and Merlot. This material is based upon work supported by the National Science Foundation under Grant No. 1246120,
1525057, and 1413739. Unless otherwise noted, LibreTexts content is licensed by CC BY-NC-SA 3.0.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation nor the US Department of Education.
Have questions or comments? For information about adoptions or adaptions contact info@LibreTexts.org. More information on our
activities can be found via Facebook (https://facebook.com/Libretexts), Twitter (https://twitter.com/libretexts), or our blog
(http://Blog.Libretexts.org).
This text was compiled on 11/24/2022

Introduction
1 https://stats.libretexts.org/@go/page/10315
TABLE OF CONTENTS
Introduction
Licensing
Object Library
Credits
Sources and Resources
1: Foundations
1.1: Sets
1.2: Functions
1.3: Relations
1.4: Partial Orders
1.5: Equivalence Relations
1.6: Cardinality
1.7: Counting Measure
1.8: Combinatorial Structures
1.9: Topological Spaces
1.10: Metric Spaces
1.11: Measurable Spaces
1.12: Special Set Structures
2: Probability Spaces
2.1: Random Experiments
2.2: Events and Random Variables
2.3: Probability Measures
2.4: Conditional Probability
2.5: Independence
2.6: Convergence
2.7: Measure Spaces
2.8: Existence and Uniqueness
2.9: Probability Spaces Revisited
2.10: Stochastic Processes
2.11: Filtrations and Stopping Times
3: Distributions
3.1: Discrete Distributions
3.2: Continuous Distributions
3.3: Mixed Distributions
3.4: Joint Distributions
3.5: Conditional Distributions
3.6: Distribution and Quantile Functions
3.7: Transformations of Random Variables
3.8: Convergence in Distribution
3.9: General Distribution Functions
3.10: The Integral With Respect to a Measure
3.11: Properties of the Integral
3.12: General Measures
3.13: Absolute Continuity and Density Functions
3.14: Function Spaces
4: Expected Value
4.1: De nitions and Basic Properties
4.2: Additional Properties
4.3: Variance
4.4: Skewness and Kurtosis
4.5: Covariance and Correlation
4.6: Generating Functions
4.7: Conditional Expected Value
4.8: Expected Value and Covariance Matrices
4.9: Expected Value as an Integral
4.10: Conditional Expected Value Revisited
4.11: Vector Spaces of Random Variables
4.12: Uniformly Integrable Variables
4.13: Kernels and Operators
5: Special Distributions
5.1: Location-Scale Families
5.2: General Exponential Families
5.3: Stable Distributions
5.4: In nitely Divisible Distributions
5.5: Power Series Distributions
5.6: The Normal Distribution
5.7: The Multivariate Normal Distribution
5.8: The Gamma Distribution
5.9: Chi-Square and Related Distribution
5.10: The Student t Distribution
5.11: The F Distribution
5.12: The Lognormal Distribution
5.13: The Folded Normal Distribution
5.14: The Rayleigh Distribution
5.15: The Maxwell Distribution
5.16: The Lévy Distribution
5.17: The Beta Distribution
5.18: The Beta Prime Distribution
5.19: The Arcsine Distribution
5.20: General Uniform Distributions
5.21: The Uniform Distribution on an Interval
5.22: Discrete Uniform Distributions
5.23: The Semicircle Distribution
5.24: The Triangle Distribution
5.25: The Irwin-Hall Distribution
5.26: The U-Power Distribution
5.27: The Sine Distribution
5.28: The Laplace Distribution
5.29: The Logistic Distribution
5.30: The Extreme Value Distribution
5.31: The Hyperbolic Secant Distribution
5.32: The Cauchy Distribution
5.33: The Exponential-Logarithmic Distribution
5.34: The Gompertz Distribution
5.35: The Log-Logistic Distribution
5.36: The Pareto Distribution
5.37: The Wald Distribution
5.38: The Weibull Distribution
5.39: Benford's Law
5.40: The Zeta Distribution
5.41: The Logarithmic Series Distribution
6: Random Samples
6.1: Introduction
6.2: The Sample Mean
6.3: The Law of Large Numbers
6.4: The Central Limit Theorem
6.5: The Sample Variance
6.6: Order Statistics
6.7: Sample Correlation and Regression
6.8: Special Properties of Normal Samples
7: Point Estimation
7.1: Estimators
7.2: The Method of Moments
7.3: Maximum Likelihood
7.4: Bayesian Estimation
7.5: Best Unbiased Estimators
7.6: Suf cient, Complete and Ancillary Statistics
8: Set Estimation
8.1: Introduction to Set Estimation
8.2: Estimation the Normal Model
8.3: Estimation in the Bernoulli Model
8.4: Estimation in the Two-Sample Normal Model
8.5: Bayesian Set Estimation
9: Hypothesis Testing
9.1: Introduction to Hypothesis Testing
9.2: Tests in the Normal Model
9.3: Tests in the Bernoulli Model
9.4: Tests in the Two-Sample Normal Model
9.5: Likelihood Ratio Tests
9.6: Chi-Square Tests
10: Geometric Models

10.1: Buffon's Problems
10.2: Bertrand's Paradox
10.3: Random Triangles
11: Bernoulli Trials
11.1: Introduction to Bernoulli Trials
11.2: The Binomial Distribution
11.3: The Geometric Distribution
11.4: The Negative Binomial Distribution
11.5: The Multinomial Distribution
11.6: The Simple Random Walk
11.7: The Beta-Bernoulli Process
12: Finite Sampling Models

12.1: Introduction to Finite Sampling Models
12.2: The Hypergeometric Distribution
12.3: The Multivariate Hypergeometric Distribution
12.5: The Matching Problem
12.6: The Birthday Problem
12.7: The Coupon Collector Problem
12.8: Pólya's Urn Process
12.9: The Secretary Problem
13: Games of Chance

13.1: Introduction to Games of Chance
13.2: Poker
13.3: Simple Dice Games
13.4: Craps
13.5: Roulette
13.6: The Monty Hall Problem
13.7: Lotteries
13.8: The Red and Black Game
13.9: Timid Play
13.10: Bold Play
13.11: Optimal Strategies
14: The Poisson Process

14.1: Introduction to the Poisson Process
14.2: The Exponential Distribution
14.4: The Poisson Distribution
14.5: Thinning and Superpositon
14.6: Non-homogeneous Poisson Processes
14.7: Compound Poisson Processes
14.8: Poisson Processes on General Spaces
15: Renewal Processes

15.1: Introduction
15.2: Renewal Equations
15.3: Renewal Limit Theorems
15.4: Delayed Renewal Processes
15.5: Alternating Renewal Processes
15.6: Renewal Reward Processes
16: Markov Processes
16.1: Introduction to Markov Processes
16.2: Potentials and Generators for General Markov Processes
16.3: Introduction to Discrete-Time Chains
16.4: Transience and Recurrence for Discrete-Time Chains
16.5: Periodicity of Discrete-Time Chains
16.6: Stationary and Limiting Distributions of Discrete-Time Chains
16.7: Time Reversal in Discrete-Time Chains
16.8: The Ehrenfest Chains
16.9: The Bernoulli-Laplace Chain
16.10: Discrete-Time Reliability Chains
16.11: Discrete-Time Branching Chain
16.12: Discrete-Time Queuing Chains
16.13: Discrete-Time Birth-Death Chains
16.14: Random Walks on Graphs
16.15: Introduction to Continuous-Time Markov Chains
16.16: Transition Matrices and Generators of Continuous-Time Chains
16.17: Potential Matrices
16.18: Stationary and Limting Distributions of Continuous-Time Chains
16.19: Time Reversal in Continuous-Time Chains
16.20: Chains Subordinate to the Poisson Process
16.21: Continuous-Time Birth-Death Chains
16.22: Continuous-Time Queuing Chains
16.23: Continuous-Time Branching Chains
17: Martingales
17.1: Introduction to Martingalges
17.2: Properties and Constructions
17.3: Stopping Times
17.4: Inequalities
17.5: Convergence
17.6: Backwards Martingales
18: Brownian Motion

18.1: Standard Brownian Motion
18.2: Brownian Motion with Drift and Scaling
18.3: The Brownian Bridge
18.4: Geometric Brownian Motion
Index
Glossary
Detailed Licensing
Licensing
A detailed breakdown of this resource's licensing can be found in Back Matter/Detailed Licensing.
Table of Contents
http://www.randomservices.org/random/index.html
Random is a website devoted to probability, mathematical statistics, and stochastic processes, and is intended for teachers and
students of these subjects. The site consists of an integrated set of components that includes expository text, interactive web apps,
data sets, biographical sketches, and an object library. Please read the Introduction for more information about the content,
structure, mathematical prerequisites, technologies, and organization of the project.
1: Foundations
1.1: Sets
1.2: Functions
1.3: Relations
1.4: Partial Orders
1.6: Cardinality
1.10: Metric Spaces
2.5: Independence
2.6: Convergence
2.7: Measure Spaces
3: Distributions
4: Expected Value
4.1: New Page
4.2: New Page
4.3: New Page
4.4: New Page
4.5: New Page
4.6: New Page
4.7: New Page
4.8: New Page
4.9: New Page
4.10: New Page
5.1: New Page
5.2: New Page
5.3: New Page
5.4: New Page
5.5: New Page
5.6: New Page
5.7: New Page
5.8: New Page
5.9: New Page
5.10: New Page
6: Random Samples
6.1: New Page
6.2: New Page
6.3: New Page
6.4: New Page
6.5: New Page
6.6: New Page
6.7: New Page
6.8: New Page
6.9: New Page
6.10: New Page
7: Point Estimation
7.1: New Page
7.2: New Page
7.3: New Page
7.4: New Page
7.5: New Page
7.6: New Page
7.7: New Page
7.8: New Page
7.9: New Page
7.10: New Page
8: Set Estimation
8.1: New Page
8.2: New Page
8.3: New Page
8.4: New Page
8.5: New Page
8.6: New Page
8.7: New Page
8.8: New Page
8.9: New Page
8.10: New Page
9.1: New Page
9.2: New Page
9.3: New Page
9.4: New Page
9.5: New Page
9.6: New Page
9.7: New Page
9.8: New Page
9.9: New Page
9.10: New Page

10.1: New Page
10.2: New Page
10.3: New Page
10.4: New Page
10.5: New Page
10.6: New Page
10.7: New Page
10.8: New Page
10.9: New Page
10.10: New Page

11.1: New Page
11.2: New Page
11.3: New Page
11.4: New Page
11.5: New Page
11.6: New Page
11.7: New Page
11.8: New Page
11.9: New Page
11.10: New Page

12.1: New Page
12.2: New Page
12.3: New Page
12.4: New Page
12.5: New Page
12.6: New Page
12.7: New Page
12.8: New Page
12.9: New Page
12.10: New Page
13: Games of Chance

13.1: New Page
13.2: New Page
13.3: New Page
13.4: New Page
13.5: New Page
13.6: New Page
13.7: New Page
13.8: New Page
13.9: New Page
13.10: New Page

14.1: New Page
14.2: New Page
14.3: New Page
14.4: New Page
14.5: New Page
14.6: New Page
14.7: New Page
14.8: New Page
14.9: New Page
14.10: New Page

15.1: New Page
15.2: New Page
15.3: New Page
15.4: New Page
15.5: New Page
15.6: New Page
15.7: New Page
15.8: New Page
15.9: New Page
15.10: New Page

16.1: New Page
16.2: New Page
16.3: New Page
16.4: New Page
16.5: New Page
16.6: New Page
16.7: New Page
16.8: New Page
16.9: New Page
16.10: New Page
17: Martingales
17.1: New Page
17.2: New Page
17.3: New Page
17.4: New Page
17.5: New Page
17.6: New Page
17.7: New Page
17.8: New Page
17.9: New Page
17.10: New Page
18: Brownian Motion
Object Library
Credits
Sources and Resources
CHAPTER OVERVIEW
1: Foundations
In this chapter we review several mathematical topics that form the foundation of probability and mathematical statistics. These
include the algebra of sets and functions, general relations with special emphasis on equivalence relations and partial orders,
counting measure, and some basic combinatorial structures such as permuations and combinations. We also discuss some advanced
topics from topology and measure theory. You may wish to review the topics in this chapter as the need arises.
1.1: Sets
1.2: Functions
1.3: Relations
1.4: Partial Orders
1.6: Cardinality
1.10: Metric Spaces
This page titled 1: Foundations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
1.1: Sets
Set theory is the foundation of probability and statistics, as it is for almost every branch of mathematics.
Sets and subsets

In this text, sets and their elements are primitive, self-evident concepts, an approach that is sometimes referred to as naive set
theory.
A set is simply a collection of objects; the objects are referred to as elements of the set. The statement that x is an element of
set S is written x ∈ S , and the negation that x is not an element of S is written as x ∉ S . By definition, a set is completely
determined by its elements; thus sets A and B are equal if they have the same elements:
A = B if and only if x ∈ A ⟺ x ∈ B (1.1.1)
Our next definition is the subset relation, another very basic concept.
If A and B are sets then A is a subset of B if every element of A is also an element of B :

A ⊆ B if and only if x ∈ A ⟹ x ∈ B (1.1.2)
Concepts in set theory are often illustrated with small, schematic sketches known as Venn diagrams, named for John Venn. The
Venn diagram in the picture below illustrates the subset relation.
Figure 1.1.1 : A ⊆ B
As noted earlier, membership is a primitive, undefined concept in naive set theory. However, the following construction, known as
Russell's paradox, after the mathematician and philosopher Bertrand Russell, shows that we cannot be too cavalier in the
construction of sets.
Let R be the set of all sets A such that A ∉ A . Then R ∈ R if and only if R ∉ R .
Proof
Usually, the sets under discussion in a particular context are all subsets of a well-defined, specified set S , often called a universal
set. The use of a universal set prevents the type of problem that arises in Russell's paradox. That is, if S is a given set and p(x) is a
predicate on S (that is, a valid mathematical statement that is either true or false for each x ∈ S ), then {x ∈ S : p(x)} is a valid
subset of S . Defining a set in this way is known as predicate form. The other basic way to define a set is simply be listing its
elements; this method is known as list form.
In contrast to a universal set, the empty set, denoted ∅, is the set with no elements.
∅ ⊆A for every set A .

Proof
One step up from the empty set is a set with just one element. Such a set is called a singleton set. The subset relation is a partial
order on the collection of subsets of S .
Suppose that A , B and C are subsets of a set S . Then

1. A ⊆ A (the reflexive property).
2. If A ⊆ B and B ⊆ A then A = B (the anti-symmetric property).
3. If A ⊆ B and B ⊆ C then A ⊆ C (the transitive property).
1.1.1 https://stats.libretexts.org/@go/page/10116
Here are a couple of variations on the subset relation.
Suppose that A and B are sets.

1. If A ⊆ B and A ≠ B , then A is a strict subset of B and we sometimes write A ⊂ B .
2. If ∅ ⊂ A ⊂ B , then A is called a proper subset of B .
The collection of all subsets of a given set frequently plays an important role, particularly when the given set is the universal set.
If S is a set, then the set of all subsets of S is known as the power set of S and is denoted P(S) .
Special Sets
The following special sets are used throughout this text. Defining them will also give us practice using list and predicate form.
Special Sets
1. R denotes the set of real numbers and is the universal set for the other subsets in this list.
2. N = {0, 1, 2, …} is the set of natural numbers
3. N = {1, 2, 3, …} is the set of positive integers
+
4. Z = {… , −2, −1, 0, 1, 2, …}is the set of integers

5. Q = {m/n : m ∈ Z and n ∈ N } is the set of rational numbers
+
6. A = {x ∈ R : p(x) = 0 for some polynomial p with integer coefficients} is the set of algebraic numbers.
Note that N ⊂ N ⊂ Z ⊂ Q ⊂ A ⊂ R . We will also occasionally need the set of complex numbers C = {x + iy : x,
+ y ∈ R}
where i is the imaginary unit. The following special rational numbers turn out to be useful for various constructions.
For n ∈ N , a rational number of the form j/2 where j ∈ Z is odd is a dyadic rational (or binary rational) of rank n .
n
1. For n ∈ N , the set of dyadic rationals of rank n or less is D = {j/2

n
n
: j ∈ Z} .
2. The set of all dyadic rationals is D = {j/2 : j ∈ Z and n ∈ N} .
n
Note that D = Z and D ⊂ D

0 n n+1for n ∈ N , and of course, D ⊂ Q . We use the usual notation for intervals of real numbers, but
again the definitions provide practice with predicate notation.
Suppose that a, b ∈ R with a < b .

1. [a, b] = {x ∈ R : a ≤ x ≤ b} . This interval is closed.
2. (a, b) = {x ∈ R : a < x < b} . This interval is open.
3. [a, b) = {x ∈ R : a ≤ x < b} . This interval is closed-open.
4. (a, b] = {x ∈ R : a < x ≤ b} . This interval is open-closed.
The terms open and closed are actually topological concepts.

You may recall that x ∈ R is rational if and only if the decimal expansion of x either terminates or forms a repeating block. The
binary rationals have simple binary expansions (that is, expansions in the base 2 number system).
A number x ∈ R is a binary rational of rank n ∈ N+ if and only if the binary expansion of x is finite, with 1 in position n
(after the separator).

Proof
Set Operations
We are now ready to review the basic operations of set theory. For the following definitions, suppose that A and B are subsets of a
universal set, which we will denote by S .
The union of A and B is the set obtained by combining the elements of A and B .
A ∪ B = {x ∈ S : x ∈ A or x ∈ B} (1.1.3)
The intersection of A and B is the set of elements common to both A and B :
A ∩ B = {x ∈ S : x ∈ A and x ∈ B} (1.1.4)
If A ∩ B = ∅ then A and B are disjoint.
So A and B are disjoint if the two sets have no elements in common.
The set difference of B and A is the set of elements that are in B but not in A :
B ∖ A = {x ∈ S : x ∈ B and x ∉ A} (1.1.5)
Sometimes (particularly in older works and particularly when A ⊆B ), the notation B−A is used instead of B∖A . When
A ⊆ B , B − A is known as proper set difference.
The complement of A is the set of elements that are not in A :

c
A = {x ∈ S : x ∉ A} (1.1.6)
Note that union, intersection, and difference are binary set operations, while complement is a unary set operation.
In the Venn diagram app, select each of the following and note the shaded area in the diagram.
1. A
2. B
3. A c
4. B c
5. A ∪ B
6. A ∩ B
Basic Rules
In the following theorems, A , B , and C are subsets of a universal set S . The proofs are straightforward, and just use the definitions
and basic logic. Try the proofs yourself before reading the ones in the text.
A∩B ⊆ A ⊆ A∪B .
The identity laws:

1. A ∪ ∅ = A
2. A ∩ S = A
So the empty set acts as an identity relative to the union operation, and the universal set acts as an identiy relative to the
intersection operation.
The idempotent laws:

1. A ∪ A = A
2. A ∩ A = A
The complement laws:

1. A ∪ A c
=S
2. A ∩ A c
=∅
The double complement law: (A c c

) =A
The commutative laws:

1. A ∪ B = B ∪ A
2. A ∩ B = B ∩ A
Proof
The associative laws:

1. A ∪ (B ∪ C ) = (A ∪ B) ∪ C
2. A ∩ (B ∩ C ) = (A ∩ B) ∩ C
Proof
Thus, we can write A ∪ B ∪ C without ambiguity. Note that x is an element of this set if and only if x is an element of at least one
of the three given sets. Similarly, we can write A ∩ B ∩ C without ambiguity. Note that x is an element of this set if and only if x
is an element of all three of the given sets.
The distributive laws:

1. A ∩ (B ∪ C ) = (A ∩ B) ∪ (A ∩ C )
2. A ∪ (B ∩ C ) = (A ∪ B) ∩ (A ∪ C )
Proof
So intersection distributes over union, and union distributes over intersection. It's interesting to compare the distributive properties
of set theory with those of the real number system. If x, y, z ∈ R , then x(y + z) = (xy) + (xz) , so multiplication distributes over
addition, but it is not true that x + (yz) = (x + y)(x + z) , so addition does not distribute over multiplication. The following
results are particularly important in probability theory.
DeMorgan's laws (named after Agustus DeMorgan):

1. (A ∪ B) c
=A
c
∩B
c
2. (A ∩ B) c
=A
c
∪B
c
.
Proof
The following result explores the connections between the subset relation and the set operations.
The following statements are equivalent:

1. A ⊆ B
2. B ⊆ A
c c
3. A ∪ B = B
4. A ∩ B = A
5. A ∖ B = ∅
Proof
In addition to the special sets defined earlier, we also have the following:
More special sets

1. R ∖ Q is the set of irrational numbers
2. R ∖ A is the set of transcendental numbers
Since Q ⊂ A ⊂ R it follows that R ∖ A ⊂ R ∖ Q , that is, every transcendental number is also irrational.
Set difference can be expressed in terms of complement and intersection. All of the other set operations (complement, union, and
intersection) can be expressed in terms of difference.
Results for set difference:

1. B ∖ A = B ∩ A c
2. A = S ∖ A
c
3. A ∩ B = A ∖ (A ∖ B)
4. A ∪ B = S ∖ {(S ∖ A) ∖ [(S ∖ A) ∖ (S ∖ B)]}
Proof
So in principle, we could do all of set theory using the one operation of set difference. But as (c) and (d) suggest, the results would
be hideous.
(A ∪ B) ∖ (A ∩ B) = (A ∖ B) ∪ (B ∖ A) .
Proof
The set in the previous result is called the symmetric difference of A and B , and is sometimes denoted A △ B . The elements of
this set belong to one but not both of the given sets. Thus, the symmetric difference corresponds to exclusive or in the same way
that union corresponds to inclusive or. That is, x ∈ A ∪ B if and only if x ∈ A or x ∈ B (or both); x ∈ A △ B if and only if
x ∈ A or x ∈ B , but not both. On the other hand, the complement of the symmetric difference consists of the elements that belong
to both or neither of the given sets:

c c c c c
(A △ B) = (A ∩ B) ∪ (A ∩ B ) = (A ∪ B) ∩ (B ∪ A)
Proof
There are 16 different (in general) sets that can be constructed from two given events A and B .
Proof
Open the Venn diagram app. This app lists the 16 sets that can be constructed from given sets A and B using the set
operations.
1. Select each of the four subsets in the proof of the last exercise: A ∩ B , A ∩ B , A ∩ B , and A
c c c
∩B
c
. Note that these are
disjoint and their union is S .
2. Select each of the other 12 sets and show how each is a union of some of the sets in (a).
General Operations
The operations of union and intersection can easily be extended to a finite or even an infinite collection of sets.
Definitions
Suppose that A is a nonempty collection of subsets of a universal set S . In some cases, the subsets in A may be naturally indexed
by a nonempty index set I , so that A = {A : i ∈ I } . (In a technical sense, any collection of subsets can be indexed.)
i
The union of the collection of sets A is the set obtained by combining the elements of the sets in A :
⋃ A = {x ∈ S : x ∈ A for some A ∈ A } (1.1.16)
If A = { Ai : i ∈ I } , so that the collection of sets is indexed, then we use the more natural notation:
⋃ Ai = {x ∈ S : x ∈ Ai for some i ∈ I } (1.1.17)
i∈I
The intersection of the collection of sets A is the set of elements common to all of the sets in A :
⋂ A = {x ∈ S : x ∈ A for all A ∈ A } (1.1.18)
If A = { Ai : i ∈ I } , so that the collection of sets is indexed, then we use the more natural notation:
⋂ Ai = {x ∈ S : x ∈ Ai for all i ∈ I } (1.1.19)
i∈I
Often the index set is an “integer interval” of N. In such cases, an even more natural notation is to use the upper and lower limits of
the index set. For example, if the collection is {A : i ∈ N } then we would write ⋃ A for the union and ⋂ A for the
i +
∞
i=1 i
∞
i=1 i
intersection. Similarly, if the collection is {A : i ∈ {1, 2, … , n}} for some n ∈ N , we would write ⋃ A for the union and
i +
n
i=1 i
A for the intersection.

n
⋂ i
i=1
A collection of sets A is pairwise disjoint if the intersection of any two sets in the collection is empty: A ∩ B = ∅ for every
A, B ∈ A with A ≠ B .
A collection of sets A is said to partition a set B if the collection A is pairwise disjoint and ⋃ A =B .
Partitions are intimately related to equivalence relations. As an example, for n ∈ N , the set
j j+1
Dn = {[ , ) : j ∈ Z} (1.1.20)
n n
2 2
is a partition of R into intervals of equal length 1/2 . Note that the endpoints are the dyadic rationals of rank n or less, and that
n
D n+1 can be obtained from D by dividing each interval into two equal parts. This sequence of partitions is one of the reasons that
n
the dyadic rationals are important.
Basic Rules
In the following problems, A = { Ai : i ∈ I } is a collection of subsets of a universal set S , indexed by a nonempty set I , and B is
a subset of S .
The general distributive laws:

1. (⋃ i∈I
Ai ) ∩ B = ⋃i∈I (Ai ∩ B)
2. (⋂ i∈I
Ai ) ∪ B = ⋂
i∈I
(Ai ∪ B)
Restate the laws in the notation where the collection A is not indexed.
Proof
The general De Morgan's laws:

c
1. (⋃ i∈I
Ai ) =⋂
i∈I
A
c
i
c
2. (⋂ i∈I
Ai ) =⋃
i∈I
A
c
i
Restate the laws in the notation where the collection A is not indexed.
Proof
Suppose that the collection A partitions S . For any subset B , the collection {A ∩ B : A ∈ A } partitions B .
Proof
Figure 1.1.2: A partition of S induces a partition of B
Suppose that {A i : i ∈ N+ } is a collection of subsets of a universal set S

∞ ∞
1. ⋂ n=1
⋃k=n Ak = {x ∈ S : x ∈ Ak for infinitely many k ∈ N+ }
∞ ∞
2. ⋃ n=1
⋂k=n Ak = {x ∈ S : x ∈ Ak for all but finitely many k ∈ N+ }
Proof
The sets in the previous result turn out to be important in the study of probability.
Product sets
Definitions
Product sets are sets of sequences. The defining property of a sequence, of course, is that order as well as membership is important.
Let us start with ordered pairs. In this case, the defining property is that (a, b) = (c, d) if and only if a = c and b = d .
Interestingly, the structure of an ordered pair can be defined just using set theory. The construction in the result below is due to
Kazimierz Kuratowski
Define (a, b) = {{a}, {a, b}}. This definition captures the defining property of an ordered pair.
Proof
Of course, it's important not to confuse the ordered pair (a, b) with the open interval (a, b), since the same notation is used for
both. Usually it's clear form context which type of object is referred to. For ordered triples, the defining property is
(a, b, c) = (d, e, f ) if and only if a = d , b = e , and c = f . Ordered triples can be defined in terms of ordered pairs, which via the
last result, uses only set theory.
Define (a, b, c) = (a, (b, c)) . This definition captures the defining property of an ordered triple.
Proof
All of this is just to show how complicated structures can be built from simpler ones, and ultimately from set theory. But enough of
that! More generally, two ordered sequences of the same size (finite or infinite) are the same if and only if their corresponding
coordinates agree. Thus for n ∈ N , the definition for n -tuples is (x , x , … , x ) = (y , y , … , y ) if and only if x = y for all
+ 1 2 n 1 2 n i i
i ∈ {1, 2, … , n}. For infinite sequences, (x , x , …) = (y , y , …) if and only if x = y for all i ∈ N .

1 2 1 2 i i +
Suppose now that we have a sequence of n sets, (S 1, S2 , … , Sn ) , where n ∈ N . The Cartesian product of the sets is defined
+
as follows:
S1 × S2 × ⋯ × Sn = {(x1 , x2 , … , xn ) : xi ∈ Si for i ∈ {1, 2, … , n}} (1.1.22)
Cartesian products are named for René Descartes. If S = S for each i, then the Cartesian product set can be written compactly as
i
S , a Cartesian power. In particular, recall that R denotes the set of real numbers so that R is n -dimensional Euclidean space,
n n
named after Euclid, of course. The elements of {0, 1} are called bit strings of length n . As the name suggests, we sometimes
n
represent elements of this product set as strings rather than sequences (that is, we omit the parentheses and commas). Since the
coordinates just take two values, there is no risk of confusion.
Suppose that we have an infinite sequence of sets (S 1, S2 , …) . The Cartesian product of the sets is defined by
S1 × S2 × ⋯ = {(x1 , x2 , …) : xi ∈ Si for each i ∈ {1, 2, …}} (1.1.23)
When S = S for i ∈ N , the Cartesian product set is sometimes written as a Cartesian power as S or as S . An explanation
i +
∞ N+
for the last notation, as well as a much more general construction for products of sets, is given in the next section on functions.
Also, notation similar to that of general union and intersection is often used for Cartesian product, with ∏ as the operator. So
n ∞
∏ Si = S1 × S2 × ⋯ × Sn , ∏ Si = S1 × S2 × ⋯ (1.1.24)
i=1 i=1
Rules for Product Sets

We will now see how the set operations relate to the Cartesian product operation. Suppose that S and T are sets and that A ⊆ S ,
B ⊆ S and C ⊆ T , D ⊆ T . The sets in the theorems below are subsets of S × T .
The most important rules that relate Cartesian product with union, intersection, and difference are the distributive rules:
Distributive rules for product sets

1. A × (C ∪ D) = (A × C ) ∪ (A × D)
2. (A ∪ B) × C = (A × C ) ∪ (B × C )
3. A × (C ∩ D) = (A × C ) ∩ (A × D)
4. (A ∩ B) × C = (A × C ) ∩ (B × C )
5. A × (C ∖ D) = (A × C ) ∖ (A × D)
6. (A ∖ B) × C = (A × C ) ∖ (B × C )
Proof
In general, the product of unions is larger than the corresponding union of products.
(A ∪ B) × (C ∪ D) = (A × C ) ∪ (A × D) ∪ (B × C ) ∪ (B × D)
Proof
So in particular it follows that (A × C ) ∪ (B × D) ⊆ (A ∪ B) × (C ∪ D) . On the other hand, the product of intersections is the
same as the corresponding intersection of products.
(A × C ) ∩ (B × D) = (A ∩ B) × (C ∩ D)
Proof
In general, the product of differences is smaller than the corresponding difference of products.
(A ∖ B) × (C ∖ D) = [(A × C ) ∖ (A × D)] ∖ [(B × C ) ∖ (B × D)]
Proof
So in particular it follows that (A ∖ B) × (C ∖ D) ⊆ (A × C ) ∖ (B × D) ,
Projections and Cross Sections

In this discussion, suppose again that S and T are nonempty sets, and that C ⊆ S ×T .
Cross Sections
1. The cross section of C in the first coordinate at x ∈ S is C = {y ∈ T x : (x, y) ∈ C }
2. The cross section of C at in the second coordinate at y ∈ T is

y
C = {x ∈ S : (x, y) ∈ C } (1.1.25)
Note that C x ⊆T for x ∈ S and C y

⊆S for y ∈ T .
Projections
1. The projection of C onto T is C T = {y ∈ T : (x, y) ∈ C for some x ∈ S} .
2. The projection of C onto S is C S
= {x ∈ S : (x, y) ∈ C for some y ∈ T } .
The projections are the unions of the appropriate cross sections.
Unions
1. CT =⋃
x∈S
Cx
2. C S
=⋃
y∈T
C
y
Cross sections are preserved under the set operations. We state the result for cross sections at x ∈ S . By symmetry, of course,
analgous results hold for cross sections at y ∈ T .
Suppose that C , D ⊆ S ×T . Then for x ∈ S ,

1. (C ∪ D) x = Cx ∪ Dx
2. (C ∩ D) x = Cx ∩ Dx
3. (C ∖ D) x = Cx ∖ Dx
Proof
For projections, the results are a bit more complicated. We give the results for projections onto T ; naturally the results for
projections onto S are analogous.
Suppose again that C , D ⊆ S ×T . Then

1. (C ∪ D) T = CT ∪ DT
2. (C ∩ D) T ⊆ CT ∩ DT
3. (C T
c
)
c
⊆ (C )T
Proof
It's easy to see that equality does not hold in general in parts (b) and (c). In part (b) for example, suppose that A , A ⊆ S are1 2
nonempty and disjoint and B ⊆ T is nonempty. Let C = A × B and D = A × B . Then C ∩ D = ∅ so (C ∩ D) = ∅ . But

1 2 T
CT =D = B . In part (c) for example, suppose that A is a nonempty proper subset of S and B is a nonempty proper subset of T .
T
Let C = A × B . Then C = B so (C ) = B . On the other hand, C = (A × B) ∪ (A × B ) ∪ (A × B ) , so (C ) = T .

T T
c c c c c c c c
T
Cross sections and projections will be extended to very general product sets in the next section on Functions.
Computational Exercises
Subsets of R
The universal set is [0, ∞). Let A = [0, 5] and B = (3, 7) . Express each of the following in terms of intervals:
1. A ∩ B
2. A ∪ B
3. A ∖ B
4. B ∖ A
5. Ac
Answer
The universal set is N. Let A = {n ∈ N : n is even} and let B = {n ∈ N : n ≤ 9} . Give each of the following:
1. A ∩ B in list form
2. A ∪ B in predicate form
3. A ∖ B in list form
4. B ∖ A in list form
5. A in predicate form
c
6. B in list form
c
Answer
Coins and Dice

Let S = {1, 2, 3, 4} × {1, 2, 3, 4, 5, 6}. This is the set of outcomes when a 4-sided die and a 6-sided die are tossed. Further let
A = {(x, y) ∈ S : x = 2} and B = {(x, y) ∈ S : x + y = 7} . Give each of the following sets in list form:
1. A
2. B
3. A ∩ B
4. A ∪ B
5. A ∖ B
6. B ∖ A
Answer
Let S = {0, 1} . This is the set of outcomes when a coin is tossed 3 times (0 denotes tails and 1 denotes heads). Further let
3
A = {(x , x , x ) ∈ S : x = 1} and B = {(x , x , x ) ∈ S : x + x + x = 2} . Give each of the following sets in list

1 2 3 2 1 2 3 1 2 3
form, using bit-string notation:

1. S
2. A
3. B
4. Ac
5. Bc
6. A ∩ B
7. A ∪ B
8. A ∖ B
9. B ∖ A
Answer
Let S = {0, 1} . This is the set of outcomes when a coin is tossed twice (0 denotes tails and 1 denotes heads). Give P(S) in
2
list form.
Answer
Cards
A standard card deck can be modeled by the Cartesian product set
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (1.1.26)
where the first coordinate encodes the denomination or kind (ace, 2–10, jack, queen, king) and where the second coordinate
encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for
example q♡ for the queen of hearts). For the problems in this subsection, the card deck D is the universal set.
Let H denote the set of hearts and F the set of face cards. Find each of the following:
1. H ∩ F
2. H ∖ F
3. F ∖ H
4. H △ F
Answer
A bridge hand is a subset of D with 13 cards. Often bridge hands are described by giving the cross sections by suit.
Suppose that N is a bridge hand, held by a player named North, defined by

♣ ♢ ♡ ♠
N = {2, 5, q}, N = {1, 5, 8, q, k}, N = {8, 10, j, q}, N = {1} (1.1.27)
Find each of the following:

1. The nonempty cross sections of N by denomination.
2. The projection of N onto the set of suits.
3. The projection of N onto the set of denominations
Answer
By contrast, it is usually more useful to describe a poker hand by giving the cross sections by denomination. In the usual version of
draw poker, a hand is a subset of D with 5 cards.
Suppose that B is a poker hand, held by a player named Bill, with
B1 = {♣, ♠}, B8 = {♣, ♠}, Bq = {♡} (1.1.28)

1. The nonempty cross sections of B by suit.
2. The projection of B onto the set of suits.
3. The projection of B onto the set of denominations
Answer
The poker hand in the last exercise is known as a dead man's hand. Legend has it that Wild Bill Hickock held this hand at the time
of his murder in 1876.
General unions and intersections

For the problems in this subsection, the universal set is R.
Let A n = [0, 1 −
1
n
] for n ∈ N . Find
+
1. ⋂ ∞
n=1
An
2. ⋃ ∞
n=1
An
3. ⋂ ∞
n=1
An
c
4. ⋃ ∞
n=1
An
c
Answer
Let A n = (2 −
1
n
,5+
1
n
) for n ∈ N . Find
+
1. ⋂ ∞
n=1
An
2. ⋃ ∞
n=1
An
3. ⋂ ∞
n=1
An
c
4. ⋃ ∞
n=1
An
c
Answer
Subsets of R 2
Let T be the closed triangular region in R with vertices (0, 0), (1, 0), and (1, 1). Find each of the following:
2
1. The cross section T xfor x ∈ R

2. The cross section T y
for y ∈ R
3. The projection of T onto the horizontal axis
4. The projection of T onto the vertical axis
Answer
This page titled 1.1: Sets is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via
source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.2: Functions
Functions play a central role in probability and statistics, as they do in every other branch of mathematics. For the most part, the
proofs in this section are straightforward, so be sure to try them yourself before reading the ones in the text.
Definitions and Properties

Basic Definitions
We start with the formal, technical definition of a function. It's not very intuitive, but has the advantage that it only requires set
theory.
A function f from a set S into a set T is a subset of the product set S × T with the property that for each element x ∈ S , there
exists a unique element y ∈ T such that (x, y) ∈ f . If f is a function from S to T we write f : S → T . If (x, y) ∈ f we write
y = f (x).
Less formally, a function f from S into T is a “rule” (or “procedure” or “algorithm”) that assigns to each x ∈ S a unique element
f (x) ∈ T . The definition of a function as a set of ordered pairs, is due to Kazimierz Kuratowski. The term map or mapping is also
used in place of function, so we could say that f maps S into T .
Figure 1.2.1: A function f from S into T
The sets S and T in the definition are clearly important.
Suppose that f : S → T .
1. The set S is the domain of f .
2. The set T is the range space or co-domain of f .
3. The range of f is the set of function values. That is, range (f ) = {y ∈ T : y = f (x) for some x ∈ S} .
The domain and range are completely specified by a function. That's not true of the co-domain: if f is a function from S into T ,
and U is another set with T ⊆ U , then we can also think of f as a function from S into U . The following definitions are natural
and important.
Suppose again that f : S → T .

1. f maps S onto T if range (f ) = T . That is, for each y ∈ T there exists x ∈ S such that f (x) = y .
2. f is one-to-one if distinct elements in the domain are mapped to distinct elements in the range. That is, if u, v∈ S and
u ≠ v then f (u) ≠ f (v) .
Clearly a function always maps its domain onto its range. Note also that f is one-to-one if f (u) = f (v) implies u =v for
u, v ∈ S .
Inverse functions
A funtion that is one-to-one and onto can be “reversed” in a sense.
If f maps S one-to-one onto T , the inverse of f is the function f −1

from T onto S given by
−1
f (y) = x ⟺ f (x) = y; x ∈ S, y ∈ T (1.2.1)
If you like to think of a function as a set of ordered pairs, then f = {(y, x) ∈ T × S : (x, y) ∈ f } . The fact that f is one-to-one
−1
and onto ensures that f is a valid function from T onto S . Sets S and T are in one-to-one correspondence if there exists a one-
−1
to-one function from S onto T . One-to-one correspondence plays an essential role in the study of cardinality.
Restrictions
The domain of a function can be restricted to create a new funtion.
Suppose that f : S → T and that A ⊆ S . The function f A : A → T defined by f A (x) = f (x) for x ∈ A is the restriction of f
to A .
As a set of ordered pairs, note that f A = {(x, y) ∈ f : x ∈ A} .
Composition
Composition is perhaps the most important way to combine two functions to create another function.
Suppose that g : R → S and f : S → T . The composition of f with g is the function f ∘ g : R → T defined by

(f ∘ g) (x) = f (g(x)) , x ∈ R (1.2.2)
Composition is associative:
Suppose that h : R → S , g : S → T , and f : T → U . Then

f ∘ (g ∘ h) = (f ∘ g) ∘ h (1.2.3)
Proof
Thus we can write f ∘ g ∘ h without ambiguity. On the other hand, composition is not commutative. Indeed depending on the
domains and co-domains, f ∘ g might be defined when g ∘ f is not. Even when both are defined, they may have different domains
and co-domains, and so of course cannot be the same function. Even when both are defined and have the same domains and co-
domains, the two compositions will not be the same in general. Examples of all of these cases are given in the computational
exercises below.
Suppose that g : R → S and f : S → T .

1. If f and g are one-to-one then f ∘ g is one-to-one.
2. If f and g are onto then f ∘ g is onto.
Proof
The identity function on a set S is the function I from S onto S defined by I

S S (x) =x for x ∈ S
The identity function acts like an identity with respect to the operation of composition.
If f : S → T then
1. f ∘ I = f
S
2. I ∘ f = f
T
Proof
The inverse of a function is really the inverse with respect to composition.
Suppose that f is a one-to-one function from S onto T . Then

1. f ∘ f
−1
= IS
2. f ∘ f −1
= IT
Proof
An element x ∈ S can be thought of as a function from {1, 2, … , n} into S . Similarly, an element x ∈ S can be thought of as
n ∞
a function from N into S . For such a sequence x, of course, we usually write x instead of x(i). More generally, if S and T are
+ i
sets, then the set of all functions from S into T is denoted by T
S
. In particular, as we noted in the last section, S
∞
is also (and
more accurately) written as S . N+
Suppose that g is a one-to-one function from R onto S and that f is a one-to-one function from S onto T . Then
−1
(f ∘ g) =g
−1
∘f .
−1
Proof
Inverse Images
Inverse images of a function play a fundamental role in probability, particularly in the context of random variables.
Suppose that f : S → T . If A ⊆ T , the inverse image of A under f is the subset of S given by

−1
f (A) = {x ∈ S : f (x) ∈ A} (1.2.4)
So f −1
(A) is the subset of S consisting of those elements that map into A .
Figure 1.2.2: The inverse image of A under f
Technically, the inverse images define a new function from P(T ) into P(S) . We use the same notation as for the inverse
function, which is defined when f is one-to-one and onto. These are very different functions, but usually no confusion results. The
following important theorem shows that inverse images preserve all set operations.
Suppose that f : S → T , and that A, B ⊆T . Then

1. f (A ∪ B) = f (A) ∪ f (B)
−1 −1 −1
2. f (A ∩ B) = f (A) ∩ f (B)
−1 −1 −1
3. f (A ∖ B) = f (A) ∖ f (B)
−1 −1 −1
4. If A ⊆ B then f (A) ⊆ f (B) −1 −1
5. If A and B are disjoint, so are f (A) and f −1 −1

(B)
Proof
The result in part (a) holds for arbitrary unions, and the result in part (b) holds for arbitrary intersections. No new ideas are
involved; only the notation is more complicated.
Suppose that {A i : i ∈ I} is a collection of subsets of T , where I is a nonempty index set. Then

1. f −1
(⋃
i∈I
Ai ) = ⋃
i∈I
f
−1
(Ai )
2. f −1
(⋂
i∈I
Ai ) = ⋂
i∈I
f
−1
(Ai )
Proof
Forward Images
Forward images of a function are a naturally complement to inverse images.
Suppose again that f : S → T . If A ⊆ S , the forward image of A under f is the subset of T given by
f (A) = {f (x) : x ∈ A} (1.2.5)
So f (A) is the range of f restricted to A .
Figure 1.2.3: The forward image of A under f
Technically, the forward images define a new function from P(S) into P(T ) , but we use the same symbol f for this new function
as for the underlying function from S into T that we started with. Again, the two functions are very different, but usually no
confusion results.
It might seem that forward images are more natural than inverse images, but in fact, the inverse images are much more important
than the forward ones (at least in probability and measure theory). Fortunately, the inverse images are nicer as well—unlike the
inverse images, the forward images do not preserve all of the set operations.
Suppose that f : S → T , and that A, B ⊆S . Then

1. f (A ∪ B) = f (A) ∪ f (B) .
2. f (A ∩ B) ⊆ f (A) ∩ f (B) . Equality holds if f is one-to-one.
3. f (A) ∖ f (B) ⊆ f (A ∖ B) . Equality holds if f is one-to-one.
4. If A ⊆ B then f (A) ⊆ f (B) .
Proof
The result in part (a) hold for arbitrary unions, and the results in part (b) hold for arbitrary intersections. No new ideas are involved;
only the notation is more complicated.
Suppose that {A i : i ∈ I} is a collection of subsets of S , where I is a nonempty index set. Then

1. f (⋃ i∈I
Ai ) = ⋃
i∈I
f (Ai ) .
2. f (⋂ i∈I
Ai ) ⊆ ⋂i∈I f (Ai ) . Equality holds if f is one-to-one.
Proof
Suppose again that f : S → T . As noted earlier, the forward images of f define a function from P(S) into P(T ) and the inverse
images define a function from P(T ) into P(S) . One might hope that these functions are inverses of one another, but alas no.
Suppose that f : S → T .
1. A ⊆ f −1
[f (A)] for A ⊆ S . Equality holds if f is one-to-one.
2. f [ f−1
(B)] ⊆ B for B ⊆ T . Equality holds if f is onto.
Proof
Spaces of Real Functions

Real-valued function on a given set S are of particular importance. The usual arithmetic operations on such functions are defined
pointwise.
Suppose thatf , g : S → R and c ∈ R , then f + g, f − g, f g, cf , f /g : S → R are defined as follows for all x ∈ S .

1. (f + g)(x) = f (x) + g(x)
2. (f − g)(x) = f (x) − g(x)
3. (f g)(x) = f (x)g(x)
4. (cf )(x) = cf (x)
5. (f /g)(x) = f (x)/g(x) assuming that g(x) ≠ 0 for x ∈ S .
Now let V denote the collection of all functions from the given set S into R. A fact that is very important in probability as well as
other branches of analysis is that V , with addition and scalar multiplication as defined above, is a vector space. The zero function 0
is defined, of course, by 0(x) = 0 for all x ∈ S .
(V , +, ⋅) is a vector space over R. That is, for all f , g, h ∈ V and a, b ∈ R
1. f + g = g + f , the commutative property of vector addition.

2. f + (g + h) = (f + g) + h , the associative property of vector addition.
3. a(f + g) = af + ag , scalar multiplication distributes over vector addition.
4. (a + b)f = af + bf , scalar multiplication distributive over scalar addition.
5. f + 0 = f , the existence of an zero vector.
6. f + (−f ) = 0 , the existence of additive inverses.
7. 1 ⋅ f = f , the unity property.
Proof
Various subspaces of V are important in probability as well. We will return to the discussion of vector spaces of functions in the
sections on partial orders and in the advanced sections on metric spaces and measure theory.
Indicator Functions
For our next discussion, suppose that S is the universal set, so that all other sets mentioned are subsets of S .
Suppose that A ⊆ S . The indicator function of A is the function 1 A : S → {0, 1} defined as follows:
1, x ∈ A
1A (x) = { (1.2.6)
0, x ∉ A
Thus, the indicator function of A simply indicates whether or not x ∈ A for each x ∈ S . Conversely, any function on S that just
takes the values 0 and 1 is an indicator function.
If f : S → {0, 1} then f is the indicator function of the set A = f −1

{1} = {x ∈ S : f (x) = 1} .
Thus, there is a one-to-one correspondence between P(S) , the power set of S , and the collection of indicator functions {0, 1}
S
.
The next result shows how the set algebra of subsets corresponds to the arithmetic algebra of the indicator functions.
Suppose that A, B ⊆S . Then

1. 1 =1
A∩B 1 = min { 1 , 1 }
A B A B
2. 1 = 1 − (1 − 1 ) (1 − 1 ) = max { 1
A∪B A B A, 1B }
3. 1 = 1 − 1
A
c
A
4. 1 =1
A∖B (1 − 1 )A B
5. A ⊆ B if and only if 1 ≤ 1 A B
Proof
The results in part (a) extends to arbitrary intersections and the results in part (b) extends to arbitrary unions.
Suppose that {A i : i ∈ I} is a collection of subsets of S , where I is a nonempty index set. Then

1. 1 ⋂
i∈I
Ai = ∏i∈I 1Ai = min { 1Ai : i ∈ I }
2. 1 ⋃
i∈I
Ai = 1 −∏
i∈I
(1 − 1Ai ) = max { 1Ai : i ∈ I }
Proof
Multisets
A multiset is like an ordinary set, except that elements may be repeated. A multiset A (with elements from a universal set S ) can be
uniquely associated with its multiplicity function m : S → N , where m (x) is the number of times that element x is in A for
A A
each x ∈ S . So the multiplicity function of a multiset plays the same role that an indicator function does for an ordinary set.
Multisets arise naturally when objects are sampled with replacement (but without regard to order) from a population. Various
sampling models are explored in the section on Combinatorial Structures. We will not go into detail about the operations on
multisets, but the definitions are straightforward generalizations of those for ordinary sets.
Suppose that A and B are multisets with elements from the universal set S . Then
1. A ⊆ B if and only if m ≤ m A B
2. m = max{ m , m }
A∪B A B
3. m = min{ m , m }
A∩B A B
4. m =m
A+B +m A B
Product Spaces
Using functions, we can generalize the Cartesian products studied earlier. In this discussion, we suppose that S is a set for each i i
in a nonempty index set I .
Define the product set
∏ Si = {x : x is a function from I into ⋃ Si such that x(i) ∈ Si for each i ∈ I } (1.2.7)
i∈I i∈I
Note that except for being nonempty, there are no assumptions on the cardinality of the index set I . Of course, if I = {1, 2 … , n}
for some n ∈ N , or if I = N then this construction reduces to S × S × ⋯ × S and to S × S × ⋯ , respectively. Since
+ + 1 2 n 1 2
we want to make the notation more closely resemble that of simple Cartesian products, we will write x instead of x(i) for the i
value of the function x at i ∈ I , and we sometimes refer to this value as the ith coordinate of x. Finally, note that if S = S for i
each i ∈ I , then ∏ S is simply the set of all functions from I into S , which we denoted by S above.
i∈I i
I
For j ∈ I define the projection p j : ∏

i∈I
Si → Sj by p j (x) = xj for x ∈ ∏ i∈I
Si .
So p (x) is just the j th coordinate of x. The projections are of basic importance for product spaces. In particular, we have a better
j
way of looking at projections of a subset of a product set.
For A ⊆ ∏ i∈I
Si and j ∈ I , the forward image p j (A) is the projection of A onto S . j
Proof
So the properties of projection that we studied in the last section are just special cases of the properties of forward images.
Projections also allow us to get coordinate functions in a simple way.
Suppose that R is a set, and that f : R → ∏i∈I Si . If j ∈ I then p j ∘ f : R → Sj is the j th coordinate function of f .
Proof
This will look more familiar for a simple cartesian product. If f : R → S1 × S2 × ⋯ × Sn , then f = (f1 , f2 , … , fn ) where
f : R → S
j is the j th coordinate function for j ∈ {1, 2, … , n}.
i
Cross sections of a subset of a product set can be expressed in terms of inverse images of a function. First we need some additional
notation. Suppose that our index set I has at least two elements. For j ∈ I and u ∈ S , define j : ∏ S → ∏ S by j u i∈I−{j} i i∈I i
j (x) = y where y = x for i ∈ I − {j} and y = u . In words, j takes a point x ∈ ∏

u i i j S and assigns u to coordinate j to
u i
i∈I−{j}
produce the point y ∈ ∏ S . i∈I i
In the setting above, if j ∈ I , u ∈ S and A ⊆ ∏ j i∈I

Si then j −1
u (A) is the cross section of A in the j th coordinate at u.
Proof
Let's look at this for the product of two sets S and T . For x ∈ S , the function 1 : T → S × T is given by 1 (y) = (x, y). x x
Similarly, for y ∈ T , the function 2 : S → S × T is given by 2 (x) = (x, y). Suppose now that A ⊆ S × T . If x ∈ S , then
y y
(A) = {y ∈ T : (x, y) ∈ A} , the very definition of the cross section of A in the first coordinate at x. Similarly, if y ∈ T , then
−1
1x
(A) = {x ∈ S : (x, y) ∈ A} , the very definition of the cross section of A in the second coordinate at y . This construction is
−1
2y
not particularly important except to show that cross sections are inverse images. Thus the fact that cross sections preserve all of the
set operations is a simple consequence of the fact that inverse images generally preserve set operations.
Operators
Sometimes functions have special interpretations in certain settings.
Suppose that S is a set.
1. A function f : S → S is sometimes called a unary operator on S .
2. A function g : S × S → S is sometimes called a binary operator on S .
As the names suggests, a unary operator f operates on an element x ∈ S to produce a new element f (x) ∈ S . Similarly, a binary
operator g operates on a pair of elements (x, y) ∈ S × S to produce a new element g(x, y) ∈ S . The arithmetic operators are
quintessential examples:
The following are operators on R:

1. minus(x) = −x is a unary operator.
2. sum(x, y) = x + y is a binary operator.
3. product(x, y) = x y is a binary operator.
4. difference(x, y) = x − y is a binary operator.
For a fixed universal set S , the set operators studied in the section on Sets provide other examples.
For a given set S , the following are operators on P(S) :

1. complement(A) = A is a unary operator.
c
2. union(A, B) = A ∪ B is a binary operator.

3. intersect(A, B) = A ∩ B is a binary operator.
4. difference(A, B) = A ∖ B is a binary operator.
As these examples illustrate, a binary operator is often written as x f y rather than f (x, y). Still, it is useful to know that operators
are simply functions of a special type.
Suppose that f is a unary operator on a set S , g is a binary operator on S , and that A ⊆ S .

1. A is closed under f if x ∈ A implies f (x) ∈ A.
2. A is closed under g if (x, y) ∈ A × A implies g(x, y) ∈ A .
Thus if A is closed under the unary operator f , then f restricted to A is unary operator on A . Similary if A is closed under the
binary operator g , then g restricted to A × A is a binary operator on A . Let's return to our most basic example.
For the arithmetic operatoes on R,

1. N is closed under plus and times, but not under minus and difference.
2. Z is closed under plus, times, minus, and difference.
3. Q is closed under plus, times, minus, and difference.
Many properties that you are familiar with for special operators (such as the arithmetic and set operators) can now be formulated
generally.
Suppose that f and g are binary operators on a set S . In the following definitions, x, y , and z are arbitrary elements of S .
1. f is commutative if f (x, y) = f (y, x), that is, x f y = y f x
2. f is associative if f (x, f (y, z)) = f (f (x, y), z), that is, x f (y f z) = (x f y) f z.
3. g distributes over f (on the left) if g(x, f (y, z)) = f (g(x, y), g(x, z)), that is, x g (y f z) = (x g y) f (x g z)
The Axiom of Choice

Suppose that S is a collection of nonempty subsets of a set S . The axiom of choice states that there exists a function
f : S → S with the property that f (A) ∈ A for each A ∈ S . The function f is known as a choice function.
Stripped of most of the mathematical jargon, the idea is very simple. Since each set A ∈ S is nonempty, we can select an element
of A ; we will call the element we select f (A) and thus our selections define a function. In fact, you may wonder why we need an
axiom at all. The problem is that we have not given a rule (or procedure or algorithm) for selecting the elements of the sets in the
collection. Indeed, we may not know enough about the sets in the collection to define a specific rule, so in such a case, the axiom of
choice simply guarantees the existence of a choice function. Some mathematicians, known as constructionists do not accept the
axiom of choice, and insist on well defined rules for constructing functions.
A nice consequence of the axiom of choice is a type of duality between one-to-one functions and onto functions.
Suppose that f is a function from a set S onto a set T . There exists a one-to-one function g from T into S .
Proof.
Suppose that f is a one-to-one function from a set S into a set T . There exists a function g from T onto S .
Proof.
Some Elementary Functions
Each of the following rules defines a function from R into R.
2
f (x) = x
g(x) = sin(x)
h(x) = ⌊x⌋
x
e
u(x) = x
1+e
Find the range of the function and determine if the function is one-to-one in each of the following cases:
1. f
2. g
3. h
4. u
Answer
Find the following inverse images:

1. f −1
[4, 9]
2. g−1
{0}
3. h−1
{2, 3, 4}
Answer
The function u is one-to-one. Find (that is, give the domain and rule for) the inverse function u .
−1
Answer
Give the rule and find the range for each of the following functions:
1. f ∘ g
2. g ∘ f
3. h ∘ g ∘ f
Answer
Note that f ∘ g and g ∘ f are well-defined functions from R into R, but f ∘ g ≠ g ∘ f .
Dice
Let S = {1, 2, 3, 4, 5, 6} . This is the set of possible outcomes when a pair of standard dice are thrown. Let f , g , u, and
2
v be the
functions from S into Z defined by the following rules:
f (x, y) = x + y
g(x, y) = y − x
u(x, y) = min{x, y}
v(x, y) = max{x, y}
In addition, let F and U be the functions defined by F = (f , g) and U = (u, v) .
Find the range of each of the following functions:

1. f
2. g
3. u
4. v
5. U
Answer
Give each of the following inverse images in list form:

1. f−1
{6}
2. u−1
{3}
3. v−1
{4}
4. U −1
{(3, 4)}
Answer
Find each of the following compositions:

1. f ∘ U
2. g ∘ U
3. u ∘ F
4. v ∘ F
5. F ∘ U
6. U ∘ F
Answer
Note that while f ∘ U is well-defined, U ∘ f is not. Note also that f ∘ U =f even though U is not the identity function on S .
Bit Strings
Let n ∈ N and let S = {0, 1} and T = {0, 1, … , n}. Recall that the elements of S are bit strings of length n , and could
+
n
represent the possible outcomes of n tosses of a coin (where 1 means heads and 0 means tails). Let f : S → T be the function
defined by f (x , x , … , x ) = ∑ x . Note that f (x) is just the number of 1s in the the bit string x. Let g : T → S be the
1 2 n
n
i=1 i
function defined by g(k) = x where x denotes the bit string with k 1s followed by n − k 0s.
k k
Find each of the following

1. f ∘ g
2. g ∘ f
Answer
In the previous exercise, note that f ∘ g and g ∘ f are both well-defined, but have different domains, and so of course are not the
same. Note also that f ∘ g is the identity function on T , but f is not the inverse of g . Indeed f is not one-to-one, and so does not
have an inverse. However, f restricted to {x : k ∈ T } (the range of g ) is one-to-one and is the inverse of g .
k
Let n = 4 . Give f −1
({k}) in list form for each k ∈ T .
Answer
Again let n = 4 . Let A = {1000, 1010} and B = {1000, 1100}. Give each of the following in list form:
1. f (A)
2. f (B)
3. f (A ∩ B)
4. f (A) ∩ f (B)
5. f −1
(f (A))
Answer
In the previous exercise, note that f (A ∩ B) ⊂ f (A) ∩ f (B) and A ⊂ f −1

(f (A)) .
Indicator Functions
Suppose that A and B are subsets of a universal set S . Express, in terms of 1 and 1 , the indicator function of each of the 14
A B
non-trivial sets that can be constructed from A and B . Use the Venn diagram app to help.
Answer
Suppose that A , B , and C are subsets of a universal set S . Give the indicator function of each of the following, in terms of 1 , A
1 , and 1
B in sum-product form:
C
1. D = {x ∈ S : x is an element of exactly one of the given sets}

2. E = {x ∈ S : x is an element of exactly two of the given sets}
Answer
Operators
Recall the standard arithmetic operators on R discussed above.
We all know that sum is commutative and associative, and that product distributes over sum.
1. Is difference commutative?
2. Is difference associative?
3. Does product distribute over difference?
4. Does sum distributed over product?
Answer
Multisets
Express the multiset A in list form that has the multiplicity function m : {a, b, c, d, e} → N given by m(a) = 2 , m(b) = 3 ,
m(c) = 1 , m(d) = 0 , m(e) = 4 .
Answer
Express the prime factors of 360 as a multiset in list form.

Answer
This page titled 1.2: Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1.3: Relations
Relations play a fundamental role in probability theory, as in most other areas of mathematics.
Definitions and Constructions

Suppose that S and T are sets. A relation from S to T is a subset of the product set S × T .
1. The domain of R is the set of first coordinates: domain(R) = {x ∈ S : (x, y) ∈ R for some y ∈ T } .
2. The range of R is the set of second coordinates: range(R) = {y ∈ T : (x, y) ∈ R for some x ∈ S} .
A relation from a set S to itself is a relation on S .
As the name suggests, a relation R from S into T is supposed to define a relationship between the elements of S and the elements
of T , and so we usually use the more suggestive notation x R y when (x, y) ∈ R. Note that the domain of R is the projections of R
onto S and the range of R is the projection of R onto T .
Basic Examples
Suppose that S is a set and recall that P(S) denotes the power set of S , the collection of all subsets of S . The membership relation
∈ from S to P(S) is perhaps the most important and basic relationship in mathematics. Indeed, for us, it's a primitive (undefined)
relationship—given x and A , we assume that we understand the meaning of the statement x ∈ A .

Another basic primitive relation is the equality relation = on a given set of objects S . That is, given two objects x and y , we
assume that we understand the meaning of the statement x = y .
Other basic relations that you have seen are
1. The subset relation ⊆ on P(S) .
2. The order relation ≤ on R
These two belong to a special class of relations known as partial orders that we will study in the next section. Note that a function f
from S into T is a special type of relation. To compare the two types of notation (relation and function), note that x f y means that
y = f (x).
Constructions
Since a relation is just a set of ordered pairs, the set operations can be used to build new relations from existing ones.
if Q and R are relations from S to T , then so are Q ∪ R , Q ∩ R , Q ∖ R .

1. x(Q ∪ R)y if and only if x Q y or x R y.
2. x(Q ∩ R)y if and only if x Q y and x R y.
3. x(Q ∖ R)y if and only if x Q y but not x R y.
4. If Q ⊆ R then x Q y implies x R y.
If R is a relation from S to T and Q ⊆ R , then Q is a relation from S to T .
The restriction of a relation defines a new relation.
If R is a relation on S and A ⊆ S then R A = R ∩ (A × A) is a relation on A , called the restriction of R to A .
The inverse of a relation also defines a new relation.
If R is a relation from S to T , the inverse of R is the relation from T to S defined by

−1
y R x if and only if x R y (1.3.1)
Equivalently, R = {(y, x) : (x, y) ∈ R} . Note that any function f from S into T has an inverse relation, but only when the f is
−1
one-to-one is the inverse relation also a function (the inverse function). Composition is another natural way to create new relations
from existing ones.
Suppose that Q is a relation from S to T and that R is a relation from T to U . The composition Q ∘ R is the relation from S to
U defined as follows: for x ∈ S and z ∈ U , x(Q ∘ R)z if and only if there exists y ∈ T such that x Q y and y R z .
Note that the notation is inconsistent with the notation used for composition of functions, essentially because relations are read
from left to right, while functions are read from right to left. Hopefully, the inconsistency will not cause confusion, since we will
always use function notation for functions.
Basic Properties
The important classes of relations that we will study in the next couple of sections are characterized by certain basic properties.
Here are the definitions:
Suppose that R is a relation on S .

1. R is reflexive if x R x for all x ∈ S .
2. R is irreflexive if no x ∈ S satisfies x R x.
3. R is symmetric if x R y implies y R x for all x, y ∈ S .
4. R is anti-symmetric if x R y and y R x implies x = y for all x, y ∈ S .
5. R is transitive if x R y and y R z implies x R z for all x, y, z ∈ S .
The proofs of the following results are straightforward, so be sure to try them yourself before reading the ones in the text.
A relation R on S is reflexive if and only if the equality relation = on S is a subset of R .

Proof
A relation R on S is symmetric if and only if R −1

=R .
Proof
A relation R on S is transitive if and only if R ∘ R ⊆ R .

Proof
A relation R on S is antisymmetric if and only if R ∩ R −1

is a subset of the equality relation = on S .
Proof
Suppose that Q and R are relations on S . For each property below, if both Q and R have the property, then so does Q ∩ R .
1. reflexive
2. symmetric
3. transitive
Proof
Suppose that R is a relation on a set S .

1. Give an explicit definition for the property R is not reflexive.
2. Give an explicit definition for the property R is not irreflexive.
3. Are any of the properties R is reflexive, R is not reflexive, R is irreflexive, R is not irreflexive equivalent?
Answer
Suppose that R is a relation on a set S .

1. Give an explicit definition for the property R is not symmetric.
2. Give an explicit definition for the property R is not antisymmetric.
3. Are any of the properties R is symmetric, R is not symmetric, R is antisymmetric, R is not antisymmetric equivalent?
Answer
Let R be the relation defined on R by xRy if and only if sin(x) = sin(y) . Determine if R has each of the following
properties:
1. reflexive
2. symmetric
3. transitive
4. irreflexive
5. antisymmetric
Answer
The relation R in the previous exercise is a member of an important class of equivalence relations.
Let R be the relation defined on R by x R y if and only if x 2

+y
2
≤1 . Determine if R has each of the following properties:
1. reflexive
2. symmetric
3. transitive
4. irreflexive
5. antisymmetric
Answer
This page titled 1.3: Relations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1.4: Partial Orders
Partial orders are a special class of relations that play an important role in probability theory.
Basic Theory
 Definitions
A partial order on a set S is a relation ⪯ on S that is reflexive, anti-symmetric, and transitive. The pair (S, ⪯) is called a
partially ordered set. So for all x, y, z ∈ S :
1. x ⪯ x , the reflexive property
2. If x ⪯ y and y ⪯ x then x = y , the antisymmetric property
3. If x ⪯ y and y ⪯ z then x ⪯ z , the transitive property
As the name and notation suggest, a partial order is a type of ordering of the elements of S . Partial orders occur naturally in many
areas of mathematics, including probability. A partial order on a set naturally gives rise to several other relations on the set.
Suppose that ⪯ is a partial order on a set S . The relations ⪰, ≺, ≻, ⊥, and ∥ are defined as follows:
1. x ⪰ y if and only if y ⪯ x .
2. x ≺ y if and only if x ⪯ y and x ≠ y .
3. x ≻ y if and only if y ≺ x .
4. x ⊥ y if and only if x ⪯ y or y ⪯ x .
5. x ∥ y if and only if neither x ⪯ y nor y ⪯ x .
Note that ⪰ is the inverse of ⪯, and ≻ is the inverse of ≺. Note also that x ⪯ y if and only if either x ≺ y or x = y , so the relation
≺ completely determines the relation ⪯. The relation ≺ is sometimes called a strict or strong partial order to distingush it from the
ordinary (weak) partial order ⪯. Finally, note that x ⊥ y means that x and y are related in the partial order, while x ∥ y means that
x and y are unrelated in the partial order. Thus, the relations ⊥ and ∥ are complements of each other, as sets of ordered pairs. A
total or linear order is a partial order in which there are no unrelated elements.
A partial order ⪯ on S is a total order or linear order if for every x, y ∈ S , either x ⪯ y or y ⪯ x .
Suppose that ⪯ and ⪯ are partial orders on a set S . Then ⪯ is an sub-order of ⪯ , or equivalently ⪯ is an extension of ⪯
1 2 1 2 2 1
if and only if x ⪯ y implies x ⪯ y for x, y ∈ S .

1 2
Thus if ⪯ is a suborder of ⪯ , then as sets of ordered pairs, ⪯ is a subset of ⪯ . We need one more relation that arises naturally
1 2 1 2
from a partial order.
Suppose that ⪯ is a partial order on a set S . For x, y ∈ S , y is said to cover x if x ≺y but no element z ∈ S satisfies
x ≺z ≺y .
If S is finite, the covering relation completely determines the partial order, by virtue of the transitive property.
Suppose that ⪯ is a partial order on a finite set S . The covering graph or Hasse graph of (S, ⪯) is the directed graph with
vertex set S and directed edge set E , where (x, y) ∈ E if and only if y covers x.
Thus, x ≺ y if and only if there is a directed path in the graph from x to y . Hasse graphs are named for the German mathematician
Helmut Hasse. The graphs are often drawn with the edges directed upward. In this way, the directions can be inferred without
having to actually draw arrows.
Basic Examples
Of course, the ordinary order ≤ is a total order on the set of real numbers R. The subset partial order is one of the most important
in probability theory:
Suppose that S is a set. The subset relation ⊆ is a partial order on P(S) , the power set of S .
Proof
Here is a partial order that arises naturally from arithmetic.
Let ∣ denote the division relation on the set of positive integers N . That is, m ∣ n if and only if there exists k ∈ N
+ + such that
n = km . Then
1. ∣ is a partial order on N .
+
2. ∣ is a sub-order of the ordinary order ≤.

Proof
The set of functions from a set into a partial ordered set can itself be partially ordered in a natural way.
Suppose that S is a set and that (T , ⪯ ) is a partially ordered set, and let S denote the set of functions
T f : S → T . The
relation ⪯ on S defined by f ⪯ g if and only f (x) ⪯ g(x) for all x ∈ S is a partial order on S .
T
Proof
Note that we don't need a partial order on the domain S .
Basic Properties
The proofs of the following basic properties are straightforward. Be sure to try them yourself before reading the ones in the text.
The inverse of a partial order is also a partial order.

Proof
If ⪯ is a partial order on S and A is a subset of S , then the restriction of ⪯ to A is a partial order on A .

Proof
The following theorem characterizes relations that correspond to strict order.
Let S be a set. A relation ⪯ is a partial order on S if and only if ≺ is transitive and irreflexive.
Proof
Monotone Sets and Functions

Partial orders form a natural setting for increasing and decreasing sets and functions. Here are the definitions:
Suppose that ⪯ is a partial order on a set S and that A ⊆ S . In the following definitions, x, y are arbitrary elements of S .
1. A is increasing if x ∈ A and x ⪯ y imply y ∈ A .
2. A is decreasing if y ∈ A and x ⪯ y imply x ∈ A .
Suppose that S is a set with partial order ⪯S , T is a set with partial order ⪯T , and that f : S → T . In the following
definitions, x, y are arbitrary elements of S .
1. f is increasing if and only if x ⪯ y implies f (x) ⪯ f (y).
S T
2. f is decreasing if and only if x ⪯ y implies f (x) ⪰ f (y).

S T
3. f is strictly increasing if and only if x ≺ y implies f (x) ≺

S T f (y) .
4. f is strictly decreasing if and only if x ≺ y implies f (x) ≻
S T f (y).
Recall the definition of the indicator function 1 associated with a subset A of a universal set S : For x ∈ S , 1
A A (x) =1 if x ∈ A
and 1 (x) = 0 if x ∉ A .
A
Suppose that ⪯ is a partial order on a set S and that A ⊆ S . Then
1. A is increasing if and only if 1 is increasing.
A
2. A is decreasing if and only if 1 is decreasing.

A
Proof
Isomorphism
Two partially ordered sets (S, ⪯ ) and (T , ⪯ ) are said to be isomorphic if there exists a one-to-one function f from S onto
S T
T such that x ⪯ Sy if and only if f (x) ⪯ f (y), for all x, y ∈ S . The function f is an isomorphism.
T
Generally, a mathematical space often consists of a set and various structures defined in terms of the set, such as relations,
operators, or a collection of subsets. Loosely speaking, two mathematical spaces of the same type are isomorphic if there exists a
one-to-one function from one of the sets onto the other that preserves the structures, and again, the function is called an
isomorphism. The basic idea is that isomorphic spaces are mathematically identical, except for superficial matters of appearance.
The word isomorphism is from the Greek and means equal shape.
Suppose that the partially ordered sets (S, ⪯S ) and (T , ⪯

T ) are isomorphic, and that f : S → T is an isomorphism. Then f
and f are strictly increasing.

−1
Proof
In a sense, the subset partial order is universal—every partially ordered set is isomorphic to (S , ⊆) for some collection of sets S .
Suppose that ⪯ is a partial order on a set S . Then there exists S ⊆ P(S) such that (S, ⪯) is isomorphic to (S , ⊆).
Proof
Extremal Elements
Various types of extremal elements play important roles in partially ordered sets. Here are the definitions:
Suppose that ⪯ is a partial order on a set S and that A ⊆ S .

1. An element a ∈ A is the minimum element of A if and only if a ⪯ x for every x ∈ A .
2. An element a ∈ A is a minimal element of A if and only if no x ∈ A satisfies x ≺ a .
3. An element b ∈ A is the maximum element of A if and only if b ⪰ x for every x ∈ A .
4. An element b ∈ A is a maximal element of A if and only if no x ∈ A satisfies x ≻ b .
In general, a set can have several maximal and minimal elements (or none). On the other hand,
The minimum and maximum elements of A , if they exist, are unique. They are denoted min(A) and max(A), respectively.
Proof
Minimal, maximal, minimum, and maximum elements of a set must belong to that set. The following definitions relate to upper
and lower bounds of a set, which do not have to belong to the set.
Suppose again that ⪯ is a partial order on a set S and that A ⊆ S . Then

1. An element u ∈ S is a lower bound for A if and only if u ⪯ x for every x ∈ A .
2. An element v ∈ S is an upper bound for A if and only if v ⪰ x for every x ∈ A .
3. The greatest lower bound or infimum of A , if it exists, is the maximum of the set of lower bounds of A .
4. The least upper bound or supremum of A , if it exists, is the minimum of the set of upper bounds of A .
By (20), the greatest lower bound of A is unique, if it exists. It is denoted glb(A) or inf(A) . Similarly, the least upper bound of A
is unique, if it exists, and is denoted lub(A) or sup(A) . Note that every element of S is a lower bound and an upper bound for ∅,
since the conditions in the definition hold vacuously.
The symbols ∧ and ∨ are also used for infimum and supremum, respectively, so ⋀ A = inf(A) and ⋁ A = sup(A) if they exist..
In particular, for x, y ∈ S , operator notation is more commonly used, so x ∧ y = inf{x, y} and x ∨ y = sup{x, y} . Partially
ordered sets for which these elements always exist are important, and have a special name.
Suppose that ⪯ is a partial order on a set S . Then (S, ⪯) is a lattice if x ∧ y and x ∨ y exist for every x, y ∈ S .
For the subset partial order, the inf and sup operators correspond to intersection and union, respectively:
Let S be a set and consider the subset partial order ⊆ on P(S) , the power set of S . Let A be a nonempty subset of P(S) ,
that is, a nonempty collection of subsets of S . Then
1. inf(A ) = ⋂ A
2. sup(A ) = ⋃ A
Proof
In particular, A ∧ B = A ∩ B and A ∨ B = A ∪ B , so (P(S), ⊆) is a lattice.
Consider the division partial order ∣ on the set of positive integers N and let A be a nonempty subset of N .
+ +
1. inf(A) is the greatest common divisor of A , usually denoted gcd(A) in this context.
2. If A is infinite then sup(A) does not exist. If A is finite then sup(A) is the least common multiple of A , usually denoted
lcm(A) in this context.
Suppose that S is a set and that f : S → S . An element z ∈ S is said to be a fixed point of f if f (z) = z .
The following result explores a basic fixed point theorem for a partially ordered set. The theorem is important in the study of
cardinality.
Suppose that ⪯ is a partial order on a set S with the property that sup(A) exists for every A ⊆ S . If f : S → S is increasing,
then f has a fixed point.
Proof.
If ⪯ is a total order on a set S with the property that every nonempty subset of S has a minimum element, then S is said to be well
ordered by ⪯. One of the most important examples is N , which is well ordered by the ordinary order ≤. On the other hand, the
+
well ordering principle, which is equivalent to the axiom of choice, states that every nonempty set can be well ordered.
Orders on Product Spaces

Suppose that S and T are sets with partial orders ⪯ and ⪯ respectively. Define the relation ⪯ on S × T by (x, y) ⪯ (z, w)
S T
if and only if x ⪯ z and y ⪯ w .

S T
1. The relation ⪯ is a partial order on S × T , called, appropriately enough, the product order.
2. Suppose that (S, ⪯ ) = (T , ⪯ ) . If S has at least 2 elements, then ⪯ is not a total order on S .
S T
2
Proof
Figure 1.4.1 : The product order on R . The region shaded red is the set of points ⪰ (x, y). The region shaded blue is the set of
2
points ⪯ (x, y). The region shaded white is the set of points that are not comparable with (x, y).
Product order extends in a straightforward way to the Cartesian product of a finite or an infinite sequence of partially ordered
spaces. For example, suppose that S is a set with partial order ⪯ for each i ∈ {1, 2, … , n}, where n ∈ N . The product order ⪯
i i +
on the product set S × S × ⋯ × S is defined as follows: for x = (x , x , … , x ) and y = (y , y , … , y ) in the product

1 2 n 1 2 n 1 2 n
set, x ⪯ y if and only if x ⪯ y for each i ∈ {1, 2, … , n}. We can generalize this further to arbitrary product sets. Suppose that
i i i
S is a set for each i in a nonempty (both otherwise arbitrary) index set I . Recall that
i
∏ Si = {x : x is a function from I into ⋃ Si such that x(i) ∈ Si for each i ∈ I } (1.4.2)
i∈I i∈I
To make the notation look more like a simple Cartesian product, we will write x instead of x(i) for the value of a function x in the
i
product set at i ∈ I .
Suppose that S is a set with partial order ⪯ for each i in a nonempty index set I . Define the relation ⪯ on ∏ S by x ⪯ y
i i i∈I i
if and only if x ⪯ y for each i ∈ I . Then ⪯ is a partial order on the product set, known again as the product order.
i i i
Proof
Note again that no assumptions are made on the index set I , other than it be nonempty. In particular, no order is necessary on I .
The next result gives a very different type of order on a product space.
Suppose again that S and T are sets with partial orders ⪯ S and ⪯T respectively. Define the relation ⪯ on S ×T by
(x, y) ⪯ (z, w) if and only if either x ≺ z , or x = z and y ⪯
S T w.
1. The relation ⪯ is a partial order on S × T , called the lexicographic order or dictionary order.
2. If ⪯ and ⪯ are total orders on S and T , respectively, then ⪯ is a total order on S × T .
S T
Proof
Figure 1.4.2 : The lexicographic order on R . The region shaded red is the set of points ⪰ (x, y). The region shaded blue is the set
2
of points ⪯ (x, y).

As with the product order, the lexicographic order can be generalized to a collection of partially ordered spaces. However, we need
the index set to be totally ordered.
Suppose that S is a set with partial order ⪯ for each i in a nonempty index set I . Suppose also that ≤ is a total order on I .
i i
Define the relation ⪯ on the product set ∏ S as follows: x ≺ y if and only if there exists j ∈ I such that x = y if i < j
i∈I i i i
and x ≺ y . Then
j j j
1. ⪯ is a partial order on S , known again as the lexicographic order.

2. If ⪯ is a total order for each i ∈ I , and I is well ordered by ≤, then ⪯ is a total order on S .
i
Proof
The term lexicographic comes from the way that we order words alphabetically: We look at the first letter; if these are different, we
know how to order the words. If the first letters are the same, we look at the second letter; if these are different, we know how to
order the words. We continue in this way until we find letters that are different, and we can order the words. In fact, the
lexicographic order is sometimes referred to as the first difference order. Note also that if S is a set and ⪯ a total order on S for
i i i
i ∈ I , then by the well ordering principle, there exists a well ordering ≤ of I , and hence there exists a lexicographic total order on
the product space ∏ S . As a mathematical structure, the lexicographic order is not as obscure as you might think.
i∈I i
(R, ≤) is isomorphic to the lexicographic product of (Z, ≤) with ([0, 1), ≤), where ≤ is the ordinary order for real numbers.
Proof
Limits of Sequences of Real Numbers
Suppose that (a 1, a2 , …) is a sequence of real numbers.
The sequence inf{a n, an+1 …} is increasing in n ∈ N . +
Since the sequence of infimums in the last result is increasing, the limit exists in R ∪ {∞} , and is called the limit inferior of the
original sequence:
lim inf an = lim inf{ an , an+1 , …} (1.4.3)

n→∞ n→∞
The sequence sup{a n, an+1 , …} is decreasing in n ∈ N . +
Since the the sequence of supremums in the last result is decreasing, the limit exists in R ∪ {−∞} , and is called the limit superior
of the original sequence:
lim sup an = lim sup{ an , an+1 , …} (1.4.4)
n→∞
n→∞
Note that lim inf n→∞ an ≤ lim sup n→∞ an and equality holds if and only if lim n→∞ an exists (and is the common value).
Vector Spaces of Functions

Suppose that S is a nonempty set, and recall that the set V of functions f : S → R is a vector space, under the usual pointwise
definition of addition and scalar multiplication. As noted in (9), V is also a partial ordered set, under the pointwise partial order:
f ⪯ g if and only if f (x) ≤ g(x) for all x ∈ S . Consistent with the definitions (19), f ∈ V is bounded if there exists C ∈ (0, ∞)
such that |f (x)| ≤ C for all x ∈ S . Now let U denote the set of bounded functions f : S → R , and for f ∈ U define
∥f ∥ = sup{|f (x)| : x ∈ S} (1.4.5)
U is a vector subspace of V and ∥ ⋅ ∥ is a norm on U .

Proof
Appropriately enough, ∥ ⋅ ∥ is called the supremum norm on U . Vector spaces of bounded, real-valued functions, with the
supremum norm are especially important in probability and random processes. We will return to this discussion again in the
advanced sections on metric spaces and measure theory.
Let S = {2, 3, 4, 6, 12}.
1. Sketch the Hasse graph corresponding to the ordinary order ≤ on S .
2. Sketch the Hasse graph corresponding to the division partial order ∣ on S .
Answer
Consider the ordinary order ≤ on the set of real numbers R , and let A = [a, b) where a < b . Find each of the following that
exist:
1. The set of minimal elements of A
2. The set of maximal elements of A
3. min(A)
4. max(A)
5. The set of lower bounds of A
6. The set of upper bounds of A
7. inf(A)
8. sup(A)
Answer
Again consider the division partial order ∣ on the set of positive integers N+ and let A = {2, 3, 4, 6, 12} . Find each of the
following that exist:
1. The set of minimal elements of A
2. The set of maximal elements of A
3. min(A)
4. max(A)
5. The set of lower bounds of A
6. The set of upper bounds of A
7. inf(A)
8. sup(A) .
Answer
Let S = {a, b, c} .
1. Give P(S) in list form.
2. Describe the Hasse graph of (P(S), ⊆)
Answer
Let S = {a, b, c, d} .
1. Give P(S) in list form.
2. Describe the Hasse graph of (P(S), ⊆)
Answer
Suppose that A and B are subsets of a universal set S . Let A denote the collection of the 16 subsets of S that can be
constructed from A and B using the set operations. Show that (A , ⊆) is isomorphic to the partially ordered set in the previous
exercise. Use the Venn diagram app to help.
Proof
This page titled 1.4: Partial Orders is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Basic Theory
 Definitions
A relation ≈ on a nonempty set S that is reflexive, symmetric, and transitive is an equivalence relation on S . Thus, for all
x, y, z ∈ S ,
1. x ≈ x , the reflexive property.
2. If x ≈ y then y ≈ x , the symmetric property.
3. If x ≈ y and y ≈ z then x ≈ z , the transitive property.
As the name and notation suggest, an equivalence relation is intended to define a type of equivalence among the elements of S .
Like partial orders, equivalence relations occur naturally in most areas of mathematics, including probability.
Suppose that ≈ is an equivalence relation on S . The equivalence class of an element x ∈ S is the set of all elements that are
equivalent to x, and is denoted
[x] = {y ∈ S : y ≈ x} (1.5.1)
Results
The most important result is that an equivalence relation on a set S defines a partition of S , by means of the equivalence classes.
Suppose that ≈ is an equivalence relation on a set S .

1. If x ≈ y then [x] = [y].
2. If x ≉ y then [x] ∩ [y] = ∅ .
3. The collection of (distinct) equivalence classes is a partition of S into nonempty sets.
Proof
Sometimes the set S of equivalence classes is denoted S/ ≈ . The idea is that the equivalence classes are new “objects” obtained
by “identifying” elements in S that are equivalent. Conversely, every partition of a set defines an equivalence relation on the set.
Suppose that S is a collection of nonempty sets that partition a given set S . Define the relation ≈ on S by x ≈ y if and only if
x ∈ A and y ∈ A for some A ∈ S .
1. ≈ is an equivalence relation.
2. S is the set of equivalence classes.
Proof
Figure 1.5.1 : A partition of S . Any two points in the same partition set are equivalent.
Sometimes the equivalence relation ≈ associated with a given partition S is denoted S/S . The idea, of course, is that elements in
the same set of the partition are equivalent.
The process of forming a partition from an equivalence relation, and the process of forming an equivalence relation from a
partition are inverses of each other.
1. If we start with an equivalence relation ≈ on S , form the associated partition, and then construct the equivalence relation
associated with the partition, then we end up with the original equivalence relation. In modular notation, S/(S/ ≈) is the
same as ≈.
2. If we start with a partition S of S , form the associated equivalence relation, and then form the partition associated with the
equivalence relation, then we end up with the original partition. In modular notation, S/(S/S ) is the same as S .
Suppose that S is a nonempty set. The most basic equivalence relation on S is the equality relation =. In this case [x] = {x} for
each x ∈ S . At the other extreme is the trivial relation ≈ defined by x ≈ y for all x, y ∈ S . In this case S is the only equivalence
class.
Every function f defines an equivalence relation on its domain, known as the equivalence relation associated with f . Moreover,
the equivalence classes have a simple description in terms of the inverse images of f .
Suppose that f : S → T . Define the relation ≈ on S by x ≈ y if and only if f (x) = f (y).

1. The relation ≈ is an equivalence relation on S .
2. The set of equivalences classes is S = {f {t} : t ∈ range(f )} .
−1
3. The function F : S → T defined by F ([x]) = f (x) is well defined and is one-to-one.

Proof
Figure 1.5.2 : The equivalence relation on S associated with f : S → T
Suppose again that f : S → T .

1. If f is one-to-one then the equivalence relation associated with f is the equality relation, and hence [x] = {x} for each
x ∈ S.
2. If f is a constant function then the equivalence relation associated with f is the trivial relation, and hence S is the only
equivalence class.
Proof
Equivalence relations associated with functions are universal: every equivalence relation is of this form:
Suppose that ≈ is an equivalence relation on a set S . Define f : S → P(S) by f (x) = [x]. Then ≈ is the equivalence relation
associated with f .
Proof
The intersection of two equivalence relations is another equivalence relation.
Suppose that ≈ and ≅ are equivalence relations on a set S . Let ≡ denote the intersection of ≈ and ≅ (thought of as subsets of
S × S ). Equivalently, x ≡ y if and only if x ≈ y and x ≅y .
1. ≡ is an equivalence relation on S .
2. [x ] = [x ] ∩ [x ] .
≡ ≈ ≅
Suppose that we have a relation that is reflexive and transitive, but fails to be a partial order because it's not anti-symmetric. The
relation and its inverse naturally lead to an equivalence relation, and then in turn, the original relation defines a true partial order on
the equivalence classes. This is a common construction, and the details are given in the next theorem.
Suppose that ⪯ is a relation on a set S that is reflexive and transitive. Define the relation ≈ on S by x ≈ y if and only if x ⪯ y
and y ⪯ x .
1. ≈ is an equivalence relation on S .
2. If A and B are equivalence classes and x ⪯ y for some x ∈ A and y ∈ B , then u ⪯ v for all u ∈ A and v ∈ B .
3. Define the relation ⪯ on the collection of equivalence classes S by A ⪯ B if and only if x ⪯ y for some (and hence all)
x ∈ A and y ∈ B . Then ⪯ is a partial order on S .
Proof
A prime example of the construction in the previous theorem occurs when we have a function whose range space is partially
ordered. We can construct a partial order on the equivalence classes in the domain that are associated with the function.
Suppose that S and T are sets and that ⪯ is a partial order on T . Suppose also that f
T : S → T . Define the relation ⪯ on S
S
by x ⪯ y if and only if f (x) ⪯ f (y).

S T
1. ⪯ is reflexive and transitive.

S
2. The equivalence relation on S constructed in (10) is the equivalence relation associated with f , as in (6).
3. ⪯ can be extended to a partial order on the equivalence classes corresponding to f .
S
Proof
Examples and Applications

Simple functions
Give the equivalence classes explicitly for the functions from R into R defined below:
1. f (x) = x . 2
2. g(x) = ⌊x⌋ .
3. h(x) = sin(x).
Answer
Calculus
Suppose that I is a fixed interval of R, and that S is the set of differentiable functions from I into R. Consider the equivalence
relation associated with the derivative operator D on S , so that D(f ) = f . For f ∈ S , give a simple description of [f ].
′
Answer
Congruence
Recall the division relation ∣ from N to Z: For d ∈ N and n ∈ Z , d ∣ n means that n = kd for some k ∈ Z . In words, d divides
+ +
n or equivalently n is a multiple of d . In the previous section, we showed that ∣ is a partial order on N .

+
Fix d ∈ N . +
1. Define the relation ≡ on Z by m ≡ n if and only if d ∣ (n − m) . The relation ≡ is known as congruence modulo d .
d d d
2. Let r : Z → {0, 1, … , d − 1} be defined so that r(n) is the remainder when n is divided by d .

d
Recall that by the Euclidean division theorem, named for Euclid of course, n ∈ Z can be written uniquely in the form n = kd + q
where k ∈ Z and q ∈ {0, 1, … , d − 1} , and then r (n) = q .
d
Congruence modulo d .
1. ≡ is the equivalence relation associated with the function r .
d d
2. There are d distinct equivalence classes, given by [q] = {q + kd : k ∈ Z} for q ∈ {0, 1, … , d − 1} .

d
Proof
Explicitly give the equivalence classes for ≡ , congruence mod 4.

4
Answer
Linear Algebra
Linear algebra provides several examples of important and interesting equivalence relations. To set the stage, let R m×n
denote the
set of m × n matrices with real entries, for m, n ∈ N . +
Recall that the following are row operations on a matrix:
1. Multiply a row by a non-zero real number.
2. Interchange two rows.
3. Add a multiple of a row to another row.
Row operations are essential for inverting matrices and solving systems of linear equations.
Matrices A, B ∈ R are row equivalent if A can be transformed into

m×n
B by a finite sequence of row operations. Row
equivalence is an equivalence relation on R . m×n
Proof.
Our next relation involves similarity, which is very important in the study of linear transformations, change of basis, and the theory
of eigenvalues and eigenvectors.
Matrices A, B ∈ R n×n
are similar if there exists an invertible P ∈ R
n×n
such that P −1
AP = B . Similarity is an equivalence
relation on R . n×n
Proof
Next recall that for A ∈ R , the transpose of A is the matrix A ∈ R

m×n
with the property that (i, j) entry of A is the (j, i)
T n×m
entry of A , for i, j ∈ {1, 2, … , m}. Simply stated, A is the matrix whose rows are the columns of A . For the theorem that
T T
−1 T
follows, we need to remember that (AB)
T
=B
T T
A for A ∈ R
m×n
and B ∈ R
n×k
, and (A
T
) = (A
−1
) if n×n
A ∈ R is
invertible.
Matrices A, B ∈ R are congruent if there exists an invertible

n×n
P ∈ R
n×n
such that B =P
T
AP . Congruence is an
equivalence relation on R n×n
Proof
Congruence is important in the study of orthogonal matrices and change of basis. Of course, the term congruence applied to
matrices should not be confused with the same term applied to integers.
Number Systems
Equivalence relations play an important role in the construction of complex mathematical structures from simpler ones. Often the
objects in the new structure are equivalence classes of objects constructed from the simpler structures, modulo an equivalence
relation that captures the essential properties of the new objects.
The construction of number systems is a prime example of this general idea. The next exercise explores the construction of rational
numbers from integers.
Define a relation ≈ on Z × N + by (j, k) ≈ (m, n) if and only if j n = k m .

1. ≈ is an equivalence relation.
2. Define m
n
= [(m, n)] , the equivalence class generated by (m, n), for m ∈ Z and n ∈ N . This definition captures the
+
essential properties of the rational numbers.

Proof
This page titled 1.5: Equivalence Relations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
1.6: Cardinality
Basic Theory
Definitions
Suppose that S is a non-empty collection of sets. We define a relation ≈ on S by A ≈ B if and only if there exists a one-to-
one function f from A onto B . The relation ≈ is an equivalence relation on S . That is, for all A, B, C ∈ S ,
1. A ≈ A , the reflexive property
2. If A ≈ B then B ≈ A , the symmetric property
3. If A ≈ B and B ≈ C then A ≈ C , the transitive property
Proof
A one-to-one function f from A onto B is sometimes called a bijection. Thus if A ≈ B then A and B are in one-to-one
correspondence and are said to have the same cardinality. The equivalence classes under this equivalence relation capture the
notion of having the same number of elements.
Let N 0 =∅ , and for k ∈ N , let N
+ k = {0, 1, … k − 1} . As always, N = {0, 1, 2, …} is the set of all natural numbers.
Suppose that A is a set.

1. A is finite if A ≈ N for some k ∈ N , in which case k is the cardinality of A , and we write #(A) = k .
k
2. A is infinite if A is not finite.

3. A is countably infinite if A ≈ N .
4. A is countable if A is finite or countably infinite.
5. A is uncountable if A is not countable.
In part (a), think of N as a reference set with k elements; any other set with k elements must be equivalent to this one. We will
k
study the cardinality of finite sets in the next two sections on Counting Measure and Combinatorial Structures. In this section, we
will concentrate primarily on infinite sets. In part (d), a countable set is one that can be enumerated or counted by putting the
elements into one-to-one correspondence with N for some k ∈ N or with all of N. An uncountable set is one that cannot be so
k
counted. Countable sets play a special role in probability theory, as in many other branches of mathematics. Apriori, it's not clear
that there are uncountable sets, but we will soon see examples.
Preliminary Examples
If S is a set, recall that P(S) denotes the power set of S (the set of all subsets of S ). If A and B are sets, then A is the set of all
B
functions from B into A . In particular, {0, 1} denotes the set of functions from S into {0, 1}.
S
If S is a set then P(S) ≈ {0, 1} . S
Proof
Next are some examples of countably infinite sets.
The following sets are countably infinite:

1. The set of even natural numbers E = {0, 2, 4, …}
2. The set of integers Z
Proof
At one level, it might seem that E has only half as many elements as N while Z has twice as many elements as N. as the previous
result shows, that point of view is incorrect: N, E , and Z all have the same cardinality (and are countably infinite). The next
example shows that there are indeed uncountable sets.
If A is a set with at least two elements then S = A , the set of all functions from N into A , is uncountable.
N
Proof
Subsets of Infinite Sets
Surely a set must be as least as large as any of its subsets, in terms of cardinality. On the other hand, by example (4), the set of
natural numbers N, the set of even natural numbers E and the set of integers Z all have exactly the same cardinality, even though
E ⊂ N ⊂ Z . In this subsection, we will explore some interesting and somewhat paradoxical results that relate to subsets of infinite
sets. Along the way, we will see that the countable infinity is the “smallest” of the infinities.
If S is an infinite set then S has a countable infinite subset.

Proof
A set S is infinite if and only if S is equivalent to a proper subset of S .

Proof
When S was infinite in the proof of the previous result, not only did we map S one-to-one onto a proper subset, we actually threw
away a countably infinite subset and still maintained equivalence. Similarly, we can add a countably infinite set to an infinite set S
without changing the cardinality.
If S is an infinite set and B is a countable set, then S ≈ S ∪ B .

Proof
In particular, if S is uncountable and B is countable then S ∪ B and S ∖ B have the same cardinality as S , and in particular are
uncountable. In terms of the dichotomies finite-infinite and countable-uncountable, a set is indeed at least as large as a subset. First
we need a preliminary result.
If S is countably infinite and A ⊆ S then A is countable.

Proof
Suppose that A ⊆ B .
1. If B is finite then A is finite.
2. If A is infinite then B is infinite.
3. If B is countable then A is countable.
4. If A is uncountable then B is uncountable.
Proof
Comparisons by one-to-one and onto functions

We will look deeper at the general question of when one set is “at least as big” as another, in the sense of cardinality. Not
surprisingly, this will eventually lead to a partial order on the cardinality equivalence classes.
First note that if there exists a function that maps a set A one-to-one into a set B , then in a sense, there is a copy of A contained in
B . Hence B should be at least as large as A .
Suppose that f : A → B is one-to-one.

1. If B is finite then A is finite.
2. If A is infinite then B is infinite.
3. If B is countable then A is countable.
4. If A is uncountable then B is uncountable.
Proof
On the other hand, if there exists a function that maps a set A onto a set B , then in a sense, there is a copy of B contained in A .
Hence A should be at least as large as B .
Suppose that f : A → B is onto.

1. If A is finite then B is finite.
2. If B is infinite then A is infinite.
3. If A is countable then B is countable.
4. If B is uncountable then A is uncountable.
Proof
The previous exercise also could be proved from the one before, since if there exists a function f mapping A onto B , then there
exists a function g mapping B one-to-one into A . This duality is proven in the discussion of the axiom of choice. A simple and
useful corollary of the previous two theorems is that if B is a given countably infinite set, then a set A is countable if and only if
there exists a one-to-one function f from A into B , if and only if there exists a function g from B onto A .
If A is a countable set for each i in a countable index set I , then ⋃

i i∈I
Ai is countable.
Proof
If A and B are countable then A × B is countable.

Proof
The last result could also be proven from the one before, by noting that
A × B = ⋃ {a} × B (1.6.1)
a∈A
Both proofs work because the set M is essentially a copy of N × N , embedded inside of N. The last theorem generalizes to the
statement that a finite product of countable sets is still countable. But, from (5), a product of infinitely many sets (with at least 2
elements each) will be uncountable.
The set of rational numbers Q is countably infinite.

Proof
A real number is algebraic if it is the root of a polynomial function (of degree 1 or more) with integer coefficients. Rational
numbers are algebraic, as are rational roots of rational numbers (when defined). Moreover, the algebraic numbers are closed under
addition, multiplication, and division. A real number is transcendental if it's not algebraic. The numbers e and π are transcendental,
but we don't know very many other transcendental numbers by name. However, as we will see, most (in the sense of cardinality)
real numbers are transcendental.
The set of algebraic numbers A is countably infinite.

Proof
Now let's look at some uncountable sets.
The interval [0, 1) is uncountable.

Proof
The following sets have the same cardinality, and in particular all are uncountable:
1. R, the set of real numbers.
2. Any interval I of R, as long as the interval is not empty or a single point.
3. R ∖ Q , the set of irrational numbers.
4. R ∖ A , the set of transcendental numbers.
5. P(N), the power set of N.
Proof
The Cardinality Partial Order

Suppose that S is a nonempty collection of sets. We define the relation ⪯ on S by A ⪯ B if and only if there exists a one-to-one
function f from A into B , if and only if there exists a function g from B onto A . In light of the previous subsection, A ⪯ B should
capture the notion that B is at least as big as A , in the sense of cardinality.
The relation ⪯ is reflexive and transitive.
Proof
Thus, we can use the construction in the section on on Equivalence Relations to first define an equivalence relation on S , and then
extend ⪯ to a true partial order on the collection of equivalence classes. The only question that remains is whether the equivalence
relation we obtain in this way is the same as the one that we have been using in our study of cardinality. Rephrased, the question is
this: If there exists a one-to-one function from A into B and a one-to-one function from B into A , does there necessarily exist a
one-to-one function from A onto B ? Fortunately, the answer is yes; the result is known as the Schröder-Bernstein Theorem, named
for Ernst Schröder and Felix Bernstein.
If A ⪯ B and B ⪯ A then A ≈ B .
Proof
We will write A ≺ B if A ⪯ B , but A ≉ B , That is, there exists a one-to-one function from A into B , but there does not exist a
function from A onto B . Note that ≺ would have its usual meaning if applied to the equivalence classes. That is, [A] ≺ [B] if and
only if [A] ⪯ [B] but [A] ≠ [B] . Intuitively, of course, A ≺ B means that B is strictly larger than A , in the sense of cardinality.
A ≺B in each of the following cases:

1. A and B are finite and #(A) < #(B) .
2. A is finite and B is countably infinite.
3. A is countably infinite and B is uncountable.
We close our discussion with the observation that for any set, there is always a larger set.
If S is a set then S ≺ P(S) .

Proof
The proof that a set cannot be mapped onto its power set is similar to the Russell paradox, named for Bertrand Russell.
The continuum hypothesis is the statement that there is no set whose cardinality is strictly between that of N and R. The continuum
hypothesis actually started out as the continuum conjecture, until it was shown to be consistent with the usual axioms of the real
number system (by Kurt Gödel in 1940), and independent of those axioms (by Paul Cohen in 1963).
Assuming the continuum hypothesis, if S is uncountable then there exists A ⊆ S such that A and A are uncountable. c
Proof
There is a more complicated proof of the last result, without the continuum hypothesis and just using the axiom of choice.
This page titled 1.6: Cardinality is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Basic Theory
For our first discussion, we assume that the universal set S is finite. Recall the following definition from the section on cardinality.
For A ⊆ S , the cardinality of A is the number of elements in A , and is denoted #(A) . The function # on P(S) is called
counting measure.
Counting measure plays a fundamental role in discrete probability structures, and particularly those that involve sampling from a
finite set. The set S is typically very large, hence efficient counting methods are essential. The first combinatorial problem is
attributed to the Greek mathematician Xenocrates.
In many cases, a set of objects can be counted by establishing a one-to-one correspondence between the given set and some other
set. Naturally, the two sets have the same number of elements, but for various reasons, the second set may be easier to count.
The Addition Rule

The addition rule of combinatorics is simply the additivity axiom of counting measure.
If {A 1, A2 , … , An } is a collection of disjoint subsets of S then

n n
# ( ⋃ Ai ) = ∑ #(Ai ) (1.7.1)
i=1 i=1
Figure 1.7.1 : The addition rule

The following counting rules are simple consequences of the addition rule. Be sure to try the proofs yourself before reading the
ones in the text.
c
#(A ) = #(S) − #(A) . This is the complement rule.
Proof
Figure 1.7.2 : The complement rule
#(B ∖ A) = #(B) − #(A ∩ B) . This is the difference rule.

Proof
If A ⊆ B then #(B ∖ A) = #(B) − #(A) . This is the proper difference rule.

Proof
If A ⊆ B then #(A) ≤ #(B) .

Proof
Thus, # is an increasing function, relative to the subset partial order ⊆ on P(S) , and the ordinary order ≤ on N.
Inequalities
Our next disucssion concerns two inequalities that are useful for obtaining bounds on the number of elements in a set. The first is
Boole's inequality (named after George Boole) which gives an upper bound on the cardinality of a union.
If {A 1, A2 , … , An } is a finite collection of subsets of S then

n n
# ( ⋃ Ai ) ≤ ∑ #(Ai ) (1.7.2)
i=1 i=1
Proof
Intuitively, Boole's inequality holds because parts of the union have been counted more than once in the expression on the right.
The second inequality is Bonferroni's inequality (named after Carlo Bonferroni), which gives a lower bound on the cardinality of
an intersection.
If {A 1, A2 , … , An } is a finite collection of subsets of S then

n n
# ( ⋂ Ai ) ≥ #(S) − ∑[#(S) − #(Ai )] (1.7.4)
i=1 i=1
Proof
The Inclusion-Exclusion Formula

The inclusion-exclusion formula gives the cardinality of a union of sets in terms of the cardinality of the various intersections of the
sets. The formula is useful because intersections are often easier to count. We start with the special cases of two sets and three sets.
As usual, we assume that the sets are subsets of a finite universal set S .
If A and B are subsets of S then #(A ∪ B) = #(A) + #(B) − #(A ∩ B) .

Proof
Figure 1.7.3 : The inclusion-exclusion theorem for two sets
If A , B , C are subsets of S then

#(A ∪ B ∪ C ) = #(A) + #(B) + #(C ) − #(A ∩ B) − #(A ∩ C ) − #(B ∩ C ) + #(A ∩ B ∩ C ) .
Proof
Figure 1.7.4 : The inclusion-exclusion theorem for three sets

The inclusion-exclusion rule for two and three sets can be generalized to a union of n sets; the generalization is known as the
(general) inclusion-exclusion formula.
Suppose that {A i : i ∈ I} is a collection of subsets of S where I is an index set with #(I ) = n . Then
n
k−1
# ( ⋃ Ai ) = ∑(−1 ) ∑ # ( ⋂ Aj ) (1.7.6)
i∈I k=1 J⊆I, #(J)=k j∈J
Proof
The general Bonferroni inequalities, named again for Carlo Bonferroni, state that if sum on the right is truncated after k terms (
k < n ), then the truncated sum is an upper bound for the cardinality of the union if k is odd (so that the last term has a positive
sign) and is a lower bound for the cardinality of the union if k is even (so that the last terms has a negative sign).
The Multiplication Rule

The multiplication rule of combinatorics is based on the formulation of a procedure (or algorithm) that generates the objects to be
counted.
Suppose that a procedure consists of k steps, performed sequentially, and that for each j ∈ {1, 2, … , k}, step j can be
performed in n ways, regardless of the choices made on the previous steps. Then the number of ways to perform the entire
j
procedure is n n ⋯ n .
1 2 k
The key to a successful application of the multiplication rule to a counting problem is the clear formulation of an algorithm that
generates the objects being counted, so that each object is generated once and only once. That is, we must neither over count nor
under count. It's also important to notice that the set of choices available at step j may well depend on the previous steps; the
assumption is only that the number of choices available does not depend on the previous steps.
The first two results below give equivalent formulations of the multiplication principle.
Suppose that S is a set of sequences of length k , and that we denote a generic element of S by (x , x , … , x ). Suppose that
1 2 k
for each j ∈ {1, 2, … , k}, x has n different values, regardless of the values of the previous coordinates. Then
j j
#(S) = n n ⋯ n .
1 2 k
Proof
Suppose that T is an ordered tree with depth k and that each vertex at level i − 1 has n children for
i i ∈ {1, 2, … , k} . Then
the number of endpoints of the tree is n n ⋯ n .
1 2 k
Proof
Product Sets
If S is a set with n elements for i ∈ {1, 2, … , k} then
i i
#(S1 × S2 × ⋯ × Sk ) = n1 n2 ⋯ nk (1.7.10)
Proof
If S is a set with n elements, then S has n elements.

k k
Proof
In (16), note that the elements of S can be thought of as ordered samples of size k that can be chosen with replacement from a
k
population of n objects. Elements of {0, 1} are sometimes called bit strings of length n . Thus, there are 2 bit strings of length n .
n n
Functions
The number of functions from a set A of m elements into a set B of n elements is n . m
Proof
Recall that the set of functions from a set A into a set B (regardless of whether the sets are finite or infinite) is denoted B . This
A
theorem is motivation for the notation. Note also that if S is a set with n elements, then the elements in the Cartesian power set S k
can be thought of as functions from {1, 2, … , k} into S . So the counting formula for sequences can be thought of as a corollary of
counting formula for functions.
Subsets
Suppose that S is a set with n elements, where n ∈ N . There are 2 subsets of S .
n
Proof from the multiplication principle

Proof using indicator functions
k
Suppose that {A , A , … A } is a collection of k subsets of a set S , where k ∈ N . There are 2 different (in general) sets
1 2 k +
2
that can be constructed from the k given sets, using the operations of union, intersection, and complement. These sets form the
algebra generated by the given sets.
Proof
Open the Venn diagram app.

1. Select each of the 4 disjoint sets A ∩ B , A ∩ B , A ∩ B , A ∩ B .
c c c c
2. Select each of the 12 other subsets of S . Note how each is a union of some of the sets in (a).
Suppose that S is a set with n elements and that A is a subset of S with k elements, where n, k ∈ N and k ≤ n . The number
of subsets of S that contain A is 2 . n−k
Proof
Our last result in this discussion generalizes the basic subset result above.
Suppose that n, k ∈ N and that S is a set with n elements. The number of sequences of subsets (A1 , A2 , … , Ak ) with
A1 ⊆ A2 ⊆ ⋯ ⊆ Ak ⊆ S is (k + 1) .n
Proof
When k = 1 we get 2 as the number of subsets of S , as before.

n
Identification Numbers
A license number consists of two letters (uppercase) followed by five digits. How many different license numbers are there?
Answer
Suppose that a Personal Identification Number (PIN) is a four-symbol code word in which each entry is either a letter
(uppercase) or a digit. How many PINs are there?
Answer
Cards, Dice, and Coins

In the board game Clue, Mr. Boddy has been murdered. There are 6 suspects, 6 possible weapons, and 9 possible rooms for the
murder.
1. The game includes a card for each suspect, each weapon, and each room. How many cards are there?
2. The outcome of the game is a sequence consisting of a suspect, a weapon, and a room (for example, Colonel Mustard with
the knife in the billiard room). How many outcomes are there?
3. Once the three cards that constitute the outcome have been randomly chosen, the remaining cards are dealt to the players.
Suppose that you are dealt 5 cards. In trying to guess the outcome, what hand of cards would be best?
Answer
An experiment consists of rolling a standard die, drawing a card from a standard deck, and tossing a standard coin. How many
outcomes are there?
Answer
A standard die is rolled 5 times and the sequence of scores recorded. How many outcomes are there?
Answer
In the card game Set, each card has 4 properties: number (one, two, or three), shape (diamond, oval, or squiggle), color (red,
blue, or green), and shading (solid, open, or stripped). The deck has one card of each (number, shape, color, shading)
configuration. A set in the game is defined as a set of three cards which, for each property, the cards are either all the same or
all different.
1. How many cards are in a deck?
2. How many sets are there?
Answer
A coin is tossed 10 times and the sequence of scores recorded. How many sequences are there?
Answer
The die-coin experiment consists of rolling a die and then tossing a coin the number of times shown on the die. The sequence
of coin results is recorded.
1. How many outcomes are there?
2. How many outcomes are there with all heads?
3. How many outcomes are there with exactly one head?
Answer
Run the die-coin experiment 100 times and observe the outcomes.
Consider a deck of cards as a set D with 52 elements.

1. How many subsets of D are there?
2. How many functions are there from D into the set {1, 2, 3, 4}?
Answer
Birthdays
Consider a group of 10 persons.
1. If we record the birth month of each person, how many outcomes are there?
2. If we record the birthday of each person (ignoring leap day), how many outcomes are there?
Answer
Reliability
In the usual model of structural reliability, a system consists of components, each of which is either working or defective. The
system as a whole is also either working or defective, depending on the states of the components and how the components are
connected.
A string of lights has 20 bulbs, each of which may be good or defective. How many configurations are there?
Answer
If the components are connected in series, then the system as a whole is working if and only if each component is working. If the
components are connected parallel, then the system as a whole is working if and only if at least one component is working.
A system consists of three subsystems with 6, 5, and 4 components, respectively. Find the number of component states for
which the system is working in each of the following cases:
1. The components in each subsystem are in parallel and the subsystems are in series.
2. The components in each subsystem are in series and the subsystems are in parallel.
Answer
Menus
Suppose that a sandwich at a restaurant consists of bread, meat, cheese, and various toppings. There are 4 choices for the bread,
3 choices for the meat, 5 choices for the cheese, and 10 different toppings (each of which may be chosen). How many
sandwiches are there?
Answer
At a wedding dinner, there are three choices for the entrée, four choices for the beverage, and two choices for the dessert.
1. How many different meals are there?
2. If there are 50 guests at the wedding and we record the meal requested for each guest, how many possible outcomes are
there?
Answer
Braille
Braille is a tactile writing system used by people who are visually impaired. The system is named for the French educator
Louis Braille and uses raised dots in a 3 × 2 grid to encode characters. How many meaningful Braille configurations are there?
Answer
Figure 1.7.5 : The Braille encoding of the number 2 and the letter b
Personality Typing
The Meyers-Briggs personality typing is based on four dichotomies: A person is typed as either extroversion (E) or
introversion (I), either sensing (S) or intuition (I), either thinking (T) or feeling (F), and either judgement (J) or perception (P).
1. How many Meyers-Briggs personality types are there? List them.
2. Suppose that we list the personality types of 10 persons. How many possible outcomes are there?
Answer
The Galton Board

The Galton Board, named after Francis Galton, is a triangular array of pegs. Galton, apparently too modest to name the device after
himself, called it a quincunx from the Latin word for five twelfths (go figure). The rows are numbered, from the top down, by
(0, 1, …). Row n has n + 1 pegs that are labeled, from left to right by (0, 1, … , n). Thus, a peg can be uniquely identified by an
ordered pair (n, k) where n is the row number and k is the peg number in that row.
A ball is dropped onto the top peg (0, 0) of the Galton board. In general, when the ball hits peg (n, k), it either bounces to the left
to peg (n + 1, k) or to the right to peg (n + 1, k + 1) . The sequence of pegs that the ball hits is a path in the Galton board.
There is a one-to-one correspondence between each pair of the following three collections:
1. Bit strings of length n
2. Paths in the Galton board from (0, 0) to any peg in row n .
3. Subsets of a set with n elements.
Thus, each of these collections has 2 elements.

n
Open the Galton board app.

1. Move the ball from (0, 0) to (10, 6) along a path of your choice. Note the corresponding bit string and subset.
2. Generate the bit string 0111001010. Note the corresponding subset and path.
3. Generate the subset {2, 4, 5, 9, 10}. Note the corresponding bit string and path.
4. Generate all paths from (0, 0) to (4, 2). How many paths are there?
Answer
This page titled 1.7: Counting Measure is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
The purpose of this section is to study several combinatorial structures that are of basic importance in probability.
Permutations
Suppose that D is a set with n ∈ N elements. A permutation of length k ∈ {0, 1, … , n} from D is an ordered sequence of k
distinct elements of D; that is, a sequence of the form (x , x , … , x ) where x ∈ D for each i and x ≠ x for i ≠ j .
1 2 k i i j
Statistically, a permutation of length k from D corresponds to an ordered sample of size k chosen without replacement from the
population D.
The number of permutations of length k from an n element set is

(k)
n = n(n − 1) ⋯ (n − k + 1) (1.8.1)
Proof
By convention, n = 1 . Recall that, in general, a product over an empty index set is 1. Note that n has k factors, starting at n ,
(0) (k)
and with each subsequent factor one less than the previous factor. Some alternate notations for the number of permutations of size
k from a set of n objects are P (n, k) , P , and P . n,k n k
The number of permutations of length n from the n element set D (these are called simply permutations of D) is
(n)
n! = n = n(n − 1) ⋯ (1) (1.8.2)
The function on N given by n ↦ n! is the factorial function. The general permutation formula in (2) can be written in terms of
factorials:
For n ∈ N and k ∈ {0, 1, … , n}
(k)
n!
n = (1.8.3)
(n − k)!
Although this formula is succinct, it's not always a good idea numerically. If n and n − k are large, n! and (n − k)! are enormous,
and division of the first by the second can lead to significant round-off errors.
Note that the basic permutation formula in (2) is defined for every real number n and nonnegative integer k . This extension is
sometimes referred to as the generalized permutation formula. Actually, we will sometimes need an even more general formula of
this type (particularly in the sections on Pólya's urn and the beta-Bernoulli process).
For a ∈ R , s ∈ R , and k ∈ N , define

(s,k)
a = a(a + s)(a + 2s) ⋯ [a + (k − 1)s] (1.8.4)
1. a(0,k)
=a
k
2. a(−1,k)
=a
(k)
3. a(1,k)
= a(a + 1) ⋯ (a + k − 1)
4. 1(1,k)
= k!
The product a (−1,k)

=a (our ordinary permutation formula) is sometimes called the falling power of a of order k , while a
(k)
is (1,k)
called the rising power of a of order k , and is sometimes denoted a . Note that a is the ordinary k th power of a . In general,
[k] (0,k)
note that a(s,k)

has k factors, starting at a and with each subsequent factor obtained by adding s to the previous factor.
Combinations
Consider again a set D with n ∈ N elements. A combination of size k ∈ {0, 1, … , n} from D is an (unordered) subset of k
distinct elements of D. Thus, a combination of size k from D has the form {x , x , … , x }, where x ∈ D for each i and
1 2 k i
x ≠x
i jfor i ≠ j .
Statistically, a combination of size k from D corresponds to an unordered sample of size k chosen without replacement from the
population D. Note that for each combination of size k from D, there are k! distinct orderings of the elements of that combination.
Each of these is a permutation of length k from D. The number of combinations of size k from an n -element set is denoted by ( ). n
Some alternate notations are C (n, k), C , and C .

n,k n k
The number of combinations of size k from an n element set is

(k)
n n n!
( ) = = (1.8.5)
k k! k!(n − k)!
Proof
The number ( ) is called a binomial coefficient. Note that the formula makes sense for any real number n and nonnegative integer
n
k since this is true of the generalized permutation formula n . With this extension, ( ) is called the generalized binomial
(k) n
coefficient. Note that if n and k are positive integers and k > n then ( ) = 0 . By convention, we will also define ( ) = 0 if
n
k
n
k < 0 . This convention sometimes simplifies formulas.
Properties of Binomial Coefficients

For some of the identities below, there are two possible proofs. An algebraic proof, of course, should be based on (5). A
combinatorial proof is constructed by showing that the left and right sides of the identity are two different ways of counting the
same collection.
n
( ) =( ) =1
n
n
0
.
Algebraically, the last result is trivial. It also makes sense combinatorially: There is only one way to select a subset of D with n
elements (D itself), and there is only one way to select a subset of size 0 from D (the empty set ∅).
If n, k ∈ N with k ≤ n then
n n
( ) =( ) (1.8.6)
k n−k
Combinatorial Proof
The next result is one of the most famous and most important. It's known as Pascal's rule and is named for Blaise Pascal.
If n, k ∈ N+ with k ≤ n then
n n−1 n−1
( ) =( ) +( ) (1.8.7)
k k−1 k
Combinatorial Proof
Recall that the Galton board is a triangular array of pegs: the rows are numbered n = 0, 1, … and the pegs in row n are numbered
k = 0, 1, … , n. If each peg in the Galton board is replaced by the corresponding binomial coefficient, the resulting table of
numbers is known as Pascal's triangle, named again for Pascal. By (8), each interior number in Pascal's triangle is the sum of the
two numbers directly above it.
The following result is the binomial theorem, and is the reason for the term binomial coefficient.
If a, b ∈ R and n ∈ N is a positive integer, then

n
n
n k n−k
(a + b ) = ∑( )a b (1.8.8)
k
k=0
Combinatorial Proof
If j, k, n ∈ N+ with j ≤ k ≤ n then
n n−j
(j) (j)
k ( ) =n ( ) (1.8.9)
k k−j
Combinatorial Proof
The following result is known as Vandermonde's identity, named for Alexandre-Théophile Vandermonde.
If m, n, k ∈ N with k ≤ m + n , then
k
m n m +n
∑( )( ) =( ) (1.8.10)
j k−j k
j=0
Combinatorial Proof
The next result is a general identity for the sum of binomial coefficients.
If m, n ∈ N with n ≤ m then
m
j m +1
∑( ) =( ) (1.8.11)
n n+1
j=n
Combinatorial Proof
For an even more general version of the last result, see the section on Order Statistics in the chapter on Finite Sampling Models.
The following identity for the sum of the first m positive integers is a special case of the last result.
If m ∈ N then
m
m +1 (m + 1)m
∑j= ( ) = (1.8.12)
2 2
j=1
Proof
There is a one-to-one correspondence between each pair of the following collections. Hence the number objects in each of
these collection is ( ). n
1. Subsets of size k from a set of n elements.

2. Bit strings of length n with exactly k 1's.
3. Paths in the Galton board from (0, 0) to (n, k).
Proof
The following identity is known as the alternating sum identity for binomial coefficients. It turns out to be useful in the Irwin-Hall
probability distribution. We give the identity in two equivalent forms, one for falling powers and one for ordinary powers.
If n ∈ N , j ∈ {0, 1, … n − 1} then
+
1. n
n k (j)
∑( )(−1 ) k =0 (1.8.13)
k
k=0
2. n
n k j
∑( )(−1 ) k = 0 (1.8.14)
k
k=0
Proof
Our next identity deals with a generalized binomial coefficient.
If n, k ∈ N then
−n n+k−1
k
( ) = (−1 ) ( ) (1.8.16)
k k
Proof
In particular, note that (

−1
k
k
) = (−1 ) . Our last result in this discussion concerns the binomial operator and its inverse.
The binomial operator takes a sequence of real numbers a = (a0 , a1 , a2 , …) and returns the sequence of real numbers
b = (b , b , b , …) by means of the formula
0 1 2
n
n
bn = ∑ ( ) ak , n ∈ N (1.8.19)
k
k=0
The inverse binomial operator recovers the sequence a from the sequence b by means of the formula
n
n−k
n
an = ∑(−1 ) ( ) bk , n ∈ N (1.8.20)
k
k=0
Proof
Samples
The experiment of drawing a sample from a population is basic and important. There are two essential attributes of samples:
whether or not order is important, and whether or not a sampled object is replaced in the population before the next draw. Suppose
now that the population D contains n objects and we are interested in drawing a sample of k objects. Let's review what we know
so far:
If order is important and sampled objects are replaced, then the samples are just elements of the product set D . Hence, the k
number of samples is n . k
If order is important and sample objects are not replaced, then the samples are just permutations of size k chosen from D.
Hence the number of samples is n . (k)
If order is not important and sample objects are not replaced, then the samples are just combinations of size k chosen from D.
Hence the number of samples is ( ). n
Thus, we have one case left to consider.
Unordered Samples With Replacement

An unordered sample chosen with replacement from D is called a multiset. A multiset is like an ordinary set except that elements
may be repeated.
There is a one-to-one correspondence between each pair of the following collections:

1. Mulitsets of size k from a population D of n elements.
2. Bit strings of length n + k − 1 with exactly k 1s.
3. Nonnegative integer solutions (x , x , … , x ) of the equation x
1 2 n 1 + x2 + ⋯ + xn = k .
Each of these collections has ( n+k−1
k
) members.
Proof
Summary of Sampling Formulas

The following table summarizes the formulas for the number of samples of size k chosen from a population of n elements, based
on the criteria of order and replacement.
Sampling formulas
Number of Samples With order Without
m… With replacement n
k
(
n+k−1
k
)
m… Without n
(k) n
( )
k
Multinomial Coefficients
Partitions of a Set
Recall that the binomial coefficient ( ) is the number of subsets of size j from a set S of n elements. Note also that when we select
n
a subset A of size j from S , we effectively partition S into two disjoint subsets of sizes j and n − j , namely A and A . A natural c
generalization is to partition S into a union of k distinct, pairwise disjoint subsets (A , A , … , A ) where #(A ) = n for each
1 2 k i i
i ∈ {1, 2, … , k}. Of course we must have n + n + ⋯ + n = n .

1 2 k
The number of ways to partition a set of n elements into a sequence of k sets of sizes (n 1, n2 , … , nk ) is
n n − n1 n − n1 − ⋯ − nk−1 n!
( )( )⋯( ) = (1.8.22)
n1 n2 nk n1 ! n2 ! ⋯ nk !
Proof
The number in (18) is called a multinomial coefficient and is denoted by

n n!
( ) = (1.8.23)
n1 , n2 , ⋯ , nk n1 ! n2 ! ⋯ nk !
If n, k ∈ N with k ≤ n then
n n
( ) =( ) (1.8.24)
k, n − k k
Combinatorial Proof
Sequences
Consider now the set T = {1, 2, … , k} . Elements of this set are sequences of length n in which each coordinate is one of k
n
values. Thus, these sequences generalize the bit strings of length n . Again, let (n , n , … , n ) be a sequence of nonnegative
1 2 k
integers with ∑ n = n .
k
i=1 i
There is a one-to-one correspondence between the following collections:

1. Partitions of S into pairwise disjoint subsets (A , A , … , A ) where #(A ) = n for each j ∈ {1, 2, … , k}.
1 2 k j j
2. Sequences in {1, 2, … , k} in which j occurs n times for each j ∈ {1, 2, … , k}.

n
j
Proof
It follows that the number of elements in both of these collections is

n n!
( ) = (1.8.25)
n1 , n2 , ⋯ , nk n1 ! n2 ! ⋯ nk !
Permutations with Indistinguishable Objects

Suppose now that we have n object of k different types, with n elements of type i for each i ∈ {1, 2, … , k}. Moreover,
i
objects of a given type are considered identical. There is a one-to-one correspondence between the following collections:
1. Sequences in {1, 2, … , k} in which j occurs n times for each j ∈ {1, 2, … , k}.
n
j
2. Distinguishable permutations of the n objects.

Proof
Once again, it follows that the number of elements in both collections is

n n!
( ) = (1.8.26)
n1 , n2 , ⋯ , nk n1 ! n2 ! ⋯ nk !
The Multinomial Theorem

The following result is the multinomial theorem which is the reason for the name of the coefficients.
If x
1, x2 , … , xk ∈ R and n ∈ N then
n
n n1 n2 nk
(x1 + x2 + ⋯ + xk ) = ∑( )x x ⋯x (1.8.27)
1 2 k
n1 , n2 , ⋯ , nk
The sum is over sequences of nonnegative integers (n1 , n2 , … , nk ) with n

1 + n2 + ⋯ + nk = n . There are n+k−1
(
n
) terms
in this sum.
Combinatorial Proof
Arrangements
In a race with 10 horses, the first, second, and third place finishers are noted. How many outcomes are there?
Answer
Eight persons, consisting of four male-female couples, are to be seated in a row of eight chairs. How many seating
arrangements are there in each of the following cases:
1. There are no other restrictions.
2. The men must sit together and the women must sit together.
3. The men must sit together.
4. Each couple must sit together.
Answer
Suppose that n people are to be seated at a round table. How many seating arrangements are there? The mathematical
significance of a round table is that there is no dedicated first chair.
Answer
Twelve books, consisting of 5 math books, 4 science books, and 3 history books are arranged on a bookshelf. Find the number
of arrangements in each of the following cases:
1. There are no restrictions.
2. The books of each type must be together.
3. The math books must be together.
Answer
Find the number of distinct arrangements of the letters in each of the following words:
1. statistics
2. probability
3. mississippi
4. tennessee
5. alabama
Answer
A child has 12 blocks; 5 are red, 4 are green, and 3 are blue. In how many ways can the blocks be arranged in a line if blocks of
a given color are considered identical?
Answer
Code Words
A license tag consists of 2 capital letters and 5 digits. Find the number of tags in each of the following cases:
1. There are no other restrictions
2. The letters and digits are all different.
Answer
Committees
A club has 20 members; 12 are women and 8 are men. A committee of 6 members is to be chosen. Find the number of different
committees in each of the following cases:
1. There are no other restrictions.
2. The committee must have 4 women and 2 men.
3. The committee must have at least 2 women and at least 2 men.
Answer
Suppose that a club with 20 members plans to form 3 distinct committees with 6, 5, and 4 members, respectively. In how many
ways can this be done.
Answer
Cards
A standard card deck can be modeled by the Cartesian product set
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (1.8.28)
where the first coordinate encodes the denomination or kind (ace, 2-10, jack, queen, king) and where the second coordinate encodes
the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for example q♡).
A poker hand (in draw poker) consists of 5 cards dealt without replacement and without regard to order from a deck of 52
cards. Find the number of poker hands in each of the following cases:
2. The hand is a full house (3 cards of one kind and 2 of another kind).
3. The hand has 4 of a kind.
4. The cards are all in the same suit (so the hand is a flush or a straight flush).
Answer
The game of poker is studied in detail in the chapter on Games of Chance.
A bridge hand consists of 13 cards dealt without replacement and without regard to order from a deck of 52 cards. Find the
number of bridge hands in each of the following cases:
2. The hand has exactly 4 spades.
3. The hand has exactly 4 spades and 3 hearts.
4. The hand has exactly 4 spades, 3 hearts, and 2 diamonds.
Answer
A hand of cards that has no cards in a particular suit is said to be void in that suit. Use the inclusion-exclusion formula to find
each of the following:
1. The number of poker hands that are void in at least one suit.
2. The number of bridge hands that are void in at least one suit.
Answer
A bridge hand that has no honor cards (cards of denomination 10, jack, queen, king, or ace) is said to be a Yarborough, in
honor of the Second Earl of Yarborough. Find the number of Yarboroughs.
Answer
A bridge deal consists of dealing 13 cards (a bridge hand) to each of 4 distinct players (generically referred to as north, south,
east, and west) from a standard deck of 52 cards. Find the number of bridge deals.
Answer
This staggering number is about the same order of magnitude as the number of atoms in your body, and is one of the reasons that
bridge is a rich and interesting game.
Find the number of permutations of the cards in a standard deck.

Answer
This number is even more staggering. Indeed if you perform the experiment of dealing all 52 cards from a well-shuffled deck, you
may well generate a pattern of cards that has never been generated before, thereby ensuring your immortality. Actually, this
experiment shows that, in a sense, rare events can be very common. By the way, Persi Diaconis has shown that it takes about seven
standard riffle shuffles to thoroughly randomize a deck of cards.
Dice and Coins

Suppose that 5 distinct, standard dice are rolled and the sequence of scores recorded.
1. Find the number of sequences.
2. Find the number of sequences with the scores all different.
Answer
Suppose that 5 identical, standard dice are rolled. How many outcomes are there?
Answer
A coin is tossed 10 times and the outcome is recorded as a bit string (where 1 denotes heads and 0 tails).
1. Find the number of outcomes.
2. Find the number of outcomes with exactly 4 heads.
3. Find the number of outcomes with at least 8 heads.
Answer
Polynomial Coefficients
Find the coefficient of x

3
y
4
in (2 x − 4 y) . 7
Answer
Find the coefficient of x in (2 + 3 x) .

5 8
Answer
Find the coefficient of x

3
y
7
z
5
in (x + y + z)15
.
Answer
The Galton Board

In the Galton board game,
1. Move the ball from (0, 0) to (10, 6) along a path of your choice. Note the corresponding bit string and subset.
2. Generate the bit string 0011101001. Note the corresponding subset and path.
3. Generate the subset {1, 4, 5, 7, 8, 10}. Note the corresponding bit string and path.
4. Generate all paths from (0, 0) to (5, 3). How many paths are there?
Answer
Generate Pascal's triangle up to n = 10 .
Samples
A shipment contains 12 good and 8 defective items. A sample of 5 items is selected. Find the number of samples that contain
exactly 3 good items.
Answer
In the (n, k) lottery, k numbers are chosen without replacement from the set of integers from 1 to n (where n, k ∈ N+ and
k < n ). Order does not matter.
1. Find the number of outcomes in the general (n, k) lottery.

2. Explicitly compute the number of outcomes in the (44, 6) lottery (a common format).
Answer
For more on this topic, see the section on Lotteries in the chapter on Games of Chance.
Explicitly compute each formula in the sampling table above when n = 10 and k = 4 .
Answer
Greetings
Suppose there are n people who shake hands with each other. How many handshakes are there?
Answer
There are m men and n women. The men shake hands with each other; the women hug each other; and each man bows to each
woman.
1. How many handshakes are there?
2. How many hugs are there?
3. How many bows are there?
4. How many greetings are there?
Answer
Integer Solutions
Find the number of integer solutions of x1 + x2 + x3 = 10 in each of the following cases:
1. x i ≥0 for each i.
2. x i >0 for each i.
Answer
Generalized Coefficients
Compute each of the following:

1. (−5) (3)
(4)
2. ( 1
2
)
(5)
3. (− 1
3
)
Answer
Compute each of the following:

1. (1/2
3
)
2. (−5
4
)
3. (−1/3
5
)
Answer
Birthdays
Suppose that n persons are selected and their birthdays noted. (Ignore leap years, so that a year has 365 days.)
1. Find the number of outcomes.
2. Find the number of outcomes with distinct birthdays.
Answer
Chess
Note that the squares of a chessboard are distinct, and in fact are often identified with the Cartesian product set
{a, b, c, d, e, f , g, h} × {1, 2, 3, 4, 5, 6, 7, 8} (1.8.29)
Find the number of ways of placing 8 rooks on a chessboard so that no rook can capture another in each of the following cases.
1. The rooks are distinguishable.
2. The rooks are indistinguishable.
Answer
Gifts
Suppose that 20 identical candies are distributed to 4 children. Find the number of distributions in each of the following cases:
2. Each child must get at least one candy.
Answer
In the song The Twelve Days of Christmas, find the number of gifts given to the singer by her true love. (Note that the singer
starts afresh with gifts each day, so that for example, the true love gets a new partridge in a pear tree each of the 12 days.)
Answer
Teams
Suppose that 10 kids are divided into two teams of 5 each for a game of basketball. In how many ways can this be done in each
of the following cases:
1. The teams are distinguishable (for example, one team is labeled “Alabama” and the other team is labeled “Auburn”).
2. The teams are not distinguishable.
Answer
This page titled 1.8: Combinatorial Structures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Topology is one of the major branches of mathematics, along with other such branches as algebra (in the broad sense of algebraic
structures), and analysis. Topology deals with spatial concepts involving distance, closeness, separation, convergence, and
continuity. Needless to say, entire series of books have been written about the subject. Our goal in this section and the next is
simply to review the basic definitions and concepts of topology that we will need for our study of probability and stochastic
processes. You may want to refer to this section as needed.
Basic Theory
Definitions
A topological space consists of a nonempty set S and a collection S of subsets of S that satisfy the following properties:
1. S ∈ S and ∅ ∈ S
2. If A ⊆ S then ⋃ A ∈ S
3. If A ⊆ S and A is finite, then ⋂ A ∈ S
If A ∈ S , then A is said to be open and A is said to be closed. The collection S of open sets is a topology on S .
c
So the union of an arbitrary number of open sets is still open, as is the intersection of a finite number of open sets. The universal set
S and the empty set ∅ are both open and closed. There may or may not exist other subsets of S with this property.
Suppose that S is a nonempty set, and that S and T are topologies on S . If S ⊆ T then T is finer than S , and S is
coarser than T . Coarser than defines a partial order on the collection of topologies on S . That is, if R, S , T are topologies
on S then
1. R is coarser than R , the reflexive property.
2. If R is coarser than S and S is coarser than R then R = S , the anti-symmetric property.
3. If R is coarser than S and S is coarser than T then R is coarser than T , the transitive property.
A topology can be characterized just as easily by means of closed sets as open sets.
Suppose that S is a nonempty set. A collection of subsets C is the collection of closed sets for a topology on S if and only if
1. S ∈ C and ∅ ∈ C
2. If A ⊆ C then ⋂ A ∈ C .
3. If A ⊆ C and A is a finite then ⋃ A ∈ C .
Proof
Suppose that (S, S ) is a topological space, and that x ∈ S . A set A ⊆ S is a neighborhood of x if there exists U ∈ S with
x ∈ U ⊆A .
So a neighborhood of a point x ∈ S is simply a set with an open subset that contains x. The idea is that points in a “small”
neighborhood of x are “close” to x in a sense. An open set can be defined in terms of the neighborhoods of the points in the set.
Suppose again that (S, S ) is a topological space. A set U ⊆S is open if and only if U is a neighborhood of every x ∈ U
Proof
Although the proof seems trivial, the neighborhood concept is how you should think of openness. A set U is open if every point in
U has a set of “nearby points” that are also in U .
Our next three definitions deal with topological sets that are naturally associated with a given subset.
Suppose again that (S, S ) is a topological space and that A ⊆ S . The closure of A is the set
cl(A) = ⋂{B ⊆ S : B is closed and A ⊆ B} (1.9.1)
This is the smallest closed set containing A :
1. cl(A) is closed.
2. A ⊆ cl(A) .
3. If B is closed and A ⊆ B then cl(A) ⊆ B
Proof
Of course, if A is closed then A = cl(A) . Complementary to the closure of a set is the interior of the set.
Suppose again that (S, S ) is a topological space and that A ⊆ S . The interior of A is the set
int(A) = ⋃{U ⊆ S : U is open and U ⊆ A} (1.9.2)
This set is the largest open subset of A :

1. int(A) is open.
2. int(A) ⊆ A .
3. If U is open and U ⊆A then U ⊆ int(A)
Proof
Of course, if A is open then A = int(A) . The boundary of a set is the set difference between the closure and the interior.
Suppose again that (S, S ) is a topological space. The boundary of A is ∂(A) = cl(A) ∖ int(A) . This set is closed.
Proof
A topology on a set induces a natural topology on any subset of the set.
Suppose that (S, S ) is a topological space and that R is a nonempty subset of S . Then R = {A ∩ R : A ∈ S } is a topology
on R , known as the relative topology induced by S .
Proof
In the context of the previous result, note that if R is itself open, then the relative topology is R = {A ∈ S : A ⊆ R} , the subsets
of R that are open in the original topology.
Separation Properties
Separation properties refer to the ability to separate points or sets with disjoint open sets. Our first definition deals with separating
two points.
Suppose that (S, S ) is a topological space and that x, y are distinct points in S . Then x and y can be separated if there exist
disjoint open sets U and V with x ∈ U and y ∈ V . If every pair of distinct points in S can be separated, then (S, S ) is called
a Hausdorff space.
Hausdorff spaces are named for the German mathematician Felix Hausdorff. There are weaker separation properties. For example,
there could be an open set U that contains x but not y , and an open set V that contains y but not x, but no disjoint open sets that
contain x and y . Clearly if every open set that contains one of the points also contains the other, then the points are
indistinguishable from a topological viewpoint. In a Hausdorff space, singletons are closed.
Suppose that (S, S ) is a Hausdorff space. Then {x} is closed for each x ∈ S .
Proof
Our next definition deals with separating a point from a closed set.
Suppose again that (S, S ) is a topological space. A nonempty closed set A ⊆ S and a point x ∈ A can be separated if there
c
exist disjoint open sets U and V with A ⊆ U and x ∈ V . If every nonempty closed set A and point x ∈ A can be separated,
c
then the space (S, S ) is regular.
Clearly if (S, S ) is a regular space and singleton sets are closed, then (S, S ) is a Hausdorff space.
Bases
Topologies, like other set structures, are often defined by first giving some basic sets that should belong to the collection, and the
extending the collection so that the defining axioms are satisfied. This idea is motivation for the following definition:
Suppose again that (S, S ) is a topological space. A collection B ⊆ S is a base for S if every set in S can be written as a
union of sets in B .
So, a base is a smaller collection of open sets with the property that every other open set can be written as a union of basic open
sets. But again, we often want to start with the basic open sets and extend this collection to a topology. The following theorem
gives the conditions under which this can be done.
Suppose that S is a nonempty set. A collection B of subsets of S is a base for a topology on S if and only if
1. S = ⋃ B
2. If A, B ∈ B and x ∈ A ∩ B , there exists C ∈ B with x ∈ C ⊆ A∩B
Proof
Here is a slightly weaker condition, but one that is often satisfied in practice.
Suppose that S is a nonempty set. A collection B of subsets of S that satisfies the following properties is a base for a topology
on S :
1. S = ⋃ B
2. If A, B ∈ B then A ∩ B ∈ B
Part (b) means that B is closed under finite intersections.
Compactness
Our next discussion considers another very important type of set. Some additional terminology will make the discussion easier.
Suppose that S is a set and A ⊆ S . A collection of subsets A of S is said to cover A if A ⊆ A . So the word cover simply means
a collection of sets whose union contains a given set. In a topological space, we can have open an open cover (that is, a cover with
open sets), a closed cover (that is, a cover with closed sets), and so forth.
Suppose again that (S, S ) is a topological space. A set C ⊆ S is compact if every open cover of C has a finite sub-cover.
That is, if A ⊆ S with C ⊆ ⋃ A then there exists a finite B ⊆ A with C ⊆ ⋃ B .
So intuitively, a compact set is compact in the ordinary sense of the word. No matter how “small” are the open sets in the covering
of C , there will always exist a finite number of the open sets that cover C .
Suppose again that (S, S ) is a topological space and that C ⊆S is a compact. If B ⊆ C is closed, then B is also compact.
Proof
Compactness is also preserved under finite unions.
Suppose again that (S, S ) is a topological space, and that Ci ⊆ S is compact for each i in a finite index set I . Then
C =⋃ C is compact.
i
i∈I
Proof
As we saw above, closed subsets of a compact set are themselves compact. In a Hausdorff space, a compact set is itself closed.
Suppose that (S, S ) is a Hausdorff space. If C ⊆S is compact then C is closed.

Proof
Also in a Hausdorff space, a point can be separated from a compact set that does not contain the point.
Suppose that (S, S ) is a Hausdorff space. If x ∈ S , C ⊆S is compact, and x ∉ C , then there exist disjoint open sets U and
V with x ∈ U and C ⊆ V
Proof
In a Hausdorff space, if a point has a neighborhood with a compact boundary, then there is a smaller, closed neighborhood.
Suppose again that (S, S ) is a Hausdorff space. If x ∈ S and A is a neighborhood of x with ∂(A) compact, then there exists a
closed neighborhood B of x with B ⊆ A .
Proof
Generally, local properties in a topological space refer to properties that hold on the neighborhoods of a point x ∈ S .
A topological space (S, S ) is locally compact if every point x ∈ S has a compact neighborhood.
This definition is important because many of the topological spaces that occur in applications (like probability) are not compact,
but are locally compact. Locally compact Hausdorff spaces have a number of nice properties. In particular, in a locally compact
Hausdorff space, there are arbitrarily “small” compact neighborhoods of a point.
Suppose that (S, S ) is a locally compact Hausdorff space. If x ∈ S and A is a neighborhood of x, then there exists a compact
neighborhood B of x with B ⊆ A .
Proof
Countability Axioms
Our next discussion concerns topologies that can be “countably constructed” in a certain sense. Such axioms limit the “size” of the
topology in a way, and are often satisfied by important topological spaces that occur in applications. We start with an important
preliminary definition.
Suppose that (S, S ) is a topological space. A set D ⊆ S is dense if U ∩ D is nonempty for every nonempty U ∈ S .
Equivalently, D is dense if every neighborhood of a point x ∈ S contains an element of D. So in this sense, one can find elements
of D “arbitrarily close” to a point x ∈ S . Of course, the entire space S is dense, but we are usually interested in topological spaces
that have dense sets of limited cardinality.
Suppose again that (S, S ) is a topological space. A set D ⊆ S is dense if and only if cl(D) = S .
Proof
Here is our first countability axiom:
A topological space (S, S ) is separable if there exists a countable dense subset.
So in a separable space, there is a countable set D with the property that there are points in D “arbitrarily close” to every x ∈ S .
Unfortunately, the term separable is similar to separating points that we discussed above in the definition of a Hausdorff space. But
clearly the concepts are very different. Here is another important countability axiom.
A topological space (S, S ) is second countable if it has a countable base.
So in a second countable space, there is a countable collection of open sets B with the property that every other open set is a union
of sets in B . Here is how the two properties are related:
If a topological space (S, S ) is second countable then it is separable.

Proof
As the terminology suggests, there are other axioms of countability (such as first countable), but the two we have discussed are the
most important.
Connected and Disconnected Spaces

This discussion deals with the situation in which a topological space falls into two or more separated pieces, in a sense.
A topological space (S, S ) is disconnected if there exist nonempty, disjoint, open sets U and V with S = U ∪ V . If (S, S ) is
not disconnected, then it is connected.
Since U = V , it follows that U and V are also closed. So the space is disconnected if and only if there exists a proper subset U
c
that is open and closed (sadly, such sets are sometimes called clopen). If S is disconnected, then S consists of two pieces U and V ,
and the points in U are not “close” to the points in V , in a sense. To study S topologically, we could simply study U and V
separately, with their relative topologies.
Convergence
There is a natural definition for a convergent sequence in a topological space, but the concept is not as useful as one might expect.
Suppose again that (S, S ) is a topological space. A sequence of points (x : n ∈ N n +) in S converges to x ∈ S if for every
neighborhood A of x there exists m ∈ N such that x ∈ A for n > m . We write x
+ n n → x as n → ∞ .
So for every neighborhood of x, regardless of how “small”, all but finitely many of the terms of the sequence will be in the
neighborhood. One would naturally hope that limits, when they exist, are unique, but this will only be the case if points in the space
can be separated.
Suppose that (S, S ) is a Hausdorff space. If (xn : n ∈ N+ ) is a sequence of points in S with xn → x ∈ S as n → ∞ and
x → y ∈ S as n → ∞ , then x = y .
n
Proof
On the other hand, if distinct points x, y ∈ S cannot be separated, then any sequence that converges to x will also converge to y .
Continuity
Continuity of functions is one of the most important concepts to come out of general topology. The idea, of course, is that if two
points are close together in the domain, then the functional values should be close together in the range. The abstract topological
definition, based on inverse images is very simple, but not very intuitive at first.
Suppose that (S, S ) and (T , T ) are topological spaces. A function f : S → T is continuous if f

−1
(A) ∈ S for every
A ∈ T .
So a continuous function has the property that the inverse image of an open set (in the range space) is also open (in the domain
space). Continuity can equivalently be expressed in terms of closed subsets.
Suppose again that (S, S ) and (T , T ) are topological spaces. A function f : S → T is continuous if and only if f −1
(A) is a
closed subset of S for every closed subset A of T .
Proof
Continuity preserves limits.
Suppose again that (S, S ) and (T , T ) are topological spaces, and that f : S → T is continuous. If (xn : n ∈ N+ ) is a
sequence of points in S with x → x ∈ S as n → ∞ , then f (x ) → f (x) as n → ∞ .
n n
Proof
The converse of the last result is not true, so continuity of functions in a general topological space cannot be characterized in terms
of convergent sequences. There are objects like sequences but more general, known as nets, that do characterize continuity, but we
will not study these. Composition, the most important way to combine functions, preserves continuity.
Suppose that (S, S ), (T , T ), and (U , U ) are topological spaces. If f : S → T and g : T → U are continuous, then
g∘ f : S → U is continuous.
Proof
The next definition is very important. A recurring theme in mathematics is to recognize when two mathematical structures of a
certain type are fundamentally the same, even though they may appear to be different.
Suppose again that (S, S ) and (T , T ) are topological spaces. A one-to-one function f that maps S onto T with both f and
f
−1
continuous is a homeomorphism from (S, S ) to (T , T ). When such a function exists, the topological spaces are said to
be homeomorphic.
Note that in this definition, f refers to the inverse function, not the mapping of inverse images. If f is a homeomorphism, then A
−1
is open in S if and only if f (A) is open in T . It follows that the topological spaces are essentially equivalent: any purely
topological property can be characterized in terms of open sets and therefore any such property is shared by the two spaces.
Being homeomorphic is an equivalence relation on the collection of topological spaces. That is, for spaces ,
(S, S ) (T , T ) ,
and (U , U ) ,
1. (S, S ) is homeomorphic to (S, S ) (the reflexive property).
2. If (S, S ) is homeomorphic to (T , T ) then (T , T ) is homeomorphic to (S, S ) (the symmetric property).
3. If (S, S ) is homeomorphic to (T , T ) and (T , T ) is homeomorphic to (U , U ) then (S, S ) is homeomorphic to (U , U )
(the transitive property).
Proof
Continuity can also be defined locally, by restricting attention to the neighborhoods of a point.
Suppose again that (S, S ) and (T , T ) are topological spaces, and that x ∈ S . A function f : S → T is continuous at x if
(B) is a neighborhood of x in S whenever B is a neighborhood of f (x) in T . If A ⊆ S , then f is continuous on A is f is
−1
f
continuous at each x ∈ A .
Suppose again that (S, S ) and (T , T ) are topological spaces, and that f : S → T . Then f is continuous if and only if f is
continuous at each x ∈ S .
Proof
Properties that are defined for a topological space can be applied to a subset of the space, with the relative topology. But one has to
be careful.
Suppose again that (S, S ) are topological spaces and that f : S → T . Suppose also that A ⊆ S , and let A denote the relative
topology on A induced by S , and let f denote the restriction of f to A . If f is continuous on A then f is continuous
A A
relative to the spaces (A, A ) and (T , T ). The converse is not generally true.
Proof
Product Spaces
Cartesian product sets are ubiquitous in mathematics, so a natural question is this: given topological spaces (S, S ) and (T , T ) ,
what is a natural topology for S × T ? The answer is very simple using the concept of a base above.
Suppose that (S, S ) and (T , T ) are topological spaces. The collection B = {A × B : A ∈ S , B ∈ T } is a base for a
topology on S × T , called the product topology associated with the given spaces.
Proof
So basically, we want the product of open sets to be open in the product space. The product topology is the smallest topology that
makes this happen. The definition above can be extended to very general product spaces, but to state the extension, let's recall how
general product sets are constructed. Suppose that S is a set for each i in a nonempty index set I . Then the product set ∏ S is
i i∈I i
the set of all functions x : I → ⋃ S such that x(i) ∈ S for i ∈ I .

i∈I i i
Suppose that (S i, Si ) is a topological space for each i in a nonempty index set I . Then
B = { ∏ Ai : Ai ∈ Si for all i ∈ I and Ai = Si for all but finitely many i ∈ I } (1.9.8)
i∈I
is a base for a topology on ∏ i∈I

Si , known as the product topology associated with the given spaces.
Proof
Suppose again that S is a set for each i in a nonempty index set I . For j ∈ I , recall that projection function
i pj : ∏
i∈I
Si → Sj is
defined by p (x) = x(j) .
j
Suppose again that (S , S ) is a topological space for each

i i i ∈ I , and give the product spacee ∏
i∈I
Si the product topology.
The projection function p is continuous for each j ∈ I .
j
Proof
As a special case of all this, suppose that (S, S ) is a topological space, and that S = S for all i ∈ I . Then the product space
i
S is the set of all functions from I to S , sometimes denoted S . In this case, the base for the product topology on S is
I I
∏ i
i∈I
B = { ∏ Ai : Ai ∈ S for all i ∈ I and Ai = S for all but finitely many i ∈ I } (1.9.9)
i∈I
For j ∈ I , the projection function p just returns the value of a function x : I

j → S at j : p
j (x) = x(j) . This projection function is
continuous. Note in particular that no topology is necessary on the domain I .
Examples and Special Cases

The Trivial Topology
Suppose that S is a nonempty set. Then {S, ∅} is a topology on S , known as the trivial topology.
With the trivial topology, no two distinct points can be separated. So the topology cannot distinguish between points, in a sense,
and all points in S are close to each other. Clearly, this topology is not very interesting, except as a place to start. Since there is only
one nonempty open set (S itself), the space is connected, and every subset of S is compact. A sequence in S converges to every
point in S .
Suppose that S has the trivial topology and that (T , T ) is another topological space.
1. Every function from T to S is continuous.
2. If (T , T ) is a Hausdorff space then the only continuous functions from S to T are constant functions.
Proof
The Discrete Topology

At the opposite extreme from the trivial topology, with the smallest collection of open sets, is the discrete topology, with the largest
collection of open sets.
Suppose that S is a nonempty set. The power set P(S) (consisting of all subsets of S ) is a topology, known as the discrete
topology.
So in the discrete topology, every set is both open and closed. All points are separated, and in a sense, widely so. No point is close
to another point. With the discrete topology, S is Hausdorff, disconnected, and the compact subsets are the finite subsets. A
sequence in S converges to x ∈ S , if and only if all but finitely many terms of the sequence are x.
Suppose that S has the discrete topology and that (T , S ) is another topological space.
1. Every function from S to T is continuous.
2. If (T , T ) is connected, then the only continuous functions from T to S are constant functions.
Proof
Euclidean Spaces
The standard topologies used in the Euclidean spaces are the topologies built from open sets that you familiar with.
For the set of real numbers R, let B = {(a, b) : a, b ∈ R, a < b} , the collection of open intervals. Then B is a base for a
topology R on R, known as the Euclidean topology.
Proof
The space (R, R) satisfies many properties that are motivations for definitions in topology in the first place. The convergence of a
sequence in R, in the topological sense given above, is the same as the definition of convergence in calculus. The same statement
holds for the continuity of a function f from R to R.
Before listing other topological properties, we give a characterization of compact sets, known as the Heine-Borel theorem, named
for Eduard Heine and Émile Borel. Recall that A ⊆ R is bounded if A ⊆ [a, b] for some a, b ∈ R with a < b .
A subset C ⊆R is compact if and only if C is closed and bounded.
So in particular, closed, bounded intervals of the form [a, b] with a, b ∈ R and a < b are compact.
The space (R, R) has the following properties:

1. Hausdorff.
2. Connected.
3. Locally compact.
4. Second countable.
Proof
As noted in the proof, , the set of rationals, is countable and is dense in R. Another countable, dense subset is
Q
D = {j/ 2
n
, the set of dyadic rationals (or binary rationals). For the higher-dimensional Euclidean spaces, we
: n ∈ N and j ∈ Z}
can use the product topology based on the topology of the real numbers.
For n ∈ {2, 3, …}, let (R n

, Rn ) be the n -fold product space corresponding to the space (R, R) . Then Rn is the Euclidean
topology on R . n
A subset A ⊆ R n
is bounded if there exists a, b ∈ R with a < b such that n
A ⊆ [a, b] , so that A fits inside of an n -dimensional
“block”.
A subset C ⊆R
n
is compact if and only if C is closed and bounded.
The space (R n
, Rn ) has the following properties:
1. Hausdorff.
2. Connected.
3. Locally compact.
4. Second countable.
This page titled 1.9: Topological Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
1.10: Metric Spaces
Basic Theory
Most of the important topological spaces that occur in applications (like probability) have an additional structure that gives a distance
between points in the space.
Definitions
A metric space consists of a nonempty set S and a function d : S × S → [0, ∞) that satisfies the following axioms: For
x, y, z ∈ S ,
1. d(x, y) = 0 if and only if x = y .

2. d(x, y) = d(y, x).
3. d(x, z) ≤ d(x, y) + d(y, z) .
The function d is known as a metric or a distance function.
So as the name suggests, d(x, y) is the distance between points x, y ∈ S . The axioms are intended to capture the essential properties
of distance from geometry. Part (a) is the positive property; the distance is strictly positive if and only if the points are distinct. Part
(b) is the symmetric property; the distance from x to y is the same as the distance from y to x. Part (c) is the triangle inequality;
going from x to z cannot be longer than going from x to z by way of a third point y .
Note that if (S, d) is a metric space, and A is a nonempty subset of S , then the set A with d restricted to A×A is also a metric
space (known as a subspace). The next definitions also come naturally from geometry:
Suppose that (S, d) is a metric space, and that x ∈ S and r ∈ (0, ∞).
1. B(x, r) = {y ∈ S : d(x, y) < r} is the open ball with center x and radius r.
2. C (x, r) = {y ∈ S : d(x, y) ≤ r} is the closed ball with center x and radius r.
A metric on a space induces a topology on the space in a natural way.
Suppose that (S, d) is a metric space. By definition, a set U ⊆ S is open if for every x ∈ U there exists r ∈ (0, ∞) such that
B(x, r) ⊆ U . The collection S of open subsets of S is a topology.
d
Proof
As the names suggests, an open ball is in fact open and a closed ball is in fact closed.
Suppose again that (S, d) is a metric space, and that x ∈ S and r ∈ (0, ∞). Then
1. B(x, r) is open.
2. C (x, r) is closed.
Proof
Recall that for a general topological space, a neighborhood of a point x ∈ S is a set A ⊆ S with the property that there exists an
open set U with x ∈ U ⊆ A . It follows that in a metric space, A ⊆ S is a neighborhood of x if and only if there exists r > 0 such
that B(x, r) ⊆ A . In words, a neighborhood of a point must contain an open ball about that point.
It's easy to construct new metrics from ones that we already have. Here's one such result.
Suppose that S is a nonempty set, and that d, e are metrics on S , and c ∈ (0, ∞). Then the following are also metrics on S :
1. cd
2. d + e
Proof
Since a metric space produces a topological space, all of the definitions for general topological spaces apply to metric spaces as well.
In particular, in a metric space, distinct points can always be separated.
A metric space (S, d) is a Hausdorff space.
Proof
Metrizable Spaces
Again, every metric space is a topological space, but not conversely. A non-Hausdorff space, for example, cannot correspond to a
metric space. We know there are such spaces; a set S with more than one point, and with the trivial topology S = {S, ∅} is non-
Hausdorff.
Suppose that (S, S ) is a topological space. If there exists a metric d on S such that S = Sd , then (S, S ) is said to be
metrizable.
It's easy to see that different metrics can induce the same topology. For example, if d is a metric and c ∈ (0, ∞), then the metrics d
and cd induce the same topology.
Let S be a nonempty set. Metrics d and e on S are equivalent, and we write d ≡ e , if Sd = Se . The relation ≡ is an
equivalence relation on the collection of metrics on S . That is, for metrics d, e, f on S ,
1. d ≡ d , the reflexive property.
2. If d ≡ e then e ≡ d , the symmetric property.
3. If d ≡ e and e ≡ f then d ≡ f , the transitive property.
There is a simple condition that characterizes when the topology of one metric is finer than the topology of another metric, and then
this in turn leads to a condition for equivalence of metrics.
Suppose again that S is a nonempty set and that d, e are metrics on S . Then S is finer than S if and only if every open ball
e d
relative to d contains an open ball relative to e .

Proof
It follows that metrics d and e on S are equivalent if and only if every open ball relative to one of the metrics contains an open ball
relative to the other metric.
So every metrizable topology on S corresponds to an equivalence class of metrics that produce that topology. Sometimes we want to
know that a topological space is metrizable, because of the nice properties that it will have, but we don't really need to use a specific
metric that generates the topology. At any rate, it's important to have conditions that are sufficient for a topological space to be
metrizable. The most famous such result is the Urysohn metrization theorem, named for the Russian mathematician Pavel Uryshon:
Suppose that (S, S ) is a regular, second-countable, Hausdorff space. Then (S, S ) is metrizable.
Review of the terms
Convergence
With a distance function, the convergence of a sequence can be characterized in a manner that is just like calculus. Recall that for a
general topological space (S, S ), if (x : n ∈ N ) is a sequence of points in S and x ∈ S , then x → x as n → ∞ means that for
n + n
every neighborhood U of x, there exists m ∈ N such that x ∈ U for n > m .

+ n
Suppose that (S, d) is a metric space, and that (x : n ∈ N ) is a sequence of points in S and x ∈ S . Then x → x as n → ∞
n + n
if and only if for every ϵ > 0 there exists m ∈ N such that if n > m then d(x , x) < ϵ . Equivalently, x → x as n → ∞ if
+ n n
and only if d(x , x) → 0 as n → ∞ (in the usual calculus sense).

n
Proof
So, no matter how tiny ϵ>0 may be, all but finitely many terms of the sequence are within ϵ distance of x. As one might hope,
limits are unique.
Suppose again that (S, d) is a metric space. Suppose also that (x n : n ∈ N+ ) is a sequence of points in S and that x, y ∈ S . If
x → x as n → ∞ and x → y as n → ∞ then x = y .
n n
Proof
Convergence of a sequence is a topological property, and so is preserved under equivalence of metrics.
Suppose that d, e are equivalent metrics on S , and that (x : n ∈ N
n +) is a sequence of points in S and x ∈ S . Then x n → x as
n → ∞ relative to d if and only if x → x as n → ∞ relative to e .
n
Closed subsets of a metric space have a simple characterization in terms of convergent sequences, and this characterization is more
intuitive than the abstract axioms in a general topological space.
Suppose again that (S, d) is a metric space. Then A ⊆ S is closed if and only if whenever a sequence of points in A converges,
the limit is also in A .
Proof
The following definition also shows up in standard calculus. The idea is to have a criterion for convergence of a sequence that does
not require knowing a-priori the limit. But for metric spaces, this definition takes on added importance.
Suppose again that (S, d) is a metric space. A sequence of points (x : n ∈ N ) in S is a Cauchy sequence if for every
n + ϵ>0
there exist k ∈ N such that if m, n ∈ N with m > k and n > k then d(x , x ) < ϵ .
+ + m n
Cauchy sequences are named for the ubiquitous Augustin Cauchy. So for a Cauchy sequence, no matter how tiny ϵ > 0 may be, all
but finitely many terms of the sequence will be within ϵ distance of each other. A convergent sequence is always Cauchy.
Suppose again that (S, d) is a metric space. If a sequence of points (x n : n ∈ N+ ) in S converges, then the sequence is Cauchy.
Proof
Conversely, one might think that a Cauchy sequence should converge, but it's relatively trivial to create a situation where this is false.
Suppose that (S, d) is a metric space, and that there is a point x ∈ S that is the limit of a sequence of points in S that are all distinct
from x. Then the space T = S − {x} with the metric d restricted to T × T has a Cauchy sequence that does not converge.
Essentially, we have created a “convergence hole”. So our next defintion is very natural and very important.
Suppose again that (S, d) is metric space and that A ⊆S . Then A is complete if every Cauchy sequence in A converges to a
point in A .
Of course, completeness can be applied to the entire space S . Trivially, a complete set must be closed.
Suppose again that (S, d) is a metric space, and that A ⊆ S . If A is complete, then A is closed.
Proof
Completeness is such a crucial property that it is often imposed as an assumption on metric spaces that occur in applications. Even
though a Cauchy sequence may not converge, here is a partial result that will be useful latter: if a Cauchy sequence has a convergent
subsequence, then the sequence itself converges.
Suppose again the (S, d) is a metric space, and that (x : n ∈ N ) is a Cauchy sequence in
n + S . If there exists a subsequence
(xnk: k ∈ N ) such that x
+ → x ∈ S as k → ∞ , then x → x as n → ∞ .
nk n
Proof
Continuity
In metric spaces, continuity of functions also has simple characterizations in terms of that are familiar from calculus. We start with
local continuity. Recall that the general topological definition is that f : S → T is continuous at x ∈ S if f (V ) is a neighborhood−1
of x in S for every open set V of f (x) in T .
Suppose that (S, d) and (T , e) are metric spaces, and that f : S → T . The continuity of f at x ∈ S is equivalent to each of the
following conditions:
1. If (x : n ∈ N ) is a sequence in S with x → x as n → ∞ then f (x ) → f (x) as n → ∞ .
n + n n
2. For every ϵ > 0 , there exists δ > 0 such that if y ∈ S and d(x, y) < δ then e[f (y) − f (x)] < ϵ .
Proof
More generally, recall that f continuous on A ⊆ S means that f is continuous at each x ∈ A , and that f continuous means that f is
continuous on S . So general continuity can be characterized in terms of sequential continuity and the ϵ-δ condition.
On a metric space, there are stronger versions of continuity.
Suppose again that (S, d) and (T , e) are metric spaces and that f : S → T . Then f is uniformly continuous if for every ϵ>0
there exists δ > 0 such that if x, y ∈ S with d(x, y) < δ then e[f (x), f (y)] ≤ ϵ.
In the ϵ-δ formulation of ordinary point-wise continuity above, δ depends on the point x in addition to ϵ. With uniform continuity,
there exists a δ depending only on ϵ that works uniformly in x ∈ S .
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . If f is uniformly continuous then f is continuous.
Here is an even stronger version of continuity.
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . Then is Höder continuous with exponent
f
α ∈ (0, ∞) if there exists C ∈ (0, ∞) such that e[f (x), f (y)] ≤ C [d(x, y)] for all x, y ∈ S.
α
The definition is named for Otto Höder. The exponent α is more important than the constant C , which generally does not have a
name. If α = 1 , f is said to be Lipschitz continuous, named for the German mathematician Rudolf Lipschitz.
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . If f is Höder continuous with exponent α > 0 then
f is uniformly continuous.
The case where α = 1 and C <1 is particularly important.
Suppose again that (S, d) and (T , e) are metric spaces. A function f : S → T is a contraction if there exists C ∈ (0, 1) such
that
e[f (x), f (y)] ≤ C d(x, y), x, y ∈ S (1.10.6)
So contractions shrink distance. By the result above, a contraction is uniformly continuous. Part of the importance of contraction
maps is due to the famous Banach fixed-point theorem, named for Stefan Banach.
Suppose that (S, d) is a complete metric space and that f : S → S is a contraction. Then f has a unique fixed point. That is,
there exists exactly one x ∈ S with f (x ) = x . Let x ∈ S , and recursively define x = f (x ) for n ∈ N . Then
∗ ∗ ∗
0 n n−1 +
x → x
n
∗
as n → ∞ .
Functions that preserve distance are particularly important. The term isometry means distance-preserving.
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . Then f is an isometry if e[f (x), f (y)] = d(x, y) for
every x, y ∈ S .
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . If f is an isometry, then f is one-to-one and
Lipschitz continuous.
Proof
In particular, an isometry f is uniformly continuous. If one metric space can be mapped isometrically onto another metric space, the
spaces are essentially the same.
Metric spaces (S, d) and (T , e) are isometric if there exists an isometry f that maps S onto T . Isometry is an equivalence
relation on metric spaces. That is, for metric spaces (S, d) , (T , e) , and (U , ρ),
1. (S, d) is isometric to (S, d) , the reflexive property.
2. If (S, d) is isometric to (T , e) them (T , e) is isometric to (S, d) , the symmetric property.
3. If (S, d) is isometric to (T , e) and (T , e) is isometric to (U , ρ), then (S, d) is isometric to (U , ρ), the transitive property.
Proof
In particular, if metric spaces (S, d) and (T , e) are isometric, then as topological spaces, they are homeomorphic.
Compactness and Boundedness
In a metric space, various definitions related to a set being bounded are natural, and are related to the general concept of
compactness.
Suppose again that (S, d) is a metric space, and that A ⊆ S . Then A is bounded if there exists r ∈ (0, ∞) such that d(x, y) ≤ r
for all x, y ∈ A. The diameter of A is
diam(A) = inf{r > 0 : d(x, y) < r for all x, y ∈ A} (1.10.7)
Additional details
So A is bounded if and only if diam(A) < ∞ . Diameter is an increasing function relative to the subset partial order.
Suppose again that (S, d) is a metric space, and that A ⊆ B ⊆ S . Then diam(A) ≤ diam(B) .
Our next definition is stronger, but first let's review some terminology that we used for general topological spaces: If S is a set, A a
subset of S , and A a collection of subsets of S , then A is said to cover A if A ⊆ ⋃ A . So with this terminology, we can talk about
open covers, closed covers, finite covers, disjoint covers, and so on.
Suppose again that (S, d) is a metric space, and that A ⊆ S . Then A is totally bounded if for every r > 0 there is a finite cover
of A with open balls of radius r.
Recall that for a general topological space, a set A is compact if every open cover of A has a finite subcover. So in a metric space,
the term precompact is sometimes used instead of totally bounded: The set A is totally bounded if every cover of A with open balls
of radius r has a finite subcover.
Suppose again that (S, d) is a metric space. If A ⊆ S is totally bounded then A is bounded.
Proof
Since a metric space is a Hausdorff space, a compact subset of a metric space is closed. Compactness also has a simple
characterization in terms of convergence of sequences.
Suppose again that (S, d) is a metric space. A subset C ⊆S is compact if and only if every sequence of points in C has a
subsequence that converges to a point in C .
Proof
Hausdorff Measure and Dimension

Our last discussion is somewhat advanced, but is important for the study of certain random processes, particularly Brownian motion.
The idea is to measure the “size” of a set in a metric space in a topological way, and then use this measure to define a type of
“dimension”. We need a preliminary definition, using our convenient cover terminology. If (S, d) is a metric space, A ⊆ S , and
δ ∈ (0, ∞), then a countable δ cover of A is a countable cover B of A with the property that diam(B) < δ for each B ∈ B .
Suppose again that (S, d) is a metric space and that A ⊆ S . For δ ∈ (0, ∞) and k ∈ [0, ∞), define
k k
H (A) = inf { ∑ [diam(B)] : B is a countable δ cover of A} (1.10.9)
δ
B∈B
The k -dimensional Hausdorff measure of A is

k k k
H (A) = sup { H (A) : δ > 0} = lim H (A) (1.10.10)
δ δ
δ↓0
Additional details
Note that the k -dimensional Hausdorff measure is defined for every k ∈ [0, ∞), not just nonnegative integers. Nonetheless, the
integer dimensions are interesting. The 0-dimensional measure of A is the number of points in A . In Euclidean space, which we
consider in (36), the measures of dimension 1, 2, and 3 are related to length, area, and volume, respectively.
Suppose again that (S, d) is a metric space and that A ⊆ S . The Hausdorff dimension of A is
k
dimH (A) = inf{k ∈ [0, ∞) : H (A) = 0} (1.10.12)
Of special interest, as before, is the case when S = R for some n ∈ N and d is the standard Euclidean distance, reviewed in (36).
n
+
As you might guess, the Hausdorff dimension of a point is 0, the Hausdorff dimension of a “simple curve” is 1, the Hausdorff
dimension of a “simple surface” is 2, and so on. But there are also sets with fractional Hausdorff dimension, and the stochastic
process Brownian motion provides some fascinating examples. The graph of standard Brownian motion has Hausdorff dimension
3/2 while the set of zeros has Hausdorff dimension 1/2.

Normed Vector Spaces
A norm on a vector space generates a metric on the space in a very simple, natural way.
Suppose that (S, +, ⋅) is a vector space, and that ∥ ⋅ ∥ is a norm on the space. Then d defined by d(x, y) = ∥y − x∥ for x, y ∈ S
is a metric on S .
Proof
On R , we have a variety of norms, and hence a variety of metrics.

n
For n ∈ N + and k ∈ [1, ∞), the function d given below is a metric on R :

k
n
1/k
n
k n
dk (x, y) = ( ∑ | xi − yi | ) , x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ R (1.10.13)
i=1
Proof
Of course the metric d is Euclidean distance, named for Euclid of course. This is the most important one, in a practical sense
2
because it's the usual one that we use in the real world, and in a mathematical sense because the associated norm corresponds to the
standard inner product on R given by
n
n
n
⟨x, y⟩ = ∑ xi yi , x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ R (1.10.15)
i=1
For n ∈ N , the function d

+ ∞ defined below is a metric on R : n
n
d∞ (x, y) = max{| xi − yi | : i ∈ {1, 2 … , n}}, x = (x1 , x2 , … , xn ) ∈ R (1.10.16)
Proof
To justify the notation, recall that ∥x ∥k → ∥x ∥∞ as k → ∞ for x ∈ R

n
, and hence dk (x, y) → d∞ (x, y) as k → ∞ for
x, y ∈ R .
n
Figure 1.10.1 : From inside out, the boundaries of the unit balls centered at the origin in R
2
for the metrics dk with
k ∈ {1, 3/4, 2, 3, ∞} .
Suppose now that S is a nonempty set. Recall that the collection V of all functions f : S → R is a vector space under the usual
pointwise definition of addition and scalar multiplication. That is, if f , g ∈ V and c ∈ R , then f + g ∈ V and cf ∈ V are defined
by (f + g)(x) = f (x) + g(x) and (cf )(x) = cf (x) for x ∈ S . Recall further that the collection U of bounded functions f : S → R
is a vector subspace of V , and moreover, ∥ ⋅ ∥ defined by ∥f ∥ = sup{|f (x)| : x ∈ S} is a norm on U , known as the supremum
norm. It follow that U is a metric space with the metric d defined by
d(f , g) = ∥f − g∥ = sup{|f (x) − g(x)| : x ∈ S} (1.10.18)
Vector spaces of bounded, real-valued functions, with the supremum norm are very important in the study of probability and
stochastic processes. Note that the supremum norm on U generalizes the maximum norm on R , since we can think of a point in R
n n
as a function from {1, 2, … , n} into R. Later, as part of our discussion on integration with respect to a positive measure, we will see
how to generalize the k norms on R to spaces of functions.
n
Products of Metric Spaces

If we have a finite number of metric spaces, then we can combine the individual metrics, together with an norm on the vector space
R , to create a norm on the Cartesian product space.
n
Suppose n ∈ {2, 3, …}, and that (S , d ) is a metric space for each in ∈ {1, 2, … , n}. Suppose also that ∥ ⋅ ∥ is a norm on R .
i i
n
Then the function d given as follows is a metric on S = S × S × ⋯ × S :

1 2 n
d(x, y) = ∥(d1 (x1 , y1 ), d2 (x2 , y2 ), … , dn (xn , yn ))∥ , x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ S (1.10.19)
Proof
Graphs
Recall that a graph (in the combinatorial sense) consists of a countable set S of vertices and a set E ⊆ S × S of edges. In this
discussion, we assume that the graph is undirected in the sense that (x, y) ∈ E if and only if (y, x) ∈ E , and has no loops so that
(x, x) ∉ E for x ∈ S . Finally, recall that a path of length n ∈ N from x ∈ S to y ∈ S is a sequence (x , x , … , x ) ∈ S such n+1
+ 0 1 n
that x = x , x = y , and (x , x ) ∈ E for i ∈ {1, 2, … , n}. The graph is connected if there exists a path of finite length between
0 n i−1 i
any two distinct vertices in S . Such a graph has a natural metric:
Suppose that G = (S, E) is a connected graph. Then d defined as follows is a metric on S : d(x, x) = 0 for x ∈ S , and d(x, y)
is the length of the shortest path from x to y for distinct x, y ∈ S .

Proof
The Discrete Topology

Suppose that S is a nonempty set. Recall that the discrete topology on S is P(S) , the power set of S , so that every subset of S is
open (and closed). The discrete topology is metrizable, and there are lots of metrics that generate this topology.
Suppose again that S is a nonempty set. A metric d on S with the property that there exists c ∈ (0, ∞) such that d(x, y) ≥ c for
distinct x, y ∈ S generates the discrete topology.
Proof
So any metric that is bounded from below (for distinct points) generates the discrete topology. It's easy to see that there are such
metrics.
Suppose again that S is a nonempty set. The function d on S × S defined by d(x, x) = 0 for x ∈ S and d(x, y) = 1 for distinct
x, y ∈ S is a metric on S , known as the discrete metric. This metric generates the discrete topology.
Proof
In probability applications, the discrete topology is often appropriate when S is countable. Note also that the discrete metric is the
graph distance if S is made into the complete graph, so that (x, y) is an edge for every pair of distinct vertices x, y ∈ S .
This page titled 1.10: Metric Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
In this section we discuss some topics from measure theory that are a bit more advanced than the topics in the early sections of this
chapter. However, measure-theoretic ideas are essential for a deep understanding of probability, since probability is itself a
measure. The most important of the definitions is the σ-algebra, a collection of subsets of a set with certain closure properties. Such
collections play a fundamental role, even for applied probability, in encoding the state of information about a random experiment.
On the other hand, we won't be overly pedantic about measure-theoretic details in this text. Unless we say otherwise, we assume
that all sets that appear are measurable (that is, members of the appropriate σ-algebras), and that all functions are measurable
(relative to the appropriate σ-algebras).
Although this section is somewhat abstract, many of the proofs are straightforward. Be sure to try the proofs yourself before
reading the ones in the text.
Algebras and σ-Algebras

Suppose that S is a set, playing the role of a universal set for a particular mathematical model. It is sometimes impossible to
include all subsets of S in our model, particularly when S is uncountable. In a sense, the more sets that we include, the harder it is
to have consistent theories. However, we almost always want the collection of admissible subsets to be closed under the basic set
operations. This leads to some important definitions.
Algebras of Sets
Suppose that S is a nonempty collection of subsets of S . Then S is an algebra (or field) if it is closed under complement and
union:
1. If A ∈ S then A ∈ S . c
2. If A ∈ S and B ∈ S then A ∪ B ∈ S .
If S is an algebra of subsets of S then

1. S ∈ S
2. ∅ ∈ S
Proof
Suppose that S is an algebra of subsets of S and that A i ∈ S for each i in a finite index set I .
1. ⋃ i∈I
Ai ∈ S
2. ⋂ i∈I
Ai ∈ S
Proof
Thus it follows that an algebra of sets is closed under a finite number of set operations. That is, if we start with a finite number of
sets in the algebra S , and build a new set with a finite number of set operations (union, intersection, complement), then the new
set is also in S . However in many mathematical theories, probability in particular, this is not sufficient; we often need the
collection of admissible subsets to be closed under a countable number of set operations.
σ -Algebras of Sets
Suppose that S is a nonempty collection of subsets of S . Then S is a σ-algebra (or σ-field) if the following axioms are
satisfied:
2. If A ∈ S for each i in a countable index set I , then ⋃

i i∈I
Ai ∈ S .
Clearly a σ-algebra of subsets is also an algebra of subsets, so the basic results for algebras above still hold. In particular, S ∈ S
and ∅ ∈ S .
If Ai ∈ S for each i in a countable index set I , then ⋂

i∈I
Ai ∈ S .
Proof
Thus a σ-algebra of subsets of S is closed under countable unions and intersections. This is the reason for the symbol σ in the
name. As mentioned in the introductory paragraph, σ-algebras are of fundamental importance in mathematics generally and
probability theory specifically, and thus deserve a special definition:
If S is a set and S a σ-algebra of subsets of S , then the pair (S, S ) is called a measurable space.
The term measurable space will make more sense in the next chapter, when we discuss positive measures (and in particular,
probability measures) on such spaces.
Suppose that S is a set and that S is a finite algebra of subsets of S . Then S is also a σ-algebra.
Proof
However, there are algebras that are not σ-algebras. Here is the classic example:
Suppose that S is an infinite set. The collection of finite and co-finite subsets of S defined below is an algebra of subsets of S ,
but not a σ-algebra:
c
F = {A ⊆ S : A is finite or A is finite} (1.11.1)
Proof
General Constructions
Recall that P(S) denotes the collection of all subsets of S , called the power set of S . Trivially, P(S) is the largest σ-algebra of
S . The power set is often the appropriate σ-algebra if S is countable, but as noted above, is sometimes too large to be useful if S is
uncountable. At the other extreme, the smallest σ-algebra of S is given in the following result:
The collection {∅, S} is a σ-algebra.

Proof
In many cases, we want to construct a σ-algebra that contains certain basic sets. The next two results show how to do this.
Suppose that S is a σ-algebra of subsets of S for each i in a nonempty index set I . Then S
i =⋂
i∈I
Si is also a σ-algebra of
subsets of S .
Proof
Note that no restrictions are placed on the index set I , other than it be nonempty, so in particular it may well be uncountable.
Suppose that S is a set and that B is a collection of subsets of S . The σ-algebra generated by B is
σ(B) = ⋂{S : S is a σ-algebra of subsets of S and B ⊆ S } (1.11.2)
If B is countable then σ(B) is said to be countably generated.
So the σ-algebra generated by B is the intersection of all σ-algebras that contain B , which by the previous result really is a σ-
algebra. Note that the collection of σ-algebras in the intersection is not empty, since P(S) is in the collection. Think of the sets in
B as basic sets that we want to be measurable, but do not form a σ-algebra.
The σ-algebra σ(B) is the smallest σ algebra containing B .

1. B ⊆ σ(B)
2. If S is a σ-algebra of subsets of S and B ⊆ S then σ(B) ⊆ S .
Proof
Note that the conditions in the last theorem completely characterize σ(B) . If S1 and S2 satisfy the conditions, then by (a),
B ⊆S 1and B ⊆ S . But then by (b), S ⊆ S and S ⊆ S .
2 1 2 2 1
If A is a subset of S then σ{A} = {∅, A, A c

, S}
Proof
We can generalize the previous result. Recall that a collection of subsets A = { Ai : i ∈ I } is a partition of S if Ai ∩ Aj = ∅ for
i, j ∈ I with i ≠ j , and ⋃ A =S. i
i∈I
Suppose that A = {A i : i ∈ I} is a countable partition of S into nonempty subsets. Then σ(A ) is the collection of all unions
of sets in A . That is,
σ(A ) = { ⋃ Aj : J ⊆ I } (1.11.3)
j∈J
Proof
A σ-algebra of this form is said to be generated by a countable partition. Note that since A ≠ ∅ for i ∈ I , the representation of a i
set in σ(A ) as a union of sets in A is unique. That is, if J, K ⊆ I and J ≠ K then ⋃ A ≠ ⋃ A . In particular, if there
j∈J j k∈K k
are n nonempty sets in A , so that #(I ) = n , then there are 2 subsets of I and hence 2 sets in σ(A ).
n n
Suppose now that A = {A , A , … , A } is a collection of n subsets of S (not necessarily disjoint). To describe the σ-algebra
1 2 n
generated by A we need a bit more notation. For x = (x , x , … , x ) ∈ {0, 1} (a bit string of length n ), let B = ⋂ A
1 2 n
n
x
n
i=1
xi
where A = A and A = A .
1
i i
0
i
c
i
In the setting above,

1. B = {B : x ∈ {0, 1} } partitions S .
x
n
2. A = ⋃ {B : x ∈ {0, 1} , x = 1} for i ∈ {1, 2, … , n}.

i x
n
i
3. σ(A ) = σ(B) = {⋃ B : J ⊆ {0, 1 } } .

x∈J x
n
Proof
Recall that there are 2 bit strings of length n . The sets in A are said to be in general position if the sets in
n
B are distinct (and
n
hence there are 2 of them) and are nonempty. In this case, there are 2 sets in σ(A ).
n 2
Open the Venn diagram app. This app shows two subsets A and B of S in general position, and lists the 16 sets in σ{A, B}.
1. Select each of the 4 sets that partition S : A ∩ B , A ∩ B , A ∩ B , A ∩ B . c c c c
2. Select each of the other 12 sets in σ{A, B} and note how each is a union of some of the sets in (a).
Sketch a Venn diagram with sets A 1, A2 , A3 in general position. Identify the set B for each x ∈ {0, 1} .
x
3
If a σ-algebra is generated by a collection of basic sets, then each set in the σ-algebra is generated by a countable number of the
basic sets.
Suppose that S is a set and B a nonempty collection of subsets of S . Then

σ(B) = {A ⊆ S : A ∈ σ(C ) for some countable C ⊆ B} (1.11.4)
Proof
A σ-algebra on a set naturally leads to a σ-algebra on a subset.
Suppose that (S, S ) is a measurable space, and that R ⊆ S . Let R = {A ∩ R : A ∈ S } . Then

1. R is a σ-algebra of subsets of R .
2. If R ∈ S then R = {B ∈ S : B ⊆ R} .
Proof
The σ-algebra R is the σ-algebra on R induced by S . The following construction is useful for counterexamples. Compare this
example with the one for finite and co-finite sets.
Let S be a nonempty set. The collection of countable and co-countable subsets of S is

c
C = {A ⊆ S : A is countable or A is countable} (1.11.5)
1. C is a σ-algebra
2. C = σ{{x} : x ∈ S}, the σ-algebra generated by the singleton sets.
Proof
Of course, if S is itself countable then C = P(S) . On the other hand, if S is uncountable, then there exists A ⊆ S such that A
and A are uncountable. Thus, A ∉ C , but A = ⋃
c
{x} , and of course {x} ∈ C . Thus, we have an example of a σ-algebra that
x∈A
is not closed under general unions.
Topology and Measure

One of the most important ways to generate a σ-algebra is by means of topology. Recall that a topological space consists of a set S
and a topology S , the collection of open subsets of S . Most spaces that occur in probability and stochastic processes are
topological spaces, so it's crucial that the topological and measure-theoretic structures are compatible.
Suppose that (S, S ) is a topological space. Then σ(S ) is the Borel σ-algebra on S , and (S, σ(S )) is a Borel measurable
space.
So the Borel σ-algebra on S , named for Émile Borel is generated by the open subsets of S . Thus, a topological space (S, S )
naturally leads to a measurable space (S, σ(S )). Since a closed set is simply the complement of an open set, the Borel σ-algebra
contains the closed sets as well (and in fact is generated by the closed sets). Here are some other sets that are in the Borel σ-
algebra:
Suppose again that (S, S ) is a topological space and that I is a countable index set.
1. If A is open for each i ∈ I then ⋂ A ∈ σ(S ) . Such sets are called G sets.
i i∈I i δ
2. If A is closed for each i ∈ I then ⋃ A ∈ σ(S ) . Such sets are called F sets.
i i∈I i σ
3. If (S, S ) is Hausdorff then {x} ∈ S for every x ∈ S .

Proof
In terms of part (c), recall that a topological space is Hausdorff, named for Felix Hausdorff, if the topology can distinguish
individual points. Specifically, if x, y ∈ S are distinct then there exist disjoint open sets U , V with x ∈ U and y ∈ V . This is a
very basic property possessed by almost all topological spaces that occur in applications. A simple corollary of (c) is that if the
topological space (S, S ) is Hausdorff then A ∈ σ(S ) for every countable A ⊆ S .
Let's note the extreme cases. If S has the discrete topology P(S) , so that every set is open (and closed), then of course the Borel
σ-algebra is also P(S) . As noted above, this is often the appropriate σ-algebra if S is countable, but is often too large if S is
uncountable. If S has the trivial topology {S, ∅}, then the Borel σ-algebra is also {S, ∅}, and so is also trivial.
Recall that a base for a topological space (S, T ) is a collection B ⊆ T with the property that every set in T is a union of a
collection of sets in B . In short, every open set is a union of some of the basic open sets.
Suppose that (S, S ) is a topological space with a countable base B . Then σ(B) = σ(S ) .
Proof
The topological spaces that occur in probability and stochastic processes are usually assumed to have a countable base (along with
other nice properties such as the Hausdorff property and locally compactness). The σ-algebra used for such a space is usually the
Borel σ-algebra, which by the previous result, is countably generated.
Measurable Functions
Recall that a set usually comes with a σ-algebra of admissible subsets. A natural requirement on a function is that the inverse image
of an admissible set in the range space be admissible in the domain space. Here is the formal definition.
Suppose that (S, S ) and (T , T ) are measurable spaces. A function f : S → T is measurable if f

−1
(A) ∈ S for every
A ∈ T .
If the σ-algebra in the range space is generated by a collection of basic sets, then to check the measurability of a function, we need
only consider inverse images of basic sets:
Suppose again that (S, S ) and (T , T ) are measurable spaces, and that T = σ(B) for a collection of subsets B of T . Then
f : S → T is measurable if and only if f (B) ∈ S for every B ∈ B .
−1
Proof
If you have reviewed the section on topology then you may have noticed a striking parallel between the definition of continuity for
functions on topological spaces and the defintion of measurability for functions on measurable spaces: A function from one
topological space to another is continuous if the inverse image of an open set in the range space is open in the domain space. A
function from one measurable space to another is measurable if the inverse image of a measurable set in the range space is
measurable in the domain space. If we start with topological spaces, which we often do, and use the Borel σ-algebras to get
measurable spaces, then we get the following (hardly surprising) connection.
Suppose that (S, S ) and (T , T ) are topological spaces, and that we give S and T the Borel σ-algebras σ(S ) and σ(T )
respectively. If f : S → T is continuous, then f is measurable.

Proof
Measurability is preserved under composition, the most important method for combining functions.
Suppose that (R, R) , (S, S ), and (T , T ) are measurable spaces. If f : R → S is measurable and g : S → T is measurable,
then g ∘ f : R → T is measurable.
Proof
If T is given the smallest possible σ-algebra or if S is given the largest one, then any function from S into T is measurable.
Every function f : S → T is measurable in each of the following cases:

1. T = {∅, T } and S is an arbitrary σ-algebra of subsets of S
2. S = P(S) and T is an arbitrary σ-algebra of subsets of T .
Proof
When there are several σ-algebras for the same set, then we use the phrase with respect to so that we can be precise. If a function is
measurable with respect to a given σ-algebra on its domain, then it's measurable with respect to any larger σ-algebra on this space.
If the function is measurable with respect to a σ-algebra on the range space then its measurable with respect to any smaller σ-
algebra on this space.
Suppose that S has σ-algebras R and S with R ⊆ S , and that T has σ-algebras T and U with T ⊆U . If f : S → T is
measurable with respect to R and U , then f is measureable with respect to S and T .
Proof
The following construction is particularly important in probability theory:
Suppose that S is a set and (T , T ) is a measurable space. Suppose also that f : S → T and define
σ(f ) = { f
−1
(A) : A ∈ T }. Then
1. σ(f ) is a σ-algebra on S .
2. σ(f ) is the smallest σ-algebra on S that makes f measurable.
Proof
Appropriately enough, σ(f ) is called the σ-algebra generated by f . Often, S will have a given σ-algebra S and f will be
measurable with respect to S and T . In this case, σ(f ) ⊆ S . We can generalize to an arbitrary collection of functions on S .
Suppose S is a set and that (T , T ) is a measurable space for each i in a nonempty index set I . Suppose also that f
i i i : S → Ti
for each i ∈ I . The σ-algebra generated by this collection of functions is

−1
σ { fi : i ∈ I } = σ {σ(fi ) : i ∈ I } = σ { f (A) : i ∈ I , A ∈ Ti } (1.11.6)
i
Again, this is the smallest σ-algebra on S that makes f measurable for each i ∈ I .
i
Product Sets
Product sets arise naturally in the form of the higher-dimensional Euclidean spaces R for n ∈ {2, 3, …}. In addition, product
n
spaces are particularly important in probability, where they are used to describe the spaces associated with sequences of random
variables. More general product spaces arise in the study of stochastic processes. We start with the product of two sets; the
generalization to products of n sets and to general products is straightforward, although the notation gets more complicated.
Suppose that (S, S ) and (T , T ) are measurable spaces. The product σ-algebra on S × T is
S ⊗ T = σ{A × B : A ∈ S , B ∈ T } (1.11.7)
So the definition is natural: the product σ-algebra is generated by products of measurable sets. Our next goal is to consider the
measurability of functions defined on, or mapping into, product spaces. Of basic importance are the projection functions. If S and
T are sets, let p : S × T → S and p : S × T → T be defined by p (x, y) = x and p (x, y) = y for (x, y) ∈ S × T . Recall that
1 2 1 2
p1 is the projection onto the first coordinate and p is the projection onto the second coordinate. The product σ algebra is the
2
smallest σ-algebra that makes the projections measurable:
Suppose again that (S, S ) and (T , T ) are measurable spaces. Then S ⊗ T = σ{ p1 , p2 } .

Proof
Projection functions make it easy to study functions mapping into a product space.
Suppose that (R, R) , (S, S ) and (T , T ) are measurable spaces, and that S × T is given the product σ-algebra S ⊗ T .
Suppose also that f : R → S × T , so that f (x) = (f (x), f (x)) for x ∈ R, where f : R → S and f : R → T are the
1 2 1 2
coordinate functions. Then f is measurable if and only if f and f are measurable.

1 2
Proof
Our next goal is to consider cross sections of sets in a product space and cross sections of functions defined on a product space. It
will help to introduce some new functions, which in a sense are complementary to the projection functions.
Suppose again that (S, S ) and (T , T ) are measurable spaces, and that S × T is given the product σ-algebra S ⊗T .
1. For x ∈ S the function 1x : T → S ×T , defined by 1 x (y) = (x, y) for y ∈ T , is measurable.
2. For y ∈ T the function 2y : S → S ×T , defined by 2 y (x) = (x, y) for x ∈ S , is measurable.
Proof
Now our work is easy.
Suppose again that (S, S ) and (T , T ) are measurable spaces, and that C ∈ S ⊗T . Then
1. For x ∈ S , {y ∈ T : (x, y) ∈ C } ∈ T .
2. For y ∈ T , {x ∈ S : (x, y) ∈ C } ∈ S .
Proof
The set in (a) is the cross section of C in the first coordinate at x, and the set in (b) is the cross section of C in the second
coordinate at y . As a simple corollary to the theorem, note that if A ⊆ S , B ⊆ T and A × B ∈ S ⊗ T then A ∈ S and B ∈ T .
That is, the only measurable product sets are products of measurable sets. Here is the measurability result for cross-sectional
functions:
Suppose again that (S, S ) and (T , T ) are measurable spaces, and that S × T is given the product σ -algebra S ⊗T .
Suppose also that (U , U ) is another measurable space, and that f : S × T → U is measurable. Then
1. The function y ↦ f (x, y) from T to U is measurable for each x ∈ S .
2. The function x ↦ f (x, y) from S to U is measurable for each y ∈ T .
Proof
The results for products of two spaces generalize in a completely straightforward way to a product of n spaces.
Suppose n ∈ N and that (S , S ) is a measurable space for each
+ i i i ∈ {1, 2, … , n} . The product σ-algebra on the Cartesian
product set S × S × ⋯ × S is
1 2 n
S1 ⊗ S2 ⊗ ⋯ ⊗ Sn = σ { A1 × A2 × ⋯ × An : Ai ∈ Si for all i ∈ {1, 2, … , n}} (1.11.8)
So again, the product σ-algebra is generated by products of measurable sets. Results analogous to the theorems above hold. In the
special case that (S , S ) = (S, S ) for i ∈ {1, 2, … , n}, the Cartesian product becomes S and the corresponding product σ-
i i
n
algebra is denoted S . The notation is natural, but potentially confusing. Note that S is not the Cartesian product of S n times,
n n
but rather the σ-algebra generated by sets of the form A × A × ⋯ × A where A ∈ S for i ∈ {1, 2, … , n}. 1 2 n i
We can also extend these ideas to a general product. To recall the definition, suppose that S is a set for each i in a nonempty index i
set I . The product set ∏ S consists of all functions x : I → ⋃ S such that x(i) ∈ S for each i ∈ I . To make the notation
i∈I i i∈I i i
look more like a simple Cartesian product, we often write x instead of x(i) for the value of a function in the product set at i ∈ I .
i
The next definition gives the appropriate σ-algebra for the product set.
Suppose that (S i, Si ) is a measurable space for each i in a nonempty index set I . The product σ-algebra on the product set
∏ S is i
i∈I
σ { ∏ Ai : Ai ∈ Si for each i ∈ I and Ai = Si for all but finitely many i ∈ I } (1.11.9)
i∈I
If you have reviewed the section on topology, the definition should look familiar. If the spaces were topological spaces instead of
measurable spaces, with S the topology of S for i ∈ I , then the set of products in the displayed expression above is a base for
i i
the product topology on ∏ S . i∈I i
The definition can also be understood in terms of projections. Recall that the projection onto coordinate j ∈ I is the function
p : ∏
j S → S
i∈I
given by p (x) = x . The product σ-algebra is the smallest σ-algebra on the product set that makes all of the
i j j j
projections measurable.
Suppose again that (S , S ) is a measurable space for each i in a nonempty index set I , and let
i i S denote the product σ-
algebra on the product set S = ∏ S . Then S = σ{p : i ∈ I } .I i∈I i i
Proof
In the special case that (S, S ) is a fixed measurable space and (S , S ) = (S, S ) for all i ∈ I , the product set ∏ S is just the i i i∈I
collection of functions from I into S , often denoted S . The product σ-algebra is then denoted S , a notation that is natural, but
I I
again potentially confusing. Here is the main measurability result for a function mapping into a product space.
Suppose that (R, R) is a measurable space, and that (S , S ) is a measurable space for each i in a nonempty index set I . As
i i
before, let ∏ S have the product σ-algebra. Suppose now that f : R → ∏ S . For i ∈ I let f : R → S denote the ith
i∈I i i∈I i i i
coordinate function of f , so that f (x) = [f (x)] for x ∈ R . Then f is measurable if and only if f is measurable for each
i i i
i ∈ I.
Proof
Just as with the product of two sets, cross-sectional sets and functions are measurable with respect to the product measure. Again,
it's best to work with some special functions.
Suppose that (S , S ) is a measurable space for each i in an index set I with at least two elements. For j ∈ I and u ∈ S ,
i i j
define the function j : ∏ → ∏ S

u by j (x) = y where y = x for i ≠ j and y = u . Then j is measurable with
i∈I−{j} i∈I i u i i j u
respect to the product σ-algebras.

Proof
In words, for j ∈ I and u ∈ S , the function j takes a point in the product set ∏
j S and assigns u to coordinate j to give a
u i∈I−{j} i
point in ∏ S . If A ⊆ ∏ S , then j (A) is the cross section of A in coordinate j at u. So it follows immediately from the
i∈I i i∈I i
−1
u
previous result that the cross sections of a measurable set are measurable. Cross sections of measurable functions are also
measurable. Suppose that (T , T ) is another measurable space, and that f : ∏ S → T is measurable. The cross section of f in
i∈I i
coordinate j ∈ I at u ∈ S is simply f ∘ j : S
j → T , a composition of measurable functions.
u I−{j}
However, a non-measurable set can have measurable cross sections, even in a product of two spaces.
Suppose that S is an uncountable set with the σ-algebra C of countable and co-countable sets as in (21). Consider S × S with
the product σ-algebra C ⊗ C . Let D = {(x, x) : x ∈ S} , the diagonal of S × S . Then D has measurable cross sections, but
D is not measurable.
Proof
Special Cases
Most of the sets encountered in applied probability are either countable, or subsets of R for some n , or more generally, subsets of
n
a product of a countable number of sets of these types. In the study of stochastic processes, various spaces of functions play an
important role. In this subsection, we will explore the most important special cases.
Discrete Spaces
If S is countable and S = P(S) is the collection of all subsets of S , then (S, S ) is a discrete measurable space.
Thus if (S, S ) is discrete, all subsets of S are measurable and every function from S to another measurable space is measurable.
The power set is also the discrete topology on S , so S is a Borel σ-algebra as well. As a topological space, (S, S ) is complete,
locally compact, Hausdorff, and since S is countable, separable. Moreover, the discrete topology corresponds to the discrete metric
d , defined by d(x, x) = 0 for x ∈ S and d(x, y) = 1 for x, y ∈ S with x ≠ y .
Euclidean Spaces
Recall that for n ∈ N , the Euclidean topology on R is generated by the standard Euclidean metric d given by
+
n
n
−−−−−−−−−−
n
2 n
dn (x, y) = √ ∑(xi − yi ) , x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ R (1.11.10)
i=1
With this topology, R is complete, connected, locally compact, Hausdorff, and separable.
n
For n ∈ N , the n -dimensional Euclidean measurable space is

+
n
(R , Rn ) where Rn is the Borel σ-algebra corresponding to
the standard Euclidean topology on R . n
The one-dimensional case is particularly important. In this case, the standard Euclidean metric d is given by d(x, y) = |x − y| for
x, y ∈ R. The Borel σ-algebra R can be generated by various collections of intervals.
Each of the following collections generates R .

1. B 1 = {I ⊆ R : I is an interval}
2. B 2 = {(a, b] : a, b ∈ R, a < b}
3. B 3 = {(−∞, b] : b ∈ R}
Proof
Since the Euclidean topology has a countable base, R is countably generated. In fact each collection of intervals above, but with
endpoints restricted to Q, generates R . Moreover, R can also be constructed from σ-algebras that are generated by countable
partitions. First recall that for n ∈ N , the set of dyadic rationals (or binary rationals) of rank n or less is D = {j/2 : j ∈ Z} . n
n
Note that D is countable and D ⊆ D

n n for n ∈ N . Moreover, the set D = ⋃
n+1 D of all dyadic rationals is dense in R. The
n∈N n
dyadic rationals are often useful in various applications because D has the natural ordered enumeration j ↦ j/2 for each n ∈ N .
n
n
Now let
n n
Dn = {(j/ 2 , (j + 1)/ 2 ] : j ∈ Z} , n ∈ N (1.11.11)
Then D is a countable partition of R into nonempty intervals of equal size 1/2 , so E = σ(D ) consists of unions of sets in D
n
n
n n n
as described above. Every set D is the union of two sets in D

n so clearly E ⊆ E n+1 for n ∈ N . Finally, the Borel σ-algebra on
n n+1
R is R = σ (⋃ D ) . This construction turns out to be useful in a number of settings.

∞ ∞
E ) = σ (⋃
n n
n=0 n=0
For n ∈ {2, 3, …}, the Euclidean topology on R is the n -fold product topology formed from the Euclidean topology on R. So the
n
Borel σ-algebra R is also the n -fold power σ-algebra formed from R . Finally, R can be generated by n -fold products of sets in
n n
any of the three collections in the previous theorem.
Space of Real Functions

Suppose that (S, S ) is a measurable space. From our general discussion of functions, recall that the usual arithmetic operations on
functions from S into R are defined pointwise.
If f : S → R and g : S → R are measurable and a ∈ R , then each of the following functions from S into R is also
measurable:
1. f + g
2. f − g
3. f g
4. af
Proof
Similarly, if f : S → R ∖ {0} is measurable, then so is 1/f. Recall that the set of functions from S into R is a vector space, under
the pointwise definitions of addition and scalar multiplication. But once again, we usually want to restrict our attention to
measurable functions. Thus, it's nice to know that the measurable functions from S into R also form a vector space. This follows
immediately from the closure properties (a) and (d) of the previous theorem. Of particular importance in probability and stochastic
processes is the vector space of bounded, measurable functions f : S → R , with the supremum norm
∥f ∥ = sup {|f (x)| : x ∈ S} (1.11.12)
The elementary functions that we encounter in calculus and other areas of applied mathematics are functions from subsets of R into
R . The elementary functions include algebraic functions (which in turn include the polynomial and rational functions), the usual
transcendental functions (exponential, logarithm, trigonometric), and the usual functions constructed from these by composition,
the arithmetic operations, and by piecing together. As we might hope, all of the elementary functions are measurable.
This page titled 1.11: Measurable Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
There are several other types of algebraic set structures that are weaker than σ-algebras. These are not particularly important in
themselves, but are important for constructing σ-algebras and the measures on these σ-algebras. You may want to skip this section
if you are not intersted in questions of existence and uniqueness of positive measures.
Basic Theory
Definitions
Throughout this section, we assume that S is a set and S is a nonempty collection of subsets of S . Here are the main definitions
we will need.
S is a π-system if S is closed under finite intersections: if A, B ∈ S then A ∩ B ∈ S .
Closure under intersection is clearly a very simple property, but π systems turn out to be useful enough to deserve a name.
S is a λ -system if it is closed under complements and countable disjoint unions.

2. If A ∈ S for i in a countable index set I and A

i i ∩ Aj = ∅ for i ≠ j then ⋃ i∈I
Ai ∈ S .
S is a semi-algebra if it is closed under intersection and if complements can be written as finite, disjoint unions:
1. If A, B ∈ S then A ∩ B ∈ S .
2. If A ∈ S then there exists a finite, disjoint collection {B i : i ∈ I} ⊆ S such that A c
=⋃
i∈I
Bi .
For our final structure, recall that a sequence (A , A , …) of subsets of S is increasing if A ⊆ A

1 2 for all n ∈ N . The n n+1 +
sequence is decreasing if A ⊆A for all n ∈ N . Of course, these are the standard meanings of increasing and decreasing
n+1 n +
relative to the ordinary order ≤ on N and the subset partial order ⊆ on P(S) .
+
S is a monotone class if it is closed under increasing unions and decreasing intersections:

∞
1. If (A 1, A2 , …) is an increasing sequence of sets in S then ⋃ n=1
An ∈ S .
∞
2. If (A 1, A2 , …) is a decreasing sequence of sets in S then ⋂ n=1
An ∈ S .
If (A , A , …) is an increasing sequence of sets then we sometimes write ⋃

1 2 A = lim A . Similarly, if (A , A …) is a
∞
n=1 n n→∞ n 1 2
decreasing sequence of sets we sometimes write ⋂ A = lim

∞
n=1
A . The reason for this notation will become clear in the
n n→∞ n
section on Convergence in the chapter on Probability Spaces. With this notation, a monotone class S is defined by the condition
that if (A , A , …) is an increasing or decreasing sequence of sets in S then lim
1 2 A ∈ S . n→∞ n
Basic Theorems
Our most important set structure, the σ-algebra, has all of the properties in the definitions above.
If S is a σ-algebra then S is a π-system, a λ -system, a semi-algebra, and a monotone class.
If S is a λ -system then S ∈ S and ∅ ∈ S .

Proof
Any type of algebraic structure on subsets of S that is defined purely in terms of closure properties will be preserved under
intersection. That is, we will have results that are analogous to how σ-algebras are generated from more basic sets, with completely
straightforward and analgous proofs. In the following two theorems, the term system could mean π-system, λ -system, or monotone
class of subsets of S .
If S is a system for each i in an index set I and ⋂

i i∈I
Si is nonempty, then ⋂ i∈I
Si is a system of the same type.
The condition that ⋂ S be nonempty is unnecessary for a λ -system, by the result above. Now suppose that B is a nonempty
i∈I i
collection of subsets of S , thought of as basic sets of some sort. Then the system generated by B is the intersection of all systems
that contain B .
The system S generated by B is the smallest system containing B , and is characterized by the following properties:
1. B ⊆ S .
2. If T is a system and B ⊆ T then S ⊆T .
Note however, that the previous two results do not apply to semi-algebras, because the semi-algebra is not defined purely in terms
of closure properties (the condition on A is not a closure property).
c
If S is a monotone class and an algebra, then S is a σ-algebra.

Proof
By definition, a semi-algebra is a π-system. More importantly, a semi-algebra can be used to construct an algebra.
Suppose that S is a semi-algebra of subsets of S . Then the collection S of finite, disjoint unions of sets in S is an algebra.
∗
Proof
We will say that our nonempty collection S is closed under proper set difference if A, B ∈ S and A ⊆ B implies B∖A ∈ S .
The following theorem gives the basic relationship between λ -systems and monotone classes.
Suppose that S is a nonempty collection of subsets of S .

1. If S is a λ -system then S is a monotone class and is closed under proper set difference.
2. If S is a monotone class, is closed under proper set difference, and contains S , then S is a λ -system.
Proof
The following theorem is known as the monotone class theorem, and is due to the mathematician Paul Halmos.
Suppose that A is an algebra, M is a monotone class, and A ⊆M . Then σ(A ) ⊆ M .

Proof
As noted in (5), a σ-algebra is both a π-system and a λ -system. The converse is also true, and is one of the main reasons for
studying these structures.
If S is a π-system and a λ -system then S is a σ-algebra.

Proof
The importance of π-systems and λ -systems stems in part from Dynkin's π-λ theorem given next. It's named for the mathematician
Eugene Dynkin.
Suppose that A is a π-system of subsets of S , B is a λ -system of subsets of S , and A ⊆B . Then σ(A ) ⊆ B .

Proof

Suppose that S is a set and A is a finite partition of S . Then S = {∅} ∪ A is a semi-algebra of subsets of S .
Proof
Euclidean Spaces
The following example is particulalry important because it will be used to construct positive measures on R. Let
B = {(a, b] : a, b ∈ R, a < b} ∪ {(−∞, b] : b ∈ R} ∪ {(a, ∞) : a ∈ R} (1.12.3)
B is a semi-algebra of subsets of R.
Proof
It follows from the theorem above that the collection A of finite disjoint unions of intervals in B is an algebra. Recall also that
σ(B) = σ(A ) is the Borel σ-algebra of R , named for Émile Borel. We can generalize all of this to R for n ∈ N
n
+
The collection B n = {∏
n
i=1
Ai : Ai ∈ B for each i ∈ {1, 2, … , n}} is a semi-algebra of subsets of R .
n
Recall also that σ(B n) is the σ-algebra of Borel sets of R . n
Product Spaces
The examples in this discussion are important for constructing positive measures on product spaces.
Suppose that S is a semi-algebra of subsets of a set S and that T is a semi-algebra of subsets of a set T . Then
U = {A × B : A ∈ S , B ∈ T } (1.12.4)
is a semi-algebra of subsets of S × T .
Proof
This result extends in a completely straightforward way to a product of a finite number of sets.
Suppose that n ∈ N + and that S is a semi-algebra of subsets of a set S for i ∈ {1, 2, … , n}. Then
i i
U = { ∏ Ai : Ai ∈ Si for all i ∈ {1, 2, … , n}} (1.12.7)
i=1
n
is a semi-algebra of subsets of ∏ i=1
Si .
Note that the semi-algebra of products of intervals in R

n
described above is a special case of this result. For the product of an
infinite sequence of sets, the result is bit more tricky.
Suppose that S is a semi-algebra of subsets of a set S for i ∈ N . Then

i i +
U = { ∏ Ai : Ai ∈ Si for all i ∈ N+ and Ai = Si for all but finitely many i ∈ N+ } (1.12.8)
i=1
is a semi-algebra of subsets of ∏ n
i=1
Si .
Proof
Note that this result would not be true with U = {∏ ∞
i=1
Ai : Ai ∈ Si for all i ∈ N+ } . In general, the complement of a set in U
cannot be written as a finite disjoint union of sets in U .
This page titled 1.12: Special Set Structures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
CHAPTER OVERVIEW
The basic topics in this chapter are fundamental to probability theory, and should be accessible to new students of probability. We
start with the paradigm of the random experiment and its mathematical model, the probability space. The main objects in this
model are sample spaces, events, random variables, and probability measures. We also study several concepts of fundamental
importance: conditional probability and independence.
The advanced topics can be skipped if you are a new student of probability, or can be studied later, as the need arises. These topics
include the convergence of random variables, the measure-theoretic foundations of probability theory, and the existence and
construction of probability measures and random processes.
2.5: Independence
2.6: Convergence
2.7: Measure Spaces
This page titled 2: Probability Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
Experiments
Probability theory is based on the paradigm of a random experiment; that is, an experiment whose outcome cannot be predicted
with certainty, before the experiment is run. In classical or frequency-based probability theory, we also assume that the experiment
can be repeated indefinitely under essentially the same conditions. The repetitions can be in time (as when we toss a single coin
over and over again) or in space (as when we toss a bunch of similar coins all at once). The repeatability assumption is important
because the classical theory is concerned with the long-term behavior as the experiment is replicated. By contrast, subjective or
belief-based probability theory is concerned with measures of belief about what will happen when we run the experiment. In this
view, repeatability is a less crucial assumption. In any event, a complete description of a random experiment requires a careful
definition of precisely what information about the experiment is being recorded, that is, a careful definition of what constitutes an
outcome.
The term parameter refers to a non-random quantity in a model that, once chosen, remains constant. Many probability models of
random experiments have one or more parameters that can be adjusted to fit the physical experiment being modeled.
The subjects of probability and statistics have an inverse relationship of sorts. In probability, we start with a completely specified
mathematical model of a random experiment. Our goal is perform various computations that help us understand the random
experiment, help us predict what will happen when we run the experiment. In statistics, by contrast, we start with an incompletely
specified mathematical model (one or more parameters may be unknown, for example). We run the experiment to collect data, and
then use the data to draw inferences about the unknown factors in the mathematical model.
Compound Experiments
Suppose that we have n experiments (E , E , … , E ). We can form a new, compound experiment by performing the n
1 2 n
experiments in sequence, E first, and then E and so on, independently of one another. The term independent means, intuitively,
1 2
that the outcome of one experiment has no influence over any of the other experiments. We will make the term mathematically
precise later.
In particular, suppose that we have a basic experiment. A fixed number (or even an infinite number) of independent replications of
the basic experiment is a new, compound experiment. Many experiments turn out to be compound experiments and moreover, as
noted above, (classical) probability theory itself is based on the idea of replicating an experiment.
In particular, suppose that we have a simple experiment with two outcomes. Independent replications of this experiment are
referred to as Bernoulli trials, named for Jacob Bernoulli. This is one of the simplest, but most important models in probability.
More generally, suppose that we have a simple experiment with k possible outcomes. Independent replications of this experiment
are referred to as multinomial trials.
Sometimes an experiment occurs in well-defined stages, but in a dependent way, in the sense that the outcome of a given stage is
influenced by the outcomes of the previous stages.
Sampling Experiments
In most statistical studies, we start with a population of objects of interest. The objects may be people, memory chips, or acres of
corn, for example. Usually there are one or more numerical measurements of interest to us—the height and weight of a person, the
lifetime of a memory chip, the amount of rain, amount of fertilizer, and yield of an acre of corn.
Although our interest is in the entire population of objects, this set is usually too large and too amorphous to study. Instead, we
collect a random sample of objects from the population and record the measurements of interest of for each object in the sample.
There are two basic types of sampling. If we sample with replacement, each item is replaced in the population before the next draw;
thus, a single object may occur several times in the sample. If we sample without replacement, objects are not replaced in the
population. The chapter on Finite Sampling Models explores a number of models based on sampling from a finite population.
Sampling with replacement can be thought of as a compound experiment, based on independent replications of the simple
experiment of drawing a single object from the population and recording the measurements of interest. Conversely, a compound
experiment that consists of n independent replications of a simple experiment can usually be thought of as a sampling experiment.
On the other hand, sampling without replacement is an experiment that consists of dependent stages, because the population
changes with each draw.

Probability theory is often illustrated using simple devices from games of chance: coins, dice, card, spinners, urns with balls, and so
forth. Examples based on such devices are pedagogically valuable because of their simplicity and conceptual clarity. On the other
hand, it would be a terrible shame if you were to think that probability is only about gambling and games of chance. Rather, try to
see problems involving coins, dice, etc. as metaphors for more complex and realistic problems.
Coins and Dice

In terms of probability, the important fact about a coin is simply that when tossed it lands on one side or the other. Coins in Western
societies, dating to antiquity, usually have the head of a prominent person engraved on one side and something of lesser importance
on the other. In non-Western societies, coins often did not have a head on either side, but did have distinct engravings on the two
sides, one typically more important than the other. Nonetheless, heads and tails are the ubiquitous terms used in probability theory
to distinguish the front or obverse side of the coin from the back or reverse side of the coin.
Figure 2.1.1 : Obverse and reverse sides of a Roman coin, about 241 CE, from Wikipedia
Consider the coin experiment of tossing a coin n times and recording the score (1 for heads or 0 for tails) for each toss.
1. Identify a parameter of the experiment.
2. Interpret the experiment as a compound experiment.
3. Interpret the experiment as a sampling experiment.
4. Interpret the experiment as n Bernoulli trials.
Answer
In the simulation of the coin experiment, set n = 5 . Run the simulation 100 times and observe the outcomes.
Dice are randomizing devices that, like coins, date to antiquity and come in a variety of sizes and shapes. Typically, the faces of a
die have numbers or other symbols engraved on them. Again, the important fact is that when a die is thrown, a unique face is
chosen (usually the upward face, but sometimes the downward one). For more on dice, see the introductory section in the chapter
on Games of Chance.
Consider the dice experiment of throwing a k -sided die (with faces numbered 1 to k ), n times and recording the scores for each
throw.
1. Identify the parameters of the experiment.
4. Identify the experiment as n multinomial trials.
Answer
In reality, most dice are Platonic solids (named for Plato of course) with 4, 6, 8, 12, or 20 sides. The six-sided die is the standard
die.
Figure 2.1.2 : Blue Platonic dice
In the simulation of the dice experiment, set n = 5 . Run the simulation 100 times and observe the outcomes.
In the die-coin experiment, a standard die is thrown and then a coin is tossed the number of times shown on the die. The
sequence of coin scores is recorded (1 for heads and 0 for tails). Interpret the experiment as a compound experiment.
Answer
Note that this experiment can be obtained by randomizing the parameter n in the basic coin experiment in (1).
Run the simulation of the die-coin experiment 100 times and observe the outcomes.
In the coin-die experiment, a coin is tossed. If the coin lands heads, a red die is thrown and if the coin lands tails, a green die is
thrown. The coin score (1 for heads and 0 for tails) and the die score are recorded. Interpret the experiment as a compound
experiment.
Answer
Run the simulation of the coin-die experiment 100 times and observe the outcomes.
Cards
Playing cards, like coins and dice, date to antiquity. From the point of view of probability, the important fact is that a playing card
encodes a number of properties or attributes on the front of the card that are hidden on the back of the card. (Later in this chapter,
these properties will become random variables.) In particular, a standard card deck can be modeled by the Cartesian product set
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (2.1.1)
example q♡ rather than (q, ♡) for the queen of hearts). Some other properties, derived from the two main ones, are color
(diamonds and hearts are red, clubs and spades are black), face (jacks, queens, and kings have faces, the other cards do not), and
suit order (from least to highest rank: (♣, ♢, ♡, ♠) ).
Consider the card experiment that consists of dealing n cards from a standard deck (without replacement).
1. Identify a parameter of the experiment.
Answer
In the simulation of the card experiment, set n = 5 . Run the simulation 100 times and observe the outcomes.
The special case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment.
Open each of the following to see depictions of card playing in some famous paintings.
1. Cheat with the Ace of Clubs by Georges de La Tour
2. The Cardsharps by Michelangelo Carravagio
3. The Card Players by Paul Cézanne
4. His Station and Four Aces by CM Coolidge
5. Waterloo by CM Coolidge
Urn Models
Urn models are often used in probability as simple metaphors for sampling from a finite population.
An urn contains m distinct balls, labeled from 1 to m. The experiment consists of selecting n balls from the urn, without
replacement, and recording the sequence of ball numbers.
Answer
Consider the basic urn model of the previous exercise. Suppose that r of the m balls are red and the remaining m − r balls are
green. Identify an additional parameter of the model. This experiment is a metaphor for sampling from a general dichotomous
population
Answer
In the simulation of the urn experiment, set m = 100 , r = 40 , and n = 25 . Run the experiment 100 times and observe the
results.
An urn initially contains m balls; r are red and m − r are green. A ball is selected from the urn and removed, and then
replaced with k balls of the same color. The process is repeated. This is known as Pólya's urn model, named after George
Pólya.
2. Interpret the case k = 0 as a sampling experiment.
3. Interpret the case k = 1 as a sampling experiment.
Answer
Open the image of the painting Allegory of Fortune by Dosso Dossi. Presumably the young man has chosen lottery tickets
from an urn.
Buffon's Coin Experiment
Buffon's coin experiment consists of tossing a coin with radius r ≤ on a floor covered with square tiles of side length 1. The
1
coordinates of the center of the coin are recorded, relative to axes through the center of the square, parallel to the sides. The
experiment is named for comte de Buffon.
1. Identify a parameter of the experiment
3. Interpret the experiment as sampling experiment.
Answer
In the simulation of Buffon's coin experiment, set r = 0.1. Run the experiment 100 times and observe the outcomes.
Reliability
In the usual model of structural reliability, a system consists of n components, each of which is either working or failed. The states
of the components are uncertain, and hence define a random experiment. The system as a whole is also either working or failed,
depending on the states of the components and how the components are connected. For example, a series system works if and only
if each component works, while a parallel system works if and only if at least one component works. More generally, a k out of n
system works if at least k components work.
Consider the k out of n reliability model.

1. Identify two parameters.
2. What value of k gives a series system?
3. What value of k gives a parallel system?
Answer
The reliability model above is a static model. It can be extended to a dynamic model by assuming that each component is initially
working, but has a random time until failure. The system as a whole would also have a random time until failure that would depend
on the component failure times and the structure of the system.
Genetics
In ordinary sexual reproduction, the genetic material of a child is a random combination of the genetic material of the parents.
Thus, the birth of a child is a random experiment with respect to outcomes such as eye color, hair type, and many other physical
traits. We are often particularly interested in the random transmission of traits and the random transmission of genetic disorders.
For example, let's consider an overly simplified model of an inherited trait that has two possible states (phenotypes), say a pea
plant whose pods are either green or yellow. The term allele refers to alternate forms of a particular gene, so we are assuming that
there is a gene that determines pod color, with two alleles: g for green and y for yellow. A pea plant has two alleles for the trait
(one from each parent), so the possible genotypes are
gg, alleles for green pods from each parent.
gy, an allele for green pods from one parent and an allele for yellow pods from the other (we usually cannot observe which
parent contributed which allele).
yy, alleles for yellow pods from each parent.
The genotypes gg and yy are called homozygous because the two alleles are the same, while the genotype gy is called heterozygous
because the two alleles are different. Typically, one of the alleles of the inherited trait is dominant and the other recessive. Thus, for
example, if g is the dominant allele for pod color, then a plant with genotype gg or gy has green pods, while a plant with genotype
yy has yellow pods. Genes are passed from parent to child in a random manner, so each new plant is a random experiment with
respect to pod color.

Pod color in peas was actually one of the first examples of an inherited trait studied by Gregor Mendel, who is considered the father
of modern genetics. Mendel also studied the color of the flowers (yellow or purple), the length of the stems (short or long), and the
texture of the seeds (round or wrinkled).
For another example, the ABO blood type in humans is controlled by three alleles: a , b , and o . Thus, the possible genotypes are
aa , ab , ao , bb, bo and oo . The alleles a and b are co-dominant and o is recessive. Thus there are four possible blood types
(phenotypes):
Type A : genotype aa or ao
Type B : genotype bb or bo
Type AB: genotype ab
type O: genotype oo
Of course, blood may be typed in much more extensive ways than the simple ABO typing. The RH factor (positive or negative) is
the most well-known example.
For our third example, consider a sex-linked hereditary disorder in humans. This is a disorder due to a defect on the X chromosome
(one of the two chromosomes that determine gender). Suppose that h denotes the healthy allele and d the defective allele for the
gene linked to the disorder. Women have two X chromosomes, and typically d is recessive. Thus, a woman with genotype hh is
completely normal with respect to the condition; a woman with genotype hd does not have the disorder, but is a carrier, since she
can pass the defective allele to her children; and a woman with genotype dd has the disorder. A man has only one X chromosome
(his other sex chromosome, the Y chromosome, typically plays no role in the disorder). A man with genotype h is normal and a
man with genotype d has the disorder. Examples of sex-linked hereditary disorders are dichromatism, the most common form of
color-blindness, and hemophilia, a bleeding disorder. Again, genes are passed from parent to child in a random manner, so the birth
of a child is a random experiment in terms of the disorder.
Point Processes
There are a number of important processes that generate “random points in time”. Often the random points are referred to as
arrivals. Here are some specific examples:
times that a piece of radioactive material emits elementary particles
times that customers arrive at a store
times that requests arrive at a web server
failure times of a device
To formalize an experiment, we might record the number of arrivals during a specified interval of time or we might record the
times of successive arrivals.
There are other processes that produce “random points in space”. For example,
flaws in a piece of sheet metal
errors in a string of symbols (in a computer program, for example)
raisins in a cake
misprints on a page
stars in a region of space
Again, to formalize an experiment, we might record the number of points in a given region of space.
Statistical Experiments
In 1879, Albert Michelson constructed an experiment for measuring the speed of light with an interferometer. The velocity of
light data set contains the results of 100 repetitions of Michelson's experiment. Explore the data set and explain, in a general
way, the variability of the data.
Answer
In 1998, two students at the University of Alabama in Huntsville designed the following experiment: purchase a bag of M&Ms
(of a specified advertised size) and record the counts for red, green, blue, orange, and yellow candies, and the net weight (in
grams). Explore the M&M data. set and explain, in a general way, the variability of the data.
Answer
In 1999, two researchers at Belmont University designed the following experiment: capture a cicada in the Middle Tennessee
area, and record the body weight (in grams), the wing length, wing width, and body length (in millimeters), the gender, and the
species type. The cicada data set contains the results of 104 repetitions of this experiment. Explore the cicada data and explain,
in a general way, the variability of the data.
Answer
On June 6, 1761, James Short made 53 measurements of the parallax of the sun, based on the transit of Venus. Explore the
Short data set and explain, in a general way, the variability of the data.
Answer
In 1954, two massive field trials were conducted in an attempt to determine the effectiveness of the new vaccine developed by
Jonas Salk for the prevention of polio. In both trials, a treatment group of children were given the vaccine while a control
group of children were not. The incidence of polio in each group was measured. Explore the polio field trial data set and
explain, in a general way, the underlying random experiment.
Answer
Each year from 1969 to 1972 a lottery was held in the US to determine who would be drafted for military service. Essentially,
the lottery was a ball and urn model and became famous because many believed that the process was not sufficiently random.
Explore the Vietnam draft lottery data set and speculate on how one might judge the degree of randomness.
Answer
Deterministic Versus Probabilistic Models

One could argue that some of the examples discussed above are inherently deterministic. In tossing a coin, for example, if we know
the initial conditions (involving position, velocity, rotation, etc.), the forces acting on the coin (gravity, air resistance, etc.), and the
makeup of the coin (shape, mass density, center of mass, etc.), then the laws of physics should allow us to predict precisely how the
coin will land. This is true in a technical, theoretical sense, but false in a very real sense. Coins, dice, and many more complicated
and important systems are chaotic in the sense that the outcomes of interest depend in a very sensitive way on the initial conditions
and other parameters. In such situations, it might well be impossible to ever know the initial conditions and forces accurately
enough to use deterministic methods.
In the coin experiment, for example, even if we strip away most of the real world complexity, we are still left with an essentially
random experiment. Joseph Keller in his article “The Probability of Heads” deterministically analyzed the toss of a coin under a
number of ideal assumptions:
1. The coin is a perfect circle and has negligible thickness
2. The center of gravity of the coin is the geometric center.
3. The coin is initially heads up and is given an initial upward velocity u and angular velocity ω.
4. In flight, the coin rotates about a horizontal axis along a diameter of the coin.
5. In flight, the coin is governed only by the force of gravity. All other possible forces (air resistance or wind, for example) are
neglected.
6. The coin does not bounce or roll after landing (as might be the case if it lands in sand or mud).
Of course, few of these ideal assumptions are valid for real coins tossed by humans. Let t = u/g where g is the acceleration of
gravity (in appropriate units). Note that the t just has units of time (in seconds) and hence is independent of how distance is
measured. The scaled parameter t actually represents the time required for the coin to reach its maximum height.
Keller showed that the regions of the parameter space (t, ω) where the coin lands either heads up or tails up are separated by the
curves
1 π
ω = (2n ± ) , n ∈ N (2.1.2)
2 2t
The parameter n is the total number of revolutions in the toss. A plot of some of these curves is given below. The largest region, in
the lower left corner, corresponds to the event that the coin does not complete even one rotation, and so of course lands heads up,
just as it started. The next region corresponds to one rotation, with the coin landing tails up. In general, the regions alternate
between heads and tails.
Figure 2.1.3 : Regions of heads and tails

The important point, of course, is that for even moderate values of t and ω, the curves are very close together, so that a small
change in the initial conditions can easily shift the outcome from heads up to tails up or conversely. As noted in Keller's article, the
probabilist and statistician Persi Diaconis determined experimentally that typical values of the initial conditions for a real coin toss
are t = seconds and ω = 76π ≈ 238.6 radians per second. These values correspond to n = 19 revolutions in the toss. Of course,
1
this parameter point is far beyond the region shown in our graph, in a region where the curves are exquisitely close together.
This page titled 2.1: Random Experiments is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The purpose of this section is to study two basic types of objects that form part of the model of a random experiment. If you are a
new student of probability, just ignore the measure-theoretic terminology and skip the technical details.
Sample Spaces
The Set of Outcomes
Recall that in a random experiment, the outcome cannot be predicted with certainty, before the experiment is run. On the other
hand:
We assume that we can identify a fixed set S that includes all possible outcomes of a random experiment. This set plays the
role of the universal set when modeling the experiment.
For simple experiments, S may be precisely the set of possible outcomes. More often, for complex experiments, S is a
mathematically convenient set that includes the possible outcomes and perhaps other elements as well. For example, if the
experiment is to throw a standard die and record the score that occurs, we would let S = {1, 2, 3, 4, 5, 6}, the set of possible
outcomes. On the other hand, if the experiment is to capture a cicada and measure its body weight (in milligrams), we might
conveniently take S = [0, ∞) , even though most elements of this set are impossible (we hope!). The problem is that we may not
know exactly the outcomes that are possible. Can a light bulb burn without failure for one thousand hours? For one thousand days?
for one thousand years?
Often the outcome of a random experiment consists of one or more real measurements, and thus, the S consists of all possible
measurement sequences, a subset of R for some n ∈ N . More generally, suppose that we have n experiments and that S is the
n
+ i
set of outcomes for experiment i ∈ {1, 2, … , n}. Then the Cartesian product S × S × ⋯ × S is the natural set of outcomes
1 2 n
for the compound experiment that consists of performing the n experiments in sequence. In particular, if we have a basic
experiment with S as the set of outcomes, then S is the natural set of outcomes for the compound experiment that consists of n
n
replications of the basic experiment. Similarly, if we have an infinite sequence of experiments and S is the set of outcomes for
i
experiment i ∈ N , then then S × S × ⋯ is the natural set of outcomes for the compound experiment that consists of
+ 1 2
performing the given experiments in sequence. In particular, the set of outcomes for the compound experiment that consists of
indefinite replications of a basic experiment is S = S × S × ⋯ . This is an essential special case, because (classical) probability
∞
theory is based on the idea of replicating a given experiment.
Events
Consider again a random experiment with S as the set of outcomes. Certain subsets of S are referred to as events. Suppose that
A ⊆ S is a given event, and that the experiment is run, resulting in outcome s ∈ S .
1. If s ∈ A then we say that A occurs.

2. If s ∉ A then we say that A does not occur.
Intuitively, you should think of an event as a meaningful statement about the experiment: every such statement translates into an
event, namely the set of outcomes for which the statement is true. In particular, S itself is an event; by definition it always occurs.
At the other extreme, the empty set ∅ is also an event; by definition it never occurs.
For a note on terminology, recall that a mathematical space consists of a set together with other mathematical structures defined on
the set. An example you may be familiar with is a vector space, which consists of a set (the vectors) together with the operations of
addition and scalar multiplication. In probability theory, many authors use the term sample space for the set of outcomes of a
random experiment, but here is the more careful definition:
The sample space of an experiment is (S, S ) where S is the set of outcomes and S is the collection of events.
Details
The Algebra of Events
The standard algebra of sets leads to a grammar for discussing random experiments and allows us to construct new events from
given events. In the following results, suppose that S is the set of outcomes of a random experiment, and that A and B are events.
A ⊆B if and only if the occurrence of A implies the occurrence of B .

Proof
A∪B is the event that occurs if and only if A occurs or B occurs.

Proof
A∩B is the event that occurs if and only if A occurs and B occurs.
Proof
A and B are disjoint if and only if they are mutually exclusive; they cannot both occur on the same run of the experiment.
Proof
A
c
is the event that occurs if and only if A does not occur.
Proof
A∖B is the event that occurs if and only if A occurs and B does not occur.
Proof
c
(A ∩ B ) ∪ (B ∩ A )
c
is the event that occurs if and only if one but not both of the given events occurs.
Proof
Recall that the event in (10) is the symmetric difference of A and B , and is sometimes denoted AΔB . This event corresponds to
exclusive or, as opposed to the ordinary union A ∪ B which corresponds to inclusive or.
(A ∩ B) ∪ (A
c c
∩B ) is the event that occurs if and only if both or neither of the given events occurs.
Proof
In the Venn diagram app, observe the diagram of each of the 16 events that can be constructed from A and B .
Suppose now that A = { Ai : i ∈ I } is a collection of events for the random experiment, where I is a countable index set.
⋃A = ⋃
i∈I
Ai is the event that occurs if and only if at least one event in the collection occurs.
Proof
⋂A = ⋂
i∈I
Ai is the event that occurs if and only if every event in the collection occurs:
Proof
A is a pairwise disjoint collection if and only if the events are mutually exclusive; at most one of the events could occur on a
given run of the experiment.
Proof
Suppose now that (A 1, A2 , …) is an infinite sequence of events.

∞ ∞
⋂ ⋃ A is the event that occurs if and only if infinitely many of the given events occur. This event is sometimes called
i
n=1 i=n
the limit superior of (A , A , …). 1 2
Proof
⋃
∞
n=1
⋂
∞
i=n
is the event that occurs if and only if all but finitely many of the given events occur. This event is sometimes
Ai
called the limit inferior of (A , A , …). 1 2
Proof
Limit superiors and inferiors are discussed in more detail in the section on convergence.
Random Variables
Intuitively, a random variable is a measurement of interest in the context of the experiment. Simple examples include the number
of heads when a coin is tossed several times, the sum of the scores when a pair of dice are thrown, the lifetime of a device subject
to random stress, the weight of a person chosen from a population. Many more examples are given below in the exercises below.
Mathematically, a random variable is a function defined on the set of outcomes.
A function X from S into a set T is a random variable for the experiment with values in T .
Details
Probability has its own notation, very different from other branches of mathematics. As a case in point, random variables, even
though they are functions, are usually denoted by capital letters near the end of the alphabet. The use of a letter near the end of the
alphabet is intended to emphasize the idea that the object is a variable in the context of the experiment. The use of a capital letter is
intended to emphasize the fact that it is not an ordinary algebraic variable to which we can assign a specific value, but rather a
random variable whose value is indeterminate until we run the experiment. Specifically, when we run the experiment an outcome
s ∈ S occurs, and random variable X takes the value X(s) ∈ T .
Figure 2.2.1 : A random variable as a function defined on the set of outcomes.

If B ⊆ T , we use the notation {X ∈ B} for the inverse image {s ∈ S : X(s) ∈ B} , rather than X (B) . Again, the notation is
−1
more natural since we think of X as a variable in the experiment. Think of {X ∈ B} as a statement about X, which then translates
into the event {s ∈ S : X(s) ∈ B}
Figure 2.2.2 : The event {X ∈ B} corresponding to B ⊆ T

Again, every statement about a random variable X with values in T translates into an inverse image of the form {X ∈ B} for
some B ∈ T . So, for example, if x ∈ T then {X = x} = {X ∈ {x}} = {s ∈ S : X(s) = x} . If X is a real-valued random
variable and a, b ∈ R with a < b then {a ≤ X ≤ b} = {X ∈ [a, b]} = {s ∈ S : a ≤ X(s) ≤ b} .
Suppose that X is a random variable taking values in T , and that A, B ⊆T . Then

1. {X ∈ A ∪ B} = {X ∈ A} ∪ {X ∈ B}
2. {X ∈ A ∩ B} = {X ∈ A} ∩ {X ∈ B}
3. {X ∈ A ∖ B} = {X ∈ A} ∖ {X ∈ B}
4. A ⊆ B ⟹ {X ∈ A} ⊆ {X ∈ B}
5. If A and B are disjoint, then so are {X ∈ A} and {X ∈ B} .
Proof
As with a general function, the result in part (a) holds for the union of a countable collection of subsets, and the result in part (b)
holds for the intersection of a countable collection of subsets. No new ideas are involved; only the notation is more complicated.
Often, a random variable takes values in a subset T of R for some k ∈ N . We might express such a random variable as
k
+
X = (X , X , … , X ) where X is a real-valued random variable for each i ∈ {1, 2, … , k}. In this case, we usually refer to X
1 2 k i
as a random vector, to emphasize its higher-dimensional character. A random variable can have an even more complicated
structure. For example, if the experiment is to select n objects from a population and record various real measurements for each
object, then the outcome of the experiment is a vector of vectors: X = (X , X , … , X ) where X is the vector of measurements
1 2 n i
for the ith object. There are other possibilities; a random variable could be an infinite sequence, or could be set-valued. Specific
examples are given in the computational exercises below. However, the important point is simply that a random variable is a
function defined on the set of outcomes S .
The outcome of the experiment itself can be thought of as a random variable. Specifically, let T = S and let X denote the identify
function on S so that X(s) = s for s ∈ S . Then trivially X is a random variable, and the events that can be defined in terms of X
are simply the original events of the experiment. That is, if A is an event then {X ∈ A} = A . Conversely, every random variable
effectively defines a new random experiment.
In the general setting above, a random variable X defines a new random experiment with T as the new set of outcomes and
subsets of T as the new collection of events.
Details
In fact, often a random experiment is modeled by specifying the random variables of interest, in the language of the experiment.
Then, a mathematical definition of the random variables specifies the sample space. A function (or transformation) of a random
variable defines a new random variable.
Suppose that X is a random variable for the experiment with values in T and that g is a function from T into another set U .
Then Y = g(X) is a random variable with values in U .
Details
Note that, as functions, g(X) = g ∘ X , the composition of g with X. But again, thinking of X and Y as variables in the context of
the experiment, the notation Y = g(X) is much more natural.
Indicator Variables
For an event A , the indicator function of A is called the indicator variable of A .
The value of this random variables tells us whether or not A has occurred:
1, A occurs
1A = { (2.2.1)
0, A does not occur
That is, as a function on S ,
1, s ∈ A
1A (s) = { (2.2.2)
0, s ∉ A
If X is a random variable that takes values 0 and 1, then X is the indicator variable of the event {X = 1} .
Proof
Recall also that the set algebra of events translates into the arithmetic algebra of indicator variables.
Suppose that A and B are events.

1. 1A∩B =1 1 A = min { 1 , 1 }
B A B
2. 1A∪B = 1 − (1 − 1 ) (1 − 1 ) = max { 1
A B A, 1B }
3. 1B∖A=1 B(1 − 1 ) A
4. 1 = 1 − 1
A
c
A
5. A ⊆ B if and only if 1 ≤ 1A B
The results in part (a) extends to arbitrary intersections and the results in part (b) extends to arbitrary unions. If the event A has a
complicated description, sometimes we use 1(A) for the indicator variable rather that 1 . A

Recall that probability theory is often illustrated using simple devices from games of chance: coins, dice, cards, spinners, urns with
balls, and so forth. Examples based on such devices are pedagogically valuable because of their simplicity and conceptual clarity.
On the other hand, remember that probability is not only about gambling and games of chance. Rather, try to see problems
involving coins, dice, etc. as metaphors for more complex and realistic problems.
Coins and Dice
The basic coin experiment consists of tossing a coin n times and recording the sequence of scores (X , X , … , X ) (where 1
1 2 n
denotes heads and 0 denotes tails). This experiment is a generic example of n Bernoulli trials, named for Jacob Bernoulli.
Consider the coin experiment with n = 4 , and Let Y denote the number of heads.
1. Give the set of outcomes S in list form.
2. Give the event {Y = k} in list form for each k ∈ {0, 1, 2, 3, 4}.
Answer
In the simulation of the coin experiment, set n =4 . Run the experiment 100 times and count the number of times that the
event {Y = 2} occurs.
Now consider the general coin experiment with the coin tossed n times, and let Y denote the number of heads.
1. Give the set of outcomes S in Cartesian product form, and give the cardinality of S .
2. Express Y as a function on S .
3. Find #{Y = k} (as a subset of S ) for k ∈ {0, 1, … , n}
Answer
The basic dice experiment consists of throwing n distinct k -sided dice (with faces numbered from 1 to k ) and recording the
sequence of scores (X , X , … , X ). This experiment is a generic example of n multinomial trials. The special case k = 6
1 2 n
corresponds to standard dice.
Consider the dice experiment with n = 2 standard dice. Let S denote the set of outcomes, A the event that the first die score is
1, and B the event that the sum of the scores is 7. Give each of the following events in the form indicated:
1. S in Cartesian product form
2. A in list form
3. B in list form
4. A ∪ B in list form
5. A ∩ B in list form
6. A ∩ B in predicate form
c c
Answer
In the simulation of the dice experiment, set n = 2 . Run the experiment 100 times and count the number of times each event in
the previous exercise occurs.
Consider the dice experiment with n = 2 standard dice, and let S denote the set of outcomes, Y the sum of the scores, U the
minimum score, and V the maximum score.
1. Express Y as a function on S and give the set of possible values in list form.
2. Express U as a function on S and give the set of possible values in list form.
3. Express V as a function on the S and give the set of possible values in list form.
4. Give the set of possible values of (U , V ) in predicate from
Answer
Consider again the dice experiment with n = 2 standard dice, and let S denote the set of outcomes, Y the sum of the scores, U
the minimum score, and V the maximum score. Give each of the following as subsets of S , in list form.
1. {X < 3, X
1 2 > 4}
2. {Y = 7}
3. {U = 2}
4. {V = 4}
5. {U = V }
Answer
In the dice experiment, set n =2 . Run the experiment 100 times. Count the number of times each event in the previous
exercise occurred.
In the general dice experiment with n distinct k -sided dice, let Y denote the sum of the scores, U the minimum score, and V
the maximum score.
1. Give the set of outcomes S and find #(S) .
2. Express Y as a function on S , and give the set of possible values in list form.
3. Express U as a function on S , and give the set of possible values in list form.
4. Express V as a function on S , and give the set of possible values in list form.
5. Give the set of possible values of (U , V ) in predicate from.
Answer
The set of outcomes of a random experiment depends of course on what information is recorded. The following exercise is an
illustration.
An experiment consists of throwing a pair of standard dice repeatedly until the sum of the two scores is either 5 or 7. Let A
denote the event that the sum is 5 rather than 7 on the final throw. Experiments of this type arise in the casino game craps.
1. Suppose that the pair of scores on each throw is recorded. Define the set of outcomes of the experiment and describe A as a
subset of this set.
2. Suppose that the pair of scores on the final throw is recorded. Define the set of outcomes of the experiment and describe A
as a subset of this set.
Answer
Suppose that 3 standard dice are rolled and the sequence of scores (X , X , X ) is recorded. A person pays $1 to play. If some
1 2 3
of the dice come up 6, then the player receives her $1 back, plus $1 for each 6. Otherwise she loses her $1. Let W denote the
person's net winnings. This is the game of chuck-a-luck and is treated in more detail in the chapter on Games of Chance.
1. Give the set of outcomes S in Cartesian product form.
2. Express W as a function on S and give the set of possible values in list form.
Answer
Play the chuck-a-luck experiment a few times and see how you do.
In the die-coin experiment, a standard die is rolled and then a coin is tossed the number of times shown on the die. The
sequence of coin scores X is recorded (0 for tails, 1 for heads). Let N denote the die score and Y the number of heads.
1. Give the set of outcomes S in terms of Cartesian powers and find #(S) .
2. Express N as a function on S and give the set of possible values in list form.
3. Express Y as a function on S and give the set of possible values in list form.
4. Give the event A that all tosses result in heads in list form.
Answer
Run the simulation of the die-coin experiment 10 times. For each run, give the values of the random variables X, N , and Y of
the previous exercise. Count the number of times the event A occurs.
In the coin-die experiment, we have a coin and two distinct dice, say one red and one green. First the coin is tossed, and then if
the result is heads the red die is thrown, while if the result is tails the green die is thrown. The coin score X and the score of the
chosen die Y are recorded. Suppose now that the red die is a standard 6-sided die, and the green die a 4-sided die.
1. Give the set of outcomes S in list form.
2. Express X as a function on S .
3. Express Y as a function on S .
4. Give the event {Y ≥ 3} as a subset of S in list form.
Answer
Run the coin-die experiment 100 times, with various types of dice.
Sampling Models
Recall that many random experiments can be thought of as sampling experiments. For the general finite sampling model, we start
with a population D with m (distinct) objects. We select a sample of n objects from the population. If the sampling is done in a
random way, then we have a random experiment with the sample as the basic outcome. Thus, the set of outcomes of the experiment
is literally the set of samples; this is the historical origin of the term sample space. There are four common types of sampling from
a finite population, based on the criteria of order and replacement. Recall the following facts from the section on combinatorial
structures:
Samples of size n chosen from a population with m elements.

1. If the sampling is with replacement and with regard to order, then the set of samples is the Cartesian power D . The
n
number of samples is m . n
2. If the sampling is without replacement and with regard to order, then the set of samples is the set of all permutations of size
n from D. The number of samples is m = m(m − 1) ⋯ [m − (n − 1)] .
(n)
3. If the sampling is without replacement and without regard to order, then the set of samples is the set of all combinations (or
subsets) of size n from D. The number of samples is ( ).m
4. If the sampling is with replacement and without regard to order, then the set of samples is the set of all multisets of size n
from D. The number of samples is ( m+n−1
n
).
If we sample with replacement, the sample size n can be any positive integer. If we sample without replacement, the sample size
cannot exceed the population size, so we must have n ∈ {1, 2, … , m}.
The basic coin and dice experiments are examples of sampling with replacement. If we toss a coin n times and record the sequence
of scores (where as usual, 0 denotes tails and 1 denotes heads), then we generate an ordered sample of size n with replacement
from the population {0, 1}. If we throw n (distinct) standard dice and record the sequence of scores, then we generate an ordered
sample of size n with replacement from the population {1, 2, 3, 4, 5, 6}.
Suppose that the sampling is without replacement (the most common case). If we record the ordered sample
X = (X , X , … , X ) , then the unordered sample W = { X , X , … , X } is a random variable (that is, a function of X). On
1 2 n 1 2 n
the other hand, if we just record the unordered sample W in the first place, then we cannot recover the ordered sample. Note also
that the number of ordered samples of size n is simply n! times the number of unordered samples of size n . No such simple
relationship exists when the sampling is with replacement. This will turn out to be an important point when we study probability
models based on random samples, in the next section.
Consider a sample of size n = 3 chosen without replacement from the population {a, b, c, d, e}.
1. Give T , the set of unordered samples in list form.
2. Give in list form the set of all ordered samples that correspond to the unordered sample {b, c, e}.
3. Note that for every unordered sample, there are 6 ordered samples.
4. Give the cardinality of S , the set of ordered samples.
Answer
Traditionally in probability theory, an urn containing balls is often used as a metaphor for a finite population.
Suppose that an urn contains 50 (distinct) balls. A sample of 10 balls is selected from the urn. Find the number of samples in
each of the following cases:
1. Ordered samples with replacement
2. Ordered samples without replacement
3. Unordered samples without replacement
4. Unordered samples with replacement
Answer
Suppose again that we have a population D with m (distinct) objects, but suppose now that each object is one of two types—either
type 1 or type 0. Such populations are said to be dichotomous. Here are some specific examples:
The population consists of persons, each either male or female.
The population consists of voters, each either democrat or republican.
The population consists of devices, each either good or defective.
The population consists of balls, each either red or green.
Suppose that the population D has r type 1 objects and hence m − r type 0 objects. Of course, we must have r ∈ {0, 1, … , m}.
Now suppose that we select a sample of size n without replacement from the population. Note that this model has three parameters:
the population size m, the number of type 1 objects in the population r, and the sample size n .
Let Y denote the number of type 1 objects in the sample. Then

1. #{Y = k} = ( )r (m − r)
n
k
(k)
for each k ∈ {0, 1, … , n}, if the event is considered as a subset of S , the set of
(n−k)
ordered samples.
2. #{Y = k} = ( )(r
k
m−r
n−k
) for each k ∈ {0, 1, … , n}, if the event is considered as a subset of T , the set of unordered
samples.
3. The expression in (a) is n! times the expression in (b).
Proof
A batch of 50 components consists of 40 good components and 10 defective components. A sample of 5 components is
selected, without replacement. Let Y denote the number of defectives in the sample.
1. Let S denote the set of ordered samples. Find #(S) .
2. Let T denote the set of unordered samples. Find #(T ).
3. As a subset of T , find #{Y = k} for each k ∈ {0, 1, 2, 3, 4, 5}.
Answer
Run the simulation of the ball and urn experiment 100 times for the parameter values in the last exercise: m = 50 , r = 10 ,
n = 5 . Note the values of the random variable Y .
Cards
Recall that a standard card deck can be modeled by the Cartesian product set
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (2.2.8)
example q♡ rather than (q, ♡) for the queen of hearts).
Most card games involve sampling without replacement from the deck D, which plays the role of the population. Thus, the basic
card experiment consists of dealing n cards from a standard deck without replacement; in this special context, the sample of cards
is often referred to as a hand. Just as in the general sampling model, if we record the ordered hand X = (X , X , … , X ), then
1 2 n
the unordered hand W = {X , X , … , X } is a random variable (that is, a function of X). On the other hand, if we just record
1 2 n
the unordered hand W in the first place, then we cannot recover the ordered hand. Finally, recall that n = 5 is the poker
experiment and n = 13 is the bridge experiment. The game of poker is treated in more detail in the chapter on Games of Chance.
Suppose that a single card is dealt from a standard deck. Let Q denote the event that the card is a queen and H the event that
the card is a heart. Give each of the following events in list form:
1. Q
2. H
3. Q ∪ H
4. Q ∩ H
5. Q ∖ H
Answer
In the card experiment, set n =1 . Run the experiment 100 times and count the number of times each event in the previous
exercise occurs.
Suppose that two cards are dealt from a standard deck and the sequence of cards recorded. Let S denote the set of outcomes,
and let Q denote the event that the ith card is a queen and H the event that the ith card is a heart for i ∈ {1, 2}. Find the
i i
number of outcomes in each of the following events:

1. S
2. H1
3. H2
4. H1 ∩ H2
5. Q1 ∩ H1
6. Q1 ∩ H2
7. H1 ∪ H2
Answer
Consider the general card experiment in which n cards are dealt from a standard deck, and the ordered hand X is recorded.
1. Give cardinality of S , the set of values of the ordered hand X.
2. Give the cardinality of T , the set of values of the unordered hand W .
3. How many ordered hands correspond to a given unordered hand?
4. Explicitly compute the numbers in (a) and (b) when n = 5 (poker).
5. Explicitly compute the numbers in (a) and (b) when n = 13 (bridge).
Answer
Consider the bridge experiment of dealing 13 cards from a deck and recording the unordered hand. In the most common point
counting system, an ace is worth 4 points, a king 3 points, a queen 2 points, and a jack 1 point. The other cards are worth 0
points. Let S denote the set of outcomes of the experiment and V the point value of the hand.
1. Find the set of possible values of V .
2. Find the cardinality of the event {V = 0} as a subset of S .
Answer
In the card experiment, set n = 13 and run the experiment 100 times. For each run, compute the value of each of the random
variable V in the previous exercise.
Consider the poker experiment of dealing 5 cards from a deck. Find the cardinality of each of the events below, as a subset of
the set of unordered hands.
1. A : the event that the hand is a full house (3 cards of one kind and 2 of another kind).
2. B : the event that the hand has 4 of a kind (4 cards of one kind and 1 of another kind).
3. C : the event that all cards in the hand are in the same suit (the hand is a flush or a straight flush).
Answer
Run the poker experiment 1000 times. Note the number of times that the events A , B , and C in the previous exercise occurred.
Consider the bridge experiment of dealing 13 cards from a standard deck. Let S denote the set of unordered hands, Y the
number of hearts in the hand, and Z the number of queens in the hand.
1. Find the cardinality of the event {Y = y} as a subset of S for each y ∈ {0, 1, … , 13}.
2. Find the cardinality of the event {Z = z} as a subset of S for each z ∈ {0, 1, 2, 3, 4}.
Answer
Geometric Models
In the experiments that we have considered so far, the sample spaces have all been discrete (so that the set of outcomes is finite or
countably infinite). In this subsection, we consider Euclidean sample spaces where the set of outcomes S is continuous in a sense
that we will make clear later. The experiments we consider are sometimes referred to as geometric models because they involve
selecting a point at random from a Euclidean set.
We first consider Buffon's coin experiment, which consists of tossing a coin with radius r ≤ randomly on a floor covered with
1
square tiles of side length 1. The coordinates (X, Y ) of the center of the coin are recorded relative to axes through the center of the
square in which the coin lands. Buffon's experiments are studied in more detail in the chapter on Geometric Models and are named
for Compte de Buffon
Figure 2.2.3 : Buffon's coin experiment

In Buffon's coin experiment, let S denote the set of outcomes, A the event that the coin does not touch the sides of the square,
and let Z denote the distance form the center of the coin to the center of the square.
1. Describe S as a Cartesian product.
2. Describe A as a subset of S .
3. Describe A as a subset of S .
c
4. Express Z as a function on S .
5. Express the event {X < Y } as a subset of S .
6. Express the event {Z ≤ } as a subset of S .
1
Answer
Run Buffon's coin experiment 100 times with r = 0.2 . For each run, note whether event A occurs and compute the value of
random variable Z .
A point (X, Y ) is chosen at random in the circular region of radius 1 in R centered at the origin. Let S denote the set of
2
outcomes. Let A denote the event that the point is in the inscribed square region centered at the origin, with sides parallel to
the coordinate axes. Let B denote the event that the point is in the inscribed square with vertices (±1, 0), (0, ±1).
1. Describe S mathematically and sketch the set.
2. Describe A mathematically and sketch the set.
3. Describe B mathematically and sketch the set.
4. Sketch A ∪ B
5. Sketch A ∩ B
6. Sketch A ∩ B c
Answer
Reliability
In the simple model of structural reliability, a system is composed of n components, each of which is either working or failed. The
state of component i is an indicator random variable X , where 1 means working and 0 means failure. Thus,
i
X = (X , X , … , X ) is a vector of indicator random variables that specifies the states of all of the components, and therefore
1 2 n
the set of outcomes of the experiment is S = {0, 1} . The system as a whole is also either working or failed, depending only on
n
the states of the components and how the components are connected together. Thus, the state of the system is also an indicator
random variable and is a function of X. The state of the system (working or failed) as a function of the states of the components is
the structure function.
A series system is working if and only if each component is working. The state of the system is
U = X1 X2 ⋯ Xn = min { X1 , X2 , … , Xn } (2.2.9)
A parallel system is working if and only if at least one component is working. The state of the system is
V = 1 − (1 − X1 ) (1 − X2 ) ⋯ (1 − Xn ) = max { X1 , X2 , … , Xn } (2.2.10)
More generally, a k out of n system is working if and only if at least k of the n components are working. Note that a parallel
system is a 1 out of n system and a series system is an n out of n system. A k out of 2k system is a majority rules system.
n
The state of the k out of n system is Un,k = 1 (∑i=1 Xi ≥ k) . The structure function can also be expressed as a polynomial
in the variables.
Explicitly give the state of the k out of 3 system, as a polynomial function of the component states (X1 , X2 , X3 ) , for each
k ∈ {1, 2, 3}.
Answer
In some cases, the system can be represented as a graph or network. The edges represent the components and the vertices the
connections between the components. The system functions if and only if there is a working path between two designated vertices,
which we will denote by a and b .
Find the state of the Wheatstone bridge network shown below, as a function of the component states. The network is named for
Charles Wheatstone.
Answer
Figure 2.2.4 : The Wheatstone bridge network
Not every function n

u : {0, 1 } → {0, 1} makes sense as a structure function. Explain why the following properties might be
desirable:
1. u(0, 0, … , 0) = 0 and s(1, 1, … , 1) = 1
2. u is an increasing function, where {0, 1} is given the ordinary order and {0, 1} the corresponding product order.
n
3. For each i ∈ {1, 2, … , n}, there exist x and y in {0, 1} all of whose coordinates agree, except x = 0 and y = 1 , and
n
i i
u(x) = 0 while u(y) = 1 .
Answer
The model just discussed is a static model. We can extend it to a dynamic model by assuming that component i is initially working,
but has a random time to failure T , taking values in [0, ∞), for each i ∈ {1, 2, … , n}. Thus, the basic outcome of the experiment
i
is the random vector of failure times (T , T , … , T ), and so the set of outcomes is [0, ∞) .
1 2 n
n
Consider the dynamic reliability model for a system with structure function u (valid in the sense of the previous exercise).
1. The state of component i at time t ≥ 0 is X (t) = 1 (T > t) .
i i
2. The state of the system at time t is X(t) = s [X (t), X (t), … , X

1 2 n (t)] .
3. The time to failure of the system is T = min{t ≥ 0 : X(t) = 0} .
Suppose that we have two devices and that we record (X, Y ), where X is the failure time of device 1 and Y is the failure time
of device 2. Both variables take values in the interval [0, ∞), where the units are in hundreds of hours. Sketch each of the
following events:
1. The set of outcomes S
2. {X < Y }
3. {X + Y > 2}
Answer
Genetics
Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this
section.
Recall first that the ABO blood type in humans is determined by three alleles: a , b , and o . Furthermore, o is recessive and a and b
are co-dominant.
Suppose that a person is chosen at random and his genotype is recorded. Give each of the following in list form.
1. The set of outcomes S
2. The event that the person is type A
3. The event that the person is type B
4. The event that the person is type AB
5. The event that the person is type O
Answer
Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: g for green and y for yellow, and
that g is dominant.
Suppose that n (distinct) pea plants are collected and the sequence of pod color genotypes is recorded.
1. Give the set of outcomes S in Cartesian product form and find #(S) .
2. Let N denote the number of plants with green pods. Find #(N = k) (as a subset of S ) for each k ∈ {0, 1, … , n}.
Answer
Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let h denote the healthy allele
and d the defective allele for the gene linked to the disorder. Recall that d is recessive for women.
Suppose that n women are sampled and the sequence of genotypes is recorded.
1. Give the set of outcomes S in Cartesian product form and find #(S) .
2. Let N denote the number of women who are completely healthy (genotype hh ). Find #(N = k) (as a subset of S ) for
each k ∈ {0, 1, … , n}.
Answer
Radioactive Emissions
The emission of elementary particles from a sample of radioactive material occurs in a random way. Suppose that the time of
emission of the ith particle is a random variable T taking values in (0, ∞). If we measure these arrival times, then basic outcome
i
vector is (T , T , …) and so the set of outcomes is S = {(t , t , …) : 0 < t < t < ⋯} .

1 2 1 2 1 2
Run the simulation of the gamma experiment in single-step mode for different values of the parameters. Observe the arrival
times.
Now let N denote the number of emissions in the interval (0, t]. Then
t
1. N t = max {n ∈ N+ : Tn ≤ t} .
2. N t ≥n if and only if T
n ≤t .
Run the simulation of the Poisson experiment in single-step mode for different parameter values. Observe the arrivals in the
specified time interval.
Statistical Experiments
In the basic cicada experiment, a cicada in the Middle Tennessee area is captured and the following measurements recorded:
body weight (in grams), wing length, wing width, and body length (in millimeters), species type, and gender. The cicada data
set gives the results of 104 repetitions of this experiment.
1. Define the set of outcomes S for the basic experiment.
2. Let F be the event that a cicada is female. Describe F as a subset of S . Determine whether F occurs for each cicada in the
data set.
3. Let V denote the ratio of wing length to wing width. Compute V for each cicada.
4. Give the set of outcomes for the compound experiment that consists of 104 repetitions of the basic experiment.
Answer
In the basic M&M experiment, a bag of M&Ms (of a specified size) is purchased and the following measurements recorded:
the number of red, green, blue, yellow, orange, and brown candies, and the net weight (in grams). The M&M data set gives the
results of 30 repetitions of this experiment.
1. Define the set of outcomes S for the basic experiment.
2. Let A be the event that a bag contains at least 57 candies. Describe A as a subset of S .
3. Determine whether A occurs for each bag in the data set.
4. Let N denote the total number of candies. Compute N for each bag in the data set.
5. Give the set of outcomes for the compound experiment that consists of 30 repetitions of the basic experiment.
Answer
This page titled 2.2: Events and Random Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
This section contains the final and most important ingredient in the basic model of a random experiment. If you are a new student
of probability, skip the technical detials.
Definitions and Interpretations

Suppose that we have a random experiment with sample space (S, S ), so that S is the set of outcomes of the experiment and S is
the collection of events. When we run the experiment, a given event A either occurs or does not occur, depending on whether the
outcome of the experiment is in A or not. Intuitively, the probability of an event is a measure of how likely the event is to occur
when we run the experiment. Mathematically, probability is a function on the collection of events that satisfies certain axioms.
Definition
A probability measure (or probability distribution) P on the sample space (S, S ) is a real-valued function defined on the
collection of events S that satisifes the following axioms:
1. P(A) ≥ 0 for every event A .
2. P(S) = 1 .
3. If {A : i ∈ I } is a countable, pairwise disjoint collection of events then
i
P ( ⋃ Ai ) = ∑ P(Ai ) (2.3.1)
i∈I i∈I
Details
Axiom (c) is known as countable additivity, and states that the probability of a union of a finite or countably infinite collection of
disjoint events is the sum of the corresponding probabilities. The axioms are known as the Kolmogorov axioms, in honor of Andrei
Kolmogorov who was the first to formalize probability theory in an axiomatic way. More informally, we say that P is a probability
measure (or distribution) on S , the collection of events S usually being understood.
Axioms (a) and (b) are really just a matter of convention; we choose to measure the probability of an event with a number between
0 and 1 (as opposed, say, to a number between −5 and 7). Axiom (c) however, is fundamental and inescapable. It is required for
probability for precisely the same reason that it is required for other measures of the “size” of a set, such as cardinality for finite
sets, length for subsets of R, area for subsets of R , and volume for subsets of R . In all these cases, the size of a set that is
2 3
composed of countably many disjoint pieces is the sum of the sizes of the pieces.
Figure 2.3.1 :The union of 4 disjoint events

On the other hand, uncountable additivity (the extension of axiom (c) to an uncountable index set I ) is unreasonable for probability,
just as it is for other measures. For example, an interval of positive length in R is a union of uncountably many points, each of
which has length 0.
We now have defined the three essential ingredients for the model a random experiment:
A probability space (S, S , P) consists of

1. A set of outcomes S
2. A collection of events S
3. A probability measure P on the sample space (S, S )
Details
The Law of Large Numbers
Intuitively, the probability of an event is supposed to measure the long-term relative frequency of the event—in fact, this concept
was taken as the definition of probability by Richard Von Mises. Here are the relevant definitions:
Suppose that the experiment is repeated indefinitely, and that A is an event. For n ∈ N , +
1. Let N n (A) denote the number of times that A occurred. This is the frequency of A in the first n runs.
2. Let Pn (A) = Nn (A)/n . This is the relative frequency or empirical probability of A in the first n runs.
Note that repeating the original experiment indefinitely creates a new, compound experiment, and that N (A) and P (A) are
n n
random variables for the new experiment. In particular, the values of these variables are uncertain until the experiment is run n
times. The basic idea is that if we have chosen the correct probability measure for the experiment, then in some sense we expect
that the relative frequency of an event should converge to the probability of the event. That is,
Pn (A) → P(A) as n → ∞, A ∈ S (2.3.2)
regardless of the uncertainty of the relative frequencies on the left. The precise statement of this is the law of large numbers or law
of averages, one of the fundamental theorems in probability. To emphasize the point, note that in general there will be lots of
possible probability measures for an experiment, in the sense of the axioms. However, only the probability measure that models the
experiment correctly will satisfy the law of large numbers.
Given the data from n runs of the experiment, the empirical probability function P is a probability measure on S .
n
Proof
The Distribution of a Random Variable

Suppose now that X is a random variable for the experiment, taking values in a set T . Recall that mathematically, X is a function
from S into T , and {X ∈ B} denotes the event {s ∈ S : X(s) ∈ B} for B ⊆ T . Intuitively, X is a variable of interest for the
experiment, and every meaningful statement about X defines an event.
The function B ↦ P(X ∈ B) for B ⊆ T defines a probability measure on T .

Proof
Figure 2.3.2 : A set B ∈ T corresponds to the event {X ∈ B} ∈ S

The probability measure in (5) is called the probability distribution of X , so we have all of the ingredients for a new probability
space.
A random variable X with values in T defines a new probability space:

1. T is the set of outcomes.
2. Subsets of T are the events.
3. The probability distribution of X is the probability measure on T .
This probability space corresponds to the new random experiment in which the outcome is X. Moreover, recall that the outcome of
the experiment itself can be thought of as a random variable. Specifically, if we let T = S we let X be the identity function on S ,
so that X(s) = s for s ∈ S . Then X is a random variable with values in S and P(X ∈ A) = P(A) for each event A . Thus, every
probability measure can be thought of as the distribution of a random variable.
Constructions
Measures
How can we construct probability measures? As noted briefly above, there are other measures of the “size” of sets; in many cases,
these can be converted into probability measures. First, a positive measure μ on the sample space (S, S ) is a real-valued function
defined on S that satisfies axioms (a) and (c) in (1), and then (S, S , μ) is a measure space. In general, μ(A) is allowed to be
infinite. However, if μ(S) is positive and finite (so that μ is a finite positive measure), then μ can easily be re-scaled into a
probability measure.
If μ is a positive measure on S with 0 < μ(S) < ∞ then P defined below is a probability measure.
μ(A)
P(A) = , A ∈ S (2.3.3)
μ(S)
Proof
In this context, μ(S) is called the normalizing constant. In the next two subsections, we consider some very important special
cases.
Discrete Distributions
In this discussion, we assume that the sample space (S, S ) is discrete. Recall that this means that the set of outcomes S is
countable and that S = P(S) is the collection of all subsets of S , so that every subset is an event. The standard measure on a
discrete space is counting measure #, so that #(A) is the number of elements in A for A ⊆ S . When S is finite, the probability
measure corresponding to counting measure as constructed in above is particularly important in combinatorial and sampling
experiments.
Suppose that S is a finite, nonempty set. The discrete uniform distribution on S is given by
#(A)
P(A) = , A ⊆S (2.3.5)
#(S)
The underlying model is refereed to as the classical probability model, because historically the very first problems in probability
(involving coins and dice) fit this model.
In the general discrete case, if P is a probability measure on S , then since S is countable, it follows from countable additivity that
P is completely determined by its values on the singleton events. Specifically, if we define f (x) = P ({x}) for x ∈ S , then
P(A) = ∑ f (x) for every A ⊆ S . By axiom (a), f (x) ≥ 0 for x ∈ S and by axiom (b), ∑ f (x) = 1 . Conversely, we can
x∈A x∈S
give a general construction for defining a probability measure on a discrete space.
Suppose that g : S → [0, ∞) . Then μ defined by μ(A) = ∑ g(x)

x∈A
for A ⊆S is a positive measure on S . If
0 < μ(S) < ∞ then P defined as follows is a probability measure on S .
μ(A) ∑ g(x)
x∈A
P(A) = = , A ⊆S (2.3.6)
μ(S) ∑ g(x)
x∈S
Proof
In the context of our previous remarks, f (x) = g(x)/μ(S) = g(x)/ ∑ g(y) for x ∈ S . Distributions of this type are said to be
y∈S
discrete. Discrete distributions are studied in detail in the chapter on Distributions.
If S is finite and g is a constant function, then the probability measure P associated with g is the discrete uniform distribution
on S .
Proof
Continuous Distributions
The probability distributions that we will construct next are continuous distributions on R for n ∈ N
n
+ and require some calculus.
For n ∈ N , the standard measure λ on R is given by

+ n
n
n
λn (A) = ∫ 1 dx, A ⊆R (2.3.8)
A
In particular, λ1 (A) is the length of ( A \subseteq \R \), lambda2 (A) is the area of A ⊆R
2
, and λ3 (A) is the volume of
A ⊆R .
3
Details
When n > 3 , λ (A) is sometimes called the n -dimensional volume of A ⊆ R . The probability measure associated with
n n λn on a
set with positive, finite n -dimensional volume is particularly important.
Suppose that S ⊆ R with 0 < λ

n n (S) <∞ . The continuous uniform distribution on S is defined by
λn (A)
P(A) = , A ⊆S (2.3.9)
λn (S)
Note that the continuous uniform distribution is analogous to the discrete uniform distribution defined in (8), but with Lebesgue
measure λ replacing counting measure #. We can generalize this construction to produce many other distributions.
n
Suppose again that S ⊆ R and that g : S → [0, ∞) . Then μ defined by μ(A) = ∫

n
A
g(x) dx for A ⊆ S is a positive measure
on S . If 0 < μ(S) < ∞ , then P defined as follows is a probability measure on S .
μ(A) ∫ g(x) dx
A
P(A) = = , A ∈ S (2.3.10)
μ(S) ∫ g(x) dx
S
Proof
Distributions of this type are said to be continuous. Continuous distributions are studied in detail in the chapter on Distributions.
Note that the continuous distribution above is analogous to the discrete distribution in (9), but with integrals replacing sums. The
general theory of integration allows us to unify these two special cases, and many others besides.
Rules of Probability
Basic Rules
Suppose again that we have a random experiment modeled by a probability space (S, S , P), so that S is the set of outcomes, S
the collection of events, and P the probability measure. In the following theorems, A and B are events. The results follow easily
from the axioms of probability in (1), so be sure to try the proofs yourself before reading the ones in the text.
c
P(A ) = 1 − P(A) . This is known as the complement rule.
Proof
Figure 2.3.3 : The complement rule
P(∅) = 0 .
Proof
P(B ∖ A) = P(B) − P(A ∩ B) . This is known as the difference rule.

Proof
Figure 2.3.4 : The difference rule
If A ⊆ B then P(B ∖ A) = P(B) − P(A) .

Proof
Recall that if A ⊆ B we sometimes write B − A for the set difference, rather than B∖A . With this notation, the difference rule
has the nice form P(B − A) = P(B) − P(A) .
If A ⊆ B then P(A) ≤ P(B) .

Proof
Thus, P is an increasing function, relative to the subset partial order on the collection of events S , and the ordinary order on R. In
particular, it follows that P(A) ≤ 1 for any event A .
Figure 2.3.5 : The increasing property
Suppose that A ⊆ B .
1. If P(B) = 0 then P(A) = 0 .
2. If P(A) = 1 then P(B) = 1 .
Proof
The Boole and Bonferroni Inequalities

The next result is known as Boole's inequality, named after George Boole. The inequality gives a simple upper bound on the
probability of a union.
If {A i : i ∈ I} is a countable collection of events then
P ( ⋃ Ai ) ≤ ∑ P(Ai ) (2.3.12)
i∈I i∈I
Proof
Figure 2.3.6 : Boole's inequality

Intuitively, Boole's inequality holds because parts of the union have been measured more than once in the sum of the probabilities
on the right. Of course, the sum of the probabilities may be greater than 1, in which case Boole's inequality is not helpful. The
following result is a simple consequence of Boole's inequality.
If {A i : i ∈ I} is a countable collection of events with P(A i) =0 for each i ∈ I , then
P ( ⋃ Ai ) = 0 (2.3.13)
i∈I
An event A with P(A) = 0 is said to be null. Thus, a countable union of null events is still a null event.
The next result is known as Bonferroni's inequality, named after Carlo Bonferroni. The inequality gives a simple lower bound for
the probability of an intersection.
If {A i : i ∈ I} is a countable collection of events then
P ( ⋂ Ai ) ≥ 1 − ∑ [1 − P(Ai )] (2.3.14)
i∈I i∈I
Proof
Of course, the lower bound in Bonferroni's inequality may be less than or equal to 0, in which case it's not helpful. The following
result is a simple consequence of Bonferroni's inequality.
If {A i : i ∈ I} is a countable collection of events with P(A i) =1 for each i ∈ I , then
P ( ⋂ Ai ) = 1 (2.3.16)
i∈I
An event A with P(A) = 1 is sometimes called almost sure or almost certain. Thus, a countable intersection of almost sure events
is still almost sure.
Suppose that A and B are events in an experiment.

1. If P(A) = 0 , then P(A ∪ B) = P(B) .
2. If P(A) = 1 , then P(A ∩ B) = P(B) .
Proof
The Partition Rule

Suppose that {A : i ∈ I } is a countable collection of events that partition S . Recall that this means that the events are disjoint
i
and their union is S . For any event B ,
P(B) = ∑ P(Ai ∩ B) (2.3.17)
i∈I
Proof
Figure 2.3.7 : The partition rule

Naturally, this result is useful when the probabilities of the intersections are known. Partitions usually arise in connection with a
random variable. Suppose that X is a random variable taking values in a countable set T , and that B is an event. Then
P(B) = ∑ P(X = x, B) (2.3.18)
x∈T
In this formula, note that the comma acts like the intersection symbol in the previous formula.
The Inclusion-Exclusion Rule

The inclusion-exclusion formulas provide a method for computing the probability of a union of events in terms of the probabilities
of the various intersections of the events. The formula is useful because often the probabilities of the intersections are easier to
compute. Interestingly, however, the same formula works for computing the probability of an intersection of events in terms of the
probabilities of the various unions of the events. This version is rarely stated, because it's simply not that useful. We start with two
events.
If A, B are events thatn P(A ∪ B) = P(A) + P(B) − P(A ∩ B) .
Proof
Figure 2.3.8 : The probability of the union of two events

Here is the complementary result for the intersection in terms of unions:
If A, B are events then P(A ∩ B) = P(A) + P(B) − P(A ∪ B) .

Proof
Next we consider three events.
If A, B, C are events then

P(A ∪ B ∪ C ) = P(A) + P(B) + P(C ) − P(A ∩ B) − P(A ∩ C ) − P(B ∩ C ) + P(A ∩ B ∩ C ) .
Analytic Proof
Proof by accounting
Figure 2.3.8 : The probability of the union of three events

Here is the complementary result for the probability of an intersection in terms of the probabilities of the unions:
If A, B, C are events then

P(A ∩ B ∩ C ) = P(A) + P(B) + P(C ) − P(A ∪ B) − P(A ∪ C ) − P(B ∪ C ) + P(A ∪ B ∪ C ) .
Proof
The inclusion-exclusion formulas for two and three events can be generalized to n events. For the remainder of this discussion,
suppose that {A : i ∈ I } is a collection of events where I is an index set with #(I ) = n .
i
The general inclusion-exclusion formula for the probability of a union.

n
k−1
P ( ⋃ Ai ) = ∑(−1 ) ∑ P ( ⋂ Aj ) (2.3.21)
i∈I k=1 J⊆I, #(J)=k j∈J
Proof by induction
Proof by accounting
Here is the complementary result for the probability of an intersection in terms of the probabilities of the various unions:
The general inclusion-exclusion formula for the probability of an intersection.

n
k−1
P ( ⋂ Ai ) = ∑(−1 ) ∑ P ( ⋃ Aj ) (2.3.24)
i∈I k=1 J⊆I, #(J)=k j∈J
The general inclusion-exclusion formulas are not worth remembering in detail, but only in pattern. For the probability of a union,
we start with the sum of the probabilities of the events, then subtract the probabilities of all of the paired intersections, then add the
probabilities of the third-order intersections, and so forth, alternating signs, until we get to the probability of the intersection of all
of the events.
The general Bonferroni inequalities (for a union) state that if sum on the right in the general inclusion-exclusion formula is
truncated, then the truncated sum is an upper bound or a lower bound for the probability on the left, depending on whether the last
term has a positive or negative sign. Here is the result stated explicitly:
Suppose that m ∈ {1, 2, … , n − 1}. Then

m
1. P (⋃ i∈I
Ai ) ≤ ∑k=1 (−1 )
k−1
∑J⊆I, #(J)=k
P (⋂j∈J Aj ) if m is odd.
2. P (⋃ i∈I
Ai ) ≥ ∑
m
k=1
(−1 )
k−1
∑
J⊆I, #(J)=k
P (⋂
j∈J
Aj ) if m is event.
Proof
More elegant proofs of the inclusion-exclusion formula and the Bonferroni inequalities can be constructed using expected value.
Note that there is a probability term in the inclusion-exclusion formulas for every nonempty subset J of the index set I , with either
a positive or negative sign, and hence there are 2 − 1 such terms. These probabilities suffice to compute the probability of any
n
event that can be constructed from the given events, not just the union or the intersection.
The probability of any event that can be constructed from { Ai : i ∈ I } can be computed from either of the following
collections of 2 − 1 probabilities:
n
1. P (⋂ j∈J
Aj ) where J is a nonempty subset of I .
2. P (⋃ j∈J
Aj ) where J is a nonempty subset of I .
Remark
If you go back and look at your proofs of the rules of probability above, you will see that they hold for any finite measure μ , not
just probability. The only change is that the number 1 is replaced by μ(S) . In particular, the inclusion-exclusion rule is as important
in combinatorics (the study of counting measure) as it is in probability.

Probability Rules
Suppose that A and B are events in an experiment with P(A) = , P(B) = 1
3
1
4
, P(A ∩ B) =
1
10
. Express each of the
following events in the language of the experiment and find its probability:
1. A ∖ B
2. A ∪ B
3. A ∪ B
c c
4. A ∩ B
c c
5. A ∪ B c
Answer
Suppose that A , B , and C are events in an experiment with P(A) = 0.3 , P(B) = 0.2 , P(C ) = 0.4 , P(A ∩ B) = 0.04 ,
P(A ∩ C ) = 0.1 , P(B ∩ C ) = 0.1 , P(A ∩ B ∩ C ) = 0.01 . Express each of the following events in set notation and find its
probability:
1. At least one of the three events occurs.
2. None of the three events occurs.
3. Exactly one of the three events occurs.
4. Exactly two of the three events occur.
Answer
Suppose that A and B are events in an experiment with P(A ∖ B) =

1
6
, P(B ∖ A) =
1
4
, and P(A ∩ B) =
1
12
. Find the
probability of each of the following events:
1. A
2. B
3. A ∪ B
4. A ∪ B
c c
5. A ∩ B
c c
Answer
Suppose that A and B are events in an experiment with P(A) = 2
5
, P(A ∪ B) = 10
7
, and P(A ∩ B) = 1
6
. Find the probability
of each of the following events:
1. B
2. A ∖ B
3. B ∖ A
4. A ∪ B
c c
5. A ∩ B
c c
Answer
Suppose that A , B , and C are events in an experiment with P(A) = 1
3
, P(B) = 1
4
, P(C ) = 1
5
.
1. Use Boole's inequality to find an upper bound for P(A ∪ B ∪ C ) .
2. Use Bonferronis's inequality to find a lower bound for P(A ∩ B ∩ C ) .
Answer
Open the simple probability experiment.

1. Note the 16 events that can be constructed from A and B using the set operations of union, intersection, and complement.
2. Given P(A) , P(B) , and P(A ∩ B) in the table, use the rules of probability to verify the probabilities of the other events.
3. Run the experiment 1000 times and compare the relative frequencies of the events with the probabilities of the events.
Suppose that A , B , and C are events in a random experiment with P(A) = 1/4 , P(B) = 1/3 , P(C ) = 1/6 ,
P(A ∩ B) = 1/18 , P(A ∩ C ) = 1/16 , P(B ∩ C ) = 1/12 , and P(A ∩ B ∩ C ) = 1/24 . Find the probabilities of the various
unions:
1. A ∪ B
2. A ∪ C
3. B ∪ C
4. A ∪ B ∪ C
Answer
Suppose that A ,
B , and C are events in a random experiment with P(A) = 1/4 , P(B) = 1/4 , P(C ) = 5/16,
P(A ∪ B) = 7/16 , P(A ∪ C ) = 23/48 , P(B ∪ C ) = 11/24 , and P(A ∪ B ∪ C ) = 7/12 . Find the probabilities of the
various intersections:
1. A ∩ B
2. A ∩ C
3. B ∩ C
4. A ∩ B ∩ C
Answer
Suppose that A , B , and C are events in a random experiment. Explicitly give all of the Bonferroni inequalities for
P(A ∪ B ∪ C )
Proof
Coins
Consider the random experiment of tossing a coin n times and recording the sequence of scores X = (X , X , … , X ) (where 1 1 2 n
denotes heads and 0 denotes tails). This experiment is a generic example of n Bernoulli trials, named for Jacob Bernoulli. Note that
the set of outcomes is S = {0, 1} , the set of bit strings of length n . If the coin is fair, then presumably, by the very meaning of the
n
word, we have no reason to prefer one point in S over another. Thus, as a modeling assumption, it seems reasonable to give S the
uniform probability distribution in which all outcomes are equally likely.
Suppose that a fair coin is tossed 3 times and the sequence of coin scores is recorded. Let A be the event that the first coin is
heads and B the event that there are exactly 2 heads. Give each of the following events in list form, and then compute the
probability of the event:
1. A
2. B
3. A ∩ B
4. A ∪ B
5. A ∪ B
c c
6. A ∩ B
c c
7. A ∪ B c
Answer
In the Coin experiment, select 3 coins. Run the experiment 1000 times, updating after every run, and compute the empirical
probability of each event in the previous exercise.
Suppose that a fair coin is tossed 4 times and the sequence of scores is recorded. Let Y denote the number of heads. Give the
event {Y = k} (as a subset of the sample space) in list form, for each k ∈ {0, 1, 2, 3, 4}, and then give the probability of the
event.
Answer
Suppose that a fair coin is tossed n times and the sequence of scores is recorded. Let Y denote the number of heads.
n
n 1
P(Y = k) = ( )( ) , k ∈ {0, 1, … , n} (2.3.25)
k 2
Proof
The distribution of Y in the last exercise is a special case of the binomial distribution. The binomial distribution is studied in more
detail in the chapter on Bernoulli Trials.
Dice
Consider the experiment of throwing n distinct, k -sided dice (with faces numbered from 1 to k ) and recording the sequence of
scores X = (X , X , … , X ). We can record the outcome as a sequence because of the assumption that the dice are distinct; you
1 2 n
can think of the dice as somehow labeled from 1 to n , or perhaps with different colors. The special case k = 6 corresponds to
standard dice. In general, note that the set of outcomes is S = {1, 2, … , k} . If the dice are fair, then again, by the very meaning
n
of the word, we have no reason to prefer one point in S over another, so as a modeling assumption it seems reasonable to give S
the uniform probability distribution.
Suppose that two fair, standard dice are thrown and the sequence of scores recorded. Let A denote the event that the first die
score is less than 3 and B the event that the sum of the dice scores is 6. Give each of the following events in list form and then
find the probability of the event.
1. A
2. B
3. A ∩ B
4. A ∪ B
5. B ∖ A
Answer
In the dice experiment, set n =2 . Run the experiment 100 times and compute the empirical probability of each event in the
previous exercise.
Consider again the dice experiment with n = 2 fair dice. Let S denote the set of outcomes, Y the sum of the scores, U the
minimum score, and V the maximum score.
1. Express Y as a function on S and give the set of values.
2. Find P(Y = y) for each y in the set in part (a).
3. Express U as a function on S and give the set of values.
4. Find P(U = u) for each u in the set in part (c).
5. Express V as a function on S and give the set of values.
6. Find P(V = v) for each v in the set in part (e).
7. Find the set of values of (U , V ).
8. Find P(U = u, V = v) for each (u, v) in the set in part (g).
Answer
In the previous exercise, note that (U , V ) could serve as the outcome vector for the experiment of rolling two standard, fair dice if
we do not bother to distinguish the dice (so that we might as well record the smaller score first and then the larger score). Note that
this random vector does not have a uniform distribution. On the other hand, we might have chosen at the beginning to just record
the unordered set of scores and, as a modeling assumption, imposed the uniform distribution on the corresponding set of outcomes.
Both models cannot be right, so which model (if either) describes real dice in the real world? It turns out that for real (fair) dice, the
ordered sequence of scores is uniformly distributed, so real dice behave as distinct objects, whether you can tell them apart or not.
In the early history of probability, gamblers sometimes got the wrong answers for events involving dice because they mistakenly
applied the uniform distribution to the set of unordered scores. It's an important moral. If we are to impose the uniform distribution
on a sample space, we need to make sure that it's the right sample space.
A pair of fair, standard dice are thrown repeatedly until the sum of the scores is either 5 or 7. Let A denote the event that the
sum of the scores on the last throw is 5 rather than 7. Events of this type are important in the game of craps.
1. Suppose that we record the pair of scores on each throw. Give the set of outcomes S and express A as a subset of S .
2. Compute the probability of A in the setting of part (a).
3. Now suppose that we just record the pair of scores on the last throw. Give the set of outcomes T and express A as a subset
of T .
4. Compute the probability of A in the setting of parts (c).
Answer
The previous problem shows the importance of defining the set of outcomes appropriately. Sometimes a clever choice of this set
(and appropriate modeling assumptions) can turn a difficult problem into an easy one.
Sampling Models
Recall that many random experiments can be thought of as sampling experiments. For the general finite sampling model, we start
with a population D with m (distinct) objects. We select a sample of n objects from the population, so that the sample space S is
the set of possible samples. If we select a sample at random then the outcome X (the random sample) is uniformly distributed on
S:
#(A)
P(X ∈ A) = , A ⊆S (2.3.26)
#(S)
Recall from the section on Combinatorial Structures that there are four common types of sampling from a finite population, based
If the sampling is with replacement and with regard to order, then the set of samples is the Cartesian power D . The number of
n
samples is m .n
If the sampling is without replacement and with regard to order, then the set of samples is the set of all permutations of size n
from D. The number of samples is m = m(m − 1) ⋯ (m − n + 1) .
(n)
If the sampling is without replacement and without regard to order, then the set of samples is the set of all combinations (or
subsets) of size n from D. The number of samples is ( ) = m /n! .
m
n
(n)
If the sampling is with replacement and without regard to order, then the set of samples is the set of all multisets of size n from
D. The number of samples is ( ).
m+n−1
If we sample with replacement, the sample size n can be any positive integer. If we sample without replacement, the sample size
cannot exceed the population size, so we must have n ∈ {1, 2, … , m}.
The basic coin and dice experiments are examples of sampling with replacement. If we toss a fair coin n times and record the
sequence of scores X (where as usual, 0 denotes tails and 1 denotes heads), then X is a random sample of size n chosen with order
and with replacement from the population {0, 1}. Thus, X is uniformly distributed on {0, 1} . If we throw n (distinct) standard n
fair dice and record the sequence of scores, then we generate a random sample X of size n with order and with replacement from
the population {1, 2, 3, 4, 5, 6}. Thus, X is uniformly distributed on {1, 2, 3, 4, 5, 6} . Of an analogous result would hold for fair,
n
k -sided dice.
Suppose that the sampling is without replacement (the most common case). If we record the ordered sample
X = (X , X , … , X ) , then the unordered sample W = { X , X , …} is a random variable (that is, a function of X). On the
1 2 n 1 2
other hand, if we just record the unordered sample W in the first place, then we cannot recover the ordered sample.
Suppose that X is a random sample of size n chosen with order and without replacement from D, so that X is uniformly
distributed on the space of permutations of size n from D. Then W , the unordered sample, is uniformly distributed on the
space of combinations of size n from D. Thus, W is also a random sample.
Proof
The result in the last exercise does not hold if the sampling is with replacement (recall the exercise above and the discussion that
follows). When sampling with replacement, there is no simple relationship between the number of ordered samples and the number
of unordered samples.
Sampling From a Dichotomous Population

Suppose again that we have a population D with m (distinct) objects, but suppose now that each object is one of two types—either
type 1 or type 0. Such populations are said to be dichotomous. Here are some specific examples:
The population consists of persons, each either male or female.
The population consists of voters, each either democrat or republican.
The population consists of devices, each either good or defective.
The population consists of balls, each either red or green.
Suppose that the population D has r type 1 objects and hence m − r type 0 objects. Of course, we must have r ∈ {0, 1, … , m}.
Now suppose that we select a sample of size n at random from the population. Note that this model has three parameters: the
population size m, the number of type 1 objects in the population r, and the sample size n . Let Y denote the number of type 1
objects in the sample.
If the sampling is without replacement then

r m−r
( )( )
y n−y
P(Y = y) = , y ∈ {0, 1, … , n} (2.3.27)
m
( )
n
Proof
In the previous exercise, random variable Y has the hypergeometric distribution with parameters m, r, and n . The hypergeometric
distribution is studied in more detail in the chapter on Findite Sampling Models.
If the sampling is with replacement then

y n−y
n r (m − r) n r y r n−y
P(Y = y) = ( ) =( )( ) (1 − ) , y ∈ {0, 1, … , n} (2.3.28)

n
y m y m m
Proof
In the last exercise, random variable Y has the binomial distribution with parameters n and p =
r
m
. The binomial distribution is
studied in more detail in the chapter on Bernoulli Trials.
Suppose that a group of voters consists of 40 democrats and 30 republicans. A sample 10 voters is chosen at random. Find the
probability that the sample contains at least 4 democrats and at least 4 republicans, each of the following cases:
1. The sampling is without replacement.
2. The sampling is with replacement.
Answer
Look for other specialized sampling situations in the exercises below.
Urn Models
Drawing balls from an urn is a standard metaphor in probability for sampling from a finite population.
Consider an urn with 30 balls; 10 are red and 20 are green. A sample of 5 balls is chosen at random, without replacement. Let
Y denote the number of red balls in the sample. Explicitly compute P(Y = y) for each y ∈ {0, 1, 2, 3, 4, 5}.
answer
In the simulation of the ball and urn experiment, select 30 balls with 10 red and 20 green, sample size 5, and sampling without
replacement. Run the experiment 1000 times and compare the empirical probabilities with the true probabilities that you
computed in the previous exercise.
Consider again an urn with 30 balls; 10 are red and 20 are green. A sample of 5 balls is chosen at random, with replacement.
Let Y denote the number of red balls in the sample. Explicitly compute P(Y = y) for each y ∈ {0, 1, 2, 3, 4, 5}.
Answer
In the simulation of the ball and urn experiment, select 30 balls with 10 red and 20 green, sample size 5, and sampling with
replacement. Run the experiment 1000 times and compare the empirical probabilities with the true probabilities that you
computed in the previous exercise.
An urn contains 15 balls: 6 are red, 5 are green, and 4 are blue. Three balls are chosen at random, without replacement.
1. Find the probability that the chosen balls are all the same color.
2. Find the probability that the chosen balls are all different colors.
Answer
Suppose again that an urn contains 15 balls: 6 are red, 5 are green, and 4 are blue. Three balls are chosen at random, with
replacement.
1. Find the probability that the chosen balls are all the same color.
2. Find the probability that the chosen balls are all different colors.
Answer
Cards
Recall that a standard card deck can be modeled by the product set
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (2.3.29)
example q♡ for the queen of hearts).
Card games involve choosing a random sample without replacement from the deck D, which plays the role of the population. Thus,
the basic card experiment consists of dealing n cards from a standard deck without replacement; in this special context, the sample
of cards is often referred to as a hand. Just as in the general sampling model, if we record the ordered hand
X = (X , X , … , X ) , then the unordered hand W = { X , X , … , X } is a random variable (that is, a function of X). On the
1 2 n 1 2 n
other hand, if we just record the unordered hand W in the first place, then we cannot recover the ordered hand. Finally, recall that
n = 5 is the poker experiment and n = 13 is the bridge experiment. The game of poker is treated in more detail in the chapter on
Games of Chance. By the way, it takes about 7 standard riffle shuffles to randomize a deck of cards.
Suppose that 2 cards are dealt from a well-shuffled deck and the sequence of cards is recorded. For , let H denote
i ∈ {1, 2} i
the event that card i is a heart. Find the probability of each of the following events.
1. H 1
2. H 1 ∩ H2
3. H 2 ∖ H1
4. H 2
5. H 1 ∖ H2
6. H 1 ∪ H2
Answer
Think about the results in the previous exercise, and suppose that we continue dealing cards. Note that in computing the probability
of H , you could conceptually reduce the experiment to dealing a single card. Note also that the probabilities do not depend on the
i
order in which the cards are dealt. For example, the probability of an event involving the 1st, 2nd and 3rd cards is the same as the
probability of the corresponding event involving the 25th, 17th, and 40th cards. Technically, the cards are exchangeable. Here's
another way to think of this concept: Suppose that the cards are dealt onto a table in some pattern, but you are not allowed to view
the process. Then no experiment that you can devise will give you any information about the order in which the cards were dealt.
In the card experiment, set n =2 . Run the experiment 100 times and compute the empirical probability of each event in the
previous exercise
In the poker experiment, find the probability of each of the following events:
1. The hand is a full house (3 cards of one kind and 2 cards of another kind).
2. The hand has four of a kind (4 cards of one kind and 1 of another kind).
3. The cards are all in the same suit (thus, the hand is either a flush or a straight flush).
Answer
Run the poker experiment 10000 times, updating every 10 runs. Compute the empirical probability of each event in the
previous problem.
Find the probability that a bridge hand will contain no honor cards that is, no cards of denomination 10, jack, queen, king, or
ace. Such a hand is called a Yarborough, in honor of the second Earl of Yarborough.
Answer
Find the probability that a bridge hand will contain

1. Exactly 4 hearts.
2. Exactly 4 hearts and 3 spades.
3. Exactly 4 hearts, 3 spades, and 2 clubs.
Answer
A card hand that contains no cards in a particular suit is said to be void in that suit. Use the inclusion-exclusion rule to find the
1. A poker hand is void in at least one suit.
2. A bridge hand is void in at least one suit.
Answer
Birthdays
The following problem is known as the birthday problem, and is famous because the results are rather surprising at first.
Suppose that n persons are selected and their birthdays recorded (we will ignore leap years). Let A denote the event that the
birthdays are distinct, so that A is the event that there is at least one duplication in the birthdays.
c
1. Define an appropriate sample space and probability measure. State the assumptions you are making.
2. Find P (A) and P(A ) in terms of the parameter n .
c
3. Explicitly compute P (A) and P (A ) for n ∈ {10, 20, 30, 40, 50}
c
Answer
The small value of P(A) for relatively small sample sizes n is striking, but is due mathematically to the fact that 365 grows much
n
faster than 365 as n increases. The birthday problem is treated in more generality in the chapter on Finite Sampling Models.
(n)
Suppose that 4 persons are selected and their birth months recorded. Assuming an appropriate uniform distribution, find the
probability that the months are distinct.
Answer
Continuous Uniform Distributions

Recall that in Buffon's coin experiment, a coin with radius r ≤ is tossed “randomly” on a floor with square tiles of side
1
length 1, and the coordinates (X, Y ) of the center of the coin are recorded, relative to axes through the center of the square in
which the coin lands (with the axes parallel to the sides of the square, of course). Let A denote the event that the coin does not
touch the sides of the square.
1. Define the set of outcomes S mathematically and sketch S .
2. Argue that (X, Y ) is uniformly distributed on S .
3. Express A in terms of the outcome variables (X, Y ) and sketch A .
4. Find P(A) .
5. Find P(A ).
c
Answer
In Buffon's coin experiment, set r = 0.2. Run the experiment 100 times and compute the empirical probability of each event in
the previous exercise.
A point (X, Y ) is chosen at random in the circular region S ⊂ R of radius 1, centered at the origin. Let A denote the event
2
that the point is in the inscribed square region centered at the origin, with sides parallel to the coordinate axes, and let B denote
the event that the point is in the inscribed square with vertices (±1, 0), (0, ±1). Sketch each of the following events as a subset
of S , and find the probability of the event.
1. A
2. B
3. A ∩ B c
4. B ∩ A c
5. A ∩ B
6. A ∪ B
Answer
Suppose a point (X, Y ) is chosen at random in the circular region S ⊆ R of radius 12, centered at the origin. Let R denote
2
the distance from the origin to the point. Sketch each of the following events as a subset of S , and compute the probability of
the event. Is R uniformly distributed on the interval [0, 12]?
1. {R ≤ 3}
2. {3 < R ≤ 6}
3. {6 < R ≤ 9}
4. {9 < R ≤ 12}
Answer
In the simple probability experiment, points are generated according to the uniform distribution on a rectangle. Move and
resize the events A and B and note how the probabilities of the various events change. Create each of the following
configurations. In each case, run the experiment 1000 times and compare the relative frequencies of the events to the
probabilities of the events.
1. A and B in general position
2. A and B disjoint
3. A ⊆ B
4. B ⊆ A
Genetics
subsection.
Recall first that the ABO blood type in humans is determined by three alleles: a , b , and o . Furthermore, a and b are co-dominant
and o is recessive. Suppose that the probability distribution for the set of blood genotypes in a certain population is given in the
following table:
Genotype aa ab ao bb bo oo
Probability 0.050 0.038 0.310 0.007 0.116 0.479
A person is chosen at random from the population. Let A , B , AB, and O be the events that the person is type A , type B , type
AB , and type O respectively. Let H be the event that the person is homozygous and D the event that the person has an o
allele. Find the probability of the following events:

1. A
2. B
3. AB
4. O
5. H
6. D
7. H ∪ D
8. D c
Answer
that g is dominant.
Let G be the event that a child plant has green pods. Find P(G) in each of the following cases:
1. At least one parent is type gg .
2. Both parents are type yy.
3. Both parents are type gy .
4. One parent is type yy and the other is type gy .
Answer
and d the defective allele for the gene linked to the disorder. Recall that d is recessive for women.
Let B be the event that a son has the disorder, C the event that a daughter is a healthy carrier, and D the event that a daughter
has the disease. Find P(B) , P(C ) and P(D) in each of the following cases:
1. The mother and father are normal.
2. The mother is a healthy carrier and the father is normal.
3. The mother is normal and the father has the disorder.
4. The mother is a healthy carrier and the father has the disorder.
5. The mother has the disorder and the father is normal.
6. The mother and father both have the disorder.
Answer
From this exercise, note that transmission of the disorder to a daughter can only occur if the mother is at least a carrier and the
father has the disorder. In ordinary large populations, this is a unusual intersection of events, and thus sex-linked hereditary
disorders are typically much less common in women than in men. In brief, women are protected by the extra X chromosome.
Radioactive Emissions
Suppose that T denotes the time between emissions (in milliseconds) for a certain type of radioactive material, and that T has
the following probability distribution, defined for measurable A ⊆ [0, ∞) by
−t
P(T ∈ A) = ∫ e dt (2.3.30)
A
1. Show that this really does define a probability distribution.

2. Find P(T > 3) .
3. Find P(2 < T < 4) .
Answer
Suppose that N denotes the number of emissions in a one millisecond interval for a certain type of radioactive material, and
that N has the following probability distribution:
−1
e
P(N ∈ A) = ∑ , A ⊆N (2.3.31)
n!
n∈A
1. Show that this really does define a probability distribution.

2. Find P(N ≥ 3) .
3. Find P(2 ≤ N ≤ 4) .
Answer
The probability distribution that governs the time between emissions is a special case of the exponential distribution, while the
probability distribution that governs the number of emissions is a special case of the Poisson distribution, named for Simeon
Poisson. The exponential distribution and the Poisson distribution are studied in more detail in the chapter on the Poisson process.
Matching
Suppose that at an absented-minded secretary prepares 4 letters and matching envelopes to send to 4 different persons, but then
stuffs the letters into the envelopes randomly. Find the probability of the event A that at least one letter is in the proper
envelope.
Solution
This exercise is an example of the matching problem, originally formulated and studied by Pierre Remond Montmort. A complete
analysis of the matching problem is given in the chapter on Finite Sampling Models.
In the simulation of the matching experiment select n = 4 . Run the experiment 1000 times and compute the relative frequency
of the event that at least one match occurs.
Data Analysis Exercises

For the M&M data set, let R denote the event that a bag has at least 10 red candies, T the event that a bag has at least 57
candies total, and W the event that a bag weighs at least 50 grams. Find the empirical probability the following events:
1. R
2. T
3. W
4. R ∩ T
5. T ∖ W
Answer
For the cicada data, let W denote the event that a cicada weighs at least 0.20 grams, F the event that a cicada is female, and T
the event that a cicada is type tredecula. Find the empirical probability of each of the following:
1. W
2. F
3. T
4. W ∩ F
5. F ∪ T ∪ W
Answer
This page titled 2.3: Probability Measures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The purpose of this section is to study how probabilities are updated in light of new information, clearly an absolutely essential
topic. If you are a new student of probability, you may want to skip the technical details.

The Basic Definition
As usual, we start with a random experiment modeled by a probability space (S, S , P). Thus, S is the set of outcomes, S the
collection of events, and P the probability measure on the sample space (S, S ). Suppose now that we know that an event B has
occurred. In general, this information should clearly alter the probabilities that we assign to other events. In particular, if A is
another event then A occurs if and only if A and B occur; effectively, the sample space has been reduced to B . Thus, the
probability of A , given that we know B has occurred, should be proportional to P(A ∩ B) .
Figure 2.4.1 : Events A and B

However, conditional probability, given that B has occurred, should still be a probability measure, that is, it must satisfy the axioms
of probability. This forces the proportionality constant to be 1/P(B). Thus, we are led inexorably to the following definition:
Let A and B be events with P(B) > 0 . The conditional probability of A given B is defined to be
P(A ∩ B)
P(A ∣ B) = (2.4.1)
P(B)

The definition above was based on the axiomatic definition of probability. Let's explore the idea of conditional probability from the
less formal and more intuitive notion of relative frequency (the law of large numbers). Thus, suppose that we run the experiment
repeatedly. For n ∈ N and an event E , let N (E) denote the number of times E occurs (the frequency of E ) in the first n runs.
+ n
Note that N (E) is a random variable in the compound experiment that consists of replicating the original experiment. In
n
particular, its value is unknown until we actually run the experiment n times.
If N (B) is large, the conditional probability that A has occurred, given that B has occurred, should be close to the conditional
n
relative frequency of A given B , namely the relative frequency of A for the runs on which B occurred: N (A ∩ B)/N (B) . But
n n
note that
Nn (A ∩ B) Nn (A ∩ B)/n
= (2.4.2)
Nn (B) Nn (B)/n
The numerator and denominator of the main fraction on the right are the relative frequencies of A ∩ B and B , respectively. So by
the law of large numbers again, N (A ∩ B)/n → P(A ∩ B) as n → ∞ and N (B) → P(B) as n → ∞ . Hence
n n
Nn (A ∩ B) P(A ∩ B)
→ as n → ∞ (2.4.3)
Nn (B) P(B)
and we are led again to the definition above.

In some cases, conditional probabilities can be computed directly, by effectively reducing the sample space to the given event. In
other cases, the formula in the mathematical definition is better. In some cases, conditional probabilities are known from modeling
assumptions, and then are used to compute other probabilities. We will see examples of all of these situations in the computational
exercises below.
It's very important that you not confuse P(A ∣ B) , the probability of A given B , with P(B ∣ A) , the probability of B given A .
Making that mistake is known as the fallacy of the transposed conditional. (How embarrassing!)
Conditional Distributions
Suppose that X is a random variable for the experiment with values in T . Mathematically, X is a function from S into T , and
{X ∈ A} denotes the event {s ∈ S : X(s) ∈ A} for A ⊆ T . Intuitively, X is a variable of interest in the experiment, and every
meaningful statement about X defines an event. Recall that the probability distribution of X is the probability measure on T given
by
A ↦ P(X ∈ A), A ⊆T (2.4.4)
This has a natural extension to a conditional distribution, given an event.
If B is an event with P(B) > 0 , then the conditional distribution of X given B is the probability measure on T given by
A ↦ P(X ∈ A ∣ B), A ⊆T (2.4.5)
Details
Basic Theory
Preliminary Results
Our first result is of fundamental importance, and indeed was a crucial part of the argument for the definition of conditional
probability.
Suppose again that B is an event with P(B) > 0 . Then A ↦ P(A ∣ B) is a probability measure on S .
Proof
It's hard to overstate the importance of the last result because this theorem means that any result that holds for probability measures
in general holds for conditional probability, as long as the conditioning event remains fixed. In particular the basic probability rules
in the section on Probability Measure have analogs for conditional probability. To give two examples,
c
P (A ∣ B) = 1 − P(A ∣ B) (2.4.8)
P (A1 ∪ A2 ∣ B) = P (A1 ∣ B) + P (A2 ∣ B) − P (A1 ∩ A2 ∣ B) (2.4.9)
By the same token, it follows that the conditional distribution of a random variable with values in T , given in above, really does
define a probability distribution on T . No further proof is necessary. Our next results are very simple.
Suppose that A and B are events with P(B) > 0 .

1. If B ⊆ A then P(A ∣ B) = 1 .
2. If A ⊆ B then P(A ∣ B) = P(A)/P(B) .
3. If A and B are disjoint then P(A ∣ B) = 0 .
Proof
Parts (a) and (c) certainly make sense. Suppose that we know that event B has occurred. If B ⊆ A then A becomes a certain event.
If A ∩ B = ∅ then A becomes an impossible event. A conditional probability can be computed relative to a probability measure
that is itself a conditional probability measure. The following result is a consistency condition.
Suppose that A , B , and C are events with P(B ∩ C ) > 0 . The probability of A given B , relative to P(⋅ ∣ C ) , is the same as
the probability of A given B and C (relative to P). That is,
P(A ∩ B ∣ C )
= P(A ∣ B ∩ C ) (2.4.10)
P(B ∣ C )
Proof
Correlation
Our next discussion concerns an important concept that deals with how two events are related, in a probabilistic sense.
Suppose that A and B are events with P(A) > 0 and P(B) > 0 .
1. P(A ∣ B) > P(A) if and only if P(B ∣ A) > P(B) if and only if P(A ∩ B) > P(A)P(B) . In this case, A and B are
positively correlated.
2. P(A ∣ B) < P(A) if and only if P(B ∣ A) < P(B) if and only if P(A ∩ B) < P(A)P(B) . In this case, A and B are
negatively correlated.
3. P(A ∣ B) = P(A) if and only if P(B ∣ A) = P(B) if and only if P(A ∩ B) = P(A)P(B) . In this case, A and B are
uncorrelated or independent.
Proof
Intuitively, if A and B are positively correlated, then the occurrence of either event means that the other event is more likely. If A
and B are negatively correlated, then the occurrence of either event means that the other event is less likely. If A and B are
uncorrelated, then the occurrence of either event does not change the probability of the other event. Independence is a fundamental
concept that can be extended to more than two events and to random variables; these generalizations are studied in the next section
on Independence. A much more general version of correlation, for random variables, is explored in the section on Covariance and
Correlation in the chapter on Expected Value.
Suppose that A and B are events. Note from (4) that if A ⊆ B or B ⊆ A then A and B are positively correlated. If A and B are
disjoint then A and B are negatively correlated.
Suppose that A and B are events in a random experiment.

1. A and B have the same correlation (positive, negative, or zero) as A and B . c c
2. A and B have the opposite correlation as A and B (that is, positive-negative, negative-positive, or 0-0).
c
Proof
The Multiplication Rule

Sometimes conditional probabilities are known and can be used to find the probabilities of other events. Note first that if A and B
are events with positive probability, then by the very definition of conditional probability,
P(A ∩ B) = P(A)P(B ∣ A) = P(B)P(A ∣ B) (2.4.15)
The following generalization is known as the multiplication rule of probability. As usual, we assume that any event conditioned on
has positive probability.
Suppose that (A 1, A2 , … , An ) is a sequence of events. Then
P (A1 ∩ A2 ∩ ⋯ ∩ An ) = P (A1 ) P (A2 ∣ A1 ) P (A3 ∣ A1 ∩ A2 ) ⋯ P (An ∣ A1 ∩ A2 ∩ ⋯ ∩ An−1 ) (2.4.16)
Proof
The multiplication rule is particularly useful for experiments that consist of dependent stages, where Ai is an event in stage i.
Compare the multiplication rule of probability with the multiplication rule of combinatorics.
As with any other result, the multiplication rule can be applied to a conditional probability measure. In the context above, if E is
another event, then
P (A1 ∩ A2 ∩ ⋯ ∩ An ∣ E) = P (A1 ∣ E) P (A2 ∣ A1 ∩ E) P (A3 ∣ A1 ∩ A2 ∩ E) (2.4.17)
⋯ P (An ∣ A1 ∩ A2 ∩ ⋯ ∩ An−1 ∩ E)
Conditioning and Bayes' Theorem

Suppose that A = { Ai : i ∈ I } is a countable collection of events that partition the sample space S , and that P(A i) >0 for each
i ∈ I.
Figure 2.4.2 : A partition of S induces a partition of B .
The following theorem is known as the law of total probability.
If B is an event then
P(B) = ∑ P(Ai )P(B ∣ Ai ) (2.4.18)
i∈I
Proof
The following theorem is known as Bayes' Theorem, named after Thomas Bayes:
If B is an event then
P(Aj )P(B ∣ Aj )
P(Aj ∣ B) = , j∈ I (2.4.20)
∑ P(Ai )P(B ∣ Ai )
i∈I
Proof
These two theorems are most useful, of course, when we know P(A ) and P(B ∣ A ) for each i ∈ I . When we compute the
i i
probability of P(B) by the law of total probability, we say that we are conditioning on the partition A . Note that we can think of
the sum as a weighted average of the conditional probabilities P(B ∣ A ) over i ∈ I , where P(A ) , i ∈ I are the weight factors. In
i i
the context of Bayes theorem, P(A ) is the prior probability of A and P(A ∣ B) is the posterior probability of A for j ∈ I . We
j j j j
will study more general versions of conditioning and Bayes theorem in the section on Discrete Distributions in the chapter on
Distributions, and again in the section on Conditional Expected Value in the chapter on Expected Value.
Once again, the law of total probability and Bayes' theorem can be applied to a conditional probability measure. So, if E is another
event with P(A ∩ E) > 0 for i ∈ I then
i
P(B ∣ E) = ∑ P(Ai ∣ E)P(B ∣ Ai ∩ E) (2.4.21)
i∈I
P(Aj ∣ E)P(B ∣ Aj ∩ E)
P(Aj ∣ B ∩ E) = , j∈ I (2.4.22)
∑i∈I P(Ai ∩ E)P(B ∣ Ai ∩ E)

Basic Rules
Suppose that A and B are events in an experiment with P(A) = 1
3
, P(B) = 1
4
, P(A ∩ B) = 10
1
. Find each of the following:
1. P(A ∣ B)
2. P(B ∣ A)
3. P(A ∣ B)
c
4. P(B ∣ A)
c
5. P(A ∣ B )
c c
Answer
Suppose that A , B , and C are events in a random experiment with P(A ∣ C ) =

1
2
, P(B ∣ C ) =
1
3
, and P(A ∩ B ∣ C ) =
1
4
.
1. P(B ∖ A ∣ C )
2. P(A ∪ B ∣ C )
3. P(A ∩ B ∣ C )
c c
4. P(A ∪ B ∣ C )
c c
5. P(A ∪ B ∣ C )
c
6. P(A ∣ B ∩ C )
Answer
Suppose that A and B are events in a random experiment with P(A) = 1
2
, P(B) = 1
3
, and P(A ∣ B) = 3
4
.
1. Find P(A ∩ B)
2. Find P(A ∪ B)
3. Find P(B ∪ A ) c
4. Find P(B ∣ A)
5. Are A and B positively correlated, negatively correlated, or independent?
Answer
Open the conditional probability experiment.

1. Given P(A) , P(B) , and P(A ∩ B) , in the table, verify all of the other probabilities in the table.
2. Run the experiment 1000 times and compare the probabilities with the relative frequencies.
Simple Populations
In a certain population, 30% of the persons smoke cigarettes and 8% have COPD (Chronic Obstructive Pulmonary Disease).
Moreover, 12% of the persons who smoke have COPD.
1. What percentage of the population smoke and have COPD?
2. What percentage of the population with COPD also smoke?
3. Are smoking and COPD positively correlated, negatively correlated, or independent?
Answer
A company has 200 employees: 120 are women and 80 are men. Of the 120 female employees, 30 are classified as managers,
while 20 of the 80 male employees are managers. Suppose that an employee is chosen at random.
1. Find the probability that the employee is female.
2. Find the probability that the employee is a manager.
3. Find the conditional probability that the employee is a manager given that the employee is female.
4. Find the conditional probability that the employee is female given that the employee is a manager.
5. Are the events female and manager positively correlated, negatively correlated, or indpendent?
Answer
Dice and Coins

Consider the experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores X = (X , X ) . Let
1 2
Y denote the sum of the scores. For each of the following pairs of events, find the probability of each event and the conditional
probability of each event given the other. Determine whether the events are positively correlated, negatively correlated, or
independent.
1. {X 1 = 3} , {Y = 5}
2. {X 1 = 3} , {Y = 7}
3. {X 1 = 2} , {Y = 5}
4. {X 1 = 3} , { X = 2}
1
Answer
Note that positive correlation is not a transitive relation. From the previous exercise, for example, note that {X = 3} and 1
{Y = 5} are positively correlated, {Y = 5} and { X = 2} are positively correlated, but { X = 3} and { X = 2} are negatively
1 1 1
correlated (in fact, disjoint).
In dice experiment, set n = 2 . Run the experiment 1000 times. Compute the empirical conditional probabilities corresponding
to the conditional probabilities in the last exercise.
Consider again the experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores
X = (X , X ) . Let Y denote the sum of the scores, U the minimum score, and V the maximum score.
1 2
1. Find P(U = u ∣ V = 4) for the appropriate values of u.

2. Find P(Y = y ∣ V = 4) for the appropriate values of y .
3. Find P(V = v ∣ Y = 8) for appropriate values of v .
4. Find P(U = u ∣ Y = 8) for the appropriate values of u.
5. Find P[(X , X ) = (x , x ) ∣ Y = 8] for the appropriate values of (x
1 2 1 2 1, x2 ) .
Answer
In the die-coin experiment, a standard, fair die is rolled and then a fair coin is tossed the number of times showing on the die.
Let N denote the die score and H the event that all coin tosses result in heads.
1. Find P(H ).
2. Find P(N = n ∣ H ) for n ∈ {1, 2, 3, 4, 5, 6}.
3. Compare the results in (b) with P(N = n) for n ∈ {1, 2, 3, 4, 5, 6}. In each case, note whether the events H and {N = n}
are positively correlated, negatively correlated, or independent.

Answer
Run the die-coin experiment 1000 times. Let H and N be as defined in the previous exercise.
1. Compute the empirical probability of H . Compare with the true probability in the previous exercise.
2. Compute the empirical probability of {N = n} given H , for n ∈ {1, 2, 3, 4, 5, 6}. Compare with the true probabilities in
the previous exercise.
Suppose that a bag contains 12 coins: 5 are fair, 4 are biased with probability of heads 1
3
; and 3 are two-headed. A coin is
chosen at random from the bag and tossed.
1. Find the probability that the coin is heads.
2. Given that the coin is heads, find the conditional probability of each coin type.
Answer
Compare die-coin experiment and bag of coins experiment. In the die-coin experiment, we toss a coin with a fixed probability of
heads a random number of times. In the bag of coins experiment, we effectively toss a coin with a random probability of heads a
fixed number of times. The random experiment of tossing a coin with a fixed probability of heads p a fixed number of times n is
known as the binomial experiment with parameters n and p. This is a very basic and important experiment that is studied in more
detail in the section on the binomial distribution in the chapter on Bernoulli Trials. Thus, the die-coin and bag of coins experiments
can be thought of as modifications of the binomial experiment in which a parameter has been randomized. In general, interesting
new random experiments can often be constructed by randomizing one or more parameters in another random experiment.
In the coin-die experiment, a fair coin is tossed. If the coin lands tails, a fair die is rolled. If the coin lands heads, an ace-six flat
die is tossed (faces 1 and 6 have probability each, while faces 2, 3, 4, and 5 have probability each). Let H denote the
1
4
1
event that the coin lands heads, and let Y denote the score when the chosen die is tossed.
1. Find P(Y = y) for y ∈ {1, 2, 3, 4, 5, 6}.
2. Find P(H ∣ Y = y) for y ∈ {1, 2, 3, 4, 5, 6, }.
3. Compare each probability in part (b) with P(H ). In each case, note whether the events H and {Y = y} are positively
correlated, negatively correlated, or independent.
Answer
Run the coin-die experiment 1000 times. Let H and Y be as defined in the previous exercise.
1. Compute the empirical probability of {Y = y} , for each y , and compare with the true probability in the previous exercise
2. Compute the empirical probability of H given {Y = y} for each y , and compare with the true probability in the previous
exercise.
Cards
Consider the card experiment that consists of dealing 2 cards from a standard deck and recording the sequence of cards dealt.
For i ∈ {1, 2}, let Q be the event that card i is a queen and H the event that card i is a heart. For each of the following pairs
i i
of events, compute the probability of each event, and the conditional probability of each event given the other. Determine
whether the events are positively correlated, negatively correlated, or independent.
1. Q1 ,H
1
2. Q1 ,Q
2
3. Q2 ,H
2
4. Q1 ,H
2
Answer
In the card experiment, set n = 2 . Run the experiment 500 times. Compute the conditional relative frequencies corresponding
to the conditional probabilities in the last exercise.
Consider the card experiment that consists of dealing 3 cards from a standard deck and recording the sequence of cards dealt.
Find the probability of the following events:
1. All three cards are all hearts.
2. The first two cards are hearts and the third is a spade.
3. The first and third cards are hearts and the second is a spade.
Proof
In the card experiment, set n = 3 and run the simulation 1000 times. Compute the empirical probability of each event in the
previous exercise and compare with the true probability.
Bivariate Uniform Distributions

Recall that Buffon's coin experiment consists of tossing a coin with radius r ≤ randomly on a floor covered with square tiles of
1
side length 1. The coordinates (X, Y ) of the center of the coin are recorded relative to axes through the center of the square,
parallel to the sides. Since the needle is dropped randomly, the basic modeling assumption is that (X, Y ) is uniformly distributed
on the square [−1/2, 1/2] .2
Figure 2.4.3 : Buffon's coin experiment
In Buffon's coin experiment,

1. Find P(Y > 0 ∣ X < Y )
2. Find the conditional distribution of (X, Y ) given that the coin does not touch the sides of the square.
Answer
Run Buffon's coin experiment 500 times. Compute the empirical probability that Y >0 given that X <Y and compare with
the probability in the last exercise.
In the conditional probability experiment, the random points are uniformly distributed on the rectangle S . Move and resize
events A and B and note how the probabilities change. For each of the following configurations, run the experiment 1000
times and compare the relative frequencies with the true probabilities.
1. A and B in general position
2. A and B disjoint
3. A ⊆ B
4. B ⊆ A
Reliability
A plant has 3 assembly lines that produces memory chips. Line 1 produces 50% of the chips and has a defective rate of 4%;
line 2 has produces 30% of the chips and has a defective rate of 5%; line 3 produces 20% of the chips and has a defective rate
of 1%. A chip is chosen at random from the plant.
1. Find the probability that the chip is defective.
2. Given that the chip is defective, find the conditional probability for each line.
Answer
Suppose that a bit (0 or 1) is sent through a noisy communications channel. Because of the noise, the bit sent may be received
incorrectly as the complementary bit. Specifically, suppose that if 0 is sent, then the probability that 0 is received is 0.9 and the
probability that 1 is received is 0.1. If 1 is sent, then the probability that 1 is received is 0.8 and the probability that 0 is
received is 0.2. Finally, suppose that 1 is sent with probability 0.6 and 0 is sent with probability 0.4. Find the probability that
1. 1 was sent given that 1 was received
2. 0 was sent given that 0 was received
Answer
Suppose that T denotes the lifetime of a light bulb (in 1000 hour units), and that T has the following exponential distribution,
defined for measurable A ⊆ [0, ∞) :
−t
P(T ∈ A) = ∫ e dt (2.4.23)
A
1. Find P(T > 3)
2. Find P(T > 5 ∣ T > 2)
Answer
Suppose again that T denotes the lifetime of a light bulb (in 1000 hour units), but that T is uniformly distributed on the interal
[0, 10].
1. Find P(T > 3)
2. Find P(T > 5 ∣ T > 2)
Answer
Genetics
section.
and o is recessive. Suppose that the probability distribution for the set of blood genotypes in a certain population is given in the
following table:
Probability 0.050 0.038 0.310 0.007 0.116 0.479
Suppose that a person is chosen at random from the population. Let A , B , AB, and O be the events that the person is type A ,
type B , type AB, and type O respectively. Let H be the event that the person is homozygous, and let D denote the event that
the person has an o allele. Find each of the following:
1. P(A) , P(B) , P(AB), P(O) , P(H ), P(D)
2. P (A ∩ H ) , P (A ∣ H ) , P (H ∣ A) . Are the events A and H positively correlated, negatively correlated, or independent?
3. P (B ∩ H ) , P (B ∣ H ) , P (H ∣ B) . Are the events B and H positively correlated, negatively correlated, or independent?
4. P (A ∩ D) , P (A ∣ D) , P (D ∣ A) . Are the events A and D positively correlated, negatively correlated, or independent?
5. P (B ∩ D) , P (B ∣ D) , P (D ∣ B) . Are the events B and D positively correlated, negatively correlated, or independent?
6. P (H ∩ D) , P (H ∣ D) , P (D ∣ H ) . Are the events H and D positively correlated, negatively correlated, or independent?
Answer
that g is dominant and y recessive.
Suppose that a green-pod plant and a yellow-pod plant are bred together. Suppose further that the green-pod plant has a 1
chance of carrying the recessive yellow-pod allele.

1. Find the probability that a child plant will have green pods.
2. Given that a child plant has green pods, find the updated probability that the green-pod parent has the recessive allele.
Answer
Suppose that two green-pod plants are bred together. Suppose further that with probability neither plant has the recessive
1
allele, with probability one plant has the recessive allele, and with probability both plants have the recessive allele.
1
2
1
1. Find the probability that a child plant has green pods.

2. Given that a child plant has green pods, find the updated probability that both parents have the recessive gene.
Answer
and d the defective allele for the gene linked to the disorder. Recall that h is dominant and d recessive for women.
Suppose that in a certain population, 50% are male and 50% are female. Moreover, suppose that 10% of males are color blind
but only 1% of females are color blind.
1. Find the percentage of color blind persons in the population.
2. Find the percentage of color blind persons that are male.
Answer
Since color blindness is a sex-linked hereditary disorder, note that it's reasonable in the previous exercise that the probability that a
female is color blind is the square of the probability that a male is color blind. If p is the probability of the defective allele on the X
chromosome, then p is also the probability that a male will be color blind. But since the defective allele is recessive, a woman
would need two copies of the defective allele to be color blind, and assuming independence, the probability of this event is p . 2
A man and a woman do not have a certain sex-linked hereditary disorder, but the woman has a 1
3
chance of being a carrier.
1. Find the probability that a son born to the couple will be normal.
2. Find the probability that a daughter born to the couple will be a carrier.
3. Given that a son born to the couple is normal, find the updated probability that the mother is a carrier.
Answer
Urn Models
Urn 1 contains 4 red and 6 green balls while urn 2 contains 7 red and 3 green balls. An urn is chosen at random and then a ball
is chosen at random from the selected urn.
1. Find the probability that the ball is green.
2. Given that the ball is green, find the conditional probability that urn 1 was selected.
Answer
Urn 1 contains 4 red and 6 green balls while urn 2 contains 6 red and 3 green balls. A ball is selected at random from urn 1 and
transferred to urn 2. Then a ball is selected at random from urn 2.
1. Find the probability that the ball from urn 2 is green.
2. Given that the ball from urn 2 is green, find the conditional probability that the ball from urn 1 was green.
Answer
An urn initially contains 6 red and 4 green balls. A ball is chosen at random from the urn and its color is recorded. It is then
replaced in the urn and 2 new balls of the same color are added to the urn. The process is repeated. Find the probability of each
of the following events:
1. Balls 1 and 2 are red and ball 3 is green.
3. Ball 1 is green and balls 2 and 3 are red.
4. Ball 2 is red.
5. Ball 1 is red given that ball 2 is red.
Answer
Think about the results in the previous exercise. Note in particular that the answers to parts (a), (b), and (c) are the same, and that
the probability that the second ball is red in part (d) is the same as the probability that the first ball is red. More generally, the
probabilities of events do not depend on the order of the draws. For example, the probability of an event involving the first, second,
and third draws is the same as the probability of the corresponding event involving the seventh, tenth and fifth draws. Technically,
the sequence of events (R , R , …) is exchangeable. The random process described in this exercise is a special case of Pólya's urn
1 2
scheme, named after George Pólya. We sill study Pólya's urn in more detail in the chapter on Finite Sampling Models
An urn initially contains 6 red and 4 green balls. A ball is chosen at random from the urn and its color is recorded. It is then
replaced in the urn and two new balls of the other color are added to the urn. The process is repeated. Find the probability of
each of the following events:
3. Ball 1 is green and balls 2 and 3 are red.
4. Ball 2 is red.
5. Ball 1 is red given that ball 2 is red.
Answer
Think about the results in the previous exercise, and compare with Pólya's urn. Note that the answers to parts (a), (b), and (c) are
not all the same, and that the probability that the second ball is red in part (d) is not the same as the probability that the first ball is
red. In short, the sequence of events (R , R , …) is not exchangeable.
1 2
Diagnostic Testing
Suppose that we have a random experiment with an event A of interest. When we run the experiment, of course, event A will
either occur or not occur. However, suppose that we are not able to observe the occurrence or non-occurrence of A directly. Instead
we have a diagnostic test designed to indicate the occurrence of event A ; thus the test that can be either positive for A or negative
for A . The test also has an element of randomness, and in particular can be in error. Here are some typical examples of the type of
situation we have in mind:
The event is that a person has a certain disease and the test is a blood test for the disease.
The event is that a woman is pregnant and the test is a home pregnancy test.
The event is that a person is lying and the test is a lie-detector test.
The event is that a device is defective and the test consists of a sensor reading.
The event is that a missile is in a certain region of airspace and the test consists of radar signals.
The event is that a person has committed a crime, and the test is a jury trial with evidence presented for and against the event.
Let T be the event that the test is positive for the occurrence of A . The conditional probability P(T ∣ A) is called the sensitivity of
the test. The complementary probability
c
P(T ∣ A) = 1 − P(T ∣ A) (2.4.24)
is the false negative probability. The conditional probability P(T

c c
∣ A ) is called the specificity of the test. The complementary
probability
c c c
P(T ∣ A ) = 1 − P(T ∣ A ) (2.4.25)
is the false positive probability. In many cases, the sensitivity and specificity of the test are known, as a result of the development of
the test. However, the user of the test is interested in the opposite conditional probabilities, namely P(A ∣ T ) , the probability of the
event of interest, given a positive test, and P(A ∣ T ) , the probability of the complementary event, given a negative test. Of
c c
course, if we know P(A ∣ T ) then we also have P(A ∣ T ) = 1 − P(A ∣ T ) , the probability of the complementary event given a
c
positive test. Similarly, if we know P(A ∣ T ) then we also have P(A ∣ T ) , the probability of the event given a negative test.
c c c
Computing the probabilities of interest is simply a special case of Bayes' theorem.
The probability that the event occurs, given a positive test is

P(A)P(T ∣ A)
P(A ∣ T ) = (2.4.26)
c c
P(A)P(T ∣ A) + P(A )P(T ∣ A )
The probability that the event does not occur, given a negative test is
c c c
P(A )P(T ∣ A )
c c
P(A ∣ T ) = (2.4.27)
c c c c
P(A)P(T ∣ A) + P(A )P(T ∣ A )
There is often a trade-off between sensitivity and specificity. An attempt to make a test more sensitive may result in the test being
less specific, and an attempt to make a test more specific may result in the test being less sensitive. As an extreme example,
consider the worthless test that always returns positive, no matter what the evidence. Then T = S so the test has sensitivity 1, but
specificity 0. At the opposite extreme is the worthless test that always returns negative, no matter what the evidence. Then T = ∅
so the test has specificity 1 but sensitivity 0. In between these extremes are helpful tests that are actually based on evidence of some
sort.
Suppose that the sensitivity a = P(T ∣ A) ∈ (0, 1) and the specificity b = P(T ∣ A ) ∈ (0, 1) are fixed. Letc c
p = P(A) denote
the prior probability of the event A and P = P(A ∣ T ) the posterior probability of A given a positive test.
P as a function of p is given by
ap
P = , p ∈ [0, 1] (2.4.28)
(a + b − 1)p + (1 − b)
1. P increases continuously from 0 to 1 as p increases from 0 to 1.

2. P is concave downward if a + b > 1 . In this case A and T are positively correlated.
3. P is concave upward if a + b < 1 . In this case A and T are negatively correlated.
4. P = p if a + b = 1 . In this case, A and T are uncorrelated (independent).
Proof
Of course, part (b) is the typical case, where the test is useful. In fact, we would hope that the sensitivity and specificity are close to
1. In case (c), the test is worse than useless since it gives the wrong information about A . But this case could be turned into a useful
test by simply reversing the roles of positive and negative. In case (d), the test is worthless and gives no information about A . It's
interesting that the broad classification above depends only on the sum of the sensitivity and specificity.
Figure 2.4.4 : P = P(A ∣ T ) as a function of p = P(A) in the three cases
Suppose that a diagnostic test has sensitivity 0.99 and specificity 0.95. Find P(A ∣ T ) for each of the following values of
P(A) :
1. 0.001
2. 0.01
3. 0.2
4. 0.5
5. 0.7
6. 0.9
Answer
With sensitivity 0.99 and specificity 0.95, the test in the last exercise superficially looks good. However the small value of
P(A ∣ T ) for small values of P(A) is striking (but inevitable given the properties above). The moral, of course, is that P(A ∣ T )
depends critically on P(A) not just on the sensitivity and specificity of the test. Moreover, the correct comparison is P(A ∣ T ) with
P(A) , as in the exercise, not P(A ∣ T ) with P(T ∣ A) —Beware of the fallacy of the transposed conditional! In terms of the correct
comparison, the test does indeed work well; P(A ∣ T ) is significantly larger than P(A) in all cases.
A woman initially believes that there is an even chance that she is or is not pregnant. She takes a home pregnancy test with
sensitivity 0.95 and specificity 0.90 (which are reasonable values for a home pregnancy test). Find the updated probability that
the woman is pregnant in each of the following cases.
1. The test is positive.
2. The test is negative.
Answer
Suppose that 70% of defendants brought to trial for a certain type of crime are guilty. Moreover, historical data show that juries
convict guilty persons 80% of the time and convict innocent persons 10% of the time. Suppose that a person is tried for a crime
of this type. Find the updated probability that the person is guilty in each of the following cases:
1. The person is convicted.
2. The person is acquitted.
Answer
The “Check Engine” light on your car has turned on. Without the information from the light, you believe that there is a 10%
chance that your car has a serious engine problem. You learn that if the car has such a problem, the light will come on with
probability 0.99, but if the car does not have a serious problem, the light will still come on, under circumstances similar to
yours, with probability 0.3. Find the updated probability that you have an engine problem.
Answer
The standard test for HIV is the ELISA (Enzyme-Linked Immunosorbent Assay) test. It has sensitivity and specificity of 0.999.
Suppose that a person is selected at random from a population in which 1% are infected with HIV, and given the ELISA test.
Find the probability that the person has HIV in each of the following cases:
Answer
The ELISA test for HIV is a very good one. Let's look another test, this one for prostate cancer, that's rather bad.
The PSA test for prostate cancer is based on a blood marker known as the Prostate Specific Antigen. An elevated level of PSA
is evidence for prostate cancer. To have a diagnostic test, in the sense that we are discussing here, we must decide on a definite
level of PSA, above which we declare the test to be positive. A positive test would typically lead to other more invasive tests
(such as biopsy) which, of course, carry risks and cost. The PSA test with cutoff 2.6 ng/ml has sensitivity 0.40 and specificity
0.81. The overall incidence of prostate cancer among males is 156 per 100000. Suppose that a man, with no particular risk
factors, has the PSA test. Find the probability that the man has prostate cancer in each of the following cases:
Answer
Diagnostic testing is closely related to a general statistical procedure known as hypothesis testing. A separate chapter on hypothesis
testing explores this procedure in detail.
For the M&M data set, find the empirical probability that a bag has at least 10 reds, given that the weight of the bag is at least
48 grams.
Answer
Consider the Cicada data.

1. Find the empirical probability that a cicada weighs at least 0.25 grams given that the cicada is male.
2. Find the empirical probability that a cicada weighs at least 0.25 grams given that the cicada is the tredecula species.
Answer
This page titled 2.4: Conditional Probability is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
2.5: Independence
In this section, we will discuss independence, one of the fundamental concepts in probability theory. Independence is frequently
invoked as a modeling assumption, and moreover, (classical) probability itself is based on the idea of independent replications of
the experiment. As usual, if you are a new student of probability, you may want to skip the technical details.
Basic Theory
As usual, our starting point is a random experiment modeled by a probability space (S, S , P) so that S is the set of outcomes, S
the collection of events, and P the probability measure on the sample space (S, S ). We will define independence for two events,
then for collections of events, and then for collections of random variables. In each case, the basic idea is the same.
Independence of Two Events

Two events A and B are independent if
P(A ∩ B) = P(A)P(B) (2.5.1)
If both of the events have positive probability, then independence is equivalent to the statement that the conditional probability of
one event given the other is the same as the unconditional probability of the event:
P(A ∣ B) = P(A) ⟺ P(B ∣ A) = P(B) ⟺ P(A ∩ B) = P(A)P(B) (2.5.2)
This is how you should think of independence: knowledge that one event has occurred does not change the probability assigned to
the other event. Independence of two events was discussed in the last section in the context of correlation. In particular, for two
events, independent and uncorrelated mean the same thing.
The terms independent and disjoint sound vaguely similar but they are actually very different. First, note that disjointness is purely
a set-theory concept while independence is a probability (measure-theoretic) concept. Indeed, two events can be independent
relative to one probability measure and dependent relative to another. But most importantly, two disjoint events can never be
independent, except in the trivial case that one of the events is null.
Suppose that A and B are disjoint events, each with positive probability. Then A and B are dependent, and in fact are
negatively correlated.
Proof
If A and B are independent events then intuitively it seems clear that any event that can be constructed from A should be
independent of any event that can be constructed from B . This is the case, as the next result shows. Moreover, this basic idea is
essential for the generalization of independence that we will consider shortly.
If A and B are independent events, then each of the following pairs of events is independent:
1. A , B
c
2. B , A c
3. A , B
c c
Proof
An event that is “essentially deterministic”, that is, has probability 0 or 1, is independent of any other event, even itself.
Suppose that A and B are events.

1. If P(A) = 0 or P(A) = 1 , then A and B are independent.
2. A is independent of itself if and only if P(A) = 0 or P(A) = 1 .
Proof
General Independence of Events
To extend the definition of independence to more than two events, we might think that we could just require pairwise
independence, the independence of each pair of events. However, this is not sufficient for the strong type of independence that we
have in mind. For example, suppose that we have three events A , B , and C . Mutual independence of these events should not only
mean that each pair is independent, but also that an event that can be constructed from A and B (for example A ∪ B ) should be c
independent of C . Pairwise independence does not achieve this; an exercise below gives three events that are pairwise independent,
but the intersection of two of the events is related to the third event in the strongest possible sense.
Another possible generalization would be to simply require the probability of the intersection of the events to be the product of the
probabilities of the events. However, this condition does not even guarantee pairwise independence. An exercise below gives an
example. However, the definition of independence for two events does generalize in a natural way to an arbitrary collection of
events.
Suppose that Ai is an event for each i in an index set I . Then the collection A = { Ai : i ∈ I } is independent if for every
finite J ⊆ I ,
P ( ⋂ Aj ) = ∏ P(Aj ) (2.5.4)
j∈J j∈J
Independence of a collection of events is much stronger than mere pairwise independence of the events in the collection. The basic
inheritance property in the following result follows immediately from the definition.
Suppose that A is a collection of events.

1. If A is independent, then B is independent for every B ⊆ A .
2. If B is independent for every finite B ⊆ A then A is independent.
For a finite collection of events, the number of conditions required for mutual independence grows exponentially with the number
of events.
There are 2 n
−n−1 non-trivial conditions in the definition of the independence of n events.
1. Explicitly give the 4 conditions that must be satisfied for events A , B , and C to be independent.
2. Explicitly give the 11 conditions that must be satisfied for events A , B , C , and D to be independent.
Answer
If the events A 1, A2 , … , An are independent, then it follows immediately from the definition that
n n
P ( ⋂ Ai ) = ∏ P(Ai ) (2.5.5)
i=1 i=1
This is known as the multiplication rule for independent events. Compare this with the general multiplication rule for conditional
probability.
The collection of essentially deterministic events D = {A ∈ S : P(A) = 0 or P(A) = 1} is independent.

Proof
The next result generalizes the theorem above on the complements of two independent events.
Suppose that A = {A : i ∈ I } and B = {B : i ∈ I } are two collections of events with the property that for each
i i i ∈ I ,
either B = A or B = A . Then A is independent if and only if B is an independent.
i i i
c
i
Proof
The last theorem in turn leads to the type of strong independence that we want. The following exercise gives examples.
If A , B , C , and D are independent events, then
1. A ∪ B , C , D are independent.
c
2. A ∪ B , C ∪ D are independent.
c c c
Proof
The complete generalization of these results is a bit complicated, but roughly means that if we start with a collection of indpendent
events, and form new events from disjoint subcollections (using the set operations of union, intersection, and complment), then the
new events are independent. For a precise statement, see the section on measure spaces. The importance of the complement
theorem lies in the fact that any event that can be defined in terms of a finite collection of events {A : i ∈ I } can be written as ai
disjoint union of events of the form ⋂ B where B = A or B = A for each i ∈ I .

i∈I i i i i
c
i
Another consequence of the general complement theorem is a formula for the probability of the union of a collection of
independent events that is much nicer than the inclusion-exclusion formula.
If A1, A2 , … , An are independent events, then

n n
P ( ⋃ Ai ) = 1 − ∏ [1 − P(Ai )] (2.5.10)
i=1 i=1
Proof
Independence of Random Variables

Suppose now that X is a random variable for the experiment with values in a set T for each i in a nonempty index set I .
i i
Mathematically, X is a function from S into T , and recall that {X ∈ B} denotes the event {s ∈ S : X (s) ∈ B} for B ⊆ T .
i i i i i
Intuitively, X is a variable of interest in the experiment, and every meaningful statement about X defines an event. Intuitively, the
i i
random variables are independent if information about some of the variables tells us nothing about the other variables.
Mathematically, independence of a collection of random variables can be reduced to the independence of collections of events.
The collection of random variables X = {X : i ∈ I } is independent if the collection of events {{X ∈ B } : i ∈ I } is

i i i
independent for every choice of B ⊆ T for i ∈ I . Equivalently then, X is independent if for every finite J ⊆ I , and for
i i
every choice of B ⊆ T for j ∈ J we have

j j
P ( ⋂ { Xj ∈ Bj }) = ∏ P(Xj ∈ Bj ) (2.5.12)
j∈J j∈J
Details
Suppose that X is a collection of random variables.

1. If X is independent, then Y is independent for every Y ⊆ X
2. If Y is independent for every finite Y ⊆ X then X is independent.
It would seem almost obvious that if a collection of random variables is independent, and we transform each variable in
deterministic way, then the new collection of random variables should still be independent.
Suppose now that g is a function from T into a set U for each i ∈ I . If {X

i i i i : i ∈ I} is independent, then {g i (Xi ) : i ∈ I} is
also independent.
Proof
As with events, the (mutual) independence of random variables is a very strong property. If a collection of random variables is
independent, then any subcollection is also independent. New random variables formed from disjoint subcollections are
independent. For a simple example, suppose that X, Y , and Z are independent real-valued random variables. Then
1. sin(X), cos(Y ), and e are independent. Z
2. (X, Y ) and Z are independent.

3. X + Y and arctan(Z) are independent.
2 2
4. X and Z are independent.

5. Y and Z are independent.
In particular, note that statement 2 in the list above is much stronger than the conjunction of statements 4 and 5. Contrapositively, if
X and Z are dependent, then (X, Y ) and Z are also dependent. Independence of random variables subsumes independence of
events.
A collection of events A is independent if and only if the corresponding collection of indicator variables { 1A : A ∈ A } is
independent.
Proof
Many of the concepts that we have been using informally can now be made precise. A compound experiment that consists of
“independent stages” is essentially just an experiment whose outcome is a sequence of independent random variables
X = (X , X , …) where X is the outcome of the i th stage.
1 2 i
In particular, suppose that we have a basic experiment with outcome variable X. By definition, the outcome of the experiment that
consists of “independent replications” of the basic experiment is a sequence of independent random variables X = (X , X , …) 1 2
each with the same probability distribution as X. This is fundamental to the very concept of probability, as expressed in the law of
large numbers. From a statistical point of view, suppose that we have a population of objects and a vector of measurements X of
interest for the objects in the sample. The sequence X above corresponds to sampling from the distribution of X; that is, X is the i
vector of measurements for the ith object drawn from the sample. When we sample from a finite population, sampling with
replacement generates independent random variables while sampling without replacement generates dependent random variables.
Conditional Independence and Conditional Probability

As noted at the beginning of our discussion, independence of events or random variables depends on the underlying probability
measure. Thus, suppose that B is an event with positive probability. A collection of events or a collection of random variables is
conditionally independent given B if the collection is independent relative to the conditional probability measure A ↦ P(A ∣ B) .
For example, a collection of events {A : i ∈ I } is conditionally independent given B if for every finite J ⊆ I ,
i
∣
P ( ⋂ Aj ∣ B) = ∏ P(Aj ∣ B) (2.5.13)
∣
j∈J j∈J
Note that the definitions and theorems of this section would still be true, but with all probabilities conditioned on B .
Conversely, conditional probability has a nice interpretation in terms of independent replications of the experiment. Thus, suppose
that we start with a basic experiment with S as the set of outcomes. We let X denote the outcome random variable, so that
mathematically X is simply the identity function on S . In particular, if A is an event then trivially, P(X ∈ A) = P(A) . Suppose
now that we replicate the experiment independently. This results in a new, compound experiment with a sequence of independent
random variables (X , X , …), each with the same distribution as X. That is, X is the outcome of the ith repetition of the
1 2 i
experiment.
Suppose now that A and B are events in the basic experiment with P(B) > 0 . In the compound experiment, the event that
“when B occurs for the first time, A also occurs” has probability
P(A ∩ B)
= P(A ∣ B) (2.5.14)
P(B)
Proof
Heuristic Argument
Suppose that A and B are disjoint events in a basic experiment with P(A) > 0 and P(B) > 0 . In the compound experiment
obtained by replicating the basic experiment, the event that “A occurs before B ” has probability
P(A)
(2.5.18)
P(A) + P(B)
Proof
Basic Rules
Suppose that A , B , and C are independent events in an experiment with P(A) = 0.3 , P(B) = 0.4 , and P(C ) = 0.8 . Express
each of the following events in set notation and find its probability:
1. All three events occur.
2. None of the three events occurs.
3. At least one of the three events occurs.
4. At least one of the three events does not occur.
5. Exactly one of the three events occurs.
6. Exactly two of the three events occurs.
Answer
Suppose that A , B , and C are independent events for an experiment with P(A) =
1
3
, P(B) =
1
4
, and P(C ) =
1
5
. Find the
1. (A ∩ B) ∪ C
2. A ∪ B ∪ C c
3. (A ∩ B ) ∪ C
c c c
Answer
Simple Populations
A small company has 100 employees; 40 are men and 60 are women. There are 6 male executives. How many female
executives should there be if gender and rank are independent? The underlying experiment is to choose an employee at
random.
Answer
Suppose that a farm has four orchards that produce peaches, and that peaches are classified by size as small, medium, and
large. The table below gives total number of peaches in a recent harvest by orchard and by size. Fill in the body of the table
with counts for the various intersections, so that orchard and size are independent variables. The underlying experiment is to
select a peach at random from the farm.
Frequency Size Small Medium Large Total
otal Orchard 1 Total2000 400
otal 2 Total2000 600
Total 400 1000 600 2000
Answer
Note from the last two exercises that you cannot “see” independence in a Venn diagram. Again, independence is a measure-
theoretic concept, not a set-theoretic concept.
Bernoulli Trials
A Bernoulli trials sequence is a sequence X = (X , X , …) of independent, identically distributed indicator variables. Random
1 2
variable X is the outcome of trial i, where in the usual terminology of reliability theory, 1 denotes success and 0 denotes failure.
i
The canonical example is the sequence of scores when a coin (not necessarily fair) is tossed repeatedly. Another basic example
arises whenever we start with an basic experiment and an event A of interest, and then repeat the experiment. In this setting, X is i
the indicator variable for event A on the ith run of the experiment. The Bernoulli trials process is named for Jacob Bernoulli, and
has a single basic parameter p = P(X = 1) . This random process is studied in detail in the chapter on Bernoulli trials.
i
For (x 1, x2 , … , xn ) ∈ {0, 1 }
n
,
x1 +x2 +⋯+xn n−( x1 +x2 +⋯+xn )
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = p (1 − p ) (2.5.19)
Proof
Note that the sequence of indicator random variables X is exchangeable. That is, if the sequence (x , x , … , x ) in the previous
1 2 n
result is permuted, the probability does not change. On the other hand, there are exchangeable sequences of indicator random
variables that are dependent, as Pólya's urn model so dramatically illustrates.
Let Y denote the number of successes in the first n trials. Then

n
y n−y
P(Y = y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (2.5.20)
y
Proof
The distribution of Y is called the binomial distribution with parameters n and p. The binomial distribution is studied in more
detail in the chapter on Bernoulli Trials.
More generally, a multinomial trials sequence is a sequence X = (X , X , …) of independent, identically distributed random
1 2
variables, each taking values in a finite set S . The canonical example is the sequence of scores when a k -sided die (not necessarily
fair) is thrown repeatedly. Multinomial trials are also studied in detail in the chapter on Bernoulli trials.
Cards
Consider the experiment that consists of dealing 2 cards at random from a standard deck and recording the sequence of cards
dealt. For i ∈ {1, 2}, let Q be the event that card i is a queen and H the event that card i is a heart. Compute the appropriate
i i
probabilities to verify the following results. Reflect on these results.

1. Q 1 and H 1 are independent.
3. Q 1 and Q 2 are negatively correlated.
4. H 1 and H 2 are negatively correlated.
6. H 1 and Q 2 are independent.
Answer
In the card experiment, set n = 2 . Run the simulation 500 times. For each pair of events in the previous exercise, compute the
product of the empirical probabilities and the empirical probability of the intersection. Compare the results.
Dice
The following exercise gives three events that are pairwise independent, but not (mutually) independent.
Consider the dice experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores. Let A denote
the event that first score is 3, B the event that the second score is 4, and C the event that the sum of the scores is 7. Then
1. A , B , C are pairwise independent.
2. A ∩ B implies (is a subset of) C and hence these events are dependent in the strongest possible sense.
Answer
In the dice experiment, set n = 2 . Run the experiment 500 times. For each pair of events in the previous exercise, compute the
The following exercise gives an example of three events with the property that the probability of the intersection is the product of
the probabilities, but the events are not pairwise independent.
Suppose that we throw a standard, fair die one time. Let A = {1, 2, 3, 4}, B = C = {4, 5, 6} . Then
1. P(A ∩ B ∩ C ) = P(A)P(B)P(C ) .
2. B and C are the same event, and hence are dependent in the strongest possbile sense.
Answer
Suppose that a standard, fair die is thrown 4 times. Find the probability of the following events.
1. Six does not occur.
2. Six occurs at least once.
3. The sum of the first two scores is 5 and the sum of the last two scores is 7.
Answer
Suppose that a pair of standard, fair dice are thrown 8 times. Find the probability of each of the following events.
1. Double six does not occur.
2. Double six occurs at least once.
3. Double six does not occur on the first 4 throws but occurs at least once in the last 4 throws.
Answer
Consider the dice experiment that consists of rolling n , k -sided dice and recording the sequence of scores
X = (X , X , … , X ) .The following conditions are equivalent (and correspond to the assumption that the dice are fair):
1 2 n
1. X is uniformly distributed on {1, 2, … , k} . n
2. X is a sequence of independent variables, and X is uniformly distributed on {1, 2, … , k} for each i.

i
Proof
A pair of standard, fair dice are thrown repeatedly. Find the probability of each of the following events.
1. A sum of 4 occurs before a sum of 7.
4. When a sum of 8 occurs the first time, it occurs “the hard way” as (4, 4).
Answer
Problems of the type in the last exercise are important in the game of craps. Craps is studied in more detail in the chapter on Games
of Chance.
Coins
A biased coin with probability of heads is tossed 5 times. Let X denote the outcome of the tosses (encoded as a bit string)
1
and let Y denote the number of heads. Find each of the following:
1. P(X = x) for each x ∈ {0, 1} . 5
2. P(Y = y) for each y ∈ {0, 1, 2, 3, 4, 5}.

3. P(1 ≤ Y ≤ 3)
Answer
A box contains a fair coin and a two-headed coin. A coin is chosen at random from the box and tossed repeatedly. Let F
denote the event that the fair coin is chosen, and let H denote the event that the ith toss results in heads. Then
i
1. (H , H , …) are conditionally independent given F , with P(H ∣ F ) = for each i.

1 2 i
1
2. (H , H , …) are conditionally independent given F , with P(H ∣ F ) = 1 for each i.

1 2
c
i
c
3. P(H ) = for each i.

i
3
4. P(H ∩ H ∩ ⋯ ∩ H ) =
1 2 + n . n+1
1 1
2
2
5. (H , H , …) are dependent.
1 2
6. P(F ∣ H ∩ H ∩ ⋯ ∩ H ) =
1 2 .n
2 +1
1
n
7. P(F ∣ H ∩ H ∩ ⋯ ∩ H ) → 0 as n → ∞ .
1 2 n
Proof
Consider again the box in the previous exercise, but we change the experiment as follows: a coin is chosen at random from the
box and tossed and the result recorded. The coin is returned to the box and the process is repeated. As before, let H denote the
i
event that toss i results in heads. Then

1. (H , H , …) are independent.
1 2
2. P(H ) = for each i.

i
3
4
n
3. P(H ∩ H ∩ ⋯ H ) = ( ) .
1 2 n
3
Proof
Think carefully about the results in the previous two exercises, and the differences between the two models. Tossing a coin
produces independent random variables if the probability of heads is fixed (that is, non-random even if unknown). Tossing a coin
with a random probability of heads generally does not produce independent random variables; the result of a toss gives information
about the probability of heads which in turn gives information about subsequent tosses.
Uniform Distributions
Recall that Buffon's coin experiment consists of tossing a coin with radius r ≤ randomly on a floor covered with square tiles
1
of side length 1. The coordinates (X, Y ) of the center of the coin are recorded relative to axes through the center of the square
in which the coin lands. The following conditions are equivalent:
2
1. (X, Y ) is uniformly distributed on [− , ] .
1
2
1
2. X and Y are independent and each is uniformly distributed on [− 1
2
,
1
2
] .
Figure 2.5.1: Buffon's coin experiment
Proof
Compare this result with the result above for fair dice.
In Buffon's coin experiment, set r = 0.3. Run the simulation 500 times. For the events {X > 0} and {Y < 0} , compute the
The arrival time X of the A train is uniformly distributed on the interval (0, 30), while the arrival time Y of the B train is
uniformly distributed on the interval (15, 30). (The arrival times are in minutes, after 8:00 AM). Moreover, the arrival times
are independent. Find the probability of each of the following events:
1. The A train arrives first.
2. Both trains arrive sometime after 20 minutes.
Answer
Reliability
Recall the simple model of structural reliability in which a system is composed of n components. Suppose in addition that the
components operate independently of each other. As before, let X denote the state of component i, where 1 means working and 0
i
means failure. Thus, our basic assumption is that the state vector X = (X , X , … , X ) is a sequence of independent indicator
1 2 n
random variables. We assume that the state of the system (either working or failed) depends only on the states of the components.
Thus, the state of the system is an indicator random variable
Y = y(X1 , X2 , … , Xn ) (2.5.25)
where y : {0, 1} → {0, 1} is the structure function. Generally, the probability that a device is working is the reliability of the
n
device. Thus, we will denote the reliability of component i by p = P(X = 1) so that the vector of component reliabilities is
i i
p = (p , p , … , p ) . By independence, the system reliability r is a function of the component reliabilities:

1 2 n
r(p1 , p2 , … , pn ) = P(Y = 1) (2.5.26)
Appropriately enough, this function is known as the reliability function. Our challenge is usually to find the reliability function,
given the structure function. When the components all have the same probability p then of course the system reliability r is just a
function of p. In this case, the state vector X = (X , X , … , X ) forms a sequence of Bernoulli trials.
1 2 n
Comment on the independence assumption for real systems, such as your car or your computer.
Recall that a series system is working if and only if each component is working.
1. The state of the system is U = X X ⋯ X
1 2 n = min{ X1 , X2 , … , Xn } .
2. The reliability is P(U = 1) = p p ⋯ p .
1 2 n
Recall that a parallel system is working if and only if at least one component is working.
1. The state of the system is V = 1 − (1 − X )(1 − X ) ⋯ (1 − X ) = max{X
1 2 n 1, X2 , … , Xn } .
2. The reliability is P(V = 1) = 1 − (1 − p )(1 − p ) ⋯ (1 − p ) .
1 2 n
Recall that a k out of n system is working if and only if at least k of the n components are working. Thus, a parallel system is a 1
out of n system and a series system is an n out of n system. A k out of 2k − 1 system is a majority rules system. The reliability
function of a general k out of n system is a mess. However, if the component reliabilities are the same, the function has a
reasonably simple form.
For a k out of n system with common component reliability p, the system reliability is
n
n
i n−i
r(p) = ∑ ( ) p (1 − p ) (2.5.27)
i
i=k
Consider a system of 3 independent components with common reliability p = 0.8 . Find the reliability of each of the following:
1. The parallel system.
2. The 2 out of 3 system.
3. The series system.
Answer
Consider a system of 3 independent components with reliabilities p 1 = 0.8 ,p 2 = 0.8 ,p

3 = 0.7 . Find the reliability of each of
the following:
1. The parallel system.
2. The 2 out of 3 system.
3. The series system.
Answer
Consider an airplane with an odd number of engines, each with reliability p. Suppose that the airplane is a majority rules
system, so that the airplane needs a majority of working engines in order to fly.
1. Find the reliability of a 3 engine plane as a function of p.
2. Find the reliability of a 5 engine plane as a function of p.
3. For what values of p is a 5 engine plane preferable to a 3 engine plane?
Answer
The graph below is known as the Wheatstone bridge network and is named for Charles Wheatstone. The edges represent
components, and the system works if and only if there is a working path from vertex a to vertex b .
1. Find the structure function.
2. Find the reliability function.
Figure 2.5.2: The Wheatstone bridge netwok
Answer
A system consists of 3 components, connected in parallel. Because of environmental factors, the components do not operate
independently, so our usual assumption does not hold. However, we will assume that under low stress conditions, the
components are independent, each with reliability 0.9; under medium stress conditions, the components are independent with
reliability 0.8; and under high stress conditions, the components are independent, each with reliability 0.7. The probability of
low stress is 0.5, of medium stress is 0.3, and of high stress is 0.2.
1. Find the reliability of the system.
2. Given that the system works, find the conditional probability of each stress level.
Answer
Suppose that bits are transmitted across a noisy communications channel. Each bit that is sent, independently of the others, is
received correctly with probability 0.9 and changed to the complementary bit with probability 0.1. Using redundancy to
improve reliability, suppose that a given bit will be sent 3 times. We naturally want to compute the probability that we correctly
identify the bit that was sent. Assume we have no prior knowledge of the bit, so we assign probability each to the event that 1
000 was sent and the event that 111 was sent. Now find the conditional probability that 111 was sent given each of the 8
possible bit strings received.
Answer
Diagnostic Testing
Recall the discussion of diagnostic testing in the section on Conditional Probability. Thus, we have an event A for a random
experiment whose occurrence or non-occurrence we cannot observe directly. Suppose now that we have n tests for the occurrence
of A , labeled from 1 to n . We will let T denote the event that test i is positive for A . The tests are independent in the following
i
sense:
If A occurs, then (T , T , … , T ) are (conditionally) independent and test i has sensitivity a = P(T
1 2 n i i ∣ A) .
If A does not occur, then (T , T , … , T ) are (conditionally) independent and test i has specificity b
1 2 n i = P(T
c
i
c
∣ A ) .
Note that unconditionally, it is not reasonable to assume that the tests are independent. For example, a positive result for a given
test presumably is evidence that the condition A has occurred, which in turn is evidence that a subsequent test will be positive. In
short, we expect that T and T should be positively correlated.
i j
We can form a new, compound test by giving a decision rule in terms of the individual test results. In other words, the event T that
the compound test is positive for A is a function of (T , T , … , T ). The typical decision rules are very similar to the reliability
1 2 n
structures discussed above. A special case of interest is when the n tests are independent applications of a given basic test. In this
case, a = a and b = b for each i.
i i
Consider the compound test that is positive for A if and only if each of the n tests is positive for A .
1. T = T ∩ T ∩ ⋯ ∩ T
1 2 n
2. The sensitivity is P(T ∣ A) = a a ⋯ a . 1 2 n
3. The specificity is P(T ∣ A ) = 1 − (1 − b

c c
1 )(1 − b2 ) ⋯ (1 − bn )
Consider the compound test that is positive for A if and only if each at least one of the n tests is positive for A .
1. T = T ∪ T ∪ ⋯ ∪ T
1 2 n
2. The sensitivity is P(T ∣ A) = 1 − (1 − a )(1 − a 1 2) ⋯ (1 − an ) .

3. The specificity is P(T ∣ A ) = b b ⋯ b .
c c
1 2 n
More generally, we could define the compound k out of n test that is positive for A if and only if at least k of the individual tests
are positive for A . The series test is the n out of n test, while the parallel test is the 1 out of n test. The k out of 2k − 1 test is the
majority rules test.
Suppose that a woman initially believes that there is an even chance that she is or is not pregnant. She buys three identical
pregnancy tests with sensitivity 0.95 and specificity 0.90. Tests 1 and 3 are positive and test 2 is negative.
1. Find the updated probability that the woman is pregnant.
2. Can we just say that tests 2 and 3 cancel each other out? Find the probability that the woman is pregnant given just one
positive test, and compare the answer with the answer to part (a).
Answer
Suppose that 3 independent, identical tests for an event A are applied, each with sensitivity a and specificity b . Find the
sensitivity and specificity of the following tests:
1. 1 out of 3 test
2. 2 out of 3 test
3. 3 out of 3 test
Answer
In a criminal trial, the defendant is convicted if and only if all 6 jurors vote guilty. Assume that if the defendant really is guilty,
the jurors vote guilty, independently, with probability 0.95, while if the defendant is really innocent, the jurors vote not guilty,
independently with probability 0.8. Suppose that 70% of defendants brought to trial are guilty.
1. Find the probability that the defendant is convicted.
2. Given that the defendant is convicted, find the probability that the defendant is guilty.
3. Comment on the assumption that the jurors act independently.
Answer
Genetics
section.
and o is recessive. Suppose that in a certain population, the proportion of a , b , and o alleles are p, q, and r respectively. Of course
we must have p > 0 , q > 0 , r > 0 and p + q + r = 1 .
Suppose that the blood genotype in a person is the result of independent alleles, chosen with probabilities p, q, and r as above.
1. The probability distribubtion of the geneotypes is given in the following table:
Probability p
2
2pq 2pr q
2
2qr r
2
2. The probability distribution of the blood types is given in the following table:
Blood type A B AB O
Probability 2
p + 2pr q
2
+ 2qr 2pq r
2
Proof
The discussion above is related to the Hardy-Weinberg model of genetics. The model is named for the English mathematician
Godfrey Hardy and the German physician Wilhelm Weiberg
Suppose that the probability distribution for the set of blood types in a certain population is given in the following table:
Blood type A B AB O
Probability 0.360 0.123 0.038 0.479
Find p, q, and r.
Answer
that g is dominant and o recessive.
Suppose that 2 green-pod plants are bred together. Suppose further that each plant, independently, has the recessive yellow-pod
allele with probability . 1
1. Find the probability that 3 offspring plants will have green pods.
2. Given that the 3 offspring plants have green pods, find the updated probability that both parents have the recessive allele.
Answer
and d the defective allele for the gene linked to the disorder. Recall that h is dominant and d recessive for women.
Suppose that a healthy woman initially has a chance of being a carrier. (This would be the case, for example, if her mother
1
and father are healthy but she has a brother with the disorder, so that her mother must be a carrier).
1. Find the probability that the first two sons of the women will be healthy.
2. Given that the first two sons are healthy, compute the updated probability that she is a carrier.
3. Given that the first two sons are healthy, compute the conditional probability that the third son will be healthy.
Answer
Laplace's Rule of Succession
Suppose that we have m + 1 coins, labeled 0, 1, … , m. Coin i lands heads with probability for each i. The experiment is
m
i
to choose a coin at random (so that each coin is equally likely to be chosen) and then toss the chosen coin repeatedly.
n
1. The probability that the first n tosses are all heads is p m,n =
1
m+1
m
∑
i=0
(
i
m
)
2. p →
m,n as m → ∞
n+1
1
pm,n+1
3. The conditional probability that toss n + 1 is heads given that the previous n tosses were all heads is p
m,n
pm,n+1
4. pm,n
→
n+1
n+2
as m → ∞
Proof
Note that coin 0 is two-tailed, the probability of heads increases with i, and coin m is two-headed. The limiting conditional
probability in part (d) is called Laplace's Rule of Succession, named after Simon Laplace. This rule was used by Laplace and others
as a general principle for estimating the conditional probability that an event will occur on time n + 1 , given that the event has
occurred n times in succession.
Suppose that a missile has had 10 successful tests in a row. Compute Laplace's estimate that the 11th test will be successful.
Does this make sense?
Answer
This page titled 2.5: Independence is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
2.6: Convergence
This is the first of several sections in this chapter that are more advanced than the basic topics in the first five sections. In this
section we discuss several topics related to convergence of events and random variables, a subject of fundamental importance in
probability theory. In particular the results that we obtain will be important for:
Properties of distribution functions,
The weak law of large numbers,
The strong law of large numbers.
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of
outcomes, F the σ-algebra of events, and P the probability measure on the sample space (Ω, F ).
Basic Theory
Sequences of events
Our first discussion deals with sequences of events and various types of limits of such sequences. The limits are also event. We
start with two simple definitions.
Suppose that (A 1, A2 , …) is a sequence of events.

1. The sequence is increasing if A n ⊆ An+1 for every n ∈ N .
+
2. The sequence is decreasing if A n+1 ⊆ An for every n ∈ N .

+
Note that these are the standard definitions of increasing and decreasing, relative to the ordinary total order ≤ on the index set N +
and the subset partial order ⊆ on the collection of events. The terminology is also justified by the corresponding indicator
variables.
Suppose that (A 1, A2 , …) is a sequence of events, and let I n = 1A

n
denote the indicator variable of the event A for n ∈ N .
n +
1. The sequence of events is increasing if and only if the sequence of indicator variables is increasing in the ordinary sense.
That is, I ≤ I
n for each n ∈ N .
n+1 +
2. The sequence of events is decreasing if and only if the sequence of indicator variables is decreasing in the ordinary sense.
That is, I
n+1 ≤I for each n ∈ .
n +
Proof
Figure 2.6.1: A sequence of increasing events and their union
Figure 2.6.2: A sequence of decreasing events and their intersection
If a sequence of events is either increasing or decreasing, we can define the limit of the sequence in a way that turns out to be quite
natural.

1. If the sequence is increasing, we define lim n→∞ An = ⋃
∞
n=1
An .
2. If the sequence is decreasing, we define lim n→∞ An = ⋂
∞
n=1
An .
Once again, the terminology is clarified by the corresponding indicator variables.
Suppose again that (A 1, A2 , …) is a sequence of events, and let I n = 1An denote the indicator variable of A for n ∈ N .
n +
1. If the sequence of events is increasing, then lim n→∞ In is the indicator variable of ⋃ ∞
n=1
An
2. If the sequence of events is decreasing, then lim n→∞ In is the indicator variable of ⋂ ∞
n=1
An
Proof
An arbitrary union of events can always be written as a union of increasing events, and an arbitrary intersection of events can
always be written as an intersection of decreasing events:
Suppose that (A 1, A2 , …) is a sequence of events. Then

n ∞ n
1. ⋃ i=1
Ai is increasing in n ∈ N and ⋃ + i=1
Ai = limn→∞ ⋃i=1 Ai .
n ∞ n
2. ⋂ i=1
Ai is decreasing in n ∈ N and ⋂ + i=1
Ai = limn→∞ ⋂i=1 Ai .
Proof
There is a more interesting and useful way to generate increasing and decreasing sequences from an arbitrary sequence of events,
using the tail segment of the sequence rather than the initial segment.

∞
1. ⋃ i=n
Ai is decreasing in n ∈ N . +
∞
2. ⋂ i=n
Ai is increasing in n ∈ N . +
Proof
Since the new sequences defined in the previous results are decreasing and increasing, respectively, we can take their limits. These
are the limit superior and limit inferior, respectively, of the original sequence.
Suppose that (A 1, A2 , …) is a sequence of events. Define

1. lim sup A = lim
n→∞ n n→∞ ⋃
∞
i=n
Ai = ⋂
∞
n=1
⋃
∞
i=n
Ai . This is the event that occurs if an only if A occurs for infinitely
n
many values of n .
2. lim inf A = lim
n→∞ n n→∞ ⋂
∞
i=n
Ai = ⋃
∞
n=1
⋂
∞
i=n
Ai . This is the event that occurs if an only if A occurs for all but
n
finitely many values of n .

Proof
Once again, the terminology and notation are clarified by the corresponding indicator variables. You may need to review limit
inferior and limit superior for sequences of real numbers in the section on Partial Orders.
Suppose that (A 1, A2 , …) is a sequence of events, and et I n = 1An denote the indicator variable of A for n ∈ N . Then
n +
1. lim sup n→∞

In is the indicator variable of lim sup n→∞
An .
2. lim inf n→∞ In is the indicator variable of lim inf n→∞ An .
Proof
Suppose that (A 1, A2 , …) is a sequence of events. Then lim inf n→∞ An ⊆ lim sup n→∞ An .
Proof

1. (lim sup n→∞
An )
c
= lim infn→∞ An
c
2. (lim inf n→∞ An )

c
= lim sup n→∞ An
c
.
Proof
The Continuity Theorems
Generally speaking, a function is continuous if it preserves limits. Thus, the following results are the continuity theorems of
probability. Part (a) is the continuity theorem for increasing events and part (b) the continuity theorem for decreasing events.

1. If the sequence is increasing then lim n→∞ P(An ) = P (limn→∞ An ) = P (⋃
∞
n=1
An )
2. If the sequence is decreasing then lim n→∞ P(An ) = P (limn→∞ An ) = P (⋂

∞
n=1
An )
Proof
The continuity theorems can be applied to the increasing and decreasing sequences that we constructed earlier from an arbitrary
sequence of events.

1. P (⋃ ∞
i=1
Ai ) = limn→∞ P (⋃
n
i=1
Ai )
2. P (⋂ ∞
i=1
Ai ) = limn→∞ P (⋂
n
i=1
Ai )
Proof

1. P (lim sup n→∞
An ) = limn→∞ P (⋃
i=n
∞
Ai )
2. P (lim inf n→∞ An ) = limn→∞ P (⋂

∞
i=n
Ai )
Proof
The next result shows that the countable additivity axiom for a probability measure is equivalent to finite additivity and the
continuity property for increasing events.
Temporarily, suppose that P is only finitely additive, but satisfies the continuity property for increasing events. Then P is
countably additive.
Proof
There are a few mathematicians who reject the countable additivity axiom of probability measure in favor of the weaker finite
additivity axiom. Whatever the philosophical arguments may be, life is certainly much harder without the continuity theorems.
The Borel-Cantelli Lemmas

The Borel-Cantelli Lemmas, named after Emil Borel and Francessco Cantelli, are very important tools in probability theory. The
first lemma gives a condition that is sufficient to conclude that infinitely many events occur with probability 0.
∞
First Borel-Cantelli Lemma. Suppose that (A1 , A2 , …) is a sequence of events. If ∑
n=1
P(An ) < ∞ then
P (lim sup A ) = 0.
n→∞ n
Proof
The second lemma gives a condition that is sufficient to conclude that infinitely many independent events occur with probability 1.
∞
Second Borel-Cantelli Lemma. Suppose that (A1 , A2 , …) is a sequence of independent events. If ∑n=1 P(An ) = ∞ then
P (lim sup A ) = 1.
n→∞ n
Proof
For independent events, both Borel-Cantelli lemmas apply of course, and lead to a zero-one law.
If (A1, A2 , …) is a sequence of independent events then lim sup n→∞

An has probability 0 or 1:
1. If ∑ ∞
n=1
P(An ) < ∞ then P (lim sup n→∞
An ) = 0 .
2. If ∑ ∞
n=1
P(An ) = ∞ then P (lim sup n→∞
An ) = 1.
This result is actually a special case of a more general zero-one law, known as the Kolmogorov zero-one law, and named for Andrei
Kolmogorov. This law is studied in the more advanced section on measure. Also, we can use the zero-one law to derive a calculus
theorem that relates infinite series and infinte products. This derivation is an example of the probabilistic method—the use of
probability to obtain results, seemingly unrelated to probability, in other areas of mathematics.
Suppose that p i ∈ (0, 1) for each i ∈ N . Then

+
∞ ∞
∏ pi > 0 if and only if ∑(1 − pi ) < ∞ (2.6.6)
i=1 i=1
Proof
Our next result is a simple application of the second Borel-Cantelli lemma to independent replications of a basic experiment.
Suppose that A is an event in a basic random experiment with P(A) > 0 . In the compound experiment that consists of
independent replications of the basic experiment, the event “A occurs infinitely often” has probability 1.
Proof
Convergence of Random Variables

Our next discussion concerns two ways that a sequence of random variables defined for our experiment can “converge”. These are
fundamentally important concepts, since some of the deepest results in probability theory are limit theorems involving random
variables. The most important special case is when the random variables are real valued, but the proofs are essentially the same for
variables with values in a metric space, so we will use the more general setting.
Thus, suppose that (S, d) is a metric space, and that S is the corresponding Borel σ-algebra (that is, the σ-algebra generated by
the topology), so that our measurable space is (S, S ). Here is the most important special case:
For n ∈ N , is the n -dimensional Euclidean space is (R

+
n
, dn ) where
−−
n
−−−−−−−−
2 n
dn (x, y) = √ ∑(yi − xi ) , x = (x1 , x2 … , xn ), y = (y1 , y2 , … , yn ) ∈ R (2.6.7)
i=1
Euclidean spaces are named for Euclid, of course. As noted above, the one-dimensional case where d(x, y) = |y − x| for x, y ∈ R
is particularly important. Returning to the general metric space, recall that if (x , x , …) is a sequence in S and x ∈ S , then
1 2
x → x as n → ∞ means that d(x , x) → 0 as n → ∞ (in the usual calculus sense). For the rest of our discussion, we assume
n n
that (X , X , …) is a sequence of random variable with values in S and X is a random variable with values in S , all defined on
1 2
the probability space (Ω, F , P).
We say that X n → X as n → ∞ with probability 1 if the event that X n → X as n → ∞ has probability 1. That is,
P{ω ∈ S : Xn (ω) → X(ω) as n → ∞} = 1 (2.6.8)
Details
As good probabilists, we usually suppress references to the sample space and write the definition simply as
P(X → X as n → ∞) = 1 . The statement that an event has probability 1 is usually the strongest affirmative statement that we
n
can make in probability theory. Thus, convergence with probability 1 is the strongest form of convergence. The phrases almost
surely and almost everywhere are sometimes used instead of the phrase with probability 1.
Recall that metrics d and e on S are equivalent if they generate the same topology on S . Recall also that convergence of a
sequence is a topological property. That is, if (x , x , …) is a sequence in S and x ∈ S , and if d, e are equivalent metrics on S ,
1 2
then x → x as n → ∞ relative to d if and only if x → x as n → ∞ relative to e . So for our random variables as defined above,
n n
it follows that X → X as n → ∞ with probability 1 relative to d if and only if X → X as n → ∞ with probability 1 relative to
n n
e.
The following statements are equivalent:

1. X → X as n → ∞ with probability 1.
n
2. P [d(X , X) > ϵ for infinitely many n ∈ N ] = 0 for every rational ϵ > 0 .

n +
3. P [d(X , X) > ϵ for infinitely many n ∈ N ] = 0 for every ϵ > 0 .

n +
4. P [d(X , X) > ϵ for some k ≥ n] → 0 as n → ∞ for every ϵ > 0 .

k
Proof
Our next result gives a fundamental criterion for convergence with probability 1:
If ∑ ∞
n=1
P [d(Xn , X) > ϵ] < ∞ for every ϵ > 0 then X n → X as n → ∞ with probability 1.
Proof
Here is our next mode of convergence.
We say that X n → X as n → ∞ in probability if
P [d(Xn , X) > ϵ] → 0 as n → ∞ for each ϵ > 0 (2.6.11)
The phrase in probability sounds superficially like the phrase with probability 1. However, as we will soon see, convergence in
probability is much weaker than convergence with probability 1. Indeed, convergence with probability 1 is often called strong
convergence, while convergence in probability is often called weak convergence.
If X n → X as n → ∞ with probability 1 then X n → X as n → ∞ in probability.

Proof
The converse fails with a passion. A simple counterexample is given below. However, there is a partial converse that is very useful.
If X → X as n → ∞ in probability, then there exists a subsequence (n

n 1, n2 , n3 …) of N such that
+ Xnk → X as k → ∞
with probability 1.
Proof
There are two other modes of convergence that we will discuss later:
Convergence in distribution.
Convergence in mean,

Coins
Suppose that we have an infinite sequence of coins labeled 1, 2, … Moreover, coin n has probability of heads 1/n for each a
n ∈ N , where a > 0 is a parameter. We toss each coin in sequence one time. In terms of a , find the probability of the
+
following events:
1. infinitely many heads occur
2. infinitely many tails occur
Answer
The following exercise gives a simple example of a sequence of random variables that converge in probability but not with
probability 1. Naturally, we are assuming the standard metric on R.
Suppose again that we have a sequence of coins labeled 1, 2, …, and that coin n lands heads up with probability n
1
for each n .
We toss the coins in order to produce a sequence (X , X , …) of independent indicator random variables with
1 2
1 1
P(Xn = 1) = , P(Xn = 0) = 1 − ; n ∈ N+ (2.6.12)
n n
1. P(X = 0 for infinitely many n) = 1 , so that infinitely many tails occur with probability 1.
n
2. P(X = 1 for infinitely many n) = 1 , so that infinitely many heads occur with probability 1.
n
3. P(X does not converge as n → ∞) = 1 .

n
4. X → 0 as n → ∞ in probability.
n
Proof
Discrete Spaces
Recall that a measurable space (S, S ) is discrete if S is countable and S is the collection of all subsets of S (the power set of S ).
Moreover, S is the Borel σ-algebra corresponding to the discrete metric d on S given by d(x, x) = 0 for x ∈ S and d(x, y) = 1
for distinct x, y ∈ S . How do convergence with probability 1 and convergence in probability work for the discrete metric?
Suppose that (S, S ) is a discrete space. Suppose further that (X , X , …) is a sequence of random variables with values in S
1 2
and X is a random variable with values in S , all defined on the probability space (Ω, F , P). Relative to the discrete metric d ,
1. X n → X as n → ∞ with probability 1 if and only if P(X = X for all but finitely many n ∈ N
n +) =1 .
2. X n → X as n → ∞ in probability if and only if P(X ≠ X) → 0 as n → ∞ .
n
Proof
Of course, it's important to realize that a discrete space can be the Borel space for metrics other than the discrete metric.
This page titled 2.6: Convergence is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
2.7: Measure Spaces
In this section we discuss positive measure spaces (which include probability spaces) from a more advanced point of view. The
sections on Measure Theory and Special Set Structures in the chapter on Foundations are essential prerequisites. On the other hand,
if you are not interested in the measure-theoretic aspects of probability, you can safely skip this section.
Positive Measure
Definitions
Suppose that S is a set, playing the role of a universal set for a mathematical theory. As we have noted before, S usually comes
with a σ-algebra S of admissible subsets of S , so that (S, S ) is a measurable space. In particular, this is the case for the model of
a random experiment, where S is the set of outcomes and S the σ-algebra of events, so that the measurable space (S, S ) is the
sample space of the experiment. A probability measure is a special case of a more general object known as a positive measure.
A positive measure on (S, S ) is a function μ : S → [0, ∞] that satisfies the following axioms:
1. μ(∅) = 0
2. If {A : i ∈ I } is a countable, pairwise disjoint collection of sets in S then
i
μ ( ⋃ Ai ) = ∑ μ(Ai ) (2.7.1)
i∈I i∈I
The triple (S, S , μ) is a measure space.
Axiom (b) is called countable additivity, and is the essential property. The measure of a set that consists of a countable union of
disjoint pieces is the sum of the measures of the pieces. Note also that since the terms in the sum are positive, there is no issue with
the order of the terms in the sum, although of course, ∞ is a possible value.
Figure 2.7.1 : A union of four disjoint sets

So perhaps the term measurable space for (S, S ) makes a little more sense now—a measurable space is one that can have a
positive measure defined on it.
Suppose that (S, S , μ) is a measure space.

1. If μ(S) < ∞ then (S, S , μ) is a finite measure space.
2. If μ(S) = 1 then (S, S , μ) is a probability space.
So probability measures are positive measures, but positive measures are important beyond the application to probability. The
standard measures on the Euclidean spaces are all positive measures: the extension of length for measurable subsets of R, the
extension of area for measurable subsets of R , the extension of volume for measurable subsets of R , and the higher dimensional
2 3
analogues. We will actually construct these measures in the next section on Existence and Uniqueness. In addition, counting
measure # is a positive measure on the subsets of a set S . Even more general measures that can take positive and negative values
are explored in the chapter on Distributions.
Properties
The following results give some simple properties of a positive measure space (S, S , μ). The proofs are essentially identical to the
proofs of the corresponding properties of probability, except that the measure of a set may be infinite so we must be careful to
avoid the dreaded indeterminate form ∞ − ∞ .
If A, B ∈ S , then μ(B) = μ(A ∩ B) + μ(B ∖ A) .
Proof
If A, B ∈ S and A ⊆ B then
1. μ(B) = μ(A) + μ(B ∖ A)
2. μ(A) ≤ μ(B)
Proof
Thus μ is an increasing function, relative to the subset partial order ⊆ on S and the ordinary order ≤ on [0, ∞]. In particular, if μ
is a finite measure, then μ(A) < ∞ for every A ∈ S . Note also that if A, B ∈ S and μ(B) < ∞ then
μ(B ∖ A) = μ(B) − μ(A ∩ B) . In the special case that A ⊆ B , this becomes μ(B ∖ A) = μ(B) − μ(A) . In particular, these
results holds for a finite measure and are just like the difference rules for probability. If μ is a finite measure, then
μ(A ) = μ(S) − μ(A) . This is the analogue of the complement rule in probability, with but with μ(S) replacing 1.
c
The following result is the analogue of Boole's inequality for probability. For a general positive measure, the result is referred to as
the subadditive property.
Suppose that A i ∈ S for i in a countable index set I . Then
μ ( ⋃ Ai ) ≤ ∑ μ(Ai ) (2.7.2)
i∈I i∈I
Proof
For a union of sets with finite measure, the inclusion-exclusion formula holds, and the proof is just like the one for probability.
Suppose that A i ∈ S for each i ∈ I where #(I ) = n , and that μ(A i) <∞ for i ∈ I . Then
n
k−1
μ ( ⋃ Ai ) = ∑(−1 ) ∑ μ ( ⋂ Aj ) (2.7.4)
i∈I k=1 J⊆I, #(J)=k j∈J
Proof
The continuity theorem for increasing sets holds for a positive measure. The continuity theorem for decreasing events holds also, if
the sets have finite measure. Again, the proofs are similar to the ones for a probability measure, except for considerations of infinite
measure.
Suppose that (A 1, A2 , …) is a sequence of sets in S .

1. If the sequence is increasing then μ (⋃ A ) = lim ∞
i=1 i n→∞ μ(An ) .

2. If sequence is decreasing and μ(A ) < ∞ then μ (⋂
1
∞
i=1
Ai ) = limn→∞ μ(An ) .
Proof
∞ ∞
Recall that if (A , A , …) is increasing, ⋃ A is denoted lim
1 2 i=1 i A , and if (A , A , …) is decreasing, ⋂
n→∞ n 1A is denoted
2 i=1 i
limn→∞ A . In both cases, the continuity theorem has the form μ (lim
n A ) = lim μ(A ) . The continuity theorem for
n→∞ n n→∞ n
decreasing events fails without the additional assumption of finite measure. A simple counterexample is given below.
The following corollary of the inclusion-exclusion law gives a condition for countable additivity that does not require that the sets
be disjoint, but only that the intersections have measure 0. The result is used below in the theorem on completion.
Suppose that A i ∈ S for each i in a countable index set I and that μ(Ai ) < ∞ for i ∈ I and μ(Ai ∩ Aj ) = 0 for distinct
i, j ∈ I . Then
μ ( ⋃ Ai ) = ∑ μ(Ai ) (2.7.11)
i∈I i∈I
Proof
More Definitions
If a positive measure is not finite, then the following definition gives the next best thing.
The measure space (S, S , μ) is σ -finite if there exists a countable collection { Ai : i ∈ I } ⊆ S with ⋃i∈I Ai = S and
μ(A ) < ∞ for each i ∈ I .
i
So of course, if μ is a finite measure on (S, S ) then μ is σ-finite, but not conversely in general. On the other hand, for i ∈ I , let
S = {A ∈ S : A ⊆ A } . Then S is a σ-algebra of subsets of A and μ restricted to S is a finite measure. The point of this
i i i i i
(and the reason for the definition) is that often nice properties of finite measures can be extended to σ-finite measures. In particular,
σ-finite measure spaces play a crucial role in the construction of product measure spaces, and for the completion of a measure
space considered below.
Suppose that (S, S , μ) is a σ-finite measure space.

1. There exists an increasing sequence satisfying the σ-finite definition
2. There exists a disjoint sequence satisfying the σ-finite definition.
Proof
Our next definition concerns sets where a measure is concentrated, in a certain sense.
Suppose that (S, S , μ) is a measure space. An atom of the space is a set A ∈ S with the following properties:
1. μ(A) > 0
2. If B ∈ S and B ⊆ A then either μ(B) = μ(A) or μ(B) = 0 .
A measure space that has no atoms is called non-atomic or diffuse.
In probability theory, we are often particularly interested in atoms that are singleton sets. Note that {x} ∈ S is an atom if and only
if μ({x}) > 0, since the only subsets of {x} are {x} itself and ∅.
Constructions
There are several simple ways to construct new positive measures from existing ones. As usual, we start with a measurable space
(S, S ).
Suppose that (R, R) is a measurable subspace of (S, S ). If μ is a positive measure on (S, S ) then μ restricted to R is a
positive measure on (R, R) . If μ is a finite measure on (S, S ) then μ is a finite measure on (R, R) .
Proof
However, if μ is σ-finite on (S, S ), it is not necessarily true that μ is σ-finite on (R, R) . A counterexample is given below. The
previous theorem would apply, in particular, when R = S so that R is a sub σ-algebra of S . Next, a positive multiple of a positive
measure gives another positive measure.
If μ is a positive measure on (S, S ) and c ∈ (0, ∞), then cμ is also a positive measure on (S, S ). If μ is finite (σ-finite) then
cμ is finite (σ-finite) respectively.
Proof
A nontrivial finite positive measure μ is practically just like a probability measure, and in fact can be re-scaled into a probability
measure P, as was done in the section on Probability Measures:
Suppose that μ is a positive measure on (S, S ) with 0 < μ(S) < ∞ . Then P defined by P(A) = μ(A)/μ(S) for A ∈ S is a
probability measure on (S, S ).
Proof
Sums of positive measures are also positive measures.
If μ is a positive measure on
i (S, S ) for each i in a countable index set I then μ =∑
i∈I
μi is also a positive measure on
(S, S ).
1. If I is finite and μ is finite for each i ∈ I then μ is finite.
i
2. If I is finite and μ is σ-finite for each i ∈ I then μ is σ-finite.

i
Proof
In the context of the last result, if I is countably infinite and μ is finite for each i ∈ I , then μ is not necessarily σ-finite. A
i
counterexample is given below. In this case, μ is said to be s -finite, but we've had enough definitions, so we won't pursue this one.
From scaling and sum properties, note that a positive linear combination of positive measures is a positive measure. The next
method is sometimes referred to as a change of variables.
Suppose that (S, S , μ) is a measure space. Suppose also that (T , T ) is another measurable space and that f : S → T is
measurable. Then ν defined as follows is a positive measure on (T , T )
−1
ν (B) = μ [ f (B)] , B ∈ T (2.7.17)
If μ is finite then ν is finite.

Proof
In the context of the last result, if μ is σ-finite on (S, S ), it is not necessarily true that ν is σ-finite on (T , T ), even if f is one-to-
one. A counterexample is given below. The takeaway is that σ-finiteness of ν depends very much on the nature of the σ-algebra
T . Our next result shows that it's easy to explicitly construct a positive measure on a countably generated σ-algebra, that is, a σ-
algebra generated by a countable partition. Such σ-algebras are important for counterexamples and to gain insight, and also because
many σ-algebras that occur in applications can be constructed from them.
Suppose that A = {A : i ∈ I } is a countable partition of S into nonempty sets, and that S = σ(A ) , the
i σ -algebra
generated by the partition. For i ∈ I , define μ(A ) ∈ [0, ∞] arbitrarily. For A = ⋃ A where J ⊆ I , define
i j∈J j
μ(A) = ∑ μ(Aj ) (2.7.19)
j∈J
Then μ is a positive measure on (S, S ).

1. The atoms of the measure are the sets of the form A = ⋃ A where J ⊆ I and where μ(A
j∈J j j) >0 for one and only one
j∈ J .
2. If μ(A ) < ∞ for i ∈ I and I is finite then μ is finite.

i
3. If μ(A ) < ∞ for i ∈ I and I is countably infinite then μ is σ-finite.

i
Proof
One of the most general ways to construct new measures from old ones is via the theory of integration with respect to a positive
measure, which is explored in the chapter on Distributions. The construction of positive measures more or less “from scratch” is
considered in the next section on Existence and Uniqueness. We close this discussion with a simple result that is useful for
counterexamples.
Suppose that the measure space (S, S , μ) has an atom A ∈ S with μ(A) = ∞ . Then the space is not σ-finite.
Proof
Measure and Topology

Often the spaces that occur in probability and stochastic processes are topological spaces. Recall that a topological space (S, T )
consists of a set S and a topology T on S (the collection of open sets). The topology as well as the measure theory plays an
important role, so it's natural to want these two types of structures to be compatible. We have already seen the most important step
in this direction: Recall that S = σ(T ) , the σ-algebra generated by the topology, is the Borel σ-algebra on S , named for Émile
Borel. Since the complement of an open set is a closed set, S is also the σ-algebra generated by the collection of closed sets.
Moreover, S contains countable intersections of open sets (called G sets) and countable unions of closed sets (called F sets).
δ σ
Suppose that (S, T ) is a topological space and let S = σ(T ) be the Borel σ-algebra. A positive measure μ on (S, S ) is a
Borel measure and then (S, S , μ) is a Borel measure space.
The next definition concerns the subset on which a Borel measure is concentrated, in a certain sense.
Suppose that (S, S , μ) is a Borel measure space. The support of μ is
supp(μ) = {x ∈ S : μ(U ) > 0 for every open neighborhood U of x} (2.7.21)
The set supp(μ) is closed.

Proof
The term Borel measure has different definitions in the literature. Often the topological space is required to be locally compact,
Hausdorff, and with a countable base (LCCB). Then a Borel measure μ is required to have the additional condition that μ(C ) < ∞
if C ⊆ S is compact. In this text, we use the term Borel measures in this more restricted sense.
Suppose that (S, S , μ) is a Borel measure space corresponding to an LCCB topolgy. Then the space is σ-finite.
Proof
Here are a couple of other definitions that are important for Borel measures, again linking topology and measure in natural ways.
Suppose again that (S, S , μ) is a Borel measure space.

1. μ is inner regular if μ(A) = sup{μ(C ) : C is compact and C ⊆ A} for A ∈ S .
2. μ is outer regular if μ(A) = inf{μ(U ) : U is open and A ⊆ U } for A ∈ S .
3. μ is regular if it is both inner regular and outer regular.
The measure spaces that occur in probability and stochastic processes are usually regular Borel spaces associated with LCCB
topologies.
Null Sets and Equivalence

Sets of measure 0 in a measure space turn out to be very important precisely because we can often ignore the differences between
mathematical objects on such sets. In this disucssion, we assume that we have a fixed measure space (S, S , μ).
A set A ∈ S is null if μ(A) = 0 .
Consider a measurable “statement” with x ∈ S as a free variable. (Technically, such a statement is a predicate on S .) If the
statement is true for all x ∈ S except for x in a null set, we say that the statement holds almost everywhere on S . This terminology
is used often in measure theory and captures the importance of the definition.
Let D = {A ∈ S c
: μ(A) = 0 or μ(A ) = 0} , the collection of null and co-null sets. Then D is a sub σ-algebra of S .
Proof
Of course μ restricted to D is not very interesting since μ(A) = 0 or μ(A) = μ(S) for every A ∈ S . Our next definition is a type
of equivalence between sets in S . To make this precise, recall first that the symmetric difference between subsets A and B of S is
A △ B = (A ∖ B) ∪ (B ∖ A) . This is the set that consists of points in one of the two sets, but not both, and corresponds to
exclusive or.
Sets A, B ∈ S are equivalent if μ(A △ B) = 0 , and we denote this by A ≡ B .
Thus A ≡ B if and only if μ(A △ B) = μ(A ∖ B) + μ(B ∖ A) = 0 if and only if μ(A ∖ B) = μ(B ∖ A) = 0 . In the predicate
terminology mentioned above, the statement
x ∈ A if and only if x ∈ B (2.7.22)
is true for almost every x ∈ S . As the name suggests, the relation ≡ really is an equivalence relation on S and hence S is
partitioned into disjoint classes of mutually equivalent sets. Two sets in the same equivalence class differ by a set of measure 0.
The relation ≡ is an equivalence relation on S . That is, for A, B, C ∈ S ,

1. A ≡ A (the reflexive property).
2. If A ≡ B then B ≡ A (the symmetric property).
3. If A ≡ B and B ≡ C then A ≡ C (the transitive property).
Proof
Equivalence is preserved under the standard set operations.
If A, B ∈ S and A ≡ B then A c
≡B
c
.
Proof
Suppose that A i, Bi ∈ S and that A i ≡ Bi for i in a countable index set I . Then

1. ⋃ i∈I
Ai ≡ ⋃
i∈I
Bi
2. ⋂ i∈I
Ai ≡ ⋂
i∈I
Bi
Proof
Equivalent sets have the same measure.
If A, B ∈ S and A ≡ B then μ(A) = μ(B) .

Proof
The converse trivially fails, and a counterexample is given below. However, the collection of null sets and the collection of co-null
sets do form equivalence classes.
Suppose that A ∈ S .
1. If μ(A) = 0 then A ≡ B if and only if μ(B) = 0 .
2. If μ(A ) = 0 then A ≡ B if and only if μ(B ) = 0 .
c c
Proof
We can extend the notion of equivalence to measruable functions with a common range space. Thus suppose that (T , T ) is another
measurable space. If f , g : S → T are measurable, then (f , g) : S → T × T is measurable with respect the usual product σ-
algebra T ⊗ T . We assume that the diagonal set D = {(y, y) : y ∈ T } ∈ T ⊗ T , which is almost always true in applications.
Measurable functions f , g : S → T are equivalent if μ{x ∈ S : f (x) ≠ g(x)} = 0 . Again we write f ≡g .

Details
In the terminology discussed earlier, f ≡ g means that f (x) = g(x) almost everywhere on S . As with measurable sets, the relation
≡ really does define an equivalence relation on the collection of measurable functions from S to T . Thus, the collection of such
functions is partitioned into disjoint classes of mutually equivalent variables.
The relation ≡ is an equivalence relation on the collection of measurable functions from S to T . That is, for measurable
f , g, h : S → T ,
1. f ≡ f (the reflexive property).
2. If f ≡ g then g ≡ f (the symmetric property).
3. If f ≡ g and g ≡ h then f ≡ h (the transitive property).
Proof
Suppose agaom that f , g : S → T are measurable and that f ≡g . Then for every B ∈ T , the sets f −1
(B) ≡ g
−1
(B) .
Proof
Thus if f , g : S → T are measurable and f ≡ g , then by the previous result, νf = νg where νf , νg are the measures on (T , T )
associated with f and g , as above. Again, the converse fails with a passion.
It often happens that a definition for functions subsumes the corresponding definition for sets, by considering the indicator functons
of the sets. So it is with equivalence. In the following result, we can take T = {0, 1} with T the collection of all subsets.
Suppose that A, B ∈ S . Then A ≡ B if and only if 1 A ≡ 1B .

Proof
Equivalence is preserved under composition. For the next result, suppose that (U , U ) is yet another measurable space.
Suppose that f , g : S → T are measurable and that h : T → U is measurable. If f ≡g then h ∘ f ≡h∘g .

Proof
Suppose again that (S, S , μ) is a measure space. Let V denote the collection of all measurable real-valued random functions from
S into R . (As usual, R is given the Borel σ-algebra.) From our previous discussion of measure theory, we know that with the usual
definitions of addition and scalar multiplication, (V , +, ⋅) is a vector space. However, in measure theory, we often do not want to
distinguish between functions that are equivalent, so it's nice to know that the vector space structure is preserved when we identify
equivalent functions. Formally, let [f ] denote the equivalence class generated by f ∈ V , and let W denote the collection of all such
equivalence classes. In modular notation, W is V / ≡ . We define addition and scalar multiplication on W by
[f ] + [g] = [f + g], c[f ] = [cf ]; f, g ∈ V , c ∈ R (2.7.27)
(W , +, ⋅) is a vector space.
Proof
Often we don't bother to use the special notation for the equivalence class associated with a function. Rather, it's understood that
equivalent functions represent the same object. Spaces of functions in a measure space are studied further in the chapter on
Distributions.
Completion
Suppose that (S, S , μ) is a measure space and let N = {A ∈ S : μ(A) = 0} denote the collection of null sets of the space. If
A ∈ N and B ∈ S is a subset of A , then we know that μ(B) = 0 so B ∈ N also. However, in general there might be subsets of
A that are not in S . This leads naturally to the following definition.
The measure space (S, S , μ) is complete if A ∈ N and B ⊆ A imply B ∈ S (and hence B ∈ N ).
Our goal in this discussion is to show that if (S, S , μ) is a σ-finite measure that is not complete, then it can be completed. That is
μ can be extended to σ-algebra that includes all of the sets in S and all subsets of null sets. The first step is to extend the
equivalence relation defined in our previous discussion to P(S) .
For A, B ⊆ S , define A ≡ B if and only if there exists N ∈ N such that A △ B ⊆N . The relation ≡ is an equivalence
relation on P(S) : For A, B, C ⊆ S ,
Proof
So the equivalence relation ≡ partitions P(S) into mutually disjoint equivalence classes. Two sets in an equivalence class differ
by a subset of a null set. In particular, A ≡ ∅ if and only if A ⊆ N for some N ∈ N . The extended relation ≡ is preserved under
the set operations, just as before. Our next step is to enlarge the σ-algebra S by adding any set that is equivalent to a set in S .
Let S = {A ⊆ S : A ≡ B for some B ∈ S } . Then S is a σ-algebra of subsets of S , and in fact is the σ-algebra generated
0 0
by S ∪ {A ⊆ S : A ≡ ∅} .
Proof
Our last step is to extend μ to a positive measure on the enlarged σ-algebra S . 0
Suppose that A ∈ S so that A ≡ B for some B ∈ S . Define μ

0 0 (A) = μ(B) . Then
1. μ is well defined.
0
2. μ (A) = μ(A) for A ∈ S .

0
3. μ is a positive measure on S .
0 0
The measure space (S, S 0, μ0 ) is complete and is known as the completion of (S, S , μ).
Proof
Examples and Exercises

As always, be sure to try the computational exercises and proofs yourself before reading the answers and proofs in the text. Recall
that a discrete measure space consists of a countable set, with the σ-algebra of all subsets, and with counting measure #.
Counterexamples
The continuity theorem for decreasing events can fail if the events do not have finite measure.
Consider Z with counting measure # on the σ -algebra of all subsets. Let An = {z ∈ Z : z ≤ −n} for n ∈ N+ . The
continuity theorem fails for (A , A , …).
1 2
Proof
Equal measure certainly does not imply equivalent sets.
Suppose that (S, S , μ)is a measure space with the property that there exist disjoint sets A, B ∈ S such that
μ(A) = μ(B) > 0 . Then A and B are not equivalent.
Proof
For a concrete example, we could take S = {0, 1} with counting measure # on σ-algebra of all subsets, and A = {0} , B = {1} .
The σ-finite property is not necessarily inherited by a sub-measure space. To set the stage for the counterexample, let R denote the
Borel σ-algebra of R, that is, the σ-algebra generated by the standard Euclidean topology. There exists a positive measure λ on
(R, R) that generalizes length. The measure λ , known as Lebesgue measure, is constructed in the section on Existence. Next let C
denote the σ-algebra of countable and co-countable sets:

c
C = {A ⊆ R : A is countable or A is countable} (2.7.31)
That C is a σ-algebra was shown in the section on measure theory in the chapter on foundations.
(R, C ) is a subspace of (R, R). Moreover, (R, R, λ) is σ-finite but (R, C , λ) is not.
Proof
A sum of finite measures may not be σ-finite.
Let S be a nonempty, finite set with the σ-algebra S of all subsets. Let μ = # be counting measure on (S, S ) for n ∈ N .
n +
Then μ is a finite measure for each n ∈ N , but μ = ∑

n + μ is not σ-finite.
n∈N+ n
Proof
Basic Properties
In the following problems, μ is a positive measure on the measurable space (S, S ).
Suppose that μ(S) = 20 and that A, B ∈ S with μ(A) = 5 , μ(B) = 6 , μ(A ∩ B) = 2 . Find the measure of each of the
following sets:
1. A ∖ B
2. A ∪ B
3. A ∪ B
c c
4. A ∩ B
c c
5. A ∪ B c
Answer
Suppose that μ(S) = ∞ and that A, B ∈ S with μ(A ∖ B) = 2 , μ(B ∖ A) = 3 , and μ(A ∩ B) = 4 . Find the measure of
each of the following sets:
1. A
2. B
3. A ∪ B
4. A ∩ B
c c
5. A ∪ B
c c
Answer
Suppose that μ(S) = 10 and that A, B ∈ S with μ(A) = 3 , μ(A ∪ B) = 7 , and μ(A ∩ B) = 2 . Find the measure of each of
the following events:
1. B
2. A ∖ B
3. B ∖ A
4. A ∪ B
c c
5. A ∩ B
c c
Answer
Suppose that A, B, C ∈ S with μ(A) = 10 , μ(B) = 12 , μ(C ) = 15 , μ(A ∩ B) = 3 , μ(A ∩ C ) = 4 , μ(B ∩ C ) = 5 , and
μ(A ∩ B ∩ C ) = 1S . Find the probabilities of the various unions:
1. A ∪ B
2. A ∪ C
3. B ∪ C
4. A ∪ B ∪ C
Answer
This page titled 2.7: Measure Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Suppose that S is a set and S a σ-algebra of subsets of S , so that (S, S ) is a measurable space. In many cases, it is impossible to
define a positive measure μ on S explicitly, by giving a “formula” for computing μ(A) for each A ∈ S . Rather, we often know
how the measure μ should work on some class of sets B that generates S . We would then like to know that μ can be extended to a
positive measure on S , and that this extension is unique. The purpose of this section is to discuss the basic results on this topic. To
understand this section you will need to review the sections on Measure Theory and Special Set Structures in the chapter on
Foundations, and the section on Measure Spaces in this chapter. If you are not interested in questions of existence and uniqueness
of positive measures, you can safely skip this section.
Basic Theory
Positive Measures on Algebras
Suppose first that A is an algebra of subsets of S . Recall that this means that A is a collection of subsets that contains S and is
closed under complements and finite unions (and hence also finite intersections). Here is our first definition:
A positive measure on A is a function μ : A → [0, ∞] that satisfies the following properties:

1. μ(∅) = 0
2. If {A : i ∈ I } is a countable, disjoint collection of sets in A and if ⋃
i i∈I
Ai ∈ A then
μ ( ⋃ Ai ) = ∑ μ(Ai ) (2.8.1)
i∈I i∈I
Clearly the definition of a positive measure on an algebra is very similar to the definition for a σ-algebra. If the collection of sets in
(b) is finite, then ⋃ A must be in the algebra A . Thus, μ is finitely additive. If the collection is countably infinite, then there is
i∈I i
no guarantee that the union is in A . If it is however, then μ must be additive over this collection. Given the similarity, it is not
surprising that μ shares many of the basic properties of a positive measure on a σ-algebra, with proofs that are almost identical.
If A, B ∈ A , then μ(B) = μ(A ∩ B) + μ(B ∖ A) .

Proof
If A, B ∈ A and A ⊆ B then
1. μ(B) = μ(A) + μ(B ∖ A)
2. μ(A) ≤ μ(B)
Proof
Thus μ is increasing, relative to the subset partial order ⊆ on A and the ordinary order ≤ on [0, ∞]. Note also that if A, B ∈ A
and μ(B) < ∞ then μ(B ∖ A) = μ(B) − μ(A ∩ B) . In the special case that A ⊆ B , this becomes μ(B ∖ A) = μ(B) − μ(A) . If
μ(S) < ∞ then μ(A ) = μ(S) − μ(A) . These are the familiar difference and complement rules.
c
The following result is the subadditive property for a positive measure μ on an algebra A .
Suppose that {A i : i ∈ I} is a countable collection of sets in A and that ⋃ i∈I

Ai ∈ A . Then
μ ( ⋃ Ai ) ≤ ∑ μ(Ai ) (2.8.2)
i∈I i∈I
Proof
For a finite union of sets with finite measure, the inclusion-exclusion formula holds, and the proof is just like the one for a
Suppose that {A i : i ∈ I} is a finite collection of sets in A where #(I ) = n ∈ N , and that μ(A
+ i) <∞ for i ∈ I . Then
n
k−1
μ ( ⋃ Ai ) = ∑(−1 ) ∑ μ ( ⋂ Aj ) (2.8.4)
i∈I k=1 J⊆I, #(J)=k j∈J
The continuity theorems hold for a positive measure μ on an algebra A , just as for a positive measure on a σ-algebra, assuming
that the appropriate union and intersection are in the algebra. The proofs are just as before.
Suppose that (A 1, A2 , …) is a sequence of sets in A .

1. If the sequence is increasing, so that A n ⊆ An+1 for each n ∈ N , and ⋃
+
∞
i=1
Ai ∈ A , then
μ(A ) .
∞
μ (⋃ A ) = lim
i n→∞ n
i=1
2. If the sequence is decreasing, so that A n+1 ⊆ An for each n ∈ N , and μ(A

+ 1) <∞ and ⋂ ∞
i=1
Ai ∈ A , then
μ(A ) .
∞
μ (⋂ A ) = lim
i n→∞ n
i=1
Proof
∞
Recall that if the sequence (A , A 1 2, …) is increasing, then we define lim A =⋃ n→∞ A , and if the sequence is decreasing
n n=1 n
∞
then we define lim A =⋂
n→∞ n n=1
An . Thus the conclusion of both parts of the continuity theorem is
P ( lim An ) = lim P(An ) (2.8.8)

n→∞ n→∞
Finite additivity and continuity for increasing events imply countable additivity:
If μ : A → [0, ∞] satisfies the properties below then μ is a positive measure on A .

1. μ(∅) = 0
2. μ (⋃ A i∈I i) =∑
i∈I
μ(Ai ) if {A : i ∈ I } is a finite disjoint collection of sets in A
i
∞ ∞
3. μ (⋃ A i=1 i ) = limn→∞ μ(An ) if (A , A , …) is an increasing sequence of events in A and ⋃
1 2 i=1
Ai ∈ A .
Proof
Many of the basic theorems in measure theory require that the measure not be too far removed from being finite. This leads to the
following definition, which is just like the one for a positive measure on a σ-algebra.
A measure μ on an algebra A of subsets of S is σ-finite if there exists a sequence of sets (A , A 1 2, …) in A such that
∞
⋃n=1 An = S and μ(A ) < ∞ for each n ∈ N . The sequence is called a σ-finite sequence for μ .
n +
Suppose that μ is a σ-finite measure on an algebra A of subsets of S .

1. There exists an increasing σ-finite sequence.
2. There exists a disjoint σ-finite sequence.
Proof
Extension and Uniqueness Theorems

The fundamental theorem on measures states that a positive, σ-finite measure μ on an algebra A can be uniquely extended to
σ(A ). The extension part is sometimes referred to as the Carathéodory extension theorem, and is named for the Greek
mathematician Constantin Carathéodory.
If μ is a positive, σ-finte measure on an algebra A , then μ can be extended to a positive measure on S = σ(A ) .
Proof
Our next goal is the basic uniqueness result, which serves as the complement to the basic extension result. But first we need another
variation of the term σ-finite.
Suppose that μ is a measure on a σ-algebra S of subsets of S and B ⊆ S . Then μ is σ-finite on B if there exists a countable
collection {B : i ∈ I } ⊆ B such that μ(B ) < ∞ for i ∈ I and ⋃ B = S .
i i i∈I i
The next result is the uniqueness theorem. The proof, like others that we have seen, uses Dynkin's π-λ theorem, named for Eugene
Dynkin.
Suppose that B is a π-system and that S = σ(B) . If μ and μ are positive measures on 1 2 S and are σ-finite on B , and if
μ (A) = μ (A) for all A ∈ B , then μ (A) = μ (A) for all A ∈ S .
1 2 1 2
Proof
An algebra A of subsets of S is trivially a π-system. Hence, if μ and μ are positive measures on S = σ(A ) and are σ-finite on
1 2
A , and if μ (A) = μ (A) for A ∈ A , then μ (A) = μ (A) for A ∈ S . This completes the second part of the fundamental
1 2 1 2
theorem.
Of course, the results of this subsection hold for probability measures. Formally, a probability measure P on an algebra A of
subsets of S is a positive measure on A with the additional requirement that P(S) = 1 . Probability measures are trivially σ-finite,
so a probability measure P on an algebra A can be uniquely extended to S = σ(A ) .
However, usually we start with a collection that is more primitive than an algebra. The next result combines the definition with the
main theorem associated with the definition. For a proof see the section on Special Set Structures in the chapter on Foundations.
Suppose that B is a nonempty collection of subsets of S and let
A = { ⋃ Bi : { Bi : i ∈ I } is a finite, disjoint collection of sets in B} (2.8.16)
i∈I
If the following conditions are satisfied, then B is a semi-algebra of subsets of S , and then A is the algebra generated by B .
1. If B , B ∈ B then B ∩ B
1 2 1 2 ∈ B .
2. If B ∈ B then B ∈ A . c
Suppose now that we know how a measure μ should work on a semi-algebra B that generates an algebra A and then a σ-algebra
S = σ(A ) = σ(B) . That is, we know μ(B) ∈ [0, ∞] for each B ∈ B . Because of the additivity property, there is no question as
to how we should extend μ to A . We must have
μ(A) = ∑ μ(Bi ) (2.8.17)
i∈I
if A = ⋃ B for some finite, disjoint collection {B : i ∈ I } of sets in B (so that A ∈ A ). However, we cannot assign the
i∈I i i
values μ(B) for B ∈ B arbitrarily. The following extension theorem states that, subject just to some essential consistency
conditions, the extension of μ from the semi-algebra B to the algebra A does in fact produce a measure on A . The consistency
conditions are that μ be finitely additive and countably subadditive on B .
Suppose that B is a semi-algebra of subsets of S and that A is the algebra of subsets of S generated by B . A function
μ : B → [0, ∞] can be uniquely extended to a measure on A if and only if μ satisfies the following properties:
1. If ∅ ∈ B then μ(∅) = 0 .
2. If {B : i ∈ I } is a finite, disjoint collection of sets in B and B = ⋃ B ∈ B then μ(B) = ∑ μ(B ) .
i i∈I i i∈I i
3. If B ∈ B and B ⊆ ⋃ B where {B : i ∈ I } is a countable collection of sets in B then μ(B) ≤ ∑ μ(B

i∈I i i i∈I i)
If the measure μ on the algebra A is σ-finite, then the extension theorem and the uniqueness theorem apply, so μ can be extended
uniquely to a measure on the σ-algebra S = σ(A ) = σ(B) . This chain of extensions, starting with a semi-algebra B , is often
how measures are constructed.

Product Spaces
Suppose that (S, S ) and (T , T ) are measurable spaces. For the Cartesian product set S × T , recall that the product σ-algebra is
S ⊗ T = σ{A × B : A ∈ S , B ∈ T } (2.8.18)
the σ-algebra generated by the Cartesian products of measurable sets, sometimes referred to as measurable rectangles.
Suppose that (S, S , μ) and (T , T , ν ) are σ-finite measure spaces. Then there exists a unique σ-finite measure μ⊗ν on
(S × T , S ⊗ T ) such that
(μ ⊗ ν )(A × B) = μ(A)ν (B); A ∈ S, B ∈ T (2.8.19)
The measure space (S × T , S ⊗ T , μ ⊗ ν) is the product measure space associated with (S, S , μ) and (T , T , ν ).
Proof
Recall that for C ⊆ S × T , the cross section of C in the first coordinate at x ∈ S is C = {y ∈ T : (x, y) ∈ C } . Similarly, the
x
cross section of C in the second coordinate at y ∈ T is C = {x ∈ S : (x, y) ∈ C } . We know that the cross sections of a
y
measurable set are measurable. The following result shows that the measures of the cross sections of a measurable set form
measurable functions.
Suppose again that (S, S , μ) and (T , T , ν ) are σ-finite measure spaces. If C ∈ S ⊗T then
1. x ↦ ν (C x) is a measurable function from S to [0, ∞].
2. y ↦ μ(C y
) is a measurable function from T to [0, ∞].
Proof
In the next chapter, where we study integration with respect to a measure, we will see that for C ∈ S ⊗ T , the product measure
(μ ⊗ ν )(C ) can be computed by integrating ν (C ) over x ∈ S with respect to μ or by integrating μ(C ) over y ∈ T with respect
y
x
to ν . These results, generalizing the definition of the product measure, are special cases of Fubini's theorem, named for the Italian
mathematician Guido Fubini.
Except for more complicated notation, these results extend in a perfectly straightforward way to the product of a finite number of
σ-finite measure spaces.
Suppose that n ∈ N and that (S , S , μ ) is a σ-finite measure space for i ∈ {1, 2, … , n}. Let S = ∏
+ i i i
n
i=1
Si and let S
denote the corresponding product σ-algebra. There exists a unique σ-finite measure μ on (S, S ) satisfying
n n
μ ( ∏ Ai ) = ∏ μi (Ai ), Ai ∈ Si for i ∈ {1, 2, … , n} (2.8.20)
i=1 i=1
The measure space (S, S , μ) is the product measure space associated with the given measure spaces.
Lebesgue Measure
The next discussion concerns our most important and essential application. Recall that the Borel σ-algebra on R, named for Émile
Borel, is the σ-algebra R generated by the standard Euclidean topology on R. Equivalently, R = σ(I ) where I is the collection
of intervals of R (of all types—bounded and unbounded, with any type of closure, and including single points and the empty set).
Next recall how the length of an interval is defined. For a, b ∈ R with a ≤ b , each of the intervals (a, b), [a, b), (a, b], and [a, b]
has length b − a . For a ∈ R , each of the intervals (a, ∞), [a, ∞), (−∞, a) , (−∞, a] has length ∞, as does R itself. The standard
measure on R generalizes the length measurement for intervals.
There exists a unique measure λ on R such that λ(I ) = length(I ) for I ∈ I . The measure λ is Lebesgue measure on
(R, R).
Proof
The is name in honor of Henri Lebesgue, of course. Since λ is σ-finite, the σ-algebra of Borel sets R can be completed with
respect to λ .
The completion of the Borel σ-algebra R with respect to λ is the Lebesgue σ-algebra R . ∗
Recall that completed means that if A ∈ R , λ(A) = 0 and B ⊆ A , then B ∈ R (and then λ(B) = 0 ). The Lebesgue measure λ
∗ ∗
on R, with either the Borel σ-algebra R , or its completion R is the standard measure that is used for the real numbers. Other
∗
properties of the measure space (R, R, λ) are given below, in the discussion of Lebesgue measure on R . n
For n ∈ N , let R denote the Borel σ-algebra corresponding to the the standard Euclidean topology on R , so that (R , R ) is
+ n
n n
n
the n -dimensional Euclidean measurable space. The σ-algebra, R is also the n -fold power of R , the Borel σ-algebra of R. That
n
is, R = R ⊗ R ⊗ ⋯ ⊗ R (n times). It is also the σ-algebra generated by the products of intervals:

n
Rn = σ { I1 × I2 × ⋯ In : Ij ∈ I for j ∈ {1, 2, … n}} (2.8.21)
As above, let λ denote Lebesgue measure on (R, R).
For n ∈ N + the n -fold power of λ , denoted λ is Lebesgue measure on (R

n
n
. In particular,
, Rn )
λn (A1 × A2 × ⋯ × An ) = λ(A1 )λ(A2 ) ⋯ λ(An ); A1 , … , An ∈ R (2.8.22)
Specializing further, if I j ∈ I is an interval for j ∈ {1, 2, … , n} then

λn (I1 × I2 × ⋯ × In ) = length(I1 )length(I2 ) ⋯ length(In ) (2.8.23)
In particular, λ extends the area measure on R and λ extends the volume measure on R . In general, λ (A) is sometimes
2 2 3 3 n
referred to as n -dimensional volume of A ∈ R . As in the one-dimensional case, R can be completed with respect to λ ,
n n n
essentially adding all subsets of sets of measure 0 to R . The completed σ-algebra is the σ-algebra of Lebesgue measurable sets.
n
Since λ (U ) > 0 if U ⊆ R is open, the support of λ is all of R . In addition, Lebesgue measure has the regularity properties
n
n
n
n
that are concerned with approximating the measure of a set, from below with the measure of a compact set, and from above with
the measure of an open set.
The measure space (R n

, Rn , λn ) is regular. That is, for A ∈ R , n
1. λ n (A) = sup{ λn (C ) : C is compact and C ⊆ A} , (inner regularity)

2. λ n (A) = inf{ λn (U ) : U is open and A ⊆ U } (outer regulairty).
The following theorem describes how the measure of a set is changed under certain basic transformations. These are essential
properties of Lebesgue measure. To setup the notation, suppose that n ∈ N , A ⊆ R , x ∈ R , c ∈ (0, ∞) and that T is an n × n
+
n n
matrix. Define
A + x = {a + x : a ∈ A}, cA = {ca : a ∈ A}, T A = {T a : a ∈ A} (2.8.24)
Suppose that A ∈ R . n
1. If x ∈ R then λ (A + x) = λ (A) (translation invariance)

n
n n
2. If c ∈ (0, ∞) then λ (cA) = c λ (A) (dialation property)

n
n
n
3. If T is an n × n matrix then λ (T A) = | det(T )|λ (A) (the scaling property)

n n
Lebesgue-Stieltjes Measures on R
The construction of Lebesgue measure on R can be generalized. Here is the definition that we will need.
A function F : R → R that satisfis the following properties is a distribution function on R

1. F is increasing: if x ≤ y then F (x) ≤ F (y).
2. F is continuous from the right: lim F (t) = F (x) for all x ∈ R.
t↓x
Since Fis increasing, the limit from the left at x ∈ R exists in R and is denoted F (x ) = lim F (t) . Similarly −
t↑x
F (∞) = lim F (x) exists, as a real number or ∞, and F (−∞) = lim

x→∞ F (x) exists, as a real number or −∞ .
x→−∞
If F is a distribution function on R, then there exists a unique measure μ on R that satisfies

μ(a, b] = F (b) − F (a), −∞ ≤ a ≤ b ≤ ∞ (2.8.25)
The measure μ is called the Lebesgue-Stieltjes measure associated with F , named for Henri Lebesgue and Thomas Joannes
Stieltjes. Distribution functions and the measures associated with them are studied in more detail in the chapter on Distributions.
When the function F takes values in [0, 1], the associated measure P is a probability measure, and the function F is the probability
distribution function of P. Probability distribution functions are also studied in much more detail (but with less technicality) in the
chapter on Distributions.
Note that the identity function x ↦ x for x ∈ R is a distribution function, and the measure associated with this function is ordinary
Lebesgue measure on R constructed in(15).
This page titled 2.8: Existence and Uniqueness is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In this section we discuss probability spaces from the more advanced point of view of measure theory. The previous two sections
on positive measures and existence and uniqueness are prerequisites. The discussion is divided into two parts: first those concepts
that are shared rather equally between probability theory and general measure theory, and second those concepts that are for the
most part unique to probability theory. In particular, it's a mistake to think of probability theory as a mere branch of measure theory.
Probability has its own notation, terminology, point of view, and applications that makes it an incredibly rich subject on its own.
Basic Concepts
Our first discussion concerns topics that were discussed in the section on positive measures. So no proofs are necessary, but you
will notice that the notation, and in some cases the terminology, is very different.
Definitions
We can now give a precise definition of the probability space, the mathematical model of a random experiment.
A probability space (S, S , P), consists of three essential parts:

1. A set of outcomes S .
2. A σ-algebra of events S .
3. A probability measure P on the sample space (S, S ).
Often the special notation (Ω, F , P) is used for a probability space in the literature—the symbol Ω for the set of outcomes is
intended to remind us that these are all possible outcomes. However in this text, we don't insist on the special notation, and use
whatever notation seems most appropriate in a given context.
In probability, σ-algebras are not just important for theoretical and foundational purposes, but are important for practical purposes
as well. A σ-algebra can be used to specify partial information about an experiment—a concept of fundamental importance.
Specifically, suppose that A is a collection of events in the experiment, and that we know whether or not A occurred for each
A ∈ A . Then in fact, we can determine whether or not A occurred for each A ∈ σ(A ) , the σ-algebra generated by A .
Technically, a random variable for our experiment is a measurable function from the sample space into another measurable space.
Suppose that (S, S , P) is a probability space and that (T , T ) is another measurable space. A random variable X with values
in T is a measurable function from S into T .
1. The probability distribution of X is the mapping on T given by B ↦ P(X ∈ B) .
2. The collection of events {{X ∈ B} : B ∈ T } is a sub σ-algebra of S , and is the σ-algebra generated by X, denoted
σ(X).
Details
Figure 2.9.1 : The event {X ∈ B} associated with B ∈ T

If we observe the value of X, then we know whether or not each event in σ(X) has occurred. More generally, we can construct the
σ-algebra associated with any collection of random variables.
suppose that (T , T ) is a measurable space for each i in an index set I , and that
i i Xi is a random variable taking values in Ti
for each i ∈ I . The σ-algebra generated by {X : i ∈ I } is

i
σ{ Xi : i ∈ I } = σ {{X ∈ Bi } : Bi ∈ Ti , i ∈ I } (2.9.1)
If we observe the value of X for each i ∈ I then we know whether or not each event in σ{X
i i : i ∈ I} has occurred. This idea is
very important in the study of stochastic processes.
Null Events, Almost Sure Events, and Equivalence

Suppose that (S, S , P) is a probability space.
Define the following collections of events:

1. N = {A ∈ S : P(A) = 0} , the collection of null events
2. M = {A ∈ S : P(A) = 1} , the collection of almost sure events
3. D = N ∪ M = {A ∈ S : P(A) = 0 or P(A) = 1} , the collection of essentially deterministic events
The collection of essentially deterministic events D is a sub σ-algebra of S .
In the section on independence, we showed that D is also a collection of independent events.

Intuitively, equivalent events or random variables are those that are indistinguishable from a probabilistic point of view. Recall first
that the symmetric difference between events A and B is A △ B = (A ∖ B) ∪ (B ∖ A) ; it is the event that occurs if and only if
one of the events occurs, but not the other, and corresponds to exclusive or. Here is the definition for events:
Events A and B are equivalent if A △ B ∈ N , and we denote this by A ≡B . The relation ≡ is an equivalence relation on
S . That is, for A, B, C ∈ S ,

Thus A ≡ B if and only if P(A △ B) = P(A ∖ B) + P(B ∖ A) = 0 if and only if P(A ∖ B) = P(B ∖ A) = 0 . The equivalence
relation ≡ partitions S into disjoint classes of mutually equivalent events. Equivalence is preserved under the set operations.
Suppose that A, B ∈ S . If A ≡ B then A c

≡B
c
.
Suppose that A i, Bi ∈ S for i in a countable index set I . If A

i ≡ Bi for i ∈ I then
1. ⋃ i∈I
Ai ≡ ⋃
i∈I
Bi
2. ⋂ i∈I
Ai ≡ ⋂
i∈I
Bi
Equivalent events have the same probability.
If A, B ∈ S and A ≡ B then P(A) = P(B) .
The converse trivially fails, and a counterexample is given below However, the null and almost sure events do form equivalence
classes.
Suppose that A ∈ S .
1. If A ∈ N then A ≡ B if and only if B ∈ N .
2. If A ∈ M then A ≡ B if and only if B ∈ M .
We can extend the notion of equivalence to random variables taking values in the same space. Thus suppose that (T , T ) is another
measurable space. If X and Y are random variables with values in T , then (X, Y ) is a random variable with values in T × T ,
which is given the usual product σ-algebra T ⊗ T . We assume that the diagonal set D = {(x, x) : x ∈ T } ∈ T ⊗ T , which is
almost always true in applications.
Random variables X and Y taking values in T are equivalent if P(X = Y ) = 1 . Again we write X ≡ Y . The relation ≡ is an
equivalence relation on the collection of random variables that take values in T . That is, for random variables X, Y , and Z
with values in T ,
1. X ≡ X (the reflexive property).
2. If X ≡ Y then Y ≡ X (the symmetric property).
3. If X ≡ Y and Y ≡ Z then X ≡ Z (the transitive property).
So the collection of random variables with values in T is partitioned into disjoint classes of mutually equivalent variables.
Suppose that X and Y are random variables taking values in T and that X ≡ Y . Then
1. {X ∈ B} ≡ {Y ∈ B} for every B ∈ T .
2. X and Y have the same probability distribution on (T , T ).
Again, the converse to part (b) fails with a passion, and a counterexample is given below. It often happens that a definition for
random variables subsumes the corresponding definition for events, by considering the indicator variables of the events. So it is
with equivalence.
Suppose that A, B ∈ S . Then A ≡ B if and only if 1 A ≡ 1B .
Equivalence is preserved under a deterministic transformation of the variables. For the next result, suppose that (U , U ) is yet
another measurable space, along with (T , T ).
Suppose X, Y are random variables with values in T and that g : T → U is measurable. If X ≡ Y then g(X) ≡ g(Y ) .
Suppose again that (S, S , P) is a probability space corresponding to a random experiment. Let V denote the collection of all real-
valued random variables for the experiment, that is, all measurable functions from S into R. With the usual definitions of addition
and scalar multiplication, (V , +, ⋅) is a vector space. However, in probability theory, we often do not want to distinguish between
random variables that are equivalent, so it's nice to know that the vector space structure is preserved when we identify equivalent
random variables. Formally, let [X] denote the equivalence class generated by a real-valued random variable X ∈ V , and let W
denote the collection of all such equivalence classes. In modular notation, W is the set V / ≡ . We define addition and scalar
multiplication on V by
[X] + [Y ] = [X + Y ], c[X] = [cX]; [X], [Y ] ∈ V , c ∈ R (2.9.2)
(W , +, ⋅) is a vector space.
Often we don't bother to use the special notation for the equivalence class associated with a random variable. Rather, it's understood
that equivalent random variables represent the same object. Spaces of functions in a general measure space are studied in the
chapter on Distributions, and spaces of random variables are studied in more detail in the chapter on Expected Value.
Completion
Suppose again that (S, S , P) is a probability space, and that N denotes the collection of null events, as above. Suppose that
A ∈ N so that P(A) = 0 . If B ⊆ A and B ∈ S , then we know that P(B) = 0 so B ∈ N also. However, in general there might
be subsets of A that are not in S . This leads naturally to the following definition.
The probability space (S, S , P) is complete if A ∈ N and B ⊆ A imply B ∈ S (and hence B ∈ N ).
So the probability space is complete if every subset of an event with probability 0 is also an event (and hence also has probability
0). We know from our work on positive measures that every σ-finite measure space that is not complete can be completed. So in
particular a probability space that is not complete can be completed. To review the construction, recall that the equivalence relation
≡ that we used above on S is extended to P(S) (the power set of S ).
For A, B ⊆ S , define A ≡B if and only if there exists N ∈ N such that A △ B ⊆N . The relation ≡ is an equivalence
relation on P(S) .
Here is how the probability space is completed:
Let S 0 = {A ⊆ S : A ≡ B for some B ∈ S } . For A ∈ S , define P

0 0 (A) = P(B) where B ∈ S and A ≡ B . Then
1. S is a σ-algebra of subsets of S and S
0 ⊆ S0 .
2. P is a probability measure on (S, S ).
0 0
3. (S, S , P ) is complete, and is the completion of (S, S , P).

0 0
Product Spaces
Our next discussion concerns the construction of probability spaces that correspond to specified distributions. To set the stage,
suppose that (S, S , P) is a probability space. If we let X denote the identity function on S , so that X(x) = x for x ∈ S , then
{X ∈ A} = A for A ∈ S and hence P(X ∈ A) = P(A) . That is, P is the probability distribution of X. We have seen this before
—every probability measure can be thought of as the distribution of a random variable. The next result shows how to construct a
probability space that corresponds to a sequence of independent random variables with specified distributions.
Suppose n ∈ N and that (S , S , P ) is a probability space for i ∈ {1, 2, … , n}. The corresponding product measure space
+ i i i
(S, S , P) is a probability space. If X : S → S is the ith coordinate function on S so that X (x) = x

i i for i i
x = (x , x , … , x ) ∈ S then (X , X , … , X ) is a sequence of independent random variables on (S, S , P), and X has

1 2 n 1 2 n i
distribution P on (S , S ) for each i ∈ {1, 2, … , n}.

i i i
Proof
Intuitively, the given probability spaces correspond to n random experiments. The product space then is the probability space that
corresponds to the experiments performed independently. When modeling a random experiment, if we say that we have a finite
sequence of independent random variables with specified distributions, we can rest assured that there actually is a probability space
that supports this statement
We can extend the last result to an infinite sequence of probability spaces. Suppose that (S , S ) is a measurable space for each
i i
i ∈ N . Recall that the product space ∏ S consists of all sequences x = (x , x , …) such that x ∈ S for each i ∈ N . The
∞
+ i=1 i 1 2 i i +
corresponding product σ-algebra S is generated by the collection of cylinder sets. That is, S = σ(B) where
∞
B = { ∏ Ai : Ai ∈ Si for each i ∈ N+ and Ai = Si for all but finitely many i ∈ N+ } (2.9.6)
i=1
Suppose that (S , S , P ) is a probability space for i ∈ N . Let (S, S ) denote the product measurable space so that
i i i +
S = σ(B) where B is the collection of cylinder sets. Then there exists a unique probability measure P on (S, S ) that
satisfies
∞ ∞ ∞
P ( ∏ Ai ) = ∏ Pi (Ai ), ∏ Ai ∈ B (2.9.7)
i=1 i=1 i=1
If Xi : S → Si is the ith coordinate function on S for i ∈ N , so that X (x) = x for x = (x , x , …) ∈ S , then

+ i i 1 2
(X1 , X2 , …) is a sequence of independent random variables on (S, S , P), and X has distribution P on (S , S ) for each
i i i i
i ∈ N+ .
Proof
Once again, if we model a random process by starting with an infinite sequence of independent random variables, we can be sure
that there exists a probability space that supports this sequence. The particular probability space constructed in the last theorem is
called the canonical probability space associated with the sequence of random variables. Note also that it was important that we
had probability measures rather than just general positive measures in the construction, since the infinite product ∏ P (A ) is ∞
i=1 i i
always well defined. The next section on Stochastic Processes continues the discussion of how to construct probability spaces that
correspond to a collection of random variables with specified distributional properties.
Probability Concepts
Our next discussion concerns topics that are unique to probability theory and do not have simple analogies in general measure
theory.
Independence
As usual, suppose that (S, S , P) is a probability space. We have already studied the independence of collections of events and the
independence of collections of random variables. A more complete and general treatment results if we define the independence of
collections of collections of events, and most importantly, the independence of collections of σ-algebras. This extension actually
occurred already, when we went from independence of a collection of events to independence of a collection of random variables,
but we did not note it at the time. In spite of the layers of set theory, the basic idea is the same.
Suppose that A is a collection of events for each i in an index set I . Then A = {A : i ∈ I } is independent if and only if for
i i
every choice of A ∈ A for i ∈ I , the collection of events {A : i ∈ I } is independent. That is, for every finite J ⊆ I ,
i i i
P ( ⋂ Aj ) = ∏ P(Aj ) (2.9.8)
j∈J j∈J
As noted above, independence of random variables, as we defined previously, is a special case of our new definition.
Suppose that (T , T ) is a measurable space for each i in an index set I , and that X is a random variable taking values in a set
i i i
T for each i ∈ I . The independence of { X : i ∈ I } is equivalent to the independence of {σ(X ) : i ∈ I }.

i i i
Independence of events is also a special case of the new definition, and thus our new definition really does subsume our old one.
Suppose that Ai is an event for each i ∈ I . The independence of { Ai : i ∈ I } is equivalent to the independence of
{ Ai : i ∈ I } where A = σ{A } = {S, ∅, A , A } for each i ∈ I .
i i i
c
i
For every collection of objects that we have considered (collections of events, collections of random variables, collections of
collections of events), the notion of independence has the basic inheritance property.
Suppose that A is a collection of collections of events.

1. If A is independent then B is independent for every B ⊆ A .
2. If B is independent for every finite B ⊆ A then A is independent.
Our most important collections are σ-algebras, and so we are most interested in the independence of a collection of σ-algebras. The
next result allows us to go from the independence of certain types of collections to the independence of the σ-algebras generated by
these collections. To understand the result, you will need to review the definitions and theorems concerning π-systems and λ -
systems. The proof uses Dynkin's π-λ theorem, named for Eugene Dynkin.
Suppose that A is a collection of events for each i in an index set I , and that A is a π-system for each i ∈ I . If {A
i i i : i ∈ I}
is independent, then {σ(A ) : i ∈ I } is independent.

i
Proof
The next result is a rigorous statement of the strong independence that is implied the independence of a collection of events.
Suppose that A is an independent collection of events, and that {B j : j ∈ J} is a partition of A . That is, B j ∩ Bk = ∅ for
j ≠ k and ⋃ B = A . Then {σ(B ) : j ∈ J} is independent.
j j
j∈J
Proof
Let's bring the result down to earth. Suppose that A, B, C , D are independent events. In our elementary discussion of
independence, you were asked to show, for example, that A ∪ B and C ∪ D are independent. This is a consequence of the much
c c c
stronger statement that the σ-algebras σ{A, B} and σ{C , D} are independent.
Exchangeability
As usual, suppose that (S, S , P) is a probability space corresponding to a random experiment Roughly speaking, a sequence of
events or a sequence of random variables is exchangeable if the probability law that governs the sequence is unchanged when the
order of the events or variables is changed. Exchangeable variables arise naturally in sampling experiments and many other
settings, and are a natural generalization of a sequence of independent, identically distributed (IID) variables. Conversely, it turns
out that any exchangeable sequence of variables can be constructed from an IID sequence. First we give the definition for events:
Suppose that A = {A : i ∈ I } is a collection of events, where I is a nonempty index set. Then A is exchangeable if the
i
probability of the intersection of a finite number of the events depends only on the number of events. That is, if J and K are
finite subsets of I and #(J) = #(K) then
P ( ⋂ Aj ) = P ( ⋂ Ak ) (2.9.12)
j∈J k∈K
Exchangeability has the same basic inheritance property that we have seen before.
Suppose that A is a collection of events.

1. If A is exchangeable then B is exchangeable for every B ⊆ A .
2. Conversely, if B is exchangeable for every finite B ⊆ A then A is exchangeable.
For a collection of exchangeable events, the inclusion exclusion law for the probability of a union is much simpler than the general
version.
Suppose that { A1 , A2 , … , An } is an exchangeable collection of events. For J ⊆ {1, 2, … , n} with #(J) = k , let
pk = P (⋂j∈J Aj ) . Then
n n
n
k−1
P ( ⋃ Ai ) = ∑(−1 ) ( ) pk (2.9.13)
k
i=1 k=1
Proof
The concept of exchangeablility can be extended to random variables in the natural way. Suppose that (T , T ) is a measurable
space.
Suppose that A is a collection of random variables, each taking values in T . The collection A is exchangeable if for any
{ X1 , X2 , … , Xn } ⊆ A , the distribution of the random vector (X , X , … , X ) depends only on n .
1 2 n
Thus, the distribution of the random vector is unchanged if the coordinates are permuted. Once again, exchangeability has the same
basic inheritance property as a collection of independent variables.
Suppose that A is a collection of random variables, each taking values in T .

1. If A is exchangeable then B is exchangeable for every B ⊆ A .
2. Conversely, if B is exchangeable for every finite B ⊆ A then A is exchangeable.
Suppose that A is a collection of random variables, each taking values in T , and that A is exchangeable. Then trivially the
variables are identically distributed: if X, Y ∈ A and A ∈ T , then P(X ∈ A) = P(Y ∈ A) . Also, the definition of exchangeable
variables subsumes the definition for events:
Suppose that A is a collection of events, and let B = {1 : A ∈ A } denote the corresponding collection of indicator random
A
variables. Then A is exchangeable if and only if B is exchangeable.
Tail Events and Variables

Suppose again that we have a random experiment modeled by a probability space (S, S , P).
Suppose that (X 1, X2 , …) be a sequence of random variables. The tail sigma algebra of the sequence is
∞
T = ⋂ σ{ Xn , Xn+1 , …} (2.9.15)
n=1
1. An event B ∈ T is a tail event for the sequence.

2. A random variable Y that is measurable with respect to T is a tail random variable for the sequence.
Informally, a tail event (random variable) is an event (random variable) that can be defined in terms of {X , X , …} for each n n+1
n ∈ N + . The tail sigma algebra for a sequence of events (A , A , …) is defined analogously (or simply let X = 1(A ) , the
1 2 k k
indicator variable of A , for each k ). For the following results, you may need to review some of the definitions in the section on
Convergence.

∞
1. If the sequence is increasing then lim n→∞ An = ⋃n=1 An is a tail event of the sequence.
∞
2. If the sequence is decreasing then lim n→∞ An = ⋂n=1 An is a tail event of the sequence.
Proof
Suppose again that (A 1, A2 , …) is a sequence of events. Each of the following is a tail event of the sequence:
∞ ∞
1. lim sup n→∞
An = ⋂n=1 ⋃i=n Ai
∞ ∞
2. lim inf n→∞ An = ⋃n=1 ⋂i=n Ai
Proof
Suppose that X = (X 1, X2 , …) is a sequence of real-valued random variables.

1. {X converges as n → ∞} is a tail event for X.
n
2. lim inf X is a tail random variable for X.

n→∞ n
3. lim sup X is a tail random variable for X.

n→∞ n
Proof
The random variable in part (b) may take the value −∞ , and the random variable in (c) may take the value ∞. From parts (b) and
(c) together, note that if X → X as n → ∞ on the sample space S , then X is a tail random variable for X.
n ∞ ∞
There are a number of zero-one laws in probability. These are theorems that give conditions under which an event will be
essentially deterministic; that is, have probability 0 or probability 1. Interestingly, it can sometimes be difficult to determine which
of these extremes is actually the case. The following result is the Kolmogorov zero-one law, named for Andrey Kolmogorov. It
states that an event in the tail σ-algebra of an independent sequence will have probability 0 or 1.
Suppose that X = (X 1, X2 , …) is an independent sequence of random variables

1. If B is a tail event for X then P(B) = 0 or P(B) = 1 .
2. If Y is a real-valued tail random variable for X then Y is constant with probability 1.
Proof
From the Komogorov zero-one law and the result above, note that if (A , A , …) is a sequence of independent events, then
1 2
lim sup A
n→∞
must have probability 0 or 1. The Borel-Cantelli lemmas give conditions for which of these is correct:
n
Suppose that (A 1, A2 , …) is a sequence of independent events.

∞
1. If ∑ i=1
P(Ai ) < ∞ then P (lim sup n→∞
An ) = 0 .
∞
2. If ∑ i=1
P(Ai ) = ∞ then P (lim sup n→∞
An ) = 1 .
Another proof of the Kolmogorov zero-one law will be given using the martingale convergence theorem.

As always, be sure to try the computational exercises and proofs yourself before reading the answers and proofs in the text.
Counterexamples
Equal probability certainly does not imply equivalent events.
Consider the simple experiment of tossing a fair coin. The event that the coin lands heads and the event that the coin lands tails
have the same probability, but are not equivalent.
Proof
Similarly, equivalent distributions does not imply equivalent random variables.
Consider the experiment of rolling a standard, fair die. Let X denote the score and Y = 7 −X . Then X and Y have the same
distribution but are not equivalent.
Proof
Consider the experiment of rolling two standard, fair dice and recording the sequence of scores (X, Y ) . Then X and Y are
independent and have the same distribution, but are not equivalent.
Proof
This page titled 2.9: Probability Spaces Revisited is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Introduction
This section requires measure theory, so you may need to review the advanced sections in the chapter on Foundations and in this chapter.
In particular, recall that a set E almost always comes with a σ-algebra E of admissible subsets, so that (E, E ) is a measurable space.
Usually in fact, E has a topology and E is the corresponding Borel σ-algebra, that is, the σ-algebra generated by the topology. If E is
countable, we almost always take E to be the collection of all subsets of E , and in this case (E, E ) is a discrete space. The other common
case is when E is an uncountable measurable subset of R for some n ∈ N , in which case E is the collection of measurable subsets of E .
n
If (E , E ), (E , E ), … , (E , E ) are measurable spaces for some n ∈ N , then the Cartesian product E × E × ⋯ × E is given
1 1 2 2 n n + 1 2 n
the product σ-algebra E ⊗ E ⊗ ⋯ ⊗ E . As a special case, the Cartesian power E is given the corresponding power σ-algebra E .
1 2 n
n n
With these preliminary remarks out of the way, suppose that (Ω, F , P) is a probability space, so that Ω is the set of outcomes, F the σ-
algebra of events, and P is the probability measure on the sample space (Ω, F ). Suppose also that (S, S ) and (T , T ) are measurable
spaces. Here is our main definition:
A random process or stochastic process on (Ω, F , P) with state space (S, S ) and index set T is a collection of random variables
X = { X : t ∈ T } such that X takes values in S for each t ∈ T .
t t
Sometimes it's notationally convenient to write X(t) instead of X for t ∈ T . Often T = N or T = [0, ∞) and the elements of T are
t
interpreted as points in time (discrete time in the first case and continuous time in the second). So then X ∈ S is the state of the random
t
process at time t ∈ T , and the index space (T , T ) becomes the time space.
Since X is itself a function from Ω into S , it follows that ultimately, a stochastic process is a function from Ω × T into S . Stated another
t
way, t ↦ X is a random function on the probability space (Ω, F , P). To make this precise, recall that S is the notation sometimes used
t
T
for the collection of functions from T into S . Recall also that a natural σ-algebra used for S is the one generated by sets of the form
T
T
{f ∈ S : f (t) ∈ At for all t ∈ T } , where At ∈ S for every t ∈ T and At = S for all but finitely many t ∈ T (2.10.1)
This σ-algebra, denoted S , generalizes the ordinary power σ-algebra S

T n
mentioned in the opening paragraph and will be important in
the discussion of existence below.
Suppose that X = {X : t ∈ T } is a stochastic process on the probability space (Ω, F , P) with state space (S, S ) and index set
t T .
Then the mapping that takes ω into the function t ↦ X (ω) is measurable with respect to (Ω, F ) and (S , S ).
t
T T
Proof
For ω ∈ Ω , the function t ↦ X (ω) is known as a sample path of the process. So S , the set of functions from T into S , can be thought
t
T
of as a set of outcomes of the stochastic process X, a point we will return to in our discussion of existence below.
As noted in the proof of the last theorem, X is a measurable function from Ω into S for each t ∈ T , by the very meaning of the term
t
random variable. But it does not follow in general that (ω, t) ↦ X (ω) is measurable as a function from Ω × T into S . In fact, the σ-
t
algebra on T has played no role in our discussion so far. Informally, a statement about X for a fixed t ∈ T or even a statement about X
t t
for countably many t ∈ T defines an event. But it does not follow that a statement about X for uncountably many t ∈ T defines an event.
t
We often want to make such statements, so the following definition is inevitable:
A stochastic process X = {X : t ∈ T } defined on the probability space (Ω, F , P) and with index space
t (T , T ) and state space
(S, S ) is measurable if (ω, t) ↦ X (ω) is a measurable function from Ω × T into S .
t
Every stochastic process indexed by a countable set T is measurable, so the definition is only important when T is uncountable, and in
particular for T = [0, ∞) .
Equivalent Processes
Our next goal is to study different ways that two stochastic processes, with the same state and index spaces, can be “equivalent”. We will
assume that the diagonal D = {(x, x) : x ∈ S} ∈ S , an assumption that almost always holds in applications, and in particular for the
2
discrete and Euclidean spaces that are most important to us. Sufficient conditions are that S have a sub σ-algebra that is countably
generated and contains all of the singleton sets, properties that hold for the Borel σ-algebra when the topology on S is locally compact,
Hausdorff, and has a countable base.
First, we often feel that we understand a random process X = {X : t ∈ T } well if we know the finite dimensional distributions, that is, if
t
we know the distribution of (X , X , … , X ) for every choice of n ∈ N and (t , t , … , t ) ∈ T . Thus, we can compute
t1 t2 tn + 1 2 n
n
P [(X , X , … , X ) ∈ A] for every n ∈ N , (t , t , … , t ) ∈ T , and A ∈ S . Using various rules of probability, we can compute
n n
t1 t2 tn + 1 2 n
the probabilities of many events involving infinitely many values of the index parameter t as well. With this idea in mind, we have the
following definition:
Random processes X = {X : t ∈ T } and Y = {Y : t ∈ T } with state space (S, S ) and index set T are equivalent in distribution
t t
if they have the same finite dimensional distributions. This defines an equivalence relation on the collection of stochastic processes
with this state space and index set. That is, if X, Y , and Z are such processes then
1. X is equivalent in distribution to X (the reflexive property)
2. If X is equivalent in distribution to Y then Y is equivalent in distribution to X (the symmetric property)
3. If X is equivalent in distribution to Y and Y is equivalent in distribution to Z then X is equivalent in distribution to Z (the
transitive property)
Note that since only the finite-dimensional distributions of the processes X and Y are involved in the definition, the processes need not be
defined on the same probability space. Thus, equivalence in distribution partitions the collection of all random processes with a given state
space and index set into mutually disjoint equivalence classes. But of course, we already know that two random variables can have the
same distribution but be very different as variables (functions on the sample space). Clearly, the same statement applies to random
processes.
Suppose that X = (X , X , …) is a sequence of Bernoulli trials with success parameter p =

1 2
1
2
. Let Y n = 1 − Xn for n ∈ N . Then
+
Y = (Y , Y , …) is equivalent in distribution to X but

1 2
P(Xn ≠ Yn for every n ∈ N+ ) = 1 (2.10.2)
Proof
Motivated by this example, let's look at another, stronger way that random processes can be equivalent. First recall that random variables
X and Y on (Ω, F , P), with values in S , are equivalent if P(X = Y ) = 1 .
Suppose that X = {X : t ∈ T } and Y = {Y : t ∈ T } are stochastic processes defined on the same probability space (Ω, F , P) and
t t
both with state space (S, S ) and index set T . Then Y is a versions of X if Y is equivalent to X (so that P(X = Y ) = 1 ) for every
t t t t
t ∈ T . This defines an equivalence relation on the collection of stochastic processes on the same probability space and with the same
state space and index set. That is, if X, Y , and Z are such processes then
1. X is a version of X (the reflexive property)
2. If X is a version of Y then Y is ia version of X (the symmetric property)
3. If X is a version of Y and Y is of Z then X is a version of Z (the transitive property)
Proof
So the version of relation partitions the collection of stochastic processes on a given probability space and with a given state space and
index set into mutually disjoint equivalence classes.
Suppose again that X = {X : t ∈ T } and Y = {Y : t ∈ T } are random processes on (Ω, F , P) with state space (S, S ) and index
t t
set T . If Y is a version of X then Y and X are equivalent in distribution.

Proof
As noted in the proof, a countable intersection of events with probability 1 still has probability 1. Hence if T is countable and random
processes X is a version of Y then
P(Xt = Yt for all t ∈ T ) = 1 (2.10.5)
so X and Y really are essentially the same random process. But when T is uncountable the result in the displayed equation may not be
true, and X and Y may be very different as random functions on T . Here is a simple example:
Suppose that Ω = T = [0, ∞) , F = T is the σ-algebra of Borel measurable subsets of [0, ∞), and P is any continuous probability
measure on (Ω, F ). Let S = {0, 1} (with all subsets measurable, of course). For t ∈ T and ω ∈ Ω , define X (ω) = 1 (ω) and t t
Y (ω) = 0 . Then X = { X : t ∈ T } is a version of Y = { Y : t ∈ T } , but P(X = Y for all t ∈ T } = 0 .

t t t t t
Proof
Motivated by this example, we have our strongest form of equivalence:
Suppose that X = {X : t ∈ T } and Y = {Y : t ∈ T } are measurable random processes on the probability space (Ω, F , P) and
t t
with state space (S, S ) and index space (T , T ). Then X is indistinguishable from Y if P(X = Y for all t ∈ T ) = 1 . This defines t t
an equivalence relation on the collection of measurable stochastic processes defined on the same probability space and with the same
state and index spaces. That is, if X, Y , and Z are such processes then
1. X is indistinguishable from X (the reflexive property)
2. If X is indistinguishable from Y then Y is indistinguishable from X (the symmetric property)
3. If X is indistinguishable from Y and Y is indistinguishable from Z then X is indistinguishable from Z (the transitive property)
Details
So the indistinguishable from relation partitions the collection of measurable stochastic processes on a given probability space and with
given state space and index space into mutually disjoint equivalence classes. Trivially, if X is indistinguishable from Y , then X is a
version of Y . As noted above, when T is countable, the converse is also true, but not, as our previous example shows, when T is
uncountable. So to summarize, indistinguishable from implies version of implies equivalent in distribution, but none of the converse
implications hold in general.
The Kolmogorov Construction

In applications, a stochastic process is often modeled by giving various distributional properties that the process should satisfy. So the
basic existence problem is to construct a process that has these properties. More specifically, how can we construct random processes with
specified finite dimensional distributions? Let's start with the simplest case, one that we have seen several times before, and build up from
there. Our simplest case is to construct a single random variable with a specified distribution.
Suppose that (S, S , P ) is a probability space. Then there exists a random variable X on probability space (Ω, F , P) such that X
takes values in S and has distribution P .

Proof
In spite of its triviality the last result contains the seeds of everything else we will do in this discussion. Next, let's see how to construct a
sequence of independent random variables with specified distributions.
Suppose that P is a probability measure on the measurable space (S, S ) for i ∈ N . Then there exists an independent sequence of
i +
random variables (X , X , …) on a probability space (Ω, F , P) such that X takes values in S and has distribution P for i ∈ N .
1 2 i i +
Proof
If you looked at the proof of the last two results you might notice that the last result can be viewed as a special case of the one before,
since X = (X , X , …) is simply the identity function on Ω = S . The important step is the existence of the product measure P on
1 2
∞
(Ω, F ).
The full generalization of these results is known as the Kolmogorov existence theorem (named for Andrei Kolmogorov). We start with the
state space (S, S ) and the index set T . The theorem states that if we specify the finite dimensional distributions in a consistent way, then
there exists a stochastic process defined on a suitable probability space that has the given finite dimensional distributions. The consistency
condition is a bit clunky to state in full generality, but the basic idea is very easy to understand. Suppose that s and t are distinct elements
in T and that we specify the distribution (probability measure) P of X , P of X , P of (X , X ), and P of (X , X ). Then clearly
s s t t s,t s t t,s t s
we must specify these so that

Ps (A) = Ps,t (A × S), Pt (B) = Ps,t (S × B) (2.10.13)
For all A, B ∈ S . Clearly we also must have P s,t (C ) = Pt,s (C )

′
for all measurable C ∈ S
2
, where C ′
= {(y, x) : (x, y) ∈ C } .
To state the consistency conditions in general, we need some notation. For n ∈ N , let T ⊂T denote the set of n -tuples of distinct
+
(n) n
elements of T , and let T = ⋃ T denote the set of all finite sequences of distinct elements of T . If n ∈ N ,
∞
n=1
(n)
+
t = (t , t , … , t ) ∈ T
1 2 n and π is a permutation of {1, 2, … , n}, let tπ denote the element of T
(n)
with coordinates (tπ ) = t . That (n)
i π(i)
is, we permute the coordinates of t according to π. If C ∈ S , let n
n n
πC = {(x1 , x2 , … , xn ) ∈ S : (xπ(1) , xπ(2) , … , xπ(n) ) ∈ C } ∈ S (2.10.14)
finally, if n > 1 , let t denote the vector (t

− 1, t2 , … , tn−1 ) ∈ T
(n−1)
Now suppose that P is a probability measure on (S , S ) for each n ∈ N and t ∈ T . The idea, of course, is that we want the
t
n n
+
(n)
collection P = {P : t ∈ T } to be the finite dimensional distributions of a random process with index set T and state space (S, S ). Here
t
is the critical definition:
The collection of probability distributions P relative to T and (S, S ) is consistent if

1. P tπ (C ) = Pt (πC )for every n ∈ N , t ∈ T , permutation π of {1, 2, … , n}, and measurable C
+
(n)
⊆S
n
.
2. P t− (C ) = Pt (C × S) for every n > 1 , t ∈ T , and measurable C ⊆ S
(n) n−1
With the proper definition of consistence, we can state the fundamental theorem.
Kolmogorov Existence Theorem. If P is a consistent collection of probability distributions relative to the index set T and the state
space (S, S ), then there exists a probability space (Ω, F , P) and a stochastic process X = {X : t ∈ T } on this probability space
t
such that P is the collection of finite dimensional distribution of X.

Proof sketch
Note that except for the more complicated notation, the construction is very similar to the one for a sequence of independent variables.
Again, X is essentially the identity function on Ω = S . The important and more difficult part is the construction of the probability
T
measure P on (Ω, F ).
Applications
Our last discussion is a summary of the stochastic processes that are studied in this text. All are classics and are immensely important in
applications.
Random processes are associated with Bernoulli trials include

1. the Bernoulli trials sequence itself
2. the sequence of binomial variables
3. the sequence of geometric variables
4. the sequence of negative binomial variables
5. the simple random walk
Construction
Random process associated with the Poisson model include

1. the sequence of inter-arrival times
2. the sequence of arrival times
3. the counting process on [0, ∞), both in the homogeneous and non-homogeneous cases.
4. A compound Poisson process.
5. the counting process on a general measure space
Constructions
Random processes associated with renewal theory include

1. the sequence of inter-arrival times
2. the sequence of arrival times
3. the counting process on [0, ∞)
Markov chains form a very important family of random processes as do Brownian motion and related processes. We will study these in
subsequent chapters.
This page titled 2.10: Stochastic Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Introduction
Suppose that X = {X : t ∈ T } is a stochastic process with state space (S, S ) defined on an underlying probability space
t
(Ω, F , P). To review, Ω is the set of outcomes, F the σ-algebra of events, and P the probability measure on (S, S ). Also S is the
set of states, and S the σ-algebra of admissible subsets of S . Usually, S is a topological space and S the Borel σ-algebra
generated by the open subsets of S . A standard set of assumptions is that the topology is locally compact, Hausdorff, and has a
countable base, which we will abbreviate by LCCB. For the index set, we assume that either T = N or that T = [0, ∞) and as
usual in these cases, we interpret the elements of T as points of time. The set T is also given a topology, the discrete topology in
the first case and the standard Euclidean topology in the second case, and then the Borel σ-algebra T . So in discrete time with
T = N , T = P(T ) , the power set of T , so every subset of T is measurable, as is every function from T into a another
measurable space. Finally, X is a random variable and so by definition is measurable with respect to F and S for each t ∈ T . We
t
interpret X is the state of some random system at time t ∈ T . Many important concepts involving X are based on how the future
t
behavior of the process depends on the past behavior, relative to a given current time.
For t ∈ T , let F = σ {X : s ∈ T , s ≤ t} , the σ-algebra of events that can be defined in terms of the process up to time t .
t s
Roughly speaking, for a given A ∈ F , we can tell whether or not A has occurred if we are allowed to observe the process up to
t
time t . The family of σ-algebras F = {F : t ∈ T } has two critical properties: the family is increasing in t ∈ T , relative to the
t
subset partial order, and all of the σ-algebras are sub σ-algebras of F . That is for s, t ∈ T with s ≤ t , we have F ⊆ F ⊆ F . s t
Filtrations
Basic Definitions
Sometimes we need σ-algebras that are a bit larger than the ones in the last paragraph. For example, there may be other random
variables that we get to observe, as time goes by, besides the variables in X. Sometimes, particularly in continuous time, there are
technical reasons for somewhat different σ-algebras. Finally, we may want to describe how our information grows, as a family of
σ-algebras, without reference to a random process. For the remainder of this section, we have a fixed measurable space (Ω, F )
which we again think of as a sample space, and the time space (T , T ) as described above.
A family of σ-algebras F = {F : t ∈ T } is a filtration on (Ω, F ) if s, t ∈ T and s ≤ t imply F ⊆ F ⊆ F . The object

t s t
(Ω, F , F) is a filtered sample space. If P is a probability measure on (Ω, F ), then (Ω, F , F, P) is a filtered probability space.
So a filtration is simply an increasing family of sub-σ-algebras of F , indexed by T . We think of F as the σ-algebra of events up
t
to time t ∈ T . The larger the σ-algebras in a filtration, the more events that are available, so the following relation on filtrations is
natural.
Suppose that F = {F : t ∈ T } and G = {G : t ∈ T } are filtrations on (Ω, F ). We say that F is coarser than G and G is
t t
finer than F, and we write F ⪯ G , if F ⊆ G for all t ∈ T . The relation ⪯ is a partial order on the collection of filtrations on
t t
(Ω, F ). That is, if F = { F : t ∈ T } , G = { G : t ∈ T } , and H = H : t ∈ T } are filtrations then

t t t
1. F ⪯ F , the reflexive property.

2. If F ⪯ G and G ⪯ F then F = G , the antisymmetric property.
3. If F ⪯ G and G ⪯ H then F ⪯ H , the transitive property.
Proof
So the coarsest filtration on (Ω, F ) is the one where F = {Ω, ∅} for every t ∈ T while the finest filtration is the one where
t
F = F for every t ∈ T . In the first case, we gain no information as time evolves, and in the second case, we have complete
t
information from the beginning of time. Usually neither of these is realistic.

It's also natural to consider the σ-algebra that encodes our information over all time.
For a filtration F = {F t : t ∈ T} on (Ω, F ), define F ∞ = σ (⋃ { Ft : t ∈ T }) . Then

1. F∞ = σ (⋃ { Ft : t ∈ T , t ≥ s}) for s ∈ T .
2. F t ⊆ F∞ for t ∈ T .
Proof
Of course, it may be the case that F = F , but not necessarily. Recall that the intersection of a collection of σ-algebras on
∞
(Ω, F ) is another σ-algebra. We can use this to create new filtrations from a collection of given filtrations.
Suppose that i
Fi = {Ft : t ∈ T } is a filtration on (Ω, F ) for each i in a nonempty index set I . Then F = { Ft : t ∈ T }
where F = ⋂ F for t ∈ T is also a filtration on

t i∈I
i
t
(Ω, F ) . This filtration is sometimes denoted F =⋀
i∈I
Fi , and is the
finest filtration that is coarser than F for every i ∈ I . i
Proof
Unions of σ-algebras are not in general σ-algebras, but we can construct a new filtration from a given collection of filtrations using
unions in a natural way.
Suppose again that Fi = {F

i
t
: t ∈ T} is a filtration on (Ω, F ) for each i in a nonempty index set I . Then
F = { Ft : t ∈ T } where Ft = σ (⋃i∈I Ft )
i
for t ∈ T is also a filtration on (Ω, F ) . This filtration is sometimes denoted
F =⋁
i∈I
Fi , and is the coarsest filtration that is finer than F for every i ∈ I . i
Proof
Note again that we can have a filtration without an underlying stochastic process in the background. However, we usually do have a
stochastic process X = {X : t ∈ T } , and in this case the filtration F = {F : t ∈ T } where F = σ{X : s ∈ T , s ≤ t} is
t
0
t
0 0
t s
the natural filtration associated with X. More generally, the following definition is appropriate.
A stochastic process X = {X : t ∈ T } on (Ω, F ) is adapted to a filtration

t F = { Ft : t ∈ T } on (Ω, F ) if Xt is
measureable with respect to F for each t ∈ T . t
Equivalently, is adapted to F if F is finer than F , the natural filtration associated with X. That is,
X
0
σ{ X : s ∈ T , s ≤ t} ⊆ F
s for each t ∈ T . So clearly, if X is adapted to a filtration, then it is adapted to any finer filtration, and
t
F
0
is the coarsest filtration to which X is adapted. The basic idea behind the definition is that if the filtration F encodes our
information as time goes by, then the process X is observable. In discrete time, there is a related definition.
Suppose that T = N . A stochastic process X = {X n : n ∈ N} is predictable by the filtration F = { Fn : n ∈ N} if Xn+1 is

measurable with respect to F for all n ∈ N . n
Clearly if X is predictable by F then X is adapted to F. But predictable is better than adapted, in the sense that if F encodes our
information as time goes by, then we can look one step into the future in terms of X: at time n we can determine X . The n+1
concept of predictability can be extended to continuous time, but the definition is much more complicated.
Note that ultimately, a stochastic process X = {X : t ∈ T } with sample space (Ω, F ) and state space (S, S ) can be viewed a
t
function from Ω × T into S , so X (ω) ∈ S is the state at time t ∈ T corresponding to the outcome ω ∈ Ω . By definition,
t
ω ↦ X (ω) is measurable for each t ∈ T , but it is often necessary for the process to be jointly measurable in ω and t .
t
Suppose that X = {X : t ∈ T } is a stochastic process with sample space

t (Ω, F ) and state space (S, S ) . Then X is
measurable if X : Ω × T → S is measurable with respect to F ⊗ T and S .
When we have a filtration, as we usually do, there is a stronger condition that is natural. Let T t = {s ∈ T : s ≤ t} for t ∈ T , and
let T = {A ∩ T : A ∈ T } be the corresponding induced σ-algebra.
t t
Suppose that X = { Xt : t ∈ T } is a stochastic process with sample space (Ω, F ) and state space (S, S ), and that
F = { Ft : t ∈ T } is a filtration. Then X is progressively measurable relative to F if X : Ω × T → S is measurable with t
respect to F t ⊗ Tt and S for each t ∈ T .
Clearly if X is progressively measurable with respect to a filtration, then it is progressively measurable with respect to any finer
filtration. Of course when T is discrete , then any process X is measurable, and any process X adapted to F progressively
measurable, so these definitions are only of interest in the case of continuous time.
Suppose again that X = { X : t ∈ T } is a stochastic process with sample space (Ω, F ) and state space
t (S, S ) , and that
F = { Ft : t ∈ T } is a filtration. If X is progressively measurable relative to F then
1. X is measurable.
2. X is adapted to F.
Proof
When the state space is a topological space (which is usually the case), then as you might guess, there is a natural link between
continuity of the sample paths and progressive measurability.
Suppose that S has an LCCB topology and that S is the σ-algebra of Borel sets. Suppose also that X = {X t : t ∈ [0, ∞)} is
right continuous. Then X is progressively measurable relative to the natural filtration F . 0
So if X is right continuous, then X is progressively measurable with respect to any filtration to which X is adapted. Recall that in
the previous section, we studied different ways that two stochastic processes can be equivalent. The following example illustrates
some of the subtleties of processes in continuous time.
Suppose that Ω = T = [0, ∞) , F = T is the σ-algebra of Borel measurable subsets of [0, ∞), and P is any continuous
probability measure on (Ω, F ). Let S = {0, 1} and S = P(S) = {∅, {0}, {1}, {0, 1}}. For t ∈ T and ω ∈ Ω , define
X (ω) = 1 (ω) and Y (ω) = 0 . Then
t t t
1. X = {X : t ∈ T } is a version of Y = {Y : t ∈ T }
t t
2. X is not adapted to the natural filtration of Y .

Proof
Completion
Suppose now that P is a probability measure on (Ω, F ). Recall that F is complete with respect to P if A ∈ F , B ⊆ A , and
P (A) = 0 imply B ∈ F (and hence P (B) = 0 ). That is, if A is an event with probability 0 and B ⊆ A , then B is also an event
(and also has probability 0). For a filtration, the following definition is appropriate.
The filtration {F t : t ∈ T} is complete with respect to a probability measure P on (Ω, F ) if

1. F is complete with respect to P
2. If A ∈ F and P (A) = 0 then A ∈ F . 0
Suppose P is a probability measure on (Ω, F ) and that the filtration {F : t ∈ T } is complete with respect to P . If A ∈ F is
t
a null event (P (A) = 0 ) or an almost certain event (P(A) = 1 ) then A ∈ F for every t ∈ T .
t
Proof
Recall that if P is a probability measure on (Ω, F ), but F is not complete with respect to P , then F can always be completed.
Here's a review of how this is done: Let
N = {A ⊆ Ω : there exists N ∈ F with P (N ) = 0 and A ⊆ N } (2.11.2)
P P P
So N is the collection of null sets. Then we let F = σ(F ∪ N ) and extend P to F is the natural way: if A ∈ F and A
differs from B ∈ F by a null set, then P (A) = P (B) . Filtrations can also be completed.
Suppose that F = {F : t ∈ T } is a filtration on (Ω, F ) and that P is a probability measure on

t (Ω, F ) . As above, let N
denote the collection of null subsets of Ω, and for t ∈ T , let F = σ(F ∪ N ) . Then F = {F
P
t t
P P
t
: t ∈ T} is a filtration on
(Ω, F
P
) that is finer than F and is complete relative to P .
Proof
Naturally, F is the completion of F with respect to P . Sometimes we need to consider all probability measures on (Ω, F ).
P
Let P denote the collection of probability measures on (Ω, F ), and suppose that F = {F : t ∈ T } is a filtration on (Ω, F ). t
Let F = ⋂{F : P ∈ P} , and let F = ⋀{F : P ∈ P} . Then F is a filtration on (Ω, F ), known as the universal
∗ P ∗ P ∗ ∗
completion of F.
Proof
The last definition must seem awfully obscure, but it does have a place. In the theory of Markov processes, we usually allow
arbitrary initial distributions, which in turn produces a large collection of distributions on the sample space.
Right Continuity
In continuous time, we sometimes need to refine a given filtration somewhat.
Suppose that F = { Ft : t ∈ [0, ∞)} is a filtration on (Ω, F ). For t ∈ [0, ∞), define Ft+ = ⋂{ Fs : s ∈ (t, ∞)} . Then
F+ = { Ft+ : t ∈ T } is also a filtration on (Ω, F ) and is finer than F.
Proof
Since the -algebras in a filtration are increasing, it follows that for t ∈ [0, ∞), F = ⋂{F : s ∈ (t, t + ϵ)} for every
σ t+ s
ϵ ∈ (0, ∞) . So if the filtration F encodes the information available as time goes by, then the filtration F allows an “infinitesimal +
peak into the future” at each t ∈ [0, ∞). In light of the previous result, the next definition is natural.
A filtration F = {F t : t ∈ [0, ∞)} is right continuous if F + =F , so that F t+ = Ft for every t ∈ [0, ∞).
Right continuous filtrations have some nice properties, as we will see later. If the original filtration is not right continuous, the
slightly refined filtration is:
Suppose again that F = {F t : t ∈ [0, ∞)} is a filtration. Then F is a right continuous filtration.
+
Proof
For a stochastic process X = {X : t ∈ [0, ∞)} in continuous time, often the filtration F that is most useful is the right-continuous
t
refinement of the natural filtration. That is, F = F , so that F = σ{X : s ∈ [0, t]} for t ∈ [0, ∞).
0
+ t s +
Stopping Times
Basic Properties
Suppose again that we have a fixed sample space (Ω, F ). Random variables taking values in the time set T are important, but
often as we will see, it's necessary to allow such variables to take the value ∞ as well as finite times. So let T = T ∪ {∞} . We ∞
extend order to T by the obvious rule that t < ∞ for every t ∈ T . We also extend the topology on T to T by the rule that for
∞ ∞
each s ∈ T , the set {t ∈ T : t > s} is an open neighborhood of ∞. That is, T is the one-point compactification of T . The
∞ ∞
reason for this is to preserve the meaning of time converging to infinity. That is, if (t , t , …) is a sequence in T then t → ∞ as
1 2 ∞ n
n → ∞ if and only if, for every t ∈ T there exists m ∈ N such that t > t for n > m . We then give T the Borel σ-algebra
+ n ∞
T∞ as before. In discrete time, this is once again the discrete σ-algebra, so that all subsets are measurable. In both cases, we now
have an enhanced time space is (T , T ). A random variable τ taking values in T is called a random time.
∞ ∞ ∞
Suppose that F = {F t : t ∈ T} is a filtration on . A random time

(Ω, F ) τ is a stopping time relative to F if {τ ≤ t} ∈ Ft
for each t ∈ T .
In a sense, a stopping time is a random time that does not require that we see into the future. That is, we can tell whether or not
τ ≤ t from our information at time t . The term stopping time comes from gambling. Consider a gambler betting on games of
chance. The gambler's decision to stop gambling at some point in time and accept his fortune must define a stopping time. That is,
the gambler can base his decision to stop gambling on all of the information that he has at that point in time, but not on what will
happen in the future. The terms Markov time and optional time are sometimes used instead of stopping time. If τ is a stopping time
relative to a filtration, then it is also a stoping time relative to any finer filtration:
Suppose that F = {F : t ∈ T } and G = {G : t ∈ T } are filtrations on (Ω, F ), and that G is finer than F. If a random time
t t
τ is a stopping time relative to F then τ is a stopping time relative to G.
Proof
So, the finer the filtration, the larger the collection of stopping times. In fact, every random time is a stopping time relative to the
finest filtration F where F = F for every t ∈ T . But this filtration corresponds to having complete information from the
t
beginning of time, which of course is usually not sensible. At the other extreme, for the coarsest filtration F where F = {Ω, ∅} t
for every t ∈ T , the only stopping times are constants. That is, random times of the form τ (ω) = t for every ω ∈ Ω , for some
t ∈ T∞ .
Suppose again that F = {F : t ∈ T } is a filtration on (Ω, F ). A random time τ is a stopping time relative to F if and only if
t
{τ > t} ∈ F for each t ∈ T .

t
Proof
Suppose again that F = {F t : t ∈ T} is a filtration on (Ω, F ), and that τ is a stopping time relative to F. Then
1. {τ < t} ∈ Ft for every t ∈ T .
2. {τ ≥ t} ∈ Ft for every t ∈ T .
3. {τ = t} ∈ Ft for every t ∈ T .
Proof
Note that when T = N , we actually showed that {τ < t} ∈ F and {τ ≥ t} ∈ F . The converse to part (a) (or equivalently
t−1 t−1
(b)) is not true, but in continuous time there is a connection to the right-continuous refinement of the filtration.
Suppose that T = [0, ∞) and that F = {F : t ∈ [0, ∞)} is a filtration ont (Ω, F ) . A random time τ is a stopping time
relative to F if and only if {τ < t} ∈ F for every t ∈ [0, ∞).
+ t
Proof
If F = {F : t ∈ [0, ∞)} is a filtration and τ is a random time that satisfies {τ < t} ∈ F for every t ∈ T , then some authors call
t t
τ a weak stopping time or say that τ is weakly optional for the filtration F . But to me, the increase in jargon is not worthwhile, and
it's better to simply say that τ is a stopping time for the filtration F . The following corollary now follows.
+
Suppose that T = [0, ∞) and that F = {F : t ∈ [0, ∞)} is a right-continuous filtration. A random time τ is a stopping time
t
relative to F if and only if {τ < t} ∈ F for every t ∈ [0, ∞).

t
The converse to part (c) of the result above holds in discrete time.
Suppose that T = N and that F = {F : n ∈ N} is a filtration on

n . A random time
(Ω, F ) τ is a stopping time for F if and
only if {τ = n} ∈ F for every n ∈ N .
n
Proof
Basic Constructions
As noted above, a constant element of T ∞ is a stopping time, but not a very interesting one.
Suppose s ∈ T ∞ and that τ (ω) = s for all ω ∈ Ω . The τ is a stopping time relative to any filtration on (Ω, F ).
Proof
If the filtration {F : t ∈ T } is complete, then a random time that is almost certainly a constant is also a stopping time. The
t
following theorems give some basic ways of constructing new stopping times from ones we already have.
Suppose that F = {F : t ∈ T } is a filtration on (Ω, F ) and that τ and τ are stopping times relative to F. Then each of the
t 1 2
following is also a stopping time relative to F:

1. τ1 ∨ τ2 = max{ τ1 , τ2 }
2. τ1 ∧ τ2 = min{ τ1 , τ2 }
3. τ1 + τ2
Proof
It follows that if (τ 1, τ2 , … , τn ) is a finite sequence of stopping times relative to F, then each of the following is also a stopping
time relative to F:
τ1 ∨ τ2 ∨ ⋯ ∨ τn
τ1 ∧ τ2 ∧ ⋯ ∧ τn
τ1 + τ2 + ⋯ + τn
We have to be careful when we try to extend these results to infinite sequences.
Suppose that F = {F t : t ∈ T} is a filtration on (Ω, F ), and that (τ n : n ∈ N+ ) is a sequence of stopping times relative to F.
Then sup{τ : n ∈ N
n +} is also a stopping time relative to F.
Proof
Suppose that F = {F : t ∈ T } is a filtration on (Ω, F ), and that

t (τn : n ∈ N+ ) is an increasing sequence of stopping times
relative to F. Then lim τ is a stopping time relative to F.
n→∞ n
Proof
Suppose that T = [0, ∞) and that F = {F : t ∈ T } is a filtration on (Ω, F ). If

t (τn : n ∈ N+ ) is a sequence of stopping
times relative to F, then each of the following is a stopping time relative to F : +
1. inf {τ : n ∈ N
n +}
2. lim inf n→∞ τ n
3. lim sup n→∞

τ n
Proof
As a simple corollary, we have the following results:
Suppose that T = [0, ∞) and that F = {F : t ∈ T } is a right-continuous filtration on (Ω, F ). If

t (τn : n ∈ N+ ) is a
sequence of stopping times relative to F, then each of the following is a also a stopping time relative to F:
1. inf {τ : n ∈ N
n +}
2. lim inf n→∞ τ n
3. lim sup n→∞

τ n
The σ-Algebra of a Stopping Time

Consider again the general setting of a filtration F = {F : t ∈ T } on the sample space (Ω, F ), and suppose that τ is a stopping
t
time relative to F. We want to define the σ-algebra F of events up to the random time τ , analagous to F the σ-algebra of events
τ t
up to a fixed time t ∈ T . Here is the appropriate definition:
Suppose that F = { Ft : t ∈ T } is a filtration on (Ω, F ) and that τ is a stopping time relative to F . Define
Fτ = {A ∈ F : A ∩ {τ ≤ t} ∈ F for all t ∈ T } . Then F is a σ-algebra.
t τ
Proof
Thus, an event A is in F if we can determine if A and τ ≤ t both occurred given our information at time t . If τ is constant, then
τ
F reduces to the corresponding member of the original filtration, which clealry should be the case, and is additional motivation
τ
for the definition.
Suppose again that F = { Ft : t ∈ T } is a filtration on (Ω, F ) . Fix s ∈ T and define τ (ω) = s for all ω ∈ Ω . Then
F =F .
τ s
Proof
Clearly, if we have the information available in F , then we should know the value of τ itself. This is also true:
τ
Suppose again that F = {F : t ∈ T } is a filtration on

t (Ω, F ) and that τ is a stopping time relative to F . Then τ is
measureable with respect to F . τ
Proof
Here are other results that relate the σ-algebra of a stopping time to the original filtration.
Suppose again that F = {F t : t ∈ T} is a filtration on (Ω, F ) and that τ is a stopping time relative to F. If A ∈ F then for τ
t ∈ T ,
1. A ∩ {τ < t} ∈ Ft
2. A ∩ {τ = t} ∈ Ft
Proof
The σ-algebra of a stopping time relative to a filtration is related to the σ-algebra of the stopping time relative to a finer filtration in
the natural way.
Suppose that F = {F : t ∈ T } and G = {G

t t : t ∈ T} are filtrations on (Ω, F ) and that G is finer than F. If τ is a stopping
time relative to F then F ⊆ G . τ τ
Proof
When two stopping times are ordered, their σ-algebras are also ordered.
Suppose that F = { Ft : t ∈ T } is a filtration on (Ω, F ) and that ρ and τ are stopping times for F with ρ ≤τ . Then
F ⊆F .
ρ τ
Proof
Suppose again that F = {F : t ∈ T } is a filtration on

t (Ω, F ) , and that ρ and τ are stopping times for F. Then each of the
following events is in F and in F .
τ ρ
1. {ρ < τ }
2. {ρ = τ }
3. {ρ > τ }
4. {ρ ≤ τ }
5. {ρ ≥ τ }
Proof
We can “stop” a filtration at a stopping time. In the next subsection, we will stop a stochastic process in the same way.
Suppose again that F = {F t : t ∈ T} is a filtration on (Ω, F ), and that τ is a stopping times for F . For t ∈ T define
F t
τ
=F . Then F = {F
t∧τ
τ
t
τ
: t ∈ T } is a filtration and is coarser than F .
Proof
As usual, the most common setting is when we have a stochastic process X = {X : t ∈ T } defined on our sample space (Ω, F )
t
and with state space (S, S ). If τ is a random time, we are often interested in the state X at the random time. But there are two
τ
issues. First, τ may take the value infinity, in which case X is not defined. The usual solution is to introduce a new “death state” δ ,
τ
and define X = δ . The σ-algebra S on S is extended to S = S ∪ {δ} in the natural way, namely S = σ(S ∪ {δ}) .
∞ δ δ
Our other problem is that we naturally expect X to be a random variable (that is, measurable), just as X is a random variable for
τ t
a deterministic t ∈ T . Moreover, if X is adapted to a filtration F = {F : t ∈ T } , then we would naturally also expect X to be

t τ
measurable with respect to F , just as X is measurable with respect to F for deterministic t ∈ T . But this is not obvious, and in
τ t t
fact is not true without additional assumptions. Note that X is a random state at a random time, and so depends on an outcome
τ
ω ∈ Ω in two ways: X (ω).

τ(ω)
Suppose that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that
t X is
measurable. If τ is a finite random time, then X is measurable. That is, X is a random variable with values in S .
τ τ
Proof
This result is one of the main reasons for the definition of a measurable process in the first place. Sometimes we literally want to
stop the random process at a random time τ . As you might guess, this is the origin of the term stopping time.
Suppose again that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that X
t
is measurable. If τ is a random time, then the process X = {X : t ∈ T } defined by X = X

τ τ
t
for t ∈ T is the process X
τ
t t∧τ
stopped at τ .
Proof
When the original process is progressively measurable, so is the stopped process.
t
is progressively measurable with respect to a filtration F = {F : t ∈ T } . If τ is a stopping time relative to F, then the
t
stopped process X = {X : t ∈ T } is progressively measurable with respect to the stopped filtration F .

τ τ
t
τ
Since F is finer than F , it follows that X is also progressively measurable with respect to F.
τ τ
t
is progressively measurable with respect to a filtration F = {F : t ∈ T } on (Ω, F ). If τ is a finite stopping time relative to F
t
then X is measurable with respect to F .

τ τ
For many random processes, the first time that the process enters or hits a set of states is particularly important. In the discussion
that follows, let T = {t ∈ T : t > 0} , the set of positive times.
+
Suppose that X = {X t : t ∈ T} is a stochastic process on (Ω, F ) with state space (S, S ). For A ∈ S , define
1. ρ A = inf{t ∈ T : Xt ∈ A} , the first entry time to A .
2. τ A = inf{t ∈ T+ : Xt ∈ A} , the first hitting time to A .
As usual, inf(∅) = ∞ so ρ = ∞ if X ∉ A for all t ∈ T , so that the process never enters A , and
A t τA = ∞ if Xt ∉ A for all
t ∈ T , so that the process never hits A . In discrete time, it's easy to see that these are stopping times.
+
Suppose that {X : n ∈ N} is a stochastic process on (Ω, F ) with state space (S, S ). If A ∈ S then τ and ρ are stopping
n A A
times relative to the natural filtration F . 0
Proof
So of course in discrete time, τ and ρ are stopping times relative to any filtration F to which X is adapted. You might think that
A A
τA and ρ should always be a stopping times, since τ ≤ t if and only if X ∈ A for some s ∈ T with s ≤ t , and ρ ≤ t if and
A A s + A
only if X ∈ A for some s ∈ T with s ≤ t . It would seem that these events are known if one is allowed to observe the process up
s
to time t . The problem is that when T = [0, ∞) , these are uncountable unions, so we need to make additional assumptions on the
stochastic process X or the filtration F, or both.
Suppose that S has an LCCB topology, and that S is the σ-algebra of Borel sets. Suppose also that X = {X t : t ∈ [0, ∞)} is
right continuous and has left limits. Then τ and ρ are stopping times relative to F for every open A ∈ S .
A A
0
+
Here is another result that requires less of the stochastic process X, but more of the filtration F.
Suppose that X = {X : t ∈ [0, ∞)} is a stochastic process on (Ω, F ) that is progressively measurable relative to a complete,
t
right-continuous filtration F = {F : t ∈ [0, ∞)} . If A ∈ S then ρ and τ are stopping times relative to F.
t A A
This page titled 2.11: Filtrations and Stopping Times is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
CHAPTER OVERVIEW
3: Distributions
Recall that a probability distribution is just another name for a probability measure. Most distributions are associated with random
variables, and in fact every distribution can be associated with a random variable. In this chapter we explore the basic types of
probability distributions (discrete, continuous, mixed), and the ways that distributions can be defined using density functions,
distribution functions, and quantile functions. We also study the relationship between the distribution of a random vector and the
distributions of its components, conditional distributions, and how the distribution of a random variable changes when the variable
is transformed.
In the advanced sections, we study convergence in distribution, one of the most important types of convergence. We also construct
the abstract integral with respect to a positive measure and study the basic properties of the integral. This leads in turn to general
(signed measures), absolute continuity and singularity, and the existence of density functions. Finally, we study various vector
spaces of functions that are defined by integral pro
This page titled 3: Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
Basic Theory
Definitions and Basic Properties
As usual, our starting point is a random experiment modeled by a probability space (S, S , P). So to review, S is the set of
outcomes, S the collection of events, and P the probability measure on the sample space (S, S ). We use the terms probability
measure and probability distribution synonymously in this text. Also, since we use a general definition of random variable, every
probability measure can be thought of as the probability distribution of a random variable, so we can always take this point of view
if we like. Indeed, most probability measures naturally have random variables associated with them.
Recall that the sample space (S, S ) is discrete if S is countable and S = P(S) is the collection of all subsets of S . In this
case, P is a discrete distribution and (S, S , P) is a discrete probabiity space.
For the remainder or our discussion we assume that (S, S , P) is a discrete probability space. In the picture below, the blue dots are
intended to represent points of positive probability.
Figure 3.1.1 : A discrete distribution

It's very simple to describe a discrete probability distribution with the function that assigns probabilities to the individual points in
S.
The function f on S defined by f (x) = P({x}) for x ∈ S is the probability density function of P, and satisfies the following
properties:
1. f (x) ≥ 0, x ∈ S
2. ∑ x∈S
f (x) = 1
3. ∑ x∈A
f (x) = P(A) for A ⊆ S
Proof
Property (c) is particularly important since it shows that a discrete probability distribution is completely determined by its
probability density function. Conversely, any function that satisfies properties (a) and (b) can be used to construct a discrete
probability distribution on S via property (c).
A nonnegative function f on S that satisfies ∑ x∈S

f (x) = 1 is a (discrete) probability density function on S , and then P
defined as follows is a probability measure on S .
P(A) = ∑ f (x), A ⊆S (3.1.1)
x∈A
Proof
Technically, f is the density of P relative to counting measure # on S . The technicalities are discussed in detail in the advanced
section on absolute continuity and density functions.
Figure 3.1.2 : A discrete distribution is completely determined by its probability density function.
The set of outcomes S is often a countable subset of some larger set, such as R for some n ∈ N . But not always. We might want
n
+
to consider a random variable with values in a deck of cards, or a set of words, or some other discrete population of objects. Of
course, we can always map a countable set S one-to-one into a Euclidean set, but it might be contrived or unnatural to do so. In any
event, if S is a subset of a larger set, we can always extend a probability density function f , if we want, to the larger set by defining
f (x) = 0 for x ∉ S . Sometimes this extension simplifies formulas and notation. Put another way, the “set of values” is often a
convenience set that includes the points with positive probability, but perhaps other points as well.
Suppose that f is a probability density function on S . Then {x ∈ S : f (x) > 0} is the support set of the distribution.
Values of x that maximize the probability density function are important enough to deserve a name.
Suppose again that f is a probability density function on S . An element x ∈ S that maximizes f is a mode of the distribution.
When there is only one mode, it is sometimes used as a measure of the center of the distribution.
A discrete probability distribution defined by a probability density function f is equivalent to a discrete mass distribution, with
total mass 1. In this analogy, S is the (countable) set of point masses, and f (x) is the mass of the point at x ∈ S . Property (c) in (2)
above simply means that the mass of a set A can be found by adding the masses of the points in A .
But let's consider a probabilistic interpretation, rather than one from physics. We start with a basic random variable X for an
experiment, defined on a probability space (Ω, F , P). Suppose that X has a discrete distribution on S with probability density
function f . So in this setting, f (x) = P(X = x) for x ∈ S . We create a new, compound experiment by conducting independent
repetitions of the original experiment. So in the compound experiment, we have a sequence of independent random variables
(X , X , …) each with the same distribution as X; in statistical terms, we are sampling from the distribution of X. Define
1 2
n
1 1
fn (x) = # {i ∈ {1, 2, … , n} : Xi = x} = ∑ 1(Xi = x), x ∈ S (3.1.3)
n n
i=1
Note that f (x) is the relative frequency of outcome x ∈ S in the first n runs. Note also that f (x) is a random variable for the
n n
compound experiment for each x ∈ S . By the law of large numbers, f (x) should converge to f (x), in some sense, as n → ∞ .
n
The function f is called the empirical probability density function, and it is in fact a (random) probability density function, since it
n
satisfies properties (a) and (b) of (2). Empirical probability density functions are displayed in most of the simulation apps that deal
with discrete variables.
It's easy to construct discrete probability density functions from other nonnegative functions defined on a countable set.
Suppose that g is a nonnegative function defined on S , and let
c = ∑ g(x) (3.1.4)
x∈S
If 0 < c < ∞ , then the function f defined by f (x) = 1
c
g(x) for x ∈ S is a discrete probability density function on S .
Proof
Note that since we are assuming that g is nonnegative, c = 0 if and only if g(x) = 0 for every x ∈ S . At the other extreme, c = ∞
could only occur if S is infinite (and the infinite series diverges). When 0 < c < ∞ (so that we can construct the probability
density function f ), c is sometimes called the normalizing constant. This result is useful for constructing probability density
functions with desired functional properties (domain, shape, symmetry, and so on).
Conditional Densities
Suppose again that X is a random variable on a probability space (Ω, F , P) and that X takes values in our discrete set S . The
distributionn of X (and hence the probability density function of X) is based on the underlying probability measure on the sample
space (Ω, F ). This measure could be a conditional probability measure, conditioned on a given event E ∈ F (with P(E) > 0 ).
The probability density function in this case is
f (x ∣ E) = P(X = x ∣ E), x ∈ S (3.1.6)
Except for notation, no new concepts are involved. Therefore, all results that hold for discrete probability density functions in
general have analogies for conditional discrete probability density functions.
For fixed E ∈ F with P(E) > 0 the function x ↦ f (x ∣ E) is a discrete probability density function on S That is,
1. f (x ∣ E) ≥ 0 for x ∈ S .
2. ∑ x∈S
f (x ∣ E) = 1
3. ∑ x∈A
f (x ∣ E) = P(X ∈ A ∣ E) for ⊆ S
Proof
In particular, the event E could be an event defined in terms of the random variable X itself.
Suppose that B ⊆ S and P(X ∈ B) > 0 . The conditional probability density function of X given X ∈ B is the function on B
defined by
f (x) f (x)
f (x ∣ X ∈ B) = = , x ∈ B (3.1.7)
P(X ∈ B) ∑ f (y)
y∈B
Proof
Note that the denominator is simply the normalizing constant for f restricted to B . Of course, f (x ∣ B) = 0 for x ∈ B . c

Suppose again that X is a random variable defined on a probability space (Ω, F , P) and that X has a discrete distribution on S ,
with probability density function f . We assume that f (x) > 0 for x ∈ S so that the distribution has support S . The versions of the
law of total probability and Bayes' theorem given in the following theorems follow immediately from the corresponding results in
the section on Conditional Probability. Only the notation is different.
Law of Total Probability. If E ∈ F is an event then
P(E) = ∑ f (x)P(E ∣ X = x) (3.1.8)
x∈S
Proof
This result is useful, naturally, when the distribution of X and the conditional probability of E given the values of X are known.
When we compute P(E) in this way, we say that we are conditioning on X. Note that P(E), as expressed by the formula, is a
weighted average of P(E ∣ X = x) , with weight factors f (x), over x ∈ S .
Bayes' Theorem. If E ∈ F is an event with P(E) > 0 then

f (x)P(E ∣ X = x)
f (x ∣ E) = , x ∈ S (3.1.10)
∑ f (y)P(E ∣ X = y)
y∈S
Proof
Bayes' theorem, named for Thomas Bayes, is a formula for the conditional probability density function of X given E . Again, it is
useful when the quantities on the right are known. In the context of Bayes' theorem, the (unconditional) distribution of X is
referred to as the prior distribution and the conditional distribution as the posterior distribution. Note that the denominator in
Bayes' formula is P(E) and is simply the normalizing constant for the function x ↦ f (x)P(E ∣ X = x) .
We start with some simple (albeit somewhat artificial) discrete distributions. After that, we study three special parametric models—
the discrete uniform distribution, hypergeometric distributions, and Bernoulli trials. These models are very important, so when
working the computational problems that follow, try to see if the problem fits one of these models. As always, be sure to try the
problems yourself before looking at the answers and proofs in the text.
Simple Discrete Distributions

Let g be the function defined by g(n) = n(10 − n) for n ∈ {1, 2, … , 9}.
1. Find the probability density function f that is proportional to g as in .
2. Sketch the graph of f and find the mode of the distribution.
3. Find P(3 ≤ N ≤ 6) where N has probability density function f .
Answer
Let g be the function defined by g(n) = n 2

(10 − n) for n ∈ {1, 2 … , 10}.
1. Find the probability density function f that is proportional to g .
2. Sketch the graph of f and find the mode of the distribution.
3. Find P(3 ≤ N ≤ 6) where N has probability density function f .
Answer
Let g be the function defined by g(x, y) = x + y for (x, y) ∈ {1, 2, 3} . 2
1. Sketch the domain of g .

3. Find the mode of the distribution.
4. Find P(X > Y ) where (X, Y ) has probability density function f .
Answer
Let g be the function defined by g(x, y) = xy for (x, y) ∈ {(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)}.
1. Sketch the domain of g .
4. Find P [(X, Y ) ∈ {(1, 2), (1, 3), (2, 2), (2, 3)}]where (X, Y ) has probability density function f .
Answer
Consider the following game: An urn initially contains one red and one green ball. A ball is selected at random, and if the ball
is green, the game is over. If the ball is red, the ball is returned to the urn, another red ball is added, and the game continues. At
each stage, a ball is selected at random, and if the ball is green, the game is over. If the ball is red, the ball is returned to the
urn, another red ball is added, and the game continues. Let X denote the length of the game (that is, the number of selections
required to obtain a green ball). Find the probability density function of X.
Solution
Discrete Uniform Distributions

An element X is chosen at random from a finite set S . The distribution of X is the discrete uniform distribution on S .
1. X has probability density function f given by f (x) = 1/#(S) for x ∈ S .
2. P(X ∈ A) = #(A)/#(S) for A ⊆ S .
Proof
Many random variables that arise in sampling or combinatorial experiments are transformations of uniformly distributed variables.
The next few exercises review the standard methods of sampling from a finite population. The parameters m and n are positive
inteters.
Suppose that n elements are chosen at random, with replacement from a set D with m elements. Let X denote the ordered
sequence of elements chosen. Then X is uniformly distributed on the Cartesian power set S = D , and has probability density n
function f given by
1
f (x) = , x ∈ S (3.1.13)
n
m
Proof
Suppose that n elements are chosen at random, without replacement from a set D with m elements (so n ≤ m ). Let X denote
the ordered sequence of elements chosen. Then X is uniformly distributed on the set S of permutations of size n chosen from
D, and has probability density function f given by
1
f (x) = , x ∈ S (3.1.14)
(n)
m
Proof
Suppose that n elements are chosen at random, without replacement, from a set D with m elements (so n ≤ m ). Let W
denote the unordered set of elements chosen. Then W is uniformly distributed on the set T of combinations of size n chosen
from D, and has probability density function f given by
1
f (w) = , w ∈ T (3.1.15)
m
( )
n
Proof
Suppose that X is uniformly distributed on a finite set S and that B is a nonempty subset of S . Then the conditional
distribution of X given X ∈ B is uniform on B .
Proof
Hypergeometric Models
Suppose that a dichotomous population consists of m objects of two different types: r of the objects are type 1 and m − r are type
0. Here are some typical examples:
The objects are persons, each either male or female.
The objects are voters, each either a democrat or a republican.
The objects are devices of some sort, each either good or defective.
The objects are fish in a lake, each either tagged or untagged.
The objects are balls in an urn, each either red or green.
A sample of n objects is chosen at random (without replacement) from the population. Recall that this means that the samples,
either ordered or unordered are equally likely. Note that this probability model has three parameters: the population size m, the
number of type 1 objects r, and the sample size n . Each is a nonnegative integer with r ≤ m and n ≤ m . Now, suppose that we
keep track of order, and let X denote the type of the ith object chosen, for i ∈ {1, 2, … , n}. Thus, X is an indicator variable
i i
(that is, a variable that just takes values 0 and 1).
X = (X1 , X2 , … , Xn ) has probability density function f given by

(y) (n−y)
r (m − r)
n
f (x1 , x2 , … , xn ) = , (x1 , x2 , … , xn ) ∈ {0, 1 } where y = x1 + x2 + ⋯ + xn (3.1.17)
m(n)
Proof
Note that the value of f (x , x , … , x ) depends only on y = x + x + ⋯ + x , and hence is unchanged if (x , x , … , x ) is

1 2 n 1 2 n 1 2 n
permuted. This means that (X , X , … , X ) is exchangeable. In particular, the distribution of X is the same as the distribution of
1 2 n i
X , so P(X = 1) = . Thus, the variables are identically distributed. Also the distribution of (X , X ) is the same as the
r
1 i m i j
r(r−1)
distribution of (X1 , X2 ) , so P(Xi = 1, Xj = 1) =
m(m−1)
. Thus, Xi and Xj are not independent, and in fact are negatively
correlated.
Now let Y denote the number of type 1 objects in the sample. Note that Y =∑
n
i=1
Xi . Any counting variable can be written as a
sum of indicator variables.
Y has probability density function g given by.

r m−r
( )( )
y n−y
g(y) = , y ∈ {0, 1, … , n} (3.1.18)
m
( )
n
1. g(y − 1) < g(y) if and only if y < t where t = (r + 1)(n + 1)/(m + 2) .

2. If t is not a positive integer, there is a single mode at ⌊t⌋.
3. If t is a positive integer, then there are two modes, at t − 1 and t .
Proof
The distribution defined by the probability density function in the last result is the hypergeometric distributions with parameters m,
r , and n . The term hypergeometric comes from a certain class of special functions, but is not particularly helpful in terms of
remembering the model. Nonetheless, we are stuck with it. The set of values {0, 1, … , n} is a convenience set: it contains all of
the values that have positive probability, but depending on the parameters, some of the values may have probability 0. Recall our
convention for binomial coefficients: for j, k ∈ N , ( ) = 0 if j > k . Note also that the hypergeometric distribution is unimodal:
+
k
the probability density function increases and then decreases, with either a single mode or two adjacent modes.
We can extend the hypergeometric model to a population of three types. Thus, suppose that our population consists of m objects; r
of the objects are type 1, s are type 2, and m − r − s are type 0. Here are some examples:
The objects are voters, each a democrat, a republican, or an independent.
The objects are cicadas, each one of three species: tredecula, tredecassini, or tredecim
The objects are peaches, each classified as small, medium, or large.
The objects are faculty members at a university, each an assistant professor, or an associate professor, or a full professor.
Once again, a sample of n objects is chosen at random (without replacement). The probability model now has four parameters: the
population size m, the type sizes r and s , and the sample size n . All are nonnegative integers with r + s ≤ m and n ≤ m .
Moreover, we now need two random variables to keep track of the counts for the three types in the sample. Let Y denote the
number of type 1 objects in the sample and Z the number of type 2 objects in the sample.
(Y , Z) has probability density function h given by

r s m−r−s
( )( )( )
y z n−y−z
2
h(y, z) = , (y, z) ∈ {0, 1, … , n} with y + z ≤ n (3.1.19)
m
( )
n
Proof
The distribution defined by the density function in the last exericse is the bivariate hypergeometric distribution with parameters m,
r , s , and n . Once again, the domain given is a convenience set; it includes the set of points with positive probability, but depending
on the parameters, may include points with probability 0. Clearly, the same general pattern applies to populations with even more
types. However, because of all of the parameters, the formulas are not worthing remembering in detail; rather, just note the pattern,
and remember the combinatorial meaning of the binomial coefficient. The hypergeometric model will be revisited later in this
chapter, in the section on joint distributions and in the section on conditional distributions. The hypergeometric distribution and the
multivariate hypergeometric distribution are studied in detail in the chapter on Finite Sampling Models. This chapter contains a
variety of distributions that are based on discrete uniform distributions.
Bernoulli Trials
A Bernoulli trials sequence is a sequence (X , X , …) of independent, identically distributed indicator variables. Random
1 2
variable X is the outcome of trial i, where in the usual terminology of reliability, 1 denotes success while 0 denotes failure, The
i
process is named for Jacob Bernoulli. Let p = P(X = 1) ∈ [0, 1] denote the success parameter of the process. Note that the
i
indicator variables in the hypergeometric model satisfy one of the assumptions of Bernoulli trials (identical distributions) but not
the other (independence).
X = (X1 , X2 , … , Xn ) has probability density function f given by
y n−y n
f (x1 , x2 , … , xn ) = p (1 − p ) , (x1 , x2 , … , xn ) ∈ {0, 1 } , where y = x1 + x2 + ⋯ + xn (3.1.20)
Proof
Now let Y denote the number of successes in the first n trials. Note that Y = ∑ X , so we see again that a complicated random
n
i=1 i
variable can be written as a sum of simpler ones. In particular, a counting variable can always be written as a sum of indicator
variables.
Y has probability density function g given by

n
y n−y
g(y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (3.1.21)
y
1. g(y − 1) < g(y) if and only if y < t , wher t = (n + 1)p .

2. If t is not a positive integer, there is a single mode at ⌊t⌋.
3. If t is a positive integer, then there are two modes, at t − 1 and t .
Proof
The distribution defined by the probability density function in the last theorem is called the binomial distribution with parameters n
and p. The distribution is unimodal: the probability density function at first increases and then decreases, with either a single mode
or two adjacent modes. The binomial distribution is studied in detail in the chapter on Bernoulli Trials.
Suppose that p > 0 and let N denote the trial number of the first success. Then N has probability density function h given by
n−1
h(n) = (1 − p ) p, n ∈ N+ (3.1.22)
The probability density function h is decreasing and the mode is n = 1 .

Proof
The distribution defined by the probability density function in the last exercise is the geometric distribution on N with parameter
+
p . The geometric distribution is studied in detail in the chapter on Bernoulli Trials.
Sampling Problems
In the following exercises, be sure to check if the problem fits one of the general models above.
An urn contains 30 red and 20 green balls. A sample of 5 balls is selected at random, without replacement. Let Y denote the
number of red balls in the sample.
1. Compute the probability density function of Y explicitly and identify the distribution by name and parameter values.
2. Graph the probability density function and identify the mode(s).
3. Find P(Y > 3) .
Answer
In the ball and urn experiment, select sampling without replacement and set m = 50 , r = 30 , and n = 5 . Run the experiment
1000 times and note the agreement between the empirical density function of Y and the probability density function.
An urn contains 30 red and 20 green balls. A sample of 5 balls is selected at random, with replacement. Let Y denote the
number of red balls in the sample.
1. Compute the probability density function of Y explicitly and identify the distribution by name and parameter values.
2. Graph the probability density function and identify the mode(s).
3. Find P(Y > 3) .
Answer
In the ball and urn experiment, select sampling with replacement and set m = 50 , r = 30 , and n = 5 . Run the experiment
1000 times and note the agreement between the empirical density function of Y and the probability density function.
A group of voters consists of 50 democrats, 40 republicans, and 30 independents. A sample of 10 voters is chosen at random,
without replacement. Let X denote the number of democrats in the sample and Y the number of republicans in the sample.
1. Give the probability density function of X.
2. Give the probability density function of Y .
3. Give the probability density function of (X, Y ).
4. Find the probability that the sample has at least 4 democrats and at least 4 republicans.
Answer
The Math Club at Enormous State University (ESU) has 20 freshmen, 40 sophomores, 30 juniors, and 10 seniors. A committee
of 8 club members is chosen at random, without replacement to organize π-day activities. Let X denote the number of
freshman in the sample, Y the number of sophomores, and Z the number of juniors.
2. Give the probability density function of Y .
3. Give the probability density function of Z .
5. Give the probability density function of (X, Y , Z) .
6. Find the probability that the committee has no seniors.
Answer
Coins and Dice

Suppose that a coin with probability of heads p is tossed repeatedly, and the sequence of heads and tails is recorded.
1. Identify the underlying probability model by name and parameter.
2. Let Y denote the number of heads in the first n tosses. Give the probability density function of Y and identify the
distribution by name and parameters.
3. Let N denote the number of tosses needed to get the first head. Give the probability density function of N and identify the
distribution by name and parameter.
Answer
Suppose that a coin with probability of heads p = 0.4 is tossed 5 times. Let Y denote the number of heads.
1. Compute the probability density function of Y explicitly.
2. Graph the probability density function and identify the mode.
3. Find P(Y > 3) .
Answer
In the binomial coin experiment, set n = 5 and p = 0.4 . Run the experiment 1000 times and compare the empirical density
function of Y with the probability density function.
Suppose that a coin with probability of heads p = 0.2 is tossed until heads occurs. Let N denote the number of tosses.
1. Find the probability density function of N .
2. Find P(N ≤ 5) .
Answer
In the negative binomial experiment, set k = 1 and p = 0.2 . Run the experiment 1000 times and compare the empirical
density function with the probability density function.
Suppose that two fair, standard dice are tossed and the sequence of scores (X , X ) recorded. Let Y = X
1 2 1 + X2 denote the
sum of the scores, U = min{X , X } the minimum score, and V = max{X , X } the maximum score.
1 2 1 2
1. Find the probability density function of (X 1, X2 ) . Identify the distribution by name.

2. Find the probability density function of Y .
3. Find the probability density function of U .
4. Find the probability density function of V .
5. Find the probability density function of (U , V ).
Answer
Note that (U , V ) in the last exercise could serve as the outcome of the experiment that consists of throwing two standard dice if we
did not bother to record order. Note from the previous exercise that this random vector does not have a uniform distribution when
the dice are fair. The mistaken idea that this vector should have the uniform distribution was the cause of difficulties in the early
development of probability.
In the dice experiment, select n = 2 fair dice. Select the following random variables and note the shape and location of the
probability density function. Run the experiment 1000 times. For each of the following variables, compare the empirical
1. Y , the sum of the scores.
2. U , the minimum score.
3. V , the maximum score.
In the die-coin experiment, a fair, standard die is rolled and then a fair coin is tossed the number of times showing on the die.
Let N denote the die score and Y the number of heads.
1. Find the probability density function of N . Identify the distribution by name.
Answer
Run the die-coin experiment 1000 times. For the number of heads, compare the empirical density function with the probability
density function.
Suppose that a bag contains 12 coins: 5 are fair, 4 are biased with probability of heads ; and 3 are two-headed. A coin is
1
chosen at random from the bag and tossed 5 times. Let V denote the probability of heads of the selected coin and let Y denote
the number of heads.
Answer
Compare thedie-coin experiment with the bag of coins experiment. In the first experiment, we toss a coin with a fixed probability of
heads a random number of times. In second experiment, we effectively toss a coin with a random probability of heads a fixed
number of times. In both cases, we can think of starting with a binomial distribution and randomizing one of the parameters.
In the coin-die experiment, a fair coin is tossed. If the coin lands tails, a fair die is rolled. If the coin lands heads, an ace-six flat
die is tossed (faces 1 and 6 have probability 1
4
each, while faces 2, 3, 4, 5 have probability 1
8
each). Find the probability
density function of the die score Y .
Answer
Run the coin-die experiment 1000 times, with the settings in the previous exercise. Compare the empirical density function
with the probability density function.
Suppose that a standard die is thrown 10 times. Let Y denote the number of times an ace or a six occurred. Give the probability
density function of Y and identify the distribution by name and parameter values in each of the following cases:
1. The die is fair.
2. The die is an ace-six flat.
Answer
Suppose that a standard die is thrown until an ace or a six occurs. Let N denote the number of throws. Give the probability
density function of N and identify the distribution by name and parameter values in each of the following cases:
1. The die is fair.
2. The die is an ace-six flat.
Answer
Fred and Wilma takes turns tossing a coin with probability of heads p ∈ (0, 1): Fred first, then Wilma, then Fred again, and so
forth. The first person to toss heads wins the game. Let N denote the number of tosses, and W the event that Wilma wins.
1. Give the probability density function of N and identify the distribution by name.
2. Compute P(W ) and sketch the graph of this probability as a function of p.
3. Find the conditional probability density function of N given W .
Answer
The alternating coin tossing game is studied in more detail in the section on The Geometric Distribution in the chapter on Bernoulli
trials.
Suppose that k players each have a coin with probability of heads p, where k ∈ {2, 3, …} and where p ∈ (0, 1).
1. Suppose that the players toss their coins at the same time. Find the probability that there is an odd man, that is, one player
with a different outcome than all the rest.
2. Suppose now that the players repeat the procedure in part (a) until there is an odd man. Find the probability density
function of N , the number of rounds played, and identify the distribution by name.
Answer
The odd man out game is treated in more detail in the section on the Geometric Distribution in the chapter on Bernoulli Trials.
Cards
Recall that a poker hand consists of 5 cards chosen at random and without replacement from a standard deck of 52 cards. Let
X denote the number of spades in the hand and Y the number of hearts in the hand. Give the probability density function of
each of the following random variables, and identify the distribution by name:
1. X
2. Y
3. (X, Y )
Answer
Recall that a bridge hand consists of 13 cards chosen at random and without replacement from a standard deck of 52 cards. An
honor card is a card of denomination ace, king, queen, jack or 10. Let N denote the number of honor cards in the hand.
1. Find the probability density function of N and identify the distribution by name.
2. Find the probability that the hand has no honor cards. A hand of this kind is known as a Yarborough, in honor of Second
Earl of Yarborough.
Answer
In the most common high card point system in bridge, an ace is worth 4 points, a king is worth 3 points, a queen is worth 2
points, and a jack is worth 1 point. Find the probability density function of V , the point value of a random bridge hand.
Reliability
Suppose that in a batch of 500 components, 20 are defective and the rest are good. A sample of 10 components is selected at
random and tested. Let X denote the number of defectives in the sample.
1. Find the probability density function of X and identify the distribution by name and parameter values.
2. Find the probability that the sample contains at least one defective component.
Answer
A plant has 3 assembly lines that produce a certain type of component. Line 1 produces 50% of the components and has a
defective rate of 4%; line 2 has produces 30% of the components and has a defective rate of 5%; line 3 produces 20% of the
components and has a defective rate of 1%. A component is chosen at random from the plant and tested.
1. Find the probability that the component is defective.
2. Given that the component is defective, find the conditional probability density function of the line that produced the
component.
Answer
Recall that in the standard model of structural reliability, a systems consists of n components, each of which, independently of the
others, is either working for failed. Let X denote the state of component i, where 1 means working and 0 means failed. Thus, the
i
state vector is X = (X , X , … , X ). The system as a whole is also either working or failed, depending only on the states of the
1 2 n
components. Thus, the state of the system is an indicator random variable U = u(X) that depends on the states of the components
according to a structure function u : {0, 1} → {0, 1}. In a series system, the system works if and only if every components works.
n
In a parallel system, the system works if and only if at least one component works. In a k out of n system, the system works if and
only if at least k of the n components work.
The reliability of a device is the probability that it is working. Let p = P(X = 1) denote the reliability of component i, so that
i i
p = (p , p , … , p ) is the vector of component reliabilities. Because of the independence assumption, the system reliability
1 2 n
depends only on the component reliabilities, according to a reliability function r(p) = P(U = 1) . Note that when all component
reliabilities have the same value p, the states of the components form a sequence of n Bernoulli trials. In this case, the system
reliability is, of course, a function of the common component reliability p.
Suppose that the component reliabilities all have the same value p. Let X denote the state vector and Y denote the number of
working components.
2. Give the probability density function of Y and identify the distribution by name and parameter.
3. Find the reliability of the k out of n system.
Answer
Suppose that we have 4 independent components, with common reliability p = 0.8 . Let Y denote the number of working
components.
1. Find the probability density function of Y explicitly.
2. Find the reliability of the parallel system.
3. Find the reliability of the 2 out of 4 system.
5. Find the reliability of the series system.
Answer
Suppose that we have 4 independent components, with reliabilities p 1 = 0.6 ,p

2 = 0.7 ,p
3 = 0.8 , and p
4 = 0.9 . Let Y denote
the number of working components.
2. Find the reliability of the parallel system.
5. Find the reliability of the series system.
Answer
The Poisson Distribution

Suppose that a > 0 . Define f by
n
−a
a
f (n) = e , n ∈ N (3.1.24)
n!
1. f is a probability density function.

2. f (n − 1) < f (n) if and only if n < a .
3. If a is not a positive integer, there is a single mode at ⌊a⌋
4. If a is a positive integer, there are two modes at a − 1 and a .
Proof
The distribution defined by the probability density function in the previous exercise is the Poisson distribution with parameter a ,
named after Simeon Poisson. Note that like the other named distributions we studied above (hypergeometric and binomial), the
Poisson distribution is unimodal: the probability density function at first increases and then decreases, with either a single mode or
two adjacent modes. The Poisson distribution is studied in detail in the Chapter on Poisson Processes, and is used to model the
number of “random points” in a region of time or space, under certain ideal conditions. The parameter a is proportional to the size
of the region of time or space.
Suppose that the customers arrive at a service station according to the Poisson model, at an average rate of 4 per hour. Thus,
the number of customers N who arrive in a 2-hour period has the Poisson distribution with parameter 8.
1. Find the modes.
2. Find P(N ≥ 6) .
Answer
In the Poisson experiment, set r = 4 and t = 2 . Run the simulation 1000 times and compare the empirical density function to
the probability density function.
Suppose that the number of flaws N in a piece of fabric of a certain size has the Poisson distribution with parameter 2.5.
1. Find the mode.
2. Find P(N > 4) .
Answer
Suppose that the number of raisins N in a piece of cake has the Poisson distribution with parameter 10.
1. Find the modes.
2. Find P(8 ≤ N ≤ 12) .
Answer
A Zeta Distribution
Let g be the function defined by g(n) = 1
n2
for n ∈ N .
+

3. Find P(N ≤ 5) where N has probability density function f .
Answer
The distribution defined in the previous exercise is a member of the zeta family of distributions. Zeta distributions are used to
model sizes or ranks of certain types of objects, and are studied in more detail in the chapter on Special Distributions.
Benford's Law
Let f be the function defined by f (d) = log(d + 1) − log(d) = log(1 + 1
d
) for d ∈ {1, 2, … , 9}. (The logarithm function is
the base 10 common logarithm, not the base e natural logarithm.)
1. Show that f is a probability density function.
2. Compute the values of f explicitly, and sketch the graph.
3. Find P(X ≤ 3) where X has probability density function f .
Answer
The distribution defined in the previous exercise is known as Benford's law, and is named for the American physicist and engineer
Frank Benford. This distribution governs the leading digit in many real sets of data. Benford's law is studied in more detail in the
chapter on Special Distributions.
In the M&M data, let R denote the number of red candies and N the total number of candies. Compute and graph the
empirical probability density function of each of the following:
1. R
2. N
3. R given N > 57
Answer
In the Cicada data, let G denotes gender, S species type, and W body weight (in grams). Compute the empirical probability
density function of each of the following:
1. G
2. S
3. (G, S)
4. G given W > 0.20 grams.
Answer
This page titled 3.1: Discrete Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In the previous section, we considered discrete distributions. In this section, we study a complementary type of distribution. As
usual, if you are a new student of probability, you may want to skip the technical details.
Basic Theory
As usual, our starting point is a random experiment modeled by a probability space (S, S , P). So to review, S is the set of
outcomes, S the collection of events, and P the probability measure on the sample space (S, S ). We use the terms probability
measure and probability distribution synonymously in this text. Also, since we use a general definition of random variable, every
probability measure can be thought of as the probability distribution of a random variable, so we can always take this point of view
if we like. Indeed, most probability measures naturally have random variables associated with them.
In this section, we assume that S ⊆ R for some n ∈ N .

n
+
Details
Here is our first fundamental definition.
The probability measure P is continuous if P({x}) = 0 for all x ∈ S .
The fact that each point is assigned probability 0 might seem impossible or paradoxical at first, but soon we will see very familiar
analogies.
If P is a continuous distribtion then P(C ) = 0 for every countable C ⊆S .

Proof
Thus, continuous distributions are in complete contrast with discrete distributions, for which all of the probability mass is
concentrated on the points in a discrete set. For a continuous distribution, the probability mass is continuously spread over S in
some sense. In the picture below, the light blue shading is intended to suggest a continuous distribution of probability.
Figure 3.2.1 : A continuous probability distribution on S

Typically, S is a region of R defined by inequalities involving elementary functions, for example an interval in R, a circular
n
region in R , and a conical region in R . Suppose that P is a continuous probability measure on S . The fact that each point in S
2 3
has probability 0 is conceptually the same as the fact that an interval of R can have positive length even though it is composed of
points each of which has 0 length. Similarly, a region of R can have positive area even though it is composed of points (or curves)
2
each of which has area 0. In the one-dimensional case, continuous distributions are used to model random variables that take values
in intervals of R, variables that can, in principle, be measured with any degree of accuracy. Such variables abound in applications
and include
length, area, volume, and distance
time
mass and weight
charge, voltage, and current
resistance, capacitance, and inductance
velocity and acceleration
energy, force, and work
Usually a continuous distribution can usually be described by certain type of function.
Suppose again that P is a continuous distribution on S . A function f : S → [0, ∞) is a probability density function for P if
P(A) = ∫ f (x) dx, A ∈ S (3.2.2)

A
Details
So the probability distribution P is completely determined by the probability density function f . As a special case, note that
∫ f (x) dx = P(S) = 1 . Conversely, a nonnegative function on S with this property defines a probability measure.
S
A function f : S → [0, ∞) that satisfies ∫ S

f (x) dx = 1 is a probability density function on S and then P defined as follows is
a continuous probability measure on S :
P(A) = ∫ f (x) dx, A ∈ S (3.2.3)

A
Proof
Figure 3.2.2 : A continuous distribution is completely determined by its probability density function
Note that we can always extend f to a probability density function on a subset of R that contains S , or to all of R , by defining
n n
f (x) = 0 for x ∉ S . This extension sometimes simplifies notation. Put another way, we can be a bit sloppy about the “set of
values” of the random variable. So for example if a, b ∈ R with a < b and X has a continuous distribution on the interval [a, b],
then we could also say that X has a continuous distribution on (a, b) or [a, b), or (a, b].
The points x ∈ S that maximize the probability density function f are important, just as in the discrete case.
Suppose that P is a continuous distribution on S with probability density function f . An element x ∈ S that maximizes f is a
mode of the distribution.
If there is only one mode, it is sometimes used as a measure of the center of the distribution.
You have probably noticed that probability density functions for continuous distributions are analogous to probability density
functions for discrete distributions, with integrals replacing sums. However, there are essential differences. First, every discrete
distribution has a unique probability density function f given by f (x) = P({x}) for x ∈ S . For a continuous distribution, the
existence of a probability density function is not guaranteed. The advanced section on absolute continuity and density functions has
several examples of continuous distribution that do not have density functions, and gives conditions that are necessary and
sufficient for the existence of a probability density function. Even if a probability density function f exists, it is never unique. Note
that the values of f on a finite (or even countably infinite) set of points could be changed to other nonnegative values and the new
function would still be a probability density function for the same distribution. The critical fact is that only integrals of f are
important. Second, the values of the PDF f for a discrete distribution are probabilities, and in particular f (x) ≤ 1 for x ∈ S . For a
continuous distribution the values are not probabilities and in fact it's possible that f (x) > 1 for some or even all x ∈ S . Further, f
can be unbounded on S . In the typical calculus interpretation, f (x) really is probability density at x. That is, f (x) dx is
approximately the probability of a “small” region of size dx about x.
Constructing Probability Density Functions

Just as in the discrete case, a nonnegative function on S can often be scaled to produce a produce a probability density function.
Suppose that g : S → [0, ∞) and let
c =∫ g(x) dx (3.2.4)
S
If 0 < c < ∞ then f defined by f (x) =
1
c
g(x) for x ∈ S defines a probability density function for a continuous distribution
on S .
Proof
Note again that f is just a scaled version of g . So this result can be used to construct probability density functions with desired
properties (domain, shape, symmetry, and so on). The constant c is sometimes called the normalizing constant of g .
Conditional Densities
Suppose now that X is a random variable defined on a probability space (Ω, F , P) and that X has a continuous distribution on S .
A probability density function for X is based on the underlying probability measure on the sample space (Ω, F ). This measure
could be a conditional probability measure, conditioned on a given event E ∈ F with P(E) > 0 . Assuming that the conditional
probability density function exists, the usual notation is
f (x ∣ E), x ∈ S (3.2.6)
Note, however, that except for notation, no new concepts are involved. The defining property is
∫ f (x ∣ E) dx = P(X ∈ A ∣ E), A ∈ S (3.2.7)

A
and all results that hold for probability density functions in general hold for conditional probability density functions. The event E
could be an event described in terms of the random variable X itself:
Suppose that X has a continuous distribution on S with probability density function f and that B ∈ S with P(X ∈ B) > 0 .
The conditional probability density function of X given X ∈ B is the function on B given by
f (x)
f (x ∣ X ∈ B) = , x ∈ B (3.2.8)
P(X ∈ B)
Proof
Of course, P(X ∈ B) = ∫ B
f (x) dx and hence is the normaliziang constant for the restriction of f to B , as in (8)

As always, try the problems yourself before looking at the answers.
The Exponential Distribution

Let f be the function defined by f (t) = re −rt
for t ∈ [0, ∞), where r ∈ (0, ∞) is a parameter.
2. Draw a careful sketch of the graph of f , and state the important qualitative features.
Proof
The distribution defined by the probability density function in the previous exercise is called the exponential distribution with rate
parameter r. This distribution is frequently used to model random times, under certain assumptions. Specifically, in the Poisson
model of random points in time, the times between successive arrivals have independent exponential distributions, and the
parameter r is the average rate of arrivals. The exponential distribution is studied in detail in the chapter on Poisson Processes.
The lifetime T of a certain device (in 1000 hour units) has the exponential distribution with parameter r = 1
2
. Find
1. P(T > 2)
2. P(T > 3 ∣ T > 1)
Answer
In the gamma experiment, set n = 1 to get the exponential distribution. Vary the rate parameter r and note the shape of the
probability density function. For various values of r, run the simulation 1000 times and compare the the empirical density
function with the probability density function.
A Random Angle
In Bertrand's problem, a certain random angle Θ has probability density function f given by f (θ) = sin θ for θ ∈ [0, π
2
] .
2. Draw a careful sketch of the graph f , and state the important qualitative features.
3. Find P (Θ < ) .π
Answer
Bertand's problem is named for Joseph Louis Bertrand and is studied in more detail in the chapter on Geometric Models.
In Bertrand's experiment, select the model with uniform distance. Run the simulation 1000 times and compute the empirical
probability of the event {Θ < } . Compare with the true probability in the previous exercise.
π
Gamma Distributions
n
Let g be the function defined by g

n n (t) =e
−t t
n!
for t ∈ [0, ∞) where n ∈ N is a parameter.
1. Show that g is a probability density function for each n ∈ N .
n
2. Draw a careful sketch of the graph of g , and state the important qualitative features.
n
Proof
Interestingly, we showed in the last section on discrete distributions, that f (n) = g (t) is a probability density function on N for
t n
each t ≥ 0 (it's the Poisson distribution with parameter t ). The distribution defined by the probability density function g belongs
n
to the family of Erlang distributions, named for Agner Erlang; n + 1 is known as the shape parameter. The Erlang distribution is
studied in more detail in the chapter on the Poisson Process. In turn the Erlang distribution belongs to the more general family of
gamma distributions. The gamma distribution is studied in more detail in the chapter on Special Distributions.
In the gamma experiment, keep the default rate parameter r = 1 . Vary the shape parameter and note the shape and location of
the probability density function. For various values of the shape parameter, run the simulation 1000 times and compare the
empirical density function with the probability density function.
Suppose that the lifetime of a device T (in 1000 hour units) has the gamma distribution above with n =2 . Find each of the
following:
1. P(T > 3) .
2. P(T ≤ 2)
3. P(1 ≤ T ≤ 4)
Answer
Beta Distributions
Let f be the function defined by f (x) = 6x(1 − x) for x ∈ [0, 1].

Answer
Let f be the function defined by f (x) = 12x 2

(1 − x) for x ∈ [0, 1].
2. Draw a careful sketch the graph of f , and state the important qualitative features.
Answer
The distributions defined in the last two exercises are examples of beta distributions. These distributions are widely used to model
random proportions and probabilities, and physical quantities that take values in bounded intervals (which, after a change of units,
can be taken to be [0, 1]). Beta distributions are studied in detail in the chapter on Special Distributions.
In the special distribution simulator, select the beta distribution. For the following parameter values, note the shape of the
probability density function. Run the simulation 1000 times and compare the empirical density function with the probability
density function.
1. a = 2 , b = 2 . This gives the first beta distribution above.
2. a = 3 , b = 2 . This gives the second beta distribuiton above.
Suppose that P is a random proportion. Find P ( 1
4
≤P ≤
3
4
) in each of the following cases:
1. P has the first beta distribution above.
2. P has the second beta distribution above.
Answer
Let f be the function defined by

1
f (x) = , x ∈ (0, 1) (3.2.10)
− −−−−−−
π √ x(1 − x)

Answer
The distribution defined in the last exercise is also a member of the beta family of distributions. But it is also known as the
(standard) arcsine distribution, because of the arcsine function that arises in the proof that f is a probability density function. The
arcsine distribution has applications to a very important random process known as Brownian motion, named for the Scottish
botanist Robert Brown. Arcsine distributions are studied in more generality in the chapter on Special Distributions.
In the special distribution simulator, select the (continuous) arcsine distribution and keep the default parameter values. Run the
simulation 1000 times and compare the empirical density function with the probability density function.
Suppose that X represents the change in the price of a stock at time t , relative to the value at an initial reference time 0. We
t
treat t as a continuous variable measured in weeks. Let T = max {t ∈ [0, 1] : X = 0} , the last time during the first week that
t
the stock price was unchanged over its initial value. Under certain ideal conditions, T will have the arcsine distribution. Find
1. P (T <
1
4
)
2. P (T ≥
1
2
)
3. P (T ≤
3
4
)
Answer
Open the Brownian motion experiment and select the last zero variable. Run the experiment in single step mode a few times.
The random process that you observe models the price of the stock in the previous exercise. Now run the experiment 1000
times and compute the empirical probability of each event in the previous exercise.
The Pareto Distribution

Let g be the function defined by g(x) = 1/x for x ∈ [1, ∞), where b ∈ (0, ∞) is a parameter.
b
1. Draw a careful sketch the graph of g , and state the important qualitative features.
2. Find the values of b for which there exists a probability density function f (8)proportional to g . Identify the mode.
Answer
Note that the qualitative features of g are the same, regardless of the value of the parameter b > 0 , but only when b > 1 can g be
normalized into a probability density function. In this case, the distribution is known as the Pareto distribution, named for Vilfredo
Pareto. The parameter a = b − 1 , so that a > 0 , is known as the shape parameter. Thus, the Pareto distribution with shape
parameter a has probability density function
a
f (x) = , x ∈ [1, ∞) (3.2.13)
a+1
x
The Pareto distribution is widely used to model certain economic variables and is studied in detail in the chapter on Special
Distributions.
In the special distribution simulator, select the Pareto distribution. Leave the scale parameter fixed, but vary the shape
parameter, and note the shape of the probability density function. For various values of the shape parameter, run the simulation
1000 times and compare the empirical density function with the probability density function.
Suppose that the income X (in appropriate units) of a person randomly selected from a population has the Pareto distribution
with shape parameter a = 2 . Find each of the following:
1. P(X > 2)
2. P(X ≤ 4)
3. P(3 ≤ X ≤ 5)
Answer
The Cauchy Distribution

1
f (x) = , x ∈ R (3.2.14)
2
π(x + 1)

Answer
The distribution constructed in the previous exercise is known as the (standard) Cauchy distribution, named after Augustin Cauchy
It might also be called the arctangent distribution, because of the appearance of the arctangent function in the proof that f is a
probability density function. In this regard, note the similarity to the arcsine distribution above. The Cauchy distribution is studied
in more generality in the chapter on Special Distributions. Note also that the Cauchy distribution is obtained by normalizing the
function x ↦ 1
1+x
; the graph of this function is known as the witch of Agnesi, in honor of Maria Agnesi.
2
In the special distribution simulator, select the Cauchy distribution with the default parameter values. Run the simulation 1000
times and compare the empirical density function with the probability density function.
A light source is 1 meter away from position 0 on an infinite, straight wall. The angle Θ that the light beam makes with the
perpendicular to the wall is randomly chosen from the interval (− , ). The position X = tan(Θ) of the light beam on the
π
2
π
wall has the standard Cauchy distribution. Find each of the following:
1. P(−1 < X < 1) .
2. P (X ≥ 1
)
√3
–
3. P(X ≤ √3)
Answer
The Cauchy experiment (with the default parameter values) is a simulation of the experiment in the last exercise.
1. Run the experiment a few times in single step mode.
2. Run the experiment 1000 times and compare the empirical density function with the probability density function.
3. Using the data from (b), compute the relative frequency of each event in the previous exercise, and compare with the true
probability.
The Standard Normal Distribution
Let ϕ be the function defined by ϕ(z) = for z ∈ R .

2
1 −z /2
e
√2π
1. Show that ϕ is a probability density function.
2. Draw a careful sketch the graph of ϕ , and state the important qualitative features.
Proof
The distribution defined in the last exercise is the standard normal distribution, perhaps the most important distribution in
probability and statistics. It's importance stems largely from the central limit theorem, one of the fundamental theorems in
probability. In particular, normal distributions are widely used to model physical measurements that are subject to small, random
errors. The family of normal distributions is studied in more generality in the chapter on Special Distributions.
In the special distribution simulator, select the normal distribution and keep the default parameter values. Run the simulation
1000 times and compare the empirical density function and the probability density function.
The function z ↦ e is a notorious example of an integrable function that does not have an antiderivative that can be expressed
2
−z /2
in closed form in terms of other elementary functions. (That's why we had to resort to the polar coordinate trick to show that ϕ is a
probability density function.) So probabilities involving the normal distribution are usually computed using mathematical or
statistical software.
Suppose that the error Z in the length of a certain machined part (in millimeters) has the standard normal distribution. Use
mathematical software to approximate each of the following:
1. P(−1 ≤ Z ≤ 1)
2. P(Z > 2)
3. P(Z < −3)
Answer
The Extreme Value Distribution

Let f be the function defined by f (x) = e for x ∈ R.
−x
−x −e
e

3. Find P(X > 0) , where X has probability density function f .
Answer
The distribution in the last exercise is the (standard) type 1 extreme value distribution, also known as the Gumbel distribution in
honor of Emil Gumbel. Extreme value distributions are studied in more generality in the chapter on Special Distributions.
In the special distribution simulator, select the extreme value distribution. Keep the default parameter values and note the shape
and location of the probability density function. Run the simulation 1000 times and compare the empirical density function
with the probability density function.
The Logistic Distribution

x
e
f (x) = , x ∈ R (3.2.19)
x 2
(1 + e )

3. Find P(X > 1) , where X has probability density function f .
Answer
The distribution in the last exercise is the (standard) logistic distribution. Logistic distributions are studied in more generality in the
In the special distribution simulator, select the logistic distribution. Keep the default parameter values and note the shape and
location of the probability density function. Run the simulation 1000 times and compare the empirical density function with the
probability density function.
Weibull Distributions
Let f be the function defined by f (t) = 2te for t ∈ [0, ∞).

2
−t

Answer
Let f be the function defined by f (t) = 3t for t ≥ 0 .

3
2 −t
e

Answer
The distributions in the last two exercises are examples of Weibull distributions, name for Waloddi Weibull. Weibull distributions
are studied in more generality in the chapter on Special Distributions. They are often used to model random failure times of devices
(in appropriately scaled units).
In the special distribution simulator, select the Weibull distribution. For each of the following values of the shape parameter k ,
note the shape and location of the probability density function. Run the simulation 1000 times and compare the empirical
1. k = 2 . This gives the first Weibull distribution above.
2. k = 3 . This gives the second Weibull distribution above.
Suppose that T is the failure time of a device (in 1000 hour units). Find P (T >
1
2
) in each of the following cases:
1. T has the first Weibull distribution above.
2. T has the second Weibull distribution above.
Answer
Additional Examples
Let f be the function defined by f (x) = − ln x for x ∈ (0, 1].
3. Find P ( ≤ X ≤ ) where X has the probability density function in (a).
1
3
1
Answer
Let f be the function defined by f (x) = 2e −x

(1 − e
−x
) for x ∈ [0, ∞).
2. Draw a careful sketch of the graph of f , and give the important qualitative features.
3. Find P(X ≥ 1) where X has the probability density function in (a).
Answer
The following problems deal with two and three dimensional random vectors having continuous distributions. The idea of
normalizing a function to form a probability density function is important for some of the problems. The relationship between the
distribution of a vector and the distribution of its components will be discussed later, in the section on joint distributions.
Let f be the function defined by f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 .

1. Show that f is a probability density function, and identify the mode.
2. Find P(Y ≥ X) where (X, Y ) has the probability density function in (a).
3. Find the conditional density of (X, Y ) given {X < 1
2
,Y <
1
2
} .
Answer
Let g be the function defined by g(x, y) = x + y for 0 ≤ x ≤ y ≤ 1 .

2. Find P(Y ≥ 2X) where (X, Y ) has the probability density function in (a).
Answer
Let g be the function defined by g(x, y) = x 2

y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 .
2. Find P(Y ≥ X) where (X, Y ) has the probability density function in (a).
Answer
Let g be the function defined by g(x, y) = x 2

y for 0 ≤ x ≤ y ≤ 1 .
2. Find P (Y ≥ 2X) where (X, Y ) has the probability density function in (a).
Answer
Let g be the function defined by g(x, y, z) = x + 2y + 3z for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 , 0 ≤ z ≤ 1 .

2. Find P(X ≤ Y ≤ Z) where (X, Y , Z) has the probability density function in (a).
Answer
Let g be the function defined by g(x, y) = e −x

e
−y
for 0 ≤ x ≤ y < ∞ .
2. Find P(X + Y < 1) where (X, Y ) has the probability density function in (a).
Answer
Continuous Uniform Distributions

Our next discussion will focus on an important class of continuous distributions that are defined purely in terms of geometry. We
need a preliminary definition.
For n ∈ N , the standard measure λ on R is given by

+ n
n
n
λn (A) = ∫ 1 dx, A ⊆R (3.2.23)
A
In particular, λ 1 (A) is the length of A ⊆ R , λ 2 (A) is the area of A ⊆ R , and λ 2

3 (A) is the volume of A ⊆ R . 3
Details
Note that if n >1 , the integral above is a multiple integral. Generally, λn (A) is referred to as the n -dimensional volumve of
A ∈⊆ R .
n
Suppose that S ⊆ R for some n ∈ N

n
+ with 0 < λ n (S) <∞ .
1. the function f defined by f (x) = 1/λ (S) for x ∈ S is a probability density function on S .
n
2. The probability measure associated with f is given by P(A) = λ (A)/λ (S) for A ⊆ S , and is known as the uniform
n n
distribution on S .
Proof
Note that the probability assigned to a set A ⊆ R is proportional to the size of A , as measured by λ . Note also that in both the
n
n
discrete and continuous cases, the uniform distribution on a set S has constant probability density function on S . The uniform
distribution on a set S governs a point X chosen “at random” from S , and in the continuous case, such distributions play a
fundamental role in various Geometric Models. Uniform distributions are studied in more generality in the chapter on Special
Distributions.
The most important special case is the uniform distribution on an interval [a, b] where a, b ∈ R and a <b . In this case, the
probability density function is
1
f (x) = , a ≤x ≤b (3.2.25)
b −a
This distribution models a point chosen “at random” from the interval. In particular, the uniform distribution on [0, 1] is known as
the standard uniform distribution, and is very important because of its simplicity and the fact that it can be transformed into a
variety of other probability distributions on R. Almost all computer languages have procedures for simulating independent,
standard uniform variables, which are called random numbers in this context.
Conditional distributions corresponding to a uniform distribution are also uniform.
Suppose that R ⊆ S ⊆ R for some n ∈ N , and that λ

n
+ n (R) >0 and λ n (S) <∞ . If P is the uniform distribution on S , then
the conditional distribution given R is uniform on R .
Proof
The last theorem has important implications for simulations. If we can simulate a random variable that is uniformly distributed on a
set, we can simulate a random variable that is uniformly distributed on a subset.
Suppose again that R ⊆S ⊆R for some n ∈ N , and that λ (R) > 0 and λ (S) < ∞ . Suppose further that
n
+ n n
X = (X1 , X2 , …) is a sequence of independent random variables, each uniformly distributed on S . Let

N = min{k ∈ N+ : X ∈ R} . Then
k
1. N has the geometric distribution on N with success parameter p = λ

+ n (R)/ λn (S) .
2. X is uniformly distributed on R .
N
Proof
Suppose in particular that S is a Cartesian product of n bounded intervals. It turns out to be quite easy to simulate a sequence of
independent random variables X = (X , X , …) each of which is uniformly distributed on S . Thus, the last theorem give an
1 2
algorithm for simulating a random variable that is uniformly distributed on an irregularly shaped region R ⊆ S (assuming that we
have an algorithm for recognizing when a point x ∈ R falls in R ). This method of simulation is known as the rejection method,
n
and as we will see in subsequent sections, is more important that might first appear.
Figure 3.2.3 : With a sequence of independent, uniformly distributed points in S , the first one to fall in R is uniformly distributed
on R .
In the simple probability experiment, random points are uniformly distributed on the rectangular region S . Move and resize the
events A and B and note how the probabilities of the 16 events that can be constructed from A and B change. Run the
experiment 1000 times and note the agreement between the relative frequencies of the events and the probabilities of the
events.
Suppose that (X, Y ) is uniformly distributed on the circular region of radius 5, centered at the origin. We can think of (X, Y )
−−−−−− −
as the position of a dart thrown “randomly” at a target. Let R = √X + Y , the distance from the center to (X, Y ).
2 2

2. Find P(n ≤ R ≤ n + 1 for n ∈ {0, 1, 2, 3, 4}.
Answer
Suppose that (X, Y , Z) is uniformly distributed on the cube S = [0, 1] . Find P(X < Y
3
< Z) in two ways:
1. Using the probability density function.
2. Using a combinatorial argument.
Answer
The time T (in minutes) required to perform a certain job is uniformly distributed over the interval [15, 60].
1. Find the probability that the job requires more than 30 minutes
2. Given that the job is not finished after 30 minutes, find the probability that the job will require more than 15 additional
minutes.
Answer

If D is a data set from a variable X with a continuous distribution, then an empirical density function can be computed by
partitioning the data range into subsets of small size, and then computing the probability density of points in each subset. Empirical
probability density functions are studied in more detail in the chapter on Random Samples.
For the cicada data, BW denotes body weight (in grams), BL body length (in millimeters), and G gender (0 for female and 1
for male). Construct an empirical density function for each of the following and display each as a bar graph:
1. BW
2. BL
3. BW given G = 0
Answer
This page titled 3.2: Continuous Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In the previous two sections, we studied discrete probability meausres and continuous probability measures. In this section, we will
study probability measure that that are combinations of the two types. Once again, if you are a new student of probability you may
want to skip the technical details.
Basic Theory
Our starting point is a random experiment modeled by a probability space (S, S , P). So to review, S is the set of outcomes, S the
collection of events, and P the probability measure on the sample space (S, S ). We use the terms probability measure and
probability distribution synonymously in this text. Also, since we use a general definition of random variable, every probability
measure can be thought of as the probability distribution of a random variable, so we can always take this point of view if we like.
Indeed, most probability measures naturally have random variables associated with them. Here is the main definition:
The probability measure P is of mixed type if S can be partitioned into events D and C with the following properties:
1. D is countable, 0 < P(D) < 1 and P({x}) > 0 for every x ∈ D.
2. C ⊆ R for some n ∈ N and P({x}) = 0 for every x ∈ C .
n
+
Details
Recall that the term partition means that D and C are disjoint and S = D ∪ C . As alwasy, the collection of events S is
required to be a σ-algebra. The set C is a measurable subset of R and then the elements of S have the form A ∪ B where
n
A ⊆ D and B is a measurable subset of C . Typically in applications, C is defined by a finite number of inequalities involving
elementary functions.
Often the discrete set D is a subset of R also, but that's not a requirement. Note that since D and C are complements,
n
0 < P(C ) < 1 also. Thus, part of the distribution is concentrated at points in a discrete set D; the rest of the distribution is
continuously spread over C . In the picture below, the light blue shading is intended to represent a continuous distribution of
probability while the darker blue dots are intended to represents points of positive probability.
Figure 3.3.1 : A mixed distribution on S

The following result is essentially equivalent to the definition.
Suppose that P is a probability measure on S of mixed type as in (1).

1. The conditional probability measure A ↦ P(A ∣ D) = P(A)/P (D) for A ⊆ D is a discrete distribution on D
2. The conditional probability measure A ↦ P(A ∣ C ) = P(A)/P(C ) for A ⊆ C is a continuous distribution on C .
Proof
Note that
P(A) = P(D)P(A ∣ D) + P(C )P(A ∣ C ), A ∈ S (3.3.1)
Thus, the probability measure P really is a mixture of a discrete distribution and a continuous distribution. Mixtures are studied in
more generality in the section on conditional distributions. We can define a function on D that is a partial probability density
function for the discrete part of the distribution.
Suppose that P is a probability measure on S of mixed type as in (1). Let g be the function defined by g(x) = P({x}) for
x ∈ D . Then
1. g(x) ≥ 0 for x ∈ D
2. ∑ g(x) = P(D)
x∈D
3. P(A) = ∑ x∈A
g(x) for A ⊆ D
Proof
Clearly, the normalized function x ↦ g(x)/P(D) is the probability density function of the conditional distribution given D ,
discussed in (2). Often, the continuous part of the distribution is also described by a partial probability density function.
A partial probability density function for the continuous part of P is a nonnegative function h : C → [0, ∞) such that
P(A) = ∫ h(x) dx, A ∈ C (3.3.2)

A
Details
Technically, h is require to be measurable, and is a density function with respect to Lebesgue measure λn on C , the standard
measure on R . n
Clearly, the normalized function x ↦ h(x)/P(C ) is the probability density function of the conditional distribution given C
discussed in (2). As with purely continuous distributions, the existence of a probability density function for the continuous part of a
mixed distribution is not guaranteed. And when it does exist, a density function for the continuous part is not unique. Note that the
values of h could be changed to other nonnegative values on a countable subset of C , and the displayed equation above would still
hold, because only integrals of h are important. The probability measure P is completely determined by the partial probability
density functions.
Suppose that P has partial probability density functions g and h for the discrete and continuous parts, respectively. Then
P(A) = ∑ g(x) + ∫ h(x) dx, A ∈ S (3.3.3)

A∩C
x∈A∩D
Proof
Figure 3.3.2 : A mixed distribution is completely determined by its partial density functions.
Truncated Variables
Distributions of mixed type occur naturally when a random variable with a continuous distribution is truncated in a certain way.
For example, suppose that T is the random lifetime of a device, and has a continuous distribution with probability density function
f that is positive on [0, ∞). In a test of the device, we can't wait forever, so we might select a positive constant a and record the
random variable U , defined by truncating T at a , as follows:
T, T <a
U ={ (3.3.4)
a, T ≥a
U has a mixed distribution. In the notation above,

∞
1. D = {a} and g(a) = ∫ f (t) dt
a
2. C = [0, a) and h(t) = f (t) for t ∈ [0, a)
Suppose next that random variable X has a continuous distribution on R, with probability density function f that is positive on R.
Suppose also that a, b ∈ R with a < b . The variable is truncated on the interval [a, b] to create a new random variable Y as
follows:
⎧ a, X ≤a
Y = ⎨ X, a <X <b (3.3.5)

⎩
b, X ≥b
Y has a mixed distribution. In the notation above,

a ∞
1. D = {a, b} , g(a) = ∫ f (x) dx , g(b) = ∫
−∞ b
f (x) dx
2. C = (a, b) and h(x) = f (x) for x ∈ (a, b)
Another way that a “mixed” probability distribution can occur is when we have a pair of random variables (X, Y ) for our
experiment, one with a discrete distribution and the other with a continuous distribution. This setting is explored in the next section
on Joint Distributions.

Suppose that X has probability uniformly distributed on the set {1, 2, … , 8} and has probability
1
2
1
2
uniformly distributed on
the interval [0, 10]. Find P(X > 6) .
Answer
Suppose that (X, Y ) has probability 1
3
uniformly distributed on 2
{0, 1, 2} and has probability 2
3
uniformly distributed on
[0, 2] . Find P(Y > X) .
2
Answer
Suppose that the lifetime T of a device (in 1000 hour units) has the exponential distribution with probability density function
f (t) = e
−t
for 0 ≤ t < ∞ . A test of the device is terminated after 2000 hours; the truncated lifetime U is recorded. Find each
of the following:
1. P(U < 1)
2. P(U = 2)
Answer
This page titled 3.3: Mixed Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The purpose of this section is to study how the distribution of a pair of random variables is related to the distributions of the
variables individually. If you are a new student of probability you may want to skip the technical details.
Basic Theory
Joint and Marginal Distributions
As usual, we start with a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F
is the collection of events, and P is the probability measure on the sample space (Ω, F ). Suppose now that X and Y are random
variables for the experiment, and that X takes values in S while Y takes values in T . We can think of (X, Y ) as a random variable
taking values in the product set S × T . The purpose of this section is to study how the distribution of (X, Y ) is related to the
distributions of X and Y individually.
Recall that
1. The distribution of (X, Y ) is the probability measure on S × T given by P [(X, Y ) ∈ C ] for C ⊆ S ×T .
2. The distribution of X is the probability measure on S given by P(X ∈ A) for A ⊆ S .
3. The distribution of Y is the probability measure on T given by P(Y ∈ B) for B ⊆ T .
In this context, the distribution of (X, Y ) is called the joint distribution, while the distributions of X and of Y are referred to
as marginal distributions.
Details
The first simple but very important point, is that the marginal distributions can be obtained from the joint distribution.
Note that
1. P(X ∈ A) = P [(X, Y ) ∈ A × T ] for A ⊆ S
2. P(Y ∈ B) = P [(X, Y ) ∈ S × B] for B ⊆ T
The converse does not hold in general. The joint distribution contains much more information than the marginal distributions
separately. However, the converse does hold if X and Y are independent, as we will show below.
Joint and Marginal Densities

Recall that probability distributions are often described in terms of probability density functions. Our goal is to study how the
probability density functions of X and Y individually are related to probability density function of (X, Y ). But first we need to
make sure that we understand our starting point.
We assume that (X, Y ) has density function f : S × T → (0, ∞) in the following sense:
1. If X and Y have discrete distributions on the countable sets S and T respectively, then f is defined by
f (x, y) = P(X = x, Y = y), (x, y) ∈ S × T (3.4.1)
2. If X and Y have continuous distributions on S ⊆ R and T

j
⊆R
k
respectively, then f is defined by the condition
P[(X, Y ) ∈ C ] = ∫ f (x, y) d(x, y), C ⊆ S ×T (3.4.2)

C
3. In the mixed case where X has a discrete distribution on the countable set S and Y has a continuous distribution on
T ⊆ R , then f is defined by the condition
k
P(X = x, Y ∈ B) = ∫ f (x, y) dy, x ∈ S, B ⊆ T (3.4.3)

B
4. In the mixed case where X has a continuous distribution on S ⊆ R and Y has a discrete distribution on the countable set
j
T , then f is defined by the condition
P(X ∈ A, Y = y) = ∫ f (x, y), dx, A ⊆ S, y ∈ T (3.4.4)
A
Details
First we will see how to obtain the probability density function of one variable when the other has a discrete distribution.
Suppose that (X, Y ) has probability density function f as described above.

1. If Y has a discrete distribution on the countable set T , then X has probability density function g given by
g(x) = ∑ f (x, y) for x ∈ S
y∈T
2. If X has a discrete distribution on the countable set S , then Y has probability density function h given by
h(y) = ∑ f (x, y), y ∈ T
x∈S
Proof
Next we will see how to obtain the probability density function of one variable when the other has a continuous distribution.
Suppose again that (X, Y ) has probability density function f as described above.
1. If Y has a continuous distribution on T ⊆R
k
then X has probability density function g given by
g(x) = ∫ f (x, y) dy, x ∈ S
T
2. If X has a continuous distribution on S ⊆ R then Y has probability density function h given by

k
h(y) = ∫ f (x, y) dx, y ∈ T

S
Proof
In the context of the previous two theorems, f is called the joint probability density function of (X, Y ), while g and h are called
the marginal density functions of X and of Y , respectively. Some of the computational exercises below will make the term
marginal clear.
Independence
When the variables are independent, the marginal distributions determine the joint distribution.
If X and Y are independent, then the distribution of X and the distribution of Y determine the distribution of (X, Y ).
Proof
When the variables are independent, the joint density is the product of the marginal densities.
Suppose that X and Y are independent and have probability density function g and h respectively. Then (X, Y ) has
probability density function f given by
f (x, y) = g(x)h(y), (x, y) ∈ S × T (3.4.11)
Proof
The following result gives a converse to the last result. If the joint probability density factors into a function of x only and a
function of y only, then X and Y are independent, and we can almost identify the individual probability density functions just from
the factoring.
Factoring Theorem. Suppose that (X, Y ) has probability density function f of the form
f (x, y) = u(x)v(y), (x, y) ∈ S × T (3.4.15)
where u : S → [0, ∞) and v : T → [0, ∞) . Then X and Y are independent, and there exists a positve constant c such that X
and Y have probability density functions g and h , respectively, given by
g(x) = c u(x), x ∈ S (3.4.16)
1
h(y) = v(y), y ∈ T (3.4.17)
c
Proof
The last two results extend to more than two random variables, because X and Y themselves may be random vectors. Here is the
explicit statement:
Suppose that X is a random variable taking values in a set R with probability density funcion g for i ∈ {1, 2, … , n}, and
i i i
that the random variables are independent. Then the random vector X = (X , X , … , X ) taking values in 1 2 n
S = R ×R ×⋯ ×R
1 2 has probability density function f given by
n
f (x1 , x2 , … , xn ) = g1 (x1 )g2 (x2 ) ⋯ gn (xn ), (x1 , x2 , … , xn ) ∈ S (3.4.21)
The special case where the distributions are all the same is particularly important.
Suppose that X = (X , X , … , X ) is a sequence of independent random variables, each taking values in a set
1 2 n R and with
common probability density function g . Then the probability density function f of X on S = R is given by n
f (x1 , x2 , … , xn ) = g(x1 )g(x2 ) ⋯ g(xn ), (x1 , x2 , … , xn ) ∈ S (3.4.22)
In probability jargon, X is a sequence of independent, identically distributed variables, a phrase that comes up so often that it is
often abbreviated as IID. In statistical jargon, X is a random sample of size n from the common distribution. As is evident from
the special terminology, this situation is very impotant in both branches of mathematics. In statistics, the joint probability density
function f plays an important role in procedures such as maximum likelihood and the identification of uniformly best estimators.
Recall that (mutual) independence of random variables is a very strong property. If a collection of random variables is independent,
then any subcollection is also independent. New random variables formed from disjoint subcollections are independent. For a
simple example, suppose that X, Y , and Z are independent real-valued random variables. Then
1. sin(X), cos(Y ), and e are independent.Z
2. (X, Y ) and Z are independent.

3. X + Y and arctan(Z) are independent.
2 2
4. X and Z are independent.

5. Y and Z are independent.
In particular, note that statement 2 in the list above is much stronger than the conjunction of statements 4 and 5. Restated, if X and
Z are dependent, then (X, Y ) and Z are also dependent.

Dice
Recall that a standard die is an ordinary six-sided die, with faces numbered from 1 to 6. The answers in the next couple of
exercises give the joint distribution in the body of a table, with the marginal distributions literally in the magins. Such tables are the
reason for the term marginal distibution.
Suppose that two standard, fair dice are rolled and the sequence of scores (X , X ) recorded. Our ususal assumption is that the
1 2
variables X and X are independent. Let Y = X + X and Z = X − X denote the sum and difference of the scores,
1 2 1 2 1 2
respectively.
1. Find the probability density function of (Y , Z).
3. Find the probability density function of Z .
4. Are Y and Z independent?
Answer
Suppose that two standard, fair dice are rolled and the sequence of scores (X 1, X2 ) recorded. Let U = min{ X1 , X2 } and
V = max{ X , X } denote the minimum and maximum scores, respectively.
1 2
1. Find the probability density function of (U , V ).

2. Find the probability density function of U .
4. Are U and V independent?
Answer
The previous two exercises show clearly how little information is given with the marginal distributions compared to the joint
distribution. With the marginal PDFs alone, you could not even determine the support set of the joint distribution, let alone the
values of the joint PDF.
Simple Continuous Distributions

Suppose that (X, Y ) has probability density function f given by f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 .
1. Find the probability density function of X.
3. Are X and Y independent?
Answer
Suppose that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 .
Answer
Suppose that (X, Y ) has probability density function f given by f (x, y) = 6x 2

y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 .
Answer
The last exercise is a good illustration of the factoring theorem. Without any work at all, we can tell that the PDF of X is
proportional to x ↦ x on the interval [0, 1], the PDF of Y is proportional to y ↦ y on the interval [0, 1], and that X and Y are
2
independent. The only thing unclear is how the constant 6 factors.
y for 0 ≤ x ≤ y ≤ 1 .
Answer
Note that in the last exercise, the factoring theorem does not apply. Random variables X and Y each take values in , but the
[0, 1]
joint PDF factors only on part of [0, 1] .

2
Suppose that (X, Y , Z) has probability density function f given f (x, y, x) = 2(x + y)z for 0 ≤x ≤1 , 0 ≤y ≤1 ,
0 ≤z ≤1 .
1. Find the probability density function of each pair of variables.

2. Find the probability density function of each variable.
3. Determine the dependency relationships between the variables.
Proof
Suppose that (X, Y ) has probability density function f given by f (x, y) = 2e −x

e
−y
for 0 ≤ x ≤ y < ∞ .
Answer
In the previous exercise, X has an exponential distribution with rate parameter 2. Recall that exponential distributions are widely
used to model random times, particularly in the context of the Poisson model.
Suppose that X and Y are independent, and that X has probability density function g given by g(x) = 6x(1 − x) for
0 ≤ x ≤ 1 , and that Y has probability density function h given by h(y) = 12 y (1 − y) for 0 ≤ y ≤ 1 .
2
1. Find the probability density function of (X, Y ).

2. Find P(X + Y ≤ 1) .
Answer
In the previous exercise, X and Y have beta distributions, which are widely used to model random probabilities and proportions.
Beta distributions are studied in more detail in the chapter on Special Distributions.
Suppose that Θ and Φ are independent random angles, with common probability density function g given by g(t) = sin(t) for
0 ≤t ≤ .
π
1. Find the probability density function of (Θ, Φ).

2. Find P(Θ ≤ Φ) .
Answer
The common distribution of X and Y in the previous exercise governs a random angle in Bertrand's problem.
Suppose that X and Y are independent, and that X has probability density function g given by g(x) = 2
3
x
for 1 ≤ x < ∞ , and
that Y has probability density function h given by h(y) = 3
y
4
for 1 ≤ y < ∞ .
1. Find the probability density function of (X, Y ).

2. Find P(X ≤ Y ) .
Answer
Both X and Y in the previous exercise have Pareto distributions, named for Vilfredo Pareto. Recall that Pareto distributions are
used to model certain economic variables and are studied in more detail in the chapter on Special Distributions.
Suppose that (X, Y ) has probability density function g given by g(x, y) = 15x y for 0 ≤ x ≤ y ≤ 1 , and that
2
Z has
probability density function h given by h(z) = 4z for 0 ≤ z ≤ 1 , and that (X, Y ) and Z are independent.
3
1. Find the probability density function of (X, Y , Z) .

2. Find the probability density function of (X, Z).
3. Find the probability density function of (Y , Z).
4. Find P(Z ≤ XY ) .
Answer
Multivariate Uniform Distributions

Multivariate uniform distributions give a geometric interpretation of some of the concepts in this section.
Recall first that for n ∈ N , the standard measure on R is

+
n
n
λn (A) = ∫ 1dx, A ⊆R (3.4.23)
A
In particular, λ
1 (A) is the length of A ⊆ R , λ2 (A) is the area of A ⊆ R , and λ
2
Details
Suppose now that X takes values in R , Y takes values in R , and that (X, Y ) is uniformly distributed on a set R ⊆ R . So
j k j+k
0 <λ j+k(R) < ∞ and the joint probability density function f of (X, Y ) is given by f (x, y) = 1/ λ (R) for (x, y) ∈ R. Recall
j+k
that uniform distributions always have constant density functions. Now let S and T be the projections of R onto R and R j k
respectively, defined as follows:

j k k j
S = {x ∈ R : (x, y) ∈ R for some y ∈ R } , T = {y ∈ R : (x, y) ∈ R for some x ∈ R } (3.4.24)
Note that R ⊆ S × T . Next we denote the cross sections at x ∈ S and at y ∈ T , respectively by
Tx = {t ∈ T : (x, t) ∈ R}, Sy = {s ∈ S : (s, y) ∈ R} (3.4.25)
Figure 3.4.1 : The projections S and T , and the cross sections at x and y
X takes values in S and Y takes values in T . The probability density functions g and h of X and Y are proportional to the
cross-sectional measures:
1. g(x) = λ k (Tx ) / λj+k (R) for x ∈ S
2. h(y) = λ j (Sy ) / λj+k (R) for y ∈ T
Proof
In particular, note from previous theorem that X and Y are neither independent nor uniformly distributed in general. However,
these properties do hold if R is a Cartesian product set.
Suppose that R = S × T .
1. X is uniformly distributed on S .
2. Y is uniformly distributed on T .
3. X and Y are independent.
Proof
In each of the following cases, find the joint and marginal probabilit density functions, and determine if X and Y are
independent.
1. (X, Y ) is uniformly distributed on the square R = [−6, 6] . 2
2. (X, Y ) is uniformly distributed on the triangle R = {(x, y) : −6 ≤ y ≤ x ≤ 6} .

3. (X, Y ) is uniformly distributed on the circle R = {(x, y) : x + y ≤ 36} .
2 2
Answer
In the bivariate uniform experiment, run the simulation 1000 times for each of the following cases. Watch the points in the
scatter plot and the graphs of the marginal distributions. Interpret what you see in the context of the discussion above.
1. square
2. triangle
3. circle
Suppose that (X, Y , Z) is uniformly distributed on the cube [0, 1] . 3
1. Give the joint probability density function of (X, Y , Z) .

3. Find the probability density function of each variable
Answer
Suppose that (X, Y , Z) is uniformly distributed on {(x, y, z) : 0 ≤ x ≤ y ≤ z ≤ 1} .

1. Give the joint density function of (X, Y , Z) .
3. Find the probability density function of each variable
Answer
The Rejection Method

The following result shows how an arbitrary continuous distribution can be obtained from a uniform distribution. This result is
useful for simulating certain continuous distributions, as we will see. To set up the basic notation, suppose that g is a probability
density function for a continuous distribution on S ⊆ R . Let n
n+1
R = {(x, y) : x ∈ S and 0 ≤ y ≤ g(x)} ⊆ R (3.4.27)
If (X, Y ) is uniformly distributed on R , then X has probability density function g .

Proof
A picture in the case n = 1 is given below:
Figure 3.4.2 : If (X, Y ) is uniformly distributed on R , then X has density function g .

The next result gives the rejection method for simulating a random variable with the probability density function g .
Suppose now that R ⊆ T where T ⊆ Rn+1 with λn+1 (T ) < ∞ and that ((X , Y ), (X , Y ), …) is a sequence of
1 1 2 2
independent random variables with X k ∈ R

n
,Y
k ∈ R , and (X
k , Yk ) uniformly distributed on T for each k ∈ N . Let +
N = min {k ∈ N+ : (Xk , Yk ) ∈ R} = min {k ∈ N+ : Xk ∈ S, 0 ≤ Yk ≤ g (Xk )} (3.4.29)
1. N has the geometric distribution on N with success parameter p = 1/λ

+ n+1 (T ) .
2. (X , Y ) is uniformly distributed on R .
N N
3. X has probability density function g .

N
Proof
Figure 3.4.3 : With a sequence of independent points, uniformly distributed on T , the x coordinate of the first point to land in R
has probability density function g .

The point of the theorem is that if we can simulate a sequence of independent variables that are uniformly distributed on T , then we
can simulate a random variable with the given probability density function g . Suppose in particular that R is bounded as a subset of
R
n+1
, which would mean that the domain S is bounded as a subset of R and that the probability density function g is bounded on
n
S . In this case, we can find T that is the Cartesian product of n + 1 bounded intervals with R ⊆ T . It turns out to be very easy to
simulate a sequence of independent variables, each uniformly distributed on such a product set, so the rejection method always
works in this case. As you might guess, the rejection method works best if the size of T , namely λ (T ) , is small, so that the
n+1
success parameter p is large.
The rejection method app simulates a number of continuous distributions via the rejection method. For each of the following
distributions, vary the parameters and note the shape and location of the probability density function. Then run the experiment
1000 times and observe the results.
1. The beta distribution
2. The semicircle distribution
3. The triangle distribution
4. The U-power distribution
The Multivariate Hypergeometric Distribution

Suppose that a population consists of m objects, and that each object is one of four types. There are a type 1 objects, b type 2
objects, c type 3 objects and m − a − b − c type 0 objects. We sample n objects from the population at random, and without
replacement. The parameters m, a , b , c , and n are nonnegative integers with a + b + c ≤ m and n ≤ m . Denote the number of
type 1, 2, and 3 objects in the sample by X, Y , and Z , respectively. Hence, the number of type 0 objects in the sample is
n − X − Y − Z . In the problems below, the variables x, y , and z take values in N .
(X, Y , Z) has a (multivariate) hypergeometric distribution with probability density function f given by
a b c m−a−b−c
( )( )( )( )
x y z n−x−y−z
f (x, y, z) = , x +y +z ≤ n (3.4.30)
m
( )
n
Proof
(X, Y ) also has a (multivariate) hypergeometric distribution, with the probability density function g given by
a b m−a−b
( )( )( )
x y n−x−y
g(x, y) = , x +y ≤ n (3.4.31)
m
( )
n
Proof
X has an ordinary hypergeometric distribution, with probability density function h given by

a m−a
( )( )
x n−x
h(x) = , x ≤n (3.4.32)
m
( )
n
Proof
These results generalize in a straightforward way to a population with any number of types. In brief, if a random vector has a
hypergeometric distribution, then any sub-vector also has a hypergeometric distribution. In other words, all of the marginal
distributions of a hypergeometric distribution are themselves hypergeometric. Note however, that it's not a good idea to memorize
the formulas above explicitly. It's better to just note the patterns and recall the combinatorial meaning of the binomial coefficient.
The hypergeometric distribution and the multivariate hypergeometric distribution are studied in more detail in the chapter on Finite
Sampling Models.
Suppose that a population of voters consists of 50 democrats, 40 republicans, and 30 independents. A sample of 10 voters is
chosen at random from the population (without replacement, of course). Let X denote the number of democrats in the sample
and Y the number of republicans in the sample. Find the probability density function of each of the following:
1. (X, Y )
2. X
3. Y
Answer
Suppose that the Math Club at Enormous State University (ESU) has 50 freshmen, 40 sophomores, 30 juniors and 20 seniors.
A sample of 10 club members is chosen at random to serve on the π-day committee. Let X denote the number freshmen on the
committee, Y the number of sophomores, and Z the number of juniors.
1. Find the probability density function of (X, Y , Z)
3. Find the probability density function of each individual variable.
Answer
Multinomial Trials
Suppose that we have a sequence of n independent trials, each with 4 possible outcomes. On each trial, outcome 1 occurs with
probability p, outcome 2 with probability q, outcome 3 with probability r, and outcome 0 occurs with probability 1 − p − q − r .
The parameters p, q, and r are nonnegative numbers with p + q + r ≤ 1 , and n ∈ N . Denote the number of times that outcome 1,
+
outcome 2, and outcome 3 occurred in the n trials by X, Y , and Z respectively. Of course, the number of times that outcome 0
occurs is n − X − Y − Z . In the problems below, the variables x, y , and z take values in N.
(X, Y , Z) has a multinomial distribution with probability density function f given by

n
x y z n−x−y−z
f (x, y, z) = ( ) p q r (1 − p − q − r) , x +y +z ≤ n (3.4.33)
x, y, z
Proof
(X, Y ) also has a multinomial distribution with the probability density function g given by
n x y n−x−y
g(x, y) = ( ) p q (1 − p − q ) , x +y ≤ n (3.4.34)
x, y
Proof
X has a binomial distribution, with the probability density function h given by

n x n−x
h(x) = ( ) p (1 − p ) , x ≤n (3.4.35)
x
Proof
These results generalize in a completely straightforward way to multinomial trials with any number of trial outcomes. In brief, if a
random vector has a multinomial distribution, then any sub-vector also has a multinomial distribution. In other terms, all of the
marginal distributions of a multinomial distribution are themselves multinomial. The binomial distribution and the multinomial
distribution are studied in more detail in the chapter on Bernoulli Trials.
Suppose that a system consists of 10 components that operate independently. Each component is working with probability , 1
idle with probability , or failed with probability . Let X denote the number of working components and Y the number of
1
3
1
idle components. Give the probability density function of each of the following:
1. (X, Y )
2. X
3. Y
Answer
Suppose that in a crooked, four-sided die, face i occurs with probability for i ∈ {1, 2, 3, 4}. The die is thrown 12 times; let
10
i
X denote the number of times that score 1 occurs, Y the number of times that score 2 occurs, and Z the number of times that
score 3 occurs.
1. Find the probability density function of (X, Y , Z)
3. Find the probability density function of each individual variable.
Answer
Bivariate Normal Distributions

Suppose that (X, Y ) has probability the density function f given below:
2 2
1 x y 2
f (x, y) = exp[− ( + )], (x, y) ∈ R (3.4.36)
12π 8 18

Answer
Suppose that (X, Y ) has probability density function f given below:
1 2 2 2 2
f (x, y) = – exp[− (x − xy + y )], (x, y) ∈ R (3.4.37)
√3π 3
1. Find the density function of X.

2. Find the density function of Y .
Answer
The joint distributions in the last two exercises are examples of bivariate normal distributions. Normal distributions are widely
used to model physical measurements subject to small, random errors. In both exercises, the marginal distributions of X and Y also
have normal distributions, and this turns out to be true in general. The multivariate normal distribution is studied in more detail in
the chapter on Special Distributions.
Exponential Distributions
Recall that the exponential distribution has probability density function
−rt
f (x) = re , x ∈ [0, ∞) (3.4.38)
where r ∈ (0, ∞) is the rate parameter. The exponential distribution is widely used to model random times, and is studied in more
detail in the chapter on the Poisson Process.
Suppose X and Y have exponential distributions with parameters a ∈ (0, ∞) and b ∈ (0, ∞) , respectively, and are
independent. Then P(X < Y ) = . a
a+b
Suppose X, Y , and Z have exponential distributions with parameters a ∈ (0, ∞) , b ∈ (0, ∞) , and c ∈ (0, ∞) , respectively,
and are independent. Then
1. P(X < Y < Z) = a
a+b+c b+c
b
2. P(X < Y , X < Z) = a
a+b+c
If X, Y , and Z are the lifetimes of devices that act independently, then the results in the previous two exercises give probabilities
of various failure orders. Results of this type are also very important in the study of continuous-time Markov processes. We will
continue this discussion in the section on transformations of random variables.
Mixed Coordinates
Suppose X takes values in the finite set {1, 2, 3}, Y takes values in the interval [0, 3], and that (X, Y ) has probability density
function f given by
1
⎧
⎪ , x = 1, 0 ≤ y ≤ 1
⎪ 3
1
f (x, y) = ⎨ , x = 2, 0 ≤ y ≤ 2 (3.4.39)
6
⎪
⎩
⎪ 1
, x = 3, 0 ≤ y ≤ 3
9

Answer
Suppose that P takes values in the interval [0, 1] X , takes values in the finite set {0, 1, 2, 3}, and that (P , X) has probability
density function f given by
3 x+1 4−x
f (p, x) = 6( )p (1 − p ) , (p, x) ∈ [0, 1] × {0, 1, 2, 3} (3.4.40)
x
1. Find the probability density function of P .

3. Are P and X independent?
Answer
As we will see in the section on conditional distributions, the distribution in the last exercise models the following experiment: a
random probability P is selected, and then a coin with this probability of heads is tossed 3 times; X is the number of heads. Note
that P has a beta distribution.
Random Samples
Recall that the Bernoulli distribution with parameter p ∈ [0, 1] has probability density function g given by
x
g(x) = p (1 − p )
1−x
for x ∈ {0, 1}. Let X = (X , X , … , X ) be a random sample of size n ∈ N from the distribution.
1 2 n +
Give the probability density funcion of X in simplified form.

Answer
The Bernoulli distribution is name for Jacob Bernoulli, and governs an indicator random varible. Hence if X is a random sample of
size n from the distribution then X is a sequence of n Bernoulli trials. A separate chapter studies Bernoulli trials in more detail.
Recall that the geometric distribution on N with parameter p ∈ (0, 1) has probability density function g given by
+
g(x) = p(1 − p)
x−1
for x ∈ N . Let X = (X , X , … , X ) be a random sample of size n ∈ N from the distribution. Give
+ 1 2 n +
the probability density function of X in simplified form.

Answer
The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials. Hence the variables in the
random sample can be interpreted as the number of trials between successive successes.
x
Recall that the Poisson distribution with parameter a ∈ (0, ∞) has probability density function g given by g(x) = e for −a a
x!
x ∈ N . Let X = (X , X , … , X ) be a random sample of size n ∈ N

1 2 n from the distribution. Give the probability density
+
funcion of X in simplified form.

Answer
The Poisson distribution is named for Simeon Poisson, and governs the number of random points in a region of time or space under
appropriate circumstances. The parameter a is proportional to the size of the region. The Poisson distribution is studied in more
detail in the chapter on the Poisson process.
Recall again that the exponential distribution with rate parameter r ∈ (0, ∞) has probability density function g given by
g(x) = re for x ∈ (0, ∞). Let X = (X , X , … , X ) be a random sample of size n ∈ N from the distribution. Give
−rx
1 2 n +
the probability density funcion of X in simplified form.

Answer
The exponential distribution governs failure times and other types or arrival times under appropriate circumstances. The
exponential distribution is studied in more detail in the chapter on the Poisson process. The variables in the random sample can be
interpreted as the times between successive arrivals in the Poisson process.
Recall that the standard normal distribution has probability density function given by for . Let
2
1 −z /2
ϕ ϕ(z) = e z ∈ R
√2π
Z = (Z1 , Z2 , … , Zn ) be a random sample of size n ∈ N + from the distribution. Give the probability density funcion of Z in
simplified form.
Answer
The standard normal distribution governs physical quantities, properly scaled and centered, subject to small, random errors. The
normal distribution is studied in more generality in the chapter on the Special Distributions.
For the cicada data, G denotes gender and S denotes species type.
1. Find the empirical density of (G, S) .
2. Find the empirical density of G.
3. Find the empirical density of S .
4. Do you believe that S and G are independent?
Answer
For the cicada data, let W denote body weight (in grams) and L body length (in millimeters).
1. Construct an empirical density for (W , L).
2. Find the corresponding empirical density for W .
3. Find the corresponding empirical density for L.
4. Do you believe that W and L are independent?
Answer
For the cicada data, let G denote gender and W body weight (in grams).
1. Construct and empirical density for (W , G).
2. Find the empirical density for G.
3. Find the empirical density for W .
4. Do you believe that G and W are independent?
Answer
This page titled 3.4: Joint Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
In this section, we study how a probability distribution changes when a given random variable has a known, specified value. So this
is an essential topic that deals with hou probability measures should be updated in light of new information. As usual, if you are a
new student or probability, you may want to skip the technical details.
Basic Theory
Our starting point is a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F is
the collection of events, and P is the probability measure on the underlying sample space (Ω, F ).
Suppose that X is a random variable defined on the sample space (that is, defined for the experiment), taking values in a set S .
Details
The purpose of this section is to study the conditional probability measure given X = x for x ∈ S . That is, if E is an event, we
would like to define and study the probability of E given X = x , denoted P(E ∣ X = x) . If X has a discrete distribution, the
conditioning event has positive probability, so no new concepts are involved, and the simple definition of conditional probability
suffices. When X has a continuous distribution, however, the conditioning event has probability 0, so a fundamentally new
approach is needed.
The Discrete Case

Suppose first that X has a discrete distribution with probability density function g . Thus S is countable and we can assume that
g(x) = P(X = x) > 0 for x ∈ S .
If E is an event and x ∈ S then

P(E, X = x)
P(E ∣ X = x) = (3.5.1)
g(x)
Proof
The next result is a sepcial case of the law of total probability. and will be the key to the definition when X has a continuous
distribution.
If E is an event then
P(E, X ∈ A) = ∑ g(x)P(E ∣ X = x), A ⊆S (3.5.3)
x∈A
Conversely, this condition uniquely determines P(E ∣ X = x) .

Proof
The Continuous Case

Suppose now that X has a continuous distribution on S ⊆ R for some n ∈ N , with probability density function g . We assume
n
+
that g(x) > 0 for x ∈ S . Unlike the discrete case, we cannot use simple conditional probability to define P(E ∣ X = x) for an
event E and x ∈ S because the conditioning event has probability 0. Nonetheless, the concept should make sense. If we actually
run the experiment, X will take on some value x (even though a priori, this event occurs with probability 0), and surely the
information that X = x should in general alter the probabilities that we assign to other events. A natural approach is to use the
results obtained in the discrete case as definitions in the continuous case.
If E is an event and x ∈ S , the conditional probability P(E ∣ X = x) is defined by the requirement that
P(E, X ∈ A) = ∫ g(x)P(E ∣ X = x) dx, A ⊆S (3.5.6)

A
Details
Suppose again that X is a random variable with values in S and probability density function g , as described above. Our discussions
above in the discrete and continuous cases lead to basic formulas for computing the probability of an event by conditioning on X.
The law of total probability. If E is an event, then P(E) can be computed as follows:
1. If X has a discrete distribution then
P(E) = ∑ g(x)P(E ∣ X = x) (3.5.7)
x∈S
2. If X has a continuous distribution then
P(E) = ∫ g(x)P(E ∣ X = x) dx (3.5.8)

S
Proof
Naturally, the law of total probability is useful when P(E ∣ X = x) and g(x) are known for x ∈ S . Our next result is, Bayes'
Theorem, named after Thomas Bayes.
Bayes' Theorem. Suppose that E is an event with P(E) > 0 . The conditional probability density function x ↦ g(x ∣ E) of X
given E can be computed as follows:
g(x)P(E ∣ X = x)
g(x ∣ E) = , x ∈ S (3.5.9)
∑s∈S g(s)P(E ∣ X = s)

g(x)P(E ∣ X = x)
g(x ∣ E) = , x ∈ S (3.5.10)
∫ g(s)P(E ∣ X = s) ds
S
Proof
In the context of Bayes' theorem, g is called the prior probability density function of X and x ↦ g(x ∣ E) is the posterior
probability density function of X given E . Note also that the conditional probability density function of X given E is proportional
to the function x ↦ g(x)P(E ∣ X = x) , the sum or integral of this function that occurs in the denominator is simply the
normalizing constant. As with the law of total probability, Bayes' theorem is useful when P(E ∣ X = x) and g(x) are known for
x ∈ S.
Conditional Probability Density Functions

The definitions and results above apply, of course, if E is an event defined in terms of another random variable for our experiment.
Here is the setup:
Suppose that X and Y are random variables on the probability space, with values in sets S and T , respectively, so that (X, Y )
is a random variable with values in S × T . We assume that (X, Y ) has probability density function f , as discussed in the
section on Joint Distributions. Recall that X has probability density function g defined as follows:
1. If Y has a discrete distribution on the countable set T then
g(x) = ∑ f (x, y), x ∈ S (3.5.12)
y∈T
2. If Y has a continuous distribution on T ⊆R

k
then
g(x) = ∫ f (x, y)dy, x ∈ S (3.5.13)

T
Similary, the probability density function h of Y can be obtained by summing f over x ∈ S if X has a discrete distribution or
integrating f over S if X has a continuous distribution.
Suppose that x ∈ S and that g(x) > 0 . The function y ↦ h(y ∣ x) defined below is a probability density function on T :
f (x, y)
h(y ∣ x) = , y ∈ T (3.5.14)
g(x)
Proof
The distribution that corresponds to this probability density function is what you would expect:
For x ∈ S , the function y ↦ h(y ∣ x) is the conditional probability density function of Y given X = x . That is,
1. If Y has a discrete distribution then
P(Y ∈ B ∣ X = x) = ∑ h(y ∣ x), B ⊆T (3.5.17)
y∈B
2. If Y has a continuous distribution then
P(Y ∈ B ∣ X = x) = ∫ h(y ∣ x) dy, B ⊆T (3.5.18)

B
Proof
The following theorem gives Bayes' theorem for probability density functions. We use the notation established above.
Bayes' Theorem. For y ∈ T , the conditional probability density function x ↦ g(x ∣ y) of X given y = y can be computed as
follows:
g(x)h(y ∣ x)
g(x ∣ y) = , x ∈ S (3.5.24)
∑ g(s)h(y ∣ s)
s∈S

g(x)h(y ∣ x)
g(x ∣ y) = , x ∈ S (3.5.25)
∫ g(s)h(y ∣ s)ds
S
Proof
In the context of Bayes' theorem, g is the prior probability density function of X and x ↦ g(x ∣ y) is the posterior probability
density function of X given Y = y for y ∈ T . Note that the posterior probability density function x ↦ g(x ∣ y) is proportional to
the function x ↦ g(x)h(y ∣ x). The sum or integral in the denominator is the normalizing constant.
Independence
Intuitively, X and Y should be independent if and only if the conditional distributions are the same as the corresponding
unconditional distributions.
The following conditions are equivalent:

1. X and Y are independent.
2. f (x, y) = g(x)h(y) for x ∈ S , y ∈ T
3. h(y ∣ x) = h(y) for x ∈ S , y ∈ T
4. g(x ∣ y) = g(x) for x ∈ S , y ∈ T
Proof

In the exercises that follow, look for special models and distributions that we have studied. A special distribution may be embedded
in a larger problem, as a conditional distribution, for example. In particular, a conditional distribution sometimes arises when a
parameter of a standard distribution is randomized.
A couple of special distributions will occur frequently in the exercises. First, recall that the discrete uniform distribution on a finite,
nonempty set S has probability density function f given by f (x) = 1/#(S) for x ∈ S . This distribution governs an element
selected at random from S .
Recall also that Bernoulli trials (named for Jacob Bernoulli) are independent trials, each with two possible outcomes generically
called success and failure. The probability of success p ∈ [0, 1] is the same for each trial, and is the basic parameter of the random
process. The number of successes in n ∈ N Bernoulli trials has the binomial distribution with parameters n and p. This
+
distribution has probability density function f given by f (x) = ( )p (1 − p)

n
x
x
for x ∈ {0, 1, … , n}. The binomial distribution
n−x
is studied in more detail in the chapter on Bernoulli trials
Coins and Dice

Suppose that two standard, fair dice are rolled and the sequence of scores (X 1, X2 ) is recorded. Let U = min{ X1 , X2 } and
V = max{ X , X } denote the minimum and maximum scores, respectively.
1 2
1. Find the conditional probability density function of U given V =v for each v ∈ {1, 2, 3, 4, 5, 6}.
2. Find the conditional probability density function of V given U =u for each u ∈ {1, 2, 3, 4, 5, 6}.
Answer
In the die-coin experiment, a standard, fair die is rolled and then a fair coin is tossed the number of times showing on the die.
Let N denote the die score and Y the number of heads.
1. Find the joint probability density function of (N , Y ).
3. Find the conditional probability density function of N given Y =y for each y ∈ {0, 1, 2, 3, 4, 5, 6}.
Answer
In the die-coin experiment, select the fair die and coin.

1. Run the simulation of 1000 times and compare the empirical density function of Y with the true probability density
function in the previous exercise
2. Run the simulation 1000 times and compute the empirical conditional density function of N given Y = 3 . Compare with
the conditional probability density functions in the previous exercise.
In the coin-die experiment, a fair coin is tossed. If the coin is tails, a standard, fair die is rolled. If the coin is heads, a standard,
ace-six flat die is rolled (faces 1 and 6 have probability each and faces 2, 3, 4, 5 have probability each). Let X denote the
1
4
1
coin score (0 for tails and 1 for heads) and Y the die score.
1. Find the joint probability density function of (X, Y ).
3. Find the conditional probability density function of X given Y =y for each y ∈ {1, 2, 3, 4, 5, 6}.
Answer
In the coin-die experiment, select the settings of the previous exercise.

1. Run the simulation 1000 times and compare the empirical density function of Y with the true probability density function
in the previous exercise.
2. Run the simulation 100 times and compute the empirical conditional probability density function of X given Y = 2 .
Compare with the conditional probability density function in the previous exercise.
Suppose that a box contains 12 coins: 5 are fair, 4 are biased so that heads comes up with probability , and 3 are two-headed.
1
A coin is chosen at random and tossed 2 times. Let P denote the probability of heads of the selected coin, and X the number of
heads.
1. Find the joint probability density function of (P , X).
3. Find the conditional probability density function of P given X = x for x ∈ {0, 1, 2}.
Answer
Compare the die-coin experiment with the box of coins experiment. In the first experiment, we toss a coin with a fixed probability
of heads a random number of times. In the second experiment, we effectively toss a coin with a random probability of heads a fixed
number of times.
Suppose that P has probability density function g(p) = 6p(1 − p) for p ∈ [0, 1]. Given P =p , a coin with probability of
heads p is tossed 3 times. Let X denote the number of heads.
1. Find the joint probability density function of (P , X).
2. Find the probability density of function of X.
3. Find the conditional probability density of P given X = x for x ∈ {0, 1, 2, 3}. Graph these on the same axes.
Answer
Compare the box of coins experiment with the last experiment. In the second experiment, we effectively choose a coin from a box
with a continuous infinity of coin types. The prior distribution of P and each of the posterior distributions of P in part (c) are
members of the family of beta distributions, one of the reasons for the importance of the beta family. Beta distributions are studied
in more detail in the chapter on Special Distributions.
In the simulation of the beta coin experiment, set a = b = 2 and n = 3 to get the experiment studied in the previous exercise.
For various “true values” of p, run the experiment in single step mode a few times and observe the posterior probability density
function on each run.
Simple Mixed Distributions

Recall that the exponential distribution with rate parameter r ∈ (0, ∞) has probability density function f given by f (t) = re −rt
for t ∈ [0, ∞). The exponential distribution is often used to model random times, under certain assumptions. The exponential
distribution is studied in more detail in the chapter on the Poisson Process. Recall also that for a, b ∈ R with a < b , the continuous
uniform distribution on the interval [a, b] has probability density function f given by f (x) = 1
b−a
for x ∈ [a, b]. This distribution
governs a point selected at random from the interval.
Suppose that there are 5 light bulbs in a box, labeled 1 to 5. The lifetime of bulb n (in months) has the exponential distribution
with rate parameter n . A bulb is selected at random from the box and tested.
1. Find the probability that the selected bulb will last more than one month.
2. Given that the bulb lasts more than one month, find the conditional probability density function of the bulb number.
Answer
Suppose that X is uniformly distributed on {1, 2, 3}, and given X = x ∈ {1, 2, 3}, random variable Y is uniformly distributed
on the interval [0, x].
3. Find the conditional probability density function of X given Y =y for y ∈ [0, 3].
Answer

n
Recall that the Poisson distribution with parameter a ∈ (0, ∞) has probability density function g(n) = e for n ∈ N . This
−a a
n!
distribution is widely used to model the number of “random points” in a region of time or space; the parameter a is proportional to
the size of the region. The Poisson distribution is named for Simeon Poisson, and is studied in more detail in the chapter on the
Poisson Process.
Suppose that N is the number of elementary particles emitted by a sample of radioactive material in a specified period of time,
and has the Poisson distribution with parameter a . Each particle emitted, independently of the others, is detected by a counter
with probability p ∈ (0, 1) and missed with probability 1 − p . Let Y denote the number of particles detected by the counter.
1. For n ∈ N , argue that the conditional distribution of Y given N =n is binomial with parameters n and p.
2. Find the joint probability density function of (N , Y ).
4. For y ∈ N , find the conditional probability density function of N given Y =y .
Answer
The fact that Y also has a Poisson distribution is an interesting and characteristic property of the distribution. This property is
explored in more depth in the section on thinning the Poisson process.

Suppose that (X, Y ) has probability density function f defined by f (x, y) = x + y for (x, y) ∈ (0, 1) . 2
1. Find the conditional probability density function of X given Y = y for y ∈ (0, 1)

2. Find the conditional probability density function of Y given X = x for x ∈ (0, 1)
3. Find P ( ≤ Y ≤ ∣∣ X = ) .
1
4
3
4
1

Answer
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2(x + y) for 0 < x < y < 1 .
1. Find the conditional probability density function of X given Y = y for y ∈ (0, 1).
2. Find the conditional probability density function of Y given X = x for x ∈ (0, 1).
3. Find P (Y ≥ ∣∣ X = ) .
3
4
1

Answer
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 15x 2
y for 0 < x < y < 1 .
3. Find P (X ≤ ∣∣ Y = ) .
1
4
1

Answer

y for 0 < x < 1 and 0 < y < 1 .
Answer
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2e −x

e
−y
for 0 < x < y < ∞ .
1. Find the conditional probability density function of X given Y = y for y ∈ (0, ∞).
2. Find the conditional probability density function of Y given X = x for x ∈ (0, ∞).
Answer
Suppose that X is uniformly distributed on the interval (0, 1), and that given X = x , Y is uniformly distributed on the interval
(0, x).

3. Find the conditional probability density function of X given Y =y for y ∈ (0, 1).
Answer
Suppose that X has probability density function g defined by g(x) = 3x
2
for x ∈ (0, 1) . The conditional probability density
2
3y
function of Y given X = x is h(y ∣ x) = x
3
for y ∈ (0, x).
3. Find the conditional probability density function of X given Y =y .
Answer
Multivariate Uniform Distributions

Multivariate uniform distributions give a geometric interpretation of some of the concepts in this section.
Recall that For n ∈ N , the standard measure λ on R is given by

+ n
n
n
λn (A) = ∫ 1 dx, A ⊆R (3.5.29)
A
In particular, λ
1 (A) is the length of A ⊆ R , λ 2 (A) is the area of A ⊆ R and λ 2
Details
Suppose now that takes values in R , Y takes values in R , and that (X, Y ) is uniformly distributed on a set R ⊆ R . So
X
j k j+k
0 <λ (R) < ∞ and then the joint probability density function f of (X, Y ) is given by f (x, y) = 1/ λ
j+k (R) for (x, y) ∈ R. j+k
Now let S and T be the projections of R onto R and R respectively, defined as follows:
j k
j k k j
S = {x ∈ R : (x, y) ∈ R for some y ∈ R } , T = {y ∈ R : (x, y) ∈ R for some x ∈ R } (3.5.30)
Note that R ⊆ S × T . Next we denote the cross sections at x ∈ S and at y ∈ T , respectively by
Tx = {t ∈ T : (x, t) ∈ R}, Sy = {s ∈ S : (s, y) ∈ R} (3.5.31)
Figure 3.5.1 : The projections S and T , and the cross sections at x and y
In the last section on Joint Distributions, we saw that even though (X, Y ) is uniformly distributed, the marginal distributions of X
and Y are not uniform in general. However, as the next theorem shows, the conditional distributions are always uniform.
Suppose that (X, Y ) is uniformly distributed on R . Then

1. The conditional distribution of Y given X = x is uniformly on T for each x ∈ S . x
2. The conditional distribution of X given Y = y is uniformly on S for each y ∈ T . y
Proof
Find the conditional density of each variable given a value of the other, and determine if the variables are independent, in each
of the following cases:
1. (X, Y ) is uniformly distributed on the square R = (−6, 6) . 2
2. (X, Y ) is uniformly distributed on the triangle R = {(x, y) ∈ R : −6 < y < x < 6} . 2
3. (X, Y ) is uniformly distributed on the circle R = {(x, y) ∈ R : x + y < 36} . 2 2 2
Answer
In the bivariate uniform experiment, run the simulation 1000 times in each of the following cases. Watch the points in the
scatter plot and the graphs of the marginal distributions.
1. square
2. triangle
3. circle
Suppose that (X, Y , Z) is uniformly distributed on R = {(x, y, z) ∈ R 3

: 0 < x < y < z < 1} .
1. Find the conditional density of each pair of variables given a value of the third variable.
2. Find the conditional density of each variable given values of the other two.
Answer
The Multivariate Hypergeometric Distribution

Recall the discussion of the (multivariate) hypergeometric distribution given in the last section on joint distributions. As in that
discussion, suppose that a population consists of m objects, and that each object is one of four types. There are a objects of type 1,
b objects of type 2, and c objects of type 3, and m − a − b − c objects of type 0. We sample n objects from the population at
random, and without replacement. The parameters a , b , c , and n are nonnegative integers with a + b + c ≤ m and n ≤ m . Denote
the number of type 1, 2, and 3 objects in the sample by X, Y , and Z , respectively. Hence, the number of type 0 objects in the
sample is n − X − Y − Z . In the following exercises, x, y, z ∈ N .
Suppose that z ≤ c and n − m + c ≤ z ≤ n . Then the conditional distribution of (X, Y ) given Z = z is hypergeometric, and
has the probability density function defined by
a b m−a−b−c
( )( )( )
x y n−x−y−z
g(x, y ∣ z) = , x +y ≤ n−z (3.5.34)
m−c
( )
n−z
Proof
Suppose that y ≤ b , z ≤ c , and n − m + b ≤ y + z ≤ n . Then the conditional distribution of X given Y =y and Z =z is

hypergeometric, and has the probability density function defined by
a m−a−b−c
( )( )
x n−x−y−z
g(x ∣ y, z) = , x ≤ n−y −z (3.5.35)
m−b−c
( )
n−y−z
Proof
These results generalize in a completely straightforward way to a population with any number of types. In brief, if a random vector
has a hypergeometric distribution, then the conditional distribution of some of the variables, given values of the other variables, is
also hypergeometric. Moreover, it is clearly not necessary to remember the hideous formulas in the previous two theorems. You
just need to recognize the problem as sampling without replacement from a multi-type population, and then identify the number of
objects of each type and the sample size. The hypergeometric distribution and the multivariate hypergeometric distribution are
studied in more detail in the chapter on Finite Sampling Models.
In a population of 150 voters, 60 are democrats and 50 are republicans and 40 are independents. A sample of 15 voters is
selected at random, without replacement. Let X denote the number of democrats in the sample and Y the number of
republicans in the sample. Give the probability density function of each of the following:
1. (X, Y )
2. X
3. Y given X = 5
Answer
Recall that a bridge hand consists of 13 cards selected at random and without replacement from a standard deck of 52 cards.
Let X, Y , and Z denote the number of spades, hearts, and diamonds, respectively, in the hand. Find the probability density
function of each of the following:
1. (X, Y , Z)
2. (X, Y )
3. X
4. (X, Y ) given Z = 3
5. X given Y = 3 and Z = 2
Answer
Multinomial Trials
Recall the discussion of multinomial trials in the last section on joint distributions. As in that discussion, suppose that we have a
sequence of n independent trials, each with 4 possible outcomes. On each trial, outcome 1 occurs with probability p, outcome 2
with probability q, outcome 3 with probability r, and outcome 0 with probability 1 − p − q − r . The parameters p, q, r ∈ (0, 1),
with p + q + r < 1 , and n ∈ N . Denote the number of times that outcome 1, outcome 2, and outcome 3 occurs in the n trials by
+
X, Y , and Z respectively. Of course, the number of times that outcome 0 occurs is n − X − Y − Z . In the following exercises,
x, y, z ∈ N .
For z ≤ n , the conditional distribution of (X, Y ) given Z = z is also multinomial, and has the probability density function.
x y n−x−y−z
n−z p q p q
g(x, y ∣ z) = ( )( ) ( ) (1 − − ) , x +y ≤ n−z (3.5.36)
x, y 1 −r 1 −r 1 −r 1 −r
Proof
For y + z ≤ n , the conditional distribution of X given Y =y and Z = z is binomial, with the probability density function
x n−x−y−z
n−y −z p p
h(x ∣ y, z) = ( )( ) (1 − ) , x ≤ n−y −z (3.5.37)
x 1 −q −r 1 −q −r
Proof
These results generalize in a completely straightforward way to multinomial trials with any number of trial outcomes. In brief, if a
random vector has a multinomial distribution, then the conditional distribution of some of the variables, given values of the other
variables, is also multinomial. Moreover, it is clearly not necessary to remember the specific formulas in the previous two
exercises. You just need to recognize a problem as one involving independent trials, and then identify the probability of each
outcome and the number of trials. The binomial distribution and the multinomial distribution are studied in more detail in the
chapter on Bernoulli Trials.
Suppose that peaches from an orchard are classified as small, medium, or large. Each peach, independently of the others is
small with probability 3
10
, medium with probability , and large with probability . In a sample of 20 peaches from the
1
2
1
orchard, let X denote the number of small peaches and Y the number of medium peaches. Give the probability density
1. (X, Y )
2. X
3. Y given X = 5
Answer
For a certain crooked, 4-sided die, face 1 has probability , face 2 has probability , face 3 has probability , and face 4 has
2
5
3
10
1
probability . Suppose that the die is thrown 50 times. Let X, Y , and Z denote the number of times that scores 1, 2, and 3
1
10
occur, respectively. Find the probability density function of each of the following:
1. (X, Y , Z)
2. (X, Y )
3. X
4. (X, Y ) given Z = 5
5. X given Y = 10 and Z = 5
Answer
Bivariate Normal Distributions
The joint distributions in the next two exercises are examples of bivariate normal distributions. The conditional distributions are
also normal, an important property of the bivariate normal distribution. In general, normal distributions are widely used to model
physical measurements subject to small, random errors. The bivariate normal distribution is studied in more detail in the chapter on
Special Distributions.
Suppose that (X, Y ) has the bivariate normal distribution with probability density function f defined by
2 2
1 x y
2
f (x, y) = exp[− ( + )], (x, y) ∈ R (3.5.38)
12π 8 18
1. Find the conditional probability density function of X given Y = y for y ∈ R .

2. Find the conditional probability density function of Y given X = x for x ∈ R.
Answer
Suppose that (X, Y ) has the bivariate normal distribution with probability density function f defined by
1 2
2 2 2
f (x, y) = exp[− (x − xy + y )], (x, y) ∈ R (3.5.39)
–
√3π 3
1. Find the conditional probability density function of X given Y = y for y ∈ R .

2. Find the conditional probability density function of Y given X = x for x ∈ R.
Answer
Mixtures of Distributions
With our usual sets S and T , as above, suppose that P is a probability measure on T for each x ∈ S . Suppose also that g is a
x
probability density function on S . We can obtain a new probability measure on T by averaging (or mixing) the given distributions
according to g .
First suppose that g is the probability density function of a discrete distribution on the countable set S . Then the function P
defined below is a probability measure on T :
P(B) = ∑ g(x)Px (B), B ⊆T (3.5.40)
x∈S
Proof
In the setting of the previous theorem, suppose that Px has probability density function hx for each x ∈ S . Then P has
probability density function h given by
h(y) = ∑ g(x)hx (y), y ∈ T (3.5.42)
x∈S
Proof
Conversely, given a probability density function g on S and a probability density function h on T for each x ∈ S , the function
x h
defined in the previous theorem is a probability density function on T .
Suppose now that g is the probability density function of a continuous distribution on S ⊆R

j
. Then the function P defined
below is a probability measure on T :
P(B) = ∫ g(x)Px (B)dx, B ⊆T (3.5.45)

S
Proof
In the setting of the previous theorem, suppose that P is a discrete (respectively continuous) distribution with probability
x
density function h for each x ∈ S . Then P is also discrete (respectively continuous) with probability density function h given
x
by
h(y) = ∫ g(x)hx (y)dx, y ∈ T (3.5.47)

S
Proof
In both cases, the distribution P is said to be a mixture of the set of distributions {P x : x ∈ S} , with mixing density g .
One can have a mixture of distributions, without having random variables defined on a common probability space. However,
mixtures are intimately related to conditional distributions. Returning to our usual setup, suppose that X and Y are random
variables for an experiment, taking values in S and T respectively and that X probability density function g . The following result
is simply a restatement of the law of total probability.
The distribution of Y is a mixture of the conditional distributions of Y given X = x , over x ∈ S , with mixing density g .
Proof
Finally we note that a mixed distribution (with discrete and continuous parts) really is a mixture, in the sense of this discussion.
Suppose that P is a mixed distribution on a set T . Then P is a mixture of a discrete distribution and a continuous distribution.
Proof
This page titled 3.5: Conditional Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
As usual, our starting point is a random experiment modeled by a with probability space (Ω, F , P). So to review, Ω is the set of
outcomes, F is the collection of events, and P is the probability measure on the sample space (Ω, F ). In this section, we will
study two types of functions that can be used to specify the distribution of a real-valued random variable.
Distribution Functions
Definition
Suppose that X is a random variable with values in R . The (cumulative) distribution function of X is the function
F : R → [0, 1] defined by
F (x) = P(X ≤ x), x ∈ R (3.6.1)
The distribution function is important because it makes sense for any type of random variable, regardless of whether the
distribution is discrete, continuous, or even mixed, and because it completely determines the distribution of X. In the picture below,
the light shading is intended to represent a continuous distribution of probability, while the darker dots represents points of positive
probability; F (x) is the total probability mass to the left of (and including) x.
Figure 3.6.1 : F (x) is the total probability to the left of (and including) x
Basic Properties
A few basic properties completely characterize distribution functions. Notationally, it will be helpful to abbreviate the limits of F
from the left and right at x ∈ R, and at ∞ and −∞ as follows:

+ −
F (x ) = lim F (t), F (x ) = lim F (t), F (∞) = lim F (t), F (−∞) = lim F (t) (3.6.2)
t↓x t↑x t→∞ t→−∞
Suppose that F is the distribution function of a real-valued random variable X.

2. F (x ) = F (x) for x ∈ R. Thus, F is continuous from the right.
+
3. F (x ) = P(X < x) for x ∈ R. Thus, F has limits from the left.

−
4. F (−∞) = 0 .
5. F (∞) = 1 .
Proof
Figure 3.6.2 : The graph of a distribution function

The following result shows how the distribution function can be used to compute the probability that X is in an interval. Recall that
a probability distribution on R is completely determined by the probabilities of intervals; thus, the distribution function determines
the distribution of X.
Suppose again that F is the distribution function of a real-valued random variable X. If a, b ∈ R with a < b then
1. P(X = a) = F (a) − F (a ) −
2. P(a < X ≤ b) = F (b) − F (a)

3. P(a < X < b) = F (b ) − F (a)
−
4. P(a ≤ X ≤ b) = F (b) − F (a ) −
5. P(a ≤ X < b) = F (b ) − F (a )
− −
Proof
Conversely, if a Function F : R → [0, 1] satisfies the basic properties, then the formulas above define a probability distribution on
R , with F as the distribution function. For more on this point, read the section on Existence and Uniqueness.
If X has a continuous distribution, then the distribution function F is continuous.

Proof
Thus, the two meanings of continuous come together: continuous distribution and continuous function in the calculus sense. Next
recall that the distribution of a real-valued random variable X is symmetric about a point a ∈ R if the distribution of X − a is the
same as the distribution of a − X .
Suppose that X has a continuous distribution on R that is symmetric about a point a . Then the distribution function F satisfies
F (a − t) = 1 − F (a + t) for t ∈ R .
Proof
Relation to Density Functions

There are simple relationships between the distribution function and the probability density function. Recall that if X takes value in
S ⊆ R and has probability density function f , we can extend f to all of R by the convention that f (x) = 0 for x ∈ S . As in
c
Definition (1), it's customary to define the distribution function F on all of R, even if the random variable takes values in a subset.
Suppose that X has discrete distribution on a countable subset S ⊆ R . Let f denote the probability density function and F the
distribution function.
1. F (x) = ∑ f (t) for x ∈ R
t∈S, t≤x
2. f (x) = F (x) − F (x ) for x ∈ S

−
Proof
Thus, F is a step function with jumps at the points in S ; the size of the jump at x is f (x).
Figure 3.6.3 : The distribution function of a a discrete distribution

There is an analogous result for a continuous distribution with a probability density function.
Suppose that X has a continuous distribution on R with probability density function f and distribution function F .
x
1. F (x) = ∫ −∞
f (t)dt for x ∈ R.
2. f (x) = F ′
(x) if f is continuous at x.
Proof
Figure 3.6.4 : The distribution function of a continuous distribution

The last result is the basic probabilistic version of the fundamental theorem of calculus. For mixed distributions, we have a
combination of the results in the last two theorems.
Suppose that X has a mixed distribution, with discrete part on a countable subset D ⊆ R , and continuous part on R ∖ D . Let g
denote the partial probability density function of the discrete part and assume that the continuous part has partial probability
density function h . Let F denote the distribution function.
x
1. F (x) = ∑ g(t) + ∫
t∈D, t≤x
h(t)dt for x ∈ R
−∞
2. g(x) = F (x) − F (x ) for x ∈ D

−
3. h(x) = F (x) if x ∉ D and h is continuous at x

′
Go back to the graph of a general distribution function. At a point of positive probability, the probability is the size of the jump. At
a smooth point of the graph, the continuous probability density is the slope.
Recall that the existence of a probability density function is not guaranteed for a continuous distribution, but of course the
distribution function always makes perfect sense. The advanced section on absolute continuity and density functioons has an
example of a continuous distribution on the interval (0, 1) that has no probability density function. The distribution function is
continuous and strictly increases from 0 to 1 on the interval, but has derivative 0 at almost every point!
Naturally, the distribution function can be defined relative to any of the conditional distributions we have discussed. No new
concepts are involved, and all of the results above hold.
Reliability
Suppose again that X is a real-valued random variable with distribution function F . The function in the following definition clearly
gives the same information as F .
The function F defined by

c
c
F (x) = 1 − F (x) = P(X > x), x ∈ R (3.6.4)
is the right-tail distribution function of X. Give the mathematical properties of F analogous to the properties of F in (2).
c
Answer
So F might be called the left-tail distribution function. But why have two distribution functions that give essentially the same
information? The right-tail distribution function, and related functions, arise naturally in the context of reliability theory. For the
remainder of this subsection, suppose that T is a random variable with values in [0, ∞) and that T has a continuous distribution
with probability density function f . Here are the important defintions:
Suppose that T represents the lifetime of a device.

1. The right tail distribution function F is the reliability function of T .
c
2. The function h defined by h(t) = f (t)/F (t) for t ≥ 0 is the failure rate function of T .
c
To interpret the reliability function, note that F (t) = P(T > t) is the probability that the device lasts at least
c
t time units. To
interpret the failure rate function, note that if dt is “small” then
P(t < T < t + dt) f (t) dt
P(t < T < t + dt ∣ T > t) = ≈ = h(t) dt (3.6.5)
c
P(T > t) F (t)
So h(t) dt is the approximate probability that the device will fail in the interval (t, t + dt) , given survival up to time t . Moreover,
like the distribution function and the reliability function, the failure rate function also completely determines the distribution of T .
The reliability function can be expressed in terms of the failure rate function by
t
c
F (t) = exp(− ∫ h(s) ds), t ≥0 (3.6.6)
0
Proof
The failure rate function h satisfies the following properties:

1. h(t) ≥ 0 for t ≥ 0
∞
2. ∫ h(t) dt = ∞
0
Proof
Conversely, a function that satisfies these properties is the failure rate function for a continuous distribution on [0, ∞):
∞
Suppose that h : [0, ∞) → [0, ∞) is piecewise continuous and ∫ 0
h(t) dt = ∞ . Then the function G defined by
t
c
F (t) = exp(− ∫ h(s) ds), t ≥0 (3.6.8)
0
is a reliability function for a continuous distribution on [0, ∞)

Proof
Multivariate Distribution Functions

Suppose now that X and Y are real-valued random variables for an experiment (that is, defined on the same probability space), so
that (X, Y ) is random vector taking values in a subset of R . 2
The distribution function of (X, Y ) is the function F defined by

2
F (x, y) = P(X ≤ x, Y ≤ y), (x, y) ∈ R (3.6.9)
Figure 3.6.5 : F (x, y) is the total probability below and to the left of (x, y).
In the graph above, the light shading is intended to suggest a continuous distribution of probability, while the darker dots represent
points of positive probability. Thus, F (x, y) is the total probability mass below and to the left (that is, southwest) of the point
(x, y). As in the single variable case, the distribution function of (X, Y ) completely determines the distribution of (X, Y ).
Suppose that a, b, c, d ∈ R with a < b and c < d . Then
P(a < X ≤ b, c < Y ≤ d) = F (b, d) − F (a, d) − F (b, c) + F (a, c) (3.6.10)
Proof
A probability distribution on R is completely determined by its values on rectangles of the form (a, b] × (c, d] , so just as in the
2
single variable case, it follows that the distribution function of (X, Y ) completely determines the distribution of (X, Y ). See the
advanced section on existence and uniqueness of positive measures in the chapter on Probability Measures for more details.
In the setting of the previous result, give the appropriate formula on the right for all possible combinations of weak and strong
inequalities on the left.
The joint distribution function determines the individual (marginal) distribution functions.
Let F denote the distribution function of (X, Y ), and let G and H denote the distribution functions of X and Y , respectively.
Then
1. G(x) = F (x, ∞) for x ∈ R
2. H (y) = F (∞, y) for y ∈ R
Proof
On the other hand, we cannot recover the distribution function of (X, Y ) from the individual distribution functions, except when
the variables are independent.
Random variables X and Y are independent if and only if

2
F (x, y) = G(x)H (y), (x, y) ∈ R (3.6.13)
Proof
All of the results of this subsection generalize in a straightforward way to n -dimensional random vectors. Only the notation is more
complicated.
The Empirical Distribution Function

Suppose now that X is a real-valued random variable for a basic random experiment and that we repeat the experiment n times
independently. This generates (for the new compound experiment) a sequence of independent variables (X , X , … , X ) each1 2 n
with the same distribution as X. In statistical terms, this sequence is a random sample of size n from the distribution of X. In
statistical inference, the observed values (x , x , … , x ) of the random sample form our data.
1 2 n
The empirical distribution function, based on the data (x 1, , is defined by

x2 , … , xn )
n
1 1
Fn (x) = # {i ∈ {1, 2, … , n} : xi ≤ x} = ∑ 1(xi ≤ x), x ∈ R (3.6.16)
n n
i=1
Thus, F (x) gives the proportion of values in the data set that are less than or equal to x. The function F is a statistical estimator
n n
of F , based on the given data set. This concept is explored in more detail in the section on the sample mean in the chapter on
Random Samples. In addition, the empirical distribution function is related to the Brownian bridge stochastic process which is
studied in the chapter on Brownian motion.
Quantile Functions
Definitions
Suppose again that X is a real-valued random variable with distribution function F .
For p ∈ (0, 1), a value of x such that F (x −

) = P(X < x) ≤ p and F (x) = P(X ≤ x) ≥ p is called a quantile of order p for
the distribution.
Roughly speaking, a quantile of order p is a value where the graph of the distribution function crosses (or jumps over) p. For
example, in the picture below, a is the unique quantile of order p and b is the unique quantile of order q. On the other hand, the
quantiles of order r form the interval [c, d], and moreover, d is a quantile for all orders in the interval [r, s]. Note also that if X has
a continuous distribution (so that F is continuous) and x is a quantile of order p ∈ (0, 1), then F (x) = p .
Figure 3.6.6 : Quantiles of various orders
Note that there is an inverse relation of sorts between the quantiles and the cumulative distribution values, but the relation is more
complicated than that of a function and its ordinary inverse function, because the distribution function is not one-to-one in general.
For many purposes, it is helpful to select a specific quantile for each order; to do this requires defining a generalized inverse of the
distribution function F .
The quantile function F −1

of X is defined by
−1
F (p) = min{x ∈ R : F (x) ≥ p}, p ∈ (0, 1) (3.6.17)
F
−1
is well defined
Note that if F strictly increases from 0 to 1 on an interval S (so that the underlying distribution is continuous and is supported on
S ), then F is the ordinary inverse of F . We do not usually define the quantile function at the endpoints 0 and 1. If we did, note
−1
that F (0) would always be −∞ .

−1
Properties
The following exercise justifies the name: F −1
(p) is the minimum of the quantiles of order p.
Let p ∈ (0, 1).

1. F (p) is a quantile of order p.
−1
2. If x is a quantile of order p then F −1

(p) ≤ x .
Proof
Other basic properties of the quantile function are given in the following theorem.
F
−1
satisfies the following properties:
1. F −1
is increasing on (0, 1).
2. F −1
[F (x)] ≤ xfor any x ∈ R with F (x) < 1 .
3. F [F
−1
(p)] ≥ p for any p ∈ (0, 1).
4. F −1 −
(p ) = F
−1
(p) for p ∈ (0, 1). Thus F is continuous from the left.
−1
5. F −1
(p ) = inf{x ∈ R : F (x) > p} for p ∈ (0, 1). Thus F
+
has limits from the right.
−1
Proof
As always, the inverse of a function is obtained essentially by reversing the roles of independent and dependent variables. In the
graphs below, note that jumps of F become flat portions of F while flat portions of F become jumps of F . For p ∈ (0, 1),
−1 −1
the set of quantiles of order p is the closed, bounded interval [ F (p), F (p )]. Thus, F (p) is the smallest quantile of order
−1 −1 + −1
p , as we noted earlier, while F (p ) is the largest quantile of order p .

−1 +
Figure 3.6.7: Graph of the distribution function
Figure 3.6.8: Graph of the quantile function
The following basic property will be useful in simulating random variables, a topic explored in the section on transformations of
random variables.
For x ∈ R and p ∈ (0, 1), F −1

(p) ≤ x if and only if p ≤ F (x).
Proof
Special Quantiles
Certain quantiles are important enough to deserve special names.
Suppose that X is a real-valued random variable.

1. A quantile of order 1
4
is a first quartile of the distribution.
2
is a median or second quartile of the distribution.
4
is a third quartile of the distribution.
When there is only one median, it is frequently used as a measure of the center of the distribution, since it divides the set of values
of X in half, by probability. More generally, the quartiles can be used to divide the set of values into fourths, by probability.
Assuming uniqueness, let q , q , and q denote the first, second, and third quartiles of
1 2 3 X , respectively, and let a =F
−1 +
(0 )
and b = F (1) .
−1
1. The interquartile range is defined to be q − q . 3 1
2. The five parameters (a, q , q , q , b) are referred to as the five number summary of the distribution.
1 2 3
Note that the interval [q , q ] roughly gives the middle half of the distribution, so the interquartile range, the length of the interval,
1 3
is a natural measure of the dispersion of the distribution about the median. Note also that a and b are essentially the minimum and
maximum values of X, respectively, although of course, it's possible that a = −∞ or b = ∞ (or both). Collectively, the five
parameters give a great deal of information about the distribution in terms of the center, spread, and skewness. Graphically, the five
numbers are often displayed as a boxplot or box and whisker plot, which consists of a line extending from the minimum value a to
the maximum value b , with a rectangular box from q to q , and “whiskers” at a , the median q , and b . Roughly speaking, the five
1 3 2
numbers separate the set of values of X into 4 intervals of approximate probability each. 1
Figure 3.6.9 : The probability density function and boxplot for a continuous distribution
Suppose that X has a continuous distribution that is symmetric about a point a ∈ R . If a + t is a quantile of order p ∈ (0, 1)
then a − t is a quantile of order 1 − p .

Proof

Distributions of Different Types
Let F be the function defined by
⎧ 0, x <1
⎪
⎪
⎪
⎪ 1 3
⎪ , 1 ≤x <
⎪
⎪ 10 2
⎪
⎪ 3 3
, ≤x <2
10 2
F (x) = ⎨ (3.6.18)
6 5
⎪ , 2 ≤x <
⎪
⎪
10 2
⎪ 9 5
⎪
⎪
⎪ , ≤x <3
⎪
⎩
10 2
⎪
1, x ≥ 3;
1. Sketch the graph of F and show that F is the distribution function for a discrete distribution.
2. Find the corresponding probability density function f and sketch the graph.
3. Find P(2 ≤ X < 3) where X has this distribution.
4. Find the quantile function and sketch the graph.
5. Find the five number summary and sketch the boxplot.
Answer
0, x <0
F (x) = { x (3.6.19)
, x ≥0
x+1
1. Sketch the graph of F and show that F is the distribution function for a continuous distribution.
2. Find the corresponding probability density function f and sketch the graph.
Answer
p
The expression 1−p
that occurs in the quantile function in the last exercise is known as the odds ratio associated with p ,
particularly in the context of gambling.
⎧ 0, x <0
⎪
⎪ 1
⎪
⎪
⎪ x, 0 ≤x <1
⎪ 4
1 1 2
F (x) = ⎨ + (x − 1 ) , 1 ≤x <2 (3.6.20)
3 4
⎪
⎪ 2 1 3
⎪
⎪ + (x − 2 ) , 2 ≤x <3
⎪
⎩
⎪
3 4
1, x ≥3
1. Sketch the graph of F and show that F is the distribution function of a mixed distribution.
2. Find the partial probability density function of the discrete part and sketch the graph.
3. Find the partial probability density function of the continuous part and sketch the graph.
Answer
The Uniform Distribution
Suppose that X has probability density function f (x) = 1
b−a
for x ∈ [a, b], where a, b ∈ R and a < b .
1. Find the distribution function and sketch the graph.
3. Compute the five-number summary.
4. Sketch the graph of the probability density function with the boxplot on the horizontal axis.
Answer
The distribution in the last exercise is the uniform distribution on the interval [a, b]. The left endpoint a is the location parameter
and the length of the interval w = b − a is the scale parameter. The uniform distribution models a point chose “at random” from
the interval, and is studied in more detail in the chapter on Special Distributions.
In the special distribution calculator, select the continuous uniform distribution. Vary the location and scale parameters and
note the shape of the probability density function and the distribution function.

Suppose that T has probability density function f (t) = re −rt
for 0 ≤ t < ∞ , where r > 0 is a parameter.
2. Find the reliability function and sketch the graph.
3. Find the failure rate function and sketch the graph.
Answer
The distribution in the last exercise is the exponential distribution with rate parameter r. Note that this distribution is characterized
by the fact that it has constant failure rate (and this is the reason for referring to r as the rate parameter). The reciprocal of the rate
parameter is the scale parameter. The exponential distribution is used to model failure times and other random times under certain
conditions, and is studied in detail in the chapter on The Poisson Process.
In the special distribution calculator, select the exponential distribution. Vary the scale parameter b and note the shape of the
probability density function and the distribution function.

Suppose that X has probability density function f (x) = a
xa +1
for 1 ≤ x < ∞ where a > 0 is a parameter.
1. Find the distribution function.

2. Find the reliability function.
3. Find the failure rate function.
4. Find the quantile function.
6. In the case a = 2 , sketch the graph of the probability density function with the boxplot on the horizontal axis.
Answer
The distribution in the last exercise is the Pareto distribution with shape parameter a , named after Vilfredo Pareto. The Pareto
distribution is a heavy-tailed distribution that is sometimes used to model income and certain other economic variables. It is studied
in detail in the chapter on Special Distributions.
In the special distribution calculator, select the Pareto distribution. Keep the default value for the scale parameter, but vary the
shape parameter and note the shape of the density function and the distribution function.

π(1+x2 )
for x ∈ R.

3. Compute the five-number summary and the interquartile range.
Answer
The distribution in the last exercise is the Cauchy distribution, named after Augustin Cauchy. The Cauchy distribution is studied in
more generality in the chapter on Special Distributions.
In the special distribution calculator, select the Cauchy distribution and keep the default parameter values. Note the shape of
the density function and the distribution function.
The Weibull Distribution

Let h(t) = kt k−1
for 0 < t < ∞ where k > 0 is a parameter.
1. Sketch the graph of h in the cases 0 < k < 1 , k = 1 , 1 < k < 2 , k = 2 , and k > 2 .
2. Show that h is a failure rate function.
3. Find the reliability function and sketch the graph.
5. Find the probability density function and sketch the graph.
Answer
The distribution in the previous exercise is the Weibull distributions with shape parameter k , named after Walodi Weibull. The
Weibull distribution is studied in detail in the chapter on Special Distributions. Since this family includes increasing, decreasing,
and constant failure rates, it is widely used to model the lifetimes of various types of devices.
In the special distribution calculator, select the Weibull distribution. Keep the default scale parameter, but vary the shape
parameter and note the shape of the density function and the distribution function.
Beta Distributions
Suppose that X has probability density function f (x) = 12x 2
(1 − x) for 0 ≤ x ≤ 1 .
1. Find the distribution function of X and sketch the graph.
2. Find P ( ≤ X ≤ ) .
1
4
1
3. Compute the five number summary and the interquartile range. You will have to approximate the quantiles.
4. Sketch the graph of the density function with the boxplot on the horizontal axis.
Answer

for 0 < x < 1 .
π√x(1−x)
1. Find the distribution function of X and sketch the graph.

2. Compute P ( ≤ X ≤ ) . 1
3
2

4. Compute the five number summary and the interquartile range.
Answer
The distributions in the last two exercises are examples of beta distributions. The particular beta distribution in the last exercise is
also known as the arcsine distribution; the distribution function explains the name. Beta distributions are used to model random
proportions and probabilities, and certain other types of random variables, and are studied in detail in the chapter on Special
Distributions.
In the special distribution calculator, select the beta distribution. For each of the following parameter values, note the location
and shape of the density function and the distribution function.
1. a = 3 , b = 2 . This gives the first beta distribution above.
2. a = b = . This gives the arcsine distribution above
1
Logistic Distribution
x
Let F (x) = 1+e

e
x
for x ∈ R.
1. Show that F is a distribution function for a continuous distribution, and sketch the graph.
2. Compute P(−1 ≤ X ≤ 1) where X is a random variable with distribution function F .
4. Compute the five-number summary and the interquartile range.
5. Find the probability density function and sketch the graph with the boxplot on the horizontal axis.
Answer
The distribution in the last exercise is an logistic distribution and the quantile function is known as the logit function. The logistic
distribution is studied in detail in the chapter on Special Distributions.
In the special distribution calculator, select the logistic distribution and keep the default parameter values. Note the shape of the
probability density function and the distribution function.
Extreme Value Distribution

−x
Let F (x) = e −e
for x ∈ R.
1. Show that F is a distribution function for a continuous distribution, and sketch the graph.
2. Compute P(−1 ≤ X ≤ 1) where X is a random variable with distribution function F .
5. Find the probability density function and sketch the graph with the boxplot on the horizontal axis.
Answer
The distribution in the last exercise is the type 1 extreme value distribution, also known as the Gumbel distribution in honor of Emil
Gumbel. Extreme value distributions are studied in detail in the chapter on Special Distributions.
In the special distribution calculator, select the extreme value distribution and keep the default parameter values. Note the
shape and location of the probability density function and the distribution function.
Recall that the standard normal distribution has probability density function ϕ given by
1 −
1
z
2
ϕ(z) = −−e
2 , z ∈ R (3.6.21)
√2π
This distribution models physical measurements of all sorts subject to small, random errors, and is one of the most important
distributions in probability. The normal distribution is studied in more detail in the chapter on Special Distributions. The
distribution function Φ, of course, can be expressed as
z
Φ(z) = ∫ ϕ(x) dx, z ∈ R (3.6.22)

−∞
but Φ and the quantile function Φ cannot be expressed, in closed from, in terms of elementary functions. Because of the
−1
importance of the normal distribution Φ and Φ are themselves considered special functions, like sin, ln, and many others.
−1
Approximate values of these functions can be computed using most mathematical and statistical software packages. Because the
distribution is symmetric about 0, Φ(−z) = 1 − Φ(z) for z ∈ R , and equivalently, Φ (1 − p) = −Φ (p) . In particular, the
−1 −1
median is 0.
Open the sepcial distribution calculator and choose the normal distribution. Keep the default parameter values and select CDF
view. Note the shape and location of the distribution/quantile function. Compute each of the following:
1. The first and third quartiles
2. The quantiles of order 0.9 and 0.1
3. The quantiles of order 0.95 and 0.05
Miscellaneous Exercises
Suppose that X has probability density function f (x) = − ln x for 0 < x ≤ 1 .
1. Sketch the graph of f .
2. Find the distribution function F and sketch the graph.
3. Find P ( ≤ X ≤ ) .
1
3
1
Answer
Suppose that a pair of fair dice are rolled and the sequence of scores (X 1, X2 ) is recorded.
1. Find the distribution function of Y = X + X , the sum of the scores.
1 2
2. Find the distribution function of V = max{X , X } , the maximum score.

1 2
3. Find the conditional distribution function of Y given V = 5 .

Answer
Suppose that (X, Y ) has probability density function f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 .

1. Find the distribution function of X, Y ).
2. Compute P ( ≤ X ≤ , ≤ Y ≤ ) .
1
4
1
2
1
3
2
3. Find the distribution function of X.

4. Find the distribution function of Y .
5. Find the conditional distribution function of X given Y = y for 0 < y < 1 .
6. Find the conditional distribution function of Y given X = x for 0 < x < 1 .
Answer
Statistical Exercises
For the M&M data, compute the empirical distribution function of the total number of candies.
Answer
For the cicada data, let BL denotes body length and let G denote gender. Compute the empirical distribution function of the
following variables:
1. BL
2. BL given G = 1 (male)
3. BL given G = 0 (female).
4. Do you believe that BL and G are independent?
For statistical versions of some of the topics in this section, see the chapter on Random Samples, and in particular, the sections on
empirical distributions and order statistics.
This page titled 3.6: Distribution and Quantile Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
This section studies how the distribution of a random variable changes when the variable is transfomred in a deterministic way. If
you are a new student of probability, you should skip the technical details.
Basic Theory
The Problem
is the collection of events, and P is the probability measure on the sample space (Ω, F ). Suppose now that we have a random
variable X for the experiment, taking values in a set S , and a function r from S into another set T . Then Y = r(X) is a new
random variable taking values in T . If the distribution of X is known, how do we find the distribution of Y ? This is a very basic
and important question, and in a superficial sense, the solution is easy. But first recall that for B ⊆ T ,
(B) = {x ∈ S : r(x) ∈ B} is the inverse image of B under r .
−1
r
P(Y ∈ B) = P [X ∈ r
−1
(B)] for B ⊆ T .
Proof
Figure 3.7.1 : A function r : S → T . How is a probability distribution on S transformed by r to a distribution on T ?

However, frequently the distribution of X is known either through its distribution function F or its probability density function f ,
and we would similarly like to find the distribution function or probability density function of Y . This is a difficult problem in
general, because as we will see, even simple transformations of variables with simple distributions can lead to variables with
complex distributions. We will solve the problem in various special cases.
Transformed Variables with Discrete Distributions

When the transformed variable Y has a discrete distribution, the probability density function of Y can be computed using basic
rules of probability.
Suppose that X has a discrete distribution on a countable set S , with probability density function f . Then Y has a discrete
distribution with probability density function g given by
g(y) = ∑ f (x), y ∈ T (3.7.1)

−1
x∈r {y}
Proof
Figure 3.7.2 : A transformation of a discrete probability distribution.
Suppose that X has a continuous distribution on a subset S ⊆ R with probability density function f , and that T is countable.
n
Then Y has a discrete distribution with probability density function g given by
g(y) = ∫ f (x) dx, y ∈ T (3.7.2)

−1
r {y}
Proof
Figure 3.7.3 : A continuous distribution on S transformed by a discrete function r : S → T
So the main problem is often computing the inverse images r {y} for y ∈ T . The formulas above in the discrete and continuous
−1
cases are not worth memorizing explicitly; it's usually better to just work each problem from scratch. The main step is to write the
event {Y = y} in terms of X, and then find the probability of this event using the probability density function of X.
Transformed Variables with Continuous Distributions

Suppose that X has a continuous distribution on a subset S ⊆ R and that Y = r(X) has a continuous distributions on a subset
n
T ⊆R
m
. Suppose also that X has a known probability density function f . In many cases, the probability density function of Y can
be found by first finding the distribution function of Y (using basic rules of probability) and then computing the appropriate
derivatives of the distribution function. This general method is referred to, appropriately enough, as the distribution function
method.
Suppose that Y is real valued. The distribution function G of Y is given by
G(y) = ∫ f (x) dx, y ∈ R (3.7.3)

−1
r (−∞,y]
Proof
As in the discrete case, the formula in (4) not much help, and it's usually better to work each problem from scratch. The main step
is to write the event {Y ≤ y} in terms of X, and then find the probability of this event using the probability density function of X.
The Change of Variables Formula

When the transformation r is one-to-one and smooth, there is a formula for the probability density function of Y directly in terms
of the probability density function of X. This is known as the change of variables formula. Note that since r is one-to-one, it has an
inverse function r .
−1
We will explore the one-dimensional case first, where the concepts and formulas are simplest. Thus, suppose that random variable
X has a continuous distribution on an interval S ⊆ R , with distribution function F and probability density function f . Suppose
that Y = r(X) where r is a differentiable function from S onto an interval T . As usual, we will let G denote the distribution
function of Y and g the probability density function of Y .
Suppose that r is strictly increasing on S . For y ∈ T ,

1. G(y) = F [ r −1
(y)]
2. g(y) = f [ r −1
(y)]
d
dy
r
−1
(y)
Proof
Suppose that r is strictly decreasing on S . For y ∈ T ,

1. G(y) = 1 − F [r
−1
(y)]
2. g(y) = −f [ r −1
(y)]
d
dy
−1
r (y)
Proof
The formulas for the probability density functions in the increasing case and the decreasing case can be combined:
If r is strictly increasing or strictly decreasing on S then the probability density function g of Y is given by
−1
∣ d −1
∣
g(y) = f [ r (y)] ∣ r (y)∣ (3.7.5)
∣ dy ∣
Letting x = r−1
(y) , the change of variables formula can be written more compactly as
∣ dx ∣
g(y) = f (x) ∣ ∣ (3.7.6)
∣ dy ∣
Although succinct and easy to remember, the formula is a bit less clear. It must be understood that x on the right should be written
in terms of y via the inverse function. The images below give a graphical interpretation of the formula in the two cases where r is
increasing and where r is decreasing.
Figure 3.7.4 : The change of variables theorems in the increasing and decreasing cases
The generalization of this result from R to R is basically a theorem in multivariate calculus. First we need some notation. Suppose
n
that r is a one-to-one differentiable function from S ⊆ R onto T ⊆ R . The first derivative of the inverse function x = r (y) is
n n −1
the n × n matrix of first partial derivatives:

dx ∂xi
( ) = (3.7.7)
dy ∂yj
ij
The Jacobian (named in honor of Karl Gustav Jacobi) of the inverse function is the determinant of the first derivative matrix
dx
det ( ) (3.7.8)
dy
With this compact notation, the multivariate change of variables formula is easy to state.
Suppose that X is a random variable taking values in S ⊆ R , and that X has a continuous distribution with probability
n
density function f . Suppose also Y = r(X) where r is a differentiable function from S onto T ⊆ R . Then the probability
n
density function g of Y is given by

∣ dx ∣
g(y) = f (x) ∣det ( )∣ , y ∈ T (3.7.9)
∣ dy ∣
Proof
The Jacobian is the infinitesimal scale factor that describes how n -dimensional volume changes under the transformation.
Figure 3.7.5 : The multivariate change of variables theorem
Special Transformations
Linear Transformations
Linear transformations (or more technically affine transformations) are among the most common and important transformations.
Moreover, this type of transformation leads to simple applications of the change of variable theorems. Suppose first that X is a
random variable taking values in an interval S ⊆ R and that X has a continuous distribution on S with probability density function
f . Let Y = a + b X where a ∈ R and b ∈ R ∖ {0} . Note that Y takes values in T = {y = a + bx : x ∈ S} , which is also an
interval.

1 y −a
g(y) = f ( ), y ∈ T (3.7.12)
|b| b
Proof
When b > 0 (which is often the case in applications), this transformation is known as a location-scale transformation; a is the
location parameter and b is the scale parameter. Scale transformations arise naturally when physical units are changed (from feet
to meters, for example). Location transformations arise naturally when the physical reference point is changed (measuring time
relative to 9:00 AM as opposed to 8:00 AM, for example). The change of temperature measurement from Fahrenheit to Celsius is a
location and scale transformation. Location-scale transformations are studied in more detail in the chapter on Special Distributions.
The multivariate version of this result has a simple and elegant form when the linear transformation is expressed in matrix-vector
form. Thus suppose that X is a random variable taking values in S ⊆ R and that X has a continuous distribution on S with
n
probability density function f . Let Y = a + BX where a ∈ R and B is an invertible n × n matrix. Note that Y takes values in
n
T = {a + Bx : x ∈ S} ⊆ R
n
.

1 −1
g(y) = f [B (y − a)] , y ∈ T (3.7.13)
|det(B)|
Proof
Sums and Convolution

Simple addition of random variables is perhaps the most important of all transformations. Suppose that X and Y are random
variables on a probability space, taking values in R ⊆ R and S ⊆ R , respectively, so that (X, Y ) takes values in a subset of
R×S . Our goal is to find the distribution of Z = X + Y . Note that Z takes values in
T = {z ∈ R : z = x + y for some x ∈ R, y ∈ S} . For z ∈ T , let D = {x ∈ R : z − x ∈ S} .
z
Suppose that (X, Y ) probability density function f .

1. If (X, Y ) has a discrete distribution then Z = X + Y has a discrete distribution with probability density function u given
by
u(z) = ∑ f (x, z − x), z ∈ T (3.7.14)
x∈Dz
2. If (X, Y ) has a continuous distribution then Z = X + Y has a continuous distribution with probability density function u
given by
u(z) = ∫ f (x, z − x) dx, z ∈ T (3.7.15)

Dz
Proof
In the discrete case, R and S are countable, so T is also countable as is D for each z ∈ T . In the continuous case, R and S are
z
typically intervals, so T is also an interval as is D for z ∈ T . In both cases, determining D is often the most difficult step. By far
z z
the most important special case occurs when X and Y are independent.
Suppose that X and Y are independent and have probability density functions g and h respectively.
1. If X and Y have discrete distributions then Z = X + Y has a discrete distribution with probability density function g ∗ h
given by
(g ∗ h)(z) = ∑ g(x)h(z − x), z ∈ T (3.7.18)
x∈Dz
2. If X and Y have continuous distributions then Z = X + Y has a continuous distribution with probability density function
g ∗ h given by
(g ∗ h)(z) = ∫ g(x)h(z − x) dx, z ∈ T (3.7.19)

Dz
In both cases, the probability density function g ∗ h is called the convolution of g and h .
Proof
As before, determining this set D is often the most challenging step in finding the probability density function of
z Z . However,
there is one case where the computations simplify significantly.
Suppose again that X and Y are independent random variables with probability density functions g and h , respectively.
1. In the discrete case, suppose X and Y take values in N. Then Z has probability density function
z
(g ∗ h)(z) = ∑ g(x)h(z − x), z ∈ N (3.7.20)
x=0
2. In the continuous case, suppose that X and Y take values in [0, ∞). Then Z and has probability density function
z
(g ∗ h)(z) = ∫ g(x)h(z − x) dx, z ∈ [0, ∞) (3.7.21)

0
Proof
Convolution is a very important mathematical operation that occurs in areas of mathematics outside of probability, and so involving
functions that are not necessarily probability density functions. The following result gives some simple properties of convolution.
Convolution (either discrete or continuous) satisfies the following properties, where f , g , and h are probability density
functions of the same type.
1. f ∗ g = g ∗ f (the commutative property)
2. (f ∗ g) ∗ h = f ∗ (g ∗ h) (the associative property)
Proof
Thus, in part (b) we can write f ∗ g ∗ h without ambiguity. Of course, the constant 0 is the additive identity so
X + 0 = 0 + X = 0 for every random variable X. Also, a constant is independent of every other random variable. It follows that
the probability density function δ of 0 (given by δ(0) = 1 ) is the identity with respect to convolution (at least for discrete PDFs).
That is, f ∗ δ = δ ∗ f = f . The next result is a simple corollary of the convolution theorem, but is important enough to be
highligted.
Suppose that X = (X , X , …) is a sequence of independent and identically distributed real-valued random variables, with
1 2
common probability density function f . Then Y = X + X + ⋯ + X n 1 has probability density function

2 n
= f ∗ f ∗ ⋯ ∗ f , the n -fold convolution power of f , for n ∈ N .

∗n
f
In statistical terms, X corresponds to sampling from the common distribution.By convention, Y = 0 , so naturally we take
0
= δ . When appropriately scaled and centered, the distribution of Y converges to the standard normal distribution as n → ∞ .
∗0
f n
The precise statement of this result is the central limit theorem, one of the fundamental theorems of probability. The central limit
theorem is studied in detail in the chapter on Random Samples. Clearly convolution power satisfies the law of exponents:
f
∗n
∗f
∗m
=f
∗(n+m)
for m, n ∈ N.
Convolution can be generalized to sums of independent variables that are not of the same type, but this generalization is usually
done in terms of distribution functions rather than probability density functions.
Products and Quotients

While not as important as sums, products and quotients of real-valued random variables also occur frequently. We will limit our
discussion to continuous distributions.
Suppose that (X, Y ) has a continuous distribution on R with probability density function f .
2
1. Random variable V = XY has probability density function

∞
1
v↦ ∫ f (x, v/x) dx (3.7.22)
−∞ |x|
2. Random variable W = Y /X has probability density function

∞
w ↦ ∫ f (x, wx)|x|dx (3.7.23)

−∞
Proof
If (X, Y ) takes values in a subset D ⊆ R , then for a given v ∈ R , the integral in (a) is over {x ∈ R : (x, v/x) ∈ D}, and for a
2
given w ∈ R, the integral in (b) is over {x ∈ R : (x, wx) ∈ D}. As usual, the most important special case of this result is when X
and Y are independent.
Suppose that X and Y are independent random variables with continuous distributions on R having probability density
functions g and h , respectively.
1. Random variable V = XY has probability density function
∞
1
v↦ ∫ g(x)h(v/x) dx (3.7.28)
−∞ |x|
2. Random variable W = Y /X has probability density function

∞
w ↦ ∫ g(x)h(wx)|x|dx (3.7.29)
−∞
Proof
If X takes values in S ⊆ R and Y takes values in T ⊆ R , then for a given v ∈ R , the integral in (a) is over {x ∈ S : v/x ∈ T } ,
and for a given w ∈ R, the integral in (b) is over {x ∈ S : wx ∈ T } . As with convolution, determining the domain of integration is
often the most challenging step.
Minimum and Maximum

Suppose that (X 1, X2 , … , Xn ) is a sequence of independent real-valued random variables. The minimum and maximum
transformations
U = min{ X1 , X2 , … , Xn }, V = max{ X1 , X2 , … , Xn } (3.7.30)
are very important in a number of applications. For example, recall that in the standard model of structural reliability, a system
consists of n components that operate independently. Suppose that X represents the lifetime of component i ∈ {1, 2, … , n}.
i
Then U is the lifetime of the series system which operates if and only if each component is operating. Similarly, V is the lifetime of
the parallel system which operates if and only if at least one component is operating.
A particularly important special case occurs when the random variables are identically distributed, in addition to being
independent. In this case, the sequence of variables is a random sample of size n from the common distribution. The minimum and
maximum variables are the extreme examples of order statistics. Order statistics are studied in detail in the chapter on Random
Samples.
Suppose that (X , X , … , X
1 2 n) is a sequence of indendent real-valued random variables and that X has distribution function
i
F for i ∈ {1, 2, … , n}.

i
1. V = max{X , X , … , X } has distribution function H given by H (x) = F (x)F (x) ⋯ F (x) for x ∈ R.
1 2 n 1 2 n
2. U = min{X , X , … , X } has distribution function G given by G(x) = 1 − [1 − F (x)] [1 − F (x)] ⋯ [1 − F

1 2 n 1 2 n (x)]
for x ∈ R.
Proof
From part (a), note that the product of n distribution functions is another distribution function. From part (b), the product of n
right-tail distribution functions is a right-tail distribution function. In the reliability setting, where the random variables are
nonnegative, the last statement means that the product of n reliability functions is another reliability function. If X has a i
continuous distribution with probability density function f for each i ∈ {1, 2, … , n}, then U and V also have continuous
i
distributions, and their probability density functions can be obtained by differentiating the distribution functions in parts (a) and (b)
of last theorem. The computations are straightforward using the product rule for derivatives, but the results are a bit of a mess.
The formulas in last theorem are particularly nice when the random variables are identically distributed, in addition to being
independent
Suppose that (X1 , X2 , … , Xn ) is a sequence of independent real-valued random variables, with common distribution
function F .
1. V = max{ X1 , X2 , … , Xn } has distribution function H given by H (x) = F (x) for x ∈ R.
n
2. U = min{ X1 , X2 , … , Xn } has distribution function G given by G(x) = 1 − [1 − F (x)] for x ∈ R.n
In particular, it follows that a positive integer power of a distribution function is a distribution function. More generally, it's easy to
see that every positive power of a distribution function is a distribution function. How could we construct a non-integer power of a
distribution function in a probabilistic way?
Suppose that (X , X , … , X ) is a sequence of independent real-valued random variables, with a common continuous
1 2 n
distribution that has probability density function f .

1. V = max{ X1 , X2 , … , Xn } has probability density function h given by h(x) = nF (x)f (x) for x ∈ R .
n−1
n−1
2. U = min{ X1 , X2 , … , Xn } has probability density function g given by g(x) = n[1 − F (x)] f (x) for x ∈ R .
Coordinate Systems
For our next discussion, we will consider transformations that correspond to common distance-angle based coordinate systems—
polar coordinates in the plane, and cylindrical and spherical coordinates in 3-dimensional space. First, for (x, y) ∈ R , let (r, θ) 2
denote the standard polar coordinates corresponding to the Cartesian coordinates (x, y), so that r ∈ [0, ∞) is the radial distance
and θ ∈ [0, 2π) is the polar angle.
Figure 3.7.6 : Polar coordinates. Stover, Christopher and Weisstein, Eric W. "Polar Coordinates." From MathWorld—A Wolfram
Web Resource. http://mathworld.wolfram.com/PolarCoordinates.html
It's best to give the inverse transformation: x = r cos θ , y = r sin θ . As we all know from calculus, the Jacobian of the
transformation is r. Hence the following result is an immediate consequence of our change of variables theorem:
Suppose that (X, Y ) has a continuous distribution on R with probability density function f , and that
2
(R, Θ) are the polar
coordinates of (X, Y ). Then (R, Θ) has probability density function g given by
g(r, θ) = f (r cos θ, r sin θ)r, (r, θ) ∈ [0, ∞) × [0, 2π) (3.7.32)
Next, for (x, y, z) ∈ R , let (r, θ, z) denote the standard cylindrical coordinates, so that (r, θ) are the standard polar coordinates of
3
(x, y) as above, and coordinate z is left unchanged. Given our previous result, the one for cylindrical coordinates should come as
no surprise.
Suppose that (X, Y , Z) has a continuous distribution on R with probability density function f , and that
3
(R, Θ, Z) are the
cylindrical coordinates of (X, Y , Z) . Then (R, Θ, Z) has probability density function g given by
g(r, θ, z) = f (r cos θ, r sin θ, z)r, (r, θ, z) ∈ [0, ∞) × [0, 2π) × R (3.7.33)
Finally, for (x, y, z) ∈ R , let (r, θ, ϕ) denote the standard spherical coordinates corresponding to the Cartesian coordinates
3
(x, y, z), so that r ∈ [0, ∞) is the radial distance, θ ∈ [0, 2π) is the azimuth angle, and ϕ ∈ [0, π] is the polar angle. (In spite of
our use of the word standard, different notations and conventions are used in different subjects.)
Figure 3.7.7 : Spherical coordinates, By Dmcq—Own work, CC BY-SA 3.0, Wikipedia

Once again, it's best to give the inverse transformation: x = r sin ϕ cos θ , y = r sin ϕ sin θ , z = r cos ϕ . As we remember from
calculus, the absolute value of the Jacobian is r sin ϕ. Hence the following result is an immediate consequence of the change of
2
variables theorem (8):
Suppose that (X, Y , Z) has a continuous distribution on R with probability density function f , and that
3
(R, Θ, Φ) are the
spherical coordinates of (X, Y , Z) . Then (R, Θ, Φ) has probability density function g given by
2
g(r, θ, ϕ) = f (r sin ϕ cos θ, r sin ϕ sin θ, r cos ϕ)r sin ϕ, (r, θ, ϕ) ∈ [0, ∞) × [0, 2π) × [0, π] (3.7.34)
Sign and Absolute Value

Our next discussion concerns the sign and absolute value of a real-valued random variable.
Suppose that X has a continuous distribution on R with distribution function F and probability density function f .
1. |X| has distribution function G given by G(y) = F (y) − F (−y) for y ∈ [0, ∞).
2. |X| has probability density function g given by g(y) = f (y) + f (−y) for y ∈ [0, ∞).
Proof
Recall that the sign function on R (not to be confused, of course, with the sine function) is defined as follows:
⎧ −1, x <0
sgn(x) = ⎨ 0, x =0 (3.7.35)
⎩
1, x >0
Suppose again that X has a continuous distribution on R with distribution function F and probability density function f , and
suppose in addition that the distribution of X is symmetric about 0. Then
1. |X| has distribution function G given byG(y) = 2F (y) − 1 for y ∈ [0, ∞).
2. |X| has probability density function g given by g(y) = 2f (y) for y ∈ [0, ∞).
3. sgn(X) is uniformly distributed on {−1, 1}.
4. |X| and sgn(X) are independent.
Proof

This subsection contains computational exercises, many of which involve special parametric families of distributions. It is always
interesting when a random variable from one parametric family can be transformed into a variable from another family. It is also
interesting when a parametric family is closed or invariant under some transformation on the variables in the family. Often, such
properties are what make the parametric families special in the first place. Please note these properties when they occur.
Dice
Recall that a standard die is an ordinary 6-sided die, with faces labeled from 1 to 6 (usually in the form of dots). A fair die is one in
which the faces are equally likely. An ace-six flat die is a standard die in which faces 1 and 6 occur with probability each and the
1
other faces with probability 1
8
each.
Suppose that two six-sided dice are rolled and the sequence of scores (X , X 1 2) is recorded. Find the probability density
function of Y = X + X , the sum of the scores, in each of the following cases:
1 2
1. Both dice are standard and fair.

2. Both dice are ace-six flat.
3. The first die is standard and fair, and the second is ace-six flat
4. The dice are both fair, but the first die has faces labeled 1, 2, 2, 3, 3, 4 and the second die has faces labeled 1, 3, 4, 5, 6, 8.
Answer
In the dice experiment, select two dice and select the sum random variable. Run the simulation 1000 times and compare the
empirical density function to the probability density function for each of the following cases:
1. fair dice
2. ace-six flat dice
Suppose that n standard, fair dice are rolled. Find the probability density function of the following variables:
1. the minimum score
2. the maximum score.
Answer
In the dice experiment, select fair dice and select each of the following random variables. Vary n with the scroll bar and note
the shape of the density function. With n = 4 , run the simulation 1000 times and note the agreement between the empirical
density function and the probability density function.
1. minimum score
2. maximum score.
Recall that for n ∈ N , the standard measure of the size of a set A ⊆ R is
+
n
λn (A) = ∫ 1 dx (3.7.37)
A
In particular, λ (A) is the length of A for A ⊆ R , λ (A) is the area of A for A ⊆ R , and λ
1 2
2
3 (A) is the volume of A for A ⊆ R . 3
See the technical details in (1) for more advanced information.

Now if S ⊆ R with 0 < λ (S) < ∞ , recall that the uniform distribution on S is the continuous distribution with constant
n
n
probability density function f defined by f (x) = 1/λ (S) for x ∈ S . Uniform distributions are studied in more detail in the
n
Let Y =X
2
. Find the probability density function of Y and sketch the graph in each of the following cases:
1. X is uniformly distributed on the interval [0, 4].
2. X is uniformly distributed on the interval [−2, 2].
3. X is uniformly distributed on the interval [−1, 3].
Answer
Compare the distributions in the last exercise. In part (c), note that even a simple transformation of a simple distribution can
produce a complicated distribution. In this particular case, the complexity is caused by the fact that x ↦ x is one-to-one on part of
2
the domain {0} ∪ (1, 3] and two-to-one on the other part [−1, 1] ∖ {0}.
On the other hand, the uniform distribution is preserved under a linear transformation of the random variable.
Suppose that X has the continuous uniform distribution on S ⊆ R . Let Y = a + BX , where a ∈ R and B is an invertible
n n
n × n matrix. Then Y is uniformly distributed on T = {a + Bx : x ∈ S} .
Proof
For the following three exercises, recall that the standard uniform distribution is the uniform distribution on the interval [0, 1].
Suppose that X and Y are independent and that each has the standard uniform distribution. Let U = X +Y , V = X −Y ,
W = XY , Z = Y /X . Find the probability density function of each of the follow:
1. (U , V )
2. U
3. V
4. W
5. Z
Answer
Suppose that X, Y , and Z are independent, and that each has the standard uniform distribution. Find the probability density
function of (U , V , W ) = (X + Y , Y + Z, X + Z) .
Answer
Suppose that (X , X , … , X ) is a sequence of independent random variables, each with the standard uniform distribution.
1 2 n
Find the distribution function and probability density function of the following variables.
1. U = min{ X1 , X2 … , Xn }
2. V = max{ X1 , X2 , … , Xn }
Answer
Both distributions in the last exercise are beta distributions. More generally, all of the order statistics from a random sample of
standard uniform variables have beta distributions, one of the reasons for the importance of this family of distributions. Beta
distributions are studied in more detail in the chapter on Special Distributions.
In the order statistic experiment, select the uniform distribution.

1. Set k = 1 (this gives the minimum U ). Vary n with the scroll bar and note the shape of the probability density function.
With n = 5 , run the simulation 1000 times and note the agreement between the empirical density function and the true
2. Vary n with the scroll bar, set k = n each time (this gives the maximum V ), and note the shape of the probability density
function. With n = 5 run the simulation 1000 times and compare the empirical density function and the probability density
function.
Let f denote the probability density function of the standard uniform distribution.
1. Compute f ∗2
2. Compute f ∗3
3. Graph f , f , and f on the same set of axes.

∗2 ∗3
Answer
In the last exercise, you can see the behavior predicted by the central limit theorem beginning to emerge. Recall that if
(X , X , X ) is a sequence of independent random variables, each with the standard uniform distribution, then f , f , and f are
∗2 ∗3
1 2 3
the probability density functions of X , X + X , and X + X + X , respectively. More generally, if (X , X , … , X ) is a

1 1 2 1 2 3 1 2 n
sequence of independent random variables, each with the standard uniform distribution, then the distribution of ∑ X (which n
i=1 i
has probability density function f ) is known as the Irwin-Hall distribution with parameter n . The Irwin-Hall distributions are
∗n
studied in more detail in the chapter on Special Distributions.
Open the Special Distribution Simulator and select the Irwin-Hall distribution. Vary the parameter n from 1 to 3 and note the
shape of the probability density function. (These are the density functions in the previous exercise). For each value of n , run
the simulation 1000 times and compare the empricial density function and the probability density function.
Simulations
A remarkable fact is that the standard uniform distribution can be transformed into almost any other distribution on R. This is
particularly important for simulations, since many computer languages have an algorithm for generating random numbers, which
are simulations of independent variables, each with the standard uniform distribution. Conversely, any continuous distribution
supported on an interval of R can be transformed into the standard uniform distribution.
Suppose first that F is a distribution function for a distribution on R (which may be discrete, continuous, or mixed), and let F
−1
denote the quantile function.
Suppose that U has the standard uniform distribution. Then X = F −1

(U ) has distribution function F .
Proof
Assuming that we can compute F , the previous exercise shows how we can simulate a distribution with distribution function F .
−1
To rephrase the result, we can simulate a variable with distribution function F by simply computing a random quantile. Most of the
apps in this project use this method of simulation. The first image below shows the graph of the distribution function of a rather
complicated mixed distribution, represented in blue on the horizontal axis. In the second image, note how the uniform distribution
on [0, 1], represented by the thick red line, is transformed, via the quantile function, into the given distribution.
Figure 3.7.8 : The random quantile method of simulation

There is a partial converse to the previous result, for continuous distributions.
Suppose that X has a continuous distribution on an interval S ⊆ R Then U = F (X) has the standard uniform distribution.
Proof
Show how to simulate the uniform distribution on the interval [a, b] with a random number. Using your calculator, simulate 5
values from the uniform distribution on the interval [2, 10].
Answer
Beta Distributions
Suppose that X has the probability density function f given by f (x) = 3x

2
for 0 ≤x ≤1 . Find the probability density
1. U = X 2
−−
2. V = √X
3. W = X
1
Proof
Random variables X, U , and V in the previous exercise have beta distributions, the same family of distributions that we saw in the
exercise above for the minimum and maximum of independent standard uniform variables. In general, beta distributions are widely
used to model random proportions and probabilities, as well as physical quantities that take values in closed bounded intervals
(which after a change of units can be taken to be [0, 1]). On the other hand, W has a Pareto distribution, named for Vilfredo Pareto.
The family of beta distributions and the family of Pareto distributions are studied in more detail in the chapter on Special
Distributions.
Suppose that the radius R of a sphere has a beta distribution probability density function f given by 2
f (r) = 12 r (1 − r) for
0 ≤ r ≤ 1 . Find the probability density function of each of the following:
1. The circumference C = 2πR

2. The surface area A = 4πR 2
3. The volume V = π R 4
3
3
Answer
Suppose that the grades on a test are described by the random variable Y = 100X where X has the beta distribution with
probability density function f given by f (x) = 12x(1 − x) for 0 ≤ x ≤ 1 . The grades are generally low, so the teacher
2
−
− −−
decides to “curve” the grades using the transformation Z = 10√Y = 100√X . Find the probability density function of
1. Y
2. Z
Answer
Bernoulli Trials
Recall that a Bernoulli trials sequence is a sequence (X , X , …) of independent, identically distributed indicator random
1 2
variables. In the usual terminology of reliability theory, X = 0 means failure on trial i, while X = 1 means success on trial i. The
i i
basic parameter of the process is the probability of success p = P(X = 1) , so p ∈ [0, 1]. The random process is named for Jacob
i
Bernoulli and is studied in detail in the chapter on Bernoulli trials.
For i ∈ N , the probability density function f of the trial variable X is f (x) = p

+ i
x 1−x
(1 − p ) for x ∈ {0, 1}.
Proof
Now let Y denote the number of successes in the first n trials, so that Y
n n =∑
n
i=1
Xi for n ∈ N .
Yn has the probability density function f given by

n
n
y n−y
fn (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (3.7.39)
y
Proof
The distribution of Y is the binomial distribution with parameters

n n and p. The binomial distribution is stuided in more detail in
the chapter on Bernoulli trials
For m, n ∈ N
1. f n =f
∗n
.
2. f m ∗ fn = fm+n .
Proof
From part (b) it follows that if Y and Z are independent variables, and that Y has the binomial distribution with parameters n ∈ N
and p ∈ [0, 1] while Z has the binomial distribution with parameter m ∈ N and p, then Y + Z has the binomial distribution with
parameter m + n and p.
Find the probability density function of the difference between the number of successes and the number of failures in n ∈ N
Bernoulli trials with success parameter p ∈ [0, 1]

Answer

Recall that the Poisson distribution with parameter t ∈ (0, ∞) has probability density function f given by
n
−t
t
ft (n) = e , n ∈ N (3.7.40)
n!
This distribution is named for Simeon Poisson and is widely used to model the number of random points in a region of time or
space; the parameter t is proportional to the size of the regtion. The Poisson distribution is studied in detail in the chapter on The
Poisson Process.
If a, b ∈ (0, ∞) then f a ∗ fb = fa+b .

Proof
The last result means that if X and Y are independent variables, and X has the Poisson distribution with parameter a > 0 while Y
has the Poisson distribution with parameter b > 0 , then X + Y has the Poisson distribution with parameter a + b . In terms of the
Poisson model, X could represent the number of points in a region A and Y the number of points in a region B (of the appropriate
sizes so that the parameters are a and b respectively). The independence of X and Y corresponds to the regions A and B being
disjoint. Then X + Y is the number of points in A ∪ B .

Recall that the exponential distribution with rate parameter r ∈ (0, ∞) has probability density function f given by f (t) = re −rt
for t ∈ [0, ∞). This distribution is often used to model random times such as failure times and lifetimes. In particular, the times
between arrivals in the Poisson model of random points in time have independent, identically distributed exponential distributions.
The Exponential distribution is studied in more detail in the chapter on Poisson Processes.
Show how to simulate, with a random number, the exponential distribution with rate parameter r. Using your calculator,
simulate 5 values from the exponential distribution with parameter r = 3 .
Answer
For the next exercise, recall that the floor and ceiling functions on R are defined by
⌊x⌋ = max{n ∈ Z : n ≤ x}, ⌈x⌉ = min{n ∈ Z : n ≥ x}, x ∈ R (3.7.43)
Suppose that T has the exponential distribution with rate parameter r ∈ (0, ∞) . Find the probability density function of each
of the following random variables:
1. Y = ⌊T ⌋
2. Z = ⌈T ⌉
Answer
Note that the distributions in the previous exercise are geometric distributions on N and on N , respectively. In many respects, the
+
geometric distribution is a discrete version of the exponential distribution.
Suppose that T has the exponential distribution with rate parameter r ∈ (0, ∞) . Find the probability density function of each
of the following random variables:
1. X = T 2
2. Y = e T
3. Z = ln T
Answer
In the previous exercise, Y has a Pareto distribution while Z has an extreme value distribution. Both of these are studied in more
detail in the chapter on Special Distributions.
Suppose that X and Y are independent random variables, each having the exponential distribution with parameter 1. Let
Z = . Y
1. Find the distribution function of Z .

2. Find the probability density function of Z .
Answer
Suppose that X has the exponential distribution with rate parameter a > 0 , Y has the exponential distribution with rate
parameter b > 0 , and that X and Y are independent. Find the probability density function of Z = X + Y in each of the
following cases.
1. a = b
2. a ≠ b
Answer
Suppose that (T 1, T2 , … , Tn ) is a sequence of independent random variables, and that T has the exponential distribution with
i
rate parameter r i >0 for each i ∈ {1, 2, … , n}.

1. Find the probability density function of U = min{T , T , … , T } .
1 2 n
2. Find the distribution function of V = max{T , T , … , T }.

1 2 n
3. Find the probability density function of V in the special case that r = r for each i ∈ {1, 2, … , n}.
i
Answer
Note that the minimum U in part (a) has the exponential distribution with parameter r + r + ⋯ + r . In particular, suppose that
1 2 n
a series system has independent components, each with an exponentially distributed lifetime. Then the lifetime of the system is also
exponentially distributed, and the failure rate of the system is the sum of the component failure rates.
In the order statistic experiment, select the exponential distribution.

With n = 5 , run the simulation 1000 times and compare the empirical density function and the probability density function.
2. Vary n with the scroll bar and set k = n each time (this gives the maximum V ). Note the shape of the density function.
With n = 5 , run the simulation 1000 times and compare the empirical density function and the probability density function.
Suppose again that (T , T , … , T ) is a sequence of independent random variables, and that

1 2 n Ti has the exponential
distribution with rate parameter r > 0 for each i ∈ {1, 2, … , n}. Then
i
ri
P (Ti < Tj for all j ≠ i) = (3.7.44)
n
∑ rj
j=1
Proof
The result in the previous exercise is very important in the theory of continuous-time Markov chains. If we have a bunch of
independent alarm clocks, with exponentially distributed alarm times, then the probability that clock i is the first one to sound is
r .
n
r /∑
i j
j=1
The Gamma Distribution

Recall that the (standard) gamma distribution with shape parameter n ∈ N + has probability density function
n−1
t
−t
gn (t) = e , 0 ≤t <∞ (3.7.45)
(n − 1)!
With a positive integer shape parameter, as we have here, it is also referred to as the Erlang distribution, named for Agner Erlang.
This distribution is widely used to model random times under certain basic assumptions. In particular, the n th arrival times in the
Poisson model of random points in time has the gamma distribution with parameter n . The Erlang distribution is studied in more
detail in the chapter on the Poisson Process, and in greater generality, the gamma distribution is studied in the chapter on Special
Distributions.
Let g = g , and note that this is the probability density function of the exponential distribution with parameter 1, which was the
1
topic of our last discussion.
If m, n ∈ N+ then
1. g
n =g
∗n
2. g
m ∗ gn = gm+n
Proof
Part (b) means that if X has the gamma distribution with shape parameter m and Y has the gamma distribution with shape
parameter n , and if X and Y are independent, then X + Y has the gamma distribution with shape parameter m + n . In the context
of the Poisson model, part (a) means that the n th arrival time is the sum of the n independent interarrival times, which have a
common exponential distribution.
Suppose that T has the gamma distribution with shape parameter n ∈ N . Find the probability density function of X = ln T .
+
Answer

Recall that the Pareto distribution with shape parameter a ∈ (0, ∞) has probability density function f given by
a
f (x) = , 1 ≤x <∞ (3.7.47)
a+1
x
Members of this family have already come up in several of the previous exercises. The Pareto distribution, named for Vilfredo
Pareto, is a heavy-tailed distribution often used for modeling income and other financial variables. The Pareto distribution is
Suppose that X has the Pareto distribution with shape parameter a . Find the probability density function of each of the
following random variables:
1. U =X
2
2. V =
1
3. Y = ln X
Answer
In the previous exercise, V also has a Pareto distribution but with parameter a
2
; Y has the beta distribution with parameters a and
b = 1 ; and Z has the exponential distribution with rate parameter a .
Show how to simulate, with a random number, the Pareto distribution with shape parameter a . Using your calculator, simulate
5 values from the Pareto distribution with shape parameter a = 2 .
Answer
The Normal Distribution

Recall that the standard normal distribution has probability density function ϕ given by
1 −
1
z
2
ϕ(z) = −−e
2
, z ∈ R (3.7.48)
√2π
Suppose that Z has the standard normal distribution, and that μ ∈ (−∞, ∞) and σ ∈ (0, ∞).
1. Find the probability density function f of X = μ + σZ
2. Sketch the graph of f , noting the important qualitative features.
Answer
Random variable X has the normal distribution with location parameter μ and scale parameter σ. The normal distribution is
perhaps the most important distribution in probability and mathematical statistics, primarily because of the central limit theorem,
one of the fundamental theorems. It is widely used to model physical measurements of all types that are subject to small, random
errors. The normal distribution is studied in detail in the chapter on Special Distributions.
Suppose that Z has the standard normal distribution. Find the probability density function of Z and sketch the graph. 2
Answer
Random variable V has the chi-square distribution with 1 degree of freedom. Chi-square distributions are studied in detail in the
Suppose that X and Y are independent random variables, each with the standard normal distribution, and let (R, Θ) be the
standard polar coordinates (X, Y ). Find the probability density function of
1. (R, Θ)
2. R
3. Θ
Answer
The distribution of R is the (standard) Rayleigh distribution, and is named for John William Strutt, Lord Rayleigh. The Rayleigh
distribution is studied in more detail in the chapter on Special Distributions.
The standard normal distribution does not have a simple, closed form quantile function, so the random quantile method of
simulation does not work well. However, the last exercise points the way to an alternative method of simulation.
Show how to simulate a pair of independent, standard normal variables with a pair of random numbers. Using your calculator,
simulate 6 values from the standard normal distribution.
Answer

Suppose that X and Y are independent random variables, each with the standard normal distribution. Find the probability
density function of T = X/Y .
Answer
Random variable T has the (standard) Cauchy distribution, named after Augustin Cauchy. The Cauchy distribution is studied in
Suppose that a light source is 1 unit away from position 0 on an infinite straight wall. We shine the light at the wall an angle Θ
to the perpendicular, where Θ is uniformly distributed on (− , ). Find the probability density function of the position of the
π
2
π
light beam X = tan Θ on the wall.

Answer
Thus, X also has the standard Cauchy distribution. Clearly we can simulate a value of the Cauchy distribution by
+ πU ) where U is a random number. This is the random quantile method.
π
X = tan(−
2
Open the Cauchy experiment, which is a simulation of the light problem in the previous exercise. Keep the default parameter
values and run the experiment in single step mode a few times. Then run the experiment 1000 times and compare the empirical
density function and the probability density function.
This page titled 3.7: Transformations of Random Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
This section is concenred with the convergence of probability distributions, a topic of basic importance in probability theory. Since
we will be almost exclusively concerned with the convergences of sequences of various kinds, it's helpful to introduce the notation
∪ {∞} = {1, 2, …} ∪ {∞} .
∗
N+
=N +
Distributions on (R, R)
Definition
We start with the most important and basic setting, the measurable space (R, R), where R is the set of real numbers of course, and
R is the Borel σ-algebra of subsets of R . Recall that if P is a probability measure on (R, R), then the function F : R → [0, 1]
defined by F (x) = P (−∞, x] for x ∈ R is the (cumulative) distribution function of P . Recall also that F completely determines
P . Here is the definition for convergence of probability measures in this setting:
Suppose P is a probability measure on (R, R) with distribution function F for each n ∈ N . Then P converges (weakly)
n n
∗
+ n
to P as n → ∞ if F (x) → F (x) as n → ∞ for every x ∈ R where F is continuous. We write P ⇒ P as n → ∞ .

∞ n ∞ ∞ n ∞
Recall that a distribution function F is continuous at x ∈ R if and only if P(X = x) = 0 , so that x is not an atom of the
distribution (a point of positive probability). We will see shortly why this condition on F is appropriate. Of course, a probability
∞
measure on (R, R) is usually associated with a real-valued random variable for some random experiment that is modeled by a
probability space (Ω, F , P). So to review, Ω is the set of outcomes, F is the σ-algebra of events, and P is the probability measure
on the sample space (Ω, F ). If X is a real-valued random variable defined on the probability space, then the distribution of X is
the probability measure P on (R, R) defined by P (A) = P(X ∈ A) for A ∈ R , and then of course, the distribution function of
X is the function F defined by F (x) = P(X ≤ x) for x ∈ R . Here is the convergence terminology used in this setting:
Suppose that X is a real-valued random variable with distribution P for each n ∈ N . If P ⇒ P as n → ∞ then we say
n n
∗
+ n ∞
that X converges in distribution to X as n → ∞ . We write X → X as n → ∞ in distribution.

n ∞ n ∞
So if F is the distribution function of X for n ∈ N , then X → X as n → ∞ in distribution if F (x) → F (x) at every

n n
∗
+ n ∞ n ∞
point x ∈ R where F is continuous. On the one hand, the terminology and notation are helpful, since again most probability
∞
measures are associated with random variables (and every probability measure can be). On the other hand, the terminology and
notation can be a bit misleading since the random variables, as functions, do not converge in any sense, and indeed the random
variables need not be defined on the same probability spaces. It is only the distributions that converge. However, often the random
variables are defined on the same probability space (Ω, F , P), in which case we can compare convergence in distribution with the
other modes of convergence we have or will study:
Convergence with probability 1
Convergence in probability
Convergence in mean
We will show, in fact, that convergence in distribution is the weakest of all of these modes of convergence. However, strength of
convergence should not be confused with importance. Convergence in distribution is one of the most important modes of
convergence; the central limit theorem, one of the two fundamental theorems of probability, is a theorem about convergence in
distribution.
Preliminary Examples
The examples below show why the definition is given in terms of distribution functions, rather than probability density functions,
and why convergence is only required at the points of continuity of the limiting distribution function. Note that the distributions
considered are probability measures on (R, R), even though the support of the distribution may be a much smaller subset. For the
first example, note that if a deterministic sequence converges in the ordinary calculus sense, then naturally we want the sequence
(thought of as random variables) to converge in distribution. Expand the proof to understand the example fully.
Suppose that x ∈ R for n ∈ N . Define random variable X

n
∗
+ n = xn with probability 1 for each n ∈ N . Then x
∗
+ n → x∞ as
n → ∞ if and only if X → X nas n → ∞ in distribution.
∞
Proof
For the example below, recall that Q denotes the set of rational numbers. Once again, expand the proof to understand the example
fully
n−1
For n ∈ N , let P denote the discrete uniform distribution on {
+ n
1
n
,
2
n
,…
n
, 1} and let P ∞ denote the continuous uniform
distribution on the interval [0, 1]. Then
1. P n ⇒ P∞ as n → ∞
2. P n (Q) =1 for each n ∈ N
+ but P
∞ (Q) =0 .
Proof
The point of the example is that it's reasonable for the discrete uniform distribution on {
1
n
,
2
n
,…
n−1
n
, 1} to converge to the
continuous uniform distribution on [0, 1], but once again, the probability density functions are evidently not the correct objects of
study.
Probability Density Functions

As the previous example shows, it is quite possible to have a sequence of discrete distributions converge to a continuous
distribution (or the other way around). Recall that probability density functions have very different meanings in the discrete and
continuous cases: density with respect to counting measure in the first case, and density with respect to Lebesgue measure in the
second case. This is another indication that distribution functions, rather than density functions, are the correct objects of study.
However, if probability density functions of a fixed type converge then the distributions converge. Recall again that we are
thinnking of our probability distributions as measures on (R, R) even when supported on a smaller subset.
Convergence in distribution in terms of probability density functions.

1. Suppose that f is a probability density function for a discrete distribution P on a countable set S ⊆ R for each n ∈ N .
n n
∗
+
If f (x) → f (x) as n → ∞ for each x ∈ S then P ⇒ P as n → ∞ .

n ∞ n ∞
2. Suppose that f is a probability density function for a continuous distribution P on R for each n ∈ N If f (x) → f (x)
n n
∗
+ n
as n → ∞ for all x ∈ R (except perhaps on a set with Lebesgue measure 0) then P ⇒ P as n → ∞ . n ∞
Proof
Convergence in Probability
Naturally, we would like to compare convergence in distribution with other modes of convergence we have studied.
Suppose that X is a real-valued random variable for each n ∈ N , all defined on the same probability space. If
n
∗
+
Xn → X∞
as n → ∞ in probability then X → X as n → ∞ in distribution.

n ∞
Proof
Our next example shows that even when the variables are defined on the same probability space, a sequence can converge in
distribution, but not in any other way.
Let X be an indicator variable with P(X = 0) = P(X = 1) =

1
2
, so that X is the result of tossing a fair coin. Let
Xn = 1 − X for n ∈ N . Then +
1. X → X as n → ∞ in distribution.
n
2. P(X does not converge to X as n → ∞) = 1 .

n
3. X does not converge to X as n → ∞ in probability.

n
4. X does not converge to X as n → ∞ in mean.

n
Proof
The critical fact that makes this counterexample work is that 1 − X has the same distribution as X. Any random variable with this
property would work just as well, so if you prefer a counterexample with continuous distributions, let X have probability density
function f given by f (x) = 6x(1 − x) for 0 ≤ x ≤ 1 . The distribution of X is an example of a beta distribution.
The following summary gives the implications for the various modes of convergence; no other implications hold in general.
Suppose that X is a real-valued random variable for each n ∈ N , all defined on a common probability space.
n
∗
+
1. If X n → X∞ as n → ∞ with probability 1 then X → X as n → ∞ in probability.

n ∞
2. If X n → X∞ as n → ∞ in mean then X → X as n → ∞ in probability.

n ∞
3. If X n → X∞ as n → ∞ in probability then X → X as n → ∞ in distribtion.

n ∞
It follows that convergence with probability 1, convergence in probability, and convergence in mean all imply convergence in
distribution, so the latter mode of convergence is indeed the weakest. However, our next theorem gives an important converse to
part (c) in (7), when the limiting variable is a constant. Of course, a constant can be viewed as a random variable defined on any
probability space.
Suppose that X is a real-valued random variable for each n ∈ N , defined on the same probability space, and that c ∈ R . If
n +
X → c as n → ∞ in distribution then X → c as n → ∞ in probability.

n n
Proof
The Skorohod Representation

As noted in the summary above, convergence in distribution does not imply convergence with probability 1, even when the random
variables are defined on the same probability space. However, the next theorem, known as the Skorohod representation theorem,
gives an important partial result in this direction.
Suppose that P is a probability measure on (R, R) for each n ∈ N and that P ⇒ P

n
∗
+ n ∞ as n → ∞ . Then there exist real-
valued random variables X for n ∈ N , defined on the same probability space, such that
n
∗
+
1. X has distribution P for n ∈ N .

n n
∗
+
n ∞
Proof
The following theorem illustrates the value of the Skorohod representation and the usefulness of random variable notation for
convergence in distribution. The theorem is also quite intuitive, since a basic idea is that continuity should preserve convergence.
Suppose that X is a real-valued random variable for each n ∈ N (not necessarily defined on the same probability space).
n
∗
+
Suppose also that g : R → R is measurable, and let D denote the set of discontinuities of g , and P the distribution of X .
g ∞ ∞
If X → X as n → ∞ in distribution and P (D ) = 0 , then g(X ) → g(X ) as n → ∞ in distribution.

n ∞ ∞ g n ∞
Proof
As a simple corollary, if X converges X as n → ∞ in distribution, and if

n ∞ a, b ∈ R then a + bXn converges to a + bX as
n → ∞ in distribution. But we can do a little better:
Suppose that X is a real-valued random variable and that

n an , bn ∈ R for each n ∈ N+ . If X → X as
∗
n ∞ n → ∞ in
distribution and if a → a and b → b as n → ∞ , then a
n ∞ n ∞ n + bn Xn → a + b X∞ as n → ∞ in distribution.
Proof
The definition of convergence in distribution requires that the sequence of probability measures converge on sets of the form
(−∞, x] for x ∈ R when the limiting distrbution has probability 0 at x. It turns out that the probability measures will converge on
lots of other sets as well, and this result points the way to extending convergence in distribution to more general spaces. To state the
result, recall that if A is a subset of a topological space, then the boundary of A is ∂A = cl(A) ∖ int(A) where cl(A) is the
closure of A (the smallest closed set that contains A ) and int(A) is the interior of A (the largest open set contained in A ).
Suppose that P is a probability measure on (R, R) for n ∈ N . Then P

n
∗
+ n ⇒ P∞ as n → ∞ if and only if P n (A) → P∞ (A)
as n → ∞ for every A ∈ R with P (∂A) = 0 .

Proof
In the context of this result, suppose that a, b ∈ R with a < b . If P {a} = P {b} = 0 , then as n → ∞ we have
P (a, b) → P (a, b) , P [a, b) → P [a, b) , P (a, b] → P (a, b] , and P [a, b] → P [a, b] . Of course, the limiting values are all the
n n n n
same.
Next we will explore several interesting examples of the convergence of distributions on (R, R). There are several important cases
where a special distribution converges to another special distribution as a parameter approaches a limiting value. Indeed, such
convergence results are part of the reason why such distributions are special in the first place.
The Hypergeometric Distribution

Recall that the hypergeometric distribution with parameters m, r, and n is the distribution that governs the number of type 1
objects in a sample of size n , drawn without replacement from a population of m objects with r objects of type 1. It has discrete
r m−r
( )( )
k n−k
f (k) = , k ∈ {0, 1, … , n} (3.8.3)
m
( )
n
The pramaters m, r, and n are positive integers with n ≤ m and r ≤ m . The hypergeometric distribution is studied in more detail
in the chapter on Finite Sampling Models
Recall next that Bernoulli trials are independent trials, each with two possible outcomes, generically called success and failure. The
probability of success p ∈ [0, 1] is the same for each trial. The binomial distribution with parameters n ∈ N and p is the +
distribution of the number successes in n Bernoulli trials. This distribution has probability density function g given by
n
k n−k
g(k) = ( ) p (1 − p ) , k ∈ {0, 1, … , n} (3.8.4)
k
The binomial distribution is studied in more detail in the chapter on Bernoulli Trials. Note that the binomial distribution with
parameters n and p = r/m is the distribution that governs the number of type 1 objects in a sample of size n , drawn with
replacement from a population of m objects with r objects of type 1. This fact is motivation for the following result:
Suppose that r ∈ {0, 1, … , m} for each m ∈ N and that r /m → p as m → ∞ . For fixed n ∈ N , the hypergeometric
m + m +
distribution with parameters m, r , and n converges to the binomial distribution with parameters n and p as m → ∞ .
m
Proof
From a practical point of view, the last result means that if the population size m is “large” compared to sample size n , then the
hypergeometric distribution with parameters m, r, and n (which corresponds to sampling without replacement) is well
approximated by the binomial distribution with parameters n and p = r/m (which corresponds to sampling with replacement).
This is often a useful result, not computationally, but rather because the binomial distribution has fewer parameters than the
hypergeometric distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the
limiting binomial distribution, we do not need to know the population size m and the number of type 1 objects r individually, but
only in the ratio r/m.
In the ball and urn experiment, set m = 100 and r = 30 . For each of the following values of n (the sample size), switch
between sampling without replacement (the hypergeometric distribution) and sampling with replacement (the binomial
distribution). Note the difference in the probability density functions. Run the simulation 1000 times for each sampling mode
and compare the relative frequency function to the probability density function.
1. 10
2. 20
3. 30
4. 40
5. 50
The Binomial Distribution

Recall again that the binomial distribution with parameters n ∈ N and p ∈ [0, 1] is the distribution of the number successes in n
+
Bernoulli trials, when p is the probability of success on a trial. This distribution has probability density function f given by
n
k n−k
f (k) = ( ) p (1 − p ) , k ∈ {0, 1, … , n} (3.8.7)
k
Recall also that the Poisson distribution with parameter r ∈ (0, ∞) has probability density function g given by
k
r
−r
g(k) = e , k ∈ N (3.8.8)
k!
The distribution is named for Simeon Poisson and governs the number of “random points” in a region of time or space, under
certain ideal conditions. The parameter r is proportional to the size of the region of time or space. The Poisson distribution is
studied in more detail in the chapter on the Poisson Process.
Suppose that p ∈ [0, 1] for n ∈ N and that np → r ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters
n + n
n and p converges to the Poisson distribution with parameter r as n → ∞ .

n
Proof
From a practical point of view, the convergence of the binomial distribution to the Poisson means that if the number of trials n is
“large” and the probability of success p “small”, so that np is small, then the binomial distribution with parameters n and p is well
2
approximated by the Poisson distribution with parameter r = np . This is often a useful result, again not computationally, but rather
because the Poisson distribution has fewer parameters than the binomial distribution (and often in real problems, the parameters
may only be known approximately). Specifically, in the approximating Poisson distribution, we do not need to know the number of
trials n and the probability of success p individually, but only in the product np. As we will see in the next chapter, the condition
that np be small means that the variance of the binomial distribution, namely np(1 − p) = np − np is approximately r = np ,
2 2
which is the variance of the approximating Poisson distribution.
In the binomial timeline experiment, set the parameter values as follows, and observe the graph of the probability density
function. (Note that np = 5 in each case.) Run the experiment 1000 times in each case and compare the relative frequency
function and the probability density function. Note also the successes represented as “random points” in discrete time.
1. n = 10 , p = 0.5
2. n = 20 , p = 0.25
3. n = 100 , p = 0.05
In the Poisson experiment, set r = 5 and t = 1 , to get the Poisson distribution with parameter 5. Note the shape of the
probability density function. Run the experiment 1000 times and compare the relative frequency function to the probability
density function. Note the similarity between this experiment and the one in the previous exercise.
The Geometric Distribution

Recall that the geometric distribution on N with success parameter p ∈ (0, 1] has probability density function f given by
+
k−1
f (k) = p(1 − p ) , k ∈ N+ (3.8.10)
The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials.
Suppose that U has the geometric distribution on N with success parameter p ∈ (0, 1]. For
+ n ∈ N+ , the conditional
distribution of U given U ≤ n converges to the uniform distribution on {1, 2, … , n} as p ↓ 0 .
Proof
Next, recall that the exponential distribution with rate parameter r ∈ (0, ∞) has distribution function G given by
−rt
G(t) = 1 − e , 0 ≤t <∞ (3.8.12)
The exponential distribution governs the time between “arrivals” in the Poisson model of random points in time.
Suppose that Un has the geometric distribution on N+ with success parameter p ∈ (0, 1] for n ∈ N , and that
n +
npn → r ∈ (0, ∞) as n → ∞ . The distribution of U n /n converges to the exponential distribution with parameter r as
n → ∞ .
Proof
Note that the limiting condition on n and p in the last result is precisely the same as the condition for the convergence of the
binomial distribution to the Poisson distribution. For a deeper interpretation of both of these results, see the section on the Poisson
distribution.
In the negative binomial experiment, set k = 1 to get the geometric distribution. Then decrease the value of p and note the
shape of the probability density function. With p = 0.5 run the experiment 1000 times and compare the relative frequency
function to the probability density function.
In the gamma experiment, set k = 1 to get the exponential distribution, and set r = 5 . Note the shape of the probability density
function. Run the experiment 1000 times and compare the empirical density function and the probability density function.
Compare this experiment with the one in the previous exercise, and note the similarity, up to a change in scale.
The Matching Distribution

For n ∈ N , consider a random permutation
+ (X1 , X2 , … , Xn ) of the elements in the set {1, 2, … , n} . We say that a match
occurs at position i if X = i .
i
P (Xi = i) =
1
n
for each i ∈ {1, 2, … , n}.
Proof
So the matching events all have the same probability, which varies inversely with the number of trials.
P (Xi = i, Xj = j) =
1
n(n−1)
for i, j ∈ {1, 2, … , n} with i ≠ j .
Proof
So the matching events are dependent, and in fact are positively correlated. In particular, the matching events do not form a
sequence of Bernoulli trials. The matching problem is studied in detail in the chapter on Finite Sampling Models. In that section we
show that the number of matches N has probability density function f given by:
n n
n−k j
1 (−1)
fn (k) = ∑ , k ∈ {0, 1, … , n} (3.8.14)
k! j!
j=0
The distribution of N converges to the Poisson distribution with parameter 1 as n → ∞ .

n
Proof
In the matching experiment, increase n and note the apparent convergence of the probability density function for the number of
matches. With selected values of n , run the experiment 1000 times and compare the relative frequency function and the
The Extreme Value Distribution

Suppose that (X , X , …) is a sequence of independent random variables, each with the standard exponential distribution
1 2
(parameter 1). Thus, recall that the common distribution function G is given by
−x
G(x) = 1 − e , 0 ≤x <∞ (3.8.16)
As n → ∞ , the distribution of Y n = max{ X1 , X2 , … , Xn } − ln n converges to the distribution with distribution function F

given by
−x
−e
F (x) = e , x ∈ R (3.8.17)
Proof
The limiting distribution in Exercise (27) is the standard extreme value distribution, also known as the standard Gumbel
distribution in honor of Emil Gumbel. Extreme value distributions are studied in detail in the chapter on Special Distributions.

Recall that the Pareto distribution with shape parameter a ∈ (0, ∞) has distribution function F given by
1
F (x) = 1 − , 1 ≤x <∞ (3.8.19)
a
x
The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution sometimes used to model financial variables. It is
Suppose that X has the Pareto distribution with parameter n for each n ∈ N . Then
n +
1. X → 1 as n → ∞ in distribution (and hence also in probability).

n
2. The distribution of Y = nX − n converges to the standard exponential distribution as n → ∞ .

n n
Proof
Fundamental Theorems
The two fundamental theorems of basic probability theory, the law of large numbers and the central limit theorem, are studied in
detail in the chapter on Random Samples. For this reason we will simply state the results in this section. So suppose that
(X , X , …) is a sequence of independent, identically distributed, real-valued random variables (defined on the same probability
1 2
space) with mean μ ∈ (−∞. ∞) and standard deviation σ ∈ (0, ∞). For n ∈ N , let Y = ∑ X denote the sum of the first n
+ n
n
i=1 i
variables, M = Y /n the average of the first n variables, and Z = (Y − nμ)/√−

n n n n σ the standard score of Y .
n n
The fundamental theorems of probability

1. M → μ as n → ∞ with probability 1 (and hence also in probability and in distribution). This is the law of large numbers.
n
2. The distribution of Z converges to the standard normal distribution as n → ∞ . This is the central limit theorem.
n
In part (a), convergence with probability 1 is the strong law of large numbers while convergence in probability and in distribution
are the weak laws of large numbers.
General Spaces
Our next goal is to define convergence of probability distributions on more general measurable spaces. For this discussion, you
may need to refer to other sections in this chapter: the integral with respect to a positive measure, properties of the integral, and
density functions. In turn, these sections depend on measure theory developed in the chapters on Foundations and Probability
Measures.
Definition and Basic Properties

First we need to define the type of measurable spaces that we will use in this subsection.
We assume that (S, d) is a complete, separable metric space and let S denote the Borel σ-algebra of subsets of S , that is, the
σ-algebra generated by the topology. The standard spaces that we often use are special cases of the measurable space (S, S ):
1. Discrete: S is countable and is given the discrete metric so S is the collection of all subsets of S .
2. Euclidean: R is given the standard Euclidean metric so R is the usual σ-algebra of Borel measurable subsets of R .
n
n
n
Additional details
As suggested by our setup, the definition for convergence in distribution involves both measure theory and topology. The
motivation is the theorem above for the one-dimensional Euclidean space (R, R).
Convergence in distribution:
1. Suppose that P is a probability measure on (S, S ) for each n ∈ N . Then P converges (weakly) to P as n → ∞ if
n
∗
+ n ∞
P (A) → P
n (A) as n → ∞ for every A ∈ S with P
∞ (∂A) = 0 . We write P ⇒ P
∞ as n → ∞ . n ∞
2. Suppose that X is a random variable with distribution P on (S, S ) for each n ∈ N . Then X converges in distribution
n n
∗
+ n
to X as n → ∞ if P ⇒ P as n → ∞ . We write X → X as n → ∞ in distribution.
∞ n ∞ n ∞
Notes
Let's consider our two special cases. In the discrete case, as usual, the measure theory and topology are not really necessary.
Suppose that P is a probability measures on a discrete space

n (S, S ) for each n ∈ N
∗
+
. Then Pn ⇒ P∞ as n → ∞ if and
only if P (A) → P (A) as n → ∞ for every A ⊆ S .
n ∞
Proof
In the Euclidean case, it suffices to consider distribution functions, as in the one-dimensional case. If P is a probability measure on
(R , R ), recall that the distribution function F of P is given by
n
n
n
F (x1 , x2 , … , xn ) = P ((−∞, x1 ] × (−∞, x2 ] × ⋯ × (−∞, xn ]) , (x1 , x2 , … , xn ) ∈ R (3.8.21)
Suppose that P is a probability measures on (R , R ) with distribution function F for each

n
n
n n n ∈ N
∗
+
. Then Pn ⇒ P∞ as
n → ∞ if and only if F (x) → F (x) as n → ∞ for every x ∈ R where F is continuous. n
n ∞ ∞
Convergence in Probability
As in the case of (R, R), convergence in probability implies convergence in distribution.
Suppose that X is a random variable with values in S for each n ∈ N , all defined on the same probability space. If
n
∗
+
X → X
n as n → ∞ in probability then X → X as n → ∞ in distribution.
∞ n ∞
Notes
So as before, convergence with probability 1 implies convergence in probability which in turn implies convergence in distribution.
Skorohod's Representation Theorem

As you might guess, Skorohod's theorem for the one-dimensional Euclidean space (R, R) can be extended to the more general
spaces. However the proof is not nearly as straightforward, because we no longer have the quantile function for constructing
random variables on a common probability space.
Suppose that P is a probability measures on (S, S ) for each n ∈ N and that P ⇒ P as n → ∞ . Then there exists a
n
∗
+ n ∞
random variable X with values in S for each n ∈ N , defined on a common probability space, such that
n
∗
+
1. X has distribution P for n ∈ N

n n
∗
+
n ∞
One of the main consequences of Skorohod's representation, the preservation of convergence in distribution under continuous
functions, is still true and has essentially the same proof. For the general setup, suppose that (S, d, S ) and (T , e, T ) are spaces of
the type described above.
Suppose that X is a random variable with values in S for each n ∈ N (not necessarily defined on the same probability
n
∗
+
space). Suppose also that g : S → T is measurable, and let D denote the set of discontinuities of g , and P the distribution of
g ∞
X . If X → X as n → ∞ in distribution and P (D ) = 0 , then g(X ) → g(X ) as n → ∞ in distribution.

∞ n ∞ ∞ g n ∞
Proof
A simple consequence of the continuity theorem is that if a sequence of random vectors in R converge in distribution, then the n
sequence of each coordinate also converges in distribution. Let's just consider the two-dimensional case to keep the notation
simple.
Suppose that (X , Y n n) is a random variable with values in R

2
for n ∈ N+
∗
and that (Xn , Yn ) → (X∞ , Y∞ ) as n → ∞ in
distribution. Then
1. X n → X∞ as n → ∞ in distribution.
2. Y n → Y∞ as n → ∞ in distribution.
Scheffé's Theorem
Our next discussion concerns an important result known as Scheffé's theorem, named after Henry Scheffé. To state our theorem,
suppose that (S, S , μ) is a measure space, so that S is a set, S is a σ-algebra of subsets of S , and μ is a positive measure on
(S, S ). Further, suppose that P is a probability measure on (S, S ) that has density function f with respect to μ for each
n n
n ∈ N , and that P is a probability measure on (S, S ) that has density function f with respect to μ .
+
If f n (x) → f (x) as n → ∞ for almost all x ∈ S (with respect to μ ) then P n (A) → P (A) as n → ∞ uniformly in A ∈ S .
Proof
Of course, the most important special cases of Scheffé's theorem are to discrete distributions and to continuous distributions on a
subset of R , as in the theorem above on density functions.
n
Expected Value
Generating functions are studied in the chapter on Expected Value. In part, the importance of generating functions stems from the
fact that ordinary (pointwise) convergence of a sequence of generating functions corresponds to the convergence of the
distributions in the sense of this section. Often it is easier to show convergence in distribution using generating functions than
directly from the definition.
In addition, converence in distribution has elegant characterizations in terms of the convergence of the expected values of certain
types of functions of the underlying random variables.
This page titled 3.8: Convergence in Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Our goal in this section is to define and study functions that play the same role for positive measures on R that (cumulative)
distribution functions do for probability measures on R. Of course probability measures on R are usually associated with real-
valued random variables. These general distribution functions are useful for constructing measures on R and will appear in our
study of integrals with respect to a measure in the next section, as well as non-homogeneous Poisson processes and general renewal
processes.
Basic Theory
Throughout this section, our basic measurable space is (R, R), where R is the σ-algebra of Borel measurable subsets of R, and as
usual, we will let λ denote Lebesgue measure on (R, R). As with cumulative distribution functions, it's convenient to have
compact notation for the limits of a function F : R → R from the left and right at x ∈ R, and at ∞ and −∞ (assuming of course
that these limits exist):
+ −
F (x ) = lim F (t), F (x ) = lim F (t), F (∞) = lim F (t), F (−∞) = lim F (t) (3.9.1)
t↓x t↑x t→∞ t→−∞
Distribution Functions and Their Measures

A function F : R → R that satisfies the following properties is a distribution function on R
2. F is continuous from the right: F (x ) = F (x) for all x ∈ R.
+
Since F is increasing, F (x −
) exists in R. Similarly F (∞) exists, as a real number or ∞, and F (−∞) exists, as a real number or
−∞ .
If F is a distribution function on R, then there exists a unique positive measure μ on R that satisfies
μ(a, b] = F (b) − F (a), a, b ∈ R, a ≤ b (3.9.2)
Proof
The measure μ is called the Lebesgue-Stieltjes measure associated with F , named for Henri Lebesgue and Thomas Joannes
Stieltjes. A very rich variety of measures on R can be constructed in this way. In particular, when the function F takes values in
[0, 1], the associated measure P is a probability measure. Another special case of interest is the distribution function defined by
F (x) = x for x ∈ R , in which case μ(a, b] is the length of the interval (a, b] and therefore μ = λ , Lebesgue measure on R . But
although the measure associated with a distribution function is unique, the distribution function itself is not. Note that if c ∈ R then
the distribution function defined by F (x) = x + c for x ∈ R also generates Lebesgue measure. This example captures the general
situation.
Suppose that F and G are distribution functions that generate the same measure μ on R . Then there exists c ∈ R such that
G = F +c .
Proof
Returning to the case of a probability measure P on R, the cumulative distribution function F that we studied in this chapter is the
unique distribution function satisfying F (−∞) = 0 . More generally, having constructed a measure from a distribution function,
let's now consider the complementary problem of finding a distribution function for a given measure. The proof of the last theorem
points the way.
Suppose that μ is a positive measure on (R, R) with the property that μ(A) < ∞ if A is bounded. Then there exists a
distribution function that generates μ .
Proof
In the proof of the last theorem, the use of 0 as a “reference point” is arbitrary, of course. Any other point in R would do as well,
and would produce a distribution function that differs from the one in the proof by a constant. If μ has the property that
μ(−∞, x] < ∞ for x ∈ R, then it's easy to see that F defined by F (x) = μ(−∞, x] for x ∈ R is a distribution function that
generates μ , and is the unique distribution function with F (−∞) = 0 . Of course, in the case of a probability measure, this is the
cumulative distribution function, as noted above.
Properties
General distribution functions enjoy many of the same properties as the cumulative distribution function (but not all because of the
lack of uniqueness). In particular, we can easily compute the measure of any interval from the distribution function.
Suppose that F is a distribution function and μ is the positive measure on (R, R) associated with F . For a, b ∈ R with a < b ,
1. μ[a, b] = F (b) − F (a )
−
2. μ{a} = F (a) − F (a ) −
3. μ(a, b) = F (b ) − F (a)
−
4. μ[a, b) = F (b ) − F (a )
− −
Proof
Note that F is continuous at x ∈ R if and only if μ{x} = 0. In particular, μ is a continuous measure (recall that this means that
μ{x} = 0 for all x ∈ R ) if and only if F is continuous on R . On the other hand, F is discontinuous at x ∈ R if and only if
μ{x} > 0, so that μ has an atom at x. So μ is a discrete measure (recall that this means that μ has countable support) if and only if
F is a step function.
Suppose again that F is a distribution function and μ is the positive measure on (R, R) associated with F . If a ∈ R then
1. μ(a, ∞) = F (∞) − F (a)
2. μ[a, ∞) = F (∞) − F (a ) −
3. μ(−∞, a] = F (a) − F (−∞)

4. μ(−∞, a) = F (a ) − F (−∞)
−
5. μ(R) = F (∞) − F (−∞)

Proof
Distribution Functions on [0, ∞)

Positive measures and distribution functions on [0, ∞) are particularly important in renewal theory and Poisson processes, because
they model random times.
The discrete case. Suppose that G is discrete, so that there exists a countable set C ⊂ [0, ∞) with G (C ) = 0 . Let c
g(t) = G{t} for t ∈ C so that g is the density function of G with respect to counting measure on C . If u : [0, ∞) → R is
locally bounded then

t
∫ u(s) dG(s) = ∑ u(s)g(s) (3.9.7)

0
s∈C ∩[0,t]
Figure 3.9.1 : A discrete measure

In the discrete case, the distribution is often arithmetic. Recall that this means that the countable set C is of the form {nd : n ∈ N}
for some d ∈ (0, ∞) . In the following results,
The continuous case. Suppose that G is absolutely continuous with respect to Lebesgue measure on [0, ∞) with density
function g : [0, ∞) → [0, ∞) . If u : [0, ∞) → R is locally bounded then
t t
∫ u(s) dG(s) = ∫ u(s)g(s) ds (3.9.8)

0 0
Figure 3.9.2 : A continuous measure
The mixed case. Suppose that there exists a countable set C ⊂ [0, ∞) with G(C ) > 0 and G (C ) > 0 , and that G restricted
c
to subsets of C is absolutely continuous with respect to Lebesgue measure. Let g(t) = G{t} for t ∈ C and let h be a density
c
with respect to Lebesgue measure of G restricted to subsets of C . If u : [0, ∞) → R is locally bounded then,
c
t t
∫ u(s) dG(s) = ∑ u(s)g(s) + ∫ u(s)h(s) ds (3.9.9)

0 0
s∈C ∩[0,t]
Figure 3.9.3 : A mixed measure

The three special cases do not exhaust the possibilities, but are by far the most common cases in applied problems.
This page titled 3.9: General Distribution Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Probability density functions have very different interpretations for discrete distributions as opposed to continuous distributions.
For a discrete distribution, the probability of an event is computed by summing the density function over the outcomes in the event,
while for a continuous distribution, the probability is computed by integrating the density function over the outcomes. For a mixed
distributions, we have partial discrete and continuous density functions and the probability of an event is computed by summing
and integrating. The various types of density functions can unified under a general theory of integration, which is the subject of this
section. This theory has enormous importance in probability, far beyond just density functions. Expected value, which we consider
in the next chapter, can be interpreted as an integral with respect to a probability measure. Beyond probability, the general theory of
integration is of fundamental importance in many areas of mathematics.
Basic Theory
Definitions
Our starting point is a measure space (S, S , μ). That is, S is a set, S is a σ-algebra of subsets of S , and μ is a positive measure
on S . As usual, the most important special cases are
Euclidean space: S = R for some n ∈ N , S = R , the σ-algebra of Lebesgue measurable subsets of R , and μ = λ ,
n
+ n
n
n
standard n -dimensional Lebesgue measure.

Discrete space: S is a countable set, S is the collection of all subsets of S , and μ = # , counting measure.
Probability space: S is the set of outcomes of a random experiment, S is the σ-algebra of events, and μ = P , a probability
measure.
The following definition reflects the fact that in measure theory, sets of measure 0 are often considered unimportant.
Consider a statement with x ∈ S as a free variable. Technically such a statement is a predicate on S . Suppose that A ∈ S .
1. The statement holds on A if it is true for every x ∈ A .
2. The statement holds almost everywhere on A (with respect to μ ) if there exists B ∈ S with B ⊆ A such that the statement
holds on B and μ(A ∖ B) = 0 .
A typical statement that we have in mind is an equation or an inequality with x ∈ S as a free variable. Our goal is to define the
integral of certain measurable functions f : S → R , with respect to the measure μ . The integral may exist as a number in R (in
which case we say that f is integrable), or may exist as ∞ or −∞ , or may not exist at all. When it exists, the integral is denoted
variously by
∫ f dμ, ∫ f (x) dμ(x), ∫ f (x)μ(dx) (3.10.1)

S S S
We will use the first two.

Since the set of extended real numbers R = R ∪ {−∞, ∞} plays an important role in the theory, we need to recall the arithmetic
∗
of ∞ and −∞ . Here are the conventions that are appropriate for integration:
Arithmetic on R ∗
1. If a ∈ (0, ∞] then a ⋅ ∞ = ∞ and a ⋅ (−∞) = −∞

2. If a ∈ [−∞, 0) then a ⋅ ∞ = −∞ and a ⋅ (−∞) = ∞
3. 0 ⋅ ∞ = 0 and 0 ⋅ (−∞) = 0
4. If a ∈ R then a + ∞ = ∞ and a + (−∞) = −∞
5. ∞ + ∞ = ∞
6. −∞ + (−∞) = −∞
However, ∞ − ∞ is not defined (because it does not make consistent sense) and we must be careful never to produce this
indeterminate form. You might recall from calculus that 0 ⋅ ∞ is also an indeterminate form. However, for the theory of
integration, the convention that 0 ⋅ ∞ = 0 is convenient and consistent. In terms of order of course, −∞ < a < ∞ for a ∈ R .
We also need to extend topology and measure to R . In terms of the first, (a, ∞] is an open neighborhood of ∞ and [−∞, a) is an
∗
open neighborhood of −∞ for every a ∈ R . This ensures that if x ∈ R for n ∈ N then x → ∞ or x → −∞ as n → ∞ has
n + n n
its usual calculus meaning. Technically this topology results in the two-point compactification of R. Now we can give R the Borel ∗
σ-algebra R , that is, the σ-algebra generated by the topology. Basically, this simply means that if A ∈ R then A ∪ {∞} ,
∗
A ∪ {−∞} , and A ∪ {−∞, ∞} are all in R .

∗
Desired Properties
As motivation for the definition, every version of integration should satisfy some basic properties. First, the integral of the indicator
function of a measurable set should simply be the size of the set, as measured by μ . This gives our first definition:
If A ∈ S then ∫ S
1A dμ = μ(A) .
This definition hints at the intimate relationship between measure and integration. We will construct the integral from the measure
μ in this section, but this first property shows that if we started with the integral, we could recover the measure. This property also
shows why we need ∞ as a possible value of the integral, and coupled with some of the properties below, why −∞ is also needed.
Here is a simple corollary of our first definition.
∫ 0 dμ = 0
S
Proof
We give three more essential properties that we want. First are the linearity properties in two parts—part (a) is the additive
property and part (b) is the scaling property.
If f , g : S → R are measurable functions whose integrals exist, and c ∈ R , then

1. ∫ S
(f + g) dμ = ∫
S
f dμ + ∫
S
g dμ as long as the right side is not of the form ∞ − ∞
2. ∫ S
cf dμ = c ∫
S
f dμ .
The additive property almost implies the scaling property
To be more explicit, we want the additivity property (a) to hold if at least one of the integrals on the right is finite, or if both are ∞
or if both are −∞ . What is ruled out are the two cases where one integral is ∞ and the other is −∞ , and this is what is meant by
the indeterminate form ∞ − ∞ . Our next essential properties are the order properties, again in two parts—part (a) is the positive
property and part (b) is the increasing property.
Suppose that f , g : S → R are measurable.

1. If f ≥ 0 on S then ∫ f dμ ≥ 0 . S
2. If the integrals of f and g exist and f ≤g on S then ∫ S

f dμ ≤ ∫
S
g dμ
The positive property and the additive property imply the increasing property
Our last essential property is perhaps the least intuitive, but is a type of continuity property of integration, and is closely related to
the continuity property of positive measure. The official name is the monotone convergence theorem.
Suppose that f n : S → [0, ∞) is measurable for n ∈ N + and that f is increasing in n . Then

n
∫ lim fn dμ = lim ∫ fn dμ (3.10.3)

n→∞ n→∞
S S
Note that since f is increasing in n , lim

n f (x) exists in R ∪ {∞} for each x ∈ R (and the limit defines a measurable
n→∞ n
function). This property shows that it is sometimes convenient to allow nonnegative functions to take the value ∞. Note also that
by the increasing property, ∫ f dμ is increasing in n and hence also has a limit in R ∪ {∞} .
S
n
To see the connection with measure, suppose that (A , A , …) is an increasing sequence of sets in S , and let A = ⋃ A . Note
1 2
∞
i=1 i
that 1 is increasing in n ∈ N and 1 → 1 as n → ∞ . For this reason, the union A is sometimes called the limit of A as
An + An A n
n → ∞ . The continuity theorem of positive measure states that μ(A ) → μ(A) as n → ∞ . Equivalently, ∫ 1 n dμ → ∫ 1 dμ An A
S S
as n → ∞ , so the continuity theorem of positive measure is a special case of the monotone convergence theorem.
Armed with the properties that we want, the definition of the integral is fairly straightforward, and proceeds in stages. We give the
definition successively for
1. Nonnegative simple functions
2. Nonnegative measurable functions
3. Measurable real-valued functions
Of course, each definition should agree with the previous one on the functions that are in both collections.
Simple Functions
A simple function on S is simply a measurable, real-valued function with finite range. Simple functions are usually expressed as
linear combinations of indicator functions.
Representations of simple functions

1. Suppose that I is a finite index set, a ∈ R for each i ∈ I , and {A : i ∈ I } is a collection of sets in S that partition S .
i i
Then f = ∑ a 1 is a simple function. Expressing a simple function in this form is a representation of f .

i∈I i Ai
2. A simple function f has a unique representation as f = ∑ b 1 where J is a finite index set, {b : j ∈ J} is a set of
j∈J j Bj j
distinct real numbers, and {B : j ∈ J} is a collection of nonempty sets in S that partition S . This representation is
j
known as the canonical representation.

Proof
You might wonder why we don't just always use the canonical representation for simple functions. The problem is that even if we
start with canonical representations, when we combine simple functions in various ways, the resulting representations may not be
canonical. The collection of simple functions is closed under the basic arithmetic operations, and in particular, forms a vector
space.
Suppose that f and g are simple functions with representations f =∑

i∈I
ai 1A
i
and g = ∑ j∈J
bj 1B
j
, and that c ∈ R . Then
1. f + g is simple, with representation f + g = ∑ (i,j)∈I×J
(ai + bj )1A
i
∩Bj .
2. f g is simple, with representation f g = ∑ (a b
(i,j)∈I×J i j )1Ai ∩Bj .
3. cf is simple, with representation cf = ∑ ca 1 . i∈I i Ai
Proof
As we alluded to earlier, note that even if the representations of f and g are canonical, the representations for f + g and fg may
not be. The next result treats composition, and will be important for the change of variables theorem in the next section.
Suppose that (T , T ) is another measurable space, and that f : S → T is measurable. If g is a simple function on T with
representation g = ∑ b 1 , then g ∘ f is a simple function on S with representation g ∘ f = ∑ b 1
i∈I i Bi . i∈I i f
−1
( Bi )
Proof
Given the definition of the integral of an indicator function in (3) and that we want the linearity property (5) to hold, there is no
question as to how we should define the integral of a nonnegative simple function.
Suppose that f is a nonnegative simple function, with the representation f =∑

i∈I
ai 1Ai where a i ≥0 for i ∈ I . We define
∫ f dμ = ∑ ai μ(Ai ) (3.10.4)
S
i∈I
The definition is consistent
Note that if f is a nonnegative simple function, then ∫ f dμ exists in [0, ∞], so the order properties holds. We next show that the
S
linearity properties are satisfied for nonnegative simple functions.
Suppose that f and g are nonnegative simple functions, and that c ∈ [0, ∞). Then
1. ∫S
(f + g) dμ = ∫
S
f dμ + ∫
S
g dμ
2. ∫S
cf dμ = c ∫
S
f dμ
Proof
The increasing property holds for nonnegative simple functions.
Suppose that f and g are nonnegative simple functions and f ≤g on S . Then ∫ S

f dμ ≤ ∫
S
g dμ
Proof
Next we give a version of the continuity theorem in (7) for simple functions. It's not completely general, but will be needed for the
next subsection where we do prove the general version.
Suppose that f is a nonnegative simple function and that (A1 , A2 , …) is an increasing sequence of sets in S with
∞
A =⋃
n=1
An . then
∫ 1An f dμ → ∫ 1A f dμ as n → ∞ (3.10.12)
S S
Proof
Note that 1 An f is increasing in n ∈ N + and 1

An f → 1A f as n → ∞ , so this really is a special case of the monotone convergence
theorem.
Nonnegative Functions
Next we will consider nonnegative measurable functions on S . First we note that a function of this type is the limit of nonnegative
simple functions.
Suppose that f : S → [0, ∞) is measurable. Then there exists an increasing sequence (f1 , f2 , …) of nonnegative simple
functions with f → f on S as n → ∞ .
n
Proof
The last result points the way towards the definition of the integral of a measurable function f : S → [0, ∞) in terms of the
integrals of simple functions. If g is a nonnegative simple function with g ≤ f , then by the order property, we need
∫ g dμ ≤ ∫ f dμ . On the other hand, there exists a sequence of nonnegative simple function converging to f . Thus the continuity
S S
property suggests the following definition:
If f : S → [0, ∞) is measurable, define
∫ f dμ = sup {∫ g dμ : g is simple and 0 ≤ g ≤ f } (3.10.15)

S S
Note that ∫ f dμ exists in [0, ∞] so the positive property holds. Note also that if f is simple, the new definition agrees with the
S
old one. As always, we need to establish the essential properties. First, the increasing property holds.
If f , g : S → [0, ∞) are measurable and f ≤g on S then ∫ S

f dμ ≤ ∫
S
g dμ .
Proof
We can now prove the continuity property known as the monotone convergence theorem in full generality.
Suppose that f n : S → [0, ∞) is measurable for n ∈ N + and that f is increasing in n . Then

n

n→∞ n→∞
S S
Proof
If f : S → [0, ∞) is measurable, then by the theorem above, there exists an increasing sequence (f , f , …) of simple functions 1 2
with f → f as n → ∞ . By the monotone convergence theorem in (18), ∫ f dμ → ∫ f dμ as n → ∞ . These two facts can be
n S n S
used to establish other properties of the integral of a nonnegative function based on our knowledge that the properties hold for
simple functions. This type of argument is known as bootstrapping. We use bootstrapping to show that the linearity properties hold:
If f , g : S → [0, ∞) are measurable and c ∈ [0, ∞), then
1. ∫ S
(f + g) dμ = ∫
S
f dμ + ∫
S
g dμ
2. ∫ S
cf dμ = c ∫
S
f dμ
Proof
General Functions
Our final step is to define the integral of a measurable function f : S → R . First, recall the positive and negative parts of x ∈ R:
+ −
x = max{x, 0}, x = max{−x, 0} (3.10.19)
Note that x ≥ 0 , x ≥ 0 , x = x − x , and |x| = x + x . Given that we want the integral to have the linearity properties in
+ − + − + −
(5), there is no question as to how we should define the integral of f in terms of the integrals of f and f , which being + −
nonnegative, are defined by the previous subsection.
If f : S → R is measurable, we define
+ −
∫ f dμ = ∫ f dμ − ∫ f dμ (3.10.20)
S S S
assuming that at least one of the integrals on the right is finite. If both are finite, then f is said to be integrable.
Assuming that either the integral of the positive part or the integral of the negative part is finite ensures that we do not get the
dreaded indeterminate form ∞ − ∞ .
Suppose that f : S → R is measurable. Then f is integrable if and only if ∫ S

|f | dμ < ∞ .
Proof
Note that if f is nonnegative, then our new definition agrees with our old one, since f +
=f and f −
=0 . For simple functions the
integral has the same basic form as for nonnegative simple functions:
Suppose that f is a simple function with the representation f = ∑i∈I ai 1Ai . Then
∫ f dμ = ∑ ai μ(Ai ) (3.10.21)
S
i∈I
assuming that the sum does not have both ∞ and −∞ terms.
Proof
Once again, we need to establish the essential properties. Our first result is an intermediate step towards linearity.
If f , g : S → [0, ∞) are measurable then ∫

S
(f − g) dμ = ∫
S
f dμ − ∫
S
g dμ as long as at least one of the integrals on the
right is finite.
Proof
We finally have the linearity properties in full generality.
If f , g : S → R are measurable functions whose integrals exist, and c ∈ R , then

1. ∫ S
(f + g) dμ = ∫
S
f dμ + ∫
S
g dμ as long as the right side is not of the form ∞ − ∞ .
2. ∫ S
cf dμ = c ∫
S
f dμ
Proof
In particular, note that if f and g are integrable, then so are f + g and cf for c ∈ R . Thus, the set of integrable functions on
(S, S , μ) forms a vector space, which is denoted L (S, S , μ). The L is in honor of Henri Lebesgue, who first developed the
theory. This vector space, and other related ones, will be studied in more detail in the section on function spaces.
We also have the increasing property in full generality.
If f , g : S → R are measurable functions whose integrals exist, and if f ≤g on S then ∫ S
f dμ ≤ ∫
S
g dμ
Proof
The Integral Over a Set

Now that we have defined the integral of a measurable function f over all of S , there is a natural extension to the integral of f over
a measurable subset
If f : S → R is measurable and A ∈ S , we define
∫ f dμ = ∫ 1A f dμ (3.10.31)
A S
assuming that the integral on the right exists.
If f : S → R is a measurable function whose integral exists and A ∈ S , then the integral of f over A exists.
Proof
On the other hand, it's clearly possible for ∫ A

f dμ to exist for some A ∈ S , but not ∫ S
f dμ .
We could also simply think of ∫ f dμ as the integral of a measurable function f : A → R over the measure space (A, S , μ ),
A A A
where S = {B ∈ S : B ⊆ A} = {C ∩ A : C ∈ S } is the σ-algebra of measurable subsets of A , and where μ is the

A A
restriction of μ to S . It follows that all of the essential properties hold for integrals over A : the linearity properties, the order
A
properties, and the monotone convergence theorem. The following property is a simple consequence of the general additive
property, and is known as additive property for disjoint domains.
Suppose that f : S → R is a measurable function whose integral exists, and that A, B ∈ S are disjoint. then
∫ f dμ = ∫ f dμ + ∫ f dμ (3.10.32)
A∪B A B
Proof
By induction, the additive property holds for a finite collection of disjoint domains. The extension to a countably infinite collection
of disjoint domains will be considered in the next section on properties of the integral.
Special Cases
Discrete Spaces
Recall again that the measure space (S, S , #) is discrete if S is countable, S is the collection of all subsets of S , and # is
counting measure on S . Thus all functions f : S → R are measurable, and and as we will see, integrals with respect to # are
simply sums.
If f : S → R then
∫ f d# = ∑ f (x) (3.10.34)
S
x∈S
as long as either the sum of the positive terms or the sum of the negative terms in finite.
Proof
If the sum of the positive terms and the sum of the negative terms are both finite, then f is integrable with respect to #, but the
usual term from calculus is that the series ∑ f (x) is absolutely convergent. The result will look more familiar in the special
x∈S
case S = N . Functions on S are simply sequences, so we can use the more familiar notation a rather than a(i) for a function
+ i
a : S → R . Part (b) of the proof (with A = {1, 2, … , n}) is just the definition of an infinite series of nonnegative terms as the
n
limit of the partial sums:

∞ n
∑ ai = lim ∑ ai (3.10.36)
n→∞
i=1 i=1
Part (c) of the proof is just the definition of a general infinite series
∞ ∞ ∞
+ −
∑ ai = ∑ a −∑a (3.10.37)
i i
i=1 i=1 i=1
as long as one of the series on the right is finite. Again, when both are finite, the series is absolutely convergent. In calculus we also
consider conditionally convergent series. This means that ∑ a = ∞ , ∑ a = ∞ , but lim
∞
i=1
+
i
∑
∞
i=1
a exists in R . Such
−
i n→∞
n
i=1 i
series have no place in general integration theory. Also, you may recall that such series are pathological in the sense that, given any
number in R , there exists a rearrangement of the terms so that the rearranged series converges to the given number.
∗
The Lebesgue and Riemann Integrals on R

Consider the one-dimensional Euclidean space (R, R, λ) where R is the usual σ-algebra of Lebesgue measurable sets and λ is
Lebesgue measure. The theory developed above applies, of course, for the integral ∫ f dμ of a measurable function f : R → R A
over a set A ∈ R . It's not surprising that in this special case, the theory of integration is referred to as Lebesgue integration in
honor of our good friend Henri Lebesgue, who first developed the theory.
On the other hand, we already have a theory of integration on R, namely the Riemann integral of calculus, named for our other
good friend Georg Riemann. For a suitable function f and domain A this integral is denoted ∫ f (x) dx, as we all remember from A
calculus. How are the two integrals related? As we will see, the Lebesgue integral generalizes the Riemann integral.
To understand the connection we need to review the definition of the Riemann integral. Consider first the standard case where the
domain of integration is a closed, bounded interval. Here are the preliminary definitions that we will need.
Suppose that f : [a, b] → R , where a, b ∈ R and a < b .

1. A partition A = {A : i ∈ I } of [a, b] is a finite collection of disjoint subintervals whose union is [a, b].
i
2. The norm of a partition A is ∥A∥ = max{λ(A ) : i ∈ I } , the length of the largest subinterval of A .
i
3. A set of points B = {x : i ∈ I } where x ∈ A for each i ∈ I is said to be associated with the partition A .
i i i
4. The Riemann sum of f corresponding to a partition A and and a set B associated with A is
R (f , A , B) = ∑ f (xi )λ(Ai ) (3.10.38)
i∈I
Note that the Riemann sum is simply the integral of the simple function g = ∑ f (x )1 . Moreover, since A is an interval for
i∈I i Ai i
each i ∈ I , g is a step function, since it is constant on a finite collection of disjoint intervals. Moreover, again since A is an i
interval for each i ∈ I , λ(A ) is simply the length of the subinterval A , so of course measure theory per se is not needed for
i i
Riemann integration. Now for the definition from calculus:
f is Riemann integrable on [a, b] if there exists r ∈ R with the property that for every ϵ > 0 there exists δ > 0 such that if A
is a partition of [a, b] with ∥A∥ < δ then |r − R (f , A , B)| < ϵ for every set of points B associated with A . Then of course
we define the integral by
b
∫ f (x) dx = r (3.10.39)
a
Here is our main theorem of this subsection.
If f : [a, b] → R is Riemann integrable on [a, b] then f is Lebesgue integrable on [a, b] and

b
∫ f dλ = ∫ f (x) dx (3.10.40)
[a,b] a
On the other hand, there are lots of functions that are Lebesgue integrable but not Riemann integrable. In fact there are indicator
functions of this type, the simplest of functions from the point of view of Lebesgue integration.
Consider the function 1 where as usual, Q is the set of rational number in R. Then
Q
1. ∫ R
1Q dλ = 0 .
2. 1 is not Riemann integrable on any interval [a, b] with a < b .
Q
Proof
The following fundamental theorem completes the picture.
f : [a, b] → R is Riemann integrable on [a, b] if and only if f is bounded on [a, b] and f is continuous almost everywhere on
[a, b] .
Now that the Riemann integral is defined for a closed bounded interval, it can be extended to other domains.
Extensions of the Riemann integral.

b t
1. If f is defined on [a, b) and Riemann integrable on [a, t] for a < t < b , we define ∫ f (x) dx = lim ∫ f (x) dx if the
a
t↑b
a
limit exists in R .∗
b b
2. If f is defined on (a, b] and Riemann integrable on [t, b] for a < t < b , we define ∫ f (x) dx = lim ∫ f (x) dx if the
a
t↓a
t
limit exists in R .∗
b c b
3. If f is defined on (a, b), we select c ∈ (a, b) and define ∫ f (x) dx = ∫ f (x) dx + ∫ f (x), dx if the integrals on the
a a c
right exist in R by (a) and (b), and are not of the form ∞ − ∞ .
∗
∞ t
4. If f is defined an [a, ∞) and Riemann integrable on [a, t] for a < t < ∞ we define ∫ f (x) dx = lima
∫ f (x) dx. t→∞ a
5. if f is defined on (−∞, b] and Riemann integrable on [t, b] for −∞ < t < b we define
b b
∫
−∞
f (x) dx = limt→−∞ ∫
t
if the limit exists in R
f (x) dx
∗
∞ c ∞
6. if f is defined on R we select c ∈ R and define ∫ −∞
f (x) dx = ∫ f (x) dx + ∫
−∞
f (x) dx if both integrals on the right
c
exist by (d) and (e), and are not of the form ∞ − ∞ .

7. The integral is be defined for a domain that is the union of a finite collection of disjoint intervals by the requirement that the
integral be additive over disjoint domains
As another indication of its superiority, note that none of these convolutions is necessary for the Lebesgue integral. Once and for
all, we have defined ∫ f (x) dx for a general measurable function f : R → R and a general domain A ∈ R
A
The Lebesgue-Stieltjes Integral

Consider again the measurable space (R, R) where R is the usual σ-algebra of Lebesgue measurable subsets of R. Suppose that
F : R → R is a general distribution function, so that by definition, F is increasing and continuous from the right. Recall that the
Lebesgue-Stieltjes measure μ associated with F is the unique measure on R that satisfies
μ(a, b] = F (b) − F (a); a, b ∈ R, a < b (3.10.42)
Recall that F satisfies some, but not necessarily all of the properties of a probability distribution function. The properties not
necessarily satisfied are the normalizing properties
F (x) → 0 as x → −∞
F (x) → 1 as x → ∞
If F does satisfy these two additional properties, then μ is a probability measure and F its probability distribution function.
The integral with respect to the measure μ is, appropriately enough, referred to as the Lebesgue-Stieltjes integral with respect to F ,
and like the measure, is named for the ubiquitous Henri Lebesgue and for Thomas Stieltjes. In addition to our usual notation
∫ f dμ , the Lebesgue-Stieltjes integral is also denoted ∫ f dF and ∫ f (x) dF (x).
S S S
Probability Spaces
Suppose that (S, S , P) is a probability space, so that S is the set of outcomes of a random experiment, S is the σ-algebra of
events, and P the probability measure on the sample space (S, S ). A measurable, real-valued function X on S is, of course, a real-
valued random variable. The integral with respect to P, if it exists, is the expected value of X and is denoted
E(X) = ∫ X dP (3.10.43)
S
This concept is of fundamental importance in probability theory and is studied in detail in a separate chapter on Expected Value,
mostly from an elementary point of view that does not involve abstract integration. However an advanced section treats expected
value as an integral over the underlying probability measure, as above.
Suppose next that (T , T , #) is a discrete space and that X is a random variable for the experiment, taking values in T . In this case
X has a discrete distribution and the probability density function f of X is given by f (x) = P(X = x) for x ∈ T . More generally,
P(X ∈ A) = ∑ f (x) = ∫ f d#, A ⊆T (3.10.44)

A
x∈A
On the other hand, suppose that X is a random variable with values in R , where as usual, (R , R , λ ) is
n n
n n n -dimensional
Euclidean space. If X has a continuous distribution, then f : T → [0, ∞) is a probability density function of X if
n
P(X ∈ A) = ∫ f dλn , A ∈ R (3.10.45)
A
Technically, f is the density function of X with respect to counting measure # in the discrete case, and f is the density function of
X with respect to Lebesgue measure λ in the continuous case. In both cases, the probability of an event A is computed by
n
integrating the density function, with respect to the appropriate measure, over A . There are still differences, however. In the
discrete case, the existence of the density function with respect to counting measure is guaranteed, and indeed we have an explicit
formula for it. In the continuous case, the existence of a density function with respect to Lebesgue measure is not guaranteed, and
indeed there might not be one. More generally, suppose that we have a measure space (T , T , μ) and a random variable X with
values in T . A measurable function f : T → [0, ∞) is a probability density function of X (or more precisely, the distribution of X)
with respect to μ if
P(X ∈ A) = ∫ f dμ, A ∈ T (3.10.46)

A
This fundamental question of the existence of a density function will be clarified in the section on absolute continuity and density
functions.
Suppose again that X is a real-valued random variable with distribution function F . Then, by definition, the distribution of X is the
Lebesgue-Stieltjes measure associated with F :
P(a < X ≤ b) = F (b) − F (a), a, b ∈ R, a < b (3.10.47)
regardless of whether the distribution is discrete, continuous, or mixed. Trivially, P(X ∈ A) = ∫ 1 dF for A ∈ R and the
S A
expected value of X defined above can also be written as E(X) = ∫ x dF (x). Again, all of this will be explained in much more
R
detail in the next chapter on Expected Value.
Let g(x) = 1+x
1
2
for x ∈ R.
∞
1. Find ∫−∞
g(x) dx .
∞
2. Show that ∫ xg(x) dx does not exist.
−∞
Answer
You may recall that the function g in the last exercise is important in the study of the Cauchy distribution, named for Augustin
Cauchy. You may also remember that the graph of g is known as the witch of Agnesi, named for Maria Agnesi.
∞
Let g(x) = x
1
b
for x ∈ [1, ∞) where b > 0 is a parameter. Find ∫ 1
g(x) dx
Answer
You may recall that the function g in the last exercise is important in the study of the Pareto distribution, named for Vilfredo
Pareto.
Suppose that f (x) = 0 if x ∈ Q and f (x) = sin(x) if x ∈ R − Q .

1. Find ∫[0,π]
f (x) dλ(x)
π
2. Does ∫ 0
f (x) dx exist?
Answer
This page titled 3.10: The Integral With Respect to a Measure is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Basic Theory
Again our starting point is a measure space (S, S , μ) . That is, S is a set, S is a σ-algebra of subsets of S , and μ is a positive
measure on S .
Definition
In the last section we defined the integral of certain measurable functions f : S → R with respect to the measure μ . Recall that the
integral, denoted ∫ f dμ , may exist as a number in R (in which case f is integrable), or may exist as ∞ or −∞ , or may fail to
S
exist. Here is a review of how the definition is built up in stages:
Definition of the integral

1. If f is a nonnegative simple function, so that f = ∑ i∈I
ai 1Ai where I is a finite index set, a i ∈ [0, ∞) for i ∈ I , and
{ A : i ∈ I } is measurable partition of S , then
i
∫ f dμ = ∑ ai μ(Ai ) (3.11.1)
S i∈I
2. If f : S → [0, ∞) is measurable, then
∫ f dμ = sup {∫ g dμ : g is simple and 0 ≤ g ≤ f } (3.11.2)

S S
3. If f : S → R is measurable, then
+ −
∫ f dμ = ∫ f dμ − ∫ f dμ (3.11.3)
S S S
as long as the right side is not of the form ∞ − ∞ , and where f and f denote the positive and negative parts of f .
+ −
4. If f : S → R is measurable and A ∈ S , then the integral of f over A is defined by
∫ f dμ = ∫ 1A f dμ (3.11.4)
A S
assuming that the integral on the right exists.
Consider a statement on the elements of S , for example an equation or an inequality with x ∈ S as a free variable. (Technically
such a statement is a predicate on S .) For A ∈ S , we say that the statement holds on A if it is true for every x ∈ A . We say that
the statement holds almost everywhere on A (with respect to μ ) if there exists B ∈ S with B ⊆ A such that the statement holds
on B and μ(A ∖ B) = 0 .
Basic Properties
A few properties of the integral that were essential to the motivation of the definition were given in the last section. In this section,
we extend some of those properties and we study a number of new ones. As a review, here is what we know so far.
Properties of the integral

1. If f , g : S → R are measurable functions whose integrals exist, then ∫ (f + g) dμ = ∫ f dμ + ∫ g dμ as long as the
S S S
right side is not of the form ∞ − ∞ .

2. If f : S → R is a measurable function whose integral exists and c ∈ R , then ∫ cf dμ = c ∫ f dμ . S S
3. If f : S → R is measurable and f ≥ 0 on S then ∫ f dμ ≥ 0 . S
4. If f , g : S → R are measurable functions whose integrals exist and f ≤ g on S then ∫ f dμ ≤ ∫ g dμ S S
5. If f : S → [0, ∞) is measurable for n ∈ N and f is increasing in n on S then ∫ lim

n + n f dμ = lim ∫ f dμ .
S
n→∞ n n→∞
S
n
6. f : S → R is measurable and the the integral of f on A ∪ B exists, where A, B ∈ S are disjoint, then
∫ f dμ = ∫ f dμ + ∫ f dμ .
A∪B A B
Parts (a) and (b) are the linearity properties; part (a) is the additivity property and part (b) is the scaling property. Parts (c) and (d)
are the order properties; part (c) is the positive property and part (d) is the increasing property. Part (e) is a continuity property
known as the monotone convergence theorem. Part (f) is the additive property for disjoint domains. Properties (a)–(e) hold with S
replaced by A ∈ S .
Equality and Order

Our first new results are extensions dealing with equality and order. The integral of a function over a null set is 0:
Suppose that f : S → R is measurable and A ∈ S with μ(A) = 0 . Then ∫ A

f dμ = 0 .
Proof
Two functions that are indistinguishable from the point of view of μ must have the same integral.
Suppose that f : S → R is a measurable function whose integral exists. If g : S → R is measurable and g =f almost
everywhere on S , then ∫ g dμ = ∫ f dμ .
S S
Proof
Next we have a simple extension of the positive property.
Suppose that f : S → R is measurable and f ≥0 almost everywhere on S . Then

1. ∫ S
f dμ ≥ 0
2. ∫ S
f =0 if and only if f =0 almost everywhere on S .
Proof
So, if f ≥ 0 almost everywhere on S then ∫ f dμ > 0 if and only if μ{x ∈ S : f (x) > 0} > 0 . The simple extension of the
S
positive property in turn leads to a simple extension of the increasing property.
Suppose that f , g : S → R are measurable functions whose integrals exist, and that f ≤g almost everywhere on S . Then
1. ∫ f ≤ ∫ g
S S
2. Except in the case that both integrals are ∞ or both −∞ , ∫ S

f dμ = ∫
S
g dμ if and only if f =g almost everywhere on S .
Proof
So if f ≤g almost everywhere on S then, except in the two cases mentioned, ∫ f dμ < ∫ g dμ if and only if S S
μ{x ∈ S : f (x) < g(x)} > 0 . The exclusion when both integrals are ∞ or −∞ is important. A counterexample when this
condition does not hold is given below. The next result is the absolute value inequality.
Suppose that f : S → R is a measurable function whose integral exists. Then

∣ ∣
∣∫ f dμ∣ ≤ ∫ |f | dμ (3.11.9)
∣ S ∣ S
If f is integrable, then equality holds if and only if f ≥0 almost everywhere on S or f ≤0 almost everywhere on S .
Proof
Change of Variables
Suppose that (T , T ) is another measurable space and that u : S → T is measurable. As we saw in our first study of positive
measures, ν defined by
−1
ν (B) = μ [ u (B)] , B ∈ T (3.11.11)
is a positive measure on (T , T ). The following result is known as the change of variables theorem.
If f : T → R is measurable then, assuming that the integrals exist,
∫ f dν = ∫ (f ∘ u) dμ (3.11.12)
T S
Proof
The change of variables theorem will look more familiar if we give the variables explicitly. Thus, suppose that we want to evaluate
∫ f [u(x)] dμ(x) (3.11.17)

S
where again, u : S → T and f : T → R . One way is to use the substitution u = u(x), find the new measure ν , and then evaluate
∫ g(u) dν (u) (3.11.18)

T
Convergence Properties
We start with a simple but important corollary of the monotone convergence theorem that extends the additivity property to a
countably infinite sum of nonnegative functions.
Suppose that f n : S → [0, ∞) is measurable for n ∈ N . Then +
∞ ∞
∫ ∑ fn dμ = ∑ ∫ fn dμ (3.11.19)
S n=1 n=1 S
Proof
A theorem below gives a related result that relaxes the assumption that f be nonnegative, but imposes a stricter integrability
requirement. Our next result is the additivity of the integral over a countably infinite collection of disjoint domains.
Suppose that f : S → R is a measurable function whose integral exists, and that {A n : n ∈ N+ } is a disjoint collection of sets
in S . Let A = ⋃ A . Then
∞
n=1 n
∫ f dμ = ∑ ∫ f dμ (3.11.20)
A n=1 An
Proof
Of course, the previous theorem applies if f is nonnegative or if f is integrable. Next we give a minor extension of the monotone
convergence theorem that relaxes the assumption that the functions be nonnegative.
Monotone Convergence Theorem. Suppose that f : S → R is a measurable function whose integral exists for each n ∈ N
n +
and that f is increasing in n on S . If ∫ f dμ > −∞ then

n S 1

n→∞ n→∞
S S
Proof
Here is the complementary result for decreasing functions.
Suppose that f : S → R is a measurable function whose integral exists for each n ∈ N

n + and that f is decreasing in n on S .
n
If ∫ f dμ < ∞ then
S 1

n→∞ n→∞
S S
Proof
The additional assumptions on the integral of f in the last two extensions of the monotone convergence theorem are necessary. An
1
example is given in below.

Our next result is also a consequence of the montone convergence theorem, and is called Fatou's lemma in honor of Pierre Fatou.
Its usefulness stems from the fact that no assumptions are placed on the integrand functions, except that they be nonnegative and
measurable.
Fatou's Lemma. Suppose that f n : S → [0, ∞) is measurable for n ∈ N . Then +
∫ lim inf fn dμ ≤ lim inf ∫ fn dμ (3.11.26)
n→∞ n→∞
S S
Proof
Given the weakness of the hypotheses, it's hardly surprising that strict inequality can easily occur in Fatou's lemma. An example is
given below.
Our next convergence result is one of the most important and is known as the dominated convergence theorem. It's sometimes also
known as Lebesgue's dominated convergence theorem in honor of Henri Lebesgue, who first developed all of this stuff in the
context of R . The dominated convergence theorem gives a basic condition under which we may interchange the limit and
n
integration operators.
Dominated Convergence Theorem. Suppose that f : S → R is measurable for n n ∈ N+ and that limn→∞ fn exists on S .
Suppose also that |f | ≤ g for n ∈ N where g : S → [0, ∞) is integrable. Then
n

n→∞ n→∞
S S
Proof
As you might guess, the assumption that |f | is uniformly bounded in n by an integrable function is critical. A counterexample
n
when this assumption is missing is given below when this assumption is missing. The dominated convergence theorem remains
true if lim
n→∞ f nexists almost everywhere on S . The follow corollary of the dominated convergence theorem gives a condition
for the interchange of infinite sum and integral.
Suppose that f i : S → R is measurable for i ∈ N and that ∑

+
∞
i=1
| fi | is integrable. then
∞ ∞
∫ ∑ fi dμ = ∑ ∫ fi dμ (3.11.30)
S i=1 i=1 S
Proof
The following corollary of the dominated convergence theorem is known as the bounded convergence theorem.
Bounded Convergence Theorem. Suppose that f : S → R is measurable forn n ∈ N+ and there exists A ∈ S such that
μ(A) < ∞ , lim f exists on A , and |f | is bounded in n ∈ N on A . Then
n→∞ n n +

n→∞ n→∞
A A
Proof
Again, the bounded convergence remains true if lim f exists almost everywhere on
n→∞ n A . For a finite measure space (and in
particular for a probability space), the condition that μ(A) < ∞ automatically holds.
Product Spaces
Suppose now that (S, S , μ) and (T , T , ν ) are σ-finite measure spaces. Please recall the basic facts about the product σ-algebra
S ⊗ T of subsets of S × T , and the product measure μ ⊗ ν on S ⊗ T . The product measure space (S × T , S ⊗ T , μ ⊗ ν ) is
the standard one that we use for product spaces. If f : S × T → R is measurable, there are three integrals we might consider. First,
of course, is the integral of f with respect to the product measure μ ⊗ ν
∫ f (x, y) d(μ ⊗ ν )(x, y) (3.11.32)

S×T
sometimes called a double integral in this context. But also we have the nested or iterated integrals where we integrate with
respect to one variable at a time:
∫ (∫ f (x, y) dν (y)) dμ(x), ∫ (∫ f (x, y)dμ(x)) dν (y) (3.11.33)

S T T S
How are these integrals related? Well, just as in calculus with ordinary Riemann integrals, under mild conditions the three integrals
are the same. The resulting important theorem is known as Fubini's Theorem in honor of the Italian mathematician Guido Fubini.
Fubini's Theorem. Suppose that f : S ×T → R is measurable. If the double integral on the left exists, then
∫ f (x, y) d(μ ⊗ ν )(x, y) = ∫ ∫ f (x, y) dν (y) dμ(x) = ∫ ∫ f (x, y) dμ(x) dν (y) (3.11.34)
S×T S T T S
Proof
Of course, the double integral exists, and so Fubini's theorem applies, if either f is nonnegative or integrable with respect to μ ⊗ ν .
When f is nonnegative, the result is sometimes called Tonelli's theorem in honor of another Italian mathematician, Leonida Tonelli.
On the other hand, the iterated integrals may exist, and may be different, when the double integral does not exist. A
counterexample and a second counterexample are given below.
A special case of Fubini's theorem (and indeed part of the proof) is that we can compute the measure of a set in the product space
by integrating the cross-sectional measures.
If C ∈ S ⊗T then
y
(μ ⊗ ν )(C ) = ∫ ν (Cx ) dμ(x) = ∫ μ (C ) dν (y) (3.11.39)
S T
where C x = {y ∈ T : (x, y) ∈ C } for x ∈ S , and C y

= {x ∈ S : (x, y) ∈ C } for y ∈ T .
In particular, if C , D ∈ S ⊗ T have the property that ν (C ) = ν (D ) for all x ∈ S , or μ (C ) = μ (D ) for all y ∈ T (that is,
x x
y y
C and D have the same cross-sectional measures with respect to one of the variables), then (μ ⊗ ν )(C ) = (μ ⊗ ν )(D) . In R with 2
area, and in R with volume (Lebesgue measure in both cases), this is known as Cavalieri's principle, named for Bonaventura
3
Cavalieri, yet a third Italian mathematician. Clearly, Italian mathematicians cornered the market on theorems of this sort.
A simple corollary of Fubini's theorem is that the double integral of a product function over a product set is the product of the
integrals. This result has important applications to independent random variables.
Suppose that g : S → R and h : T → R are measurable, and are either nonnegative or integrable with respect to μ and ν ,
respectively. Then
∫ g(x)h(y)d(μ ⊗ ν )(x, y) = ( ∫ g(x) dμ(x)) ( ∫ h(y) dν (y)) (3.11.40)

S×T S T
Recall that a discrete measure space consists of a countable set with the σ-algebra of all subsets and with counting measure. In such
a space, integrals are simply sums and so Fubini's theorem allows us to rearrange the order of summation in a double sum.
Suppose that I and J are countable and that aij ∈ R for i ∈ I and j∈ J . If the sum of the positive terms or the sum of the
negative terms is finite, then
∑ aij = ∑ ∑ aij = ∑ ∑ aij (3.11.41)
(i,j)∈I×J i∈I j∈J j∈J i∈I
Often I = J = N , and in this case, a can be viewed as an infinite array, with i ∈ N the row number and j ∈ N
+ ij + + the column
number:
a11 a12 a13 …
a21 a22 a23 …
a31 a32 a33 …
⋮ ⋮ ⋮ ⋮
The significant point is that N is totally ordered. While there is no implied order of summation in the double sum ∑
+ a , 2
(i,j)∈N+ ij
the iterated sum ∑ ∑ a is obtained by summing over the rows in order and then summing the results by column in order,
∞
i=1
∞
j=1 ij
∞ ∞
while the iterated sum ∑ ∑ a is obtained by summing over the columns in order and then summing the results by row in
j=1 i=1 ij
order.
Of course, only one of the product spaces might be discrete. Theorems (9) and (15) which give conditions for the interchange of
sum and integral can be viewed as applications of Fubini's theorem, where one of the measure spaces is (S, S , μ) and the other is
N+ with counting measure.

Probability Spaces
Suppose that (Ω, F , P) is a probability space, so that Ω is the set of outcomes of a random experiment, F is the σ-algebra of
events, and P is a probability measure on the sample space (Ω, F ). Suppose also that (S, S ) is another measurable space, and that
X is a random variable for the experiment, taking values in S . Of course, this simply means that X is a measurable function from
Ω to S . Recall that the probability distribution of X is the probability measure P on (S, S ) defined by
X
PX (A) = P(X ∈ A), A ∈ S (3.11.42)
Since {X ∈ A} is just probability notation for the inverse image of A under X, P is simply a special case of constructing a new
X
positive measure from a given positive measure via a change of variables. Suppose now that r : S → R is measurable, so that r(X)
is a real-valued random variable. The integral of r(X) (assuming that it exists) is known as the expected value of r(X) and is of
fundamental importance. We will study expected values in detail in the next chapter. Here, we simply note different ways to write
the integral. By the change of variables formula (8) we have
∫ r [X(ω)] dP(ω) = ∫ r(x) dPX (x) (3.11.43)

Ω S
Now let F denote the distribution function of Y = r(X) . By another change of variables, Y has a probability distribution P on
Y Y
R , which is also a Lebesgue-Stieltjes measure, named for Henri Lebesgue and Thomas Stiletjes. Recall that this probability
measure is characterized by
PY (a, b] = P(a < Y ≤ b) = FY (b) − FY (a); a, b ∈ R, a < b (3.11.44)
With another application of our change of variables theorem, we can add to our chain of integrals:
∫ r [X(ω)] dP(ω) = ∫ r(x) dPX (x) = ∫ y dPY (y) = ∫ y dFY (y) (3.11.45)
Ω S R R
Of course, the last two integrals are simply different notations for exactly the same thing. In the section on absolute continuity and
density functions, we will see other ways to write the integral.
Counterexamples
In the first three exercises below, (R, R, λ) is the standard one-dimensional Euclidean space, so mathscrR is σ -algebra of
Lebesgue measurabel sets and λ is Lebesgue measure.
Let f = 1[1,∞) and g = 1 [0,∞)

. Show that
1. f ≤ g on R
2. λ{x ∈ R : f (x) < g(x)} = 1
3. ∫ f dλ = ∫ g dλ = ∞
R R
This example shows that the strict increasing property can fail when the integrals are infinite.
Let f n = 1[n,∞) for n ∈ N . Show that

+
1. f is decreasing in n ∈ N on R.
n +
2. f → 0 as n → ∞ on R.
n
3. ∫ f dλ = ∞ for each n ∈ N .
R
n +
This example shows that the monotone convergence theorem can fail if the first integral is infinite. It also illustrates strict
inequality in Fatou's lemma.
Let f n = 1[n,n+1] for n ∈ N . Show that

+
1. lim f = 0 on R so ∫ lim
n→∞ n R n→∞ fn dμ = 0
2. ∫ f dλ = 1 for n ∈ N so lim
R n + n→∞ ∫
R
fn dλ = 1
3. sup{f : n ∈ N } = 1
n on R + [1,∞)
This example shows that the dominated convergence theorem can fail if | fn | is not bounded by an integrable function. It also
shows that strict inequality can hold in Fatou's lemma.
Consider the product space [0, 1] with the usual Lebesgue measurable subsets and Lebesgue measure. Let f
2 2
: [0, 1 ] → R be
defined by
2 2
x −y
f (x, y) = (3.11.46)
2 2 2
(x +y )
Show that
1. ∫[0,1]
2 f (x, y) d(x, y) does not exist.
1 1
2. ∫0
∫
0
f (x, y) dx dy = −
π
4
1 1
3. ∫0
∫
0
f (x, y) dy dx =
π
This example shows that Fubini's theorem can fail if the double integral does not exist.
For i, j ∈ N+ define the sequence a as follows: aij ii =1 and a

i+1,i = −1 for i ∈ N , a
+ ij =0 otherwise.
1. Give a in array form with i ∈ N as the row number and j ∈ N
ij + + as the column number
2. Show that ∑ a does not exist
(i,j)∈N+
2
ij
3. Show that ∑ ∑ a = 1 ∞
i=1
∞
j=1 ij
∞ ∞
4. Show that ∑ ∑ a = 0 j=1 i=1 ij
This example shows that the iterated sums can exist and be different when the double sum does not exist, a counterexample to
the corollary to Fubini's theorem for sums when the hypotheses are not satisfied.
Compute ∫ D
f (x, y) d(x, y) in each case below for the given D ⊆ R and f 2
: D → R .
1. f (x, y) = e −2x
e
−3y
, D = [0, ∞) × [0, ∞)
2. f (x, y) = e −2x
e
−3y
, D = {(x, y) ∈ R : 0 ≤ x ≤ y < ∞}
2
Integrals of the type in the last exercise are useful in the study of exponential distributions.
This page titled 3.11: Properties of the Integral is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Our starting point in this section is a measurable space (S, S ). That is, S is a set and S is a σ-algebra of subsets of S . So far, we
have only considered positive measures on such spaces. Positive measures have applications, as we know, to length, area, volume,
mass, probability, counting, and similar concepts of the nonnegative “size” of a set. Moreover, we have defined the integral of a
measurable function f : S → R with respect to a positive measure, and we have studied properties of the integral.
Definition
But now we will consider measures that can take negative values as well as positive values. These measures have applications to
electric charge, monetary value, and other similar concepts of the “content” of a set that might be positive or negative. Also, this
generalization will help in our study of density functions in the next section. The definition is exactly the same as for a positive
measure, except that values in R = R ∪ {−∞, ∞} are allowed.
∗
A measure on (S, S ) is a function μ : S → R

∗
that satisfies the following properties:
1. μ(∅) = 0
2. If {A : i ∈ I } is a countable, disjoint collection of sets in S then μ (⋃
i i∈I
Ai ) = ∑
i∈I
μ(Ai )
As before, (b) is known as countable additivity and is the critical assumption: the measure of a set that consists of a countable
number of disjoint pieces is the sum of the measures of the pieces. Implicit in the statement of this assumption is that the sum in (b)
exists for every countable disjoint collection {A : i ∈ I } . That is, either the sum of the positive terms is finite or the sum of the
i
negative terms is finite. In turn, this means that the order of the terms in the sum does not matter (a good thing, since there is no
implied order). The term signed measure is used by many, but we will just use the simple term measure, and add appropriate
adjectives for the special cases. Note that if μ(A) ≥ 0 for all A ∈ S , then μ is a positive measure, the kind we have already
studied (and so the new definition really is a generalization). In this case, the sum in (b) always exists in [0, ∞]. If μ(A) ∈ R for all
A ∈ S then μ is a finite measure. Note that in this case, the sum in (b) is absolutely convergent for every countable disjoint
collection {A : i ∈ I } . If μ is a positive measure and μ(S) = 1 then μ is a probability measure, our favorite kind. Finally, as with
i
positive measures, μ is σ-finite if there exists a countable collection {A : i ∈ I } of sets in S such that S = ⋃ A and
i i∈I i
μ(A ) ∈ R for i ∈ I .
i
Basic Properties
We give a few simple properties of general measures; hopefully many of these will look familiar. Throughout, we assume that μ is
a measure on (S, S ). Our first result is that although μ can take the value ∞ or −∞ , it turns out that it cannot take both of these
values.
Either μ(A) > −∞ for all A ∈ S or μ(A) < ∞ for all A ∈ S .

Proof
We will say that two measures are of the same type if neither takes the value ∞ or if neither takes the value −∞ . Being of the same
type is trivially an equivalence relation on the collection of measures on (S, S ).
The difference rule holds, as long as the sets have finite measure:
Suppose that A, B ∈ S . If μ(B) ∈ R then μ(B ∖ A) = μ(B) − μ(A ∩ B) .

Proof
The following corollary is the difference rule for subsets, and will be needed below.
Suppose that A, B ∈ S and A ⊆ B . If μ(B) ∈ R then μ(A) ∈ R and μ(B ∖ A) = μ(B) − μ(A) .
Proof
As a consequence, suppose that A, B ∈ S and A ⊆ B . If μ(A) = ∞ , then by the infinity rule we cannot have μ(B) = −∞ and
by the difference rule we cannot have μ(B) ∈ R , so we must have μ(B) = ∞ . Similarly, if μ(A) = −∞ then μ(B) = −∞ . The
inclusion-exclusion rules hold for general measures, as long as the sets have finite measure.
Suppose that A i ∈ S for each i ∈ I where #(I ) = n , and that μ(A i) ∈ R for i ∈ I . Then
n
k−1
μ ( ⋃ Ai ) = ∑(−1 ) ∑ μ ( ⋂ Aj ) (3.12.1)
i∈I k=1 J⊆I, #(J)=k j∈J
Proof
The continuity properties hold for general measures. Part (a) is the continuity property for increasing sets, and part (b) is the
continuity property for decreasing sets.
Suppose that A n ∈ S for n ∈ N . +
1. If An ⊆ An+1 for n ∈ N + then lim n→∞ μ(An ) = μ (⋃

∞
i=1
Ai ) .
2. If An+1 ⊆ An for n ∈ N + and μ(A 1) ∈ R , then lim n→∞ μ(An ) = μ (⋂
∞
i=1
Ai )
Proof
Recall that a positive measure is an increasing function, relative to the subset partial order on S and the ordinary order on [0, ∞],
and this property follows from the difference rule. But for general measures, the increasing property fails, and so do other
properties that flow from it, including the subadditive property (Boole's inequality in probability) and the Bonferroni inequalities.
Constructions
It's easy to construct general measures as differences of positive measures.
Suppose that μ and ν are positive measures on (S, S ) and that at least one of them is finite. Then δ = μ − ν is a measure.
Proof
The collection of measures on our space is closed under scalar multiplication.
If μ is a measure on (S, S ) and c ∈ R , then cμ is a measure on (S, S )

Proof
If μ is a finite measure, then so is cμ for c ∈ R . If μ is not finite then μ and cμ are of the same type if c > 0 and are of opposite
types if c < 0 . We can add two measures to get another measure, as long as they are of the same type. In particular, the collection
of finite measures is closed under addition as well as scalar multiplication, and hence forms a vector space.
If μ and ν are measures on (S, S ) of the same type then μ + ν is a measure on (S, S ).
Proof
Finally, it is easy to explicitly construct measures on a σ-algebra generated by a countable partition. Such σ-algebras are important
for counterexamples and to gain insight, and also because many σ-algebras that occur in applications can be constructed from
them.
Suppose that A = { A : i ∈ I } is a countable partition of S into nonempty sets, and that S = σ(A ) . For i ∈ I , define
i
μ(Ai ) ∈ R
∗
arbitrarily, subject only to the condition that the sum of the positive terms is finite, or the sum of the negative
terms is finite. For A = ⋃ A where J ⊆ I , define
j∈J j
μ(A) = ∑ μ(Aj ) (3.12.7)
j∈J
Then μ is a measure on (S, S ).

Proof
Positive, Negative, and Null Sets
To understand the structure of general measures, we need some basic definitions and properties. As before, we assume that μ is a
measure on (S, S ).
Definitions
1. A ∈ S is a positive set for μ if μ(B) ≥ 0 for every B ∈ S with B ⊆ A .
2. A ∈ S is a negative set for μ if μ(B) ≤ 0 for every B ∈ S with B ⊆ A .
3. A ∈ S is a null set for μ if μ(B) = 0 for every B ∈ S with B ⊆ A .
Note that positive and negative are used in the weak sense (just as we use the terms increasing and decreasing in this text). Of
course, if μ is a positive measure, then every A ∈ S is positive for μ , and A ∈ S is negative for μ if and only if A is null for μ if
and only if μ(A) = 0 . For a general measure, A ∈ S is both positive and negative for μ if and only if A is null for μ . In
particular, ∅ is null for μ . A set A ∈ S is a support set for μ if and only if A is a null set for μ . A support set is a set where the
c
measure “lives” in a sense. Positive, negative, and null sets for μ have a basic inheritance property that is essentially equivalent to
the definition.
Suppose A ∈ S .
1. If A is positive for μ then B is positive for μ for every B ∈ S with B ⊆ A .
2. If A is negative for μ then B is negative for μ for every B ∈ S with B ⊆ A .
3. If A is null for μ then B is null for μ for every B ∈ S with B ⊆ A .
The collections of positive sets, negative sets, and null sets for μ are closed under countable unions.
Suppose that {A i : i ∈ I} is a countable collection of sets in S .

1. If A is positive for μ for i ∈ I then ⋃ A is positive for μ .
i i∈I i
2. If A is negative for μ for i ∈ I then ⋃ A is negative for μ .

i i∈I i
3. If A is null for μ for i ∈ I then ⋃ A is null for μ .

i i∈I i
Proof
It's easy to see what happens to the positive, negative, and null sets when a measure is multiplied by a non-zero constant.
Suppose that μ is a measure on (S, S ), c ∈ R , and A ∈ S .

1. If c > 0 then A is positive (negative) for μ if and only if A is positive (negative) for cμ.
2. If c < 0 then A is positive (negative) for μ if and only if A is negative (positive) for cμ.
3. If c ≠ 0 then A is null for μ if and only if A is null for cμ
Positive, negative, and null sets are also preserved under countable sums, assuming that the measures make senes.
Suppose that μ is a measure on (S, S ) for each i in a countable index set I , and that μ = ∑
i i∈I
μi is a well-defined measure
on (S, S ). Let A ∈ S .
1. If A is positive for μ for every i ∈ I then A is positive for μ .
i
2. If A is negative for μ for every i ∈ I then A is negative for μ .

i
3. If A is null for μ for every i ∈ I then A is null for μ .

i
In particular, note that μ = ∑ μ is a well-defined measure if μ is a positive measure for each i ∈ I , or if I is finite and μ is a
i∈I i i i
finite measure for each i ∈ I . It's easy to understand the positive, negative, and null sets for a σ-algebra generated by a countable
partition.
Suppose that A = {A : i ∈ I } is a countable partition of

i S into nonempty sets, and that S = σ(A ) . Suppose that μ is a
measure on (S, S ). Define
I+ = {i ∈ I : μ(Ai ) > 0}, I− = {i ∈ I : μ(Ai ) < 0}, I0 = {i ∈ I : μ(Ai ) = 0} (3.12.9)
Let A ∈ S , so that A = ⋃ j∈J
Aj for some J ⊆ I (and this representation is unique). Then
1. A is positive for μ if and only if J ⊆ I + ∪ I0 .
2. A is negative for μ if and only if J ⊆ I − ∪ I0 .
3. A is null for μ if and only if J ⊆ I . 0
The Hahn Decomposition

The fundamental results in this section and the next are two decomposition theorems that show precisely the relationship between
general measures and positive measures. First we show that if a set has finite, positive measure, then it has a positive subset with at
least that measure.
If A ∈ S and 0 ≤ μ(A) < ∞ then there exists P ∈ S with P ⊆A such that P is positive for μ and μ(P ) ≥ μ(A) .
Proof
The assumption that μ(A) < ∞ is critical; a counterexample is given below. Our first decomposition result is the Hahn
decomposition theorem, named for the Austrian mathematician Hans Hahn. It states that S can be partitioned into a positive set and
a negative set, and this decomposition is essentially unique.
Hahn Decomposition Theorem. There exists P ∈ S such that P is positive for μ and P is negative for μ . The pair (P , P c c
)
is a Hahn decomposition of S . If (Q, Q ) is another Hahn decomposition, then P △ Q is null for μ .

c
Proof
It's easy to see the Hahn decomposition for a measure on a σ-algebra generated by a countable partition.
Suppose that A = {A : i ∈ I } is a countable partition of S into nonempty sets, and that S

i = σ(A ) . Suppose that μ is a
measure on (S, S ). Let I = {i ∈ I : μ(A ) > 0} and I = {i ∈ I : μ(A ) = 0 . Then (P , P
+ i 0 i
c
) is a Hahn decomposition of
μ if and only if the positive set P has the form P = ⋃ A where J = I ∪ K and K ⊆ I . j + 0
j∈J
The Jordan Decomposition

The Hahn decomposition leads to another decomposition theorem called the Jordan decomposition theorem, named for the French
mathematician Camille Jordan. This one shows that every measure is the difference of positive measures. Once again we assume
that μ is a measure on (S, S ).
Jordan Decomposition Theorem. The measure μ can be written uniquely in the form μ = μ − μ where μ and μ are + − + −
positive measures, at least one finite, and with the property that if (P , P ) is any Hahn decomposition of S , then P is a null
c c
set of μ and P is a null set of μ . The pair (μ , μ ) is the Jordan decomposition of μ .

+ − + −
Proof
The Jordan decomposition leads to an important set of new definitions.
Suppose that μ has Jordan decomposition μ = μ + − μ− .

1. The positive measure μ is called the positive variation measure of μ .
+
2. The positive measure μ is called the negative variation measure of μ .

−
3. The positive measure |μ| = μ + μ is called the total variation measure of μ .

+ −
4. ∥μ∥ = |μ| (S) is the total variation of μ .
Note that, in spite of the similarity in notation, μ (A) and μ (A) are not simply the positive and negative parts of the (extended)
+ −
real number μ(A) , nor is |μ| (A) the absolute value of μ(A) . Also, be careful not to confuse the total variation of μ , a number in
[0, ∞], with the total variation measure. The positive, negative, and total variation measures can be written directly in terms of μ .
For A ∈ S ,
1. μ (A) = sup{μ(B) : B ∈ S , B ⊆ A}
+
2. μ (A) = − inf{μ(B) : B ∈ S , B ⊆ A}
−
3. |μ(A)| = sup {∑ μ(A ) : {A : i ∈ I } is a finite, measurable partition of A}

i∈I i i
4. ∥μ∥ = sup {∑ i∈I
μ(Ai ) : { Ai : i ∈ I } is a finite, measurable partition of S}
The total variation measure is related to sum and scalar multiples of measures in a natural way.
Suppose that μ and ν are measures of the same type and that c ∈ R . Then
1. |μ| = 0 if and only if μ = 0 (the zero measure).
2. |cμ| = |c| |μ|
3. |μ + ν | ≤ |μ| + |ν |
Proof
You may have noticed that the properties in the last result look a bit like norm properties. In fact, total variation really is a norm on
the vector space of finite measures on (S, S ):
Suppose that μ and ν are measures of the same type and that c ∈ R . Then
1. ∥μ∥ = 0 if and only if μ = 0 (the zero property)
2. ∥cμ∥ = |c| ∥μ∥ (the scaling property)
3. ∥μ + ν ∥ ≤ ∥μ∥ + ∥ν ∥ (the triangle inequality)
Proof
Every norm on a vector space leads to a corresponding measure of distance (a metric). Let M denote the collection of finite
measures on (S, S ). Then M , under the usual definition of addition and scalar multiplication of measures, is a vector space, and
as the last theorem shows, ∥ ⋅ ∥ is a norm on M . Here are the corresponding metric space properties:
Suppose that μ, ν, ρ ∈ M and c ∈ R . Then

1. ∥μ − ν ∥ = ∥ν − μ∥ , the symmetric property
2. ∥μ∥ = 0 if and only if μ = 0 , the zero property
3. ∥μ − ρ∥ ≤ ∥μ − ν ∥ + ∥ν − ρ∥ , the triangle inequality
Now that we have a metric, we have a corresponding criterion for convergence.
Suppose that μn ∈ M for n ∈ N+ and μ ∈ M . We say that μn → μ as n → ∞ in total variation if ∥ μn − μ∥ → 0 as

n → ∞.
Of course, M includes the probability measures on (S, S ), so we have a new notion of convergence to go along with the others
we have studied or will study. Here is a list:
convergence with probability 1
convergence in probability
convergence in distribution
convergence in k th mean
convergence in total variation
The Integral
Armed with the Jordan decomposition, the integral can be extended to general measures in a natural way.
Suppose that μ is a measure on (S, S ) and that f : S → R is measurable. We define
∫ f dμ = ∫ f dμ+ − ∫ f dμ− (3.12.14)

S S S
assuming that the integrals on the right exist and that the right side is not of the form ∞ − ∞ .
We will not pursue this extension, but as you might guess, the essential properties of the integral hold.
Complex Measures
Again, suppose that (S, S ) is a measurable space. The same axioms that work for general measures can be used to define complex
measures. Recall that C = {x + iy : x, y ∈ R} denotes the set of complex numbers, where i is the imaginary unit.
A complex measure on (S, S ) is a function μ : S → C that satisfies the following properties:

1. μ(∅) = 0
2. If {A : i ∈ I } is a countable, disjoint collection of sets in S then μ (⋃
i i∈I
Ai ) = ∑
i∈I
μ(Ai )
Clearly a complex measure μ can be decomposed as μ = ν + iρ where ν and ρ are finite (real) measures on (S, S ). We will have
no use for complex measures in this text, but from the decomposition into finite measures, it's easy to see how to develop the
theory.
Counterexamples
The lemma needed for the Hahn decomposition theorem can fail without the assumption that μ(A) < ∞ .
Let S be a set with subsets A and B satisfying ∅ ⊂ B ⊂ A ⊂ S . Let S = σ{A, B} be the σ-algebra generated by {A, B} .
Define μ(B) = −1 , μ(A ∖ B) = ∞ , μ(A ) = 1 . c
1. Draw the Venn diagram of A , B , S .

2. List the sets in S .
3. Using additivity, give the value of μ on each set in S .
4. Show that A does not have a positive subset P ∈ S with μ(P ) ≥ μ(A) .
This page titled 3.12: General Measures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Our starting point is a measurable space (S, S ). That is S is a set and S is a σ-algebra of subsets of S . In the last section, we
discussed general measures on (S, S ) that can take positive and negative values. Special cases are positive measures, finite
measures, and our favorite kind, probability measures. In particular, we studied properties of general measures, ways to construct
them, special sets (positive, negative, and null), and the Hahn and Jordan decompositions.
In this section, we see how to construct a new measure from a given positive measure using a density function, and we answer the
fundamental question of when a measure has a density function relative to the given positive measure.
Relations on Measures
The answer to the question involves two important relations on the collection of measures on (S, S ) that are defined in terms of
null sets. Recall that A ∈ S is null for a measure μ on (S, S ) if μ(B) = 0 for every B ∈ S with B ⊆ A . At the other extreme,
A ∈ S is a support set for μ if A is a null set. Here are the basic definitions:
c
Suppose that μ and ν are measures on (S, S ).

1. ν is absolutely continuous with respect to μ if every null set of μ is also a null set of ν . We write ν ≪ μ .
2. μ and ν are mutually singular if there exists A ∈ S such that A is null for μ and A is null for ν . We write μ ⊥ ν .
c
Thus ν ≪ μ if every support support set of μ is a support set of ν . At the opposite end, μ ⊥ ν if μ and ν have disjoint support sets.
Suppose that μ , ν , and ρ are measures on (S, S ). Then

1. μ ≪ μ , the reflexive property.
2. If μ ≪ ν and ν ≪ ρ then μ ≪ ρ , the transitive property.
Recall that every relation that is reflexive and transitive leads to an equivalence relation, and then in turn, the original relation can
be extended to a partial order on the collection of equivalence classes. This general theorem on relations leads to the following two
results.
Measures μ and ν on (S, S ) are equivalent if μ ≪ ν and ν ≪ μ , and we write μ ≡ ν . The relation ≡ is an equivalence
relation on the collection of measures on (S, S ). That is, if μ , ν , and ρ are measures on (S, S ) then
1. μ ≡ μ , the reflexive property
2. If μ ≡ ν then ν ≡ μ , the symmetric property
3. If μ ≡ ν and ν ≡ ρ then μ ≡ ρ , the transitive property
Thus, μ and ν are equivalent if they have the same null sets and thus the same support sets. This equivalence relation is rather
weak: equivalent measures have the same support sets, but the values assigned to these sets can be very different. As usual, we will
write [μ] for the equivalence class of a measure μ on (S, S ), under the equivalence relation ≡.
If μ and ν are measures on (S, S ), we write [μ] ⪯ [ν ] if μ ≪ ν . The definition is consistent, and defines a partial order on the
collection of equivalence classes. That is, if μ , ν , and ρ are measures on (S, S ) then
1. [μ] ⪯ [μ] , the reflexive property.
2. If [μ] ⪯ [ν ] and [ν ] ⪯ [μ] then [μ] = [ν ] , the antisymmetric property.
3. If [μ] ⪯ [ν ] and [ν ] ⪯ [ρ] then [μ] ⪯ [ρ], the transitive property
The singularity relation is trivially symmetric and is almost anti-reflexive.
Suppose that μ and ν are measures on (S, S ). Then

1. If μ ⊥ ν then ν ⊥ μ , the symmetric property.
2. μ ⊥ μ if and only if μ = 0 , the zero measure.
Proof
Absolute continuity and singularity are preserved under multiplication by nonzero constants.
Suppose that μ and ν are measures on (S, S ) and that a, b ∈ R ∖ {0} . Then
1. ν ≪ μ if and only if aν ≪ bμ .
2. ν ⊥μ if and only if aν ⊥ bμ .
Proof
There is a corresponding result for sums of measures.
Suppose that μ is a measure on (S, S ) and that ν is a measure on (S, S ) for each i in a countable index set I . Suppose also
i
that ν = ∑ ν is a well-defined measure on (S, S ).

i∈I i
1. If νi ≪ μ for every i ∈ I then ν ≪ μ .

2. If νi ⊥μ for every i ∈ I then ν ⊥ μ .
Proof
As before, note that ν = ∑ ν is well-defined if ν is a positive measure for each i ∈ I or if I is finite and ν is a finite measure
i∈I i i i
for each i ∈ I . We close this subsection with a couple of results that involve both the absolute continuity relation and the
singularity relation
Suppose that μ , ν , and ρ are measures on (S, S ). If ν ≪ μ and μ ⊥ ρ then ν ⊥ρ .

Proof
Suppose that μ and ν are measures on (S, S ). If ν ≪ μ and ν ⊥μ then ν =0 .

Proof
Density Functions
We are now ready for our study of density functions. Throughout this subsection, we assume that μ is a positive, σ-finite measure
on our measurable space (S, S ). Recall that if f : S → R is measurable, then the integral of f with respect to μ may exist as a
number in R = R ∪ {−∞, ∞} or may fail to exist.
∗
Suppose that f : S → R is a measurable function whose integral with respect to μ exists. Then function ν defined by
ν (A) = ∫ f dμ, A ∈ S (3.13.1)

A
is a σ-finite measure on (S, S ) that is absolutely continuous with respect to μ . The function f is a density function of ν
relative to μ .
Proof
The following three special cases are the most important:

1. If f is nonnegative (so that the integral exists in R ∪ {∞} ) then ν is a positive measure since ν (A) ≥ 0 for A ∈ S .
2. If f is integrable (so that the integral exists in R), then ν is a finite measure since ν (A) ∈ R for A ∈ S .
3. If f is nonnegative and ∫ f dμ = 1 then ν is a probability measure since ν (A) ≥ 0 for A ∈ S and ν (S) = 1 .
S
In case 3, f is the probability density function of ν relative to μ , our favorite kind of density function. When they exist, density
functions are essentially unique.
Suppose that ν is a σ-finite measure on (S, S ) and that ν has density function f with respect to μ . Then g : S → R is a
density function of ν with respect to μ if and only if f = g almost everywhere on S with respect to μ .
Proof
The essential uniqueness of density functions can fail if the positive measure space (S, S , μ) is not σ-finite. A simple example is
given below. Our next result answers the question of when a measure has a density function with respect to μ , and is the
fundamental theorem of this section. The theorem is in two parts: Part (a) is the Lebesgue decomposition theorem, named for our
old friend Henri Lebesgue. Part (b) is the Radon-Nikodym theorem, named for Johann Radon and Otto Nikodym. We combine the
theorems because our proofs of the two results are inextricably linked.
Suppose that ν is a σ-finite measure on (S, S ).

1. Lebesgue Decomposition Theorem. ν can be uniquely decomposed as ν = νc + νs where ν c ≪ μ and ν s ⊥μ .
2. Radon-Nikodym Theorem. ν has a density function with respect to μ .
c
Proof
In particular, a measure ν on (S, S ) has a density function with respect to μ if and only if ν ≪ μ . The density function in this case
is also referred to as the Radon-Nikodym derivative of ν with respect to μ and is sometimes written in derivative notation as
dν /dμ. This notation, however, can be a bit misleading because we need to remember that a density function is unique only up to a
μ -null set. Also, the Radon-Nikodym theorem can fail if the positive measure space (S, S , μ) is not σ-finite. A couple of
examples are given below. Next we characterize the Hahn decomposition and the Jordan decomposition of ν in terms of the density
function.
Suppose that ν is a measure on (S, S ) with ν ≪ μ , and that ν has density function f with respect to μ . Let
= {x ∈ S : f (x) ≥ 0} , and let f and f denote the positive and negative parts of f .
+ −
P
1. A Hahn decomposition of ν is (P , P ). c
2. The Jordan decomposition is ν = ν − ν + − where ν + (A) =∫

A
f
+
dμ and ν
− (A) =∫
A
f
−
dμ , for A ∈ S .
Proof
The following result is a basic change of variables theorem for integrals.
Suppose that ν is a positive measure on (S, S ) with ν ≪ μ and that ν has density function f with respect to μ . If g : S → R
is a measurable function whose integral with respect to ν exists, then
∫ g dν = ∫ gf dμ (3.13.6)
S S
Proof
In differential notation, the change of variables theorem has the familiar form dν = f dμ , and this is really the justification for the
derivative notation f = dν /dμ in the first place. The following result gives the scalar multiple rule for density functions.
Suppose that ν is a measure on (S, S ) with ν ≪ μ and that ν has density function f with respect to μ . If c ∈ R , then cν has
density function cf with respect to μ .
Proof
Of course, we already knew that ν ≪ μ implies cν ≪ μ for c ∈ R , so the new information is the relation between the density
functions. In derivative notation, the scalar multiple rule has the familiar form
d(cν ) dν
=c (3.13.9)
dμ dμ
The following result gives the sum rule for density functions. Recall that two measures are of the same type if neither takes the
value ∞ or if neither takes the value −∞ .
Suppose that ν and ρ are measures on (S, S ) of the same type with ν ≪ μ and ρ ≪ μ , and that ν and ρ have density
functions f and g with respect to μ , respectively. Then ν + ρ has density function f + g with respect to μ .
Proof
Of course, we already knew that ν ≪ μ and ρ ≪ μ imply ν + ρ ≪ μ , so the new information is the relation between the density
functions. In derivative notation, the sum rule has the familiar form
d(ν + ρ) dν dρ
= + (3.13.11)
dμ dμ dμ
The following result is the chain rule for density functions.
Suppose that ν is a positive measure on (S, S ) with ν ≪ μ and that ν has density function f with respect to μ . Suppose ρ is a
measure on (S, S ) with ρ ≪ ν and that ρ has density function g with respect to ν . Then ρ has density function gf with
respect to μ .
Proof
Of course, we already knew that ν ≪ μ and ρ ≪ ν imply ρ ≪ μ , so once again the new information is the relation between the
density functions. In derivative notation, the chan rule has the familiar form
dρ dρ dν
= (3.13.12)
dμ dν dμ
The following related result is the inverse rule for density functions.
Suppose that ν is a positive measure on (S, S ) with ν ≪ μ and μ ≪ ν (so that ν ≡μ ). If ν has density function f with
respect to μ then μ has density function 1/f with respect to ν .
Proof
In derivative notation, the inverse rule has the familiar form

dμ 1
= (3.13.13)
dν dν /dμ

Discrete Spaces
Recall that a discrete measure space (S, S , #) consists of a countable set S with the σ-algebra S = P(S) of all subsets of S ,
and with counting measure #. Of course # is a positive measure and is trivially σ-finite since S is countable. Note also that ∅ is
the only set that is null for #. If ν is a measure on S , then by definition, ν (∅) = 0 , so ν is absolutely continuous relative to μ .
Thus, by the Radon-Nikodym theorem, ν can be written in the form
ν (A) = ∑ f (x), A ⊆S (3.13.14)
x∈A
for a unique f : S → R . Of course, this is obvious by a direct argument. If we define f (x) = ν {x} for x ∈ S then the displayed
equation follows by the countable additivity of ν .
Spaces Generated by Countable Partitions

We can generalize the last discussion to spaces generated by countable partitions. Suppose that S is a set and that
A = { A : i ∈ I } is a countable partition of S into nonempty sets. Let S = σ(A ) and recall that every A ∈ S has a unique
i
representation of the form A = ⋃ A where J ⊆ I . Suppse now that μ is a positive measure on S with 0 < μ(A ) < ∞ for
j∈J j i
every i ∈ I . Then once again, the measure space (S, S , μ) is σ-finite and ∅ is the only null set. Hence if ν is a measure on (S, S )
then ν is absolutely continuous with respect to μ and hence has unique density function f with respect to μ :
ν (A) = ∫ f dμ, A ∈ S (3.13.15)

A
Once again, we can construct the density function explicitly.
In the setting above, define f : S → R by f (x) = ν (A

i )/μ(Ai ) for x ∈ A and i ∈ I . Then f is the density of ν with respect
i
to μ .
Proof
Often positive measure spaces that occur in applications can be decomposed into spaces generated by countable partitions. In the
section on Convergence in the chapter on Martingales, we show that more general density functions can be obtained as limits of
density functions of the type in the last theorem.
Probability Spaces
Suppose that (Ω, F , P) is a probability space and that X is a random variable taking values in a measurable space (S, S ). Recall
that the distribution of X is the probability measure P on (S, S ) given by
X
PX (A) = P(X ∈ A), A ∈ S (3.13.17)
If μ is a positive measure, σ-finite measure on (S, S ), then the theory of this section applies, of course. The Radon-Nikodym
theorem tells us precisely when (the distribution of) X has a probability density function with respect to μ : we need the distribution
to be absolutely continuous with respect to μ : if μ(A) = 0 then P (A) = P(X ∈ A) = 0 for A ∈ S .
X
Suppose that r : S → R is measurable, so that r(X) is a real-valued random variable. The integral of r(X) (assuming that it exists)
is of fundamental importance, and is knowns as the expected value of r(X). We will study expected values in detail in the next
chapter, but here we just note different ways to write the integral. By the change of variables theorem in the last section we have
∫ r[X(ω)]dP(ω) = ∫ r(x)dPX (x) (3.13.18)

Ω S
Assuming that P , the distribution of

X X , is absolutely continuous with respect to μ , with density function f , we can add to our
chain of integrals using Theorem (14):
∫ r[X(ω)]dP(ω) = ∫ r(x)dPX (x) = ∫ r(x)f (x)dμ(x) (3.13.19)

Ω S S
Specializing, suppose that (S, S , #) is a discrete measure space. Thus X has a discrete distribution and (as noted in the previous
subsection), the distribution of X is absolutely continuous with respect to #, with probability density function f given by
f (x) = P(X = x) for x ∈ S . In this case the integral simplifies:
∫ r[X(ω)]dP(ω) = ∑ r(x)f (x) (3.13.20)

Ω
x∈S
Recall next that for n ∈ N , the n -dimensional Euclidean measure space is (R , R , λ ) where R is the σ-algebra of Lebesgue
+
n
n n n
measurable sets and λ is Lebesgue measure. Suppose now that S ∈ R and that S is the σ-algebra of Lebesgue measurable
n n
subsets of S , and that once again, X is a random variable with values in S . By definition, X has a continuous distribution if
P(X = x) = 0 for x ∈ S . But we now know that this is not enough to ensure that the distribution of X has a density function with
respect to λ . We need the distribution to be absolutely continuous, so that if λ (A) = 0 then P(X ∈ A) = 0 for A ∈ S . Of
n n
course λ {x} = 0 for x ∈ S , so absolute continuity implies continuity, but not conversely. Continuity of the distribution is a
n
(much) weaker condition than absolute continuity of the distribution. If the distribution of X is continuous but not absolutely so,
then the distribution will not have a density function with respect to λ . n
For example, suppose that λ (S) = 0 . Then the distribution of X and λ are mutually singular since P(X ∈ S) = 1 and so X will
n n
not have a density function with respect to λ . This will always be the case if S is countable, so that the distribution of X is
n
discrete. But it is also possible for X to have a continuous distribution on an uncountable set S ∈ R with λ (S) = 0 . In such a
n n
case, the continuous distribution of X is said to be degenerate. There are a couple of natural ways in which this can happen that are
illustrated in the following exercises.
Suppose that Θ is uniformly distributed on the interval [0, 2π). Let X = cos Θ , Y = sin Θ .
1. (X, Y ) has a continuous distribution on the circle C = {(x, y) : x 2
+y
2
= 1} .
2. The distribution of (X, Y ) and λ are mutually singular.
2
3. Find P(Y > X) .

Solution
The last example is artificial since (X, Y ) has a one-dimensional distribution in a sense, in spite of taking values in R
2
. And of
course Θ has a probability density function f with repsect λ given by f (θ) = 1/2π for θ ∈ [0, 2π).
1
Suppose that X is uniformly distributed on the set {0, 1, 2}, Y is uniformly distributed on the interval [0, 2], and that X and Y
are independent.
1. (X, Y ) has a continuous distribution on the product set S = {0, 1, 2} × [0, 2].
2. The distribution of (X, Y ) and λ are mutually singular.
2
3. Find P(Y > X) .

Solution
The last exercise is artificial since X has a discrete distribution on {0, 1, 2} (with all subsets measureable and with #), and Y a
continuous distribution on the Euclidean space [0, 2] (with Lebesgue mearuable subsets and with λ ). Both are absolutely
continuous; X has density function g given by g(x) = 1/3 for x ∈ {0, 1, 2} and Y has density function h given by h(y) = 1/2
for y ∈ [0, 2]. So really, the proper measure space on S is the product measure space formed from these two spaces. Relative to this
product space (X, Y ) has a density f given by f (x, y) = 1/6 for (x, y) ∈ S.
It is also possible to have a continuous distribution on S ⊆ R with λ (S) > 0 , yet still with no probability density function, a
n
n
much more interesting situation. We will give a classical construction. Let (X , X , …) be a sequence of Bernoulli trials with
1 2
success parameter p ∈ (0, 1). We will indicate the dependence of the probability measure P on the parameter p with a subscript.
Thus, we have a sequence of independent indicator variables with
Pp (Xi = 1) = p, Pp (Xi = 0) = 1 − p (3.13.21)
We interpret X as the ith binary digit (bit) of a random variable X taking values in (0, 1). That is, X = ∑ X /2 . Conversely,
i
∞
i=1 i
i
recall that every number x ∈ (0, 1) can be written in binary form as x = ∑ x /2 where x ∈ {0, 1} for each i ∈ N . This
∞
i=1 i
i
i +
representation is unique except when x is a binary rational of the form x = k/2 for n ∈ N and k ∈ {1, 3, … 2 − 1} . In this
n
+
n
case, there are two representations, one in which the bits are eventually 0 and one in which the bits are eventually 1. Note, however,
that the set of binary rationals is countable. Finally, note that the uniform distribution on (0, 1) is the same as Lebesgue measure on
(0, 1).
X has a continuous distribution on (0, 1) for every value of the parameter p ∈ (0, 1). Moreover,
1. If p, q ∈ (0, 1) and p ≠ q then the distribution of X with parameter p and the distribution of X with parameter q are
mutually singular.
2. If p = , X has the uniform distribution on (0, 1).
1
3. If p ≠ , then the distribution of X is singular with respect to Lebesgue measure on (0, 1), and hence has no probability
1
density function in the usual sense.

Proof
For an application of some of the ideas in this example, see Bold Play in the game of Red and Black.
Counterexamples
The essential uniqueness of density functions can fail if the underlying positive measure μ is not σ -finite. Here is a trivial
counterexample:
Suppose that S is a nonempty set and that S = {S, ∅} is the trivial σ-algebra. Define the positive measure μ on (S, S ) by
μ(∅) = 0 , μ(S) = ∞ . Let ν denote the measure on (S, S ) with constant density function c ∈ R with respect to μ .
c
1. (S, S , μ) is not σ-finite.

2. ν = μ for every c ∈ (0, ∞).
c
The Radon-Nikodym theorem can fail if the measure μ is not σ -finite, even if ν is finite. Here are a couple of standard
counterexample:
Suppose that S is an uncountable set and S is the σ-algebra of countable and co-countable sets:
c
S = {A ⊆ S : A is countable or A is countable} (3.13.23)
As usual, let # denote counting measure on S , and define ν on S by ν (A) = 0 if A is countable and ν (A) = 1 if A
c
is
countable. Then
1. (S, S , #) is not σ-finite.
2. ν is a finite, positive measure on (S, S ).
3. ν is absolutely continuous with respect to #.
4. ν does not have a density function with respect to #.
Proof
Let R denote the standard Borel σ-algebra on R . Let # and λ denote counting measure and Lebesgue measure on (R, R) ,
respectively. Then
1. (R, R, #) is not σ-finite.
2. λ is absolutely continuous with respect to #.
3. λ does not have a density function with respect to #.
Proof
This page titled 3.13: Absolute Continuity and Density Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated
by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history
is available upon request.
Basic Theory
Our starting point is a positive measure space (S, S , μ). That is S is a set, S is a σ-algebra of subsets of S , and μ is a positive
measure on (S, S ). As usual, the most important special cases are
Euclidean space: S is a Lebesgue measurable subset of R for some n ∈ N , S is the σ-algebra of Lebesgue measurable
n
+
subsets of S , and μ = λ is n -dimensional Lebesgue measure.

n
Discrete space: S is a countable set, S = P(S) is the collection of all subsets of S , and μ = # is counting measure.
Probability space: S is the set of outcomes of a random experiment, S is the σ-algebra of events, and μ = P is a probability
measure.
In previous sections, we defined the integral of certain measurable functions f : S → R with respect to μ , and we studied
properties of the integral. In this section, we will study vector spaces of functions that are defined in terms of certain integrability
conditions. These function spaces are of fundamental importance in all areas of analysis, including probability. In particular, the
results of this section will reappear in the form of spaces of random variables in our study of expected value.

Consider a statement on the elements of S , for example an equation or an inequality with x ∈ S as a free variable. (Technically
such a statement is a predicate on S .) For A ∈ S , we say that the statement holds on A if it is true for every x ∈ A . We say that
the statement holds almost everywhere on A (with respect to μ ) if there exists B ∈ S with B ⊆ A such that the statement holds
on B and μ(A ∖ B) = 0 .
Measurable functions f , g : S → R are equivalent if f = g almost everywhere on S , in which case we write f ≡ g . The
relation ≡ is an equivalence relation on the collection of measurable functions from S to R. That is, if f , g, h : S → R are
measurable then
1. f ≡ f , the reflexive property.
2. If f ≡ g then g ≡ f , the symmetric property.
3. If f ≡ g and g ≡ h then f ≡ h , the transitive property.
Thus, equivalent functions are indistinguishable from the point of view of the measure μ . As with any equivalence relation, ≡
partitions the underlying set (in this case the collection of real-valued measurable functions on S ) into equivalence classes of
mutually equivalent elements. As we will see, we often view these equivalence classes as the basic objects of study. Our next task
is to define measures of the “size” of a function; these will become norms in our spaces.
Suppose that f : S → R is measurable. For p ∈ (0, ∞) we define

1/p
p
∥f ∥p = ( ∫ |f | dμ) (3.14.1)
S
We also define ∥f ∥ ∞ = inf {b ∈ [0, ∞] : |f | ≤ b almost everywhere on S} .
p p
Since |f |is a nonnegative, measurable function for p ∈ (0, ∞), ∫ |f | dμ exists in [0, ∞], and hence so does ∥f ∥ . Clearly
S p
∥f ∥∞ also exists in [0, ∞] and is known as the essential supremum of f . A number b ∈ [0, ∞] such that |f | ≤ b almost
everywhere on S is an essential bound of f and so, appropriately enough, the essential supremum of f is the infimum of the
essential bounds of f . Thus, we have defined ∥f ∥ for all p ∈ (0, ∞]. The definition for p = ∞ is special, but we will see that it's
p
the appropriate one.
For p ∈ (0, ∞], let L denote the collection of measurable functions f

p
: S → R such ∥f ∥ p <∞
So for p ∈ (0, ∞), f ∈ L if and only if |f | is integrable. The symbol L is in honor of Henri Lebesgue, who first developed the
p p
theory. If we want to indicate the dependence on the underlying measure space, we write L (S, S , μ). Of course, L is simply the
p 1
collection of functions that are integrable with respect to μ . Our goal is to study the spaces L for p ∈ (0, ∞]. We start with some
p
simple properties.
Suppose that f : S → R is measurable. Then for p ∈ (0, ∞],

1. ∥f ∥ p ≥0
2. ∥f ∥ p =0 if and only if f =0 almost everywhere on S , so that f ≡0 .

Proof
Suppose that f : S → R is measurable and c ∈ R . Then ∥cf ∥ p = |c|∥f ∥p for p ∈ (0, ∞].
Proof
In particular, if f ∈ L
p
and c ∈ R then cf ∈ L
p
.
Conjugate Indices and Hölder's inequality

Certain pairs of our function spaces turn out to be dual or complimentary to one another in a sense. To understand this, we need the
following definition.
Indices p, q ∈ (1, ∞) are said to be conjugate if 1/p + 1/q = 1 . In addition, 1 and ∞ are conjugate indices.
For justification of the last case, note that if p ∈ (1, ∞), then the index conjugate to p is
1
q = (3.14.3)
1 − 1/p
and q ↑ ∞ as p ↓ 1 . Note that p = q = 2 are conjugate indices, and this is the only case where the indices are the same. Ultimately,
the importance of conjugate indices stems from the following inequality:
If x, y ∈ (0, ∞) and if p, q ∈ (1, ∞) are conjugate indices, then

1 1
p q
xy ≤ x + y (3.14.4)
p q
Moreover, equality occurs if and only if x p

=y
q
.
Proof 1
Proof 2
our next major result is Hölder's inequality, named for Otto Hölder, which clearly indicates the importance of conjugate indices.
Suppose that f , g : S → R are measurable and that p and q are conjugate indices. Then
∥f g∥1 ≤ ∥f ∥p ∥g∥q (3.14.9)
Proof
In particular, if f ∈ L and g ∈ L then f g ∈ L . The most important special case of Hölder's inequality is when p = q = 2 , in
p q 1
which case we have the Cauchy-Schwartz inequality, named for Augustin Louis Cauchy and Karl Hermann Schwarz:
∥f g∥1 ≤ ∥f ∥2 ∥g∥2 (3.14.14)
Minkowski's Inequality
Our next major result is Minkowski's inequality, named for Hermann Minkowski. This inequality will help show that L is a vector p
space and that ∥ ⋅ ∥ is a norm (up to equivalence) when p ≥ 1 .

p
Suppose that f , g : S → R are measurable and that p ∈ [1, ∞]. Then

∥f + g∥p ≤ ∥f ∥p + ∥g∥p (3.14.15)
Proof
Vector Spaces
We can now discuss various vector spaces of functions. First, we know from our previous work with measure spaces, that the set V
of all measurable functions f : S → R is a vector space under our standard (pointwise) definitions of sum and scalar multiple. The
spaces we are studying in this section are subspaces:
L
p
is a subspace of V for every p ∈ [1, ∞].
Proof
However, we usually want to identify functions that are equal almost everywhere on S (with respect to μ ). Recalling the
equivalence relation ≡ defined above, here are the definitions:
Let [f ] denote the equivalence class of f ∈ V under the equivalence relation ≡, and let U = {[f ] : f ∈ V } . If f , g ∈ V and
c ∈ R we define
1. [f ] + [g] = [f + g]
2. c[f ] = [cf ]
Then U is a vector space.
Proof
Now we can define the Lebesgue vector spaces precisely.
For p ∈ [1, ∞], let L = {[f ] : f ∈ L } . For

p p
f ∈ V define ∥[f ]∥
p
= ∥f ∥p . Then L
p
is a subspace of U and ∥ ⋅ ∥p is a
norm on L . That is, for f , g ∈ L and c ∈ R
p p
1. ∥f ∥ ≥ 0 and ∥f ∥ = 0 if and only if f ≡ 0 , the positive property

p p
2. ∥cf ∥ = |c| ∥f ∥ , the scaling property

p p
3. ∥f + g∥ ≤ ∥f ∥ + ∥g∥ , the triangle inequality

p p p
Proof
We have stated these results precisely, but on the other hand, we don't want to be overly pedantic. It's more natural and intuitive to
simply work with the space V and the subspaces L for p ∈ [1, ∞], and just remember that functions that are equal almost
p
everywhere on S are regarded as the same vector. This will be our point of view for the rest of this section.
Every norm on a vector space naturally leads to a metric. That is, we measure the distance between vectors as the norm of their
difference. Stated in terms of the norm ∥ ⋅ ∥ , here are the properties of the metric on L .
p
p
For f , g, h ∈ L
p
,
1. ∥f − g∥ p ≥0 and ∥f − g∥ = 0 if and only if f ≡ g , the positive property
p
2. ∥f − g∥ p = ∥g − f ∥p, the symmetric property

3. ∥f − h∥ p ≤ ∥f − g∥ + ∥g − h ∥ , the triangle inequality
p p
Once we have a metric, we naturally have a criterion for convergence.
Suppose that f n ∈ L
p
for n ∈ N + and f ∈ L
p
. Then by definition, f n → f as n → ∞ in L if and only if ∥f
p
n − f ∥p → 0 as
n → ∞.
Limits are unique, up to equivalence. (That is, limits are unique in L .) p
Suppose again that f ∈ L for n ∈ N . Recall that this sequence is said to be a Cauchy sequence if for every ϵ > 0 there exists
n
p
+
N ∈ N such that if n > N and m > N then ∥f − f ∥ < ϵ . Needless to say, the Cauchy criterion is named for our ubiquitous
+ n m p
friend Augustin Cauchy. A metric space in which every Cauchy sequence converges (to an element of the space) is said to be
complete. Intuitively, one expects a Cauchy sequence to converge, so a complete space is literally one that is not missing any
elements that should be there. A complete, normed vector space is called a Banach space, after the Polish mathematician Stefan
Banach. Banach spaces are of fundamental importance in analysis, in large part because of the following result:
L
p
is a Banach space for every p ∈ [1, ∞].
The Space L 2
The norm ∥ ⋅ ∥ is special because it corresponds to an inner product.

2
For f , g ∈ L
2
, define
⟨f , g⟩ = ∫ f g dμ (3.14.23)
S
Note that the integral is well-defined by the Cauchy-Schwarz inequality. As with all of our other definitions, this one is consistent
with the equivalence relation. That is, if f ≡ f and g ≡ g then f g ≡ f g so ∫ f g dμ = ∫ f g dμ and hence
1 1 1 1 S S 1 1
⟨f , g⟩ = ⟨f , g ⟩ . Note also that ⟨f , f ⟩ = ∥f ∥ for f ∈ L , so this definition generates the 2-norm.

2 2
1 1 2
L
2
is an inner product space. That is, if f , g, h ∈ L
2
and c ∈ R then
1. ⟨f , f ⟩ ≥ 0 and ⟨f , f ⟩ = 0 if and only if f ≡ 0 , the positive property
2. ⟨f , g⟩ = ⟨g, f ⟩ , the symmetric property
3. ⟨cf , g⟩ = c⟨f , g⟩ , the scaling property
4. ⟨f + g, h⟩ = ⟨f , g⟩ + ⟨f , h⟩ , the additive property
Proof
From parts (c) and (d), the inner product is linear in the first argument, with the second argument fixed. By the symmetric property
(b), it follows that the inner product is also linear in the second argument with the first argument fixed. That is, the inner product is
bi-linear. A complete. inner product space is known as a Hilbert space, named for the German mathematician David Hilbert. Thus,
the following result follows immediately from the previous two.
L
2
is a Hilbert space.
All inner product spaces lead naturally to the concept of orthogonality; L is no exception. 2
Functions f , g ∈ L
2
are orthogonal if ⟨f , g⟩ = 0 , in which case we write f ⊥g . Equivalently f ⊥g if
∫ f g dμ = 0 (3.14.24)
S
Of course, all of the basic theorems of general inner product spaces hold in L
2
. For example, the following result is the
Pythagorean theorem, named of course for Phythagoras.
If f , g ∈ L
2
and f ⊥g then ∥f + g∥2
2
= ∥f ∥
2
2
+ ∥g∥
2
2
.
Proof

Discrete Spaces
Recall again that the measure space (S, S , #) is discrete if S is countable, S = P(S) is the σ-algebra of all subsets of S , and of
course, # is counting measure. In this case, recall that integrals are sums. The exposition will look more familiar if we use the
notation of sequences rather than functions. Thus, let x : S → R , and denote the value of x at i ∈ S by x rather than x(i). For i
p ∈ [1, ∞), the p -norm is
1/p
p
∥x ∥p = ( ∑ |x| ) (3.14.26)
i
i∈S
On the other hand, ∥x ∥ = sup{x : i ∈ S} . The only null set for # is ∅, so the equivalence relation ≡ is simply equality, and so
∞ i
the spaces L and L are the same. For p ∈ [1, ∞), x ∈ L if and only if
p p p
p
∑ |x| <∞ (3.14.27)
i
i∈S
When p ∈ N (as is often the case), this condition means that ∑ x is absolutely convergent. On the other hand,
+ i∈S
p
i
x ∈ L
∞
if
and only if x is bounded. When S = N , the space L is often denoted l . The inner produce on L is
+
p p 2
2
⟨x, y⟩ = ∑ xi yi , x, y ∈ L (3.14.28)
i∈S
When S = {1, 2, … , n}, L is simply the vector space R with the usual addition, scalar multiplication, inner product, and norm
2 n
that we study in elementary linear algebra. Orthogonal vector are perpendicular in the usual sense.
Probability Spaces
Suppose that (S, S , P) is a probability space, so that S is the set of outcomes of a random experiment, S is the σ-algebra of
events, and P is a probability measure on the sample space (S, S ). Of course, a measurable function X : S → R is simply a real-
p p p
valued random variable. For p ∈ [1, ∞), the integral ∫ |x| dP is the expected value of |X| , and is denoted E (|X| ) . Thus in this
S
case, L is the collection of real-valued random variables X with E (|X| ) < ∞ . We will study these spaces in more detail in the
p p
chapter on expected value.
This page titled 3.14: Function Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
CHAPTER OVERVIEW
4: Expected Value
Expected value is one of the fundamental concepts in probability, in a sense more general than probability itself. The expected
value of a real-valued random variable gives a measure of the center of the distribution of the variable. More importantly, by taking
the expected value of various functions of a general random variable, we can measure many interesting features of its distribution,
including spread, skewness, kurtosis, and correlation. Generating functions are certain types of expected value that completely
determine the distribution of the variable. Conditional expected value, which incorporates known information in the computation,
is one of the fundamental concepts in probability.
In the advanced topics, we define expected value as an integral with respect to the underlying probability measure. We also revisit
conditional expected value from a measure-theoretic point of view. We study vector spaces of random variables with certain
expected values as the norms of the spaces, which in turn leads to modes of convergence for random variables.
4.1: Definitions and Basic Properties
4.3: Variance
This page titled 4: Expected Value is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
4.1: Definitions and Basic Properties
Expected value is one of the most important concepts in probability. The expected value of a real-valued random variable gives the
center of the distribution of the variable, in a special sense. Additionally, by computing expected values of various real
transformations of a general random variable, we con extract a number of interesting characteristics of the distribution of the
variable, including measures of spread, symmetry, and correlation. In a sense, expected value is a more general concept than
probability itself.
Basic Concepts
Definitions
the collection of events and P the probability measure on the sample space (Ω, F ). In the following definitions, we assume that X
is a random variable for the experiment, taking values in S ⊆ R .
If X has a discrete distribution with probability density function f (so that S is countable), then the expected value of X is
defined as follows (assuming that the sum is well defined):
E(X) = ∑ xf (x) (4.1.1)
x∈S
The sum defining the expected value makes sense if either the sum over the positive x ∈ S is finite or the sum over the negative
x ∈ S is finite (or both). This ensures the that the entire sum exists (as an extended real number) and does not depend on the order
of the terms. So as we will see, it's possible for E(X) to be a real number or ∞ or −∞ or to simply not exist. Of course, if S is
finite the expected value always exists as a real number.
If X has a continuous distribution with probability density function f (and so S is typically an interval or a union of disjoint
intervals), then the expected value of X is defined as follows (assuming that the integral is well defined):
E(X) = ∫ xf (x) dx (4.1.2)

S
The probability density functions in basic applied probability that describe continuous distributions are piecewise continuous. So
the integral above makes sense if the integral over positive x ∈ S is finite or the integral over negative x ∈ S is finite (or both).
This ensures that the entire integral exists (as an extended real number). So as in the discrete case, it's possible for E(X) to exist as
a real number or as ∞ or as −∞ or to not exist at all. As you might guess, the definition for a mixed distribution is a combination
of the definitions for the discrete and continuous cases.
If X has a mixed distribution, with partial discrete density g on D and partial continuous density h on C , where D and C are
disjoint, D is countable, C is typically an interval, and S = D ∪ C . The expected value of X is defined as follows (assuming
that the expression on the right is well defined):
E(X) = ∑ xg(x) + ∫ xh(x) dx (4.1.3)

C
x∈D
For the expected value above to make sense, the sum must be well defined, as in the discrete case, the integral must be well
defined, as in the continuous case, and we must avoid the dreaded indeterminate form ∞ − ∞ . In the next section on additional
properties, we will see that the various definitions given here can be unified into a single definition that works regardless of the
type of distribution of X. An even more general definition is given in the advanced section on expected value as an integral.
Interpretation
The expected value of X is also called the mean of the distribution of X and is frequently denoted μ . The mean is the center of the
probability distribution of X in a special sense. Indeed, if we think of the distribution as a mass distribution (with total mass 1),
then the mean is the center of mass as defined in physics. The two pictures below show discrete and continuous probability density
functions; in each case the mean μ is the center of mass, the balance point.
Figure 4.1.1: The mean μ as the center of mass of a discrete distribution.
Figure 4.1.2: The mean μ as the center of mass of a continuous distribution.
Recall the other measures of the center of a distribution that we have studied:
A mode is any x ∈ S that maximizes f .
A median is any x ∈ R that satisfies P(X < x) ≤ 1
2
and P(X ≤ x) ≥ 1
2
.
To understand expected value in a probabilistic way, suppose that we create a new, compound experiment by repeating the basic
experiment over and over again. This gives a sequence of independent random variables (X , X , …), each with the same
1 2
distribution as X. In statistical terms, we are sampling from the distribution of X. The average value, or sample mean, after n runs
is
n
1
Mn = ∑ Xi (4.1.4)
n
i=1
Note that M is a random variable in the compound experiment. The important fact is that the average value M converges to the
n n
expected value E(X) as n → ∞ . The precise statement of this is the law of large numbers, one of the fundamental theorems of
probability. You will see the law of large numbers at work in many of the simulation exercises given below.
Extensions
If a ∈ R and n ∈ N , the moment of X about a of order n is defined to be
n
E [(X − a) ] (4.1.5)
(assuming of course that this expected value exists).
The moments about 0 are simply referred to as moments (or sometimes raw moments). The moments about μ are the central
moments. The second central moment is particularly important, and is studied in detail in the section on variance. In some cases, if
we know all of the moments of X, we can determine the entire distribution of X. This idea is explored in the section on generating
functions.
The expected value of a random variable X is based, of course, on the probability measure P for the experiment. This probability
measure could be a conditional probability measure, conditioned on a given event A ∈ F for the experiment (with P(A) > 0 ). The
usual notation is E(X ∣ A) , and this expected value is computed by the definitions given above, except that the conditional
probability density function x ↦ f (x ∣ A) replaces the ordinary probability density function f . It is very important to realize that,
except for notation, no new concepts are involved. All results that we obtain for expected value in general have analogues for these
conditional expected values. On the other hand, we will study a more general notion of conditional expected value in a later
section.
Basic Properties
The purpose of this subsection is to study some of the essential properties of expected value. Unless otherwise noted, we will
assume that the indicated expected values exist, and that the various sets and functions that we use are measurable. We start with
two simple but still essential results.
Simple Variables
First, recall that a constant c ∈ R can be thought of as a random variable (on any probability space) that takes only the value c with
probability 1. The corresponding distribution is sometimes called point mass at c .
If c is a constant random variable, then E(c) = c .

Proof
Next recall that an indicator variable is a random variable that takes only the values 0 and 1.
If X is an indicator variable then E(X) = P(X = 1) .

Proof
In particular, if 1 is the indicator variable of an event A , then E (1 ) = P(A) , so in a sense, expected value subsumes
A A
probability. For a book that takes expected value, rather than probability, as the fundamental starting concept, see the book
Probability via Expectation, by Peter Whittle.
Change of Variables Theorem

The expected value of a real-valued random variable gives the center of the distribution of the variable. This idea is much more
powerful than might first appear. By finding expected values of various functions of a general random variable, we can measure
many interesting features of its distribution.
Thus, suppose that X is a random variable taking values in a general set S , and suppose that r is a function from S into R. Then
r(X) is a real-valued random variable, and so it makes sense to compute E [r(X)] (assuming as usual that this expected value
exists). However, to compute this expected value from the definition would require that we know the probability density function
of the transformed variable r(X) (a difficult problem, in general). Fortunately, there is a much better way, given by the change of
variables theorem for expected value. This theorem is sometimes referred to as the law of the unconscious statistician, presumably
because it is so basic and natural that it is often used without the realization that it is a theorem, and not a definition.
If X has a discrete distribution on a countable set S with probability density function f . then
E [r(X)] = ∑ r(x)f (x) (4.1.6)
x∈S
Proof
Figure 4.1.3 : The change of variables theorem when X has a discrete distribution.
The next result is the change of variables theorem when X has a continuous distribution. We will prove the continuous version in
stages, first when r has discrete range below and then in the next section in full generality. Even though the complete proof is
delayed, however, we will use the change of variables theorem in the proofs of many of the other properties of expected value.
Suppose that X has a continuous distribution on S ⊆ R with probability density function f , and that r : S → R . Then
n
E [r(X)] = ∫ r(x)f (x) dx (4.1.7)

S
Proof when r has discrete range
Figure 4.1.4 : The change of variables theorem when X has a continuous distribution and r has countable range.
The results below gives basic properties of expected value. These properties are true in general, but we will restrict the proofs
primarily to the continuous case. The proofs for the discrete case are analogous, with sums replacing integrals. The change of
variables theorem is the main tool we will need. In these theorems X and Y are real-valued random variables for an experiment
(that is, defined on an underlying probability space) and c is a constant. As usual, we assume that the indicated expected values
exist. Be sure to try the proofs yourself before reading the ones in the text.
Linearity
Our first property is the additive property.
E(X + Y ) = E(X) + E(Y )
Proof
Our next property is the scaling property.
E(cX) = c E(X)
Proof
Here is the linearity of expected value in full generality. It's a simple corollary of the previous two results.
Suppose that (X 1,X , …) is a sequence of real-valued random variables defined on the underlying probability space and that
2
(a1 , a2 , … , an ) is a sequence of constants. Then

n n
E ( ∑ ai Xi ) = ∑ ai E(Xi ) (4.1.11)
i=1 i=1
Thus, expected value is a linear operation on the collection of real-valued random variables for the experiment. The linearity of
expected value is so basic that it is important to understand this property on an intuitive level. Indeed, it is implied by the
interpretation of expected value given in the law of large numbers.
Suppose that (X 1, X2 , … , Xn ) is a sequence of real-valued random variables with common mean μ .

n
1. Let Y = ∑ X , the sum of the variables. Then E(Y ) = nμ .
i=1 i
2. Let M = ∑ X , the average of the variables. Then E(M ) = μ .

1
n
n
i=1 i
Proof
If the random variables in the previous result are also independent and identically distributed, then in statistical terms, the sequence
is a random sample of size n from the common distribution, and M is the sample mean.
In several important cases, a random variable from a special distribution can be decomposed into a sum of simpler random
variables, and then part (a) of the last theorem can be used to compute the expected value.
Inequalities
The following exercises give some basic inequalities for expected value. The first, known as the positive property is the most
obvious, but is also the main tool for proving the others.
Suppose that P(X ≥ 0) = 1 . Then

1. E(X) ≥ 0
2. If P(X > 0) > 0 then E(X) > 0 .
Proof
Next is the increasing property, perhaps the most important property of expected value, after linearity.
Suppose that P(X ≤ Y ) = 1 . Then

1. E(X) ≤ E(Y )
2. If P(X < Y ) > 0 then E(X) < E(Y ) .
Proof
Absolute value inequalities:

1. |E(X)| ≤ E (|X|)
2. If P(X > 0) > 0 and P(X < 0) > 0 then |E(X)| < E (|X|).
Proof
Only in Lake Woebegone are all of the children above average:
If P [X ≠ E(X)] > 0 then

1. P [X > E(X)] > 0
2. P [X < E(X)] > 0
Proof
Thus, if X is not a constant (with probability 1), then X must take values greater than its mean with positive probability and values
less than its mean with positive probability.
Symmetry
Again, suppose that X is a random variable taking values in R. The distribution of X is symmetric about a ∈ R if the distribution
of a − X is the same as the distribution of X − a .
Suppose that the distribution of X is symmetric about a ∈ R . If E(X) exists, then E(X) = a .
Proof
The previous result applies if X has a continuous distribution on R with a probability density f that is symmetric about a ; that is,
f (a + x) = f (a − x) for x ∈ R .
Independence
If X and Y are independent real-valued random variables then E(XY ) = E(X)E(Y ) .
Proof
It follows from the last result that independent random variables are uncorrelated (a concept that we will study in a later section).
Moreover, this result is more powerful than might first appear. Suppose that X and Y are independent random variables taking
values in general spaces S and T respectively, and that u : S → R and v : T → R . Then u(X) and v(Y ) are independent, real-
valued random variables and hence
E [u(X)v(Y )] = E [u(X)] E [v(Y )] (4.1.14)

As always, be sure to try the proofs and computations yourself before reading the proof and answers in the text.
Discrete uniform distributions are widely used in combinatorial probability, and model a point chosen at random from a finite set.
Suppose that X has the discrete uniform distribution on a finite set S ⊆ R .

1. E(X) is the arithmetic average of the numbers in S .
2. If the points in S are evenly spaced with endpoints a, b, then E(X) = a+b
2
, the average of the endpoints.
Proof
The previous results are easy to see if we think of E(X) as the center of mass, since the discrete uniform distribution corresponds
to a finite set of points with equal mass.
Open the special distribution simulator, and select the discrete uniform distribution. This is the uniform distribution on n
points, starting at a , evenly spaced at distance h . Vary the parameters and note the location of the mean in relation to the
probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical
mean to the distribution mean.
Next, recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the
interval. Continuous uniform distributions arise in geometric probability and a variety of other applied problems.
Suppose that X has the continuous uniform distribution on an interval [a, b], where a, b ∈ R and a < b .
1. E(X) = a+b
2
, the midpoint of the interval.
2. E (X ) =
n 1
n+1
n
(a
n−1
+a b + ⋯ + ab
n−1 n
+b ) for n ∈ N .
Proof
Part (a) is easy to see if we think of the mean as the center of mass, since the uniform distribution corresponds to a uniform
distribution of mass on the interval.
Open the special distribution simulator, and select the continuous uniform distribution. This is the uniform distribution the
interval [a, a + w] . Vary the parameters and note the location of the mean in relation to the probability density function. For
selected values of the parameters, run the simulation 1000 times and compare the empirical mean to the distribution mean.
Next, the average value of a function on an interval, as defined in calculus, has a nice interpretation in terms of the uniform
distribution.
Suppose that X is uniformly distributed on the interval , and that

[a, b] g is an integrable function from [a, b] into R . Then
E [g(X)] is the average value of g on [a, b]:
b
1
E [g(X)] = ∫ g(x)dx (4.1.19)
b −a a
Proof
Find the average value of the following functions on the given intervals:
1. f (x) = x on [2, 4]
2. g(x) = x on [0, 1]
2
3. h(x) = sin(x) on [0, π].

Answer
The next exercise illustrates the value of the change of variables theorem in computing expected values.
Suppose that X is uniformly distributed on [−1, 3].

2. Find the probability density function of X . 2
3. Find E (X ) using the probability density function in (b).

2
4. Find E (X ) using the change of variables theorem.

2
Answer
The discrete uniform distribution and the continuous uniform distribution are studied in more detail in the chapter on Special
Distributions.
Dice
Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard
die in which faces 1 and 6 have probability each, and faces 2, 3, 4, and 5 have probability each.
1
4
1
Two standard, fair dice are thrown, and the scores (X1 , X2 ) recorded. Find the expected value of each of the following
variables.
1. Y = X + X , the sum of the scores.
1 2
2. M = (X + X ) , the average of the scores.

1
2
1 2
3. Z = X X , the product of the scores.

1 2
4. U = min{X , X } , the minimum score

1 2
5. V = max{X , X } , the maximum score.

1 2
Answer
In the dice experiment, select two fair die. Note the shape of the probability density function and the location of the mean for
the sum, minimum, and maximum variables. Run the experiment 1000 times and compare the sample mean and the
distribution mean for each of these variables.
Two standard, ace-six flat dice are thrown, and the scores (X 1, X2 ) recorded. Find the expected value of each of the following
variables.
1. Y = X + X , the sum of the scores.
1 2
2. M = (X + X ) , the average of the scores.

1
2
1 2
3. Z = X X , the product of the scores.

1 2
4. U = min{X , X } , the minimum score

1 2
5. V = max{X , X } , the maximum score.

1 2
Answer
In the dice experiment, select two ace-six flat die. Note the shape of the probability density function and the location of the
mean for the sum, minimum, and maximum variables. Run the experiment 1000 times and compare the sample mean and the
distribution mean for each of these variables.
Bernoulli Trials
Recall that a Bernoulli trials process is a sequence X = (X , X , …) of independent, identically distributed indicator random
1 2
variables. In the usual language of reliability, X denotes the outcome of trial i, where 1 denotes success and 0 denotes failure. The
i
probability of success p = P(X = 1) ∈ [0, 1] is the basic parameter of the process. The process is named for Jacob Bernoulli. A
i
separate chapter on the Bernoulli Trials explores this process in detail.

For n ∈ N , the number of successes in the first n trials is Y = ∑ X . Recall that this random variable has the binomial
+
n
i=1 i
distribution with parameters n and p, and has probability density function f given by
n y n−y
f (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (4.1.20)
y
If Y has the binomial distribution with parameters n and p then E(Y ) = np

Proof from the definition
Proof using the additive property
Note the superiority of the second proof to the first. The result also makes intuitive sense: in n trials with success probability p, we
expect np successes.
In the binomial coin experiment, vary n and p and note the shape of the probability density function and the location of the
mean. For selected values of n and p, run the experiment 1000 times and compare the sample mean to the distribution mean.
Suppose that p ∈ (0, 1], and let N denote the trial number of the first success. This random variable has the geometric distribution
on N with parameter p, and has probability density function g given by
+
n−1
g(n) = p(1 − p ) , n ∈ N+ (4.1.23)
If N has the geometric distribution on N with parameter p ∈ (0, 1] then E(N ) = 1/p .
+
Proof
Again, the result makes intuitive sense. Since p is the probability of success, we expect a success to occur after 1/p trials.
In the negative binomial experiment, select k = 1 to get the geometric distribution. Vary p and note the shape of the
probability density function and the location of the mean. For selected values of p, run the experiment 1000 times and compare
the sample mean to the distribution mean.

Suppose that a population consists of m objects; r of the objects are type 1 and m − r are type 0. A sample of n objects is chosen
at random, without replacement. The parameters m, r, n ∈ N with r ≤ m and n ≤ m . Let X denote the type of the ith object i
selected. Recall that X = (X , X , … , X ) is a sequence of identically distributed (but not independent) indicator random
1 2 n
variable with P(X = 1) = r/m for each i ∈ {1, 2, … , n}.

i
n
Let Y denote the number of type 1 objects in the sample, so that Y = ∑i=1 Xi . Recall that Y has the hypergeometric distribution,
which has probability density function f given by
r m−r
( )( )
y n−y
f (y) = , y ∈ {0, 1, … , n} (4.1.25)
m
( )
n
If Y has the hypergeometric distribution with parameters m, n , and r then E(Y ) = n r
m
.
In the ball and urn experiment, vary n , r, and m and note the shape of the probability density function and the location of the
mean. For selected values of the parameters, run the experiment 1000 times and compare the sample mean to the distribution
mean.
Note that if we select the objects with replacement, then X would be a sequence of Bernoulli trials, and hence Y would have the
binomial distribution with parameters n and p = . Thus, the mean would still be E(Y ) = n .
r
m
r

Recall that the Poisson distribution has probability density function f given by
n
a
−a
f (n) = e , n ∈ N (4.1.29)
n!
where a ∈ (0, ∞) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number
of “random points” in a region of time or space; the parameter a is proportional to the size of the region. The Poisson distribution is
studied in detail in the chapter on the Poisson Process.
If N has the Poisson distribution with parameter a then E(N ) = a . Thus, the parameter of the Poisson distribution is the mean
of the distribution.
Proof
In the Poisson experiment, the parameter is a = rt . Vary the parameter and note the shape of the probability density function
and the location of the mean. For various values of the parameter, run the experiment 1000 times and compare the sample
Recall that the exponential distribution is a continuous distribution with probability density function f given by
−rt
f (t) = re , t ∈ [0, ∞) (4.1.31)
where r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other “arrival times”; in
particular, the distribution governs the time between arrivals in the Poisson model. The exponential distribution is studied in detail
in the chapter on the Poisson Process.
Suppose that T has the exponential distribution with rate parameter r. Then E(T ) = 1/r .
Proof
Recall that the mode of T is 0 and the median of T is ln 2/r. Note how these measures of center are ordered: 0 < ln 2/r < 1/r
In the gamma experiment, set n = 1 to get the exponential distribution. This app simulates the first arrival in a Poisson
process. Vary r with the scroll bar and note the position of the mean relative to the graph of the probability density function.
For selected values of r, run the experiment 1000 times and compare the sample mean to the distribution mean.
Suppose again that T has the exponential distribution with rate parameter r and suppose that t > 0 . Find E(T ∣ T > t) .
Answer

Recall that the gamma distribution is a continuous distribution with probability density function f given by
n−1
n
t −rt
f (t) = r e , t ∈ [0, ∞) (4.1.33)
(n − 1)!
where n ∈ N is the shape parameter and r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times
+
and other “arrival times”, and in particular, models the n th arrival in the Poisson process. Thus it follows that if (X , X , … , X )
1 2 n
is a sequence of independent random variables, each having the exponential distribution with rate parameter r, then T = ∑ X n
i=1 i
has the gamma distribution with shape parameter n and rate parameter r. The gamma distribution is studied in more generality,
with non-integer shape parameters, in the chapter on the Special Distributions.
Suppose that T has the gamma distribution with shape parameter n and rate parameter r. Then E(T ) = n/r .
Note again how much easier and more intuitive the second proof is than the first.
Open the gamma experiment, which simulates the arrival times in the Poisson process. Vary the parameters and note the
position of the mean relative to the graph of the probability density function. For selected parameter values, run the experiment
1000 times and compare the sample mean to the distribution mean.
Beta Distributions
The distributions in this subsection belong to the family of beta distributions, which are widely used to model random proportions
and probabilities. The beta distribution is studied in detail in the chapter on Special Distributions.
Suppose that X has probability density function f given by f (x) = 3x for x ∈ [0, 1]. 2
1. Find the mean of X.

2. Find the mode of X.
3. Find the median of X.
4. Sketch the graph of f and show the location of the mean, median, and mode on the x-axis.
Answer
In the special distribution simulator, select the beta distribution and set a = 3 and b = 1 to get the distribution in the last
exercise. Run the experiment 1000 times and compare the sample mean to the distribution mean.
Suppose that a sphere has a random radius R with probability density function f given by 2
f (r) = 12 r (1 − r) for .
r ∈ [0, 1]
Find the expected value of each of the following:

3
3
Answer
Suppose that X has probability density function f given by f (x) = 1

for x ∈ (0, 1).
π√x(1−x)
1. Find the mean of X.

2. Find median of X.
3. Note that f is unbounded, so X does not have a mode.
4. Sketch the graph of f and show the location of the mean and median on the x-axis.
Answer
The particular beta distribution in the last exercise is also known as the (standard) arcsine distribution. It governs the last time that
the Brownian motion process hits 0 during the time interval [0, 1]. The arcsine distribution is studied in more generality in the
Open the Brownian motion experiment and select the last zero. Run the simulation 1000 times and compare the sample mean
to the distribution mean.
probability density function f given by f (x) = 12x(1 − x) for x ∈ [0, 1]. The grades are generally low, so the teacher
2
−− −
−
decides to “curve” the grades using the transformation Z = 10√Y = 100√X . Find the expected value of each of the
following variables
1. X
2. Y
3. Z
Answer

Recall that the Pareto distribution is a continuous distribution with probability density function f given by
a
f (x) = , x ∈ [1, ∞) (4.1.36)
a+1
x
where a ∈ (0, ∞) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is
widely used to model certain financial variables. The Pareto distribution is studied in detail in the chapter on Special Distributions.
Suppose that X has the Pareto distribution with shape parameter a . Then
1. E(X) = ∞ if 0 < a ≤ 1
2. E(X) = if a > 1
a
a−1
Proof
The previous exercise gives us our first example of a distribution whose mean is infinite.
In the special distribution simulator, select the Pareto distribution. Note the shape of the probability density function and the
location of the mean. For the following values of the shape parameter a , run the experiment 1000 times and note the behavior
of the empirical mean.
1. a = 1
2. a = 2
3. a = 3 .

Recall that the (standard) Cauchy distribution has probability density function f given by
1
f (x) = , x ∈ R (4.1.39)
2
π (1 + x )
This distribution is named for Augustin Cauchy. The Cauchy distributions is studied in detail in the chapter on Special
Distributions.
If X has the Cauchy distribution then E(X) does not exist.

Proof
Note that the graph of f is symmetric about 0 and is unimodal. Thus, the mode and median of X are both 0. By the symmetry
result, if X had a mean, the mean would be 0 also, but alas the mean does not exist. Moreover, the non-existence of the mean is not
just a pedantic technicality. If we think of the probability distribution as a mass distribution, then the moment to the right of a is
∞ a
∫ (x − a)f (x) dx = ∞ and the moment to the left of a is ∫ (x − a)f (x) dx = −∞ for every a ∈ R . The center of mass
a −∞
simply does not exist. Probabilisitically, the law of large numbers fails, as you can see in the following simulation exercise:
In the Cauchy experiment (with the default parameter values), a light sources is 1 unit from position 0 on an infinite straight
−π
wall. The angle that the light makes with the perpendicular is uniformly distributed on the interval ( , ), so that the
2
π
position of the light beam on the wall has the Cauchy distribution. Run the simulation 1000 times and note the behavior of the
empirical mean.

Recall that the standard normal distribution is a continuous distribution with density function ϕ given by
1 1 2
− z
ϕ(z) = e 2 , z ∈ R (4.1.41)
−−
√2π
Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in detail in
If Z has the standard normal distribution then E(X) = 0 .

Proof
The standard normal distribution is unimodal and symmetric about 0. Thus, the median, mean, and mode all agree. More generally,
for μ ∈ (−∞, ∞) and σ ∈ (0, ∞), recall that X = μ + σZ has the normal distribution with location parameter μ and scale
parameter σ. X has probability density function f given by
2
1 1 x −μ
f (x) = exp[− ( ) ], x ∈ R (4.1.43)
−−
√2πσ 2 σ
The location parameter is the mean of the distribution:
If X has the normal distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞), then E(X) = μ
Proof
In the special distribution simulator, select the normal distribution. Vary the parameters and note the location of the mean. For
selected parameter values, run the simulation 1000 times and compare the sample mean to the distribution mean.
Additional Exercises
Suppose that (X, Y ) has probability density function f given by f (x, y) = x + y for (x, y) ∈ [0, 1] × [0, 1] . Find the
following expected values:
1. E(X)
2. E (X Y )
2
3. E (X + Y )
2 2
4. E(XY ∣ Y > X)
Answer
Suppose that Nhas a discrete distribution with probability density function f given by f (n) =
1
50
2
n (5 − n) for
n ∈ {1, 2, 3, 4}. Find each of the following:
1. The median of N .
2. The mode of N
3. E(N ).
4. E (N ) 2
5. E(1/N ).
6. E (1/N ). 2
Answer
Suppose that X and Y are real-valued random variables with E(X) = 5 and E(Y ) = −2 . Find E(3X + 4Y − 7) .
Answer
Suppose that X and Y are real-valued, independent random variables, and that E(X) = 5 and E(Y ) = −2 . Find
E [(3X − 4)(2Y + 7)] .
Answer
Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each hunter selects one duck at
random and shoots. Find the expected number of ducks killed.
Solution
For a more complete analysis of the duck hunter problem, see The Number of Distinct Sample Values in the chapter on Finite
Sampling Models.
Consider the following game: An urn initially contains one red and one green ball. A ball is selected at random, and if the ball
is green, the game is over. If the ball is red, the ball is returned to the urn, another red ball is added, and the game continues. At
each stage, a ball is selected at random, and if the ball is green, the game is over. If the ball is red, the ball is returned to the
urn, another red ball is added, and the game continues. Let X denote the length of the game (that is, the number of selections
required to obtain a green ball). Find E(X).
Solution
This page titled 4.1: Definitions and Basic Properties is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
In this section, we study some properties of expected value that are a bit more specialized than the basic properties considered in
the previous section. Nonetheless, the new results are also very important. They include two fundamental inequalities as well as
special formulas for the expected value of a nonnegative variable. As usual, unless otherwise noted, we assume that the referenced
expected values exist.
Basic Theory
Markov's Inequality
Our first result is known as Markov's inequality (named after Andrei Markov). It gives an upper bound for the tail probability of a
nonnegative random variable in terms of the expected value of the variable.
If X is a nonnegative random variable, then

E(X)
P(X ≥ x) ≤ , x >0 (4.2.1)
x
Proof
The upper bound in Markov's inequality may be rather crude. In fact, it's quite possible that E(X)/x ≥ 1 , in which case the bound
is worthless. However, the real value of Markov's inequality lies in the fact that it holds with no assumptions whatsoever on the
distribution of X (other than that X be nonnegative). Also, as an example below shows, the inequality is tight in the sense that
equality can hold for a given x. Here is a simple corollary of Markov's inequality.
If X is a real-valued random variable and k ∈ (0, ∞) then

k
E (|X| )
P(|X| ≥ x) ≤ x >0 (4.2.2)

k
x
Proof
In this corollary of Markov's inequality, we could try to find k >0 so that k k

E (|X| ) / x is minimized, thus giving the tightest
bound on P (|X|) ≥ x) .
Right Distribution Function

Our next few results give alternative ways to compute the expected value of a nonnegative random variable by means of the right-
tail distribution function. This function also known as the reliability function if the variable represents the lifetime of a device.
If X is a nonnegative random variable then

∞
E(X) = ∫ P(X > x) dx (4.2.4)

0
Proof
Here is a slightly more general result:
If X is a nonnegative random variable and k ∈ (0, ∞) then

∞
k k−1
E(X ) =∫ kx P(X > x) dx (4.2.6)
0
Proof
The following result is similar to the theorem above, but is specialized to nonnegative integer valued variables:
Suppose that N has a discrete distribution, taking values in N. Then
∞ ∞
E(N ) = ∑ P(N > n) = ∑ P(N ≥ n) (4.2.8)
n=0 n=1
Proof
A General Definition
The special expected value formula for nonnegative variables can be used as the basis of a general formulation of expected value
that would work for discrete, continuous, or even mixed distributions, and would not require the assumption of the existence of
probability density functions. First, the special formula is taken as the definition of E(X) if X is nonnegative.
If X is a nonnegative random variable, define

∞
E(X) = ∫ P(X > x) dx (4.2.10)

0
Next, for x ∈ R, recall that the positive and negative parts of x are x +
= max{x, 0} and x −
= max{0, −x} .
For x ∈ R,
1. x ≥ 0 , x ≥ 0
+ −
2. x = x − x+ −
3. |x| = x + x + −
Now, if X is a real-valued random variable, then X and X , the positive and negative parts of X, are nonnegative random
+ −
variables, so their expected values are defined as above. The definition of E(X) is then natural, anticipating of course the linearity
property.
If X is a real-valued random variable, define E(X) = E (X +

) − E (X
−
) , assuming that at least one of the expected values on
the right is finite.
The usual formulas for expected value in terms of the probability density function, for discrete, continuous, or mixed distributions,
would now be proven as theorems. We will not go further in this direction, however, since the most complete and general definition
of expected value is given in the advanced section on expected value as an integral.
The Change of Variables Theorem

Suppose that X takes values in S and has probability density function f . Suppose also that r : S → R , so that r(X) is a real-
valued random variable. The change of variables theorem gives a formula for computing E [r(X)] without having to first find the
probability density function of r(X). If S is countable, so that X has a discrete distribution, then
E [r(X)] = ∑ r(x)f (x) (4.2.11)
x∈S
If S ⊆ R and X has a continuous distribution on S then

n
E [r(X)] = ∫ r(x)f (x) dx (4.2.12)

S
In both cases, of course, we assume that the expected values exist. In the previous section on basic properties, we proved the
change of variables theorem when X has a discrete distribution and when X has a continuous distribution but r has countable
range. Now we can finally finish our proof in the continuous case.
Suppose that X has a continuous distribution on S with probability density function f , and r : S → R . Then
E [r(X)] = ∫ r(x)f (x) dx (4.2.13)

S
Proof
Jensens's Inequality
Our next sequence of exercises will establish an important inequality known as Jensen's inequality, named for Johan Jensen. First
we need a definition.
A real-valued function g defined on an interval S ⊆ R is said to be convex (or concave upward) on S if for each t ∈ S , there
exist numbers a and b (that may depend on t ), such that
1. a + bt = g(t)
2. a + bx ≤ g(x) for all x ∈ S
The graph of x ↦ a + bx is called a supporting line for g at t .
Thus, a convex function has at least one supporting line at each point in the domain
Figure 4.2.1 : A convex function and several supporting lines

You may be more familiar with convexity in terms of the following theorem from calculus: If g has a continuous, non-negative
second derivative on S , then g is convex on S (since the tangent line at t is a supporting line at t for each t ∈ S ). The next result is
the single variable version of Jensen's inequality
If X takes values in an interval S and g : S → R is convex on S , then

E [g(X)] ≥ g [E(X)] (4.2.17)
Proof
Jensens's inequality extends easily to higher dimensions. The 2-dimensional version is particularly important, because it will be
used to derive several special inequalities in the section on vector spaces of random variables. We need two definitions.
A set S ⊆R
n
is convex if for every pair of points in S , the line segment connecting those points also lies in S . That is, if
x, y ∈ S and p ∈ [0, 1] then px + (1 − p)y ∈ S .
Figure 4.2.2 : A convex subset of R2
Suppose that S ⊆ R is convex. A function g : S → R on

n
S is convex (or concave upward) if for each t ∈ S , there exist
a ∈ R and b ∈ R (depending on t ) such that
n
1. a + b ⋅ t = g(t)
2. a + b ⋅ x ≤ g(x) for all x ∈ S
The graph of x ↦ a + b ⋅ x is called a supporting hyperplane for g at t .
In R a supporting hyperplane is an ordinary plane. From calculus, if g has continuous second derivatives on S and has a positive
2
non-definite second derivative matrix, then g is convex on S . Suppose now that X = (X , X , … , X ) takes values in S ⊆ R ,
1 2 n
n
and let E(X) = (E(X 1 ), E(X2 ), … , E(Xn )) . The following result is the general version of Jensen's inequlaity.
If S is convex and g : S → R is convex on S then
E [g(X)] ≥ g [E(X)] (4.2.19)
Proof
We will study the expected value of random vectors and matrices in more detail in a later section. In both the one and n -
dimensional cases, a function g : S → R is concave (or concave downward) if the inequality in the definition is reversed. Jensen's
inequality also reverses.
Expected Value in Terms of the Quantile Function

If X has a continuous distribution with support on an interval of R, then there is a simple (but not well known) formula for the
expected value of X as the integral the quantile function of X. Here is the general result:
Suppose that X has a continuous distribution with support on an interval (a, b) ⊆ R . Let F denote the cumulative distribution
function of X so that F is the quantile function of X. If g : (a, b) → R then (assuming that the expected value exists),
−1
1
−1
E[g(X)] = ∫ g [F (p)] dp, n ∈ N (4.2.21)
0
Proof
1
So in particular, E(X) = ∫ 0
F
−1
(p) dp .

Let a ∈ (0, ∞) and let P(X = a) = 1 , so that X is a constant random variable. Show that Markov's inequality is in fact
equality at x = a .
Solution

Recall that the exponential distribution is a continuous distribution with probability density function f given by
−rt
f (t) = re , t ∈ [0, ∞) (4.2.23)
where r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other “arrival times”; in
particular, the distribution governs the time between arrivals in the Poisson model. The exponential distribution is studied in detail
in the chapter on the Poisson Process.
Suppose that X has exponential distribution with rate parameter r.

1. Find E(X) using the right distribution formula.
2. Find E(X) using the quantile function formula.
3. Compute both sides of Markov's inequality.
Answer
Open the gamma experiment. Keep the default value of the stopping parameter (n = 1 ), which gives the exponential
distribution. Vary the rate parameter r and note the shape of the probability density function and the location of the mean. For
various values of the rate parameter, run the experiment 1000 times and compare the sample mean with the distribution mean.

Recall that Bernoulli trials are independent trials each with two outcomes, which in the language of reliability, are called success
and failure. The probability of success on each trial is p ∈ [0, 1]. A separate chapter on Bernoulli Trials explores this random
process in more detail. It is named for Jacob Bernoulli. If p ∈ (0, 1), the trial number N of the first success has the geometric
distribution on N with success parameter p. The probability density function f of N is given by
+
n−1
f (n) = p(1 − p ) , n ∈ N+ (4.2.24)
Suppose that N has the geometric distribution on N with parameter p ∈ (0, 1).
+
1. Find E(N ) using the right distribution function formula.

2. Compute both sides of Markov's inequality.
3. Find E(N ∣ N is even ) .
Answer
Open the negative binomial experiment. Keep the default value of the stopping parameter (k = 1 ), which gives the geometric
distribution. Vary the success parameter p and note the shape of the probability density function and the location of the mean.
For various values of the success parameter, run the experiment 1000 times and compare the sample mean with the distribution
mean.

Recall that the Pareto distribution is a continuous distribution with probability density function f given by
a
f (x) = , x ∈ [1, ∞) (4.2.25)
a+1
x
widely used to model certain financial variables. The Pareto distribution is studied in detail in the chapter on Special Distributions.
Suppose that X has the Pareto distribution with parameter a > 1 .

1. Find E(X) using the right distribution function formula.
2. Find E(X) using the quantile function formula.
3. Find E(1/X).
4. Show that x ↦ 1/x is convex on (0, ∞).
5. Verify Jensen's inequality by comparing E(1/X) and 1/E(X).
Answer
Open the special distribution simulator and select the Pareto distribution. Keep the default value of the scale parameter. Vary
the shape parameter and note the shape of the probability density function and the location of the mean. For various values of
the shape parameter, run the experiment 1000 times and compare the sample mean with the distribution mean.
A Bivariate Distribution
Suppose that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 .
1. Show that the domain of f is a convex set.
2. Show that (x, y) ↦ x + y is convex on the domain of f .
2 2
3. Compute E (X + Y ) .
2 2
4. Compute [E(X)] + [E(Y )] .

2 2
5. Verify Jensen's inequality by comparing (b) and (c).

Answer
The Arithmetic and Geometric Means

Suppose that {x 1, x2 , … , xn } is a set of positive numbers. The arithmetic mean is at least as large as the geometric mean:
1/n
n n
1
( ∏ xi ) ≤ ∑ xi (4.2.26)
n
i=1 i=1
Proof
This page titled 4.2: Additional Properties is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
4.3: Variance
Recall the expected value of a real-valued random variable is the mean of the variable, and is a measure of the center of the
distribution. Recall also that by taking the expected value of various transformations of the variable, we can measure other
interesting characteristics of the distribution. In this section, we will study expected values that measure the spread of the
distribution about the mean.
Basic Theory
the collection of events, and P the probability measure on the sample space (Ω, F ). Suppose that X is a random variable for the
experiment, taking values in S ⊆ R . Recall that E(X), the expected value (or mean) of X gives the center of the distribution of X.
The variance and standard deviation of X are defined by
1. var(X) = E ([X − E(X)] 2

)
−−−−−−
2. sd(X) = √var(X)
Implicit in the definition is the assumption that the mean E(X) exists, as a real number. If this is not the case, then var(X) (and
hence also sd(X)) are undefined. Even if E(X) does exist as a real number, it's possible that var(X) = ∞ . For the remainder of
our discussion of the basic theory, we will assume that expected values that are mentioned exist as real numbers.
The variance and standard deviation of X are both measures of the spread of the distribution about the mean. Variance (as we will
see) has nicer mathematical properties, but its physical unit is the square of that of X. Standard deviation, on the other hand, is not
as nice mathematically, but has the advantage that its physical unit is the same as that of X. When the random variable X is
understood, the standard deviation is often denoted by σ, so that the variance is σ . 2
Recall that the second moment of X about a ∈ R is E [(X − a) ] . Thus, the variance is the second moment of X about the mean
2
μ = E(X) , or equivalently, the second central moment of X. In general, the second moment of X about a ∈ R can also be thought
of as the mean square error if the constant a is used as an estimate of X. In addition, second moments have a nice interpretation in
physics. If we think of the distribution of X as a mass distribution in R, then the second moment of X about a ∈ R is the moment
of inertia of the mass distribution about a . This is a measure of the resistance of the mass distribution to any change in its rotational
motion about a . In particular, the variance of X is the moment of inertia of the mass distribution about the center of mass μ .
Figure 4.3.1 : The moment of inertia about a .

The mean square error (or equivalently the moment of inertia) about a is minimized when a = μ :
Let mse(a) = E [(X − a) 2

] for a ∈ R . Then mse is minimized when a = μ , and the minimum value is σ . 2
Proof
Figure 4.3.2 : mse(a) is minimized when a = μ .
The relationship between measures of center and measures of spread is studied in more detail in the advanced section on vector
spaces of random variables.
Properties
The following exercises give some basic properties of variance, which in turn rely on basic properties of expected value. As usual,
be sure to try the proofs yourself before reading the ones in the text. Our first results are computational formulas based on the
change of variables formula for expected value
Let μ = E(X) .
1. If X has a discrete distribution with probability density function f , then var(X) = ∑ x∈S
2
(x − μ) f (x) .
2. If X has a continuous distribution with probability density function f , then var(X) = ∫ S
2
(x − μ) f (x)dx
Proof
Our next result is a variance formula that is usually better than the definition for computational purposes.
var(X) = E(X
2 2
) − [E(X)] .
Proof
Of course, by the change of variables formula, E (X ) = ∑ 2

x f (x)
x∈S
2
if X has a discrete distribution, and
E (X ) = ∫ x f (x) dx if X has a continuous distribution. In both cases, f is the probability density function of X.
2 2
S
Variance is always nonnegative, since it's the expected value of a nonnegative random variable. Moreover, any random variable that
really is random (not a constant) will have strictly positive variance.
The nonnegative property.

1. var(X) ≥ 0
2. var(X) = 0 if and only if P(X = c) = 1 for some constant c (and then of course, E(X) = c ).
Proof
Our next result shows how the variance and standard deviation are changed by a linear transformation of the random variable. In
particular, note that variance, unlike general expected value, is not a linear operation. This is not really surprising since the variance
is the expected value of a nonlinear function of the variable: x ↦ (x − μ) . 2
If a, b ∈ R then
1. var(a + bX) = b var(X) 2
2. sd(a + bX) = |b| sd(X)

Proof
Recall that when b > 0 , the linear transformation x ↦ a + bx is called a location-scale transformation and often corresponds to a
change of location and change of scale in the physical units. For example, the change from inches to centimeters in a measurement
of length is a scale transformation, and the change from Fahrenheit to Celsius in a measurement of temperature is both a location
and scale transformation. The previous result shows that when a location-scale transformation is applied to a random variable, the
standard deviation does not depend on the location parameter, but is multiplied by the scale factor. There is a particularly important
location-scale transformation.
Suppose that X is a random variable with mean μ and variance σ . The random variable Z defined as follows is the standard
2
score of X.
X −μ
Z = (4.3.2)
σ
1. E(Z) = 0
2. var(Z) = 1
Proof
Since X and its mean and standard deviation all have the same physical units, the standard score Z is dimensionless. It measures
the directed distance from E(X) to X in terms of standard deviations.
Let Z denote the standard score of X, and suppose that Y = a + bX where a, b ∈ R and b ≠ 0 .
1. If b > 0 , the standard score of Y is Z .
2. If b < 0 , the standard score of Y is −Z .
Proof
As just noted, when b > 0 , the variable Y = a + bX is a location-scale transformation and often corresponds to a change of
physical units. Since the standard score is dimensionless, it's reasonable that the standard scores of X and Y are the same. Here is
another standardized measure of dispersion:
Suppose that X is a random variable with E(X) ≠ 0 . The coefficient of variation is the ratio of the standard deviation to the
mean:
sd(X)
cv(X) = (4.3.4)
E(X)
The coefficient of variation is also dimensionless, and is sometimes used to compare variability for random variables with different
means. We will learn how to compute the variance of the sum of two random variables in the section on covariance.
Chebyshev's Inequality
Chebyshev's inequality (named after Pafnuty Chebyshev) gives an upper bound on the probability that a random variable will be
more than a specified distance from its mean. This is often useful in applied problems where the distribution is unknown, but the
mean and variance are known (at least approximately). In the following two results, suppose that X is a real-valued random
variable with mean μ = E(X) ∈ R and standard deviation σ = sd(X) ∈ (0, ∞) .
Chebyshev's inequality 1.
2
σ
P (|X − μ| ≥ t) ≤ , t >0 (4.3.5)
2
t
Proof
Figure 4.3.3 : Chebyshev's inequality

Here's an alternate version, with the distance in terms of standard deviation.
Chebyshev's inequality 2.
1
P (|X − μ| ≥ kσ) ≤ , k >0 (4.3.6)
2
k
Proof
The usefulness of the Chebyshev inequality comes from the fact that it holds for any distribution (assuming only that the mean and
variance exist). The tradeoff is that for many specific distributions, the Chebyshev bound is rather crude. Note in particular that the
first inequality is useless when t ≤ σ , and the second inequality is useless when k ≤ 1 , since 1 is an upper bound for the
probability of any event. On the other hand, it's easy to construct a distribution for which Chebyshev's inequality is sharp for a
specified value of t ∈ (0, ∞) . Such a distribution is given in an exercise below.
As always, be sure to try the problems yourself before looking at the solutions and answers.
Indicator Variables
Suppose that X is an indicator variable with p = P(X = 1) , where p ∈ [0, 1]. Then
1. E(X) = p
2. var(X) = p(1 − p)
Proof
The graph of var(X) as a function of p is a parabola, opening downward, with roots at 0 and 1. Thus the minimum value of
var(X) is 0, and occurs when p = 0 and p = 1 (when X is deterministic, of course). The maximum value is and occurs when 1
p =
1
2
.
Figure 4.3.4 : The variance of an indicator variable as a function of p.
Discrete uniform distributions are widely used in combinatorial probability, and model a point chosen at random from a finite set.
The mean and variance have simple forms for the discrete uniform distribution on a set of evenly spaced points (sometimes referred
to as a discrete interval):
Suppose that X has the discrete uniform distribution on {a, a + h, … , a + (n − 1)h} where a ∈ R , h ∈ (0, ∞) , and
n ∈ N . Let b = a + (n − 1)h , the right endpoint. Then
+
1. E(X) = (a + b) .
1
2. var(X) = (b − a)(b − a + 2h) .

1
12
Proof
Note that mean is simply the average of the endpoints, while the variance depends only on difference between the endpoints and
the step size.
Open the special distribution simulator, and select the discrete uniform distribution. Vary the parameters and note the location
and size of the mean ± standard deviation bar in relation to the probability density function. For selected values of the
parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and
standard deviation.
Next, recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the
interval. Continuous uniform distributions arise in geometric probability and a variety of other applied problems.
Suppose that X has the continuous uniform distribution on the interval [a, b] where a, b ∈ R with a < b . Then
1. E(X) = (a + b)
1
2. var(X) = (b − a)
1
12
2
Proof
Note that the mean is the midpoint of the interval and the variance depends only on the length of the interval. Compare this with the
results in the discrete case.
Open the special distribution simulator, and select the continuous uniform distribution. This is the uniform distribution the
interval [a, a + w] . Vary the parameters and note the location and size of the mean ± standard deviation bar in relation to the
mean and standard deviation to the distribution mean and standard deviation.
Dice
Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice.
Here are three:
An ace-six flat die is a six-sided die in which faces 1 and 6 have probability each while faces 2, 3, 4, and 5 have probability
1
4
1
8
each.
A two-five flat die is a six-sided die in which faces 2 and 5 have probability each while faces 1, 3, 4, and 6 have probability
1
4
1
each.
A three-four flat die is a six-sided die in which faces 3 and 4 have probability each while faces 1, 2, 5, and 6 have probability
1
4
1
8
each.
A flat die, as the name suggests, is a die that is not a cube, but rather is shorter in one of the three directions. The particular
probabilities that we use ( and ) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter axis
1
4
1
have slightly larger probabilities that the other four faces. Flat dice are sometimes used by gamblers to cheat. In the following
problems, you will compute the mean and variance for each of the various types of dice. Be sure to compare the results.
A standard, fair die is thrown and the score X is recorded. Sketch the graph of the probability density function and compute
1. E(X)
2. var(X)
Answer
An ace-six flat die is thrown and the score X is recorded. Sketch the graph of the probability density function and compute
1. E(X)
2. var(X)
Answer
A two-five flat die is thrown and the score X is recorded. Sketch the graph of the probability density function and compute
1. E(X)
2. var(X)
Answer
A three-four flat die is thrown and the score X is recorded. Sketch the graph of the probability density function and compute
1. E(X)
2. var(X)
Answer
In the dice experiment, select one die. For each of the following cases, note the location and size of the mean ± standard
deviation bar in relation to the probability density function. Run the experiment 1000 times and compare the empirical mean
and standard deviation to the distribution mean and standard deviation.
1. Fair die
2. Ace-six flat die
3. Two-five flat die
4. Three-four flat die
Recall that the Poisson distribution is a discrete distribution on N with probability density function f given by
n
a
−a
f (n) = e , n ∈ N (4.3.10)
n!
of “random points” in a region of time or space; the parameter a is proportional to the size of the region. The Poisson distribution is
studied in detail in the chapter on the Poisson Process.
Suppose that N has the Poisson distribution with parameter a . Then

1. E(N ) = a
2. var(N ) = a
Proof
Thus, the parameter of the Poisson distribution is both the mean and the variance of the distribution.
In the Poisson experiment, the parameter is a = rt . Vary the parameter and note the size and location of the mean ± standard
deviation bar in relation to the probability density function. For selected values of the parameter, run the experiment 1000
times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.

Recall that Bernoulli trials are independent trials each with two outcomes, which in the language of reliability, are called success
and failure. The probability of success on each trial is p ∈ [0, 1]. A separate chapter on Bernoulli Trials explores this random
process in more detail. It is named for Jacob Bernoulli. If p ∈ (0, 1], the trial number N of the first success has the geometric
distribution on N with success parameter p. The probability density function f of N is given by
+
n−1
f (n) = p(1 − p ) , n ∈ N+ (4.3.13)
Suppose that N has the geometric distribution on N with success parameter p ∈ (0, 1]. Then
+
1. E(N ) = 1
p
1−p
2. var(N ) = p
2
Proof
Note that the variance is 0 when p = 1 , not surprising since X is deterministic in this case.
In the negative binomial experiment, set k = 1 to get the geometric distribution . Vary p with the scroll bar and note the size
and location of the mean ± standard deviation bar in relation to the probability density function. For selected values of p, run
the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard
deviation.
Suppose that N has the geometric distribution with parameter p = . Compute the true value and the Chebyshev bound for the
3
probability that N is at least 2 standard deviations away from the mean.

Answer

Recall that the exponential distribution is a continuous distribution on [0, ∞) with probability density function f given by
−rt
f (t) = re , t ∈ [0, ∞) (4.3.16)
where r ∈ (0, ∞) is the with rate parameter. This distribution is widely used to model failure times and other “arrival times”. The
exponential distribution is studied in detail in the chapter on the Poisson Process.
Suppose that T has the exponential distribution with rate parameter r. Then
1. E(T ) = .1
2. var(T ) = . 1
r2
Proof
Thus, for the exponential distribution, the mean and standard deviation are the same.
In the gamma experiment, set k = 1 to get the exponential distribution. Vary r with the scroll bar and note the size and
location of the mean ± standard deviation bar in relation to the probability density function. For selected values of r, run the
experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard
deviation.
Suppose that X has the exponential distribution with rate parameter r > 0 . Compute the true value and the Chebyshev bound
for the probability that X is at least k standard deviations away from the mean.
Answer

Recall that the Pareto distribution is a continuous distribution on [1, ∞) with probability density function f given by
a
f (x) = , x ∈ [1, ∞) (4.3.19)
a+1
x
widely used to model financial variables such as income. The Pareto distribution is studied in detail in the chapter on Special
Distributions.
Suppose that X has the Pareto distribution with shape parameter a . Then
1. E(X) = ∞ if 0 < a ≤ 1 and E(X) = a
if 1 < a < ∞
a−1
2. var(X) is undefined if 0 < a ≤ 1 , var(X) = ∞ if 1 < a ≤ 2 , and var(X) = a

2
if 2 < a < ∞
(a−1 ) (a−2)
Proof
In the special distribution simuator, select the Pareto distribution. Vary a with the scroll bar and note the size and location of
the mean ± standard deviation bar. For each of the following values of a , run the experiment 1000 times and note the behavior
of the empirical mean and standard deviation.
1. a = 1
2. a = 2
3. a = 3

Recall that the standard normal distribution is a continuous distribution on R with probability density function ϕ given by
1 −
1
z
2
ϕ(z) = −−e
2
, z ∈ R (4.3.22)
√2π
Suppose that Z has the standard normal distribution. Then

1. E(Z) = 0
2. var(Z) = 1
Proof
More generally, for μ ∈ R and σ ∈ (0, ∞), recall that the normal distribution with location parameter μ and scale parameter σ is
a continuous distribution on R with probability density function f given by
2
1 1 x −μ
f (x) = −− exp[− ( ) ], x ∈ R (4.3.25)
√2πσ 2 σ
Moreover, if Z has the standard normal distribution, then X = μ + σZ has the normal distribution with location parameter μ and
scale parameter σ. As the notation suggests, the location parameter is the mean of the distribution and the scale parameter is the
standard deviation.
Suppose that X has the normal distribution with location parameter μ and scale parameter σ. Then
1. E(X) = μ
2. var(X) = σ 2
Proof
So to summarize, if X has a normal distribution, then its standard score Z has the standard normal distribution.
In the special distribution simulator, select the normal distribution. Vary the parameters and note the shape and location of the
mean ± standard deviation bar in relation to the probability density function. For selected parameter values, run the experiment
1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Beta Distributions
The distributions in this subsection belong to the family of beta distributions, which are widely used to model random proportions
and probabilities. The beta distribution is studied in detail in the chapter on Special Distributions.
Suppose that X has a beta distribution with probability density function f . In each case below, graph f below and compute the
mean and variance.
1. f (x) = 6x(1 − x) for x ∈ [0, 1]
2. f (x) = 12x (1 − x) for x ∈ [0, 1]
2
3. f (x) = 12x(1 − x) for x ∈ [0, 1]

2
Answer
In the special distribution simulator, select the beta distribution. The parameter values below give the distributions in the
previous exercise. In each case, note the location and size of the mean ± standard deviation bar. Run the experiment 1000
1. a = 2 , b = 2
2. a = 3 , b = 2
3. a = 2 , b = 3
Suppose that a sphere has a random radius R with probability density function f given by 2
f (r) = 12 r (1 − r) for .
r ∈ [0, 1]
Find the mean and standard deviation of each of the following:

3
3
Answer

for x ∈ (0, 1). Find
π√x(1−x)
1. E(X)
2. var(X)
Answer
Open the Brownian motion experiment and select the last zero. Note the location and size of the mean ± standard deviation bar
in relation to the probability density function. Run the simulation 1000 times and compare the empirical mean and standard
deviation to the distribution mean and standard deviation.
probability density function f given by f (x) = 12x(1 − x) for x ∈ [0, 1]. The grades are generally low, so the teacher
2
−
− −
−
decides to “curve” the grades using the transformation Z = 10√Y = 100√X . Find the mean and standard deviation of each
of the following variables:
1. X
2. Y
3. Z
Answer
Exercises on Basic Properties

Suppose that X is a real-valued random variable with E(X) = 5 and var(X) = 4 . Find each of the following:
1. var(3X − 2)
2. E(X )2
Answer
Suppose that X is a real-valued random variable with E(X) = 2 and E [X(X − 1)] = 8 . Find each of the following:
1. E(X )2
2. var(X)
Answer
The expected value E [X(X − 1)] is an example of a factorial moment.
Suppose that X and 1 X2 are independent, real-valued random variables with E(Xi ) = μi and var(Xi ) = σ
i
2
for .
i ∈ {1, 2}
Then
1. E (X X ) = μ μ
1 2 1 2
2. var (X X ) = σ σ
1 2
2
1
2
2
2
+σ μ
1
2
2
2
+σ μ
2
2
1
Proof
Marilyn Vos Savant has an IQ of 228. Assuming that the distribution of IQ scores has mean 100 and standard deviation 15, find
Marilyn's standard score.
Answer
Fix t ∈ (0, ∞) . Suppose that X is the discrete random variable with probability density function defined by
P(X = t) = P(X = −t) = p , P(X = 0) = 1 − 2p , where p ∈ (0, ). Then equality holds in Chebyshev's inequality at t .
1
Proof
This page titled 4.3: Variance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services)
via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
As usual, our starting point is a random experiment, modeled by a probability space (Ω, F , P ). So to review, Ω is the set of
outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). Suppose that X is a real-valued
random variable for the experiment. Recall that the mean of X is a measure of the center of the distribution of X. Furthermore, the
variance of X is the second moment of X about the mean, and measures the spread of the distribution of X about the mean. The
third and fourth moments of X about the mean also measure interesting (but more subtle) features of the distribution. The third
moment measures skewness, the lack of symmetry, while the fourth moment measures kurtosis, roughly a measure of the fatness in
the tails. The actual numerical measures of these characteristics are standardized to eliminate the physical units, by dividing by an
appropriate power of the standard deviation. As usual, we assume that all expected values given below exist, and we will let
μ = E(X) and σ = var(X) . We assume that σ > 0 , so that the random variable is really random.
2
Basic Theory
Skewness
The skewness of X is the third moment of the standard score of X:

3
X −μ
skew(X) = E [ ( ) ] (4.4.1)
σ
The distribution of X is said to be positively skewed, negatively skewed or unskewed depending on whether skew(X) is
positive, negative, or 0.
In the unimodal case, if the distribution is positively skewed then the probability density function has a long tail to the right, and if
the distribution is negatively skewed then the probability density function has a long tail to the left. A symmetric distribution is
unskewed.
Suppose that the distribution of X is symmetric about a . Then

1. E(X) = a
2. skew(X) = 0 .
Proof
The converse is not true—a non-symmetric distribution can have skewness 0. Examples are given in Exercises (30) and (31) below.
skew(X) can be expressed in terms of the first three moments of X.

3 2 3 3 2 3
E (X ) − 3μE (X ) + 2μ E (X ) − 3μσ −μ
skew(X) = = (4.4.2)
σ3 σ3
Proof
Since skewness is defined in terms of an odd power of the standard score, it's invariant under a linear transformation with positve
slope (a location-scale transformation of the distribution). On the other hand, if the slope is negative, skewness changes sign.
Suppose that a ∈ R and b ∈ R ∖ {0} . Then

1. skew(a + bX) = skew(X) if b > 0
2. skew(a + bX) = −skew(X) if b < 0
Proof
Recall that location-scale transformations often arise when physical units are changed, such as inches to centimeters, or degrees
Fahrenheit to degrees Celsius.
Kurtosis
The kurtosis of X is the fourth moment of the standard score:
4
X −μ
kurt(X) = E [ ( ) ] (4.4.4)
σ
Kurtosis comes from the Greek word for bulging. Kurtosis is always positive, since we have assumed that σ > 0 (the random
variable really is random), and therefore P(X ≠ μ) > 0 . In the unimodal case, the probability density function of a distribution
with large kurtosis has fatter tails, compared with the probability density function of a distribution with smaller kurtosis.
kurt(X) can be expressed in terms of the first four moments of X.

4 3 2 2 4 4 3 2 2 4
E (X ) − 4μE (X ) + 6 μ E (X ) − 3μ E (X ) − 4μE (X ) + 6μ σ + 3μ
kurt(X) = = (4.4.5)
4 4
σ σ
Proof
Since kurtosis is defined in terms of an even power of the standard score, it's invariant under linear transformations.
Suppose that a ∈ R and b ∈ R ∖ {0} . Then kurt(a + bX) = kurt(X) .

Proof
We will show in below that the kurtosis of the standard normal distribution is 3. Using the standard normal distribution as a
benchmark, the excess kurtosis of a random variable X is defined to be kurt(X) − 3 . Some authors use the term kurtosis to mean
what we have defined as excess kurtosis.
As always, be sure to try the exercises yourself before expanding the solutions and answers in the text.
Indicator Variables
Recall that an indicator random variable is one that just takes the values 0 and 1. Indicator variables are the building blocks of
many counting random variables. The corresponding distribution is known as the Bernoulli distribution, named for Jacob
Bernoulli.
Suppose that X is an indicator variable with P(X = 1) = p where p ∈ (0, 1). Then
1. E(X) = p
2. var(X) = p(1 − p)
1−2p
3. skew(X) =
√p(1−p)
2
1−3p+3p
4. kurt(X) = p(1−p)
Proof
Open the binomial coin experiment and set n = 1 to get an indicator variable. Vary p and note the change in the shape of the
Dice
Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice.
Here are three:
An ace-six flat die is a six-sided die in which faces 1 and 6 have probability 1
4
each while faces 2, 3, 4, and 5 have probability
1
8
each.
A two-five flat die is a six-sided die in which faces 2 and 5 have probability 1
4
each while faces 1, 3, 4, and 6 have probability 1
each.
A three-four flat die is a six-sided die in which faces 3 and 4 have probability 1
4
each while faces 1, 2, 5, and 6 have probability
1
8
each.
probabilities that we use ( and ) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter axis
1
4
1
have slightly larger probabilities that the other four faces. Flat dice are sometimes used by gamblers to cheat.
A standard, fair die is thrown and the score X is recorded. Compute each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
An ace-six flat die is thrown and the score X is recorded. Compute each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
A two-five flat die is thrown and the score X is recorded. Compute each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
A three-four flat die is thrown and the score X is recorded. Compute each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
All four die distributions above have the same mean 7
2
and are symmetric (and hence have skewness 0), but differ in variance and
kurtosis.
Open the dice experiment and set n = 1 to get a single die. Select each of the following, and note the shape of the probability
density function in comparison with the computational results above. In each case, run the experiment 1000 times and compare
the empirical density function to the probability density function.
1. fair
2. ace-six flat
3. two-five flat
4. three-four flat
Recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the interval.
Continuous uniform distributions arise in geometric probability and a variety of other applied problems.
Suppose that X has uniform distribution on the interval [a, b], where a, b ∈ R and a < b . Then
1. E(X) = (a + b)
1
2. var(X) = (b − a)
12
1 2
3. skew(X) = 0
4. kurt(X) = 9
Proof
Open the special distribution simulator, and select the continuous uniform distribution. Vary the parameters and note the shape
of the probability density function in comparison with the moment results in the last exercise. For selected values of the
parameter, run the simulation 1000 times and compare the empirical density function to the probability density function.

Recall that the exponential distribution is a continuous distribution on [0, ∞)with probability density function f given by
−rt
f (t) = re , t ∈ [0, ∞) (4.4.7)
where r ∈ (0, ∞) is the with rate parameter. This distribution is widely used to model failure times and other “arrival times”. The
exponential distribution is studied in detail in the chapter on the Poisson Process.
Suppose that X has the exponential distribution with rate parameter r > 0 . Then
1. E(X) = 1
2. var(X) = r
1
2
3. skew(X) = 2
4. kurt(X) = 9
Proof
Note that the skewness and kurtosis do not depend on the rate parameter r. That's because 1/r is a scale parameter for the
exponential distribution
Open the gamma experiment and set n = 1 to get the exponential distribution. Vary the rate parameter and note the shape of
the probability density function in comparison to the moment results in the last exercise. For selected values of the parameter,
run the experiment 1000 times and compare the empirical density function to the true probability density function.
Pareto Distribution
Recall that the Pareto distribution is a continuous distribution on [1, ∞) with probability density function f given by
a
f (x) = , x ∈ [1, ∞) (4.4.8)
a+1
x
widely used to model financial variables such as income. The Pareto distribution is studied in detail in the chapter on Special
Distributions.
Suppose that X has the Pareto distribution with shape parameter a > 0 . Then
1. E(X) = a−1
a
if a > 1
2. var(X) = a
2
if a > 2
(a−1 ) (a−2)
−−−−−
2(1+a)
3. skew(X) = a−3
√1 −
2
a
if a > 3
2
3(a−2)(3 a +a+2)
4. kurt(X) = a(a−3)(a−4)
if a > 4
Proof
Open the special distribution simulator and select the Pareto distribution. Vary the shape parameter and note the shape of the
probability density function in comparison to the moment results in the last exercise. For selected values of the parameter, run
the experiment 1000 times and compare the empirical density function to the true probability density function.
Recall that the standard normal distribution is a continuous distribution on R with probability density function ϕ given by
1 −
1
z
2
ϕ(z) = −−e
2 , z ∈ R (4.4.9)
√2π
Suppose that Z has the standard normal distribution. Then

1. E(Z) = 0
2. var(Z) = 1
3. skew(Z) = 0
4. kurt(Z) = 3
Proof
More generally, for μ ∈ R and σ ∈ (0, ∞), recall that the normal distribution with mean μ and standard deviation σ is a
continuous distribution on R with probability density function f given by
2
1 1 x −μ
f (x) = −− exp[− ( ) ], x ∈ R (4.4.10)
√2πσ 2 σ
However, we also know that μ and σ are location and scale parameters, respectively. That is, if Z has the standard normal
distribution then X = μ + σZ has the normal distribution with mean μ and standard deviation σ.
If X has the normal distribution with mean μ ∈ R and standard deviation σ ∈ (0, ∞), then
1. skew(X) = 0
2. kurt(X) = 3
Proof
Open the special distribution simulator and select the normal distribution. Vary the parameters and note the shape of the
probability density function in comparison to the moment results in the last exercise. For selected values of the parameters, run
The Beta Distribution

The distributions in this subsection belong to the family of beta distributions, which are continuous distributions on [0, 1] widely
used to model random proportions and probabilities. The beta distribution is studied in detail in the chapter on Special
Distributions.
Suppose that X has probability density function f given by f (x) = 6x(1 − x) for x ∈ [0, 1]. Find each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
Suppose that X has probability density function f given by f (x) = 12x 2

(1 − x) for x ∈ [0, 1]. Find each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
Suppose that X has probability density function f given by f (x) = 12x(1 − x) for x ∈ [0, 1]. Find each of the following:
2
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
Open the special distribution simulator and select the beta distribution. Select the parameter values below to get the
distributions in the last three exercises. In each case, note the shape of the probability density function in relation to the
calculated moment results. Run the simulation 1000 times and compare the empirical density function to the probability
density function.
1. a = 2 , b = 2
2. a = 3 , b = 2
3. a = 2 , b = 3

for x ∈ (0, 1). Find
π√x(1−x)
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
Open the Brownian motion experiment and select the last zero. Note the shape of the probability density function in relation to
the moment results in the last exercise. Run the simulation 1000 times and compare the empirical density function to the
Counterexamples
The following exercise gives a simple example of a discrete distribution that is not symmetric but has skewness 0.
Suppose that X is a discrete random variable with probability density function f given by f (−3) =
1
10
, f (−1) =
1
2
,
f (2) = . Find each of the following and then show that the distribution of X is not symmetric.
2
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
The following exercise gives a more complicated continuous distribution that is not symmetric but has skewness 0. It is one of a
collection of distributions constructed by Erik Meijer.
Suppose that U , V , and I are independent random variables, and that U is normally distributed with mean μ = −2 and
variance σ = 1 , V is normally distributed with mean ν = 1 and variance τ = 2 , and I is an indicator variable with
2 2
P(I = 1) = p = . Let X = I U + (1 − I )V . Find each of the following and then show that the distribution of X is not
1
symmetric.
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Solution
This page titled 4.4: Skewness and Kurtosis is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Recall that by taking the expected value of various transformations of a random variable, we can measure many interesting
characteristics of the distribution of the variable. In this section, we will study an expected value that measures a special type of
relationship between two real-valued variables. This relationship is very important both in probability and statistics.
Basic Theory
Definitions
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). Unless otherwise noted, we assume
that all expected values mentioned in this section exist. Suppose now that X and Y are real-valued random variables for the
experiment (that is, defined on the probability space) with means E(X), E(Y ) and variances var(X), var(Y ), respectively.
The covariance of (X, Y ) is defined by

cov(X, Y ) = E ([X − E(X)] [Y − E(Y )]) (4.5.1)
and, assuming the variances are positive, the correlation of (X, Y ) is defined by
cov(X, Y )
cor(X, Y ) = (4.5.2)
sd(X)sd(Y )
1. If cov(X, Y ) > 0 then X and Y are positively correlated.

2. If cov(X, Y ) < 0 then X and Y are negatively correlated.
3. If cov(X, Y ) = 0 then X and Y are uncorrelated.
Correlation is a scaled version of covariance; note that the two parameters always have the same sign (positive, negative, or 0).
Note also that correlation is dimensionless, since the numerator and denominator have the same physical units, namely the product
of the units of X and Y .
As these terms suggest, covariance and correlation measure a certain kind of dependence between the variables. One of our goals is
a deeper understanding of this dependence. As a start, note that (E(X), E(Y )) is the center of the joint distribution of (X, Y ), and
the vertical and horizontal lines through this point separate R into four quadrants. The function (x, y) ↦ [x − E(X)] [y − E(Y )]
2
is positive on the first and third quadrants and negative on the second and fourth.
Figure 4.5.1 : A joint distribution with (E(X), E(Y ))as the center of mass
Properties of Covariance
The following theorems give some basic properties of covariance. The main tool that we will need is the fact that expected value is
a linear operation. Other important properties will be derived below, in the subsection on the best linear predictor. As usual, be sure
to try the proofs yourself before reading the ones in the text. Once again, we assume that the random variables are defined on the
common sample space, are real-valued, and that the indicated expected values exist (as real numbers).
Our first result is a formula that is better than the definition for computational purposes, but gives less insight.
cov(X, Y ) = E(XY ) − E(X)E(Y ) .
Proof
From (2), we see that X and Y are uncorrelated if and only if E(XY ) = E(X)E(Y ) , so here is a simple but important corollary:
If X and Y are independent, then they are uncorrelated.

Proof
However, the converse fails with a passion: Exercise (31) gives an example of two variables that are functionally related (the
strongest form of dependence), yet uncorrelated. The computational exercises give other examples of dependent yet uncorrelated
variables also. Note also that if one of the variables has mean 0, then the covariance is simply the expected product.
Trivially, covariance is a symmetric operation.
cov(X, Y ) = cov(Y , X) .
As the name suggests, covariance generalizes variance.
cov(X, X) = var(X) .
Proof
Covariance is a linear operation in the first argument, if the second argument is fixed.
If X, Y , Z are random variables, and c is a constant, then

1. cov(X + Y , Z) = cov(X, Z) + cov(Y , Z)
2. cov(cX, Y ) = c cov(X, Y )
Proof
By symmetry, covariance is also a linear operation in the second argument, with the first argument fixed. Thus, the covariance
operator is bi-linear. The general version of this property is given in the following theorem.
Suppose that (X1 , X2 , … , Xn ) and (Y1 , Y2 , … , Ym ) are sequences of random variables, and that (a1 , a2 , … , an ) and
(b1 , b2 , … , bm ) are constants. Then
n m n m
cov ( ∑ ai Xi , ∑ bj Yj ) = ∑ ∑ ai bj cov(Xi , Yj ) (4.5.7)
i=1 j=1 i=1 j=1
The following result shows how covariance is changed under a linear transformation of one of the variables. This is simply a
special case of the basic properties, but is worth stating.
If a, b ∈ R then cov(a + bX, Y ) = b cov(X, Y ) .

Proof
Of course, by symmetry, the same property holds in the second argument. Putting the two together we have that if a, b, c, d ∈ R
then cov(a + bX, c + dY ) = bd cov(X, Y ) .
Properties of Correlation
Next we will establish some basic properties of correlation. Most of these follow easily from corresponding properties of
covariance above. We assume that var(X) > 0 and var(Y ) > 0 , so that the random variable really are random and hence the
correlation is well defined.
The correlation between X and Y is the covariance of the corresponding standard scores:
X − E(X) Y − E(Y ) X − E(X) Y − E(Y )
cor(X, Y ) = cov ( , ) =E( ) (4.5.8)
sd(X) sd(Y ) sd(X) sd(Y )
Proof
This shows again that correlation is dimensionless, since of course, the standard scores are dimensionless. Also, correlation is
symmetric:
cor(X, Y ) = cor(Y , X) .
Under a linear transformation of one of the variables, the correlation is unchanged if the slope is positve and changes sign if the
slope is negative:
If a, b ∈ R and b ≠ 0 then
1. cor(a + bX, Y ) = cor(X, Y ) if b > 0
2. cor(a + bX, Y ) = −cor(X, Y ) if b < 0
Proof
This result reinforces the fact that correlation is a standardized measure of association, since multiplying the variable by a positive
constant is equivalent to a change of scale, and adding a contant to a variable is equivalent to a change of location. For example, in
the Challenger data, the underlying variables are temperature at the time of launch (in degrees Fahrenheit) and O-ring erosion (in
millimeters). The correlation between these two variables is of fundamental importance. If we decide to measure temperature in
degrees Celsius and O-ring erosion in inches, the correlation is unchanged. Of course, the same property holds in the second
argument, so if a, b, c, d ∈ R with b ≠ 0 and d ≠ 0 , then cor(a + bX, c + dY ) = cor(X, Y ) if bd > 0 and
cor(a + bX, c + dY ) = −cor(X, Y ) if bd < 0 .
The most important properties of covariance and correlation will emerge from our study of the best linear predictor below.
The Variance of a Sum

We will now show that the variance of a sum of variables is the sum of the pairwise covariances. This result is very useful since
many random variables with special distributions can be written as sums of simpler random variables (see in particular the binomial
distribution and hypergeometric distribution below).
If (X 1, X2 , … , Xn ) is a sequence of real-valued random variables then

n n n n
var ( ∑ Xi ) = ∑ ∑ cov(Xi , Xj ) = ∑ var(Xi ) + 2 ∑ cov(Xi , Xj ) (4.5.10)
i=1 i=1 j=1 i=1 {(i,j):i<j}
Proof
Note that the variance of a sum can be larger, smaller, or equal to the sum of the variances, depending on the pure covariance terms.
As a special case of (12), when n = 2 , we have
var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y ) (4.5.12)
The following corollary is very important.
If (X 1, X2 , … , Xn ) is a sequence of pairwise uncorrelated, real-valued random variables then

n n
var ( ∑ Xi ) = ∑ var(Xi ) (4.5.13)
i=1 i=1
Proof
Note that the last result holds, in particular, if the random variables are independent. We close this discussion with a couple of
minor corollaries.
If X and Y are real-valued random variables then var(X + Y ) + var(X − Y ) = 2 [var(X) + var(Y )] .
Proof
If X and Y are real-valued random variables with var(X) = var(Y ) then X + Y and X − Y are uncorrelated.
Proof
Random Samples
In the following exercises, suppose that (X , X , …) is a sequence of independent, real-valued random variables with a common
1 2
distribution that has mean μ and standard deviation σ > 0 . In statistical terms, the variables form a random sample from the
common distribution.
n
For n ∈ N+ , let Y n = ∑i=1 Xi .
1. E (Y ) = nμ
n
2. var (Y ) = nσ
n
2
Proof
For n ∈ N , let M
+ n = Yn /n =
n
1 n
∑
i=1
Xi , so that M is the sample mean of (X
n 1, X2 , … , Xn ) .
1. E (M ) = μ
n
2. var (M ) = σ /n
n
2
3. var (M ) → 0 as n → ∞
n
4. P (|M − μ| > ϵ) → 0 as n → ∞ for every ϵ > 0 .

n
Proof
Part (c) of (17) means that M → μ as n → ∞ in mean square. Part (d) means that M → μ as n → ∞ in probability. These are
n n
both versions of the weak law of large numbers, one of the fundamental theorems of probability.
The standard score of the sum Y and the standard score of the sample mean M are the same:
n n
Yn − n μ Mn − μ
Zn = − = − (4.5.16)
√n σ σ/ √n
1. E(Z ) = 0
n
2. var(Z ) = 1
n
Proof
The central limit theorem, the other fundamental theorem of probability, states that the distribution of Z converges to the standard n
normal distribution as n → ∞ .
Events
If A and B are events in our random experiment then the covariance and correlation of A and B are defined to be the covariance
and correlation, respectively, of their indicator random variables.
If A and B are events, define cov(A, B) = cov(1 A

, 1B ) and cor(A, B) = cor(1 A
, 1B ) . Equivalently,
1. cov(A, B) = P(A ∩ B) − P(A)P(B)
−−−−−−−−−−−−−−−−−−−−−−−− −
2. cor(A, B) = [P(A ∩ B) − P(A)P(B)] /√P(A) [1 − P(A)] P(B) [1 − P(B)]
Proof
In particular, note that A and B are positively correlated, negatively correlated, or independent, respectively (as defined in the
section on conditional probability) if and only if the indicator variables of A and B are positively correlated, negatively correlated,
or uncorrelated, as defined in this section.
If A and B are events then

1. cov(A, B ) = −cov(A, B)
c
2. cov(A , B ) = cov(A, B)
c c
Proof
If A and B are events with A ⊆ B then

1. cov(A, B) = P(A)[1 − P(B)]
−−−−−−−−−−−−−−−−−−−−−−−−−−
2. cor(A, B) = √P(A) [1 − P(B)] /P(B) [1 − P(A)]
Proof
In the language of the experiment, A ⊆B means that A implies B . In such a case, the events are positively correlated, not
surprising.
The Best Linear Predictor

What linear function of X (that is, a function of the form a + bX where a, b ∈ R ) is closest to Y in the sense of minimizing mean
square error? The question is fundamentally important in the case where random variable X (the predictor variable) is observable
and random variable Y (the response variable) is not. The linear function can be used to estimate Y from an observed value of X.
Moreover, the solution will have the added benefit of showing that covariance and correlation measure the linear relationship
between X and Y . To avoid trivial cases, let us assume that var(X) > 0 and var(Y ) > 0 , so that the random variables really are
random. The solution to our problem turns out to be the linear function of X with the same expected value as Y , and whose
covariance with X is the same as that of Y .
The random variable L(Y ∣ X) defined as follows is the only linear function of X satisfying properties (a) and (b).
cov(X, Y )
L(Y ∣ X) = E(Y ) + [X − E(X)] (4.5.17)
var(X)
1. E [L(Y ∣ X)] = E(Y )

2. cov [X, L(Y ∣ X)] = cov(X, Y )
Proof
Note that in the presence of part (a), part (b) is equivalent to E [XL(Y ∣ X)] = E(XY ) . Here is another minor variation, but one
that will be very useful: L(Y ∣ X) is the only linear function of X with the same mean as Y and with the property that
Y − L(Y ∣ X) is uncorrelated with every linear function of X.
L(Y ∣ X) is the only linear function of X that satisfies

1. E [L(Y ∣ X)] = E(Y )
2. cov [Y − L(Y ∣ X), U ] = 0 for every linear function U of X.
Proof
The variance of L(Y ∣ X) and its covariance with Y turn out to be the same.
Additional properties of L(Y ∣ X) :

1. var [L(Y 2
∣ X)] = cov (X, Y )/var(X)
2. cov [L(Y 2
∣ X), Y ] = cov (X, Y )/var(X)
Proof
We can now prove the fundamental result that L(Y ∣ X) is the linear function of X that is closest to Y in the mean square sense.
We give two proofs; the first is more straightforward, but the second is more interesting and elegant.
Suppose that U is a linear function of X. Then

2
1. E ([Y − L(Y ∣ X)] ) ≤ E [(Y − U ) ]
2
2. Equality occurs in (a) if and only if U = L(Y ∣ X) with probability 1.

Proof from calculus
Proof using properties
The mean square error when L(Y ∣ X) is used as a predictor of Y is

2 2
E ( [Y − L(Y ∣ X)] ) = var(Y ) [1 − cor (X, Y )] (4.5.30)
Proof
Our solution to the best linear perdictor problems yields important properties of covariance and correlation.
Additional properties of covariance and correlation:
1. −1 ≤ cor(X, Y ) ≤ 1
2. −sd(X)sd(Y ) ≤ cov(X, Y ) ≤ sd(X)sd(Y )
3. cor(X, Y ) = 1 if and only if, with probability 1, Y is a linear function of X with positive slope.
4. cor(X, Y ) = −1 if and only if, with probability 1, Y is a linear function of X with negative slope.
Proof
The last two results clearly show that cov(X, Y ) and cor(X, Y ) measure the linear association between X and Y . The equivalent
inequalities (a) and (b) above are referred to as the correlation inequality. They are also versions of the Cauchy-Schwarz inequality,
named for Augustin Cauchy and Karl Schwarz
Recall from our previous discussion of variance that the best constant predictor of Y , in the sense of minimizing mean square error,
is E(Y ) and the minimum value of the mean square error for this predictor is var(Y ). Thus, the difference between the variance of
Y and the mean square error above for L(Y ∣ X) is the reduction in the variance of Y when the linear term in X is added to the
predictor:
2 2
var(Y ) − E ( [Y − L(Y ∣ X)] ) = var(Y ) cor (X, Y ) (4.5.33)
Thus cor (X, Y ) is the proportion of reduction in

2
var(Y ) when X is included as a predictor variable. This quantity is called the
(distribution) coefficient of determination. Now let
cov(X, Y )
L(Y ∣ X = x) = E(Y ) + [x − E(X)] , x ∈ R (4.5.34)
var(X)
The function x ↦ L(Y ∣ X = x) is known as the distribution regression function for Y given X, and its graph is known as the
distribution regression line. Note that the regression line passes through (E(X), E(Y )), the center of the joint distribution.
Figure 4.5.2 : The distribution regression line

However, the choice of predictor variable and response variable is crucial.
The regression line for Y given X and the regression line for X given Y are not the same line, except in the trivial case where
the variables are perfectly correlated. However, the coefficient of determination is the same, regardless of which variable is the
predictor and which is the response.
Proof
Suppose that A and B are events with 0 < P(A) < 1 and 0 < P(B) < 1 . Then
1. cor(A, B) = 1 if and only P(A ∖ B) + P(B ∖ A) = 0 . (That is, A and B are equivalent events.)
2. cor(A, B) = −1 if and only P(A ∖ B ) + P(B ∖ A) = 0 . (That is, A and B are equivalent events.)
c c c
Proof
The concept of best linear predictor is more powerful than might first appear, because it can be applied to transformations of the
variables. Specifically, suppose that X and Y are random variables for our experiment, taking values in general spaces S and T ,
respectively. Suppose also that g and h are real-valued functions defined on S and T , respectively. We can find L [h(Y ) ∣ g(X)] ,
the linear function of g(X) that is closest to h(Y ) in the mean square sense. The results of this subsection apply, of course, with
g(X) replacing X and h(Y ) replacing Y . Of course, we must be able to compute the appropriate means, variances, and
covariances.
We close this subsection with two additional properties of the best linear predictor, the linearity properties.
Suppose that X, Y , and Z are random variables and that c is a constant. Then
1. L(Y + Z ∣ X) = L(Y ∣ X) + L(Z ∣ X)
2. L(cY ∣ X) = cL(Y ∣ X)
Proof from the definitions
Proof by characterizing properties
There are several extensions and generalizations of the ideas in the subsection:
The corresponding statistical problem of estimating a and b , when these distribution parameters are unknown, is considered in
the section on Sample Covariance and Correlation.
The problem finding the function of X that is closest to Y in the mean square error sense (using all reasonable functions, not
just linear functions) is considered in the section on Conditional Expected Value.
The best linear prediction problem when the predictor and response variables are random vectors is considered in the section on
Expected Value and Covariance Matrices.
The use of characterizing properties will play a crucial role in these extensions.

Suppose that X is uniformly distributed on the interval [−1, 1] and Y =X
2
. Then X and Y are uncorrelated even though Y is
a function of X (the strongest form of dependence).
Proof
Suppose that (X, Y ) is uniformly distributed on the region S ⊆R

2
. Find cov(X, Y ) and cor(X, Y ) and determine whether
the variables are independent in each of the following cases:
1. S = [a, b] × [c, d] where a 0 , so S is a triangle
2
3. S = {(x, y) ∈ R : x + y ≤ r } where r > 0 , so S is a circle

2 2 2 2
Answer
In the bivariate uniform experiment, select each of the regions below in turn. For each region, run the simulation 2000 times
and note the value of the correlation and the shape of the cloud of points in the scatterplot. Compare with the results in the last
exercise.
1. Square
2. Triangle
3. Circle
Suppose that X is uniformly distributed on the interval (0, 1) and that given X = x ∈ (0, 1) , Y is uniformly distributed on the
interval (0, x). Find each of the following:
1. cov(X, Y )
2. cor(X, Y )
3. L(Y ∣ X)
4. L(X ∣ Y )
Answer
Dice
die in which faces 1 and 6 have probability each, and faces 2, 3, 4, and 5 have probability each.
1
4
1
A pair of standard, fair dice are thrown and the scores (X , X ) recorded. Let Y = X + X denote the sum of the scores,
1 2 1 2
U = min{ X , X } the minimum scores, and V = max{ X , X } the maximum score. Find the covariance and correlation of
1 2 1 2
each of the following pairs of variables:

1. (X , X )
1 2
2. (X , Y )
1
3. (X , U )
1
4. (U , V )
5. (U , Y )
Answer
Suppose that n fair dice are thrown. Find the mean and variance of each of the following variables:
n
2. M , the average of the scores.

n
Answer
In the dice experiment, select fair dice, and select the following random variables. In each case, increase the number of dice
and observe the size and location of the probability density function and the mean ± standard deviation bar. With n = 20 dice,
run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard
deviation.
1. The sum of the scores.
2. The average of the scores.
Suppose that n ace-six flat dice are thrown. Find the mean and variance of each of the following variables:
n
2. M , the average of the scores.

n
Answer
In the dice experiment, select ace-six flat dice, and select the following random variables. In each case, increase the number of
dice and observe the size and location of the probability density function and the mean ± standard deviation bar. With n = 20
dice, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and
standard deviation.
1. The sum of the scores.
2. The average of the scores.
A pair of fair dice are thrown and the scores (X , X 1 2) recorded. Let Y = X + X denote the sum of the scores,
1 2
U = min{ X , X } the minimum score, and V = max{ X

1 2 1, X2 }the maximum score. Find each of the following:
1. L(Y ∣ X1 )
2. L(U ∣ X1 )
3. L(V ∣ X1 )
Answer
Bernoulli Trials
Recall that a Bernoulli trials process is a sequence X = (X , X , …) of independent, identically distributed indicator random
1 2
variables. In the usual language of reliability, X denotes the outcome of trial i, where 1 denotes success and 0 denotes failure. The
i
probability of success p = P(X = 1) is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate
i
chapter on the Bernoulli Trials explores this process in detail.
+ n
n
i=1 i
distribution with parameters n and p, which has probability density function f given by
n y n−y
fn (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (4.5.45)
y
The mean and variance of Y are n
1. E(Y ) = np
n
2. var(Y ) = np(1 − p)
n
Proof
In the binomial coin experiment, select the number of heads. Vary n and p and note the shape of the probability density
function and the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the
experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation.
For n ∈ N , the proportion of successes in the first n trials is M

+ n = Yn /n . This random variable is sometimes used as a statistical
estimator of the parameter p, when the parameter is unknown.
The mean and variance of M are n
1. E(M ) = p
n
2. var(M ) = p(1 − p)/n

n
Proof
In the binomial coin experiment, select the proportion of heads. Vary n and p and note the shape of the probability density
function and the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the
As a special case of (17) note that M n → p as n → ∞ in mean square and in probability.

Suppose that a population consists of m objects; r of the objects are type 1 and m − r are type 0. A sample of n objects is chosen
at random, without replacement. The parameters m, n ∈ N and r ∈ N with n ≤ m and r ≤ m . For i ∈ {1, 2, … , n}, let X
+ i
denote the type of the ith object selected. Recall that (X , X , … , X ) is a sequence of identically distributed (but not
1 2 n
independent) indicator random variables.

Let Y denote the number of type 1 objects in the sample, so that Y = ∑ n
i=1
Xi . Recall that this random variable has the
hypergeometric distribution, which has probability density function f given by n
r m−r
( )( )
y n−y
f (y) = , y ∈ {0, 1, … , n} (4.5.46)
m
( )
n
For distinct i, j ∈ {1, 2, … , n} ,

1. E(X ) =
i
r
2. var(X ) = (1 −
i
r
m
r
m
)
3. cov(X , X ) = −
i j
r
m
(1 −
r
m
)
1
m−1
4. cor(X , X ) = −
i j
1
m−1
Proof
Note that the event of a type 1 object on draw i and the event of a type 1 object on draw j are negatively correlated, but the
correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if
m = 2 . Think about these result intuitively.
The mean and variance of Y are

1. E(Y ) = n r
2. var(Y ) = n m
r
(1 −
m
r
)
m−n
m−1
Proof
Note that if the sampling were with replacement, Y would have a binomial distribution, and so in particular E(Y ) = n and m
r
m−n
) . The additional factor that occurs in the variance of the hypergeometric distribution is sometimes
r r
var(Y ) = n (1 −
m m m−1
m−n
called the finite population correction factor. Note that for fixed m, is decreasing in n , and is 0 when n = m . Of course, we
m−1
know that we must have var(Y ) = 0 if n = m , since we would be sampling the entire population, and so deterministically, Y = r .
On the other hand, for fixed n , → 1 as m → ∞ . More generally, the hypergeometric distribution is well approximated by the
m−n
m−1
binomial when the population size m is large compared to the sample size n . These ideas are discussed more fully in the section on
the hypergeometric distribution in the chapter on Finite Sampling Models.
In the ball and urn experiment, select sampling without replacement. Vary m, r, and n and note the shape of the probability
density function and the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the
Suppose that X and Y are real-valued random variables with cov(X, Y ) = 3 . Find cov(2X − 5, 4Y + 2) .
Answer
Suppose X and Y are real-valued random variables with var(X) = 5 , var(Y ) = 9 , and cov(X, Y ) = −3 . Find
1. cor(X, Y )
2. var(2X + 3Y − 7)
3. cov(5X + 2Y − 3, 3X − 4Y + 2)
4. cor(5X + 2Y − 3, 3X − 4Y + 2)
Answer
Suppose that X and Y are independent, real-valued random variables with var(X) = 6 and var(Y ) = 8 . Find
var(3X − 4Y + 5) .
Answer
Suppose that A and B are events in an experiment with P(A) =

1
2
, P(B) =
1
3
, and P(A ∩ B) =
1
8
. Find each of the
following:
1. cov(A, B)
2. cor(A, B)
Answer
Suppose that X ,
, and Z are real-valued random variables for an experiment, and that
Y L(Y ∣ X) = 2 − 3X and
L(Z ∣ X) = 5 + 4X . Find L(6Y − 2Z ∣ X) .
Answer
Suppose that X and Y are real-valued random variables for an experiment, and that E(X) = 3 , var(X) = 4 , and
L(Y ∣ X) = 5 − 2X . Find each of the following:
1. E(Y )
2. cov(X, Y )
Answer

Suppose that (X, Y ) has probability density function f given by f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . Find each of the
following
1. cov(X, Y )
2. cor(X, Y )
3. L(Y ∣ X)
4. L(X ∣ Y )
Answer
Suppose that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤x ≤y ≤1 . Find each of the
following:
1. cov(X, Y )
2. cor(X, Y )
3. L(Y ∣ X)
4. L(X ∣ Y )
Answer
Suppose again that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 .
1. Find cov (X , Y ).
2
2. Find cor (X , Y ).
2
3. Find L (Y ∣ X ) . 2
4. Which predictor of Y is better, the one based on X or the one based on X ? 2
Answer

y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . Find each of the
following:
1. cov(X, Y )
2. cor(X, Y )
3. L(Y ∣ X)
4. L(X ∣ Y )
Answer
Suppose that (X, Y ) has probability density function f given by f (x, y) = 15 x y

2
for 0 ≤x ≤y ≤1 . Find each of the
following:
1. cov(X, Y )
2. cor(X, Y )
3. L(Y ∣ X)
4. L(X ∣ Y )
Answer
Suppose again that (X, Y ) has probability density function f given by f (x, y) = 15x 2
y for 0 ≤ x ≤ y ≤ 1 .
−
−
1. Find cov (√X , Y ).
−
−
2. Find cor (√X , Y ).
−
−
3. Find L (Y ∣ √X ) .
−
−
4. Which of the predictors of Y is better, the one based on X of the one based on √X ?
Answer
This page titled 4.5: Covariance and Correlation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
As usual, our starting point is a random experiment modeled by a probability sace (Ω, F , P). A generating function of a real-
valued random variable is an expected value of a certain transformation of the random variable involving another (deterministic)
variable. Most generating functions share four important properties:
1. Under mild conditions, the generating function completely determines the distribution of the random variable.
2. The generating function of a sum of independent variables is the product of the generating functions
3. The moments of the random variable can be obtained from the derivatives of the generating function.
4. Ordinary (pointwise) convergence of a sequence of generating functions corresponds to the special convergence of the
corresponding distributions.
Property 1 is perhaps the most important. Often a random variable is shown to have a certain distribution by showing that the
generating function has a certain form. The process of recovering the distribution from the generating function is known as
inversion. Property 2 is frequently used to determine the distribution of a sum of independent variables. By contrast, recall that the
probability density function of a sum of independent variables is the convolution of the individual density functions, a much more
complicated operation. Property 3 is useful because often computing moments from the generating function is easier than
computing the moments directly from the probability density function. The last property is known as the continuity theorem. Often
it is easer to show the convergence of the generating functions than to prove convergence of the distributions directly.
The numerical value of the generating function at a particular value of the free variable is of no interest, and so generating
functions can seem rather unintuitive at first. But the important point is that the generating function as a whole encodes all of the
information in the probability distribution in a very useful way. Generating functions are important and valuable tools in
probability, as they are in other areas of mathematics, from combinatorics to differential equations.
We will study the three generating functions in the list below, which correspond to increasing levels of generality. The fist is the
most restrictive, but also by far the simplest, since the theory reduces to basic facts about power series that you will remember from
calculus. The third is the most general and the one for which the theory is most complete and elegant, but it also requires basic
knowledge of complex analysis. The one in the middle is perhaps the one most commonly used, and suffices for most distributions
in applied probability.
1. the probability generating function
2. the moment generating function
3. the characteristic function
We will also study the characteristic function for multivariate distributions, although analogous results hold for the other two types.
In the basic theory below, be sure to try the proofs yourself before reading the ones in the text.
Basic Theory
The Probability Generating Function
For our first generating function, assume that N is a random variable taking values in N.
The probability generating function P of N is defined by

N
P (t) = E (t ) (4.6.1)
for all t ∈ R for which the expected value exists in R.
That is, P (t) is defined when E (|t|N

) <∞ . The probability generating function can be written nicely in terms of the probability
density function.
Suppose that N has probability density function f and probability generating function P . Then
∞
n
P (t) = ∑ f (n)t , t ∈ (−r, r) (4.6.2)
n=0
where r ∈ [1, ∞] is the radius of convergence of the series.
Proof
In the language of combinatorics, P is the ordinary generating function of f . Of course, if N just takes a finite set of values in N
then r = ∞ . Recall from calculus that a power series can be differentiated term by term, just like a polynomial. Each derivative
series has the same radius of convergence as the original series (but may behave differently at the endpoints of the interval of
convergence). We denote the derivative of order n by P . Recall also that if n ∈ N and k ∈ N with k ≤ n , then the number of
(n)
permutations of size k chosen from a population of n objects is

(k)
n = n(n − 1) ⋯ (n − k + 1) (4.6.3)
The following theorem is the inversion result for probability generating functions: the generating function completely determines
the distribution.
Suppose again that N has probability density function f and probability generating function P . Then
(k)
P (0)
f (k) = , k ∈ N (4.6.4)
k!
Proof
Our next result is not particularly important, but has a certain curiosity.
P(N is even) =
1
2
[1 + P (−1)] .
Proof
Recall that the factorial moment of N of order k ∈ N is E [ N ] . The factorial moments can be computed from the derivatives of
(k)
the probability generating function. The factorial moments, in turn, determine the ordinary moments about 0 (sometimes referred to
as raw moments).
Suppose that the radius of convergence r > 1 . Then P (k)

(1) = E [ N
(k)
] for k ∈ N . In particular, N has finite moments of all
orders.
Proof
Suppose again that r > 1 . Then

1. E(N ) = P (1)
′
2. var(N ) = P (1) + P ′′ ′
(1) [1 − P (1)]
′
Proof
Suppose that N and N are independent random variables taking values in N, with probability generating functions P and
1 2 1
P 2 having radii of convergence r and r , respectively. Then the probability generating function P of N + N is given by
1 2 1 2
P (t) = P (t)P (t) for |t| < r ∧ r .

1 2 1 2
Proof
The Moment Generating Function

Our next generating function is defined more generally, so in this discussion we assume that the random variables are real-valued.
The moment generating function of X is the function M defined by

tX
M (t) = E (e ), t ∈ R (4.6.7)
Note that since e ≥ 0 with probability 1, M (t) exists, as a real number or ∞, for any t ∈ R . But as we will see, our interest will
tX
be in the domain where M (t) < ∞ .
Suppose that X has a continuous distribution on R with probability density function f . Then
∞
tx
M (t) = ∫ e f (x) dx (4.6.8)
−∞
Proof
Thus, the moment generating function of X is closely related to the Laplace transform of the probability density function f . The
Laplace transform is named for Pierre Simon Laplace, and is widely used in many areas of applied mathematics, particularly
differential equations. The basic inversion theorem for moment generating functions (similar to the inversion theorem for Laplace
transforms) states that if M (t) < ∞ for t in an open interval about 0, then M completely determines the distribution of X. Thus,
if two distributions on R have moment generating functions that are equal (and finite) in an open interval about 0, then the
distributions are the same.
Suppose that X has moment generating function M that is finite in an open interval I about 0. Then X has moments of all
orders and
∞ n
E (X )
n
M (t) = ∑ t , t ∈ I (4.6.9)
n!
n=0
Proof
So under the finite assumption in the last theorem, the moment generating function, like the probability generating function, is a
power series in t .
Suppose again that X has moment generating function M that is finite in an open interval about 0. Then M (n)
(0) = E (X
n
)
for n ∈ N
Proof
Thus, the derivatives of the moment generating function at 0 determine the moments of the variable (hence the name). In the
language of combinatorics, the moment generating function is the exponential generating function of the sequence of moments.
Thus, a random variable that does not have finite moments of all orders cannot have a finite moment generating function. Even
when a random variable does have moments of all orders, the moment generating function may not exist. A counterexample is
constructed below.
For nonnegative random variables (which are very common in applications), the domain where the moment generating function is
finite is easy to understand.
Suppose that X takes values in [0, ∞) and has moment generating function M . If M (t) < ∞ for t ∈ R then M (s) < ∞ for
s ≤t.
Proof
So for a nonnegative random variable, either M (t) < ∞ for all t ∈ R or there exists r ∈ (0, ∞) such that M (t) < ∞ for t < r .
Of course, there are complementary results for non-positive random variables, but such variables are much less common. Next we
consider what happens to the moment generating function under some simple transformations of the random variables.
Suppose that X has moment generating function M and that a, b ∈ R . The moment generating function N of Y = a + bX is
given by N (t) = e M (bt) for t ∈ R .
at
Proof
Recall that if a ∈ R and b ∈ (0, ∞) then the transformation a + bX is a location-scale transformation on the distribution of X,
with location parameter a and scale parameter b . Location-scale transformations frequently arise when units are changed, such as
length changed from inches to centimeters or temperature from degrees Fahrenheit to degrees Celsius.
Suppose that X and X are independent random variables with moment generating functions
1 2 M1 and M2 respectively. The
moment generating function M of Y = X + X is given by M (t) = M (t)M (t) for t ∈ R .
1 2 1 2
Proof
The probability generating function of a variable can easily be converted into the moment generating function of the variable.
Suppose that X is a random variable taking values in N with probability generating function G having radius of convergence
r . The moment generating function M of X is given by M (t) = G (e ) for t < ln(r) .
t
Proof
The following theorem gives the Chernoff bounds, named for the mathematician Herman Chernoff. These are upper bounds on the
tail events of a random variable.
If X has moment generating function M then

1. P(X ≥ x) ≤ e −tx
M (t) for t > 0
2. P(X ≤ x) ≤ e −tx
M (t) for t < 0
Proof
Naturally, the best Chernoff bound (in either (a) or (b)) is obtained by finding t that minimizes e −tx
M (t) .
The Characteristic Function

Our last generating function is the nicest from a mathematical point of view. Once again, we assume that our random variables are
real-valued.
The characteristic function of X is the function χ defined by by

itX
χ(t) = E (e ) = E [cos(tX)] + iE [sin(tX)] , t ∈ R (4.6.12)
Note that χ is a complex valued function, and so this subsection requires some basic knowledge of complex analysis. The function
χ is defined for all t ∈ R because the random variable in the expected value is bounded in magnitude. Indeed, ∣ ∣ = 1 for all
itX
∣e ∣
t ∈ R . Many of the properties of the characteristic function are more elegant than the corresponding properties of the probability or
moment generating functions, because the characteristic function always exists.
If X has a continuous distribution on R with probability density function f and characteristic function χ then
∞
itx
χ(t) = ∫ e f (x)dx, t ∈ R (4.6.13)
−∞
Proof
Thus, the characteristic function of X is closely related to the Fourier transform of the probability density function f . The Fourier
transform is named for Joseph Fourier, and is widely used in many areas of applied mathematics.
As with other generating functions, the characteristic function completely determines the distribution. That is, random variables X
and Y have the same distribution if and only if they have the same characteristic function. Indeed, the general inversion formula
given next is a formula for computing certain combinations of probabilities from the characteristic function.
Suppose again that X has characteristic function χ. If a, b ∈ R and a < b then

n −iat −ibt
e −e 1
∫ χ(t) dt → P(a < X < b) + [P(X = b) − P(X = a)] as n → ∞ (4.6.14)
−n 2πit 2
The probability combinations on the right side completely determine the distribution of X . A special inversion formula holds for
continuous distributions:
Suppose that X has a continuous distribution with probability density function f and characteristic function χ. At every point
x ∈ R where f is differentiable,
∞
1
−itx
f (x) = ∫ e χ(t) dt (4.6.15)
2π −∞
This formula is essentially the inverse Fourrier transform. As with the other generating functions, the characteristic function can be
used to find the moments of X. Moreover, this can be done even when only some of the moments exist.
Suppose again that X has characteristic function χ. If n ∈ N + and E (|X n
|) < ∞ . Then
n k
E (X )
k n
χ(t) = ∑ (it) + o(t ) (4.6.16)
k!
k=0
and therefore χ (n) n

(0) = i E (X
n
) .
Details
Next we consider how the characteristic function is changed under some simple transformations of the variables.
Suppose that X has characteristic function χ and that a, b ∈ R . The characteristic function ψ of Y = a + bX is given by
ψ(t) = e
iat
χ(bt) for t ∈ R .
Proof
Suppose that X and X are independent random variables with characteristic functions
1 2 χ1 and χ2 respectively. The
characteristic function χ of Y = X + X is given by χ(t) = χ (t)χ (t) for t ∈ R .
1 2 1 2
Proof
The characteristic function of a random variable can be obtained from the moment generating function, under the basic existence
condition that we saw earlier.
Suppose that X has moment generating function M that satisfies M (t) < ∞ for t in an open interval I about 0. Then the
characteristic function χ of X satisfies χ(t) = M (it) for t ∈ I .
The final important property of characteristic functions that we will discuss relates to convergence in distribution. Suppose that
(X , X , …) is a sequence of real-valued random with characteristic functions (χ , χ , …) respectively. Since we are only
1 2 1 2
concerned with distributions, the random variables need not be defined on the same probability space.
The Continuity Theorem

1. If the distribution of X converges to the distribution of a random variable X as n → ∞ and X has characteristic function
n
χ, then χ (t) → χ(t) as n → ∞ for all t ∈ R .

n
2. Conversely, if χ (t) converges to a function χ(t) as n → ∞ for t in an open interval about 0, and if χ is continuous at 0,
n
then χ is the characteristic function of a random variable X, and the distribution of X converges to the distribution of X n
as n → ∞ .
There are analogous versions of the continuity theorem for probability generating functions and moment generating functions. The
continuity theorem can be used to prove the central limit theorem, one of the fundamental theorems of probability. Also, the
continuity theorem has a straightforward generalization to distributions on R . n
The Joint Characteristic Function

All of the generating functions that we have discussed have multivariate extensions. However, we will discuss the extension only
for the characteristic function, the most important and versatile of the generating functions. There are analogous results for the
other generating functions. So in this discussion, we assume that (X, Y ) is a random vector for our experiment, taking values in
R .
2
The (joint) characteristic function χ of (X, Y ) is defined by

2
χ(s, t) = E [exp(isX + itY )] , (s, t) ∈ R (4.6.18)
Once again, the most important fact is that χ completely determines the distribution: two random vectors taking values in R have 2
the same characteristic function if and only if they have the same distribution.
The joint moments can be obtained from the derivatives of the characteristic function.
Suppose that (X, Y ) has characteristic function χ. If m, n ∈ N and E (|X m

Y
n
|) < ∞ then
(m,n) i (m+n) m n
χ (0, 0) = e E (X Y ) (4.6.19)
The marginal characteristic functions and the characteristic function of the sum can be easily obtained from the joint characteristic
function:
Suppose again that (X, Y ) has characteristic function χ, and let χ1 , χ , and
2 χ+ denote the characteristic functions of X , Y ,
and X + Y , respectively. For t ∈ R
1. χ(t, 0) = χ 1 (t)
2. χ(0, t) = χ 2 (t)
3. χ(t, t) = χ + (t)
Proof
Suppose again that χ , χ , and χ are the characteristic functions of

1 2 X , Y , and (X, Y ) respectively. Then X and Y are
independent if and only if χ(s, t) = χ (s)χ (t) for all (s, t) ∈ R .
1 2
2
Naturally, the results for bivariate characteristic functions have analogies in the general multivariate case. Only the notation is more
complicated.

As always, be sure to try the computational problems yourself before expanding the solutions and answers in the text.
Dice
Recall that an ace-six flat die is a six-sided die for which faces numbered 1 and 6 have probability each, while faces numbered 2, 1
3, 4, and 5 have probability each. Similarly, a 3-4 flat die is a six-sided die for which faces numbered 3 and 4 have probability
1
8
1
each, while faces numbered 1, 2, 5, and 6 have probability each. 1
Suppose that an ace-six flat die and a 3-4 flat die are rolled. Use probability generating functions to find the probability density
function of the sum of the scores.
Solution
Two fair, 6-sided dice are rolled. One has faces numbered (0, 1, 2, 3, 4, 5) and the other has faces numbered
(0, 6, 12, 18, 24, 30). Use probability generating functions to find the probability density function of the sum of the scores, and
identify the distribution.
Solution
Suppose that random variable Y has probability generating function P given by

5
2 3 2
1 3
1 4
P (t) = ( t+ t + t + t ) , t ∈ R (4.6.22)
5 10 5 10
1. Interpret Y in terms of rolling dice.

2. Use the probability generating function to find the first two factorial moments of Y .
3. Use (b) to find the variance of Y .
Answer
Bernoulli Trials
Suppose X is an indicator random variable with p = P(X = 1) , where p ∈ [0, 1] is a parameter. Then X has probability
generating function P (t) = 1 − p + pt for t ∈ R .
Proof
Recall that a Bernoulli trials process is a sequence (X , X , …) of independent, identically distributed indicator random variables.
1 2
In the usual language of reliability, X denotes the outcome of trial i, where 1 denotes success and 0 denotes failure. The
i
probability of success p = P(X = 1) is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate
i
chapter on the Bernoulli Trials explores this process in more detail.

+ n
n
i=1 i
distribution with parameters n and p, which has probability density function f given by n
n
y n−y
fn (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (4.6.23)
y
Random variable Y has probability generating function P given by P

n n n (t) = (1 − p + pt)
n
for t ∈ R .
Proof
Rando variable Y has the following parameters:

n
(k)
1. E [Y n ] =n
(k) k
p
2. E (Y ) = np
n
3. var (Y ) = np(1 − p)
n
4. P(Y is even) = [1 − (1 − 2p)

n
1
2
n
]
Proof
Suppose that U has the binomial distribution with parameters m ∈ N + and ,

p ∈ [0, 1] V has the binomial distribution with
parameters n ∈ N and q ∈ [0, 1], and that U and V are independent.
+
1. If p = q then U + V has the binomial distribution with parameters m + n and p.

2. If p ≠ q then U + V does not have a binomial distribution.
Proof
Suppose now that p ∈ (0, 1]. The trial number N of the first success in the sequence of Bernoulli trials has the geometric
distribution on N with success parameter p. The probability density function h is given by
+
n−1
h(n) = p(1 − p ) , n ∈ N+ (4.6.24)
The geometric distribution is studied in more detail in the chapter on Bernoulli trials.
Let Q denote the probability generating function of N . Then

pt
1. Q(t) = 1−(1−p)t
for − 1
1−p
<t <
1
1−p
k−1
(1−p)
2. E [ N (k)
] = k!
pk
for k ∈ N
3. E(N ) = 1
p
1−p
4. var(N ) = p
2
1−p
5. P(N is even) =
2−p
Proof
The probability that N is even comes up in the alternating coin tossing game with two players.

Recall that the Poisson distribution has probability density function f given by
n
a
−a
f (n) = e , n ∈ N (4.6.26)
n!
of “random points” in a region of time or space; the parameter is proportional to the size of the region of time or space. The Poisson
distribution is studied in more detail in the chapter on the Poisson Process.
Suppose that N has Poisson distribution with parameter a ∈ (0, ∞) . Let P denote the probability generating function of a N .
Then
1. P (t) = e
a for t ∈ R
a(t−1)
2. E [ N ] = a
(k) k
3. E(N ) = a
4. var(N ) = a
5. P(N is even) = (1 + e 1
2
−2a
)
Proof
The Poisson family of distributions is closed with respect to sums of independent variables, a very important property.
Suppose that X, Y have Poisson distributions with parameters a, b ∈ (0, ∞) , respectively, and that X and Y are independent.
Then X + Y has the Poisson distribution with parameter a + b .
Proof
The right distribution function of the Poisson distribution does not have a simple, closed-form expression. The following exercise
gives an upper bound.
Suppose that N has the Poisson distribution with parameter a > 0 . Then
n
a
n−a
P(N ≥ n) ≤ e ( ) , n >a (4.6.28)
n
Proof
The following theorem gives an important convergence result that is explored in more detail in the chapter on the Poisson process.
Suppose that p ∈ (0, 1) for n ∈ N and that np → a ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters
n + n
n and p converges to the Poisson distribution with parameter a as n → ∞ .

n
Proof

Recall that the exponential distribution is a continuous distribution on [0, ∞) with probability density function f given by
−rt
f (t) = re , t ∈ (0, ∞) (4.6.31)
where r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other random times, and in
particular governs the time between arrivals in the Poisson model. The exponential distribution is studied in more detail in the
chapter on the Poisson Process.
Suppose that T has the exponential distribution with rate parameter r ∈ (0, ∞) and let M denote the moment generating
function of T . Then
1. M (s) = for s ∈ (−∞, r) .
r
r−s
2. E(T ) = n!/r for n ∈ N

n n
Proof
Suppose that (T , T , …) is a sequence of independent random variables, each having the exponential distribution with rate
1 2
parameter r ∈ (0, ∞). For n ∈ N , the moment generating function M of U = ∑ T is given by

+ n n
n
i=1 i
n
r
Mn (s) = ( ) , s ∈ (−∞, r) (4.6.32)
r−s
Proof
Random variable U has the Erlang distribution with shape parameter n and rate parameter r, named for Agner Erlang. This
n
distribution governs the n th arrival time in the Poisson model. The Erlang distribution is a special case of the gamma distribution
and is studied in more detail in the chapter on the Poisson Process.
Suppose that a, b ∈ R and a <b . Recall that the continuous uniform distribution on the interval [a, b] has probability density
function f given by
1
f (x) = , x ∈ [a, b] (4.6.33)
b −a
The distribution corresponds to selecting a point at random from the interval. Continuous uniform distributions arise in geometric
probability and a variety of other applied problems.
Suppose that X is uniformly distributed on the interval [a, b] and let M denote the moment generating function of X. Then
bt at
1. M (t) = e
(b−a)t
−e
if t ≠ 0 and M (0) = 1
n+1 n+1
2. E (X n
) =
b
(n+1)(b−a)
−a
for n ∈ N
Proof
Suppose that (X, Y ) is uniformly distributed on the triangle T = {(x, y) ∈ R

2
: 0 ≤ x ≤ y ≤ 1} . Compute each of the
following:
1. The joint moment generating function of (X, Y ).
2. The moment generating function of X.
3. The moment generating function of Y .
4. The moment generating function of X + Y .
Answer
A Bivariate Distribution
Suppose that (X, Y ) has probability density function f given by f (x, y) = x + y for (x, y) ∈ [0, 1]
2
. Compute each of the
following:
1. The joint moment generating function (X, Y ).
2. The moment generating function of X.
3. The moment generating function of Y .
4. The moment generating function of X + Y .
Answer

Recall that the standard normal distribution is a continuous distribution on R)withprobabilitydensityf unction\(ϕgiven by
1 −
1
z
2
ϕ(z) = −−e
2
, z ∈ R (4.6.34)
√2π
Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in more
Suppose that Z has the standard normal distribution and let M denote the moment generating function of Z . Then
1 2
1. M (t) = e for t ∈ R
2
t
2. E (Z ) = 1 ⋅ 3 ⋯ (n − 1) if n is even and E (Z
n n
) =0 if n is odd.
Proof
More generally, for μ ∈ R and σ ∈ (0, ∞), recall that the normal distribution with mean μ and standard deviation σ is a
continuous distribution on R with probability density function f given by
2
1 1 x −μ
f (x) = −− exp[− ( ) ], x ∈ R (4.6.37)
√2πσ 2 σ
Moreover, if Z has the standard normal distribution, then X = μ + σZ has the normal distribution with mean μ and standard
deviation σ. Thus, we can easily find the moment generating function of X:
Suppose that X has the normal distribution with mean μ and standard deviation σ. The moment generating function of X is
1
2 2
M (t) = exp(μt + σ t ), t ∈ R (4.6.38)
2
Proof
So the normal family of distributions in closed under location-scale transformations. The family is also closed with respect to sums
of independent variables:
If X and Y are independent, normally distributed random variables then X + Y has a normal distribution.
Proof

Recall that the Pareto distribution is a continuous distribution on [1, ∞)withprobabilitydensityf unction\(fgiven by
a
f (x) = , x ∈ [1, ∞) (4.6.41)
a+1
x
where a ∈ (0, ∞) is the shape parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that
is widely used to model financial variables such as income. The Pareto distribution is studied in more detail in the chapter on
Special Distributions.
Suppose that X has the Pareto distribution with shape parameter a , and let M denote the moment generating function of X .
Then
1. E (X ) =
n a
a−n
if n < a and E (X n
) =∞ if n ≥ a
2. M (t) = ∞ for t > 0
Proof
On the other hand, like all distributions on R, the Pareto distribution has a characteristic function. However, the characteristic
function of the Pareto distribution does not have a simple, closed form.

Recall that the (standard) Cauchy distribution is a continuous distribution on R with probability density function f given by
1
f (x) = , x ∈ R (4.6.42)
2
π (1 + x )
and is named for Augustin Cauchy. The Cauch distribution is studied in more generality in the chapter on Special Distributions.
The graph of f is known as the Witch of Agnesi, named for Maria Agnesi.
Suppose that X has the standard Cauchy distribution, and let M denote the moment generating function of X. Then
1. E(X) does not exist.
2. M (t) = ∞ for t ≠ 0 .
Proof
Once again, all distributions on R have characteristic functions, and the standard Cauchy distribution has a particularly simple one.
Let χ denote the characteristic function of X. Then χ(t) = e −|t|

for t ∈ R .
Proof
Counterexample
For the Pareto distribution, only some of the moments are finite; so course, the moment generating function cannot be finite in an
interval about 0. We will now give an example of a distribution for which all of the moments are finite, yet still the moment
generating function is not finite in any interval about 0. Furthermore, we will see two different distributions that have the same
moments of all orders.
Suppose that Z has the standard normal distribution and let X = e . The distribution of X is known as the (standard) lognormal
Z
distribution. The lognormal distribution is studied in more generality in the chapter on Special Distributions. This distribution has
finite moments of all orders, but infinite moment generating function.
X has probability density function f given by

1 1 2
f (x) = −− exp(− ln (x)), x >0 (4.6.43)
√2πx 2
1 2
1. E (X n
) =e 2
n
for n ∈ N .
2. E (e tX
) =∞ for t > 0 .
Proof
Next we construct a different distribution with the same moments as X.
Let h be the function defined by h(x) = sin(2π ln x) for x > 0 and let g be the function defined by g(x) = f (x) [1 + h(x)]
for x > 0 . Then

1. g is a probability density function.
1 2
2. If Y has probability density function g then E (Y n

) =e 2
n
for n ∈ N
Proof
Figure 4.6.1 : The graphs of f and g , probability density functions for two distributions with the same moments of all orders.
This page titled 4.6: Generating Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). Suppose next that X is a random
variable taking values in a set S and that Y is a random variable taking values in T ⊆ R . We assume that either Y has a discrete
distribution, so that T is countable, or that Y has a continuous distribution so that T is an interval (or perhaps a union of intervals).
In this section, we will study the conditional expected value of Y given X, a concept of fundamental importance in probability. As
we will see, the expected value of Y given X is the function of X that best approximates Y in the mean square sense. Note that X
is a general random variable, not necessarily real-valued, but as usual, we will assume that either X has a discrete distribution, so
that S is countable or that X has a continuous distribution on S ⊆ R for some n ∈ N . In the latter case, S is typically a region
n
+
defined by inequalites involving elementary functions. We will also assume that all expected values that are mentioned exist (as
real numbers).
Basic Theory
Definitions
Note that we can think of (X, Y ) as a random variable that takes values in the Cartesian product set S × T . We need recall some
basic facts from our work with joint distributions and conditional distributions.
We assume that (X, Y ) has joint probability density function f and we let g denote the (marginal) probability density function
X. Recall that if Y has a discrte distribution then
g(x) = ∑ f (x, y), x ∈ S (4.7.1)
y∈T
and if Y has a continuous distribution then
g(x) = ∫ f (x, y) dy, x ∈ S (4.7.2)

T
In either case, for x ∈ S , the conditional probability density function of Y given X = x is defined by
f (x, y)
h(y ∣ x) = , y ∈ T (4.7.3)
g(x)
We are now ready for the basic definitions:
For x ∈ S , the conditional expected value of Y given X =x ∈ S is simply the mean computed relative to the conditional
distribution. So if Y has a discrete distribution then
E(Y ∣ X = x) = ∑ yh(y ∣ x), x ∈ S (4.7.4)
y∈T
and if Y has a continuous distribution then
E(Y ∣ X = x) = ∫ yh(y ∣ x) dy, x ∈ S (4.7.5)

T
1. The function v : S → R defined by v(x) = E(Y ∣ X = x) for x ∈ S is the regression function of Y based on X.
2. The random variable v(X) is called the conditional expected value of Y given X and is denoted E(Y ∣ X) .
Intuitively, we treat X as known, and therefore not random, and we then average Y with respect to the probability distribution that
remains. The advanced section on conditional expected value gives a much more general definition that unifies the definitions
given here for the various distribution types.
Properties
The most important property of the random variable E(Y ∣ X) is given in the following theorem. In a sense, this result states that
E(Y ∣ X) behaves just like Y in terms of other functions of X, and is essentially the only function of X with this property.
The fundamental property

1. E [r(X)E(Y ∣ X)] = E [r(X)Y ] for every function r : S → R .
2. If u : S → R satisfies E[r(X)u(X)] = E[r(X)Y ] for every r : S → R then P [u(X) = E(Y ∣ X)] = 1 .
Proof
Two random variables that are equal with probability 1 are said to be equivalent. We often think of equivalent random variables as
being essentially the same object, so the fundamental property above essentially characterizes E(Y ∣ X) . That is, we can think of
E(Y ∣ X) as any random variable that is a function of X and satisfies this property. Moreover the fundamental property can be
used as a definition of conditional expected value, regardless of the type of the distribution of (X, Y ). If you are interested, read
the more advanced treatment of conditional expected value.
Suppose that X is also real-valued. Recall that the best linear predictor of Y based on X was characterized by property (a), but
with just two functions: r(x) = 1 and r(x) = x . Thus the characterization in the fundamental property is certainly reasonable,
since (as we show below) E(Y ∣ X) is the best predictor of Y among all functions of X, not just linear functions.
The basic property is also very useful for establishing other properties of conditional expected value. Our first consequence is the
fact that Y and E(Y ∣ X) have the same mean.
E [E(Y ∣ X)] = E(Y ) .

Proof
Aside from the theoretical interest, this theorem is often a good way to compute E(Y ) when we know the conditional distribution
of Y given X. We say that we are computing the expected value of Y by conditioning on X.
For many basic properties of ordinary expected value, there are analogous results for conditional expected value. We start with two
of the most important: every type of expected value must satisfy two critical properties: linearity and monotonicity. In the following
two theorems, the random variables Y and Z are real-valued, and as before, X is a general random variable.
Linear Properties
1. E(Y + Z ∣ X) = E(Y ∣ X) + E(Z ∣ X) .
2. E(c Y ∣ X) = c E(Y ∣ X)
Proof
Part (a) is the additive property and part (b) is the scaling property. The scaling property will be significantly generalized below in
(8).
Positive and Increasing Properties

1. If Y ≥ 0 then E(Y ∣ X) ≥ 0 .
2. If Y ≤ Z then E(Y ∣ X) ≤ E(Z ∣ X) .
3. |E(Y ∣ X)| ≤ E (|Y | ∣ X)
Proof
Our next few properties relate to the idea that E(Y ∣ X) is the expected value of Y given X . The first property is essentially a
restatement of the fundamental property.
If r : S → R , then Y − E(Y ∣ X) and r(X) are uncorrelated.

Proof
The next result states that any (deterministic) function of X acts like a constant in terms of the conditional expected value with
respect to X.
If s : S → R then
E [s(X) Y ∣ X] = s(X) E(Y ∣ X) (4.7.12)
Proof
The following rule generalizes theorem (8) and is sometimes referred to as the substitution rule for conditional expected value.
If s : S × T → R then
E [s(X, Y ) ∣ X = x] = E [s(x, Y ) ∣ X = x] (4.7.14)
In particular, it follows from (8) that E[s(X) ∣ X] = s(X) . At the opposite extreme, we have the next result: If X and Y are
independent, then knowledge of X gives no information about Y and so the conditional expected value with respect to X reduces
to the ordinary (unconditional) expected value of Y .
If X and Y are independent then

E(Y ∣ X) = E(Y ) (4.7.15)
Proof
Suppose now that Z is real-valued and that X and Y are random variables (all defined on the same probability space, of course).
The following theorem gives a consistency condition of sorts. Iterated conditional expected values reduce to a single conditional
expected value with respect to the minimum amount of information. For simplicity, we write E(Z ∣ X, Y ) rather than
E [Z ∣ (X, Y )] .
Consistency
1. E [E(Z ∣ X, Y ) ∣ X] = E(Z ∣ X)
2. E [E(Z ∣ X) ∣ X, Y ] = E(Z ∣ X)
Proof
Finally we show that E(Y ∣ X) has the same covariance with X as does Y , not surprising since again, E(Y ∣ X) behaves just like
Y in its relations with X.
cov [X, E(Y ∣ X)] = cov(X, Y ) .

Proof
Conditional Probability
The conditional probability of an event A , given random variable X (as above), can be defined as a special case of the conditional
expected value. As usual, let 1 denote the indicator random variable of A .
A
If A is an event, defined
P(A ∣ X) = E (1A ∣ X) (4.7.17)
Here is the fundamental property for conditional probability:
The fundamental property

1. E [r(X)P(A ∣ X)] = E [r(X)1 ] for every function r : S → R .
A
2. If u : S → R and u(X) satisfies E[r(X)u(X)] = E [r(X)1 ] for every function r : S → R , then

A
P [u(X) = P(A ∣ X)] = 1 .
For example, suppose that X has a discrete distribution on a countable set S with probability density function g . Then (a) becomes
∑ r(x)P(A ∣ X = x)g(x) = ∑ r(x)P(A, X = x) (4.7.18)
x∈S x∈S
But this is obvious since P(A ∣ X = x) = P(A, X = x)/P(X = x) and g(x) = P(X = x) . Similarly, if X has a continuous
distribution on S ⊆ R then (a) states that
n
E [r(X)1A ] = ∫ r(x)P(A ∣ X = x)g(x) dx (4.7.19)
S
The properties above for conditional expected value, of course, have special cases for conditional probability.
P(A) = E [P(A ∣ X)] .

Proof
Again, the result in the previous exercise is often a good way to compute P(A) when we know the conditional probability of A
given X. We say that we are computing the probability of A by conditioning on X. This is a very compact and elegant version of
the conditioning result given first in the section on Conditional Probability in the chapter on Probability Spaces and later in the
section on Discrete Distributions in the Chapter on Distributions.
The following result gives the conditional version of the axioms of probability.
Axioms of probability
1. P(A ∣ X) ≥ 0 for every event A .
2. P(Ω ∣ X) = 1
3. If {A : i ∈ I } is a countable collection of disjoint events then P (⋃
i i∈I
Ai ∣
∣ X) = ∑i∈I P(Ai ∣ X) .
Details
From the last result, it follows that other standard probability rules hold for conditional probability given X. These results include
the complement rule
the increasing property
Boole's inequality
Bonferroni's inequality
the inclusion-exclusion laws
The Best Predictor

The next result shows that, of all functions of X, E(Y ∣ X) is closest to Y , in the sense of mean square error. This is fundamentally
important in statistical problems where the predictor vector X can be observed but not the response variable Y . In this subsection
and the next, we assume that the real-valued random variables have finite variance.
If u : S → R , then
1. E ([E(Y ∣ X) − Y ]
2
) ≤ E ([u(X) − Y ]
2
)
2. Equality holds in (a) if and only if u(X) = E(Y ∣ X) with probability 1.

Proof
Suppose now that X is real-valued. In the section on covariance and correlation, we found that the best linear predictor of Y given
X is
cov(X, Y )
L(Y ∣ X) = E(Y ) + [X − E(X)] (4.7.23)
var(X)
On the other hand, E(Y ∣ X) is the best predictor of Y among all functions of X. It follows that if E(Y ∣ X) happens to be a linear
function of X then it must be the case that E(Y ∣ X) = L(Y ∣ X) . However, we will give a direct proof also:
If E(Y ∣ X) = a + bX for constants a and b then E(Y ∣ X) = L(Y ∣ X) ; that is,

1. b = cov(X, Y )/var(X)
2. a = E(Y ) − E(X)cov(X, Y )/var(X)
Proof
Conditional Variance
The conditional variance of Y given X is defined like the ordinary variance, but with all expected values conditioned on X.
The conditional variance of Y given X is defined as
2 ∣
var(Y ∣ X) = E ( [Y − E(Y ∣ X)] ∣ X) (4.7.24)
∣
Thus, var(Y ∣ X) is a function of X, and in particular, is a random variable. Our first result is a computational formula that is
analogous to the one for standard variance—the variance is the mean of the square minus the square of the mean, but now with all
expected values conditioned on X:
var(Y ∣ X) = E (Y
2
∣ X) − [E(Y ∣ X)]
2
.
Proof
Our next result shows how to compute the ordinary variance of Y by conditioning on X.
var(Y ) = E [var(Y ∣ X)] + var [E(Y ∣ X)] .

Proof
Thus, the variance of Y is the expected conditional variance plus the variance of the conditional expected value. This result is often
a good way to compute var(Y ) when we know the conditional distribution of Y given X. With the help of (21) we can give a
formula for the mean square error when E(Y ∣ X) is used a predictor of Y .
Mean square error

2
E ( [Y − E(Y ∣ X)] ) = var(Y ) − var [E(Y ∣ X)] (4.7.27)
Proof
Let us return to the study of predictors of the real-valued random variable Y , and compare the three predictors we have studied in
terms of mean square error.
Suppose that Y is a real-valued random variable.

1. The best constant predictor of Y is E(Y ) with mean square error var(Y ).
2. If X is another real-valued random variable, then the best linear predictor of Y given X is
cov(X, Y )
L(Y ∣ X) = E(Y ) + [X − E(X)] (4.7.29)
var(X)
with mean square error var(Y ) [1 − cor (X, Y )] .

2
3. If X is a general random variable, then the best overall predictor of Y given X is E(Y ∣ X) with mean square error
var(Y ) − var [E(Y ∣ X)] .
Conditional Covariance
Suppose that Y and Z are real-valued random variables, and that X is a general random variable, all defined on our underlying
probability space. Analogous to variance, the conditional covariance of Y and Z given X is defined like the ordinary covariance,
but with all expected values conditioned on X.
The conditional covariance of Y and Z given X is defined as
∣
cov(Y , Z ∣ X) = E ([Y − E(Y ∣ X)][Z − E(Z ∣ X) ∣ X) (4.7.30)
∣
Thus, cov(Y , Z ∣ X) is a function of X, and in particular, is a random variable. Our first result is a computational formula that is
analogous to the one for standard covariance—the covariance is the mean of the product minus the product of the means, but now
with all expected values conditioned on X:
cov(Y , Z ∣ X) = E (Y Z ∣ X) − E(Y ∣ X)E(Z ∣ X) .

Proof
Our next result shows how to compute the ordinary covariance of Y and Z by conditioning on X.
cov(Y , Z) = E [cov(Y , Z ∣ X)] + cov [E(Y ∣ X), E(Z ∣ X)] .

Proof
Thus, the covariance of Y and Z is the expected conditional covariance plus the covariance of the conditional expected values.
This result is often a good way to compute cov(Y , Z) when we know the conditional distribution of (Y , Z) given X.

As always, be sure to try the proofs and computations yourself before reading the ones in the text.

Suppose that (X, Y ) has probability density function f defined by f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 .
1. Find L(Y ∣ X) .
2. Find E(Y ∣ X) .
3. Graph L(Y ∣ X = x) and E(Y ∣ X = x) as functions of x, on the same axes.
4. Find var(Y ).
5. Find var(Y ) [1 − cor (X, Y )] .
2
6. Find var(Y ) − var [E(Y ∣ X)] .

Answer
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 .
1. Find L(Y ∣ X) .
2. Find E(Y ∣ X) .
4. Find var(Y ).
5. Find var(Y ) [1 − cor (X, Y )] .
2

Answer

y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 .
1. Find L(Y ∣ X) .
2. Find E(Y ∣ X) .
4. Find var(Y ).
5. Find var(Y ) [1 − cor (X, Y )] .
2

Answer
y for 0 ≤ x ≤ y ≤ 1 .
1. Find L(Y ∣ X) .
2. Find E(Y ∣ X) .
4. Find var(Y ).
5. Find var(Y ) [1 − cor (X, Y )] .
2
Answer
Suppose that X , Y , and Z are real-valued random variables with E(Y ∣ X) = X

3
and E(Z ∣ X) =
1
2
. Find
1+X
E (Y e
X
− Z sin X ∣ X) .
Answer
As usual, continuous uniform distributions can give us some geometric insight.
Recall first that for n ∈ N , the standard measure on R is

+
n
n
λn (A) = ∫ 1dx, A ⊆R (4.7.36)
A
In particular, λ 1 (A) is the length of A ⊆ R , λ 2 (A) is the area of A ⊆ R , and λ

2
Details
With our usual setup, suppose that X takes values in S ⊆ R , Y takes values in T ⊆ R , and that (X, Y ) is uniformly distributed
n
on R ⊆ S × T ⊆ R . So 0 < λ (R) < ∞ , and the joint probability density function f of (X, Y ) is given by
n+1
n+1
f (x, y) = 1/ λ n+1(R) for (x, y) ∈ R. Recall that uniform distributions, whether discrete or continuous, always have constant
densities. Finally, recall that the cross section of R at x ∈ S is T = {y ∈ T : (x, y) ∈ R} .

x
In the setting above, suppose that T is a bounded interval with midpoint m(x) and length l(x) for each x ∈ S . Then
x
1. E(Y ∣ X) = m(X)
2. var(Y ∣ X) = l (X) 1
12
2
Proof
So in particular, the regression curve x ↦ E(Y ∣ X = x) follows the midpoints of the cross-sectional intervals.
In each case below, suppose that (X, Y ) is uniformly distributed on the give region. Find E(Y ∣ X) and var(Y ∣ X)
1. The rectangular region R = [a, b] × [c, d] where a 0 .
2
3. The circular region C = {(x, y) ∈ R : x + y ≤ r} where r > 0 .

2 2 2
Answer
In the bivariate uniform experiment, select each of the following regions. In each case, run the simulation 2000 times and note
the relationship between the cloud of points and the graph of the regression function.
1. square
2. triangle
3. circle
Suppose that X is uniformly distributed on the interval (0, 1), and that given X, random variable Y is uniformly distributed on
(0, X). Find each of the following:
1. E(Y ∣ X)
2. E(Y )
3. var(Y ∣ X)
4. var(Y )
Answer
Suppose that a population consists of m objects, and that each object is one of three types. There are a objects of type 1, b objects
of type 2, and m − a − b objects of type 0. The parameters a and b are positive integers with a + b < m . We sample n objects
from the population at random, and without replacement, where n ∈ {0, 1, … , m}. Denote the number of type 1 and 2 objects in
the sample by X and Y , so that the number of type 0 objects in the sample is n − X − Y . In the in the chapter on Distributions, we
showed that the joint, marginal, and conditional distributions of X and Y are all hypergeometric—only the parameters change.
Here is the relevant result for this section:

1. E(Y ∣ X) =
b
m−a
(n − X)
b(m−a−b)
2. var(Y ∣ X) = 2
(n − X)(m − a − n + X)
(m−a) (m−a−1)
n(m−n)b(m−a−b)
3. E ([Y − E(Y ∣ X)] ) =
2
m(m−1)(m−a)
Proof
Note that E(Y ∣ X) is a linear function of X and hence E(Y ∣ X) = L(Y ∣ X) .
In a collection of 120 objects, 50 are classified as good, 40 as fair and 30 as poor. A sample of 20 objects is selected at random
and without replacement. Let X denote the number of good objects in the sample and Y the number of poor objects in the
sample. Find each of the following:
1. E(Y ∣ X)
2. var(Y ∣ X)
3. The predicted value of Y given X = 8
Answer
The Multinomial Trials Model

Suppose that we have a sequence of n independent trials, and that each trial results in one of three outcomes, denoted 0, 1, and 2.
On each trial, the probability of outcome 1 is p, the probability of outcome 2 is q, so that the probability of outcome 0 is 1 − p − q .
The parameters p, q ∈ (0, 1) with p + q < 1 , and of course n ∈ N . Let X denote the number of trials that resulted in outcome 1,
+
Y the number of trials that resulted in outcome 2, so that n − X − Y is the number of trials that resulted in outcome 0. In the in
the chapter on Distributions, we showed that the joint, marginal, and conditional distributions of X and Y are all multinomial—
only the parameters change. Here is the relevant result for this section:

q
1. E(Y ∣ X) =
1−p
(n − X)
q(1−p−q)
2. var(Y ∣ X) = 2
(n − X)
(1−p)
q(1−p−q)
3. E ([Y − E(Y ∣ X)] ) =
2
1−p
n
Proof
Note again that E(Y ∣ X) is a linear function of X and hence E(Y ∣ X) = L(Y ∣ X) .
Suppose that a fair, 12-sided die is thrown 50 times. Let X denote the number of throws that resulted in a number from 1 to 5,
and Y the number of throws that resulted in a number from 6 to 9. Find each of the following:
1. E(Y ∣ X)
2. var(Y ∣ X)
3. The predicted value of Y given X = 20
Answer

Recall that the Poisson distribution, named for Simeon Poisson, is widely used to model the number of “random points” in a region
of time or space, under certain ideal conditions. The Poisson distribution is studied in more detail in the chapter on the Poisson
Process. The Poisson distribution with parameter r ∈ (0, ∞) has probability density function f defined by
x
−r
r
f (x) = e , x ∈ N (4.7.39)
x!
The parameter r is the mean and variance of the distribution.
Suppose that X and Y are independent random variables, and that X has the Poisson distribution with parameter a ∈ (0, ∞)
and Y has the Poisson distribution with parameter b ∈ (0, ∞). Let N = X + Y . Then
1. E(X ∣ N ) = a
a+b
N
2. var(X ∣ N ) = ab
2
N
(a+b)
3. E ([X − E(X ∣ N )] 2
) =
ab
a+b
Proof
Once again, E(X ∣ N ) is a linear function of N and so E(X ∣ N ) = L(X ∣ N ) . If we reverse the roles of the variables, the
conditional expected value is trivial from our basic properties:
E(N ∣ X) = E(X + Y ∣ X) = X + b (4.7.41)
Coins and Dice

A pair of fair dice are thrown, and the scores (X , X ) recorded. Let
1 2 Y = X1 + X2 denote the sum of the scores and
U = min { X , X } the minimum score. Find each of the following:
1 2
1. E (Y ∣ X ) 1
2. E (U ∣ X ) 1
3. E (Y ∣ U )
4. E (X ∣ X )
2 1
Answer
A box contains 10 coins, labeled 0 to 9. The probability of heads for coin i is i
9
. A coin is chosen at random from the box and
tossed. Find the probability of heads.
Answer
This problem is an example of Laplace's rule of succession, named for Pierre Simon Laplace.
Random Sums of Random Variables

Suppose that X = (X , X , …) is a sequence of independent and identically distributed real-valued random variables. We will
1 2
denote the common mean, variance, and moment generating function, respectively, by μ = E(X ) , σ = var(X ) , and i
2
i
) . Let
t Xi
G(t) = E (e
Yn = ∑ Xi , n ∈ N (4.7.42)
i=1
so that (Y , Y , …) is the partial sum process associated with

0 1 X . Suppose now that N is a random variable taking values in N ,
independent of X. Then
N
YN = ∑ Xi (4.7.43)
i=1
is a random sum of random variables; the terms in the sum are random, and the number of terms is random. This type of variable
occurs in many different contexts. For example, N might represent the number of customers who enter a store in a given period of
time, and X the amount spent by the customer i, so that Y is the total revenue of the store during the period.
i N
The conditional and ordinary expected value of Y N are

1. E (Y N ∣ N) = Nμ
2. E (Y N ) = E(N )μ
Proof
Wald's equation, named for Abraham Wald, is a generalization of the previous result to the case where N is not necessarily
independent of X, but rather is a stopping time for X. Roughly, this means that the event N = n depends only (X , X , … , X ).
1 2 n
Wald's equation is discussed in the chapter on Random Samples. An elegant proof of and Wald's equation is given in the chapter on
Martingales. The advanced section on stopping times is in the chapter on Probability Measures.
The conditional and ordinary variance of Y N are

1. var (Y N ∣ N) = Nσ
2
2. var (Y N ) = E(N )σ
2 2
+ var(N )μ
Proof
Let H denote the probability generating function of N . The conditional and ordinary moment generating function of Y N are
1. E (e tYN
∣ N ) = [G(t)]
N
2. E (e tN
) = H (G(t))
Proof
Thus the moment generating function of Y is H ∘ G , the composition of the probability generating function of
N N with the
common moment generating function of X, a simple and elegant result.
In the die-coin experiment, a fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let N
denote the die score and Y the number of heads. Find each of the following:
1. The conditional distribution of Y given N .
2. E (Y ∣ N )
3. var (Y ∣ N )
4. E (Y ) i
5. var(Y )
Answer
Run the die-coin experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and
standard deviation.
The number of customers entering a store in a given hour is a random variable with mean 20 and standard deviation 3. Each
customer, independently of the others, spends a random amount of money with mean $50 and standard deviation $5. Find the
mean and standard deviation of the amount of money spent during the hour.
Answer
A coin has a random probability of heads V and is tossed a random number of times N . Suppose that V is uniformly
distributed on [0, 1]; N has the Poisson distribution with parameter a > 0 ; and V and N are independent. Let Y denote the
number of heads. Compute the following:
1. E(Y ∣ N , V )
2. E(Y ∣ N )
3. E(Y ∣ V )
4. E(Y )
5. var(Y ∣ N , V )
6. var(Y )
Answer
Mixtures of Distributions
Suppose that X = (X , X , …) is a sequence of real-valued random variables. Denote the mean, variance, and moment
1 2
generating function of X by μ = E(X ) , σ = var(X ) , and M (t) = E (e ) , for i ∈ N . Suppose also that N is a random
i i i
2
i i i
t Xi
+
variable taking values in N , independent of X. Denote the probability density function of N by p = P(N = n) for n ∈ N .
+ n +
The distribution of the random variable XN is a mixture of the distributions of X = (X1 , X2 , …) , with the distribution of N as
the mixing distribution.
The conditional and ordinary expected value of X are N
1. E(X N ∣ N ) = μN
∞
2. E(X N ) = ∑n=1 pn μn
Proof
The conditional and ordinary variance of X are N
1. var (X N ∣ N) = σ
2
N
∞ ∞ 2
2. var(X 2 2
N ) = ∑n=1 pn (σn + μn ) − (∑n=1 pn μn ) .
Proof
The conditional and ordinary moment generating function of X are N
1. E (e tXN
∣ N ) = MN (t)
2. E (e tXN
) =∑
∞
i=1
pi Mi (t) .
Proof
In the coin-die experiment, a biased coin is tossed with probability of heads . If the coin lands tails, a fair die is rolled; if the
1
coin lands heads, an ace-six flat die is rolled (faces 1 and 6 have probability each, and faces 2, 3, 4, 5 have probability
1
4
1
each). Find the mean and standard deviation of the die score.
Answer
Run the coin-die experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the
distribution mean and standard deviation.
This page titled 4.7: Conditional Expected Value is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The main purpose of this section is a discussion of expected value and covariance for random matrices and vectors. These topics
are somewhat specialized, but are particularly important in multivariate statistical models and for the multivariate normal
distribution. This section requires some prerequisite knowledge of linear algebra.
We assume that the various indices m, n, p, k that occur in this section are positive integers. Also we assume that expected values
of real-valued random variables that we reference exist as real numbers, although extensions to cases where expected values are ∞
or −∞ are straightforward, as long as we avoid the dreaded indeterminate form ∞ − ∞ .
Basic Theory
Linear Algebra
We will follow our usual convention of denoting random variables by upper case letters and nonrandom variables and constants by
lower case letters. In this section, that convention leads to notation that is a bit nonstandard, since the objects that we will be
dealing with are vectors and matrices. On the other hand, the notation we will use works well for illustrating the similarities
between results for random matrices and the corresponding results in the one-dimensional case. Also, we will try to be careful to
explicitly point out the underlying spaces where various objects live.
Let m×n
R denote the space of all m × n matrices of real numbers. The (i, j) entry of a ∈ R is denoted a
m×n
for ij
i ∈ {1, 2, … , m} and j ∈ {1, 2, … , n}. We will identify R with R , so that an ordered n -tuple can also be thought of as an
n n×1
n × 1 column vector. The transpose of a matrix a ∈ R is denoted a —the n × m matrix whose (i, j) entry is the (j, i) entry
m×n T
of a . Recall the definitions of matrix addition, scalar multiplication, and matrix multiplication. Recall also the standard inner
product (or dot product) of x, y ∈ R : n
T
⟨x, y⟩ = x ⋅ y = x y = ∑ xi yi (4.8.1)
i=1
The outer product of x and y is xy , the n × n matrix whose (i, j) entry is x y . Note that the inner product is the trace (sum of
T
i j
the diagonal entries) of the outer product. Finally recall the standard norm on R , given by n
−−−−− −−−−−−−−−−−−−−−
2 2 2
∥x∥ = √⟨x, x⟩ = √ x +x + ⋯ + xn (4.8.2)
1 2
Recall that inner product is bilinear, that is, linear (preserving addition and scalar multiplication) in each argument separately. As a
consequence, for x, y ∈ R , n
2 2 2
∥x + y∥ = ∥x ∥ + ∥y∥ + 2⟨x, y⟩ (4.8.3)
Expected Value of a Random Matrix

outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). It's natural to define the expected
value of a random matrix in a component-wise manner.
Suppose that X is an m × n matrix of real-valued random variables, whose (i, j) entry is denoted X . Equivalently, X is as a
ij
random m × n matrix, that is, a random variable with values in R . The expected value E(X) is defined to be the m × n
m×n
matrix whose (i, j) entry is E (X ), the expected value of X .

ij ij
Many of the basic properties of expected value of random variables have analogous results for expected value of random matrices,
with matrix operation replacing the ordinary ones. Our first two properties are the critically important linearity properties. The first
part is the additive property—the expected value of a sum is the sum of the expected values.
E(X + Y ) = E(X) + E(Y ) if X and Y are random m × n matrices.

Proof
The next part of the linearity properties is the scaling property—a nonrandom matrix factor can be pulled out of the expected value.
Suppose that X is a random n × p matrix.
1. E(aX) = aE(X) if a ∈ R m×n
.
2. E(Xa) = E(X)a if a ∈ R p×n
.
Proof
Recall that for independent, real-valued variables, the expected value of the product is the product of the expected values. Here is
the analogous result for random matrices.
E(XY ) = E(X)E(Y ) if X is a random m × n matrix, Y is a random n × p matrix, and X and Y are independent.
Proof
Actually the previous result holds if X and Y are simply uncorrelated in the sense that X and Y are uncorrelated for each ij jk
i ∈ {1, … , m}, j ∈ {1, 2, … , n} and k ∈ {1, 2, … p}. We will study covariance of random vectors in the next subsection.
Covariance Matrices
Our next goal is to define and study the covariance of two random vectors.
Suppose that X is a random vector in R and Y is a random vector in R .

m n
1. The covariance matrix of X and Y is the m × n matrix cov(X, Y ) whose (i, j) entry is cov (X , Y ) the ordinary i j
covariance of X and Y .
i j
2. Assuming that the coordinates of X and Y have positive variance, the correlation matrix of X and Y is the m × n matrix
cor(X, Y ) whose (i, j) entry is cor (X , Y ), the ordinary correlation of X and Y
i j i j
Many of the standard properties of covariance and correlation for real-valued random variables have extensions to random vectors.
For the following three results, X is a random vector in R and Y is a random vector in R .
m n
T
cov(X, Y ) = E ([X − E(X)] [Y − E(Y )] )
Proof
Thus, the covariance of X and Y is the expected value of the outer product of X − E(X) and Y − E(Y ) . Our next result is the
computational formula for covariance: the expected value of the outer product of X and Y minus the outer product of the expected
values.
cov(X, Y ) = E (X Y
T
) − E(X)[E(Y )]
T
.
Proof
The next result is the matrix version of the symmetry property.

T
cov(Y , X) = [cov(X, Y )] .
Proof
In the following result, 0 denotes the m × n zero matrix.
cov(X, Y ) = 0 if and only if cov (Xi , Yj ) = 0 for each i and j , so that each coordinate of X is uncorrelated with each
coordinate of Y .
Proof
Naturally, when cov(X, Y ) = 0 , we say that the random vectors X and Y are uncorrelated. In particular, if the random vectors
are independent, then they are uncorrelated. The following results establish the bi-linear properties of covariance.
The additive properties.

1. cov(X + Y , Z) = cov(X, Z) + cov(Y , Z) if X and Y are random vectors in R and Z is a random vector in R . m n
2. cov(X, Y + Z) = cov(X, Y ) + cov(X, Z) if X is a random vector in R , and Y and Z are random vectors in R .
m n
Proof
The scaling properties
1. cov(aX, Y ) = acov(X, Y ) if X is a random vector in R , Y is a random vector in R , and a ∈ R
n p m×n
.
2. cov(X, aY ) = cov(X, Y )a if X is a random vector in R , Y is a random vector in R , and a ∈ R
T m n k×n
.
Proof
Variance-Covariance Matrices
Suppose that X is a random vector in R . The covariance matrix of X with itself is called the variance-covariance matrix of
n
X:
T
vc(X) = cov(X, X) = E ([X − E(X)] [X − E(X)] ) (4.8.6)
Recall that for an ordinary real-valued random variable X, var(X) = cov(X, X) . Thus the variance-covariance matrix of a
random vector in some sense plays the same role that variance does for a random variable.
vc(X) is a symmetric n × n matrix with (var(X 1 ), var(X2 ), … , var(Xn )) on the diagonal.

Proof
The following result is the formula for the variance-covariance matrix of a sum, analogous to the formula for the variance of a sum
of real-valued variables.
vc(X + Y ) = vc(X) + cov(X, Y ) + cov(Y , X) + vc(Y ) if X and Y are random vectors in R . n
Proof
Recall that var(aX) = a var(X) if X is a real-valued random variable and a ∈ R . Here is the analogous result for the variance-
2
covariance matrix of a random vector.
vc(aX) = avc(X)a
T
if X is a random vector in R and a ∈ R n m×n
.
Proof
Recall that if X is a random variable, then var(X) ≥ 0 , and var(X) = 0 if and only if X is a constant (with probability 1). Here is
the analogous result for a random vector:
Suppose that X is a random vector in R . n
1. vc(X) is either positive semi-definite or positive definite.

2. vc(X) is positive semi-definite but not positive definite if and only if there exists a ∈ R and c ∈ R such that, with
n
probability 1, a X = ∑ a X = c
T n
i=1 i i
Proof
Recall that since vc(X) is either positive semi-definite or positive definite, the eigenvalues and the determinant of vc(X) are
nonnegative. Moreover, if vc(X) is positive semi-definite but not positive definite, then one of the coordinates of X can be written
as a linear transformation of the other coordinates (and hence can usually be eliminated in the underlying model). By contrast, if
vc(X) is positive definite, then this cannot happen; vc(X) has positive eigenvalues and determinant and is invertible.
Best Linear Predictor

Suppose that X is a random vector in R and that Y is a random vector in R . We are interested in finding the function of X of
m n
the form a + bX , where a ∈ R and b ∈ Rn

, that is closest to Y in the mean square sense. Functions of this form are
n×m
analogous to linear functions in the single variable case. However, unless a = 0 , such functions are not linear transformations in
the sense of linear algebra, so the correct term is affine function of X. This problem is of fundamental importance in statistics when
random vector X, the predictor vector is observable, but not random vector Y , the response vector. Our discussion here
generalizes the one-dimensional case, when X and Y are random variables. That problem was solved in the section on Covariance
and Correlation. We will assume that vc(X) is positive definite, so that vc(X) is invertible, and none of the coordinates of X can
be written as an affine function of the other coordinates. We write vc (X) for the inverse instead of the clunkier [vc(X)] .
−1 −1
As with the single variable case, the solution turns out to be the affine function that has the same expected value as Y , and whose
covariance with X is the same as that of Y .
Define L(Y ∣ X) = E(Y ) + cov(Y , X)vc

−1
(X) [X − E(X)] . Then L(Y ∣ X) is the only affine function of X in n
R
satisfying
1. E [L(Y ∣ X)] = E(Y )
2. cov [L(Y ∣ X), X] = cov(Y , X)
Proof
A simple corollary is the Y − L(Y ∣ X) is uncorrelated with any affine function of X:
If U is an affine function of X then

1. cov [Y − L(Y ∣ X), U ] = 0
2. E (⟨Y − L(Y ∣ X), U ⟩) = 0
Proof
The variance-covariance matrix of L(Y ∣ X) , and its covariance matrix with Y turn out to be the same, again analogous to the
single variable case.
Additional properties of L(Y ∣ X) :

1. cov [Y , L(Y ∣ X)] = cov(Y , X)vc (X)cov(X, Y )
−1
2. vc [L(Y ∣ X)] = cov(Y , X)vc (X)cov(X, Y )

−1
Proof
Next is the fundamental result that L(Y ∣ X) is the affine function of X that is closest to Y in the mean square sense.
Suppose that U ∈ R
n
is an affine function of X. Then
1. E (∥Y − L(Y ∣ X)∥ ) ≤ E (∥Y − U ∥ )
2 2
2. Equality holds in (a) if and only if U = L(Y ∣ X) with probability 1.

Proof
The variance-covariance matrix of the difference between Y and the best affine approximation is given in the next theorem.
−1
vc [Y − L(Y ∣ X)] = vc(Y ) − cov(Y , X)vc (X)cov(X, Y )
Proof
The actual mean square error when we use L(Y ∣ X) to approximate Y , namely E (∥Y − L(Y ∣ X)∥
2
) , is the trace (sum of the
diagonal entries) of the variance-covariance matrix above. The function of x given by
−1
L(Y ∣ X = x) = E(Y ) + cov(Y , X)vc (X) [x − E(X)] (4.8.16)
is known as the (distribution) linear regression function. If we observe x then L(Y ∣ X = x) is our best affine prediction of Y .
Multiple linear regression is more powerful than it may at first appear, because it can be applied to non-linear transformations of
the random vectors. That is, if g : R → R and h : R → R then L [h(Y ) ∣ g(X)] is the affine function of g(X) that is closest
m j n k
to h(Y ) in the mean square sense. Of course, we must be able to compute the appropriate means, variances, and covariances.
Moreover, Non-linear regression with a single, real-valued predictor variable can be thought of as a special case of multiple linear
regression. Thus, suppose that X is the predictor variable, Y is the response variable, and that (g , g , … , g ) is a sequence of
1 2 n
real-valued functions. We can apply the results of this section to find the linear function of (g (X), g (X), … , g (X)) that is
1 2 n
closest to Y in the mean square sense. We just replace X with g (X) for each i. Again, we must be able to compute the
i i
appropriate means, variances, and covariances to do this.
Suppose that (X, Y ) has probability density function f defined by f (x, y) = x + y for 0 ≤x ≤1 , 0 ≤y ≤1 . Find each of
the following:
1. E(X, Y )
2. vc(X, Y )
Answer
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2(x + y) for 0 ≤x ≤y ≤1 . Find each of the
following:
1. E(X, Y )
2. vc(X, Y )
Answer

y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . Find each of the
following:
1. E(X, Y )
2. vc(X, Y )
Answer
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 15 x y

2
for 0 ≤x ≤y ≤1 . Find each of the
following:
1. E(X, Y )
2. vc(X, Y )
3. L(Y ∣ X)
4. L [Y ∣ (X, X )] 2
5. Sketch the regression curves on the same set of axes.

Answer
Suppose that (X, Y , Z) is uniformly distributed on the region {(x, y, z) ∈ R

3
: 0 ≤ x ≤ y ≤ z ≤ 1} . Find each of the
following:
1. E(X, Y , Z)
2. vc(X, Y , Z)
3. L [Z ∣ (X, Y )]
4. L [Y ∣ (X, Z)]
5. L [X ∣ (Y , Z)]
6. L [(Y , Z) ∣ X]
Answer
Suppose that X is uniformly distributed on (0, 1), and that given X , random variable Y is uniformly distributed on (0, X) .
1. E(X, Y )
2. vc(X, Y )
Answer
This page titled 4.8: Expected Value and Covariance Matrices is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
In the introductory section, we defined expected value separately for discrete, continuous, and mixed distributions, using density
functions. In the section on additional properties, we showed how these definitions can be unified, by first defining expected value
for nonnegative random variables in terms of the right-tail distribution function. However, by far the best and most elegant
definition of expected value is as an integral with respect to the underlying probability measure. This definition and a review of the
properties of expected value are the goals of this section. No proofs are necessary (you will be happy to know), since all of the
results follow from the general theory of integration. However, to understand the exposition, you will need to review the advanced
sections on the integral with respect to a positive measure and the properties of the integral. If you are a new student of probability,
or are not interested in the measure-theoretic detail of the subject, you can safely skip this section.
Definitions
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So Ω is the set of outcomes, F is
the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ).
Recall that a random variable X for the experiment is simply a measurable function from (Ω, F ) into another measurable space
(S, S ). When S ⊆ R , we assume that S is Lebesgue measurable, and we take S to the σ-algebra of Lebesgue measurable
n
subsets of S . As noted above, here is the measure-theoretic definition:
If X is a real-valued random variable on the probability space, the expected value of X is defined as the integral of X with
respect to P, assuming that the integral exists:
E(X) = ∫ X dP (4.9.1)
Ω
Let's review how the integral is defined in stages, but now using the notation of probability theory.
Let S denote the support set of X, so that S is a measurable subset of R.

1. If S is finite, then E(X) = ∑ x∈S
x P(X = x) .
2. If S ⊆ [0, ∞) , then E(X) = sup {E(Y ) : Y has finite range and 0 ≤ Y ≤ X}

3. For general S ⊆ R , E(X) = E (X ) − E (X ) as long as the right side is not of the form ∞ − ∞ , and where X and
+ − +
X
−
denote the positive and negative parts of X.
4. If A ∈ F , then E(X; A) = E (X1 ) , assuming that the expected value on the right exists.
A
Thus, as with integrals generally, an expected value can exist as a number in R (in which case X is integrable), can exist as ∞ or
−∞ , or can fail to exist. In reference to part (a), a random variable with a finite set of values in R is a simple function in the
terminology of general integration. In reference to part (b), note that the expected value of a nonnegative random variable always
exists in [0, ∞]. In reference to part (c), E(X) exists if and only if either E (X ) < ∞ or E (X ) < ∞ .
+ −
Our next goal is to restate the basic theorems and properties of integrals, but in the notation of probability. Unless otherwise noted,
all random variables are assumed to be real-valued.
Basic Properties
The Linear Properties
Perhaps the most important and basic properties are the linear properties. Part (a) is the additive property and part (b) is the scaling
property.
Suppose that X and Y are random variables whose expected values exist, and that c ∈ R . Then
1. E(X + Y ) = E(X) + E(Y ) as long as the right side is not of the form ∞ − ∞ .
2. E(cX) = cE(X)
Thus, part (a) holds if at least one of the expected values on the right is finite, or if both are ∞, or if both are −∞ . What is ruled
out are the two cases where one expected value is ∞ and the other is −∞ , and this is what is meant by the indeterminate form
∞−∞ .
Equality and Order

Our next set of properties deal with equality and order. First, the expected value of a random variable over a null set is 0.
If X is a random variable and A is an event with P(A) = 0 . Then E(X; A) = 0 .
Random variables that are equivalent have the same expected value
If X is a random variable whose expected value exists, and Y is a random variable with P(X = Y ) = 1 , then E(X) = E(Y ) .
Our next result is the positive property of expected value.
Suppose that X is a random variable and P(X ≥ 0) = 1 . Then

1. E(X) ≥ 0
2. E(X) = 0 if and only if P(X = 0) = 1 .
So, if X is a nonnegative random variable then E(X) > 0 if and only if P(X > 0) > 0 . The next result is the increasing property
of expected value, perhaps the most important property after linearity.
Suppose that X, Y are random variables whose expected values exist, and that P(X ≤ Y ) = 1 . Then
1. E(X) ≤ E(Y )
2. Except in the case that both expected values are ∞ or both −∞ , E(X) = E(Y ) if and only if P(X = Y ) = 1 .
So if X ≤ Y with probability 1 then, except in the two cases mentioned, E(X) < E(Y ) if and only if P(X < Y ) > 0 . The next
result is the absolute value inequality.
Suppose that X is a random variable whose expected value exists. Then

1. |E(X)| ≤ E (|X|)
2. If E(X) is finite, then equality holds in (a) if and only if P(X ≥ 0) = 1 or P(X ≤ 0) = 1 .
Change of Variables and Density Functions

The Change of Variables Theorem
Suppose now that X is a general random variable on the probability space (Ω, F , P), taking values in a measurable space (S, S ).
Recall that the probability distribution of X is the probability measure P on (S, S ) given by P (A) = P(X ∈ A) for A ∈ S .
This is a special case of a new positive measure induced by a given positive measure and a measurable function. If g : S → R is
measurable, then g(X) is a real-valued random variable. The following result shows how to computed the expected value of g(X)
as an integral with respect to the distribution of X, and is known as the change of variables theorem.
If g : S → R is measurable then, assuming that the expected value exists,
E [g(X)] = ∫ g(x) dP (x) (4.9.2)

S
So, using the original definition and the change of variables theorem, and giving the variables explicitly for emphasis, we have
E [g(X)] = ∫ g [X(ω)] dP(ω) = ∫ g(x) dP (x) (4.9.3)

Ω S
The Radon-Nikodym Theorem

Suppose now μ is a positive measure on (S, S ), and that the distribution of X is absolutely continuous with respect to μ . Recall
that this means that μ(A) = 0 implies P (A) = P(X ∈ A) = 0 for A ∈ S . By the Radon-Nikodym theorem, named for Johann
Radon and Otto Nikodym, X has a probability density function f with respect to μ . That is,
P (A) = P(X ∈ A) = ∫ f dμ, A ∈ S (4.9.4)

A
In this case, we can write the expected value of g(X) as an integral with respect to the probability density function.
If g : S → R is measurable then, assuming that the expected value exists,
E [g(X)] = ∫ gf dμ (4.9.5)
S
Again, giving the variables explicitly for emphasis, we have the following chain of integrals:
E [g(X)] = ∫ g [X(ω)] dP(ω) = ∫ g(x) dP (x) = ∫ g(x)f (x) dμ(x) (4.9.6)

Ω S S
There are two critically important special cases.
Discrete Distributions
Suppose first that (S, S , #) is a discrete measure space, so that S is countable, S = P(S) is the collection of all subsets of S ,
and # is counting measure on (S, S ). Thus, X has a discrete distribution on S , and this distribution is always absolutely
continuous with respect to #. Specifically, #(A) = 0 if and only if A = ∅ and of course P(X ∈ ∅) = 0 . The probability density
function f of X with respect to #, as we know, is simply f (x) = P(X = x) for x ∈ S . Moreover, integrals with respect to # are
sums, so
E [g(X)] = ∑ g(x)f (x) (4.9.7)
x∈S
assuming that the expected value exists. Existence in this case means that either the sum of the positive terms is finite or the sum of
the negative terms is finite, so that the sum makes sense (and in particular does not depend on the order in which the terms are
added). Specializing further, if X itself is real-valued and g = 1 we have
E(X) = ∑ xf (x) (4.9.8)
x∈S
which was our original definition of expected value in the discrete case.
Continuous Distributions
For the second special case, suppose that (S, S , λ ) is a Euclidean measure space, so that S is a Lebesgue measurable subset of
n
R
n
for some n ∈ N , S is the σ-algebra of Lebesgue measurable subsets of S , and λ is Lebesgue measure on (S, S ). The
+ n
distribution of X is absolutely continuous with respect to λ if λ (A) = 0 implies P(X ∈ A) = 0 for A ∈ S . If this is the case,
n n
then a probability density function f of X has its usual meaning. Thus,
E [g(X)] = ∫ g(x)f (x) dλn (x) (4.9.9)

S
assuming that the expected value exists. When g is a typically nice function, this integral reduces to an ordinary n -dimensional
Riemann integral of calculus. Specializing further, if X is itself real-valued and g = 1 then
E(X) = ∫ xf (x) dx (4.9.10)

S
which was our original definition of expected value in the continuous case.
Interchange Properties
In this subsection, we review properties that allow the interchange of expected value and other operations: limits of sequences,
infinite sums, and integrals. We assume again that the random variables are real-valued unless otherwise specified.
Limits
Our first set of convergence results deals with the interchange of expected value and limits. We start with the expected value
version of Fatou's lemma, named in honor of Pierre Fatou. Its usefulness stems from the fact that no assumptions are placed on the
random variables, except that they be nonnegative.
Suppose that X is a nonnegative random variable for n ∈ N . Then

n +
E (lim inf Xn ) ≤ lim inf E(Xn ) (4.9.11)

n→∞ n→∞
Our next set of results gives conditions for the interchange of expected value and limits.
Suppose that X is a random variable for each n ∈ N . then

n +
E ( lim Xn ) = lim E (Xn ) (4.9.12)

n→∞ n→∞
in each of the following cases:

1. X is nonnegative for each n ∈ N and X is increasing in n .
n + n
2. E(X ) exists for each n ∈ N , E(X ) > −∞ , and X is increasing in n .

n + 1 n
3. E(X ) exists for each n ∈ N , E(X ) < ∞ , and X is decreasing in n .

n + 1 n
4. lim X
n→∞ exists, and |X | ≤ Y for n ∈ N where Y is a nonnegative random variable with E(Y ) < ∞ .
n n
5. lim X
n→∞ exists, and |X | ≤ c for n ∈ N where c is a positive constant.
n n
Statements about the random variables in the theorem above (nonnegative, increasing, existence of limit, etc.) need only hold with
probability 1. Part (a) is the monotone convergence theorem, one of the most important convergence results and in a sense, essential
to the definition of the integral in the first place. Parts (b) and (c) are slight generalizations of the monotone convergence theorem.
In parts (a), (b), and (c), note that lim Xn→∞ exists (with probability 1), although the limit may be ∞ in parts (a) and (b) and
n
−∞ in part (c) (with positive probability). Part (d) is the dominated convergence theorem, another of the most important
convergence results. It's sometimes also known as Lebesgue's dominated convergence theorem in honor of Henri Lebesgue. Part (e)
is a corollary of the dominated convergence theorem, and is known as the bounded convergence theorem.
Infinite Series
Our next results involve the interchange of expected value and an infinite sum, so these results generalize the basic additivity
property of expected value.
Suppose that X is a random variable for n ∈ N . Then

n +
∞ ∞
E ( ∑ Xn ) = ∑ E (Xn ) (4.9.13)
n=1 n=1

1. X is nonnegative for each n ∈ N .
n +
2. E (∑ ∞
| X |) < ∞
n=1 n
Part (a) is a consequence of the monotone convergence theorem, and part (b) is a consequence of the dominated convergence
theorem. In (b), note that ∑ | X | < ∞ and hence ∑
∞
n=1 n X is absolutely convergent with probability 1. Our next result is the

∞
n=1 n
additivity of the expected value over a countably infinite collection of disjoint events.
Suppose that X is a random variable whose expected value exists, and that {A n : n ∈ N+ } is a disjoint collection events. Let
A . Then
∞
A =⋃ n
n=1
E(X; A) = ∑ E(X; An ) (4.9.14)
n=1
Of course, the previous theorem applies in particular if X is nonnegative.
Integrals
Suppose that (T , T , μ) is a σ-finite measure space, and that X is a real-valued random variable for each t ∈ T . Thus we can
t
think of {X : t ∈ T } is a stochastic process indexed by T . We assume that (ω, t) ↦ X (ω) is measurable, as a function from the
t t
product space (Ω × T , F ⊗ T ) into R. Our next result involves the interchange of expected value and integral, and is a
consequence of Fubini's theorem, named for Guido Fubini.
Under the assumptions above,
E [∫ Xt dμ(t)] = ∫ E (Xt ) dμ(t) (4.9.15)

T T

1. X is nonnegative for each t ∈ T .
t
2. ∫ E (|X |) dμ(t) < ∞

T t
Fubini's theorem actually states that the two iterated integrals above equal the joint integral
∫ Xt (ω) d(P ⊗ μ)(ω, t) (4.9.16)

Ω×T
where of course, P ⊗ μ is the product measure on (Ω × T , F ⊗ T ) . However, our interest is usually in evaluating the iterated
integral above on the left in terms of the iterated integral on the right. Part (a) is the expected value version of Tonelli's theorem,
named for Leonida Tonelli.

You may have worked some of the computational exercises before, but try to see them in a new light, in terms of the general theory
of integration.

Recall that the Cauchy distribution, named for Augustin Cauchy, is a continuous distribution with probability density function f
given by
1
f (x) = , x ∈ R (4.9.17)
2
π (1 + x )
The Cauchy distribution is studied in more generality in the chapter on Special Distributions.
Suppose that X has the Cauchy distribution.

1. Show that E(X) does not exist.
2. Find E (X ) 2
Answer
Open the Cauchy Experiment and keep the default parameters. Run the experiment 1000 times and note the behaior of the
sample mean.

Recall that the Pareto distribution, named for Vilfredo Pareto, is a continuous distribution with probability density function f given
by
a
f (x) = , x ∈ [1, ∞) (4.9.18)
a+1
x
where a > 0 is the shape parameter. The Pareto distribution is studied in more generality in the chapter on Special Distributions.
Suppose that X has the Pareto distribution with shape parameter a . Find E(X) is the following cases:
1. 0 < a ≤ 1
2. a > 1
Answer
probability density function and the location of the mean. For various values of the parameter, run the experiment 1000 times
and compare the sample mean with the distribution mean.
Suppose that X has the Pareto distribution with shape parameter a . Find E (1/X n
) for n ∈ N .
+
Answer
Special Results for Nonnegative Variables

For a nonnegative variable, the moments can be obtained from integrals of the right-tail distribution function.
If X is a nonnegative random variable then

∞
n n−1
E (X ) =∫ nx P(X > x) dx (4.9.19)
0
Proof
∞
When n = 1 we have E(X) = ∫ P(X > x) dx . We saw this result before in the section on additional properties of expected
0
value, but now we can understand the proof in terms of Fubini's theorem.
For a random variable taking nonnegative integer values, the moments can be computed from sums involving the right-tail
Suppose that X has a discrete distribution, taking values in N. Then

∞
n n n
E (X ) = ∑ [k − (k − 1 ) ] P(X ≥ k) (4.9.21)
k=1
Proof
When n = 1 we have E(X) = ∑ P(X ≥ k) . We saw this result before in the section on additional properties of expected
∞
k=0
value, but now we can understand the proof in terms of the interchange of sum and expected value.
This page titled 4.9: Expected Value as an Integral is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Conditional expected value is much more important than one might at first think. In fact, conditional expected value is at the core
of modern probability theory because it provides the basic way of incorporating known information into a probability measure.
Basic Theory
Definition
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P), so that Ω is the set of outcomes,, F
is the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ). In our first elementary discussion, we
studied the conditional expected value of a real-value random variable X given a general random variable Y . The more general
approach is to condition on a sub σ-algebra G of F . The sections on σ-algebras and measure theory are essential prerequisites for
this section.
Before we get to the definition, we need some preliminaries. First, all random variables mentioned are assumed to be real valued.
next the notion of equivalence plays a fundamental role in this section. Next recall that random variables X and X are equivalent
1 2
if P(X = X ) = 1 . Equivalence really does define an equivalence relation on the collection of random variables defined on the
1 2
sample space. Moreover, we often regard equivalent random variables as being essentially the same object. More precisely from
this point of view, the objects of our study are not individual random variables but rather equivalence classes of random variables
under this equivalence relation. Finally, for A ∈ F , recall the notation for the expected value of X on the event A
E(X; A) = E(X 1A ) (4.10.1)
assuming of course that the expected value exists. For the remainder of this subsection, suppose that G is a sub σ-algebra of F .
Suppose that X is a random variable with E(|X|) < ∞ . The conditional expected value of X given G is the random variable
E(X ∣ G ) defined by the following properties:
1. E(X ∣ G ) is measurable with repsect to G .

2. If A ∈ G then E[E(X ∣ G ); A] = E(X; A)
The basic idea is that E(X ∣ G ) is the expected value of X given the information in the σ-algebra G . Hopefully this idea will
become clearer during our study. The conditions above uniquely define E(X ∣ G ) up to equivalence. The proof of this fact is a
simple application of the Radon-Nikodym theorem, named for Johann Radon and Otto Nikodym
Suppose again that X is a random variable with E(|X|) < ∞ .

1. There exists a random variable V satisfying the definition.
2. If V and V satisfy the definition, then P(V = V ) = 1 so that V and V are equivalent.
1 2 1 2 1 2
Proof
The following characterization might seem stronger but in fact in equivalent to the definition.
Suppose again that X is a random variable with E(|X|) < ∞ . Then E(X ∣ G ) is characterized by the following properties:
1. E(X ∣ G ) is measurable with respect to G
2. If U is measurable with respect to G and E(|U X|) < ∞ then E[U E(X ∣ G )] = E(U X) .
Proof
Properties
Our next discussion concerns some fundamental properties of conditional expected value. All equalities and inequalities are
understood to hold modulo equivalence, that is, with probability 1. Note also that many of the proofs work by showing that the
right hand side satisfies the properties in the definition for the conditional expected value on the left side. Once again we assume
that G is a sub sigma-algebra of F .
Our first property is a simple consequence of the definition: X and E(X ∣ G ) have the same mean.
Suppose that X is a random variable with E(|X|) < ∞ . Then E[E(X ∣ G )] = E(X) .
Proof
The result above can often be used to compute E(X), by choosing the σ-algebra G in a clever way. We say that we are computing
E(X) by conditioning on G . Our next properties are fundamental: every version of expected value must satisfy the linearity
properties. The first part is the additive property and the second part is the scaling property.
Suppose that X and Y are random variables with E(|X|) < ∞ and E(|Y |) < ∞ , and that c ∈ R . Then
1. E(X + Y ∣ G ) = E(X ∣ G ) + E(Y ∣ G)
2. E(cX ∣ G ) = cE(X ∣ G )
Proof
The next set of properties are also fundamental to every notion of expected value. The first part is the positive property and the
second part is the increasing property.
Suppose again that X and Y are random variables with E(|X|) < ∞ and E(|Y |) < ∞ .
1. If X ≥ 0 then E(X ∣ G ) ≥ 0
2. If X ≤ Y then E(X ∣ G ) ≤ E(Y ∣ G)
Proof
The next few properties relate to the central idea that E(X ∣ G ) is the expected value of X given the information in the σ-algebra
G.
Suppose that X and V are random variables with E(|X|) < ∞ and E(|XV |) < ∞ and that V is measurable with respect to
G . Then E(V X ∣ G ) = V E(X ∣ G ) .
Proof
Compare this result with the scaling property. If V is measurable with respect to G then V is like a constant in terms of the
conditional expected value given G . On the other hand, note that this result implies the scaling property, since a constant can be
viewed as a random variable, and as such, is measurable with respect to any σ-algebra. As a corollary to this result, note that if X
itself is measurable with respect to G then E(X ∣ G ) = X . The following result gives the other extreme.
Suppose that X is a random variable with E(|X|) < ∞ . If X and G are independent then E(X ∣ G ) = E(X) .
Proof
Every random variable X is independent of the trivial σ-algebra {∅, Ω} so it follows that E(X ∣ {∅, Ω}) = E(X).
The next properties are consistency conditions, also known as the tower properties. When conditioning twice, with respect to
nested σ-algebras, the smaller one (representing the least amount of information) always prevails.
Suppose that X is a random variable with E(|X|) < ∞ and that H is a sub σ-algebra of G . Then
1. E[E(X ∣ H ) ∣ G ] = E(X ∣ H )
2. E[E(X ∣ G ) ∣ H ] = E(X ∣ H )
Proof
The next result gives Jensen's inequality for conditional expected value, named for Johan Jensen.
Suppose that X takes values in an interval S ⊆ R and that g : S → R is convex. If E(|X|) < ∞ and E(|g(X)| < ∞ then
E[g(X) ∣ G ] ≥ g[E(X ∣ G )] (4.10.9)
Proof
For our next discussion, suppose as usual that G is a sub σ-algebra of F . The conditional probability of an event A given G can be
defined as a special case of conditional expected value. As usual, let 1 denote the indicator random variable of A .
A
For A ∈ F we define
P(A ∣ G ) = E(1A ∣ G ) (4.10.11)
Thus, we have the following characterizations of conditional probability, which are special cases of the definition and the alternate
version:
If A ∈ F then P(A ∣ G ) is characterized (up to equivalence) by the following properties

1. P(A ∣ G ) is measurable with respect to G .
2. If B ∈ G then E[P(A ∣ G ); B] = P(A ∩ B)
Proof
If A ∈ F then P(A ∣ G ) is characterized (up to equivalence) by the following properties

1. P(A ∣ G ) is measurable with respect to G .
2. If U is measurable with respect to G and E(|U |) < ∞ then E[U P(A ∣ G )] = E(U ; A)
The properties above for conditional expected value, of course, have special cases for conditional probability. In particular, we can
compute the probability of an event by conditioning on a σ-algebra:
If A ∈ F then P(A) = E[P(A ∣ G )] .

Proof
Again, the last theorem is often a good way to compute P(A) when we know the conditional probability of A given G . This is a
very compact and elegant version of the law of total probability given first in the section on Conditional Probability in the chapter
on Probability Spaces and later in the section on Discrete Distributions in the Chapter on Distributions. The following theorem
gives the conditional version of the axioms of probability.
The following properties hold (as usual, modulo equivalence):

1. P(A ∣ G ) ≥ 0 for every A ∈ F
2. P(Ω ∣ G ) = 1
3. If {A : i ∈ I } is a countable disjoint subset of F then P(⋃
i i∈I
Ai ∣
∣ G ) = ∑i∈I P(Ai ∣ G )
Proof
From the last result, it follows that other standard probability rules hold for conditional probability given G (as always, modulo
equivalence). These results include
the complement rule
the increasing property
Boole's inequality
Bonferroni's inequality
the inclusion-exclusion laws
However, it is not correct to state that A ↦ P(A ∣ G ) is a probability measure, because the conditional probabilities are only
defined up to equivalence, and so the mapping does not make sense. We would have to specify a particular version of E(A ∣ G ) for
each A ∈ F for the mapping to make sense. Even if we do this, the mapping may not define a probability measure. In part (c), the
left and right sides are random variables and the equation is an event that has probability 1. However this event depends on the
collection {A : i ∈ I } . In general, there will be uncountably many such collections in F , and the intersection of all of the
i
corresponding events may well have probability less than 1 (if it's measurable at all). It turns out that if the underlying probability
space (Ω, F , P) is sufficiently “nice” (and most probability spaces that arise in applications are nice), then there does in fact exist a
regular conditional probability. That is, for each A ∈ F , there exists a random variable P(A ∣ G ) satisfying the conditions in (12)
and such that with probability 1, A ↦ P(A ∣ G ) is a probability measure.
The following theorem gives a version of Bayes' theorem, named for the inimitable Thomas Bayes.
Suppose that A ∈ G and B ∈ F . then
E[P(B ∣ G ); A]
P(A ∣ B) = (4.10.14)
E[P(B ∣ G )]
Proof
Basic Examples
The purpose of this discussion is to tie the general notions of conditional expected value that we are studying here to the more
elementary concepts that you have seen before. Suppose that A is an event (that is, a member of F ) with P(A) > 0 . If B is
another event, then of course, the conditional probability of B given A is
P(A ∩ B)
P(B ∣ A) = (4.10.15)
P(A)
If X is a random variable then the conditional distribution of X given A is the probability measure on R given by
P({X ∈ R} ∩ A)
R ↦ P(X ∈ R ∣ A) = for measurable R ⊆ R (4.10.16)
P(A)
If E(|X|) < ∞ then the conditional expected value of X given A , denoted E(X ∣ A) , is simply the mean of this conditional
distribution.
Suppose now that A = {A : i ∈ I } is a countable partition of the sample space Ω into events with positive probability. To review
i
the jargon, A ⊆ F ; the index set I is countable; A ∩ A = ∅ for distinct i, j ∈ I ; ⋃ A = Ω ; and P(A ) > 0 for i ∈ I . Let
i j i∈I i i
G = σ(A ) , the σ-algebra generated by A . The elements of G are of the form ⋃ A for J ⊆ I . Moreover, the random variables
j
j∈J
that are measurable with respect to G are precisely the variables that are constant on A for each i ∈ I . The σ-algebra G is said to
i
be countably generated.
If B ∈ F then P(B ∣ G ) is the random variable whose value on A is P(B ∣ Ai i) for each i ∈ I .
Proof
In this setting, the version of Bayes' theorem in (15) reduces to the usual elementary formulation: For i ∈ I ,
E[P(B ∣ G ); A ] = P(A )P(B ∣ A ) and E[P(B ∣ G )] = ∑
i i i P(A )P(B ∣ A ) . Hence
j∈I j j
P(Ai )P(B ∣ Ai )
P(Ai ∣ B) = (4.10.18)
∑ P(Aj )P(B ∣ Aj )
j∈I
If X is a random variable with E(|X|) < ∞ , then E(X ∣ G ) is the random variable whose value on A is E(X ∣ A i i) for each
i ∈ I.
Proof
The previous examples would apply to G = σ(Y ) if Y is a discrete random variable taking values in a countable set T . In this
case, the partition is simply A = {{Y = y} : y ∈ T } . On the other hand, suppose that Y is a random variable taking values in a
general set T with σ-algebra T . The real-valued random variables that are measurable with respect to G = σ(Y ) are (up to
equivalence) the measurable, real-valued functions of Y .
Specializing further, Suppose that X takes values in S ⊆ R , Y takes values in T ⊆ R (where S and T are Lebesgue measurable)
n
and that (X, Y ) has a joint continuous distribution with probability density function f . Then Y has probability density function h
given by
h(y) = ∫ f (x, y) dx, y ∈ T (4.10.20)

S
Assume that h(y) > 0 for y ∈ T . Then for y ∈ T , a conditional probability density function of X given Y =y is defined by
f (x, y)
g(x ∣ y) = , x ∈ S (4.10.21)
h(y)
This is precisely the setting of our elementary discussion of conditional expected value. If E(|X|) < ∞ then we usually write
E(X ∣ Y ) instead of the clunkier E[X ∣ σ(Y )].
In this setting above suppose that E(|X|) < ∞ . Then
E(X ∣ Y ) = ∫ xg(x ∣ Y ) dx (4.10.22)

S
Proof
Best Predictor
In our elementary treatment of conditional expected value, we showed that the conditional expected value of a real-valued random
variable X given a general random variable Y is the best predictor of X, in the least squares sense, among all real-valued functions
of Y . A more careful statement is that E(X ∣ Y ) is the best predictor of X among all real-valued random variables that are
measurable with respect to σ(Y ). Thus, it should come as not surprise that if G is a sub σ-algebra of F , then E(X ∣ G ) is the best
predictor of X, in the least squares sense, among all real-valued random variables that are measurable with respect to G ). We will
show that this is indeed the case in this subsection. The proofs are very similar to the ones given in the elementary section. For the
rest of this discussion, we assume that G is a sub σ-algebra of F and that all random variables mentioned are real valued.
Suppose that X and U are random variables with E(|X|) < ∞ and E(|XU |) < ∞ and that U is measurable with respect to
G . Then X − E(X ∣ G ) and U are uncorrelated.
Proof
The next result is the main one: E(X ∣ G ) is closer to X in the mean square sense than any other random variable that is
measurable with respect to G . Thus, if G represents the information that we have, then E(X ∣ G ) is the best we can do in
estimating X.
Suppose that X and U are random variables with E(X 2

) <∞ and E(U 2
) <∞ and that U is measurable with respect to G .
Then
1. E([X − E(X ∣ G )] ) ≤ E[(X − U ) ] .
2 2
2. Equality holds if and only if P[U = E(X ∣ G )] = 1 , so U and E(X ∣ G ) are equivalent.
Proof
Conditional Variance
Once again, we assume that G is a sub σ-algebra of F and that all random variables mentioned are real valued, unless otherwise
noted. It's natural to define the conditional variance of a random variable given G in the same way as ordinary variance, but witl all
expected values conditioned on G .
Suppose that X is a random variable with E(X 2

) <∞ . The conditional variance of X given G is
2 ∣
var(X ∣ G ) = E ([X − E(X ∣ G )] ∣ G) (4.10.27)
∣
Like all conditional expected values relative to G , var(X ∣ G ) is a random variable that is measurable with respect to G and is
unique up to equivalence. The first property is analogous to the computational formula for ordinary variance.
Suppose again that X is a random variable with E(X 2

) <∞ . Then
2 2
var(X ∣ G ) = E(X ∣ G ) − [E(X ∣ G )] (4.10.28)
Proof
Next is a formula for the ordinary variance in terms of conditional variance and expected value.

) <∞ . Then
var(X) = E[var(X ∣ G )] + var[E(X ∣ G )] (4.10.31)
Proof
So the variance of X is the expected conditional variance plus the variance of the conditional expected value. This result is often a
good way to compute var(X) when we know the conditional distribution of X given G . In turn, this property leads to a formula for
the mean square error when E(X ∣ G ) is thought of as a predictor of X.

) <∞ .
2
E([X − E(X ∣ G )] ) = var(X) − var[E(X ∣ G )] (4.10.32)
Proof
Let us return to the study of predictors of the real-valued random variable X, and compare them in terms of mean square error.

) <∞ .
1. The best constant predictor of X is E(X) with mean square error var(X).
2. If Y is another random variable with E(Y ) < ∞ , then the best predictor of X among linear functions of Y is
2
cov(X, Y )
L(X ∣ Y ) = E(X) + [Y − E(Y )] (4.10.34)
var(Y )
with mean square error var(X)[1 − cor (X, Y )] .

2
3. If Y is a (general) random variable, then the best predictor of X among all real-valued functions of Y with finite variance is
E(X ∣ Y ) with mean square error var(X) − var[E(X ∣ Y )] .
4. If G is a sub σ-algebra of F , then the best predictor of X among random variables with finite variance that are measurable
with respect to G is E(X ∣ G ) with mean square error var(X) − var[E(X ∣ G )] .
Of course, (a) is a special case of (d) with G = {∅, Ω} and (c) is a special case of (d) with G = σ(Y ) . Only (b), the linear case,
cannot be interpreted in terms of conditioning with respect to a σ-algebra.
Conditional Covariance
Suppose again that G is a sub σ-algebra of F . The conditional covariance of two random variables is defined like the ordinary
covariance, but with all expected values conditioned on G .
Suppose that X and Y are random variables with E(X

2
) <∞ and E(Y
2
) <∞ . The conditional covariance of X and Y
given G is defined as
∣
cov(X, Y ∣ G ) = E ([X − E(X ∣ G )][Y − E(Y ∣ G )] ∣ G ) (4.10.35)
∣
So cov(X, Y ∣ G ) is a random variable that is measurable with respect to G and is unique up to equivalence. As should be the case,
conditional covariance generalizes conditional variance.
Suppose that X is a random variable with E(X 2

) <∞ . Then cov(X, X ∣ G ) = var(X ∣ G ) .
Proof
Our next result is a computational formula that is analogous to the one for standard covariance—the covariance is the mean of the
product minus the product of the means, but now with all expected values conditioned on G :
Suppose again that X and Y are random variables with E(X 2

) <∞ and E(Y 2
) <∞ . Then
cov(X, Y ∣ G ) = E(XY ∣ G ) − E(X ∣ G )E(Y ∣ G ) (4.10.36)
Proof
Our next result shows how to compute the ordinary covariance of X and Y by conditioning on X.
Suppose again that X and Y are random variables with E(X 2

) < ∞) and E(Y 2
< ∞) . Then
cov(X, Y ) = E [cov(X, Y ∣ G )] + cov [E(X ∣ G ), E(Y ∣ G )] (4.10.39)
Proof
Thus, the covariance of X and Y is the expected conditional covariance plus the covariance of the conditional expected values.
This result is often a good way to compute cov(X, Y ) when we know the conditional distribution of (X, Y ) given G .
This page titled 4.10: Conditional Expected Value Revisited is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Basic Theory
Many of the concepts in this chapter have elegant interpretations if we think of real-valued random variables as vectors in a vector
space. In particular, variance and higher moments are related to the concept of norm and distance, while covariance is related to
inner product. These connections can help unify and illuminate some of the ideas in the chapter from a different point of view. Of
course, real-valued random variables are simply measurable, real-valued functions defined on the sample space, so much of the
discussion in this section is a special case of our discussion of function spaces in the chapter on Distributions, but recast in the
notation of probability.
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). Thus, Ω is the set of outcomes, F is
the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ). Our basic vector space V consists of all
real-valued random variables defined on (Ω, F , P) (that is, defined for the experiment). Recall that random variables X and X 1 2
are equivalent if P(X = X ) = 1 , in which case we write X ≡ X . We consider two such random variables as the same vector,
1 2 1 2
so that technically, our vector space consists of equivalence classes under this equivalence relation. The addition operator
corresponds to the usual addition of two real-valued random variables, and the operation of scalar multiplication corresponds to the
usual multiplication of a real-valued random variable by a real (non-random) number. These operations are compatible with the
equivalence relation in the sense that if X ≡ X and Y ≡ Y then X + Y ≡ X + Y and cX ≡ cX for c ∈ R . In short, the
1 2 1 2 1 1 2 2 1 2
vector space V is well-defined.
Norm
Suppose that k ∈ [1, ∞). The k norm of X ∈ V is defined by
1/k
k
∥X ∥k = [E (|X| )] (4.11.1)
Thus, ∥X∥ is a measure of the size of X in a certain sense, and of course it's possible that
k ∥X ∥k = ∞ . The following theorems
establish the fundamental properties. The first is the positive property.
Suppose again that k ∈ [1, ∞). For X ∈ V ,

1. ∥X∥ k ≥0
2. ∥X∥ k =0 if and only if P(X = 0) = 1 (so that X ≡ 0 ).

Proof
The next result is the scaling property.
Suppose again that k ∈ [1, ∞). Then ∥cX∥ k = |c| ∥X ∥k for X ∈ V and c ∈ R .
Proof
The next result is Minkowski's inequality, named for Hermann Minkowski, and also known as the triangle inequality.
Suppose again that k ∈ [1, ∞). Then ∥X + Y ∥ k ≤ ∥X ∥k + ∥Y ∥k for X, Y ∈ V .

Proof
It follows from the last three results that the set of random variables (again, modulo equivalence) with finite k norm forms a
subspace of our parent vector space V , and that the k norm really is a norm on this vector space.
For k ∈ [1, ∞), L denotes the vector space of X ∈ V with ∥X∥

k k <∞ , and with norm ∥ ⋅ ∥ .k
In analysis, p is often used as the index rather than k as we have used here, but p seems too much like a probability, so we have
broken with tradition on this point. The L is in honor of Henri Lebesgue, who developed much of this theory. Sometimes, when
we need to indicate the dependence on the underlying σ-algebra F , we write L (F ). Our next result is Lyapunov's inequality,
k
named for Aleksandr Lyapunov. This inequality shows that the k -norm of a random variable is increasing in k .
Suppose that j, k ∈ [1, ∞) with j ≤ k . Then ∥X∥ j ≤ ∥X ∥k for X ∈ V .
Proof
Lyapunov's inequality shows that if 1 ≤ j ≤ k and ∥X∥ k <∞ then ∥X∥ j <∞ . Thus, L is a subspace of L .
k j
Metric
The k norm, like any norm on a vector space, can be used to define a metric, or distance function; we simply compute the norm of
the difference between two vectors.
For k ∈ [1, ∞), the k distance (or k metric) between X, Y ∈ V is defined by

1/k
k
dk (X, Y ) = ∥X − Y ∥k = [E ( |X − Y | )] (4.11.7)
The following properties are analogous to the properties in norm properties (and thus very little additional work is required for the
proofs). These properties show that the k metric really is a metric on L (as always, modulo equivalence). The first is the positive
k
property.
Suppose again that k ∈ [1, ∞) X, Y ∈ V . Then

1. d k (X, Y ) ≥0
2. d k (X, Y ) =0 if and only if P(X = Y ) = 1 (so that X ≡ Y and Y ).

Proof
Next is the obvious symmetry property:
dk (X, Y ) = dk (Y , X) for X, Y ∈ V .
Next is the distance version of the triangle inequality.
dk (X, Z) ≤ dk (X, Y ) + dk (Y , Z) for X, Y , Z ∈ V
Proof
The last three properties mean that d is indeed a metric on L for k ≥ 1 . In particular, note that the standard deviation is simply
k k
the 2-distance from X to its mean μ = E(X) :

−−−−−−−−−−
2
sd(X) = d2 (X, μ) = ∥X − μ∥2 = √ E [(X − μ) ] (4.11.9)
and the variance is the square of this. More generally, the k th moment of X about a is simply the k th power of the k -distance from
X to a . The 2-distance is especially important for reasons that will become clear below, in the discussion of inner product. This
distance is also called the root mean square distance.
Center and Spread Revisited

Measures of center and measures of spread are best thought of together, in the context of a measure of distance. For a real-valued
random variable X, we first try to find the constants t ∈ R that are closest to X, as measured by the given distance; any such t is a
measure of center relative to the distance. The minimum distance itself is the corresponding measure of spread.
Let us apply this procedure to the 2-distance.
For X ∈ L , define the root mean square error function by

2
−−−−−−−−−−
2
d2 (X, t) = ∥X − t∥2 = √ E [(X − t) ] , t ∈ R (4.11.10)
For X ∈ L , d 2 2 (X, t) is minimized when t = E(X) and the minimum value is sd(X).
Proof
We have seen this computation several times before. The best constant predictor of X is E(X), with mean square error var(X).
The physical interpretation of this result is that the moment of inertia of the mass distribution of X about t is minimized when
t = μ , the center of mass. Next, let us apply our procedure to the 1-distance.
For X ∈ L , define the mean absolute error function by

1
d1 (X, t) = ∥X − t∥1 = E [|X − t|] , t ∈ R (4.11.12)
We will show that d (X, t) is minimized when t is any median of X. (Recall that the set of medians of X forms a closed, bounded
1
interval.) We start with a discrete case, because it's easier and has special interest.
Suppose that X ∈ L has a discrete distribution with values in a finite set S ⊆ R . Then d
1 1 (X, t) is minimized when t is any
median of X.
Proof
The last result shows that mean absolute error has a couple of basic deficiencies as a measure of error:
The function may not be smooth (differentiable).
The function may not have a unique minimizing value of t .
Indeed, when X does not have a unique median, there is no compelling reason to choose one value in the median interval, as the
measure of center, over any other value in the interval.
Suppose now that X ∈ L has a general distribution on R. Then d

1 1 (X, t) is minimized when t is any median of X.
Proof
Convergence
Whenever we have a measure of distance, we automatically have a criterion for convergence.
Suppose that X ∈ L for n ∈ N and that X ∈ L , where k ∈ [1, ∞). Then X

n k + k n → X as n → ∞ in k th mean if Xn → X
as n → ∞ in the vector space L . That is, k
dk (Xn , X) = ∥ Xn − X ∥k → 0 as n → ∞ (4.11.15)
k
or equivalently E (|X n − X| ) → 0 as n → ∞ .
When k = 1 , we simply say that X → X as n n → ∞ in mean; when k =2 , we say that Xn → X as n → ∞ in mean square.
These are the most important special cases.
Suppose that 1 ≤ j ≤ k . If X n → X as n → ∞ in k th mean then X n → X as n → ∞ in j th mean.

Proof
Convergence in k th mean implies that the k norms converge.
Suppose that Xn ∈ Lk for n ∈ N and that X ∈ L , where k ∈ [1, ∞). If X → X as n → ∞ in k th mean then
+ k n
∥ Xn ∥k → ∥X ∥k as n → ∞ . Equivalently, if E(|X − X| ) → 0 as n → ∞ then E(|X | ) → E(|X| ) as n → ∞ .

n
k
n
k k
Proof
The converse is not true; a counterexample is given below. Our next result shows that convergence in mean is stronger than
convergence in probability.
Suppose that Xn ∈ L1 for n ∈ N+ and that X ∈ L1 . If Xn → X as n → ∞ in mean, then Xn → X as n → ∞ in

probability.
Proof
The converse is not true. That is, convergence with probability 1 does not imply convergence in k th mean; a counterexample is
given below. Also convergence in k th mean does not imply convergence with probability 1; a counterexample to this is given
below. In summary, the implications in the various modes of convergence are shown below; no other implications hold in general.
Convergence with probability 1 implies convergence in probability.
Convergence in k th mean implies convergence in j th mean if j ≤ k .
Convergence in k th mean implies convergence in probability.
Convergence in probability implies convergence in distribution.
However, the next section on uniformly integrable variables gives a condition under which convergence in probability implies
convergence in mean.
Inner Product
The vector space L of real-valued random variables on (Ω, F , P) (modulo equivalence of course) with finite second moment is
2
special, because it's the only one in which the norm corresponds to an inner product.
The inner product of X, Y ∈ L2 is defined by

⟨X, Y ⟩ = E(XY ) (4.11.17)
The following results are analogous to the basic properties of covariance, and show that this definition really does give an inner
product on the vector space
For X, Y , Z ∈ L2 and a ∈ R ,
1. ⟨X, Y ⟩ = ⟨Y , X⟩ , the symmetric property.
2. ⟨X, X⟩ ≥ 0 and ⟨X, X⟩ = 0 if and only if P(X = 0) = 1 (so that X ≡ 0 ), the positive property.
3. ⟨aX, Y ⟩ = a⟨X, Y ⟩ , the scaling property.
4. ⟨X + Y , Z⟩ = ⟨X, Z⟩ + ⟨Y , Z⟩ , the additive property.
Proof
From parts (a), (c), and (d) it follows that inner product is bi-linear, that is, linear in each variable with the other fixed. Of course
bi-linearity holds for any inner product on a vector space. Covariance and correlation can easily be expressed in terms of this inner
product. The covariance of two random variables is the inner product of the corresponding centered variables. The correlation is the
inner product of the corresponding standard scores.
For X, Y ∈ L2 ,
1. cov(X, Y ) = ⟨X − E(X), Y − E(Y )⟩
2. cor(X, Y ) = ⟨[X − E(X)]/sd(X), [Y − E(Y )]/sd(Y )⟩
Proof
Thus, real-valued random variables X and Y are uncorrelated if and only if the centered variables X − E(X) and Y − E(Y ) are
perpendicular or orthogonal as elements of L . 2
For X ∈ L , ⟨X, X⟩ = ∥X∥

2
2
2
= E (X
2
) .
Thus, the norm associated with the inner product is the 2-norm studied above, and corresponds to the root mean square operation
on a random variable. This fact is a fundamental reason why the 2-norm plays such a special, honored role; of all the k -norms, only
the 2-norm corresponds to an inner product. In turn, this is one of the reasons that root mean square difference is of fundamental
importance in probability and statistics. Technically, the vector space L is a Hilbert space, named for David Hilbert.
2
The next result is Hölder's inequality, named for Otto Hölder.
Suppose that j, k ∈ [1, ∞) and 1
j
+
1
k
=1 . For X ∈ L and Y
j ∈ Lk ,
⟨|X| , |Y |⟩ ≤ ∥X ∥j ∥Y ∥k (4.11.18)
Proof
In the context of the last theorem, j and k are called conjugate exponents. If we let j = k = 2 in Hölder's inequality, then we get
the Cauchy-Schwarz inequality, named for Augustin Cauchy and Karl Schwarz: For X, Y ∈ L , 2
−−−−−− −−−−−−
2 2
E (|X| |Y |) ≤ √E (X )√E (Y ) (4.11.21)
In turn, the Cauchy-Schwarz inequality is equivalent to the basic inequalities for covariance and correlations: For X, Y ∈ L2 ,
|cov(X, Y )| ≤ sd(X)sd(Y ), |cor(X, Y )| ≤ 1 (4.11.22)
If j, k ∈ [1, ∞) are conjugate exponents then

j
1. k = j−1
.
2. k ↓ 1 as j ↑ ∞ .
The following result is an equivalent to the identity var(X + Y ) + var(X − Y ) = 2 [var(X) + var(Y )] that we studied in the
section on covariance and correlation. In the context of vector spaces, the result is known as the parallelogram rule:
If X, Y ∈ L2 then
2 2 2 2
∥X + Y ∥ + ∥X − Y ∥ = 2∥X ∥ + 2∥Y ∥ (4.11.23)
2 2 2 2
Proof
The following result is equivalent to the statement that the variance of the sum of uncorrelated variables is the sum of the variances,
which again we proved in the section on covariance and correlation. In the context of vector spaces, the result is the famous
Pythagorean theorem, named for Pythagoras of course.
If (X 1, X2 , … , Xn ) is a sequence of random variables in L with ⟨X 2 i, Xj ⟩ = 0 for i ≠ j then

2
n n
∥ ∥
2
∥∑ Xi ∥ = ∑ ∥ Xi ∥ (4.11.26)
2
∥ i=1 ∥ i=1
2
Proof
Projections
The best linear predictor studied in the section on covariance and correlation and conditional expected values have nice
interpretation in terms of projections onto subspaces of L . First let's review the concepts. Recall that U is a subspace of L if
2 2
U ⊆L 2and U is also a vector space (under the same operations of addition and scalar multiplication). To show that U ⊆ L is 2
a subspace, we just need to show the closure properties (the other axioms of a vector space are inherited).
If U , V ∈ U then U + V ∈ U .
If U ∈ U and c ∈ R then cU ∈ U .
Suppose now that U is a subspace of L and that X ∈ L . Then the projection of X onto U (if it exists) is the vector V
2 2 ∈ U
with the property that X − V is perpendicular to U :

⟨X − V , U ⟩ = 0, U ∈ U (4.11.29)
The projection has two critical properties: It is unique (if it exists) and it is the vector in U closest to X. If you look at the proofs of
these results, you will see that they are essentially the same as the ones used for the best predictors of X mentioned at the
beginning of this subsection. Moreover, the proofs use only vector space concepts—the fact that our vectors are random variables
on a probability space plays no special role.
The projection of X onto U (if it exists) is unique.

Proof
Suppose that V is the projection of X onto U . Then

1. ∥X − V ∥ 2
2
≤ ∥X − U ∥
2
2
for all U ∈ U .
2. Equality holds in (a) if and only if U ≡V
Proof
Now let's return to our study of best predictors of a random variable.
If X ∈ L then the set W

2 X = {a + bX : a ∈ R, b ∈ R} is a subspace of L . In fact, it is the subspace generated by X and
2
1.
Proof
Recall that for X, Y ∈ L2 , the best linear predictor of Y based on X is

cov(X, Y )
L(Y ∣ X) = E(Y ) + [X − E(X)] (4.11.33)
var(X)
Here is the meaning of the predictor in the context of our vector spaces.
If X, Y ∈ L2 then L(Y ∣ X) is the projection of Y onto W .X
Proof
The previous result is actually just the random variable version of the standard formula for the projection of a vector onto a space
spanned by two other vectors. Note that 1 is a unit vector and that X = X − E(X) = X − ⟨X, 1⟩1 is perpendicular to 1. Thus,
0
L(Y ∣ X) is just the sum of the projections of Y onto 1 and X : 0
⟨Y , X0 ⟩
L(Y ∣ X) = ⟨Y , 1⟩1 + X0 (4.11.34)
⟨X0 , X0 ⟩
Suppose now that G is a sub σ-algebra of F . Of course if X : Ω → R is G -measurable then X is F -measurables, so L 2 (G ) is a

subspace of L (F ) .
2
If X ∈ L 2 (F ) then E(X ∣ G ) is the projection of X onto L 2 (G .

)
Proof
But remember that E(X ∣ G ) is defined more generally for X ∈ L 1 (F ) . Our final result in this discussion concerns convergence.
Suppose that k ∈ [1, ∞) and that G is a sub σ-algebra of F .

1. If X ∈ L (F ) then E(X ∣ G ) ∈ L (G )
k k
2. If X ∈ L (F ) for n ∈ N , X ∈ L (F ) , and X
n k + k n → X as n → ∞ in L k (F ) then E(X
n ∣ G ) → E(X ∣ G ) as n → ∞
in L (G )
k
Proof

App Exercises
In the error function app, select the root mean square error function. Click on the x-axis to generate an empirical distribution,
and note the shape and location of the graph of the error function.
In the error function app, select the mean absolute error function. Click on the x-axis to generate an empirical distribution, and
note the shape and location of the graph of the error function.
Suppose that X is uniformly distributed on the interval [0, 1].
1. Find ∥X∥ for k ∈ [1, ∞).
k
2. Graph ∥X∥ as a function of k ∈ [1, ∞).

k
3. Find lim ∥X ∥ .
k→∞ k
Answer
Suppose that X has probability density function f (x) =

a
xa +1
for 1 ≤x <∞ , where a >0 is a parameter. Thus, X has the
Pareto distribution with shape parameter a .
1. Find ∥X∥ for k ∈ [1, ∞).
k
2. Graph ∥X∥ as a function of k ∈ (1, a) .

k
3. Find lim ∥X∥ .k↑a k
Answer
Suppose that (X, Y ) has probability density function f (x, y) = x + y for 0 ≤x ≤1 , 0 ≤y ≤1 . Verify Minkowski's
inequality.
Answer
Let X be an indicator random variable with P(X = 1) = p , where 0 ≤ p ≤ 1 . Graph E (|X − t|) as a function of t ∈ R in
each of the cases below. In each case, find the minimum value of the function and the values of t where the minimum occurs.
1. p < 1
2. p = 1
3. p > 1
Answer
Suppose that X is uniformly distributed on the interval [0, 1]. Find d (X, t) = E (|X − t|) as a function of
1 t and sketch the
graph. Find the minimum value of the function and the value of t where the minimum occurs.
Suppose that X is uniformly distributed on the set [0, 1] ∪ [2, 3]. Find d (X, t) = E (|X − t|) as a function of t and sketch the
1
graph. Find the minimum value of the function and the values of t where the minimum occurs.
Suppose that (X, Y ) has probability density function f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . Verify Hölder's inequality in
the following cases:
1. j = k = 2
2. j = 3 , k = 3
Answer
Counterexamples
The following exercise shows that convergence with probability 1 does not imply convergence in mean.
Suppose that (X 1, X2 , …) is a sequence of independent random variables with
3
1 1
P (X = n ) = , P(Xn = 0) = 1 − ; n ∈ N+ (4.11.38)
2 2
n n
1. X → 0 as n → ∞ with probability 1.
n
2. X → 0 as n → ∞ in probability.
n
3. E(X ) → ∞ as n → ∞ .
n
Proof
The following exercise shows that convergence in mean does not imply convergence with probability 1.
Suppose that (X 1, X2 , …) is a sequence of independent indicator random variables with

1 1
P(Xn = 1) = , P(Xn = 0) = 1 − ; n ∈ N+ (4.11.39)
n n
1. P(X n = 0 for infinitely many n) = 1 .

2. P(X n = 1 for infinitely many n) = 1 .
3. P(X n does not converge as n → ∞) = 1 .
4. X n → 0 as n → ∞ in k th mean for every k ≥ 1 .
Proof
The following exercise show that convergence of the k th means does not imply convergence in k th mean.
Suppose that U has the Bernoulli distribution with parmaeter 1
2
, so that P(U = 1) = P(U = 0) =
1
2
. Let Xn = U for
n ∈ N and let X = 1 − U . Let k ∈ [1, ∞). Then
+
1. E(X ) = E(X ) = for n ∈ N , so E(X ) → E(X ) as n → ∞

k
n
k 1
2
+
k
n
k
2. E(|X − X| ) = 1 for n ∈ N so X does not converge to X as n → ∞ in L .

n
k
n k
Proof
This page titled 4.11: Vector Spaces of Random Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Two of the most important modes of convergence in probability theory are convergence with probability 1 and convergence in
mean. As we have noted several times, neither mode of convergence implies the other. However, if we impose an additional
condition on the sequence of variables, convergence with probability 1 will imply convergence in mean. The purpose of this brief,
but advanced section, is to explore the additional condition that is needed. This section is particularly important for the theory of
martingales.
Basic Theory
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So Ω is the set of outcomes, F is
the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ). In this section, all random variables that are
mentioned are assumed to be real valued, unless otherwise noted. Next, recall from the section on vector spaces that for
1/k
k ∈ [1, ∞) Lk ,is the vector space of random variables X with E(|X| ) < ∞ , endowed with the norm ∥X∥ = [E(X )] . In
k
k
k
particular, X ∈ L simply means that E(|X|) < ∞ so that E(X) exists as a real number. From the section on expected value as an
1
integral, recall the following notation, assuming of course that the expected value makes sense:
E(X; A) = E(X 1A ) = ∫ X dP (4.12.1)

A
Definition
The following result is motivation for the main definition in this section.
If X is a random variable then E(|X|) < ∞ if and only if E(|X|; |X| ≥ x) → 0 as x → ∞ .

Proof
Suppose now that X is a random variable for each i in a nonempty index set I (not necessarily countable). The critical definition
i
for this section is to require the convergence in the previous theorem to hold uniformly for the collection of random variables
X = {X : i ∈ I } .
i
The collection X = {X i : i ∈ I} is uniformly integrable if for each ϵ > 0 there exists x > 0 such that for all i ∈ I ,
E(| Xi |; | Xi | > x) < ϵ (4.12.3)
Equivalently E(|X i |; | Xi | > x) → 0 as x → ∞ uniformly in i ∈ I .
Properties
Our next discussion centers on conditions that ensure that the collection of random variables X = { Xi : i ∈ I } is uniformly
integrable. Here is an equivalent characterization:
The collection X = {X i : i ∈ I} is uniformly integrable if and only if the following conditions hold:
1. {E(|X |) : i ∈ I } is bounded.
i
2. For each ϵ > 0 there exists δ > 0 such that if A ∈ F and P(A) < δ then E(|X i |; A) < ϵ for all i ∈ I .
Proof
Condition (a) means that X is bounded (in norm) as a subset of the vector space L1 . Trivially, a finite collection of integrable
random variables is uniformly integrable.
Suppose that I is finite and that E(|X i |) <∞ for each i ∈ I . Then X = {X i : i ∈ I} is uniformly integrable.
A subset of a uniformly integrable set of variables is also uniformly integrable.
If {X i : i ∈ I} is uniformly integrable and J is a nonempty subset of I , then {X j : j ∈ J} is uniformly integrable.
If the random variables in the collection are dominated in absolute value by a random variable with finite mean, then the collection
is uniformly integrable.
Suppose that Y is a nonnegative random variable with E(Y ) < ∞ and that |X i| ≤Y for each i ∈ I . Then X = {X i : i ∈ I}
Proof
The following result is more general, but essentially the same proof works.
Suppose that Y = {X : j ∈ J} is uniformly integrable, and X = {X : i ∈ I } is a set of variables with the property that for
j i
each i ∈ I there exists j ∈ J such that |X | ≤ |Y | . Then X is uniformly integrable.

i j
As a simple corollary, if the variables are bounded in absolute value then the collection is uniformly integrable.
If there exists c > 0 such that |X i| ≤c for all i ∈ I then X = {X i : i ∈ I} is uniformly integrable.
Just having E(|X |) bounded in i ∈ I (condition (a) in the characterization above) is not sufficient for X = {X : i ∈ I } to be
i i
uniformly integrable; a counterexample is given below. However, if E(|X | ) is bounded in i ∈ I for some k > 1 , then X is
i
k
uniformly integrable. This condition means that X is bounded (in norm) as a subset of the vector space L . k
If {E(|X k
i| : i ∈ I} is bounded for some k > 1 , then {X i : i ∈ I} is uniformly integrable.
Proof
Uniformly integrability is closed under the operations of addition and scalar multiplication.
Suppose that X = {X : i ∈ I } and Y i = { Yi : i ∈ I } are uniformly integrable and that c ∈ R . Then each of the following
collections is also uniformly integrable.
1. X + Y = {X + Y : i ∈ I } i i
2. cX = {cX : i ∈ I } i
Proof
The following corollary is trivial, but will be needed in our discussion of convergence below.
Suppose that {X : i ∈ I } is uniformly integrable and that X is a random variable with E(|X|) < ∞ . Then {X
i i − X : i ∈ I}
Proof
Convergence
We now come to the main results, and the reason for the definition of uniform integrability in the first place. To set up the notation,
suppose that X is a random variable for n ∈ N and that X is a random variable. We know that if X → X as n → ∞ in mean
n + n
then X → X as n → ∞ in probability. The converse is also true if and only if the sequence is uniformly integrable. Here is the
n
first half:
If Xn → X as n → ∞ in mean, then {X n : n ∈ N} is uniformly integrable.

Proof
Here is the more important half, known as the uniform integrability theorem:
If {X n : n ∈ N+ } is uniformly integrable and X n → X as n → ∞ in probability, then X n → X as n → ∞ in mean.

Proof
As a corollary, recall that if X → X as n → ∞ with probability 1, then X

n n → X as n → ∞ in probability. Hence if
X = { X : n ∈ N } is uniformly integrable then X → X as n → ∞ in mean.
n + n
Examples
Our first example shows that bounded L norm is not sufficient for uniform integrability.
1
Suppose that U is uniformly distributed on the interval (0, 1) (so U has the standard uniform distribution). For n ∈ N+ , let
X = n1(U ≤ 1/n) . Then
n
1. E(|X n |) =1 for all n ∈ N +
2. E(|X n |; | Xn | > x) = 1 for x > 0 , n ∈ N + with n > x

Proof
By part (b), E(|X n |; | Xn | > x) does not converge to 0 as x → ∞ uniformly in n ∈ N , so X = {X

+ n : n ∈ N+ } is not uniformly
integrable.
The next example gives an important application to conditional expected value. Recall that if X is a random variable with
E(|X|) < ∞ and G is a sub σ-algebra of F then E(X ∣ G ) is the expected value of X given the information in G , and is the G -
measurable random variable closest to X in a sense. Indeed if X ∈ L (F ) then E(X ∣ G ) is the projection of X onto L (G ). The
2 2
collection of all conditional expected values of X is uniformly integrable:
Suppose that X is a real-valued random variable with E(|X|) < ∞ . Then {E(X ∣ G ) : G is a sub σ-algebra of F } is
uniformly integrable.
Proof
Note that the collection of sub σ-algebras of F , and so also the collection of conditional expected values above, might well be
uncountable. The conditional expected values range from E(X), when G = {Ω, ∅} to X itself, when G = F .
This page titled 4.12: Uniformly Integrable Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
The goal of this section is to study a type of mathematical object that arises naturally in the context of conditional expected value
and parametric distributions, and is of fundamental importance in the study of stochastic processes, particularly Markov processes.
In a sense, the main object of study in this section is a generalization of a matrix, and the operations generalizations of matrix
operations. If you keep this in mind, this section may seem less abstract.
Basic Theory
Definitions
Recall that a measurable space (S, S ) consists of a set S and a σ-algebra S of subsets of S . If μ is a positive measure on
(S, S ), then (S, S , μ) is a measure space. The two most important special cases that we have studied frequently are
1. Discrete: S is countable, S = P(S) is the collection of all subsets of S , and μ = # is counting measure on (S, S ).
2. Euclidean: S is a measurable subset of R for some n ∈ N , S is the collection of subsets of S that are also measurable,
n
+
and μ = λ is n -dimensional Lebesgue measure on (S, S ).

n
More generally, S usually comes with a topology that is locally compact, Hausdorff, with a countable base (LCCB), and S is the
Borel σ-algebra, the σ-algebra generated by the topology (the collection of open subsets of S ). The measure μ is usually a Borel
measure, and so satisfies μ(C ) < ∞ if C ⊆ S is compact. A discrete measure space is of this type, corresponding to the discrete
topology. A Euclidean measure space is also of this type, corresponding to the Euclidean topology, if S is open or closed (which is
usually the case). In the discrete case, every function from S to another measurable space is measurable, and every from function
from S to another topological space is continuous, so the measure theory is not really necessary.
Recall also that the measure space (S, S , μ) is σ-finite if there exists a countable collection {A : i ∈ I } ⊆ S such that
i
μ(A ) < ∞ for i ∈ I and S = ⋃

i A . If (S, S , μ) is a Borel measure space corresponding to an LCCB topology, then it is σ-
i
i∈I
finite.
If f : S → R is measurable, define ∥f ∥ = sup{|f (x)| : x ∈ S}. Of course we may well have ∥f ∥ = ∞. Let B(S) denote the
collection of bounded measurable functions f : S → R . Under the usual operations of pointwise addition and scalar multiplication,
B(S) is a vector space, and ∥ ⋅ ∥ is the natural norm on this space, known as the supremum norm. This vector space plays an
important role.
In this section, it is sometimes more natural to write integrals with respect to the positive measure μ with the differential before the
integrand, rather than after. However, rest assured that this is mere notation, the meaning of the integral is the same. So if
f : S → R is measurable then we may write the integral of f with respect to μ in operator notation as
μf = ∫ μ(dx)f (x) (4.13.1)

S
assuming, as usual, that the integral exists. This will be the case if f is nonnegative, although ∞ is a possible value. More
generally, the integral exists in R ∪ {−∞, ∞} if μf < ∞ or μf < ∞ where f and f are the positive and negative parts of
+ − + −
f . If both are finite, the integral exists in R (and f is integrable with respect to μ ). If If μ is a probability measure and we think of
(S, S ) as the sample space of a random experiment, then we can think of f as a real-valued random variable, in which case our
new notation is not too far from our traditional expected value E(f ). Our main definition comes next.
Suppose that (S, S ) and (T , T ) are measurable spaces. A kernel from (S, S ) to (T , T ) is a function K : S × T → [0, ∞]
such that
1. x ↦ K(x, A) is a measurable function from S into [0, ∞] for each A ∈ T .
2. A ↦ K(x, A) is a positive measure on T for each x ∈ S .
If (T , T ) = (S, S ) , then K is said to be a kernel on (S, S ).
There are several classes of kernels that deserve special names.
Suppose that K is a kernel from (S, S ) to (T , T ). Then
1. K is σ-finite if the measure K(x, ⋅) is σ-finite for every x ∈ S .
2. K is finite if K(x, T ) < ∞ for every x ∈ S .
3. K is bounded if K(x, T ) is bounded in x ∈ S .
4. K is a probability kernel if K(x, T ) = 1 for every x ∈ S .
Define ∥K∥ = sup{K(x, T ) : x ∈ S} , so that ∥K∥ < ∞ if K is a bounded kernel and ∥K∥ = 1 if K is a probability kernel.
So a probability kernel is bounded, a bounded kernel is finite, and a finite kernel is σ-finite. The terms stochastic kernel and
Markov kernel are also used for probability kernels, and for a probability kernel ∥K∥ = 1 of course. The terms are consistent with
terms used for measures: K is a finite kernel if and only if K(x, ⋅) is a finite measure for each x ∈ S , and K is a probability kernel
if and only if K(x, ⋅) is a probability measure for each x ∈ S . Note that ∥K∥ is simply the supremum norm of the function
x ↦ K(x, T ) .
A kernel defines two natural integral operators, by operating on the left with measures, and by operating on the right with
functions. As usual, we are often a bit casual witht the question of existence. Basically in this section, we assume that any integrals
mentioned exist.
Suppose that K is a kernel from (S, S ) to (T , T ).

1. If μ is a positive measure on (S, S ), then μK defined as follows is a positive measure on (T , T ):
μK(A) = ∫ μ(dx)K(x, A), A ∈ T (4.13.2)

S
2. If f : T → R is measurable, then Kf : S → R defined as follows is measurable (assuming that the integrals exist in R):
Kf (x) = ∫ K(x, dy)f (y), x ∈ S (4.13.3)

T
Proof
Thus, a kernel transforms measures on (S, S ) into measures on (T , T ), and transforms certain measurable functions from T to R
into measurable functions from S to R. Again, part (b) assumes that f is integrable with respect to the measure K(x, ⋅) for every
x ∈ S . In particular, the last statement will hold in the following important special case:
Suppose that K is a kernel from (S, S ) to (T , T ) and that f ∈ B(T ) .

1. If K is finite then Kf is defined and ∥Kf ∥ = ∥K∥∥f ∥.
2. If K is bounded then Kf ∈ B(T ) .
Proof
The identity kernel I on the measurable space (S, S ) is defined by I (x, A) = 1(x ∈ A) for x ∈ S and A ∈ S .
Thus, I (x, A) = 1 if x ∈ A and I (x, A) = 0 if x ∉ A . So x ↦ I (x, A) is the indicator function of A ∈ S , while A ↦ I (x, A)
is point mass at x ∈ S . Clearly the identity kernel is a probability kernel. If we need to indicate the dependence on the particular
space, we will add a subscript. The following result justifies the name.
Let I denote the identity kernel on (S, S ).

1. If μ is a positive measure on (S, S ) then μI =μ .
2. If f : S → R is measurable, then I f = f .
Constructions
We can create a new kernel from two given kernels, by the usual operations of addition and scalar multiplication.
Suppose that K and L are kernels from (S, S ) to (T , T ), and that c ∈ [0, ∞). Then cK and K + L defined below are also
kernels from (S, S ) to (T , T ).
1. (cK)(x, A) = cK(x, A) for x ∈ S and A ∈ T .
2. (K + L)(x, A) = K(x, A) + L(x, A) for x ∈ S and A ∈ T .
If K and L are σ-finite (finite) (bounded) then cK and K + L are σ-finite (finite) (bounded), respectively.
Proof
A simple corollary of the last result is that if a, b ∈ [0, ∞) then aK + bL is a kerneal from (S, S ) to (T , T ). In particular, if
K, L are probability kernels and p ∈ (0, 1) then pK + (1 − p)L is a probability kernel. A more interesting and important way to
form a new kernel from two given kernels is via a “multiplication” operation.
Suppose that K is a kernel from (R, R) to (S, S ) and that L is a kernel from (S, S ) to (T , T ). Then KL defined as follows
is a kernel from (R, R) to (T , T ):
KL(x, A) = ∫ K(x, dy)L(y, A), x ∈ R, A ∈ T (4.13.5)

S
1. If K is finite and L is bounded then KL is finite.

2. If K and L are bounded then KL is bounded.
3. If K and L are stochastic then KL is stochastic
Proof
Once again, the identity kernel lives up to its name:
Suppose that K is a kernel from (S, S ) to (T , T ). Then

1. I K = K
S
2. KI = K
T
The next several results show that the operations are associative whenever they make sense.
Suppose that K is a kernel from (S, S ) to (T , T ), μ is a positive measure on S , c ∈ [0, ∞), and f : T → R is measurable.
Then, assuming that the appropriate integrals exist,
1. c(μK) = (cμ)K
2. c(Kf ) = (cK)f
3. (μK)f = μ(Kf )
Proof
Suppose that K is a kernel from (R, R) to (S, S ) and L is a kernel from (S, S ) to (T , T ). Suppose also that μ is a positive
measure on (R, R) , f : T → R is measurable, and c ∈ [0, ∞). Then, assuming that the appropriate integrals exist,
1. (μK)L = μ(KL)
2. K(Lf ) = (KL)f
3. c(KL) = (cK)L
Proof
Suppose that K is a kernel from (R, R) to (S, S ), L is a kernel from (S, S ) to (T , T ), and M is a kernel from (T , T ) to
(U , U ) . Then (KL)M = K(LM ) .
Proof
The next several results show that the distributive property holds whenever the operations makes sense.
Suppose that K and L are kernels from (R, R) to (S, S ) and that M and N are kernels from (S, S ) to (T , T ). Suppose
also that μ is a positive measure on (R, R) and that f : S → R is measurable. Then, assuming that the appropriate integrals
exist,
1. (K + L)M = KM + LM
2. K(M + N ) = KM + KN
3. μ(K + L) = μK + μL
4. (K + L)f = Kf + Lf
Suppose that K is a kernel from (S, S ) to (T , T ), and that μ and ν are positive measures on , and that
(S, S ) f and g are
measurable functions from T to R. Then, assuming that the appropriate integrals exist,
1. (μ + ν )K = μK + ν K
2. K(f + g) = Kf + Kg
3. μ(f + g) = μf + μg
4. (μ + ν )f = μf + ν f
In particular, note that if K is a kernel from (S, S ) to (T , T ), then the transformation μ ↦ μK defined for positive measures on
(S, S ), and the transformation f ↦ Kf defined for measurable functions f : T → R (for which Kf exists), are both linear
operators. If μ is a positive measure on (S, S ), then the integral operator f ↦ μf defined for measurable f : S → R (for which
μf exists) is also linear, but of course, we already knew that. Finally, note that the operator f ↦ Kf is positive: if f ≥ 0 then
Kf ≥ 0 . Here is the important summary of our results when the kernel is bounded.
If K is a bounded kernel from (S, S ) to (T , T ), then f ↦ Kf is a bounded, linear transformation from B(T ) to B(S) and
∥K∥ is the norm of the transformation.
The commutative property for the product of kernels fails with a passion. If K and L are kernels, then depending on the measurable
spaces, KL may be well defined, but not LK . Even if both products are defined, they may be kernels from or to different
measurable spaces. Even if both are defined from and to the same measurable spaces, it may well happen that KL ≠ LK . Some
examples are given below
If K is a kernel on (S, S ) and n ∈ N , we let K
n
= KK ⋯ K , the n -fold power of K . By convention, K
0
=I , the identity
kernel on S .
Fixed points of the operators associated with a kernel turn out to be very important.
Suppose that K is a kernel from (S, S ) to (T , T ).

1. A positive measure μ on (S, S ) such that μK = μ is said to be invariant for K .
2. A measurable function f : T → R such that Kf = f is said to be invariant for K
So in the language of linear algebra (or functional analysis), an invariant measure is a left eigenvector of the kernel, while an
invariant function is a right eigenvector of the kernel, both corresponding to the eigenvalue 1. By our results above, if μ and ν are
invariant measures and c ∈ [0, ∞), then μ + ν and cμ are also invariant. Similarly, if f and g are invariant functions and c ∈ R ,
the f + g and cf are also invariant.
Of couse we are particularly interested in probability kernels.
Suppose that P is a probability kernel from (R, R) to (S, S ) and that Q is a probability kernel from (S, S ) to (T , T ) .
Suppose also that μ is a probability measure on (R, R) . Then
1. P Q is a probability kernel from (R, R) to (T , T ).
2. μP is a probability measure on (S, S ).
Proof
As a corollary, it follows that if P is a probability kernel on (S, S ), then so is P for n ∈ N .

n
The operators associated with a kernel are of fundamental importance, and we can easily recover the kernel from the operators.
Suppose that K is a kernel from (S, S ) to (T , T ), and let x ∈ S and A ∈ T . Then trivially, K1 (x) = K(x, A) where as usual,
A
1A is the indicator function of A . Trivially also δ K(A) = K(x, A) where δ is point mass at x.
x x
Kernel Functions
Usually our measurable spaces are in fact measure spaces, with natural measures associated with the spaces, as in the special cases
described in (1). When we start with measure spaces, kernels are usually constructed from density functions in much the same way
that positive measures are defined from density functions.
Suppose that (S, S , λ) and (T , T , μ) are measure spaces. As usual, S × T is given the product σ-algebra S ⊗T . If
k : S × T → [0, ∞) is measurable, then the function K defined as follows is a kernel from (S, S ) to (T , T ):
K(x, A) = ∫ k(x, y)μ(dy), x ∈ S, A ∈ T (4.13.9)

A
Proof
Clearly the kernel K depends on the positive measure μ on (T , T ) as well as the function k , while the measure λ on (S, S ) plays
no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Appropriately
enough, the function k is called a kernel density function (with respect to μ ), or simply a kernel function.
Suppose again that (S, S , λ) and (T , T , μ) are measure spaces. Suppose also K is a kernel from (S, S ) to (T , T ) with
kernel function k . If f : T → R is measurable, then, assuming that the integrals exists,
Kf (x) = ∫ k(x, y)f (y)μ(dy), x ∈ S (4.13.10)

S
Proof
A kernel function defines an operator on the left with functions on S in a completely analogous way to the operator on the right
above with functions on T .
Suppose again that (S, S , λ) and (T , T , μ) are measure spaces, and that K is a kernel from (S, S ) to (T , T ) with kernel
function k . If f : S → R is measurable, then the function f K : T → R defined as follows is also measurable, assuming that
the integrals exists
f K(y) = ∫ λ(dx)f (x)k(x, y), y ∈ T (4.13.12)

S
The operator defined above depends on the measure λ on (S, S ) as well as the kernel function k , while the measure μ on (T , T )
playes no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Here is
how our new operation on the left with functions relates to our old operation on the left with measures.
Suppose again that (S, S , λ) and (T , T , μ) are measure spaces, and that K is a kernel from (S, S ) to (T , T ) with kernel
function k . Suppose also that f : S → [0, ∞) is measurable, and let ρ denote the measure on (S, S ) that has density f with
respect to λ . Then f K is the density of the measure ρK with respect to μ .
Proof
As always, we are particularly interested in stochastic kernels. With a kernel function, we can have doubly stochastic kernels.
Suppose again that (S, S , λ) and (T , T , μ) are measure spaces and that k : S × T → [0, ∞) is measurable. Then k is a
double stochastic kernel function if
1. ∫T
k(x, y)μ(dy) = 1 for x ∈ S
2. ∫S
λ(dx)k(x, y) = 1 for y ∈ S
Of course, condition (a) simply means that the kernel associated with k is a stochastic kernel according to our original definition.
The most common and important special case is when the two spaces are the same. Thus, if (S, S , λ) is a measure space and
k : S × S → [0, ∞) is measurable, then we have an operator K that operates on the left and on the right with measurable
functions f : S → R :
f K(y) = ∫ λ(dx)f (x)k(x, y), y ∈ S

S
Kf (x) = ∫ k(x, y)f (y)λ(dy), x ∈ S

S
If f is nonnegative and μ is the measure on with density function f , then f K is the density function of the measure μK (both with
respect to λ ).
Suppose again that (S, S , λ) is a measure space and k : S × S → [0, ∞) is measurable. Then k is symmetric if
k(x, y) = k(y, x) for all (x, y) ∈ S .
2
Of course, if k is a symmetric, stochastic kernel function on (S, S , λ) then k is doubly stochastic, but the converse is not true.
Suppose that (R, R, λ) , (S, S , μ), and (T , T , ρ) are measure spaces. Suppose also that K is a kernel from (R, R) to (S, S )
with kernel function k , and that L is a kernel from (S, S ) to (T , T ) with kernel function l. Then the kernel KL from (R, R)
to (T , T ) has density kl given by
kl(x, z) = ∫ k(x, y)l(y, z)μ(dy), (x, z) ∈ R × T (4.13.13)

S
Proof

The Discrete Case
In this subsection, we assume that the measure spaces are discrete, as described in (1). Since the σ-algebra (all subsets) and the
measure (counting measure) are understood, we don't need to reference them. Recall that integrals with respect to counting measure
are sums. Suppose now that K is a kernel from the discrete space S to the discrete space T . For x ∈ S and y ∈ T , let
K(x, y) = K(x, {y}). Then more generally,
K(x, A) = ∑ K(x, y), x ∈ S, A ⊆ T (4.13.14)
y∈A
The function (x, y) ↦ K(x, y) is simply the kernel function of the kernel K , as defined above, but in this case we usually don't
bother with using a different symbol for the function as opposed to the kernel. The function K can be thought of as a matrix, with
rows indexed by S and columns indexed by T (and so an infinite matrix if S or T is countably infinite). With this interpretation, all
of the operations defined above can be thought of as matrix operations. If f : T → R and f is thought of as a column vector
indexed by T , then Kf is simply the ordinary product of the matrix K and the vector f ; the product is a column vector indexed by
S:
Kf (x) = ∑ K(x, y)f (y), x ∈ S (4.13.15)
y∈S
Similarly, if f : S → R and f is thought of as a row vector indexed by S , then f K is simple the ordinary product of the vector f
and the matrix K ; the product is a row vector indexed by T :
f K(y) = ∑ f (x)K(x, y), y ∈ T (4.13.16)
x∈S
If L is another kernel from T to another discrete space U , then as functions, KL is the simply the matrix product of K and L:
KL(x, z) = ∑ K(x, y)L(x, z), (x, z) ∈ S × L (4.13.17)
y∈T
Let S = {1, 2, 3} and T = {1, 2, 3, 4}. Define the kernel K from S to T by K(x, y) = x + y for (x, y) ∈ S × T . Define the
function f on S by f (x) = x! for x ∈ S , and define the function g on T by g(y) = y for y ∈ T . Compute each of the
2
following using matrix algebra:

1. f K
2. Kg
Answer
Let R = {0, 1} , S = {a, b} , and T = {1, 2, 3}. Define the kernel K from R to S , the kernel L from S to S and the kernel M
from S to T in matrix form as follows:
1 4 2 2 1 0 2
K =[ ], L = [ ], M = [ ] (4.13.20)
2 3 1 5 0 3 1
Compute each of the following kernels, or explain why the operation does not make sense:
1. KL
2. LK
3. K 2
4. L2
5. KM
6. LM
Proof
An important class of probability kernels arises from the distribution of one random variable, conditioned on the value of another
random variable. In this subsection, suppose that (Ω, F , P) is a probability space, and that (S, S ) and (T , T ) are measurable
spaces. Further, suppose that X and Y are random variables defined on the probability space, with X taking values in S and that Y
taking values in T . Informally, X and Y are random variables defined on the same underlying random experiment.
The function P defined as follows is a probability kernel from (S, S ) to (T , T ), known as the conditional probability kernel
of Y given X.
P (x, A) = P(Y ∈ A ∣ X = x), x ∈ S, A ∈ T (4.13.25)
Proof
The operators associated with this kernel have natural interpretations.
Let P be the conditional probability kernel of Y given X.

1. If f : T → R is measurable, then P f (x) = E[f (Y ) ∣ X = x] for x ∈ S (assuming as usual that the expected value exists).
2. If μ is the probability distribution of X then μP is the probability distribution of Y .
Proof
As in the general discussion above, the measurable spaces (S, S ) and (T , T ) are usually measure spaces with natural measures
attached. So the conditional probability distributions are often given via conditional probability density functions, which then play
the role of kernel functions. The next two exercises give examples.
Suppose that X and Y are random variables for an experiment, taking values in R. For x ∈ R, the conditional distribution of
Y given X = x is normal with mean x and standard deviation 1. Use the notation and operations of this section for the
following computations:
1. Give the kernel function for the conditional distribution of Y given X.
2. Find E (Y ∣∣ X = x) .
2
3. Suppose that X has the standard normal distribution. Find the probability density function of Y .
Answer
Suppose that X and Y are random variables for an experiment, with X taking values in {a, b, c} and Y taking values in
{1, 2, 3, 4}. The kernel function of Y given X is as follows: P (a, y) = 1/4 , P (b, y) = y/10, and P (c, y) = y /30, each for
2
y ∈ {1, 2, 3, 4}.
1. Give the kernel P in matrix form and verify that it is a probability kernel.
2. Find f P where f (a) = f (b) = f (c) = 1/3 . The result is the density function of Y given that X is uniformly distributed.
3. Find P g where g(y) = y for y ∈ {1, 2, 3, 4}. The resulting function is E(Y ∣ X = x) for x ∈ {a, b, c}.
Answer
Parametric Distributions
A parametric probability distribution also defines a probability kernel in a natural way, with the parameter playing the role of the
kernel variable, and the distribution playing the role of the measure. Such distributions are usually defined in terms of a parametric
density function which then defines a kernel function, again with the parameter playing the role of the first argument and the
variable the role of the second argument. If the parameter is thought of as a given value of another random variable, as in Bayesian
analysis, then there is considerable overlap with the previous subsection. In most cases, (and in particular in the examples below),
the spaces involved are either discrete or Euclidean, as described in (1).
Consider the parametric family of exponential distributions. Let f denote the identity function on (0, ∞).
1. Give the probability density function as a probability kernel function p on (0, ∞).
2. Find P f .
3. Find f P .
4. Find p , the kernel function corresponding to the product kernel P .
2 2
Answer
Consider the parametric family of Poisson distributions. Let f be the identity function on N and let g be the identity function
on (0, ∞).
1. Give the probability density function p as a probability kernel function from (0, ∞) to N.
2. Show that P f = g .
3. Show that gP = f .
Answer
Clearly the Poisson distribution has some very special and elegant properties. The next family of distributions also has some very
special properties. Compare this exercise with the exercise (30).
Consider the family of normal distributions, parameterized by the mean and with variance 1.
1. Give the probability density function as a probability kernel function p on R.
2. Show that p is symmetric.
3. Let f be the identity function on R. Show that P f = f and f P = f .
4. For n ∈ N , find p the kernel function for the operator P .
n n
Answer
For each of the following special distributions, express the probability density function as a probability kernel function. Be sure
to specify the parameter spaces.
1. The general normal distribution on R.
2. The beta distribution on (0, 1).
3. The negative binomial distribution on N.
Answer
This page titled 4.13: Kernels and Operators is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
CHAPTER OVERVIEW
In this chapter, we study several general families of probability distributions and a number of special parametric families of
distributions. Unlike the other expository chapters in this text, the sections are not linearly ordered and so this chapter serves
primarily as a reference. You may want to study these topics as the need arises.
First, we need to discuss what makes a probability distribution special in the first place. In some cases, a distribution may be
important because it is connected with other special distributions in interesting ways (via transformations, limits, conditioning,
etc.). In some cases, a parametric family may be important because it can be used to model a wide variety of random phenomena.
This may be the case because of fundamental underlying principles, or simply because the family has a rich collection of
probability density functions with a small number of parameters (usually 3 or less). As a general philosophical principle, we try to
model a random process with as few parameters as possible; this is sometimes referred to as the principle of parsimony of
parameters. In turn, this is a special case of Ockham's razor, named in honor of William of Ockham, the principle that states that
one should use the simplest model that adequately describes a given phenomenon. Parsimony is important because often the
parameters are not known and must be estimated.
In many cases, a special parametric family of distributions will have one or more distinguished standard members, corresponding
to specified values of some of the parameters. Usually the standard distributions will be mathematically simplest, and often other
members of the family can be constructed from the standard distributions by simple transformations on the underlying standard
random variable.
An incredible variety of special distributions have been studied over the years, and new ones are constantly being added to the
literature. To truly deserve the adjective special, a distribution should have a certain level of mathematical elegance and economy,
and should arise in interesting and diverse applications.
5.4: Infinitely Divisible Distributions
1
5.39: Benford's Law
This page titled 5: Special Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
2
General Theory
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P), so that Ω is the set of outcomes, F
the collection of events, and P the probability measure on the sample space (Ω, F ). In this section, we assume that we fixed
random variable Z defined on the probability space, taking values in R.
Definition
For a ∈ R and b ∈ (0, ∞), let X = a + b Z . The two-parameter family of distributions associated with X is called the
location-scale family associated with the given distribution of Z . Specifically, a is the location parameter and b the scale
parameter.
Thus a linear transformation, with positive slope, of the underlying random variable Z creates a location-scale family for the
underlying distribution. In the special case that b = 1 , the one-parameter family is called the location family associated with the
given distribution, and in the special case that a = 0 , the one-parameter family is called the scale family associated with the given
distribution. Scale transformations, as the name suggests, occur naturally when physical units are changed. For example, if a
random variable represents the length of an object, then a change of units from meters to inches corresponds to a scale
transformation. Location transformations often occur when the zero reference point is changed, in measuring distance or time, for
example. Location-scale transformations can also occur with a change of physical units. For example, if a random variable
represents the temperature of an object, then a change of units from Fahrenheit to Celsius corresponds to a location-scale
transformation.
Our goal is to relate various functions that determine the distribution of X = a + bZ to the corresponding functions for Z . First we
consider the (cumulative) distribution function.
If Z has distribution function G then X has distribution function F given by

x −a
F (x) = G ( ), x ∈ R (5.1.1)
b
Proof
Next we consider the probability density function. The results are a bit different for discrete distributions and continuous
distribution, not surprising since the density function has different meanings in these two cases.
If Z has a discrete distribution with probability density function g then X also has a discrete distribution, with probability
x −a
f (x) = g ( ), x ∈ R (5.1.3)
b
Proof
If Z has a continuous distribution with probability density function g , then X also has a continuous distribution, with
1 x −a
f (x) = g( ), x ∈ R (5.1.5)
b b
1. For the location family associated with g , the graph of f is obtained by shifting the graph of g , a units to the right if a > 0
and −a units to the left if a < 0 .
2. For the scale family associated with g , if b > 1 , the graph of f is obtained from the graph of g by stretching horizontally
and compressing vertically, by a factor of b . If 0 < b < 1 , the graph of f is obtained from the graph of g by compressing
horizontally and stretching vertically, by a factor of b .
Proof
If Z has a mode at z , then X has a mode at x = a + bz .

Proof
Next we relate the quantile functions of Z and X.
If G and F are the distribution functions of Z and X, respectively, then

1. F (p) = a + b G (p) for p ∈ (0, 1)
−1 −1
2. If z is a quantile of order p for Z then x = a + b z is a quantile of order p for X.

Proof
Suppose now that Z has a continuous distribution on [0, ∞), and that we think of Z as the failure time of a device (or the time of
death of an organism). Let X = bZ where b ∈ [0, ∞), so that the distribution of X is the scale family associated with the
distribution of Z . Then X also has a continuous distribution on [0, ∞) and can also be thought of as the failure time of a device
(perhaps in different units).
Let G and F denote the reliability functions of Z and X respectively, and let r and R denote the failure rate functions of Z
c c
and X, respectively. Then

1. F (x) = G (x/b) for x ∈ [0, ∞)
c c
2. R(x) = r ( ) for x ∈ [0, ∞)

1
b
x
Proof
Moments
The following theorem relates the mean, variance, and standard deviation of Z and X.
As before, suppose that X = a + b Z . Then

1. E(X) = a + b E(Z)
2. var(X) = b var(Z)2
3. sd(X) = b sd(Z)
Proof
Recall that the standard score of a random variable is obtained by subtracting the mean and dividing by the standard deviation. The
standard score is dimensionless (that is, has no physical units) and measures the distance from the mean to the random variable in
standard deviations. Since location-scale familes essentially correspond to a change of units, it's not surprising that the standard
score is unchanged by a location-scale transformation.
The standard scores of X and Z are the same:

X − E(X) Z − E(Z)
= (5.1.6)
sd(X) sd(Z)
Proof
Recall that the skewness and kurtosis of a random variable are the third and fourth moments, respectively, of the standard score.
Thus it follows from the previous result that skewness and kurtosis are unchanged by location-scale transformations:
skew(X) = skew(Z) , kurt(X) = kurt(Z) .
We can represent the moments of X (about 0) to those of Z by means of the binomial theorem:
n
n
n k n−k k
E (X ) = ∑( )b a E (Z ) , n ∈ N (5.1.8)
k
k=0
Of course, the moments of X about the location parameter a have a simple representation in terms of the moments of Z about 0:
n n n
E [(X − a) ] = b E (Z ), n ∈ N (5.1.9)
The following exercise relates the moment generating functions of Z and X.
If Z has moment generating function m then X has moment generating function M given by
at
M (t) = e m(bt) (5.1.10)
Proof
Type
As we noted earlier, two probability distributions that are related by a location-scale transformation can be thought of as governing
the same underlying random quantity, but in different physical units. This relationship is important enough to deserve a name.
Suppose that P and Q are probability distributions on R with distribution functions F and G, respectively. Then P and Q are
of the same type if there exist constants a ∈ R and b ∈ (0, ∞) such that
x −a
F (x) = G ( ), x ∈ R (5.1.12)
b
Being of the same type is an equivalence relation on the collection of probability distributions on R. That is, if P , Q, and R are
probability distribution on R then
1. P is the same type as P (the reflexive property).
2. If P is the same type as Q then Q is the same type as P (the symmetric property).
3. If P is the same type as Q, and Q is the same type as R , then P is the same type as R (the transitive property).
Proof
So, the collection of probability distributions on R is partitioned into mutually exclusive equivalence classes, where the
distributions in each class are all of the same type.

Special Distributions
Many of the special parametric families of distributions studied in this chapter and elsewhere in this text are location and/or scale
families.
The arcsine distribution is a location-scale family.
The Cauchy distribution is a location-scale family.
The exponential distribution is a scale family.
The exponential-logarithmic distribution is a scale family for each value of the shape parameter.
The extreme value distribution is a location-scale family.
The gamma distribution is a scale family for each value of the shape parameter.
The Gompertz distribution is a scale family for each value of the shape parameter.
The half-normal distribution is a scale family.
The hyperbolic secant distribution is a location-scale family.
The Lévy distribution is a location scale family.
The logistic distribution is a location-scale family.
The log-logistic distribution is a scale family for each value of the shape parameter.
The Maxwell distribution is a scale family.
The normal distribution is a location-scale family.
The Pareto distribution is a scale family for each value of the shape parameter.
The Rayleigh distribution is a scale family.
The semicircle distribution is a location-scale family.
The triangle distribution is a location-scale family for each value of the shape parameter.
The uniform distribution on an interval is a location-scale family.
The U-power distribution is a location-scale family for each value of the shape parameter.
The Weibull distribution is a scale family for each value of the shape parameter.
The Wald distribution is a scale family, although in the usual formulation, neither of the parameters is a scale parameter.
This page titled 5.1: Location-Scale Families is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Definition
We start with a probability space (Ω, F , P) as a model for a random experiment. So as usual, Ω is the set of outcomes, F the σ-
algebra of events, and P the probability measure on the sample space (Ω, F ). For the general formulation that we want in this
section, we need two additional spaces, a measure space (S, S , μ) (where the probability distributions will live) and a measurable
space (T , T ) (serving the role of a parameter space). Typically, these spaces fall into our two standard categories. Specifically, the
measure space is usually one of the following:
Discrete. S is countable, S is the collection of all subsets of S , and μ = # is counting measure.
Euclidean. S is a sufficiently nice Borel measurable subset of R for some n ∈ N , S is the σ-algebra of Borel measurable
n
+
subsets of S , and μ = λ is n -dimensional Lebesgue measure.

n
Similarly, the parameter space (T , T ) is usually either discrete, so that T is countable and T the collection of all subsets of T , or
Euclidean so that T is a sufficiently nice Borel measurable subset of R for some m ∈ N and T is the σ-algebra of Borel
m
+
measurable subsets of T .
Suppose now that X is random variable defined on the probability space, taking values in S , and that the distribution of X depends
on a parameter θ ∈ T . For θ ∈ T we assume that the distribution of X has probability density function f with respect to μ . θ
for k ∈ N , the family of distributions of X is a k -parameter exponential family if

+
fθ (x) = α(θ) g(x) exp( ∑ βi (θ) hi (x)); x ∈ S, θ ∈ T (5.2.1)
i=1
where α and (β , β , … , β ) are measurable functions from T into R, and where

1 2 k g and (h1 , h2 , … , hk ) are measurable
functions from S into R. Moreover, k is assumed to be the smallest such integer.
1. The parameters (β (θ), β (θ), … , β (θ)) are called the natural parameters of the distribution.
1 2 k
2. the random variables (h (X), h (X), … , h (X)) are called the natural statistics of the distribution.
1 2 k
Although the definition may look intimidating, exponential families are useful because many important theoretical results in
statistics hold for exponential families, and because many special parametric families of distributions turn out to be exponential
families. It's important to emphasize that the representation of f (x) given in the definition must hold for all x ∈ S and θ ∈ T . If
θ
the representation only holds for a set of x ∈ S that depends on the particular θ ∈ T , then the family of distributions is not a
general exponential family.
The next result shows that if we sample from the distribution of an exponential family, then the distribution of the random sample
is itself an exponential family with the same natural parameters.
Suppose that the distribution of random variable X is a k -parameter exponential family with natural parameters
(β (θ), β (θ), … , β (θ)) , and natural statistics (h (X), h (X), … , h (X)). Let X = (X , X , … , X ) be a sequence of n
1 2 k 1 2 k 1 2 n
independent random variables, each with the same distribution as X. Then X is a k -parameter exponential family with natural
parameters (β (θ), β (θ), … , β (θ)) , and natural statistics
1 2 k
uj (X) = ∑ hj (Xi ), j ∈ {1, 2, … , k} (5.2.2)
i=1
Proof
Many of the special distributions studied in this chapter are general exponential families, at least with respect to some of their
parameters. On the other hand, most commonly, a parametric family fails to be a general exponential family because the support set
depends on the parameter. The following theorems give a number of examples. Proofs will be provided in the individual sections.
The Bernoulli distribution is a one parameter exponential family in the success parameter p ∈ [0, 1]
The beta distiribution is a two-parameter exponential family in the shape parameters a ∈ (0, ∞) , b ∈ (0, ∞).
The beta prime distribution is a two-parameter exponential family in the shape parameters a ∈ (0, ∞) , b ∈ (0, ∞).
The binomial distribution is a one-parameter exponential family in the success parameter p ∈ [0, 1] for a fixed value of the trial
parameter n ∈ N .+
The chi-square distribution is a one-parameter exponential family in the degrees of freedom n ∈ (0, ∞) .
The exponential distribution is a one-parameter exponential family (appropriately enough), in the rate parameter r ∈ (0, ∞).
The gamma distribution is a two-parameter exponential family in the shape parameter k ∈ (0, ∞) and the scale parameter
b ∈ (0, ∞).
The geometric distribution is a one-parameter exponential family in the success probability p ∈ (0, 1).
The half normal distribution is a one-parameter exponential family in the scale parameter σ ∈ (0, ∞)
The Laplace distribution is a one-parameter exponential family in the scale parameter b ∈ (0, ∞) for a fixed value of the
location parameter a ∈ R .
The Lévy distribution is a one-parameter exponential family in the scale parameter b ∈ (0, ∞) for a fixed value of the location
parameter a ∈ R .
The logarithmic distribution is a one-parameter exponential family in the shape parameter p ∈ (0, 1)
The lognormal distribution is a two parameter exponential family in the shape parameters μ ∈ R , σ ∈ (0, ∞).
The Maxwell distribution is a one-parameter exponential family in the scale parameter b ∈ (0, ∞).
The k -dimensional multinomial distribution is a k -parameter exponential family in the probability parameters (p 1, p2 , … , pk )
for a fixed value of the trial parameter n ∈ N .

+
The k -dimensional multivariate normal distribution is a 1
2
2
(k + 3k) -parameter exponential family with respect to the mean
vector μ and the variance-covariance matrix V .
The negative binomial distribution is a one-parameter exponential family in the success parameter p ∈ (0, 1) for a fixed value
of the stopping parameter k ∈ N .+
The normal distribution is a two-parameter exponential family in the mean μ ∈ R and the standard deviation σ ∈ (0, ∞).
The Pareto distribution is a one-parameter exponential family in the shape parameter for a fixed value of the scale parameter.
The Poisson distribution is a one-parameter exponential family.
The Rayleigh distribution is a one-parameter exponential family.
The U-power distribution is a one-parameter exponential family in the shape parameter, for fixed values of the location and
scale parameters.
The Weibull distribution is a one-parameter exponential family in the scale parameter for a fixed value of the shape parameter.
The zeta distribution is a one-parameter exponential family.
The Wald distribution is a two-parameter exponential family.
This page titled 5.2: General Exponential Families is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
This section discusses a theoretical topic that you may want to skip if you are a new student of probability.
Basic Theory
Stable distributions are an important general class of probability distributions on R that are defined in terms of location-scale
transformations. Stable distributions occur as limits (in distribution) of scaled and centered sums of independent, identically
distributed variables. Such limits generalize the central limit theorem, and so stable distributions generalize the normal distribution
in a sense. The pioneering work on stable distributions was done by Paul Lévy.
Definition
In this section, we consider real-valued random variables whose distributions are not degenerate (that is, not concentrated at a
single value). After all, a random variable with a degenerate distribution is not really random, and so is not of much interest.
Random variable X has a stable distribution if the following condition holds: If n ∈ N and (X , X , … , X ) is a sequence
+ 1 2 n
of independent variables, each with the same distribution as X, then X + X + ⋯ + X has the same distribution as
1 2 n
a + b X for some a ∈ R and b ∈ (0, ∞) . If a = 0 for n ∈ N

n n n n n then the distribution of X is strictly stable.
+
1. The parameters a for n ∈ N are the centering parameters.

n +
2. The parameters b for n ∈ N are the norming parameters.

n +
Details
Recall that two distributions on R that are related by a location-scale transformation are said to be of the same type, and that being
of the same type defines an equivalence relation on the class of distributions on R. With this terminology, the definition of stability
has a more elegant expression: X has a stable distribution if the sum of a finite number of independent copies of X is of the same
type as X. As we will see, the norming parameters are more important than the centering parameters, and in fact, only certain
norming parameters can occur.
Basic Properties
We start with some very simple results that follow easily from the definition, before moving on to the deeper results.
−
Suppose that X has a stable distribution with mean μ and finite variance. Then the norming parameters are √n and the
centering parameters are (n − √−
n ) μ for n ∈ N . +
Proof
It will turn out that the only stable distribution with finite variance is the normal distribution, but the result above is useful as an
intermediate step. Next, it seems fairly clear from the definition that the family of stable distributions is itself a location-scale
family.
Suppose that the distribution of X is stable, with centering parameters a ∈ R and norming parameters b ∈ (0, ∞) for
n n
n ∈ N . If c ∈ R and d ∈ (0, ∞) , then the distribution of Y = c + dX is also stable, with centering parameters
+
da + (n − b )c and norming parameters b

n n for n ∈ N .
n +
Proof
An important point is the the norming parameters are unchanged under a location-scale transformation.
Suppose that the distribution of X is stable, with centering parameters a ∈ R and norming parameters b ∈ (0, ∞) for
n n
n ∈ N . Then the distribution of −X is stable, with centering parameters −a and norming parameters b for n ∈ N .
+ n n +
Proof
From the last two results, if X has a stable distribution, then so does c + dX , with the same norming parameters, for every
c, d ∈ R with d ≠ 0 . Stable distributions are also closed under convolution (corresponding to sums of independent variables) if the
norming parameters are the same.
Suppose that X and Y are independent variables. Assume also that X has a stable distribution with centering parameters
a ∈ R and norming parameters b ∈ (0, ∞) for n ∈ N
n n , and that + Y has a stable distribution with centering parameters
c ∈ R and the same norming parameters b
n for n ∈ N . Then
n + Z = X +Y has a stable distribution with centering
paraemters a + c and norming parameters b for n ∈ N .
n n n +
Proof
We can now give another characterization of stability that just involves two independent copies of X.
Random variable X has a stable distribution if and only if the following condition holds: If X , X are independent variables,
1 2
each with the same distribution as X and d , d ∈ (0, ∞) then d X + d X has the same distribution as a + bX for some
1 2 1 1 2 2
a ∈ R and b ∈ (0, ∞).
Proof
As a corollary of a couple of the results above, we have the following:
Suppose that X and Y are independent with the same stable distribution. Then the distribution of X − Y is strictly stable, with
the same norming parameters.
Note that the distribution of X − Y is symmetric (about 0). The last result is useful because it allows us to get rid of the centering
parameters when proving facts about the norming parameters. Here is the most important of those facts:
Suppose that X has a stable distribution. Then the norming parameters have the form b = n n
1/α
for n ∈ N+ , for some
α ∈ (0, 2]. The parameter α is known as the index or characteristic exponent of the distribution.
Proof
Every stable distribution is continuous.

Proof
The next result is a precise statement of the limit theorem alluded to in the introductory paragraph.
n
Suppose that (X , X , …) is a sequence of independent, identically distributed random variables, and let Y = ∑ X for
1 2 n i=1 i
n ∈ N +. If there exist constants a ∈ R and b ∈ (0, ∞) for n ∈ N such that (Y − a )/b has a (non-degenerate) limiting
n n + n n n
distribution as n → ∞ , then the limiting distribution is stable.
The following theorem completely characterizes stable distributions in terms of the characteristic function.
Suppose that X has a stable distribution. The characteristic function of X has the following form, for some α ∈ (0, 2] ,
β ∈ [−1, 1], c ∈ R , and d ∈ (0, ∞)
itX α α
χ(t) = E (e ) = exp(itc − d |t| [1 + iβ sgn(t)uα (t)]), t ∈ R (5.3.15)
where sgn is the usual sign function, and where

πα
tan( ), α ≠1
2
uα (t) = { (5.3.16)
2
ln(|t|), α =1
π
1. The parameter α is the index, as before.

2. The parameter β is the skewness parameter.
3. The parameter c is the location parameter.
4. The parameter d is the scale parameter.
Thus, the family of stable distributions is a 4 parameter family. The index parameter α and and the skewness parameter β can be
considered shape parameters. When the location parameter c = 0 and the scale parameter d = 1 , we get the standard form of the
stable distributions, with characteristic function
itX α
χ(t) = E (e ) = exp(−|t| [1 + iβ sgn(t)uα (t)]), t ∈ R (5.3.17)
The characteristic function gives another proof that stable distributions are closed under convolution (corresponding to sums of
independent variables), if the index is fixed.
Suppose that X and X are independent random variables, and that X and X have the stable distribution with common
1 2 1 2
index α ∈ (0, 2], skewness parameter β ∈ [−1, 1], location parameter c ∈ R , and scale parameter d ∈ (0, ∞) . Then
k k k
X +X
1 has the stable distribution with index α , location parameter c = c + c , scale parameter d = (d + d )
2 1 2 , and α
1 2
α 1/α
skewness parameter
α α
β1 d + β2 d
1 2
β = (5.3.18)
α α
d +d
1 2
Proof
Special Cases
Three special parametric families of distributions studied in this chapter are stable. In the proofs in this subsection, we use the
definition of stability and various important properties of the distributions. These properties, in turn, are verified in the sections
devoted to the distributions. We also give proofs based on the characteristic function, which allows us to identify the skewness
parameter.
The normal distribution is stable with index α = 2 . There is no skewness parameter.

Proof
Of course, the normal distribution has finite variance, so once we know that it is stable, it follows from the finite variance property
above that the index must be 2. Moreover, the characteristic function shows that the normal distribution is the only stable
distribution with index 2, and hence the only stable distribution with finite variance.
Open the special distribution simulator and select the normal distribution. Vary the parameters and note the shape and location
of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the
empirical density function to the probability density function.
The Cauchy distribution is stable with index α = 1 and skewness parameter β = 0 .

Proof
Open the special distribution simulator and select the Cauchy distribution. Vary the parameters and note the shape and location
The Lévy distribution is stable with index α = 1
2
and skewness parameter β = 1 .
Proof
Open the special distribution simulator and select the Lévy distribution. Vary the parameters and note the shape and location of
the probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical
density function to the probability density function.
The normal, Cauchy, and Lévy distributions are the only stable distributions for which the probability density function is known in
closed form.
This page titled 5.3: Stable Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
5.4: Infinitely Divisible Distributions
This section discusses a theoretical topic that you may want to skip if you are a new student of probability.
Basic Theory
Infinitely divisible distributions form an important class of distributions on R that includes the stable distributions, the compound
Poisson distributions, as well as several of the most important special parametric families of distribtions. Basically, the distribution
of a real-valued random variable is infinitely divisible if for each n ∈ N , the variable can be decomposed into the sum of n
+
independent copies of another variable. Here is the precise definition.
The distribution of a real-valued random variable X is infinitely divisible if for every n ∈ N , there exists a sequence of
+
independent, identically distributed variables (X , X , … , X ) such that X + X + ⋯ + X has the same distribution as
1 2 n 1 2 n
X.
If the distribution of X is stable then the distribution is infinitely divisible.

Proof
Suppose now that X = (X , X , …) is a sequence of independent, identically distributed random variables, and that N has a
1 2
Poisson distribution and is independent of X. Recall that the distribution of ∑ X is said to be compound Poisson. Like the
N
i=1 i
stable distributions, the compound Poisson distributions form another important class of infinitely divisible distributions.
Suppose that Y is a random variable.

1. If Y is compound Poisson then Y is infinitely divisible.
2. If Y is infinitely divisible and takes values in N then Y is compound Poisson.
Proof
Special Cases
A number of special distributions are infinitely divisible. Proofs of the results stated below are given in the individual sections.
Stable Distributions
First, the normal distribution, the Cauchy distribution, and the Lévy distribution are stable, so they are infinitely divisible.
However, direct arguments give more information, because we can identify the distribution of the component variables.
The normal distribution is infinitely divisible. If X has the normal distribution with mean μ ∈ R and standard deviation
σ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X
+ 1 2 where (X , X , … , X ) are
n 1 2 n
−
independent, and X has the normal distribution with mean μ/n and standard deviation σ/√n for each i ∈ {1, 2, … , n}.
i
The Cauchy distribution is infinitely divisible. If X has the Cauchy distribution with location parameter a ∈ R and scale
parameter b ∈ (0, ∞), then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are
+ 1 2 n 1 2 n
independent, and X has the Cauchy distribution with location parameter a/n and scale parameter b/n for each
i
i ∈ {1, 2, … , n}.
Other Special Distributions

On the other hand, there are distributions that are infinitely divisible but not stable.
The gamma distribution is infinitely divisible. If X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale
parameter b ∈ (0, ∞), then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are
+ 1 2 n 1 2 n
independent, and X has the gamma distribution with shape parameter k/n and scale parameter b for each i ∈ {1, 2, … , n}
i
The chi-square distribution is infinitely divisible. If X has the chi-square distribution with k ∈ (0, ∞) degrees of freedom,
then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has
+ 1 2 n 1 2 n i
the chi-square distribution with k/n degrees of freedom for each i ∈ {1, 2, … , n}.
The Poisson distribution distribution is infinitely divisible. If X has the Poisson distribution with rate parameter λ ∈ (0, ∞) ,
then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has
+ 1 2 n 1 2 n i
the Poisson distribution with rate parameter λ/n for each i ∈ {1, 2, … , n}.
The general negative binomial distribution on N is infinitely divisible. If X has the negative binomial distribution on N with
parameters k ∈ (0, ∞) and p ∈ (0, 1), then for n ∈ N , X has the same distribution as X + X + ⋯ + X
+ 1 2 were n
(X , X , … , X ) are independent, and X has the negative binomial distribution on N with parameters k/n and p for each
1 2 n i
i ∈ {1, 2, … , n}.
Since the Poisson distribution and the negative binomial distributions are distributions on N, it follows from the characterization
above that these distributions must be compound Poisson. Of course it is completely trivial that the Poisson distribution is
compound Poisson, but it's far from obvious that the negative binomial distribution has this property. It turns out that the negative
binomial distribution can be obtained by compounding the logarithmic series distribution with the Poisson distribution.
The Wald distribution is infinitely divisible. If X has the Wald distribution with shape parameter λ ∈ (0, ∞) and mean
μ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X
+ 1 2 where (X , X , … , X ) are
n 1 2 n
independent, and X has the Wald distribution with shape parameter λ/n and mean μ/n for each i ∈ {1, 2, … , n}.
i
2
This page titled 5.4: Infinitely Divisible Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Power Series Distributions are discrete distributions on (a subset of) N constructed from power series. This class of distributions is
important because most of the special, discrete distributions are power series distributions.
Basic Theory
Power Series
Suppose that a = (a , a , a , …) is a sequence of nonnegative real numbers. We are interested in the power series with a as the
0 1 2
sequence of coefficients. Recall first that the partial sum of order n ∈ N is

n
k
gn (θ) = ∑ ak θ , θ ∈ R (5.5.1)
k=0
The power series g is then defined by g(θ) = lim n→∞ gn (θ) for θ ∈ R for which the limit exists, and is denoted
∞
n
g(θ) = ∑ an θ (5.5.2)
n=0
Note that the series converges when θ = 0 , and g(0) = a . Beyond this trivial case, recall that there exists r ∈ [0, ∞] such that the
0
series converges absolutely for |θ| < r and diverges for |θ| > r . The number r is the radius of convergence. From now on, we
assume that r > 0 . If r < ∞ , the series may converge (absolutely) or may diverge to ∞ at the endpoint r. At −r, the series may
converge absolutely, may converge conditionally, or may diverge.
Distributions
From now on, we restrict θ to the interval [0, r); this interval is our parameter space. Some of the results below may hold when
r < ∞ and θ = r , but dealing with this case explicitly makes the exposition unnecessarily cumbersome.
Suppose that N is a random variable with values in N. Then N has the power series distribution associated with the function g
(or equivalently with the sequence a ) and with parameter θ ∈ [0, r) if N has probability density function f given by θ
n
an θ
fθ (n) = , n ∈ N (5.5.3)
g(θ)
Proof
Note that when θ = 0 , the distribution is simply the point mass distribution at 0; that is, f 0 (0) =1 .
The distribution function F is given by

θ
gn (θ)
Fθ (n) = , n ∈ N (5.5.4)
g(θ)
Proof
Of course, the probability density function f is most useful when the power series g(θ) can be given in closed form, and similarly
θ
the distribution function F is most useful when the power series and the partial sums can be given in closed form
θ
Moments
The moments of N can be expressed in terms of the underlying power series function g , and the nicest expression is for the
factorial moments. Recall that the permutation formula is t = t(t − 1) ⋯ (t − k + 1) for t ∈ R and k ∈ N , and the factorial
(k)
moment of N of order k ∈ N is E (N ). (k)
For θ ∈ [0, r) , the factorial moments of N are as follows, where g (k)

is the k th derivative of g .
k (k)
θ g (θ)
(k)
E (N ) = , k ∈ N (5.5.5)
g(θ)
Proof
The mean and variance of N are

1. E(N ) = θg ′
(θ)/g(θ)
2
2. var(N ) = θ 2
(g
′′ ′
(θ)/g(θ) − [ g (θ)/g(θ)] )
Proof
The probability generating function of N also has a simple expression in terms of g .
For θ ∈ (0, r) , the probability generating function P of N is given by

g(θt) r
N
P (t) = E (t ) = , t < (5.5.7)
g(θ) θ
Proof
Relations
Power series distributions are closed with respect to sums of independent variables.
Suppose that N and N are independent, and have power series distributions relative to the functions g and g , respectively,
1 2 1 2
each with parameter value θ < min{r , r } . Then N + N has the power series distribution relative to the function g g ,
1 2 1 2 1 2
with parameter value θ .

Proof
Here is a simple corollary.
Suppose that (N , N , … , N ) is a sequence of independent variables, each with the same power series distribution, relative
1 2 k
to the function g and with parameter value θ < r . Then N + N + ⋯ + N has the power series distribution relative to the
1 2 k
function g and with parameter θ .

k
In the context of this result, recall that (N 1, N2 , … , Nk ) is a random sample of size k from the common distribution.

The Poisson distribution with rate parameter λ ∈ [0, ∞) is a power series distribution relative to the function g(λ) = e
λ
for
λ ∈ [0, ∞).
Proof
The geometric distribution on N with success parameter p ∈ (0, 1] is a power series distribution relative to the function
g(θ) = 1/(1 − θ) for θ ∈ [0, 1), where θ = 1 − p .
Proof
For fixed k ∈ (0, ∞) , the negative binomial distribution on N with with stopping parameter k and success parameter
p ∈ (0, 1] is a power series distribution relative to the function g(θ) = 1/(1 − θ) for θ ∈ [0, 1), where θ = 1 − p .
k
Proof
For fixed n ∈ N , the binomial distribution with trial parameter n and success parameter
+ p ∈ [0, 1) is a power series
distribution relative to the function g(θ) = (1 + θ) for θ ∈ [0, ∞) , where θ = p/(1 − p) .
n
Proof
The logarithmic distribution with parameter p ∈ [0, 1) is a power series distribution relative to the function
g(p) = − ln(1 − p) for p ∈ [0, 1).
Proof
This page titled 5.5: Power Series Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The normal distribution holds an honored role in probability and statistics, mostly because of the central limit theorem, one of the
fundamental theorems that forms a bridge between the two subjects. In addition, as we will see, the normal distribution has many
nice mathematical properties. The normal distribution is also called the Gaussian distribution, in honor of Carl Friedrich Gauss,
who was among the first to use the distribution.

The standard normal distribution is a continuous distribution on R with probability density function ϕ given by
1 2
−z /2
ϕ(z) = −−e , z ∈ R (5.6.1)
√2π
Proof that ϕ is a probability density function
The standard normal probability density function has the famous “bell shape” that is known to just about everyone.
The standard normal density function ϕ satisfies the following properties:

1. ϕ is symmetric about z = 0 .
2. ϕ increases and then decreases, with mode z = 0 .
3. ϕ is concave upward and then downward and then upward again, with inflection points at z = ±1 .
4. ϕ(z) → 0 as z → ∞ and as z → −∞ .
Proof
In the Special Distribution Simulator, select the normal distribution and keep the default settings. Note the shape and location
of the standard normal density function. Run the simulation 1000 times, and compare the empirical density function to the
The standard normal distribution function Φ, given by

z z
1 2
−t /2
Φ(z) = ∫ ϕ(t) dt = ∫ −−e dt (5.6.4)
−∞ −∞ √2π
and its inverse, the quantile function Φ , cannot be expressed in closed form in terms of elementary functions. However
−1
approximate values of these functions can be obtained from the special distribution calculator, and from most mathematics and
statistics software. Indeed these functions are so important that they are considered special functions of mathematics.
The standard normal distribution function Φ satisfies the following properties:

1. Φ(−z) = 1 − Φ(z) for z ∈ R
2. Φ (p) = −Φ (1 − p) for p ∈ (0, 1)
−1 −1
3. Φ(0) = , so the median is 0.

1
Proof
In the special distribution calculator, select the normal distribution and keep the default settings.
1. Note the shape of the density function and the distribution function.
2. Find the first and third quartiles.
3. Compute the interquartile range.
In the special distribution calculator, select the normal distribution and keep the default settings. Find the quantiles of the
following orders for the standard normal distribution:
1. p = 0.001, p = 0.999
2. p = 0.05, p = 0.95
3. p = 0.1 , p = 0.9
Moments
Suppose that random variable Z has the standard normal distribution.
The mean and variance of Z are

1. E(Z) = 0
2. var(Z) = 1
Proof
In the Special Distribution Simulator, select the normal distribution and keep the default settings. Note the shape and size of
the mean ± standard deviation bar.. Run the simulation 1000 times, and compare the empirical mean and standard deviation to
the true mean and standard deviation.
More generally, we can compute all of the moments. The key is the following recursion formula.
For n ∈ N , E (Z
+
n+1
) = nE (Z
n−1
)
Proof
The moments of the standard normal distribution are now easy to compute.
For n ∈ N ,
1. E (Z 2n
) = 1 ⋅ 3 ⋯ (2n − 1) = (2n)!/(n! 2 )
n
2. E (Z 2n+1
) =0
Proof
Of course, the fact that the odd-order moments are 0 also follows from the symmetry of the distribution. The following theorem
gives the skewness and kurtosis of the standard normal distribution.
The skewness and kurtosis of Z are

1. skew(Z) = 0
2. kurt(Z) = 3
Proof
Because of the last result, (and the use of the standard normal distribution literally as a standard), the excess kurtosis of a random
variable is defined to be the ordinary kurtosis minus 3. Thus, the excess kurtosis of the normal distribution is 0.
Many other important properties of the normal distribution are most easily obtained using the moment generating function or the
characteristic function.
The moment generating function m and characteristic function χ of Z are given by

1. m(t) = e for t ∈ R .
2
t /2
2. χ(t) = e for t ∈ R .
2
−t /2
Proof
Thus, the standard normal distribution has the curious property that the characteristic function is a multiple of the probability
density function:
−−
χ = √2πϕ (5.6.11)
The moment generating function can be used to give another derivation of the moments of Z , since we know that
(0) .
n (n)
E (Z ) = m
The General Normal Distribution
The general normal distribution is the location-scale family associated with the standard normal distribution.
Suppose that μ ∈ R and σ ∈ (0, ∞) and that Z has the standard normal distribution. Then X = μ + σZ has the normal
distribution with location parameter μ and scale parameter σ.
Suppose that X has the normal distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞). The basic properties of
the density function and distribution function of X follow from general results for location scale families.
The probability density function f of X is given by

2
1 x −μ 1 1 x −μ
f (x) = ϕ( ) = −
−− exp[− ( ) ], x ∈ R (5.6.12)
σ σ √2 π σ 2 σ
Proof
The probability density function f satisfies the following properties:

1. f is symmetric about x = μ .
2. f increases and then decreases with mode x = μ .
3. f is concave upward then downward then upward again, with inflection points at x = μ ± σ .
4. f (x) → 0 as x → ∞ and as x → −∞ .
Proof
In the special distribution simulator, select the normal distribution. Vary the parameters and note the shape and location of the
probability density function. With your choice of parameter settings, run the simulation 1000 times and compare the empirical
density function to the true probability density function.
Let F denote the distribution function of X, and as above, let Φ denote the standard normal distribution function.
The distribution function F and quantile function F −1

satsify the following properties:
x−μ
1. F (x) = Φ ( σ
) for x ∈ R.
2. F (p) = μ + σ Φ (p) for p ∈ (0, 1).
−1 −1
3. F (μ) = so the median occurs at x = μ .

1
Proof
In the special distribution calculator, select the normal distribution. Vary the parameters and note the shape of the density
function and the distribution function.
Moments
Suppose again that X has the normal distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞). As the notation
suggests, the location and scale parameters are also the mean and standard deviation, respectively.
The mean and variance of X are

1. E(X) = μ
2. var(X) = σ 2
Proof
So the parameters of the normal distribution are usually referred to as the mean and standard deviation rather than location and
scale. The central moments of X can be computed easily from the moments of the standard normal distribution. The ordinary (raw)
moments of X can be computed from the central moments, but the formulas are a bit messy.
For n ∈ N ,
1. E [(X − μ) 2n
] = 1 ⋅ 3 ⋯ (2n − 1)σ
2n
= (2n)! σ
2n n
/(n! 2 )
2. E [(X − μ) 2 n+1
] =0
All of the odd central moments of X are 0, a fact that also follows from the symmetry of the probability density function.
In the special distribution simulator select the normal distribution. Vary the mean and standard deviation and note the size and
location of the mean/standard deviation bar. With your choice of parameter settings, run the simulation 1000 times and
compare the empirical mean and standard deviation to the true mean and standard deviation.
The following exercise gives the skewness and kurtosis.
The skewness and kurtosis of X are

1. skew(X) = 0
2. kurt(X) = 3
Proof
The moment generating function M and characteristic function χ of X are given by

1. M (t) = exp(μt + 1
2
2
σ t )
2
for t ∈ R .
2. χ(t) = exp(iμt − 1
2
2
σ t )
2
for t ∈ R
Proof
Relations
The normal family of distributions satisfies two very important properties: invariance under linear transformations of the variable
and invariance with respect to sums of independent variables. The first property is essentially a restatement of the fact that the
normal distribution is a location-scale family.
Suppose that X is normally distributed with mean μ and variance σ

2
. If a ∈ R and b ∈ R ∖ {0} , then a + bX is normally
distributed with mean a + bμ and variance b σ . 2 2
Proof
Recall that in general, if X is a random variable with mean μ and standard deviation σ > 0 , then Z = (X − μ)/σ is the standard
score of X. A corollary of the last result is that if X has a normal distribution then the standard score Z has a standard normal
distribution. Conversely, any normally distributed variable can be constructed from a standard normal variable.
Standard score.
X−μ
1. If X has the normal distribution with mean μ and standard deviation σ then Z = has the standard normal distribution. σ
2. If Z has the standard normal distribution and if μ ∈ R and σ ∈ (0, ∞), then X = μ + σZ has the normal distribution with
mean μ and standard deviation σ.
Suppose that X and X are independent random variables, and that X is normally distributed with mean μ and variance σ
1 2 i i
2
i
for i ∈ {1, 2}. Then X + X is normally distributed with

1 2
1. E(X + X ) = μ + μ
1 2 1 2
2. var(X + X ) = σ + σ
1 2
2
1
2
2
Proof
This theorem generalizes to a sum of n independent, normal variables. The important part is that the sum is still normal; the
expressions for the mean and variance are standard results that hold for the sum of independent variables generally. As a
consequence of this result and the one for linear transformations, it follows that the normal distribution is stable.
The normal distribution is stable. Specifically, suppose that X has the normal distribution with mean μ ∈ R and variance
σ ∈ (0, ∞) . If (X , X , … , X ) are independent copies of X, then X + X + ⋯ + X has the same distribution as
2
1 2 n 1 2 n
− −
(n − √n ) μ + √n X , namely normal with mean nμ and variance nσ .
2
Proof
All stable distributions are infinitely divisible, so the normal distribution belongs to this family as well. For completeness, here is
the explicit statement:
The normal distribution is infinitely divisible. Specifically, if X has the normal distribution with mean μ ∈ R and variance
σ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are
2
+ 1 2 n 1 2 n
independent, and each has the normal distribution with mean μ/n and variance σ /n. 2
Finally, the normal distribution belongs to the family of general exponential distributions.
Suppose that X has the normal distribution with mean μ and variance σ
2
. The distribution is a two-parameter exponential
μ
family with natural parameters ( σ2
,−
1
2 σ2
, and natural statistics (X, X ).
)
2
Proof
A number of other special distributions studied in this chapter are constructed from normally distributed variables. These include
The lognormal distribution
The folded normal distribution, which includes the half normal distribution as a special case
The Rayleigh distribution
The Maxwell distribution
The Lévy distribution
Also, as mentioned at the beginning of this section, the importance of the normal distribution stems in large part from the central
limit theorem, one of the fundamental theorems of probability. By virtue of this theorem, the normal distribution is connected to
many other distributions, by means of limits and approximations, including the special distributions in the following list. Details
are given in the individual sections.
The binomial distribution
The negative binomial distribution
The Poisson distribution
The gamma distribution
The chi-square distribution
The student t distribution
The Irwin-Hall distribution
Suppose that the volume of beer in a bottle of a certain brand is normally distributed with mean 0.5 liter and standard deviation
0.01 liter.
1. Find the probability that a bottle will contain at least 0.48 liter.
2. Find the volume that corresponds to the 95th percentile
Answer
A metal rod is designed to fit into a circular hole on a certain assembly. The radius of the rod is normally distributed with mean
1 cm and standard deviation 0.002 cm. The radius of the hole is normally distributed with mean 1.01 cm and standard deviation
0.003 cm. The machining processes that produce the rod and the hole are independent. Find the probability that the rod is to
big for the hole.
Answer
The weight of a peach from a certain orchard is normally distributed with mean 8 ounces and standard deviation 1 ounce. Find
the probability that the combined weight of 5 peaches exceeds 45 ounces.
Answer
A Further Generlization
In some settings, it's convenient to consider a constant as having a normal distribution (with mean being the constant and variance
0, of course). This convention simplifies the statements of theorems and definitions in these settings. Of course, the formulas for
the probability density function and the distribution function do not hold for a constant, but the other results involving the moment
generating function, linear transformations, and sums are still valid. Moreover, the result for linear transformations would hold for
all a and b .
This page titled 5.6: The Normal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The multivariate normal distribution is among the most important of multivariate distributions, particularly in statistical inference and the
study of Gaussian processes such as Brownian motion. The distribution arises naturally from linear transformations of independent
normal variables. In this section, we consider the bivariate normal distribution first, because explicit results can be given and because
graphical interpretations are possible. Then, with the aid of matrix notation, we discuss the general multivariate distribution.
The Bivariate Normal Distribution

The Standard Distribution
Recall that the probability density function ϕ of the standard normal distribution is given by
1 2
−z /2
ϕ(z) = e , z ∈ R (5.7.1)
−−
√2π
The corresponding distribution function is denoted Φ and is considered a special function in mathematics:
z z
1 2
−x /2
Φ(z) = ∫ ϕ(x) dx = ∫ −−e dx, z ∈ R (5.7.2)
−∞ −∞ √2π
Finally, the moment generating function m is given by
tZ
1 2
t /2
m(t) = E (e ) = exp[ var(tZ)] = e , t ∈ R (5.7.3)
2
Suppose that Z and W are independent random variables, each with the standard normal distribution. The distribution of (Z, W ) is
known as the standard bivariate normal distribution.
The basic properties of the standard bivariate normal distribution follow easily from independence and properties of the (univariate)
normal distribution. Recall first that the graph of a function f : R → R is a surface. For c ∈ R , the set of points
2
{(x, y) ∈ R : f (x, y) = c} is the level curve of f at level c . The graph of f can be understood by means of the level curves.
2
The probability density function ϕ of the standard bivariate normal distribution is given by
2
1 1 2 2
− (z +w ) 2
ϕ2 (z, w) = e 2 , (z, w) ∈ R (5.7.4)
2π
1. The level curves of ϕ are circles centered at the origin.

2
2. The mode of the distribution is (0, 0).

3. ϕ is concave downward on {(z, w) ∈ R : z + w < 1}
2
2 2 2
Proof
Clearly ϕ has a number of symmetry properties as well: ϕ (z, w) is symmetric in z about 0 so that ϕ (−z, w) = ϕ (z, w) ; ϕ (z, w) is
2 2 2 2
symmetric in w about 0 so that ϕ (z, −w) = ϕ (z, w) ; ϕ (z, w) is symmetric in (z, w) so that ϕ (z, w) = ϕ (w, z) . In short, ϕ has
2 2 2 2 2 2
the classical “bell shape” that we associate with normal distributions.
Open the bivariate normal experiment, keep the default settings to get the standard bivariate normal distribution. Run the experiment
1000 times. Observe the cloud of points in the scatterplot, and compare the empirical density functions to the probability density
functions.
Suppose that (Z, W ) has the standard bivariate normal distribution. The moment generating function m of (Z, W ) is given by 2
1 1
2 2 2
m2 (s, t) = E [exp(sZ + tW )] = exp[ var(sZ + tW )] = exp[ (s + t )], (s, t) ∈ R (5.7.6)
2 2
Proof
The General Distribution

The general bivariate normal distribution can be constructed by means of an affine transformation on a standard bivariate normal vector.
The distribution has 5 parameters. As we will see, two are location parameters, two are scale parameters, and one is a correlation
parameter.
Suppose that (Z, W ) has the standard bivariate normal distribution. Let μ, ν ∈ R ; σ, τ ∈ (0, ∞) ; and ρ ∈ (−1, 1), and let X and Y
be new random variables defined by
X = μ+σ Z (5.7.7)
−−−−−
2
Y = ν + τ ρZ + τ √ 1 − ρ W (5.7.8)
The joint distribution of (X, Y ) is called the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ).
We can use the change of variables formula to find the joint probability density function.
Suppose that (X, Y ) has the bivariate normal distribution with the parameters (μ, ν , σ, τ , ρ) as specified above. The joint probability
density function f of (X, Y ) is given by
2 2
1 1 (x − μ) (x − μ)(y − ν ) (y − ν )
2
f (x, y) = exp{− [ − 2ρ + ]}, (x, y) ∈ R (5.7.9)
− −−− − 2 2 2
2 2(1 − ρ ) σ στ τ
2πστ √ 1 − ρ
1. The level curves of f are ellipses centered at (μ, ν ).

2. The mode of the distribution is (μ, ν ).
Proof
The following theorem gives fundamental properties of the bivariate normal distribution.
Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) as specified above. Then
1. X is normally distributed with mean μ and standard deviation σ.
2. Y is normally distributed with mean ν and standard deviation τ .
3. cor(X, Y ) = ρ .
4. X and Y are independent if and only if ρ = 0 .
Proof
Thus, two random variables with a joint normal distribution are independent if and only if they are uncorrelated.
In the bivariate normal experiment, change the standard deviations of X and Y with the scroll bars. Watch the change in the shape of
the probability density functions. Now change the correlation with the scroll bar and note that the probability density functions do
not change. For various values of the parameters, run the experiment 1000 times. Observe the cloud of points in the scatterplot, and
compare the empirical density functions to the probability density functions.
In the case of perfect correlation (ρ = 1 or ρ = −1 ), the distribution of (X, Y ) is also said to be bivariate normal, but degenerate. In this
case, we know from our study of covariance and correlation that (X, Y ) takes values on the regression line
(x − μ)} , and hence does not have a probability density function (with respect to Lebesgue measure on R ).
2 τ 2
{(x, y) ∈ R : y = ν + ρ
σ
Degenerate normal distributions will be discussed in more detail below.
In the bivariate normal experiment, run the experiment 1000 times with the values of ρ given below and selected values of σ and τ .
Observe the cloud of points in the scatterplot and compare the empirical density functions to the probability density functions.
1. ρ ∈ {0, 0.3, 0.5, 0.7, 1}
2. ρ ∈ {−0.3, −0.5, −0.7, −1}
The conditional distributions are also normal.
Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) as specified above.
1. For x ∈ R, the conditional distribution of Y given X = x is normal with mean E(Y ∣ X = x) = ν + ρ τ
σ
(x − μ) and variance
var(Y ∣ X = x) = τ (1 − ρ ) .
2 2
2. For y ∈ R , the conditional distribution of X given Y = y is normal with mean E(X ∣ Y = y) = μ + ρ σ
τ
(y − ν ) and variance
var(X ∣ Y = y) = σ (1 − ρ ) .
2 2
Proof from density functions

Proof from random variables
Note that the conditional variances do not depend on the value of the given variable.
In the bivariate normal experiment, set the standard deviation of X to 1.5, the standard deviation of Y to 0.5, and the correlation to
0.7.
1. Run the experiment 100 times.
2. For each run, compute E(Y ∣ X = x) the predicted value of Y for the given the value of X.
3. Over all 100 runs, compute the square root of the average of the squared errors between the predicted value of Y and the true
value of Y .
You may be perplexed by the lack of symmetry in how (X, Y ) is defined in terms of (Z, W ) in the original definition. Note however
−−−− −
that the distribution is completely determined by the 5 parameters. If we define X = μ + σρZ + σ √1 − ρ W and Y = ν + τ Z then ′ 2 ′
(X , Y ) has the same distribution as (X, Y ), namely the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) (although, of
′ ′
course (X , Y ) and (X, Y ) are different random vectors). There are other ways to define the same distribution as an affine
′ ′
transformation of (Z, W )—the situation will be clarified in the next subsection.
Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) . Then (X, Y ) has moment generating
function M given by
1
M (s, t) = E [exp(sX + tY )] = exp[E(sX + tY ) + var(sX + tY )] = exp (5.7.15)
2
1 2 2 2 2 2
[μs + ν t + (σ s + 2ρστ st + τ t )], (s, t) ∈ R
2
Proof
We showed above that if (X, Y ) has a bivariate normal distribution then the marginal distributions of X and Y are also normal. The
converse is not true.
Suppose that (X, Y ) has probability density function f given by

1 2
−( x +y
2
)/2
2
−( x +y
2
−2) 2
f (x, y) = e [1 + xy e /2] , (x, y) ∈ R (5.7.18)
2π
1. X and Y each have standard normal distributions.

2. (X, Y ) does not have a bivariate normal distribution.
Proof
Transformations
Like its univariate counterpart, the family of bivariate normal distributions is preserved under two types of transformations on the
underlying random vector: affine transformations and sums of independent vectors. We start with a preliminary result on affine
transformations that should help clarify the original definition. Throughout this discussion, we assume that the parameter vector
(μ, ν , σ, τ , ρ) satisfies the usual conditions: μ, ν ∈ R , and σ, τ ∈ (0, ∞), and ρ ∈ (−1, 1).
Suppose that (Z, W ) has the standard bivariate normal distribution. Let X = a + b Z + c W and Y = a + b Z + c 1 1 1 2 2 2W where
the coefficients are in R and b c − c b ≠ 0 . Then (X, Y ) has a bivariate normal distribution with parameters given by
1 2 1 2
1. E(X) = a 1
2. E(Y ) = a 2
3. var(X) = b + c 2
1
2
1
4. var(Y ) = b + c 2
2
2
2
5. cov(X, Y ) = b b 1 2 + c1 c2
Proof
Now it is easy to show more generally that the bivariate normal distribution is closed with respect to affine transformations.
Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ). Define U = a + b X + c Y and 1 1 1
V = a2 + b2 X + c2 Y , where the coefficients are in R and b c − c b ≠ 0 . Then (U , V ) has a bivariate normal distribution with
1 2 1 2
parameters as follows:
1. E(U ) = a 1 + b1 μ + c1 ν
2. E(V ) = a + b 2 2μ + c2 ν
3. var(U ) = b σ 2
1
2
+c τ
2
1
2
+ 2 b1 c1 ρστ
4. var(V ) = b σ 2
2
2
+c τ
2
2
2
+ 2 b2 c2 ρστ
5. cov(U , V ) = b 1 b2 σ
2
+ c1 c2 τ
2
+ (b1 c2 + b2 c1 )ρστ
Proof
The bivariate normal distribution is preserved with respect to sums of independent variables.
Suppose that (X , Y ) has the bivariate normal distribution with parameters (μ , ν , σ , τ , ρ ) for i ∈ {1, 2}, and that
i i i i i i i (X1 , Y1 ) and
(X , Y ) are independent. Then (X + X , Y + Y ) has the bivariate normal distribution with parameters given by
2 2 1 2 1 2
1. E(X + X ) = μ + μ
1 2 1 2
2. E(Y + Y ) = ν + ν
1 2 1 2
3. var(X + X ) = σ + σ
1 2
2
1
2
2
4. var(Y + Y ) = τ + τ
1 2
2
1
2
2
5. cov(X + X , Y + Y ) = ρ
1 2 1 2 1 σ1 τ1 + ρ2 σ2 τ2
Proof
The following result is important in the simulation of normal variables.
Suppose that (Z, W ) has the standard bivariate normal distribution. Define the polar coordinates (R, Θ) of (Z, W ) by the equations
Z = R cos Θ , W = R sin Θ where R ≥ 0 and 0 ≤ Θ < 2 π . Then
1 2
1. R has probability density function g given by g(r) = r e −

2
r
for r ∈ [0, ∞).
2. Θ is uniformly distributed on [0, 2π).
3. R and Θ are independent.
Proof
The distribution of R is known as the standard Rayleigh distribution, named for William Strutt, Lord Rayleigh. The Rayleigh
distribution studied in more detail in a separate section.
Since the quantile function Φ of the normal distribution cannot be given in a simple, closed form, we cannot use the usual random
−1
quantile method of simulating a normal random variable. However, the quantile method works quite well to simulate a Rayleigh variable,
and of course simulating uniform variables is trivial. Hence we have a way of simulating a standard bivariate normal vector with a pair of
random numbers (which, you will recall are independent random variables, each with the standard uniform distribution, that is, the
uniform distribution on [0, 1)).
−−−−−−
Suppose that U and V are independent random variables, each with the standard uniform distribution. Let R = √−2 ln U and
Θ = 2πV . Define Z = R cos Θ and W = R sin Θ . Then (Z, W ) has the standard bivariate normal distribution.
Proof
Of course, if we can simulate (Z, W ) with a standard bivariate normal distribution, then we can simulate (X, Y ) with the general
−−−− −
bivariate normal distribution, with parameter (μ, ν , σ, τ , ρ) by definition (5), namely X = μ + σZ , Y = ν + τ ρZ + τ √1 − ρ W . 2
The General Multivariate Normal Distribution

The general multivariate normal distribution is a natural generalization of the bivariate normal distribution studied above. The exposition
is very compact and elegant using expected value and covariance matrices, and would be horribly complex without these tools. Thus, this
section requires some prerequisite knowledge of linear algebra. In particular, recall that A denotes the transpose of a matrix A and that T
we identify a vector in R with the corresponding n × 1 column vector.

n

Suppose that Z = (Z , Z , … , Z ) is a vector of independent random variables, each with the standard normal distribution. Then
1 2 n
Z is said to have the n -dimensional standard normal distribution.
1. E(Z) = 0 (the zero vector in R ). n
2. vc(Z) = I (the n × n identity matrix).
Z has probability density function ϕ given by n
n
1 1 1 1
2 n
ϕn (z) = exp(− z ⋅ z) = exp(− ∑ z ), z = (z1 , z2 , … , zn ) ∈ R (5.7.32)
i
n/2 2 n/2 2
(2π) (2π)
i=1
where as usual, ϕ is the standard normal PDF.

Proof
Z has moment generating function m given by n
n
1 1 1 n
2
mn (t) = E [exp(t ⋅ Z)] = exp[ var(t ⋅ Z)] = exp( t ⋅ t) = exp( ∑ t ), t = (t1 , t2 , … , tn ) ∈ R (5.7.33)
i
2 2 2
i=1
Proof

Suppose that Z has the n -dimensional standard normal distribution. Suppose also that μ ∈ R and that A ∈ R n n×n
is invertible. The
random vector X = μ + AZ is said to have an n -dimensional normal distribution.
1. E(X) = μ .
2. vc(X) = A A . T
Proof
In the context of this result, recall that the variance-covariance matrix vc(X) = AA is symmetric and positive definite (and hence also
T
invertible). We will now see that the multivariate normal distribution is completely determined by the expected value vector μ and the
variance-covariance matrix V , and hence these give the basic parameters of the distribution.
Suppose that X has an n -dimensional normal distribution with expected value vector μ and variance-covariance matrix V . The
probability density function f of X is given by
1 1 −1 n
f (x) = −−−−−− exp[− (x − μ) ⋅ V (x − μ)], x ∈ R (5.7.34)
n/2 2
(2π) √det(V )
Proof
Suppose again that X has an n -dimensional normal distribution with expected value vector μ and variance-covariance matrix V .
The moment generating function M of X is given by
1 1 n
M (t) = E [exp(t ⋅ X)] = exp[E(t ⋅ X) + var(t ⋅ X)] = exp(t ⋅ μ + t ⋅ V t), t ∈ R (5.7.39)
2 2
Proof
Of course, the moment generating function completely determines the distribution. Thus, if a random vector X in R has a moment
n
generating function of the form given above, for some μ ∈ R and symmetric, positive definite V ∈ R
n n×n
, then X has the n -
dimensional normal distribution with mean μ and variance-covariance matrix V .
Note again that in the representation X = μ + AZ , the distribution of X is uniquely determined by the expected value vector μ and the
variance-covariance matrix V = AA , but not by μ and A. In general, for a given positive definite matrix V , there are many invertible
T
matrices A such that V = AA (the matrix A is a bit like a square root of V ). A theorem in matrix theory states that there is a unique
T
lower triangular matrix L with this property. The representation X = μ + LZ is known as the canonical representation of X.
If X = (X, Y ) has bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) , then the lower triangular matrix L such that
LL
T
= vc(X) is
σ 0
L =[ −−−− −] (5.7.41)
2
τρ τ √1 − ρ
Proof
Note that the matrix L above gives the canonical representation of (X, Y ) in terms of the standard normal vector (Z, W ) in the original
−− −−−
definition, namely X = μ + σZ , Y = ν + τ ρZ + τ √1 − ρ W . 2
If the matrix A ∈ R in the definition is not invertible, then the variance-covariance matrix V = AA is symmetric, but only
n×n T
positive semi-definite. The random vector X = μ + AZ takes values in a lower dimensional affine subspace of R that has measure 0 n
relative to n -dimensional Lebesgue measure λ . Thus, X does not have a probability density function relative to λ , and so the
n n
distribution is degenerate. However, the formula for the moment generating function still holds. Degenerate normal distributions are
discussed in more detail below.
Transformations
The multivariate normal distribution is invariant under two basic types of transformations on the underlying random vectors: affine
transformations (with linearly independent rows), and concatenation of independent vectors. As simple corollaries of these two results,
the normal distribution is also invariant with respect to subsequences of the random vector, re-orderings of the terms in the random
vector, and sums of independent random vectors. The main tool that we will use is the moment generating function. We start with the
first main result on affine transformations.
Suppose that X has the n -dimensional normal distribution with mean vector μ and variance-covariance matrix V . Suppose also that
a∈ R
m
and that A ∈ R has linearly independent rows (thus, m ≤ n ). Then Y = a + AX has an m-dimensional normal
m×n
distribution, with
1. E(Y ) = a + Aμ
2. vc(Y ) = AV A T
Proof
A clearly important special case is m = n , which generalizes the definition. Thus, if a∈ R

n
and A ∈ R
n×n
is invertible, then
Y = a + AX has an n -dimensional normal distribution. Here are some other corollaries:
Suppose that X = (X 1, X2 , … , Xn ) has an n -dimensional normal distribution. If {i 1, i2 , … , im } is a set of distinct indices, then
Y = (Xi , Xi , … , Xi
1 2 m
) has an m-dimensional normal distribution.
Proof
In the context of the previous result, if X has mean vector μ and variance-covariance matrix V , then Y has mean vector Aμ and
variance-covariance matrix AV A , where A is the 0-1 matrix defined in the proof. As simple corollaries, note that if
T
X = (X , X , … , X ) has an n -dimensional normal distribution, then any permutation of the coordinates of X also has an n -
1 2 n
dimensional normal distribution, and (X , X , … , X ) has an m-dimensional normal distribution for any m ≤ n . Here is a slight
1 2 m
extension of the last statement.
Suppose that X is a random vector in m

R , Y is a random vector in n
R , and that (X, Y ) has an (m + n) -dimensional normal
distribution. Then
1. X has an m-dimensional normal distribution.
2. Y has an n -dimensional normal distribution.
3. X and Y are independent if and only if cov(X, Y ) = 0 (the m × n zero matrix).
Proof
Next is the converse to part (c) of the previous result: concatenating independent normally distributed vectors produces another normally
distributed vector.
Suppose that X has the m-dimensional normal distribution with mean vector μ and variance-covariance matrix U , Y has the n -
dimensional normal distribution with mean vector ν and variance-covariance matrix V , and that X and Y are independent. Then
Z = (X, Y ) has the m + n -dimensional normal distribution with
1. E(X, Y ) = (μ, ν )
vc(X) 0
2. vc(X, Y ) = [ T
] where 0 is the m × n zero matrix.
0 vc(Y )
Proof
Just as in the univariate case, the normal family of distributions is closed with respect to sums of independent variables. The proof
follows easily from the previous result.
Suppose that X has the n -dimensional normal distribution with mean vector μ and variance-covariance matrix U , Y has the n -
dimensional normal distribution with mean vector ν and variance-covariance matrix V , and that X and Y are independent. Then
X +Y has the n -dimensional normal distribution with
1. E(X + Y ) = μ + ν
2. vc(X + Y ) = U + V
Proof
We close with a trivial corollary to the general result on affine transformation, but this corollary points the way to a further generalization
of the multivariate normal distribution that includes the degenerate distributions.
Suppose that X has an n -dimensional normal distribution with mean vector μ and variance-covariance matrix V , and that a∈ R
n
with a ≠ 0 . Then Y = a ⋅ X has a (univariate) normal distribution with

1. E(Y ) = a ⋅ μ
2. var(Y ) = a ⋅ V a
Proof
A Further Generalization
The last result can be used to give a simple, elegant definition of the multivariate normal distribution that includes the degenerate
distributions as well as the ones we have considered so far. First we will adopt our general definition of the univariate normal distribution
that includes constant random variables.
A random variable X that takes values in R has an n -dimensional normal distribution if and only if a ⋅ X has a univariate normal
n
distribution for every a ∈ R . n
Although an n -dimensional normal distribution may not have a probability density function with respect to n -dimensional Lebesgue
measure λ , the form of the moment generating function is unchanged.
n
Suppose that X has mean vector μ and variance-covariance matrix V , and that X has an n -dimensional normal distribution. The
moment generating function of X is given by
1 1
n
E [exp(t ⋅ X)] = exp[E(t ⋅ X) + var(t ⋅ X)] = exp(t ⋅ μ + t ⋅ V t), t ∈ R (5.7.47)
2 2
Proof
Our new general definition really is a generalization.
Suppose that X has an n -dimensional normal distribution in the sense of the general definition, and that the distribution of X has a
probability density function on R with respect to Lebesgue measure λ . Then X has an n -dimensional normal distribution in the
n
n
sense of our original definition.

Proof
This page titled 5.7: The Multivariate Normal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
In this section we will study a family of distributions that has special importance in probability and statistics. In particular, the
arrival times in the Poisson process have gamma distributions, and the chi-square distribution in statistics is a special case of the
gamma distribution. Also, the gamma distribution is widely used to model physical quantities that take positive values.
The Gamma Function

Before we can study the gamma distribution, we need to introduce the gamma function, a special function whose values will play
the role of the normalizing constants.
Definition
The gamma function Γ is defined as follows
∞
k−1 −x
Γ(k) = ∫ x e dx, k ∈ (0, ∞) (5.8.1)
0
The function is well defined, that is, the integral converges for any k >0 . On the other hand, the integral diverges to ∞ for
k ≤ 0.
Proof
The gamma function was first introduced by Leonhard Euler.
Figure 5.8.1 : The graph of the gamma function on the interval (0, 5)
The (lower) incomplete gamma function is defined by

x
k−1 −t
Γ(k, x) = ∫ t e dt, k, x ∈ (0, ∞) (5.8.6)
0
Properties
Here are a few of the essential properties of the gamma function. The first is the fundamental identity.
Γ(k + 1) = k Γ(k) for k ∈ (0, ∞) .

Proof
Applying this result repeatedly gives
Γ(k + n) = k(k + 1) ⋯ (k + n − 1)Γ(k), n ∈ N+ (5.8.8)
It's clear that the gamma function is a continuous extension of the factorial function.
Γ(k + 1) = k! for k ∈ N .
Proof
The values of the gamma function for non-integer arguments generally cannot be expressed in simple, closed forms. However,
there are exceptions.
−
Γ(
1
2
) = √π .
Proof
We can generalize the last result to odd multiples of 1
2
.
For n ∈ N ,
2n + 1 1 ⋅ 3 ⋯ (2n − 1) (2n)!
− −
Γ( ) = √π = √π (5.8.11)
n n
2 2 4 n!
Proof
Stirling's Approximation
One of the most famous asymptotic formulas for the gamma function is Stirling's formula, named for James Stirling. First we need
to recall a definition.
Suppose that f , g : D → (0, ∞) where D = (0, ∞) or D = N . Then f (x) ≈ g(x) as x → ∞ means that
+
f (x)
→ 1 as x → ∞ (5.8.12)
g(x)
Stirling's formula
x x
−−−
Γ(x + 1) ≈ ( ) √2πx as x → ∞ (5.8.13)
e
As a special case, Stirling's result gives an asymptotic formula for the factorial function:
n n
−−−
n! ≈ ( ) √2πn as n → ∞ (5.8.14)
e
The Standard Gamma Distribution

The standard gamma distribution with shape parameter k ∈ (0, ∞) is a continuous distribution on (0, ∞) with probability
1
k−1 −x
f (x) = x e , x ∈ (0, ∞) (5.8.15)
Γ(k)
Clearly f is a valid probability density function, since f (x) > 0 for x > 0 , and by definition, Γ(k) is the normalizing constant for
the function x ↦ x e k−1
on (0, ∞). The following theorem shows that the gamma density has a rich variety of shapes, and
−x
shows why k is called the shape parameter.
The gamma probability density function f with shape parameter k ∈ (0, ∞) satisfies the following properties:
1. If 0 < k < 1 , f is decreasing with f (x) → ∞ as x ↓ 0.
2. If k = 1 , f is decreasing with f (0) = 1 .
3. If k > 1 , f increases and then decreases, with mode at k − 1 .
4. If 0 < k ≤ 1 , f is concave upward.
−−− −
5. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at k − 1 + √k − 1 .
−−−−
6. If k > 2 , f is concave upward, then downward, then upward again, with inflection points at k − 1 ± √k − 1 .
Proof
The special case k = 1 gives the standard exponential distribuiton. When k ≥ 1 , the distribution is unimodal.
In the simulation of the special distribution simulator, select the gamma distribution. Vary the shape parameter and note the
shape of the density function. For various values of k , run the simulation 1000 times and compare the empirical density
function to the true probability density function.
The distribution function and the quantile function do not have simple, closed representations for most values of the shape
parameter. However, the distribution function has a trivial representation in terms of the incomplete and complete gamma
functions.
The distribution function F of the standard gamma distribution with shape parameter k ∈ (0, ∞) is given by
Γ(k, x)
F (x) = , x ∈ (0, ∞) (5.8.16)
Γ(k)
Approximate values of the distribution and quantile functions can be obtained from special distribution calculator, and from most
mathematical and statistical software packages.
Using the special distribution calculator, find the median, the first and third quartiles, and the interquartile range in each of the
following cases:
1. k = 1
2. k = 2
3. k = 3
Moments
Suppose that X has the standard gamma distribution with shape parameter k ∈ (0, ∞) . The mean and variance are both simply the
shape parameter.

1. E(X) = k
2. var(X) = k
Proof
In the simulation of the special distribution simulator, select the gamma distribution. Vary the shape parameter and note the
size and location of the mean ± standard deviation bar. For selected values of k , run the simulation 1000 times and compare
the empirical mean and standard deviation to the distribution mean and standard deviation.
More generally, the moments can be expressed easily in terms of the gamma function:
The moments of X are

1. E(X a
) = Γ(a + k)/Γ(k) if a > −k
2. E(X n
) =k
[n]
= k(k + 1) ⋯ (k + n − 1) if n ∈ N
Proof
Note also that E(X a

) =∞ if a ≤ −k . We can now also compute the skewness and the kurtosis.

1. skew(X) = 2
√k
2. kurt(X) = 3 + 6
Proof
In particular, note that skew(X) → 0 and kurt(X) → 3 as k → ∞ . Note also that the excess kurtosis kurt(X) − 3 → 0 as
k → ∞.
In the simulation of the special distribution simulator, select the gamma distribution. Increase the shape parameter and note the
shape of the density function in light of the previous results on skewness and kurtosis. For various values of k , run the
simulation 1000 times and compare the empirical density function to the true probability density function.
The following theorem gives the moment generating function.
The moment generating function of X is given by

1
tX
E (e ) = , t <1 (5.8.20)
(1 − t)k
Proof
The General Gamma Distribution

The gamma distribution is usually generalized by adding a scale parameter.
If Z has the standard gamma distribution with shape parameter k ∈ (0, ∞) and if b ∈ (0, ∞) , then X = bZ has the gamma
distribution with shape parameter k and scale parameter b .
The reciprocal of the scale parameter, r = 1/b is known as the rate parameter, particularly in the context of the Poisson process.
The gamma distribution with parameters k = 1 and b is called the exponential distribution with scale parameter b (or rate
parameter r = 1/b ). More generally, when the shape parameter k is a positive integer, the gamma distribution is known as the
Erlang distribution, named for the Danish mathematician Agner Erlang. The exponential distribution governs the time between
arrivals in the Poisson model, while the Erlang distribution governs the actual arrival times.
Basic properties of the general gamma distribution follow easily from corresponding properties of the standard distribution and
basic results for scale transformations.
Suppose that X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞).

1 k−1 −x/b
f (x) = x e , x ∈ (0, ∞) (5.8.23)
k
Γ(k)b
Proof
Recall that the inclusion of a scale parameter does not change the shape of the density function, but simply scales the graph
horizontally and vertically. In particular, we have the same basic shapes as for the standard gamma density function.
The probability density function f of X satisfies the following properties:

2. If k = 1 , f is decreasing with f (0) = 1 .
3. If k > 1 , f increases and then decreases, with mode at (k − 1)b .
4. If 0 < k ≤ 1 , f is concave upward.
−−− −
5. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at b (k − 1 + √k − 1 ) .
−−−−
6. If k > 2 , f is concave upward, then downward, then upward again, with inflection points at b (k − 1 ± √k − 1 ) .
In the simulation of the special distribution simulator, select the gamma distribution. Vary the shape and scale parameters and
note the shape and location of the probability density function. For various values of the parameters, run the simulation 1000
times and compare the empirical density function to the true probability density function.
Once again, the distribution function and the quantile function do not have simple, closed representations for most values of the
shape parameter. However, the distribution function has a simple representation in terms of the incomplete and complete gamma
functions.
The distribution function F of X is given by
Γ(k, x/b)
F (x) = , x ∈ (0, ∞) (5.8.24)
Γ(k)
Proof
Approximate values of the distribution and quanitle functions can be obtained from special distribution calculator, and from most
mathematical and statistical software packages.
Open the special distribution calculator. Vary the shape and scale parameters and note the shape and location of the distribution
and quantile functions. For selected values of the parameters, find the median and the first and third quartiles.
Moments
Suppose again that X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞).

1. E(X) = bk
2. var(X) = b 2
k
Proof
In the special distribution simulator, select the gamma distribution. Vary the parameters and note the shape and location of the
mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical

1. E(X a a
) = b Γ(a + k)/Γ(k) for a > −k
2. E(X n n
) =b k
[n] n
= b k(k + 1) ⋯ (k + n − 1) if n ∈ N
Proof
Note also that E(X ) = ∞ if a ≤ −k . Recall that skewness and kurtosis are defined in terms of the standard score, and hence are
a
unchanged by the addition of a scale parameter.

1. skew(X) = √k
2
2. kurt(X) = 3 + 6
The moment generating function of X is given by
tX
1 1
E (e ) = , t < (5.8.25)
k
(1 − bt) b
Proof
Relations
Our first result is simply a restatement of the meaning of the term scale parameter.
Suppose that X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) . If c ∈ (0, ∞),
then cX has the gamma distribution with shape parameter k and scale parameter bc.
Proof
More importantly, if the scale parameter is fixed, the gamma family is closed with respect to sums of independent variables.
Suppose that X and X are independent random variables, and that X has the gamma distribution with shape parameter
1 2 i
ki ∈ (0, ∞) and scale parameter b ∈ (0, ∞) for i ∈ {1, 2}. Then X + X has the gamma distribution with shape parameter
1 2
k1 + k2 and scale parameter b .
Proof
From the previous result, it follows that the gamma distribution is infinitely divisible:
Suppose that X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). For n ∈ N , X +
has the same distribution as ∑ X , where (X , X , … , X ) is a sequence of independent random variables, each with the
n
i=1 i 1 2 n
gamma distribution with with shape parameter k/n and scale parameter b .
From the sum result and the central limit theorem, it follows that if k is large, the gamma distribution with shape parameter k and
scale parameter b can be approximated by the normal distribution with mean kb and variance kb . Here is the precise statement:
2
Suppose that X has the gamma distribution with shape parameter k ∈ (0, ∞) and fixed scale parameter b ∈ (0, ∞). Then the
k
distribution of the standardized variable below converges to the standard normal distribution as k → ∞ :
Xk − kb
Zk = −
− (5.8.27)
√k b
In the special distribution simulator, select the gamma distribution. For various values of the scale parameter, increase the
shape parameter and note the increasingly “normal” shape of the density function. For selected values of the parameters, run
The gamma distribution is a member of the general exponential family of distributions:
The gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) is a two-parameter exponential
family with natural parameters (k − 1, −1/b) , and natural statistics (ln X, X).
Proof
For n ∈ (0, ∞) , the gamma distribution with shape parameter n/2 and scale parameter 2 is known as the chi-square distribution
with n degrees of freedom. The chi-square distribution is important enough to deserve a separate section.
Computational Exercise
Suppose that the lifetime of a device (in hours) has the gamma distribution with shape parameter k =4 and scale parameter
b = 100 .
1. Find the probability that the device will last more than 300 hours.
2. Find the mean and standard deviation of the lifetime.
Answer
Suppose that Y has the gamma distribution with parameters k = 10 and b = 2 . For each of the following, compute the true
value using the special distribution calculator and then compute the normal approximation. Compare the results.
1. P(18 < X < 25)
2. The 80th percentile of Y
Answer
This page titled 5.8: The Gamma Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In this section we will study a distribution, and some relatives, that have special importance in statistics. In particular, the chi-
square distribution will arise in the study of the sample variance when the underlying distribution is normal and in goodness of fit
tests.
The Chi-Square Distribution

For n ∈ (0, ∞) , the gamma distribution with shape parameter n/2 and scale parameter 2 is called the chi-square distribution
with n degrees of freedom. The probability density function f is given by
1
n/2−1 −x/2
f (x) = x e , x ∈ (0, ∞) (5.9.1)
n/2
2 Γ(n/2)
So the chi-square distribution is a continuous distribution on (0, ∞). For reasons that will be clear later, n is usually a positive
integer, although technically this is not a mathematical requirement. When n is a positive integer, the gamma function in the
normalizing constant can be be given explicitly.
If n ∈ N + then
1. Γ(n/2) = (n/2 − 1)! if n is even.
(n−1)! −
2. Γ(n/2) = n−1
√π if n is odd.
2 (n/2−1/2)!
The chi-square distribution has a rich collection of shapes.
The chi-square probability density function with n ∈ (0, ∞) degrees of freedom satisfies the following properties:
1. If 0 < n < 2 , f is decreasing with f (x) → ∞ as x ↓ 0.
2. If n = 2 , f is decreasing with f (0) = . 1
3. If n > 2 , f increases and then decreases with mode at n − 2 .

4. If 0 < n ≤ 2 , f is concave downward.
−−−−−
5. If 2 < n ≤ 4 , f is concave downward and then upward, with inflection point at n − 2 + √2n − 4
−−−−−
6. If n > 4 then f is concave upward then downward and then upward again, with inflection points at n − 2 ± √2n − 4
In the special distribution simulator, select the chi-square distribution. Vary n with the scroll bar and note the shape of the
probability density function. For selected values of n , run the simulation 1000 times and compare the empirical density
The distribution function and the quantile function do not have simple, closed-form representations for most values of the
parameter. However, the distribution function can be given in terms of the complete and incomplete gamma functions.
Suppose that X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom. The distribution function F of X is given
by
Γ(n/2, x/2)
F (x) = , x ∈ (0, ∞) (5.9.2)
Γ(n/2)
Approximate values of the distribution and quantile functions can be obtained from the special distribution calculator, and from
most mathematical and statistical software packages.
In the special distribution calculator, select the chi-square distribution. Vary the parameter and note the shape of the probability
density, distribution, and quantile functions. In each of the following cases, find the median, the first and third quartiles, and
the interquartile range.
1. n = 1
2. n = 2
3. n = 5
4. n = 10
Moments
The mean, variance, moments, and moment generating function of the chi-square distribution can be obtained easily from general
results for the gamma distribution.
If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom then

1. E(X) = n
2. var(X) = 2n
In the simulation of the special distribution simulator, select the chi-square distribution. Vary n with the scroll bar and note the
size and location of the mean ± standard deviation bar. For selected values of n , run the simulation 1000 times and compare
the empirical moments to the distribution moments.
The skewness and kurtosis of the chi-square distribution are given next.
If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then

−−−
1. skew(X) = 2√2/n
2. kurt(X) = 3 + 12/n
Note that skew(X) → 0 and kurt(X) → 3 as n → ∞ . In particular, the excess kurtosis kurt(X) − 3 → 0 as n → ∞ .
In the simulation of the special distribution simulator, select the chi-square distribution. Increase n with the scroll bar and note
the shape of the probability density function in light of the previous results on skewness and kurtosis. For selected values of n ,
run the simulation 1000 times and compare the empirical density function to the true probability density function.
The next result gives the general moments of the chi-square distribution.
If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then for k > −n/2 ,
Γ(n/2 + k)
k k
E (X ) =2 (5.9.3)
Γ(n/2)
In particular, if k ∈ N+ then
n n n
k k
E (X ) =2 ( )( + 1) ⋯ ( + k − 1) (5.9.4)
2 2 2
Note also E (X k
) =∞ if k ≤ −n/2 .
If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then X has moment generating function
1 1
tX
E (e ) = , t < (5.9.5)
n/2 2
(1 − 2t)
Relations
The chi-square distribution is connected to a number of other special distributions. Of course, the most important relationship is the
definition—the chi-square distribution with n degrees of freedom is a special case of the gamma distribution, corresponding to
shape parameter n/2 and scale parameter 2. On the other hand, any gamma distributed variable can be re-scaled into a variable
with a chi-square distribution.
If X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) then Y =
2
b
X has the chi-
square distribution with 2k degrees of freedom.
Proof
The chi-square distribution with 2 degrees of freedom is the exponential distribution with scale parameter 2.
Proof
If Z has the standard normal distribution then X = Z has the chi-square distribution with 1 degree of freedom.
2
Proof
Recall that if we add independent gamma variables with a common scale parameter, the resulting random variable also has a
gamma distribution, with the common scale parameter and with shape parameter that is the sum of the shape parameters of the
terms. Specializing to the chi-square distribution, we have the following important result:
If X has the chi-square distribution with m ∈ (0, ∞) degrees of freedom, Y has the chi-square distribution with n ∈ (0, ∞)
degrees of freedom, and X and Y are independent, then X + Y has the chi-square distribution with m + n degrees of
freedom.
The last two results lead to the following theorem, which is fundamentally important in statistics.
Suppose that n ∈ N + and that (Z

1, Z2 , … , Zn ) is a sequence of independent standard normal variables. Then the sum of the
squares
n
2
V = ∑Z (5.9.8)
i
i=1
has the chi-square distribution with n degrees of freedom:
This theorem is the reason that the chi-square distribution deserves a name of its own, and the reason that the degrees of freedom
parameter is usually a positive integer. Sums of squares of independent normal variables occur frequently in statistics.
From the central limit theorem, and previous results for the gamma distribution, it follows that if n is large, the chi-square
distribution with n degrees of freedom can be approximated by the normal distribution with mean n and variance 2n. Here is the
precise statement:
If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then the distribution of the standard score
n
Xn − n
Zn = −− (5.9.9)
√2n
converges to the standard normal distribution as n → ∞ .
In the simulation of the special distribution simulator, select the chi-square distribution. Start with n = 1 and increase n . Note
the shape of the probability density function in light of the previous theorem. For selected values of n , run the experiment 1000
times and compare the empirical density function to the true density function.
Like the gamma distribution, the chi-square distribution is infinitely divisible:
Suppose that X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom. For k ∈ N , X has the same distribution
+
k
as ∑ X , where (X , X , … , X ) is a sequence of independent random variables, each with the chi-square distribution
i=1 i 1 2 k
with n/k degrees of freedom.
Also like the gamma distribution, the chi-square distribution is a member of the general exponential family of distributions:
The chi-square distribution with with n ∈ (0, ∞) degrees of freedom is a one-parameter exponential family with natural
parameter n/2 − 1 , and natural statistic ln X .
Proof
The Chi Distribution

The chi distribution, appropriately enough, is the distribution of the square root of a variable with the chi-square distribution
−
−
Suppose that X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom. Then U = √X has the chi distribution
with n degrees of freedom.
So like the chi-square distribution, the chi distribution is a continuous distribution on (0, ∞).
The distribution function G of the chi distribution with n ∈ (0, ∞) degrees of freedom is given by
2
Γ(n/2, u /2)
G(u) = , u ∈ (0, ∞) (5.9.11)
Γ(n/2)
Proof
The probability density function g of the chi distribution with n ∈ (0, ∞) degrees of freedom is given by
1 n−1
2
−u /2
g(u) = u e , u ∈ (0, ∞) (5.9.13)
n/2−1
2 Γ(n/2)
Proof
The chi probability density function also has a variety of shapes.
The chi probability density function with n ∈ (0, ∞) degrees of freedom satisfies the following properties:
1. If 0 < n < 1 , g is decreasing with g(u) → ∞ as u ↓ 0.
−−−
2. If n = 1 , g is decreasing with g(0) = √2/π as u ↓ 0.
−−−−−
3. If n > 1 , g increases and then decreases with mode u = √n − 1
4. If 0 < n < 1 , g is concave upward.
−−−−−−−−−−−−−−− −
−−−−−
5. If 1 ≤ n ≤ 2 , g is concave downward and then upward with inflection point at u = √ 1
2
[2n − 1 + √8n − 7 ]
−−−−−−−−−−−−−−− −
−−−−−
6. If n > 2 , g is concave upward then downward then upward again with inflection points at u = √ 1
2
[2n − 1 ± √8n − 7 ]
Moments
The raw moments of the chi distribution are easy to comput in terms of the gamma function.
Suppose that U has the chi distribution with n ∈ (0, ∞) degrees of freedom. Then
Γ[(n + k)/2]
k k/2
E(U ) =2 , k ∈ (0, ∞) (5.9.15)
Γ(n/2)
Proof
Curiously, the second moment is simply the degrees of freedom parameter.
Suppose again that U has the chi distribution with n ∈ (0, ∞) degrees of freedom. Then
1/2 Γ[(n+1)/2]
1. E(U ) = 2 Γ(n/2)
2. E(U 2
) =n
2
Γ [(n+1)/2]
3. var(U ) = n − 2 2
Γ (n/2)
Proof
Relations
The fundamental relationship of course is the one between the chi distribution and the chi-square distribution given in the
definition. In turn, this leads to a fundamental relationship between the chi distribution and the normal distribution.
Suppose that n ∈ N + and that (Z1 , Z2 , … , Zn ) is a sequence of independent variables, each with the standard normal
distribution. Then
−−−−−−−−−−−−−−−−
2 2 2
U =√ Z +Z + ⋯ + Zn (5.9.19)
1 2
has the chi distribution with n degrees of freedom.
Note that the random variable U in the last result is the standard Euclidean norm of (Z , Z , … , Z ), thought of as a vector in R .
1 2 n
n
Note also that the chi distribution with 1 degree of freedom is the distribution of |Z|, the absolute value of a standard normal
variable, which is known as the standard half-normal distribution.
The Non-Central Chi-Square Distribution

Much of the importance of the chi-square distribution stems from the fact that it is the distribution that governs the sum of squares
of independent, standard normal variables. A natural generalization, and one that is important in statistical applications, is to
consider the distribution of a sum of squares of independent normal variables, each with variance 1 but with different means.
Suppose that n ∈ N and that (X , X , … , X ) is a sequence of independent variables, where X has the normal
+ 1 2 n k
n
distribution with mean μ ∈ R and variance 1 for k ∈ {1, 2, … , n}. The distribution of Y = ∑
k X is the non-central chi-
k=1 k
2
square distribution with n degrees of freedom and non-centrality parameter λ = ∑ μ .

n
k=1
2
k
Note that the degrees of freedom is a positive integer while the non-centrality parameter λ ∈ [0, ∞) , but we will soon generalize
the degrees of freedom.
Like the chi-square and chi distributions, the non-central chi-square distribution is a continuous distribution on (0, ∞). The
probability density function and distribution function do not have simple, closed expressions, but there is a fascinating connection
to the Poisson distribution. To set up the notation, let f and F denote the probability density and distribution functions of the chi-
k k
square distribution with k ∈ (0, ∞) degrees of freedom. Suppose that Y has the non-central chi-square distribution with n ∈ N +
degrees of freedom and non-centrality parameter λ ∈ [0, ∞). The following fundamental theorem gives the probability density
function of Y as an infinite series, and shows that the distribution does in fact depend only on n and λ .
The probability density function g of Y is given by

∞ k
(λ/2)
−λ/2
g(y) = ∑ e fn+2k (y), y ∈ (0, ∞) (5.9.20)
k!
k=0
Proof
k
(λ/2)
The function k ↦ e −λ/2
k!
on N is the probability density function of the Poisson distribution with parameter λ/2. So it
follows that if N has the Poisson distribution with parameter λ/2 and the conditional distribution of Y given N is chi-square with
parameter n + 2N , then Y has the distribution discussed here—non-central chi-square with n degrees of freedom and non-
centrality parameter λ . Moreover, it's clear that g is a valid probability density function for any n ∈ (0, ∞) , so we can generalize
our definition a bit.
For n ∈ (0, ∞) and λ ∈ [0, ∞), the distribution with probability density function g above is the non-central chi-square
distribution with n degrees of freedom and non-centrality parameter λ .
The distribution function G is given by

∞ k
(λ/2)
−λ/2
G(y) = ∑ e Fn+2k (y), y ∈ (0, ∞) (5.9.28)
k!
k=0
Proof
Moments
In this discussion, we assume again that Y has the non-central chi-square distribution with n ∈ (0, ∞) degrees of freedom and
non-centrality parameter λ ∈ [0, ∞).
The moment generating function M of Y is given by

1 λt
tY
M (t) = E (e ) = exp( ), t ∈ (−∞, 1/2) (5.9.29)
n/2 1 − 2t
(1 − 2t)
Proof

1. E(Y ) = n + λ
2. var(Y ) = 2(n + 2λ)
Proof
The skewness and kurtosis of Y are

1. skew(Y ) = 2 3/2 n+3λ
3/2
(n+2λ)
2. kurt(Y ) = 3 + 12 n+4λ
2
(n+2λ)
Note that skew(Y ) → 0 as n → ∞ or as λ → ∞ . Note also that the excess kurtosis is kurt(Y ) − 3 = 12
n+4λ
2
. So
(n+2λ)
kurt(Y ) → 3 (the kurtosis of the normal distribution) as n → ∞ or as λ → ∞ .
Relations
Trivially of course, the ordinary chi-square distribution is a special case of the non-central chi-square distribution, with non-
centrality parameter 0. The most important relation is the orignal definition above. The non-central chi-square distribution with
n ∈ N + degrees of freedom and non-centrality parameter λ ∈ [0, ∞) is the distribution of the sum of the squares of n independent
n
normal variables with variance 1 and whose means satisfy ∑ μ = λ . The next most important relation is the one that arose in
k=1
2
k
the probability density function and was so useful for computing moments. We state this one again for emphasis.
Suppose that N has the Poisson distribution with parameter λ/2, where λ ∈ (0, ∞) , and that the conditional distribution of Y
given N is chi-square with n + 2N degrees of freedom, where n ∈ (0, ∞) . Then the (unconditional) distribution of Y is non-
central chi-square with n degree of freedom and non-centrality parameter λ .
Proof
As the asymptotic results for the skewness and kurtosis suggest, there is also a central limit theorem.
Suppose that Y has the non-central chi-square distribution with n ∈ (0, ∞) degrees of freedom and non-centrality parameter
λ ∈ (0, ∞) . Then the distribution of the standard score
Y − (n + λ)
− −−−−− −− (5.9.33)
√ 2(n + 2λ)
converges to the standard normal distribution as n → ∞ or as λ → ∞ .
Suppose that a missile is fired at a target at the origin of a plane coordinate system, with units in meters. The missile lands at
(X, Y ) where X and Y are independent and each has the normal distribution with mean 0 and variance 100. The missile will
destroy the target if it lands within 20 meters of the target. Find the probability of this event.
Answer
Suppose that X has the chi-square distribution with n = 18 degrees of freedom. For each of the following, compute the true
value using the special distribution calculator and then compute the normal approximation. Compare the results.
1. P(15 < X < 20)
2. The 75th percentile of X.
Answer
This page titled 5.9: Chi-Square and Related Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
In this section we will study a distribution that has special importance in statistics. In particular, this distribution will arise in the
study of a standardized version of the sample mean when the underlying distribution is normal.
Basic Theory
Definition
Suppose that Z has the standard normal distribution, V has the chi-squared distribution with n ∈ (0, ∞) degrees of freedom,
and that Z and V are independent. Random variable
Z
T = (5.10.1)
−−−−
√V /n
has the student t distribution with n degrees of freedom.
The student t distribution is well defined for any n > 0 , but in practice, only positive integer values of n are of interest. This
distribution was first studied by William Gosset, who published under the pseudonym Student.
Suppose that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then T has a continuous distribution on R with
−(n+1)/2
2
Γ[(n + 1)/2] t
f (t) = −
−− (1 + ) , t ∈ R (5.10.2)
√nπ Γ(n/2) n
Proof
The proof of this theorem provides a good way of thinking of the t distribution: the distribution arises when the variance of a mean
0 normal distribution is randomized in a certain way.
In the special distribution simulator, select the student t distribution. Vary n and note the shape of the probability density
function. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true
The Student probability density function f with n ∈ (0, ∞) degrees of freedom has the following properties:
1. f is symmetric about t = 0 .
2. f is increasing and then decreasing with mode t = 0 .
−−−−−−− −
3. f is concave upward, then downward, then upward again with inflection points at ±√n/(n + 1) .
4. f (t) → 0 as t → ∞ and as t → −∞ .
In particular, the distribution is unimodal with mode and median at t =0 . Note also that the inflection points converge to ±1 as
n → ∞.
The distribution function and the quantile function of the general t distribution do not have simple, closed-form representations.
Approximate values of these functions can be obtained from the special distribution calculator, and from most mathematical and
statistical software packages.
In the special distribution calculator, select the student distribution. Vary the parameter and note the shape of the probability
density, distribution, and quantile functions. In each of the following cases, find the first and third quartiles:
1. n = 2
2. n = 5
3. n = 10
4. n = 20
Moments
Suppose that T has a t distribution. The representation in the definition can be used to find the mean, variance and other moments
of T . The main point to remember in the proofs that follow is that since V has the chi-square distribution with n degrees of
freedom, E (V ) = ∞ if k ≤ − , while if k > − ,
k n
2
n
Γ(k + n/2)
k k
E (V ) =2 (5.10.6)
Γ(n/2)
Suppose that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then
1. E(T ) is undefined if 0 < n ≤ 1
2. E(T ) = 0 if 1 < n < ∞
Proof
Suppose again that T has the t distribution with n ∈ (0, ∞) degrees of freedom then
1. var(T ) is undefined if 0 < n ≤ 1
2. var(T ) = ∞ if 1 < n ≤ 2
3. var(T ) = n
n−2
if 2 < n < ∞
Proof
Note that var(T ) → 1 as n → ∞ .
In the simulation of the special distribution simulator, select the student t distribution. Vary n and note the location and shape
of the mean ± standard deviation bar. For selected values of n , run the simulation 1000 times and compare the empirical mean
and standard deviation to the distribution mean and standard deviation.
Next we give the general moments of the t distribution.
Suppose again that T has the t distribution with n ∈ (0, ∞) degrees of freedom and k ∈ N . Then
1. E (T ) is undefined if k is odd and k ≥ n
k
2. E (T ) = ∞ if k is even and k ≥ n
k
3. E (T ) = 0 if k is odd and k < n

k
4. If k is even and k < n then

k/2 k/2
n 1 ⋅ 3 ⋯ (k − 1)Γ ((n − k)/2) n k!Γ ((n − k)/2)
k
E (T ) = = (5.10.7)
k/2 k
2 Γ(n/2) 2 (k/2)!Γ(n/2)
Proof
From the general moments, we can compute the skewness and kurtosis of T .
Suppose again that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then
1. skew(T ) = 0 if n > 3
2. kurt(T ) = 3 + if n > 4
6
n−4
Proof
Note that kurt(T ) → 3 as n → ∞ and hence the excess kurtosis kurt(T ) − 3 → 0 as n → ∞ .
In the special distribution simulator, select the student t distribution. Vary n and note the shape of the probability density
function in light of the previous results on skewness and kurtosis. For selected values of n , run the simulation 1000 times and
compare the empirical density function to the true probability density function.
Since T does not have moments of all orders, there is no interval about 0 on which the moment generating function of T is finite.
The characteristic function exists, of course, but has no simple representation, except in terms of special functions.
Relations
The t distribution with 1 degree of freedom is known as the Cauchy distribution. The probability density function is
1
f (t) = , t ∈ R (5.10.11)
2
π(1 + t )
The Cauchy distribution is named after Augustin Cauchy and is studied in more detail in a separate section.
You probably noticed that, qualitatively at least, the t probability density function is very similar to the standard normal probability
density function. The similarity is quantitative as well:
Let f denote the t probability density function with n ∈ (0, ∞) degrees of freedom. Then for fixed t ∈ R ,
n
1 −
1 2
t
fn (t) → −−e
2 as n → ∞ (5.10.12)
√2π
Proof
Note that the function on the right is the probability density function of the standard normal distribution. We can also get
convergence of the t distribution to the standard normal distribution from the basic random variable representation in the
definition.
Suppose that T has the t distribution with n ∈ N

n + degrees of freedom, so that we can represent T as n
Z
Tn = −−− − (5.10.15)
√Vn /n
where Z has the standard normal distribution, V has the chi-square distribution with n degrees of freedom, and Z and V are
n n
independent. Then T → Z as n → ∞ with probability 1.

n
Proof
The t distribution has more probability in the tails, and consequently less probability near 0, compared to the standard normal
distribution.
The Non-Central t Distribution

One natural way to generalize the student t distribution is to replace the standard normal variable Z in the definition above with a
normal variable having an arbitrary mean (but still unit variance). The reason this particular generalization is important is because it
arises in hypothesis tests about the mean based on a random sample from the normal distribution, when the null hypothesis is false.
For details see the sections on tests in the normal model and tests in the bivariate normal model in the chapter on Hypothesis
Testing.
Suppose that Z has the standard normal distribution, μ ∈ R , V has the chi-squared distribution with n ∈ (0, ∞) degrees of
freedom, and that Z and V are independent. Random variable
Z +μ
T = −−−− (5.10.16)
√V /n
has the non-central student t distribution with n degrees of freedom and non-centrality parameter μ .
The standard functions that characterize a distribution—the probability density function, distribution function, and quantile
function—do not have simple representations for the non-central t distribution, but can only be expressed in terms of other special
functions. Similarly, the moments do not have simple, closed form expressions either. For the beginning student of statistics, the
most important fact is that the probability density function of the non-central t distribution is similar (but not exactly the same) as
that of the standard t distribution (with the same degrees of freedom), but shifted and scaled. The density function is shifted to the
right or left, depending on whether μ > 0 or μ < 0 .
Suppose that T has the t distribution with n = 10 degrees of freedom. For each of the following, compute the true value using
the special distribution calculator and then compute the normal approximation. Compare the results.
1. P(−0.8 < T < 1.2)
2. The 90th percentile of T .
Answer
This page titled 5.10: The Student t Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In this section we will study a distribution that has special importance in statistics. In particular, this distribution arises from ratios
of sums of squares when sampling from a normal distribution, and so is important in estimation and in the two-sample normal
model and in hypothesis testing in the two-sample normal model.
Basic Theory
Definition
Suppose that U has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, V has the chi-square distribution with
d ∈ (0, ∞) degrees of freedom, and that U and V are independent. The distribution of
U /n
X = (5.11.1)
V /d
is the F distribution with n degrees of freedom in the numerator and d degrees of freedom in the denominator.
The F distribution was first derived by George Snedecor, and is named in honor of Sir Ronald Fisher. In practice, the parameters n
and d are usually positive integers, but this is not a mathematical requirement.
Suppose that X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of
freedom in the denominator. Then X has a continuous distribution on (0, ∞) with probability density function f given by
n/2−1
Γ(n/2 + d/2) n [(n/d)x]
f (x) = , x ∈ (0, ∞) (5.11.2)
Γ(n/2)Γ(d/2) d [1 + (n/d)x] n/2+d/2
where Γ is the gamma function.

Proof
Recall that the beta function B can be written in terms of the gamma function by
Γ(a)Γ(b)
B(a, b) = , a, b ∈ (0, ∞) (5.11.8)
Γ(a + b)
Hence the probability density function of the F distribution above can also be written as
n/2−1
1 n [(n/d)x]
f (x) = , x ∈ (0, ∞) (5.11.9)
n/2+d/2
B(n/2, d/2) d
[1 + (n/d)x]
When n ≥ 2 , the probability density function is defined at x = 0 , so the support interval is [0, ∞) is this case.
In the special distribution simulator, select the F distribution. Vary the parameters with the scroll bars and note the shape of the
Both parameters influence the shape of the F probability density function, but some of the basic qualitative features depend only
on the numerator degrees of freedom. For the remainder of this discussion, let f denote the F probability density function with
n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator.
Probability density function f satisfies the following properties:

1. If 0 < n < 2 , f is decreasing with f (x) → ∞ as x ↓ 0.
2. If n = 2 , f is decreasing with mode at x = 0 .
(n−2)d
3. If n > 2 , f increases and then decreases, with mode at x = n(d+2)
.
Proof
Qualitatively, the second order properties of f also depend only on n , with transitions at n = 2 and n = 4 .
For n > 2 , define

−−−−−−−−−−−−−−−−− −
d (n − 2)(d + 4) − √ 2(n − 2)(d + 4)(n + d)
x1 = (5.11.11)
n (d + 2)(d + 4)
−−−−−−−−−−−−−−−−− −
d (n − 2)(d + 4) + √ 2(n − 2)(d + 4)(n + d)
x2 = (5.11.12)
n (d + 1)(d + 4)

1. If 0 < n ≤ 2 , f is concave upward.
2. If 2 < n ≤ 4 , f is concave downward and then upward, with inflection point at x . 2
3. If n > 4 , f is concave upward, then downward, then upward again, with inflection points at x and x .1 2
Proof
The distribution function and the quantile function do not have simple, closed-form representations. Approximate values of these
functions can be obtained from the special distribution calculator and from most mathematical and statistical software packages.
In the special distribution calculator, select the F distribution. Vary the parameters and note the shape of the probability density
function and the distribution function. In each of the following cases, find the median, the first and third quartiles, and the
interquartile range.
1. n = 5 , d = 5
2. n = 5 , d = 10
3. n = 10 , d = 5
4. n = 10 , d = 10
The general probability density function of the F distribution is a bit complicated, but it simplifies in a couple of special cases.
Special cases.
1. If n = 2 ,
1
f (x) = , x ∈ (0, ∞) (5.11.14)
(1 + 2x/d)1+d/2
2. If n = d ∈ (0, ∞) ,
n/2−1
Γ(n) x
f (x) = , x ∈ (0, ∞) (5.11.15)
2 n
Γ (n/2) (1 + x)
3. If n = d = 2 ,
1
f (x) = , x ∈ (0, ∞) (5.11.16)
(1 + x)2
4. If n = d = 1 ,
1
f (x) = − , x ∈ (0, ∞) (5.11.17)
π √x (1 + x)
Moments
The random variable representation in the definition, along with the moments of the chi-square distribution can be used to find the
mean, variance, and other moments of the F distribution. For the remainder of this discussion, suppose that X has the F
distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator.
Mean
1. E(X) = ∞ if 0 < d ≤ 2
2. E(X) = if d > 2
d
d−2
Proof
Thus, the mean depends only on the degrees of freedom in the denominator.
Variance
1. var(X) is undefined if 0 < d ≤ 2
2. var(X) = ∞ if 2 < d ≤ 4
3. If d > 4 then
2
d n+d−2
var(X) = 2( ) (5.11.19)
d−2 n(d − 4)
Proof
In the simulation of the special distribution simulator, select the F distribution. Vary the parameters with the scroll bar and
note the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000
times and compare the empirical mean and standard deviation to the distribution mean and standard deviation..
General moments. For k > 0 ,

1. E (X ) = ∞ if 0 < d ≤ 2k
k
2. If d > 2k then
k
d Γ(n/2 + k) Γ(d/2 − k)
k
E (X ) =( ) (5.11.23)
n Γ(n/2)Γ(d/2)
Proof
If k ∈ N , then using the fundamental identity of the gamma distribution and some algebra,
k
d n(n + 2) ⋯ [n + 2(k − 1)]
k
E (X ) =( ) (5.11.26)
n (d − 2)(d − 4) ⋯ (d − 2k)
From the general moment formula, we can compute the skewness and kurtosis of the F distribution.
Skewness and kurtosis

1. If d > 6 ,
− −−−−−−
(2n + d − 2)√ 8(d − 4)
skew(X) = (5.11.27)
−−−−−−−−− −
(d − 6)√ n(n + d − 2)
2. If d > 8 ,
2
n(5d − 22)(n + d − 2) + (d − 4)(d − 2)
kurt(X) = 3 + 12 (5.11.28)
n(d − 6)(d − 8)(n + d − 2)
Proof
Not surprisingly, the F distribution is positively skewed. Recall that the excess kurtosis is
2
n(5d − 22)(n + d − 2) + (d − 4)(d − 2)
kurt(X) − 3 = 12 (5.11.29)
n(d − 6)(d − 8)(n + d − 2)
In the simulation of the special distribution simulator, select the F distribution. Vary the parameters with the scroll bar and
note the shape of the probability density function in light of the previous results on skewness and kurtosis. For selected values
of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density
function.
Relations
The most important relationship is the one in the definition, between the F distribution and the chi-square distribution. In addition,
the F distribution is related to several other special distributions.
Suppose that X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of
freedom in the denominator. Then 1/X has the F distribution with d degrees of freedom in the numerator and n degrees of
freedom in the denominator.
Proof
Suppose that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then X = T 2
has the F distribution with 1 degree
of freedom in the numerator and n degrees of freedom in the denominator.
Proof
Our next relationship is between the F distribution and the exponential distribution.
Suppose that X and Y are independent random variables, each with the exponential distribution with rate parameter
r ∈ (0, ∞). Then Z = X/Y . has the F distribution with 2 degrees of freedom in both the numerator and denominator.
Proof
A simple transformation can change a variable with the F distribution into a variable with the beta distribution, and conversely.
Connections between the F distribution and the beta distribution.

1. If X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in
the denominator, then
(n/d)X
Y = (5.11.37)
1 + (n/d)X
has the beta distribution with left parameter n/2 and right parameter d/2.
2. If Y has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) then
bY
X = (5.11.38)
a(1 − Y )
has the F distribution with 2a degrees of freedom in the numerator and 2b degrees of freedom in the denominator.
Proof
The F distribution is closely related to the beta prime distribution by a simple scale transformation.
Connections with the beta prime distributions.

1. If X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in
the denominator, then Y = X has the beta prime distribution with parameters n/2 and d/2.
n
2. If Y has the beta prime distribution with parameters a ∈ (0, ∞) and b ∈ (0, ∞) then X = X has the F distribution with
b
2a degrees of the freedom in the numerator and 2b degrees of freedom in the denominator.
Proof
The Non-Central F Distribution

The F distribution can be generalized in a natural way by replacing the ordinary chi-square variable in the numerator in the
definition above with a variable having a non-central chi-square distribution. This generalization is important in analysis of
variance.
Suppose that U has the non-central chi-square distribution with n ∈ (0, ∞) degrees of freedom and non-centrality parameter
λ ∈ [0, ∞), V has the chi-square distribution with d ∈ (0, ∞) degrees of freedom, and that U and V are independent. The
distribution of
U /n
X = (5.11.43)
V /d
is the non-central F distribution with n degrees of freedom in the numerator, d degrees of freedom in the denominator, and
non-centrality parameter λ .
One of the most interesting and important results for the non-central chi-square distribution is that it is a Poisson mixture of
ordinary chi-square distributions. This leads to a similar result for the non-central F distribution.
Suppose that N has the Poisson distribution with parameter λ/2, and that the conditional distribution of X given N is the F
distribution with N + 2n degrees of freedom in the numerator and d degrees of freedom in the denominator, where
λ ∈ [0, ∞) and n, d ∈ (0, ∞) . Then X has the non-central F distribution with n degrees of freedom in the numerator, d
degrees of freedom in the denominator, and non-centrality parameter λ .

Proof
From the last result, we can express the probability density function and distribution function of the non-central F distribution as a
series in terms of ordinary F density and distribution functions. To set up the notation, for j, k ∈ (0, ∞) let f be the probability
jk
density function and F the distribution function of the F distribution with j degrees of freedom in the numerator and k degrees
jk
of freedom in the denominator. For the rest of this discussion, λ ∈ [0, ∞) and n, d ∈ (0, ∞) as usual.
The probability density function g of the non-central F distribution with n degrees of freedom in the numerator, d degrees of
freedom in the denominator, and non-centrality parameter λ is given by
∞ k
(λ/2)
−λ/2
g(x) = ∑ e fn+2k,d (x), x ∈ (0, ∞) (5.11.44)
k!
k=0
The distribution function G of the non-central F distribution with n degrees of freedom in the numerator, d degrees of
freedom in the denominator, and non-centrality parameter λ is given by
∞ k
(λ/2)
−λ/2
G(x) = ∑ e Fn+2k,d (x), x ∈ (0, ∞) (5.11.45)
k!
k=0
This page titled 5.11: The F Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Definition
Suppose that Y has the normal distribution with mean μ ∈ R and standard deviation σ ∈ (0, ∞) . Then X =e
Y
has the
lognormal distribution with parameters μ and σ.
1. The parameter σ is the shape parameter of the distribution.
2. The parameter e is the scale parameter of the distribution.
μ
If Z has the standard normal distribution then W =e

Z
has the standard lognormal distribution.
So equivalently, if X has a lognormal distribution then ln X has a normal distribution, hence the name. The lognormal distribution
is a continuous distribution on (0, ∞) and is used to model random quantities when the distribution is believed to be skewed, such
as certain income and lifetime variables. It's easy to write a general lognormal variable in terms of a standard lognormal variable.
Suppose that Z has the standard normal distribution and let W = e so that W has the standard lognormal distribution. If μ ∈ R
Z
and σ ∈ (0, ∞) then Y = μ + σZ has the normal distribution with mean μ and standard deviation σ and hence X = e has the Y
lognormal distribution with parameters μ and σ. But

σ
Y μ+σZ μ Z μ σ
X =e =e =e (e ) =e W (5.12.1)
Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞).

2
1 (ln x − μ)
f (x) = −− exp[− ], x ∈ (0, ∞) (5.12.2)
√2πσx 2σ 2
1. f increases and then decreases with mode at x = exp(μ − σ 2

) .
−−−−−
2. f is concave upward then downward then upward again, with inflection points at x = exp(μ − 3
2
σ
2
±
1
2
2
σ √σ + 4 )
3. f (x) → 0 as x ↓ 0 and as x → ∞ .
Proof
In the special distribution simulator, select the lognormal distribution. Vary the parameters and note the shape and location of
the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the
empirical density function to the true probability density function.
Let Φ denote the standard normal distribution function, so that Φ is the standard normal quantile function. Recall that values of
−1
Φ and Φ can be obtained from the special distribution calculator, as well as standard mathematical and statistical software
−1
packages, and in fact these functions are considered to be special functions in mathematics. The following two results show how to
compute the lognormal distribution function and quantiles in terms of the standard normal distribution function and quantiles.

ln x − μ
F (x) = Φ ( ), x ∈ (0, ∞) (5.12.5)
σ
Proof
The quantile function of X is given by

−1 −1
F (p) = exp[μ + σ Φ (p)], p ∈ (0, 1) (5.12.7)
Proof
In the special distribution calculator, select the lognormal distribution. Vary the parameters and note the shape and location of
the probability density function and the distribution function. With μ = 0 and σ = 1 , find the median and the first and third
quartiles.
Moments
The moments of the lognormal distribution can be computed from the moment generating function of the normal distribution. Once
again, we assume that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞).
For t ∈ R ,
t
1 2 2
E (X ) = exp(μt + σ t ) (5.12.8)
2
Proof
In particular, the mean and variance of X are

1. E(X) = exp(μ + σ ) 1
2
2
2. var(X) = exp[2(μ + σ 2
)] − exp(2μ + σ )
2
In the simulation of the special distribution simulator, select the lognormal distribution. Vary the parameters and note the shape
and location of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and
compare the empirical moments to the true moments.
From the general formula for the moments, we can also compute the skewness and kurtosis of the lognormal distribution.

−− − −−−
1. skew(X) = (e
2 2
σ σ
+ 2) √e −1
2 2 2
2. kurt(X) = e 4σ
+ 2e
3σ
+ 3e
2σ
−3
Proof
The fact that the skewness and kurtosis do not depend on μ is due to the fact that μ is a scale parameter. Recall that skewness and
kurtosis are defined in terms of the standard score and so are independent of location and scale parameters. Naturally, the
lognormal distribution is positively skewed. Finally, note that the excess kurtosis is
2 2 2
4σ 3σ 2σ
kurt(X) − 3 = e + 2e + 3e −6 (5.12.10)
Even though the lognormal distribution has finite moments of all orders, the moment generating function is infinite at any positive
number. This property is one of the reasons for the fame of the lognormal distribution.
E (e
tX
) =∞ for every t > 0 .
Proof
Related Distributions
The most important relations are the ones between the lognormal and normal distributions in the definition: if X has a lognormal
distribution then ln X has a normal distribution; conversely if Y has a normal distribution then e has a lognormal distribution. Y
The lognormal distribution is also a scale family.
Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞) and that c ∈ (0, ∞) . Then cX has the
lognormal distribution with parameters μ + ln c and σ.
Proof
The reciprocal of a lognormal variable is also lognormal.
If X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞) then 1/X has the lognormal distribution with
parameters −μ and σ.
Proof
The lognormal distribution is closed under non-zero powers of the underlying variable. In particular, this generalizes the previous
result.
Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞) and that a ∈ R ∖ {0} . Then X has the a
lognormal distribution with parameters with parameters aμ and |a|σ.

Proof
Since the normal distribution is closed under sums of independent variables, it's not surprising that the lognormal distribution is
closed under products of independent variables.
Suppose that n ∈ N and that (X , X , … , X ) is a sequence of independent variables, where X has the lognormal
+ 1 2 n i
n
distribution with parameters μ ∈ R and σ ∈ (0, ∞) for i ∈ {1, 2, … , n}. Then ∏ X has the lognormal distribution with
i i i=1 i
n n
parameters μ and σ where μ = ∑ μ and σ = ∑ σ .
i=1 i
2
i=1
2
i
Proof
Finally, the lognormal distribution belongs to the family of general exponential distributions.
Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞). The distribution of X is a 2-parameter
exponential family with natural parameters and natural statistics, respectively, given by
1. (−1/2σ , μ/σ
2 2
)
2. (ln (X), ln X)
2
Proof
Suppose that the income X of a randomly chosen person in a certain population (in $1000 units) has the lognormal distribution
with parameters μ = 2 and σ = 1 . Find P(X > 20) .
Answer
Suppose that the income X of a randomly chosen person in a certain population (in $1000 units) has the lognormal distribution
with parameters μ = 2 and σ = 1 . Find each of the following:
1. E(X)
2. var(X)
Answer
This page titled 5.12: The Lognormal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
The General Folded Normal Distribution

Introduction
The folded normal distribution is the distribution of the absolute value of a random variable with a normal distribution. As has been
emphasized before, the normal distribution is perhaps the most important in probability and is used to model an incredible variety
of random phenomena. Since one may only be interested in the magnitude of a normally distributed variable, the folded normal
arises in a very natural way. The name stems from the fact that the probability measure of the normal distribution on (−∞, 0] is
“folded over” to [0, ∞). Here is the formal definition:
Suppose that Y has a normal distribution with mean μ ∈ R and standard deviation σ ∈ (0, ∞). Then X = |Y | has the folded
normal distribution with parameters μ and σ.
So in particular, the folded normal distribution is a continuous distribution on [0, ∞).
Suppose that Z has the standard normal distribution. Recall that Z has probability density function ϕ and distribution function Φ
given by
1 −z
2
/2
ϕ(z) = −−e , z ∈ R (5.13.1)
√2π
z z
1 2
−x /2
Φ(z) = ∫ ϕ(x) dx = ∫ e dx, z ∈ R (5.13.2)
−−
−∞ −∞ √2π
The standard normal distribution is so important that Φ is considered a special function and can be computed using most
mathematical and statistical software. If μ ∈ R and σ ∈ (0, ∞), then Y = μ + σZ has the normal distribution with mean μ and
standard deviation σ, and therefore X = |Y | = |μ + σZ| has the folded normal distribution with parameters μ and σ. For the
remainder of this discussion we assume that X has this folded normal distribution.
X has distribution function F given by

x −μ −x − μ x −μ x +μ
F (x) =Φ( ) −Φ ( ) =Φ( ) +Φ ( ) −1 (5.13.3)
σ σ σ σ
x 2 2
1 1 y +μ 1 y −μ
=∫ −− {exp[− ( ) ] + exp[− ( ) ]} dy, x ∈ [0, ∞) (5.13.4)
0 σ √2π 2 σ 2 σ
Proof
We cannot compute the quantile function F −1

in closed form, but values of this function can be approximated.
Open the special distribution calculator and select the folded normal distribution, and set the view to CDF. Vary the parameters
and note the shape of the distribution function. For selected values of the parameters, compute the median and the first and
third quartiles.

1 x −μ x +μ
f (x) = [ϕ ( ) +ϕ( )] (5.13.7)
σ σ σ
2 2
1 1 x +μ 1 x −μ
= −− {exp[− ( ) ] + exp[− ( ) ]} , x ∈ [0, ∞) (5.13.8)
σ √2π 2 σ 2 σ
Proof
Open the special distribution simulator and select the folded normal distribution. Vary the parameters μ and σ and note the
shape of the probability density function. For selected values of the parameters, run the simulation 1000 times and compae the
empirical density function to the true probability density function.
Note that the folded normal distribution is unimodal for some values of the parameters and decreasing for other values. Note also
that μ is not a location parameter nor is σ a scale parameter; both influence the shape of the probability density function.
Moments
We cannot compute the the mean of the folded normal distribution in closed form, but the mean can at least be given in terms of Φ.
Once again, we assume that X has the folded normal distribution with parmaeters μ ∈ R and σ ∈ (0, ∞).
The first two moments of x are

−−−
1. E(X) = μ[1 − 2Φ(−μ/σ)] + σ √2/π exp(−μ 2
/2 σ )
2
2. E(X ) = μ + σ
2 2 2
Proof
In particular, the variance of X is

−
− 2
2
2 2
μ 2 μ
var(X) = μ +σ − {μ [1 − 2Φ (− )] + σ √ exp(− )} (5.13.12)
2
σ π 2σ
Open the special distribution simulator and select the folded normal distribution. Vary the parameters and note the size and
location of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare
the empirical mean and standard deviation to the true mean and standard deviation.
The most important relation is the one between the folded normal distribution and the normal distribution in the definition: If Y has
a normal distribution then X = |Y | has a folded normal distribution. The folded normal distribution is also related to itself through
a symmetry property that is perhaps not completely obvious from the initial definition:
For μ ∈ R and σ ∈ (0, ∞), the folded normal distribution with parameters −μ and σ is the same as the folded normal
distribution with parameters μ and σ.
Proof 1
Proof 2
The folded normal distribution is also closed under scale transformations.
Suppose that X has the folded normal distribution with parameters μ ∈ R and σ ∈ (0, ∞) and that b ∈ (0, ∞) . Then bX has
the folded normal distribution with parameters bμ and bσ.
Proof
The Half-Normal Distribution

When μ = 0 , results for the folded normal distribution are much simpler, and fortunately this special case is the most important
one. We are more likely to be interested in the magnitude of a normally distributed variable when the mean is 0, and moreover, this
distribution arises in the study of Brownian motion.
Suppose that Z has the standard normal distribution and that σ ∈ (0, ∞). Then X = σ|Z| has the half-normal distribution
with scale parameter σ. If σ = 1 so that X = |Z| , then X has the standard half-normal distribution.
For our next discussion, suppose that X has the half-normal distribution with parameter σ ∈ (0, ∞). Once again, Φ and Φ
−1
denote the distribution function and quantile function, respectively, of the standard normal distribution.
The distribution function F and quantile function F −1

of X are
x −
− 2
x 1 2 y
F (x) = 2Φ ( )−1 = ∫ √ exp(− ) dy, x ∈ [0, ∞) (5.13.13)
2
σ 0
σ π 2σ
−1 −1
1 +p
F (p) = σ Φ ( ), p ∈ [0, 1) (5.13.14)
2
Proof
Open the special distribution calculator and select the folded normal distribution. Select CDF view and keep μ = 0 . Vary σ and
note the shape of the CDF. For various values of σ, compute the median and the first and third quartiles.

−
− 2
2 x 1 2 x
f (x) = ϕ( ) = √ exp(− ), x ∈ [0, ∞) (5.13.15)
σ σ σ π 2σ 2
1. f is decreasing with mode at x = 0 .

2. f is concave downward and then upward, with inflection point at x = σ .
Proof
Open the special distribution simulator and select the folded normal distribution. Keep μ = 0 and vary σ, and note the shape of
the probability density function. For selected values of σ, run the simulation 1000 times and compare the empricial density
Moments
The moments of the half-normal distribution can be computed explicitly. Once again we assume that X has the half-normal
distribution with parameter σ ∈ (0, ∞).
For n ∈ N
(2n)!
2n 2n
E(X ) =σ (5.13.16)
n
n!2
−−
2n+1 2n+1 n
2
E(X ) =σ 2 √ n! (5.13.17)
π
Proof
−−−
In particular, we have E(X) = σ √2/π and var(X) = σ 2
(1 − 2/π)
Open the special distribution simulator and select the folded normal distribution. Keep μ = 0 and vary σ, and note the size and
location of the mean±standard deviation bar. For selected values of σ, run the simulation 1000 times and compare the mean
and standard deviation to the true mean and standard deviation.
Next are the skewness and kurtosis of the half-normal distribution.

1. The skewness of X is
−−−
√2/π(4/π − 1)
skew(X) = ≈ 0.99527 (5.13.19)
3/2
(1 − 2/π)
2. The kurtosis of X is
2
3 − 4/π − 12/π
kurt(X) = ≈ 3.8692 (5.13.20)
2
(1 − 2/π)
Proof
Once again, the most important relation is the one in the definition: If Y has a normal distribution with mean 0 then X = |Y | has a
half-normal distribution. Since the half normal distribution is a scale family, it is trivially closed under scale transformations.
Suppose that X has the half-normal distribution with parameter σ and that b ∈ (0, ∞) . Then bX has the half-normal
distribution with parameter bσ.
Proof
The standard half-normal distribution is also a special case of the chi distribution.
The standard half-normal distribution is the chi distribution with 1 degree of freedom.
Proof
This page titled 5.13: The Folded Normal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
The Rayleigh distribution, named for William Strutt, Lord Rayleigh, is the distribution of the magnitude of a two-dimensional
random vector whose coordinates are independent, identically distributed, mean 0 normal variables. The distribution has a number
of applications in settings where magnitudes of normal variables are important.
The Standard Rayleigh Distribution

Definition
Suppose that Z1 and Z2 are independent random variables with standard normal distributions. The magnitude
−−−−−−−
R = √Z
1
2
+Z
2
2
of the vector (Z 1, Z2 ) has the standard Rayleigh distribution.
So in this definition, (Z 1, Z2 ) has the standard bivariate normal distribution
We give five functions that completely characterize the standard Rayleigh distribution: the distribution function, the probability
density function, the quantile function, the reliability function, and the failure rate function. For the remainder of this discussion,
we assume that R has the standard Rayleigh distribution.
has distribution function G given by G(x) = 1 − e for x ∈ [0, ∞).

2
−x /2
R
Proof
has probability density function g given by g(x) = xe for x ∈ [0, ∞).

2
−x /2
R
1. g increases and then decreases with mode at x = 1 .

–
2. g is concave downward and then upward with inflection point at x = √3 .
Proof
Open the Special Distribution Simulator and select the Rayleigh distribution. Keep the default parameter value and note the
shape of the probability density function. Run the simulation 1000 times and compare the emprical density function to the
−−−−− −−−− −
R has quantile function G −1
given by G −1
(p) = √−2 ln(1 − p) for p ∈ [0, 1). In particular, the quartiles of R are
−−−−−−−−− −
1. q 1 = √4 ln 2 − 2 ln 3 ≈ 0.7585 , the first quartile
−−− −
2. q 2 = √2 ln 2 ≈ 1.1774 , the median
−−− −
3. q 3 = √4 ln 2 ≈ 1.6651 , the third quartile
Proof
Open the Special Distribution Calculator and select the Rayleigh distribution. Keep the default parameter value. Note the shape
and location of the distribution function. Compute selected values of the distribution function and the quantile function.
has reliability function G given by G for x ∈ [0, ∞).

2
c c −x /2
R (x) = e
Proof
R has failure rate function h given by h(x) = x for x ∈ [0, ∞). In particular, R has increasing failure rate.
Proof
Moments
Once again we assume that R has the standard Rayleigh distribution. We can express the moment generating function of R in
terms of the standard normal distribution function Φ. Recall that Φ is so commonly used that it is a special function of
mathematics.
R has moment generating function m given by
tR −− 2
t /2
m(t) = E(e ) = 1 + √2πte Φ(t), t ∈ R (5.14.3)
Proof
The mean, variance of R are

−−−
1. E(R) = √π/2 ≈ 1.2533
2. var(R) = 2 − π/2
Proof
Numerically, E(R) ≈ 1.2533 and sd(R) ≈ 0.6551.
Open the Special Distribution Simulator and select the Rayleigh distribution. Keep the default parameter value. Note the size
and location of the mean±standard deviation bar. Run the simulation 1000 times and compare the empirical mean and stadard
deviation to the true mean and standard deviation.
The general moments of R can be expressed in terms of the gamma function Γ.

n
E(R ) = 2
n/2
Γ(1 + n/2) for n ∈ N .
Proof
Of course, the formula for the general moments gives an alternate derivation of the mean and variance above, since
−
Γ(3/2) = √π /2 and Γ(2) = 1 . On the other hand, the moment generating function can be also be used to derive the formula for
the general moments.
The skewness and kurtosis of R are

1. skew(R) = 2√− π (π − 3)/(4 − π )
3/2
≈ 0.6311
2. kurt(R) = (32 − 3π )/(4 − π ) ≈ 3.2451

2 2
Proof
The fundamental connection between the standard Rayleigh distribution and the standard normal distribution is given in the very
definition of the standard Rayleigh, as the distribution of the magnitude of a point with independent, standard normal coordinates.
Connections to the chi-square distribution.

1. If R has the standard Rayleigh distribution then R has the chi-square distribution with 2 degrees of freedom.
2
−
−
2. If V has the chi-square distribution with 2 degrees of freedom then √V has the standard Rayleigh distribution.
Proof
Recall also that the chi-square distribution with 2 degrees of freedom is the same as the exponential distribution with scale
parameter 2.
Since the quantile function is in closed form, the standard Rayleigh distribution can be simulated by the random quantile method.
Connections between the standard Rayleigh distribution and the standard uniform distribution.
−−−−−−−−− −
1. If U has the standard uniform distribution (a random number) then R = G −1
(U ) = √−2 ln(1 − U ) has the standard
Rayleigh distribution.
2. If R has the standard Rayleigh distribution then U = G(R) = 1 − exp(−R 2
/2) has the standard uniform distribution
−−−−− −
In part (a), note that 1 − U has the same distribution as U (the standard uniform). Hence R = √−2 ln U also has the standard
Rayleigh distribution.
Open the random quantile simulator and select the Rayleigh distribution with the default parameter value (standard). Run the
simulation 1000 times and compare the empirical density function to the true density function.
There is another connection with the uniform distribution that leads to the most common method of simulating a pair of
independent standard normal variables. We have seen this before, but it's worth repeating. The result is closely related to the
definition of the standard Rayleigh variable as the magnitude of a standard bivariate normal pair, but with the addition of the polar
coordinate angle.
Suppose that R has the standard Rayleigh distribution, Θ is uniformly distributed on [0, 2π), and that R and Θ are
independent. Let Z = R cos Θ , W = R sin Θ . Then (Z, W ) have the standard bivariate normal distribution.
Proof
The General Rayleigh Distribution

Definition
The standard Rayleigh distribution is generalized by adding a scale parameter.
If R has the standard Rayleigh distribution and b ∈ (0, ∞) then X = bR has the Rayleigh distribution with scale parameter b .
Equivalently, the Rayleigh distribution is the distribution of the magnitude of a two-dimensional vector whose components have
independent, identically distributed mean 0 normal variables.
−−−−−−−
If U and U are independent normal variables with mean 0 and standard deviation σ ∈ (0, ∞) then X = √U
1 2 1
2
+U
2
2
has the
Rayleigh distribution with scale parameter σ.
Proof
In this section, we assume that X has the Rayleigh distribution with scale parameter b ∈ (0, ∞).
2
X has cumulative distribution function F given by F (x) = 1 − exp(− x

2
) for x ∈ [0, ∞).
2b
Proof
2
X has probability density function f given by f (x) = x

2
exp(−
x
2
) for x ∈ [0, ∞).
b 2b
1. f increases and then decreases with mode at x = b .

–
2. f is concave downward and then upward with inflection point at x = √3b .
Proof
Open the Special Distribution Simulator and select the Rayleigh distribution. Vary the scale parameter and note the shape and
location of the probability density function. For various values of the scale parameter, run the simulation 1000 times and
compare the emprical density function to the probability density function.
−−−−− −−−− −
X has quantile function F −1
given by F −1
(p) = b √−2 ln(1 − p) for p ∈ [0, 1). In particular, the quartiles of X are
−−−−−−−−− −
1. q1 = b √4 ln 2 − 2 ln 3 , the first quartile
−−− −
2. q2 = b √2 ln 2, the median
−−− −
3. q3 = b √4 ln 2, the third quartile
Proof
Open the Special Distribution Calculator and select the Rayleigh distribution. Vary the scale parameter and note the location
and shape of the distribution function. For various values of the scale parameter, compute selected values of the distribution
function and the quantile function.
X has reliability function F given by F

c c
(x) = exp(−
x
2
) for x ∈ [0, ∞).
2b
Proof
X has failure rate function h given by h(x) = x/b for x ∈ [0, ∞). In particular, X has increasing failure rate.
2
Proof
Moments
Again, we assume that X has the Rayleigh distribution with scale parameter b , and recall that Φ denotes the standard normal
X has moment generating function M given by

2 2
−− b t
tX
M (t) = E(e ) = 1 + √2πbt exp( )Φ(t), t ∈ R (5.14.10)
2
Proof
The mean and variance of R are

−−−
1. E(X) = b√π/2
2. var(X) = b (2 − π/2)
2
Proof
Open the Special Distribution Simulator and select the Rayleigh distribution. Vary the scale parameter and note the size and
location of the mean±standard deviation bar. For various values of the scale parameter, run the simulation 1000 times and
compare the empirical mean and stadard deviation to the true mean and standard deviation.
Again, the general moments can be expressed in terms of the gamma function Γ.
E(X
n n
) =b 2
n/2
Γ(1 + n/2) for n ∈ N .
Proof

1. skew(X) = 2√− π (π − 3)/(4 − π ) ≈ 0.6311
3/2
2. kurt(X) = (32 − 3π )/(4 − π ) ≈ 3.2451

2 2
Proof
The fundamental connection between the Rayleigh distribution and the normal distribution is the defintion, and of course, is the
primary reason that the Rayleigh distribution is special in the first place. By construction, the Rayleigh distribution is a scale
family, and so is closed under scale transformations.
If X has the Rayleigh distribution with scale parameter b ∈ (0, ∞) and if c ∈ (0, ∞) then cX has the Rayleigh distribution
with scale parameter bc.
The Rayleigh distribution is a special case of the Weibull distribution.
The Rayleigh distribution with scale parameter b ∈ (0, ∞) is the Weibull distribution with shape parameter 2 and scale
–
parameter √2b.
The following result generalizes the connection between the standard Rayleigh and chi-square distributions.
If X has the Rayleigh distribution with scale parameter b ∈ (0, ∞) then X

2
has the exponential distribution with scale
parameter 2b . 2
Proof
Since the quantile function is in closed form, the Rayleigh distribution can be simulated by the random quantile method.
Suppose that b ∈ (0, ∞).
−−−−−−−−−−
1. If U has the standard uniform distribution (a random number) then X = F (U ) = b√−2 ln(1 − U ) has the Rayleigh
−1
distribution with scale parameter b .

2. If X has the Rayleigh distribution with scale parameter b then U = F (X) = 1 − exp(−X /2b ) has the standard uniform
2 2
distribution
−−−−− −
In part (a), note that 1 − U has the same distribution as U (the standard uniform). Hence X = b √−2 ln U also has the Rayleigh
distribution with scale parameter b .
Open the random quantile simulator and select the Rayleigh distribution. For selected values of the scale parameter, run the
simulation 1000 times and compare the empirical density function to the true density function.
Finally, the Rayleigh distribution is a member of the general exponential family.
If X has the Rayleigh distribution with scale parameter b ∈ (0, ∞) then X has a one-parameter exponential distribution with
natural parameter −1/b and natural statistic X /2.
2 2
Proof
This page titled 5.14: The Rayleigh Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The Maxwell distribution, named for James Clerk Maxwell, is the distribution of the magnitude of a three-dimensional random
vector whose coordinates are independent, identically distributed, mean 0 normal variables. The distribution has a number of
applications in settings where magnitudes of normal variables are important, particularly in physics. It is also called the Maxwell-
Boltzmann distribution in honor also of Ludwig Boltzmann. The Maxwell distribution is closely related to the Rayleigh
distribution, which governs the magnitude of a two-dimensional random vector whose coordinates are independent, identically
distributed, mean 0 normal variables.
The Standard Maxwell Distribution

Definition
Suppose that Z1 Z2 , , and Z3 are independent random variables with standard normal distributions. The magnitude
−−−−−−−−−−−
R = √Z
2
1
+Z
2
2
+Z
3
2
of the vector (Z 1, Z2 , Z3 ) has the standard Maxwell distribution.
So in the context of the definition, (Z1 , Z2 , Z3 ) has the standard trivariate normal distribution. The Maxwell distribution is a
continuous distribution on [0, ∞).
In this discussion, we assume that R has the standard Maxwell distribution. The distribution function of R can be expressed in
terms of the standard normal distribution function Φ. Recall that Φ occurs so frequently that it is considered a special function in
mathematics.
R has distribution function G given by

−
−
2 2
−x /2
G(x) = 2Φ(x) − √ xe − 1, x ∈ [0, ∞) (5.15.1)
π
Proof
R has probability density function g given by

−
−
2 2
2
−x /2
g(x) = √ x e , x ∈ [0, ∞) (5.15.4)
π
–
1. g increases and then decreases with mode at x = √2 .
−−−−−−−−− −
−−
2. g is concave upward, then downward, then upward again, with inflection points at x 1 = √(5 − √17)/2 ≈ 0.6622 and
−−−−−−−−− −
−−
x2 = √(5 + √17)/2 ≈ 2.1358
Proof
Open the Special Distribution Simulator and select the Maxwell distribution. Keep the default parameter value and note the
The quantile function has no simple closed-form expression.
Open the Special Distribution Calculator and select the Maxwell distribution. Keep the default parameter value. Find
approximate values of the median and the first and third quartiles.
Moments
Suppose again that R has the standard Maxwell distribution. The moment generating function of R , like the distribution function,
can be expressed in terms of the standard normal distribution function Φ.
R has moment generating function m given by
−
−
tR
2 2
2
t /2
m(t) = E (e ) =√ t + 2(1 + t )e Φ(t), t ∈ R (5.15.5)
π
Proof
The mean and variance of R can be found from the moment generating function, but direct computations are also easy.
The mean and variance of R are

−−−
1. E(R) = 2√2/π
2. var(R) = 3 − 8/π
Proof
Numerically, E(R) ≈ 1.5958 and sd(R) =≈ 0.6734
Open the Special Distribution Simulator and select the Maxwell distribution. Keep the default parameter value. Note the size
and location of the mean±standard deviation bar. Run the simulation 1000 times and compare the empirical mean and standard
deviation to the true mean and standard deviation.
The general moments of R can be expressed in terms of the gamma function Γ
For n ∈ N ,+
n/2+1
n
2 n+3
E(R ) = − Γ( ) (5.15.13)
√π 2
Proof
Of course, the formula for the general moments gives an alternate derivation for the mean and variance above since Γ(2) = 1 and
−
Γ(5/2) = 3 √π /4. On the other hand, the moment generating function can be also be used to derive the formula for the general
moments. Finally, we give the skewness and kurtosis of R .
The skewness and kurtosis of R are

–
1. skew(R) = 2√2(16 − 5π)/(3π − 8) ≈ 0.4857
3/2
2. kurt(R) = (15π + 16π − 192)/(3π − 8) ≈ 3.1082

2 2
Proof
The fundamental connection between the standard Maxwell distribution and the standard normal distribution is given in the very
definition of the standard Maxwell, as the distribution of the magnitude of a vector in R with independent, standard normal
3
coordinates.
Connections to the chi-square distribution.

1. If R has the standard Maxwell distribution then R has the chi-square distribution with 3 degrees of freedom.
2
−
−
2. If V has the chi-square distribution with 3 degrees of freedom then √V has the standard Maxwell distribution.
Proof
Equivalently, the Maxwell distribution is simply the chi distribution with 3 degrees of freedom.
The General Maxwell Distribution

Definition
The standard Maxwell distribution is generalized by adding a scale parameter.
If R has the standard Maxwell distribution and b ∈ (0, ∞) then X = bR has the Maxwell distribution with scale parameter b .
Equivalently, the Maxwell distribution is the distribution of the magnitude of a three-dimensional vector whose components have
independent, identically distributed, mean 0 normal variables.
If U1 , U2 and U3 are independent normal variables with mean 0 and standard deviation σ ∈ (0, ∞) then
−−−−−−−−−−−−
X = √U
2
1
+U
2
2
+U
2
3
has the Maxwell distribution with scale parameter σ.
Proof
In this section, we assume that X has the Maxwell distribution with scale parameter b ∈ (0, ∞) . We can give the distribution
function of X in terms of the standard normal distribution function Φ.

−
− 2
x 1 2 x
F (x) = 2Φ ( )− √ x exp(− ) − 1, x ∈ [0, ∞) (5.15.15)
2
b b π 2b
Proof

−
− 2
1 2 x
2
f (x) = √ x exp(− ), x ∈ [0, ∞) (5.15.16)
3 2
b π 2b
–
1. f increases and then decreases with mode at x = b√2 .
−−−−−−−−−−
−−
2. f is concave upward, then downward, then upward again, with inflection points at x = b√(5 ± √17)/2 .
Proof
Open the Special Distribution Simulator and select the Maxwell distribution. Vary the scale parameter and note the shape and
location of the probability density function. For various values of the scale parameter, run the simulation 1000 times and
Again, the quantile function does not hava a simple, closed-form expression.
Open the Special Distribution Calculator and select the Maxwell distribution. For various values of the scale parameter,
compute the median and the first and third quartiles.
Moments
Again, we assume that X has the Maxwell distribution with scale parameter b ∈ (0, ∞). As before, the moment generating
function of X can be written in terms of the standard normal distribution function Φ.

−
− 2 2
tX
2 2 2
b t
M (t) = E (e ) =√ bt + 2(1 + b t ) exp( )Φ(bt), t ∈ R (5.15.17)
π 2
Proof

−−−
1. E(X) = b2√2/π
2. var(X) = b (3 − 8/π)
2
Proof
Open the Special Distribution Simulator and select the Maxwell distribution. Vary the scale parameter and note the size and
location of the mean±standard deviation bar. For various values of the scale parameter, run the simulation 1000 times compare
the empirical mean and standard deviation to the true mean and standard deviation.
As before, the general moments can be expressed in terms of the gamma function Γ.
For n ∈ N ,
n/2+1
2 n+3
n n
E(X ) =b Γ( ) (5.15.18)
−
√π 2
Proof
Finally, the skewness and kurtosis are unchanged.

–
1. skew(X) = 2√2(16 − 5π)/(3π − 8) ≈ 0.4857 3/2
2. kurt(X) = (15π + 16π − 192)/(3π − 8) ≈ 3.1082

2 2
Proof
The fundamental connection between the Maxwell distribution and the normal distribution is given in the definition, and of course,
is the primary reason that the Maxwell distribution is special in the first place.
By construction, the Maxwell distribution is a scale family, and so is closed under scale transformations.
If X has the Maxwell distribution with scale parameter b ∈ (0, ∞) and if c ∈ (0, ∞) then cX has the Maxwell distribution
with scale parameter bc.
Proof
The Maxwell distribution is a generalized exponential distribution.
If X has the Maxwell distribution with scale parameter b ∈ (0, ∞) then X is a one-parameter exponential family with natural
parameter −1/b and natural statistic X /2.
2 2
Proof
This page titled 5.15: The Maxwell Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The Lévy distribution, named for the French mathematician Paul Lévy, is important in the study of Brownian motion, and is one of
only three stable distributions whose probability density function can be expressed in a simple, closed form.
The Standard Lévy Distribution

Definition
If Z has the standard normal distribution then U = 1/Z

2
has the standard Lévy distribution.
So the standard Lévy distribution is a continuous distribution on (0, ∞).
We assume that U has the standard Lévy distribution. The distribution function of U has a simple expression in terms of the
standard normal distribution function Φ, not surprising given the definition.
U has distribution function G given by

1
G(u) = 2 [1 − Φ ( )] , u ∈ (0, ∞) (5.16.1)
−
√u
Proof
Similarly, the quantile function of U has a simple expression in terms of the standard normal quantile function Φ −1
.
U has quantile function G −1

given by
1
−1
G (p) = , p ∈ [0, 1) (5.16.3)
2
−1
[Φ (1 − p/2)]
The quartiles of U are

−2
1. q1 = [Φ
−1
(
7
8
)] ≈ 0.7557 , the first quartile.
−2
2. q2 = [Φ
−1
(
3
4
)] ≈ 2.1980 , the median.
−2
3. q3 = [Φ
−1
(
5
8
)] ≈ 9.8516 , the third quartile.
Proof
Open the Special Distribution Calculator and select the Lévy distribution. Keep the default parameter values. Note the shape
and location of the distribution function. Compute a few values of the distribution function and the quantile function.
Finally, the probability density function of U has a simple closed expression.
U has probability density function g given by

1 1 1
g(u) = exp(− ), u ∈ (0, ∞) (5.16.4)
−− 3/2
√2π u 2u
1. g increases and then decreasing with mode at x = 1
3
.
√10
2. g is concave upward, then downward, then upward again, with inflection points at x = 1
3
−
15
≈ 0.1225 and at
√10
x =
1
3
+
15
≈ 0.5442 .
Proof
Open the Special Distribtion Simulator and select the Lévy distribution. Keep the default parameter values. Note the shape of
the probability density function. Run the simulation 1000 times and compare the empirical density function to the probability
density function.
Moments
We assume again that U has the standard Lévy distribution. After exploring the graphs of the probability density function and
distribution function above, you probably noticed that the Lévy distribution has a very heavy tail. The 99th percentile is about
6400, for example. The following result is not surprising.
E(U ) = ∞
Proof
Of course, the higher-order moments are infinite as well, and the variance, skewness, and kurtosis do not exist. The moment
generating function is infinite at every positive value, and so is of no use. On the other hand, the characteristic function of the
standard Lévy distribution is very useful. For the following result, recall that the sign function sgn is given by sgn(t) = 1 for
t > 0 , sgn(t) = −1 for t < 0 , and sgn(0) = 0 .
U has characteristic function χ given by0
itU 1/2
χ0 (t) = E (e ) = exp(−|t| [1 + i sgn(t)]), t ∈ R (5.16.9)
The most important relationship is the one in the definition: If Z has the standard normal distribution then U = 1/Z
2
has the
standard Lévy distribution. The following result is bascially the converse.
−
−
If U has the standard Lévy distribution, then V = 1/ √U has the standard half-normal distribution.
Proof
The General Lévy Distribution

Like so many other “standard distributions”, the standard Lévy distribution is generalized by adding location and scale parameters.
Definition
Suppose that U has the standard Lévy distribution, and a ∈ R and b ∈ (0, ∞) . Then X = a + bU has the Lévy distribution
with location parameter a and scale parameter b .
Note that X has a continuous distribution on the interval (a, ∞).
Suppose that X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). As before, the
distribution function of X has a simple expression in terms of the standard normal distribution function Φ.
X has distribution function G given by

−−−−−−−
b
F (x) = 2 [1 − Φ ( √ )] , x ∈ (a, ∞) (5.16.10)
(x − a)
Proof
Similarly, the quantile function of X has a simple expression in terms of the standard normal quantile function Φ −1
.

given by
b
−1
F (p) = a + , p ∈ [0, 1) (5.16.11)
2
−1
[Φ (1 − p/2)]
The quartiles of X are

−2
1. q1
−1
= a + b[ Φ (
7
8
)] , the first quartile.
−2
2. q2 = a + b[ Φ
−1
(
3
4
)] , the median.
−2
3. q3 = a + b[ Φ
−1
(
5
8
)] , the third quartile.
Proof
Open the Special Distribution Calculator and select the Lévy distribution. Vary the parameter values and note the shape of the
graph of the distribution function function. For various values of the parameters, compute a few values of the distribution
Finally, the probability density function of X has a simple closed expression.

−−
−
b 1 b
f (x) = √ exp[− ], x ∈ (a, ∞) (5.16.12)
2π (x − a) 3/2
2(x − a)
1. f increases and then decreases with mode at x = a + 1
3
b .
√10
2. f is concave upward, then downward, then upward again with inflection points at x = a + ( 1
3
±
15
)b .
Proof
Open the Special Distribtion Simulator and select the Lévy distribution. Vary the parameters and note the shape and location of
the probability density function. For various parameter values, run the simulation 1000 times and compare the empirical
Moments
Assume again that X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). Of course, since the
standard Lévy distribution has infinite mean, so does the general Lévy distribution.
E(X) = ∞
Also as before, the variance, skewness, and kurtosis of X are undefined. On the other hand, the characteristic function of X is very
important.
X has characteristic function χ given by

itX 1/2 1/2
χ(t) = E (e ) = exp(ita − b |t| [1 + i sgn(t)]), t ∈ R (5.16.13)
Proof
Since the Lévy distribution is a location-scale family, it is trivially closed under location-scale transformations.
Suppose that X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R and
d ∈ (0, ∞) . Then Y = c + dX has the Lévy distribution with location parameter c + ad and scale parameter bd .
Proof
Of more interest is the fact that the Lévy distribution is closed under convolution (corresponding to sums of independent variables).
Suppose that X and X are independent, and that, X has the Lévy distribution with location parameter a ∈ R and scale
1 2 k k
parameter b ∈ (0, ∞) for k ∈ {1, 2}. Then X + X has the Lévy distribution with location parameter a + a and scale
k 1 2 1 2
1/2 1/2
parameter (b + b ) .
1 2
2
Proof
As a corollary, the Lévy distribution is a stable distribution with index α = 1
2
:
Suppose that n ∈ N and that (X , X , … , X ) is a sequence of independent random variables, each having the Lévy
+ 1 2 n
distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). Then X + X + ⋯ + X has the Lévy
1 2 n
distribution with location parameter na and scale parameter n b. 2
Stability is one of the reasons for the importance of the Lévy distribution. From the characteristic function, it follows that the
skewness parameter is β = 1 .
This page titled 5.16: The Lévy Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In this section, we will study the beta distribution, the most important distribution that has bounded support. But before we can
study the beta distribution we must study the beta function.
The Beta Function

Definition
The beta function B is defined as follows:

1
a−1 b−1
B(a, b) = ∫ u (1 − u ) du; a, b ∈ (0, ∞) (5.17.1)
0
Proof that B is well defined
The beta function was first introduced by Leonhard Euler.
Properties
The beta function satisfies the following properties:
1. B(a, b) = B(b, a) for a, b ∈ (0, ∞), so B is symmetric.
2. B(a, 1) = for a ∈ (0, ∞)
1
3. B(1, b) = for b ∈ (0, ∞)

1
Proof
The beta function has a simple expression in terms of the gamma function:
If a, b ∈ (0, ∞) then
Γ(a)Γ(b)
B(a, b) = (5.17.4)
Γ(a + b)
Proof
Recall that the gamma function is a generalization of the factorial function. Here is the corresponding result for the beta function:
If j, k ∈ N+ then
(j − 1)!(k − 1)!
B(j, k) = (5.17.8)
(j + k − 1)!
Proof
Let's generalize this result. First, recall from our study of combinatorial structures that for a ∈ R and j ∈ N , the ascending power
of base a and order j is
[j]
a = a(a + 1) ⋯ [a + (j − 1)] (5.17.9)
If a, b ∈ (0, ∞) , and j, k ∈ N , then

[j] [k]
B(a + j, b + k) a b
= (5.17.10)
[j+k]
B(a, b) (a + b)
Proof
B(
1
2
,
1
2
) =π .
Proof
Figure 5.17.1 : The graph of B(a, b) on the square 0 < a < 5 , 0 < b < 5
The Incomplete Beta Function

The integral that defines the beta function can be generalized by changing the interval of integration from (0, 1) to (0, x) where
x ∈ [0, 1].
The incomplete beta function is defined as follows

x
a−1 b−1
B(x; a, b) = ∫ u (1 − u ) du, x ∈ (0, 1); a, b ∈ (0, ∞) (5.17.11)
0
Of course, the ordinary (complete) beta function is B(a, b) = B(1; a, b) for a, b ∈ (0, ∞) .
The Standard Beta Distribution

The beta distributions are a family of continuous distributions on the interval (0, 1).
The (standard) beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) has probability density
function f given by
1 a−1 b−1
f (x) = x (1 − x ) , x ∈ (0, 1) (5.17.12)
B(a, b)
Of course, the beta function is simply the normalizing constant, so it's clear that f is a valid probability density function. If a ≥ 1 ,
f is defined at 0, and if b ≥ 1 , f is defined at 1. In these cases, it's customary to extend the domain of f to these endpoints. The
beta distribution is useful for modeling random probabilities and proportions, particularly in the context of Bayesian analysis. The
distribution has just two parameters and yet a rich variety of shapes (so in particular, both parameters are shape parameters).
Qualitatively, the first order properties of f depend on whether each parameter is less than, equal to, or greater than 1.
For a, b ∈ (0, ∞) with a + b ≠ 2 , define

a−1
x0 = (5.17.13)
a+b −2
1. If 0 < a < 1 and 0 < b < 1 , f decreases and then increases with minimum value at x and with f (x) → ∞ as x ↓ 0 and
0
as x ↑ 1.
2. If a = 1 and b = 1 , f is constant.
3. If 0 < a < 1 and b ≥ 1 , f is decreasing with f (x) → ∞ as x ↓ 0.
4. If a ≥ 1 and 0 1 , f is decreasing with mode at x = 0 .
6. If a > 1 and b = 1 , f is increasing with mode at x = 1 .
7. If a > 1 and b > 1 , f increases and then decreases with mode at x . 0
Proof
From part (b), note that the special case a = 1 and b = 1 gives the continuous uniform distribution on the interval (0, 1) (the
standard uniform distribution). Note also that when a < 1 or b < 1 , the probability density function is unbounded, and hence the
distribution has no mode. On the other hand, if a ≥ 1 , b ≥ 1 , and one of the inequalites is strict, the distribution has a unique mode
at x . The second order properties are more complicated.
0
For a, b ∈ (0, ∞) with a + b ∉ {2, 3} and (a − 1)(b − 1)(a + b − 3) ≥ 0 , define

−−−−−−−−−−−−−−−−−− −
(a − 1)(a + b − 3) − √ (a − 1)(b − 1)(a + b − 3)
x1 = (5.17.15)
(a + b − 3)(a + b − 2)
−−−−−−−−−−−−−−−−−− −
(a − 1)(a + b − 3) + √ (a − 1)(b − 1)(a + b − 3)
x2 = (5.17.16)
(a + b − 3)(a + b − 2)
For a < 1 and a + b = 2 or for b < 1 and a + b = 2 , define x 1 = x2 = 1 − a/2 .

1. If a ≤ 1 and b ≤ 1 , or if a ≤ 1 and b ≥ 2 , or if a ≥ 2 and b ≤ 1 , f is concave upward.
2. If a ≤ 1 and 1 < b < 2 , f is concave upward and then downward with inflection point at x . 1
3. If 1 < a < 2 and b ≤ 1 , f is concave downward and then upward with inflection point at x . 2
4. If 1 < a ≤ 2 and 1 2 , f is concave downward and then upward with inflection point at x . 2
6. If a > 2 and 1 2 and b > 2 , f is concave upward, then downward, then upward again, with inflection points at x and x . 1 2
Proof
In the special distribution simulator, select the beta distribution. Vary the parameters and note the shape of the beta density
function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to
the true density function.
The special case a = 1
2
,b= 1
2
is the arcsine distribution, with probability density function given by
1
f (x) = − −−−−−−, x ∈ (0, 1) (5.17.18)
π √ x(1 − x)
This distribution is important in a number of applications, and so the arcsine distribution is studied in a separate section.
The beta distribution function F can be easily expressed in terms of the incomplete beta function. As usual a denotes the left
parameter and b the right parameter.
The beta distribution function F with parameters a, b ∈ (0, ∞) is given by

B(x; a, b)
F (x) = , x ∈ (0, 1) (5.17.19)
B(a, b)
The distribution function F is sometimes known as the regularized incomplete beta function. In some special cases, the distribution
function F and its inverse, the quantile function F , can be computed in closed form, without resorting to special functions.
−1
If a ∈ (0, ∞) and b = 1 then

1. F (x) = x for x ∈ (0, 1)
a
2. F (p) = p
−1
for p ∈ (0, 1)
1/a
If a = 1 and b ∈ (0, ∞) then

1. F (x) = 1 − (1 − x) for x ∈ (0, 1) b
2. F (p) = 1 − (1 − p)
−1
for p ∈ (0, 1)1/b
If a = b = 1
2
(the arcsine distribution) then
−
1. F (x) = arcsin(√x) for x ∈ (0, 1)
2
2. F (p) = sin ( p) for p ∈ (0, 1)

−1 2 π
There is an interesting relationship between the distribution functions of the beta distribution and the binomial distribution, when
the beta parameters are positive integers. To state the relationship we need to embellish our notation to indicate the dependence on
the parameters. Thus, let F denote the beta distribution function with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞),
a,b
and let G denote the binomial distribution function with trial parameter n ∈ N and success parameter p ∈ (0, 1).
n,p +
If j, k ∈ N+ and x ∈ (0, 1) then
Fj,k (x) = Gj+k−1,1−x (k − 1) (5.17.20)
Proof
In the special distribution calculator, select the beta distribution. Vary the parameters and note the shape of the density function
and the distribution function. In each of the following cases, find the median, the first and third quartiles, and the interquartile
range. Sketch the boxplot.
1. a = 1 , b = 1
2. a = 1 , b = 3
3. a = 3 , b = 1
4. a = 2 , b = 4
5. a = 4 , b = 2
6. a = 4 , b = 4
Moments
The moments of the beta distribution are easy to express in terms of the beta function. As before, suppose that X has the beta
distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞).
If k ∈ [0, ∞) then
B(a + k, b)
k
E (X ) = (5.17.24)
B(a, b)
In particular, if k ∈ N then
[k]
k
a
E (X ) = (5.17.25)
[k]
(a + b)
Proof
From the general formula for the moments, it's straightforward to compute the mean, variance, skewness, and kurtosis.

a
E(X) = (5.17.27)
a+b
ab
var(X) = (5.17.28)
2
(a + b ) (a + b + 1)
Proof
Note that the variance depends on the parameters a and b only through the product ab and the sum a + b .
Open the special distribution simulator and select the beta distribution. Vary the parameters and note the size and location of
the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the sample
−−−−−− −
2(b − a)√ a + b + 1
skew(X) = (5.17.29)
−−
(a + b + 2)√ab
3 3 2 2 3 3 2 2 2 2
3 a b + 3ab + 6a b +a +b + 13 a b + 13ab +a +b + 14ab
kurt(X) = (5.17.30)
ab(a + b + 2)(a + b + 3)
Proof
In particular, note that the distribution is positively skewed if a b .
Open the special distribution simulator and select the beta distribution. Vary the parameters and note the shape of the
probability density function in light of the previous result on skewness. For various values of the parameters, run the
simulation 1000 times and compare the empirical density function to the true probability density function.
The beta distribution is related to a number of other special distributions.
If X has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) then Y = 1 −X has the beta
distribution with left parameter b and right parameter a .
Proof
The beta distribution with right parameter 1 has a reciprocal relationship with the Pareto distribution.
Suppose that a ∈ (0, ∞) .

1. If X has the beta distribution with left parameter a and right parameter 1 then Y = 1/X has the Pareto distribution with
shape parameter a .
2. If Y has the Pareto distribution with shape parameter a then X = 1/Y has the beta distribution with left parameter a and
right parameter 1.
Proof
The following result gives a connection between the beta distribution and the gamma distribution.
Suppose that X has the gamma distribution with shape parameter a ∈ (0, ∞) and rate parameter r ∈ (0, ∞), Y has the
gamma distribution with shape parameter b ∈ (0, ∞) and rate parameter r, and that X and Y are independent. Then
V = X/(X + Y ) has the beta distribution with left parameter a and right parameter b .
Proof
The following result gives a connection between the beta distribution and the F distribution. This connection is a minor variation
of the previous result.
If X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the
denominator then
(n/d)X
Y = (5.17.38)
1 + (n/d)X
has the beta distribution with left parameter a = n/2 and right parameter b = d/2 .
Proof
Our next result is that the beta distribution is a member of the general exponential family of distributions.
Suppose that X has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Then the distribution
is a two-parameter exponential family with natural parameters a − 1 and b − 1 , and natural statistics ln(X) and ln(1 − X) .
Proof
The beta distribution is also the distribution of the order statistics of a random sample from the standard uniform distribution.
Suppose n ∈ N and that (X , X , … , X ) is a sequence of independent variables, each with the standard uniform
+ 1 2 n
distribution. For k ∈ {1, 2, … , n}, the k th order statistics X has the beta distribution with left parameter a = k and right
(k)
parameter b = n − k + 1 .
Proof
One of the most important properties of the beta distribution, and one of the main reasons for its wide use in statistics, is that it
forms a conjugate family for the success probability in the binomial and negative binomial distributions.
Suppose that P is a random probability having the beta distribution with left parameter a ∈ (0, ∞) and right parameter
b ∈ (0, ∞). Suppose also that X is a random variable such that the conditional distribution of X given P = p ∈ (0, 1) is
binomial with trial parameter n ∈ N and success parameter p. Then the conditional distribution of P given X = k is beta
+
with left parameter a + k and right parameter b + n − k .

Proof
Suppose again that P is a random probability having the beta distribution with left parameter a ∈ (0, ∞) and right parameter
b ∈ (0, ∞). Suppose also that N is a random variable such that the conditional distribution of N given P = p ∈ (0, 1) is
negative binomial with stopping parameter k ∈ N and success parameter p. Then the conditional distribution of P given
+
N = n is beta with left parameter a + k and right parameter b + n − k .
Proof
in both cases, note that in the posterior distribution of P , the left parameter is increased by the number of successes and the right
parameter by the number of failures. For more on this, see the section on Bayesian estimation in the chapter on point estimation.
The General Beta Distribution

The beta distribution can be easily generalized from the support interval (0, 1) to an arbitrary bounded interval using a linear
transformation. Thus, this generalization is simply the location-scale family associated with the standard beta distribution.
Suppose that Z has the standard beta distibution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). For c ∈ R
and d ∈ (0, ∞) random variable X = c + dZ has the beta distribution with left parameter a , right parameter b , location
parameter c and scale parameter d .
For the remainder of this discussion, suppose that X has the distribution in the definition above.
X has probability density function

1 a−1 b−1
f (x) = (x − c ) (c + d − x ) , x ∈ (c, c + d) (5.17.44)
a+b−1
B(a, b)d
Proof
Most of the results in the previous sections have simple extensions to the general beta distribution.

1. E(X) = c + d a
a+b
2. var(X) = d 2
2
ab
(a+b) (a+b+1)
Proof
Recall that skewness and variance are defined in terms of standard scores, and hence are unchanged under location-scale
transformations. Hence the skewness and kurtosis of X are just as for the standard beta distribution.
This page titled 5.17: The Beta Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
The beta prime distribution is the distribution of the odds ratio associated with a random variable with the beta distribution. Since variables with
beta distributions are often used to model random probabilities and proportions, the corresponding odds ratios occur naturally as well.
Definition
Suppose that U has the beta distribution with shape parameters a, b ∈ (0, ∞) . Random variable X = U /(1 − U ) has the beta prime
distribution with shape parameters a and b .
The special case a = b = 1 is known as the standard beta prime distribution. Since U has a continuous distribution on the interval (0, 1) ,
random variable X has a continuous distribution on the interval (0, ∞).
Suppose that X has the beta prime distribution with shape parameters a, b ∈ (0, ∞) , and as usual, let B denote the beta function.

a−1
1 x
f (x) = , x ∈ (0, ∞) (5.18.1)
B(a, b) (1 + x)a+b
Proof
If a ≥ 1 , the probability density function is defined at x = 0 , so in this case, it's customary add this endpoint to the domain. In particular, for the
standard beta prime distribution,
1
f (x) = , x ∈ [0, ∞) (5.18.4)
2
(1 + x)
Qualitatively, the first order properties of the probability density function f depend only on a , and in particular on whether a is less than, equal
to, or greater than 1.

1. If 0 < a < 1 , f is decreasing with f (x) → ∞ as x ↓ 0.
2. If a = 1 , f is decreasing with mode at x = 0 .
3. If a > 1 , f increases and then decreases with mode at x = (a − 1)/(b + 1) .
Proof
Qualitatively, the second order properties of f also depend only on a , with transitions at a = 1 and a = 2 .
For a > 1 , define

−−−−−−−−−−−−−−− −
(a − 1)(b + 2) − √ (a − 1)(b + 2)(a + b)
x1 = (5.18.6)
(b + 1)(b + 2)
−−−−−−−−−−−−−−− −
(a − 1)(b + 2) + √ (a − 1)(b + 2)(a + b)
x2 = (5.18.7)
(b + 1)(b + 2)

1. If 0 < a ≤ 1 , f is concave upward.
2. If 1 < a ≤ 2 , f is concave downward and then upward, with inflection point at x . 2
3. If a > 2 , f is concave upward, then downward, then upward again, with inflection points at x and x . 1 2
Proof
Open the Special Distribution Simulator and select the beta prime distribution. Vary the parameters and note the shape of the probability
density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the
Because of the definition of the beta prime variable, the distribution function of X has a simple expression in terms of the beta distribution
function with the same parameters, which in turn is the regularized incomplete beta function. So let G denote the distribution function of the
beta distribution with parameters a, b ∈ (0, ∞), and recall that
B(x; a, b)
G(x) = , x ∈ (0, 1) (5.18.9)
B(a, b)

x
F (x) = G ( ), x ∈ [0, ∞) (5.18.10)
x +1
Proof
Similarly, the quantile function of X has a simple expression in terms of the beta quantile function G −1
with the same parameters.

given by
−1
G (p)
−1
F (p) = , p ∈ [0, 1) (5.18.12)
−1
1 −G (p)
Proof
Open the Special Distribution Calculator and choose the beta prime distribution. Vary the parameters and note the shape of the distribution
function. For selected values of the parameters, find the median and the first and third quartiles.
For certain values of the parameters, the distribution and quantile functions have simple, closed form expressions.
If a ∈ (0, ∞) and b = 1 then

a
1. F (x) = ( x
x+1
) for x ∈ [0, ∞)
1/a
p
2. F −1
(p) =
1/a
for p ∈ [0, 1)
1−p
Proof
If a = 1 and b ∈ (0, ∞) then

b
1. F (x) = 1 − ( x+1
1
) for x ∈ [0, ∞)
1/b
1−(1−p)
2. F −1
(p) =
1/b
for p ∈ [0, 1)
(1−p)
Proof
If a = b = 1
2
then
−−−
1. F (x) = 2
π
arcsin(√
x
x+1
) for x ∈ [0, ∞)
2 π
sin ( p)
2. F for p ∈ [0, 1)
2
−1
(p) =
2 π
1−sin ( p)
2
Proof
When a = b = 1
2
, X is the odds ratio for a variable with the standard arcsine distribution.
Moments
As before, X denotes a random variable with the beta prime distribution, with parameters a, b ∈ (0, ∞) . The moments of X have a simple
expression in terms of the beta function.
If t ∈ (−a, b) then
B(a + t, b − t)
t
E (X ) = (5.18.13)
B(a, b)
If t ∈ (−∞, −a] ∪ [b, ∞) then E(X t

) =∞ .
Proof
Of course, we are usually most interested in the integer moments of X . Recall that for x ∈ R and n ∈ N , the rising power of x of order n is
= x(x + 1) ⋯ (x + n − 1) .
[n]
x
Suppose that n ∈ N . If n 1 then
a
E(X) = (5.18.17)
b −1
If b > 2 then
a(a + b − 1)
var(X) = (5.18.18)
2
(b − 1 ) (b − 2)
Proof
Open the Special Distribution Simulator and select the beta prime distribution. Vary the parameters and note the size and location of the
mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and
standard deviation to the distribution mean and standard deviation.
Finally, the general moment result leads to the skewness and kurtosis of X.
If b > 3 then
−−−−−−−−−−
2(2a + b − 1) b −2
skew(X) = √ (5.18.19)
b −3 a(a + b − 1)
Proof
In particular, the distibution is positively skewed for all a > 0 and b > 3 .
If b > 4 then
kurt(X) (5.18.20)
3 2 3 3 2 3 2 2 2 2 4 3 2 4 3
3a b + 69 a b − 30 a + 6a b + 12 a b − 78 a b + 60 a + 3ab + 9ab − 69ab + 99ab − 42a + 6 b − 30 b
2
+ 54 b − 42b + 12
=
(a + b − 1)(b − 3)(b − 4)
Proof
The most important connection is the one between the beta prime distribution and the beta distribution given in the definition. We repeat this for
emphasis.
Suppose that a, b ∈ (0, ∞) .

1. If U has the beta distribution with parameters a and b , then X = U /(1 − U ) has the beta prime distribution with parameters a and b .
2. If X has the beta prime distribution with parameters a and b , then U = X/(X + 1) has the beta distribution with parameters a and b .
The beta prime family is closed under the reciprocal transformation.
If X has the beta prime distribution with parameters a, b ∈ (0, ∞) then 1/X has the beta prime distribution with parameters b and a .
Proof
The beta prime distribution is closely related to the F distribution by a simple scale transformation.
Connections with the F distributions.
1. If X has the beta prime distribution with parameters a, b ∈ (0, ∞) then Y = X has the F distribution with 2a degrees of the freedom
b
in the numerator and 2b degrees of freedom in the denominator.

2. If Y has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the
denominator, then X = Y has the beta prime distribution with parameters n/2 and d/2.
n
Proof
The beta prime is the distribution of the ratio of independent variables with standard gamma distributions. (Recall that standard here means that
the scale parameter is 1.)
Suppose that Y and Z are independent and have standard gamma distributions with shape parameters a ∈ (0, ∞) and b ∈ (0, ∞) ,
respectively. Then X = Y /Z has the beta prime distribution with parameters a and b .
Proof
The standard beta prime distribution is the same as the standard log-logistic distribution.
Proof
Finally, the beta prime distribution is a member of the general exponential family of distributions.
Suppose that X has the beta prime distribution with parameters a, b ∈ (0, ∞). Then X has a two-parameter general exponential
distribution with natural parameters a − 1 and −(a + b) and natural statistics ln(X) and ln(1 + X) .
Proof
This page titled 5.18: The Beta Prime Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
The arcsine distribution is important in the study of Brownian motion and prime numbers, among other applications.
The Standard Arcsine Distribution

The standard arcsine distribution is a continuous distribution on the interval (0, 1) with probability density function g given by
1
g(x) = − −−−−−−, x ∈ (0, 1) (5.19.1)
π √ x(1 − x)
Proof
The occurrence of the arcsine function in the proof that g is a probability density function explains the name.
The standard arcsine probability density function g satisfies the following properties:
1. g is symmetric about x = . 1
2. g decreases and then increases with minimum value at x = 1
2
.
3. g is concave upward
4. g(x) → ∞ as x ↓ 0 and as x ↑ 1.
Proof
In particular, the standard arcsine distribution is U-shaped and has no mode.
Open the Special Distribution Simulator and select the arcsine distribution. Keep the default parameter values and note the
The distribution function has a simple expression in terms of the arcsine function, again justifying the name of the distribution.
−
The standard arcsine distribution function G is given by G(x) = 2
π
arcsin(√x ) for x ∈ [0, 1].
Proof
Not surprisingly, the quantile function has a simple expression in terms of the sine function.
The standard arcinse quantile function G −1

is given by G −1
(p) = sin (
2 π
2
p) for p ∈ [0, 1]. In particular, the quartiles are
–
1. q1 = sin (
2 π
8
) =
1
4
(2 − √2) ≈ 0.1464 , the first quartile
2. q2 =
1
2
, the median
–
3. q3 = sin (
2 3π
8
) =
1
4
(2 + √2) ≈ 0.8536 , the third quartile
Proof
Open the Special Distribution Calculator and select the arcsine distribution. Keep the default parameter values and note the
shape of the distribution function. Compute selected values of the distribution function and the quantile function.
Moments
Suppose that random variable Z has the standard arcsine distribution. First we give the mean and variance.

1. E(Z) = 1
2. var(Z) = 1
Proof
Open the Special Distribution Simulator and select the arcsine distribution. Keep the default parameter values. Run the
simulation 1000 times and compare the empirical mean and stadard deviation to the true mean and standard deviation.
The general moments about 0 can be expressed as products.
For n ∈ N ,
n−1
2j + 1
n
E (Z ) = ∏ (5.19.8)
2j + 2
j=0
Proof
Of course, the moments can be used to give a formula for the moment generating function, but this formula is not particularly
helpful since it is not in closed form.
Z has moment generating function m given by

∞ n−1 n
2j + 1 t
tZ
m(t) = E (e ) = ∑ (∏ ) , t ∈ R (5.19.10)
2j + 2 n!
n=0 j=0
Finally we give the skewness and kurtosis.

1. skew(Z) = 0
2. kurt(Z) = 3
Proof
As noted earlier, the standard arcsine distribution is a special case of the beta distribution.
The standard arcsine distribution is the beta distribution with left parameter 1
2
and right parameter 1
2
.
Proof
Since the quantile function is in closed form, the standard arcsine distribution can be simulated by the random quantile method.
Connections with the standard uniform distribution.

1. If U has the standard uniform distribution (a random number) then X = sin ( U ) has the standard arcsine distribution.
2 π
2
−−
2. If X has the standard arcsine distribution then U = arcsin(√X ) has the standard uniform distribution.
2
Open the random quantile simulator and select the arcsine distribution. Keep the default parameters. Run the experiment 1000
times and compare the empirical probability density function, mean, and standard deviation to their distributional counterparts.
Note how the random quantiles simulate the distribution.
The following exercise illustrates the connection between the Brownian motion process and the standard arcsine distribution.
Open the Brownian motion simulator. Keep the default time parameter and select the last zero random variable. Note that this
random variable has the standard arcsine distribution. Run the experiment 1000 time and compare the empirical probability
density function, mean, and standard deviation to their distributional counterparts. Note how the last zero simulates the
distribution.
The General Arcsine Distribution

The standard arcsine distribution is generalized by adding location and scale parameters.
Definition
If Z has the standard arcsine distribution, and if a ∈ R and b ∈ (0, ∞) , then X = a + bZ has the arcsine distribution with
location parameter a and scale parameter b .
So X has a continuous distribution on the interval (a, a + b) .
Suppose that X has the arcsine distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) .

1
f (x) = , x ∈ (a, a + w) (5.19.12)
− −−−−−−−−−−−−− −
π √ (x − a)(a + w − x)
1. f is symmetric about a + w . 1
2. f decreases and then increases with minimum value at x = a + 1
2
w .
3. f is concave upward.
4. f (x) → ∞ as x ↓ a and as x ↑ a + w .
Proof
An alternate parameterization of the general arcsine distribution is by the endpoints of the support interval: the left endpoint
(location parameter) a and the right endpoint b = a + w .
Open the Special Distribution Simulator and select the arcsine distribution. Vary the location and scale parameters and note the
shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and
Once again, the distribution function has a simple representation in terms of the arcsine function.

−−−−−
2 x −a
F (x) = arcsin( √ ), x ∈ [a, a + w] (5.19.13)
π w
Proof
As before, the quantile function has a simple representation in terms of the sine functioon

given by F −1
(p) = a + w sin (
2 π
2
p) for p ∈ [0, 1] In particular, the quartiles of X are
–
1. q1 = a + w sin (
2 π
8
) = a+
1
4
(2 − √2) w , the first quartile
2. q2 = a+
1
2
w , the median
–
3. q3 = a + w sin (
2 3π
8
) = a+
1
4
(2 + √2) w , the third quartile
Proof
Open the Special Distribution Calculator and select the arcsine distribution. Vary the parameters and note the shape and
location of the distribution function. For various values of the parameters, compute selected values of the distribution function
and the quantile function.
Moments
Again, we assume that X has the arcsine distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) . First we give
the mean and variance.

1. E(X) = a + w 1
2. var(X) = w 1
8
2
Proof
Open the Special Distribution Simulator and select the arcsine distribution. Vary the parameters and note the size and location
of the mean±standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the
empirical mean and stadard deviation to the true mean and standar deviation.
The moments of X can be obtained from the moments of Z , but the results are messy, except when the location parameter is 0.
Suppose the location parameter a = 0 . For n ∈ N ,

n−1
n n
2j + 1
E(X ) =w ∏ (5.19.14)
2j + 2
j=0
Proof
The moment generating function can be expressed as a series with product coefficients, and so is not particularly helpful.

∞ n−1 n n
2j + 1 w t
tX at
M (t) = E (e ) =e ∑ (∏ ) , t ∈ R (5.19.15)
2j + 2 n!
n=0 j=0
Proof
Finally, the skewness and kurtosis are unchanged.

1. skew(X) = 0
2. kurt(X) = 3
Proof
By construction, the general arcsine distribution is a location-scale family, and so is closed under location-scale transformations.
If X has the arcsine distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) and if c ∈ R and d ∈ (0, ∞)
then c + dX has the arcsine distribution with location parameter c + ad scale parameter dw.
Proof
Since the quantile function is in closed form, the arcsine distribution can be simulated by the random quantile method.
Suppose that a ∈ R and w ∈ (0, ∞) .

1. If U has the standard uniform distribution (a random number) then X = a + w sin 2
(
π
2
U) has the arcsine distribution with
location parameter a and scale parameter b .
−−−−
2. If X has the arcsine distribution with location parameter a and scale parameter b then U =
2
π
arcsin(√
X−a
w
) has the
standard uniform distribution.
Open the random quantile simulator and select the arcsine distribution. Vary the parameters and note the location and shape of
the probability density function. For selected parameter values, run the experiment 1000 times and compare the empirical
probability density function, mean, and standard deviation to their distributional counterparts. Note how the random quantiles
simulate the distribution.
The following exercise illustrates the connection between the Brownian motion process and the arcsine distribution.
Open the Brownian motion simulator and select the last zero random variable. Vary the time parameter t and note that the last
zero has the arcsine distribution on the interval (0, t). Run the experiment 1000 time and compare the empirical probability
density function, mean, and standard deviation to their distributional counterparts. Note how the last zero simulates the
distribution.
This page titled 5.19: The Arcsine Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
This section explores uniform distributions in an abstract setting. If you are a new student of probability, or are not familiar with
measure theory, you may want to skip this section and read the sections on the uniform distribution on an interval and the discrete
uniform distributions.
Basic Theory
Definition
Suppose that (S, S , λ) is a measure space. That is, S is a set, S a σ-algebra of subsets of S , and λ a positive measure on S .
Suppose also that 0 < λ(S) < ∞ , so that λ is a finite, positive measure.
Random variable X with values in S has the uniform distribution on S (with respect to λ ) if
λ(A)
P(X ∈ A) = , A ∈ S (5.20.1)
λ(S)
Thus, the probability assigned to a set A ∈ S depends only on the size of A (as measured by λ ).
The most common special cases are as follows:

1. Discrete: The set S is finite and non-empty, S is the σ-algebra of all subsets of S , and λ = # (counting measure).
2. Euclidean: For n ∈ N , let R denote the σ-algebra of Borel measureable subsets of R and let λ denote Lebesgue
+ n
n
n
measure on (R , R ). In this setting, S ∈ R with 0 < λ (S) < ∞ , S = {A ∈ R : A ⊆ S} , and the measure is λ
n
n n n n n
restricted to (S, S ).
In the Euclidean case, recall that λ is length measure on R, λ is area measure on R , λ is volume measure on
1 2
2
3 R
3
, and in
general λ is sometimes referred to as n -dimensional volume. Thus, S ∈ R is a set with positive, finite volume.
n n
Properties
Suppose (S, S , λ) is a finite, positive measure space, as above, and that X is uniformly distributed on S .
The probability density function f of X (with respect to λ ) is

1
f (x) = , x ∈ S (5.20.2)
λ(S)
Proof
Thus, the defining property of the uniform distribution on a set is constant density on that set. Another basic property is that
uniform distributions are preserved under conditioning.
Suppose that R ∈ S with λ(R) > 0 . The conditional distribution of X given X ∈ R is uniform on R .
Proof
In the setting of previous result, suppose that X = (X , X , …) is a sequence of independent variables, each uniformly
1 2
distributed on S . Let N = min{n ∈ N : X ∈ R} . Then N has the geometric distribution on N with success parameter
+ n +
p = P(X ∈ R) . More importantly, the distribution of X Nis the same as the conditional distribution of X given X ∈ R , and hence
is uniform on R . This is the basis of the rejection method of simulation. If we can simulate a uniform distribution on S , then we
can simulate a uniform distribution on R .
If h is a real-valued function on S , then E[h(X)] is the average value of h on S , as measured by λ :
If h : S → R is integrable with respect to λ Then

1
E[h(X)] = ∫ h(x) dλ(x) (5.20.5)
λ(S) S
Proof
The entropy of the uniform distribution on S depends only on the size of S , as measured by λ :
The entropy of X is H (X) = ln[λ(S)] .

Proof
Product Spaces
Suppose now that (S, S , λ) and (T , T , μ) are finite, positive measure spaces, so that 0 < λ(S) < ∞ and 0 < μ(T ) < ∞ . Recall
the product space (S × T , S ⊗ T , λ ⊗ μ) . The product σ-algebra S ⊗ T is the σ-algebra of subsets of S × T generated by
product sets A × B where A ∈ S and B ∈ T . The product measure λ ⊗ μ is the unique positive measure on (S × T , S ⊗ T )
that satisfies (λ ⊗ μ)(A × B) = λ(A)μ(B) for A ∈ S and B ∈ T .
(X, Y ) is uniformly distributed on S × T if and only if X is uniformly distributed on S , Y is uniformly distributed on T , and
X and Y are independent.
Proof
This page titled 5.20: General Uniform Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
The continuous uniform distribution on an interval of R is one of the simplest of all probability distributions, but nonetheless very
important. In particular, continuous uniform distributions are the basic tools for simulating other probability distributions. The
uniform distribution corresponds to picking a point at random from the interval. The uniform distribution on an interval is a special
case of the general uniform distribution with respect to a measure, in this case Lebesgue measure (length measure) on R.
The Standard Uniform Distribution

Definition
The continuous uniform distribution on the interval [0, 1] is known as the standard uniform distribution. Thus if U has the
standard uniform distribution then
P(U ∈ A) = λ(A) (5.21.1)
for every (Borel measurable) subset A of [0, 1], where λ is Lebesgue (length) measure.
A simulation of a random variable with the standard uniform distribution is known in computer science as a random number. All
programming languages have functions for computing random numbers, as do calculators, spreadsheets, and mathematical and
statistical software packages.
Suppose that U has the standard uniform distribution. By definition, the probability density function is constant on [0, 1].
U has probability density function g given by g(u) = 1 for u ∈ [0, 1].
Since the density function is constant, the mode is not meaningful.
Open the Special Distribution Simulator and select the continuous uniform distribution. Keep the default parameter values.
Run the simulation 1000 times and compare the empirical density function and to the probability density function.
The distribution function is simply the identity function on [0, 1].
U has distribution function G given by G(u) = u for u ∈ [0, 1].

Proof
The quantile function is the same as the distribution function.
U has quantile function G −1

given by G
−1
(p) = p for p ∈ [0, 1]. The quartiles are
1. q1 =
1
4
, the first quartile
2. q2 =
1
2
, the median
3. q3 =
3
4
, the third quartile
Proof
Open the Special Distribution Calculator and select the continuous uniform distribution. Keep the default parameter values.
Compute a few values of the distribution function and the quantile function.
Moments
Suppose again that U has the standard uniform distribution. The moments (about 0) are simple.
For n ∈ N ,
1
n
E (U ) = (5.21.2)
n+1
Proof
The mean and variance follow easily from the general moment formula.
The mean and variance of U are

1. E(U ) = 1
2. var(U ) = 1
12
Open the Special Distribution Simulator and select the continuous uniform distribution. Keep the default parameter values.
Run the simulation 1000 times and compare the empirical mean and standard deviation to the true mean and standard
deviation.
Next are the skewness and kurtosis.
The skewness and kurtosis of U are

1. skew(U ) = 0
2. kurt(U ) = 9
Proof
Thus, the excess kurtosis is kurt(U ) − 3 = − 6
Finally, we give the moment generating function.
The moment generating function m of U is given by m(0) = 1 and

t
e −1
m(t) = , t ∈ R ∖ {0} (5.21.5)
t
Proof
The standard uniform distribution is connected to every other probability distribution on R by means of the quantile function of the
other distribution. When the quantile function has a simple closed form expression, this result forms the primary method of
simulating the other distribution with a random number.
Suppose that F is the distribution function for a probability distribution on R, and that F is the corresponding quantile
−1
function. If U has the standard uniform distribution, then X = F (U ) has distribution function F .
−1
Proof
Open the Random Quantile Experiment. For each distribution, run the simulation 1000 times and compare the empirical
density function to the probability density function of the selected distribution. Note how the random quantiles simulate the
distribution.
For a continuous distribution on an interval of R, the connection goes the other way.
Suppose that X has a continuous distribution on an interval I ⊆R , with distribution function F . Then U = F (X) has the
Proof
The standard uniform distribution is a special case of the beta distribution.
The beta distribution with left parameter a = 1 and right parameter b = 1 is the standard uniform distribution.
Proof
The standard uniform distribution is also the building block of the Irwin-Hall distributions.
The Uniform Distribution on a General Interval
Definition
The standard uniform distribution is generalized by adding location-scale parameters.
Suppose that U has the standard uniform distribution. For a ∈ R and w ∈ (0, ∞) random variable X = a + wU has the
uniform distribution with location parameter a and scale parameter w.
Suppose that X has the uniform distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) .
X has probability density function f given by f (x) = 1/w for x ∈ [a, a + w] .

Proof
The last result shows that X really does have a uniform distribution, since the probability density function is constant on the
support interval. Moreover, we can clearly parameterize the distribution by the endpoints of this interval, namely a and b = a + w ,
rather than by the location, scale parameters a and w. In fact, the distribution is more commonly known as the uniform distribution
on the interval [a, b]. Nonetheless, it is useful to know that the distribution is the location-scale family associated with the standard
uniform distribution. In terms of the endpoint parameterization,
1
f (x) = , x ∈ [a, b] (5.21.10)
b −a
Open the Special Distribution Simulator and select the uniform distribution. Vary the location and scale parameters and note
the graph of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare

x −a
F (x) = , x ∈ [a, a + w] (5.21.11)
w
Proof
In terms of the endpoint parameterization,

x −a
F (x) = , x ∈ [a, b] (5.21.12)
b −a

given by F −1
(p) = a + pw = (1 − p)a + pb for p ∈ [0, 1]. The quartiles are
1. q1 = a+
1
4
w =
3
4
a+
1
4
b , the first quartile
2. q2 = a+
1
2
w =
1
2
a+
1
2
b , the median
3. q3 = a+
3
4
w =
1
4
a+
3
4
b , the third quartile
Proof
Open the Special Distribution Calculator and select the uniform distribution. Vary the parameters and note the graph of the
distribution function. For selected values of the parameters, compute a few values of the distribution function and the quantile
function.
Moments
Again we assume that X has the uniform distribution on the interval [a, b] where a, b ∈ R and a < b . Thus the location parameter
is a and the scale parameter w = b − a .

n+1 n+1
n
b −a
E(X ) = , n ∈ N (5.21.13)
(n + 1)(b − a)
Proof

1. E(X) = (a + b)
1
2. var(X) = (b − a)
1
12
2
Open the Special Distribution Simulator and select the uniform distribution. Vary the parameters and note the location and size
of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation.

1. skew(X) = 0
2. kurt(X) = 9
Proof
Once again, the excess kurtosis is kurt(X) − 3 = − 6
5
.
The moment generating function M of X is given by M (0) = 1 and

bt at
e −e
M (t) = , t ∈ R ∖ {0} (5.21.15)
t(b − a)
Proof
If h is a real-valued function on [a, b], then E[h(X)] is the average value of h on [a, b], as defined in calculus:
If h : [a, b] → R is integrable, then

b
1
E[h(X)] = ∫ h(x) dx (5.21.16)
b −a a
Proof
The entropy of the uniform distribution on an interval depends only on the length of the interval.
The entropy of X is H (X) = ln(b − a) .

Proof
Since the uniform distribution is a location-scale family, it is trivially closed under location-scale transformations.
If X has the uniform distribution with location parameter a and scale parameter w, and if c ∈ R and d ∈ (0, ∞) , then
Y = c + dX has the uniform distribution with location parameter c + da and scale parameter dw.
Proof
As we saw above, the standard uniform distribution is a basic tool in the random quantile method of simulation. Uniform
distributions on intervals are also basic in the rejection method of simulation. We sketch the method in the next paragraph; see the
section on general uniform distributions for more theory.
Suppose that h is a probability density function for a continuous distribution with values in a bounded interval (a, b) ⊆ R . Suppose
also that h is bounded, so that there exits c > 0 such that h(x) ≤ c for all x ∈ (a, b). Let X = (X , X , …) be a sequence of
1 2
independent variables, each uniformly distributed on (a, b), and let Y = (Y , Y , …) be a sequence of independent variables, each
1 2
uniformly distributed on (0, c). Finally, assume that X and Y are independent. Then ((X , Y ), (X , Y ), …)) is a sequence of
1 1 2 2
independent variables, each uniformly distributed on (a, b) × (0, c) . Let N = min{n ∈ N : 0 < Y < h(X )} . Then (X , Y )
+ n n N N
is uniformly distributed on R = {(x, y) ∈ (a, b) × (0, c) : y < h(x)} (the region under the graph of h ), and therefore X has N
probability density function h . In words, we generate uniform points in the rectangular region (a, b) × (0, c) until we get a point
under the graph of h . The x-coordinate of that point is our simulated value. The rejection method can be used to approximately
simulate random variables when the region under the density function is unbounded.
Open the rejection method simulator. For each distribution, select a set of parameter values. Run the experiment 2000 times
and observe how the rejection method works. Compare the empirical density function, mean, and standard deviation to their
distributional counterparts.
This page titled 5.21: The Uniform Distribution on an Interval is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Uniform Distributions on a Finite Set

Suppose that S is a nonempty, finite set. A random variable X taking values in S has the uniform distribution on S if
#(A)
P(X ∈ A) = , A ⊆S (5.22.1)
#(S)
The discrete uniform distribution is a special case of the general uniform distribution with respect to a measure, in this case
counting measure. The distribution corresponds to picking an element of S at random. Most classical, combinatorial probability
models are based on underlying discrete uniform distributions. The chapter on Finite Sampling Models explores a number of such
models.

1
f (x) = , x ∈ S (5.22.2)
#(S)
Proof
Like all uniform distributions, the discrete uniform distribution on a finite set is characterized by the property of constant density
on the set. Another property that all uniform distributions share is invariance under conditioning on a subset.
Suppose that R is a nonempty subset of S . Then the conditional distribution of X given X ∈ R is uniform on R .
Proof
If h : S → R then the expected value of h(X) is simply the arithmetic average of the values of h :
1
E[h(X)] = ∑ h(x) (5.22.4)
#(S)
x∈S
Proof
The entropy of X depends only on the number of points in S .
The entropy of X is H (X) = ln[#(S)] .

Proof
Uniform Distributions on Finite Subsets of R

Without some additional structure, not much more can be said about discrete uniform distributions. Thus, suppose that n ∈ N and +
that S = {x , x , … , x } is a subset of R with n points. We will assume that the points are indexed in order, so that
1 2 n
x <x <⋯ <x

1 2 . Suppose that X has the uniform distribution on S .
n
The probability density function f of X is given by f (x) = 1
n
for x ∈ S .

1. F (x) = 0 for x < x 1
2. F (x) = for x ≤ x < x

k
n
k k+1 and k ∈ {1, 2, … n − 1}
3. F (x) = 1 for x > x n
Proof

of X is given by F −1
(p) = x⌈np⌉ for p ∈ (0, 1].
Proof
The moments of X are ordinary arithmetic averages.
For k ∈ N
n
k
1
k
E (X ) = ∑x (5.22.7)
i
n
i=1
In particular,

1. μ = 1
n
∑
n
i=1
xi
2. σ =
2 1
n
∑
n
i=1
(xi − μ)
2
Uniform Distributions on Discrete Intervals

We specialize further to the case where the finite subset of R is a discrete interval, that is, the points are uniformly spaced.

Suppose that n ∈ N and that Z has the discrete uniform distribution on S = {0, 1, … , n − 1} . The distribution of
+ Z is the
standard discrete uniform distribution with n points.
Of course, the results in the previous subsection apply with x i = i −1 and i ∈ {1, 2, … , n}.
The probability density function g of Z is given by g(z) = 1
n
for z ∈ S .
Open the Special Distribution Simulation and select the discrete uniform distribution. Vary the number of points, but keep the
default values for the other parameters. Note the graph of the probability density function. Run the simulation 1000 times and
compare the empirical density function to the probability density function.
The distribution function G of Z is given by G(z) = 1
n
(⌊z⌋ + 1) for z ∈ [0, n − 1] .
Proof
The quantile function G −1

of Z is given by G
−1
(p) = ⌈np⌉ − 1 for p ∈ (0, 1]. In particular
1. G −1
(1/4) = ⌈n/4⌉ − 1 is the first quartile.
2. G −1
(1/2) = ⌈n/2⌉ − 1 is the median.
3. G −1
(3/4) = ⌈3n/4⌉ − 1 is the third quartile.
Proof
Open the special distribution calculator and select the discrete uniform distribution. Vary the number of points, but keep the
default values for the other parameters. Note the graph of the distribution function. Compute a few values of the distribution
For the standard uniform distribution, results for the moments can be given in closed form.

1. E(Z) = (n − 1) 1
2. var(Z) = (n − 1) 1
12
2
Proof
Open the Special Distribution Simulation and select the discrete uniform distribution. Vary the number of points, but keep the
default values for the other parameters. Note the size and location of the mean±standard devation bar. Run the simulation 1000
times and compare the empirical mean and standard deviation to the true mean and standard deviation.
1. skew(Z) = 0
2
2. kurt(Z) = 3
5
3 n −7
n −1
2
Proof
Note that skew(Z) → 9
5
as n → ∞ . The limiting value is the skewness of the uniform distribution on an interval.
Z has probability generating function P given by P (1) = 1 and

n
1 1 −t
P (t) = , t ∈ R ∖ {1} (5.22.12)
n 1 −t
Proof

We now generalize the standard discrete uniform distribution by adding location and scale parameters.
Suppose that Z has the standard discrete uniform distribution on n ∈ N points, and that a ∈ R and + h ∈ (0, ∞) . Then
X = a + hZ has the uniform distribution on n points with location parameter a and scale parameter h .
Note that X takes values in

S = {a, a + h, a + 2h, … , a + (n − 1)h} (5.22.14)
so that S has n elements, starting at a , with step size h , a discrete interval. In the further special case where a ∈ Z and h = 1 , we
have an integer interval. Note that the last point is b = a + (n − 1)h , so we can clearly also parameterize the distribution by the
endpoints a and b , and the step size h . With this parametrization, the number of points is n = 1 + (b − a)/h . For the remainder of
this discussion, we assume that X has the distribution in the definiiton. Our first result is that the distribution of X really is
uniform.
X has probability density function f given by f (x) = 1
n
for x ∈ S
Proof
Open the Special Distribution Simulation and select the discrete uniform distribution. Vary the parameters and note the graph
The distribution function F of x is given by

1 x −a
F (x) = (⌊ ⌋ + 1) , x ∈ [a, b] (5.22.15)
n h
Proof

of X is given by G −1
(p) = a + h (⌈np⌉ − 1) for p ∈ (0, 1]. In particular
1. F −1
(1/4) = a + h (⌈n/4⌉ − 1) is the first quartile.
2. F −1
(1/2) = a + h (⌈n/2⌉ − 1) is the median.
3. F −1
(3/4) = a + h (⌈3n/4⌉ − 1) is the third quartile.
Proof
Open the special distribution calculator and select the discrete uniform distribution. Vary the parameters and note the graph of
the distribution function. Compute a few values of the distribution function and the quantile function.

1. E(X) = a + 1
2
(n − 1)h =
1
2
(a + b)
2. var(X) = 1
12
(n
2
− 1)h
2
=
1
12
(b − a)(b − a + 2h)
Proof
Note that the mean is the average of the endpoints (and so is the midpoint of the interval [a, b]) while the variance depends only on
the number of points and the step size.
Open the Special Distribution Simulator and select the discrete uniform distribution. Vary the parameters and note the shape
and location of the mean/standard deviation bar. For selected values of the parameters, run the simulation 1000 times and
compare the empirical mean and standard deviation to the true mean and standard deviation.

1. skew(X) = 0
2
2. kurt(X) = 3
5
3 n −7
n2 −1
Proof
X has moment generating function M given by M (0) = 1 and

nth
1 1 −e
ta
M (t) = e , t ∈ R ∖ {0} (5.22.16)
th
n 1 −e
Proof
Since the discrete uniform distribution on a discrete interval is a location-scale family, it is trivially closed under location-scale
transformations.
Suppose that X has the discrete uniform distribution on n ∈ N points with location parameter a ∈ R and scale parameter
+
h ∈ (0, ∞) . If c ∈ R and w ∈ (0, ∞) then Y = c + wX has the discrete uniform distribution on n points with location
parameter c + wa and scale parameter wh.

Proof
In terms of the endpoint parameterization, X has left endpoint a , right endpoint a + (n − 1)h , and step size h while Y has left
endpoint c + wa , right endpoint (c + wa) + (n − 1)wh , and step size wh.
The uniform distribution on a discrete interval converges to the continuous uniform distribution on the interval with the same
endpoints, as the step size decreases to 0.
Suppose that X has the discrete uniform distribution with endpoints a and b , and step size (b − a)/n , for each n ∈ N . Then
n +
the distribution of X converges to the continuous uniform distribution on [a, b] as n → ∞ .

n
Proof
This page titled 5.22: Discrete Uniform Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
The Semicircle Distribution

The semicircle distribution plays a very important role in the study of random matrices. It is also known as the Wigner distribution
in honor of the physicist Eugene Wigner, who did pioneering work on random matrices.
The Standard Semicircle Distribution

The standard semicircle distribution is a continuous distribution on the interval [−1, 1] with probability density function g
given by
2 −−−− −
2
g(x) = √1 − x , x ∈ [−1, 1] (5.23.1)
π
Proof
−−−− −
As noted in the proof, x ↦ √1 − x
2
for x ∈ [−1, 1] is the upper half of the circle of radius 1 centered at the origin, hence the
name.
The standard semicircle probability density function g satisfies the following properties:
1. g is symmetric about x = 0 .
2. g increases and then decreases with mode at x = 0 .
3. g is concave downward.
Proof
As noted earlier, except for the normalizing constant, the graph of g is the upper half of the circle of radius 1 centered at the
origin, and so these properties are obvious.
Open special distribution simulator and select the semicircle distribution. With the default parameter value, note the shape of
the probability density function. Run the simulation 1000 times and compare the empirical density function to the probability
density function.
The standard semicircle distribution function G is given by

1 1 − −−− − 1
2
G(x) = + x√ 1 − x + arcsin x, x ∈ [−1, 1] (5.23.2)
2 π π
Proof
We cannot give the quantile function G in closed form, but values of this function can be approximated. Clearly by symmetry,
−1
+ p) for 0 ≤ p ≤ . In particular, the median is 0.

−1 1 −1 1 1
G ( − p) = −G (
2 2 2
Open the special distribution simulator and select the semicircle distribution. With the default parameter value, note the shape
of the distribution function. Compute the first and third quartiles.
Moments
Suppose that X has the standard semicircle distribution. The moments of X about 0 can be computed explicitly. In particular, the
odd order moments are 0 by symmetry.
For n ∈ N , the moment of order 2n + 1 is E (X 2n+1

) =0 and the moment of order 2n is
2n
1 1 2n
2n
E (X ) =( ) ( ) (5.23.3)
2 n+1 n
Proof
The numbers C = n
1
(
n+1
) for n ∈ N are known as the Catalan numbers, and are named for the Belgian mathematician Eugene
2n
Catalan. In particular, we can compute the mean, variance, skewness, and kurtosis.

1. E(X) = 0
2. var(X) = 1
Open the special distribution simulator and select the semicircle distribution. With the default parameter value, note the size
and location of the mean ± standard deviation bar. Run the simulation 1000 times and compare the empirical mean and
standard deviation to the true mean and standard deviation.

1. skew(X) = 0
2. kurt(X) = 2
Proof
It follows that the excess kurtosis is kurt(X) − 3 = −1 .
The semicircle distribution has simple connections to the continuous uniform distribution.
If (X, Y ) is uniformly distributed on the circular region in R centered at the orgin with radius 1, then X and Y each have the
2
standard semicircular distribution.

Proof
It's easy to simulate a random point that is uniformly distributed on circular region in the previous theorem, and this provides a way
of simulating a standard semicircle distribution. This is important since we can't use the random quantile method of simulation.
Suppose that U , V , and W are independent random variables, each with the standard uniform distribution (random numbers).
Let R = max{U , V } and Θ = 2πW , and then let X = R cos Θ , Y = R sin Θ . Then (X, Y ) is uniformly distributed on the
circular region of radius 1 centered at the origin, and hence X and Y each have the standard semicircle distribution.
Proof
Of course, note that X and Y in the previous theorem are not independent. Another method of simulation is to use the rejection
method. This method works well since the semicircle distribution has a bounded support interval and a bounded probability density
function.
Open the rejection method app and select the semicircle distribution. Keep the default parameters to get the standard semicirle
distribution. Run the simulation 1000 times and note the points in the scatterplot. Compare the empirical density function,
mean, and standard deviation to their distributional counterparts.
The General Semicircle Distribution

Like so many standard distributions, the standard semicircle distribution is usually generalized by adding location and scale
parameters.
Definition
Suppose that Z has the standard semicircle distribution. For a ∈ R and r ∈ (0, ∞) , X = a + rZ has the semicircle
distribution with center (location parameter) a and radius (scale parameter) r.
Suppose that X has the semicircle distribution with center a ∈ R and radius r ∈ (0, ∞).

2 −−−−−−−−−−
2 2
f (x) = √r − (x − a) , x ∈ [a − r, a + r] (5.23.9)
2
πr
Proof
−−−−−−−−−−
The graph of x ↦ √r − (x − a) for x ∈ [a − r, a + r] is the upper half of the circle of radius r centered at a . The area under
2 2
this semicircle is π r /2 so as a check on our work, we see that f is a valid probability density function.
2
The probability density function f of X satisfies the following properties:

1. f is symmetric about x = a .
2. f increases and then decreases with mode at x = a .
3. f is concave downward.
Open special distribution simulator and select the semicircle distribution. Vary the center a and the radius r, and note the shape
of the probability density function. For selected values of a and r, run the simulation 1000 times and compare the empirical
The distribution function F of X is

1 x −a −−−−−−−−−− 1 x −a
2 2
F (x) = + √r − (x − a) + arcsin( ), x ∈ [a − r, a + r] (5.23.11)
2
2 πr π r
Proof
As in the standard case, we cannot give the quantile function F in closed form, but values of this function can be approximated.
−1
Recall that F (p) = a + rG (p) where G

−1 −1
is the standard semicircle quantile function. In particular,
−1
F
−1
(
1
− p) = 2a − F
2
(
−1
+ p) for 0 ≤ p ≤ . The median is a .
1
2
1
Open the special distribution simulator and select the semicircle distribution. Vary the center a and the radius r, and note the
shape of the distribution function. For selected values of a and r, compute the first and third quartiles.
Moments
Suppose again that X has the semicircle distribution with center a ∈ R and radius r ∈ (0, ∞), so by definition we can assume
X = a + rZ where Z has the standard semicircle distribution. The moments of X can be computed from the moments of Z .
Using the binomial theorem and the linearity of expected value we have
n
n
n k n−k k
E (X ) = ∑( )r a E (Z ) , n ∈ N (5.23.13)
k
k=0
In particular,

1. E(X) = a
2. var(X) = r 2
/4
When the center is 0, the general moments have a simple form:
Suppose that a = 0 . For n ∈ N the moment of order 2n + 1 is E (X 2n+1

) =0 and the moment of order 2n is
2n
2n
r 1 2n
E (X ) =( ) ( ) (5.23.14)
2 n+1 n
Proof
Open the special distribution simulator and select the semicircle distribution. Vary the center a and the radius r, and note the
size and location of the mean ± standard deviation bar. For selected values of a and r, run the simulation 1000 times and
compare the empirical mean and standard deviation to the distribution mean and standard deviation.

1. skew(X) = 0
2. kurt(X) = 2
Proof
Once again, the excess kurtosis is kurt(X) − 3 = −1 .
Since the semicircle distribution is a location-scale family, it's invariant under location-scale transformations.
Suppose that X has the semicircle distribution with center a ∈ R and radius r ∈ (0, ∞). If b ∈ R and c ∈ (0, ∞) then b + cX
has the semicircle distribution with center b + ca and radius cr.
Proof
One member of the beta family of distributions is a semicircle distribution:
The beta distribution with left parameter 3/2 and right parameter 3/2 is the semicircle distribution with center 1/2 and radius
1/2.
Proof
Since we can simulate a variable Z with the standard semicircle distribution by the method above, we can simulate a variable with
the semicircle distribution with center a ∈ R and radius r ∈ (0, ∞) by our very definition: X = a + rZ . Once again, the rejection
method also works well since the support and probability density fucntion of X are bounded.
Open the rejection method app and select the semicircle distribution. For selected values of a and r, run the simulation 1000
times and note the points in the scatterplot. Compare the empirical density function, mean and standard deviation to their
This page titled 5.23: The Semicircle Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Like the semicircle distribution, the triangle distribution is based on a simple geometric shape. The distribution arises naturally
when uniformly distributed random variables are transformed in various ways.
The Standard Triangle Distribution

The standard triangle distribution with vertex at p ∈ [0, 1] (equivalently, shape parameter p) is a continuous distribution on
[0, 1] with probability density function g described as follows:
1. If p = 0 then g(x) = 2(1 − x) for x ∈ [0, 1]

2. If p = 1 then g(x) = 2x for x ∈ [0, 1].
3. If p ∈ (0, 1) then
2x
, x ∈ [0, p]
p
g(x) = { (5.24.1)
2(1−x)
, x ∈ [p, 1]
1−p
The shape of the probability density function justifies the name triangle distribution.
The graph of g , together with the domain , forms a triangle with vertices
[0, 1] ,
(0, 0) , and
(1, 0) (p, 2) . The mode of the
distribution is x = p .
1. If p = 0 , g is decreasing.
2. If p = 1 , g is increasing.
3. If p ∈ (0, 1), g increases and then decreases.
Proof
Open special distribution simulator and select the triangle distribution. Vary p (but keep the default values for the other
parameters) and note the shape of the probability density function. For selected values of p, run the simulation 1000 times and
The distribution function G is given as follows:

1. If p = 0 , G(x) = 1 − (1 − x) for x ∈ [0, 1].
2
2. If p = 1 , G(x) = x for x ∈ [0, 1].

2
3. If p ∈ (0, 1)
2
x
⎧ , x ∈ [0, p]
p
G(x) = ⎨ 2 (5.24.2)
⎩ (1−x)
1− , x ∈ [p, 1]
1−p
Proof

is given by
−−
−1 √up, u ∈ [0, p]
G (u) = { − −−−−−−−−− − (5.24.3)
1 − √ (1 − u)(1 − p) , u ∈ [p, 1]
−
−− −−−−−−−
1. The first quartile is √ 1
4
p if p ∈ [ 1
4
, 1] and is 1 − √ 3
4
(1 − p) if p ∈ [0, 1
4
]
−
−− −−−−−−−
2. The median is √ 1
2
p if p ∈ [ 1
2
, 1] and is 1 − √ 1
2
(1 − p) if p ∈ [0, 1
2
.
]
−
−− −−−−−−−
3. The third quartile is √ 3
4
p if p ∈ [ 3
4
, 1] and is 1 − √ 1
4
(1 − p) if p ∈ [0, 3
4
] .
Open the special distribution calculator and select the triangle distribution. Vary p (but keep the default values for the other
parameters) and note the shape of the distribution function. For selected values of p, compute the first and third quartiles.
Moments
Suppose that X has the standard triangle distribution with vertex p ∈ [0, 1]. The moments are easy to compute.
Suppose that n ∈ N .
1. If p = 1 , E(X n
) = 2/(n + 2) .
2. If p ∈ [0, 1),
n+1 n+2
n
2 n+2
2 1 −p 2 1 −p
E(X ) = p + + (5.24.4)
n+1 n+1 1 −p n+2 1 −p
Proof
From the general moment formula, we can compute the mean, variance, skewness, and kurtosis.

1. E(X) = (1 + p)
1
2. var(X) = [1 − p(1 − p)]

18
1
Proof
Note that E(X) increases from to as p increases from 0 to 1. The graph of

1
3
2
3
var(X) is a parabola opening downward; the
largest value is when p = 0 or p = 1 and the smallest value is
18
1
when p = . 1
24
1
Open the special distribution simulator and select the triangle distribution. Vary p (but keep the default values for the other
paramters) and note the size and location of the mean ± standard deviation bar. For selected values of p, run the simulation
The skewness of X is
–
√2(1 − 2p)(1 + p)(2 − p)
skew(X) = (5.24.5)
3/2
5[1 − p(1 − p)]
The kurtosis of X is kurt(X) = 12
5
.
Proof
Note that X is positively skewed for p < , negatively skewed for p > , and symmetric for p = . More specifically, if we
1
2
1
2
1
indicate the dependence on the parameter p then skew (X) = −skew (X) . Note also that the kurtosis is independent of p, and
1−p p
the excess kurtosis is kurt(X) − 3 = − . 3
Open the special distribution simulator and select the triangle distribution. Vary p (but keep the default values for the other
paramters) and note the degree of symmetry and the degree to which the distribution is peaked. For selected values of p, run
the simulation 1000 times and compare the empirical density function to the probability density function.
If X has the standard triangle distribution with parameter p, then 1 − X has the standard triangle distribution with parameter
1 −p.
Proof
The standard triangle distribution has a number of connections with the standard uniform distribution. Recall that a simulation of a
random variable with a standard uniform distribution is a random number in computer science.
Suppose that U and U are independent random variables, each with the standard uniform distribution. Then
1 2
1. X = min{U , U } has the standard triangle distribution with p = 0 .
1 2
2. Y = max{U , U } has the standard triangle distribution with p = 1 .

1 2
Proof
Suppose again that U and U are independent random variables, each with the standard uniform distribution. Then
1 2
1. X = |U 2 − U1 | has the standard triangle distribution with p = 0 .

2. Y = (U 1 + U2 ) /2 has the standard triangle distribution with p = 1
2
.
Proof
In the previous result, note that Y is the sample mean from a random sample of size 2 from the standard uniform distribution. Since
the quantile function has a simple closed-form expression, the standard triangle distribution can be simulated using the random
quantile method.
Suppose that U is has the standard uniform distribution and p ∈ [0, 1] . Then the random variable below has the standard
triangle distribution with parameter p:
−
−−
√pU , U ≤p
X ={ − −−−− −−− −−−− (5.24.6)
1 − √ (1 − p)(1 − U ) , p <U ≤1
Open the random quantile experiment and select the triangle distribution. Vary p (but keep the default values for the other
parameters) and note the shape of the distribution function/quantile function. For selected values of p, run the experiment 1000
times and watch the random quantiles. Compare the empirical density function, mean, and standard deviation to their
The standard triangle distribution can also be simulated using the rejection method, which also works well since the region R under
the probability density function g is bounded. Recall that this method is based on the following fact: if (X, Y ) is uniformly
distributed on the rectangular region S = {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 2} which contains R , then the conditional distribution of
(X, Y ) given (X, Y ) ∈ R is uniformly distributed on R , and hence X has probability density function g .
Open the rejection method experiment and select the triangle distribution. Vary p (but keep the default values for the other
parameters) and note the shape of the probability density function. For selected values of p, run the experiment 1000 times and
watch the scatterplot. Compare the empirical density function, mean, and standard deviation to their distributional counterparts.
For the extreme values of the shape parameter, the standard triangle distributions are also beta distributions.
Connections to the beta distribution:

1. The standard triangle distribution with shape parameter p = 0 is the beta distribution with left parameter a = 1 and right
parameter b = 2 .
2. The standard triangle distribution with shape parameter p = 1 is the beta distribution with left parameter a = 2 and right
parameter b = 1 .
Proof
Open the special distribution simulator and select the beta distribution. For parameter values given below, run the simulation
1000 times and compare the empirical density function, mean, and standard deviation to their distributional counterparts.
1. a = 1 , b = 2
2. a = 2 , b = 1
The General Triangle Distribution

Like so many standard distributions, the standard triangle distribution is usually generalized by adding location and scale
parameters.
Definition
Suppose that Z has the standard triangle distribution with vertex at p ∈ [0, 1]. For a ∈ R and w ∈ (0, ∞) , random variable
X = a + wZ has the triangle distribution with location parameter a , and scale parameter w, and shape parameter p
Suppose that X has the general triangle distribution given in the definition above.
X has probability density function f given as follows:

1. If p = 0 , f (x) = 2
2
w
(a + w − x) for x ∈ [a, a + w] .
2. If p = 1 , f (x) = 2
2
w
(x − a) for x ∈ [a, a + w] .
3. If p ∈ (0, 1),
2
2
(x − a), x ∈ [a, a + pw]
pw
f (x) = { (5.24.7)
2
(a + w − x), x ∈ [a + pw, a + w]
w2 (1−p)
Proof
Once again, the shape of the probability density function justifies the name triangle distribution.
The graph of f , together with the domain [a, a + w] , forms a triangle with vertices (a, 0), (a + w, 0) , and (a + pw, 2/w). The
mode of the distribution is x = a + pw .
1. If p = 0 , f is decreasing.
2. If p = 1 , f is increasing.
3. If p ∈ (0, 1), f increases and then decreases.
Clearly the general triangle distribution could be parameterized by the left endpoint a , the right endpoint b = a+w and the
location of the vertex c = a + pw , but the location-scale-shape parameterization is better.
Open special distribution simulator and select the triangle distribution. Vary the parameters a , w, and p, and note the shape and
location of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare
The distribution function F of X is given as follows:

1. If p = 0 , F (x) = 1 − (a + w − x ) for x ∈ [a, a + w]
w
1
2
2
2. If p = 1 , F (x) = (x − a) for x ∈ [a, a + w]

1
w
2
2
3. If p ∈ (0, 1),
1 2
(x − a) , x ∈ [a, a + pw]
pw2
F (x) = { (5.24.9)
1 2
1− 2
(a + w − x ) , x ∈ [a + pw, a + w]
w (1−p)
Proof

given by
−−
w √up, 0 ≤u ≤p
−1
F (u) = a + { − −−−−−−−−− − (5.24.11)
w [1 − √ (1 − u)(1 − p) ] , p ≤u ≤1
−
−− −−−−−−−
1. The first quartile is a + w √ 1
4
p if p ∈ [ 1
4
, 1] and is a + w (1 − √ 3
4
(1 − p) ) if p ∈ [0, 1
4
]
−
−− −−−−−−−
2. The median is a + w √ 1
2
p if p ∈ [ 1
2
, 1] and is a + w (1 − √ 1
2
(1 − p) ) if p ∈ [0, 1
2
.
]
−
−− −−−−−−−
3. The third quartile is a + w √ 3
4
p if p ∈ [ 3
4
, 1] and is a + w (1 − √ 1
4
(1 − p) ) if p ∈ [0, 3
4
] .
Proof
Open the special distribution simulator and select the triangle distribution. Vary the the parameters a , w, and p, and note the
shape and location of the distribution function. For selected values of parameters, compute the median and the first and third
quartiles.
Moments
Suppose again that X has the triangle distribution with location parameter a ∈ R , scale parameter w ∈ (0, ∞) and shape parameter
p ∈ [0, 1]. Then we can take X = a + wZ where Z has the standard triangle distribution with parameter p . Hence the moments of
X can be computed from the moments of Z . Using the binomial theorem and the linearity of expected value we have
n
n k n−k k
E(X ) = ∑( )w a E(Z ), n ∈ N (5.24.12)
k
k=0
The general results are rather messy.

1. E(X) = a + w
3
(1 + p)
2
2. var(X) = w
18
[1 − p(1 − p)]
Proof
Open the special distribution simulator and select the triangle distribution. Vary the parameters a , w, and p, and note the size
and location of the mean ± standard deviation bar. For selected values of the paramters, run the simulation 1000 times and
–
√2(1 − 2p)(1 + p)(2 − p)
skew(X) = (5.24.13)
5[1 − p(1 − p)]3/2
5
.
Proof
As before, the excess kurtosis is kurt(X) − 3 = − 3
5
.
Since the triangle distribution is a location-scale family, it's invariant under location-scale transformations. More generally, the
family is closed under linear transformations with nonzero slope.
Suppose that X has the triangle distribution with shape parameter a ∈ R , scale parameter w ∈ (0, ∞) , and shape parameter
p ∈ [0, 1]. If b ∈ R and c ∈ (0, ∞) then
1. b + cX has the triangle distribution with location parameter b + ca , scale parameter cw, and shape parameter p.
2. b − cX has the triangle distribution with location parameter b − c(a + w) , scale parameter cw, and shape parameter 1 − p .
Proof
As with the standard distribution, there are several connections between the triangle distribution and the continuous uniform
distribution.
Suppose that V and 1 V2 are independent and are uniformly distributed on the interval [a, a + w] , where a ∈ R and
w ∈ (0, ∞) . Then
1. min{V , V } has the triangle distribution with location parameter a , scale parameter w, and shape parameter p = 0 .
1 2
2. max{V , V } has the triangle distribution with location parameter a , scale parameter w, and shape parameter p = 1 .
1 2
Proof
Suppose again that V1 and V2 are independent and are uniformly distributed on the interval [a, a + w] , where a ∈ R and
w ∈ (0, ∞) . Then
1. |V − V | has the triangle distribution with location parameter 0, scale parameter w, and shape parameter p = 0 .
2 1
2. V + V has the triangle distribution with location parameter 2a, scale parameter 2w, and shape parameter p = .
1 2
1
3. V − V has the triangle distribution with location parameter −w, scale parameter 2w, and shape parameter p =
2 1
1
Proof
A special case of (b) leads to a connection between the triangle distribution and the Irwin-Hall distribution.
Suppose that U and U are independent random variables, each with the standard uniform distribution. Then U + U has the
1 2 1 2
triangle distribtion with location parameter 0, scale parameter 2 and shape parameter . But this is also the Irwin-Hall
1
distribution of order n = 2 .
Open the special distribution simulator and select the Irwin-Hall distribution. Set n = 2 and note the shape and location of the
probability density function. Run the simulation 1000 times and compare the empirical density function, mean, and standard
deviation to their distributional counterparts.
Since we can simulate a variable Z with the basic triangle distribution with parameter p ∈ [0, 1] by the random quantile method
above, we can simulate a variable with the triangle distribution that has location parameter a ∈ R , scale parameter w ∈ (0, ∞) ,
and shape parameter p by our very definition: X = a + wZ . Equivalently, we could compute a random quantile using the quantile
function of X.
Open the random quantile experiment and select the triangle distribution. Vary the location parameter a , the scale parameter w,
and the shape parameter p, and note the shape of the distribution function. For selected values of the parameters, run the
experiment 1000 times and watch the random quantiles. Compare the empirical density function, mean and standard deviation
to their distributional counterparts.
As with the standard distribution, the general triangle distribution has a bounded probability density function on a bounded interval,
and hence can be simulated easily via the rejection method.
Open the rejection method experiment and select the triangle distribution. Vary the parameters and note the shape of the
probability density function. For selected values of the parameters, run the experiment 1000 times and watch the scatterplot.
Compare the empirical density function, mean, and standard deviation to their distributional counterparts.
This page titled 5.24: The Triangle Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The Irwin-Hall distribution, named for Joseph Irwin and Phillip Hall, is the distribution that governs the sum of independent
random variables, each with the standard uniform distribution. It is also known as the uniform sum distribution. Since the standard
uniform is one of the simplest and most basic distributions (and corresponds in computer science to a random number), the Irwin-
Hall is a natural family of distributions. It also serves as a nice example of the central limit theorem, conceptually easy to
understand.
Basic Theory
Definition
Suppose that U = (U , U , …) is a sequence of indpendent random variables, each with the uniform distribution on the
1 2
interval [0, 1] (the standard uniform distribution). For n ∈ N , let +
Xn = ∑ Ui (5.25.1)
i=1
Then X has the Irwin-Hall distribution of order n .

n
So X has a continuous distribution on the interval [0, n] for n ∈ N .

n +
Let f denote the probability density function of the standard uniform distribution, so that f (x) = 1 for 0 ≤ x ≤ 1 (and is 0
otherwise). It follows immediately that the probability density function f of X satisfies f = f , where of course f is the n -
n n n
∗n ∗n
fold convolution power of f . We can compute f and f by hand.

2 3

2 2
x, 0 ≤x ≤1
f2 (x) = { (5.25.2)
x − 2(x − 1), 1 ≤x ≤2
Proof
Note that the graph of f on [0, 2] consists of two lines, pieced together in a continuous way at x = 1 . The form given above is not
2
the simplest, but makes the continuity clear, and will be helpful when we generalize.
In the special distribution simulator, select the Irwin-Hall distribution and set n = 2 . Note the shape of the probability density
function. Run the simulation 1000 times and compare the empirical density function to the probability density function.

3 3
1 2
⎧
⎪ x , 0 ≤x ≤1
⎪ 2
1 2 3 2
f3 (x) = ⎨ x − (x − 1 ) , 1 ≤x ≤2 (5.25.3)
2 2
⎪
⎩
⎪ 1 2 3 2 3 2
x − (x − 1 ) + (x − 2 ) , 2 ≤x ≤3
2 2 2
Note that the graph of f on [0, 3] consists of three parabolas pieced together in a continuous way at x = 1 and x = 2 . The
3
expressions for f (x) for 1 ≤ x ≤ 2 and 2 ≤ x ≤ 3 can be expanded and simplified, but again the form given above makes the
3
continuity clear, and will be helpful when we generalize.
In the special distribution simulator, select the Irwin-Hall distribution and set n = 3 . Note the shape of the probability density
function. Run the simulation 1000 times and compare the empirical density function to the probability density function.
Naturally, we don't want to perform the convolutions one at a time; we would like a general formula. To state the formula
succinctly, we need to recall the floor function:
⌊x⌋ = max{n ∈ Z : n ≤ x}, x ∈ R (5.25.4)
so that ⌊x⌋ = j if j ∈ Z and j ≤ x < j + 1 .
For n ∈ N , the probability density function f of X is given by

+ n n
⌊x⌋
1 k
n n−1
fn (x) = ∑(−1 ) ( )(x − k) , x ∈ R (5.25.5)
(n − 1)! k
k=0
Proof
Note that for n ∈ N , the graph of f on [0, n] consists of n polynomials of degree n − 1 pieced together in a continuous way.
+ n
Such a construction is known as a polynomial spline. The points where the polynomials are connected are known as knots. So f is n
a polynomial spline of degree n − 1 with knots at x ∈ {1, 2, … , n − 1}. There is another representation of f as a sum. To state n
this one succinctly, we need to recall the sign function:
⎧ −1, x <0
sgn(x) = ⎨ 0, x =0 (5.25.12)
⎩
1, x >0
For n ∈ N , the probability density function f of X is given by

+ n n
n
1 n
k n−1
fn (x) = ∑(−1 ) ( )sgn(x − k)(x − k) , x ∈ R (5.25.13)
2(n − 1)! k
k=0
Direct Proof
Proof by induction
Open the special distribution simulator and select the Irwin-Hall distribution. Start with n = 1 and increase n successively to
the maximum n = 10 . Note the shape of the probability density function. For various values of n , run the simulation 1000
times and compare the empirical density function to the probability density function.
For n ∈ {2, 3, …}, the Irwin-Hall distribution is symmetric and unimodal, with mode at n/2.

n n
⌊x⌋
1 k
n n
Fn (x) = ∑(−1 ) ( )(x − k) , x ∈ [0, n] (5.25.20)
n! k
k=0
Proof
So F is a polynomial spline of degree n with knots at

n {1, 2, … , n − 1} . The alternate from of the probability density function
leads to an alternate form of the distribution function.

n n
n
1 1 k
n n
Fn (x) = + ∑(−1 ) ( )sgn(x − k)(x − k) , x ∈ [0, n] (5.25.21)
2 2n! k
k=0
Proof
The quantile function F−1

n does not have a simple representation, but of course by symmetry, the median is n/2.
Open the special distribution calculator and select the Irwin-Hall distribution. Vary n from 1 to 10 and note the shape of the
distribution function. For each value of n compute the first and third quartiles.
Moments
The moments of the Irwin-Hall distribution are easy to obtain from the representation as a sum of independent standard uniform
variables. Once again, we assume that X has the Irwin-Hall distribution of order n ∈ N .
n +
The mean and variance of X are n
1. E(X ) = n/2
n
2. var(X ) = n/12
n
Proof
Open the special distribution simulator and select the Irwin-Hall distribution. Vary n and note the shape and location of the
mean ± standard deviation bar. For selected values of n run the simulation 1000 times and compare the empirical mean and
The skewness and kurtosis of X are n
1. skew(X ) = 0n
2. kurt(X ) = 3 −
n
6
5n
Proof
Note that kurt(X n) → 3 , the kurtosis of the normal distribution, as n → ∞ . That is, the excess kurtosis kurt(Xn ) − 3 → 0 as
n → ∞.
Open the special distribution simulator and select the Irwin-Hall distribution. Vary n and note the shape and of the probability
density function in light of the previous results on skewness and kurtosis. For selected values of n run the simulation 1000
times and compare the empirical density function, mean, and standard deviation to their distributional counterparts.
The moment generating function M of X is given by M

n n n (0) =1 and
t n
e −1
Mn (t) = ( ) , t ∈ R ∖ {0} (5.25.22)
t
Proof
The most important connection is to the standard uniform distribution in the definition: The Irwin-Hall distribution of order
n ∈ N + is the distribution of the sum of n independent variables, each with the standard uniform distribution. The Irwin-Hall
distribution of order 2 is also a triangle distribution:
The Irwin-Hall distribution of order 2 is the triangle distribution with location parameter 0, scale parameter 2, and shape
parameter . 1
Proof
The Irwin-Hall distribution is connected to the normal distribution via the central limit theorem.
Suppose that X has the Irwin-Hall distribution of order n for each n ∈ N . Then the distribution of
n +
Xn − n/2
Zn = −−−− (5.25.23)
√n/12
converges to the standard normal distribution as n → ∞ .

Proof
Thus, if n is large, X has approximately a normal distribution with mean n/2 and variance n/12.
n
Open the special distribution simulator and select the Irwin-Hall distribution. Start with n = 1 and increase n successively to
the maximum n = 10 . Note how the probability density function becomes more “normal” as n increases. For various values of
n , run the simulation 1000 times and compare the empirical density function to the probability density function.
The Irwin-Hall distribution of order n is trivial to simulate, as the sum of n random numbers. Since the probability density function
is bounded on a bounded support interval, the distribution can also be simulated via the rejection method. Computationally, this is a
dumb thing to do, of course, but it can still be a fun exercise.
Open the rejection method experiment and select the Irwin-Hall distribution. For various values of n , run the simulation 2000
times. Compare the empirical density function, mean, and standard deviation to their distributional counterparts.
This page titled 5.25: The Irwin-Hall Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The U-power distribution is a U-shaped family of distributions based on a simple family of power functions.
The Standard U-Power Distribution

The standard U-power distribution with shape parameter k ∈ N+ is a continuous distribution on [−1, 1] with probability
density function g given by
2k + 1
2k
g(x) = x , x ∈ [−1, 1] (5.26.1)
2
Proof
The algebraic form of the probability density function explains the name of the distribution. The most common of the standard U-
power distributions is the U-quadratic distribution, which corresponds to k = 1 .
The standard U-power probability density function g satisfies the following properties:
2. g decreases and then increases with minimum value at x = 0 .
3. The modes are x = ±1 .
4. g is concave upward.
Proof
Open the Special Distribution Simulator and select the U-power distribution. Vary the shape parameter but keep the default
values for the other parameters. Note the graph of the probability density function. For selected values of the shape parameter,
run the simulation 1000 times and compare the emprical density function to the probability density function.
The distribution function G given by

1
2k+1
G(x) = (1 + x ), x ∈ [−1, 1] (5.26.5)
2
Proof

given by G −1
(p) = (2p − 1 )
1/(2k+1)
for p ∈ [0, 1].
1. G (1 − p) = −G
−1 −1
(p) for p ∈ [0, 1].
2. The first quartile is q
1 =− . 1
1/( 2k+1)
2
3. The median is 0.
4. The third quartile is q 3 =
1
1/( 2k+1)
.
2
Proof
Open the Special Distribution Calculator and select the U-power distribution. Vary the shape parameter but keep the default
values for the other parameters. Note the shape of the distribution function. For various values of the shape parameter, compute
a few quantiles.
Moments
Suppose that Z has the standard U-power distribution with parameter k ∈ N . The moments (about 0) are easy to compute.
+
Let n ∈ N . The moment of order 2n + 1 is E(Z 2n+1

) =0 . The moment of order 2n is
2k + 1
2n
E (Z ) = (5.26.6)
2(n + k) + 1
Proof
Since the mean is 0, the moments about 0 are also the central moments.

1. E(Z) = 0
2. var(Z) = 2k+1
2k+3
Proof
Note that var(Z) → 1 as k → ∞ .
Open the Special Distribution Simulator and select the U-power distribution. Vary the shape parameter but keep the default
values for the other paramters. Note the position and size of the mean ± standard deviation bar. For selected values of the
shape parameter, run the simulation 1000 times and compare the empirical mean and stadard deviation to the distribution mean
and standard deviation.

1. skew(Z) = 0
2
(2k+3)
2. kurt(Z) = (2k+5)(2k+1)
Proof
2
(2k+3)
Note that kurt(Z) → 1 as k → ∞ . The excess kurtosis is kurt(Z) − 3 =
(2k+5)(2k+1)
−3 and so kurt(Z) − 3 → −2 as
k → ∞ .
The U-power probability density function g actually makes sense for k = 0 as well, and in this case the distribution reduces to the
uniform distribution on the interval [−1, 1]. But of course, this distribution is not U-shaped, except in a degenerate sense. There are
other connections to the uniform distribution. The first is a standard result since the U-power quantile function has a simple, closed
representation:
Suppose that k ∈ N . +
1. If U has the standard uniform distribution then Z = (2U − 1) 1/(2k+1)

has the standard U-power distribution with
parameter k .
2. If Z has the standard U-power distribution with parameter k then U =
1
2
(1 + Z
2k+1
) has the standard uniform
distribution.
Part (a) of course leads to the random quantile method of simulation.
Open the random quantile simulator and select the U-power distribution. Vary the shape parameter but keep the default values
for the other parameters. Note the shape of the distribution and density functions. For selected values of the parameter, run the
simulation 1000 times and note the random quantiles. Compare the empirical density function to the probability density
function.
The standard U-power distribution with shape parameter k ∈ N + converges to the discrete uniform distribution on {−1, 1} as
k → ∞.
Proof
The General U-Power Distribution

Like so many standard distributions, the standard U-power distribution is generalized by adding location and scale parameters.
Definition
Suppose that Z has the standard U-power distribution with shape parameter k ∈ N . If μ ∈ R and c ∈ (0, ∞) then +
X = μ + cZ has the U-power distribution with shape parameter k , location parameter μ and scale parameter c .
Note that X has a continuous distribution on the interval [a, b] where a = μ − c and b = μ + c , so the distribution can also be
parameterized by the the shape parameter k and the endpoints a and b . With this parametrization, the location parameter is
μ =
a+b
2
and the scale parameter is c = . b−a
Suppose that X has the U-power distribution with shape parameter k ∈ N+ , location parameter μ ∈ R , and scale parameter
c ∈ (0, ∞).

2k
2k + 1 x −μ
f (x) = ( ) , x ∈ [μ − c, μ + c] (5.26.7)
2c c
1. f is symmetric about μ .
2. f decreases and then increases with minimum value at x = μ .
3. The modes are at x = μ ± c .
Proof
Open the Special Distribution Simulator and select the U-power distribution. Vary the parameters and note the shape and
location of the probability density function. For various values of the parameters, run the simulation 1000 times and compare
the emprical density function to the probability density function.

2k+1
1 x −μ
F (x) = [1 + ( ) ], x ∈ [μ − c, μ + c] (5.26.8)
2 c
Proof

given by F −1
(p) = μ + c(2p − 1 )
1/(2k+1)
for p ∈ [0, 1].
1. F (1 − p) = μ − cF (p)
−1 −1
2. The first quartile is q = μ − c

1
1
1/( 2k+1)
2
3. The median is μ .
4. The third quartile is q3 = μ+c
1
1/( 2k+1)
2
Proof
Open the Special Distribution Calculator and select the U-power distribution. Vary the parameters and note the graph of the
distribution function. For various values of the parameters, compute selected values of the distribution function and the
quantile function.
Moments
Suppose again that X has the U-power distribution with shape parameter k ∈ N , location parameter μ ∈ R , and scale parameter
+
c ∈ (0, ∞).

1. E(X) = μ
2. var(X) = c 2 2k+1
2k+3
Proof
Note that var(Z) → c as k → ∞
2
Open the Special Distribution Simulator and select the U-power distribution. Vary the parameters and note the size and
location of the mean ± standard deviation bar. For various values of the parameters, run the simulation 1000 times and
compare the empirical mean and stadard deviation to the distribution mean and standard deviation.
The moments about 0 are messy, but the central moments are simple.
Let n ∈ N . The central moment of order 2n + 1 is E [(X − μ)

+
2n+1
] =0 . The moment of order 2n is
2n 2n
2k + 1
E [(x − μ) ] =c (5.26.9)
2(n + k) + 1
Proof

1. skew(X) = 0
2
(2k+3)
2. kurt(X) = (2k+5)(2k+1)
Proof
2
(2k+3)
Again, kurt(X) → 1 as k → ∞ and the excess kurtosis is kurt(X) − 3 = (2k+5)(2k+1)
−3
Since the U-power distribution with a given shape parameter is a location-scale family, it is trivially closed under location-scale
transformations.
Suppose that X has the U-power distribution with shape parameter k ∈ N , location parameter μ ∈ R , and scale parameter
+
c ∈ (0, ∞). If α ∈ R and β ∈ (0, ∞) , then Y = α + βX has the U-power distribution with shape parameter k , location
parameter α + βμ , and scale parameter βc.

Proof
As before, since the U-power distribution function and the U-power quantile function have simple forms, we have the usual
connections with the standard uniform distribution.
Suppose that k ∈ N , μ ∈ R and c ∈ (0, ∞).

+
1. If U has the standard uniform distribution then X = μ + c(2U − 1) has the U-power distribution with shape
1/(2k+1)
parameter k , location parameter μ , and scale parameter c .

2. If X has the U-power distribution with shape parameter k , location parameter μ , and scale parameter c , then
2k+1
X−μ
U =
1
2
[1 + (
c
) ] has the standard uniform distribution.
Again, part (a) of course leads to the random quantile method of simulation.
Open the random quantile simulator and select the U-power distribution. Vary the parameters and note the shape of the
distribution and density functions. For selected values of the parameters, run the simulation 1000 times and note the random
quantiles. Compare the empirical density function to the probability density function.
The U-power distribution with given location and scale parameters converges to the discrete uniform distribution at the endpoints
as the shape parameter increases.
The U-power distribution with shape parameter k ∈ N , location parameter μ ∈ R , and scale parameter c ∈ (0, ∞) converges
+
to the discrete uniform distribution on {μ − c, μ + c} as k → ∞ .

Proof
The U-power distribution is a general exponential family in the shape parameter, if the location and scale parameters are fixed.
Suppose that X has the U-power distribution with unspecified shape parameter k ∈ N , but with specified location parameter
+
μ ∈ R and scale parameter c ∈ (0, ∞). Then X has a one-parameter exponential distribution with natural parameter 2k and
X−μ
natural statistics ln( c
.
)
Proof
Since the U-power distribution has a bounded probability density function on a bounded support interval, it can also be simulated
via the rejection method.
Open the rejection method experiment and select the U-power distribution. Vary the parameters and note the shape of the
probability density function. For selected values of the parameters, run the experiment 1000 times and watch the scatterplot.
Compare the empirical density function, mean, and standard deviation to their distributional counterparts.
This page titled 5.26: The U-Power Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The sine distribution is a simple probability distribution based on a portion of the sine curve. It is also known as Gilbert's sine
distribution, named for the American geologist Grove Karl (GK) Gilbert who used the distribution in 1892 to study craters on the
moon.
The Standard Sine Distribution

The standard sine distribution is a continuous distribution on [0, 1] with probability density function g given by
π
g(z) = sin(πz), z ∈ [0, 1] (5.27.1)
2
1. g is symmetric about z = . 1
2. g increases and then decreases with mode at z = 1
2
.
Proof
Open the Special Distribution Simulator and select the sine distribution. Run the simulation 1000 times and compare the
emprical density function to the probability density function.
The distribution function G is given by G(z) = 1
2
[1 − cos(πz)] for z ∈ [0, 1].
Proof

is given by G −1
(p) =
1
π
arccos(1 − 2p) for p ∈ [0, 1].
1 =
1
3
.
2. The median is . 1
3. The third quartile is q

3 =
2
3
.
Proof
Open the Special Distribution Calculator and select the sine distribution. Compute a few quantiles.
Moments
Suppose that Z has the standard sine distribution. The moment generating function can be given in closed form.
The moment generating function m of Z is given by

2 t
π (1 + e )
tZ
m(t) = E (e ) = , t ∈ R (5.27.5)
2 2
2(t +π )
Proof
The moments of all orders exist, but a general formula is complicated and involves special functions. However, the mean and
variance are easy to compute.

1. E(Z) = 1/2
2. var(Z) = 1/4 − 2/π 2
Proof
Numerically, sd(Z) ≈ 0.2176 .
Open the Special Distribution Simulator and select the sine distribution. Note the position and size of the mean ± standard
deviation bar. Run the simulation 1000 times and compare the empirical mean and stadard deviation to the distribution mean

1. skew(Z) = 0
2. kurt(Z) = (384 − 48π 2 4
+ π )/(π
2
− 8)
2
Proof
Numerically, kurt(Z) ≈ 2.1938 .
Since the distribution function and the quantile function have closed form representations, the standard sine distribution has the
usual connection to the standard uniform distribution.
1. If U has the standard uniform distribution then Z = G (U ) = arccos(1 − 2U ) has the standard sine distribution.
−1 1
2. If Z has the standard sine distribution then U = G(Z) = [1 − cos(πZ)] has the standard uniform distribution.
1
Part (a) of course leads to the random quantile method of simulation.
Open the random quantile simulator and select the sine distribution. Note the shape of the distribution and density functions.
Run the simulation 1000 times and note the random quantiles. Compare the empirical density function to the probability
density function.
Since the probability density function is continuous and is defined on a closed, bounded interval, the standard sine distribution can
also be simulated using the rejection method.
Open the rejection method app and select the sine distribution. Run the simulation 1000 times and compare the empirical
The General Sine Distribution

As with so many other “standard distributions”, the standard sine distribution is generalized by adding location and scale
parameters.
Suppose that Z has the standard sine distribution. For a ∈ R and b ∈ (0, ∞) , random variable X = a + bZ has the sine
distribution with location parameter a and scale parameter h .
Analogies of the results above for the standard sine distribution follow easily from basic properties of the location-scale
transformation. Suppose that X has the sine distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). So X has
a continuous distribution on the interval [a, a + b] .

π x −a
f (x) = sin(π ), x ∈ [a, a + b] (5.27.10)
2b b
1. f is symmetric about x = a + b/2 .

2. f increases and then decreases, with mode x = a + b/2 .
3. f is concave downward.
Proof
Pure scale transformations (a = 0 and b > 0 ) are particularly common, since X often represents a random angle. The scale
transformation with b = π gives the angle in radians. In this case the probability density function is f (x) = sin(x) for x ∈ [0, π].
1
Since the radian is the standard angle unit, this distribution could also be considered the “standard one”. The scale transformation
with b = 90 gives the angle in degrees. In this case, the probability density function is f (x) = sin( x) for x ∈ [0, 90]. This
π
180
π
90
was Gilbert's original formulation.
In the special distribution simulator, select the sine distribution. Vary the parameters and note the shape and location of the

1 x −a
F (x) = [1 − cos(π )] , x ∈ [a, a + b] (5.27.12)
2 b
Proof

of X is given by
−1
b
F (p) = a + arccos(1 − 2p), p ∈ (0, 1) (5.27.14)
π
1. The first quartile is a + b/3 .

2. The median is a + b/2 .
3. The third quartile is a + 2b/3
Proof
In the special distribution calculator, select the sine distribution. Vary the parameters and note the shape and location of the
probability density function and the distribution function. For selected values of the parameters, find the quantiles of order 0.1
and 0.9.
Moments
Suppose again that X has the sine distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞).

2 at (a+b)t
π (e +e )
M (t) = , t ∈ R (5.27.15)
2 2 2
2 (b t +π )
Proof

1. E(X) = a + b/2
2. var(X) = b (1/4 − 2/π
2 2
)
Proof
In the special distribution simulator, select the sine distribution. Vary the parameters and note the shape and location of the

1. skew(X) = 0
2. kurt(X) = (384 − 48π 2
+ π )/(π
4 2
− 8)
2
Proof
The general sine distribution is a location-scale family, so it is trivially closed under location-scale transformations.
Suppose that X has the sine distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R and
d ∈ (0, ∞) . Then Y = c + dX has the sine distribution with location parameter c + ad and scale parameter bd .
Proof
This page titled 5.27: The Sine Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The Laplace distribution, named for Pierre Simon Laplace arises naturally as the distribution of the difference of two independent,
identically distributed exponential variables. For this reason, it is also called the double exponential distribution.
The Standard Laplace Distribution

The standard Laplace distribution is a continuous distribution on R with probability density function g given by
1
−|u|
g(u) = e , u ∈ R (5.28.1)
2
Proof
The probability density function g satisfies the following properties:

1. g is symmetric about 0.
2. g increases on (−∞, 0] and decreases on [0, ∞), with mode u = 0 .
3. g is concave upward on (−∞, 0] and on [0, ∞) with a cusp at u = 0
Proof
Open the Special Distribution Simulator and select the Laplace distribution. Keep the default parameter value and note the
shape of the probability density function. Run the simulation 1000 times and compare the emprical density function and the
The standard Laplace distribution function G is given by

1 u
e , u ∈ (−∞, 0]
2
G(u) = { (5.28.3)
1 −u
1− e , u ∈ [0, ∞)
2
Proof

given by
1
ln(2p), p ∈ [0, ]
−1 2
G (p) = { (5.28.4)
1
− ln[2(1 − p)], p ∈ [ , 1]
2
1. G (1 − p) = −G (p) for p ∈ (0, 1)

−1 −1
2. The first quartile is q = − ln 2 ≈ −0.6931 .

1
3. The median is q = 0
2
4. The third quartile is q = ln 2 ≈ 0.6931 .

3
Proof
Open the Special Distribution Calculator and select the Laplace distribution. Keep the default parameter value. Compute
selected values of the distribution function and the quantile function.
Moments
Suppose that U has the standard Laplace distribution.
U has moment generating function m given by

1
tU
m(t) = E (e ) = , t ∈ (−1, 1) (5.28.5)
2
1 −t
Proof
The moments of U are
1. E(U n
) =0 if n ∈ N is odd.
2. E(U n
) = n! if n ∈ N is even.
Proof

1. E(U ) = 0
2. var(U ) = 2
Open the Special Distribution Simulator and select the Laplace distribution. Keep the default parameter value. Run the
simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.

1. skew(U ) = 0
2. kurt(U ) = 6
Proof
It follows that the excess kurtosis is kurt(U ) − 3 = 3 .
Of course, the standard Laplace distribution has simple connections to the standard exponential distribution.
If U has the standard Laplace distribution then V = |U | has the standard exponential distribution.
Proof
If V and W are independent and each has the standard exponential distribution, then U = V −W has the standard Laplace
distribution.
Proof using PDFs
Proof using MGFs
If V has the standard exponential distribution, I has the standard Bernoulli distribution, and V and I are independent, then
U = (2I − 1)V has the standard Laplace distribution.
Proof
The standard Laplace distribution has a curious connection to the standard normal distribution.
Suppose that (Z , Z , Z , Z ) is a random sample of size 4 from the standard normal distribution. Then
1 2 3 4 U = Z1 Z2 + Z3 Z4
has the standard Laplace distribution.

Proof
The standard Laplace distribution has the usual connections to the standard uniform distribution by means of the distribution
function and the quantile function computed above.
Connections to the standard uniform distribution.

1. If V has the standard uniform distribution then U = ln(2V )1 (V <
1
2
) − ln[2(1 − V )]1 (V ≥
1
2
) has the standard
Laplace distribution.
2. If U has the standard Laplace distribution then V =
1
2
e
U
1(U < 0) + (1 −
1
2
e
−U
) 1(U ≥ 0) has the standard uniform
distribution.
From part (a), the standard Laplace distribution can be simulated with the usual random quantile method.
Open the random quantile experiment and select the Laplace distribution. Keep the default parameter values and note the shape
of the probability density and distribution functions. Run the simulation 1000 times and compare the empirical density
function, mean, and standard deviation to their distributional counterparts.
The General Laplace Distribution

The standard Laplace distribution is generalized by adding location and scale parameters.
Suppose that U has the standard Laplace distribution. If a ∈ R and b ∈ (0, ∞), then X = a + bU has the Laplace distribution
with location parameter a and scale parameter b .
Suppos that X has the Laplace distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞).
1 |x − a|
f (x) = exp(− ), x ∈ R (5.28.16)
2b b
1. f is symmetric about a .
2. f increases on [0, a] and decreases on [a, ∞) with mode x = a .
3. f is concave upward on [0, a] and on [a, ∞) with a cusp at x = a .
Proof
Open the Special Distribution Simulator and select the Laplace distribution. Vary the parameters and note the shape and
location of the probability density function. For various values of the parameters, run the simulation 1000 times and compare
the emprical density function to the probability density function.

1 x−a
exp( ), x ∈ (−∞, a]
2 b
F (x) = { x−a
(5.28.17)
1
1− exp(− ), x ∈ [a, ∞)
2 b
Proof

given by
1
a + b ln(2p), 0 ≤p ≤
−1 2
F (p) = { (5.28.18)
1
a − b ln[2(1 − p)], ≤p <1
2
1. F (1 − p) = a − bF (p) for p ∈ (0, 1)

−1 −1
2. The first quartile is q = a − b ln 2 .

1
3. The median is q = a
2
4. The third quartile is q = a + b ln 2 .

3
Proof
Open the Special Distribution Calculator and select the Laplace distribution. For various values of the scale parameter,
compute selected values of the distribution function and the quantile function.
Moments
Again, we assume that X has the Laplace distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞) , so that by
definition, X = a + bU where U has the standard Lapalce distribution.

at
e
tX
M (t) = E (e ) = , t ∈ (−1/b, 1/b) (5.28.19)
2 2
1 −b t
Proof
The moments of X about the location parameter have a simple form.
The moments of X about a are

1. E [(X − a) n
] =0 if n ∈ N is odd.
2. E [(X − a) n n
] = b n! if n ∈ N is even.
Proof

1. E(X) = a
2. var(X) = 2b 2
Proof
Open the Special Distribution Simulator and select the Laplace distribution. Vary the parameters and note the size and location
of the mean ± standard deviation bar. For various values of the scale parameter, run the simulation 1000 times and compare
the empirical mean and standard deviation to the distribution mean and standard deviation.

1. skew(X) = 0
2. kurt(X) = 6
Proof
As before, the excess kurtosis is kurt(X) − 3 = 3 .
By construction, the Laplace distribution is a location-scale family, and so is closed under location-scale transformations.
Suppose that X has the Laplace distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R
and d ∈ (0, ∞) . Then Y = c + dX has the Laplace distribution with location parameter c + ad scale parameter bd .
Proof
Once again, the Laplace distribution has the usual connections to the standard uniform distribution by means of the distribution
function and the quantile function computed above. The latter leads to the usual random quantile method of simulation.
Suppose that a ∈ R and b ∈ (0, ∞).

1. If V has the standard uniform distribution then
1 1
U = [a + b ln(2V )] 1 (V < ) + (a − b ln[2(1 − V )]) 1 (V ≥ ) (5.28.20)
2 2
has the Laplace distribution with location parameter a and scale parameter b .
2. If X has the Laplace distribution with location parameter a and scale parameter b , then
1 X −a 1 X −a
V = exp( )1(X < a) + [1 − exp(− )] 1(X ≥ a) (5.28.21)
2 b 2 b
has the standard uniform distribution.
Open the random quantile experiment and select the Laplace distribution. Vary the parameter values and note the shape of the
probability density and distribution functions. For selected values of the parameters, run the simulation 1000 times and
compare the empirical density function, mean, and standard deviation to their distributional counterparts.
The Laplace distribution is also a member of the general exponential family of distributions.
Suppose that X has the Laplace distribution with known location parameter a ∈ R and unspecified scale parameter
b ∈ (0, ∞). Then X has a general exponential distribution in the scale parameter b , with natural parameter −1/b and natural
statistics |X − a| .
Proof
This page titled 5.28: The Laplace Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The logistic distribution is used for various growth models, and is used in a certain type of regression, known appropriately as
logistic regression.
The Standard Logistic Distribution

The standard logistic distribution is a continuous distribution on R with distribution function G given by
z
e
G(z) = , z ∈ R (5.29.1)
1 + ez
Proof
The probability density function g of the standard logistic distribution is given by

z
e
g(z) = , z ∈ R (5.29.3)
2
z
(1 + e )
2. g increases and then decreases with the mode x = 0 .
–
3. g is concave upward, then downward, then upward again with inflection points at x = ± ln(2 + √3) ≈= ±1.317 .
Proof
location of the probability density function. Run the simulation 1000 times and compare the empirical density function to the

of the standard logistic distribution is given by
p
−1
G (p) = ln( ), p ∈ (0, 1) (5.29.7)
1 −p
1. The first quartile is − ln 3 ≈ −1.0986.

2. The median is 0.
3. The third quartile is ln 3 ≈ 1.0986
Proof
Recall that p : 1 − p are the odds in favor of an event with probability p. Thus, the logistic distribution has the interesting property
that the quantiles are the logarithms of the corresponding odds ratios. Indeed, this function of p is sometimes called the logit
function. The fact that the median is 0 also follows from symmetry, of course.
In the special distribution calculator, select the logistic distribution. Keep the default parameter values and note the shape and
location of the probability density function and the distribution function. Find the quantiles of order 0.1 and 0.9.
Moments
Suppose that Z has the standard logistic distribution. The moment generating function of Z has a simple representation in terms of
the beta function B , and hence also in terms of the gamma function Γ
The moment generating function m of Z is given by
m(t) = B(1 + t, 1 − t) = Γ(1 + t) Γ(1 − t), t ∈ (−1, 1) (5.29.8)
Proof
Since the moment generating function is finite on an open interval containing 0, random variable Z has moments of all orders. By
symmetry, the odd order moments are 0. The even order moments can be represented in terms of Bernoulli numbers, named of
course for Jacob Bernoulli. Let β Bernoulli number of order n ∈ N .
n
Let n ∈ N
1. If n is odd then E(Z ) = 0 .
n
2. If n is even then E (Z ) = (2
n n
− 2)π
n
| βn |
Proof
In particular, we have the mean and variance.

1. E(Z) = 0
2
2. var(Z) = π
Proof
location of the mean ± standard deviation bar. Run the simulation 1000 times and compare the empirical mean and standard

1. skew(Z) = 0
2. kurt(Z) = 21
Proof
It follows that the excess kurtosis of Z is kurt(Z) − 3 = 6
5
.
The standard logistic distribution has the usual connections with the standard uniform distribution by means of the distribution
function and quantile function given above. Recall that the standard uniform distribution is the continuous uniform distribution on
the interval (0, 1).
Connections with the standard uniform distribution.

1. If Z has the standard logistic distribution then
Z
e
U = G(Z) = (5.29.13)
Z
1 +e

2. If U has the standard uniform distribution then
−1
U
Z =G (U ) = ln( ) = ln(U ) − ln(1 − U ) (5.29.14)
1 −U
has the standard logistic distribution.
Since the quantile function has a simple closed form, we can use the usual random quantile method to simulate the standard logistic
distribution.
Open the random quantile experiment and select the logistic distribution. Keep the default parameter values and note the shape
of the probability density and distribution functions. Run the simulation 1000 times and compare the empirical density
The standard logistic distribution also has several simple connections with the standard exponential distribution (the exponential
distribution with rate parameter 1).
Connections with the standard exponential distribution:

1. If Z has the standard logistic distribution, then Y = ln(e + 1) has the standard exponential distribution.
X
2. If Y has the standard exponential distribution then Z = ln(e − 1) has the standard logistic distribution.
Y
Proof
Suppose that X and Y are independent random variables, each with the standard exponential distribution. Then Z = ln(X/Y )
has the standard logistic distribution.
Proof
There are also simple connections between the standard logistic distribution and the Pareto distribution.
Connections with the Pareto distribution:

1. If Z has the standard logistic distribution, then Y = e + 1 has the Pareto distribution with shape parameter 1.
Z
2. If Y has the Pareto distribution with shape parameter 1, then Z = ln(Y − 1) has the standard logistic distribution.
Proof
Finally, there are simple connections to the extreme value distribution.
If X and Y are independent and each has the standard Gumbel distribution, them Z = Y −X has the standard logistic
distribution.
Proof
The General Logistic Distribution

The general logistic distribution is the location-scale family associated with the standard logistic distribution.
Suppose that Z has the standard logistic distribution. For a ∈ R and b ∈ (0, ∞), random variable X = a + bZ has the logistic
distribution with location parameter a and scale parameter b .
Analogies of the results above for the general logistic distribution follow easily from basic properties of the location-scale
transformation. Suppose that X has the logistic distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞).

x−a
exp( )
b
f (x) = , x ∈ R (5.29.23)
x−a 2
b [1 + exp( )]
b
2. f increases and then decreases, with mode x = a .
–
3. f is concave upward, then downward, then upward again, with inflection points at x = a ± ln(2 + √3)b .
Proof
In the special distribution simulator, select the logistic distribution. Vary the parameters and note the shape and location of the

x−a
exp( )
b
F (x) = , x ∈ R (5.29.25)
x−a
1 + exp( )
b
Proof

of X is given by
p
−1
F (p) = a + b ln( ), p ∈ (0, 1) (5.29.27)
1 −p
1. The first quartile is a − b ln 3 .

2. The median is a .
3. The third quartile is a + b ln 3
Proof
In the special distribution calculator, select the logistic distribution. Vary the parameters and note the shape and location of the
probability density function and the distribution function. For selected values of the parameters, find the quantiles of order 0.1
and 0.9.
Moments
Suppose again that X has the logistic distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞) . Recall that B
denotes the beta function and Γ the gamma function.

at at
M (t) = e B(1 + bt, 1 − bt) = e Γ(1 + bt) Γ(1 − bt), t ∈ (−1, 1) (5.29.28)
Proof

1. E(X) = a
2
2. var(X) = b 2 π
Proof
In the special distribution simulator, select the logistic distribution. Vary the parameters and note the shape and location of the

1. skew(X) = 0
2. kurt(X) = 21
Proof
Once again, it follows that the excess kurtosis of X is kurt(X) − 3 = . The central moments of X can be given in terms of the
6
Bernoulli numbers. As before, let β denote the Bernoulli number of order n ∈ N .

n
Let n ∈ N .
1. If n is odd then E [(X − a) ] = 0 .
n
2. If n is even then E [(X − a) ] = (2

n n
− 2)π
n n
b | βn |
Proof
The general logistic distribution is a location-scale family, so it is trivially closed under location-scale transformations.
Suppose that X has the logistic distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R and
d ∈ (0, ∞) . Then Y = c + dX has the logistic distribution with location parameter c + ad and scale parameter bd .
Proof
This page titled 5.29: The Logistic Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Extreme value distributions arise as limiting distributions for maximums or minimums (extreme values) of a sample of
independent, identically distributed random variables, as the sample size increases. Thus, these distributions are important in
probability and mathematical statistics.
The Standard Distribution for Maximums

The standard extreme value distribution (for maximums) is a continuous distribution on R with distribution function G given
by
−v
G(v) = exp(−e ), v∈ R (5.30.1)
Proof
The distribution is also known as the standard Gumbel distribution in honor of Emil Gumbel. As we will show below, it arises as
the limit of the maximum of n independent random variables, each with the standard exponential distribution (when this maximum
is appropriately centered). This fact is the main reason that the distribution is special, and is the reason for the name. For the
remainder of this discussion, suppose that random variable V has the standard Gumbel distribution.
The probability density function g of V is given by

−v −v −v
g(v) = e exp(−e ) = exp[− (e + v)], v∈ R (5.30.2)
1. g increases and then decreases with mode v = 0

–
2. g is concave upward, then downward, then upward again, with inflection points at v = ln[(3 ± √5)/2)] ≈ ±0.9264 .
Proof
In the special distribution simulator, select the extreme value distribution. Keep the default parameter values and note the shape
and location of the probability density function. In particular, note the lack of symmetry. Run the simulation 1000 times and

of V is given by
−1
G (p) = − ln[− ln(p)], p ∈ (0, 1) (5.30.3)
1. The first quartile is − ln(− ln 4) ≈ −0.3266.

2. The median is − ln(− ln 2) ≈ 0.3665
3. The third quartile is − ln(ln 4 − ln 3) ≈ 1.2459
Proof
In the special distribution calculator, select the extreme value distribution. Keep the default parameter values and note the
shape and location of the probability density and distribution functions. Compute the quantiles of order 0.1, 0.3, 0.6, and 0.9
Moments
Suppose again that V has the standard Gumbel distribution. The moment generating function of V has a simple expression in terms
of the gamma function Γ.
The moment generating function m of V is given by

tV
m(t) = E (e ) = Γ(1 − t), t ∈ (−∞, 1) (5.30.4)
Proof
Next we give the mean and variance. First, recall that the Euler constant, named for Leonhard Euler is defined by
∞
′ −x
γ = −Γ (1) = − ∫ e ln x dx ≈ 0.5772156649 (5.30.6)
0
The mean and variance of V are

1. E(V ) = γ
2
2. var(V ) = π
Proof
In the special distribution simulator, select the extreme value distribution and keep the default parameter values. Note the shape
and location of the mean ± standard deviation bar. Run the simulation 1000 times and compare the empirical mean and
Next we give the skewness and kurtosis of V . The skewness involves a value of the Riemann zeta function ζ , named of course for
Georg Riemann. Recall that ζ is defined by
∞
1
ζ(n) = ∑ , n >1 (5.30.8)
n
k
k=1
The skewness and kurtosis of V are

–
1. skew(V ) = 12√6ζ(3)/π 3
≈ 1.13955
2. kurt(V ) = 27
The particular value of the zeta function, ζ(3) , is known as Apéry's constant. From (b), it follows that the excess kurtosis is
kurt(V ) − 3 =
5
.
12
The standard Gumbel distribution has the usual connections to the standard uniform distribution by means of the distribution
function and quantile function given above. Recall that the standard uniform distribution is the continuous uniform distribution on
the interval (0, 1).
The standard Gumbel and standard uniform distributions are related as follows:
1. If U has the standard uniform distribution then V =G
−1
(U ) = − ln(− ln U ) has the standard Gumbel distribution.
2. If V has the standard Gumbel distribution then U = G(V ) = exp(e
−V
) has the standard uniform distribution.
So we can simulate the standard Gumbel distribution using the usual random quantile method.
Open the random quantile experiment and select the extreme value distribution. Keep the default parameter values and note
again the shape and location of the probability density and distribution functions. Run the simulation 1000 times and compare
the empirical density function, mean, and standard deviation to their distributional counteparts.
The standard Gumbel distribution also has simple connections with the standard exponential distribution (the exponential
distribution with rate parameter 1).
The standard Gumbel and standard exponential distributions are related as follows:
1. If X has the standard exponential distribution then V = − ln X has the standard Gumbel distribution.
2. If V has the standard Gumbel distribution then X = e has the standard exponential distribution.
−V
Proof
As noted in the introduction, the following theorem provides the motivation for the name extreme value distribution.
Suppose that (X , X 1 2, …) is a sequence of independent random variables, each with the standard exponential distribution.
The distribution of Y n = max{ X1 , X2 , … , Xn } − ln n converges to the standard Gumbel distribution as n → ∞ .
Proof
The General Extreme Value Distribution
As with many other distributions we have studied, the standard extreme value distribution can be generalized by applying a linear
transformation to the standard variable. First, if V has the standard Gumbel distribution (the standard extreme value distribution for
maximums), then −V has the standard extreme value distribution for minimums. Here is the general definition.
Suppose that V has the standard Gumbel distribution, and that a, b ∈ R with b ≠ 0 . Then X = a + bV has the extreme value
distribution with location parameter a and scale parameter |b|.
1. If b > 0 , then the distribution corresponds to maximums.
2. If b < 0 , then the distribution corresponds to minimums.
So the family of distributions with a ∈ R and b ∈ (0, ∞) is a location-scale family associated with the standard distribution for
maximums, and the family of distributions with a ∈ R and b ∈ (−∞, 0) is the location-scale family associated with the standard
distribution for minimums.. The distributions are also referred to more simply as Gumbel distributions rather than extreme value
distributions. The web apps in this project use only the extreme value distributions for maximums. As you will see below, the
differences in the distribution for maximums and the distribution for minimums are minor. For the remainder of this discussion,
suppose that X has the form given in the definition.
Lef F denote the distribution function of X.
1. If b > 0 then
x −a
F (x) = exp[− exp(− )], x ∈ R (5.30.12)
b
2. If b < 0 then
x −a
F (x) = 1 − exp[− exp(− )], x ∈ R (5.30.13)
b
Proof
Let f denote the probability density function of X. Then

1 x −a x −a
f (x) = exp(− ) exp[− exp(− )], x ∈ R (5.30.14)
|b| b b
Proof
Open the special distribution simulator and select the extreme value distribution. Vary the parameters and note the shape and
location of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare

of X is given as follows
1. If b > 0 then F −1
(p) = a − b ln(− ln p) for p ∈ (0, 1).
2. If b < 0 then F −1
(p) = a − b ln[− ln(1 − p)], for p ∈ (0, 1)
Proof
Open the special distribution calculator and select the extreme value distribution. Vary the parameters and note the shape and
location of the probability density and distribution functions. For selected values of the parameters, compute a few values of
the quantile function and the distribution function.
Moments
Suppose again that X = a + bV where V has the standard Gumbel distribution, and that a, b ∈ R with b ≠ 0 .
The moment generating function M of X is given by M (t) = e at

Γ(1 − bt) .
1. With domain t ∈ (−∞, 1/b) if b > 0
2. With domain t ∈ (1/b, ∞) if b < 0
Proof

1. E(X) = a + bγ
2
2. var(X) = b 2 π
Proof
Open the special distribution simulator and select the extreme value distribution. Vary the parameters and note the size and
location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and
–
1. skew(X) = 12√6ζ(3)/π ≈ 1.13955 if b > 0 .
3
–
2. skew(X) = −12√6ζ(3)/π ≈ −1.13955 if b < 0
3
Proof
Proof
Once again, the excess kurtosis is kurt(X) − 3 = 12
5
.
Since the general extreme value distributions are location-scale families, they are trivially closed under linear transformations of
the underlying variables (with nonzero slope).
Suppose that X has the extreme value distribution with parameters a, b with b ≠0 and that c, d ∈ R with d ≠0 . Then
Y = c + dX has the extreme value distribution with parameters ad + c and bd .
Proof
Note if d > 0 then X and Y have the same association (max, max) or (min, min). If d <0 then X and Y have opposite
associations (max, min) or (min, max).
As with the standard Gumbel distribution, the general Gumbel distribution has the usual connections with the standard uniform
distribution by means of the distribution and quantile functions. Since the quantile function has a simple closed form, the latter
connection leads to the usual random quantile method of simulation. We state the result for maximums.
Suppose that a, b ∈ R with b ≠ 0 . Let F denote distribution function and let F −1

denote the quantile function above
1. If U has the standard uniform distribution then X = F (U ) has the extreme value distribution with parameters a and b .
−1
2. If X has the extreme value distribution with parameters a and b then U = F (X) has the standard uniform distribution.
Open the random quantile experiment and select the extreme value distribution. Vary the parameters and note again the shape
and location of the probability density and distribution functions. For selected values of the parameters, run the simulation
1000 times and compare the empirical density function, mean, and standard deviation to their distributional counteparts.
The extreme value distribution for maximums has a simple connection to the Weibull distribution, and this generalizes the in
connection between the standard Gumbel and exponential distributions above. There is a similar result for the extreme value
distribution for minimums.
The extreme value and Weibull distributions are related as follows:

1. If X has the extreme value distribution with parameters a ∈ R and b ∈ (0, ∞), then Y =e
−X
has the Weibull distribution
with shape parameter and scale parameter e .
1
b
−a
2. If Y has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) then X = − ln Y has
the extreme value distribution with parameters − ln b and . 1
Proof
This page titled 5.30: The Extreme Value Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
The hyperbolic secant distribution is a location-scale familty with a number of interesting parallels to the normal distribution. As
the name suggests, the hyperbolic secant function plays an important role in the distribution, so we should first review some
definitions
The hyperbolic trig functions sinh, cosh, tanh, and sech are defined as follows, for x ∈ R
1 x −x
sinh x = (e −e ) (5.31.1)
2
1 x −x
cosh x = (e +e ) (5.31.2)
2
x −x
sinh x e −e
tanh x = = (5.31.3)
x −x
cosh x e +e
1 2
sech x = = (5.31.4)
cosh x ex + e−x
The Standard Hyperbolic Secant Distribution

The standard hyperbolic secant distribution is a continuous distribution on R with probability density function g given by
1 π
g(z) = sech ( z) , z ∈ R (5.31.5)
2 2
1. g is symmetric about 0.
2. g increases and then decreases with mode z = 0 .
–
3. g is concave upward then downward then upward again, with inflection points at z = ± 2
π
ln(√2 + 1) ≈ ±0.561 .
Proof
So g has the classic unimodal shape. Recall that the inflection points in the standard normal probability density function are ±1.
Compared to the standard normal distribution, the hyperbolic secant distribution is more peaked at the mode 0 but has fatter tails.
Open the special distribution simulator and select the hyperbolic secant distribtion. Keep the default parameter settings and
note the shape and location of the probability density function. Run the simulation 1000 times and compare the empirical
The distribution function G of the standard hyperbolic secant distribution is given by

2 π
G(z) = arctan[exp( z)], z ∈ R (5.31.8)
π 2
Proof

of the standard hyperbolic secant distribution is given by
−1
2 π
G (p) = ln[tan( p)], p ∈ (0, 1) (5.31.9)
π 2
–
1. The first quartile is G ( ) = − ln(1 + √2) ≈ −0.561
−1 1
4
2
2. The median is G ( ) = 0
−1 1
2
–
3. The third quartile is G ( ) = ln(1 + √2) ≈ 0.561
−1 3
4
2
Proof
Of course, the fact that the median is 0 also follows from the symmetry of the distribution, as does the relationship between the first
and third quartiles. In general, G (1 − p) = −G (p) for p ∈ (0, 1). Note that the first and third quartiles coincide with the
−1 −1
inflection points, whereas in the normal distribution, the inflection points are at ±1 and coincide with the standard deviation.
Open the sepcial distribution calculator and select the hyperbolic secant distribution. Keep the default values of the parameters
and note the shape of the distribution and probability density functions. Compute a few values of the distribution and quantile
functions.
Moments
Suppose that Z has the standard hyperbolic secant distribution. The moments of Z are easiest to compute from the generating
functions.
The characteristic function χ of Z is the hyperbolic secant function:

χ(t) = sech(t), t ∈ R (5.31.10)
Proof
Note that the probability density function can be obtained from the characteristic function by a scale transformation:
z) for z ∈ R . This is another curious simularity to the normal distribution: the probability density function ϕ and
1 π
g(z) = χ(
2 2
characteristic function χ of the standard normal distribution are related by ϕ(z) = χ(z) .
√2π
1
The moment generating function m of Z is the secant function:

tZ
π π
m(t) = E (e ) = sec(t), t ∈ (− , ) (5.31.12)
2 2
Proof
It follows that Z has moments of all orders, and then by symmetry, that the odd order moments are all 0.

1. E(Z) = 0
2. var(Z) = 1
Proof
Thus, the standard hyperbolic secant distribution has mean 0 and variance 1, just like the standard normal distribution.
Open the special distribution simulator and select the hyperbolic secant distribution. Keep the default parameters and note the
size and location of the mean ± standard deviation bar. Run the simulation 1000 times compare the empirical mean and

1. skew(Z) = 0
2. kurt(Z) = 5
Proof
Recall that the kurtosis of the standard normal distribution is 3, so the excess kurtosis of the standard hyperbolic secant distribution
is kurt(Z) − 3 = 2 . This distribution is more sharply peaked at the mean 0 and has fatter tails, compared with the normal.
The standard hyperbolic secant distribution has the usual connections with the standard uniform distribution by means of the
distribution function and the quantile function computed above.
The standard hyperbolic secant distribution is related to the standard uniform distribution as follows:
1. If Z has the standard hyperbolic secant distribution then
2 π
U = G(Z) = arctan[exp( Z)] (5.31.14)
π 2
−1
2 π
Z =G (U ) = ln[tan( U )] (5.31.15)
π 2
has the standard hyperbolic secant distribution.
Since the quantile function has a simple closed form, the standard hyperbolic secant distribution can be easily simulated by means
of the random quantile method.
Open the random quantile experiment and select the hyperbolic secant distribution. Keep the default parameter values and note
again the shape of the probability density and distribution functions. Run the experiment 1000 times and compare the empirical
density function, mean, and standard deviation to their distributional counterparts.
The General Hyperbolic Secant Distribution

The standard hyperbolic secant distribution is generalized by adding location and scale parameters.
Suppose that Z has the standard hyperbolic secant distribution and that μ ∈ R and σ ∈ (0, ∞) . Then X = μ + σZ has the
hyperbolic secant distribution with location parameter μ and scale parameter σ.
Suppose that X has the hyperbolic secant distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞).

1 π x −μ
f (x) = sech [ ( )] , x ∈ R (5.31.16)
2σ 2 σ
1. f is symmetric about μ .
2. f increases and then decreases with mode x = μ .
–
3. f is concave upward then downward then upward again, with inflection points at x = μ ± 2
π
ln(√2 + 1)σ ≈ μ ± 0.561σ .
Proof
Open the special distribution simulator and select the hyperbolic secant distribution. Vary the parameters and note the shape
and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and

2 π x −μ
F (x) = arctan{exp[ ( )]}, x ∈ R (5.31.17)
π 2 σ
Proof

of X is given by
−1
2 π
F (p) = μ + σ ln[tan( p)], p ∈ (0, 1) (5.31.18)
π 2
–
1. The first quartile is F
−1
(
1
4
) = μ−
2
π
ln(1 + √2)σ ≈ μ − 0.561σ
2. The median is F (−1 1
2
) =μ
–
3. The third quartile is F −1
(
3
4
) = μ+
2
π
ln(1 + √2)σ ≈ μ + 0.561σ
Proof
Open the sepcial distribution calculator and select the hyperbolic secant distribution. Vary the parameters and note the shape of
the distribution and density functions. For various values of the parameters, compute a few values of the distribution and
quantile functions.
Moments
Suppose again that X has the hyperbolic secant distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞).

π π
μt
M (t) = e sec(σt), t ∈ (− , ) (5.31.19)
2σ 2σ
Proof
Just as in the normal distribution, the location and scale parameters are the mean and standard deviation, respectively.

1. E(X) = μ
2. var(X) = σ 2
Proof
Open the special distribution simulator and select the hyperbolic secant distribution. Vary the parameters and note the size and
location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and

1. skew(X) = 0
2. kurt(X) = 5
Proof
Once again, the excess kurtosis is kurt(X) − 3 = 2
Since the hyperbolic secant distribution is a location-scale family, it is trivially closed under location-scale transformations.
Suppose that X has the hyperbolic secant distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞), and that
a ∈ R and b ∈ (0, ∞). Then Y = a + bX has the hyperbolic secant distribution with location parameter a + bμ and scale
parameter bσ.
Proof
The hyperbolic secant distribution has the usual connections with the standard uniform distribution by means of the distribution
Suppose that μ ∈ R and σ ∈ (0, ∞).

1. If X has the hyperbolic secant distribution with location parameter μ and scale parameter σ then
2 π X −μ
U = F (X) = arctan{exp[ ( )]} (5.31.20)
π 2 σ

2 π
−1
X =F (U ) = μ + σ ln[tan( U )] (5.31.21)
π 2
has the hyperbolic secant distribution with location parameter μ and scale parameter σ.
Since the quantile function has a simple closed form, the hyperbolic secant distribution can be easily simulated by means of the
random quantile method.
Open the random quantile experiment and select the hyperbolic secant distribution. Vary the parameters and note again the
shape of the probability density and distribution functions. Run the experiment 1000 times and compare the empirical density
This page titled 5.31: The Hyperbolic Secant Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
The Cauchy distribution, named of course for the ubiquitous Augustin Cauchy, is interesting for a couple of reasons. First, it is a
simple family of distributions for which the expected value (and other moments) do not exist. Second, the family is closed under
the formation of sums of independent variables, and hence is an infinitely divisible family of distributions.
The Standard Cauchy Distribution

The standard Cauchy distribution is a continuous distribution on R with probability density function g given by
1
g(x) = , x ∈ R (5.32.1)
2
π (1 + x )
1. g is symmetric about x = 0
2. g increases and then decreases, with mode x = 0 .
3. g is concave upward, then downward, and then upward again, with inflection points at x = ± 1
√3
.
4. g(x) → 0 as x → ∞ and as x → −∞
Proof
Thus, the graph of g has a simple, symmetric, unimodal shape that is qualitatively (but certainly not quantitatively) like the
standard normal probability density function. The probability density function g is obtained by normalizing the function
1
x ↦ , x ∈ R (5.32.3)
2
1 +x
The graph of this function is known as the witch of Agnesi, named for the Italian mathematician Maria Agnesi.
Open the special distribution simulator and select the Cauchy distribution. Keep the default parameter values to get the
standard Cauchy distribution and note the shape and location of the probability density function. Run the simulation 1000
The standard Cauchy distribution function G given by G(x) = 1
2
+
1
π
arctan x for x ∈ R
Proof
The standard Cauchy quantile function G −1

is given by G−1
(p) = tan[π (p −
1
2
)] for p ∈ (0, 1). In particular,
1. The first quartile is G ( ) = −1
−1 1
2. The median is G ( ) = 0
−1 1
3. The third quartile is G ( ) = 1

−1 3
Proof
Of course, the fact that the median is 0 also follows from the symmetry of the distribution, as does the fact that
(p) for p ∈ (0, 1).
−1 −1
G (1 − p) = −G
Open the special distribution calculator and select the Cauchy distribution. Keep the default parameter values and note the
shape of the distribution and probability density functions. Compute a few quantiles.
Moments
Suppose that random variable X has the standard Cauchy distribution. As we noted in the introduction, part of the fame of this
distribution comes from the fact that the expected value does not exist.
E(X) does not exist.

Proof
By symmetry, if the expected value did exist, it would have to be 0, just like the median and the mode, but alas the mean does not
exist. Moreover, this is not just an artifact of how mathematicians define improper integrals, but has real consequences. Recall that
if we think of the probability distribution as a mass distribution, then the mean is center of mass, the balance point, the point where
the moment (in the sense of physics) to the right is balanced by the moment to the left. But as the proof of the last result shows, the
moments to the right and to the left at any point a ∈ R are infinite. In this sense, 0 is no more important than any other a ∈ R .
Finally, if you are not convinced by the argument from physics, the next exercise may convince you that the law of large numbers
fails as well.
Open the special distribution simulator and select the Cauchy distribution. Keep the default parameter values, which give the
standard Cauchy distribution. Run the simulation 1000 times and note the behavior of the sample mean.
Earlier we noted some superficial similarities between the standard Cauchy distribution and the standard normal distribution
(unimodal, symmetric about 0). But clearly there are huge quantitative differences. The Cauchy distribution is a heavy tailed
distribution because the probability density function g(x) decreases at a polynomial rate as x → ∞ and x → −∞ , as opposed to
an exponential rate. This is yet another way to understand why the expected value does not exist.
In terms of the higher moments, E (X ) does not exist if n is odd, and is ∞ if n is even. It follows that the moment generating
n
function m(t) = E (e ) cannot be finite in an interval about 0. In fact, m(t) = ∞ for every t ≠ 0 , so this generating function is
tX
of no use to us. But every distribution on R has a characteristic function, and for the Cauchy distribution, this generating function
will be quite useful.
X has characteristic function χ given by χ

0 0 (t) = exp(− |t|) for t ∈ R .
Proof
The standard Cauchy distribution a member of the Student t family of distributions.
The standard Cauchy distribution is the Student t distribution with one degree of freedom.
Proof
The standard Cauchy distribution also arises naturally as the ratio of independent standard normal variables.
Suppose that Z and W are independent random variables, each with the standard normal distribution. Then X = Z/W has the
standard Cauchy distribution. Equivalently, the standard Cauchy distribution is the Student t distribution with 1 degree of
freedom.
Proof
If X has the standard Cauchy distribution, then so does Y = 1/X
Proof
The standard Cauchy distribution has the usual connections to the standard uniform distribution via the distribution function and
the quantile function computed above.
The standard Cauchy distribution and the standard uniform distribution are related as follows:
1. If U has the standard uniform distribution then X = G (U ) = tan[π (U − )] has the standard Cauchy distribution.
−1 1
2. If X has the standard Cauchy distribution then U = G(X) = + arctan(X) has the standard uniform distribution.
1
2
1
Proof
Since the quantile function has a simple, closed form, it's easy to simulate the standard Cauchy distribution using the random
quantile method.
Open the random quantile experiment and select the Cauchy distribution. Keep the default parameter values and note again the
shape and location of the distribution and probability density functions. For selected values of the parameters, run the
simulation 1000 times and compare the empirical density function to the probability density function. Note the behavior of the
empirical mean and standard deviation.
For the Cauchy distribution, the random quantile method has a nice physical interpretation. Suppose that a light source is 1 unit
away from position 0 of an infinite, straight wall. We shine the light at the wall at an angle Θ (to the perpendicular) that is
uniformly distributed on the interval (− , ). Then the position X = tan Θ of the light beam on the wall has the standard
π
2
π
Cauchy distribution. Note that this follows since Θ has the same distribution as π (U − ) where U has the standard uniform 1
distribution.
Open the Cauchy experiment and keep the default parameter values.
1. Run the experiment in single-step mode a few times, to make sure that you understand the experiment.
2. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the
probability density function. Note the behavior of the empirical mean and standard deviation.
The General Cauchy Distribution

Like so many other “standard” distributions, the Cauchy distribution is generalized by adding location and scale parameters. Most
of the results in this subsection follow immediately from results for the standard Cauchy distribution above and general results for
location scale families.
Suppose that Z has the standard Cauchy distribution and that a ∈ R and b ∈ (0, ∞) . Then X = a + bZ has the Cauchy
distribution with location parameter a and scale parameter b .
Suppose that X has the Cauchy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞).

b
f (x) = , x ∈ R (5.32.16)
2 2
π[ b + (x − a) ]
2. f increases and then decreases, with mode x = a .
3. f is concave upward, then downward, then upward again, with inflection points at x = a ± √3
1
b .
4. f (x) → 0 as x → ∞ and as x → −∞ .
Proof
Open the special distribution simulator and select the Cauchy distribution. Vary the parameters and note the location and shape
of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the

1 1 x −a
F (x) = + arctan( ), x ∈ R (5.32.18)
2 π b
Proof

given by
1
−1
F (p) = a + b tan[π (p − )], p ∈ (0, 1) (5.32.20)
2
In particular,
1. The first quartile is F
−1
(
1
4
) = a−b .
2. The median is F (−1 1
2
) =a .
(
3
4
) = a+b .
Proof
Open the special distribution calculator and select the Cauchy distribution. Vary the parameters and note the shape and location
of the distribution and probability density functions. Compute a few values of the distribution and quantile functions.
Moments
Suppose again that X has the Cauchy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). Since the mean
and other moments of the standard Cauchy distribution do not exist, they don't exist for the general Cauchy distribution either.
Open the special distribution simulator and select the Cauchy distribution. For selected values of the parameters, run the
simulation 1000 times and note the behavior of the sample mean.
But of course the characteristic function of the Cauchy distribution exists and is easy to obtain from the characteristic function of
the standard distribution.
X has characteristic function χ given by χ(t) = exp(ait − b |t|) for t ∈ R .

Proof
Like all location-scale families, the general Cauchy distribution is closed under location-scale transformations.
Suppose that X has the Cauchy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R
and d ∈ (0, ∞) . Then Y = c + dX has the Cauchy distribution with location parameter c + da and scale parameter bd .
Proof
Much more interesting is the fact that the Cauchy family is closed under sums of independent variables. In fact, this is the main
reason that the generalization to a location-scale family is justified.
Suppose that X has the Cauchy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞) for i ∈ {1, 2},
i i i
and that X and X are independent. Then Y = X + X has the Cauchy distribution with location parameter a + a and
1 2 1 2 1 2
scale parameter b + b .
1 2
Proof
As a corollary, the Cauchy distribution is stable, with index α = 1 :
If X = (X , X , … , X ) is a sequence of independent variables, each with the Cauchy distribution with location parameter
1 2 n
a ∈ R and scale parameter b ∈ (0, ∞), then X + X + ⋯ + X

1 2 has the Cauchy distribution with location parameter na
n
and scale parameter nb.
Another corollary is the strange property that the sample mean of a random sample from a Cauchy distribution has that same
Cauchy distribution. No wonder the expected value does not exist!
Suppose that X = (X , X , … , X ) is a sequence of independent random variables, each with the Cauchy distribution with
1 2 n
location parameter a ∈ R and scale parameter b ∈ (0, ∞). (That is, X is a random sample of size n from the Cauchy
n
distribution.) Then the sample mean M = ∑ X also has the Cauchy distribution with location parameter a and scale
1
n i=1 i
parameter b .
Proof
The next result shows explicitly that the Cauchy distribution is infinitely divisible. But of course, infinite divisibility is also a
consequence of stability.
Suppose that a ∈ R and b ∈ (0, ∞). For every n ∈ N the Cauchy distribution with location parameter a and scale parameter
+
b is the distribution of the sum of n independent variables, each of which has the Cauchy distribution with location parameters
a/n and scale parameter b/n.
Our next result is a very slight generalization of the reciprocal result above for the standard Cauchy distribution.
Suppose that X has the Cauchy distribution with location parameter 0 and scale parameter b ∈ (0, ∞). Then Y = 1/X has the
Cauchy distribution with location parameter 0 and scale parameter 1/b.
Proof
As with its standard cousin, the general Cauchy distribution has simple connections with the standard uniform distribution via the
distribution function and quantile function computed above, and in particular, can be simulated via the random quantile method.
Suppose that a ∈ R and b ∈ (0, ∞).

1. If U has the standard uniform distribution, then X = F (U ) = a + b tan[π (U − )] has the Cauchy distribution with
−1 1
location parameter a and scale parameter g

2. If X has the Cauchy distribution with location parameter a and scale parameter b , then
X−a
U = F (X) =
1
2
+
1
π
arctan(
b
) has the standard uniform distribution.
Open the random quantile experiment and select the Cauchy distribution. Vary the parameters and note again the shape and
location of the distribution and probability density functions. For selected values of the parameters, run the simulation 1000
times and compare the empirical density function to the probability density function. Note the behavior of the empirical mean
As before, the random quantile method has a nice physical interpretation. Suppose that a light source is b units away from position
a of an infinite, straight wall. We shine the light at the wall at an angle Θ (to the perpendicular) that is uniformly distributed on the
interval (− , ). Then the position X = a + b tan Θ of the light beam on the wall has the Cauchy distribution with location
π
2
π
parameter a and scale parameter b .
Open the Cauchy experiment. For selected values of the parameters, run the simulation 1000 times and compare the empirical
density function to the probability density function. Note the behavior of the empirical mean and standard deviation.
This page titled 5.32: The Cauchy Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The exponential-logarithmic distribution arises when the rate parameter of the exponential distribution is randomized by the
logarithmic distribution. The exponential-logarithmic distribution has applications in reliability theory in the context of devices or
organisms that improve with age, due to hardening or immunity.
The Standard Exponential-Logarithmic Distribution

The standard exponential-logarithmic distribution with shape parameter p ∈ (0, 1) is a continuous distribution on [0, ∞) with
probability density function g given by
−x
(1 − p)e
g(x) = − , x ∈ [0, ∞) (5.33.1)
−x
ln(p)[1 − (1 − p)e ]
1. g is decreasing on [0, ∞) with mode x = 0 .

2. g is concave upward on [0, ∞).
Proof
Open the special distribution simulator and select the exponential-logarithmic distribution. Vary the shape parameter and note
the shape of the probability density function. For selected values of the shape parameter, run the simulation 1000 times and

−x
ln[1 − (1 − p)e ]
G(x) = 1 − , x ∈ [0, ∞) (5.33.5)
ln(p)
Proof

is given by
−1
1 −p 1−u
G (u) = ln( ) = ln(1 − p) − ln(1 − p ), u ∈ [0, 1) (5.33.6)
1−u
1 −p
1. The first quartile is q = ln(1 − p) − ln(1 − p ) .

1
3/4
2. The median is q = ln(1 − p) − ln(1 − p ) = ln(1 + √–

2 p) .
1/2
3. The third quartile is q = ln(1 − p) − ln(1 − p ) .

3
1/4
Proof
Open the special distribution calculator and select the exponential-logarithmic distribution. Vary the shape parameter and note
the shape of the distribution and probability density functions. For selected values of the shape parameter, computer a few
values of the distribution function and the quantile function.
The reliability function G given by

c
−x
ln[1 − (1 − p)e ]
c
G (x) = , x ∈ [0, ∞) (5.33.7)
ln(p)
Proof
The standard exponential-logarithmic distribution has decreasing failure rate.
The failure rate function r is given by

−x
(1 − p)e
r(x) = − , x ∈ (0, ∞) (5.33.8)
−x −x
[1 − (1 − p)e ] ln[1 − (1 − p)e ]
1. r is decreasing on [0, ∞).
2. r is concave upward on [0, ∞).
Proof
The Polylogarithm
The moments of the standard exponential-logarithmic distribution cannot be expressed in terms of the usual elementary functions,
but can be expressed in terms of a special function known as the polylogarithm.
The polylogarithm of order s ∈ R is defined by

∞ k
x
Li s (x) = ∑ , x ∈ (−1, 1) (5.33.9)
s
k
k=1
The polylogarithm is a power series in x with radius of convergence is 1 for each s ∈ R .

Proof
In this section, we are only interested in nonnegative integer orders, but the polylogarithm will show up again, for non-integer
orders, in the study of the zeta distribution.
The polylogarithm functions of orders 0, 1, 2, and 3.

1. The polylogarithm of order 0 is
∞
x
k
Li 0 (x) = ∑ x = , x ∈ (−1, 1) (5.33.11)
1 −x
k=1
2. The polylogarithm of order 1 is

∞ k
x
Li 1 (x) = ∑ = − ln(1 − x), x ∈ (−1, 1) (5.33.12)
k
k=1
3. The polylogarithm of order 2 is known as the dilogarithm

4. The polylogarithm of order 3 is known as the trilogarithm.
Thus, the polylogarithm of order 0 is a simple geometric series, and the polylogarithm of order 1 is the standard power series for
the natural logarithm. Note that the probability density function of X can be written in terms of the polylogarithms of orders 0 and
1:
−x −x
Li 0 [(1 − p)e ] Li 0 [(1 − p)e ]
g(x) = − = , x ∈ [0, ∞) (5.33.13)
ln(p) Li 1 (1 − p)
The most important property of the polylogarithm is given in the following theorem:
The polylogarithm satisfies the following recursive integral formula:

x
Li s (t)
Li s+1 (x) = ∫ dt; s ∈ R, x ∈ (−1, 1) (5.33.14)
0 t
Equivalently, x Li ′
s+1
(x) = Li s (x) for x ∈ (−1, 1) and s ∈ R .
Proof
When s > 1 , the polylogarithm series converges at x = 1 also, and

∞
1
Li s (1) = ζ(s) = ∑ (5.33.16)
s
k
k=1
where ζ is the Riemann zeta function, named for Georg Riemann. The polylogarithm can be extended to complex orders and
defined for complex z with |z| < 1 , but the simpler version suffices for our work here.
Moments
We assume that X has the standard exponential-logarithmic distribution with shape parameter p ∈ (0, 1).
The moments of X (about 0) are

Li n+1 (1 − p) Li n+1 (1 − p)
n
E(X ) = −n! = n! , n ∈ N (5.33.17)
ln(p) Li 1 (1 − p)
1. E(X n
) → 0 as p ↓ 0
2. E(X n
) → n! as p ↑ 1
Proof
We will get some additional insight into the asymptotics below when we consider the limiting distribution as p ↓ 0 and p ↑ 1 . The
mean and variance of the standard exponential logarithmic distribution follow easily from the general moment formula.

1. E(X) = −Li 2 (1 − p)/ ln(p)
2
2. var(X) = −2Li 3 (1 − p)/ ln(p) − [ Li 2 (1 − p)/ ln(p)]
From the asymptotics of the general moments, note that E(X) → 0 and var(X) → 0 as p ↓ 0 , and E(X) → 1 and var(X) → 1
as p ↑ 1 .
Open the special distribution simulator and select the exponential-logarithmic distribution. Vary the shape parameter and note
the size and location of the mean ± standard deviation bar. For selected values of the shape parameter, run the simulation 1000
The standard exponential-logarithmic distribution has the usual connections to the standard uniform distribution by means of the
distribution function and the quantile function computed above.
Suppose that p ∈ (0, 1).

1 −p U
X = ln( ) = ln(1 − p) − ln(1 − p ) (5.33.22)
U
1 −p
has the standard exponential-logarithmic distribution with shape parameter p.

2. If X has the standard exponential-logarithmic distribution with shape parameter p then
−X
ln[1 − (1 − p)e ]
U = (5.33.23)
ln(p)

Proof
Since the quantile function of the basic exponential-logarithmic distribution has a simple closed form, the distribution can be
simulated using the random quantile method.
Open the random quantile experiment and select the exponential-logarithmic distribution. Vary the shape parameter and note
the shape of the distribution and probability density functions. For selected values of the parameter, run the simulation 1000
As the name suggests, the basic exponential-logarithmic distribution arises from the exponential distribution and the logarithmic
distribution via a certain type of randomization.
Suppose that T = (T , T , …) is a sequence of independent random variables, each with the standard exponential distribution.
1 2
Suppose also that N has the logarithmic distribution with parameter 1 − p ∈ (0, 1) and is independent of T . Then
X = min{ T , T , … , T } has the basic exponential-logarithmic distribution with shape parameter p .
1 2 N
Proof
Also of interest, of course, are the limiting distributions of the standard exponential-logarithmic distribution as p → 0 and as
p → 1.
The standard exponential-logarithmic distribution with shape parameter p ∈ (0, 1) converges to

1. Point mass at 0 as p → 0 .
2. The standard exponential distribution as p → 1 .
Proof
The General Exponential-Logarithmic Distribution

The standard exponential-logarithmic distribution is generalized, like so many distributions on [0, ∞), by adding a scale parameter.
Suppose that Z has the standard exponential-logarithmic distribution with shape parameter p ∈ (0, 1). If b ∈ (0, ∞) , then
X = bZ has the exponential-logarithmic distribution with shape parameter p and scale parameter b .
Using the same terminology as the exponential distribution, 1/b is called the rate parameter.
Suppose that X has the exponential-logarithmic distribution with shape parameter p ∈ (0, 1) and scale parameter b ∈ (0, ∞).

−x/b
(1 − p)e
f (x) = − , x ∈ [0, ∞) (5.33.27)
−x/b
b ln(p)[1 − (1 − p)e ]
1. f is decreasing on [0, ∞) with mode x = 0 .

2. f is concave upward on [0, ∞).
Proof
Open the special distribution simulator and select the exponential-logarithmic distribution. Vary the shape and scale parameters
and note the shape and location of the probability density function. For selected values of the parameters, run the simulation
1000 times and compare the empirical density function to the probability density function.

−x/b
ln[1 − (1 − p)e ]
F (x) = 1 − , x ∈ [0, ∞) (5.33.28)
ln(p)
Proof

given by
1 −p
−1 1−u
F (u) = b ln( ) = b [ln(1 − p) − ln(1 − p )] , u ∈ [0, 1) (5.33.29)
1−u
1 −p
1. The first quartile is q = b [ln(1 − p) − ln(1 − p )] .

1
3/4
2. The median is q = b [ln(1 − p) − ln(1 − p )] = b ln(1 + √–

2
1/2
p) .
3. The third quartile is q = b [ln(1 − p) − ln(1 − p )] .

3
1/4
Proof
Open the special distribution calculator and select the exponential-logarithmic distribution. Vary the shape and scale parameter
and note the shape and location of the probability density and distribution functions. For selected values of the parameters,
computer a few values of the distribution function and the quantile function.
X has reliability function F given by

c
−x/b
ln[1 − (1 − p)e ]
c
F (x) = , x ∈ [0, ∞) (5.33.30)
ln(p)
Proof
The exponential-logarithmic distribution has decreasing failure rate.
The failure rate function R of X is given by.

−x/b
(1 − p)e
R(x) = − , x ∈ [0, ∞) (5.33.31)
−x/b −x/b
b [1 − (1 − p)e ] ln[1 − (1 − p)e ]
1. R is decreasing on [0, ∞).

2. R is concave upward on [0, ∞).
Proof
Moments
Suppose again that X has the exponential-logarithmic distribution with shape parameter p ∈ (0, 1) and scale parameter b ∈ (0, ∞).
The moments of X can be computed easily from the representation X = bZ where Z has the basic exponential-logarithmic
distribution.
The moments of X (about 0) are

Li n+1 (1 − p)
n n
E(X ) = −b n! , n ∈ N (5.33.32)
ln(p)
1. E(X n
) → 0 as p ↓ 0
2. E(X n n
) → b n! as p ↑ 1
Proof

1. E(X) = −bLi 2 (1 − p)/ ln(p)
2
2. var(X) = b 2
(−2 Li 3 (1 − p)/ ln(p) − [ Li 2 (1 − p)/ ln(p)] )
From the general moment results, note that E(X) → 0 and var(X) → 0 as p ↓ 0 , while E(X) → b and var(X) → b as p ↑ 1 . 2
Open the special distribution simulator and select the exponential-logarithmic distribution. Vary the shape and scale parameters
and note the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation
Since the exponential-logarithmic distribution is a scale family for each value of the shape parameter, it is trivially closed under
scale transformations.
Suppose that X has the exponential-logarithmic distribution with shape parameter p ∈ (0, 1) and scale parameter b ∈ (0, ∞).
If c ∈ (0, ∞), then Y = cX has the exponential-logarithmic distribution with shape parameter p and scale parameter bc.
Proof
Once again, the exponential-logarithmic distribution has the usual connections to the standard uniform distribution by means of the
distribution function and quantile function computed above.
Suppose that p ∈ (0, 1) and b ∈ (0, ∞).
1. If U has the standard exponential distribution then
1 −p U
X = b [ln( )] = b [ln(1 − p) − ln(1 − p )] (5.33.33)
U
1 −p
has the exponential-logarithmic distribution with shape parameter p and scale parameter b .
2. If X has the exponential-logarithmic distribution with shape parameter p and scale parameter b , then
−X/b
ln[1 − (1 − p)e ]
U = (5.33.34)
ln(p)

Proof
Again, since the quantile function of the exponential-logarithmic distribution has a simple closed form, the distribution can be
simulated using the random quantile method.
Open the random quantile experiment and select the exponential-logarithmic distribution. Vary the shape and scale parameters
and note the shape and location of the distribution and probability density functions. For selected values of the parameters, run
the simulation 1000 times and compare the empirical density function to the probability density function.
Suppose that T = (T , T , …) is a sequence of independent random variables, each with the exponential distribution with
1 2
scale parameter b ∈ (0, ∞). Suppose also that N has the logarithmic distribution with parameter 1 − p ∈ (0, 1) and is
independent of T . Then X = min{T , T , … , T } has the exponential-logarithmic distribution with shape parameter p and
1 2 N
scale parameter b .
Proof
The limiting distributions as p ↓ 0 and as p ↑ 1 also follow easily from the corresponding results for the standard case.
For fixed b ∈ (0, ∞), the exponential-logarithmic distribution with shape parameter p ∈ (0, 1) and scale parameter b
converges to
1. Point mass at 0 as p ↓ 0 .
2. The exponential distribution with scale parameter b as p ↑ 1 .
Proof
This page titled 5.33: The Exponential-Logarithmic Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated
The Gompertz distributon, named for Benjamin Gompertz, is a continuous probability distribution on [0, ∞) that has exponentially
increasing failure rate. Unfortunately, the death rate of adult humans increases exponentially, so the Gompertz distribution is widely
used in actuarial science.
The Basic Gompertz Distribution

We will start by giving the reliability function, since most applications of the Gompertz distribution deal with mortality.
The basic Gompertz distribution with shape parameter a ∈ (0, ∞) is a continuous distribution on [0, ∞) with reliability
function G given by
c
c x
G (x) = exp[−a (e − 1)], x ∈ [0, ∞) (5.34.1)
The special case a = 1 gives the standard Gompertz distribution.

Proof

x
G(x) = 1 − exp[−a (e − 1)], x ∈ [0, ∞) (5.34.2)
Proof

is given by
−1
1
G (p) = ln[1 − ln(1 − p)], p ∈ [0, 1) (5.34.3)
a
1. The first quartile is q = ln[1 + (ln 4 − ln 3)/a] .

1
2. The median is q = ln(1 + ln 2/a) .

2
3. The third quartile is q = ln(1 + ln 4/a) .

3
Proof
For the standard Gompertz distribution (a = 1 ), the first quartile is q = ln[1 + (ln 4 − ln 3)] ≈ 0.2529 , the median is
1
q = ln(1 + ln 2) ≈ 0.5266 , and the third quartile is q = ln(1 + ln 4) ≈ 0.8697 .

2 3
Open the special distribution calculator and select the Gompertz distribution. Vary the shape parameter and note the shape of
the distribution function. For selected values of the shape parameter, computer a few values of the distribution function and the
quantile function.
The probability density function g is given by

x x
g(x) = ae exp[−a (e − 1)], x ∈ [0, ∞) (5.34.4)
1. If a < 1 then g is increasing and then decreasing with mode x = − ln(a) .

2. If a ≥ 1 then g is decreasing with mode x = 0 .
–
3. If a < (3 − √5)/2 ≈ 0.382 then g is concave up and then down then up again, with inflection points at
–
x = ln[(3 ± √5)/2a] .
– –
4. If (3 − √5)/2 ≤ a < (3 + √5)/2 ≈ 2.618 then g is concave down and then up, with inflection point at
–
x = ln[(3 + √5)/2a] .
–
5. If a ≥ (3 + √5)/2 then g is concave up.
Proof
–
So for the standard Gompertz distribution (a = 1 ), the inflection point is x = ln(3 + √5) ≈ 1.6555 .
Open the special distribution simulator and select the Gompertz distribution. Vary the shape parameter and note the shape of
the probability density function. For selected values of the shape parameter, run the simulation 1000 times and compare the
Finally, as promised, the Gompertz distribution has exponentially increasing failure rate.
The failure rate function r is given by r(x) = ae for x ∈ [0, ∞)

x
Proof
Moments
The moments of the basic Gompertz distribution cannot be given in simple closed form, but the mean and moment generating
function can at least be expressed in terms of a special function known as the exponential integral. There are many variations on
the exponential integral, but for our purposes, the following version is best:
The exponential integral with parameter a ∈ (0, ∞) is the function E a : R → (0, ∞) defined by
∞
t −au
Ea (t) = ∫ u e du, t ∈ R (5.34.7)
1
For the remainder of this discussion, we assume that X has the basic Gompertz distribution with shape parameter a ∈ (0, ∞) .
X has moment generating function m given by

tX a
m(t) = E (e ) = ae Ea (t), t ∈ R (5.34.8)
Proof
It follows that X has moments of all orders. Here is the mean:
X has mean E(X) = e a

Ea (−1) .
Proof
If X has the standard Gompertz distribution, E(X) ≈ 0.5963.
Open the special distribution simulator and select the Gompertz distribution. Vary the shape parameter and note the size and
location of the mean ± standard deviation bar. For selected values of the parameter, run the simulation 1000 times and
The basic Gompertz distribution has the usual connections to the standard uniform distribution by means of the distribution
function and quantile function computed above.

1. If U has the standard uniform distribution then X = ln(1 − ln U ) has the basic Gompertz distribution with shape
1
parameter a .
2. If X has the basic Gompertz distribution with shape parameter a then U = exp[−a (e − 1)] has the standard uniform
X
distribution.
Proof
Since the quantile function of the basic Gompertz distribution has a simple closed form, the distribution can be simulated using the
random quantile method.
Open the random quantile experiment and select the Gompertz distribution. Vary the shape parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, run the simulation 1000 times and compare
the empirical density function, mean, and standard deviation to their distributional counterparts.
The basic Gompertz distribution also has simple connections to the exponential distribution.

1. If X has the basic Gompertz distribution with shape parameter a , then Y = e − 1 has the exponential distribution with
X
rate parameter a .
2. If Y has the exponential distribution with rate parameter a , then X = ln(Y + 1) has the Gompertz distribution with shape
parameter a .
Proof
In particular, if Y has the standard exponential distribution (rate parameter 1), then X = ln(Y + 1) has the standard Gompertz
distribution (shape parameter 1). Since the exponential distribution is a scale family (the scale parameter is the reciprocal of the rate
parameter), we can construct an arbitrary basic Gompertz variable from a standard exponential variable. Specifically, if Y has the
standard exponential variable and a ∈ (0, ∞) , then
1
X = ln( Y + 1) (5.34.14)
a
has the Gompertz distribution with shape parameter a .

The extreme value distribution (Gumbel distribution) is also related to the Gompertz distribution.
If X has the standard extreme value distribution for minimums, then the conditional distribution of X given X ≥0 is the
standard Gompertz distribution.
Proof
The General Gompertz Distribution

The basic Gompertz distribution is generalized, like so many distributions on [0, ∞), by adding a scale parameter. Recall that scale
transformations often correspond to a change of units (minutes to hours, for example) and thus are fundamental.
If Z has the basic Gompertz distribution with shape parameter a ∈ (0, ∞) and b ∈ (0, ∞) then X = bZ has the Gompertz
distribution with shape parameter a and scale parameter b .
Suppose that X has the Gompertz distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞).

c
c x/b
F (x) = P(X > x) = exp[−a (e − 1)], x ∈ [0, ∞) (5.34.16)
Proof

x/b
F (x) = P(X ≤ x) = 1 − exp[−a (e − 1)], x ∈ [0, ∞) (5.34.17)
Proof

given by
1
−1
F (p) = b ln[1 − ln(1 − p)], p ∈ [0, 1) (5.34.18)
a
1. The first quartile is q = b ln[1 + (ln 4 − ln 3)/a] .

1
2. The median is q = b ln(1 + ln 2/a) .

2
3. The third quartile is q = b ln(1 + ln 4/a) .

3
Proof
Open the special distribution calculator and select the Gompertz distribution. Vary the shape and scale parameters and note the
shape and location of the distribution function. For selected values of the parameters, computer a few values of the distribution

a x/b x/b
f (x) = e exp[−a (e − 1)], x ∈ [0, ∞) (5.34.19)
b
1. If a < 1 then f is increasing and then decreasing with mode x = −b ln(a) .

2. If a ≥ 1 then f is decreasing with mode x = 0 .
–
3. If a < (3 − √5)/2 ≈ 0.382 then f is concave up and then down then up again, with inflection points at
–
x = b ln[(3 ± √5)/2a] .
– –
4. If (3 − √5)/2 ≤ a < (3 + √5)/2 ≈ 2.618 then f has is concave down and then up, with inflection point at
–
x = b ln[(3 + √5)/2a] .
–
5. If a ≥ (3 + √5)/2 then f is concave up.
Proof
Open the special distribution simulator and select the Gompertz distribution. Vary the shape and scale parameters and note the
shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and
Once again, X has exponentially increasing failure rate.
X has failure rate function R given by

a x/b
R(x) = e , x ∈ [0, ∞) (5.34.20)
b
Proof
Moments
As with the basic distribution, the moment generating function and mean of the general Gompertz distribution can be expressed in
terms of the exponential integral. Suppose again that X has the Gompertz distribution with shape parameter a ∈ (0, ∞) and scale
parameter b ∈ (0, ∞).

tX a
M (t) = E (e ) = ae Ea (bt), t ∈ R (5.34.21)
Proof
X has mean E(X) = be a

Ea (−1) .
Proof
Open the special distribution simulator and select the Gompertz distribution. Vary the shape and scale parameters and note the
size and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times
and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Since the Gompertz distribution is a scale family for each value of the shape parameter, it is trivially closed under scale
transformations.
Suppose that X has the Gompertz distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞). If c ∈ (0, ∞)
then Y = cX has the Gompertz distribution with shape parameter a and scale parameter bc.
Proof
As with the basic distribution, the Gompertz distribution has the usual connections with the standard uniform distribution by means
of the distribution function and quantile function computed above.

1. If U has the standard uniform distribution then X = b ln(1 − ln U ) has the Gompertz distribution with shape parameter
1
a and scale parameter b .
2. If X has the Gompertz distribution with shape parameter a and scale parameter b , then U = exp[−a (e − 1)] has the
X/b

Proof
Again, since the quantile function of the Gompertz distribution has a simple closed form, the distribution can be simulated using
the random quantile method.
Open the random quantile experiment and select the Gompertz distribution. Vary the shape and scale parameters and note the
simulation 1000 times and note the agreement between the empirical density function and the probability density function.
The following result is a slight generalization of the connection above between the basic Gompertz distribution and the extreme
value distribution.
If X has the extreme value distribution for minimums with scale parameter b > 0 , then the conditional distribution of X given
X ≥ 0 is the Gompertz distribution with shape parameter 1 and scale parameter b .
Proof
Finally, we give a slight generalization of the connection above between the Gompertz distribution and the exponential distribution.

1. If X has the Gompertz distribution with shape parameter a and scale parameter b , then Y = e − 1 has the exponential
X/b
distribution with rate parameter a .

2. If Y has the exponential distribution with rate parameter a , then X = b ln(Y + 1) has the Gompertz distribution with
shape parameter a and scale parameter b .
Proof
As a corollary, we can construct a general Gompertz variable from a standard exponential variable. Specifically, if Y has the
standard exponential distribution and if a, b ∈ (0, ∞) then
1
X = b ln( Y + 1) (5.34.22)
a
has the Gompertz distribution with shape parameter a and scale parameter b .
This page titled 5.34: The Gompertz Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
As the name suggests, the log-logistic distribution is the distribution of a variable whose logarithm has the logistic distribution. The
log-logistic distribution is often used to model random lifetimes, and hence has applications in reliability.
The Basic Log-Logistic Distribution

The basic log-logistic distribution with shape parameter k ∈ (0, ∞) is a continuous distribution on [0, ∞) with distribution
function G given by
k
z
G(z) = , z ∈ [0, ∞) (5.35.1)
k
1 +z
In the special case that k = 1 , the distribution is the standard log-logistic distribution.
Proof
The probability density function function g is given by

k−1
kz
g(z) = , z ∈ (0, ∞) (5.35.3)
k 2
(1 + z )
1. If 0 < k < 1 , g is decreasing with g(z) → ∞ as z ↓ 0 .

2. If k = 1 , g is deceasing with mode z = 0 .
1/k
3. If k > 1 , g increases and then decreases with mode z = ( k−1
k+1
) .
4. If k ≤ 1 , g is concave upward.
5. If 1 < k ≤ 2 , g is concave downward and then upward, with inflection point at
−−−−−− − 1/k
2
2(k − 1) + 2k√ 3(k2 − 1)
z =[ ] (5.35.4)
(k + 1)(k + 2)
6. If k > 2 , g is concave upward then downward then upward again, with inflection points at
2
−−−−−− − 1/k
2
2(k − 1) ± 2k√ 3(k − 1)
z =[ ] (5.35.5)
(k + 1)(k + 2)
Proof
So g has a rich variety of shapes, and is unimodal if k > 1 . When k ≥ 1 , g is defined at 0 as well.
Open the special distribution simulator and select the log-logistic distribution. Vary the shape parameter and note the shape of
the probability density function. For selected values of the shape parameter, run the simulation 1000 times and compare the

is given by
1/k
p
−1
G (p) = ( ) , p ∈ [0, 1) (5.35.8)
1 −p
1. The first quartile is q = (1/3)

1
1/k
.
2. The median is q = 1 .
2
3. The third quartile is q = 3 .

3
1/k
Proof
Recall that p/(1 − p) is the odds ratio associated with probability p ∈ (0, 1). Thus, the quantile function of the basic log-logistic
distribution with shape parameter k is the k th root of the odds ratio function. In particular, the quantile function of the standard log-
logistic distribution is the odds ratio function itself. Also of interest is that the median is 1 for every value of the shape parameter.
Open the special distribution calculator and select the log-logistic distribution. Vary the shape parameter and note the shape of
the distribution and probability density functions. For selected values of the shape parameter, computer a few values of the
distribution function and the quantile function.
The reliability function G is given by

c
c
1
G (z) = , z ∈ [0, ∞) (5.35.9)
k
1 +z
Proof
The basic log-logistic distribution has either decreasing failure rate, or mixed decreasing-increasing failure rate, depending on the
shape parameter.

k−1
kz
r(z) = , z ∈ (0, ∞) (5.35.10)
k
1 +z
1. If 0 < k ≤ 1 , r is decreasing.
2. If k > 1 , r decreases and then increases with minimum at z = (k − 1) 1/k
.
Proof
If k ≥ 1 , r is defined at 0 also.
Moments
Suppose that Z has the basic log-logistic distribution with shape parameter k ∈ (0, ∞) . The moments (about 0) of the Z have an
interesting expression in terms of the beta function B and in terms of the sine function. The simplest representation is in terms of a
new special function constructed from the sine function.
The (normalized) cardinal sine function sinc is defined by

sin(πx)
sinc(x) = , x ∈ R (5.35.12)
πx
where it is understood that sinc(0) = 1 (the limiting value).
Figure 5.35.1 : The graph of the sinc function on the interval [−10, 10]
If n ≥ k then E(Z n
) =∞ . If 0 ≤ n < k then
n n 1
n
E(Z ) = B (1 − ,1+ ) = (5.35.13)
k k sinc(n/k)
Proof
In particular, we can give the mean and variance.
If k > 1 then
1
E(Z) = (5.35.16)
sinc(1/k)
If k > 2 then
1 1
var(Z) = − (5.35.17)
2
sinc(2/k) sinc (1/k)
Open the special distribution simulator and select the log-logistic distribution. Vary the shape parameter and note the size and
location of the mean ± standard deviation bar. For selected values of the shape parameter, run the simulation 1000 times and
The basic log-logistic distribution is preserved under power transformations.
If Z has the basic log-logistic distribution with shape parameter k ∈ (0, ∞) and if n ∈ (0, ∞) , then W =Z
n
has the basic
log-logistic distribution with shape parameter k/n.
Proof
In particular, it follows that if V has the standard log-logistic distribution and k ∈ (0, ∞) , then Z = V 1/k
has the basic log-logistic
distribution with shape parameter k .
The log-logistic distribution has the usual connections with the standard uniform distribution by means of the distribution function
and the quantile function given above.
Suppose that k ∈ (0, ∞) .

1/k
1. If U has the standard uniform distribution then Z = G (U ) = [U /(1 − U )]
−1
has the basic log-logistic distribution
with shape parameter k .
2. If Z has the basic log-logistic distribution with shape parameter k then U = G(Z) = Z /(1 + Z ) has the standard
k k
uniform distribution.
Since the quantile function of the basic log-logistic distribution has a simple closed form, the distribution can be simulated using
Open the random quantile experiment and select the log-logistic distribution. Vary the shape parameter and note the shape of
the distribution and probability density functions. For selected values of the parameter, run the simulation 1000 times and
compare the empirical density function, mean, and standard deviation to their distributional counterparts..
Of course, as mentioned in the introduction, the log-logistic distribution is related to the logistic distribution.
Suppose that k, b ∈ (0, ∞) .

1. If Z has the basic log-logistic distribution with shape parameter k then Y = ln Z has the logistic distribution with location
parameter 0 and scale parameter 1/k.
2. If Y has the logistic distribution with location parameter 0 and scale parameter b then Z = e has the basic log-logistic
Y
distribution with shape parameter 1/b.

Proof
As a special case, (and as noted in the proof), if Z has the standard log-logistic distribution, then Y = ln Z has the standard
logistic distribution, and if Y has the standard logistic distribution, then Z = e has the standard log-logistic distribution.
Y
The standard log-logistic distribution is the same as the standard beta prime distribution.
Proof
Of course, limiting distributions with respect to parameters are always interesting.
The basic log-logistic distribution with shape parameter k ∈ (0, ∞) converges to point mass at 1 as k → ∞ .
Random variable proof
The General Log-Logistic Distribution

The basic log-logistic distribution is generalized, like so many distributions on [0, ∞), by adding a scale parameter. Recall that a
scale transformation often corresponds to a change of units (gallons into liters, for example), and so such transformations are of
basic importance.
If Z has the basic log-logistic distribution with shape parameter k ∈ (0, ∞) and if b ∈ (0, ∞) then X = bZ has the log-
logistic distribution with shape parameter k and scale parameter b .
Suppose that X has the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞).

k
x
F (x) = , x ∈ [0, ∞) (5.35.21)
k k
b +x
Proof

k k−1
b kx
f (x) = , x ∈ (0, ∞) (5.35.22)
k k 2
(b +x )
When k ≥ 1 , f is defined at 0 also. f satisfies the following properties:

2. If k = 1 , f is deceasing with mode x = 0 .
1/k
k−1
3. If k > 1 , f increases and then decreases with mode x = b( k+1
) .
4. If k ≤ 1 , f is concave upward.
5. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at
−−−−−− − 1/k
2
2(k − 1) + 2k√ 3(k2 − 1)
x = b[ ] (5.35.23)
(k + 1)(k + 2)
6. If k > 2 , f is concave upward then downward then upward again, with inflection points at
−−−−−− − 1/k
2
2(k − 1) ± 2k√ 3(k2 − 1)
x = b[ ] (5.35.24)
(k + 1)(k + 2)
Proof
Open the special distribution simulator and select the log-logistic distribution. Vary the shape and scale parameters and note the
shape of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the

given by
1/k
−1
p
F (p) = b ( ) , p ∈ [0, 1) (5.35.25)
1 −p

1 = b(1/3 )
1/k
.
2. The median is q = b .
2
3. The third quartile is q

3 = b3
1/k
.
Proof
Open the special distribution calculator and select the log-logistic distribution. Vary the shape and sclae parameters and note
the shape of the distribution and probability density functions. For selected values of the parameters, computer a few values of
the distribution function and the quantile function.

c
k
b
c
F (x) = , x ∈ [0, ∞) (5.35.26)
bk + xk
Proof
The log-logistic distribution has either decreasing failure rate, or mixed decreasing-increasing failure rate, depending on the shape
parameter.

k−1
kx
R(x) = , x ∈ (0, ∞) (5.35.27)
k k
b +x
1. If 0 < k ≤ 1 , R is decreasing.
2. If k > 1 , R decreases and then increases with minimum at x = b(k − 1) 1/k
.
Proof
Moments
Suppose again that X has the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). The
moments of X can be computed easily from the representation X = bZ where Z has the basic log-logistic distribution with shape
parameter k . Again, the expressions are simplest in terms of the beta function B and in terms of the normalized cardinal sine
function sinc.
If n ≥ k then E(X n
) =∞ . If 0 ≤ n < k then
n
n n b
n n
E(X ) = b B (1 − ,1+ ) = (5.35.28)
k k sinc(n/k)
If k > 1 then
b
E(X) = (5.35.29)
sinc(1/k)
If k > 2 then
2
1 1
var(X) = b [ − ] (5.35.30)
2
sinc(2/k) sinc (1/k)
Open the special distribution simulator and select the log-logistic distribution. Vary the shape and scale parameters and note the
size and location of the mean/standard deviation bar. For selected values of the parameters, run the simulation 1000 times
Since the log-logistic distribution is a scale family for each value of the shape parameter, it is trivially closed under scale
transformations.
If X has the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞), and if c ∈ (0, ∞), then
Y = cX has the log-logistic distribution with shape parameter k and scale parameter bc.
Proof
The log-logistic distribution is preserved under power transformations.
If X has the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞), and if n ∈ (0, ∞) ,
then Y = X has the log-logistic distribution with shape parameter k/n and scale parameter b .
n n
Proof
In particular, if V has the standard log-logistic distribution, then X = bV 1/k

has the log-logistic distribution with shape parameter
k and scale parameter b .
As before, the log-logistic distribution has the usual connections with the standard uniform distribution by means of the distribution

1/k
1. If U has the standard uniform distribution then X = F (U ) = b[U /(1 − U )]
−1
has the log-logistic distribution with
shape parameter k and scale parameter b .
2. If X has the log-logistic distribution with shape parameter k and scale parameter b , then U = F (X) = X /(b + X ) has
k k k
the standard uniform distribution.
Again, since the quantile function of the log-logistic distribution has a simple closed form, the distribution can be simulated using
Open the random quantile experiment and select the log-logistic distribution. Vary the shape and scale parameters and note the
simulation 1000 times and compare the empirical density function, mean and standard deviation to their distributional
counterparts.
Again, the logarithm of a log-logistic variable has the logistic distribution.
Suppose that k, b, c ∈ (0, ∞) and a ∈ R .

1. If X has the log-logistic distribution with shape parameter k and scale parameter b then Y = ln X has the logistic
distribution with location parameter ln b and scale parameter 1/k.
2. If Y has the logistic distribution with location parameter a and scale parameter c then X = e has the log-logistic
Y
distribution with shape parameter 1/c and scale parameter e . a
Proof
Once again, the limiting distribution is also of interest.
For fixed b ∈ (0, ∞), the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b converges to point
mass at b as k → ∞ .
Proof
This page titled 5.35: The Log-Logistic Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
The Pareto distribution is a skewed, heavy-tailed distribution that is sometimes used to model the distribution of incomes and other
financial variables.
The Basic Pareto Distribution

The basic Pareto distribution with shape parameter a ∈ (0, ∞) is a continuous distribution on [1, ∞) with distribution
function G given by
1
G(z) = 1 − , z ∈ [1, ∞) (5.36.1)
a
z
The special case a = 1 gives the standard Pareto distribuiton.

Proof
The Pareto distribution is named for the economist Vilfredo Pareto.

a
g(z) = , z ∈ [1, ∞) (5.36.2)
a+1
z
1. g is decreasing with mode z = 1

2. g is concave upward.
Proof
The reason that the Pareto distribution is heavy-tailed is that the g decreases at a power rate rather than an exponential rate.
probability density function. For selected values of the parameter, run the simulation 1000 times and compare the empirical

is given by
1
−1
G (p) = , p ∈ [0, 1) (5.36.3)
1/a
(1 − p)
1/a
1. The first quartile is q 1 =(
4
3
) .
2. The median is q = 2 2
1/a
.
3. The third quartile is q 3 =4
1/a
.
Proof
Open the special distribution calculator and select the Pareto distribution. Vary the shape parameter and note the shape of the
probability density and distribution functions. For selected values of the parameters, compute a few values of the distribution
and quantile functions.
Moments
Suppose that random variable Z has the basic Pareto distribution with shape parameter a ∈ (0, ∞) . Because the distribution is
heavy-tailed, the mean, variance, and other moments of Z are finite only if the shape parameter a is sufficiently large.
The moments of Z (about 0) are

1. E(Z n
) =
a
a−n
if 0 < n < a
2. E(Z n
) =∞ if n ≥ a
Proof
It follows that the moment generating function of Z cannot be finite on any interval about 0.
In particular, the mean and variance of Z are

1. E(Z) = a
a−1
if a > 1
2. var(Z) = a
2
if a > 2
(a−1 ) (a−2)
Proof
In the special distribution simulator, select the Pareto distribution. Vary the parameters and note the shape and location of the
mean ± standard deviation bar. For each of the following parameter values, run the simulation 1000 times and note the
behavior of the empirical moments:
1. a = 1
2. a = 2
3. a = 3
The skewness and kurtosis of Z are as follows:

1. If a > 3 ,
−−−−−
2(1 + a) 2
skew(Z) = √1 − (5.36.5)
a−3 a
2. If a > 4 ,
2
3(a − 2)(3 a + a + 2)
kurt(Z) = (5.36.6)
a(a − 3)(a − 4)
Proof
So the distribution is positively skewed and skew(Z) → 2 as a → ∞ while skew(Z) → ∞ as a ↓ 3 . Similarly, kurt(Z) → 9 as
a → ∞ and kurt(Z) → ∞ as a ↓ 4 . Recall that the excess kurtosis of Z is
2 3 2
3(a − 2)(3 a + a + 2) 6(a +a − 6a − 1)
kurt(Z) − 3 = −3 = (5.36.7)
a(a − 3)(a − 4) a(a − 3)(a − 4)
The basic Pareto distribution is invariant under positive powers of the underlying variable.
Suppose that Z has the basic Pareto distribution with shape parameter a ∈ (0, ∞) and that n ∈ (0, ∞) . Then W =Z
n
has the
basic Pareto distribution with shape parameter a/n.
Proof
In particular, if Z has the standard Pareto distribution and a ∈ (0, ∞) , then Z has the basic Pareto distribution with shape
1/a
parameter a . Thus, all basic Pareto variables can be constructed from the standard one.
The basic Pareto distribution has a reciprocal relationship with the beta distribution.

1. If Z has the basic Pareto distribution with shape parameter a then V = 1/Z has the beta distribution with left parameter a
and right parameter 1.
2. If V has the beta distribution with left parameter a and right parameter 1, then Z = 1/V has the basic Pareto distribution
with shape parameter a .
Proof
The basic Pareto distribution has the usual connections with the standard uniform distribution by means of the distribution function
and quantile function computed above.
1. If U has the standard uniform distribution then Z = 1/U has the basic Pareto distribution with shape parameter a .
1/a
2. If Z has the basic Pareto distribution with shape parameter a then U = 1/Z has the standard uniform distribution.
a
Proof
Since the quantile function has a simple closed form, the basic Pareto distribution can be simulated using the random quantile
method.
Open the random quantile experiment and selected the Pareto distribution. Vary the shape parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, run the experiment 1000 times and
The basic Pareto distribution also has simple connections to the exponential distribution.

1. If Z has the basic Pareto distribution with shape parameter a , then T = ln Z has the exponential distribution with rate
parameter a .
2. If T has the exponential distribution with rate parameter a , then Z = e has the basic Pareto distribution with shape
T
parameter a .
Proof
The General Pareto Distribution

As with many other distributions that govern positive variables, the Pareto distribution is often generalized by adding a scale
parameter. Recall that a scale transformation often corresponds to a change of units (dollars into Euros, for example) and thus such
transformations are of basic importance.
Suppose that Z has the basic Pareto distribution with shape parameter a ∈ (0, ∞) and that b ∈ (0, ∞) . Random variable
X = bZ has the Pareto distribution with shape parameter a and scale parameter b .
Note that X has a continuous distribution on the interval [b, ∞).
Suppose again that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞).

a
b
F (x) = 1 − ( ) , x ∈ [b, ∞) (5.36.13)
x
Proof

a
ab
f (x) = , x ∈ [b, ∞) (5.36.14)
a+1
x
Proof
Open the special distribution simulator and select the Pareto distribution. Vary the parameters and note the shape and location
of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the

given by
b
−1
F (p) = , p ∈ [0, 1) (5.36.15)
1/a
(1 − p)
1/a
1. The first quartile is q = b( 1
4
3
) .
2. The median is q = b2 . 2
1/a
3. The third quartile is q = b4 3

1/a
.
Proof
Open the special distribution calculator and select the Pareto distribution. Vary the parameters and note the shape and location
of the probability density and distribution functions. For selected values of the parameters, compute a few values of the
distribution and quantile functions.
Moments
Suppose again that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞)
The moments of X are given by

1. E(X n
) =b
n a
a−n
if 0 < n < a
2. E(X n
) =∞ if n ≥ a
Proof

1. E(X) = b a
a−1
if a > 1
2. var(X) = b 2 a
2
if a > 2
(a−1 ) (a−2)
Open the special distribution simulator and select the Pareto distribution. Vary the parameters and note the shape and location
of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the
The skewness and kurtosis of X are as follows:

1. If a > 3 ,
−−−−−
2(1 + a) 2
skew(X) = √1 − (5.36.16)
a−3 a
2. If a > 4 ,
2
3(a − 2)(3 a + a + 2)
kurt(X) = (5.36.17)
a(a − 3)(a − 4)
Proof
Since the Pareto distribution is a scale family for fixed values of the shape parameter, it is trivially closed under scale
transformations.
Suppose that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞) . If c ∈ (0, ∞)
then Y = cX has the Pareto distribution with shape parameter a and scale parameter bc.
Proof
The Pareto distribution is closed under positive powers of the underlying variable.
Suppose that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞) . If n ∈ (0, ∞)
then Y = X has the Pareto distribution with shape parameter a/n and scale parameter b .
n n
Proof
All Pareto variables can be constructed from the standard one. If Z has the standard Pareto distribution and a, b ∈ (0, ∞) then
X = bZ has the Pareto distribution with shape parameter a and scale parameter b .
1/a
As before, the Pareto distribution has the usual connections with the standard uniform distribution by means of the distribution
function and quantile function given above.

1. If U has the standard uniform distribution then X = b/U has the Pareto distribution with shape parameter a and scale
1/a
parameter b .
2. If X has the Pareto distribution with shape parameter a and scale parameter b , then U = (b/X) has the standard uniform
a
distribution.
Proof
Again, since the quantile function has a simple closed form, the basic Pareto distribution can be simulated using the random
quantile method.
Open the random quantile experiment and selected the Pareto distribution. Vary the parameters and note the shape of the
distribution and probability density functions. For selected values of the parameters, run the experiment 1000 times and
The Pareto distribution is closed with respect to conditioning on a right-tail event.
Suppose that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞). For c ∈ [b, ∞) ,
the conditional distribution of X given X ≥ c is Pareto with shape parameter a and scale parameter c .
Proof
Finally, the Pareto distribution is a general exponential distribution with respect to the shape parameter, for a fixed value of the
scale parameter.
Suppose that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞). For fixed b , the
distribution of X is a general exponential distribution with natural parameter −(a + 1) and natural statistic ln X .
Proof
Suppose that the income of a certain population has the Pareto distribution with shape parameter 3 and scale parameter 1000.
1. The proportion of the population with incomes between 2000 and 4000.
2. The median income.
3. The first and third quartiles and the interquartile range.
4. The mean income.
5. The standard deviation of income.
6. The 90th percentile.
Answer
This page titled 5.36: The Pareto Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The Wald distribution, named for Abraham Wald, is important in the study of Brownian motion. Specifically, the distribution
governs the first time that a Brownian motion with positive drift hits a fixed, positive value. In Brownian motion, the distribution of
the random position at a fixed time has a normal (Gaussian) distribution, and thus the Wald distribution, which governs the random
time at a fixed position, is sometimes called the inverse Gaussian distribution.
The Basic Wald Distribution

As usual, let Φ denote the standard normal distribution function.
The basic Wald distribution with shape parameter λ ∈ (0, ∞) is a continuous distribution on (0, ∞) with distribution function
G given by
−
− −−
λ λ
2λ
G(u) = Φ [√ (u − 1)] + e Φ [−√ (u + 1)] , u ∈ (0, ∞) (5.37.1)
u u
The special case λ = 1 gives the standard Wald distribution.

Proof

−−−−−
λ λ
2
g(u) = √ exp[− (u − 1 ) ], u ∈ (0, ∞) (5.37.5)
3
2πu 2u
−−−−−−−−
2
1. g increases and then decreases with mode u 0 = √1 + (
3
2λ
) −
3
2λ
2. g is concave upward then downward then upward again.

Proof
So g has the classic unimodal shape, but the inflection points are very complicated functions of λ . For the mode, note that u 0 ↓ 0 as
λ ↓ 0 and u ↑ 1 as λ ↑ ∞ . The probability density function of the standard Wald distribution is
0
−−−−−
1 1
2
g(u) = √ exp[− (u − 1 ) ], u ∈ (0, ∞) (5.37.8)
3
2πu 2u
Open the special distribution simulator and select the Wald distribution. Vary the shape parameter and note the shape of the
probability density function. For various values of the parameter, run the simulation 1000 times and compare the empirical
The quantile function of the standard Wald distribution does not have a simple closed form, so the median and other quantiles must
be approximated.
Open the special distribution calculator and select the Wald distribution. Vary the shape parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, compute approximate values of the first
quartile, the median, and the third quartile.
Moments
Suppose that random variable U has the standard Wald distribution with shape parameter λ ∈ (0, ∞) .
U has moment generating function m given by

−−−−−−
tU
2t λ
m(t) = E (e ) = exp[λ (1 − √ 1 − )], t < (5.37.9)
λ 2
Proof
Since the moment generating function is finite in an interval containing 0, the basic Wald distribution has moments of all orders.

1. E(U ) = 1
2. var(U ) = 1
Proof
So interestingly, the mean is 1 for all values of the shape parameter, while var(U ) → ∞ as λ ↓ 0 and var(U ) → 0 as λ → ∞ .
Open the special distribution simulator and select the Wald distribution. Vary the shape parameter and note the size and
location of the mean ± standard deviation bar. For various values of the parameters, run the simulation 1000 times and

−
1. skew(U ) = 3/ √λ
2. kurt(U ) = 3 + 15/λ
Proof
It follows that the excess kurtosis is kurt(U ) − 3 = 15/λ . Note that skew(U ) → ∞ as λ → 0 and skew(U ) → 0 as λ → ∞ .
Similarly, kurt(U ) → ∞ as λ → 0 and kurt(U ) → 3 as λ → ∞ .
The General Wald Distribution

The basic Wald distribution is generalized into a scale family. Scale parameters often correspond to a change of units, and so are of
basic importance.
Suppose that λ, μ ∈ (0, ∞) and that U has the basic Wald distribution with shape parameter λ/μ . Then X = μU has the
Wald distribution with shape parameter λ and mean μ .
Justification for the name of the parameter μ as the mean is given below. Note that the generalization is consistent—when μ =1
we have the basic Wald distribution with shape parameter λ .
Suppose that X has the Wald distribution with shape parameter λ ∈ (0, ∞) and mean μ ∈ (0, ∞) . Again, we let Φ denote the
standard normal distribution function.

−
− −
−
λ x 2λ λ x
F (x) = Φ [√ ( − 1)] + exp( )Φ [−√ ( + 1)] , x ∈ (0, ∞) (5.37.22)
x μ μ x μ
Proof

−−−−− 2
λ −λ(x − μ)
f (x) = √ exp[ ], x ∈ (0, ∞) (5.37.24)
3 2
2πx 2μ x
−−−−−−−−
2
3μ 3μ
1. f increases and then decreases with mode x 0 = μ [√1 + (
2λ
) −
2λ
]
2. f is concave upward then downward then upward again.

Proof
Once again, the graph of f has the classic unimodal shape, but the inflection points are complicated functions of the parameters.
Open the special distribution simulator and select the Wald distribution. Vary the parameters and note the shape of the
probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical
Again, the quantile function cannot be expressed in a simple closed form, so the median and other quantiles must be approximated.
Open the special distribution calculator and select the Wald distribution. Vary the parameters and note the shape of the
distribution and density functions. For selected values of the parameters, compute approximate values of the first quartile, the
median, and the third quartile.
Moments
Suppose again that X has the Wald distribution with shape parameter λ ∈ (0, ∞) and mean μ ∈ (0, ∞) . By definition, we can
take X = μU where U has the basic Wald distribution with shape parameter λ/μ.

−−−−−−−−
2
λ 2μ t λ
M (t) = exp[ (1 − √ 1 − )], t < (5.37.26)
2
μ λ 2μ
Proof
As promised, the parameter μ is the mean of Wald distribution.

1. E(X) = μ
2. var(X) = μ 3
/λ
Proof
Open the special distribution simulator and select the Wald distribution. Vary the parameters and note the size and location of
the mean ± standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the

−−−
1. skew(X) = 3 √μ/λ
2. kurt(U ) = 3 + 15μ/λ
Proof
Related Distribution
As noted earlier, the Wald distribution is a scale family, although neither of the parameters is a scale parameter.
Suppose that X has the Wald distribution with shape parameters λ ∈ (0, ∞) and mean μ ∈ (0, ∞) and that c ∈ (0, ∞). Then
Y = cX has the Wald distribution with shape parameter cλ and mean cμ.
Proof
For the next result, it's helpful to re-parameterize the Wald distribution with the mean μ and the ratio r = λ/μ ∈ (0, ∞) . This
2
parametrization is clearly equivalent, since we can recover the shape parameter from the mean and ratio as λ = rμ . Note also that
2
r = E(X)/ var(X) , the ratio of the mean to the variance. Finally, note that the moment generating function above becomes
−−−−−−
2 r
M (t) = exp[rμ (1 − √ 1 − t )], t < (5.37.27)
r 2
and of course, this function characterizes the Wald distribution with this parametrization. Our next result is that the Wald
distribution is closed under convolution (corresponding to sums of independent variables) when the ratio is fixed.
Suppose that X has the Wald distribution with mean μ ∈ (0, ∞) and ratio r ∈ (0, ∞); X has the Wald distribution with
1 1 2
mean μ ∈ (0, ∞) and ratio r; and that X and X are independent. Then Y = X + X has the Wald distribution with mean
2 1 2 1 2
μ +μ
1 and ratio r.
2
Proof
In the previous result, note that the shape parameter of X is rμ , the shape parameter of X is rμ , and the shape parameter of Y
1
2
1 2
2
2
is λ = r(μ + μ ) . Also, of course, the result generalizes to a sum of any finite number of independent Wald variables, as long as
1 2
2
the ratio is fixed. The next couple of results are simple corollaries.
Suppose that (X , X , … , X ) is a sequence of independent variables, each with the Wald distribution with shape parameter
1 2 n
λ ∈ (0, ∞) and mean μ ∈ (0, ∞) . Then
1. Y n =∑
n
i=1
Xi has the Wald distribution with shape parameter n λ and mean nμ.
2
2. M n =
1
n
∑
n
i=1
X has the Wald distribution with shape parameter nλ and mean μ .
i
Proof
In the context of the previous theorem, (X , X , … , X ) is a random sample of size

1 2 n n from the Wald distribution, and
X is the sample mean. The Wald distribution is infinitely divisible:
1 n
∑ i
n i=1
Suppose that X has the Wald distribution with shape parameter λ ∈ (0, ∞) and mean μ ∈ (0, ∞) . For every n ∈ N , X has +
n
the same distribution as ∑ X where (X , X , … , X ) are independent, and each has the Wald distribution with shape
i=1 i 1 2 n
parameter λ/n and mean μ/n. 2
The Lévy distribution is related to the Wald distribution, not surprising since the Lévy distribution governs the first time that a
standard Brownian motion hits a fixed positive value.
For fixed λ ∈ (0, ∞) , the Wald distribution with shape parameter λ and mean μ ∈ (0, ∞) converges to the Lévy distribution
with location parameter 0 and scale parameter λ as μ → ∞ .
Proof
The other limiting distribution, this time with the mean fixed, is less interesting.
For fixed μ ∈ (0, ∞) , the Wald distribution with shape parameter λ ∈ (0, ∞) and mean μ converges to point mass at μ and
variance 1 as λ → ∞ .
Proof
Finally, the Wald distribution is a member of the general exponential family of distributions.
The Wald distribution is a general exponential distribution with natural parameters −λ/(2μ 2
) and −λ/2, and natural statistics
X and 1/X.
Proof
This page titled 5.37: The Wald Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In this section, we will study a two-parameter family of distributions that has special importance in reliability.
The Basic Weibull Distribution

The basic Weibull distribution with shape parameter k ∈ (0, ∞) is a continuous distribution on [0, ∞) with distribution
function G given by
k
G(t) = 1 − exp(−t ), t ∈ [0, ∞) (5.38.1)
The special case k = 1 gives the standard Weibull distribution.

Proof
The Weibull distribution is named for Waloddi Weibull. Weibull was not the first person to use the distribution, but was the first to
study it extensively and recognize its wide use in applications. The standard Weibull distribution is the same as the standard
exponential distribution. But as we will see, every Weibull random variable can be obtained from a standard Weibull variable by a
simple deterministic transformation, so the terminology is justified.

k−1 k
g(t) = kt exp(−t ), t ∈ (0, ∞) (5.38.2)
1. If 0 < k < 1 , g is decreasing and concave upward with g(t) → ∞ as t ↓ 0 .

2. If k = 1 , g is decreasing and concave upward with mode t = 0 .
1/k
k−1
3. If k > 1 , g increases and then decreases, with mode t = ( k
) .
1/k
3(k−1)+√(5k−1)(k−1)
4. If 1 < k ≤ 2 , g is concave downward and then upward, with inflection point at t = [ 2k
]
1/k
3(k−1)±√(5k−1)(k−1)
5. If k > 2 , g is concave upward, then downward, then upward again, with inflection points at t = [ 2k
]
Proof
So the Weibull density function has a rich variety of shapes, depending on the shape parameter, and has the classic unimodal shape
when k > 1 . If k ≥ 1 , g is defined at 0 also.
In the special distribution simulator, select the Weibull distribution. Vary the shape parameter and note the shape of the
probability density function. For selected values of the shape parameter, run the simulation 1000 times and compare the

is given by
−1 1/k
G (p) = [− ln(1 − p)] , p ∈ [0, 1) (5.38.5)
1. The first quartile is q = (ln 4 − ln 3)

1
1/k
.
2. The median is q = (ln 2) .
2
1/k
3. The third quartile is q = (ln 4) .

3
1/k
Proof
Open the special distribution calculator and select the Weibull distribution. Vary the shape parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, compute the median and the first and third
quartiles.
The reliability function G is given by

c
c k
G (t) = exp(−t ), t ∈ [0, ∞) (5.38.6)
Proof

k−1
r(t) = kt , t ∈ (0, ∞) (5.38.7)
1. If 0 < k < 1 , r is decreasing with r(t) → ∞ as t ↓ 0 and r(t) → 0 as t → ∞ .

2. If k = 1 , r is constant 1.
3. If k > 1 , r is increasing with r(0) = 0 and r(t) → ∞ as t → ∞ .
Proof
Thus, the Weibull distribution can be used to model devices with decreasing failure rate, constant failure rate, or increasing failure
rate. This versatility is one reason for the wide use of the Weibull distribution in reliability. If k ≥ 1 , r is defined at 0 also.
Moments
Suppose that Z has the basic Weibull distribution with shape parameter k ∈ (0, ∞) . The moments of Z , and hence the mean and
variance of Z can be expressed in terms of the gamma function Γ
E(Z
n
) = Γ (1 +
n
k
) for n ≥ 0 .
Proof
So the Weibull distribution has moments of all orders. The moment generating function, however, does not have a simple, closed
expression in terms of the usual elementary functions.
In particular, the mean and variance of Z are

1. E(Z) = Γ (1 + ) 1
2. var(Z) = Γ (1 + 2
k
)−Γ
2
(1 +
1
k
)
Note that E(Z) → 1 and var(Z) → 0 as k → ∞ . We will learn more about the limiting distribution below.
In the special distribution simulator, select the Weibull distribution. Vary the shape parameter and note the size and location of
the mean ± standard deviation bar. For selected values of the shape parameter, run the simulation 1000 times and compare the
The skewness and kurtosis also follow easily from the general moment result above, although the formulas are not particularly
helpful.

1. The skewness of Z is
3
Γ(1 + 3/k) − 3Γ(1 + 1/k)Γ(1 + 2/k) + 2 Γ (1 + 1/k)
skew(Z) = (5.38.10)
3/2
[Γ(1 + 2/k) − Γ2 (1 + 1/k)]
2. The kurtosis of Z is
2 4
Γ(1 + 4/k) − 4Γ(1 + 1/k)Γ(1 + 3/k) + 6 Γ (1 + 1/k)Γ(1 + 2/k) − 3 Γ (1 + 1/k)
kurt(Z) = (5.38.11)
2
2
[Γ(1 + 2/k) − Γ (1 + 1/k)]
Proof
As noted above, the standard Weibull distribution (shape parameter 1) is the same as the standard exponential distribution. More
generally, any basic Weibull variable can be constructed from a standard exponential variable.
1. If U has the standard exponential distribution then Z = U has the basic Weibull distribution with shape parameter k .
1/k
2. If Z has the basic Weibull distribution with shape parameter k then U = Z has the standard exponential distribution.
k
Proof
The basic Weibull distribution has the usual connections with the standard uniform distribution by means of the distribution
function and the quantile function given above.

1. If U has the standard uniform distribution then Z = (− ln U ) has the basic Weibull distribution with shape parameter k .
1/k
2. If Z has the basic Weibull distribution with shape parameter k then U = exp(−Z ) has the standard uniform distribution.
k
Proof
Since the quantile function has a simple, closed form, the basic Weibull distribution can be simulated using the random quantile
method.
Open the random quantile experiment and select the Weibull distribution. Vary the shape parameter and note again the shape of
the distribution and density functions. For selected values of the parameter, run the simulation 1000 times and compare the
empirical density, mean, and standard deviation to their distributional counterparts.
The limiting distribution with respect to the shape parameter is concentrated at a single point.
The basic Weibull distribution with shape parameter k ∈ (0, ∞) converges to point mass at 1 as k → ∞ .
Proof
The General Weibull Distribution

Like most special continuous distributions on [0, ∞), the basic Weibull distribution is generalized by the inclusion of a scale
parameter. A scale transformation often corresponds in applications to a change of units, and for the Weibull distribution this
usually means a change in time units.
Suppose that Z has the basic Weibull distribution with shape parameter k ∈ (0, ∞) . For b ∈ (0, ∞), random variable X = bZ
has the Weibull distribution with shape parameter k and scale parameter b .
Generalizations of the results given above follow easily from basic properties of the scale transformation.
Suppose that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞).
X distribution function F given by

k
t
F (t) = 1 − exp[−( ) ], t ∈ [0, ∞) (5.38.12)
b
Proof

k
k k−1
t
f (t) = t exp[−( ) ], t ∈ (0, ∞) (5.38.13)
k
b b
1. If 0 < k < 1 , f is decreasing and concave upward with f (t) → ∞ as t ↓ 0 .

2. If k = 1 , f is decreasing and concave upward with mode t = 0 .
1/k
3. If k > 1 , f increases and then decreases, with mode t = b( k−1
k
) .
1/k
3(k−1)+√(5k−1)(k−1)
4. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at t = b[ 2k
]
5. If k > 2 , f is concave upward, then downward, then upward again, with inflection points at
1/k
3(k−1)±√(5k−1)(k−1)
t = b[ ]
2k
Proof
Open the special distribution simulator and select the Weibull distribution. Vary the parameters and note the shape of the

given by
−1 1/k
F (p) = b[− ln(1 − p)] , p ∈ [0, 1) (5.38.14)
1. The first quartile is q = b(ln 4 − ln 3)

1
1/k
.
2. The median is q = b(ln 2) .
2
1/k
3. The third quartile is q = b(ln 4) .

3
1/k
Proof
Open the special distribution calculator and select the Weibull distribution. Vary the parameters and note the shape of the
distribution and probability density functions. For selected values of the parameters, compute the median and the first and third
quartiles.
X has reliability function F given by c
k
t
c
F (t) = exp[−( ) ], t ∈ [0, ∞) (5.38.15)
b
Proof
As before, the Weibull distribution has decreasing, constant, or increasing failure rates, depending only on the shape parameter.

k−1
kt
R(t) = , t ∈ (0, ∞) (5.38.16)
k
b
1. If 0 < k < 1 , R is decreasing with R(t) → ∞ as t ↓ 0 and R(t) → 0 as t → ∞ .

2. If k = 1 , R is constant . 1
3. If k > 1 , R is increasing with R(0) = 0 and R(t) → ∞ as t → ∞ .
Moments
Suppose again that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Recall that by
definition, we can take X = bZ where Z has the basic Weibull distribution with shape parameter k .
E(X
n n
) = b Γ (1 +
n
k
) for n ≥ 0 .
Proof
In particular, the mean and variance of X are

1. E(X) = bΓ (1 + ) 1
2. var(X) = b [Γ (1 +
2 2
k
)−Γ
2
(1 +
1
k
)]
Note that E(X) → b and var(X) → 0 as k → ∞ .
Open the special distribution simulator and select the Weibull distribution. Vary the parameters and note the size and location
of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the

1. The skewness of X is
3
Γ(1 + 3/k) − 3Γ(1 + 1/k)Γ(1 + 2/k) + 2 Γ (1 + 1/k)
skew(X) = (5.38.17)
3/2
2
[Γ(1 + 2/k) − Γ (1 + 1/k)]
2. The kurtosis of X is
2 4
Γ(1 + 4/k) − 4Γ(1 + 1/k)Γ(1 + 3/k) + 6 Γ (1 + 1/k)Γ(1 + 2/k) − 3 Γ (1 + 1/k)
kurt(X) = (5.38.18)
2
2
[Γ(1 + 2/k) − Γ (1 + 1/k)]
Proof
Since the Weibull distribution is a scale family for each value of the shape parameter, it is trivially closed under scale
transformations.
Suppose that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) . If c ∈ (0, ∞)
then Y = cX has the Weibull distribution with shape parameter k and scale parameter bc.
Proof
The exponential distribution is a special case of the Weibull distribution, the case corresponding to constant failure rate.
The Weibull distribution with shape parameter 1 and scale parameter b ∈ (0, ∞) is the exponential distribution with scale
parameter b .
Proof
More generally, any Weibull distributed variable can be constructed from the standard variable. The following result is a simple
generalization of the connection between the basic Weibull distribution and the exponential distribution.

1. If X has the standard exponential distribution (parameter 1), then Y = b X has the Weibull distribution with shape
1/k
parameter k and scale parameter b .

2. If Y has the Weibull distribution with shape parameter k and scale parameter b , then X = (Y /b) has the standard
k
exponential distribution.
Proof
The Rayleigh distribution, named for William Strutt, Lord Rayleigh, is also a special case of the Weibull distribution.
The Rayleigh distribution with scale parameter b ∈ (0, ∞) is the Weibull distribution with shape parameter 2 and scale
–
parameter √2b.
Proof
Recall that the minimum of independent, exponentially distributed variables also has an exponential distribution (and the rate
parameter of the minimum is the sum of the rate parameters of the variables). The Weibull distribution has a similar, but more
restricted property.
Suppose that (X , X , … , X ) is an independent sequence of variables, each having the Weibull distribution with shape
1 2 n
parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Then U = min{X , X , … , X } has the Weibull distribution with
1 2 n
shape parameter k and scale parameter b/n . 1/k
Proof
As before, Weibull distribution has the usual connections with the standard uniform distribution by means of the distribution
function and the quantile function given above..

1. If U has the standard uniform distribution then X = b(− ln U ) has the Weibull distribution with shape parameter k and
1/k
scale parameter b .
2. If X has the basic Weibull distribution with shape parameter k then U = exp[−(X/b) ] has the standard uniform
k
distribution.
Proof
Again, since the quantile function has a simple, closed form, the Weibull distribution can be simulated using the random quantile
method.
Open the random quantile experiment and select the Weibull distribution. Vary the parameters and note again the shape of the
distribution and density functions. For selected values of the parameters, run the simulation 1000 times and compare the
empirical density, mean, and standard deviation to their distributional counterparts.
The limiting distribution with respect to the shape parameter is concentrated at a single point.
The Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) converges to point mass at b as
k → ∞.
Proof
Finally, the Weibull distribution is a member of the family of general exponential distributions if the shape parameter is fixed.
Suppose that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). For fixed k , X
has a general exponential distribution with respect to b , with natural parameter k − 1 and natural statistics ln X .
Proof
The lifetime T of a device (in hours) has the Weibull distribution with shape parameter k = 1.2 and scale parameter b = 1000.
1. Find the probability that the device will last at least 1500 hours.
2. Approximate the mean and standard deviation of T .
3. Compute the failure rate function.
Answer
This page titled 5.38: The Weibull Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
5.39: Benford's Law
Benford's law refers to probability distributions that seem to govern the significant digits in real data sets. The law is named for the
American physicist and engineer Frank Benford, although the “law” was actually discovered earlier by the astronomer and mathematician
Simon Newcomb.
To understand Benford's law, we need some preliminaries. Recall that a positive real number x can be written uniquely in the form
x = y ⋅ 10
n
(sometimes called scientific notation) where y ∈ [ , 1) is the mantissa and n ∈ Z is the exponent (both of these terms are
1
10
base 10, of course). Note that

log x = log y + n (5.39.1)
where the logarithm function is the base 10 common logarithm instead of the usual base e natural logarithm. In the old days BC (before
calculators), one would compute the logarithm of a number by looking up the logarithm of the mantissa in a table of logarithms, and then
adding the exponent. Of course, these remarks apply to any base b > 1 , not just base 10. Just replace 10 with b and the common logarithm
with the base b logarithm.
Distribution of the Mantissa

Suppose now that X is a number selected at random from a certain data set of positive numbers. Based on empirical evidence from a
number of different types of data, Newcomb, and later Benford, noticed that the mantissa Y of X seemed to have distribution function
F (y) = 1 + log y for y ∈ [1/10, 1). We will generalize this to an arbitrary base b > 1 .
The Benford mantissa distribution with base b ∈ (1, ∞), is a continuous distribution on [1/b, 1) with distribution function F given by
F (y) = 1 + logb y, y ∈ [1/b, 1) (5.39.2)
The special case b = 10 gives the standard Benford mantissa distribution.

Proof
The probability density function f is given by

1
f (y) = , y ∈ [1/b, 1) (5.39.3)
y ln b
1. f is decreasing with mode y = 1
b
.
Proof
Open the Special Distribution Simulator and select the Benford mantissa distribution. Vary the base b and note the shape of the
probability density function. For various values of b , run the simulation 1000 times and compare the empirical density function to the

is given by
−1
1
F (p) = , p ∈ [0, 1] (5.39.4)
1−p
b
1. The first quartile is F −1

(
1
4
) =
1
3/4
b
2. The median is F −1
(
1
2
) =
√b
1

(
3
4
) =
1/4
1
Proof
Numerical values of the quartiles for the standard (base 10) distribution are given in an exercise below.
Open the special distribution calculator and select the Benford mantissa distribution. Vary the base and note the shape and location of
the distribution and probability density functions. For selected values of the base, compute the median and the first and third quartiles.
Moments
Assume that Y has the Benford mantissa distribution with base b ∈ (1, ∞).
The moments of Y are

n
n
b −1
E (Y ) = , n ∈ (0, ∞) (5.39.5)
n
nb ln b
Proof
Note that for fixed n > 0 , E(Y ) → 1 as b ↓ 1 and E(Y ) → 0 as b → ∞ . We will learn more about the limiting distribution below. The
n n
mean and variance follow easily from the general moment result.
Mean and variance

1. The mean of Y is
b −1
E(Y ) = (5.39.7)
b ln b
2. the variance of Y is
b −1 b +1 b −1
var(Y ) = [ − ] (5.39.8)
2
b ln b 2 ln b
Numerical values of the mean and variance for the standard (base 10) distribution are given in an exercise below.
In the Special Distribution Simulator, select the Benford mantissa distribution. Vary the base b and note the size and location of the
mean ± standard deviation bar. For selected values of b , run the simulation 1000 times and compare the empirical mean and standard
The Benford mantissa distribution has the usual connections to the standard uniform distribution by means of the distribution function and
quantile function given above.
Suppose that b ∈ (1, ∞).

1. If U has the standard uniform distribution then Y = b has the Benford mantissa distribution with base b .
−U
2. If Y has the Benford mantissa distribution with base b then U = − log Y has the standard uniform distribution.
b
Proof
Since the quantile function has a simple closed form, the Benford mantissa distribution can be simulated using the random quantile method.
Open the random quantile experiment and select the Benford mantissa distribution. Vary the base b and note again the shape and
location of the distribution and probability density functions. For selected values of b , run the simulation 1000 times and compare the
empirical density function, mean, and standard deviation to their distributional counterparts.
Also of interest, of course, are the limiting distributions of Y with respect to the base b .
The Benford mantissa distribution with base b ∈ (1, ∞) converges to

1. Point mass at 1 as b ↓ 1 .
2. Point mass at 0 as b ↑ ∞ .
Proof
Since the probability density function is bounded on a bounded support interval, the Benford mantissa distribution can also be simulated via
the rejection method.
Open the rejection method experiment and select the Benford mantissa distribution. Vary the base b and note again the shape and
location of the probability density functions. For selected values of b , run the simulation 1000 times and compare the empirical density
Distributions of the Digits
Assume now that the base is a positive integer b ∈ {2, 3, …}, which of course is the case in standard number systems. Suppose that the
sequence of digits of our mantissa Y (in base b ) is (N , N , …), so that1 2
∞
Nk
Y =∑ (5.39.9)
k
b
k=1
Thus, our leading digit N takes values in {1, 2, … , b − 1}, while each of the other significant digits takes values in {0, 1, … , b − 1}.
1
Note that (N , N , …) is a stochastic process so at least we would like to know the finite dimensional distributions. That is, we would like
1 2
to know the joint probability density function of the first k digits for every k ∈ N . But let's start, appropriately enough, with the first digit
+
law. The leading digit is the most important one, and fortunately also the easiest to analyze mathematically.
First Digit Law

N1 has probability density function g given by g (n) = log
1 1 b
(1 +
1
n
) = log (n + 1) − log (n)
b b
for n ∈ {1, 2, … , b − 1} . The
density function g is decreasing and hence the mode is n = 1 .
1
Proof
Note that when b = 2 , N = 1 deterministically, which of course has to be the case. The first significant digit of a number in base 2 must be
1
1. Numerical values of g for the standard (base 10) distribution are given in an exercise below.
1
In the Special Distribution Simulator, select the Benford first digit distribution. Vary the base b with the input control and note the shape
of the probability density function. For various values of b , run the simulation 1000 times and compare the empirical density function to
N1 has distribution function G given by G

1 1 (x) = log (⌊x⌋ + 1)
b
for x ∈ [1, b − 1] .
Proof
N1 has quantile function G −1

1
given by G −1
1
(p) = ⌈b
p
− 1⌉ for p ∈ (0, 1].
1. The first quartile is ⌈b − 1⌉ .
1/4
2. The median is ⌈b − 1⌉ .
1/2
3. The third quartile is ⌈b − 1⌉ .

3/4
Proof
Numerical values of the quantiles for the standard (base 10) distribution are given in an exercise below.
Open the special distribution calculator and choose the Benford first digit distribution. Vary the base and note the shape and location of
the distribution and probability density functions. For selected values of the base, compute the median and the first and third quartiles.
For the most part the moments of N do not have simple expressions. However, we do have the following result for the mean.
1
E(N1 ) = (b − 1) − log [(b − 1)!]

b
.
Proof
Numerical values of the mean and variance for the standard (base 10) distribution are given in an exercise below.
Opne the Special Distribution Simulator and select the Benford first digit distribution. Vary the base b with the input control and note
the size and location of the mean ± standard deviation bar. For various values of b , run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation..
Since the quantile function has a simple, closed form, the Benford first digit distribution can be simulated via the random quantile method.
Open the random quantile experiment and select the Benford first digit distribution. Vary the base b and note again the shape and
location of the probability density function. For selected values of the base, run the experiment 1000 times and compare the empirical
density function, mean, and standard deviation to their distributional counterparts.
Higher Digits
Now, to compute the joint probability density function of the first k significant digits, some additional notation will help.
If n1 ∈ {1, 2, … , b − 1} and n j ∈ {0, 1, … , b − 1} for j ∈ {2, 3, … , k}, let

k
k−j
[ n1 n2 ⋯ nk ]b = ∑ nj b (5.39.13)
j=1
Of course, this is just the base b version of what we do in our standard base 10 system: we represent integers as strings of digits between 0
and 9 (except that the first digit cannot be 0). Here is a base 5 example:
2 1 0
[324 ]5 = 3 ⋅ 5 +2 ⋅ 5 +4 ⋅ 5 = 89 (5.39.14)
The joint probability density function h of (N k 1, N2 , … , Nk ) is given by

1
k−1
hk (n1 , n2 , … , nk ) = logb (1 + ), n1 ∈ {1, 2, … , b − 1}, (n2 , … , nk ) ∈ {2, … , b − 1 } (5.39.15)
[ n1 n2 ⋯ nk ]b
Proof
The probability density function of (N , N ) in the standard (base 10) case is given in an exercise below. Of course, the probability density
1 2
function of a given digit can be obtained by summing the joint probability density over the unwanted digits in the usual way. However,
except for the first digit, these functions do not reduce to simple expressions.
The probability density function g of N is given by

2 2
b−1 b−1
1 1
g2 (n) = ∑ logb (1 + ) = ∑ logb (1 + ), n ∈ {0, 1, … , b − 1} (5.39.18)
[k n]b k b +n
k=1 k=1
The probability density function of N in the standard (base 10) case is given in an exercise below.
2
Theoretical Explanation
Aside from the empirical evidence noted by Newcomb and Benford (and many others since), why does Benford's law work? For a
theoretical explanation, see the article A Statistical Derivation of the Significant Digit Law by Ted Hill.
In the following exercises, suppose that Y has the standard Benford mantissa distribution (the base 10 decimal case), and that (N 1, N2 , …)
are the digits of Y .
Find each of the following for the mantissa Y

1. The density function f .
2. The mean and variance
3. The quartiles
Answer
For N , find each of the following numerically

1
1. The probability density function

2. The mean and variance
3. The quartiles
Answer
Explicitly compute the values of the joint probability density function of (N 1, N2 ) .

Answer
For N , find each of the following numerically

2
1. The probability density function

2. E(N ) 2
3. var(N ) 2
Answer
Comparing the result for N and the result result for N , note that the distribution of N is flatter than the distribution of N . In general, it
1 2 2 1
turns out that distribution of N converges to the uniform distribution on {0, 1, … , b − 1} as k → ∞ . Interestingly, the digits are
k
dependent.
N1 and N are dependent.

2
Proof
Find each of the following.

1. P(N 1 = 5, N2 = 3, N3 = 1)
2. P(N 1 = 3, N2 = 1, N3 = 5)
3. P(N 1 = 1, N2 = 3, N3 = 5)
This page titled 5.39: Benford's Law is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services)
The zeta distribution is used to model the size or ranks of certain types of objects randomly chosen from certain types of
populations. Typical examples include the frequency of occurrence of a word randomly chosen from a text, or the population rank
of a city randomly chosen from a country. The zeta distribution is also known as the Zipf distribution, in honor of the American
linguist George Zipf.
Basic Theory
The Zeta Function
The Riemann zeta function ζ , named after Bernhard Riemann, is defined as follows:
∞
1
ζ(a) = ∑ , a ∈ (1, ∞) (5.40.1)
a
n
n=1
You might recall from calculus that the series in the zeta function converges for a > 1 and diverges for a ≤ 1 .
Figure 5.40.1 : Graph of ζ on the interval (1, 10]
The zeta function satifies the following properties:

1. ζ is decreasing.
2. ζ is concave upward.
3. ζ(a) ↓ 1 as a ↑ ∞
4. ζ(a) ↑ ∞ as a ↓ 1
The zeta function is transcendental, and most of its values must be approximated. However, ζ(a) can be given explicitly for even
2 4
integer values of a ; in particular, ζ(2) = π
6
and ζ(4) = π
90
.
The Probability Density Function

The zeta distribution with shape parameter a ∈ (1, ∞) is a discrete distribution on N+ with probability density function f
given by.
1
f (n) = , n ∈ N+ (5.40.2)
ζ(a)na
1. f is decreasing with mode n = 1 .

2. When smoothed, f is concave upward.
Proof
Open the special distribution simulator and select the zeta distribution. Vary the shape parameter and note the shape of the
probability density function. For selected values of the parameter, run the simulation 1000 times and compare the empirical
The distribution function and quantile function do not have simple closed forms, except in terms of other special functions.
Open the special distribution calculator and select the zeta distribution. Vary the parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, compute the median and the first and third
quartiles.
Moments
Suppose that N has the zeta distribution with shape parameter a ∈ (1, ∞) . The moments of X can be expressed easily in terms of
the zeta function.
If k ≥ a − 1 , E(X) = ∞ . If k < a − 1 ,
ζ(a − k)
k
E (N ) = (5.40.3)
ζ(a)
Proof
The mean and variance of N are as follows:

1. If a > 2 ,
ζ(a − 1)
E(N ) = (5.40.5)
ζ(a)
2. If a > 3 ,
2
ζ(a − 2) ζ(a − 1)
var(N ) = −( ) (5.40.6)
ζ(a) ζ(a)
Open the special distribution simulator and select the zeta distribution. Vary the parameter and note the shape and location of
the mean ± standard deviation bar. For selected values of the parameter, run the simulation 1000 times and compare the
The skewness and kurtosis of N are as follows:

1. If a > 4 ,
2 3
ζ(a − 3)ζ (a) − 3ζ(a − 1)ζ(a − 2)ζ(a) + 2 ζ (a − 1)
skew(N ) = (5.40.7)
2 3/2
[ζ(a − 2)ζ(a) − ζ (a − 1)]
2. If a > 5 ,
3 2 2 4
ζ(a − 4)ζ (a) − 4ζ(a − 1)ζ(a − 3)ζ (a) + 6 ζ (a − 1)ζ(a − 2)ζ(a) − 3 ζ (a − 1)
kurt(N ) = (5.40.8)
2 2
[ζ(a − 2)ζ(a) − ζ (a − 1)]
Proof
The probability generating function of N can be expressed in terms of the polylogarithm function Li that was introduced in the
section on the exponential-logarithmic distribution. Recall that the polylogarithm of order s ∈ R is defined by
∞ k
x
Li s (x) = ∑ , x ∈ (−1, 1) (5.40.9)
s
k
k=1
N has probability generating function P given by

Li a (t)
N
P (t) = E (t ) = , t ∈ (−1, 1) (5.40.10)
ζ(a)
Proof
In an algebraic sense, the zeta distribution is a discrete version of the Pareto distribution. Recall that if a >1 , the Pareto
distribution with shape parameter a − 1 is a continuous distribution on [1, ∞) with probability density function
a−1
f (x) = , x ∈ [1, ∞) (5.40.12)
a
x
Naturally, the limits of the zeta distribution with respect to the shape parameter a are of interest.
The zeta distribution with shape parameter a ∈ (1, ∞) converges to point mass at 1 as a → ∞ .
Proof
Finally, the zeta distribution is a member of the family of general exponential distributions.
Suppose that N has the zeta distribution with parameter a . Then the distribution is a one-parameter exponential family with
natural parameter a and natural statistic − ln N .
Proof
Let N denote the frequency of occurrence of a word chosen at random from a certain text, and suppose that X has the zeta
distribution with parameter a = 2 . Find P(N > 4) .
Answer
Suppose that N has the zeta distribution with parameter a = 6 . Approximate each of the following:
1. E(N )
2. var(N )
3. skew(N )
4. kurt(N )
Answer
This page titled 5.40: The Zeta Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The logarithmic series distribution, as the name suggests, is based on the standard power series expansion of the natural logarithm
function. It is also sometimes known more simply as the logarithmic distribution.
Basic Theory
The logarithmic series distribution with shape parameter p ∈ (0, 1) is a discrete distribution on N+ with probability density
function f given by
n
1 p
f (n) = , n ∈ N+ (5.41.1)
− ln(1 − p) n
1. f is decreasing with mode n = 1 .

2. When smoothed, f is concave upward.
Proof
Open the Special Distribution Simulator and select the logarithmic series distribution. Vary the parameter and note the shape of
the probability density function. For selected values of the parameter, run the simulation 1000 times and compare the empirical
The distribution function and the quantile function do not have simple, closed forms in terms of the standard elementary functions.
Open the special distribution calculator and select the logarithmic series distribution. Vary the parameter and note the shape of
the distribution and probability density functions. For selected values of the parameters, compute the median and the first and
third quartiles.
Moments
Suppose again that random variable N has the logarithmic series distribution with shape parameter p ∈ (0, 1). Recall that the
permutation formula is n = n(n − 1) ⋯ (n − k + 1) for n ∈ R and k ∈ N . The factorial moments of N are E (N ) for
(k) (k)
k ∈ N.
The factorial moments of N are given by

k
(k − 1)! p
(k)
E (N ) = ( ) , k ∈ N+ (5.41.5)
− ln(1 − p) 1 −p
Proof
The mean and variance of N are

1. 1 p
E(N ) = (5.41.9)
− ln(1 − p) 1 −p
2. 1 p p
var(N ) = [1 − ] (5.41.10)
2
− ln(1 − p) (1 − p) − ln(1 − p)
Proof
Open the special distribution simulator and select the logarithmic series distribution. Vary the parameter and note the shape of
the mean ± standard deviation bar. For selected values of the parameter, run the simulation 1000 times and compare the
The probability generating function P of N is given by
ln(1 − pt) 1
N
P (t) = E (t ) = , |t| < (5.41.12)
ln(1 − p) p
Proof
The factorial moments above can also be obtained from the probability generating function, since P
(k)
(1) = E (N
(k)
) for
k ∈ N +.
Naturally, the limits of the logarithmic series distribution with respect to the parameter p are of interest.
The logarithmic series distribution with shape parameter p ∈ (0, 1) converges to point mass at 1 as p ↓ 0 .
Proof
The logarithmic series distribution is a power series distribution associated with the function g(p) = − ln(1 − p) for
p ∈ [0, 1).
Proof
The moment results above actually follow from general results for power series distributions. The compound Poisson distribution
based on the logarithmic series distribution gives a negative binomial distribution.
Suppose that X = (X , X , …) is a sequence of independent random variables each with the logarithmic series distribution
1 2
with parameter p ∈ (0, 1). Suppose also that N is independent of X and has the Poisson distribution with rate parameter
r ∈ (0, ∞). Then Y = ∑ X has the negative binomial distribution on N with parameters 1 − p and −r/ ln(1 − p)
N
i=1 i
Proof
This page titled 5.41: The Logarithmic Series Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
CHAPTER OVERVIEW
6: Random Samples
Point estimation refers to the process of estimating a parameter from a probability distribution, based on observed data from the
distribution. It is one of the core topics in mathematical statistics. In this chapter, we will explore the most common methods of
point estimation: the method of moments, the method of maximum likelihood, and Bayes' estimators. We also study important
properties of estimators, including sufficiency and completeness, and the basic question of whether an estimator is the best possible
one.
6.1: Introduction
This page titled 6: Random Samples is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
6.1: Introduction
The Basic Statistical Model

In the basic statistical model, we have a population of objects of interest. The objects could be persons, families, computer chips,
acres of corn. In addition, we have various measurements or variables defined on the objects. We select a sample from the
population and record the variables of interest for each object in the sample. Here are a few examples based on the data sets in this
project:
In the M&M data, the objects are bags of M&Ms of a specified size. The variables recorded for a bag of M&Ms are net weight
and the counts for red, green, blue, orange, yellow, and brown candies.
In the cicada data, the objects are cicadas from the middle Tennessee area. The variables recorded for a cicada are body weight,
wing length, wing width, body length, gender, and species.
In Fisher's iris data, the objects are irises. The variables recorded for an iris are petal length, petal width, sepal length, sepal
width, and type.
In the Polio data set, the objects are children. Although many variables were probably recorded for a child, the two crucial
variables, both binary, were whether or not the child was vaccinated, and whether or not the child contracted Polio within a
certain time period.
In the Challenger data sets, the objects are Space Shuttle launches. The variables recorded are temperature at the time of launch
and various measures of O-ring erosion of the solid rocket boosters.
In Michelson's data set, the objects are beams of light and the variable recorded is speed.
In Pearson's data set, the objects are father-son pairs. The variables are the height of the father and the height of the son.
In Snow's data set, the objects are persons who died of cholera. The variables record the address of the person.
In one of the SAT data sets, the objects are states and the variables are participation rate, average SAT Math score and average
SAT Verbal score.
Thus, the observed outcome of a statistical experiment (the data) has the form x = (x , x , … , x ) where x is the vector of
1 2 n i
measurements for the ith object chosen from the population. The set S of possible values of x (before the experiment is conducted)
is called the sample space. It is literally the space of samples. Thus, although the outcome of a statistical experiment can have quite
a complicated structure (a vector of vectors), the hallmark of mathematical abstraction is the ability to gray out the features that are
not relevant at any particular time, to treat a complex structure as a single object. This we do with the outcome x of the experiment.
The techniques of statistics have been enormously successful; these techniques are widely used in just about every subject that
deals with quantification—the natural sciences, the social sciences, law, and medicine. On the other hand, statistics has a legalistic
quality and a great deal of terminology and jargon that can make the subject a bit intimidating at first. In the rest of this section, we
begin discussing some of this terminology.
The Empirical Distribution

Suppose again that the data have the form x = (x , x , … , x ) where x is the vector of measurements for the ith object chosen.
1 2 n i
The empirical distribution associated with x is the probability distribution that places probability 1/n at each x . Thus, if the
i
values are distinct, the empirical distribution is the discrete uniform distribution on {x , x , … , x }. More generally, if x occurs k
1 2 n
times in the data, then the empirical distribution assigns probability k/n to x. Thus, every finite data set defines a probability
distribution.
Statistics
Technically, a statistic w = w(x) is an observable function of the outcome x of the experiment. That is, a statistic is a computable
function defined on the sample space S . The term observable means that the function should not contain any unknown quantities,
because we need to be able to compute the value w of the statistic from the observed data x. As with the data x, a statistic w may
have a complicated structure; typically, w is vector valued. Indeed, the outcome x of the experiment is itself is a statistic; all other
statistics are derived from x.
Statistics u and v are equivalent if there exists a one-to-one function r from the range of u onto the range of v such that v = r(u) .
Equivalent statistics give equivalent information about x.
Statistics u and v are equivalent if and only if the following condition holds: for any x ∈ S and y ∈ S , u(x) = u(y) if and
only if v(x) = v(y) .
Equivalence really is an equivalence relation on the collection of statistics for a given statistical experiment. That is, if u, v ,
and w are arbitrary statistics then
1. u is equivalent to u (the reflexive property).
2. If u is equivalent to v then v is equivalent to u (the symmetric property).
3. If u is equivalent to v and v is equivalent to w then u is equivalent to w (the transitive property).
Descriptive and Inferential Statistics

There are two broad branches of statistics. The term descriptive statistics refers to methods for summarizing and displaying the
observed data x. As the name suggests, the methods of descriptive statistics usually involve computing various statistics (in the
technical sense) that give useful information about the data: measures of center and spread, measures of association, and so forth.
In the context of descriptive statistics, the term parameter refers to a characteristic of the entire population.
The deeper and more useful branch of statistics is known as inferential statistics. Our point of view in this branch is that the
statistical experiment (before it is conducted) is a random experiment with a probability measure P on an underlying sample space.
Thus, the outcome x of the experiment is an observed value of a random variable X defined on this probability space, with the
distribution of X not completely known to us. Our goal is to draw inferences about the distribution of X from the observed value
x . Thus, in a sense, inferential statistics is the dual of probability. In probability, we try to predict the value of X assuming
complete knowledge of the distribution. In statistics, by contrast, we observe the value of x of the random variable X and try to
infer information about the underlying distribution of X. In inferential statistics, a statistic (a function of X) is itself a random
variable with a distribution of its own. On the other hand, the term parameter refers to a characteristic of the distribution of X.
Often the inferential problem is to use various statistics to estimate or test hypotheses about a parameter. Another way to think of
inferential statistics is that we are trying to infer from the empirical distribution associated with the observed data x to the true
distribution associated with X.
There are two basic types of random experiments in the general area of inferential statistics. A designed experiment, as the name
suggests, is carefully designed to study a particular inferential question. The experimenter has considerable control over how the
objects are selected, what variables are to be recorded for these objects, and the values of certain of the variables. In an
observational study, by contrast, the researcher has little control over these factors. Often the researcher is simply given the data set
and asked to make sense out of it. For example, the Polio field trials were designed experiments to study the effectiveness of the
Salk vaccine. The researchers had considerable control over how the children were selected, and how the children were assigned to
the treatment and control groups. By contrast, the Challenger data sets used to explore the relationship between temperature and O-
ring erosion are observational studies. Of course, just because an experiment is designed does not mean that it is well designed.
Difficulties
A number of difficulties can arise when trying to explore an inferential question. Often, problems arise because of confounding
variables, which are variables that (as the name suggests) interfere with our understanding of the inferential question. In the first
Polio field trial design, for example, age and parental consent are two confounding variables that interfere with the determination
of the effectiveness of the vaccine. The entire point of the Berkeley admissions data, to give another example, is to illustrate how a
confounding variable (department) can create a spurious correlation between two other variables (gender and admissions status).
When we correct for the interference caused by a confounding variable, we say that we have controlled for the variable.
Problems also frequently arise because of measurement errors. Some variables are inherently difficult to measure, and systematic
bias in the measurements can interfere with our understanding of the inferential question. The first Polio field trial design again
provides a good example. Knowledge of the vaccination status of the children led to systematic bias by doctors attempting to
diagnose polio in these children. Measurement errors are sometimes caused by hidden confounding variables.
Confounding variables and measurement errors abound in political polling, where the inferential question is who will win an
election. How do confounding variables such as race, income, age, and gender (to name just a few) influence how a person will
vote? How do we know that a person will vote for whom she says she will, or if she will vote at all (measurement errors)? The
Literary Digest poll in the 1936 presidential election and the professional polls in the 1948 presidential election illustrate these
problems.
Confouding variables, measurement errors and other causes often lead to selection bias, which means that the sample does not
represent the population with respect to the inferential question at hand. Often randomization is used to overcome the effects of
confounding variables and measurement errors.
Random Samples
The most common and important special case of the inferential statistical model occurs when the observation variable
X = (X1 , X2 , … , Xn ) (6.1.1)
is a sequence of independent and identically distributed random variables. Again, in the standard sampling model, X is itself a
i
vector of measurements for the ith object in the sample, and thus, we think of (X , X , … , X ) as independent copies of an
1 2 n
underlying measurement vector X. In this case, (X , X , … , X ) is said to be a random sample of size n from the distribution of
1 2 n
X.
Variables
The mathematical operations that make sense for variable in a statistical experiment depend on the type and level of measurement
of the variable.
Type
Recall that a real variable x is continuous if the possible values form an interval of real numbers. For example, the weight variable
in the M&M data set, and the length and width variables in Fisher's iris data are continuous. In contrast, a discrete variable is one
whose set of possible values forms a discrete set. For example, the counting variables in the M&M data set, the type variable in
Fisher's iris data, and the denomination and suit variables in the card experiment are discrete. Continuous variables represent
quantities that can, in theory, be measured to any degree of accuracy. In practice, of course, measuring devices have limited
accuracy so data collected from a continuous variable are necessarily discrete. That is, there is only a finite (but perhaps very large)
set of possible values that can actually be measured. So, the distinction between a discrete and continuous variable is based on what
is theoretically possible, not what is actually measured. Some additional examples may help:
A person's age is usually given in years. However, one can imagine age being given in months, or weeks, or even (if the time of
birth is known to a sufficient accuracy) in seconds. Age, whether of devices or persons, is usually considered to be a continuous
variable.
The price of an item is usually given (in the US) in dollars and cents, and of course, the smallest monetary object in circulation
is the penny ($0.01). However, taxes are sometimes given in mills ($0.001), and one can imagine smaller divisions of a dollar,
even if there are no coins to represent these divisions. Measures of wealth are usually thought of as continuous variables.
On the other hand, the number of persons in a car at the time of an accident is a fundamentally discrete variable.
Levels of Measurement
A real variable x is also distinguished by its level of measurement.
Qualitative variables simply encode types or names, and thus few mathematical operations make sense, even if numbers are used
for the encoding. Such variables have the nominal level of measurement. For example, the type variable in Fisher's iris data is
qualitative. Gender, a common variable in many studies of persons and animals, is also qualitative. Qualitative variables are almost
always discrete; it's hard to imagine a continuous infinity of names.
A variable for which only order is meaningful is said to have the ordinal level of measurement; differences are not meaningful
even if numbers are used for the encoding. For example, in many card games, the suits are ranked, so the suit variable has the
ordinal level of measurement. For another example, consider the standard 5-point scale (terrible, bad, average, good, excellent)
used to rank teachers, movies, restaurants etc.
A quantitative variable for which difference, but not ratios are meaningful is said to have the interval level of measurement.
Equivalently, a variable at this level has a relative, rather than absolute, zero value. Typical examples are temperature (in Fahrenheit
or Celsius) or time (clock or calendar).
Finally, a quantitative variable for which ratios are meaningful is said to have the ratio level of measurement. A variable at this
level has an absolute zero value. The count and weight variables in the M&M data set, and the length and width variables in
Fisher's iris data are examples.
Subsamples
In the basic statistical model, subsamples corresponding to some of the variables can be constructed by filtering with respect to
other variables. This is particularly common when the filtering variables are qualitative. Consider the cicada data for example. We
might be interested in the quantitative variables body weight, body length, wing width, and wing length by species, that is,
separately for species 0, 1, and 2. Or, we might be interested in these quantitative variables by gender, that is separately for males
and females.
Exercises
Study Michelson's experiment to measure the velocity of light.
1. Is this a designed experiment or an observational study?
2. Classify the velocity of light variable in terms of type and level of measurement.
3. Discuss possible confounding variables and problems with measurement errors.
Answer
Study Cavendish's experiment to measure the density of the earth.

2. Classify the density of earth variable in terms of type and level of measurement.
Answer
Study Short's experiment to measure the parallax of the sun.

2. Classify the parallax of the sun variable in terms of type and level of measurement.
Answer
In the M&M data, classify each variable in terms of type and level of measurement.
Answer
In the Cicada data, classify each variable in terms of type and level of measurement.
Answer
In Fisher's iris data, classify each variable in terms of type and level of measurement.
Answer
Study the Challenger experiment to explore the relationship between temperature and O-ring erosion.
2. Classify each variable in terms of type and level of measurement.
Answer
In the Vietnam draft data, classify each variable in terms of type and level of measurement.
Answer
In the two SAT data sets, classify each variable in terms of type and level of measurement.
Answer
Study the Literary Digest experiment to to predict the outcome of the 1936 presidential election.
Answer
Study the 1948 polls to predict the outcome of the presidential election between Truman and Dewey. Are these designed
experiments or an observational studies?
Answer
Study Pearson's experiment to explore the relationship between heights of fathers and heights of sons.
3. Discuss possible confounding variables.
Answer
Study the Polio field trials.

1. Are these designed experiments or observational studies?
2. Identify the essential variables and classify each in terms of type and level of measurement.
Answer
Identify the parameters in each of the following:

1. Buffon's Coin Experiment
2. Buffon's Needle Experiment
3. the Bernoulli trials model
4. the Poisson model
Answer
Note the parameters for each of the following families of special distributions:
1. the normal distribution
2. the gamma distribution
3. the beta distribution
4. the Pareto distribution
5. the Weibull distribution
Answer
During World War II, the Allies recorded the serial numbers of captured German tanks. Classify the underlying serial number
variable by type and level of measurement.
Answer
For a discussion of how the serial numbers were used to estimate the total number of tanks, see the section on Order Statistics in
the chapter on Finite Sampling Models.
This page titled 6.1: Introduction is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Basic Theory
Recall the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that
we make on the objects. We select objects from the population and record the variables for the objects in the sample; these become
our data. Our first discussion is from a purely descriptive point of view. That is, we do not assume that the data are generated by an
underlying probability distribution. However, recall that the data themselves define a probability distribution.
Suppose that x = (x , x , … , x 1 2 n) is a sample of size n from a real-valued variable. The sample mean is simply the arithmetic
average of the sample values:
n
1
m = ∑ xi (6.2.1)
n
i=1
If we want to emphasize the dependence of the mean on the data, we write m(x) instead of just m. Note that m has the same
physical units as the underlying variable. For example, if we have a sample of weights of cicadas, in grams, then m is in grams
also. The sample mean is frequently used as a measure of center of the data. Indeed, if each x is the location of a point mass, then
i
m is the center of mass as defined in physics. In fact, a simple graphical display of the data is the dotplot: on a number line, a dot is
placed at x for each i. If values are repeated, the dots are stacked vertically. The sample mean m is the balance point of the
i
dotplot. The image below shows a dot plot with the mean as the balance point.
Figure 6.2.1 : A dotplot

The standard notation for the sample mean corresponding to the data x is x̄. We break with tradition and do not use the bar notation
in this text, because it's clunky and because it's inconsistent with the notation for other statistics such as the sample variance,
sample standard deviation, and sample covariance. However, you should be aware of the standard notation, since you will
undoubtedly see it in other sources.
The following exercises establish a few simple properties of the sample mean. Suppose that x = (x , x , … , x ) and 1 2 n
y = (y , y , … , y ) are samples of size n from real-valued population variables and that c is a constant. In vector notation, recall
1 2 n
that x + y = (x + y , x + y , … , x + y ) and cx = (cx , cx , … , cx ).

1 1 2 2 n n 1 2 n
Computing the sample mean is a linear operation.

1. m(x + y) = m(x) + m(y)
2. m(c x) = c m(x)
Proof
The sample mean preserves order.

1. If x i ≥0 for each i then m(x) ≥ 0 .
2. If x i ≥0 for each i and x > 0 for some j them m(x) > 0
j
3. If x i ≤ yi for each i then m(x) ≤ m(y)

4. If x i ≤ yi for each i and x < y for some j then m(x) < m(y)
j j
Proof
Trivially, the mean of a constant sample is simply the constant. .
If c = (c, c, … , c) is a constant sample then m(c) = c .
Proof
As a special case of these results, suppose that x = (x , x , … , x ) is a sample of size n corresponding to a real variable x, and
1 2 n
that a and b are constants. Then the sample corresponding to the variable y = a + bx , in our vector notation, is a + bx . The
sample means are related in precisely the same way, that is, m(a + bx) = a + bm(x) . Linear transformations of this type, when
b > 0 , arise frequently when physical units are changed. In this case, the transformation is often called a location-scale
transformation; a is the location parameter and b is the scale parameter. For example, if x is the length of an object in inches, then
y = 2.54x is the length of the object in centimeters. If x is the temperature of an object in degrees Fahrenheit, then y =
5
(x − 32)
9
is the temperature of the object in degree Celsius.

Sample means are ubiquitous in statistics. In the next few paragraphs we will consider a number of special statistics that are based
on sample means.
The Empirical Distribution
Suppose now that x = (x , x , … , x ) is a sample of size n from a general variable taking values in a set
1 2 n S . For A ⊆S , the
frequency of A corresponding to x is the number of data values that are in A :
n
n(A) = #{i ∈ {1, 2, … , n} : xi ∈ A} = ∑ 1(xi ∈ A) (6.2.5)
i=1
The relative frequency of A corresponding to x is the proportion of data values that are in A :
n
n(A) 1
p(A) = = ∑ 1(xi ∈ A) (6.2.6)
n n
i=1
Note that for fixed A , p(A) is itself a sample mean, corresponding to the data {1(x ∈ A) : i ∈ {1, 2, … , n}}. This fact bears
i
repeating: every sample proportion is a sample mean, corresponding to an indicator variable. In the picture below, the red dots
represent the data, so p(A) = 4/15.
Figure 6.2.2 : The empirical probability of A

p is a probability measure on S .
1. p(A) ≥ 0 for every A ⊆ S
2. p(S) = 1
3. If {Aj : j ∈ J} is a countable collection of pairwise disjont subsets of S then p (⋃ j∈J
Aj ) = ∑
j∈J
p(Aj )
Proof
This probability measure is known as the empirical probability distribution associated with the data set x. It is a discrete
distribution that places probability at each point x . In fact this observation supplies a simpler proof of previous theorem. Thus,
1
n
i
if the data values are distinct, the empirical distribution is the discrete uniform distribution on {x , x , … , x }. More generally, if
1 2 n
x ∈ S occurs k times in the data then the empirical distribution assigns probability k/n to x.
If the underlying variable is real-valued, then clearly the sample mean is simply the mean of the empirical distribution. It follows
that the sample mean satisfies all properties of expected value, not just the linear properties and increasing properties given above.
These properties are just the most important ones, and so were repeated for emphasis.
Empirical Density
Suppose now that the population variable x takes values in a set S ⊆R
d
for some d ∈ N+ . Recall that the standard measure on
R is given by
d
d
λd (A) = ∫ 1 dx, A ⊆R (6.2.9)
A
In particular λ (A) is the length of A , for A ⊆ R ; λ (A) is the area of A , for A ⊆ R ; and λ (A) is the volume of A , for
1 2
2
3
A ⊆ R . Suppose that x is a continuous variable in the sense that λ (S) > 0 . Typically, S is an interval if d = 1 and a Cartesian
3
d
product of intervals if d > 1 . Now for A ⊆ S with λ (A) > 0 , the empirical density of A corresponding to x is
d
n
p(A) 1
D(A) = = ∑ 1(xi ∈ A) (6.2.10)
λd (A) n λd (A)
i=1
Thus, the empirical density of A is the proportion of data values in A , divided by the size of A . In the picture below
(corresponding to d = 2 ), if A has area 5, say, then D(A) = 4/75.
Figure 6.2.3 : The empirical density of A

Suppose again that x = (x , x , … , x ) is a sample of size n from a real-valued variable. For x ∈ R, let F (x) denote the relative
1 2 n
frequency (empirical probability) of (−∞, x] corresponding to the data set x. Thus, for each x ∈ R, F (x) is the sample mean of
the data {1(x ≤ x) : i ∈ {1, 2, … , n}} :
i
n
1
F (x) = p ((−∞, x]) = ∑ 1(xi ≤ x) (6.2.11)
n
i=1
F is a distribution function.
1. F increases from 0 to 1.
2. F is a step function with jumps at the distinct sample values {x 1, x2 , … , xn } .
Proof
Appropriately enough, F is called the empirical distribution function associated with x and is simply the distribution function of
the empirical distribution corresponding to x. If we know the sample size n and the empirical distribution function F , we can
recover the data, except for the order of the observations. The distinct values of the data are the places where F jumps, and the
number of data values at such a point is the size of the jump, times the sample size n .
The Empirical Discrete Density Function
Suppose now that x = (x , x , … , x ) is a sample of size n from a discrete variable that takes values in a countable set S . For
1 2 n
x ∈ S , let f (x) be the relative frequency (empirical probability) of x corresponding to the data set x . Thus, for each x ∈ S , f (x)
is the sample mean of the data {1(x = x) : i ∈ {1, 2, … , n}}:

i
n
1
f (x) = p({x}) = ∑ 1(xi = x) (6.2.12)
n
i=1
In the picture below, the dots are the possible values of the underlying variable. The red dots represent the data, and the numbers
indicate repeated values. The blue dots are possible values of the the variable that did not happen to occur in the data. So, the
sample size is 12, and for the value x that occurs 3 times, we have f (x) = 3/12.
Figure 6.2.4 : The discrete probability density function
f is a discrete probabiltiy density function:
1. f (x) ≥ 0 for x ∈ S
2. ∑ f (x) = 1
x∈S
Proof
Appropriately enough, f is called the empirical probability density function or the relative frequency function associated with x,
and is simply the probabiltiy density function of the empirical distribution corresponding to x. If we know the empirical PDF f and
the sample size n , then we can recover the data set, except for the order of the observations.
If the underlying population variable is real-valued, then the sample mean is the expected value computed relative to the
empirical density function. That is,
n
1
∑ xi = ∑ x f (x) (6.2.14)
n
i=1 x∈S
Proof
As we noted earlier, if the population variable is real-valued then the sample mean is the mean of the empirical distribution.
The Empirical Continuous Density Function
Suppose now that is a sample of size n from a continuous variable that takes values in a set S ⊆ R . Let
x = (x1 , x2 , … , xn )
d
A = { A : j ∈ J} be a partition of S into a countable number of subsets, each of positive, finite measure. Recall that the word
j
partition means that the subsets are pairwise disjoint and their union is S . Let f be the function on S defined by the rule that f (x)
is the empricial density of A , corresponding to the data set x, for each x ∈ A . Thus, f is constant on each of the partition sets:
j j
n
p(Aj ) 1
f (x) = D(Aj ) = = ∑ 1(xi ∈ Aj ), x ∈ Aj (6.2.16)
λd (Aj ) nλd (Aj )
i=1
f is a continuous probabiltiy density function.

1. f (x) ≥ 0 for x ∈ S
2. ∫ f (x) dx = 1
S
Proof
The function f is called the empirical probability density function associated with the data x and the partition A . For the
probability distribution defined by f , the empirical probability p(A ) is uniformly distributed over A for each j ∈ J . In the
j j
picture below, the red dots represent the data and the black lines define a partition of S into 9 rectangles. For the partition set A in
the upper right, the empirical distribution would distribute probability 3/15 = 1/5 uniformly over A . If the area of A is, say, 4,
then f (x) = 1/20 for x ∈ A .
Figure 6.2.5 : Empirical probability density function

Unlike the discrete case, we cannot recover the data from the empirical PDF. If we know the sample size, then of course we can
determine the number of data points in A for each j , but not the precise location of these points in A . For this reason, the mean of
j j
the empirical PDF is not in general the same as the sample mean when the underlying variable is real-valued.
Histograms
Our next discussion is closely related to the previous one. Suppose again that x = (x , x , … , x ) is a sample of size n from a
1 2 n
variable that takes values in a set S and that A = (A , A , … , A ) is a partition of S into k subsets. The sets in the partition are
1 2 k
sometimes known as classes. The underlying variable may be discrete or continuous.

The mapping that assigns frequencies to classes is known as a frequency distribution for the data set and the given partition.
The mapping that assigns relative frequencies to classes is known as a relative frequency distribution for the data set and the
given partition.
In the case of a continuous variable, the mapping that assigns densities to classes is known as a density distribution for the data
set and the given partition.
In dimensions 1 or 2, the bar graph any of these distributions, is known as a histogram. The histogram of a frequency distribution
and the histogram of the corresponding relative frequency distribution look the same, except for a change of scale on the vertical
axis. If the classes all have the same size, the histogram of the corresponding density histogram also looks the same, again except
for a change of scale on the vertical axis. If the underlying variable is real-valued, the classes are usually intervals (discrete or
continuous) and the midpoints of these intervals are sometimes referred to as class marks.
Figure 6.2.6 : A density histogram

The whole purpose of constructing a partition and graphing one of these empirical distributions corresponding to the partition is to
summarize and display the data in a meaningful way. Thus, there are some general guidelines in choosing the classes:
1. The number of classes should be moderate.
2. If possible, the classes should have the same size.
For highly skewed distributions, classes of different sizes are appropriate, to avoid numerous classes with very small frequencies.
For a continuous variable with classes of different sizes, it is essential to use a density histogram, rather than a frequency or relative
frequency histogram, otherwise the graphic is visually misleading, and in fact mathematically wrong.
It is important to realize that frequency data is inevitable for a continuous variable. For example, suppose that our variable
represents the weight of a bag of M&Ms (in grams) and that our measuring device (a scale) is accurate to 0.01 grams. If we
measure the weight of a bag as 50.32, then we are really saying that the weight is in the interval [50.315, 50.324)(or perhaps some
other interval, depending on how the measuring device works). Similarly, when two bags have the same measured weight, the
apparent equality of the weights is really just an artifact of the imprecision of the measuring device; actually the two bags almost
certainly do not have the exact same weight. Thus, two bags with the same measured weight really give us a frequency count of 2
for a certain interval.
Again, there is a trade-off between the number of classes and the size of the classes; these determine the resolution of the empirical
distribution corresponding to the partition. At one extreme, when the class size is smaller than the accuracy of the recorded data,
each class contains a single datum or no datum. In this case, there is no loss of information and we can recover the original data set
from the frequency distribution (except for the order in which the data values were obtained). On the other hand, it can be hard to
discern the shape of the data when we have many classes with small frequency. At the other extreme is a frequency distribution
with one class that contains all of the possible values of the data set. In this case, all information is lost, except the number of the
values in the data set. Between these two extreme cases, an empirical distribution gives us partial information, but not complete
information. These intermediate cases can organize the data in a useful way.
Ogives
Suppose now the underlying variable is real-valued and that the set of possible values is partitioned into intervals
(A , A , … , A ), with the endpoints of the intervals ordered from smallest to largest. Let n denote the frequency of class A , so
1 2 k j j
that p = n /n is the relative frequency of class A . Let t denote the class mark (midpoint) of class A . The cumulative frequency
j j j j j
of class A is N = ∑ n and the cumulative relative frequency of class A is P = ∑ p = N /n . Note that the cumulative
j j
j
i=1
i j j
j
i=1
i j
frequencies increase from n to n and the cumulative relative frequencies increase from p to 1.
1 1
The mapping that assigns cumulative frequencies to classes is known as a cumulative frequency distribution for the data set and
the given partition. The polygonal graph that connects the points (t , N ) for j ∈ {1, 2, … , k} is the cumulative frequency
j j
ogive.
The mapping that assigns cumulative relative frequencies to classes is known as a cumulative relative frequency distribution for
the data set and the given partition. The polygonal graph that connects the points (t , P ) for j ∈ {1, 2, … , k} is the cumulative
j j
relative frequency ogive.

Note that the relative frquency ogive is simply the graph of the distribution function corresponding to the probability distibution
that places probability p at t for each j .
j j
Approximating the Mean

In the setting of the last subsection, suppose that we do not have the actual data x , but just the frequency distribution. An
approximate value of the sample mean is
k k
1
∑ nj tj = ∑ pj tj (6.2.18)
n
j=1 j=1
This approximation is based on the hope that the mean of the data values in each class is close to the midpoint of that class. In fact,
the expression on the right is the expected value of the distribution that places probability p on class mark t for each j .
j j
Exercises
Basic Properties
Suppose that x is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of
operation.
1. Classify x by type and level of measurement.
2. A sample of 30 components has mean 113°. Find the sample mean if the temperature is converted to degrees Celsius. The
transformation is y = (x − 32) .
5
Answer
Suppose that x is the length (in inches) of a machined part in a manufacturing process.
2. A sample of 50 parts has mean 10.0. Find the sample mean if length is measured in centimeters. The transformation is
y = 2.54x.
Answer
Suppose that x is the number of brothers and y the number of sisters for a person in a certain population. Thus, z = x +y is
the number of siblings.
1. Classify the variables by type and level of measurement.
2. For a sample of 100 persons, m(x) = 0.8 and m(y) = 1.2 . Find m(z).
Answer
Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). The mean grade
on the first midterm exam was 64 (out of a possible 100 points). Professor Moriarity thinks the grades are a bit low and is
considering various transformations for increasing the grades. In each case below give the mean of the transformed grades, or
state that there is not enough information.
1. Add 10 points to each grade, so the transformation is y = x + 10 .
2. Multiply each grade by 1.2, so the transformation is z = 1.2x
3. Use the transformation w = 10√− x . Note that this is a non-linear transformation that curves the grades greatly at the low
end and very little at the high end. For example, a grade of 100 is still 100, but a grade of 36 is transformed to 60.
One of the students did not study at all, and received a 10 on the midterm. Professor Moriarity considers this score to be an
outlier.
4. What would the mean be if this score is omitted?
Answer
All statistical software packages will compute means and proportions, draw dotplots and histograms, and in general perform the
numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets,
the use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with
small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the
graphs with minimal technological aids.
Suppose that x is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data
.
x = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)

2. Sketch the dotplot.
3. Compute the sample mean m from the definition and indicate its location on the dotplot.
4. Find the empirical density function f and sketch the graph.
5. Compute the sample mean m using f .
6. Find the empirical distribution function F and sketch the graph.
Answer
Suppose that a sample of size 12 from a discrete variable x has empirical density function given by f (−2) = 1/12 ,
f (−1) = 1/4, f (0) = 1/3, f (1) = 1/6, f (2) = 1/6.

2. Compute the sample mean m using f .
3. Find the empirical distribution function F
4. Give the sample values, ordered from smallest to largest.
Answer
The following table gives a frequency distribution for the commuting distance to the math/stat building (in miles) for a sample
of ESU students.
Class Freq Rel Freq Density Cum Freq Cum Rel Freq Midpoint
(0, 2] 6
(2, 6] 16
(6, 10] 18
(10, 20] 10
Total Density Cum Freq Cum Rel Freq Midpoint
1. Complete the table

2. Sketch the density histogram
3. Sketch the cumulative relative frquency ogive.
4. Compute an approximation to the mean
Answer
App Exercises
In the interactive histogram, click on the x-axis at various points to generate a data set with at least 20 values. Vary the number
of classes and switch between the frequency histogram and the relative frequency histogram. Note how the shape of the
histogram changes as you perform these operations. Note in particular how the histogram loses resolution as you decrease the
number of classes.
In the interactive histogram, click on the axis to generate a distribution of the given type with at least 30 points. Now vary the
number of classes and note how the shape of the distribution changes.
1. A uniform distribution
2. A symmetric unimodal distribution
3. A unimodal distribution that is skewed right.
4. A unimodal distribution that is skewed left.
5. A symmetric bimodal distribution
6. A u-shaped distribution.

Statistical software should be used for the problems in this subsection.
Consider the petal length and species variables in Fisher's iris data.
2. Compute the sample mean and plot a density histogram for petal length.
3. Compute the sample mean and plot a density histogram for petal length by species.
Answers
Consider the erosion variable in the Challenger data set.

1. Classify the variable by type and level of measurement.
2. Compute the mean
3. Plot a density histogram with the classes [0, 5), [5, 40), [40, 50), [50, 60).
Answer
Consider Michelson's velocity of light data.

2. Plot a density histogram.
3. Compute the sample mean.
4. Find the sample mean if the variable is converted to km/hr. The transformation is y = x + 299 000
Answer
Consider Short's paralax of the sun data.

4. Find the sample mean if the variable is converted to degrees. There are 3600 seconds in a degree.
5. Find the sample mean if the variable is converted to radians. There are π/180 radians in a degree.
Answer
Consider Cavendish's density of the earth data.

Answer
Consider the M&M data.

2. Compute the sample mean for each color count variable.
3. Compute the sample mean for the total number of candies, using the results from (b).
4. Plot a relative frequency histogram for the total number of candies.
5. Compute the sample mean and plot a density histogram for the net weight.
Answer
Consider the body weight, species, and gender variables in the Cicada data.
2. Compute the relative frequency function for species and plot the graph.
3. Compute the relative frequeny function for gender and plot the graph.
4. Compute the sample mean and plot a density histogram for body weight.
5. Compute the sample mean and plot a density histogrm for body weight by species.
6. Compute the sample mean and plot a density histogram for body weight by gender.
Answer
Consider Pearson's height data.

2. Compute the sample mean and plot a density histogram for the height of the father.
3. Compute the sample mean and plot a density histogram for the height of the son.
Answer
This page titled 6.2: The Sample Mean is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Basic Theory
This section continues the discussion of the sample mean from the last section, but we now consider the more interesting setting
where the variables are random. Specifically, suppose that we have a basic random experiment with an underlying probability
measure P, and that X is random variable for the experiment. Suppose now that we perform n independent replications of the basic
experiment. This defines a new, compound experiment with a sequence of independent random variables X = (X , X , … , X ), 1 2 n
each with the same distribution as X. Recall that in statistical terms, X is a random sample of size n from the distribution of X.
All of the relevant statistics discussed in the the previous section, are defined for X, but of course now these statistics are random
variables with distributions of their own. For the most part, we use the notation established previously, except that for the usual
convention of denoting random variables with capital letters. Of course, the deterministic properties and relations established
previously apply as well. When we acutally run the experiment and observe the values x = (x , x , … , x ) of the random
1 2 n
variables, then we are precisely in the setting of the previous section.

Suppose now that the basic variable X is real valued, and let μ = E(X) denote the expected value of X and σ
2
= var(X) the
variance of X (assumed finite). The sample mean is
n
1
M = ∑ Xi (6.3.1)
n
i=1
Ofen the distribution mean μ is unknown and the sample mean M is used as an estimator of this unknown parameter.
Moments
The mean and variance of M are

1. E(M ) = μ
2. var(M ) = σ 2
/n
Proof
Part (a) means that the sample mean M is an unbiased estimator of the distribution mean μ . Therefore, the variance of M is the
mean square error, when M is used as an estimator of μ . Note that the variance of M is an increasing function of the distribution
variance and a decreasing function of the sample size. Both of these make intuitive sense if we think of the sample mean M as an
estimator of the distribution mean μ . The fact that the mean square error (variance in this case) decreases to 0 as the sample size n
increases to ∞ means that the sample mean M is a consistent estimator of the distribution mean μ .
Recall that X − M is the deviation of X from M , that is, the directed distance from M to X . The following theorem states that
i i i
the sample mean is uncorrelated with each deviation, a result that will be crucial for showing the independence of the sample mean
and the sample variance when the sampling distribution is normal.
M and X i −M are uncorrelated.

Proof
The Weak and Strong Laws of Large Numbers

The law of large numbers states that the sample mean converges to the distribution mean as the sample size increases, and is one of
the fundamental theorems of probability. There are different versions of the law, depending on the mode of convergence.
Suppose again that X is a real-valued random variable for our basic experiment, with mean μ and standard deviation σ (assumed
finite). We repeat the basic experiment indefinitely to create a new, compound experiment with an infinite sequence of independent
random variables (X , X , …), each with the same distribution as X. In statistical terms, we are sampling from the distribution of
1 2
X. In probabilistic terms, we have an independent, identically distributed (IID) sequence. For each n , let M denote the sample
n
mean of the first n sample variables:

n
1
Mn = ∑ Xi (6.3.5)
n
i=1
From the result above on variance, note that var(Mn ) = E [(Mn − μ)
2
] → 0 as n → ∞ . This means that Mn → μ as n → ∞
in mean square. As stated in the next theorem, M n → μ as n → ∞ in probability as well.
P (| Mn − μ| > ϵ) → 0 as n → ∞ for every ϵ > 0 .

Proof
Recall that in general, convergence in mean square implies convergence in probability. The convergence of the sample mean to the
distribution mean in mean square and in probability are known as weak laws of large numbers.
Finally, the strong law of large numbers states that the sample mean M converges to the distribution mean μ with probability 1 .
n
As the name suggests, this is a much stronger result than the weak laws. We will need some additional notation for the proof. First
n
let Y = ∑ X so that M = Y /n . Next, recall the definitions of the positive and negative parts a real number x:
n i=1 i n n
= max{x, 0}, x = max{−x, 0}. Note that x ≥ 0, x ≥ 0, x = x , and |x| = x + x .

+ − + − + − + −
x −x
Mn → μ as n → ∞ with probability 1.
Proof
The proof of the strong law of large numbers given above requires that the variance of the sampling distribution be finite (note that
this is critical in the first step). However, there are better proofs that only require that E (|X|) < ∞ . An elegant proof showing that
M → μ as n → ∞ with probability 1 and in mean, using backwards martingales, is given in the chapter on martingales. In the
n
next few paragraphs, we apply the law of large numbers to some of the special statistics studied in the previous section.
Emprical Probability
Suppose that X is the outcome random variable for a basic experiment, with sample space S and probability measure P. Now
suppose that we repeat the basic experiment indefinitley to form a sequence of independent random variables (X , X , …) each 1 2
with the same distribution as X. That is, we sample from the distribution of X. For A ⊆ S , let P (A) denote the empricial n
probability of A corresponding to the sample (X , X , … , X ):

1 2 n
n
1
Pn (A) = ∑ 1(Xi ∈ A) (6.3.13)
n
i=1
n
Now of course, P (A) is a random variable for each event A . In fact, the sum ∑
n i=1
1(Xi ∈ A) has the binomial distribution with
parameters n and P(A) .
For each event A ,

1. E [P (A)] = P(A)
n
2. var [P (A)] = P(A) [1 − P(A)]

n
1
3. P (A) → P(A) as n → ∞ with probability 1.

n
Proof
This special case of the law of large numbers is central to the very concept of probability: the relative frequency of an event
converges to the probability of the event as the experiment is repeated.
Suppose now that X is a real-valued random variable for a basic experiment. Recall that the distribution function of X is the
function F given by
F (x) = P(X ≤ x), x ∈ R (6.3.14)
Now suppose that we repeat the basic experiment indefintely to form a sequence of independent random variables (X , X , …), 1 2
each with the same distribution as X. That is, we sample from the distribution of X. Let F denote the empirical distributionn
function corresponding to the sample (X , X , … , X ):

1 2 n
n
1
Fn (x) = ∑ 1(Xi ≤ x), x ∈ R (6.3.15)
n
i=1
Now, of course, F (x) is a random variable for each x ∈ R. In fact, the sum
n ∑
n
i=1
1(Xi ≤ x) has the binomial distribution with
parameters n and F (x).
For each x ∈ R,
1. E [F (x)] = F (x)
n
2. var [F (x)] = F (x) [1 − F (x)]

n
1
3. F (x) → F (x) as n → ∞ with probability 1.

n
Proof
Empirical Density for a Discrete Variable
Suppose now that X is a random variable for a basic experiment with a discrete distribution on a countable set S . Recall that the
probability density function of X is the function f given by
f (x) = P(X = x), x ∈ S (6.3.16)
Now suppose that we repeat the basic experiment to form a sequence of independent random variables (X , X , …) each with the 1 2
same distribution as X. That is, we sample from the distribution of X. Let f denote the empirical probability density function
n
corresponding to the sample (X , X , … , X ):

1 2 n
n
1
fn (x) = ∑ 1(Xi = x), x ∈ S (6.3.17)
n
i=1
Now, of course, f (x) is a random variable for each

n x ∈ S . In fact, the sum ∑
n
i=1
1(Xi = x) has the binomial distribution with
parameters n and f (x).
For each x ∈ S ,
1. E [f (x)] = f (x)
n
2. var [f (x)] = f (x) [1 − f (x)]

n
1
3. f (x) → f (x) as n → ∞ with probability 1.

n
Proof
Recall that a countable intersection of events with probability 1 still has probability 1. Thus, in the context of the previous theorem,
we actually have
P [ fn (x) → f (x) as n → ∞ for every x ∈ S] = 1 (6.3.18)
Empirical Density for a Continuous Variable
Suppose now that X is a random variable for a basic experiment, with a continuous distribution on S ⊆ R , and that X has d
probability density function f . Technically, f is the probability density function with respect to the standard (Lebsesgue) measure
λ . Thus, by definition,
d
P(X ∈ A) = ∫ f (x) dx, A ⊆S (6.3.19)

A
Again we repeat the basic experiment to generate a sequence of independent random variables (X , X , …) each with the same
1 2
distribution as X. That is, we sample from the distribution of X. Suppose now that A = {A : j ∈ J} is a partition of S into a
j
countable number of subsets, each with positive, finite size. Let f denote the empirical probability density function corresponding
n
to the sample (X , X , … , X ) and the partition A :

1 2 n
n
Pn (Aj ) 1
fn (x) = = ∑ 1(Xi ∈ Aj ); j ∈ J, x ∈ Aj (6.3.20)
λd (Aj ) n λd (Aj )
i=1
Of course now, f (x) is a random variable for each x ∈ S . If the partition is sufficiently fine (so that λ
n d (Aj ) is small for each j ),
and if the sample size n is sufficiently large, then by the law of large numbers,
fn (x) ≈ f (x), x ∈ S (6.3.21)
Exercises
Simulation Exercises
In the dice experiment, recall that the dice scores form a random sample from the specified die distribution. Select the average
random variable, which is the sample mean of the sample of dice scores. For each die distribution, start with 1 die and increase
the sample size n . Note how the distribution of the sample mean begins to resemble a point mass distribution. Note also that
the mean of the sample mean stays the same, but the standard deviation of the sample mean decreases. For selected values of n
and selected die distributions, run the simulation 1000 times and compare the relative frequency function of the sample mean
to the true probability density function, and compare the empirical moments of the sample mean to the true moments.
Several apps in this project are simulations of random experiments with events of interest. When you run the experiment, you are
performing independent replications of the experiment. In most cases, the app displays the relative frequency of the event and its
complement, both graphically in blue, and numerically in a table. When you run the experiment, the relative frequencies are shown
graphically in red and also numerically.
In the simulation of Buffon's coin experiment, the event of interest is that the coin crosses a crack. For various values of the
parameter (the radius of the coin), run the experiment 1000 times and compare the relative frequency of the event to the true
probability.
In the simulation of Bertrand's experiment, the event of interest is that a “random chord” on a circle will be longer than the
length of a side of the inscribed equilateral triangle. For each of the various models, run the experiment 1000 times and
compuare the relative frequency of the event to the true probability.
Many of the apps in this project are simulations of experiments which result in discrete variables. When you run the simulation,
you are performing independent replications of the experiment. In most cases, the app displays the true probability density function
numerically in a table and visually as a blue bar graph. When you run the simulation, the relative frequency function is also shown
numerically in the table and visually as a red bar graph.
In the simulation of the binomial coin experiment, select the number of heads. For selected values of the parameters, run the
simulation 1000 times and compare the sample mean to the distribution mean, and compare the empirical density function to
In the simulation of the matching experiment, the random variable is the number of matches. For selected values of the
parameter, run the simulation 1000 times and compare the sample mean and the distribution mean, and compare the empirical
In the poker experiment, the random variable is the type of hand. Run the simulation 1000 times and compare the empirical
density function to the true probability density function.
Many of the apps in this project are simulations of experiments which result in variables with continuous distributions. When you
run the simulation, you are performing independent replications of the experiment. In most cases, the app displays the true
probability density function visually as a blue graph. When you run the simulation, an empirical density function, based on a
partition, is also shown visually as a red bar graph.
In the simulation of the gamma experiment, the random variable represents a random arrival time. For selected values of the
parameters, run the experiment 1000 times and compare the sample mean to the distribution mean, and compare the empirical
In the special distribution simulator, select the normal distribution. For various values of the parameters (the mean and standard
deviation), run the experiment 1000 times and compare the sample mean to the distribution mean, and compare the empirical
Probability Exercises
Suppose that X has probability density function f (x) = 12x 2

(1 − x) for 0 ≤ x ≤ 1 . The distribution of X is a member of the
beta family. Compute each of the following
1. E(X)
2. var(X)
3. P (X ≤ 1
2
)
Answer
Suppose now that (X , X , … , X ) is a random sample of size 9 from the distribution in the previous problem. Find the
1 2 9
expected value and variance of each of the following random variables:

1. The sample mean M
2. The empirical probability P ([0,
1
2
])
Answer
Suppose that X has probability density function f (x) = x

3
4
for 1 ≤ x < ∞ . The distribution of X is a member of the Pareto
family. Compute each of the following
1. E(X))
2. var(X)
3. P(2 ≤ X ≤ 3)
Answer
Suppose now that (X , X , … , X ) is a random sample of size 16 from the distribution in the previous problem. Find the
1 2 16
expected value and variance of each of the following random variables:

1. The sample mean M
2. The empirical probability P ([2, 3])
Answer
Recall that for an ace-six flat die, faces 1 and 6 have probability each, while faces 2, 3, 4, and 5 have probability
1
4
1
8
each. Let
X denote the score when an ace-six flat die is thrown. Compute each of the following:
1. The probability density function f (x) for x ∈ {1, 2, 3, 4, 5, 6}

2. The distribution function F (x) for x ∈ {1, 2, 3, 4, 5, 6}
3. E(X)
4. var(X)
Answer
Suppose now that an ace-six flat die is thrown n times. Find the expected value and variance of each of the following random
variables:
1. The empirical probability density function f (x) for x ∈ {1, 2, 3, 4, 5, 6}
n
2. The empirical distribution function F (x) for x ∈ {1, 2, 3, 4, 5, 6}

n
3. The average score M

Answer
This page titled 6.3: The Law of Large Numbers is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
The central limit theorem and the law of large numbers are the two fundamental theorems of probability. Roughly, the central limit
theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will
be approximately normal, regardless of the underlying distribution. The importance of the central limit theorem is hard to overstate;
indeed it is the reason that many statistical procedures work.
Partial Sum Processes

Definitions
Suppose that X = (X , X , …) is a sequence of independent, identically distributed, real-valued random variables with common
1 2
probability density function f , mean μ , and variance σ . We assume that 0 < σ < ∞ , so that in particular, the random variables
2
really are random and not constants. Let

n
Yn = ∑ Xi , n ∈ N (6.4.1)
i=1
Note that by convention, Y = 0 , since the sum is over an empty index set. The random process Y = (Y , Y , Y , …) is called the
0 0 1 2
partial sum process associated with X. Special types of partial sum processes have been studied in many places in this text; in
particular see
the binomial distribution in the setting of Bernoulli trials
the negative binomial distribution in the setting of Bernoulli trials
the gamma distribution in the Poisson process
the the arrival times in a general renewal process
Recall that in statistical terms, the sequence X corresponds to sampling from the underlying distribution. In particular,
(X , X , … , X ) is a random sample of size n from the distribution, and the corresponding sample mean is
1 2 n
n
Yn 1
Mn = = ∑ Xi (6.4.2)
n n
i=1
By the law of large numbers, M n → μ as n → ∞ with probability 1.
Stationary, Independent Increments

The partial sum process corresponding to a sequence of independent, identically distributed variables has two important properties,
and these properties essentially characterize such processes.
If m ≤ n then Y n − Ym has the same distribution as Y n−m . Thus the process Y has stationary increments.
Proof
Note however that Yn − Ym and Yn−m are very different random variables; the theorem simply states that they have the same
distribution.
If n ≤ n ≤ n ≤ ⋯ then (Y
1 2 3 n1 , Yn
2
− Yn , Yn
1 3
− Yn , …)
2
is a sequence of independent random variables. Thus the process
Y has independent increments.
Proof
Conversely, suppose that V = (V , V , V 0 1 2, …) is a random process with stationary, independent increments. Define
U = V −V
i for i ∈ N . Then U = (U
i i−1 + 1, U2 , …) is a sequence of independent, identically distributed variables and V is
the partial sum process associated with U .
Thus, partial sum processes are the only discrete-time random processes that have stationary, independent increments. An
interesting, and much harder problem, is to characterize the continuous-time processes that have stationary independent increments.
The Poisson counting process has stationary independent increments, as does the Brownian motion process.
Moments
If n ∈ N then
1. E(Y ) = nμ
n
2. var(Y ) = nσ
n
2
Proof
If n ∈ N + and m ∈ N with m ≤ n then

1. cov(Y m, Yn ) = m σ
2
−−
2. cor(Y m, Yn ) = √
m
3. E(Y m Yn ) = m σ
2
+ mnμ
2
Proof
If X has moment generating function G then Y has moment generating function G .

n
n
Proof
Distributions
Suppose that X has either a discrete distribution or a continuous distribution with probability density function f . Then the
probability density function of Y is f = f ∗ f ∗ ⋯ ∗ f , the convolution power of f of order n .
n
∗n
Proof
More generally, we can use the stationary and independence properties to find the joint distributions of the partial sum process:
If n 1 < n2 < ⋯ < nk then (Y n1 , Yn2 , … , Ynk ) has joint probability density function
∗n1 ∗( n2 −n1 ) ∗( nk −nk−1 ) k
fn1 , n2 ,…, nk (y1 , y2 , … , yk ) = f (y1 )f (y2 − y1 ) ⋯ f (yk − yk−1 ), (y1 , y2 , … , yk ) ∈ R (6.4.5)
Proof
The Central Limit Theorem

First, let's make the central limit theorem more precise. From Theorem 4, we cannot expect Y itself to have a limiting distribution. n
Note that var(Y ) → ∞ as n → ∞ since σ > 0 , and E(Y ) → ∞ as n → ∞ if μ > 0 while E(Y ) → −∞ as n → ∞ if μ < 0 .
n n n
Similarly, we know that M → μ as n → ∞ with probability 1, so the limiting distribution of the sample mean is degenerate.
n
Thus, to obtain a limiting distribution of Y or M that is not degenerate, we need to consider, not these variables themeselves, but
n n
rather the common standard score. Thus, let

Yn − nμ Mn − μ
Zn = − = (6.4.6)
−
√n σ σ/ √n
Zn has mean 0 and variance 1.

1. E(Z ) = 0
n
2. var(Z ) = 1n
Proof
The precise statement of the central limit theorem is that the distribution of the standard score Z converges to the standard normal n
distribution as n → ∞ . Recall that the standard normal distribution has probability density function
1 −
1
z
2
ϕ(z) = −−e
2
, z ∈ R (6.4.7)
√2π
and is studied in more detail in the chapter on special distributions. A special case of the central limit theorem (to Bernoulli trials),
dates to Abraham De Moivre. The term central limit theorem was coined by George Pólya in 1920. By definition of convergence in
distribution, the central limit theorem states that F (z) → Φ(z) as n → ∞ for each z ∈ R , where F is the distribution function
n n
of Z and Φ is the standard normal distribution function:

n
z z
1 1 2
− x
Φ(z) = ∫ ϕ(x) dx = ∫ e 2 dx, z ∈ R (6.4.8)
−−
−∞ −∞ √2π
An equivalent statment of the central limit theorm involves convergence of the corresponding characteristic functions. This is the
version that we will give and prove, but first we need a generalization of a famous limit from calculus.
Suppose that (a 1, a2 , …) is a sequence of real numbers and that a n → a ∈ R as n → ∞ . Then

n
an a
(1 + ) → e as n → ∞ (6.4.9)
n
Now let χ denote the characteristic function of the standard score of the sample variable X , and let χn denote the characteristic
function of the standard score Z : n
X −μ
χ(t) = E [exp(it )] , χn (t) = E[exp(itZn )]; t ∈ R (6.4.10)
σ
1
2
Recall that t ↦ e −
2
t
is the characteristic function of the standard normal distribution. We can now give a proof.
The central limit theorem. The distribution of Zn converges to the standard normal distribution as n → ∞ . That is,
1 2
χn (t) → e
−
2
t
as n → ∞ for each t ∈ R .
Proof
Normal Approximations
The central limit theorem implies that if the sample size n is “large” then the distribution of the partial sum Y is approximately n
normal with mean nμ and variance nσ . Equivalently the sample mean M is approximately normal with mean μ and variance
2
n
σ /n. The central limit theorem is of fundamental importance, because it means that we can approximate the distribution of certain
2
statistics, even if we know very little about the underlying sampling distribution.
Of course, the term “large” is relative. Roughly, the more “abnormal” the basic distribution, the larger n must be for normal
approximations to work well. The rule of thumb is that a sample size n of at least 30 will usually suffice if the basic distribution is
not too weird; although for many distributions smaller n will do.
Let Y denote the sum of the variables in a random sample of size 30 from the uniform distribution on [0, 1]. Find normal
approximations to each of the following:
1. P(13 < Y < 18)
Answer
Random variable Y in the previous exercise has the Irwin-Hall distribution of order 30. The Irwin-Hall distributions are studied in
more detail in the chapter on Special Distributions and are named for Joseph Irwin and Phillip Hall.
In the special distribution simulator, select the Irwin-Hall distribution. Vary and n from 1 to 10 and note the shape of the
probability density function. With n = 10 run the experiment 1000 times and compare the empirical density function to the
true probability density function.
Let M denote the sample mean of a random sample of size 50 from the distribution with probability density function
f (x) =
3
x4
for 1 ≤ x < ∞ . This is a Pareto distribution, named for Vilfredo Pareto. Find normal approximations to each of
the following:
1. P(M > 1.6)
2. The 60th percentile of M
Answer
The Continuity Correction
A slight technical problem arises when the sampling distribution is discrete. In this case, the partial sum also has a discrete
distribution, and hence we are approximating a discrete distribution with a continuous one. Suppose that X takes integer values
(the most common case) and hence so does the partial sum Y . For any k ∈ Z and h ∈ [0, 1), note that the event
n
{k − h ≤ Y ≤ k + h}
n is equivalent to the event {Y = k} . Different values of h lead to different normal approximations, even
though the events are equivalent. The smallest approximation would be 0 when h = 0 , and the approximations increase as h
increases. It is customary to split the difference by using h = for the normal approximation. This is sometimes called the half-
1
unit continuity correction or the histogram correction. The continuity correction is extended to other events in the natural way,
using the additivity of probability.
Suppose that j, k ∈ Z with j ≤ k .

1. For the event {j ≤ Y ≤ k} = {j − 1 < Y < k + 1} , use {j − ≤ Y ≤ k + } in the normal approximation.
n n
1
2
n
1
2. For the event {j ≤ Y } = {j − 1 < Y } , use {j − ≤ Y } in the normal approximation.

n n
1
2
n
3. For the event {Y ≤ k} = {Y < k + 1} , use {Y ≤ k + } in the normal approximation.

n n n
1
Let Y denote the sum of the scores of 20 fair dice. Compute the normal approximation to P(60 ≤ Y ≤ 75) .
Answer
In the dice experiment, set the die distribution to fair, select the sum random variable Y , and set n = 20 . Run the simulation
1000 times and find each of the following. Compare with the result in the previous exercise:
1. P(60 ≤ Y ≤ 75)
2. The relative frequency of the event {60 ≤ Y ≤ 75} (from the simulation)
Normal Approximation to the Gamma Distribution

Recall that the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) is a continuous distribution
on (0, ∞) with probability density function f given by
1 k−1 −x/b
f (x) = x e , x ∈ (0, ∞) (6.4.14)
k
Γ(k)b
The mean is kb and the variance is kb . The gamma distribution is widely used to model random times (particularly in the context
2
of the Poisson model) and other positive random variables. The general gamma distribution is studied in more detail in the chapter
on Special Distributions. In the context of the Poisson model (where k ∈ N ), the gamma distribution is also known as the Erlang
+
distribution, named for Agner Erlang; it is studied in more detail in the chapter on the Poisson Process. Suppose now that Y has k
the gamma (Erlang) distribution with shape parameter k ∈ N and scale parameter b > 0 then
+
Yk = ∑ Xi (6.4.15)
i=1
where (X , X , …) is a sequence of independent variables, each having the exponential distribution with scale parameter b . (The
1 2
exponential distribution is a special case of the gamma distribution with shape parameter 1.) It follows that if k is large, the gamma
distribution can be approximated by the normal distribution with mean kb and variance kb . The same statement actually holds
2
when k is not an integer. Here is the precise statement:
Suppose that Y has the gamma distribution with scale parameter b ∈ (0, ∞) and shape parameter k ∈ (0, ∞) . Then the
k
distribution of the standardized variable Z below converges to the standard normal distribution as k → ∞ :
k
Yk − kb
Zk = −
− (6.4.16)
√k b
In the special distribution simulator, select the gamma distribution. Vary and b and note the shape of the probability density
function. With k = 10 and various values of b , run the experiment 1000 times and compare the empirical density function to
the true probability density function.
Suppose that Y has the gamma distribution with shape parameter k = 10 and scale parameter b =2 . Find normal
1. P(18 ≤ Y ≤ 23)
Answer
Normal Approximation to the Chi-Square Distribution

Recall that the chi-square distribution with n ∈ (0, ∞) degrees of freedom is a special case of the gamma distribution, with shape
parameter k = n/2 and scale parameter b = 2 . Thus, the chi-square distribution with n degrees of freedom has probability density
function
1
n/2−1 −x/2
f (x) = x e , 0 <x <∞ (6.4.17)
n/2
Γ(n/2)2
When n is a positive, integer, the chi-square distribution governs the sum of n independent, standard normal variables. For this
reason, it is one of the most important distributions in statistics. The chi-square distribution is studied in more detail in the chapter
on Special Distributions. From the previous discussion, it follows that if n is large, the chi-square distribution can be approximated
by the normal distribution with mean n and variance 2n. Here is the precise statement:
Suppose that Y has the chi-square distribution with n ∈ (0, ∞) degrees of freedom. Then the distribution of the standardized
n
variable Z below converges to the standard normal distribution as n → ∞ :

n
Yn − n
Zn = (6.4.18)
−−
√2n
In the special distribution simulator, select the chi-square distribution. Vary n and note the shape of the probability density
function. With n = 20 , run the experiment 1000 times andcompare the empirical density function to the probability density
function.
Suppose that Y has the chi-square distribution with n = 20 degrees of freedom. Find normal approximations to each of the
following:
1. P(18 < Y < 25)
Answer
Normal Approximation to the Binomial Distribution

Recall that a Bernoulli trials sequence, named for Jacob Bernoulli, is a sequence (X , X , …) of independent, identically
1 2
distributed indicator variables with P(X = 1) = p for each i, where p ∈ (0, 1) is the parameter. In the usual language of
i
reliability, X is the outcome of trial i, where 1 means success and 0 means failure. The common mean is p and the common
i
variance is p(1 − p) .
Let Y = ∑ X , so that Y is the number of successes in the first
n
n
i=1 i n n trials. Recall that Yn has the binomial distribution with

parameters n and p, and has probability density function
n
k n−k
f (k) = ( ) p (1 − p ) , k ∈ {0, 1, … , n} (6.4.19)
k
The binomial distribution is studied in more detail in the chapter on Bernoulli trials.
It follows from the central limit theorem that if n is large, the binomial distribution with parameters n and p can be approximated
by the normal distribution with mean np and variance np(1 − p) . The rule of thumb is that n should be large enough for np ≥ 5
and n(1 − p) ≥ 5 . (The first condition is the important one when p < and the second condition is the important one when
1
p >
1
2
.) Here is the precise statement:
Suppose that Y has the binomial distribution with trial parameter n ∈ N and success parameter p ∈ (0, 1). Then the
n +
distribution of the standardized variable Z given below converges to the standard normal distribution as n → ∞ :
n
Yn − np
Zn = − −−−−−−− (6.4.20)
√ np(1 − p)
In the binomial timeline experiment, vary n and p and note the shape of the probability density function. With n = 50 and
p = 0.3 , run the simulation 1000 times and compute the following:
1. P(12 ≤ Y ≤ 16)
2. The relative frequency of the event {12 ≤ Y ≤ 16} (from the simulation)
Answer
Suppose that Y has the binomial distribution with parameters n = 50 and p = 0.3 . Compute the normal approximation to
P(12 ≤ Y ≤ 16) (don't forget the continuity correction) and compare with the results of the previous exercise.
Answer
Normal Approximation to the Poisson Distribution

Recall that the Poisson distribution, named for Simeon Poisson, is a discrete distribution on N with probability density function f
given by
x
θ
−θ
f (x) = e , x ∈ N (6.4.21)
x!
where θ > 0 is a parameter. The parameter is both the mean and the variance of the distribution. The Poisson distribution is widely
used to model the number of “random points” in a region of time or space, and is studied in more detail in the chapter on the
Poisson Process. In this context, the parameter is proportional to the size of the region.
Suppose now that Y has the Poisson distribution with parameter n ∈ N . Then
n +
Yn = ∑ Xi (6.4.22)
i=1
where (X , X , … , X ) is a sequence of independent variables, each with the Poisson distribution with parameter 1. It follows
1 2 n
from the central limit theorem that if n is large, the Poisson distribution with parameter n can be approximated by the normal
distribution with mean n and variance n . The same statement holds when the parameter n is not an integer. Here is the precise
statement:
. Suppose that Y has the Poisson distribution with parameter θ ∈ (0, ∞) . Then the distribution of the standardized variable Z
θ θ
below converges to the standard normal distribution as θ → ∞ :

Yθ − θ
Zθ = (6.4.23)
√θ
Suppose that Y has the Poisson distribution with mean 20.

1. Compute the true value of P(16 ≤ Y ≤ 23) .
2. Compute the normal approximation to P(16 ≤ Y ≤ 23) .
Answer
In the Poisson experiment, vary the time and rate parameters t and r (the parameter of the Poisson distribution in the
experiment is the product rt ). Note the shape of the probability density function. With r = 5 and t = 4 , run the experiment
1000 times and compare the empirical density function to the true probability density function.
Normal Approximation to the Negative Binomial Distribution

The general version of the negative binomial distribution is a discrete distribution on N , with shape parameter k ∈ (0, ∞) and
success parameter p ∈ (0, 1). The probability density function f is given by
n+k−1
k n
f (n) = ( ) p (1 − p ) , n ∈ N+ (6.4.24)
n
The mean is k(1 − p)/p and the variance is k(1 − p)/p . The negative binomial distribution is studied in more detail in the
2
chapter on Bernoulli trials. If k ∈ N , the distribution governs the number of failures Y before success number k in a sequence of
+ k
Bernoulli trials with success parameter p. Thus in this case,

k
Yk = ∑ Xi (6.4.25)
i=1
where (X , X , … , X ) is a sequence of independent variables, each having the geometric distribution on N with parameter p.
1 2 k
(The geometric distribution is a special case of the negative binomial, with parameters 1 and p.) In the context of the Bernoulli
trials, X is the number of failures before the first success, and for i ∈ {2, 3, …}, X is the number of failures between success
1 i
number i − 1 success number i. It follows that if k is large, the negative binomial distribution can be approximated by the normal
distribution. The same statement holds if k is not an integer. Here is the precise statement:
Suppose that Y has the negative binomial distribution with shape parameter k ∈ (0, 1) and scale parameter p ∈ (0, 1). Then
k
the distribution of the standardized variable Z below converges to the standard normal distribution as k → ∞ :
k
p Yk − k(1 − p)
Zk = (6.4.26)
− −−−−−−
√ k(1 − p)
Another version of the negative binomial distribution is the distribution of the trial number V of success number k ∈ N . So
k +
V = k+Y
k and V has mean k/p and variance k(1 − p)/p . The normal approximation applies to the distribution of V as well,
k k
2
k
if k is large, and since the distributions are related by a location transformation, the standard scores are the same. That is
p Vk − k p Yk − k(1 − p)
− −−−−−− = − −−−−−− (6.4.27)
√ k(1 − p) √ k(1 − p)
In the negative binomial experiment, vary k and p and note the shape of the probability density function. With k = 5 and
p = 0.4 , run the experiment 1000 times and compare the empirical density function to the true probability density function.
Suppose that Y has the negative binomial distribution with trial parameter k = 10 and success parameter p = 0.4 . Find normal
1. P(20 < Y < 30)
Answer
Partial Sums with a Random Number of Terms

Our last topic is a bit more esoteric, but still fits with the general setting of this section. Recall that X = (X , X , …) is a 1 2
sequence of independent, identically distributed real-valued random variables with common mean μ and variance σ . Suppose now 2
that N is a random variable (on the same probability space) taking values in N, also with finite mean and variance. Then
N
YN = ∑ Xi (6.4.28)
i=1
is a random sum of the independent, identically distributed variables. That is, the terms are random of course, but so also is the
number of terms N . We are primarily interested in the moments of Y . N
Independent Number of Terms

Suppose first that N , the number of terms, is independent of X, the sequence of terms. Computing the moments of YN is a good
exercise in conditional expectation.
The conditional expected value of Y N given N , and the expected value of Y N are
1. E(Y N ∣ N) = Nμ
2. E(Y N ) = E(N )μ
The conditional variance of Y N given N and the variance of Y N are

1. var(Y N ∣ N) = Nσ
2
2. var(Y N ) = E(N )σ
2
+ var(N )μ
2
Let H denote the probability generating function of N . Show that the moment generating function of Y N is H ∘ G .
1. E(e tYN
∣ N ) = [G(t)]
N
2. E(e tYN
) = H (G(t))
Wald's Equation
The result in Exercise 29 (b) generalizes to the case where the random number of terms N is a stopping time for the sequence X.
This means that the event {N = n} depends only on (technically, is measurable with respect to) (X , X , … , X ) for each
1 2 n
n ∈ N . The generalization is knowns as Wald's equation, and is named for Abraham Wald. Stopping times are studied in much
more technical detail in the section on Filtrations and Stopping Times.
If N is a stopping time for X then E(Y N ) = E(N )μ .

Proof
An elgant proof of Wald's equation is given in the chapter on Martingales.
Suppose that the number of customers arriving at a store during a given day has the Poisson distribution with parameter 50.
Each customer, independently of the others (and independently of the number of customers), spends an amount of money that
is uniformly distributed on the interval [0, 20]. Find the mean and standard deviation of the amount of money that the store
takes in during a day.
Answer
When a certain critical component in a system fails, it is immediately replaced by a new, statistically identical component. The
components are independent, and the lifetime of each (in hours) is exponentially distributed with scale parameter b . During the
life of the system, the number of critical components used has a geometric distribution on N with parameter p. For the total
+
life of the critical component,

1. Find the mean.
2. Find the standard deviation.
3. Find the moment generating function.
4. Identify the distribution by name.
Answer
This page titled 6.4: The Central Limit Theorem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Descriptive Theory
we make on these objects. We select objects from the population and record the variables for the objects in the sample; these
become our data. Once again, our first discussion is from a descriptive point of view. That is, we do not assume that the data are
generated by an underlying probability distribution. Remember however, that the data themselves form a probability distribution.
Variance and Standard Deviation
Suppose that x = (x 1, x2 , … , xn ) is a sample of size n from a real-valued variable x. Recall that the sample mean is
n
1
m = ∑ xi (6.5.1)
n
i=1
and is the most important measure of the center of the data set. The sample variance is defined to be
n
2
1 2
s = ∑(xi − m ) (6.5.2)
n−1
i=1
If we need to indicate the dependence on the data vector x, we write s (x). The difference x − m is the deviation of x from the
2
i i
mean m of the data set. Thus, the variance is the mean square deviation and is a measure of the spread of the data set with respet to
the mean. The reason for dividing by n − 1 rather than n is best understood in terms of the inferential point of view that we discuss
in the next section; this definition makes the sample variance an unbiased estimator of the distribution variance. However, the
reason for the averaging can also be understood in terms of a related concept.
∑
n
i=1
(xi − m) = 0 .
Proof
Thus, if we know n − 1 of the deviations, we can compute the last one. This means that there are only n − 1 freely varying
deviations, that is to say, n − 1 degrees of freedom in the set of deviations. In the definition of sample variance, we average the
squared deviations, not by dividing by the number of terms, but rather by dividing by the number of degrees of freedom in those
terms. However, this argument notwithstanding, it would be reasonable, from a purely descriptive point of view, to divide by n in
the definition of the sample variance. Moreover, when n is sufficiently large, it hardly matters whether we divide by n or by n − 1 .
In any event, the square root s of the sample variance s is the sample standard deviation. It is the root mean square deviation and
2
is also a measure of the spread of the data with respect to the mean. Both measures of spread are important. Variance has nicer
mathematical properties, but its physical unit is the square of the unit of x. For example, if the underlying variable x is the height
of a person in inches, the variance is in square inches. On the other hand, the standard deviation has the same physical unit as the
original variable, but its mathematical properties are not as nice.
Recall that the data set x naturally gives rise to a probability distribution, namely the empirical distribution that places probability
1
n
at x for each i. Thus, if the data are distinct, this is the uniform distribution on {x , x , … , x }. The sample mean m is simply
i 1 2 n
the expected value of the empirical distribution. Similarly, if we were to divide by n rather than n − 1 , the sample variance would
be the variance of the empirical distribution. Most of the properties and results this section follow from much more general
properties and results for the variance of a probability distribution (although for the most part, we give independent proofs).
Measures of Center and Spread
Measures of center and measures of spread are best thought of together, in the context of an error function. The error function
measures how well a single number a represents the entire data set x. The values of a (if they exist) that minimize the error
functions are our measures of center; the minimum value of the error function is the corresponding measure of spread. Of course,
we hope for a single value of a that minimizes the error function, so that we have a unique measure of center.
Let's apply this procedure to the mean square error function defined by
n
1 2
mse(a) = ∑(xi − a) , a ∈ R (6.5.3)
n−1
i=1
Minimizing mse is a standard problem in calculus.
The graph of mse is a parabola opening upward.

1. mse is minimized when a = m , the sample mean.
2. The minimum value of mse is s , the sample variance.
2
Proof
Trivially, if we defined the mean square error function by dividing by n rather than n − 1 , then the minimum value would still
occur at m, the sample mean, but the minimum value would be the alternate version of the sample variance in which we divide by
−−−− −−
n . On the other hand, if we were to use the root mean square deviation function rmse(a) = √mse(a) , then because the square
root function is strictly increasing on [0, ∞), the minimum value would again occur at m, the sample mean, but the minimum value
would be s , the sample standard deviation. The important point is that with all of these error functions, the unique measure of
center is the sample mean, and the corresponding measures of spread are the various ones that we are studying.
Next, let's apply our procedure to the mean absolute error function defined by
n
1
mae(a) = ∑ | xi − a| , a ∈ R (6.5.5)
n−1
i=1
The mean absolute error function satisfies the following properties:

1. mae is a continuous function.
2. The graph of mae consists of lines.
3. The slope of the line at a depends on where a is in the data set x.
Proof
Mathematically, mae has some problems as an error function. First, the function will not be smooth (differentiable) at points where
two lines of different slopes meet. More importantly, the values that minimize mae may occupy an entire interval, thus leaving us
without a unique measure of center. The error function exercises below will show you that these pathologies can really happen. It
turns out that mae is minimized at any point in the median interval of the data set x. The proof of this result follows from a much
more general result for probability distributions. Thus, the medians are the natural measures of center associated with mae as a
measure of error, in the same way that the sample mean is the measure of center associated with the mse as a measure of error.
Properties
In this section, we establish some essential properties of the sample variance and standard deviation. First, the following alternate
formula for the sample variance is better for computational purposes, and for certain theoretical purposes as well.
The sample variance can be computed as

n
1 n
2 2 2
s = ∑x − m (6.5.6)
i
n−1 n−1
i=1
Proof
If we let x = (x , x
2 2
1
2
2
2
, … , xn ) denote the sample from the variable x , then the computational formula in the last exercise can be
2
written succinctly as
2
n 2 2
s (x) = [m(x ) − m (x)] (6.5.9)
n−1
The following theorem gives another computational formula for the sample variance, directly in terms of the variables and thus
without the computation of an intermediate statistic.
The sample variance can be computed as
n n
2
1 2
s = ∑ ∑(xi − xj ) (6.5.10)
2n(n − 1)
i=1 j=1
Proof
The sample variance is nonnegative:

1. s2
≥0
2. s2
=0 if and only if x i = xj for each i, j ∈ {1, 2, … , n} .
Proof
Thus, s 2
=0 if and only if the data set is constant (and then, of course, the mean is the common value).
If c is a constant then
1. s (c x) = c
2 2 2
s (x)
2. s(c x) = |c| s(x)
Proof
If c is a sample of size n from a constant c then

1. s (x + c) = s (x) .
2 2
2. s(x + c) = s(x)
Proof
As a special case of these results, suppose that x = (x , x , … , x ) is a sample of size n corresponding to a real variable x, and
1 2 n
that a and b are constants. The sample corresponding to the variable y = a + bx , in our vector notation, is a + bx . Then
m(a + bx) = a + bm(x) and s(a + bx) = |b| s(x) . Linear transformations of this type, when b > 0 , arise frequently when
physical units are changed. In this case, the transformation is often called a location-scale transformation; a is the location
parameter and b is the scale parameter. For example, if x is the length of an object in inches, then y = 2.54x is the length of the
object in centimeters. If x is the temperature of an object in degrees Fahrenheit, then y = (x − 32) is the temperature of the 5
object in degree Celsius.

Now, for i ∈ {1, 2, … , n}, let z = (x − m)/s . The number z is the standard score associated with x . Note that since x , m,
i i i i i
and s have the same physical units, the standard score z is dimensionless (that is, has no physical units); it measures the directed
i
distance from the mean m to the data value x in standard deviations. i
The sample of standard scores z = (z 1, z2 , … , zn ) has mean 0 and variance 1. That is,
1. m(z) = 0
2. s (z) = 1
2
Proof
Approximating the Variance
Suppose that instead of the actual data x, we have a frequency distribution corresponding to a partition with classes (intervals)
(A , A , … , A ), class marks (midpoints of the intervals) (t , t , … , t ), and frequencies (n , n , … , n ). Recall that the
1 2 k 1 2 k 1 2 k
relative frequency of class A is p = n /n . In this case, approximate values of the sample mean and variance are, respectively,
j j j
k k
1
m = ∑ nj tj = ∑ pj tj (6.5.18)
n
j=1 j=1
k k
1 n
2 2 2
s = ∑ nj (tj − m ) = ∑ pj (tj − m ) (6.5.19)
n−1 n−1
j=1 j=1
These approximations are based on the hope that the data values in each class are well represented by the class mark. In fact, these
are the standard definitions of sample mean and variance for the data set in which t occurs n times for each j . j j
Inferential Statistics
We continue our discussion of the sample variance, but now we assume that the variables are random. Thus, suppose that we have a
basic random experiment, and that X is a real-valued random variable for the experiment with mean μ and standard deviation σ.
We will need some higher order moments as well. Let σ = E [(X − μ) ] and σ = E [(X − μ) ] denote the 3rd and 4th
3
3
4
4
moments about the mean. Recall that σ /σ = skew(X) , the skewness of X, and σ /σ = kurt(X) , the kurtosis of X. We
3
3
4
4
assume that σ < ∞ . 4
We repeat the basic experiment n times to form a new, compound experiment, with a sequence of independent random variables
X = (X , X , … , X ) , each with the same distribution as X. In statistical terms, X is a random sample of size n from the
1 2 n
distribution of X. All of the statistics above make sense for X, of course, but now these statistics are random variables. We will use
the same notationt, except for the usual convention of denoting random variables by capital letters. Finally, note that the
deterministic properties and relations established above still hold.
In addition to being a measure of the center of the data X, the sample mean
n
1
M = ∑ Xi (6.5.20)
n
i=1
is a natural estimator of the distribution mean μ . In this section, we will derive statistics that are natural estimators of the
distribution variance σ . The statistics that we will derive are different, depending on whether μ is known or unknown; for this
2
reason, μ is referred to as a nuisance parameter for the problem of estimating σ . 2
A Special Sample Variance

First we will assume that μ is known. Although this is almost always an artificial assumption, it is a nice place to start because the
analysis is relatively easy and will give us insight for the standard case. A natural estimator of σ is the following statistic, which 2
we will refer to as the special sample variance.

n
2
1 2
W = ∑(Xi − μ) (6.5.21)
n
i=1
W
2
is the sample mean for a random sample of size n from the distribution of 2
(X − μ) , and satisfies the following
properties:
1. E (W ) = σ 2 2
2. var (W ) = (σ − σ )
2 1
n 4
4
3. W → σ as n → ∞ with probability 1
2 2
−−−−−−
4. The distribution of √−
n (W − σ ) / √σ − σ converges to the standard normal distribution as n → ∞ .
2 2
4
4
Proof
In particular part (a) means that W is an unbiased estimator of σ . From part (b), note that var(W ) → 0 as n → ∞ ; this means
2 2 2
that W is a consistent estimator of σ . The square root of the special sample variance is a special version of the sample standard
2 2
deviation, denoted W .
E(W ) ≤ σ . Thus, W is a negativley biased estimator that tends to underestimate σ.

Proof
Next we compute the covariance and correlation between the sample mean and the special sample variance.
The covariance and correlation of M and W are 2
1. cov (M , W 2
) = σ3 /n .
−−−−−−−−−−
2. cor (M , W 2 3 2 4
) = σ / √σ (σ4 − σ )
Proof
Note that the correlation does not depend on the sample size, and that the sample mean and the special sample variance are
uncorrelated if σ = 0 (equivalently skew(X) = 0 ).
3
The Standard Sample Variance
Consider now the more realistic case in which μ is unknown. In this case, a natural approach is to average, in some sense, the
squared deviations (X − M ) over i ∈ {1, 2, … , n}. It might seem that we should average by dividing by n . However, another
i
2
approach is to divide by whatever constant would give us an unbiased estimator of σ . This constant turns out to be n − 1 , leading 2
to the standard sample variance:

n
2
1 2
S = ∑(Xi − M ) (6.5.24)
n−1
i=1
E (S
2
) =σ
2
.
Proof
Of course, the square root of the sample variance is the sample standard deviation, denoted S .
E(S) ≤ σ . Thus, S is a negativley biased estimator than tends to underestimate σ.

Proof
S
2
→ σ
2
as n → ∞ with probability 1.
Proof
Since S is an unbiased estimator of σ , the variance of S is the mean square error, a measure of the quality of the estimator.
2 2 2
var (S
2
) =
1
n
(σ4 −
n−3
n−1
4
σ ) .
Proof
Note that var(S ) → 0 as n → ∞ , and hence S is a consistent estimator of σ . On the other hand, it's not surprising that the
2 2 2
variance of the standard sample variance (where we assume that μ is unknown) is greater than the variance of the special standard
variance (in which we assume μ is known).
var (S
2
) > var (W
2
) .
Proof
Next we compute the covariance between the sample mean and the sample variance.
The covariance and correlation between the sample mean and sample variance are
1. cov (M , S 2
) = σ3 /n
σ3
2. cor (M , S 2
) =
σ√σ4 −σ 4 (n−3)/(n−1)
Proof
In particular, note that cov(M , S ) = cov(M , W ) . Again, the sample mean and variance are uncorrelated if σ = 0 so that
2 2
3
skew(X) = 0 . Our last result gives the covariance and correlation between the special sample variance and the standard one.
Curiously, the covariance the same as the variance of the special sample variance.
The covariance and correlation between W and S are 2 2
1. cov (W 2
,S
2
) = (σ4 − σ )/n
4
−−−−−−−−−−−−
4
σ4 −σ
2. cor (W 2
,S
2
) =√
σ4 −σ
4
(n−3)/(n−1)
Proof
Note that cor (W 2

,S
2
) → 1 as n → ∞ , not surprising since with probability 1, S 2
→ σ
2
and W 2
→ σ
2
as n → ∞ .
A particularly important special case occurs when the sampling distribution is normal. This case is explored in the section on
Special Properties of Normal Samples.
Exercises
Basic Properties
Suppose that x is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of
operation. A sample of 30 components has mean 113° and standard deviation 18°.
2. Find the sample mean and standard deviation if the temperature is converted to degrees Celsius. The transformation is
(x − 32) .
5
y =
9
Answer
Suppose that x is the length (in inches) of a machined part in a manufacturing process. A sample of 50 parts has mean 10.0 and
standard deviation 2.0.
2. Find the sample mean if length is measured in centimeters. The transformation is y = 2.54x.
Answer
Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). The mean grade
on the first midterm exam was 64 (out of a possible 100 points) and the standard deviation was 16. Professor Moriarity thinks
the grades are a bit low and is considering various transformations for increasing the grades. In each case below give the mean
and standard deviation of the transformed grades, or state that there is not enough information.
1. Add 10 points to each grade, so the transformation is y = x + 10 .
2. Multiply each grade by 1.2, so the transformation is z = 1.2x
3. Use the transformation w = 10√− x . Note that this is a non-linear transformation that curves the grades greatly at the low
end and very little at the high end. For example, a grade of 100 is still 100, but a grade of 36 is transformed to 60.
One of the students did not study at all, and received a 10 on the midterm. Professor Moriarity considers this score to be an
outlier.
4. Find the mean and standard deviation if this score is omitted.
Answer
All statistical software packages will compute means, variances and standard deviations, draw dotplots and histograms, and in
general perform the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those
with large data sets, the use of statistical software is essential. On the other hand, there is some value in performing the
computations by hand, with small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the
computations and draw the graphs with minimal technological aids.
x = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2).
2. Sketch the dotplot.
3. Construct a table with rows corresponding to cases and columns corresponding to i, x , x i i −m , and (x
i − m)
2
. Add rows
at the bottom in the i column for totals and means.
Answer
Suppose that a sample of size 12 from a discrete variable x has empirical density function given by f (−2) = 1/12 ,
f (−1) = 1/4, f (0) = 1/3, f (1) = 1/6, f (2) = 1/6.

2. Compute the sample mean and variance.
3. Give the sample values, ordered from smallest to largest.
Answer
The following table gives a frequency distribution for the commuting distance to the math/stat building (in miles) for a sample
of ESU students.
Class Freq Rel Freq Density Cum Freq Cum Rel Freq Midpoint
(0, 2] 6
(2, 6] 16
(6, 10] 18
(10, 20]) 10
Total
1. Complete the table

2. Sketch the density histogram
3. Sketch the cumulative relative frquency ogive.
4. Compute an approximation to the mean and standard deviation.
Answer
Error Function Exercises
In the error function app, select root mean square error. As you add points, note the shape of the graph of the error function, the
value that minimizes the function, and the minimum value of the function.
In the error function app, select mean absolute error. As you add points, note the shape of the graph of the error function, the
values that minimizes the function, and the minimum value of the function.
Suppose that our data vector is (2, 1, 5, 7). Explicitly give mae as a piecewise function and sketch its graph. Note that
1. All values of a ∈ [2, 5] minimize mae.
2. mae is not differentiable at a ∈ {1, 2, 5, 7}.
Suppose that our data vector is (3, 5, 1). Explicitly give mae as a piecewise function and sketch its graph. Note that
1. mae is minimized at a = 3 .
2. mae is not differentiable at a ∈ {1, 3, 5}.
Many of the apps in this project are simulations of experiments with a basic random variable of interest. When you run the
simulation, you are performing independent replications of the experiment. In most cases, the app displays the standard deviation
of the distribution, both numerically in a table and graphically as the radius of the blue, horizontal bar in the graph box. When you
run the simulation, the sample standard deviation is also displayed numerically in the table and graphically as the radius of the red
horizontal bar in the graph box.
In the binomial coin experiment, the random variable is the number of heads. For various values of the parameters n (the
number of coins) and p (the probability of heads), run the simulation 1000 times and compare the sample standard deviation to
the distribution standard deviation.
In the simulation of the matching experiment, the random variable is the number of matches. For selected values of n (the
number of balls), run the simulation 1000 times and compare the sample standard deviation to the distribution standard
deviation.
Run the simulation of the gamma experiment 1000 times for various values of the rate parameter r and the shape parameter k .
Compare the sample standard deviation to the distribution standard deviation.
Probability Exercises
Suppose that X has probability density function f (x) = 12 x

2
(1 − x) for 0 ≤x ≤1 . The distribution of X is a member of
the beta family. Compute each of the following
1. μ = E(X)
2. σ = var(X)
2
3. d = E [(X − μ)
3
3
]
4. d = E [(X − μ)
4
4
]
Answer
Suppose now that (X 1, X2 , … , X10 ) is a random sample of size 10 from the beta distribution in the previous problem. Find
1. E(M )
2. var(M )
3. E (W ) 2
4. var (W ) 2
5. E (S )
2
6. var (S ) 2
7. cov (M , W ) 2
8. cov (M , S ) 2
9. cov (W , S ) 2 2
Answer
Suppose that X has probability density function f (x) = λe for 0 ≤ x < ∞ , where λ > 0 is a parameter. Thus
−λx
X has the
exponential distribution with rate parameter λ . Compute each of the following
1. μ = E(X)
2. σ = var(X)
2
3. d = E [(X − μ)
3
3
]
4. d = E [(X − μ)
4
4
]
Answer
Suppose now that (X , X , … , X 1 2 5) is a random sample of size 5 from the exponential distribution in the previous problem.
1. E(M )
2. var(M )
3. E (W ) 2
4. var (W ) 2
5. E (S )
2
6. var (S ) 2
7. cov (M , W ) 2
8. cov (M , S ) 2
9. cov (W , S ) 2 2
Answer
Recall that for an ace-six flat die, faces 1 and 6 have probability each, while faces 2, 3, 4, and 5 have probability
1
4
1
8
each. Let
X denote the score when an ace-six flat die is thrown. Compute each of the following:
1. μ = E(X)
2. σ = var(X)
2
3. d = E [(X − μ)
3
3
]
4. d = E [(X − μ)
4
4
]
Answer
Suppose now that an ace-six flat die is tossed 8 times. Find each of the following:
1. E(M )
2. var(M )
3. E (W ) 2
4. var (W ) 2
5. E (S )
2
6. var (S ) 2
7. cov (M , W ) 2
8. cov (M , S ) 2
9. cov (W , S ) 2 2
Answer
2. Compute the sample mean and standard deviation, and plot a density histogram for petal length.
3. Compute the sample mean and standard deviation, and plot a density histogram for petal length by species.
Answers

2. Compute the mean and standard deviation
3. Plot a density histogram with the classes [0, 5), [5, 40), [40, 50), [50, 60).
Answer
Consider Michelson's velocity of light data.

3. Compute the sample mean and standard deviation.
4. Find the sample mean and standard deviation if the variable is converted to km/hr. The transformation is y = x + 299 000
Answer

4. Find the sample mean and standard deviation if the variable is converted to degrees. There are 3600 seconds in a degree.
5. Find the sample mean and standard deviation if the variable is converted to radians. There are π/180 radians in a degree.
Answer

Answer

2. Compute the sample mean and standard deviation for each color count variable.
3. Compute the sample mean and standard deviation for the total number of candies.
4. Plot a relative frequency histogram for the total number of candies.
5. Compute the sample mean and standard deviation, and plot a density histogram for the net weight.
Answer
2. Compute the relative frequency function for species and plot the graph.
3. Compute the relative frequeny function for gender and plot the graph.
4. Compute the sample mean and standard deviation, and plot a density histogram for body weight.
5. Compute the sample mean and standard deviation, and plot a density histogrm for body weight by species.
6. Compute the sample mean and standard deviation, and plot a density histogram for body weight by gender.
Answer

2. Compute the sample mean and standard deviation, and plot a density histogram for the height of the father.
3. Compute the sample mean and standard deviation, and plot a density histogram for the height of the son.
Answer
This page titled 6.5: The Sample Variance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Descriptive Theory
Recall again the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that we
make on these objects. We select objects from the population and record the variables for the objects in the sample; these become our data.
Our first discussion is from a purely descriptive point of view. That is, we do not assume that the data are generated by an underlying
probability distribution. But as always, remember that the data themselves define a probability distribution, namely the empirical distribution.
Order Statistics
Suppose that x is a real-valued variable for a population and that x = (x , x , … , x ) are the observed values of a sample of size n
1 2 n
corresponding to this variable. The order statistic of rank k is the k th smallest value in the data set, and is usually denoted x . To emphasize (k)
the dependence on the sample size, another common notation is x . Thus, n:k
x(1) ≤ x(2) ≤ ⋯ ≤ x(n−1) ≤ x(n) (6.6.1)
Naturally, the underlying variable x should be at least at the ordinal level of measurement. The order statistics have the same physical units as
x. One of the first steps in exploratory data analysis is to order the data, so order statistics occur naturally. In particular, note that the extreme
order statistics are

x(1) = min{ x1 , x2 … , xn }, x(n) = max{ x1 , x2 , … , xn } (6.6.2)
The sample range is r = x − x(n) and the sample midrange is

(1)
r
2
=
1
2
[ x(n) − x(1) ] . These statistics have the same physical units as x and
are measures of the dispersion of the data set.
The Sample Median
If n is odd, the sample median is the middle of the ordered observations, namely x where k = . If n is even, there is not a single
(k)
n+1
middle observation, but rather two middle observations. Thus, the median interval is [ x , x ] where k = . In this case, the sample
(k) (k+1)
n
median is defined to be the midpoint of the median interval, namely [ x + x ] where k =

1
2
(k)
. In a sense, this definition is a bit
(k+1)
n
arbitrary because there is no compelling reason to prefer one point in the median interval over another. For more on this issue, see the
discussion of error functions in the section on Sample Variance. In any event, sample median is a natural statistic that gives a measure of the
center of the data set.
Sample Quantiles
We can generalize the sample median discussed above to other sample quantiles. Thus, suppose that p ∈ [0, 1]. Our goal is to find the value
that is the fraction p of the way through the (ordered) data set. We define the rank of the value that we are looking for as (n − 1)p + 1 . Note
that the rank is a linear function of p, and that the rank is 1 when p = 0 and n when p = 1 . But of course, the rank will not be an integer in
general, so we let k = ⌊(n − 1)p + 1⌋ , the integer part of the desired rank, and we let t = [(n − 1)p + 1] − k , the fractional part of the
desired rank. Thus, (n − 1)p + 1 = k + t where k ∈ {1, 2, … , n} and t ∈ [0, 1). So, using linear interpolation, we define the sample
quantile of order p to be
x[p] = x(k) + t [ x(k+1) − x(k) ] = (1 − t)x(k) + tx(k+1) (6.6.3)
Sample quantiles have the same physical units as the underlying variable x. The algorithm really does generalize the results for sample
medians.
The sample quantile of order p = 1
2
is the median as defined earlier, in both cases where n is odd and where n is even.
The sample quantile of order is known as the first quartile and is frequently denoted q . The the sample quantile of order is known as the
1
4
1
3
third quartile and is frequently denoted q . The sample median which is the quartile of order is sometimes denoted q . The interquartile
3
1
2
2
range is defined to be iqr = q − q . Note that iqr is a statistic that measures the spread of the distribution about the median, but of course
3 1
this number gives less information than the interval [q , q ]. 1 3
The statistic q − iqr is called the lower fence and the statistic q + iqr is called the upper fence. Sometimes lower limit and upper limit
1
3
2
3
3
are used instead of lower fence and upper fence. Values in the data set that are below the lower fence or above the upper fence are potential
outliers, that is, values that don't seem to fit the overall pattern of the data. An outlier can be due to a measurement error, or may be a valid
but rather extreme value. In any event, outliers usually deserve additional study.
The five statistics (x , q , q , q , x ) are often referred to as the five-number summary. Together, these statistics give a great deal of
(1) 1 2 3 (n)
information about the data set in terms of the center, spread, and skewness. The five numbers roughly separate the data set into four intervals
each of which contains approximately 25% of the data. Graphically, the five numbers, and the outliers, are often displayed as a boxplot,
sometimes called a box and whisker plot. A boxplot consists of an axis that extends across the range of the data. A line is drawn from smallest
value that is not an outlier (of course this may be the minimum x ) to the largest value that is not an outlier (of course, this may be the
(1)
maximum x ). Vertical marks (“whiskers”) are drawn at the ends of this line. A rectangular box extends from the first quartile q to the
(n) 1
third quartile q and with an additional whisker at the median q . Finally, the outliers are denoted as points (beyond the extreme whiskers).
3 2
All statistical packages will compute the quartiles and most will draw boxplots. The picture below shows a boxplot with 3 outliers.
Figure 6.6.1 : Boxplot

Alternate Definitions
The algorithm given above is not the only reasonable way to define sample quantiles, and indeed there are lots of alternatives. One natural
method would be to first compute the empirical distribution function
n
1
F (x) = ∑ 1(xi ≤ x), x ∈ R (6.6.4)
n
i=1
Recall that F has the mathematical properties of a distribution function, and in fact F is the distribution function of the empirical distribution
of the data. Recall that this is the distribution that places probability at each data value x (so this is the discrete uniform distribution on
n
1
i
{ x , x , … , x } if the data values are distinct). Thus, F (x) = for x ∈ [x , x ). Then, we could define the quantile function to be the
k
1 2 n (k) (k+1)
n
inverse of the distribution function, as we usually do for probability distributions:

−1
F (p) = min{x ∈ R : F (x) ≥ p}, p ∈ (0, 1) (6.6.5)
It's easy to see that with this definition, the quantile of order p ∈ (0, 1) is simply x (k)
where k = ⌈np⌉ .
Another method is to compute the rank of the quantile of order p ∈ (0, 1) as (n + 1)p , rather than (n − 1)p + 1 , and then use linear
interpolation just as we have done. To understand the reasoning behind this method, suppose that the underlying variable x takes value in an
interval (a, b). Then the n points in the data set x separate this interval into n + 1 subintervals, so it's reasonable to think of x as the (k)
quantile of order . This method also reduces to the standard calculation for the median when p = . However, the method will fail if p is
k
n+1
1
so small that (n + 1)p < 1 or so large that (n + 1)p > n .

The primary definition that we give above is the one that is most commonly used in statistical software and spreadsheets. Moreover, when the
sample size n is large, it doesn't matter very much which of these competing quantile definitions is used. All will give similar results.
Transformations
Suppose again that x = (x , x , … , x ) is a sample of size n from a population variable x, but now suppose also that y = a + bx is a new
1 2 n
variable, where a ∈ R and b ∈ (0, ∞). Recall that transformations of this type are location-scale transformations and often correspond to
changes in units. For example, if x is the length of an object in inches, then y = 2.54x is the length of the object in centimeters. If x is the
temperature of an object in degrees Fahrenheit, then y = (x − 32) is the temperature of the object in degrees Celsius. Let y = a + bx
5
denote the sample from the variable y .
Order statistics and quantiles are preserved under location-scale transformations:

1. y(i)
= a + b x(i) for i ∈ {1, 2, … , n}
2. y[p] = a + b x[p] for p ∈ [0, 1]
Proof
Like standard deviation (our most important measure of spread), range and interquartile range are not affected by the location parameter, but
are scaled by the scale parameter.
The range and interquartile range of y are

1. r(y) = b r(x)
2. iqr(y) = b iqr(x)
Proof
More generally, suppose y = g(x) where g is a strictly increasing real-valued function on the set of possible values of x. Let
y = (g(x1 ), g(x2 ), … , g(xn )) denote the sample corresponding to the variable y . Then (as in the proof of Theorem 2), the order statistics
are preserved so y (i)
= g(x ) . However, if g is nonlinear, the quantiles are not preserved (because the quantiles involve linear interpolation).
(i)
That is, y and g(x

[p] ) are not usually the same. When g is convex or concave we can at least give an inequality for the sample quantiles.
[p]
Suppose that y = g(x) where g is strictly increasing. Then
1. y = g (x ) for i ∈ {1, 2, … , n}
(i) (i)
2. If g is convex then y ≥ g (x ) for p ∈ [0, 1]

[p] [p]
3. If g is concave then y ≤ g (x ) for p ∈ [0, 1]

[p] [p]
Proof
Stem and Leaf Plots
A stem and leaf plot is a graphical display of the order statistics (x , x , … , x ). It has the benefit of showing the data in a graphical
(1) (2) (n)
way, like a histogram, and at the same time, preserving the ordered data. First we assume that the data have a fixed number format: a fixed
number of digits, then perhaps a decimal point and another fixed number of digits. A stem and leaf plot is constructed by using an initial part
of this string as the stem, and the remaining parts as the leaves. There are lots of variations in how to do this, so rather than give an
exhaustive, complicated definition, we will just look at a couple of examples in the exercise below.
Probability Theory
We continue our discussion of order statistics except that now we assume that the variables are random variables. Specifically, suppose that
we have a basic random experiment, and that X is a real-valued random variable for the experiment with distribution function F . We perform
n independent replications of the basic experiment to generate a random sample X = (X , X , … , X ) of size n from the distribution of 1 2 n
X. Recall that this is a sequence of independent random variables, each with the distribution of X. All of the statistics defined in the previous
section make sense, but now of course, they are random variables. We use the notation established previously, except that we follow our usual
convention of denoting random variables with capital letters. Thus, for k ∈ {1, 2, … , n}, X is the k th order statistic, that is, the k (k)
smallest of (X , X , … , X ). Our interest now is on the distribution of the order statistics and statistics derived from them.
1 2 n
Distribution of the kth order statistic

Finding the distribution function of an order statistic is a nice application of Bernoulli trials and the binomial distribution.
The distribution function F of X k (k)

is given by
n
n j n−j
Fk (x) = ∑ ( ) [F (x)] [1 − F (x)] , x ∈ R (6.6.8)
j
j=k
Proof
As always, the extreme order statistics are particularly interesting.
The distribution functions F of X 1 (1)

and F of X
n (n)
are given by
1. F 1 (x) = 1 − [1 − F (x)] for x ∈ R n
2. F n (x) = [F (x)]
n
for x ∈ R
The quantile functions F −1

1
and F −1
n of X(1)
and X (n)
are given by
1. F −1
1
(p) = F
−1
[1 − (1 − p ) ] for p ∈ (0, 1)
1/n
2. F −1
n (p) = F
−1
(p
1/n
) for p ∈ (0, 1)
Proof
When the underlying distribution is continuous, we can give a simple formula for the probability density function of an order statistic.
Suppose now that X has a continuous distribution with probability density function f . Then X(k) has a continuous distribution with
probability density function f given by k
n! k−1 n−k
fk (x) = [F (x)] [1 − F (x)] f (x), x ∈ R (6.6.11)
(k − 1)!(n − k)!
Proof
Heuristic Proof
Here are the special cases for the extreme order statistics.
The probability density function f of X 1 (1) and f of X

n (n) are given by
1. f 1 (x) = n[1 − F (x)]
n−1
f (x) for x ∈ R
n−1
2. f n (x) = n[F (x)] f (x) for x ∈ R
Joint Distributions
We assume again that X has a continuous distribution with distribution function F and probability density function f .
Suppose that j, k ∈ {1, 2, … , n} with j < k . The joint probability density function f j,k of (X (j) , X(k) ) is given by
n! j−1 k−j−1 n−k
fj,k (x, y) = [F (x)] [F (y) − F (x)] [1 − F (y)] f (x)f (y); x, y ∈ R, x < y (6.6.17)
(j − 1)!(k − j − 1)!(n − k)!
Heuristic Proof
From the joint distribution of two order statistics we can, in principle, find the distribution of various other statistics: the sample range R ;
sample quantiles X for p ∈ [0, 1], and in particular the sample quartiles Q , Q , Q ; and the inter-quartile range IQR. The joint distribution
[p] 1 2 3
of the extreme order statistics (X , X ) is a particularly important case.

(1) (n)
The joint probability density function f 1,n of (X

(1) , X(n) ) is given by
n−2
f1,n (x, y) = n(n − 1) [F (y) − F (x)] f (x)f (y); x, y ∈ R, x < y (6.6.20)
Proof
Arguments similar to the one above can be used to obtain the joint probability density function of any number of the order statistics. Of
course, we are particularly interested in the joint probability density function of all of the order statistics. It turns out that this density function
has a remarkably simple form.
(X(1) , X(2) , … , X(n) ) has joint probability density function g given by
g(x1 , x2 , … , xn ) = n!f (x1 ) f (x2 ) ⋯ f (xn ), x1 < x2 < ⋯ < xn (6.6.21)
Proof
Heuristic Proof
Probability Plots
A probability plot, also called a quantile-quantile plot or a Q-Q plot for short, is an informal, graphical test to determine if observed data
come from a specified distribution. Thus, suppose that we observe real-valued data (x , x , … , x ) from a random sample of size n . We are
1 2 n
interested in the question of whether the data could reasonably have come from a continuous distribution with distribution function F . First,
we order that data from smallest to largest; this gives us the sequence of observed values of the order statistics: (x , x , … , x ). (1) (2) (n)
Note that we can view x(i) has the sample quantile of order i
n+1
. Of course, by definition, the distribution quantile of order i
n+1
is
yi = F
−1
(
i
n+1
) . If the data really do come from the distribution, then we would expect the points ((x (1) , y1 ) , (x(2) , y2 ) … , (x(n) , yn )) to
be close to the diagonal line y = x ; conversely, strong deviation from this line is evidence that the distribution did not produce the data. The
plot of these points is referred to as a probability plot.
Usually however, we are not trying to see if the data come from a particular distribution, but rather from a parametric family of distributions
(such as the normal, uniform, or exponential families). We are usually forced into this situation because we don't know the parameters; indeed
the next step, after the probability plot, may be to estimate the parameters. Fortunately, the probability plot method has a simple extension for
any location-scale family of distributions. Thus, suppose that G is a given distribution function. Recall that the location-scale family
associated with G has distribution function F (x) = G ( ) for, x ∈ R , where a ∈ R is the location parameter and b ∈ (0, ∞) is the scale
x−a
parameter. Recall also that for p ∈ (0, 1), if z = G (p) denote the quantile of order p for G and y = F (p) the quantile of order p for
p
−1
p
−1
F . Then y = a + b z . It follows that if the probability plot constructed with distribution function F is nearly linear (and in particular, if it is
p p
close to the diagonal line), then the probability plot constructed with distribution function G will be nearly linear. Thus, we can use the
distribution function G without having to know the location and scale parameters.
In the exercises below, you will explore probability plots for the normal, exponential, and uniform distributions. We will study a formal,
quantitative procedure, known as the chi-square goodness of fit test in the chapter on Hypothesis Testing.
Exercises and Applications

Basic Properties
Suppose that x is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of operation. A sample
of 30 components has five number summary (84, 102, 113, 120, 135) .
2. Find the range and interquartile range.
3. Find the five number summary, range, and interquartile range if the temperature is converted to degrees Celsius. The transformation is
(x − 32) .
5
y =
9
Answer
Suppose that x is the length (in inches) of a machined part in a manufacturing process. A sample of 50 parts has five number summary
(9.6, 9.8, 10.0, 10.1, 10.3).
3. Find the five number summary, range, and interquartile if length is measured in centimeters. The transformation is y = 2.54x.
Answer
Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). For the first midterm exam,
the five number summary was (16, 52, 64, 72, 81) (out of a possible 100 points). Professor Moriarity thinks the grades are a bit low and is
considering various transformations for increasing the grades.
2. Suppose she adds 10 points to each grade. Find the five number summary, range, and interquartile range for the transformed grades.
3. Suppose she multiplies each grade by 1.2. Find the five number summary, range, and interquartile range for the transformed grades.
4. Suppose she uses the transformation w = 10√− x , which curves the grades greatly at the low end and very little at the high end. Give
whatever information you can about the five number summary of the transformed grades.
5. Determine whether the low score of 16 is an outlier.
Answer
All statistical software packages will compute order statistics and quantiles, draw stem-and-leaf plots and boxplots, and in general perform
the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets, the
use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with small, artificial
data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the graphs with minimal
technological aids.
.
x = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)

2. Give the order statistics
3. Compute the five number summary and draw the boxplot.
4. Compute the range and the interquartile range.
Answer
Suppose that a sample of size 12 from a discrete variable x has empirical density function given by f (−2) = 1/12 , f (−1) = 1/4,
f (0) = 1/3, f (1) = 1/6, f (2) = 1/6.
1. Give the order statistics.

3. Compute the range and the interquartile range.
Answer
The stem and leaf plot below gives the grades for a 100-point test in a probability course with 38 students. The first digit is the stem and
the second digit is the leaf. Thus, the low score was 47 and the high score was 98. The scores in the 6 row are 60, 60, 62, 63, 65, 65, 67,
68.
4 7
5 0346
6 00235578
7 0112346678899
8 0367889
9 1368
Compute the five number summary and draw the boxplot.
Answer
App Exercises
In the histogram app, construct a distribution with at least 30 values of each of the types indicated below. Note the five number summary.
1. A uniform distribution.
2. A symmetric, unimodal distribution.
3. A unimodal distribution that is skewed right.
4. A unimodal distribution that is skewed left.
5. A symmetric bimodal distribution.
6. A u-shaped distribution.
In the error function app, Start with a distribution and add additional points as follows. Note the effect on the five number summary:
1. Add a point below x . (1)
2. Add a point between x and q .(1) 1
3. Add a point between q and q .1 2
4. Add a point between q and q .2 3
5. Add a point between q and x .3 (n)
6. Add a point above x . (n)
In the last problem, you may have noticed that when you add an additional point to the distribution, one or more of the five statistics does not
change. In general, quantiles can be relatively insensitive to changes in the data.
Recall that the standard uniform distribution is the uniform distribution on the interval [0, 1].
Suppose that X is a random sample of size n from the standard uniform distribution. For k ∈ {1, 2, … , n}, X(k) has the beta
distribution, with left parameter k and right parameter n − k + 1 . The probability density function f is given by k
n! k−1 n−k
fk (x) = x (1 − x ) , 0 ≤x ≤1 (6.6.22)
(k − 1)!(n − k)!
Proof
In the order statistic experiment, select the standard uniform distribution and n = 5 . Vary k from 1 to 5 and note the shape of the
probability density function of X . For each value of k , run the simulation 1000 times and compare the empirical density function to
(k)
It's easy to extend the results for the standard uniform distribution to the general uniform distribution on an interval.
Suppose that X is a random sample of size n from the uniform distribution on the interval [a, a + h] where a ∈ R and h ∈ (0, ∞) . For
k ∈ {1, 2, … , n}, X has the beta distribution with left parameter k , right parameter n − k + 1 , location parameter a , and scale
(k)
parameter h . In particular,
1. E (X (k) ) = a+h
k
n+1
k(n−k+1)
2. var (X (k)
2
) =h
2
(n+1 ) (n+2)
Proof
We return to the standard uniform distribution and consider the range of the random sample.
Suppose that X is a random sample of size n from the standard uniform distribution. The sample range R has the beta distribution with
left parameter n − 1 and right parameter 2. The probability density function g is given by
n−2
g(r) = n(n − 1)r (1 − r), 0 ≤r ≤1 (6.6.23)
Proof
Once again, it's easy to extend this result to a general uniform distribution.
Suppose that X = (X , X , … , X ) is a random sample of size n from the uniform distribution on [a, a + h] where a ∈ R and
1 2 n
h ∈ (0, ∞) . The sample range R = X −X has the beta distribution with left parameter n − 1 , right parameter 2, and scale
(n) (1)
parameter h . In particular,
n−1
1. E(R) = h n+1
2( n1 )
2. var(R) = h 2
2
(n+1 ) (n+2)
Proof
The joint distribution of the order statistics for a sample from the uniform distribution is easy to get.
Suppose that (X , X , … , X ) is a random sample of size n from the uniform distribution on the interval [a, a + h] , where a ∈ R and
1 2 n
h ∈ (0, ∞) . Then (X ) is uniformly distributed on {x ∈ [a, a + h ] : a ≤ x ≤ x ≤ ⋯ ≤ x < a + h} .

n
,X ,…,X
(1) (2) (n) 1 2 n
Proof
Recall that the exponential distribution with rate parameter λ > 0 has probability density function
−λx
f (x) = λ e , 0 ≤x <∞ (6.6.25)
The exponential distribution is widely used to model failure times and other random times under certain ideal conditions. In particular, the
exponential distribution governs the times between arrivals in the Poisson process.
Suppose that X is a random sample of size n from the exponential distribution with rate parameter λ . The probability density function of
the k th order statistic X is (k)
n!
−λx k−1 −λ(n−k+1)x
fk (x) = λ(1 − e ) e , 0 ≤x <∞ (6.6.26)
(k − 1)!(n − k)!
In particular, the minimum of the variables X (1) also has an exponential distribution, but with rate parameter nλ.
Proof
In the order statistic experiment, select the standard exponential distribution and n = 5 . Vary k from 1 to 5 and note the shape of the
probability density function of X . For each value of k , run the simulation 1000 times and compare the empirical density function to
(k)
Suppose again that X is a random sample of size n from the exponential distribution with rate parameter λ . The sample range R has the
same distribution as the maximum of a random sample of size n − 1 from the exponential distribution. The probability density function
is
−λt n−2 −λt
h(t) = (n − 1)λ(1 − e ) e , 0 ≤t <∞ (6.6.27)
Proof
Suppose again that X is a random sample of size n from the exponential distribution with rate parameter λ . The joint probability density
function of the order statistics (X , X , … , X ) is
(1) (2) (n)
n −λ( x1 +x2 +⋯+xn )

g(x1 , x2 , … , xn ) = n! λ e , 0 ≤ x1 ≤ x2 ⋯ ≤ xn < ∞ (6.6.30)
Proof
Dice
Four fair dice are rolled. Find the probability density function of each of the order statistics.
Answer
In the dice experiment, select the order statistic and die distribution given in parts (a)–(d) below. Increase the number of dice from 1 to
20, noting the shape of the probability density function at each stage. Now with n = 4 , run the simulation 1000 times, and note the
apparent convergence of the relative frequency function to the probability density function.
1. Maximum score with fair dice.
2. Minimum score with fair dice.
3. Maximum score with ace-six flat dice.
4. Minimum score with ace-six flat dice.
Four fair dice are rolled. Find the joint probability density function of the four order statistics.
Answer
Four fair dice are rolled. Find the probability density function of the sample range.
Answer
Probability Plot Simulations
In the probability plot experiment, set the sampling distribution to normal distribution with mean 5 and standard deviation 2. Set the
sample size to n = 20 . For each of the following test distributions, run the experiment 50 times and note the geometry of the probability
plot:
1. Standard normal
2. Uniform on the interval [0, 1]
3. Exponential with parameter 1
In the probability plot experiment, set the sampling distribution to the uniform distribution on [4, 10]. Set the sample size to n = 20 . For
each of the following test distributions, run the experiment 50 times and note the geometry of the probability plot:
1. Standard normal
In the probability plot experiment, Set the sampling distribution to the exponential distribution with parameter 3. Set the sample size to
n = 20 . For each of the following test distributions, run the experiment 50 times and note the geometry of the probability plot:
1. Standard normal
2. Compute the five number summary and draw the boxplot for petal length.
3. Compute the five number summary and draw the boxplot for petal length by species.
4. Draw the normal probability plot for petal length.
Answers

3. Identify any outliers.
Answer
A stem and leaf plot of Michelson's velocity of light data is given below. In this example, the last digit (which is always 0) has been left
out, for convenience. Also, note that there are two sets of leaves for each stem, one corresponding to leaves from 0 to 4 (so actually from
00 to 40) and the other corresponding to leaves from 5 to 9 (so actually from 50 to 90). Thus, the minimum value is 620 and the numbers
in the second 7 row are 750, 760, 760, and so forth.
6 2
6 5
7 222444
7 566666788999
8 000001111111111223344444444
9 0011233444
9 55566667888
10 000
10 7
Classify the variable by type and level of measurement.

2. Compute the five number summary for the velocity in km/hr. The transformation is y = x + 299 000.
3. Draw the normal probability plot.
Answer

3. Compute the five number summary and draw the boxplot if the variable is converted to degrees. There are 3600 seconds in a degree.
4. Compute the five number summary and draw the boxplot if the variable is converted to radians. There are π/180 radians in a degree.
Answer

Answer

2. Compute the five number summary and draw the boxplot for each color count.
3. Construct a stem and leaf plot for the total number of candies.
4. Compute the five number summary and draw the boxplot for the total number of candies.
5. Compute the five number summary and draw the boxplot for net weight.
Answer
2. Compute the five number summary and draw the boxplot for body weight.
3. Compute the five number summary and draw the boxplot for body weight by species.
4. Compute the five number summary and draw the boxplot for body weight by gender.
Answer

2. Compute the five number summary and sketch the boxplot for the height of the father.
3. Compute the five number summary and sketch the boxplot for the height of the son.
Answer
This page titled 6.6: Order Statistics is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via
Descriptive Theory
we make on these objects. We select objects from the population and record the variables for the objects in the sample; these
become our data. Our first discussion is from a purely descriptive point of view. That is, we do not assume that the data are
generated by an underlying probability distribution. But as always, remember that the data themselves define a probability
distribution, namely the empirical distribution that assigns equal probability to each data point.
Suppose that x and y are real-valued variables for a population, and that ((x , y ), (x , y ), … , (x , y )) is an observed sample of
1 1 2 2 n n
size n from (x, y). We will let x = (x , x , … , x ) denote the sample from x and y = (y , y , … , y ) the sample from y . In this
1 2 n 1 2 n
section, we are interested in statistics that are measures of association between the x and y , and in finding the line (or other curve)
that best fits the data.
Recall that the sample means are
n n
1 1
m(x) = ∑ xi , m(y) = ∑ yi (6.7.1)
n n
i=1 i=1
and the sample variances are

n n
1 1
2 2 2 2
s (x) = ∑[ xi − m(x)] , s (y) = ∑[ yi − m(y)] (6.7.2)
n−1 n−1
i=1 i=1
Scatterplots
Often, the first step in exploratory data analysis is to draw a graph of the points; this is called a scatterplot an can give a visual
sense of the statistical realtionship between the variables.
Figure 6.7.1 : A scatterplot

In particular, we are interested in whether the cloud of points seems to show a linear trend or whether some nonlinear curve might
fit the cloud of points. We are interested in the extent to which one variable x can be used to predict the other variable y .
Defintions
Our next goal is to define statistics that measure the association between the x and y data.
The sample covariance is defined to be

n
1
s(x, y) = ∑[ xi − m(x)][ yi − m(y)] (6.7.3)
n−1
i=1
Assuming that the data vectors are not constant, so that the standard deviations are positive, the sample correlation is defined
to be
s(x, y)
r(x, y) = (6.7.4)
s(x)s(y)
Note that the sample covariance is an average of the product of the deviations of the x and y data from their means. Thus, the
physical unit of the sample covariance is the product of the units of x and y . Correlation is a standardized version of covariance. In
particular, correlation is dimensionless (has no physical units), since the covariance in the numerator and the product of the
standard devations in the denominator have the same units (the product of the units of x and y ). Note also that covariance and
correlation have the same sign: positive, negative, or zero. In the first case, the data x and y are said to be positively correlated; in
the second case x and y are said to be negatively correlated; and in the third case x and y are said to be uncorrelated
To see that the sample covariance is a measure of association, recall first that the point (m(x), m(y)) is a measure of the center of
the bivariate data. Indeed, if each point is the location of a unit mass, then (m(x), m(y)) is the center of mass as defined in
physics. Horizontal and vertical lines through this center point divide the plane into four quadrants. The product deviation
[ x − m(x)][ y − m(y)] is positive in the first and third quadrants and negative in the second and fourth quadrants. After we study
i i
linear regression below, we will have a much deeper sense of what covariance measures.
Figure 6.7.2 : Scatterplot with means

You may be perplexed that we average the product deviations by dividing by n − 1 rather than n . The best explanation is that in
the probability model discussed below, the sample covariance is an unbiased estimator of the distribution covariance. However, the
mode of averaging can also be understood in terms of degrees of freedom, as was done for sample variance. Initially, we have 2n
degrees of freedom in the bivariate data. We lose two by computing the sample means m(x) and m(y). Of the remaining 2n − 2
degrees of freedom, we lose n − 1 by computing the product deviations. Thus, we are left with n − 1 degrees of freedom total. As
is typical in statistics, we average not by dividing by the number of terms in the sum but rather by the number of degrees of
freedom in those terms. However, from a purely descriptive point of view, it would also be reasonable to divide by n .
Recall that there is a natural probability distribution associated with the data, namely the empirical distribution that gives
probability to each data point (x , y ). (Thus, if these points are distinct this is the discrete uniform distribution on the data.) The
1
n i i
sample means are simply the expected values of this bivariate distribution, and except for a constant multiple (dividing by n − 1
rather than n ), the sample variances are simply the variances of this bivarite distribution. Similarly, except for a constant multiple
(again dividing by n − 1 rather than n ), the sample covariance is the covariance of the bivariate distribution and the sample
correlation is the correlation of the bivariate distribution. All of the following results in our discussion of descriptive statistics are
actually special cases of more general results for probability distributions.
Properties of Covariance
The next few exercises establish some essential properties of sample covariance. As usual, bold symbols denote samples of a fixed
size n from the corresponding population variables (that is, vectors of length n ), while symbols in regular type denote real
numbers. Our first result is a formula for sample covariance that is sometimes better than the definition for computational purposes.
To state the result succinctly, let xy = (x y , x y , … , x y ) denote the sample from the product variable xy.
1 1 2 2 n n
The sample covariance can be computed as follows:

n
1 n n
s(x, y) = ∑ xi yi − m(x)m(y) = [m(xy) − m(x)m(y)] (6.7.5)
n−1 n−1 n−1
i=1
Proof
The following theorem gives another formula for the sample covariance, one that does not require the computation of intermediate
statistics.
The sample covariance can be computed as follows:

n n
1
s(x, y) = ∑ ∑(xi − xj )(yi − yj ) (6.7.10)
2n(n − 1)
i=1 j=1
Proof
As the name suggests, sample covariance generalizes sample variance.

2
s(x, x) = s (x) .
In light of the previous theorem, we can now see that the first computational formula and the second computational formula above
generalize the computational formulas for sample variance. Clearly, sample covariance is symmetric.
s(x, y) = s(y, x) .
Sample covariance is linear in the first argument with the second argument fixed.
If x, y , and z are data vectors from population variables x, y , and z , respectively, and if c is a constant, then
1. s(x + y, z) = s(x, z) + s(y, z)
2. s(cx, y) = cs(x, y)
Proof
By symmetry, sample covariance is also linear in the second argument with the first argument fixed, and hence is bilinear. The
general version of the bilinear property is given in the following theorem:
Suppose that x is a data vector from a population variable x for i ∈ {1, 2, … , k} and that y is a data vector from a
i i j
population variable y for j ∈ {1, 2, … , l}. Suppose also that a , a , … , a and b , b , … , b are constants. Then
j 1 2 k 1 2 l
k l k l
s ( ∑ ai xi , ∑ bj yj ) = ∑ ∑ ai bj s(xi , yj ) (6.7.21)
i=1 j=1 i=1 j=1
A special case of the bilinear property provides a nice way to compute the sample variance of a sum.
2 2
s (x + y) = s (x) + 2s(x, y) + s (y)
2
.
Proof
The generalization of this result to sums of three or more vectors is completely straightforward: namely, the sample variance of a
sum is the sum of all of the pairwise sample covariances. Note that the sample variance of a sum can be greater than, less than, or
equal to the sum of the sample variances, depending on the sign and magnitude of the pure covariance term. In particular, if the
vectors are pairwise uncorrelated, then the variance of the sum is the sum of the variances.
If c is a constant data set then s(x, c) = 0 .

Proof
Combining the result in the last exercise with the bilinear property, we see that covariance is unchanged if constants are added to
the data sets. That is, if c and d are constant vectors then s(x + c, y + d) = s(x, y) .
Properties of Correlation
A few simple properties of correlation are given next. Most of these follow easily from the corresponding properties of covariance.
First, recall that the standard scores of x and y are, respectively,
i i
xi − m(x) yi − m(y)
ui = , vi = (6.7.24)
s(x) s(y)
The standard scores from a data set are dimensionless quantities that have mean 0 and variance 1.
The correlation between x and y is the covariance of their standard scores u and v. That is, r(x, y) = s(u, v).
Proof
Correlation is symmetric.
r(x, y) = r(y, x) .
Unlike covariance, correlation is unaffected by multiplying one of the data sets by a positive constant (recall that this can always be
thought of as a change of scale in the underlying variable). On the other hand, muliplying a data set by a negative constant changes
the sign of the correlation.
If c ≠ 0 is a constant then
1. r(cx, y) = r(x, y) if c > 0
2. r(cx, y) = −r(x, y) if c < 0
Proof
Like covariance, correlation is unaffected by adding constants to the data sets. Adding a constant to a data set often corresponds to
a change of location.
If c and d are constant vectors then r(x + c, y + d) = r(x, y) .

Proof
The last couple of properties reinforce the fact that correlation is a standardized measure of association that is not affected by
changing the units of measurement. In the first Challenger data set, for example, the variables of interest are temperature at time of
launch (in degrees Fahrenheit) and O-ring erosion (in millimeters). The correlation between these variables is of critical
importance. If we were to measure temperature in degrees Celsius and O-ring erosion in inches, the correlation between the two
variables would be unchanged.
The most important properties of correlation arise from studying the line that best fits the data, our next topic.
Linear Regression
We are interested in finding the line y = a + bx that best fits the sample points ((x , y ), (x , y ), … , (x , y )). This is a basic
1 1 2 2 n n
and important problem in many areas of mathematics, not just statistics. We think of x as the predictor variable and y as the
response variable. Thus, the term best means that we want to find the line (that is, find the coefficients a and b ) that minimizes the
average of the squared errors between the actual y values in our data and the predicted y values:
n
1 2
mse(a, b) = ∑[ yi − (a + b xi )] (6.7.29)
n−1
i=1
Note that the minimizing value of (a, b) would be the same if the function were simply the sum of the squared errors, of if we
averaged by dividing by n rather than n − 1 , or if we used the square root of any of these functions. Of course that actual minimum
value of the function would be different if we changed the function, but again, not the point (a, b) where the minimum occurs. Our
particular choice of mse as the error function is best for statistical purposes. Finding (a, b) that minimize mse is a standard
problem in calculus.
The graph of mse is a paraboloid opening upward. The function mse is minimized when
s(x, y)
b(x, y) = (6.7.30)
s2 (x)
s(x, y)
a(x, y) = m(y) − b(x, y)m(x) = m(y) − m(x) (6.7.31)
2
s (x)
Proof
Of course, the optimal values of a and b are statistics, that is, functions of the data. Thus the sample regression line is
s(x, y)
y = m(y) + [x − m(x)] (6.7.35)
s2 (x)
Figure 6.7.3 : Scatterplot with regression line

Note that the regression line passes through the point (m(x), m(y)), the center of the sample of points.
Figure 6.7.4 : The regression line passes through the center
The minimum mean square error is

2 2
mse [a(x, y), b(x, y)] = s(y) [1 − r (x, y)] (6.7.36)
Proof
Sample correlation and covariance satisfy the following properties.

1. −1 ≤ r(x, y) ≤ 1
2. −s(x)s(y) ≤ s(x, y) ≤ s(x)s(y)
3. r(x, y) = −1 if and only if the sample points lie on a line with negative slope.
4. r(x, y) = 1 if and only if the sample points lie on a line with positive slope.
Proof
Thus, we now see in a deeper way that the sample covariance and correlation measure the degree of linearity of the sample points.
Recall from our discussion of measures of center and spread that the constant a that minimizes
n
1 2
mse(a) = ∑(yi − a) (6.7.37)
n−1
i=1
is the sample mean m(y), and the minimum value of the mean square error is the sample variance s (y). Thus, the difference 2
between this value of the mean square error and the one above, namely s (y)r (x, y) is the reduction in the variability of the y
2 2
data when the linear term in x is added to the predictor. The fractional reduction is r (x, y), and hence this statistics is called the
2
(sample) coefficient of determination. Note that if the data vectors x and y are uncorrelated, then x has no value as a predictor of
y ; the regression line in this case is the horizontal line y = m(y) and the mean square error is s (y).
2
The choice of predictor and response variables is important.
The sample regression line with predictor variable x and response variable y is not the same as the sample regression line with
predictor variable y and response variable x, except in the extreme case r(x, y) = ±1 where the sample points all lie on a line.
Residuals
The difference between the actual y value of a data point and the value predicted by the regression line is called the residual of that
data point. Thus, the residual corresponding to (x , y ) is d = y − y^ where y^ is the regression line at x :
i i i i i i i
s(x, y)
^ = m(y) +
y [ xi − m(x)] (6.7.38)
i
2
s(x)
Note that the predicted value y^ and the residual d are statistics, that is, functions of the data (x, y), but we are suppressing this in
i i
the notation for simplicity.

n
The residuals sum to 0: ∑ i=1
di = 0 .
Proof
Various plots of the residuals can help one understand the relationship between the x and y data. Some of the more common are
given in the following definition:
Residual plots
1. A plot of (i, d ) for i ∈ {1, 2, … , n}, that is, a plot of indices versus residuals.
i
2. A plot of (x , d ) for i ∈ {1, 2, … , n}, that is, a plot of x values versus residuals.
i i
3. A plot of (d , y ) for i ∈ {1, 2, … , n}, that is, a plot of residuals versus actual y values.
i i
4. A plot of (d , y^ ) for i ∈ {1, 2, … , n}, that is a plot of residuals versus predicted y values.
i i
5. A histogram of the residuals (d , d , … , d ).1 2 n
Sums of Squares
For our next discussion, we will re-interpret the minimum mean square error formulat above. Here are the new definitions:
Sums of squares
n
1. sst(y) = ∑ [yi=1 i − m(y)]
2
is the total sum of squares.
n
2. ssr(x, y) = ∑ i=1
^ − m(y)]
[y i
2
is the regression sum of squares
n
3. sse(x, y) = ∑ i=1
^ )
(yi − y i
2
is the error sum of squares.
Note that sst(y) is simply n − 1 times the variance s (y) and is the total of the sums of the squares of the deviations of the y
2
values from the mean of the y values. Similarly, sse(x, y) is simply n − 1 times the minimum mean square error given above. Of
course, sst(y) has n − 1 degrees of freedom, while sse(x, y) has n − 2 degrees of freedom and ssr(x, y) a single degree of
freedom. The total sum of squares is the sum of the regression sum of squares and the error sum of squares:
The sums of squares are related as follows:

1. ssr(x, y) = r (x, y)sst(y)
2
2. sst(y) = ssr(x, y) + sse(x, y)

Proof
Note that r (x, y) = ssr(x, y)/sst(y) , so once again, r (x, y) is the coefficient of determination—the proportion of the
2 2
variability in the y data explained by the x data. We can average sse by dividing by its degrees of freedom and then take the square
root to obtain a standard error:
The standard error of estimate is
−−−−−−−−
sse(x, y)
se(x, y) = √ (6.7.41)
n−2
This really is a standard error in the same sense as a standard deviation. It's an average of the errors of sorts, but in the root mean
square sense.
Finally, it's important to note that linear regression is a much more powerful idea than might first appear, and in fact the term linear
can be a bit misleading. By applying various transformations to y or x or both, we can fit a variety of two-parameter curves to the
given data ((x , y ), (x , y ), … , (x , y )). Some of the most common transformations are explored in the exercises below.
1 1 2 2 n n
Probability Theory
We continue our discussion of sample covariance, correlation, and regression but now from the more interesting point of view that
the variables are random. Specifically, suppose that we have a basic random experiment, and that X and Y are real-valued random
variables for the experiment. Equivalently, (X, Y ) is a random vector taking values in R . Let μ = E(X) and ν = E(Y ) denote 2
the distribution means, σ = var(X) and τ = var(Y ) the distribution variances, and let δ = cov(X, Y ) denote the distribution
2 2
covariance, so that the distribution correlation is

cov(X, Y ) δ
ρ = cor(X, Y ) = = (6.7.42)
sd(X) sd(Y ) στ
We will also need some higher order moments. Let σ = E [(X − μ) 4

4
] , τ4 = E [(Y − ν ) ]
4
, and 2
δ2 = E [(X − μ) (Y − ν ) ]
2
.
Naturally, we assume that all of these moments are finite.
Now suppose that we run the basic experiment n times. This creates a compound experiment with a sequence of independent
random vectors ((X , Y ), (X , Y ), … , (X , Y )) each with the same distribution as (X, Y ). In statistical terms, this is a random
1 1 2 2 n n
sample of size n from the distribution of (X, Y ). The statistics discussed in previous section are well defined but now they are all
random variables. We use the notation established previously, except that we use our usual convention of denoting random
variables with capital letters. Of course, the deterministic properties and relations established above still hold. Note that
X = (X , X , … , X ) is a random sample of size n from the distribution of X and Y = (Y , Y , … , Y ) is a random sample of
1 2 n 1 2 n
size n from the distribution of Y . The main purpose of this subsection is to study the relationship between various statistics from
X and Y , and to study statistics that are natural estimators of the distribution covariance and correlation.
The Sample Means
Recall that the sample means are

n n
1 1
M (X) = ∑ Xi , M (Y ) = ∑ Yi (6.7.43)
n n
i=1 i=1
From the sections on the law of large numbers and the central limit theorem, we know a great deal about the distributions of M (X)
and M (Y ) individually. But we need to know more about the joint distribution.
The covariance and correlation between M (X) and M (Y ) are

1. cov[M (X), M (Y )] = δ/n
2. cor[(M (X), M (Y )] = ρ
Proof
Note that the correlation between the sample means is the same as the correlation of the underlying sampling distribution. In
particular, the correlation does not depend on the sample size n .
The Sample Variances
Recall that special versions of the sample variances, in the unlikely event that the distribution means are known, are
n n
1 1
2 2 2 2
W (X) = ∑(Xi − μ) , W (Y ) = ∑(Yi − ν ) (6.7.46)
n n
i=1 i=1
Once again, we have studied these statistics individually, so our emphasis now is on the joint distribution.
The covariance and correlation between W 2

(X) and W 2
(Y ) are
1. cov[W 2
(X), W
2
(Y )] = (δ2 − σ τ
2 2
)/n
−−− −−− − − −− −−− − −
2. cor[W 2
(X), W
2
(Y )] = (δ2 − σ τ
2 2
)/ √(σ4 − σ 4 )(τ4 − τ 4 )
Proof
Note that the correlation does not dependend on the sample size n . Next, recall that the standard versions of the sample variances
are
n n
2
1 2 2
1 2
S (X) = ∑[ Xi − M (X)] , S (Y ) = ∑[ Yi − M (Y )] (6.7.49)
n−1 n−1
i=1 i=1
The covariance and correlation of the sample variances are

1. cov[S 2
(X), S
2
(Y )] = (δ2 − σ τ
2 2
)/n + 2 δ /[n(n − 1)]
2
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −
2. cor[S 2
(X), S
2
(Y )] = [(n − 1)(δ2 − σ τ
2 2 2 4 4
) + 2 δ ]/ √[(n − 1)σ4 − (n − 3)σ ][(n − 1)τ4 − (n − 3)τ ]
Proof
Asymptotically, the correlation between the sample variances is the same as the correlation between the special sample variances
given above:
2 2
2 2
δ2 − σ τ
cor [ S (X), S (Y )] → − −−−− −− −−−−−− −− as n → ∞ (6.7.52)
4 4
√ (σ4 − σ )(τ4 − τ )
Sample Covariance
Suppose first that the distribution means μ and ν are known. As noted earlier, this is almost always an unrealistic assumption, but
is still a good place to start because the analysis is very simple and the results we obtain will be useful below. A natural estimator of
the distsribution covariance δ = cov(X, Y ) in this case is the special sample covariance
n
1
W (X, Y ) = ∑(Xi − μ)(Yi − ν ) (6.7.53)
n
i=1
Note that the special sample covariance generalizes the special sample variance: W (X, X) = W 2
(X) .
W (X, Y ) is the sample mean for a random sample of size n from the distribution of (X − μ)(Y − ν ) and satisfies the
following properties:
1. E[W (X, Y )] = δ
2. var[W (X, Y )] = (δ − δ ) 1
n
2
2
3. W (X, Y ) → δ as n → ∞ with probability 1

Proof
As an estimator of δ , part (a) means that W (X, Y ) is unbiased and part (b) means that W (X, Y ) is consistent.
Consider now the more realistic assumption that the distribution means μ and ν are unknown. A natural approach in this case is to
average [(X − M (X)][Y − M (Y )] over i ∈ {1, 2, … , n}. But rather than dividing by n in our average, we should divide by
i i
whatever constant gives an unbiased estimator of δ . As shown in the next theorem, this constant turns out to be n − 1 , leading to
the standard sample covariance:
n
1
S(X, Y ) = ∑[ Xi − M (X)][ Yi − M (Y )] (6.7.55)
n−1
i=1
E[S(X, Y )] = δ .
Proof
S(X, Y ) → δ as n → ∞ with probability 1.
Proof
Of courese, the sample correlation is

S(X, Y )
R(X, Y ) = (6.7.59)
S(X) S(Y )
Since the sample correlation R(X, Y ) is a nonlinear function of the sample covariance and sample standard deviations, it will not
in general be an unbiased estimator of the distribution correlation ρ. In most cases, it would be difficult to even compute the mean
and variance of R(X, Y ). Nonetheless, we can show convergence of the sample correlation to the distribution correlation.
R(X, Y ) → ρ as n → ∞ with probability 1.

Proof
Our next theorem gives a formuala for the variance of the sample covariance, not to be confused with the covariance of the sample
variances given above!
The variance of the sample covariance is

1 1 n−2
2 2 2
var[S(X, Y )] = ( δ2 + σ τ − δ ) (6.7.60)
n n−1 n−1
Proof
It's not surprising that the variance of the standard sample covariance (where we don't know the distribution means) is greater than
the variance of the special sample covariance (where we do know the distribution means).
var[S(X, Y )] > var[W (X, Y )] .

Proof
var[S(X, Y )] → 0 as n → ∞ . Thus, the sample covariance is a consistent estimator of the distribution covariance.
Regression
In our first discussion above, we studied regression from a deterministic, descriptive point of view. The results obtained applied
only to the sample. Statistically more interesting and deeper questions arise when the data come from a random experiment, and we
try to draw inferences about the underlying distribution from the sample regression. There are two models that commonly arise.
One is where the response variable is random, but the predictor variable is deterministic. The other is the model we consider here,
where the predictor variable and the response variable are both random, so that the data form a random sample from a bivariate
distribution.
Thus, suppose again that we have a basic random vector (X, Y ) for an experiment. Recall that in the section on (distribution)
correlation and regression, we showed that the best linear predictor of Y given X, in the sense of minimizing mean square error, is
the random variable
cov(X, Y ) δ
L(Y ∣ X) = E(Y ) + [X − E(X)] = ν + (X − μ) (6.7.64)
2
var(X) σ
so that the distribution regression line is given by

δ
y = L(Y ∣ X = x) = ν + (x − μ) (6.7.65)
2
σ
Moreover, the (minimum) value of the mean square error is E{[Y 2

− L(Y ∣ X)]} = var(Y )[1 − cor (X, Y )] = r (1 − ρ )
2 2
.
Figure 6.7.5 : The distribution regression line
Of course, in real applications, we are unlikely to know the distribution parameters μ , ν , σ , and δ . If we want to estimate the
2
distribution regression line, a natural approach would be to consider a random sample ((X , Y ), (X , Y ), … , (X , Y )) from the
1 1 2 2 n n
distribution of (X, Y ) and compute the sample regression line. Of course, the results are exactly the same as in the discussion
above, except that all of the relevant quantities are random variables. The sample regression line is
S(X, Y )
y = M (Y ) + [x − M (X)] (6.7.66)
2
S (X)
The mean square error is S 2 2

(Y )[1 − R (X, Y )] and the coefficient of determination is R 2
.
(X, Y )
Figure 6.7.6 : The distribution and sample regression lines

The fact that the sample regression line and mean square error are completely analogous to the distribution regression line and
mean square error is mathematically elegant and reassuring. Again, the coefficients of the sample regression line can be viewed as
estimators of the respective coefficients in the distribution regression line.
The coefficients of the sample regression line converge to the coefficients of the distribution regression line with probability 1.
S(X,Y )
1. 2
→
σ2
δ
as n → ∞
S (X)
S(X,Y )
2. M (Y ) − 2
M (X) → ν −
δ
σ2
μ as n → ∞
S (X)
Proof
Of course, if the linear relationship between X and Y is not strong, as measured by the sample correlation, then transformation
applied to one or both variables may help. Again, some typical transformations are explored in the exercises below.
Exercises
Basic Properties
Suppose that x and y are population variables, and x and y samples of size n from x and y respectively. Suppose also that
m(x) = 3 , m(y) = −1 , s (x) = 4 , s (y) = 9 , and s(x, y) = 5 . Find each of the following:
2 2
1. r(x, y)
2. m(2x + 3y)
3. s (2x + 3y)
2
4. s(2x + 3y − 1, 4x + 2y − 3)
Suppose that x is the temperature (in degrees Fahrenheit) and y the resistance (in ohms) for a certain type of electronic
component after 10 hours of operation. For a sample of 30 components, m(x) = 113, s(x) = 18 , m(y) = 100 , s(y) = 10 ,
r(x, y) = 0.6.
1. Classify x and y by type and level of measurement.

2. Find the sample covariance.
3. Find the equation of the regression line.
Suppose now that temperature is converted to degrees Celsius (the transformation is 5
9
(x − 32) ).
4. Find the sample means.

5. Find the sample standard deviations.
6. Find the sample covariance and correlation.
Answer
Suppose that x is the length and y the width (in inches) of a leaf in a certain type of plant. For a sample of 50 leaves
m(x) = 10 , s(x) = 2 , m(y) = 4 , s(y) = 1 , and r(x, y) = 0.8.

2. Find the sample covariance.
3. Find the equation of the regression line with x as the predictor variable and y as the response variable.
Suppose now that x and y are converted to inches (0.3937 inches per centimeter).
4. Find the sample means.
5. Find the sample standard deviations.
6. Find the sample covariance and correlation.
Answer
Scatterplot Exercises
Click in the interactive scatterplot, in various places, and watch how the means, standard deviations, correlation, and regression
line change.
Click in the interactive scatterplot to define 20 points and try to come as close as possible to each of the following sample
correlations:
1. 0
2. 0.5
3. −0.5
4. 0.7
5. −0.7
6. 0.9
7. −0.9.
Click in the interactive scatterplot to define 20 points. Try to generate a scatterplot in which the regression line has
1. slope 1, intercept 1
2. slope 3, intercept 0
3. slope −2, intercept 1
Run the bivariate uniform experiment 2000 times in each of the following cases. Compare the sample means to the distribution
means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution
correlation, and the sample regression line to the distribution regression line.
1. The uniform distribution on the square
2. The uniform distribution on the triangle.
3. The uniform distribution on the circle.
Run the bivariate normal experiment 2000 times for various values of the distribution standard deviations and the distribution
correlation. Compare the sample means to the distribution means, the sample standard deviations to the distribution standard
deviations, the sample correlation to the distribution correlation, and the sample regression line to the distribution regression
line.
Transformations
Consider the function y = a + bx . 2
1. Sketch the graph for some representative values of a and b .

2. Note that y is a linear function of x , with intercept a and slope b .
2
3. Hence, to fit this curve to sample data, simply apply the standard regression procedure to the data from the variables x and 2
y.
Consider the function y = 1
a+bx
.
2. Note that is a linear function of x, with intercept a and slope b .
1
3. Hence, to fit this curve to our sample data, simply apply the standard regression procedure to the data from the variables x
and .
1
Consider the function y = x
a+bx
.
2. Note that is a linear function of , with intercept b and slope a .
1
y
1
3. Hence, to fit this curve to sample data, simply apply the standard regression procedure to the data from the variables 1
x
and
1
y
.
4. Note again that the names of the intercept and slope are reversed from the standard formulas.
Consider the function y = ae .bx

2. Note that ln(y) is a linear function of x, with intercept ln(a) and slope b .
3. Hence, to fit this curve to sample data, simply apply the standard regression procedure to the data from the variables x and
ln(y).
4. After solving for the intercept ln(a) , recover the statistic a = e

ln(a)
.
Consider the function y = ax .b

2. Note that ln(y) is a linear function of ln(x), with intercept ln(a) and slope b .
3. Hence, to fit this curve to sample data, simply apply the standard regression procedure to the data from the variables ln(x)
and ln(y).
4. After solving for the intercept ln(a) , recover the statistic a = e
ln(a)
.
All statistical software packages will perform regression analysis. In addition to the regression line, most packages will typically
report the coefficient of determination r (x, y), the sums of squares sst(y) , ssr(x, y), sse(x, y), and the standard error of estimate
2
se(x, y). Most packages will also draw the scatterplot, with the regression line superimposed, and will draw the various graphs of
residuals discussed above. Many packages also provide easy ways to transform the data. Thus, there is very little reason to perform
the computations by hand, except with a small data set to master the definitions and formulas. In the following problem, do the
computations and draw the graphs with minimal technological aids.
Suppose that x is the number of math courses completed and the number of science courses completed for a student at
y
Enormous State University (ESU). A sample of 10 ESU students gives the following data:
.
((1, 1), (3, 3), (6, 4), (2, 1), (8, 5), (2, 2), (4, 3), (6, 4), (4, 3), (4, 4))

2. Sketch the scatterplot.
Construct a table with rows corresponding to cases and columns corresponding to i, x , y , x − m(x) , y − m(y) , i i i i
[ x − m(x)] , [ y − m(y)] , [ x − m(x)][ y − m(y)] , y ^ , y

^ − m(y) , [ y
^ − m(y)] , y − y
^ , and (y − y
^ ) . Add a rows at
2 2 2 2
i i i i i i i i i i i
the bottom for totals and means. Use precision arithmetic.

3. Complete the first 8 columns.
4. Find the sample correlation and the coefficient of determination.
5. Find the sample regression equation.
6. Complete the table.
7. Verify the identities for the sums of squares.
Answer
The following two exercise should help you review some of the probability topics in this section.
Suppose that (X, Y ) has a continuous distribution with probability density function f (x, y) = 15x 2
y for 0 ≤ x ≤ y ≤ 1 . Find
1. μ = E(X) and ν = E(Y )
2. σ = var(X) and τ = var(Y )
2 2
3. σ = E [(X − μ) ] and τ = E [(Y − ν ) ]

3
3
3
3
4. σ = E [(X − μ) ] and τ = E [(Y − ν ) ]

4
4
4
4
5. δ = cov(X, Y ) , ρ = cor(X, Y ) , and δ = E [(X − μ) 2

2 2
(Y − ν ) ]
6. L(Y ∣ X) and L(X ∣ Y )

Answer
Suppose now that ((X , Y ), (X , Y

1 1 2 2 ), … (X9 , Y9 )) is a random sample of size 9 from the distribution in the previous
exercise. Find each of the following:
1. E[M (X)] and var[M (X)]
2. E[M (Y )] and var[M (Y )]
3. cov[M (X), M (Y )] and cor[M (X), M (Y )]
4. E[W (X)] and var[W (X)]
2 2
5. E[W (Y )] and var[W (Y )]

2 2
6. E[S (X)] and var[S (X)]

2 2
7. E[S (Y )] and var[S (Y )]

2 2
8. E[W (X, Y )] and var[W (X, Y )]

9. E[S(X, Y )] and var[S(X, Y )]
Answer

Use statistical software for the following problems.
Consider the height variables in Pearson's height data.
2. Compute the correlation coefficient and the coefficient of determination
3. Compute the least squares regression line, with the height of the father as the predictor variable and the height of the son as
the response variable.
4. Draw the scatterplot and the regression line together.
5. Predict the height of a son whose father is 68 inches tall.
6. Compute the regression line if the heights are converted to centimeters (there are 2.54 centimeters per inch).
Answer
Consider the petal length, petal width, and species variables in Fisher's iris data.
2. Compute the correlation between petal length and petal width.
3. Compute the correlation between petal length and petal width by species.
Answer
Consider the number of candies and net weight variables in the M&M data.
2. Compute the correlation coefficient and the coefficient of determination.
3. Compute the least squares regression line with number of candies as the predictor variable and net weight as the response
variable.
4. Draw the scatterplot and the regression line in part (b) together.
5. Predict the net weight of a bag of M&Ms with 56 candies.
6. Naively, one might expect a much stronger correlation between the number of candies and the net weight in a bag of
M&Ms. What is another source of variability in net weight?
Answer
Consider the response rate and total SAT score variables in the SAT by state data set.
3. Compute the least squares regression line with response rate as the predictor variable and SAT score as the response
variable.
4. Draw the scatterplot and regression line together.
5. Give a possible explanation for the negative correlation.
Answer
Consider the verbal and math SAT scores (for all students) in the SAT by year data set.
3. Compute the least squares regression line.
4. Draw the scatterplot and regression line together.
Answer
Consider the temperature and erosion variables in the first data set in the Challenger data.
3. Compute the least squares regression line.
4. Draw the scatter plot and the regression line together.
5. Predict the O-ring erosion with a temperature of 31° F.
6. Is the prediction in part (c) meaningful? Explain.
7. Find the regression line if temperature is converted to degrees Celsius. Recall that the conversion is 5
9
(x − 32) .
Answer
This page titled 6.7: Sample Correlation and Regression is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Random samples from normal distributions are the most important special cases of the topics in this chapter. As we will see, many
of the results simplify significantly when the underlying sampling distribution is normal. In addition we will derive the
distributions of a number of random variables constructed from normal samples that are of fundamental important in inferential
statistics.
The One Sample Model

Suppose that X = (X , X , … , X ) is a random sample from the normal distribution with mean μ ∈ R and standard deviation
1 2 n
σ ∈ (0, ∞) . Recall that the term random sample means that X is a sequence of independent, identically distributed random
variables. Recall also that the normal distribution has probability density function
2
1 1 x −μ
f (x) = exp[− ( ) ], x ∈ R (6.8.1)
−
−−
√2 πσ 2 σ
In the notation that we have used elsewhere in this chapter, σ = E [(X − μ) ] = 0 (equivalently, the skewness of the normal
3
3
distribution is 0) and σ = E [(X − μ) ] = 3σ (equivalently, the kurtosis of the normal distribution is 3). Since the sample (and
4
4 4
in particular the sample size n ) is fixed is this subsection, it will be suppressed in the notation.
The Sample Mean

First recall that the sample mean is
n
1
M = ∑ Xi (6.8.2)
n
i=1
M is normally distributed with mean and variance given by

1. E(M ) = μ
2. var(M ) = σ 2
/n
Proof
Of course, by the central limit theorem, the distribution of M is approximately normal, if n is large, even if the underlying
sampling distribution is not normal. The standard score of M is given as follows:
M −μ
Z = (6.8.3)
−
σ/ √n
Z has the standard normal distribution.
The standard score Z associated with the sample mean M plays a critical role in constructing interval estimates and hypothesis
tests for the distribution mean μ when the distribution standard deviation σ is known. The random variable Z will also appear in
several derivations in this section.
The Sample Variance

The main goal of this subsection is to show that certain multiples of the two versions of the sample variance that we have studied
have chi-square distributions. Recall that the chi-square distribution with k ∈ N degrees of freedom has probability density
+
function
1
k/2−1 −x/2
f (x) = x e , 0 <x <∞ (6.8.4)
k/2
Γ(k/2)2
and has mean k and variance 2k. The moment generating function is
1 1
G(t) = , −∞ < t < (6.8.5)
k/2 2
(1 − 2t)
k
The most important result to remember is that the chi-square distribution with k degrees of freedom governs ∑i=1 Z
i
2
, where
(Z , Z , … , Z ) is a sequence of independent, standard normal random variables.
1 2 k
Recall that if μ is known, a natural estimator of the variance σ is the statistic 2
n
1
2 2
W = ∑(Xi − μ) (6.8.6)
n
i=1
Although the assumption that μ is known is almost always artificial, W is very easy to analyze and it will be used in some of the
2
derivations below. Our first result is the distribution of a simple multiple of W . Let 2
n
2
U = W (6.8.7)
2
σ
U has the chi-square distribution with n degrees of freedom.

Proof
The variable U associated with the statistic W plays a critical role in constructing interval estimates and hypothesis tests for the
2
distribution standard deviation σ when the distribution mean μ is known (although again, this assumption is usually not realistic).
The mean and variance of W are 2
1. E(W ) = σ 2 2
2. var(W ) = 2σ 2 4
/n
Proof
As an estimator of σ , part (a) means that W is unbiased and part (b) means that W is consistent. Of course, these moment
2 2 2
results are special cases of the general results obtained in the section on Sample Variance. In that section, we also showed that M
and W are uncorrelated if the underlying sampling distribution has skewness 0 (σ = 0 ), as is the case here.
2
3
Recall now that the standard version of the sample variance is the statistic
n
2
1 2
S = ∑(Xi − M ) (6.8.9)
n−1
i=1
The sample variance S is the usual estimator of σ when μ is unknown (which is usually the case). We showed earlier that in
2 2
general, the sample mean M and the sample variance S are uncorrelated if the underlying sampling distribution has skewness 0 (
2
σ = 0 ). It turns out that if the sampling distribution is normal, these variables are in fact independent, a very important and useful
3
property, and at first blush, a very surprising result since S appears to depend explicitly on M .
2
The sample mean M and the sample variance S are independent. 2
Proof
We can now determine the distribution of a simple multiple of the sample variance S . Let 2
n−1 2
V = S (6.8.11)
σ2
V has the chi-square distribution with n − 1 degrees of freedom.

Proof
The variable V associated with the statistic S plays a critical role in constructing interval estimates and hypothesis tests for the
2
distribution standard deviation σ when the distribution mean μ is unknown (almost always the case).
The mean and variance of S are 2
1. E(S ) = σ
2 2
2. var(S ) = 2σ
2 4
/(n − 1)
Proof
As before, these moment results are special cases of the general results obtained in the section on Sample Variance. Again, as an
estimator of σ , part (a) means that S is unbiased, and part (b) means that S is consistent. Note also that var(S ) is larger than
2 2 2 2
var(W ) (not surprising), by a factor of .

2 n
n−1
In the special distribution simulator, select the chi-square distribution. Vary the degree of freedom parameter and note the
shape and location of the probability density function and the mean, standard deviation bar. For selected values of the
parameter, run the experiment 1000 times and compare the empirical density function and moments to the true probability
density function and moments.
The covariance and correlation between the special sample variance and the standard sample variance are
1. cov(W 2
,S
2 4
) = 2 σ /n
−−−− −−−−
2. cor(W 2
,S
2
) = √(n − 1)/n
Proof
Note that the correlation does not depend on the parameters μ and σ, and converges to 1 as n → ∞ ,
The T Variable
Recall that the Student t distribution with k ∈ N + degrees of freedom has probability density function
2 −(k+1)/2
t
f (t) = Ck (1 + ) , t ∈ R (6.8.15)
k
where C is the appropriate normalizing constant. The distribution has mean 0 if k > 1 and variance k/(k − 2) if
k k >2 . In this
subsection, the main point to remember is that the t distribution with k degrees of freedom is the distribution of
Z
(6.8.16)
−−−
−
√V /k
where Z has the standard normal distribution; V has the chi-square distribution with k degrees of freedom; and Z and V are
independent. Our goal is to derive the distribution of
M −μ
T = − (6.8.17)
S/ √n
Note that T is similar to the standard score Z associated with M , but with the sample standard deviation S replacing the
distribution standard deviation σ. The variable T plays a critical role in constructing interval estimates and hypothesis tests for the
distribution mean μ when the distribution standard deviation σ is unknown.
As usual, let Z denote the standard score associated with the sample mean M , and let V denote the chi-square variable
associated with the sample variance S . Then 2
Z
T = − −−−−−− − (6.8.18)
√ V /(n − 1)
and hence T has the student t distribution with n − 1 degrees of freedom.

Proof
In the special distribution simulator, select the t distribution. Vary the degree of freedom parameter and note the shape and
location of the probability density function and the mean±standard deviation bar. For selected values of the parameters, run the
experiment 1000 times and compare the empirical density function and moments to the distribution density function and
moments.
The Two Sample Model

Suppose that X = (X , X , … , X ) is a random sample of size m from the normal distribution with mean μ ∈ R and standard
1 2 m
deviation σ ∈ (0, ∞), and that Y = (Y , Y , … , Y ) is a random sample of size n from the normal distribution with mean ν ∈ R
1 2 n
and standard deviation τ ∈ (0, ∞) . Finally, suppose that X and Y are independent. Of course, all of the results above in the one
sample model apply to X and Y separately, but now we are interested in statistics that are helpful in inferential procedures that
compare the two normal distributions. We will use the basic notation established above, but we will indicate the dependence on the
sample.
The two-sample (or more generally the multi-sample model) occurs naturally when a basic variable in the statistical experiment is
filtered according to one or more other variable (often nominal variables). For example, in the cicada data, the weights of the male
cicadas and the weights of the female cicadas may fit observations from the two-sample normal model. The basic variable weight is
filtered by the variable gender. If weight is filtered by gender and species, we might have observations from the 6-sample normal
model.
The Difference in the Sample Means

We know from our work above that M (X) and M (Y ) have normal distributions. Moreover, these sample means are independent
because the underlying samples X and Y are independent. Hence, it follows from a basic property of the normal distribution that
any linear combination of M (X) and M (Y ) will be normally distributed as well. For inferential procedures that compare the
distribution means μ and ν , the linear combination that is most important is the difference.
M (X) − M (Y ) has a normal distribution with mean and variance given by

1. E [(M (X) − M (Y )] = μ − ν
2. var [(M (X) − M (Y )] = σ /m + τ2 2
/n
Hence the standard score

[(M (X) − M (Y )] − (μ − ν )
Z = − −−−−−−−− −− (6.8.19)
2 2
√ σ /m + τ /n
has the standard normal distribution. This standard score plays a fundamental role in constructing interval estimates and hypothesis
test for the difference μ − ν when the distribution standard deviations σ and τ are known.
Ratios of Sample Variances

Next we will show that the ratios of certain multiples of the sample variances (both versions) of X and Y have F distributions.
Recall that the F distribution with j ∈ N degrees of freedom in the numerator and k ∈ N degrees of freedom in the
+ +
denominator is the distribution of

U /j
(6.8.20)
V /k
where U has the chi-square distribution with j degrees of freedom; V has the chi-square distribution with k degrees of freedom;
and U and V are independent. The F distribution is named in honor of Ronald Fisher and has probability density function
(j−2)/2
x
f (x) = Cj,k , 0 <x <∞ (6.8.21)
(j+k)/2
[1 + (j/k)x]
2
j+k−2
where C j,k is the appropriate normalizing constant. The mean is k
k−2
if k > 2 , and the variance is 2( k
k−2
)
j(k−4)
if k > 4 .
The random variable given below has the F distribution with m degrees of freedom in the numerator and n degrees of
freedom in the denominator:
2 2
W (X)/ σ
(6.8.22)
2 2
W (Y )/ τ
Proof
The random variable given below has the F distribution with m − 1 degrees of freedom in the numerator and n − 1 degrees of
freedom in the denominator:
2 2
S (X)/ σ
(6.8.23)
2 2
S (Y )/ τ
Proof
These variables are useful for constructing interval estimates and hypothesis tests of the ratio of the standard deviations σ/τ. The
choice of the F variable depends on whether the means μ and ν are known or unknown. Usually, of course, the means are
unknown and so the statistic in above is used.
In the special distribution simulator, select the F distribution. Vary the degrees of freedom parameters and note the shape and
location of the probability density function and the mean±standard deviation bar. For selected values of the parameters, run the
experiment 1000 times and compare the empirical density function and moments to the true distribution density function and
moments.
The T Variable
Our final construction in the two sample normal model will result in a variable that has the student t distribution. This variable
plays a fundamental role in constructing interval estimates and hypothesis test for the difference μ − ν when the distribution
standard deviations σ and τ are unknown. The construction requires the additional assumption that the distribution standard
deviations are the same: σ = τ . This assumption is reasonable if there is an inherent variability in the measurement variables that
does not change even when different treatments are applied to the objects in the population.
Note first that the standard score associated with the difference in the sample means becomes
[M (Y ) − M (X)] − (ν − μ)
Z = (6.8.24)
− −−−−−−−−
σ √ 1/m + 1/n
To construct our desired variable, we first need an estimate of σ . A natural approach is to consider a weighted average of the
2
sample variances S (X) and S (Y ), with the degrees of freedom as the weight factors (this is called the pooled estimate of σ .
2 2 2
Thus, let
2 2
(m − 1)S (X) + (n − 1)S (Y )
2
S (X, Y ) = (6.8.25)
m +n−2
The random variable V given below has the chi-square distribution with m + n − 2 degrees of freedom:
2 2
(m − 1)S (X) + (n − 1)S (Y )
V = (6.8.26)
2
σ
Proof
The variables M (Y ) − M (X) and S 2

(X, Y ) are independent.
Proof
The random variable T given below has the student t distribution with m + n − 2 degrees of freedom.
[M (Y ) − M (X)] − (ν − μ)
T = (6.8.27)
− −−−−−−−−
S(X, Y )√ 1/m + 1/n
Proof
The Bivariate Sample Model

Suppose now that ((X 1, is a random sample of size n from the bivariate normal distribution with
Y1 ), (X2 , Y2 ), … , (Xn , Yn ))
means μ ∈ R and ν , standard deviations σ ∈ (0, ∞) and τ ∈ (0, ∞) , and correlation ρ ∈ [0, 1]. Of course,
∈ R
X = (X , X , … , X ) is a random sample of size n from the normal distribution with mean μ and standard deviation σ, and
1 2 n
Y = (Y , Y , … , Y ) is a random sample of size n from the normal distribution with mean ν and standard deviation τ , so the
1 2 n
results above in the one sample model apply to X and Y individually. Thus our interest in this section is in the relation between
various X and Y statistics and properties of sample covariance.
The bivariate (or more generally multivariate) model occurs naturally when considering two (or more) variables in the statistical
experiment. For example, the heights of the fathers and the heights of the sons in Pearson's height data may well fit observations
from the bivariate normal model.
In the notation that we have used previously, recall that σ = E [(X − μ) ] = 0 , 3 3 4
σ4 = E [(X − μ) ] = 3 σ
4
,
τ = E [(Y − ν ) ] = 0 , τ = E [(Y − ν ) ] = 3 τ , δ = cov(X, Y ) = στ ρ . and δ = E[(X − μ) .
3 4 4 2 2 2 2 2
3 4 2 (Y − ν ) ] = σ τ (1 + 2 ρ )
The data vector ((X 1, Y1 ), (X2 , Y2 ), … (Xn , Yn )) has a multivariate normal distribution.
1. The mean vector has a block form, with each block being (μ, ν ).
2
σ στ ρ
2. The variance-covariance matrix has a block-diagonal form, with each block being [ 2
] .
στ ρ τ
Proof
Sample Means
(M (X), M (Y )) has a bivariate normal distribution. The covariance and correlation are
1. cov [M (X), M (Y )] = στ ρ/n
2. cor [M (X), M (Y )] = ρ
Proof
Of course, we know the individual means and variances of M (X) and M (Y ) from the one-sample model above. Hence we know
the complete distribution of (M (X), M (Y )).
Sample Variances
The covariance and correlation between the special sample variances are
1. cov [ W 2
(X), W
2
(Y )] = 2 σ τ
2 2 2
ρ /n
2. cor [ W 2
(X), W
2
(Y )] = ρ
2
Proof
The covariance and correlation between the standard sample variances are
1. cov [ S 2
(X), S
2
(Y )] = 2 σ τ
2 2 2
ρ /(n − 1)
2. cor [ S
2
(X), S
2
(Y )] = ρ
2
Proof
Sample Covariance
If μ and ν are known (again usually an artificial assumption), a natural estimator of the distribution covariance δ is the special
version of the sample covariance
n
1
W (X, Y ) = ∑(Xi − μ)(Yi − ν ) (6.8.28)
n
i=1
The mean and variance of W (X, Y ) are

1. E[W (X, Y )] = στ ρ
2. var[W (X, Y )] = σ τ 2 2
(1 + ρ )/n
2
Proof
If μ and ν are unknown (again usually the case), then a natural estimator of the distribution covariance δ is the standard sample
covariance
n
1
S(X, Y ) = ∑[ Xi − M (X)][ Yi − M (Y )] (6.8.29)
n−1
i=1
The mean and variance of the sample variance are

1. E[S(X, Y )] = στ ρ
2. var[S(X, Y )] = σ τ 2 2 2
(1 + ρ )/(n − 1)
Proof
We use the basic notation established above for samples X and Y , and for the statistics M , W , S , T , and so forth.
2 2
Suppose that the net weights (in grams) of 25 bags of M&Ms form a random sample X from the normal distribution with
mean 50 and standard deviation 4. Find each of the following:
1. The mean and standard deviation of M .
2. The mean and standard deviation of W . 2
3. The mean and standard deviation of S . 2
4. The mean and standard deviation of T .

5. P(M > 49, S < 20)) .
2
6. P(−1 < T < 1) .

Answer
Suppose that the SAT math scores from 16 Alabama students form a random sample X from the normal distribution with mean
550 and standard deviation 20, while the SAT math scores from 25 Georgia students form a random sample Y from the normal
distribution with mean 540 and standard deviation 15. The two samples are independent. Find each of the following:
1. The mean and standard deviation of M (X).
2. The mean and standard deviation of M (Y ).
3. The mean and standard deviation of M (X) − M (Y ) .
4. P[M (X) > M (Y )] .
5. The mean and standard deviation of S (X). 2
6. The mean and standard deviation of S (Y ). 2
7. The mean and standard deviation of S (X)/S (Y )

2 2
8. P[S(X) > S(Y )] .

Answer
This page titled 6.8: Special Properties of Normal Samples is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
CHAPTER OVERVIEW
7: Point Estimation
Point estimation refers to the process of estimating a parameter from a probability distribution, based on observed data from the
distribution. It is one of the core topics in mathematical statistics. In this chapter, we will explore the most common methods of
point estimation: the method of moments, the method of maximum likelihood, and Bayes' estimators. We also study important
properties of estimators, including sufficiency and completeness, and the basic question of whether an estimator is the best possible
one.
7.1: Estimators
7.6: Sufficient, Complete and Ancillary Statistics
This page titled 7: Point Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
7.1: Estimators

As usual, our starting point is a random experiment with an underlying sample space and a probability measure P. In the basic
statistical model, we have an observable random variable X taking values in a set S . Recall that in general, this variable can have
quite a complicated structure. For example, if the experiment is to sample n objects from a population and record various
measurements of interest, then the data vector has the form
X = (X1 , X2 , … , Xn ) (7.1.1)
where X is the vector of measurements for the ith object. The most important special case is when (X , X , … , X ) are
i 1 2 n
independent and identically distributed (IID). In this case X is a random sample of size n from the distribution of an underlying
measurement variable X.
Statistics
Recall also that a statistic is an observable function of the outcome variable of the random experiment: U = u(X) where u is a
known function from S into another set T . Thus, a statistic is simply a random variable derived from the observation variable X,
with the assumption that U is also observable. As the notation indicates, U is typically also vector-valued. Note that the original
data vector X is itself a statistic, but usually we are interested in statistics derived from X. A statistic U may be computed to
answer an inferential question. In this context, if the dimension of U (as a vector) is smaller than the dimension of X (as is usually
the case), then we have achieved data reduction. Ideally, we would like to achieve significant data reduction with no loss of
information about the inferential question at hand.
Parameters
In the technical sense, a parameter θ is a function of the distribution of X, taking values in a parameter space T . Typically, the
distribution of X will have k ∈ N real parameters of interest, so that θ has the form θ = (θ , θ , … , θ ) and thus T ⊆ R . In
+ 1 2 k
k
many cases, one or more of the parameters are unknown, and must be estimated from the data variable X. This is one of the of the
most important and basic of all statistical problems, and is the subject of this chapter. If U is a statistic, then the distribution of U
will depend on the parameters of X, and thus so will distributional constructs such as means, variances, covariances, probability
density functions and so forth. We usually suppress this dependence notationally to keep our mathematical expressions from
becoming too unwieldy, but it's very important to realize that the underlying dependence is present. Remember that the critical idea
is that by observing a value u of a statistic U we (hopefully) gain information about the unknown parameters.
Estimators
Suppose now that we have an unknown real parameter θ taking values in a parameter space T ⊆ R . A real-valued statistic
U = u(X) that is used to estimate θ is called, appropriately enough, an estimator of θ . Thus, the estimator is a random variable
and hence has a distribution, a mean, a variance, and so on (all of which, as noted above, will generally depend on θ ). When we
actually run the experiment and observe the data x, the observed value u = u(x) (a single number) is the estimate of the parameter
θ . The following definitions are basic.
Suppose that U is a statistic used as an estimator of a parameter θ with values in T ⊆R . For θ ∈ T ,

1. U − θ is the error.
2. bias(U ) = E(U − θ) = E(U ) − θ is the bias of U
3. mse(U ) = E [(U − θ) ] is the mean square error of U
2
Thus the error is the difference between the estimator and the parameter being estimated, so of course the error is a random
variable. The bias of U is simply the expected error, and the mean square error (the name says it all) is the expected square of the
error. Note that bias and mean square error are functions of θ ∈ T . The following definitions are a natural complement to the
definition of bias.
Suppose again that U is a statistic used as an estimator of a parameter θ with values in T ⊆R .
1. U is unbiased if bias(U ) = 0 , or equivalently E(U ) = θ , for all θ ∈ T .
2. U is negatively biased if bias(U ) ≤ 0 , or equivalently E(U ) ≤ θ , for all θ ∈ T .
3. U is positively biased if bias(U ) ≥ 0 , or equivalently E(U ) ≥ θ , for all θ ∈ T .
Thus, for an unbiased estimator, the expected value of the estimator is the parameter being estimated, clearly a desirable property.
On the other hand, a positively biased estimator overestimates the parameter, on average, while a negatively biased estimator
underestimates the parameter on average. Our definitions of negative and positive bias are weak in the sense that the weak
inequalities ≤ and ≥ are used. There are corresponding strong definitions, of course, using the strong inequalities < and >. Note,
however, that none of these definitions may apply. For example, it might be the case that bias(U ) < 0 for some θ ∈ T ,
bias(U ) = 0 for other θ ∈ T , and bias(U ) > 0 for yet other θ ∈ T .
2
mse(U ) = var(U ) + bias (U )
Proof
In particular, if the estimator is unbiased, then the mean square error of U is simply the variance of U .
Ideally, we would like to have unbiased estimators with small mean square error. However, this is not always possible, and the
result in (3) shows the delicate relationship between bias and mean square error. In the next section we will see an example with
two estimators of a parameter that are multiples of each other; one is unbiased, but the other has smaller mean square error.
However, if we have two unbiased estimators of θ , we naturally prefer the one with the smaller variance (mean square error).
Suppose that U and V are unbiased estimators of a parameter θ with values in T ⊆R .

1. U is more efficient than V if var(U ) ≤ var(V ) .
2. The relative efficiency of U with respect to V is
var(V )
eff(U , V ) = (7.1.3)
var(U )
Asymptotic Properties
Suppose again that we have a real parameter θ with possible values in a parameter space T . Often in a statistical experiment, we
observe an infinite sequence of random variables over time, X = (X , X , … , ), so that at time n we have observed
1 2
X = (X , X , … , X ) . In this setting we often have a general formula that defines an estimator of θ for each sample size n .
n 1 2 n
Technically, this gives a sequence of real-valued estimators of θ : U = (U , U , …) where U is a real-valued function of X for
1 2 n n
each n ∈ N . In this case, we can discuss the asymptotic properties of the estimators as n → ∞ . Most of the definitions are
+
natural generalizations of the ones above.
The sequence of estimators U = (U , U , …) is asymptotically unbiased if

1 2 bias(Un ) → 0 as n → ∞ for every θ ∈ T , or
equivalently, E(U ) → θ as n → ∞ for every θ ∈ T .
n
Suppose that U = (U , U , …) and V = (V

1 2 1, V2 , …) are two sequences of estimators that are asymptotically unbiased. The
asymptotic relative efficiency of U to V is
var(Vn )
lim eff(Un , Vn ) = lim (7.1.4)
n→∞ n→∞ var(Un )
assuming that the limit exists.
Naturally, we expect our estimators to improve, as the sample size n increases, and in some sense to converge to the parameter as
n → ∞ . This general idea is known as consistency. Once again, for the remainder of this discussion, we assume that
U = (U , U , …) is a sequence of estimators for a real-valued parameter θ , with values in the parameter space T .
1 2
Consistency
1. U is consistent if U → θ as n → ∞ in probability for each θ ∈ T . That is, P (|U − θ| > ϵ) → 0 as n → ∞ for every
n n
ϵ > 0 and θ ∈ T .
2. U is mean-square consistent if mse(U ) = E[(U − θ) ] → 0 as n → ∞ for θ ∈ T .

n n
2
Here is the connection between the two definitions:
If U is mean-square consistent then U is consistent.

Proof
That mean-square consistency implies simple consistency is simply a statistical version of the theorem that states that mean-square
convergence implies convergence in probability. Here is another nice consequence of mean-square consistency.
If U is mean-square consistent then U is asymptotically unbiased.

Proof
In the next several subsections, we will review several basic estimation problems that were studied in the chapter on Random
Samples.
Estimation in the Single Variable Model

Suppose that X is a basic real-valued random variable for an experiment, with mean μ ∈ R and variance σ ∈ (0, ∞) . We sample 2
from the distribution of X to produce a sequence X = (X , X , …) of independent variables, each with the distribution of X. For
1 2
each n ∈ N , X = (X , X , … , X ) is a random sample of size n from the distribution of X.

+ n 1 2 n
Estimating the Mean

This subsection is a review of some results obtained in the section on the Law of Large Numbers in the chapter on Random
Samples. Recall that a natural estimator of the distribution mean μ is the sample mean, defined by
n
1
Mn = ∑ Xi , n ∈ N+ (7.1.7)
n
i=1
Properties of M = (M 1, M2 , …) as a sequence of estimators of μ .

1. E(M ) = μ so M is unbiased for n ∈ N
n n +
2. var(M ) = σ /n for n ∈ N so M is consistent.

n
2
+
The consistency of M is simply the weak law of large numbers. Moreover, there are a number of important special cases of the
results in (10). See the section on Sample Mean for the details.
Special cases of the sample mean

1. Suppose that X = 1 , the indicator variable for an event A that has probability P(A) . Then the sample mean for a random
A
sample of size n ∈ N from the distribution of X is the relative frequency or empirical probability of A , denoted P (A).
+ n
Hence P (A) is an unbiased estimator of P(A) for n ∈ N and (P (A) : n ∈ N ) is consistent..

n + n +
2. Suppose that F denotes the distribution function of a real-valued random variable Y . Then for fixed y ∈ R , the empirical
distribution function F (y) is simply the sample mean for a random sample of size n ∈ N from the distribution of the
n +
indicator variable X = 1(Y ≤ y) . Hence F (y) is an unbiased estimator of F (y) for n ∈ N and (F (y) : n ∈ N ) is
n + n +
consistent.
3. Suppose that U is a random variable with a discrete distribution on a countable set S and f denotes the probability density
function of U . Then for fixed u ∈ S , the empirical probability density function f (u) is simply the sample mean for a
n
random sample of size n ∈ N from the distribution of the indicator variable X = 1(U = u) . Hence f (u) is an unbiased
+ n
estimator of f (u) for n ∈ N and (f (u) : n ∈ N ) is consistent.

+ n +
Estimating the Variance

This subsection is a review of some results obtained in the section on the Sample Variance in the chapter on Random Samples. We
also assume that the fourth central moment σ = E [(X − μ) ] is finite. Recall that σ /σ is the kurtosis of X. Recall first that if
4
4
4
4
μ is known (almost always an artificial assumption), then a natural estimator of σ is a special version of the sample variance,
2
defined by
n
1 2
2
Wn = ∑(Xi − μ) , n ∈ N+ (7.1.8)
n
i=1
Properties of W 2
= (W
2
1
,W
2
2
, …) as a sequence of estimators of σ . 2
1. E (W ) = σ so W is unbiased for n ∈ N
2
n
2 2
n +
2. var (W ) = (σ − σ ) for n ∈ N so W is consistent.

2
n
1
n
4
4
+
2
Proof
If μ is unknown (the more reasonable assumption), then a natural estimator of the distribution variance is the standard version of
the sample variance, defined by
n
1 2
2
Sn = ∑(Xi − Mn ) , n ∈ {2, 3, …} (7.1.9)
n−1
i=1
Properties of S 2
= (S
2
2
,S
3
2
, …) as a sequence of estimators of σ 2
1. E (S2
n) =σ
2
so S is unbiased for n ∈ {2, 3, …}
2
n
2. var (S 2
n) =
1
n
(σ4 −
n−3
n−1
σ )
4
for n ∈ {2, 3, …} so S is consistent sequence.
2
Naturally, we would like to compare the sequences W

2
and S
2
as estimators of σ
2
. But again remember that W
2
only makes
sense if μ is known.
Comparison of W and S 2 2
1. var (W ) < var(S ) for n ∈ {2, 3, …}.

2
n
2
n
2. The asymptotic relative efficiency of W to S is 1. 2 2
So by (a) W is better than S for n ∈ {2, 3, …}, assuming that μ is known so that we can actually use W . This is perhaps not
2
n
2
n
2
n
surprising, but by (b) S works just about as well as W for a large sample size n . Of course, the sample standard deviation S is
2
n
2
n n
a natural estimator of the distribution standard deviation σ. Unfortunately, this estimator is biased. Here is a more general result:
Suppose that θ is a parameter with possible values in T ⊆ (0, ∞) (with at least two points) and that U is a statistic with values
in T . If U is an unbiased estimator of θ then U is a negatively biased estimator of θ .
2 2
Proof
Thus, we should not be too obsessed with the unbiased property. For most sampling distributions, there will be no statistic U with
the property that U is an unbiased estimator of σ and U is an unbiased estimator of σ . 2 2
Estimation in the Bivariate Model

In this subsection we review some of the results obtained in the section on the Correlation and Regression in the chapter on
Random Samples
Suppose that X and Y are real-valued random variables for an experiment, so that (X, Y ) has a bivariate distribution in R . Let 2
μ = E(X) and σ = var(X) denote the mean and variance of X, and let ν = E(Y ) and τ = var(Y ) denote the mean and
2 2
variance of Y . For the bivariate parameters, let δ = cov(X, Y ) denote the distribution covariance and ρ = cor(X, Y ) the
distribution correlation. We need one higher-order moment as well: let δ = E [(X − μ) (Y − ν ) ] , and as usual, we assume that 2
2 2
all of the parameters exist. So the general parameter spaces are μ, ν ∈ R , σ , τ ∈ (0, ∞) , δ ∈ R , and ρ ∈ [0, 1]. Suppose now 2 2
that we sample from the distribution of (X, Y ) to generate a sequence of independent variables ((X , Y ), (X , Y ), …), each 1 1 2 2
with the distribution of (X, Y ). As usual, we will let X = (X , X , … , X ) and Y = (Y , Y , … , Y ) ; these are random
n 1 2 n n 1 2 n
samples of size n from the distributions of X and Y , respectively.

Since we now have two underlying variables, we need to enhance our notation somewhat. It will help to define the deterministic
versions of our statistics. So if x = (x , x , …) and y = (y , y , …) are sequences of real numbers and n ∈ N , we define the
1 2 1 2 +
mean and special covariance functions by
n
1
mn (x) = ∑ xi
n
i=1
n
1
wn (x, y) = ∑(xi − μ)(yi − ν )
n
i=1
If n ∈ {2, 3, …} we define the variance and standard covariance functions by

n
1
2 2
sn (x) = ∑[ xi − mn (x)]
n−1
i=1
n
1
sn (x, y) = ∑[ xi − mn (x)][ yi − mn (y)]
n−1
i=1
It should be clear from context whether we are using the one argument or two argument version of sn . On this point, note that
s (x, x) = s (x) .
2
n n
Estimating the Covariance

If μ and ν are known (almost always an artificial assumption), then a natural estimator of the distribution covariance δ is a special
version of the sample covariance, defined by
n
1
Wn = wn (X, Y ) = ∑(Xi − μ)(Yi − ν ), n ∈ N+ (7.1.11)
n
i=1
Properties of W = (W1 , W2 , …) as a sequence of estimators of δ .

1. E (W ) = δ so W is unbiased for n ∈ N .
n n +
2. var (W ) = (δ − δ ) for n ∈ N so W is consistent.

n
n
1
2
2
+
Proof
If μ and ν are unknown (usually the more reasonable assumption), then a natural estimator of the distribution covariance δ is the
standard version of the sample covariance, defined by
n
1
Sn = sn (X, Y ) = ∑[ Xi − mn (X)][ Yi − mn (Y )], n ∈ {2, 3, …} (7.1.12)
n−1
i=1
Properties of S = (S 2, S3 , …) as a sequence of estimators of δ .

1. E (S n) =δ so S is unbiased for n ∈ {2, 3, …}.
n
2. var (S n) =
1
n
(δ2 +
1
n−1
2
σ τ
2
−
n−2
n−1
2
δ ) for n ∈ {2, 3, …} so S is consistent.
Once again, since we have two competing sequences of estimators of δ , we would like to compare them.
Comparison of W and S as estimators of δ :

1. var (W ) < var (S ) for n ∈ {2, 3, …}.
n n
2. The asymptotic relative efficiency of W to S is 1.
Thus, U is better than V for n ∈ {2, 3, …}, assuming that μ and ν are known so that we can actually use W . But for large n ,
n n n
Vn works just about as well as U . n
Estimating the Correlation

A natural estimator of the distribution correlation ρ is the sample correlation
sn (X, Y )
Rn = , n ∈ {2, 3, …} (7.1.13)
sn (X)sn (Y )
Note that this statistics is a nonlinear function of the sample covariance and the two sample standard deviations. For most
distributions of (X, Y ), we have no hope of computing the bias or mean square error of this estimator. If we could compute the
expected value, we would probably find that the estimator is biased. On the other hand, even though we cannot compute the mean
square error, a simple application of the law of large numbers shows that R → ρ as n → ∞ with probability 1. Thus,n
R = (R , R , …) is at least consistent.
2 3
Estimating the regression coefficients

Recall that the distribution regression line, with X as the predictor variable and Y as the response variable, is y = a + b x where
cov(X, Y ) cov(X, Y )
a = E(Y ) − E(X), b = (7.1.14)
var(X) var(X)
On the other hand, the sample regression line, based on the sample of size n ∈ {2, 3, …}, is y = A n + Bn x where
sn (X, Y ) sn (X, Y )
An = mn (Y ) − mn (X), Bn = (7.1.15)
2 2
sn (X) sn (X)
Of course, the statistics A and B are natural estimators of the parameters a and b , respectively, and in a sense are derived from
n n
our previous estimators of the distribution mean, variance, and covariance. Once again, for most distributions of (X, Y ), it would
be difficult to compute the bias and mean square errors of these estimators. But applications of the law of large numbers show that
with probability 1, A → a and B → b as n → ∞ , so at least A = (A , A , …) and B = (B , B , …) are consistent.
n n 2 3 2 3
Exercises and Special Cases

Let's consider a simple example that illustrates some of the ideas above. Recall that the Poisson distribution with parameter
λ ∈ (0, ∞) has probability density function g given by
x
λ
−λ
g(x) = e , x ∈ N (7.1.16)
x!
The Poisson distribution is often used to model the number of random “points” in a region of time or space, and is studied in more
detail in the chapter on the Poisson process. The parameter λ is proportional to the size of the region of time or space; the
proportionality constant is the average rate of the random points. The distribution is named for Simeon Poisson.
Suppose that X has the Poisson distribution with parameter λ . . Hence

1. μ = E(X) = λ
2. σ = var(X) = λ
2
3. σ = E [(X − λ) ] = 3λ
4
4 2
+λ
Proof
Suppose now that we sample from the distribution of X to produce a sequence of independent random variables
X = (X , X , …), each having the Poisson distribution with unknown parameter λ ∈ (0, ∞) . Again, X = (X , X , … , X )
1 2 n 1 2 n
is a random sample of size n ∈ N from the from the distribution for each n ∈ N . From the previous exercise, λ is both the mean
+
and the variance of the distribution, so that we could use either the sample mean M or the sample variance S as an estimator of
n
2
n
λ . Both are unbiased, so which is better? Naturally, we use mean square error as our criterion.
Comparison of M to S as estimators of λ .
2
1. var (M n) =
λ
n
for n ∈ N . +
2. var 2
(Sn ) =
λ
n
(1 + 2λ
n
n−1
) for n ∈ {2, 3, …}.
3. var (M ) < var (S ) so M for n ∈ {2, 3, …}.
n
2
n n
4. The asymptotic relative efficiency of M to S is 1 + 2λ . 2
So our conclusion is that the sample mean M is a better estimator of the parameter
n λ than the sample variance 2
Sn for
n ∈ {2, 3, …}, and the difference in quality increases with λ .
Run the Poisson experiment 100 times for several values of the parameter. In each case, compute the estimators M and S
2
.
Which estimator seems to work better?
The emission of elementary particles from a sample of radioactive material in a time interval is often assumed to follow the
Poisson distribution. Thus, suppose that the alpha emissions data set is a sample from a Poisson distribution. Estimate the rate
parameter λ .
1. using the sample mean
2. using the sample variance
Answer
In the sample mean experiment, set the sampling distribution to gamma. Increase the sample size with the scroll bar and note
graphically and numerically the unbiased and consistent properties. Run the experiment 1000 times and compare the sample
Run the normal estimation experiment 1000 times for several values of the parameters.
1. Compare the empirical bias and mean square error of M with the theoretical values.
2. Compare the empirical bias and mean square error of S and of W to their theoretical values. Which estimator seems to
2 2
work better?
In matching experiment, the random variable is the number of matches. Run the simulation 1000 times and compare
1. the sample mean to the distribution mean.
2. the empirical density function to the probability density function.
Run the exponential experiment 1000 times and compare the sample standard deviation to the distribution standard deviation.

For Michelson's velocity of light data, compute the sample mean and sample variance.
Answer
For Cavendish's density of the earth data, compute the sample mean and sample variance.
Answer
For Short's parallax of the sun data, compute the sample mean and sample variance.
Answer
Consider the Cicada data.

1. Compute the sample mean and sample variance of the body length variable.
2. Compute the sample mean and sample variance of the body weight variable.
3. Compute the sample covariance and sample correlation between the body length and body weight variables.
Answer

1. Compute the sample mean and sample variance of the net weight variable.
2. Compute the sample mean and sample variance of the total number of candies.
3. Compute the sample covariance and sample correlation between the number of candies and the net weight.
Answer
Consider the Pearson data.

1. Compute the sample mean and sample variance of the height of the father.
2. Compute the sample mean and sample variance of the height of the son.
3. Compute the sample covariance and sample correlation between the height of the father and height of the son.
Answer
The estimators of the mean, variance, and covariance that we have considered in this section have been natural in a sense.
However, for other parameters, it is not clear how to even find a reasonable estimator in the first place. In the next several sections,
we will consider the problem of constructing estimators. Then we return to the study of the mathematical properties of estimators,
and consider the question of when we can know that an estimator is the best possible, given the data.
This page titled 7.1: Estimators is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Basic Theory
The Method
Suppose that we have a basic random experiment with an observable, real-valued random variable X. The distribution of X has k
unknown real-valued parameters, or equivalently, a parameter vector θ = (θ , θ , … , θ ) taking values in a parameter space, a 1 2 k
subset of R . As usual, we repeat the experiment n times to generate a random sample of size n from the distribution of X.
k
X = (X1 , X2 , … , Xn ) (7.2.1)
Thus, X is a sequence of independent random variables, each with the distribution of X. The method of moments is a technique for
constructing estimators of the parameters that is based on matching the sample moments with the corresponding distribution
moments. First, let
(j) j
μ (θ) = E (X ) , j ∈ N+ (7.2.2)
so that μ (θ) is the j th moment of X about 0. Note that we are emphasizing the dependence of these moments on the vector of
(j)
parameters θ. Note also that μ (θ) is just the mean of X, which we usually denote simply by μ . Next, let
(1)
(j)
1 j
M (X) = ∑X , j ∈ N+ (7.2.3)
i
n
i=1
so that M
(j)
(X) is the j th sample moment about 0. Equivalently, M
(j)
(X) is the sample mean for the random sample
j j j
(X , X , … , Xn )
1 2
from the distribution of X
j
. Note that we are emphasizing the dependence of the sample moments on the
sample X. Note also that M (X) is just the ordinary sample mean, which we usually just denote by M (or by M if we wish to
(1)
n
emphasize the dependence on the sample size). From our previous work, we know that M (X) is an unbiased and consistent (j)
estimator of μ (θ) for each j . Here's how the method works:

(j)
To construct the method of moments estimators (W1 , W2 , … , Wk ) for the parameters (θ1 , θ2 , … , θk ) respectively, we
consider the equations
(j) (j)
μ (W1 , W2 , … , Wk ) = M (X1 , X2 , … , Xn ) (7.2.4)
consecutively for j ∈ N + until we are able to solve for (W 1, W2 , … , Wk ) in terms of (M (1)

,M
(2)
.
, …)
The equations for j ∈ {1, 2, … , k} give k equations in k unknowns, so there is hope (but no guarantee) that the equations can be
solved for (W , W , … , W ) in terms of (M , M , … , M ). In fact, sometimes we need equations with j > k . Exercise 28
1 2 k
(1) (2) (k)
below gives a simple example. The method of moments can be extended to parameters associated with bivariate or more general
multivariate distributions, by matching sample product moments with the corresponding distribution product moments. The method
of moments also sometimes makes sense when the sample variables (X , X , … , X ) are not independent, but at least are 1 2 n
identically distributed. The hypergeometric model below is an example of this.

Of course, the method of moments estimators depend on the sample size n ∈ N . We have suppressed this so far, to keep the +
notation simple. But in the applications below, we put the notation back in because we want to discuss asymptotic behavior.
Estimates for the Mean and Variance

Estimating the mean and variance of a distribution are the simplest applications of the method of moments. Throughout this
subsection, we assume that we have a basic real-valued random variable X with μ = E(X) ∈ R and σ = var(X) ∈ (0, ∞) . 2
Occasionally we will also need σ = E[(X − μ) ] , the fourth central moment. We sample from the distribution of X to produce a
4
4
sequence X = (X , X , …) of independent variables, each with the distribution of X. For each n ∈ N ,

1 2 +
X = (X , X , … , X ) is a random sample of size n from the distribution of X. We start by estimating the mean, which is
n 1 2 n
essentially trivial by this method.
Suppose that the mean μ is unknown. The method of moments estimator of μ based on X is the sample mean n
n
1
Mn = ∑ Xi (7.2.5)
n
i=1
1. E(M ) = μ so M is unbiased for n ∈ N

n n +
2. var(M ) = σ /n for n ∈ N so M = (M
n
2
+ 1, M2 , …) is consistent.
Proof
Estimating the variance of the distribution, on the other hand, depends on whether the distribution mean μ is known or unknown.
First we will consider the more realistic case when the mean in also unknown. Recall that for n ∈ {2, 3, …}, the sample variance
based on X is n
n
1
2 2
Sn = ∑(Xi − Mn ) (7.2.6)
n−1
i=1
n−3
Recall also that E(Sn ) = σ
2 2
so S is unbiased for
2
n n ∈ {2, 3, …} , and that var(Sn ) =
2 1
n
(σ4 −
n−1
4
σ ) so S
2
= (S
2
2
,S
3
2
, …)
is consistent.
Suppose that the mean μ and the variance σ are both unknown. For n ∈ N , the method of moments estimator of
2
+ σ
2
based
on X is n
n
1 2
2
Tn = ∑(Xi − Mn ) (7.2.7)
n
i=1
1. bias(T 2
n ) = −σ /n
2
for n ∈ N + so T 2
= (T
2
1
,T
2
2
is asymptotically unbiased.
, …)
2. mse(T 2
n ) =
n
1
3
[(n − 1 ) σ4 − (n
2 2
− 5n + 3)σ ]
4
for n ∈ N so T is consistent.+
2
Proof
Hence T is negatively biased and on average underestimates σ . Because of this result,

2
n
2
Tn
2
is referred to as the biased sample
variance to distinguish it from the ordinary (unbiased) sample variance S . 2
n
Next let's consider the usually unrealistic (but mathematically interesting) case where the mean is known, but not the variance.
Suppose that the mean μ is known and the variance σ unknown. For n ∈ N , the method of moments estimator of σ based
2
+
2
on X is n
n
1
2 2
Wn = ∑(Xi − μ) (7.2.8)
n
i=1
1. E(W ) = σ so W is unbiased for n ∈ N

2
n
2 2
n +
2. var(W ) = (σ − σ ) for n ∈ N so W
2
n
1
n
4
4
+
2
= (W
1
2
,W
2
2
, …) is consistent.
Proof
We compared the sequence of estimators S with the sequence of estimators W in the introductory section on Estimators. Recall
2 2
that var(W ) < var(S ) for n ∈ {2, 3, …} but var(S )/var(W ) → 1 as n → ∞ . There is no simple, general relationship
2
n
2
n
2
n
2
n
between mse(T ) and mse(S ) or between mse(T ) and mse(W ), but the asymptotic relationship is simple.
2
n
2
n
2
n
2
n
2
mse(Tn )/mse(Wn ) → 1
2
and mse(T 2 2
n )/mse(Sn ) → 1 as n → ∞
Proof
−
−−
It also follows that if both μ and σ are unknown, then the method of moments estimator of the standard deviation σ is T
2
= √T
2
.
−−−
In the unlikely event that μ is known, but σ unknown, then the method of moments estimator of σ is W = √W .
2 2
Estimating Two Parameters

There are several important special distributions with two paraemters; some of these are included in the computational exercises
below. With two parameters, we can derive the method of moments estimators by matching the distribution mean and variance with
the sample mean and variance, rather than matching the distribution mean and second moment with the sample mean and second
moment. This alternative approach sometimes leads to easier equations. To setup the notation, suppose that a distribution on R has
parameters a and b . We sample from the distribution to produce a sequence of independent variables X = (X , X , …), each with 1 2
the common distribution. For n ∈ N , X = (X , X , … , X ) is a random sample of size n from the distribution. Let M ,
+ n 1 2 n n
(2)
Mn , and Tn
2
denote the sample mean, second-order sample mean, and biased sample variance corresponding to Xn , and let
μ(a, b), μ (a, b), and σ (a, b) denote the mean, second-order mean, and variance of the distribution.
(2) 2
If the method of moments estimators U and V of a and b , respectively, can be found by solving the first two equations
n n
(2) (2)
μ(Un , Vn ) = Mn , μ (Un , Vn ) = Mn (7.2.9)
then U and V can also be found by solving the equations

n n
2 2
μ(Un , Vn ) = Mn , σ (Un , Vn ) = Tn (7.2.10)
Proof
Because of this result, the biased sample variance T will appear in many of the estimation problems for special distributions that
2
n
we consider below.
The normal distribution with mean μ ∈ R and variance σ
2
∈ (0, ∞) is a continuous distribution on R with probability density
function g given by
2
1 1 x −μ
g(x) = exp[− ( ) ], x ∈ R (7.2.11)
−−
√2πσ 2 σ
This is one of the most important distributions in probability and statistics, primarily because of the central limit theorem. The
normal distribution is studied in more detail in the chapter on Special Distributions.
Suppose now that X = (X , X , … , X ) is a random sample of size n from the normal distribution with mean μ and variance
1 2 n
σ . Form our general work above, we know that if μ is unknown then the sample mean M is the method of moments estimator of
2
μ , and if in addition, σ is unknown then the method of moments estimator of σ is T . On the other hand, in the unlikely event
2 2 2
that μ is known then W is the method of moments estimator of σ . Our goal is to see how the comparisons above simplify for the
2 2
normal distribution.
Mean square errors of S and T . 2

n
2
n
2n−1
1. mse(T 2
) =
n
2
σ
4
2. mse(S 2
) =
2
n−1
σ
4
3. mse(T 2
) < mse(S
2
) for n ∈ {2, 3, … , }
Proof
Thus, S and T are multiplies of one another; S is unbiased, but when the sampling distribution is normal, T has smaller mean
2 2 2 2
square error. Surprisingly, T has smaller mean square error even than W .
2 2
Mean square errors of T and W . 2 2
1. mse(W ) = σ 2
n
2 4
2. mse(T ) < mse(W

2 2
) for n ∈ {2, 3, …}
Proof
Run the normal estimation experiment 1000 times for several values of the sample size n and the parameters μ and σ.
Compare the empirical bias and mean square error of S and of T to their theoretical values. Which estimator is better in
2 2
terms of bias? Which estimator is better in terms of mean square error?
−
−−
Next we consider estimators of the standard deviation σ. As noted in the general discussion above, T = √T is the method of 2
− −−
moments estimator when μ is unknown, while W = √W is the method of moments estimator in the unlikely event that μ is
2
−−
known. Another natural estimator, of course, is S = √S , the usual sample standard deviation. The following sequence, defined in
2
terms of the gamma function turns out to be important in the analysis of all three estimators.
Consider the sequence

−
−
2 Γ[(n + 1)/2)
an = √ , n ∈ N+ (7.2.12)
n Γ(n/2)
Then 0 < a n <1 for n ∈ N + and a n ↑ 1 as n ↑ ∞ .
First, assume that μ is known so that W is the method of moments estimator of σ.

n
For n ∈ N ,
+
1. E(W ) = a σ n
2. bias(W ) = (a − 1)σ n
3. var(W ) = (1 − a ) σ 2
n
2
4. mse(W ) = 2(1 − a )σ n
2
Proof
Thus W is negatively biased as an estimator of σ but asymptotically unbiased and consistent. Of course we know that in general
(regardless of the underlying distribution), W is an unbiased estimator of σ and so W is negatively biased as an estimator of σ.
2 2
In the normal case, since a involves no unknown parameters, the statistic W /a is an unbiased estimator of σ. Next we consider
n n
the usual sample standard deviation S .
For n ∈ {2, 3, …},

1. E(S) = a σ n−1
2. bias(S) = (a − 1)σ n−1
3. var(S) = (1 − a ) σ 2
n−1
2
4. mse(S) = 2(1 − a )σ n−1

2
Proof
As with W , the statistic S is negatively biased as an estimator of σ but asymptotically unbiased, and also consistent. Since a n−1
involves no unknown parameters, the statistic S/a is an unbiased estimator of σ. Note also that, in terms of bias and mean
n−1
square error, S with sample size n behaves like W with sample size n − 1 . Finally we consider T , the method of moments
estimator of σ when μ is unknown.
For n ∈ {2, 3, …},

−−−
n−1
1. E(T ) = √ n
an−1 σ
−−−
2. bias(T ) = (√ n−1
n
an−1 − 1) σ
3. var(T ) = n−1
n
(1 − a
2
n−1
)σ
2
−−−
4. mse(T ) = (2 − 1
n
− 2√
n−1
n
an−1 ) σ
2
Proof
The Bernoulli Distribution

Recall that an indicator variable is a random variable X that takes only the values 0 and 1. The distribution of X is known as the
Bernoulli distribution, named for Jacob Bernoulli, and has probability density function g given by
x 1−x
g(x) = p (1 − p ) , x ∈ {0, 1} (7.2.14)
where p ∈ (0, 1) is the success parameter. The mean of the distribution is p and the variance is p(1 − p) .
Suppose now that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success
1 2 n
parameter p. Since the mean of the distribution is p, it follows from our general work above that the method of moments estimator
of p is M , the sample mean. In this case, the sample X is a sequence of Bernoulli trials, and M has a scaled version of the
binomial distribution with parameters n and p:
k n k n−k
P (M = ) =( ) p (1 − p ) , k ∈ {0, 1, … , n} (7.2.15)
n k
Note that since X = X for every k ∈ N , it follows that μ = p and M = M for every k ∈ N . So any of the method of
k
+
(k) (k)
+
moments equations would lead to the sample mean M as the estimator of p. Although very simple, this is an important application,
since Bernoulli trials are found embedded in all sorts of estimation problems, such as empirical probability density functions and
empirical distribution functions.

The geometric distribution on N with success parameter p ∈ (0, 1) has probability density function g given by
+
x−1
g(x) = p(1 − p ) , x ∈ N+ (7.2.16)
The geometric distribution on N governs the number of trials needed to get the first success in a sequence of Bernoulli trials with
+
success parameter p. The mean of the distribution is μ = 1/p .
Suppose that X = (X , X , … , X ) is a random sample of size

1 2 n n from the geometric distribution on N+ with unknown
success parameter p. The method of moments estimator of p is
1
U = (7.2.17)
M
Proof
The geometric distribution on N with success parameter p ∈ (0, 1) has probability density function
x
g(x) = p(1 − p ) , x ∈ N (7.2.18)
This version of the geometric distribution governs the number of failures before the first success in a sequence of Bernoulli trials.
The mean of the distribution is μ = (1 − p)/p .

1 2 n n from the geometric distribution on N with unknown
parameter p. The method of moments estimator of p is
1
U = (7.2.19)
M +1
Proof
The Negative Binomial Distribution

More generally, the negative binomial distribution on N with shape parameter k ∈ (0, ∞) and success parameter p ∈ (0, 1) has
probability density function
x +k−1 k x
g(x) = ( ) p (1 − p ) , x ∈ N (7.2.20)
k−1
If k is a positive integer, then this distribution governs the number of failures before the k th success in a sequence of Bernoulli
trials with success parameter p. However, the distribution makes sense for general k ∈ (0, ∞) . The negative binomial distribution
is studied in more detail in the chapter on Bernoulli Trials. The mean of the distribution is k(1 − p)/p and the variance is
k(1 − p)/p . Suppose now that X = (X , X , … , X ) is a random sample of size n from the negative binomial distribution on
2
1 2 n
N with shape parameter k and success parameter p
If k and p are unknown, then the corresponding method of moments estimators U and V are
2
M M
U = , V = (7.2.21)
2 2
T −M T
Proof
As usual, the results are nicer when one of the parameters is known.
Suppose that k is known but p is unknown. The method of moments estimator V of p is k
k
Vk = (7.2.23)
M +k
Proof
Suppose that k is unknown but p is known. The method of moments estimator of k is

p
Up = M (7.2.25)
1 −p
1. E(U ) = k so U is unbiased.
p p
2. var(U ) =
p
k
so U is consistent.
n(1−p)
p
Proof

The Poisson distribution with parameter r ∈ (0, ∞) is a discrete distribution on N with probability density function g given by
x
r
−r
g(x) = e , x ∈ N (7.2.26)
x!
The mean and variance are both r. The distribution is named for Simeon Poisson and is widely used to model the number of
“random points” is a region of time or space. The parameter r is proportional to the size of the region, with the proportionality
constant playing the role of the average rate at which the points are distributed in time or space. The Poisson distribution is studied
in more detail in the chapter on the Poisson Process.
Suppose now that X = (X , X , … , X ) is a random sample of size n from the Poisson distribution with parameter r. Since r is
1 2 n
the mean, it follows from our general work above that the method of moments estimator of r is the sample mean M .

The gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) is a continuous distribution on (0, ∞)
with probability density function g given by

1
k−1 −x/b
g(x) = x e , x ∈ (0, ∞) (7.2.27)
k
Γ(k)b
The gamma probability density function has a variety of shapes, and so this distribution is used to model various types of positive
random variables. The gamma distribution is studied in more detail in the chapter on Special Distributions. The mean is μ = kb
and the variance is σ = kb .
2 2
Suppose now that X = (X1 , X2 , … , Xn ) is a random sample from the gamma distribution with shape parameter k and scale
parameter b .
Suppose that k and b are both unknown, and let U and V be the corresponding method of moments estimators. Then
2 2
M T
U = , V = (7.2.28)
2
T M
Proof
The method of moments estimators of k and b given in the previous exercise are complicated, nonlinear functions of the sample
mean M and the sample variance T . Thus, computing the bias and mean square errors of these estimators are difficult problems
2
that we will not attempt. However, we can judge the quality of the estimators empirically, through simulations.
When one of the parameters is known, the method of moments estimator of the other parameter is much simpler.
Suppose that k is unknown, but b is known. The method of moments estimator of k is
M
Ub = (7.2.29)
b
1. E(U ) = k so U is unbiased.
b b
2. var(U ) = k/n so U is consistent.

b b
Proof
Suppose that b is unknown, but k is known. The method of moments estimator of b is

M
Vk = (7.2.30)
k
1. E(V ) = b so V is unbiased.
k k
2. var(V ) = b /kn so that V is consistent.

k
2
k
Proof
Run the gamma estimation experiment 1000 times for several different values of the sample size n and the parameters k and b .
Note the empirical bias and mean square error of the estimators U , V , U , and V . One would think that the estimators when
b k
one of the parameters is known should work better than the corresponding estimators when both parameters are unknown; but
investigate this question empirically.

The beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) is a continuous distribution on (0, 1) with
1 a−1 b−1
g(x) = x (1 − x ) , 0 <x <1 (7.2.31)
B(a, b)
The beta probability density function has a variety of shapes, and so this distribution is widely used to model various types of
random variables that take values in bounded intervals. The beta distribution is studied in more detail in the chapter on Special
a(a+1)
Distributions. The first two moments are μ = a
a+b
and μ (2)
=
(a+b)(a+b+1)
.
Suppose now that X = (X 1, X2 , … , Xn ) is a random sample of size n from the beta distribution with left parameter a and right
parameter b .
Suppose that a and b are both unknown, and let U and V be the corresponding method of moments estimators. Then
(2) (2)
M (M − M ) (1 − M ) (M − M )
U = , V = (7.2.32)
(2) 2 (2) 2
M −M M −M
Proof
The method of moments estimators of a and b given in the previous exercise are complicated nonlinear functions of the sample
moments M and M . Thus, we will not attempt to determine the bias and mean square errors analytically, but you will have an
(2)
opportunity to explore them empricially through a simulation.
Suppose that a is unknown, but b is known. Let U be the method of moments estimator of a . Then
b
M
Ub = b (7.2.34)
1 −M
Proof
Suppose that b is unknown, but a is known. Let V be the method of moments estimator of b . Then
a
1 −M
Va = a (7.2.35)
M
Proof
Run the beta estimation experiment 1000 times for several different values of the sample size n and the parameters a and b .
Note the empirical bias and mean square error of the estimators U , V , U , and V . One would think that the estimators when
b a
one of the parameters is known should work better than the corresponding estimators when both parameters are unknown; but
investigate this question empirically.
The following problem gives a distribution with just one parameter but the second moment equation from the method of moments
is needed to derive an estimator.
Suppose that X = (X , X , … , X ) is a random sample from the symmetric beta distribution, in which the left and right
1 2 n
parameters are equal to an unknown value c ∈ (0, ∞). The method of moments estimator of c is
(2)
2M
U = (7.2.36)
(2)
1 − 4M
Proof

The Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞) is a continuous distribution on (b, ∞)
with probability density function g given by

a
ab
g(x) = , b ≤x <∞ (7.2.38)
a+1
x
The Pareto distribution is named for Vilfredo Pareto and is a highly skewed and heavy-tailed distribution. It is often used to model
income and certain other types of positive random variables. The Pareto distribution is studied in more detail in the chapter on
2
Special Distributions. If a > 2 , the first two moments of the Pareto distribution are μ = ab
a−1
and μ (2)
=
ab
a−2
.
Suppose now that X = (X 1, X2 , … , Xn ) is a random sample of size n from the Pareto distribution with shape parameter a >2
and scale parameter b > 0 .
Suppose that a and b are both unknown, and let U and V be the corresponding method of moments estimators. Then
−−−−−−−−−−
(2)
M
U = 1 +√ (7.2.39)
(2) 2
M −M
−−−−−−−−−−
(2) (2) 2
M M −M
V = (1 − √ ) (7.2.40)
M (2)
M
Proof
As with our previous examples, the method of moments estimators are complicatd nonlinear functions of M and M , so (2)
computing the bias and mean square error of the estimator is difficult. Instead, we can investigate the bias and mean square error
empirically, through a simulation.
Run the Pareto estimation experiment 1000 times for several different values of the sample size n and the parameters a and b .
Note the empirical bias and mean square error of the estimators U and V .
When one of the parameters is known, the method of moments estimator for the other parameter is simpler.
Suppose that a is unknown, but b is known. Let U be the method of moments estimator of a . Then
b
M
Ub = (7.2.43)
M −b
Proof
Suppose that b is unknown, but a is known. Let V be the method of moments estimator of b . Then
a
a−1
Va = M (7.2.44)
a
1. E(V a) =b so V is unbiased.a
2
2. var(V a) =
na(a−2)
b
so V is consistent.
a
Proof

The (continuous) uniform distribution with location parameter a ∈ R and scale parameter h ∈ (0, ∞) has probability density
function g given by
1
g(x) = , x ∈ [a, a + h] (7.2.45)
h
The distribution models a point chosen “at random” from the interval [a, a + h] . The mean of the distribution is μ = a + h and 1
the variance is σ = h . The uniform distribution is studied in more detail in the chapter on Special Distributions. Suppose now
2 1
12
2
that X = (X , X , … , X ) is a random sample of size n from the uniform distribution.

1 2 n
Suppose that a and h are both unknown, and let U and V denote the corresponding method of moments estimators. Then
– –
U = 2M − √3T , V = 2 √3T (7.2.46)
Proof
As usual, we get nicer results when one of the parameters is known.
Suppose that a is known and h is unknown, and let V denote the method of moments estimator of h . Then
a
Va = 2(M − a) (7.2.47)
1. E(V a) =h so V is unbiased.
2
2. var(V a) =
h
3n
so V is consistent.
a
Proof
Suppose that h is known and a is unknown, and let U denote the method of moments estimator of a . Then
h
1
Uh = M − h (7.2.48)
2
1. E(U h) =a so U is unbiased.h
2
2. var(U h) =
12n
h
so U is consistent.
h
Proof
The Hypergeometric Model

Our basic assumption in the method of moments is that the sequence of observed random variables X = (X , X , … , X ) is a 1 2 n
random sample from a distribution. However, the method makes sense, at least in some cases, when the variables are identically
distributed but dependent. In the hypergeometric model, we have a population of N objects with r of the objects type 1 and the
remaining N − r objects type 0. The parameter N , the population size, is a positive integer. The parameter r, the type 1 size, is a
nonnegative integer with r ≤ N . These are the basic parameters, and typically one or both is unknown. Here are some typical
examples:
1. The objects are devices, classified as good or defective.
2. The objects are persons, classified as female or male.
3. The objects are voters, classified as for or against a particular candidate.
4. The objects are wildlife or a particular type, either tagged or untagged.
We sample n objects from the population at random, without replacement. Let X be the type of the ith object selected, so that our
i
sequence of observed variables is X = (X , X , … , X ). The variables are identically distributed indicator variables, with
1 2 n
P (X = 1) = r/N for each i ∈ {1, 2, … , n}, but are dependent since the sampling is without replacement. The number of type 1
i
n
objects in the sample is Y = ∑ X . This statistic has the hypergeometric distribution with parameter N , r, and n , and has
i=1 i
probability density function given by
r N −r
( )( ) (y) (n−y)
y n−y n r (N − r)
P (Y = y) = =( ) , y ∈ {max{0, N − n + r}, … , min{n, r}} (7.2.49)
N (n)
( ) y N
n
The hypergeometric model is studied in more detail in the chapter on Finite Sampling Models.
As above, let X = (X 1, X2 , … , Xn ) be the observed variables in the hypergeometric model with parameters N and r. Then
1. The method of moments estimator of p = r/N is M = Y /n , the sample mean.
2. The method of moments estimator of r with N known is U = N M = N Y /n .
3. The method of moments estimator of N with r known is V = r/M = rn/Y if Y >0 .
Proof
In the voter example (3) above, typically N and r are both unknown, but we would only be interested in estimating the ratio
p = r/N . In the reliability example (1), we might typically know N and would be interested in estimating r . In the wildlife
example (4), we would typically know r and would be interested in estimating N . This example is known as the capture-recapture
model.
Clearly there is a close relationship between the hypergeometric model and the Bernoulli trials model above. In fact, if the
sampling is with replacement, the Bernoulli trials model would apply rather than the hypergeometric model. In addition, if the
population size N is large compared to the sample size n , the hypergeometric model is well approximated by the Bernoulli trials
model.
This page titled 7.2: The Method of Moments is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
The Method
Suppose again that we have an observable random variable X for an experiment, that takes values in a set S . Suppose also that
distribution of X depends on an unknown parameter θ , taking values in a parameter space Θ. Of course, our data variable X will
almost always be vector valued. The parameter θ may also be vector valued. We will denote the probability density function of X
on S by f for θ ∈ Θ . The distribution of X could be discrete or continuous.
θ
The likelihood function is the function obtained by reversing the roles of x and θ in the probability density function; that is, we
view θ as the variable and x as the given information (which is precisely the point of view in estimation).
The likelihood function at x ∈ S is the function L x : Θ → [0, ∞) given by

Lx (θ) = fθ (x), θ ∈ Θ (7.3.1)
In the method of maximum likelihood, we try to find the value of the parameter that maximizes the likelihood function for each
value of the data vector.
Suppose that the maximum value of L occurs at u(x) ∈ Θ for each x ∈ S . Then the statistic u(X) is a maximum likelihood
x
estimator of θ .
The method of maximum likelihood is intuitively appealing—we try to find the value of the parameter that would have most likely
produced the data we in fact observed.
Since the natural logarithm function is strictly increasing on (0, ∞), the maximum value of the likelihood function, if it exists, will
occur at the same points as the maximum value of the logarithm of the likelihood function.
The log-likelihood function at x ∈ S is the function ln L : x
ln Lx (θ) = ln fθ (x), θ ∈ Θ (7.3.2)
If the maximum value of ln Lx occurs at u(x) ∈ Θ for each x ∈ S . Then the statistic u(X) is a maximum likelihood
estimator of θ
The log-likelihood function is often easier to work with than the likelihood function (typically because the probability density
function f (x) has a product structure).
θ
Vector of Parameters
An important special case is when θ = (θ , θ , … , θ ) is a vector of k real parameters, so that Θ ⊆ R . In this case, the maximum
1 2 k
k
likelihood problem is to maximize a function of several variables. If Θ is a continuous set, the methods of calculus can be used. If
the maximum value of L occurs at a point θ in the interior of Θ, then L has a local maximum at θ. Therefore, assuming that the
x x
likelihood function is differentiable, we can find this point by solving

∂
Lx (θ) = 0, i ∈ {1, 2, … , k} (7.3.3)
∂θi
or equivalently
∂
ln Lx (θ) = 0, i ∈ {1, 2, … , k} (7.3.4)
∂θi
On the other hand, the maximum value may occur at a boundary point of Θ, or may not exist at all.
Random Samples
The most important special case is when the data variables form a random sample from a distribution.
Suppose that X = (X , X , … , X ) is a random sample of size n from the distribution of a random variable X taking values
1 2 n
in R , with probability density function g for θ ∈ Θ . Then X takes values in S = R , and the likelihood and log-likelihood
θ
n
functions for x = (x , x , … , x ) ∈ S are

1 2 n
Lx (θ) = ∏ gθ (xi ), θ ∈ Θ
i=1
ln Lx (θ) = ∑ ln gθ (xi ), θ ∈ Θ
i=1
Extending the Method and the Invariance Property

Returning to the general setting, suppose now that h is a one-to-one function from the parameter space Θ onto a set Λ . We can
view λ = h(θ) as a new parameter taking values in the space Λ , and it is easy to re-parameterize the probability density function
with the new parameter. Thus, let f^ (x) = f
λ
(x) for x ∈ S and λ ∈ Λ . The corresponding likelihood function for x ∈ S is
−1
h (λ)
^ −1
Lx (λ) = Lx [ h (λ)] , λ ∈ Λ (7.3.5)
Clearly if u(x) ∈ Θ maximizes L for x ∈ S . Then h [u(x)] ∈ Λ maximizes L

x
^
for x ∈ S . It follows that if x U is a maximum
likelihood estimator for θ , then V = h(U ) is a maximum likelihood estimator for λ = h(θ) .
If the function h is not one-to-one, the maximum likelihood function for the new parameter λ = h(θ) is not well defined, because
we cannot parameterize the probability density function in terms of λ . However, there is a natural generalization of the method.
Suppose that h : Θ → Λ , and let λ = h(θ) denote the new parameter. Define the likelihood function for λ at x ∈ S by
^ −1
Lx (λ) = max { Lx (θ) : θ ∈ h {λ}} ; λ ∈ Λ (7.3.6)
If v(x) ∈ Λ maximizes L
^
for each x ∈ S , then V
x = v(X) is a maximum likelihood estimator of λ .
This definition extends the maximum likelihood method to cases where the probability density function is not completely
parameterized by the parameter of interest. The following theorem is known as the invariance property: if we can solve the
maximum likelihood problem for θ then we can solve the maximum likelihood problem for λ = h(θ) .
In the setting of the previous theorem, if U is a maximum likelihood estimator of θ , then V = h(U ) is a maximum likelihood
estimator of λ .
Proof

In the following subsections, we will study maximum likelihood estimation for a number of special parametric families of
distributions. Recall that if X = (X , X , … , X ) is a random sample from a distribution with mean μ and variance σ , then the
1 2 n
2
method of moments estimators of μ and σ are, respectively,

2
n
1
M = ∑ Xi (7.3.7)
n
i=1
n
1
2 2
T = ∑(Xi − M ) (7.3.8)
n
i=1
Of course, M is the sample mean, and T is the biased version of the sample variance. These statistics will also sometimes occur
2
as maximum likelihood estimators. Another statistic that will occur in some of the examples below is
n
1
2
M2 = ∑X (7.3.9)
i
n
i=1
the second-order sample mean. As always, be sure to try the derivations yourself before looking at the solutions.

1 2 n n from the Bernoulli distribution with success parameter
p ∈ [0, 1]. Recall that the Bernoulli probability density function is
x 1−x
g(x) = p (1 − p ) , x ∈ {0, 1} (7.3.10)
Thus, X is a sequence of independent indicator variables with P(X = 1) = p for each i. In the usual language of reliability, X is
i i
n
the outcome of trial i, where 1 means success and 0 means failure. Let Y = ∑ X denote the number of successes, so that the i=1 i
proportion of successes (the sample mean) is M = Y /n . Recall that Y has the binomial distribution with parameters n and p.
The sample mean M is the maximum likelihood estimator of p on the parameter space (0, 1).
Proof
Recall that M is also the method of moments estimator of p. It's always nice when two different estimation procedures yield the
same result. Next let's look at the same problem, but with a much restricted parameter space.
Suppose now that p takes values in { 1
2
, 1} . Then the maximum likelihood estimator of p is the statistic
1, Y =n
U ={ 1 (7.3.14)
, Y <n
2
1, p =1
1. E(U ) = { 1 1
n+1
1
+( ) , p =
2 2 2
2. U is positively biased, but is asymptotically unbiased.

0 p =1
3. mse(U ) = { 1
n+2
1
( ) , p =
2 2
4. U is consistent.
Proof
Note that the Bernoulli distribution in the last exercise would model a coin that is either fair or two-headed. The last two exercises
show that the maximum likelihood estimator of a parameter, like the solution to any maximization problem, depends critically on
the domain.
U is uniformly better than M on the parameter space { 1
2
.
, 1}
Proof
Suppose that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success
1 2 n
parameter p ∈ (0, 1). Find the maximum likelihood estimator of p(1 − p) , which is the variance of the sampling distribution.
Answer

Recall that the geometric distribution on N with success parameter p ∈ (0, 1) has probability density function
+
x−1
g(x) = p(1 − p ) , x ∈ N+ (7.3.17)
The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials.
Suppose that X = (X , X , … , X ) is a random sample from the geometric distribution with unknown parameter p ∈ (0, 1).
1 2 n
The maximum likelihood estimator of p is U = 1/M .

Proof
Recall that U is also the method of moments estimator of p. It's always reassuring when two different estimation procedures
produce the same estimator.
The Negative Binomial Distribution
More generally, the negative binomial distribution on N with shape parameter k ∈ (0, ∞) and success parameter p ∈ (0, 1) has
x +k−1
k x
g(x) = ( ) p (1 − p ) , x ∈ N (7.3.20)
k−1
If k is a positive integer, then this distribution governs the number of failures before the k th success in a sequence of Bernoulli
trials with success parameter p. However, the distribution makes sense for general k ∈ (0, ∞) . The negative binomial distribution
is studied in more detail in the chapter on Bernoulli Trials.
Suppose that X = (X , X , … , X ) is a random sample of size n from the negative binomial distribution on N with known
1 2 n
shape parameter k and unknown success parameter p ∈ (0, 1). The maximum likelihood estimator of p is
k
U = (7.3.21)
k+M
Proof
Once again, this is the same as the method of moments estimator of p with k known.

Recall that the Poisson distribution with parameter r > 0 has probability density function
x
r
−r
g(x) = e , x ∈ N (7.3.24)
x!
The Poisson distribution is named for Simeon Poisson and is widely used to model the number of random “points” in a region of
time or space. The parameter r is proportional to the size of the region. The Poisson distribution is studied in more detail in the
chapter on the Poisson process.
Suppose that X = (X , X , … , X ) is a random sample from the Poisson distribution with unknown parameter r ∈ (0, ∞).
1 2 n
The maximum likelihood estimator of r is the sample mean M .

Proof
Recall that for the Poisson distribution, the parameter r is both the mean and the variance. Thus M is also the method of moments
estimator of r. We showed in the introductory section that M has smaller mean square error than S , although both are unbiased.
2
Suppose that X = (X1 , X2 , … , Xn ) is a random sample from the Poisson distribution with parameter r ∈ (0, ∞) , and let
p = P(X = 0) = e
−r
. Find the maximum likelihood estimator of p in two ways:
1. Directly, by finding the likelihood function corresponding to the parameter p.
2. By using the result of the last exercise and the invariance property.
Answer

Recall that the normal distribution with mean μ and variance σ has probability density function
2
2
1 1 x −μ
g(x) = −
−− exp[− ( ) ], x ∈ R (7.3.26)
√2 πσ 2 σ
The normal distribution is often used to model physical quantities subject to small, random errors, and is studied in more detail in
the chapter on Special Distributions
Suppose that X = (X , X , … , X ) is a random sample from the normal distribution with unknown mean
1 2 n μ ∈ R and
variance σ ∈ (0, ∞) . The maximum likelihood estimators of μ and σ are M and T , respectively.
2 2 2
Proof
Of course, M and T are also the method of moments estimators of μ and σ , respectively.
2 2
Run the Normal estimation experiment 1000 times for several values of the sample size n , the mean μ , and the variance σ . 2
For the parameter σ , compare the maximum likelihood estimator T with the standard sample variance S . Which estimator
2 2 2
seems to work better in terms of mean square error?
Suppose again that X = (X , X , … , X ) is a random sample from the normal distribution with unknown mean μ ∈ R and
1 2 n
unknown variance σ ∈ (0, ∞) . Find the maximum likelihood estimator of μ + σ , which is the second moment about 0 for
2 2 2
the sampling distribution.

Answer

Recall that the gamma distribution with shape parameter k > 0 and scale parameter b > 0 has probability density function
1
k−1 −x/b
g(x) = x e , 0 <x <∞ (7.3.30)
Γ(k) bk
The gamma distribution is often used to model random times and certain other types of positive random variables, and is studied in
more detail in the chapter on Special Distributions
Suppose that X = (X , X , … , X ) is a random sample from the gamma distribution with known shape parameter
1 2 n k and
unknown scale parameter b ∈ (0, ∞). The maximum likelihood estimator of b is V = M . k
1
Proof
Recall that V is also the method of moments estimator of

k b when k is known. But when k is unknown, the method of moments
2
estimator of b is V =
T
M
.
Run the gamma estimation experiment 1000 times for several values of the sample size n , shape parameter k , and scale
parameter b . In each case, compare the method of moments estimator V of b when k is unknown with the method of moments
and maximum likelihood estimator V of b when k is known. Which estimator seems to work better in terms of mean square
k
error?

Recall that the beta distribution with left parameter a ∈ (0, ∞) and right parameter b = 1 has probability density function
a−1
g(x) = ax , x ∈ (0, 1) (7.3.34)
The beta distribution is often used to model random proportions and other random variables that take values in bounded intervals. It
is studied in more detail in the chapter on Special Distribution
Suppose that X = (X , X , … , X ) is a random sample from the beta distribution with unknown left parameter a ∈ (0, ∞)
1 2 n
and right parameter b = 1 . The maximum likelihood estimator of a is

n n
W =− =− (7.3.35)
n
∑ ln Xi ln(X1 X2 ⋯ Xn )
i=1
Proof
Recall that when b = 1 , the method of moments estimator of a is U = M /(1 − M ) , but when b ∈ (0, ∞) is also unknown, the
1
method of moments estimator of a is U = M (M − M )/(M − M ) . When b = 1 , which estimator is better, the method of
2 2
2
moments estimator or the maximum likelihood estimator?
In the beta estimation experiment, set b = 1 . Run the experiment 1000 times for several values of the sample size n and the
parameter a . In each case, compare the estimators U , U and W . Which estimator seems to work better in terms of mean
1
square error?
Finally, note that 1/W is the sample mean for a random sample of size n from the distribution of − ln X . This distribution is the
exponential distribution with rate a .
Recall that the Pareto distribution with shape parameter a > 0 and scale parameter b > 0 has probability density function
a
ab
g(x) = , b ≤x <∞ (7.3.37)
a+1
x
The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution often used to model income and certain other
types of random variables. It is studied in more detail in the chapter on Special Distribution.
Suppose that X = (X , X , … , X ) is a random sample from the Pareto distribution with unknown shape parameter
1 2 n
a ∈ (0, ∞) and scale parameter b ∈ (0, ∞). The maximum likelihood estimator of b is X = min{ X , X , … , X }, the (1) 1 2 n
first order statistic. The maximum likelihood estimator of a is

n n
U = n
= (7.3.38)
n
∑ ln Xi − n ln X(1) ∑i=1 (ln Xi − ln X(1) )
i=1
Proof
Recall that if a > 2 , the method of moments estimators of a and b are

−−−−−−−−− −−−−−−−−−
2
M2 M2 M2 − M
1 +√ , (1 − √ ) (7.3.41)
2
M2 − M M M2
Open the the Pareto estimation experiment. Run the experiment 1000 times for several values of the sample size n and the
parameters a and b . Compare the method of moments and maximum likelihood estimators. Which estimators seem to work
better in terms of bias and mean square error?
Often the scale parameter in the Pareto distribution is known.
Suppose that X = (X , X , … , X ) is a random sample from the Pareto distribution with unknown shape parameter
1 2 n
a ∈ (0, ∞) and known scale parameter b ∈ (0, ∞). The maximum likelihood estimator of a is
n n
U = = (7.3.42)
n n
∑ ln Xi − n ln b ∑ (ln Xi − ln b)
i=1 i=1
Proof
In this section we will study estimation problems related to the uniform distribution that are a good source of insight and
counterexamples. In a sense, our first estimation problem is the continuous analogue of an estimation problem studied in the
section on Order Statistics in the chapter Finite Sampling Models. Suppose that X = (X , X , … , X ) is a random sample from 1 2 n
the uniform distribution on the interval [0, h], where h ∈ (0, ∞) is an unknown parameter. Thus, the sampling distribution has
1
g(x) = , x ∈ [0, h] (7.3.45)
h
First let's review results from the last section.
The method of moments estimator of h is U = 2M . The estimator U satisfies the following properties:
1. U is unbiased.
2
2. var(U ) = h
3n
so U is consistent.
Now let's find the maximum likelihood estimator
The maximum likelihood estimator of h is X(n) = max{ X1 , X2 , … , Xn } , the n th order statistic. The estimator X(n)
satisfies the following properties:

1. E (X (n)
) =
n+1
n
h
2. bias (X (n) ) =−
n+1
h
so that X (n) is negatively biased but asymptotically unbiased.
3. var (X (n)
) =
n
2
h
2
(n+2)(n+1)
4. mse (X (n) ) =
(n+1)(n+2)
2 2
h so that X (n) is consistent.
Proof
Since the expected value of X (n)

is a known multiple of the parameter h , we can easily construct an unbiased estimator.
Let V =
n+1
n
X(n) . The estimator V satisfies the following properties:
1. V is unbiased.
2
2. var(V ) = n(n+2)
h
so that V is consistent.
3. The asymptotic relative efficiency of V to U is infinite.
Proof
The last part shows that the unbiased version V of the maximum likelihood estimator is a much better estimator than the method of
moments estimator U . In fact, an estimator such as V , whose mean square error decreases on the order of , is called super n
1
2
efficient. Now, having found a really good estimator, let's see if we can find a really bad one. A natural candidate is an estimator
based on X = min{X , X , … , X }, the first order statistic. The next result will make the computations very easy.
(1) 1 2 n
The sample X = (X 1, X2 , … , Xn ) satisfies the following properties:

1. h − X is uniformly distributed on [0, h] for each i.
i
2. (h − X , h − X , … , h − X ) is also a random sample from the uniform distribution on [0, h].

1 2 n
3. X has the same distribution as h − X .

(1) (n)
Proof
Now we can construct our really bad estimator.
Let W = (n + 1)X(1) . Then

1. W is an unbiased estimator of h .
2. var(W ) = h , so W is not even consistent.
n+2
n 2
Proof
Run the uniform estimation experiment 1000 times for several values of the sample size n and the parameter a . In each case,
compare the empirical bias and mean square error of the estimators with their theoretical values. Rank the estimators in terms
of empirical mean square error.
Our next series of exercises will show that the maximum likelihood estimator is not necessarily unique. Suppose that
X = (X , X , … , X ) is a random sample from the uniform distribution on the interval [a, a + 1] , where a ∈ R is an unknown
1 2 n
parameter. Thus, the sampling distribution has probability density function
g(x) = 1, a ≤ x ≤ a+1 (7.3.47)
As usual, let's first review the method of moments estimator.
The method of moments estimator of a is U =M −

1
2
. The estimator U satisfies the following properties:
1. U is unbiased.
2. var(U ) = so U is consistent.
1
12n
However, as promised, there is not a unique maximum likelihood estimatr.
Any statistic V ∈ [ X(n) − 1, X(1) ] is a maximum likelihood estimator of a .

Proof
For completeness, let's consider the full estimation problem. Suppose that X = (X , X , … , X ) is a random sample of size n
1 2 n
from the uniform distribution on [a, a + h] where a ∈ R and h ∈ (0, ∞) are both unknown. Here's the result from the last section:
Let U and V denote the method of moments estimators of a and h , respectively. Then
– –
U = 2M − √3T , V = 2 √3T (7.3.48)
n n
where M =
1
n
∑i=1 Xi is the sample mean, and T =
1
n
∑i=1 (Xi − M )
2
is the biased version of the sample variance.
It should come as no surprise at this point that the maximum likelihood estimators are functions of the largest and smallest order
statistics.
The maximum likelihood estimators or a and h are U = X(1) and V = X(n) − X(1) , respectively.
1. E(U ) = a + h
n+1
so U is positively biased and asymptotically unbiased.
n−1
2. E(V ) = h n+1
so V is negatively biased and asymptotically unbiased.
3. var(U ) = h 2 n
2
so U is consistent.
(n+1 ) (n+2)
2(n−1)
4. var(V ) = h 2
2
so V is consistent.
(n+1 ) (n+2)
Proof

In all of our previous examples, the sequence of observed random variables X = (X , X , … , X ) is a random sample from a 1 2 n
distribution. However, maximum likelihood is a very general method that does not require the observation variables to be
independent or identically distributed. In the hypergeometric model, we have a population of N objects with r of the objects type 1
and the remaining N − r objects type 0. The population size N , is a positive integer. The type 1 size r, is a nonnegative integer
with r ≤ N . These are the basic parameters, and typically one or both is unknown. Here are some typical examples:
1. The objects are devices, classified as good or defective.
2. The objects are persons, classified as female or male.
3. The objects are voters, classified as for or against a particular candidate.
4. The objects are wildlife or a particular type, either tagged or untagged.
We sample n objects from the population at random, without replacement. Let X be the type of the ith object selected, so that our
i
sequence of observed variables is X = (X , X , … , X ). The variables are identically distributed indicator variables, with
1 2 n
P (X = 1) = r/N for each i ∈ {1, 2, … , n}, but are dependent since the sampling is without replacement. The number of type 1
i
objects in the sample is Y = ∑ X . This statistic has the hypergeometric distribution with parameter N , r, and n , and has
n
i=1 i
probability density function given by

r N −r
( )( ) (y) (n−y)
P (Y = y) = =( ) , y ∈ {max{0, N − n + r}, … , min{n, r}} (7.3.49)
N (n)
( ) y N
n
Recall the falling power notation: x = x(x − 1) ⋯ (x − k + 1) for x ∈ R and k ∈ N . The hypergeometric model is studied in
(k)
more detail in the chapter on Finite Sampling Models.
As above, let X = (X 1, X2 , … , Xn ) be the observed variables in the hypergeometric model with parameters N and r. Then
1. The maximum likelihood estimator of r with N known is U = ⌊N M ⌋ = ⌊N Y /n⌋ .
2. The maximum likelihood estimator of N with r known is V = ⌊r/M ⌋ = ⌊rn/Y ⌋ if Y >0 .
Proof
In the reliability example (1), we might typically know N and would be interested in estimating r. In the wildlife example (4), we
would typically know r and would be interested in estimating N . This example is known as the capture-recapture model.
Clearly there is a close relationship between the hypergeometric model and the Bernoulli trials model above. In fact, if the
sampling is with replacement, the Bernoulli trials model with p = r/N would apply rather than the hypergeometric model. In
addition, if the population size N is large compared to the sample size n , the hypergeometric model is well approximated by the
Bernoulli trials model, again with p = r/N .
This page titled 7.3: Maximum Likelihood is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
The General Method
Suppose again that we have an observable random variable X for an experiment, that takes values in a set S . Suppose also that
distribution of X depends on a parameter θ taking values in a parameter space T . Of course, our data variable X is almost always
vector-valued, so that typically S ⊆ R for some n ∈ N . Depending on the nature of the sample space S , the distribution of X
n
+
may be discrete or continuous. The parameter θ may also be vector-valued, so that typically T ⊆ R for some k ∈ N .
k
+
In Bayesian analysis, named for the famous Thomas Bayes, we model the deterministic, but unknown parameter θ with a random
variable Θ that has a specified distribution on the parameter space T . Depending on the nature of the parameter space, this
distribution may also be either discrete or continuous. It is called the prior distribution of Θ and is intended to reflect our
knowledge of the parameter θ , before we gather data. After observing X = x ∈ S , we then use Bayes' theorem, to compute the
conditional distribution of Θ given X = x . This distribution is called the posterior distribution of Θ, and is an updated
distribution, given the information in the data. Here is the mathematical description, stated in terms of probability density
functions.
Suppose that the prior distribution of Θ on T has probability density function h , and that given Θ = θ ∈ T , the conditional
probability density function of X on S is f (⋅ ∣ θ) . Then the probability density function of the posterior distribution of Θ
given X = x ∈ S is
h(θ)f (x ∣ θ)
h(θ ∣ x) = , θ ∈ T (7.4.1)
f (x)
where the function in the denominator is defined as follows, in the discrete and continuous cases, respectively:
f (x) = ∑ h(θ)f (x|θ), x ∈ S
θ∈T
f (x) = ∫ h(θ)f (x ∣ θ) dθ, x ∈ S

T
Proof
For x ∈ S , note that f (x) is simply the normalizing constant for the function θ ↦ h(θ)f (x ∣ θ) . It may not be necessary to
explicitly compute f (x), if one can recognize the functional form of θ ↦ h(θ)f (x ∣ θ) as that of a known distribution. This will
indeed be the case in several of the examples explored below.
If the parameter space T has finite measure c (counting measure in the discrete case or Lebesgue measure in the continuous case),
then one possible prior distribution is the uniform distribution on T , with probability density function h(θ) = 1/c for θ ∈ T . This
distribution reflects no prior knowledge about the parameter, and so is called the non-informative prior distributioon.
Random Samples
Of course, an important and essential special case occurs when X = (X , X , … , X ) is a random sample of size n from the
1 2 n
distribution of a basic variable X. Specifically, suppose that X takes values in a set R and has probability density function g(⋅ ∣ θ)
for a given θ ∈ T . In this case, S = R and the probability density function f (⋅ ∣ θ) of X given θ is
n
f (x1 , x2 , … , xn ∣ θ) = g(x1 ∣ θ)g(x2 ∣ θ) ⋯ g(xn ∣ θ), (x1 , x2 , … , xn ) ∈ S (7.4.3)
Real Parameters
Suppose that θ is a real-valued parameter, so that T ⊆R . Here is our main definition.
The conditional expected value E(Θ ∣ X) is the Bayesian estimator of θ .

1. If Θ has a discrete distribution on T then
E(Θ ∣ X = x) = ∑ θh(θ ∣ x), x ∈ S (7.4.4)
θ∈T
2. If Θ has a continuous distribution on T then
E(Θ ∣ X = x) = ∫ θh(θ ∣ x)dθ, x ∈ S (7.4.5)

T
Recall that E(Θ ∣ X) is a function of X and, among all functions of X, is closest to Θ in the mean square sense. Of course, once
we collect the data and observe X = x , the Bayesian estimate of θ is E(Θ ∣ X = x) . As always, the term estimator refers to a
random variable, before the data are collected, and the term estimate refers to an observed value of the random variable after the
data are collected. The definitions of bias and mean square error are as before, but now conditioned on Θ = θ ∈ T .
Suppose that U is the Bayes estimator of θ .

1. The bias of U is bias(U ∣ θ) = E(U − θ ∣ Θ = θ) for θ ∈ T .
2. The mean square error of U is mse(U ∣ θ) = E[(U − θ) ∣ Θ = θ] 2
for θ ∈ T .
As before, bias(U ∣ θ) = E(U ∣ θ) − θ and mse(U ∣ θ) = var(U ∣ θ) + bias (U ∣ θ) . Suppose now that we observe the random
2
variables (X , X , X , …) sequentially, and we compute the Bayes estimator U of θ based on (X , X , … , X ) for each
1 2 3 n 1 2 n
n ∈ N + . Again, the most common case is when we are sampling from a distribution, so that the sequence is independent and
identically distributed (given θ ). We have the natural asymptotic properties that we have seen before.
Let U = (Un : n ∈ N+ ) be the sequence of Bayes estimators of θ as above.

1. U is asymptotically unbiased if bias(U n ∣ θ) → 0 as n → ∞ for each θ ∈ T .
2. U is mean-square consistent if mse(U n ∣ θ) → 0 as n → ∞ for each θ ∈ T .
Often we cannot construct unbiased Bayesian estimators, but we do hope that our estimators are at least asymptotically unbiased
and consistent. It turns out that the sequence of Bayesian estimators U is a martingale. The theory of martingales provides some
powerful tools for studying these estimators.
From the Bayesian perspective, the posterior distribution of Θ given the data X = x is of primary importance. Point estimates of θ
derived from this distribution are of secondary importance. In particular, the mean square error function
u ↦ E[(Θ − u ) ∣ X = x) , minimized as we have noted at E(Θ ∣ X = x) , is not the only loss function that can be used.
2
(Although it's the only one that we consider.) Another possible loss function, among many, is the mean absolute error function
u ↦ E(|Θ − u| ∣ X = x) , which we know is minimized at the median(s) of the posterior distribution.
Conjugate Families
Often, the prior distribution of Θ is itself a member of a parametric family, with the parameters specified to reflect our prior
knowledge of θ . In many important special cases, the parametric family can be chosen so that the posterior distribution of Θ given
X = x belongs to the same family for each x ∈ S . In such a case, the family of distributions of Θ is said to be conjugate to the
family of distributions of X. Conjugate families are nice from a computational point of view, since we can often compute the
posterior distribution through a simple formula involving the parameters of the family, without having to use Bayes' theorem
directly. Similarly, in the case that the parameter is real valued, we can often compute the Bayesian estimator through a simple
formula involving the parameters of the conjugate family.
Suppose that X = (X , X , …) is sequence of independent variables, each having the Bernoulli distribution with unknown
1 2
success parameter p ∈ (0, 1). In short, X is a sequence of Bernoulli trials, given p. In the usual language of reliability, X = 1 i
means success on trial i and X = 0 means failure on trial i. Recall that given p, the Bernoulli distribution has probability density
i
function
x 1−x
g(x ∣ p) = p (1 − p ) , x ∈ {0, 1} (7.4.6)
Note that the number of successes in the first n trials is Y n =∑
n
i=1
Xi . Given p, random variable Y has the binomial distribution
n
with parameters n and p.

Suppose now that we model p with a random variable P that has a prior beta distribution with left parameter a ∈ (0, ∞) and right
parameter b ∈ (0, ∞), where a and b are chosen to reflect our initial information about p. So P has probability density function
1 a−1 b−1
h(p) = p (1 − p ) , p ∈ (0, 1) (7.4.7)
B(a, b)
and has mean a/(a + b) . For example, if we know nothing about p, we might let a = b = 1 , so that the prior distribution is
uniform on the parameter space (0, 1) (the non-informative prior). On the other hand, if we believe that p is about , we might let 2
a = 4 and b = 2 , so that the prior distribution is unimodal, with mean . As a random process, the sequence X with p randomized
2
by P , is known as the beta-Bernoulli process, and is very interesting on its own, outside of the context of Bayesian estimation.
For n ∈ N , the posterior distribution of

+ P given Xn = (X1 , X2 , … , Xn ) is beta with left parameter a + Yn and right
parameter b + (n − Y ) .n
Proof
Thus, the beta distribution is conjugate to the Bernoulli distribution. Note also that the posterior distribution depends on the data
vector X only through the number of successes Y . This is true because Y is a sufficient statistic for p. In particular, note that the
n n n
left beta parameter is increased by the number of successes Y and the right beta parameter is increased by the number of failures
n
n−Y . n
The Bayesian estimator of p given X is n
a + Yn
Un = (7.4.10)
a+b +n
Proof
In the beta coin experiment, set n = 20 and p = 0.3 , and set a = 4 and b = 2 . Run the simulation 100 times and note the
estimate of p and the shape and location of the posterior probability density function of p on each run.
Next let's compute the bias and mean-square error functions.
For n ∈ N , +
a(1 − p) − bp
bias(Un ∣ p) = , p ∈ (0, 1) (7.4.11)
a+b +n
The sequence U = (Un : n ∈ N+ ) is asymptotically unbiased.

Proof
Note also that we cannot choose a and b to make U unbiased, since such a choice would involve the true value of p, which we do
n
not know.
In the beta coin experiment, vary the parameters and note the change in the bias. Now set n = 20 and p = 0.8 , and set a = 2
and b = 6 . Run the simulation 1000 times. Note the estimate of p and the shape and location of the posterior probability
density function of p on each update. Compare the empirical bias to the true bias.
For n ∈ N , +
2 2 2
p[n − 2 a(a + b)] + p [(a + b ) − n] + a
mse(Un ∣ p) = , p ∈ (0, 1) (7.4.13)
2
(a + b + n)
The sequence (U n : n ∈ N+ ) is mean-square consistent.

Proof
In the beta coin experiment, vary the parameters and note the change in the mean square error. Now set n = 10 and p = 0.7 ,
and set a = b = 1 . Run the simulation 1000 times. Note the estimate of p and the shape and location of the posterior
probability density function of p on each update. Compare the empirical mean square error to the true mean square error.
Interestingly, we can choose a and b so that U has mean square error that is independent of the unknown parameter p:
Let n ∈ N + and let a = b = √−

n /2 . Then
n
mse(Un ∣ p) = , p ∈ (0, 1) (7.4.16)
− 2
4 (n + √n )
In the beta coin experiment, set n = 36 and a = b = 3 . Vary p and note that the mean square error does not change. Now set
p = 0.8 and run the simulation 1000 times. Note the estimate of p and the shape and location of the posterior probability
density function on each update. Compare the empirical bias and mean square error to the true values.
Recall that the method of moments estimator and the maximum likelihood estimator of p (on the interval (0, 1)) is the sample mean
(the proportion of successes):
n
Y 1
Mn = = ∑ Xi (7.4.17)
n n
i=1
This estimator has mean square error mse(M n ∣ p) =

1
n
p(1 − p) . To see the connection between the estimators, note from (6) that
a+b a n
Un = + Mn (7.4.18)
a+b +n a+b a+b +n
So U is a weighted average of a/(a + b) (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n
Another Bernoulli Distribution

Bayesian estimation, like other forms of parametric estimation, depends critically on the parameter space. Suppose again that
(X , X , …) is a sequence of Bernoulli trials, given the unknown success parameter p , but suppose now that the parameter space
1 2
is { , 1}. This setup corresponds to the tossing of a coin that is either fair or two-headed, but we don't know which. We model p
1
with a random variable P that has the prior probability density function h given by h(1) = a , h ( ) = 1 − a , where a ∈ (0, 1) is 1
chosen to reflect our prior knowledge of the probability that the coin is two-headed. If we are completely ignorant, we might let
a = (the non-informative prior). If with think the coin is more likely to be two-headed, we might let a = . Again let
1
2
3
X for n ∈ N .
n
Y =∑
n i +
i=1
The posterior distribution of P given X n = (X1 , X2 , … , Xn ) is

n
1. h(1 ∣ X n) = n
2 a+(1−a)
2 a
if Y
n =n and h(1 ∣ X n) =0 if Y n <n
2. h ( 1
2
∣ Xn ) = n
2 a+(1−a)
1−a
if Y n =n and h ( 1
2
∣ Xn ) = 1 if Y n <n
Proof
Now let
n+1
2 a + (1 − a)
pn = (7.4.23)
n+1
2 a + 2(1 − a)
The Bayes' estimator of p given X the statistic U defined by n n
1. U n = pn if Y n =n
2. U n =
1
2
if Y n <n
Proof
If we observe Y < n then U gives the correct answer . This certainly makes sense since we know that we do not have the two-
n n
1
headed coin. On the other hand, if we observe Y = n then we are not certain which coin we have, and the Bayesian estimate p is
n n
not even in the parameter space! But note that p n → 1 as n → ∞ exponentially fast. Next let's compute the bias and mean-square
error for a given p ∈ { , 1} . 1
For n ∈ N , +
1. bias(U n ∣ 1) = pn − 1
n
2. bias (U n ∣
1
2
) =(
1
2
) (pn −
1
2
)
The sequence of estimators (U n : n ∈ N+ ) is asymptotically unbiased.

Proof
If p = 1 , the estimator U is negatively biased; we noted this earlier. If p =

n
1
2
, then U is positively biased for sufficiently large n
n
(depending on a ).
For n ∈ N , +
1. mse(U n ∣ 1) = (pn − 1 )
2
n 2
2. mse (U n ∣
1
2
) =(
1
2
) (pn −
1
2
)
The sequence of estimators U = (Un : n ∈ N+ ) is mean-square consistent.

Proof
The Geometric distribution

Suppose that X = (X , X , …) is a sequence of independent random variables, each having the geometric distribution on N
1 2 +
with unknown success parameter p ∈ (0, 1). Recall that these variables can be interpreted as the number of trials between
successive successes in a sequence of Bernoulli trials. Given p, the geometric distribution has probability density function
x−1
g(x ∣ p) = p(1 − p ) , x ∈ N+ (7.4.29)
Once again for n ∈ N , let Y = ∑ X . In this setting, Y is the trial number of the n th success, and given p, has the negative
+ n
n
i=1 i n
binomial distribution with parameters n and p.

Suppose now that we model p with a random variable P having a prior beta distribution with left parameter a ∈ (0, ∞) and right
parameter b ∈ (0, ∞). As usual, a and b are chosen to reflect our prior knowledge of p.
The posterior distribution of P given Xn = (X1 , X2 , … , Xn ) is beta with left parameter a+n and right parameter
b + (Y − n) .
n
Proof
Thus, the beta distribution is conjugate to the geometric distribution. Moreover, note that in the posterior beta distribution, the left
parameter is increased by the number of successes n while the right parameter is increased by the number of failures Y − n , just as
in the Bernoulli model. In particular, the posterior left parameter is deterministic and depends on the data only through the sample
size n .
The Bayesian estimator of p based on X is n
a+n
Vn = (7.4.32)
a + b + Yn
Proof
Recall that the method of moments estimator of p, and the maximum likelihood estimator of p on the interval (0, 1) are both
W = 1/ M = n/ Y . To see the connection between the estimators, note from (19) that
n n n
1 a a+b n 1
= + (7.4.33)
Vn a+n a a + n Wn
So 1/V (the reciprocal of the Bayesian estimator) is a weighted average of

n (a + b)/a (the reciprocal of the mean of the prior
distribution) and 1/W (the reciprocal of the maximum likelihood estimator).
n
Suppose that X = (X , X , …) is a sequence of random variable each having the Poisson distribution with unknown parameter
1 2
λ ∈ (0, ∞) . Recall that the Poisson distribution is often used to model the number of “random points” in a region of time or space
and is studied in more detail in the chapter on the Poisson Process. The distribution is named for the inimitable Simeon Poisson and
given λ , has probability density function
x
λ
−λ
g(x ∣ λ) = e , x ∈ N (7.4.34)
x!
Once again, for n ∈ N , let Y

+ n =∑
n
i=1
Xi . Given λ , random variable Y also has a Poisson distribution, but with parameter nλ.
n
Suppose now that we model λ with a random variable Λ having a prior gamma distribution with shape parameter k ∈ (0, ∞) and
rate parameter r ∈ (0, ∞). As usual k and r are chosen to reflect our prior knowledge of λ . Thus the prior probability density
function of Λ is
k
r k−1 −rλ
h(λ) = λ e , λ ∈ (0, ∞) (7.4.35)
Γ(k)
and the mean is k/r. The scale parameter of the gamma distribution is b = 1/r , but the formulas will work out nicer if we use the
rate parameter.
The posterior distribution of Λ given Xn = (X1 , X2 , … , Xn ) is gamma with shape parameter k + Yn and rate parameter
r+n.
Proof
It follows that the gamma distribution is conjugate to the Poisson distribution. Note that the posterior rate parameter is
deterministic and depends on the data only through the sample size n .
The Bayesian estimator of λ based on X n = (X1 , X2 , … , Xn ) is

k + Yn
Vn = (7.4.39)
r+n
Proof
Since V is a linear function of

n Yn , and we know the distribution of Yn given λ ∈ (0, ∞) , we can compute the bias and mean-
square error functions.
For n ∈ N , +
k − rλ
bias(Vn ∣ λ) = , λ ∈ (0, ∞) (7.4.40)
r+n
The sequence of estimators V = (Vn : n ∈ N+ ) is asymptotically unbiased.

Proof
Note that, as before, we cannot choose k and r to make V unbiased, without knowledge of λ .
n
For n ∈ N , +
2
nλ + (k − rλ)
mse(Vn ∣ λ) = , λ ∈ (0, ∞) (7.4.42)
2
(r + n)
The sequence of estimators V = (Vn : n ∈ N+ ) is mean-square consistent.

Proof
Recall that the method of moments estimator of λ and the maximum likelihood estimator of λ on the interval (0, ∞) are both
M = Y /n , the sample mean. This estimator is unbiased and has mean square error λ/n. To see the connection between the
n n
estimators, note from (21) that
r k n
Vn = + Mn (7.4.44)
r+n r r+n
So V is a weighted average of k/r (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n

Suppose that X = (X , X , …) is a sequence of independent random variables, each having the normal distribution with
1 2
unknown mean μ ∈ R but known variance σ ∈ (0, ∞) . Of course, the normal distribution plays an especially important role in
2
statistics, in part because of the central limit theorem. The normal distribution is widely used to model physical quantities subject to
numerous small, random errors. In many statistical applications, the variance of the normal distribution is more stable than the
mean, so the assumption that the variance is known is not entirely artificial. Recall that the normal probability density function
(given μ ) is
2
1 1 x −μ
g(x ∣ μ) = −
−− exp[− ( ) ], x ∈ R (7.4.45)
√2 πσ 2 σ
Again, for n ∈ N+ let Yn = ∑

n
i=1
Xi . Recall that Yn also has a normal distribution (given μ ) but with mean nμ and variance
nσ .
2
Suppose now that μ is modeled by a random variable Ψ that has a prior normal distribution with mean a ∈ R and variance
b ∈ (0, ∞) . As usual, a and b are chosen to reflect our prior knowledge of μ . An interesting special case is when we take b = σ ,
2
so the variance of the prior distribution of Ψ is the same as the variance of the underlying sampling distribution.
For n ∈ N , the posterior distribution of Ψ given X

+ n = (X1 , X2 , … , Xn ) is normal with mean and variance given by
2 2
Yn b + aσ
E(Ψ ∣ Xn ) = (7.4.46)
2 2
nb +σ
2 2
σ b
var(Ψ ∣ Xn ) = (7.4.47)
2 2
nb +σ
Proof
Therefore, the normal distribution is conjugate to the normal distribution with unknown mean and known variance. Note that the
posterior variance is deterministic, and depends on the data only through the sample size n . In the special case that b = σ , the
posterior distribution of Ψ given X is normal with mean (Y + a)/(n + 1) and variance σ /(n + 1) .
n n
2
The Bayesian estimator of μ is

2 2
Yn b + aσ
Un = (7.4.55)
2 2
nb +σ
Proof
Note that U n = (Yn + a)/(n + 1) in the special case that b = σ .
For n ∈ N , +
2
σ (a − μ)
bias(Un ∣ μ) = , μ ∈ R (7.4.56)
2 2
σ +n b
The sequence of estimators U = (Un : n ∈ N+ ) is asymptotically unbiased.

Proof
When b = σ , bias(U n ∣ μ) = (a − μ)/(n + 1) .
For n ∈ N , +
2 4 4 2
nσ b + σ (a − μ)
mse(Un ∣ μ) = , μ ∈ R (7.4.58)
(σ 2 + n b2 )2
The sequence of estimators U = (Un : n ∈ N+ ) is mean-square consistent.
Proof
When b = σ , mse(U ∣ μ) = [nσ + (a − μ) ]/(n + 1) . Recall that the method of moments estimator of μ and the maximum
2 2 2
likelihood estimator of μ on R are both M = Y /n , the sample mean. This estimator is unbiased and has mean square error
n n
var(M ) = σ /n . To see the connection between the estimators, note from (25) that
2
2 2
σ nb
Un = a+ Mn (7.4.60)
2 2 2 2
nb +σ nb +σ
So U is a weighted average of a (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n

Suppose that X = (X , X , …) is a sequence of independent random variables each having the beta distribution with unknown
1 2
left shape parameter a ∈ (0, ∞) and right shape parameter b = 1 . The beta distribution is widely used to model random
proportions and probabilities and other variables that take values in bounded intervals (scaled to take values in (0, 1)). Recall that
the probability density function (given a ) is
a−1
g(x ∣ a) = a x , x ∈ (0, 1) (7.4.61)
Suppose now that a is modeled by a random variable A that has a prior gamma distribution with shape parameter k ∈ (0, ∞) and
rate parameter r ∈ (0, ∞). As usual, k and r are chosen to reflect our prior knowledge of a . Thus the prior probabiltiy density
function of A is
k
r k−1 −ra
h(a) = a e , a ∈ (0, ∞) (7.4.62)
Γ(k)
The mean of the prior distribution is k/r.
The posterior distribution of A given Xn = (X1 , X2 , … , Xn ) is gamma, with shape parameter k+n and rate parameter
r − ln(X X ⋯ X ) .
1 2 n
Proof
Thus, the gamma distribution is conjugate to the beta distribution with unknown left parameter and right parameter 1. Note that the
posterior shape parameter is deterministic and depends on the data only through the sample size n .
The Bayesian estimator of a based on X is n
k+n
Un = (7.4.65)
r − ln(X1 X2 ⋯ Xn )
Proof
Given the complicated structure, the bias and mean square error of U given a ∈ (0, ∞) would be difficult to compute explicitly.
n
Recall that the maximum likelihood estimator of a is W = −n/ ln(X X ⋯ X ) . To see the connection between the estimators,
n 1 2 n
note from (29) that

1 k r n 1
= + (7.4.66)
Un k+n k k + n Wn
So 1/U (the reciprocal of the Bayesian estimator) is a weighted average of

n r/k (the reciprocal of the mean of the prior
distribution) and 1/W (the reciprocal of the maximum likelihood estimator).
n

Suppose that X = (X , X , …) is a sequence of independent random variables each having the Pareto distribution with unknown
1 2
shape parameter a ∈ (0, ∞) and scale parameter b = 1 . The Pareto distribution is used to model certain financial variables and
other variables with heavy-tailed distributions, and is named for Vilfredo Pareto. Recall that the probability density function (given
a ) is
a
g(x ∣ a) = , x ∈ [1, ∞) (7.4.67)
a+1
x
Suppose now that a is modeled by a random variable A that has a prior gamma distribution with shape parameter k ∈ (0, ∞) and
rate parameter r ∈ (0, ∞). As usual, k and r are chosen to reflect our prior knowledge of a . Thus the prior probabiltiy density
function of A is
k
r
k−1 −ra
h(a) = a e , a ∈ (0, ∞) (7.4.68)
Γ(k)
For n ∈ N , the posterior distribution of

+ A given X n = (X1 , X2 , … , Xn ) is gamma, with shape parameter k+n and rate
parameter r + ln(X X ⋯ X ) .
1 2 n
Proof
Thus, the gamma distribution is conjugate to Pareto distribution with unknown shape parameter. Note that the posterior shape
parameter is deterministic and depends on the data only through the sample size n .
The Bayesian estimator of a based on X is n
k+n
Un = (7.4.71)
r + ln(X1 X2 ⋯ Xn )
Proof
Given the complicated structure, the bias and mean square error of U given a ∈ (0, ∞) would be difficult to compute explicitly.
Recall that the maximum likelihood estimator of a is W = n/ ln(X X ⋯ X ) . To see the connection between the estimators,
n 1 2 n
note from (31) that

1 k r n 1
= + (7.4.72)
Un k+n k k + n Wn
So 1/U (the reciprocal of the Bayesian estimator) is a weighted average of

n r/k (the reciprocal of the mean of the prior
distribution) and 1/W (the maximum likelihood estimator).
n
This page titled 7.4: Bayesian Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Consider again the basic statistical model, in which we have a random experiment that results in an observable random variable X
taking values in a set S . Once again, the experiment is typically to sample n objects from a population and record one or more
measurements for each item. In this case, the observable random variable has the form
X = (X1 , X2 , … , Xn ) (7.5.1)
where X is the vector of measurements for the ith item.

i
Suppose that θ is a real parameter of the distribution of X, taking values in a parameter space Θ. Let f denote the probability
θ
density function of X for θ ∈ Θ . Note that the expected value, variance, and covariance operators also depend on θ , although we
will sometimes suppress this to keep the notation from becoming too unwieldy.
Definitions
Suppose now that λ = λ(θ) is a parameter of interest that is derived from θ . (Of course, λ might be θ itself, but more generally
might be a function of θ .) In this section we will consider the general problem of finding the best estimator of λ among a given
class of unbiased estimators. Recall that if U is an unbiased estimator of λ , then var (U ) is the mean square error. Mean square
θ
error is our measure of the quality of unbiased estimators, so the following definitions are natural.
Suppose that U and V are unbiased estimators of λ .

1. If var (U ) ≤ var (V ) for all θ ∈ Θ then U is a uniformly better estimator than V .
θ θ
2. If U is uniformly better than every other unbiased estimator of λ , then U is a Uniformly Minimum Variance Unbiased
Estimator (UMVUE) of λ .
Given unbiased estimators U and V of λ , it may be the case that U has smaller variance for some values of θ while V has smaller
variance for other values of θ , so that neither estimator is uniformly better than the other. Of course, a minimum variance unbiased
estimator is the best we can hope for.
The Cramér-Rao Lower Bound

We will show that under mild conditions, there is a lower bound on the variance of any unbiased estimator of the parameter λ .
Thus, if we can find an estimator that achieves this lower bound for all θ , then the estimator must be an UMVUE of λ . The
derivative of the log likelihood function, sometimes called the score, will play a critical role in our anaylsis. A lesser, but still
important role, is played by the negative of the second derivative of the log-likelihood function. Life will be much easier if we give
these functions names.
For x ∈ S and θ ∈ Θ , define

d
L1 (x, θ) = ln(fθ (x)) (7.5.2)
dθ
2
d d
L2 (x, θ) = − L1 (x, θ) = − ln(fθ (x)) (7.5.3)
dθ dθ2
In the rest of this subsection, we consider statistics h(X) where h : S → R (and so in particular, h does not depend on θ ). We
need a fundamental assumption:
We will consider only statistics h(X) with E θ

2
(h (X)) < ∞ for θ ∈ Θ . We also assume that
d
Eθ (h(X)) = Eθ (h(X)L1 (X, θ)) (7.5.4)
dθ
This is equivalent to the assumption that the derivative operator d/dθ can be interchanged with the expected value operator
E .
θ
Proof
Generally speaking, the fundamental assumption will be satisfied if f (x) is differentiable as a function of θ , with a derivative that
θ
is jointly continuous in x and θ , and if the support set {x ∈ S : f (x) > 0} does not depend on θ .
θ
Eθ (L1 (X, θ)) = 0 for θ ∈ Θ .

Proof
If h(X) is a statistic then

d
covθ (h(X), L1 (X, θ)) = Eθ (h(X)) (7.5.8)
dθ
Proof
2
varθ (L1 (X, θ)) = Eθ (L (X, θ))
1
Proof
The following theorem gives the general Cramér-Rao lower bound on the variance of a statistic. The lower bound is named for
Harold Cramér and CR Rao:

2
d
( Eθ (h(X)))
dθ
varθ (h(X)) ≥ (7.5.9)
2
Eθ (L (X, θ))
1
Proof
We can now give the first version of the Cramér-Rao lower bound for unbiased estimators of a parameter.
Suppose now that λ(θ) is a parameter of interest and h(X) is an unbiased estimator of λ . Then
2
(dλ/dθ)
varθ (h(X)) ≥ (7.5.11)
2
Eθ (L (X, θ))
1
Proof
An estimator of λ that achieves the Cramér-Rao lower bound must be a uniformly minimum variance unbiased estimator
(UMVUE) of λ .
Equality holds in the previous theorem, and hence h(X) is an UMVUE, if and only if there exists a function u(θ) such that
(with probability 1)
h(X) = λ(θ) + u(θ)L1 (X, θ) (7.5.12)
Proof
The quantity E (L (X, θ)) that occurs in the denominator of the lower bounds in the previous two theorems is called the Fisher
θ
2
information number of X, named after Sir Ronald Fisher. The following theorem gives an alternate version of the Fisher
information number that is usually computationally better.
If the appropriate derivatives exist and if the appropriate interchanges are permissible then
2
Eθ (L (X, θ)) = Eθ (L2 (X, θ)) (7.5.13)
1
The following theorem gives the second version of the Cramér-Rao lower bound for unbiased estimators of a parameter.
If λ(θ) is a parameter of interest and h(X) is an unbiased estimator of λ then

2
(dλ/dθ)
varθ (h(X)) ≥ (7.5.14)
Eθ (L2 (X, θ))
Proof
Random Samples
Suppose now that X = (X , X , … , X ) is a random sample of size n from the distribution of a random variable X having
1 2 n
probability density function g and taking values in a set R . Thus S = R . We will use lower-case letters for the derivative of the
θ
n
log likelihood function of X and the negative of the second derivative of the log likelihood function of X.
For x ∈ R and θ ∈ Θ define

d
l(x, θ) = ln(gθ (x)) (7.5.15)
dθ
2
d
l2 (x, θ) = − ln(gθ (x)) (7.5.16)
2
dθ
L
2
can be written in terms of l and L can be written in terms of l :
2
2 2
1. E θ
2
(L (X, θ)) = nEθ (l (X, θ))
2
2. E θ (L2 (X, θ)) = nEθ (l2 (X, θ))
The following theorem gives the second version of the general Cramér-Rao lower bound on the variance of a statistic, specialized
for random samples.

2
d
( Eθ (h(X)))
dθ
varθ (h(X)) ≥ (7.5.17)

2
nEθ (l (X, θ))
The following theorem give the third version of the Cramér-Rao lower bound for unbiased estimators of a parameter, specialized
for random samples.
Suppose now that λ(θ) is a parameter of interest and h(X) is an unbiased estimator of λ . Then
2
(dλ/dθ)
varθ (h(X)) ≥ (7.5.18)
nEθ (l2 (X, θ))
Note that the Cramér-Rao lower bound varies inversely with the sample size n . The following version gives the fourth version of
the Cramér-Rao lower bound for unbiased estimators of a parameter, again specialized for random samples.
If the appropriate derivatives exist and the appropriate interchanges are permissible) then
2
(dλ/dθ)
varθ (h(X)) ≥ (7.5.19)
nEθ (l2 (X, θ))
To summarize, we have four versions of the Cramér-Rao lower bound for the variance of an unbiased estimate of λ : version 1 and
version 2 in the general case, and version 1 and version 2 in the special case that X is a random sample from the distribution of X.
If an ubiased estimator of λ achieves the lower bound, then the estimator is an UMVUE.

We will apply the results above to several parametric families of distributions. First we need to recall some standard notation.
Suppose that X = (X , X , … , X ) is a random sample of size n from the distribution of a real-valued random variable X with
1 2 n
mean μ and variance σ . The sample mean is

2
n
1
M = ∑ Xi (7.5.20)
n
i=1
Recall that E(M ) = μ and var(M ) = σ 2
/n . The special version of the sample variance, when μ is known, and standard version
of the sample variance are, respectively,
n
1
2 2
W = ∑(Xi − μ) (7.5.21)
n
i=1
n
1
2 2
S = ∑(Xi − M ) (7.5.22)
n−1
i=1

Suppose that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success
1 2 n
parameter p ∈ (0, 1). In the usual language of reliability, X = 1 means success on trial i and X = 0 means failure on trial i; the
i i
distribution is named for Jacob Bernoulli. Recall that the Bernoulli distribution has probability density function
x 1−x
gp (x) = p (1 − p ) , x ∈ {0, 1} (7.5.23)
The basic assumption is satisfied. Moreover, recall that the mean of the Bernoulli distribution is p, while the variance is p(1 − p) .
p(1 − p)/n is the Cramér-Rao lower bound for the variance of unbiased estimators of p.
The sample mean M (which is the proportion of successes) attains the lower bound in the previous exercise and hence is an
UMVUE of p.

Suppose that X = (X , X , … , X ) is a random sample of size n from the Poisson distribution with parameter θ ∈ (0, ∞) .
1 2 n
Recall that this distribution is often used to model the number of “random points” in a region of time or space and is studied in
more detail in the chapter on the Poisson Process. The Poisson distribution is named for Simeon Poisson and has probability
density function
x
−θ
θ
gθ (x) = e , x ∈ N (7.5.24)
x!
The basic assumption is satisfied. Recall also that the mean and variance of the distribution are both θ .
θ/n is the Cramér-Rao lower bound for the variance of unbiased estimators of θ .
The sample mean M attains the lower bound in the previous exercise and hence is an UMVUE of θ .

Suppose that X = (X , X , … , X ) is a random sample of size n from the normal distribution with mean μ ∈ R and variance
1 2 n
σ ∈ (0, ∞) . Recall that the normal distribution plays an especially important role in statistics, in part because of the central limit
2
theorem. The normal distribution is widely used to model physical quantities subject to numerous small, random errors, and has
2
1 x −μ
gμ,σ 2 (x) = −
−− exp[−( ) ], x ∈ R (7.5.25)
√2 πσ σ
The basic assumption is satisfied with respect to both of these parameters. Recall also that the fourth central moment is
E ((X − μ) ) = 3 σ .
4 4
σ /n
2
is the Cramér-Rao lower bound for the variance of unbiased estimators of μ .
The sample mean M attains the lower bound in the previous exercise and hence is an UMVUE of μ .
4
2σ
n
is the Cramér-Rao lower bound for the variance of unbiased estimators of σ . 2
4
The sample variance S has variance

2 2σ
n−1
and hence does not attain the lower bound in the previous exercise.
If μ is known, then the special sample variance W attains the lower bound above and hence is an UMVUE of σ .
2 2
If μ is unknown, no unbiased estimator of σ attains the Cramér-Rao lower bound above.

2
Proof

Suppose that X = (X , X , … , X ) is a random sample of size n from the gamma distribution with known shape parameter
1 2 n
k > 0 and unknown scale parameter b > 0 . The gamma distribution is often used to model random times and certain other types of
positive random variables, and is studied in more detail in the chapter on Special Distributions. The probability density function is
1 k−1 −x/b
gb (x) = x e , x ∈ (0, ∞) (7.5.26)
k
Γ(k)b
The basic assumption is satisfied with respect to b . Moreover, the mean and variance of the gamma distribution are kb and 2
kb ,
respectively.
2
b
nk
is the Cramér-Rao lower bound for the variance of unbiased estimators of b .
k
attains the lower bound in the previous exercise and hence is an UMVUE of b .

Suppose that X = (X , X , … , X ) is a random sample of size n from the beta distribution with left parameter a > 0 and right
1 2 n
parameter b = 1 . Beta distributions are widely used to model random proportions and other random variables that take values in
bounded intervals, and are studied in more detail in the chapter on Special Distributions. In our specialized case, the probability
density function of the sampling distribution is
a−1
ga (x) = a x , x ∈ (0, 1) (7.5.27)
The basic assumption is satisfied with respect to a .
The mean and variance of the distribution are

1. μ = a
a+1
2. σ =
2 a
2
(a+1 ) (a+2)
The Cramér-Rao lower bound for the variance of unbiased estimators of μ is a

4
.
n (a+1)
The sample mean M does not achieve the Cramér-Rao lower bound in the previous exercise, and hence is not an UMVUE of
μ.

Suppose that X = (X , X , … , X ) is a random sample of size n from the uniform distribution on
1 2 n [0, a] where a >0 is the
unknown parameter. Thus, the probability density function of the sampling distribution is
1
ga (x) = , x ∈ [0, a] (7.5.28)
a
The basic assumption is not satisfied.
The Cramér-Rao lower bound for the variance of unbiased estimators of a is a
n
. Of course, the Cramér-Rao Theorem does not
apply, by the previous exercise.
2
Recall that V =
n+1
n
max{ X1 , X2 , … , Xn } is unbiased and has variance a
n(n+2)
. This variance is smaller than the Cramér-
Rao bound in the previous exercise.
The reason that the basic assumption is not satisfied is that the support set {x ∈ R : g a (x) > 0} depends on the parameter a .
Best Linear Unbiased Estimators

We now consider a somewhat specialized problem, but one that fits the general theme of this section. Suppose that
X = (X , X , … , X ) is a sequence of observable real-valued random variables that are uncorrelated and have the same
1 2 n
unknown mean μ ∈ R , but possibly different standard deviations. Let σ = (σ , σ , … , σ ) where σ = sd(X ) for 1 2 n i i
i ∈ {1, 2, … , n}.
We will consider estimators of μ that are linear functions of the outcome variables. Specifically, we will consider estimators of the
following form, where the vector of coefficients c = (c , c , … , c ) is to be determined:
1 2 n
Y = ∑ ci Xi (7.5.29)
i=1
n
Y is unbiased if and only if ∑ i=1
ci = 1 .
The variance of Y is
n
2 2
var(Y ) = ∑ c σ (7.5.30)
i i
i=1
The variance is minimized, subject to the unbiased constraint, when

2
1/σ
j
cj = , j ∈ {1, 2, … , n} (7.5.31)
n 2
∑ 1/ σ
i=1 i
Proof
This exercise shows how to construct the Best Linear Unbiased Estimator (BLUE) of μ , assuming that the vector of standard
deviations σ is known.
Suppose now that σ = σ for i ∈ {1, 2, … , n} so that the outcome variables have the same standard deviation. In particular, this
i
would be the case if the outcome variables form a random sample of size n from a distribution with mean μ and standard deviation
σ.
In this case the variance is minimized when c i = 1/n for each i and hence Y =M , the sample mean.
This exercise shows that the sample mean M is the best linear unbiased estimator of μ when the standard deviations are the same,
and that moreover, we do not need to know the value of the standard deviation.
This page titled 7.5: Best Unbiased Estimators is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
7.6: Sufficient, Complete and Ancillary Statistics
Basic Theory
Consider again the basic statistical model, in which we have a random experiment with an observable random variable X taking
values in a set S . Once again, the experiment is typically to sample n objects from a population and record one or more
measurements for each item. In this case, the outcome variable has the form
X = (X1 , X2 , … , Xn ) (7.6.1)
where X is the vector of measurements for the ith item. In general, we suppose that the distribution of X depends on a parameter
i
θ taking values in a parameter space T . The parameter θ may also be vector-valued. We will sometimes use subscripts in
probability density functions, expected values, etc. to denote the dependence on θ .

As usual, the most important special case is when X is a sequence of independent, identically distributed random variables. In this
case X is a random sample from the common distribution.
Sufficient Statistics
Let U = u(X) be a statistic taking values in a set R . Intuitively, U is sufficient for θ if U contains all of the information about θ
that is available in the entire data variable X. Here is the formal definition:
A statistic U is sufficient for θ if the conditional distribution of X given U does not depend on θ ∈ T .
Sufficiency is related to the concept of data reduction. Suppose that X takes values in R . If we can find a sufficient statistic U
n
that takes values in R , then we can reduce the original data vector X (whose dimension n is usually large) to the vector of
j
statistics U (whose dimension j is usually much smaller) with no loss of information about the parameter θ .
The following result gives a condition for sufficiency that is equivalent to this definition.
Let U = u(X) be a statistic taking values in R , and let f and h denote the probability density functions of
θ θ X and U
respectively. Then U is suffcient for θ if and only if the function on S given below does not depend on θ ∈ T :
fθ (x)
x ↦ (7.6.2)
hθ [u(x)]
Proof
The definition precisely captures the intuitive notion of sufficiency given above, but can be difficult to apply. We must know in
advance a candidate statistic U , and then we must be able to compute the conditional distribution of X given U . The Fisher-
Neyman factorization theorem given next often allows the identification of a sufficient statistic from the form of the probability
density function of X. It is named for Ronald Fisher and Jerzy Neyman.
Fisher-Neyman Factorization Theorem. Let f denote the probability density function of X and suppose that U = u(X) is
θ
a statistic taking values in R . Then U is sufficient for θ if and only if there exists G : R × T → [0, ∞) and r : S → [0, ∞)
such that
fθ (x) = G[u(x), θ]r(x); x ∈ S, θ ∈ T (7.6.3)
Proof
Note that r depends only on the data x but not on the parameter θ . Less technically, u(X) is sufficient for θ if the probability
density function f (x) depends on the data vector x and the parameter θ only through u(x).
θ
If U and V are equivalent statistics and U is sufficient for θ then V is sufficient for θ .
Minimal Sufficient Statistics
The entire data variable X is trivially sufficient for θ . However, as noted above, there usually exists a statistic U that is sufficient
for θ and has smaller dimension, so that we can achieve real data reduction. Naturally, we would like to find the statistic U that has
the smallest dimension possible. In many cases, this smallest dimension j will be the same as the dimension k of the parameter
vector θ . However, as we will see, this is not necessarily the case; j can be smaller or larger than k . An example based on the
uniform distribution is given in (38).
Suppose that a statistic U is sufficient for θ . Then U is minimally sufficient if U is a function of any other statistic V that is
sufficient for θ .
Once again, the definition precisely captures the notion of minimal sufficiency, but is hard to apply. The following result gives an
equivalent condition.
Let f denote the probability density function of X corresponding to the parameter value θ ∈ T and suppose that U = u(X)
θ
is a statistic taking values in R . Then U is minimally sufficient for θ if the following condition holds: for x ∈ S and y ∈ S
fθ (x)
is independent of θ if and only if u(x) = u(y) (7.6.4)
fθ (y)
Proof
If U and V are equivalent statistics and U is minimally sufficient for θ then V is minimally sufficient for θ .
Properties of Sufficient Statistics

Sufficiency is related to several of the methods of constructing estimators that we have studied.
Suppose that U is sufficient for θ and that there exists a maximum likelihood estimator of θ . Then there exists a maximum
likelihood estimator V that is a function of U .
Proof
In particular, suppose that V is the unique maximum likelihood estimator of θ and that V is sufficient for θ . If U is sufficient for θ
then V is a function of U by the previous theorem. Hence it follows that V is minimally sufficient for θ . Our next result applies to
Bayesian analysis.
Suppose that the statistic U = u(X) is sufficient for the parameter θ and that θ is modeled by a random variable Θ with values
in T . Then the posterior distribution of Θ given X = x ∈ S is a function of u(x).
Proof
Continuing with the setting of Bayesian analysis, suppose that θ is a real-valued parameter. If we use the usual mean-square loss
function, then the Bayesian estimator is V = E(Θ ∣ X) . By the previous result, V is a function of the sufficient statistics U . That
is, E(Θ ∣ X) = E(Θ ∣ U ) .
The next result is the Rao-Blackwell theorem, named for CR Rao and David Blackwell. The theorem shows how a sufficient
statistic can be used to improve an unbiased estimator.
Rao-Blackwell Theorem. Suppose that U is sufficient for θ and that V is an unbiased estimator of a real parameter λ = λ(θ) .
Then E (V ∣ U ) is also an unbiased estimator of λ and is uniformly better than V .
θ
Proof
Complete Statistics
Suppose that U = u(X) is a statistic taking values in a set R . Then U is a complete statistic for θ if for any function
r : R → R
Eθ [r(U )] = 0 for all θ ∈ T ⟹ Pθ [r(U ) = 0] = 1 for all θ ∈ T (7.6.9)
To understand this rather strange looking condition, suppose that r(U ) is a statistic constructed from U that is being used as an
estimator of 0 (thought of as a function of θ ). The completeness condition means that the only such unbiased estimator is the
statistic that is 0 with probability 1.
If U and V are equivalent statistics and U is complete for θ then V is complete for θ .
The next result shows the importance of statistics that are both complete and sufficient; it is known as the Lehmann-Scheffé
theorem, named for Erich Lehmann and Henry Scheffé.
Lehmann-Scheffé Theorem. Suppose that U is sufficient and complete for θ and that V = r(U ) is an unbiased estimator of a
real parameter λ = λ(θ) . Then V is a uniformly minimum variance unbiased estimator (UMVUE) of λ .
Proof
Ancillary Statistics
Suppose that V = v(X) is a statistic taking values in a set R . If the distribution of V does not depend on θ , then V is called an
ancillary statistic for θ .
Thus, the notion of an ancillary statistic is complementary to the notion of a sufficient statistic. A sufficient statistic contains all
available information about the parameter; an ancillary statistic contains no information about the parameter. The following result,
known as Basu's Theorem and named for Debabrata Basu, makes this point more precisely.
Basu's Theorem. Suppose that U is complete and sufficient for a parameter θ and that V is an ancillary statistic for θ . Then U
and V are independent.
Proof
If U and V are equivalent statistics and U is ancillary for θ then V is ancillary for θ .
Applications and Special Distributions

In this subsection, we will explore sufficient, complete, and ancillary statistics for a number of special distributions. As always, be
sure to try the problems yourself before looking at the solutions.

Recall that the Bernoulli distribuiton with parameter p ∈ (0, 1) is a discrete distribution on {0, 1} with probability density function
g defined by
x 1−x
g(x) = p (1 − p ) , x ∈ {0, 1} (7.6.10)
Suppose that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with parameter p. Equivalently,
1 2 n
X is a sequence of Bernoulli trials, so that in the usual langauage of reliability, X = 1 if trial i is a success, and X = 0 if trial i is
i i
a failure. The Bernoulli distribution is named for Jacob Bernoulli and is studied in more detail in the chapter on Bernoulli Trials
Let Y = ∑ X denote the number of successes. Recall that
n
i=1 i Y has the binomial distribution with parameters n and p, and has
probability density function h defined by
n
y n−y
h(y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (7.6.11)
y
Y is sufficient for p. Specifically, for y ∈ {0, 1, … , n}, the conditional distribution of X given Y =y is uniform on the set of
points
n
Dy = {(x1 , x2 , … , xn ) ∈ {0, 1 } : x1 + x2 + ⋯ + xn = y} (7.6.12)
Proof
This result is intuitively appealing: in a sequence of Bernoulli trials, all of the information about the probability of success p is
contained in the number of successes Y . The particular order of the successes and failures provides no additional information. Of
course, the sufficiency of Y follows more easily from the factorization theorem (3), but the conditional distribution provides
additional insight.
Y is complete for p on the parameter space (0, 1).

Proof
The proof of the last result actually shows that if the parameter space is any subset of (0, 1) containing an interval of positive
length, then Y is complete for p. But the notion of completeness depends very much on the parameter space. The following result
considers the case where p has a finite set of values.
Suppose that the parameter space T ⊂ (0, 1) is a finite set with k ∈ N + elements. If the sample size n is at least k , then Y is
not complete for p.
Proof
The sample mean M = Y /n (the sample proportion of successes) is clearly equivalent to Y (the number of successes), and hence
is also sufficient for p and is complete for p ∈ (0, 1). Recall that the sample mean M is the method of moments estimator of p, and
is the maximum likelihood estimator of p on the parameter space (0, 1).
In Bayesian analysis, the usual approach is to model p with a random variable P that has a prior beta distribution with left
parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Then the posterior distribution of P given X is beta with left parameter
a + Y and right parameter b + (n − Y ) . The posterior distribution depends on the data only through the sufficient statistic Y , as
guaranteed by theorem (9).
The sample variance S is an UMVUE of the distribution variance p(1 − p) for p ∈ (0, 1), and can be written as
2
Y Y
2
S = (1 − ) (7.6.17)
n−1 n
Proof

Recall that the Poisson distribution with parameter θ ∈ (0, ∞) is a discrete distribution on N with probability density function g
defined by
x
θ
−θ
g(x) = e , x ∈ N (7.6.19)
x!
The Poisson distribution is named for Simeon Poisson and is used to model the number of “random points” in region of time or
space, under certain ideal conditions. The parameter θ is proportional to the size of the region, and is both the mean and the
variance of the distribution. The Poisson distribution is studied in more detail in the chapter on Poisson process.
Suppose now that X = (X 1, X2 , … , Xn ) is a random sample of size n from the Poisson distribution with parameter θ . Recall
that the sum of the scores Y =∑
n
i=1
Xi also has the Poisson distribution, but with parameter nθ .
The statistic Y is sufficient for θ . Specifically, for y ∈ N , the conditional distribution of X given Y =y is the multinomial
distribution with y trials, n trial values, and uniform trial probabilities.
Proof
As before, it's easier to use the factorization theorem to prove the sufficiency of Y , but the conditional distribution gives some
additional insight.
Y is complete for θ ∈ (0, ∞) .

Proof
As with our discussion of Bernoulli trials, the sample mean M = Y /n is clearly equivalent to Y and hence is also sufficient for θ
and complete for θ ∈ (0, ∞) . Recall that M is the method of moments estimator of θ and is the maximum likelihood estimator on
the parameter space (0, ∞).
An UMVUE of the parameter P(X = 0) = e −θ

for θ ∈ (0, ∞) is
Y
n−1
U =( ) (7.6.23)
n
Proof

Recall that the normal distribution with mean μ ∈ R and variance σ
2
∈ (0, ∞) is a continuous distribution on R with probability
density function g defined by
2
1 1 x −μ
g(x) = −
−− exp[− ( ) ], x ∈ R (7.6.26)
√2 πσ 2 σ
The normal distribution is often used to model physical quantities subject to small, random errors, and is studied in more detail in
the chapter on Special Distributions. Because of the central limit theorem, the normal distribution is perhaps the most important
distribution in statistics.
Suppose that X = (X , X , … , X ) is a random sample from the normal distribution with mean
1 2 n μ and variance σ
2
. Then
each of the following pairs of statistics is minimally sufficient for (μ, σ ) 2
1. (Y , V ) where Y = ∑ n
i=1
Xi and V = ∑ X . n
i=1 i
2
n n
2. (M , S ) where M =
2 1
n
∑
i=1
X is the sample mean and S =
i ∑
2
n−1
1
i=1
(Xi − M )
2
is the sample variance.
n
3. (M , T ) where T =
2 2 1
n
∑
i=1
(X − M )
i is the biased sample variance.
2
Proof
Recall that M and T are the method of moments estimators of

2
μ and σ
2
, respectively, and are also the maximum likelihood
estimators on the parameter space R × (0, ∞) .
Run the normal estimation experiment 1000 times with various values of the parameters. Compare the estimates of the
parameters in terms of bias and mean square error.
Sometimes the variance σ of the normal distribution is known, but not the mean μ . It's rarely the case that μ is known but not σ .
2 2
Nonetheless we can give sufficient statistics in both cases.
Suppose again that X = (X 1, X2 , … , Xn ) is a random sample from the normal distribution with mean μ ∈ R and variance
σ ∈ (0, ∞) . If
2
1. If σ is known then Y = ∑
2 n
i=1
Xi is minimally sufficient for μ .
2. If μ is known then U = ∑ n
i=1
(Xi − μ)
2
is sufficient for σ . 2
Proof
Of course by equivalence, in part (a) the sample mean M = Y /n is minimally sufficient for μ , and in part (b) the special sample
variance W = U /n is minimally sufficient for σ . Moreover, in part (a), M is complete for μ on the parameter space R and the
2
sample variance S is ancillary for μ (Recall that (n − 1)S /σ has the chi-square distribution with n − 1 degrees of freedom.) It
2 2 2
follows from Basu's theorem (15) that the sample mean M and the sample variance S are independent. We proved this by more 2
direct means in the section on special properties of normal samples, but the formulation in terms of sufficient and ancillary
statistics gives additional insight.

Recall that the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) is a continuous distribution
on (0, ∞) with probability density function g given by
1
k−1 −x/b
g(x) = x e , x ∈ (0, ∞) (7.6.29)
Γ(k)bk
The gamma distribution is often used to model random times and certain other types of positive random variables, and is studied in
more detail in the chapter on Special Distributions.
Suppose that X = (X , X , … , X ) is a random sample from the gamma distribution with shape parameter
1 2 n k and scale
parameter b . Each of the following pairs of statistics is minimally sufficient for (k, b)
1. (Y , V ) where Y = ∑ X is the sum of the scores and V = ∏ X is the product of the scores.
n
i=1 i
n
i=1 i
2. (M , U ) where M = Y /n is the sample (arithmetic) mean of X and U = V is the sample geometric mean of X. 1/n
Proof
Recall that the method of moments estimators of k and b are M /T and T /M , respectively, where M = ∑ X is the 2 2 2 1
n
n
i=1 i
sample mean and T = ∑ (X − M ) is the biased sample variance. If the shape parameter k is known, M is both the
2 1
n
n
i=1 i
2 1
method of moments estimator of b and the maximum likelihood estimator on the parameter space (0, ∞). Note that T is not a 2
function of the sufficient statistics (Y , V ), and hence estimators based on T suffer from a loss of information. 2
Run the gamma estimation experiment 1000 times with various values of the parameters and the sample size n . Compare the
estimates of the parameters in terms of bias and mean square error.
The proof of the last theorem actually shows that Y is sufficient for b if k is known, and that V is sufficient for k if b is known.
Suppose again that X = (X , X , … , X ) is a random sample of size n from the gamma distribution with shape parameter
1 2 n
k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Then Y = ∑ X is complete for b .

n
i=1 i
Proof
Suppose again that X = (X , X , … , X ) is a random sample from the gamma distribution on (0, ∞) with shape parameter
1 2 n
n
k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Let M = denote the sample mean and U = (X X … X ) the
1 1/n
∑ X i=1 i 1 2 n
n
sample geometric mean, as before. Then

1. M /U is ancillary for b .
2. M and M /U are independent.
Proof

Recall that the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) is a continuous distribution on
(0, 1) with probability density function g given by
1
a−1 b−1
g(x) = x (1 − x ) , x ∈ (0, 1) (7.6.33)
B(a, b)
where B is the beta function. The beta distribution is often used to model random proportions and other random variables that take
values in bounded intervals. It is studied in more detail in the chapter on Special Distribution
Suppose that X = (X , X , … , X ) is a random sample from the beta distribution with left parameter a and right parameter
1 2 n
n n
b . Then (P , Q) is minimally sufficient for (a, b) where P = ∏ X and Q = ∏ (1 − X ) . i=1 i i=1 i
Proof
The proof also shows that P is sufficient for a if b is known, and that Q is sufficient for b if a is known. Recall that the method of
moments estimators of a and b are
(2) (2)
M (M − M ) (1 − M ) (M − M )
U = , V = (7.6.35)
(2) 2 (2) 2
M −M M −M
n n
respectively, where M = ∑ X is the sample mean and M = ∑ X is the second order sample mean. If b is
1
n i=1 i
(2) 1
n i=1 i
2
known, the method of moments estimator of a is U = bM /(1 − M ) , while if a is known, the method of moments estimator of b
b
is V = a(1 − M )/M . None of these estimators is a function of the sufficient statistics (P , Q) and so all suffer from a loss of
a
information. On the other hand, if b = 1 , the maximum likelihood estimator of a on the interval (0, ∞) is W = −n/ ∑ ln X , n
i=1 i
which is a function of P (as it must be).
Run the beta estimation experiment 1000 times with various values of the parameters. Compare the estimates of the
parameters.
Recall that the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞) is a continuous distribution on
[b, ∞) with probability density function g given by
a
ab
g(x) = , b ≤x <∞ (7.6.36)
a+1
x
The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution often used to model income and certain other
types of random variables. It is studied in more detail in the chapter on Special Distribution.
Suppose that X = (X , X , … , X ) is a random sample from the Pareto distribution with shape parameter a and scale
1 2 n
n
parameter b . Then (P , X ) is minimally sufficient for (a, b) where P = ∏ X is the product of the sample variables and
(1) i=1 i
where X = min{X , X , … , X } is the first order statistic.

(1) 1 2 n
Proof
The proof also shows that P is sufficient for a if b is known (which is often the case), and that X (1) is sufficient for b if a is known
(much less likely). Recall that the method of moments estimators of a and b are
−−−−−−−−−− −−−−−−−−−−
(2) (2) (2) 2
M M M −M
U = 1 +√ , V = (1 − √ ) (7.6.39)
(2) 2 M (2)
M −M M
n n
respectively, where as before M = ∑ X is the sample mean and M = ∑ X the second order sample mean. These
1
n i=1 i
(2)
i=1
2
i
estimators are not functions of the sufficient statistics and hence suffers from loss of information. On the other hand, the maximum
likelihood estimators of a and b on the interval (0, ∞) are
n
W = n
, X(1) (7.6.40)
∑ ln Xi − n ln X(1)
i=1
respectively. These are functions of the sufficient statistics, as they must be.
Run the Pareto estimation experiment 1000 times with various values of the parameters a and b and the sample size n .
Compare the method of moments estimates of the parameters with the maximum likelihood estimates in terms of the empirical
bias and mean square error.

Recall that the continuous uniform distribution on the interval [a, a + h] , where a ∈ R is the location parameter and h ∈ (0, ∞) is
the scale parameter, has probability density function g given by
1
g(x) = , x ∈ [a, a + h] (7.6.41)
h
Continuous uniform distributions are widely used in applications to model a number chosen “at random” from an interval.
Continuous uniform distributions are studied in more detail in the chapter on Special Distributions. Let's first consider the case
where both parameters are unknown.
Suppose that is a random sample from the uniform distribution on the interval [a, a + h] . Then
X = (X1 , X2 , … , Xn )
(X(1) , X(n) )is minimally sufficient for (a, h), where X = min{X , X , … , X } is the first order statistic and
(1) 1 2 n
X(n) = max{ X , X , … , X } is the last order statistic.

1 2 n
Proof
If the location parameter a is known, then the largest order statistic is sufficient for the scale parameter h . But if the scale
parameter h is known, we still need both order statistics for the location parameter a . So in this case, we have a single real-valued
parameter, but the minimally sufficient statistic is a pair of real-valued random variables.
Suppose again that X = (X 1, X2 , … , Xn ) is a random sample from the uniform distribution on the interval [a, a + h] .
1. If a ∈ R is known, then X (n) is sufficient for h .
2. If h ∈ (0, ∞) is known, then (X (1)
, X(n) ) is minimally sufficient for a .
Proof
Run the uniform estimation experiment 1000 times with various values of the parameter. Compare the estimates of the
parameter.
– –
Recall that if both parameters are unknown, the method of moments estimators of a and h are U = 2M − √3T and V = 2√3T ,
respectively, where M = ∑ X is the sample mean and T = ∑ (X − M ) is the biased sample variance. If a is
1
n
n
i=1 i
2 1
n
n
i=1 i
2
known, the method of moments estimator of h is V = 2(M − a) , while if h is known, the method of moments estimator of h is
a
h . None of these estimators are functions of the minimally sufficient statistics, and hence result in loss of
1
U =M −
h
2
information.

So far, in all of our examples, the basic variables have formed a random sample from a distribution. In this subsection, our basic
variables will be dependent.
Recall that in the hypergeometric model, we have a population of N objects, and that r of the objects are type 1 and the remaining
N − r are type 0. The population size N is a positive integer and the type 1 size r is a nonnegative integer with r ≤ N . Typically
one or both parameters are unknown. We select a random sample of n objects, without replacement from the population, and let X i
be the type of the ith object chosen. So our basic sequence of random variables is X = (X , X , … , X ). The variables are 1 2 n
identically distributed indicator variables with P(X = 1) = r/N for i ∈ {1, 2, … , n}, but are dependent. Of course, the sample
i
size n is a positive integer with n ≤ N .

n
The variable Y = ∑ X is the number of type 1 objects in the sample. This variable has the hypergeometric distribution with
i=1 i
parameters N , r, and n , and has probability density function h given by

r N −r
( )( ) (y) (n−y)
h(y) = =( ) , y ∈ {max{0, N − n + r}, … , min{n, r}} (7.6.44)
N (n)
( ) y N
n
(Recall the falling power notation x = x(x − 1) ⋯ (x − k + 1) ). The hypergeometric distribution is studied in more detail in
(k)
the chapter on Finite Sampling Models.
Y is sufficient for (N , r). Specifically, for y ∈ {max{0, N − n + r}, … , min{n, r}} , the conditional distribution of X
given Y = y is uniform on the set of points

n
Dy = {(x1 , x2 , … , xn ) ∈ {0, 1 } : x1 + x2 + ⋯ + xn = y} (7.6.45)
Proof
There are clearly strong similarities between the hypergeometric model and the Bernoulli trials model above. Indeed if the
sampling were with replacement, the Bernoulli trials model with p = r/N would apply rather than the hypergeometric model. It's
also interesting to note that we have a single real-valued statistic that is sufficient for two real-valued parameters.
Once again, the sample mean M = Y /n is equivalent to Y and hence is also sufficient for (N , r). Recall that the method of
moments estimator of r with N known is N M and the method of moment estimator of N with r known is r/M . The estimator of
r is the one that is used in the capture-recapture experiment.
Exponential Families
Suppose now that our data vector X takes values in a set S , and that the distribution of X depends on a parameter vector θ taking
values in a parameter space Θ. The distribution of X is a k -parameter exponential family if S does not depend on θ and if the
probability density function of X can be written as
k
fθ (x) = α(θ)r(x) exp( ∑ βi (θ)ui (x)); x ∈ S, θ ∈ Θ (7.6.48)
i=1
where α and (β , β , … , β ) are real-valued functions on Θ, and where r and (u , u , … , u ) are real-valued functions on S .
1 2 k 1 2 k
Moreover, k is assumed to be the smallest such integer. The parameter vector β = (β (θ), β (θ), … , β (θ)) is sometimes called 1 2 k
the natural parameter of the distribution, and the random vector U = (u (X), u (X), … , u (X)) is sometimes called the
1 2 k
natural statistic of the distribution. Although the definition may look intimidating, exponential families are useful because they
have many nice mathematical properties, and because many special parametric families are exponential families. In particular, the
sampling distributions from the Bernoulli, Poisson, gamma, normal, beta, and Pareto considered above are exponential families.
Exponential families of distributions are studied in more detail in the chapter on special distributions.
U is minimally sufficient for θ.

Proof
It turns out that U is complete for θ as well, although the proof is more difficult.
This page titled 7.6: Sufficient, Complete and Ancillary Statistics is shared under a CC BY 2.0 license and was authored, remixed, and/or curated
CHAPTER OVERVIEW
8: Set Estimation
Set estimation refers to the process of constructing a subset of the parameter space, based on observed data from a probability
distribution. The subset will contain the true value of the parameter with a specified confidence level. In this chapter, we explore
the basic method of set estimation using pivot variables. We study set estimation in some of the most important models: the single
variable normal model, the two-variable normal model, and the Bernoulli model.
This page titled 8: Set Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
Basic Theory
statistical model, we have an observable random variable X taking values in a set S . In general, X can have quite a complicated
structure. For example, if the experiment is to sample n objects from a population and record various measurements of interest,
then
X = (X1 , X2 , … , Xn ) (8.1.1)
where X is the vector of measurements for the ith object. The most important special case occurs when (X , X , … , X
i 1 2 n) are
independent and identically distributed. In this case, we have a random sample of size n from the common distribution.
Suppose also that the distribution of X depends on a parameter θ taking values in a parameter space Θ. The parameter may also be
vector-valued, in which case Θ ⊆ R for some k ∈ N and the parameter vector has the form θ = (θ , θ , … , θ ) .
k
+ 1 2 k
Confidence Sets
A confidence set is a subset C (X) of the parameter space Θ that depends only on the data variable X , and no unknown
parameters. the confidence level is the smallest probability that θ ∈ C (X) :
min {P[θ ∈ C (X)] : θ ∈ Θ} (8.1.2)
Thus, in a sense, a confidence set is a set-valued statistic. A confidence set is an estimator of θ in the sense that we hope that
θ ∈ C (X) with high probability, so that the confidence level is high. Note that since the distribution of X depends on θ , there is a
dependence on θ in the probability measure P in the definition of confidence level. However, we usually suppress this, just to keep
the notation simple. Usually, we try to construct a confidence set for θ with a prescribed confidence level 1 − α where 0 < α < 1 .
Typical confidence levels are 0.9, 0.95, and 0.99. Sometimes the best we can do is to construct a confidence set whose confidence
level is at least 1 − α ; this is called a conservative 1 − α confidence set for θ .
Figure 8.1.1 : A set estimate that successfully captured the parameter

Suppose that C (X) is 1 − α level confidence set for a parameter θ . Note that when we run the experiment and observe the data x,
the computed confidence set is C (x). The true value of θ is either in this set, or is not, and we will usually never know. However,
by the law of large numbers, if we were to repeat the confidence experiment over and over, the proportion of sets that contain θ
would converge to P[θ ∈ C (X)] = 1 − α . This is the precise meaning of the term confidence. In the usual terminology of
statistics, the random set C (X) is the estimator; the deterministic set C (x) based on an observed value x is the estimate.
Next, note that the quality of a confidence set, as an estimator of θ , is based on two factors: the confidence level and the precision
as measured by the “size” of the set. A good estimator has small size (and hence gives a precise estimate of θ ) and large
confidence. However, for a given X, there is usually a tradeoff between confidence level and precision—increasing the confidence
level comes only at the expense of increasing the size of the set, and decreasing the size of the set comes only at the expense of
decreasing the confidence level. How we measure the “size” of the confidence set depends on the dimension of the parameter space
and the nature of the confidence set. Moreover, the size of the set is usually random, although in some special cases it may be
deterministic.
Considering the extreme cases may give us some insight. First, suppose that C (X) = Θ . This set estimator has maximum
confidence 1, but no precision and hence it is worthless (we already knew that θ ∈ Θ ). At the other extreme, suppose that C (X) is
a singleton set. This set estimator has the best possible precision, but typically for continuous distributions, would have confidence
0. In between these extremes, hopefully, are set estimators that have high confidence and high precision.
Suppose that Ci (X) is a 1 − αi level confidence set for θ for i ∈ {1, 2, … , k}. If α = α1 + α2 + ⋯ + αk < 1 then
C1 (X) ∩ C2 (X) ∩ ⋯ ∩ Ck (X) is a conservative 1 − α level confidence set for θ .
Proof
Real-Valued Parameters
In many cases, we are interested in estimating a real-valued parameter λ = λ(θ) taking values in an interval parameter space
(a, b) , where a, b ∈ R with a < b . Of course, it's possible that a = −∞ or b = ∞ . In this context our confidence set frequently
has the form
C (X) = {θ ∈ Θ : L(X) < λ(θ) < U (X)} (8.1.3)
where L(X) and U (X) are real-valued statistics. In this case (L(X), U (X)) is called a confidence interval for λ . If L(X) and
U (X) are both random, then the confidence interval is often said to be two-sided. In the special case that U (X) = b , L(X) is
called a confidence lower bound for λ . In the special case that L(X) = a , U (X) is called a confidence upper bound for λ .
Suppose that L(X) is a 1 − α level confidence lower bound for λ and that U (X) is a 1 − β level confidence upper bound for
λ . If α + β < 1 then (L(X), U (X)) is a conservative 1 − (α + β) level confidence interval for λ .
Proof
Pivot Variables
You might think that it should be very difficult to construct confidence sets for a parameter θ . However, in many important special
cases, confidence sets can be constructed easily from certain random variables known as pivot variables.
Suppose that V is a function from S × Θ into a set T . The random variable V (X, θ) is a pivot variable for θ if its distribution
does not depend on θ . Specifically, P[V (X, θ) ∈ B] is constant in θ ∈ Θ for each B ⊆ T .
The basic idea is that we try to combine X and θ algebraically in such a way that we factor out the dependence on θ in the
distribution of the resulting random variable V (X, θ) . If we know the distribution of the pivot variable, then for a given α , we can
try to find B ⊆ T (that does not depend on θ ) such that P [V (X, θ) ∈ B] = 1 − α . It then follows that a 1 − α confidence set for
θ
the parameter is given by C (X) = {θ ∈ Θ : V (X, θ) ∈ B} .
Figure 8.1.2 : A confidence set constructed from a pivot variable

Suppose now that our pivot variable V (X, θ) is real-valued, which for simplicity, we will assume has a continuous distribution.
For p ∈ (0, 1), let v(p) denote the quantile of order p for the pivot variable V (X, θ) . By the very meaning of pivot variable, v(p)
does not depend on θ .
For any p ∈ (0, 1), a 1 − α level confidence set for θ is
{θ ∈ Θ : v(α − pα) < V (X, θ) < v(1 − pα)} (8.1.4)
Proof
The confidence set above corresponds to (1 − p)α in the left tail and pα in the right tail, in terms of the distribution of the pivot
variable V (X, λ). The special case p = is the equal-tailed case, the most common case.
1
Figure 8.1.3 : Distribution of the pivot variable showing (1 − p)α is the left tail and pα is the right tail.
The confidence set (5) is decreasing in α and hence increasing in 1 − α (in the sense of the subset relation) for fixed p.
For the confidence set (5), we would naturally like to choose p that minimizes the size of the set in some sense. However this is
often a difficult problem. The equal-tailed interval, corresponding to p = , is the most commonly used case, and is sometimes
1
(but not always) an optimal choice. Pivot variables are far from unique; the challenge is to find a pivot variable whose distribution
is known and which gives tight bounds on the parameter (high precision).
Suppose that V (X, θ) is a pivot variable for θ . If g is a function defined on the range of V and g involves no unknown
parameters, then U = g[V (X, θ)] is also a pivot variable for θ .

Location-Scale Families
In the case of location-scale families of distributions, we can easily find pivot variables. Suppose that Z is a real-valued random
variable with a continuous distribution that has probability density function g , and no unknown parameters. Let X = μ + σZ
where μ ∈ R and σ ∈ (0, ∞) are parameters. Recall that the probability density function of X is given by
1 x −μ
fμ,σ (x) = g( ), x ∈ R (8.1.5)
σ σ
and the corresponding family of distributions is called the location-scale family associated with the distribution of Z ; μ is the
location parameter and σ is the scale parameter. Generally, we are assuming that these parameters are unknown.
Now suppose that X = (X 1, X2 , … , Xn ) is a random sample of size n from the distribution of X; this is our observable outcome
vector. For each i, let
Xi − μ
Zi = (8.1.6)
σ
The random vector Z = (Z 1, Z2 , … , Zn ) is a random sample of size n from the distribution of Z .
In particular, note that Z is a pivot variable for (μ, σ), since Z is a function of X, μ , and σ, but the distribution of Z does not
depend on μ or σ. Hence, any function of Z will also be a pivot variable for (μ, σ), (if the function does not involve the
parameters). Of course, some of these pivot variables will be much more useful than others in estimating μ and σ. In the following
exercises, we will explore two common and important pivot variables.
Let M (X) and M (Z) denote the sample means of X and Z , respectively. Then M (Z) is a pivot variable for (μ, σ) since
M (X) − μ
M (Z) = (8.1.7)
σ
Let m denote the quantile function of the pivot variable M (Z). For any p ∈ (0, 1), a 1 − α confidence set for (μ, σ) is
Zα,p (X) = {(μ, σ) : M (X) − m(1 − pα)σ < μ < M (X) − m(α − pα)σ} (8.1.8)
The confidence set constructed above is a “cone” in the (μ, σ) parameter space, with vertex at (M (X), 0) and boundary lines
of slopes −1/m(1 − pα) and −1/m(α − pα), as shown in the graph below. (Note, however, that both slopes might be
negative or both positive.)
Figure 8.1.4 : The confidence set for (μ, σ) constructed from M
The fact that the confidence set is unbounded is clearly not good, but is perhaps not surprising; we are estimating two real
parameters with a single real-valued pivot variable. However, if σ is known, the confidence set defines a confidence interval for μ .
Geometrically, the confidence interval simply corresponds to the horizontal cross section at σ.
1 −α confidence sets for (μ, σ) are

1. Z α,1 (X) = {(μ, σ) : M (X) − m(1 − α)σ < μ < ∞}
2. Z α,0 (X) = {(μ, σ) : −∞ < μ < M (X) − m(α)σ}
Proof
If σ is known, then (a) gives a 1 − α confidence lower bound for μ and (b) gives a 1 − α confidence upper bound for μ .
Let S(X) and S(Z) denote the sample standard deviations of X and Z , respectively. Then S(Z) is a pivot variable for (μ, σ)
and a pivot variable for σ since
S(X)
S(Z) = (8.1.9)
σ
Let s denote the quantile function of S(Z). For any α ∈ (0, 1) and p ∈ (0, 1), a 1 − α confidence set for (μ, σ) is
S(X) S(X)
Vα,p (X) = {(μ, σ) : <σ < } (8.1.10)
s(1 − pα) s(α − pα)
Note that the confidence set gives no information about μ since the random variable above is a pivot variable for σ alone. The
confidence set can also be viewed as a bounded confidence interval for σ.
Figure 8.1.5 : The confidence set for (μ, σ) constructed from S
1 −α confidence sets for (μ, σ) are

1. V α,1 (X) = {(μ, σ) : S(X)/s(1 − α) < σ < ∞}
2. V α,0 (X) = {(μ, σ) : 0 < σ < S(X)/s(α)}
Proof
The set in part (a) gives a 1 − α confidence lower bound for σ and the set in part (b) gives a 1 − α confidence upper bound for σ.
We can intersect the confidence sets corresponding to the two pivot variables to produce conservative, bounded confidence sets.
If α, β, p, q ∈ (0, 1) with α + β < 1 then Z α,p ∩ Vβ,q is a conservative 1 − (α + β) confidence set for (μ, σ).
Proof
Figure 8.1.6 : The bounded confidence set for (μ, σ) constructed from (M , S)
The most important location-scale family is the family of normal distributions. The problem of estimation in the normal model is
considered in the next section. In the remainder of this section, we will explore another important scale family.

Recall that the
exponential distribution with scale parameter σ ∈ (0, ∞) has probability density function
, x ∈ [0, ∞) . It is the scale family associated with the standard exponential distribution, which has probability
1 −x/σ
f (x) = e
σ
density function g(x) = e , x ∈ [0, ∞) . The exponential distribution is widely used to model random times (such as lifetimes
−x
and “arrival” times), particularly in the context of the Poisson model. Now suppose that X = (X , X , … , X ) is a random 1 2 n
sample of size n from the exponential distribution with unknown scale parameter σ. Let
n
Y = ∑ Xi (8.1.11)
i=1
The random variable 2
σ
Y has the chi-square distribution with 2n degrees of freedom, and hence is a pivot variable for σ.
Note that this pivot variable is a multiple of the variable M constructed above for general location-scale families (with μ = 0 ). For
p ∈ (0, 1) and k ∈ (0, ∞) , let χ (p) denote the quantile of order p for the chi-square distribution with k degrees of freedom. For
2
k
selected values of k and p, χ (p) can be obtained from the special distribution calculator or from most statistical software
2
k
packages.
Recall that
1. χ 2
k
(p) → 0 as p ↓ 0
2. χ 2
k
(p) → ∞ as p ↑ 1
For any α ∈ (0, 1) and any p ∈ (0, 1), a 1 − α confidence interval for σ is
2Y 2Y
( , ) (8.1.12)
2 2
χ (1 − pα) χ (α − pα)
2n 2n
Note that
1. 2Y /χ 2
2n
(1 − α) is a 1 − α confidence lower bound for σ.
2. 2Y /χ 2
2n
(α) is a 1 − α confidence lower bound for σ.
Of the two-sided confidence intervals constructed above, we would naturally prefer the one with the smallest length, because this
interval gives the most information about the parameter σ. However, minimizing the length as a function of p is computationally
difficult. The two-sided confidence interval that is typically used is the equal tailed interval obtained by letting p = : 1
2Y 2Y
( , ) (8.1.13)
2 2
χ (1 − α/2) χ (α/2)
2n 2n
The lifetime of a certain type of component (in hours) has an exponential distribution with unknown scale parameter σ. Ten
devices are operated until failure; the lifetimes are 592, 861, 1470, 2412, 335, 3485, 736, 758, 530, 1961.
1. Construct the 95% two-sided confidence interval for σ.
2. Construct the 95% confidence lower bound for σ.
3. Construct the 95% confidence upper bound for σ.
Answer
This page titled 8.1: Introduction to Set Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
The Normal Model
The normal distribution is perhaps the most important distribution in the study of mathematical statistics, in part because of the
central limit theorem. As a consequence of this theorem, a measured quantity that is subject to numerous small, random errors will
have, at least approximately, a normal distribution. Such variables are ubiquitous in statistical experiments, in subjects varying
from the physical and biological sciences to the social sciences.
So in this section, we assume that X = (X , X , … , X ) is a random sample from the normal distribution with mean μ and
1 2 n
standard deviation σ. Our goal is to construct confidence intervals for μ and σ individually, and then more generally, confidence
sets for (μ, σ). These are among of the most important special cases of set estimation. A parallel section on Tests in the Normal
Model is in the chapter on Hypothesis Testing. First we need to review some basic facts that will be critical for our analysis.
Recall that the sample mean M and sample variance S are 2
n n
1 1
2 2
M = ∑ Xi , S = ∑(Xi − M ) (8.2.1)
n n−1
i=1 i=1
From our study of point estimation, recall that M is an unbiased and consistent estimator of μ while S is an unbiased and 2
consistent estimator of σ . From these basic statistics we can construct the pivot variables that will be used to construct our interval
2
estimates. The following results were established in the section on Special Properties of the Normal Distribution.
Define
M −μ M −μ n−1
2
Z = , T = , V = S (8.2.2)
− − 2
σ/ √n S/ √n σ
1. Z has the standard normal distribution.

2. T has the student t distribution with n − 1 degrees of freedom.
3. V has the chi-square distribution with n − 1 degrees of freedom.
4. Z and V are independent.
It follows that each of these random variables is a pivot variable for (μ, σ) since the distributions do not depend on the parameters,
but the variables themselves functionally depend on one or both parameters. Pivot variables Z and T will be used to construct
interval estimates of μ while V will be used to construct interval estimates of σ . To construct our estimates, we will need
2
quantiles of these standard distributions. The quantiles can be computed using the special distribution calculator or from most
mathematical and statistical software packages. Here is the notation we will use:
Let p ∈ (0, 1) and k ∈ N . +
1. z(p) denotes the quantile of order p for the standard normal distribution.
2. t (p) denotes the quantile of order p for the student t distribution with k degrees of freedom.
k
3. χ (p) denotes the quantile of order p for the chi-square distribution with k degrees of freedom
2
k
Since the standard normal and student t distributions are symmetric about 0, it follows that z(1 − p) = −z(p) and
t (1 − p) = −t (p) for p ∈ (0, 1) and k ∈ N
k k . On the other hand, the chi-square distribution is not symmetric.
+
Confidence Intervals for μ with σ Known

For our first discussion, we assume that the distribution mean μ is unknown but the standard deviation σ is known. This is not
always an artificial assumption. There are often situations where σ is stable over time, and hence is at least approximately known,
while μ changes because of different “treatments”. Examples are given in the computational exercises below. The pivot variable Z
leads to confidence intervals for μ .
For α ∈ (0, 1),
1. [M − z (1 − α
2
)
σ
, M + z (1 −
α
)
σ
] is a 1 − α confidence interval for μ .
√n 2 √n
2. M − z(1 − α) σ
√n
is a 1 − α confidence lower bound for μ
3. M + z(1 − α) σ
√n
is a 1 − α confidence upper bound for μ
Proof
These are the standard interval estimates for μ when σ is known. The two-sided confidence interval in (a) is symmetric about the
sample mean M , and as the proof shows, corresponds to equal probability in each tail of the distribution of the pivot variable Z .
α
But of course, this is not the only two-sided 1 − α confidence interval; we can divide the probability α anyway we want between
the left and right tails of the distribution of Z .
For every α, p ∈ (0, 1) , a 1 − α confidence interval for μ is

σ σ
[M − z(1 − pα) − , M − z(α − pα) −] (8.2.3)
√n √n
1. p = gives the symmetric, equal-tail confidence interval.

1
2. p → 0 gives the interval with the confidence upper bound.

3. p → 1 gives the interval with the confidence lower bound.
Proof
In terms of the distribution of the pivot variable Z , as the proof shows, the two-sided confidence interval above corresponds to pα
in the right tail and (1 − p)α in the left tail. Next, let's study the length of this confidence interval.
For α, , the (deterministic) length of the two-sided 1 − α confidence interval above is

p ∈ (0, 1)
[z(1 − pα) − z(α − pα)] σ

L = − (8.2.5)
√n
1. L is a decreasing function of α , and L ↓ 0 as α ↑ 1 and L ↑ ∞ as α ↓ 0 .

2. L is a decreasing function of n , and L ↓ 0 as n ↑ ∞ .
3. L is an increasing function of σ, and L ↓ 0 as σ ↓ 0 and L ↑ ∞ as σ ↑ ∞.
4. As a function of p, L decreases and then increases, with minimum at the point of symmetry p = 1
2
.
The last result shows again that there is a tradeoff between the confidence level and the length of the confidence interval. If n and p
are fixed, we can decrease L, and hence tighten our estimate, only at the expense of decreasing our confidence in the estimate.
Conversely, we can increase our confidence in the estimate only at the expense of increasing the length of the interval. In terms of
p , the best of the two-sided 1 − α confidence intervals (and the one that is almost always used) is symmetric, equal-tail interval
with p = : 1
Use the mean estimation experiment to explore the procedure. Select the normal distribution and select normal pivot. Use
various parameter values, confidence levels, sample sizes, and interval types. For each configuration, run the experiment 1000
times. As the simulation runs, note that the confidence interval successfully captures the mean if and only if the value of the
pivot variable is between the quantiles. Note the size and location of the confidence intervals and compare the proportion of
successful intervals to the theoretical confidence level.
For the standard confidence intervals, let d denote the distance between the sample mean M and an endpoint. That is,
σ
d = zα − (8.2.6)
√n
where z = z(1 − α/2) for the two-sided interval and z

α α = z(1 − α) for the upper or lower confidence interval. The number
d is the margin of error of the estimate.
Note that d is deterministic, and the length of the standard two-sided interval is L = 2d . In many cases, the first step in the design
of the experiment is to determine the sample size needed to estimate μ with a given margin of error and a given confidence level.
The sample size needed to estimate μ with confidence 1 − α and margin of error d is
2 2
zα σ
n =⌈ ⌉ (8.2.7)
2
d
Proof
Note that n varies directly with z and with σ and inversely with d . This last fact implies a law of diminishing return in reducing
2
α
2 2
the margin of error. For example, if we want to reduce a given margin of error by a factor of , we must increase the sample size 1
by a factor of 4.
Confidence Intervals for μ with σ Unknown

For our next discussion, we assume that the distribution mean μ and standard deviation σ are unknown, the usual situation. In this
case, we can use the T pivot variable, rather than the Z pivot variable, to construct confidence intervals for μ .
For α ∈ (0, 1),
1. [M − t n−1 (1 −
α
2
)
S
√n
, M + tn−1 (1 −
α
2
)
S
√n
] is a 1 − α confidence interval for μ .
2. M − t n−1 (1 − α)
√n
S
is a 1 − α lower bound for μ
3. M + t n−1 (1 − α)
S
is a 1 − α upper bound for μ
√n
Proof
These are the standard interval estimates of μ with σ unknown. The two-sided confidence interval in (a) is symmetric about the
sample mean M and corresponds to equal probability in each tail of the distribution of the pivot variable T . As before, this is
α
not the only confidence interval; we can divide α between the left and right tails any way that we want.
For every α, p ∈ (0, 1) , a 1 − α confidence interval for μ is

S S
[M − tn−1 (1 − pα) − , M − tn−1 (α − pα) −] (8.2.8)
√n √n
1. p = gives the symmetric, equal-tail confidence interval.

1
2. p → 0 gives the interval with the confidence upper bound.

Proof
The two-sided confidence interval above corresponds to pα in the right tail and (1 − p)α in the left tail of the distribution of the
pivot variable T . Next, let's study the length of this confidence interal.
For α, p ∈ (0, 1) , the (random) length of the two-sided 1 − α confidence interval above is
tn−1 (1 − pα) − tn−1 (α − pα)
L = S (8.2.10)
−
√n
1. L is a decreasing function of α , and L ↓ 0 as α ↑ 1 and L ↑ ∞ as α ↓ 0 .

2. As a function of p, L decreases and then increases, with minimum at the point of symmetry p = 1
2
.
–
3. [ tn−1 (1 − pα) − tn−1 (α − pα)] √2σΓ(n/2)
E(L) = − −−−−−− (8.2.11)
√ n(n − 1) Γ[(n − 1)/2]
4. 1 2 2
2
2 Γ (n/2)
var(L) = [ tn−1 (1 − pα) − tn−1 (α − pα)] σ [1 − ] (8.2.12)
2
n (n − 1)Γ [(n − 1)/2]
Proof
√n−1
Parts (a) and (b) follow from properties of the student quantile function t n−1 . Parts (c) and (d) follow from the fact that σ
S
has a chi distribution with n − 1 degrees of freedom.
Once again, there is a tradeoff between the confidence level and the length of the confidence interval. If n and p are fixed, we can
decrease L, and hence tighten our estimate, only at the expense of decreasing our confidence in the estimate. Conversely, we can
increase our confidence in the estimate only at the expense of increasing the length of the interval. In terms of p, the best of the
two-sided 1 − α confidence intervals (and the one that is almost always used) is symmetric, equal-tail interval with p = . Finally, 1
note that it does not really make sense to consider L as a function of S , since S is a statistic rather than an algebraic variable.
Similarly, it does not make sense to consider L as a function of n , since changing n means new data and hence a new value of S .
Use the mean estimation experiment to explore the procedure. Select the normal distribution and the T pivot. Use various
parameter values, confidence levels, sample sizes, and interval types. For each configuration, run the experiment 1000 times.
As the simulation runs, note that the confidence interval successfully captures the mean if and only if the value of the pivot
variable is between the quantiles. Note the size and location of the confidence intervals and compare the proportion of
successful intervals to the theoretical confidence level.
Confidence Intervals for σ 2
Next we will construct confidence intervals for σ using the pivot variable V given above 2
For α ∈ (0, 1),
1. [ 2
n−1
S
2
, 2
n−1
S
2
] is a 1 − α confidence interval for σ 2
χ (1−α/2) χ (α/2)
n−1 n−1
2. 2
n−1
S
2
is a 1 − α confidence lower bound for σ 2
χn−1 (1−α)
3. χ
2
n−1
(α)
S
2
is a 1 − α confidence upper bound for σ . 2
n−1
Proof
These are the standard interval estimates for σ . The two-sided interval in (a) is the equal-tail interval, corresponding to probability
2
α/2 in each tail of the distribution of the pivot variable V . Note however that this interval is not symmetric about the sample
variance S . Once again, we can partition the probability α between the left and right tails of the distribution of V any way that we
2
like.
For every α, p ∈ (0, 1) , a 1 − α confidence interval for σ is 2
n−1 n−1
2 2
[ S , S ] (8.2.13)
2 2
χ (1 − pα) χ (α − pα)
n−1 n−1
1. p = gives the equal-tail 1 − α confidence interval.

1
2. p → 0 gives the interval with the 1 − α upper bound

3. p → 1 gives the interval with the 1 − α lower bound.
In terms of the distribution of the pivot variable V , the confidence interval above corresponds to pα in the right tail and (1 − p)α
in the left tail. Once again, let's look at the length of the general two-sided confidence interval. The length is random, but is a
multiple of the sample variance S . Hence we can compute the expected value and variance of the length.
2
For α, p ∈ (0, 1) , the (random) length of the two-sided confidence interval in the last theorem is
1 1 2
L =[ − ] (n − 1)S (8.2.14)
2 2
χ (α − pα) χ (1 − pα)
n−1 n−1
1. E(L) = [ χ
2
1
(α−pα)
−
χ
2
1
(1−pα)
] (n − 1)σ
2
n−1 n−1
2. var(L) = 2[ χ
2
1
(α−pα)
−
χ
2
1
(1−pα)
] (n − 1)σ
4
n−1 n−1
To construct an optimal two-sided confidence interval, it would be natural to find p that minimizes the expected length. This is a
complicated problem, but it turns out that for large n , the equal-tail interval with p = is close to optimal. Of course, taking
1
square roots of the endpoints of any of the confidence intervals for σ gives 1 − α confidence intervals for the distribution standard
2
deviation σ.
Use variance estimation experiment to explore the procedure. Select the normal distribution. Use various parameter values,
confidence levels, sample sizes, and interval types. For each configuration, run the experiment 1000 time. As the simulation
runs, note that the confidence interval successfully captures the standard deviation if and only if the value of the pivot variable
is between the quantiles. Note the size and location of the confidence intervals and compare the proportion of successful
intervals to the theoretical confidence level.
Confidence Sets for (μ, σ)

In the discussion above, we constructed confidence intervals for μ and for σ separately (again, usually both parameters are
unknown). In our next discussion, we will consider confidence sets for the parameter point (μ, σ). These sets will be subsets of the
underlying parameter space R × (0, ∞) .
Confidence Sets Constructed from the Pivot Variables

Each of the pivot variables Z , T , and V can be used to construct confidence sets for (μ, σ). In isolation, each will produce an
unbounded confidence set, not surprising since, we are using a single pivot variable to estimate two parameters. We consider the
normal pivot variable Z first.
For any α, , a 1 − α level confidence set for (μ, σ) is

p ∈ (0, 1)
σ σ
Zα,p = {(μ, σ) : M − z(1 − pα) − < μ < M − z(α − pα) −} (8.2.15)
√n √n
The confidence set is a “cone” in the (μ, σ) parameter space, with vertex at (M , 0) and boundary lines of slopes
− −
−√n /z(1 − pα) and −√n /z(α − pα)
Proof
The confidence cone is shown in the graph below. (Note, however, that both slopes might be negative or both positive.)
Figure 8.2.1 : The confidence set based on the normal pivot variable
The pivot variable T leads to the following result:
For every α, p ∈ (0, 1) , a 1 − α level confidence set for (μ, σ) is

S S
Tα,p = {(μ, σ) : M − tn−1 (1 − pα) < μ < M − tn−1 (α − pα) } (8.2.17)
− −
√n √n
Proof
Figure 8.2.2 : The confidence set based on the T pivot variable
By design, this confidence set gives no information about σ. Finally, the pivot variable V leads to the following result:
For every α, p ∈ (0, 1) , a 1 − α level confidence set for (μ, σ) is

2 2
(n − 1)S (n − 1)S
2
Vα,p = {(μ, σ) : <σ < } (8.2.18)
2 2
χ (1 − pα) χ (α − pα)
n−1 n−1
Proof
Figure 8.2.3 : The confidence set based on the pivot variable V

By design, this confidence set gives no information about μ .
Intersections
We can now form intersections of some of the confidence sets constructed above to obtain bounded confidence sets for (μ, σ). We
will use the fact that the sample mean M and the sample variance S are independent, one of the most important special properties
2
of a normal sample. We will also need the result from the Introduction on the intersection of confidence interals. In the following
theorems, suppose that α, β, p, q ∈ (0, 1) with α + β < 1 .
The set T α,p ∩ Vβ,q is a conservative 1 − (α + β) confidence sets for (μ, σ).
Figure 8.2.4 : The confidence set T α,p ∩ Vβ,q
The set Z α,p ∩ Vβ,q is a (1 − α)(1 − β) confidence set for (μ, σ).
Figure 8.2.5 : The confidence set Z α,p ∩ Vβ,q
It is interesting to note that the confidence set T ∩ V is a product set as a subset of the parameter space, but is not a product set
α,p β,q
as a subset of the sample space. By contrast, the confidence set Z ∩ V is not a product set as a subset of the parameter space,
α,p β,q
but is a product set as a subset of the sample space.
Exercises
Robustness
The main assumption that we made was that the underlying sampling distribution is normal. Of course, in real statistical problems,
we are unlikely to know much about the sampling distribution, let alone whether or not it is normal. When a statistical procedure
works reasonably well, even when the underlying assumptions are violated, the procedure is said to be robust. In this subsection,
we will explore the robustness of the estimation procedures for μ and σ.
Suppose in fact that the underlying distribution is not normal. When the sample size n is relatively large, the distribution of the
sample mean will still be approximately normal by the central limit theorem. Thus, our interval estimates of μ may still be
approximately valid.
Use the simulation of the mean estimation experiment to explore the procedure. Select the gamma distribution and select
student pivot. Use various parameter values, confidence levels, sample sizes, and interval types. For each configuration, run the
experiment 1000 times. Note the size and location of the confidence intervals and compare the proportion of successful
intervals to the theoretical confidence level.
In the mean estimation experiment, repeat the previous exercise with the uniform distribution.
How large n needs to be for the interval estimation procedures of μ to work well depends, of course, on the underlying
distribution; the more this distribution deviates from normality, the larger n must be. Fortunately, convergence to normality in the
central limit theorem is rapid and hence, as you observed in the exercises, we can get away with relatively small sample sizes (30
or more) in most cases.
In general, the interval estimation procedures for σ are not robust; there is no analog of the central limit theorem to save us from
deviations from normality.
In variance estimation experiment, select the gamma distribution. Use various parameter values, confidence levels, sample
sizes, and interval types. For each configuration, run the experiment 1000 times. Note the size and location of the confidence
intervals and compare the proportion of successful intervals to the theoretical confidence level.
In variance estimation experiment, select the uniform distribution. Use various parameter values, confidence levels, sample
sizes, and interval types. For each configuration, run the experiment 1000 times. Note the size and location of the confidence
intervals and compare the proportion of successful intervals to the theoretical confidence level.
In the following exercises, use the equal-tailed construction for two-sided confidence intervals, unless otherwise instructed.
The length of a certain machined part is supposed to be 10 centimeters but due to imperfections in the manufacturing process,
the actual length is a normally distributed with mean μ and variance σ . The variance is due to inherent factors in the process,
2
which remain fairly stable over time. From historical data, it is known that σ = 0.3. On the other hand, μ may be set by
adjusting various parameters in the process and hence may change to an unknown value fairly frequently. A sample of 100
parts has mean 10.2.
1. Construct the 95% confidence interval for μ .
2. Construct the 95% confidence upper bound for μ .
3. Construct the 95% confidence lower bound for μ .
Answer
Suppose that the weight of a bag of potato chips (in grams) is a normally distributed random variable with mean μ and
standard deviation σ, both unknown. A sample of 75 bags has mean 250 and standard deviation 10.
1. Construct the 90% confidence interval for μ .
2. Construct the 90% confidence interval for σ.
3. Construct a conservative 90% confidence rectangle for (μ, σ).
Answer
At a telemarketing firm, the length of a telephone solicitation (in seconds) is a normally distributed random variable with mean
μ and standard deviation σ, both unknown. A sample of 50 calls has mean length 300 and standard deviation 60.
1. Construct the 95% confidence upper bound for μ .

2. Construct the 95% confidence lower bound for σ.
Answer
At a certain farm the weight of a peach (in ounces) at harvest time is a normally distributed random variable with standard
deviation 0.5. How many peaches must be sampled to estimate the mean weight with a margin of error ±2 and with 95%
confidence.
Answer
The hourly salary for a certain type of construction work is a normally distributed random variable with standard deviation
$1.25 and unknown mean μ . How many workers must be sampled to construct a 95% confidence lower bound for μ with
margin of error $0.25?
Answer

In Michelson's data, assume that the measured speed of light has a normal distribution with mean μ and standard deviation σ,
both unknown.
1. Construct the 95% confidence interval for μ . Is the “true” value of the speed of light in this interval?
3. Explore, in an informal graphical way, the assumption that the underlying distribution is normal.
Answer
In Cavendish's data, assume that the measured density of the earth has a normal distribution with mean μ and standard
deviation σ, both unknown.
1. Construct the 95% confidence interval for μ . Is the “true” value of the density of the earth in this interval?
Answer
In Short's data, assume that the measured parallax of the sun has a normal distribution with mean μ and standard deviation σ,
both unknown.
1. Construct the 95% confidence interval for μ . Is the “true” value of the parallax of the sun in this interval?
Answer
Suppose that the length of an iris petal of a given type (Setosa, Verginica, or Versicolor) is normally distributed. Use Fisher's
iris data to construct 90% two-sided confidence intervals for each of the following parameters.
1. The mean length of a Sertosa iris petal.
2. The mean length of a Vergnica iris petal.
3. The mean length of a Versicolor iris petal.
Answer
This page titled 8.2: Estimation the Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Introduction
Recall that an indicator variable is a random variable that just takes the values 0 and 1. In applications, an indicator variable
indicates which of two complementary events in a random experiment has occurred. Typical examples include
A manufactured item subject to unavoidable random factors is either defective or acceptable.
A voter selected from a population either supports a particular candidate or does not.
A person selected from a population either does or does not have a particular medical condition.
A student in a class either passes or fails a standardized test.
A sample of radioactive material either does or does not emit an alpha particle in a specified ten-second period.
Recall also that the distribution of an indicator variable is known as the Bernoulli distribution, named for Jacob Bernoulli, and has
probability density function given by P(X = 1) = p , P(X = 0) = 1 − p , where p ∈ (0, 1) is the basic parameter. In the context
of the examples above,
p is the probability that the manufactured item is defective.
p is the proportion of voters in the population who favor the candidate.
p is the poportion of persons in the population that have the medical condition.
p is the probability that a student in the class will pass the exam.
p is the probability that the material will emit an alpha particle in the specified period.
Recall that the mean and variance of the Bernoulli distribution are E(X) = p and var(X) = p(1 − p) . Often in statistical
applications, p is unknown and must be estimated from sample data. In this section, we will see how to construct interval estimates
for the parameter from sample data. A parallel section on Tests in the Bernoulli Model is in the chapter on Hypothesis Testing.
The One-Sample Model

Preliminaries
Suppose that X = (X , X , … , X ) is a random sample from the Bernoulli distribution with unknown parameter p ∈ [0, 1]. That
1 2 n
is, X is a squence of Bernoulli trials. From the examples in the introduction above, note that often the underlying experiment is to
sample at random from a dichotomous population. When the sampling is with replacement, X really is a sequence of Bernoulli
trials. When the sampling is without replacement, the variables are dependent, but the Bernoulli model is still approximately valid
if the population size is large compared to the sample size n . For more on these points, see the discussion of sampling with and
without replacement in the chapter on Finite Sampling Models.
Note that the sample mean of our data vector X, namely
n
1
M = ∑ Xi (8.3.1)
n
i=1
is the sample proportion of objects of the type of interest. By the central limit theorem, the standard score
M −p
Z = (8.3.2)
− −−−−− −−−
√ p(1 − p)/n
has approximately a standard normal distribution and hence is (approximately) a pivot variable for p. For a given sample size n ,
the distribution of Z is closest to normal when p is near and farthest from normal when p is near 0 or 1 (extreme). Because the
1
pivot variable is (approximately) normally distributed, the construction of confidence intervals for p in this model is similar to the
construction of confidence intervals for the distribution mean μ in the normal model. But of course all of the confidence intervals
so constructed are approximate.
As usual, for r ∈ (0, 1), let z(r) denote the quantile of order r for the standard normal distribution. Values of z(r) can be obtained
from the special distribution calculator, or from most statistical software packages.
Basic Confidence Intervals
For α ∈ (0, 1), the following are approximate 1 − α confidence sets for p:
−−−−−−−−− −−−−−−−−−
1. {p ∈ [0, 1] : M − z(1 − α/2)√p(1 − p)/n ≤ p ≤ M + z(1 − α/2)√p(1 − p)/n }
−−−−− −−−−
2. {p ∈ [0, 1] : p ≤ M + z(1 − α)√p(1 − p)/n }
−−−−−−−−−
3. {p ∈ [0, 1] : M − z(1 − α)√p(1 − p)/n ≤ p}
Proof
These confidence sets are actually intervals, known as the Wilson intervals, in honor of Edwin Wilson.
The confidence sets for p in (1) are intervals. Let

−−−−−−−−−−−−−−−
2 2
n z M (1 − M ) z
U (z) = (M + + z√ + ) (8.3.3)
2 2
n+z 2n n 4n
Then the following have approximate confidecne level 1 − α for p.

1. The two-sided interval [U [−z(1 − α/2)], U [z(1 − α/2)]] .
2. The upper bound U [z(1 − α)] .
3. The lower bound U [−z(1 − α)] .
Proof
As usual, the equal-tailed confidence interval in (a) is not the only two-sided 1 − α confidence interval for p. We can divide the α
probability between the left and right tails of the standard normal distribution in any way that we please.
For α, r ∈ (0, 1), an approximate two-sided 1 − α confidence interval for p is [U [z(α − rα)], U [z(1 − rα)]] where U is the
function in (2).
Proof
In practice, the equal-tailed 1 − α confidence interval in part (a) of (2), obtained by setting r = , is the one that is always used.
1
As r ↑ 1 , the right enpoint converges to the 1 − α confidence upper bound in part (b), and as r ↓ 0 the left endpoint converges to
the 1 − α confidence lower bound in part (c).
Simplified Confidence Intervals

Simplified approximate 1 − α confidence intervals for p can be obtained by replacing the distribution mean p by the sample mean
M in the extreme parts of the inequalities in (1).
For α ∈ (0, 1), the following have approximate confidence level 1 − α for p:
−−−−−−−−−−−
1. The two-sided interval with endpoints M ± z(1 − α/2)√M (1 − M )/n .
− −− −−−−−−−−
2. The upper bound M + z(1 − α)√M (1 − M )/n .
− −− −−−−−−−−
3. The lower bound M − z(1 − α)√M (1 − M )/n .
Proof
These confidence intervals are known as Wald intervals, in honor of Abraham Wald.. Note that the Wald interval can also be
obtained from the Wilson intervals in (2) by assuming that n is large compared to z , so that n/(n + z ) ≈ 1 , z /2n ≈ 0, and
2 2
z /4 n ≈ 0 . Note that this interval in (c) is symmetric about the sample proportion M but that the length of the interval, as well as
2 2
the center is random. This is the two-sided interval that is normally used.
Use the simulation of the proportion estimation experiment to explore the procedure. Use various values of p and various
confidence levels, sample sizes, and interval types. For each configuration, run the experiment 1000 times and compare the
proportion of successful intervals to the theoretical confidence level.
As always, the equal-tailed interval in (4) is not the only two-sided, 1 − α confidence interval.
For α, , an approximate two-sided 1 − α confidence interval for p is

r ∈ (0, 1)
−−−−−−−−− −−−−−−−−−
M (1 − M ) M (1 − M )
[M − z(1 − rα)√ , M − z(α − rα)√ ] (8.3.5)
n n
The interval with smallest length is the equal-tail interval with r = 1
2
.
Conservative Confidence Intervals

Note that the function p ↦ p(1 − p) on the interval [0, 1] is maximized when p = and thus the maximum value is 1
2
1
4
. We can
obtain conservative confidence intervals for p from the basic confidence intervals by using this fact.
For α ∈ (0, 1), the following have approximate confidence level at least 1 − α for p:
1. The two-sided interval with endpoints M ± z(1 − α/2) 2 √n
1
.
2. The upper bound M + z(1 − α) 1
2 √n
.
3. The lower bound M − z(1 − α) 1
2 √n
.
Proof
Note that the confidence interval in (a) is symmetric about the sample proportion M and that the length of the interval is
deterministic. Of course, the conservative confidence intervals will be larger than the approximate simplified confidence intervals
in (4). The conservative estimate can be used to design the experiment. Recall that the margin of error is the distance between the
sample proportion M and an endpoint of the confidence interval.
A conservative estimate of the sample size n needed to estimate p with confidence 1 − α and margin of error d is
2
zα
n =⌈ ⌉ (8.3.6)
2
4d
where z α = z(1 − α/2) for the two-sided interval and z

α = z(1 − α) for the confidence upper or lower bound.
Proof
As always, the equal-tailed interval in (7) is not the only two-sided, conservative, 1 − α confidence interval.
For α, r ∈ (0, 1) , an approximate two-sided, conservative 1 − α confidence interval for p is

1 1
[M − z(1 − rα) , M − z(α − rα) ] (8.3.7)
− −
2 √n 2 √n
The interval with smallest length is the equal-tail interval with r = 1
2
.
The Two-Sample Model

Preliminaries
Often we have two underlying Bernoulli distributions, with parameters p 1, p2 ∈ [0, 1] and we would like to estimate the difference
p − p . This problem could arise in the following typical examples:
1 2
In a quality control setting, suppose that p is the proportion of defective items produced under one set of manufacturing
1
conditions while p is the proportion of defectives under a different set of conditions.

2
In an election, suppose that p is the proportion of voters who favor a particular candidate at one point in the campaign, while
1
p is the proportion of voters who favor the candidate at a later point (perhaps after a scandal has erupted).
2
Suppose that p is the proportion of students who pass a certain standardized test with the usual test preparation methods while
1
p is the proportion of students who pass the test with a new set of preparation methods.
2
Suppose that p is the proportion of unvaccinated persons in a certain population who contract a certain disease, while p is the
1 2
proportion of vaccinated person who contract the disease.

Note that several of these examples can be thought of as treatment-control problems. Of course, we could construct interval
estimates I for p and I for p separately, as in the subsections above. But as we noted in the Introduction, if these two intervals
1 1 2 2
have confidence level 1 − α , then the product set I 1 × I2 has confidence level (1 − α)
2
for (p1 , p2 ) . So if p1 − p2 is our
parameter of interest, we will use a different approach.
Simplified Confidence Intervals

Suppose now that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with parameter p , and
1 2 n1 1 1
Y = (Y , Y , … , Y
1 2 ) is a random sample of size n
n2 from the Bernoulli distribution with parameter p . We assume that the
2 2
samples X and Y are independent. Let

n1 n2
1 1
M1 = ∑ Xi , M2 = ∑ Yi (8.3.8)
n1 n2
i=1 i=1
denote the sample means (sample proportions) for the samples X and Y . A natural point estimate for p − p , and the building 1 2
block for our interval estimate, is M − M . As noted in the one-sample model, if n is large, M has an approximate normal
1 2 i i
distribution with mean p and variance p (1 − p )/n for i ∈ {1, 2}. Since the samples are independent, so are the sample means.
i i i i
Hence M − M has an approximate normal distribution with mean p − p and variance p (1 − p )/n + p (1 − p )/n . We
1 2 1 2 1 1 1 2 2 2
now have all the tools we need for a simplified, approximate confidence interval for p − p . 1 2
For α ∈ (0, 1), the following have approximate confidence level 1 − α for p 1 − p2 :
−−−−−−−−−−−−−−−−−−−−−−−−−− −
1. The two-sided interval with endpoints (M1 − M2 ) ± z (1 − α/2) √M1 (1 − M1 )/ n1 + M2 (1 − M2 )/ n2 .
−−−−−−−−−−−−−−−−−−−−−−−−−− −
2. The lower bound (M 1 .
− M2 ) − z(1 − α)√M1 (1 − M1 )/ n1 + M2 (1 − M2 )/ n2
−−−−−−−−−−−−−−−−−−−−−−−−−−−
3. The upper bound (M − M ) + z(1 − α)√M (1 − M )/n + M (1 − M )/n .
1 2 1 1 1 2 2 2
Proof
As always, the equal-tailed interval in (a) is not the only approximate two-sided 1 − α confidence interval.
For α, r ∈ (0, 1) , an approximate 1 − α confidence set for p 1 − p2 is

−−−−−−−−−−−−−−−−−−−−−−−−−−−
[(M1 − M2 ) − z(1 − rα)√ M1 (1 − M1 )/ n1 + M2 (1 − M2 )/ n2 , (M1 − M2 ) (8.3.11)
−−−−−−−−−−−−−−−−−−−−−−−−−−−
− z(α − rα)√ M1 (1 − M1 )/ n1 + M2 (1 − M2 )/ n2 ]
Proof
Conservative Confidence Intervals

Once again, p ↦ p(1 − p) is maximized when p =
1
2
with maximum value 1
4
. We can use this to construct approximate
conservative confidence intervals for p − p . 1 2
For α ∈ (0, 1), the following have approximate confidence level at least 1 − α for p 1 − p2 :
−−− −−−− −− −
1. The two-sided interval with endpoints (M − M ) ± 1 2
1
2
z (1 − α/2) √1/ n1 + 1/ n2 .
−−−− −−−−−−
2. The lower bound (M − M ) − z(1 − α)√1/n + 1/n .
1 2
1
2
1 2
−−−− −−−−−−
3. The upper bound (M − M ) + z(1 − α)√1/n + 1/n .
1 2
1
2
1 2
Proof
In a poll of 1000 registered voters in a certain district, 427 prefer candidate X. Construct the 95% two-sided confidence interval
for the proportion of all registered voters in the district that prefer X.
Answer
A coin is tossed 500 times and results in 302 heads. Construct the 95% confidence lower bound for the probability of heads.
Do you believe that the coin is fair?
Answer
A sample of 400 memory chips from a production line are tested, and 30 are defective. Construct the conservative 90% two-
sided confidence interval for the proportion of defective chips.
Answer
A drug company wants to estimate the proportion of persons who will experience an adverse reaction to a certain new drug.
The company wants a two-sided interval with margin of error 0.03 with 95% confidence. How large should the sample be?
Answer
An advertising agency wants to construct a 99% confidence lower bound for the proportion of dentists who recommend a
certain brand of toothpaste. The margin of error is to be 0.02. How large should the sample be?
Answer
The Buffon trial data set gives the results of 104 repetitions of Buffon's needle experiment. Theoretically, the data should
correspond to Bernoulli trials with p = 2/π, but because real students dropped the needle, the true value of p is unknown.
Construct a 95% confidence interval for p. Do you believe that p is the theoretical value?
Answer
A manufacturing facility has two production lines for a certain item. In a sample of 150 items from line 1, 12 are defective.
From a sample of 130 items from line 2, 10 are defective. Construct the two-sided 95% confidence interval for p − p , where1 2
p is the proportion of defective items from line i , for i ∈ {1, 2}

i
Answer
The vaccine for influenza is tailored each year to match the predicted dominant strain of influenza. Suppose that of 500
unvaccinated persons, 45 contracted the flu in a certain time period. Of 300 vaccinated persons, 20 contracted the flu in the
same time period. Construct the two-sided 99% confidence interval for p − p , where p is the incidence of flu in the
1 2 1
unvaccinated population and p the incidence of flu in the vaccinated population.

2
This page titled 8.3: Estimation in the Bernoulli Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
As we have noted before, the normal distribution is perhaps the most important distribution in the study of mathematical statistics, in part
because of the central limit theorem. As a consequence of this theorem, measured quantities that are subject to numerous small, random
errors will have, at least approximately, normal distributions. Such variables are ubiquitous in statistical experiments, in subjects varying
In this section, we will study estimation problems in the two-sample normal model and in the bivariate normal model. This section parallels
the section on Tests in the Two-Sample Normal Model in the Chapter on Hypothesis Testing.
The Two-Sample Normal Model

Preliminaries
Suppose that X = (X , X , … , X ) is a random sample of size m from the normal distribution with mean μ and standard deviation σ,
1 2 m
and that Y = (Y , Y , … , Y ) is a random sample of size n from the normal distribution with mean ν and standard deviation τ . Moreover,
1 2 n
suppose that the samples X and Y are independent. Usually, the parameters are unknown, so the parameter space for our vector of
parameters (μ, ν , σ, τ ) is R × (0, ∞) .
2 2
This type of situation arises frequently when the random variables represent a measurement of interest for the objects of the population, and
the samples correspond to two different treatments. For example, we might be interested in the blood pressure of a certain population of
patients. The X vector records the blood pressures of a control sample, while the Y vector records the blood pressures of the sample
receiving a new drug. Similarly, we might be interested in the yield of an acre of corn. The X vector records the yields of a sample receiving
one type of fertilizer, while the Y vector records the yields of a sample receiving a different type of fertilizer.
Usually our interest is in a comparison of the parameters (either the means or standard deviations) for the two sampling distributions. In this
section we will construct confidence intervals for the difference of the distribution means ν − μ and for the ratio of the distribution
variances τ /σ . As with previous estimation problems, the construction depends on finding appropriate pivot variables.
2 2
For a generic sample U = (U1 , U2 , … , Uk ) from a distribution with mean a , we will use our standard notation for the sample mean and for
the sample variance.
k
1
M (U ) = ∑ Ui (8.4.1)
k
i=1
2
1 2
S (U ) = ∑[ Ui − M (U )] (8.4.2)
k−1
i=1
We will need to also recall the special properties of these statistics when the sampling distribution is normal. The special pivot distributions
that will play a fundamental role in this section are the standard normal, the student t , and the Fisher F distributions. To construct our
interval estimates we will need the quantiles of these distributions. The quantiles can be computed using the special distribution calculator or
from most mathematical and statistical software packages. Here is the notation we will use:
Let p ∈ (0, 1) and let j, k ∈ N+ .

k
3. f (p) denotes the quantile of order p for the student f distribution with j degrees of freedom in the numerator and k degrees of
j,k
freedom in the denominator.
Recall that by symmetry, z(p) = −z(1 − p) and t (p) = −t (1 − p) for p ∈ (0, 1) and
k k k ∈ N+ . On the other hand, there is no simple
relationship between the left and right tail probabilities of the F distribution.
Confidence Intervals for the Difference of the Means with Known Variances
First we will construct confidence intervals for ν − μ under the assumption that the distribution variances σ and τ are known. This is not
2 2
always an artificial assumption. As in the one sample normal model, the variances are sometime stable, and hence are at least approximately
known, while the means change under different treatments. First recall the following basic facts:
The difference of the sample means M (Y ) − M (X) has the normal distribution with mean ν − μ and variance σ 2
/m + τ
2
/n . Hence
the standard score of the difference of the sample means
[M (Y ) − M (X)] − (ν − μ)
Z = (8.4.3)
− −−−−−−−− −−
2 2
√ σ /m + τ /n
has the standard normal distribution. Thus, this variable is a pivotal variable for ν − μ when σ, τ are known.
The basic confidence interval and upper and lower bound are now easy to construct.
For α ∈ (0, 1),

−−−−−−− −−−−−−−
1. [M (Y ) − M (X) − z (1 − is a 1 − α confidence interval for ν − μ .
2 2 2 2
α σ τ α σ τ
)√ + , M (Y ) − M (X) + z (1 − )√ + ]
2 m n 2 m n
−−−−−−−
2 2
2. M (Y ) − M (X) − z(1 − α)√ σ
m
+
τ
n
is a 1 − α confidence lower bound for ν − μ .
−−−−−−−
3. M (Y ) − M (X) + z(1 − α)√ is a 1 − α confidence upper bound for ν − μ .
2 2
σ τ
+
m n
Proof
The two-sided interval in part (a) is the symmetric interval corresponding to α/2 in both tails of the standard normal distribution. As usual,
we can construct more general two-sided intervals by partitioning α between the left and right tails in anyway that we please.
For every α, , a 1 − α confidence interval for ν − μ is

p ∈ (0, 1)
−−− −−−− − −− − −−−− −

2 2 2 2
σ τ σ τ
[M (Y ) − M (X) − z(1 − αp)√ + , M (Y ) − M (X) − z(α − pα)√ + ] (8.4.4)
m n m n
1. p = gives the symmetric two-sided interval.

1

3. p → 0 gives the interval with confidence upper bound.
Proof
The following theorem gives some basic properties of the length of this interval.
The (deterministic) length of the general two-sided confidence interval is

−−− −−−− −
2 2
σ τ
L = [z(1 − αp) − z(α − αp)] √ + (8.4.6)
m n
1. L is a decreasing function of m and a decreasing function of n .

2. L is an increasing function of σ and an increasing function of τ
3. L is an decreasing function of α and hence an increasing function of the confidence level.
4. As a function of p, L decreases and then increases, with minimum value at p = . 1
Part (a) means that we can make the estimate more precise by increasing either or both sample sizes. Part (b) means that the estimate
becomes less precise as the variance in either distribution increases. Part (c) we have seen before. All other things being equal, we can
increase the confidence level only at the expense of making the estimate less precise. Part (d) means that the symmetric, equal-tail
confidence interval is the best of the two-sided intervals.
Confidence Intervals for the Difference of the Means with Unknown Variances
Our next method is a construction of confidence intervals for the difference of the means ν − μ without needing to know the standard
deviations σ and τ . However, there is a cost; we will assume that the standard deviations are the same, σ = τ , but the common value is
unknown. This assumption is reasonable if there is an inherent variability in the measurement variables that does not change even when
different treatments are applied to the objects in the population. We need to recall some basic facts from our study of special properties of
normal samples.
The pooled estimate of the common variance σ 2

=τ
2
is
2 2
(m − 1)S (X) + (n − 1)S (Y )
2
S (X, Y ) = (8.4.7)
m +n−2
The random variable

[M (Y ) − M (X)] − (ν − μ)
T = − −−−−−−−− (8.4.8)
S(X, Y )√ 1/m + 1/n
has the student t distribution with m + n − 2 degrees of freedom
Note that S (X, Y ) is a weighted average of the sample variances, with the degrees of freedom as the weight factors. Note also that T is a
2
pivot variable for ν − μ and so we can construct confidence intervals for ν − μ in the usual way.
For α ∈ (0, 1),

−−−−−− −−−−−−
1. [M (Y ) − M (X) − t m+n−2 (1 −
α
2
) S(X, Y )√
1
m
+
1
n
, M (Y ) − M (X) + tm+n−2 (1 −
α
2
) S(X, Y )√
1
m
+
1
n
] is a 1 − α
confidence interval for ν − μ .
−−−−−−
2. M (Y ) − M (X) − t m+n−2 (1 − α)S(X, Y )√
1
m
+
1
n
is a 1 − α confidence lower bound for ν − μ .
−−−−−−
3. M (Y ) − M (X) + t m+n−2 (1 − α)S(X, Y )√
1
m
+
1
n
is a 1 − α confidence upper bound for ν − μ .
Proof
The two-sided interval in part (a) is the symmetric interval corresponding to α/2 in both tails of the student t distribution. As usual, we can
construct more general two-sided intervals by partitioning α between the left and right tails in anyway that we please.
For every α, p ∈ (0, 1) , a 1 − α confidence interval for ν − μ is

−−−−−−− −−−−−−−
1 1 1 1
[M (Y ) − M (X) − tm+n−2 (1 − αp)S(X, Y )√ + , M (Y ) − M (X) − tm+n−2 (α − pα)S(X, Y )√ + ] (8.4.9)
m n m n
1. p = gives the symmetric two-sided interval.

1

3. p → 0 gives the inteval with confidence upper bound.
Proof
The next result considers the length of the general two-sided interval.
The (random) length of the two-sided interval above is

−−−−−−−
1 1
L = [ tm+n−2 (1 − pα) − tm+n−2 (α − pα)]S(X, Y )√ + (8.4.11)
m n

2. As a function of p, L decreases and then increases, with minimum value at p = . 1
As in the case of known variances, part (c) means that all other things being equal, we can increase the confidence level only at the expense
of making the estimate less precise. Part (b) means that the symmetric, equal-tail confidence interval is the best of the two-sided intervals.
Confidence Intervals for the Ratio of the Variances

Our next construction will produce interval estimates for the ratio of the variances τ /σ (or by taking square roots, for the ratio of the 2 2
standard deviations τ /σ). Once again, we need to recall some basic facts from our study of special properties of random samples from the
The ratio
2 2
S (X)τ
U = (8.4.12)
S 2 (Y )σ 2
has the F distribution with m − 1 degrees of freedom in the numerator and n−1 degrees of freedom in the denominator, and hence
this variable is a pivot variable for τ /σ . 2 2
The pivot variable U can be used to construct confidence intervals for τ 2

/σ
2
in the usual way.
For α ∈ (0, 1),

2 2
S (Y ) S (Y )
1. [f m−1,n−1 (
α
2
)
2
, fm−1,n−1 (1 −
α
2
)
2
] is a 1 − α confidence interval for τ 2
/σ
2
.
S (X) S (X)
2
S (Y )
2. fm−1,n−1 (1 − α) 2
is a 1 − α confidence lower bound for τ 2
/σ
2
.
S (X)
2
S (Y )
3. fm−1,n−1 (α) 2
is a 1 − α confidence upper bound for ν − μ .
S (X)
Proof
The two-sided confidence interval in part (a) is the equal-tail confidence interval, and is the one commonly used. But as usual, we can
partition α between the left and right tails of the distribution of the pivot variable in any way that we please.
For every α, p ∈ (0, 1) , a 1 − α confidence set for τ 2

/σ
2
is
2 2
S (Y ) S (Y )
[fm−1,n−1 (α − pα) , fm−1,n−1 (1 − pα) ] (8.4.13)
2 2
S (X) S (X)
1. p = gives the equal-tail, two-sided interval.

1

3. p → 0 gives the inteval with confidence upper bound.
Proof
The length of the general confidence interval is considered next.
The (random) length of the general two-sided confidence interval above is

2
S (Y )
L = [ fm−1,n−1 (1 − pα) − fm−1,n−1 (α − pα)] (8.4.15)
2
S (X)
Assuming that m > 5 and n > 1 ,

2
2. E(L) = σ
τ
2
m−1
m−3
4
2
m−1 m+n−4
3. var(L) = 2 τ
σ
4
(
m−3
)
(n−1)(m−5)
Proof
Optimally, we might want to choose p so that E(L) is minimized. However, this is difficult computationally, and fortunately the equal-tail
interval with p = is not too far from optimal when the sample sizes m and n are large.
1
Estimation in the Bivariate Normal Model

In this subsection, we consider a model that is superficially similar to the two-sample normal model, but is actually much simpler. Suppose
that
((X1 , Y1 ), (X2 , Y2 ), … , (Xn , Yn )) (8.4.16)
is a random sample of size n from the bivariate normal distribution of a random vector (X, Y ), with E(X) = μ , E(Y ) = ν , var(X) = σ , 2
var(Y ) = τ , and cov(X, Y ) = δ .

2
Thus, instead of a pair of samples, we have a sample of pairs. This type of model frequently arises in before and after experiments, in which
a measurement of interest is recorded for a sample of n objects from the population, both before and after a treatment. For example, we
could record the blood pressure of a sample of n patients, before and after the administration of a certain drug. The critical point is that in
this model, X and Y are measurements made on the same underlying object in the sample. As with the two-sample normal model, the
i i
interest is usually in estimating the difference of the means.

We will use our usual notation for the sample means and variances of X = (X 1, X2 , … , Xn ) and Y = (Y1 , Y2 , … , Yn ) . Recall also that
the sample covariance of (X, Y ), is
n
1
S(X, Y ) = ∑[ Xi − M (X)][ Yi − M (Y )] (8.4.17)
n−1
i=1
(not to be confused with the pooled estimate of the standard deviation in the two sample model).
The vector of differences Y − X = (Y1 − X1 , Y2 − X2 , … , Yn − Xn ) is a random sample of size n from the distribution of Y −X ,
which is normal with
1. E(Y − X) = ν − μ
2. var(Y − X) = σ + τ 2 2
−2 δ
The sample mean and variance of the sample of differences are given by
1. M (Y − X) = M (Y ) − M (X)
2. S (Y − X) = S (X) + S (Y ) − 2 S(X, Y )
2 2 2
Thus, the sample of differences Y − X fits the normal model for a single variable. The section on Estimation in the Normal Model could be
used to obtain confidence sets and intervals for the parameters (ν − μ, σ + τ − 2 δ) .
2 2
In the setting of this subsection, suppose that X = (X , X , … , X ) and Y = (Y , Y , … , Y ) are independent. Mathematically this
1 2 n 1 2 n
fits both models—the two-sample normal model and the bivariate normal model. Which procedure would work better for estimating the
difference of means ν − μ ?
1. If the standard deviations σ and τ are known.
2. If the standard deviations σ and τ are unknown.
Answer
Although the setting in the last problem fits both models mathematically, only one model would make sense in a real problem. Again, the
critical point is whether (X , Y ) makes sense as a pair of random variables (measurements) corresponding to a given object in the sample.
i i
A new drug is being developed to reduce a certain blood chemical. A sample of 36 patients are given a placebo while a sample of 49
patients are given the drug. Let X denote the measurement for a patient given the placebo and Y the measurement for a patient given
the drug (in mg). The statistics are m(x) = 87 , s(x) = 4 , m(y) = 63 , s(y) = 6 .
1. Compute the 90% confidence interval for τ /σ.
2. Assuming that σ = τ , compute the 90% confidence interval for ν − μ .
3. Based on (a), is the assumption that σ = τ reasonable?
4. Based on (b), is the drug effective?
Answer
A company claims that an herbal supplement improves intelligence. A sample of 25 persons are given a standard IQ test before and after
taking the supplement. Let X denote the IQ of a subject before taking the supplement and Y the IQ of the subject after the supplement.
The before and after statistics are m(x) = 105, s(x) = 13 , m(y) = 110 , s(y) = 17 , s(x, y) = 190 . Do you believe the company's
claim?
Answer
In Fisher's iris data, let X denote consider the petal length of a Versicolor iris and Y the petal length of a Virginica iris.
Answer
A plant has two machines that produce a circular rod whose diameter (in cm) is critical. Let X denote the diameter of a rod from the
first machine and Y the diameter of a rod from the second machine. A sample of 100 rods from the first machine as mean 10.3 and
standard deviation 1.2. A sample of 100 rods from the second machine has mean 9.8 and standard deviation 1.6.
Answer
This page titled 8.4: Estimation in the Two-Sample Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon
request.
Basic Theory
structure. For example, if the experiment is to sample n objects from a population and record various measurements of interest, then
X = (X1 , X2 , … , Xn ) (8.5.1)
where X is the vector of measurements for the ith object.

i
Suppose also that the distribution of X depends on a parameter θ taking values in a parameter space Θ. The parameter may also be
vector valued, in which case Θ ⊆ R for some k ∈ N and the parameter has the form θ = (θ , θ , … , θ ) .
k
+ 1 2 k
The Bayesian Formulation

Recall that in Bayesian analysis, the unknown parameter θ is treated as a random variable. Specifically, suppose that the conditional
probability density function of the data vector X given θ ∈ Θ is denoted f (x ∣ θ) for x ∈ S . Moreover, the parameter θ is given a
prior distribution with probability density function h on Θ. (The prior distribution is often subjective, and is chosen to reflect our
knowledge, if any, of the parameter.) The joint probability density function of the data vector and the parameter is
(x, θ) ↦ h(θ)f (x ∣ θ); (x, θ) ∈ S × Θ (8.5.2)
Next, the (unconditional) probability density function of X is the function f given by
f (x) = ∑ h(θ)f (x ∣ θ), x ∈ S (8.5.3)
θ∈Θ
if the parameter has a discrete distribution, or by
f (x) = ∫ h(θ)f (x ∣ θ) dθ, x ∈ S (8.5.4)

Θ
if the parameter has a continuous distribution. Finally, by Bayes' theorem, the posterior probability density function of θ given x ∈ S is
h(θ)f (x ∣ θ)
h(θ ∣ x) = , θ ∈ Θ (8.5.5)
f (x)
In some cases, we can recognize the posterior distribution from the functional form of θ ↦ h(θ)f (x ∣ θ) without having to actually
compute the normalizing constant f (x), and thus reducing the computational burden significantly. In particular, this is often the case
when we have a conjugate parametric family of distributions of θ . Recall that this means that when the prior distribution of θ belongs
to the family, so does the posterior distribution given x ∈ S .
Confidence Sets
Now let C (X) be a confidence set (that is, a subset of the parameter space that depends on the data variable X but no unknown
parameters). One possible definition of a 1 − α level Bayesian confidence set requires that
P [θ ∈ C (x) ∣ X = x] = 1 − α (8.5.6)
In this definition, only θ is random and thus the probability above is computed using the posterior probability density function
θ ↦ h(θ ∣ x) . Another possible definition requires that
P [θ ∈ C (X)] = 1 − α (8.5.7)
In this definition, X and θ are both random, and so the probability above would be computed using the joint probability density
function (x, θ) ↦ h(θ)f (x ∣ θ) . Whatever the philosophical arguments may be, the first definition is certainly the easier one from a
computational viewpoint, and hence is the one most commonly used.
Let us compare the classical and Bayesian approaches. In the classical approach, the parameter θ is deterministic, but unknown. Before
the data are collected, the confidence set C (X) (which is random by virtue of X) will contain the parameter with probability 1 − α .
After the data are collected, the computed confidence set C (x) either contains θ or does not, and we will usually never know which. By
contrast in a Bayesian confidence set, the random parameter θ falls in the computed, deterministic confidence set C (x) with probability
1 −α.
Real Parameters
Suppose that θis real valued, so that Θ ⊆ R . For r ∈ (0, 1), we can compute the 1 − α level Bayesian confidence interval as
[ U(1−r)α (x), U1−rα (x)] where U (x) is the quantile of order p for the posterior distribution of θ given X = x . As in past sections, r
p
is the fraction of α in the right tail of the posterior distribution and 1 − r is the fraction of α in the left tail of the posterior distribution.
As usual, r = gives the symmetric, two-sided confidence interval; letting r → 0 gives the confidence lower bound; and letting r → 1
1
gives the confidence upper bound.
Random Samples
In terms of our data vector X the most important special case arises when we have a basic variable X with values in a set R , and given
θ , X = (X , X , … , X ) is a random sample of size n from X. That is, given θ , X is a sequence of independent, identically
1 2 n
distributed variables, each with the same distribution as X given θ . Thus S = R and if X has conditional probability density function
n
g(x ∣ θ) , then
f (x ∣ θ) = g(x1 ∣ θ)g(x2 ∣ θ) ⋯ g(xn ∣ θ), x = (x1 , x2 , … , xn ) ∈ S (8.5.8)
Applications
Suppose that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success parameter
1 2 n
p ∈ (0, 1). In the usual language of reliability, X = 1 means success on trial i and X = 0 means failure on trial i . The distribution is
i i
named for Jacob Bernoulli. Recall that the Bernoulli distribution has probability density function (given p)
x 1−x
g(x ∣ p) = p (1 − p ) , x ∈ {0, 1} (8.5.9)
Note that the number of successes in the n trials is Y =∑

n
i=1
Xi . Given p, random variable Y has the binomial distribution with
parameters n and p.
In our previous discussion of Bayesian estimation, we showed that the beta distribution is conjugate for p. Specifically, if the prior
distribution of p is beta with left parameter a > 0 and right parameter b > 0 , then the posterior distribution of p given X is beta with
left parameter a + Y and right parameter b + (n − Y ) ; the left parameter is increased by the number of successes and the right
parameter by the number of failure. It follows that a 1 − α level Bayesian confidence interval for p is [ U (y), U (y)] where
α/2 1−α/2
U (y) is the quantile of order r for the posterior beta distribution. In the special case a = b = 1 the prior distribution is uniform on
r
(0, 1) and reflects a lack of previous knowledge about p .
Suppose that we have a coin with an unknown probability p of heads, and that we give p the uniform prior, reflecting our lack of
knowledge about p. We then toss the coin 50 times, observing 30 heads.
1. Find the posterior distribution of p given the data.
2. Construct the 95% Bayesian confidence interval.
3. Construct the classical Wald confidence interval at the 95% level.
Answer

Suppose that X = (X , X , … , X ) is a random sample of size n from the Poisson distribution with parameter λ ∈ (0, ∞) . Recall
1 2 n
that the Poisson distribution is often used to model the number of “random points” in a region of time or space and is studied in more
detail in the chapter on the Poisson Process. The distribution is named for the inimitable Simeon Poisson and given λ , has probability
density function
x
−λ
λ
g(x ∣ θ) = e , x ∈ N (8.5.10)
x!
n
As usual, we will denote the sum of the sample values by Y =∑
i=1
Xi . Given λ , random variable Y also has a Poisson distribution,
but with parameter nλ.
In our previous discussion of Bayesian estimation, we showed that the gamma distribution is conjugate for λ . Specifically, if the prior
distribution of λ is gamma with shape parameter k > 0 and rate parameter r > 0 (so that the scale parameter is 1/r), then the posterior
distribution of λ given X is gamma with shape parameter k + Y and rate parameter r + n . It follows that a 1 − α level Bayesian
confidence interval for λ is [ U (y), U (y)] where U (y) is the quantile of order p for the posterior gamma distribution.
α/2 1−α/2 p
Consider the alpha emissions data, which we believe come from a Poisson distribution with unknown parameter λ . Suppose that a
priori, we believe that λ is about 5, so we give λ a prior gamma distribution with shape parameter 5 and rate parameter 1. (Thus
–
the mean is 5 and the standard deviation √5 = 2.236.)
1. Find the posterior distribution of λ given the data.
3. Construct the classical t confidence interval at the 95% level.
Answer

Suppose that x = (X , X , … , X ) is a random sample of size n from the normal distribution with unknown mean μ ∈ R and known
1 2 n
variance σ ∈ (0, ∞) . Of course, the normal distribution plays an especially important role in statistics, in part because of the central
2
limit theorem. The normal distribution is widely used to model physical quantities subject to numerous small, random errors. Recall
that the normal probability density function (given the parameters) is
2
1 x −μ
g(x ∣ μ, σ) = exp[−( ) ], x ∈ R (8.5.11)
−−
√2πσ σ
We denote the sum of the sample values by Y =∑

n
i=1
Xi . Recall that Y also has a normal distribution (given μ and σ), but with mean
nμ and variance nσ .
2
In our previous discussion of Bayesian estimation, we showed that the normal distribution is conjugate for μ (with σ known).
Specifically, if the prior distribution of μ is normal with mean a ∈ R and standard deviation b ∈ (0, ∞), then the posterior distribution
of μ given X is also normal, with
2 2 2 2
Y b + aσ σ b
E(μ ∣ X) = , var(μ ∣ X) = (8.5.12)
2 2 2 2
σ + nb σ + nb
It follows that a 1 − α level Bayesian confidence interval for μ is [ U (y), U (y)] where U (y) is the quantile of order p for the
α/2 1−α/2 p
posterior normal distribution. An interesting special case is when b = σ , so that the standard deviation of the prior distribution of μ is
the same as the standard deviation of the sampling distribution. In this case, the posterior mean is (Y + a)/(n + 1) and the posterior
variance is σ /(n + 1)
2
The length of a certain machined part is supposed to be 10 centimeters but due to imperfections in the manufacturing process, the
actual length is a normally distributed with mean μ and variance σ . The variance is due to inherent factors in the process, which
2
remain fairly stable over time. From historical data, it is known that σ = 0.3. On the other hand, μ may be set by adjusting various
parameters in the process and hence may change to an unknown value fairly frequently. Thus, suppose that we give μ with a prior
normal distribution with mean 10 and standard deviation 0.03 A sample of 100 parts has mean 10.2.
1. Find the posterior distribution of μ given the data.
3. Construct the classical z confidence interval at the 95% level.
Answer

Suppose that X = (X , X , … , X ) is a random sample of size n from the beta distribution with unknown left shape parameter
1 2 n
a ∈ (0, ∞) and right shape parameter b = 1 . The beta distribution is widely used to model random proportions and probabilities and
other variables that take values in bounded intervals. Recall that the probability density function (given a ) is
a−1
g(x ∣ a) = ax , x ∈ (0, 1) (8.5.13)
We denote the product of the sample values by W = X1 X2 ⋯ Xn .

In our previous discussion of Bayesian estimation, we showed that the gamma distribution is conjugate for a . Specifically, if the prior
distribution of a is gamma with shape parameter k > 0 and rate parameter r > 0 , then the posterior distribution of a given X is also
gamma, with shape parameter k + n and rate parameter r − ln(W ) . It follows that a 1 − α level Bayesian confidence interval for a is
[ Uα/2 (w), U1−α/2 (w)] where U (w) is the quantile of order p for the posterior gamma distribution. In the special case that k = 1 , the
p
prior distribution of a is exponential with rate parameter r.
Suppose that the resistance of an electrical component (in Ohms) has the beta distribution with unknown left parameter a and right
parameter b = 1 . We believe that a may be about 10, so we give a the prior gamma distribution with shape parameter 10 and rate
parameter 1. We sample 20 components and observe the data
0.98, 0.93, 0.99, 0.89, 0.79, 0.99, 0.92, 0.97, 0.88, 0.97, 0.86, 0.84, 0.96, 0.97, 0.92, 0.90, 0.98, 0.96, 0.96, 1.00
(8.5.14)
1. Find the posterior distribution of a .

2. Construct the 95% Bayesian confidence interval for a .
Answer

Suppose that X = (X , X , … , X ) is a random sample of size n from the Pareto distribution with shape parameter a ∈ (0, ∞) and
1 2 n
scale parameter b = 1 . The Pareto distribution is used to model certain financial variables and other variables with heavy-tailed
distributions, and is named for Vilfredo Pareto. Recall that the probability density function (given a ) is
a
g(x ∣ a) = , x ∈ [1, ∞) (8.5.15)
a+1
x
We denote the product of the sample values by W = X1 X2 ⋯ Xn .

In our previous discussion of Bayesian estimation, we showed that the gamma distribution is conjugate for a . Specifically, if the prior
distribution of a is gamma with shape parameter k > 0 and rate parameter r > 0 , then the posterior distribution of a given X is also
gamma, with shape parameter k + n and rate parameter r + ln(W ) . It follows that a 1 − α level Bayesian confidence interval for a is
[U α/2
(w), U 1−α/2
(w)] where U (w) is the quantile of order p for the posterior gamma distribution. In the special case that k = 1 , the
p
prior distribution of a is exponential with rate parameter r.
Suppose that a financial variable has the Pareto distribution with unknown shape parameter a and scale parameter b = 1 . We
believe that a may be about 4, so we give a the prior gamma distribution with shape parameter 4 and rate parameter 1. A random
sample of size 20 from the variable gives the data
1.09, 1.13, 2.00, 1.43, 1.26, 1.00, 1.36, 1.03, 1.46, 1.18, 2.16, 1.16, 1.22, 1.06, 1.28, 1.23, 1.11, 1.03, 1.04, 1.05
(8.5.16)
1. Find the posterior distribution of a .

2. Construct the 95% Bayesian confidence interval for a .
Answer
This page titled 8.5: Bayesian Set Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon
request.
CHAPTER OVERVIEW
Hypothesis testing refers to the process of choosing between competing hypotheses about a probability distribution, based on
observed data from the distribution. It is a core topic in mathematical statistics, and indeed is a fundamental part of the language of
statistics. In this chapter, we study the basics of hypothesis testing, and explore hypothesis tests in some of the most important
parametric models: the normal model and the Bernoulli model.
This page titled 9: Hypothesis Testing is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
Basic Theory
Preliminaries
then
X = (X1 , X2 , … , Xn ) (9.1.1)
i 1 2 n) are
The purpose of this section is to define and discuss the basic concepts of statistical hypothesis testing. Collectively, these concepts
are sometimes referred to as the Neyman-Pearson framework, in honor of Jerzy Neyman and Egon Pearson, who first formalized
them.
Hypotheses
A statistical hypothesis is a statement about the distribution of X. Equivalently, a statistical hypothesis specifies a set of
possible distributions of X: the set of distributions for which the statement is true. A hypothesis that specifies a single
distribution for X is called simple; a hypothesis that specifies more than one distribution for X is called composite.
In hypothesis testing, the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a
conjectured alternative hypothesis. The null hypothesis is usually denoted H while the alternative hypothesis is usually denoted
0
H .
1
An hypothesis test is a statistical decision; the conclusion will either be to reject the null hypothesis in favor of the alternative, or to
fail to reject the null hypothesis. The decision that we make must, of course, be based on the observed value x of the data vector
X. Thus, we will find an appropriate subset R of the sample space S and reject H if and only if x ∈ R . The set R is known as
0
the rejection region or the critical region. Note the asymmetry between the null and alternative hypotheses. This asymmetry is due
to the fact that we assume the null hypothesis, in a sense, and then see if there is sufficient evidence in x to overturn this
assumption in favor of the alternative.
An hypothesis test is a statistical analogy to proof by contradiction, in a sense. Suppose for a moment that H is a statement in a
1
mathematical theory and that H is its negation. One way that we can prove H is to assume H and work our way logically to a
0 1 0
contradiction. In an hypothesis test, we don't “prove” anything of course, but there are similarities. We assume H and then see if
0
the data x are sufficiently at odds with that assumption that we feel justified in rejecting H in favor of H .
0 1
Often, the critical region is defined in terms of a statistic w(X), known as a test statistic, where w is a function from S into another
set T . We find an appropriate rejection region R ⊆ T and reject H when the observed value w(x) ∈ R . Thus, the rejection
T 0 T
region in S is then R = w (R ) = {x ∈ S : w(x) ∈ R } . As usual, the use of a statistic often allows significant data reduction
−1
T T
when the dimension of the test statistic is much smaller than the dimension of the data vector.
Errors
The ultimate decision may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is
actually true.
Types of errors:
1. A type 1 error is rejecting the null hypothesis H when H is true.
0 0
2. A type 2 error is failing to reject the null hypothesis H when the alternative hypothesis H is true.
0 1
Similarly, there are two ways to make a correct decision: we could reject H when H is true or we could fail to reject H when
0 1 0
H is true. The possibilities are summarized in the following table:

0
Hypothesis Test
State | Decision Fail to reject H
0 Reject H 0
H0 True Correct Type 1 error
H1 True Type 2 error Correct
Of course, when we observe X = x and make our decision, either we will have made the correct decision or we will have
committed an error, and usually we will never know which of these events has occurred. Prior to gathering the data, however, we
can consider the probabilities of the various errors.
If H is true (that is, the distribution of X is specified by H ), then P(X ∈ R) is the probability of a type 1 error for this
0 0
distribution. If H is composite, then H specifies a variety of different distributions for X and thus there is a set of type 1 error
0 0
probabilities.
The maximum probability of a type 1 error, over the set of distributions specified by H , is the significance level of the test or
0
the size of the critical region.
The significance level is often denoted by α . Usually, the rejection region is constructed so that the significance level is a
prescribed, small value (typically 0.1, 0.05, 0.01).
If H is true (that is, the distribution of X is specified by H ), then P(X ∉ R) is the probability of a type 2 error for this
1 1
distribution. Again, if H is composite then H specifies a variety of different distributions for X, and thus there will be a set of
1 1
type 2 error probabilities. Generally, there is a tradeoff between the type 1 and type 2 error probabilities. If we reduce the
probability of a type 1 error, by making the rejection region R smaller, we necessarily increase the probability of a type 2 error
because the complementary region S ∖ R is larger.
The extreme cases can give us some insight. First consider the decision rule in which we never reject H , regardless of the 0
evidence x. This corresponds to the rejection region R = ∅ . A type 1 error is impossible, so the significance level is 0. On the
other hand, the probability of a type 2 error is 1 for any distribution defined by H . At the other extreme, consider the decision rule
1
in which we always rejects H regardless of the evidence x. This corresponds to the rejection region R = S . A type 2 error is
0
impossible, but now the probability of a type 1 error is 1 for any distribution defined by H . In between these two worthless tests
0
are meaningful tests that take the evidence x into account.
Power
If H is true, so that the distribution of X is specified by H , then P(X ∈ R) , the probability of rejecting H is the power of
1 1 0
the test for that distribution.
Thus the power of the test for a distribution specified by H is the probability of making the correct decision.
1
Suppose that we have two tests, corresponding to rejection regions R and R , respectively, each having significance level α .
1 2
The test with region R is uniformly more powerful than the test with region R if
1 2
P(X ∈ R1 ) ≥ P(X ∈ R2 ) for every distribution of X specified by H1 (9.1.2)
Naturally, in this case, we would prefer the first test. Often, however, two tests will not be uniformly ordered; one test will be more
powerful for some distributions specified by H while the other test will be more powerful for other distributions specified by H .
1 1
If a test has significance level α and is uniformly more powerful than any other test with significance level α , then the test is
said to be a uniformly most powerful test at level α .
Clearly a uniformly most powerful test is the best we can do.
P -value
In most cases, we have a general procedure that allows us to construct a test (that is, a rejection region Rα ) for any given
significance level α ∈ (0, 1). Typically, R decreases (in the subset sense) as α decreases.
α
The P -value of the observed value x of X, denoted P (x), is defined to be the smallest α for which x ∈ Rα ; that is, the
smallest significance level for which H is rejected, given X = x .
0
Knowing P (x) allows us to test H at any significance level for the given data x: If P (x) ≤ α then we would reject H at
0 0
significance level α ; if P (x) > α then we fail to reject H at significance level α . Note that P (X) is a statistic. Informally, P (x)
0
can often be thought of as the probability of an outcome “as or more extreme” than the observed value x, where extreme is
interpreted relative to the null hypothesis H . 0
Analogy with Justice Systems

There is a helpful analogy between statistical hypothesis testing and the criminal justice system in the US and various other
countries. Consider a person charged with a crime. The presumed null hypothesis is that the person is innocent of the crime; the
conjectured alternative hypothesis is that the person is guilty of the crime. The test of the hypotheses is a trial with evidence
presented by both sides playing the role of the data. After considering the evidence, the jury delivers the decision as either not
guilty or guilty. Note that innocent is not a possible verdict of the jury, because it is not the point of the trial to prove the person
innocent. Rather, the point of the trial is to see whether there is sufficient evidence to overturn the null hypothesis that the person is
innocent in favor of the alternative hypothesis of that the person is guilty. A type 1 error is convicting a person who is innocent; a
type 2 error is acquitting a person who is guilty. Generally, a type 1 error is considered the more serious of the two possible errors,
so in an attempt to hold the chance of a type 1 error to a very low level, the standard for conviction in serious criminal cases is
beyond a reasonable doubt.
Tests of an Unknown Parameter

Hypothesis testing is a very general concept, but an important special class occurs when the distribution of the data variable X
depends on a parameter θ taking values in a parameter space Θ. The parameter may be vector-valued, so that θ = (θ , θ , … , θ ) 1 2 n
and Θ ⊆ R for some k ∈ N . The hypotheses generally take the form

k
+
H0 : θ ∈ Θ0 versus H1 : θ ∉ Θ0 (9.1.3)
where Θ is a prescribed subset of the parameter space Θ. In this setting, the probabilities of making an error or a correct decision
0
depend on the true value of θ . If R is the rejection region, then the power function Q is given by
Q(θ) = Pθ (X ∈ R), θ ∈ Θ (9.1.4)
The power function gives a lot of information about the test.
The power function satisfies the following properties:

1. Q(θ) is the probability of a type 1 error when θ ∈ Θ . 0
2. max {Q(θ) : θ ∈ Θ } is the significance level of the test.

0
3. 1 − Q(θ) is the probability of a type 2 error when θ ∉ Θ . 0
4. Q(θ) is the power of the test when θ ∉ Θ . 0
If we have two tests, we can compare them by means of their power functions.
Suppose that we have two tests, corresponding to rejection regions R and R , respectively, each having significance level α .
1 2
The test with rejection region R is uniformly more powerful than the test with rejection region R if Q (θ) ≥ Q (θ) for all
1 2 1 2
θ ∉ Θ . 0
Most hypothesis tests of an unknown real parameter θ fall into three special cases:
Suppose that θ is a real parameter and θ0 ∈ Θ a specified value. The tests below are respectively the two-sided test, the left-
tailed test, and the right-tailed test.
1. H0 : θ = θ0 versus H 1 : θ ≠ θ0
2. H 0 : θ ≥ θ0 versus H 1 : θ < θ0
3. H 0 : θ ≤ θ0 versus H 1 : θ > θ0
Thus the tests are named after the conjectured alternative. Of course, there may be other unknown parameters besides θ (known as
nuisance parameters).
Equivalence Between Hypothesis Test and Confidence Sets

There is an equivalence between hypothesis tests and confidence sets for a parameter θ .
Suppose that C (x) is a 1 − α level confidence set for θ . The following test has significance level α for the hypothesis
H : θ =θ
0 versus H : θ ≠ θ : Reject H if and only if θ ∉ C (x)
0 1 0 0 0
Proof
Equivalently, we fail to reject H at significance level α if and only if θ is in the corresponding 1 − α level confidence set. In
0 0
particular, this equivalence applies to interval estimates of a real parameter θ and the common tests for θ given above.
In each case below, the confidence interval has confidence level 1 − α and the test has significance level α .
1. Suppose that [L(X, U (X)] is a two-sided confidence interval for θ . Reject H : θ = θ versus H : θ ≠ θ if and only if
0 0 1 0
θ < L(X) or θ > U (X) .

0 0
2. Suppose that L(X) is a confidence lower bound for θ . Reject H : θ ≤ θ versus H : θ > θ if and only if θ < L(X) .
0 0 1 0 0
3. Suppose that U (X) is a confidence upper bound for θ . Reject H : θ ≥ θ versus H : θ < θ if and only if θ > U (X) .
0 0 1 0 0
Pivot Variables and Test Statistics

Recall that confidence sets of an unknown parameter θ are often constructed through a pivot variable, that is, a random variable
W (X, θ) that depends on the data vector X and the parameter θ , but whose distribution does not depend on θ and is known. In
this case, a natural test statistic for the basic tests given above is W (X, θ ). 0
This page titled 9.1: Introduction to Hypothesis Testing is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
The Normal Model
The normal distribution is perhaps the most important distribution in the study of mathematical statistics, in part because of the
central limit theorem. As a consequence of this theorem, a measured quantity that is subject to numerous small, random errors will
have, at least approximately, a normal distribution. Such variables are ubiquitous in statistical experiments, in subjects varying
So in this section, we assume that X = (X , X , … , X ) is a random sample from the normal distribution with mean μ and
1 2 n
standard deviation σ. Our goal in this section is to to construct hypothesis tests for μ and σ; these are among of the most important
special cases of hypothesis testing. This section parallels the section on Estimation in the Normal Model in the chapter on Set
Estimation, and in particular, the duality between interval estimation and hypothesis testing will play an important role. But first we
need to review some basic facts that will be critical for our analysis.
Recall that the sample mean M and sample variance S are 2
n n
1 1
2 2
M = ∑ Xi , S = ∑(Xi − M ) (9.2.1)
n n−1
i=1 i=1
From our study of point estimation, recall that M is an unbiased and consistent estimator of μ while S is an unbiased and 2
consistent estimator of σ . From these basic statistics we can construct the test statistics that will be used to construct our
2
hypothesis tests. The following results were established in the section on Special Properties of the Normal Distribution.
Define
M −μ M −μ n−1
2
Z = , T = , V = S (9.2.2)
− − 2
σ/ √n S/ √n σ
1. Z has the standard normal distribution.

2. T has the student t distribution with n − 1 degrees of freedom.
3. V has the chi-square distribution with n − 1 degrees of freedom.
4. Z and V are independent.
It follows that each of these random variables is a pivot variable for (μ, σ) since the distributions do not depend on the parameters,
but the variables themselves functionally depend on one or both parameters. The pivot variables will lead to natural test statistics
that can then be used to perform the hypothesis tests of the parameters. To construct our tests, we will need quantiles of these
standard distributions. The quantiles can be computed using the special distribution calculator or from most mathematical and
statistical software packages. Here is the notation we will use:
Let p ∈ (0, 1) and k ∈ N . +
k
3. χ (p) denotes the quantile of order p for the chi-square distribution with k degrees of freedom
2
k
Since the standard normal and student t distributions are symmetric about 0, it follows that z(1 − p) = −z(p) and
t (1 − p) = −t (p) for p ∈ (0, 1) and k ∈ N
k k . On the other hand, the chi-square distribution is not symmetric.
+
Tests for the Mean with Known Standard Deviation

For our first discussion, we assume that the distribution mean μ is unknown but the standard deviation σ is known. This is not
always an artificial assumption. There are often situations where σ is stable over time, and hence is at least approximately known,
while μ changes because of different “treatments”. Examples are given in the computational exercises below.
For a conjectured μ 0 ∈ R , define the test statistic
M − μ0
Z = (9.2.3)
−
σ/ √n
1. If μ = μ then Z has the standard normal distribution.

0
μ−μ
2. If μ ≠ μ then Z has the normal distribution with mean
0
0
σ/ √n
and variance 1.
μ−μ
So in case (b), 0
σ/ √n
can be viewed as a non-centrality parameter. The graph of the probability density function of Z is like that of
the standard normal probability density function, but shifted to the right or left by the non-centrality parameter, depending on
whether μ > μ or μ < μ .
0 0
For α ∈ (0, 1), each of the following tests has significance level α :
1. Reject H 0 : μ = μ0 versus H : μ ≠ μ if and only if Z < −z(1 − α/2) or Z > z(1 − α/2) if and only if
1 0
M < μ0 − z(1 − α/2) or M > μ + z(1 − α/2)

σ
0 . σ
√n √n
2. Reject H 0 : μ ≤ μ0 versus H 1 : μ > μ0 if and only if Z > z(1 − α) if and only if M > μ0 + z(1 − α)
σ
.
√n
3. Reject H 0 : μ ≥ μ0 versus H 1 : μ < μ0 if and only if Z < −z(1 − α) if and only if M < μ0 − z(1 − α)
σ
√n
.
Proof
Part (a) is the standard two-sided test, while (b) is the right-tailed test and (c) is the left-tailed test. Note that in each case, the
hypothesis test is the dual of the corresponding interval estimate constructed in the section on Estimation in the Normal Model.
For each of the tests above, we fail to reject H0 at significance level α if and only if μ0 is in the corresponding 1 −α
confidence interval, that is

1. M − z(1 − α/2) σ
√n
≤ μ0 ≤ M + z(1 − α/2)
√n
σ
2. μ0 ≤ M + z(1 − α)
√n
σ
3. μ0 ≥ M − z(1 − α)
√n
σ
Proof
The two-sided test in (a) corresponds to α/2 in each tail of the distribution of the test statistic Z , under H . This set is said to be 0
unbiased. But of course we can construct other biased tests by partitioning the confidence level α between the left and right tails in
a non-symmetric way.
For every α, p ∈ (0, 1) , the following test has significance level α : Reject H0 : μ = μ0 versus H1 : μ ≠ μ0 if and only if
Z < z(α − pα) or Z ≥ z(1 − pα) .
1. p = gives the symmetric, unbiased test.
1
2. p ↓ 0 gives the left-tailed test.

3. p ↑ 1 gives the right-tailed test.
Proof
The P -value of these test can be computed in terms of the standard normal distribution function Φ.
The P -values of the standard tests above are respectively

1. 2 [1 − Φ (|Z|)]
2. 1 − Φ(Z)
3. Φ(Z)
Recall that the power function of a test of a parameter is the probability of rejecting the null hypothesis, as a function of the true
value of the parameter. Our next series of results will explore the power functions of the tests above.
The power function of the general two-sided test above is given by
− −
√n √n
Q(μ) = Φ (z(α − pα) − (μ − μ0 )) + Φ ( (μ − μ0 ) − z(1 − pα)) , μ ∈ R (9.2.4)
σ σ
√n
1. Q is decreasing on (−∞, m ) and increasing on (m , ∞) where m
0 0 0 = μ0 + [z(α − pα) + z(1 − pα)]
2σ
.
2. Q(μ ) = α .
0
3. Q(μ) → 1 as μ ↑ ∞ and Q(μ) → 1 as μ ↓ −∞ .

4. If p = then Q is symmetric about μ (and m = μ ).
1
2
0 0 0
5. As p increases, Q(μ) increases if μ > μ and decreases if μ < μ .

0 0
So by varying p, we can make the test more powerful for some values of μ , but only at the expense of making the test less powerful
for other values of μ .
The power function of the left-tailed test above is given by

−
√n
Q(μ) = Φ (z(α) + (μ − μ0 )) , μ ∈ R (9.2.5)
σ
1. Q is increasing on R.
2. Q(μ ) = α .
0
3. Q(μ) → 1 as μ ↑ ∞ and Q(μ) → 0 as μ ↓ −∞ .
The power function of the right-tailed test above, is given by

−
√n
Q(μ) = Φ (z(α) − (μ − μ0 )) , μ ∈ R (9.2.6)
σ
1. Q is decreasing on R.
2. Q(μ ) = α .
0
3. Q(μ) → 0 as μ ↑ ∞ and Q(μ) → 1 as μ ↓ −∞ .
For any of the three tests in above , increasing the sample size n or decreasing the standard deviation σ results in a uniformly
more powerful test.
In the mean test experiment, select the normal test statistic and select the normal sampling distribution with standard deviation
σ = 2 , significance level α = 0.1 , sample size n = 20 , and μ = 0 . Run the experiment 1000 times for several values of the
0
true distribution mean μ . For each value of μ , note the relative frequency of the event that the null hypothesis is rejected.
Sketch the empirical power function.
In the mean estimate experiment, select the normal pivot variable and select the normal distribution with μ = 0 and standard
deviation σ = 2 , confidence level 1 − α = 0.90 , and sample size n = 10 . For each of the three types of confidence intervals,
run the experiment 20 times. State the corresponding hypotheses and significance level, and for each run, give the set of μ for 0
which the null hypothesis would be rejected.
In many cases, the first step is to design the experiment so that the significance level is α and so that the test has a given power β
for a given alternative μ .
1
For either of the one-sided tests in above, the sample size n needed for a test with significance level α and power β for the
alternative μ is1
2
σ [z(β) − z(α)]
n =( ) (9.2.7)
μ1 − μ0
Proof
For the unbiased, two-sided test, the sample size n needed for a test with significance level α and power β for the alternative
μ is approximately
1
2
σ [z(β) − z(α/2)]
n =( ) (9.2.8)
μ1 − μ0
Proof
Tests of the Mean with Unknown Standard Deviation

For our next discussion, we construct tests of μ without requiring the assumption that σ is known. And in applications of course, σ
is usually unknown.
For a conjectured μ 0 ∈ R , define the test statistic

M − μ0
T = (9.2.9)
−
S/ √n
1. If μ = μ , the statistic T has the student t distribution with n − 1 degrees of freedom.

0
μ−μ0
2. If μ ≠ μ then T has a non-central t distribution with n − 1 degrees of freedom and non-centrality parameter
0
σ/ √n
.
In case (b), the graph of the probability density function of T is much (but not exactly) the same as that of the ordinary t
distribution with n − 1 degrees of freedom, but shifted to the right or left by the non-centrality parameter, depending on whether
μ >μ 0 or μ < μ . 0
For α ∈ (0, 1), each of the following tests has significance level α :
1. Reject H 0 : μ = μ0 versus H 1 : μ ≠ μ0 if and only if T < −tn−1 (1 − α/2) or T > tn−1 (1 − α/2) if and only if
M < μ0 − tn−1 (1 − α/2)
S
or T > μ0 + tn−1 (1 − α/2)
S
.
√n √n
2. Reject H 0 : μ ≤ μ0 versus H 1 : μ > μ0 if and only if T > tn−1 (1 − α) if and only if M > μ0 + tn−1 (1 − α)
S
.
√n
3. Reject H 0 : μ ≥ μ0 versus H 1 : μ < μ0 if and only if T < −tn−1 (1 − α) if and only if M < μ0 − tn−1 (1 − α)
S
√n
.
Proof
Part (a) is the standard two-sided test, while (b) is the right-tailed test and (c) is the left-tailed test. Note that in each case, the
hypothesis test is the dual of the corresponding interval estimate constructed in the section on Estimation in the Normal Model.
For each of the tests above, we fail to reject H0 at significance level α if and only if μ0 is in the corresponding 1 −α
confidence interval.
1. M − t n−1 (1 − α/2)
S
√n
≤ μ0 ≤ M + tn−1 (1 − α/2)
S
√n
2. μ 0 ≤ M + tn−1 (1 − α)
S
√n
3. μ 0 ≥ M − tn−1 (1 − α)
S
√n
Proof
The two-sided test in (a) corresponds to α/2 in each tail of the distribution of the test statistic T , under H . This set is said to be 0
unbiased. But of course we can construct other biased tests by partitioning the confidence level α between the left and right tails in
a non-symmetric way.
For every α, p ∈ (0, 1) , the following test has significance level α : Reject H : μ = μ versus 0 0 H1 : μ ≠ μ0 if and only if
T < tn−1 (α − pα) or T ≥ t (1 − pα) if and only if M < μ + t (α − pα)
n−1 0 or M > μ
n−1
S
√n
0 + tn−1 (1 − pα)
S
√n
.
1. p = gives the symmetric, unbiased test.

1
2. p ↓ 0 gives the left-tailed test.

3. p ↑ 1 gives the right-tailed test.
Proof
The P -value of these test can be computed in terms of the distribution function Φn−1 of the t -distribution with n−1 degrees of
freedom.
The P -values of the standard tests above are respectively
1. 2 [1 − Φ (|T |)]
n−1
2. 1 − Φ (T )
n−1
3. Φ (T )
n−1
In the mean test experiment, select the student test statistic and select the normal sampling distribution with standard deviation
σ = 2 , significance level α = 0.1 , sample size n = 20 , and μ = 1 . Run the experiment 1000 times for several values of the
0
true distribution mean μ . For each value of μ , note the relative frequency of the event that the null hypothesis is rejected.
Sketch the empirical power function.
In the mean estimate experiment, select the student pivot variable and select the normal sampling distribution with mean 0 and
standard deviation 2. Select confidence level 0.90 and sample size 10. For each of the three types of intervals, run the
experiment 20 times. State the corresponding hypotheses and significance level, and for each run, give the set of μ for which 0
the null hypothesis would be rejected.
The power function for the t tests above can be computed explicitly in terms of the non-central t distribution function.
Qualitatively, the graphs of the power functions are similar to the case when σ is known, given above two-sided, left-tailed, and
right-tailed cases.
If an upper bound σ on the standard deviation σ is known, then conservative estimates on the sample size needed for a given
0
confidence level and a given margin of error can be obtained using the methods for the normal pivot variable, in the two-sided and
one-sided cases.
Tests of the Standard Deviation

For our next discussion, we will construct hypothesis tests for the distribution standard deviation σ. So our assumption is that σ is
unknown, and of course almost always, μ would be unknown as well.
For a conjectured value σ 0 ∈ (0, ∞) , define the test statistic

n−1
2
V = S (9.2.10)
2
σ
0
1. If σ = σ , then V has the chi-square distribution with n − 1 degrees of freedom.

0
2. If σ ≠ σ then V has the gamma distribution with shape parameter (n − 1)/2 and scale parameter 2σ
0
2
/σ
0
2
.
Recall that the ordinary chi-square distribution with n − 1 degrees of freedom is the gamma distribution with shape parameter
(n − 1)/2 and scale parameter . So in case (b), the ordinary chi-square distribution is scaled by σ /σ . In particular, the scale
1 2 2
2 0
factor is greater than 1 if σ > σ and less than 1 if σ < σ .

0 0
For every α ∈ (0, 1), the following test has significance level α :
1. Reject H 0 : σ = σ0 versus H 1 : σ ≠ σ0 if and only if V <χ
2
n−1
(α/2) or V >χ
2
n−1
(1 − α/2) if and only if
2 2
σ0 σ0
S
2
<χ
2
n−1
(α/2)
n−1
or S 2
>χ
2
n−1
(1 − α/2)
n−1
2
σ
2. Reject H 0 : σ ≥ σ0 versus H 1 : σ < σ0 if and only if V <χ
2
n−1
(α) if and only if S 2
<χ
2
n−1
(α)
n−1
0
2
σ0
3. Reject H 0 : σ ≤ σ0 versus H 1 : σ > σ0 if and only if V >χ
2
n−1
(1 − α) if and only if S 2
>χ
2
n−1
(1 − α)
n−1
Proof
Part (a) is the unbiased, two-sided test that corresponds to α/2 in each tail of the chi-square distribution of the test statistic V ,
under H . Part (b) is the left-tailed test and part (c) is the right-tailed test. Once again, we have a duality between the hypothesis
0
tests and the interval estimates constructed in the section on Estimation in the Normal Model.
For each of the tests in above, we fail to reject H0 at significance level α if and only if σ
2
0
is in the corresponding 1 −α
confidence interval. That is
1. 2
n−1
S
2
≤σ
2
0
≤
2
n−1
S
2
χ (1−α/2) χ (α/2)
n−1 n−1
2. σ 0
2
≤ 2
n−1
S
2
χ (α)
n−1
3. σ 0
2
≥ 2
n−1
S
2
χ (1−α)
n−1
Proof
As before, we can construct more general two-sided tests by partitioning the significance level α between the left and right tails of
the chi-square distribution in an arbitrary way.
For every α, p ∈ (0, 1) , the following test has significance level α : Reject H0 : σ = σ0 versus H1 : σ ≠ σ0 if and only if
2 2
σ σ
V ≤χ
2
n−1
(α − pα) or V ≥χ
2
n−1
(1 − pα) if and only if S 2 2
<χ
n−1
(α − pα)
n−1
0
or S 2
>χ
2
n−1
(1 − pα)
n−1
0
.
1. p = gives the equal-tail test.
1
2. p ↓ 0 gives the left-tail test.

3. p ↑ 1 gives the right-tail test.
Proof
Recall again that the power function of a test of a parameter is the probability of rejecting the null hypothesis, as a function of the
true value of the parameter. The power functions of the tests for σ can be expressed in terms of the distribution function G of n−1
the chi-square distribution with n − 1 degrees of freedom.
The power function of the general two-sided test above is given by the following formula, and satisfies the given properties:
2 2
σ σ
0 2 0 2
Q(σ) = 1 − Gn−1 ( χ (1 − p α)) + Gn−1 ( χ (α − p α)) (9.2.11)
n−1 n−1
σ2 σ2
1. Q is decreasing on (−∞, σ ) and increasing on (σ 0 0, ∞) .

2. Q(σ ) = α . 0
3. Q(σ) → 1 as σ ↑ ∞ and Q(σ) → 1 as σ ↓ 0.
The power function of the left-tailed test in above is given by the following formula, and satisfies the given properties:
2
σ
0 2
Q(σ) = 1 − Gn−1 ( χ (1 − α)) (9.2.12)
2 n−1
σ
1. Q is increasing on (0, ∞).

2. Q(σ ) = α . 0
3. Q(σ) → 1 as σ ↑ ∞ and Q(σ) → 0 as σ ↓ 0.
The power function for the right-tailed test above is given by the following formula, and satisfies the given properties:
2
σ
0 2
Q(σ) = Gn−1 ( χ (α)) (9.2.13)
2 n−1
σ
1. Q is decreasing on (0, ∞).

2. Q(σ ) = α . 0
3. Q(σ) → 0 as σ ↑ ∞) and Q(σ) → 0 as σ ↑ ∞ and as σ ↓ 0.
In the variance test experiment, select the normal distribution with mean 0, and select significance level 0.1, sample size 10,
and test standard deviation 1.0. For various values of the true standard deviation, run the simulation 1000 times. Record the
relative frequency of rejecting the null hypothesis and plot the empirical power curve.
1. Two-sided test
2. Left-tailed test
3. Right-tailed test
In the variance estimate experiment, select the normal distribution with mean 0 and standard deviation 2, and select confidence
level 0.90 and sample size 10. Run the experiment 20 times. State the corresponding hypotheses and significance level, and for
each run, give the set of test standard deviations for which the null hypothesis would be rejected.
1. Two-sided confidence interval
2. Confidence lower bound
3. Confidence upper bound
Exercises
Robustness
The primary assumption that we made is that the underlying sampling distribution is normal. Of course, in real statistical problems,
we are unlikely to know much about the sampling distribution, let alone whether or not it is normal. Suppose in fact that the
underlying distribution is not normal. When the sample size n is relatively large, the distribution of the sample mean will still be
approximately normal by the central limit theorem, and thus our tests of the mean μ should still be approximately valid. On the
other hand, tests of the variance σ are less robust to deviations form the assumption of normality. The following exercises explore
2
these ideas.
In the mean test experiment, select the gamma distribution with shape parameter 1 and scale parameter 1. For the three
different tests and for various significance levels, sample sizes, and values of μ , run the experiment 1000 times. For each
0
configuration, note the relative frequency of rejecting H . When H is true, compare the relative frequency with the
0 0
significance level.
In the mean test experiment, select the uniform distribution on [0, 4]. For the three different tests and for various significance
levels, sample sizes, and values of μ , run the experiment 1000 times. For each configuration, note the relative frequency of
0
rejecting H . When H is true, compare the relative frequency with the significance level.
0 0
How large n needs to be for the testing procedure to work well depends, of course, on the underlying distribution; the more this
distribution deviates from normality, the larger n must be. Fortunately, convergence to normality in the central limit theorem is
rapid and hence, as you observed in the exercises, we can get away with relatively small sample sizes (30 or more) in most cases.
In the variance test experiment, select the gamma distribution with shape parameter 1 and scale parameter 1. For the three
different tests and for various significance levels, sample sizes, and values of σ , run the experiment 1000 times. For each
0
configuration, note the relative frequency of rejecting H . When H is true, compare the relative frequency with the
0 0
significance level.
In the variance test experiment, select the uniform distribution on [0, 4]. For the three different tests and for various
significance levels, sample sizes, and values of μ , run the experiment 1000 times. For each configuration, note the relative
0
frequency of rejecting H . When H is true, compare the relative frequency with the significance level.
0 0
The length of a certain machined part is supposed to be 10 centimeters. In fact, due to imperfections in the manufacturing
process, the actual length is a random variable. The standard deviation is due to inherent factors in the process, which remain
fairly stable over time. From historical data, the standard deviation is known with a high degree of accuracy to be 0.3. The
mean, on the other hand, may be set by adjusting various parameters in the process and hence may change to an unknown
value fairly frequently. We are interested in testing H : μ = 10 versus H : μ ≠ 10 .
0 1
1. Suppose that a sample of 100 parts has mean 10.1. Perform the test at the 0.1 level of significance.
2. Compute the P -value for the data in (a).
3. Compute the power of the test in (a) at μ = 10.05.
4. Compute the approximate sample size needed for significance level 0.1 and power 0.8 when μ = 10.05.
Answer
A bag of potato chips of a certain brand has an advertised weight of 250 grams. Actually, the weight (in grams) is a random
variable. Suppose that a sample of 75 bags has mean 248 and standard deviation 5. At the 0.05 significance level, perform the
following tests:
1. H 0 : μ ≥ 250 versus H : μ < 250
1
2. H 0 : σ ≥7 versus H : σ < 7
1
Answer
At a telemarketing firm, the length of a telephone solicitation (in seconds) is a random variable. A sample of 50 calls has mean
310 and standard deviation 25. At the 0.1 level of significance, can we conclude that
1. μ > 300 ?
2. σ > 20?
Answer
At a certain farm the weight of a peach (in ounces) at harvest time is a random variable. A sample of 100 peaches has mean 8.2
and standard deviation 1.0. At the 0.01 level of significance, can we conclude that
1. μ > 8 ?
2. σ < 1.5?
Answer
The hourly wage for a certain type of construction work is a random variable with standard deviation 1.25. For sample of 25
workers, the mean wage was $6.75. At the 0.01 level of significance, can we conclude that μ < 7.00?
Answer

Using Michelson's data, test to see if the velocity of light is greater than 730 (+299000) km/sec, at the 0.005 significance level.
Answer
Using Cavendish's data, test to see if the density of the earth is less than 5.5 times the density of water, at the 0.05 significance
level .
Answer
Using Short's data, test to see if the parallax of the sun differs from 9 seconds of a degree, at the 0.1 significance level.
Answer
Using Fisher's iris data, perform the following tests, at the 0.1 level:
1. The mean petal length of Setosa irises differs from 15 mm.
2. The mean petal length of Verginica irises is greater than 52 mm.
3. The mean petal length of Versicolor irises is less than 44 mm.
Answer
This page titled 9.2: Tests in the Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Tests
Preliminaries
Suppose that X = (X , X , … , X ) is a random sample from the Bernoulli distribution with unknown parameter p ∈ (0, 1).
1 2 n
Thus, these are independent random variables taking the values 1 and 0 with probabilities p and 1 − p respectively. In the usual
language of reliability, 1 denotes success and 0 denotes failure, but of course these are generic terms. Often this model arises in one
of the following contexts:
1. There is an event of interest in a basic experiment, with unknown probability p. We replicate the experiment n times and define
X = 1 if and only if the event occurred on run i .
i
2. We have a population of objects of several different types; p is the unknown proportion of objects of a particular type of
interest. We select n objects at random from the population and let X = 1 if and only if object i is of the type of interest.
i
When the sampling is with replacement, these variables really do form a random sample from the Bernoulli distribution. When
the sampling is without replacement, the variables are dependent, but the Bernoulli model may still be approximately valid if
the population size is very large compared to the sample size n . For more on these points, see the discussion of sampling with
and without replacement in the chapter on Finite Sampling Models.
In this section, we will construct hypothesis tests for the parameter p. The parameter space for p is the interval (0, 1), and all
hypotheses define subsets of this space. This section parallels the section on Estimation in the Bernoulli Model in the Chapter on
Interval Estimation.
The Binomial Test

Recall that the number of successes Y =∑
n
i=1
Xi has the binomial distribution with parameters n and p, and has probability
density function given by
n y n−y
P(Y = y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (9.3.1)
y
Recall also that the mean is E(Y ) = np and variance is var(Y ) = np(1 − p) . Moreover Y is sufficient for p and hence is a
natural candidate to be a test statistic for hypothesis tests about p. For α ∈ (0, 1), let b (α) denote the quantile of order α for the
n,p
binomial distribution with parameters n and p. Since the binomial distribution is discrete, only certain (exact) quantiles are
possible. For the remainder of this discussion, p ∈ (0, 1) is a conjectured value of p.
0
For every α ∈ (0, 1), the following tests have approximate significance level α :
1. Reject H 0 : p = p0 versus H 1 : p ≠ p0 if and only if Y ≤ bn,p (α/2)
0
or Y ≥ bn,p (1 − α/2)
0
.
2. Reject H 0 : p ≥ p0 versus H 1 : p < p0 if and only if Y ≤ bn,p (α)
0
.
3. Reject H 0 : p ≤ p0 versus H 1 : p > p0 if and only if Y ≥ bn,p (1 − α)
0
.
Proof
The test in (a) is the standard, symmetric, two-sided test, corresponding to probability α/2 (approximately) in both tails of the
binomial distribution under H . The test in (b) is the left-tailed and test and the test in (c) is the right-tailed test. As usual, we can
0
generalize the two-sided test by partitioning α between the left and right tails of the binomial distribution in an arbitrary manner.
For any α, r ∈ (0, 1), the following test has (approximate) significance level α : Reject H 0 : p = p0 versus H1 : p ≠ p0 if and
only if Y ≤ b (α − rα) or Y ≥ b (1 − rα) .
n,p
0
n,p
0
1. r = gives the standard symmetric two-sided test.

1
2. r ↓ 0 gives the left-tailed test.

3. r ↑ 1 gives the right-tailed test.
Proof
An Approximate Normal Test
When n is large, the distribution of Y is approximately normal, by the central limit theorem, so we can construct an approximate
normal test.
Suppose that the sample size n is large. For a conjectured p 0 ∈ (0, 1), define the test statistic
Y − np0
Z = (9.3.2)
− −−−−−−− −
√ np0 (1 − p0 )
1. If p = p , then Z has approximately a standard normal distribution.

0
p−p p(1−p)
2. If p ≠ p , then Z has approximately a normal distribution with mean √−
0 n
0
and variance p0 (1−p0 )
√p (1−p )
0 0
Proof
As usual, for α ∈ (0, 1), let z(α) denote the quantile of order α for the standard normal distribution. For selected values of α ,
z(α) can be obtained from the special distribution calculator, or from most statistical software packages. Recall also by symmetry
that z(1 − α) = −z(α) .
For every α ∈ (0, 1), the following tests have approximate significance level α :
1. Reject H 0 : p = p0 versus H
1 : p ≠ p0 if and only if Z < −z(1 − α/2) or Z > z(1 − α/2) .
2. Reject H 0 : p ≥ p0 versus H
1 : p < p0 if and only if Z < −z(1 − α) .
3. Reject H 0 : p ≤ p0 versus H
1 : p ≥ p0 if and only if Z > z(1 − α) .
Proof
The test in (a) is the symmetric, two-sided test that corresponds to α/2 in both tails of the distribution of Z , under H . The test in 0
(b) is the left-tailed test and the test in (c) is the right-tailed test. As usual, we can construct a more general two-sided test by
partitioning α between the left and right tails of the standard normal distribution in an arbitrary manner.
For every α, r ∈ (0, 1), the following test has approximate significance level α : Reject H 0 : p = p0 versus H 1 : p ≠ p0 if and
only if Z < z(α − rα) or Z > z(1 − rα) .
1. r = gives the standard, symmetric two-sided test.
1
2. r ↓ 0 gives the left-tailed test.

3. r ↑ 1 gives the right-tailed test.
Proof
In the proportion test experiment, set H : p = p , and select sample size 10, significance level 0.1, and p = 0.5 . For each
0 0 0
p ∈ {0.1, 0.2, … , 0.9} , run the experiment 1000 times and then note the relative frequency of rejecting the null hypothesis.
Graph the empirical power function.
In the proportion test experiment, repeat the previous exercise with sample size 20.
In the proportion test experiment, set H : p ≤ p , and select sample size 15, significance level 0.05, and p = 0.3 . For each
0 0 0
p ∈ {0.1, 0.2, … , 0.9} , run the experiment 1000 times and note the relative frequency of rejecting the null hypothesis. Graph
the empirical power function.
In the proportion test experiment, set H : p ≥ p , and select sample size 20, significance level 0.01, and p = 0.6 . For each
0 0 0
p ∈ {0.1, 0.2, … , 0.9} , run the experiment 1000 times and then note the relative frequency of rejecting the null hypothesis.
Graph the empirical power function.
In a pole of 1000 registered voters in a certain district, 427 prefer candidate X. At the 0.1 level, is the evidence sufficient to
conclude that more that 40% of the registered voters prefer X?
Answer
A coin is tossed 500 times and results in 302 heads. At the 0.05 level, test to see if the coin is unfair.
Answer
A sample of 400 memory chips from a production line are tested, and 32 are defective. At the 0.05 level, test to see if the
proportion of defective chips is less than 0.1.
Answer
A new drug is administered to 50 patients and the drug is effective in 42 cases. At the 0.1 level, test to see if the success rate
for the new drug is greater that 0.8.
Answer
Using the M&M data, test the following alternative hypotheses at the 0.1 significance level:
1. The proportion of red M&Ms differs from . 1
2. The proportion of green M&Ms is less than . 1
3. The proportion of yellow M&M is greater than 1
6
.
Answer
The Sign Test

Derivation
Suppose now that we have a basic random experiment with a real-valued random variable U of interest. We assume that U has a
continuous distribution with support on an interval of S ⊆ R . Let m denote the quantile of a specified order p ∈ (0, 1) for the
0
distribution of U . Thus, by definition,

p0 = P(U ≤ m) (9.3.4)
In general of course, m is unknown, even though p is specified, because we don't know the distribution of
0 U . Suppose that we
want to construct hypothesis tests for m. For a given test value m , let 0
p = P(U ≤ m0 ) (9.3.5)
Note that p is unknown even though m is specified, because again, we don't know the distribution of U .
0
Relations
1. m = m if and only if p = p .
0 0
2. m < m if and only if p > p .

0 0
3. m > m if and only if p < p .

0 0
Proof
As usual, we repeat the basic experiment n times to generate a random sample U = (U , U , … , U ) of size
1 2 n n from the
distribution of U . Let X = 1(U ≤ m ) be the indicator variable of the event {U ≤ m } for i ∈ {1, 2, … , n}.
i i 0 i 0
Note that X = (X , X , … , X ) is a statistic (an observable function of the data vector U ) and is a random sample of size n
1 2 n
from the Bernoulli distribution with parameter p.
From the last two results it follows that tests of the unknown quantile m can be converted to tests of the Bernoulli parameter p, and
thus the tests developed above apply. This procedure is known as the sign test, because essentially, only the sign of U − m is i 0
recorded for each i. This procedure is also an example of a nonparametric test, because no assumptions about the distribution of U
are made (except for continuity). In particular, we do not need to assume that the distribution of U belongs to a particular
parametric family.
The most important special case of the sign test is the case where p = ; this is the sign test of the median. If the distribution of U
0
1
is known to be symmetric, the median and the mean agree. In this case, sign tests of the median are also tests of the mean.
In the sign test experiment, set the sampling distribution to normal with mean 0 and standard deviation 2. Set the sample size to
10 and the significance level to 0.1. For each of the 9 values of m , run the simulation 1000 times.
0
1. When m = m , give the empirical estimate of the significance level of the test and compare with 0.1.
0
2. In the other cases, give the empirical estimate of the power of the test.
In the sign test experiment, set the sampling distribution to uniform on the interval [0, 5]. Set the sample size to 20 and the
significance level to 0.05. For each of the 9 values of m , run the simulation 1000 times.
0
0
In the sign test experiment, set the sampling distribution to gamma with shape parameter 2 and scale parameter 1. Set the
sample size to 30 and the significance level to 0.025. For each of the 9 values of m , run the simulation 1000 times.
0
0
Using the M&M data, test to see if the median weight exceeds 47.9 grams, at the 0.1 level.
Answer
Using Fisher's iris data, perform the following tests, at the 0.1 level:
1. The median petal length of Setosa irises differs from 15 mm.
2. The median petal length of Verginica irises is less than 52 mm.
3. The median petal length of Versicolor irises is less than 42 mm.
Answer
This page titled 9.3: Tests in the Bernoulli Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In this section, we will study hypothesis tests in the two-sample normal model and in the bivariate normal model. This section
parallels the section on Estimation in the Two Sample Normal Model in the chapter on Interval Estimation.
The Two-Sample Normal Model

Suppose that X = (X , X , … , X ) is a random sample of size m from the normal distribution with mean μ and standard
1 2 n
deviation σ, and that Y = (Y , Y , … , Y ) is a random sample of size n from the normal distribution with mean ν and standard
1 2 n
deviation τ . Moreover, suppose that the samples X and Y are independent.

This type of situation arises frequently when the random variables represent a measurement of interest for the objects of the
population, and the samples correspond to two different treatments. For example, we might be interested in the blood pressure of a
certain population of patients. The X vector records the blood pressures of a control sample, while the Y vector records the blood
pressures of the sample receiving a new drug. Similarly, we might be interested in the yield of an acre of corn. The X vector
records the yields of a sample receiving one type of fertilizer, while the Y vector records the yields of a sample receiving a
different type of fertilizer.
Usually our interest is in a comparison of the parameters (either the mean or variance) for the two sampling distributions. In this
section we will construct tests for the for the difference of the means and the ratio of the variances. As with previous estimation
problems we have studied, the procedures vary depending on what parameters are known or unknown. Also as before, key
elements in the construction of the tests are the sample means and sample variances and the special properties of these statistics
when the sampling distribution is normal.
We will use the following notation for the sample mean and sample variance of a generic sample U = (U1 , U2 , … , Uk ) :
k k
1 1
2 2
M (U ) = ∑ Ui , S (U ) = ∑[ Ui − M (U )] (9.4.1)
k k−1
i=1 i=1
Tests of the Difference in the Means with Known Standard Deviations

Our first discussion concerns tests for the difference in the means ν − μ under the assumption that the standard deviations σ and τ
are known. This is often, but not always, an unrealistic assumption. In some statistical problems, the variances are stable, and are at
least approximately known, while the means may be different because of different treatments. Also this is a good place to start
because the analysis is fairly easy.
For a conjectured difference of the means δ ∈ R , define the test statistic

[M (Y ) − M (X)] − δ
Z = − −−−−−−−− −− (9.4.2)
2 2
√ σ /m + τ /n
1. If ν − μ = δ then Z has the standard normal distribution.

−−−−−−−−−−−
2. If ν − μ ≠ δ then Z has the normal distribution with mean [(ν − μ) − δ]/√σ /m + τ /n and variance 1.
2 2
Proof
Of course (b) actually subsumes (a), but we separate them because the two cases play an impotrant role in the hypothesis tests. In
part (b), the non-zero mean can be viewed as a non-centrality parameter.
As usual, for p ∈ (0, 1), let z(p) denote the quantile of order p for the standard normal distribution. For selected values of p, z(p)
can be obtained from the special distribution calculator or from most statistical software packages. Recall also by symmetry that
z(1 − p) = −z(p) .
For every α ∈ (0, 1), the following tests have significance level α :
1. Reject H 0 : ν −μ = δ versus H
: ν −μ ≠ δ
1 if and only if Z < −z(1 − α/2) or Z > z(1 − α/2) if and only if
−−−−−− −−− −− −−−−−− −−− −−
2
M (Y ) − M (X) > δ + z(1 − α/2)√σ /m + τ /n
2
or M (Y ) − M (X) < δ − z(1 − α/2)√σ /m + τ /n . 2 2
2. Reject H : ν − μ ≥ δ versus H : ν − μ < δ if and only if Z < −z(1 − α) if and only if

0 1
− −−−− −−−−− −
M (Y ) − M (X) < δ − z(1 − α)√σ /m + τ /n .
2 2
3. Reject H0 : ν −μ ≤ δ versus
H1 : ν − μ > δ if and only if Z > z(1 − α) if and only if
− −−−− −−− −−−
M (Y ) − M (X) > δ + z(1 − α)√σ 2 /m + τ 2 /n .
Proof
For each of the tests above, we fail to reject H0 at significance level α if and only if δ is in the corresponding 1 −α level
−−−−−−−−−−− −−− −−−−−− −−
1. [M (Y ) − M (X)] − z(1 − α/2)√σ /m + τ /n ≤ δ ≤ [M (Y ) − M (X)] + z(1 − α/2)√σ
2 2 2 2
/m + τ /n
−−−−−−−−−−−
2. δ ≤ [M (Y ) − M (X)] + z(1 − α)√σ /m + τ /n 2 2
−−−−−−−−−−−
3. δ ≥ [M (Y ) − M (X)] − z(1 − α)√σ /m + τ /n 2 2
Proof
Tests of the Difference of the Means with Unknown Standard Deviations

Next we will construct tests for the difference in the means ν − μ under the more realistic assumption that the standard deviations
σ and τ are unknown. In this case, it is more difficult to find a suitable test statistic, but we can do the analysis in the special case
that the standard deviations are the same. Thus, we will assume that σ = τ , and the common value σ is unknown. This assumption
is reasonable if there is an inherent variability in the measurement variables that does not change even when different treatments
are applied to the objects in the population. Recall that the pooled estimate of the common variance σ is the weighted average of 2
the sample variances, with the degrees of freedom as the weight factors:
2 2
(m − 1)S (X) + (n − 1)S (Y )
2
S (X, Y ) = (9.4.3)
m +n−2
The statistic S 2
(X, Y ) is an unbiased and consistent estimator of the common variance σ . 2
For a conjectured δ ∈ R define the test statistc

[M (Y ) − M (X)] − δ
T = − −−−−−−−− (9.4.4)
S(X, Y )√ 1/m + 1/n
1. If ν − μ = δ then T has the t distribution with m + n − 2 degrees of freedom,

2. If ν − μ ≠ δ then T has a non-central t distribution with m + n − 2 degrees of freedom and non-centrality parameter
(ν − μ) − δ
− −−−−−−−− (9.4.5)
σ √ 1/m + 1/n
Proof
As usual, for k > 0 and p ∈ (0, 1), let t (p) denote the quantile of order p for the t distribution with k degrees of freedom. For
k
selected values of k and p, values of t (p) can be computed from the special distribution calculator, or from most statistical
k
software packages. Recall also that, by symmetry, t (1 − p) = −t (p) .

k k
The following tests have significance level α :

1. Reject H : ν − μ = δ versus H
0 1 : ν −μ ≠ δ if and only if T < −tm+n−2 (1 − α/2) or T > tm+n−2 (1 − α/2) if and
−− − − −−−−− −−
only if M (Y ) − M (X) > δ + t
m+n−2 (1 − α/2)
√σ 2 /m + τ 2 /n or
−− −−−−−− − −−
2 2
M (Y ) − M (X) < δ − tm+n−2 (1 − α/2)√σ /m + τ /n
2. Reject H 0 : ν −μ ≥ δ versus H 1 : ν −μ < δ if and only if T ≤ −tm−n+2 (1 − α) if and only if

−−−−−−−−− −−
2 2
M (Y ) − M (X) < δ − tm+n−2 (1 − α)√σ /m + τ /n
3. Reject H0 : ν −μ ≤ δ versus
H1 : ν − μ > δ if and only if T ≥ tm−n+2 (1 − α) if and only if
−− −−−−−−− −−
2 2
M (Y ) − M (X) > δ + tm+n−2 (1 − α)√σ /m + τ /n
Proof
For each of the tests above, we fail to reject H0 at significance level α if and only if δ is in the corresponding 1 −α level
−−− −−−−−− −− −−−−−−−−− −−
1. [M (Y ) − M (X)] − t m+n−2
2 2 2 2
(1 − α/2)√σ /m + τ /n ≤ δ ≤ [M (Y ) − M (X)] + tm+n−2 (1 − α/2)√σ /m + τ /n
− −−−−−−− − −−
2. δ ≤ [M (Y ) − M (X)] + t 2
m+n−2 (1 − α)√σ /m + τ
2
/n
−− − −−−−− − −−
3. δ ≥ [M (Y ) − M (X)] − t m+n−2 (1 − α)√σ 2 /m + τ 2 /n
Proof
Tests of the Ratio of the Variances

Next we will construct tests for the ratio of the distribution variances τ 2
/σ
2
. So the basic assumption is that the variances, and of
course the means μ and ν are unknown.
For a conjectured ρ ∈ (0, ∞), define the test statistics

2
S (X)
F = ρ (9.4.7)
2
S (Y )
1. If τ /σ = ρ then F has the F distribution with m − 1 degrees of freedom in the numerator and n − 1 degrees of freedom
2 2
in the denominator.
2. If τ /σ ≠ ρ then F has a scaled F distribution with m − 1 degrees of freedom in the numerator, n − 1 degrees of
2 2
freedom in the denominator, and scale factor ρ . σ
τ
2
Proof
The following tests have significance level α :

1. Reject H 0 : τ
2
/σ
2
=ρ versus H 1 : τ
2
/σ
2
≠ρ if and only if F > fm−1,n−1 (1 − α/2) or F < fm−1,n−1 (α/2) .
2. Reject H 0 : τ
2
/σ
2
≤ρ versus H 1 : τ
2
/σ
2
>ρ if and only if F < fm−1,n−1 (α) .
3. Reject H 0 : τ
2
/σ
2
≥ρ versus H 1 : τ
2
/σ
2
<ρ if and only if F > fm−1,n−1 (1 − α) .
Proof
For each of the tests above, we fail to reject H0 at significance level α if and only if ρ0 is in the corresponding 1 −α level
2 2
S (Y ) S (Y )
1. 2
Fm−1,n−1 (α/2) ≤ ρ ≤ 2
Fm−1,n−1 (1 − α/2)
S (X) S (X)
2
S (Y )
2. ρ ≤ 2
Fm−1,n−1 (α)
S (X)
2
S (Y )
3. ρ ≥ 2
Fm−1,n−1 (1 − α)
S (X)
Proof
Tests in the Bivariate Normal Model

In this subsection, we consider a model that is superficially similar to the two-sample normal model, but is actually much simpler.
Suppose that
((X1 , Y1 ), (X2 , Y2 ), … , (Xn , Yn )) (9.4.9)
is a random sample of size n from the bivariate normal distribution of (X, Y ) with E(X) = μ , E(Y ) = ν , var(X) = σ
2
,
var(Y ) = τ , and cov(X, Y ) = δ .
2
Thus, instead of a pair of samples, we have a sample of pairs. The fundamental difference is that in this model, variables X and Y
are measured on the same objects in a sample drawn from the population, while in the previous model, variables X and Y are
measured on two distinct samples drawn from the population. The bivariate model arises, for example, in before and after
experiments, in which a measurement of interest is recorded for a sample of n objects from the population, both before and after a
treatment. For example, we could record the blood pressure of a sample of n patients, before and after the administration of a
certain drug.
We will use our usual notation for the sample means and variances of X = (X 1, X2 , … , Xn ) and Y = (Y1 , Y2 , … , Yn ) . Recall
also that the sample covariance of (X, Y ) is
n
1
S(X, Y ) = ∑[ Xi − M (X)][ Yi − M (Y )] (9.4.10)
n−1
i=1
(not to be confused with the pooled estimate of the standard deviation in the two-sample model above).
The sequence of differences Y − X = (Y − X , Y 1 1 2 − X2 , … , Yn − Xn ) is a random sample of size n from the distribution

of Y − X . The sampling distribution is normal with
1. E(Y − X) = ν − μ
2. var(Y − X) = σ + τ 2 2
−2 δ
The sample mean and variance of the sample of differences are

1. M (Y − X) = M (Y ) − M (X)
2. S (Y − X) = S (X) + S (Y ) − 2 S(X, Y )
2 2 2
The sample of differences Y − X fits the normal model for a single variable. The section on Tests in the Normal Model could be
used to perform tests for the distribution mean ν − μ and the distribution variance σ + τ − 2δ . 2 2
A new drug is being developed to reduce a certain blood chemical. A sample of 36 patients are given a placebo while a sample
of 49 patients are given the drug. The statistics (in mg) are m = 87 , s = 4 , m = 63 , s = 6 . Test the following at the 10%
1 1 2 2
significance level:
1. H : σ = σ versus H : σ ≠ σ .
0 1 2 1 1 2
2. H : μ ≤ μ versus H : μ > μ (assuming that σ

0 1 2 1 1 2 1 = σ2 ).
3. Based on (b), is the drug effective?
Answer
A company claims that an herbal supplement improves intelligence. A sample of 25 persons are given a standard IQ test before
and after taking the supplement. The before and after statistics are m = 105 , s = 13 , m = 110 , s = 17 , s = 190 . At
1 1 2 2 1, 2
the 10% significance level, do you believe the company's claim?

Answer
In Fisher's iris data, consider the petal length variable for the samples of Versicolor and Virginica irises. Test the following at
the 10% significance level:
1. H 0 : σ1 = σ2 versus H 1 : σ1 ≠ σ2 .
2. H 0 : μ1 ≤ μ2 versus μ1 > μ2 (assuming that σ 1 = σ2 ).
Answer
A plant has two machines that produce a circular rod whose diameter (in cm) is critical. A sample of 100 rods from the first
machine as mean 10.3 and standard deviation 1.2. A sample of 100 rods from the second machine has mean 9.8 and standard
deviation 1.6. Test the following hypotheses at the 10% level.
1. H 0 : σ1 = σ2 versus H 1 : σ1 ≠ σ2 .
2. H 0 : μ1 = μ2 versus H 1 : μ1 ≠ μ2 (assuming that σ 1 = σ2 ).
Answer
This page titled 9.4: Tests in the Two-Sample Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Basic Theory
As usual, our starting point is a random experiment with an underlying sample space, and a probability measure P. In the basic
then
X = (X1 , X2 , … , Xn ) (9.5.1)
i 1 2 n) are
In the previous sections, we developed tests for parameters based on natural test statistics. However, in other cases, the tests may
not be parametric, or there may not be an obvious statistic to start with. Thus, we need a more general method for constructing test
statistics. Moreover, we do not yet know if the tests constructed so far are the best, in the sense of maximizing the power for the set
of alternatives. In this and the next section, we investigate both of these ideas. Likelihood functions, similar to those used in
maximum likelihood estimation, will play a key role.
Tests of Simple Hypotheses

Suppose that X has one of two possible distributions. Our simple hypotheses are
H0 : X has probability density function f .
0
H1 : X has probability density function f .

1
We will use subscripts on the probability measure P to indicate the two hypotheses, and we assume that f and f are postive on S .
0 1
The test that we will construct is based on the following simple idea: if we observe X = x , then the condition f (x) > f (x) is 1 0
evidence in favor of the alternative; the opposite inequality is evidence against the alternative.
The likelihood ratio function L : S → (0, ∞) is defined by

f0 (x)
L(x) = , x ∈ S (9.5.2)
f1 (x)
The statistic L(X) is the likelihood ratio statistic.
Restating our earlier observation, note that small values of L are evidence in favor of H . Thus it seems reasonable that the
1
likelihood ratio statistic may be a good test statistic, and that we should consider tests in which we teject H if and only if L ≤ l ,
0
where l is a constant to be determined:
The significance level of the test is α = P 0 (L ≤ l) .
As usual, we can try to construct a test by choosing l so that α is a prescribed value. If X has a discrete distribution, this will only
be possible when α is a value of the distribution function of L(X).
An important special case of this model occurs when the distribution of X depends on a parameter θ that has two possible values.
Thus, the parameter space is {θ , θ }, and f denotes the probability density function of X when θ = θ and f denotes the
0 1 0 0 1
probability density function of X when θ = θ . In this case, the hypotheses are equivalent to H : θ = θ versus H : θ = θ .
1 0 0 1 1
As noted earlier, another important special case is when X = (X , X , … , X ) is a random sample of size n from a distribution
1 2 n
an underlying random variable X taking values in a set R . In this case, S = R and the probability density function f of X has
n
the form
f (x1 , x2 , … , xn ) = g(x1 )g(x2 ) ⋯ g(xn ), (x1 , x2 , … , xn ) ∈ S (9.5.3)
where g is the probability density function of X. So the hypotheses simplify to

H0 : X has probability density function g .
0
H1 : X has probability density function g . 1
and the likelihood ratio statistic is

n
g0 (Xi )
L(X1 , X2 , … , Xn ) = ∏ (9.5.4)
g1 (Xi )
i=1
In this special case, it turns out that under H , the likelihood ratio statistic, as a function of the sample size n , is a martingale.
1
The Neyman-Pearson Lemma

The following theorem is the Neyman-Pearson Lemma, named for Jerzy Neyman and Egon Pearson. It shows that the test given
above is most powerful. Let
R = {x ∈ S : L(x) ≤ l} (9.5.5)
and recall that the size of a rejection region is the significance of the test with that rejection region.
Consider the tests with rejection regions R given above and arbitrary A ⊆ S . If the size of R is at least as large as the size of
A then the test with rejection region R is more powerful than the test with rejection region A . That is, if
P (X ∈ R) ≥ P (X ∈ A) then P (X ∈ R) ≥ P (X ∈ A) .
0 0 1 1
Proof
The Neyman-Pearson lemma is more useful than might be first apparent. In many important cases, the same most powerful test
works for a range of alternatives, and thus is a uniformly most powerful test for this range. Several special cases are discussed
below.
Generalized Likelihood Ratio

The likelihood ratio statistic can be generalized to composite hypotheses. Suppose again that the probability density function f of θ
the data variable X depends on a parameter θ , taking values in a parameter space Θ. Consider the hypotheses θ ∈ Θ versus 0
θ ∉ Θ , where Θ ⊆ Θ .
0 0
Define
sup { fθ (x) : θ ∈ Θ0 }
L(x) = (9.5.9)
sup { fθ (x) : θ ∈ Θ}
The function L is the likelihood ratio function and L(X) is the likelihood ratio statistic.
By the same reasoning as before, small values of L(x) are evidence in favor of the alternative hypothesis.

Tests for the Exponential Model
Suppose that X = (X , X , … , X ) is a random sample of size n ∈ N from the exponential distribution with scale parameter
1 2 n +
b ∈ (0, ∞). The sample variables might represent the lifetimes from a sample of devices of a certain type. We are interested in
testing the simple hypotheses H : b = b versus H : b = b , where b , b ∈ (0, ∞) are distinct specified values.
0 0 1 1 0 1
Recall that the sum of the variables is a sufficient statistic for b :

n
Y = ∑ Xi (9.5.10)
i=1
Recall also that Y has the gamma distribution with shape parameter n and scale parameter b . For α >0 , we will denote the
quantile of order α for the this distribution by γ (α). n,b
The likelihood ratio statistic is

n
b1 1 1
L =( ) exp[( − )Y ] (9.5.11)
b0 b1 b0
Proof
The following tests are most powerful test at the α level

1. Suppose that b 1 > b0 . Reject H 0 : b = b0 versus H 1 : b = b1 if and only if Y ≥ γn,b0 (1 − α) .
2. Suppose that b 1 < b0 . Reject H 0 : b = b0 versus H 1 : b = b1 if and only if Y ≤ γn,b0 (α) .
Proof
Note that the these tests do not depend on the value of b . This fact, together with the monotonicity of the power function can be
1
used to shows that the tests are uniformly most powerful for the usual one-sided tests.
Suppose that b 0 ∈ (0, ∞) .

1. The decision rule in part (a) above is uniformly most powerful for the test H 0 : b ≤ b0 versus H 1 : b > b0 .
2. The decision rule in part (b) above is uniformly most powerful for the test H 0 : b ≥ b0 versus H 1 : b < b0 .
Tests for the Bernoulli Model

Suppose that X = (X , X , … , X ) is a random sample of size n ∈ N from the Bernoulli distribution with success parameter
1 2 n +
p . The sample could represent the results of tossing a coin n times, where p is the probability of heads. We wish to test the simple
hypotheses H : p = p versus H : p = p , where p , p ∈ (0, 1) are distinct specified values. In the coin tossing model, we
0 0 1 1 0 1
know that the probability of heads is either p or p , but we don't know which.
0 1
Recall that the number of successes is a sufficient statistic for p:

n
Y = ∑ Xi (9.5.14)
i=1
Recall also that Y has the binomial distribution with parameters n and p. For α ∈ (0, 1), we will denote the quantile of order α for
the this distribution by b (α); although since the distribution is discrete, only certain values of α are possible.
n,p

n Y
1 − p0 p0 (1 − p1 )
L =( ) [ ] (9.5.15)
1 − p1 p1 (1 − p0 )
Proof
The following tests are most powerful test at the α level

1. Suppose that p 1 > p0 . Reject H : p = p versus H : p = p if and only if Y ≥ b
0 0 1 1 n,p0 (1 − α) .
2. Suppose that p 1 < p0 . Reject p = p versus p = p if and only if Y ≤ b (α) .
0 1 n,p0
Proof
Note that these tests do not depend on the value of p . This fact, together with the monotonicity of the power function can be used
1
to shows that the tests are uniformly most powerful for the usual one-sided tests.
Suppose that p 0 ∈ (0, 1) .

1. The decision rule in part (a) above is uniformly most powerful for the test H 0 : p ≤ p0 versus H 1 : p > p0 .
2. The decision rule in part (b) above is uniformly most powerful for the test H 0 : p ≥ p0 versus H 1 : p < p0 .
Tests in the Normal Model

The one-sided tests that we derived in the normal model, for μ with σ known, for μ with σ unknown, and for σ with μ unknown
are all uniformly most powerful. On the other hand, none of the two-sided tests are uniformly most powerful.
A Nonparametric Example
Suppose that X = (X , X , … , X ) is a random sample of size n ∈ N , either from the Poisson distribution with parameter 1 or
1 2 n +
from the geometric distribution on N with parameter p = . Note that both distributions have mean 1 (although the Poisson
1
distribution has variance 1 while the geometric distribution has variance 2). So, we wish to test the hypotheses
H0 : X has probability density function g 0 (x) =e
−1 1
x!
for x ∈ N.
x+1
H1 : X has probability density function g 1 (x) =(
1
2
) for x ∈ N.

Y n n
n −n
2
L =2 e where Y = ∑ Xi and U = ∏ Xi ! (9.5.18)
U
i=1 i=1
Proof
The most powerful tests have the following form, where d is a constant: reject H if and only if ln(2)Y
0 − ln(U ) ≤ d .
Proof
This page titled 9.5: Likelihood Ratio Tests is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In this section, we will study a number of important hypothesis tests that fall under the general term chi-square tests. These are
named, as you might guess, because in each case the test statistics has (in the limit) a chi-square distribution. Although there are
several different tests in this general category, they all share some common themes:
In each test, there are one or more underlying multinomial samples, Of course, the multinomial model includes the Bernoulli
model as a special case.
Each test works by comparing the observed frequencies of the various outcomes with expected frequencies under the null
hypothesis.
If the model is incompletely specified, some of the expected frequencies must be estimated; this reduces the degrees of freedom
in the limiting chi-square distribution.
We will start with the simplest case, where the derivation is the most straightforward; in fact this test is equivalent to a test we have
already studied. We then move to successively more complicated models.
The One-Sample Bernoulli Model

Suppose that X = (X , X , … , X ) is a random sample from the Bernoulli distribution with unknown success parameter
1 2 n
p ∈ (0, 1). Thus, these are independent random variables taking the values 1 and 0 with probabilities p and 1 − p respectively. We
want to test H : p = p versus H : p ≠ p , where p ∈ (0, 1) is specified. Of course, we have already studied such tests in the
0 0 1 0 0
Bernoulli model. But keep in mind that our methods in this section will generalize to a variety of new models that we have not yet
studied.
n n
Let O = ∑ X and O = n − O = ∑ (1 − X ) . These statistics give the number of times (frequency) that outcomes 1
1 j=1 j 0 1 j=1 j
and 0 occur, respectively. Moreover, we know that each has a binomial distribution; O has parameters n and p, while O has 1 0
parameters n and 1 − p . In particular, E(O ) = np , E(O ) = n(1 − p) , and var(O ) = var(O ) = np(1 − p) . Moreover, recall
1 0 1 0
that O is sufficient for p. Thus, any good test statistic should be a function of O . Next, recall that when n is large, the distribution
1 1
of O is approximately normal, by the central limit theorem. Let

1
O1 − np0
Z = − −−−−−−− − (9.6.1)
√ np0 (1 − p0 )
Note that Z is the standard score of O under H . Hence if n is large, Z has approximately the standard normal distribution under
1 0
H , and therefore V = Z has approximately the chi-square distribution with 1 degree of freedom under H . As usual, let χ
2 2
0 0 k
denote the quantile function of the chi-square distribution with k degrees of freedom.
An approximate test of H versus H at the α level of significance is to reject H if and only if V

0 1 0
2
> χ (1 − α)
1
.
The test above is equivalent to the unbiased test with test statistic Z (the approximate normal test) derived in the section on
Tests in the Bernoulli model.
For purposes of generalization, the critical result in the next exercise is a special representation of V . Let e0 = n(1 − p0 ) and
e = np . Note that these are the expected frequencies of the outcomes 0 and 1, respectively, under H .
1 0 0
V can be written in terms of the observed and expected frequencies as follows:

2 2
(O0 − e0 ) (O1 − e1 )
V = + (9.6.2)
e0 e1
This representation shows that our test statistic V measures the discrepancy between the expected frequencies, under H , and the 0
observed frequencies. Of course, large values of V are evidence in favor of H . Finally, note that although there are two terms in
1
the expansion of V in Exercise 3, there is only one degree of freedom since O + O = n . The observed and expected frequencies
0 1
could be stored in a 1 × 2 table.
The Multi-Sample Bernoulli Model
Suppose now that we have samples from several (possibly) different, independent Bernoulli trials processes. Specifically, suppose
that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success parameter
i i,1 i,2 i,ni i
p ∈ (0, 1) for each i ∈ {1, 2, … , m}. Moreover, the samples (X , X , … , X ) are independent. We want to test hypotheses
i 1 2 m
about the unknown parameter vector p = (p , p , … , p ). There are two common cases that we consider below, but first let's set
1 2 m
up the essential notation that we will need for both cases. For i ∈ {1, 2, … , m} and j ∈ {0, 1}, let O denote the number of times i,j
that outcome j occurs in sample X . The observed frequency O has a binomial distribution; O has parameters n and p while
i i,j i,1 i i
O has parameters n and 1 − p .

i,0 i i
The Completely Specified Case

Consider a specified parameter vector p = (p , p , … , p ) ∈ (0, 1) . We want to test the null hypothesis H : p = p ,
0 0,1 0,2 0,m
m
0 0
versus H : p ≠ p . Since the null hypothesis specifies the value of p for each i, this is called the completely specified case. Now
1 0 i
let e = n (1 − p ) and let e = n p . These are the expected frequencies of the outcomes 0 and 1, respectively, from sample
i,0 i i,0 i,1 i i,0
X under H .
i 0
If n is large for each i, then under H the following test statistic has approximately the chi-square distribution with m degrees
i 0
of freedom:
m 1 2
(Oi,j − ei,j )
V = ∑∑ (9.6.3)
ei,j
i=1 j=0
Proof
As a rule of thumb, “large” means that we need e i,j ≥5 for each i ∈ {1, 2, … , m} and j ∈ {0, 1}. But of course, the larger these
expected frequencies the better.
Under the large sample assumption, an approximate test of H0 versus H at the 1 α level of significance is to reject H0 if and
only if V > χ (1 − α) . 2
m
Once again, note that the test statistic V measures the discrepancy between the expected and observed frequencies, over all
outcomes and all samples. There are 2 m terms in the expansion of V in Exercise 4, but only m degrees of freedom, since
O +O
i,0 =n for each i ∈ {1, 2, … , m}. The observed and expected frequencies could be stored in an m × 2 table.
i,1 i
The Equal Probability Case

Suppose now that we want to test the null hypothesis H : p = p = ⋯ = p that all of the success probabilities are the same,
0 1 2 m
versus the complementary alternative hypothesis H that the probabilities are not all the same. Note, in contrast to the previous
1
model, that the null hypothesis does not specify the value of the common success probability p. But note also that under the null
hypothesis, the m samples can be combined to form one large sample of Bernoulli trials with success probability p. Thus, a natural
approach is to estimate p and then define the test statistic that measures the discrepancy between the expected and observed
frequencies, just as before. The challenge will be to find the distribution of the test statistic.
m
Let n = ∑ n denote the total sample size when the samples are combined. Then the overall sample mean, which in this
i=1 i
context is the overall sample proportion of successes, is

m ni m
1 1
P = ∑ ∑ Xi,j = ∑ Oi,1 (9.6.4)
n n
i=1 j=1 i=1
The sample proportion P is the best estimate of p, in just about any sense of the word. Next, let E = n (1 − P ) and i,0 i
Ei,1 = n P . These are the estimated expected frequencies of 0 and 1, respectively, from sample X under H . Of course these
i i 0
estimated frequencies are now statistics (and hence random) rather than parameters. Just as before, we define our test statistic
m 1 2
(Oi,j − Ei,j )
V = ∑∑ (9.6.5)
Ei,j
i=1 j=0
It turns out that under H0 , the distribution of V converges to the chi-square distribution with m −1 degrees of freedom as
n → ∞.
0 1 0 >χ
2
m−1
(1 − α) .
Intuitively, we lost a degree of freedom over the completely specified case because we had to estimate the unknown common
success probability p. Again, the observed and expected frequencies could be stored in an m × 2 table.
The One-Sample Multinomial Model

Our next model generalizes the one-sample Bernoulli model in a different direction. Suppose that X = (X , X , … , X ) is a 1 2 n
sequence of multinomial trials. Thus, these are independent, identically distributed random variables, each taking values in a set S
with k elements. If we want, we can assume that S = {0, 1, … , k − 1} ; the one-sample Bernoulli model then corresponds to
k = 2 . Let f denote the common probability density function of the sample variables on S , so that f (j) = P(X = j) for i
i ∈ {1, 2, … , n} and j ∈ S . The values of f are assumed unknown, but of course we must have ∑ f (j) = 1 , so there are j∈S
really only k − 1 unknown parameters. For a given probability density function f on S we want to test H : f = f versus 0 0 0
H : f ≠f .
1 0
By this time, our general approach should be clear. We let O denote the number of times that outcome j ∈ S occurs in sample X:
j
Oj = ∑ 1(Xi = j) (9.6.6)
i=1
Note that O has the binomial distribution with parameters

j n and f (j) . Thus, ej = n f0 (j) is the expected number of times that
outcome j occurs, under H . Out test statistic, of course, is
0
2
(Oj − ej )
V =∑ (9.6.7)
j
e
j∈S
It turns out that under H , the distribution of V converges to the chi-square distribution with k − 1 degrees of freedom as n → ∞ .
0
Note that there are k terms in the expansion of V , but only k − 1 degrees of freedom since ∑ O = n.
j∈S j

0 1 0 >χ
2
k−1
(1 − α) .
Again, as a rule of thumb, we need e j ≥5 for each j ∈ S , but the larger the expected frequencies the better.
The Multi-Sample Multinomial Model

As you might guess, our final generalization is to the multi-sample multinomial model. Specifically, suppose that
X = (X
i ,Xi,1,…,X ) is a random sample of size n
i,2 i,ni from a distribution on a set S with k elements, for each
i
i ∈ {1, 2, … , m}. Moreover, we assume that the samples (X , X , … , X ) are independent. Again there is no loss in generality
1 2 m
if we take S = {0, 1, … , k − 1} . Then k = 2 reduces to the multi-sample Bernoulli model, and m = 1 corresponds to the one-
sample multinomial model.
Let denote the common probability density function of the variables in sample X , so that f (j) = P(X = j) for
fi i i i,l
,
i ∈ {1, 2, … , m} l ∈ {1, 2, … , ni } , and j ∈ S . These are generally unknown, so that our vector of parameters is the vector of
probability density functions: f = (f , f , … , f ). Of course, ∑
1 2 f (j) = 1 for i ∈ {1, 2, … , m}, so there are actually
m j∈S i
m (k − 1) unknown parameters. We are interested in testing hypotheses about f . As in the multi-sample Bernoulli model, there are
two common cases that we consider below, but first let's set up the essential notation that we will need for both cases. For
i ∈ {1, 2, … , m} and j ∈ S , let O denote the number of times that outcome j occurs in sample X . The observed frequency
i,j i
O i,jhas a binomial distribution with parameters n and f (j) . i i

Consider a given vector of probability density functions on S , denoted f = (f , f , … , f ) . We want to test the null 0 0,1 0,2 0,m
hypothesis H : f = f , versus H : f ≠ f . Since the null hypothesis specifies the value of f (j) for each i and j , this is called
0 0 1 0 i
the completely specified case. Let e = n f (j) . This is the expected frequency of outcome j in sample X under H .
i,j i 0,i i 0
If n is large for each i, then under H , the test statistic V below has approximately the chi-square distribution with m (k − 1)
i 0
degrees of freedom:
m 2
(Oi,j − ei,j )
V = ∑∑ (9.6.8)
ei,j
i=1 j∈S
Proof
As usual, our rule of thumb is that we need e i,j ≥5 for each i ∈ {1, 2, … , m} and j ∈ S . But of course, the larger these expected
frequencies the better.
Under the large sample assumption, an approximate test of H0 versus H at the

1 α level of significance is to reject H0 if and
only if V > χ (1 − α) .
2
m (k−1)
As always, the test statistic V measures the discrepancy between the expected and observed frequencies, over all outcomes and all
samples. There are mk terms in the expansion of V in Exercise 8, but we lose m degrees of freedom, since ∑ O =n for j∈S i,j i
each i ∈ {1, 2, … , m}.
The Equal PDF Case

Suppose now that we want to test the null hypothesis H : f = f = ⋯ = f that all of the probability density functions are the
0 1 2 m
same, versus the complementary alternative hypothesis H that the probability density functions are not all the same. Note, in
1
contrast to the previous model, that the null hypothesis does not specify the value of the common success probability density
function f . But note also that under the null hypothesis, the m samples can be combined to form one large sample of multinomial
trials with probability density function f . Thus, a natural approach is to estimate the values of f and then define the test statistic
that measures the discrepancy between the expected and observed frequencies, just as before.
Let n = ∑ m
i=1
ni denote the total sample size when the samples are combined. Under H , our best estimate of f (j) is
0
m
1
Pj = ∑ Oi,j (9.6.9)
n
i=1
Hence our estimate of the expected frequency of outcome j in sample X under H is E = n P . Again, this estimated
i 0 i,j i j
frequency is now a statistic (and hence random) rather than a parameter. Just as before, we define our test statistic
m 2
(Oi,j − Ei,j )
V = ∑∑ (9.6.10)
Ei,j
i=1 j∈S
As you no doubt expect by now, it turns out that under H , the distribution of V converges to a chi-square distribution as n → ∞ .
0
But let's see if we can determine the degrees of freedom heuristically.
The limiting distribution of V has (k − 1)(m − 1) degrees of freedom.

Proof

0 1 0 >χ
2
(k−1) (m−1)
(1 − α) .
A Goodness of Fit Test

A goodness of fit test is an hypothesis test that an unknown sampling distribution is a particular, specified distribution or belongs to
a parametric family of distributions. Such tests are clearly fundamental and important. The one-sample multinomial model leads to
a quite general goodness of fit test.
To set the stage, suppose that we have an observable random variable X for an experiment, taking values in a general set S .
Random variable X might have a continuous or discrete distribution, and might be single-variable or multi-variable. We want to
test the null hypothesis that X has a given, completely specified distribution, or that the distribution of X belongs to a particular
parametric family.
Our first step, in either case, is to sample from the distribution of X to obtain a sequence of independent, identically distributed
variables X = (X , X , … , X ). Next, we select k ∈ N and partition S into k (disjoint) subsets. We will denote the partition
1 2 n +
by {A : j ∈ J} where #(J) = k . Next, we define the sequence of random variables Y = (Y , Y , … , Y ) by Y = j if and only
j 1 2 n i
if X ∈ A for i ∈ {1, 2, … , n} and j ∈ J .

i j
Y is a multinomial trials sequence with parameters n and f , where f (j) = P(X ∈ A j) for j ∈ J .

Let H denote the statement that X has a given, completely specified distribution. Let f denote the probability density function on
0
J defined by f (j) = P(X ∈ A ∣ H ) for j ∈ J . To test hypothesis H , we can formally test H : f = f

0 j versus H : f ≠ f ,
0 0 1 0
which of course, is precisely the problem we solved in the one-sample multinomial model.
Generally, we would partition the space S into as many subsets as possible, subject to the restriction that the expected frequencies
all be at least 5.
The Partially Specified Case

Often we don't really want to test whether X has a completely specified distribution (such as the normal distribution with mean 5
and variance 9), but rather whether the distribution of X belongs to a specified parametric family (such as the normal). A natural
course of action in this case would be to estimate the unknown parameters and then proceed just as above. As we have seen before,
the expected frequencies would be statistics E because they would be based on the estimated parameters. As a rule of thumb, we
j
lose a degree of freedom in the chi-square statistic V for each parameter that we estimate, although the precise mathematics can be
complicated.
A Test of Independence
Suppose that we have observable random variables X and Y for an experiment, where X takes values in a set S with k elements,
and Y takes values in a set T with m elements. Let f denote the joint probability density function of (X, Y ), so that
f (i, j) = P(X = i, Y = j) for i ∈ S and j ∈ T . Recall that the marginal probability density functions of X and Y are the
functions g and h respectively, where
g(i) = ∑ f (i, j), i ∈ S (9.6.11)
j∈T
h(j) = ∑ f (i, j), j∈ T (9.6.12)
i∈S
Usually, of course, f , g , and h are unknown. In this section, we are interested in testing whether X and Y are independent, a basic
and important test. Formally then we want to test the null hypothesis
H0 : f (i, j) = g(i) h(j), (i, j) ∈ S × T (9.6.13)
versus the complementary alternative H . 1
Our first step, of course, is to draw a random sample (X, Y ) = ((X , Y ), (X , Y ), … , (X , Y )) from the distribution of
1 1 2 2 n n
(X, Y ). Since the state spaces are finite, this sample forms a sequence of multinomial trials. Thus, with our usual notation, let O i,j
denote the number of times that (i, j) occurs in the sample, for each (i, j) ∈ S × T . This statistic has the binomial distribution with
trial parameter n and success parameter f (i, j). Under H , the success parameter is g(i) h(j) . However, since we don't know the
0
success parameters, we must estimate them in order to compute the expected frequencies. Our best estimate of f (i, j) is the sample
proportion O . Thus, our best estimates of g(i) and h(j) are N and M , respectively, where N is the number of times that
1
n
i,j
1
n
i
1
n
j i
i occurs in sample X and M is the number of times that j occurs in sample Y :

j
Ni = ∑ Oi,j (9.6.14)
j∈T
Mj = ∑ Oi,j (9.6.15)
i∈S
Thus, our estimate of the expected frequency of (i, j) under H is 0
1 1 1
Ei,j = n Ni Mj = Ni Mj (9.6.16)
n n n
Of course, we define our test statistic by

2
(Oi,j − Ei,j )
V = ∑∑ (9.6.17)
Ei,j
i∈J j∈T
As you now expect, the distribution of V converges to a chi-square distribution as n → ∞ . But let's see if we can determine the
appropriate degrees of freedom on heuristic grounds.
The limiting distribution of V has (k − 1) (m − 1) degrees of freedom.

Proof

0 1 0 >χ
2
(k−1)(m−1)
(1 − α) .
The observed frequencies are often recorded in a k × m table, known as a contingency table, so that O is the number in row i
i,j
and column j . In this setting, note that N is the sum of the frequencies in the ith row and M is the sum of the frequencies in the
i j
j th column. Also, for historical reasons, the random variables X and Y are sometimes called factors and the possible values of the
variables categories.
Computational and Simulation Exercises

In each of the following exercises, specify the number of degrees of freedom of the chi-square statistic, give the value of the
statistic and compute the P -value of the test.
A coin is tossed 100 times, resulting in 55 heads. Test the null hypothesis that the coin is fair.
Answer
Suppose that we have 3 coins. The coins are tossed, yielding the data in the following table:
Heads Tails
Coin 1 29 21
Coin 2 23 17
Coin 3 42 18
1. Test the null hypothesis that all 3 coin are fair.

2. Test the null hypothesis that coin 1 has probability of heads ; coin 2 is fair; and coin 3 has probability of heads
3
5
2
3
.
3. Test the null hypothesis that the 3 coins have the same probability of heads.
Answer
A die is thrown 240 times, yielding the data in the following table:
Score 1 2 3 4 5 6
Frequency 57 39 28 28 36 52
1. Test the null hypothesis that the die is fair.

2. Test the null hypothesis that the die is an ace-six flat die (faces 1 and 6 have probability 1
4
each while faces 2, 3, 4, and 5
have probability each).
1
Answer
Two dice are thrown, yielding the data in the following table:
Score 1 2 3 4 5 6
Die 1 22 17 22 13 22 24
Die 2 44 24 19 19 18 36
1. Test the null hypothesis that die 1 is fair and die 2 is an ace-six flat.
2. Test the null hypothesis that all the dice have have the same probability distribuiton.
Answer
A university classifies faculty by rank as instructors, assistant professors, associate professors, and full professors. The data,
by faculty rank and gender, are given in the following contingency table. Test to see if faculty rank and gender are independent.
Faculty Instructor Assistant Professor Associate Professor Full Professor
Male 62 238 185 115
Female 118 122 123 37
Answer

The Buffon trial data set gives the results of 104 repetitions of Buffon's needle experiment. The number of crack crossings is
56. In theory, this data set should correspond to 104 Bernoulli trials with success probability p = . Test to see if this is
2
reasonable.
Answer
Test to see if the alpha emissions data come from a Poisson distribution.
Answer
Test to see if Michelson's velocity of light data come from a normal distribution.
Answer
In the simulation exercises below, you will be able to explore the goodness of fit test empirically.
In the dice goodness of fit experiment, set the sampling distribution to fair, the sample size to 50, and the significance level to
0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case (a), give the empirical
estimate of the significance level of the test and compare with 0.1. In the other cases, give the empirical estimate of the power
of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem reasonable?
1. fair
2. ace-six flats
3. the symmetric, unimodal distribution
4. the distribution skewed right
In the dice goodness of fit experiment, set the sampling distribution to ace-six flats, the sample size to 50, and the significance
level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case (a), give the
empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give the empirical estimate of
the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem reasonable?
1. fair
2. ace-six flats
In the dice goodness of fit experiment, set the sampling distribution to the symmetric, unimodal distribution, the sample size to
50, and the significance level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000
times. In case (a), give the empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give
the empirical estimate of the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your
results seem reasonable?
2. fair
3. ace-six flats
In the dice goodness of fit experiment, set the sampling distribution to the distribution skewed right, the sample size to 50, and
the significance level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case
(a), give the empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give the empirical
estimate of the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem
reasonable?
2. fair
3. ace-six flats
Suppose that D and D are different distributions. Is the power of the test with sampling distribution D and test distribution
1 2 1
D the same as the power of the test with sampling distribution D and test distribution D ? Make a conjecture based on your
2 2 1
results in the previous three exercises.
In the dice goodness of fit experiment, set the sampling and test distributions to fair and the significance level to 0.05. Run the
experiment 1000 times for each of the following sample sizes. In each case, give the empirical estimate of the significance
level and compare with 0.05.
1. n = 10
2. n = 20
3. n = 40
4. n = 100
In the dice goodness of fit experiment, set the sampling distribution to fair, the test distributions to ace-six flats, and the
significance level to 0.05. Run the experiment 1000 times for each of the following sample sizes. In each case, give the
empirical estimate of the power of the test. Do the powers seem to be converging?
1. n = 10
2. n = 20
3. n = 40
4. n = 100
This page titled 9.6: Chi-Square Tests is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
CHAPTER OVERVIEW
In this chapter, we explore several problems in geometric probability. These problems are interesting, conceptually clear, and the
analysis is relatively simple. Thus, they are good problems for the student of probability. In addition, Buffon's problems and
Bertrand's problem are historically famous, and contributed significantly to the early development of probability theory.
This page titled 10: Geometric Models is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
Buffon's experiments are very old and famous random experiments, named after comte de Buffon. These experiments are
considered to be among the first problems in geometric probability.
Buffon's Coin Experiment

Buffon's coin experiment consists of dropping a coin randomly on a floor covered with identically shaped tiles. The event of
interest is that the coin crosses a crack between tiles. We will model Buffon's coin problem with square tiles of side length 1—
assuming the side length is 1 is equivalent to taking the side length as the unit of measurement.
Assumptions
First, let us define the experiment mathematically. As usual, we will idealize the physical objects by assuming that the coin is a
perfect circle with radius r and that the cracks between tiles are line segments. A natural way to describe the outcome of the
experiment is to record the center of the coin relative to the center of the tile where the coin happens to fall. More precisely, we will
2
construct coordinate axes so that the tile where the coin falls occupies the square S = [− 1
2
,
1
2
] .
Now when the coin is tossed, we will denote the center of the coin by (X, Y ) ∈ S so that S is our sample space and X and Y are
our basic random variables. Finally, we will assume that r < so that it is at least possible for the coin to fall inside the square
1
without touching a crack.
Figure 10.1.1 : Buffon's floor

Next, we need to define an appropriate probability measure that describes our basic random vector (X, Y ). If the coin falls
“randomly” on the floor, then it is natural to assume that (X, Y ) is uniformly distributed on S . By definition, this means that
area(A)
P[(X, Y ) ∈ A] = , A ⊆S (10.1.1)
area(S)
Run Buffon's coin experiment with the default settings. Watch how the points seem to fill the sample space S in a uniform
manner.
The Probability of a Crack Crossing

Our interest is in the probability of the event C that the coin crosses a crack.
The probability of a crack crossing is P(C ) − 1 − (1 − 2r) . 2
Proof
Figure 10.1.2 : P(C ) as a function of r
In Buffon's coin experiment, change the radius with the scroll bar and watch how the events C and C and change. Run the
c
experiment with various values of r and compare the physical experiment with the points in the scatterplot. Compare the
relative frequency of C to the probability of C .
The convergence of the relative frequency of an event (as the experiment is repeated) to the probability of the event is a special
case of the law of large numbers.
Solve Buffon's coin problem with rectangular tiles that have height h and width w.
Answer
Solve Buffon's coin problem with equilateral triangular tiles that have side length 1.
Recall that random numbers are simulation of independent random variables, each with the standard uniform distribution, that is,
the continuous uniform distribution on the interval (0, 1).
Show how to simulate the center of the coin (X, Y ) in Buffon's coin experiment using random numbers.
Answer
Buffon's Needle Problem

Buffon's needle experiment consists of dropping a needle on a hardwood floor. The main event of interest is that the needle crosses
a crack between floorboards. Strangely enough, the probability of this event leads to a statistical estimate of the number π!
Assumptions
Our first step is to define the experiment mathematically. Again we idealize the physical objects by assuming that the floorboards
are uniform and that each has width 1. We will also assume that the needle has length L < 1 so that the needle cannot cross more
than one crack. Finally, we assume that the cracks between the floorboards and the needle are line segments.
When the needle is dropped, we want to record its orientation relative to the floorboard cracks. One way to do this is to record the
angle X that the top half of the needle makes with the line through the center of the needle, parallel to the floorboards, and the
distance Y from the center of the needle to the bottom crack. These will be the basic random variables of our experiment, and thus
the sample space of the experiment is
S = [0, π) × [0, 1) = {(x, y) : 0 ≤ x < π, 0 ≤ y < 1} (10.1.3)
Figure 10.1.3 : Buffon's needle problem

Again, our main modeling assumption is that the needle is tossed “randomly” on the floor. Thus, a reasonable mathematical
assumption might be that the basic random vector (X, Y ) is uniformly distributed over the sample space. By definition, this means
that
area(A)
P[(X, Y ) ∈ A] = , A ⊆S (10.1.4)
area(S)
Run Buffon's needle experiment with the default settings and watch the outcomes being plotted in the sample space. Note how
the points in the scatterplot seem to fill the sample space S in a uniform way.
The Probability of a Crack Crossing

Our main interest is in the event C that the needle crosses a crack between the floorboards.
The event C can be written in terms of the basic angle and distance variables as follows:
L L
C = {Y < sin(X)} ∪ {Y > 1 − sin(X)} (10.1.5)
2 2
The curves y = sin(x) and y = 1 − sin(x) on the interval 0 ≤ x < π are shown in blue in the scatterplot of Buffon's needle
L
2
L
experiment, and hence event C is the union of the regions below the lower curve and above the upper curve. Thus, the needle
crosses a crack precisely when a point falls in this region.
The probability of a crack crossing is P(C ) = 2L/π.

Proof
Figure 10.1.4 : P(C ) as a function of L
In the Buffon's needle experiment, vary the needle length L with the scroll bar and watch how the event C changes. Run the
experiment with various values of L and compare the physical experiment with the points in the scatterplot. Compare the
relative frequency of C to the probability of C .
The convergence of the relative frequency of an event (as the experiment is repeated) to the probability of the event is a special
case of the law of large numbers.
Find the probabilities of the following events in Buffon's needle experiment. In each case, sketch the event as a subset of the
sample space.
1. {0 < X < π/2, 0 < Y < 1/3}
2. {1/4 < Y < 2/3}

3. {X < Y }
4. {X + Y < 2}
Answer
The Estimate of π
Suppose that we run Buffon's needle experiment a large number of times. By the law of large numbers, the proportion of crack
crossings should be about the same as the probability of a crack crossing. More precisely, we will denote the number of crack
crossings in the first n runs by N . Note that N is a random variable for the compound experiment that consists of n replications
n n
Nn
of the basic needle experiment. Thus, if n is large, we should have ≈
n π
and hence
2L
2Ln
π ≈ (10.1.6)
Nn
This is Buffon's famous estimate of π. In the simulation of Buffon's needle experiment, this estimate is computed on each run and
shown numerically in the second table and visually in a graph.
Run the Buffon's needle experiment with needle lengths L ∈ {0.3, 0.5, 0.7, 1} . In each case, watch the estimate of π as the
simulation runs.
Let us analyze the estimation problem more carefully. On each run j we have an indicator variable I , where I = 1 if the needle
j j
crosses a crack on run j and I = 0 if the needle does not cross a crack on run j . These indicator variables are independent, and
j
identically distributed, since we are assuming independent replications of the experiment. Thus, the sequence forms a Bernoulli
trials process.
The number of crack crossings in the first n runs of the experiment is
n
Nn = ∑ Ij (10.1.7)
j=1
which has the binomial distribution with parameters n and 2L/π.
The mean and variance of N are n
1. E(N ) = n
n
2L
2. var(N ) = n
n
2L
π
(1 −
2L
π
)
Nn
With probability 1, 2Ln
→
1
π
as n → ∞ and 2Ln
Nn
→ π as n → ∞ .
Proof
Nn
Thus, we have two basic estimators: 2Ln
as an estimator of 1
π
and 2Ln
Nn
as an estimator of π. The estimator of 1
π
has several
important statistical properties. First, it is unbiased since the expected value of the estimator is the parameter being estimated:
The estimator of 1
π
is unbiased:
Nn 1
E( ) = (10.1.8)
2Ln π
Proof
Since this estimator is unbiased, the variance gives the mean square error:
2
Nn Nn 1
var ( ) = E [( − ) ] (10.1.9)
2Ln 2Ln π
The mean square error of the estimator of π

1
is
Nn π − 2L
var ( ) = (10.1.10)
2
2Ln 2Lnπ
The variance is a decreasing function of the needle length L.
Thus, the estimator of 1
π
improves as the needle length increases. On the other hand, the estimator of π is biased; it tends to
overestimate π:
The estimator of π is positively biased:

2Ln
E( ) ≥π (10.1.11)
Nn
Proof
The estimator of π also tends to improve as the needle length increases. This is not easy to see mathematically. However, you can
see it empirically.
In the Buffon's needle experiment, run the simulation 5000 times each with L = 0.3 , L = 0.5 , L = 0.7 , and L = 0.9 . Note
how well the estimator seems to work in each case.
Finally, we should note that as a practical matter, Buffon's needle experiment is not a very efficient method of approximating π.
According to Richard Durrett, to estimate π to four decimal places with L = would require about 100 million tosses!
1
Run the Buffon's needle experiment until the estimates of π seem to be consistently correct to two decimal places. Note the
number of runs required. Try this for needle lengths L = 0.3 , L = 0.5 , L = 0.7 , and L = 0.9 and compare the results.
Show how to simulate the angle X and distance Y in Buffon's needle experiment using random numbers.
Answer
Notes
Buffon's needle problem is essentially solved by Monte-Carlo integration. In general, Monte-Carlo methods use statistical
sampling to approximate the solutions of problems that are difficult to solve analytically. The modern theory of Monte-Carlo
methods began with Stanislaw Ulam, who used the methods on problems associated with the development of the hydrogen bomb.
The original needle problem has been extended in many ways, starting with Simon Laplace who considered a floor with rectangular
tiles. Indeed, variations on the problem are active research problems even today.
Neil Weiss has pointed out that our computer simulation of Buffon's needle experiment is circular, in the sense the program
assumes knowledge of π (you can see this from the simulation result above).
Try to write a computer algorithm for Buffon's needle problem, without assuming the value of π or any other transcendental
numbers.
This page titled 10.1: Buffon's Problems is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Preliminaries
Statement of the Problem
Bertrand's problem is to find the probability that a “random chord” on a circle will be longer than the length of a side of the
inscribed equilateral triangle. The problem is named after the French mathematician Joseph Louis Bertrand, who studied the
problem in 1889.
It turns out, as we will see, that there are (at least) three answers to Bertrand's problem, depending on how one interprets the phrase
“random chord”. The lack of a unique answer was considered a paradox at the time, because it was assumed (naively, in hindsight)
that there should be a single natural answer.
Run Bertrand's experiment 100 times for each of the following models. Do not be concerned with the exact meaning of the
models, but see if you can detect a difference in the behavior of the outcomes
1. Uniform distance
2. Uniform angle
3. Uniform endpoint
Mathematical Formulation
To formulate the problem mathematically, let us take (0, 0) as the center of the circle and take the radius of the circle to be 1. These
assumptions entail no loss of generality because they amount to measuring distances relative to the center of the circle, and taking
the radius of the circle as the unit of length. Now consider a chord on the circle. By rotating the circle, we can assume that one
point of the chord is (1, 0) and the other point is (X, Y ) where Y > 0 and X + Y = 1 .
2 2
With these assumptions, the chord is completely specified by giving any one of the following variables
1. The (perpendicular) distance D from the center of the circle to the midpoint of the chord. Note that 0 ≤ D ≤ 1 .
2. The angle A between the x-axis and the line from the center of the circle to the midpoint of the chord. Note that
0 ≤ A ≤ π/2 .
3. The horizontal coordinate X. Note that −1 ≤ X ≤ 1 .
Figure 10.2.1 : A chord in the circle
The variables are related as follows:

1. D = cos(A)
2. X = 2D − 1
2
−− −−−−
3. Y = 2D√1 − D 2
The inverse relations are given below. Note again that there are one-to-one correspondences between X, A , and D.
1. A = arccos(D)
−−−−−−−
2. D = √ 1
2
(x + 1)
− −−−− −−−−−− −
−−−−−
3. D =√
1
2
±
1
√1 − y
2
2
If the chord is generated in a probabilistic way, D, A , X, and Y become random variables. In light of the previous results,
specifying the distribution of any of the variables D, A , or X completely determines the distribution of all four variables.
The angle A is also the angle between the chord and the tangent line to the circle at (1, 0).
Now consider the equilateral triangle inscribed in the circle so that one of the vertices is (1, 0). Consider the chord defined by the
upper side of the triangle.
For this chord, the angle, distance, and coordinate variables are given as follows:
1. a = π/3
2. d = 1/2
3. x = −1/2
–
4. y = √3/2
Figure 10.2.2 : The inscribed equilateral triangle

Now suppose that a chord is chosen in probabilistic way.
The length of the chord is greater than the length of a side of the inscribed equilateral triangle if and only if the following
equivalent conditions occur:
1. 0 < D < 1/2
2. π/3 < A < π/2
3. −1 < X < −1/2
Models
When an object is generated “at random”, a sequence of “natural” variables that determines the object should be given an
appropriate uniform distribution. The coordinates of the coin center are such a sequence in Buffon's coin experiment; the angle and
distance variables are such a sequence in Buffon's needle experiment. The crux of Bertrand's paradox is the fact that the distance
D, the angle A , and the coordinate X each seems to be a natural variable that determine the chord, but different models are
obtained, depending on which is given the uniform distribution.
The Model with Uniform Distance

Suppose that D is uniformly distributed on the interval [0, 1].
The solution of Bertrand's problem is

1 1
P (D < ) = (10.2.1)
2 2
In Bertrand's experiment, select the uniform distance model. Run the experiment 1000 times and compare the relative
frequency function of the chord event to the true probability.
The angle A has probability density function

π
g(a) = sin(a), 0 <a < (10.2.2)
2
Proof
The coordinate X has probability density function
1
h(x) = − −−−−−−, −1 < x < 1 (10.2.3)
√ 8(x + 1)
Proof
Note that A and X do not have uniform distributions. Recall that a random number is a simulation of a variable with the standard
uniform distribution, that is the continuous uniform distribution on the interval [0, 1).
Show how to simulate D, A , X, and Y using a random number.

Answer
The Model with Uniform Angle

Suppose that A is uniformly distributed on the interval (0, π/2).

π 1
P (A > ) = (10.2.4)
3 3
In Bertrand's experiment, select the uniform angle model. Run the experiment 1000 times and compare the relative frequency
function of the chord event to the true probability.
The distance D has probability density function

2
f (d) = − −−− −, 0 <d <1 (10.2.5)
2
π√ 1 − d
Proof
The coordinate X has probability density function

1
h(x) = , −1 < x < 1 (10.2.6)
− −−− −
2
π√ 1 − x
Proof
Note that D and X do not have uniform distributions.
Show how to simulate D, A , X, and Y using a random number.

Answer
The Model with Uniform Endpoint

Suppose that X is uniformly distributed on the interval (−1, 1).

1 1
P (−1 < X < − ) = (10.2.7)
2 4
In Bertrand's experiment, select the uniform endpoint model. Run the experiment 1000 times and compare the relative
frequency function of the chord event to the true probability.
The distance D has probability density function
f (d) = 2d, 0 <d <1 (10.2.8)
Proof
The angle A has probability density function
π
g(a) = 2 sin(a) cos(a), 0 <a < (10.2.9)
2
Proof
Note that D and A do not have uniform distributions; in fact, D has a beta distribution with left parameter 2 and right parameter 1.
Physical Experiments
Suppose that a random chord is generated by tossing a coin of radius 1 on a table ruled with parallel lines that are distance 2
apart. Which of the models (if any) would apply to this physical experiment?
Answer
Suppose that a needle is attached to the edge of disk of radius 1. A random chord is generated by spinning the needle. Which of
the models (if any) would apply to this physical experiment?
Answer
Suppose that a thin trough is constructed on the edge of a disk of radius 1. Rolling a ball in the trough generates a random point
on the circle, so a random chord is generated by rolling the ball twice. Which of the models (if any) would apply to this
physical experiment?
Answer
This page titled 10.2: Bertrand's Paradox is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Preliminaries
Suppose that a stick is randomly broken in two places. What is the probability that the three pieces form a triangle?
Without looking below, make a guess.
Run the triangle experiment 50 times. Do not be concerned with all of the information displayed in the app, but just note
whether the pieces form a triangle. Would you like to revise your guess?
Mathematical Formulation
As usual, the first step is to model the random experiment mathematically. We will take the length of the stick as our unit of length,
so that we can identify the stick with the interval [0, 1]. To break the stick into three pieces, we just need to select two points in the
interval. Thus, let X denote the first point chosen and Y the second point chosen. Note that X and Y are random variables and
hence the sample space of our experiment is S = [0, 1] . Now, to model the statement that the points are chosen at random, let us
2
assume, as in the previous sections, that X and Y are independent and each is uniformly distributed on [0, 1].
The random point (X, Y ) is uniformly distributed on S = [0, 1] . 2
Hence
area(A)
P [(X, Y ) ∈ A] = (10.3.1)
area(S)
Triangles
The Probability of a Triangle
The three pieces form a triangle if and only if the triangle inequalities hold: the sum of the lengths of any two pieces must be
greater than the length of the third piece.
The event that the pieces form a triangle is T 1 ∪ T2 where

1. T 1 = {(x, y) ∈ S : y >
1
2
, x <
1
2
, y −x <
1
2
}
2. T 2 = {(x, y) ∈ S : x >
1
2
, y <
1
2
, x −y <
1
2
}
A sketch of the event T is given below. Curiously, T is composed of triangles!
Figure 10.3.1 : The event T = T1 ∪ T2 that the pieces form a triangle
The probability that the pieces form a triangle is P(T ) = 1
4
.
How close did you come with your initial guess? The relative low value of P(T ) is a bit surprising.
Run the triangle experiment 1000 times and compare the empirical probability of T to the true probability.
c
Triangles of Different Types

Now let us compute the probability that the pieces form a triangle of a given type. Recall that in an acute triangle all three angles
are less than 90°, while an obtuse triangle has one angle (and only one) that is greater than 90°. A right triangle, of course, has one
90° angle.
Suppose that a triangle has side lengths a , b , and c , where c is the largest of these. The triangle is
1. acute if and only if c < a + b .
2 2 2
2. obtuse if and only if c > a + b .

2 2 2
3. right if and only if c = a + b .

2 2 2
Part (c), of course, is the famous Pythagorean theorem, named for the ancient Greek mathematician Pythagoras.
The right triangle equations for the stick pieces are

1. (y − x ) = x
2 2
+ (1 − y )
2
in T 1
2. (1 − x ) = x
2 2
+ (y − x )
2
in T 1
3. x = (y − x )
2 2
+ (1 − y )
2
in T 1
4. (x − y ) = y
2 2
+ (1 − x )
2
in T 2
5. (1 − x ) = y
2 2
+ (x − y )
2
in T 2
6. y = (x − y )
2 2
+ (1 − x )
2
in T 2
Figure 10.3.2 : The events that the pieces form acute and obtuse triangles
Let R denote the event that the pieces form a right triangle. Then P(R) = 0 .
The event that the pieces form an acute triangle is A = A 1 ∪ A2 where

1. A is the region inside curves (a), (b), and (c) of the right triangle equations.
1
2. A is the region inside curves (d), (e), and (f) of the right triangle equations.
2
The event that the pieces form an obtuse triangle is B = B 1 ∪ B2 ∪ B3 ∪ B4 ∪ B5 ∪ B6 where

1. B , B , and B are the regions inside T and outside of curves (a), (b), and (c) of the right triangle equations, respectively.
1 2 3 1
2. B , B , and B are the regions inside T and outside of curves (d), (e), and (f) of the right triangle equations, respectively.
4 5 6 2
The probability that the pieces form an obtuse triangle is

9
P(B) = − 3 ln(2) ≈ 0.1706 (10.3.2)
4
Proof
The probability that the pieces form an acute triangle is

P(A) = 3 ln(2) − 2 ≈ 0.07944 (10.3.5)
Proof
Run the triangle experiment 1000 times and compare the empirical probabilities to the true probabilities.
This page titled 10.3: Random Triangles is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
CHAPTER OVERVIEW
The Bernoulli trials process is one of the simplest, yet most important, of all random processes. It is an essential topic in any course
in probability or mathematical statistics. The process consists of independent trials with two outcomes and with constant
probabilities from trial to trial. Thus it is the mathematical abstraction of coin tossing. The process leads to several important
probability distributions: the binomial, geometric, and negative binomial.
This page titled 11: Bernoulli Trials is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
Basic Theory
Definition
The Bernoulli trials process, named after Jacob Bernoulli, is one of the simplest yet most important random processes in
probability. Essentially, the process is the mathematical abstraction of coin tossing, but because of its wide applicability, it is
usually stated in terms of a sequence of generic “trials”.
A sequence of Bernoulli trials satisfies the following assumptions:

1. Each trial has two possible outcomes, in the language of reliability called success and failure.
2. The trials are independent. Intuitively, the outcome of one trial has no influence over the outcome of another trial.
3. On each trial, the probability of success is p and the probability of failure is 1 − p where p ∈ [0, 1] is the success parameter
of the process.
Random Variables
Mathematically, we can describe the Bernoulli trials process with a sequence of indicator random variables:
X = (X1 , X2 , …) (11.1.1)
An indicator variable is a random variable that takes only the values 1 and 0, which in this setting denote success and failure,
respectively. Indicator variable X simply records the outcome of trial i. Thus, the indicator variables are independent and have the
i
same probability density function:

P(Xi = 1) = p, P(Xi = 0) = 1 − p (11.1.2)
The distribution defined by this probability density function is known as the Bernoulli distribution. In statistical terms, the
Bernoulli trials process corresponds to sampling from the Bernoulli distribution. In particular, the first n trials (X , X , … , X ) 1 2 n
form a random sample of size n from the Bernoulli distribution. Note again that the Bernoulli trials process is characterized by a
single parameter p.
The joint probability density function of (X 1, X2 , … , Xn ) trials is given by

x1 +x2 +⋯+xn n−( x1 +x2 +⋯+xn ) n
fn (x1 , x2 , … , xn ) = p (1 − p ) , (x1 , x2 , … , xn ) ∈ {0, 1 } (11.1.3)
Proof
Note that the exponent of p in the probability density function is the number of successes in the n trials, while the exponent of
1 − p is the number of failures.
If X = (X , X , … , ) is a Bernoulli trials process with parameter p then 1 − X = (1 − X

1 2 1, 1 − X2 , …) is a Bernoulli trials
sequence with parameter 1 − p .
Suppose that U = (U , U , …) is a sequence of independent random variables, each with the uniform distribution on the
1 2
interval [0, 1]. For p ∈ [0, 1] and i ∈ N , let X (p) = 1(U ≤ p) . Then X(p) = (X (p), X (p), …) is a Bernoulli trials
+ i i 1 2
process with probability p.
Note that in the previous result, the Bernoulli trials processes for all possible values of the parameter p are defined on a common
probability space. This type of construction is sometimes referred to as coupling. This result also shows how to simulate a
Bernoulli trials process with random numbers. All of the other random process studied in this chapter are functions of the Bernoulli
trials sequence, and hence can be simulated as well.
Moments
Let X be an indciator variable with P(X = 1) = p , where p ∈ [0, 1]. Thus, X is the result of a generic Bernoulli trial and has the
Bernoulli distribution with parameter p. The following results give the mean, variance and some of the higher moments. A helpful
fact is that if we take a positive power of an indicator variable, nothing happens; that is, X n
=X for n > 0

1. E(X) = p
2. var(X) = p(1 − p)
Proof
Note that the graph of var(X), as a function of p ∈ [0, 1] is a parabola opening downward. In particular the largest value is when 1
p =
1
2
, and the smallest value is 0 when p = 0 or p = 1 . Of course, in the last two cases, X is deterministic, taking the single value
0 when p = 0 and the single value 1 when p = 1
Figure 11.1.1 : var(X) as a function of p
Suppose that p ∈ (0, 1). The skewness and kurtosis of X are

1−2p
1. skew(X) =
√p(1−p)
2. kurt(X) = −3 + 1
p(1−p)
The probability generating function of X is P (t) = E (t X

) = (1 − p) + pt for t ∈ R .

Coins
As we noted earlier, the most obvious example of Bernoulli trials is coin tossing, where success means heads and failure means
tails. The parameter p is the probability of heads (so in general, the coin is biased).
In the basic coin experiment, set n = 100 and For each p ∈ {0.1, 0.3, 0.5, 0.7, 0.9} run the experiment and observe the
outcomes.
Generic Examples
In a sense, the most general example of Bernoulli trials occurs when an experiment is replicated. Specifically, suppose that we have
a basic random experiment and an event of interest A . Suppose now that we create a compound experiment that consists of
independent replications of the basic experiment. Define success on trial i to mean that event A occurred on the ith run, and define
failure on trial i to mean that event A did not occur on the ith run. This clearly defines a Bernoulli trials process with parameter
p = P(A) .
Bernoulli trials are also formed when we sample from a dichotomous population. Specifically, suppose that we have a population
of two types of objects, which we will refer to as type 0 and type 1. For example, the objects could be persons, classified as male or
female, or the objects could be components, classified as good or defective. We select n objects at random from the population; by
definition, this means that each object in the population at the time of the draw is equally likely to be chosen. If the sampling is
with replacement, then each object drawn is replaced before the next draw. In this case, successive draws are independent, so the
types of the objects in the sample form a sequence of Bernoulli trials, in which the parameter p is the proportion of type 1 objects in
the population. If the sampling is without replacement, then the successive draws are dependent, so the types of the objects in the
sample do not form a sequence of Bernoulli trials. However, if the population size is large compared to the sample size, the
dependence caused by not replacing the objects may be negligible, so that for all practical purposes, the types of the objects in the
sample can be treated as a sequence of Bernoulli trials. Additional discussion of sampling from a dichotomous population is in the
in the chapter Finite Sampling Models.
Suppose that a student takes a multiple choice test. The test has 10 questions, each of which has 4 possible answers (only one
correct). If the student blindly guesses the answer to each question, do the questions form a sequence of Bernoulli trials? If so,
identify the trial outcomes and the parameter p.
Answer
Candidate A is running for office in a certain district. Twenty persons are selected at random from the population of registered
voters and asked if they prefer candidate A . Do the responses form a sequence of Bernoulli trials? If so identify the trial
outcomes and the meaning of the parameter p.
Answer
An American roulette wheel has 38 slots; 18 are red, 18 are black, and 2 are green. A gambler plays roulette 15 times, betting
on red each time. Do the outcomes form a sequence of Bernoulli trials? If so, identify the trial outcomes and the parameter p.
Answer
Roulette is discussed in more detail in the chapter on Games of Chance.
Two tennis players play a set of 6 games. Do the games form a sequence of Bernoulli trials? If so, identify the trial outcomes
and the meaning of the parameter p.
Answer
Reliability
Recall that in the standard model of structural reliability, a system is composed of n components that operate independently of each
other. Let X denote the state of component i, where 1 means working and 0 means failure. If the components are all of the same
i
type, then our basic assumption is that the state vector
X = (X1 , X2 , … , Xn ) (11.1.4)
is a sequence of Bernoulli trials. The state of the system, (again where 1 means working and 0 means failed) depends only on the
states of the components, and thus is a random variable
Y = s(X1 , X, … , Xn ) (11.1.5)
where s : {0, 1} → {0, 1} is the structure function. Generally, the probability that a device is working is the reliability of the
n
device, so the parameter p of the Bernoulli trials sequence is the common reliability of the components. By independence, the
system reliability r is a function of the component reliability:
r(p) = Pp (Y = 1), p ∈ [0, 1] (11.1.6)
where we are emphasizing the dependence of the probability measure P on the parameter p. Appropriately enough, this function is
known as the reliability function. Our challenge is usually to find the reliability function, given the structure function.
A series system is working if and only if each component is working.

1. The state of the system is Y = X X ⋯ X = min{X
1 2 n 1, X2 , … , Xn } .
2. The reliability function is r(p) = p for p ∈ [0, 1].
n
A parallel system is working if and only if at least one component is working.

1. The state of the system is Y = 1 − (1 − X )(1 − X ) ⋯ (1 − X
1 2 n) = max{ X1 , X2 , … , Xn } .
2. The reliability function is r(p) − 1 − (1 − p) for p ∈ [0, 1].
n
Recall that in some cases, the system can be represented as a graph or network. The edges represent the components and the
vertices the connections between the components. The system functions if and only if there is a working path between two
designated vertices, which we will denote by a and b .
Find the reliability of the Wheatstone bridge network shown below (named for Charles Wheatstone).
Figure 11.1.2 : The Wheatstone bridge network
Answer
The Pooled Blood Test

Suppose that each person in a population, independently of all others, has a certain disease with probability p ∈ (0, 1). Thus, with
respect to the disease, the persons in the population form a sequence of Bernoulli trials. The disease can be identified by a blood
test, but of course the test has a cost.
For a group of k persons, we will compare two strategies. The first is to test the k persons individually, so that of course, k tests are
required. The second strategy is to pool the blood samples of the k persons and test the pooled sample first. We assume that the test
is negative if and only if all k persons are free of the disease; in this case just one test is required. On the other hand, the test is
positive if and only if at least one person has the disease, in which case we then have to test the persons individually; in this case
k + 1 tests are required. Thus, let Y denote the number of tests required for the pooled strategy.
The number of tests Y has the following properties:

1. P(Y = 1) = (1 − p) , P(Y = k + 1) = 1 − (1 − p)
k k
2. E(Y ) = 1 + k [1 − (1 − p) ] k
3. var(Y ) = k (1 − p) [1 − (1 − p) ]
2 k k
In terms of expected value, the pooled strategy is better than the basic strategy if and only if
1/k
1
p < 1 −( ) (11.1.7)
k
The graph of the critical value p k = 1 − (1/k)

1/k
as a function of k ∈ [2, 20] is shown in the graph below:
Figure 11.1.3 : The graph of p as a function of k

k
The critical value p satisfies the following properties:

k
1. The maximum value of p occurs at k = 3 and p

k 3 ≈ 0.307 .
2. p → 0 as k → ∞ .
k
It follows that if p ≥ 0.307, pooling never makes sense, regardless of the size of the group k . At the other extreme, if p is very
small, so that the disease is quite rare, pooling is better unless the group size k is very large.
Now suppose that we have n persons. If k ∣ n then we can partition the population into n/k groups of k each, and apply the pooled
strategy to each group. Note that k = 1 corresponds to individual testing, and k = n corresponds to the pooled strategy on the
entire population. Let Y denote the number of tests required for group i.
i
The random variables (Y 1, Y2 , … , Yn/k ) are independent and each has the distribution given above.
The total number of tests required for this partitioning scheme is Z n,k = Y1 + Y2 + ⋯ + Yn/k .
The expected total number of tests is
n, k =1
E(Zn,k ) = { 1 k
(11.1.8)
n [(1 + ) − (1 − p ) ] , k >1
k
The variance of the total number of tests is

0, k =1
var(Zn, ) = { k k
(11.1.9)
n k (1 − p ) [1 − (1 − p ) ] , k >1
Thus, in terms of expected value, the optimal strategy is to group the population into n/k groups of size k , where k minimizes the
expected value function above. It is difficult to get a closed-form expression for the optimal value of k , but this value can be
determined numerically for specific n and p.
For the following values of n and p, find the optimal pooling size k and the expected number of tests. (Restrict your attention
to values of k that divide n .)
1. n = 100 , p = 0.01
2. n = 1000, p = 0.05
3. n = 1000, p = 0.001
Answer
If k does not divide n , then we could divide the population of n persons into ⌊n/k⌋ groups of k each and one “remainder” group
with n mod k members. This clearly complicates the analysis, but does not introduce any new ideas, so we will leave this
extension to the interested reader.
This page titled 11.1: Introduction to Bernoulli Trials is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
Definitions
Our random experiment is to perform a sequence of Bernoulli trials X = (X , X , …). Recall that X is a sequence of
1 2
independent, identically distributed indicator random variables, and in the usual language of reliability, 1 denotes success and 0
denotes failure. The common probability of success p = P(X = 1) , is the basic parameter of the process. In statistical terms, the
i
first n trails (X , X , … , X ) form a random sample of size n from the Bernoulli distribution.
1 2 n
In this section we will study the random variable that gives the number of successes in the first n trials and the random variable that
gives the proportion of successes in the first n trials. The underlying distribution, the binomial distribution, is one of the most
important in probability theory, and so deserves to be studied in considerable detail. As you will see, some of the results in this
section have two or more proofs. In almost all cases, note that the proof from Bernoulli trials is the simplest and most elegant.
For n ∈ N , the number of successes in the first n trials is the random variable
n
Yn = ∑ Xi , n ∈ N (11.2.1)
i=1
The distribution of Y is the binomial distribution with trial parameter n and success parameter p.
n
Note that Y = (Y , Y , …) is the partial sum process associated with the Bernoulli trials sequence
0 1 X . In particular, Y0 = 0 , so
point mass at 0 is considered a degenerate form of the binomial distribution.
The probability density function f of Y is given by
n n
n
y n−y
fn (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (11.2.2)
y
Proof
Check that f is a valid PDF
n
In the binomial coin experiment, vary n and p with the scrollbars, and note the shape and location of the probability density
function. For selected values of the parameters, run the simulation 1000 times and compare the relative frequency function to
The binomial distribution is unimodal: For k ∈ {1, 2, … , n},

1. fn (k) > fn (k − 1) if and only if k < (n + 1)p .
2. fn (k) = fn (k − 1) if and only if k = (n + 1)p is an integer an integer between 1 and n .
Proof
Thus, the density function at first increases and then decreases, reaching its maximum value at ⌊(n + 1)p⌋ . This integer is a mode
of the distribution. In the case that m = (n + 1)p is an integer between 1 and n , there are two consecutive modes, at m − 1 and
m.
Now let F denote the distribution function of Y , so that

n n
y y
n k n−k
Fn (y) = P(Yn ≤ y) = ∑ fn (k) = ∑ ( ) p (1 − p ) , y ∈ {0, 1, … , n} (11.2.6)
k
k=0 k=0
The distribution function F and the quantile function F

n n
−1
do not have simple, closed forms, but values of these functions can be
computed from mathematical and statistical software.
Open the special distribution calculator and select the binomial distribution and set the view to CDF. Vary n and p and note the
shape and location of the distribution/quantile function. For various values of the parameters, compute the median and the first
and third quartiles.
The binomial distribution function also has a nice relationship to the beta distribution function.
The distribution function F can be written in the form

n
1−p
n!
n−k−1 k
Fn (k) = ∫ x (1 − x ) dx, k ∈ {0, 1, … , n} (11.2.7)
(n − k − 1)!k! 0
Proof
The expression on the right in the previous theorem is the beta distribution function, with left parameter n − k and right parameter
k + 1 , evaluated at 1 − p .
Moments
The mean, variance and other moments of the binomial distribution can be computed in several different ways. Again let
n
Y =∑
n X where X = (X , X , …) is a sequence of Bernoulli trials with success parameter p .
i 1 2
i=1
1. E(Y ) = np
n
2. var(Y ) = np(1 − p)
n
Proof from Bernoulli trials

Direct proof of (a)
The expected value of Y also makes intuitive sense, since p should be approximately the proportion of successes in a large
n
number of trials. We will discuss the point further in the subsection below on the proportion of successes. Note that the graph of
var(Y ) as a function of p ∈ [0, 1] is parabola opening downward. In particular the maximum value of the variance is n/4 when
n
p = 1/2 , and the minimm value is 0 when p = 0 or p = 1 . Of course, in the last two cases, Y is deterministic, taking just the n
value 0 if p = 0 and just the value n when p = 1 .
In the binomial coin experiment, vary n and p with the scrollbars and note the location and size of the mean±standard
deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the sample mean and standard
The probability generating function of Y is P n n (t) = E (t

Yn
) = [(1 − p) + pt]
n
for t ∈ R .
Direct Proof
Recall that for x ∈ R and k ∈ N , the falling power of x of order k is x = x(x − 1) ⋯ (x − k + 1) . If X is a random variable
(k)
and k ∈ N , then E [ X ] is the factorial moment of X of onder k . The probability generating function provides an easy way to
(k)
compute the factorial moments of the binomial distribution.

(k)
E [Yn ] =n
(k)
p
k
for k ∈ N .
Proof
Our next result gives a recursion equation and initial conditions for the moments of the binomial distribution.
Recursion relation and initial conditions
1. E (Y k
n ) = npE [(Yn−1 + 1)
k−1
] for n, k ∈ N+
2. E 0
(Yn ) =1 for n ∈ N
3. E (Y k
0
) =0 for k ∈ N +
Proof
The ordinary (raw) moments of Y can be computed from the factorial moments and from the recursion relation. Here are the first
n
four, which will be needed below.
The first four moments of Y are n
1. E (Y n) = np
2. E (Y 2
n ) =n
(2)
p
2
+ np
3. E (Y 3
n ) =n
(3)
p
3
+ 3n
(2) 2
p + np
4. E (Y 4
n ) =n
(4)
p
4
+ +6 n
(3)
p
3
+ 7n
(2)
p
2
+ np
Our final moment results gives the skewness and kurtosis of the binomial distribution.
For p ∈ (0, 1), the skewness of Y is n
1 − 2p
skew(Yn ) = − −−−−−−− (11.2.18)
√ np(1 − p)
1. skew(Y ) > 0 if p < , skew(Y ) < 0 if p > , and skew(Y

n
1
2
n
1
2
n) =0 if p = 1
2. For fixed n , skew(Y ) → ∞ as p ↓ 0 and as p ↑ 1

n
3. For fixed p, skew(Y ) → 0 as n → ∞ n
Proof
Open the binomial timeline experiment. For each of the following values of n , vary p from 0 to 1 and note the shape of the
probability density function in light of the previous results on skewness.
1. n = 10
2. n = 20
3. n = 100
For p ∈ (0, 1), the kurtosis of Y is n
6 1
kurt(Yn ) = 3 − + (11.2.19)
n np(1 − p)
1. For fixed n , kurt(Y n) decreases and then increases as a function of p, with minimum value 3 − 2
n
at the point of symmetry
1
p =
2
2. For fixed n , kurt(Y n) → ∞ as p ↓ 0 and as p ↑ 1

3. For fixed p, kurt(Y n) → 3 as n → ∞
Proof
Note that the excess kurtosis is kurt(Y n) −3 =

1
np(1−p)
−
6
n
→ 0 as n → ∞ . This is related to the convergence of the binomial
distribution to the normal, which we will discuss below.
The Partial Sum Process

Several important properties of the random process Y = (Y , Y , Y , …) stem from the fact that it is a partial sum process
0 1 2
corresponding to the sequence X = (X , X , …) of independent, identically distributed indicator variables.

1 2
Y has stationary, independent increments:

1. If m and n are positive integers with m ≤ n then Y − Y has the same distribution as Y , namely binomial with
n m n−m
parameters n − m and p.
2. If n ≤ n ≤ n ≤ ⋯ then (Y , Y − Y , Y − Y , …) is a sequence of independent variables.
1 2 3 n1 n2 n1 n3 n1
Proof
The following result gives the finite dimensional distributions of Y .
The joint probability density functions of the sequence Y are given as follows:
n1 n2 − n1 nk − nk−1
yk nk −yk
P (Yn1 = y1 , Yn2 = y2 , … , Ynk = yk ) = ( )( )⋯( )p (1 − p ) (11.2.20)
y1 y2 − y1 yk − yk−1
where n , n , … , n ∈ N
1 2 k + with n1 < n2 < ⋯ < nk and where y1 , y2 , … , yk ∈ N with 0 ≤ yj − yj−1 ≤ nj − nj−1 for
each j ∈ {1, 2, … , k}.
Proof
Transformations that Preserve the Binomial Distribution

There are two simple but important transformations that perserve the binomial distribution.
If U is a random variable having the binomial distribution with parameters n and p, then n − U has the binomial distribution
with parameters n and 1 − p .
The sum of two independent binomial variables with the same success parameter also has a binomial distribution.
Suppose that U and V are independent random variables, and that U has the binomial distribution with parameters m and p,
and V has the binomial distribution with parameters n and p. Then U + V has the binomial distribution with parameters
m + n and p .

Proof from convolution powers
Proof from generating functions
Sampling and the Hypergeometric Distribution

Suppose that we have a dichotomous population, that is a population of two types of objects. Specifically, suppose that we have m
objects, and that r of the objects are type 1 and the remaining m − r objects are type 0. Thus m ∈ N and r ∈ {0, 1, … , m}. We +
select n objects at random from the population, so that all samples of size n are equally likely. If the sampling is with replacement,
the sample size n can be any positive integer. If the sampling is without replacement, then we must have n ∈ {1, 2, … , m}.
n
In either case, let X denote the type of the i'th object selected for i ∈ {1, 2, … , n} so that Y = ∑ X is the number of type 1
i i=1 i
objects in the sample. As noted in the Introduction, if the sampling is with replacement, (X , X , … , X ) is a sequence of 1 2 n
Bernoulli trials, and hence Y has the binomial distribution parameters n and p = r/m . If the sampling is without replacement, then
Y has the hypergeometric distribution with parameters m , r , and n . The hypergeometric distribution is studied in detail in the
chapter on Finite Sampling Models. For reference, the probability density function of Y is given by
r m−r
( )( ) (y) (n−y)
y n−y n r (m − r)
P(Y = y) = =( ) , y ∈ {0, 1, … , n} (11.2.22)
m (n)
( ) y m
n
and the mean and variance of Y are

r r r m −n
E(Y ) = n , var(Y ) = n (1 − ) (11.2.23)
m m m m −1
If the population size m is large compared to the sample size n , then the dependence between the indicator variables is slight, and
so the hypergeometric distribution should be close to the binomial distribution. The following theorem makes this precise.
Suppose that r ∈ {0, 1, … , m} for each m ∈ N and that r /m → p ∈ [0, 1] as m → ∞ . Then for fixed n ∈ N the
m + m +
hypergeometric distribution with parameters m, r and n converges to the binomial distribution with parameters n and p as
m
m → ∞.
Proof
Under the conditions in the previous theorem, the mean and variance of the hypergeometric distribution converge to the mean
and variance of the limiting binomial distribution:
rm
1. n m
→ np as m → ∞
rn rm m−n
2. n m
(1 −
m
)
m−1
→ np(1 − p) as m → ∞
Proof
In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with
replacement. Note the difference between the graphs of the hypergeometric probability density function and the binomial
probability density function. In particular, note the similarity when m is large and n small. For selected values of the
parameters, and for both sampling modes, run the experiment 1000 times.
From a practical point of view, the convergence of the hypergeometric distribution to the binomial means that if the population size
m is “large” compared to the sample size, then the hypergeometric distribution with parameters m , r and n is well approximated
by the binomial distribution with parameters n and p = r/m . This is often a useful result, because the binomial distribution has
fewer parameters than the hypergeometric distribution (and often in real problems, the parameters may only be known
approximately). Specifically, in the approximating binomial distribution, we do not need to know the population size m and the
number of type 1 objects r individually, but only in the ratio r/m. Generally, the approximation works well if m is large compared
to n that m−n
m−1
is close to 1. This ensures that the variance of the hypergeometric distribution is close to the variance of the
approximating binomial distribution.
Now let's return to our usual sequence of Bernoulli trials X = (X , X , …), with success parameter p, and to the binomial
1 2
variables Y = ∑ X for n ∈ N . Our next result shows that given k successes in the first n trials, the trials on which the
n
n
i=1 i
successes occur is simply a random sample of size k chosen without replacement from {1, 2, … , n}.
n
Suppose that n ∈ N + and k ∈ {0, 1, … , n}. Then for (x 1, x2 , … , xn ) ∈ {0, 1 }
n
with ∑
i=1
xi = k ,
1
P [(X1 , X2 , … , Xn ) = (x1 , x2 , … , xn ) ∣ Yn = k] = (11.2.27)
n
( )
k
Proof
Note in particular that the conditional distribution above does not depend on p. In statistical terms, this means that relative to
(X , X , … , X ), random variable Y
1 2 n is a sufficient statistic for p. Roughly, Y contains all of the information about p that is
n n
available in the entire sample (X , X , … , X ). Sufficiency is discussed in more detail in the chapter on Point Estimation. Next,
1 2 n
if m ≤ n then the conditional distribution of Y given Y = k is hypergeometric, with population size n , type 1 size m, and
m n
sample size k .
Suppose that p ∈ (0, 1) and that m, n, k ∈ N+ with m ≤ n and k ≤ n . Then

m m−n
( )( )
j k−j
P(Ym = j ∣ Yn = k) = , j ∈ {0, 1, … , k} (11.2.29)
n
( )
k
Proof from the previous result

Direct proof
Once again, note that the conditional distribution is independent of the success parameter p.
The Poisson Approximation

The Poisson process on [0, ∞), named for Simeon Poisson, is a model for random points in continuous time. There are many deep
and interesting connections between the Bernoulli trials process (which can be thought of as a model for random points in discrete
time) and the Poisson process. These connections are explored in detail in the chapter on the Poisson process. In this section we just
give the most famous and important result—the convergence of the binomial distribution to the Poisson distribution.
For reference, the Poisson distribution with rate parameter r ∈ (0, ∞) has probability density function
n
−r
r
g(n) = e , n ∈ N (11.2.32)
n!
The parameter r is both the mean and the variance of the distribution. In addition, the probability generating function is
t ↦ e .
t(r−1)
Suppose that p ∈ (0, 1) for n ∈ N and that np → r ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters
n + n
n and p converges to the Poisson distribution with parameter r as n → ∞ .

n
Under the same conditions as the previous theorem, the mean and variance of the binomial distribution converge to the mean
and variance of the limiting Poisson distribution, respectively.
1. np n → r as n → ∞
2. np n (1 − pn ) → ras n → ∞
Proof
Compare the Poisson experiment and the binomial timeline experiment.

1. Open the Poisson experiment and set r = 1 and t = 5 . Run the experiment a few times and note the general behavior of the
random points in time. Note also the shape and location of the probability density function and the mean±standard
deviation bar.
2. Now open the binomial timeline experiment and set n = 100 and p = 0.05. Run the experiment a few times and note the
general behavior of the random points in time. Note also the shape and location of the probability density function and the
mean±standard deviation bar.
2
approximated by the Poisson distribution with parameter r = np . This is often a useful result, because the Poisson distribution has
fewer parameters than the binomial distribution (and often in real problems, the parameters may only be known approximately).
Specifically, in the approximating Poisson distribution, we do not need to know the number of trials n and the probability of
success p individually, but only in the product np. The condition that np be small means that the variance of the binomial
2
distribution, namely np(1 − p) = np − np is approximately r, the variance of the approximating Poisson distribution.
2
The Normal Approximation

Open the binomial timeline experiment. For selected values of p ∈ (0, 1), start with n = 1 and successively increase n by 1.
For each value of n , Note the shape of the probability density function of the number of successes and the proportion of
successes. With n = 100 , run the experiment 1000 times and compare the empirical density function to the probability density
function for the number of successes and the proportion of successes
The characteristic bell shape that you should observe in the previous exercise is an example of the central limit theorem, because
the binomial variable can be written as a sum of n independent, identically distributed random variables (the indicator variables).
The standard score Z of Y is the same as the standard score of M :

n n n
Yn − np Mn − p
Zn = − −−−−−−− = − −−−−− −−− (11.2.35)
√ np(1 − p) √ p(1 − p)/n
The distribution of Z converges to the standard normal distribution as n → ∞ .

n
This version of the central limit theorem is known as the DeMoivre-Laplace theorem, and is named after Abraham DeMoivre and
Simeon Laplace. From a practical point of view, this result means that, for large n , the distribution of Y is approximately normal,
n
− −− −− −−−
with mean np and standard deviation √np(1 − p) and the distribution of M is approximately normal, with mean p and standard
n
− −−− −−− −−
deviation √p(1 − p)/n . Just how large n needs to be for the normal approximation to work well depends on the value of p. The
rule of thumb is that we need np ≥ 5 and n(1 − p) ≥ 5 (the first condition is the significant one when p ≤ and the second
1
condition is the significant one when p ≥ ). Finally, when using the normal approximation, we should remember to use the
1
continuity correction, since the binomial is a discrete distribution.
General Families
For a fixed number of trials n , the binomial distribution is a member of two general families of distributions. First, it is a general
exponential distribution.
Suppose that Y has the binomial distribution with parameters n and p, where n ∈ N + is fixed and p ∈ (0, 1). The distribution
p
of Y is a one-parameter exponential family with natural parameter ln( 1−p
) and natural statistic Y .
Proof
Note that the natural parameter is the logarithm of the odds ratio corresponding to p. This function is sometimes called the logit
function. The binomial distribution is also a power series distribution
Suppose again that Y has the binomial distribution with parameters n and p, where n ∈ N is fixed and p ∈ (0, 1). The +
p
distribution of Y is a power series distribution in the parameter θ = , corresponding to the function θ ↦ (1 + θ) .
1−p
n
Proof
The Proportion of Successes

Suppose again that X = (X , X , …) is a sequence of Bernoulli trials with success parameter p, and that as usual,
1 2
Y =∑
n X
n
i=1
is the number of successes in the first n trials for n ∈ N . The proportion of successes in the first n trials is the
i
random variable
n
Yn 1
Mn = = ∑ Xi (11.2.38)
n n
i=1
In statistical terms, M is the sample mean of the random sample (X , X , … , X ). The proportion of successes M is typically
n 1 2 n n
used to estimate the probability of success p when this probability is unknown.

It is easy to express the probability density function of the proportion of successes M in terms of the probability density function
n
of the number of successes Y . First, note that M takes the values k/n where k ∈ {0, 1, … , n}.
n n
The probabiliy density function of M is given by n
k n k n−k
P ( Mn = ) =( ) p (1 − p ) , k ∈ {0, 1, … , n} (11.2.39)
n k
Proof
In the binomial coin experiment, select the proportion of heads. Vary n and p with the scroll bars and note the shape of the
probability density function. For selected values of the parameters, run the experiment 1000 times and compare the relative
frequency function to the probability density function.
The mean and variance of the proportion of successes Mn are easy to compute from the mean and variance of the number of
successes Y . n
The mean and variance of M are n
1. E(M ) = p
n
2. var(M ) = n
1
n
p(1 − p) .
Proof
In the binomial coin experiment, select the proportion of heads. Vary n and p and note the size and location of the mean
±standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the empirical
moments to the distribution moments.
Recall that skewness and kurtosis are standardized measures. Since M and n Yn have the same standard score, the skewness and
kurtosis of M are the same as the skewness and kurtosis of Y given above.
n n
In statistical terms, part (a) of the moment result above means that M is an unbiased estimator of p. From part (b) note that
n
var(M ) ≤
n for any p ∈ [0, 1]. In particular, var(M ) → 0 as n → ∞ and the convergence is uniform in p ∈ [0, 1]. Thus, the
1
4n
n
estimate improves as n increases; in statistical terms, this is known as consistency.
For every ϵ > 0 , P (|M n − p| ≥ ϵ) → 0 as n → ∞ and the convergence is uniform in p ∈ [0, 1].
Proof
The last result is a special case of the weak law of large numbers and means that M n → p as n → ∞ in probability. The strong law
of large numbers states that the convergence actually holds with probability 1.
The proportion of successes M has a number of nice properties as an estimator of the probability of success p. As already noted,
n
it is unbiased and consistent. In addition, since Y is a sufficient statistic for p, based on the sample (X , X , … , X ), it follows
n 1 2 n
that M is sufficient for p as well. Since E(X ) = p , M is trivially the method of moments estimator of p. Assuming that the
n i n
parameter space for p is [0, 1], it is also the maximum likelihood estimator of p.
The likelihood function for p ∈ [0, 1], based on the observed sample (x1 , x2 , … , xn ) ∈ {0, 1 }
n
, is y n−y
Ly (p) = p (1 − p ) ,
where y = ∑ x . The likelihood is maximized at m = y/n.
n
i=1 i
Proof
See Estimation in the Bernoulli Model in the chapter on Set Estimation for a different approach to the problem of estimating p.

Simple Exercises
A student takes a multiple choice test with 20 questions, each with 5 choices (only one of which is correct). Suppose that the
student blindly guesses. Let X denote the number of questions that the student answers correctly. Find each of the following:
1. The probability density function of X.
2. The mean of X.
3. The variance of X.
4. The probability that the student answers at least 12 questions correctly (the score that she needs to pass).
Answer
A certain type of missile has failure probability 0.02. Let Y denote the number of failures in 50 tests. Find each of the
following:
1. The probability density function of Y .
2. The mean of Y .
3. The variance of Y .
4. The probability of at least 47 successful tests.
Answer
Suppose that in a certain district, 40% of the registered voters prefer candidate A . A random sample of 50 registered voters is
selected. Let Z denote the number in the sample who prefer A . Find each of the following:
1. The probability density function of Z .
2. The mean of Z .
3. The variance of Z .
4. The probability that Z is less that 19.
5. The normal approximation to the probability in (d).
Answer
Coins and Dice

die in which faces 1 and 6 have probability each, and faces 2, 3, 4, and 5 have probability .
1
4
1
A standard, fair die is tossed 10 times. Let N denote the number of aces. Find each of the following:
1. The probability density function of N .
2. The mean of N .
3. The variance of N .
Answer
A coin is tossed 100 times and results in 30 heads. Find the probability density function of the number of heads in the first 20
tosses.
Answer
An ace-six flat die is rolled 1000 times. Let Z denote the number of times that a score of 1 or 2 occurred. Find each of the
following:
1. The probability density function of Z .
2. The mean of Z .
3. The variance of Z .
4. The probability that Z is at least 400.
5. The normal approximation of the probability in (d)
Answer
In the binomial coin experiment, select the proportion of heads. Set n = 10 and p = 0.4 . Run the experiment 100 times. Over
all 100 runs, compute the square root of the average of the squares of the errors, when M used to estimate p. This number is a
measure of the quality of the estimate.
In the binomial coin experiment, select the number of heads Y , and set p = 0.5 and n = 15 . Run the experiment 1000 times
with and compute the following:
1. P(5 ≤ Y ≤ 10)
2. The relative frequency of the event {5 ≤ Y ≤ 10}
3. The normal approximation to P(5 ≤ Y ≤ 10)
Answer
In the binomial coin experiment, select the proportion of heads M and set n = 30 , p = 0.6 . Run the experiment 1000 times
and compute each of the following:
1. P(0.5 ≤ M ≤ 0.7)
2. The relative frequency of the event {0.5 ≤ M ≤ 0.7}
3. The normal approximation to P(0.5 ≤ M ≤ 0.7)
Answer
Famous Problems
In 1693, Samuel Pepys asked Isaac Newton whether it is more likely to get at least one 6 in 6 rolls of a die, or at least two 6's in 12
rolls, or at least three 6's in 18 rolls. This problems is known a Pepys' problem; naturally, Pepys had fair dice in mind.
Solve Pepys' problem using the binomial distribution.

Answer
So the first event, at least one 6 in 6 rolls, is the most likely, and in fact the three events, in the order given, decrease in probability.
With fair dice run the simulation of the dice experiment 500 times and compute the relative frequency of the event of interest.
Compare the results with the theoretical probabilities above.
1. At least one 6 with n = 6 .
2. At least two 6's with n = 12 .
3. At least three 6's with n = 18 .
It appears that Pepys had some sort of misguided linearity in mind, given the comparisons that he wanted. Let's solve the general
problem.
For k ∈ N , let Y denote the number of 6's in 6k rolls of a fair die.

+ k
1. Identify the distribution of Y .

k
2. Find the mean and variance of Y . k
3. Give an exact formula for P(Y ≥ k) , the probability of at least k 6's in 6k rolls of a fair die.
k
4. Give the normal approximation of the probability in (c).

5. Find the limit of the probability in (d) as k → ∞ .
Proof
So on average, the number of 6's in 6k rolls of a fair die is k , and that fact might have influenced Pepys thinking. The next problem
is known as DeMere's problem, named after Chevalier De Mere
Which is more likely: at least one 6 with 4 throws of a fair die or at least one double 6 in 24 throws of two fair dice? .
Answer

In the cicada data, compute the proportion of males in the entire sample, and the proportion of males of each species in the
sample.
Answer
In the M&M data, pool the bags to create a large sample of M&Ms. Now compute the sample proportion of red M&Ms.
Answer
The Galton Board

The Galton board is a triangular array of pegs. The rows are numbered by the natural numbers N = {0, 1, …} from top
downward. Row n has n + 1 pegs numbered from left to right by the integers {0, 1, … , n}. Thus a peg can be uniquely identified
by the ordered pair (n, k) where n is the row number and k is the peg number in that row. The Galton board is named after Francis
Galton.
Now suppose that a ball is dropped from above the top peg (0, 0). Each time the ball hits a peg, it bounces to the right with
probability p and to the left with probability 1 − p , independently from bounce to bounce.
The number of the peg that the ball hits in row n has the binomial distribution with parameters n and p.
In the Galton board experiment, select random variable Y (the number of moves right). Vary the parameters n and p and note
the shape and location of the probability density function and the mean±standard deviation bar. For selected values of the
parameters, click single step several times and watch the ball fall through the pegs. Then run the experiment 1000 times and
watch the path of the ball. Compare the relative frequency function and empirical moments to the probability density function
and distribution moments, respectively.
Structural Reliability
Recall the discussion of structural reliability given in the last section on Bernoulli trials. In particular, we have a system of n
similar components that function independently, each with reliability p. Suppose now that the system as a whole functions properly
if and only if at least k of the n components are good. Such a systems is called, appropriately enough, a k out of n system. Note
that the series and parallel systems considered in the previous section are n out of n and 1 out of n systems, respectively.
Consider the k out of n system.

1. The state of the system is 1(Y ≥ k) where Y is the number of working components.
n n
n
2. The reliability function is r (p) = ∑ ( )p (1 − p)
n,k i=k
n
i
.i n−i
In the binomial coin experiment, set n = 10 and p = 0.9 and run the simulation 1000 times. Compute the empirical reliability
and compare with the true reliability in each of the following cases:
1. 10 out of 10 (series) system.
2. 1 out of 10 (parallel) system.
3. 4 out of 10 system.
Consider a system with n = 4 components. Sketch the graphs of r 4,1 ,r
4,2 ,r
4,3 , and r 4,4 on the same set of axes.
An n out of 2n − 1 system is a majority rules system.

1. Compute the reliability of a 2 out of 3 system.
2. Compute the reliability of a 3 out of 5 system
3. For what values of p is a 3 out of 5 system more reliable than a 2 out of 3 system?
4. Sketch the graphs of r and r on the same set of axes.
3,2 5,3
5. Show that r 2n−1,n

1
( ) =
2
. 1
Answer
In the binomial coin experiment, compute the empirical reliability, based on 100 runs, in each of the following cases. Compare
your results to the true probabilities.
1. A 2 out of 3 system with p = 0.3
Reliable Communications
Consider the transmission of bits (0s and 1s) through a noisy channel. Specifically, suppose that when bit i ∈ {0, 1} is transmitted,
bit i is received with probability p ∈ (0, 1) and the complementary bit 1 − i is received with probability 1 − p . Given the bits
i i
transmitted, bits are received correctly or incorrectly independently of one-another. Suppose now, that to increase reliability, a
given bit I is repeated n ∈ N times in the transmission. A priori, we believe that P(I = 1) = α ∈ (0, 1) and P(I = 0) = 1 − α .
+
Let X denote the number of 1s received when bit I is transmitted n times.

1. The conditional distribution of X given I = i ∈ {0, 1}
2. he probability density function of X

3. E(X)
4. var(X)
Answer
Simplify the results in the last exercise in the symmetric case where p1 = p0 =: p (so that the bits are equally reliable) and
with α = (so that we have no prior information).
1
Answer
Our interest, of course, is predicting the bit transmitted given the bits received.
Find the posterior probability that I =1 given X = k ∈ {0, 1, … , n} .

Answer
Presumably, our decision rule would be to conclude that 1 was transmitted if the posterior probability in the previous exercise is
greater than and to conclude that 0 was transmitted if the this probability is less than . If the probability equals , we have no
1
2
1
2
1
basis to prefer one bit over the other.
Give the decision rule in the symmetric case where p = p =: p , so that the bits are equally reliable. Assume that p >
1 0
1
2
, so
that we at least have a better than even chance of receiving the bit transmitted.
Answer
Not surprisingly, in the symmetric case with no prior information, so that α =

1
2
, we conclude that bit i was transmitted if a
majority of bits received are i.
Bernstein Polynomials
The Weierstrass Approximation Theorem, named after Karl Weierstrass, states that any real-valued function that is continuous on a
closed, bounded interval can be uniformly approximated on that interval, to any degree of accuracy, with a polynomial. The
theorem is important, since polynomials are simple and basic functions, and a bit surprising, since continuous functions can be
quite strange.
In 1911, Sergi Bernstein gave an explicit construction of polynomials that uniformly approximate a given continuous function,
using Bernoulli trials. Bernstein's result is a beautiful example of the probabilistic method, the use of probability theory to obtain
results in other areas of mathematics that are seemingly unrelated to probability.
Suppose that f is a real-valued function that is continuous on the interval . The Bernstein polynomial of degree
[0, 1] n for f is
defined by
bn (p) = Ep [f (Mn )] , p ∈ [0, 1] (11.2.46)
where M is the proportion of successes in the first n Bernoulli trials with success parameter p, as defined earlier. Note that we are
n
emphasizing the dependence on p in the expected value operator. The next exercise gives a more explicit representation, and shows
that the Bernstein polynomial is, in fact, a polynomial
The Bernstein polynomial of degree n can be written as follows:

n
k n k n−k
bn (p) = ∑ f ( )( ) p (1 − p ) , p ∈ [0, 1] (11.2.47)
n k
k=0
Proof
The Bernstein polynomials satisfy the following properties:

1. b n (0) = f (0) and bn (1) = f (1)
2. b 1 (p) = f (0) + [f (1) − f (0)] p for p ∈ [0, 1].

3. b 2 (p) = f (0) + 2 [f (
1
2
) − f (0)] p + [f (1) − 2f (
1
2
) + f (0)] p
2
for p ∈ [0, 1]
From part (a), the graph of b passes through the endpoints (0, f (0)) and (1, f (1)). From part (b), the graph of b is a line
n 1
connecting the endpoints. From (c), the graph of b is parabola passing through the endpoints and the point
2
f (1)) .
1 1 1 1 1
( , f (0) + f ( )+
2 4 2 2 4
The next result gives Bernstein's theorem explicitly.
bn → f as n → ∞ uniformly on [0, 1].

Proof
Compute the Bernstein polynomials of orders 1, 2, and 3 for the function f defined by f (x) = cos(πx) for x ∈ [0, 1]. Graph f
and the three polynomials on the same set of axes.
Answer
Use a computer algebra system to compute the Bernstein polynomials of orders 10, 20, and 30 for the function f defined
below. Use the CAS to graph the function and the three polynomials on the same axes.
0, x =0
f (x) = { π (11.2.49)
x sin( ), x ∈ (0, 1]
x
This page titled 11.2: The Binomial Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Definitions
Suppose again that our random experiment is to perform a sequence of Bernoulli trials X = (X , X , …) with success parameter
1 2
p ∈ (0, 1]. In this section we will study the random variable N that gives the trial number of the first success and the random
variable M that gives the number of failures before the first success.
Let N = min{n ∈ N : X = 1} , the trial number of the first success, and let M = N − 1 , the number of failures before the
+ n
first success. The distribution of N is the geometric distribution on N and the distribution of M is the geometric distribution
+
on N. In both cases, p is the success parameter of the distribution.
Since N and M differ by a constant, the properties of their distributions are very similar. Nonetheless, there are applications where
it more natural to use one rather than the other, and in the literature, the term geometric distribution can refer to either. In this
section, we will concentrate on the distribution of N , pausing occasionally to summarize the corresponding results for M .

N has probability density function f given by f (n) = p(1 − p) n−1
for n ∈ N . +
Proof
Check that f is a valid PDF
A priori, we might have thought it possible to have N = ∞ with positive probability; that is, we might have thought that we could
run Bernoulli trials forever without ever seeing a success. However, we now know this cannot happen when the success parameter
p is positive.
The probability density function of M is given by P(M = n) = p(1 − p)

n
for n ∈ N .
In the negative binomial experiment, set k = 1 to get the geometric distribution on N . Vary p with the scroll bar and note the
+
shape and location of the probability density function. For selected values of p, run the simulation 1000 times and compare the
relative frequency function to the probability density function.
Note that the probability density functions of N and M are decreasing, and hence have modes at 1 and 0, respectively. The
geometric form of the probability density functions also explains the term geometric distribution.
Distribution Functions and the Memoryless Property

Suppose that T is a random variable taking values in N . Recall that the ordinary distribution function of T is the function
+
n ↦ P(T ≤ n) . In this section, the complementary function n ↦ P(T > n) will play a fundamental role. We will refer to this
function as the right distribution function of T . Of course both functions completely determine the distribution of T . Suppose again
that N has the geometric distribution on N with success parameter p ∈ (0, 1].
+
N has right distribution function G given by G(n) = (1 − p) n

for n ∈ N .
Direct proof
From the last result, it follows that the ordinary (left) distribution function of N is given by
n
F (n) = 1 − (1 − p ) , n ∈ N (11.3.3)
We will now explore another characterization known as the memoryless property.
For m ∈ N , the conditional distribution of N − m given N >m is the same as the distribution of N . That is,
P(N > n + m ∣ N > m) = P(N > n); m, n ∈ N (11.3.4)
Proof
Thus, if the first success has not occurred by trial number m, then the remaining number of trials needed to achieve the first
success has the same distribution as the trial number of the first success in a fresh sequence of Bernoulli trials. In short, Bernoulli
trials have no memory. This fact has implications for a gambler betting on Bernoulli trials (such as in the casino games roulette or
craps). No betting strategy based on observations of past outcomes of the trials can possibly help the gambler.
Conversely, if T is a random variable taking values in N+ that satisfies the memoryless property, then T has a geometric
distribution.
Proof
Moments
Suppose again that N is the trial number of the first success in a sequence of Bernoulli trials, so that N has the geometric
distribution on N with parameter p ∈ (0, 1]. The mean and variance of N can be computed in several different ways.
+
1
E(N ) =
p
Proof from the density function

Proof from the right distribution function
This result makes intuitive sense. In a sequence of Bernoulli trials with success parameter p we would “expect” to wait 1/p trials
for the first success.
1−p
var(N ) = 2
p
Direct proof
Note that var(N ) = 0 if p = 1 , hardly surprising since N is deterministic (taking just the value 1) in this case. At the other
extreme, var(N ) ↑ ∞ as p ↓ 0 .
In the negative binomial experiment, set k = 1 to get the geometric distribution. Vary p with the scroll bar and note the
location and size of the mean±standard deviation bar. For selected values of p, run the simulation 1000 times and compare the
sample mean and standard deviation to the distribution mean and standard deviation.
the probability generating function P of N is given by

pt 1
N
P (t) = E (t ) = , |t| < (11.3.16)
1 − (1 − p)t 1 −p
Proof
Recall again that for x ∈ R and k ∈ N , the falling power of x of order k is (k)
x = x(x − 1) ⋯ (x − k + 1) . If X is a random
variable, then E [ X (k)
] is the factorial moment of X of order k .
The factorial moments of N are given by

k−1
(1 − p)
(k)
E [N ] = k! , k ∈ N+ (11.3.18)
pk
Proof from geometric series

Proof from the generating function
Suppose that p ∈ (0, 1). The skewness and kurtosis of N are

2−p
1. skew(N ) = √1−p
2
p
2. kurt(N ) = 1−p
Proof
Note that the geometric distribution is always positively skewed. Moreover, skew(N ) → ∞ and kurt(N ) → ∞ as p ↑ 1 .
Suppose now that M = N −1 , so that M (the number of failures before the first success) has the geometric distribution on N.
Then
1−p
1. E(M ) = p
1−p
2. var(M ) = p2
2−p
3. skew(M ) =
√1−p
2
p
4. kurt(M ) = 1−p
p
5. E (t M
) =
1−(1−p) t
for |t| < 1
1−p
Of course, the fact that the variance, skewness, and kurtosis are unchanged follows easily, since N and M differ by a constant.
The Quantile Function

Let F denote the distribution function of N , so that F (n) = 1 − (1 − p) n
for n ∈ N . Recall that
F
−1
(r) = min{n ∈ N+ : F (n) ≥ r} for r ∈ (0, 1) is the quantile function of N .
The quantile function of N is

ln(1 − r)
−1
F (r) = ⌈ ⌉, r ∈ (0, 1) (11.3.21)
ln(1 − p)
Of course, the quantile function, like the probability density function and the distribution function, completely determines the
distribution of N . Moreover, we can compute the median and quartiles to get measures of center and spread.
The first quartile, the median (or second quartile), and the third quartile are
1. F −1
(
1
4
) = ⌈ln(3/4)/ ln(1 − p)⌉ ≈ ⌈−0.2877/ ln(1 − p)⌉
2. F −1
(
1
2
) = ⌈ln(1/2)/ ln(1 − p)⌉ ≈ ⌈−0.6931/ ln(1 − p)⌉
3. F −1
(
3
4
) = ⌈ln(1/4)/ ln(1 − p)⌉ ≈ ⌈−1.3863/ ln(1 − p)⌉
Open the special distribution calculator, and select the geometric distribution and CDF view. Vary p and note the shape and
location of the CDF/quantile function. For various values of p, compute the median and the first and third quartiles.
The Constant Rate Property

Suppose that T is a random variable taking values in N , which we interpret as the first time that some event of interest occurs.
+
The function h given by

P(T = n)
h(n) = P(T = n ∣ T ≥ n) = , n ∈ N+ (11.3.22)
P(T ≥ n)
is the rate function of T .
If T is interpreted as the (discrete) lifetime of a device, then h is a discrete version of the failure rate function studied in reliability
theory. However, in our usual formulation of Bernoulli trials, the event of interest is success rather than failure (or death), so we
will simply use the term rate function to avoid confusion. The constant rate property characterizes the geometric distribution. As
usual, let N denote the trial number of the first success in a sequence of Bernoulli trials with success parameter p ∈ (0, 1), so that
N has the geometric distribution on N with parameter p. +
N has constant rate p.

Proof
Conversely, if T has constant rate p ∈ (0, 1) then T has the geometric distrbution on N with success parameter p.+
Proof
Relation to the Uniform Distribution

Suppose again that X = (X , X , …) is a sequence of Bernoulli trials with success parameter p ∈ (0, 1). For n ∈ N , recall that
1 2 +
n
Y =∑
n X , the number of successes in the first n trials, has the binomial distribution with parameters n and p . As before, N
i
i=1
denotes the trial number of the first success.
Suppose that n ∈ N . The conditional distribution of N given Y

+ n =1 is uniform on {1, 2, … , n}.
Proof from sampling
Direct proof
Note that the conditional distribution does not depend on the success parameter p. If we know that there is exactly one success in
the first n trials, then the trial number of that success is equally likely to be any of the n possibilities.
Another connection between the geometric distribution and the uniform distribution is given below in the alternating coin tossing
game: the conditional distribution of N given N ≤ n converges to the uniform distribution on {1, 2, … , n} as p ↓ 0 .
Relation to the Exponential Distribution

The Poisson process on [0, ∞), named for Simeon Poisson, is a model for random points in continuous time. There are many deep
and interesting connections between the Bernoulli trials process (which can be thought of as a model for random points in discrete
time) and the Poisson process. These connections are explored in detail in the chapter on the Poisson process. In this section we just
give the most famous and important result—the convergence of the geometric distribution to the exponential distribution. The
geometric distribution, as we know, governs the time of the first “random point” in the Bernoulli trials process, while the
exponential distribution governs the time of the first random point in the Poisson process.
For reference, the exponential distribution with rate parameter r ∈ (0, ∞) has distribution function F (x) = 1 − e for −rx
x ∈ [0, ∞). The mean of the exponential distribution is 1/r and the variance is 1/r . In addition, the moment generating function
2
is s ↦ 1
s−r
for s > r .
For n ∈ N , suppose that U has the geometric distribution on N with success parameter p ∈ (0, 1) , where np
+ n + n n → r >0
as n → ∞ . Then the distribution of U /n converges to the exponential distribution with parameter r as n → ∞ .

n
Proof
Note that the condition np → r as n → ∞ is the same condition required for the convergence of the binomial distribution to the
n
Poisson that we studied in the last section.
Special Families
The geometric distribution on N is an infinitely divisible distribution and is a compound Poisson distribution. For the details, visit
these individual sections and see the next section on the negative binomial distribution.

Simple Exercises
A standard, fair die is thrown until an ace occurs. Let N denote the number of throws. Find each of the following:
2. The mean of N .
4. The probability that the die will have to be thrown at least 5 times.
5. The quantile function of N .
6. The median and the first and third quartiles.
Answwer
A type of missile has failure probability 0.02. Let N denote the number of launches before the first failure. Find each of the
following:
2. The mean of N .
4. The probability of 20 consecutive successful launches.
5. The quantile function of N .
6. The median and the first and third quartiles.
Answer
A student takes a multiple choice test with 10 questions, each with 5 choices (only one correct). The student blindly guesses
and gets one question correct. Find the probability that the correct question was one of the first 4.
Answer
Recall that an American roulette wheel has 38 slots: 18 are red, 18 are black, and 2 are green. Suppose that you observe red or
green on 10 consecutive spins. Give the conditional distribution of the number of additional spins needed for black to occur.
Answer
The game of roulette is studied in more detail in the chapter on Games of Chance.
In the negative binomial experiment, set k = 1 to get the geometric distribution and set p = 0.3 . Run the experiment 1000
times. Compute the appropriate relative frequencies and empirically investigate the memoryless property
P(V > 5 ∣ V > 2) = P(V > 3) (11.3.26)
The Petersburg Problem

We will now explore a gambling situation, known as the Petersburg problem, which leads to some famous and surprising results.
Suppose that we are betting on a sequence of Bernoulli trials with success parameter p ∈ (0, 1). We can bet any amount of money
on a trial at even stakes: if the trial results in success, we receive that amount, and if the trial results in failure, we must pay that
amount. We will use the following strategy, known as a martingale strategy:
1. We bet c units on the first trial.
2. Whenever we lose a trial, we double the bet for the next trial.
3. We stop as soon as we win a trial.
Let N denote the number of trials played, so that N has the geometric distribution with parameter p, and let W denote our net
winnings when we stop.
W =c
Proof
Thus, W is not random and W is independent of p! Since c is an arbitrary constant, it would appear that we have an ideal strategy.
However, let us study the amount of money Z needed to play the strategy.
N
Z = c(2 − 1)
The expected amount of money needed for the martingale strategy is

c 1
, p >
2p−1 2
E(Z) = { (11.3.28)
1
∞, p ≤
2
Thus, the strategy is fatally flawed when the trials are unfavorable and even when they are fair, since we need infinite expected
capital to make the strategy work in these cases.
Compute E(Z) explicitly if c = 100 and p = 0.55.

Answer
In the negative binomial experiment, set k = 1 . For each of the following values of p, run the experiment 100 times. For each
run compute Z (with c = 1 ). Find the average value of Z over the 100 runs:
1. p = 0.2
2. p = 0.5
3. p = 0.8
For more information about gambling strategies, see the section on Red and Black. Martingales are studied in detail in a separate
chapter.
The Alternating Coin-Tossing Game

A coin has probability of heads p ∈ (0, 1]. There are n players who take turns tossing the coin in round-robin style: player 1 first,
then player 2, continuing until player n , then player 1 again, and so forth. The first player to toss heads wins the game.
Let N denote the number of the first toss that results in heads. Of course, N has the geometric distribution on N with parameter
+
p . Additionally, let W denote the winner of the game; W takes values in the set {1, 2, … , n}. We are interested in the probability
distribution of W .
For i ∈ {1, 2, … , n}, W =i if and only if N = i + kn for some k ∈ N . That is, using modular arithmetic,
W = [(N − 1) mod n] + 1 (11.3.29)
The winning player W has probability density function

i−1
p(1 − p)
P(W = i) = , i ∈ {1, 2, … , n} (11.3.30)
n
1 − (1 − p)
Proof
P(W = i) = (1 − p )
i−1
P(W = 1) for i ∈ {1, 2, … , n}.
Proof
Note that P(W = i) is a decreasing function of i ∈ {1, 2, … , n} . Not surprisingly, the lower the toss order the better for the
player.
Explicitly compute the probability density function of W when the coin is fair (p = 1/2 ).
Answer
Note from the result above that W itself has a truncated geometric distribution.
The distribution of W is the same as the conditional distribution of N given N ≤n :
P(W = i) = P(N = i ∣ N ≤ n), i ∈ {1, 2, … , n} (11.3.31)
The following problems explore some limiting distributions related to the alternating coin-tossing game.
For fixed p ∈ (0, 1], the distribution of W converges to the geometric distribution with parameter p as n ↑ ∞ .
For fixed n , the distribution of W converges to the uniform distribution on {1, 2, … , n} as p ↓ 0 .
Players at the end of the tossing order should hope for a coin biased towards tails.
Odd Man Out

In the game of odd man out, we start with a specified number of players, each with a coin that has the same probability of heads.
The players toss their coins at the same time. If there is an odd man, that is a player with an outcome different than all of the other
players, then the odd player is eliminated; otherwise no player is eliminated. In any event, the remaining players continue the game
in the same manner. A slight technical problem arises with just two players, since different outcomes would make both players
“odd”. So in this case, we might (arbitrarily) make the player with tails the odd man.
Suppose there are k ∈ {2, 3, …} players and p ∈ [0, 1]. In a single round, the probability of an odd man is
2p(1 − p), k =2
rk (p) = { (11.3.32)
k−1 k−1
kp(1 − p ) + kp (1 − p), k ∈ {3, 4, …}
Proof
The graph of r is more interesting than you might think.

k
Figure 11.3.1 : The graphs of r for k ∈ {3, 4, 5, 6}

k
For k ∈ {2, 3, …}, r has the following properties:

k
1. r (0) = r (1) = 0
k k
2. r is symmetric about p =
k
1
3. For fixed p ∈ [0, 1], r (p) → 0 as k → ∞ . k
Proof
For k ∈ {2, 3, 4}, r has the following properties:

k
1. r increases and then decreases, with maximum at p =

k
1
2
.
2. r is concave downward
k
Proof
For k ∈ {5, 6, …}, r has the following properties:

k
1. The maximum occurs at two points of the form p and 1 − p where p k k k ∈ (0,
1
2
) and p
k → 0 as k → ∞ .
2. The maximum value r (p ) → 1/e ≈ 0.3679 as k → ∞ .
k k
3. The graph has a local minimum at p = . 1
Proof sketch
Suppose p ∈ (0, 1), and let N denote the number of rounds until an odd man is eliminated, starting with k players. Then
k Nk
has the geometric distribution on N with parameter r (p). The mean and variance are
+ k
1. μ k (p) = 1/ rk (p)
2. σ k
2
(p) = [1 − rk (p)] / r (p)
2
k
As we might expect, μ (p) → ∞ and σ (p) → ∞ as k

2
k
k → ∞ for fixed p ∈ (0, 1) . On the other hand, from the result above,
μ (p ) → e and σ (p ) → e − e as k → ∞ .
2 2
k k k
k
Suppose we start with k ∈ {2, 3, …} players and p ∈ (0, 1). The number of rounds until a single player remains is
M =∑k N where (N , N , … , N ) are independent and N has the geometric distribution on N with parameter r (p).
k
j=2 j 2 3 k j + j
The mean and variance are

k
1. E(M k) =∑
j=2
1/ rj (p)
2. var(M k) =∑
k
j=2
[1 − rj (p)] / r (p)
2
j
Proof
k
Starting with k players and probability of heads p ∈ (0, 1), the total number of coin tosses is T k =∑
j=2
jNj . The mean and
variance are
k
1. E(T k) = ∑j=2 j/ rj (p)
k
2. var(T k) = ∑j=2 j
2
[1 − rj (p)] / r (p)
2
j
Proof
Number of Trials Before a Pattern

Consider again a sequence of Bernoulli trials X = (X , X , …) with success parameter p ∈ (0, 1). Recall that the number of
1 2
trials M before the first success (outcome 1) occurs has the geometric distribution on N with parameter p. A natural generalization
is the random variable that gives the number of trials before a specific finite sequence of outcomes occurs for the first time. (Such a
sequence is sometimes referred to as a word from the alphabet {0, 1} or simply a bit string). In general, finding the distribution of
this variable is a difficult problem, with the difficulty depending very much on the nature of the word. The problem of finding just
the expected number of trials before a word occurs can be solved using powerful tools from the theory of renewal processes and
from the theory of martingalges.
To set up the notation, let x denote a finite bit string and let M denote the number of trials before x occurs for the first time.
x
Finally, let q = 1 − p . Note that M takes values in N. In the following exercises, we will consider x = 10 , a success followed by
x
a failure. As always, try to derive the results yourself before looking at the proofs.
The probability density function f 10 of M 10 is given as follows:

1. If p ≠ 1
2
then
n+1 n+1
p −q
f10 (n) = pq , n ∈ N (11.3.33)
p −q
n+2
2. If p = 1
2
then f 10 (n) = (n + 1)(
1
2
) for n ∈ N .
Proof
It's interesting to note that f is symmetric in p and q, that is, symmetric about p = . It follows that the distribution function, 1
probability generating function, expected value, and variance, which we consider below, are all also symmetric about p = . It's 1
also interesting to note that f (0) = f (1) = pq , and this is the largest value. So regardless of p ∈ (0, 1) the distribution is
10 10
bimodal with modes 0 and 1.
The distribution function F 10 of M 10 is given as follows:

1. If p ≠ 1
2
then
n+3 n+3
p −q
F10 (n) = 1 − , n ∈ N (11.3.35)
p −q
n+2
2. If p = 1
2
then F 10 = 1 − (n + 3)(
1
2
) for n ∈ N .
Proof
The probability generating function P 10 of M 10 is given as follows:

1. If p ≠ 1
2
then
pq p q
P10 (t) = ( − ), |t| < min{1/p, 1/q} (11.3.36)
p −q 1 − tp 1 − tq
2. If p = 1
2
then P 10 (t) = 1/(t − 2 )
2
for |t| < 2
Proof
The mean of M 10 is given as follows:

1. If p ≠ 1
2
then
4 4
p −q
E(M10 ) = (11.3.38)
pq(p − q)
2. If p = 1
2
then E(M 10 ) =2 .
Proof
The graph of E(M ) as a function of p ∈ (0, 1) is given below. It's not surprising that
10 E(M10 ) → ∞ as p ↓ 0 and as p ↑ 1 , and
that the minimum value occurs when p = . 1
Figure 11.3.2 : E(M 10 ) as a function of p
The variance of M 10 is given as follows:

1. If p ≠ 1
2
then
2
6 6 4 4 4 4
2 p −q 1 p −q 1 p −q
var(M10 ) = ( )+ ( )− ( ) (11.3.39)
2 2 2 2
p q p −q pq p −q p q p −q
2. If p = 1
2
then var(M 10 ) =4 .
Proof
This page titled 11.3: The Geometric Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Suppose again that our random experiment is to perform a sequence of Bernoulli trials X = (X 1, X2 , …) with success parameter
p ∈ (0, 1]. Recall that the number of successes in the first n trials
Yn = ∑ Xi (11.4.1)
i=1
has the binomial distribution with parameters n and p. In this section we will study the random variable that gives the trial number
of the k th success:
Vk = min {n ∈ N+ : Yn = k} (11.4.2)
Note that V is the number of trials needed to get the first success, which we now know has the geometric distribution on N with
1 +
parameter p.

The probability distribution of V is given by
k
n−1
k n−k
P(Vk = n) = ( ) p (1 − p ) , n ∈ {k, k + 1, k + 2, …} (11.4.3)
k−1
Proof
The distribution defined by the density function in (1) is known as the negative binomial distribution; it has two parameters, the
stopping parameter k and the success probability p.
In the negative binomial experiment, vary k and p with the scroll bars and note the shape of the density function. For selected
values of k and p, run the experiment 1000 times and compare the relative frequency function to the probability density
function.
The binomial and negative binomial sequences are inverse to each other in a certain sense.
For n ∈ N + and k ∈ {0, 1, … , n},

1. Y ≥ k ⟺ V ≤ n and hence P(Y
n k n ≥ k) = P(Vk ≤ n)
2. kP (Y = k) = nP (V = n)
n k
Proof
In particular, it follows from part (a) that any event that can be expressed in terms of the negative binomial variables can also be
expressed in terms of the binomial variables.
The negative binomial distribution is unimodal. Let t = 1 + k−1
p
. Then
1. P(V = n) > P(V = n − 1) if and only if n < t .

k k
2. The probability density function at first increases and then decreases, reaching its maximum value at ⌊t⌋.
3. There is a single mode at ⌊t⌋ if t is not an integers, and two consecutive modes at t − 1 and t if t is an integer.
Times Between Successes

Next we will define the random variables that give the number of trials between successive successes. Let U1 = V1 and
U = V −V
k k for k ∈ {2, 3, …}
k−1
U = (U1 , U2 , …) is a sequence of independent random variables, each having the geometric distribution on N+ with
parameter p. Moreover,
k
Vk = ∑ Ui (11.4.5)
i=1
In statistical terms, U corresponds to sampling from the geometric distribution with parameter p, so that for each k ,
(U , U , … , U ) is a random sample of size k from this distribution. The sample mean corresponding to this sample is V /k ; this
1 2 k k
random variable gives the average number of trials between the first k successes. In probability terms, the sequence of negative
binomial variables V is the partial sum process corresponding to the sequence U . Partial sum processes are studied in more
generality in the chapter on Random Samples.
The random process V = (V1 , V2 , …) has stationary, independent increments:

1. If j < k then V k − Vj has the same distribution as V k−j , namely negative binomial with parameters k − j and p.
2. If k < k < k
1 2 3 <⋯ then (V , V − V , V − V
k1 k2 k1 k3 k2 , …) is a sequence of independent random variables.
Actually, any partial sum process corresponding to an independent, identically distributed sequence will have stationary,
independent increments.
Basic Properties
The mean, variance and probability generating function of V can be computed in several ways. The method using the
k
representation as a sum of independent, identically distributed geometrically distributed variables is the easiest.
Vk has probability generating function P given by

k
pt 1
P (t) = ( ) , |t| < (11.4.6)
1 − (1 − p)t 1 −p
Proof
The mean and variance of V are k
1. E(V k) =k
1
p
.
1−p
2. var(V k) =k
p2
Proof
In the negative binomial experiment, vary k and p with the scroll bars and note the location and size of the mean/standard
deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard
Suppose that V and W are independent random variables for an experiment, and that V has the negative binomial distribution
with parameters j and p, and W has the negative binomial distribution with parameters k and p. Then V + W has the negative
binomial distribution with parameters k + j and p.
Proof
Normal Approximation
In the negative binomial experiment, start with various values of p and k = 1 . Successively increase k by 1, noting the shape
of the probability density function each time.
Even though you are limited to k = 5 in the app, you can still see the characteristic bell shape. This is a consequence of the central
limit theorem because the negative binomial variable can be written as a sum of k independent, identically distributed (geometric)
random variables.
The standard score of V is k
p Vk − k
Zk = − −−−−−− (11.4.7)
√ k(1 − p)
The distribution of Z converges to the standard normal distribution as k → ∞ .
k
From a practical point of view, this result means that if k is “large”, the distribution of V is approximately normal with mean k
k
1
p
1−p
and variance k p
2
. Just how large k needs to be for the approximation to work well depends on p. Also, when using the normal
approximation, we should remember to use the continuity correction, since the negative binomial is a discrete distribution.
Relation to Order Statistics

Suppose that n ∈ N + and k ∈ {1, 2, … , n}, and let L = {(n 1, n2 , … , nk ) ∈ {1, 2, … , n}
k
: n1 < n2 < ⋯ < nk } . Then
1
P (V1 = n1 , V2 = n2 , … , Vk = nk ∣ Yn = k) = , (n1 , n2 , … , nk ) ∈ L (11.4.8)
n
( )
k
Proof
Thus, given exactly k successes in the first n trials, the vector of success trial numbers is uniformly distributed on the set of
possibilities L, regardless of the value of the success parameter p. Equivalently, the vector of success trial numbers is distributed as
the vector of order statistics corresponding to a sample of size k chosen at random and without replacement from {1, 2, … , n}.
Suppose that n ∈ N , k ∈ {1, 2, … , n}, and j ∈ {1, 2, … , k}. Then

+
m−1 n−m
( )( )
j−1 k−j
P (Vj = m ∣ Yn = k) = , m ∈ {j, j + 1, … , n + k − j} (11.4.10)
n
( )
k
Proof
Thus, given exactly k successes in the first n trials, the trial number of the j th success has the same distribution as the j th order
statistic when a sample of size k is selected at random and without replacement from the population {1, 2, … , n}. Again, this
result does not depend on the value of the success parameter p. The following theorem gives the mean and variance of the
conditional distribution.
Suppose again that n ∈ N , k ∈ {1, 2, … , n}, and j ∈ {1, 2, … , k}. Then

+
n+1
1. E (V j ∣ Yn = k) = j
k+1
(n+1)(n−k)
2. var (V j ∣ Yn = k) = j(k − j + 1) 2
(k+1 ) (k+2)
Proof

Coins, Dice and Other Gadgets
A standard, fair die is thrown until 3 aces occur. Let V denote the number of throws. Find each of the following:
1. The probability density function of V .
2. The mean of V .
3. The variance of V .
4. The probability that at least 20 throws will needed.
Answer
A coin is tossed repeatedly. The 10th head occurs on the 25th toss. Find each of the following:
1. The probability density function of the trial number of the 5th head.
2. The mean of the distribution in (a).
3. The variance of the distribution in (a).
Answer
A certain type of missile has failure probability 0.02. Let N denote the launch number of the fourth failure. Find each of the
following:
2. The mean of N .
4. The probability that there will be at least 4 failures in the first 200 launches.
Answer
In the negative binomial experiment, set p = 0.5 and k = 5 . Run the experiment 1000 times. Compute and compare each of
the following:
1. P(8 ≤ V ≤ 15)
5
2. The relative frequency of the event {8 ≤ V ≤ 15} in the simulation

5
3. The normal approximation to P(8 ≤ V ≤ 15) 5
Answer
A coin is tossed until the 50th head occurs.

1. Assuming that the coin is fair, find the normal approximation of the probability that the coin is tossed at least 125 times.
2. Suppose that you perform this experiment, and 125 tosses are required. Do you believe that the coin is fair?
Answer
The Banach Match Problem

Suppose that an absent-minded professor (is there any other kind?) has m matches in his right pocket and m matches in his left
pocket. When he needs a match to light his pipe, he is equally likely to choose a match from either pocket. We want to compute the
probability density function of the random variable W that gives the number of matches remaining when the professor first
discovers that one of the pockets is empty. This is known as the Banach match problem, named for the mathematician Stefan
Banach, who evidently behaved in the manner described.
We can recast the problem in terms of the negative binomial distribution. Clearly the match choices form a sequence of Bernoulli
trials with parameter p = . Specifically, we can consider a match from the right pocket as a win for player R , and a match from
1
the left pocket as a win for player L. In a hypothetical infinite sequence of trials, let U denote the number of trials necessary for R
to win m + 1 trials, and V the number of trials necessary for L to win m + 1 trials. Note that U and V each have the negative
binomial distribution with parameters m + 1 and p.
For k ∈ {0, 1, … , m},

1. L has m − k wins at the moment when R wins m + 1 games if and only if U = 2m − k + 1 .
2. {U = 2m − k + 1} is equivalent to the event that the professor first discovers that the right pocket is empty and that the
left pocket has k matches
2m−k+1
3. P(U = 2m − k + 1) = (
2m−k
m
)(
1
2
)
For k ∈ {0, 1, … , m},

1. R has m − k wins at the moment when L wins m + 1 games if and only if V = 2m − k + 1 .
2. {V = 2m − k + 1} is equivalent to the event that the professor first discovers that the right pocket is empty and that the
left pocket has k matches
2m−k+1
3. P(V = 2m − k + 1) = (
2 m−k
m
)(
1
2
) .
W has probability density function

2m−k
2m − k 1
P(W = k) = ( )( ) , k ∈ {0, 1, … , m} (11.4.16)
m 2
Proof
We can also solve the non-symmetric Banach match problem, using the same methods as above. Thus, suppose that the professor
reaches for a match in his right pocket with probability p and in his left pocket with probability 1 − p , where 0 < p < 1 . The
essential change in the analysis is that U has the negative binomial distribution with parameters m +1 and p, while V has the
negative binomial distribution with parameters m + 1 and 1 − p .
For the Banach match problem with parameter p, W has probability density function
2m − k
m+1 m−k m+1 m−k
P(W = k) = ( ) [p (1 − p ) + (1 − p ) p ], k ∈ {0, 1, … , m} (11.4.17)
m
The Problem of Points

Suppose that two teams (or individuals) A and B play a sequence of Bernoulli trials (which we will also call points), where
p ∈ (0, 1) is the probability that player A wins a point. For nonnegative integers n and m , let A (p) denote the probability that
n,m
A wins n points before B wins m points. Computing A (p) is an historically famous problem, known as the problem of points,
n,m
that was solved by Pierre de Fermat and by Blaise Pascal.
Comment on the validity of the Bernoulli trial assumptions (independence of trials and constant probability of success) for
games of sport that have a skill component as well as a random component.
There is an easy solution to the problem of points using the binomial distribution; this was essentially Pascal's solution. There is
also an easy solution to the problem of points using the negative binomial distribution In a sense, this has to be the case, given the
equivalence between the binomial and negative binomial processes in (3). First, let us pretend that the trials go on forever,
regardless of the outcomes. Let Y denote the number of wins by player A in the first n + m − 1 points, and let V denote
n+m−1 n
the number of trials needed for A to win n points. By definition, Y has the binomial distribution with parameters n + m − 1
n+m−1
and p, and V has the negative binomial distribution with parameters n and p.
n
Player A wins n points before B wins m points if and only if Y n+m−1 ≥n if and only if V n ≤ m +n−1 . Hence
n+m−1 n+m−1
n+m −1 k n+m−1−k
j−1 n j−n
An,m (p) = ∑ ( ) p (1 − p ) = ∑ ( ) p (1 − p ) (11.4.18)
k n−1
k=n j=n
An,m (p) satisfies the following properties:

1. An,m (p) increases from 0 to 1 as p increases from 0 to 1 for fixed n and m.
2. An,m (p) decreases as n increases for fixed m and p.
3. An,m (p) increases as m increases for fixed n and p .
1 − An,m (p) = Am,n (1 − p) for any m, n ∈ N+ and p ∈ (0, 1).

Proof
In the problem of points experiments, vary the parameters n , m, and p, and note how the probability changes. For selected
values of the parameters, run the simulation 1000 times and note the apparent convergence of the relative frequency to the
probability.
The win probability function for player A satisfies the following recurrence relation and boundary conditions (this was
essentially Fermat's solution):
1. Am,n (p) = p An−1,m (p) + (1 − p) An,m−1 (p), n, m ∈ N+
2. An,0 (p) =0 ,A0,m (p) =1
Proof
Next let N n,mdenote the number of trials needed until either A wins n points or B wins m points, whichever occurs first—the
length of the problem of points experiment. The following result gives the distribution of N n,m
For k ∈ {min{m, n}, … , n + m − 1}

k−1 n k−n
k−1 m k−m
P (Nn,m = k) = ( ) p (1 − p ) +( )(1 − p ) p (11.4.19)
n−1 m −1
Proof
Series of Games
The special case of the problem of points experiment with m = n is important, because it corresponds to A and B playing a best of
2n − 1 game series. That is, the first player to win n games wins the series. Such series, especially when n ∈ {2, 3, 4}, are
frequently used in championship tournaments.
Let A n (p) denote the probability that player A wins the series. Then
2n−1 2n−1
2n − 1 k 2n−1−k
j−1 n j−n
An (p) = ∑ ( ) p (1 − p ) = ∑ ( ) p (1 − p ) (11.4.20)
k n−1
k=n j=n
Proof
Suppose that p = 0.6 . Explicitly find the probability that team A wins in each of the following cases:
1. A best of 5 game series.
2. A best of 7 game series.
Answer
In the problem of points experiments, vary the parameters n , m, and p (keeping n = m ), and note how the probability
changes. Now simulate a best of 5 series by selecting n = m = 3 , p = 0.6 . Run the experiment 1000 times and compare the
relative frequency to the true probability.
An (1 − p) = 1 − An (p) for any n ∈ N+ and p ∈ [0, 1]. Therefore

1. The graph of A is symmetric with respect to p =
n
1
2
.
2. A ( ) = .
n
1
2
1
Proof
In the problem of points experiments, vary the parameters n , m, and p (keeping n = m ), and note how the probability
changes. Now simulate a best 7 series by selecting n = m = 4 , p = 0.45. Run the experiment 1000 times and compare the
relative frequency to the true probability.
If n > m then
1. An (p) < Am (p) if 0 Am (p) if < p < 1

1
Proof
Let N denote the number of trials in the series. Then N has probability density function
n n
k−1 n k−n n k−n

P(Nn = k) = ( ) [ p (1 − p ) + (1 − p ) p ], k ∈ {n, n + 1, … , 2 n − 1} (11.4.21)
n−1
Proof
Explicitly compute the probability density function, expected value, and standard deviation for the number of games in a best
of 7 series with the following values of p:
1. 0.5
2. 0.7
3. 0.9
Answer
Division of Stakes
The problem of points originated from a question posed by Chevalier de Mere, who was interested in the fair division of stakes
when a game is interrupted. Specifically, suppose that players A and B each put up c monetary units, and then play Bernoulli trials
until one of them wins a specified number of trials. The winner then takes the entire 2c fortune.
If the game is interrupted when A needs to win n more trials and B needs to win m more trials, then the fortune should be
divided between A and B , respectively, as follows:
1. 2cA (p) for A
n,m
2. 2c [1 − A (p)] = 2cA
n,m m,n (1 − p) for B .
Suppose that players A and B bet $50 each. The players toss a fair coin until one of them has 10 wins; the winner takes the
entire fortune. Suppose that the game is interrupted by the gambling police when A has 5 wins and B has 3 wins. How should
the stakes be divided?
Answer
Alternate and General Versions

Let's return to the formulation at the beginning of this section. Thus, suppose that we have a sequence of Bernoulli trials X with
success parameter p ∈ (0, 1], and for k ∈ N , we let V denote the trial number of the k th success. Thus, V has the negative
+ k k
binomial distribution with parameters k and p as we studied above. The random variable W = V − k is the number of failures
k k
before the k th success. Let N = W , the number of failures before the first success, and let N = W − W
1 1 , the number of
k k k−1
failures between the (k − 1) st success and the k th success, for k ∈ {2, 3, …}.
N = (N1 , N2 , …) is a sequence of independent random variables, each having the geometric distribution on N with
parameter p. Moreover,
k
Wk = ∑ Ni (11.4.22)
i=1
Thus, W = (W1 , W2 , …) is the partial sum process associated with N . In particular, W has stationary, independent increments.
Probability Density Functions

The probability density function of W is given by
k
n+k−1 k n
n+k−1 k n
P(Wk = n) = ( ) p (1 − p ) =( ) p (1 − p ) , n ∈ N (11.4.23)
k−1 n
Proof
The distribution of W is also referred to as the negative binomial distribution with parameters k and p. Thus, the term negative
k
binomial distribution can refer either to the distribution of the trial number of the k th success or the distribution of the number of
failures before the k th success, depending on the author and the context. The two random variables differ by a constant, so it's not a
particularly important issue as long as we know which version is intended. In this text, we will refer to the alternate version as the
negative binomial distribution on N, to distinguish it from the original version, which has support set {k, k + 1, …}
More interestingly, however, the probability density function in the last result makes sense for any k ∈ (0, ∞) , not just integers. To
see this, first recall the definition of the general binomial coefficient: if a ∈ R and n ∈ N , we define
(n)
a a a(a − 1) ⋯ (a − n + 1)
( ) = = (11.4.24)
n n! n!
The function f given below defines a probability density function for every p ∈ (0, 1) and k ∈ (0, ∞) :
n+k−1
k n
f (n) = ( ) p (1 − p ) , n ∈ N (11.4.25)
n
Proof
Once again, the distribution defined by the probability density function in the last theorem is the negative binomial distribution on
N , with parameters k and p . The special case when k is a positive integer is sometimes referred to as the Pascal distribution, in
honor of Blaise Pascal.

1−p
The distribution is unimodal. Let t = |k − 1| p
.
1. f (n − 1) < f (n) if and only if n < t .
2. The distribution has a single mode at ⌊t⌋ if t is not an integer.
3. The distribution has two consecutive modes at t − 1 and t if t is a positive integer.
Basic Properties
Suppose that W has the negative binomial distribution on N with parameters k ∈ (0, ∞) and p ∈ (0, 1). To establish basic
properties, we can no longer use the decomposition of W as a sum of independent geometric variables. Instead, the best approach
is to derive the probability generating function and then use the generating function to obtain other basic properties.
W has probability generating function P given by

k
p 1
W
P (t) = E (t ) =( ) , |t| < (11.4.27)
1 − (1 − p) t 1 −p
Proof
The moments of W can be obtained from the derivatives of the probability generating funciton.
W has the following moments:

1−p
1. E(W ) = k p
1−p
2. var(W ) = k p
2
2−p
3. skew(W ) =
√k(1−p)
2
3 (k+2)(1−p)+p
4. kurt(W ) = k (1−p)
Proof
The negative binomial distribution on N is preserved under sums of independent variables.
Suppose that V has the negative binomial distribution on N with parameters a ∈ (0, ∞) and p ∈ (0, 1), and that W has the
negative binomial distribution on N with parameters b ∈ (0, ∞) and p ∈ (0, 1), and that V and W are independent. Then
V + W has the negative binomial on N distribution with parameters a + b and p .
Proof
In the last result, note that the success parameter p must be the same for both variables.
Because of the decomposition of W when the parameter k is a positive integer, it's not surprising that a central limit theorm holds
for the general negative binomial distribution.
Suppose that W has the negative binomial distibtion with parameters k ∈ (0, ∞) and p ∈ (0, 1). The standard score of W is
p W − k (1 − p)
Z = − −− −−−− (11.4.29)
√ k (1 − p)
The distribution of Z converges to the standard normal distribution as k → ∞ .
1−p
Thus, if k is large (and not necessarily an integer), then the distribution of W is approximately normal with mean k
p
and
1−p
variance k p
2
.
Special Families
The negative binomial distribution on N belongs to several special families of distributions. First, It follows from the result above
on sums that we can decompose a negative binomial variable on N into the sum of an arbitrary number of independent, identically
distributed variables. This special property is known as infinite divisibility, and is studied in more detail in the chapter on Special
Distributions.
The negative binomial distribution on N is infinitely divisible.

Proof
A Poisson-distributed random sum of independent, identically distributed random variables is said to have a compound Poisson
distributions; these distributions are studied in more detail in the chapter on the Poisson Process. A theorem of William Feller states
that an infinite divisible distribution on N must be compound Poisson. Hence it follows from the previous result that the negative
binomial distribution on N belongs to this family. Here is the explicit result:
Let p, k ∈ (0, ∞). Suppose that X = (X , X , …) is a sequence of independent variables, each having the logarithmic series
1 2
distribution with shape parameter 1 − p . Suppose also that N is independent of X and has the Poisson distribution with
parameter −k ln(p) . Then W = ∑ X has the negative binomial distribution on N with parameters k and p.
N
i=1 i
Proof
As a special case (k = 1 ), it follows that the geometric distribution on N is infinitely divisible and compound Poisson.
Next, the negative binomial distribution on N belongs to the general exponential family. This family is important in inferential
statistics and is studied in more detail in the chapter on Special Distributions.
Suppose that W has the negative binomial distribution on N with parameters k ∈ (0, ∞) and p ∈ (0, 1). For fixed k , W has a
one-parameter exponential distribution with natural statistic W and natural parameter ln(1 − p) .
Proof
Finally, the negative binomial distribution on N is a power series distribution. Many special discrete distribution belong to this
family, which is studied in more detail in the chapter on Special Distributions.
For fixed k ∈ (0, ∞) , the negative binomial distribution on N with parameters k and p ∈ (0, 1) is a power series distribution
corresponding to the function g(θ) = 1/(1 − θ) for θ ∈ (0, 1) , where θ = 1 − p .
k
Proof
Suppose that W has the negative binomial distribution with parameters k = 15
2
and p = 3
4
. Compute each of the following:
1. P(W = 3)
2. E(W )
3. var(W )
Answer
Suppose that W has the negative binomial distribution with parameters k = 1
3
and p = 1
4
1. P(W ≤ 2)
2. E(W )
3. var(W )
Answer
Suppose that W has the negative binomial distribution with parameters k = 10 π and p = 1
3
1. E(W )
2. var(W )
3. The normal approximation to P(50 ≤ W ≤ 70)
Answer
This page titled 11.4: The Negative Binomial Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Basic Theory
Multinomial trials
A multinomial trials process is a sequence of independent, identically distributed random variables X = (X , X , …) each taking 1 2
k possible values. Thus, the multinomial trials process is a simple generalization of the Bernoulli trials process (which corresponds
to k = 2 ). For simplicity, we will denote the set of outcomes by {1, 2, … , k}, and we will denote the common probability density
function of the trial variables by
pi = P(Xj = i), i ∈ {1, 2, … , k} (11.5.1)
Of course p i >0 for each i and ∑ k
i=1
pi = 1 . In statistical terms, the sequence X is formed by sampling from the distribution.
As with our discussion of the binomial distribution, we are interested in the random variables that count the number of times each
outcome occurred. Thus, let
n
Yi = # {j ∈ {1, 2, … , n} : Xj = i} = ∑ 1(Xj = i), i ∈ {1, 2, … , k} (11.5.2)
j=1
Of course, these random variables also depend on the parameter n (the number of trials), but this parameter is fixed in our
k
discussion so we suppress it to keep the notation simple. Note that ∑ Y = n so if we know the values of k − 1 of the counting
i=1 i
variables, we can find the value of the remaining variable.

Basic arguments using independence and combinatorics can be used to derive the joint, marginal, and conditional densities of the
counting variables. In particular, recall the definition of the multinomial coefficient: for nonnegative integers (j , j , … , j ) with 1 2 n
k
∑ j = n,
i
i=1
n n!
( ) = (11.5.3)
j1 , j2 , … , jk j1 ! j2 ! ⋯ jk !
Joint Distribution
For nonnegative integers (j 1, j2 , … , jk ) with ∑ k
i=1
ji = n ,
n j1 j2 jk
P(Y1 = j1 , Y2 , = j2 … , Yk = jk ) = ( )p p ⋯p (11.5.4)
1 2 k
j1 , j2 , … , jk
Proof
The distribution of Y = (Y , Y , … , Y ) is called the multinomial distribution with parameters n and p = (p , p , … , p ) . We

1 2 k 1 2 k
also say that (Y , Y , … , Y ) has this distribution (recall that the values of k − 1 of the counting variables determine the value
1 2 k−1
of the remaining variable). Usually, it is clear from context which meaning of the term multinomial distribution is intended. Again,
the ordinary binomial distribution corresponds to k = 2 .
Marginal Distributions
For each i ∈ {1, 2, … , k}, Y has the binomial distribution with parameters n and p :
i i
n j n−j
P(Yi = j) = ( ) p (1 − pi ) , j ∈ {0, 1, … , n} (11.5.5)
i
j
Proof
Grouping
The multinomial distribution is preserved when the counting variables are combined. Specifically, suppose that (A 1, A2 , … , Am )
is a partition of the index set {1, 2, … , k} into nonempty subsets. For j ∈ {1, 2, … , m} let
Zj = ∑ Yi , qj = ∑ pi (11.5.6)
i∈Aj i∈Aj
Z = (Z1 , Z2 , … , Zm ) has the multinomial distribution with parameters n and q = (q 1, q2 , … , qm ) .

Proof
Conditional Distribution
The multinomial distribution is also preserved when some of the counting variables are observed. Specifically, suppose that (A, B)
is a partition of the index set {1, 2, … , k} into nonempty subsets. Suppose that (j : i ∈ B) is a sequence of nonnegative integers,
i
indexed by B such that j = ∑ j ≤ n . Let p = ∑

i∈B i p .
i∈A i
The conditional distribution of (Y i : i ∈ A) given (Y i = ji : i ∈ B) is multinomial with parameters n − j and (p i /p : i ∈ A) .

Proof
Combinations of the basic results involving grouping and conditioning can be used to compute any marginal or conditional
distributions.
Moments
We will compute the mean and variance of each counting variable, and the covariance and correlation of each pair of variables.
For i ∈ {1, 2, … , k}, the mean and variance of Y are i
1. E(Y ) = np
i i
2. var(Y ) = np
i i (1 − pi )
Proof
For distinct i, j ∈ {1, 2, … , k} ,

1. cov(Y i, Yj ) = −npi pj
− −−−−−−−−−−−−−−−−−
2. cor(Y i, Yj ) = −√pi pj / [(1 − pi )(1 − pj )]
Proof
From the last result, note that the number of times outcome i occurs and the number of times outcome j occurs are negatively
correlated, but the correlation does not depend on n .
If k = 2 , then the number of times outcome 1 occurs and the number of times outcome 2 occurs are perfectly correlated.
Proof

In the dice experiment, select the number of aces. For each die distribution, start with a single die and add dice one at a time,
noting the shape of the probability density function and the size and location of the mean/standard deviation bar. When you get
to 10 dice, run the simulation 1000 times and compare the relative frequency function to the probability density function, and
the empirical moments to the distribution moments.
Suppose that we throw 10 standard, fair dice. Find the probability of each of the following events:
1. Scores 1 and 6 occur once each and the other scores occur twice each.
2. Scores 2 and 4 occur 3 times each.
3. There are 4 even scores and 6 odd scores.
4. Scores 1 and 3 occur twice each given that score 2 occurs once and score 5 three times.
Answer
Suppose that we roll 4 ace-six flat dice (faces 1 and 6 have probability each; faces 2, 3, 4, and 5 have probability
1
4
1
8
each).
Find the joint probability density function of the number of times each score occurs.
Answer
In the dice experiment, select 4 ace-six flats. Run the experiment 500 times and compute the joint relative frequency function
of the number times each score occurs. Compare the relative frequency function to the true probability density function.
Suppose that we roll 20 ace-six flat dice. Find the covariance and correlation of the number of 1's and the number of 2's.
Answer
In the dice experiment, select 20 ace-six flat dice. Run the experiment 500 times, updating after each run. Compute the
empirical covariance and correlation of the number of 1's and the number of 2's. Compare the results with the theoretical
results computed previously.
This page titled 11.5: The Multinomial Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
The simple random walk process is a minor modification of the Bernoulli trials process. Nonetheless, the process has a number of
very interesting properties, and so deserves a section of its own. In some respects, it's a discrete time analogue of the Brownian
motion process.
The Basic Process

Suppose that U = (U , U , …) is a sequence of independent random variables, each taking values 1 and −1 with probabilities
1 2
p ∈ [0, 1] and 1 − p respectively. Let X = (X , X , X , …) be the partial sum process associated with U , so that
0 1 2
Xn = ∑ Ui , n ∈ N (11.6.1)
i=1
The sequence X is the simple random walk with parameter p.
We imagine a person or a particle on an axis, so that at each discrete time step, the walker moves either one unit to the right (with
probability p) or one unit to the left (with probability 1 − p ), independently from step to step. The walker could accomplish this by
tossing a coin with probability of heads p at each step, to determine whether to move right or move left. Other types of random
walks, and additional properties of this random walk, are studied in the chapter on Markov Chains.
The mean and standard deviation, respectively of a step U are

1. E(U ) = 2p − 1
2. var(U ) = 4p(1 − p)
Let I j =
1
2
(Uj + 1) for j ∈ N . Then I
+ = (I1 , I2 , …) is a Bernoulli trials sequence with success parameter p.
Proof
In terms of the random walker, I is the indicator variable for the event that the j th step is to the right.
j
Let R n =∑
n
i=1
Ij for n ∈ N , so that R = (R 0, R1 , …) is the partial sum process associated with I . Then
1. X = 2R − n for n ∈ N .
n n
2. R has the binomial distribution with trial parameter n and success parameter p.
n
In terms of the walker, R is the number of steps to the right in the first n steps.
n
Xn has probability density function

n
(n+k)/2 (n−k)/2
P(Xn = k) = ( )p (1 − p ) , k ∈ {−n, −n + 2, … , n − 2, n} (11.6.2)
(n + k)/2
Proof
The mean and variance of X are n
1. E(X ) = n(2p − 1)
n
2. var(X ) = 4np(1 − p)
n
The Simple Symmetric Random Walk

Suppose now that p = . In this case, X = (X , X , …) is called the simple symmetric random walk. The symmetric random
1
2
0 1
walk can be analyzed using some special and clever combinatorial arguments. But first we give the basic results above for this
special case.
For each n ∈ N , the random vector U

+ n = (U1 , U2 , … , Un ) is uniformly distributed on {−1, 1} , and therefore
n
#(A)
P(Un ∈ A) = , A ⊆S (11.6.3)
n
2
Xn has probability density function

n 1
P(Xn = k) = ( ) , k ∈ {−n, −n + 2, … , n − 2, n} (11.6.4)
n
(n + k)/2 2

n
1. E(X ) = 0
n
2. var(X ) = n
n
In the random walk simulation, select the final position. Vary the number of steps and note the shape and location of the
probability density function and the mean±standard deviation bar. For selected values of the parameter, run the simulation
1000 times and compare the empirical density function and moments to the true probability density function and moments.
In the random walk simulation, select the final position and set the number of steps to 50. Run the simulation 1000 times and
compute and compare the following:
1. P(−6 ≤ X ≤ 10)
50
2. The relative frequency of the event {−6 ≤ X ≤ 10}

50
3. The normal approximation to P(−6 ≤ X ≤ 10)

50
Answer
The Maximum Position

Consider again the simple, symmetric random walk. Let Y = max{X , X , … , X }, the maximum position during the first n
n 0 1 n
steps. Note that Y takes values in the set {0, 1, … , n}. The distribution of Y can be derived from a simple and wonderful idea
n n
known as the reflection principle.
For n ∈ N and y ∈ {0, 1, … , n},

n 1
P(Xn = y) = ( ) n
; y, n have the same parity (both even or both odd)
(y+n)/2 2
P(Yn = y) = { (11.6.5)
n 1
P(Xn = y + 1) = ( ) n
; y, n have opposite parity (one even and one odd)
(y+n+1)/2 2
Proof
In the random walk simulation, select the maximum value variable. Vary the number of steps and note the shape and location of
the probability density function and the mean/standard deviation bar. Now set the number of steps to 30 and run the simulation
1000 times. Compare the relative frequency function and empirical moments to the true probability density function and
moments.
For every n , the probability density function of Y is decreasing.

n
The last result is a bit surprising; in particular, the single most likely value for the maximum (and hence the mode of the
distribution) is 0.
Explicitly compute the probability density function, mean, and standard deviation of Y . 5
Answer
A fair coin is tossed 10 times. Find the probability that the difference between the number of heads and the number of tails is
never greater than 4.
Answer
The Last Visit to 0
Consider again the simple, symmetric random walk. Our next topic is the last visit to 0 during the first 2n steps:
Z2n = max {j ∈ {0, 2, … , 2n} : Xj = 0} , n ∈ N (11.6.8)
Note that since visits to 0 can only occur at even times, Z takes the values in the set {0, 2, … , 2n}. This random variable has a
2n
strange and interesting distribution known as the discrete arcsine distribution. Along the way to our derivation, we will discover
some other interesting results as well.
The probability density function of Z 2n is

2k 2n − 2k 1
P(Z2n = 2k) = ( )( ) , k ∈ {0, 1, … , n} (11.6.9)
2n
k n−k 2
Proof
In the random walk simulation, choose the last visit to 0 and then vary the number of steps with the scroll bar. Note the shape
and location of the probability density function and the mean/standard deviation bar. For various values of the parameter, run
the simulation 1000 times and compare the empirical density function and moments to the true probability density function and
moments.
The probability density function of Z 2n is symmetric about n and is u-shaped:

1. P(Z 2n = 2k) = P(Z2n = 2n − 2k)
2. P(Z 2n = 2j) > P(Z2n = 2k) if and only if j < k and 2k ≤ n
In particular, 0 and 2n are the most likely values and hence are the modes of the distribution. The discrete arcsine distribution is
quite surprising. Since we are tossing a fair coin to determine the steps of the walker, you might easily think that the random walk
should be positive half of the time and negative half of the time, and that it should return to 0 frequently. But in fact, the arcsine law
implies that with probability , there will be no return to 0 during the second half of the walk, from time n + 1 to 2n, regardless of
1
n , and it is not uncommon for the walk to stay positive (or negative) during the entire time from 1 to 2n.
Explicitly compute the probability density function, mean, and variance of Z . 10
Answer
The Ballot Problem and the First Return to Zero

The Ballot Problem
Suppose that in an election, candidate A receives a votes and candidate B receives b votes where a > b . Assuming a random
ordering of the votes, what is the probability that A is always ahead of B in the vote count? This is an historically famous problem
known as the Ballot Problem, that was solved by Joseph Louis Bertrand in 1887. The ballot problem is intimately related to simple
random walks.
Comment on the validity of the assumption that the voters are randomly ordered for a real election.
The ballot problem can be solved by using a simple conditional probability argument to obtain a recurrence relation. Let f (a, b)
denote the probability that A is always ahead of B in the vote count.
f satisfies the initial condition f (1, 0) = 1 and the following recurrence relation:
a b
f (a, b) = f (a − 1, b) + f (a, b − 1) (11.6.17)
a+b a+b
Proof
The probability that A is always ahead in the vote count is

a−b
f (a, b) = (11.6.18)
a+b
Proof
In the ballot experiment, vary the parameters a and b and note the change the ballot probability. For selected values of the
parameters, run the experiment 1000 times and compare the relative frequency to the true probability.
In an election for mayor of a small town, Mr. Smith received 4352 votes while Ms. Jones received 7543 votes. Compute the
probability that Jones was always ahead of Smith in the vote count.
Answer
Relation to Random Walks

Consider again the simple random walk X with parameter p.
Given X n =k ,
1. There are n+k
steps to the right and
2
steps to the left.
n−k
2. All possible orderings of the steps to the right and the steps to the left are equally likely.
For k > 0 ,
k
P(X1 > 0, X2 > 0, … , Xn−1 > 0 ∣ Xn = k) = (11.6.19)
n
Proof
In the ballot experiment, vary the parameters a and b and note the change the ballot probability. For selected values of the
parameters, run the experiment 1000 times and compare the relative frequency to the true probability.
An American roulette wheel has 38 slots; 18 are red, 18 are black, and 2 are green. Fred bet $1 on red, at even stakes, 50 times,
winning 22 times and losing 28 times. Find the probability that Fred's net fortune was always negative.
Answer
Roulette is studied in more detail in the chapter on Games of Chance.
The Distribution of the First Zero

Consider again the simple random walk with parameter p, as in the last subsection. Let T denote the time of the first return to 0:
T = min{n ∈ N+ : Xn = 0} (11.6.20)
Note that returns to 0 can only occur at even times; it may also be possible that the random walk never returns to 0. Thus, T takes
values in the set {2, 4, …} ∪ {∞}.
The probability density funtion of T 2n is given by

2n 1 n n
P(T = 2n) = ( ) p (1 − p ) , n ∈ N+ (11.6.21)
n 2n − 1
Proof
Fred and Wilma are tossing a fair coin; Fred gets a point for each head and Wilma gets a point for each tail. Find the probability
that their scores are equal for the first time after n tosses, for each n ∈ {2, 4, 6, 8, 10}.
Answer
This page titled 11.6: The Simple Random Walk is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon
request.
An interesting thing to do in almost any parametric probability model is to “randomize” one or more of the parameters. Done in a
clever way, this often leads to interesting new models and unexpected connections between models. In this section we will
randomize the success parameter in the Bernoulli trials process. This leads to interesting and surprising connections with Pólya's
urn process.
Basic Theory
Definitions
First, recall that the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) is a continuous distribution on
the interval (0, 1) with probability density function g given by
1
a−1 b−1
g(p) = p (1 − p ) , p ∈ (0, 1) (11.7.1)
B(a, b)
where B is the beta function. So B(a, b) is simply the normalizing constant for the function p ↦ p
a−1 b−1
(1 − p ) on the interval
(0, 1). Here is our main definition:
Suppose that has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Next suppose that
P
X = (X1 , X2 , …) is a sequence of indicator random variables with the property that given P = p ∈ (0, 1) , X is a
conditionally independent sequence with
P(Xi = 1 ∣ P = p) = p, i ∈ N+ (11.7.2)
Then X is the beta-Bernoulli process with parameters a and b .
In short, given P = p , the sequence X is a Bernoulli trials sequence with success parameter p. In the usual language of reliability,
X is the outcome of trial i , where 1 denotes success and 0 denotes failure. For a specific application, suppose that we select a
i
random probability of heads according to the beta distribution with with parameters a and b , and then toss a coin with this
probability of heads repeatedly.
Outcome Variables
What's our first step? Well, of course we need to compute the finite dimensional distributions of X. Recall that for r ∈ R and
j ∈ N, r denotes the ascending power r(r + 1) ⋯ [r + (j − 1)] . By convention, a product over an empty index set is 1, so
[j]
= 1.
[0]
r
Suppose that n ∈ N + and (x

1,
n
x2 , … , xn ) ∈ {0, 1 } . Let k = x 1 + x2 + ⋯ + xn . Then
[k] [n−k]
a b
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = (11.7.3)
[n]
(a + b)
Proof
From this result, it follows that Pólya's urn process with parameters a, b, c ∈ N is equivalent to the beta-Bernoulli process with
+
parameters a/c and b/c, quite an interesting result. Note that since the joint distribution above depends only on
x +x +⋯ +x
1 2 , the sequence X is exchangeable. Finally, it's interesting to note that the beta-Bernoulli process with
n
parameters a and b could simply be defined as the sequence with the finite-dimensional distributions above, without reference to
the beta distribution! It turns out that every exchangeable sequence of indicator random variables can be obtained by randomizing
the success parameter in a sequence of Bernoulli trials. This is de Finetti's theorem, named for Bruno de Finetti, which is studied in
the section on backwards martingales.
For each i ∈ N +
1. E(X ) = i
a
a+b
2. var(X ) = i
a+b
a b
a+b
Proof
Thus X is a sequence of identically distributed variables, quite surprising at first but of course inevitable for any exchangeable
sequence. Compare the joint distribution with the marginal distributions. Clearly the variables are dependent, so let's compute the
covariance and correlation of a pair of outcome variables.
Suppose that i, j ∈ N+ are distinct. Then

1. cov(X i, Xj ) =
a b
2
(a+b) (a+b+1)
2. cor(X i, Xj ) =
1
a+b+1
Proof
Thus, the variables are positively correlated. It turns out that in any infinite sequence of exchangeable variables, the the variables
must be nonnegatively correlated. Here is another result that explores how the variables are related.
n
Suppose that n ∈ N + and (x 1, x2 , … , xn ) ∈ {0, 1 }
n
. Let k = ∑ i=1
xi . Then
a+k
P(Xn+1 = 1 ∣ X1 = x1 , X2 = x2 , … Xn = xn ) = (11.7.7)
a+b +n
Proof
The beta-Bernoulli model starts with the conditional distribution of X given P . Let's find the conditional distribution in the other
direction.
Suppose that n ∈ N+ and (x1 , x2 , … , xn ) ∈ {0, 1 } . Let k = ∑ x . Then the conditional distribution of
n n
i=1 i P given
(X1 = x1 , X2 , = x2 , … , Xn = xn ) is beta with left parameter a + k and right parameter b + (n − k) . Hence
a+k
E(P ∣ X1 = x1 , X2 = x2 , … , Xn = xn ) = (11.7.8)
a+b +k
Proof
Thus, the left parameter increases by the number of successes while the right parameter increases by the number of failures. In the
language of Bayesian statistics, the original distribution of P is the prior distribution, and the conditional distribution of P given
the data (x , x , … , x ) is the posterior distribution. The fact that the posterior distribution is beta whenever the prior distribution
1 2 n
is beta means that the beta distributions is conjugate to the Bernoulli distribution. The conditional expected value in the last
theorem is the Bayesian estimate of p when p is modeled by the random variable P . These concepts are studied in more generality
in the section on Bayes Estimators in the chapter on Point Estimation. It's also interesting to note that the expected values in the last
two theorems are the same: If n ∈ N , (x , x , … , x ) ∈ {0, 1} and k = ∑ x then
1 2 n
n n
i=1 i
a+k
E(Xn+1 ∣ X1 = x1 , … , Xn = xn ) = E(P ∣ X1 = x1 , … , Xn = xn ) = (11.7.12)
a+b +n
Run the simulation of the beta coin experiment for various values of the parameter. Note how the posterior probability density
function changes from the prior probability density function, given the number of heads.
The Number of Successes

It's already clear that the number of successes in a given number of trials plays an important role, so let's study these variables. For
n ∈ N + , let
n
Yn = ∑ Xi (11.7.13)
i=1
denote the number of successes in the first n trials. Of course, Y = (Y0 , Y1 , …) is the partial sum process associated with
X = (X , X , …).
1 2
Yn has probability density function given by
[k] [n−k]
n a b
P(Yn = k) = ( ) , k ∈ {0, 1, … , n} (11.7.14)
[n]
k (a + b)
Proof
The distribution of Y is known as the beta-binomial distribution with parameters n , a , and b .

n
In the simulation of the beta-binomial experiment, vary the parameters and note how the shape of the probability density
function of Y (discrete) parallels the shape of the probability density function of P (continuous). For various values of the
n
parameters, run the simulation 1000 times and compare the empirical density function to the probability density function.
The case where the parameters are both 1 is interesting.
If a = b = 1 , so that P is uniformly distributed on (0, 1), then Y is uniformly distributed on {0, 1, … , n}.
n
Proof
Next, let's compute the mean and variance of Y . n
1. E(Y n) =n
a
a+b
2. var(Y n) =n
ab
2
[1 + (n − 1)
1
a+b+1
]
(a+b)
Proof
In the simulation of the beta-binomial experiment, vary the parameters and note the location and size of the mean-standard
deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical moments to the
true moments.
We can restate the conditional distributions in the last subsection more elegantly in terms of Y . n
Let n ∈ N .
1. The conditional distribution of X n+1 given Y isn
a + Yn
P(Xn+1 = 1 ∣ Yn ) = E(Xn+1 ∣ Yn ) = (11.7.18)
a+b +n
2. The conditional distribution of P given Y is beta with left parameter a + Y and right parameter b + (n − Y
n n n) . In
particular
a + Yn
E(P ∣ Yn ) = (11.7.19)
a+b +n
Proof
Once again, the conditional expected value E(P ∣ Y ) is the Bayesian estimator of p. In particular, if a = b = 1 , so that P has the
n
uniform distribution on (0, 1), then P(X = 1 ∣ Y = n) = n+1 . This is Laplace's rule of succession, another interesting
n
n+1
n+2
connection. The rule is named for Pierre Simon Laplace, and is studied from a different point of view in the section on
Independence.
The Proportion of Successes

For n ∈ N , let
+
n
Yn 1
Mn = = ∑ Xi (11.7.24)
n n
i=1
so that M is the sample mean of (X , X , … , X ), or equivalently the proportion of successes in the first n trials. Properties of
n 1 2 n
M n follow easily from the corresponding properties of Y . In particular, P(M = k/n) = P(Y = k) for k ∈ {0, 1, … , n} as
n n n
given above, so let's move on to the mean and variance.
For n ∈ N , the mean and variance of M are
+ n
1. E(M n) =
a+b
a
n−1
2. var(M n) =
1
n
ab
2
+
n 2
ab
(a+b) (a+b) (a+b+1)
Proof
So E(M ) is constant in n ∈ N while var(M ) → ab/(a + b) (a + b + 1) as n → ∞ . These results suggest that perhaps M
n + n
2
n
has a limit, in some sense, as n → ∞ . For an ordinary sequence of Bernoulli trials with success parameter p ∈ (0, 1), we know
from the law of large numbers that M → p as n → ∞ with probability 1 and in mean (and hence also in distribution). What
n
happens here when the success probability P has been randomized with the beta distribution? The answer is what we might hope.
Mn → P as n → ∞ with probability 1 and in mean square, and hence also in in distribution.

Proof
Proof of convergence in distribution
Recall again that the Bayesian estimator of p based on (X 1, X2 , … , Xn ) is

a + Yn a/n + Mn
E(P ∣ Yn ) = = (11.7.33)
a+b +n a/n + b/n + 1
It follows from the last theorem that E(P with probability 1, in mean square, and in distribution. The stochastic process
∣ Yn ) → P
Z = { Zn = (a + Yn )/(a + b + n) : n ∈ N} that we have seen several times now is of fundamental importance, and turns out to
be a martingale. The theory of martingales provides powerful tools for studying convergence in the beta-Bernoulli process.
The Trial Number of a Success

For k ∈ N+ , let Vk denote the trial number of the k th success. As we have seen before in similar circumstances, the process
V = (V1 , V2 , …) can be defined in terms of the process Y :
Vk = min{n ∈ N+ : Yn = k}, k ∈ N+ (11.7.34)
Note that V takes values in

k {k, k + 1, …} . The random processes V = (V1 , V2 , …) and Y = (Y0 , Y1 , …) are inverses of each
other in a sense.
For k ∈ N and n ∈ N + with k ≤ n ,

1. V k ≤n if and only if Y n ≥k
2. V k =n if and only if Y n−1 = k−1 and X n =1
The probability denisty function of V is given by k
[k] [n−k]
n−1 a b
P(Vk = n) = ( ) , n ∈ {k, k + 1, …} (11.7.35)
k−1 [n]
(a + b)
Proof 1
Proof 2
The distribution of V is known as the beta-negative binomial distribution with parameters k , a , and b .
k
If a = b = 1 so that P is uniformly distributed on (0, 1), then

k
P(Vk = n) = , n ∈ {k, k + 1, k + 2, …} (11.7.36)
n(n + 1)
Proof
In the simulation of the beta-negative binomial experiment, vary the parameters and note the shape of the probability density
function. For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the
The mean and variance of V are k
1. E(V k) =k
a+b−1
a−1
if a > 1 .
2
2. var(V k) =k
a+b−1
(a−1)(a−2)
2
[b + k(a + b − 2)] − k (
a+b−1
a−1
)
Proof
In the simulation of the beta-negative binomial experiment, vary the parameters and note the location and size of the mean
±standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical
moments to the true moments.
This page titled 11.7: The Beta-Bernoulli Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
CHAPTER OVERVIEW
This chapter explores a number of models and problems based on sampling from a finite population. Sampling without replacement
from a population of objects of various types leads to the hypergeometric and multivariate hypergeometric models. Sampling with
replacement from a finite population leads naturally to the birthday and coupon-collector problems. Sampling without replacement
form an ordered population leads naturally to the matching problem and to the study of order statistics.
This page titled 12: Finite Sampling Models is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
1
Basic Theory
Sampling Models
Suppose that we have a population D of m objects. The population could be a deck of cards, a set of people, an urn full of balls, or
any number of other collections. In many cases, we simply label the objects from 1 to m, so that D = {1, 2, … , m}. In other cases
(such as the card experiment), it may be more natural to label the objects with vectors. In any case, D is usually a finite subset of
R for some k ∈ N .
k
+
Our basic experiment consists of selecting n objects from the population D at random and recording the sequence of objects
chosen. Thus, the outcome is X = (X , X , … , X ) where X ∈ D is the ith object chosen. If the sampling is with replacement,
1 2 n i
the sample size n can be any positive integer. In this case, the sample space S is
n
S =D = {(x1 , x2 , … , xn ) : xi ∈ D for each i} (12.1.1)
If the sampling is without replacement, the sample size n can be no larger than the population size m . In this case, the sample
space S consists of all permutations of size n chosen from D:
S = Dn = {(x1 , x2 , … , xn ) : xi ∈ D for each i and xi ≠ xj for all i ≠ j} (12.1.2)
From the multiplication principle of combinatorics,

1. #(D n
) =m
n
2. #(D n) =m
(n)
= m(m − 1) ⋯ (m − n + 1)
With either type of sampling, we assume that the samples are equally likely and thus that the outcome variable X is uniformly
distributed on the appropriate sample space S ; this is the meaning of the phrase random sample:
#(A)
P(X ∈ A) = , A ⊆S (12.1.3)
#(S)
The Exchangeable Property

Suppose again that we select n objects at random from the population D, either with or without replacement and record the ordered
sample X = (X , X , … , X )
1 2 n
Any permutation of X has the same distribution as X itself, namely the uniform distribution on the appropriate sample space
S:
1. D if the sampling is with replacement.

n
2. D if the sampling is without replacement.

n
A sequence of random variables with this property is said to be exchangeable. Although this property is very simple to understand,
both intuitively and mathematically, it is nonetheless very important. We will use the exchangeable property often in this chapter.
More generally, any sequence of k of the n outcome variables is uniformly distributed on the appropriate sample space:
1. D if the sampling is with replacement.
k
2. D if the sampling is without replacement.

k
In particular, for either sampling method, X is uniformly distributed on D for each i ∈ {1, 2, … , n}.
i
If the sampling is with replacement then X = (X 1, X2 , … , Xn ) is a sequence of independent random variables.
Thus, when the sampling is with replacement, the sample variables form a random sample from the uniform distribution, in
statistical terminology.
If the sampling is without replacement, then the conditional distribution of a sequence of k of the outcome variables, given the
values of a sequence of j other outcome variables, is the uniform distribution on the set of permutations of size k chosen from
the population when the j known values are removed (of course, j + k ≤ n ).
In particular, X and X are dependent for any distinct i and j when the sampling is without replacement.
i j
The Unordered Sample

In many cases when the sampling is without replacement, the order in which the objects are chosen is not important; all that
matters is the (unordered) set of objects:
W = { X1 , X2 , … , Xn } (12.1.4)
The random set W takes values in the set of combinations of size n chosen from D:
T = {{ x1 , x2 , … , xn } : xi ∈ D for each i and xi ≠ xj for all i ≠ j} (12.1.5)
Recall that #(T ) = ( m
n
) .
W is uniformly distributed over T :

#(B) #(B)
P(W ∈ B) = = , B ⊆T (12.1.6)
m
#(T ) ( )
n
Proof
Suppose now that the sampling is with replacement, and we again denote the unordered outcome by W . In this case, W takes
values in the collection of multisets of size n from D. (A multiset is like an ordinary set, except that repeated elements are
allowed).
T = {{ x1 , x2 , … , xn } : xi ∈ D for each i} (12.1.7)
Recall that #(T ) = ( m+n−1
n
) .
W is not uniformly distributed on T .
Summary of Sampling Formulas

The following table summarizes the formulas for the number of samples of size n chosen from a population of m elements, based
Sampling Formulas
Number of samples With order Without
m… With replacement m
n m+n−1
(
n
)
m… Without m
(n) m
(
n
)

Suppose that a sample of size 2 is chosen from the population {1, 2, 3, 4}. Explicitly list all samples in the following cases:
1. Ordered samples, with replacement.
2. Ordered samples, without replacement.
3. Unordered samples, with replacement.
4. Unordered samples, without replacement.
Answer
Multi-type Populations
A dichotomous population consists of two types of objects.
Suppose that a batch of 100 components includes 10 that are defective. A random sample of 5 components is selected without
replacement. Compute the probability that the sample contains at least one defective component.
Answer
An urn contains 50 balls, 30 red and 20 green. A sample of 15 balls is chosen at random. Find the probability that the sample
contains 10 red balls in each of the following cases:
1. The sampling is without replacement
2. The sampling is with replacement
Answer
In the ball and urn experiment select 50 balls with 30 red balls, and sample size 15. Run the experiment 100 times. Compute
the relative frequency of the event that the sample has 10 red balls in each of the following cases, and compare with the
respective probability in the previous exercise:
1. The sampling is without replacement
2. The sampling is with replacement
Suppose that a club has 100 members, 40 men and 60 women. A committee of 10 members is selected at random (and without
replacement, of course).
1. Find the probability that both genders are represented on the committee.
2. If you observed the experiment and in fact the committee members are all of the same gender, would you believe that the
sampling was random?
Answer
Suppose that a small pond contains 500 fish, 50 of them tagged. A fisherman catches 10 fish. Find the probability that the catch
contains at least 2 tagged fish.
Answer
The basic distribution that arises from sampling without replacement from a dichotomous population is studied in the section on the
hypergeometric distribution. More generally, a multi-type population consists of objects of k different types.
Suppose that a legislative body consists of 60 republicans, 40 democrats, and 20 independents. A committee of 10 members is
chosen at random. Find the probability that at least one party is not represented on the committee.
Answer
The basic distribution that arises from sampling without replacement from a multi-type population is studied in the section on the
multivariate hypergeometric distribution.
Cards
Recall that a standard card deck can be modeled by the product set
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (12.1.8)
where the first coordinate encodes the denomination or kind (ace, 2-10, jack, queen, king) and where the second coordinate encodes
the suit (clubs, diamonds, hearts, spades). The general card experiment consists of drawing n cards at random and without
replacement from the deck D. Thus, the ith card is X = (Y , Z ) where Y is the denomination and Z is the suit. The special case
i i i i i
n = 5 is the poker experiment and the special case n = 13 is the bridge experiment. Note that with respect to the denominations or
with respect to the suits, a deck of cards is a multi-type population as discussed above.
In the card experiment with n = 5 cards (poker), there are

1. 311,875,200 ordered hands
2. 2,598,960 unordered hands
In the card experiment with n = 13 cards (bridge), there are
1. 3,954,242,643,911,239,680,000 ordered hands
2. 635,013,559,600 unordered hands
In the card experiment, set n = 5 . Run the simulation 5 times and on each run, list all of the (ordered) sequences of cards that
would give the same unordered hand as the one you observed.
In the card experiment,

1. Y is uniformly distributed on {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k}for each i.
i
2. Z is uniformly distributed on {♣, ♢, ♡, ♠} for each i.

j
In the card experiment, Y and Z are independent for any i and j .

i j
In the card experiment, (Y 1, Y2 ) and (Z

1, Z2 ) are dependent.
Suppose that a sequence of 5 cards is dealt. Find each of the following:

1. The probability that the third card is a spade.
2. The probability that the second and fourth cards are queens.
3. The conditional probability that the second card is a heart given that the fifth card is a heart.
4. The probability that the third card is a queen and the fourth card is a heart.
Answer
Run the card experiment 500 time. Compute the relative frequency corresponding to each probability in the previous exercise.
Find the probability that a bridge hand will contain no honor cards that is, no cards of denomination 10, jack, queen, king, or
ace. Such a hand is called a Yarborough, in honor of the second Earl of Yarborough.
Answer
Dice
Rolling fair, six-sided dice is equivalent to choosing a random sample of size n with replacement from the population
n
{1, 2, 3, 4, 5, 6}. Generally, selecting a random sample of size n with replacement from D = {1, 2, … , m} is equivalent to rolling
n fair, m -sided dice.
In the game of poker dice, 5 standard, fair dice are thrown. Find each of the following:
1. The probability that all dice show the same score.
2. The probability that the scores are distinct.
3. The probability that 1 occurs twice and 6 occurs 3 times.
Answer
Run the poker dice experiment 500 times. Compute the relative frequency of each event in the previous exercise and compare
with the corresponding probability.
The game of poker dice is treated in more detail in the chapter on Games of Chance.
Birthdays
Supposes that we select n persons at random and record their birthdays. If we assume that birthdays are uniformly distributed
throughout the year, and if we ignore leap years, then this experiment is equivalent to selecting a sample of size n with replacement
from D = {1, 2, … , 365}. Similarly, we could record birth months or birth weeks.
Suppose that a probability class has 30 students. Find each of the following:
1. The probability that the birthdays are distinct.
2. The probability that there is at least one duplicate birthday.
Answer
In the birthday experiment, set m = 365 and n = 30 . Run the experiment 1000 times and compare the relative frequency of
each event in the previous exercise to the corresponding probability.
The birthday problem is treated in more detail later in this chapter.
Balls into Cells

Suppose that we distribute n distinct balls into m distinct cells at random. This experiment also fits the basic model, where D is
the population of cells and X is the cell containing the ith ball. Sampling with replacement means that a cell may contain more
i
than one ball; sampling without replacement means that a cell may contain at most one ball.
Suppose that 5 balls are distributed into 10 cells (with no restrictions). Find each of the following:
1. The probability that the balls are all in different cells.
2. The probability that the balls are all in the same cell.
Answer
Coupons
Suppose that when we purchase a certain product (bubble gum, or cereal for example), we receive a coupon (a baseball card or
small toy, for example), which is equally likely to be any one of m types. We can think of this experiment as sampling with
replacement from the population of coupon types; X is the coupon that we receive on the ith purchase.
i
Suppose that a kid's meal at a fast food restaurant comes with a toy. The toy is equally likely to be any of 5 types. Suppose that
a mom buys a kid's meal for each of her 3 kids. Find each of the following:
1. The probability that the toys are all the same.
2. The probability that the toys are all different.
Answer
The coupon collector problem is studied in more detail later in this chapter.
The Key Problem

Suppose that a person has n keys, only one of which opens a certain door. The person tries the keys at random. We will let N
denote the trial number when the person finds the correct key.
Suppose that unsuccessful keys are discarded (the rational thing to do, of course). Then N has the uniform distribution on
{1, 2, … , n}.
1. P(N = i) = , 1
n
i ∈ {1, 2, … , n} .
2. E(N ) = . n+1
2
2
3. var(N ) = n −1
12
.
Suppose that unsuccessful keys are not discarded (perhaps the person has had a bit too much to drink). Then N has a
geometric distribution on N . +
i−1
1. P(N = i) =
1
n
(
n−1
n
) , i ∈ N+ .
2. E(N ) = n .
3. var(N ) = n(n − 1) .
Simulating a Random Samples

It's very easy to simulate a random sample of size n , with replacement from D = {1, 2, … , m} . Recall that the ceiling function
⌈x⌉ gives the smallest integer that is at least as large as x.
Let U = (U , U , … , U ) be a sequence of be a random numbers. Recall that these are independent random variables, each
1 2 n
uniformly distributed on the interval [0, 1] (the standard uniform distribution). Then X = ⌈m U ⌉ for i ∈ {1, 2, … , n}
i i
simulates a random sample, with replacement, from D.
It's a bit harder to simulate a random sample of size n , without replacement, since we need to remove each sample value before the
next draw.
The following algorithm generates a random sample of size n , without replacement, from D.
1. For i = 1 to m, let b i =i .
2. For i = 1 to n ,
a. let j = m − i + 1
b. let U be a random number
i
c. let J = ⌊jU ⌋ i
d. let X = bi J
e. let k = b j
f. let b = b
j J
g. let b = k
J
3. Return X = (X 1, X2 , … , Xn )
This page titled 12.1: Introduction to Finite Sampling Models is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Basic Theory
Dichotomous Populations
Suppose that we have a dichotomous population D. That is, a population that consists of two types of objects, which we will refer
to as type 1 and type 0. For example, we could have
balls in an urn that are either red or green
a batch of components that are either good or defective
a population of people who are either male or female
a population of animals that are either tagged or untagged
voters who are either democrats or republicans
Let R denote the subset of D consisting of the type 1 objects, and suppose that #(D) = m and #(R) = r . As in the basic
sampling model, we sample n objects at random from D. In this section, our only concern is in the types of the objects, so let X i
denote the type of the ith object chosen (1 or 0). The random vector of types is
X = (X1 , X2 , … , Xn ) (12.2.1)
Our main interest is the random variable Y that gives the number of type 1 objects in the sample. Note that Y is a counting
variable, and thus like all counting variables, can be written as a sum of indicator variables, in this case the type variables:
n
Y = ∑ Xi (12.2.2)
i=1
We will assume initially that the sampling is without replacement, which is usually the realistic setting with dichotomous
populations.

Recall that since the sampling is without replacement, the unordered sample is uniformly distributed over the set of all
combinations of size n chosen from D. This observation leads to a simple combinatorial derivation of the probability density
function of Y .
The probability density function of Y is given by

r m−r
( )( )
y n−y
P(Y = y) = , y ∈ {max{0, n − (m − r)}, … , min{n, r}} (12.2.3)
m
( )
n
Proof
This distribution defined by this probability density function is known as the hypergeometric distribution with parameters m , r,
and n .
Another form of the probability density function of Y is

(y) (n−y)
n r (m − r)
P(Y = y) = ( ) , y ∈ {max{0, n − (m − r)}, … , min{n, r}} (12.2.4)
y (n)
m
Combinatorial Proof
Algebraic Proof
Recall our convention that j = ( ) = 0 for i > j . With this convention, the two formulas for the probability density function are
(i) j
correct for y ∈ {0, 1, … , n}. We usually use this simpler set as the set of values for the hypergeometric distribution.
(r+1)(n+1)
The hypergeometric distribution is unimodal. Let v = m+2
. Then
1. P(Y = y) > P(Y = y − 1) if and only if y < v .
2. The mode occurs at ⌊v⌋ if v is not an integer, and at v and v − 1 if v is an integer greater than 0.
In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the shape of the probability
density function. For selected values of the parameters, run the experiment 1000 times and compare the relative frequency
function to the probability density function.
You may wonder about the rather exotic name hypergeometric distribution, which seems to have nothing to do with sampling from
a dichotomous population. The name comes from a power series, which was studied by Leonhard Euler, Carl Friedrich Gauss,
Bernhard Riemann, and others.
A (generalized) hypergeometric series is a power series

∞
k k
∑a x (12.2.5)
k=0
where k ↦ a k+1 / ak is a rational function (that is, a ratio of polynomials).
Many of the basic power series studied in calculus are hypergeometric series, including the ordinary geometric series and the
exponential series.
The probability generating function of the hypergeometric distribution is a hypergeometric series.

Proof
In addition, the hypergeometric distribution function can be expressed in terms of a hypergeometric series. These representations
are not particularly helpful, so basically were stuck with the non-descriptive term for historical reasons.
Moments
Next we will derive the mean and variance of Y . The exchangeable property of the indicator variables, and properties of covariance
and correlation will play a key role.
E(Xi ) =
m
r
for each i.
Proof
From the representation of Y as the sum of indicator variables, the expected value of Y is trivial to compute. But just for fun, we
give the derivation from the probability density function as well.
E(Y ) = n
m
r
.
Proof
Next we turn to the variance of the hypergeometric distribution. For that, we will need not only the variances of the indicator
variables, but their covariances as well.
var(Xi ) =
r
m
(1 −
r
m
) for each i.
Proof
For distinct i, j,
1. cov (X , X ) = −
i j m
r
(1 −
r
m
)
1
m−1
2. cor (X , X ) = −
i j
m−1
1
Proof
Note that the event of a type 1 object on draw i and the event of a type 1 object on draw j are negatively correlated, but the
correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if
m = 2 , which must be the case.
var(Y ) = n
r
m
(1 −
m
r
)
m−n
m−1
.
Proof
Note that var(Y ) = 0 if r = 0 or r = m or n = m , which must be true since Y is deterministic in each of these cases.
In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the size and location of the
mean ± standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the
empirical mean and standard deviation to the true mean and standard deviation.
Sampling with Replacement

Suppose now that the sampling is with replacement, even though this is usually not realistic in applications.
(X1 , X2 , … , Xn ) is a sequence of n Bernoulli trials with success parameter r
m
.
The following results now follow immediately from the general theory of Bernoulli trials, although modifications of the arguments
above could also be used.
Y has the binomial distribution with parameters n and r
m
:
y n−y
n r r
P(Y = y) = ( )( ) (1 − ) , y ∈ {0, 1, … , n} (12.2.10)
y m m

1. E(Y ) = n r
2. var(Y ) = n r
m
(1 −
r
m
)
Note that for any values of the parameters, the mean of Y is the same, whether the sampling is with or without replacement. On the
m−n
other hand, the variance of Y is smaller, by a factor of , when the sampling is without replacement than with replacement. It
m−1
certainly makes sense that the variance of Y should be smaller when sampling without replacement, since each selection reduces
m−n
the variablility in the population that remains. The factor is sometimes called the finite population correction factor.
m−1
probability density function. Note also the difference between the mean ± standard deviation bars. For selected values of the
parameters and for the two different sampling modes, run the simulation 1000 times.
Convergence of the Hypergeometric Distribution to the Binomial

Suppose that the population size m is very large compared to the sample size n . In this case, it seems reasonable that sampling
without replacement is not too much different than sampling with replacement, and hence the hypergeometric distribution should be
well approximated by the binomial. The following exercise makes this observation precise. Practically, it is a valuable result, since
the binomial distribution has fewer parameters. More specifically, we do not need to know the population size m and the number of
type 1 objects r individually, but only in the ratio r/m.
Suppose that r ∈ {0, 1, … , m} for each m ∈ N and that r /m → p ∈ [0, 1] as m → ∞ . Then for fixed n , the
m + m
hypergeometric probability density function with parameters m, r , and n converges to the binomial probability density
m
function with parameters n and p as m → ∞

Proof
The type of convergence in the previous exercise is known as convergence in distribution.
probability density function. In particular, note the similarity when m is large and n small. For selected values of the
parameters, and for both sampling modes, run the experiment 1000 times.
In the setting of the convergence result above, note that the mean and variance of the hypergeometric distribution converge to
the mean and variance of the binomial distribution as m → ∞ .
Inferences in the Hypergeometric Model

In many real problems, the parameters r or m (or both) may be unknown. In this case we are interested in drawing inferences about
the unknown parameters based on our observation of Y , the number of type 1 objects in the sample. We will assume initially that
the sampling is without replacement, the realistic setting in most applications.
Estimation of r with m Known

Suppose that the size of the population m is known but that the number of type 1 objects r is unknown. This type of problem could
arise, for example, if we had a batch of m manufactured items containing an unknown number r of defective items. It would be too
costly to test all m items (perhaps even destructive), so we might instead select n items at random and test those.
A simple estimator of r can be derived by hoping that the sample proportion of type 1 objects is close to the population proportion
of type 1 objects. That is,
Y r m
≈ ⟹ r ≈ Y (12.2.11)
n m n
Thus, our estimator of r is m
n
Y . This method of deriving an estimator is known as the method of moments.
m
E( Y ) =r
n
Proof
The result in the previous exercise means that m
n
Y is an unbiased estimator of r. Hence the variance is a measure of the quality of
the estimator, in the mean square sense.
var (
m
n
Y ) = (m − r)
r
n
m−n
m−1
.
Proof
For fixed m and r, var ( m
n
Y ) ↓ 0 as n ↑ m .
Thus, the estimator improves as the sample size increases; this property is known as consistency.
In the ball and urn experiment, select sampling without replacement. For selected values of the parameters, run the experiment
100 times and note the estimate of r on each run.
1. Compute the average error and the average squared error over the 100 runs.
2. Compare the average squared error with the variance in mean square error given above.
Often we just want to estimate the ratio r/m (particularly if we don't know m either. In this case, the natural estimator is the
sample proportion Y /n.
The estimator of m
r
has the following properties:
1. E ( ) = , so the estimator is unbiased.
Y
n m
r
2. var ( ) = Y
(1 −
n
1
n
)
m
r r
m
m−n
m−1
3. var ( Y
n
) ↓ 0 as n ↑ m so the estimator is consistent.
Estimation of m with r Known

Suppose now that the number of type 1 objects r is known, but the population size m is unknown. As an example of this type of
problem, suppose that we have a lake containing m fish where m is unknown. We capture r of the fish, tag them, and return them
to the lake. Next we capture n of the fish and observe Y , the number of tagged fish in the sample. We wish to estimate m from this
data. In this context, the estimation problem is sometimes called the capture-recapture problem.
Do you think that the main assumption of the sampling model, namely equally likely samples, would be satisfied for a real
capture-recapture problem? Explain.
Once again, we can use the method of moments to derive a simple estimate of m , by hoping that the sample proportion of type 1
objects is close the population proportion of type 1 objects. That is,
Y r nr
≈ ⟹ m ≈ (12.2.12)
n m Y
Thus, our estimator of m is nr
Y
if Y >0 and is ∞ if Y =0 .
In the ball and urn experiment, select sampling without replacement. For selected values of the parameters, run the experiment
100 times.
1. On each run, compare the true value of m with the estimated value.
If y > 0 then nr
y
maximizes P(Y = y) as a function of m for fixed r and n . This means that nr
Y
is a maximum likelihood
estimator of m.
E(
nr
Y
) ≥m .
Proof
Thus, the estimator is positivley biased and tends to over-estimate m. Indeed, if n ≤ m − r , so that P(Y = 0) > 0 then
) = ∞ . For another approach to estimating the population size m , see the section on Order Statistics.
nr
E(
Y

Suppose now that the sampling is with replacement, even though this is unrealistic in most applications. In this case, Y has the
binomial distribution with parameters n and . The estimators of r with m known, , and m with r known make sense, just as
m
r
m
r
before, but have slightly different properties.
The estimator m
n
Y of r with m known satisfies
1. E ( m
n
Y ) =r
r(m−r)
2. var ( m
n
Y ) =
n
The estimator 1
n
Y of r
m
satisfies
1. E ( Y ) =
1
n m
r
2. var ( Y ) =1
n
1
n
r
m
(1 −
m
r
)
Thus, the estimators are still unbiased and consistent, but have larger mean square error than before. Thus, sampling without
replacement works better, for any values of the parameters, than sampling with replacement.
In the ball and urn experiment, select sampling with replacement. For selected values of the parameters, run the experiment
100 times.
1. On each run, compare the true value of r with the estimated value.

A batch of 100 computer chips contains 10 defective chips. Five chips are chosen at random, without replacement. Find each
of the following:
1. The probability density function of the number of defective chips in the sample.
2. The mean and variance of the number of defective chips in the sample
3. The probability that the sample contains at least one defective chip.
Answer
A club contains 50 members; 20 are men and 30 are women. A committee of 10 members is chosen at random. Find each of
the following:
1. The probability density function of the number of women on the committee.
2. The mean and variance of the number of women on the committee.
3. The mean and variance of the number of men on the committee.
4. The probability that the committee members are all the same gender.
Answer
A small pond contains 1000 fish; 100 are tagged. Suppose that 20 fish are caught. Find each of the following:
1. The probability density function of the number of tagged fish in the sample.
2. The mean and variance of the number of tagged fish in the sample.
3. The probability that the sample contains at least 2 tagged fish.
4. The binomial approximation to the probability in (c).
Answer
Forty percent of the registered voters in a certain district prefer candidate A . Suppose that 10 voters are chosen at random. Find
1. The probability density function of the number of voters in the sample who prefer A .
2. The mean and variance of the number of voters in the sample who prefer A .
3. The probability that at least 5 voters in the sample prefer A .
Answer
Suppose that 10 memory chips are sampled at random and without replacement from a batch of 100 chips. The chips are tested
and 2 are defective. Estimate the number of defective chips in the entire batch.
Answer
A voting district has 5000 registered voters. Suppose that 100 voters are selected at random and polled, and that 40 prefer
candidate A . Estimate the number of voters in the district who prefer candidate A .
Answer
From a certain lake, 200 fish are caught, tagged and returned to the lake. Then 100 fish are caught and it turns out that 10 are
tagged. Estimate the population of fish in the lake.
Answer
Cards
Recall that the general card experiment is to select n cards at random and without replacement from a standard deck of 52 cards.
The special case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment.
In a poker hand, find the probability density function, mean, and variance of the following random variables:
1. The number of spades
2. The number of aces
Answer
In a bridge hand, find each of the following:

1. The probability density function, mean, and variance of the number of hearts
2. The probability density function, mean, and variance of the number of honor cards (ace, king, queen, jack, or 10).
3. The probability that the hand has no honor cards. A hand of this kind is known as a Yarborough, in honor of Second Earl of
Yarborough.
Answer
The Randomized Urn

An interesting thing to do in almost any parametric probability model is to randomize one or more of the parameters. Done in the
right way, this often leads to an interesting new parametric model, since the distribution of the randomized parameter will often
itself belong to a parametric family. This is also the natural setting to apply Bayes' theorem.
In this section, we will randomize the number of type 1 objects in the basic hypergeometric model. Specifically, we assume that we
have m objects in the population, as before. However, instead of a fixed number r of type 1 objects, we assume that each of the m
objects in the population, independently of the others, is type 1 with probability p and type 0 with probability 1 − p . We have
eliminated one parameter, r, in favor of a new parameter p with values in the interval [0, 1]. Let U denote the type of the ith object i
in the population, so that U = (U , U , … , U ) is a sequence of Bernoulli trials with success parameter p. Let V = ∑ U
1 2 n
m
i=1 i
denote the number of type 1 objects in the population, so that V has the binomial distribution with parameters m and p.
As before, we sample n object from the population. Again we let X denote the type of the ith object sampled, and we let
i
n
Y =∑ i=1
X denote the number of type 1 objects in the sample. We will consider sampling with and without replacement. In the
i
first case, the sample size can be any positive integer, but in the second case, the sample size cannot exceed the population size.
The key technique in the analysis of the randomized urn is to condition on V . If we know that V = r , then the model reduces to the
model studied above: a population of size m with r type 1 objects, and a sample of size n .
With either type of sampling, P(X i = 1) = p
Proof
Thus, in either model, X is a sequence of identically distributed indicator variables. Ah, but what about dependence?
n
Suppose that the sampling is without replacement. Let (x 1, x2 , … , xn ) ∈ {0, 1 }
n
and let y = ∑ i=1
xi . Then
y n−y
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = p (1 − p ) (12.2.13)
Proof
From the joint distribution in the previous exercise, we see that X is a sequence of Bernoulli trials with success parameter p, and
hence Y has the binomial distribution with parameters n and p. We could also argue that X is a Bernoulli trials sequence directly,
by noting that {X , X , … , X } is a randomly chosen subset of {U , U , … , U }.
1 2 n 1 2 m
Suppose now that the sampling is with replacement. Again, let (x 1, x2 , … , xn ) ∈ {0, 1 }
n
and let y = ∑ n
i=1
xi . Then
y n−y
V (m − V )
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = E [ ] (12.2.15)
n
m
Proof
A closed form expression for the joint distribution of X, in terms of the parameters m, n , and p is not easy, but it is at least clear
that the joint distribution will not be the same as the one when the sampling is without replacement. In particular, X is a dependent
sequence. Note however that X is an exchangeable sequence, since the joint distribution is invariant under a permutation of the
coordinates (this is a simple consequence of the fact that the joint distribution depends only on the sum y ).

y n−y
n V (m − V )
P(Y = y) = ( )E [ ], y ∈ {0, 1, … , n} (12.2.17)
n
y m
Suppose that i and j are distinct indices. The covariance and correlation of (X i, Xj ) are
p(1−p)
1. cov (X , X ) =
i j m
2. cor (X , X ) =
i j
1
Proof
1. E(Y ) = np
2. var(Y ) = np(1 − p) m+n−1
Proof
Let's conclude with an interesting observation: For the randomized urn, X is a sequence of independent variables when the
sampling is without replacement but a sequence of dependent variables when the sampling is with replacement—just the opposite
of the situation for the deterministic urn with a fixed number of type 1 objects.
This page titled 12.2: The Hypergeometric Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
The Multitype Model
As in the basic sampling model, we start with a finite population D consisting of m objects. In this section, we suppose in addition that each
object is one of k types; that is, we have a multitype population. For example, we could have an urn with balls of several different colors, or a
population of voters who are either democrat, republican, or independent. Let D denote the subset of all type i objects and let m = #(D ) for
i i i
i ∈ {1, 2, … , k}. Thus D = ⋃ D and m = ∑ m . The dichotomous model considered earlier is clearly a special case, with k = 2 .
k k
i=1 i i=1 i
As in the basic sampling model, we sample n objects at random from D. Thus the outcome of the experiment is X = (X , X , … , X ) where 1 2 n
X ∈ D is the i th object chosen. Now let Y denote the number of type i objects in the sample, for i ∈ {1, 2, … , k}. Note that ∑ Y = n so
k
i i i=1 i
if we know the values of k − 1 of the counting variables, we can find the value of the remaining counting variable. As with any counting
variable, we can express Y as a sum of indicator variables:
i
For i ∈ {1, 2, … , k}
n
Yi = ∑ 1 (Xj ∈ Di ) (12.3.1)
j=1
We assume initially that the sampling is without replacement, since this is the realistic case in most applications.
The Joint Distribution

Basic combinatorial arguments can be used to derive the probability density function of the random vector of counting variables. Recall that
since the sampling is without replacement, the unordered sample is uniformly distributed over the combinations of size n chosen from D.
The probability density funtion of (Y 1, Y2 , … , Yk ) is given by

m1 m2 mk
( )( )⋯( ) k
y1 y2 yk
k
P(Y1 = y1 , Y2 = y2 , … , Yk = yk ) = , (y1 , y2 , … , yk ) ∈ N with ∑ yi = n (12.3.2)
m
( ) i=1
n
Proof
The distribution of (Y , Y , … , Y ) is called the multivariate hypergeometric distribution with parameters m, (m , m , … , m ), and n . We
1 2 k 1 2 k
also say that (Y , Y , … , Y ) has this distribution (recall again that the values of any k − 1 of the variables determines the value of the
1 2 k−1
remaining variable). Usually it is clear from context which meaning is intended. The ordinary hypergeometric distribution corresponds to k = 2 .
An alternate form of the probability density function of Y 1, Y2 , … , Yk ) is

(y ) (y ) (y )
1 2 k k
n m m ⋯m
1 2 k
P(Y1 = y1 , Y2 = y2 , … , Yk = yk ) = ( ) , (y1 , y2 , … , yk ) ∈ Nk with ∑ yi = n (12.3.3)
y1 , y2 , … , yk (n)
m
i=1
Combinatorial Proof
Algebraic Proof
The Marginal Distributions

For i ∈ {1, 2, … , k}, Y has the hypergeometric distribution with parameters m, m , and n
i i
mi m−mi
( )( )
y n−y
P(Yi = y) = , y ∈ {0, 1, … , n} (12.3.4)
m
( )
n
Proof
Grouping
The multivariate hypergeometric distribution is preserved when the counting variables are combined. Specifically, suppose that
(A , A , … , A ) is a partition of the index set {1, 2, … , k} into nonempty, disjoint subsets. Let W = ∑
1 2 l Y and r = ∑ m for j i j i
i∈Aj i∈Aj
j ∈ {1, 2, … , l}
(W1 , W2 , … , Wl ) has the multivariate hypergeometric distribution with parameters m, (r 1, r2 , … , rl ) , and n .

Proof
Note that the marginal distribution of Y given above is a special case of grouping. We have two types: type i and not type i. More generally, the
i
marginal distribution of any subsequence of (Y , Y , … , Y ) is hypergeometric, with the appropriate parameters.

1 2 n
Conditioning
The multivariate hypergeometric distribution is also preserved when some of the counting variables are observed. Specifically, suppose that
(A, B) is a partition of the index set {1, 2, … , k} into nonempty, disjoint subsets. Suppose that we observe Y = y for j ∈ B . Let j j
z = n−∑ y and r = ∑
j∈B j m .
i∈A i
The conditional distribution of (Y i : i ∈ A) given (Y j = yj : j ∈ B) is multivariate hypergeometric with parameters r, (m i : i ∈ A) , and z .

Proof
Combinations of the grouping result and the conditioning result can be used to compute any marginal or conditional distributions of the counting
variables.
Moments
We will compute the mean, variance, covariance, and correlation of the counting variables. Results from the hypergeometric distribution and the
representation in terms of indicator variables are the main tools.
For i ∈ {1, 2, … , k},

1. E(Y ) = n
i
mi
m
mi m−mi
2. var(Y ) = n
i
m m
m−n
m−1
Proof
Now let I ti = 1(Xt ∈ Di ) , the indicator variable of the event that the t th object selected is type i, for t ∈ {1, 2, … , n} and i ∈ {1, 2, … , k}.
Suppose that r and s are distinct elements of {1, 2, … , n}, and i and j are distinct elements of {1, 2, … , k}. Then
mi mj
cov (Iri , Irj ) = − (12.3.5)
m m
1 mi mj
cov (Iri , Isj ) = (12.3.6)
m −1 m m
Proof
Suppose again that r and s are distinct elements of {1, 2, … , n}, and i and j are distinct elements of {1, 2, … , k}. Then
−−−−− −−−−−−− −−
mi mj
cor (Iri , Irj ) = −√ (12.3.7)
m − mi m − mj
−−−−− −−−−−−− −−
1 mi mj
cor (Iri , Isj ) = √ (12.3.8)
m −1 m − mi m − mj
Proof
In particular, I and I are negatively correlated while I and I are positively correlated.
ri rj ri sj

mi mj m −n
cov (Yi , Yj ) = − n (12.3.9)
m m m −1
−−−−− −−−−−−− −−
mi mj
cor (Yi , Yj ) = − √ (12.3.10)
m − mi m − mj

Suppose now that the sampling is with replacement, even though this is usually not realistic in applications.
The types of the objects in the sample form a sequence of n multinomial trials with parameters (m 1 /m, m2 /m, … , mk /m) .
The following results now follow immediately from the general theory of multinomial trials, although modifications of the arguments above
could also be used.
(Y1 , Y2 , … , Yk ) has the multinomial distribution with parameters n and (m 1 /m, :

m2 , /m, … , mk /m)
y y yk
1 2 k
n m m ⋯m
1 2 k k
P(Y1 = y1 , Y2 = y2 , … , Yk = yk ) = ( ) , (y1 , y2 , … , yk ) ∈ N with ∑ yi = n (12.3.11)
n
y1 , y2 , … , yk m
i=1

mi
1. E (Y ) = n
i
m
mi m−mi
2. var (Y ) = n
i
m m
mi mj
3. cov (Y , Y ) = −n
i j m m
−−−−−−−
m
−−
4. cor (Y i, Yj ) = −√
m−mi
mi j
m−mj
Comparing with our previous results, note that the means and correlations are the same, whether sampling with or without replacement. The
variances and covariances are smaller when sampling without replacement, by a factor of the finite population correction factor
(m − n)/(m − 1)
Convergence to the Multinomial Distribution

Suppose that the population size m is very large compared to the sample size n . In this case, it seems reasonable that sampling without
replacement is not too much different than sampling with replacement, and hence the multivariate hypergeometric distribution should be well
approximated by the multinomial. The following exercise makes this observation precise. Practically, it is a valuable result, since in many cases
we do not know the population size exactly. For the approximate multinomial distribution, we do not need to know m and m individually, but i
only in the ratio m /m. i
Suppose that m depends on m and that m /m → p as m → ∞ for

i i i i ∈ {1, 2, … , k} . For fixed n , the multivariate hypergeometric
probability density function with parameters m, (m , m , … , m ), and
1 2 k n converges to the multinomial probability density function with
parameters n and (p , p , … , p ).
1 2 k
Proof

A population of 100 voters consists of 40 republicans, 35 democrats and 25 independents. A random sample of 10 voters is chosen. Find
1. The joint density function of the number of republicans, number of democrats, and number of independents in the sample
2. The mean of each variable in (a).
3. The variance of each variable in (a).
4. The covariance of each pair of variables in (a).
5. The probability that the sample contains at least 4 republicans, at least 3 democrats, and at least 2 independents.
Answer
Cards
Recall that the general card experiment is to select n cards at random and without replacement from a standard deck of 52 cards. The special
case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment.
In a bridge hand, find the probability density function of

1. The number of spades, number of hearts, and number of diamonds.
2. The number of spades and number of hearts.
3. The number of spades.
4. The number of red cards and the number of black cards.
Answer

1. The mean and variance of the number of spades.
2. The covariance and correlation between the number of spades and the number of hearts.
3. The mean and variance of the number of red cards.
Answer

1. The conditional probability density function of the number of spades and the number of hearts, given that the hand has 4 diamonds.
2. The conditional probability density function of the number of spades given that the hand has 3 hearts and 2 diamonds.
Answer
In the card experiment, a hand that does not contain any cards of a particular suit is said to be void in that suit.
Use the inclusion-exclusion rule to show that the probability that a poker hand is void in at least one suit is
1913496
≈ 0.736 (12.3.12)
2598960
In the card experiment, set n = 5 . Run the simulation 1000 times and compute the relative frequency of the event that the hand is void in at
least one suit. Compare the relative frequency with the true probability given in the previous exercise.
Use the inclusion-exclusion rule to show that the probability that a bridge hand is void in at least one suit is
32427298180
≈ 0.051 (12.3.13)
635013559600
This page titled 12.3: The Multivariate Hypergeometric Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon
request.
Basic Theory
Definitions
Suppose that the objects in our population are numbered from 1 to m, so that D = {1, 2, … , m}. For example, the population
might consist of manufactured items, and the labels might correspond to serial numbers. As in the basic sampling model we select
n objects at random, without replacement from D. Thus the outcome is X = (X , X , … , X ) where X ∈ S is the i th object
1 2 n i
chosen. Recall that X is uniformly distributed over the set of permutations of size n chosen from D. Recall also that
W = { X , X , … , X } is the unordered sample, which is uniformly distributed on the set of combinations of size n chosen from
1 2 n
D.
For i ∈ {1, 2, … , n} let X = i th smallest element of {X , X , … , X }. The random variable

(i) 1 2 n X(i) is known as the order
statistic of order i for the sample X. In particular, the extreme order statistics are
X(1) = min{ X1 , X2 , … , Xn } (12.4.1)
X(n) = max{ X1 , X2 , … , Xn } (12.4.2)
Random variable X (i)

takes values in {i, i + 1, … , m − n + i} for i ∈ {1, 2, … , n}.
We will denote the vector of order statistics by Y = (X(1) , X(2) , … , X(n) ) . Note that Y takes values in
n
L = {(y1 , y2 , … , yn ) ∈ D : y1 < y2 < ⋯ < yn } (12.4.3)
Run the order statistic experiment. Note that you can vary the population size m and the sample size n . The order statistics are
recorded on each update.
Distributions
L has (m
n
) elements and Y is uniformly distributed on L.
Proof
The probability density function of X (i)

is
x−1 m−x
( )( )
i−1 n−i
P [ X(i) = x] = , x ∈ {i, i + 1, … , m − n + i} (12.4.4)
m
( )
n
Proof
In the order statistic experiment, vary the parameters and note the shape and location of the probability density function. For
selected values of the parameters, run the experiment 1000 times and compare the relative frequency function to the probability
density function.
Moments
The probability density function of X above can be used to obtain an interesting identity involving the binomial coefficients.
(i)
This identity, in turn, can be used to find the mean and variance of X . (i)
For i, n, m ∈ N+ with i ≤ n ≤ m ,
m−n+i
k−1 m −k m
∑ ( )( ) =( ) (12.4.5)
i −1 n−i n
k=i
Proof
The expected value of X (i)

is
m +1
E [ X(i) ] = i (12.4.6)
n+1
Proof
The variance of X (i)

is
(m + 1)(m − n)
var [ X(i) ] = i(n − i + 1) (12.4.7)
2
(n + 1 ) (n + 2)
Proof
In the order statistic experiment, vary the parameters and note the size and location of the mean ± standard deviation bar. For
selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to the
Estimators of m Based on Order Statistics

Suppose that the population size m is unknown. In this subsection we consider estimators of m constructed from the various order
statistics.
For i ∈ {1, 2, … , n}, the following statistic is an unbiased estimator of m:

n+1
Ui = X(i) − 1 (12.4.8)
i
Proof
Since U is unbiased, its variance is the mean square error, a measure of the quality of the estimator.
i
The variance of U isi
(m + 1)(m − n)(n − i + 1)
var(Ui ) = (12.4.9)
i(n + 2)
Proof
For fixed m and n , var(U i) decreases as i increases. Thus, the estimators improve as i increases; in particular, U is the best
n
and U the worst.

1
The relative efficiency of U with respect to U is

j i
var(Ui ) j(n − i + 1)
= (12.4.10)
var(Uj ) i(n − j + 1)
Note that the relative efficiency depends only on the orders i and j and the sample size n , but not on the population size m (the
unknown parameter). In particular, the relative efficiency of U with respect to U is n . For fixed i and j , the asymptotic relative
n 1
2
efficiency of U to U is j/i. Usually, we hope that an estimator improves (in the sense of mean square error) as the sample size n
j i
increases (the more information we have, the better our estimate should be). This general idea is known as consistency.
var(Un ) decreases to 0 as n increases from 1 to m, and so U is consistent:

n
(m + 1)(m − n)
var(Un ) = (12.4.11)
n(n + 2)
For fixed i, var(U ) at first increases and then decreases to 0 as n increases from i to m. Thus, U is inconsistent.
i i
Figure 12.4.1 : var(U1) as a function of n for m = 100
An Estimator of m Based on the Sample Mean

In this subsection, we will derive another estimator of the parameter m based on the average of the sample variables
x , (the sample mean) and compare this estimator with the estimator based on the maximum of the variables (the
1 n
M = ∑ i
n i=1
largest order statistic).
E(M ) =
m+1
2
.
Proof
It follows that V = 2M − 1 is an unbiased estimator of m. Moreover, it seems that superficially at least, V uses more information
from the sample (since it involves all of the sample variables) than U . Could it be better? To find out, we need to compute the
n
variance of the estimator (which, since it is unbiased, is the mean square error). This computation is a bit complicated since the
sample variables are dependent. We will compute the variance of the sum as the sum of all of the pairwise covariances.
For distinct i, ,
j ∈ {1, 2, … , n} cov (Xi , Xj ) = −
m+1
12
.
Proof
2
For i ∈ {1, 2, … , n}, var(X i) =

m −1
12
.
Proof
(m+1)(m−n)
var(M ) =
12 n
.
Proof
(m+1)(m−n)
var(V ) =
3 n
.
Proof
The variance of V is decreasing with n , so V is also consistent. Let's compute the relative efficiency of the estimator based on the
maximum to the estimator based on the mean.
var(V )/var(Un ) = (n + 2)/3 .
Thus, once again, the estimator based on the maximum is better. In addition to the mathematical analysis, all of the estimators
except U can sometimes be manifestly worthless by giving estimates that are smaller than some of the smaple values.
n

If the sampling is with replacement, then the sample X = (X , X , … , X ) is a sequence of independent and identically
1 2 n
distributed random variables. The order statistics from such samples are studied in the chapter on Random Samples.
Suppose that in a lottery, tickets numbered from 1 to 25 are placed in a bowl. Five tickets are chosen at random and without
replacement.
1. Find the probability density function of X (3) .
2. Find E [ X ].
(3)
3. Find var [ X ]. (3)
Answer
The German Tank Problem

The estimator U was used by the Allies during World War II to estimate the number of German tanks m that had been produced.
n
German tanks had serial numbers, and captured German tanks and records formed the sample data. The statistical estimates turned
out to be much more accurate than intelligence estimates. Some of the data are given in the table below.
German Tank Data. Source: Wikipedia
Date Statistical Estimate Intelligence Estimate German Records
June 1940 169 1000 122
June 1941 244 1550 271
August 1942 327 1550 342
One of the morals, evidently, is not to put serial numbers on your weapons!
Suppose that in a certain war, 5 enemy tanks have been captured. The serial numbers are 51, 3, 27, 82, 65. Compute the
estimate of m, the total number of tanks, using all of the estimators discussed above.
Answer
In the order statistic experiment, and set m = 100 and n = 10 . Run the experiment 50 times. For each run, compute the
estimate of m based on each order statistic. For each estimator, compute the square root of the average of the squares of the
errors over the 50 runs. Based on these empirical error estimates, rank the estimators of m in terms of quality.
Suppose that in a certain war, 10 enemy tanks have been captured. The serial numbers are 304, 125, 417, 226, 192, 340, 468,
499, 87, 352. Compute the estimate of m, the total number of tanks, using the estimator based on the maximum and the
estimator based on the mean.
Answer
This page titled 12.4: Order Statistics is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Definitions and Notation

The Matching Experiment
The matching experiment is a random experiment that can the formulated in a number of colorful ways:
Suppose that n male-female couples are at a party and that the males and females are randomly paired for a dance. A match
occurs if a couple happens to be paired together.
An absent-minded secretary prepares n letters and envelopes to send to n different people, but then randomly stuffs the letters
into the envelopes. A match occurs if a letter is inserted in the proper envelope.
n people with hats have had a bit too much to drink at a party. As they leave the party, each person randomly grabs a hat. A
match occurs if a person gets his or her own hat.

These experiments are clearly equivalent from a mathematical point of view, and correspond to selecting a random permutation
X = (X , X , … , X ) of the population D = {1, 2, … , n}. Here are the interpretations for the examples above:
1 2 n n
Number the couples from 1 to n . Then X is the number of the woman paired with the ith man.
i
Number the letters and corresponding envelopes from 1 to n . Then X is the number of the envelope containing the ith letter.
i
Number the people and their corresponding hats from 1 to n . Then X is the number of the hat chosen by the ith person.
i
Our modeling assumption, of course, is that X is uniformly distributed on the sample space of permutations of D . The number of n
objects n is the basic parameter of the experiment. We will also consider the case of sampling with replacement from the
population D , because the analysis is much easier but still provides insight. In this case, X is a sequence of independent random
n
variables, each uniformly distributed over D . n
Matches
We will say that a match occurs at position j if Xj = j . Thus, number of matches is the random variable N defined
mathematically by
n
Nn = ∑ Ij (12.5.1)
j=1
where I = 1(X = j) is the indicator variable for the event of match at position j . Our problem is to compute the probability
j j
distribution of the number of matches. This is an old and famous problem in probability that was first considered by Pierre-Remond
Montmort; it sometimes referred to as Montmort's matching problem in his honor.
Sampling With Replacement

First let's solve the matching problem in the easy case, when the sampling is with replacement. Of course, this is not the way that
the matching game is usually played, but the analysis will give us some insight.
(I1 , I2 , … , In ) is a sequence of n Bernoulli trials, with success probability 1
n
.
Proof
The number of matches N has the binomial distribution with trial parameter n and success parameter
n
1
n
.
k n−k
n 1 1
P(Nn = k) = ( )( ) (1 − ) , k ∈ {0, 1, … , n} (12.5.2)
k n n
Proof
The mean and variance of the number of matches are

1. E(N ) = 1
n
2. var(N ) = n
n−1
Proof
The distribution of the number of matches converges to the Poisson distribution with parameter 1 as n → ∞ :
−1
e
P(Nn = k) → as n → ∞ for k ∈ N (12.5.3)
k!
Proof
Sampling Without Replacement

Now let's consider the case of real interest, when the sampling is without replacement, so that X is a random permutation of the
elements of D = {1, 2, … , n}.
n
Counting Permutations with Matches

To find the probability density function of N , we need to count the number of permutations of D with a specified number of
n n
matches. This will turn out to be easy once we have counted the number of permutations with no matches; these are called
derangements of D . We will denote the number of permutations of D with exactly k matches by b (k) = #{N = k} for
n n n n
k ∈ {0, 1, … , n}. In particular, b (0) is the number of derrangements of D .

n n
The number of derrangements is

n j
(−1)
bn (0) = n! ∑ (12.5.5)
j!
j=0
Proof
The number of permutations with exactly k matches is

n−k j
n! (−1)
bn (k) = ∑ , k ∈ {0, 1, … , n} (12.5.7)
k! j!
j=0
Proof

The probability density function of the number of matches is
n−k j
1 (−1)
P(Nn = k) = ∑ , k ∈ {0, 1, … , n} (12.5.8)
k! j!
j=0
Proof
In the matching experiment, vary the parameter n and note the shape and location of the probability density function. For
selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability density
function.
P(Nn = n − 1) = 0 .
Proof
The distribution of the number of matches converges to the Poisson distribution with parameter 1 as n → ∞ :
−1
e
P(Nn = k) → as n → ∞, k ∈ N (12.5.9)
k!
Proof
The convergence is remarkably rapid.
In the matching experiment, increase n and note how the probability density function stabilizes rapidly. For selected values of
n , run the simulation 1000 times and compare the relative frequency function to the probability density function.
Moments
The mean and variance of the number of matches could be computed directly from the distribution. However, it is much better to
use the representation in terms of indicator variables. The exchangeable property is an important tool in this section.
E(Ij ) =
1
n
for j ∈ {1, 2, … , n}.
Proof
E(Nn ) = 1 for each n

Proof
Thus, the expected number of matches is 1, regardless of n , just as when the sampling is with replacement.
var(Ij ) =
n−1
n2
for j ∈ {1, 2, … , n}.
Proof
A match in one position would seem to make it more likely that there would be a match in another position. Thus, we might guess
that the indicator variables are positively correlated.
For distinct j, k ∈ {1, 2, … , n} ,

1. cov(I j, Ik ) = 2
n (n−1)
1
2. cor(I j, Ik ) =
1
2
(n−1)
Proof
Note that when n = 2 , the event that there is a match in position 1 is perfectly correlated with the event that there is a match in
position 2. This makes sense, since there will either be 0 matches or 2 matches.
var(Nn ) = 1 for every n ∈ {2, 3, …}.

Proof
In the matching experiment, vary the parameter n and note the shape and location of the mean ± standard deviation bar. For
selected values of the parameter, run the simulation 1000 times and compare the sample mean and standard deviation to the
For distinct j, ,
k ∈ {1, 2, … , n} cov(Ij , Ik ) → 0 as n → ∞ .
Thus, the event that a match occurs in position j is nearly independent of the event that a match occurs in position k if n is large.
For large n , the indicator variables behave nearly like n Bernoulli trials with success probability , which of course, is what
1
happens when the sampling is with replacement.
A Recursion Relation
In this subsection, we will give an alternate derivation of the distribution of the number of matches, in a sense by embedding the
experiment with parameter n into the experiment with parameter n + 1 .
The probability density function of the number of matches satisfies the following recursion relation and initial condition:
1. P(N n = k) = (k + 1)P(Nn+1 = k + 1) for k ∈ {0, 1, … , n}.
2. P(N 1 = 1) = 1 .
Proof
This result can be used to obtain the probability density function of N recursively for any n .
n
The Probability Generating Function

Next recall that the probability generating function of N is given by
n
n
Nn j
Gn (t) = E (t ) = ∑ P(Nn = j)t , t ∈ R (12.5.13)
j=0
The family of probability generating functions satisfies the following differential equations and ancillary conditions:
1. G ′
n+1
(t) = Gn (t) for t ∈ R and n ∈ N +
2. G n (1) = 1 for n ∈ N +
Note also that G 1 (t) =t for t ∈ R . Thus, the system of differential equations can be used to compute G for any n ∈ N .
n +
In particular, for t ∈ R ,
1. G 2 (t) =
1
2
+
1
2
2
t
2. G 3 (t) =
1
3
+
1
2
t+
1
6
3
t
3. G 4 (t) =
3
8
+
1
3
t+
1
4
2
t +
1
24
4
t
For k, n ∈ N+ with k < n ,

(k)
Gn (t) = Gn−k (t), t ∈ R (12.5.14)
Proof
For n ∈ N , +
1
P(Nn = k) = P(Nn−k = 0), k ∈ {0, 1, … , n − 1} (12.5.15)
k!
Proof

A secretary randomly stuffs 5 letters into 5 envelopes. Find each of the following:
1. The number of outcomes with exactly k matches, for each k ∈ {0, 1, 2, 3, 4, 5}.
2. The probability density function of the number of matches.
3. The covariance and correlation of a match in one envelope and a match in another envelope.
Answer
Ten married couples are randomly paired for a dance. Find each of the following:
1. The probability density function of the number of matches.
2. The mean and variance of the number of matches.
3. The probability of at least 3 matches.
Answer
In the matching experiment, set n = 10 . Run the experiment 1000 times and compare the following for the number of matches:
1. The true probabilities
2. The relative frequencies from the simulation
3. The limiting Poisson probabilities
Answer
This page titled 12.5: The Matching Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Introduction
The Sampling Model
As in the basic sampling model, suppose that we select n numbers at random, with replacement, from the population
D = {1, 2, … , m}. Thus, our outcome vector is X = (X , X , … , X ) where X is the i th number chosen. Recall that our basic
1 2 n i
modeling assumption is that X is uniformly distributed on the sample space S = D = {1, 2, … , m} n n
In this section, we are interested in the number of population values missing from the sample, and the number of (distinct)
population values in the sample. The computation of probabilities related to these random variables are generally referred to as
birthday problems. Often, we will interpret the sampling experiment as a distribution of n balls into m cells; X is the cell number
i
of ball i. In this interpretation, our interest is in the number of empty cells and the number of occupied cells.
For i ∈ D , let Y denote the number of times that i occurs in the sample:
i
Yi = # {j ∈ {1, 2, … , n} : Xj = i} = ∑ 1(Xj = i) (12.6.1)
j=1
Y = (Y1 , Y2 , … , Ym ) has the multinomial distribution with parameters n and (1/m, 1/m, … , 1/m):
m
n 1 n
P(Y1 = y1 , Y2 = y2 , … , Ym = ym ) = ( ) , (y1 , y2 , … , ym ) ∈ N with ∑ yi = n (12.6.2)
n
y1 , y2 , … , yn m
i=1
Proof
We will now define the main random variables of interest.
The number of population values missing in the sample is

m
U = # {j ∈ {1, 2, … , m} : Yj = 0} = ∑ 1(Yj = 0) (12.6.3)
j=1
and the number of (distinct) population values that occur in the sample is
m
V = # {j ∈ {1, 2, … , m} : Yj > 0} = ∑ 1(Yj > 0) (12.6.4)
j=1
Also, U takes values in {max{m − n, 0}, … , m − 1} and V takes values in {1, 2, … , min{m, n}}.
Clearly we must have U + V = m so once we have the probability distribution and moments of one variable, we can easily find
them for the other variable. However, we will first solve the simplest version of the birthday problem.
The Simple Birthday Problem

The event that there is at least one duplication when a sample of size n is chosen from a population of size m is
Bm,n = {V < n} = {U > m − n} (12.6.5)
The (simple) birthday problem is to compute the probability of this event. For example, suppose that we choose n people at
random and note their birthdays. If we ignore leap years and assume that birthdays are uniformly distributed throughout the year,
then our sampling model applies with m = 365. In this setting, the birthday problem is to compute the probability that at least two
people have the same birthday (this special case is the origin of the name).
The solution of the birthday problem is an easy exercise in combinatorial probability.
The probability of the birthday event is
(n)
m
P (Bm,n ) = 1 − , n ≤m (12.6.6)
n
m
and P (Bm,n ) = 1 for n > m

Proof
The fact that the probability is 1 for n > m is sometimes referred to as the pigeonhole principle: if more than m pigeons are placed
into m holes then at least one hole has 2 or more pigeons. The following result gives a recurrence relation for the probability of
distinct sample values and thus gives another way to compute the birthday probability.
Let p denote the probability of the complementary birthday event B , that the sample variables are distinct, with population
m,n
c
size m and sample size n . Then p satisfies the following recursion relation and initial condition:
m,n
m−n
1. pm,n+1 =
m
pm,n
2. pm,1 =1
Examples
Let m = 365 (the standard birthday problem).
1. P (B 365,10) = 0.117
2. P (B 365,20) = 0.411
3. P (B 365,30) = 0.706
4. P (B 365,40) = 0.891
5. P (B 365,50) = 0.970
6. P (B 365,60) = 0.994
Figure 12.6.1 : P (B
365,n ) as a function of n , smoothed for the sake of appearance
In the birthday experiment, set n = 365 and select the indicator variable I . For n ∈ {10, 20, 30, 40, 50, 60} run the
experiment 1000 times each and compare the relative frequencies with the true probabilities.
In spite of its easy solution, the birthday problem is famous because, numerically, the probabilities can be a bit surprising. Note that
with a just 60 people, the event is almost certain! With just 23 people, the birthday event is about ; specifically 1
) = 0.507. Mathematically, the rapid increase in the birthday probability, as n increases, is due to the fact that m grows
n
P(B 365,23
much faster than m . (n)
Four fair, standard dice are rolled. Find the probability that the scores are distinct.
Answer
In the birthday experiment, set m = 6 and select the indicator variable I . Vary n with the scrollbar and note graphically how
the probabilities change. Now with n = 4 , run the experiment 1000 times and compare the relative frequency of the event to
the corresponding probability.
Five persons are chosen at random.
1. Find the probability that at least 2 have the same birth month.
2. Criticize the sampling model in this setting
Answer
the probabilities change. Now with n = 5 , run the experiment 1000 times and compare the relative frequency of the event to
A fast-food restaurant gives away one of 10 different toys with the purchase of a kid's meal. A family with 5 children buys 5
kid's meals. Find the probability that the 5 toys are different.
Answer
the probabilities change. Now with n = 5 , run the experiment 1000 times and comparethe relative frequency of the event to
Let m = 52 . Find the smallest value of n such that the probability of a duplication is at least 1
2
.
Answer
The General Birthday Problem

We now return to the more general problem of finding the distribution of the number of distinct sample values and the distribution
of the number of excluded sample values.

The number of samples with exactly j values excluded is
m−j
m m −j
k n
#{U = j} = ( ) ∑(−1 ) ( )(m − j − k) , j ∈ {max{m − n, 0}, … , m − 1} (12.6.7)
j k
k=0
Proof
The distributions of the number of excluded values and the number of distinct values are now easy.
The probability density function of U is given by

m−j n
m k
m −j j+k
P(U = j) = ( ) ∑(−1 ) ( ) (1 − ) , j ∈ {max{m − n, 0}, … , m − 1} (12.6.11)
j k m
k=0
Proof
The probability density function of the number of distinct values V is given by

j n
m j j−k
k
P(V = j) = ( ) ∑(−1 ) ( )( ) , j ∈ {1, 2, … , min{m, n}} (12.6.12)
j k m
k=0
Proof
In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the shape and location of
the probability density function. For selected values of the parameters, run the simulation 1000 and compare the relative
frequency function to the probability density function.
The distribution of the number of excluded values can also be obtained by a recursion argument.
Let fm,n denote the probability density function of the number of excluded values U , when the population size is m and the
sample size is n . Then
1. fm,1 (m − 1) = 1
m−j j+1
2. fm,n+1 (j) =
m
fm,n (j) +
m
fm,n (j + 1)
Moments
Now we will find the means and variances. The number of excluded values and the number of distinct values are counting variables
and hence can be written as sums of indicator variables. As we have seen in many other models, this representation is frequently
the best for computing moments.
For j ∈ {0, 1, … , m}, let I = 1(Y = 0) , the indicator variable of the event that j is not in the sample. Note that the number of
j j
population values missing in the sample can be written as the sum of the indicator variables:
m
U = ∑ Ij (12.6.13)
j=1
For distinct i, j ∈ {1, 2, … , m} ,

n
1. E (I j) = (1 −
1
m
)
n 2 n
2. var (I j) = (1 −
m
1
) − (1 −
1
m
)
n 2 n
3. cov (I i, Ij ) = (1 −
2
m
) − (1 −
1
m
)
Proof
The expected number of excluded values and the expected number of distinct values are
n
1. E(U ) = m(1 − 1
m
)
n
2. E(V ) = m [1 − (1 − 1
m
) ]
Proof
The variance of the number of exluded values and the variance of the number of distinct values are
n n 2n
2 1 2
1
var(U ) = var(V ) = m(m − 1) (1 − ) + m (1 − ) −m (1 − ) (12.6.14)
m m m
Proof
In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the size and location of
the mean ± standard-deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the
sample mean and variance to the distribution mean and variance.

Suppose that 30 persons are chosen at random. Find each of the following:
1. The probability density function of the number of distinct birthdays.
2. The mean of the number of distinct birthdays.
3. The variance of the number of distinct birthdays.
4. The probability that there are at least 28 different birthdays represented.
Answer
In the birthday experiment, set m = 365 and n = 30 . Run the experiment 1000 times with an update frequency of 10 and
compute the relative frequency of the event in part (d) of the last exercise.
Suppose that 10 fair dice are rolled. Find each of the following:
1. The probability density function of the number of distinct scores.
2. The mean of the number of distinct scores.
3. The variance of the number of distinct scores.
4. The probability that there will 4 or fewer distinct scores.
Answer
In the birthday experiment, set m = 6 and n = 10 . Run the experiment 1000 times and compute the relative frequency of the
event in part (d) of the last exercise.
A fast food restaurant gives away one of 10 different toys with the purchase of each kid's meal. A family buys 15 kid's meals.
1. The probability density function of the number of toys that are missing.
2. The mean of the number of toys that are missing.
3. The variance of the number of toys that are missing.
4. The probability that at least 3 toys are missing.
Answwer
In the birthday experiment, set m = 10 and n = 15 . Run the experiment 1000 times and compute the relative frequency of the
event in part (d).
The lying students problem. Suppose that 3 students, who ride together, miss a mathematics exam. They decide to lie to the
instructor by saying that the car had a flat tire. The instructor separates the students and asks each of them which tire was flat.
The students, who did not anticipate this, select their answers independently and at random. Find each of the following:
1. The probability density function of the number of distinct answers.
2. The probability that the students get away with their deception.
3. The mean of the number of distinct answers.
4. The standard deviation of the number of distinct answers.
Answer
The duck hunter problem. Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each
hunter selects one duck at random and shoots. Find each of the following:
1. The probability density function of the number of ducks that are killed.
2. The mean of the number of ducks that are killed.
3. The standard deviation of the number of ducks that are killed.
Answer
This page titled 12.6: The Birthday Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Definitions
In this section, our random experiment is to sample repeatedly, with replacement, from the population D = {1, 2, … , m}. This
generates a sequence of independent random variables X = (X , X , …), each uniformly distributed on D
1 2
We will often interpret the sampling in terms of a coupon collector: each time the collector buys a certain product (bubble gum or
Cracker Jack, for example) she receives a coupon (a baseball card or a toy, for example) which is equally likely to be any one of m
types. Thus, in this setting, X ∈ D is the coupon type received on the ith purchase.
i
Let V denote the number of distinct values in the first n selections, for n ∈ N . This is the random variable studied in the last
n +
section on the Birthday Problem. Our interest is in this section is the sample size needed to get a specified number of distinct
sample values
For k ∈ {1, 2, … , m}, let
Wk = min{n ∈ N+ : Vn = k} (12.7.1)
the sample size needed to get k distinct sample values.
In terms of the coupon collector, this random variable gives the number of products required to get k distinct coupon types. Note
that the set of possible values of W is {k, k + 1, …} . We will be particularly interested in W , the sample size needed to get the
k m
entire population. In terms of the coupon collector, this is the number of products required to get the entire set of coupons.
In the coupon collector experiment, run the experiment in single-step mode a few times for selected values of the parameters.

Now let's find the distribution of W . The results of the previous section will be very helpful
k
For k ∈ {1, 2, … , m}, the probability density function of W is given by k
k−1 n−1
m −1 k−1 k −j−1
j
P(Wk = n) = ( ) ∑(−1 ) ( )( ) , n ∈ {k, k + 1, …} (12.7.2)
k−1 j m
j=0
Proof
In the coupon collector experiment, vary the parameters and note the shape of and position of the probability density function.
For selected values of the parameters, run the experiment 1000 times and compare the relative frequency function to the
An alternate approach to the probability density function of W is via a recursion formula.

k
For fixed m, let g denote the probability density function of W . Then

k k
1. g k (n + 1) =
k−1
m
gk (n) +
m−k+1
m
gk−1 (n)
2. g 1 (1) =1
Decomposition as a Sum
We will now show that W can be decomposed as a sum of k independent, geometrically distributed random variables. This will
k
provide some additional insight into the nature of the distribution and will make the computation of the mean and variance easy.
For i ∈ {1, 2, … , m}, let Z denote the number of additional samples needed to go from i − 1 distinct values to i distinct
i
values. Then Z = (Z , Z , … , Z ) is a sequence of independent random variables, and Z has the geometric distribution on
1 2 m i
m−i+1
N + with parameter p = . Moreover,
i m
k
Wk = ∑ Zi , k ∈ {1, 2, … , m} (12.7.4)
i=1
This result shows clearly that each time a new coupon is obtained, it becomes harder to get the next new coupon.
In the coupon collector experiment, run the experiment in single-step mode a few times for selected values of the parameters.
In particular, try this with m large and k near m.
Moments
The decomposition as a sum of independent variables provides an easy way to compute the mean and other moments of W . k
The mean and variance of the sample size needed to get k distinct values are
k
1. E(W k) =∑
i=1
m
m−i+1
(i−1)m
2. var(W k) =∑
k
i=1 2
(m−i+1)
Proof
In the coupon collector experiment, vary the parameters and note the shape and location of the mean ± standard deviation bar.
For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to
the distribution mean and standard deviation.
The probability generating function of W is given by

k
k
Wk
m −i +1 m
E (t ) =∏ , |t| < (12.7.5)
m − (i − 1)t k−1
i=1
Proof

Suppose that people are sampled at random until 40 distinct birthdays are obtained. Find each of the following:
1. The probability density function of the sample size.
2. The mean of the sample size.
3. The variance of the sample size.
4. The probability generating function of the sample size.
Answer
Suppose that a standard, fair die is thrown until all 6 scores have occurred. Find each of the following:
1. The probability density function of the number of throws.
2. The mean of the number of throws.
3. The variance of the number of throws.
4. The probability that at least 10 throws are required.
Answer
A box of a certain brand of cereal comes with a special toy. There are 10 different toys in all. A collector buys boxes of cereal
until she has all 10 toys. Find each of the following:
1. The probability density function of the number boxes purchased.
2. The mean of the number of boxes purchased.
3. The variance of the number of boxes purchased.
4. The probability that no more than 15 boxes were purchased.
Answer
This page titled 12.7: The Coupon Collector Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
The Model
Pólya's urn scheme is a dichotomous sampling model that generalizes the hypergeometric model (sampling without replacement)
and the Bernoulli model (sampling with replacement). Pólya's urn proccess leads to a famous example of a sequence of random
variables that is exchangeable, but not independent, and has deep conections with the beta-Bernoulli process.
Suppose that we have an urn (what else!) that initially contains a red and b green balls, where a and b are positive integers. At each
discrete time (trial), we select a ball from the urn and then return the ball to the urn along with c new balls of the same color.
Ordinarily, the parameter c is a nonnegative integer. However, the model actually makes sense if c is a negative integer, if we
interpret this to mean that we remove the balls rather than add them, and assuming that there are enough balls of the proper color in
the urn to perform this action. In any case, the random process is known as Pólya's urn process, named for George Pólya.
In terms of the colors of the selected balls, Pólya's urn scheme generalizes the standard models of sampling with and without
replacement.
1. c = 0 corresponds to sampling with replacement.
2. c = −1 corresponds to sampling without replacement.
For the most part, we will assume that c is nonnegative so that the process can be continued indefinitely. Occasionally we consider
the case c = −1 so that we can interpret the results in terms of sampling without replacement.
The Outcome Variables

Let X denote the color of the ball selected at time i, where 0 denotes green and 1 denotes red. Mathematically, our basic random
i
process is the sequence of indicator variables X = (X , X , …), known as the Pólya process. As with any random process, our
1 2
first goal is to compute the finite dimensional distributions of X. That is, we want to compute the joint distribution of
(X , X , … , X ) for each n ∈ N
1 2 n . Some additional notation will really help. Recall the generalized permutation formula in our
+
study of combinatorial structures: for r, s ∈ R and j ∈ N , we defined

(s,j)
r = r(r + s)(r + 2s) ⋯ [r + (j − 1)s] (12.8.1)
Note that the expression has j factors, starting with r, and with each factor obtained by adding s to the previous factor. As usual,
we adopt the convention that a product over an empty index set is 1. Hence r = 1 for every r and s .
(s,0)
Recall that
1. r (0,j)
=r
j
, an ordinary power
2. r (−1,j)
=r
(j)
= r(r − 1) ⋯ (r − j + 1) , a descending power
3. r (1,j)
=r
[j]
= r(r + 1) ⋯ (r + j − 1) , an ascending power
4. r (r,j)
= j! r
j
5. 1 (1,j)
= j!
The following simple result will turn out to be quite useful.
Suppose that r, s ∈ (0, ∞) and j ∈ N . Then

(s,j) [j]
r r
=( ) (12.8.2)
j
s s
Proof
The finite dimensional distributions are easy to compute using the multiplication rule of conditional probability. If we know the
contents of the urn at any given time, then the probability of an outcome at the next time is all but trivial.
Let n ∈ N , (x + 1, x2 , … , xn ) ∈ {0, 1 }
n
and let k = x 1 + x2 + ⋯ + xn . Then
(c,k) (c,n−k)
a b
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = (12.8.3)
(c,n)
(a + b)
Proof
The joint probability in the previous exercise depends on (x , x , … , x ) only through the number of red balls k = ∑ x in
1 2 n
n
i=1 i
the sample. Thus, the joint distribution is invariant under a permutation of (x , x , … , x ), and hence X is an exchangeable
1 2 n
sequence of random variables. This means that for each n , all permutations of (X , X , … , X ) have the same distribution. Of 1 2 n
course the joint distribution reduces to the formulas we have obtained earlier in the special cases of sampling with replacement (
c = 0 ) or sampling without replacement (c = −1 ), although in the latter case we must have n ≤ a + b . When c > 0 , the Pólya
process is a special case of the beta-Bernoulli process, studied in the chapter on Bernoulli trials.
The Pólya process X = (X 1, X2 , …) with parameters a, b, c ∈ N is the beta-Bernoulli process with parameters
+ a/c and
b/c. That is, for n ∈ N , (x , … , x ) ∈ {0, 1 } , and with k = x + x + ⋯ + x ,
n
+ 1, x2 n 1 2 n
[k] [n−k]
(a/c ) (b/c )
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = (12.8.5)
[n]
(a/c + b/c)
Proof
Recall that the beta-Bernoulli process is obtained, in the usual formulation, by randomizing the success parameter in a Bernoulli
trials sequence, giving the success parameter a beta distribution. So specifically, suppose a, b, c ∈ N and that random variable P +
has the beta distribution with parameters a/c and b/c. Suppose also that given P = p ∈ (0, 1) , the random process
X = (X , X , …) is a sequence of Bernoulli trials with success parameter p . Then X is the Pólya process with parameters
1 2
a, b, c. This is a fascinating connection between two processes that at first, seem to have little in common. In fact however, every
exchangeable sequence of indicator random variables can be obtained by randomizing the success parameter in a sequence of
Bernoulli trials. This is de Finetti's theorem, named for Bruno de Finetti, which is studied in the section on backwards martingales.
When c ∈ N , all of the results in this section are special cases of the corresponding results for the beta-Bernoulli process, but it's
+
still interesting to interpret the results in terms of the urn model.
For each i ∈ N +
1. E(X i) =
a+b
a
2. var(X i) =
a+b
a b
a+b
Proof
Thus X is a sequence of identically distributed variables, quite surprising at first but of course inevitable for any exchangeable
sequence. Compare the joint and marginal distributions. Note that X is an independent sequence if and only if c = 0 , when we
have simple sampling with replacement. Pólya's urn is one of the most famous examples of a random process in which the outcome
variables are exchangeable, but dependent (in general).
Next, let's compute the covariance and correlation of a pair of outcome variables.
Suppose that i, j ∈ N+ are distinct. Then

1. cov (X i, Xj ) =
2
abc
(a+b) (a+b+c)
2. cor (X i, Xj ) =
c
a+b+c
Proof
Thus, the variables are positively correlated if c > 0 , negatively correlated if c < 0 , and uncorrelated (in fact, independent), if
c = 0 . These results certainly make sense when we recall the dynamics of Pólya's urn. It turns out that in any infinite sequence of
exchangeable variables, the variables must be nonnegatively correlated. Here is another result that explores how the variables are
related.
n
Suppose that n ∈ N + and (x 1, x2 , … , xn ) ∈ {0, 1 }
n
. Let k = ∑ i=1
xi . Then
a + kc
P(Xn+1 = 1 ∣ X1 = x1 , X2 = x2 , … Xn = xn ) = (12.8.7)
a + b + nc
Proof
Pólya's urn is described by a sequence of indicator variables. We can study the same derived random processes that we studied with
Bernoulli trials: the number of red balls in the first n trials, the trial number of the k th red ball, and so forth.
The Number of Red Balls

For n ∈ N , the number of red balls selected in the first n trials is
n
Yn = ∑ Xi (12.8.8)
i=1
so that Y = (Y0 , Y1 , …) is the partial sum process associated with X = (X 1, X2 , …) .
Note that
1. The number of green balls selected in the first n trials is n − Y . n
2. The number of red balls in the urn after the first n trials is a + c Y . n
3. The number of green balls in the urn after the first n trials is b + c(n − Y n) .
4. The number of balls in the urn after the first n trials is a + b + c n .
The basic analysis of Y follows easily from our work with X.

n
(c,k) (c,n−k)
n a b
P(Yn = k) = ( ) , k ∈ {0, 1, … , n} (12.8.9)
(c,n)
k (a + b)
Proof
The distribution defined by this probability density function is known, appropriately enough, as the Pólya distribution with
parameters n , a , b , and c . Of course, the distribution reduces to the binomial distribution with parameters n and a/(a + b) in the
case of sampling with replacement (c = 0 ) and to the hypergeometric distribution with parameters n , a , and b in the case of
sampling without replacement (c = −1 ), although again in this case we need n ≤ a + b . When c > 0 , the Póyla distribution is a
special case of the beta-binomial distribution.
If a, b, c ∈ N + then the Pólya distribution with parameters a, b, c is the beta-binomial distribution with parameters a/c and
b/c. That is,
[k] [n−k]
n (a/c ) (b/c )
P (Yn = k) = ( ) , k ∈ {0, 1, … , n} (12.8.10)
[n]
k (a/c + b/c)
Proof
The case where all three parameters are equal is particularly interesting.
If a = b = c then Y is uniformly distributed on {0, 1, … , n}.

n
Proof
In general, the Pólya family of distributions has a diverse collection of shapes.
Start the simulation of the Pólya Urn Experiment. Vary the parameters and note the shape of the probability density function. In
particular, note when the function is skewed, when the function is symmetric, when the function is unimodal, when the
function is monotone, and when the function is U-shaped. For various values of the parameters, run the simulation 1000 times
and compare the empirical density function to the probability density function.
The Pólya probability density function is
1. unimodal if a > b > c and n > a−c
b−c
2. unimodal if b > a > c and n > b−c
a−c
3. U-shaped if c > a > b and n > c−b
c−a
4. U-shaped if c > b > a and n > c−a
c−b
5. increasing if b < c < a

6. decreasing if a < c < b
Proof
Next, let's find the mean and variance. Curiously, the mean does not depend on the parameter c .
The mean and variance of the number of red balls selected are
1. E(Yn) =n
a
a+b
2. var(Y n) =n
ab
2
[1 + (n − 1)
c
a+b+c
]
(a+b)
Proof
Start the simulation of the Pólya Urn Experiment. Vary the parameters and note the shape and location of the mean ± standard
deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical mean and
Explicitly compute the probability density function, mean, and variance of Y5 when a =6 , b =4 , and for the values of
. Sketch the graph of the density function in each case.
c ∈ {−1, 0, 1, 2, 3, 10}
Fix a , b , and n , and let c → ∞ . Then

1. P(Yn = 0) →
b
a+b
2. P(Yn = n) →
a
a+b
3. P (Y n ∈ {1, 2, … , n − 1}) → 0
Proof
Thus, the limiting distribution of Y as c → ∞ is concentrated on 0 and n . The limiting probabilities are just the initial proportion
n
of green and red balls, respectively. Interpret this result in terms of the dynamics of Pólya's urn scheme.
Our next result gives the conditional distribution of X n+1 given Y .
n
Suppose that n ∈ N and k ∈ {0, 1, … , n}. Then

a + ck
P(Xn+1 = 1 ∣ Yn = k) = (12.8.13)
a + b + cn
Proof
In particular, if a = b = c then P(X = 1 ∣ Y = n) = n+1 . This is Laplace's rule of succession, another interesting
n
n+1
n+2
connection. The rule is named for Pierre Simon Laplace, and is studied from a different point of view in the section on
Independence.
The Proportion of Red Balls

Suppose that c ∈ N , so that the process continues indefinitely. For n ∈ N , the proportion of red balls selected in the first n trials
+
is
n
Yn 1
Mn = = ∑ Xi (12.8.15)
n n
i=1
This is an interesting variable, since a little reflection suggests that it may have a limit as n → ∞ . Indeed, if c = 0 , then M is just n
the sample mean corresponding to n Bernoulli trials. Thus, by the law of large numbers, M converges to the success parameter n
a
a+b
as n → ∞ with probability 1. On the other hand, the proportion of red balls in the urn after n trials is
a + cYn
Zn = (12.8.16)
a + b + cn
When c = 0 , of course, Z n =
a
a+b
so that in this case, Z and M have the same limiting behavior. Note that
n n
a cn
Zn = + Mn (12.8.17)
a + b + cn a + b + cn
Since the constant term converges to 0 as n → ∞ and the coefficient of M converges to 1 as n → ∞ , it follows that the limits of
n
M n and Z as n → ∞ will be the same, if the limit exists, for any mode of convergence: with probability 1, in mean, or in
n
distribution. Here is the general result when c > 0 .
Suppose that a, b, c ∈ N . There exists a random variable P having the beta distribution with parameters a/c and
+ b/c such
that M → P and Z → P as n → ∞ with probability 1 and in mean square, and hence also in distribution.
n n
Proof
In turns out that the random process Z = {Z = (a + cY )/(a + b + cn) : n ∈ N} is a martingale. The theory of martingales
n n
provides powerful tools for studying convergence in Pólya's urn process. As an interesting special case, note that if a = b = c then
the limiting distribution is the uniform distribution on (0, 1).
The Trial Number of the kth Red Ball

Suppose again that c ∈ N , so that the process continues indefinitely. For k ∈ N + let V denote the trial number of the k th red ball
k
selected. Thus
Vk = min{n ∈ N+ : Yn = k} (12.8.18)
Note that V takes values in

k {k, k + 1, …} . The random processes V = (V1 , V2 , …) and Y = (Y1 , Y2 , …) are inverses of each
other in a sense.
For k, n ∈ N+ with k ≤ n ,
1. V k ≤n if and only if Y n ≥k
2. V k =n if and only if Y n−1 = k−1 and X n =1
The probability denisty function of V is given by k
(c,k) (c,n−k)
n−1 a b
P(Vk = n) = ( ) , n ∈ {k, k + 1, …} (12.8.19)
k−1 (c,n)
(a + b)
Proof
Of course this probability density function reduces to the negative binomial density function with trial parameter k and success
parameter p = when c = 0 (sampling with replacement). When c > 0 , the distribution is a special case of the beta-negative
a
a+b
binomial distribution.
If a, b, c ∈ N+ then V has the beta-negative binomial distribution with parameters k , a/c, and b/c. That is,
k
[k] [n−k]
n−1 (a/c ) (b/c )
P(Vk = n) = ( ) , n ∈ {k, k + 1, …} (12.8.20)
k−1 [n]
(a/c + b/c)
Proof
If a = b = c then
k
P(Vk = n) = , n ∈ {k, k + 1, k + 2, …} (12.8.21)
n(n + 1)
Proof
Fix a , b , and k , and let c → ∞ . Then

1. P(V k = k) →
a
a+b
2. P(V k ∈ {k + 1, k + 2, …}) → 0
Thus, the limiting distribution of V is concentrated on k and ∞. The limiting probabilities at these two points are just the initial
k
proportion of red and green balls, respectively. Interpret this result in terms of the dynamics of Pólya's urn scheme.
This page titled 12.8: Pólya's Urn Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In this section we will study a nice problem known variously as the secretary problem or the marriage problem. It is simple to state
and not difficult to solve, but the solution is interesting and a bit surprising. Also, the problem serves as a nice introduction to the
general area of statistical decision making.

As always, we must start with a clear statement of the problem.
We have n candidates (perhaps applicants for a job or possible marriage partners). The assumptions are
1. The candidates are totally ordered from best to worst with no ties.
2. The candidates arrive sequentially in random order.
3. We can only determine the relative ranks of the candidates as they arrive. We cannot observe the absolute ranks.
4. Our goal is choose the very best candidate; no one less will do.
5. Once a candidate is rejected, she is gone forever and cannot be recalled.
6. The number of candidates n is known.
The assumptions, of course, are not entirely reasonable in real applications. The last assumption, for example, that n is known, is
more appropriate for the secretary interpretation than for the marriage interpretation.
What is an optimal strategy? What is the probability of success with this strategy? What happens to the strategy and the probability
of success as n increases? In particular, when n is large, is there any reasonable hope of finding the best candidate?
Strategies
Play the secretary game several times with n = 10 candidates. See if you can find a good strategy just by trial and error.
After playing the secretary game a few times, it should be clear that the only reasonable type of strategy is to let a certain number
k − 1 of the candidates go by, and then select the first candidate we see who is better than all of the previous candidates (if she
exists). If she does not exist (that is, if no candidate better than all previous candidates appears), we will agree to accept the last
candidate, even though this means failure. The parameter k must be between 1 and n ; if k = 1 , we select the first candidate; if
k = n , we select the last candidate; for any other value of k , the selected candidate is random, distributed on {k, k + 1, … , n} .
We will refer to this “let k − 1 go by” strategy as strategy k .

Thus, we need to compute the probability of success p (k) using strategy k with n candidates. Then we can maximize the
n
probability over k to find the optimal strategy, and then take the limit over n to study the asymptotic behavior.
Analysis
First, let's do some basic computations.
For the case n =3 , list the 6 permutations of {1, 2, 3} and verify the probabilities in the table below. Note that k =2 is
optimal.
k 1 2 3
2 3 2
p3 (k)
6 6 6
Answer
In the secretary experiment, set the number of candidates to n =3 . Run the experiment 1000 times with each strategy
k ∈ {1, 2, 3}
For the case n = 4 , list the 24 permutations of {1, 2, 3, 4} and verify the probabilities in the table below. Note that k =2 is
optimal. The last row gives the total number of successes for each strategy.
k 1 2 3 4
6 11 10 6
p4 (k)
24 24 24 24
Answer
k ∈ {1, 2, 3, 4}
For the case n = 5 , list the 120 permutations of {1, 2, 3, 4, 5}and verify the probabilities in the table below. Note that k = 3 is
optimal.
k 1 2 3 4 5
24 50 52 42 24
p5 (k)
120 120 120 120 120
k ∈ {1, 2, 3, 4, 5}
Well, clearly we don't want to keep doing this. Let's see if we can find a general analysis. With n candidates, let X denote the n
number (arrival order) of the best candidate, and let S denote the event of success for strategy k (we select the best candidate).
n,k
Xn is uniformly distributed on {1, 2, … , n}.

Proof
Next we will compute the conditional probability of success given the arrival order of the best candidate.
For n ∈ N + and k ∈ {2, 3, … , n},
0, j ∈ {1, 2, … , k − 1}
P(Sn,k ∣ Xn = j) = { k−1 (12.9.1)
, j ∈ {k, k + 1, … , n}
j−1
Proof
The two cases are illustrated below. The large dot indicates the best candidate. Red dots indicate candidates that are rejected out of
hand, while blue dots indicate candidates that are considered.
Figure 12.9.1: The case when X n =j<k
Figure 12.9.2: The case when X n =j≥k
Now we can compute the probability of success with strategy k .
For n ∈ N +
1
, k =1
n
pn (k) = P(Sn,k ) = { k−1 n
(12.9.2)
1
∑ , k ∈ {2, 3, … , n}
n j=k j−1
Proof
Values of the function p can be computed by hand for small n and by a computer algebra system for moderate n . The graph of
n
p100 is shown below. Note the concave downward shape of the graph and the optimal value of k , which turns out to be 38. The
optimal probability is about 0.37104.
Figure 12.9.3 : The graph of p 100
The optimal strategy k that maximizes k ↦ p (k) , the ratio k /n, and the optimal probability
n n n pn (kn ) of finding the best
candidate, as functions of n ∈ {3, 4, … , 20} are given in the following table:
Candidates n Optimal strategy k n Ratio k n /n Optimal probability pn (kn )
3 2 0.6667 0.5000
4 2 0.5000 0.4583
5 3 0.6000 0.4333
6 3 0.5000 0.4278
7 3 0.4286 0.4143
8 4 0.5000 0.4098
9 4 0.4444 0.4060
10 4 0.4000 0.3987
11 5 0.4545 0.3984
12 5 0.4167 0.3955
13 6 0.4615 0.3923
14 6 0.4286 0.3917
15 6 0.4000 0.3894
16 7 0.4375 0.3881
17 7 0.4118 0.3873
18 7 0.3889 0.3854
19 8 0.4211 0.3850
20 8 0.4000 0.3842
Apparently, as we might expect, the optimal strategy k increases and the optimal probability p (k ) decreases as n → ∞ . On the
n n n
other hand, it's encouraging, and a bit surprising, that the optimal probability does not appear to be decreasing to 0. It's perhaps
least clear what's going on with the ratio. Graphical displays of some of the information in the table may help:
Figure 12.9.4: The optimal probability p n (kn )
Figure 12.9.5: The optimal ratio k n /n
Could it be that the ratio k /n and the probability p (k ) are both converging, and moreover, are converging to the same number?
n n n
First let's try to establish rigorously some of the trends observed in the table.
The success probability p satisfies n
n
1
pn (k − 1) < pn (k) if and only if ∑ >1 (12.9.4)
j−1
j=k
It follows that for each n ∈ N+ , the function p at first increases and then decreases. The maximum value of
n pn occurs at the
n
largest k with ∑ j=k
1
j−1
> 1 . This is the optimal strategy with n candidates, which we have denoted by k . n
As n increases, k increases and the optimal probability p

n n (kn ) decreases.
Asymptotic Analysis
We are naturally interested in the asymptotic behavior of the function p , and the optimal strategy as n → ∞ . The key is
n
recognizing p as a Riemann sum for a simple integral. (Riemann sums, of course, are named for Georg Riemann.)
n
If k(n) depends on n and k(n)/n → x ∈ (0, 1) as n → ∞ then p n [k(n)] → −x ln x as n → ∞ .

Proof
The optimal strategy k that maximizes k ↦ p (k) , the ratio k /n, and the optimal probability
n n n pn (kn ) of finding the best
candidate, as functions of n ∈ {10, 20, … , 100}are given in the following table:
10 4 0.4000 0.3987
20 8 0.4000 0.3842
30 12 0.4000 0.3786
40 16 0.4000 0.3757
50 19 0.3800 0.3743
60 23 0.3833 0.3732
70 27 0.3857 0.3724
80 30 0.3750 0.3719
90 34 0.3778 0.3714
100 38 0.3800 0.3710
The graph below shows the true probabilities p n (k) and the limiting values − k
n
ln(
k
n
) as a function of k with n = 100 .
Figure 12.9.6 : True and approximate probabilities of success as a function of k with n = 100
For the optimal strategy k , there exists x ∈ (0, 1) such that k /n → x as n → ∞ . Thus, x
n 0 n 0 0 ∈ (0, 1) is the limiting proportion
of the candidates that we reject out of hand. Moreover, x maximizes x ↦ −x ln x on (0, 1).
0
The maximum value of −x ln x occurs at x 0 = 1/e and the maximum value is also 1/e.
Proof
Figure 12.9.7 : The graph of x ln x on the interval (0, 1)

Thus, the magic number 1/e ≈ 0.3679 occurs twice in the problem. For large n :
Our approximate optimal strategy is to reject out of hand the first 37% of the candidates and then select the first candidate (if
she appears) that is better than all of the previous candidates.
Our probability of finding the best candidate is about 0.37.
The article “Who Solved the Secretary Problem?” by Tom Ferguson (1989) has an interesting historical discussion of the problem,
including speculation that Johannes Kepler may have used the optimal strategy to choose his second wife. The article also discusses
many interesting generalizations of the problem. A different version of the secretary problem, in which the candidates are assigned
a score in [0, 1], rather than a relative rank, is discussed in the section on Stopping Times in the chapter on Martingales
This page titled 12.9: The Secretary Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
CHAPTER OVERVIEW
13: Games of Chance
Games of chance hold an honored place in probability theory, because of their conceptual clarity and because of their fundamental
influence on the early development of the subject. In this chapter, we explore some of the most common and basic games of
chance. Roulette, craps, and Keno are casino games. The Monty Hall problem is based on a TV game show, and has become
famous because of the controversy that it generated. Lotteries are now basic ways that governments and other institutions raise
money. In the last four sections on the game of red and black, we study various types of gambling strategies, a study which leads to
some deep and fascinating mathematics.
13.2: Poker
13.4: Craps
13.5: Roulette
13.7: Lotteries
13.9: Timid Play
13.10: Bold Play
This page titled 13: Games of Chance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
Gambling and Probability

Games of chance are among the oldest of human inventions. The use of a certain type of animal heel bone (called the astragalus or
colloquially the knucklebone) as a crude die dates to about 3600 BCE. The modern six-sided die dates to 2000 BCE, and the term
bones is used as a slang expression for dice to this day (as in roll the bones). It is because of these ancient origins, by the way, that
we use the die as the fundamental symbol in this project.
Figure 13.1.1 : An artificial knucklebone made of steatite, from the Arjan Verweij Dice Website
Gambling is intimately interwoven with the development of probability as a mathematical theory. Most of the early development of
probability, in particular, was stimulated by special gambling problems, such as
DeMere's problem
Pepy's problem
the problem of points
the Petersburg problem
Some of the very first books on probability theory were written to analyze games of chance, for example Liber de Ludo Aleae (The
Book on Games of Chance), by Girolamo Cardano, and Essay d' Analyse sur les Jeux de Hazard (Analytical Essay on Games of
Chance), by Pierre-Remond Montmort. Gambling problems continue to be a source of interesting and deep problems in probability
to this day (see the discussion of Red and Black for an example).
Figure 13.1.2 : Allegory of Fortune by Dosso Dossi (c. 1591), Getty Museum. For more depictions of gambling in paintings, see the
ancillary material on art.
Of course, it is important to keep in mind that breakthroughs in probability, even when they are originally motivated by gambling
problems, are often profoundly important in the natural sciences, the social sciences, law, and medicine. Also, games of chance
provide some of the conceptually clearest and cleanest examples of random experiments, and thus their analysis can be very helpful
to students of probability.
However, nothing in this chapter should be construed as encouraging you, gentle reader, to gamble. On the contrary, our analysis
will show that, in the long run, only the gambling houses prosper. The gambler, inevitably, is a sad victim of the law of large
numbers.
In this chapter we will study some interesting games of chance. Poker, poker dice, craps, and roulette are popular parlor and casino
games. The Monty Hall problem, on the other hand, is interesting because of the controversy that it generated. The lottery is a basic
way that many states and nations use to raise money (a voluntary tax, of sorts).
Terminology
Let us discuss some of the basic terminology that will be used in several sections of this chapter. Suppose that A is an event in a
random experiment. The mathematical odds concerning A refer to the probability of A .
If a and b are positive numbers, then by definition, the following are equivalent:
1. the odds in favor of A are a : b .
2. P(A) = a
a+b
.
3. the odds against A are b : a .
4. P(A ) =
c
a+b
b
.
In many cases, a and b can be given as positive integers with no common factors.
Similarly, suppose that p ∈ [0, 1]. The following are equivalent:

1. P(A) = p .
2. The odds in favor of A are p : 1 − p .
3. P(A ) = 1 − p .
c
4. The odds against A are 1 − p : p .
On the other hand, the house odds of an event refer to the payout when a bet is made on the event.
A bet on event A pays n : m means that if a gambler bets m units on A then

1. If A occurs, the gambler receives the m units back and an additional n units (for a net profit of n )
2. If A does not occur, the gambler loses the bet of m units (for a net profit of −m).
Equivalently, the gambler puts up m units (betting on A ), the house puts up n units, (betting on A ) and the winner takes the pot.
c
Of course, it is usually not necessary for the gambler to bet exactly m; a smaller or larger is bet is scaled appropriately. Thus, if the
gambler bets k units and wins, his payout is k . n
Naturally, our main interest is in the net winnings if we make a bet on an event. The following result gives the probability density
function, mean, and variance for a unit bet. The expected value is particularly interesting, because by the law of large numbers, it
gives the long term gain or loss, per unit bet.
Suppose that the odds in favor of event A are a : b and that a bet on event A pays n : m . Let W denote the winnings from a
unit bet on A . Then
1. P(W = −1) =
b
a+b
, P (W =
n
m
) =
a
a+b
2. E(W ) = a n−bm
m(a+b)
2
ab(n+m)
3. var(W ) = 2 2
m (a+b)
In particular, the expected value of the bet is zero if and only if an = bm , positive if and only if an > bm , and negative if and
only if an < bm . The first case means that the bet is fair, and occurs when the payoff is the same as the odds against the event.
The second means that the bet is favorable to the gambler, and occurs when the payoff is greater that the odds against the event.
The third case means that the bet is unfair to the gambler, and occurs when the payoff is less than the odds against the event.
Unfortunately, all casino games fall into the third category.
More About Dice
Shapes of Dice
The standard die, of course, is a cube with six sides. A bit more generally, most real dice are in the shape of Platonic solids, named
for Plato naturally. The faces of a Platonic solid are congruent regular polygons. Moreover, the same number of faces meet at each
vertex so all of the edges and angles are congruent as well.
The five Platonic solids are

1. The tetrahedron, with 4 sides.
2. The hexahedron (cube), with 6 sides
3. The octahedron, with 8 sides
4. The dodecahedron, with 12 sides
5. The icosahedron, with 20 sides
Figure 13.1.3 : Blue Platonic Dice from Wikipedia

Note that the 4-sided die is the only Platonic die in which the outcome is the face that is down rather than up (or perhaps it's better
to think of the vertex that is up as the outcome).
Fair and Crooked Dice

Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice. For
the standard six-sided die, there are three crooked types that we use frequently in this project. To understand the geometry, recall
that with the standard six-sided die, opposite faces sum to 7.
Flat Dice
1. An ace-six flat die is a six-sided die in which faces 1 and 6 have probability each while faces 2, 3, 4, and 5 have
1
probability each.
1
2. A two-five flat die is a six-sided die in which faces 2 and 5 have probability each while faces 1, 3, 4, and 6 have
1
probability each.
1
3. A three-four flat die is a six-sided die in which faces 3 and 4 have probability each while faces 1, 2, 5, and 6 have
1
probability each.
1
probabilities that we use (1/4 and 1/8) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter
axis have slightly larger probabilities (because they have slightly larger areas) than the other four faces. Flat dice are sometimes
used by gamblers to cheat.
In the Dice Experiment, select one die. Run the experiment 1000 times in each of the following cases and observe the
outcomes.
1. fair die
2. ace-six flat die
3. two-five flat die
4. three-four flat die
Simulation
It's very easy to simulate a fair die with a random number. Recall that the ceiling function ⌈x⌉ gives the smallest integer that is at
least as large as x.
Suppose that U is uniformly distributed on the interval (0, 1], so that U has the standard uniform distribution (a random
number). Then X = ⌈6 U ⌉ is uniformly distributed on the set {1, 2, 3, 4, 5, 6} and so simulates a fair six-sided die. More
generally, X = ⌈n U ⌉ is uniformly distributed on {1, 2, … , n} and so simlates a fair n -sided die.
We can also use a real fair die to simulate other types of fair dice. Recall that if X is uniformly distributed on {1, 2, … , n} and
k ∈ {1, 2, … , n − 1} , then the conditional distribution of X given that X ∈ {1, 2, … , k} is uniformly distributed on
{1, 2, … , k}. Thus, suppose that we have a real, fair, n -sided die. If we ignore outcomes greater than k then we simulate a fair k -
sided die. For example, suppose that we have a carefully constructed icosahedron that is a fair 20-sided die. We can simulate a fair
13-sided die by simply rolling the die and stopping as soon as we have a score between 1 and 13.
To see how to simulate a card hand, see the Introduction to Finite Sampling Models. A general method of simulating random
variables is based on the quantile function.
This page titled 13.1: Introduction to Games of Chance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
13.2: Poker
Basic Theory
The Poker Hand
A deck of cards naturally has the structure of a product set and thus can be modeled mathematically by
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (13.2.1)
where the first coordinate represents the denomination or kind (ace, two through 10, jack, queen, king) and where the second
coordinate represents the suit (clubs, diamond, hearts, spades). Sometimes we represent a card as a string rather than an ordered
pair (for example q ♡).
There are many different poker games, but we will be interested in standard draw poker, which consists of dealing 5 cards at
random from the deck D. The order of the cards does not matter in draw poker, so we will record the outcome of our random
experiment as the random set (hand) X = {X , X , X , X , X } where X = (Y , Z ) ∈ D for each i and X ≠ X for i ≠ j .
1 2 3 4 5 i i i i j
Thus, the sample space consists of all possible poker hands:

S = {{ x1 , x2 , x3 , x4 , x5 } : xi ∈ D for each i and xi ≠ xj for all i ≠ j} (13.2.2)
Our basic modeling assumption (and the meaning of the term at random) is that all poker hands are equally likely. Thus, the
random variable X is uniformly distributed over the set of possible poker hands S .
#(A)
P(X ∈ A) = (13.2.3)
#(S)
In statistical terms, a poker hand is a random sample of size 5 drawn without replacement and without regard to order from the
population D. For more on this topic, see the chapter on Finite Sampling Models.
The Value of the Hand

There are nine different types of poker hands in terms of value. We will use the numbers 0 to 8 to denote the value of the hand,
where 0 is the type of least value (actually no value) and 8 the type of most value.
The hand value V of a poker hand is a random variable taking values 0 through 8, and is defined as follows:
0. No Value. The hand is of none of the other types.
1. One Pair. The hand has 2 cards of one kind, and one card each of three other kinds.
2. Two Pair. The hand has 2 cards of one kind, 2 cards of another kind, and one card of a third kind.
3. Three of a Kind. The hand has 3 cards of one kind and one card of each of two other kinds.
4. Straight. The kinds of cards in the hand form a consecutive sequence but the cards are not all in the same suit. An ace can
be considered the smallest denomination or the largest denomination.
5. Flush. The cards are all in the same suit, but the kinds of the cards do not form a consecutive sequence.
6. Full House. The hand has 3 cards of one kind and 2 cards of another kind.
7. Four of a Kind. The hand has 4 cards of one kind, and 1 card of another kind.
8. Straight Flush. The cards are all in the same suit and the kinds form a consecutive sequence.
Run the poker experiment 10 times in single-step mode. For each outcome, note that the value of the random variable
corresponds to the type of hand, as given above.
For some comic relief before we get to the analysis, look at two of the paintings of Dogs Playing Poker by CM Coolidge.
1. His Station and Four Aces
2. Waterloo
Computing the probability density function of V is a good exercise in combinatorial probability. In the following exercises, we
need the two fundamental rules of combinatorics to count the number of poker hands of a given type: the multiplication rule and
the addition rule. We also need some basic combinatorial structures, particularly combinations.
The number of different poker hands is

52
#(S) = ( ) = 2 598 960 (13.2.4)
5
P(V = 1) = 1 098 240/2 598 960 ≈ 0.422569 .

Proof
P(V = 2) = 123 552/2 598 960 ≈ 0.047539 .

Proof
P(V = 3) = 54 912/2 598 860 ≈ 0.021129 .

Proof
P(V = 8) = 40/2 598 960 ≈ 0.000015 .

Proof
P(V = 4) = 10 200/2 598 960 ≈ 0.003925 .

Proof
P(V = 5) = 5108/2 598 960 ≈ 0.001965 .

Proof
P(V = 6) = 3744/2 598 960 ≈ 0.001441 .

Proof
P(V = 7) = 624/2 598 960 ≈ 0.000240 .

Proof
P(V = 0) = 1 302 540/2 598 960 ≈ 0.501177 .

Proof
Note that the probability density function of V is decreasing; the more valuable the type of hand, the less likely the type of hand is
to occur. Note also that no value and one pair account for more than 92% of all poker hands.
In the poker experiment, note the shape of the density graph. Note that some of the probabilities are so small that they are
essentially invisible in the graph. Now run the poker hand 1000 times and compare the relative frequency function to the
density function.
In the poker experiment, set the stop criterion to the value of V given below. Note the number of poker hands required.
1. V =3
2. V =4
3. V =5
4. V =6
5. V =7
6. V =8
Find the probability of getting a hand that is three of a kind or better.
Answer
In the movie The Parent Trap (1998), both twins get straight flushes on the same poker deal. Find the probability of this event.
Answer
Classify V in terms of level of measurement: nominal, ordinal, interval, or ratio. Is the expected value of V meaningful?
Answer
A hand with a pair of aces and a pair of eights (and a fifth card of a different type) is called a dead man's hand. The name is in
honor of Wild Bill Hickok, who held such a hand at the time of his murder in 1876. Find the probability of getting a dead man's
hand.
Answer
Drawing Cards
In draw poker, each player is dealt a poker hand and there is an initial round of betting. Typically, each player then gets to discard
up to 3 cards and is dealt that number of cards from the remaining deck. This leads to myriad problems in conditional probability,
as partial information becomes available. A complete analysis is far beyond the scope of this section, but we will consider a comple
of simple examples.
Suppose that Fred's hand is {4 ♡, 5 ♡, 7 ♠, q ♣, 1 ♢}. Fred discards the q ♣ and 1 ♢ and draws two new cards, hoping to
complete the straight. Note that Fred must get a 6 and either a 3 or an 8. Since he is missing a middle denomination (6), Fred is
drawing to an inside straight. Find the probability that Fred is successful.
Answer
Suppose that Wilma's hand is {4 ♡, 5 ♡, 6 ♠, q ♣, 1 ♢}. Wilma discards q ♣ and 1 ♢ and draws two new cards, hoping to
complete the straight. Note that Wilma must get a 2 and a 3, or a 7 and an 8, or a 3 and a 7. Find the probability that Wilma is
successful. Clearly, Wilma has a better chance than Fred.
Answer
This page titled 13.2: Poker is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services)
In this section, we will analyze several simple games played with dice—poker dice, chuck-a-luck, and high-low. The casino game
craps is more complicated and is studied in the next section.
Figure 13.3.1 : The Dice Players by Georges de La Tour (c. 1651). For more on the influence of probability in painting, see the
ancillary material on art.
Poker Dice
Definition
The game of poker dice is a bit like standard poker, but played with dice instead of cards. In poker dice, 5 fair dice are rolled. We
will record the outcome of our random experiment as the (ordered) sequence of scores:
X = (X1 , X2 , X3 , X4 , X5 ) (13.3.1)
Thus, the sample space is S = {1, 2, 3, 4, 5, 6} . Since the dice are fair, our basic modeling assumption is that X is a sequence of
5
independent random variables and each is uniformly distributed on {1, 2, 3, 4, 5, 6}.
Equivalently, X is uniformly distributed on S :

#(A)
P(X ∈ A) = , A ⊆S (13.3.2)
#(S)
In statistical terms, a poker dice hand is a random sample of size 5 drawn with replacement and with regard to order from the
population D = {1, 2, 3, 4, 5, 6}. For more on this topic, see the chapter on Finite Sampling Models. In particular, in this chapter
you will learn that the result of Exercise 1 would not be true if we recorded the outcome of the poker dice experiment as an
unordered set instead of an ordered sequence.
The Value of the Hand

The value V of the poker dice hand is a random variable with support set {0, 1, 2, 3, 4, 5, 6}. The values are defined as follows:
0. None alike. Five distinct scores occur.
1. One Pair. Four distinct scores occur; one score occurs twice and the other three scores occur once each.
2. Two Pair. Three distinct scores occur; one score occurs twice and the other three scores occur once each.
3. Three of a Kind. Three distinct scores occur; one score occurs three times and the other two scores occur once each.
4. Full House. Two distinct scores occur; one score occurs three times and the other score occurs twice.
5. Four of a king. Two distinct scores occur; one score occurs four times and the other score occurs once.
6. Five of a kind. Once score occurs five times.
Run the poker dice experiment 10 times in single-step mode. For each outcome, note that the value of the random variable
corresponds to the type of hand, as given above.

Computing the probability density function of V is a good exercise in combinatorial probability. In the following exercises, we will
need the two fundamental rules of combinatorics to count the number of dice sequences of a given type: the multiplication rule and
the addition rule. We will also need some basic combinatorial structures, particularly combinations and permutations (with types of
objects that are identical).
The number of different poker dice hands is #(S) = 6 5

= 7776 .
P(V = 0) =
720
7776
= 0.09259 .
Proof
P(V = 1) =
3600
7776
≈ 0.46296 .
Proof
P(V = 2) =
1800
7776
≈ 0.23148 .
Proof
P(V = 3) =
1200
7776
≈ 0.15432 .
Proof
P(V = 4) =
300
7776
≈ 0.03858 .
Proof
P(V = 5) =
150
7776
= 0.01929 .
Proof
P(V = 6) =
6
7776
≈ 0.00077 .
Proof
Run the poker dice experiment 1000 times and compare the relative frequency function to the density function.
Find the probability of rolling a hand that has 3 of a kind or better.

Answer
In the poker dice experiment, set the stop criterion to the value of V given below. Note the number of hands required.
1. V =3
2. V =4
3. V =5
4. V =6
Chuck-a-Luck
Chuck-a-luck is a popular carnival game, played with three dice. According to Richard Epstein, the original name was Sweat Cloth,
and in British pubs, the game is known as Crown and Anchor (because the six sides of the dice are inscribed clubs, diamonds,
hearts, spades, crown and anchor). The dice are over-sized and are kept in an hourglass-shaped cage known as the bird cage. The
dice are rolled by spinning the bird cage.
Chuck-a-luck is very simple. The gambler selects an integer from 1 to 6, and then the three dice are rolled. If exactly k dice show
the gambler's number, the payoff is k : 1 . As with poker dice, our basic mathematical assumption is that the dice are fair, and
therefore the outcome vector X = (X , X , X ) is uniformly distributed on the sample space S = {1, 2, 3, 4, 5, 6} .
1 2 3
3
Let Y denote the number of dice that show the gambler's number. Then Y has the binomial distribution with parameters n = 3
and p = : 1
k 3−k
3 1 5
P(Y = k) = ( )( ) ( ) , k ∈ {0, 1, 2, 3} (13.3.3)
k 6 6
Let W denote the net winnings for a unit bet. Then

1. W = −1 if Y = 0
2. W =Y if Y > 0
The probability density function of W is given by

1. P(W = −1) =
125
216
2. P(W = 1) =
216
75
3. P(W = 2) =
216
15
4. P(W = 3) =
216
1
Run the chuck-a-luck experiment 1000 times and compare the empirical density function of W to the true probability density
function.
The expected value and variance of W are

1. E(W ) = − 17
216
≈ 0.0787
2. var(W ) = 75815
46656
≈ 1.239
Run the chuck-a-luck experiment 1000 times and compare the empirical mean and standard deviation of W to the true mean
and standard deviation. Suppose you had bet $1 on each of the 1000 games. What would your net winnings be?
High-Low
In the game of high-low, a pair of fair dice are rolled. The outcome is
high if the sum is 8, 9, 10, 11, or 12.
low if the sum is 2, 3, 4, 5, or 6
seven if the sum is 7
A player can bet on any of the three outcomes. The payoff for a bet of high or for a bet of low is 1 : 1 . The payoff for a bet of seven
is 4 : 1 .
Let Z denote the outcome of a game of high-low. Find the probability density function of Z .
Answer
Let W denote the net winnings for a unit bet. Find the expected value and variance of W for each of the three bets:
1. high
2. low
3. seven
Answer
This page titled 13.3: Simple Dice Games is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
13.4: Craps
The Basic Game

Craps is a popular casino game, because of its complexity and because of the rich variety of bets that can be made.
Figure 13.4.1 : A typical craps table

According to Richard Epstein, craps is descended from an earlier game known as Hazard, that dates to the Middle Ages. The
formal rules for Hazard were established by Montmort early in the 1700s. The origin of the name craps is shrouded in doubt, but it
may have come from the English crabs or from the French Crapeaud (for toad).
From a mathematical point of view, craps is interesting because it is an example of a random experiment that takes place in stages;
the evolution of the game depends critically on the outcome of the first roll. In particular, the number of rolls is a random variable.
Definitions
The rules for craps are as follows:
The player (known as the shooter) rolls a pair of fair dice

1. If the sum is 7 or 11 on the first throw, the shooter wins; this event is called a natural.
2. If the sum is 2, 3, or 12 on the first throw, the shooter loses; this event is called craps.
3. If the sum is 4, 5, 6, 8, 9, or 10 on the first throw, this number becomes the shooter's point. The shooter continues rolling
the dice until either she rolls the point again (in which case she wins) or rolls a 7 (in which case she loses).
As long as the shooter wins, or loses by rolling craps, she retrains the dice and continues. Once she loses by failing to make her
point, the dice are passed to the next shooter.
Let us consider the game of craps mathematically. Our basic assumption, of course, is that the dice are fair and that the outcomes of
the various rolls are independent. Let N denote the (random) number of rolls in the game and let (X , Y ) denote the outcome of
i i
the ith roll for i ∈ {1, 2, … , N }. Finally, let Z = X + Y , the sum of the scores on the ith roll, and let V denote the event that
i i i
the shooter wins.
In the craps experiment, press single step a few times and observe the outcomes. Make sure that you understand the rules of the
game.
The Probability of Winning

We will compute the probability that the shooter wins in stages, based on the outcome of the first roll.
The sum of the scores Z on a given roll has the probability density function in the following table:
z 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P(Z = z)
36 36 36 36 36 36 36 36 36 36 36
The probability that the player makes her point can be computed using a simple conditioning argument. For example, suppose that
the player throws 4 initially, so that 4 is the point. The player continues until she either throws 4 again or throws 7. Thus, the final
roll will be an element of the following set:
S4 = {(1, 3), (2, 2), (3, 1), (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} (13.4.1)
Since the dice are fair, these outcomes are equally likely, so the probability that the player makes her 4 point is 3
9
. A similar
argument can be used for the other points. Here are the results:
The probabilities of making the point z are given in the following table:
z 4 5 6 8 9 10
3 4 5 5 4 3
P(V ∣ Z1 = z)
9 10 11 11 10 9
The probability that the shooter wins is P(V ) = 244
495
≈ 0.49293
Proof
Note that craps is nearly a fair game. For the sake of completeness, the following result gives the probability of winning, given a
“point” on the first roll.
67
P(V ∣ Z1 ∈ {4, 5, 6, 8, 9, 10}) = ≈ 0.406
165
Proof
Bets
There is a bewildering variety of bets that can be made in craps. In the exercises in this subsection, we will discuss some typical
bets and compute the probability density function, mean, and standard deviation of each. (Most of these bets are illustrated in the
picture of the craps table above). Note however, that some of the details of the bets and, in particular the payout odds, vary from
one casino to another. Of course the expected value of any bet is inevitably negative (for the gambler), and thus the gambler is
doomed to lose money in the long run. Nonetheless, as we will see, some bets are better than others.
Pass and Don't Pass

A pass bet is a bet that the shooter will win and pays 1 : 1 .
Let W denote the winnings from a unit pass bet. Then

1. P(W = −1) = , P(W = 1) =
251
495
244
495
2. E(W ) = − 7
≈ −0.0141
495
3. sd(W ) ≈ 0.9999
In the craps experiment, select the pass bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
A don't pass bet is a bet that the shooter will lose, except that 12 on the first throw is excluded (that is, the shooter loses, of course,
but the don't pass better neither wins nor loses). This is the meaning of the phrase don't pass bar double 6 on the craps table. The
don't pass bet also pays 1 : 1 .
Let W denote the winnings for a unit don't pass bet. Then
1. P(W = −1) = , P(W = 0) =
244
495 36
1
, P(W = 1) =
949
1980
2. E(W ) = − 27
≈ −0.01363
1980
3. sd(W ) ≈ 0.9859
Thus, the don't pass bet is slightly better for the gambler than the pass bet.
In the craps experiment, select the don't pass bet. Run the simulation 1000 times and compare the empirical density function
and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
The come bet and the don't come bet are analogous to the pass and don't pass bets, respectively, except that they are made after the
point has been established.
Field
A field bet is a bet on the outcome of the next throw. It pays 1 : 1 if 3, 4, 9, 10, or 11 is thrown, 2 : 1 if 2 or 12 is thrown, and loses
otherwise.
Let W denote the winnings for a unit field bet. Then

1. P(W = −1) = , P(W = 1) =
5
9
7
18
, P(W = 2) =
1
18
2. E(W ) = − ≈ −0.0556
1
18
3. sd(W ) ≈ 1.0787
In the craps experiment, select the field bet. Run the simulation 1000 times and compare the empirical density function and
Seven and Eleven

A 7 bet is a bet on the outcome of the next throw. It pays 4 : 1 if a 7 is thrown. Similarly, an 11 bet is a bet on the outcome of the
next throw, and pays 15 : 1 if an 11 is thrown. In spite of the romance of the number 7, the next exercise shows that the 7 bet is one
of the worst bets you can make.
Let W denote the winnings for a unit 7 bet. Then

1. P(W = −1) = , P(W = 4) =
5
6
1
2. E(W ) = − ≈ −0.1667
1
3. sd(W ) ≈ 1.8634
In the craps experiment, select the 7 bet. Run the simulation 1000 times and compare the empirical density function and
Let W denote the winnings for a unit 11 bet. Then

1. P(W = −1) = , P(W = 15) =
17
18
1
18
2. E(W ) = − ≈ −0.1111
1
3. sd(W ) ≈ 3.6650
In the craps experiment, select the 11 bet. Run the simulation 1000 times and compare the empirical density function and
Craps
All craps bets are bets on the next throw. The basic craps bet pays 7 : 1 if 2, 3, or 12 is thrown. The craps 2 bet pays 30 : 1 if a 2 is
thrown. Similarly, the craps 12 bet pays 30 : 1 if a 12 is thrown. Finally, the craps 3 bet pays 15 : 1 if a 3 is thrown.
Let W denote the winnings for a unit craps bet. Then

1. P(W = −1) =
8
9
, P(W = 7) =
1
2. E(W ) = − ≈ −0.1111
1
3. sd(W ) ≈ 5.0944
In the craps experiment, select the craps bet. Run the simulation 1000 times and compare the empirical density function and
Let W denote the winnings for a unit craps 2 bet or a unit craps 12 bet. Then
1. P(W = −1) = , P(W = 30) =
35
36 36
1
2. E(W ) = − ≈ −0.1389
5
36
3. sd(W ) = 5.0944
In the craps experiment, select the craps 2 bet. Run the simulation 1000 times and compare the empirical density function and
Let W denote the winnings for a unit craps 3 bet. Then

1. P(W = −1) = , P(W = 15) =
17
18 18
1
2. E(W ) = − ≈ −0.1111
1
3. sd(W ) ≈ 3.6650
Thus, of the craps bets, the basic craps bet and the craps 3 bet are best for the gambler, and the craps 2 and craps 12 are the worst.
Big Six and Big Eight

The big 6 bet is a bet that 6 is thrown before 7. Similarly, the big 8 bet is a bet that 8 is thrown before 7. Both pay even money
1 : 1.
Let W denote the winnings for a unit big 6 bet or a unit big 8 bet. Then
1. P(W = −1) = , P(W = 1) =
6
11
5
11
2. E(W ) = − ≈ −0.0909
1
11
3. sd(W ) ≈ 0.9959
In the craps experiment, select the big 6 bet. Run the simulation 1000 times and compare the empirical density function and
In the craps experiment, select the big 8 bet. Run the simulation 1000 times and compare the empirical density function and
Hardway Bets
A hardway bet can be made on any of the numbers 4, 6, 8, or 10. It is a bet that the chosen number n will be thrown “the hardway”
as (n/2, n/2), before 7 is thrown and before the chosen number is thrown in any other combination. Hardway bets on 4 and 10 pay
7 : 1 , while hardway bets on 6 and 8 pay 9 : 1 .
Let W denote the winnings for a unit hardway 4 or hardway 10 bet. Then
1. P(W = −1) = , P(W = 7) =
8
9
1
2. E(W ) = − ≈ −0.1111
1
3. sd(W ) = 2.5142
In the craps experiment, select the hardway 4 bet. Run the simulation 1000 times and compare the empirical density function
Let W denote the winnings for a unit hardway 6 or hardway 8 bet. Then
1. P(W = −1) = , P(W = 9) =
10
11
1
11
2. E(W ) = − ≈ −0.0909
1
11
3. sd(W ) ≈ 2.8748
In the craps experiment, select the hardway 6 bet. Run the simulation 1000 times and compare the empirical density and
moments of W to the true density and moments. Suppose that you bet $1 on each of the 1000 games. What would your net
winnings be?
Thus, the hardway 6 and 8 bets are better than the hardway 4 and 10 bets for the gambler, in terms of expected value.
The Distribution of the Number of Rolls

Next let us compute the distribution and moments of the number of rolls N in a game of craps. This random variable is of no
special interest to the casino or the players, but provides a good mathematically exercise. By definition, if the shooter wins or loses
on the first roll, N = 1 . Otherwise, the shooter continues until she either makes her point or rolls 7. In this latter case, we can use
the geometric distribution on N which governs the trial number of the first success in a sequence of Bernoulli trials. The
+
distribution of N is a mixture of distributions.
The probability density function of N is

12
⎧ , n =1
36
P(N = n) = ⎨ n−2
n−2 n−2 (13.4.5)
1 3 5 13 55 25
⎩ ( ) + ( ) + ( ) , n ∈ {2, 3, …}
24 4 81 18 648 36
Proof
The first few values of the probability density function of N are given in the following table:
n 1 2 3 4 5
P(N = n) 0.33333 0.18827 0.13477 0.09657 0.06926
Find the probability that a game of craps will last at least 8 rolls.
Answer
The mean and variance of the number of rolls are

1. E(N ) = 557
165
≈ 3.3758
2. var(N ) = 245 672
27 225
≈ 9.02376
Proof
This page titled 13.4: Craps is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services)
13.5: Roulette
The Roulette Wheel

According to Richard Epstein, roulette is the oldest casino game still in operation. It's invention has been variously attributed to
Blaise Pascal, the Italian mathematician Don Pasquale, and several others. In any event, the roulette wheel was first introduced into
Paris in 1765. Here are the characteristics of the wheel:
The (American) roulette wheel has 38 slots numbered 00, 0, and 1–36.
1. Slots 0, 00 are green;
2. Slots 1, 3, 5, 7, 9, 12, 14, 16, 18, 19, 21, 23, 25, 27, 30, 32, 34, 36 are red;
3. Slots 2, 4, 6, 8, 10, 11, 13, 15, 17, 20, 22, 24, 26, 28, 29, 31, 33, 35 are black.
Except for 0 and 00, the slots on the wheel alternate between red and black. The strange order of the numbers on the wheel is
intended so that high and low numbers, as well as odd and even numbers, tend to alternate.
Figure 13.5.1 : A typical roulette wheel and table

The roulette experiment is very simple. The wheel is spun and then a small ball is rolled in a groove, in the opposite direction as the
motion of the wheel. Eventually the ball falls into one of the slots. Naturally, we assume mathematically that the wheel is fair, so
that the random variable X that gives the slot number of the ball is uniformly distributed over the sample space
S = {00, 0, 1, … , 36}. Thus, P(X = x) = for each x ∈ S .
1
38
Bets
As with craps, roulette is a popular casino game because of the rich variety of bets that can be made. The picture above shows the
roulette table and indicates some of the bets we will study. All bets turn out to have the same expected value (negative, of course).
However, the variances differ depending on the bet.
Although all bets in roulette have the same expected value, the standard deviations vary inversely with the number of numbers
selected. What are the implications of this for the gambler?
Straight Bets
A straight bet is a bet on a single number, and pays 35 : 1.
Let W denote the winnings on a unit straight bet. Then

1. P(W = −1) = , P(W = 35) =
37
38
1
38
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 5.7626
In the roulette experiment, select the single number bet. Run the simulation 1000 times and compare the empirical density
function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000
games. What would your net winnings be?
Two Number Bets
A 2-number bet (or a split bet) is a bet on two adjacent numbers in the roulette table. The bet pays 17 : 1.
Let W denote the winnings on a unit split bet. Then

1. P(W = −1) = , P(W = 17) =
18
19 19
1
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 4.0193
In the roulette experiment, select the 2 number bet. Run the simulation 1000 times and compare the empirical density function
Three Number Bets

A 3-number bet (or row bet) is a bet on the three numbers in a vertical row on the roulette table. The bet pays 11 : 1.
Let W denote the winnings on a unit row bet. Then

1. P(W = −1) = , P(W = 11) =
35
38 38
3
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 3.2359
In the roulette experiment, select the 3-number bet. Run the simulation 1000 times and compare the empirical density function
Four Number Bets

A 4-number bet or a square bet is a bet on the four numbers that form a square on the roulette table. The bet pays 8 : 1 .
Let W denote the winnings on a unit 4-number bet. Then

1. P(W = −1) = , P(W = 8) =
17
19
2
19
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 2.7620
In the roulette experiment, select the 4-number bet. Run the simulation 1000 times and compare the empirical density function
Six Number Bets

A 6-number bet or 2-row bet is a bet on the 6 numbers in two adjacent rows of the roulette table. The bet pays 5 : 1 .

1. P(W = −1) = , P(W = 5) =
16
19
3
19
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 2.1879
In the roulette experiment, select the 6-number bet. Run the simulation 1000 times and compuare the empirical density
Twelve Number Bets
A 12-number bet is a bet on 12 numbers. In particular, a column bet is bet on any one of the three columns of 12 numbers running
horizontally along the table. Other 12-number bets are the first 12 (1-12), the middle 12 (13-24), and the last 12 (25-36). A 12-
number bet pays 2 : 1 .

1. P(W = −1) = , P(W = 2) =
13
19
6
19
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 1.3945
In the roulette experiment, select the 12-number bet. Run the simulation 1000 times and compare the empirical density
Eighteen Number Bets

An 18-number bet is a bet on 18 numbers. In particular, A color bet is a bet either on red or on black. A parity bet is a bet on the
odd numbers from 1 to 36 or the even numbers from 1 to 36. The low bet is a bet on the numbers 1-18, and the high bet is the bet
on the numbers from 19-36. An 18-number bet pays 1 : 1 .

1. P(W = −1) = , P(W = 1) =
10
19
9
19
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 0.9986
In the roulette experiment, select the 18-number bet. Run the simulation 1000 times and compare the empirical density
This page titled 13.5: Roulette is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Preliminaries
The Monty Hall problem involves a classical game show situation and is named after Monty Hall, the long-time host of the TV game show Let's Make a
Deal. There are three doors labeled 1, 2, and 3. A car is behind one of the doors, while goats are behind the other two:
Figure 13.6.1 : The car and the two goats
The rules are as follows:

1. The player selects a door.
2. The host selects a different door and opens it.
3. The host gives the player the option of switching from her original choice to the remaining closed door.
4. The door finally selected by the player is opened and she either wins or loses.
The Monty Hall problem became the subject of intense controversy because of several articles by Marilyn Vos Savant in the Ask Marilyn column of Parade
magazine, a popular Sunday newspaper supplement. The controversy began when a reader posed the problem in the following way:
Suppose you're on a game show, and you're given a choice of three doors. Behind one door is a car; behind the
others, goats. You pick a door—say No. 1—and the host, who knows what's behind the doors, opens another
door—say No. 3—which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your
advantage to switch your choice?
Marilyn's response was that the contestant should switch doors, claiming that there is a chance that the car is behind door 1, while there is a chance that
1
3
2
the car is behind door 2. In two follow-up columns, Marilyn printed a number of responses, some from academics, most of whom claimed in angry or
sarcastic tones that she was wrong and that there are equal chances that the car is behind doors 1 or 2. Marilyn stood by her original answer and offered
additional, but non-mathematical, arguments.
Think about the problem. Do you agree with Marilyn or with her critics, or do you think that neither solution is correct?
In the Monty Hall game, set the host strategy to standard (the meaning of this strategy will be explained in the below). Play the Monty Hall game 50
times with each of the following strategies. Do you want to reconsider your answer to question above?
1. Always switch
2. Never switch
In the Monty Hall game, set the host strategy to blind (the meaning of this strategy will be explained below). Play the Monty Hall game 50 times with
each of the following strategies. Do you want to reconsider your answer to question above?
1. Always switch
2. Never switch
Modeling Assumptions
When we begin to think carefully about the Monty Hall problem, we realize that the statement of the problem by Marilyn's reader is so vague that a
meaningful discussion is not possible without clarifying assumptions about the strategies of the host and player. Indeed, we will see that misunderstandings
about these strategies are the cause of the controversy.
Let us try to formulate the problem mathematically. In general, the actions of the host and player can vary from game to game, but if we are to have a random
experiment in the classical sense, we must assume that the same probability distributions govern the host and player on each game and that the games are
independent.
There are four basic random variables for a game:
1. U : the number of the door containing the car.
2. X: the number of the first door selected by the player.
3. V : the number of the door opened by the host.
4. Y : the number of the second door selected by the player.
Each of these random variables has the possible values 1, 2, and 3. However, because of the rules of the game, the door opened by the host cannot be either of
the doors selected by the player, so V ≠ X and V ≠ Y . In general, we will allow the possibility V = U , that the host opens the door with the car behind it.
Whether this is a reasonable action of the host is a big part of the controversy about this problem.
The Monty Hall experiment will be completely defined mathematically once the joint distribution of the basic variables is specified. This joint distribution in
turn depends on the strategies of the host and player, which we will consider next.
Strategies
Host Strategies
In the Monty Hall experiment, note that the host determines the probability density function of the door containing the car, namely P(U = i) for
i ∈ {1, 2, 3}. The obvious choice for the host is to randomly assign the car to one of the three doors. This leads to the uniform distribution, and unless
otherwise noted, we will always assume that U has this distribution. Thus, P(U = i) = for i ∈ {1, 2, 3}.
1
The host also determines the conditional density function of the door he opens, given knowledge of the door containing the car and the first door selected by
the player, namely P(V = k ∣ U = i, X = j) for i, j ∈ {1, 2, 3}. Recall that since the host cannot open the door chosen by the player, this probability must
be 0 for k = j .
Thus, the distribution of U and the conditional distribution of V given U and X constitute the host strategy.
The Standard Strategy

In most real game shows, the host would always open a door with a goat behind it. If the player's first choice is incorrect, then the host has no choice; he
cannot open the door with the car or the player's choice and must therefore open the only remaining door. On the other hand, if the player's first choice is
correct, then the host can open either of the remaining doors, since goats are behind both. Thus, he might naturally pick one of these doors at random.
This strategy leads to the following conditional distribution for V given U and X:
⎧ 1,
⎪ i ≠ j, i ≠ k, k ≠ j
1
P(V = k ∣ U = i, X = j) = ⎨ , i = j, k ≠ i (13.6.1)
2
⎩
⎪
0, k = i, k = j
This distribution, along with the uniform distribution for U , will be referred to as the standard strategy for the host.
In the Monty Hall game, set the host strategy to standard. Play the game 50 times with each of the following player strategies. Which works better?
1. Always switch
2. Never switch
The Blind Strategy

Another possible second-stage strategy is for the host to always open a door chosen at random from the two possibilities. Thus, the host might well open the
door containing the car.
This strategy leads to the following conditional distribution for V given U and X:
1
, k ≠j
2
P(V = k ∣ U = i, X = j) = { (13.6.2)
0, k =j
This distribution, together with the uniform distribution for U , will be referred to as the blind strategy for the host. The blind strategy seems a bit odd.
However, the confusion between the two strategies is the source of the controversy concerning this problem.
In the Monty Hall game, set the host strategy to blind. Play the game 50 times with each of the following player strategies. Which works better?
1. Always switch
2. Never switch
Player Strategies
The player, on the other hand, determines the probability density function of her first choice, namely P(X = j) for j ∈ {1, 2, 3}. The obvious first choice for
the player is to randomly choose a door, since the player has no knowledge at this point. This leads to the uniform distribution, so P(X = j) = 1
3
for
j ∈ {1, 2, 3}
The player also determines the conditional density function of her second choice, given knowledge of her first choice and the door opened by the host,
namely P(Y = l ∣ X = j, V = k) for i, j, k ∈ {1, 2, 3} with j ≠ k . Recall that since the player cannot choose the door opened by the host, this probability
must be 0 for l = k . The distribution of X and the conditional distribution of Y given X and V constitute the player strategy.
Suppose that the player switches with probability p ∈ [0, 1]. This leads to the following conditional distribution:
⎧ p, j ≠ k, j ≠ l, k ≠ l
P(Y = l ∣ X = j, V = k) = ⎨ 1 − p, j ≠ k, l = j (13.6.3)
⎩
0, j = k, l = k
In particular, if p = 1 , the player always switches, while if p = 0 , the player never switches.
Mathematical Analysis
We are almost ready to analyze the Monty Hall problem mathematically. But first we must make some independence assumptions to incorporate the lack of
knowledge that the host and player have about each other's actions. First, the player has no knowledge of the door containing the car, so we assume that U
and X are independent. Also, the only information about the car door that the player has when she makes her second choice is the information (if any)
revealed by her first choice and the host's subsequent selection. Mathematically, this means that Y is conditionally independent of U given X and V .
Distributions
The host and player strategies form the basic data for the Monty Hall problem. Because of the independence assumptions, the joint distribution of the basic
random variables is completely determined by these strategies.
The joint probability density function of (U , X, V , Y ) is given by
P(U = i, X = j, V = k, Y = l) = P(U = i)P(X = j)P(V = k ∣ U = i, X = j)P(Y = l ∣ X = j, V = k), i, j, k, l ∈ {1, 2, 3} (13.6.4)
Proof
The probability of any event defined in terms of the Monty Hall problem can be computed by summing the joint density over the appropriate values of
(i, j, k, l).
With either of the basic host strategies, V is uniformly distributed on {1, 2, 3}.
Suppose that the player switches with probability p. With either of the basic host strategies, Y is uniformly distributed on {1, 2, 3}.
In the Monty Hall experiment, set the host strategy to standard. For each of the following values of p, run the simulation 1000 times. Based on relative
frequency, which strategy works best?
1. p = 0 (never switch)
2. p = 0.3
3. p = 0.5
4. p = 0.7
5. p = 1 (always switch)
In the Monty Hall experiment, set the host strategy to blind. For each of the following values of p, run the experiment 1000 times. Based on relative
frequency, which strategy works best?
2. p = 0.3
3. p = 0.5
4. p = 0.7

The event that the player wins a game is {Y = U} . We will compute the probability of this event with the basic host and player strategies.
Suppose that the host follows the standard strategy and that the player switches with probability p. Then the probability that the player wins is
1 +p
P(Y = U ) = (13.6.5)
3
In particular, if the player always switches, the probability that she wins is p = 2
3
and if the player never switches, the probability that she wins is p = 1
3
.
In the Monty Hall experiment, set the host strategy to standard. For each of the following values of p, run the simulation 1000 times. In each case,
compare the relative frequency of winning to the probability of winning.
2. p = 0.3
3. p = 0.5
4. p = 0.7
Suppose that the host follows the blind strategy. Then for any player strategy, the probability that the player wins is
1
P(Y = U ) = (13.6.6)
3
In the Monty Hall experiment, set the host strategy to blind. For each of the following values of p, run the experiment 1000 times. In each case, compare
the relative frequency of winning to the probability of winning.
2. p = 0.3
3. p = 0.5
4. p = 0.7
For a complete solution of the Monty Hall problem, we want to compute the conditional probability that the player wins, given that the host opens a door
with a goat behind it:
P(Y = U )
P(Y = U ∣ V ≠ U ) = (13.6.7)
P(V ≠ U )
With the basic host and player strategies, the numerator, the probability of winning, has been computed. Thus we need to consider the denominator, the
probability that the host opens a door with a goat. If the host use the standard strategy, then the conditional probability of winning is the same as the
unconditional probability of winning, regardless of the player strategy. In particular, we have the following result:
If the host follows the standard strategy and the player switches with probability p, then
1 +p
P(Y = U ∣ V ≠ U ) = (13.6.8)
3
Proof
Once again, the probability increases from 1
3
when p = 0 , so that the player never switches, to 2
3
when p = 1 , so that the player always switches.
If the host follows the blind strategy, then for any player strategy, P(V ≠ U) =
2
3
and therefore P(Y = U ∣ V ≠ U) =
1
2
.
In the Monty Hall experiment, set the host strategy to blind. For each of the following values of p, run the experiment 500 times. In each case, compute
the conditional relative frequency of winning, given that the host shows a goat, and compare with the theoretical answer above,
2. p = 0.3
3. p = 0.5
4. p = 0.7
The confusion between the conditional probability of winning for these two strategies has been the source of much controversy in the Monty Hall problem.
Marilyn was probably thinking of the standard host strategy, while some of her critics were thinking of the blind strategy. This problem points out the
importance of careful modeling, of the careful statement of assumptions. Marilyn is correct if the host follows the standard strategy; the critics are correct if
the host follows the blind strategy; any number of other answers could be correct if the host follows other strategies.
The mathematical formulation we have used is fairly complete. However, if we just want to solve Marilyn's problem, there is a much simpler analysis (which
you may have discovered yourself). Suppose that the host follows the standard strategy, and thus always opens a door with a goat. If the player's first door is
incorrect (contains a goat), then the host has no choice and must open the other door with a goat. Then, if the player switches, she wins. On the other hand, if
the player's first door is correct and she switches, then of course she loses. Thus, we see that if the player always switches, then she wins if and only if her
first choice is incorrect, an event that obviously has probability . If the player never switches, then she wins if and only if her first choice is correct, an
2
event with probability .1
This page titled 13.6: The Monty Hall Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source
content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.7: Lotteries
You realize the odds of winning [the lottery] are the same as being mauled by a polar bear and a regular bear in the same day.
—E*TRADE baby, January 2010.
Lotteries are among the simplest and most widely played of all games of chance, and unfortunately for the gambler, among the
worst in terms of expected value. Lotteries come in such an incredible number of variations that it is impractical to analyze all of
them. So, in this section, we will study some of the more common lottery formats.
Figure 13.7.1 : A lottery ticket issued by the Continental Congress in 1776 to raise money for the American Revolutionary War.
Source: Wikipedia
The Basic Lottery

Basic Format
The basic lottery is a random experiment in which the gambling house (in many cases a government agency) selects n numbers at
random, without replacement, from the integers from 1 to N . The integer parameters N and n vary from one lottery to another,
and of course, n cannot be larger than N . The order in which the numbers are chosen usually does not matter, and thus in this case,
the sample space S of the experiment consists of all subsets (combinations) of size n chosen from the population {1, 2, … , N }.
S = {x ⊆ {1, 2, … , N } : #(x) = n} (13.7.1)
Recall that
N N!
#(S) = ( ) = (13.7.2)
n n!(N − n)!
Naturally, we assume that all such combinations are equally likely, and thus, the chosen combination X, the basic random variable
of the experiment, is uniformly distributed on S .
1
P(X = x) = , x ∈ S (13.7.3)
N
( )
n
The player of the lottery pays a fee and gets to select m numbers, without replacement, from the integers from 1 to N . Again, order
does not matter, so the player essentially chooses a combination y of size m from the population {1, 2, … , N }. In many cases
m = n , so that the player gets to choose the same number of numbers as the house. In general then, there are three parameters in
the basic (N , n, m) lottery.

The player's goal, of course, is to maximize the number of matches (often called catches by gamblers) between her combination y
and the random combination X chosen by the house. Essentially, the player is trying to guess the outcome of the random
experiment before it is run. Thus, let U = #(X ∩ y) denote the number of catches.
The number of catches U in the (N , n, m), lottery has probability density function given by
m N −m
( )( )
k n−k
P(U = k) = , k ∈ {0, 1, … , m} (13.7.4)
N
( )
n
The distribution of U is the hypergeometric distribution with parameters N , n , and m, and is studied in detail in the chapter on
Finite Sampling Models. In particular, from this section, it follows that the mean and variance of the number of catches U are
m
E(U ) =n (13.7.5)
N
m m N −n
var(U ) = n (1 − ) (13.7.6)
N N N −1
Note that P(U = k) = 0 if k > n or k < n + m − N . However, in most lotteries, m ≤ n and N is much larger than n+m . In
these common cases, the density function is positive for the values of k given in above.
We will refer to the special case where m = n as the (N , n) lottery; this is the case in most state lotteries. In this case, the
probability density function of the number of catches U is
n N −n
( )( )
k n−k
P(U = k) = , k ∈ {0, 1, … , n} (13.7.7)
N
( )
n
The mean and variance of the number of catches U in this special case are
2
n
E(U ) = (13.7.8)
N
2 2
n (N − n)
var(U ) = (13.7.9)
2
N (N − 1)
Explicitly give the probability density function, mean, and standard deviation of the number of catches in the (47, 5) lottery.
Answer
Answer
Answer
The analysis above was based on the assumption that the player's combination y is selected deterministically. Would it matter if the
player chose the combination in a random way? Thus, suppose that the player's selected combination Y is a random variable taking
values in S . (For example, in many lotteries, players can buy tickets with combinations randomly selected by a computer; this is
typically known as Quick Pick). Clearly, X and Y must be independent, since the player (and her randomizing device) can have no
knowledge of the winning combination X. As you might guess, such randomization makes no difference.
Let U denote the number of catches in the (N , n, m) lottery when the player's combination Y is a random variable,
independent of the winning combination X. Then U has the same distribution as in the deterministic case above.
Proof
There are many websites that publish data on the frequency of occurrence of numbers in various state lotteries. Some gamblers
evidently feel that some numbers are luckier than others.
Given the assumptions and analysis above, do you believe that some numbers are luckier than others? Does it make any
mathematical sense to study historical data for a lottery?
The prize money in most state lotteries depends on the sales of the lottery tickets. Typically, about 50% of the sales money is
returned as prize money; the rest goes for administrative costs and profit for the state. The total prize money is divided among the
winning tickets, and the prize for a given ticket depends on the number of catches U . For all of these reasons, it is impossible to
give a simple mathematical analysis of the expected value of playing a given state lottery. Note however, that since the state keeps a
fixed percentage of the sales, there is essentially no risk for the state.
From a pure gambling point of view, state lotteries are bad games. In most casino games, by comparison, 90% or more of the
money that comes in is returned to the players as prize money. Of course, state lotteries should be viewed as a form of voluntary
taxation, not simply as games. The profits from lotteries are typically used for education, health care, and other essential services.
A discussion of the value and costs of lotteries from a political and social point of view (as opposed to a mathematical one) is
beyond the scope of this project.
Bonus Numbers
Many state lotteries now augment the basic (N , n), format with a bonus number. The bonus number T is selected from a specified
set of integers, in addition to the combination X, selected as before. The player likewise picks a bonus number s , in addition to a
combination y . The player's prize then depends on the number of catches U between X and y , as before, and in addition on
whether the player's bonus number s matches the random bonus number T chosen by the house. We will let I denote the indicator
variable of this latter event. Thus, our interest now is in the joint distribution of (I , U ) .
In one common format, the bonus number T is selected at random from the set of integers {1, 2, … , M }, independently of the
combination X of size n chosen from {1, 2, … , N }. Usually M < N . Note that with this format, the game is essentially two
independent lotteries, one in the (N , n), format and the other in the (M , 1), format.
Explicitly compute the joint probability density function of (I , U ) for the (47, 5) lottery with independent bonus number from
1 to 27. This format is used in the California lottery, among others.
Answer
Explicitly compute the joint probability density function of (I , U ) for the (49, 5) lottery with independent bonus number from
1 to 42. This format is used in the Powerball lottery, among others.
Answer
In another format, the bonus number T is chosen from 1 to N , and is distinct from the numbers in the combination X. To model
this game, we assume that T is uniformly distributed on {1, 2, … , N }, and given T = t , X is uniformly distributed on the set of
combinations of size n chosen from {1, 2, … , N } ∖ {t}. For this format, the joint probability density function is harder to
compute.
The probability density function of (I , U ) is given by

n N −1−n
( )( )
k n−k
P(I = 1, U = k) = , k ∈ {0, 1, … , n} (13.7.11)
N −1
N( )
n
n N −1−n n−1 N −n
( )( ) ( )( )
k n−k k n−k
P(I = 0, U = k) = (N − n + 1) +n , k ∈ {0, 1, … , n} (13.7.12)
N −1 N −1
N( ) N( )
n n
Proof
Explicitly compute the joint probability density function of (I , U ) for the (47, 7) lottery with bonus number chosen as
described above. This format is used in the Super 7 Canada lottery, among others.
Keno
Keno is a lottery game played in casinos. For a fixed N (usually 80) and n (usually 20), the player can play a range of basic
(N , n, m) games, as described in the first subsection. Typically, m ranges from 1 to 15, and the payoff depends on m and the
number of catches U . In this section, you will compute the density function, mean, and standard deviation of the random payoff,
based on a unit bet, for a typical keno game with N = 80 , n = 20 , and m ∈ {1, 2, … , 15}. The payoff tables are based on the
keno game at the Tropicana casino in Atlantic City, New Jersey.
Recall that the probability density function of the number of catches U above , is given by
m 80−m
( )( )
k 20−k
P(U = k) = , k ∈ {0, 1, … , m} (13.7.13)
80
( )
20
The payoff table for m =1 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 1
Catches 0 1
Payoff 0 3
Answer
payoff.
Pick m = 2
Catches 0 1 2
Payoff 0 0 12
Answer
payoff.
Pick m = 3
Catches 0 1 2 3
Payoff 0 0 1 43
Answer
payoff.
Pick m = 4
Catches 0 1 2 3 4
Payoff 0 0 1 3 130
Answer
payoff.
Pick m = 5
Catches 0 1 2 3 4 5
Payoff 0 0 0 1 10 800
Answer
payoff.
Pick m = 6
Catches 0 1 2 3 4 5 6
Payoff 0 0 0 1 4 95 1500
Answer
payoff.
Pick m = 7
Catches 0 1 2 3 4 5 6 7
Payoff 0 0 0 0 1 25 350 8000
Answer
payoff.
Pick m = 8
Catches 0 1 2 3 4 5 6 7 8
Payoff 0 0 0 0 0 9 90 1500 25,000
Answer
payoff.
Pick m = 9
Catches 0 1 2 3 4 5 6 7 8 9
Payoff 0 0 0 0 0 4 50 280 4000 50,000
Answer
The payoff table for m = 10 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 10
Catches 0 1 2 3 4 5 6 7 8 9 10
Payoff 0 0 0 0 0 1 22 150 1000 5000 100,000
Answer
payoff.
Pick m = 11
Catches 0 1 2 3 4 5 6 7 8 9 10 11
Payoff 0 0 0 0 0 0 8 80 400 2500 25,000 100,000
Answer
payoff.
Pick m = 12
Catches 0 1 2 3 4 5 6 7 8 9 10 11 12
Payoff 0 0 0 0 0 0 5 32 200 1000 5000 25,000 100,000
Answer
payoff.
Pick m = 13
Catche
0 1 2 3 4 5 6 7 8 9 10 11 12 13
s
100,00
Payoff 1 0 0 0 0 0 1 20 80 600 3500 10,000 50,000
0
Proof
payoff.
Pick m = 14
Catch
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
es
25,00 50,00 100,00

Payoff 1 0 0 0 0 0 1 9 42 310 1100 8000
0 0 0
Answer
payoff.
Pick m = 15
Catch
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
es
Payof 25,00 50,00 100,0 100,0

1 0 0 0 0 0 0 10 25 100 300 2800
f 0 0 00 00
Answer
In the exercises above, you should have noticed that the expected payoff on a unit bet varies from about 0.71 to 0.75, so the
expected profit (for the gambler) varies from about −0.25 to −0.29. This is quite bad for the gambler playing a casino game, but as
always, the lure of a very high payoff on a small bet for an extremely rare event overrides the expected value analysis for most
players.
With m = 15 , show that the top 4 prizes (25,000, 50,000, 100,000, 100,000) contribute only about 0.017 (less than 2 cents) to
the total expected value of about 0.714.
On the other hand, the standard deviation of the payoff varies quite a bit, from about 1 to about 55.
Although the game is highly unfavorable for each m, with expected value that is nearly constant, which do you think is better
for the gambler—a format with high standard deviation or one with low standard deviation?
This page titled 13.7: Lotteries is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
In this section and the following three sections, we will study gambling strategies for one of the simplest gambling models. Yet in
spite of the simplicity of the model, the mathematical analysis leads to some beautiful and sometimes surprising results that have
importance and application well beyond gambling. Our exposition is based primarily on the classic book by Dubbins and Savage,
Inequalities for Stochastic Processes (How to Gamble if You Must) by Lester E Dubbins and Leonard J Savage (1965).
Basic Theory
Assumptions
Here is the basic situation: The gambler starts with an initial sum of money. She bets on independent, probabilistically identical
games, each with two outcomes—win or lose. If she wins a game, she receives the amount of the bet on that game; if she loses a
game, she must pay the amount of the bet. Thus, the gambler plays at even stakes. This particular situation (IID games and even
stakes) is known as red and black, and is named for the color bets in the casino game roulette. Other examples are the pass and
don't pass bets in craps.
Let us try to formulate the gambling experiment mathematically. First, let I denote the outcome of the n th game for
n n ∈ N+ ,
where 1 denotes a win and 0 denotes a loss. These are independent indicator random variables with the same distribution:
P (Ij = 1) = pj , P (Ij = 0) = q = 1 − pj (13.8.1)
where p ∈ [0, 1] is the probability of winning an individual game. Thus, I = (I1 , I2 , …) is a sequence of Bernoulli trials.
If p = 0 , then the gambler always loses and if p = 1 then the gambler always wins. These trivial cases are not interesting, so we
will usually assume that 0 < p < 1 . In real gambling houses, of course, p < (that is, the games are unfair to the player), so we
1
will be particularly interested in this case.
Random Processes
The gambler's fortune over time is the basic random process of interest: Let X denote the gambler's initial fortune and X the
0 i
gambler's fortune after i games. The gambler's strategy consists of the decisions of how much to bet on the various games and
when to quit. Let Y denote the amount of the ith bet, and let N denote the number of games played by the gambler. If we want to,
i
we can always assume that the games go on forever, but with the assumption that the gambler bets 0 on all games after N . With
this understanding, the game outcome, fortune, and bet processes are defined for all times i ∈ N . +
The fortune process is related to the wager process as follows:

Xj = Xj−1 + (2 Ij − 1) Yj , j ∈ N+ (13.8.2)
Strategies
The gambler's strategy can be very complicated. For example, the random variable Y , the gambler's bet on game n , or the event
n
N = n − 1 , her decision to stop after n − 1 games, could be based on the entire past history of the game, up to time n .
Technically, this history forms a σ-algebra:
Hn = σ { X0 , Y1 , I1 , Y2 , I2 , … , Yn−1 , In−1 } (13.8.3)
Moreover, they could have additional sources of randomness. For example a gambler playing roulette could partly base her bets on
the roll of a lucky die that she keeps in her pocket. However, the gambler cannot see into the future (unfortunately from her point of
view), so we can at least assume that Y and {N = n − 1} are independent of (I , I , … , I ).
n 1 2 n−1
At least in terms of expected value, any gambling strategy is futile if the games are unfair.
E (Xi ) = E (Xi−1 ) + (2p − 1)E (Yi ) for i ∈ N+
Proof
Suppose that the gambler has a positive probability of making a real bet on game i, so that E(Y i) >0 . Then
1. E(X i) < E(Xi−1 ) if p < 1
2. E(X i) > E(Xi−1 ) if p > 1
3. E(X i) = E(Xi−1 ) if p = 1
Proof
Thus on any game in which the gambler makes a positive bet, her expected fortune strictly decreases if the games are unfair,
remains the same if the games are fair, and strictly increases if the games are favorable.
As we noted earlier, a general strategy can depend on the past history and can be randomized. However, since the underlying
Bernoulli games are independent, one might guess that these complicated strategies are no better than simple strategies in which the
amount of the bet and the decision to stop are based only on the gambler's current fortune. These simple strategies do indeed play a
fundamental role and are referred to as stationary, deterministic strategies. Such a strategy can be described by a betting function S
from the space of fortunes to the space of allowable bets, so that S(x) is the amount that the gambler bets when her current fortune
is x.
The Stopping Rule

From now on, we will assume that the gambler's stopping rule is a very simple and standard one: she will bet on the games until
she either loses her entire fortune and is ruined or reaches a fixed target fortune a :
N = min{n ∈ N : Xn = 0 or Xn = a} (13.8.4)
Thus, any strategy (betting function) S must satisfy s(x) ≤ min{x, a − x} for 0 ≤ x ≤ a : the gambler cannot bet what she does
not have, and will not bet more than is necessary to reach the target a .
If we want to, we can think of the difference between the target fortune and the initial fortune as the entire fortune of the house.
With this interpretation, the player and the house play symmetric roles, but with complementary win probabilities: play continues
until either the player is ruined or the house is ruined. Our main interest is in the final fortune X of the gambler. Note that this
N
random variable takes just two values; 0 and a .
The mean and variance of the final fortune are given by

1. E(X ) = aP(X = a)
N N
2. var(X ) = a P(X = a) [1 − P(X

N
2
N N = a)]
Presumably, the gambler would like to maximize the probability of reaching the target fortune. Is it better to bet small amounts or
large amounts, or does it not matter? How does the optimal strategy, if there is one, depend on the initial fortune, the target fortune,
and the game win probability?
We are also interested in E(N ), the expected number of games played. Perhaps a secondary goal of the gambler is to maximize the
expected number of games that she gets to play. Are the two goals compatible or incompatible? That is, can the gambler maximize
both her probability of reaching the target and the expected number of games played, or does maximizing one quantity necessarily
mean minimizing the other?
In the next two sections, we will analyze and compare two strategies that are in a sense opposites:
Timid Play: On each game, until she stops, the gambler makes a small constant bet, say $1.
Bold Play: On each game, until she stops, the gambler bets either her entire fortune or the amount needed to reach the target
fortune, whichever is smaller.
In the final section of the chapter, we will return to the question of optimal strategies.
This page titled 13.8: The Red and Black Game is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
13.9: Timid Play
Basic Theory
Recall that with the strategy of timid play in red and black, the gambler makes a small constant bet, say $1, on each game until she
stops. Thus, on each game, the gambler's fortune either increases by 1 or decreases by 1, until the fortune reaches either 0 or the
target a (which we assume is a positive integer). Thus, the fortune process (X , X , …) is a random walk on the fortune space
0 1
{0, 1, … , a} with 0 and a as absorbing barriers.
As usual, we are interested in the probability of winning and the expected number of games. The key idea in the analysis is that
after each game, the fortune process simply starts over again, but with a different initial value. This is an example of the Markov
property, named for Andrei Markov. A separate chapter on Markov Chains explores these random processes in more detail. In
particular, this chapter has sections on Birth-Death Chains and Random Walks on Graphs, particular classes of Markov chains that
generalize the random processes that we are studying here.
Figure 13.9.1 : The transition graph for timid play

Our analysis based on the Markov property suggests that we treat the initial fortune as a variable. Thus, we will denote the
probability that the gambler reaches the target a , starting with an initial fortune x by
f (x) = P(XN = a ∣ X0 = x), x ∈ {0, 1, … , a} (13.9.1)
The function f satisfies the following difference equation and boundary conditions:
1. f (x) = qf (x − 1) + pf (x + 1) for x ∈ {1, 2, … , a − 1}
2. f (0) = 0 , f (a) = 1
Proof
The difference equation is linear (in the unknown function f ), homogeneous (because each term involves the unknown function f ),
and second order (because 2 is the difference between the largest and smallest fortunes in the equation). Recall that linear
homogeneous difference equations can be solved by finding the roots of the characteristic equation.
The characteristic equation of the difference equation is pr 2

−r+q = 0 , and that the roots are r = 1 and r = q/p .
If p ≠ 1
2
, then the roots are distinct. In this case, the probability that the gambler reaches her target is
x
(q/p ) −1
f (x) = , x ∈ {0, 1, … , a} (13.9.2)
a
(q/p ) −1
If p = , the characteristic equation has a single root 1 that has multiplicity 2. In this case, the probability that the gambler
1
reaches her target is simply the ratio of the initial fortune to the target fortune:
x
f (x) = , x ∈ {0, 1, … , a} (13.9.3)
a
Thus, we have the distribution of the final fortune X in either casse:

N
P(XN = 0 ∣ X0 = x) = 1 − f (x), P(XN = a ∣ X0 = x) = f (x); x ∈ {0, 1, … , a} (13.9.4)
In the red and black experiment, choose Timid Play. Vary the initial fortune, target fortune, and game win probability and note
how the probability of winning the game changes. For various values of the parameters, run the experiment 1000 times and
compare the relative frequency of winning a game to the probability of winning a game.
As a function of x, for fixed p and a ,
1. f is increasing from 0 to a .
2. f is concave upward if p < 1
2
and concave downward if p > 1
2
. Of course, f is linear if p = 1
2
.
f is continuous as a function of p, for fixed x and a .

Proof
For fixed x and a , f (x) increases from 0 to 1 as p increases from 0 to 1.
Figure 13.9.2 : The graph of f for p = 0.4 , p = 0.5 , and p = 0.6
The Expected Number of Trials

Now let us consider the expected number of games needed with timid play, when the initial fortune is x:
g(x) = E(N ∣ X0 = x), x ∈ {0, 1, … , a} (13.9.5)
The function g satisfies the following difference equation and boundary conditions:
1. g(x) = qg(x − 1) + pg(x + 1) + 1 for x ∈ {1, 2, … , a − 1}
2. g(0) = 0 , g(a) = 0
Proof
The difference equation in the last exercise is linear, second order, but non-homogeneous (because of the constant term 1 on the
right side). The corresponding homogeneous equation is the equation satisfied by the win probability function f . Thus, only a little
additional work is needed to solve the non-homogeneous equation.
If p ≠ 1
2
, then
x a
g(x) = − f (x), x ∈ {0, 1, … , a} (13.9.6)
q −p q −p
where f is the win probability function above.
If p = 1
2
, then
g(x) = x(a − x), x ∈ {0, 1, … , a} (13.9.7)
Consider g as a function of the initial fortune x, for fixed values of the game win probability p and the target fortune a .
1. g at first increases and then decreases.
2
When p = , the maximum value of g is

1
2
a
4
and occurs when x = a
2
. When p ≠ 1
2
, the value of x where the maximum occurs is
rather complicated.
Figure 13.9.3 : The graph of g for p = 0.4 , p = 0.5 , and p = 0.6
g is continuous as a function of p, for fixed x and a .

Proof
For many parameter settings, the expected number of games is surprisingly large. For example, suppose that p = and the target 1
fortune is 100. If the gambler's initial fortune is 1, then the expected number of games is 99, even though half of the time, the
gambler will be ruined on the first game. If the initial fortune is 50, the expected number of games is 2500.
In the red and black experiment, select Timid Play. Vary the initial fortune, the target fortune and the game win probability and
notice how the expected number of games changes. For various values of the parameters, run the experiment 1000 times and
compare the sample mean number of games to the expect value.
Increasing the Bet

What happens if the gambler makes constant bets, but with an amount higher than 1? The answer to this question may give insight
into what will happen with bold play.
In the red and black game, set the target fortune to 16, the initial fortune to 8, and the win probability to 0.45. Play 10 games
with each of the following strategies. Which seems to work best?
1. Bet 1 on each game (timid play).
2. Bet 2 on each game.
3. Bet 4 on each game.
4. Bet 8 on each game (bold play).
We will need to embellish our notation to indicate the dependence on the target fortune. Let
f (x, a) = P(XN = a ∣ X0 = x), x ∈ {0, 1, … , a}, a ∈ N+ (13.9.8)
Now fix p and suppose that the target fortune is 2a and the initial fortune is 2x. If the gambler plays timidly (betting $1 each time),
then of course, her probability of reaching the target is f (2x, 2a). On the other hand:
Suppose that the gambler bets $2 on each game. The fortune process (X /2 : i ∈ N) corresponds to timid play with initial
i
fortune x and target fortune a and that therefore the probability that the gambler reaches the target is f (x, a).
Thus, we need to compare the probabilities f (2x, 2a) and f (x, a).
The win probability functions are related as follows:

x
(q/p ) +1
f (2x, 2a) = f (x, a) , x ∈ {0, 1, … , a} (13.9.9)
a
(q/p ) +1
In particular
1. f (2x, 2a) < f (x, a) if p < 1
2. f (2x, 2a) = f (x, a) if p = 1
3. f (2x, 2a) > f (x, a) if p > 1
Thus, it appears that increasing the bets is a good idea if the games are unfair, a bad idea if the games are favorable, and makes no
difference if the games are fair.
What about the expected number of games played? It seems almost obvious that if the bets are increased, the expected number of
games played should decrease, but a direct analysis using the expected value function above is harder than one might hope (try it!),
We will use a different method, one that actually gives better results. Specifically, we will have the $1 and $2 gamblers bet on the
same underlying sequence of games, so that the two fortune processes are defined on the same sample space. Then we can compare
the actual random variables (the number of games played), which in turn leads to a comparison of their expected values. Recall that
this general method is referred to as coupling.
Let X denote the fortune after n games for the gamble making $1 bets (simple timid play). Then 2X − X is the fortune
n n 0
after n games for the gambler making $2 bets (with the same initial fortune, betting on the same sequence of games). Assume
again that the initial fortune is 2x and the target fortune 2a where 0 < x < a . Let N denote the number of games played by
1
the $1 gambler, and N the number of games played by the $2 gambler, Then
2
1. If the $1 gambler falls to fortune x, the $2 gambler is ruined (fortune 0).

2. If the $1 gambler hits fortune x + a , the $2 gambler reaches the target 2a.
3. The $1 gambler must hit x before hitting 0 and must hit x + a before hitting 2a.
4. N < N given X = 2x .
2 1 0
5. E(N ∣ X = 2x) < E(N ∣ X = 2x)

2 0 1 0
Of course, the expected values agree (and are both 0) if x = 0 or x = a . This result shows that N is stochastically smaller than
2
N 1 when the gamblers are not playing the same sequence of games (so that the random variables are not defined on the same
sample space).
Generalize the analysis in this subsection to compare timid play with the strategy of betting $k on each game (let the initial
fortune be kx and the target fortune ka .
It appears that with unfair games, the larger the bets the better, at least in terms of the probability of reaching the target. Thus, we
are naturally led to consider bold play.
This page titled 13.9: Timid Play is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
13.10: Bold Play
Basic Theory
Preliminaries
Recall that with the strategy of bold play in red and black, the gambler on each game bets either her entire fortune or the amount
needed to reach the target fortune, whichever is smaller. As usual, we are interested in the probability that the player reaches the
target and the expected number of trials. The first interesting fact is that only the ratio of the initial fortune to the target fortune
matters, quite in contrast to timid play.
Suppose that the gambler plays boldly with initial fortune x and target fortune a . As usual, let X = (X , X , …) denote the
1 2
fortune process for the gambler. For any c > 0 , the random process cX = (cX , cX , …) is the fortune process for bold play
0 1
with initial fortune cx and target fortune ca.
Because of this result, it is convenient to use the target fortune as the monetary unit and to allow irrational, as well as rational,
initial fortunes. Thus, the fortune space is [0, 1]. Sometimes in our analysis we will ignore the states 0 or 1; clearly there is no harm
in this because in these states, the game is over.
Recall that the betting function S is the function that gives the amount bet as a function of the current fortune. For bold play,
the betting function is
1
x, 0 ≤x ≤
2
S(x) = min{x, 1 − x} = { (13.10.1)
1
1 − x, ≤x ≤1
2
Figure 13.10.1: The betting function for bold play

We will denote the probability that the bold gambler reaches the target a = 1 starting from the initial fortune x ∈ [0, 1] by F (x).
By the scaling property, the probability that the bold gambler reaches some other target value a > 0 , starting from x ∈ [0, a] is
F (x/a).
The function F satisfies the following functional equation and boundary conditions:
1
pF (2x), 0 ≤x ≤
1. F (x) = { 1
2
p + qF (2x − 1), ≤x ≤1
2
2. F (0) = 0 , F (1) = 1
From the previous result, and a little thought, it should be clear that an important role is played by the following function:
Let d be the function defined on [0, 1) by

1
2x, 0 ≤x <
2
d(x) = 2x − ⌊2x⌋ = { (13.10.2)
1
2x − 1, ≤x <1
2
The function d is called the doubling function, mod 1, since d(x) gives the fractional part of 2x.
Note that until the last bet that ends the game (with the player ruined or victorious), the successive fortunes of the player follow
iterates of the map d . Thus, bold play is intimately connected with the dynamical system associated with d .
Figure 13.10.2: The doubling map, modulo 1
Binary Expansions
One of the keys to our analysis is to represent the initial fortune in binary form.
The binary expansion of x ∈ [0, 1) is

∞
xi
x =∑ (13.10.3)
i
i=1
2
where x ∈ {0, 1} for each i ∈ N . This representation is unique except when x is a binary rational (sometimes also called a
i +
dyadic rational), that is, a number of the form k/2 where n ∈ N and k ∈ {1, 3, … , 2 − 1} ; the positive integer n is
n
+
n
called the rank of x. Binary rationals are discussed in more detail in the chapter on Foundations.
For a binary rational x of rank n , we will use the standard terminating representation where x = 1 and x = 0 for i > n . Rank
n i
can be extended to all numbers in [0, 1) by defining the rank of 0 to be 0 (0 is also considered a binary rational) and by defining the
rank of a binary irrational to be ∞. We will denote the rank of x by r(x).
Applied to the binary sequences, the doubling function d is the shift operator:
For x ∈ [0, 1), [d(x)] i = xi+1 .
Bold play in red and black can be elegantly described by comparing the bits of the initial fortune with the game bits.
Suppose that gambler starts with initial fortune x ∈ (0, 1). The gambler eventually reaches the target 1 if and only if there
exists a positive integer k such that I = 1 − x for j ∈ {1, 2, … , k − 1} and I = x . That is, the gambler wins if and only
j j k k
if when the game bit agrees with the corresponding fortune bit for the first time, that bit is 1.
The random variable whose bits are the complements of the fortune bits will play an important role in our analysis. Thus, let
∞
1 − Ij
W =∑ (13.10.4)
j
j=1
2
Note that W is a well defined random variable taking values in [0, 1].
Suppose that the gambler starts with initial fortune x ∈ (0, 1). Then the gambler reaches the target 1 if and only if W <x .
Proof
W has a continuous distribution. That is, P(W = x) = 0 for any x ∈ [0, 1].
From the previous two results, it follows that F is simply the distribution function of W . In particular, F is an increasing function,
and since W has a continuous distribution, F is a continuous function.
The success function F is the unique continuous solution of the functional equation above.
Proof
If we introduce a bit more notation, we can give nice expression for F (x), and later for the expected number of games G(x). Let
p = p and p = q = 1 − p .
0 1
The win probability function F can be expressed as follows:

∞
F (x) = ∑ px ⋯ px p xn (13.10.5)
1 n−1
n=1
Note that p ⋅ x in the last expression is correct; it's not a misprint of p . Thus, only terms with x
n xn n =1 are included in the sum.
F is strictly increasing on [0, 1]. This means that the distribution of W has support [0, 1]; that is, there are no subintervals of
[0, 1] that have positive length, but 0 probability.
In particular,
1. F (
1
8
) =p
3
2. F (
2
8
) =p
2
3. F (
3
8
) =p
2
+p q
2
4. F (
4
8
) =p
5. F (
5
8
) = p +p q
2
6. F (
6
8
) = p + pq
7. F (
7
8
) = p + pq + p q
2
If p = 1
2
then F (x) = x for x ∈ [0, 1]
Proof
Thus, for p = (fair trials), the probability that the bold gambler reaches the target fortune a starting from the initial fortune x is
1
x/a, just as it is for the timid gambler. Note also that the random variable W has the uniform distribution on [0, 1]. When p ≠ , 1
the distribution of W is quite strange. To state the result succinctly, we will indicate the dependence of the of the probability
measure P on the parameter p ∈ (0, 1). First we define
n
1
Cp = {x ∈ [0, 1] : ∑(1 − xi ) → p as n → ∞} (13.10.6)
n
i=1
Thus, C is the set of x ∈ [0, 1] for which the relative frequency of 0's in the binary expansion is p.
p
For distinct p, t ∈ (0, 1)
1. Pp (W ∈ Cp ) = 1
2. Pp (W ∈ Ct ) = 0
Proof
When p ≠ , W does not have a probability density function (with respect to Lebesgue measure on [0, 1]), even though
1
2
W
has a continuous distribution.

Proof
When p ≠ 1
2
, F has derivative 0 at almost every point in [0, 1], even though it is strictly increasing.
Figure 13.10.3: The graphs of F when p = 0.5 , p = 0.4 , and p = 0.2
In the red and black experiment, select Bold Play. Vary the initial fortune, target fortune, and game win probability with the
scroll bars and note how the probability of winning the game changes. In particular, note that this probability depends only on
x/a. Now for various values of the parameters, run the experiment 1000 times and compare the relative frequency function to
The Expected Number of Trials

Let G(x) = E(N ∣ X = x) for x ∈ [0, 1], the expected number of trials starting at x. For any other target fortune
0 a >0 , the
expected number of trials starting at x ∈ [0, a] is just G(x/a).
G satisfies the following functional equation and boundary conditions:

1
1 + pG(2x), 0 <x ≤
1. G(x) = { 1
2
1 + qG(2x − 1), ≤x <1

2
2. G(0) = 0 , G(1) = 0
Proof
Note, interestingly, that the functional equation is not satisfied at x =0 or x =1 . As before, we can give an alternate analysis
using the binary representation of an initial fortune x ∈ (0, 1).
Suppose that the initial fortune of the gambler is x ∈ (0, 1). Then N = min{k ∈ N+ : Ik = xk or k = r(x)} .
Proof
We can give an explicit formula for the expected number of trials G(x) in terms of the binary representation of x. Recall our
special notation: p = p , p = q = 1 − p
0 1
Suppose that x ∈ (0, 1). Then

r(x)−1
G(x) = ∑ px … px (13.10.7)
1 n
n=0
Note that the n = 0 term is 1, since the product is empty. The sum has a finite number of terms if x is a binary rational, and the
sum has an infinite number of terms if x is a binary irrational.
In particular,
1. G ( 1
8
) = 1 +p +p
2
2. G ( 2
8
) = 1 +p
3. G ( 3
8
) = 1 + p + pq
4. G ( 4
8
) =1
5. G ( 5
8
) = 1 + q + pq
6. G ( 6
8
) = 1 +q
7. G ( 7
8
) = 1 +q +q
2
If p = 1
2
then
1
2− , x is a binary rational
r(x)− 1
G(x) = { 2 (13.10.8)
2, x is a binary irrational
Figure 13.10.4: The expected number of games in bold play with fair games
In the red and black experiment, select Bold Play. Vary x, a , and p with the scroll bars and note how the expected number of
trials changes. In particular, note that the mean depends only on the ratio x/a. For selected values of the parameters, run the
experiment 1000 times and compare the sample mean to the distribution mean.
For fixed x, G is continuous as a function of p.
However, as a function of the initial fortune x, for fixed p, the function G is very irregular.
G is discontinuous at the binary rationals in [0, 1] and continuous at the binary irrationals.
This page titled 13.10: Bold Play is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Basic Theory
Definitions
Recall that the stopping rule for red and black is to continue playing until the gambler is ruined or her fortune reaches the target
fortune a . Thus, the gambler's strategy is to decide how much to bet on each game before she must stop. Suppose that we have a
class of strategies that correspond to certain valid fortunes and bets; A will denote the set of fortunes and B will denote the set of
x
valid bets for x ∈ A . For example, sometimes (as with timid play) we might want to restrict the fortunes to set of integers
{0, 1, … , a}; other times (as with bold play) we might want to use the interval [0, 1] as the fortune space. As for the bets, recall
that the gambler cannot bet what she does not have and will not bet more than she needs in order to reach the target. Thus, a betting
function S must satisfy
S(x) ≤ min{x, a − x}, x ∈ A (13.11.1)
Moreover, we always restrict our strategies to those for which the stopping time N is finite.
The success function of a strategy is the probability that the gambler reaches the target a with that strategy, as a function of the
initial fortune x. A strategy with success function V is optimal if for any other strategy with success function U , we have
U (x) ≤ V (x) for x ∈ A .
If there exists an optimal strategy, then the optimal success function is unique.
However, there may not exist an optimal strategy or there may be several optimal strategies. Moreover, the optimality question
depends on the value of the game win probability p, in addition to the structure of fortunes and bets.
A Condition for Optimality

Our main theorem gives a condition for optimality:
A strategy S with success function V is optimal if

pV (x + y) + qV (x − y) ≤ V (x); x ∈ A, y ∈ Bx (13.11.2)
Proof
Favorable Trials with a Minimum Bet

Suppose now that p ≥ so that the trials are favorable (or at least not unfair) to the gambler. Next, suppose that all bets must be
1
multiples of a basic unit, which we might as well assume is $1. Of course, real gambling houses have this restriction. Thus the set
of valid fortunes is A = {0, 1, … , a} and the set of valid bets for x ∈ A is B = {0, 1, … , min{x, a − x}}. Our main result for
x
this case is
Timid play is an optimal strategy.

Proof
In the red and black game set the target fortune to 16, the initial fortune to 8, and the game win probability to 0.45. Define the
strategy of your choice and play 100 games. Compare your relative frequency of wins with the probability of winning with
timid play.
Favorable Trials without a Minimum Bet

We will now assume that the house allows arbitrarily small bets and that p > , so that the trials are strictly favorable. In this case
1
it is natural to take the target as the monetary unit so that the set of fortunes is A = [0, 1], and the set of bets for x ∈ A is
B = [0, min{x, 1 − x}]. Our main result for this case is given below. The results for timid play will play an important role in the
x
analysis, so we will let f (j, a) denote the probability of reaching an integer target a , starting at the integer j ∈ [0, a] , with unit
bets.
The optimal success function is V (x) = x for x ∈ (0, 1].
Proof
Unfair Trials
We will now assume that p ≤ so that the trials are unfair, or at least not favorable. As before, we will take the target fortune as
1
the basic monetary unit and allow any valid fraction of this unit as a bet. Thus, the set of fortunes is A = [0, 1], and the set of bets
for x ∈ A is B = [0, min{x, 1 − x}]. Our main result for this case is
x
Bold play is optimal.

Proof
In the red and black game, set the target fortune to 16, the initial fortune to 8, and the game win probability to 0.45. Define the
strategy of your choice and play 100 games. Compare your relative frequency of wins with the probability of winning with
bold play.
Other Optimal Strategies in the Sub-Fair Case

Consider again the sub-fair case where p ≤ so that the trials are not favorable to the gambler. We will show that bold play is not
1
the only optimal strategy; amazingly, there are infinitely many optimal strategies. Recall first that the bold strategy has betting
function
1
x, 0 ≤x ≤
2
S1 (x) = min{x, 1 − x} = { (13.11.8)
1
1 − x, ≤x ≤1
2
Figure 13.11.1: The betting function for bold play
Consider the following strategy, which we will refer to as the second order bold strategy:
1. With fortune x ∈ (0, ), play boldly with the object of reaching before falling to 0.
1
2
1
2. With fortune x ∈ ( , 1), play boldly with the object of reaching 1 without falling below
1
2
1
2
.
3. With fortune , play boldly and bet
1
2
1
The second order bold strategy has betting function S given by

2
1
⎧ x, 0 ≤x <
⎪
⎪ 4
⎪
⎪
⎪ 1 − x, 1
≤x <
1
⎪
⎪ 2 4 2
1 1
S2 (x) = ⎨ , x = (13.11.9)
2 2
⎪ 1 1 3
⎪
⎪ x−
⎪ , <x ≤
⎪ 2 2 4
⎪
⎩
⎪ 3
1 − x, <x ≤1
4
Figure 13.11.2: The betting function for the second order bold strategy
The second order bold strategy is optimal.

Proof
Once we understand how this construction is done, it's straightforward to define the third order bold strategy and show that it's
optimal as well.
Figure 13.11.3: The betting function for the third order bold strategy
Explicitly give the third order betting function and show that the strategy is optimal.
More generally, we can define the n th order bold strategy and show that it is optimal as well.
The sequence of bold strategies can be defined recursively from the basic bold strategy S as follows:
1
1 1
⎧
⎪ Sn (2x), 0 ≤x <
⎪ 2 2
1 1
Sn+1 (x) = ⎨ , x = (13.11.12)
2 2
⎪
⎩
⎪ 1 1
Sn (2x − 1), <x ≤1
2 2
Sn is optimal for each n .
Even more generally, we can define an optimal strategy T in the following way: for each x ∈ [0, 1] select n ∈ N and let x +
T (x) = S nx (x) . The graph below shows a few of the graphs of the bold strategies. For an optimal strategy T , we just need to
select, for each x a bet on one of the graphs.
Figure 13.11.4: The family of bold strategies
Martingales
Let's return to the unscaled formulation of red and black, where the target fortune is a ∈ (0, ∞) and the initial fortune is
x ∈ (0, a). In the subfair case, when p ≤ , no strategy can do better than the optimal strategies, so that the win probability is
1
bounded by x/a (with equality when p = ). Another elegant proof of this is given in the section on inequalities in the chapter on
1
martingales.
This page titled 13.11: Optimal Strategies is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
CHAPTER OVERVIEW
The Poisson process is one of the most important random processes in probability theory. It is widely used to model random
“points” in time and space, such as the times of radioactive emissions, the arrival times of customers at a service center, and the
positions of flaws in a piece of material. Several important probability distributions arise naturally from the Poisson process—the
Poisson distribution, the exponential distribution, and the gamma distribution. The process has a beautiful mathematical structure,
and is used as a foundation for building a number of other, more complicated random processes.
This page titled 14: The Poisson Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
1
The Poisson Model

We will consider a process in which points occur randomly in time. The phrase points in time is generic and could represent, for
example:
The times when a sample of radioactive material emits particles
The times when customers arrive at a service station
The times when file requests arrive at a server computer
The times when accidents occur at a particular intersection
The times when a device fails and is replaced by a new device
It turns out that under some basic assumptions that deal with independence and uniformity in time, a single, one-parameter
probability model governs all such random processes. This is an amazing result, and because of it, the Poisson process (named after
Simeon Poisson) is one of the most important in probability theory.
Run the Poisson experiment with the default settings in single step mode. Note the random points in time.
Random Variables
There are three collections of random variables that can be used to describe the process. First, let X denote the time of the first
1
arrival, and X the time between the (i − 1) st and ith arrival for i ∈ {2, 3, …}. Thus, X = (X , X , …) is the sequence of inter-
i 1 2
arrival times. Next, let T denote the time of the n th arrival for n ∈ N . It will be convenient to define T = 0 , although we do
n + 0
not consider this as an arrival. Thus T = (T , T , …) is the sequence of arrival times. Clearly T is the partial sum process
0 1
associated X, and so in particular each sequence determines the other:

n
Tn = ∑ Xi , n ∈ N (14.1.1)
i=1
Xn = Tn − Tn−1 , n ∈ N+ (14.1.2)
Next, let N denote the number of arrivals in (0, t] for t ∈ [0, ∞). The random process N = (N : t ≥ 0) is the counting process.
t t
The arrival time process T and the counting process N are inverses of one another in a sense, and in particular each process
determines the other:
Tn = min{t ≥ 0 : Nt = n}, n ∈ N (14.1.3)
Nt = max{n ∈ N : Tn ≤ t}, t ∈ [0, ∞) (14.1.4)
Note also that N ≥ n if and only if

t Tn ≤ t for n ∈ N and t ∈ [0, ∞) since each of these events means that there are at least n
arrivals in the interval (0, t].

Sometimes it will be helpful to extend the notation of the counting process. For A ⊆ [0, ∞) (measurable of course), let N (A)
denote the number of arrivals in A :

∞
N (A) = #{n ∈ N+ : Tn ∈ A} = ∑ 1(Tn ∈ A) (14.1.5)
n=1
Thus, A ↦ N (A) is the counting measure associated with the random points (T , T , …), so in particular it is a random measure.
1 2
For our original counting process, note that N = N (0, t] for t ≥ 0 . Thus, t ↦ N is a (random) distribution function, and
t t
A ↦ N (A) is the (random) measure associated with this distribution function.
The Basic Assumption

The assumption that we will make can be described intuitively (but imprecisely) as follows: If we fix a time t , whether constant or
one of the arrival times, then the process after time t is independent of the process before time t and behaves probabilistically just
like the original process. Thus, the random process has a strong renewal property. Making the strong renewal assumption precise
will enable use to completely specify the probabilistic behavior of the process, up to a single, positive parameter.
Think about the strong renewal assumption for each of the specific applications given above.
Run the Poisson experiment with the default settings in single step mode. See if you can detect the strong renewal assumption.
As a first step, note that part of the renewal assumption, namely that the process “restarts” at each arrival time, independently of the
past, implies the following result:
The sequence of inter-arrival times X is an independent, identically distributed sequence

Proof
A model of random points in time in which the inter-arrival times are independent and identically distributed (so that the process
“restarts” at each arrival time) is known as a renewal process. A separate chapter explores Renewal Processes in detail. Thus, the
Poisson process is a renewal process, but a very special one, because we also require that the renewal assumption hold at fixed
times.
Analogy with Bernoulli Trials

In some sense, the Poisson process is a continuous time version of the Bernoulli trials process. To see this, suppose that we have a
Bernoulli trials process with success parameter p ∈ (0, 1), and that we think of each success as a random point in discrete time.
Then this process, like the Poisson process (and in fact any renewal process) is completely determined by the sequence of inter-
arrival times X = (X , X , …) (in this case, the number of trials between successive successes), the sequence of arrival times
1 2
T = (T , T , …) (in this case, the trial numbers of the successes), and the counting process (N : t ∈ N) (in this case, the number
0 1 t
of successes in the first t trials). Also like the Poisson process, the Bernoulli trials process has the strong renewal property: at each
fixed time and at each arrival time, the process “starts over” independently of the past. But of course, time is discrete in the
Bernoulli trials model and continuous in the Poisson model. The Bernoulli trials process can be characterized in terms of each of
the three sets of random variables.
Each of the following statements characterizes the Bernoulli trials process with success parameter p ∈ (0, 1):
1. The inter-arrival time sequence X is a sequence of independent variables, and each has the geometric distributions on N +
with success parameter p.

2. The arrival time sequence T has stationary, independent increments, and for n ∈ N , T has the negative binomial
+ n
distribution with stopping parameter n and success parameter p

3. The counting process N has stationary, independent increments, and for t ∈ N , N has the binomial distribution with trial
t
parameter t and success parameter p.
Run the binomial experiment with n = 50 and p = 0.1 . Note the random points in discrete time.
Run the Poisson experiment with t = 5 and r = 1 . Note the random points in continuous time and compare with the behavior
in the previous exercise.
As we develop the theory of the Poisson process we will frequently refer back to the analogy with Bernoulli trials. In particular, we
will show that if we run the Bernoulli trials at a faster and faster rate but with a smaller and smaller success probability, in just the
right way, the Bernoulli trials process converges to the Poisson process.
This page titled 14.1: Introduction to the Poisson Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
The Memoryless Property
Recall that in the basic model of the Poisson process, we have “points” that occur randomly in time. The sequence of inter-arrival
times is X = (X , X , …). The strong renewal assumption states that at each arrival time and at each fixed time, the process must
1 2
probabilistically restart, independent of the past. The first part of that assumption implies that X is a sequence of independent,
identically distributed variables. The second part of the assumption implies that if the first arrival has not occurred by time s , then
the time remaining until the arrival occurs must have the same distribution as the first arrival time itself. This is known as the
memoryless property and can be stated in terms of a general random variable as follows:
Suppose that X takes values in [0, ∞). Then X has the memoryless property if the conditional distribution of X −s given
X > s is the same as the distribution of X for every s ∈ [0, ∞) . Equivalently,
P(X > t + s ∣ X > s) = P(X > t), s, t ∈ [0, ∞) (14.2.1)
The memoryless property determines the distribution of X up to a positive parameter, as we will see now.
Distribution functions
Suppose that X takes values in [0, ∞) and satisfies the memoryless property.
X has a continuous distribution and there exists r ∈ (0, ∞) such that the distribution function F of X is
−r t
F (t) = 1 − e , t ∈ [0, ∞) (14.2.2)
Proof
The probability density function of X is

−r t
f (t) = r e , t ∈ [0, ∞) (14.2.7)
1. f is decreasing on [0, ∞).

2. f is concave upward on [0, ∞).
3. f (t) → 0 as t → ∞ .
Proof
A random variable with the distribution function above or equivalently the probability density function in the last theorem is said to
have the exponential distribution with rate parameter r. The reciprocal is known as the scale parameter (as will be justified
1
below). Note that the mode of the distribution is 0, regardless of the parameter r, not very helpful as a measure of center.
In the gamma experiment, set n = 1 so that the simulated random variable has an exponential distribution. Vary r with the
scroll bar and watch how the shape of the probability density function changes. For selected values of r, run the experiment
1000 times and compare the empirical density function to the probability density function.
The quantile function of X is

− ln(1 − p)
−1
F (p) = , p ∈ [0, 1) (14.2.8)
r
1. The median of X is ln(2) ≈ 0.6931

1
r
1
2. The first quartile of X is [ln(4) − ln(3)] ≈ 0.2877

1
r
1
3. The third quartile X is ln(4) ≈ 1.3863

1
r
1
4. The interquartile range is ln(3) ≈ 1.0986

1
r
1
Proof
In the special distribution calculator, select the exponential distribution. Vary the scale parameter (which is 1/r) and note the
shape of the distribution/quantile function. For selected values of the parameter, compute a few values of the distribution
Returning to the Poisson model, we have our first formal definition:
A process of random points in time is a Poisson process with rate r ∈ (0, ∞) if and only the interarrvial times are independent,
and each has the exponential distribution with rate r.
Constant Failure Rate

Suppose now that X has a continuous distribution on [0, ∞) and is interpreted as the lifetime of a device. If F denotes the
distribution function of X, then F = 1 − F is the reliability function of X. If f denotes the probability density function of X then
c
the failure rate function h is given by

f (t)
h(t) = , t ∈ [0, ∞) (14.2.9)
c
F (t)
If X has the exponential distribution with rate r > 0 , then from the results above, the reliability function is F
c
(t) = e
−rt
and the
probability density function is f (t) = re , so trivially X has constant rate r. The converse is also true.
−rt
If X has constant failure rate r > 0 then X has the exponential distribution with parameter r.
Proof
The memoryless and constant failure rate properties are the most famous characterizations of the exponential distribution, but are
by no means the only ones. Indeed, entire books have been written on characterizations of this distribution.
Moments
Suppose again that X has the exponential distribution with rate parameter r >0 . Naturaly, we want to know the the mean,
variance, and various other moments of X.
If n ∈ N then E (X n
) = n!/ r
n
.
Proof
More generally, E (X a
) = Γ(a + 1)/ r
a
for every a ∈ [0, ∞), where Γ is the gamma function.
In particular.
1. E(X) = 1
2. var(X) = 1
r2
3. skew(X) = 2
4. kurt(X) = 9
In the context of the Poisson process, the parameter r is known as the rate of the process. On average, there are 1/r time units
between arrivals, so the arrivals come at an average rate of r per unit time. The Poisson process is completely determined by the
sequence of inter-arrival times, and hence is completely determined by the rate r.
Note also that the mean and standard deviation are equal for an exponential distribution, and that the median is always smaller than
the mean. Recall also that skewness and kurtosis are standardized measures, and so do not depend on the parameter r (which is the
reciprocal of the scale parameter).
The moment generating function of X is

sX
r
M (s) = E (e ) = , s ∈ (−∞, r) (14.2.12)
r−s
Proof
In the gamma experiment, set n = 1 so that the simulated random variable has an exponential distribution. Vary r with the
scroll bar and watch how the mean±standard deviation bar changes. For various values of r, run the experiment 1000 times
and compare the empirical mean and standard deviation to the distribution mean and standard deviation, respectively.
Additional Properties
The exponential distribution has a number of interesting and important mathematical properties. First, and not surprisingly, it's a
member of the general exponential family.
Suppose that X has the exponential distribution with rate parameter r ∈ (0, ∞). Then X has a one parameter general
exponential distribution, with natural parameter −r and natural statistic X.
Proof
The Scaling Property

As suggested earlier, the exponential distribution is a scale family, and 1/r is the scale parameter.
Suppose that X has the exponential distribution with rate parameter r >0 and that c >0 . Then cX has the exponential
distribution with rate parameter r/c.
Proof
Recall that multiplying a random variable by a positive constant frequently corresponds to a change of units (minutes into hours for
a lifetime variable, for example). Thus, the exponential distribution is preserved under such changes of units. In the context of the
Poisson process, this has to be the case, since the memoryless property, which led to the exponential distribution in the first place,
clearly does not depend on the time units.
In fact, the exponential distribution with rate parameter 1 is referred to as the standard exponential distribution. From the previous
result, if Z has the standard exponential distribution and r > 0 , then X = Z has the exponential distribution with rate parameter
1
r . Conversely, if X has the exponential distribution with rate r > 0 then Z = rX has the standard exponential distribution.
Similarly, the Poisson process with rate parameter 1 is referred to as the standard Poisson process. If Z is the ith inter-arrival time
i
for the standard Poisson process for i ∈ N , then letting X = Z for i ∈ N gives the inter-arrival times for the Poisson process
+ i
1
r
i +
with rate r. Conversely if X is the ith inter-arrival time of the Poisson process with rate r > 0 for i ∈ N , then Z = rX for
i + i i
i ∈ N+ gives the inter-arrival times for the standard Poisson process.
Relation to the Geometric Distribution

In many respects, the geometric distribution is a discrete version of the exponential distribution. In particular, recall that the
geometric distribution on N is the only distribution on N with the memoryless and constant rate properties. So it is not
+ +
surprising that the two distributions are also connected through various transformations and limits.
Suppose that X has the exponential distribution with rate parameter r > 0 . Then
1. ⌊X⌋ has the geometric distributions on N with success parameter 1 − e . −r
2. ⌈X⌉ has the geometric distributions on N with success parameter 1 − e .

+
−r
Proof
The following connection between the two distributions is interesting by itself, but will also be very important in the section on
splitting Poisson processes. In words, a random, geometrically distributed sum of independent, identically distributed exponential
variables is itself exponential.
Suppose that X = (X , X , …) is a sequence of independent variables, each with the exponential distribution with rate r.
1 2
Suppose that U has the geometric distribution on N with success parameter p and is independent of X. Then Y = ∑ X
+
U
i=1 i
has the exponential distribution with rate rp.

Proof
The next result explores the connection between the Bernoulli trials process and the Poisson process that was begun in the
Introduction.
For n ∈ N+ , suppose that U has the geometric distribution on N with success parameter p , where np
n + n n → r >0 as
n → ∞ . Then the distribution of U /n converges to the exponential distribution with parameter r as n → ∞ .
n
Proof
To understand this result more clearly, suppose that we have a sequence of Bernoulli trials processes. In process n , we run the trials
at a rate of n per unit time, with probability of success p . Thus, the actual time of the first success in process n is U /n. The last
n n
result shows that if np → r > 0 as n → ∞ , then the sequence of Bernoulli trials processes converges to the Poisson process with
n
rate parameter r as n → ∞ . We will return to this point in subsequent sections.
Orderings and Order Statistics

Suppose that X and Y have exponential distributions with parameters a and b , respectively, and are independent. Then
a
P(X < Y ) = (14.2.16)
a+b
Proof
The following theorem gives an important random version of the memoryless property.
Suppose that X and Y are independent variables taking values in [0, ∞) and that Y has the exponential distribution with rate
parameter r > 0 . Then X and Y − X are conditionally independent given X < Y , and the conditional distribution of Y − X
is also exponential with parameter r.
Proof
For our next discussion, suppose that X = (X , X , … , X ) is a sequence of independent random variables, and that X has the
1 2 n i
exponential distribution with rate parameter r > 0 for each i ∈ {1, 2, … , n}.
i
Let U = min{ X1 , X2 , … , Xn } . Then U has the exponential distribution with parameter ∑ n
i=1
ri .
Proof
In the context of reliability, if a series system has independent components, each with an exponentially distributed lifetime, then the
lifetime of the system is also exponentially distributed, and the failure rate of the system is the sum of the component failure rates.
In the context of random processes, if we have n independent Poisson process, then the new process obtained by combining the
random points in time is also Poisson, and the rate of the new process is the sum of the rates of the individual processes (we will
return to this point latter).
Let V = max{ X1 , X2 , … , Xn } . Then V has distribution function F given by

n
−ri t
F (t) = ∏ (1 − e ), t ∈ [0, ∞) (14.2.22)
i=1
Proof
Consider the special case where r = r ∈ (0, ∞) for each i ∈ N . In statistical terms, X is a random sample of size n from the
i +
exponential distribution with parameter r. From the last couple of theorems, the minimum U has the exponential distribution with
n
rate nr while the maximum V has distribution function F (t) = (1 − e ) for t ∈ [0, ∞). Recall that U and V are the first and
−rt
last order statistics, respectively.
In the order statistic experiment, select the exponential distribution.

For selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability
density function.
2. Vary n with the scroll bar, set k = n each time (this gives the maximum V ), and note the shape of the probability density
function. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true
Curiously, the distribution of the maximum of independent, identically distributed exponential variables is also the distribution of
the sum of independent exponential variables, with rates that grow linearly with the index.
Suppose that r i = ir for each i ∈ {1, 2, … , n} where r ∈ (0, ∞). Then Y =∑

n
i=1
Xi has distribution function F given by
−rt n
F (t) = (1 − e ) , t ∈ [0, ∞) (14.2.23)
Proof
This result has an application to the Yule process, named for George Yule. The Yule process, which has some parallels with the
Poisson process, is studied in the chapter on Markov processes. We can now generalize the order probability above:
For i ∈ {1, 2, … , n},

ri
P (Xi < Xj for all j ≠ i) = (14.2.26)
n
∑ rj
j=1
Proof
Suppose that for each i, X is the time until an event of interest occurs (the arrival of a customer, the failure of a device, etc.) and
i
that these times are independent and exponentially distributed. Then the first time U that one of the events occurs is also
exponentially distributed, and the probability that the first event to occur is event i is proportional to the rate r . i
The probability of a total ordering is

n
ri
P(X1 < X2 < ⋯ < Xn ) = ∏ (14.2.27)
n
∑j=i rj
i=1
Proof
Of course, the probabilities of other orderings can be computed by permuting the parameters appropriately in the formula on the
right.
The result on minimums and the order probability result above are very important in the theory of continuous-time Markov chains.
But for that application and others, it's convenient to extend the exponential distribution to two degenerate cases: point mass at 0
and point mass at ∞ (so the first is the distribution of a random variable that takes the value 0 with probability 1, and the second
the distribution of a random variable that takes the value ∞ with probability 1). In terms of the rate parameter r and the distribution
function F , point mass at 0 corresponds to r = ∞ so that F (t) = 1 for 0 < t < ∞ . Point mass at ∞ corresponds to r = 0 so that
F (t) = 0 for 0 < t < ∞ . The memoryless property, as expressed in terms of the reliability function F , still holds for these
c
degenerate cases on (0, ∞):

c c c
F (s)F (t) = F (s + t), s, t ∈ (0, ∞) (14.2.30)
We also need to extend some of results above for a finite number of variables to a countably infinite number of variables. So for the
remainder of this discussion, suppose that {X : i ∈ I } is a countable collection of independent random variables, and that X has
i i
the exponential distribution with parameter r ∈ (0, ∞) for each i ∈ I .

i
Let U = inf{ Xi : i ∈ I } . Then U has the exponential distribution with parameter ∑ i∈I
ri
Proof
For i ∈ N ,+
ri
P (Xi < Xj for all j ∈ I − {i}) = (14.2.32)
∑j∈I rj
Proof
We need one last result in this setting: a condition that ensures that the sum of an infinite collection of exponential variables is
finite with probability one.
Let Y = ∑i∈I Xi and μ = ∑ i∈I

1/ ri . Then μ = E(Y ) and P(Y < ∞) = 1 if and only if μ < ∞ .
Proof
Show directly that the exponential probability density function is a valid probability density function.
Solution
Suppose that the length of a telephone call (in minutes) is exponentially distributed with rate parameter r = 0.2. Find each of
the following:
1. The probability that the call lasts between 2 and 7 minutes.
2. The median, the first and third quartiles, and the interquartile range of the call length.
Answer
Suppose that the lifetime of a certain electronic component (in hours) is exponentially distributed with rate parameter
r = 0.001. Find each of the following:
1. The probability that the component lasts at least 2000 hours.

2. The median, the first and third quartiles, and the interquartile range of the lifetime.
Answer
Suppose that the time between requests to a web server (in seconds) is exponentially distributed with rate parameter r =2 .
1. The mean and standard deviation of the time between requests.
2. The probability that the time between requests is less that 0.5 seconds.
3. The median, the first and third quartiles, and the interquartile range of the time between requests.
Answer
Suppose that the lifetime X of a fuse (in 100 hour units) is exponentially distributed with P(X > 10) = 0.8 . Find each of the
following:
1. The rate parameter.
2. The mean and standard deviation.
3. The median, the first and third quartiles, and the interquartile range of the lifetime.
Answer
The position X of the first defect on a digital tape (in cm) has the exponential distribution with mean 100. Find each of the
following:
1. The rate parameter.
2. The probability that X < 200 given X > 150 .
3. The standard deviation.
4. The median, the first and third quartiles, and the interquartile range of the position.
Answer
Suppose that X, Y , Z are independent, exponentially distributed random variables with respective parameters
a, b, c ∈ (0, ∞) . Find the probability of each of the 6 orderings of the variables.
Proof
This page titled 14.2: The Exponential Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
We now know that the sequence of inter-arrival times X = (X , X , …) in the Poisson process is a sequence of independent
1 2
random variables, each having the exponential distribution with rate parameter r, for some r > 0 . No other distribution gives the
strong renewal assumption that we want: the property that the process probabilistically restarts, independently of the past, at each
arrival time and at each fixed time.
The n th arrival time is simply the sum of the first n inter-arrival times:
n
Tn = ∑ Xi , n ∈ N (14.3.1)
i=0
Thus, the sequence of arrival times T = (T0 , T1 , …) is the partial sum process associated with the sequence of inter-arrival times
X = (X , X , …).
1 2
Recall that the common probability density function of the inter-arrival times is
−rt
f (t) = re , 0 ≤t <∞ (14.3.2)
Our first goal is to describe the distribution of the n th arrival T . n
For n ∈ N , T has a continuous distribution with probability density function f given by

+ n n
n−1
n
t −rt
fn (t) = r e , 0 ≤t <∞ (14.3.3)
(n − 1)!
1. f increases and then decreases, with mode at (n − 1)/r .

n
2. f is concave upward. f is concave downward and then upward, with inflection point at t = 2/r . For n ≥ 2 , f is
1 2 n
−−−−−
concave upward, then downward, then upward again with inflection points at t = [(n − 1) ± √n − 1 ] /r .
Proof
The distribution with this probability density function is known as the gamma distribution with shape parameter n and rate
parameter r. It is lso known as the Erlang distribution, named for the Danish mathematician Agner Erlang. Again, 1/r is the scale
parameter, and that term will be justified below. The term shape parameter for n clearly makes sense in light of parts (a) and (b) of
the last result. The term rate parameter for r is inherited from the inter-arrival times, and more generally from the underlying
Poisson process itself: the random points are arriving at an average rate of r per unit time. A more general version of the gamma
distribution, allowing non-integer shape parameters, is studied in the chapter on Special Distributions. Note that since the arrival
times are continuous, the probability of an arrival at any given instant of time is 0.
In the gamma experiment, vary r and n with the scroll bars and watch how the shape of the probability density function
changes. For various values of the parameters, run the experiment 1000 times and compare the empirical density function to
The distribution function and the quantile function of the gamma distribution do not have simple, closed-form expressions.
However, it's easy to write the distribution function as a sum.
For n ∈ N , T has distribution function F given by

+ n n
n−1 k
(rt)
−rt
Fn (t) = 1 − ∑ e , t ∈ [0, ∞) (14.3.5)
k!
k=0
Proof
Open the special distribution calculator, select the gamma distribution, and select CDF view. Vary the parameters and note the
shape of the distribution and quantile functions. For selected values of the parameters, compute the quartiles.
Moments
The mean, variance, and moment generating function of Tn can be found easily from the representation as a sum of independent
exponential variables.
The mean and variance of T are. n
1. E (T ) = n/r
n
2. var (T ) = n/r
n
2
Proof
For k ∈ N , the moment of order k of T is n
(k + n − 1)! 1
k
E (Tn ) = (14.3.7)
k
(n − 1)! r
Proof
More generally, the moment of order k > 0 (not necessarily an integer) is

Γ(k + n) 1
k
E (Tn ) = (14.3.9)
k
Γ(n) r
where Γ is the gamma function.
In the gamma experiment, vary r and n with the scroll bars and watch how the size and location of the mean±standard
deviation bar changes. For various values of r and n , run the experiment 1000 times and compare the empirical moments to
the true moments.
Our next result gives the skewness and kurtosis of the gamma distribution.
The skewness and kurtosis of T are n
1. skew(X) = 2
√n
2. kurt(X) = 3 + 6
Proof
In particular, note that the gamma distribution is positively skewed but skew(X) → 0 and as n → ∞ . Recall also that the excess
kurtosis is kurt(T ) − 3 = → 0 as n → ∞ . This result is related to the convergence of the gamma distribution to the normal,
n
6
discussed below. Finally, note that the skewness and kurtosis do not depend on the rate parameter r. This is because, as we show
below, 1/r is a scale parameter.
The moment generating function of T is n
n
r
sTn
Mn (s) = E (e ) =( ) , −∞ < s < r (14.3.10)
r−s
Proof
The moment generating function can also be used to derive the moments of the gamma distribution given above—recall that
(k)
(0) = E (T ) .
k
M n n
Estimating the Rate

In many practical situations, the rate r of the process in unknown and must be estimated based on data from the process. We start
with a natural estimate of the scale parameter 1/r. Note that
n
Tn 1
Mn = = ∑ Xi (14.3.11)
n n
i=1
is the sample mean of the first n inter-arrival times (X 1, X2 , … , Xn ) . In statistical terms, this sequence is a random sample of size
n from the exponential distribution with rate r .
Mn satisfies the following properties:

1. E(M ) = n
1
2. var(M ) = n
1
nr2
3. M n →
1
r
as n → ∞ with probability 1
Proof
In statistical terms, part (a) means that M is an unbiased estimator of 1/r and hence the variance in part (b) is the mean square
n
error. Part (b) means that M is a consistent estimator of 1/r since var(M ) ↓ 0 as n → ∞ . Part (c) is a stronger from of
n n
consistency. In general, the sample mean of a random sample from a distribution is an unbiased and consistent estimator of the
distribution mean. On the other hand, a natural estimator of r itself is 1/M = n/T . However, this estimator is positively biased.
n n
E(n/ Tn ) ≥ r .
Proof
Properties and Connections

Scaling
As noted above, the gamma distribution is a scale family.
Suppose that T has the gamma distribution with rate parameter r ∈ (0, ∞) and shape parameter n ∈ N . If c ∈ (0, ∞) then +
cT has the gamma distribution with rate parameter r/c and shape parameter n .
Proof
The scaling property also follows from the fact that the gamma distribution governs the arrival times in the Poisson process. A time
change in a Poisson process clearly does not change the strong renewal property, and hence results in a new Poisson process.
General Exponential Family

The gamma distribution is also a member of the general exponential family of distributions.
Suppose that T has the gamma distribution with shape parameter n ∈ N and rate parameter r ∈ (0, ∞). Then T has a two +
parameter general exponential distribution with natural parameters n − 1 and −r, and natural statistics ln(T ) and T .
Proof
Increments
A number of important properties flow from the fact that the sequence of arrival times T = (T 0, T1 , …) is the partial sum process
associated with the sequence of independent, identically distributed inter-arrival times X = (X 1, X2 , …).
The arrival time sequence T has stationary, independent increments:

1. If m < n then T − T has the same distribution as T
n m n−m , namely the gamma distribution with shape parameter n − m
and rate parameter r.
2. If n < n < n < ⋯ then (T , T − T , T − T
1 2 3 n1 n2 n1 n3 n2 , …) is an independent sequence.
Proof
Of course, the stationary and independent increments properties are related to the fundamental “renewal” assumption that we
started with. If we fix n ∈ N , then (T − T , T −T , T
+ − T , …) is independent of (T , T , … , T ) and has the same
n n n+1 n n+2 n 1 2 n
distribution as (T , T , T , …). That is, if we “restart the clock” at time T , then the process in the future looks just like the
0 1 2 n
original process (in a probabilistic sense) and is indpendent of the past. Thus, we have our second characterization of the Poisson
process.
A process of random points in time is a Poisson process with rate r ∈ (0, ∞) if and only if the arrival time sequence T has
stationary, independent increments, and for n ∈ N , T has the gamma distribution with shape parameter n and rate parameter
+ n
r.
Sums
The gamma distribution is closed with respect to sums of independent variables, as long as the rate parameter is fixed.
Suppose that V has the gamma distribution with shape parameter m ∈ N and rate parameter r > 0 , W has the gamma
+
distribution with shape parameter n ∈ N and rate parameter r, and that V and W are independent. Then V + W has the
+
gamma distribution with shape parameter m + n and rate parameter r.

Proof
In the gamma experiment, vary r and n with the scroll bars and watch how the shape of the probability density function
changes. Now set n = 10 and for various values of r run the experiment 1000 times and compare the empirical density
Even though you are restricted to relatively small values of n in the app, note that the probability density function of the n th arrival
time becomes more bell shaped as n increases (for r fixed). This is yet another application of the central limit theorem, since T is n
the sum of n independent, identically distributed random variables (the inter-arrival times).
The distribution of the random variable Z below converges to the standard normal distribution as n → ∞ :
n
r Tn − n
Zn = − (14.3.14)
√n
Proof
Connection to Bernoulli Trials

We return to the analogy between the Bernoulli trials process and the Poisson process that started in the Introduction and continued
in the last section on the Exponential Distribution. If we think of the successes in a sequence of Bernoulli trials as random points in
discrete time, then the process has the same strong renewal property as the Poisson process, but restricted to discrete time. That is,
at each fixed time and at each arrival time, the process “starts over”, independently of the past. In Bernoulli trials, the time of the
n th arrival has the negative binomial distribution with parameters n and p (the success probability), while in the Poisson process,
as we now know, the time of the n th arrival has the gamma distribution with parameters n and r (the rate). Because of this strong
analogy, we expect a relationship between these two processes. In fact, we have the same type of limit as with the geometric and
exponential distributions.
Fix n ∈ N+ and suppose that for each m ∈ N T has the negative binomial distribution with parameters n and
+ m,n
pm ∈ (0, 1), where mp → r ∈ (0, ∞) as m → ∞ . Then the distribution of T /m converges to the gamma distribution
m m,n
with parameters n and r as m → ∞ .

Proof
Suppose that customers arrive at a service station according to the Poisson model, at a rate of r =3 per hour. Relative to a
given starting time, find the probability that the second customer arrives sometime after 1 hour.
Answer
Defects in a type of wire follow the Poisson model, with rate 1 per 100 meter. Find the probability that the 5th defect is located
between 450 and 550 meters.
Answer
Suppose that requests to a web server follow the Poisson model with rate r = 5 . Relative to a given starting time, compute the
mean and standard deviation of the time of the 10th request.
Answer
Suppose that Y has a gamma distribution with mean 40 and standard deviation 20. Find the shape parameter n and the rate
parameter r.
Answer
Suppose that accidents at an intersection occur according to the Poisson model, at a rate of 8 per year. Compute the normal
approximation to the event that the 10th accident (relative to a given starting time) occurs within 2 years.
Answer
In the gamma experiment, set n = 5 and r = 2 . Run the experiment 1000 times and compute the following:
1. P(1.5 ≤ T ≤ 3)
t
2. The relative frequency of the event {1.5 ≤ T ≤ 3}5
3. The normal approximation to P(1.5 ≤ T ≤ 3) 5
Answer
Suppose that requests to a web server follow the Poisson model. Starting at 12:00 noon on a certain day, the requests are
logged. The 100th request comes at 12:15. Estimate the rate of the process.
Answer
This page titled 14.3: The Gamma Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Recall that in the Poisson model, X = (X , X , …) denotes the sequence of inter-arrival times, and T
1 2 = (T0 , T1 , T2 , …) denotes
the sequence of arrival times. Thus T is the partial sum process associated with X:
n
Tn = ∑ Xi , n ∈ N (14.4.1)
i=0
Based on the strong renewal assumption, that the process restarts at each fixed time and each arrival time, independently of the
past, we now know that X is a sequence of independent random variables, each with the exponential distribution with rate
parameter r, for some r ∈ (0, ∞). We also know that T has stationary, independent increments, and that for n ∈ N , T has the + n
gamma distribution with rate parameter r and scale parameter n . Both of the statements characterize the Poisson process with rate
r.
Recall that for t ≥0, N denotes the number of arrivals in the interval (0, t], so that N = max{n ∈ N : T ≤ t} . We refer to
t t n
N = (Nt : t ≥ 0) as the counting process. In this section we will show that N has a Poisson distribution, named for Simeon
t
Poisson, one of the most important distributions in probability theory. Our exposition will alternate between properties of the
distribution and properties of the counting process. The two are intimately intertwined. It's not too much of an exaggeration to say
that wherever there is a Poisson distribution, there is a Poisson process lurking in the background.
Probability density function.

Recall that the probability density function of the n th arrival time T is n
n−1
n
t −rt
fn (t) = r e , 0 ≤t <∞ (14.4.2)
(n − 1)!
We can find the distribution of N because of the inverse relation between N and T . In particular, recall that
t
Nt ≥ n ⟺ Tn ≤ t, t ∈ (0, ∞), n ∈ N+ (14.4.3)
since both events mean that there are at least n arrivals in (0, t].
For t ∈ [0, ∞), the probability density function of N is given by

t
n
(rt)
−rt
P(Nt = n) = e , n ∈ N (14.4.4)
n!
Proof
Note that the distribution of N depends on the paramters

t r and t only through the product rt . The distribution is called the
Poisson distribution with parameter rt .
In the Poisson experiment, vary r and t with the scroll bars and note the shape of the probability density function. For various
values of r and t , run the experiment 1000 times and compare the relative frequency function to the probability density
function.
In general, a random variable N taking values in N is said to have the Poisson distribution with parameter a ∈ (0, ∞) if it has
the probability density function
n
−a
a
g(n) = e , n ∈ N (14.4.6)
n!
1. g(n − 1) < g(n) if and only if n < a .

2. If a ∉ N , there is a single mode at ⌊a⌋.
+
3. If a ∈ N , there are consecutive modes at a − 1 and a .

+
Proof
The Poisson distribution does not have simple closed-form distribution or quantile functions. Trivially, we can write the distribution
function as a sum of the probability density function.
The Poisson distribution with parameter a has distribution function G given by

n k
−a
a
G(n) = ∑ e , n ∈ N (14.4.7)
k!
k=0
Open the special distribution calculator, select the Poisson distribution, and select CDF view. Vary the parameter and note the
shape of the distribution and quantile functions. For various values of the parameter, compute the quartiles.
Sometimes it's convenient to allow the parameter a to be 0. This degenerate Poisson distribution is simply point mass at 0. That is,
with the usual conventions regarding nonnegative integer powers of 0, the probability density function g above reduces to g(0) = 1
and g(n) = 0 for n ∈ N .+
Moments
Suppose that N has the Poisson distribution with parameter a > 0 . Naturally we want to know the mean, variance, skewness and
kurtosis, and the probability generating function of N . The easiest moments to compute are the factorial moments. For this result,
recall the falling power notation for the number of permutations of size k chosen from a population of size n :
(k)
n = n(n − 1) ⋯ [n − (k + 1)] (14.4.8)
The factorial moment of N of order k ∈ N is E [ N (k)

] =a
k
.
Proof
The mean and variance of N are the parameter a .

1. E(N ) = a
2. var(N ) = a
Proof
Open the special distribution simulator and select the Poisson distribution. Vary the parameter and note the location and size of
the mean±standard deviation bar. For selected values of the parameter, run the simulation 1000 times and compare the
The skewness and kurtosis of N are

1. skew(N ) = 1/√− −
a
2. kurt(N ) = 3 − 5/a
Proof
Note that the Poisson distribution is positively skewed, but skew(N ) → 0 as n → ∞ . Recall also that the excess kurtosis is
kurt(N ) − 3 = −5/a → 0 as n → ∞ . This limit is related to the convergence of the Poisson distribution to the normal, discussed
below.
Open the special distribution simulator and select the Poisson distribution. Vary the parameter and note the shape of the
probability density function in the context of the results on skewness and kurtosis above.
The probability generating function P of N is given by

N a(s−1)
P (s) = E (s ) =e , s ∈ R (14.4.10)
Proof
Returning to the Poisson counting process N = (N : t ≥ 0) with rate parameter r, it follows that E(N ) = rt and var(N ) = rt
t t t
for t ≥ 0 . Once again, we see that r can be interpreted as the average arrival rate. In an interval of length t , we expect about rt
arrivals.
In the Poisson experiment, vary r and t with the scroll bars and note the location and size of the mean±standard deviation bar.
For various values of r and t , run the experiment 1000 times and compare the sample mean and standard deviation to the
distribution mean and standard deviation, respectively.
Estimating the Rate

Suppose again that we have a Poisson process with rate r ∈ (0, ∞). In many practical situations, the rate r in unknown and must
be estimated based on observing data. For fixed t > 0 , a natural estimator of the rate r is N /t . t
The mean and variance of N t /t are

1. E(N /t) = r
t
2. var(N /t) = r/t t
Proof
Part (a) means that the estimator is unbiased. Since this is the case, the variance in part (b) gives the mean square error. Since
var(N ) decreases to 0 as t → ∞ , the estimator is consistent.
t
Additional Properties and Connections

Increments and Characterizations
Let's explore the basic renewal assumption of the Poisson model in terms of the counting process N = (N : t ≥ 0) . Recall that t
N is the number of arrivals in the interval (0, t], so it follows that if s, t ∈ [0, ∞) with s < t , then N − N
t is the number of
t s
arrivals in the interval (s, t] . Of course, the arrival times have continuous distributions, so the probability that an arrival occurs at a
specific point t is 0. Thus, it does not really matter if we write the interval above as (s, t] , (s, t) , [s, t) or [s, t] .
The process N has stationary, independent increments.

1. If s, t ∈ [0, ∞) with s < t then N − N has the same distribution as N , namely Poisson with parameter r(t − s) .
t s t−s
2. If t , t , t … ∈ [0, ∞) with t < t < t < ⋯ then (N , N − N , N − N , …) is an independent sequence.

1 2 3 1 2 3 t1 t2 t1 t3 t2
Statements about the increments of the counting process can be expressed more elegantly in terms of our more general counting
process. Recall that for A ⊆ [0, ∞) (measurable of course), N (A) denotes the number of random points in A :
N (A) = # {n ∈ N+ : Tn ∈ A} (14.4.12)
and so in particular, N = N (0, t] . Thus, note that t ↦ N is a (random) distribution function and A ↦ N (A) is the (random)
t t
measure associated with this distribution function. Recall also that λ denotes the standard length (Lebesgue) measure on [0, ∞).
Here is our third characterization of the Poisson process.
A process of random points in time is a Poisson process with rate r ∈ (0, ∞) if and only if the following properties hold:.
1. If A ⊆ [0, ∞) is measurable then N (A) has the Poisson distribution with parameter rλ(A).
2. if {A : i ∈ I } is a countable, disjoint collection of measurable sets in [0, ∞) then {N (A ) : i ∈ I } is a set of independent
i i
variables.
From a modeling point of view, the assumptions of stationary, independent increments are ones that might be reasonably made. But
the assumption that the increments have Poisson distributions does not seem as clear. Our next characterization of the process is
more primitive in a sense, because it just imposes some limiting assumptions (in addition to stationary, independent increments.
A process of random points in time is a Poisson process with rate r ∈ (0, ∞) if and only if the following properties hold:
1. If A, B ⊆ [0, ∞) are measurable and λ(A) = λ(B) , then N (A) and N (B) have the same distribution.
2. if {A : i ∈ I } is a countable, disjoint collection of measurable sets in [0, ∞) then {N (A ) : i ∈ I } is a set of independent
i i
variables.
3. If A ⊆ [0, ∞) is measurable and λ(A ) > 0 for n ∈ N , and if λ(A ) → 0 as n → ∞ then
n n + n
P [N (An ) = 1]
→ r as n → ∞ (14.4.13)
λ(An )
P [N (An ) > 1]
→ 0 as n → ∞ (14.4.14)
λ(An )
Proof
Of course, part (a) is the stationary assumption and part (b) the independence assumption. The first limit in (c) is sometimes called
the rate property and the second limit the sparseness property. In a “small” time interval of length dt , the probability of a single
random point is approximately r dt, and the probability of two or more random points is negligible.
Sums
Suppose that N and M are independent random variables, and that N has the Poisson distribution with parameter a ∈ (0, ∞)
and M has the Poisson distribution with parameter b ∈ (0, ∞). Then N + M has the Poisson distribution with parameter
a+b .
Proof from the Poisson process

Proof from probability generating functions
Proof from convolution
From the last theorem, it follows that the Poisson distribution is infinitely divisible. That is, a Poisson distributed variable can be
written as the sum of an arbitrary number of independent, identically distributed (in fact also Poisson) variables.
Suppose that N has the Poisson distribution with parameter a ∈ (0, ∞) . Then for n ∈ N , N has the same distribution as
+
N where (N , N , … , N ) are independent, and each has the Poisson distribution with parameter a/n.
n
∑ i 1 2 n
i=1
Because of the representation as a sum of independent, identically distributed variables, it's not surprising that the Poisson
distribution can be approximated by the normal.
Suppose that N has the Poisson distribution with parameter

t t >0 . Then the distribution of the variable below converges to
the standard normal distribution as t → ∞ .
Nt − t
Zt = (14.4.18)
√t
Proof
Thus, if N has the Poisson distribution with parameter a , and a is “large”, then the distribution of N is approximately normal with
mean a and standard deviation √− −
a. When using the normal approximation, we should remember to use the continuity correction,
since the Poisson is a discrete distribution.
In the Poisson experiment, set r = t = 1 . Increase t and note how the graph of the probability density function becomes more
bell-shaped.
General Exponential
The Poisson distribution is a member of the general exponential family of distributions. This fact is important in various statistical
procedures.
Suppose that N has the Poisson distribution with parameter a ∈ (0, ∞) . This distribution is a one-parameter exponential
family with natural parameter ln(a) and natural statistic N .
Proof

The Poisson process has some basic connections to the uniform distribution. Consider again the Poisson process with rate r >0 .
As usual, T = (T , T , …) denotes the arrival time sequence and N = (N : t ≥ 0) the counting process.
0 1 t
For t > 0 , the conditional distribution of T given N
1 t =1 is uniform on the interval (0, t].
Proof
More generally, for t > 0 and n ∈ N , the conditional distribution of (T , T , … , T ) given N = n is the same as the
+ 1 2 n t
distribution of the order statistics of a random sample of size n from the uniform distribution on the interval (0, t].
Heuristic proof
Note that the conditional distribution in the last result is independent of the rate r. This means that, in a sense, the Poisson model
gives the most “random” distribution of points in time.
The Binomial Distribution

The Poisson distribution has important connections to the binomial distribution. First we consider a conditional distribution based
on the number of arrivals of a Poisson process in a given interval, as we did in the last subsection.
Suppose that (N : t ∈ [0, ∞)) is a Poisson counting process with rate r ∈ (0, ∞). If s, t ∈ (0, ∞) with s < t , and n ∈ N ,
t +
then the conditional distribution of N given N = n is binomial with trial parameter n and success parameter p = s/t .
s t
Proof
Note again that the conditional distribution in the last result does not depend on the rate r. Given N = n , each of the n arrivals,
t
independently of the others, falls into the interval (0, s] with probability s/t and into the interval (s, t] with probability
1 − s/t = (t − s)/t . Here is essentially the same result, outside of the context of the Poisson process.
Suppose that M has the Poisson distribution with parameter a ∈ (0, ∞) , N has the Poisson distribution with parameter
b ∈ (0, ∞), and that M and N are independent. Then the conditional distribution of M given M + N = n is binomial with
parameters n and p = a/(a + b) .

Proof
More importantly, the Poisson distribution is the limit of the binomial distribution in a certain sense. As we will see, this
convergence result is related to the analogy between the Bernoulli trials process and the Poisson process that we discussed in the
Introduction, the section on the inter-arrival times, and the section on the arrival times.
Suppose that p ∈ (0, 1) for n ∈ N and that np → a ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters
n + n
n and p converges to the Poisson distribution with parameter a as n → ∞ . That is, for fixed k ∈ N ,
n
k
n a
k n−k −a
( ) pn (1 − pn ) → e as n → ∞ (14.4.27)
k k!
Direct proof
The mean and variance of the binomial distribution converge to the mean and variance of the limiting Poisson distribution,
respectively.
1. np n → a as n → ∞
2. np n (1 − pn ) → a as n → ∞
Of course the convergence of the means is precisely our basic assumption, and is further evidence that this is the essential
assumption. But for a deeper look, let's return to the analogy between the Bernoulli trials process and the Poisson process. Recall
that both have the strong renewal property that at each fixed time, and at each arrival time, the process stochastically starts over,
independently of the past. The difference, of course, is that time is discrete in the Bernoulli trials process and continuous in the
Poisson process. The convergence result is a special case of the more general fact that if we run Bernoulli trials at a faster and
faster rate but with a smaller and smaller success probability, in just the right way, the Bernoulli trials process converges to the
Poisson process. Specifically, suppose that we have a sequence of Bernoulli trials processes. In process n we perform the trials at a
rate of n per unit time, with success probability p . Our basic assumption is that np → r as n → ∞ where r > 0 . Now let Y
n n n,t
denote the number of successes in the time interval (0, t] for Bernoulli trials process n , and let N denote the number of arrivals in
t
this interval for the Poisson process with rate r. Then Y n,t has the binomial distribution with parameters ⌊nt⌋ and p , and of course
n
N has the Poisson distribution with parameter rt .

t
For t > 0 , the distribution of Yn,t converges to the distribution of N as n → ∞ .

t
Proof
Compare the Poisson experiment and the binomial timeline experiment.

1. Open the Poisson experiment and set r = 1 and t = 5 . Run the experiment a few times and note the general behavior of the
random points in time. Note also the shape and location of the probability density function and the mean±standard
deviation bar.
2. Now open the binomial timeline experiment and set n = 100 and p = 0.05. Run the experiment a few times and note the
general behavior of the random points in time. Note also the shape and location of the probability density function and the
mean±standard deviation bar.
2
approximated by the Poisson distribution with parameter r = np . This is often a useful result, because the Poisson distribution has
fewer parameters than the binomial distribution (and often in real problems, the parameters may only be known approximately).
Specifically, in the approximating Poisson distribution, we do not need to know the number of trials n and the probability of
success p individually, but only in the product np. The condition that np be small means that the variance of the binomial
2
distribution, namely np(1 − p) = np − np is approximately r = np , the variance of the approximating Poisson distribution.
2
Recall that the binomial distribution can also be approximated by the normal distribution, by virtue of the central limit theorem.
The normal approximation works well when np and n(1 − p) are large; the rule of thumb is that both should be at least 5. The
Poisson approximation works well, as we have already noted, when n is large and np small. 2
Suppose that requests to a web server follow the Poisson model with rate r = 5 per minute. Find the probability that there will
be at least 8 requests in a 2 minute period.
Answer
Defects in a certain type of wire follow the Poisson model with rate 1.5 per meter. Find the probability that there will be no
more than 4 defects in a 2 meter piece of the wire.
Answer
Suppose that customers arrive at a service station according to the Poisson model, at a rate of r =4 . Find the mean and
standard deviation of the number of customers in an 8 hour period.
Answer
In the Poisson experiment, set r = 5 and t = 4 . Run the experiment 1000 times and compute the following:
1. P(15 ≤ N ≤ 22)
4
2. The relative frequency of the event {15 ≤ N ≤ 22} .

4
3. The normal approximation to P(15 ≤ N ≤ 22) .

4
Answer
Suppose that requests to a web server follow the Poisson model with rate r = 5 per minute. Compute the normal
approximation to the probability that there will be at least 280 requests in a 1 hour period.
Answer
Suppose that requests to a web server follow the Poisson model, and that 1 request comes in a five minute period. Find the
probability that the request came during the first 3 minutes of the period.
Answer
Suppose that requests to a web server follow the Poisson model, and that 10 requests come during a 5 minute period. Find the
probability that at least 4 requests came during the first 3 minutes of the period.
Answer
In the Poisson experiment, set r = 3 and t = 5 . Run the experiment 100 times.
1. For each run, compute the estimate of r based on N . t
2. Over the 100 runs, compute the average of the squares of the errors.
3. Compare the result in (b) with var(N ) . t
Suppose that requests to a web server follow the Poisson model with unknown rate r per minute. In a one hour period, the
server receives 342 requests. Estimate r.
Answer
In the binomial experiment, set n = 30 and p = 0.1 , and run the simulation 1000 times. Compute and compare each of the
following:
1. P(Y ≤ 4)
30
2. The relative frequency of the event {Y ≤ 4} 30
3. The Poisson approximation to P(Y ≤ 4) 30
Answer
Suppose that we have 100 memory chips, each of which is defective with probability 0.05, independently of the others.
Approximate the probability that there are at least 3 defectives in the batch.
Answer
In the binomial timeline experiment, set n = 40 and p = 0.1 and run the simulation 1000 times. Compute and compare each of
the following:
1. P(Y > 5)
40
2. The relative frequency of the event {Y > 5} 40
3. The Poisson approximation to P(Y > 5) 40
4. The normal approximation to P(Y > 5) 40
Answer
In the binomial timeline experiment, set n = 100 and p = 0.1 and run the simulation 1000 times. Compute and compare each
of the following:
1. P(8 < Y < 15)
100
2. The relative frequency of the event {8 < Y < 15} 100
3. The Poisson approximation to P(8 < Y < 15) 100
4. The normal approximation to P(8 < Y < 15) 100
Answer
A text file contains 1000 words. Assume that each word, independently of the others, is misspelled with probability p.
1. If p = 0.015, approximate the probability that the file contains at least 20 misspelled words.
2. If p = 0.001, approximate the probability that the file contains at least 3 misspelled words.
Answer
This page titled 14.4: The Poisson Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Thinning
Thinning or splitting a Poisson process refers to classifying each random point, independently, into one of a finite number of
different types. The random points of a given type also form Poisson processes, and these processes are independent. Our
exposition will concentrate on the case of just two types, but this case has all of the essential ideas.
The Two-Type Process

We start with a Poisson process with rate r > 0 . Recall that this statement really means three interrelated stochastic processes: the
sequence of inter-arrival times X = (X , X , …), the sequence of arrival times T = (T , T , …) , and the counting process
1 2 0 1
N = (N : t ≥ 0) . Suppose now that each arrival, independently of the others, is one of two types: type 1 with probability p and
t
type 0 with probability 1 − p , where p ∈ (0, 1) is a parameter. Here are some common examples:
The arrivals are radioactive emissions and each emitted particle is either detected (type 1) or missed (type 0) by a counter.
The arrivals are customers at a service station and each customer is classified as either male (type 1) or female (type 0).
We want to consider the type 1 and type 0 random points separately. For this reason, the new random process is usually referred to
as thinning or splitting the original Poisson process. In some applications, the type 1 points are accepted while the type 0 points are
rejected. The main result of this section is that the type 1 and type 0 points form separate Poisson processes, with rates rp and
r(1 − p) respectively, and are independent. We will explore this important result from several points of view.
Bernoulli Trials
In the previous sections, we have explored the analogy between the Bernoulli trials process and the Poisson process. Both have the
strong renewal property that at each fixed time and at each arrival time, the process stochastically “restarts”, independently of the
past. The difference, of course, is that time is discrete in the Bernoulli trials process and continuous in the Poisson process. In this
section, we have both processes simultaneously, and given our previous explorations, it's perhaps not surprising that this leads to
some interesting mathematics.
Thus, in addition to the processes X, T , and N , we have a sequence of Bernoulli trials I = (I , I , …) with success parameter p.
1 2
Indicator variable I specifies the type of the j th arrival. Moreover, because of our assumptions, I is independent of X, T , and N .
j
Recall that V , the trial number of the k th success has the negative binomial distribution with parameters k and p for k ∈ N . We
k +
take V = 0 by convention. Also, U , the number of trials needed to go from the (k − 1) st success to the k th success has the
0 k
geometric distribution with success parameter p for k ∈ N . Moreover, U = (U , U , …) is independent and V = (V , V , …) is

+ 1 2 0 1
the partial sum process associated with U :

k
Vk = ∑ Ui , k ∈ N (14.5.1)
i=1
Uk = Vk − Vk−1 , k ∈ N+ (14.5.2)
As noted above, the Bernoulli trials process can be thought of as random points in discrete time, namely the trial numbers of the
successes. With this understanding, U is the sequence of inter-arrival times and V is the sequence of arrival times.
The Inter-arrival Times

Now consider just the type 1 points in our Poisson process. The time between the arrivals of (k − 1) st and k th type 1 point is
Vk
Yk = ∑ Xi , k ∈ N+ (14.5.3)
i=Vk−1 +1
Note that Y has U terms. The next result shows that the type 1 points form a Poisson process with rate pr.
k k
Y = (Y1 , Y2 , …) is a sequence of independent variables and each has the exponential distribution with rate parameter pr
Proof
Similarly, if Z = (Z , Z , …) is the sequence of interarrvial times for the type 0 points, then Z is a sequence of independent
1 2
variables, and each has the exponential distribution with rate (1 − p)r . Moreover, Y and Z are independent.
Counting Processes
For t ≥0 , let denote the number of type 1 arrivals in (0, t] and W the number of type 0 arrivals in (0, t]. Thus,
Mt t
M = (Mt : t ≥ 0) and W = (W : t ≥ 0) are the counting processes for the type 1 arrivals and for the type 0 arrivals. The next
t
result follows from the previous subsection, but a direct proof is interesting.
For t ≥ 0 , M has the Poisson distribution with parameter

t ,
pr Wt has the Poisson distribution with parameter (1 − p)r , and
M and W are independent.
t t
Proof
In the two-type Poisson experiment vary r, p, and t with the scroll bars and note the shape of the probability density functions.
For various values of the parameters, run the experiment 1000 times and compare the relative frequency functions to the
probability density functions.
Estimating the Number of Arrivals

Suppose that the type 1 arrivals are observable, but not the type 0 arrivals. This setting is natural, for example, if the arrivals are
radioactive emissions, and the type 1 arrivals are emissions that are detected by a counter, while the type 0 arrivals are emissions
that are missed. Suppose that for a given t > 0 , we would like to estimate the total number arrivals N after observing the number t
of type 1 arrivals M . t
The conditional distribution of N given M t t =k is the same as the distribution of k + W . t
n−k
[(1 − p)rt]
−(1−p)rt
P(Nt = n ∣ Mt = k) = e , n ∈ {k, k + 1, …} (14.5.7)
(n − k)!
Proof
E(Nt ∣ Mt = k) = k + (1 − p) r .
Proof
Thus, if the overall rate r of the process and the probability p that an arrival is type 1 are known, then it follows form the general
theory of conditional expectation that the best estimator of N based on M , in the least squares sense, is
t t
E(Nt ∣ Mt ) = Mt + (1 − p)r (14.5.10)
The mean square error is E ([N t − E(Nt ∣ Mt )]

2
) = (1 − p)rt .
Proof
The Multi-Type Process

As you might guess, the results above generalize from 2 types to k types for general k ∈ N . Once again, we start with a Poisson
+
process with rate r > 0 . Suppose that each arrival, independently of the others, is type i with probability p for i
i ∈ {0, 1, … , k − 1}. Of course we must have p ≥ 0 for each i and ∑ p = 1 . Then for each i , the type i points form a
k−1
i i=0 i
Poisson process with rate p r, and these processes are independent.

i
Superposition
Complementary to splitting or thinning a Poisson process is superposition: if we combine the random points in time from
independent Poisson processes, then we have a new Poisson processes. The rate of the new process is the sum of the rates of the
processes that were combined. Once again, our exposition will concentrate on the superposition of two processes. This case
contains all of the essential ideas.
Two Processes
Suppose that we have two independent Poisson processes. We will denote the sequence of inter-arrival times, the sequence of
arrival times, and the counting variables for the process i ∈ {1, 2} by X = (X , X …), T = (T , T , …), and i i
1
i
2
i i
1
i
2
: t ∈ [0, ∞)) , and we assume that process i has rate r ∈ (0, ∞) . The new process that we want to consider is
i i
N = (N i
t
obtained by simply combining the random points. That is, the new random points are {T : n ∈ N } ∪ {T : n ∈ N } , but of
1
n +
2
n +
course then ordered in time. We will denote the sequence of inter-arrival times, the sequence of arrival times, and the counting
variables for the new process by X = (X , X …) , T = (T , T , …) , and N = (N : t ∈ [0, ∞)) . Clearly if A is an interval in
1 2 1 2 t
[0, ∞) then
1 2
N (A) = N (A) + N (A) (14.5.11)
the number of combined points in A is simply the sum of the number of point in A for processes 1 and 2. It's also worth noting that
1 2
X1 = min{ X ,X } (14.5.12)
1 1
the first arrival for the combined process is the smaller of the first arrival times for processes 1 and 2. The other inter-arrival times,
and hence also the arrival times, for the combined process are harder to state.
The combined process is a Poisson process with rate r 1 + r2 . Moreover,

Proof
In the two-type Poisson experiment, set r = 2 , t = 3 , and p = 0.7 . Run the experiment 1000 times, Compute the appropriate
relative frequency functions and investigate empirically the independence of the number of type 1 points and the number of
type 0 points.
Suppose that customers arrive at a service station according to the Poisson model, with rate r = 20 per hour. Moreover, each
customer, independently, is female with probability 0.6 and male with probability 0.4. Find the probability that in a 2 hour
period, there will be at least 20 women and at least 15 men.
Answer
In the two-type Poisson experiment, set r = 3 , t = 4 , and p = 0.8 . Run the experiment 100 times.
1. Compute the estimate of N based on M for each run.
t t
2. Over the 100 runs, compute average of the sum of the squares of the errors.
3. Compare the result in (b) with the result in Exercise 8.
Suppose that a piece of radioactive material emits particles according to the Poisson model at a rate of r = 100 per second.
Moreover, assume that a counter detects each emitted particle, independently, with probability 0.9. Suppose that the number of
detected particles in a 5 second period is 465.
1. Estimate the number of particles emitted.
2. Compute the mean square error of the estimate.
Answer
This page titled 14.5: Thinning and Superpositon is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
A non-homogeneous Poisson process is similar to an ordinary Poisson process, except that the average rate of arrivals is allowed to
vary with time. Many applications that generate random points in time are modeled more faithfully with such non-homogeneous
processes. The mathematical cost of this generalization, however, is that we lose the property of stationary increments.
Non-homogeneous Poisson processes are best described in measure-theoretic terms. Thus, you may need to review the sections on
measure theory in the chapters on Foundations, Probability Measures, and Distributions. Our basic measure space in this section is
[0, ∞) with the σ-algebra of Borel measurable subsets (named for Émile Borel). As usual, λ denotes Lebesgue measure on this
space, named for Henri Lebesgue. Recall that the Borel σ-algebra is the one generated by the intervals, and λ is the generalization
of length on intervals.

Of all of our various characterizations of the ordinary Poisson process, in terms of the inter-arrival times, the arrival times, and the
counting process, the characterizations involving the counting process leads to the most natural generalization to non-homogeneous
processes. Thus, consider a process that generates random points in time, and as usual, let N denote the number of random points
t
in the interval (0, t] for t ≥ 0 , so that N = {N : t ≥ 0} is the counting process. More generally, N (A) denotes the number of
t
random points in a measurable A ⊆ [0, ∞) , so N is our random counting measure. As before, t ↦ N is a (random) distribution
t
function and A ↦ N (A) is the (random) measure associated with this distribution function.
Suppose now that r : [0, ∞) → [0, ∞) is measurable, and define m : [0, ∞) → [0, ∞) by
m(t) = ∫ r(s) dλ(s) (14.6.1)

(0,t]
From properties of the integral, m is increasing and right-continuous on [0, ∞) and hence is distribution function. The positive
measure on [0, ∞) associated with m (which we will also denote by m) is defined on a measurable A ⊆ [0, ∞) by
m(A) = ∫ r(s) dλ(s) (14.6.2)

A
Thus, m(t) = m(0, t] , and for s, t ∈ [0, ∞) with s < t , m(s, t] = m(t) − m(s) . Finally, note that the measure m is absolutely
continuous with respect to λ , and r is the density function. Note the parallels between the random distribution function and
measure N and the deterministic distribution function and measure m. With the setup involving r and m complete, we are ready
for our first definition.
A process that produces random points in time is a non-homogeneous Poisson process with rate function r if the counting
process N satisfies the following properties:
1. If {A : i ∈ I } is a countable, disjoint collection of measurable subsets of [0, ∞) then {N (A
i i) : i ∈ I} is a collection of
independent random variables.
2. If A ⊆ [0, ∞) is measurable then N (A) has the Poisson distribution with parameter m(A).
Property (a) is our usual property of independent increments, while property (b) is a natural generalization of the property of
Poisson distributed increments. Clearly, if r is a positive constant, then m(t) = rt for t ∈ [0, ∞) and as a measure, m is
proportional to Lebesgue measure λ . In this case, the non-homogeneous process reduces to an ordinary, homogeneous Poisson
process with rate r. However, if r is not constant, then m is not linear, and as a measure, is not proportional to Lebesgue measure.
In this case, the process does not have stationary increments with respect to λ , but does of course, have stationary increments with
respect to m. That is, if A, B are measurable subsets of [0, ∞) and λ(A) = λ(B) then N (A) and N (B) will not in general have
the same distribution, but of course they will have the same distribution if m(A) = m(B) .
In particular, recall that the parameter of the Poisson distribution is both the mean and the variance, so
E [N (A)] = var [N (A)] = m(A) for measurable A ⊆ [0, ∞) , and in particular, E(N ) = var(N ) = m(t) for t ∈ [0, ∞). The
t t
function m is usually called the mean function. Since m (t) = r(t) (if r is continuous at t ), it makes sense to refer to r as the rate
′
function. Locally, at t , the arrivals are occurring at an average rate of r(t) per unit time.
As before, from a modeling point of view, the property of independent increments can reasonably be evaluated. But we need
something more primitive to replace the property of Poisson increments. Here is the main theorem.
A process that produces random points in time is a non-homogeneous Poisson process with rate function r if and only if the
counting process N satisfies the following properties:
1. If {A : i ∈ I } is a countable, disjoint collection of measurable subsets of [0, ∞) then {N (A
i i) : i ∈ I} is a set of
independent variables.
2. For t ∈ [0, ∞),
P [N (t, t + h] = 1]
→ r(t) as h ↓ 0 (14.6.3)
h
P [N (t, t + h] > 1]
→ 0 as h ↓ 0 (14.6.4)
h
So if h is “small” the probability of a single arrival in [t, t + h) is approximately r(t)h , while the probability of more than 1
arrival in this interval is negligible.
Arrival Times and Time Change

Suppose that we have a non-homogeneous Poisson process with rate function r, as defined above. As usual, let T denote the time n
of the n th arrival for n ∈ N . As with the ordinary Poisson process, we have an inverse relation between the counting process
N = { N : t ∈ [0, ∞)}
t and the arrival time sequence T = {T : n ∈ N} , namely T = min{t ∈ [0, ∞) : N = n} ,
n n t
N = #{n ∈ N : T ≤ t} , and { T ≤ t} = { N ≥ n} , since both events mean at least n random points in (0, t]. The last
t n n t
relationship allows us to get the distribution of T . n
For n ∈ N , T has probability density function f given by

+ n n
n−1
m (t)
−m(t)
fn (t) = r(t)e , t ∈ [0, ∞) (14.6.5)
(n − 1)!
Proof
In particular, T has probability density function f given by

1 1
−m(t)
f1 (t) = r(t)e , t ∈ [0, ∞) (14.6.8)
Recall that in reliability terms, r is the failure rate function, and that the reliability function is the right distribution function:
c −m(t)
F (t) = P(T1 > t) = e , t ∈ [0, ∞) (14.6.9)
1
In general, the functional form of f is clearly similar to the probability density function of the gamma distribution, and indeed, T
n n
can be transformed into a random variable with a gamma distribution. This amounts to a time change which will give us additional
insight into the non-homogeneous Poisson process.
Let U n = m(Tn ) for n ∈ N . Then U has the gamma distribution with shape parameter n and rate parameter 1
+ n
Proof
Thus, the time change u = m(t) transforms the non-homogeneous Poisson process into a standard (rate 1) Poisson process. Here is
an equivalent way to look at the time change result.
For u ∈ [0, ∞), let Mu = Nt where t =m

−1
(u) . Then { Mu : u ∈ [0, ∞)} is the counting process for a standard, rate 1
Poisson process.
Proof
Equivalently, we can transform a standard (rate 1) Poisson process into a a non-homogeneous Poisson process with a time change.
Suppose that M = {M : u ∈ [0, ∞)} is the counting process for a standard Poisson process, and let N = M
u for t m(t)
t ∈ [0, ∞). Then { N : t ∈ [0, ∞)} is the counting process for a non-homogeneous Poisson process with mean function m
t
(and rate function r).

Proof
This page titled 14.6: Non-homogeneous Poisson Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
In a compound Poisson process, each arrival in an ordinary Poisson process comes with an associated real-valued random variable
that represents the value of the arrival in a sense. These variables are independent and identically distributed, and are independent
of the underlying Poisson process. Our interest centers on the sum of the random variables for all the arrivals up to a fixed time t ,
which thus is a Poisson-distributed random sum of random variables. Distributions of this type are said to be compound Poisson
distributions, and are important in their own right, particularly since some surprising parametric distributions turn out to be
compound Poisson.
Basic Theory
Definition
Suppose we have a Poisson process with rate r ∈ (0, ∞). As usual, we wil denote the sequence of inter-arrival times by
X = (X , X , …), the sequence of arrival times by T = (T , T , T , …) , and the counting process by N = { N : t ∈ [0, ∞)} .
1 2 0 1 2 t
To review some of the most important facts briefly, recall that X is a sequence of independent random variables, each having the
exponential distribution on [0, ∞) with rate r. The sequence T is the partial sum sequence associated with X, and has stationary
independent increments. For n ∈ N , the n th arrival time T has the gamma distribution with parameters n and r. The process N
+ n
is the inverse of T , in a certain sense, and also has stationary independent increments. For t ∈ (0, ∞) , the number of arrivals N in t
(0, t] has the Poisson distribution with parameter rt .
Suppose now that each arrival has an associated real-valued random variable that represents the value of the arrival in a certain
sense. Here are some typical examples:
The arrivals are customers at a store. Each customer spends a random amount of money.
The arrivals are visits to a website. Each visitor spends a random amount of time at the site.
The arrivals are failure times of a complex system. Each failure requires a random repair time.
The arrivals are earthquakes at a particular location. Each earthquake has a random severity, a measure of the energy released.
For n ∈ N , let U denote the value of the n th arrival. We assume that U = (U , U , …) is a sequence of independent,
+ n 1 2
identically distributed, real-valued random variables, and that U is independent of the underlying Poisson process. The common
distribution may be discrete or continuous, but in either case, we let f denote the common probability density function. We will let
μ = E(U ) denote the common mean, σ = var(U ) the common variance, and G the common moment generating function, so
2
n n
that G(s) = E [exp(sU )] for s in some interval I about 0. Here is our main definition:
n
The compound Poisson process associated with the given Poisson process N and the sequence U is the stochastic process
V = { V : t ∈ [0, ∞)} where
t
Nt
Vt = ∑ Un (14.7.1)
n=1
Thus, V is the total value for all of the arrivals in (0, t]. For the examples above
t
Vt is the total income to the store up to time t .

Vt is the total time spent at the site by the customers who arrived up to time t .
Vt is the total repair time for the failures up to time t .
Vt is the total energy released up to time t .
Recall that a sum over an empty index set is 0, so V 0 =0 .
Properties
Note that for fixed t , V is a random sum of independent, identically distributed random variables, a topic that we have studied
t
before. In this sense, we have a special case, since the number of terms N has the Poisson distribution with parameter rt . But we
t
also have a new wrinkle, since the process is indexed by the continuous time parameter t , and so we can study its properties as a
stochastic process. Our first result is a pair of properties shared by the underlying Poisson process.
V has stationary, independent increments:
1. If s, t ∈ [0, ∞) with s < t , then V − V has the same distribution as V .
t s t−s
2. If (t , t , … , t ) is a sequence of points in [0, ∞) with t < t < ⋯ < t then (V

1 2 n 1 2 n t1 , Vt
2
− Vt , … , Vt
1 n
− Vt
n−1
) is a
sequence of independent variables.
Proof
Next we consider various moments of the compound process.
For t ∈ [0, ∞), the mean and variance of V are t
1. E(V ) = μrt
t
2. var(V ) = (μ
t
2 2
+ σ )rt
Proof
For t ∈ [0, ∞), the moment generating function of V is given by t
E [exp(sVt )] = exp(rt [G(s) − 1]), s ∈ I (14.7.4)
Proof
By exactly the same argument, the same relationship holds for characteristic functions and, in the case that the variables in U take
values in N, for probability generating functions.. That is, if the variables in U have generating function G, then the generating
function H of V is given by
t
H (s) = exp(rt[G(s) − 1]) (14.7.6)
for s in the domain of G , where generating function can be any of the three types we have discussed: probability, moment, or
characteristic.

The Discrete Case
First we note that Thinning a Poisson process can be thought of as a special case of a compound Poisson process. Thus, suppose
that U = (U , U , …) is a Bernoulli trials sequence with success parameter p ∈ (0, 1), and as above, that U is independent of the
1 2
Poisson process N . In the usual language of thinning, the arrivals are of two types (1 and 0), and U is the type of the ith arrival. i
Thus the compound process V constructed above is the thinned process, so that V is the number of type 1 points up to time t . We
t
know that V is also a Poisson process, with rate rp.

The results above for thinning generalize to the case where the values of the arrivals have a discrete distribution. Thus, suppose U i
takes values in a countable set S ⊆ R , and as before, let f denote the common probability density function so that
f (u) = P(U = u) for u ∈ S and i ∈ N . For u ∈ S , let N denote the number of arrivals up to time t that have the value u, and
u
i + t
let N = {N : t ∈ [0, ∞)} denote the corresponding stochastic process. Armed with this setup, here is the result:
u
t
u
The compound Poisson process V associated with N and U can be written in the form
u
Vt = ∑ u N , t ∈ [0, ∞) (14.7.7)
t
u∈S
The processes {N u
: u ∈ S} are independent Poisson processes, and N has rate rf (u) for u ∈ S .
u
Proof
Compound Poisson Distributions

A compound Poisson random variable can be defined outside of the context of a Poisson process. Here is the formal definition:
Suppose that U = (U , U , …) is a sequence of independent, identically distributed random variables, and that N is
1 2
N
independent of U and has the Poisson distribution with parameter λ ∈ (0, ∞) . Then V = ∑ U has a compound Poisson i=1 i
distribution.
But in fact, compound Poisson variables usually do arise in the context of an underlying Poisson process. In any event, the results
on the mean and variance above and the generating function above hold with rt replaced by λ . Compound Poisson distributions are
infinitely divisible. A famous theorem of William Feller gives a partial converse: an infinitely divisible distribution on N must be
compound Poisson.
The negative binomial distribution on N is infinitely divisible, and hence must be compound Poisson. Here is the construction:
Let p, k ∈ (0, ∞). Suppose that U = (U , U , …) is a sequence of independent variables, each having the logarithmic series
1 2
distribution with shape parameter 1 − p . Suppose also that N is independent of U and has the Poisson distribution with
parameter −k ln(p) . Then V = ∑ U has the negative binomial distribution on N with parameters k and p.
N
i=1 i
Proof
As a special case (k = 1 ), it follows that the geometric distribution on N is also compound Poisson.
This page titled 14.7: Compound Poisson Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
The Process
So far, we have studied the Poisson process as a model for random points in time. However there is also a Poisson model for
random points in space. Some specific examples of such “random points” are
Defects in a sheet of material.
Raisins in a cake.
Stars in the sky.
The Poisson process for random points in space can be defined in a very general setting. All that is really needed is a measure space
(S, S , μ). Thus, S is a set (the underlying space for our random points), S is a σ-algebra of subsets of S (as always, the
allowable sets), and μ is a positive measure on (S, S ) (a measure of the size of sets). The most important special case is when S is
a (Lebesgue) measurable subset of R for some d ∈ N , S is the σ-algebra of measurable subsets of S , and μ = λ is d -
d
+ d
dimensional Lebesgue measure. Specializing further, recall the lower dimensional spaces:
1. When d = 1 , S ⊆ R and λ is length measure.
1
2. When d = 2 , S ⊆ R and λ is area measure.

2
2
3. When d = 3 , S ⊆ R and λ is volume measure.

3
3
Of course, the characterizations of the Poisson process on [0, ∞), in term of the inter-arrival times and the characterization in terms
of the arrival times do not generalize because they depend critically on the order relation on [0, ∞). However the characterization
in terms of the counting process generalizes perfectly to our new setting. Thus, consider a process that produces random points in
S , and as usual, let N (A) denote the number of random points in A ∈ S . Thus N is a random, counting measure on (S, S )
The random measure N is a Poisson process or a Poisson random measure on S with density parameter r > 0 if the following
axioms are satisfied:
1. If A ∈ S then N (A) has the Poisson distribution with parameter rμ(A).
2. If {A : i ∈ I } is a countable, disjoint collection of sets in S then {N (A
i i) : i ∈ I} is a set of independent random
variables.
To draw parallels with the Poison process on [0, ∞), note that axiom (a) is the generalization of stationary, Poisson-distributed
increments, and axiom (b) is the generalization of independent increments. By convention, if μ(A) = 0 then N (A) = 0 with
probability 1, and if μ(A) = ∞ then N (A) = ∞ with probability 1. (These distributions are considered degenerate members of
the Poisson family.) On the other hand, note that if 0 < μ(A) < ∞ then N (A) has support N.
In the two-dimensional Poisson process, vary the width w and the rate r. Note the location and shape of the probability density
function of N . For selected values of the parameters, run the simulation 1000 times and compare the empirical density
For A ⊆ D
1. E [N (A)] = rμ(A)
2. var [N (A)] = rμ(A)
Proof
In particular, r can be interpreted as the expected density of the random points (that is, the expected number of points in a region of
unit size), justifying the name of the parameter.
In the two-dimensional Poisson process, vary the width w and the density parameter r. Note the size and location of the mean
±standard deviation bar of N . For various values of the parameters, run the simulation 1000 times and compare the empirical
mean and standard deviation to the true mean and standard deviation.
The Distribution of the Random Points
As before, the Poisson model defines the most random way to distribute points in space, in a certain sense. Assume that we have a
Poisson process N on (S, S , μ) with density parameter r ∈ (0, ∞).
Given that A ∈ S contains exactly one random point, the position X of the point is uniformly distributed on A .
Proof
More generally, if A contains n points, then the positions of the points are independent and each is uniformly distributed in A .
Suppose that A, B ∈ S and B ⊆ A . For n ∈ N , the conditional distribution of

+ N (B) given N (A) = n is the binomial
distribution with trial parameter n and success parameter p = μ(B)/μ(A) .
Proof
Thus, given N (A) = n , each of the n random points falls into B , independently, with probability p = μ(B)/μ(A) , regardless of
the density parameter r.
More generally, suppose that A ∈ S and that A is partitioned into k subsets (B , B , … , B ) in S . Then the conditional
1 2 k
distribution of (N (B ), N (B ), … , N (B )) given N (A) = n is the multinomial distribution with parameters n and

1 2 k
(p , p , … p ), where p = μ(B )/μ(A) for i ∈ {1, 2, … , k}.

1 2 k i i
Thinning and Combining

Suppose that N is a Poisson random process on (S, S , μ) with density parameter r ∈ [0, ∞). Thinning (or splitting) this process
works just like thinning the Poisson process on [0, ∞). Specifically, suppose that the each random point, independently of the
others is either type 1 with probability p or type 0 with probability 1 − p , where p ∈ (0, 1) is a new parameter. Let N and N 1 0
denote the random counting measures associated with the type 1 and type 0 points, respectively. That is, N (A) is the number of i
type i random points in A , for A ∈ S and i ∈ {0, 1}.
N0 and N are independent Poisson processes on (S, S , μ) with density parameters pr and (1 − p)r , respectively.
1
Proof
This result extends naturally to k ∈ N types. As in the standard case, combining independent Poisson processes produces a new
+
Poisson process, and the density parameters add.
Suppose that N and N are independent Poisson processes on (S, S , μ), with density parameters r and r , respectively.
0 1 0 1
Then the process obtained by combining the random points is also a Poisson process on (S, S , μ) with density parameter
r +r .
0 1
Proof
Applications and Special Cases

Non-homogeneous Poisson Processes
A non-homogeneous Poisson process on [0, ∞) can be thought of simply as a Poisson process on [0, ∞) with respect to a measure
that is not the standard Lebesgue measure λ on [0, ∞). Thus suppose that r : [0, ∞) → (0, ∞) is piece-wise continuous with
1
∞
∫ r(t) dt = ∞ , and let
0
m(t) = ∫ r(s) ds, t ∈ [0, ∞) (14.8.8)

0
Consider the non-homogeneous Poisson process with rate function r (and hence mean function m). Recall that the Lebesgue-
Stieltjes measure on [0, ∞) associated with m (which we also denote by m) is defined by the condition
m(a, b] = m(b) − m(a), a, b ∈ [0, ∞), a < b (14.8.9)
Equivalently, m is the measure that is absolutely continuous with respect to λ1 , with density function r. That is, if A is a
measurable subset of [0, ∞) then
m(A) = ∫ r(t) dt (14.8.10)
A
The non-homogeneous Poisson process on [0, ∞) with rate function r is the Poisson process on [0, ∞) with respect to the
measure m.
Proof
Nearest Points in R d
In this subsection, we consider a rather specialized topic, but one that is fun and interesting. Consider the Poisson process on
(R , R , λ ) with density parameter r > 0 , where as usual, R is the σ-algebra of Lebesgue measurable subsets of R , and λ is
d
d d d d d
d -dimensional Lebesgue measure. We use the usual Euclidean norm on R :

d
1/d d
d d d
∥x ∥d = (x +x ⋯ +x ) , x = (x1 , x2 , … , xd ) ∈ R (14.8.11)
1 2 d
For t > 0 , let B t = {x ∈ R

d
: ∥x ∥d ≤ t} denote the ball of radius t centered at the origin. Recall that λ d (Bt ) = cd t
d
where
d/2
π
cd = (14.8.12)
Γ(d/2 + 1)
is the measure of the unit ball in R , and where Γ is the gamma function. Of course, c
d
1 =2 ,c 2 =π ,c 3 = 4π/3 .
For t ≥ 0 , let M = N (B ) , the number of random points in the ball B , or equivalently, the number of random points within
t t t
distance t of the origin. From our formula for the measure of B above, it follows that M has the Poisson distribution with
t t
parameter rc t . d
d
Now let Z = 0 and for n ∈ N let Z denote the distance of the n th closest random point to the origin. Note that Z is analogous
0 + n n
to the n th arrival time for the Poisson process on [0, ∞). Clearly the processes M = (M : t ≥ 0) and Z = (Z , Z , …) are t 0 1
inverses of each other in the sense that Z ≤ t if and only if M ≥ n . Both of these events mean that there are at least n random
n t
points within distance t of the origin.
Distributions
1. c Z has the gamma distribution with shape parameter n and rate parameter r.
d
d
n
2. Z has probability density function g given by

n n
n nd−1
d(cd r) z
d
gn (z) = exp(−rcd z ), 0 ≤z <∞ (14.8.13)
(n − 1)!
Proof
d
cd Zn − cd Z
d
n−1
are independent for n ∈ N + and each has the exponential distribution with rate parameter r.
Suppose that defects in a sheet of material follow the Poisson model with an average of 1 defect per 2 square meters. Consider
a 5 square meter sheet of material.
1. Find the probability that there will be at least 3 defects.
2. Find the mean and standard deviation of the number of defects.
Answer
Suppose that raisins in a cake follow the Poisson model with an average of 2 raisins per cubic inch. Consider a slab of cake that
measures 3 by 4 by 1 inches.
1. Find the probability that there will be at no more than 20 raisins.
2. Find the mean and standard deviation of the number of raisins.
Answer
Suppose that the occurrence of trees in a forest of a certain type that exceed a certain critical size follows the Poisson model. In
a one-half square mile region of the forest there are 40 trees that exceed the specified size.
1. Estimate the density parameter.
2. Using the estimated density parameter, find the probability of finding at least 100 trees that exceed the specified size in a
square mile region of the forest
Answer
Suppose that defects in a type of material follow the Poisson model. It is known that a square sheet with side length 2 meters
contains one defect. Find the probability that the defect is in a circular region of the material with radius meter. 1
Answer
Suppose that raisins in a cake follow the Poisson model. A 6 cubic inch piece of the cake with 20 raisins is divided into 3 equal
parts. Find the probability that each piece has at least 6 raisins.
Answer
Suppose that defects in a sheet of material follow the Poisson model, with an average of 5 defects per square meter. Each
defect, independently of the others is mild with probability 0.5, moderate with probability 0.3, or severe with probability 0.2.
Consider a circular piece of the material with radius 1 meter.
1. Give the mean and standard deviation of the number of defects of each type in the piece.
2. Find the probability that there will be at least 2 defects of each type in the piece.
Answer
This page titled 14.8: Poisson Processes on General Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
CHAPTER OVERVIEW
A renewal process is an idealized stochastic model for “events” that occur randomly in time (generically called renewals or
arrivals). The basic mathematical assumption is that the times between the successive arrivals are independent and identically
distributed. Renewal processes have a very rich and interesting mathematical structure and can be used as a foundation for building
more realistic models. Moreover, renewal processes are often found embedded in other stochastic processes, most notably Markov
chains.
15.1: Introduction
This page titled 15: Renewal Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
15.1: Introduction
A renewal process is an idealized stochastic model for “events” that occur randomly in time. These temporal events are generically
referred to as renewals or arrivals. Here are some typical interpretations and applications.
The arrivals are “customers” arriving at a “service station”. Again, the terms are generic. A customer might be a person and the
service station a store, but also a customer might be a file request and the service station a web server.
A device is placed in service and eventually fails. It is replaced by a device of the same type and the process is repeated. We do
not count the replacement time in our analysis; equivalently we can assume that the replacement is immediate. The times of the
replacements are the renewals
The arrivals are times of some natural event, such as a lightening strike, a tornado or an earthquake, at a particular geographical
point.
The arrivals are emissions of elementary particles from a radioactive source.
Basic Processes
The basic model actually gives rise to several interrelated random processes: the sequence of interarrival times, the sequence of
arrival times, and the counting process. The term renewal process can refer to any (or all) of these. There are also several natural
“age” processes that arise. In this section we will define and study the basic properties of each of these processes in turn.
Interarrival Times
Let X denote the time of the first arrival, and X the time between the (i − 1) st and ith arrivals for i ∈ {2, 3, …). Our basic
1 i
assumption is that the sequence of interarrival times X = (X , X , …) is an independent, identically distributed sequence of
1 2
random variables. In statistical terms, X corresponds to sampling from the distribution of a generic interarrival time X. We assume
that X takes values in [0, ∞) and P(X > 0) > 0 , so that the interarrival times are nonnegative, but not identically 0. Let
μ = E(X) denote the common mean of the interarrival times. We allow that possibility that μ = ∞ . On the other hand,
μ >0 .
Proof
If μ < ∞ , we will let σ = var(X) denote the common variance of the interarrival times. Let F denote the common distribution
2
function of the interarrival times, so that
F (x) = P(X ≤ x), x ∈ [0, ∞) (15.1.1)
The distribution function F turns out to be of fundamental importance in the study of renewal processes. We will let f denote the
probability density function of the interarrival times if the distribution is discrete or if the distribution is continuous and has a
probability density function (that is, if the distribution is absolutely continuous with respect to Lebesgue measure on [0, ∞)). In the
discrete case, the following definition turns out to be important:
If X takes values in the set {nd : n ∈ N} for some d ∈ (0, ∞) , then X (or its distribution) is said to be arithmetic (the terms
lattice and periodic are also used). The largest such d is the span of X.
The reason the definition is important is because the limiting behavior of renewal processes turns out to be more complicated when
the interarrival distribution is arithmetic.
The Arrival Times

Let
n
Tn = ∑ Xi , n ∈ N (15.1.2)
i=1
We follow our usual convention that the sum over an empty index set is 0; thus T = 0 . On the other hand, T is the time of the
0 n
n th arrival for n ∈ N . The sequence T = (T , T , …) is called the arrival time process, although note that T is not considered
+ 0 1 0
an arrival. A renewal process is so named because the process starts over, independently of the past, at each arrival time.
Figure 15.1.1 : The interarrival times and arrival times
The sequence T is the partial sum process associated with the independent, identically distributed sequence of interarrival times
X. Partial sum processes associated with independent, identically distributed sequences have been studied in several places in this
project. In the remainder of this subsection, we will collect some of the more important facts about such processes. First, we can
recover the interarrival times from the arrival times:
Xi = Ti − Ti−1 , i ∈ N+ (15.1.3)
Next, let F denote the distribution function of T , so that

n n
Fn (t) = P(Tn ≤ t), t ∈ [0, ∞) (15.1.4)
Recall that if X has probability density function f (in either the discrete or continuous case), then Tn has probability density
function f = f = f ∗ f ∗ ⋯ ∗ f , the n -fold convolution power of f .
n
∗n
The sequence of arrival times T has stationary, independent increments:

1. If m ≤ n then T − T has the same distribution as T
n m n−m and thus has distribution function F n−m
2. If n ≤ n ≤ n ≤ ⋯ then (T , T − T , T − T
1 2 3 n1 n2 n1 n3 n2 , …)is a sequence of independent random variables.
Proof
If n, m ∈ N then
1. E (T ) = nμ
n
2. var (T ) = nσ n
2
3. cov (T , T ) = min{m, n}σ

m n
2
Proof
Recall the law of large numbers: T n /n → μ as n → ∞

1. With probability 1 (the strong law).
2. In probability (the weak law).
Note that T ≤ T n for n ∈ N since the interarrival times are nonnegative. Also P(T = T ) = P(X = 0) = F (0) . This can
n+1 n n−1 n
be positive, so with positive probability, more than one arrival can occur at the same time. On the other hand, the arrival times are
unbounded:
Tn → ∞ as n → ∞ with probability 1.
Proof
The Counting Process

For t ≥ 0 , let N denote the number of arrivals in the interval [0, t]:
t
Nt = ∑ 1(Tn ≤ t), t ∈ [0, ∞) (15.1.6)
n=1
We will refer to the random process N = (N : t ≥ 0) as the counting process. Recall again that
t T0 = 0 is not considered an
arrival, but it's possible to have T = 0 for n ∈ N , so there may be one or more arrivals at time 0.
n +
Nt = max{n ∈ N : Tn ≤ t} for t ≥ 0 .
If s, t ∈ [0, ∞) and s ≤ t then N t − Ns is the number of arrivals in (s, t] .
Note that as a function of t , N is a (random) step function with jumps at the distinct values of (T , T
t 1 2, …) ; the size of the jump at
an arrival time is the number of arrivals at that time. In particular, N is an increasing function of t .
Figure 15.1.2 : The counting process
More generally, we can define the (random) counting measure corresponding to the sequence of random points (T , T 1 2, …) in
[0, ∞). Thus, if A is a (measurable) subset of [0, ∞), we will let N (A) denote the number of the random points in A :
N (A) = ∑ 1(Tn ∈ A) (15.1.7)
n=1
In particular, note that with our new notation, N = N [0, t] for t ≥ 0 and N (s, t] = N − N for s ≤ t . Thus, the random
t t s
counting measure is completely determined by the counting process. The counting process is the “cumulative measure function” for
the counting measure, analogous the cumulative distribution function of a probability measure.
For t ≥ 0 and n ∈ N ,
1. Tn ≤t if and only if N t ≥n
2. N t =n if and only if T n ≤ t < Tn+1
Proof
Of course, the complements of the events in (a) are also equivalent, so T > t if and only if N < n . On the other hand, neither of
n t
the events N ≤ n and T ≥ t implies the other. For example, we couse easily have N = n and T < t < T
t n t . Taking
n n+1
complements, neither of the events N > n and T < t implies the other. The last result also shows that the arrival time process T
t n
and the counting process N are inverses of each other in a sense.
The following events have probability 1:

1. N t <∞ for all t ∈ [0, ∞)
2. N t → ∞ as t → ∞
Proof
All of the results so far in this subsection show that the arrival time process T and the counting process N are inverses of one
another in a sense. The important equivalences above can be used to obtain the probability distribution of the counting variables in
terms of the interarrival distribution function F .
For t ≥ 0 and n ∈ N ,
1. P(N t ≥ n) = Fn (t)
2. P(N t = n) = Fn (t) − Fn+1 (t)
The next result is little more than a restatement of the result above relating the counting process and the arrival time process.
However, you may need to review the section on filtrations and stopping times to understand the result
For t ∈ [0, ∞), N t +1 is a stopping time for the sequence of interarrival times X
Proof
The Renewal Function

The function M that gives the expected number of arrivals up to time t is known as the renewal function:
M (t) = E(Nt ), t ∈ [0, ∞) (15.1.8)
The renewal function turns out to be of fundamental importance in the study of renewal processes. Indeed, the renewal function
essentially characterizes the renewal process. It will take awhile to fully understand this, but the following theorem is a first step:
The renewal function is given in terms of the interarrival distribution function by

∞
M (t) = ∑ Fn (t), 0 ≤t <∞ (15.1.9)
n=1
Proof
Note that we have not yet shown that M (t) < ∞ for t ≥ 0 , and note also that this does not follow from the previous theorem.
However, we will establish this finiteness condition in the subsection on moment generating functions below. If M is
differentiable, the derivative m = M is known as the renewal density, so that m(t) gives the expected rate of arrivals per unit
′
time at t ∈ [0, ∞).

More generally, if A is a (measurable) subset of [0, ∞), let M (A) = E[N (A)] , the expected number of arrivals in A .
M is a positive measure on [0, ∞). This measure is known as the renewal measure.
Proof
The renewal measure is also given by

∞
M (A) = ∑ P(Tn ∈ A), A ⊂ [0, ∞) (15.1.12)
n=1
Proof
If s t ∈ [0, ∞) with s ≤ t then M (t) − M (s) = m(s, t] , the expected number of arrivals in (s, t] .
The last theorem implies that the renewal function actually determines the entire renewal measure. The renewal function is the
“cumulative measure function”, analogous to the cumulative distribution function of a probability measure. Thus, every renewal
process naturally leads to two measures on [0, ∞), the random counting measure corresponding to the arrival times, and the
measure associated with the expected number of arrivals.
The Age Processes

For t ∈ [0, ∞), T Nt ≤ t < TNt +1 . That is, t is in the random renewal interval [T Nt .
, TNt +1 )
Consider the reliability setting in which whenever a device fails, it is immediately replaced by a new device of the same type. Then
the sequence of interarrival times X is the sequence of lifetimes, while T is the time that the n th device is placed in service. There
n
are several other natural random processes that can be defined.
The random variable

Ct = t − TN , t ∈ [0, ∞) (15.1.13)
t
is called the current life at time t . This variable takes values in the interval [0, t] and is the age of the device that is in service at
time t . The random process C = (C : t ≥ 0) is the current life process.
t
The random variable
Rt = TNt +1 − t, t ∈ [0, ∞) (15.1.14)
is called the remaining life at time t . This variable takes values in the interval (0, ∞) and is the time remaining until the device
that is in service at time t fails. The random process R = (R : t ≥ 0) is the remaining life process.
t
Figure 15.1.3 : The current and remaining life at time t
The random variable
Lt = Ct + Rt = TN +1 − TN = XN +1 , t ∈ [0, ∞) (15.1.15)
t t t
is called the total life at time t . This variable takes values in [0, ∞) and gives the total life of the device that is in service at
time t . The random process L = (L : t ≥ 0) is the total life process.
t
Tail events of the current and remaining life can be written in terms of each other and in terms of the counting variables.
Suppose that t ∈ [0, ∞), x ∈ [0, t], and y ∈ [0, ∞). Then
1. {R t > y} = { Nt+y − Nt = 0}
2. {C t ≥ x} = { Rt−x > x} = { Nt − Nt−x = 0}
3. {C t ≥ x, Rt > y} = { Rt−x > x + y} = { Nt+y − Nt−x = 0}
Proof
Figure 15.1.4 : The events of interest for the current and remaining life
Of course, the various equivalent events in the last result must have the same probability. In particular, it follows that if we know
the distribution of R for all t then we also know the distribution of C for all t , and in fact we know the joint distribution of
t t
(R , C ) for all t and hence also the distribution of L for all t .

t t t
For fixed t ∈ (0, ∞) the total life at t (the lifetime of the device in service at time t ) is stochastically larger than a generic lifetime.
This result, a bit surprising at first, is known as the inspection paradox. Let X denote fixed interarrival time.
P(Lt > x) ≥ P(X > x) for x ≥ 0 .

Proof
Basic Comparison
The basic comparison in the following result is often useful, particularly for obtaining various bounds. The idea is very simple: if
the interarrival times are shortened, the arrivals occur more frequently.
Suppose now that we have two interarrival sequences, X = (X , X , …) and Y = (Y , Y 1 2 1 2, …) defined on the same
probability space, with Y ≤ X (with probability 1) for each i. Then for n ∈ N and t ∈ [0, ∞),
i i
1. TY ,n ≤ TX,n
2. N Y ,t ≥ NX,t
3. m Y (t) ≥ mX (t)

Bernoulli Trials
Suppose that X = (X , X , …) is a sequence of Bernoulli trials with success parameter p ∈ (0, 1). Recall that X is a sequence of
1 2
independent, identically distributed indicator variables with P(X = 1) = p .
Recall the random processes derived from X:

1. Y = (Y , Y , …) where Y the number of success in the first n trials. The sequence Y is the partial sum process
0 1 n
associated with X. The variable Y has the binomial distribution with parameters n and p.
n
2. U = (U , U , …) where U the number of trials needed to go from success number n − 1 to success number n . These are
1 2 n
independent variables, each having the geometric distribution on N with parameter p. +
3. V = (V , V , …) where V is the trial number of success n . The sequence V is the partial sum process associated with U .
0 1 n
The variable V has the negative binomial distribution with parameters n and p.
n
It is natural to view the successes as arrivals in a discrete-time renewal process.
Consider the renewal process with interarrival sequence U . Then
1. The basic assumptions are satisfied and that the mean interarrival time is μ = 1/p .
2. V is the sequence of arrival times.
3. Y is the counting process (restricted to N).
4. The renewal function is m(n) = np for n ∈ N .
It follows that the renewal measure is proportional to counting measure on N . +
Run the binomial timeline experiment 1000 times for various values of the parameters n and p. Compare the empirical
distribution of the counting variable to the true distribution.
Run the negative binomial experiment 1000 times for various values of the parameters k and p. Comare the empirical
distribution of the arrival time to the true distribution.
Consider again the renewal process with interarrival sequence U . For n ∈ N ,

1. The current life and remaining life at time n are independent.
2. The remaining life at time n has the same distribution as an interarrival time U , namely the geometric distribution on N +
with parameter p.
3. The current life at time n has a truncated geometric distribution with parameters n and p:
k
p(1 − p ) , k ∈ {0, 1, … , n − 1}
P(Cn = k) = { (15.1.20)
n
(1 − p ) , k =n
Proof
This renewal process starts over, independently of the past, not only at the arrival times, but at fixed times n ∈ N as well. The
Bernoulli trials process (with the successes as arrivals) is the only discrete-time renewal process with this property, which is a
consequence of the memoryless property of the geometric interarrival distribution.
We can also use the indicator variables as the interarrival times. This may seem strange at first, but actually turns out to be useful.
Consider the renewal process with interarrival sequence X.

1. The basic assumptions are satisfied and that the mean interarrival time is μ = p .
2. Y is the sequence of arrival times.
3. The number of arrivals at time 0 is U − 1 and the number of arrivals at time i ∈ N is U .
1 + i+1
4. The number of arrivals in the interval [0, n] is V − 1 for n ∈ N . This gives the counting process.
n+1
5. The renewal function is m(n) = n+1
p
− 1 for n ∈ N .
The age processes are not very interesting for this renewal process.
For n ∈ N (with probability 1),

1. C n =0
2. R n =1
The Moment Generating Function of the Counting Variables

As an application of the last renewal process, we can show that the moment generating function of the counting variable N in an t
arbitrary renewal process is finite in an interval about 0 for every t ∈ [0, ∞). This implies that N has finite moments of all orders
t
and in particular that m(t) < ∞ for every t ∈ [0, ∞).

Suppose that X = (X , X , …) is the interarrival sequence for a renewal process. By the basic assumptions, there exists a > 0
1 2
such that p = P(X ≥ a) > 0 . We now consider the renewal process with interarrival sequence X = (X , X , …) , where
a a,1 a,2
X a,i= a 1(X ≥ a)i for i ∈ N . The renewal process with interarrival sequence X is just like the renewal process with
+ a
Bernoulli interarrivals, except that the arrival times occur at the points in the sequence (0, a, 2a, …), instead of (0, 1, 2, …).
For each t ∈ [0, ∞), N has finite moment generating function in an interval about 0, and hence N has moments of all orders
t t
at 0.
Proof
The Poisson Process

The Poisson process, named after Simeon Poisson, is the most important of all renewal processes. The Poisson process is so
important that it is treated in a separate chapter in this project. Please review the essential properties of this process:
Properties of the Poisson process with rate r ∈ (0, ∞).

1. The interarrival times have an exponential distribution with rate parameter r. Thus, the basic assumptions above are
satisfied and the mean interarrival time is μ = 1/r .
2. The exponential distribution is the only distribution with the memoryless property on [0, ∞).
3. The time of the n th arrival T has the gamma distribution with shape parameter n and rate parameter r.
n
4. The counting process N = (N : t ≥ 0) has stationary, independent increments and N has the Poisson distribution with
t t
parameter rt for t ∈ [0, ∞).

5. In particular, the renewal function is m(t) = rt for t ∈ [0, ∞). Hence, the renewal measure is a multiple of the standard
length measure (Lebesgue measure) on [0, ∞).
Consider again the Poisson process with rate parameter r. For t ∈ [0, ∞),
1. The current life and remaining life at time t are independent.
2. The remaining life at time t has the same distribution as an interarrival time X, namely the exponential distribution with
rate parameter r.
3. The current life at time t has a truncated exponential distribution with parameters t and r:
−rs
e , 0 ≤s ≤t
P(Ct ≥ s) = { (15.1.22)
0, s >t
Proof
The Poisson process starts over, independently of the past, not only at the arrival times, but at fixed times t ∈ [0, ∞) as well. The
Poisson process is the only renewal process with this property, which is a consequence of the memoryless property of the
exponential interarrival distribution.
Run the Poisson experiment 1000 times for various values of the parameters t and r. Compare the empirical distribution of the
counting variable to the true distribution.
Run the gamma experiment 1000 times for various values of the parameters n and r. Compare the empirical distribution of the
arrival time to the true distribution.
Open the renewal experiment and set t = 10 . For each of the following interarrival distributions, run the simulation 1000 times
and note the shape and location of the empirical distribution of the counting variable. Note also the mean of the interarrival
distribution in each case.
1. The continuous uniform distribution on the interval [0, 1] (the standard uniform distribution).
2. the discrete uniform distribution starting at a = 0 , with step size h = 0.1 , and with n = 10 points.
3. The gamma distribution with shape parameter k = 2 and scale parameter b = 1 .
4. The beta distribution with left shape parameter a = 3 and right shape parameter b = 2 .
5. The exponential-logarithmic distribution with shape parameter p = 0.1 and scale parameter b = 1 .
6. The Gompertz distribution with shape paraemter a = 1 and scale parameter b = 1 .
7. The Wald distribution with mean μ = 1 and shape parameter λ = 1 .
8. The Weibull distribution with shape parameter k = 2 and scale parameter b = 1 .
This page titled 15.1: Introduction is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Many quantities of interest in the study of renewal processes can be described by a special type of integral equation known as a
renewal equation. Renewal equations almost always arise by conditioning on the time of the first arrival and by using the defining
property of a renewal process—the fact that the process restarts at each arrival time, independently of the past. However, before we
can study renewal equations, we need to develop some additional concepts and tools involving measures, convolutions, and
transforms. Some of the results in the advanced sections on measure theory, general distribution functions, the integral with respect
to a measure, properties of the integral, and density functions are needed for this section. You may need to review some of these
topics as necessary. As usual, we assume that all functions and sets that are mentioned are measurable with respect to the
appropriate σ-algebras. In particular, [0, ∞) which is our basic temporal space, is given the usual Borel σ-algebra generated by the
intervals.
Measures, Integrals, and Transforms

Distribution Functions and Positive Measures
Recall that a distribution function on [0, ∞) is a function G : [0, ∞) → [0, ∞) that is increasing and continuous from the right.
The distribution function G defines a positive measure on [0, ∞), which we will also denote by G, by means of the formula
G[0, t] = G(t) for t ∈ [0, ∞).
Figure 15.2.1 : G(t) is the cumulative measure at t

Hopefully, our notation will not cause confusion and it will be clear from context whether G refers to the positive measure (a set
function) or the distribution function (a point function). More generally, if a, b ∈ [0, ∞) and a ≤ b then G(a, b] = G(b) − G(a) .
Note that the positive measure associated with a distribution function is locally finite in the sense that G(A) < ∞ is A ⊂ [0, ∞) is
bounded. Of course, if A is unbounded, G(A) may well be infinite. The basic structure of a distribution function and its associated
positive measure occurred several times in our preliminary discussion of renewal processes:
Distributions associated with a renewal process.

1. The distribution function F of the interarrival times defines a probability measure on [0, ∞)
2. The counting process N defines a (random) counting measure on [0, ∞)
3. the renewal function M defines a (deterministic) positive measure on [0, ∞)
Suppose again that G is a distribution function on [0, ∞). Recall that the integral associated with the positive measure G is also
called the Lebesgue-Stieltjes integral associated with the distribution function G (named for Henri Lebesgue and Thomas Stieltjes).
If f : [0, ∞) → R and A ⊆ [0, ∞) (measurable of course), the integral of f over A (if it exists) is denoted
∫ f (t) dG(t) (15.2.1)

A
t ∞
We use the more conventional ∫ f (x) dG(x) for the integral over [0, t] and ∫ f (x) dG(x) for the integral over [0, ∞). On the
0 0
t ∞
other hand, ∫ f (x) dG(x) means the integral over (s, t] for s < t , and ∫ f (x) dG(x) means the integral over (s, ∞) . Thus, the
s s
additivity of the integral over disjoint domains holds, as it must. For example, for t ∈ [0, ∞),
∞ t ∞
∫ f (x) dG(x) = ∫ f (x) dG(x) + ∫ f (x) dG(x) (15.2.2)

0 0 t
This notation would be ambiguous without the clarification, but is consistent with how the measure works: G[0, t] = G(t) for
t ≥ 0 , G(s, t] = G(t) − G(s) for 0 ≤ s < t , etc. Of course, if G is continuous as a function, so that G is also continuous as a
measure, then none of this matters—the integral over an interval is the same whether or not endpoints are included. . The following
definition is a natural complement to the locally finite property of the positive measures that we are considering.
A function f : [0, ∞) → R is locally bounded if it is measurable and is bounded on [0, t] for each t ∈ [0, ∞).
The locally bounded functions form a natural class for which our integrals of interest exist.
Suppose that G is a distribution function on [0, ∞) and f : [0, ∞) → R is locally bounded. Then g : [0, ∞) → R defined by
t
g(t) = ∫ f (s) dG(s) is also locally bounded.
0
Proof
Note that if f and g are locally bounded, then so are f + g and f g. If f is increasing on [0, ∞) then f is locally bounded, so in
particular, a distribution function on [0, ∞) is locally bounded. If f is continuous on [0, ∞) then f is locally bounded. Similarly, if
G and H are distribution functions on [0, ∞) and if c ∈ (0, ∞), then G + H and cG are also distribution functions on [0, ∞).
Convolution, which we consider next, is another way to construct new distributions on [0, ∞) from ones that we already have.
Convolution
The term convolution means different things in different settings. Let's start with the definition we know, the convolution of
probability density functions, on our space of interest [0, ∞).
Suppose that X and Y are independent random variables with values in [0, ∞) and with probability density functions f and g ,
respectively. Then X + Y has probability density function f ∗ g given as follows, in the discrete and continuous cases,
respectively
(f ∗ g)(t) = ∑ f (t − s)g(s) (15.2.4)
s∈[0,t]
(f ∗ g)(t) = ∫ f (t − s)g(s) ds (15.2.5)

0
In the discrete case, it's understood that t is a possible value of X + Y , and the sum is over the countable collection of s ∈ [0, t]
with s a value of X and t − s a value of Y . Often in this case, the random variables take values in N, in which case the sum is
simply over the set {0, 1, … , t} for t ∈ N . The discrete and continuous cases could be unified by defining convolution with
respect to a general positive measure on [0, ∞). Moreover, the definition clearly makes sense for functions that are not necessarily
Suppose that f , g : [0, ∞) → R ae locally bounded and that H is a distribution function on [0, ∞). The convolution of f and
g with respect to H is the function on [0, ∞) defined by
t ↦ ∫ f (t − s)g(s) dH (s) (15.2.6)

0
If f and g are probability density functions for discrete distributions on a countable set C ⊆ [0, ∞) and if H is counting measure
on C , we get discrete convolution, as above. If f and g are probability density functions for continuous distributions on [0, ∞) and
if H is Lebesgue measure, we get continuous convolution, as above. Note however, that if g is nonnegative then
t
G(t) = ∫ g(s) dH (s) for t ∈ [0, ∞) defines another distribution function on [0, ∞), and the convolution integral above is simply
0
t
∫ f (t − s) dG(s) . This motivates our next version of convolution, the one that we will use in the remainder of this section.
0
Suppose that f : [0, ∞) → R is locally bounded and that G is a distribution function on [0, ∞) . The convolution of the
function f with the distribution G is the function f ∗ G defined by
t
(f ∗ G)(t) = ∫ f (t − s) dG(s), t ∈ [0, ∞) (15.2.7)

0
Note that if F and G are distribution functions on [0, ∞), the convolution F ∗ G makes sense, with F simply as a function and G
as a distribution function. The result is another distribution function. Moreover in this case, the operation is commutative.
If F and G are distribution functions on [0, ∞) then F ∗G is also a distribution function on [0, ∞), and F ∗ G = G∗ F
Proof
If F and G are probability distribution functions corresponding to independent random variables X and Y with values in [0, ∞),
then F ∗ G is the probabiltiy distribution function of X + Y . Suppose now that f : [0, ∞) → R is locally bounded and that G and
H are distribution functions on [0, ∞). From the previous result, both (f ∗ G) ∗ H and f ∗ (G ∗ H ) make sense. Fortunately, they
are the same so that convolution is associative.
Suppose that f : [0, ∞) → R is locally bounded and that G and H are distribution functions on [0, ∞). Then
(f ∗ G) ∗ H = f ∗ (G ∗ H ) (15.2.9)
Proof
Finally, convolution is a linear operation. That is, convolution preserves sums and scalar multiples, whenever these make sense.
Suppose that f , g : [0, ∞) → R are locally bounded, H is a distribution function on [0, ∞), and c ∈ R . Then
1. (f + g) ∗ H = (f ∗ H ) + (g ∗ H )
2. (cf ) ∗ H = c(f ∗ H )
Proof
Suppose that f : [0, ∞) → R is locally bounded, G and H are distribution functions on [0, ∞), and that c ∈ (0, ∞). Then
1. f ∗ (G + H ) = (f ∗ G) + (f ∗ H )
2. f ∗ (cG) = c(f ∗ G)
Proof
Laplace Transforms
Like convolution, the term Laplace transform (named for Pierre Simon Laplace of course) can mean slightly different things in
different settings. We start with the usual definition that you may have seen in your study of differential equations or other subjects:
The Laplace transform of a function f : [0, ∞) → R is the function ϕ defined as follows, for all s ∈ (0, ∞) for which the
integral exists in R:
∞
−st
ϕ(s) = ∫ e f (t) dt (15.2.11)
0
Suppose that f is nonnegative, so that the integral defining the transform exists in [0, ∞] for every s ∈ (0, ∞) . If ϕ(s ) < ∞ for 0
some s ∈ (0, ∞) then ϕ(s) < ∞ for s ≥ s . The transform of a general function f exists (in R) if and only if the transform of
0 0
|f | is finite at s . It follows that if f has a Laplace transform, then the transform ϕ is defined on an interval of the form (a, ∞) for
some a ∈ (0, ∞) . The actual domain is of very little importance; the main point is that the Laplace transform, if it exists, will be
defined for all sufficiently large s . Basically, a nonnegative function will fail to have a Laplace transform if it grows at a “hyper-
exponential rate” as t → ∞ .
We could generalize the Laplace transform by replacing the Riemann or Lebesgue integral with the integral over a positive measure
on [0, ∞).
Suppose that that G is a distribution on [0, ∞). The Laplace transform of f : [0, ∞) → R with respect to G is the function
given below, defined for all s ∈ (0, ∞) for which the integral exists in R:
∞
−st
s ↦ ∫ e f (t) dG(t) (15.2.12)
0
t
However, as before, if f is nonnegative, then H (t) = ∫ f (x) dG(x) for t ∈ [0, ∞) defines another distribution function, and the
0
∞
previous integral is simply ∫ e dH (t) . This motivates the definiton for the Laplace transform of a distribution.
0
−st
The Laplace transform of a distribution F on [0, ∞) is the function Φ defined as follows, for all s ∈ (0, ∞) for which the
integral is finite:
∞
−st
Φ(s) = ∫ e dF (t) (15.2.13)
0
Once again if F has a Laplace transform, then the transform will be defined for all sufficiently large s ∈ (0, ∞) . We will try to be
explicit in explaining which of the Laplace transform definitions is being used. For a generic function, the first definition applies,
and we will use a lower case Greek letter. If the function is a distribution function, either definition makes sense, but it is usually
the the latter that is appropriate, in which case we use an upper case Greek letter. Fortunately, there is a simple relationship between
the two.
Suppose that F is a distribution function on [0, ∞). Let Φ denote the Laplace transform of the distribution F and ϕ the
Laplace transform of the function F . Then Φ(s) = sϕ(s) .
Proof
For a probability distribution, there is also a simple relationship between the Laplace transform and the moment generating
function.
Suppose that X is a random variable with values in [0, ∞) and with probability distribution function F . The Laplace transform
Φ and the moment generating function Γ of the distribution F are given as follows, and so Φ(s) = Γ(−s) for all s ∈ (0, ∞) .
∞
−sX −st
Φ(s) = E (e ) =∫ e dF (t) (15.2.16)
0
∞
sX st
Γ(s) = E (e ) =∫ e dF (t) (15.2.17)
0
In particular, a probability distribution F on [0, ∞) always has a Laplace transform Φ , defined on (0, ∞) . Note also that if
F (0) < 1 (so that X is not deterministically 0), then Φ(s) < 1 for s ∈ (0, ∞) .
Laplace transforms are important for general distributions on [0, ∞) for the same reasons that moment generating functions are
important for probability distributions: the transform of a distribution uniquely determines the distribution, and the transform of a
convolution is the product of the corresponding transforms (and products are much nicer mathematically than convolutions). The
following theorems give the essential properties of Laplace transforms. We assume that the transforms exist, of course, and it
should be understood that equations involving transforms hold for sufficiently large s ∈ (0, ∞) .
Suppose that F and G are distributions on [0, ∞) with Laplace transforms Φ and Γ , respectively. If Φ(s) = Γ(s) for s
sufficiently large, then G = H
In the case of general functions on [0, ∞), the conclusion is that f =g except perhaps on a subset of [0, ∞) of measure 0. The
Laplace transform is a linear operation.
Suppose that f , g : [0, ∞) → R have Laplace transforms ϕ and γ, respectively, and c ∈ R then
1. f + g has Laplace transform ϕ + γ
2. cf has Laplace transform cϕ
Proof
The same properties holds for distributions on [0, ∞) with c ∈ (0, ∞). Integral transforms have a smoothing effect. Laplace
transforms are differentiable, and we can interchange the derivative and integral operators.
Suppose that f : [0, ∞) → R has Lapalce transform ϕ . Then ϕ has derivatives of all orders and
∞
(n) n n −st
ϕ (s) = ∫ (−1 ) t e f (t) dt (15.2.18)
0
Restated, (−1) ϕ is the Laplace transform of the function

n (n) n
t ↦ t f (t) . Again, one of the most important properties is that the
Laplace transform turns convolution into products.
Suppose that f : [0, ∞) → R is locally bounded with Laplace transform ϕ , and that G is a distribution function on [0, ∞) with
Laplace transform Γ. Then f ∗ G has Laplace transform ϕ ⋅ Γ .
Proof
If F and G are distributions on [0, ∞), then so is F ∗ G . The result above applies, of course, with F and F ∗ G thought of as
functions and G as a distribution, but multiplying through by s and using the theorem above, it's clear that the result is also true
with all three as distributions.
Renewal Equations and Their Solutions

Armed with our new analytic machinery, we can return to the study of renewal processes. Thus, suppose that we have a renewal
process with interarrival sequence X = (X , X , …), arrival time sequence T = (T , T , …) , and counting process
1 2 0 1
N = { N : t ∈ [0, ∞)} . As usual, let F denote the common distribution function of the interarrival times, and let M denote the
t
renewal function, so that M (t) = E(N ) for t ∈ [0, ∞). Of course, the probability distribution function F defines a probability
t
measure on [0, ∞), but as noted earlier, M is also a distribution functions and so defines a positive measure on [0, ∞). Recall that
= 1 − F is the right distribution function (or reliability function) of an interarrival time.
c
F
The distributions of the arrival times are the convolution powers of F . That is, F n =F
∗n
=F ∗F ∗⋯∗F .
Proof
The next definition is the central one for this section.
Suppose that a : [0, ∞) → R is locally bounded. An integral equation of the form
u = a+u ∗ F (15.2.22)
for an unknown function u : [0, ∞) → R is called a renewal equation for u.
Often u(t) = E(U ) where {U : t ∈ [0, ∞)} is a random process of interest associated with the renewal process. The renewal
t t
equation comes from conditioning on the first arrival time T = X , and then using the defining property of the renewal process—
1 1
the fact that the process starts over, interdependently of the past, at the arrival time. Our next important result illustrates this.
Renewal equations for M and F :

1. M = F + M ∗ F
2. F = M − F ∗ M
Proof
Thus, the renewal function itself satisfies a renewal equation. Of course, we already have a “formula” for M , namely
F . However, sometimes M can be computed more easily from the renewal equation directly. The next result is the
∞
M =∑ n
n=1
transform version of the previous result:
The distributions F and M have Laplace transfroms Φ and Γ, respectively, that are related as follows:
Φ Γ
Γ = , Φ = (15.2.25)
1 −Φ Γ+1
Proof from the renewal equation

Proof from convolution
In particular, the renewal distribution M always has a Laplace transform. The following theorem gives the fundamental results on
the solution of the renewal equation.
Suppose that a : [0, ∞) → R is locally bounded. Then the unique locally bounded solution to the renewal equation
u = a+u ∗ F is u = a + a ∗ M .
Direct proof
Proof from Laplace transforms
Returning to the renewal equations for M and F above, we now see that the renewal function M completely determines the
renewal process: from M we can obtain F , and everything is ultimately constructed from the interarrival times. Of course, this is
also clear from the Laplace transform result above which gives simple algebraic equations for each transform in terms of the other.
The Distribution of the Age Variables
Let's recall the definition of the age variables. A deterministic time t ∈ [0, ∞) falls in the random renewal interval [T , T ). Nt Nt +1
The current life (or age) at time t is C = t − T , the remaining life at time t is R = T
t Nt − t , and the total life at time t is
t Nt +1
L =T
t Nt +1−T . In the usual reliability setting, C is the age of the device that is in service at time t , while R is the time until
Nt t t
that device fails, and L is the total lifetime of the device.

t
For t, y ∈ [0, ∞) , let

ry (t) = P(Rt > y) = P (N (t, t + y] = 0) (15.2.28)
and let F (t) = F (t + y) . Note that y ↦ r (t) is the right distribution function of R . We will derive and then solve a renewal
y
c c
y t
equation for r by conditioning on the time of the first arrival. We can then find integral equations that describe the distribution of
y
the current age and the joint distribution of the current and remaining ages.
For y ∈ [0, ∞), r satisfies the renewal equation r

y y
c
= Fy + ry ∗ F and hence for t ∈ [0, ∞),
t
c c
P(Rt > y) = F (t + y) + ∫ F (t + y − s) dM (s), y ≥0 (15.2.29)
0
Proof
We can now describe the distribution of the current age.
For t ∈ [0, ∞),

t−x
c c
P(Ct ≥ x) = F (t) + ∫ F (t − s) dM (s), x ∈ [0, t] (15.2.32)
0
Proof
Finally we get the joint distribution of the current and remaining ages.
For t ∈ [0, ∞),

t−x
c c
P(Ct ≥ x, Rt > y) = F (t + y) + ∫ F (t + y − s) dM (s), x ∈ [0, t], y ∈ [0, ∞) (15.2.33)
0
Proof

Uniformly Distributed Interarrivals
Consider the renewal process with interarrival times uniformly distributed on [0, 1]. Thus the distribution function of an interarrival
time is F (x) = x for 0 ≤ x ≤ 1 . The renewal function M can be computed from the general renewal equation for M by
successively solving differential equations. The following exercise give the first two cases.
On the interval [0, 2], show that M is given as follows:

1. M (t) = e − 1 for 0 ≤ t ≤ 1
t
2. M (t) = (e − 1) − (t − 1)e
t t−1
for 1 ≤ t ≤ 2
Solution
Figure 15.2.2 : The graph of M on the interval [0, 2]
Show that the Laplace transform Φ of the interarrival distribution F and the Laplace transform Γ of the renewal distribution
M are given by
−s −s
1 −e 1 −e
Φ(s) = , Γ(s) = ; s ∈ (0, ∞) (15.2.34)
−s
s s−1 +e
Solution
Open the renewal experiment and select the uniform interarrival distribution on the interval [0, 1]. For each of the following
values of the time parameter, run the experiment 1000 times and note the shape and location of the empirical distribution of the
counting variable.
1. t = 5
2. t = 10
3. t = 15
4. t = 20
5. t = 25
6. t = 30
The Poisson Process

Recall that the Poisson process has interarrival times that are exponentially distributed with rate parameter r > 0 . Thus, the
interarrival distribution function F is given by F (x) = 1 − e for x ∈ [0, ∞). The following exercises give alternate proofs of
−rx
fundamental results obtained in the Introduction.
Show that the renewal function M is given by M (t) = rt for t ∈ [0, ∞)

1. Using the renewal equation
2. Using Laplace transforms
Solution
Show that the current and remaining life at time t ≥ 0 satisfy the following properties:
1. C and R are independent.
t t
2. R has the same distribution as an interarrival time, namely the exponential distribution with rate parameter r.
t
3. C has a truncated exponential distribution with parameters t and r:

t
−rx
e , 0 ≤x ≤t
P(Ct ≥ x) = { (15.2.40)
0, x >t
Solution
Bernoulli Trials
Consider the renewal process for which the interarrival times have the geometric distribution with parameter p. Recall that the
probability density function is
n−1
f (n) = (1 − p ) p, n ∈ N+ (15.2.42)
The arrivals are the successes in a sequence of Bernoulli trials. The number of successes Y in the first n trials is the counting
n
variable for n ∈ N . The renewal equations in this section can be used to give alternate proofs of some of the fundamental results in
the Introduction.
Show that the renewal function is M (n) = np for n ∈ N

1. Using the renewal equation
2. Using Laplace transforms
Proof
Show that the current and remaining life at time n ∈ N satisfy the following properties:.
1. C and R are independent.
n n
2. R has the same distribution as an interarrival time, namely the geometric distribution with parameter p.
n
3. C has a truncated geometric distribution with parameters n and p:

n
j
p(1 − p ) , j ∈ {0, 1, … , n − 1}
P(Cn = j) = { (15.2.50)
n
(1 − p ) , j=n
Solution
A Gamma Interarrival Distribution

Consider the renewal process whose interarrival distribution F is gamma with shape parameter 2 and rate parameter r ∈ (0, ∞) .
Thus
−rt
F (t) = 1 − (1 + rt)e , t ∈ [0, ∞) (15.2.52)
Recall also that F is the distribution of the sum of two independent random variables, each having the exponential distribution with
rate parameter r.
Show that the renewal distribution function M is given by

1 1 1
−2rt
M (t) = − + rt + e , t ∈ [0, ∞) (15.2.53)
4 2 4
Solution
Note that M (t) ≈ − 1
4
+
1
2
rt as t → ∞ .
Figure 15.2.3 : The graph of M on the interval [0, 5] when r = 1
Open the renewal experiment and select the gamma interarrival distribution with shape parameter k = 2 and scale parameter
b = 1 (so the rate parameter r = is also 1). For each of the following values of the time parameter, run the experiment 1000
1
times and note the shape and location of the empirical distribution of the counting variable.
1. t = 5
2. t = 10
3. t = 15
4. t = 20
5. t = 25
6. t = 30
This page titled 15.2: Renewal Equations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
We start with a renewal process as constructed in the introduction. Thus, X = (X , X , …) is the sequence of interarrival times.
1 2
These are independent, identically distributed, nonnegative variables with common distribution function F (satisfying F (0) < 1 )
and common mean μ . When μ = ∞ , we let 1/μ = 0 . When μ < ∞ , we let σ denote the common standard deviation. Recall also
that F = 1 − F is the right distribution function (or reliability function). Then, T = (T , T , …) is the arrival time sequence,
c
0 1
where T = 0 and
0
Tn = ∑ Xi (15.3.1)
i=1
is the time of the n th arrival for n ∈ N . Finally, N

+ = { Nt : t ∈ [0, ∞)} is the counting process, where for t ∈ [0, ∞),
∞
Nt = ∑ 1(Tn ≤ t) (15.3.2)
n=1
is the number of arrivals in [0, t]. The renewal function M is defined by M (t) = E (N t) for t ∈ [0, ∞).
We noted earlier that the arrival time process and the counting process are inverses, in a sense. The arrival time process is the
partial sum process for a sequence of independent, identically distributed variables. Thus, it seems reasonable that the fundamental
limit theorems for partial sum processes (the law of large numbers and the central limit theorem theorem), should have analogs for
the counting process. That is indeed the case, and the purpose of this section is to explore the limiting behavior of renewal
processes. The main results that we will study, known appropriately enough as renewal theorems, are important for other stochastic
processes, particularly Markov chains.
Basic Theory
Our first result is a strong law of large numbers for the renewal counting process, which comes as you might guess, from the law of
large numbers for the sequence of arrival times.
If μ < ∞ then N t /t → 1/μ as t → ∞ with probability 1.

Proof
Thus, 1/μ is the limiting average rate of arrivals per unit time.
Open the renewal experiment and set t = 50 . For a variety of interarrival distributions, run the simulation 1000 times and note
how the empirical distribution is concentrated near t/μ.
The Central Limit Theorem

Our next goal is to show that the counting variable N is asymptotically normal.
t
Suppose that μ and σ are finite, and let

Nt − t/μ
Zt = −−−− , t >0 (15.3.5)
3
σ √t/μ
The distribution of Z converges to the standard normal distribution as t → ∞ .

t
Proof
Open the renewal experiment and set t = 50 . For a variety of interarrival distributions, run the simulation 1000 times and note
− −−−
the “normal” shape of the empirical distribution. Compare the empirical mean and standard deviation to t/μ and σ √t/μ , 3
respectively
The Elementary Renewal Theorem
The elementary renewal theorem states that the basic limit in the law of large numbers above holds in mean, as well as with
probability 1. That is, the limiting mean average rate of arrivals is 1/μ. The elementary renewal theorem is of fundamental
importance in the study of the limiting behavior of Markov chains, but the proof is not as easy as one might hope. In particular,
recall that convergence with probability 1 does not imply convergence in mean, so the elementary renewal theorem does not follow
from the law of large numbers.
M (t)/t → 1/μ as t → ∞ .
Proof
Open the renewal experiment and set t = 50 . For a variety of interarrival distributions, run the experiment 1000 times and
−−− −
once again compare the empirical mean and standard deviation to t/μ and σ √t/μ , respectively. 3
The Renewal Theorem

The renewal theorem states that the expected number of renewals in an interval is asymptotically proportional to the length of the
interval; the proportionality constant is 1/μ. The precise statement is different, depending on whether the renewal process is
arithmetic or not. Recall that for an arithmetic renewal process, the interarrival times take values in a set of the form {nd : n ∈ N}
for some d ∈ (0, ∞) , and the largest such d is the span of the distribution.
For h > 0 , M (t, t + h] → h
μ
as t → ∞ in each of the following cases:
1. The renewal process is non-arithmetic

2. The renewal process is arithmetic with span d , and h is a multiple of d
The renewal theorem is also known as Blackwell's theorem in honor of David Blackwell. The final limit theorem we will study is
the most useful, but before we can state the theorem, we need to define and study the class of functions to which it applies.
Direct Riemann Integration

Recall that in the ordinary theory of Riemann integration, the integral of a function on the interval [0, t] exists if the upper and
lower Riemann sums converge to a common number as the partition is refined. Then, the integral of the function on [0, ∞) is
defined to be the limit of the integral on [0, t], as t → ∞ . For our new definition, a function is said to be directly Riemann
integrable if the lower and upper Riemann sums on the entire unbounded interval [0, ∞) converge to a common number as the
partition is refined, a more restrictive definition than the usual one.
Suppose that g : [0, ∞) → [0, ∞) . For h ∈ [0, ∞) and k ∈ N , let m (g, h) = inf{g(t) : t ∈ [kh, (k + 1)h)}
k and
Mk (g, h) = sup{g(t) : t ∈ [kh, (k + 1)h) . The lower and upper Riemann sums of g on [0, ∞) corresponding to h are
∞ ∞
Lg (h) = h ∑ mk (g, h), Ug (h) = h ∑ Mk (g, h) (15.3.12)
k=0 k=0
The sums exist in [0, ∞] and satisfy the following properties:

1. Lg (h) ≤ Ug (h) for h > 0
2. Lg (h) increases as h decreases
3. Ug (h) decreases as h decreases
It follows that lim L (h) and lim U (h) exist in

h↓0 g h↓0 g [0, ∞] and limh↓0 Lg (h) ≤ limh↓0 Ug (h) . Naturally, the case where the
limits are finite and agree is what we're after.
A function g : [0, ∞) → [0, ∞) is directly Riemann integrable if U g (h) <∞ for every h > 0 and
lim Lg (h) = lim Ug (h) (15.3.13)
h↓0 h↓0
∞
The common value is ∫ 0
g(t) dt .
Ordinary Riemann integrability on [0, ∞) allows functions that are unbounded and oscillate wildly as t → ∞ , and these are the
types of functions that we want to exclude for the renewal theorems. The following result connects ordinary Riemann integrability
with direct Riemann integrability.
If g : [0, ∞) → [0, ∞) is integrable (in the ordinary Riemann sense) on [0, t] for every t ∈ [0, ∞) and if U g (h) <∞ for some
h ∈ (0, ∞) then g is directly Riemann integrable.
Here is a simple and useful class of functions that are directly Riemann integrable.
∞
Suppose that g : [0, ∞) → [0, ∞) is decreasing with ∫ 0
g(t) dt < ∞ . Then g is directly Riemann integrable.
The Key Renewal Theorems

The key renewal theorem is an integral version of the renewal theorem, and is the most useful of the various limit theorems.
Suppose that the renewal process is non-arithmetic and that g : [0, ∞) → [0, ∞) is directly Riemann integrable. Then
t ∞
1
(g ∗ M )(t) = ∫ g(t − s) dM (s) → ∫ g(x) dx as t → ∞ (15.3.14)
0
μ 0
Connections
Our next goal is to see how the various renewal theorems relate.
The renewal theorem implies the elementary renewal theorem:

Proof
Conversely, the elementary renewal theorem almost implies the renewal theorem.
Proof
The key renewal theorem implies the renewal theorem

Proof
Conversely, the renewal theorem implies the key renewal theorem.
The Age Processes

The key renewal theorem can be used to find the limiting distributions of the current and remaining age. Recall that for t ∈ [0, ∞)
the current life at time t is C = t − T and the remaining life at time t is R = T
t Nt −t . t Nt +1
If the renewal process is non-arithmetic, then

∞
1
c
P(Rt > x) → ∫ F (y) dy as t → ∞, x ∈ [0, ∞) (15.3.17)
μ x
Proof
If the renewal process is aperiodic, then

∞
1
c
P(Rt > x) → ∫ F (y) dy as t → ∞, x ∈ [0, ∞) (15.3.19)
μ x
Proof
The current and remaining life have the same limiting distribution. In particular,
x
1
c
lim P(Ct ≤ x) = lim P(Rt ≤ x) = ∫ F (y) dy, x ∈ [0, ∞) (15.3.21)
t→∞ t→∞ μ 0
Proof
The fact that the current and remaining age processes have the same limiting distribution may seem surprising at first, but there is a
simple intuitive explanation. After a long period of time, the renewal process looks just about the same backward in time as
forward in time. But reversing the direction of time reverses the rolls of current and remaining age.

The Poisson Process
Recall that the Poisson process, the most important of all renewal processes, has interarrival times that are exponentially distributed
with rate parameter r > 0 . Thus, the interarrival distribution function is F (x) = 1 − e for x ≥ 0 and the mean interarrival time
−rx
is μ = 1/r .
Verify each of the following directly:

1. The law of large numbers for the counting process.
2. The central limit theorem for the counting process.
3. The elementary renewal theorem.
4. The renewal theorem.
Bernoulli Trials
Suppose that X = (X , X , …) is a sequence of Bernoulli trials with success parameter p ∈ (0, 1). Recall that X is a sequence of
1 2
independent, identically distributed indicator variables with p = P(X = 1) . We have studied a number of random processes
derived from X:
Random processes associated with Bernoulli trials.

1. Y = (Y , Y , …) where Y the number of successes in the first n trials. The sequence Y is the partial sum process
0 1 n
associated with X. The variable Y has the binomial distribution with parameters n and p.
n
2. U = (U , U , …) where U the number of trials needed to go from success number n − 1 to success number n . These are
1 2 n
independent variables, each having the geometric distribution on N with parameter p.

+
3. V = (V , V , …) where V is the trial number of success n . The sequence V is the partial sum process associated with U .
0 1 n
The variable V has the negative binomial distribution with parameters n and p.
n
Consider the renewal process with interarrival sequence U . Thus, μ = 1/p is the mean interarrival time, and Y is the counting
process. Verify each of the following directly:
Consider the renewal process with interarrival sequence X. Thus, the mean interarrival time is μ =p and the number of
arrivals in the interval [0, n] is V − 1 for n ∈ N . Verify each of the following directly:
n+1

This page titled 15.3: Renewal Limit Theorems is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Preliminaries
A delayed renewal process is just like an ordinary renewal process, except that the first arrival time is allowed to have a different
distribution than the other interarrival times. Delayed renewal processes arise naturally in applications and are also found
embedded in other random processes. For example, in a Markov chain (which we study in the next chapter), visits to a fixed state,
starting in that state form the random times of an ordinary renewal process. But visits to a fixed state, starting in another state
form a delayed renewal process.
Suppose that X = (X , X , …) is a sequence of independent variables taking values in [0, ∞), with (X , X , …) identically
1 2 2 3
distributed. Suppose also that P(X > 0) > 0 for i ∈ N . The stochastic process with X as the sequence of interarrival times
i +
is a delayed renewal process.
As before, the actual arrival times are the partial sums of X. Thus let
n
Tn = ∑ Xi (15.4.1)
i=1
so that T = 0 and T is the time of the n th arrival for

0 n n ∈ {1, 2, …} . Also as before, Nt is the number of arrivals in [0, t] (not
counting T ): 0
Nt = ∑ 1(Tn ≤ t) = max{n ∈ N : Tn ≤ t} (15.4.2)
n=1
If we restart the clock at time T = X , we have an ordinary renewal process with interarrival sequence (X , X , …). We use
1 1 2 3
some of the standard notation developed in the Introduction for this renewal process. In particular, F denotes the common
distribution function and μ the common mean of X for i ∈ {2, 3, …}. Similarly F = F
i denotes the distribution function of
n
∗n
the sum of n independent variables with distribution function F , and M denotes the renewal function:
∞
M (t) = ∑ Fn (t), t ∈ [0, ∞) (15.4.3)
n=1
On the other hand, we will let G denote the distribution function of X (the special interarrival time, different from the rest), and
1
we will let G denote the distribution function of T for n ∈ N . As usual, F = 1 − F and G = 1 − G are the corresponding
n n +
c c
right-tail distribution functions.
Gn = G ∗ Fn−1 = Fn−1 ∗ G for n ∈ N . +
Proof
Finally, we will let U denote the renewal function for the delayed renewal process. Thus, U (t) = E(N t) is the expected number of
arrivals in [0, t] for t ∈ [0, ∞).
The delayed renewal function satisfies

∞
U (t) = ∑ Gn (t), t ∈ [0, ∞) (15.4.4)
n=1
Proof
The delayed renewal function U satisfies the equation U = G+M ∗ G ; that is,
t
U (t) = G(t) + ∫ M (t − s) dG(s), t ∈ [0, ∞) (15.4.6)

0
Proof
The delayed renewal function U satisfies the renewal equation U = G+U ∗ F ; that is,
t
U (t) = G(t) + ∫ U (t − s) dF (s), t ∈ [0, ∞) (15.4.8)

0
Proof
Asymptotic Behavior
In a delayed renewal process only the first arrival time is changed. Thus, it's not surprising that the asymptotic behavior of a
delayed renewal process is the same as the asymptotic behavior of the corresponding regular renewal process. Our first result is the
strong law of large numbers for the delayed renewal process.
Nt /t → 1/μ as t → ∞ with probability 1.

Proof
Our next result is the elementary renewal theorem for the delayed renewal process.
U (t)/t → 1/μ as t → ∞ .
Next we have the renewal theorem for the delayed renwal process, also known as Blackwell's theorem, named for David Blackwell.
For h > 0 , U (t, t + h] = U (t + h) − U (t) → h/μ as t → ∞ in each of the following cases:

1. F is non-arithmetic
2. F is arithmetic with span d ∈ (0, ∞) , and h is a multiple of d .
Finally we have the key renewal theorem for the delayed renewal process.
Suppose that the renewal process is non-arithmetic and that g : [0, ∞) → [0, ∞) is directly Riemann integrable. Then
t ∞
1
(g ∗ U )(t) = ∫ g(t − s) dU (s) → ∫ g(x) dx as t → ∞ (15.4.11)
0
μ 0
Stationary Point Processes

Recall that a point process is a stochastic process that models a discrete set of random points in a measure space (S, S , λ). Often,
of course, S ⊆ R for some n ∈ N and λ is the corresponding n -dimensional Lebesgue measure. The special cases S = N with
n
+
counting measure and S = [0, ∞) with length measure are of particular interest, in part because renewal and delayed renewal
processes give rise to point processes in these spaces.
For a general point process on S , we use our standard notation and denote the number of random points A ∈ S by N (A) . There
are a couple of natural properties that a point process may have. In particular, the process is said to be stationary if λ(A) = λ(B)
implies that N (A) and N (B) have the same distribution for A, B ∈ S . In [0, ∞) the term stationary increments is often used,
because the stationarity property means that for s, t ∈ [0, ∞) , the distribution of N (s, s + t] = N − N depends only on t .
s+t s
Consider now a regular renewal process. We showed earlier that the asymptotic distributions of the current life and remaining life
are the same. Intuitively, after a very long period of time, the renewal process looks pretty much the same forward in time or
backward in time. This suggests that if we make the renewal process into a delayed renewal process by giving the first arrival time
this asymptotic distribution, then the resulting point process will be stationary. This is indeed the case. Consider the setting and
notation of the preliminary subsection above.
For the delayed renewal process, the point process N is stationary if and only if the initial arrival time has distribution function
t
1 c
G(t) = ∫ F (s) ds, t ∈ [0, ∞) (15.4.12)
μ 0
in which case the renewal function is U (t) = t/μ for t ∈ [0, ∞).
Proof

Patterns in Multinomial Trials
Suppose that L = (L , L , …) is a sequence of independent, identically distributed random variables taking values in a finite set
1 2
S , so that L is a sequence of multinomial trials. Let f denote the common probability density function so that for a generic trial
variable L, we have f (a) = P(L = a) for a ∈ S . We assume that all outcomes in S are actually possible, so f (a) > 0 for a ∈ S .
In this section, we interpret S as an alphabet, and we write the sequence of variables in concatenation form, L = L L ⋯ rather 1 2
than standard sequence form. Thus the sequence is an infinite string of letters from our alphabet S . We are interested in the
repeated occurrence of a particular finite substring of letters (that is, a “word” or “pattern”) in the infinite sequence.
So, fix a word a (again, a finite string of elements of S ), and consider the successive random trial numbers (T , T , …) where the
1 2
word a is completed in L . Since the sequence L is independent and identically distributed, it seems reasonable that these variables
are the arrival times of a renewal process. However there is a slight complication. An example may help.
Suppose that L is a sequence of Bernoulli trials (so S = {0, 1}). Suppose that the outcome of L is
101100101010001101000110 ⋯ (15.4.19)
1. For the word a = 001 note that T 1 =7 ,T 2 = 15 ,T

3 = 22
2. For the word b = 010 , note that T 1 =8 ,T 2 = 10 ,T

3 = 12 ,T 4 = 19
In this example, you probably noted an important difference between the two words. For b , a suffix of the word (a proper substring
at the end) is also a prefix of the word (a proper substring at the beginning. Word a does not have this property. So, once we
“arrive” at b , there are ways to get to b again (taking advantage of the suffix-prefix) that do not exist starting from the beginning of
the trials. On the other hand, once we arrive at a , arriving at a again is just like with a new sequence of trials. Thus we are lead to
the following definition.
Suppose that a is a finite word from the alphabet S . If no proper suffix of a is also a prefix, then a is simple. Otherwise, a is
compound.
Returning to the general setting, let T = 0 and then let X = T − T

0 for n ∈ N . For k ∈ N , let N = ∑
n n n−1 + 1(T ≤ k) .
k
∞
n=1 n
For occurrences of the word a , X = (X , X , …) is the sequence of interarrival times, T = (T , T , …) is the sequence of
1 2 0 1
arrival times, and N = {N : k ∈ N} is the counting process. If a is simple, these form an ordinary renewal process. If a is
k
compound, they form a delayed renewal process, since X will have a different distribution than (X , X , …). Since the structure
1 2 3
of a delayed renewal process subsumes that of an ordinary renewal process, we will work with the notation above for the delayed
process. In particular, let U denote the renewal function. Everything in this paragraph depends on the word a of course, but we
have suppressed this in the notation.
Suppose a = a 1 a2 , where a ∈ S for each i ∈ {1, 2, … , k}, so that a is a word of length k . Note that X takes values in
⋯ ak i 1
{k, k + 1, …} . If is simple, this applies to the other interarrival times as well. If a is compound, the situation is more
a
complicated X , X , … will have some minimum value j < k , but the possible values are positive integers, of course, and
2 3
include {k + 1, k + 2, …} . In any case, the renewal process is arithmetic with span 1. Expanding the definition of the probability
density function f , let
k
f (a) = ∏ f (ai ) (15.4.20)
i=1
so that f (a)is the probability of forming a with k consecutive trials. Let μ(a) denote the common mean of X for n
n ∈ {2, 3, …} , so μ(a) is the mean number of trials between occurrences of a . Let ν (a) = E(X ) , so that ν (a) is the mean time
1
number of trials until a occurs for the first time. Our first result is an elegant connection between μ(a) and f (a) , which has a
wonderfully simple proof from renewal theory.
If a is a word in S then
1
μ(a) = (15.4.21)
f (a)
Proof
Our next goal is to compute ν (a) in the case that a is a compound word.
Suppose that a is a compound word, and that b is the largest word that is a proper suffix and prefix of a . Then
1
ν (a) = ν (b) + μ(a) = ν (b) + (15.4.22)
f (a)
Proof
By repeated use of the last result, we can compute the expected number of trials needed to form any compound word.
Consider Bernoulli trials with success probability p ∈ (0, 1), and let q = 1 − p . For each of the following strings, find the
expected number of trials between occurrences and the expected number of trials to the first occurrence.
1. a = 001
2. b = 010
3. c = 1011011
4. d = 11 ⋯ 1 (k times)
Answer
Recall that an ace-six flat die is a six-sided die for which faces 1 and 6 have probability 1
4
each while faces 2, 3, 4, and 5 have
probability each. Ace-six flat dice are sometimes used by gamblers to cheat.
1
Suppose that an ace-six flat die is thrown repeatedly. Find the expected number of throws until the pattern 6165616 first
occurs.
Solution
Suppose that a monkey types randomly on a keyboard that has the 26 lower-case letter keys and the space key (so 27 keys).
Find the expected number of keystrokes until the monkey produces each of the following phrases:
1. it was the best of times
2. to be or not to be
Proof
This page titled 15.4: Delayed Renewal Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
Preliminaries
An alternating renewal process models a system that, over time, alternates between two states, which we denote by 1 and 0 (so the
system starts in state 1). Generically, we can imagine a device that, over time, alternates between on and off states. Specializing
further, suppose that a device operates until it fails, and then is replaced with an identical device, which in turn operates until failure
and is replaced, and so forth. In this setting, the times that the device is functioning correspond to the on state, while the
replacement times correspond to the off state. (The device might actually be repaired rather than replaced, as long as the repair
returns the device to pristine, new condition.) The basic assumption is that the pairs of random times successively spent in the two
states form an independent, identically distributed sequence. Clearly the model of a system alternating between two states is basic
and important, but moreover, such alternating processes are often found embedded in other stochastic processes.
Let's set up the mathematical notation. Let U = (U , U , …) denote the successive lengths of time that the system is in state 1,
1 2
and let V = (V , V , …) the successive lengths of time that the system is in state 0. So to be clear, the system starts in state 1 and
1 2
remains in that state for a period of time U , then goes to state 0 and stays in this state for a period of time V , then back to state 1
1 1
for a period of time U , and so forth. Our basic assumption is that W = ((U , V ), (U , V ), …) is an independent, identically
2 1 1 2 2
distributed sequence. It follows that U and V each are independent, identically distributed sequences, but U and V might well be
dependent. In fact, V might be a function of U for n ∈ N . Let μ = E(U ) denote the mean of a generic time period U in state 1
n n +
and let ν = E(V ) denote the mean of a generic time period V in state 0. Let G denote the distribution function of a time period U
in state 1, and as usual, let G = 1 − G denote the right distribution function (or reliability function) of U .
c
Clearly it's natural to consider returns to state 1 as the arrivals in a renewal process. Thus, let X = U + V for n ∈ N and n n n +
consider the renewal process with interarrival times X = (X , X , …). Clearly this makes sense, since X is an independent,
1 2
identically distributed sequence of nonnegative variables. For the most part, we will use our usual notation for a renewal process,
so the common distribution function of X = U + V is denoted by F , the arrival time process is T = (T , T , …) , the counting
n n n 0 1
process is {N : t ∈ [0, ∞)} , and the renewal function is M . But note that the mean interarrival time is now μ + ν .
t
The renwal process associated with W = ((U1 , V1 ), (U2 , V2 ), …) as constructed above is known as an alternating renewal
process.
The State Process

Our interest is the state I of the system at time t ∈ [0, ∞), so I = {I : t ∈ [0, ∞)} is a stochastic process with state space {0, 1}.
t t
Clearly the stochastic processes W and I are equivalent in the sense that we can recover one from the other. Let p(t) = P(I = 1) , t
the probability that the device is on at time t ∈ [0, ∞). Our first main result is a renewal equation for the function p.
The function p satisfies the renewal equation p = G c

+p ∗ F and hence p = G c
+G
c
∗M .
Proof
We can now apply the key renewal theorem to get the asymptotic behavior of p.
If the renewal process is non-arithmetic, then

μ
p(t) → as t → ∞ (15.5.3)
μ+ν
Proof
Thus, the limiting probability that the system is on is simply the ratio of the mean of an on period to the mean of an on-off period. It
follows, of course, that
ν
P(It = 0) = 1 − p(t) → as t → ∞ (15.5.5)
μ+ν
so in particular, the fact that the system starts in the on state makes no difference in the limit. We will return to the asymptotic
behavior of the alternating renewal process in the next section on renewal reward processes.
Applications and Special Cases

With a clever definition of on and off, many stochastic processes can be turned into alternating renewal processes, leading in turn to
interesting limits, via the basic limit theorem above.
Age Processes
The last remark applies in particular to the age processes of a standard renewal process. So, suppose that we have a renewal process
with interarrival sequence X, arrival sequence T , and counting process N . As usual, let μ denote the mean and F the probability
distribution function of an interarrival time, and let F = 1 − F denote the right distribution function (or reliability function).
c
For t ∈ [0, ∞), recall that the current life, remaining life and total life at time t are
Ct = t − TNt , Rt = TNt +1 − t, Lt = Ct + Rt = TNt +1 − TNt = XNt +1 (15.5.6)
respectively. In the usual terminology of reliability, C is the age of the device in service at time t , R is the time remaining until
t t
this device fails, and L is total life of the device. We will use limit theorem above to derive the limiting distributions these age
t
processes. The limiting distributions were obtained earlier, in the section on renewal limit theorems, by a direct application of the
key renewal theorem. So the results are not new, but the method of proof is interesting.
If the renewal process is non-arithmetic then

x
1
c
lim P(Ct ≤ x) = lim P(Rt ≤ x) = ∫ F (y) dy, x ∈ [0, ∞) (15.5.7)
t→∞ t→∞ μ 0
Proof
As we have noted before, the fact that the limiting distributions are the same is not surprising after a little thought. After a long
time, the renewal process looks the same forward and backward in time, and reversing the arrow of time reverses the roles of
current and remaining time.
If the renewal process is non-arithmetic then

x
1
lim P(Lt ≤ x) = ∫ y dF (y), x ∈ [0, ∞) (15.5.12)
t→∞ μ 0
Proof
This page titled 15.5: Alternating Renewal Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
Preliminaries
In a renewal reward process, each interarrival time is associated with a random variable that is generically thought of as the reward
associated with that interarrival time. Our interest is in the process that gives the total reward up to time t . So let's set up the usual
notation. Suppose that X = (X , X , …) are the interarrival times of a renewal process, so that X is a sequence of independent,
1 2
identically distributed, nonnegative variables with common distribution function F and mean μ . As usual, we assume that
F (0) < 1 so that the interarrival times are not deterministically 0, and in this section we also assume that μ < ∞ . Let
Tn = ∑ Xi , n ∈ N (15.6.1)
i=1
so that T is the time of the n th arrival for n ∈ N

n + and T = (T0 , T1 , …) is the arrival time sequence. Finally, Let
∞
Nt = ∑ 1(Tn ≤ t), t ∈ [0, ∞) (15.6.2)
n=1
so that N is the number of arrivals in [0, t] and N

t = { Nt : t ∈ [0, ∞)} is the counting process. As usual, let M (t) = E (N t) for
t ∈ [0, ∞) so that M is the renewal function.
Suppose now that Y = (Y , Y , …) is a sequence of real-valued random variables, where Y is thought of as the reward associated
1 2 n
with the interarrival time X . However, the term reward should be interpreted generically since Y might actually be a cost or
n n
some other value associated with the interarrival time, and in any event, may take negative as well as positive values. Our basic
assumption is that the interarrival time and reward pairs Z = ((X , Y ), (X , Y ), …) form an independent and identically
1 1 2 2
distributed sequence. Recall that this implies that X is an IID sequence, as required by the definition of the renewal process, and
that Y is also an IID sequence. But X and Y might well be dependent, and in fact Y might be a function of X for n ∈ N . Let
n n +
ν = E(Y ) denote the mean of a generic reward Y , which we assume exists in R .
The stochastic process R = {R t : t ∈ [0, ∞)} defined by

Nt
Rt = ∑ Yi , t ∈ [0, ∞) (15.6.3)
i=1
is the reward renewal process associated with Z . The function r given by r(t) = E(R t) for t ∈ [0, ∞) is the reward function.
As promised, R is the total reward up to time t ∈ [0, ∞). Here are some typical examples:
t
The arrivals are customers at a store. Each customer spends a random amount of money.
The arrivals are visits to a website. Each visitor spends a random amount of time at the site.
The arrivals are failure times of a complex system. Each failure requires a random repair time.
The arrivals are earthquakes at a particular location. Each earthquake has a random severity, a measure of the energy released.
So R is a random sum of random variables for each t ∈ [0, ∞). In the special case that Y and X independent, the distribution of
t
R is known as a compound distribution, based on the distribution of N and the distribution of a generic reward Y . Specializing
t t
further, if the renewal process is Poisson and is independent of Y , the process R is a compound Poisson process.
Note that a renewal reward process generalizes an ordinary renewal process. Specifically, if Y = 1 for each n ∈ N , then n +
R =N
t for t ∈ [0, ∞), so that the reward process simply reduces to the counting process, and then r reduces to the renewal
t
function M .
The Renewal Reward Theorem

For t ∈ (0, ∞) , the average reward on the interval [0, t] is R /t, and the expected average reward on that interval is
t r(t)/t . The
fundamental theorem on renewal reward processes gives the asymptotic behavior of these averages.
The renewal reward theorem
1. R /t → ν /μ as t → ∞ with probability 1.
t
2. r(t)/t → ν /μ as t → ∞
Proof
Part (a) generalizes the law of large numbers and part (b) generalizes elementary renewal theorem, for a basic renewal process.
Once again, if Y = 1 for each n , then (a) becomes N /t → 1/μ as t → ∞ and (b) becomes M (t)/t → 1/μ as t → ∞ . It's not
n t
surprising then that these two theorems play a fundamental role in the proof of the renewal reward theorem.
General Reward Processes

n
The renewal reward process R = {R : t ∈ [0, ∞)} above is constant, taking the value ∑ Y , on the renewal interval
t i=1 i
[T , T
n ) for each n ∈ N . Effectively, the rewards are received discretely: Y at time T , an additional Y at time T , and so forth.
n+1 1 1 2 2
It's possible to modify the construction so the rewards accrue continuously in time or in a mixed discrete/continuous manner. Here
is a simple set of conditions for a general reward process.
Suppose again that Z = ((X1 , Y1 ), (X2 , Y2 ), …) is the sequence of interarrival times and rewards. A stochastic process
V = { Vt : t ∈ [0, ∞)} (on our underlying probability space) is a reward process associated with Z if the following conditions
hold:
1. V = ∑ Y for n ∈ N
Tn
n
i=1 i
2. V is between V and V
t Tn Tn+1 for t ∈ (T
n, Tn+1 ) and n ∈ N
In the continuous case, with nonnegative rewards (the most important case), the reward process will typically have the following
form:
Suppose that the rewards are nonnegative and that U = { Ut : t ∈ [0, ∞)} is a nonnegative stochastic process (on our
underlying probability space) with
1. t ↦ U piecewise continous
t
Tn+1
2. ∫ Tn
U dt = Y for n ∈ N
t n+1
t
Let V t =∫
0
Us ds for t ∈ [0, ∞). Then V = { Vt : t ∈ [0, ∞)} is a reward process associated with Z .
Proof
Thus in this special case, the rewards are being accrued continuously and U is the rate at which the reward is being accrued at
t
time t . So U plays the role of a reward density process. For a general reward process, the basic renewal reward theorem still holds.
Suppose that V = {V : t ∈ [0, ∞)} is a reward process associated with Z = ((X

t 1, Y1 ), (X2 , Y2 ), …) , and let v(t) = E (V t)
for t ∈ [0, ∞) be the corresponding reward function.

1. V /t → ν /μ as t → ∞ with probability 1.
t
2. v(t)/t → ν /μ as t → ∞ .
Proof
Here is the corollary for a continuous reward process.
Suppose that the rewards are positive, and consider the continuous reward process with density process
U = { U : t ∈ [0, ∞)} as above. Let u(t) = E(U ) for t ∈ [0, ∞). Then
t t
t
1. 1
t
∫
0
Us ds →
ν
μ
as t → ∞ with probability 1
t
2. 1
t
∫
0
u(s) ds →
ν
μ
as t → ∞
Special Cases and Applications

With a clever choice of the “rewards”, many interesting renewal processes can be turned into renewal reward processes, leading in
turn to interesting limits via the renewal reward theorem.
Alternating Renewal Processes
Recall that in an alternating renewal process, a system alternates between on and off states (starting in the on state). If we let
U = (U , U , …) be the lengths of the successive time periods in which the system is on, and V = (V , V , …) the lengths of the
1 2 1 2
successive time periods in which the system is off, then the basic assumptions are that ((U , V ), (U , V ), …) is an independent,
1 1 2 2
identically distributed sequence, and that the variables X = U + V for n ∈ N form the interarrival times of a standard
n n n +
renewal process. Let μ = E(U ) denote the mean of a time period that the device is on, and ν = E(V ) the mean of a time period
that the device is off. Recall that I denotes the state (1 or 0) of the system at time t ∈ [0, ∞), so that I = {I : t ∈ [0, ∞)} is the
t t
state process. The state probability function p is given by p(t) = P(I = 1) for t ∈ [0, ∞). t
Limits for the alternating renewal process.

t μ
1. 1
t
∫
0
Is ds →
μ+ν
t μ
2. 1
t
∫
0
p(s) ds →
μ+ν
as t → ∞
Proof
Thus, the asymptotic average time that the device is on, and the asymptotic mean average time that the device is on, are both
simply the ratio of the mean of an on period to the mean of an on-off period. In our previous study of alternating renewal processes,
the fundamental result was that in the non-arithmetic case, p(t) → μ/(μ + ν ) as t → ∞ . This result implies part (b) in the
theorem above.
Age Processes
Renewal reward processes can be used to derive some asymptotic results for the age processes of a standard renewal process So,
suppose that we have a renewal process with interarrival sequence X, arrival sequence T , and counting process N . As usual, let
μ = E(X) denote the mean of an interarrival time, but now we will also need ν = E(X ) , the second moment. We assume that
2
both moments are finite.

For t ∈ [0, ∞), recall that the current life, remaining life and total life at time t are
At = t − TN , Bt = TN +1 − t, Lt = At + Bt = TN +1 − TN = XN +1 (15.6.15)
t t t t t
respectively. In the usual terminology of reliability, A is the age of the device in service at time t , B is the time remaining until
t t
this device fails, and L is total life of the device. (To avoid notational clashes, we are using different notation than in past
t
sections.) Let a(t) = E(A ) , b(t) = E(B ) , and l(t) = E(L ) for t ∈ [0, ∞), the corresponding mean functions. To derive our
t t t
asymptotic results, we simply use the current life and the remaining life as reward densities (or rates) in a renewal reward process.
Limits for the current life process.

t
1. 1
t
∫
0
As ds →
ν
2μ
t
2. 1
t
∫
0
a(s) ds →
ν
2μ
as t → ∞
Proof
Limits for the remaining life process.

t
1. 1
t
∫
0
Bs ds →
ν
2μ
t
2. 1
t
∫
0
b(s) ds →
ν
2μ
as t → ∞
Proof
With a little thought, it's not surprising that the limits for the current life and remaining life processes are the same. After a long
period of time, a renewal process looks stochastically the same forward or backward in time. Changing the “arrow of time”
reverses the role of the current and remaining life. Asymptotic results for the total life process now follow trivially from the results
for the current and remaining life processes.
Limits for the total life process

t
1. 1
t
∫
0
Ls ds →
ν
μ
t
2. 1
t
∫
0
l(s) ds =
ν
μ
as t → ∞
Replacement Models
Consider again a standard renewal process as defined in the Introduction, with interarrival sequence X = (X , X , …), arrival 1 2
sequence T = (T , T , …) , and counting process N = {N : t ∈ [0, ∞)} . One of the most basic applications is to reliability,
0 1 t
where a device operates for a random lifetime, fails, and then is replaced by a new device, and the process continues. In this model,
X is the lifetime and T
n the failure time of the n th device in service, for n ∈ N , while N is the number of failures in [0, t] for
n + t
t ∈ [0, ∞). As usual, F denotes the distribution function of a generic lifetime X, and F the corresponding right
c
= 1 −F
distribution function (reliability function). Sometimes, the device is actually a system with a number of critical components—the
failure of any of the critical components causes the system to fail.
Replacement models are variations on the basic model in which the device is replaced (or the critical components replaced) at times
other than failure. Often the cost a of a planned replacement is less than the cost b of an emergency replacement (at failure), so
replacement models can make economic sense. We will consider the the most common model.
In the age replacement model, the device is replaced either when it fails or when it reaches a specified age s ∈ (0, ∞) . This model
gives rise to a new renewal process with interarrival sequence U = (U , U , …) where U = min{X , s} for n ∈ N . If
1 2 n n +
a, b ∈ (0, ∞) are the costs of planned and unplanned replacements, respectively, then the cost associated with the renewal period
Un is
Yn = a1(Un = s) + b1(Un < s) = a1(Xn ≥ s) + b1(Xn < s) (15.6.18)
Clearly ((U , Y ), (U , Y ), …) satisfies the assumptions of a renewal reward process given above. The model makes
1 1 2 2
mathematical sense for any a, b ∈ (0, ∞) but if a ≥ b , so that the planned cost of replacement is at least as large as the unplanned
cost of replacement, then Y ≥ b for n ∈ N , so the model makes no financial sense. Thus we assume that a < b .
n +
In the age replacement model, with planned replacement at age s ∈ (0, ∞) ,

1. The expected cost of a renewal period is E(Y ) = aF c
(s) + bF (s) .
s
2. The expected length of a renewal period is E(U ) = ∫ 0
F
c
(x) dx
The limiting expected cost per unit time is

c
aF (s) + bF (s)
C (s) = (15.6.19)
s
c
∫ F (x) dx
0
Proof
So naturally, given the costs a and b , and the lifetime distribution function F , the goal is be to find the value of s that minimizes
C (s) ; this value of s is the optimal replacement time. Of course, the optimal time may not exist.
Properties of C
1. C (s) → ∞ as s ↓ 0
2. C (s) → μ/b as s ↑ ∞
Proof
As s → ∞ , the age replacement model becomes the standard (unplanned) model with limiting expected average cost b/μ.
Suppose that the lifetime of the device (in appropriate units) has the standard exponential distribution. Find C (s) and solve the
optimal age replacement problem.
Answer
The last result is hardly surprising. A device with an exponentially distributed lifetime does not age—if it has not failed, it's just as
good as new. More generally, age replacement does not make sense for any device with decreasing failure rate. Such devices
improve with age.
Suppose that the lifetime of the device (in appropriate units) has the gamma distribution with shape parameter 2 and scale
parameter 1. Suppose that the costs (in appropriate units) are a = 1 and b = 5 .
1. Find C (s) .
2. Sketch the graph of C (s) .
3. Solve numerically the optimal age replacement problem.
Answer
Suppose again that the lifetime of the device (in appropriate units) has the gamma distribution with shape parameter 2 and
scale parameter 1. But suppose now that the costs (in appropriate units) are a = 1 and b = 2 .
1. Find C (s) .
2. Sketch the graph of C (s) .
3. Solve the optimal age replacement problem.
Answer
In the last case, the difference between the cost of an emergency replacement and a planned replacement is not great enough for age
replacement to make sense.
Suppose that the lifetime of the device (in appropriately scaled units) is uniformly distributed on the interval [0, 1]. Find C (s)
and solve the optimal replacement problem. Give the results explicitly for the following costs:
1. a = 4 , b = 6
2. a = 2 , b = 5
3. a = 1 , b = 10
Proof
Thinning
We start with a standard renewal process with interarrival sequence X = (X , X , …), arrival sequence T = (T , T , …) and
1 2 0 1
counting process N = {N : t ∈ [0, ∞)} . As usual, let μ = E(X) denote the mean of an interarrival time. For n ∈ N , suppose
t +
now that arrival n is either accepted or rejected, and define random variable Y to be 1 in the first case and 0 in the second. Let
n
Z = (X , Y ) denote the interarrival time and rejection variable pair for n ∈ N

n n n , and assume that Z = (Z , Z , …) is an
+ 1 2
independent, identically distributed sequence.

Note that we have the structure of a renewal reward process, and so in particular, Y = (Y , Y , …) is a sequence of Bernoulli
1 2
trials. Let p denote the parameter of this sequence, so that p is the probability of accepting an arrival. The procedure of accepting or
rejecting points in a point process is known as thinning the point process. We studied thinning of the Poisson process. In the
notation of this section, note that the reward process R = {R : t ∈ [0, ∞)} is the thinned counting process. That is,
t
Nt
Rt = ∑ Yi (15.6.25)
i=1
is the number of accepted points in [0, t] for t ∈ [0, ∞). So then r(t) = E(R t) is the expected number of accepted points in [0, t].
The renewal reward theorem gives the asymptotic behavior.
Limits for the thinned process.

1. R /t → p/μ as t → ∞
t
2. r(t)/t → p/μ as t → ∞
Proof
This page titled 15.6: Renewal Reward Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
CHAPTER OVERVIEW
A Markov process is a random process in which the future is independent of the past, given the present. Thus, Markov processes
are the natural stochastic analogs of the deterministic processes described by differential and difference equations. They form one
of the most important classes of random processes.
This page titled 16: Markov Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
A Markov process is a random process indexed by time, and with the property that the future is independent of the past, given the
present. Markov processes, named for Andrei Markov, are among the most important of all random processes. In a sense, they are
the stochastic analogs of differential equations and recurrence relations, which are of course, among the most important
deterministic processes.
The complexity of the theory of Markov processes depends greatly on whether the time space T is N (discrete time) or [0, ∞)
(continuous time) and whether the state space is discrete (countable, with all subsets measurable) or a more general topological
space. When T = [0, ∞) or when the state space is a general space, continuity assumptions usually need to be imposed in order to
rule out various types of weird behavior that would otherwise complicate the theory.
When the state space is discrete, Markov processes are known as Markov chains. The general theory of Markov chains is
mathematically rich and relatively simple.
When T = N and the state space is discrete, Markov processes are known as discrete-time Markov chains. The theory of such
processes is mathematically elegant and complete, and is understandable with minimal reliance on measure theory. Indeed, the
main tools are basic probability and linear algebra. Discrete-time Markov chains are studied in this chapter, along with a
number of special models.
When T = [0, ∞) and the state space is discrete, Markov processes are known as continuous-time Markov chains. If we avoid
a few technical difficulties (created, as always, by the continuous time space), the theory of these processes is also reasonably
simple and mathematically very nice. The Markov property implies that the process, sampled at the random times when the
state changes, forms an embedded discrete-time Markov chain, so we can apply the theory that we will have already learned.
The Markov property also implies that the holding time in a state has the memoryless property and thus must have an
exponential distribution, a distribution that we know well. In terms of what you may have already studied, the Poisson process
is a simple example of a continuous-time Markov chain.
For a general state space, the theory is more complicated and technical, as noted above. However, we can distinguish a couple of
classes of Markov processes, depending again on whether the time space is discrete or continuous.
When T = N and S = R , a simple example of a Markov process is the partial sum process associated with a sequence of
independent, identically distributed real-valued random variables. Such sequences are studied in the chapter on random samples
(but not as Markov processes), and revisited below.
In the case that T = [0, ∞) and S = R or more generally S = R , the most important Markov processes are the diffusion
k
processes. Generally, such processes can be constructed via stochastic differential equations from Brownian motion, which thus
serves as the quintessential example of a Markov process in continuous time and space.
The goal of this section is to give a broad sketch of the general theory of Markov processes. Some of the statements are not
completely rigorous and some of the proofs are omitted or are sketches, because we want to emphasize the main ideas without
getting bogged down in technicalities. If you are a new student of probability you may want to just browse this section, to get the
basic ideas and notation, but skipping over the proofs and technical details. Then jump ahead to the study of discrete-time Markov
chains. On the other hand, to understand this section in more depth, you will need to review topcis in the chapter on foundations
and in the chapter on stochastic processes.
Basic Theory
Preliminaries
As usual, our starting point is a probability space (Ω, F , P), so that Ω is the set of outcomes, F the σ-algebra of events, and P the
probability measure on (Ω, F ). The time set T is either N (discrete time) or [0, ∞) (continuous time). In the first case, T is given
the discrete topology and in the second case T is given the usual Euclidean topology. In both cases, T is given the Borel σ-algebra
T , the σ-algebra generated by the open sets. In the discrete case when T = N , this is simply the power set of T so that every
subset of T is measurable; every function from T to another measurable space is measurable; and every function from T to another
topological space is continuous. The time space (T , T ) has a natural measure; counting measure # in the discrete case, and
Lebesgue in the continuous case.
The set of states S also has a σ-algebra S of admissible subsets, so that (S, S ) is the state space. Usually S has a topology and
S is the Borel σ-algebra generated by the open sets. A typical set of assumptions is that the topology on S is LCCB: locally
compact, Hausdorff, and with a countable base. These particular assumptions are general enough to capture all of the most
important processes that occur in applications and yet are restrictive enough for a nice mathematical theory. Usually, there is a
natural positive measure λ on the state space (S, S ). When S has an LCCB topology and S is the Borel σ-algebra, the measure λ
wil usually be a Borel measure satisfying λ(C ) < ∞ if C ⊆ S is compact. The term discrete state space means that S is countable
with S = P(S) , the collection of all subsets of S . Thus every subset of S is measurable, as is every function from S to another
measurable space. This is the Borel σ-algebra for the discrete topology on S , so that every function from S to another topological
space is continuous. The compact sets are simply the finite sets, and the reference measure is #, counting measure. If S = R for k
some k ∈ S (another common case), then we usually give S the Euclidean topology (which is LCCB) so that S is the usual Borel
σ-algebra. The compact sets are the closed, bounded sets, and the reference measure λ is k -dimensional Lebesgue measure.
Clearly, the topological and measure structures on T are not really necessary when T = N , and similarly these structures on S are
not necessary when S is countable. But the main point is that the assumptions unify the discrete and the common continuous cases.
Also, it should be noted that much more general state spaces (and more general time spaces) are possible, but most of the important
Markov processes that occur in applications fit the setting we have described here.
Various spaces of real-valued functions on S play an important role. Let B denote the collection of bounded, measurable functions
f : S → R . With the usual (pointwise) addition and scalar multiplication, B is a vector space. We give B the supremum norm,
defined by ∥f ∥ = sup{|f (x)| : x ∈ S}.

Suppose now that X = {X : t ∈ T } is a stochastic process on (Ω, F , P) with state space S and time space T . Thus, X is a
t t
random variable taking values in S for each t ∈ T , and we think of X ∈ S as the state of a system at time t ∈ T . We also assume
t
that we have a collection F = {F : t ∈ T } of σ-algebras with the properties that X is measurable with respect to F for t ∈ T ,
t t t
and the F ⊆ F ⊆ F for s, t ∈ T with s ≤ t . Intuitively, F is the collection of event up to time t ∈ T . Technically, the
s t t
assumptions mean that F is a filtration and that the process X is adapted to F. The most basic (and coarsest) filtration is the
natural filtration F
0
= {Ft
0
: t ∈ T} where Ft
0
= σ{ Xs : s ∈ T , s ≤ t} , the σ-algebra generated by the process up to time
t ∈ T . In continuous time, however, it is often necessary to use slightly finer σ-algebras in order to have a nice mathematical
theory. In particular, we often need to assume that the filtration F is right continuous in the sense that F = F for t ∈ T where t+ t
= ⋂{ F : s ∈ T , s > t} . We can accomplish this by taking F = F so that F = F for t ∈ T , and in this case, F is
0 0
F t+ s t
+ t+
referred to as the right continuous refinement of the natural filtration. We also sometimes need to assume that F is complete with
respect to P in the sense that if A ∈ S with P(A) = 0 and B ⊆ A then B ∈ F . That is, F contains all of the null events (and
0 0
hence also all of the almost certain events), and therefore so does F for all t ∈ T . t
Definitions
The random process X is a Markov process if
P(Xs+t ∈ A ∣ Fs ) = P(Xs+t ∈ A ∣ Xs ) (16.1.1)
for all s, t ∈ T and A ∈ S .
The defining condition, known appropriately enough as the the Markov property, states that the conditional distribution of X s+t
given F is the same as the conditional distribution of X

s just given X . Think of s as the present time, so that s + t is a time in
s+t s
the future. If we know the present state X , then any additional knowledge of events in the past is irrelevant in terms of predicting
s
the future state X . Technically, the conditional probabilities in the definition are random variables, and the equality must be
s+t
interpreted as holding with probability 1. As you may recall, conditional expected value is a more general and useful concept than
conditional probability, so the following theorem may come as no surprise.
The random process X is a Markov process if and only if

E[f (Xs+t ) ∣ Fs ] = E[f (Xs+t ) ∣ Xs ] (16.1.2)
for every s, t ∈ T and every f ∈ B .

Proof sketch
Technically, we should say that X is a Markov process relative to the filtration F. If X satisfies the Markov property relative to a
filtration, then it satisfies the Markov property relative to any coarser filtration.
Suppose that the stochastic process X = {X t : t ∈ T} is adapted to the filtration F = { Ft : t ∈ T } and that
G = { G : t ∈ T } is a filtration that is finer than
t F . If X is a Markov process relative to G then X is a Markov process
relative to F.
Proof
In particular, if X is a Markov process, then X satisfies the Markov property relative to the natural filtration F
0
. The theory of
Markov processes is simplified considerably if we add an additional assumption.
A Markov process X is time homogeneous if

P(Xs+t ∈ A ∣ Xs = x) = P(Xt ∈ A ∣ X0 = x) (16.1.4)
for every s, t ∈ T , x ∈ S and A ∈ S .
So if X is homogeneous (we usually don't bother with the time adjective), then the process {X : t ∈ T } given X = x is s+t s
equivalent (in distribution) to the process {X : t ∈ T } given X = x . For this reason, the initial distribution is often unspecified
t 0
in the study of Markov processes—if the process is in state x ∈ S at a particular time s ∈ T , then it doesn't really matter how the
process got to state x; the process essentially “starts over”, independently of the past. The term stationary is sometimes used
instead of homogeneous.
From now on, we will usually assume that our Markov processes are homogeneous. This is not as big of a loss of generality as you
might think. A non-homogenous process can be turned into a homogeneous process by enlarging the state space, as shown below.
For a homogeneous Markov process, if s, t ∈ T , x ∈ S , and f ∈ B , then
E[f (Xs+t ) ∣ Xs = x] = E[f (Xt ) ∣ X0 = x] (16.1.5)
Feller Processes
In continuous time, or with general state spaces, Markov processes can be very strange without additional continuity assumptions.
Suppose (as is usually the case) that S has an LCCB topology and that S is the Borel σ-algebra. Let C denote the collection of
bounded, continuous functions f : S → R . Let C denote the collection of continuous functions f : S → R that vanish at ∞. The
0
last phrase means that for every ϵ > 0 , there exists a compact set C ⊆ S such that |f (x)| < ϵ if x ∉ C . With the usual (pointwise)
operations of addition and scalar multiplication, C is a vector subspace of C , which in turn is a vector subspace of B . Just as with
0
B , the supremum norm is used for C and C . 0
A Markov process X = {X t : t ∈ T} is a Feller process if the following conditions are satisfied.

1. Continuity in space: For t ∈ T and y ∈ S , the distribution of X given X = x converges to the distribution of X given
t 0 t
X = y as x → y .
0
2. Continuity in time: Given X = x for x ∈ S , X converges in probability to x as t ↓ 0 .

0 t
Additional details
Feller processes are named for William Feller. Note that if S is discrete, (a) is automatically satisfied and if T is discrete, (b) is
automatically satisfied. In particular, every discrete-time Markov chain is a Feller Markov process. There are certainly more
general Markov processes, but most of the important processes that occur in applications are Feller processes, and a number of nice
properties flow from the assumptions. Here is the first:
If X = {X : t ∈ T } is a Feller process, then there is a version of

t X such that t ↦ Xt (ω) is continuous from the right and
has left limits for every ω ∈ Ω .
Again, this result is only interesting in continuous time T = [0, ∞) . Recall that for ω ∈ Ω , the function t ↦ X (ω) is a sample t
path of the process. So we will often assume that a Feller Markov process has sample paths that are right continuous have left
limits, since we know there is a version with these properties.
Stopping Times and the Strong Markov Property
For our next discussion, you may need to review again the section on filtrations and stopping times.To give a quick review, suppose
again that we start with our probability space (Ω, F , P) and the filtration F = {F : t ∈ T } (so that we have a filtered probability
t
space).
Since time (past, present, future) plays such a fundamental role in Markov processes, it should come as no surprise that random
times are important. We often need to allow random times to take the value ∞, so we need to enlarge the set of times to
T∞ = T ∪ {∞} . The topology on T is extended to T by the rule that for s ∈ T , the set {t ∈ T : t > s} is an open
∞ ∞
neighborhood of ∞. This is the one-point compactification of T and is used so that the notion of time converging to infinity is
preserved. The Borel σ-algebra T is used on T , which again is just the power set in the discrete case.
∞ ∞
If X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ), and if τ is a random time, then naturally we want to
t
consider the state X at the random time. There are two problems. First if τ takes the value ∞, X is not defined. The usual
τ τ
solution is to add a new “death state” δ to the set of states S , and then to give S = S ∪ {δ} the σ algebra δ
S = S ∪ {A ∪ {δ} : A ∈ S } . A function f ∈ B is extended to S by the rule f (δ) = 0 . The second problem is that X
δ δ may τ
not be a valid random variable (that is, measurable) unless we assume that the stochastic process X is measurable. Recall that this
means that X : Ω × T → S is measurable relative to F ⊗ T and S . (This is always true in discrete time.)
Recall next that a random time τ is a stopping time (also called a Markov time or an optional time) relative to F if {τ ≤ t} ∈ F t
for each t ∈ T . Intuitively, we can tell whether or not τ ≤ t from the information available to us at time t . In a sense, a stopping
time is a random time that does not require that we see into the future. Of course, the concept depends critically on the filtration.
Recall that if a random time τ is a stopping time for a filtration F = {F : t ∈ T } then it is also a stopping time for a finer
t
filtration G = {G : t ∈ T } , so that F ⊆ G for t ∈ T . Thus, the finer the filtration, the larger the collection of stopping times. In
t t t
fact if the filtration is the trivial one where F = F for all t ∈ T (so that all information is available to us from the beginning of
t
time), then any random time is a stopping time. But of course, this trivial filtration is usually not sensible.
Next, recall that if τ is a stopping time for the filtration F, then the σ-algebra F associated with τ is given by
τ
Fτ = {A ∈ F : A ∩ {τ ≤ t} ∈ Ft for all t ∈ T } (16.1.6)
Intuitively, F is the collection of events up to the random time τ , analogous to the F which is the collection of events up to the
τ t
deterministic time t ∈ T . If X = {X : t ∈ T } is a stochastic process adapted to F and if τ is a stopping time relative to F, then
t
we would hope that X is measurable with respect to F just as X is measurable with respect to F for deterministic t ∈ T .
τ τ t t
However, this will generally not be the case unless X is progressively measurable relative to F, which means that
X : Ω × T → S is measurable with respect to F ⊗ T
t and S where T = {s ∈ T : s ≤ t} and T the corresponding Borel σ-
t t t t
algebra. This is always true in discrete time, of course, and more generally if S has an LCCB topology with S the Borel σ-algebra,
and X is right continuous. If X is progressively measurable with respect to F then X is measurable and X is adapted to F.
The strong Markov property for our stochastic process X = { Xt : t ∈ T } states that the future is independent of the past, given
the present, when the present time is a stopping time.
The random process X is a strong Markov process if
E[f (Xτ+t ) ∣ Fτ ] = E[f (Xτ+t ) ∣ Xτ ] (16.1.7)
for every t ∈ T , stopping time τ , and f ∈ B .
As with the regular Markov property, the strong Markov property depends on the underlying filtration F. If the property holds with
respect to a given filtration, then it holds with respect to a coarser filtration.
Suppose that the stochastic process X = {X : t ∈ T } is progressively measurable relative to the filtration F = {F : t ∈ T }
t t
and that the filtration G = {G : t ∈ T } is finer than F. If X is a strong Markov process relative to G then X is a strong
t
Markov process relative to F.

Proof
So if X is a strong Markov process, then X satisfies the strong Markov property relative to its natural filtration. Again there is a
tradeoff: finer filtrations allow more stopping times (generally a good thing), but make the strong Markov property harder to satisfy
and may not be reasonable (not so good). So we usually don't want filtrations that are too much finer than the natural one.
With the strong Markov and homogeneous properties, the process {X : t ∈ T } given X = x is equivalent in distribution to the
τ+t τ
process {X : t ∈ T } given X = x . Clearly, the strong Markov property implies the ordinary Markov property, since a fixed time
t 0
t ∈ T is trivially also a stopping time. The converse is true in discrete time.
Suppose that X = {X n : n ∈ N} is a (homogeneous) Markov process in discrete time. Then X is a strong Markov process.
As always in continuous time, the situation is more complicated and depends on the continuity of the process X and the filtration
F . Here is the standard result for Feller processes.
If X = {X : t ∈ [0, ∞) is a Feller Markov process, then

t X is a strong Markov process relative to filtration F+
0
, the right-
continuous refinement of the natural filtration..
Transition Kernels of Markov Processes

For our next discussion, you may need to review the section on kernels and operators in the chapter on expected value. Suppose
again that X = {X : t ∈ T } is a (homogeneous) Markov process with state space S and time space T , as described above. The
t
kernels in the following definition are of fundamental importance in the study of X
For t ∈ T , let
Pt (x, A) = P(Xt ∈ A ∣ X0 = x), x ∈ S, A ∈ S (16.1.9)
Then P is a probability kernel on (S, S ), known as the transition kernel of X for time t .
t
Proof
That is, P (x, ⋅) is the conditional distribution of X given X = x for t ∈ T and

t t 0 x ∈ S . By the time homogenous property,
Pt (x, ⋅) is also the conditional distribution of X given X = x for s ∈ T :
s+t s
Pt (x, A) = P(Xs+t ∈ A ∣ Xs = x), s, t ∈ T , x ∈ S, A ∈ S (16.1.10)
Note that P = I , the identity kernel on (S, S ) defined by I (x, A) = 1(x ∈ A) for x ∈ S and A ∈ S , so that I (x, A) = 1 if
0
x ∈ A and I (x, A) = 0 if x ∉ A . Recall also that usually there is a natural reference measure λ on (S, S ). In this case, the
transition kernel P will often have a transition density p with respect to λ for t ∈ T . That is,
t t
Pt (x, A) = P(Xt ∈ A ∣ X0 = x) = ∫ pt (x, y)λ(dy), x ∈ S, A ∈ S (16.1.11)

A
The next theorem gives the Chapman-Kolmogorov equation, named for Sydney Chapman and Andrei Kolmogorov, the
fundamental relationship between the probability kernels, and the reason for the name transition kernel.
Suppose again that X = {X t : t ∈ T} is a Markov process on S with transition kernels P = { Pt : t ∈ T } . If s, s ∈ T , then

P P =P
s t . That is,
s+t
Ps+t (x, A) = ∫ Ps (x, dy)Pt (y, A), x ∈ S, A ∈ S (16.1.12)

S
Proof
In the language of functional analysis, P is a semigroup. Recall that the commutative property generally does not hold for the
product operation on kernels. However the property does hold for the transition kernels of a homogeneous Markov process. That is,
P P =P P =P
s t t for s, t ∈ T . As a simple corollary, if S has a reference measure, the same basic relationship holds for the
s s+t
transition densities.
Suppose that λ is the reference measure on (S, S ) and that X = {X t : t ∈ T} is a Markov process on S and with transition
densities {p : t ∈ T } . If s, t ∈ T then p p = p . That is,
t s t s+t
pt (x, z) = ∫ ps (x, y)pt (y, z)λ(dy), x, z ∈ S (16.1.16)

S
Proof
If T =N (discrete time), then the transition kernels of X are just the powers of the one-step transition kernel. That is, if we let
P = P1 then P = P for n ∈ N .
n
n
Recall that a kernel defines two operations: operating on the left with positive measures on (S, S ) and operating on the right with
measurable, real-valued functions. For the transition kernels of a Markov process, both of the these operators have natural
interpretations.
Suppose that s, t ∈ T . If μ is the distribution of X then X

s s s+t has distribution μ s+t = μs Pt . That is,
μs+t (A) = ∫ μs (dx)Pt (x, A), A ∈ S (16.1.17)

S
Proof
So if P denotes the collection of probability measures on (S, S ), then the left operator P maps P back into P . In particular, if t
X has distribution μ (the initial distribution) then X has distribution μ = μ P for every t ∈ T .
0 0 t t 0 t
A positive measure μ on (S, S ) is invariant for X if μP t =μ for every t ∈ T .
Hence if μ is a probability measure that is invariant for X, and X has distribution μ , then X has distribution μ for every t ∈ T
0 t
so that the process X is identically distributed. In discrete time, note that if μ is a positive measure and μP = μ then μP = μ for n
every n ∈ N , so μ is invariant for X. The operator on the right is given next.
Suppose that f : S → R . If t ∈ T then (assuming that the expected value exists),
Pt f (x) = ∫ Pt (x, dy)f (y) = E [f (Xt ) ∣ X0 = x] , x ∈ S (16.1.19)

S
Proof
In particular, the right operator P is defined on B , the vector space of bounded, linear functions f : S → R , and in fact is a linear
t
operator on B . That is, if f , g ∈ B and c ∈ R , then P (f + g) = P f + P g and P (cf ) = cP f . Moreover, P is a contraction

t t t t t t
operator on B , since ∥P f ∥ ≤ ∥f ∥ for f ∈ B . It then follows that P is a continuous operator on B for t ∈ T .

t t
For the right operator, there is a concept that is complementary to the invariance of of a positive measure for the left operator.
A measurable function f : S → R is harmonic for X if P tf =f for all t ∈ T .
Again, in discrete time, if P f =f then P n

f =f for all n ∈ N , so f is harmonic for X.
Combining two results above, if X has distribution μ and f
0 0 : S → R is measurable, then (again assuming that the expected value
exists), μ P f = E[f (X )] for t ∈ T . That is,
0 t t
E[f (Xt )] = ∫ μ0 (dx) ∫ Pt (x, dy)f (y) (16.1.21)

S S
The result above shows how to obtain the distribution of X from the distribution of X and the transition kernel P for t ∈ T . But
t 0 t
we can do more. Recall that one basic way to describe a stochastic process is to give its finite dimensional distributions, that is, the
distribution of (X , X , … , X ) for every n ∈ N and every (t , t , … , t ) ∈ T . For a Markov process, the initial
t1 t2 tn + 1 2 n
n
distribution and the transition kernels determine the finite dimensional distributions. It's easiest to state the distributions in
differential form.
Suppose X = {X t : t ∈ T} is a Markov process with transition operators P = {P : t ∈ T } , and that (t , … , t t 1 n) ∈ T

n
with
0 < t1 < ⋯ < tn . If X 0 has distribution μ , then in differential form, the distribution of (X , X , … , X ) is
0 0 t1 tn
μ0 (dx0 )Pt (x0 , dx1 )Pt −t1 (x1 , dx2 ) ⋯ Pt −tn−1 (xn−1 , dxn ) (16.1.22)
1 2 n
Proof
This result is very important for constructing Markov processes. If we know how to define the transition kernels P for t ∈ T t
(based on modeling considerations, for example), and if we know the initial distribution μ , then the last result gives a consistent 0
set of finite dimensional distributions. From the Kolmogorov construction theorem, we know that there exists a stochastic process
that has these finite dimensional distributions. In continuous time, however, two serious problems remain. First, it's not clear how
we would construct the transition kernels so that the crucial Chapman-Kolmogorov equations above are satisfied. Second, we
usually want our Markov process to have certain properties (such as continuity properties of the sample paths) that go beyond the
finite dimensional distributions. The first problem will be addressed in the next section, and fortunately, the second problem can be
resolved for a Feller process.
Suppose that X = { Xt : t ∈ T } is a Markov process on an LCCB state space (S, S ) with transition operators
P = { Pt : t ∈ [0, ∞)} . Then X is a Feller process if and only if the following conditions hold:
1. Continuity in space: If f ∈ C and t ∈ [0, ∞) then P f ∈ C
0 t 0
2. Continuity in time: If f ∈ C and x ∈ S then P f (x) → f (x) as t ↓ 0 .

0 t
A semigroup of probability kernels P = {P : t ∈ T } that satisfies the properties in this theorem is called a Feller semigroup. So
t
the theorem states that the Markov process X is Feller if and only if the transition semigroup of transition P is Feller. As before,
(a) is automatically satisfied if S is discrete, and (b) is automatically satisfied if T is discrete. Condition (a) means that P is an t
operator on the vector space C , in addition to being an operator on the larger space B . Condition (b) actually implies a stronger
0
form of continuity in time.
Suppose that P = {P : t ∈ T } is a Feller semigroup of transition operators. Then t ↦ P

t tf is continuous (with respect to the
supremum norm) for f ∈ C . 0
Additional details
So combining this with the remark above, note that if P is a Feller semigroup of transition operators, then f ↦ P f is continuous t
on C for fixed t ∈ T , and t ↦ P f is continuous on T for fixed f ∈ C . Again, the importance of this is that we often start with
0 t 0
the collection of probability kernels P and want to know that there exists a nice Markov process X that has these transition
operators.
Sampling in Time
If we sample a Markov process at an increasing sequence of points in time, we get another Markov process in discrete time. But the
discrete time process may not be homogeneous even if the original process is homogeneous.
Suppose that X = {X t : t ∈ T} is a Markov process with state space (S, S ) and that (t , t , t , …) is a sequence in T with
0 1 2
0 = t0 < t1 < t2 < ⋯ . Let Y n = Xt

n
for n ∈ N . Then Y = {Y : n ∈ N} is a Markov process in discrete time.
n
Proof
If we sample a homogeneous Markov process at multiples of a fixed, positive time, we get a homogenous Markov process in
discrete time.
Suppose that X = { Xt : t ∈ T } is a homogeneous Markov process with state space (S, S ) and transition kernels
P = { P : t ∈ T } . Fix r ∈ T with r > 0 and define Y = X
t for n ∈ N . Then Y = {Y : n ∈ N} is a homogeneous
n nr n
Markov process in discrete time, with one-step transition kernel Q given by

Q(x, A) = Pr (x, A); x ∈ S, A ∈ S (16.1.28)
In some cases, sampling a strong Markov process at an increasing sequence of stopping times yields another Markov process in
discrete time. The point of this is that discrete-time Markov processes are often found naturally embedded in continuous-time
Markov processes.
Enlarging the State Space

Our first result in this discussion is that a non-homogeneous Markov process can be turned into a homogenous Markov process, but
only at the expense of enlarging the state space.
Suppose that X = {X : t ∈ T } is a non-homogeneous Markov process with state space (S, S ). Suppose also that τ is a
t
random variable taking values in T , independent of X. Let τ = τ + t and let Y = (X , τ ) for t ∈ T . Then
t t τt t
Y = { Y : t ∈ T } is a homogeneous Markov process with state space (S × T , S ⊗ T ) . For t ∈ T , the transition kernel P is
t t
given by
Pt [(x, r), A × B] = P(Xr+t ∈ A ∣ Xr = x)1(r + t ∈ B), (x, r) ∈ S × T , A × B ∈ S ⊗ T (16.1.29)
Proof
The trick of enlarging the state space is a common one in the study of stochastic processes. Sometimes a process that has a weaker
form of “forgetting the past” can be made into a Markov process by enlarging the state space appropriately. Here is an example in
discrete time.
Suppose that X = {X : n ∈ N} is a random process with state space

n (S, S ) in which the future depends stochastically on
the last two states. That is, for n ∈ N
P(Xn+2 ∈ A ∣ Fn+1 ) = P(Xn+2 ∈ A ∣ Xn , Xn+1 ), A ∈ S (16.1.31)
where {F : n ∈ N} is the natural filtration associated with the process

n X . Suppose also that the process is time
homogeneous in the sense that
P(Xn+2 ∈ A ∣ Xn = x, Xn+1 = y) = Q(x, y, A) (16.1.32)
independently of n ∈ N . Let Y = (X , X ) for n ∈ N . Then Y = {Y

n n n+1 n : n ∈ N} is a homogeneous Markov process with
state space (S × S, S ⊗ S . The one step transition kernel P is given by
P [(x, y), A × B] = I (y, A)Q(x, y, B); x, y ∈ S, A, B ∈ S (16.1.33)
Proof
The last result generalizes in a completely straightforward way to the case where the future of a random process in discrete time
depends stochastically on the last k states, for some fixed k ∈ N .

Recurrence Relations and Differential Equations
As noted in the introduction, Markov processes can be viewed as stochastic counterparts of deterministic recurrence relations
(discrete time) and differential equations (continuous time). Our goal in this discussion is to explore these connections.
Suppose that X = {X n : n ∈ N} is a stochastic process with state space (S, S ) and that X satisfies the recurrence relation
Xn+1 = g(Xn ), n ∈ N (16.1.34)
where g : S → S is measurable. Then X is a homogeneous Markov process with one-step transition operator P given by
Pf = f ∘g for a measurable function f : S → R .
Proof
In the deterministic world, as in the stochastic world, the situation is more complicated in continuous time. Nonetheless, the same
basic analogy applies.
Suppose that X = {X t : t ∈ [0, ∞)} with state space (R, R)satisfies the first-order differential equation
d
Xt = g(Xt ) (16.1.35)
dt
where g : R → R is Lipschitz continuous. Then X is a Feller Markov process

Proof
In differential form, the process can be described by dX = g(X ) dt . This essentially deterministic process can be extended to a
t t
very important class of Markov processes by the addition of a stochastic term related to Brownian motion. Such stochastic
differential equations are the main tools for constructing Markov processes known as diffusion processes.
Processes with Stationary, Independent Increments

For our next discussion, we consider a general class of stochastic processes that are Markov processes. Suppose that
X = { X : t ∈ T } is a random process with S ⊆ R as the set of states. The state space can be discrete (countable) or
t
“continuous”. Typically, S is either N or Z in the discrete case, and is either [0, ∞) or R in the continuous case. In any case, S is
given the usual σ-algebra S of Borel subsets of S (which is the power set in the discrete case). Also, the state space (S, S ) has a
natural reference measure measure λ , namely counting measure in the discrete case and Lebesgue measure in the continuous case.
Let F = {F : t ∈ T } denote the natural filtration, so that F = σ{X : s ∈ T , s ≤ t} for t ∈ T .
t t s
The process X has

1. Independent increments if X − X is independent of F for all s, t ∈ T .
s+t s s
2. Stationary increments if the distribution of X − X is the same as the distribution of X

s+t s t − X0 for all s, t ∈ T .
A difference of the form X − X for s, t ∈ T is an increment of the process, hence the names. Sometimes the definition of
s+t s
stationary increments is that X − X have the same distribution as X . But this forces X = 0 with probability 1, and as usual
s+t s t 0
with Markov processes, it's best to keep the initial distribution unspecified. If X has stationary increments in the sense of our
definition, then the process Y = {Y = X − X : t ∈ T } has stationary increments in the more restricted sense. For the
t t 0
remainder of this discussion, assume that X = {X : t ∈ T } has stationary, independent increments, and let Q denote the
t t
distribution of X − X for t ∈ T .
t 0
Qs ∗ Qt = Qs+t for s, t ∈ T .
Proof
So the collection of distributions Q = {Q t : t ∈ T} forms a semigroup, with convolution as the operator. Note that Q is simply 0
point mass at 0.
The process X is a homogeneous Markov process. For t ∈ T , the transition operator P is given by t
Pt f (x) = ∫ f (x + y)Qt (dy), f ∈ B (16.1.36)

S
Proof
Clearly the semigroup property of P = {P : t ∈ T } (with the usual operator product) is equivalent to the semigroup property of
t
Q = { Q : t ∈ T } (with convolution as the product).

t
Suppose that for positive t ∈ T , the distribution Q has probability density function g with respect to the reference measure
t t
λ . Then the transition density is
pt (x, y) = gt (y − x), x, y ∈ S (16.1.39)
Of course, from the result above, it follows that gs ∗ gt = gs+t for s, t ∈ T , where here ∗ refers to the convolution operation on
If Qt → Q0 as t ↓ 0 then X is a Feller Markov process.
Thus, by the general theory sketched above, X is a strong Markov process, and there exists a version of X that is right continuous
and has left limits. Such a process is known as a Lévy process, in honor of Paul Lévy.
For a real-valued stochastic process X = {X t : t ∈ T} , let m and v denote the mean and variance functions, so that
m(t) = E(Xt ), v(t) = var(Xt ); t ∈ T (16.1.40)
assuming of course that the these exist. The mean and variance functions for a Lévy process are particularly simple.
Suppose again that X has stationary, independent increments.

1. If μ = E(X ) ∈ R and μ
0 0 1 = E(X1 ) ∈ R then m(t) = μ + (μ − μ )t for t ∈ T .
0 1 0
2. If in addition, σ = var(X 2
0 0 ) ∈ (0, ∞) and σ = var(X ) ∈ (0, ∞) then v(t) = σ + (σ
2
1 1
2
0
2
1
2
− σ )t
0
for t ∈ T .
Proof
It's easy to describe processes with stationary independent increments in discrete time.
A process X = {X n : n ∈ N} has independent increments if and only if there exists a sequence of independent, real-valued
random variables (U 0 , U1 , …) such that
n
Xn = ∑ Ui (16.1.43)
i=0
In addition, X has stationary increments if and only if (U 1, U2 , …) are identically distributed.

Proof
Thus suppose that U = (U , U , …) is a sequence of independent, real-valued random variables, with (U

0 1 1, U2 , …) identically
distributed with common distribution Q. Then from our main result above, the partial sum process X = {X n : n ∈ N} associated
with U is a homogeneous Markov process with one step transition kernel P given by
P (x, A) = Q(A − x), x ∈ S, A ∈ S (16.1.44)
More generally, for n ∈ N , the n -step transition kernel is P (x, A) = Q (A − x) for x ∈ S and A ∈ S . This Markov process
n ∗n
is known as a random walk (although unfortunately, the term random walk is used in a number of other contexts as well). The idea
is that at time n , the walker moves a (directed) distance U on the real line, and these steps are independent and identically
n
distributed. If Q has probability density function g with respect to the reference measure λ , then the one-step transition density is
p(x, y) = g(y − x), x, y ∈ S (16.1.45)
Consider the random walk on R with steps that have the standard normal distribution. Give each of the following explicitly:
1. The one-step transition density.
2. The n -step transition density for n ∈ N . +
Proof
In continuous time, there are two processes that are particularly important, one with the discrete state space N and one with the
continuous state space R.
For t ∈ [0, ∞) , let gt denote the probability density function of the Poisson distribution with parameter t , and let
pt (x, y) = gt (y − x) for x, y ∈ N. Then {p : t ∈ [0, ∞)} is the collection of transition densities for a Feller semigroup on N
t
Proof
So a Lévy process N = {N : t ∈ [0, ∞)} with these transition densities would be a Markov process with stationary, independent
t
increments and with sample paths are right continuous and have left limits. We do know of such a process, namely the Poisson
process with rate 1.
Open the Poisson experiment and set the rate parameter to 1 and the time parameter to 10. Run the experiment several times in
single-step mode and note the behavior of the process.
For t ∈ (0, ∞) , let g denote the probability density function of the normal distribution with mean 0 and variance t , and let
t
p (x, y) = g (y − x) for x, y ∈ R. Then { p : t ∈ [0, ∞)} is the collection of transition densities of a Feller semigroup on R .
t t t
Proof
So a Lévy process X = {X : t ∈ [0, ∞)} on R with these transition densities would be a Markov process with stationary,
t
independent increments, and whose sample paths are continuous from the right and have left limits. In fact, there exists such a
process with continuous sample paths. This process is Brownian motion, a process important enough to have its own chapter.
Run the simulation of standard Brownian motion and note the behavior of the process.
This page titled 16.1: Introduction to Markov Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Our goal in this section is to continue the broad sketch of the general theory of Markov processes. As with the last section, some of
the statements are not completely precise and rigorous, because we want to focus on the main ideas without being overly burdened
by technicalities. If you are a new student of probability, or are primarily interested in applications, you may want to skip ahead to
the study of discrete-time Markov chains.
Preliminaries
Basic Definitions
As usual, our starting point is a probability space (Ω, F , P), so that Ω is the set of outcomes, F the σ-algebra of events, and P the
probability measure on the sample space (Ω, F ). The set of times T is either N, discrete time with the discrete topology, or [0, ∞),
continuous time with the usual Euclidean topology. The time set T is given the Borel σ-algebra T , which is just the power set if
T = N , and then the time space (T , T ) is given the usual measure, counting measure in the discrete case and Lebesgue measure in
the continuous case. The set of states S has an LCCB topology (locally compact, Hausdorff, with a countable base), and is also
given the Borel σ-algebra S . Recall that to say that the state space is discrete means that S is countable with the discrete topology,
so that S is the power set of S . The topological assumptions mean that the state space (S, S ) is nice enough for a rich
mathematical theory and general enough to encompass the most important applications. There is often a natural Borel measure λ
on (S, S ), counting measure # if S is discrete, and for example, Lebesgue measure if S = R for some k ∈ N .
k
+
Recall also that there are several spaces of functions on S that are important. Let B denote the set of bounded, measurable
functions f : S → R . Let C denote the set of bounded, continuous functions f : S → R , and let C denote the set of continuous
0
functions f : S → R that vanish at ∞ in the sense that for every ϵ > 0 , there exists a compact set K ⊆ S such |f (x)| < ϵ for
x ∈ K . These are all vector spaces under the usual (pointwise) addition and scalar multiplication, and C ⊆ C ⊆ B . The
c
0
supremum norm, defined by ∥f ∥ = sup{|f (x)| : x ∈ S} for f ∈ B is the norm that is used on these spaces.
Suppose now that X = {X : t ∈ T } is a time-homogeneous Markov process with state space (S, S ) defined on the probability
t
space (Ω, F , P). As before, we also assume that we have a filtration F = {F : t ∈ T } , that is, an increasing family of sub σ-
t
algebras of F , indexed by the time space, with the properties that X is measurable with repsect to F for t ∈ T . Intuitively, F is
t t t
the collection of events up to time t ∈ T .

As usual, we let P denote the transition probability kernel for an increase in time of size t ∈ T . Thus
t
Pt (x, A) = P(Xt ∈ A ∣ X0 = x), x ∈ S, A ∈ S (16.2.1)
Recall that for t ∈ T , the transition kernel P defines two operators, on the left with measures and on the right with functions. So,
t
if μ is a measure on (S, S ) then μP is the measure on (S, S ) given by

t
μPt (A) = ∫ μ(dx)Pt (x, A), A ∈ S (16.2.2)

S
If μ is the distribution of X then μP is the distribution of X for t ∈ T . If f

0 t t ∈ B then P
tf ∈ B is defined by
Pt f (x) = ∫ Pt (x, dy)f (y) = E [f (Xt ) ∣ X0 = x] (16.2.3)

S
Recall that the collection of transition operators P = {P : t ∈ T } is a semigroup because P P = P

t for s, t ∈ T . Just about
s t s+t
everything in this section is defined in terms of the semigroup P , which is one of the main analytic tools in the study of Markov
processes.
Feller Markov Processes

We make the same assumptions as in the Introduction. Here is a brief review:
We assume that the Markov process X = { Xt : t ∈ T } satisfies the following properties (and hence is a Feller Markov
process):
1. For t ∈ T and y ∈ S , the distribution of X given X
t 0 =x converges to the distribution of X given X
t 0 =y as x → y .
2. Given X 0 =x ∈ S , X converges in probability to x as t ↓ 0 .
t
Part (a) is an assumption on continuity in space, while part (b) is an assumption on continuity in time. If S is discrete then (a)
automatically holds, and if T is discrete then (b) automatically holds. As we will see, the Feller assumptions are sufficient for a
very nice mathematical theory, and yet are general enough to encompass the most important continuous-time Markov processes.
The process X = {X t : t ∈ T} has the following properties:

1. There is a version of X such that t ↦ X is continuous from the right and has left limits.
t
2. X is a strong Markov process relative to the F , the right-continuous refinement of the natural filtration.
0
+
The Feller assumptions on the Markov process have equivalent formulations in terms of the transition semigroup.
The transition semigroup P = { Pt : t ∈ T } has the following properties:

1. If f ∈ C0 and t ∈ T then P f ∈ C t 0
2. If f ∈ C0 and x ∈ S then P f (x) → f (x) as t ↓ 0 .

t
As before, part (a) is a condition on continuity in space, while part (b) is a condition on continuity in time. Once again, (a) is trivial
if S is discrete, and (b) trivial if T is discrete. The first condition means that P is a linear operator on C (as well as being a linear
t 0
operator on B ). The second condition leads to a stronger continuity result.
For f ∈ C0 , the mapping t ↦ P tf is continuous on T . That is, for t ∈ T ,

∥ Ps f − Pt f ∥ = sup{| Ps f (x) − Pt f (x)| : x ∈ S} → 0 as s → t (16.2.4)
Our interest in this section is primarily the continuous time case. However, we start with the discrete time case since the concepts
are clearer and simpler, and we can avoid some of the technicalities that inevitably occur in continuous time.
Discrete Time
Suppose that T = N , so that time is discrete. Recall that the transition kernels are just powers of the one-step kernel. That is, we let
P =P 1and then P = P for n ∈ N .
n
n
Potential Operators
For α ∈ (0, 1], the α -potential kernel R of X is defined as follows:
α
n n
Rα (x, A) = ∑ α P (x, A), x ∈ S, A ∈ S (16.2.5)
n=0
1. The special case R = R is simply the potential kernel of X.

1
2. For x ∈ S and A ∈ S , R(x, A) is the expected number of visits of X to A , starting at x.

Proof
Note that it's quite possible that R(x, A) = ∞ for some x ∈ S and A ∈ S . In fact, knowing when this is the case is of
considerable importance in the study of Markov processes. As with all kernels, the potential kernel R defines two operators, α
operating on the right on functions, and operating on the left on positive measures. For the right potential operator, if f : S → R is
measurable then
∞ ∞ ∞
n n n n n
Rα f (x) = ∑ α P f (x) = ∑ α ∫ P (x, dy)f (y) = ∑ α E[f (Xn ) ∣ X0 = x], x ∈ S (16.2.7)
n=0 n=0 S n=0
assuming as usual that the expected values and the infinite series make sense. This will be the case, in particular, if f is
nonnegative or if p ∈ (0, 1) and f ∈ B .
If α ∈ (0, 1), then R α (x, S) =

1
1−α
for all x ∈ S .
Proof
It follows that for α ∈ (0, 1), the right operator R is a bounded, linear operator on
α B with ∥ Rα ∥ =
1
1−α
. It also follows that
(1 − α)R is a probability kernel. There is a nice interpretation of this kernel.
α
If α ∈ (0, 1) then (1 − α)R (x, ⋅) is the conditional distribution of

α XN given X0 = x ∈ S , where N is independent of X
and has the geometric distribution on N with parameter 1 − α .

Proof
So (1 − α)R is a transition probability kernel, just as P is a transition probability kernel, but corresponding to the random time
α n
N (with α ∈ (0, 1) as a parameter), rather than the deterministic time n ∈ N . An interpretation of the potential kernel R for α
α ∈ (0, 1) can be also given in economic terms. Suppose that A ∈ S and that we receive one monetary unit each time the process
X visits A . Then as above, R(x, A) is the expected total amount of money we receive, starting at x ∈ S . However, typically
money that we will receive at times distant in the future has less value to us now than money that we will receive soon. Specifically
suppose that a monetary unit received at time n ∈ N has a present value of α , where α ∈ (0, 1) is an inflation factor (sometimes
n
also called a discount factor). Then R (x, A) gives the expected, total, discounted amount we will receive, starting at x ∈ S . A bit
α
more generally, if f ∈ B is a reward function, so that f (x) is the reward (or cost, depending on the sign) that we receive when we
visit state x ∈ S , then for α ∈ (0, 1), R f (x) is the expected, total, discounted reward, starting at x ∈ S .
α
For the left potential operator, if μ is a positive measure on S then

∞ ∞
n n n n
μRα (A) = ∑ α μP (A) = ∑ α ∫ μ(dx)P (x, A), A ∈ S (16.2.12)
n=0 n=0 S
In particular, if μ is a probability measure and X has distribution μ then μP is the distribution of X for n ∈ N , so from the last
0
n
n
result, (1 − α)μR is the distribution of X where again, N is independent of X and has the geometric distribution on N with
α N
parameter 1 − α . The family of potential kernels gives the same information as the family of transition kernels.
The potential kernels R = {R α : α ∈ (0, 1)} completely determine the transition kernels P = { Pn : n ∈ N} .
Proof
Of course, it's really only necessary to determine P , the one step transition kernel, since the other transition kernels are powers of
P . In any event, it follows that the kernels R = { R : α ∈ (0, 1)} , along with the initial distribution, completely determine the
α
finite dimensional distributions of the Markov process X. The potential kernels commute with each other and with the transition
kernels.
Suppose that α, β ∈ (0, 1] and k ∈ N . Then (as kernels)

∞
1. P k
Rα = Rα P
k
=∑
n=0
α P
n n+k
∞ ∞
2. R α Rβ = Rβ Rα = ∑
m=0
∑
n=0
α
m
β
n
P
m+n
Proof
The same identities hold for the right operators on the entire space B , with the additional restrictions that α <1 and β < 1 . The
fundamental equation that relates the potential kernels is given next.
If α, β ∈ (0, 1] with α ≤ β then (as kernels),

β Rβ = α Rα + (β − α)Rα Rβ (16.2.17)
Proof
The same identity holds holds for the right operators on the entire space B , with the additional restriction that β < 1 .
If α ∈ (0, 1], then (as kernels), I + α R αP = I + αP Rα = Rα .

Proof
The same identity holds for the right operators on the entire space B , with the additional restriction that α <1 . This leads to the
following important result:
If α ∈ (0, 1), then as operators on the space B ,
1. R = (I − αP )
α
−1
2. P = (I − R )
1
α
−1
α
Proof
This result shows again that the potential operator R determines the transition operator P .
α

Our first example considers the binomial process as a Markov process.
Let I = { In : n ∈ N+ } be a sequence of Bernoulli trials with success parameter p ∈ (0, 1). Define the Markov process
n
X = { Xn : n ∈ N} by X = X + ∑
n 0 I where X takes values in N and is independent of I .
k=1 k 0
1. For n ∈ N , show that the transition probability matrix P of X is given by n
n
n y−x n−y+x
P (x, y) = ( )p (1 − p ) , x ∈ N, y ∈ {x, x + 1, … , x + n} (16.2.22)
y −x
2. For α ∈ (0, 1], show that the potential matrix R of X is given by α
y−x
1 αp
Rα (x, y) = ( ) , x ∈ N, y ∈ {x, x + 1, …} (16.2.23)
1 − α + αp 1 − α + αp
3. For α ∈ (0, 1) and x ∈ N, identify the probability distribution defined by (1 − α)R (x, ⋅) . α
4. For x, y ∈ N with x ≤ y , interpret R(x, y), the expected time in y starting in x, in the context of the process X.
Solutions
Continuous Time
With the discrete-time setting as motivation, we now turn the more important continuous-time case where T = [0, ∞) .
Potential Kernels
For α ∈ [0, ∞), the α -potential kernel U of X is defined as follows:
α
∞
−αt
Uα (x, A) = ∫ e Pt (x, A) dt, x ∈ S, A ∈ S (16.2.25)
0
1. The special case U = U is simply the potential kerenl of X.

0
2. For x ∈ S and A ∈ S , U (x, A) is the expected amount of time that X spends in A , starting at x.
3. The family of kernels U = {U : α ∈ (0, ∞)} is known as the reolvent of X.
α
Proof
As with discrete time, it's quite possible that U (x, A) = ∞ for some x ∈ S and A ∈ S , and knowing when this is the case is of
considerable interest. As with all kernels, the potential kernel U defines two operators, operating on the right on functions, and
α
operating on the left on positive measures. If f : S → R is measurable then, giving the right potential operator in its many forms,
∞
−αt
Uα f (x) = ∫ Uα (x, dy)f (y) = ∫ e Pt f (x) dt
S 0
∞ ∞
−αt −αt
=∫ e ∫ Pt (x, dy)f (y) = ∫ e E[f (Xt ) ∣ X0 = x] dt, x ∈ S
0 S 0
assuming that the various integrals make sense. This will be the case in particular if f is nonnegative, or if f ∈ B and α > 0 .
If α > 0 , then Uα (x, S) =

1
α
for all x ∈ S .
Proof
It follows that for α ∈ (0, ∞) , the right potential operator U is a bounded, linear operator on B with ∥U
α α∥ =
1
α
. It also follows
that αU is a probability kernel. This kernel has a nice interpretation.
α
If α > 0 then α U (x, ⋅) is the conditional distribution of X where τ is independent of X and has the exponential distribution
α τ
on [0, ∞) with parameter α .

Proof
So αU is a transition probability kernel, just as P is a transition probability kernel, but corresponding to the random time τ (with
α t
α ∈ (0, ∞) as a parameter), rather than the deterministic time t ∈ [0, ∞). As in the discrete case, the potential kernel can also be
interpreted in economic terms. Suppose that A ∈ S and that we receive money at a rate of one unit per unit time whenever the
process X is in A . Then U (x, A) is the expected total amount of money that we receive, starting in state x ∈ S . But again, money
that we receive later is of less value to us now than money that we will receive sooner. Specifically, suppose that one monetary unit
at time t ∈ [0, ∞) has a present value of e where α ∈ (0, ∞) is the inflation factor or discount factor. The U (x, A) is the
−αt
α
total, expected, discounted amount that we receive, starting in x ∈ S . A bit more generally, suppose that f ∈ B and that f (x) is
the reward (or cost, depending on the sign) per unit time that we receive when the process is in state x ∈ S . Then U f (x) is the α
expected, total, discounted reward, starting in state x ∈ S .

For the left potential operator, if μ is a positive measure on S then
∞
−αt
μUα (A) = ∫ μ(dx)Uα (x, A) = ∫ e μPt (A) dt
S 0
∞ ∞
−αt −αt
=∫ e [∫ μ(dx)Pt (x, A)] dt = ∫ e [∫ μ(dx)P(Xt ∈ A)] dt, A ∈ S
0 S 0 S
In particular, suppose that α > 0 and that μ is a probability measure and X has distributionμ . Then μP is the distribution of X
0 t t
for t ∈ [0, ∞), and hence from the last result, αμU is the distribution of X , where again, τ is independent of X and has the
α τ
exponential distribution on [0, ∞) with parameter α . The family of potential kernels gives the same information as the family of
transition kernels.
The resolvent U = { Uα : α ∈ (0, ∞)} completely determines the family of transition kernels P = { Pt : t ∈ (0, ∞)} .
Proof
It follows that the resolvent {U : α ∈ [0, ∞)} , along with the initial distribution, completely determine the finite dimensional
α
distributions of the Markov process X. This is much more important here in the continuous-time case than in the discrete-time
case, since the transition kernels P cannot be generated from a single transition kernel. The potential kernels commute with each
t
other and with the transition kernels.
Suppose that α, β, t ∈ [0, ∞) . Then (as kernels),

∞
1. P t Uα = Uα Pt = ∫
0
e
−αs
Ps+t ds
∞ ∞
2. U α Uβ = ∫
0
∫
0
e
−αs
e
−βt
Ps+t ds dt
Proof
The same identities hold for the right operators on the entire space B under the additional restriction that α > 0 and β >0 . The
fundamental equation that relates the potential kernels, known as the resolvent equation, is given in the next theorem:
If α, β ∈ [0, ∞) with α ≤ β then (as kernels) U α = Uβ + (β − α)Uα Uβ .

Proof
The same identity holds for the right potential operators on the entire space B , under the additional restriction that α >0 . For
α ∈ (0, ∞) , U is also an operator on the space C .
α 0
If α ∈ (0, ∞) and f ∈ C0 then U αf ∈ C0 .

Proof
If f ∈ C0 then α U αf → f as α → ∞ .
Proof
Infinitesimal Generator
In continuous time, it's not at all clear how we could construct a Markov process with desired properties, say to model a real system
of some sort. Stated mathematically, the existential problem is how to construct the family of transition kernels {P : t ∈ [0, ∞)} t
so that the semigroup property P P = P s is satisfied for all s, t ∈ [0, ∞) . The answer, as for similar problems in the
t s+t
deterministic world, comes essentially from calculus, from a type of derivative.
The infinitesimal generator of the Markov process X is the operator G : D → C defined by 0
Pt f − f
Gf = lim (16.2.39)
t↓0 t
on the domain D ⊆ C for which the limit exists.

0
As usual, the limit is with respect to the supremum norm on C , so f 0 ∈ D and Gf =g means that f , g ∈ C0 and
∥ Pt f − f ∥ ∣ Pt f (x) − f (x) ∣
∥ − g∥ = sup {∣ − g(x)∣ : x ∈ S} → 0 as t ↓ 0 (16.2.40)
∥ t ∥ ∣ t ∣
So in particular,
Pt f (x) − f (x) E[f (Xt ) ∣ X0 = x] − f (x)
Gf (x) = lim = lim , x ∈ S (16.2.41)
t↓0 t t↓0 t
The domain D is a subspace of C and the generator G is a linear operator on D

0
1. If f ∈ D and c ∈ R then cf ∈ D and G(cf ) = cGf .

2. If f , g ∈ D then f + g ∈ D and G(f + g) = Gf + Gg .
Proof
Note G is the (right) derivative at 0 of the function t ↦ P f . Because of the semigroup property, this differentiability property at 0
t
implies differentiability at arbitrary t ∈ [0, ∞). Moreover, the infinitesimal operator and the transition operators commute:
If f ∈ D and t ∈ [0, ∞), then P tf ∈ D and the following derivative rules hold with respect to the supremum norm.
1. P t
′
f = Pt Gf , the Kolmogorov forward equation
2. P t
′
f = GPt f , the Kolmogorov backward equation
Proof
The last result gives a possible solution to the dilema that motivated this discussion in the first place. If we want to construct a
Markov process with desired properties, to model a a real system for example, we can start by constructing an appropriate
generator G and then solve the initial value problem
′
P = GPt , P0 = I (16.2.47)
t
to obtain the transition operators P = {P : t ∈ [0, ∞)} . The next theorem gives the relationship between the potential operators
t
and the infinitesimal operator, which in some ways is better. This relationship is analogous to the relationship between the potential
operators and the one-step operator given above in discrete time
Suppose α ∈ (0, ∞) .
1. If f ∈ D the Gf ∈ C and f + U Gf
0 α = α Uα f
2. If f ∈ C0 then U f ∈ D and f + GU
α αf = α Uα f .
Proof
For α > 0 , the operators U and G have an inverse relationship.

α
Suppose again that α ∈ (0, ∞) .

1. U = (αI − G) : C → D
α
−1
0
2. G = αI − U : D → Cα
−1
0
Proof
So, from the generator G we can determine the potential operators U = {U : α ∈ (0, ∞)} , which in turn determine the transition
α
operators P = {P : t ∈ (0, ∞)} . In continuous time, transition operators P = {P : t ∈ [0, ∞)} can be obtained from the single,
t t
infinitesimal operator G in a way that is reminiscent of the fact that in discrete time, the transition operators P = {P : n ∈ N} n
can be obtained from the single, one-step operator P .

Our first example is essentially deterministic.
Consider the Markov process X = {X t : t ∈ [0, ∞)} on R satisfying the ordinary differential equation
d
Xt = g(Xt ), t ∈ [0, ∞) (16.2.52)
dt
where g : R → R is Lipschitz continuous. The infinitesimal operator G is given by ′

Gf (x) = f (x)g(x) for x ∈ R on the
domain D of functions f : R → R where f ∈ C and f ∈ C .0
′
0
Proof
Our next example considers the Poisson process as a Markov process. Compare this with the binomial process above.
Let N = { Nt : t ∈ [0, ∞)} denote the Poisson process on N with rate β ∈ (0, ∞). Define the Markov process
X = { Xt : t ∈ [0, ∞)} by X t = X +N
0 where X takes values in N and is independent of N .
t 0
1. For t ∈ [0, ∞), show that the probability transition matrix P of X is given by
t
y−x
(βt)
−βt
Pt (x, y) = e , x, y ∈ N, y ≥ x (16.2.54)
(y − x)!
2. For α ∈ [0, ∞), show that the potential matrix U of X is given by

α
y−x
1 β
Uα (x, y) = ( ) , x, y ∈ N, y ≥ x (16.2.55)
α +β α +β
3. For α > 0 and x ∈ N, identify the probability distribution defined by α U (x, ⋅). α
4. Show that the infinitesimal matrix G of X is given by G(x, x) = −β, G(x, x + 1) = β for x ∈ N.
Solutions
This page titled 16.2: Potentials and Generators for General Markov Processes is shared under a CC BY 2.0 license and was authored, remixed,
and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a
detailed edit history is available upon request.
In this and the next several sections, we consider a Markov process with the discrete time space N and with a discrete (countable)
state space. Recall that a Markov process with a discrete state space is called a Markov chain, so we are studying discrete-time
Markov chains.
Review
We will review the basic definitions and concepts in the general introduction. With both time and space discrete, many of these
definitions and concepts simplify considerably. As usual, our starting point is a probability space (Ω, F , P), so Ω is the sample
space, F the σ-algebra of events, and P the probability measure on (Ω, F ). Let X = (X , X , X , …) be a stochastic process
0 1 2
defined on the probability space, with time space N and with countable state space S . In the context of the general introduction, S
is given the power set P(S) as the σ-algebra, so all subsets of S are measurable, as are all functions from S into another
measurable space. Counting measure # is the natural measure on S , so integrals over S are simply sums. The same comments
apply to the time space N: all subsets of N are measurable and counting measure # is the natural measure on N.
The vector space B consisting of bounded functions f : S → R will play an important role. The norm that we use is the supremum
norm defined by
∥f ∥ = sup{|f (x)| : x ∈ S}, f ∈ B (16.3.1)
For n ∈ N , let F = σ{X , X , … , X }, the σ-algebra generated by the process up to time n . Thus F = {F , F , F , …} is
n 0 1 n 0 1 2
the natural filtration associated with X. We also let G = σ{X , X , …}, the σ-algebra generated by the process from time n
n n n+1
on. So if n ∈ N represents the present time, then F contains the events in the past and G the events in the future.
n n
Definitions
We start with the basic definition of the Markov property: the past and future are conditionally independent, given the present.
X = (X0 , X1 , X2 , …) is a Markov chain if P(A ∩ B ∣ Xn ) = P(A ∣ Xn )P(B ∣ Xn ) for every n ∈ N , A ∈ Fn and

B ∈ Gn .
There are a number of equivalent formulations of the Markov property for a discrete-time Markov chain. We give a few of these.
X = (X0 , X1 , X2 , …) is a Markov chain if either of the following equivalent conditions is satisfied:

1. P(X n+1 = x ∣ Fn ) = P(Xn+1 = x ∣ Xn ) for every n ∈ N and x ∈ S .
2. E[f (X n+1 ) ∣ Fn ] = E[f (Xn+1 ) ∣ Xn ] for every n ∈ N and f ∈ B .
Part (a) states that for n ∈ N , the conditional probability density function of X given F is the same as the conditional
n+1 n
probability density function of X given X . Part (b) also states, in terms of expected value, that the conditional distribution of
n+1 n
X n+1 given F is the same as the conditional distribution of X

n given X . Both parts are the Markov property looking just one
n+1 n
time step in the future. But with discrete time, this is equivalent to the Markov property at general future times.
X = (X0 , X1 , X2 , …) is a Markov chain if either of the following equivalent conditions is satisfied:

1. P(X n+k = x ∣ Fn ) = P(Xn+k = x ∣ Xn ) for every n, k ∈ N and x ∈ S .
2. E[f (X n+k ) ∣ Fn ] = E[f (Xn+k ) ∣ Xn ] for every n, k ∈ N and f ∈ B .
Part (a) states that for n, k ∈ N , the conditional probability density function of X given F is the same as the conditional
n+k n
probability density function of X given X . Part (b) also states, in terms of expected value, that the conditional distribution of
n+k n
X n+k given F is the same as the conditional distribution of X

n given X . In discrete time and space, the Markov property can
n+k n
also be stated without explicit reference to σ-algebras. If you are not familiar with measure theory, you can take this as the starting
definition.
X = (X0 , X1 , X2 , …) is a Markov chain if for every n ∈ N and every sequence of states (x 0, x1 , … , xn−1 , x, y) ,
P(Xn+1 = y ∣ X0 = x0 , X1 = x1 , … , Xn−1 = xn−1 , Xn = x) = P(Xn+1 = y ∣ Xn = x) (16.3.2)
The theory of discrete-time Markov chains is simplified considerably if we add an additional assumption.
A Markov chain X = (X 0, X1 , X2 , …) is time homogeneous if
P(Xn+k = y ∣ Xk = x) = P(Xn = y ∣ X0 = x) (16.3.3)
for every k, n ∈ N and every x, y ∈ S .
That is, the conditional distribution of X given X = x depends only on n . So if X is homogeneous (we usually don't bother
n+k k
with the time adjective), then the chain {X : n ∈ N} given X = x is equivalent (in distribution) to the chain { X : n ∈ N}
k+n k n
given X = x . For this reason, the initial distribution is often unspecified in the study of Markov chains—if the chain is in state
0
x ∈ S at a particular time k ∈ N , then it doesn't really matter how the chain got to state x; the process essentially “starts over”,
independently of the past. The term stationary is sometimes used instead of homogeneous.
From now on, we will usually assume that our Markov chains are homogeneous. This is not as big of a loss of generality as you
might think. A non-homogenous Markov chain can be turned into a homogeneous Markov process by enlarging the state space, as
shown in the introduction to general Markov processes, but at the cost of creating an uncountable state space. For a homogeneous
Markov chain, if k, n ∈ N , x ∈ S , and f ∈ B , then
E[f (Xk+n ) ∣ Xk = x] = E[f (Xn ) ∣ X0 = x] (16.3.4)
Stopping Times and the Strong Markov Property

Consider again a stochastic process X = (X , X , X , …) with countable state space S , and with the natural filtration
0 1 2
F = (F , F , F , …) as given above. Recall that a random variable τ taking values in N ∪ {∞} is a stopping time or a Markov
0 1 2
time for X if {τ = n} ∈ F for each n ∈ N . Intuitively, we can tell whether or not τ = n by observing the chain up to time n . In
n
a sense, a stopping time is a random time that does not require that we see into the future. The following result gives the
quintessential examples of stopping times.
Suppose again X = {X : n ∈ N} is a discrete-time Markov chain with state space

n S as defined above. For A ⊆S , the
following random times are stopping times:
1. ρA = inf{n ∈ N : Xn ∈ A} , the entrance time to A .
2. τA = inf{n ∈ N+ : Xn ∈ A} , the hitting time to A .
Proof
An example of a random time that is generally not a stopping time is the last time that the process is in A :
ζA = max{n ∈ N+ : Xn ∈ A} (16.3.5)
We cannot tell if ζ A =n without looking into the future: {ζ A = n} = { Xn ∈ A, Xn+1 ∉ A, Xn+2 ∉ A, …} for n ∈ N .
If τ is a stopping time for X, the σ-algebra associated with τ is
Fτ = {A ∈ F : A ∩ {τ = n} ∈ Fn for all n ∈ N} (16.3.6)
Intuitively, F contains the events that can be described by the process up to the random time τ , in the same way that F contains
τ n
the events that can be described by the process up to the deterministic time n ∈ N . For more information see the section on
filtrations and stopping times.
The strong Markov property states that the future is independent of the past, given the present, when the present time is a stopping
time. For a discrete-time Markov chain, the ordinary Markov property implies the strong Markov property.
If X = (X , X , X , …) is a discrete-time Markov chain then

0 1 2 X has the strong Markov property. That is, if τ is a finite
stopping time for X then
1. P(X τ+k = x ∣ Fτ ) = P(Xτ+k = x ∣ Xτ ) for every k ∈ N and x ∈ S .
2. E[f (X τ+k ) ∣ Fτ ] = E[f (Xτ+k ) ∣ Xτ ] for every k ∈ N and f ∈ B .
Part (a) states that the conditional probability density function of X given F is the same as the conditional probability density
τ+k τ
function of X given just X . Part (b) also states, in terms of expected value, that the conditional distribution of X
τ+k τ given F τ+k τ
is the same as the conditional distribution of X given just X . Assuming homogeneity as usual, the Markov chain
τ+k τ
{X τ+n : n ∈ N} given X = x is equivalent in distribution to the chain { X : n ∈ N} given X = x .

τ n 0
Transition Matrices
Suppose again that X = (X , X , X , …) is a homogeneous, discrete-time Markov chain with state space S . With a discrete state
0 1 2
space, the transition kernels studied in the general introduction become transition matrices, with rows and columns indexed by S
(and so perhaps of infinite size). The kernel operations become familiar matrix operations. The results in this section are special
cases of the general results, but we sometimes give independent proofs for completeness, and because the proofs are simpler. You
may want to review the section on kernels in the chapter on expected value.
For n ∈ N let
Pn (x, y) = P(Xn = y ∣ X0 = x), (x, y) ∈ S × S (16.3.7)
The matrix P is the n -step transition probability matrix for X.

n
Thus, y ↦ P (x, y) is the probability density function of X given X = x . In particular, P is a probability matrix (or stochastic
n n 0 n
matrix) since P (x, y) ≥ 0 for (x, y) ∈ S and ∑

n P (x, y) = 1 for x ∈ S . As with any nonnegative matrix on S , P
2
y∈S
defines a n
kernel on S for n ∈ N :
Pn (x, A) = ∑ Pn (x, y) = P(Xn ∈ A ∣ X0 = x), x ∈ S, A ⊆ S (16.3.8)
y∈A
So A ↦ P (x, A) is the probability distribution of X given X = x . The next result is the Chapman-Kolmogorov equation,
n n 0
named for Sydney Chapman and Andrei Kolmogorov. It gives the basic relationship between the transition matrices.
If m, n ∈ N then P m Pn = Pm+n
Proof
It follows immediately that the transition matrices are just the matrix powers of the one-step transition matrix. That is, letting
P =P 1 we have P = P for all n ∈ N . Note that P = I , the identity matrix on S given by I (x, y) = 1 if x = y and 0
n
n 0
otherwise. The right operator corresponding to P yields an expected value.

n
Suppose that n ∈ N and that f : S → R . Then, assuming that the expected value exists,
n n
P f (x) = ∑ P (x, y)f (y) = E[f (Xn ) ∣ X0 = x], x ∈ S (16.3.12)
y∈S
Proof
The existence of the expected value is only an issue if S is infinte. In particular, the result holds if f is nonnegative or if f ∈ B
(which in turn would always be the case if S is finite). In fact, P is a linear contraction operator on the space B for n ∈ N . That
n
is, if f ∈ B then P f ∈ B and ∥P f ∥ ≤ ∥f ∥. The left operator corresponding to P is defined similarly. For f : S → R
n n n
n n
fP (y) = ∑ f (x)P (x, y), y ∈ S (16.3.14)
x∈S
assuming again that the sum makes sense (as before, only an issue when S is infinite). The left operator is often restricted to
nonnegative functions, and we often think of such a function as the density function (with respect to #) of a positive measure on S .
In this sense, the left operator maps a density function to another density function.
A function f : S → R is invariant for P (or for the chain X) if f P =f .
Clearly if f is invariant, so that f P =f then f P n

=f for all n ∈ N . If f is a probability density function, then so is f P .
If X has probability density function f , then X has probability density function f P for n ∈ N .
0 n
n
Proof
In particular, if X has probability density function f , and f is invariant for X, then X has probability density function f for all
0 n
n ∈ N , so the sequence of variables X = (X , X , X , …) is identically distributed. Combining two results above, suppose that
0 1 2
X has probability density function f and that g : S → R . Assuming the expected value exists, E[g(X )] = f P g . Explicitly,
n
0 n
n
E[g(Xn )] = ∑ ∑ f (x)P (x, y)g(y) (16.3.16)
x∈S y∈S
It also follows from the last theorem that the distribution of X (the initial distribution) and the one-step transition matrix
0
determine the distribution of X for each n ∈ N . Actually, these basic quantities determine the finite dimensional distributions of
n
the process, a stronger result.
Suppose that X has probability density function f . For any sequence of states (x
0 0 0, x1 , … , xn ) ∈ S
n
, ,
P(X0 = x0 , X1 = x1 , … , Xn = xn ) = f0 (x0 )P (x0 , x1 )P (x1 , x2 ) ⋯ P (xn−1 , xn ) (16.3.17)
Proof
Computations of this sort are the reason for the term chain in the name Markov chain. From this result, it follows that given a
probability matrix P on S and a probability density function f on S , we can construct a Markov chain X = (X , X , X , …) 0 1 2
such that X has probability density function f and the chain has one-step transition matrix P . In applied problems, we often know
0
the one-step transition matrix P from modeling considerations, and again, the initial distribution is often unspecified.
There is a natural graph (in the combinatorial sense) associated with a homogeneous, discrete-time Markov chain.
Suppose again that X = (X , X , X , …) is a Markov chain with state space S and transition probability matrix P . The state
0 1 2
graph of X is the directed graph with vertex set S and edge set E = {(x, y) ∈ S : P (x, y) > 0} . 2
That is, there is a directed edge from x to y if and only if state x leads to state y in one step. Note that the graph may well have
loops, since a state can certainly lead back to itself in one step. More generally, we have the following result:
Suppose again that X = (X , X , X , …) is a Markov chain with state space S and transition probability matrix
0 1 2 P . For
x, y ∈ S and n ∈ N , there is a directed path of length n in the state graph from x to y if and only if P (x, y) > 0 . n
+
Proof
Potential Matrices
For α ∈ (0, 1], the α -potential matrix R of X is α
n n 2
Rα = ∑ α P , (x, y) ∈ S (16.3.19)
n=0
1. R = R is simply the potential matrix of X.

1
2. R(x, y) is the expected number of visits by X to y ∈ S , starting at x ∈ S .

Proof
Note that it's quite possible that R(x, y) = ∞ for some (x, y) ∈ S . In fact, knowing when this is the case is of considerable
2
importance in recurrence and transience, which we study in the next section. As with any nonnegative matrix, the α -potential
matrix defines a kernel and defines left and right operators. For the kernel,
∞
n n
Rα (x, A) = ∑ Rα (x, y) = ∑ α P (x, A), x ∈ S, A ⊆ S (16.3.21)
y∈A n=0
In particular, R(x, A) is the expected number of visits by the chain to A starting in x:

∞ ∞
n
R(x, A) = ∑ R(x, y) = ∑ P (x, A) = E [ ∑ 1(Xn ∈ A)] , x ∈ S, A ⊆ S (16.3.22)
y∈A n=0 n=0
If α ∈ (0, 1), then R α (x, S) =

1
1−α
for all x ∈ S .
Proof
Hence R is a bounded matrix for
α α ∈ (0, 1) and (1 − α)Rα is a probability matrix. There is a simple interpretation of this
matrix.
If α ∈ (0, 1) then (1 − α)R (x, y) = P(X = y ∣ X α N 0 = x) for (x, y) ∈ S

2
, where N is independent of X and has the
geometric distribution on N with parameter 1 − α .
Proof
So (1 − α)R can be thought of as a transition matrix just as P is a transition matrix, but corresponding to the random time N
α
n
(with α as a paraamter) rather than the deterministic time n . An interpretation of the potential matrix R for α ∈ (0, 1) can also be α
given in economic terms. Suppose that we receive one monetary unit each time the chain visits a fixed state y ∈ S . Then R(x, y) is
the expected total reward, starting in state x ∈ S . However, typically money that we will receive at times distant in the future have
less value to us now than money that we will receive soon. Specifically suppose that a monetary unit at time n ∈ N has a present
value of α , so that α is an inflation factor (sometimes also called a discount factor). Then R (x, y) gives the expected total
n
α
discounted reward, starting at x ∈ S .
The potential kernels R = {R α : α ∈ (0, 1)} completely determine the transition kernels P = { Pn : n ∈ N} .
Proof
Of course, it's really only necessary to determine P , the one step transition kernel, since the other transition kernels are powers of
P . In any event, it follows that the matrices R = { R : α ∈ (0, 1)} , along with the initial distribution, completely determine the
α
finite dimensional distributions of the Markov chain X. The potential matrices commute with each other and with the transition
matrices.
If α, β ∈ (0, 1] and k ∈ N , then

1. P k
Rα = Rα P
k
=∑
∞
n=0
α P
n n+k
2. R α Rβ = Rβ Rα = ∑
∞
m=0
∑
∞
n=0
α
m
β
n
P
m+n
Proof
The fundamental equation that relates the potential matrices is given next.
If α, β ∈ (0, 1] with α ≥ β then
α Rα = β Rβ + (α − β)Rα Rβ (16.3.31)
Proof
If α ∈ (0, 1] then I + α R αP = I + αP Rα = Rα .
Proof
This leads to an important result: when α ∈ (0, 1), there is an inverse relationship between P and R . α
If α ∈ (0, 1), then

1. R = (I − αP )
α
−1
2. P = (I − R )1
α
−1
α
Proof
This result shows again that the potential matrix R determines the transition operator P . α
Sampling in Time
If we sample a Markov chain at multiples of a fixed time k , we get another (homogeneous) chain.
Suppose that X = (X , X , X , …) is an Markov chain with state space S and transition probability matrix
0 1 2 P . For fixed
k ∈ N , the sequence X = (X , X , X , …) is a Markov chain on S with transition probability matrix P .
+ k 0 k 2k
k
If we sample a Markov chain at a general increasing sequence of time points 0 < n < n < ⋯ in N, then the resulting stochastic 1 2
process Y = (Y , Y , Y , …), where Y = X for k ∈ N , is still a Markov chain, but is not time homogeneous in general.
0 1 2 k nk
Recall that if A is a nonempty subset of S , then P is the matrix P restricted to A × A . So P is a sub-stochastic matrix, since the
A A
row sums may be less than 1. Recall also that P means (P ) , not (P ) ; in general these matrices are different.
A
n
A
n n
A
If A is a nonempty subset of S then for n ∈ N ,

n
P (x, y) = P(X1 ∈ A, X2 ∈ A, … , Xn−1 ∈ A, Xn = y ∣ X0 = x), (x, y) ∈ A × A (16.3.36)
A
That is, P (x, y) is the probability of going from state x to y in n steps, remaining in A all the while. In terms of the state graph of
A
n
X, it is the sum of products of probabilities along paths of length n from x to y that stay inside A .

Let X = (X 0, X1 , …) be the Markov chain on S = {a, b, c} with transition matrix
1 1
0
⎡ 2 2 ⎤
P =⎢ ⎥
1 3
⎢ 0 ⎥ (16.3.37)
4 4
⎣ ⎦
1 0 0
For the Markov chain X,

1. Draw the state graph.
2. Find P(X = a, X = b, X = c ∣ X = a)
1 2 3 0
3. Find P 2
4. Suppose that g : S → R is given by g(a) = 1 , g(b) = 2 , g(c) = 3 . Find E[g(X ) ∣ X = x] for x ∈ S . 2 0
5. Suppose that X has the uniform distribution on S . Find the probability density function of X .
0 2
Answer
Let A = {a, b} . Find each of the following:

1. PA
2. PA
2
3. (P 2
)A
Proof
Find the invariant probability density function of X

Answer
Compute the α -potential matrix R for α ∈ (0, 1). α
Answer
The Two-State Chain

Perhaps the simplest, non-trivial Markov chain has two states, say S = {0, 1} and the transition probability matrix given below,
where p ∈ (0, 1) and q ∈ (0, 1) are parameters.
1 −p p
P =[ ] (16.3.41)
q 1 −q
For n ∈ N ,
n n
1 q + p(1 − p − q) p − p(1 − p − q)
n
P = [ ] (16.3.42)
n n
p +q q − q(1 − p − q) p + q(1 − p − q)
Proof
As n → ∞ ,
1 q p
n
P → [ ] (16.3.44)
p +q q p
Proof
Open the simulation of the two-state, discrete-time Markov chain. For various values of p and q, and different initial states, run
the simulation 1000 times. Compare the relative frequency distribution to the limiting distribution, and in particular, note the
rate of convergence. Be sure to try the case p = q = 0.01
The only invariant probability density function for the chain is

q p
f =[ ] (16.3.45)
p+q p+q
Proof
For α ∈ (0, 1), the α -potential matrix is

1 q p 1 p −p
Rα = [ ]+ [ ] (16.3.46)
2
(p + q)(1 − α) q p (p + q ) (1 − α) −q q
Proof
In spite of its simplicity, the two state chain illustrates some of the basic limiting behavior and the connection with invariant
distributions that we will study in general in a later section.
Independent Variables and Random Walks

Suppose that X = (X , X , X , …) is a sequence of independent random variables taking values in a countable set
0 1 2 S , and that
(X , X , …) are identically distributed with (discrete) probability density function f .
1 2
X is a Markov chain on S with transition probability matrix P given by P (x, y) = f (y) for (x, y) ∈ S × S . Also, f is
invariant for P .
Proof
As a Markov chain, the process X is not very interesting, although of course it is very interesting in other ways. Suppose now that
S = Z , the set of integers, and consider the partial sum process (or random walk) Y associated with X:
Yn = ∑ Xi , n ∈ N (16.3.49)
i=0
Y is a Markov chain on Z with transition probability matrix Q given by Q(x, y) = f (y − x) for (x, y) ∈ Z × Z .
Proof
Thus the probability density function f governs the distribution of a step size of the random walker on Z.
Consider the special case of the random walk on Z with f (1) = p and f (−1) = 1 − p , where p ∈ (0, 1).
1. Give the transition matrix Q explicitly.
2. Give Q explicitly for n ∈ N .
n
Answer
This special case is the simple random walk on Z. When p = we have the simple, symmetric random walk. The simple random
1
walk on Z is studied in more detail in the section on random walks on graphs. The simple symmetric random walk is studied in
more detail in the chapter on Bernoulli Trials.
Doubly Stochastic Matrices

A matrix P on S is doubly stochastic if it is nonnegative and if the row and columns sums are 1:
∑ P (x, u) = 1, ∑ P (u, y) = 1, (x, y) ∈ S × S (16.3.53)
u∈S u∈s
Suppose that X is a Markov chain on a finite state space S with doubly stochastic transition matrix P . Then the uniform
distribution on S is invariant.
Proof
If P and Q are doubly stochastic matrices on S , then so is P Q.

Proof
It follows that if P is doubly stochastic then so is P for n ∈ N .

n
Suppose that X = (X 0, X1 , …) is the Markov chain with state space S = {−1, 0, 1} and with transition matrix
1 1
0
⎡ 2 2 ⎤
1 1
P =⎢ 0 ⎥ (16.3.56)
⎢ 2 2 ⎥
⎣ 1 1 ⎦
0
2 2

2. Show that P is doubly stochastic
3. Find P .
2
4. Show that the uniform distribution on S is the only invariant distribution for X.
5. Suppose that X has the uniform distribution on S . For n ∈ N , find E(X ) and var(X ).
0 n n
6. Find the α -potential matrix R for α ∈ (0, 1).

α
Proof
Recall that a matrix M indexed by a countable set S is symmetric if M (x, y) = M (y, x) for all x, y ∈ S .
If P is a symmetric, stochastic matrix then P is doubly stochastic.

Proof
The converse is not true. The doubly stochastic matrix in the exercise above is not symmetric. But since a symmetric, stochastic
matrix on a finite state space is doubly stochastic, the uniform distribution is invariant.
Suppose that X = (X 0, X1 , …) is the Markov chain with state space S = {−1, 0, 1} and with transition matrix
1 0 0
⎡ ⎤
1 3
P =⎢0 ⎥ (16.3.60)
⎢ 4 4 ⎥
3 1
⎣0 ⎦
4 4

2. Show that P is symmetric
3. Find P .
2
4. Find all invariant probability density functions for X.

5. Find the α -potential matrix R for α ∈ (0, 1).
α
Proof
Special Models
The Markov chains in the following exercises model interesting processes that are studied in separate sections.
Read the introduction to the Ehrenfest chains.
Read the introduction to the Bernoulli-Laplace chain.
Read the introduction to the reliability chains.
Read the introduction to the branching chain.
Read the introduction to the queuing chains.
Read the introduction to random walks on graphs.
Read the introduction to birth-death chains.
This page titled 16.3: Introduction to Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
The study of discrete-time Markov chains, particularly the limiting behavior, depends critically on the random times between visits
to a given state. The nature of these random times leads to a fundamental dichotomy of the states.
Basic Theory
As usual, our starting point is a probability space (Ω, F , P), so that Ω is the sample space, F the σ-algebra of events, and P the
probability measure on (Ω, F ). Suppose now that X = (X , X , X , …) is a (homogeneous) discrete-time Markov chain with
0 1 2
(countable) state space S and transition probability matrix P . So by definition,
P (x, y) = P(Xn+1 = y ∣ Xn = x) (16.4.1)
for x, y ∈ S and n ∈ N . Let F = σ{X , X , … , X }, the σ-algebra of events defined by the chain up to time
n 0 1 n n ∈ N , so that
F = (F0 , F1 , …) is the natural filtration associated with X.
Hitting Times and Probabilities

Let A be a nonempty subset of S . Recall that the hitting time to A is the random variable that gives the first positive time that the
chain is in A :
τA = min{n ∈ N+ : Xn ∈ A} (16.4.2)
Since the chain may never enter A , the random variable τ takes values in N ∪ {∞} (recall our convention that the minimum of
A +
the empty set is ∞). Recall also that τ is a stopping time for X. That is, {τ = n} ∈ F for n ∈ N . Intuitively, this means that
A A n +
we can tell if τ = n by observing the chain up to time n . This is clearly the case, since explicitly
A
{ τA = n} = { X1 ∉ A, … , Xn−1 ∉ A, Xn ∈ A} (16.4.3)
When A = {x} for x ∈ S , we will simplify the notation to τ . This random variable gives the first positive time that the chain is in
x
state x. When the chain enters a set of states A for the first time, the chain must visit some state in A for the first time, so it's clear
that
τA = min{ τx : x ∈ A}, A ⊆S (16.4.4)
Next we define two functions on S that are related to the hitting times.
For x ∈ S , A ⊆ S (nonempty), and n ∈ N + define

1. H (x, A) = P(τ
n A
= n ∣ X0 = x)
2. H (x, A) = P(τ A
< ∞ ∣ X0 = x)
∞
So H (x, A) = ∑ n=1
Hn (x, A) .
Note that n ↦ H (x, A) is the probability density function of τ , given X = x , except that the density function may be defective
n A 0
in the sense that the sum H (x, A) may be less than 1, in which case of course, 1 − H (x, A) = P(τ = ∞ ∣ X = x) . Again, A 0
when A = {y} , we will simplify the notation to H (x, y) and H (x, y), respectively. In particular, H (x, x) is the probability,
n
starting at x, that the chain eventually returns to x. If x ≠ y , H (x, y) is the probability, starting at x, that the chain eventually
reaches y . Just knowing when H (x, y) is 0, positive, and 1 will turn out to be of considerable importance in the overall structure
and limiting behavior of the chain. As a function on S , we will refer to H as the hitting matrix of X. Note however, that unlike
2
the transition matrix P , we do not have the structure of a kernel. That is, A ↦ H (x, A) is not a measure, so in particular, it is
generally not true that H (x, A) = ∑ H (x, y). The same remarks apply to H
y∈A
for n ∈ N . However, there are interesting
n +
relationships between the transition matrix and the hitting matrix.
H (x, y) > 0 if and only if P n

(x, y) > 0 for some n ∈ N . +
Proof
The following result gives a basic relationship between the sequence of hitting probabilities and the sequence of transition
probabilities.
Suppose that (x, y) ∈ S . Then 2
n n−k
P (x, y) = ∑ Hk (x, y)P (y, y), n ∈ N+ (16.4.6)
k=1
Proof
Suppose that x ∈ S and A ⊆ S . Then

1. H (x, A) = ∑
n+1 P (x, y)H (y, A) for n ∈ N
y∉A n +
2. H (x, A) = P (x, A) + ∑ P (x, y)H (y, A)

y∉A
Proof
The following definition is fundamental for the study of Markov chains.
Let x ∈ S .
1. State x is recurrent if H (x, x) = 1.
2. State x is transient if H (x, x) < 1.
Thus, starting in a recurrent state, the chain will, with probability 1, eventually return to the state. As we will see, the chain will
return to the state infinitely often with probability 1, and the times of the visits will form the arrival times of a renewal process.
This will turn out to be the critical observation in the study of the limiting behavior of the chain. By contrast, if the chain starts in a
transient state, then there is a positive probability that the chain will never return to the state.
Counting Variables and Potentials

Again, suppose that A is a nonempty set of states. A natural complement to the hitting time to A is the counting variable that gives
the number of visits to A (at positive times). Thus, let
∞
NA = ∑ 1(Xn ∈ A) (16.4.12)
n=1
Note that N takes value in N ∪ {∞} . We will mostly be interested in the special case
A A = {x} for x ∈ S , and in this case, we
will simplify the notation to N . x
Let G(x, A) = E(N A

∣ X0 = x) for x ∈ S and A ⊆ S . Then G is a kernel on S and
∞
n
G(x, A) = ∑ P (x, A) (16.4.13)
n=1
Proof
Thus G(x, A) is the expected number of visits to A at positive times. As usual, when A = {y} for y ∈ S we simplify the notation
∞
to G(x, y), and then more generally we have G(x, A) = ∑ G(x, y) for A ⊆ S . So, as a matrix on S , G = ∑
y∈A
P . The n=1
n
matrix G is closely related to the potential matrix R of X, given by R = ∑ P . So R = I + G , and R(x, y) gives the
∞
n=0
n
expected number of visits to y ∈ S at all times (not just positive times), starting at x ∈ S . The matrix G is more useful for our
purposes in this section.
The distribution of N has a simple representation in terms of the hitting probabilities. Note that because of the Markov property
y
and time homogeneous property, whenever the chain reaches state y , the future behavior is independent of the past and is
stochastically the same as the chain starting in state y at time 0. This is the critical observation in the proof of the following
theorem.
If x, y ∈ S then
1. P(N y = 0 ∣ X0 = x) = 1 − H (x, y)
2. P(N y = n ∣ X0 = x) = H (x, y)[H (y, y)]

n−1
[1 − H (y, y)] for n ∈ N +
Proof
Figure 16.4.1 : Visits to state y starting in state x
The essence of the proof is illustrated in the graphic above. The thick lines are intended as reminders that these are not one step
transitions, but rather represent all paths between the given vertices. Note that in the special case that x = y we have
n
P(Nx = n ∣ X0 = x) = [H (x, x)] [1 − H (x, x)], n ∈ N (16.4.15)
In all cases, the counting variable N has essentially a geometric distribution, but the distribution may well be defective, with some
y
of the probability mass at ∞. The behavior is quite different depending on whether y is transient or recurrent.
If x, y ∈ S and y is transient then

1. P(N < ∞ ∣ X = x) = 1
y 0
2. G(x, y) = H (x, y)/[1 − H (y, y)]

3. H (x, y) = G(x, y)/[1 + G(y, y)]
Proof
if x, y ∈ S and y is recurrent then

1. P(N = 0 ∣ X = x) = 1 − H (x, y) and P(N = ∞ ∣ X = x) = H (x, y)
y 0 y 0
2. G(x, y) = 0 if H (x, y) = 0 and G(x, y) = ∞ if H (x, y) > 0

3. P(N = ∞ ∣ X = y) = 1 and G(y, y) = ∞
y 0
Proof
Note that there is an invertible relationship between the matrix H and the matrix G; if we know one we can compute the other. In
particular, we can characterize the transience or recurrence of a state in terms of G. Here is our summary so far:
Let x ∈ S .
1. State x is transient if and only if H (x, x) < 1 if and only if G(x, x) < ∞.
2. State x is recurrent if and only if H (x, x) = 1 if and only if G(x, x) = ∞.
Of course, the classification also holds for the potential matrix R = I +G . That is, state x ∈ S is transient if and only if
R(x, x) < ∞ and state x is recurrent if and only if R(x, x) = ∞.
Relations
The hitting probabilities suggest an important relation on the state space S .
For (x, y) ∈ S , we say that x leads to y and we write x → y if either x = y or H (x, y) > 0.
2
It follows immediately from the result above that x → y if and only if P (x, y) > 0 for some n ∈ N . In terms of the state graph of
n
the chain, x → y if and only if x = y or there is a directed path from x to y . Note that the leads to relation is reflexive by
definition: x → x for every x ∈ S . The relation has another important property as well.
The leads to relation is transitive: For x, y, z ∈ S , if x → y and y → z then x → z .

Proof
The leads to relation naturally suggests a couple of other definitions that are important.
Suppose that A ⊆ S is nonempty.

1. A is closed if x ∈ A and x → y implies y ∈ A .
2. A is irreducible if A is closed and has no proper closed subsets.
Suppose that A ⊆ S is closed. Then
1. P , the restriction of P to A × A , is a transition probability matrix on A .
A
2. X restricted to A is a Markov chain with transition probability matrix P . A
3. (P ) = (P ) for n ∈ N .
n
A A
n
Proof
Of course, the entire state space S is closed by definition. If it is also irreducible, we say the Markov chain X itself is irreducible.
Recall that for a nonempty subset A of S and for n ∈ N , the notation P refers to (P ) and not (P ) . In general, these are not
n
A A
n n
A
the same, and in fact for x, y ∈ A,

n
P (x, y) = P(X1 ∈ A, … , Xn−1 ∈ A, Xn = y ∣ X0 = x) (16.4.17)
A
the probability of going from x to y in n steps, remaining in A all the while. But if A is closed, then as noted in part (c), this is just
P (x, y).
n
Suppose that A is a nonempty subset of S . Then cl(A) = {y ∈ S : x → y for some x ∈ A} is the smallest closed set
containing A , and is called the closure of A . That is,
1. cl(A) is closed.
2. A ⊆ cl(A) .
3. If B is closed and A ⊆ B then cl(A) ⊆ B
Proof
Recall that for a fixed positive integer k , P is also a transition probability matrix, and in fact governs the k -step Markov chain
k
(X , X , X
0 k , …). It follows that we could consider the leads to relation for this chain, and all of the results above would still
2k
hold (relative, of course, to the k -step chain). Occasionally we will need to consider this relation, which we will denote by →,
k
particularly in our study of periodicity.
Suppose that j, k ∈ N+ . If x → y and j ∣ k then x → y .

k j
Proof
By combining the leads to relation → with its inverse, the comes from relation ←, we can obtain another very useful relation.
For (x, y) ∈ S , we say that x to and from y and we write x ↔ y if x → y and y → x .

2
By definition, this relation is symmetric: if x ↔ y then y ↔ x . From our work above, it is also reflexive and transitive. Thus, the to
and from relation is an equivalence relation. Like all equivalence relations, it partitions the space into mutually disjoint equivalence
classes. We will denote the equivalence class of a state x ∈ S by
[x] = {y ∈ S : x ↔ y} (16.4.18)
Thus, for any two states x, y ∈ S , either [x] = [y] or [x] ∩ [y] = ∅ , and moreover, ⋃ x∈S
[x] = S .
Figure 16.4.2 : The equivalence relation partitions S into mutually disjoint equivalence classes
Two negative results:

1. A closed set is not necessarily an equivalence class.
2. An equivalence class is not necessarily closed.
Example
On the other hand, we have the following result:
If A ⊆ S is irreducible, then A is an equivalence class.
Proof
The to and from equivalence relation is very important because many interesting state properties turn out in fact to be class
properties, shared by all states in a given equivalence class. In particular, the recurrence and transience properties are class
properties.
Transient and Recurrent Classes

Our next result is of fundamental importance: a recurrent state can only lead to other recurrent states.
If x is a recurrent state and x → y then y is recurrent and H (x, y) = H (y, x) = 1 .

Proof
From the last theorem, note that if x is recurrent, then all states in [x] are also recurrent. Thus, for each equivalence class, either all
states are transient or all states are recurrent. We can therefore refer to transient or recurrent classes as well as states.
If A is a recurrent equivalence class then A is irreducible.

Proof
If A is finite and closed then A has a recurrent state.

Proof
If A is finite and irreducible then A is a recurrent equivalence class.

Proof
Thus, the Markov chain X will have a collection (possibly empty) of recurrent equivalence classes {A : j ∈ J} where J is a
j
countable index set. Each A is irreducible. Let B denote the set of all transient states. The set B may be empty or may consist of a
j
number of equivalence classes, but the class structure of B is usually not important to us. If the chain starts in A for some j ∈ J
j
then the chain remains in A forever, visiting each state infinitely often with probability 1. If the chain starts in B , then the chain
j
may stay in B forever (but only if B is infinite) or may enter one of the recurrent classes A , never to escape. However, in either
j
case, the chain will visit a given transient state only finitely many time with probability 1. This basic structure is known as the
canonical decomposition of the chain, and is shown in graphical form below. The edges from B are in gray to indicate that these
transitions may not exist.
Figure 16.4.3 : The canonical decomposition of the state space
Staying Probabilities and a Classification Test

Suppose that A is a proper subset of S . Then
1. P (x, A) = P(X ∈ A, X
A
n
1 2 ∈ A, … , Xn ∈ A ∣ X0 = x) for x ∈ A
2. lim n→∞
n
P (x, A) = P(X
A 1 ∈ A, X2 ∈ A … ∣ X0 = x) for x ∈ A
Proof
Let g denote the function defined by part (b), so that

A
gA (x) = P(X1 ∈ A, X2 ∈ A, … ∣ X0 = x), x ∈ A (16.4.20)
The staying probability function g is an interesting complement to the hitting matrix studied above. The following result
A
characterizes this function and provides a method that can be used to compute it, at least in some cases.
For A ⊂S , gA is the largest function on A that takes values in [0, 1] and satisfies g = PA g . Moreover, either gA = 0A or
sup{ gA (x) : x ∈ A} = 1 .
Proof
Note that the characterization in the last result includes a zero-one law of sorts: either the probability that the chain stays in A
forever is 0 for every initial state x ∈ A , or we can find states in A for which the probability is arbitrarily close to 1. The next two
results explore the relationship between the staying function and recurrence.
Suppose that X is an irreducible, recurrent chain with state space S . Then g A = 0A for every proper subset A of S .
Proof
Suppose that X is an irreducible Markov chain with state space S and transition probability matrix P . If there exists a state x
such that g = 0 where A = S ∖ {x} , then X is recurrent.
A A
Proof
More generally, suppose that X is a Markov chain with state space S and transition probability matrix P . The last two theorems
can be used to test whether an irreducible equivalence class C is recurrent or transient. We fix a state x ∈ C and set A = C ∖ {x} .
We then try to solve the equation g = P g on A . If the only solution taking values in [0, 1] is 0 , then the class C is recurrent by
A A
the previous result. If there are nontrivial solutions, then C is transient. Often we try to choose x to make the computations easy.
Computing Hitting Probabilities and Potentials

We now know quite a bit about Markov chains, and we can often classify the states and compute quantities of interest. However,
we do not yet know how to compute:
G(x, y) when x and y are transient
H (x, y) when x is transient and y is transient or recurrent.
These problems are related, because of the general inverse relationship between the matrix H and the matrix G noted in our
discussion above. As usual, suppose that X is a Markov chain with state space S , and let B denote the set of transient states. The
next result shows how to compute G , the matrix G restricted to the transient states. Recall that the values of this matrix are finite.
B
GB satisfies the equation GB = PB + PB GB and is the smallest nonnegative solution. If B is finite then
GB = (I B −P ) BP .
−1
B
Proof
Now that we can compute G , we can also compute H using the result above. All that remains is for us to compute the hitting
B B
probability H (x, y) when x is transient and y is recurrent. The first thing to notice is that the hitting probability is a class property.
Suppose that x is transient and that A is a recurrent class. Then H (x, y) = H (x, A) for y ∈ A .
That is, starting in the transient state x ∈ S , the hitting probability to y is constant for y ∈ A , and is just the hitting probability to
the class A . As before, let B denote the set of transient states and suppose that A is a recurrent equivalence class. Let h denote A
the function on B that gives the hitting probability to class A , and let p denote the function on B that gives the probability of
A
entering A on the first step:
hA (x) = H (x, A), pA (x) = P (x, A), x ∈ B (16.4.21)
hA = pA + GB pA .
Proof
This result is adequate if we have already computed G (using the result in above, for example). However, we might just want to
B
compute h directly.
A
hA satisfies the equation h A

= pA + PB hA and is the smallest nonnegative solution. If B is finite, h A
= (IB − PB )
−1
pA .
Proof
Finite Chains
Consider a Markov chain with state space S = {a, b, c, d} and transition matrix P given below:
1 2
⎡ 0 0 ⎤
2 3
⎢ 1 0 0 0 ⎥
⎢ ⎥
P =⎢ ⎥ (16.4.22)
⎢ 0 0 1 0 ⎥
⎢ ⎥
1 1 1 1
⎣ ⎦
4 4 4 4

2. Find the equivalent classes and classify each as transient or recurrent.
3. Compute the matrix G.
4. Compute the matrix H .
Answer
Consider a Markov chain with state space S = {1, 2, 3, 4, 5, 6} and transition matrix P given below:
1 1
⎡ 0 0
2
0
2
0 ⎤
⎢ 0 0 0 0 0 1 ⎥
⎢ ⎥
⎢ 1 1 1
⎥
⎢ 0 0 0 ⎥
⎢ 4 2 4 ⎥
P =⎢ ⎥ (16.4.23)
⎢ 0 0 0 1 0 0 ⎥
⎢ ⎥
⎢ 1 2 ⎥
⎢ 0 0 0 0 ⎥
⎢ 3 3 ⎥
1 1 1 1
⎣ 0 0 ⎦
4 4 4 4
1. Sketch the state graph.

2. Find the equivalence classes and classify each as recurrent or transient.
Answer
1 1
⎡ 0 0 0 0 ⎤
2 2
⎢ 1 3 ⎥
⎢ 4 0 0 0 0 ⎥
4
⎢ ⎥
⎢ 1 1 1 ⎥
⎢ 0 0 0 ⎥
4 2 4
P =⎢ ⎥ (16.4.24)
⎢ 1 1 1 1 ⎥
⎢ 0 0 ⎥
⎢ 4 4 4 4 ⎥
⎢ 1 1
⎥
⎢ 0 0 0 0 ⎥
⎢ 2 2 ⎥
1 1
⎣ 0 0 0 0 ⎦
2 2

2. Find the equivalence classes and classify each as recurrent or transient.
Answer
Special Models
Read again the definitions of the Ehrenfest chains and the Bernoulli-Laplace chains. Note that since these chains are
irreducible and have finite state spaces, they are recurrent.
Read the discussion on recurrence in the section on the reliability chains.
Read the discussion on random walks on Z in the section on the random walks on graphs.
k
Read the discussion on extinction and explosion in the section on the branching chain.
Read the discussion on recurrence and transience in the section on queuing chains.
Read the discussion on recurrence and transience in the section on birth-death chains.
This page titled 16.4: Transience and Recurrence for Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed,
and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a
detailed edit history is available upon request.
A state in a discrete-time Markov chain is periodic if the chain can return to the state only at multiples of some integer larger than
1. Periodic behavior complicates the study of the limiting behavior of the chain. As we will see in this section, we can eliminate the
periodic behavior by considering the d -step chain, where d ∈ N is the period, but only at the expense of introducing additional
+
equivalence classes. Thus, in a sense, we can trade one form of complexity for another.
Basic Theory
Definitions and Basic Results
As usual, our starting point is a (time homogeneous) discrete-time Markov chain X = (X0 , X1 , X2 , …) with (countable) state
space S and transition probability matrix P .
The period of state x ∈ S is

n
d(x) = gcd{n ∈ N+ : P (x, x) > 0} (16.5.1)
State x is aperiodic if d(x) = 1 and periodic if d(x) > 1 .
Thus, starting in x, the chain can return to x only at multiples of the period d , and d is the largest such integer. Perhaps the most
important result is that period, like recurrence and transience, is a class property, shared by all states in an equivalence class under
the to and from relation.
If x ↔ y then d(x) = d(y) .

Proof
Thus, the definitions of period, periodic, and aperiodic apply to equivalence classes as well as individual states. When the chain is
irreducible, we can apply these terms to the entire chain.
Suppose that x ∈ S . If P (x, x) > 0 then x (and hence the equivalence class of x) is aperiodic.
Proof
The converse is not true, of course. A simple counterexample is given below.
The Cyclic Classes

Suppose now that X = (X , X , X , …) is irreducible and is periodic with period d . There is no real loss in generality in
0 1 2
assuming that the chain is irreducible, for if this were not the case, we could simply restrict our attention to one of the irreducible
equivalence classes. Our exposition will be easier and cleaner if we recall the congruence equivalence relation modulo d on Z,
which in turn is based on the divison partial order. For n, m ∈ Z, n ≡ m if and only if d ∣ (n − m) , equivalently n − m is an
d
integer multiple of d , equivalently m and n have the same remainder after division by d . The basic fact that we will need is that
≡ d is preserved under sums and differences. That is, if m, n, p, q ∈ Z and if m ≡ n and p ≡ q , then m + p ≡ n + q and
d d d
m −p ≡ n−q .
d
Now, we fix a reference state u ∈ S , and for k ∈ {0, 1, … , d − 1} , define

nd+k
Ak = {x ∈ S : P (u, x) > 0 for some n ∈ N} (16.5.2)
That is, x ∈ A if and only if there exists m ∈ N with m ≡

k d k and P m
(u, x) > 0 .
Suppose that x, y ∈ S .
1. If x ∈ A and y ∈ A for j, k ∈ {0, 1, … , d − 1} then P (x, y) > 0 for some n ≡ k − j
j k
n
d
2. Conversely, if P (x, y) > 0 for some n ∈ N , then there exists j, k ∈ {0, 1, … d − 1} such that x ∈ A , y ∈ A and
n
j k
n ≡ k−j .
d
3. The sets (A , A , … , A ) partition S .

0 1 k−1
Proof
(A0 , A1 , … , Ad−1 ) are the equivalence classes for the d -step to and from relation ↔ that governs the d -step chain
d
(X0 , Xd , X2d , …) that has transition matrix P . d
The sets (A0, A1 , … , Ad−1 ) are known as the cyclic classes. The basic structure of the chain is shown in the state diagram below:
Figure 16.5.1 : The cyclic classes of a chain with period d

Finite Chains
Consider the Markov chain with state space S = {a, b, c} and transition matrix P given below:
1 2
⎡0 3 3 ⎤
P =⎢0 0 1 ⎥ (16.5.3)
⎣ ⎦
1 0 0
1. Sketch the state graph and show that the chain is irreducible.
2. Show that the chain is aperiodic.
3. Note that P (x, x) = 0 for all x ∈ S .
Answer
Consider the Markov chain with state space S = {1, 2, 3, 4, 5, 6, 7} and transition matrix P given below:
1 1 1
⎡ 0 0
2 4 4
0 0 ⎤
⎢ 1 2 ⎥
⎢ 0 0 0 0 0 ⎥
3 3
⎢ ⎥
⎢ 1 2 ⎥
⎢ 0 0 0 0 0 ⎥
3 3
⎢ ⎥
⎢ 1 1 ⎥
P =⎢ 0 0 0 0 0 ⎥ (16.5.4)
⎢ 2 2 ⎥
⎢ 3 1 ⎥
⎢ 0 0 0 0 0 ⎥
⎢ 4 4 ⎥
⎢ ⎥
1 1
⎢ 0 0 0 0 0 ⎥
⎢ 2 2 ⎥
1 3
⎣ 0 0 0 0 0 ⎦
4 4
1. Sketch the state graph and show that the chain is irreducible.
2. Find the period d .
3. Find P .d
4. Identify the cyclic classes.

Answer
Special Models
Review the definition of the basic Ehrenfest chain. Show that this chain has period 2, and find the cyclic classes.
Review the definition of the modified Ehrenfest chain. Show that this chain is aperiodic.
Review the definition of the simple random walk on Z
k
. Show that the chain is periodic with period 2, and find the cyclic
classes.
This page titled 16.5: Periodicity of Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
In this section, we study some of the deepest and most interesting parts of the theory of discrete-time Markov chains, involving two
different but complementary ideas: stationary distributions and limiting distributions. The theory of renewal processes plays a
critical role.
Basic Theory
As usual, our starting point is a (time homogeneous) discrete-time Markov chain X = (X , X , X , …) with (countable) state
0 1 2
space S and transition probability matrix P . In the background, of course, is a probability space (Ω, F , P) so that Ω is the sample
space, F the σ-algebra of events, and P the probability measure on (Ω, F ). For n ∈ N , let F = σ{X , X , … , X }, the σ- n 0 1 n
algebra of events determined by the chain up to time n , so that F = {F , F , …} is the natural filtration associated with X.
0 1
The Embedded Renewal Process

Let y ∈ S and n ∈ N . We will denote the number of visits to y during the first n positive time units by
+
Ny,n = ∑ 1(Xi = y) (16.6.1)
i=1
Note that N y,n → Ny as n → ∞ , where

∞
Ny = ∑ 1(Xi = y) (16.6.2)
i=1
is the total number of visits to y at positive times, one of the important random variables that we studied in the section on
transience and recurrence. For n ∈ N , we denote the time of the n th visit to y by
+
τy,n = min{k ∈ N+ : Ny,k = n} (16.6.3)
where as usual, we define min(∅) = ∞ . Note that τ is the time of the first visit to y , which we denoted simply by τ in the
y,1 y
section on transience and recurrence. The times of the visits to y are stopping times for X. That is, {τ = k} ∈ F for n ∈ N y,n k +
and k ∈ N . Recall also the definition of the hitting probability to state y starting in state x:
2
H (x, y) = P (τy < ∞ ∣ X0 = x) , (x, y) ∈ S (16.6.4)
Suppose that x, y ∈ S , and that y is recurrent and X

0 =x .
1. If x = y , then the successive visits to y form a renewal process.
2. If x ≠ y but x → y , then the successive visits to y form a delayed renewal process.
Proof
As noted in the proof, (τ , τ , …) is the sequence of arrival times and (N , N , …) is the associated sequence of counting
y,1 y,2 y,1 y,2
variables for the embedded renewal process associated with the recurrent state y . The corresponding renewal function, given
X = x , is the function n ↦ G (x, y) where
0 n
k
Gn (x, y) = E (Ny,n ∣ X0 = x) = ∑ P (x, y), n ∈ N (16.6.5)
k=1
Thus is the expected number of visits to y in the first n positive time units, starting in state x. Note that
Gn (x, y)
Gn (x, y) → G(x, y) as n → ∞ where G is the potential matrix that we studied previously. This matrix gives the expected total
number visits to state y ∈ S , at positive times, starting in state x ∈ S :
∞
k
G(x, y) = E (Ny ∣ X0 = x) = ∑ P (x, y) (16.6.6)
k=1
Limiting Behavior
The limit theorems of renewal theory can now be used to explore the limiting behavior of the Markov chain. Let
μ(y) = E(τ ∣ X = y) denote the mean return time to state y , starting in y . In the following results, it may be the case that
y 0
μ(y) = ∞ , in which case we interpret 1/μ(y) as 0.
If x, y ∈ S and y is recurrent then

1 1 ∣
P( Nn,y → as n → ∞ ∣ X0 = x) = H (x, y) (16.6.7)
n μ(y) ∣
Proof
Note that 1
n
Ny,n =
1
n
∑
n
k=1
1(Xk = y) is the average number of visits to y in the first n positive time units.
If x, y ∈ S and y is recurrent then

n
1 1 H (x, y)
k
Gn (x, y) = ∑P (x, y) → as n → ∞ (16.6.8)
n n μ(y)
k=1
Proof
n
Note that G 1
n n (x, y) =
1
n
∑k=1 P
k
(x, y) is the expected average number of visits to y during the first n positive time units,
starting at x.
If x, y ∈ S and y is recurrent and aperiodic then

H (x, y)
n
P (x, y) → as n → ∞ (16.6.9)
μ(y)
Proof
Note that H (y, y) = 1 by the very definition of a recurrent state. Thus, when x = y , the law of large numbers above gives
convergence with probability 1, and the first and second renewal theory limits above are simply 1/μ(y). By contrast, we already
know the corresponding limiting behavior when y is transient.
If x, y ∈ S and y is transient then

1. P ( N → 0 as n → ∞ ∣ X = x) = 1
1
n
y,n 0
2. G (x, y) = ∑
1
n
n
1 n
P (x, y) → 0 as n → ∞
n k=1
k
3. P (x, y) → 0 as n → ∞
n
Proof
On the other hand, if y is transient then P(τ = ∞ ∣ X = y) > 0 by the very definition of a transience. Thus μ(y) = ∞ , and so
y 0
the results in parts (b) and (c) agree with the corresponding results above for a recurrent state. Here is a summary.
For x, y ∈ S ,
n
1 1 H (x, y)
k
Gn (x, y) = ∑P (x, y) → as n → ∞ (16.6.11)
n n μ(y)
k=1
If y is transient or if y is recurrent and aperiodic,

H (x, y)
n
P (x, y) → as n → ∞ (16.6.12)
μ(y)
Positive and Null Recurrence

Clearly there is a fundamental dichotomy in terms of the limiting behavior of the chain, depending on whether the mean return time
to a given state is finite or infinite. Thus the following definition is natural.
Let x ∈ S .
1. State x is positive recurrent if μ(x) < ∞ .
2. If x is recurrent but μ(x) = ∞ then state x is null recurrent.
Implicit in the definition is the following simple result:
If x ∈ S is positive recurrent, then x is recurrent.

Proof
On the other hand, it is possible to have P(τ < ∞ ∣ X = x) = 1 , so that x is recurrent, and also E(τ ∣ X = x) = ∞ , so that x
x 0 x 0
is null recurrent. Simply put, a random variable can be finite with probability 1, but can have infinite expected value. A classic
example is the Pareto distribution with shape parameter a ∈ (0, 1).
Like recurrence/transience, and period, the null/positive recurrence property is a class property.
If x is positive recurrent and x → y then y is positive recurrent.

Proof
Thus, the terms positive recurrent and null recurrent can be applied to equivalence classes (under the to and from equivalence
relation), as well as individual states. When the chain is irreducible, the terms can be applied to the chain as a whole.
Recall that a nonempty set of states A is closed if x ∈ A and x → y implies y ∈ A . Here are some simple results for a finite,
closed set of states.
If A ⊆ S is finite and closed, then A contains a positive recurrent state.

Proof
If A ⊆ S is finite and closed, then A contains no null recurrent states.

Proof
If A ⊆ S is finite and irreducible, then A is a positive recurrent equivalence class.

Proof
In particular, a Markov chain with a finite state space cannot have null recurrent states; every state must be transient or positive
recurrent.
Limiting Behavior, Revisited

Returning to the limiting behavior, suppose that the chain X is irreducible, so that either all states are transient, all states are null
recurrent, or all states are positive recurrent. From the basic limit theorem above, if the chain is transient or if the chain is recurrent
and aperiodic, then
n
1
P (x, y) → as n → ∞ for every x ∈ S (16.6.16)
μ(y)
Note in particular that the limit is independent of the initial state x. Of course in the transient case and in the null recurrent and
aperiodic case, the limit is 0. Only in the positive recurrent, aperiodic case is the limit positive, which motivates our next definition.
A Markov chain X that is irreducible, positive recurrent, and aperiodic, is said to be ergodic.
In the ergodic case, as we will see, X has a limiting distribution as n → ∞ that is independent of the initial distribution.
n
The behavior when the chain is periodic with period d ∈ {2, 3, …} is a bit more complicated, but we can understand this behavior
by considering the d -step chain X = (X , X , X , …) that has transition matrix P . Essentially, this allows us to trade
d 0 d 2d
d
periodicity (one form of complexity) for reducibility (another form of complexity). Specifically, recall that the d -step chain is
aperiodic but has d equivalence classes (A , A , … , A ); and these are the cyclic classes of original chain X.
0 1 d−1
Figure 16.6.1 : The cyclic classes of a chain with period d
The mean return time to state x for the d -step chain X is μ d d (x) = μ(x)/d .
Proof
Let i, j, k ∈ {0, 1, … , d − 1} ,
1. P nd+k
(x, y) → d/μ(y) as n → ∞ if x ∈ A and y ∈ A and j = (i + k)
i j mod d .
2. P nd+k
(x, y) → 0 as n → ∞ in all other cases.
Proof
If y ∈ S is null recurrent or transient then regardless of the period of y , P n

(x, y) → 0 as n → ∞ for every x ∈ S .
Invariant Distributions
Our next goal is to see how the limiting behavior is related to invariant distributions. Suppose that f is a probability density
function on the state space S . Recall that f is invariant for P (and for the chain X) if f P = f . It follows immediately that
= f for every n ∈ N . Thus, if X has probability density function f then so does X for each n ∈ N , and hence X is a
n
fP 0 n
sequence of identically distributed random variables. A bit more generally, suppose that g : S → [0, ∞) is invariant for P , and let
C =∑ g(x) . If 0 < C < ∞ then f defined by f (x) = g(x)/C for x ∈ S is an invariant probability density function.
x∈S
Suppose that g : S → [0, ∞) is invariant for P and satisfies ∑ x∈S

g(x) < ∞ . Then
1
g(y) = ∑ g(x)H (x, y), y ∈ S (16.6.17)
μ(y)
x∈S
Proof
Note that if y is transient or null recurrent, then g(y) = 0 . Thus, a invariant function with finite sum, and in particular an invariant
probability density function must be concentrated on the positive recurrent states.
Suppose now that the chain X is irreducible. If X is transient or null recurrent, then from the previous result, the only nonnegative
functions that are invariant for P are functions that satisfy ∑ g(x) = ∞ and the function that is identically 0: g = 0 . In
x∈S
particular, the chain does not have an invariant distribution. On the other hand, if the chain is positive recurrent, then H (x, y) = 1
for all x, y ∈ S . Thus, from the previous result, the only possible invariant probability density function is the function f given by
f (x) = 1/μ(x) for x ∈ S . Any other nonnegative function g that is invariant for P and has finite sum, is a multiple of f (and
indeed the multiple is sum of the values). Our next goal is to show that f really is an invariant probability density function.
If X is an irreducible, positive recurrent chain then the function f given by f (x) = 1/μ(x) for x ∈ S is an invariant
probability density function for X.
Proof
In summary, an irreducible, positive recurrent Markov chain X has a unique invariant probability density function f given by
f (x) = 1/μ(x) for x ∈ S . We also now have a test for positive recurrence. An irreducible Markov chain X is positive recurrent if
and only if there exists a positive function g on S that is invariant for P and satisfies ∑ g(x) < ∞ (and then, of course,
x∈S
normalizing g would give f ).
Consider now a general Markov chain X on S . If X has no positive recurrent states, then as noted earlier, there are no invariant
distributions. Thus, suppose that X has a collection of positive recurrent equivalence classes (A : i ∈ I ) where I is a nonempty,
i
countable index set. The chain restricted to A is irreducible and positive recurrent for each i ∈ I , and hence has a unique invariant
i
probability density function f on A given by

i i
1
fi (x) = , x ∈ Ai (16.6.21)
μ(x)
We extend f to S by defining f (x) = 0 for x ∉ A , so that

i i i fi is a probability density function on S . All invariant probability
density functions for X are mixtures of these functions:
f is an invariant probability density function for X if and only if f has the form
f (x) = ∑ pi fi (x), x ∈ S (16.6.22)
i∈I
where (pi : i ∈ I )is a probability density function on the index set I . That is, f (x) = pi fi (x) for i ∈ I and x ∈ Ai , and
f (x) = 0 otherwise.
Proof
Invariant Measures
Suppose that X is irreducible. In this section we are interested in general functions g : S → [0, ∞) that are invariant for X, so that
gP = g . A function g : S → [0, ∞) defines a positive measure ν on S by the simple rule
ν (A) = ∑ g(x), A ⊆S (16.6.27)
x∈A
so in this sense, we are interested in invariant positive measures for X that may not be probability measures. Technically, g is the
density function of ν with respect to counting measure # on S .
From our work above, We know the situation if X is positive recurrent. In this case, there exists a unique invariant probability
density function f that is positive on S , and every other nonnegative invariant function g is a nonnegative multiples of f . In
particular, either g = 0 , the zero function on S , or g is positive on S and satisfies ∑ g(x) < ∞ . x∈S
We can generalize to chains that are simply recurrent, either null or positive. We will show that there exists a positive invariant
function that is unique, up to multiplication by positive constants. To set up the notation, recall that τ = min{k ∈ N : X = x}
x + k
is the first positive time that the chain is in state x ∈ S . In particular, if the chain starts in x then τ is the time of the first return to
x
x. For x ∈ S we define the function γ by x
τx −1
∣
γx (y) = E ( ∑ 1(Xn = y) ∣ X0 = x) , y ∈ S (16.6.28)
∣
n=0
so that γ x (y) is the expected number of visits to y before the first return to x, starting in x. Here is the existence result.
Suppose that X is recurrent. For x ∈ S ,

1. γ (x) = 1
x
2. γ is invariant for X
x
3. γ (y) ∈ (0, ∞) for y ∈ S .

x
Proof
Next is the uniqueness result.
Suppose again that X is recurrent and that g : S → [0, ∞) is invariant for X. For fixed x ∈ S ,
g(y) = g(x)γx (y), y ∈ S (16.6.33)
Proof
Thus, suppose that X is null recurrent. Then there exists an invariant function g that is positive on S and satisfies
∑ g(x) = ∞ . Every other nonnegative invariant function is a nonnegative multiple of g . In particular, either g = 0 , the zero
x∈S
function on S , or g is positive on S and satisfies ∑
x∈S
g(x) = ∞ . The section on reliability chains gives an example of the
invariant function for a null recurrent chain.
The situation is complicated when X is transient. In this case, there may or may not exist nonnegative invariant functions that are
not identically 0. When they do exist, they may not be unique (up to multiplication by nonnegative constants). But we still know
that there are no invariant probability density functions, so if g is a nonnegative function that is invariant for X then either g = 0
or ∑ x∈S
g(x) = ∞ . The section on random walks on graphs provides lots of examples of transient chains with nontrivial invariant
functions. In particular, the non-symmetric random walk on Z has a two-dimensional space of invariant functions.

Finite Chains
Consider again the general two-state chain on S = {0, 1} with transition probability matrix given below, where p ∈ (0, 1) and
q ∈ (0, 1) are parameters.
1 −p p
P =[ ] (16.6.39)
q 1 −q
1. Find the invariant distribution.

2. Find the mean return time to each state.
3. Find limn→∞ P
n
without having to go to the trouble of diagonalizing P , as we did in the introduction to discrete-time
chains.
Answer
Consider a Markov chain with state space S = {a, b, c, d} and transition matrix P given below:
1 2
0 0
⎡ 3 3 ⎤
⎢ 1 0 0 0 ⎥
P =⎢ ⎥ (16.6.40)
⎢ ⎥
⎢ 0 0 1 0 ⎥
⎣ 1 1 1 1 ⎦
4 4 4 4
1. Draw the state diagram.

2. Determine the equivalent classes and classify each as transient or positive recurrent.
3. Find all invariant probability density functions.
5. Find limn→∞ P .
n
Answer
1 1
⎡ 0 0
2
0
2
0 ⎤
⎢ 0 0 0 0 0 1 ⎥
⎢ ⎥
⎢ 1 1 1 ⎥
⎢ 0 0 0 ⎥
4 2 4
P =⎢ ⎥ (16.6.41)
⎢ ⎥
⎢ 0 0 0 1 0 0 ⎥
⎢ ⎥
⎢ 0 1 2
0 0 0 ⎥
⎢ 3 3 ⎥
⎣ 1 1 1 1 ⎦
0 0
4 4 4 4

2. Find the equivalence classes and classify each as transient or positive recurrent.
n
Answer
1 1
0 0 0 0
⎡ 2 2 ⎤
1 3
⎢ 0 0 0 0 ⎥
⎢ 4 4 ⎥
⎢ ⎥
⎢ 1 0
1 1
0 0 ⎥
⎢ 4 2 4 ⎥
P =⎢ ⎥ (16.6.42)
⎢ 1 0
1 1
0
1 ⎥
⎢ 4 4 4 4 ⎥
⎢ ⎥
⎢ 1 1 ⎥
0 0 0 0
⎢ 2 2 ⎥
⎣ 1 1 ⎦
0 0 0 0
2 2

2. Find the equivalence classes and classify each as transient or positive recurrent.
n
Answer
Consider the Markov chain with state space S = {1, 2, 3, 4, 5, 6, 7} and transition matrix P given below:
1 1 1
0 0 0 0
⎡ 2 4 4 ⎤
⎢ 1 2
0 0 0 0 0 ⎥
⎢ 3 3 ⎥
⎢ ⎥
⎢ 1 2 ⎥
⎢ 0 0 0 0 0 ⎥
3 3
⎢ ⎥
⎢ 1 1 ⎥
P =⎢ 0 0 0 0 0 ⎥ (16.6.43)
⎢ 2 2 ⎥
⎢ 3 1 ⎥
⎢ 0 0 0 0 0 ⎥
⎢ 4 4 ⎥
⎢ 1 1
⎥
⎢ 0 0 0 0 0 ⎥
⎢ 2 2 ⎥
1 3
⎣ 0 0 0 0 0 ⎦
4 4
1. Sketch the state digraph, and show that the chain is irreducible with period 3.
2. Identify the cyclic classes.
3. Find the invariant probability density function.
3n
.
3n+1
.
3n+2
.
Answer
Special Models
Read the discussion of invariant distributions and limiting distributions in the Ehrenfest chains.
Read the discussion of invariant distributions and limiting distributions in the Bernoulli-Laplace chain.
Read the discussion of positive recurrence and invariant distributions for the reliability chains.
Read the discussion of positive recurrence and limiting distributions for the birth-death chain.
Read the discussion of positive recurrence and for the queuing chains.
Read the discussion of positive recurrence and limiting distributions for the random walks on graphs.
This page titled 16.6: Stationary and Limiting Distributions of Discrete-Time Chains is shared under a CC BY 2.0 license and was authored,
remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts
platform; a detailed edit history is available upon request.
The Markov property, stated in the form that the past and future are independent given the present, essentially treats the past and
future symmetrically. However, there is a lack of symmetry in the fact that in the usual formulation, we have an initial time 0, but
not a terminal time. If we introduce a terminal time, then we can run the process backwards in time. In this section, we are
interested in the following questions:
Is the new process still Markov?
If so, how does the new transition probability matrix relate to the original one?
Under what conditions are the forward and backward processes stochastically the same?
Consideration of these questions leads to reversed chains, an important and interesting part of the theory of Markov chains.
Basic Theory
Reversed Chains
Our starting point is a (homogeneous) discrete-time Markov chain X = (X , X , X , …) with (countable) state space S and
0 1 2
transition probability matrix P . Let m be a positive integer, which we will think of as the terminal time or finite time horizon. We
won't bother to indicate the dependence on m notationally, since ultimately the terminal time will not matter. Define X ^ =X n m−n
for n ∈ {0, 1, … , m}. Thus, the process forward in time is X = (X , X , … , X ) while the process backwards in time is
0 1 m
^ ^ ^ ^
X = (X 0 , X 1 , … , X m ) = (Xm , Xm−1 , … , X0 ) (16.7.1)
For n ∈ {0, 1, … , m}, let

^ ^ ^ ^
F n = σ{ X 0 , X 1 , … , X n } = σ{ Xm−n , Xm−n+1 , … , Xm } (16.7.2)
denote the σ algebra of the events of the process X ^

up to time n . So of course, an event for X
^
up to time n is an event for X from
time m − n forward. Our first result is that the reversed process is still a Markov chain, but not time homogeneous in general.
The process X ^
= (X , X , … , X ) is a Markov chain, but is not time homogenous in general. The one-step transition
^
0
^
1
^
m
matrix at time n ∈ {0, 1, … , m − 1} is given by

P(Xm−n−1 = y)
^ ^ 2
P(X n+1 = y ∣ X n = x) = P (y, x), (x, y) ∈ S (16.7.3)
P(Xm−n = x)
Proof
However, the backwards chain will be time homogeneous if X has an invariant distribution.
0
Suppose that X is irreducible and positive recurrent, with (unique) invariant probability density function f . If X0 has the
invariant probability distribution, then X
^
is a time-homogeneous Markov chain with transition matrix P^ given by
f (y)
^ 2
P (x, y) = P (y, x), (x, y) ∈ S (16.7.6)
f (x)
Proof
Recall that a discrete-time Markov chain is ergodic if it is irreducible, positive recurrent, and aperiodic. For an ergodic chain, the
previous result holds in the limit of the terminal time.
Suppose that X is ergodic, with (unique) invariant probability density function f . Regardless of the distribution of X , 0
f (y)
^ ^
P(X n+1 = y ∣ X n = x) → P (y, x) as m → ∞ (16.7.7)
f (x)
Proof
These three results are motivation for the definition that follows. We can generalize by defining the reversal of an irreducible
Markov chain, as long as there is a positive, invariant function. Recall that a positive invariant function defines a positive measure
on S , but of course not in general a probability distribution.
Suppose that X is an irreducible Markov chain with transition matrix P , and that g : S → (0, ∞) is invariant for X. The
reversal of X with respect to g is the Markov chain X
^
= (X , X , …) with transition probability matrix P defined by
^ ^
0 1
^
g(y)
^(x, y) = 2
P P (y, x), (x, y) ∈ S (16.7.8)
g(x)
Proof
Recall that if g is a positive invariant function for X then so is cg for every positive constant c . Note that g and cg generate the
same reversed chain. So let's consider the cases:
Suppose that X is an irreducible Markov chain on S .

1. If X is recurrent, then X always has a positive invariant function that is unique up to multiplication by positive constants.
Hence the reversal of a recurrent chain X always exists and is unique, and so we can refer to the reversal of X without
reference to the invariant function.
2. Even better, if X is positive recurrent, then there exists a unique invariant probability density function, and the reversal of
X can be interpreted as the time reversal (with respect to a terminal time) when X has the invariant distribution, as in the
motivating exercises above.

3. If X is transient, then there may or may not exist a positive invariant function, and if one does exist, it may not be unique
(up to multiplication by positive constants). So a transient chain may have no reversals or more than one.
Nonetheless, the general definition is natural, because most of the important properties of the reversed chain follow from the
balance equation between the transition matrices P and P^ , and the invariant function g :
^ 2
g(x)P (x, y) = g(y)P (y, x), (x, y) ∈ S (16.7.10)
We will see this balance equation repeated with other objects related to the Markov chains.
Suppose that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that ^
X is the reversal of X with
respect to g . For x, y ∈ S ,
1. P^(x, x) = P (x, x)
2. P^(x, y) > 0 if and only if P (y, x) > 0
Proof
From part (b) it follows that the state graphs of X and X ^

are reverses of each other. That is, to go from the state graph of one chain
to the state graph of the other, simply reverse the direction of each edge. Here is a more complicated (but equivalent) version of the
balance equation for chains of states:
Suppose again that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that X
^
is the reversal of X
with respect to g . For every n ∈ N and every sequence of states (x , x , … , x , x ) ∈ S
+ 1 2 , n n+1
n+1
^ ^ ^
g(x1 )P (x1 , x2 )P (x2 , x3 ) ⋯ P (xn , xn+1 ) = g(xn+1 )P (xn+1 , xn ) ⋯ P (x3 , x2 )P (x2 , x1 ) (16.7.11)
Proof
The balance equation holds for the powers of the transition matrix:
^
with respect to g . For every (x, y) ∈ S and n ∈ N ,
2
n
^ (x, y) = g(y)P n
g(x)P (y, x) (16.7.14)
Proof
We can now generalize the simple result above.
^
with respect to g . For n ∈ N and (x, y) ∈ S , 2
1. P n ^
(x, x) = P (x, x)
n
2. P^ (x, y) > 0 if and only if P n
(y, x) > 0
In terms of the state graphs, part (b) has an obvious meaning: If there exists a path of length n from y to x in the original state
graph, then there exists a path of length n from x to y in the reversed state graph. The time reversal definition is symmetric with
respect to the two Markov chains.
^
with respect to g . Then
1. g is also invariant for X^
.
2. X is also irreducible.
^
3. X is the reversal of X ^
with respect to g .
Proof
The balance equation also holds for the potential matrices.
Suppose that X and X ^

are time reversals with respect to the invariant function g : S → (0, ∞) . For α ∈ (0, 1], the α potential
matrices are related by
^ (x, y) = g(y)R (y, x), 2
g(x)Rα α (x, y) ∈ S (16.7.16)
Proof
Markov chains that are time reversals share many important properties:
Suppose that X and X

^
are time reversals. Then
1. X and X
^
are of the same type (transient, null recurrent, or positive recurrent).
2. X and X
^
have the same period.
3. X and X have the same mean return time μ(x) for every x ∈ S .
^
Proof
The main point of the next result is that we don't need to know a-priori that g is invariant for X, if we can guess g and P^ .
Suppose again that X is irreducible with transition probability matrix P . If there exists a a function g : S → (0, ∞) and a
transition probability matrix P^ such that g(x)P^(x, y) = g(y)P (y, x) for all (x, y) ∈ S , then 2
1. g is invariant for X.
2. P^ is the transition matrix of the reversal of X with respect to g .
Proof
As a corollary, if there exists a probability density function f on S and a transition probability matrix P^ such that
f (x)P (x, y) = f (y)P (y, x) for all (x, y) ∈ S then in addition to the conclusions above, we know that the chains X and X are
^ 2 ^
positive recurrent.
Reversible Chains
Clearly, an interesting special case occurs when the transition matrix of the reversed chain turns out to be the same as the original
transition matrix. A chain of this type could be used to model a physical process that is stochastically the same, forward or
backward in time.
Suppose again that X = (X , X , X , …) is an irreducible Markov chain with transition matrix P and invariant function
0 1 2
g : S → (0, ∞) . If the reversal of X with respect to g also has transition matrix P , then X is said to be reversible with respect
to g . That is, X is reversible with respect to g if and only if

2
g(x)P (x, y) = g(y)P (y, x), (x, y) ∈ S (16.7.19)
Clearly if X is reversible with respect to the invariant function g : S → (0, ∞) then X is reversible with respect to the invariant
function cg for every c ∈ (0, ∞). So again, let's review the cases.
Suppose that X is an irreducible Markov chain on S .

1. If X is recurrent, there exists a positive invariant function that is unique up to multiplication by positive constants. So X is
either reversible or not, and we don't have to reference the invariant function g .
2. If X is positive recurrent then there exists a unique invariant probability density function f : S → (0, 1) , and again, either
X is reversible or not. If X is reversible, then P is the transition matrix of X forward or backward in time, when the chain
has the invariant distribution.

3. If X is transient, there may or may not exist positive invariant functions. If there are two or more positive invariant
functions that are not multiplies of one another, X might be reversible with respect to one function but not the others.
The non-symmetric simple random walk on Z falls into the last case. Using the last result in the previous subsection, we can tell
whether X is reversible with respect to g without knowing a-priori that g is invariant.
Suppose again that X is irreducible with transition matrix P . If there exists a function g : S → (0, ∞) such that
g(x)P (x, y) = g(y)P (y, x) for all (x, y) ∈ S , then2
1. g is invariant for X.
2. X is reversible with respect to g
If we have reason to believe that a Markov chain is reversible (based on modeling considerations, for example), then the condition
in the previous theorem can be used to find the invariant functions. This procedure is often easier than using the definition of
invariance directly. The next two results are minor generalizations:
Suppose again that X is irreducible and that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and
only if for every n ∈ N and every sequence of states (x , x , … x , x ) ∈ S
+ 1 2 , n n+1
n+1
g(x1 )P (x1 , x2 )P (x2 , x3 ) ⋯ P (xn , xn+1 ) = g(xn+1 )P (xn+1 , xn ), ⋯ P (x3 , x2 )P (x2 , x1 ) (16.7.20)
only if for every (x, y) ∈ S and n ∈ N ,
2
+
n n
g(x)P (x, y) = g(y)P (y, x) (16.7.21)
Here is the condition for reversibility in terms of the potential matrices.
only if
2
g(x)Rα (x, y) = g(y)Rα (y, x), α ∈ (0, 1], (x, y) ∈ S (16.7.22)
In the positive recurrent case (the most important case), the following theorem gives a condition for reversibility that does not
directly reference the invariant distribution. The condition is known as the Kolmogorov cycle condition, and is named for Andrei
Kolmogorov
Suppose that X is irreducible and positive recurrent. Then X is reversible if and only if for every sequence of states
,
(x1 , x2 , … , xn )
P (x1 , x2 )P (x2 , x3 ) ⋯ P (xn−1 , xn )P (xn , x1 ) = P (x1 , xn )P (xn , xn−1 ) ⋯ P (x3 , x2 )P (x2 , x1 ) (16.7.23)
Proof
Note that the Kolmogorov cycle condition states that the probability of visiting states (x , x , … , x , x ) in sequence, starting in
2 3 n 1
state x is the same as the probability of visiting states (x , x , … , x , x ) in sequence, starting in state x . The cycle condition
1 n n−1 2 1 1
is also known as the balance equation for cycles.
Figure 16.7.1 : The Kolmogorov cycle condition

Finite Chains
Recall the general two-state chain X on S = {0, 1} with the transition probability matrix
1 −p p
P =[ ] (16.7.25)
q 1 −q
where p, q ∈ (0, 1) are parameters. The chain X is reversible and the invariant probability density function is
q p
f =(
p+q
,
p+q
) .
Proof
Suppose that X is a Markov chain on a finite state space S with symmetric transition probability matrix P . Thus
P (x, y) = P (y, x) for all (x, y) ∈ S . The chain X is reversible and that the uniform distribution on S is invariant.
2
Proof
Consider the Markov chain X on S = {a, b, c} with transition probability matrix P given below:
1 1 1
⎡ 4 4 2 ⎤
⎢ 1 1 1 ⎥
P =⎢ ⎥ (16.7.27)
⎢ 3 3 3 ⎥
⎣ 1 1
0 ⎦
2 2
1. Draw the state graph of X and note that the chain is irreducible.
2. Find the invariant probability density function f .
4. Find the transition probability matrix P^ of the time-reversed chain X
^.
5. Draw the state graph of X.

^
Answer
Special Models
Read the discussion of reversibility for the Ehrenfest chains.
Read the discussion of reversibility for the Bernoulli-Laplace chain.
Read the discussion of reversibility for the random walks on graphs.
Read the discussion of time reversal for the reliability chains.
Read the discussion of reversibility for the birth-death chains.
This page titled 16.7: Time Reversal in Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Basic Theory
The Ehrenfest chains, named for Paul Ehrenfest, are simple, discrete models for the exchange of gas molecules between two containers. However, they
can be formulated as simple ball and urn models; the balls correspond to the molecules and the urns to the two containers. Thus, suppose that we have
two urns, labeled 0 and 1, that contain a total of m balls. The state of the system at time n ∈ N is the number of balls in urn 1, which we will denote by
X . Our stochastic process is X = (X , X , X , …) with state space S = {0, 1, … , m}. Of course, the number of balls in urn 0 at time n is m − X .
n 0 1 2 n
The Models
In the basic Ehrenfest model, at each discrete time unit, independently of the past, a ball is selected at random and moved to the other urn.
Figure 16.8.1 : The Ehrenfest model
X is a discrete-time Markov chain on S with transition probability matrix P given by

x m −x
P (x, x − 1) = , P (x, x + 1) = , x ∈ S (16.8.1)
m m
Proof
In the Ehrenfest experiment, select the basic model. For selected values of m and selected values of the initial state, run the chain for 1000 time
steps and note the limiting behavior of the proportion of time spent in each state.
Suppose now that we modify the basic Ehrenfest model as follows: at each discrete time, independently of the past, we select a ball at random and a urn
at random. We then put the chosen ball in the chosen urn.
X is a discrete-time Markov chain on S with the transition probability matrix Q given by

x 1 m −x
Q(x, x − 1) = , Q(x, x) = , Q(x, x + 1) = , x ∈ S (16.8.3)
2m 2 2m
Proof
Note that Q(x, y) = 1
2
P (x, y) for y ∈ {x − 1, x + 1} .
In the Ehrenfest experiment, select the modified model. For selected values of m and selected values of the initial state, run the chain for 1000 time
steps and note the limiting behavior of the proportion of time spent in each state.
Classification
The basic and modified Ehrenfest chains are irreducible and positive recurrent.
Proof
The basic Ehrenfest chain is periodic with period 2. The cyclic classes are the set of even states and the set of odd states. The two-step transition
matrix is
x(x − 1) x(m − x + 1) + (m − x)(x + 1) (m − x)(m − x − 1)
2 2 2
P (x, x − 2) = , P (x, x) = , P (x, x + 2) = , x ∈ S (16.8.5)
2 2 2
m m m
Proof
The modified Ehrenfest chain is aperiodic.

Proof
Invariant and Limiting Distributions

For the basic and modified Ehrenfest chains, the invariant distribution is the binomial distribution with trial parameter m and success parameter 1
2
.
So the invariant probability density function f is given by
m
m 1
f (x) = ( )( ) , x ∈ S (16.8.6)
x 2
Proof
Thus, the invariant distribution corresponds to placing each ball randomly and independently either in urn 0 or in urn 1.
The mean return time to state x ∈ S for the basic or modified Ehrenfest chain is μ(x) = 2 m m
/(
x
) .
Proof
For the basic Ehrenfest chain, the limiting behavior of the chain is as follows:
m−1
1. P 2n m
(x, y) → (
y
)(
1
2
) as n → ∞ if x, y ∈ S have the same parity (both even or both odd). The limit is 0 otherwise.
m−1
2. P 2n+1
(x, y) → (
m
y
)(
1
2
) as n → ∞ if x, y ∈ S have oppositie parity (one even and one odd). The limit is 0 otherwise.
Proof
m
For the modified Ehrenfest chain, Q n m
(x, y) → (
y
)(
1
2
) as n → ∞ for x, y ∈ S .
Proof
In the Ehrenfest experiment, the limiting binomial distribution is shown graphically and numerically. For each model and for selected values of m
and selected values of the initial state, run the chain for 1000 time steps and note the limiting behavior of the proportion of time spent in each state.
How do the choices of m, the initial state, and the model seem to affect the rate of convergence to the limiting distribution?
Reversibility
The basic and modified Ehrenfest chains are reversible.
Proof
Run the simulation of the Ehrenfest experiment 10,000 time steps for each model, for selected values of m, and with initial state 0. Note that at first,
you can see the “arrow of time”. After a long period, however, the direction of time is no longer evident.
Consider the basic Ehrenfest chain with m = 5 balls, and suppose that X has the uniform distribution on S .0
1. Compute the probability density function, mean and variance of X . 1
4. Sketch the initial probability density function and the probability density functions in parts (a), (b), and (c) on a common set of axes.
Answer
Consider the modified Ehrenfest chain with m = 5 balls, and suppose that the chain starts in state 2 (with probability 1).
1. Compute the probability density function, mean and standard deviation of X . 1
4. Sketch the initial probability density function and the probability density functions in parts (a), (b), and (c) on a common set of axes.
Answer
This page titled 16.8: The Ehrenfest Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via
Basic Theory
Introduction
The Bernoulli-Laplace chain, named for Jacob Bernoulli and Pierre Simon Laplace, is a simple discrete model for the diffusion of two
incompressible gases between two containers. Like the Ehrenfest chain, it can also be formulated as a simple ball and urn model. Thus, suppose
that we have two urns, labeled 0 and 1. Urn 0 contains j balls and urn 1 contains k balls, where j, k ∈ N . Of the j + k balls, r are red and the
+
remaining j + k − r are green. Thus r ∈ N and 0 < r < j + k . At each discrete time, independently of the past, a ball is selected at random
+
from each urn and then the two balls are switched. The balls of different colors correspond to molecules of different types, and the urns are the
containers. The incompressible property is reflected in the fact that the number of balls in each urn remains constant over time.
Figure 16.9.1 : The Bernoulli-Laplace model
Let X denote the number of red balls in urn 1 at time n ∈ N . Then

n
1. k − X is the number of green balls in urn 1 at time n .

n
2. r − X is the number of red balls in urn 0 at time n .

n
3. j − r + X is the number of green balls in urn 0 at time n .

n
X = (X0 , X1 , X2 , …) is a discrete-time Markov chain with state space S = {max{0, r − j}, … , min{k, r}} and with transition matrix
P given by
(j − r + x)x (r − x)x + (j − r + x)(k − x) (r − x)(k − x)
P (x, x − 1) = , P (x, x) = , P (x, x + 1) = ; x ∈ S (16.9.1)
jk jk jk
Proof
This is a fairly complicated model, simply because of the number of parameters. Interesting special cases occur when some of the parameters are
the same.
Consider the special case j = k , so that each urn has the same number of balls. The state space is S = {max{0, r − k}, … , min{k, r}}
and the transition probability matrix is

(k − r + x)x (r − x)x + (k − r + x)(k − x) (r − x)(k − x)
P (x, x − 1) = , P (x, x) = , P (x, x + 1) = ; x ∈ S (16.9.2)
2 2 2
k k k
Consider the special case r = j , so that the number of red balls is the same as the number of balls in urn 0. The state space is
S = {0, … , min{j, k}} and the transition probability matrix is
2
x x(j + k − 2x) (j − x)(k − x)
P (x, x − 1) = , P (x, x) = , P (x, x + 1) = ; x ∈ S (16.9.3)
jk jk jk
Consider the special case r =k , so that the number of red balls is the same as the number of balls in urn 1. The state space is
S = {max{0, k − j}, … , k} and the transition probability matrix is
2
(j − k + x)x (k − x)(j − k + 2x) (k − x)
P (x, x − 1) = , P (x, x) = , P (x, x + 1) = ; x ∈ S (16.9.4)
jk jk jk
Consider the special case j = k = r , so that each urn has the same number of balls, and this is also the number of red balls. The state space
is S = {0, 1, … , k} and the transition probability matrix is
2 2
x 2x(k − x) (k − x)
P (x, x − 1) = , P (x, x) = , P (x, x + 1) = ; x ∈ S (16.9.5)
2 2 2
k k k
Run the simulation of the Bernoulli-Laplace experiment for 10000 steps and for various values of the parameters. Note the limiting behavior
of the proportion of time spent in each state.
Invariant and Limiting Distributions

The Bernoulli-Laplace chain is irreducible.
Proof
Except in the trivial case j = k = r = 1 , the Bernoulli-Laplace chain aperiodic.

Proof
The invariant distribution is the hypergeometric distribution with population parameter j+k , sample parameter k , and type parameter r.
The probability density function is
r j+k−r
( )( )
x k−x
f (x) = , x ∈ S (16.9.6)
j+k
( )
k
Proof
Thus, the invariant distribution corresponds to selecting a sample of k balls at random and without replacement from the j + k balls and placing
them in urn 1. The mean and variance of the invariant distribution are
r r j+k −r j
2
μ =k , σ =k (16.9.7)
j+k j+k j+k j+k −1
The mean return time to each state x ∈ S is

j+k
( )
k
μ(x) = (16.9.8)
r j+k−r
( )( )
x k−x
Proof
P
n r
(x, y) → f (y) = ( )(
y
r+k−r
k−y
j+k
)/(
k
) as n → ∞ for (x, y) ∈ S . 2
Proof
In the simulation of the Bernoulli-Laplace experiment, vary the parameters and note the shape and location of the limiting hypergeometric
distribution. For selected values of the parameters, run the simulation for 10000 steps and and note the limiting behavior of the proportion of
time spent in each state.
Reversibility
The Bernoulli-Laplace chain is reversible.
Proof
Run the simulation of the Bernoulli-Laplace experiment 10,000 time steps for selected values of the parameters, and with initial state 0.
Note that at first, you can see the “arrow of time”. After a long period, however, the direction of time is no longer evident.
Consider the Bernoulli-Laplace chain with j = 10 , k =5 , and r =4 . Suppose that X0 has the uniform distribution on S . Explicitly give
1. The state space S
2. The transition matrix P .
3. The probability density function, mean and variance of X . 1
Answer
Consider the Bernoulli-Laplace chain with j = k = 10 and r = 6 . Give each of the following explicitly:
1. The state space S
2. The transition matrix P
3. The invariant probability density function.
Answer
This page titled 16.9: The Bernoulli-Laplace Chain is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
The Success-Runs Chain

Suppose that we have a sequence of trials, each of which results in either success or failure. Our basic assumption is that if there
have been x ∈ N consecutive successes, then the probability of success on the next trial is p(x), independently of the past, where
p : N → (0, 1). Whenever there is a failure, we start over, independently, with a new sequence of trials. Appropriately enough, p is
called the success function. Let X denote the length of the run of successes after n trials.
n
X = (X0 , X1 , X2 , …) is a discrete-time Markov chain with state space N and transition probability matrix P given by
P (x, x + 1) = p(x), P (x, 0) = 1 − p(x); x ∈ N (16.10.1)
The Markov chain X is called the success-runs chain.
Figure 16.10.1: State graph of the success-runs chain

Now let T denote the trial number of the first failure, starting with a fresh sequence of trials. Note that in the context of the
success-runs chain X, T = τ , the first return time to state 0, starting in 0. Note that T takes values in N ∪ {∞} , since
0 +
presumably, it is possible that no failure occurs. Let r(n) = P(T > n) for n ∈ N , the probability of at least n consecutive
successes, starting with a fresh set of trials. Let f (n) = P(T = n + 1) for n ∈ N , the probability of exactly n consecutive
successes, starting with a fresh set of trails.
The functions p, r, and f are related as follows:

1. p(x) = r(x + 1)/r(x) for x ∈ N
n−1
2. r(n) = ∏ x=0
p(x) for n ∈ N
3. f (n) = [1 − p(n)] ∏ p(x) for n ∈ N

n−1
x=0
4. r(n) = 1 − ∑ f (x) for n ∈ N

n−1
x=0
5. f (n) = r(n) − r(n + 1) for n ∈ N
Thus, the functions p, r, and f give equivalent information. If we know one of the functions, we can construct the other two, and
hence any of the functions can be used to define the success-runs chain. The function r is the reliability function associated with T .
The function r is characterized by the following properties:

1. r is positive.
2. r(0) = 1
3. r is strictly decreasing.
The function f is characterized by the following properties:

1. f is positive.
∞
2. ∑ x=0
f (x) ≤ 1
Essentially, f is the probability density function of T − 1 , except that it may be defective in the sense that the sum of its values
may be less than 1. The leftover probability, of course, is the probability that T = ∞ . This is the critical consideration in the
classification of the success-runs chain, which we will consider shortly.
Verify that each of the following functions has the appropriate properties, and then find the other two functions:
1. p is a constant in (0, 1).
2. r(n) = 1/(n + 1) for n ∈ N .
3. r(n) = (n + 1)/(2 n + 1) for n ∈ N .
4. p(x) = 1/(x + 2) for x ∈ N.
Answer
In part (a), note that the trials are Bernoulli trials. We have an app for this case.
The success-runs app is a simulation of the success-runs chain based on Bernoulli trials. Run the simulation 1000 times for
various values of p and various initial states, and note the general behavior of the chain.
The success-runs chain is irreducible and aperiodic.

Proof
Recall that T has the same distribution as τ , the first return time to 0 starting at state 0. Thus, the classification of the chain as
0
recurrent or transient depends on α = P(T = ∞) . Specifically, the success-runs chain is transient if α > 0 and recurrent if α = 0 .
Thus, we see that the chain is recurrent if and only if a failure is sure to occur. We can compute the parameter α in terms of each of
the three functions that define the chain.
In terms of p, r, and f ,
∞ ∞
α = ∏ p(x) = lim r(n) = 1 − ∑ f (x) (16.10.2)

n→∞
x=0 x=0
Compute α and determine whether the success-runs chain X is transient or recurrent for each of the examples above.
Answer
Run the simulation of the success-runs chain 1000 times for various values of p, starting in state 0. Note the return times to
state 0.
Let μ = E(T ) , the expected trial number of the first failure, starting with a fresh sequence of trials.
μ is related to α , f , and r as follows:

1. If α > 0 then μ = ∞
2. If α = 0 then μ = 1 + ∑ ∞
n=0
nf (n)
3. μ = ∑ ∞
r(n)
n=0
Proof
The success-runs chain X is positive recurrent if and only if μ < ∞ .

Proof
If X is recurrent, then r is invariant for X . In the positive recurrent case, when μ <∞ , the invariant distribution has
r(x)
g(x) = , x ∈ N (16.10.3)
μ
Proof
When X is recurrent, we know from the general theory that every other nonnegative left invariant function is a nonnegative
multiple of r
Determine whether the success-runs chain X is transient, null recurrent, or positive recurrent for each of the examples above.
If the chain is positive recurrent, find the invariant probability density function.
Answer
From (a), the success-runs chain corresponding to Bernoulli trials with success probability p ∈ (0, 1) has the geometric distribution
on N, with parameter 1 − p , as the invariant distribution.
Run the simulation of the success-runs chain 1000 times for various values of p and various initial states. Compare the
empirical distribution to the invariant distribution.
The Remaining Life Chain

Consider a device whose (discrete) time to failure U takes values in N, with probability density function f . We assume that
f (n) > 0 for n ∈ N . When the device fails, it is immediately (and independently) replaced by an identical device. For n ∈ N , let
Yn denote the time to failure of the device that is in service at time n .
Y = (Y0 , Y1 , Y2 , …) is a discrete-time Markov chain with state space N and transition probability matrix Q given by
Q(0, x) = f (x), Q(x + 1, x) = 1; x ∈ N (16.10.6)
The Markov chain Y is called the remaining life chain with lifetime probability density function f , and has the state graph below.
Figure 16.10.2: State graph of the remaining life chain

We have an app for the remaining life chain whose lifetime distribution is the geometric distribution on N , with parameter
1 − p ∈ (0, 1) .
Run the simulation of the remaining-life chain 1000 times for various values of p and various initial states. Note the general
behavior of the chain.
If U denotes the lifetime of a device, as before, note that T = 1 +U is the return time to 0 for the chain Y , starting at 0.
Y is irreducible, aperiodic, and recurrent.

Proof
Now let r(n) = P(U ≥ n) = P(T > n) for n ∈ N and let μ = E(T ) = 1 + E(U ) . Note that r(n) = ∑
∞
x=n
f (x) and
μ = 1 +∑
∞
x=0
f (x) = ∑
∞
n=0
r(n) .
The success-runs chain X is positive recurrent if and only if μ <∞ , in which case the invariant distribution has probability
density function g given by
r(x)
g(x) = , x ∈ N (16.10.7)
μ
Proof
Suppose that Y is the remaining life chain whose lifetime distribution is the geometric distribution on N with parameter
1 − p ∈ (0, 1) . Then this distribution is also the invariant distribution.
Proof
Run the simulation of the success-runs chain 1000 times for various values of p and various initial states. Compare the
empirical distribution to the invariant distribution.
Time Reversal
You probably have already noticed similarities, in notation and results, between the success-runs chain and the remaining-life
chain. There are deeper connections.
Suppose that f is a probability density function on N with f (n) > 0 for n ∈ N . Let X be the success-runs chain associated
with f and Y the remaining life chain associated with f . Then X and Y are time reversals of each other.
Proof
In the context of reliability, it is also easy to see that the chains are time reversals of each other. Consider again a device whose
random lifetime takes values in N, with the device immediately replaced by an identical device upon failure. For n ∈ N , we can
think of X as the age of the device in service at time n and Y as the time remaining until failure for that device.
n n
Run the simulation of the success-runs chain 1000 times for various values of p, starting in state 0. This is the time reversal of
the simulation in the next exercise
Run the simulation of the remaining-life chain 1000 times for various values of p, starting in state 0. This is the time reversal
of the simulation in the previous exercise.
This page titled 16.10: Discrete-Time Reliability Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
Introduction
Generically, suppose that we have a system of particles that can generate or split into other particles of the same type. Here are
some typical examples:
The particles are biological organisms that reproduce.
The particles are neutrons in a chain reaction.
The particles are electrons in an electron multiplier.
We assume that each particle, at the end of its life, is replaced by a random number of new particles that we will refer to as children
of the original particle. Our basic assumption is that the particles act independently, each with the same offspring distribution on N.
Let f denote the common probability density function of the number of offspring of a particle. We will also let
= f ∗ f ∗ ⋯ ∗ f denote the convolution power of degree n of f ; this is the probability density function of the total number of
∗n
f
children of n particles.
We will consider the evolution of the system in real time in our study of continuous-time branching chains. In this section, we will
study the evolution of the system in generational time. Specifically, the particles that we start with are in generation 0, and
recursively, the children of a particle in generation n are in generation n + 1 .
Figure 16.11.1: Generations 0, 1, 2, and 3 of a branching chain.

Let X denote the number of particles in generation n ∈ N . One way to construct the process mathematically is to start with an
n
array of independent random variables (U : n ∈ N, i ∈ N ) , each with probability density function f . We interpret U as the
n,i + n,i
number of children of the ith particle in generation n (if this particle exists). Note that we have more random variables than we
need, but this causes no harm, and we know that we can construct a probability space that supports such an array of random
variables. We can now define our state variables recursively by
Xn
Xn+1 = ∑ Un,i (16.11.1)
i=1
X = (X0 , X1 , X2 , …) is a discrete-time Markov chain on N with transition probability matrix P given by

∗x 2
P (x, y) = f (y), (x, y) ∈ N (16.11.2)
The chain X is the branching chain with offspring distribution defined by f .

Proof
The branching chain is also known as the Galton-Watson process in honor of Francis Galton and Henry William Watson who
studied such processes in the context of the survival of (aristocratic) family names. Note that the descendants of each initial particle
form a branching chain, and these chains are independent. Thus, the branching chain starting with x particles is equivalent to x
independent copies of the branching chain starting with 1 particle. This features turns out to be very important in the analysis of the
chain. Note also that 0 is an absorbing state that corresponds to extinction. On the other hand, the population may grow to infinity,
sometimes called explosion. Computing the probability of extinction is one of the fundamental problems in branching chains; we
will essentially solve this problem in the next subsection.
Extinction and Explosion
The behavior of the branching chain in expected value is easy to analyze. Let m denote the mean of the offspring distribution, so
that
∞
m = ∑ xf (x) (16.11.4)
x=0
Note that m ∈ [0, ∞]. The parameter m will turn out to be of fundamental importance.
Expected value properties

1. E(X n+1 ) = mE(Xn ) for n ∈ N
2. E(X n)
n
= m E(X0 ) for n ∈ N
3. E(X n ) → 0 as n → ∞ if m < 1 .
4. E(X n ) = E(X ) for each n ∈ N if m = 1 .

0
5. E(X n ) → ∞ as n → ∞ if m > 1 and E(X 0) >0 .

Proof
Part (c) is extinction in the mean; part (d) is stability in the mean; and part (e) is explosion in the mean.
Recall that state 0 is absorbing (there are no particles), and hence {X = 0 for some n ∈ N} = {τ < ∞} is the extinction event
n 0
(where as usual, τ is the time of the first return to 0). We are primarily concerned with the probability of extinction, as a function
0
of the initial state. First, however, we will make some simple observations and eliminate some trivial cases.
Suppose that f (1) = 1 , so that each particle is replaced by a single new particle. Then
1. Every state is absorbing.
2. The equivalence classes are the singleton sets.
3. With probability 1, X = X for every n ∈ N .
n 0
Proof
Suppose that f (0) > 0 so that with positive probability, a particle will die without offspring. Then
1. Every state leads to 0.
2. Every positive state is transient.
3. With probability 1 either X = 0 for some n ∈ N (extinction) or X
n n → ∞ as n → ∞ (explosion).
Proof
Suppose that f (0) = 0 and f (1) < 1 , so that every particle is replaced by at least one particle, and with positive probability,
more than one. Then
2. P(X → ∞ as n → ∞ ∣ X = x) = 1 for every x ∈ N , so that explosion is certain, starting with at least one particle.
n 0 +
Proof
Suppose that f (0) > 0 and f (0) + f (1) = 1 , so that with positive probability, a particle will die without offspring, and with
probability 1, a particle is not replaced by more than one particle. Then
1. Every state leads to 0.
3. With probability 1, X = 0 for some n ∈ N , so extinction is certain.
n
Proof
Thus, the interesting case is when f (0) > 0 and f (0) + f (1) < 1 , so that with positive probability, a particle will die without
offspring, and also with positive probability, the particle will be replaced by more than one new particles. We will assume these
conditions for the remainder of our discussion. By the state classification above all states lead to 0 (extinction). We will denote the
probability of extinction, starting with one particle, by
q = P(τ0 < ∞ ∣ X0 = 1) = P(Xn = 0 for some n ∈ N ∣ X0 = 1) (16.11.6)
The set of positive states N is a transient equivalence class, and the probability of extinction starting with x ∈ N particles is
+
x
q = P(τ0 < ∞ ∣ X0 = x) = P(Xn = 0 for some n ∈ N ∣ X0 = x) (16.11.7)
Proof
The parameter q satisfies the equation

∞
x
q = ∑ f (x)q (16.11.8)
x=0
Proof
Thus the extinction probability q starting with 1 particle is a fixed point of the probability generating function Φ of the offspring
distribution:
∞
x
Φ(t) = ∑ f (x)t , t ∈ [0, 1] (16.11.11)
x=0
Moreover, from the general discussion of hitting probabilities in the section on recurrence and transience, q is the smallest such
number in the interval (0, 1]. If the probability generating function Φ can be computed in closed form, then q can sometimes be
computed by solving the equation Φ(t) = t .
Φ satisfies the following properties:

1. Φ(0) = f (0).
2. Φ(1) = 1 .
3. Φ (t) > 0 for t ∈ (0, 1) so Φ in increasing on (0, 1).
′
4. Φ (t) > 0 for t ∈ (0, 1) so Φ in concave upward on (0, 1).

′′
5. m = lim Φ (t) .
t↑1
′
Proof
Our main result is next, and relates the extinction probability q and the mean of the offspring distribution m.
The extinction probability q and the mean of the offspring distribution m are related as follows:
1. If m ≤ 1 then q = 1 , so extinction is certain.
2. If m > 1 then 0 < q < 1 , so there is a positive probability of extinction and a positive probability of explosion.
Proof
Figure 16.11.2: The case of certain extinction.
Figure 16.11.3: The case of possible extinction and possible explosion.
Consider the branching chain with offspring probability density function f given by f (0) = 1 − p , f (2) = p , where p ∈ (0, 1)
is a parameter. Thus, each particle either dies or splits into two new particles. Find each of the following.
2. The mean m of the offspring distribution.
3. The generating function Φ of the offspring distribution.
4. The extinction probability q.
Answer
Consider the branching chain whose offspring distribution is the geometric distribution on N with parameter 1 −p , where
p ∈ (0, 1). Thus f (n) = (1 − p)p for n ∈ N . Find each of the following:
n

4. The extinction probability q.
Answer
Curiously, the extinction probability is the same as for the previous problem.
Consider the branching chain whose offspring distribution is the Poisson distribution with parameter m ∈ (0, ∞) . Thus
m /n! for n ∈ N . Find each of the following:
−m n
f (n) = e

4. The approximate extinction probability q when m = 2 and when m = 3 .
Answer
This page titled 16.11: Discrete-Time Branching Chain is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
Introduction
In a queuing model, customers arrive at a station for service. As always, the terms are generic; here are some typical examples:
The customers are persons and the service station is a store.
The customers are file requests and the service station is a web server.
The customers are packages and the service station is a processing facility.
Figure 16.12.1: Ten customers and a server
Queuing models can be quite complex, depending on such factors as the probability distribution that governs the arrival of
customers, the probability distribution that governs the service of customers, the number of servers, and the behavior of the
customers when all servers are busy. Indeed, queuing theory has its own lexicon to indicate some of these factors. In this section,
we will study one of the simplest, discrete-time queuing models. However, as we will see, this discrete-time chain is embedded in a
much more realistic continuous-time queuing process knows as the M/G/1 queue. In a general sense, the main interest in any
queuing model is the number of customers in the system as a function of time, and in particular, whether the servers can adequately
handle the flow of customers.
Our main assumptions are as follows:

1. If the queue is empty at a given time, then a random number of new customers arrive at the next time.
2. If the queue is nonempty at a given time, then one customer is served and a random number of new customers arrive at the
next time.
3. The number of customers who arrive at each time period form an independent, identically distributed sequence.
Thus, let X denote the number of customers in the system at time n ∈ N , and let U denote the number of new customers who
n n
arrive at time n ∈ N . Then U = (U , U , …) is a sequence of independent random variables, with common probability density
+ 1 2
function f on N, and
Un+1 , Xn = 0
Xn+1 = { , n ∈ N (16.12.1)
(Xn − 1) + Un+1 , Xn > 0
X = (X0 , X1 , X2 , …) is a discrete-time Markov chain with state space N and transition probability matrix P given by
P (0, y) = f (y), y ∈ N (16.12.2)
P (x, y) = f (y − x + 1), x ∈ N+ , y ∈ {x − 1, x, x + 1, …} (16.12.3)
The chain X is the queuing chain with arrival distribution defined by f .

Proof
Recurrence and Transience

From now on we will assume that f (0) > 0 and f (0) + f (1) < 1 . Thus, at each time unit, it's possible that no new customers
arrive or that at least 2 new customers arrive. Also, we let m denote the mean of the arrival distribution, so that
∞
m = ∑ xf (x) (16.12.4)
x=0
Thus m is the average number of new customers who arrive during a time period.
The chain X is irreducible and aperiodic.
Proof
Our goal in this section is to compute the probability that the chain reaches 0, as a function of the initial state (so that the server is
able to serve all of the customers). As we will see, there are some curious and unexpected parallels between this problem and the
problem of computing the extinction probability in the branching chain. As a corollary, we will also be able to classify the queuing
chain as transient or recurrent. Our basic parameter of interest is q = H (1, 0) = P(τ < ∞ ∣ X = 1) , where as usual, H is the
0 0
hitting probability matrix and τ = min{n ∈ N : X = 0} is the first positive time that the chain is in state 0 (possibly infinite).
0 + n
Thus, q is the probability that the queue eventually empties, starting with a single customer.
The parameter q satisifes the following properties:

1. q = H (x, x − 1) for every x ∈ N . +
2. q = H (x, 0) for every x ∈ N .

x
+
Proof
The parameter q satisfies the equation:

∞
x
q = ∑ f (x)q (16.12.5)
x=0
Proof
Note that this is exactly the same equation that we considered for the branching chain, namely Φ(q) = q , where Φ is the
probability generating function of the distribution that governs the number of new customers that arrive during each period.
Figure 16.12.2: The graph of ϕ in the recurrent case
Figure 16.12.3: The graph of ϕ in the transient case
q is the smallest solution in (0, 1] of the equation Φ(t) = t . Moreover

1. If m ≤ 1 then q = 1 and the chain is recurrent.
2. If m > 1 then 0 < q < 1 and the chain is transient..
Proof
Positive Recurrence
Our next goal is to find conditions for the queuing chain to be positive recurrent. Recall that m is the mean of the probability
density function f ; that is, the expected number of new customers who arrive during a time period. As before, let τ denote the first 0
positive time that the chain is in state 0. We assume that the chain is recurrent, so m ≤ 1 and P(τ < ∞) = 1 . 0
Let Ψ denote the probability generating function of τ , starting in state 1. Then

0
1. Ψ is also the probability generating function of τ starting in state 0.

0
2. Ψ is the probability generating function of τ starting in state x ∈ N .

x
0 +
Proof
Ψ(t) = tΦ[Ψ(t)] for t ∈ [−1, 1].

Proof
The deriviative of Ψ is
Φ[Ψ(t)]
′
Ψ (t) = , t ∈ (−1, 1) (16.12.11)
′
1 − tΦ [Ψ(t)]
Proof
As usual, let μ 0 = E(τ0 ∣ X0 = 0) , the mean return time to state 0 starting in state 0. Then
1. μ0 =
1
1−m
if m < 1 and therefore the chain is positive recurrent.
2. μ0 =∞ if m = 1 and therefore the chain is null recurrent.
Proof
So to summarize, the queuing chain is positive recurrent if m < 1 , null recurrent if m = 1 , and transient if m > 1 . Since m is the
expected number of new customers who arrive during a service period, the results are certainly reasonable.
Consider the queuing chain with arrival probability density function f given by f (0) = 1 − p , f (2) = p , where p ∈ (0, 1) is a
parameter. Thus, at each time period, either no new customers arrive or two arrive.
1. Find the transition matrix P .
2. Find the mean m of the arrival distribution.
3. Find the generating function Φ of the arrival distribution.
4. Find the probability q that the queue eventually empties, starting with one customer.
5. Classify the chain as transient, null recurrent, or positive recurrent.
6. In the positive recurrent case, find μ , the mean return time to 0.
0
Answer
Consider the queuing chain whose arrival distribution is the geometric distribution on N with parameter 1 −p , where
p ∈ (0, 1). Thus f (n) = (1 − p)p for n ∈ N . n
1. Find the transition matrix P .

2. Find the mean m of the arrival distribution.
3. Find the generating function Φ of the arrival distribution.
4. Find the probability q that the queue eventually empties, starting with one customer.
0
Answer
Curiously, the parameter q and the classification of the chain are the same in the last two models.
Consider the queuing chain whose arrival distribution is the Poisson distribution with parameter m ∈ (0, ∞) . Thus
m /n! for n ∈ N . Find each of the following:
−m n
f (n) = e
1. The transition matrix P
2. The mean m of the arrival distribution.
3. The generating function Φ of the arrival distribution.
4. The approximate value of q when m = 2 and when m = 3 .
0
Answer
This page titled 16.12: Discrete-Time Queuing Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
Introduction
Suppose that S is an interval of integers (that is, a set of consecutive integers), either finite or infinite. A (discrete-time) birth-
death chain on S is a discrete-time Markov chain X = (X , X , X , …) on S with transition probability matrix P of the
0 1 2
form
P (x, x − 1) = q(x), P (x, x) = r(x), P (x, x + 1) = p(x); x ∈ S (16.13.1)
where p, q, and r are nonnegative functions on S with p(x) + q(x) + r(x) = 1 for x ∈ S .
If the interval S has a minimum value a ∈ Z then of course we must have q(a) = 0 . If r(a) = 1 , the boundary point a is
absorbing and if p(a) = 1 , then a is reflecting. Similarly, if the interval S has a maximum value b ∈ Z then of course we must
have p(b) = 0 . If r(b) = 1 , the boundary point b is absorbing and if p(b) = 1 , then b is reflecting. Several other special models
that we have studied are birth-death chains; these are explored in below.
In this section, as you will see, we often have sums of products. Recall that a sum over an empty index set is 0, while a product
over an empty index set is 1.
Recurrence and Transience

If S is finite, classification of the states of a birth-death chain as recurrent or transient is simple, and depends only on the state
graph. In particular, if the chain is irreducible, then the chain is positive recurrent. So we will study the classification of birth-death
chains when S = N . We assume that p(x) > 0 for all x ∈ N and that q(x) > 0 for all x ∈ N (but of course we must have
+
q(0) = 0 ). Thus, the chain is irreducible.
Under these assumptions, the birth-death chain on N is

1. Aperiodic if r(x) > 0 for some x ∈ N.
2. Periodic with period 2 if r(x) = 0 for all x ∈ N.
Proof
We will use the test for recurrence derived earlier with A = N , the set of positive states. That is, we will compute the probability
+
that the chain never hits 0, starting in a positive state.
The chain X is recurrent if and only if

∞
q(1) ⋯ q(x)
∑ =∞ (16.13.2)
p(1) ⋯ p(x)
x=0
Proof
Note that r, the function that assigns to each state x ∈ N the probability of an immediate return to x, plays no direct role in whether
the chain is transient or recurrent. Indeed all that matters are the ratios q(x)/p(x) for x ∈ N .
+
Positive Recurrence and Invariant Distributions

Suppose again that we have a birth-death chain X on N, with p(x) > 0 for all x ∈ N and q(x) > 0 for all x ∈ N . Thus the chain
+
is irreducible.
The function g : N → (0, ∞) defined by

p(0) ⋯ p(x − 1)
g(x) = , x ∈ N (16.13.9)
q(1) ⋯ q(x)
is invariant for X, and is the only invariant function, up to multiplication by constants. Hence X is positive recurrent if and
only if B = ∑ ∞
x=0
g(x) < ∞ , in which case the (unique) invariant probability density function f is given by f (x) = g(x)
1
for x ∈ N.
Proof
Here is a summary of the classification:
For the birth-death chain X, define

∞ ∞
q(1) ⋯ q(x) p(0) ⋯ p(x − 1)
A =∑ , B =∑ (16.13.13)
p(1) ⋯ p(x) q(1) ⋯ q(x)
x=0 x=0
1. X is transient if A < ∞
2. X is null recurrent if A = ∞ and B = ∞ .
3. X is positive recurrent if B < ∞ .
Note again that r, the function that assigns to each state x ∈ N the probability of an immediate return to x, plays no direct role in
whether the chain is transient, null recurrent, or positive recurrent. Also, we know that an irreducible, recurrent chain has a positive
invariant function that is unique up to multiplication by positive constants, but the birth-death chain gives an example where this is
also true in the transient case.
Suppose now that n ∈ N and that X = (X , X , X , …) is a birth-death chain on the integer interval N = {0, 1, … , n}. We
+ 0 1 2 n
assume that p(x) > 0 for x ∈ {0, 1, … , n − 1} while q(x) > 0 for x ∈ {1, 2, … n}. Of course, we must have q(0) = p(n) = 0 .
With these assumptions, X is irreducible, and since the state space is finite, positive recurrent. So all that remains is to find the
invariant distribution. The result is essentially the same as when the state space is N.
The invariant probability density function f is given byn
n
1 p(0) ⋯ p(x − 1) p(0) ⋯ p(x − 1)
fn (x) = for x ∈ Nn where Bn = ∑ (16.13.14)
Bn q(1) ⋯ q(x) q(1) ⋯ q(x)
x=0
Proof
Note that B → B as n → ∞ , and if B < ∞ , f (x) → f (x) as n → ∞ for x ∈ N. We will see this type of behavior again.
n n
Results for the birth-death chain on N often converge to the corresponding results for the birth-death chain on N as n → ∞ .
n
Absorption
Often when the state space S = N , the state of a birth-death chain represents a population of individuals of some sort (and so the
terms birth and death have their usual meanings). In this case state 0 is absorbing and means that the population is extinct.
Specifically, suppose that X = (X , X , X , …) is a birth-death chain on N with r(0) = 1 and with p(x), q(x) > 0 for x ∈ N .
0 1 2 +
Thus, state 0 is absorbing and all positive states lead to each other and to 0. Let N = min{n ∈ N : X = 0} denote the time untiln
absorption, where as usual, min ∅ = ∞ .
One of the following events will occur:

1. Population extinction: N < ∞ or equivalently, X m =0 for some m ∈ N and hence X n =0 for all n ≥ m .
2. Population explosion: N = ∞ or equivalently X n → ∞ as n → ∞ .
Proof
Naturally we would like to find the probability of these complementary events, and happily we have already done so in our study of
recurrence above. Let
u(x) = P(N = ∞) = P(Xn → ∞ as n → ∞ ∣ X0 = x), x ∈ N (16.13.16)
so the absorption probability is

v(x) = 1 − u(x) = P(N < ∞) = P(Xn = 0 for some n ∈ N ∣ X0 = x), x ∈ N (16.13.17)
For the birth-death chain X,
x−1 ∞
1 q(1) ⋯ q(i) q(1) ⋯ q(i)
u(x) = ∑ for x ∈ N+ where A = ∑ (16.13.18)
A p(1) ⋯ p(i) p(1) ⋯ p(i)
i=0 i=0
Proof
So if A = ∞ then u(x) = 0 for all x ∈ S . If A < ∞ then u(x) > 0 for all x ∈ N and u(x) → 1 as x → ∞ . For the absorption
+
probability, v(x) = 1 for all x ∈ N if A = ∞ and so absorption is certain. If A < ∞ then

∞
1 q(1) ⋯ q(i)
v(x) = ∑ , x ∈ N (16.13.19)
A p(1) ⋯ p(i)
i=x
Next we consider the mean time to absorption, so let m(x) = E(N ∣ X0 = x) for x ∈ N . +
The mean absorption function is given by

x ∞
p(j) ⋯ p(k)
m(x) = ∑ ∑ , x ∈ N (16.13.20)
q(j) ⋯ q(k + 1)
j=1 k=j−1
Probabilisitic Proof
Analytic Proof
Next we will consider a birth-death chain on a finite integer interval with both endpoints absorbing. Our interest is in the
probability of absorption in one endpoint rather than the other, and in the mean time to absorption. Thus suppose that n ∈ N and +
that X = (X , X , X , …) is a birth-death chain on N = {0, 1, … , n} with r(0) = r(n) = 1 and with p(x) > 0 and q(x) > 0
0 1 2 n
for x ∈ {1, 2, … , n − 1}. So the endpoints 0 and n are absorbing, and all other states lead to each other and to the endpoints. Let
N = min{n ∈ N : X ∈ {0, n}} , n the time until absorption, and for x ∈ S let v (x) = P(X = 0 ∣ X = x) n and N 0
m (x) = E(N ∣ X = x) . The definitions make sense since N is finite with probability 1.
n 0
The absorption probability function for state 0 is given by

n−1 n−1
1 q(1) ⋯ q(i) q(1) ⋯ q(i)
vn (x) = ∑ for x ∈ Nn where An = ∑ (16.13.28)
An p(1) ⋯ p(i) p(1) ⋯ p(i)
i=x i=0
Proof
Note that A → A as n → ∞ where A is the constant above for the absorption probability at 0 with the infinite state space N. If
n
A < ∞ then v (x) → v(x) as n → ∞ for x ∈ N .

n
The mean absorption time is given by

x−1 x−1 y
q(1) ⋯ q(y) q(z + 1) ⋯ q(y)
mn (x) = mn (1) ∑ −∑∑ , x ∈ Nn (16.13.31)
p(1) ⋯ p(y) p(z) ⋯ p(y)
y=0 y=0 z=1
where, with A as in the previous theorem,

n
n−1 y
1 q(z + 1) ⋯ q(y)
mn (1) = ∑∑ (16.13.32)
An p(z) ⋯ p(y)
y=1 z=1
Proof
Time Reversal
Our next discussion is on the time reversal of a birth-death chain. Essentially, every recurrent birth-death chain is reversible.
Suppose that X = (X0 , X1 , X2 , …) is an irreducible, recurrent birth-death chain on an integer interval S . Then X is
reversible.
Proof
If S is finite and the chain X is irreducible, then of course X is recurrent (in fact positive recurrent), so by the previous result, X
is reversible. In the case S = N , we can use the invariant function above to show directly that the chain is reversible.
Suppose that X = (X 0, X1 , X2 , …) is a birth-death chain on N with p(x) > 0 for x ∈ N and q(x) > 0 for x ∈ N . Then X+
is reversible.
Proof
Thus, in the positive recurrent case, when the variables are given the invariant distribution, the transition matrix P describes the
chain forward in time and backwards in time.

As always, be sure to try the problems yourself before looking at the solutions.
Constant Birth and Death Probabilities

Our first examples consider birth-death chains on N with constant birth and death probabilities, except at the boundary points. Such
chains are often referred to as random walks, although that term is used in a variety of different settings. The results are special
cases of the general results above, but sometimes direct proofs are illuminating.
Suppose that X = (X , X , X , …) is the birth-death chain on N with constant birth probability

0 1 2 p ∈ (0, ∞) on N and
constant death probability q ∈ (0, ∞) on N , with p + q ≤ 1 . Then
+
1. X is transient if q p , and the invariant distribution is the geometric distribution on N with parameter p/q
x
p p
f (x) = (1 − )( ) , x ∈ N (16.13.37)
q q
Next we consider the random walk on N with 0 absorbing. As in the discussion of absorption above, v(x) denotes the absorption
probability and m(x) the mean time to absorption, starting in state x ∈ N.
Suppose that X = (X , X , …) is the birth-death chain on N with constant birth probability p ∈ (0, ∞) on N and constant
0 1 +
death probability q ∈ (0, ∞) on N , with p + q ≤ 1 . Assume also that r(0) = 1 , so that 0 is absorbing.
+
1. If q ≥ p then v(x) = 1 for all x ∈ N. If q p then m(x) = x/(q − p) for x ∈ N.

+
Proof
This chain is essentially the gambler's ruin chain. Consider a gambler who bets on a sequence of independent games, where p and
q are the probabilities of winning and losing, respectively. The gambler receives one monetary unit when she wins a game and must
pay one unit when she loses a game. So X is the gambler's fortune after playing n games.
n
Next we consider random walks on a finite interval.
Suppose that X = (X , X , …) is the birth-death chain on N = {0, 1, … , n} with constant birth probability p ∈ (0, ∞) on
0 1 n
{0, 1, … , n − 1} and constant death probability q ∈ (0, ∞) on {1, 2, … , n}, with p + q ≤ 1 . Then X is positive recurrent
and the invariant probability density function f is given as follows:

n
1. If p ≠ q then
x
(p/q ) (1 − p/q)
fn (x) = , x ∈ Nn (16.13.39)
n+1
1 − (p/q)
2. If p = q then f n (x) = 1/(n + 1) for x ∈ N .n
Note that if p < q then the invariant distribution is a truncated geometric distribution, and f (x) → f (x) for x ∈ N where f is the
n
invariant probability density function of the birth-death chain on N considered above. If p = q , the invariant distribution is uniform
on N , certainly a reasonable result. Next we consider the chain with both endpoints absorbing. As before, v is the function that
n n
gives the probability of absorption in state 0, while m is the function that gives the mean time to absorption.
n
Suppose that X = (X , X , …) is the birth-death chain on N = {0, 1, … , n} with constant birth probability p ∈ (0, 1) and
0 1 n
death probability q ∈ (0, ∞) on {1, 2, … , n − 1}, where p + q ≤ 1 . Assume also that r(0) = r(n) = 1 , so that 0 and n are
absorbing.
1. If p ≠ q then
x n
(q/p ) − (q/p )
vn (x) = , x ∈ Nn (16.13.40)
1 − (q/p)n
2. If p = q then v n (x) = 1 − x/n for x ∈ N

n
Note that if q < p then v n (x) → v(x) as n → ∞ for x ∈ N.
Suppose again that X = (X , X , …) is the birth-death chain on N

0 1 n = {0, 1, … , n} with constant birth probability
p ∈ (0, 1) and death probability q ∈ (0, ∞) on {1, 2, … , n − 1}, where p + q ≤ 1 . Assume also that r(0) = r(n) = 1 , so
that 0 and n are absorbing.

1. If p ≠ q then
x
n 1 − (q/p) x
mn (x) = + , x ∈ Nn (16.13.41)
n
p −q 1 − (q/p) q −p
2. If p = q then
1
mn (x) = x(n − x), x ∈ Nn (16.13.42)
2p
Special Birth-Death Chains

Some of the random processes that we have studied previously are birth-death Markov chains.
Describe each of the following as a birth-death chain.

1. The Ehrenfest chain.
2. The modified Ehrenfest chain.
3. The Bernoulli-Laplace chain
4. The simple random walk on Z.
Answer
Other Examples
Consider the birth-death process on N with p(x) = 1
x+1
, q(x) = 1 − p(x) , and r(x) = 0 for x ∈ S .
1. Find the invariant function g .
2. Classify the chain.
Answer
This page titled 16.13: Discrete-Time Birth-Death Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
Introduction
Suppose that G = (S, E) is a graph with vertex set S and edge set E ⊆ S . We assume that the graph is undirected (perhaps a
2
better term would be bi-directed) in the sense that (x, y) ∈ E if and only if (y, x) ∈ E . The vertex set S is countable, but may be
infinite. Let N (x) = {y ∈ S : (x, y) ∈ E} denote the set of neighbors of a vertex x ∈ S , and let d(x) = #[N (x)] denote the
degree of x. We assume that N (x) ≠ ∅ for x ∈ S , so G has no isolated points.
Suppose now that there is a conductance c(x, y) > 0 associated with each edge (x, y) ∈ E . The conductance is symmetric in the
sense that c(x, y) = c(y, x) for (x, y) ∈ E . We extend c to a function on all of S × S by defining c(x, y) = 0 for (x, y) ∉ E . Let
C (x) = ∑ c(x, y), x ∈ S (16.14.1)
y∈S
so that C (x) is the total conductance of the edges coming from x. Our main assumption is that C (x) < ∞ for x ∈ S . As the
terminology suggests, we imagine a fluid of some sort flowing through the edges of the graph, so that the conductance of an edge
measures the capacity of the edge in some sense. One of the best interpretation is that the graph is an electrical network and the
edges are resistors. In this interpretation, the conductance of a resistor is the reciprocal of the resistance.
In some applications, specifically the resistor network just mentioned, it's appropriate to impose the additional assumption that G
has no loops, so that (x, x) ∉ E for each x ∈ S . However, that assumption is not mathematically necessary for the Markov chains
that we will consider in this section.
The discrete-time Markov chain X = (X 0, X1 , X2 , …) with state space S and transition probability matrix P given by
c(x, y)
2
P (x, y) = , (x, y) ∈ S (16.14.2)
C (x)
is called a random walk on the graph G.

Justification
This chain governs a particle moving along the vertices of G. If the particle is at vertex x ∈ S at a given time, then the particle will
be at a neighbor of x at the next time; the neighbor is chosen randomly, in proportion to the conductance. In the setting of an
electrical network, it is natural to interpret the particle as an electron. Note that multiplying the conductance function c by a
positive constant has no effect on the associated random walk.
Suppose that d(x) < ∞ for each x ∈ S and that c is constant on the edges. Then
1. C (x) = cd(x) for every x ∈ S .
2. The transition matrix P is given by P (x, y) = 1
d(x)
for x ∈ S and y ∈ N (x), and P (x, y) = 0 otherwise.
The discrete-time Markov chain X is the symmetric random walk on G.

Proof
Thus, for the symmetric random walk, if the state is x ∈ S at a given time, then the next state is equally likely to be any of the
neighbors of x. The assumption that each vertex has finite degree means that the graph G is locally finite.
Let X be a random walk on a graph G.

1. If G is connected then X is irreducible.
2. If G is not connected then the equivalence classes of X are the components of G (the maximal connected subsets of S ).
Proof
So as usual, we will usually assume that G is connected, for otherwise we could simply restrict our attention to a component of G.
In the case that G has no loops (again, an important special case because of applications), it's easy to characterize the periodicity of
the chain. For the theorem that follows, recall that G is bipartite if the vertex set S can be partitioned into nonempty, disjoint sets
A and B (the parts) such that every edge in E has one endpoint in A and one endpoint in B .
Suppose that X is a random walk on a connected graph G with no loops. Then X is either aperiodic or has period 2.
Moreover, X has period 2 if and only if G is bipartite, in which case the parts are the cyclic classes of X.
Proof
Positive Recurrence and Invariant Distributions

Suppose again that X is a random walk on a graph G, and assume that G is connected so that X is irreducible.
The function C is invariant for P . The random walk X is positive recurrent if and only if
K = ∑ C (x) = ∑ c(x, y) < ∞ (16.14.4)

2
x∈S (x,y)∈S
in which case the invariant probability density function f is given by f (x) = C (x)/K for x ∈ S .
Proof
Note that K is the total conductance over all edges in G. In particular, of course, if S is finite then X is positive recurrent, with f
as the invariant probability density function. For the symmetric random walk, this is the only way that positive recurrence can
occur:
The symmetric random walk on G is positive recurrent if and only if the set of vertices S is finite, in which case the invariant
probability density function f is given by
d(x)
f (x) = , x ∈ S (16.14.6)
2m
where d is the degree function and where m is the number of undirected edges.
Proof
On the other hand, when S is infinite, the classification of X as recurrent or transient is complicated. We will consider an
interesting special case below, the symmetric random walk on Z . k
Reversibility
Essentially, all reversible Markov chains can be interpreted as random walks on graphs. This fact is one of the reasons for studying
such walks.
If X is a random walk on a connected graph G, then X is reversible with respect to C .

Proof
Of course, if X is recurrent, then C is the only positive invariant function, up to multiplication by positive constants, and so X is
simply reversible.
Conversely, suppose that X is an irreducible Markov chain on S with transition matrix P and positive invariant function g . If
X is reversible with respect to g then X is the random walk on the state graph with conductance function c given by
c(x, y) = g(x)P (x, y) for (x, y) ∈ S .

2
Proof
Again, in the important special case that X is recurrent, there exists a positive invariant function g that is unique up to
multiplication by positive constants. In this case the theorem states that an irreducible, recurrent, reversible chain is a random walk
on the state graph.
The Wheatstone Bridge Graph
The graph below is called the Wheatstone bridge in honor of Charles Wheatstone.
Figure 16.14.1: The Wheatstone bridge network, with conductance values in red
In this subsection, let X be the random walk on the Wheatstone bridge above, with the given conductance values.
For the random walk X,

1. Explicitly give the transition probability matrix P .
2. Given X = a , find the probability density function of X .
0 2
Answer

1. Show that X is aperiodic.
n
Answer
The Cube Graph

The graph below is the 3-dimensional cube graph. The vertices are bit strings of length 3, and two vertices are connected by an
edge if and only if the bit strings differ by a single bit.
Figure 16.14.2: The cube graph with conductance values in red

In this subsection, let X denote the random walk on the cube graph above, with the given conductance values.

1. Explicitly give the transition probability matrix P .
2. Suppose that the initial distribution is the uniform distribution on {000, 001, 101, 100}
. Find the probability density
function of X .
2
Answer

1. Show that the chain has period 2 and find the cyclic classes.
2n
2n+1
Answer
Special Models
Recall that the basic Ehrenfest chain with m ∈ N + balls is reversible. Interpreting the chain as a random walk on a graph,
sketch the graph and find a conductance function.
Answer
Recall that the modified Ehrenfest chain with m ∈ N + balls is reversible. Interpreting the chain as a random walk on a graph,
sketch the graph and find a conductance function.
Answer
Recall that the Bernoulli-Laplace chain with j ∈ N balls in urn 0, k ∈ N balls in urn 1, and with r ∈ {0, … , j + k} of the
+ +
balls red, is reversible. Interpreting the chain as a random walk on a graph, sketch the graph and find a conductance function.
Simplify the conductance function in the special case that j = k = r .
Answer
Random Walks on Z
Random walks on integer lattices are particularly interesting because of their classification as transient or recurrent. We consider
the one-dimensional case in this subsection, and the higher dimensional case in the next subsection.
Let X = (X 0, X1 , X2 , …) be the discrete-time Markov chain with state space Z and transition probability matrix P given by
P (x, x + 1) = p, P (x, x − 1) = 1 − p, x ∈ Z (16.14.9)
where p ∈ (0, 1). The chain X is called the simple random walk on Z with parameter p.
The term simple is used because the transition probabilities starting in state x ∈ Z do not depend on x. Thus the chain is spatially
as well as temporally homogeneous. In the special case p = , the chain X is the simple symmetric random walk on Z. Basic
1
properties of the simple random walk on Z, and in particular, the simple symmetric random walk were studied in the chapter on
Bernoulli Trials. Of course, the state graph G of X has vertex set Z, and the neighbors of x ∈ Z are x + 1 and x − 1 . It's not
immediately clear that X is a random walk on G associated with a conductance function, which after all, is the topic of this
section. But that fact and more follow from the next result.
Let g be the function on Z defined by

x
p
g(x) = ( ) , x ∈ Z (16.14.10)
1 −p
Then
1. g(x)P (x, y) = g(y)P (y, x) for all (x, y) ∈ Z 2
2. g is invariant for X
3. X is reversible with respect to g
4. X is the random walk on Z with conductance function c given by c(x, x + 1) = p x+1 x
/(1 − p ) for x ∈ Z.
Proof
In particular, the simple symmetric random walk is the symmetric random walk on G.
The chain X is irreducible and periodic with period 2. Moreover
2n
2n n n
P (0, 0) = ( ) p (1 − p ) , n ∈ N (16.14.11)
n
Proof
Classification of the simple random walk on Z.

1. If p ≠ 1
2
then X is transient.
2. If p = 1
2
then X is null recurrent.
Proof
So for the one-dimensional lattice Z, the random walk X is transient in the non-symmetric case, and null recurrent in the
symmetric case. Let's return to the invariant functions of X
Consider again the random walk X on Z with parameter p ∈ (0, 1). The constant function 1 on Z and the function g given by
x
p
g(x) = ( ) , x ∈ Z (16.14.13)
1 −p
are invariant for X. All other invariant functions are linear combinations of these two functions.
Proof
Note that when p = , the constant function 1 is the only positive invariant function, up to multiplication by positive constants.
1
But we know this has to be the case since the chain is recurrent when p = . Moreover, the chain is reversible. In the non- 1
symmetric case, when p ≠ , we have an example of a transient chain which nonetheless has non-trivial invariant functions—in
1
fact a two dimensional space of such functions. Also, X is reversible with respect to g , as shown above, but the reversal of X with
respect to 1 is the chain with transition matrix Q given by Q(x, y) = P (y, x) for (x, y) ∈ Z . This chain is just the simple random 2
walk on Z with parameter 1 − p . So the non-symmetric simple random walk is an example of a transient chain that is reversible
with respect to one invariant measure but not with respect to another invariant measure.
Random walks on Z k
More generally, we now consider Z , where k ∈ N . For i ∈ {1, 2, … , k}, let u ∈ Z denote the unit vector with 1 in position i
k
+ i
k
and 0 elsewhere. The k -dimensional integer lattice G has vertex set Z , and the neighbors of x ∈ Z are x ± u for k k
i
i ∈ {1, 2, … , k}. So in particular, each vertex has 2k neighbors.
Let X = (X1 , X2 , …) be the Markov chain on Z with transition probability matrix P given by
k
k
P (x, x + ui ) = pi , P (x, x − ui ) = qi ; x ∈ Z , i ∈ {1, 2, … , k} (16.14.15)
where p > 0 , q > 0 for i ∈ {1, 2, … , k} and ∑ (p + q

i i
k
i=1 i i) =1 . The chain X is the simple random walk on Z

k
with
parameters p = (p , p , … , p ) and q = (q , q , … , q ) .
1 2 k 1 2 k
Again, the term simple means that the transition probabilities starting at x ∈ Z do not depend on x, so that the chain is spatially
k
homogeneous as well as temporally homogeneous. In the special case that p = q = for i ∈ {1, 2, … , k}, X is the simple
i i
1
2k
symmetric random walk on Z . The following theorem is the natural generalization of the result abpve for the one-dimensional
k
case.
Define the function g : Z k

→ (0, ∞) by
k xi
pi k
g(x1 , x2 , … , xk ) = ∏ ( ) , (x1 , x2 , … , xk ) ∈ Z (16.14.16)
qi
i=1
Then
1. g(x)P (x, y) = g(y)P (y, x) for all x, y ∈ Z k
2. g is invariant for X .
3. X is reversible with respect to g .
4. X is the random walk on G with conductance function c given by c(x, y) = g(x)P (x, y) for x, y ∈ Z
k
.
Proof
It terms of recurrence and transience, it would certainly seem that the larger the dimension k , the less likely the chain is to be
recurrent. That's generally true:
Classification of the simple random walk on Z . k
1. For k ∈ {1, 2}, X is null recurrent in the symmetric case and transient for all other values of the parameters.
2. For k ∈ {3, 4, …}, X is transient for all values of the parameters.
Proof sketch
So for the simple, symmetric random walk on the integer lattice Z , we have the following interesting dimensional phase shift: the
k
chain is null recurrent in dimensions 1 and 2 and transient in dimensions 3 or more.

Let's return to the positive invariant functions for X . Again, the results generalize those for the one-dimensional case.
For J ⊆ {1, 2 … , k}, define g on Z by J

k
xj
pj
k
gJ (x1 , x2 , … , xk ) = ∏ ( ) , (x1 , x2 , … , xk ) ∈ Z (16.14.18)
qj
j∈J
Let X denote the simple random walk on

J Z
k
with transition matrix P , corresponding to the parameter vectors
J p
J
and q ,
J
wherre p = p , q = q for j ∈ J , and p

J
j j
J
j j
J
j
=q , q
j
J
=p
j
for j ∉ J . Then
j
1. g (x)P (x, y) = g (y)P (y, x) for all x,

J J J y ∈ Z
k
2. g is invariant for X .
J
3. X is reversal of X with respect to g .

J J
Proof
Note that when J = ∅ , g = 1 and when J = {1, 2, … , k}, g = g , the invariant function introduced above. So in the completely
J J
non-symmetric case where p ≠ q for every i ∈ {1, 2, … , k}, the random walk X has 2 positive invariant functions that are
i i
k
linearly independent, and X is reversible with respect to one of them.
This page titled 16.14: Random Walks on Graphs is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
This section begins our study of Markov processes in continuous time and with discrete state spaces. Recall that a Markov process
with a discrete state space is called a Markov chain, so we are studying continuous-time Markov chains. It will be helpful if you
review the section on general Markov processes, at least briefly, to become familiar with the basic notation and concepts. Also,
discrete-time chains plays a fundamental role, so you will need review this topic also.
We will study continuous-time Markov chains from different points of view. Our point of view in this section, involving holding
times and the embedded discrete-time chain, is the most intuitive from a probabilistic point of view, and so is the best place to start.
In the next section, we study the transition probability matrices in continuous time. This point of view is somewhat less intuitive,
but is closest to how other types of Markov processes are treated. Finally, in the third introductory section we study the Markov
chain from the view point of potential matrices. This is the least intuitive approach, but analytically one of the best. Naturally, the
interconnections between the various approaches are particularly important.
Preliminaries
As usual, we start with a probability space (Ω, F , P), so that Ω is the set of outcomes, F the σ-algebra of events, and P the
probability measure on the sample space (Ω, F ). The time space is ([0, ∞), T ) where as usual, T is the Borel σ-algebra on
[0, ∞) corresponding to the standard Euclidean topology. The state space is (S, S ) where S is countable and S is the power set
of S . So every subset of S is measurable, as is every function from S to another measurable space. Recall that S is also the Borel
σ algebra corresponding to the discrete topology on S . With this topology, every function from S to another topological space is
continuous. Counting measure # is the natural measure on (S, S ), so in the context of the general introduction, integrals over S
are simply sums. Also, kernels on S can be thought of as matrices, with rows and sums indexed by S . The left and right kernel
operations are generalizations of matrix multiplication.
Suppose now that is stochastic process with state space (S, S ). For t ∈ [0, ∞),
X = { Xt : t ∈ [0, ∞)} let
F
t
0
, so that F is the σ-algebra of events defined by the process up to time t . The collection of σ-algebras
= σ{ Xs : s ∈ [0, t]}
0
: t ∈ [0, ∞)} is the natural filtration associated with X. For technical reasons, it's often necessary to have a filtration
0 0
F = {F
t
0
F = { F : t ∈ [0, ∞)} that is slightly finer than the natural one, so that F
t ⊆ F for t ∈ [0, ∞) (or in equivlaent jargon, X is
t t
adapted to F). See the general introduction for more details on the common ways that the natural filtration is refined. We will also
let G = σ{X : s ∈ [t, ∞)} , the σ-algebra of events defined by the process from time t onward. If t is thought of as the present
t s
time, then F is the collection of events in the past and G is the collection of events in the future.
t t
It's often necessary to impose assumptions on the continuity of the process X in time. Recall that X is right continuous if
t ↦ X (ω) is right continuous on [0, ∞) for every ω ∈ Ω , and similarly X has left limits if t ↦ X (ω) has left limits on (0, ∞)
t t
for every ω ∈ Ω . Since S has the discrete topology, note that if X is right continuous, then for every t ∈ [0, ∞) and ω ∈ Ω , there
exists ϵ (depending on t and ω) such that X (ω) = X (ω) for s ∈ [0, ϵ) . Similarly, if X has left limits, then for every t ∈ (0, ∞)
t+s t
and ω ∈ Ω there exists δ (depending on t and ω) such that X (ω) is constant for s ∈ (0, δ) .
t−s
The Markov Property

There are a number of equivalent ways to state the Markov property. At the most basic level, the property states that the past and
future are conditionally independent, given the present.
The process X = {X t : t ∈ [0, ∞)} is a Markov chain on S if for every t ∈ [0, ∞), A ∈ F , and B ∈ G ,
t t
P(A ∩ B ∣ Xt ) = P(A ∣ Xt )P(B ∣ Xt ) (16.15.1)
Another version is that the conditional distribution of a state in the future, given the past, is the same as the conditional distribution
just given the present state.
The process X = {X t : t ∈ [0, ∞)} is a Markov chain on S if for every s, t ∈ [0, ∞) , and x ∈ S ,
P(Xs+t = x ∣ Fs ) = P(Xs+t = x ∣ Xs ) (16.15.2)
Technically, in the last two definitions, we should say that X is a Markov process relative to the filtration F. But recall that if X
satisfies the Markov property relative to a filtration, then it satisfies the Markov property relative to any coarser filtration, and in
particular, relative to the natural filtration. For the natural filtration, the Markov property can also be stated without explicit
reference to σ-algebras, although at the cost of additional clutter:
The process X = { Xt : t ∈ [0, ∞)} is a Markov chain on S if and only if for every n ∈ N+ , time sequence
(t1 , t2 , … , tn ) ∈ [0, ∞)
n
with t
1 < t2 < ⋯ < tn, and state sequence (x , x , … , x ) ∈ S ,
1 2 n
n
P (Xtn = xn ∣ Xt1 = x1 , Xt2 = x2 , … Xtn−1 = xn−1 ) = P (Xtn = xn ∣ Xtn−1 = xn−1 ) (16.15.3)
As usual, we also assume that our Markov chain X is time homogeneous, so that P(X = y ∣ X = x) = P(X = y ∣ X = x)s+t s t 0
for s, t ∈ [0, ∞) and x, y ∈ S . So, for a homogeneous Markov chain on S , the process {X : t ∈ [0, ∞)} given X = x , is s+t s
independent of F and equivalent to the process {X : t ∈ [0, ∞)} given X = x , for every s ∈ [0, ∞) and x ∈ S . That is, if the
s t 0
chain is in state x ∈ S at a particular time s ∈ [0, ∞) , it does not matter how the chain got to x; the chain essentially starts over in
state x.
The Strong Markov Property

Random times play an important role in the study of continuous-time Markov chains. It's often necessary to allow random times to
take the value ∞, so formally, a random time τ is a random variable on the underlying sample space (Ω, F ) taking values in
[0, ∞]. Recall also that a random time τ is a stopping time (also called a Markov time or an optional time) if {τ ≤ t} ∈ F for t
every t ∈ [0, ∞). If τ is a stopping time, the σ-algebra associated with τ is

Fτ = {A ∈ F : A ∩ {τ ≤ t} ∈ Ft for all t ∈ [0, ∞)} (16.15.4)
So F is the collection of events up to the random time τ in the same way that F is the collection of events up to the deterministic
τ t
time t ∈ [0, ∞). We usually want the Markov property to extend from deterministic times to stopping times.
The process X = {X t : t ∈ [0, ∞)} is a strong Markov chain on S if for every stopping time τ , t ∈ [0, ∞), and x ∈ S ,
P(Xτ+t = x ∣ Fτ ) = P(Xτ+t = x ∣ Xτ ) (16.15.5)
So, for a homogeneous strong Markov chain on S , the process {X : t ∈ [0, ∞)} given X = x , is independent of F
τ+t τ and τ
equivalent to the process {X : t ∈ [0, ∞)} given X = x , for every stopping time τ and x ∈ S . That is, if the chain is in state
t 0
x ∈ S at a stopping time τ , then the chain essentially starts over at x, independently of the past.
Holding Times and the Jump Chain

For our first point of view, we sill study when and how our Markov chain X changes state. The discussion depends heavily on
properties of the exponential distribution, so we need a quick review.

A random variable τ has the exponential distribution with rate parameter r ∈ (0, ∞) if τ has a continuous distribution on [0, ∞)
with probability density function f given by f (t) = re for t ∈ [0, ∞). Equivalently, the right distribution function F is given
−rt c
by
c −rt
F (t) = P(τ > t) = e , t ∈ [0, ∞) (16.15.6)
The mean of the distribution is 1/r and the variance is 1/r . The exponential distribution has an amazing number of
2
characterizations. One of the most important is the memoryless property which states that a random variable τ with values in
[0, ∞) has an exponential distribution if and only if the conditional distribution of τ − s given τ > s is the same as the distribution
of τ itself, for every s ∈ [0, ∞) . It's easy to see that the memoryless property is equivalent to the law of exponents for right
distribution function F , namely F (s + t) = F (s)F (t) for s, t ∈ [0, ∞) . Since F is right continuous, the only solutions are
c c c c c
exponential functions.
For our study of continuous-time Markov chains, it's helpful to extend the exponential distribution to two degenerate cases, τ = 0
with probability 1, and τ = ∞ with probability 1. In terms of the parameter, the first case corresponds to r = ∞ so that
F (t) = P(τ > t) = 0 for every t ∈ [0, ∞), and the second case corresponds to r = 0 so that F (t) = P(τ > t) = 1 for every
t ∈ [0, ∞). Note that in both cases, the function F satisfies the law of exponents, and so corresponds to a memoryless distribution
in a general sense. In all cases, the mean of the exponential distribution with parameter r ∈ [0, ∞] is , where we interpret
1/r
1/0 = ∞ and 1/∞ = 0 .
Holding Times
The Markov property implies the memoryless property for the random time when a Markov process first leaves its initial state. It
follows that this random time must have an exponential distribution.
Suppose that X = {X : t ∈ [0, ∞)} is a Markov chain on S , and let τ = inf{t ∈ [0, ∞) : X
t t ≠ X0 } . For x ∈ S , the
conditional distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞].
0
Proof
So, associated with the Markov chain X on S is a function λ : S → [0, ∞] that gives the exponential parameters for the holding
times in the states. Considering the ordinary exponential distribution, and the two degenerate versions, we are led to the following
classification of states:
Suppose again that X = {X t : t ∈ [0, ∞)} is a Markov chain on S with exponential parameter function λ . Let x ∈ S .
1. If λ(x) = 0 then P(τ = ∞ ∣ X = x) = 1 , and x is said to be an absorbing state.
0
2. If λ(x) ∈ (0, ∞) then P(0 < τ < ∞ ∣ X = x) = 1 and x is said to be an stable state.
0
3. If λ(x) = ∞ then P(τ = 0 ∣ X = x) = 1 , and x is said to be an instantaneous state.

0
As you can imagine, an instantaneous state corresponds to weird behavior, since the chain starting in the state leaves the state at
times arbitrarily close to 0. While mathematically possible, instantaneous states make no sense in most applications, and so are to
be avoided. Also, the proof of the last result has some technical holes. We did not really show that τ is a valid random time, let
alone a stopping time. Fortunately, one of our standard assumptions resolves these problems.
Suppose again that X = {X t : t ∈ [0, ∞)} is a Markov chain on S . If the process X and the filtration F are right continuous,
then
1. τ is a stopping time.
2. X has no instantaneous states.
3. P(X ≠ x ∣ X = x) = 1 if x ∈ S is stable.
τ 0
4. X is a strong Markov process.

Proof
There is actually a converse to part (b) that states that if X has no instantaneous states, then there is a version of X that is right
continuous. From now on, we will assume that our Markov chains are right continuous with probability 1, and hence have no
instantaneous states. On the other hand, absorbing states are perfectly reasonable and often do occur in applications. Finally, if the
chain enters a stable state, it will stay there for a (proper) exponentially distributed time, and then leave.
The Jump Chain

Without instantaneous states, we can now construct a sequence of stopping times. Basically, we let τ denote the n th time that the
n
chain changes state for n ∈ N , unless the chain has previously been caught in an absorbing state. Here is the formal construction:
+
Suppose again that X = {X : t ∈ [0, ∞)} is a Markov chain on

t S . Let τ = 0 and τ
0 1 = inf{t ∈ [0, ∞) : Xt ≠ X0 } .
Recursively, suppose that τ is defined for n ∈ N . If τ = ∞ let τ
n + n n+1 =∞ . Otherwise, let
τn+1 = inf {t ∈ [ τn , ∞) : Xt ≠ Xτ } (16.15.9)
n
Let M = sup{n ∈ N : τn < ∞} .
In the definition of M , of course, sup(N) = ∞ , so M is the number of changes of state. If M < ∞ , the chain was sucked into an
absorbing state at time τ . Since we have ruled out instantaneous states, the sequence of random times in strictly increasing up
M
until the (random) term M . That is, with probability 1, if n ∈ N and τ < ∞ then τ < τ n . Of course by construction, if
n n+1
τ = ∞ then τ
n n+1 = ∞ . The increments τ −τ for n ∈ N with n < M are the times spent in the states visited by X. The
n+1 n
process at the random times when the state changes forms an embedded discrete-time Markov chain.
Suppose again that X = {X : t ∈ [0, ∞)} is a Markov chain on S . Let {τ : n ∈ N} denote the stopping times and M the
t n
random index, as defined above. For n ∈ N , let Y = X if n ≤ M and Y = X if n > M . Then Y = {Y : n ∈ N} is a

n τn n τM n
(homogenous) discrete-time Markov chain on S , known as the jump chain of X.

Proof
As noted in the proof, the one-step transition probability matrix Q for the jump chain Y is given for (x, y) ∈ S by 2
P(Xτ = y ∣ X0 = x), x stable

Q(x, y) = { (16.15.13)
I (x, y), x absorbing
where I is the identity matrix on S . Of course Q satisfies the usual properties of a probability matrix on S , namely Q(x, y) ≥ 0 for
(x, y) ∈ S and ∑
2
Q(x, y) = 1 for x ∈ S . But Q satisfies another interesting property as well. Since the the state actually
y∈S
changes at time τ starting in a stable state, we must have Q(x, x) = 0 if x is stable and Q(x, x) = 1 if x is absorbing.
Given the initial state, the holding time and the next state are independent.
If x, y ∈ S and t ∈ [0, ∞) then P(Y 1 = y, τ1 > t ∣ Y0 = x) = Q(x, y)e

−λ(x)t
Proof
The following theorem is a generalization. The changes in state and the holding times are independent, given the initial state.
Suppose that n ∈ N + and that (x 0, x1 , … , xn ) is a sequence of stable states and (t 1, t2 , … , tn ) is a sequence in [0, ∞). Then
P(Y1 = x1 , τ1 > t1 , Y2 = x2 , τ2 − τ1 > t2 , … , Yn = xn , τn − τn−1 > tn ∣ Y0 = x0 )
−λ( x0 ) t1 −λ( x1 ) t2 −λ( xn−1 ) tn

= Q(x0 , x1 )e Q(x1 , x2 )e ⋯ Q(xn−1 , xn )e
Proof
Regularity
We now know quite a bit about the structure of a continuous-time Markov chain X = {X : t ∈ [0, ∞)} (without instantaneous t
states). Once the chain enters a given state x ∈ S , the holding time in state x has an exponential distribution with parameter
λ(x) ∈ [0, ∞), after which the next state y ∈ S is chosen, independently of the holding time, with probability Q(x, y). However,
we don't know everything about the chain. For the sequence {τ : n ∈ N} defined above, let τ = lim
n τ , which exists in ∞ n→∞ n
(0, ∞] of course, since the sequence is increasing. Even though the holding time in a state is positive with probability 1, it's
possible that τ < ∞ with positive probability, in which case we know nothing about X for t ≥ τ . The event {τ < ∞} is
∞ t ∞ ∞
known as explosion, since it means that the X makes infinitely many transitions before the finite time τ . While not as ∞
pathological as the existence of instantaneous states, explosion is still to be avoided in most applications.
A Markov chain X = {X t : t ∈ [0, ∞)} on S is regular if each of the following events has probability 1:
1. X is right continuous.
2. τ → ∞ as n → ∞ .
n
There is a simple condition on the exponential parameters and the embedded chain that is equivalent to condition (b).
Suppose that X = {X : t ∈ [0, ∞)} is a right-continuous Markov chain on S with exponential parameter function λ and
t
∞
embedded chain Y = (Y , Y , …). Then τ → ∞ as n → ∞ with probability 1 if and only if ∑
0 1 n 1/λ(Y ) = ∞ with n=0 n
probability 1.
Proof
If λ is bounded, then X is regular.
Suppose that X = {X t : t ∈ [0, ∞)} is a Markov chain on S with exponential parameter function λ . If λ is bounded, then X
is regular.
Proof
Here is another sufficient condition that is useful when the state space is infinite.
Suppose that X = { Xt : t ∈ [0, ∞)} is a Markov chain on S with exponential parameter function λ : S → [0, ∞) . Let
S+ = {x ∈ S : λ(x) > 0} . Then X is regular if
1
∑ =∞ (16.15.19)
λ(x)
x∈S+
Proof
As a corollary, note that if S is finite then λ is bounded, so a continuous-time Markov chain on a finite state space is regular. So to
review, if the exponential parameter function λ is finite, the chain X has no instantaneous states. Even better, if λ is bounded or if
the conditions in the last theorem are satisfied, then X is regular. A continuous-time Markov chain with bounded exponential
parameter function λ is called uniform, for reasons that will become clear in the next section on transition matrices. As we will see
in later section, a uniform continuous-time Markov chain can be constructed from a discrete-time chain and an independent Poisson
process. For the next result, recall that to say that X has left limits with probability 1 means that the random function t ↦ X has t
limits from the left on (0, ∞) with probability 1.
If X = {X t : t ∈ [0, ∞)} is regular then X has left limits with probability 1.

Proof
Thus, our standard assumption will be that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S . For such a chain, the behavior
t
of X is completely determined by the exponential parameter function λ that governs the holding times, and the transition
probability matrix Q of the jump chain Y . Conversely, when modeling real stochastic systems, we often start with λ and Q. It's
then relatively straightforward to construct the continuous-time Markov chain that has these parameters. For simplicity, we will
assume that there are no absorbing states. The inclusion of absorbing states is not difficult, but mucks up the otherwise elegant
exposition.
Suppose that λ : S → (0, ∞) is bounded and that Q is a probability matrix on S with the property that Q(x, x) = 0 for every
x ∈ S . The regular, continuous-time Markov chain X = { X : t ∈ [0, ∞)} with exponential parameter function λ and jump
t
transition matrix Q can be constructed as follows:

1. First construct the jump chain Y = (Y , Y , …) having transition matrix Q.
0 1
2. Next, given Y = (x , x , …) , the transition times (τ , τ , …) are constructed so that the holding times (τ , τ − τ
0 1 1 2 1 2 1, …)
are independent and exponentially distributed with parameters (λ(x ), λ(x ), …) 0 1
3. Again given Y = (x , x , …) , define X = x for 0 ≤ t < τ and for n ∈ N , define X = x for τ ≤ t < τ
0 1 t 0 1 + t n n n+1 ) .
Additional details
Often, particularly when S is finite, the essential structure of a standard, continuous-time Markov chain can be succinctly
summarized with a graph.
Suppose again that X = {X t is a regular Markov chain on S , with exponential parameter function λ and
: t ∈ [0, ∞)}
embedded transition matrix . The state graph of X is the graph with vertex set S and directed edge set
Q
: Q(x, y) > 0} . The graph is labeled as follows:

2
E = {(x, y) ∈ S
1. Each vertex x ∈ S is labeled with the exponential parameter λ(x).

2. Each edge (x, y) ∈ E is labeled with the transition probability Q(x, y).
So except for the labels on the vertices, the state graph of X is the same as the state graph of the discrete-time jump chain Y . That
is, there is a directed edge from state x to state y if and only if the chain, when in x, can move to y after the random holding time in
x. Note that the only loops in the state graph correspond to absorbing states, and for such a state there are no outward edges.
Let's return again to the construction above of a continuous-time Markov chain from the jump transition matrix Q and the
exponential parameter function λ . Again for simplicity, assume there are no absorbing states. We assume that Q(x, x) = 0 for all
x ∈ S , so that the state really does change at the transition times. However, if we drop this assumption, the construction still
produces a continuous-time Markov chain, but with an altered jump transition matrix and exponential parameter function.
Suppose that Q is a transition matrix on S × S with Q(x, x) < 1 for x ∈ S , and that λ : S → (0, ∞) is bounded. The
stochastic process X = {X : t ∈ [0, ∞)} constructed above from Q and λ is a regular, continuous-time Markov chain with
t
~ ~
exponential parameter function λ and jump transition matrix Q given by
~
λ(x) = λ(x)[1 − Q(x, x)], x ∈ S
~ Q(x, y)
2
Q(x, y) = , (x, y) ∈ S , x ≠y
1 − Q(x, x)
Proof 1
Proof 2
This construction will be important in our study of chains subordinate to the Poisson process.
Transition Times
The structure of a regular Markov chain on S , as described above, can be explained purely in terms of a family of independent,
exponentially distributed random variables. The main tools are some additional special properties of the exponential distribution,
that we need to restate in the setting of our Markov chain. Our interest is in how the process evolves among the stable states until it
enters an absorbing state (if it does). Once in an absorbing state, the chain stays there forever, so the behavior from that point on is
trivial.
Suppose that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S , with exponential parameter function
t λ and transition
probability matrix Q. Define μ(x, y) = λ(x)Q(x, y) for (x, y) ∈ S . Then 2
1. λ(x) = ∑ μ(x, y) for x ∈ S .

y∈S
2. Q(x, y) = μ(x, y)/λ(x) if (x, y) ∈ S and x is stable.

2
The main point is that the new parameters μ(x, y) for (x, y) ∈ S determine the exponential parameters λ(x) for x ∈ S , and the
2
transition probabilities Q(x, y) when x ∈ S is stable and y ∈ S . Of course we know that if λ(x) = 0 , so that x is absorbing, then
Q(x, x) = 1. So in fact, the new parameters, as specified by the function μ , completely determine the old parmeters, as specified
by the functions λ and Q. But so what?
Consider the functions μ , λ , and Q as given in the previous result. Suppose that T has the exponential distribution with
x,y
parameter μ(x, y) for each (x, y) ∈ S and that {T : (x, y) ∈ S } is a set of independent random variables. Then
2
x,y
2
1. T = inf {T
x x,y : y ∈ S} has the exponential distribution with parameter λ(x) for x ∈ S .
2. P (T = T
x x,y ) = Q(x, y) for (x, y) ∈ S . 2
Proof
So here's how we can think of a regular, continuous-time Markov chain on S : There is a timer associated with each (x, y) ∈ S , set 2
to the random time T . All of the timers function independently. When the chain enters state x ∈ S , the timers on (x, y) for y ∈ S
x,y
are started simultaneously. As soon as the first alarm goes off for a particular (x, y), the chain immediately moves to state y , and
the process repeats. Of course, if μ(x, y) = 0 then T = ∞ with probability 1, so only the timers with λ(x) > 0 and Q(x, y) > 0
x,y
matter (these correspond to the non-loop edges in the state graph). In particular, if x is absorbing, then the timers on (x, y) are set
to infinity for each y , and no alarm ever sounds.
The new collection of exponential parameters can be used to give an alternate version of the state graph. Again, the vertex set is S
and the edge set is E = {(x, y) ∈ S : Q(x, y) > 0} . But now each edge (x, y) is labeled with the exponential rate parameter
2
μ(x, y). The exponential rate parameters are closely related to the generator matrix, a matrix of fundamental importance that we
will study in the next section.

The Two-State Chain
The two-state chain is the simplest non-trivial, continuous-time Markov chain, but yet this chain illustrates many of the important
properties of general continuous-time chains. So consider the Markov chain X = {X : t ∈ [0, ∞)} on the set of states
t
S = {0, 1}, with transition rate a ∈ [0, ∞) from 0 to 1 and transition rate b ∈ [0, ∞) from 1 to 0.
The transition matrix Q for the embedded chain is given below. Draw the state graph in each case.
0 1
1. Q = [ ] if a > 0 and b > 0 , so that both states are stable.
1 0
1 0
2. Q = [ ] if a = 0 and b > 0 , so that a is absorbing and b is stable.
1 0
0 1
3. Q = [ ] if a > 0 and b = 0 , so that a is stable and b is absorbing.
0 1
1 0
4. Q = [ ] if a = 0 and b = 0 , so that both states are absorbing.
0 1
We will return to the two-state chain in subsequent sections.
Consider the Markov chain X = { Xt : t ∈ [0, ∞)} on S = {0, 1, 2} with exponential parameter function λ = (4, 1, 3) and
embedded transition matrix
1 1
0
⎡ 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.15.27)
⎢ ⎥
⎣ 1 2
0 ⎦
3 3
1. Draw the state graph and classify the states.

2. Find the matrix of transition rates.
3. Classify the jump chain in terms of recurrence and period.
4. Find the invariant distribution of the jump chain.
Answer
Special Models
Read the introduction to chains subordinate to the Poisson process.
Read the introduction to birth-death chains.
Read the introduction to continuous-time queuing chains.
Read the introduction to continuous-time branching chains.
This page titled 16.15: Introduction to Continuous-Time Markov Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or
curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed
edit history is available upon request.
16. Transition Matrices and Generators of Continuous-Time Chains

Preliminaries
This is the second of the three introductory sections on continuous-time Markov chains. Thus, suppose that
X = { X : t ∈ [0, ∞)} is a continuous-time Markov chain defined on an underlying probability space (Ω, F , P) and with state
t
space (S, S ). By the very meaning of Markov chain, the set of states S is countable and the σ-algebra S is the collection of all
subsets of S . So every subset of S is measurable, as is every function from S to another measurable space. Recall that S is also
the Borel σ algebra corresponding to the discrete topology on S . With this topology, every function from S to another topological
space is continuous. Counting measure # is the natural measure on (S, S ), so in the context of the general introduction, integrals
over S are simply sums. Also, kernels on S can be thought of as matrices, with rows and sums indexed by S . The left and right
kernel operations are generalizations of matrix multiplication.
A space of functions on S plays an important role. Let B denote the collection of bounded functions f : S → R . With the usual
pointwise definitions of addition and scalar multiplication, B is a vector space. The supremum norm on B is given by
∥f ∥ = sup{|f (x)| : x ∈ S}, f ∈ B (16.16.1)
Of course, if S is finite, B is the set of all real-valued functions on S , and ∥f ∥ = max{|f (x)| : x ∈ S} for f ∈ B .
In the last section, we studied X in terms of when and how the state changes. To review briefly, let
τ = inf{t ∈ (0, ∞) : X ≠ X } . Assuming that X is right continuous, the Markov property of X implies the memoryless
t 0
property of τ , and hence the distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞) for each x ∈ S . The
0
assumption of right continuity rules out the pathological possibility that λ(x) = ∞ , which would mean that x is an instantaneous
state so that P(τ = 0 ∣ X = x) = 1 . On the other hand, if λ(x) ∈ (0, ∞) then x is a stable state, so that τ has a proper
0
exponential distribution given X = x with P(0 < τ < ∞ ∣ X = x) = 1 . Finally, if λ(x) = 0 then x is an absorbing state, so
0 0
that P(τ = ∞ ∣ X = x) = 1 . Next we define a sequence of stopping times: First τ = 0 and τ = τ . Recursively, if τ < ∞ then
0 0 1 n
τ = inf {t > τ : X ≠ X } , while if τ = ∞ then τ

n n t τn n = ∞ . With M = sup{n ∈ N : τ < ∞}
n+1 we define Y = Xn if n τn
n ∈ N with n ≤ M and Y = Y if n ∈ N with n > M . The sequence Y = (Y , Y , …) is a discrete-time Markov chain on S

n M 0 1
with one-step transition matrix Q given by Q(x, y) = P(X = y ∣ X = x) if x, y ∈ S with x stable, and Q(x, x) = 1 if x ∈ S is
τ 0
absorbing. Assuming that X is regular, which means that τ → ∞ as n → ∞ with probability 1 (ruling out the explosion event of
n
infinitely many transitions in finite time), the structure of X is completely determined by the sequence of stopping times
τ = (τ , τ , …) and the discrete-time jump chain Y = (Y , Y , …) . Analytically, the distribution X is determined by the
0 1 0 1
exponential parameter function λ and the one-step transition matrix Q of the jump chain.
In this section, we sill study the Markov chain X in terms of the transition matrices in continuous time and a fundamentally
important matrix known as the generator. Naturally, the connections between the two points of view are particularly interesting.
The Transition Semigroup

Definition and basic Properties
The first part of our discussion is very similar to the treatment for a general Markov processes, except for simplifications caused by
the discrete state space. We assume that X = {X : t ∈ [0, ∞)} is a Markov chain on S .
t
The transition probability matrix P of X corresponding to t ∈ [0, ∞) is

t
2
Pt (x, y) = P(Xt = y ∣ X0 = x), (x, y) ∈ S (16.16.2)
In particular, P 0 =I , the identity matrix on S

Proof
Note that since we are assuming that the Markov chain is homogeneous,
2
Pt (x, y) = P(Xs+t = y ∣ Xs = x), (x, y) ∈ S (16.16.3)
for every s, t ∈ [0, ∞) . The Chapman-Kolmogorov equation given next is essentially yet another restatement of the Markov
property. The equation is named for Andrei Kolmogorov and Sydney Chapman,
Suppose that P = { Pt : t ∈ [0, ∞)} is the collection of transition matrices for the chain X . Then Ps Pt = Ps+t for
s, t ∈ [0, ∞) . Explicitly,
Ps+t (x, z) = ∑ Ps (x, y)Pt (y, z), x, z ∈ S (16.16.4)
y∈S
Proof
Restated in another form of jargon, the collection P = {P : t ∈ [0, ∞)} is a semigroup of probability matrices. The semigroup of
t
transition matrices P , along with the initial distribution, determine the finite-dimensional distributions of X.
Suppose that X has probability density function f . If

0 (t1 , t2 , … , tn ) ∈ [0, ∞)
n
is a time sequence with 0 < t1 < ⋯ < tn
and (x , x , … , x ) ∈ S
0 1 n is a state sequence, then
n+1
P (X0 = x0 , Xt1 = x1 , … Xtn = xn ) = f (x0 )Pt1 (x0 , x1 )Pt2 −t1 (x1 , x2 ) ⋯ Ptn −tn−1 (xn−1 , xn ) (16.16.8)
Proof
As with any matrix on S , the transition matrices define left and right operations on functions which are generalizations of matrix
multiplication. For a transition matrix, both have natural interpretations.
Suppose that f : S → R , and that either f is nonnegative or f ∈ B . Then for t ∈ [0, ∞),
Pt f (x) = ∑ Pt (x, y)f (y) = E[f (Xt ) ∣ X0 = x], x ∈ S (16.16.12)
y∈S
The mapping f ↦ Pt f is a bounded, linear operator on B and ∥P t∥ =1 .

Proof
If f is nonnegative and S is infinte, then it's possible that P f (x) = ∞ . In general, the left operation of a positive kernel acts on
t
positive measures on the state space. In the setting here, if μ is a positive (Borel) measure on (S, S ), then the function
f : S → [0, ∞) given by f (x) = μ{x} for x ∈ S is the density function of μ with respect to counting measure # on (S, S ). This
simply means that μ(A) = ∑ f (x) for A ⊆ S . Conversely, given f : S → [0, ∞) , the set function μ(A) = ∑
x∈A
f (x) for x∈A
A ⊆ S defines a positive measure on (S, S ) with f as its density function. So for the left operation of P , it's natural to consider
t
only nonnegative functions.
If f : S → [0, ∞) then
f Pt (y) = ∑ f (x)Pt (x, y), y ∈ S (16.16.13)
x∈S
If X has probability density function f then X has probability density function f P .

0 t t
Proof
More generally, if f is the density function of a positive measure μ on (S, S ) then f P is the density function of the measure μP ,
t t
defined by
μPt (A) = ∑ μ{x} Pt (x, A) = ∑ f (x)Pt (x, A), A ⊆S (16.16.15)
x∈S x∈S
A function f : S → [0, ∞) is invariant for the Markov chain X (or for the transition semigroup P ) if f Pt = f for every
t ∈ [0, ∞).
It follows that if X has an invariant probability density function f , then X has probability density function f for every
0 t
t ∈ [0, ∞), so X is identically distributed. Invariant and limiting distributions are fundamentally important for continuous-time
Markov chains.
Standard Semigroups
Suppose again that X = {X : t ∈ [0, ∞)} is a Markov chain on S with transition semigroup P = {P : t ∈ [0, ∞)} . Once
t t
again, continuity assumptions need to be imposed on X in order to rule out strange behavior that would otherwise greatly
complicate the theory. In terms of the transition semigroup P , here is the basic assumption:
The transition semigroup P is standard if P t (x, x) → 1 as t ↓ 0 for each x ∈ S .
Since P (x, x) = 1 for x ∈ S , the standard assumption is clearly a continuity assumption. It actually implies much stronger
0
smoothness properties that we will build up by stages.
If the transition semigroup P = { Pt : t ∈ [0, ∞)} is standard, then the function t ↦ Pt (x, y) is right continuous for each
(x, y) ∈ S .
2
Proof
Our next result connects one of the basic assumptions in the section on transition times and the embedded chain with the standard
assumption here.
If the Markov chain X has no instantaneous states then the transition semigroup P is standard.
Proof
Recall that the non-existence of instantaneous states is essentially equivalent to the right continuity of X. So we have the nice
result that if X is right continuous, then so is P . For the remainder of our discussion, we assume that X = {X : t ∈ [0, ∞)} is a t
regular Markov chain on S with transition semigroup P = {P : t ∈ [0, ∞)} , exponential function λ and one-step transition
t
matrix Q for the jump chain. Our next result is the fundamental integral equations relating P , λ , and Q.
For t ∈ [0, ∞),

t
−λ(x)t −λ(x)s 2
Pt (x, y) = I (x, y)e +∫ λ(x)e Q Pt−s (x, y) ds, (x, y) ∈ S (16.16.18)
0
Proof
We can now improve on the continuity result that we got earlier. First recall the leads to relation for the jump chain Y : For
(x, y) ∈ S , x leads to y if Q (x, y) > 0 for some n ∈ N . So by definition, x leads to x for each x ∈ S , and for (x, y) ∈ S with
2 n 2
x ≠ y , x leads to y if and only if the discrete-time chain starting in x eventually reaches y with positive probability.
For (x, y) ∈ S , 2
1. t ↦ P (x, y) is continuous.
t
2. If x leads to y then P (x, y) > 0 for every t ∈ (0, ∞) .

t
3. If x does not lead to y then P (x, y) = 0 for every t ∈ (0, ∞) .

t
Proof
Parts (b) and (c) are known as the Lévy dichotomy, named for Paul Lévy. It's possible to prove the Lévy dichotomy just from the
semigroup property of P , but this proof is considerably more complicated. In light of the dichotomy, the leads to relation clearly
makes sense for the continuous-time chain X as well as the discrete-time embedded chain Y .
The Generator Matrix

In this discussion, we assume again that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S with transition semigroup
t
P = { P : t ∈ [0, ∞)} , exponential parameter function λ and one-step transition matrix Q for the embedded jump chain. The
t
fundamental integral equation above now implies that the transition probability matrix P is differentiable in t . The derivative at 0
t
is particularly important.
The matrix function t ↦ P has a (right) derivative at 0:

t
Pt − I
→ G as t ↓ 0 (16.16.25)
t
where the infinitesimal generator matrix G is given by G(x, y) = −λ(x)I (x, y) + λ(x)Q(x, y) for (x, y) ∈ S . 2
Proof
Note that λ(x)Q(x, x) = 0 for every x ∈ S , since λ(x) = 0 is x is absorbing, while Q(x, x) = 0 if x is stable. So
G(x, x) = −λ(x) for x ∈ S , and G(x, y) = λ(x)Q(x, y) for (x, y) ∈ S with y ≠ x . Thus, the generator matrix G determines
2
the exponential parameter function λ and the jump transition matrix Q, and thus determines the distribution of the Markov chain
X.
Given the generator matrix G of X,

1. λ(x) = −G(x, x) for x ∈ S
2. Q(x, y) = −G(x, y)/G(x, x) if x ∈ S is stable and y ∈ S − {x}
The infinitesimal generator has a nice interpretation in terms of our discussion in the last section. Recall that when the chain first
enters a stable state x, we set independent, exponentially distributed “timers” on (x, y), for each y ∈ S − {x} . Note that G(x, y) is
the exponential parameter for the timer on (x, y). As soon as an alarm sounds for a particular (x, y), the chain moves to state y and
the process continues.
The generator matrix G satisfies the following properties for every x ∈ S :

1. G(x, x) ≤ 0
2. ∑ G(x, y) = 0
y∈S
The matrix function t ↦ Pt is differentiable on [0, ∞), and satisfies the Kolmogorov backward equation: P
t
′
= GPt .
Explicitly,
′ 2
Pt (x, y) = −λ(x)Pt (x, y) + ∑ λ(x)Q(x, z)Pt (z, y), (x, y) ∈ S (16.16.27)
z∈S
Proof
The backward equation is named for Andrei Kolmogorov. In continuous time, the transition semigroup P = {P : t ∈ [0, ∞)} can t
be obtained from the single, generator matrix G in a way that is reminiscent of the fact that in discrete time, the transition
semigroup P = {P : n ∈ N} can be obtained from the single, one-step matrix P . From a modeling point of view, we often start
n
with the generator matrix G and then solve the the backward equation, subject to the initial condition P = I , to obtain the 0
semigroup of transition matrices P .

As with any matrix on S , the generator matrix G defines left and right operations on functions that are analogous to ordinary
matrix multiplication. The right operation is defined for functions in B .
If f ∈ B then Gf is given by
Gf (x) = −λ(x)f (x) + ∑ λ(x)Q(x, y)f (y), x ∈ S (16.16.29)
y∈S
Proof
But note that Gf is not in B unless λ ∈ B . Without this additional assumption, G is a linear operator from the vector space B of
bounded functions from S to R into the vector space of all functions from S to R. We will return to this point in our next
discussion.
Uniform Transition Semigroups
We can obtain stronger results for the generator matrix if we impose stronger continuity assumptions on P .
The transition semigroup P = { Pt : t ∈ [0, ∞)} is uniform if P t (x, x) → 1 as t ↓ 0 uniformly in x ∈ S .
If P is uniform, then the operator function t ↦ P is continuous on the vector space B .

t
Proof
As usual, we want to look at this new assumption from different points of view.
The following are equivalent:

1. The transition semigroup P is uniform.
2. The exponential parameter function λ is bounded.
3. The generator matrix G defiens a bounded linear operator on B .
Proof
So when the equivalent conditions are satisfied, the Markov chain X = {X : t ∈ [0, ∞)} is also said to be uniform. As we will
t
see in a later section, a uniform, continuous-time Markov chain can be constructed from a discrete-time Markov chain and an
independent Poisson process. For a uniform transition semigroup, we have a companion to the backward equation.
Suppose that P is a uniform transition semigroup. Then t ↦ Pt satisfies the Kolmogorov forward equation ′
Pt = Pt G .
Explicitly,
′ 2
P (x, y) = −λ(y)Pt (x, y) + ∑ Pt (x, z)λ(z)Q(z, y), (x, y) ∈ S (16.16.33)
t
z∈S
The backward equation holds with more generality than the forward equation, since we only need the transition semigroup P to be
standard rather than uniform. It would seem that we need stronger conditions on λ for the forward equation to hold, for otherwise
it's not even obvious that ∑ P (x, z)λ(z)Q(z, y) is finite for (x, y) ∈ S . On the other hand, the forward equation is sometimes
z∈S t
easier to solve than the backward equation, and the assumption that λ is bounded is met in many applications (and of course holds
automatically if S is finite).
As a simple corollary, the transition matrices and the generator matrix commute for a uniform semigroup: P G = GP for t t
t ∈ [0, ∞). The forward and backward equations formally look like the differential equations for the exponential function. This
actually holds with the operator exponential.
Suppose again that P = { Pt : t ∈ [0, ∞)} is a uniform transition semigroup with generator G. Then
∞ n
tG
t n
Pt = e =∑ G , t ∈ [0, ∞) (16.16.34)
n!
n=0
Proof
We can characterize the generators of uniform transition semigroups. We just need the minimal conditions that the diagonal entries
are nonpositive and the row sums are 0.
Suppose that G a matrix on S with ∥G∥ < ∞ . Then G is the generator of a uniform transition semigroup
P = { Pt : t ∈ [0, ∞)} if and only if for every x ∈ S ,
1. G(x, x) ≤ 0
2. ∑ G(x, y) = 0
y∈S
Proof

The Two-State Chain
Let X = {X : t ∈ [0, ∞)} be the Markov chain on the set of states S = {0, 1}, with transition rate a ∈ [0, ∞) from 0 to 1 and
t
transition rate b ∈ [0, ∞) from 1 to 0. This two-state Markov chain was studied in the previous section. To avoid the trivial case
with both states absorbing, we will assume that a + b > 0 .
The generator matrix is

−a a
G=[ ] (16.16.38)
b −b
Show that for t ∈ [0, ∞),
1 b a 1 −(a+b)t
−a a
Pt = [ ]− e [ ] (16.16.39)
a+b b a a+b b −b
1. By solving the Kolmogorov backward equation.

2. By solving the Kolmogorov forward equation.
3. By computing P = e . t
tG
You probably noticed that the forward equation is easier to solve because there is less coupling of terms than in the backward
equation.
Define the probability density function f on S by f (0) = b
a+b
, f (1) = a
a+b
. Show that
b a
1. Pt →
1
a+b
[ ] as t → ∞ , the matrix with f in both rows.
b a
2. f P = f for all t ∈ [0, ∞), so that f is invariant for P .

t
3. f G = 0 .
embedded transition matrix
1 1
⎡ 0 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.16.40)
⎢ ⎥
⎣ 1 2 ⎦
0
3 3

2. Find the generator matrix G.
3. Find the transition matrix P for t ∈ [0, ∞).
t
4. Find lim P .
t→∞ t
Answer
Special Models
Read the discussion of generator and transition matrices for chains subordinate to the Poisson process.
Read the discussion of the infinitesimal generator for continuous-time birth-death chains.
Read the discussion of the infinitesimal generator for continuous-time queuing chains.
Read the discussion of the infinitesimal generator for continuous-time branching chains.
This page titled 16.16: Transition Matrices and Generators of Continuous-Time Chains is shared under a CC BY 2.0 license and was authored,
Prelimnaries
This is the third of the introductory sections on continuous-time Markov chains. So our starting point is a time-homogeneous
Markov chain X = {X : t ∈ [0, ∞)} defined on an underlying probability space (Ω, F , P) and with discrete state space (S, S ).
t
Thus S is countable and S is the power set of S , so every subset of S is measurable, as is every function from S into another
measurable space. In addition, S is given the discret topology so that S can also be thought of as the Borel σ-algebra. Every
function from S to another topological space is continuous. Counting measure # is the natural measure on (S, S ), so in the
context of the general introduction, integrals over S are simply sums. Also, kernels on S can be thought of as matrices, with rows
and sums indexed by S , so the left and right kernel operations are generalizations of matrix multiplication. As before, let B denote
the collection of bounded functions f : S → R . With the usual pointwise definitions of addition and scalar multiplication, B is a
vector space. The supremum norm on B is given by
∥f ∥ = sup{|f (x)| : x ∈ S}, f ∈ B (16.17.1)
Of course, if S is finite, B is the set of all real-valued functions on S , and ∥f ∥ = max{|f (x)| : x ∈ S} for f ∈ B . The time
space is ([0, ∞), T ) where as usual, T is the Borel σ-algebra on [0, ∞) corresponding to the standard Euclidean topology.
Lebesgue measure is the natural measure on ([0, ∞), T ).
In our first point of view, we studied X in terms of when and how the state changes. To review briefly, let
τ = inf{t ∈ (0, ∞) : X ≠ X } . Assuming that X is right continuous, the Markov property of X implies the memoryless
t 0
property of τ , and hence the distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞) for each x ∈ S . The
0
assumption of right continuity rules out the pathological possibility that λ(x) = ∞ , which would mean that x is an instantaneous
state so that P(τ = 0 ∣ X = x) = 1 . On the other hand, if λ(x) ∈ (0, ∞) then x is a stable state, so that τ has a proper
0
exponential distribution given X = x with P(0 < τ < ∞ ∣ X = x) = 1 . Finally, if λ(x) = 0 then x is an absorbing state, so
0 0
that P(τ = ∞ ∣ X = x) = 1 . Next we define a sequence of stopping times: First τ = 0 and τ = τ . Recursively, if τ < ∞ then
0 0 1 n
τ = inf {t > τ : X ≠ X } , while if τ = ∞ then τ

n n t τn n = ∞ . With M = sup{n ∈ N : τ < ∞}
n+1 we define Y = X if n n τn
n ∈ N with n ≤ M and Y = Y n if n ∈ N with n > M . The sequence Y = (Y , Y , …) is a discrete-time Markov chain on S

M 0 1
with one-step transition matrix Q given by Q(x, y) = P(X = y ∣ X = x) if x, y ∈ S with x stable, and Q(x, x) = 1 if x ∈ S is
τ 0
absorbing. Assuming that X is regular, which means that τ → ∞ as n → ∞ with probability 1 (ruling out the explosion event of
n
infinitely many transitions in finite time), the structure of X is completely determined by the sequence of stopping times
τ = (τ , τ , …) and the embedded discrete-time jump chain Y = (Y , Y , …) . Analytically, the distribution X is determined by
0 1 0 1
the exponential parameter function λ and the one-step transition matrix Q of the jump chain.
In our second point of view, we studied X in terms of the collection of transition matrices P = { Pt : t ∈ [0, ∞)} , where for
t ∈ [0, ∞),
2
Pt (x, y) = P(Xt = y ∣ X0 = x), (x, y) ∈ S (16.17.2)
The Markov and time-homogeneous properties imply the Chapman-Kolmogorov equations P P = P for s, t ∈ [0, ∞) , so that
s t s+t
P is a semigroup of transition matrices. The semigroup P , along with the initial distribution of X , completely determines the 0
distribution of X. For a regular Markov chain X, the fundamental integral equation connecting the two points of view is
t
0
which is obtained by conditioning on τ and X . It then follows that the matrix function t ↦ P is differentiable, with the derivative
τ t
satisfying the Kolmogorov backward equation P = GP where the generator matrix G is given by
t
′
t
2
G(x, y) = −λ(x)I (x, y) + λ(x)Q(x, y), (x, y) ∈ S (16.17.4)
If the exponential parameter function λ is bounded, then the transition semigroup P is uniform, which leads to stronger results.
The generator G is a bounded operator on B , the backward equation holds as well as a companion forward equation P = P G , as ′
t t
operators on B (so with respect to the supremum norm rather than just pointwise). Finally, we can represent the transition matrix
as an exponential: P = e for t ∈ [0, ∞).
t
tG
In this section, we study the Markov chain X in terms of a family of matrices known as potential matrices. This is the least
intuitive of the three points of view, but analytically one of the best approaches. Essentially, the potential matrices are transforms of
the transition matrices.
Basic Theory
We assume again that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S with transition semigroup P = {P
t t : t ∈ [0, ∞)} .
Our first discussion closely parallels the general theory, except for simplifications caused by the discrete state space.
Definitions and Properties

For α ∈ [0, ∞), the α -potential matrix U of X is defined as follows:
α
∞
−αt 2
Uα (x, y) = ∫ e Pt (x, y) dt, (x, y) ∈ S (16.17.5)
0
1. The special case U = U is simply the potential matrix of X.

0
2. For (x. y) ∈ S , U (x, y) is the expected amount of time that X spends in y , starting at x.
2
3. The family of matrices U = {U : α ∈ (0, ∞)} is known as the reolvent of X.

α
Proof
It's quite possible that U (x, y) = ∞ for some (x, y) ∈ S , and knowing when this is the case is of considerable interest. If
2
f : S → R and α ≥ 0 , then giving the right operation in its many forms,
∞
−αt
Uα f (x) = ∑ Uα (x, y)f (y) = ∫ e Pt f (x) dt
0
y∈S
∞ ∞
−αt −αt
=∫ e ∑ Pt (x, y)f (y) = ∫ e E[f (Xt ) ∣ X0 = x] dt, x ∈ S
0 y∈S 0
assuming, as always, that the sums and integrals make sense. This will be the case in particular if f is nonnegative (although ∞ is a
possible value), or as we will now see, if f ∈ B and α > 0 .
If α > 0 , then U α (x, S) =

1
α
for all x ∈ S .
Proof
It follows that for α ∈ (0, ∞) , the right potential operator U is a bounded, linear operator on B with ∥U
α α∥ =
1
α
. It also follows
that αU is a probability matrix. This matrix has a nice interpretation.
α
If α > 0 then α U (x, ⋅) is the conditional probability density function of

α XT given X0 = x , where T is independent of X
and has the exponential distribution on [0, ∞) with parameter α .

Proof
So αU is a transition probability matrix, just as P is a transition probability matrix, but corresponding to the random time T
α t
(with α ∈ (0, ∞) as a parameter), rather than the deterministic time t ∈ [0, ∞). The potential matrix can also be interpreted in
economic terms. Suppose that we receive money at a rate of one unit per unit time whenever the process X is in a particular state
y ∈ S . Then U (x, y) is the expected total amount of money that we receive, starting in state x ∈ S . But money that we receive
later is of less value to us now than money that we will receive sooner. Specifically, suppose that one monetary unit at time
t ∈ [0, ∞) has a present value of e where α ∈ (0, ∞) is the inflation factor or discount factor. Then U (x, y) is the total,
−αt
α
expected, discounted amount that we receive, starting in x ∈ S . A bit more generally, suppose that f ∈ B and that f (y) is the
reward (or cost, depending on the sign) per unit time that we receive when the process is in state y ∈ S . Then U f (x) is the α
expected, total, discounted reward, starting in state x ∈ S .
α Uα → I as α → ∞ .
Proof
If f : S → [0, ∞) , then giving the left potential operation in its various forms,
∞
−αt
f Uα (y) = ∑ f (x)Uα (x, y) = ∫ e f Pt (y) dt
0
x∈S
∞ ∞
−αt −αt
=∫ e [ ∑ f (x)Pt (x, y)] dt = ∫ e [ ∑ f (x)P(Xt = y)] dt, y ∈ S
0 0
x∈S x∈S
In particular, suppose that α > 0 and that f is the probability density function of X . Then f P is the probability density function 0 t
of X for t ∈ [0, ∞), and hence from the last result, αf U is the probability density function of X , where again, T is
t α T
independent of X and has the exponential distribution on [0, ∞) with parameter α . The family of potential kernels gives the same
information as the family of transition kernels.
The resolvent U = { Uα : α ∈ (0, ∞)} completely determines the family of transition kernels P = { Pt : t ∈ (0, ∞)} .
Proof
Although not as intuitive from a probability view point, the potential matrices are in some ways nicer than the transition matrices
because of additional smoothness. In particular, the resolvent {U : α ∈ [0, ∞)} , along with the initial distribution, completely
α
determine the finite dimensional distributions of the Markov chain X. The potential matrices commute with the transition matrices
and with each other.
Suppose that α, β, t ∈ [0, ∞) . Then

∞
1. P t Uα = Uα Pt = ∫
0
e
−αs
Ps+t ds
∞ ∞
2. U α Uβ = Uβ Uα = ∫ 0
∫
0
e
−αs
e
−βt
Ps+t ds dt
Proof
The equations above are matrix equations, and so hold pointwise. The same identities hold for the right operators on the space B
under the additional restriction that α > 0 and β > 0 . The fundamental equation that relates the potential kernels, known as the
resolvent equation, is given in the next theorem:
If α, β ∈ [0, ∞) with α ≤ β then U α = Uβ + (β − α)Uα Uβ .

Proof
The equation above is a matrix equation, and so holds pointwise. The same identity holds for the right potential operators on the
space B , under the additional restriction that α > 0 .
Connections with the Generator

Once again, assume that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S with transition semigroup
t
P = { P : t ∈ [0, ∞)} , infinitesimal generator G, resolvent U = { U

t : α ∈ (0, ∞)} , exponential parameter function λ , and one-
α
step transition matrix Q for the jump chain. There are fundamental connections between the potential U and the generator matrix α
G, and hence between U and the function λ and the matrix Q.

α
If α ∈ (0, ∞) then I + GU α = α Uα . In terms of λ and Q,

1 λ(x)
2
Uα (x, y) = I (x, y) + Q Uα (x, y), (x, y) ∈ S (16.17.16)
α + λ(x) α + λ(x)
Proof 1
Proof 2
Proof 3
As before, we can get stronger results if we assume that λ is bounded, or equivalently, the transition semigroup P is uniform.
Suppose that λ is bounded and α ∈ (0, ∞) . Then as operators on B (and hence also as matrices),
1. I + GU = α U α α
2. I + U G = α U
α α
Proof
As matrices, the equation in (a) holds with more generality than the equation in (b), much as the Kolmogorov backward equation
holds with more generality than the forward equation. Note that
2
Uα G(x, y) = ∑ Uα (x, z)G(z, y) = −λ(y)Uα (x, y) + ∑ Uα (x, z)λ(z)Q(z, y), (x, y) ∈ S (16.17.26)
z∈S z∈S
If λ is unbounded, it's not clear that the second sum is finite.
Suppose that λ is bounded and α ∈ (0, ∞) . Then as operators on B (and hence also as matrices),
1. U = (αI − G)
α
−1
2. G = αI − U −1
α
Proof
So the potential operator U and the generator G have a simple, elegant inverse relationship. Of course, these results hold in
α
particular if S is finite, so that all of the various matrices really are matrices in the elementary sense.

The Two-State Chain
Let X = {X : t ∈ [0, ∞)} be the Markov chain on the set of states S = {0, 1}, with transition rate a ∈ [0, ∞) from 0 to 1 and
t
transition rate b ∈ [0, ∞) from 1 to 0. To avoid the trivial case with both states absorbing, we will assume that a + b > 0 . The first
two results below are a review from the previous two sections.
The generator matrix G is

−a a
G=[ ] (16.17.27)
b −b
The transition matrix at time t ∈ [0, ∞) is

1 b a 1 −(a+b)t
−a a
Pt = [ ]− e [ ], t ∈ [0, ∞) (16.17.28)
a+b b a a+b b −b
Now we can find the potential matrix in two ways.
For α ∈ (0, ∞) , show that the potential matrix U is α
1 b a 1 −a a
Uα = [ ]− [ ] (16.17.29)
α(a + b) b a (α + a + b)(a + b) b −b
1. From the definition.

2. From the relation U α = (αI − G)
−1
.
jump transition matrix
1 1
0
⎡ 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.17.30)
⎢ ⎥
⎣ 1 2
0 ⎦
3 3

2. Find the generator matrix G.
3. Find the potential matrix U for α ∈ (0, ∞) .
α
Answer
Special Models
Read the discussion of potential matrices for chains subordinate to the Poisson process.
This page titled 16.17: Potential Matrices is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
In this section, we study the limiting behavior of continuous-time Markov chains by focusing on two interrelated ideas: invariant
(or stationary) distributions and limiting distributions. In some ways, the limiting behavior of continuous-time chains is simpler
than the limiting behavior of discrete-time chains, in part because the complications caused by periodicity in the discrete-time case
do not occur in the continuous-time case. Nonetheless as we will see, the limiting behavior of a continuous-time chain is closely
related to the limiting behavior of the embedded, discrete-time jump chain.
Review
Once again, our starting point is a time-homogeneous, continuous-time Markov chain X = {X : t ∈ [0, ∞)} defined on an t
underlying probability space (Ω, F , P) and with discrete state space (S, S ). By definition, this means that S is countable with the
discrete topology, so that S is the σ-algebra of all subsets of S .
Let's review what we have so far. We assume that the Markov chain X is regular. Among other things, this means that the basic
structure of X is determined by the transition times τ = (τ , τ , τ , …) and the jump chain Y = (Y , Y , Y , …). First, τ = 0
0 1 2 0 1 2 0
and τ = τ = inf{t > 0 : X ≠ X } . The time-homogeneous and Markov properties imply that the distribution of τ given
1 t 0
X = x is exponential with parameter λ(x) ∈ [0, ∞). Part of regularity is that X is right continuous so that there are no
0
instantaneous states where λ(x) = ∞ , which would mean P(τ = 0 ∣ X = x) = 1 . On the other hand, λ(x) ∈ (0, ∞) means that 0
x is a stable state so that τ has a proper exponential distribution given X = x , with P(0 < τ < ∞ ∣ X = x) = 1 . Finally, 0 0
λ(x) = 0 means that x is an absorbing state so that P(τ = ∞ ∣ X = x) = 1 . The remaining transition times are defined 0
recursively: τ = inf {t > τ : X ≠ X } if τ < ∞ and τ

n+1 n t = ∞ if τ = ∞ . Another component of regularity is that with
τn n n+1 n
probability 1, τ → ∞ as n → ∞ , ruling out the explosion event of infinitely many jumps in finite time. The jump chain Y is
n
formed by sampling X at the transition times (until the chain is sucked into an absorbing state, if that happens). That is, with
M = sup{n : τ < ∞} and for n ∈ N , we define Y = X
n if n ≤ M and Y = X if n > M . Then Y is a discrete-time
n τn n τM
Markov chain with one-step transition matrix Q given Q(x, y) = P(X = y ∣ X = x) if (x, y) ∈ S with x stable and τ 0
2
Q(x, x) = 1 if x ∈ S is absorbing.
The transition matrix P at time t ∈ [0, ∞) is given by P (x, y) = P(X = y ∣ X = x) for (x, y) ∈ S . The time-homogenous
t t t 0
2
and Markov properties imply that the collection of transition matrices P = {P : t ∈ [0, ∞)} satisfies the Chapman-Kolmogorov t
equations P P = P s for s, t ∈ [0, ∞) , and hence is a semigroup. of transition matrices The transition semigroup P and the
t s+t
initial distribution of X determine all of the finite-dimensional distributions of X. Since there are no instantaneous states, P is
0
standard which means that P → I as t ↓ 0 (as matrices, and so pointwise). The fundamental relationship between P on the one
t
hand, and λ and Q on the other, is

t
0
From this, it follows that the matrix function t ↦ P is differentiable (again, pointwise) and satisfies the Kolmogorov backward
t
equation P = GP , where the infinitesimal generator matrix G is given by G(x, y) = −λ(x)I (x, y) + λ(x)Q(x, y) for
d
dt
t t
(x, y) ∈ S . If we impose the stronger assumption that P is uniform, which means that P → I as t ↓ 0 as operators on B (so
2
t
with respect to the supremum norm), then the backward equation as well as the companion Kolmogorov forward equation
P == P G hold as operators on B . In addition, we have the matrix exponential representation P = e for t ∈ [0, ∞). The
d tG
t t t
dt
uniform assumption is equivalent to the exponential parameter function being bounded.

∞
Finally, for α ∈ [0, ∞), the α potential matrix U of X is U = ∫ e P dt . The resolvent U = {U : α ∈ (0, ∞)} is the
α α 0
−αt
t α
Laplace transform of P and hence gives the same information as P . From this point of view, the time-homogeneous and Markov
properties lead to the resolvent equation U = U + (β − α)U U for α, β ∈ (0, ∞) with α ≤ β . For α ∈ (0, ∞) , the α
α β α β
potential matrix is related to the generator by the fundamental equation α U = I + GU . If P is uniform, then this equation, as α α
well as the companion α U = I + U G hold as operators on B , which leads to U = (αI − G) .

α α α
−1
Basic Theory
Relations and Classification
We start our discussion with relations among states and classifications of states. These are the same ones that we studied for
discrete-time chains in our study of recurrence and transience, applied here to the jump chain Y . But as we will see, the relations
and classifications make sense for the continuous-time chain X as well. The discussion is complicated slightly when there are
absorbing states. Only when X is in an absorbing state can we not interpret the values of Y as the values of X at the transition
times (because of course, there are no transitions when X is in an absorbing state). But x ∈ S is absorbing for the continuous-time
chain X if and only if x is absorbing for the jump chain Y , so this trivial exception is easily handled.
For y ∈ S let ρ = inf{n ∈ N : Y = y} , the (discrete) hitting time to y for the jump chain Y , where as usual, inf(∅) = ∞ . That
y + n
is, ρ is the first positive (discrete) time that Y in in state y . The analogous random time for the continuous-time chain X is τ ,
y ρy
where naturally we take τ = ∞ . This is the first time that X is in state y , not counting the possible initial period in y .
∞
Specifically, suppose X = x . If x ≠ y then τ = inf{t > 0 : X = y} . If x = y then τ = inf{t > τ : X = y} .

0 ρ
y
t ρ
y
1 t
Define the hitting matrix H by

2
H (x, y) = P(ρy < ∞ ∣ Y0 = x), (x, y) ∈ S (16.18.2)
Then H (x, y) = P (τ ρy < ∞ ∣ X0 = x) except when x is absorbing and y = x .
So for the continuous-time chain, if x ∈ S is stable then H (x, x) is the probability that, starting in x, the chain X returns to x after
its initial period in x. If x, y ∈ S are distinct, then H (x, y) is simply the probability that X, starting in x, eventually reaches y . It
follows that the basic relation among states makes sense for either the continuous-time chain X as well as its jump chain Y .
Define the relation → on S by x → y if x = y or H (x, y) > 0.

2
The leads to relation → is reflexive by definition: x → x for every x ∈ S . From our previous study of discrete-time chains, we
know it's also transitive: if x → y and y → z then x → z for x, y, z ∈ S . We also know that x → y if and only if there is a
directed path in the state graph from x to y , if and only if Q (x, y) > 0 for some n ∈ N . For the continuous-time transition
n
matrices, we have a stronger result that in turn makes a stronger case that the leads to relation is fundamental for X.
Suppose (x, y) ∈ S . 2
1. If x → y then P t (x, y) > 0 for all t ∈ (0, ∞) .

2. If x ↛ y then P t (x, y) = 0 for all t ∈ (0, ∞) .
Proof
This result is known as the Lévy dichotomy, and is named for Paul Lévy. Let's recall a couple of other definitions:
Suppose that A is a nonempty subset of S .

1. A is closed if x ∈ A and x → y imply y ∈ A .
2. A is irreducible if A is closed and has no proper closed subset.
If S is irreducible, we also say that the chain X itself is irreducible.
If A is a nonempty subset of S , then cl(A) = {y ∈ S : x → y for some x ∈ A} is the smallest closed set containing A , and
is called the closure of A .
Suppose that A ⊆ S is closed. Then

1. P , the restriction of P to A × A , is a transition probability matrix on A for every t ∈ [0, ∞).
t
A
t
2. X restricted to A is a continuous-time Markov chain with transition semigroup P = {P : t ∈ [0, ∞)} .

A A
t
Proof
Define the relation ↔ on S by x ↔ y if x → y and y → x for (x, y) ∈ S . 2
The to and from relation ↔ defines an equivalence relation on S and hence partitions S into mutually disjoint equivalence classes.
Recall from our study of discrete-time chains that a closed set is not necessarily an equivalence class, nor is an equivalence class
necessarily closed. However, an irreducible set is an equivalence class, but an equivalence class may not be irreducible. The
importance of the relation ↔ stems from the fact that many important properties of Markov chains (in discrete or continuous time)
turn out to be class properties, shared by all states in an equivalence class. The following definition is fundamental, and once again,
makes sense for either the continuous-time chain X or its jump chain Y .
Let x ∈ S .
1. State x is transient if H (x, x) < 1
2. State x is recurrent if H (x, x) = 1.
Recall from our study of discrete-time chains that if x is recurrent and x → y then y is recurrent and y → x . Thus, recurrence and
transience are class properties, shared by all states in an equivalence class.
Time Spent in a State

For x ∈ S , let N denote the number of visits to state x by the jump chain Y , and let
x Tx denote the total time spent in state x by
the continuous-time chain X. Thus
∞ ∞
Nx = ∑ 1(Yn = x), Tx = ∫ 1(Xt = x) dt (16.18.3)

0
n=0
The expected values R(x, y) = E(N ∣ Y = x) and U (x, y) = E(T ∣ X = x) for (x, y) ∈ S define the potential matrices of
y 0 y 0
2
Y and X, respectively. From our previous study of discrete-time chains, we know the distribution and mean of N given Y = x y 0
in terms of the hitting matrix H . The next two results give a review:
Suppose that x, y ∈ S are distinct. Then

1. P(N y = n ∣ Y0 = y) = H
n−1
(y, y)[1 − H (y, y)] for n ∈ N +
2. P(N y = 0 ∣ Y0 = x) = 1 − H (x, y) and P(N y = n ∣ Y0 = x) = H (x, y)H

n−1
(y, y)[1 − H (y, y)] for n ∈ N +
Let's take cases. First suppose that y is recurrent. In part (a), P(N = n ∣ Y = y) = 0 for all n ∈ N , and consequently
y 0 +
P(N = ∞ ∣ Y = y) = 1 .
y 0 In part (b), P(N = n ∣ Y = x) = 0 y for
0 n ∈ N , and consequently +
P(N = 0 ∣ Y = x) = 1 − H (x, y) while P(N = ∞ ∣ Y = x) = H (x, y) . Suppose next that y is transient. Part (a) specifies a
y 0 y 0
proper geometric distribution on N while in part (b), probability 1 − H (x, y) is assigned to 0 and the remaining probability
+
H (x, y) is geometrically distributed over N as in (a). In both cases, N is finite with probability 1. Next we consider the expected
+ y
value, that is, the (discrete) potential. To state the results succinctly we will use the convention that a/0 = ∞ if a > 0 and
0/0 = 0 .
Suppose again that x, y ∈ S are distinct. Then

1. R(y, y) = 1/[1 − H (y, y)]
2. R(x, y) = H (x, y)/[1 − H (y, y)]
Let's take cases again. If y ∈ S is recurrent then R(y, y) = ∞ , and for x ∈ S with x ≠ y , either R(x, y) = ∞ if x → y or
R(x, y) = 0 if x ↛ y . If y ∈ S is transient, R(y, y) is finite, as is R(x, y) for every x ∈ S with x ≠ y . Moreover, there is an
inverse relationship of sorts between the potential and the hitting probabilities.
Naturally, our next goal is to find analogous results for the continuous-time chain X. For the distribution of T it's best to use the y
right distribution function.
Suppose that x, y ∈ S are distinct. Then for t ∈ [0, ∞)

1. P(T y > t ∣ X0 = y) = exp{−λ(y)[1 − H (y, y)]t}
2. P(T y > t ∣ X0 = x) = H (x, y) exp{−λ(y)[1 − H (y, y)]t}
Proof
Let's take cases as before. Suppose first that y is recurrent. In part (a), P(T > t ∣ X = y) = 1 for every t ∈ [0, ∞) and hence
y 0
P(T = ∞ ∣ X = y) = 1 .
y 0 In part (b), P(T > t ∣ X = x) = H (x, y) for every t ∈ [0, ∞) and consequently
y 0
P(T = 0 ∣ X = x) = 1 − H (x, y) while P(T = ∞ ∣ X = x) = H (x, y) . Suppose next that y is transient. From part (a), the
y 0 y 0
distribution of T given X = y is exponential with parameter λ(y)[1 − H (y, y)] . In part (b), the distribution assigns probability
y 0
1 − H (x, y) to 0 while the remaining probability H (x, y) is exponentially distributed over (0, ∞) as in (a). Taking expected value,
we get a very nice relationship between the potential matrix U of the continuous-time chain X and the potential matrix R of the
discrete-time jump chain Y :
For every (x, y) ∈ S , 2
R(x, y)
U (x, y) = (16.18.6)
λ(y)
Proof
In particular, y ∈ S is transient if and only if R(x, y) < ∞ for every x ∈ S , if and only if U (x, y) < ∞ for every x ∈ S . On the
other hand, y is recurrent if and only if R(x, y) = U (x, y) = ∞ if x → y and R(x, y) = U (x, y) = 0 if x ↛ y .
Null and Positive Recurrence

Unlike transience and recurrence, the definitions of null and positive recurrence of a state x ∈ S are different for the continuous-
time chain X and its jump chain Y . This is because these definitions depend on the expected hitting time to x, starting in x, and
not just the finiteness of this hitting time. For x ∈ S , let ν (x) = E(ρ ∣ Y = x) , the expected (discrete) return time to x starting in
x 0
x. Recall that x is positive recurrent for Y if ν (x) < ∞ and x is null recurrent if x is recurrent but not positive recurrent, so that
H (x, x) = 1 but ν (x) = ∞ . The definitions are similar for X, but using the continuous hitting time τ . ρ
x
For x ∈ S , let μ(x) = 0 if x is absorbing and μ(x) = E (τ ρx ∣ X0 = x) if x is stable. So if x is stable, μ(x) is the expected
return time to x starting in x (after the initial period in x).
1. State x is positive recurrent for X if μ(x) < ∞ .
2. State x is null recurrent for X if x recurrent but not positive recurrent, so that H (x, x) = 1 but μ(x) = ∞ .
A state x ∈ S can be positive recurrent for X but null recurrent for its jump chain Y or can be null recurrent for X but positive
recurrent for Y . But like transience and recurrence, positive and null recurrence are class properties, shared by all states in an
equivalence class under the to and from equivalence relation ↔.
Invariant Functions
Our next discussion concerns functions that are invariant for the transition matrix Q of the jump chain Y and functions that are
invariant for the transition semigroup P = {P : t ∈ [0, ∞)} of the continuous-time chain X. For both discrete-time and
t
continuous-time chains, there is a close relationship between invariant functions and the limiting behavior in time.
First let's recall the definitions. A function f : S → [0, ∞) is invariant for Q (or for the chain Y ) if f Q = f . It then follows that
f Q = f for every n ∈ N . In continuous time we must assume invariance at each time. That is, a function f : S → [0, ∞) is
n
invariant for P (or for the chain X) if f P = f for all t ∈ [0, ∞). Our interest is in nonnegative functions, because we can think of
t
such a function as the density function, with respect to counting measure, of a positive measure on S . We are particularly interested
in the special case that f is a probability density function, so that ∑ f (x) = 1 . If Y has a probability density function f that is
x∈S 0
invariant for Q, then Y has probability density function f for all n ∈ N and hence Y is stationary. Similarly, if X has a
n 0
probability density function f that is invariant for P then X has probability density function f for every t ∈ [0, ∞) and once
t
again, the chain X is stationary.

Our first result shows that there is a one-to-one correspondence between invariant functions for Q and zero functions for the
generator G.
Suppose f : S → [0, ∞) . Then f G = 0 if and only if (λf )Q = λf , so that λf is invariant for Q.

Proof
If our chain X has no absorbing states, then f : S → [0, ∞) is invariant for Q if and only if (f /λ)G = 0 .
Suppose that f : S → [0, ∞) . Then f is invariant for P if and only if f G = 0 .
Proof 1
Proof 2
So putting the two main results together we see that f is invariant for the continuous-time chain X if and only if λf is invariant for
the jump chain Y . Our next result shows how functions that are invariant for X are related to the resolvent
U = {U : α ∈ (0, ∞)} . To appreciate the result, recall that for α ∈ (0, ∞) the matrix αU
α is a probability matrix, and in fact
α
α U (x, ⋅) is the conditional probability density function of X , given X = x , where T is independent of X and has the
α T 0
exponential distribution with parameter α . So αU is a transition matrix just as P is a transition matrix, but corresponding to the
α t
exponentially distributed random time T with parameter α ∈ (0, ∞) rather than the deterministic time t ∈ [0, ∞).
Suppose that f : S → [0, ∞) . If f G = 0 then f (α U α) =f for α ∈ (0, ∞) . Conversely if f (α U α) =f for α ∈ (0, ∞) then
fG = 0.
Proof
So extending our summary, f : S → [0, ∞) is invariant for the transition semigroup P = {P : t ∈ [0, ∞)} if and only if λf is
t
invariant for jump transition matrices {Q : n ∈ N} if and only if f G = 0 if and only if f is invariant for the collection of
n
probability matrices {α U : α ∈ (0, ∞)} . From our knowledge of the theory for discrete-time chains, we now have the following
α
fundamental result:
Suppose that X is irreducible and recurrent.

1. There exists g : S → (0, ∞) that is invariant for X.
2. If f is invariant for X, then f = cg for some constant c ∈ [0, ∞).
Proof
Invariant functions have a nice interpretation in terms of occupation times, an interpretation that parallels the discrete case. The
potential gives the expected total time in a state, starting in another state, but here we need to consider the expected time in a state
during a cycle that starts and ends in another state.
For x ∈ S , define the function γ by x
τρ
x
∣
γx (y) = E ( ∫ 1(Xs = y) ds ∣ X0 = x) , y ∈ S (16.18.13)
∣
0
so that γ x (y) is the expected occupation time in state y before the first return to x, starting in x.
Suppose again that X is irreducible and recurrent. For x ∈ S ,

1. γ : S → (0, ∞)
x
2. γ is invariant for X
x
3. γ (x) = 1/λ(x)
x
4. μ(x) = ∑ γ (y)
y∈S x
Proof
So now we have some additional insight into positive and null recurrence for the continuous-time chain X and the associated jump
chain Y . Suppose again that the chains are irreducible and recurrent. There exist g : S → (0, ∞) that is invariant for Y , and then
g/λ is invariant for X. The invariant functions are unique up to multiplication by positive constants. The jump chain Y is positive
recurrent if and only if ∑ g(x) < ∞

x∈S
while the continuous-time chain X is positive recurrent if and only if
∑ g(x)/λ(x) < ∞ . Note that if λ is bounded (which is equivalent to the transition semigroup P being uniform), then X is
x∈S
positive recurrent if and only if Y is positive recurrent.
Suppose again that X is irreducible and recurrent.

1. If X is null recurrent then X does not have an invariant probability density function.
2. If X is positive recurrent then X has a unique, positive invariant probability density function.
Proof
Limiting Behavior
Our next discussion focuses on the limiting behavior of the transition semigroup P = { Pt : t ∈ [0, ∞)} . Our first result is a
simple corollary of the result above for potentials.
If y ∈ S is transient, then P t (x, y) → 0 as t → ∞ for every x ∈ S .

Proof
So we should turn our attention to the recurrent states. The set of recurrent states partitions into equivalent classes under ↔, and
each of these classes is irreducible. Hence we can assume without loss of generality that our continuous-time chain
X = { X : t ∈ [0, ∞)} is irreducible and recurrent. To avoid trivialities, we will also assume that S has at least two states. Thus,
t
there are no absorbing states and so λ(x) > 0 for x ∈ S . Here is the main result.
Suppose that X = {X : t ∈ [0, ∞)} is irreducible and recurrent. Then

t f (y) = limt→∞ Pt (x, y) exists for each y ∈ S ,
independently of x ∈ S . The function f is invariant for X and
γx (y)
f (y) = , y ∈ S (16.18.17)
μ(x)
1. If X is null recurrent then f (y) = 0 for all y ∈ S .

2. If X is positive recurrent then f (y) > 0 for all y ∈ S and ∑ y∈S
f (y) = 1 .
Proof sketch
The limiting function f can be computed in a number of ways. First we find a function g : S → (0, ∞) that is invariant for X. We
can do this by solving
gPt = g for t ∈ (0, ∞)
gG = 0
g(α Uα ) = g for α ∈ (0, ∞)

hQ = h and then g = h/λ
The function g is unique up to multiplication by positive constants. If ∑ x∈S
g(x) < ∞ , then we are in the positive recurrent case
and so f is simply g normalized:
g(y)
f (y) = , y ∈ S (16.18.19)
∑ g(x)
x∈S
The following result is known as the ergodic theorem for continuous-time Markov chains. It can also be thought of as a strong law
of large numbers for continuous-time Markov chains.
Suppose that X = {X t : t ∈ [0, ∞)} is irreducible and positive recurrent, with (unique) invariant probability density function
f . If h : S → R then
t
1
∫ h(Xs )ds → ∑ f (x)h(x) as t → ∞ (16.18.20)
t 0
x∈S
with probability 1, assuming that the sum on the right converges absolutely.
Notes
Note that no assumptions are made about X , so the limit is independent of the initial state. By now, this should come as no
0
surprise. After a long period of time, the Markov chain X “forgets” about the initial state. Note also that ∑ f (x)h(x) is the
x∈S
expected value of h , thought of as a random variable on S with probability measure defined by f . On the other hand,
t
∫ h(X )ds is the average of the time function s ↦ h(X ) on the interval [0, t]. So the ergodic theorem states that the limiting
1
t 0 s s
time average on the left is the same as the spatial average on the right.
Applications and Exercises
The Two-State Chain
The continuous-time, two-state chain has been studied in the last several sections. The following result puts the pieces together and
completes the picture.
Consider the continuous-time Markov chain X = {X : t ∈ [0, ∞)} on S = {0, 1} with transition rate a ∈ (0, ∞) from 0 to
t
1 and transition rate b ∈ (0, ∞) from 1 to 0. Give each of the following

1. The transition matrix Q for Y at n ∈ N .
n
2. The infinitesimal generator G.

3. The transition matrix P for X at t ∈ [0, ∞).
t
4. The invariant probability density function for Y .

5. The invariant probability density function for X.
6. The limiting behavior of Q as n → ∞ .
n
7. The limiting behavior of P as t → ∞ .

t
Answer
The following continuous-time chain has also been studied in the previous three sections.
1 1
⎡ 0 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.18.21)
⎢ ⎥
⎣ 1 2 ⎦
0
3 3
1. Recall the generator matrix G.

2. Find the invariant probability density function f for Y by solving f Q = f .
d d d
3. Find the invariant probability density function f for X by solving f G = 0 .

c c
4. Verify that λf is a multiple of f .

c d
5. Describe the limiting behavior of Q as n → ∞ .

n
6. Describe the limiting behavior of P as t → ∞ .

t
7. Verify the result in (f) by recalling the transition matrix P for X at t ∈ [0, ∞).
t
Answer
Special Models
Read the discussion of stationary and limiting distributions for chains subordinate to the Poisson process.
Read the discussion of stationary and limiting distributions for continuous-time birth-death chains.
Read the discussion of classification and limiting distributions for continuous-time queuing chains.
This page titled 16.18: Stationary and Limting Distributions of Continuous-Time Chains is shared under a CC BY 2.0 license and was authored,
Earlier, we studied time reversal of discrete-time Markov chains. In continous time, the issues are basically the same. First, the
Markov property stated in the form that the past and future are independent given the present, essentially treats the past and future
symmetrically. However, there is a lack of symmetry in the fact that in the usual formulation, we have an initial time 0, but not a
terminal time. If we introduce a terminal time, then we can run the process backwards in time. In this section, we are interested in the
following questions:
Is the new process still Markov?
If so, how are the various parameters of the reversed Markov chain related to those of the original chain?
Under what conditions are the forward and backward Markov chains stochastically the same?
Consideration of these questions leads to reversed chains, an important and interesting part of the theory of continuous-time Markov
chains. As always, we are also interested in the relationship between properties of a continuous-time chain and the corresponding
properties of its discrete-time jump chain. In this section we will see that there are simple and elegant connections between the time
reversal of a continuous-time chain and the time-reversal of the jump chain.
Basic Theory
Reversed Chains
Our starting point is a (homogeneous) continuous-time Markov chain X = {X : t ∈ [0, ∞)} with (countable ) state space S . We
t
will assume that X is irreducible, so that every state in S leads to every other state, and to avoid trivialities, we will assume that there
are at least two states. The irreducibility assumption involves no serious loss of generality since otherwise we could simply restrict
our attention to an irreducible equivalence class of states. With our usual notation, we will let P = {P : t ∈ [0, ∞)} denote the
t
semigroup of transition matrices of X and G the infinitesimal generator. Let λ(x) denote the exponential parameter for the holding
time in state x ∈ S and Q the transition matrix for the discrete-time jump chain Y = (Y , Y , …). Finally, let 0 1
U = {U : α ∈ [0, ∞)} denote the collection of potential matrices of X. We will assume that the chain X is regular, which gives us
α
the following properties:

Pt (x, x) → 1 as t ↓ 0 for x ∈ S .
There are no instantaneous states, so λ(x) < ∞ for x ∈ S .
The transition times (τ , τ , …) satisfy τ → ∞ as n → ∞ .
1 2 n
We may assume that the chain X is right continuous and has left limits.
The assumption of regularity rules out various types of weird behavior that, while mathematically possible, are usually not
appropriate in applications. If X is uniform, a stronger assumption than regularity, we have the following additional properties:
Pt (x, x) → 1 as t ↓ 0 uniformly in x ∈ S .
λ is bounded.
P =et for t ∈ [0, ∞).
tG
U = (αI − G)
α for α ∈ (0, ∞) .
−1
Now let h ∈ (0, ∞) . We will think of h as the terminal time or time horizon so the chains in our first discussion will be defined on
the time interval [0, h]. Notationally, we won't bother to indicate the dependence on h , since ultimately the time horizon won't matter.
Define X^
=X t for t ∈ [0, h]. Thus, the process forward in time is X = {X : t ∈ [0, h]} while the process backwards in time is
h−t t
^ ^
X = { X t : t ∈ [0, h]} = { Xh−t : t ∈ [0, h]} (16.19.1)
Similarly let
^ ^ : s ∈ [0, t]} = σ{ X
F t = σ{ X s h−s : s ∈ [0, t]} = σ{ Xr : r ∈ [h − t, h]}, t ∈ [0, h] (16.19.2)
So F^ is the σ-algebra of events of the process X

t
^ up to time t , which of course, is also the σ-algebra of events of X from time h − t
forward. Our first result is that the chain reversed in time is still Markov
The process X ^
= { X : t ∈ [0, h]} is a Markov chain, but is not time homogeneous in general. For s,
^
t t ∈ [0, h] with s < t , the
transition matrix from s to t is
P(Xh−t = y)
^ 2
P s,t (x, y) = Pt−s (y, x), (x, y) ∈ S (16.19.3)
P(Xh−s = x)
Proof
However, the backwards chain will be time homogeneous if X has an invariant distribution.
0
Suppose that X is positive recurrent, with (unique) invariant probability density function f . If X has the invariant distribution,
0
then X
^
is a time-homogeneous Markov chain. The transition matrix at time t ∈ [0, ∞) (for every terminal time h ≥ t ), is given
by
f (y)
^ 2
P t (x, y) = Pt (y, x), (x, y) ∈ S (16.19.6)
f (x)
Proof
The previous result holds in the limit of the terminal time, regardless of the initial distribution.
Suppose again that X is positive recurrent, with (unique) invariant probability density function f . Regardless of the distribution
of X ,
0
f (y)
^
P(X ^
s+t = y ∣ X s = x) → Pt (y, x) as h → ∞ (16.19.7)
f (x)
Proof
These three results are motivation for the definition that follows. We can generalize by defining the reversal of an irreducible Markov
chain, as long as there is a positive, invariant function. Recall that a positive invariant function defines a positive measure on S , but
of course not in general a probability measure.
Suppose that g : S → (0, ∞) is invariant for X. The reversal of X with respect to g is the Markov chain
^ ^ : t ∈ [0, ∞)}
X = {X t with transition semigroup P^ defined by
g(y)
^ 2
P t (x, y) = Pt (y, x), (x, y) ∈ S , t ∈ [0, ∞) (16.19.8)
g(x)
Justification
Recall that if g is a positive invariant function for X then so is cg for every constant c ∈ (0, ∞) . Note that g and cg generate the
same reversed chain. So let's consider the cases:
Suppose again that X is a Markov chain satisfying the assumptions above.

1. If X is recurrent, then X always has a positive invariant function g , unique up to multiplication by positive constants. Hence
the reversal of a recurrent chain X always exists and is unique, and so we can refer to the reversal of X without reference to
the invariant function.
2. Even better, if X is positive recurrent, then there exists a unique invariant probability density function, and the reversal of X
can be interpreted as the time reversal (relative to a time horizon) when X has the invariant distribution, as in the motivating
result above.
3. If X is transient, then there may or may not exist a positive invariant function, and if one does exist, it may not be unique (up
to multiplication by positive constants). So a transient chain may have no reversals or more than one.
Nonetheless, the general definition is natural, because most of the important properties of the reversed chain follow from the basic
balance equation relating the transition semigroups P and P^, and the invariant function g :
^ (x, y) = g(y)P (y, x), 2
g(x)P t t (x, y) ∈ S , t ∈ [0, ∞) (16.19.10)
We will see the balance equation repeated for other objects associated with the Markov chains.
Suppose again that g : S → (0, ∞) is invariant for X, and that X

^
is the time reversal of X with respect to g . Then
1. g is also invariant for X
^
.
2. X is the time reversal of X
^
with respect to g .
Proof
In the balance equation for the transition semigroups, it's not really necessary to know a-priori that the function g is invariant, if we
know the two transition semigroups.
Suppose that g : S → (0, ∞) . Then g is invariant and the Markov chains X and ^
X are time reversals with respect to g if and
only if
^ (x, y) = g(y)P (y, x), 2
g(x)P t t (x, y) ∈ S , t ∈ [0, ∞) (16.19.12)
Proof
Here is a slightly more complicated (but equivalent) version of the balance equation for the transition probabilities.
Suppose again that g : S → (0, ∞) . Then g is invariant and the chains X and X
^
are time reversals with respect to g if and only
if
^ ^ ^
g(x1 )P t1 (x1 , x2 )P t2 (x2 , x3 ) ⋯ P tn (xn , xn+1 ) = g(xn+1 )Ptn (xn+1 , xn )Ptn−1 (xn , xn−1 ) ⋯ Pt1 (x2 , x1 ) (16.19.14)
for all n ∈ N , (t
+ 1, t2 , … , tn ) ∈ [0, ∞)
n
, and (x 1, x2 , … , xn+1 ) ∈ S
n+1
.
Proof
The balance equation holds for the potenetial matrices.
Suppose again that g : S → (0, ∞) . Then g is invariant and the chains X and X
^ are time reversals with respect to g if and only
if the potential matrices satisfy

^ 2
g(x)U α (x, y) = g(y)Uα (y, x), (x, y) ∈ S , α ∈ [0, ∞) (16.19.17)
Proof
As a corollary, continuous-time chains that are time reversals are of the same type.
If X and X
^
are time reversals, then X and X
^
are of the same type: transient, null recurrent, or positive recurrent.
Proof
The balance equation extends to the infinitesimal generator matrices.
Suppose again that g : S → (0, ∞) . Then g is invariant and the Markov chains X and ^
X are time reversals if and only if the
infinitesimal generators satisfy
^ 2
g(x)G(x, y) = g(y)G(y, x), (x, y) ∈ S (16.19.19)
Proof
This leads to further results and connections:
Suppose again that g : S → (0, ∞) . Then g is invariant and X and X

^
are time reversals with respect to g if and only if
1. X^
and X have the same exponential parmeter function λ .
2. The jump chains Y and Y^ are (discrete) time reversals with respect to λg.
Proof
In our original discussion of time reversal in the positive recurrent case, we could have argued that the previous results must be true.
If we run the positive recurrent chain X = {X : t ∈ [0, h]} backwards in time to obtain the time reversed chain
t
X = { X : t ∈ [0, h]} , then the exponential parameters for X must the be same as those for X, and the jump chain Y for X must
^ ^ ^ ^ ^
t
be the time reversal of the jump chain Y for X.
Reversible Chains
Clearly an interesting special case is when the time reversal of a continuous-time Markov chain is stochastically the same as the
original chain. Once again, we assume that we have a regular Markov chain X = {X : t ∈ [0, ∞)} that is irreducible on the state
t
space S , with transition semigroup P = {P : t ∈ [0, ∞)} . As before, U = {U : α ∈ [0, ∞)} denotes the collection of potential
t α
matrices, and G the infinitesimal generator. Finally, λ denotes the exponential parameter function, Y = {Y : n ∈ N} the jump n
chain, and Q the transition matrix of Y . Here is the definition of reversibility:
Suppose that g : S → (0, ∞) is invariant for X. Then X is reversible with respect to g if the time reversed chain
^ = {X
X ^ : t ∈ [0, ∞)}
t also has transition semigroup P . That is,
2
g(x)Pt (x, y) = g(y)Pt (y, x), (x, y) ∈ S , t ∈ [0, ∞) (16.19.21)
Clearly if X is reversible with respect to g then X is reversible with respect to cg for every c ∈ (0, ∞). So here is another review of
the cases:
Suppose that X is a Markov chain satisfying the assumptions above.

1. If X is recurrent, then there exists an invariant function g : S → (0, ∞) that is unique up to multiplication by positive
constants. So X is either reversible or not, and we do not have to reference the invariant function.
2. Even better, if X is positive recurrent, then there exists a unique invariant probability density function f . Again, X is either
reversible or not, but if it is, then with the invariant distribution, the chain X is stochastically the same, forward in time or
backward in time.
3. If X is transient, then a positive invariant function may or may not exist. If such a function does exist, it may not be unique,
up to multiplication by positive constants. So in the transient case, X may be reversible with respect to one invariant function
but not with respect to others.
The following results are corollaries of the results above for time reversals. First, we don't need to know a priori that the function g is
invariant.
Suppose that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if
2
g(x)Pt (x, y) = g(y)Pt (y, x), (x, y) ∈ S , t ∈ [0, ∞) (16.19.22)
Suppose again that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if
g(x1 )Pt1 (x1 , x2 )Pt2 (x2 , x3 ) ⋯ Ptn (xn , xn+1 ) = g(xn+1 )Ptn (xn+1 , xn )Ptn−1 (xn , xn−1 ) ⋯ Pt1 (x2 , x1 ) (16.19.23)
for all n ∈ N , (t
+ 1,
n
t2 , … , tn ) ∈ [0, ∞) , and (x 1, x2 , … , xn+1 ) ∈ S
n+1
.
2
g(x)Uα (x, y) = g(y)Uα (y, x), (x, y) ∈ S , α ∈ [0, ∞) (16.19.24)
2
g(x)G(x, y) = g(y)G(y, x), (x, y) ∈ S (16.19.25)
Suppose again that g : S → (0, ∞) . Then g is invariant and X is reversible if and only if the jump chain Y is reversible with
respect to λg.
Recall that X is recurrent if and only if the jump chain Y is recurrent. In this case, the invariant functions for X and Y exist and are
unique up to positive constants. So in this case, the previous theorem states that X is reversible if and only if Y is reversible. In the
positive recurrent case (the most important case), the following theorem gives a condition for reversibility that does not directly
reference the invariant distribution. The condition is known as the Kolmogorov cycle condition, and is named for Andrei Kolmogorov
Suppose that X is positive recurrent. Then X is reversible if and only if for every sequence of distinct states (x 1, ,
x2 , … , xn )
G(x1 , x2 )G(x2 , x3 ) ⋯ G(xn−1 , xn )G(xn , x1 ) = G(x1 , xn )G(xn , xn−1 ) ⋯ G(x3 , x2 )G(x2 , x1 ) (16.19.26)
Proof
Note that the Kolmogorov cycle condition states that the transition rate of visiting states (x , x , … , x , x ) in sequence, starting in
2 3 n 1
state x is the same as the transition rate of visiting states (x , x , … , x , x ) in sequence, starting in state x . The cycle
1 n n−1 2 1 1
condition is also known as the balance equation for cycles.
Figure 16.19.1: The Kolmogorov cycle condition
Applications and Exercises

The Two-State Chain
The continuous-time, two-state chain has been studied in our previous sections on continuous-time chains, so naturally we are
interested in time reversal.
Consider the continuous-time Markov chain X = {X : t ∈ [0, ∞)} on S = {0, 1} with transition rate
t a ∈ (0, ∞) from 0 to 1
and transition rate b ∈ (0, ∞) from 1 to 0. Show that X is reversible
1. Using the transition semigroup P = {P : t ∈ [0, ∞)} .
t
2. Using the resolvent U = {U : α ∈ (0, ∞)} .

α
3. Using the generator matrix G.

Solutions
The Markov chain in the following exercise has also been studied in previous sections.
1 1
⎡ 0 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.19.29)
⎢ ⎥
⎣ 1 2 ⎦
0
3 3
Give each of the following for the time reversed chain X

^
:
1. The state graph.
2. The semigroup of transition matrices P^ = {P^ : t ∈ [0, ∞)} .
t
3. The resolvent of potential matrices U

^
= {U
^
: α ∈ (0, ∞)} .
α
4. The generator matrixG ^

.
5. The transition matrix of the jump chain Y^ .
Solutions
Special Models
Read the discussion of time reversal for chains subordinate to the Poisson process.
Read the discussion of time reversal for continuous-time birth-death chains.
This page titled 16.19: Time Reversal in Continuous-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Basic Theory
Introduction
Recall that the standard Poisson process with rate parameter r ∈ (0, ∞) involves three interrelated stochastic processes. First the
sequence of interarrival times T = (T , T , …) is independent, and each variable has the exponential distribution with parameter
1 2
r . Next, the sequence of arrival times τ = (τ , τ , …) is the partial sum sequence associated with the interrival sequence T :
0 1
τn = ∑ Ti , n ∈ N (16.20.1)
i=1
For n ∈ N+ , the arrival time τ has the gamma distribution with parameters
n n and r. Finally, the Poisson counting process
N = { Nt : t ∈ [0, ∞)} is defined by
Nt = max{n ∈ N : τn ≤ t}, t ∈ [0, ∞) (16.20.2)
so that N is the number of arrivals in (0, t] for t ∈ [0, ∞). The counting variable N has the Poisson distribution with parameter
t t
rt for t ∈ [0, ∞). The counting process N and the arrival time process τ are inverses in the sense that τ ≤ t if and only if n
N ≥ n for t ∈ [0, ∞) and n ∈ N . The Poisson counting process can be viewed as a continuous-time Markov chain.
t
Suppose that takes values in N and is independent of N . Define X = X + N for t ∈ [0, ∞). Then
X0 t 0 t
X = { Xt : t ∈ [0, ∞)} is a continuous-time Markov chain on N with exponential parameter function given by λ(x) = r for
x ∈ N and jump transition matrix Q given by Q(x, x + 1) = 1 for x ∈ S .
Proof
Note that the Poisson process, viewed as a Markov chain is a pure birth chain. Clearly we can generalize this continuous-time
Markov chain in a simple way by allowing a general embedded jump chain.
Suppose that X = {X : t ∈ [0, ∞)} is a Markov chain with (countable) state space S , and with constant exponential
t
parameter λ(x) = r ∈ (0, ∞) for x ∈ S , and jump transition matrix Q. Then X is said to be subordinate to the Poisson
process with rate parameter r.
1. The transition times (τ , τ , …) are the arrival times of the Poisson process with rate r.
1 2
2. The inter-transition times (τ , τ − τ , …) are the inter-arrival times of the Poisson process with rate r (independent, and
1 2 1
each with the exponential distribution with rate r).

3. N = {N : t ∈ [0, ∞)} is the Poisson counting process, where N is the number of transitions in (0, t] for t ∈ [0, ∞).
t t
4. The Poisson process and the jump chain Y = (Y , Y , …) are independent, and X = Y for t ∈ [0, ∞).
0 1 t Nt
Proof
Since all states are stable, note that we must have Q(x, x) = 0 for x ∈ S . Note also that for x, y ∈ S with x ≠ y , the exponential
rate parameter for the transition from x to y is μ(x, y) = rQ(x, y). Conversely suppose that μ : S → (0, ∞) satisfies 2
μ(x, x) = 0 and ∑ μ(x, y) = r for every x ∈ S . Then the Markov chain with transition rates given by μ is subordinate to the
y∈S
Poisson process with rate r. It's easy to construct a Markov chain subordinate to the Poisson process.
Suppose that N = {N : t ∈ [0, ∞)} is a Poisson counting process with rate r ∈ (0, ∞) and that Y = {Y : n ∈ N} is a
t n
discrete-time Markov chain on S , independent of N , whose transition matrix satisfies Q(x, x) = 0 for every x ∈ S . Let
X =Y t for t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain subordinate to the Poisson
Nt t
process.
Generator and Transition Matrices

Next let's find the generator matrix and the transition semigroup. Suppose again that X = {X : t ∈ [0, ∞)} is a continuous-time
t
Markov chain on S subordinate to the Poisson process with rate r ∈ (0, ∞) and with jump transition matrix Q. As usual, let
P = { P : t ∈ [0, ∞)} denote the transition semigroup and G the infinitesimal generator.
t
The generator matrix G of X is G = r(Q − I ) . Hence for t ∈ [0, ∞)
1. The Kolmogorov backward equation is P = r(Q − I )P
′
t t
2. The Kolmogorov forward equation is P = rP (Q − I )

′
t t
Proof
There are several ways to find the transition semigroup P = { Pt : t ∈ [0, ∞)} . The best way is a probabilistic argument using the
underlying Poisson process.
For t ∈ [0, ∞), the transition matrix P is given by

t
∞ n
(rt)
−rt n
Pt = ∑ e Q (16.20.3)
n!
n=0
Proof from the underlying Poisson process

Proof using the generator matrix
Potential Matrices
Next let's find the potential matrices. As with the transition matrices, we can do this in (at least) two different ways.
Suppose again that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S subordinate to the Poisson process with
t
rate r ∈ (0, ∞) and with jump transition matrix Q. For α ∈ (0, ∞) , the potential matrix U of X is α
∞ n
1 r n
Uα = ∑( ) Q (16.20.5)
α +r α +r
n=0

Proof using the generator
Recall that for p ∈ (0, 1), the p-potential matrix of the jump chain Y is R p =∑
∞
n=0
n
p Q
n
. Hence we have the following nice
relationship between the potential matrix of X and the potential matrix of Y :
1
Uα = Rr/(α+r) (16.20.9)
α +r
Next recall that α U (x, ⋅) is the probability density function of X given X = x , where T has the exponential distribution with
α T 0
parameter α and is independent of X. On the other hand, α U (x, ⋅) = (1 − p)R (x, ⋅) where p = r/(α + r) . We know from our
α p
study of discrete potentials that (1 − p)R (x, ⋅) is the probability density function of Y where M has the geometric distribution
p M
on N with parameter 1 − p and is independent of Y . But also X = Y . So it follows that if T has the exponential distribution
T NT
with parameter α , N = {N : t ∈ [0, ∞)} is a Poisson process with rate r, and is independent of T , then N has the geometric
t T
distribution on N with parameter α/(α + r) . Of course, we could easily verify this directly, but it's still fun to see such
connections.
Limiting Behavior and Stationary Distributions

Once again, suppose that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S subordinate to the Poisson process with
t
rate r ∈ (0, ∞) and with jump transition matrix Q. Let Y = {Y : n ∈ N} denote the jump process. The limiting behavior and
n
stationary distributions of X are closely related to those of Y .
Suppose that X (and hence Y ) are irreducible and positive recurrent

1. g : S → (0, ∞) is invariant for X if and only if g is invariant for Y .
2. f is an invariant probability density function for X if and only if f is an invariant probability density function for Y .
3. X is null recurrent if and only if Y is null recurrent, and in this case, lim Q (x, y) = lim P (x, y) = 0 for
n→∞
n
t→∞ t
(x, y) ∈ S .
2
4. X is positive recurrent if and only if Y is positive recurrent. If Y is aperiodic, then

P (x, y) = f (y) for (x, y) ∈ S , where f is the invariant probability density function.
n 2
lim n→∞ Q (x, y) = lim t→∞ t
Proof
Time Reversal
Once again, suppose that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S subordinate to the Poisson process with
t
rate r ∈ (0, ∞) and with jump transition matrix Q. Let Y = {Y : n ∈ N} denote the jump process. We assume that X (and
n
hence Y ) are irreducible. The time reversal of X is closely related to that of Y .
Suppose that g : S → (0, ∞) is invariant for X. The time reversal X ^ with respect to g is also subordinate to the Poisson
process with rate r. The jump chain Y^ of X

^
is the (discrete) time reversal of Y with respect to g .
Proof
In particular, X is reversible with respect to g if and only if Y is reversible with respect to g . As noted earlier, X and Y are of the
same type: both transient or both null recurrent or both positive recurrent. In the recurrent case, there exists a positive invariant
function that is unique up to multiplication by constants. In this case, the reversal of X is unique, and is the chain subordinate to
the Poisson process with rate r whose jump chain is the reversal of Y .
Uniform Chains
In the construction above for a Markov chain X = {X : t ∈ [0, ∞)} that is subordinate to the Poisson process with rate r and
t
jump transition kernel Q, we assumed of course that Q(x, x) = 0 for every x ∈ S . So there are no absorbing states and the
sequence (τ , τ , …) of arrival times of the Poisson process are the jump times of the chain X. However in our introduction to
1 2
continuous-time chains, we saw that the general construction of a chain starting with the function λ and the transition matrix Q
works without this assumption on Q, although the exponential parameters and transition probabilities change. The same idea works
here.
Suppose that N = {N : t ∈ [0, ∞)} is a counting Poisson process with rate r ∈ (0, ∞) and that Y = {Y : n ∈ N} is a
t n
discrete-time Markov chain with transition matrix Q on S × S satisfying Q(x, x) < 1 for x ∈ S . Assume also that N and Y
are independent. Define X = Y t for t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a continuous-Markov chain with
Nt t
~
exponential parameter function λ(x) = r[1 − Q(x, x)] for x ∈ S and jump transition matrix Q given by
~ Q(x, y)
2
Q(x, y) = , (x, y) ∈ S , x ≠y (16.20.10)
1 − Q(x, x)
Proof
The Markov chain constructed above is no longer a chain subordinate to the Poisson process by our definition above, since the
exponential parameter function is not constant, and the transition times of X are no longer the arrival times of the Poisson process.
Nonetheless, many of the basic results above still apply.
Let X = {X t : t ∈ [0, ∞)} be the Markov chain constructed in the previous theorem. Then
1. For t ∈ [0, ∞), the transition matrix P is given by
t
∞ n
(rt)
−rt n
Pt = ∑ e Q (16.20.11)
n!
n=0
2. For α ∈ (0, ∞) , the α potential matrix is given by

∞ n
1 r
n
Uα = ∑( ) Q (16.20.12)
α +r α +r
n=0
3. The generator matrix is G = r(Q − I )

4. g : S → (0, ∞) is invariant for X if and only if g is invariant for Y .
Proof
It's a remarkable fact that every continuous-time Markov chain with bounded exponential parameters can be constructed as in the
last theorem, a process known as uniformization. The name comes from the fact that in the construction, the exponential parameters
become constant, but at the expense of allowing the embedded discrete-time chain to jump from a state back to that state. To review
the definition, suppose that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S with transition semigroup
t
P = { Pt : t ∈ [0, ∞)} , exponential parameter function λ and jump transition matrix Q . Then P is uniform if Pt (x, x) → 1 as
t ↓ 0 uniformly in x, or equivalently if λ is bounded.
Suppose that λ : S → (0, ∞) is bounded and that Q is a transition matrix on S with Q(x, x) = 0 for every x ∈ S . Let
r ∈ (0, ∞) be an upper bound on λ and N = { N : t ∈ [0, ∞)} a Poisson counting process with rate r . Define the transition
t
matrix Q^
on S by
λ(x)
^
Q(x, x) = 1 − x ∈ S
r
λ(x)
^ 2
Q(x, y) = Q(x, y) (x, y) ∈ S , x ≠y
r
and let Y = { Yn : n ∈ N} be a discrete-time Markov chain with transition matrix Q

^
, independent of N . Define X = Y for t Nt
t ∈ [0, ∞) . Then X = {X t : t ∈ [0, ∞)} is a continuous-time Markov chain with exponential parameter function λ and jump
transition matrix Q.
Proof
Note in particular that if the state space S is finite then of course λ is bounded so the previous theorem applies. The theorem is
useful for simulating a continuous-time Markov chain, since the Poisson process and discrete-time chains are simple to simulate. In
addition, we have nice representations for the transition matrices, potential matrices, and the generator matrix.
Suppose that X = {X t : t ∈ [0, ∞} is a continuous-time Markov chain on S with bounded exponential parameter function
λ : S → (0, ∞) and jump transition matrix Q. Define r and ^
Q as in the last theorem. Then
1. For t ∈ [0, ∞), the transition matrix P is given by t
∞ n
(rt) n
−rt ^
Pt = ∑ e Q (16.20.14)
n!
n=0
2. For α ∈ (0, ∞) , the α potential matrix is given by

∞ n
1 r n
^
Uα = ∑( ) Q (16.20.15)
α +r α +r
n=0
3. The generator matrix is G = r(Q ^

− I)
4. g : S → (0, ∞) is invariant for X if and only if g is invariant for Q

^
.
Proof
Examples
The Two-State Chain
The following exercise applies the uniformization method to the two-state chain.
Consider the continuous-time Markov chain X = {X : t ∈ [0, ∞)} on S = {0, 1} with exponential parameter function
t
λ = (a, b) , where a, b ∈ (0, ∞) . Thus, states 0 and 1 are stable and the jump chain has transition matrix
0 1
Q =[ ] (16.20.16)
1 0
Let r = a + b , an upper bound on λ . Show that

b a
1. Q
^
=
a+b
1
[ ]
b a
−a a
2. G = [ ]
b −b
3. P t
^
= Q−
1
a+b
e
−(a+b)t
G for t ∈ [0, ∞)
4. U α =
1
α
^
Q−
1
(α+a+b)(a+b)
G for α ∈ (0, ∞)
Proof
Although we have obtained all of these results for the two-state chain before, the derivation based on uniformization is the easiest.
This page titled 16.20: Chains Subordinate to the Poisson Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated
Basic Theory
Introduction
A continuous-time birth-death chain is a simple class of Markov chains on a subset of Z with the property that the only possible
transitions are to increase the state by 1 (birth) or decrease the state by 1 (death). It's easiest to define the birth-death process in
terms of the exponential transition rates, part of the basic structure of continuous-time Markov chains.
Suppose that S is an integer interval (that is, a set of consecutive integers), either finite or infinite. The birth-death chain with
birth rate function α : S → [0, ∞) and death rate function β : S → [0, ∞) is the Markov chain X = {X : t ∈ [0, ∞)} on S
t
with transition rate α(x) from x to x + 1 and transition rate β(x) from x to x − 1 , for x ∈ S .
If S has a minimum element m, then of course we must have β(m) = 0 . If α(m) = 0 also, then the boundary point m is
absorbing. Similarly, if S has a maximum element n then we must have α(n) = 0 . If β(n) = 0 also then the boundary point n is
absorbing. If x ∈ S is not a boundary point, then typically we have α(x) + β(x) > 0 , so that x is stable. If β(x) = 0 for all x ∈ S ,
then X is a pure birth process, and similarly if α(x) = 0 for all x ∈ S then X is a pure death process. From the transition rates,
it's easy to compute the parameters of the exponential holding times in a state and the transition matrix of the embedded, discrete-
time jump chain.
Consider again the birth-death chain X on S with birth rate function α and death rate function β. As usual, let λ denote the
exponential parameter function and Q the transition matrix for the jump chain.
1. λ(x) = α(x) + β(x) for x ∈ S
2. If x ∈ S is stable, so that α(x) + β(x) > 0 , then
α(x) β(x)
Q(x, x + 1) = , Q(x, x − 1) = (16.21.1)
α(x) + β(x) α(x) + β(x)
Note that jump chain Y = (Y , Y

0 1, …) is a discrete-time birth death chain. The probability functions p, q, and r of Y are given as
follows: If x ∈ S is stable then
α(x)
p(x) = Q(x, x + 1) =
α(x) + β(x)
β(x)
q(x) = Q(x, x − 1) =
α(x) + β(x)
r(x) = Q(x, x) = 0
If x is absorbing then of course p(x) = q(x) = 0 and r(x) = 1 . Except for the initial state, the jump chain Y is deterministic for a
pure birth process, with Q(x, x) = 1 if x is absorbing and Q(x, x + 1) = 1 if x is stable. Similarly, except for the initial state, Y is
deterministic for a pure death process, with Q(x, x) = 1 if x is absorbing and Q(x, x − 1) = 1 if x is stable. Note that the Poisson
process with rate parameter r ∈ (0, ∞), viewed as a continuous-time Markov chain, is a pure birth process on N with birth
function α(x) = r for each x ∈ N. More generally, a birth death process with λ(x) = α(x) + β(x) = r for all x ∈ S is also
subordinate to the Poisson process with rate r.
Note that λ is bounded if and only if α and β are bounded (always the case if S is finite), and in this case the birth-death chain
X = { X : t ∈ [0, ∞)} is uniform. If λ is unbounded, then X may not even be regular, as an example below shows. Recall that a
t
sufficient condition for X to be regular when S is infinite is

1 1
∑ = ∑ =∞ (16.21.2)
λ(x) α(x) + β(x)
x∈S+ x∈S+
where S = {x ∈ S : λ(x) = α(x) + β(x) > 0} is the set of stable states. Except for the aforementioned example, we will
+
restrict our study to regular birth-death chains.
Infinitesimal Generator and Transition Matrices
Suppose again that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on an interval S ⊆ Z with birth rate function α
t
and death rate function β. As usual, we will let P denote the transition matrix at time t ∈ [0, ∞) and G the infinitesimal generator.
t
As always, the infinitesimal generator gives the same information as the exponential parameter function and the jump transition
matrix, but in a more compact and useful form.
The generator matrix G is given by

G(x, x) = −[α(x) + β(x)], G(x, x + 1) = α(x), G(x, x − 1) = β(x), x ∈ S (16.21.3)
Proof
The Kolmogorov backward and forward equations are

1. d
dt
Pt (x, y) = −[α(x) + β(x)] Pt (x, y) + α(x)Pt (x + 1, y) + β(x)Pt (x − 1, y) for (x, y) ∈ S . 2
2. d
dt
Pt (x, y) = −[α(y) + β(y)] Pt (x, y) + α(y − 1)Pt (x, y − 1) + β(y + 1)Pt (x, y + 1) for (x, y) ∈ S
2
Proof
Limiting Behavior and Stationary Distributions

For our discussion of limiting behavior, we will consider first the important special case of a continuous-time birth-death chain
X = { X : t ∈ [0, ∞)} on S = N and with α(x) > 0 for all x ∈ N and β(x) > 0 for all x ∈ N . For the jump chain
t +
Y = { Y : n ∈ N} , recall that
n
α(x) β(x)
p(x) = Q(x, x + 1) = , q(x) = Q(x, x − 1) = , x ∈ N (16.21.4)
α(x) + β(x) α(x) + β(x)
The jump chain Y is a discrete-time birth-death chain, and our notation here is consistent with the notation that we used in that
section. Note that X and Y are irreducible. We first consider transience and recurrence.
The chains X and Y are recurrent if and only if

∞
β(1) ⋯ β(x)
∑ =∞ (16.21.5)
α(1) ⋯ α(x)
x=0
Proof
Next we consider positive recurrence and invariant distributions. It's nice to look at this from different points of view.
The function g : N → (0, ∞) defined by

α(0) ⋯ α(x − 1)
g(x) = , x ∈ N (16.21.8)
β(1) ⋯ β(x)
is invariant for X, and is the only invariant function, up to multiplication by constants. Hence X is positive recurrent if and
only if B = ∑ ∞
g(x) < ∞ , in which case the (unique) invariant probability density function f is given by f (x) =
x=0
g(x)
1
for x ∈ N. Moreover, P (x, y) → f (y) as t → ∞ for every x, y ∈ N

t
Proof using the jump chain

Proof from the balance equation
Here is a summary of the classification:
For the continuous-time birth-death chain X, let

∞ ∞
β(1) ⋯ β(x) α(0) ⋯ α(x − 1)
A =∑ , B =∑ (16.21.13)
α(1) ⋯ α(x) β(1) ⋯ β(x)
x=0 x=0
1. X is transient if A < ∞ .
2. X is null recurrent if A = ∞ and B = ∞ .
3. X is positive recurrent if B < ∞ .
Suppose now that n ∈ N+ and that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on the integer interval
t
Nn = {0, 1, … , n} . We assume that α(x) > 0 for x ∈ {0, 1, … , n − 1} while β(x) > 0 for x ∈ {1, 2, … n}. Of course, we
must have β(0) = α(n) = 0 . With these assumptions, X is irreducible, and since the state space is finite, positive recurrent. So all
that remains is to find the invariant distribution. The result is essentially the same as when the state space is N.
The invariant probability density function f is given by

n
n
1 α(0) ⋯ α(x − 1) α(0) ⋯ α(x − 1)
fn (x) = for x ∈ Nn where Bn = ∑ (16.21.14)
Bn β(1) ⋯ β(x) β(1) ⋯ β(x)
x=0
Proof
Note that B → B as n → ∞ , and if B < ∞ , f (x) → f (x) as n → ∞ for x ∈ N. We will see this type of behavior again.
n n
Results for the birth-death chain on N often converge to the corresponding results for the birth-death chain on N as n → ∞ .
n
Absorption
Often when the state space S = N , the state of a birth-death chain represents a population of individuals of some sort (and so the
terms birth and death have their usual meanings). In this case state 0 is absorbing and means that the population is extinct.
Specifically, suppose that X = {X : t ∈ [0, ∞)} is a regular birth-death chain on N with α(0) = β(0) = 0 and with
t
α(x), β(x) > 0 for x ∈ N . Thus, state 0 is absorbing and all positive states lead to each other and to 0. Let
+
T = min{t ∈ [0, ∞) : X = 0} denote the time until absorption, where as usual, min ∅ = ∞ . Many of the results concerning
t
extinction of the continuous-time birth-death chain follow easily from corresponding results for the discrete-time birth-death jump
chain.
One of the following events will occur:

1. Population extinction: T < ∞ or equivalently, X = 0 for some s ∈ [0, ∞) and hence X
s t =0 for all t ∈ [s, ∞) .
2. Population explosion: T = ∞ or equivalently X → ∞ as t → ∞ .t
Proof
Naturally we would like to find the probability of these complementary events, and happily we have already done so in our study of
discrete-time birth-death chains. The absorption probability function v is defined by
v(x) = P(T < ∞) = P(Xt = 0 for some t ∈ [0, ∞) ∣ X0 = x), x ∈ N (16.21.16)
As before, let
∞
β(1) ⋯ β(i)
A =∑ (16.21.17)
α(1) ⋯ α(i)
i=0
1. If A = ∞ then v(x) = 1 for all x ∈ N.

2. If A < ∞ then
∞
1 β(1) ⋯ β(i)
v(x) = ∑ , x ∈ N (16.21.18)
A α(1) ⋯ α(i)
i=x
Proof
The mean time to extinction is considered next, so let m(x) = E(T ∣ X = x) for x ∈ N. Unlike the probability of extinction,
0
computing the mean time to extinction cannot be easily reduced to the corresponding discrete-time computation. However, the
method of computation does extend.
The mean absorption function is given by

x ∞
α(j) ⋯ α(k)
m(x) = ∑ ∑ , x ∈ N (16.21.19)
β(j) ⋯ β(k + 1)
j=1 k=j−1
Probabilisitic Proof
In particular, note that
∞
α(1) ⋯ α(k)
m(1) = ∑ (16.21.23)
β(1) ⋯ β(k + 1)
k=0
If m(1) = ∞ then m(x) = ∞ for all x ∈ S . If m(1) < ∞ then m(x) < ∞ for all x ∈ S
Next we will consider a birth-death chain on a finite integer interval with both endpoints absorbing. Our interest is in the
probability of absorption in one endpoint rather than the other, and in the mean time to absorption. Thus suppose that n ∈ N and +
that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on N = {0, 1, … , n} with α(0) = β(0) = 0 ,
t n
α(n) = β(n) = 0 , and α(x) > 0 , β(x) > 0 for x ∈ {1, 2, … , n − 1}. So the endpoints 0 and n are absorbing, and all other states
lead to each other and to the endpoints. Let T = inf{t ∈ [0, ∞) : X ∈ {0, n}} , the time until absorption, and for x ∈ S let
t
v (x) = P(X
n = 0 ∣ X = x) and m (x) = E(T ∣ X = x) . The definitions make sense since T is finite with probability 1.
T 0 n 0
The absorption probability function for state 0 is given by

n−1 n−1
1 β(1) ⋯ β(i) β(1) ⋯ β(i)
vn (x) = ∑ for x ∈ Nn where An = ∑ (16.21.24)
An α(1) ⋯ α(i) α(1) ⋯ α(i)
i=x i=0
Proof
Note that A → A as n → ∞ where A is the constant above for the absorption probability at 0 with the infinite state space N. If
n
A < ∞ then v (x) → v(x) as n → ∞ for x ∈ N .

n
Time Reversal
Essentially, every irreducible continuous-time birth-death chain is reversible.
Suppose that X = {X : t ∈ [0, ∞)} is a positive recurrent birth-death chain on an integer interval S ⊆ Z with birth rate
t
function α : S → [0, ∞) and death rate function β : S → ∞ . Assume that α(x) > 0 , except at the maximum value of S , if
there is one, and similarly that β(x) > 0, except at the minimum value of X, if there is one. Then X is reversible.
Proof
In the important special case of a birth-death chain on N, we can verify the balance equations directly.
Suppose that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on

t S =N and with birth rate α(x) > 0 for all
x ∈ N and death rate β(x) > 0 for all x ∈ N . Then X is reversible.
+
Proof
In the positive recurrent case, it follows that the birth-death chain is stochastically the same, forward or backward in time, if the
chain has the invariant distribution.

Regular and Irregular Chains
Our first exercise gives two pure birth chains, each with an unbounded exponential parameter function. One is regular and one is
irregular.
Consider the pure birth process X = {X t : t ∈ [0, ∞)} on N with birth rate function α .
+
1. If α(x) = x for x ∈ N , then X is not regular.

2
+
2. If α(x) = x for x ∈ N , then X is regular.

+
Proof
Constant Birth and Death Rates

Our next examples consider birth-death chains with constant birth and death rates, except perhaps at the endpoints. Note that such
chains will be regular since the exponential parameter function λ is bounded.
Suppose that X = {X : t ∈ [0, ∞)} is the birth-death chain on
t N , with constant birth rate α ∈ (0, ∞) on N and constant
death rate β ∈ (0, ∞) on N . +
1. X is transient if β < α .
2. X is null recurrent if β = α .
3. X is positive recurrent if β > α . The invariant distribution is the geometric distribution on N with parameter α/β
x
α α
f (x) = (1 − )( ) , x ∈ N (16.21.28)
β β
Proof
Next we consider the chain with 0 absorbing. As in the general discussion above, let v denote the function that gives the probability
of absorption and m the function that gives the mean time to absorption.
Suppose that X = {X : t ∈ [0, ∞)} is the birth-death chain in N with constant birth rate α ∈ (0, ∞) on N , constant death
t +
reate β ∈ (0, ∞) on N , and with 0 absorbing. Then

+
1. If β ≥ α then v(x) = 1 for x ∈ N. If β < α then v(x) = (β/α) for x ∈ N. x
2. If α ≥ β then m(x) = ∞ . If α < β then m(x) = x/(β − α) for x ∈ N.
Next let's look at chains on a finite state space. Let n ∈ N + and define N n = {0, 1, … , n} .
Suppose that X = { Xt : t ∈ [0, ∞)} is a continuous-time birth-death chain on N with constant birth rate α ∈ (0, ∞) on
n
{0, 1, … , n − 1} and constant death rate β ∈ (0, ∞) on {1, 2, … n}. The invariant probability density function f is given as n
follows:
1. If α ≠ β then
x
(α/β ) (1 − α/β)
fn (x) = , x ∈ Nn (16.21.30)
n+1
1 − (α/β)
2. If α = β then f n (x) = 1/(n + 1) for x ∈ N

n
Note that when α = β , the invariant distribution is uniform on N . Our final exercise considers the absorption probability at 0
n
when both endpoints are absorbing. Let v denote the function that gives the probability of absorption into 0, rather than n .
n
Suppose that X = {X : t ∈ [0, ∞)} is the birth-death chain on

t Nn with constant birth rate α and constant death rate β on
{1, 2, … , n − 1}, and with 0 and n absorbing.
1. If α ≠ β then
x n
(β/α ) − (β/α )
vn (x) = , x ∈ Nn (16.21.31)
n
1 − (β/α)
2. If α = β then v n (x) = (n − x)/n for x ∈ N . n
Linear Birth and Death Rates

For our next discussion, consider individuals that act identically and independently. Each individual splits into two at exponential
rate a ∈ (0, ∞) and dies at exponential rate b ∈ (0, ∞).
Let X denote the population at time t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a regular, continuous-time birth-death chain
t t
with birth and death rate functions given by α(x) = ax and β(x) = bx for x ∈ N.
Proof
Note that 0 is absorbing since the population is extinct, so as usual, our interest is in the probability of absorption and the mean
time to absorption as functions of the initial state. The probability of absorption is the same as for the chain with constant birth and
death rates discussed above.
The absorption probability function v is given as follows:
1. v(x) = 1 for all x ∈ N if b ≥ a .
2. v(x) = (b/a) for x ∈ N if b < a .
x
Proof
The mean time to absorption is more interesting.
The mean time to absorption function m is given as follows:

1. If a ≥ b then m(x) = ∞ for x ∈ N . +
2. If a < b then
x j−1 a/b j−1
b u
m(x) = ∑ ∫ du, x ∈ N (16.21.34)
j
a 0 1 −u
j=1
Proof
For small values of x ∈ N, the integrals in the case a < b can be done by elementary methods. For example,
1 a
m(1) = − ln(1 − )
a b
1 b a
m(2) = m(1) − − ln(1 − )
2
a a b
2
1 b b a
m(3) = m(2) − − − ln(1 − )
2 3
2a a a b
However, a general formula requires the introduction of a special function that is not much more helpful than the integrals
themselves. The Markov chain X is actually an example of a branching chain. We will revisit this chain in that section.
Linear Birth and Death with Immigration

We continue our previous discussion but generalizing a bit. Suppose again that we have individuals that act identically and
independently. An individual splits into two at exponential rate a ∈ [0, ∞) and dies at exponential rate b ∈ [0, ∞). Additionally,
new individuals enter the population at exponential rate c ∈ [0, ∞). This is the immigration effect, and when c = 0 we have the
birth-death chain in the previous discussion.
Let X denote the population at time t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a regular, continuous-time birth-death chain
t t
with birth and death rate functions given by α(x) = ax + c and β(x) = bx for x ∈ N.
Proof
The infinitesimal matrix G is given as follows, for x ∈ N:

1. G(x, x) = −[(a + b)x + c]
2. G(x, x + 1) = ax + c
3. G(x, x − 1) = bx
The backward and forward equations are given as follows, for (x, y) ∈ N and t ∈ (0, ∞) 2
1. d
dt
Pt (x, y) = −[(a + b)x + c] Pt (x, y) + (ax + c)Pt (x + 1, y) + bx Pt (x − 1, y)
2. d
dt
Pt (x, y) = −[(a + b)y + c] Pt (x, y) + [a(y − 1) + c] Pt (x, y − 1) + b(y + 1)Pt (x, y + 1 )
We can use the forward equation to find the expected population size. Let M t (x) = E(Xt , ∣ X0 = x) for t ∈ [0, ∞) and x ∈ N.
For t ∈ [0, ∞) and x ∈ N, the mean population size M t (x) is given as follows:
1. If a = b then M t (x) = ct + x .
2. If a ≠ b then
c (a−b)t (a−b)t
Mt (x) = [e − 1] + x e (16.21.40)
a−b
Proof
Note that b > a , so that the individual death rate exceeds the birth rate, then M (x) → c/(b − a) as t → ∞ for x ∈ N. If a ≥ b so
t
that the birth rate equals or exceeds the death rate, then M (x) → ∞ as t → ∞ for x ∈ N .
t +
Next we will consider the special case with no births, but only death and immigration. In this case, the invariant distribution is easy
to compute, and is one of our favorites.
Suppose that a = 0 and that b, c >0 . Then X is positive recurrent. The invariant distribution is Poisson with parameter c/b:
x
(c/b)
−c/b
f (x) = e , x ∈ N (16.21.42)
x!
Proof
The Logistics Chain

Consider a population that fluctuates between a minimum value m ∈ N and a maximum value n ∈ N , where of course, m < n .
+ +
Given the population size, the individuals act independently and identically. Specifically, if the population is
x ∈ {m, m + 1, … , n} then an individual splits in two at exponential rate a(n − x) and dies at exponential rate b(x − m) , where
a, b ∈ (0, ∞) . Thus, an individual's birth rate decreases linearly with the population size from a(n − m) to 0 while the death rate
increases linearly with the population size from 0 to b(n − m) . These assumptions lead to the following definition.
The continuous-time birth-death chain X = { Xt : t ∈ [0, ∞)} on S = {m, m + 1, … , n} with birth rate function α and
death rate function β given by
α(x) = ax(n − x), β(x) = bx(x − m), x ∈ S (16.21.45)
is the logistic chain on S with parameters a and b .

Justification
Note that the logistics chain is a stochastic counterpart of the logistics differential equation, which typically has the form
dx
= c(x − m)(n − x) (16.21.46)
dt
where m, n, c ∈ (0, ∞) and m < n . Starting in x(0) ∈ (m, n), the solution remains in (m, n) for all t ∈ [0, ∞). Of course, the
logistics differential equation models a system that is continuous in time and space, whereas the logistics Markov chain models a
system that is continuous in time and discrete is space.
For the logistics chain

1. The exponential parameter function λ is given by
λ(x) = ax(n − x) + bx(x − m), x ∈ S (16.21.47)
2. The transition matrix Q of the jump chain is given by

b(x − m) a(n − x)
Q(x, x − 1) = , Q(x, x + 1) = , x ∈ S (16.21.48)
a(n − x) + b(x − m) a(n − x) + b(x − m)
In particular, m and n are reflecting boundary points, and so the chain is irreducible.
The generator matrix G for the logistics chain is given as follows, for x ∈ S :
1. G(x, x) = −x[a(n − x) + b(x − m)]
2. G(x, x − 1) = bx(x − m)
3. G(x, x + 1) = ax(n − x)
Since S is finite, X is positive recurrent. The invariant distribution is given next.
Define g : S → (0, ∞) by
x−m
1 n−m a
g(x) = ( )( ) , x ∈ S (16.21.49)
x x −m b
Then g is invariant for X.
Proof
Of course it now follows that the invariant probability density function f for X is given by f (x) = g(x)/c for x ∈ S where c is
the normalizing constant
n
x−m
1 n−m a
c = ∑ ( )( ) (16.21.51)
x x −m b
x=m
The limiting distribution of X has probability density function f .
Other Special Birth-Death Chains

There are a number of special birth-death chains that are studied in other sections, because the models are important and lead to
special insights and analytic tools. These include
Queuing chains
The pure death branching chain
The Yule process, a pure birth branching chain
The general birth-death branching chain
This page titled 16.21: Continuous-Time Birth-Death Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Basic Theory
Introduction
In a queuing model, customers arrive at a station for service. As always, the terms are generic; here are some typical examples:
The customers are persons and the service station is a store.
The customers are file requests and the service station is a web server.
Figure 16.22.1: Ten customers and a server
Queuing models can be quite complex, depending on such factors as the probability distribution that governs the arrival of
customers, the probability distribution that governs the service of customers, the number of servers, and the behavior of the
customers when all servers are busy. Indeed, queuing theory has its own lexicon to indicate some of these factors. In this section,
we will discuss a few of the basic, continuous-time queuing chains. In a general sense, the main interest in any queuing model is
the number of customers in the system as a function of time, and in particular, whether the servers can adequately handle the flow
of customers. This section parallels the section on discrete-time queuing chains.
Our main assumptions are as follows:

1. There are k ∈ N ∪ {∞} servers.
+
2. The customers arrive according to a Poisson process with rate μ ∈ (0, ∞) .

3. If all of the servers are busy, a new customer goes to the end of a single line of customers waiting service.
4. The time required to service a customer has an exponential distribution with parameter ν ∈ (0, ∞) .
5. The service times are independent from customer to customer, and are independent of the arrival process.
Assumption (b) means that the times between arrivals of customers are independent and exponentially distributed, with parameter
μ . Assumption (c) means that we have a first-in, first-out model, often abbreviated FIFO. Note that there are three parameters in
the model: the number of servers k , the exponential parameter μ that governs the arrivals, and the exponential parameter ν that
governs the service times. The special cases k = 1 (a single server) and k = ∞ (infinitely many servers) deserve special attention.
As you might guess, the assumptions lead to a continuous-time Markov chain.
Let Xt denote the number of customers in the system (waiting in line or being served) at time t ∈ [0, ∞) . Then
X = { Xt : t ∈ [0, ∞)} is a continuous-time Markov chain on N, known as the M/M/k queuing chain.
In terms of the basic structure of the chain, the important quantities are the exponential parameters for the states and the transition
matrix for the embedded jump chain.
For the M/M/k chain X,

1. The exponential parameter function λ is given by λ(x) = μ + ν x if x ∈ N and x < k and λ(x) = μ + ν k if x ∈ N and
x ≥ k.
2. The transition matrix Q for the jump chain is given by

νx μ
Q(x, x − 1) = , Q(x, x + 1) = , x ∈ N, x < k
μ + νx μ + νx
νk μ
Q(x, x − 1) = , Q(x, x + 1) = , x ∈ N, x ≥ k
μ + νk μ + νk
So the M/M/k chain is a birth-death chain with 0 as a reflecting boundary point. That is, in state x ∈ N , the next state is either
+
x − 1 or x + 1 , while in state 0, the next state is 1. When k = 1 , the single-server queue, the exponential parameter in state
x ∈ N is μ + ν and the transition probabilities for the jump chain are

+
ν μ
Q(x, x − 1) = , Q(x, x + 1) = (16.22.1)
μ+ν μ+ν
When k =∞ , the infinite server queue, the cases above for x ≥k are vacuous, so the exponential parameter in state x ∈ N is
μ + xν and the transition probabilities are
νx μ
Q(x, x − 1) = , Q(x, x + 1) = (16.22.2)
μ + νx μ + νx
Infinitesimal Generator
The infinitesimal generator of the chain gives the same information as the exponential parameter function and the jump transition
matrix, but in a more compact form.
For the M/M/k queuing chain X, the infinitesimal generator G is given by

G(x, x) = −(μ + ν x), G(x, x − 1) = ν x, G(x, x + 1) = μ; x ∈ N, x < k
G(x, x) = −(μ + ν k), G(x, x − 1) = ν k, G(x, x + 1) = μ; x ∈ N, x ≥ k
So for , the single server queue, the generator G is given by G(0, 0) = −μ , G(0, 1) = μ , while for x ∈ N ,
k =1 +
G(x, x) = −(μ + ν ), G(x, x − 1) = ν , G(x, x + 1) = μ . For k = ∞ , the infinite server case, the generator G is given by
G(x, x) = −(μ + ν x) , G(x, x − 1) = ν x , and G(x, x + 1) = μ for all x ∈ N .
Classification and Limiting Behavior

Again, let X = {X : t ∈ [0, ∞)} denote the M/M/k queuing chain with arrival rate μ , service rate ν and with k ∈ N ∪ {∞}
t +
servers. As noted in the introduction, of fundamental importance is the question of whether the servers can handle the flow of
customers, so that the queue eventually empties, or whether the length of the queue grows without bound. To understand the
limiting behavior, we need to classify the chain as transient, null recurrent, or positive recurrent, and find the invariant functions.
This will be easy to do using our results for more general continuous-time birth-death chains. Note first that X is irreducible. It's
best to consider the single server and infinite server cases individually.
The single server queuing chain X is

1. Transient if ν < μ .
2. Null recurrent if ν = μ .
3. Positive recurrent if ν > μ . The invariant distribution is the geometric distribution on N with parameter μ/ν. The invariant
probability density function f is given by
μ μ x
f (x) = (1 − )( ) , x ∈ N (16.22.3)
ν ν
Proof
The result makes intuitive sense. If the service rate is less than the arrival rate, the chain is transient and the length of the queue
grows to infinity. If the service rate is greater than the arrival rate, the chain is positive recurrent. At the boundary between these
two cases, when the arrival and service rates are the same, the chain is null recurrent.
The infinite server queuing chain X is positive recurrent. The invariant distribution is the Poisson distribution with parameter
μ/ν. The invariant probability density function f is given by
x
(μ/ν )
−μ/ν
f (x) = e , x ∈ N (16.22.4)
x!
Proof
This result also makes intuitive sense.
This page titled 16.22: Continuous-Time Queuing Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
Introduction
Generically, suppose that we have a system of particles that can generate or split into other particles of the same type. Here are
some typical examples:
The particles are biological organisms that reproduce.
The particles are neutrons in a chain reaction.
The particles are electrons in an electron multiplier.
We assume that the lifetime of each particle is exponentially distributed with parameter α ∈ (0, ∞) , and at the end of its life, is
replaced by a random number of new particles that we will refer to as children of the original particle. The number of children N
of a particle has probability density function f on N. The particles act independently, so in addition to being identically distributed,
the lifetimes and the number of children are independent from particle to particle. Finally, we assume that f (1) = 0 , so that a
particle cannot simply die and be replaced by a single new particle. Let μ and σ denote the mean and variance of the number of
2
offspring of a single particle. So

∞ ∞
2 2
μ = E(N ) = ∑ nf (n), σ = var(N ) = ∑(n − μ) f (n) (16.23.1)
n=0 n=0
We assume that μ is finite and so σ makes sense. In our study of discrete-time Markov chains, we studied branching chains in
2
terms of generational time. Here we want to study the model in real time.
Let X denote the number of particles at time t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on
t t
N , known as a branching chain. The exponential parameter function λ and jump transition matrix Q are given by
1. λ(x) = αx for x ∈ N
2. Q(x, x + k − 1) = f (k) for x ∈ N and k ∈ N .
+
Proof
Of course 0 is an absorbing state, since this state means extinction with no particles. (Note that λ(0) = 0 and so by default,
Q(0, 0) = 1.) So with a branching chain, there are essentially two types of behavior: population extinction or population explosion.
For the branching chain X = {X t : t ∈ [0, ∞)} one of the following events occurs with probability 1:
1. Extinction: X t =0 for some t ∈ [0, ∞) and hence X s =0 for all s ≥ t .
2. Explosion: X t → ∞ as t → ∞ .
Proof
Without the assumption that μ < ∞ , explosion can actually occur in finite time. On the other hand, the assumption that f (1) = 0 is
for convenience. Without this assumption, X would still be a continuous-time Markov chain, but as discussed in the Introduction,
the exponential parameter function would be λ(x) = αf (1)x for x ∈ N and the jump transition matrix would be
f (k)
Q(x, x + k − 1) = , x ∈ N+ , k ∈ {0, 2, 3, …} (16.23.2)
1 − f (1)
Because all particles act identically and independently, the branching chain starting with x ∈ N particles is essentially x
+
independent copies of the branching chain starting with 1 particle. In many ways, this is the fundamental insight into branching
chains, and in particular, means that we can often condition on X(0) = 1 .
Generator and Transition Matrices

As usual, we will let P = { Pt : t ∈ [0, ∞)} denote the semigroup of transition matrices of X , so that
Pt (x, y) = P(Xt = y ∣ X = x) for (x, y) ∈ N . Similarly, G denotes the infinitesimal generator matrix of X.
2
The infinitesimal generator G is given by
G(x, x) = −αx, x ∈ N
G(x, x + k − 1) = αxf (k), x ∈ N+ , k ∈ N
Proof
The Kolmogorov backward equation is

∞
d
2
Pt (x, y) = −αx Pt (x, x) + αx ∑ f (k)Pt (x + k − 1, y), (x, y) ∈ N (16.23.3)
dt
k=0
Proof
Unlike some of our other continuous-time models, the jump chain Y governed by Q is not the discrete-time version of the model.
That is, Y is not a discrete-time branching chain, since in discrete time, the index n represents the n th generation, whereas here it
represent the n th time that a particle reproduces. However, there are lots of discrete-time branching chain embedded in the
continuous-time chain.
Fix t ∈ (0, ∞) and define Z = {X : n ∈ N} . Then Z is a discrete-time branching chain with offspring probability density
t nt t
function f given by f (x) = P (1, x) for x ∈ N.

t t t
Proof
Probability Generating Functions

As in the discrete case, probability generating functions are an important analytic tool for continuous-time branching chains.
For t ∈ [0, ∞) let Φ denote the probability generating function of X given X

t t 0 =1
Xt x
Φt (r) = E (r ∣ X0 = 1) = ∑ r Pt (1, x) (16.23.5)
x=0
Let Ψ denote the probability generating function of N

∞
N n
Ψ(r) = E (r ) = ∑ r f (n) (16.23.6)
n=0
The generating functions are defined (the series are absolutely convergent) at least for r ∈ (−1, 1] .
The collection of generating functions Φ = {Φ : t ∈ [0, ∞)} gives the same information as the collection of probability density
t
functions {P (1, ⋅) : t ∈ [0, ∞)}. With the fundamental insight that the branching process starting with one particle determines the
t
branching process in general, Φ actually determines the transition semigroup P = {P : t ∈ [0, ∞)} . t
For t ∈ [0, ∞) and x ∈ N, the probability generating function of X given X t 0 =x is Φ :

x
t
y x
∑ r Pt (x, y) = [ Φt (r)] (16.23.7)
y=0
Proof
Note that Φt is the generating function of the offspring distribution for the embedded discrete-time branching chain
Zt = { Xnt : n ∈ N} for t ∈ (0, ∞) . On the other hand, Ψ is the generating function of the offspring distribution for the
continuous-time chain. So our main goal in this discussion is to see how Φ is built from Ψ. Because P is a semigroup under matrix
multiplication, and because the particles act identically and independently, Φ is a semigroup under composition.
Φs+t = Φs ∘ Φt for s, t ∈ [0, ∞) .

Proof
Note also that Φ (r) = E(r ∣ X = 1) = r for all r ∈ R . This also follows from the semigroup property: Φ = Φ ∘ Φ . The
0
X0
0 0 0 0
fundamental relationship between the collection of generating functions Φ and the generating function Ψ is given in the following
theorem:
The mapping t ↦ Φ satisfies the differential equation

t
d
Φt = α(Ψ ∘ Φt − Φt ) (16.23.8)
dt
Proof
This differential equation, along with the initial condition Φ (r) = r for all r ∈ R determines the collection of generating functions
0
Φ. In fact, an implicit solution for Φ (r) is given by the integral equation

t
Φt (r)
1
∫ du = αt (16.23.11)
r
Ψ(u) − u
Another relationship is given in the following theorem. Here, Φ refers to the derivative of the generating function Φ with respect
′
t t
to its argument, of course (so r, not t ).
For t ∈ [0, ∞),

Ψ ∘ Φt − Φt
′
Φ = (16.23.12)
t
Ψ − Φ0
Proof
Moments
In this discussion, we wil study the mean and variance of the number of particles at time t ∈ [0, ∞). Let
mt = E(Xt ∣ X0 = 1), vt = var(Xt ∣ X0 = 1), t ∈ [0, ∞) (16.23.16)
so that m and v are the mean and variance, starting with a single particle. As always with a branching process, it suffices to
t t
consider a single particle:
For t ∈ [0, ∞) and x ∈ N,

1. E(X ∣ X = x) = x m
t 0 t
2. var(X ∣ X = x) = x v
t 0 t
Proof
Recall also that μ and σ are the the mean and variance of the number of offspring of a particle. Here is the connection between the
2
means:
mt = e
α(μ−1)t
for t ∈ [0, ∞).
1. If μ < 1 then m t → 0 as t → ∞ . This is extinction in the mean.
2. If μ > 1 then m t → ∞ as t → ∞ . This is explosion in the mean.
3. If μ = 1 then m t =1 for all t ∈ [0, ∞). This is stability in the mean.
Proof
This result is intuitively very appealing. As a function of time, the expected number of particles either grows or decays
exponentially, depending on whether the expected number of offspring of a particle is greater or less than one. The connection
between the variances is more complicated. We assume that σ < ∞ . 2
If μ ≠ 1 then
2
σ
2α(μ−1)t α(μ−1)t
vt = [ + (μ − 1)] [e −e ], t ∈ [0, ∞) (16.23.21)
μ−1
If μ = 1 then v t = ασ t
2
.
1. If μ < 1 then v t → 0 as t → ∞
2. If μ ≥ 1 then v t → ∞ as t → ∞
Proof
If μ < 1 so that m → 0 as t → ∞ and we have extinction in the mean, then v → 0 as t → ∞ also. If μ > 1 so that m → ∞ as
t t t
t → ∞ and we have explosion in the mean, then v → ∞ as t → ∞ also. We would expect these results. On the other hand, if
t
μ = 1 so that m = 1 for all t ∈ [0, ∞) and we have stability in the mean, then v grows linearly in t . This gives some insight into
t t
what to expect next when we consider the probability of extinction.
The Probability of Extinction

As shown above, there are two types of behavior for a branching process, either population extinction or population explosion. In
this discussion, we study the extinction probability, starting as usual with a single particle:
q = P(Xt = 0 for some t ∈ (0, ∞) ∣ X0 = 1) = lim P(Xt = 0 ∣ X0 = 1) (16.23.27)
t→∞
Need we say it? The extinction probability starting with an arbitrary number of particles is easy.
For x ∈ N,
x
P(Xt = 0 for some t ∈ (0, ∞) ∣ X0 = x) = lim P(Xt = 0 ∣ X0 = x) = q (16.23.28)
t→∞
Proof
We can easily relate extinction for the continuous-time branching chain X to extinction for any of the embedded discrete-time
branching chains.
If extinction occurs for X then extinction occurs for Z for every t ∈ (0, ∞) . Conversely if extinction occurs for Z for some
t t
t ∈ (0, ∞) then extinction occurs for Z for every t ∈ (0, ∞) and extinction occurs for X. Hence q is the minimum solution
t
in (0, 1] of the equation Φ (r) = r for every t ∈ (0, ∞) .

t
Proof
So whether or not extinction is certain depends on the critical parameter μ .
The extinction probability q and the mean of the offspring distribution μ are related as follows:
1. If μ ≤ 1 then q = 1 , so extinction is certain.
2. If μ > 1 then 0 < q < 1 , so there is a positive probability of extinction and a positive probability of explosion.
Proof
It would be nice to have an equation for q in terms of the offspring probability generating function Ψ. This is also easy
The probability of extinction q is the minimum solution in (0, 1] of the equation Ψ(r) = r .
Proof
Special Models
We now turn our attention to a number of special branching chains that are important in applications or lead to interesting insights.
We will use the notation established above, so that α is the parameter of the exponential lifetime of a particle, Q is the transition
matrix of the jump chain, G is the infinitesimal generator matrix, and P is the transition matrix at time t ∈ [0, ∞). Similarly,
t
m = E(X ∣ X = x) , v = var(X ∣ X = x) , and Φ

t t 0 t t 0 are the mean, variance, and generating function of the number of
t
particles at time t ∈ [0, ∞), starting with a single particle. As always, be sure to try these exercises yourself before looking at the
proofs and solutions.
The Pure Death Branching Chain

First we consider the branching chain in which each particle simply dies without offspring. Sadly for these particles, extinction is
inevitable, but this case is still a good place to start because the analysis is simple and lead to explicit formulas. Thus, suppose that
X = { X : t ∈ [0, ∞)} is a branching process with lifetime parameter α ∈ (0, ∞) and offspring probability density function f
t
with f (0) = 1 .
The transition matrix of the jump chain and the generator matrix are given by
1. Q(0, 0) = 1 and Q(x, x − 1) = 1 for x ∈ N +
2. G(x, x) = −αx for x ∈ N and G(x, x − 1) = αx for x ∈ N +
The time-varying functions are more interesting.
Let t ∈ [0, ∞). Then

1. m = e
t
−αt
2. v = e
t
−αt
−e
−2αt
3. Φ (r) = 1 − (1 − r)e
t for r ∈ R
−αt
4. Given X = x the distribution of X is binomial with trial parameter x and success parameter e
0 t
−αt
.
x
−αty −αt x−y
Pt (x, y) = ( )e (1 − e ) , x ∈ N, y ∈ {0, 1, … , x} (16.23.29)
y
Direct Proof
In particular, note that P (x, 0) = (1 − e ) → 1 as t → ∞ . that is, the probability of extinction by time t increases to 1
t
−αt x
exponentially fast. Since we have an explicit formula for the transition matrices, we can find an explicit formula for the potential
matrices as well. The result uses the beta function B .
For β ∈ (0, ∞) the potential matrix U is given by

β
1 x
Uβ (x, y) = ( )B(y + β/α, x − y + 1), x ∈ N, y ∈ {0, 1, … , x} (16.23.31)
α y
For β = 0 , the potential matrix U is given by

1. U (x, 0) = ∞ for x ∈ N
2. U (x, y) = 1/αy for x, y ∈ N and x ≤ y .
+
Proof
We could argue the results for the potential U directly. Recall that U (x, y) is the expected time spent in state y starting in state x.
Since 0 is absorbing and all states lead to 0, U (x, 0) = ∞ for x ∈ N. If x, y ∈ N and x ≤ y , then x leads to y with probability 1.
+
Once in state y the time spent in y has an exponential distribution with parameter λ(y) = αy , and so the mean is 1/αy. Of course,
when the chain leaves y , it never returns.
Recall that βU is a transition probability matrix for β > 0 , and in fact β U (x, ⋅) is the probability density function of X given
β β T
X = x where T is independent of X has the exponential distribution with parameter β. For the next result, recall the ascending
0
power notation
[k]
a = a(a + 1) ⋯ (a + k − 1), a ∈ R, k ∈ N (16.23.36)
For β > 0 and x ∈ N , the function

+ β Uβ (x, ⋅) is the beta-binomial probability density function with parameters x, β/α , and
1.
[y] [x−y]
x (β/α) 1
β Uβ (x, y) = ( ) , x ∈ N, y ∈ {0, 1, … x} (16.23.37)
y (1 + β/α)[x]
Proof
The Yule Process

Next we consider the pure birth branching chain in which each particle, at the end of its life, is replaced by 2 new particles.
Equivalently, we can think of particles that never die, but each particle gives birth to a new particle at a constant rate. This chain
could serve as the model for an unconstrained nuclear reaction, and is known as the Yule process, named for George Yule. So
specifically, let X = {X : t ∈ [0, ∞)} be the branching chain with exponential parameter α ∈ (0, ∞) and offspring probability
t
density function given by f (2) = 1 . Explosion is inevitable, starting with at least one particle, but other properties of the Yule
process are interesting. in particular, there are fascinating parallels with the pure death branching chain. Since 0 is an isolated,
absorbing state, we will sometimes restrict our attention to positive states.
1. Q(0, 0) = 1 and Q(x, x + 1) = 1 for x ∈ N +
2. G(x, x) = −αx for x ∈ N and G(x, x + 1) = αx for x ∈ N +
Since the Yule process is a pure birth process and the birth rate in state x ∈ N is αx, the process is also called the linear birth
chain. As with the pure death process, we can give the distribution of X specifically. t
Let t ∈ [0, ∞). Then

1. m = et
αt
2. v = e
t
2αt
−e
αt
−αt
3. Φ (r) =
t
re
1−r+re−αt
for |r| < 1
1−e−αt
4. Given X 0 =x ,X t has the negative binomial distribution on N with stopping parameter x and success parameter e
+
−αt
.
y −1 −xαt −αt y−x
Pt (x, y) = ( )e (1 − e ) , x ∈ N+ , y ∈ {x, x + 1, …} (16.23.40)
x −1
Proof from the general results

Direct proof
Recall that the negative binomial distribution with parameters k ∈ N and p ∈ (0, 1) governs the trial number of the k th success
+
in a sequence of Bernoulli trials with success parameter p. So the occurrence of this distribution in the Yule process suggests such
an interpretation. However this interpretation is not nearly as obvious as with the binomial distribution in the pure death branching
chain. Next we give the potential matrices.
For β ∈ [0, ∞) the potential matrix U is given by β
1 y −1
Uβ (x, y) = ( )B(x + β/α, y − x + 1), x ∈ N+ , y ∈ {x, x + 1, …} (16.23.45)
α x −1
If β > 0 , the function β U β (x, ⋅) is the beta-negative binomial probability density function with parameters x, β/α, and 1:
[x] [y−x]
y −1 (β/α) 1
β Uβ (x, y) = ( ) , x ∈ N, y ∈ {x, x + 1, …} (16.23.46)
[y]
x −1 (1 + β/α)
Proof
If we think of the Yule process in terms of particles that never die, but each particle gives birth to a new particle at rate α , then we
can study the age of the particles at a given time. As usual, we can start with a single, new particle at time 0. So to set up the
notation, let X = {X : t ∈ [0, ∞)} be the Yule branching chain with birth rate α ∈ (0, ∞) , and assume that X = 1 . Let τ = 0
t 0 0
and for n ∈ N , let τ denote the time of the n th transition (birth).

+ n
For t ∈ [0, ∞), let A denote the total age of the particles at time t . Then
t
Xt −1
At = ∑ (t − τn ), t ∈ [0, ∞) (16.23.49)
n=0
The random process A = {A t : t ∈ [0, ∞)} is the age process.

Proof
Here is another expression for the age process.
Again, let A = {A t : t ∈ [0, ∞)} be the age process for the Yule chain starting with a single particle. Then
t
At = ∫ Xs ds, t ∈ [0, ∞) (16.23.50)

0
Proof
With the last representation, we can easily find the expected total age at time t .
Again, let A = {A t : t ∈ [0, ∞)} be the age process for the Yule chain starting with a single particle. Then
αt
e −1
E(At ) = , t ∈ [0, ∞) (16.23.53)
α
Proof
The General Birth-Death Branching Chain

Next we consider the continuous-time branching chain in which each particle, at the end of its life, leaves either no children or two
children. At each transition, the number of particles either increases by 1 or decreases by 1, and so such a branching chain is also a
continuous-time birth-death chain. Specifically, let X = {X : t ∈ [0, ∞)} be a continuous-time branching chain with lifetime
t
parameter α ∈ (0, ∞) and offspring probability density function f given by f (0) = 1 − p , f (2) = p , where p ∈ [0, 1]. When
p = 0 we have the pure death chain, and when p = 1 we have the Yule process. We have already studied these, so the interesting
case is when p ∈ (0, 1) so that both extinction and explosion are possible.
1. Q(0, 0) = 1, and Q(x, x − 1) = 1 − p , Q(x, x + 1) = p for x ∈ N +
2. G(x, x) = −αx for x ∈ N, and G(x, x − 1) = α(1 − p)x , G(x, x + 1) = αpx for x ∈ N +
As mentioned earlier, X is also a continuous-time birth-death chain on N, with 0 absorbing. In state x ∈ N , the birth rate is αpx +
and the death rate is α(1 − p)x . The moment functions are given next.
For t ∈ [0, ∞),

1. m = e
t
α(2p−1)t
2. If p ≠ , 1
4p(1 − p)
2α(2p−1)t α(2p−1)t
vt = [ + (2p − 1)] [e −e ] (16.23.55)
2p − 1
If p = 1
2
,v
t = 4αp(1 − p)t .
Proof
The next result gives the generating function of the offspring distribution and the extinction probability.
For the birth-death branching chain,

1. Ψ(r) = pr + (1 − p) for r ∈ R .
2
1−p
2. q = 1 if 0 < p ≤ and q =1
2
if p
1
2
<p <1 .
Proof
Figure 16.23.1: Graphs of r ↦ Ψ(r) and r ↦ r when p = 1
Figure 16.23.2: Graphs of r ↦ Ψ(r) and r ↦ r when p = 2
For t ∈ [0, ∞), the generating function Φ is given by

t
α(2p−1)t
pr − (1 − p) + (1 − p)(1 − r)e
Φt (r) = , if p ≠ 1/2
α(2p−1)t
pr − (1 − p) + p(1 − r)e
2r + (1 − r)αt 1
Φt (r) = , if p =
2 + (1 − r)αt 2
Solution
This page titled 16.23: Continuous-Time Branching Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
CHAPTER OVERVIEW
17: Martingales
Martingalges, and their cousins sub-martingales and super-martingales are real-valued stochastic processes that are abstract
generalizations of fair, favorable, and unfair gambling processes. The importance of martingales extends far beyond gambling, and
indeed these random processes are among the most important in probability theory, with an incredible number and diversity of
applications.
17.4: Inequalities
17.5: Convergence
This page titled 17: Martingales is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
Basic Theory
Basic Assumptions
For our basic ingredients, we start with a stochastic process X = {X : t ∈ T } on an underlying probability space (Ω, F , P),
t
having state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). So to
review what all this means, Ω is the sample space, F the σ-algebra of events, P the probability measure on (Ω, F ), and X is a t
random variable with values in R for each t ∈ T . Next, we have a filtration F = {F : t ∈ T } , and we assume that X is adapted
t
to F. To review again, F is an increasing family of sub σ-algebras of F , so that F ⊆ F ⊆ F for s, t ∈ T with s ≤ t , and X is
s t t
measurable with respect to F for t ∈ T . We think of F as the collection of events up to time t ∈ T , thus encoding the
t t
information available at time t . Finally, we assume that E (|X |) < ∞ , so that the mean of X exists as a real number, for each
t t
t ∈ T .
There are two important special cases of the basic setup. The simplest case, of course, is when F = σ{X : s ∈ T , s ≤ t} for
t s
t ∈ T , so that F is the natural filtration associated with X. Another case that arises frequently is when we have a second stochastic
process Y = {Y : t ∈ T } on (Ω, F , P) with values in a general measure space (S, S ), and F is the natural filtration associated
t
with Y . So in this case, our main assumption is that X is measurable with respect to σ{Y : s ∈ T , s ≤ t} for t ∈ T .
t s
The theory of martingales is beautiful, elegant, and mostly accessible in discrete time, when T = N . But as with the theory of
Markov processes, martingale theory is technically much more complicated in continuous time, when T = [0, ∞) . In this case,
additional assumptions about the continuity of the sample paths t ↦ X and the filtration t ↦ F are often necessary in order to
t t
have a nice theory. Specifically, we will assume that the processX is right continuous and has left limits, and that the filtration F is
right continuous and complete. These are the standard assumptions in continuous time.
Definitions
For the basic definitions that follow, you may need to review conditional expected value with respect to a σ-algebra.
The process X is a martingale with respect to F if E (X t ∣ Fs ) = Xs for all s, t ∈ T with s ≤ t .
In the special case that F is the natural filtration associated with X, we simply say that X is a martingale, without reference to the
filtration. In the special case that we have a second stochastic process Y = {Y : t ∈ T } and F is the natural filtration associated
t
with Y , we say that X is a martingale with respect to Y .

The term martingale originally referred to a portion of the harness of a horse, and was later used to describe gambling strategies,
such as the one used in the Petersburg paradox, in which bets are doubled when a game is lost. To interpret the definitions above in
terms of gambling, suppose that a gambler is at a casino, and that X represents her fortune at time t ∈ T and F the information
t t
available to her at time t . Suppose now that s, t ∈ T with s < t and that we think of s as the current time, so that t is a future
time. If X is a martingale with respect to F then the games are fair in the sense that the gambler's expected fortune at the future
time t is the same as her current fortune at time s . To venture a bit from the casino, suppose that X is the price of a stock, or the
t
value of a stock index, at time t ∈ T . If X is a martingale, then the expected value at a future time, given all of our information, is
the present value.
Figure 17.1.1 : An English-style breastplate with a running martingale attachement. By Danielle M., CC BY 3.0, from Wikipedia
But as we will see, martingales are useful in probability far beyond the application to gambling and even far beyond financial
applications generally. Indeed, martingales are of fundamental importance in modern probability theory. Here are two related
definitions, with equality in the martingale condition replaced by inequalities.
Suppose again that the process X and the filtration F satisfy the basic assumptions above.
1. X is a sub-martingale with respect to F if E (X ∣ F ) ≥ X for all s, t ∈ T with s ≤ t .
t s s
2. X is a super-martingale with respect to F if E (X ∣ F ) ≤ X for all s, t ∈ T with s ≤ t .

t s s
In the gambling setting, a sub-martingale models games that are favorable to the gambler on average, while a super-martingale
models games that are unfavorable to the gambler on average. To venture again from the casino, suppose that X is the price of a t
stock, or the value of a stock index, at time t ∈ T . If X is a sub-martingale, the expected value at a future time, given all of our
information, is greater than the present value, and if X is a super-martingale then the expected value at the future time is less than
the present value. One hopes that a stock index is a sub-martingale.
Clearly X is a martingale with respect to F if and only if it is both a sub-martingale and a super-martingale. Finally, recall that the
conditional expected value of a random variable with respect to a σ-algebra is itself a random variable, and so the equations and
inequalities in the definitions should be interpreted as holding with probability 1. In this section generally, statements involving
random variables are assumed to hold with probability 1.
The conditions that define martingale, sub-martingale, and super-martingale make sense if the index set T is any totally ordered
set. In some applications that we will consider later, T = {0, 1, … , n} for fixed n ∈ N . In the section on backwards martingales,
+
T = {−n : n ∈ N} or T = (−∞, 0] . In the case of discrete time when T = N , we can simplify the definitions slightly.
Suppose that X = {X n : n ∈ N} satisfies the basic assumptions above.

1. X is a martingale with respect to F if and only if E (X ∣ F ) =X
n+1 for all n ∈ N .
n n
2. X is a sub-martingale with respect to F if and only if E (X ∣ F ) ≥X for all n ∈ N .

n+1 n n
3. X is a super-martingale with respect to F if and only if E (X ∣ F ) ≤X for all n ∈ N .

n+1 n n
Proof
The relations that define martingales, sub-martingales, and super-martingales hold for the ordinary (unconditional) expected values.
Suppose that s, t ∈ T with s ≤ t .

1. If X is a martingale with respect to F then E(X ) = E(X ) .
s t
2. If X is a sub-martingale with respect to F then E(X ) ≤ E(X ) .

s t
3. If X is a super-martingale with respect to F then E(X ) ≥ E(X ) . s t
Proof
So if X is a martingale then X has constant expected value, and this value is referred to as the mean of X.
Examples
The goal for the remainder of this section is to give some classical examples of martingales, and by doing so, to show the wide
variety of applications in which martingales occur. We will return to many of these examples in subsequent sections. Without
further ado, we assume that all random variables are real-valued, unless otherwise specified, and that all expected values mentioned
below exist in R. Be sure to try the proofs yourself before expanding the ones in the text.
Constant Sequence
Our first example is rather trivial, but still worth noting.
Suppose that F = {F : t ∈ T } is a filtration on the probability space (Ω, F , P) and that X is a random variable that is
t
measurable with respect to F and satisfies E(|X|) < ∞ . Let X = X for t ∈ T . Them X = {X : t ∈ T } is a martingale
0 t t
with respect to F.
Proof
Partial Sums
For our next discussion, we start with one of the most basic martingales in discrete time, and the one with the simplest
interpretation in terms of gambling. Suppose that V = {V : n ∈ N} is a sequence of independent random variables with
n
E(| V |) < ∞ for k ∈ N . Let

k
Xn = ∑ Vk , n ∈ N (17.1.3)
k=0
so that X = {X n : n ∈ N} is simply the partial sum process associated with V .
For the partial sum process X,

1. If E(V n) ≥0 for n ∈ N + then X is a sub-martingale.
2. If E(V n) ≤ 0 for n ∈ N + then X is a super-martingale.
3. If E(V n) =0 for n ∈ N + then X is a martingale.
Proof
In terms of gambling, if X = V is the gambler's initial fortune and V is the gambler's net winnings on the ith game, then X is
0 0 i n
the gamblers net fortune after n games for n ∈ N . But partial sum processes associated with independent sequences are important
+
far beyond gambling. In fact, much of classical probability deals with partial sums of independent and identically distributed
variables. The entire chapter on Random Samples explores this setting.
Note that E(Xn ) = ∑ E(V ) . Hence condition (a) is equivalent to n ↦ E(X ) increasing, condition (b) is equivalent to
n
k=0 k n
n ↦ E(Xn ) decreasing, and condition (c) is equivalent to n ↦ E(X ) constant. Here is another martingale associated with the
n
partial sum process, known as the second moment martingale.
Suppose that E(V k) =0 for k ∈ N + and var(V k) <∞ for k ∈ N . Let

2
Yn = Xn − var(Xn ), n ∈ N (17.1.6)
Then Y = { Yn : n ∈ N} is a martingale with respect to X.

Proof
So under the assumptions in this theorem, both X and Y are martingales. We will generalize the results for partial sum processes
below in the discussion on processes with independent increments.
Martingale Difference Sequences

In the last discussion, we saw that the partial sum process associated with a sequence of independent, mean 0 variables is a
martingale. Conversely, every martingale in discrete time can be written as a partial sum process of uncorrelated mean 0 variables.
This representation gives some significant insight into the theory of martingales generally. Suppose that X = {X : n ∈ N} is a n
martingale with respect to the filtration F = {F : n ∈ N} . n
Let V = X and V = X
0 0 n n − Xn−1 for n ∈ N+ . The process V = { Vn : n ∈ N} is the martingale difference sequence
associated with X and
n
Xn = ∑ Vk , n ∈ N (17.1.8)
k=0
As promised, the martingale difference variables have mean 0, and in fact satisfy a stronger property.
Suppose that V = { Vn : n ∈ N} is the martingale difference sequence associated with X. Then

1. V is adapted to F.
2. E(V ∣ F ) = 0 for k,
n k n ∈ N with k < n .
3. E(V ) = 0 for n ∈ N
n +
Proof
Also as promised, if the martingale variables have finite variance, then the martingale difference variables are uncorrelated.
Suppose again that V = {V : n ∈ N} is the martingale difference sequence associated with the martingale X. Assume that
n
var(X ) < ∞ for n ∈ N . Then V is an uncorrelated sequence. Moreover,

n
n n
2
var(Xn ) = ∑ var(Vk ) = var(X0 ) + ∑ E(V ), n ∈ N (17.1.11)
k
k=0 k=1
Proof
We now know that a discrete-time martingale is the partial sum process associated with a sequence of uncorrelated variables.
Hence we might hope that there are martingale versions of the fundamental theorems that hold for a partial sum process associated
with an independent sequence. This turns out to be true, and is a basic reason for the importance of martingales.
Discrete-Time Random Walks

Suppose that V = {V : n ∈ N} is a sequence of independent random variables with {V : n ∈ N } identically distributed. We
n n +
assume that E(|V |) < ∞ for n ∈ N and we let a denote the common mean of {V : n ∈ N } . Let X = {X : n ∈ N} be the
n n + n
partial sum process associated with V so that

n
Xn = ∑ Vi , n ∈ N (17.1.13)
i=0
This setting is a special case of the more general partial sum process considered above. The process X is sometimes called a
(discrete-time) random walk. The initial position X = V of the walker can have an arbitrary distribution, but then the steps that
0 0
the walker takes are independent and identically distributed. In terms of gambling, X = V is the initial fortune of the gambler
0 0
playing a sequence of independent and identical games. If V is the amount won (or lost) on game i ∈ N , then X is the gambler's
i + n
net fortune after n games.

1. X is a martingale if a = 0 .
2. X is a sub-martingale if a ≥ 0 .
3. X is a super-martingale if a ≤ 0
For the second moment martingale, suppose that V has common mean a = 0 and common variance b
n
2
<∞ for n ∈ N , and that
+
var(V ) < ∞ .
0
Let Y n
2
= Xn − var(V0 ) − b n
2
for n ∈ N . Then Y = { Yn : n ∈ N} is a martingale with respect to X.
Proof
We will generalize the results for discrete-time random walks below, in the discussion on processes with stationary, independent
increments.
Partial Products
Our next discussion is similar to the one on partial sum processes above, but with products instead of sums. So suppose that
V = { V : n ∈ N} is an independent sequence of nonnegative random variables with E(V ) < ∞ for n ∈ N . Let
n n
Xn = ∏ Vi , n ∈ N (17.1.15)
i=0
so that X = {X n : n ∈ N} is the partial product process associated with X.
For the partial product process X,

1. If E(V n) =1 for n ∈ N
+ then X is a martingale with respect to V
2. If E(V n) ≥ 1 for n ∈ N
+ then X is a sub-martingale with respect to V
3. If E(V n) ≤1 for n ∈ N
+ then X is a super-martingale with respect to V
Proof
As with random walks, a special case of interest is when {V n : n ∈ N+ } is an identically distributed sequence.
The Simple Random Walk
Suppose now that that V = {V : n ∈ N} is a sequence of independent random variables with P(V = 1) = p and
n i
P(V = −1) = 1 − p for i ∈ N , where p ∈ (0, 1). Let X = { X : n ∈ N} be the partial sum process associated with V so that
i + n
Xn = ∑ Vi , n ∈ N (17.1.18)
i=0
Then X is the simple random walk with parameter p, and of course, is a special case of the more general random walk studied
above. In terms of gambling, our gambler plays a sequence of independent and identical games, and on each game, wins €1 with
probability p and loses €1 with probability 1 − p . So if V is the gambler's initial fortune, then X is her net fortune after n games.
0 n
For the simple random walk,

1. If p > 1
2
then X is a sub-martingale.
2. If p < 1
2
then X is a super-martingale.
3. If p = 1
2
then X is a martingale.
Proof
So case (a) corresponds to favorable games, case (b) to unfavorable games, and case (c) to fair games.
Open the simulation of the simple symmetric random. For various values of the number of trials n , run the simulation 1000
times and note the general behavior of the sample paths.
Here is the second moment martingale for the simple, symmetric random walk.
Consider the simple random walk with parameter p =

1
2
, and let 2
Yn = Xn − var(V0 ) − n for n ∈ N . Then
Y = { Y : n ∈ N} is a martingale with respect to X
n
Proof
But there is another martingale that can be associated with the simple random walk, known as De Moivre's martingale and named
for one of the early pioneers of probability theory, Abraham De Moivre.
For n ∈ N define
Xn
1 −p
Zn = ( ) (17.1.19)
p
Then Z = {Z n : n ∈ N} is a martingale with respect to X.

Proof
The Beta-Bernoulli Process

Recall that the beta-Bernoulli process is constructed by randomizing the success parameter in a Bernoulli trials process with a beta
distribution. Specifically we have a random variable P that has the beta distribution with parameters a, b ∈ (0, ∞), and a sequence
of indicator variables X = (X , X , …) such that given P = p ∈ (0, 1) , X is a sequence of independent variables with
1 2
P(X = 1) = p for i ∈ N . As usual, we couch this in reliability terms, so that X = 1 means success on trial i and X = 0 means
i + i i
failure. In our study of this process, we showed that the finite-dimensional distributions are given by
[k] [n−k]
a b
n
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = , n ∈ N+ , (x1 , x2 , … , xn ) ∈ {0, 1 } (17.1.22)
[n]
(a + b)
where we use the ascending power notation r = r(r + 1) ⋯ (r + j − 1) for

[j]
r ∈ R and j∈ N . Next, let Y = { Yn : n ∈ N}
denote the partial sum process associated with X, so that once again,
n
Yn = ∑ Xi , n ∈ N (17.1.23)
i=1
Of course Y is the number of success in the first n trials and has the beta-binomial distribution defined by
n
[k] [n−k]
n a b
P(Yn = k) = ( ) , k ∈ {0, 1, … , n} (17.1.24)
[n]
k (a + b)
Now let
a + Yn
Zn = , n ∈ N (17.1.25)
a+b +n
This variable also arises naturally. Let F = σ{X , X , … , X } for n ∈ N . Then as shown in the section on the beta-Bernoulli
n 1 2 n
process, Z = E(X
n ∣ F ) = E(P ∣ F ) . In statistical terms, the second equation means that Z
n+1 n n is the Bayesian estimator of n
the unknown success probability p in a sequence of Bernoulli trials, when p is modeled by the random variable P .
Z = { Zn : n ∈ N} is a martingale with respect to X.

Proof
Open the beta-Binomial experiment. Run the simulation 1000 times for various values of the parameters, and compare the
empirical probability density function with the true probability density function.
Pólya's Urn Process

Recall that in the simplest version of Pólya's urn process, we start with an urn containing a red and b green balls. At each discrete
time step, we select a ball at random from the urn and then replace the ball and add c new balls of the same color to the urn. For the
parameters, we need a, b ∈ N and c ∈ N . For i ∈ N , let X denote the color of the ball selected on the ith draw, where 1 means
+ + i
red and 0 means green. The process X = {X : n ∈ N } is a classical example of a sequence of exchangeable yet dependent
n +
variables. Let Y = {Y : n ∈ N} denote the partial sum process associated with X, so that once again,
n
Yn = ∑ Xi , n ∈ N (17.1.28)
i=1
Of course Y is the total number of red balls selected in the first n draws. Hence at time n ∈ N , the total number of red balls in the
n
urn is a + cY , while the total number of balls in the urn is a + b + cn and so the proportion of red balls in the urn is
n
a + cYn
Zn = (17.1.29)
a + b + cn
Z = { Zn : n ∈ N} is a martingale with respect to X.

Indirect proof
Direct Proof
Open the simulation of Pólya's Urn Experiment. Run the simulation 1000 times for various values of the parameters, and
compare the empirical probability density function of the number of red ball selected to the true probability density function.
Processes with Independent Increments.

Our first example above concerned the partial sum process X associated with a sequence of independent random variables V .
Such processes are the only ones in discrete time that have independent increments. That is, for m, n ∈ N with m ≤ n , X − X n m
is independent of (X , X , … , X ). The random walk process has the additional property of stationary increments. That is, the
0 1 m
distribution of X − X is the same as the distribution of X

n m −X for m, n ∈ N with m ≤ n . Let's consider processes in
n−m 0
discrete or continuous time with these properties. Thus, suppose that X = {X : t ∈ T } satisfying the basic assumptions above
t
relative to the filtration F = {F : s ∈ T } . Here are the two definitions.

s
The process X has

1. Independent increments if X − X is independent of F for all s,
t s s t ∈ T with s ≤ t .
2. Stationary increments if X − X has the same distribution as X
t s t−s − X0 for all s, t ∈ T .
Processes with stationary and independent increments were studied in the Chapter on Markov processes. In continuous time (with
the continuity assumptions we have imposed), such a process is known as a Lévy process, named for Paul Lévy, and also as a
continuous-time random walk. For a process with independent increments (not necessarily stationary), the connection with
martingales depends on the mean function m given by m(t) = E(X ) for t ∈ T . t
Suppose that X = {X t : t ∈ [0, ∞)} has independent increments.

1. If m is increasing then X is a sub-martingale.
2. If m is decreasing then X is a super-martingale.
3. If m is constant then X is a martingale
Proof
Compare this theorem with the corresponding theorem for the partial sum process above. Suppose now that
X = { X : t ∈ [0, ∞)} is a stochastic process as above, with mean function m , and let Y = X − m(t) for t ∈ [0, ∞). The
t t t
process Y = {Y : t ∈ [0, ∞)} is sometimes called the compensated process associated with X and has mean function 0. If X has
t
independent increments, then clearly so does Y . Hence the following result is a trivial corollary to our previous theorem.
Suppose that X has independent increments. The compensated process Y is a martingale.
Next we give the second moment martingale for a process with independent increments, generalizing the second moment
martingale for a partial sum process.
Suppose that X = {X : t ∈ T } has independent increments with constant mean function and and with
t var(Xt ) < ∞ for
t ∈ T . Then Y = { Y : t ∈ T } is a martingale where
t
2
Yt = X − var(Xt ), t ∈ T (17.1.35)
t
Proof
Of course, since the mean function is constant, X is also a martingale. For processes with independent and stationary increments
(that is, random walks), the last two theorems simplify, because the mean and variance functions simplify.
Suppose that X = {X t : t ∈ T} has stationary, independent increments, and let a = E(X 1 − X0 ) . Then
1. X is a martingale if a = 0
2. X is a sub-martingale if a ≥ 0
3. X is a super-martingale if a ≤ 0
Proof
Compare this result with the corresponding one above for discrete-time random walks. Our next result is the second moment
martingale. Compare this with the second moment martingale for discrete-time random walks.
Suppose that X = {X : t ∈ T } has stationary, independent increments with E(X

t 0) = E(X1 ) and b 2
= E(X
1
2
) <∞ . Then
Y = { Y : t ∈ T } is a martingale where
t
2 2
Yt = X − var(X0 ) − b t, t ∈ T (17.1.40)
t
Proof
In discrete time, as we have mentioned several times, all of these results reduce to the earlier results for partial sum processes and
random walks. In continuous time, the Poisson processes, named of course for Simeon Poisson, provides examples. The standard
(homogeneous) Poisson counting process N = {N : t ∈ [0, ∞)} with constant rate r ∈ (0, ∞) has stationary, independent
t
increments and mean function given by m(t) = rt for t ∈ [0, ∞). More generally, suppose that r : [0, ∞) → (0, ∞) is piecewise
continuous (and non-constant). The non-homogeneous Poisson counting process N = {N : t ∈ [0, ∞)} with rate function r has
t
independent increments and mean function given by

t
m(t) = ∫ r(s) ds, t ∈ [0, ∞) (17.1.41)

0
The increment N − N has the Poisson distribution with parameter m(t) − m(s) for s, t ∈ [0, ∞) with s < t , so the process
t s
does not have stationary increments. In all cases, m is increasing, so the following results are corollaries of our general results:
Let N = { Nt : t ∈ [0, ∞)} be the Poisson counting process with rate function r : [0, ∞) → (0, ∞) . Then
1. N is a sub-martingale
2. The compensated process X = {N t − m(t) : t ∈ [0, ∞)} is a martinagle.
Open the simulation of the Poisson counting experiment. For various values of r and t , run the experiment 1000 times and
compare the empirical probability density function of the number of arrivals with the true probability density function.
We will see further examples of processes with stationary, independent increments in continuous time (and so also examples of
continuous-time martingales) in our study of Brownian motion.
Likelihood Ratio Tests

Suppose that (S, S , μ) is a general measure space, and that X = {X : n ∈ N} is a sequence of independent, identically
n
distributed random variables, taking values in S . In statistical terms, X corresponds to sampling from the common distribution,
which is usually not completely known. Indeed, the central problem in statistics is to draw inferences about the distribution from
observations of X. Suppose now that the underlying distribution either has probability density function g or probability density
0
function g , with respect to μ . We assume that g and g are positive on S . Of course the common special cases of this setup are
1 0 1
S is a measurable subset of R for some n ∈ N and μ = λ is n -dimensional Lebesgue measure on S .

n
+ n
S is a countable set and μ = # is counting measure on S .

The likelihood ratio test is a hypothesis test, where the null and alternative hypotheses are
H0 : the probability density function is g . 0
The test is based on the test statistic

n
g0 (Xi )
Ln = ∏ , n ∈ N (17.1.42)
g1 (Xi )
i=1
known as the likelihood ratio test statistic. Small values of the test statistic are evidence in favor of the alternative hypothesis H . 1
Here is our result.
Under the alternative hypothesis H1 , the process L = { Ln : n ∈ N} is a martingale with respect to X , known as the
likelihood ratio martingale.
Proof
Branching Processes
In the simplest model of a branching process, we have a system of particles each of which can die out or split into new particles of
the same type. The fundamental assumption is that the particles act independently, each with the same offspring distribution on N.
We will let f denote the (discrete) probability density function of the number of offspring of a particle, m the mean of the
distribution, and ϕ the probability generating function of the distribution. Thus, if U is the number of children of a particle, then
f (n) = P(U = n) for n ∈ N , m = E(U ) , and ϕ(t) = E (t ) defined at least for t ∈ (−1, 1] .
U
Our interest is in generational time rather than absolute time: the original particles are in generation 0, and recursively, the children
a particle in generation n belong to generation n + 1 . Thus, the stochastic process of interest is X = {X : n ∈ N} where X is
n n
the number of particles in the n th generation for n ∈ N . The process X is a Markov chain and was studied in the section on
discrete-time branching chains. In particular, one of the fundamental problems is to compute the probability q of extinction starting
with a single particle:
q = P(Xn = 0 for some n ∈ N ∣ X0 = 1) (17.1.45)
Then, since the particles act independently, the probability of extinction starting with x ∈ N particles is simply q . We will assume
x
that f (0) > 0 and f (0) + f (1) < 1 . This is the interesting case, since it means that a particle has a positive probability of dying
without children and a positive probability of producing more than 1 child. The fundamental result, you may recall, is that q is the
smallest fixed point of ϕ (so that ϕ(q) = q ) in the interval [0, 1]. Here are two martingales associated with the branching process:
Each of the following is a martingale with respect to X.
1. Y = {Y n : n ∈ N} where Y n = Xn / m
n
for n ∈ N .
2. Z = {Z n : n ∈ N} where Z n =q
Xn
for n ∈ N .
Proof
Doob's Martingale
Our next example is one of the simplest, but most important. Indeed, as we will see later in the section on convergence, this type of
martingale is almost universal in the sense that every uniformly integrable martingale is of this type. The process is constructed by
conditioning a fixed random variable on the σ-algebras in a given filtration, and thus accumulating information about the random
variable.
Suppose that F = {F : t ∈ T } is a filtration on the probability space (Ω, F , P), and that X is a real-valued random variable
t
with E (|X|) < ∞ . Define X = E (X ∣ F ) for t ∈ T . Then X = {X : t ∈ T } is a martingale with respect to F.

t t t
Proof
The martingale in the last theorem is known as Doob's martingale and is named for Joseph Doob who did much of the pioneering
work on martingales. It's also known as the Lévy martingale, named for Paul Lévy.
Doob's martingale arises naturally in the statistical context of Bayesian estimation. Suppose that X = (X , X , …) is a sequence 1 2
of independent random variables whose common distribution depends on an unknown real-valued parameter θ , with values in a
parameter space A ⊆ R . For each n ∈ N , let F = σ{X , X , … , X } so that F = {F : n ∈ N } is the natural filtration
+ n 1 2 n n +
associated with X. In Bayesian estimation, we model the unknown parameter θ with a random variable Θ taking values in A and
having a specified prior distribution. The Bayesian estimator of θ based on the sample X = (X , X , … , X ) is
n 1 2 n
Un = E(Θ ∣ Fn ), n ∈ N+ (17.1.50)
So it follows that the sequence of Bayesian estimators U = (U n : n ∈ N+ ) is a Doob martingale. The estimation referred to in the
discussion of the beta-Bernoulli process above is a special case.
Density Functions
For this example, you may need to review general measures and density functions in the chapter on Distributions. We start with our
probability space (Ω, F , P) and filtration F = {F : n ∈ N} in discrete time. Suppose now that μ is a finite measure on the
n
sample space (Ω, F ). For each n ∈ N , the restriction of μ to F is a measure on the measurable space (Ω, F ), and similarly the
n n
restriction of P to F is a probability measure on (Ω, F ). To save notation and terminology, we will refer to these as μ and P on
n n
F , respectively. Suppose now that μ is absolutely continuous with respect to P on F

n for each n ∈ N . Recall that this means that
n
if A ∈ F and P(A) = 0 then μ(B) = 0 for every B ∈ F with B ⊆ A . By the Radon-Nikodym theorem, μ has a density
n n
function X : Ω → R with respect to P on F for each n ∈ N . The density function of a measure with respect to a positive
n n +
measure is known as a Radon-Nikodym derivative. The theorem and the derivative are named for Johann Radon and Otto Nikodym.
Here is our main result.
X = { Xn : n ∈ N} is a martingale with respect to F.

Proof
Note that μ may not be absolutely continuous with respect to P on F or even on F = σ (⋃ F ) . On the other hand, if μ is
∞
∞
n=0 n
absolutely continuous with respect to P on F then μ has a density function X with respect to P on F . So a natural question in
∞ ∞
this case is the relationship between the martingale X and the random variable X. You may have already guessed the answer, but at
any rate it will be given in the section on convergence.
This page titled 17.1: Introduction to Martingalges is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
Preliminaries
As in the Introduction, we start with a stochastic process X = {X : t ∈ T } on an underlying probability space (Ω, F , P), having
t
state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). Next, we have
a filtration F = {F : t ∈ T } , and we assume that X is adapted to F. So F is an increasing family of sub σ-algebras of F and X
t t
is measurable with respect to F for t ∈ T . We think of F as the collection of events up to time t ∈ T . We assume that
t t
E (| X |) < ∞ , so that the mean of X exists as a real number, for each t ∈ T . Finally, in continuous time where T = [0, ∞) , we
t t
make the standard assumptions that X is right continuous and has left limits, and that the filtration F is right continuous and
complete. Please recall the following from the Introduction:
Definitions
1. X is a martingale with respect to F if E(X ∣ F ) = X for all s, t ∈ T with s ≤ t .
t s s
2. X is a sub-martingale with respect to F if E(X ∣ F ) ≥ X for all s, t ∈ T with s ≤ t .

t s s
3. X is a super-martingale with respect to F if E(X ∣ F ) ≤ X for all s, t ∈ T with s ≤ t .

t s s
Our goal in this section is to give a number of basic properties of martingales and to give ways of constructing martingales from
other types of processes. The deeper, fundamental theorems will be studied in the following sections.
Basic Properties
Our first result is that the martingale property is preserved under a coarser filtration.
Suppose that the process X and the filtration F satisfy the basic assumptions above and that G is a filtration coarser than F so
that G ⊆ F for t ∈ T . If X is a martingale (sub-martingale, super-martingale) with respect to F and X is adapted to G then
t t
X is a martingale (sub-martingale, super-martingale) with respect to G.
Proof
In particular, if X is a martingale (sub-martingale, super-martingale) with respect to some filtration, then it is a martingale (sub-
martingale, super-martingale) with respect to its own natural filtration.
The relations that define martingales, sub-martingales, and super-martingales hold for the ordinary (unconditional) expected values.
We had this result in the last section, but it's worth repeating.
Suppose that s, t ∈ T with s ≤ t .

1. If X is a martingale with respect to F then E(X ) = E(X ) .
s t
2. If X is a sub-martingale with respect to F then E(X ) ≤ E(X ) . s t
3. If X is a super-martingale with respect to F then E(X ) ≥ E(X ) . s t
Proof
So if X is a martingale then X has constant expected value, and this value is referred to as the mean of X . The martingale
properties are preserved under sums of the stochastic processes.
For the processes X = {X : t ∈ T } and Y = {Y : t ∈ T } , let X + Y = {X + Y : t ∈ T } . If X and Y are martingales

t t t t
(sub-martingales, super-martinagles) with respect to F then X + Y is a martingale (sub-martingale, super-martinagle) with

respect to F.
Proof
The sub-martingale and super-martingale properties are preserved under multiplication by a positive constant and are reversed
under multiplication by a negative constant.
For the process X = {X t : t ∈ T} and the constant c ∈ R , let cX = {cX t : t ∈ T} .
1. If X is a martingale with respect to F then cX is also a martingale with respect to F
2. If X is a sub-martingale with respect to F, then cX is a sub-martingale if c > 0 , a super-martingale if c < 0 , and a
martingale if c = 0 .
3. If X is a super-martingale with respect to F, then cX is a super-martingale if c > 0 , a sub-martingale if c < 0 , and a
martingale if c = 0 .
Proof
Property (a), together with the previous additive property, means that the collection of martingales with respect to a fixed filtration
F forms a vector space. Here is a class of transformations that turns martingales into sub-martingales.
Suppose that X takes values in an interval S ⊆ R and that g : S → R is convex with E [|g(X )|] < ∞ for t ∈ T . If either of
t
the following conditions holds then g(X) = {g(X ) : t ∈ T } is a sub-martingale with respect to F:
t
1. X is a martingale.
2. X is a sub-martingale and g is also increasing.
Proof
Figure 17.2.1 : A convex function and several supporting lines

Here is the most important special case of the previous result:
Suppose again that X is a martingale with respect to F. Let k ∈ [1, ∞) and suppose that E (|X k
t| ) <∞ for t ∈ T . Then the
process |X |k k
= {| Xt | : t ∈ T} is a sub-martingale with respect to F
Proof
Figure 17.2.2 : The graphs of x ↦ |x| , x ↦ |x| and x ↦ |x| on the interval [−2, 2]
2 3
In particular, if X is a martingale relative to F then |X| = {|X | : t ∈ T } is a sub-martingale relative to F. Here is a related result
t
that we will need later. First recall that the positive and negative parts of x ∈ R are x = max{x, 0} and x = max{−x, 0}, so
+ −
that x ≥ 0 , x ≥ 0 , x = x − x , and |x| = x + x .

+ − + − + −
Figure 17.2.3 : The graph of x ↦ x+
on the interval [−5, 5]
If X = {X : t ∈ T } is a sub-martingale relative to
t F = { Ft : t ∈ T } then X
+
= {X
t
+
: t ∈ T} is also a sub-martingale
relative to F.
Proof
Our last result of this discussion is that if we sample a continuous-time martingale at an increasing sequence of time points, we get
a discrete-time martingale.
Suppose again that the process X = {X : t ∈ [0, ∞)} and the filtration F = {F : t ∈ [0, ∞)} satisfy the basic assumptions
t t
above. Suppose also that {t : n ∈ N} ⊂ [0, ∞) is a strictly increasing sequence of time points with t = 0 , and define
n 0
Y =X
n tnand G = F for n ∈ N . If X is a martingale (sub-martingale, super-martingale) with respect to F then
n tn
Y = { Y : n ∈ N} is a martingale (sub-martingale, super-martingale) with respect to G.

n
Proof
This result is often useful for extending proofs of theorems in discrete time to continuous time.
The Martingale Transform

Our next discussion, in discrete time, shows how to build a new martingale from an existing martingale and an predictable process.
This construction turns out to be very useful, and has an interesting gambling interpretation. To review the definition, recall that
{ Y : n ∈ N } is predictable relative to the filtration F = { F : n ∈ N} if Y
n + n is measurable with respect to F
n for n ∈ N . n−1 +
Think of Y as the bet that a gambler makes on game n ∈ N . The gambler can base the bet on all of the information she has at
n +
that point, including the outcomes of the previous n − 1 games. That is, she can base the bet on the information encoded in F . n−1
Suppose that X = {X : n ∈ N} is adapted to the filtration F = {F : n ∈ N} and that

n n Y = { Yn : n ∈ N+ } is predictable
relative to F. The transform of X by Y is the process Y ⋅ X defined by
n
(Y ⋅ X )n = X0 + ∑ Yk (Xk − Xk−1 ), n ∈ N (17.2.5)
k=1
The motivating example behind the transfrom, in terms of a gambler making a sequence of bets, is given in an example below.
Note that Y ⋅ X is also adapted to F. Note also that the transform depends on X only through X and {X − X : n ∈ N } . If
0 n n−1 +
X is a martingale, this sequence is the martingale difference sequence studied Introduction.
Suppose X = {X : n ∈ N} is adapted to the filtration

n F = { Fn : n ∈ N} and that Y = { Yn : n ∈ N} is a bounded
process, predictable relative to F.
1. If X is a martingale relative to F then Y ⋅ X is also a martingale relative to F.
2. If X is a sub-martingle relative to F and Y is nonnegative, then Y ⋅ X is also a sub-martingale relative to F.
3. If X is a super-martingle relative to F and Y is nonnegative, then Y ⋅ X is also a super-martingale relative to F.
Proof
This construction is known as a martingale transform, and is a discrete version of the stochastic integral that we will study in the
chapter on Brownian motion. The result also holds if instead of Y being bounded, we have X bounded and E(|Y |) < ∞ for n
n ∈ N+
The Doob Decomposition
The next result, in discrete time, shows how to decompose a basic stochastic process into a martingale and a predictable process.
The result is known as the Doob decomposition theorem and is named for Joseph Doob who developed much of the modern theory
of martingales.
Suppose that X = { Xn : n ∈ N} satisfies the basic assumptions above relative to the filtration F = {F : n ∈ N} . Then n
Xn = Yn + Zn for n ∈ N where Y = {Y : n ∈ N} is a martingale relative to F and Z = {Z : n ∈ N} is predictable

n n
relative to F. The decomposition is unique.

1. If X is a sub-martingale relative to F then Z is increasing.
2. If X is a super-martingale relative to F then Z is decreasing.
Proof
A decomposition of this form is more complicated in continuous time, in part because the definition of a predictable process is
more subtle and complex. The decomposition theorem holds in continuous time, with our basic assumptions and the additional
assumption that the collection of random variables {X : τ is a finite-valued stopping time} is uniformly integrable. The result
τ
is known as the Doob-Meyer decomposition theorem, named additionally for Paul Meyer.
Markov Processes
As you might guess, there are important connections between Markov processes and martingales. Suppose that X = {X t : t ∈ T}
is a (homogeneous) Markov process with state space (S, S ), relative to the filtration F = {F : t ∈ T } . Let P = {P
t t : t ∈ T}
denote the collection of transition kernesl of X, so that
Pt (x, A) = P(Xt ∈ A ∣ X0 = x), x ∈ S, A ∈ S (17.2.9)
Recall that (like all probability kernels), P operates (on the right) on (measurable) functions h : S → R by the rule
t
Pt h(x) = ∫ Pt (x, dy)h(y) = E[h(Xt ) ∣ X0 = x], x ∈ S (17.2.10)

S
assuming as usual that the expected value exists. Here is the critical definition that we will need.
Suppose that h : S → R and that E[|h(X t )|] <∞ for t ∈ T .

1. h is harmonic for X if P h = h for t ∈ T .
t
2. h is sub-harmonic for X if P h ≥ h for t ∈ T .

t
3. h is super-harmonic for X if P h ≤ h for t ∈ T .

t
The following theorem gives the fundamental connection between the two types of stochastic processes. Given the similarity in the
terminology, the result may not be a surprise.
Suppose that h : S → R and E[|h(X t )|] <∞ for t ∈ T . Define h(X) = {h(X t) : t ∈ T} .
1. h is harmonic for X if and only if h(X) is a martingale with respect to F.
2. h is sub-harmonic for X if and only if h(X) is a sub-martingale with respect to F.
3. h is super-harmonic for X if and only if h(X) is a super-martingale with respect to F.
Proof
Several of the examples given in the Introduction can be re-interpreted in the context of harmonic functions of Markov chains. We
explore some of these below.
Examples
Let R denote the usual set of Borel measurable subsets of R, and for A ∈ R and x ∈ R let A − x = {y − x : y ∈ A} . Let I
denote the identity function on R, so that I (x) = x for x ∈ R. We will need this notation in a couple of our applications below.
Random Walks
Suppose that V = {V : n ∈ N} is a sequence of independent, real-valued random variables, with {V : n ∈ N } identically
n n +
distributed and having common probability measure Q on (R, R) and mean a ∈ R . Recall from the Introduction that the partial
sum process X = {X n : n ∈ N} associated with V is given by
n
Xn = ∑ Vi , n ∈ N (17.2.12)
i=0
and that X is a (discrete-time) random walk. But X is also a discrete-time Markov process with one-step transition kernel P given
by P (x, A) = Q(A − x) for x ∈ R and A ∈ R .
The identity function I is

1. Harmonic for X if a = 0 .
2. Sub-harmonic for X if a ≥ 0 .
3. Super-harmonic for X if a ≤ 0 .
Proof
It now follows from our theorem above that X is a martingale if a = 0 , a sub-martingale if a > 0 , and a super-martingale if a < 0 .
We showed these results directly from the definitions in the Introduction.

n i
i + n
Xn = ∑ Vi , n ∈ N (17.2.14)
i=0
Then X is the simple random walk with parameter p. In terms of gambling, our gambler plays a sequence of independent and
identical games, and on each game, wins €1 with probability p and loses €1 with probability 1 − p . So if X is the gambler's initial 0
fortune, then X is her net fortune after n games. In the Introduction we showed that X is a martingale if p = , a super-
n
1
martingale if p < , and a sub-martingale if p > . But suppose now that instead of making constant unit bets, the gambler makes
1
2
1
bets that depend on the outcomes of previous games. This leads to a martingale transform as studied above.
Suppose that the gambler bets Y on game n ∈ N (at even stakes), where Y ∈ [0, ∞) depends on (V , V , V , … , V )
n + n 0 1 2 n−1
and satisfies E(Y ) < ∞ . So the process Y = {Y : n ∈ N } is predictable with respect to X, and the gambler's net
n n +
winnings after n games is

n n
(Y ⋅ X )n = V0 + ∑ Yk Vk = X0 + ∑ Yk (Xk − Xk−1 ) (17.2.15)
k=1 k=1
1. Y ⋅X is a sub-martingale if p > . 1
2. Y ⋅X is a super-martingale if p < . 1
3. Y ⋅X is a martingale if p = . 1
Proof
The simple random walk X is also a discrete-time Markov chain on Z with one-step transition matrix P given by
P (x, x + 1) = p , P (x, x − 1) = 1 − p .
x
1−p
The function h given by h(x) = ( p
) for x ∈ Z is harmonic for X.
Proof
Xn
1−p
It now follows from our theorem above that the process Z = { Zn : n ∈ N} given by Zn = (
p
) for n ∈ N is a martingale.
We showed this directly from the definition in the Introduction. As you may recall, this is De Moivre's martingale and named for
Abraham De Moivre.
Branching Processes
Recall the discussion of the simple branching process from the Introduction. The fundamental assumption is that the particles act
independently, each with the same offspring distribution on N. As before, we will let f denote the (discrete) probability density
function of the number of offspring of a particle, m the mean of the distribution, and ϕ the probability generating function of the
distribution. We assume that f (0) > 0 and f (0) + f (1) < 1 so that a particle has a positive probability of dying without children
and a positive probability of producing more than 1 child. Recall that q denotes the probability of extinction, starting with a single
particle.
The stochastic process of interest is X = {X : n ∈ N} where X is the number of particles in the n th generation for n ∈ N .
n n
Recall that X is a discrete-time Markov chain on N with one-step transition matrix P given by P (x, y) = f (y) for x, y ∈ N ∗x
where f denotes the convolution power of order x of f .

∗x
The function h given by h(x) = q for x ∈ N is harmonic for X.

x
Proof
It now follows from our theorem above that the process Z = {Z : n ∈ N} is a martingale where Z = q
n for n ∈ N . We n
Xn
showed this directly from the definition in the Introduction. We also showed that the process Y = {Y : n ∈ N} is a martingale
n
where Y = X /m for n ∈ N . But we can't write Y = h(X ) for a function h defined on the state space, so we can't interpret
n n
n
n n
this martingale in terms of a harmonic function.
General Random Walks

Suppose that X = { Xt : t ∈ T } is a stochastic process satisfying the basic assumptions above relative to the filtration
F = { Ft : t ∈ T }. Recall from the Introduction that the term increment refers to a difference of the form X − X for s, t ∈ T . s+t s
The process X has independent increments if this increment is always independent of F , and has stationary increments this
s
increment always has the same distribution as X − X . In discrete time, a process with stationary, independent increments is
t 0
simply a random walk as discussed above. In continuous time, a process with stationary, independent increments (and with the
continuity assumptions we have imposed) is called a continuous-time random walk, and also a Lévy process, named for Paul Lévy.
So suppose that X has stationary, independent increments. For t ∈ T let Q denote the probability distribution of X − X on
t t 0
(R, R), so that Q is also the probability distribution aof X

t −X for every s, t ∈ T . From our previous study, we know that
s+t s
X is a Markov processes with transition kernel P at time t ∈ T given by

t
Pt (x, A) = Qt (A − x); x ∈ R, A ∈ R (17.2.18)
We also know that E(X t − X0 ) = at for t ∈ T where a = E(X 1 − X0 ) (assuming of course that the last expected value exists in
R ).
The identity function I is .

1. Harmonic for X if a = 0 .
2. Sub-harmonic for X if a ≥ 0 .
3. Super-harmonic for X if a ≤ 0 .
Proof
It now follows that X is a martingale if a = 0 , a sub-martingale if a ≥ 0 , and a super-martingale if a ≤ 0 . We showed this directly
in the Introduction. Recall that in continuous time, the Poisson counting process has stationary, independent increments, as does
standard Brownian motion
This page titled 17.2: Properties and Constructions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Basic Theory
t
t t
t t
t t
make the standard assumption that X is right continuous and has left limits, and that the filtration F is right continuous and
complete.
Our general goal in this section is to see if some of the important martingale properties are preserved if the deterministic time
t ∈ T is replaced by a (random) stopping time. Recall that a random time τ with values in T ∪ {∞} is a stopping time relative to F
if {τ ≤ t} ∈ F for t ∈ T . So a stopping time is a random time that does not require that we see into the future. That is, we can
t
tell if τ ≤ t from the information available at time t . Next recall that the σ-algebra associated with the stopping time τ is
Fτ = {A ∈ F : A ∩ {τ ≤ t} ∈ Ft for all t ∈ T } (17.3.1)
So F is the collection of events up to the random time τ just as F is the collection of events up to the deterministic time t ∈ T .
τ t
In terms of a gambler playing a sequence of games, the time that the gambler decides to stop playing must be a stopping time, and
in fact this interpretation is the origin of the name. That is, the time when the gambler decides to stop playing can only depend on
the information that the gambler has up to that point in time.
Optional Stopping
The basic martingale equation E(X ∣ F ) = X for s, t ∈ T with s ≤ t can be generalized by replacing both s and t by
t s s
bounded stopping times. The result is known as the Doob's optional stopping theorem and is named again for Joseph Doob.
Suppose that X = {X : t ∈ T } satisfies the basic assumptions above with respect to the filtration F = {F : t ∈ T }
t t
Suppose that are bounded stopping times relative to F with ρ ≤ τ .

1. If X is a martingale relative to F then E(X ∣ F ) = X .
τ ρ ρ
2. If X is a sub-martingale relative to F then E(X ∣ F ) ≥ X .

τ ρ ρ
3. If X is a super-martingale relative to F then E(X ∣ F ) ≤ X .

τ ρ ρ
Proof in discrete time

Proof in continuous time
The assumption that the stopping times are bounded is critical. A counterexample when this assumption does not hold is given
below. Here are a couple of simple corollaries:
Suppose again that ρ and τ are bounded stopping times relative to F with ρ ≤ τ .
1. If X is a martingale relative to F then E(X ) = E(X ) .
τ ρ
2. If X is a sub-martingale relative to F then E(X ) ≥ E(X ) .

τ ρ
3. If X is a super-martingale relative to F then E(X ) ≤ E(X ) .

τ ρ
Proof
Suppose that τ is a bounded stopping time relative to F.

1. If X is a martingale relative to F then E(X ) = E(X ) .
τ 0
2. If X is a sub-martingale relative to F then E(X ) ≥ E(X ) .

τ 0
3. If X is a super-martingale relative to F then E(X ) ≤ E(X ) .

τ 0
The Stopped Martingale

For our next discussion, we first need to recall how to stop a stochastic process at a stopping time.
Suppose that X satisfies the assumptions above and that τ is a stopping time relative to the filtration F. The stopped proccess
X = { X : t ∈ [0, ∞)} is defined by
τ τ
t
τ
X = Xt∧τ , t ∈ [0, ∞) (17.3.7)
t
Details
So X = X if t < τ and X = X if t ≥ τ . In particular, note that X = X . If X is the fortune of a gambler at time t ∈ T ,

t
τ
t t
τ
τ
τ
0 0 t
then X is the revised fortune at time t when τ is the stopping time of the gamber. Our next result, known as the elementary
τ
t
stopping theorem, is that a martingale stopped at a stopping time is still a martingale.
Suppose again that X satisfies the assumptions above, and that τ is a stopping time relative to F.
1. If X is a martingale relative to F then so is X . τ
2. If X is a sub-martingale relative to F then so is X . τ
3. If X is a super-martingale relative to F then so is X . τ
General proof
Special proof in discrete time
The elementary stopping theorem is bad news for the gambler playing a sequence of games. If the games are fair or unfavorable,
then no stopping time, regardless of how cleverly designed, can help the gambler. Since a stopped martingale is still a martingale,
the the mean property holds.
Suppose again that X satisfies the assumptions above, and that τ is a stopping time relative to F. Let t ∈ T .
1. If X is a martingale relative to F then E(X ) = E(X ) t∧τ 0
2. If X is a sub-martingale relative to F then E(X ) ≥ E(X ) t∧τ 0
3. If X is a super-martingale relative to F then E(X ) ≤ E(X t∧τ 0)
Optional Stopping in Discrete Time

A simple corollary of the optional stopping theorem is that if X is a martingale and τ a bounded stopping time, then
E(X ) = E(X ) (with the appropriate inequalities if X is a sub-martingale or a super-martingale). Our next discussion centers on
τ 0
other conditions which give these results in discrete time. Suppose that X = {X : n ∈ N} satisfies the basic assumptions above n
with respect to the filtration F = {F : n ∈ N} , and that τ is a stopping time relative to F.

n
Suppose that |X n| is bounded uniformly in n ∈ N and that τ is finite.

4. If X is a martingale then E(X ) = E(X ) . τ 0
5. If X is a sub-martingale then E(X ) ≥ E(X ) . τ 0
6. If X is a super-martingale then E(X ) ≤ E(X ) . τ 0
Proof
Suppose that |X n+1 − Xn | is bounded uniformly in n ∈ N and that E(τ ) < ∞ .

4. If X is a martingale then E(X ) = E(X ) . τ 0
5. If X is a sub-martingale then E(X ) ≥ E(X ) . τ 0
6. If X is a super-martingale then E(X ) ≤ E(X ) . τ 0
Proof
Let's return to our original interpretation of a martingale X representing the fortune of a gambler playing fair games. The gambler
could choose to quit at a random time τ , but τ would have to be a stopping time, based on the gambler's information encoded in the
filtration F. Under the conditions of the theorem, no such scheme can help the gambler in terms of expected value.

Suppose that V = (V1 , V2 , …) is a sequence if independent, identically distributed random variables with P(V = 1) = p and i
P(Vi = −1) = 1 − p for i ∈ N , where p ∈ (0, 1). Let X = (X , X , X , …) be the partial sum process associated with V so
+ 0 1 2
that
n
Xn = ∑ Vi , n ∈ N (17.3.14)
i=1
Then X is the simple random walk with parameter p. In terms of gambling, our gambler plays a sequence of independent and
identical games, and on each game, wins €1 with probability p and loses €1 with probability 1 − p . So X is the the gambler's total
n
net winnings after n games. We showed in the Introduction that X is a martingale if p = (the fair case), a sub-martingale if
1
p > (the favorable case), and a super-martingale if p < (the unfair case). Now, for c ∈ Z , let
1
2
1
τc = inf{n ∈ N : Xn = c} (17.3.15)
where as usual, inf(∅) = ∞ . So τ is the first time that the gambler's fortune reaches c . What if the gambler simply continues
c
playing until her net winnings is some specified positive number (say €1 000 000 )? Is that a workable strategy?
Suppose that p = 1
2
and that c ∈ N . +
1. P(τ < ∞) = 1
c
2. E (X ) = c ≠ 0 = E(X
τc 0)
3. E(τ ) = ∞
c
Proof
Note that part (b) does not contradict the optional stopping theorem because of part (c). The strategy of waiting until the net
winnings reaches a specified goal c is unsustainable. Suppose now that the gambler plays until the net winnings either falls to a
specified negative number (a loss that she can tolerate) or reaches a specified positive number (a goal she hopes to reach).
Suppose again that p = 1
2
. For a, b ∈ N+ , let τ = τ−a ∧ τb . Then
1. E(τ ) < ∞
2. E(X ) = 0τ
3. P(τ < τ ) = b/(a + b)

−a b
Proof
So gambling until the net winnings either falls to −a or reaches b is a workable strategy, but alas has expected value 0. Here's
another example that shows that the first version of the optional sampling theorem can fail if the stopping times are not bounded.
Suppose again that p = 1
2
. Let a, b ∈ N+ with a < b . Then τ a < τb < ∞ but
b = E (Xτ ∣ Fτ ) ≠ Xτ =a (17.3.17)
b a a
Proof
This result does not contradict the optional stopping theorem since the stopping times are not bounded.
Wald's Equation
Wald's equation, named for Abraham Wald is a formula for the expected value of the sum of a random number of independent,
identically distributed random variables. We have considered this before, in our discussion of conditional expected value and our
discussion of random samples, but martingale theory leads to a particularly simple and elegant proof.
Suppose that X = (X : n ∈ N ) is a sequence of independent, identically distributed variables with common mean
n + μ ∈ R .
If N is a stopping time for X with E(N ) < ∞ then
N
E ( ∑ Xk ) = E(N )μ (17.3.18)
k=1
Proof
Patterns in Multinomial Trials
Patterns in multinomial trials were studied in the chapter on Renewal Processes. As is often the case, martingales provide a more
elegant solution. Suppose that L = (L , L , …) is a sequence of independent, identically distributed random variables taking
1 2
values in a finite set S , so that L is a sequence of multinomial trials. Let f denote the common probability density function so that
for a generic trial variable L, we have f (a) = P(L = a) for a ∈ S . We assume that all outcomes in S are actually possible, so
f (a) > 0 for a ∈ S .
In this discussion, we interpret S as an alphabet, and we write the sequence of variables in concatenation form, L = L L ⋯ 1 2
rather than standard sequence form. Thus the sequence is an infinite string of letters from our alphabet S . We are interested in the
first occurrence of a particular finite substring of letters (that is, a “word” or “pattern”) in the infinite sequence. The following
definition will simplify the notation.
If a = a 1 a2 ⋯ ak is a word of length k ∈ N + from the alphabet S , define

k
f (a) = ∏ f (ai ) (17.3.22)
i=1
so f (a) is the probability of k consecutive trials producing word a .
So, fix a word a = a a ⋯ a of length k ∈ N from the alphabet S , and consider the number of trials N until a is completed.
1 2 k + a
Our goal is compute ν (a) = E (N ) . We do this by casting the problem in terms of a sequence of gamblers playing fair games and
a
then using the optional stopping theorem above. So suppose that if a gambler bets c ∈ (0, ∞) on a letter a ∈ S on a trial, then the
gambler wins c/f (a) if a occurs on that trial and wins 0 otherwise. The expected value of this bet is
c
f (a) −c = 0 (17.3.23)
f (a)
and so the bet is fair. Consider now a gambler with an initial fortune 1. When she starts playing, she bets 1 on a . If she wins, she
1
bet her entire fortune 1/f (a ) on the next trial on a . She continues in this way: as long as she wins, she bets her entire fortune on
1 2
the next trial on the next letter of the word, until either she loses or completes the word a . Finally, we consider a sequence of
independent gamblers playing this strategy, with gambler i starting on trial i for each i ∈ N .
+
For a finite word a from the alphabet S , ν (a) is the total winnings by all of the players at time N .
a
Proof
Given a , we can compute the total winnings precisely. By definition, trials N − k + 1, … , N form the word a for the first time.
Hence for i ≤ N − k , gambler i loses at some point. Also by definition, gambler N − k + 1 wins all of her bets, completes word
a and so collects 1/f (a). The complicating factor is that gamblers N − k + 2, … , N may or may not have won all of their bets
at the point when the game ends. The following exercise illustrates this.
Suppose that L is a sequence of Bernoulli trials (so S = {0, 1}) with success probability p ∈ (0, 1). For each of the following
strings, find the expected number of trials needed to complete the string.
1. 001
2. 010
Solution
The difference between the two words is that the word in (b) has a prefix (a proper string at the beginning of the word) that is also a
suffix (a proper string at the end of the word). Word a has no such prefix. Thus we are led naturally to the following dichotomy:
Suppose that a is a finite word from the alphabet S . If no proper prefix of a is also a suffix, then a is simple. Otherwise, a is
compound.
Here is the main result, which of course is the same as when the problem was solved using renewal theory.
Suppose that a is a finite word in the alphabet S .

1. If a is simple then ν (a) = 1/f (a) .
2. If a is compound, then ν (a) = 1/f (a) + ν (b) where b is the longest word that is both a prefix and a suffix of a .
Proof
For a compound word, we can use (b) to reduce the computation to simple words.
Consider Bernoulli trials with success probability p ∈ (0, 1) . Find the expected number of trials until each of the following
strings is completed.
1. 1011011
2. 11 ⋯ 1 (k times)
Solutions
Recall that an ace-six flat die is a six-sided die for which faces 1 and 6 have probability 1
4
each while faces 2, 3, 4, and 5 have
probability each. Ace-six flat dice are sometimes used by gamblers to cheat.
1
Suppose that an ace-six flat die is thrown repeatedly. Find the expected number of throws until the pattern 6165616 occurs.
Solution
Suppose that a monkey types randomly on a keyboard that has the 26 lower-case letter keys and the space key (so 27 keys).
Find the expected number of keystrokes until the monkey produces each of the following phrases:
1. it was the best of times
2. to be or not to be
Solution
The Secretary Problem

The secretary problem was considered in the chapter on Finite Sampling Models. In this discussion we will solve a variation of the
problem using martingales. Suppose that there are n ∈ N candidates for a job, or perhaps potential marriage partners. The
+
candidates arrive sequentially in random order and are interviewed. We measure the quality of each candidate by a number in the
interval [0, 1]. Our goal is to select the very best candidate, but once a candidate is rejected, she cannot be recalled. Mathematically,
our assumptions are that the sequence of candidate variables X = (X , X , … , X ) is independent and that each is uniformly
1 2 n
distributed on the interval [0, 1] (and so has the standard uniform distribution). Our goal is to select a stopping time τ with respect
to X that maximizes E(X ), the expected value of the chosen candidate. The following sequence will play a critical role as a
τ
sequence of thresholds.
Define the sequence a = (a k : k ∈ N) by a 0 =0 and ak+1 =

1
2
(1 + a )
2
k
for k ∈ N . Then
1. a < 1 for k ∈ N .
k
2. a < a
k k+1for k ∈ N .
3. a → 1 as k → ∞ .
k
4. If X is uniformly distributed on [0, 1] then E(X ∨ a k) = ak+1 for k ∈ N .

Proof
Since a0 =0 , all of the terms of the sequence are in [0, 1) by (a). Approximations of the first 10 terms are
(0, 0.5, 0.625, 0.695, 0.742, 0.775, 0.800, 0.820, 0.836, 0.850, 0.861, …) (17.3.27)
Property (d) gives some indication of why the sequence is important for the secretary probelm. At any rate, the next theorem gives
the solution. To simplify the notation, let N = {0, 1, … , n} and N = {1, 2, … , n}.
n
+
n
The stopping time τ +

= inf {k ∈ Nn : Xk > an−k } is optimal for the secretary problem with n candidates. The optimal value
is E(X ) = a .
τ n
Proof
Here is a specific example:
For n = 5 , the decision rule is as follows:
1. Select candidate 1 if X 1 > 0.742 ; otherwise,
2. select candidate 2 if X 2 > 0.695 ; otherwise,
3. select candidate 3 if X 3 > 0.625; otherwise,
4. select candidate 4 if X 4 > 0.5 ; otherwise,
5. select candidate 5.
The expected value of our chosen candidate is 0.775.
In our original version of the secretary problem, we could only observe the relative ranks of the candidates, and our goal was to
maximize the probability of picking the best candidate. With n = 5 , the optimal strategy is to let the first two candidates go by and
then pick the first candidate after that is better than all previous candidates, if she exists. If she does not exist, of course, we must
select candidate 5. The probability of picking the best candidate is 0.433
This page titled 17.3: Stopping Times is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
17.4: Inequalities
Basic Theory
In this section, we will study a number of interesting inequalities associated with martingales and their sub-martingale and super-
martingale cousins. These turn out to be very important for both theoretical reasons and for applications. You many need to review
infimums and supremums.
Basic Assumptions
t
t t
t t
t t
make the standard assumptions that t ↦ X is right continuous and has left limits, and that the filtration F is right continuous and
t
complete.
Maximal Inequalites
For motivation, let's review a modified version of Markov's inequality, named for Andrei Markov.
If X is a real-valued random variable then

1
P(X ≥ x) ≤ E(X; X ≥ x), x ∈ (0, ∞) (17.4.1)
x
Proof
So Markov's inequality gives an upper bound on the probability that X exceeds a given positive value x, in terms of a monent of
X. Now let's return to our stochastic process X = { X : t ∈ T } . To simplify the notation, let T = {s ∈ T : s ≤ t}
t for t ∈ T .
t
Here is the main definition:
For the process X, define the corresponding maximal process U = { Ut : t ∈ T } by

Ut = sup{ Xs : s ∈ Tt }, t ∈ T (17.4.3)
Clearly, the maximal process is increasing, so that if s, t ∈ T with s ≤t then Us ≤ Ut . A trivial application of Markov's
inequality above would give
1
P(Ut ≥ x) ≤ E(Ut ; Ut ≥ x), x >0 (17.4.4)
x
But when X is a sub-martingale, the following theorem gives a much stronger result by replacing the first occurrent of U on the t
right with X . The theorem is known as Doob's sub-martingale maximal inequality (or more simply as Doob's inequaltiy), named
t
once again for Joseph Doob who did much of the pioneering work on martingales. A sub-martingale has an increasing property of
sorts in the sense that if s, t ∈ T with s ≤ t then E(X ∣ F ) ≥ X , so it's perhaps not entirely surprising that such a bound is
t s s
possible.
Suppose that X is a sub-martingale. For t ∈ T , let Ut = sup{ Xs : s ∈ Tt } . Then

1
P(Ut ≥ x) ≤ E(Xt ; Ut ≥ x), x ∈ (0, ∞) (17.4.5)
x
Proof in the discrete time

Proof in continuous time
There are a number of simple corollaries of the maximal inequality. For the first, recall that the positive part of x ∈ R is
= x ∨ 0 , so that x = x if x > 0 and x = 0 if x ≤ 0 .
+ + +
x
Suppose that X is a sub-martingale. For t ∈ T , let V t = sup{ Xs
+
: s ∈ Tt } . Then
1 +
P(Vt ≥ x) ≤ E(Xt ; Vt ≥ x), x ∈ (0, ∞) (17.4.14)
x
Proof
As a further simple corollary, note that

1 +
P(Vt ≥ x) ≤ E(Xt ), x ∈ (0, ∞) (17.4.15)
x
This is sometimes how the maximal inequality is given in the literature.
Suppose that X is a martingale. For t ∈ T , let W t = sup{| Xs | : s ∈ Tt } . Then

1
P(Wt ≥ x) ≤ E(| Xt |; Wt ≥ x), x ∈ (0, ∞) (17.4.16)
x
Proof
Once again, a further simple corollary is

1
P(Wt ≥ x) ≤ E(| Xt |), x ∈ (0, ∞) (17.4.17)
x
1/k
k
Next recall that for k ∈ (1, ∞) , the k -norm of a real-valued random variable X is ∥X∥ k = [E(|X | )] , and the vector space L k
consists of all real-valued random variables for which this norm is finite. The following theorem is the norm version of the Doob's
maximal inequality.
Suppose again that X is a martingale. For t ∈ T , let W t = sup{| Xs | : s ∈ Tt } . Then for k > 1 ,
k
∥ Wt ∥k ≤ ∥ Xt ∥k (17.4.18)
k−1
Proof
Once again, W = { Wt : t ∈ T } is the maximal process associated with |X| = {|X | : t ∈ T } . As noted in the proof, t
j = k/(k − 1) is the exponent conjugate to k , satisfying 1/j + 1/k = 1 . So this version of the maximal inequality states that the
k norm of the maximum of the martingale X on T is bounded by j times the k norm of X , where j and k are conjugate
t t
exponents. Stated just in terms of expected value, rather than norms, the L maximal inequality is k
k
k
k k
E (| Wt | ) ≤ ( ) E (| Xt | ) (17.4.27)
k−1
Our final result in this discussion is a variation of the maximal inequality for super-martingales.
Suppose that X = {X t : t ∈ T} is a nonnegative super-martingale, and let U ∞ = sup{ Xt : t ∈ T } . Then

1
P(U∞ ≥ x) ≤ E(X0 ), x ∈ (0, ∞) (17.4.28)
x
Proof
The Up-Crossing Inequality

The up-crossing inequality gives a bound on how much a sub-martingale (or super-martingale) can oscillate, and is the main tool in
the martingale convergence theorems that will be studied in the next section. It should come as no surprise by now that the
inequality is due to Joseph Doob. We start with the discrete-time case.
Suppose that x = (x n : n ∈ N) is a sequence of real numbers, and that a, b ∈ R with a <b . Define t0 (x) = 0 and then
recursively define
sk+1 (x) = inf{n ∈ N : n ≥ tk (x), xn ≤ a}, k ∈ N
tk+1 (x) = inf{n ∈ N : n ≥ sk+1 (x), xn ≥ b}, k ∈ N
1. The number of up-crossings of the interval [a, b] by the sequence x up to time n ∈ N is
un (a, b, x) = sup{k ∈ N : tk (x) ≤ n} (17.4.32)
2. The total number of up-crossings of the interval [a, b] by the sequence x is

u∞ (a, b, x) = sup{k ∈ N : tk (x) < ∞} (17.4.33)
Details
So informally, as the name suggests, u (a, b, x) is the number of times that the sequence (x , x , … , x ) goes from a value below
n 0 1 n
a to one above b , and u(a, b, x) is the number of times the entire sequence x goes from a value below a to one above b . Here are a
few of simple properties:
Suppose again that x = (x n : n ∈ N) is a sequence of real numbers and that a, b ∈ R with a < b .
1. u (a, b, x) is increasing in n ∈ N .
n
2. u (a, b, x) → u(a, b, x) as n → ∞ .
n
3. If c, d ∈ R with a < c < d < b then u n (c, d, x) ≥ un (a, b, x) for n ∈ N , and u(c, d, x) ≥ u(a, b, x).
Proof
The importance of the definitions is found in the following theorem. Recall that R
∗
= R ∪ {−∞, ∞} is the set of extended real
numbers, and Q is the set of rational real numbers.
Suppose again that x = (x : n ∈ N) is a sequence of real numbers. Then

n limn→∞ xn exists in R
∗
is and only if
u∞ (a, b, x) < ∞ for every a, b ∈ Q with a < b .
Proof
Clearly the theorem is true with Q replaced with R, but the countability of Q will be important in the martingale convergence
theorem. As a simple corollary, if x is bounded and u (a, b, x) < ∞ for every a, b ∈ Q with a < b , then x converges in R. The
∞
up-crossing inequality for a discrete-time martingale X gives an upper bound on the expected number of up-crossings of X up to
time n ∈ N in terms of a moment of X . n
Suppose that X = {X : n ∈ N} satisfies the basic assumptions with respect to the filtration F = {F : n ∈ N} , and let
n n
a, b ∈ R with a < b . Let U = u (a, b, X) , the random number of up-crossings of [a, b] by X up to time n ∈ N .
n n
1. If X is a super-martingale relative to F then

1 1 1
− −
E(Un ) ≤ E[(Xn − a) ] ≤ [E(Xn ) + |a|] ≤ [E(| Xn |) + |a|] , n ∈ N (17.4.34)
b −a b −a b −a
2. If X is a sub-martingale relative to F then

1 1 1
+ +
E(Un ) ≤ E[(Xn − a) ] ≤ [E(Xn ) + |a|] ≤ [E(| Xn |) + |a|] , n ∈ N (17.4.35)
Proof
Additional details
Of course if X is a martingale with respect to F then both inequalities apply. In continuous time, as usual, the concepts are more
complicated and technical.
Suppose that x : [0, ∞) → R and that that a, b ∈ R with a < b .

1. If I ⊂ [0, ∞) is finite, define t I
0
(x) = 0 and then recursively define
I I
s (x) = inf {t ∈ I : t ≥ t (x), xt ≤ a} , k ∈ N
k+1 k
I I
t (x) = inf {t ∈ I : t ≥ s (x), xt ≥ b} , k ∈ N
k+1 k+1
The number of up-crossings of the interval [a, b] by the function x restricted to I is

I
uI (a, b, x) = sup {k ∈ N : t (x) < ∞} (17.4.39)
k
2. If I ⊆ [0, ∞) is infinte, the number of up-crossings of the interval [a, b] by x restricted to I is
uI (a, b, x) = sup{ uJ (a, b, x) : J is finite and J ⊂ I } (17.4.40)
To simplify the notation, we will let u (a, b, x) = u (a, b, x) , the number of up-crossings of [a, b] by x on [0, t], and
t [0,t]
u∞ (a, b, x) = u (a, b, x) , the total number of up-crossings of [a, b] by x . In continuous time, the definition of up-crossings is
[0,∞)
built out of finte subsets of [0, ∞) for measurability concerns, which arise when we replace the deterministic function x with a
stochastic process X. Here are the simple properties that are analogous to our previous ones.
Suppose again that x : [0, ∞) → R and that a, b ∈ R with a < b .

1. If I , J ⊆ [0, ∞) with I ⊆ J , then u (a, b, x) ≤ u (a, b, x) .
I J
2. If (I : n ∈ N) is an increasing sequence of sets in [0, ∞) and J = ⋃

n
∞
n=0
In then uIn (a, b, x) → uJ (a, b, x) as n → ∞ .
3. If c, d ∈ R with a < c < d < b and I ⊂ [0, ∞) then u (c, d, x) ≥ u I I (a, b, x) .
Proof
The following result is the reason for studying up-crossings in the first place. Note that the definition built from finite set is
sufficient.
Suppose that x : [0, ∞) → R . Then lim t→∞ xt exists in R if and only if u

∗
∞ (a, b, x) < ∞ for every a, b ∈ Q with a < b .
Proof
Finally, here is the up-crossing inequality for martingales in continuous time. Once again, the inequality gives a bound on the
expected number of up-crossings.
Suppose that X = {X : t ∈ [0, ∞)} satisfies the basic assumptions with respect to the filtration F = {F : t ∈ [0, ∞)} , and
t t
let a, b ∈ R with a < b . Let U = u (a, b, X) , the random number of up-crossings of [a, b] by X up to time t ∈ [0, ∞).
t t
1. If X is a super-martingale relative to F then

1 −
1 −
1
E(Ut ) ≤ E[(Xt − a) ] ≤ [E(Xt ) + |a|] ≤ [E(| Xt |) + |a|] , t ∈ [0, ∞) (17.4.42)
2. If X is a sub-martingale relative to F then

1 1 1
+ +
E(Ut ) ≤ E[(Xt − a) ] ≤ [E(X ) + |a|] ≤ [E(| Xt |) + |a|] , t ∈ [0, ∞) (17.4.43)
t
Proof

Kolmogorov's Inequality
Suppose that X = {X : n ∈ N } is a sequence of independent variables with E(X
n + n) =0 and 2
var(Xn ) = E(Xn ) < ∞ for
n ∈ N . Let Y = {Y : n ∈ N} be the partial sum process associated with X, so that
+ n
Yn = ∑ Xi , n ∈ N (17.4.46)
i=1
From the Introduction we know that Y is a martingale. A simple application of the maximal inequality gives the following result,
which is known as Kolmogorov's inequality, named for Andrei Kolmogorov.
For n ∈ N , let U n = max {| Yi | : i ∈ Nn } . Then

n
1 1
2
P(Un ≥ x) ≤ var(Yn ) = ∑ E(X ), x ∈ (0, ∞) (17.4.47)
2 2 i
x x
i=1
Proof
Red and Black
In the game of red and black, a gambler plays a sequence of Bernoulli games with success parameter p ∈ (0, 1) at even stakes. The
gambler starts with an initial fortune x and plays until either she is ruined or reaches a specified target fortune a , where
x, a ∈ (0, ∞) with x < a . When p ≤ , so that the games are fair or unfair, an optimal strategy is bold play: on each game, the
1
gambler bets her entire fortune or just what is needed to reach the target, whichever is smaller. In the section on bold play we
showed that when p = , so that the games are fair, the probability of winning (that is, reaching the target a starting with x) is
1
x/a. We can use the maximal inequality for super-martingales to show that indeed, one cannot do better.
To set up the notation and review various concepts, let X denote the gambler's initial fortune and let X denote the outcome of
0 n
game n ∈ N , where 1 denotes a win and −1 a loss. So {X : n ∈ N} is a sequence of independent variables with
+ n
P(X = 1) = p and P(X = −1) = 1 − p for n ∈ N

n n . (The initial fortune X has an unspecified distribution on (0, ∞).) The
+ 0
gambler is at a casino after all, so of course p ≤ . Let 1
Yn = ∑ Xi , n ∈ N (17.4.50)
i=0
so that Y = {Y : n ∈ N} is the partial sum process associated with X = {X : n ∈ N} . Recall that Y is also known as the
n n
simple random walk with parameter p, and since p ≤ , is a super-martingale. The process {X : n ∈ N } is the difference
1
2
n +
sequence associated with Y . Next let Z denote the amount that the gambler bets on game n ∈ N . The process
n +
Z = {Z : n ∈ N }
n is predictable with respect to X = {X : n ∈ N} , so that Z is measurable with respect to
+ n n
σ{ X , X , … , X
0 1 } for n ∈ N
n−1 . So the gambler's fortune after n games is
+
n n
Wn = X0 + ∑ Zi Xi = X0 + ∑ Zi (Yi − Yi−1 ) (17.4.51)
i=1 i=1
Recall that W = {W : n ∈ N} is the transform of Z with Y , denoted W = Z ⋅ Y . The gambler is not allowed to go into debt
n
and so we must have Z ≤ W nfor n ∈ N : the gambler's bet on game n cannot exceed her fortune after game n − 1 . What's
n−1 +
the probability that the gambler can ever reach or exceed the target a starting with fortune x < a ?
Let U ∞ = sup{ Wn : n ∈ N} . Suppose that x, a ∈ (0, ∞) with x < a and that X 0 =x . Then
x
P(U∞ ≥ a) ≤ (17.4.52)
a
Proof
Note that the only assumptions made on the gambler's sequence of bets Z is that the sequence is predictable, so that the gambler
cannot see into the future, and that gambler cannot go into debt. Under these basic assumptions, no strategy can do any better than
bold play. However, there are strategies that do as well as bold play; these are variations on bold play.
Open the simulation of the red and black game. Select bold play and p =
1
2
. Play the game with various values of initial and
target fortunes.
This page titled 17.4: Inequalities is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
17.5: Convergence
Basic Theory
Basic Assumptions
t
t t
t t
t t
need the additional assumptions that t ↦ X is right continuous and has left limits, and that the filtration F is standard (that is,
t
right continuous and complete). Recall also that F = σ (⋃ F ) , and this is the σ-algebra that encodes our information over
∞ t∈T t
all time.
The Martingale Convergence Theorems

If X is a sub-martingale relative to F then X has an increasing property of sorts: E(X ∣ F ) ≥ X for s, t ∈ T with s ≤ t . t s s
Similarly, if X is a super-martingale relative to F then X has a decreasing property of sorts, since the last inequality is reversed.
Thus, there is hope that if this increasing or decreasing property is coupled with an appropriate boundedness property, then the sub-
martingale or super-martingale might converge, in some sense, as t → ∞ . This is indeed the case, and is the subject of this section.
The martingale convergence theorems, first formulated by Joseph Doob, are among the most important results in the theory of
martingales. The first martingale convergence theorem states that if the expected absolute value is bounded in the time, then the
martingale process converges with probability 1.
Suppose that X = {X : t ∈ T } is a sub-martingale or a super-martingale with respect to F = {F

t t : t ∈ T} and that E (|X |)
t
is bounded in t ∈ T . Then there exists a random variable X that is measurable with respect to F
∞ ∞ such that E(|X |) < ∞
∞
and X → X as t → ∞ with probability 1.

t ∞
Proof
The boundedness condition means that X is bounded (in norm) as a subset of the vector space L1 . Here is a very simple, but
useful corollary:
If X = {X : t ∈ T } is a nonnegative super-martingale with respect to F = {F

t t : t ∈ T} then there exists a random variable
X ∞ , measurable withe respect to F , such that X → X with probability 1.
∞ t ∞
Proof
Of course, the corollary applies to a nonnegative martingale as a special case. For the second martingale convergence theorem you
will need to review uniformly integrable variables. Recall also that for k ∈ [1, ∞), the k -norm of a random variable X is
1/k
k
∥X ∥k = [E (|X | )] (17.5.4)
and L is the normed vector space of all real-valued random variables for which this norm is finite. Convergence in mean refers to
k
convergence in L and more generally, convergence in k th mean refers to convergence in L .

1 k
Suppose that X is a uniformly integrable and is a sub-martingale or super-martingale with respect to F. Then there exists a
random variable X , measurable with respect to F such that X → X as t → ∞ with probability 1 and in mean.
∞ ∞ t ∞
Moreover, if X is a martingale with respect to F then X = E(X ∣ F ) for t ∈ T .

t ∞ t
Proof
As a simple corollary, recall that if ∥X ∥ is bounded in t ∈ T for some k ∈ (1, ∞) then X is uniformly integrable, and hence the
t k
second martingale convergence theorem applies. But we can do better.
Suppose again that X = {X : t ∈ T } is a sub-martingale or super-martingale with respect to F = {F : t ∈ T } and that

t t
∥X ∥ is bounded in t ∈ T for some k ∈ (1, ∞) . Then there exists a random variable X

t k ∈ L such that X → X as ∞ k t ∞
t → ∞ in L .
k
Proof
Example and Applications

In this subsection, we consider a number of applications of the martingale convergence theorems. One indication of the importance
of martingale theory is the fact that many of the classical theorems of probability have simple and elegant proofs when formulated
in terms of martingales.
Simple Random Walk

n i
i + n
Xn = ∑ Vi , n ∈ N (17.5.8)
i=0
Recall that X is the simple random walk with parameter p. From our study of Markov chains, we know that p > then X → ∞ 1
2
n
as n → ∞ and if p < then X → −∞ as n → ∞ . The chain is transient in these two cases. If p = , the chain is (null)
1
2
n
1
recurrent and so visits every state in N infinitely often. In this case X does not converge as n → ∞ . But of course
n
E(X ) = n(2p − 1) for n ∈ N , so the martingale convergence theorems do not apply.

n
Doob's Martingale
Recall that if X is a random variable with E(|X|) < ∞ and we define X = E(X ∣ F ) for t ∈ T , then X = {X : t ∈ T } is a
t t t
martingale relative to F and is known as a Doob martingale, named for you know whom. So the second martingale convergence
theorem states that every uniformly integrable martingale is a Doob martingale. Moreover, we know that the Doob martingale X
constructed from X and F is uniformly integrable, so the second martingale convergence theorem applies. The last remaining
question is the relationship between X and the limiting random variable X . The answer may come as no surprise.
∞
Let X = {X : t ∈ T } be the Doob martingale constructed from X and F. Then X

t t → X∞ as t → ∞ with probability 1 and
in mean, where
X∞ = E(X ∣ F∞ ) (17.5.9)
Of course if F ∞ =F , which is quite possible, then X ∞ =X . At the other extreme, if F t = {∅, Ω} , the trivial σ-algebra for all
t ∈ T , then X ∞ = E(X) , a constant.
Kolmogorov Zero-One Law

Suppose that X = (Xn : n ∈ N+ ) is a sequence of random variables with values in a general state space (S, S ). Let
Gn = σ{ Xk : k ≥ n} for n ∈ N , and let G = ⋂
+ ∞
∞
n=1
G . So G
n is the tail σ-algebra of X, the collection of events that depend
∞
only on the terms of the sequence with arbitrarily large indices. For example, if the sequence is real-valued (or more generally takes
values in a metric space), then the event that X has a limit as n → ∞ is a tail event. If B ∈ S , then the event that X ∈ B for
n n
infinitely many n ∈ N is another tail event. The Kolmogorov zero-one law, named for Andrei Kolmogorov, states that if X is an
+
independent sequence, then the tail events are essentially deterministic.
Suppose that X is a sequence of independent random variables. If A ∈ G ∞ then P(A) = 0 or P(A) = 1 .

Proof
Tail events and the Kolmogorov zero-one law were studied earlier in the section on measure in the chapter on probability spaces. A
random variable that is measurable with respect to G is a tail random variable. From the Kolmogorov zero-one law, a real-valued
∞
tail random variable for an independent sequence must be a constant (with probability 1).
Branching Processes
Recall the discussion of the simple branching process from the Introduction. The fundamental assumption is that the particles act
independently, each with the same offspring distribution on N. As before, we will let f denote the (discrete) probability density
function of the number of offspring of a particle, m the mean of the distribution, and q the probability of extinction starting with a
single particle. We assume that f (0) > 0 and f (0) + f (1) < 1 so that a particle has a positive probability of dying without
children and a positive probability of producing more than 1 child.
The stochastic process of interest is X = {X : n ∈ N} where X is the number of particles in the n th generation for n ∈ N .
n n
Recall that X is a discrete-time Markov chain on N. Since 0 is an absorbing state, and all positive states lead to 0, we know that the
positive states are transient and so are visited only finitely often with probability 1. It follows that either X → 0 as n → ∞ n
(extinction) or X → ∞ as n → ∞ (explosion). We have quite a bit of information about which of these events will occur from
n
our study of Markov chains, but the martingale convergence theorems give more information.
Extinction and explosion

1. If m ≤ 1 then q = 1 and extinction is certain.
2. If m > 1 then q ∈ (0, 1). Either X → 0 as n → ∞ or X
n n → ∞ as n → ∞ at an exponential rate.
Proof
The Beta-Bernoulli Process

Recall that the beta-Bernoulli process is constructed by randomizing the success parameter in a Bernoulli trials process with a beta
distribution. Specifically, we start with a random variable P having the beta distribution with parameters a, b ∈ (0, ∞). Next we
have a sequence X = (X , X , …) of indicator variables with the property that X is conditionally independent given
1 2
P = p ∈ (0, 1) with P(X = 1 ∣ P = p) = p for i ∈ N . Let Y = { Y : n ∈ N} denote the partial sum process associated with
i + n
X, so that once again, Y = ∑ for n ∈ N . Next let M = Y /n for n ∈ N so that M is the sample mean of
n
n X i n n + n
i=1
(X , X , … , X ). Finally let
1 2 n
a + Yn
Zn = , n ∈ N (17.5.10)
a+b +n
We showed in the Introduction that Z = {Z n : n ∈ N} is a martingale with respect to X.
Mn → P and Z n → P as n → ∞ with probability 1 and in mean.

Proof
This is a very nice result and is reminiscent of the fact that for the ordinary Bernoulli trials sequence with success parameter
p ∈ (0, 1) we have the law of large numbers that M → p as n → ∞ with probability 1 and in mean.
n
Pólya's Urn Process

Recall that in the simplest version of Pólya's urn process, we start with an urn containing a red and b green balls. At each discrete
time step, we select a ball at random from the urn and then replace the ball and add c new balls of the same color to the urn. For the
parameters, we need a, b ∈ N and c ∈ N . For i ∈ N , let X denote the color of the ball selected on the ith draw, where 1 means
+ + i
n
red and 0 means green. For n ∈ N , let Y = ∑ X , so that Y = {Y : n ∈ N} is the partial sum process associated with
n i=1 i n
X = { X : i ∈ N } . Since Y
i + is the number of red balls in the urn at time n ∈ N , the average number of balls at time n is
n +
M = Y /n . On the other hand, the total number of balls in the urn at time n ∈ N is a + b + cn so the proportion of red balls in
n n
the urn at time n is

a + cYn
Zn = (17.5.11)
a + b + cn
We showed in the Introduction, that Z = {Z : n ∈ N} is a martingale. Now we are interested in the limiting behavior of M and
n n
Zn as n → ∞ . When c = 0 , the answer is easy. In this case, Y has the binomial distribution with trial parameter n and success
n
parameter a/(a + b) , so by the law of large numbers, M → a/(a + b) as n → ∞ with probability 1 and in mean. On the other
n
hand, Z = a/(a + b) when c = 0 . So the interesting case is when c > 0 .

n
Suppose that c ∈ N . Then there exists a random variable P such that M → P and Z → P as n → ∞ with probability 1
+ n n
and in mean. Moreover, P has the beta distribution with left parameter a/c and right parameter b/c.
Proof
Likelihood Ratio Tests
Recall the discussion of likelihood ratio tests in the Introduction. To review, suppose that (S, S , μ) is a general measure space, and
that X = {X : n ∈ N} is a sequence of independent, identically distributed random variables, taking values in S , and having a
n
common probability density function with respect to μ . The likelihood ratio test is a hypothesis test, where the null and alternative
hypotheses are
We assume that g and g are positive on S . Also, it makes no sense for g and g to be the same, so we assume that g
0 1 0 1 0 ≠ g1 on a
set of positive measure. The test is based on the likelihood ratio test statistic
n
g0 (Xi )
Ln = ∏ , n ∈ N (17.5.12)
g1 (Xi )
i=1
We showed that under the alternative hypothesis H1 , L = { Ln : n ∈ N} is a martingale with respect to X , known as the
likelihood ratio martingale.
Under H , L 1 n → 0 as n → ∞ with probability 1.

Proof
This result is good news, statistically speaking. Small values of L are evidence in favor of H , so the decision rule is to reject H
n 1 0
in favor of H if L ≤ l for a chosen critical value l ∈ (0, ∞). If H is true and the sample size n is sufficiently large, we will
1 n 1
reject H . In the proof, note that ln(L ) must diverge to −∞ at least as fast as n diverges to ∞. Hence L → 0 as n → ∞
0 n n
exponentially fast, at least. It also worth noting that L is a mean 1 martingale (under H ) so trivially E(L ) → 1 as n → ∞ even 1 n
though L → 0 as n → ∞ with probability 1. So the likelihood ratio martingale is a good example of a sequence where the
n
interchange of limit and expected value is not valid.
Partial Products
Suppose that X = { Xn : n ∈ N+ } is an independent sequence of nonnegative random variables with E(Xn ) = 1 for n ∈ N+ .
Let
n
Yn = ∏ Xi , n ∈ N (17.5.15)
i=1
so that Y = {Y : n ∈ N} is the partial product process associated with X. From our discussion of this process in the
n
Introduction, we know that Y is a martingale with respect to X. Since Y is nonnegative, the second martingale convergence
theorem applies, so there exists a random variable Y such that Y → Y as n → ∞ with probability 1. What more can we say?
∞ n ∞
The following result, known as the Kakutani product martingale theorem, is due to Shizuo Kakutani.
−
−−
Let a n = E (√Xn ) for n ∈ N+ and let A = ∏ ∞
i=1
ai .
1. If A > 0 then Y n → Y∞ as n → ∞ in mean and E(Y ∞) =1 .
2. If A = 0 then Y ∞ =0 with probability 1.
Proof
Density Functions
This discussion continues the one on density functions in the Introduction. To review, we start with our probability space (Ω, F , P)
and a filtration F = {F : n ∈ N} in discrete time. Recall again that F = σ (⋃
n F ) . Suppose now that μ is a finite measure
∞
∞
n=0 n
on the sample space (Ω, F ). For each n ∈ N ∪ {∞} , the restriction of μ to F is a measure on (Ω, F ) and similarly the n n
restriction of P to F is a probability measure on (Ω, F ). To save notation and terminology, we will refer to these as μ and P on
n n
F , respectively. Suppose now that μ is absolutely continuous with respect to P on F

n for each n ∈ N . By the Radon-Nikodym n
theorem, μ has a density function (or Radon-Nikodym derivative) X : Ω → R with respect to P on F for each n ∈ N . The
n n
theorem and the derivative are named for Johann Radon and Otto Nikodym. In the Introduction we showed that
X = { X : n ∈ N} is a martingale with respect to F . Here is the convergence result:
n
There exists a random variable X ∞ such that X

n → X∞ as n → ∞ with probability 1.
1. If μ is absolutely continuous with respect to P on F then X is a density function of μ with respect to P on F .
∞ ∞ ∞
2. If μ and P are mutually singular on F then X = 0 with probability 1.

∞ ∞
Proof
The martingale approach can be used to give a probabilistic proof of the Radon-Nikodym theorem, at least in certain cases. We start
with a sample set Ω. Suppose that A = {A : i ∈ I } is a countable partition of Ω for each n ∈ N . Thus I is countable,
n
n
i n n
A ∩ A = ∅ for distinct i, j ∈ I , and ⋃ A = Ω . Suppose also that A refines A for each n ∈ N in the sense that A is
n n n n
i j n i∈In i n+1 n i
a union of sets in A for each i ∈ I . Let F = σ(A ) . Thus F is generated by a countable partition, and so the sets in F
n+1 n n n n n
are of the form ⋃ A where J ⊆ I . Moreover, by the refinement property F ⊆ F

j∈J
n
j n for n ∈ N , so that F = {F : n ∈ N} n n+1 n
is a filtration. Let F = F = σ (⋃ ∞
∞
F ) = σ (⋃
n=0
A ) , so that our sample space is (Ω, F ). Finally, suppose that P is a
n
∞
n=0 n
probability measure on (Ω, F ) with the property that P(A ) > 0 for n ∈ N and i ∈ I . We now have a probability space
n
i n
(Ω, F , P). Interesting probability spaces that occur in applications are of this form, so the setting is not as specialized as you might
think.
Suppose now that μ a finte measure on (Ω, F ). From our assumptions, the only null set for P on F is ∅, so μ is automatically n
absolutely continuous with respect to P on F . Moreover, for n ∈ N , we can give the density function of μ with respect to P on
n
F n explicitly:
The density function of μ with respect to P on F is the random variable X whose value on A is μ(A
n n
n
i
n
i
)/P(A )
n
i
for each
i ∈ I . Equivalently,
n
n
μ(A )
i n
Xn = ∑ 1(A ) (17.5.22)
n i
P(A )
i∈In i
Proof
By our theorem above, there exists a random variable X such that X → X as n → ∞ with probability 1. If μ is absolutely n
continuous with respect to P on F , then X is a density function of μ with respect to P on F . The point is that we have given a
more or less explicit construction of the density.
For a concrete example, consider Ω = [0, 1). For n ∈ N , let
j j+1
n
An = {[ , ) : j ∈ {0, 1, … , 2 − 1}} (17.5.24)
n n
2 2
This is the partition of [0, 1) into 2 subintervals of equal length 1/2 , based on the dyadic rationals (or binary rationals) of rank
n n
n or less. Note that every interval in A is the union of two adjacent intervals in A , so the refinement property holds. Let P be
n n+1
ordinary Lebesgue measure on [0, 1) so that P(A ) = 1/2 for n ∈ N and i ∈ {0, 1, … , 2 − 1}. As above, let F = σ(A )
n
i
n n
n n
and F = σ (⋃ ∞
n=0
F ) = σ (⋃
n A ) . The dyadic rationals are dense in [0, 1), so F is the ordinary Borel σ-algebra on [0, 1).
∞
n=0 n
Thus our probability space (Ω, F , P) is simply [0, 1) with the usual Euclidean structures. If μ is a finite measure on ([0, 1), F )
then the density function of μ on F is the random variable X whose value on the interval [j/2 , (j + 1)/2 ) is
n n
n n
2 μ[j/ 2 , (j + 1)/ 2 ) . If μ is absolutely continuous with respect to P on F (so absolutely continuous in the usual sense), then a
n n n
density function of μ is X = lim X .

n→∞ n
This page titled 17.5: Convergence is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Basic Theory
A backwards martingale is a stochastic process that satisfies the martingale property reversed in time, in a certain sense. In some
ways, backward martingales are simpler than their forward counterparts, and in particular, satisfy a convergence theorem similar to
the convergence theorem for ordinary martingales. The importance of backward martingales stems from their numerous
applications. In particular, some of the fundamental theorems of classical probability can be formulated in terms of backward
martingales.
Definitions
As usual, we start with a stochastic process Y = {Y : t ∈ T } on an underlying probability space (Ω, F , P), having state space R,
t
and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). So to review what all this
means, Ω is the sample space, F the σ-algebra of events, P the probability measure on (Ω, F ), and Y is a random variable with t
values in R for each t ∈ T . But at this point our formulation diverges. Suppose that G is a sub σ-algebra of F for each t ∈ T , and
t
that G = {G : t ∈ T } is decreasing so that if s, t ∈ T with s ≤ t then G ⊆ G . Let G = ⋂

t t s G . We assume that Y
∞ t∈T
is t t
measurable with respect to G and that E(|Y |) < ∞ for each t ∈ T .

t t
The process Y = { Yt : t ∈ T } is a backwards martingale (or reversed martingale) with respect to G = { Gt : t ∈ T } if

E(Ys ∣ Gt ) = Yt for all s, t ∈ T with s ≤ t .
A backwards martingale can be formulated as an ordinary martingale by using negative times as the indices. Let
= {−t : t ∈ T } , so that if T = N (the discrete case) then T is the set of non-positive integers, and if T = [0, ∞) (the
− −
T
continuous case) then T = (−∞, 0] . Recall also that the standard martingale definitions make sense for any totally ordered index
−
set.
Suppose again that Y = {Y : t ∈ T } is a backwards martingale with respect to G = {G : t ∈ T } . Let

t t Xt = Y−t and
F =Gt for t ∈ T . Then X = {X : t ∈ T } is a martingale with respect to F = {F : t ∈ T } .
−t
−
t
−
t
−
Proof
Most authors define backwards martingales with negative indices, as above, in the first place. There are good reasons for doing so,
since some of the fundamental theorems of martingales apply immediately to backwards martingales. However, for the
applications of backwards martingales, this notation is artificial and clunky, so for the most part, we will use our original definition.
The next result is another way to view a backwards martingale as an ordinary martingale. This one preserves nonnegative time, but
introduces a finite time horizon. For t ∈ T , let T = {s ∈ T : s ≤ t} , a notation we have used often before.
t
Suppose again that Y = { Yt : t ∈ T } is a backwards martingale with respect to G = {G : t ∈ T } . Fix t ∈ T and define
t
X =Y
t
s and F
t−s
t
s
= Gt−s for s ∈ T . Then X = {X : s ∈ T } is a martingale relative to F = {F : s ∈ T } .
t
t t
s t
t t
s t
Proof
Properties
Backwards martingales satisfy a simple and important property.
Suppose that Y = {Y : t ∈ T } is a backwards martingale with repsect to

t G = { Gt : t ∈ T } . Then Yt = E(Y0 ∣ Gt ) for
t ∈ T and hence Y is uniformly integrable.
Proof
Here is the Doob backwards martingale, analogous to the ordinary Doob martingale, and of course named for Joseph Doob. In a
sense, this is the converse to the previous result.
Suppose that Y is a random variable on our probability space (Ω, F , P) with E(|Y |) < ∞ , and that G = {G : t ∈ T } is a t
decreasing family of sub σ-algebras of F , as above. Let Y = E(Y ∣ G ) for t ∈ T . Then Y = {Y : t ∈ T } is a backwards
t t t
martingale with respect to G.
Proof
The convergence theorems are the most important results for the applications of backwards martingales. Recall once again that for
k ∈ [1, ∞), the k-norm of a real-valued random variable X is
1/k
k
∥X ∥k = [E (|X | )] (17.6.5)
and the normed vector space L consists of all X with ∥X∥ < ∞ . Convergence in the space L is also referred to as
k k 1
convergence in mean, and convergence in the space L is referred to as convergence in mean square. Here is the primary
2
backwards martingale convergence theorem:
Suppose again that Y = {Y : t ∈ T } is a backwards martingale with respect to

t G = { Gt : t ∈ T } . Then there exists a
random variable Y such that ∞
1. Yt → Y∞ as t → ∞ with probability 1.
2. Yt → Y ∞as t → ∞ in mean.
3. Y∞ = E(Y ∣ G ).
0 ∞
Proof
As a simple extension of the last result, if Y 0 ∈ Lk for some k ∈ [1, ∞) then the convergence is in L also. k
Suppose again that Y = {Y : t ∈ T } is a backwards martingale relative to

t G = { Gt : t ∈ T } . If Y0 ∈ Lk for some
k ∈ [1, ∞) then Y → Y as t → ∞ in L .
t ∞ k
Proof
Applications
The Strong Law of Large Numbers
The strong law of large numbers is one of the fundamental theorems of classical probability. Our previous proof required that the
underlying distribution have finite variance. Here we present an elegant proof using backwards martingales that does not require
this extra assumption. So, suppose that X = {X : n ∈ N } is a sequence of independent, identically distributed random
n +
variables with common mean μ ∈ R . In statistical terms, X corresponds to sampling from the underlying distribution. Next let
n
Yn = ∑ Xi , n ∈ N (17.6.12)
i=1
so that Y = {Y : n ∈ N} is the partial sum process associated with X. Recall that the sequence Y is also a discrete-time random
n
walk. Finally, let M = Y /n for n ∈ N so that M = {M : n ∈ N } is the sequence of sample means.

n n + n +
The law of large numbers

1. M n → μ as n → ∞ with probability 1.
2. M n → μ as n → ∞ in mean.
Proof
Exchangeable Variables
We start with a probability space (Ω, F , P) and another measurable space (S, S ). Suppose that X = (X , X , …) is a sequence 1 2
of random variables each taking values in S . Recall that X is exchangeable if for every n ∈ N , every permutation of
(X , X , … , X ) has the same distribution on (S , S ) (where S is the n -fold product σ-algebra). Clearly if X is a sequence
n n n
1 2 n
of independent, identically distributed variables, then X is exchangeable. Conversely, if X is exchangeable then the variables are
identically distributed (by definition), but are not necessarily independent. The most famous example of a sequence that is
exchangeable but not independent is Pólya's urn process, named for George Pólya. On the other hand, conditionally independent
and identically distributed sequences are exchangeable. Thus suppose that (T , T ) is another measurable space and that Θ is a
random variable taking values in T .
If X is conditionally independent and identically distributed given Θ, then X is exchangeable.
Proof
Often the setting of this theorem arises when we start with a sequence of independent, identically distributed random variables that
are governed by a parametric distribution, and then randomize one of the parameters. In a sense, we can always think of the setting
in this way: Imagine that θ ∈ T is a parameter for a distribution on S . A special case is the beta-Bernoulli process, in which the
success parameter p in sequence of Bernoulli trials is randomized with the beta distribution. On the other hand, Pólya's urn process
is an example of an exchangeable sequence that does not at first seem to have anything to do with randomizing parameters. But in
fact, we know that Pólya's urn process is a special case of the beta-Bernoulli process. This connection gives a hint of de Finetti's
theorem, named for Bruno de Finetti, which we consider next. This theorem states any exchangeable sequence of indicator random
variables corresponds to randomizing the success parameter in a sequence of Bernoulli trials.
de Finetti's Theorem. Suppose that X = (X , X , …) is an exchangeable sequence of random variables, each taking values
1 2
in {0, 1}. Then there exists a random variable P with values in [0, 1], such that given P = p ∈ [0, 1], X is a sequence of
Bernoulli trials with success parameter p.
Proof
De Finetti's theorem has been extended to much more general sequences of exchangeable variables. Basically, if
X = (X , X , …) is an exchangeable sequence of random variables, each taking values in a significantly nice measurable space
1 2
(S, S ) then there exists a random variable Θ such that X is independent and identically distributed given Θ. In the proof, the
n
result that M → P as n → ∞ with probability 1, where M = ∑ X , is known as de Finetti's strong law of large numbers.
n n
1
n i=1 i
De Finetti's theorem, and it's generalizations are important in Bayesian statistical inference. For an exchangeable sequence of
random variables (our observations in a statistical experiment), there is a hidden, random parameter Θ. Given Θ = θ , the variables
are independent and identically distributed. We gain information about Θ by imposing a prior distribution on Θ and then updating
this, based on our observations and using Baye's theorem, to a posterior distribution.
Stated more in terms of distributions, de Finetti's theorem states that the distribution of n distinct variables in the exchangeable
sequence is a mixture of product measures. That is, if μ is the distribution of a generic X on (S, S ) given Θ = θ , and ν is the
θ
distribution of Θ on (T , T ), then the distribution of n of the variables on (S S ) is n n
n
B ↦ ∫ μ (B) dν (θ) (17.6.28)
θ
T
This page titled 17.6: Backwards Martingales is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
CHAPTER OVERVIEW
18: Brownian Motion
Brownian motion is a stochastic process of great theoretical importance, and as the basic building block of a variety of other
processes, of great practical importance as well. In this chapter we study Brownian motion and a number of random processes that
can be constructed from Brownian motion. We also study the Ito stochastic integral and the resulting calculus, as well as two
remarkable representation theorems involving stochastic integrals.
This page titled 18: Brownian Motion is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
1
Basic Theory
History
In 1827, the botanist Robert Brown noticed that tiny particles from pollen, when suspended in water, exhibited continuous but very
jittery and erratic motion. In his “miracle year” in 1905, Albert Einstein explained the behavior physically, showing that the
particles were constantly being bombarded by the molecules of the water, and thus helping to firmly establish the atomic theory of
matter. Brownian motion as a mathematical random process was first constructed in rigorous way by Norbert Wiener in a series of
papers starting in 1918. For this reason, the Brownian motion process is also known as the Wiener process.
Run the two-dimensional Brownian motion simulation several times in single-step mode to get an idea of what Mr. Brown may
have observed under his microscope.
Along with the Bernoulli trials process and the Poisson process, the Brownian motion process is of central importance in
probability. Each of these processes is based on a set of idealized assumptions that lead to a rich mathematial theory. In each case
also, the process is used as a building block for a number of related random processes that are of great importance in a variety of
applications. In particular, Brownian motion and related processes are used in applications ranging from physics to statistics to
economics.
Definition
A standard Brownian motion is a random process X = { Xt : t ∈ [0, ∞)} with state space R that satisfies the following
properties:
1. X = 0 (with probability 1).
0
2. X has stationary increments. That is, for s, t ∈ [0, ∞) with s < t , the distribution of X − X is the same as the
t s
distribution of X .t−s
3. X has independent increments. That is, for t , t , … , t ∈ [0, ∞) with t < t < ⋯ < t , the random variables
1 2 n 1 2 n
X ,Xt1 −X , … , X
t2 t1 −X
tn are independent.
tn−1
4. X is normally distributed with mean 0 and variance t for each t ∈ (0, ∞) .

t
5. With probability 1, t ↦ X is continuous on [0, ∞).

t
To understand the assumptions physically, let's take them one at a time.

1. Suppose that we measure the position of a Brownian particle in one dimension, starting at an arbitrary time which we designate
as t = 0 , with the initial position designated as x = 0 . Then this assumption is satisfied by convention. Indeed, occasionally, it's
convenient to relax this assumption and allow X to have other values.
0
2. This is a statement of time homogeneity: the underlying dynamics (namely the jostling of the particle by the molecules of water)
do not change over time, so the distribution of the displacement of the particle in a time interval [s, t] depends only on the
length of the time interval.
3. This is an idealized assumption that would hold approximately if the time intervals are large compared to the tiny times between
collisions of the particle with the molecules.
4. This is another idealized assumption based on the central limit theorem: the position of the particle at time t is the result of a
very large number of collisions, each making a very small contribution. The fact that the mean is 0 is a statement of spatial
homogeneity: the particle is no more or less likely to be jostled to the right than to the left. Next, recall that the assumptions of
stationary, independent increments means that var(X ) = σ t for some positive constant σ . By a change in time scale, we can
t
2 2
assume σ = 1 , although we will consider more general Brownian motions in the next section.
2
5. Finally, the continuity of the sample paths is an essential assumption, since we are modeling the position of a physical particle
as a function of time.
Of course, the first question we should ask is whether there exists a stochastic process satisfying the definition. Fortunately, the
answer is yes, although the proof is complicated.
There exists a probability space (Ω, F , P) and a stochastic process X = { Xt : t ∈ [0, ∞)} on this probability space
satisfying the assumptions in the definition.
Proof sketch
Run the simulation of the standard Brownian motion process a few times in single-step mode. Note the qualitative behavior of
the sample paths. Run the simulation 1000 times and compare the empirical density function and moments of X to the true t
probabiltiy density function and moments.
Brownian Motion as a Limit of Random Walks

Clearly the underlying dynamics of the Brownian particle being knocked about by molecules suggests a random walk as a possible
model, but with tiny time steps and tiny spatial jumps. Let X = (X , X , X , …) be the symmetric simple random walk. Thus,
0 1 2
X =∑
n U
n
where U = (U , U , …) is a sequence of independent variables with P(U = 1) = P(U = −1) =
i=1 i 1 2 for each i i
1
i ∈ N . Recall that E(X ) = 0 and var(X ) = n for n ∈ N . Also, since X is the partial sum process associated with an IID
+ n n
sequence, X has stationary, independent increments (but of course in discrete time). Finally, recall that by the central limit
theorem, X /√− n converges to the standard normal distribution as n → ∞ . Now, for h, d ∈ (0, ∞) the continuous time process
n
Xh,d = {dX⌊t/h⌋ : t ∈ [0, ∞)} (18.1.1)
is a jump process with jumps at {0, h, 2h, …} and with jumps of size ±d . Basically we would like to let h ↓ 0 and d ↓ 0 , but this
cannot be done arbitrarily. Note that E [X (t)] = 0 but var [X (t)] = d ⌊t/h⌋ . Thus, by the central limit theorem, if we take
h,d h,d
2
−
d = √h then the distribution of X (t) will converge to the normal distribution with mean 0 and variance t as h ↓ 0 . More
h,d
generally, we might hope that all of requirements in the definition are satisfied by the limiting process, and if so, we have a
standard Brownian motion.
Run the simulation of the random walk process for increasing values of n . In particular, run the simulation several times with
n = 100 . Compare the qualitative behavior with the standard Brownian motion process. Note that the scaling of the random
walk in time and space is effecitvely accomplished by scaling the horizontal and vertical axes in the graph window.
Finite Dimensional Distributions

Let X = {X : t ∈ [0, ∞)} be a standard Brownian motion. It follows from part (d) of the definition that
t Xt has probability
density function f given by t
2
1 x
ft (x) = − exp(−
−− ), x ∈ R (18.1.2)
√2πt 2t
This family of density functions determines the finite dimensional distributions of X.
If t1, t2 , … , tn ∈ (0, ∞) with 0 < t 1 < t2 ⋯ < tn then (X

t1 , Xt , … , Xt )
2 n
has probability density function f t1 , t2 ,…, tn given
by
n
ft1 , t2 ,…, tn (x1 , x2 , … , xn ) = ft1 (x1 )ft2 −t1 (x2 − x1 ) ⋯ ftn −tn− 1 (xn − xn−1 ), (x1 , x2 , … , xn ) ∈ R (18.1.3)
Proof
X is a Gaussian process with mean function mean function m(t) = 0 for t ∈ [0, ∞) and covariance function
c(s, t) = min{s, t} for s, t ∈ [0, ∞) .
Proof
Recall that for a Gaussian process, the finite dimensional (multivariate normal) distributions are completely determined by the
mean function m and the covariance function c . Thus, it follows that a standard Brownian motion is characterized as a continuous
Gaussian process with the mean and covariance functions in the last theorem. Note also that
−−−−−−−−
min{s, t} min{s, t}
2
cor(Xs , Xt ) = −
− =√ , (s, t) ∈ [0, ∞) (18.1.5)
√st max{s, t}
We can also give the higher moments and the moment generating function for X . t
For n ∈ N and t ∈ [0, ∞),
1. E (X 2n
t
) = 1 ⋅ 3 ⋯ (2n − 1)t
n n n
= (2n)! t /(n! 2 )
2. E (X 2n+1
t
) =0
Proof
For t ∈ [0, ∞), X has moment generating function given by

t
uXt tu/2
E (e ) =e , u ∈ R (18.1.6)
Proof
Simple Transformations
There are several simple transformations that preserve standard Brownian motion and will give us insight into some of its
properties. As usual, our starting place is a standard Brownian motion X = {X : t ∈ [0, ∞)} . Our first result is that reflecting the
t
paths of X in the line x = 0 gives another standard Brownian motion
Let Y t = −Xt for t ≥ 0 . Then Y = { Yt : t ≥ 0} is also a standard Brownian motion.

Proof
Our next result is related to the Markov property, which we explore in more detail below. If we “restart” Brownian motion at a
fixed time s , and shift the origin to X , then we have another standard Brownian motion. This means that Brownian motion is both
s
temporally and spatially homogeneous.
Fix s ∈ [0, ∞) and define Y t = Xs+t − Xs for t ≥ 0 . Then Y = { Yt : t ∈ [0, ∞)} is also a standard Brownian motion.
Proof
Our next result is a simple time reversal, but to state this result, we need to restrict the time parameter to a bounded interval of the
form [0, T ] where T > 0 . The upper endpoint T is sometimes referred to as a finite time horizon. Note that {X : t ∈ [0, T ]} still t
satisfies the definition, but with the time parameters restricted to [0, T ].
Define Y t = XT −t − XT for 0 ≤ t ≤ T . Then Y = { Yt : t ∈ [0, T ]} is also a standard Brownian motion on [0, T ].

Proof
Our next transformation involves scaling X both temporally and spatially, and is known as self-similarity.
Let a > 0 and define Y t =

1
a
Xa2 t for t ≥ 0 . Then Y = { Yt : t ∈ [0, ∞)} is also a standard Brownian motion.
Proof
Note that the graph of Y can be obtained from the graph of X by scaling the time axis t by a factor of a and scaling the spatial 2
axis x by a factor of a . The fact that the temporal scale factor must be the square of the spatial scale factor is clearly related to
Brownian motion as the limit of random walks. Note also that this transformation amounts to “zooming in or out” of the graph of
X and hence Brownian motion has a self-similar, fractal quality, since the graph is unchanged by this transformation. This also
suggests that, although continuous, t ↦ X is highly irregular. We consider this in the next subsection.
t
Our final transformation is referred to as time inversion.
Let Y 0 =0 and Yt = tX1/t for t > 0 . Then Y = { Yt : t ∈ [0, ∞)} is also a standard Brownian motion.
Proof
Irregularity
The defining properties suggest that standard Brownian motion X = { Xt : t ∈ [0, ∞)} cannot be a smooth, differentiable
function. Consider the usual difference quotient at t ,
Xt+h − Xt
(18.1.11)
h
By the stationary increments property, if h > 0 , the numerator has the same distribution as X , while if h < 0 , the numerator has
h
the same distribution as −X , which in turn has the same distribution as X . So, in both cases, the difference quotient has the
−h −h
same distribution as X /h, and this variable has the normal distribution with mean 0 and variance |h| /h = 1/ |h| . So the
|h|
2
variance of the difference quotient diverges to ∞ as h → 0 , and hence the difference quotient does not even converge in
distribution, the weakest form of convergence.
The temporal-spatial transformation above also suggests that Brownian motion cannot be differentiable. The intuitive meaning of
differentiable at t is that the function is locally linear at t —as we zoon in, the graph near t begins to look like a line (whose slope,
of course, is the derivative). But as we zoon in on Brownian motion, (in the sense of the transformation), it always looks the same,
and in particular, just as jagged. More formally, if X is differentiable at t , then so is the transformed process Y , and the chain rule
gives Y (t) = aX (a t) . But Y is also a standard Brownian motion for every a > 0 , so something is clearly wrong. While not
′ ′ 2
rigorous, these examples are motivation for the following theorem:
With probability 1, X is nowhere differentiable on [0, ∞).
Run the simulation of the standard Brownian motion process. Note the continuity but very jagged quality of the sample paths.
Of course, the simulation cannot really capture Brownian motion with complete fidelity.
The following theorems gives a more precise measure of the irregularity of standard Brownian motion.
Standard Brownian motion X has Hölder exponent . That is, X is Hölder continuous with exponent α for every α <
1
2
1
2
, but
is not Hölder continuous with exponent α for any α > . 1
In particular, X is not Lipschitz continuous, and this shows again that it is not differentiable. The following result states that in
terms of Hausdorff dimension, the graph of standard Brownian motion lies midway between a simple curve (dimension 1) and the
plane (dimension 2).
The graph of standard Brownian motion has Hausdorff dimension 3
2
.
Yet another indication of the irregularity of Brownian motion is that it has infinite total variation on any interval of positive length.
Suppose that a, b ∈ R with a < b . Then the total variation of X on [a, b] is ∞.
The Markov Property and Stopping Times

As usual, we start with a standard Brownian motion X = {X : t ∈ [0, ∞)} . Recall that a Markov process has the property that
t
the future is independent of the past, given the present state. Because of the stationary, independent increments property, Brownian
motion has the property. As a minor note, to view X as a Markov process, we sometimes need to relax Assumption 1 and let X 0
have an arbitrary value in R. Let F = σ{X : 0 ≤ s ≤ t} , the sigma-algebra generated by the process up to time t ∈ [0, ∞). The
t s
family of σ-algebras F = {F : t ∈ [0, ∞)} is known as a filtration.

t
Standard Brownian motion is a time-homogeneous Markov process with transition probability density p given by
2
1 (y − x)
pt (x, y) = ft (y − x) = exp[− ], t ∈ (0, ∞); x, y ∈ R (18.1.12)
−−
−
√2πt 2t
Proof
The transtion density p satisfies the following diffusion equations. The first is known as the forward equation and the second as
the backward equation.
2
∂ 1 ∂
pt (x, y) = pt (x, y) (18.1.13)
∂t 2 ∂y 2
2
∂ 1 ∂
pt (x, y) = pt (x, y) (18.1.14)
2
∂t 2 ∂x
Proof
The diffusion equations are so named, because the spatial derivative in the first equation is with respect to y , the state forward at
time t , while the spatial derivative in the second equation is with respect to x, the state backward at time 0.
Recall that a random time τ taking values in [0, ∞] is a stopping time with respect to the process X = {X : t ∈ [0, ∞)} if t
{τ ≤ t} ∈ F for every t ∈ [0, ∞). Informally, we can determine whether or not τ ≤ t by observing the process up to time t . An
t
important special case is the first time that our Brownian motion hits a specified state. Thus, for x ∈ R let
τ = inf{t ∈ [0, ∞) : X = x} . The random time τ
x t is a stopping time.
x
For a stopping time τ , we need the σ-algebra of events that can be defined in terms of the process up to the random time τ ,
analogous to F , the σ-algebra of events that can be defined in terms of the process up to a fixed time t . The appropriate definition
t
is
Fτ = {B ∈ F : B ∩ {τ ≤ t} ∈ Ft for all t ≥ 0} (18.1.15)
See the section on Filtrations and Stopping Times for more information on filtrations, stopping times, and the σ-algebra associated
with a stopping time.
The strong Markov property is the Markov property generalized to stopping times. Standard Brownian motion X is also a strong
Markov process. The best way to say this is by a generalization of the temporal and spatial homogeneity result above.
Suppose that τ is a stopping time and define Yt = Xτ+t − Xτ for t ∈ [0, ∞) . Then Y = { Yt : t ∈ [0, ∞)} is a standard
Brownian motion and is independent of F . τ
The Reflection Principle

Many interesting properties of Brownian motion can be obtained from a clever idea known as the reflection principle. As usual, we
start with a standard Brownian motion X = {X : t ∈ [0, ∞)} . Let τ be a stopping time for X. Define
t
Xt , 0 ≤t <τ
Wt = { (18.1.16)
2 Xτ − Xt , τ ≤t <∞
Thus, the graph of W = {W : t ∈ [0, ∞)} can be obtained from the graph of X by reflecting in the line x = X after time τ . In
t τ
particular, if the stopping time τ is τ , the first time that the process hits a specified state a > 0 , then the graph of W is obtained
a
from the graph of X by reflecting in the line x = a after time τ . a
Open the simulation of reflecting Brownian motion. This app shows the process W corresponding to the the stopping time τ , a
the time of first visit to a positive state a . Run the simulation in single step mode until you see the reflected process several
times. Make sure that you understand how the process W works.
The reflected process W = { Wt : t ∈ [0, ∞)} is also a standard Brownian motion.
Run the simulation of the reflected Brownian motion process 1000 times. Compaure the empirical density function and
moments of W to the true probability density function and moments.
t
Martingales
As usual, let X = { Xt : t ∈ [0, ∞)} be a standard Brownian motion, and let F = σ{X : 0 ≤ s ≤ t} for t ∈ [0, ∞), so that
t s
F = { Ft : t ∈ [0, ∞)} is the natural filtration for X. There are several important martingales associated with X. We will study a
couple of them in this section, and others in subsequent sections. Our first result is that X itself is a martingale, simply by virtue of
having stationary, independent increments and 0 mean.
X is a martingale with respect to F.

Proof
The next martingale is a little more interesting.
Let Y t =X
t
2
−t for t ∈ [0, ∞). Then Y = { Yt : t ∈ [0, ∞)} is a martingale with respect to F.
Proof
Maximums and Hitting Times
As usual, we start
with a standard Brownian motion X = {X : t ∈ [0, ∞)} . For y ∈ [0, ∞) recall that t
τy = min{t ≥ 0 : Xt = y} is the first time that the process hits state y . Of course, τ = 0 . For t ∈ [0, ∞), let 0
Y = max{ X : 0 ≤ s ≤ t} , the maximum value of X on the interval [0, t]. Note that Y is well defined by the continuity of X,
t s t
and of course Y = 0 . Thus we have two new stochastic processes: {τ : y ∈ [0, ∞)} and {Y : t ∈ [0, ∞)} . Both have index set
0 y t
[0, ∞) and (as we will see) state space [0, ∞). Moreover, the processes are inverses of each other in a sense:
For t, y ∈ (0, ∞) τy ≤ t , if and only if Y t ≥y .

Proof
Thus, if we can compute the distribution of Y for each t ∈ (0, ∞) then we can compute the distribution of τ for each y ∈ (0, ∞),
t y
and conversely.
For y > 0 , τ has the same distribution as y

y
2
/Z
2
, where Z is a standard normal variable. The probability density function g y
is given by
2
y y
gy (t) = −− − exp(−
− ), t ∈ (0, ∞) (18.1.20)
√2πt3 2t
Proof
The distribution of τ is the Lévy distribution with scale parameter y , and is named for the French mathematician Paul Lévy. The
y
2
Lévy distribution is studied in more detail in the chapter on special distributions.
Open the hitting time experiment. Vary y and note the shape and location of the probability density function of τ . For selected y
values of the parameter, run the simulation in single step mode a few times. Then run the experiment 1000 times and compare
Open the special distribution simulator and select the Lévy distribution. Vary the parameters and note the shape and location of
the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the
Standard Brownian motion is recurrent. That is, P(τ y < ∞) = 1 for every y ∈ R .
Proof
Thus, for each y ∈ R , X eventually hits y with probability 1. Actually we can say more:
With probability 1, X visits every point in R.

Proof
On the other hand,
Standard Brownian motion is null recurrent. That is, E(τ y) =∞ for every y ∈ R ∖ {0} .
Proof
The process {τ x : x ∈ [0, ∞)} has stationary, independent increments.

Proof
The family of probability density functions { gx : x ∈ (0, ∞)} is closed under convolution. That is, gx ∗ gy = gx+y for
x, y ∈ (0, ∞).
Proof
Now we turn our attention to the maximum process {Y t : t ∈ [0, ∞)} , the “inverse” of the hitting process {τ
y : y ∈ [0, ∞)} .
For t > 0 , Y has the same distribution as |X |, known as the half-normal distribution with scale parameter t . The probability
t t
density function is
−
−− 2
2 y
ht (y) = √ exp(− ), y ∈ [0, ∞) (18.1.27)
πt 2t
Proof
The half-normal distribution is a special case of the folded normal distribution, which is studied in more detail in the chapter on
special distributions.
For t ≥ 0 , the mean and variance of Y are t
−
−
1. E(Y t) =√
2t
2. var(Y t) = t (1 −
2
π
)
Proof
In the standard Brownian motion simulation, select the maximum value. Vary the parameter t and note the shape of the
probability density function and the location and size of the mean-standard deviation bar. Run the simulation 1000 times and
compare the empirical density and moments to the true probability density function and moments.
Open the special distribution simulator and select the folded-normal distribution. Vary the parameters and note the shape and
location of the probability density function and the size and location of the mean-standard deviation bar. For selected values of
the parameters, run the simulation 1000 times and compare the empirical density function and moments to the true density
function and moments.
Zeros and Arcsine Laws

As usual, we start with a standard Brownian motion X = {X : t ∈ [0, ∞)} . Study of the zeros of X lead to a number of
t
probability laws referred to as arcsine laws, because as we might guess, the probabilities and distributions involve the arcsine
function.
For s, t ∈ [0, ∞) with s <t , let E(s, t) be the event that X has a zero in the time interval (s, t) . That is,
E(s, t) = { Xu = 0 for some u ∈ (s, t)} . Then
−
−
2 s
P [E(s, t)] = 1 − arcsin( √ ) (18.1.29)
π t
Proof
In paricular, P [E(0, t)] = 1 for every t > 0 , so with probability 1, X has a zero in (0, t). Actually, we can say a bit more:
For t > 0 , X has infinitely many zeros in (0, t) with probability 1.

Proof
The last result is further evidence of the very strange and irregular behavior of Brownian motion. Note also that P [E(s, t)]
depends only on the ratio s/t . Thus, P [E(s, t)] = P [E(1/t, 1/s)] and P [E(s, t)] = P [E(cs, ct)] for every c > 0 . So, for
example the probability of at least one zero in the interval (2, 5) is the same as the probability of at least one zero in (1/5, 1/2), the
same as the probability of at least one zero in (6, 15), and the same as the probability of at least one zero in (200, 500).
For t > 0 , let Z denote the time of the last zero of X before time t . That is, Z = max {s ∈ [0, t] : X = 0} . Then Z has
t t s t
the arcsine distribution with parameter t . The distribution function H and the probability density function h are given by
t t
−
−
2 s
Ht (s) = arcsin( √ ), 0 ≤s ≤t (18.1.34)
π t
1
ht (s) = − −−−− −, 0 <s <t (18.1.35)
π √ s(t − s)
Proof
The density function of Z is u-shaped and symmetric about the midpoint t/2, so the points with the largest density are those near
t
the endpoints 0 and t , a surprising result at first. The arcsine distribution is studied in more detail in the chapter on special
distributions.
The mean and variance of Z are t
1. E(Z t) = t/2
2. E(Z 2
t ) = t /8
Proof
In the simulation of standard Brownian motion, select the last zero variable. Vary the parameter t and note the shape of the
probability density function and the size and location of the mean-standard deviation bar. For selected values of t run the
simulation is single step mode a few times and note the position of the last zero. Finally, run the simulation 1000 times and
compare the empirical density function and moments to the true probability density function and moments.
Open the special distribution simulator and select the arcsine distribution. Vary the parameters and note the shape and location
of the probability density function and the size and location of the mean-standard deviation bar. For selected values of the
parameters, run the simulation 1000 times and compare the empirical density function and moments to the true density
function and moments.
Now let Z = {t ∈ [0, ∞) : X = 0} denote the set of zeros of X, so that Z is a random subset of [0, ∞). The theorem below
t
gives some of the strange properties of the random set Z , but to understand these, we need to review some definitions. A nowhere
dense set is a set whose closure has empty interior. A perfect set is a set with no isolated points. As usual, we let λ denote Lebesgue
measure on R.
With probability 1,
1. Z is closed.
2. λ(Z) = 0
3. Z is nowhere dense.
4. Z is perfect.
Proof
The following theorem gives a deeper property of Z . The Hausdorff dimension of Z is midway between that of a point (dimension
0) and a line (dimension 1).
Z has Hausdorff dimension 1
2
.
The Law of the Iterated Logarithm

As usual, let X = {X : t ∈ [0, ∞)} be standard Brownian motion. By definition, we know that X has the normal distribution
t t
with mean 0 and standard deviation √t, so the function x = √t gives some idea of how the process grows in time. The precise
growth rate is given by the famous law of the iterated logarithm
With probability 1,
Xt
lim sup =1 (18.1.37)
−−−−− −
t→∞ √ 2t ln ln t
In the following exercises, X = {X t : t ∈ [0, ∞)} is a standard Brownian motion process.
Explicitly find the probability density function, covariance matrix, and correlation matrix of (X 0.5 , .
X1 , X2.3 )
This page titled 18.1: Standard Brownian Motion is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Definition
We start with the assumptions that govern standard Brownian motion, except that we relax the restrictions on the parameters of the
Suppose that μ ∈ R and σ ∈ (0, ∞). Brownian motion with drift parameter μ and scale parameter σ is a random process
X = { Xt : t ∈ [0, ∞)} with state space R that satisfies the following properties:
1. X = 0 (with probability 1).
0
2. X has stationary increments. That is, for s, t ∈ [0, ∞) with s < t , the distribution of X − X is the same as the t s
distribution of X . t−s
3. X has independent increments. That is, for t , t , … , t ∈ [0, ∞) with t < t < ⋯ < t , the random variables
1 2 n 1 2 n
X ,Xt1 −X , … , X
t2 t1 −X are independent.
tn tn−1
4. X has the normal distribution with mean μt and variance σ t for t ∈ [0, ∞).
t
2
5. With probability 1, t ↦ X is continuous on [0, ∞).

t
Note that we cannot assign the parameters of the normal distribution of X arbitrarily. We know that since t X has stationary,
independent increments, E(X ) and var(X ) must be linear functions of t ∈ [0, ∞).
t t
Open the simulation of Brownian motion with drift and scaling. Run the simulation in single step mode several times for
various values of the parameters. Note the behavior of the sample paths. For selected values of the parameters, run the
simulation 1000 times and compare the empirical density function and moments to the true density function and moments.
It's easy to construct Brownian motion with drift and scaling from a standard Brownian motion, so we don't have to worry about the
existence question.
Relation to standard Brownian motion.

1. Suppose that Z = {Z : t ∈ [0, ∞)} is a standard Brownian motion, and that μ ∈ R and σ ∈ (0, ∞). Let X = μt + σ Z
t t t
for t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a Brownian motion with drift parameter μ and scale parameter σ.
t
2. Conversely, suppose that X = {X : t ∈ [0, ∞)} is a Brownian motion with drift parameter μ ∈ R and scale parameter
t
σ ∈ (0, ∞) . Let Z = (X − μt)/σ for t ∈ [0, ∞). Then Z = { Z : t ∈ [0, ∞)} is a standard Brownian motion.
t t t
Proof
In differential form, part (a) can be written as

dXt = μ dt + σ dZt , X0 = 0 (18.2.1)
Finite Dimensional Distributions

Suppose that X = {X : t ∈ [0, ∞)} is Brownian motion with drift parameter μ ∈ R and scale parameter
t σ ∈ (0, ∞) . It follows
from part (d) of the definition that X has probability density function f given by
t t
1 1 2
ft (x) = − exp[−
−− (x − μt) ], x ∈ R (18.2.2)
2
σ √2πt 2σ t
This family of density functions determines the finite dimensional distributions of X.
If t
1, t2 , … , tn ∈ (0, ∞) with 0 < t 1 < t2 ⋯ < tn then (X t1 , Xt2 , … , Xtn ) has probability density function f t1 , t2 ,…, tn given
by
n
ft , t2 ,…, tn (x1 , x2 , … , xn ) = ft (x1 )ft −t1 (x2 − x1 ) ⋯ ft −tn−1 (xn − xn−1 ), (x1 , x2 , … , xn ) ∈ R (18.2.3)
1 1 2 n
Proof
X is a Gaussian process with mean function mean function m and covariance function c given by
1. m(t) = μt for t ∈ [0, ∞)
2. c(s, t) = σ min{s, t} for s,
2
t ∈ [0, ∞) .
Proof
The correlation function is independent of the parameters, and thus is the same as for standard Brownian motion. This is hardly
surprising since correlation is a standardized measure of association.
−−−−−−−−
2
σ min{s, t} min{s, t} min{s, t}
2
cor(Xs , Xt ) = =√ , (s, t) ∈ [0, ∞) (18.2.4)
σsσt st max{s, t}
Transformations
There are a couple simple transformations that preserve Brownian motion, but perhaps change the drift and scale parameters. Our
starting place is a Brownian motion X = {X : t ∈ [0, ∞)} with drift parameter μ ∈ R and scale parameter σ ∈ (0, ∞). Our first
t
result involves scaling X is time and space (and possible reflecting in the spatial origin).
Let a ∈ R ∖ {0} and b ∈ (0, ∞). Define Y = aX t bt for t ≥0 . Then Y = { Yt : t ≥ 0} is also a Brownian motion with drift
parameter abμ and scale parameter |a| √bσ.
Proof
Suppose that a > 0 in the previous theorem, so that we are scaling temporally and spatially. In order to preserve the original drift
parameter μ we must have ab = 1 (if μ ≠ 0 ). In order to preserve the original scale parameter σ, we must have a√b = 1 . We can't
have both unless μ = 0 , which leads to a slight generalization of one of our results for standard Brownian motion:
Suppose that X is a Brownian motion with drift parameter μ = 0 and scale parameter σ > 0 . Suppose also that c > 0 and let
Y =
t X
1
c
for t ≥ 0 . Then Y = {Y : t ∈ [0, ∞)} is also a Brownian motion with drift parameter 0 and scale parameter σ.
c2 t t
Our next result is related to the Markov property, which we explore in more detail below. We return to the general case where
X = { X : t ∈ [0, ∞)} is a Brownian motion with drift parameter μ ∈ R and scale parameter σ ∈ (0, ∞) . If we “restart”
t
Brownian motion at a fixed time s , and shift the origin to X , then we have another Brownian motion with the same parameters.
s
Fix s ∈ [0, ∞) and define Y t = Xs+t − Xs for t ≥ 0 . Then Y = { Yt : t ∈ [0, ∞)} is also a Brownian motion with the same
drift and scale parameters.
Proof
The Markov Property and Stopping Times

As usual, we start with a Brownian motion X = {X : t ∈ [0, ∞)} with drift parameter μ and scale parameter σ. Recall again that
t
a Markov process has the property that the future is independent of the past, given the present state. Because of the stationary,
independent increments property, Brownian motion has the property. As a minor note, to view X as a Markov process, we
sometimes need to relax Assumption 1 and let X have an arbitrary value in R. Let F = σ{X : 0 ≤ s ≤ t} , the sigma-algebra
0 t s
generated by the process up to time t ∈ [0, ∞). The family of σ-algebras F = {F : t ∈ [0, ∞)} is known as a filtration.
t
Brownian motion is a time-homogeneous Markov process with transition probability density p given by
1 1
2
pt (x, y) = ft (y − x) = exp[− (y − x − μt) ], t ∈ (0, ∞); x, y ∈ R (18.2.8)
−−
− 2
σ √2πt 2σ t
Proof
The transtion density p satisfies the following diffusion equations. The first is known as the forward equation and the second as
the backward equation.
2
∂ ∂ 1 2
∂
pt (x, y) = −μ pt (x, y) + σ pt (x, y) (18.2.9)
2
∂t ∂y 2 ∂y
2
∂ ∂ 1 ∂
2
pt (x, y) = μ pt (x, y) + σ pt (x, y) (18.2.10)
2
∂t ∂x 2 ∂x
Proof
The diffusion equations are so named, because the spatial derivative in the first equation is with respect to y , the state forward at
time t , while the spatial derivative in the second equation is with respect to x, the state backward at time 0.
Recall again that a random time τ taking values in [0, ∞] is a stopping time with respect to the process X if {τ ≤ t} ∈ Ft for
every t ∈ [0, ∞). The σ-algebra associated with τ is
Fτ = {B ∈ F : B ∩ {τ ≤ t} ∈ Ft for all t ≥ 0} (18.2.11)
See the section on Filtrations and Stopping Times for more information on filtrations, stopping times, and the σ-algebra associated
with a stopping time. Brownian motion X is also a strong Markov process.
Suppose that τ is a stopping time and define Y = X − X for t ∈ [0, ∞). Then
t τ+t τ Y = { Yt : t ∈ [0, ∞)} is a Brownian
motion with the same drift and scale parameters, and is independent of F . τ
This page titled 18.2: Brownian Motion with Drift and Scaling is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Basic Theory
Definition and Constructions
In the most common formulation, the Brownian bridge process is obtained by taking a standard Brownian motion process X,
restricted to the interval [0, 1], and conditioning on the event that X = 0 . Since X = 0 also, the process is “tied down” at both
1 0
ends, and so the process in between forms a “bridge” (albeit a very jagged one). The Brownian bridge turns out to be an interesting
stochastic process with surprising applications, including a very important application to statistics. In terms of a definition,
however, we will give a list of characterizing properties as we did for standard Brownian motion and for Brownian motion with
drift and scaling.
A Brownian bridge is a stochastic process X = {X t : t ∈ [0, 1]} with state space R that satisfies the following properties:
1. X = 0 and X = 0 (each with probability 1).
0 1
2. X is a Gaussian process.
3. E(X ) = 0 for t ∈ [0, 1].
t
4. cov(X , X ) = min{s, t} − st for s, t ∈ [0, 1].

s t
5. With probability 1, t ↦ X is continuous on [0, 1].

t
So, in short, a Brownian bridge X is a continuous Gaussian process with X = X = 0 , and with mean and covariance functions
0 1
given in (c) and (d), respectively. Naturally, the first question is whether there exists such a process. The answer is yes, of course,
otherwise why would we be here? But in fact, we will see several ways of constructing a Brownian bridge from a standard
Brownian motion. To help with the proofs, recall that a standard Brownian motion process Z = {Z : t ∈ [0, ∞)} is a continuous
t
Gaussian process with Z = 0 , E(Z ) = 0 for t ∈ [0, ∞) and cov(Z , Z ) = min{s, t} for s, t ∈ [0, ∞) . Here is our first
0 t s t
construction:
Suppose that Z = { Zt : t ∈ [0, ∞)} is a standard Brownian motion, and let Xt = Zt − tZ1 for .
t ∈ [0, 1] Then
X = { Xt : t ∈ [0, 1]} is a Brownian bridge.
Proof
Let's see the Brownian bridge in action.
Run the simulation of the Brownian bridge process in single step mode a few times.
For the Brownian bridge X, note in particular that X is normally distributed with mean 0 and variance t(1 − t) for t ∈ [0, 1].
t
Thus, the variance increases and then decreases on [0, 1] reaching a maximum of 1/4 at t = 1/2 . Of course, the variance is 0 at
t = 0 and t = 1 , since X = X = 0 deterministically.
0 1
Open the simulation of the Brownian bridge process. Vary t and note the change in the probability density function and
moments. For various values of t , run the simulation 1000 times and compare the empirical density function and moments to
the true density function and moments.
Conversely to the construction above, we can build a standard Brownian motion on the time interval [0, 1] from a Brownian bridge.
Suppose that X = {X : t ∈ [0, 1]} is a Brownian bridge, and suppose that Z is a random variable with a standard normal
t
distribution, independent of X. Let Z = X + tZ for t ∈ [0, 1]. Then Z = {Z : t ∈ [0, 1]} is a standard Brownian motion
t t t
on [0, 1].
Proof
Here's another way to construct a Brownian bridge from a standard Brownian motion.
Suppose that Z = {Z t : t ∈ [0, ∞)} is a standard Brownian motion. Define X 1 =0 and
t
Xt = (1 − t)Z ( ), t ∈ [0, 1) (18.3.1)
1 −t
Then X = {X t : t ∈ [0, 1]} is a Brownian bridge.

Proof
Conversely, we can construct a standard Brownian motion from a Brownian bridge.
Suppose that X = {X t : t ∈ [0, 1]} is a Brownian bridge. Define

t
Zt = (1 + t)X ( ), t ∈ [0, ∞) (18.3.4)
1 +t
Then Z = {Z t : t ∈ [0, ∞)} is a standard Brownian motion process.

Proof
We return to the comments at the beginning of this section, on conditioning a standard Brownian motion to be 0 at time 1. Unlike
the previous two constructions, note that we are not transforming the random variables, rather we are changing the underlying
Suppose that X = { Xt : t ∈ [0, ∞)} is a standard Brownian motion. Then conditioned on X1 = 0 , the process
{ Xt : t ∈ [0, 1]} is a Brownian bridge process.
Proof
Finally, the Brownian bridge can be defined in terms a stochastic integral
Suppose that Z = {Z t : t ∈ [0, ∞)} is standard Brownian motions. Define X 1 =1 and

t
1
Xt = (1 − t) ∫ dZs , t ∈ [0, 1) (18.3.7)
0
1 −s
Then X = {X t : t ∈ [0, 1]} is a Brownian bridge process.

Proof
In differential form, the process above can be written as

Xt
dXt = dt + dZt , X0 = 0 (18.3.11)
1 −t
The General Brownian Bridge

The processes constructed above (in several ways!) is the standard Brownian bridge. it's a simple matter to generalize the process
so that it starts at a and ends at b , for arbitrary a, b ∈ R .
Suppose that Z = {Z : t ∈ [0, 1]} is a standard Brownian bridge process. Let a, b ∈ R and define X
t t = (1 − t)a + tb + Zt
for t ∈ [0, 1]. Then X = {X : t ∈ [0, 1]} is a Brownian bridge process from a to b .
t
Of course, any of the constructions above for standard Brownian bridge can be modified to produce a general Brownian bridge.
Here are the characterizing properties.
The Brownian bridge process X = {X t : t ∈ [0, 1]} from a to b is characterized by the following properties:
1. X = a and X = b (each with probability 1).
0 1
2. X is a Gaussian process.
3. E(X ) = (1 − t)a + tb for t ∈ [0, 1].
t
4. cov(X , X ) = min{s, t} − st for s, t ∈ [0, 1].

s t
5. With probability 1, t ↦ X is continuous on [0, 1].

t
Applications
We start with a problem that is one of the most basic in statistics. Suppose that T is a real-valued random variable with an unknown
distribution. Let F denote the distribution function of T , so that F (t) = P(T ≤ t) for t ∈ R . Our goal is to construct an estimator
of F , so naturally our first step is to sample from the distribution of T . This generates a sequence T = (T , T , …) of independent
1 2
variables, each with the distribution of T (and so with distribution function F ). Think of T as a sequence of independent copies of
T . For n ∈ N and t ∈ R , the natural estimator of F (t) based on the first n sample values is
+
n
1
Fn (t) = ∑ 1(Ti ≤ t) (18.3.12)
n
i=1
which is simply the proportion of the first n sample values that fall in the interval (−∞, t] . Appropriately enough, F is known as
n
the empirical distribution function corresponding to the sample of size n . Note that (1(T ≤ t), 1(T ≤ t), …) is a sequence of
1 2
independent, identically distributed indicator variables (and hence is a sequence of Bernoulli trials), and corresponds to sampling
from the distribution of 1(T ≤ t) . The estimator F (t) is simply the sample mean of the first n of these variables. The numerator,
n
the number of the original sample variables with values in (−∞, t] , has the binomial distribution with parameters n and F (t). Like
all sample means from independent, identically distributed samples, F (t) satisfies some basic and important properties. A
n
summary is given below, but to make sense of some of these facts, you need to recall the mean and variance of the indicator
variable that we are sampling from: E [1(T ≤ t)] = F (t) , var [1(T ≤ t)] = F (t) [1 − F (t)]
For fixed t ∈ R ,
1. E [F (t)] = F (t) so F (t) is an unbiased estimator of F (t)
n n
2. var [F (t)] = F (t) [1 − F (t)] /n so F (t) is a consistent estimator of F (t)

n n
3. F (t) → F (t) as n → ∞ with probability 1, the strong law of large numbers.

n
4. √−n [ F (t) − F (t)] has mean 0 and variance F (t) [1 − F (t)] and converges to the normal distribution with these
n
parameters as n → ∞ , the central limit theorem.
The theorem above gives us a great deal of information about F (t) for fixed t , but now we want to let t vary and consider the
n
expression in (d), namely t ↦ √−

n [ F (t) − F (t)] , as a random process for each n ∈ N
n . The key is to consider a very special
+
distribution first.
Suppose that T has the standard uniform distribution, that is, the continuous uniform distribution on the interval [0, 1]. In this case
the distribution function is simply F (t) = t for t ∈ [0, 1], so we have the sequence of stochastic processes
X = { X (t) : t ∈ [0, 1]} for n ∈ N
n n , where
+
−
Xn (t) = √n [ Fn (t) − t] (18.3.13)
Of course, the previous results apply, so the process X has mean function 0, variance function t ↦ t(1 − t) , and for fixed
n
t ∈ [0, 1], the distribution X (t) converges to the corresponding normal distribution as n → ∞ . Here is the new bit of
n
information, the covariance function of X is the same as that of the Brownian bridge!
n
cov [ Xn (s), Xn (t)] = min{s, t} − st for s, t ∈ [0, 1] .

Proof
This page titled 18.3: The Brownian Bridge is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
upon request.
Basic Theory
Geometric Brownian motion, and other stochastic processes constructed from it, are often used to model population growth,
financial processes (such as the price of a stock over time), subject to random “noise”.
Definition
Suppose that Z = {Z t : t ∈ [0, ∞)} is standard Brownian motion and that μ ∈ R and σ ∈ (0, ∞). Let
2
σ
Xt = exp[(μ − ) t + σ Zt ], t ∈ [0, ∞) (18.4.1)
2
The stochastic process X = {X t : t ∈ [0, ∞)} is geometric Brownian motion with drift parameter μ and volatility parameter
σ.
Note that the stochastic process

2
σ
{(μ − ) t + σ Zt : t ∈ [0, ∞)} (18.4.2)
2
is Brownian motion with drift parameter μ − σ /2 and scale parameter σ, so geometric Brownian motion is simply the exponential
2
of this process. In particular, the process is always positive, one of the reasons that geometric Brownian motion is used to model
financial and other processes that cannot be negative. Note also that X = 1 , so the process starts at 1, but we can easily change
0
this. For x ∈ (0, ∞) , the process {x X : t ∈ [0, ∞)} is geometric Brownian motion starting at x . You may well wonder about
0 0 t 0
the particular combination of parameters μ − σ /2 in the definition. The short answer to the question is given in the following
2
theorem:
Geometric Brownian motion X = {X t : t ∈ [0, ∞)} satisfies the stochastic differential equation
dXt = μXt dt + σ Xt dZt (18.4.3)
Note that the deterministic part of this equation is the standard differential equation for exponential growth or decay, with rate
parameter μ .
Run the simulation of geometric Brownian motion several times in single step mode for various values of the parameters. Note
the behavior of the process.
Distributions
2
For t ∈ (0, ∞) , X has the lognormal distribution with parameters

t (μ −
σ
2
)t and σ √t. The probability density function ft
is given by
2
2
1 [ln(x) − (μ − σ /2) t]
ft (x) = −−
− exp(− ), x ∈ (0, ∞) (18.4.4)
√2πtσx 2σ 2 t
1. f increases and then decreases with mode at x = exp[(μ − 3
2
2
σ ) t]
−−−−−− −
2. f is concave upward, then downward, then upward again with inflection points at x = exp[(μ − σ 2
)t ±
1
2
2 2
σ √σ t + 4t ]
Proof
In particular, geometric Brownian motion is not a Gaussian process.
Open the simulation of geometric Brownian motion. Vary the parameters and note the shape of the probability density function
of X . For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the
t
true probability density function.
For t ∈ (0, ∞) , the distribution function F of X is given by t t
2
ln(x) − (μ − σ /2)t
Ft (x) = Φ [ ], x ∈ (0, ∞) (18.4.5)
σ √t
where Φ is the standard normal distribution function.

Proof
For t ∈ (0, ∞) , the quantile function F −1

t
of X is given by
t
−1 2 −1
F (p) = exp[(μ − σ /2)t + σ √tΦ (p)], p ∈ (0, 1) (18.4.6)
t
where Φ −1
is the standard normal quantile function.
Proof
Moments
For n ∈ N and t ∈ [0, ∞),
2
σ
n 2
E (X ) = exp{[nμ + (n − n)] t} (18.4.7)
t
2
Proof
In terms of the order of the moment n , the dominant term inside the exponential is σ n /2. If n > 1 − 2μ/σ 2 2 2
then
2
(n − n) > 0 so E(X ) → ∞ as t → ∞ . The mean and variance follow easily from the general moment result.
σ 2 n
nμ + t
2
For t ∈ [0, ∞),

1. E(X t) =e
μt
2. var(X
2
2μt σ t
t) =e (e − 1)
In particular, note that the mean function m(t) = E(X ) = e for t ∈ [0, ∞) satisfies the deterministic part of the stochastic
t
μt
differential equation above. If μ > 0 then m(t) → ∞ as t → ∞ . If μ = 0 then m(t) = 1 for all t ∈ [0, ∞). If μ < 0 then
m(t) → 0 as t → ∞ .
Open the simulation of geometric Brownian motion. The graph of the mean function m is shown as a blue curve in the main
graph box. For various values of the parameters, run the simulation 1000 times and note the behavior of the random process in
relation to the mean function.
Open the simulation of geometric Brownian motion. Vary the parameters and note the size and location of the mean±standard
deviation bar for X . For various values of the parameter, run the simulation 1000 times and compare the empirical mean and
t
standard deviation to the true mean and standard deviation.
Properties
The parameter μ − σ 2
/2 determines the asymptotic behavior of geometric Brownian motion.
Asymptotic behavior:
1. If μ > σ 2
/2 then X → ∞ as t → ∞ with probability 1.
t
2. If μ < σ 2
/2 then X → 0 as t → ∞ with probability 1.
t
3. If μ = σ 2
/2 then X has no limit as t → ∞ with probability 1.
t
Proof
It's interesting to compare this result with the asymptotic behavior of the mean function, given above, which depends only on the
parameter μ . When the drift parameter is 0, geometric Brownian motion is a martingale.
If μ = 0 , geometric Brownian motion X is a martingale with respect to the underlying Brownian motion Z .
Proof from stochastic integrals
Direct proof
This page titled 18.4: Geometric Brownian Motion is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Index
A F moments
arcsine distribution Feller Markov processes 7.2: The Method of Moments
5.19: The Arcsine Distribution 16.2: Potentials and Generators for General Markov Monty Hall problem
Processes 13.6: The Monty Hall Problem
B multivariate normal distribution
Bayesian estimation G 5.7: The Multivariate Normal Distribution
7.4: Bayesian Estimation gamma distribution

Benford's law 5.8: The Gamma Distribution N
5.39: Benford's Law
14.3: The Gamma Distribution normal distribution
Bernoulli Trials gamma function 5.6: The Normal Distribution
5.8: The Gamma Distribution normal model
Bertrand's paradox geometric distribution 9.2: Tests in the Normal Model
Bertrand's problem O
H operators
beta distribution Hasse graph 4.13: Kernels and Operators
5.17: The Beta Distribution 1.4: Partial Orders
beta function P
5.17: The Beta Distribution I Pólya's urn process
beta prime distribution incomplete beta function 12.8: Pólya's Urn Process
5.18: The Beta Prime Distribution 5.17: The Beta Distribution 17.1: Introduction to Martingalges
binomial distribution independence Pareto distribution
2.5: Independence 2.5: Independence 3.2: Continuous Distributions
infinitely divisible distributions 5.36: The Pareto Distribution
bold play
13.10: Bold Play 5.4: Infinitely Divisible Distributions partial orders
isomorphism 1.4: Partial Orders
bonus number
13.7: Lotteries 1.4: Partial Orders poker dice
Brownian motion
18: Brownian Motion J power series distributions
18.1: Standard Brownian Motion 5.5: Power Series Distributions
joint distributions
C Q
Chebyshev's inequality K queuing theory
4.3: Variance 16.22: Continuous-Time Queuing Chains
keno
chi distribution 13.7: Lotteries
5.9: Chi-Square and Related Distribution kernals R
conditional expected value 4.13: Kernels and Operators random experiment
4.7: Conditional Expected Value Kolmogorov axioms 2.1: Random Experiments
conditional probability 2.3: Probability Measures Rayleigh distribution
2.4: Conditional Probability kurtosis 5.14: The Rayleigh Distribution
continuous distributions 4.4: Skewness and Kurtosis Red and Black game
3.2: Continuous Distributions 13.8: The Red and Black Game
Correlation L roulette
2.4: Conditional Probability 13.5: Roulette
Lévy distribution
craps 5.16: The Lévy Distribution
13.4: Craps
law of the iterated logarithm S
18.1: Standard Brownian Motion second central moment
D likelihood ratio 4.3: Variance
disjointness 9.5: Likelihood Ratio Tests second moment
2.5: Independence lottery 4.3: Variance
distribution function 13.7: Lotteries skewness
3.6: Distribution and Quantile Functions 4.4: Skewness and Kurtosis
M stable distributions
E Martingalges
17: Martingales
standard triangle distribution
Ehrenfest chains matching problem 5.24: The Triangle Distribution
Stirling's approximation
equivalence class metric spaces 5.8: The Gamma Distribution
1.10: Metric Spaces
strong law of large numbers
extremal elements 2.6: Convergence
1.4: Partial Orders
Student t distribution V Weibull distribution
5.10: The Student t Distribution variance 5.38: The Weibull Distribution
4.3: Variance Wiener process
T 18.1: Standard Brownian Motion
The F distribution W
5.11: The F Distribution Wald distribution Z
timid play 5.37: The Wald Distribution zeta distribution
13.9: Timid Play weak law of large numbers 5.40: The Zeta Distribution
2.6: Convergence Zipf distribution
Glossary
Sample Word 1 | Sample Definition 1
Detailed Licensing
Overview
Title: Probability, Mathematical Statistics, and Stochastic Processes (Siegrist)
Webpages: 226
All licenses found:
CC BY 2.0: 98.7% (223 pages)
Undeclared: 1.3% (3 pages)
By Page
Probability, Mathematical Statistics, and Stochastic 3.1: Discrete Distributions - CC BY 2.0
Processes (Siegrist) - CC BY 2.0 3.2: Continuous Distributions - CC BY 2.0
Front Matter - CC BY 2.0 3.3: Mixed Distributions - CC BY 2.0
TitlePage - CC BY 2.0 3.4: Joint Distributions - CC BY 2.0
InfoPage - CC BY 2.0 3.5: Conditional Distributions - CC BY 2.0
Introduction - CC BY 2.0 3.6: Distribution and Quantile Functions - CC BY 2.0
Table of Contents - Undeclared 3.7: Transformations of Random Variables - CC BY
Licensing - Undeclared 2.0
Table of Contents - CC BY 2.0 3.8: Convergence in Distribution - CC BY 2.0
Object Library - CC BY 2.0 3.9: General Distribution Functions - CC BY 2.0
Credits - CC BY 2.0 3.10: The Integral With Respect to a Measure - CC
Sources and Resources - CC BY 2.0 BY 2.0
3.11: Properties of the Integral - CC BY 2.0
1: Foundations - CC BY 2.0
3.12: General Measures - CC BY 2.0
1.1: Sets - CC BY 2.0 3.13: Absolute Continuity and Density Functions -
1.2: Functions - CC BY 2.0 CC BY 2.0
1.3: Relations - CC BY 2.0 3.14: Function Spaces - CC BY 2.0
1.4: Partial Orders - CC BY 2.0
4: Expected Value - CC BY 2.0
1.5: Equivalence Relations - CC BY 2.0
1.6: Cardinality - CC BY 2.0 4.1: Definitions and Basic Properties - CC BY 2.0
1.7: Counting Measure - CC BY 2.0 4.2: Additional Properties - CC BY 2.0
1.8: Combinatorial Structures - CC BY 2.0 4.3: Variance - CC BY 2.0
1.9: Topological Spaces - CC BY 2.0 4.4: Skewness and Kurtosis - CC BY 2.0
1.10: Metric Spaces - CC BY 2.0 4.5: Covariance and Correlation - CC BY 2.0
1.11: Measurable Spaces - CC BY 2.0 4.6: Generating Functions - CC BY 2.0
1.12: Special Set Structures - CC BY 2.0 4.7: Conditional Expected Value - CC BY 2.0
4.8: Expected Value and Covariance Matrices - CC
2: Probability Spaces - CC BY 2.0
BY 2.0
2.1: Random Experiments - CC BY 2.0 4.9: Expected Value as an Integral - CC BY 2.0
2.2: Events and Random Variables - CC BY 2.0 4.10: Conditional Expected Value Revisited - CC BY
2.3: Probability Measures - CC BY 2.0 2.0
2.4: Conditional Probability - CC BY 2.0 4.11: Vector Spaces of Random Variables - CC BY 2.0
2.5: Independence - CC BY 2.0 4.12: Uniformly Integrable Variables - CC BY 2.0
2.6: Convergence - CC BY 2.0 4.13: Kernels and Operators - CC BY 2.0
2.7: Measure Spaces - CC BY 2.0
5: Special Distributions - CC BY 2.0
2.8: Existence and Uniqueness - CC BY 2.0
5.1: Location-Scale Families - CC BY 2.0
2.9: Probability Spaces Revisited - CC BY 2.0
5.2: General Exponential Families - CC BY 2.0
2.10: Stochastic Processes - CC BY 2.0
5.3: Stable Distributions - CC BY 2.0
2.11: Filtrations and Stopping Times - CC BY 2.0
5.4: Infinitely Divisible Distributions - CC BY 2.0
3: Distributions - CC BY 2.0
5.5: Power Series Distributions - CC BY 2.0
5.6: The Normal Distribution - CC BY 2.0 7.2: The Method of Moments - CC BY 2.0
5.7: The Multivariate Normal Distribution - CC BY 7.3: Maximum Likelihood - CC BY 2.0
2.0 7.4: Bayesian Estimation - CC BY 2.0
5.8: The Gamma Distribution - CC BY 2.0 7.5: Best Unbiased Estimators - CC BY 2.0
5.9: Chi-Square and Related Distribution - CC BY 2.0 7.6: Sufficient, Complete and Ancillary Statistics -
5.10: The Student t Distribution - CC BY 2.0 CC BY 2.0
5.11: The F Distribution - CC BY 2.0 8: Set Estimation - CC BY 2.0
5.12: The Lognormal Distribution - CC BY 2.0 8.1: Introduction to Set Estimation - CC BY 2.0
5.13: The Folded Normal Distribution - CC BY 2.0 8.2: Estimation the Normal Model - CC BY 2.0
5.14: The Rayleigh Distribution - CC BY 2.0 8.3: Estimation in the Bernoulli Model - CC BY 2.0
5.15: The Maxwell Distribution - CC BY 2.0 8.4: Estimation in the Two-Sample Normal Model -
5.16: The Lévy Distribution - CC BY 2.0 CC BY 2.0
5.17: The Beta Distribution - CC BY 2.0 8.5: Bayesian Set Estimation - CC BY 2.0
5.18: The Beta Prime Distribution - CC BY 2.0
9: Hypothesis Testing - CC BY 2.0
5.19: The Arcsine Distribution - CC BY 2.0
5.20: General Uniform Distributions - CC BY 2.0 9.1: Introduction to Hypothesis Testing - CC BY 2.0
5.21: The Uniform Distribution on an Interval - CC 9.2: Tests in the Normal Model - CC BY 2.0
BY 2.0 9.3: Tests in the Bernoulli Model - CC BY 2.0
5.22: Discrete Uniform Distributions - CC BY 2.0 9.4: Tests in the Two-Sample Normal Model - CC BY
5.23: The Semicircle Distribution - CC BY 2.0 2.0
5.24: The Triangle Distribution - CC BY 2.0 9.5: Likelihood Ratio Tests - CC BY 2.0
5.25: The Irwin-Hall Distribution - CC BY 2.0 9.6: Chi-Square Tests - CC BY 2.0
5.26: The U-Power Distribution - CC BY 2.0 10: Geometric Models - CC BY 2.0
5.27: The Sine Distribution - CC BY 2.0 10.1: Buffon's Problems - CC BY 2.0
5.28: The Laplace Distribution - CC BY 2.0 10.2: Bertrand's Paradox - CC BY 2.0
5.29: The Logistic Distribution - CC BY 2.0 10.3: Random Triangles - CC BY 2.0
5.30: The Extreme Value Distribution - CC BY 2.0 11: Bernoulli Trials - CC BY 2.0
5.31: The Hyperbolic Secant Distribution - CC BY 2.0
11.1: Introduction to Bernoulli Trials - CC BY 2.0
5.32: The Cauchy Distribution - CC BY 2.0
11.2: The Binomial Distribution - CC BY 2.0
5.33: The Exponential-Logarithmic Distribution - CC
11.3: The Geometric Distribution - CC BY 2.0
BY 2.0
11.4: The Negative Binomial Distribution - CC BY
5.34: The Gompertz Distribution - CC BY 2.0
2.0
5.35: The Log-Logistic Distribution - CC BY 2.0
11.5: The Multinomial Distribution - CC BY 2.0
5.36: The Pareto Distribution - CC BY 2.0
11.6: The Simple Random Walk - CC BY 2.0
5.37: The Wald Distribution - CC BY 2.0
11.7: The Beta-Bernoulli Process - CC BY 2.0
5.38: The Weibull Distribution - CC BY 2.0
5.39: Benford's Law - CC BY 2.0 12: Finite Sampling Models - CC BY 2.0
5.40: The Zeta Distribution - CC BY 2.0 12.1: Introduction to Finite Sampling Models - CC
5.41: The Logarithmic Series Distribution - CC BY BY 2.0
2.0 12.2: The Hypergeometric Distribution - CC BY 2.0
6: Random Samples - CC BY 2.0 12.3: The Multivariate Hypergeometric Distribution -
CC BY 2.0
6.1: Introduction - CC BY 2.0
12.4: Order Statistics - CC BY 2.0
6.2: The Sample Mean - CC BY 2.0
12.5: The Matching Problem - CC BY 2.0
6.3: The Law of Large Numbers - CC BY 2.0
12.6: The Birthday Problem - CC BY 2.0
6.4: The Central Limit Theorem - CC BY 2.0
12.7: The Coupon Collector Problem - CC BY 2.0
6.5: The Sample Variance - CC BY 2.0
12.8: Pólya's Urn Process - CC BY 2.0
6.6: Order Statistics - CC BY 2.0
12.9: The Secretary Problem - CC BY 2.0
6.7: Sample Correlation and Regression - CC BY 2.0
6.8: Special Properties of Normal Samples - CC BY 13: Games of Chance - CC BY 2.0
2.0 13.1: Introduction to Games of Chance - CC BY 2.0
7: Point Estimation - CC BY 2.0 13.2: Poker - CC BY 2.0
13.3: Simple Dice Games - CC BY 2.0
7.1: Estimators - CC BY 2.0
13.4: Craps - CC BY 2.0 16.9: The Bernoulli-Laplace Chain - CC BY 2.0
13.5: Roulette - CC BY 2.0 16.10: Discrete-Time Reliability Chains - CC BY 2.0
13.6: The Monty Hall Problem - CC BY 2.0 16.11: Discrete-Time Branching Chain - CC BY 2.0
13.7: Lotteries - CC BY 2.0 16.12: Discrete-Time Queuing Chains - CC BY 2.0
13.8: The Red and Black Game - CC BY 2.0 16.13: Discrete-Time Birth-Death Chains - CC BY 2.0
13.9: Timid Play - CC BY 2.0 16.14: Random Walks on Graphs - CC BY 2.0
13.10: Bold Play - CC BY 2.0 16.15: Introduction to Continuous-Time Markov
13.11: Optimal Strategies - CC BY 2.0 Chains - CC BY 2.0
14: The Poisson Process - CC BY 2.0 16.16: Transition Matrices and Generators of
14.1: Introduction to the Poisson Process - CC BY 2.0 Continuous-Time Chains - CC BY 2.0
14.2: The Exponential Distribution - CC BY 2.0 16.17: Potential Matrices - CC BY 2.0
14.3: The Gamma Distribution - CC BY 2.0 16.18: Stationary and Limting Distributions of
14.4: The Poisson Distribution - CC BY 2.0 Continuous-Time Chains - CC BY 2.0
14.5: Thinning and Superpositon - CC BY 2.0 16.19: Time Reversal in Continuous-Time Chains -
14.6: Non-homogeneous Poisson Processes - CC BY CC BY 2.0
2.0 16.20: Chains Subordinate to the Poisson Process -
14.7: Compound Poisson Processes - CC BY 2.0 CC BY 2.0
14.8: Poisson Processes on General Spaces - CC BY 16.21: Continuous-Time Birth-Death Chains - CC BY
2.0 2.0
16.22: Continuous-Time Queuing Chains - CC BY 2.0
15: Renewal Processes - CC BY 2.0
16.23: Continuous-Time Branching Chains - CC BY
15.1: Introduction - CC BY 2.0 2.0
15.2: Renewal Equations - CC BY 2.0
17: Martingales - CC BY 2.0
15.3: Renewal Limit Theorems - CC BY 2.0
17.1: Introduction to Martingalges - CC BY 2.0
15.4: Delayed Renewal Processes - CC BY 2.0
17.2: Properties and Constructions - CC BY 2.0
15.5: Alternating Renewal Processes - CC BY 2.0
17.3: Stopping Times - CC BY 2.0
15.6: Renewal Reward Processes - CC BY 2.0
17.4: Inequalities - CC BY 2.0
16: Markov Processes - CC BY 2.0
17.5: Convergence - CC BY 2.0
16.1: Introduction to Markov Processes - CC BY 2.0 17.6: Backwards Martingales - CC BY 2.0
16.2: Potentials and Generators for General Markov
18: Brownian Motion - CC BY 2.0
Processes - CC BY 2.0
16.3: Introduction to Discrete-Time Chains - CC BY 18.1: Standard Brownian Motion - CC BY 2.0
2.0 18.2: Brownian Motion with Drift and Scaling - CC
16.4: Transience and Recurrence for Discrete-Time BY 2.0
Chains - CC BY 2.0 18.3: The Brownian Bridge - CC BY 2.0
16.5: Periodicity of Discrete-Time Chains - CC BY 18.4: Geometric Brownian Motion - CC BY 2.0
2.0 Back Matter - CC BY 2.0
16.6: Stationary and Limiting Distributions of Index - CC BY 2.0
Discrete-Time Chains - CC BY 2.0 Glossary - CC BY 2.0
16.7: Time Reversal in Discrete-Time Chains - CC Detailed Licensing - Undeclared
BY 2.0
16.8: The Ehrenfest Chains - CC BY 2.0

Full

Uploaded by

Copyright:

Available Formats

You might also like

Full

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Full

Uploaded by

Copyright:

Available Formats

PROBABILITY,

This text was compiled on 11/24/2022

10: Geometric Models

12: Finite Sampling Models

13: Games of Chance

14: The Poisson Process

15: Renewal Processes

18: Brownian Motion

10: Geometric Models

11: Bernoulli Trials

12: Finite Sampling Models

13: Games of Chance

14: The Poisson Process

15: Renewal Processes

16: Markov Processes

18: Brownian Motion

Sets and subsets

If A and B are sets then A is a subset of B if every element of A is also an element of B :

∅ ⊆A for every set A .

Suppose that A , B and C are subsets of a set S . Then

Suppose that A and B are sets.

4. Z = {… , −2, −1, 0, 1, 2, …}is the set of integers

1. For n ∈ N , the set of dyadic rationals of rank n or less is D = {j/2

Note that D = Z and D ⊂ D

Suppose that a, b ∈ R with a < b .

The terms open and closed are actually topological concepts.

(after the separator).

If A ∩ B = ∅ then A and B are disjoint.

So A and B are disjoint if the two sets have no elements in common.

The complement of A is the set of elements that are not in A :

The identity laws:

The idempotent laws:

The complement laws:

The double complement law: (A c c

The commutative laws:

The associative laws:

The distributive laws:

DeMorgan's laws (named after Agustus DeMorgan):

The following statements are equivalent:

More special sets

Results for set difference:

to both or neither of the given sets:

⋃ A = {x ∈ S : x ∈ A for some A ∈ A } (1.1.16)

⋃ Ai = {x ∈ S : x ∈ Ai for some i ∈ I } (1.1.17)

⋂ A = {x ∈ S : x ∈ A for all A ∈ A } (1.1.18)

⋂ Ai = {x ∈ S : x ∈ Ai for all i ∈ I } (1.1.19)

A for the intersection.

the dyadic rationals are important.

The general distributive laws:

The general De Morgan's laws:

Figure 1.1.2: A partition of S induces a partition of B

Suppose that {A i : i ∈ N+ } is a collection of subsets of a universal set S

last result, uses only set theory.

i ∈ {1, 2, … , n}. For infinite sequences, (x , x , …) = (y , y , …) if and only if x = y for all i ∈ N .

S1 × S2 × ⋯ = {(x1 , x2 , …) : xi ∈ Si for each i ∈ {1, 2, …}} (1.1.23)

Rules for Product Sets

Distributive rules for product sets

(A ∖ B) × (C ∖ D) = [(A × C ) ∖ (A × D)] ∖ [(B × C ) ∖ (B × D)]

So in particular it follows that (A ∖ B) × (C ∖ D) ⊆ (A × C ) ∖ (B × D) ,