Probability - Full Textbook

Department of Economics and Related Studies
MODULE TITLE: Probability I
MODULE NUMBER: ECO00011C.
Academic Year 2018 - 2019
Lecturer: Dr Francesco Bravo
1
Lecture notes and reading
The lecture notes that have been prepared contain many de…nitions and
basic results, so that time during the lectures can be devoted to listening and
understanding, rather than copying formulae. These notes are, however, not
intended to give detailed discussions with examples. Doing examples and read-
ing more detailed discussions will be needed to gain a sound understanding of
the topics to be covered. It is therefore important to use a text, as well as the
lecture notes.
2
Aims:
Probability I introduces students to some basic ideas in probability and

applications in economics .
Objectives:
Students will be able to compute marginal and conditional probabilities, for

events and for random variables.
Students will also be able to describe and apply the two core concepts in
introductory probability, the Law of Large Numbers and the Central Limit
Theorem.
By the application of a probabilistic model to simple examples, students
will be able to show that many economic phenomena may only be understood
in terms of a choice in a probabilistic environment. Examples will include the
choice of an insurance contract or the portfolio allocation for an investment,
and others.
Probability theory also underpins statistical and econometrical inference,
which is studied in subsequent modules. By studying simple proofs in detail,
students will learn how to present arguments with mathematical precision. Stu-
dents will then be able to characterize and interpret properties of the statisti-
cal/econometric estimates in terms of their distributional assumptions.
Main References:
A book that is suitable as the main course text for most of the material is:
Miller, Irwin and Miller, Margless, “John E. Freund’s Mathematical Sta-
tistics with Applications”, 8th ed., Pearson Prentice-Hall, 2013 (Library shelf
location SF FRE).
The book by Miller and Miller has many exercises, both solved and not
solved. A book that has other exercises (including many solved ones) is:
Spiegel, Murray R., Srinivasan, Alu and Schiller, John J. “Schaum’s Out-
line of Theory and Problems of Probability and Statistics”, 2nd ed., McGraw-
Hill, 2000 ISBN 00 71350047 (Library shelf location S9 SPI).
3
Structure
We divided this course in four sections:
1. Preliminary mathematics and Introduction to probability;
2. Probability distributions and Random variables;

3. Mathematical expectations;
4. Special probability distributions.
A …fth section, with applications to economics and …nance, is in a separate

dedicated booklet.
4
Section 1
Preliminary mathematics and introduction to probability
Aims and objectives
to introduce some useful mathematical operators

to introduce basic concepts concerning probability
References
Chapter 2.
5
Some mathematical operators
Suppose we want to measure some characteristics, say weight, in kg, of 50 people.

It is then useful to have some shorthand notation.
We will use x1 , ..., x50 , where xi is the characteristic for the individual i, i = 1, ...,
50. For a generic set of n values, we write x1 , ..., xn or fx1 ; :::; xn g.
If two characteristics are measured simultaneously, for example weight and height, we
can use (x1 ; y1 ), ..., (xn ; yn ), with xi and yi being the characteristics (weight and
height) for the individual i, i = 1, ..., n.
If one characteristic is measured for m classes of n individuals, for example gender
and weight, we use x11 , ..., x1n , x21 , ..., x2n , ..., xm1 , ..., xmn , so for example x11 ,
..., x1n is the weight of the n female individuals, and x21 , ..., x2n is the weight of the
n male individuals.
6
We are often interested in summarising the characteristics of the whole sample in
one single number, or in other transformations of numbers.
P
For a sequence of values X1 , X2 , ..., Xn , the summation operator is
de…ned by
Xn
Xi = X1 + X2 + ::: + Xn
i=1
n may be …nite or in…nite (1);
sometimes
P Pwe do not show the limits of the summations, and use either
i Xi or Xi ;
double summation: for Xij (1 i n, 1 j m)

n X
X m n
X m
X
Xij = (Xi1 + Xi2 + ::: + Xim ) = (X1j + X2j + ::: + Xnj )
i=1 j=1 i=1 j=1
7
Rules of summations
if a and b are constant,

n
X n
X n
X
a = na, bXi = b Xi ;
i=1 i=1 i=1
Verify this directly:

n
X
a = (a + a + ::: + a) = na
| {z }
i=1
n tim es
n
X n
X
bXi = (bX1 + ::: + bXn ) = b Xi
i=1 i=1
8
We can also combine these, for example for
Zi = a + bXi + cYi ,
for a, b, c constants,
n
X n
X n
X
Zi = na + b Xi + c Yi
i=1 i=1 i=1
Pn
We can also apply the operator i=1 to non-linear functions, g (Xi ):
n
X
g (Xi ) = g (X1 ) + ::: + g (Xn )
i=1
if, for example,

n
X
g (Xi ) = Xi2 , then Xi2 = X12 + ::: + Xn2
i=1
n
X n
X n
X
2 2
g (Xi ) = (Xi + a) , then (Xi + a) = Xi2 + 2a Xi + na2 .
i=1 i=1 i=1
9
Errors to avoid: !2
n
X n
X
Xi2 6= Xi
i=1 i=1
Verify this setting n = 2, X1 = 1, X2 = 9,

Pn Pn 2 2
then i=1 Xi2 = 12 + 92 = 82; ( i=1 Xi ) = (1 + 9) = 100.
n
X n
X n
X
Xi Yi 6= Xi Yi
i=1 i=1 i=1
VerifyPthis setting n = 2, (X1 = 1; Y1 = 2),

Pn(X2 =P2; Y2 = 5),
n n
then i=1 Xi Yi = (1 2 + 2 5) = 12; i=1 Xi i=1 Yi = (1 + 2) (2 + 5) =
21:
10
n
X Pn
Xi Xi
6= Pi=1
n
i=1
Yi i=1 Yi
Verify this setting n = 2, (X1 =P

1; Y1 = 2), (X2 = 2; Y2 = 5),
Pn Xi 1 2 9
n
Xi (1+2) 3
then i=1 Yi = 2 + 5 = 10 ; i=1 P n
Yi = (2+5) = 7 .
i=1
11
For a sequence of values X1 , X2 , ..., Xn , the Sample average (sample
mean) X is de…ned as
n
1X
X= Xi .
n i=1
Alternative notation: in some references the notation X n is also used.

From the rules of summation,
if Yi = a + bXi , then Y = a + bX.
Notice that, setting a = X, b = 1, then Y = 0, i.e.

n
1X
Xi X = 0.
n i=1
12
Factorial operator
For any positive integer k, the factorial operator is de…ned so that
k! = k (k 1) ::: 1
for k = 0, 0! = 1 is de…ned.
13
Some examples with solutions:
Let
X1 X2 X3 X4 X5
5 -6 3.5 4 -2.1
then
5
X
Xi = 5 6 + 3:5 + 4 2:1 = 4:4
i=1
14
Let
Xi1 Xi2 Xi3 Xi4 Xi5
X1j 5 -6 3.5 4 -2.1
X2j -0.2 2 1 0 -0.5
then
2 X
X 5
Xij = 5 6 + 3:5 + 4 2:1 0:2 + 2 + 1 + 0 0:5 = 6:7 .
i=1 j=1
Factorial operator:
4! = 4 3 2 1 = 24.
15
F Example/Practice.
1. Write each of the following in full without using the summation operator
P
3 P
3 P
3 P3
notation: xi ; (2xi 8); (xi 2)2 ; and xi yi :
i=1 i=1 i=1 i=1
P
3 P
3 P
3
2. If xi = 6 and x2i = 20, …nd the values of the following: (2xi 8);
i=1 i=1 i=1
P
3 P
3
(xi 2)2 ; and xi (xi 2):
i=1 i=1
3. Show that for the sample average X it holds that
n
1X
Xi X = 0.
n i=1
16
1 Probability
Set theory for probability. Random experiment
an experiment that may result in two or more di¤erent outcomes with uncer-
tainty as to which will be observed.
Example: throwing a die.
Example: tossing a coin twice.
17
Sample space
the set of all the potential outcomes of the experiment, usually denoted by S.
Example: in the example of throwing a die, S = f1; 2; 3; 4; 5; 6g.
Example: in the example of tossing a coin twice, S = fHH; T H; HT; T T g.
Notice that in the example of tossing a coin twice we treated T H and HT
as two di¤erent outcomes, so they are both listed in S.
Notice that more than one sample space may describe all the potential
outcomes: in the example of tossing a coin twice, we could have also used
f(same side) ; (opposite sides)g.
18
Event
a subset of the sample space.
Example, in the example of throwing a die, three events are A = fscore less than 4g =
f1; 2; 3g, B = fscore eveng = f2; 4; 6g, C = fscore at least 5g = f5; 6g.
The concepts of "outcome", "Sample space" and "event" are speci…c of prob-
ability theory. In set theory, the set containing all the elements is called Uni-
verse.
19
In many cases we are interested in events that are combinations of two or
more events. Viewing the events as sets, we may represent these combinations
of events using operations that were originally de…ned for set theory, such as
taking union, intersection or complements.
Intersection of sets
denoted by A \ B, for A S, B S. Set of elements that belong to both A
and B.
Example: in the example of throwing a die, A \ B = f2g.
Union of sets
denoted by A [ B, for A S, B S. Set of elements that belong to A or to B
or to both.
Example: in the example of throwing a die, A [ B = f1; 2; 3; 4; 6g.
20
Di¤erence of sets
for A S, B S, indicated as A B. The set of elements that are in A but
not in B.
Example: in the example of throwing a die, A B = f1; 3g.
Alternative notation: some references also use the notation AnB to indicate the
di¤erence between two sets A and B.
Empty set
a set with no elements in. Indicated as ?.
Disjoint sets
for A S, B S. When A and B have no elements in common (so A\B = ?).
Mutually exclusive events. Two events having no outcome in common
(i.e., "mutually exclusive" is a de…nintion that is only used in probability theory,
for what is commonly called "disjoint sets" in set theory).
Example: in the example of throwing a die, A \ C = ?.
21
Complement set
denoted by A0 , for A S. Set of elements in S but not in A (A0 = S A).
In the previous example, A0 = f4; 5; 6g.
Alternative notations: some references also use the notations Ac or A to indicate
the complement of a set. We will use A0 in these lecture notes.
We can also combine these set operations, for example A0 \ B,...
Some interesting formulas (De Morgan laws)

0
(A \ B) = A0 [ B 0 ,
0
(A [ B) = A0 \ B 0 ,
A \ (B [ C) = (A \ B) [ (A \ C),
A [ (B \ C) = (A [ B) \ (A [ C).
Verify these formulas in the example of throwing a die.
Countability of sets
22
It is convenient to classify sets according to the number of elements that
they contain.
A set is said to be countable if the number of elements contained in the

set can be matched one to one with the whole numbers. For example, the
set of all the integer numbers, f0; 1; 2; :::g is numerable.
The set of all the rational numbers, f1; 2; 1=2; 1=3; 3; 4; 3=2; 2=3; 1=4; :::g is
also countable.
In our examples, the set of all the outcomes of throwing a die is also
countable.
A set is said to be uncountable if it is not countable. For example, the
set of all the real numbers in [0; 1] is uncountable.
23
The probability model "Classical approach": if there are h possible out-
comes to lead to an event A, and n outcomes overall, and if all the events are
equally likely, then the probability of that event is h=n.
"Frequency approach": we want probability to replicate the percentage of
occurrences of an uncertain event when an experiment takes place many times.
Both these approaches, however, would not lead to a satisfactory de…nition:
"equally likely" means "with the same probability", so in this de…nition it is
used the very concept that should be de…ned; as for the frequency approach,
one should …rst prove that the ratio occurrences / experiments converges to a
well behaved limit, as the number of experiments increase to in…nity. In order
to overcome these limitations, probability is de…ned in an axiomatic ("mathe-
matical") way, and the properties stated in the classical and frequency approach
are derived as consequences of this mathematical de…nition.
24
The postulates (axioms) of probability. Let S be a sample space com-
posed of n < 1 outcomes, and A, B are two generic events such that A S,
B S. Probability, denoted P (:), is a function that associates to any A, B a
number so that
P ostulate 1 : P (A) 0
P ostulate 2 : P (S) = 1
P ostulate 3 : P (A [ B) = P (A) + P (B) if A \ B = ?
Note: this de…nition requires n < 1.
25
When n = 1 is possible, Postulate 3 is
P (A1 [ A2 [ :::) = P (A1 ) + P (A2 ) + ::: if Ai \ Aj = ? for j 6= i.
Note: Ai \ Aj = ? for j 6= i means that Ai and Aj are mutually exclusive;

Note: A1 [ A2 [ ::: is also indicated as [1i=1 Ai , and Postulate 3 is then also
stated as
X1
P ([1
i=1 iA ) = P (Ai ) if Ai \ Aj = ? for j 6= i
i=1
Note. when n = 1, it is possible that S is uncountable. In that case, the

collection of all the subsets of S is too large for probabilities to be assigned
reasonably to all its members. Probability is then only de…ned for some sets Ai
that satisfy further properties. This is a very delicate problem and we do not
discuss it.
In the rest of the notes we will assume that any event Ai we consider is such
that P (Ai ) can be assigned.
26
Some useful results
(implications of the postulates of probability)
If A B, then
(i) P (A) P (B)

(ii) P (B A) = P (B) P (A) .
For two generic A and B,
(iii) 0 P (A) 1;
0
(iv) P (A ) = 1 P (A) ;
(v) P (?) = 0;
(vi) P (A [ B) = P (A) + P (B) P (A \ B)
(vii) P (A) = P (A \ B) + P (A \ B 0 )
27
Proof of (vi ).
Let C = (A \ B). Rewriting A = (A C) [ C, notice that the sets (A C),
C, are disjoint, so by Postulate 3, P (A) = P (A C) + P (C). In the same
way, we can also prove that P (B) = P (B C) + P (C). Finally, notice that
A [ B = (A C) [ C [ (B C) and that the three sets (A C), C, (B C) are
all disjoint (each pair of sets is disjoint). Therefore, by Postulate 3, P (A [ B) =
P (A C) + P (C) + P (B C). Adding P (C) in both sides of this identity,
we obtain P (A [ B) + P (C) = P (A C) + P (C) + P (B C) + P (C), which
means that P (A [ B) + P (A \ B) = P (A) + P (B). The result (vi) follows
rearranging terms in this expression.
Proof of (vii ).
Notice that A = (A \ B) [ (A \ B 0 ), and that (A \ B) and (A \ B 0 ) are
disjoint sets. The result then follows from Postulate 3.
28
Note: the formula (vi) is also known as "the addition rule".
Note: the implications of the postulates of probability can be generalised

to more than two sets.
Example: the addition rule for three events. If A, B and C are three
events in S,
P (A [ B [ C) = P (A) + P (B) + P (C)

P (A \ B) P (A \ C) P (B \ C)
+P (A \ B \ C) .
F Example/Practice. What is the probability of drawing a Queen (Q) or a

Club (C), in a single drawing from a pack of cards?
29
Example/notation: in the example of throwing a die, we will use the notation
"P (fig)" as a shorthand for "P (fthe outcome is ig)". We will also assume that
the die is "fair" (i.e., the die has not been tampered with, and each one of the
six outcomes has the same probability). Therefore, the probability is assigned
as P (fig) = 1=6 when i = 1; :::; 6 (i.e., P (f1g) = 1=6, ..., P (f6g) = 1=6).
Prove the implications of the postulates of probability also using the Venn
diagrams, and verify them in the example of throwing a fair die.
30
Conditional probability Consider the example of throwing a fair die. In
this case, P (A) = P (f1; 2; 3g) = 0:5. However, suppose the die is thrown and,
although you are not told the result, you are informed that the score is even.
What is the probability of a score less than 4 in this case?
For A, B being the events de…ned in the example of throwing a fair die, we
are interested in the probability of "A given B " or of "A conditional upon B":
this is indicated as
P (AjB) .
Upon knowing that the outcome is even, the sample space changes to
S = f2; 4; 6g
only, so P (f2g) = 1=3, P (f4g) = 1=3, P (f6g) = 1=3. Of these, only "2" is less
than 4, so P (AjB) = 1=3.
31
For two generic sets A, B, with P (AjB) we look at elements that are in
A \ B: however, because we changes the sample space, we have to normalise our
probabilities so that the new sample space has probability 1. Therefore,
P (A \ B)
P (AjB) =
P (B)
(notice that this new assignment of probability still satis…es the postulates of
probability).
32
Multiplicative rule
P (A \ B) = P (AjB) P (B)
P (A \ B) = P (BjA) P (A) :
Notation/terminology: we already said that P (AjB) is called conditional prob-

ability; P (A \ B) in this context is called "joint probability" and P (B) is called
"marginal probability". In the same way, P (BjA) is also a conditional proba-
bility, and P (A) is a marginal probability.
33
Independent events
Two events A and B are independent if and only if
P (A \ B) = P (A) P (B)
which implies that
P (AjB) = P (A) , P (BjA) = P (B) .
Two mutually exclusive events A, B such that P (A) > 0, P (B) > 0 can-
not be independent, because mutually exclusive means P (A \ B) = 0, while
independence means P (A \ B) = P (A) P (B), and, since we also assumed
P (A) > 0, P (B) > 0, that P (A \ B) > 0.
34
F Example/Practice. It is known that 48% of all undergraduates are female
and that 15% of all undergraduates are taking degrees involving economics. It
is also known that 6% of all undergraduates are female and taking degrees that
involve economics. Let A denote the event "undergraduate is female" and B
denote the event "undergraduate is taking a degree involving economics".
(a) What is the value of the conditional probability P (AjB)?
(b) Are the events A and B statistically independent?
35
F Example/Practice. Bookings at the indoor court
The indoor court of the Gym can be used for three activities: basketball
(BB), badminton (BM) and volleyball (VB); these are practised by two groups
of people: undergraduate students (UG) and postgraduate students (PG). The
percentage of bookings last year, divided per student and activity, are
Activity
Student BB BM VB
UG 0.30 0.19 0.21
PG 0.09 0.12 0.09
36
For any given booking of that year,
What is the probability that a court was booked by PG to play BM?
What is the probability that a court was booked to play BM?
What is the probability that a court was booked to play BM given that the
booking was made by a PG?
Are the events "the court was booked by UG" and "the court was booked
to play BM" mutually exclusive?
Are the events "the court was booked by PG" and "the court was booked
to play BM" independent?
37
Bayes’rule
P (BjA) P (A)
P (AjB) =
P (BjA) P (A) + P (BjA0 ) P (A0 )
Proof:
P (A \ B)
P (AjB) =
P (B)
P (A \ B)
=
P (B \ A) + P (B \ A0 )
P (BjA) P (A)
=
P (BjA) P (A) + P (BjA0 ) P (A0 )
This is also known as Bayes’theorem.
38
Example: disease detection
A blood test is available, to detect the presence of a disease, which is present
in 0:5% of the population. The test gives a positive outcome in 99% of the people
who have the disease (it correctly detects the disease with probability 99%), and
in 5% of the people who do not have the disease (it incorrectly detects the disease
with probability 5%).
What is the probability of having the disease, if the outcome of the test is
positive?
39
Let A be the event fthe individual has the diseaseg, and B the event fthe outcome of the test is positiveg,
then
P (A) = 0:005, P (BjA) = 0:99, P (BjA0 ) = 0:05,
and we are interested in P (AjB). We can compute P (A0 ) = 0:995, so, by Bayes’
rule,
P (BjA) P (A)
P (AjB) =
P (BjA) P (A) + P (BjA0 ) P (A0 )
0:99 0:005
0:0905
0:99 0:005 + 0:05 0:995
40
We can generalise Bayes’rule to sets Aj such that
S = [nj=1 Aj , where Aj \ Ai = ? (for j 6= i).
Then,
P (Ak \ B)
P (Ak jB) =
P (B)
P (Ak \ B)
= Pn
j=1 P (Aj \ B)
P (BjAk ) P (Ak )
P (Ak jB) = Pn
j=1 P (BjAj ) P (Aj )
41
Example: a supermarket receives tomato tin cans from three di¤erent suppli-
ers, named A, B, C, in proportions of 30%, 20% and 50% of the total supplies.
Some of these cans have a dent, and should be discarded: the probability of
being dented is 5% for cans from supplier A, 1% for cans from B, 3% for cans
from C. What is the probability that the dented can currently on the shelf is
from A?
P (A) = 0:3; P (B) = 0:2; P (C) = 0:5;

P (DjA) = 0:05; P (DjB) = 0:01; P (DjC) = 0:03.
Then,
P (D) = 0:05 0:3 + 0:01 0:2 + 0:03 0:5 = 0:032

0:05 0:3
P (AjD) = 0:47
0:032
42
Section 2
Random variables and probability distributions
Aims and objectives
to introduce random variables to summarise the information content of a

random experiment
to characterise the functions that can be used to assign probabilities for
random variables
References
Chapter 3, Sections 3.1 to 3.7.
43
Random variables
In some experiments, the outcome is a number; otherwise, it may be trans-

formed into a number. For example, when tossing a coin, we could transform
the H or T outcomes into 1 and 0 respectively.
In many cases, transforming the outcome in a number may be advantageous,
because the probability tables may be summarised by a generic function.
In general, more than one random variable can be associated to each out-
come.
44
Random variable, for a …nite dimensional space S:
The random variable is a function that associates the outcomes in S to
numbers.
Note: This de…nition may be generalised to comprise cases in which the

dimension of S is not …nite, but still countable, and even cases in which the
dimension is not countable. When the dimension of S is …nite or in…nite
but countable, then the random variable is said to be discrete, when the
dimension S is not countable, it is said to be continuous.
Notice that the random variable is neither a variable (it is a function) nor
random (the association is certain).
45
A random variable is indicated with the use of an upper case letter. The
value the random variable takes when the experiment is run, is called realisation
and it is indicated with the use of a lowercase letter. This is also called "image
of the random variable".
Example of throwing a fair die: X is "score", x is 1, 2; 3; 4; 5 or 6. Notice
that the dimension of S is …nite, so X is a discrete random variable.
Example: height of individuals in a population. In this case, if we set X as
"height in meters", X can take virtually any value between 0 and, say, 3. In
this example, X is therefore a continuous random variable.
46
Probability distributions
for discrete random variables Let X be a discrete random variable which
may take values fx1 ; x2 ; x3 ; :::g, and such that we can assign a probability for
each realization X = x1 , X = x2 ,... . We can then introduce a function f (x)
such that
f (xj ) = P (X = xj ) for xj 2 fx1 ; x2 ; x3 ; :::g ,
while f (x) = 0 for x 62 fx1 ; x2 ; x3 ; :::g.
This function f (x) is known as probability function, and it also referred
to as probability distribution, or as probability mass function, because the
individual f (xj ) are probability masses.
By the postulates of probability, it follows that
(i) f (x) 0
X
(ii) f (x) = 1
x
P
where x means that the sum is taken on all the possible values of x.
47
Alternative notation: given the de…nition of f (xj ), in some references these
properties are stated as
(i) f (xj ) 0
X
(ii) f (xj ) = 1.
j
This seems more consistent with the notation commonly used for the summation
operator, but in these lecture notes when possible we follow the notation of the
main reference book.
Alternative notation: in some references, the notation fX (x) (or, fX (xj ))

is used instead.
48
Examples.
When X is the score from throwing a fair die,
f (x) = 1=6 when x is 1, 2; 3; 4; 5 or 6.
When X is the number of heads from tossing a fair coin twice,

f (2) = 1=4, f (1) = 1=2, f (0) = 1=4.
When X is the number of heads from tossing a fair coin n times,

x n x
f (x) = x!(nn! x)! (1=2) (1=2) when x is 0; 1; :::; n.
When X is the number of buses arriving at the stop per 10 minutes, if we

know that, on average, 12 buses per hour arrive,
x
e 2
f (x) = 2 x! when x is 0; 1; :::
49
Probability densities
for continuous random variables When S is not countable, the random
variable can take values in a continuous subset of ( 1; 1). In this case, it

is possible to de…ne the probability using sets such as (a; b] (for b a). The
postulates of probability imply that, for a suitable random variable X,
P ( 1 < X < 1) = 1,
0 P (a < X b) 1 for any b a.
Notice that this imply that P (X = c) = P (c X c) = 0.
50
Continuous random variables
A random variable is continuous if there is a probability density function,
f (:), such that, for any two a, b, a b, the probability that X occurs in the
interval (a; b] is
Z b
P (a < X b) = f (x) dx:
a
The probability density function is often referred to as "density" or "density
function" or "pdf".
A function f (:) can serve as density if
Z 1
(i) f (x) 0 for any x, and (ii) f (x) dx = 1.
1
Note: because P (X = b) = 0, P (X = a) = 0, it also follows that P (a X < b) =

Rb
a
f (x) dx. In general, including or excluding the boundary of the interval (a; b],
i.e. using "<" or " ", makes no di¤erence.
51
Example: on the beach.
Jasmine goes to the beach early in the morning. At around lunchtime, David
goes to the beach as well. The beach is 4 km long, and the probability that
Jasmine is in any part of the beach is continuously distributed with
0:25 if 0 x 4,
f (x) =
0 otherwise.
What is the probability that David will …nd Jasmine if he looks for her between
the …rst and second km of the beach?
52
Example: cm of rain per hour.
Let X be the rain per hour in cm. Can we model it as
f (x) = a (20 x) if 0 < x < 20; 0 otherwise ?
Requirement (i) f (x)R 0: this is met for all x, when a > 0;

1
Requirement (ii) 1 f (x) dx = 1: here, we need
Z 20
a (20 x) dx = 1:
0
since Z 20 20
1
a (20 x) dx = ax (x 40) = 200a,
0 2 0
requirement (ii) is met if a = 1=200 (and notice that a > 0 is met).
53
F Example/Practice. Level of charge with one battery.
The level of charge in a battery is a random variable X which may take
values between 0 and 1 (so, X = 0 means that the battery has no charge, and
X = 1 that it is fully charged) with density
3
f (x) = x if x 2 [0; 1]
2
and 0 elsewhere.
Compute the probabilities P (X 0:5) and P (0:1 < X 0:6).
Compute the value of x such that P (X x) = 0:5.
54
Cumulative distribution function For any (discrete or continuous) ran-
dom variable, it is also possible to de…ne a Cumulative distribution func-
tion, also known simply as distribution function or as "cdf".
For any x, this is de…ned as F (x) such that
F (x) = P (X x) .
For a discrete random variable,

X
F (x) = f (t)
t x
P
(where " t x " means that the sum is taken for all the values up to, and in-
cluding, x);
for a continuous random variable,
Z x
F (x) = f (t) dt:
1
55
In the example of tossing a coin twice, with X as number of heads,
F (x) = 0 if x < 0; F (x) = 1=4 if 0 x < 1;

F (x) = 3=4 if 1 x < 2; F (x) = 1 if x 2.
In the example of the rain per hour,
F (x) =
0 if x < 0;
Z x
1 1
F (x) = (20 t) dt = x (40 x) if 0 x < 20;
0 200 400
F (x) = 1 if x 20.
56
In general, a cumulative distribution function is characterised by three fea-
tures:
i) for x < y, F (x) F (y) (i.e. it is non-decreasing)

ii) lim limx! 1 limx! 1 limx! 1 F (x) = 0,
x! 1
iii) limh!0+ F (x + h) F (x) = 0 (i.e. it is continuous from the right)

Moreover, for a discrete random variable F (x) is a step-function: if X takes
values fx1 ; x2 ; x3 ; :::g, the jumps correspond to those points fx1 ; x2 ; x3 ; :::g, and
the size of jump for the generic xi is f (xi ). For a continuous random variable,
on the other hand, F (x) is a continuous function.
Alternative notation. In some references, the cumulative distribution func-

tion of X is denoted as FX (x).
57
Joint distribution of random variables Sometimes we are interested in
more than one event at the same time.
Example: stock and bond.
Let X be the return of a given stock, and Y the return of a given bond. The
stock return may either take value x1 = 0 or x2 = 10, and the bond return may
either take value y1 = 0 or y2 = 2. Four outcomes are then possible, and the
probability of each outcome is
y
0 2
x 0 0.2 0.5
10 0.2 0.1
58
Joint probability function. For two discrete random variables X and
Y , which take values on (x; y), let
f (x; y) = P (X = x; Y = y) .
Then the list of the f (x; y) is the joint probability function. This is also known
"joint probability mass function", and as "joint distribution".
The joint probability function for discrete random variables is such that
(i) f (x; y) 0
XX
(ii) f (x; y) = 1
x y
It is also possible to de…ne a joint cumulative distribution function,

XX
F (x; y) = P (X x; Y y) = f (s; t) .
s xt y
59
In the Stock and Bond example, f (0; 0) = 0:2, f (0; 2) = 0:5, f (10; 0) = 0:2,
f (10; 2) = 0:1; f (x; y) = 0 otherwise.
Note: the notation P (X = x; Y = y) is used here instead of the more appro-

priate P (X = x \ Y = y).
Alternative notation: some references specify that X and Y which take values
on (xi ; yj ), and therefore de…ne
f (xi ; yj ) = P (X = xi \ Y = yj )
and then state the conditions (i) and (ii) above as
(i) f (xi ; yj ) 0
XX
(ii) f (xi ; yj ) = 1.
i j
Alternative notation: In some references, fXY (x; y) and FXY (x; y) are used
instead of f (x; y) and F (x; y).
60
Marginal probability functions. The functions
g (x) = P (X = x) , h (y) = P (Y = y)
are the "marginal probability functions" ("marginal distributions") of X and

Y . They can be computed as
X X
g (x) = f (x; y) , h (y) = f (x; y) .
y x
In the Stock and Bond example,
P (X = 0) = 0:2 + 0:5 = 0:7, P (Y = 2) = 0:5 + 0:1 = 0:6, ...
so
x 0 10 y 0 2
and
g (x) 0.7 0.3 h (y) 0.4 0.6
61
Note: Despite the apparently di¤erent notation (g (:) and h (:) instead of
f (:)), the marginal probability functions are probability functions, and meet all
the characteristics of probability functions.
Alternative notation. In some references fXY (x; y) is used for the joint dis-
tribution function: this makes it then natural to use fX (x), fY (y) for the
marginals.
62
When only two discrete random variables are considered, it is easy to sum-
marise joint and marginals in one table. Suppose that x 2 fx1 ; :::; xm g (i.e.,
x only takes values x1 ; :::; xm ) and y 2 fy1 ; :::; yn g
y1 y2 ::: yn P Totals
x1 f (x1 ; y1 ) f (x1 ; y2 ) ::: f (x1 ; yn ) f (x1 ; y) g (x1 )
Py
x2 f (x2 ; y1 ) f (x2 ; y2 ) ::: f (x2 ; yn ) y f (x2 ; y) g (x2 )
... ::: ::: ::: ::: P ::: :::
xm f (xm ; y1 ) f (xm ; y1 ) ::: f (xm ; yn ) y f (xm ; y) g (xm )
P P P
Totals x f (x; y1 ) x f (x; y2 ) :::
f (x; y2 )
x 1
h (y1 ) h (y2 ) :::
h (yn )
P
Notice
P the "1" in the bottom-right box, resulting from y h (y) = 1 and from
x g (x) = 1.
63
In the Stock and Bond Example, we can summarise joint and marginal
distributions as
y
0 2 Totals
x 0 0.2 0.5 0.7 g (x1 )
10 0.2 0.1 0.3 g (x2 )
Totals 0.4 0.6 1

h (y1 ) h (y2 )
64
For two continuous random variables X and Y , there is a joint density
f (x; y) such that
Z b Z d
P (a < X b; c < Y d) = f (x; y) dydx
a c
(replacing "<" with " " has no e¤ect).

A function f (:) can serve as density if
(i) f (x; y) 0
Z 1Z 1
(ii) f (x; y) dxdy = 1
1 1
It is also possible to de…ne a joint cumulative distribution function,

Z x Z y
F (x; y) = P (X x; Y y) = f (s; t) dsdt.
1 1
The marginal density functions, g (x) or h (y), can be obtained ("integrating

out") as Z 1 Z 1
g (x) = f (x; y) dy, h (y) = f (x; y) dx.
1 1
65
Example.
The life (in tens of years), Y , and the number of km (in 100,000), X, of a
motorcycle engine are jointly distributed as
1
f (x; y) = 10 2x2 y2 xy for 0 y 2, 0 x 1;
15
f (x; y) = 0 otherwise.
So, for example,

Z 0:4 Z 1
1
P (0:2 < X 0:4; 0 < Y 1) = 10 2x2 y2 xy dydx = 0:1244 .
0:2 0 15
The marginals are

Z 2
1 4 2 2 52
g (x) = 10 2x2 y2 xy dy = x x+ for 0 x 1,
0 15 15 15 45
Z 1
1 1 2 1 28
h (y) = 10 2x2 y2 xy dx = y y+ for 0 y 2.
0 15 15 30 45
66
Given the joint cumulative distribution functions, it is also possible to de…ne
the marginals.
For discrete random variables these are
X XX
G (x) = P (X x) = g (t) = f (t; y) ,
t x t x y
X XX
H (y) = P (Y y) = h (s) = f (x; s)
s y x s y
and, for continuous random variables,

Z x Z x Z 1
G (x) = P (X x) = g (t) dt = f (t; s) dtds,
1 1 1
Z y Z 1 Z y
H (y) = P (Y y) = h (s) ds = f (t; s) dtds
1 1 1
67
Conditional probability functions. For two discrete random variables X,
Y , the list of the probabilities P (X = xjY = y), P (Y = yjX = x), where
P (X = x; Y = y)
P (X = xjY = y) =
P (Y = y)
P (X = x; Y = y)
P (Y = yjX = x) =
P (X = x)
are the "conditional probability functions" ("conditional distributions") of X

given Y and of Y given X.
For discrete random variables with joint probability function f (x; y) and
marginals g (x), h (y), the conditional probability functions are indicated with
the notation f (xjy), w (yjx), where
f (x; y)
f (xjy) = , h (y) 6= 0
h (y)
f (x; y)
w (yjx) = , g (x) 6= 0.
g (x)
68
Notice that this de…nition is just the application of conditional probability:
when two events A and B are considered, we know that
P (A \ B)
P (AjB) =
P (B)
so we have just applied this formula to the events A = fX = xg, B = fY = yg.

However, as we may change the values of x or of y, this is a function.
Note: Both the conditional probability functions (f (xjy) and w (yjx)) are
probability functions, and meet all the characteristics of probability functions.
Alternative notation. In some references, f (yjx) is used instead of w (yjx).
Moreover, when fXY (x; y) is used for the joint probability function, it makes
it natural to use fXjY (xjy) and fY jX (yjx) for the conditional probability func-
tions.
69
Example: compute the conditional probability function of Y when X = 0 in
the Stock and Bond example:
0:2 0:5
= 0:285::: 0:29, = 0:714::: 0:71
0:7 0:7
so
y 0 2
w (yj0) 0.29 0.71
Note: this example also shows another thing: as we …xed X (to X = 0 in
this case), w (yjx) is just a probability function for Y .
70
It is interesting to compare the conditional probability function w (yj0) to
the marginal h (y),
y 0 2 y 0 2
,
h (y) 0.4 0.6 w (yj0) 0.29 0.71
We can see that the information on X is important to know something about

Y : for example we know that if X = 0 then the probability of Y = 2 is higher
(0.71 vs 0.6) than when we know nothing about X.
We could also compute w (yj10), and in particular, w (2j10) = 1=3 0:33,
thus seeing that knowledge that X = 10 takes the probability of Y = 2 to 0:33.
71
Conditional density functions. For two continuous random variables Y
and X, with joint probability density f (x; y) and marginals g (x), h (y), the
conditional density functions f (xjy), w (yjx) are
f (x; y)
f (xjy) = , h (y) 6= 0
h (y)
f (x; y)
w (yjx) = , g (x) 6= 0
g (x)
Note: The conditional density functions are density functions, and meet all
the characteristics of density functions.
In the example of the Motorcycle Engine, if x = 0:5
1 1 2 1
f (x; y) 15 10 2 2 y2 2 y 3
w (yjx) = = = y + 2y 2 19
g (x) 4 1 2 2 1 52 92
15 2 15 2 + 45
and notice that indeed

Z 2
3
y + 2y 2 19 dy = 1:
0 92
72
F Example/Practice. Level of charge with two batteries.
The levels of charge in two batteries are two random variables X and Y with
joint density
f (x; y) = (2 (x + y)) if x 2 [0; 1] and y 2 [0; 1]
and 0 elsewhere.
Compute the probability P (X 0:5; Y 0:5).
Compute the marginal density g (x).
Compute the conditional density w (yj0:25).
73
Independently distributed random variables.
Two discrete random variables X and Y with joint probability function
f (x; y) and marginals g (x) and h (y) are independent if and only if
f (x; y) = g (x) h (y) for all x; y.
Notice that the reference to "all x; y" is important, because f (x; y) = g (x) h (y)
may hold for some combinations of x; y, and yet fail to hold for all of them.
In the same way, when X and Y are continuous random variables, with joint
probability density f (x; y) and marginals g (x) and h (y), independence implies
and is implied by
f (x; y) = g (x) h (y) for all x; y.
When X1 , ..., Xn are n independently, identically distributed ran-

dom variables, we also write X1 , ..., Xn i:i:d: .
74
Notice that for discrete random variables this de…nition is just the applica-
tion of the de…nition of independence for two events: when two events A and B
are considered, we know that they are indendendent if
P (A \ B) = P (A) P (B)
so we have just applied this formula to the events A = fX = xg, B = fY = yg.

However, we may change the values of x or of y: the de…nition of independent
random variables requires that P (A \ B) = P (A) P (B) holds for all the eligible
values.
75
Example
y1 y2 y3 g (x)
x1 1=4 0 1=4 1=2
x2 0 1=4 0 1=4
x3 1=4 0 0 1=4
h (y) 1=2 1=4 1=4

Here, f (x1 ; y1 ) = 1=4, while g (x1 ) = 1=2, h (y1 ) = 1=2, so indeed f (x1 ; y1 ) =
g (x1 ) h (y1 ).
However, consider for example (x1 ; y2 ): f (x1 ; y2 ) = 0, while g (x1 ) h (y2 ) =
1=2 1=4 = 1=8.
Therefore, X and Y are not independent.
76
Example
y1 y2 g (x)
x1 1=8 3=8 1=2
x2 1=8 3=8 1=2
h (y) 1=4 3=4

Here, f (xi ; yj ) = g (xi ) h (yj ) for all the cases, so X and Y are independent.
77
Transformations of identically distributed random variables
Let X and Y be identically distributed random variables. Then, for any
continuous function g, g (X) and g (Y ) are also identically distributed random
variables.
Transformations of independent random variables

Let X and Y be independently distributed random variables. Then, for
any continuous function g, g (X) and g (Y ) are also independently distributed
random variables.
78
Section 3
Mathematical expectations
Aims and objectives
to introduce mathematical expectations to summarise the properties of

the random variables in only a few characteristics
References
Chapter 4, Sections 4.1 - 4.9; Chapter 8, Section 8.2.
79
It would be desirable to summarise some properties of the distribution of a
random variable in a few numbers (for example, in order to compare random
variables with di¤erent distributions).
In many cases, two such numbers are the mean and the variance.
80
Mean (expected value)
Mean of a discrete random variable

For a random variable X with P (X = x) = f (x), the expected value
E (X) is X
E (X) = f (x) x
x
Mean of a continuous random variable

For a random variable X with probability density f (x), the expected value
E (X) is Z 1
E (X) = f (x) xdx.
1
The mean is often denoted as (or X ).
81
Examples.
A lottery (one winner)
Suppose that you have a ticket of a lottery with prize value of 50£ , and that
100 tickets have been sold. Calling X the value of the ticket, the probability
function is
x 0 50
f (x) 0.99 0.01
and the expected value of the ticket is 0.5£ (50p).
A lottery (two winners)
Suppose that there is also a second prize, with value 10£ . Then,
x 0 10 50
f (x) 0.98 0.01 0.01
and the expected value of the ticket is 0.6£ (60p).
82
In the example of throwing a fair die, E (X) = 3:5.
Example: ER.
In the Accident and Emergency room of a hospital, 0, 1, 2, 3, or 4 people
injured arrive each hour, distributed according to the probability
x 0 1 2 3 4
f (x) 0:10 0:77 0:08 0:03 0:02
What is the expected number of injured per hour?

E (X) = 0:10 0 + 0:77 1 + 0:08 2 + 0:03 3 + 0:02 4 = 1:1.
R 20 1 20
In example of the cm of rain per hour, E (X) = 0
x 200 (20 x) dx = 3 .
F Example/Practice. Compute the expected level of charge in the case of

the Level of charge with one battery.
83
The expected value is a weighted average of the possible values xj with the
weights being the corresponding f (xj ).
The expected value is not necessarily the most frequent outcome, indeed
it may not be an outcome at all; it is informative about the "middle" of the
distribution, but does not necessarily split the distribution in two parts; it is
sometimes said that it is a measure of the central tendency, or the "centre of
gravity" of the distribution, because if we repeat the experiment many times
and we average the realisations, under regularity conditions that average should
be very close to the expected value.
In economics, the expected value is extremely important.
One very important application is the computation of the expected pro…t
for …rms.
84
Variance of a random variable
h i
2
V ar (X) = E (X ) .
Notice that the formula for the variance is equivalent to

h i
2
V ar (X) = E (X E (X)) :
2 2
The variance is often denoted by (or by X ).
85
For a discrete random variable X with P (X = x) = f (x), the formula for
the variance implies that
X 2
V ar (X) = f (x) (x )
x
2
so the variance is weighted average of the squared spreads (x ) with the
weights being the corresponding f/ (x).
For a continuous random variable with density f (x), the formula for the
variance implies that
Z 1
2
V ar (X) = f (x) (x ) dx.
1
86
The variance is a measure of the dispersion of the random variable around
. We can appreciate this with the example below.
Example: up and down

Consider the two discrete random variables, X and Y , with probability dis-
tributions fX (x) and fY (y) respectively, as in
x 1 0 1 y 1 0 1
,
fX (x) 1=3 1=3 1=3 fY (y) 1=4 1=2 1=4
Here both the random variables have the same support (i.e., may take the same
realisations), and in both cases the expected value is 0. However,
2 2
V ar (X) = 1=3 ( 1) + 1=3 02 + 1=3 (1) = 2=3
2 2
V ar (Y ) = 1=4 ( 1) + 1=2 02 + 1=4 (1) = 1=2
87
h i
2
In the example of throwing a die, E (X E (X)) is
1
2 2 1 2 1
(1 3:5) + (2 3:5) + (3 3:5)
6 6 6
2 1 2 1 2 1
+ (4 3:5) + (5 3:5) + (6 3:5)
6 6 6
= 2: 916 7
In the example of the ER,

2 2
0:10 (0 1:1) + 0:77 (1 1:1)
2 2 2
+0:08 (2 1:1) + 0:03 (3 1:1) + 0:02 (4 1:1)
= 0:47
In example of the cm of rain per hour,

Z 20 2 Z 20
1 20 1 202 20
(20 x) x dx = (20 x) x2 + 2x dx
0 200 3 200 0 32 3
Z 20
1 100 2 2800 8000 200
= x3 + x x+ dx =
200 0 3 9 9 9
88
F Example/Practice. Compute the variance of the level of charge in the
case of the Level of charge with one battery.
standard deviation
As the variance is scaled by squares of x and of , for comparison purposes
it is usually more interesting to use the standard deviation, often denoted as
(or as X ), instead. This is the square root of the variance:
p
= V ar (X).
89
Expectations for functions of random variables. When X is a ran-
dom variable with probabilities assigned by f (x), then g (X) (for a given func-
tion g : R ! R) is also a random variable, with probabilities (still) assigned by
f (x). We can then de…ne
X
E (g (X)) = f (x) g (x)
x
or, for continuous random variables and for a generic function g,

Z 1
E (g (X)) = f (x) g (x) dx.
1
Note: notice that here g (:) indicates a function of g: do not confuse with the
marginal density of f (x; y) (for which the reference book uses the same notation
g (x)).
90
In Economics we are often interested in the Utility Function, a function to
represent preferences over goods or services. If the possibility to receive these
goods or services is uncertain, then Utility is a function of a random variable.
r
In Probability and Statistics, one case of interest is when g (X) = (X ) ,
for r some positive integer, in which case E (g (X)) is the "rth moment about
the mean": as a special case of this, the variance is the second moment about
the mean.
It is also of interest when g (X) = X r , for r some positive integer, in which

case E (g (X)) is the "rth raw moment" (or the "rth moment about the origin").
91
The third moment about the mean is also called skewness, and it often
denoted as 3
3
3 = E (X ) .
When 3 = 0, the function is symmetric; when 3 > 0, it is "skewed to the
right" (the right tail of the distribution is longer), when 3 < 0, it is "skewed to
the left" (the left tail of the distribution is longer). For comparison purposes,
the normalised measure
3
3 = 3
is more of interest.
In the example of the ER,
2
3 = 0:618, = 0:47, 3 = 1: 918 0
92
The fourth moment about the mean is also called kurtosis, and it often
denoted as 4
4
4 = E (X ) .
The kurtosis measures the peakedness of the distribution or of the density. For
comparison purposes, the normalised measure
4
4 = 4
is more of interest.
93
Theorems for the mean and the variance For a random variable X
with E (X) = , V ar (X) = 2 , and for constant a, b,
(i) E (a + bX) = a + b
(ii) V ar (X) = E X 2 2
(iii) V ar (a + bX) = b2 2
.
Proofs.
(i)Let g (X) = a + bX; then, for a discrete random variable,
X X X
E (a + bX) = f (x) (a + bx) = a f (x) + b f (x) x = a + b ,
x x x
P
using the property (ii) of the probability distribution x f (x) = 1 and the
de…nition of ; for a continuous random variable,
Z 1 Z 1 Z 1
E (a + bX) = f (x) (a + bx) dx = a f (x) dx+b f (x) xdx = a+b .
1 1 1
94
(ii)
2
V ar (X) = E (X ) = E X2 + 2
2X
using (i), this is
= E X2 + 2
2E (X) = E X2 + 2
2 2
= E X2 2
(iii)
2 2
V ar (a + bX) = E (a + bX) (a + b )
= E a2 + b2 X 2 + 2abX a2 + b2 2
+ 2ab
2 2 2
= E a +E b X + E (2abX) a + b2
2 2
+ 2ab
2 2 2 2 2 2
= a +b E X + 2abE (X) a +b + 2ab
2 2 2
= b E X + 2ab (E (X) ),
using (i) and E (X) = ; by (ii), E X 2 2

= V ar (X), so, because E (X) =
, E (X) = 0. The result then follows.
95
Standardisation
2
For the generic X with E (X) = , V ar (X) = , the transformation
1
Z + X,
so that
X
Z=
is very important.
F Example/Practice. Using the theorems (i) and (iii) for the mean and
variance, notice (verify) that
E (Z) = 0, V ar (Z) = 1.
96
Moment Generating Function The computation of the moments may some-
times be di¢ cult, especially when high moments are of interest. It is sometimes
possible to introduce a function which facilitates such computation of the mo-
ments. This is the Moment Generating Function (also known as MGF). In what
follows, we indicate the MGF for a random variable X as MX (t).
The Moment Generating Function for a random variable X is
MX (t) = E etX .
97
For a discrete random variable,
X
MX (t) = etx f (x)
x
(assuming convergence of the summation).
For a continuous random variable,

Z
MX (t) = etx f (x) dx
(assuming that the integral is de…ned).
98
Using the Moment Generating Function. By a Taylor expansion,
t2 X 2 t3 X 3 tr X r
etX = 1 + tX + + + ::: + + :::
2! 3! r!
so
t2 E X 2 t3 E X 3 tr E (X r )
E etX = 1 + tE (X) + + + ::: + + :::
2! 3! r!
or, letting 0
r = E (X r ) ,
then
2 3 r
0 t 0 t 0 t
MX (t) = 1 + t + 2 + 3 + ::: + r + :::
2! 3! r!
99
Next, notice that
2 r 1
@MX (t) 0 t 0 t 0 t
= + 22 + 33 + ::: + rr + :::
@t 2! 3! r!
so, setting t = 0,
@MX (t)
= .
@t t=0
Next,
@ 2 MX (t) 0 0 t 0 tr 2
= 2 + 33 2 + ::: + rr (r 1) + :::
@t2 3! r!
so, setting t = 0,
@ 2 MX (t) 0
= 2.
@t2 t=0
In general, then,
0 @ r MX (t)
r = .
@tr t=0
100
Example: let X be a continuosly distributed random variable, with
e xx 0
f (x) =
0 otherwise
with > 0. Then, for t < ,

Z 1
MX (t) = etx e x
dx =
0 t
and
@MX (t) @ t 1
= = 2, so =
@t @t ( t)
@ 2 MX (t) @2 t 0 2
= =2 3, so 2 = 2
@t2 @t2 ( t)
101
Other applications of interest
(Uniqueness Theorem) Under some regularity conditions, the moments
characterise the distribution univocally, i.e., if we know all the moments,
then we know the distribution:
Uniqueness Theorem: Let X and Y be random variables with moment
generating functions MX (t) and MY (t) respectively. Then, X and Y
have the same probability distribution if and only if MX (t) = MY (t) (for
all t t 2 ( h; h) for h > 0)
Let X be a random variable with MGF MX (t), and a, b two constant

(b 6= 0), then the MGF of (X + a) =b is
t
M X+a (t) = eat=b MX
b b
Let X and Y be two independent random variables with MGFs MX (t)

and MY (t) respectively. then the MGF of X + Y is
MX+Y (t) = MX (t) MY (t)
102
Moments for joint distributions. When we have more than one random
variable, it is important to characterise not only each random variable by itself, but
also in its relation with the other ones ("jointly"). One moment that characterises this
co-movement, is the covariance.
Covariance of random variables.
Let X and Y two random variables with joint probability function f (x; y),
and marginal probability functions g (x) and h (y) respectively, (when X and
Y are discrete random variables), or with with joint density function f (x; y), and
marginal density functions g (x) and h (y) respectively, (when X and Y are con-
tinuous random variables), and such that E (X) = X , E (Y ) = Y , V ar (X) =
2 2
X , V ar (Y ) = Y . Then
Cov (X; Y ) = E ((X X ) (Y Y ))
is the covariance between X and Y (often indicated as XY ).
103
For discrete random variables,
XX
Cov (X; Y ) = (x X ) (y Y ) f (x; y)
x y
while for continuous random variables,

Z 1Z 1
(x X ) (y Y ) f (x; y) dxdy.
1 1
Theorems for the covariance

For constant a, b,
(i) Cov (a + X; b + Y ) = Cov (X; Y )

(ii) Cov (X; Y ) = E (XY ) X Y
(iii) Cov (aX; bY ) = abCov (X; Y )
(iv) If X and Y are independent, then Cov (X; Y ) = 0
104
Proof.
(i) Notice that E (a + X) = a + X, E (b + Y ) = b + Y . Substituting this
in the de…nition,
Cov (a + X; b + Y ) = E (((a + X) (a + X )) ((b +Y) (b + Y )))

= E ((X X ) (Y Y )) = Cov (X; Y ) .
(ii)
Cov (X; Y ) = E ((X X ) (Y Y )) = E (XY ) E (X Y ) E (Y X) + Y X

= E (XY ) E (X) Y E (Y ) X + Y X = E (XY ) X Y Y X + Y X
using Theorem (i) for the mean. Simplifying, Cov (X; Y ) = E (XY ) X Y
follows.
(iii)
Cov (aX; bY ) = E ((aX a X ) (bY b Y )) = E (a (X X ) b (Y Y ))

= abE ((X X ) (Y Y )) = abCov (X; Y )
where the second equality follows using Theorem (i) for the mean.
105
(iv)
When X, Y are discrete, by independence, f (x; y) = g (x) h (y)
XX XX
E (XY ) = xyf (x; y) = xyg (x) h (y)
x y x y
X X
= xg (x) yh (y) = E (X) E (Y ) = X Y ;
x y
When X, Y are continuous, by independence, f (x; y) = g (x) h (y)

Z 1Z 1 Z 1Z 1
E (XY ) = xyf (x; y) dxdy = xyg (x) h (y) dxdy
1 1 1 1
Z 1 Z 1
= xg (x) dx yh (y) dy = E (X) E (Y ) = X Y .
1 1
The result then follows using (ii).
106
However, Cov (X; Y ) = 0 does not imply that X and Y are independent.
Verify this in the following example
y
1 0 1 g (x)
1 1=4 0 1=4 1=2
x 0 0 1=2 0 1=2
h (y) 1=4 1=2 1=4
107
The covariance is also informative of the association between two random variables.
However, it is a¤ected by the unit of measurement. A scale-free measure is given by
the correlation.
Correlation of random variables The correlation is a measure of the

association between two random variables that is not a¤ected by the unit of
measurement. For two random variables X, Y , the correlation Cor (X; Y ) is
de…ned as
Cov (X; Y )
Cor (X; Y ) = p .
V ar (X) V ar (Y )
The correlation is also indicated by XY , and it holds that
XY
XY = .
X Y
It can be shown that 1 XY 1, and that the absolute value of the

XY coe¢ cient is not a¤ected by linear transformations, so if A = 1 + 2 X,
B = 1 + 2 Y , where 1 , 2 , 1 , 2 are constant, then AB = sign ( 2 ; 2 ) XY ,
where sign ( 2 ; 2 ) = 1 if 2 2 > 0, sign ( 2 ; 2 ) = 1 if 2 2 < 0.
108
In the example of Stock and Bond,
1:6
XY =p = 0:356 35:
21 0:96
In the example of the Motorcycle Engine,

Z 1
4 2 2 52 7
E (X) = x x+ xdx = ;
0 15 15 45 15
Z 1 2
4 2 2 52 7 2
V ar (X) = x x+ x dx = 8: 074 1 10
0 15 15 45 15
and it is also possible to verify that E (Y ) = 89 , V ar (Y ) = 0:309 14;

Z 1 Z 2
1 7 8 1
Cov (X; Y ) = 10 2x2 y2 xy x y dydx =
0 0 15 15 9 135
1
135 2
Cor (X; Y ) = p = 4: 682 6 10
8: 074 1 10 2 0:309 14
109
Conditional moments In the example of the Motorcycle Engine, we saw that
the expected life of the engine is 89 10, i.e. approximately 8:88 years. However, we
also saw that there is a negative correlation ( 0:047) between life and usage, so it
seems fair to conjecture, for example, that the expected life is lower if an extensive
mileage is run, and vice versa. In most of the cases, we are interested in modelling
conditional moments.
In conditional moments, the conditional rather than the marginal distribu-
tion (or density) is used in the computation of the expectation.
The Conditional mean E (Y jx) (or, E (Y jX = x)) is
X Z 1
E (Y jx) = yw (yjx) or E (Y jx) = yw (yjx) dy
y 1
for discrete or continuous random variables.

In the same way, E (Xjy)
X Z 1
E (Xjy) = xf (xjy) or E (Xjy) = xf (xjy) dx
x 1
110
In the Stock and Bond example, when X = 0,
E (Y j0) = 0 0:29 + 2 0:71 = 1:42.
(For comparison, notice that E (Y ) = 1:2, and E (Y j10) = 0:66:::).
In the Motorcycle Engine example, if X = 1,

1
8 y2 y 3
f (yj1) = 15
4 2 52 = y + y2 8
15 15 + 45
34
Z 2
3 14
y + y2 8 ydy = .
0 34 17
The life expectancy for a motorcycle running 100; 000 km is 14
17 10, i.e. ap-
proximately 8:23 years. As we anticipated, increasing the mileage comes at a
cost of the life expectancy.
111
By the same argument, it may be of interest to consider the conditional
variance V ar (Y jx) or V ar (Xjy), where is
X 2
V ar (Y jx) = w (yjx) (y E (Y jx)) (discrete r.v.)
y
Z 1
2
V ar (Y jx) = w (yjx) (y E (Y jx)) dy (continuous r.v.)
1
X 2
V ar (Xjy) = f (xjy) (x E (Xjy)) (discrete r.v.)
x
Z 1
2
V ar (Xjy) = f (xjy) (x E (Xjy)) dx (continuous r.v.).
1
In the Stock and Bond example, V ar (Y j0) = 0:823 6, V ar (Y j10) = 0:888 89

(for comparison, V ar (Y ) = 0:96).
The theorems for the mean and the variance still hold for the conditional
mean and variance.
In the same way, it is also possible to de…ne higher conditional moments.
112
F Example/Practice. Two hospitals.
Let X and Y be the number of injured people arriving every hour in two
hospitals, and suppose that the joint distribution is
x
0 1 2 3 4
0 0:10 0:15 0:03 0 0
y 1 0 0:62 0:05 0:01 0
2 0 0 0 0:02 0:02
Compute the marginal distributions g (x) and h (y).

Compute Cov (X; Y ).
Compute the conditional distribution f (xj2).
Compute the conditional expectation E (Xj2).
Compute the conditional variance V ar (Xj2).
113
Moments of sums of random variables Let
W =X +Y,
2 2
where E (X) = X, V ar (X) = X, E (Y ) = Y , V ar (Y ) = Y , Cov (X; Y ) =
XY . Then
(i) E (W ) = X + Y
2 2
(ii) V ar (W ) = X + Y +2 XY .
114
Proof of (i): when X, Y are discrete,
XX XX XX
E (X + Y ) = (x + y) f (x; y) = xf (x; y) + yf (x; y)
x y x y x y
X X X X
= x f (x; y) + y f (x; y)
x y y x
P P
Recalling the formulas for the marginal distributions, y f (x; y) = g (x), x f (x; y) =
h (y), so
X X
E (X + Y ) = xg (x) + yh (y) = E (X) + E (Y ) .
x y
When X, Y are continuous, E (X + Y ) is

Z 1Z 1 Z 1Z 1 Z 1Z 1
(x + y) f (x; y) dxdy = xf (x; y) dxdy + yf (x; y) dxdy
1 1 1 1 1 1
Z 1 Z 1 Z 1 Z 1 Z 1 Z 1
= x f (x; y) dydx + y f (x; y) dxdy = xg (x) dx + yh (y) dy
1 1 1 1 1 1
= E (X) + E (Y ) .
115
Proof of (ii):
2 2
V ar (W ) = E (W E (W )) = E (X + Y ( X + Y ))
2
= E (X X +Y Y )
2 2
= E (X X) + E (Y Y ) + 2E ((X X ) (Y Y ))
2 2
= X + Y +2 XY .
Using these two results, and the properties stated in the Theorems for the
covariances, it is possible to prove that, for example,
E (X Y)= X Y ,
2 2
V ar (X Y)= X + Y 2 XY .
116
Example.
In the Stock and Bond example, consider the portfolio
1 9
W = X + Y:
10 10
We can verify that
2 2
X = 3, Y = 1:2, X = 21, Y = 0:96
E (XY ) = 0 0 0:2 + 0 2 0:5 + 10 0 0:2 + 10 2 0:1 = 2

XY =2 3 1:2 = 1:6:
Then, using the Theorems for the mean and for the variance, and the The-
orem for the sum of two random variables,
1 9
E (W ) = 3+ 1:2 = 1:38
10 10
2 2
1 9 1 9
V ar (W ) = 21 + 0:96 2 1:6 = 0:699 6
10 10 10 10
117
Generalising to more than two variables. These results may of course be
generalised to more two random variables. One particularly important implica-

tion is that, for X1 ,...,Xn , all independent, with E (Xi ) = i , V ar (Xi ) = 2i for
i = 1,...,n, then,
n
! n n
! n
X X X X
2
E Xi = i , V ar Xi = i.
i=1 i=1 i=1 i=1
118
Further results concerning moments. Law of iterated expectations
For random variables X, Y
E (E (Y jX)) = E (Y ) and E (E (XjY )) = E (X)).
Example: E (E (Y jX)) and E (Y ) in the Stock and Bond example.

Recall that the we computed E (Y ) = 1:2, E (Y j0) = 1:42 and E (Y j10) =
0:66.
So, E (Y jx) changes with the as we change X = x, and we know the proba-
bilities P (X = x) are P (X = 0) = 0:7 and P (X = 10) = 0:3.
E (Y jX) is therefore a random variable (in X) with distribution
E (yjx) 0.66 1.42

f (x) 0.3 0.7
119
We can then compute the expectation
E (E (Y jX)) 0:66 0:3 + 1:42 0:7 = 1:19
Note: the apparent di¤erence between E (E (Y jX)) and E (Y ) in this exam-

ple is only due to rounding errors.
Note: notice that E (Y jX), is a random variable, but E (Y jx) is not a random
variable. In E (Y jx) we …xed X = x; it is only when we allow X to take all the
values x1 ,...,xm , that we have the random variable E (Y jX).
120
Further results concerning moments. Jensen’s inequality For a
random variable X, and for a a convex function g,
E (g (X)) g (E (X))
Example: consider the function g (X) = X 2 , and suppose that X is distrib-

uted as
x 0 1 2 3
f (x) 0.25 0.25 0.25 0.25
so that
g (x) 0 1 4 9
f (g (x)) 0.25 0.25 0.25 0.25
3 3 2 9 14
Then, E (X) = 2 and g (E (X)) = 2 = 4 while E (g (X)) = 4 .
121
Note:
taking g (X) = X 2 we can also see that, for a generic random variable X,
2
E X2 (E (X))
Rearranging terms,
2
E X2 (E (X)) 0:
2
Since E X 2 (E (X)) = V ar (X), we veri…ed that the variance is non-
negative (we already knew that the variance is non-negative; this application of
Jensen’s inequality gives an alternative proof).
122
Further results concerning moments. Chebyshev’s inequality. For
a random variable X with E (X) = , V ar (X) = 2 , and for any " > 0,
2
P (jX j ") .
"2
The Chebyshev’s inequality gives us a useful boundary for P (jX j) even

without knowledge of the probability function: only knowledge of and of 2
are necessary. This is also useful when the probability distribution is known,
but the computation of the probability is di¢ cult.
123
Further results concerning moments.
Law of large numbers. One example of the usefulness of the Chebyshev’s

inequality is in derivation of the Law of Large Number.
Example: Consider again the experiment of tossing a fair coin in n inde-

pendent experiments, recording Xi = 1 for head and Xi = 0 for tail. Then,
X1 ,...,Xn are independent and identically distributed random variables, with
E (Xi ) = 0:5, V ar (Xi ) = 0:25.
Consider the sample average

n
1X
Xn = Xi
n i=1
(where notice that we have added the subscript n in the notation to emphasise
the dependence from the sample size).
124
Then, E X n = 1=2; on the other hand,
1
if n = 10, V ar X n = 0:25 = 0:025
10
1
if n = 100, V ar X n = 0:25 = 0:0025
100
Heuristically, we can interpret this result saying that, as n becomes larger,
the probability is more concentrated on the expected value.
We will explore this more formally in the Law of Large Numbers.
125
Law of large numbers.
Let X1 ,...,Xn be independent and identically distributed random variables,
with E (Xi ) = , V ar (Xi ) = 2 , and let
n
1X
Sn = Xi .
n i=1
Then, for any " > 0,

lim P (jSn j ") = 0.
n!1
126
Proof.
Using independence and the theorems for the mean and the variance,
n n
1X 1X 1
E (Sn ) = E (Xi ) = = n = ;
n i=1 n i=1 n
n n
1 X 1 X 2 1 2 1 2
V ar (Sn ) = V ar (Xi ) = = n = .
n2 i=1 n2 i=1 n2 n
Therefore, by the Chebyshev’s inequality,

2
=n
P (jSn j ")
"2
2
=n
and notice that "2 ! 0 as n ! 1.
127
The Law of Large Numbers is a very important result. It means that, when
we take a sample mean of some independent, identically distributed random
variables (for which the second moment exists), then the probability that such
sample mean is arbitrarily close to the population mean, goes to 1. In statistical
jargon, the sample mean converges to the population mean in probability.
To appreciate the importance of this result, consider the example of tossing
a coin, counting the number of heads, and averaging them by the dimension of
the sample. Then, this number converges in probability to 1=2.
The conditions for the Law of Large Numbers may be generalised, allowing
for heterogeneity in the distribution (and in the moments), allowing for some
kinds of dependency in the data, and for weaker moment conditions (about the
variance).
128
Section 4
Special probability functions
Aims and objectives
to introduce some probability functions that are very useful

to examine links and approximations
References
Chapter 5, Sections 5.1, 5.3, 5.4 and 5.7; Chapter 6, Sections 6.1 - 6.3, 6.5 -
6.8 and Chapter 8, Sections 8.4 - 8.6.
129
Bernoulli Let X be a random variable that can only take values 1 or 0,
and such that
P (X = 1) = , 0 < < 1.
Then, X is Bernoulli distributed, with P (X = 1) = , and we write
X bernoulli ( ) .
For x = 0, x = 1, the probability function is

x 1 x
f (x; ) = (1 )
(where of course P (X = x) = f (x; ) but notice that here we have made explicit
reference to the parameter in the notation).
Each Bernoulli experiment is also called a "bernoulli trial".
The bernoulli distribution is typically used for events with two outcomes
which can be classi…ed as "success" or "failure": one example would be tossing
a coin, and classifying "head" as "success" (or tail!).
A random variable that can only take values 0 and 1 is also often called
"indicator function", and it is often indicated as I.
130
For X bernoulli ( ),
E (X) = , V ar (X) = (1 )
Proof
0 1 0 1 1 1
E (X) = 0 (1 ) +1 (1 ) =
2
V ar (X) = E X2 fE (X)g
0 1 0 1 1 1 2
= 02 (1 ) + 12 (1 )
2
= = (1 ):
131
Binomial For x = 0; 1; :::; n, 0 < < 1 let
n x n x
b (x; n; ) = (1 ) ,
x
where, for non negative n, x, it holds that
n n!
= .
x x! (n x)!
We indicate this distribution by saying that, if P (X = x) = b (x; n; ), then
X binomial (n; ) .
It is also possible to de…ne a cumulative distribution function B (x; n; ) =

Xx
b (k; n; ) :
k=0
Tables for b (x; n; ) and of B (x; n; ) for various values of n, , are widely
available.
132
If I1 , ..., In i:i:d: bernoulli ( ), and
n
X
X= Ii ,
i=1
then, X binomial (n; ).
If X binomial (n; ), then
E (X) = n , V ar (X) = n (1 ).
Proof:
Pn using the fact that I1 , ..., In i:i:d: bernoulli ( ) and X =
I
i=1 i ,
n
! n n
X X X
E (X) = E Ii = E (Ii ) = =n
i=1 i=1 i=1
n
! n n
X X X
V ar (X) = V ar Ii = V ar (Ii ) = (1 ) = n (1 ).
i=1 i=1 i=1
133
F Example/Practice. Mike Walters is an estate agent. At the moment, he
has a portfolio of 6 houses, each one of those he knows he will sell before the end
of the month with probability = 0:6. For each house, the sale is independent
from the other ones.
(a) how many houses does he expect to sell?
(b) …nd the probability he makes 2 sales or more
(c) list the probability function.
134
Example: the law of Large number for bernoulli trials.
Let X1 ,...,Xn be independent and identically distributed Bernoulli trials with
P (X = 1) = . Then,we know that E (Xi ) = , V ar (Xi ) = (1 ). Notice
that this means that E (Xi ) = , V ar (Xi ) = 2 (just let = , 2 = (1 )
to verify it), so the conditions for the Law of Large Numbers are met. Therefore
letting
n
1X
Sn = Xi ,
n i=1
for any " > 0, by the Law of Large Numbers,
lim P (jSn j ") = 0.

n!1
135
However, in this example we can also verify the validity of the Law of Large
Numbers directly.
X n
Letting X = Xi , then X binomial (n; ), so
i=1
E (X) = n , V ar (X) = n (1 ),
and therefore
1 1 1 1 1
E X = n = ; V ar X = n (1 )= (1 ),
n n n n2 n
so by the Chebyschev inequality
X (1 )
P "
n n "2
(1 )
for any " > 0, and n "2 ! 0 as n ! 1.
136
Poisson For x = 0; 1; ::: ; > 0,
x
e
p (x; ) =
x!
We indicate this distribution by saying that, if P (X = x) = p (x; ), then
X P oisson ( ) .
If X P oisson ( ), then
E (X) = , V ar (X) = .
137
The Poisson distribution is used to model the occurrence of events which
may recur over time / in space, such as "number of buses passing by a bus
stop in one hour" or "number of cars overtaken in a stretch of one km on
the motorway", provided that:
(i) the probability of one occurrence in a small interval of time / space is
proportional to the length of the interval
(ii) the probability of more than one occurrence in a small interval of time
/ space is negligible
(iii) the occurrence of the events is independent
138
Example: Phone Calls.
Let X be the number of phone calls per 10 minutes at the Economics Recep-
tion. Suppose that we know that, on average, the Reception receives 1.5 calls:
therefore, = 1:5.
Therefore,
1:5 1:50
P (X = 0) = p (0; 1:5) = e = 0:22313
0!
1:5 1:51
P (X = 1) = p (1; 1:5) = e = 0:33470
1!
1:5 1:52
P (X = 2) = p (2; 1:5) = e = 0:251 02
2!
1:5 1:53
P (X = 3) = p (3; 1:5) = e = 0:125 51 ....
3!
It is also possible to compute that, for example, P (X 2) = 0:808 85.
139
The Poisson is a suitable approximation of the binomial when the sample
is large and is little ("rare event"):
b (x; n; ) p (x; ) as n ! 1,
i.e. we let = n in the approximation. As a rule of thumb, the approxi-

mation is satisfactory if n 20 and 0:05, and it is excellent if n 100
and n < 10.
Example. A certain disease is present in 2% of the population. If we have a

sample of n = 40 individuals, what is the probability that at least two have the
disease?
Then, n = 40, = 0:02 and = 40 0:02 = 0:8
x 0 1 2 P (X 2)
b (x; n; ) 0:445 7 0:363 84 0:144 79 0:954 33
p (x; ) 0:449 33 0:359 46 0:143 79 0:952 58
Tables of p (x; ) for various values of are widely available.
140
Uniform Let X be a continuous random variable with density
1
if x 2 ( ; )
u (x; ; ) =
0 elsewhere
Then X is uniformly distributed and we indicate this as
X u ( ; ):
For example, when = 3, = 3, then u (x; ; ) = 1=6 if x 2 [ 3; 3] and
0.5
y
0.4
0.3
0.2
0.1
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
141
If X u ( ; ), then
2
+ ( )
E (X) = , V ar (X) = .
2 12
The uniform distribution assigns probability to subsets (intervals) of [ ; ],

and the probability assigned is proportional to the lenght of the interval.
For example, if X u ( 3; 3), then P (x 2 (0; 1]) = 1=6, P (x 2 (0; 2]) =
2=6.
Note: we de…ned the function so that u (x; ; ) = 1 if x 2 ( ; ), so
u ( ; ; ) = 0 and u ( ; ; ) = 0, but other conventions are possible: for
example, we could set [ ; ] in the de…nition and then u (x; ; ) = 1
if x 2 [ ; ], so u ( ; ; ) = 1 and u ( ; ; ) = 1 .
We have already encountered an example of this distribution. It is the

distribution that we used for the example of David and Jasmine on the
beach.
142
Exponential Let X be a continuous random variable with density
1
e x= if x > 0
g (x; ) =
0 elsewhere
for > 0. Then X is exponentially distributed and we indicate this as
X exp ( ) :
For example, when = 1, then g (x; 1)
1.2
y
1.0
0.8
0.6
0.4
0.2
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
143
If X exp ( ), then
2
E (X) = , V ar (X) = .
The Cumulative Distribution Function is

Z x
1 t t x x
e dt = e =1 e
0 0
The Moment Generating Function is
1= 1
MX (t) = = .
1= t 1 t
When the occurrence of certain events can be modelled with the Poisson
distribution, then the Exponential may be used to model the waiting time
for an occurrence.
144
Proof: Let X P oisson ( ), and let Y be the waiting time for the …rst
occurrence. Recall that if X P oisson ( ), then is the expected number of
occurrences in one unit of time. Of course, if we consider units of time instead,
the expected number of occurrences would be (think to the example of the
Phone Calls arriving at the Reception: we expected 1.5 calls per 10 minutes
interval, we then expect 3 calls in a 20 minutes interval).
Suppose we just observed an occurrence: then, the probability that we wait
for y units or time or more, P (Y > y), is the probability that no occurrences
have happened until then. Since these are poisson distributed, this is the Prob-
ability of 0 occurrences in the interval of lenght y, so,
y
P (Y > y) = p (0; y) = e .
Moreover, P (Y y) = 1 P (Y > y), and recall that P (Y y) is the cumula-

tive distibution function F (y). Therefore,
y
F (y) = P (Y y) = 1 P (Y > y) = 1 p (0; y) = 1 e
which is the cumulative distribution function of an exponentially distributed

random variable Y with = 1= .
145
Example. Phone Calls.
Let X be the number of phone calls per 10 minutes at the Economics Re-
ception, and suppose that we know that, on average, the Reception receives 1.5
calls every 10 minutes.
What is the probability that we wait for one minute or less for the …rst call?
The number of incoming calls in 10 minutes is Poisson distributed with
= 1:5. One minute is 1=10 of the 10 minutes interval over which is de…ned.
The probability of waiting for one minute or less is
Z 0:1
P (Y 0:1) = 1:5e 1:5x dx = 1 e 1:5 0:1 0:14
0
Moreover, after the …rst call, the waiting time for the second call is also
exponentially distributed (with = 1= ).
146
Normal distribution (also known as Gaussian) Let X be a continuous
random variable with density
1 1
(x
2
) for
n (x; ; ) = p e 2 1<x<1
2 2
then X is Normally distributed. We indicate this with the notation

2
XsN ; .
2
If s N ; , then,
2
E (X) = , V ar (X) = .
The density n (x; ; ) is

(i) de…ned for all x
(ii) symmetric around
2
(iii) bell shaped (shape depending on )
When X s N ; 2 , then, for constant a and b, a+bX s N a + b ; b2 2

(the proof of this statement can be easily obtained using the moment gen-
erating function, but we do not it).
147
The Standard Normal distribution.
When = 0, 2 = 1, the distribution is usually referred as Standard

Normal, and the random variable is often indicated as Z, so
Z s N (0; 1) .
The pdf of Z is usually indicated as (z), the cdf is usually indicated as

(z).
The (normalised) skewness for the Standardised Normal is
3 =0
The (normalised) kurtosis for the Standardised Normal is
4 =3
2
Standardization: when X s N ; , then, for constant a and b, a +
bX s N a + b ; b2 2 , and therefore
X
s N (0; 1)
148
Some examples: pdf and cdf of the standard normal
0.4
y
0.3
0.2
0.1
-4 -3 -2 -1 0 1 2 3 4
1 1 2 x
p e 2x
2
1.0
y
0.8
0.6
0.4
0.2
Z -4 -3 -2 -1 0 1 2 3 4
x
1 1 2 x
p e 2t dt
1 2
149
1 1 2 1 1
1)2
p e 2x and p e 2 (x ( = 0 and = 1)
2 2
0.4
y
0.3
0.2
0.1
-4 -3 -2 -1 0 1 2 3 4 5
x
1 1 2 1 1 2
p e 2x and p e 2 2x ( 2
= 1 and 2
= 2)
2 2 2
0.4
y
0.3
0.2
0.1
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
150
Probabilities for a normally distributed random variable
Direct calculation of probabilities, such as for example in P (X c), is not
Rc 1 x 2
possible, because 1 p21 2 e 2 ( ) dx does not have close form solution.
Calculation would then only be possible via numeric approximation, which
of course is not practical in general.
Tables could be produced, but they would depend on and on 2 : this too
would be practically infeasible.
Instead, tables are only produced for (z) and (z), Z s N (0; 1).
To address probability statements for X, we transform them into statements
for X .
P (c X d) = P (c X d )
c X d
= P
c d d c
= P Z = :
151
Example: using the tables for the standard normal: (z).
The values of (z) can be derived using Table 2.1 of Neave.

The lines are ordered according to the …rst decimal place, the columns ac-
cording to the second decimal place. So, for example, in order to know (1:43),
we have to look at the row 1:4 and at the column :03, so
P( 1 Z 1:43) = 0:9236:
Notice that the tables usually only allows for numbers with up to two dec-
imal digits. If we have more decimal digits, we may use linear approximation;
however, often just looking at the nearest one is su¢ cient.
Notice that the tables only cover c up to 4:99. When c 5, then P ( 1 Z c)

1 (or, when c 5, P ( 1 Z c) 0).
152
Notice that sometimes the tables are presented in a di¤erent way. For ex-
ample, the entries in Table III of the Appendix of Miller and Miller give the
probability
P (0 Z c)
when c 0.
If we are interested in (1:43), notice that P ( Z 0) = 0:5, so
(1:43) = P ( 1 Z 0) + P (0 Z 1:43) = 0:5 + 0:4236 = 0:9236:
Notice that the tables only have c 0: for c < 0, notice that P ( c Z 0) =
P (0 Z c) by symmetry. Therefore, for example, P ( 1:43 Z 0) =
0:4236.
Example. Using Tables of your choice, verify that (0) = 0:5;

Example. Using Tables of your choice, verify that ( 2:02) = 0:0217.
153
Example. Let
X s N (10; 4) .
What is P (7 X 14)?
x 10 7 10 3
Discussion: we standardise X: when x = 7, then 2 = 2 = 2. In
general,
7 10 X 10 14 10
P (7 X 14) = P
2 2 2
3 3
= P Z 2 =P( 1 Z 2) P 1 Z<
2 2
= 0:9772 0:0668 = 0:9104
Using tables from Miller and Miller we would have done
3 3
= P Z 2 =P Z 0 + P (0 Z 2)
2 2
= 0:4332 + 0:4772 = 0:9104
3 3
where we used P 2 Z 0 =P 0 Z 2 for the table.
154
F Example/Practice. Let X have a Normal distribution with = 1 and
2
= 16:
Making approximations when necessary, calculate the following probabilities:
P (X 0:052); P (X 10:44); P (X > 2:72); P (X > 1:04); and P ( 2:0 < X <
7:0):
Find c such that P (X < c) = 0:25; P (X > c) = 0:05.
F Example/Practice. Let Z have a Standard Normal distribution (i.e.,

with = 0 and 2 = 1).
Calculate P (jZj > 1:6);
Find c such that P (jZj > c) = 0:05.
F Example/Practice. Normal returns

The daily returns of a stock are normally distributed, with mean 6p, and
standard deviation is 1:2p. What is the probability for a random day of obtaining
a return between 7 and 8?
155
Bivariate Normal distribution Let
2 2
XsN 1; 1 ,Y sN 2; 2
then X and Y are jointly distributed according to a bivariate normal distribu-

tion. Letting
Cov (X; Y ) = 12 , Cor (X; Y ) = ,
the bivariate normal has pdf
8 2 2
9
>
> x 1
2 x 1 y 2
+ y 2 >
>
1 < 1 1 2 2
=
f (x; y) = p exp
2 1 2 >
> 2 (1 2) >
>
1 2 : ;
156
If = 0, then f (x; y) = g (x) h (y), so X and Y are independent.
If 6= 0,
2 2 2
Y jx s N 2 + (x 1) ; 2 1
1
Recalling that = 12
1 2
, the conditional mean ( 2 + 2
1
(x 1 )) is also
indicated as
12
2 + 2 (x 1) ;
1
The conditional mean is linear in x;

2 2 2
The conditional variance ( 2 1 ) is smaller than 2.
157
Sum of Random Variables We already know how to compute the ex-
pected value or the variance of sums of random variables. In some cases we also
veri…ed that some random distributions may be seen as the distribution of a
sum of random variables (example, the binomial is the distribution of the sum
of n independent bernoulli trials).
Moreover:
If X1 and X2 are normally distributed, then X1 + X2 is also normally

distributed.
2 2
As for the moments of X1 + X2 , if X1 N 1; 1 , if X2 N 2; 2 ,
cov (X1 ; X2 ) = 12 , then
2 2
X1 + X2 N 1 + 2; 1 + 2 +2 12
This is a very special result: it is not usually true that the sum of two
random variables with a certain distribution is a random variable with the same
distribution.
158
Central Limit Theorem Let X1 ; :::; Xn be independent, identically dis-
tributed random variables, with E (Xi ) = , V ar (Xi ) = 2 (0 < 2 < 1),
and
Xn
Sn = X1 + ::: + Xn = Xn ,
i=1
then Z b
Sn n 1 u2 =2
lim P a p b =p e du.
n!1 n 2 a
This is often indicated with the notation

Sn n
p !d N (0; 1) as n ! 1.
n
This result may be generalised to allow for heterogeneity in the means

and in the variances in the distribution, and to allow for some types of
dependence.
159
The Central Limit Theorem is also stated referring to Zn , where
p X
Zn = n ,
and claiming that then the distribution of Zn converges to a N (0; 1) as

n ! 1, i.e.
Zn !d N (0; 1) as n ! 1.
160
Note:
the central limit theorem may be used to approximate the distribution of
the sample mean (and sometimes of random variables derived from the sample
mean).
To see this, we introduce the operator , meanining that this distribution

approx
is only an approximation.
Then, the central limit theorem says that, for large n,
p X
n N (0; 1)
approx
Rearranging terms, we also derive

2
X N ;
approx n
161
Relation between the binomial and the normal distribution
We already saw that the binomial distribution can be seen as the distribution
of a sum of random variables (the i.i.d. bernoulli trials): it is immediate to
verify that these random variables also meet the conditions for the central limit
theorem, so, for X binomial (n; ),
X n
p !d N (0; 1) as n ! 1
n (1 )
i.e.,
X N (n ; n (1 ))
approx
In practical situations, the approximation is very good already when n 5,

n (1 ) 5.
162
The idea of approximating the binomial, as well as other discrete distribu-
tions, with the normal may seem surprising, because the normal is a continuous
distribution.
In order to understand the rationale of the approximation, let X binomial (n; ),
Y s N (n ; n (1 )), and let fY (y), FY (y) be the density and cumulative
distribution functions of Y . Then, we are approximating, for x 0, P (X = x)
by FY (x + 0:5) FY (x 0:5).
In order to appreciate the dimension of the approximation, notice that if
fY (x) was constant between x 0:5 and x+0:5, then FY (x + 0:5) FY (x 0:5) =
fY (x), so fY (x) = P (X = x). However, this is not the case, because fY (x)
is a function varying with x (decreasing when x > n , and increasing in x if
x < n ), so by treating it as constant we are actually approximating it.
163
Example.
In the example of tossing a fair coin 10 times, what is the approximate
probability of scoring 3 times? What is the approximate probability of scoring
between 3 and 5 times (inclusive)?
Discussion. Let X be the number of scores, then X binomial (10; 0:5),
Y s N (5; 2:5). Therefore,
2:5
5 Y 5 3:5 5
P (X = 3) P (2:5 < Y 3:5) = P p < p p
2:5 2:5 2:5
= P ( 1: 581 1 < Z 0:948 68) = 0:4418 0:3289 = 0:112 9
10! 3 10 3
(the exact probability is P (X = 3) = 3!(10 3)! 0:5 (1 0:5) = 0:117 19);
2:5 5 Y 5 5:5 5
P (3 X 5) P (2:5 < Y 5:5) = P p < p p
2:5 2:5 2:5
= P ( 1: 581 1 < Z 0:316 23) = 0:4418 + 0:1255 = 0:567 3
(the exact probability is 0:568 36)
164
Example. In the case of the example of the houses of Mike Walters, we may
use the normal approximation of the binomial. Letting X the number of houses
sold, and Y s N (n ; n (1 )), we could approximate the case X = 0 with
Y 0:5, X = 1 with 0:5 < Y 1:5, ... up to the approximation of X = 6 with
with Y > 5:5.
Now, letting n = 6, = 0:6, then Z = pY n = Yp1:44
3:6
= Y 1:23:6 .
n (1 )
0:5 3:6
P (Y 0:5) = P Z 1:2 = P (Z 2:5833) = 0:0049
P (0:5 < Y 1:5) = P ( 2:5833 < Z 1:75) = 0:0352, ...
xj 0 1 2 3 4 5 6
pj 0:0049 0:0369 0:1382 0:2765 0:3110 0:1866 0:0467
y ( 1; 0:5] (0:5; 1:5] (1:5; 2:5] (2:5; 3:5] (3:5; 4:5] (4:5; 5:5] (5:5; 1)
pj 0:0049 0:0352 0:1396 0:2871 0:3066 0:1700 0:0567
n! xj n xj
(pj = xj !(n xj )! (1 ) , pj is obtained by the approximation).
165
Other important distributions: Chi-square Let X be a continuous
random variable with density
1 2 x
f (x; ) = =2
x 2 e 2 for 0 < x < 1,
2 ( =2)
Z 1
1 y
where ( ) = y e dy for >0
0
and f (x; ) = 0 elsewhere. Then X is Chi-square distributed with degrees of

freedom, and we indicate this with the notation
2
Xs .
2
If X s , then
E (X) = , V ar (X) = 2 .
166
Note. In general, we are interested in knowing that f (x; ) exists, not in
the exact formula. However, we are interested in knowing the shape of
the probability densities for various values of .
f (x; 1) , f (x; 3) and f (x; 6) :

2 2 2
Probability densities of variables 1; 3; 6 distributed
0.5
y
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9
(f (x; 1) is the green function, f (x; 3) is the black one and f (x; 6) is the red
one)
167
Our interest in the 2 distributions in due to the following result:
Relation between the Chi-square and the normal distribution
If X1 ,...,Xn are n random variables such that
i) Xi s N (0; 1) for any i = 1; :::; n;
ii) Xi is independent from Xj for any j 6= i, for any i; j = 1; :::; n,
then
Xn
Y = Xi2 s 2n
i=1
Pn
i.e., i=1 Xi2 is 2
n (chi-squared with n degrees of freedom) distributed.
Using this result, it is easy to verify that E (Y ) = n, V ar (Y ) = 2n.
Due to this relation with the standard normal, Chi-square distributions

have very important applications in sampling and in inferential statistics.
168
Note: Xi s N (0; 1) for any i = 1; :::; n implies that Xi is identically distrib-
uted (as we speci…ed the distribution to be a N (0; 1)).
Alternative notations. In many cases, Zi s N (0; 1) is used, in this state-

ment, as Z is the symbol that is commonly used for a random variable with
standard normal distribution.
Moreover, as the degrees ofP
freedom are commonly referred to as " " in some
references, the notation Y = i=1 Zi2 , so that Y s 2 , is used.
2
The distributions are extensively tabulated. For example, using Table
V,
2
when X s 4 , P (X 9:488) = 0:95;
2
when X s 16 , P (X 32) = 0:99.
169
2
Other properties of the distribution.
2
The central limit theorem also holds, i.e., if X s ,
X
p !d N (0; 1) as !1
2
(this is also useful to approximate probabilities when > 30).
2 2
If X1 s 1
, ..., Xk s k
and X1 ; :::; Xk are independent, then
k
X
2
Xi s K
, where K = 1 + ::: + k.
i=1
170
Other important distributions: t Let T be a continuous random vari-
able with density
+1
+1
2 t2 2
f (t; ) = p 1+ for 1 < t < 1.

( =2)
Then T is t distributed with degrees of freedom, and we indicate this with the
notation
T st .
If T s t , then
E (T ) = 0; if > 2, V ar (T ) = .
2
The t distribution is also known as Student’s distribution (with degrees

of freedom).
171
Note. In general, we are interested in knowing that f (t; ) exists, not in
the exact formula. However, we are interested in knowing the shape of
the probability densities for various values of , in particular in relation
to the pdf of Z s N (0; 1).
Densities of Z and T (when = 3).
0.4
y
0.3
0.2
0.1
-5 -4 -3 -2 -1 0 1 2 3 4
i) the pdf of T , f (t; ), has thicker tails than the pdf of Z, n (x; 0; 1).
ii) As ! 1, f (t; ) ! n (x; 0; 1).
172
Our interest in the t distributions in due to the following result:
Relation between the t and the normal distribution
If Z and Y are two random variables such that
i) Z s N (0; 1);
ii) Y s 2 ;
iii) Z and Y are independently distributed;
then
Z
T =p st
Y=
Due to this relation with the standard normal, t distributions have some
important applications in sampling and in inferential statistics.
The t distributions are tabulated. For example, using Table IV,
when T s t6 , P (T 1:440) = 0:9;

when T s t24 , P (T 2:064) = 0:975.
173
Other important distributions, F Let F be a continuous random vari-
able with density
1 1+ 2
1+ 2 2 2
2 1 1 1 1
g (f ; 1; 2) = f 2 1+ f for 0 < f < 1
( 1 =2) ( 2 =2) 2 2
and g (f ; 1 ; 2 ) = 0 elsewhere. Then F is F distributed with 1 and 2 degrees

of freedom, and we indicate this with the notation
F sF 1; 2 .
Note. In general, we are interested in knowing that g (f ; 1; 2) exists, not

in the exact formula.
174
Our interest in the F distributions in due to the following result:
Relation between the F and the 2 distributions
If U and V are two random variables such that
i) U s 2 1 ;
ii) V s 2 2 ;
iii) U and V are independently distributed;
then
U= 1
F = s F 1; 2
V= 2
Due to this relation with the 2 distributions, F distributions have some

applications in sampling and in inferential statistics.
The F distributions are tabulated. For example, using Table VI,
when F s F6;30 , P (F 2:42) = 0:95;

when F s F24;120 , P (F 1:95) = 0:99.
175
Some other results for F distributions
1
(i) If F s F 1; 2 , sF 2; 1 :
F
(ii) F1; 2
= t2 where t s t 2 :
(iii) As 2 ! 1, g (f ; 1; 2) ! f (x; 1)
2
(recall notation f (x; 1) is the pdf of 1 ).
176

Probability - Full Textbook

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability - Full Textbook

Uploaded by

Copyright:

Available Formats

Department of Economics and Related Studies

MODULE TITLE: Probability I

MODULE NUMBER: ECO00011C.

Academic Year 2018 - 2019

Lecturer: Dr Francesco Bravo

Probability I introduces students to some basic ideas in probability and

Students will be able to compute marginal and conditional probabilities, for

We divided this course in four sections:

1. Preliminary mathematics and Introduction to probability;

2. Probability distributions and Random variables;

A …fth section, with applications to economics and …nance, is in a separate

Aims and objectives

to introduce some useful mathematical operators

Suppose we want to measure some characteristics, say weight, in kg, of 50 people.

n may be …nite or in…nite (1);

double summation: for Xij (1 i n, 1 j m)

if a and b are constant,

Verify this directly:

if, for example,

Verify this setting n = 2, X1 = 1, X2 = 9,

VerifyPthis setting n = 2, (X1 = 1; Y1 = 2),

Verify this setting n = 2, (X1 =P

Alternative notation: in some references the notation X n is also used.

if Yi = a + bXi , then Y = a + bX.

Notice that, setting a = X, b = 1, then Y = 0, i.e.

For any positive integer k, the factorial operator is de…ned so that

Set theory for probability. Random experiment

We can also combine these set operations, for example A0 \ B,...

Some interesting formulas (De Morgan laws)

A set is said to be countable if the number of elements contained in the

Note: this de…nition requires n < 1.

P (A1 [ A2 [ :::) = P (A1 ) + P (A2 ) + ::: if Ai \ Aj = ? for j 6= i.

Note: Ai \ Aj = ? for j 6= i means that Ai and Aj are mutually exclusive;

Note. when n = 1, it is possible that S is uncountable. In that case, the

(i) P (A) P (B)

For two generic A and B,

Note: the implications of the postulates of probability can be generalised

P (A [ B [ C) = P (A) + P (B) + P (C)

F Example/Practice. What is the probability of drawing a Queen (Q) or a

Notation/terminology: we already said that P (AjB) is called conditional prob-

which implies that

P (AjB) = P (A) , P (BjA) = P (B) .

This is also known as Bayes’theorem.

S = [nj=1 Aj , where Aj \ Ai = ? (for j 6= i).

P (A) = 0:3; P (B) = 0:2; P (C) = 0:5;

P (D) = 0:05 0:3 + 0:01 0:2 + 0:03 0:5 = 0:032

Aims and objectives

to introduce random variables to summarise the information content of a

Chapter 3, Sections 3.1 to 3.7.

In some experiments, the outcome is a number; otherwise, it may be trans-

Note: This de…nition may be generalised to comprise cases in which the

Alternative notation: in some references, the notation fX (x) (or, fX (xj ))

When X is the number of heads from tossing a fair coin twice,

When X is the number of heads from tossing a fair coin n times,

When X is the number of buses arriving at the stop per 10 minutes, if we

variable can take values in a continuous subset of ( 1; 1). In this case, it

Notice that this imply that P (X = c) = P (c X c) = 0.

Note: because P (X = b) = 0, P (X = a) = 0, it also follows that P (a X < b) =

f (x) = a (20 x) if 0 < x < 20; 0 otherwise ?

Requirement (i) f (x)R 0: this is met for all x, when a > 0;

requirement (ii) is met if a = 1=200 (and notice that a > 0 is met).

For a discrete random variable,

F (x) = 0 if x < 0; F (x) = 1=4 if 0 x < 1;