Econometrics I

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

English Class from This Year!!

Too bad!! (for you and me)

Econometrics I Econometrics I —> Statistics


Econometrics II —> Econometrics

TA session: Tue, 3rd class (13:00 – 14:30), Room #4, 4/17 –,


by Mr. Kinoshita (2nd year of the doctor course)

1 2

You can get this lecture note from: • R.V. Hogg, J.W. McKean and A.T. Craig, 2005, Introduction to Mathematical
Statistics (Sixth edition), Pearson Prentice Hall.
www2.econ.osaka-u.ac.jp/˜tanizaki/class/2012

Some Textbooks
• ʰ֬཰౷‫ܭ‬ԋश̍ ֬཰ʱ(ࠃ୔ਗ਼యฤɼ1966ɼഓ෩‫)ؗ‬
• ʰ֬཰౷‫ܭ‬ԋश̎ ౷‫ܭ‬ʱ(ࠃ୔ਗ਼యฤɼ1966ɼഓ෩‫)ؗ‬ More elementary statistics:
• H. Tanizaki, 2004, Computational Methods in Statistics and Econometrics (STATIS- ɹɾUndergraduate, Tue., 3rd class, Room #5, Prof. Oya
TICS: textbooks and monographs, Vol.172), Mercel Dekker. ɹɾGraduate, Tue., 6th class, Room #1, Prof. Fukushige

3 4

1 Event and Probability Each outcome of a sample space is called an element (ཁૉɼ‫ )ݩ‬of the sample
space or a sample point (ඪຊ఺), which represents each outcome obtained by the
1.1 Event (ࣄ৅) experiment.

We consider an experiment (࣮‫ )ݧ‬whose outcome is not known in advance but An event (ࣄ৅) is any collection of outcomes contained in the sample space, or

an event occurs with probability, which is sometimes called a random experiment equivalently a subset of the sample space.

(ແ࡞ҝ࣮‫)ݧ‬.
The sample space (ඪຊۭؒ) of an experiment is a set of all possible outcomes.

5 6
An elementary event (ࠜ‫ࣄݩ‬৅) consists of exactly one element and a compound The event which does not have any sample point is called the empty event (ۭࣄ
event (ෳ߹ࣄ৅) consists of more than one element. ৅), denoted by φ.
Sample space is denoted by Ω and sample point is given by ω. Conversely, the event which includes all possible sample points is called the whole
event (શࣄ৅), represented by Ω, which is equivalent to a sample space (ඪຊۭ
Suppose that event A is a subset of sample space Ω.
ؒ).
Let ω be a sample point in event A.
Then, we say that a sample point ω is contained in a sample space A, which is
denoted by ω ∈ A.
The event which does not belong to event A is called the complementary event (༨
ࣄ৅) of A, which is denoted by Ac .

7 8

Next, consider two events A and B. Example 1.1: Consider an experiment of casting a die (αΠίϩ).
The event which belongs to either event A or event B is called the sum event (࿨ࣄ We have six sample points, which are denoted by ω1 = {1}, ω2 = {2}, ω3 = {3},
৅), which is denoted by A ∪ B. ω4 = {4}, ω5 = {5} and ω6 = {6}, where ωi represents the sample point that we have
The event which belongs to both event A and event B is called the product event i.
(ੵࣄ৅), denoted by A ∩ B. In this experiment, the sample space is given by Ω = {ω1 , ω2 , ω3 , ω4 , ω5 , ω6 }.
When A ∩ B = φ, we say that events A and B are exclusive (ഉ൓). Let A be the event that we have even numbers and B be the event that we have
multiples of three.
Then, we can write as A = {ω2 , ω4 , ω6 } and B = {ω3 , ω6 }.
The complementary event of A is given by Ac = {ω1 , ω3 , ω5 }, which is the event

9 10

that we have odd numbers. Example 1.2: Cast a coin three times. In this case, we have the following eight
The sum event of A and B is written as A ∪ B = {ω2 , ω3 , ω4 , ω6 }, while the product sample points:
event is A ∩ B = {ω6 }.
ω1 = (H,H,H), ω2 = (H,H,T), ω3 = (H,T,H),
Since A ∩ Ac = φ, we have the fact that A and Ac are exclusive.
ω4 = (H,T,T), ω5 = (T,H,H), ω6 = (T,H,T),
ω7 = (T,T,H), ω8 = (T,T,T)

where H represents head (ද) while T indicates tail (ཪ).


For example, (H,T,H) means that the first flip lands head, the second flip is tail and
the third one is head.

11 12
Therefore, the sample space of this experiment can be written as: Since A is a subset of B, denoted by A ⊂ B, a sum event is A ∪ B = B, while a
product event is A ∩ B = A.
Ω = {ω1 , ω2 , ω3 , ω4 , ω5 , ω6 , ω7 , ω8 }.
Moreover, we obtain C ∩ D = {ω2 , ω6 } and C ∪ D = {ω1 , ω2 , ω4 , ω5 , ω6 , ω8 }.
Let A be the event that we have two heads, B be the event that we obtain at least
one tail, C be the event that we have head in the second flip, and D be an event that
we obtain tail in the third flip.
Then, the events A, B, C and D are give by:

A = {ω2 , ω3 , ω5 }, B = {ω2 , ω3 , ω4 , ω5 , ω6 , ω7 , ω8 },

C = {ω1 , ω2 , ω5 , ω6 }, D = {ω2 , ω4 , ω6 , ω8 }.

13 14

1.2 Probability Thus, the probability which the event A occurs is defined as:

Let n(A) be the number of sample points in A. n(A)


P(A) = .
n(Ω)
We have n(A) ≤ n(B) when A ⊆ B.
Each sample point is equally likely to occur. In Example 1.1, P(A) = 3/6 and P(A ∩ B) = 1/6 are obtained, because n(Ω) = 6,

In the case of Example 1.1 (Section 1.1), each of the six possible outcomes has n(A) = 3 and n(A ∩ B) = 1.

probability 1/6 and in Example 1.2 (Section 1.1), each of the eight possible out- Similarly, in Example 1.2, we have P(C) = 4/8, P(A ∩ B) = P(A) = 3/8 and so on.

comes has probability 1/8. Note that we obtain P(A) ≤ P(B) because of A ⊆ B.

15 16

It is known that we have the following three properties on probability: φ ⊆ A ⊆ Ω implies n(φ) ≤ n(A) ≤ n(Ω).
Therefore, we have:
0 ≤ P(A) ≤ 1, (1) n(φ) n(A) n(Ω)
≤ ≤ = 1.
n(Ω) n(Ω) n(Ω)
P(Ω) = 1, (2)
Dividing by n(Ω), we obtain:
P(φ) = 0. (3)
P(φ) ≤ P(A) ≤ P(Ω) = 1.

Because φ has no sample point, the number of the sample point is given by n(φ) = 0
and accordingly we have P(φ) = 0.
Therefore, 0 ≤ P(A) ≤ 1 is obtained as in (1).

17 18
When events A and B are exclusive, i.e., when A ∩ B = φ, then P(A ∪ B) = P(A) + Thus, in the example we can verify that the above addition rule holds.
P(B) holds.
Moreover, since A and Ac are exclusive, P(Ac ) = 1 − P(A) is obtained.
Note that P(A ∪ Ac ) = P(Ω) = 1 holds.
Generally, unless A and B are not exclusive, we have the following formula:

P(A ∪ B) = P(A) + P(B) − P(A ∩ B),

which is known as the addition rule (Ճ๏ఆཧ).


In Example 1.1, each probability is given by P(A ∪ B) = 2/3, P(A) = 1/2, P(B) =
1/3 and P(A ∩ B) = 1/6.

19 20

The probability which event A occurs, given that event B has occurred, is called the When event A is independent (ಠཱ) of event B, we have P(A ∩ B) = P(A)P(B),
conditional probability (৚݅෇֬཰), i.e., which implies that P(A|B) = P(A).

n(A ∩ B) P(A ∩ B)
P(A|B) = = , Conversely, P(A ∩ B) = P(A)P(B) implies that A is independent of B.
n(B) P(B)

or equivalently,
P(A ∩ B) = P(A|B)P(B),

which is called the multiplication rule (৐๏ఆཧ).

21 22

In Example 1.2, because of P(A ∩ C) = 1/4 and P(C) = 1/2, the conditional 2 Random Variable and Distribution
probability P(A|C) = 1/2 is obtained.
From P(A) = 3/8, we have P(A ∩ C) , P(A)P(C). 2.1 Univariate Random Variable and Distribution
Therefore, A is not independent of C. The random variable (֬཰ม਺) X is defined as the real value function on sample
As for C and D, since we have P(C) = 1/2, P(D) = 1/2 and P(C ∩ D) = 1/4, we space Ω.
can show that C is independent of D. Since X is a function of a sample point ω, it is written as X = X(ω).
Suppose that X(ω) takes a real value on the interval I.

23 24
That is, X depends on a set of the sample point ω, i.e., {ω; X(ω) ∈ I}, which is Next, suppose that X is a random variable which takes 1 for odd numbers and 0 for
simply written as {X ∈ I}. even numbers on the die.
Then, X is a function of ω and takes the following values:

In Example 1.1 (Section 1.1), suppose that X is a random variable which takes the X(ω1 ) = 1, X(ω2 ) = 0, X(ω3 ) = 1,
number of spots up on the die. X(ω4 ) = 0, X(ω5 ) = 1, X(ω6 ) = 0.
Then, X is a function of ω and takes the following values:

X(ω1 ) = 1, X(ω2 ) = 2, X(ω3 ) = 3,


X(ω4 ) = 4, X(ω5 ) = 5, X(ω6 ) = 6.

25 26

In Example 1.2 (Section 1.1), suppose that X is a random variable which takes the There are two kinds of random variables.
number of heads. One is a discrete random variable (཭ࢄ‫཰֬ܕ‬ม਺), while another is a continu-
Depending on the sample point ωi , X takes the following values: ous random variable (࿈ଓ‫཰֬ܕ‬ม਺).

X(ω1 ) = 3, X(ω2 ) = 2, X(ω3 ) = 2, X(ω4 ) = 1,


X(ω5 ) = 2, X(ω6 ) = 1, X(ω7 ) = 1, X(ω8 ) = 0.

Thus, the random variable depends on a sample point.

27 28

Discrete Random Variable (཭ࢄ‫཰֬ܕ‬ม਺) and Probability Function (֬཰ The function f (xi ) represents the probability in the case where X takes xi .
ؔ਺): Suppose that the discrete random variable X takes x1 , x2 , · · ·, where x1 < Therefore, we have the following relation:
x2 < · · · is assumed.
P(X = xi ) = pi = f (xi ), i = 1, 2, · · · ,
Consider the probability that X takes xi , i.e., P(X = xi ) = pi , which is a function of
xi . where f (xi ) is called the probability function (֬཰ؔ਺) of X.
That is, a function of xi , say f (xi ), is associated with P(X = xi ) = pi .

29 30
More formally, the function f (xi ) which has the following properties is defined as Several functional forms of f (xi ) are as follows.
the probability function.
Discrete uniform distribution (཭ࢄ‫ܕ‬Ұ༷෼෍):



 1
f (xi ) ≥ 0, i = 1, 2, · · · , 
 N, x = 1, 2, · · · , N
X f (x) = 



 0,
f (xi ) = 1. otherwise
i where N = 1, 2, · · ·.
Furthermore, for a set A, we can write a probability as the following equation:
Bernoulli distribution (ϕϧψΠ෼෍):
X 


 p x (1 − p)1−x , x = 0, 1
P(X ∈ A) = f (xi ). 
f (x) = 

xi ∈A 
 0, otherwise
where 0 ≤ p ≤ 1.

31 32

Binomial distribution (ೋ߲෼෍): Poisson distribution (ϙΞιϯ෼෍):




 n!  −λ x

 e λ


 p x (1 − p)n−x , x = 0, 1, 2, · · · , n 
 , x = 0, 1, · · ·
 
f (x) = 
 x f (x) = 
 x!

 

 0,

 0, otherwise, otherwise,

where 0 ≤ p ≤ 1 and n = 1, 2, · · ·. where λ > 0.


X
n
(a + b)n = nC x a
x n−x
b −→ Binomial Theorem (ೋ߲ఆཧ)
x=0

n! n!
nC x = = n! = 1 · 2 · · · n (factorial of n)
x x!(n − x)!
X ∼ B(n, p)

33 34

ʻ Review ʼ e ʻ Review ʼ Taylor series expansion about x0


Note that the definition of e is given by: f 00 (x0 ) f (k) (x0 )
f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) + (x − x0 )2 + · · · + (x − x0 )k + · · ·
2! k!
1
 1 h
e = lim(1 + x) x = lim 1 + where the kth derivative of f (x) is f (k) (x).
x→0 h→∞ h
= 2.71828182845905. Taylor series expansion of f (x) = e x about x = 0
1 2 1 3 X xk ∞
ex = 1 + x + x + x + ··· =
Notation 2! 3! k!
k=0
exp(x) = e x where f (k) (x) = e x . Set x = λ and k = x.
X∞
e−λ λ x
1=
k=0
x!

35 36
Geometric distribution (‫ز‬Կ෼෍): Taylor series expansion of f (x) = (1 − x)−r about x = 0

 r(r + 1) 2 r(r + 1)(r + 2) 3 r + k − 1!

 x
 p(1 − p) , x = 0, 1, · · · f (x) = 1 + rx + x + x + ··· + xk + · · ·
f (x) = 
 2! 3! k

 0, otherwise,
r + k − 1!
where f (k) (x) = xk .
where 0 < p ≤ 1. k
X∞ r + k − 1!
Negative binomial distribution (ෛͷೋ߲෼෍): (1 − x)−r = xk
 k

 r + x − 1! k=0


 pr (1 − p) x , x = 0, 1, · · ·
 Set x = 1 − p and k = x
f (x) = 
 x



 0,
X∞ r + x − 1! X∞ r + x − 1!
otherwise, p−r = (1 − p) x , i.e., 1 = pr (1 − p) x
x=0 x x=0 x
where 0 < p ≤ 1 and r > 0.

37 38

In Example 1.2 (Section 1.1), all the possible values of X are 0, 1, 2 and 3. (note
that X denotes the number of heads when a coin is cast three times).

Hypergeometric distribution (௒‫ز‬Կ෼෍): That is, x1 = 0, x2 = 1, x3 = 2 and x4 = 3 are assigned in this case.
  K  M − K 


  
 




   


 x  n − x


 , x = 0, 1, · · · , n
 M 
f (x) = 
  











 n



 0, otherwise,

where M = 1, 2, · · ·, K = 0, 1, · · · , M, and n = 1, 2, · · · , M.

39 40

The probability that X takes x1 , x2 , x3 or x4 is given by:


1 3!  1  x  1 3−x
P(X = 0) = f (0) = P({ω8 }) = , P(X = x) = f (x) = , x = 0, 1, 2, 3.
8 x!(3 − x)! 2 2
P(X = 1) = f (1) = P({ω4 , ω6 , ω7 })
3 For P(X = 1) and P(X = 2), note that each sample point is mutually exclusive.
= P({ω4 }) + P({ω6 }) + P({ω7 }) = ,
8 The above probability function is called the binomial distribution (ೋ߲෼෍).
P(X = 2) = f (2) = P({ω2 , ω3 , ω5 }) P
Thus, it is easy to check f (x) ≥ 0 and x f (x) = 1 in Example 1.2.
3
= P({ω2 }) + P({ω3 }) + P({ω5 }) = ,
8
1
P(X = 3) = f (3) = P({ω1 }) = ,
8
which can be written as:

41 42
Continuous Random Variable (࿈ଓ‫཰֬ܕ‬ม਺) and Probability Density Func- For example, let I be the interval between a and b for a < b.
tion (֬཰ີ౓ؔ਺): Whereas a discrete random variable assumes at most a Then, we can rewrite P(X ∈ I) as follows:
countable set of possible values, a continuous random variable X takes any real Z b
P(a < X < b) = f (x) dx,
number within an interval I. a

For the interval I, the probability which X is contained in A is defined as: where f (x) is called the probability density function (֬཰ີ౓ؔ਺) of X, or
Z
simply the density function (ີ౓ؔ਺) of X.
P(X ∈ I) = f (x) dx.
I

43 44

In order for f (x) to be a probability density function, f (x) has to satisfy the follow- Some functional forms of f (x) are as follows:
ing properties:
Uniform distribution (Ұ༷෼෍):

f (x) ≥ 0, 

 1
Z ∞ 
 b − a, a<x<b
f (x) = 

f (x) dx = 1. 

 0, otherwise,
−∞

where −∞ < a < b < ∞.


X ∼ U(a, b)

45 46

Normmal distribution (ਖ਼‫ن‬෼෍): Exponential distribution (ࢦ਺෼෍):


 1  
1 

 −λx
f (x) = √ exp − 2 (x − µ)2 , −∞ < x < ∞  λe , x>0
2πσ2 2σ f (x) = 


 0, otherwise,
where −∞ < µ < ∞ and σ > 0.
where λ > 0
X ∼ N(µ, σ2 )
N(0, 1) = Standard normal distribution

47 48
Gamma distribution (ΨϯϚ෼෍): Beta distribution (ϕʔλ෼෍):
 r 

 λ r−1 −λx 1

 Γ(r) x e ,
 x>0 

 a−1 b−1
f (x) = 
  B(a, b) x (1 − x) ,
 0<x<1

 f (x) = 

 0, otherwise, 

 0, otherwise,
where λ > 0 and r > 0.
where a > 0 and b > 0.
Gamma dist. with r = 1 ⇐⇒ Z
Z ∞ Exponential dist.
1
Beta function: B(a, b) = xa−1 (1 − x)b−1 dx
Gamma function: Γ(a) = xa−1 e−x dx, a > 0 0
0 Γ(a)Γ(b)
Γ(a + 1) = aΓ(a) —> Use integration by parts (෦෼ੵ෼) =
Γ(a + b)
Γ(n + 1) = n! for integer n
1 1 · 3 · 5 · · · (2n − 1) √ 1 3 √ B(a, b) = B(b, a)
Γ(n + ) = π, Γ( ) = 2Γ( ) = π
2 2n 2 2

49 50

Cauchy distribution (ίʔγʔ෼෍): Double exponentioal distribution (ೋॏࢦ਺෼෍), or Laplace distribution (ϥ

1 ϓϥε෼෍):
f (x) = , −∞ < x < ∞
πβ(1 + (x − α)2 /β2 ) 1  |x − α| 
f (x) = exp − , −∞ < x < ∞
where −∞ < α < ∞ and β > 0. 2β β
where −∞ < α < ∞ and β > 0.
Log-normal distribution (ର਺ਖ਼‫ن‬෼෍):
 1  1  Weibull distribution (ϫΠϒϧ෼෍):



 √
 exp − 2 (ln x − µ)2 , 0<x<∞ 
f (x) =  x 2πσ2 2σ 

 b−1 b
x>0


  abx exp(−ax ),
 0, otherwise, f (x) = 


 0, otherwise,
where −∞ < µ < ∞ and σ > 0.
where a > 0 and b > 0.

51 52

Logistic distribution (ϩδεςΟοΫ෼෍): Gumbel distribution (Ψϯϕϧ෼෍), or Extreme value distribution (‫ۃ‬஋෼෍):

F(x) = (1 + e−(x−α)/β )−1 , −∞ < x < ∞


F(x) = exp(−e−(x−α)/β ), −∞ < x < ∞
where −∞ < α < ∞ and β > 0.
where −∞ < α < ∞ and β > 0.
Pareto distribution (ύϨʔτ෼෍):



 θx0θ

 θ+1 , x > x0
f (x) = 
 x


 0, otherwise,

where x0 > 0 and θ > 0.

53 54
t distribution (t ෼෍): F distribution (F ෼෍):

Γ( k+1
2
) 1  x2 −(k+1)/2 

 Γ( m+n
2
)  m m/2 xm/2−1
f (x) = √ 1+ , −∞ < x < ∞ 

 Γ( m )Γ( n ) n m (m+n)/2 , x>0
Γ( 2k ) kπ k f (x) = 
 2 2
(1 + n
x)



 0, otherwise,
where k > 0.
X ∼ t(k) —> t dist. with k degrees of freedom (ࣗ༝౓) where m, n = 1, 2, · · ·.
t(1) ⇐⇒ Cauchy dist. X ∼ F(m, n) —> F dist. with (m, n) degrees of freedom
t(∞) ⇐⇒ N(0, 1)

55 56

Chi-square distribution (ΧΠೋ৐෼෍): For a continuous random variable, note as follows:


 Z x


 1 k/2−1 −x/2

 Γ( k )2k/2 x
 e , x>0 P(X = x) = f (t) dt = 0.
f (x) = 
 2 x



 0, otherwise,
In the case of discrete random variables, P(X = xi ) represents the probability which
where k = 1, 2, · · ·. X takes xi , i.e., pi = f (xi ).
X ∼ χ2 (k) —> χ2 dist. with k degrees of freedom Thus, the probability function f (xi ) itself implies probability.
However, in the case of continuous random variables, P(a < X < b) indicates the
probability which X lies on the interval (a, b).

57 58

Example 1.3: As an example, consider the following function: Example 1.4: As another example, consider the following function:



 1 1 2
 1, for 0 < x < 1, f (x) = √ e− 2 x ,
f (x) = 
 2π

 0, otherwise.
for −∞ < x < ∞.
R∞ R1
Clearly, since f (x) ≥ 0 for −∞ < x < ∞ and −∞
f (x) dx = 0
f (x) dx = [x]10 = 1, Clearly, we have f (x) ≥ 0 for all x.
R∞
the above function can be a probability density function. We check whether f (x) dx = 1.
−∞
R∞
In fact, it is called a uniform distribution. First of all, we define I as I = −∞ f (x) dx.
To show I = 1, we may prove I 2 = 1 because of f (x) > 0 for all x, which is shown
as follows:

59 60
ʻ Review ʼ Integration by Substitution (ஔ‫ੵ׵‬෼):
Z ∞ 2 Z ∞ Z ∞ 
I2 = f (x) dx f (x) dx
= f (y) dy Univariate (1 ม਺) Case: For a function of x, f (x), we perform integration by
−∞ −∞
Z ∞ Z ∞ −∞ 
1 1 1 1 substitution, using x = ψ(y).
= √ exp(− x2 ) dx √ exp(− y2 ) dy
2 2
Z ∞ 2π
−∞
Z ∞  1
−∞


Then, it is easy to obtain the following formula:
1
= exp − (x2 + y2 ) dx dy Z Z
2π −∞ −∞ 2
Z 2π Z ∞ f (x) dx = ψ0 (y) f (ψ(y)) dy,
1 1
= exp(− r2 )r dr dθ
2π 0 0 2
Z 2π Z ∞ which formula is called the integration by substitution.
1 1
= exp(−s) ds dθ = 2π[− exp(−s)]∞
0 = 1.
2π 0 0 2π

61 62

Proof: Bivariate (2 ม਺) Case: For f (x, y), define x = ψ1 (u, v) and y = ψ2 (u, v).
Let F(x) be the integration of f (x), i.e., ZZ ZZ
Z f (x, y) dx dy = J f (ψ1 (u, v), ψ2 (u, v)) du dv,
x
F(x) = f (t) dt,
−∞ where J is called the Jacobian (ϠίϏΞϯ), which represents the following de-
0
which implies that F (x) = f (x). terminant (ߦྻࣜ): ∂x ∂x

Differentiating F(x) = F(ψ(y)) with respect to y, we have: J = ∂u ∂v ∂x ∂y ∂x ∂y
∂y ∂y = ∂u ∂v − ∂v ∂u .
dF(ψ(y)) dF(x) dx ∂u ∂v
f (x) ≡ = = f (x)ψ0 (y) = f (ψ(y))ψ0 (y). ʻ End of Review ʼ
dy dx dy

63 64

ʻ Go back to the Integration ʼ In the inner integration of the sixth equality, again, integration by substitution is
1
In the fifth equality, integration by substitution (ஔ‫ੵ׵‬෼) is used. utilized, where transformation is s = r2 .
2
The polar coordinate transformation (‫࠲ۃ‬ඪม‫ )׵‬is used as x = r cos θ and y = Thus, we obtain the result I 2 = 1 and accordingly we have I = 1 because of f (x) ≥
r sin θ. 0.
1 2 √
Note that 0 ≤ r < +∞ and 0 ≤ θ < 2π. Therefore, f (x) = e− 2 x / 2π is also taken as a probability density function.
The Jacobian is given by: Actually, this density function is called the standard normal probability density
∂x ∂x function (ඪ४ਖ਼‫ن‬෼෍).
cos θ −r sin θ
J = ∂y
∂r ∂θ = = r.
∂y
sin θ r cos θ
∂r ∂θ

65 66
Distribution Function: The distribution function (෼෍ؔ਺) or the cumulative The properties of the distribution function F(x) are given by:
distribution function (ྦྷੵ෼෍ؔ਺), denoted by F(x), is defined as:
F(x1 ) ≤ F(x2 ), for x1 < x2 , — > nondecreasing function
P(X ≤ x) = F(x), P(a < X ≤ b) = F(b) − F(a), for a < b,

F(−∞) = 0, F(+∞) = 1.
which represents the probability less than x.

The difference between the discrete and continuous random variables is given by:

67 68

1. Discrete random variable (Figure 1): 2. Continuous random variable (Figure 2):
Xr X
r Z x
• F(x) = f (xi ) = pi , • F(x) = f (t) dt,
−∞
i=1 i=1
where r denotes the integer which satisfies xr ≤ x < xr+1 . • F 0 (x) = f (x).

• F(xi ) − F(xi − ) = f (xi ) = pi ,


f (x) and F(x) are displayed in Figure 1 for a discrete random variable and Figure 2
where  is a small positive number less than xi − xi−1 .
for a continuous random variable.

69 70

Figure 1: Probability Function f (x) and Distribution Function F(x)— Discrete Case Figure 2: Density Function f (x) and Distribution Function F(x) — Continuous

Pr
Case

 F(x) = i=1 f (xi )
z }| { .........
...... .. .. . ......
..... .. .. .. .. .. ......
.... .. .. .. .. .. .. .. .. .. .....
• f (xr ) ... .. .. .. .. .. .. .. .. .. .. .. ....
...... ... ... ... ... ... ... ... ... ... ... ... ... ... .....
... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....
... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....
B  

• ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ......
..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....
• BBN 



... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....
........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .........
• .............  

• ............. R .. . . . . . . . . . . . . . . . . . . . . . . . . . .
.... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
 x .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....

 F(x) = −∞
f (t)dt ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ...
.... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ...
f (x)
........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ....
X @ ....... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
...... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
! ...
...
x1 x2 x3 ............. xr x xr+1 ............. @
R .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. !

...
...
...
. ........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. .
........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
....
.. .. ......... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.
.....
......
................... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. ........
.....................
.
....................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Note that r is the integer which satisfies xr ≤ x < xr+1 . X
x

71 72
2.2 Multivariate Random Variable (ଟมྔ֬཰ม਺) and Distri- where f xy (xi , y j ) represents the joint probability function (݁߹֬཰ؔ਺) of X and

bution Y. In order for f xy (xi , y j ) to be a joint probability function, f xy (xi , y j ) has to satisfies
the following properties:
We consider two random variables X and Y in this section. It is easy to extend to
more than two random variables. f xy (xi , y j ) ≥ 0, i, j = 1, 2, · · ·
XX
f xy (xi , y j ) = 1.
Discrete Random Variables: Suppose that discrete random variables X and Y i j

take x1 , x2 , · · · and y1 , y2 , · · ·, respectively. The probability which event {ω; X(ω) = Define f x (xi ) and fy (y j ) as:
xi and Y(ω) = y j } occurs is given by: X
f x (xi ) = f xy (xi , y j ), i = 1, 2, · · · ,
P(X = xi , Y = y j ) = f xy (xi , y j ), j

73 74

X
fy (y j ) = f xy (xi , y j ), j = 1, 2, · · · . Continuous Random Variables: Consider two continuous random variables X
i
and Y. For a domain D, the probability which event {ω; (X(ω), Y(ω)) ∈ D} occurs
Then, f x (xi ) and fy (y j ) are called the marginal probability functions (पล֬཰ؔ
is given by: ZZ
਺) of X and Y.
P((X, Y) ∈ D) = f xy (x, y) dx dy,
D
f x (xi ) and fy (y j ) also have the properties of the probability functions, i.e.,
P P where f xy (x, y) is called the joint probability density function (݁߹֬཰ີ౓ؔ
f x (xi ) ≥ 0 and i f x (xi ) = 1, and fy (y j ) ≥ 0 and j fy (y j ) = 1.
਺) of X and Y or the joint density function of X and Y.

75 76

f xy (x, y) has to satisfy the following properties: ֬཰ີ౓ؔ਺) of X and Y or the marginal density functions (पลີ౓ؔ਺) of
X and Y.
f xy (x, y) ≥ 0,
Z ∞Z ∞ For example, consider the event {ω; a < X(ω) < b, c < Y(ω) < d}, which is a
f xy (x, y) dx dy = 1.
−∞ −∞ specific case of the domain D. Then, the probability that we have the event {ω; a <

Define f x (x) and fy (y) as: X(ω) < b, c < Y(ω) < d} is written as:
Z Z bZ d

f x (x) = f xy (x, y) dy, for all x and y, P(a < X < b, c < Y < d) = f xy (x, y) dx dy.
a c
Z −∞

fy (y) = f xy (x, y) dx,
−∞

where f x (x) and fy (y) are called the marginal probability density functions (पล

77 78
The mixture of discrete and continuous RVs is also possible. For example, let X be The marginal probability function of X is given by:
a discrete RV and Y be a continuous RV. X takes x1 , x2 , · · ·. The probability which Z ∞
f x (xi ) = f xy (xi , y) dy,
both X takes xi and Y takes real numbers within the interval I is given by: −∞

Z
for i = 1, 2, · · ·. The marginal probability density function of Y is:
P(X = xi , Y ∈ I) = f xy (xi , y) dy.
I X
fy (y) = f xy (xi , y).
Then, we have the following properties: i

f xy (xi , y) ≥ 0, for all y and i = 1, 2, · · ·,


XZ ∞
f xy (xi , y) dy = 1.
i −∞

79 80

2.3 Conditional Distribution The features of the conditional probability function f x|y (xi |y j ) are:

Discrete Random Variable: The conditional probability function (৚݅෇֬཰ f x|y (xi |y j ) ≥ 0, i = 1, 2, · · · ,
ؔ਺) of X given Y = y j is represented as: X
f x|y (xi |y j ) = 1, for any j.
i
f xy (xi , y j ) f xy (xi , y j )
P(X = xi |Y = y j ) = f x|y (xi |y j ) = =P .
fy (y j ) i f xy (xi , y j )

The second equality indicates the definition of the conditional probability.

81 82

Continuous Random Variable: The conditional probability density function The properties of the conditional probability density function f x|y (x|y) are given by:
(৚݅෇֬཰ີ౓ؔ਺) of X given Y = y (or the conditional density function (৚
f x|y (x|y) ≥ 0,
݅෇ີ౓ؔ਺) of X given Y = y) is: Z ∞
f x|y (x|y) dx = 1, for any Y = y.
f xy (x, y) f xy (x, y) −∞
f x|y (x|y) = = R∞ .
fy (y) f (x, y) dx
−∞ xy

83 84
Independence of Random Variables: For discrete random variables X and Y, 3 Mathematical Expectation
we say that X is independent (ಠཱ) (or stochastically independent (֬཰తʹಠ
ཱ)) of Y if and only if f xy (xi , y j ) = f x (xi ) fy (y j ). 3.1 Univariate Random Variable

Definition of Mathematical Expectation (਺ֶత‫ظ‬଴஋): Let g(X) be a function

Similarly, for continuous random variables X and Y, we say that X is independent of random variable X. The mathematical expectation of g(X), denoted by E(g(X)),

of Y if and only if f xy (x, y) = f x (x) fy (y). is defined as follows:

When X and Y are stochastically independent, g(X) and h(Y) are also stochastically
independent, where g(X) and h(Y) are functions of X and Y.

85 86

1. g(X) = X.
X X


 The expectation of X, E(X), is known as mean (ฏ‫ )ۉ‬of random variable X.

 g(xi )pi = g(xi ) f (xi ), (Discrete RV),


 i X
E(g(X)) =  Z ∞
i 



 
 xi f (xi ), (Discrete RV),


 

 i

 g(x) f (x) dx, (Continuous RV). E(X) = 
−∞ 
 Z ∞




 x f (x) dx, (Continuous RV),
The following three functional forms of g(X) are important. −∞
= µ, (or µ x ).

When a distribution of X is symmetric, mean indicates the center of the dis-


tribution.

2. g(X) = (X − µ)2 .

87 88

The expectation of (X − µ)2 is known as variance (෼ࢄ) of random variable If X is broadly distributed, σ2 = V(X) becomes large. Conversely, if the
X, which is denoted by V(X). distribution is concentrated on the center, σ2 becomes small. Note that σ =

V(X) is called the standard deviation (ඪ४ภࠩ).
V(X) = E((X − µ)2 )
X




 (xi − µ)2 f (xi ), (Discrete RV),


 i
= Z ∞





 (x − µ)2 f (x) dx, (Continuous RV),
−∞
2
=σ , (or σ2x ).

89 90
3. g(X) = eθX . Some Formulas of Mean and Variance:

The expectation of e θX
is called the moment-generating function (ੵ཰฼ؔ 1. Theorem: E(aX + b) = aE(X) + b, where a and b are constant.
਺), which is denoted by φ(θ).
Proof:
θX
φ(θ) = E(e ) When X is a discrete random variable,
X




 eθxi f (xi ), (Discrete RV), X


 i
= E(aX + b) = (axi + b) f (xi )

 Z
 ∞ θx


i
X X

 e f (x) dx, (Continuous RV).
−∞ =a xi f (xi ) + b f (xi )
i i
= aE(X) + b.

91 92

P P
Note that we have i xi f (xi ) = E(X) from the definition of mean and i f (xi ) = 2. Theorem: V(X) = E(X 2 ) − µ2 , where µ = E(X).
1 because f (xi ) is a probability function. Proof:
If X is a continuous random variable, V(X) is rewritten as follows:
Z ∞
E(aX + b) = (ax + b) f (x) dx V(X) = E((X − µ)2 ) = E(X 2 − 2µX − µ2 )
−∞
Z ∞ Z ∞
= E(X 2 ) − 2µE(X) + µ2 = E(X 2 ) − µ2 .
=a x f (x) dx + b f (x) dx
−∞ −∞
= aE(X) + b. The first equality is due to the definition of variance.
R∞
Similarly, note that we have −∞ x f (x) dx = E(X) from the definition of mean 3. Theorem: V(aX + b) = a2 V(X), where a and b are constant.
R∞
and −∞ f (x) dx = 1 because f (x) is a probability density function. Proof:

93 94

From the definition of the mathematical expectation, V(aX + b) is represented X−µ


E(X) = µ and variance V(X) = σ2 . Define Z = . Then, we have
σ
as: E(Z) = 0 and V(Z) = 1.
 
V(aX + b) = E ((aX + b) − E(aX + b))2
 
= E (aX − aµ)2 = E(a2 (X − µ)2 )

= a2 E((X − µ)2 ) = a2 V(X)

The first and the fifth equalities are from the definition of variance. We use
E(aX + b) = aµ + b in the second equality.

4. Theorem: The random variable X is assumed to be distributed with mean

95 96
Proof: Example 1.5: In Example 1.2 of flipping a coin three times (Section 1.1), we

E(X) and V(X) are obtained as: see in Section 2.1 that the probability function is written as the following binomial

 X − µ  E(X) − µ distribution:
E(Z) = E = = 0,
σ σ n!
P(X = x) = f (x) = p x (1 − p)n−x ,
1 x!(n − x)!
µ 1
V(Z) = V X− = V(X) = 1.
σ σ σ2
for x = 0, 1, 2, · · · , n,
The transformation from X to Z is known as normalization (ਖ਼‫ن‬Խ) or stan-
dardization (ඪ४Խ). where n = 3 and p = 1/2.
When X has the binomial distribution above, we obtain E(X), V(X) and φ(θ) as
follows.

97 98

First, µ = E(X) is computed as: Second, in order to obtain σ2 = V(X), we rewrite V(X) as:
X
n X
n X
n
n!
µ = E(X) = x f (x) = x f (x) = x p x (1 − p)n−x σ2 = V(X) = E(X 2 ) − µ2 = E(X(X − 1)) + µ − µ2 .
x=0 x=1 x=1
x!(n − x)!
X
n
n!
= p x (1 − p)n−x E(X(X − 1)) is given by:
(x − 1)!(n − x)!
x=1 X
n X
n
Xn
(n − 1)! E(X(X − 1)) = x(x − 1) f (x) = x(x − 1) f (x)
= np p x−1 (1 − p)n−x x=0 x=2
x=1
(x − 1)!(n − x)!
Xn
n!
n0
X = x(x − 1) p x (1 − p)n−x
n0 ! 0 0 0
x!(n − x)!
= np p x (1 − p)n −x = np, x=2
x0 !(n0 − x0 )!
x0 =0 Xn
n!
= p x (1 − p)n−x
where n = n − 1 and x0 = x − 1 are set.
0
x=2
(x − 2)!(n − x)!

99 100

X
n
(n − 2)! Therefore, σ2 = V(X) is obtained as:
= n(n − 1)p2 p x−2 (1 − p)n−x
x=2
(x − 2)!(n − x)!
n0
X n0 ! σ2 = V(X) = E(X(X − 1)) + µ − µ2
0 0 0
= n(n − 1)p2 p x (1 − p)n −x
x0 =0
x !(n0 − x0 )!
0
= n(n − 1)p2 + np − n2 p2 = −np2 + np = np(1 − p).
2
= n(n − 1)p ,

where n0 = n − 2 and x0 = x − 2 are re-defined.

101 102
Finally, the moment-generating function φ(θ) is represented as: Example 1.6: As an example of continuous random variables, in Section 2.1 the
X
n
n! uniform distribution is introduced, which is given by:
φ(θ) = E(eθX ) = eθx p x (1 − p)n−p 
x!(n − x)! 

x=0 
 1, for 0 < x < 1,
X
n
n! f (x) = 


 0, otherwise.
= (peθ ) x (1 − p)n−p = (peθ + 1 − p)n .
x=0
x!(n − x)!
When X has the uniform distribution above, E(X), V(X) and φ(θ) are computed as
In the last equality, we utilize the following formula:
follows:
X
n
n!
(a + b)n = a x bn−x ,
x!(n − x)! Z Z
x=0 ∞ 1 h 1 i1 1
µ = E(X) = x f (x) dx = x dx = x2 = ,
which is called the binomial theorem. −∞ 0 2 0 2

103 104

Example 1.7: As another example of continuous random variables, we take the


Z ∞ standard normal distribution:
σ2 = V(X) = E(X 2 ) − µ2 = x2 f (x) dx − µ2
−∞ 1
Z 1 h 1 i1  1  2
1 2
f (x) = √ e− 2 x , for −∞ < x < ∞
1
= x2 dx − µ2 = x3 − = , 2π
0 3 0 2 12
When X has a standard normal distribution, i.e., when X ∼ N(0, 1), E(X), V(X) and
φ(θ) are as follows.
Z ∞ Z 1
θX θx θx
φ(θ) = E(e ) = e f (x) dx = e dx
−∞ 0
1 1
= [ eθx ]10 = (eθ − 1).
θ θ

105 106

E(X) is obtained as: V(X) is computed as follows:


Z ∞ Z ∞ Z ∞
1 1 2
E(X) = x f (x) dx = √ xe− 2 x dx V(X) = E(X 2 ) = x2 f (x) dx
−∞ 2π −∞
Z ∞ −∞
1 h − 21 x2 i∞ 1 1 2
= √ −e = 0, = x2 √ e− 2 x dx
−∞ 2π
2π −∞
Z ∞ 1 2
1 2 1 d(−e− 2 x )
because lim −e− 2 x = 0. = √ x dx
x→±∞ 2π −∞ dx
h i Z ∞
1 1 2 ∞ 1 1 2
= √ x(−e− 2 x ) + √ e− 2 x dx
−∞
Z 2π

2π −∞
1 1 2
= √ e− 2 x dx = 1.
−∞ 2π

107 108
The first equality holds because of E(X) = 0. φ(θ) is derived as follows:
In the fifth equality, use the following integration formula, called the integration Z ∞ Z ∞
1 1 2
φ(θ) = eθx f (x) dx = eθx √ e− 2 x dx
by parts: Z −∞ −∞ 2π
b h ib Z b Z ∞
1
Z ∞
1
√ e− 2 ((x−θ) −θ ) dx
1 2 1 2 2
h(x)g0 (x) dx = h(x)g(x) − h0 (x)g(x) dx, = √ e− 2 x +θx dx =
a
a a −∞
Z ∞2π −∞ 2π
where we take h(x) = x and g(x) = −e − 12 x2
in this case. 1 2 1 1 2 1 2
= e2θ √ e− 2 (x−θ) dx = e 2 θ .
1 2 −∞ 2π
In the sixth equality, lim −xe− 2 x = 0 is utilized.
x→±∞
The last equality is because the integration of the standard normal probability den- The last equality holds because the integration indicates the normal density with

sity function is equal to one. mean θ and variance one.

109 110

Example 1.8: When the moment-generating function of X is given by φ x (θ) = Example 1.8(b): When X ∼ N(µ, σ2 ), what is the moment-generating function of
1 2
e 2θ (i.e., X has a standard normal distribution), we want to obtain the moment- X?
generating function of Y = µ + σX. Z ∞
φ(θ) = eθx f (x) dx
Let φ x (θ) and φy (θ) be the moment-generating functions of X and Y, respectively. Z−∞
∞  1
1 
Then, the moment-generating function of Y is obtained as follows: = eθx √
exp − 2 (x − µ)2 dx

2πσ2
Z−∞
∞  
1 1
φy (θ) = E(eθY ) = E(eθ(µ+σX) ) = eθµ E(eθσX ) = eθµ φ x (θσ) = √ exp θx − (x − µ)2 dx
−∞ 2πσ2 2σ2
1 2 2 1 Z ∞  1 
= eθµ e 2 σ θ = exp(µθ + σ2 θ2 ). 1 1
2 = exp(µθ + σ2 θ2 ) √ exp (x − µ − σ2 θ)2 dx
2 −∞ 2πσ2 2σ2
1
= exp(µθ + σ2 θ2 ).
2

111 112

3.2 Bivariate Random Variable The expectation of random variable X, i.e., E(X), is given by:
XX
Definition: Let g(X, Y) be a function of random variables X and Y. The mathe- 



 xi f (xi , y j ), (Discrete),


 i j
matical expectation of g(X, Y), denoted by E(g(X, Y)), is defined as: E(X) =  Z Z
XX 


∞ ∞
 

 x f (x, y) dx dy, (Continuous),


 g(xi , y j ) f (xi , y j ), (Discrete),



−∞ −∞
 i j = µx .
E(g(X, Y)) =  Z Z



∞ ∞


 g(x, y) f (x, y) dx dy, (Continuous).
−∞ −∞ The case of g(X, Y) = Y is exactly the same formulation as above, i.e., E(Y) =
The following four functional forms are important, i.e., mean, variance, covariance µy .
and the moment-generating function.
2. g(X, Y) = (X − µ x )2 :
1. g(X, Y) = X:

113 114
The expectation of (X − µ x )2 is known as variance of X. The expectation of (X − µ x )(Y − µy ) is known as covariance (‫ڞ‬෼ࢄ) of X
and Y, which is denoted by Cov(X, Y) and written as:
V(X) = E((X − µ x )2 )
XX




 (xi − µ x )2 f (xi , y j ), (Discrete)


 i j
= Z Z



∞ ∞


 (x − µ x )2 f (x, y) dx dy, (Continuous)
−∞ −∞
= σ2x .

The variance of Y is also obtained in the same way, i.e., V(Y) = σ2y .

3. g(X, Y) = (X − µ x )(Y − µ y ):

115 116

4. g(X, Y) = eθ1 X+θ2 Y :

The mathematical expectation of eθ1 X+θ2 Y is called the moment-generating


Cov(X, Y) = E((X − µ x )(Y − µy ))
XX function, which is denoted by:


 (xi − µ x )(y j − µy ) f (xi , y j ),





 i j


 (Discrete), φ(θ1 , θ2 ) = E(eθ1 X+θ2 Y )
= Z ∞Z ∞ XX

 






 eθ1 xi +θ2 y j f (xi , y j ), (Discrete)

 (x − µ x )(y − µy ) f (x, y) dx dy, 

 i j



 −∞ −∞ (Continuous). = Z ∞Z ∞





 eθ1 x+θ2 y f (x, y) dx dy, (Continuous)
−∞ −∞
Thus, covariance is defined in the case of bivariate random variables.
In Section 5, the moment-generating function in the multivariate cases is dis-
cussed in more detail.

117 118

Some Formulas of Mean and Variance: We consider two random variables X For continuous random variables X and Y, we can show:
and Y. Z ∞Z ∞
E(X + Y) = (x + y) f xy (x, y) dx dy
1. Theorem: E(X + Y) = E(X) + E(Y). Z−∞ −∞
∞Z ∞
= x f xy (x, y) dx dy
Proof: −∞ −∞
Z ∞Z ∞
For discrete random variables X and Y, it is given by: + y f xy (x, y) dx dy
−∞ −∞
XX = E(X) + E(Y).
E(X + Y) = (xi + y j ) f xy (xi , y j )
i j
XX XX
= xi f xy (xi , y j ) + y j f xy (xi , y j )
i j i j
= E(X) + E(Y).

119 120
2. Theorem: E(XY) = E(X)E(Y), when X is independent of Y. For continuous random variables X and Y,
Z ∞Z ∞
Proof:
E(XY) = xy f xy (x, y) dx dy
For discrete random variables X and Y, Z−∞
∞Z ∞
−∞

= xy f x (x) fy (y) dx dy
XX XX −∞ −∞
E(XY) = xi y j f xy (xi , y j ) = xi y j f x (xi ) fy (y j ) Z ∞ Z ∞ 
i j i j
= x f x (x) dx y fy (y) dy = E(X)E(Y).
X X  −∞ −∞
= xi f x (xi ) y j fy (y j ) = E(X)E(Y).
i j When X is independent of Y, we have f xy (x, y) = f x (x) fy (y) in the second

If X is independent of Y, the second equality holds, i.e., f xy (xi , y j ) = f x (xi ) fy (y j ). equality.

121 122

3. Theorem: Cov(X, Y) = E(XY) − E(X)E(Y). = E(XY) − µ x µy

Proof: = E(XY) − E(X)E(Y).

For both discrete and continuous random variables, we can rewrite as follows: In the fourth equality, the theorem in Section 3.1 is used, i.e., E(µ x Y) =
µ x E(Y) and E(µy X) = µy E(X).
Cov(X, Y) = E((X − µ x )(Y − µy ))

= E(XY − µ x Y − µy X + µ x µy )

= E(XY) − E(µ x Y) − E(µy X) + µ x µy

= E(XY) − µ x E(Y) − µy E(X) + µ x µy

= E(XY) − µ x µy − µy µ x + µ x µy

123 124

4. Theorem: Cov(X, Y) = 0, when X is independent of Y. 5. Definition: The correlation coefficient (૬ؔ܎਺) between X and Y, de-

Proof: noted by ρ xy , is defined as:

Cov(X, Y) Cov(X, Y)
From the above two theorems, we have E(XY) = E(X)E(Y) when X is inde- ρ xy = √ √ = .
V(X) V(Y) σ x σy
pendent of Y and Cov(X, Y) = E(XY) − E(X)E(Y).
ρ xy > 0 =⇒ positive correlation between X and Y
Therefore, Cov(X, Y) = 0 is obtained when X is independent of Y.
ρ xy −→ 1 =⇒ strong positive correlation

ρ xy < 0 =⇒ negative correlation between X and Y

ρ xy −→ −1 =⇒ strong negative correlation

125 126
6. Theorem: ρ xy = 0, when X is independent of Y. 7. Theorem: V(X ± Y) = V(X) ± 2Cov(X, Y) + V(Y).

Proof: Proof:

When X is independent of Y, we have Cov(X, Y) = 0. For both discrete and continuous random variables, V(X ± Y) is rewritten as
Cov(X, Y) follows:
We obtain the result ρ xy = √ √ = 0.
V(X) V(Y)
 
However, note that ρ xy = 0 does not mean the independence between X and V(X ± Y) = E ((X ± Y) − E(X ± Y))2
 
Y. = E ((X − µ x ) ± (Y − µy ))2

= E((X − µ x )2 ± 2(X − µ x )(Y − µy ) + (Y − µy )2 )

127 128

= E((X − µ x )2 ) ± 2E((X − µ x )(Y − µy )) 8. Theorem: −1 ≤ ρ xy ≤ 1.


2
+E((Y − µy ) ) Proof:
= V(X) ± 2Cov(X, Y) + V(Y). Consider the following function of t: f (t) = V(Xt − Y), which is always
greater than or equal to zero because of the definition of variance. Therefore,
for all t, we have f (t) ≥ 0. f (t) is rewritten as follows:

129 130

Therefore, we have:
Cov(X, Y)
f (t) = V(Xt − Y) = V(Xt) − 2Cov(Xt, Y) + V(Y) −1 ≤ √ √ ≤ 1.
V(X) V(Y)
= t2 V(X) − 2tCov(X, Y) + V(Y) Cov(X, Y)
From the definition of correlation coefficient, i.e., ρ xy = √ √ , we
 Cov(X, Y) 2 (Cov(X, Y))2 V(X) V(Y)
= V(X) t − + V(Y) − . obtain the result: −1 ≤ ρ xy ≤ 1.
V(X) V(X)
In order to have f (t) ≥ 0 for all t, we need the following condition:
(Cov(X, Y))2
V(Y) − ≥ 0,
V(X)
because the first term in the last equality is nonnegative, which implies:
(Cov(X, Y))2
≤ 1.
V(X)V(Y)

131 132
X XX
9. Theorem: V(X ± Y) = V(X) + V(Y), when X is independent of Y. V( ai Xi ) = ai a j Cov(Xi , X j ),
i i j
Proof:
where E(Xi ) = µi and ai is a constant value. Especially, when X1 , X2 , · · ·, Xn
From the theorem above, V(X ± Y) = V(X) ± 2Cov(X, Y) + V(Y) generally are mutually independent, we have the following:
holds. When random variables X and Y are independent, we have Cov(X, Y) = X X
V( ai Xi ) = a2i V(Xi ).
0. Therefore, V(X + Y) = V(X) + V(Y) holds, when X is independent of Y. i i

10. Theorem: For n random variables X1 , X2 , · · ·, Xn , Proof:


P
X X For mean of i ai Xi , the following representation is obtained.
E( ai X i ) = ai µi ,
X X X X
i i
E( ai Xi ) = E(ai Xi ) = ai E(Xi ) = ai µi .
i i i i

133 134

P
The first and second equalities come from the previous theorems on mean. For variance of i ai Xi , we can rewrite as follows:
X X 2
V( ai Xi ) = E ai (Xi − µi )
i i
X X 
=E ai (Xi − µi ) a j (X j − µ j )
i j
X X 
=E ai a j (Xi − µi )(X j − µ j )
i j
XX  
= ai a j E (Xi − µi )(X j − µ j )
i j
XX
= ai a j Cov(Xi , X j ).
i j

When X1 , X2 , · · ·, Xn are mutually independent, we obtain Cov(Xi , X j ) = 0

135 136

for all i , j from the previous theorem. Therefore, we obtain: 11. Theorem: n random variables X1 , X2 , · · ·, Xn are mutually independently
X X and identically distributed with mean µ and variance σ2 . That is, for all i =
V( ai Xi ) = a2i V(Xi ).
i i 1, 2, · · · , n, E(Xi ) = µ and V(Xi ) = σ2 are assumed. Consider arithmetic
P
Note that Cov(Xi , Xi ) = E((Xi − µ)2 ) = V(Xi ). average X = (1/n) ni=1 Xi . Then, mean and variance of X are given by:

σ2
E(X) = µ, V(X) = .
n

137 138
Proof: The variance of X is computed as follows:

1X 1 X 1 X
n n n
The mathematical expectation of X is given by:
V(X) = V( Xi ) = 2 V( Xi ) = 2 V(Xi )
n i=1 n n i=1
1X 1 X 1X
n n n i=1

1 X 2
E(X) = E( Xi ) = E( Xi ) = E(Xi ) n
n i=1 n i=1 n i=1 1 σ2
= 2 σ = 2 nσ2 = .
n i=1 n n
1X
n
1
= µ = nµ = µ.
n i=1 n We use V(aX) = a2 V(X) in the second equality and V(X + Y) = V(X) + V(Y)

E(aX) = aE(X) in the second equality and E(X +Y) = E(X)+E(Y) in the third for X independent of Y in the third equality, where X and Y denote random

equality are utilized, where X and Y are random variables and a is a constant variables and a is a constant value.

value.

139 140

4 Transformation of Variables (ม਺ม‫)׵‬ 4.1 Univariate Case

Transformation of variables is used in the case of continuous random variables. Distribution of Y = ψ−1 (X): Let f x (x) be the probability density function of con-

Based on a distribution of a random variable, a distribution of the transformed ran- tinuous random variable X and X = ψ(Y) be a one-to-one (ҰରҰ) transformation.

dom variable is derived. In other words, when a distribution of X is known, we can Then, the probability density function of Y, i.e., fy (y), is given by:

find a distribution of Y using the transformation of variables, where Y is a function  


fy (y) = |ψ0 (y)| f x ψ(y) .
of X.
We can derive the above transformation of variables from X to Y as follows. Let
f x (x) and F x (x) be the probability density function and the distribution function of
X, respectively. Note that F x (x) = P(X ≤ x) and f x (x) = F x0 (x).

141 142

When X = ψ(Y), we want to obtain the probability density function of Y. Let fy (y) respect to y, we can obtain the following expression:
and Fy (y) be the probability density function and the distribution function of Y,    
fy (y) = Fy0 (y) = ψ0 (y)F 0x ψ(y) = ψ0 (y) f x ψ(y) . (4)
respectively.
In the case of ψ0 (X) > 0, the distribution function of Y, Fy (y), is rewritten as follows:
 
Fy (y) = P(Y ≤ y) = P ψ(Y) ≤ ψ(y)
   
= P X ≤ ψ(y) = F x ψ(y) .

The first equality is the definition of the cumulative distribution function. The sec-
ond equality holds because of ψ0 (Y) > 0. Therefore, differentiating Fy (y) with

143 144
Next, in the case of ψ0 (X) < 0, the distribution function of Y, Fy (y), is rewritten as Note that −ψ0 (y) > 0.
follows: Thus, summarizing the above two cases, i.e., ψ0 (X) > 0 and ψ0 (X) < 0, equations
    (4) and (5) indicate the following result:
Fy (y) = P(Y ≤ y) = P ψ(Y) ≥ ψ(y) = P X ≥ ψ(y)
     
= 1 − P X < ψ(y) = 1 − F x ψ(y) . fy (y) = |ψ0 (y)| f x ψ(y) ,

Thus, in the case of ψ0 (X) < 0, pay attention to the second equality, where the which is called the transformation of variables.

inequality sign is reversed. Differentiating Fy (y) with respect to y, we obtain the


following result:
   
fy (y) = Fy0 (y) = −ψ0 (y)F 0x ψ(y) = −ψ0 (y) f x ψ(y) . (5)

145 146

Example 1.9: When X ∼ N(0, 1), we derive the probability density function of On Distribution of Y = X2 : As an example, when we know the distribution
Y = µ + σX. function of X as F x (x), we want to obtain the distribution function of Y, Fy (y),
Since we have: where Y = X 2 . Using F x (x), Fy (y) is rewritten as follows:
Y −µ
X = ψ(Y) = , √ √
σ Fy (y) = P(Y ≤ y) = P(X 2 ≤ y) = P(− y ≤ X ≤ y)
ψ0 (y) = 1/σ is obtained. Therefore, fy (y) is given by: √ √
= F x ( y) − F x (− y).
  1  1 
fy (y) = |ψ0 (y)| f x ψ(y) = √ exp − 2 (y − µ)2 ,
σ 2π 2σ The probability density function of Y is obtained as follows:

which indicates the normal distribution with mean µ and variance σ , denoted by2 1  √ √ 
fy (y) = Fy0 (y) = √ f x ( y) + f x (− y) .
2 y
N(µ, σ2 ).

147 148

Example: χ2 (1) Distribution: Define Y = X 2 , where X ∼ N(0, 1). Then, Y ∼ Note that the χ2 (n) distribution is:
χ2 (1). 1 n x
f x (x) = n x 2 −1 exp(− ), x > 0,
proof: 2 2 Γ( 2n ) 2
1  √ √  √
fy (y) = √ f x ( y) + f x (− y) where Γ( 2n ) = π
2 y
1  1 
= √ y−1/2 exp − y
2π 2
which is χ2 (1) distribution, where
1  1 
f x (x) = √ exp − x2
2π 2
√ √ 1  1 
f x ( y) = f x (− y) = √ exp − y
2π 2

149 150
4.2 Multivariate Cases where J is called the Jacobian of the transformation, which is defined as:
∂x ∂x
Bivariate Case: Let f xy (x, y) be a joint probability density function of X and Y. ∂u ∂v
.
J =
Let X = ψ1 (U, V) and Y = ψ2 (U, V) be a one-to-one transformation from (X, Y) to
∂y ∂y
(U, V). Then, we obtain a joint probability density function of U and V, denoted by ∂u ∂v
fuv (u, v), as follows:
 
fuv (u, v) = |J| f xy ψ1 (u, v), ψ2 (u, v) ,

151 152

Multivariate Case: Let f x (x1 , x2 , · · · , xn ) be a joint probability density function Then, we obtain a joint probability density function of Y1 , Y2 , · · ·, Yn , denoted by
of X1 , X2 , · · · Xn . Suppose that a one-to-one transformation from (X1 , X2 , · · · , Xn ) to fy (y1 , y2 , · · · , yn ), as follows:
(Y1 , Y2 , · · · , Yn ) is given by:
fy (y1 , y2 , · · · , yn )
 
X1 = ψ1 (Y1 , Y2 , · · · , Yn ), = |J| f x ψ1 (y1 , · · · , yn ), ψ2 (y1 , · · · , yn ), · · · , ψn (y1 , · · · , yn ) ,
X2 = ψ2 (Y1 , Y2 , · · · , Yn ),
..
.

Xn = ψn (Y1 , Y2 , · · · , Yn ).

153 154

where J is called the Jacobian of the transformation, which is defined as: Example: Normal Distribution: X ∼ N(µ1 , σ21 ) and Y ∼ N(µ2 , σ22 ). X is
∂x1 ∂x1 ∂x1
··· independent of Y.
∂y1 ∂y2 ∂yn
Then, X + Y ∼ N(µ1 + µ2 , σ21 + σ22 )
∂x2 ∂x2
∂x2
··· Proof:
∂y1 ∂y2 ∂yn
J= . The density functions of X and Y are:

... .. .. .. !
. . . 1 1
f x (x) = q exp − 2 (x − µ1 )2
2σ1
2πσ21
∂xn ∂xn ∂xn !
··· 1 1
∂y1 ∂y2 ∂yn fy (y) = q exp − 2 (y − µ2 )2
2πσ22 2σ2

155 156
The joint density of X and Y is: The joint density function of U and V, fuv (u, v), is given by:

f xy (x, y) = f x (x) fy (y) fuv (u, v) = |J| f xy (u − v, v)


! !
1 1 1 1 1 1
= exp − 2 (x − µ1 )2 − (y − µ2 )2
= exp − 2 (u − v − µ1 )2 − (v − µ2 )2
2πσ1 σ2 2σ1 2σ22 2πσ1 σ2 2σ1 2σ22

Define U = X + Y and V = Y. We obtain the joint distribution of U and V. The marginal density function of U is:
Using X = U − V and Y = V, the Jacobian is: Z
fu (u) = fuv (u, v)dv
∂x ∂x Z
1 −1  1 
J = ∂u ∂v = = 1 =
1
exp − 2 (u − v − µ1 )2 −
1
(v − µ2 )2 dv
∂y ∂y
0 1 2πσ1 σ2 2σ1 2σ22
∂u ∂v

157 158

Z Z
1 1 1
= exp − ((v − µ2 ) − (u − µ1 − µ2 ))2 = q
2πσ1 σ2 2σ21 2π/(1/σ21 + 1/σ22 )
!
1 !
− 2 (v − µ2 )2 dv 1 σ2
2σ2 × exp − ((v − µ2 ) − 2 2 2 (u − µ1 − µ2 ))2
Z 2/(1/σ21 2
+ 1/σ2 ) σ1 + σ2
1 1 !
= exp − ((v − µ2 ) 1 1 2
2πσ1 σ2 2/(1/σ21 + 1/σ22 ) × q exp − (u − µ 1 − µ2 ) dv
! 2π(σ2 + σ2 ) 2(σ21 + σ22 )
σ2 1 1 2
− 2 2 2 (u − µ1 − µ2 ))2 − (u − µ 1 − µ 2 )2
dv
σ1 + σ2 2(σ21 + σ22 )

159 160

Z
1 Example: χ2 Distribution: X ∼ χ2 (n) and Y ∼ χ2 (m). X is independent of Y.
= q
2π/(1/σ21 + 1/σ22 ) Then, X + Y ∼ χ2 (n + m).
1 Proof:
× exp − ((v − µ2 )
2/(1/σ21 + 1/σ22 )
! The density functions of X and Y are:
σ2
− 2 2 2 (u − µ1 − µ2 ))2 dv 1 n x
σ1 + σ2 f x (x) = x 2 −1 exp(− ), x > 0
! n
2 2 Γ( n2 ) 2
1 1
× q exp − 2 2
(u − µ1 − µ2 )2 1 m y
2π(σ21 + σ22 ) 2(σ1 + σ2 ) fy (y) = m m y 2 −1 exp(− ), y > 0
2 2 Γ( 2 ) 2
!
1 1 2
= q exp − (u − µ 1 − µ2 ) The joint density function of X and Y is:
2π(σ21 + σ22 ) 2(σ21 + σ22 )
f xy (x, y) = f x (x) fy (y)

161 162
1 n x 1 m y The joint density function of U and V, fuv (u, v), is given by:
= n x 2 −1 exp(− ) m m y 2 −1 exp(− )
2 2 Γ( n2 ) 2 2 2 Γ( 2 ) 2
n m x+y fuv (u, v) = |J| f xy (u − v, v)
= Cx 2 −1 y 2 −1 exp(− )
2 n m u
= C(u − v) 2 −1 v 2 −1 exp(− )
1 2
where C = n+m .
2 2 Γ( 2n )Γ( m2 )
The marginal density function of U is:
From X = U − V and Y = V, the Jacobian is: Z
∂x ∂x fu (u) = fuv (u, v)dv
1 −1
J = ∂u ∂v = = 1 Z ∞
∂y ∂y u
0 1
n m
= C exp(− ) (u − v) 2 −1 v 2 −1 dv
∂u ∂v 2 0
Z ∞
u n m
= C exp(− ) (u − uw) 2 −1 (uw) 2 −1 udw
2 0

163 164

Z 1
n+m
2 −1
u n m Example: t Distribution: X ∼ N(0, 1) and Y ∼ χ2 (n). X is independent of Y.
= Cu exp(− ) (1 − w) 2 −1 w 2 −1 dw
2 0 X
n m n+m u Then, U = √ ∼ t(n)
= CB( , )u 2 −1 exp(− ) Y/n
2 2 2
1 Γ( n2 )Γ( m2 ) n+m −1 u
= n+m n m n+m u 2 exp(− )
2 2 Γ( 2 )Γ( 2 ) Γ( 2 ) 2
Note that the density function of t(n) is:
1 n+m u
= n+m n+m u 2 −1 exp(− ) !− n+1
2 2 Γ( 2 ) 2 Γ( n+1
2
) 1 x2 2
fu (x) = √ 1+
Γ( n2 ) nπ n
Beta function B(n, m) is:
Z ∞
Γ(n)Γ(m)
B(n, m) = (1 − x)n−1 xm−1 dx =
0 Γ(n + m)

165 166

r
Proof: The density functions of X and Y are: V
From X = U and Y = V, the Jacobian is:
n
1 1 r
f x (x) = √ exp(− x2 ), −∞ < x < ∞ ∂x ∂x v u r
2π 2
1 y J = ∂u ∂v = n 2 √nv = v

∂y ∂y 0
n
fy (y) = n n y 2 −1 exp(− ), y>0 1 n
2 2 Γ( 2 ) 2 ∂u ∂v
The joint density functions of X and Y, f xy (x, y), is: The joint density function of U and V, fuv (u, v), is:
r
v
fuv (u, v) = |J| f xy (u , v)
f xy (x, y) = f x (x) fy (y) n
r
1 1 1 n y v 1 1 u2 v 1 n v
= √ exp(− x2 ) n n y 2 −1 exp(− ) = √ exp(− ) n v 2 −1 exp(− )
2π 2 2 2 Γ( 2 ) 2 n 2π 2 n 2 2 Γ( n2 ) 2
!
1 1 n y 1 v u2
y 2 −1 exp(− − x2 )
n−1
= √ n n = Cv 2 exp − (1 + )
2π 2 2 Γ( 2 ) 2 2 2 n

167 168
1 1 1 !− n+1
where C = √ √ n+1 n . u2 2 n+1 n+1
n π 2 2 Γ( 2 ) =C 1+ 2 2 Γ( )
n 2
Z
1 n+1 1
The marginal density function of U is: × n+1
w 2 −1 exp(− w)dw
2 2 Γ( n+1 ) 2
Z 2
!− n+1
fu (u) = fuv (u, v)dv u2 2 n+1 n + 1
=C 1+ 2 2 Γ( )
Z ! n 2
n−1 v u2 !− n+1
= C v 2 exp − (1 + ) dv 1 1 1 n+1 n + 1 u2 2
2 n = √ n+1 n √ 2 2 Γ( ) 1+
Z ! n−1 !−1 π 2 2 Γ( 2 ) n 2 n
u2 −1 2 1 u2
=C w(1 + ) exp(− w) 1 + dw ! n+1
2 − 2
n 2 n Γ( n+1
2
) 1 u
! n+1 Z
= √ 1+
u2 2

n+1 1 Γ( n2 ) nπ n
=C 1+ w 2 −1 exp(− w)dw
n 2

169 170

u2 Example: Cauchy Distribution: X ∼ N(0, 1) and Y ∼ N(0, 1). X is indepen-


Use integration by substitution by w = v(1 + ).
n
1 n+1 1 dent of Y.
Note that f (w) = n+1 n+1 w 2 −1 exp(− w) is the density function of χ2 (n + 1). X
2 2 Γ( ) 2 Then, U = is Cauchy.
2
Y

Note that the density function of U, fu (u), is:

1
f (u) =
π(1 + u2 )

171 172

x
Proof: The density functions of X and Y are: Transformation of variables by u = and v = y.
y
1 1 From x = uv and y = v, the Jacobian is:
f x (x) = √ exp(− x2 ), −∞ < x < ∞
2π 2 ∂x ∂x
v u
1 1 2
fy (y) = √ exp(− y ), −∞ < y < ∞ J = ∂u ∂v = = v
∂y
2π 2 ∂y 0 1
∂u ∂v
The joint density function of X and Y is:
The joint density function of U and V, fuv (u, v), is:
f xy (x, y) = f x (x) fy (y)
fuv (u, v) = |J| f xy (uv, v)
1 1 1 1
= √ exp(− x2 ) √ exp(− y2 ) 1 1
2π 2 2π 2 = |v| exp(− v2 (1 + u2 ))
1 1 2π 2
= exp(− (x2 + y2 ))
2π 2

173 174
The marginal density function of U is: 5 Moment-Generating Function (ੵ཰฼ؔ਺)
Z
fu (u) = fuv (u, v)dv
5.1 Univariate Case
Z ∞
1 1
= |v| exp(− v2 (1 + u2 ))dv As discussed in Section 3.1, the moment-generating function is defined as φ(θ) =
2π −∞ 2
Z ∞
1 1 E(eθX ).
= v exp(− v2 (1 + u2 ))dv
π 0 2
" #∞ For a random variable X, µ0n ≡ E(X n ) is called the nth moment (n ࣍ͷੵ཰) of X.
1 1 1 2 2
= − exp(− v (1 + u ))
π 1 + u2 2 v=0
1
=
π(1 + u2 )

175 176

1. Theorem: φ(n) (0) = µ0n ≡ E(X n ). Evaluating φ(n) (θ) at θ = 0, we obtain:


Z ∞
Proof:
φ(n) (0) = xn f (x) dx = E(X n ) ≡ µ0n ,
−∞
First, from the definition of the moment-generating function, φ(θ) is written
where the second equality comes from the definition of the mathematical ex-
as:
Z pectation.

φ(θ) = E(eθX ) = eθx f (x) dx.
−∞

The nth derivative of φ(θ), denoted by φ(n) (θ), is:


Z ∞
φ(n) (θ) = xn eθx f (x) dx.
−∞

177 178

2. Remark: The moment-generating function is a weighted sum of all the 3. Remark: φ(θ) does not exist, if E(X n ) for some n does not exist.
moments.
X ∞  φ(θ) is finite. ⇐⇒ All the moments exist.
1 (n)
φ(θ) = E(eθX ) = E f (0)θn
n=0
n!
X∞
1 n n X 1

=E X θ = E(X n )θn
n=0
n! n=0
n!

X∞
1 (n)
where f (θ) = eθX . f (θ) = f (0)θn
n=0
n!
Note that f (n) (θ) = X n eθX .

179 180
4. Remark: Let X and Y be two random variables. Suppose that both moment- 5. Theorem: Let φ(θ) be the moment-generating function of X. Then, the
generating functions exist. When the moment-generating function of X is moment-generating function of Y, where Y = aX + b, is given by ebθ φ(aθ).
equivalent to that of Y, we have the fact that X has the same distribution as Y. Proof:

Let φy (θ) be the moment-generating function of Y. Then, φy (θ) is rewritten as


φ x (θ) = φy (θ) ⇐⇒ E(X n ) = E(Y n ) for all n follows:
⇐⇒ f x (t) = fy (t)
φy (θ) = E(eθY ) = E(eθ(aX+b) ) = ebθ E(eaθX ) = ebθ φ(aθ).

φ(θ) represents the moment-generating function of X.

181 182

6. Theorem: Let φ1 (θ), φ2 (θ), · · ·, φn (θ) be the moment-generating functions Proof:


of X1 , X2 , · · ·, Xn , which are mutually independently distributed random vari- The moment-generating function of Y, i.e., φy (θ), is rewritten as:
ables. Define Y = X1 + X2 + · · · + Xn . Then, the moment-generating function
of Y is given by φ1 (θ)φ2 (θ) · · · φn (θ), i.e., φy (θ) = E(eθY ) = E(eθ(X1 +X2 +···+Xn ) )

= E(eθX1 )E(eθX2 ) · · · E(eθXn )


φy (θ) = E(eθY ) = φ1 (θ)φ2 (θ) · · · φn (θ),
= φ1 (θ)φ2 (θ) · · · φn (θ).
where φy (θ) represents the moment-generating function of Y.
The third equality holds because X1 , X2 , · · ·, Xn are mutually independently
distributed random variables.

183 184

7. Theorem: When X1 , X2 , · · ·, Xn are mutually independently and identically Proof:


distributed and the moment-generating function of Xi is given by φ(θ) for Using the above theorem, we have the following:
 n
all i, the moment-generating function of Y is represented by φ(θ) , where
Y = X1 + X2 + · · · + Xn . φy (θ) = φ1 (θ)φ2 (θ) · · · φn (θ)
 n
= φ(θ)φ(θ) · · · φ(θ) = φ(θ) .

Note that φi (θ) = φ(θ) for all i.

185 186
8. Theorem: When X1 , X2 , · · ·, Xn are mutually independently and identically Proof:
distributed and the moment-generating function of Xi is given by φ(θ) for Let φ x (θ) be the moment-generating function of X.
 θ n
all i, the moment-generating function of X is represented by φ( ) , where Y
n
P n θ Pn θ
X = (1/n) ni=1 Xi . φ x (θ) = E(eθX ) = E(e n i=1 Xi
)= E(e n Xi )
i=1
Y
n
θ  θ n
= φ( ) = φ( ) .
i=1
n n

187 188

Bernoulli Distribution: The probability function of Bernoulli random variable X Binomial Distribution: For the binomial random variable, the moment-generating
is: function φ(θ) is known as:
x 1−x
f (x) = p (1 − p) , x = 0, 1
φ(θ) = (peθ + 1 − p)n ,
The moment-generating function of X is:
which is discussed in Example 1.5 (Section 3.1). Using the moment-generating
φ(θ) = peθ + 1 − p function, we check whether E(X) = np and V(X) = np(1 − p) are obtained when X
is a binomial random variable.
Mean: E(X) = φ0 (0) = p
2 2 00 2 The first- and the second-derivatives with respect to θ are given by:
Variance: V(X) = E(X ) − (E(X)) = φ (0) − p = p(1 − p)

φ0 (θ) = npeθ (peθ + 1 − p)n−1 ,

189 190

φ00 (θ) = npeθ (peθ + 1 − p)n−1 + n(n − 1)p2 e2θ (peθ + 1 − p)n−2 . Poisson Distribution: The probability function of Poisson random variable X is:

λx
Evaluating at θ = 0, we have: f (x) = e−λ , x = 0, 1, 2, · · ·
x!

E(X) = φ0 (0) = np, E(X 2 ) = φ00 (0) = np + n(n − 1)p2 . The moment-generating function of X is:
X

λx
Therefore, V(X) = E(X 2 ) − (E(X))2 = np(1 − p) can be derived. Thus, we can make φ(θ) = eθx e−λ
x=0
x!
sure that E(X) and V(X) are obtained from φ(θ). X

θ θ (eθ λ) x
= e−λ ee λ e−e λ
x=0
x!
= exp(λ(eθ − 1))

191 192
Normal Distribution: When X ∼ N(µ, σ2 ), the moment-generating function of 1
Cauchy Distribution: Cauchy distribution: f (x) = for −∞ < x < ∞.
π(1 + x2 )
X is given by: φ(θ) = exp(µθ + 12 σ2 θ2 ) from the previous example.
Obtain E(X) and V(X), using φ(θ). Z Z
x
E(X) = x f (x)dx = dx
0
π(1 + x2 )
• E(X) = φ (0) = µ 1
= [log(1 + x2 )]∞
−∞
from φ0 (θ) = (µ + σ2 θ) exp(µθ + 21 σ2 θ2 ). 2π
• E(X 2 ) = φ00 (0) = σ2 + µ2 =⇒ φ(θ) does not exists.
from φ00 (θ) = σ2 exp(µθ + 12 σ2 θ2 ) + (µ + σ2 θ)2 exp(µθ + 12 σ2 θ2 ).
t(k) distrubution =⇒ E(X k ) does not exists.
• V(X) = E(X 2 ) − (E(X))2 = (σ2 + µ2 ) − µ2 = σ2

193 194

Uniform Distribution: The density function is: θ(beθb − aeθa ) − (eθb − eθa )
φ0 (θ) =
θ2 (b − a)
1 Mean:
f (x) = , a<x<b
b−a
E(X) = φ0 (0) ←− Use L’Hospital’s rule.
The moment-generating function is:
a+b
Z ∞ Z b =
1 2
φ(θ) = eθx f (x)dx = eθx dx
b−a
"
−∞
#b
a (*) f (θ) = θ(beθb − aeθa ) − (eθb − eθa ), g(θ) = θ2 (b − a)
θx θb θa
e e −e
= = f 0 (θ) = θ(b2 eθb − a2 eθa ), g0 (θ) = 2θ(b − a)
θ(b − a) a θ(b − a)
f (θ) f 0 (θ) θ(b2 eθb − a2 eθa ) a + b
lim = lim = lim =
θ→0 g(θ) θ→0 g0 (θ) θ→0 2θ(b − a) 2

195 196

(*) L’Hospital’s rule Variance: V(X) = E(X 2 ) − (E(X))2


For two continuous functions f (x) and g(x),
E(X 2 ) = φ00 (0)
f (x) f 0 (x) f (x) f 0 (x) θ2 (b2 eθb − a2 eθa ) − 2θ(beθb − aeθa ) + 2(eθb − eθa )
lim = lim 0 , or lim = lim 0 , =
x→∞ g(x) x→∞ g (x) x→0 g(x) x→0 g (x)
θ3 (b − a)
L’Hospital’s rule is used when we have: 


 2 2 θb 2 θa θb θa θb θa
 f (θ) = θ (b e − a e ) − 2θ(be − ae ) + 2(e − e )
f (x) ∞ 0 


lim = or ,  g(θ) = θ3 (b − a)
x→∞ g(x) ∞ 0

or 

 0 2 3 θb 3 θa
 f (θ) = θ (b e − a e )
f (x) ∞ 0 


lim = or .  g0 (θ) = 3θ2 (b − a)
x→0 g(x) ∞ 0

197 198
Exponential Distribution: The exponential distribution is:
0
f (θ) f (θ)
φ00 (0) = lim = lim f (x) = λe−λx , 0<x
θ→0 g(θ) θ→0 g0 (θ)
θ2 (b3 eθb − a3 eθa ) b2 + ba + a2
= lim = The moment-generating function is:
θ→0 3θ2 (b − a) 3
Z ∞ Z ∞
φ(θ) = eθx f (x)dx = eθx λe−λx dx
V(X) = E(X 2 ) − (E(X))2 −∞
Z ∞ 0
λ λ
= φ00 (0) − (φ0 (0))2 ←− L’Hospital’s rule = (λ − θ)e−(λ−θ)x dx =
λ−θ 0 λ−θ
b2 + ba + a2  a + b 2 (b − a)2
= − = Use the exponential distribution with parameter λ − θ in the integration.
3 2 12

199 200

1. Mean: E(X) = φ0 (0) χ2 Distribution: The density function is:


λ 1 n x
φ0 (θ) = f (x) = n x 2 −1 exp(− ), 0<x
(λ − θ)2 2 2 Γ( n2 ) 2
1 The moment-generating function is:
E(X) = φ0 (0) = Z ∞
λ
φ(θ) = eθx f (x)dx
2. Variance: V(X) = E(X 2 ) − (E(X))2 Z−∞ ∞
λ 1 n x
E(X 2 ) = φ00 (0) φ00 (θ) = 2 = eθx n n x 2 −1 exp(− )dx
(λ − θ)3 0 2 2 Γ( 2 ) 2
Z ∞ !
1 1
V(X) = E(X 2 ) − (E(X))2 = φ00 (0) − (φ0 (0))2
n
= x 2 −1 exp − (1 − 2θ)x dx
n n
0 2 2 Γ( 2 ) 2
2 1 1 Z ∞
= 2− 2 = 2 1  y  2 −1
n
λ λ λ 1 1
= n n
exp(− y) dy
0 2 2 Γ( 2 ) 1 − 2θ 2 1 − 2θ

201 202

! 2n Z ∞
1 1 n 1 1. Mean: E(X) = φ0 (0)
= n n
y 2 −1 exp(− y)dy
1 − 2θ 0 2 2 Γ( 2 ) 2 n

 1  2n φ0 (θ) = (− 2n )(−2)(1 − 2θ)− 2 −1


=
1 − 2θ E(X) = φ0 (0) = n
Use integration by substitution by y = (1 − 2θ)x
dx 2. Variance: V(X) = E(X 2 ) − (E(X))2
= (1 − 2θ)−1
dy E(X 2 ) = φ00 (0)
Use the χ2 (n) distribution in the integration.
n
φ00 (θ) = (− n2 )(− 2n − 1)(−2)2 (1 − 2θ)− 2 −1

V(X) = E(X 2 ) − (E(X))2 = φ00 (0) − (φ0 (0))2

= n(n + 2) − n2 = 2n

203 204
Sum of Bernoulli Random Variables: X1 , X2 , · · ·, Xn are mutually indepen- The moment-generating function of Y, φy (θ), is:
dently and identically distributed as Bernoulli random variable with parameter p.
φy (θ) = E(eθY ) = E(eθ(X1 +X2 +···+Xn ) )
Then, the probability function of Y = X1 + X2 + · · · + Xn is B(n, p).
= E(eθX1 )E(eθX2 ) · · · E(eθXn ) = φ1 (θ)φ2 (θ) · · · φn (θ)
Proof: The moment-generating function of Xi , φi (θ), is:
 n
= φ(θ) = (peθ + 1 − p)n ,
φi (θ) = peθ + 1 − p
which is the moment-generating function of B(n, p).
Note:
In the third equality, X1 , X2 , · · ·, Xn are mutually independent.
In the fifth equality, X1 , X2 , · · ·, Xn are identically distributed.

205 206

Sum of Two Normal Random Variables: X ∼ N(µ1 , σ21 ) and Y ∼ N(µ2 , σ22 ). The moment-generating function of W = aX + bY is:
X is independent of Y.
φw (θ) = E(eθW ) = E(eθ(aX+bY) ) = E(eaθX )E(ebθY ) = φ x (aθ)φy (bθ)
Then, aX + bY ∼ N(aµ1 + bµ2 , a2 σ21 + b2 σ22 ), where a are b are constant.    
1 1
= exp µ1 (aθ) + σ21 (aθ)2 × exp µ2 (bθ) + σ22 (bθ)2
2 2
Proof: Suppose tha the moment-generating functions of X and Y are given by  1 
= exp (aµ1 + bµ2 )θ + (a2 σ21 + b2 σ22 )θ2
φ x (θ) and φy (θ). 2

 1  which is the moment-generating function of normal distribution with mean aµ1 +bµ2
φ x (θ) = exp µ1 θ + σ21 θ2
2 and variance a2 σ2 + b2 σ22 .
 1 
φy (θ) = exp µ2 θ + σ22 θ2 Therefore, aX + bY ∼ N(aµ1 + bµ2 , a2 σ2 + b2 σ22 )
2

207 208

Sum of Two χ2 Random Variables: X ∼ χ2 (n) and Y ∼ χ2 (m). X is indepen- The moment-generating function of Z = X + Y is:
dent of Y.
φz (θ) ≡ E(eθZ ) = E(eθ(X+Y) ) = E(eθX )E(eθY ) = φ x (θ)φy (θ)
Then, Z = X + Y ∼ χ2 (n + m)  1  n2  1  m2  1  n+m 2
= =
1 − 2θ 1 − 2θ 1 − 2θ
Proof:
which is the moment-generating function of χ2 (n + m) distribution. Therefore, Z ∼
Let φ x (θ) and φy (θ) be the moment-generating functions of X and Y.
χ2 (n + m).
φ x (θ) and φy (θ) are given by:
 1  2n  1  m2
φ x (θ) = , φy (θ) = .
1 − 2θ 1 − 2θ Note:
In the third equality, X and Y are independent.

209 210
5.2 Multivariate Cases 1. Theorem: Consider two random variables X and Y. Let φ(θ1 , θ2 ) be the
moment-generating function of X and Y. Then, we have the following result:
Bivariate Case: As discussed in Section 3.2, for two random variables X and Y,
the moment-generating function is defined as φ(θ1 , θ2 ) = E(eθ1 X+θ2 Y ). Some useful ∂ j+k φ(0, 0)
= E(X j Y k ).
∂θ1j ∂θ2k
and important theorems and remarks are shown as follows.
Proof:

Let f xy (x, y) be the probability density function of X and Y. From the defini-
tion, φ(θ1 , θ2 ) is written as:
Z ∞ Z ∞
φ(θ1 , θ2 ) = E(eθ1 X+θ2 Y ) = eθ1 x+θ2 y f xy (x, y) dx dy.
−∞ −∞

211 212

Taking the jth derivative of φ(θ1 , θ2 ) with respect to θ1 and at the same time 2. Remark: Let (Xi , Yi ) be a pair of random variables. Suppose that the
the kth derivative with respect to θ2 , we have the following expression: moment-generating function of (X1 , Y1 ) is equivalent to that of (X2 , Y2 ). Then,
Z ∞Z ∞ (X1 , Y1 ) has the same distribution function as (X2 , Y2 ).
∂ j+k φ(θ1 , θ2 )
j
= x j yk eθ1 x+θ2 y f xy (x, y) dx dy.
k
∂θ1 ∂θ2 −∞ −∞

Evaluating the above equation at (θ1 , θ2 ) = (0, 0), we can easily obtain:
Z ∞Z ∞
∂ j+k φ(0, 0)
j
= x j yk f xy (x, y) dx dy ≡ E(X j Y k ).
k
∂θ1 ∂θ2 −∞ −∞

213 214

3. Theorem: Let φ(θ1 , θ2 ) be the moment-generating function of (X, Y). Proof:

The moment-generating function of X is given by φ1 (θ1 ) and that of Y is Again, the definition of the moment-generating function of X and Y is repre-
φ2 (θ2 ). sented as:
Z ∞ Z ∞
Then, we have the following facts:
φ(θ1 , θ2 ) = E(eθ1 X+θ2 Y ) = eθ1 x+θ2 y f xy (x, y) dx dy.
−∞ −∞

φ1 (θ1 ) = φ(θ1 , 0), φ2 (θ2 ) = φ(0, θ2 ).


When φ(θ1 , θ2 ) is evaluated at θ2 = 0, φ(θ1 , 0) is rewritten as follows:
Z ∞Z ∞
φ(θ1 , 0) = E(eθ1 X ) = eθ1 x f xy (x, y) dx dy
−∞ −∞
Z ∞ Z ∞ 
= eθ 1 x f xy (x, y) dy dx
−∞ −∞

215 216
Z ∞
= eθ1 x f x (x) dx = E(eθ1 X ) = φ1 (θ1 ). 4. Theorem: The moment-generating function of (X, Y) is given by φ(θ1 , θ2 ).
−∞
Let φ1 (θ1 ) and φ2 (θ2 ) be the moment-generating functions of X and Y, respec-
Thus, we obtain the result: φ(θ1 , 0) = φ1 (θ1 ).
tively.
Similarly, φ(0, θ2 ) = φ2 (θ2 ) can be derived.
If X is independent of Y, we have:

φ(θ1 , θ2 ) = φ1 (θ1 )φ2 (θ2 ).

217 218

Proof: Multivariate Case: For multivariate random variables X1 , X2 , · · ·, Xn , the moment-

From the definition of φ(θ1 , θ2 ), the moment-generating function of X and Y generating function is defined as:

is rewritten as follows:
φ(θ1 , θ2 , · · · , θn ) = E(eθ1 X1 +θ2 X2 +···+θn Xn ).

φ(θ1 , θ2 ) = E(eθ1 X+θ2 Y ) = E(eθ1 X )E(eθ2 Y ) = φ1 (θ1 )φ2 (θ2 ).

The second equality holds because X is independent of Y.

219 220

1. Theorem: If the multivariate random variables X1 , X2 , · · ·, Xn are mutually Proof:


independent, From the definition of the moment-generating function in the multivariate
the moment-generating function of X1 , X2 , · · ·, Xn , denoted by φ(θ1 , θ2 , · · ·, cases, we obtain the following:
θn ), is given by:
φ(θ1 , θ2 , · · · , θn ) = E(eθ1 X1 +θ2 X2 +···+θn Xn )
φ(θ1 , θ2 , · · · , θn ) = φ1 (θ1 )φ2 (θ2 ) · · · φn (θn ), = E(eθ1 X1 )E(eθ2 X2 ) · · · E(eθn Xn )

= φ1 (θ1 )φ2 (θ2 ) · · · φn (θn ).


where φi (θ) = E(eθXi ).

221 222
2. Theorem: Suppose that the multivariate random variables X1 , X2 , · · ·, Xn are Proof:
mutually independently and identically distributed. From Example 1.8 (p.111) and Example 1.9 (p.147), it is shown that the
Suppose that Xi ∼ N(µ, σ2 ). moment-generating function of X is given by: φ x (θ) = exp(µθ + 21 σ2 θ2 ), when
Pn X is normally distributed as X ∼ N(µ, σ2 ).
Let us define µ̂ = i=1 ai Xi , where ai , i = 1, 2, · · · , n, are assumed to be
known.
Pn Pn
Then, µ̂ ∼ N(µ i=1 ai , σ 2 i=1 a2i ).

223 224

Let φµ̂ be the moment-generating function of µ̂. Moreover, note as follows.

Pn Y
n
When ai = 1/n is taken for all i = 1, 2, · · · , n, i.e., when µ̂ = X is taken, µ̂ = X
φµ̂ (θ) = E(eθµ̂ ) = E(eθ i=1 ai Xi )= E(eθai Xi )
i=1 is normally distributed as: X ∼ N(µ, σ2 /n).
Y
n Y
n
1
= exp(µai θ + σ2 a2i θ2 )
φ x (ai θ) =
i=1 i=1
2
Xn
1 2X 2 2
n
= exp(µ ai θ + σ ai θ )
i=1
2 i=1

which is equivalent to the moment-generating function of the normal distri-


P P
bution with mean µ ni=1 ai and variance σ2 ni=1 a2i , where µ and σ2 in φ x (θ)
Pn P
is simply replaced by µ i=1 ai and σ2 ni=1 a2i in φµ̂ (θ), respectively.

225 226

6 Law of Large Numbers (ର਺ͷ๏ଇ) and Central Theorem: Let g(X) be a nonnegative function of the random variable X, i.e.,
g(X) ≥ 0.
Limit Theorem (த৺‫ݶۃ‬ఆཧ)
If E(g(X)) exists, then we have:

6.1 Chebyshev’s Inequality (νΣϏγΣϑͷෆ౳ࣜ) E(g(X))


P(g(X) ≥ k) ≤ , (6)
k

for a positive constant value k.

227 228
Proof: P(U = 0) = P(g(X) < k).
We define the discrete random variable U as follows:
 Then, in spite of the value which U takes, the following equation always holds:


 1, if g(X) ≥ k,

U= 

 0, if g(X) < k. g(X) ≥ kU,
Thus, the discrete random variable U takes 0 or 1.
which implies that we have g(X) ≥ k when U = 1 and g(X) ≥ 0 when U = 0, where
Suppose that the probability function of U is given by:
k is a positive constant value.
f (u) = P(U = u),
Therefore, taking the expectation on both sides, we obtain:
where P(U = u) is represented as:
E(g(X)) ≥ kE(U), (7)
P(U = 1) = P(g(X) ≥ k),

229 230

where E(U) is given by: Chebyshev’s Inequality: Assume that E(X) = µ, V(X) = σ2 , and λ is a positive

X
1 constant value. Then, we have the following inequality:
E(U) = uP(U = u) = 1 × P(U = 1) + 0 × P(U = 0)
u=0 1
P(|X − µ| ≥ λσ) ≤ ,
= P(U = 1) = P(g(X) ≥ k). (8) λ2

or equivalently,
Accordingly, substituting equation (8) into equation (7), we have the following in- 1
P(|X − µ| < λσ) ≥ 1 − ,
equality: λ2
E(g(X))
P(g(X) ≥ k) ≤ . which is called Chebyshev’s inequality.
k

231 232

Proof: An Interpretation of Chebyshev’s inequality: 1/λ2 is an upper bound for the


Take g(X) = (X − µ)2 and k = λ2 σ2 . Then, we have: probability P(|X − µ| ≥ λσ).
E(X − µ) 2 Equation (9) is rewritten as:
P((X − µ)2 ≥ λ2 σ2 ) ≤ ,
λ2 σ2
1
1 P(µ − λσ < X < µ + λσ) ≥ 1 − .
which implies P(|X − µ| ≥ λσ) ≤ . λ2
λ2
Note that E(X − µ)2 = V(X) = σ2 . That is, the probability that X falls within λσ units of µ is greater than or equal to
Since we have P(|X − µ| ≥ λσ) + P(|X − µ| < λσ) = 1, we can derive the following 1 − 1/λ2 .
inequality: Taking an example of λ = 2, the probability that X falls within two standard devia-
1
P(|X − µ| < λσ) ≥ 1 − . (9) tions of its mean is at least 0.75.
λ2

233 234
Furthermore, note as follows. Remark: Equation (10) can be derived when we take g(X) = (X − µ)2 , µ = E(X)
Taking  = λσ, we obtain as follows: and k =  2 in equation (6).

σ2 Even when we have µ , E(X), the following inequality still hold:


P(|X − µ| ≥ ) ≤ ,
2 E((X − µ)2 )
P(|X − µ| ≥ ) ≤ .
i.e., 2
V(X)
P(|X − E(X)| ≥ ) ≤ , (10) Note that E((X − µ)2 ) represents the mean square error (MSE).
2
When µ = E(X), the mean square error reduces to the variance.
which inequality is used in the next section.

235 236

6.2 Law of Large Numbers (ର਺ͷ๏ଇ) and Convergence in Proof: The moment-generating function is written as:

Probability (֬཰ऩଋ) φ(θ) = 1 + µ01 θ +


1 0 2 1 0 3
µ θ + µ3 θ + · · ·
2! 2 3!
Law of Large Numbers 1: Assume that X1 , X2 , · · ·, Xn are mutually indepen- = 1 + µ1 θ + O(θ2 )
0

dently and identically distributed with mean E(Xi ) = µ for all i.


where µ0k = E(X k ) for all k. That is, all the moments exist.
Supopose that the moment-generating function of Xi is finite.
1X
n
  θ n  θ θ2 n
Define X n = Xi . φ x (θ) = φ = 1 + µ01 + O( 2 )
n i=1 n n n
 θ 1 n   −1
1 µθ+O(n )
Then, X n −→ µ as n −→ ∞. = 1 + µ01 + O( 2 ) = (1 + x) x
n n
−→ exp(µθ) as x −→ 0,

237 238

which is the following probability function: Law of Large Numbers 2: Assume that X1 , X2 , · · ·, Xn are mutually indepen-
 dently and identically distributed with mean E(Xi ) = µ and variance V(Xi ) = σ2 <


 if x = µ,
1
f (x) = 
 ∞ for all i.

0 otherwise.
Then, for any positive value , as n −→ ∞, we have the following result:
X
φ(θ) = eθx f (x) = eθµ f (µ) = eθµ
P(|X n − µ| ≥ ) −→ 0,

1X
n
where X n = Xi .
n i=1
We say that X n converges in probability to µ.

239 240
Proof: Accordingly, when n −→ ∞, the following equation holds:
Using (10), Chebyshev’s inequality is represented as follows: σ2
P(|X n − µ| ≥ ) ≤ −→ 0.
n 2
V(X n )
P(|X n − E(X n )| ≥ ) ≤ ,
2 That is, X n −→ µ is obtained as n −→ ∞, which is written as: plim X n = µ.
where X in (10) is replaced by X n . This theorem is called the law of large numbers.
σ2
We know E(X n ) = µ and V(X n ) = , which are substituted into the above inequal- The condition P(|X n − µ| ≥ ) −→ 0 or equivalently P(|X n − µ| < ) −→ 1 is used as
n
ity. the definition of convergence in probability (֬཰ऩଋ).
Then, we obtain: In this case, we say that X n converges in probability to µ.
σ2
P(|X n − µ| ≥ ) ≤ .
n 2

241 242

Theorem: In the case where X1 , X2 , · · ·, Xn are not identically distributed and they Then, we obtain the following result:
are not mutually independently distributed, define: Pn
i=1 Xi − mn
−→ 0.
Xn X
n n
mn = E( Xi ), Vn = V( Xi ), mn
i=1 i=1 That is, X n converges in probability to lim .
n→∞ n

and assume that This theorem is also called the law of large numbers.

mn 1 X Vn 1 X
n n
= E( Xi ) < ∞, = V( Xi ) < ∞,
n n i=1 n n i=1
Vn
−→ 0, as n −→ ∞.
n2

243 244

Proof: As n goes to infinity,


Remember Chebyshev’s inequality: mn Vn
P(|X n − | ≥ ) ≤ 2 2 −→ 0.
V(X) n n
P(|X − E(X)| ≥ ) ≤ 2 , mn
 Therefore, X n −→ lim as n −→ ∞.
n→∞ n
Replace X, E(X) and V(X)
mn Vn
by X n , E(X n ) =and V(X n ) = 2 .
n n
Then, we obtain:
mn Vn
P(|X n − | ≥ ) ≤ 2 2 .
n n

245 246
6.3 Central Limit Theorem (த৺‫ݶۃ‬ఆཧ) and Convergence in Proof:
Xi − µ
Distribution (෼෍ऩଋ) Define Yi = . We can rewrite as follows:
σ

1 X Xi − µ 1 X
n n
Central Limit Theorem: X1 , X2 , · · ·, Xn are mutually independently and iden- Xn − µ
√ = √ = √ Yi .
σ/ n n i=1 σ n i=1
tically distributed with E(Xi ) = µ and V(Xi ) = σ2 for all i. Both µ and σ2 are
finite. Since Y1 , Y2 , · · ·, Yn are mutually independently and identically distributed, the
Under the above assumptions, when n −→ ∞, we have: moment-generating function of Yi is identical for all i, which is denoted by φ(θ).

 Xn − µ  Z x
1 1 2
P √ < x −→ √ e− 2 u du,
σ/ n −∞ 2π
which is called the central limit theorem.

247 248

Using E(Yi ) = 0 and V(Yi ) = 1, the moment-generating function of Yi , φ(θ), is (*) Remark:
rewritten as: O(x) implies that it is a polynomial function of x and the higher-order terms but it
 1 1  is dominated by x.
φ(θ) = E(eYi θ ) = E 1 + Yi θ + Yi2 θ2 + Yi3 θ3 · · ·
2 3! In this case, O(θ3 ) is a function of θ3 , θ4 , · · ·.
1
= 1 + θ2 + O(θ3 ). Since the moment-generating function is conventionally evaluated at θ = 0, θ3 is
2
the largest value of θ3 , θ4 , · · · and accordingly O(θ3 ) is dominated by θ3 (in other
In the second equality, eYi θ is approximated by the Taylor series expansion around
words, θ4 , θ5 , · · · are small enough, compared with θ3 ).
θ = 0.

249 250

Define Z as: 1 θ2 3
Moreover, consider x = + O(n− 2 ).
1 X
n
2n
Z= √ Yi . 1 θ2 3
n i=1 Multiply n/x on both sides of x = + O(n− 2 ).
2n
Then, the moment-generating function of Z, i.e., φz (θ), is given by: 
1 1 2 1

Then, we obtain n = θ + O(n− 2 ) .
x 2
 √θ Pn Y  Y n  √θ Y   θ n 
1 1 2 1

φz (θ) = E(eZθ ) = E e n i=1 i = E e n i = φ( √ ) Substitute n = θ + O(n− 2 ) into the moment-generating function of Z, i.e.,
n x 2
i=1
φz (θ).
 1 θ2 θ3 n  1 θ2 
− 23 n
= 1+ + O( 3 ) = 1 + + O(n ) . Then, we obtain:
2n n2 2n
 1 θ2 3
n 1 θ2 −1
We consider that n goes to infinity. φz (θ) = 1 + + O(n− 2 ) = (1 + x) x ( 2 +O(n 2 ))
3 3
2n
Therefore, O( θ3 ) indicates a function of n− 2 .  1
 θ2 +O(n− 21 ) θ2
n2 = (1 + x) x 2 −→ e 2 .

251 252
Note that x −→ 0 when n −→ ∞ and that lim (1 + x)1/x = e as in Section 2.3 (p.35). Xn − µ
x→0 We say that √ converges in distribution to N(0, 1).
1
Furthermore, we have O(n− 2 ) −→ 0 as n −→ ∞. σ/ n
=⇒ Convergence in distribution (෼෍ऩଋ)
θ2
Since φz (θ) = e 2 is the moment-generating function of the standard normal distri-
bution (see p.110 in Section 3.1 for the moment-generating function of the standard
normal probability density), we have:
 Xn − µ  Z x
1 1 2
P √ < x −→ √ e− 2 u du,
σ/ n −∞ 2π
The following expression is also possible:
or equivalently,
Xn − µ √
√ −→ N(0, 1). n(X n − µ) −→ N(0, σ2 ). (11)
σ/ n

253 254

Pn
Corollary 1: When E(Xi ) = µ, V(Xi ) = σ2 and X n = (1/n) i=1 Xi , note that Corollary 2: Consider the case where X1 , X2 , · · ·, Xn are not identically dis-

X n − E(X n ) X n − µ tributed and they are not mutually independently distributed.


q = √ .
σ/ n Assume that
V(X n ) 1X
n
lim nV(Xn ) = σ2 < ∞, where Xn = Xi .
Therefore, we can rewrite the above theorem as: n→∞ n i=1

 X n − E(X n )  Z x Then, when n −→ ∞, we have:


1 1 2
P q < x −→ √ e− 2 u du.  Xn − E(Xn )  Z x
−∞ 2π 1 1 2
V(X n ) P q < x −→ √ e− 2 u du.
V(Xn ) −∞ 2π

255 256

Summary: Let X1 , X2 , · · ·, Xn , · · · be a sequence of random variables. Let X be a 7 Statistical Inference


random variable. Let Fn be the distribution function of Xn and F be that of X.
7.1 Point Estimation (఺ਪఆ)
• Xn converges in probability to X if lim P(|Xn − X| ≥ ) = 0 or lim P(|Xn − X| <
n→∞ n→∞
) = 1 for all  > 0. Suppose that the underlying distribution is known but the parameter θ included in
P
Equivalently, we write Xn −→ X. the distribution is not known.
The distribution function of population is given by f (x; θ).
• Xn converges in distribution to X (or F) if lim Fn (x) = F(x) for all x.
n→∞
D D Let x1 , x2 , · · ·, xn be the n observed data drawn from the population distribution.
Equivalently, we write Xn −→ X or Xn −→ F.

257 258
Consider estimating the parameter θ using the n observed data. Example 1.11: Consider the case of θ = (µ, σ2 ), where the unknown parameters
Let θ̂n (x1 , x2 , · · ·, xn ) be a function of the observed data x1 , x2 , · · ·, xn . contained in population is given by mean and variance.
θ̂n (x1 , x2 , · · ·, xn ) is constructed to estimate the parameter θ.
A point estimate of population mean µ is given by:
θ̂n (x1 , x2 , · · ·, xn ) takes a certain value given the n observed data.
1X
n
θ̂n (x1 , x2 , · · ·, xn ) is called the point estimate of θ, or simply the estimate of θ. µ̂n (x1 , x2 , · · · , xn ) ≡ x = xi .
n i=1

259 260

A point estimate of population variance σ2 is: 7.2 Statistic, Estimate and Estimator (౷‫ྔܭ‬ɼਪఆ஋ɼਪఆྔ)
1 X
n
σ̂2n (x1 , x2 , · · · , xn ) ≡ s2 = (xi − x)2 . The underlying distribution of population is assumed to be known, but the parame-
n−1 i=1 ter θ, which characterizes the underlying distribution, is unknown.
An alternative point estimate of population variance σ2 is:
The probability density function of population is given by f (x; θ).
1X
n
e2n (x1 , x2 , · · · , xn )
σ ≡s ∗∗2
= (xi − x)2 .
n i=1

261 262

Let X1 , X2 , · · ·, Xn be a subset of population, which are regarded as the random There, the experimental values and the actually observed data series are used in the
variables and are assumed to be mutually independent. same meaning.

θ̂n (x1 , x2 , · · ·, xn ) denotes the point estimate of θ.


x1 , x2 , · · ·, xn are taken as the experimental values of the random variables X1 , X2 ,
· · ·, Xn .
In the case where the observed data x1 , x2 , · · ·, xn are replaced by the corresponding
random variables X1 , X2 , · · ·, Xn , a function of X1 , X2 , · · ·, Xn , i.e., θ̂(X1 , X2 , · · ·,
In statistics, we consider that n-variate random variables X1 , X2 , · · ·, Xn take the Xn ), is called the estimator (ਪఆྔ) of θ, which should be distinguished from the
experimental values x1 , x2 , · · ·, xn by chance. estimate (ਪఆ஋) of θ, i.e., θ̂(x1 , x2 , · · ·, xn ).

263 264
Example 1.12: Let X1 , X2 , · · ·, Xn denote a random sample of n from a given There are numerous estimators and estimates of θ.
distribution f (x; θ).
1X
n
X1 + Xn
2 All of Xi , , median of (X1 , X2 , · · ·, Xn ) and so on are taken as the
Consider the case of θ = (µ, σ ). n i=1 2
estimators of µ.
1X 1X
n n
The estimator of µ is given by X = Xi , while the estimate of µ is x = xi .
n i=1 n i=1
Of course, they are called the estimates of θ when Xi is replaced by xi for all i.
1 X
n
The estimator of σ2 is S 2 = (Xi − X)2 and the estimate of σ2 is s2 =
n − 1 i=1
1 X 1 X 1X
n n 2
(xi − x)2 . Both S 2 = (Xi − X)2 and S ∗2 = (Xi − X)2 are the estimators of σ2 .
n − 1 i=1 n − 1 i=1 n i=1

265 266

We need to choose one out of the numerous estimators of θ. 7.3 Estimation of Mean and Variance

Suppose that the population distribution is given by f (x; θ).


The problem of choosing an optimal estimator out of the numerous estimators is
discussed in Sections 7.4 and 7.5.1. The random sample X1 , X2 , · · ·, Xn are assumed to be drawn from the population
distribution f (x; θ), where θ = (µ, σ2 ).
In addition, note as follows.
Therefore, we can assume that X1 , X2 , · · ·, Xn are mutually independently and iden-
A function of random variables is called a statistic (౷‫)ྔܭ‬. The statistic for esti-
tically distributed, where “identically” implies E(Xi ) = µ and V(Xi ) = σ2 for all
mation of the parameter is called an estimator.
i.

Therefore, an estimator is a family of a statistic. Consider the estimators of θ = (µ, σ2 ) as follows.

267 268

1. The estimator of population mean µ is: Properties of X: From Theorem on p.138, mean and variance of X are obtained
1X
n
• X= Xi . as follows:
n i=1 σ2
E(X) = µ, V(X) = .
n
2. The estimators of population variance σ2 are:
1X
n
• S ∗2 = (Xi − µ)2 , when µ is known, Properties of S∗2 , S2 and S∗∗2 : The expectation of S ∗2 is:
n i=1
1 X n  1X n  
1 X
n
E(S ∗2 ) = E (Xi − µ)2 = E (Xi − µ)2
• S = 2
(Xi − X)2 , n i=1 n i=1
n − 1 i=1
1X 1X 2
n n

1X σ = σ2 .
n
= V(Xi ) =
• S ∗∗2 = (Xi − X)2 , n i=1 n i=1
n i=1

269 270
Next, the expectation of S 2 is given by: 1 Xn 
= E (Xi − µ)2 − n(X − µ)2
n − 1 i=1
E(S 2 ) n 1 Xn  n
 1 X n  1 X n  = E (Xi − µ)2 − E((X − µ)2 )
=E (Xi − X)2 = E (Xi − X)2 n − 1 n i=1 n−1
n − 1 i=1 n − 1 i=1
n n σ2
X
n  = σ2 − = σ2 .
1 n−1 n−1 n
= E ((Xi − µ) − (X − µ))2
n − 1 i=1 Pn
i=1 (Xi − µ) = n(X − µ) is used in the sixth equality.
1 X
n 
= E ((Xi − µ)2 − 2(Xi − µ)(X − µ) + (X − µ)2 ) 1 Xn 
n − 1 i=1 E (Xi − µ)2 = E(S ∗2 ) = σ2 and
X
n X n  n i=1
1
= E (Xi − µ)2 − 2(X − µ) (Xi − µ) + n(X − µ)2 σ2
n − 1 i=1 i=1
E((X − µ)2 ) = V(X) = are required in the eighth equality.
n

271 272

Finally, the expectation of S ∗∗2 is represented by: 7.4 Point Estimation: Optimality
1 Xn  n − 1 1 X n  θ denotes the parameter to be estimated.
E(S ∗∗2 ) = E (Xi − X)2 = E (Xi − X)2
n i=1 n n − 1 i=1
n − 1  n − 1 n−1 2
=E S2 = E(S 2 ) = σ , σ2 . θ̂n (X1 , X2 , · · ·, Xn ) represents the estimator of θ, while θ̂n (x1 , x2 , · · ·, xn ) indicates
n n n
the estimate of θ.

Summarizing the above results, we obtain as follows:


Hereafter, in the case of no confusion, θ̂n (X1 , X2 , · · ·, Xn ) is simply written as θ̂n .
n−1 2
E(S ∗2 ) = σ2 , E(S 2 ) = σ2 , E(S ∗∗2 ) = σ , σ2 .
n
As discussed above, there are numerous candidates of the estimator θ̂n .

273 274

The desired properties of θ̂n are: Unbiasedness (ෆภੑ): One of the desirable features that the estimator of the
• unbiasedness (ෆภੑ), parameter should have is given by:
• efficiency (༗ޮੑ).
E(θ̂n ) = θ, (12)
• consistency (Ұகੑ) and
• sufficiency (े෼ੑ). ←− Not discussed in this class. which implies that θ̂n is distributed around θ.

When (12) holds, θ̂n is called the unbiased estimator (ෆภਪఆྔ) of θ.

E(θ̂n ) − θ is defined as bias (ภΓ).

275 276
As an example of unbiasedness, consider the case of θ = (µ, σ2 ). 2. The estimators of σ2 are:
1 X 1X
n n
• S2 = (Xi − X)2 , • S ∗∗2 = (Xi − X)2 .
n − 1 i=1 n i=1
Suppose that X1 , X2 , · · ·, Xn are mutually independently and identically distributed
with mean µ and variance σ2 . Since we have obtained E(X) = µ and E(S 2 ) = σ2 , X and S 2 are unbiased estimators
of µ and σ2 .
We have obtained the result E(S ∗∗2 ) , σ2 and therefore S ∗∗2 is not an unbiased
Consider the following estimators of µ and σ2 .
estimator of σ2 .
1. The estimator of µ is: According to the criterion of unbiasedness, S 2 is preferred to S ∗∗2 for estimation of
1X
n
• X= Xi . σ2 .
n i=1

277 278

Efficiency (༗ޮੑ): Consider two estimators, θ̂n and e


θn . Suppose that X1 , X2 , · · ·, Xn are mutually independently and identically distributed
Both are assumed to be unbiased. and the distribution of Xi is f (xi ; θ).
That is, E(θ̂n ) = θ and E(e
θn ) = θ. For any unbiased estimator of θ, denoted by θ̂n , it is known that we have the follow-
When V(θ̂n ) < V(e
θn ), we say that θ̂n is more efficient than e
θn . ing inequality:
The unbiased estimator with the least variance is known as the efficient estimator σ2 (θ)
V(θ̂n ) ≥ , (13)
n
(༗ޮਪఆྔ). 1 1
2
where σ (θ) =  !=  !
We have the case where an efficient estimator does not exist. ∂ log f (X; θ) 2 ∂ log f (X; θ) 
E V
In order to find the efficient estimator, we utilize Cramer-Rao inequality (Ϋϥ ∂θ ∂θ
1
ϝʔϧɾϥΦͷෆ౳ࣜ). =−  2 , (14)
∂ log f (X; θ) 
E
∂θ2

279 280

which is known as the Cramer-Rao inequality (ΫϥϝʔϧɾϥΦͷෆ౳ࣜ). Proof of the Cramer-Rao inequality: We prove the above inequality and the
When there exists the unbiased estimator θ̂n such that the equality in (13) holds, equalities in σ2 (θ).
θ̂n becomes the unbiased estimator with minimum variance, which is the efficient
estimator (༗ޮਪఆྔ). The likelihood function (໬౓ؔ਺) l(θ; x) = l(θ; x1 , x2 , · · ·, xn ) is a joint density of
σ2 (θ)
is called the Cramer-Rao lower bound (ΫϥϝʔϧɾϥΦͷԼ‫)ݶ‬. X1 , X2 , · · ·, Xn .
n

Qn
That is, l(θ; x) = l(θ; x1 , x2 , · · ·, xn ) = i=1 f (xi ; θ)

See Section 7.5.1 for the likelihood function (໬౓ؔ਺).

281 282
The integration of l(θ; x1 , x2 , · · ·, xn ) with respect to x1 , x2 , · · ·, xn is equal to one. Differentiating both sides of equation (15) with respect to θ, we obtain the following
That is, we have the following equation: equation:
Z Z Z
∂l(θ; x) 1 ∂l(θ; x)
1= l(θ; x) dx, (15) 0= dx = l(θ; x) dx
∂θ l(θ; x) ∂θ
Z  ∂ log l(θ; X) 
Qn R ∂ log l(θ; x)
where the likelihood function l(θ; x) is given by l(θ; x) = f (xi ; θ) and · · · dx = l(θ; x) dx = E , (16)
i=1 ∂θ ∂θ
implies n-tuple integral.
∂ log l(θ; X)
which implies that the expectation of is equal to zero.
∂θ
d log x 1
In the third equality, note that = .
dx x

283 284

Now, let θ̂n be an estimator of θ. The definition of the mathematical expectation of d log x 1
In the second equality, = is utilized.
dx x
the estimator θ̂n is represented as:
Z
E(θ̂n ) = θ̂n l(θ; x) dx. (17)

Differentiating equation (17) with respect to θ on both sides, we can rewrite as


∂ log l(θ; X)
follows: The third equality holds because of E( ) = 0 from equation (16).
Z Z ∂θ
∂E(θ̂n ) ∂l(θ; x) ∂ log l(θ; x)
= θ̂n dx = θ̂n l(θ; x) dx
∂θ ∂θ ∂θ
Z   ∂ log l(θ; x) ∂ log l(θ; x) 
= θ̂n − E(θ̂n ) − E( ) l(θ; x) dx
∂θ ∂θ
 ∂ log l(θ; X) 
= Cov θ̂n , . (18) For simplicity of discussion, suppose that θ is a scalar.
∂θ

285 286

Taking the square on both sides of equation (18), we obtain the following expres- Note that we have the definition of ρ is given by:
sion:  ∂ log l(θ; X) 
Cov θ̂n ,
 ∂E(θ̂n ) 2  !2 ∂θ
∂ log l(θ; X)   ∂ log l(θ; X)  ρ= r .
= Cov θ̂n , = ρ2 V(θ̂n )V p  ∂ log l(θ; X) 
∂θ ∂θ ∂θ V(θ̂n ) V
! ∂θ
∂ log l(θ; X)
≤ V(θ̂n )V , (19)
∂θ Moreover, we have −1 ≤ ρ ≤ 1 (i.e., ρ2 ≤ 1).
∂ log l(θ; X) Then, the inequality (19) is obtained, which is rewritten as:
where ρ denotes the correlation coefficient between θ̂n and .
∂θ  ∂E(θ̂n ) 2

V(θ̂n ) ≥ ∂θ
 ∂ log l(θ; X)  . (20)
V
∂θ

287 288
When E(θ̂n ) = θ, i.e., when θ̂n is an unbiased estimator of θ, the numerator in the Moreover, the denominator in the right-hand side of the above inequality is rewrit-
right-hand side of equation (20) is equal to one. ten as follows:
Therefore, we have the following result:  ∂ log l(θ; X) 2 !
E
1 1 ∂θ
! X  ∂ log f (Xi ; θ) 2 !
V(θ̂n ) ≥  ∂ log l(θ; X)  =  ∂ log l(θ; X) 2 !. Xn n
∂ log f (Xi ; θ) 2
V E =E = E
∂θ ∂θ i=1
∂θ i=1
∂θ
 ∂ log f (X; θ) 2 ! Z ∞
∂ log f (x; θ) 2
 ∂ log l(θ; X)   ∂ log l(θ; X)  = nE =n f (x; θ) dx.
Note that we have V =E( )2 in the equality above, be- ∂θ −∞ ∂θ
∂θ ∂θ
 ∂ log l(θ; X) 
cause of E = 0. X
n
∂θ In the first equality, log l(θ; X) = log f (Xi ; θ) is utilized.
i=1

289 290

 ∂ log f (X; θ) 
Since Xi , i = 1, 2, · · · , n, are mutually independent, the second equality holds. =V .
∂θ
The third equality holds because X1 , X2 , · · ·, Xn are identically distributed. Z
Therefore, we obtain the following inequality: Differentiating f (x; θ) dx = 1 with respect to θ, we obtain as follows:
Z
1 1 σ2 (θ) ∂ f (x; θ)
V(θ̂n ) ≥  ∂ log l(θ; X) 2 ! =  ∂ log f (X; θ) 2 ! = n , ∂θ
dx = 0.
E nE
∂θ ∂θ ∂ f (x; θ)
We assume that the range of x does not depend on the parameter θ and that
which is equivalent to (13). ∂θ
exists.
Next, we prove the equalities in (14), i.e., The above equation is rewritten as:
 ∂2 log f (X; θ)   ∂ log f (X; θ) 2 ! Z
∂ log f (x; θ)
−E 2
=E f (x; θ) dx = 0, (21)
∂θ ∂θ ∂θ

291 292

or equivalently, Thus, we obtain:


 ∂ log f (X; θ) 
E = 0. (22)  ∂2 log f (x; θ)   ∂ log f (x; θ) 2 !
∂θ −E = E .
Again, differentiating equation (21) with respect to θ, ∂θ2 ∂θ
Z 2 Z
∂ log f (x; θ) ∂ log f (x; θ) ∂ f (x; θ) Moreover, from equation (22), the following equation is derived.
2
f (x; θ) dx + dx = 0,
 ∂ log f (x; θ) 2 !
∂θ ∂θ ∂θ
 ∂ log f (x; θ) 
i.e., E =V .
Z Z  ∂θ ∂θ
∂2 log f (x; θ) ∂ log f (x; θ) 2
f (x; θ) dx + f (x; θ) dx = 0, Therefore, we have:
∂θ2 ∂θ
i.e.,  ∂2 log f (X; θ)   ∂ log f (X; θ) 2 !  ∂ log f (X; θ) 
−E = E =V .
 ∂2 log f (x; θ)   ∂ log f (x; θ) 2 ! ∂θ2 ∂θ ∂θ
E 2
+E = 0.
∂θ ∂θ

293 294
Thus, the Cramer-Rao inequality is derived as: Example 1.13a (Efficient Estimator of µ): Suppose that X1 , X2 , · · ·, Xn are mu-
2
σ (θ) tually independently, identically and normally distributed with mean µ and variance
V(θ̂n ) ≥ ,
n σ2 .
where

1 1
σ2 (θ) =  ∂ log f (X; θ) 2 ! =  ∂ log f (X; θ) ! Then, we show that X is an efficient estimator of µ.
E V
∂θ ∂θ
1
=−  2 .
∂ log f (X; θ)  σ2
E V(X) is given by , which does not depend on the distribution of Xi , i = 1, 2, · · · , n.
∂θ2 n
(A)

295 296

Because Xi is normally distributed with mean µ and variance σ2 , the density func- The partial derivative of f (X; µ) with respect to µ is:
tion of Xi is given by: ∂ log f (X; µ) 1
= 2 (X − µ).
1  1  ∂µ σ
f (x; µ) = √ exp − 2 (x − µ)2 .
2πσ2 2σ
The Cramer-Rao inequality in this case is written as:
The Cramer-Rao inequality is represented as:
1
1 V(X) ≥ 1 2 !
V(X) ≥  ∂ log f (X; µ) 2 !, nE 2
(X − µ)
nE σ
∂µ 1 σ2
= = . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (B)
where the logarithm of f (X; µ) is written as: 1 n
n 4 E((X − µ)2 )
σ
1 1
log f (X; µ) = − log(2πσ2 ) − (X − µ)2 .
2 2σ2

297 298

From (A) and (B), variance of X is equal to the lower bound of Cramer-Rao inequal- Example 1.13b (Efficient Estimator of σ2 ): Suppose that X1 , X2 , · · ·, Xn are mu-
σ2
ity, i.e., V(X) = , which implies that the equality included in the Cramer-Rao tually independently, identically and normally distributed with mean µ and variance
n
inequality holds. σ2 .

Therefore, we can conclude that the sample mean X is an efficient estimator of µ. Is S 2 is an efficient estimator of σ2 ?

E(S 2 ) = σ2 ....... Unbiased estimator


2σ4
Under normality assumption, V(S 2 ) is given by , because V(U) = 2(n − 1)
n−1
(n − 1)S 2
from U = ∼ χ2 (n − 1).
σ2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (A)

299 300
Because Xi is normally distributed with mean µ and variance σ2 , the density func- The partial derivative of f (X; σ2 ) with respect to σ2 is:
tion of Xi is given by: ∂ log f (X; σ2 ) 1 1
=− 2 + (X − µ)2 .
1  1  ∂σ2 2σ 2σ4
f (x; σ2 ) = √ exp − 2 (x − µ)2 .
2πσ 2 2σ The 2nd partial derivative of f (X; σ2 ) with respect to σ2 is:
The Cramer-Rao inequality is represented as:
∂2 log f (X; σ2 ) 1 1
= − (X − µ)2 .
2 1 1 ∂(σ2 )2 2σ4 σ6
V(S ) ≥  ∂ log f (X; σ2 ) 2 ! = ∂2 log f (X; σ2 )
!,
nE −nE The Cramer-Rao inequality in this case is written as:
∂σ2 ∂(σ2 )2
where the logarithm of f (X; σ2 ) is written as: 1 2σ4
V(S 2 ) ≥ != . . . . . . . . . . . . . . . . . . . . . . . . . . . . (B)
1 1 2 n
1 1 1 −nE − (X − µ) .
log f (X; σ2 ) = − log(2π) − log(σ2 ) − (X − µ)2 . 2σ4 σ6
2 2 2σ2

301 302

From (A) and (B), variance of S 2 is not equal to the lower bound of Cramer-Rao Example 1.14: Minimum Variance Linear Unbiased Estimator (࠷খ෼ࢄઢ‫ܗ‬
2σ4 2σ4
inequality, i.e., V(S 2 ) = > . ෆภਪఆྔ): Suppose that X1 , X2 , · · ·, Xn are mutually independently and identi-
n−1 n
cally distributed with mean µ and variance σ2 (note that the normality assumption
Therefore, we can conclude that the sample unbiased variance S 2 is not an efficient
is excluded from Example 1.13).
estimator of σ2 .
X
n
Consider the following linear estimator: µ̂ = ai Xi .
i=1

Then, we want to show µ̂ (i.e., X) is a minimum variance linear unbiased esti-


1
mator if ai = for all i, i.e., if µ̂ = X.
n

303 304

Utilizing Theorem on p.133, when E(Xi ) = µ and V(Xi ) = σ2 for all i, we have: X
n

Xn X
n That is, if ai = 1 is satisfied, µ̂ gives us a linear unbiased estimator (ઢ‫ܗ‬ෆภ
E(µ̂) = µ ai and V(µ̂) = σ2 a2i . i=1
i=1 i=1
ਪఆྔ).

Thus, as mentioned in Example 1.12 of Section 7.2, there are numerous unbiased
Since µ̂ is linear in Xi , µ̂ is called a linear estimator (ઢ‫ܗ‬ਪఆྔ) of µ.
estimators.

X
n X
n
In order for µ̂ to be unbiased, we need to have the condition: E(µ̂) = µ ai = µ. The variance of µ̂ is given by σ2 a2i .
i=1 i=1

305 306
X
n X
n
For minimization, the partial derivatives of L with respect to ai and λ are equal to
We obtain the value of ai which minimizes a2i with the constraint ai = 1.
i=1 i=1 zero, i.e.,
∂L
= ai − λ = 0, i = 1, 2, · · · , n,
Construct the Lagrange function as follows: ∂ai
∂L Xn

1X 2
n X n =1− ai = 0.
L= a + λ(1 − ai ), ∂λ i=1
2 i=1 i i=1 1
Solving the above equations, ai = λ = is obtained.
where λ denotes the Lagrange multiplier. n
1
When ai = for all i, µ̂ has minimum variance in a class of linear unbiased estima-
n
tors.
1
The 2
in the first term makes computation easier. X is a minimum variance linear unbiased estimator.

307 308

The minimum variance linear unbiased estimator is different from the efficient Consistency (Ұகੑ): Let θ̂n be an estimator of θ.
estimator.
The former does not requires the normality assumption.
The latter gives us the unbiased estimator which variance is equal to the Cramer- Suppose that for any  > 0 we have the following:

Rao lower bound, which is not restricted to a class of the linear unbiased estimators.
P(|θ̂n − θ| ≥ ) −→ 0, as n −→ ∞,
Under normality assumption, he linear unbiased minimum variance estimator leads
to the efficient estimator. which implies that θ̂ −→ θ as n −→ ∞.

Note that the efficient estimator does not necessarily exist.

We say that θ̂n is a consistent estimator (Ұகਪఆྔ) of θ.

309 310

Example 1.15: Suppose that X1 , X2 , · · ·, Xn are mutually independently and because E(Xi ) = µ and V(Xi ) = σ2 < ∞ for all i.
identically distributed with mean µ and variance σ2 . Then, when n −→ ∞, we obtain the following result:
2
Assume that σ is known. σ2
P(|X − µ| ≥ ) ≤ −→ 0,
Then, it is shown that X is a consistent estimator of µ. n 2
For RV X, Chebyshev’s inequality is given by: which implies that X −→ µ as n −→ ∞.
V(X) Therefore, X is a consistent estimator of µ.
P(|X − E(X)| ≥ ) ≤ 2 .


Here, replacing X by X, we obtain E(X) and V(X) as follows:

σ2
E(X) = µ, V(X) = ,
n

311 312
Summary: Example 1.16a: Suppose that X1 , X2 , · · ·, Xn are mutually independently, identi-
cally and normally distributed with mean µ and variance σ2 .
When the distribution of Xi is not assumed for all i, X is an minimum variance 1 X
n
Consider S 2 = (Xi − X)2 , which is an unbiased estimator of σ2 .
linear unbiased and consistent estimator of µ. n − 1 i=1

We obtain the following Chebyshev’s inequality:


When the distribution of Xi is assumed to be normal for all i, X leads to an efficient
and consistent estimator of µ. E((S 2 − σ2 )2 )
P(|S 2 − σ2 | ≥ ) ≤ .
2

We compute E((S 2 − σ2 )2 ) ≡ V(S 2 ).


(n − 1)S 2
U= ∼ χ2 (n − 1).
σ2

313 314

E(U) = n − 1 and V(U) = 2(n − 1). Example 1.16b: Suppose that X1 , X2 , · · ·, Xn are mutually independently, identi-
(n − 1)S 2
V(U) = V( ) = 2(n − 1) cally and normally distributed with mean µ and variance σ2 .
σ2
(n − 1)2 1X
n
V(S 2 ) = 2(n − 1) Consider S ∗∗2 = (Xi − X)2 , which is an estimate of σ2 .
σ4 n i=1
2σ2
V(S 2 ) =
n−1 We obtain the following Chebyshev’s inequality:
2
E((S − σ ) )2 2
2σ 2 E((S ∗∗2 − σ2 )2 )
P(|S 2 − σ2 | ≥ ) ≤ = −→ 0, P(|S ∗∗2 − σ2 | ≥ ) ≤ .
2 (n − 1) 2 2
We compute E((S ∗∗2 − σ2 )2 ).
which implies that S 2 −→ σ2 as n −→ ∞.
1 X
n
Threfore, S 2 is a consistent estimator of σ2 .
Define S 2 = (Xi − X)2 as an estimator σ2 .
n − 1 i=1

315 316

(n − 1)S 2  (n − 1)S 2   (n − 1)S 2  n−1 2


From ∼ χ2 (n − 1), we obtain E = n − 1 and V = Using S ∗∗2 = S , we have the following:
σ2 σ2 σ2 n
2(n − 1).  n − 1 2 
E((S ∗∗2 − σ2 )2 ) = E S 2 − σ2
n
2σ4  n − 1 σ2 2 
Therefore, E(S 2 ) = σ2 and V(S 2 ) = can be derived. =E (S 2 − σ2 ) −
n−1 n n
(n − 1)2 2 2 2 σ4
= E((S − σ ) ) + 2
n2 n
(n − 1)2 2 σ4 (2n − 1) 4
= V(S ) + 2 = σ.
n2 n n2

317 318
Therefore, as n −→ ∞, we obtain: 7.5 Estimation Methods
1 (2n − 1) 4
P(|S ∗∗2 − σ2 | ≥ ) ≤ σ −→ 0.
 2 n2 • Maximum Likelihood Estimation Method (࠷໬ਪఆ๏)

Because S ∗∗2 −→ σ2 , S ∗∗2 is a consistent estimator of σ2 . • Least Squares Estimation Method (࠷খೋ৐๏)

S ∗∗2 is biased (see Section 7.3, p.273), but is is consistent. • Method of Moment (ੵ཰๏)

319 320

7.5.1 Maximum Likelihood Estimator (࠷໬ਪఆྔ) θ = (µ, σ2 ) in the case of the normal distribution.

In Section 7.4, the properties of the estimators X and S 2 are discussed.


Now, in more general cases, we want to consider how to estimate θ.

The maximum likelihood estimator (࠷໬ਪఆྔ) gives us one of the solutions.


It is shown that X is an unbiased, efficient and consistent estimator of µ under
normality assumption and that S 2 is an unbiased and consistent estimator of σ2 . Let X1 , X2 , · · ·, Xn be mutually independently and identically distributed random
samples.

The parameter θ is included in the underlying distribution f (x; θ). Xi has the probability density function f (x; θ).

321 322

The joint density function of X1 , X2 , · · ·, Xn is given by: Let θ̂n be the θ which maximizes the likelihood function.
Y
n Given data x1 , x2 , · · ·, xn , θ̂n (x1 , x2 , · · ·, xn ) is called the maximum likelihood
f (x1 , x2 , · · · , xn ; θ) = f (xi ; θ),
i=1
estimate (MLE, ࠷໬ਪఆ஋).

where θ denotes the unknown parameter. Replacing x1 , x2 , · · ·, xn by X1 , X2 , · · ·, Xn , θ̂n = θ̂n (X1 , X2 , · · ·, Xn ) is called the

Given the actually observed data x1 , x2 , · · ·, xn , the joint density f (x1 , x2 , · · ·, xn ; θ) maximum likelihood estimator (MLE, ࠷໬ਪఆྔ).

is regarded as a function of θ, i.e., That is, solving the following equation:

Y
n ∂l(θ)
= 0,
l(θ) = l(θ; x) = l(θ; x1 , x2 , · · · , xn ) = f (xi ; θ). ∂θ
i=1
MLE θ̂n ≡ θ̂n (X1 , X2 , · · · , Xn ) is obtained.
l(θ) is called the likelihood function (໬౓ؔ਺).

323 324
Example 1.17a: Suppose that X1 , X2 , · · ·, Xn are mutually independently, identi-  1 X n 
= (2πσ2 )−n/2 exp − 2 (xi − µ)2
2σ i=1
cally and normally distributed with mean µ and variance σ2 .
= l(µ, σ2 ).
2
We derive the maximum likelihood estimators of µ and σ .
The logarithm of the likelihood function is given by:
The joint density (or the likelihood function) of X1 , X2 , · · ·, Xn is:
1 X
n
n n
log l(µ, σ2 ) = − log(2π) − log(σ2 ) − (xi − µ)2 ,
Y
n 2 2 2
2σ i=1
f (x1 , x2 , · · · , xn ; µ, σ2 ) = f (xi ; µ, σ2 )
i=1
which is called the log-likelihood function (ର਺໬౓ؔ਺).
Yn  1 
= (2πσ2 )−1/2 exp − 2 (xi − µ)2
i=1

325 326

For maximization of the likelihood function, differentiating the log-likelihood func- Solving the two equations, we obtain the maximum likelihood estimates as follows:
tion log l(µ, σ2 ) with respect to µ and σ2 , the first derivatives should be equal to zero, 1X
n
µ̂ = xi = x,
i.e., n i=1
1X 1X
n n
1 X
n
∂ log l(µ, σ2 ) σ̂2 = (xi − µ̂)2 = (xi − x)2 = s∗∗2 .
= 2 (xi − µ) = 0, n i=1 n i=1
∂µ σ i=1
1 X
n
∂ log l(µ, σ2 ) n 1 Replacing xi by Xi for i = 1, 2, · · · , n, the maximum likelihood estimators of µ and
=− 2 + (xi − µ)2 = 0.
∂σ2 2σ 2σ4 i=1
σ2 are given by X and S ∗∗2 .
2
Let µ̂ and σ̂ be the solution which satisfies the above two equations.

327 328

Since E(X) = µ, the maximum likelihood estimator of µ, X, is an unbiased estima- Example 1.17b: Suppose that X1 , X2 , · · ·, Xn are mutually independently and
tor. identically distributed as Bernoulli random variables with parameter p.

We have checked that X is efficient and consistent. We derive the maximum likelihood estimators of p.

n−1 2 The joint density (or the likelihood function) of X1 , X2 , · · ·, Xn is:


However, because of E(S ∗∗2 ) = σ , σ2 as shown in Section 7.3, the maxi-
n Y
n Y
n
2 ∗∗2
mum likelihood estimator of σ , S , is not an unbiased estimator. f (x1 , x2 , · · · , xn ; p) = f (xi ; p) = p xi (1 − p)1−xi
i=1 i=1
Pn P
i=1 xi n− ni=1 xi
We have checked that S ∗∗2 is inefficient but consistent. =p (1 − p) = l(p).

329 330
The log-likelihood function is given by: We obtain the maximum likelihood estimates as follows:
Xn X
n
1X
n
log l(p) = ( xi ) log(p) + (n − xi ) log(1 − p). p̂ = x = xi ,
i=1 i=1
n i=1

For maximization of the likelihood function, differentiating the log-likelihood func- Replacing xi by Xi for i = 1, 2, · · · , n, the maximum likelihood estimator of p is
1X
n
tion log l(p) with respect to p , the first derivatives should be equal to zero, i.e., given by p̂ = X = Xi .
n i=1
d log l(p) 1 X X
n n
1
= xi − (n − xi )
dp p i=1 1− p i=1
n n
= x− (1 − x) = 0
p 1− p

Let p̂ be the solution which satisfies the above equation.

331 332

˔ We check whether p̂ is unbiased. ˔ Next, we check whether p̂ is efficient.

1X
n
1X
n From Cramer-Rao inequality,
E( p̂) = E(X) = E( Xi ) = E(Xi ) = p
n n 1
i=1 i=1 V( p̂) ≥ − .
 d2 log f (X; p) 
X
1 nE
dp2
Remember that E(Xi ) = xi p xi (1 − p)1−xi = p, where xi takes 0 or 1.
xi =0
f (X; p) = pX (1 − p)1−X
Thus, p̂ is an unbiased estimator of p.
log f (X; p) = X log(p) + (1 − X) log(1 − p)
d log f (X; p) X 1 − X
= −
dp p 1− p
2
d log f (X; p) X 1−X
=− 2 −
dp2 p (1 − p)2

333 334

We need to check whether the equality holds. The Cramer-Rao lower bound is:

1X 1 X 1 X
n n n
1 1
V( p̂) = V( Xi ) = 2 V( Xi ) = 2 V(Xi ) −  X =−
n i=1 n n i=1  d2 log f (X; p)  1−X 
i=1 nE nE − 2 −
dp2 p (1 − p)2
1 X
n
p(1 − p)
= 2 p(1 − p) = , 1 1 p(1 − p)
n i=1 n =−  =  = ,
E(X) 1 − E(X)  1 1  n
n− 2 − n +
Note as follows: p (1 − p)2 p 1− p
X
1
V(Xi ) = E((Xi − p)2 ) = (xi − p)2 p xi (1 − p)1−xi = p(1 − p). which is equal to V( p̂).
xi =0

Thus, p̂ is an efficient estimator of p.

335 336
˔ We check whether p̂ is consistent. Properties of Maximum Likelihood Estimator: For small sample (খඪຊ), the
From Chebyshev’s inequality, MLE has the following properties.

E(( p̂ − p)2 ) p(1 − p)


P(| p̂ − p| ≥ ) ≤ = . • MLE is not necessarily unbiased in general, but we often have the case where we
2 n 2
can construct the unbiased estimator by an appropriate transformation.
As n −→ ∞, P(| p̂ − p| ≥ ) −→ 0.
For instance, the MLE of σ2 , S ∗∗2 , is not unbiased.
That is, p̂ converges in probability to p. n
However, S ∗∗2 = S 2 is an unbiased estimator of σ2 .
n−1
Thus, p̂ is a consistent estimator of p.
• If the efficient estimator exists, the maximum likelihood estimator is efficient.

337 338

Efficient estimator ⇐⇒ The variance of the estimator is equal to the Cramer-Rao (23) indicates that the MLE has consistency, asymptotic unbiasedness (઴ۙෆภ
lower bound. ੑ), asymptotic efficiency (઴ۙ༗ޮੑ) and asymptotic normality (઴ۙਖ਼‫ن‬ੑ).

For large sample (େඪຊ), as n −→ ∞, the maximum likelihood estimator of θ, Asymptotic normality of the MLE comes from the central limit theorem discussed

θ̂n , has the following property: in Section 6.3.



n(θ̂n − θ) −→ N(0, σ2 (θ)), (23) Even though the underlying distribution is not normal, i.e., even though f (x; θ) is

where not normal, the MLE is asymptotically normally distributed.


2 1 1
σ (θ) =  ! =− !.
∂ log f (X; θ) 2 ∂2 log f (X; θ)
E E 2
∂θ ∂θ

339 340

Note that the properties of n −→ ∞ are called the asymptotic properties, which As another representation, when n is large, we can approximate the distribution of
include consistency, asymptotic normality and so on. θ̂n as follows:
 σ2 (θ) 
θ̂n ∼ N θ, .
n
By normalizing, as n −→ ∞, we obtain as follows: This implies that when n −→ ∞, θ̂n approaches the lower bound of Cramer-Rao
σ2 (θ)
√ inequality: .
n(θ̂n − θ) θ̂n − θ n
= √ −→ N(0, 1). This property is called an asymptotic efficiency.
σ(θ) σ(θ)/ n

n(θ̂n − θ) has the distribution, which does not depend on n.

n(θ̂n − θ) = O(1) is written, where O() is a function n.
That is, θ̂n − θ = n−1/2 × O(1) = O(n−1/2 ).

341 342
Moreover, replacing θ in variance σ2 (θ) by θ̂n , when n −→ ∞, we have the follow- Proof of (23): By the central limit theorem (11) on p.254,
ing property: 1 X ∂ log f (Xi ; θ)
n  1 
θ̂n − θ √ −→ N 0, 2 , (26)
√ −→ N(0, 1). (24) n i=1 ∂θ σ (θ)
σ(θ̂n )/ n
Practically, when n is large, we approximately use:  ∂ log f (Xi ; θ)  1
where σ2 (θ) is defined in (14), i.e., V = 2 .
∂θ σ (θ)
 σ2 (θ̂n ) 
θ̂n ∼ N θ, . (25)
n
 ∂ log f (Xi ; θ) 
Note that E = 0.
∂θ

∂ log f (Xi ; θ)
Apply the central limit theorem, taking as the ith random variable.
∂θ

343 344

By the Taylor series expansion around θ̂n = θ, The third and above terms in the right-hand side are:

1 X ∂ log f (Xi ; θ̂n ) 1 1 X ∂3 log f (Xi ; θ)


n n
0= √ √ (θ̂n − θ)2 + · · · −→ 0.
n i=1 ∂θ 2! n i=1 ∂θ3
1 X ∂ log f (Xi ; θ) 1 X ∂2 log f (Xi ; θ)
n n
= √ + √ (θ̂n − θ) It can be shown that the sum of the above terms is equal to O(n−1/2 ).
n i=1 ∂θ n i=1 ∂θ2
1 X ∂3 log f (Xi ; θ)
n  ∂3 log f (Xi ; θ) 
Note that −→ E from Chebyshev’s inequality.
1 1 X ∂3 log f (Xi ; θ)
n
n i=1 ∂θ3 ∂θ3
+ √ (θ̂n − θ)2 + · · · √
2! n i=1 ∂θ3 2
In addition, for now, we consider n(θ̂n − θ) −→ 0 as n −→ ∞. Actually, we

obtain n(θ̂n − θ)2 = O(n−1/2 ) from θ̂n − θ = O(n−1/2 ).

345 346

Therefore, From (26) and the above equations, we obtain:

1 X ∂ log f (Xi ; θ) 1 X ∂2 log f (Xi ; θ)


n n
1 X ∂2 log f (Xi ; θ) √
n  1 
√ ≈−√ (θ̂n − θ) − n(θ̂n − θ) −→ N 0, 2 .
n i=1 ∂θ n i=1 ∂θ2 n i=1 ∂θ2 σ (θ)

1 X ∂ log f (Xi ; θ)
n
The law of large numbers indicates as follows:
which implies that the asy. dist. of √ is equivalent to that of
n i=1 ∂θ
1 X ∂2 log f (Xi ; θ)
n  ∂2 log f (Xi ; θ) 
Xn 1
1 2
∂ log f (Xi ; θ) − 2
−→ −E = 2 ,
−√ (θ̂n − θ). n i=1 ∂θ ∂θ2 σ (θ)
n i=1 ∂θ2
where the last equality comes from (14).

347 348
Thus, we have the following relationship: 7.5.2 Least Squares Estimation Method (࠷খೋ৐๏)

1 X ∂2 log f (Xi ; θ) √
n
1 √ X1 , X2 , · · ·, Xn are mutually independently distributed with mean µ.
− n(θ̂n − θ) −→ 2 n(θ̂n − θ)
n i=1 ∂θ2 σ (θ)
x1 , x2 , · · ·, xn are generated from X1 , X2 , · · ·, Xn , respectively.
 1 
−→ N 0, 2 Solve the following problem:
σ (θ)
X
n
Therefore, the asymptotic normality of the maximum likelihood estimator is ob- min S (µ), where S (µ) = (xi − µ)2 .
µ
tained as follows: i=1


n(θ̂n − θ) −→ N(0, σ2 (θ)). Let µ̂ be the least squares estimate of µ.

1X
n
Thus, (23) is obtained. dS (µ)
=0 =⇒ µ̂ = xi
dµ n i=1

349 350

The least squares estimator is given by: 7.5.3 Method of Moment (ੵ཰๏)

1X
n
µ̂ = Xi , The distribution of Xi is f (x; θ).
n i=1 Let µ0k be the kth moment.
which is equivalent to MLE. From the definition of the kth moment,

E(X k ) = µ0k

where µ0k depends on θ.


Let µ̂0k be the estimate of the kth moment.
1X k
n
E(X k ) ≈ x = µ̂0k
n i=1 i

351 352

The estimator of µ0k is: Estimates:


1X k
n
1X 1X 2
n n
µ̂0k = X µ̂ = xi = x σ̂2 + µ̂2 = x
n i=1 i n i=1 n i=1 i
1X 2 1X 2 1X
n n n

Example: θ = (µ, σ2 ): Because we have two parameters, we use the 1st and 2nd σ̂2 = x − µ̂2 = x − x2 = (xi − x)2
n i=1 i n i=1 i n i=1
moments.
µ01 = E(X) = µ Estimators:
1X 1X
n n
2 2 2 2
µ02 = E(X ) = V(X) + (E(X)) = σ + µ µ̂ = Xi = X σ̂2 = (Xi − X)2
n i=1 n i=1

353 354
7.6 Interval Estimation Using the random sample X1 , X2 , · · ·, Xn drawn from the population distribution, we
construct the two statistics, say, θU (X1 , X2 , · · ·, Xn ) and θL (X1 , X2 , · · ·, Xn ), where
In Sections 7.1 – 7.5.1, the point estimation is discussed.
It is important to know where the true parameter value of θ is likely to lie. P(θL (X1 , X2 , · · · , Xn ) < θ < θU (X1 , X2 , · · · , Xn )) = 1 − α. (27)
Suppose that the population distribution is given by f (x; θ).  
(27) implies that θ lies on the interval θL (X1 , X2 , · · ·, Xn ), θU (X1 , X2 , · · ·, Xn ) with
probability 1 − α.

355 356

Now, we replace the random variables X1 , X2 , · · ·, Xn by the experimental values x1 , Given probability α, the θL (X1 , X2 , · · ·, Xn ) and θU (X1 , X2 , · · ·, Xn ) which satisfies
x2 , · · ·, xn . equation (27) are not unique.
Then, we say that the interval: For estimation of the unknown parameter θ, it is more optimal to minimize the
  width of the confidence interval.
θL (x1 , x2 , · · · , xn ), θU (x1 , x2 , · · · , xn )
Therefore, we should choose θL and θU which minimizes the width θU (X1 , X2 , · · ·,
is called the 100 × (1 − α)% confidence interval (৴པ۠ؒ) of θ. Xn ) − θL (X1 , X2 , · · ·, Xn ).
Thus, estimating the interval is known as the interval estimation (۠ؒਪఆ), which
is distinguished from the point estimation.
In the interval, θL (x1 , x2 , · · ·, xn ) is known as the lower bound of the confidence
interval, while θU (x1 , x2 , · · ·, xn ) is the upper bound of the confidence interval.

357 358

Interval Estimation of X: Let X1 , X2 , · · ·, Xn be mutually independently and Therefore, when n is large enough,
identically distributed random variables. X−µ
P(z∗ < √ < z∗∗ ) = 1 − α,
Xi has a distribution with mean µ and variance σ2 . S/ n
From the central limit theorem, where z∗ and z∗∗ (z∗ < z∗∗ ) are percent points from the standard normal density

X−µ function.
√ −→ N(0, 1).
σ/ n Solving the inequality above with respect to µ, the following expression is obtained.
2
Replacing σ by its estimator S (or S 2 ∗∗2
),  S S 
P X − z∗∗ √ < µ < X − z∗ √ = 1 − α,
n n
X−µ
√ −→ N(0, 1).
S/ n S S
where θ̂L and θ̂U correspond to X − z∗∗ √ and X − z∗ √ , respectively.
n n

359 360
The length of the confidence interval is given by: Solving the minimization problem above, we can obtain the conditions that f (z∗ ) =

S f (z∗∗ ) for z∗ < z∗∗ and that f (x) is symmetric.


θ̂U − θ̂L = √ (z∗∗ − z∗ ),
n Therefore, we have:

which should be minimized subject to: −z∗ = z∗∗ = zα/2 ,


Z z∗∗
where zα/2 denotes the 100 × α/2 percent point from the standard normal density
f (x) dx = 1 − α,
z∗ function.
i.e., Accordingly, replacing the estimators X and S 2 by their estimates x and s2 , the
F(z∗∗ ) − F(z∗ ) = 1 − α, 100 × (1 − α)% confidence interval of µ is approximately represented as:
 s s 
where F(·) denotes the standard normal cumulative distribution function. x − zα/2 √ , x + zα/2 √ ,
n n

361 362

for large n. Interval Estimation of θ̂ n: Let X1 , X2 , · · ·, Xn be mutually independently and


For now, we do not impose any assumptions on the distribution of Xi . identically distributed random variables.
X−µ
If we assume that Xi is normal, √ has a t distribution with n − 1 degrees of Xi has the probability density function f (xi ; θ).
S/ n
freedom for any n. Suppose that θ̂n represents the maximum likelihood estimator of θ.
Therefore, 100 × (1 − α)% confidence interval of µ is given by: From (25), we can approximate the 100 × (1 − α)% confidence interval of θ as
 s s  follows:
x − tα/2 (n − 1) √ , x + tα/2 (n − 1) √ ,  σ(θ̂n ) σ(θ̂n ) 
n n θ̂n − zα/2 √ , θ̂n + zα/2 √ .
n n
where tα/2 (n − 1) denotes the 100 × α/2 percent point of the t distribution with n − 1
degrees of freedom.

363 364

8 Testing Hypothesis (Ծઆ‫ݕ‬ఆ) The hypothesis against the null hypothesis, e.g. θ , θ0 , is called the alternative
hypothesis (ରཱԾઆ), which is denoted by H1 : θ , θ0 .
8.1 Basic Concepts in Testing Hypothesis

Given the population distribution f (x; θ), we want to judge from the observed values
x1 , x2 , · · ·, xn whether the hypothesis on the parameter θ, e.g. θ = θ0 , is correct or
not.
The hypothesis that we want to test is called the null hypothesis (‫ؼ‬ແԾઆ), which
is denoted by H0 : θ = θ0 .

365 366
Type I and Type II Errors (ୈҰछͷ‫ޡ‬Γɼୈೋछͷ‫ޡ‬Γ): When we test the The probability which a type I error occurs is called the significance level (༗ҙ
null hypothesis H0 , as shown in Table 1 we have four cases, i.e., ਫ४), which is denoted by α, and the probability of committing a type II error is
(i) we accept H0 when H0 is true, denoted by β.
(ii) we reject H0 when H0 is true, Probability of (iv) is called the power (‫ݕ‬ग़ྗ) or the power function (‫ݕ‬ग़ྗؔ
(iii) we accept H0 when H0 is false, and ਺), because it is a function of the parameter θ.
(iv) we reject H0 when H0 is false.

(i) and (iv) are correct judgments, while (ii) and (iii) are not correct.
(ii) is called a type I error (ୈҰछͷ‫ޡ‬Γ) and (iii) is called a type II error (ୈೋ
छͷ‫ޡ‬Γ).

367 368

Testing Procedures: The testing procedure is summarized as follows.


Table 1: Type I and Type II Errors
H0 is true. H0 is false. 1. Construct the null hypothesis (H0 ) on the parameter.
Acceptance of H0 Correct judgment Type II Error 2. Consider an appropriate statistic, which is called a test statistic (‫ݕ‬ఆ౳‫ܭ‬
ୈೋछͷ‫ޡ‬Γ
(Probability β) ྔ).

Rejection of H0 Type I Error Correct judgment Derive a distribution function of the test statistic when H0 is true.
ୈҰछͷ‫ޡ‬Γ (1 − β = Power)
3. From the observed data, compute the observed value of the test statistic.
(Probability α ‫ݕ‬ग़ྗ
= Significance Level) 4. Compare the distribution and the observed value of the test statistic.
༗ҙਫ४
When the observed value of the test statistic is in the tails of the distribution,

369 370

we consider that H0 is not likely to occur and we reject H0 . The region that H0 is unlikely to occur and accordingly H0 is rejected is called the
rejection region (‫غ‬٫Ҭ) or the critical region, denoted by R.

Conversely, the region that H0 is likely to occur and accordingly H0 is accepted is


called the acceptance region (࠾୒Ҭ), denoted by A.

Using the rejection region R and the acceptance region A, the type I and II errors
and the power are formulated as follows.

Suppose that the test statistic is give by T = T (X1 , X2 , · · · , Xn ).

371 372
The probability of committing a type I error (ୈҰछͷ‫ޡ‬Γ), i.e., the significance The probability of committing a type II error (ୈೋछͷ‫ޡ‬Γ), i.e., β, is represented
level (༗ҙਫ४) α, is given by: as:
P(T (X1 , X2 , · · · , Xn ) ∈ A|H0 is not true) = β,
P(T (X1 , X2 , · · · , Xn ) ∈ R|H0 is true) = α,
which corresponds to the probability that accepts H0 when H0 is not true.
which is the probability that rejects H0 when H0 is true.

Conventionally, the significance level α = 0.1, 0.05, 0.01 is chosen in practice. The power (‫ݕ‬ग़ྗɼ·ͨ͸ɼ‫ݕ‬ఆྗ) is defined as 1 − β,

P(T (X1 , X2 , · · · , Xn ) ∈ R|H0 is not true) = 1 − β,

which is the probability that rejects H0 when H0 is not true.

373 374

8.2 Power Function (‫ݕ‬ग़ྗؔ਺) Figure 3: Type I Error (α) and Type II Error (β)
Let X1 , X2 , · · ·, Xn be mutually independently, identically and normally distributed ....
...............
... ....
As α is small, i.e.,
.................
.. ...
...
: as f goes to right,

... ∗
...

..
... ..... ...
2 ..
....
... ...
... .
.... .
... .. ..
  β becomes large.
...
...
with mean µ and variance σ . ....
...
...
.
....
...... ... ... ...
...
...
...
.. ... .... . . . . ...
.. ... ... .. .. .. .. ..
... ... ..... .. .. .. .. .. ...
...
The distribution under .. .... . . . . . . ...
... ....... . . . . . ... The distribution
the null hypothesis (H ) .... .... .. ...... .. .. .. .. .. ...
Assume that σ2 is known. 0 ...
. β ..... .. ...... .. .. .. ..
. . . . .. . . .
...
... under the alternative
A .... ...... .. .. .. ....... .. .. .. ...
hypothesis (H1 )
A
.... @ .........
....... .. .. .. .. .. ....... .. ..
R
@
...
...
....
....
....... .. .. .. .. .. .. .. ...... ..
.
α ...
... 
A .. .. .
..... ... ... ... ... ... ... ... ... ... .........
...... .. .. .. .. .. .. .. .. .. .. .. .........  
..
..
In Figure 3, we consider: AAU .. .. .. ...... ... ... ... ... ... ... ... ... ... ... ... ... ... ...............
...

...
.
. . .... . . . . . . . . . . . . . . . ........... ...
. . ....
.... ........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ............................
.....
...
. ...  ....
.....
the null hypothesis H0 : µ = µ0 , .....................
.........
.... ....
.
..................... . . . . . . . . . . . . . . . . . . . . . . . ..............................................................

........ .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ................................................................
.......
.....................
.......
...........
............

the alternative hypothesis H1 : µ = µ1 , µ0 f ∗ µ1


 A - R -
where µ1 > µ0 is taken. Acceptance region Rejection region

375 376

The dark shadow area (probability α) corresponds to the probability of a type I er- The distribution of sample mean X is given by:
ror, i.e., the significance level, while the light shadow area (probability β) indicates  σ2 
X ∼ N µ, .
the probability of a type II error. n
By normalization, we have:
The probability of the right-hand side of f ∗ in the distribution under H1 represents
X−µ
the power of the test, i.e., 1 − β. √ ∼ N(0, 1).
σ/ n
Therefore, under the null hypothesis H0 : µ = µ0 , we obtain:
X − µ0
√ ∼ N(0, 1),
σ/ n
where µ is replaced by µ0 .

377 378
Since the significance level α is the probability which rejects H0 when H0 is true, it Since the power 1 − β is the probability which rejects H0 when H1 is true, it is given
is given by: by:
 σ 
α = P X > µ0 + zα √ ,  σ   X − µ1 µ0 − µ1 
n 1 − β = P X > µ0 + zα √ = P √ > √ + zα
n σ/ n σ/ n
where zα denotes 100 × α percent point of N(0, 1).
 X − µ1 µ0 − µ1   µ0 − µ1 
=1−P √ < √ + zα = 1 − F √ + zα ,
σ σ/ n σ/ n σ/ n
Therefore, the rejection region is given by: X > µ0 + zα √ .
n
where F(·) represents the standard normal cumulative distribution function, which
is given by: Z x
1
F(x) = (2π)−1/2 exp(− t2 ) dt.
−∞ 2
The power function is a function of µ1 , given µ0 and α.

379 380

8.3 Small Sample Test (খඪຊ‫ݕ‬ఆ) X − µ0


Therefore, the test statistic is given by:√ .
σ/ n
8.3.1 Testing Hypothesis on Mean Depending on the alternative hypothesis, we have the three cases.

Known σ2 : Let X1 , X2 , · · ·, Xn be mutually independently, identically and nor- 1. The alternative hypothesis H1 : µ < µ0 (one-sided testɼยଆ‫ݕ‬ఆ): We
2
 X − µ0  x − µ0
mally distributed with µ and σ . have: P √ < −zα = α. Therefore, when √ < −zα , we reject the
σ/ n σ/ n
Consider testing the null hypothesis H0 : µ = µ0 . null hypothesis H0 : µ = µ0 at the significance level α.
When the null hypothesis H0 is true, the distribution of X is:

X − µ0
√ ∼ N(0, 1).
σ/ n

381 382

2. The alternative hypothesis H1 : µ > µ0 (one-sided testɼยଆ‫ݕ‬ఆ): We Unknown σ2 : Let X1 , X2 , · · ·, Xn be mutually independently, identically and
 X − µ0  x − µ0
have: P √ > zα = α. Therefore, when √ > zα , we reject the null normally distributed with µ and σ2 .
σ/ n σ/ n
hypothesis H0 : µ = µ0 at the significance level α. Test the null hypothesis H0 : µ = µ0 .
When the null hypothesis H0 is true, the distribution of X is given by:
3. The alternative
hypothesis H1 : µ , µ0 (two-sided
ɼ྆ଆ‫ݕ‬ఆ): We
test
 X − µ0  x − µ0 X − µ0
have: P √ > zα/2 = α. Therefore, when √ > zα/2 , we reject the √ ∼ t(n − 1).
σ/ n σ/ n S/ n
null hypothesis H0 : µ = µ0 at the significance level α.
X − µ0
Therefore, the test statistic is given by: √ .
S/ n

383 384
8.3.2 Testing Hypothesis on Variance Let Y1 , Y2 , · · ·, Ym be mutually independently, identically and normally distributed
with µy and σ2y .
Testing Hypothesis on Variance: Let X1 , X2 , · · ·, Xn be mutually independently,
Test the null hypothesis H0 : σ2x = σ2y .
identically and normally distributed with µ and σ2 .
1 X
n
Test the null hypothesis H0 : σ2 = σ20 . (n − 1)S 2x
∼ χ2 (n − 1), where S 2x = (Xi − X)2
When the null hypothesis H0 is true, the distribution of S 2 is given by: σ2x n − 1 i=1

1 X
m
(n − 1)S 2 (m − 1)S y2
∼ χ2 (n − 1) ∼ χ2 (m − 1), where S y2 = (Yi − Y)2
σ20 σ2y m − 1 i=1
Both are independent.
Testing Equality of Two Variances: Let X1 , X2 , · · ·, Xn be mutually indepen-
dently, identically and normally distributed with µ x and σ2x .

385 386

Then, the ratio of two χ2 random variables divided by degrees of freedom is: 8.4 Large Sample Test (େඪຊ‫ݕ‬ఆ)
(n − 1)S 2x
σ2x /(n − 1) • Wald Test (ϫϧυ‫ݕ‬ఆ)
∼ F(n − 1, m − 1)
(m − 1)S y2
σ2y /(m − 1)
• Likelihood Ratio Test (໬౓ൺ‫ݕ‬ఆ)
Therefore, under the null hypothesis H0 : σ2x = σ2y ,

S 2x
∼ F(n − 1, m − 1)
S y2 • Lagrange Multiplier Test (ϥάϥϯδΣ৐਺‫ݕ‬ఆ)
−→ Skipped in this class.

387 388

8.4.1 Wald Test (ϫϧυ‫ݕ‬ఆ) For H0 : θ = θ0 and H1 : θ , θ0 , replacing X1 , · · ·, Xn in θ̂n by the observed values
x1 , · · ·, xn , the testing procedure is as follows.
From (24), under the null hypothesis H0 : θ = θ0 (scalar case), as n −→ ∞, the
 θ̂n − θ0 2
maximum likelihood estimator θ̂n is distributed as: When we have: √ > χ2α (1), we reject the null hypothesis H0 at the
σ(θ̂n )/ n
θ̂n − θ0 significance level α.
√ −→ N(0, 1).
σ(θ̂n )/ n
χ2α (1) denotes the 100×α % point of the χ2 distribution with one degree of freedom.
Or, equivalently,
 θ̂n − θ0 2
√ −→ χ2 (1). This testing procedure is called the Wald test (ϫϧυ‫ݕ‬ఆ).
σ(θ̂n )/ n

389 390
Example 1.18: X1 , X2 , · · ·, Xn are mutually independently, identically and expo- Generally, as n −→ ∞, the distribution of the maximum likelihood estimator of
nentially distributed. the parameter γ, γ̂n , is asymptotically represented as:

γ̂n − γ
Consider the following exponential probability density function: √ −→ N(0, 1),
σ(γ̂n )/ n

f (x; γ) = γe−γx , or, equivalently


 γ̂n − γ 2
√ −→ χ2 (1),
for 0 < x < ∞. σ(γ̂n )/ n
where
Using the Wald test, we want to test the null hypothesis H0 : γ = γ0 against the ! ! ""−1 !  2 "−1
d log f (X; γ) 2 d log f (X; γ) 
σ2 (γ) = E =− E .
alternative hypothesis H1 : γ , γ0 . dγ dγ2

391 392

Therefore, under the null hypothesis H0 : γ = γ0 , when n is large enough, we have First, σ2 (γ) is given by:
the following distribution: !  2 "−1
d log f (X; γ) 
σ2 (γ) = − E 2
= γ2 .
 γ̂n − γ0 2 dγ
√ −→ χ2 (1).
σ(γ̂n )/ n Note that the first- and the second-derivatives of log f (X; γ) with respect to γ are
As for the null hypothesis H0 : γ = γ0 against the alternative hypothesis H1 : γ , given by:
γ0 , if we have: d log f (X; γ) 1 d2 log f (X; γ) 1
 γ̂n − γ0 2 = − X, = − 2.
√ > χ2α (1), dγ γ dγ2 γ
σ(γ̂n )/ n
we can reject H0 at the significance level α. Next, the maximum likelihood estimator of γ, i.e., γ̂n , is obtained as follows.

We need to derive σ2 (γ) and γ̂n for the testing procedure. Since X1 , X2 · · ·, Xn are mutually independently and identically distributed, the

393 394

likelihood function l(γ) is given by: the MLE of γ, γ̂n , is represented as:
Y
n Y
n
P
l(γ) = f (xi ; γ) = γe−γxi = γn e−γ xi
. n 1
γ̂n = Pn = .
i=1 i=1 i=1 Xi X
Therefore, the log-likelihood function is written as: Then, we have the following:
X
n
log l(γ) = n log(γ) − γ xi . γ̂n − γ γ̂n − γ
√ = √ −→ N(0, 1),
i=1 σ(γ̂n )/ n γ̂n / n
We obtain the value of γ which maximizes log l(γ).
where γ̂n is given by 1/X.
Solving the following equation:
Or, equivalently,
d log l(γ) n X  γ̂n − γ 2  γ̂n − γ 2
n
= − xi = 0, √ = √ −→ χ2 (1).
dγ γ i=1 σ(γ̂n )/ n γ̂n / n

395 396
For H0 : γ = γ0 and H1 : γ , γ0 , when we have: 8.4.2 Likelihood Ratio Test (໬౓ൺ‫ݕ‬ఆ)
 γ̂n − γ0 2
√ > χ2α (1), Suppose that the population distribution is given by f (x; θ), where θ = (θ1 , θ2 ).
γ̂n / n

we reject H0 at the significance level α. Consider testing the null hypothesis θ1 = θ1∗ against the alternative hypothesis H1 :
θ1 , θ1∗ , using the observed values (x1 , · · ·, xn ) corresponding to the random sample
(X1 , · · ·, Xn ).

Let θ1 and θ2 be 1 × k1 and 1 × k2 vectors, respectively.

θ = (θ1 , θ2 ) denotes a 1 × (k1 + k2 ) vector.

397 398

Since we take the null hypothesis as H0 : θ1 = θ1∗ , the number of restrictions is That is, (e
θ1 , e
θ2 ) indicates the solution of (θ1 , θ2 ), obtained from the following equa-
given by k1 , which is equal to the dimension of θ1 . tions:
∂l(θ1 , θ2 ) ∂l(θ1 , θ2 )
= 0, = 0.
The likelihood function is written as: ∂θ1 ∂θ2

Y
n The solution (e
θ1 , e
θ2 ) is called the unconstrained maximum likelihood estimator
l(θ1 , θ2 ) = f (xi ; θ1 , θ2 ). (੍໿ͳ͠࠷໬ਪఆྔ), because the null hypothesis H0 : θ1 = θ1∗ is not taken into
i=1
account.
Let (e
θ1 , e
θ2 ) be the maximum likelihood estimator of (θ1 , θ2 ).

399 400

Let θ̂2 be the maximum likelihood estimator of θ2 under the null hypothesis H0 : Define λ as follows:
l(θ1∗ , θ̂2 )
θ1 = θ1∗ . λ= ,
l(e
θ1 , eθ2 )
That is, θ̂2 is a solution of the following equation: which is called the likelihood ratio (໬౓ൺ).

∂l(θ1∗ , θ2 ) As n goes to infinity, it is known that we have:


= 0.
∂θ2

The solution θ̂2 is called the constrained maximum likelihood estimator (੍໿ͭ −2 log(λ) −→ χ2 (k1 ),

͖࠷໬ਪఆྔ) of θ2 , because the likelihood function is maximized with respect to


where k1 denotes the number of the constraints.
θ2 subject to the constraint θ1 = θ1∗ .

401 402
Let χ2α (k1 ) be the 100 × α percent point from the chi-square distribution with k1 Example 1.19: X1 , X2 , · · ·, Xn are mutually independently, identically and expo-
degrees of freedom. nentially distributed.

When −2 log(λ) > χ2α (k1 ), we reject the null hypothesis H0 : θ1 = θ1∗ at the signifi- Consider the exponential probability density function:
cance level α.
f (x; γ) = γe−γx ,
This test is called the likelihood ratio test (໬౓ൺ‫ݕ‬ఆ)
for 0 < x < ∞.
If −2 log(λ) is close to zero, we accept the null hypothesis.
Using the likelihood ratio test, we test the null hypothesis H0 : γ = γ0 against the
When (θ1∗ , θ̂2 ) is close to (e
θ1 , e
θ2 ), −2 log(λ) approaches zero. alternative hypothesis H1 : γ , γ0 .

403 404

The likelihood ratio is given by: The likelihood ratio is computed as follows:
P
l(γ0 ) l(γ0 ) γ0n e−γ0 Xi
λ= , λ= = .
l(γ̂n ) l(γ̂n ) γ̂nn e−n

where γ̂n is derived in Example 1.18, i.e., If −2 log λ > χ2α (1), we reject the null hypothesis H0 : γ = γ0 at the significance

n 1 level α.
γ̂n = Pn = .
i=1 Xi X
Since the number of the constraint is equal to one, as the sample size n goes to
infinity we have the following asymptotic distribution:

−2 log λ −→ χ2 (1).

405 406

Example 1.20: Suppose that X1 , X2 , · · ·, Xn are mutually independently, identi- The likelihood ratio is given by:
cally and normally distributed with mean µ and variance σ2 . l(µ0 , σe2 )
λ= ,
l(µ̂, σ̂2 )
The normal probability density function with mean µ and variance σ2 is given by:
e2 is the constrained maximum likelihood estimator with the constraint µ =
where σ
2 1 − 1
(x−µ)2
f (x; µ, σ ) = √ e 2σ2 . µ0 , while (µ̂, σ̂2 ) denotes the unconstrained maximum likelihood estimator.
2πσ2
By the likelihood ratio test, we test the null hypothesis H0 : µ = µ0 against the In this case, since the number of the constraint is one, the asymptotic distribution is
alternative hypothesis H1 : µ , µ0 . as follows:
−2 log λ −→ χ2 (1).

407 408
e2 ) and l(µ̂, σ̂2 ).
We derive l(µ0 , σ l(µ, σ2 ) is written as: For the numerator of the likelihood ratio, under the constraint µ = µ0 , maximize
Y
n log l(µ0 , σ2 ) with respect to σ2 .
l(µ, σ2 ) = f (x1 , x2 , · · · , xn ; µ, σ2 ) = f (xi ; µ, σ2 )
i=1
Y
n  1  Since we obtain the first-derivative:
1
= √ exp − 2 (xi − µ)2
1 X
n
2πσ2 2σ ∂ log l(µ0 , σ2 ) n 1
i=1 = − + (xi − µ0 )2 = 0,
 1 X n  ∂σ2 2 σ2 2σ4 i=1
= (2πσ2 )−n/2 exp − 2 (xi − µ)2 .
2σ i=1
e2 is:
the constrained maximum likelihood estimate σ
2
The log-likelihood function log l(µ, σ ) is represented as: 1X
n
e2 =
σ (xi − µ0 )2 .
n n 1 X
n n i=1
log l(µ, σ2 ) = − log(2π) − log(σ2 ) − (xi − µ)2 .
2 2 2σ2 i=1

409 410

Therefore, replacing σ2 by σ
e2 , l(µ0 , σ
e2 ) is written as: For the denominator of the likelihood ratio, because the unconstrained maximum
 1 X n  likelihood estimates are obtained as:
e2 ) = (2πe
l(µ0 , σ σ2 )−n/2 exp − 2 (xi − µ0 )2
2eσ i=1 1X 1X
n n
 n µ̂ = xi , σ̂2 = (xi − µ̂)2 ,
σ2 )−n/2 exp − .
= (2πe n i=1 n i=1
2
l(µ̂, σ̂2 ) is written as:
 1 X n 
l(µ̂, σ̂2 ) = (2πσ̂2 )−n/2 exp − 2 (xi − µ̂)2
2σ̂ i=1
 n
= (2πσ̂2 )−n/2 exp − .
2

411 412

Thus, the likelihood ratio is given by: Exam


 n
l(µ0 , σe2 ) (2πe σ2 )−n/2 exp − σ
e2 −n/2
λ= = 2 .
 n = 2 July 31, 2012
l(µ̂, σ̂2 ) 2
(2πσ̂ )−n/2 exp − σ̂
2
Asymptotically, we have: 60–70% from 16 exercises (in my Web) and two homeworks

e2 − log σ̂2 ) −→ χ2 (1).


−2 log λ = n(log σ

When −2 log λ > χ2α (1), we reject the null hypothesis H0 : µ = µ0 at the significance
level α.

413 414

You might also like