Professional Documents
Culture Documents
Econometrics I
Econometrics I
Econometrics I
1 2
You can get this lecture note from: • R.V. Hogg, J.W. McKean and A.T. Craig, 2005, Introduction to Mathematical
Statistics (Sixth edition), Pearson Prentice Hall.
www2.econ.osaka-u.ac.jp/˜tanizaki/class/2012
Some Textbooks
• ʰ֬౷ܭԋश̍ ֬ʱ(ࠃਗ਼యฤɼ1966ɼഓ෩)ؗ
• ʰ֬౷ܭԋश̎ ౷ܭʱ(ࠃਗ਼యฤɼ1966ɼഓ෩)ؗ More elementary statistics:
• H. Tanizaki, 2004, Computational Methods in Statistics and Econometrics (STATIS- ɹɾUndergraduate, Tue., 3rd class, Room #5, Prof. Oya
TICS: textbooks and monographs, Vol.172), Mercel Dekker. ɹɾGraduate, Tue., 6th class, Room #1, Prof. Fukushige
3 4
1 Event and Probability Each outcome of a sample space is called an element (ཁૉɼ )ݩof the sample
space or a sample point (ඪຊ), which represents each outcome obtained by the
1.1 Event (ࣄ) experiment.
We consider an experiment (࣮ )ݧwhose outcome is not known in advance but An event (ࣄ) is any collection of outcomes contained in the sample space, or
an event occurs with probability, which is sometimes called a random experiment equivalently a subset of the sample space.
(ແ࡞ҝ࣮)ݧ.
The sample space (ඪຊۭؒ) of an experiment is a set of all possible outcomes.
5 6
An elementary event (ࠜࣄݩ) consists of exactly one element and a compound The event which does not have any sample point is called the empty event (ۭࣄ
event (ෳ߹ࣄ) consists of more than one element. ), denoted by φ.
Sample space is denoted by Ω and sample point is given by ω. Conversely, the event which includes all possible sample points is called the whole
event (શࣄ), represented by Ω, which is equivalent to a sample space (ඪຊۭ
Suppose that event A is a subset of sample space Ω.
ؒ).
Let ω be a sample point in event A.
Then, we say that a sample point ω is contained in a sample space A, which is
denoted by ω ∈ A.
The event which does not belong to event A is called the complementary event (༨
ࣄ) of A, which is denoted by Ac .
7 8
Next, consider two events A and B. Example 1.1: Consider an experiment of casting a die (αΠίϩ).
The event which belongs to either event A or event B is called the sum event (ࣄ We have six sample points, which are denoted by ω1 = {1}, ω2 = {2}, ω3 = {3},
), which is denoted by A ∪ B. ω4 = {4}, ω5 = {5} and ω6 = {6}, where ωi represents the sample point that we have
The event which belongs to both event A and event B is called the product event i.
(ੵࣄ), denoted by A ∩ B. In this experiment, the sample space is given by Ω = {ω1 , ω2 , ω3 , ω4 , ω5 , ω6 }.
When A ∩ B = φ, we say that events A and B are exclusive (ഉ). Let A be the event that we have even numbers and B be the event that we have
multiples of three.
Then, we can write as A = {ω2 , ω4 , ω6 } and B = {ω3 , ω6 }.
The complementary event of A is given by Ac = {ω1 , ω3 , ω5 }, which is the event
9 10
that we have odd numbers. Example 1.2: Cast a coin three times. In this case, we have the following eight
The sum event of A and B is written as A ∪ B = {ω2 , ω3 , ω4 , ω6 }, while the product sample points:
event is A ∩ B = {ω6 }.
ω1 = (H,H,H), ω2 = (H,H,T), ω3 = (H,T,H),
Since A ∩ Ac = φ, we have the fact that A and Ac are exclusive.
ω4 = (H,T,T), ω5 = (T,H,H), ω6 = (T,H,T),
ω7 = (T,T,H), ω8 = (T,T,T)
11 12
Therefore, the sample space of this experiment can be written as: Since A is a subset of B, denoted by A ⊂ B, a sum event is A ∪ B = B, while a
product event is A ∩ B = A.
Ω = {ω1 , ω2 , ω3 , ω4 , ω5 , ω6 , ω7 , ω8 }.
Moreover, we obtain C ∩ D = {ω2 , ω6 } and C ∪ D = {ω1 , ω2 , ω4 , ω5 , ω6 , ω8 }.
Let A be the event that we have two heads, B be the event that we obtain at least
one tail, C be the event that we have head in the second flip, and D be an event that
we obtain tail in the third flip.
Then, the events A, B, C and D are give by:
A = {ω2 , ω3 , ω5 }, B = {ω2 , ω3 , ω4 , ω5 , ω6 , ω7 , ω8 },
C = {ω1 , ω2 , ω5 , ω6 }, D = {ω2 , ω4 , ω6 , ω8 }.
13 14
1.2 Probability Thus, the probability which the event A occurs is defined as:
In the case of Example 1.1 (Section 1.1), each of the six possible outcomes has n(A) = 3 and n(A ∩ B) = 1.
probability 1/6 and in Example 1.2 (Section 1.1), each of the eight possible out- Similarly, in Example 1.2, we have P(C) = 4/8, P(A ∩ B) = P(A) = 3/8 and so on.
comes has probability 1/8. Note that we obtain P(A) ≤ P(B) because of A ⊆ B.
15 16
It is known that we have the following three properties on probability: φ ⊆ A ⊆ Ω implies n(φ) ≤ n(A) ≤ n(Ω).
Therefore, we have:
0 ≤ P(A) ≤ 1, (1) n(φ) n(A) n(Ω)
≤ ≤ = 1.
n(Ω) n(Ω) n(Ω)
P(Ω) = 1, (2)
Dividing by n(Ω), we obtain:
P(φ) = 0. (3)
P(φ) ≤ P(A) ≤ P(Ω) = 1.
Because φ has no sample point, the number of the sample point is given by n(φ) = 0
and accordingly we have P(φ) = 0.
Therefore, 0 ≤ P(A) ≤ 1 is obtained as in (1).
17 18
When events A and B are exclusive, i.e., when A ∩ B = φ, then P(A ∪ B) = P(A) + Thus, in the example we can verify that the above addition rule holds.
P(B) holds.
Moreover, since A and Ac are exclusive, P(Ac ) = 1 − P(A) is obtained.
Note that P(A ∪ Ac ) = P(Ω) = 1 holds.
Generally, unless A and B are not exclusive, we have the following formula:
19 20
The probability which event A occurs, given that event B has occurred, is called the When event A is independent (ಠཱ) of event B, we have P(A ∩ B) = P(A)P(B),
conditional probability (݅֬), i.e., which implies that P(A|B) = P(A).
n(A ∩ B) P(A ∩ B)
P(A|B) = = , Conversely, P(A ∩ B) = P(A)P(B) implies that A is independent of B.
n(B) P(B)
or equivalently,
P(A ∩ B) = P(A|B)P(B),
21 22
In Example 1.2, because of P(A ∩ C) = 1/4 and P(C) = 1/2, the conditional 2 Random Variable and Distribution
probability P(A|C) = 1/2 is obtained.
From P(A) = 3/8, we have P(A ∩ C) , P(A)P(C). 2.1 Univariate Random Variable and Distribution
Therefore, A is not independent of C. The random variable (֬ม) X is defined as the real value function on sample
As for C and D, since we have P(C) = 1/2, P(D) = 1/2 and P(C ∩ D) = 1/4, we space Ω.
can show that C is independent of D. Since X is a function of a sample point ω, it is written as X = X(ω).
Suppose that X(ω) takes a real value on the interval I.
23 24
That is, X depends on a set of the sample point ω, i.e., {ω; X(ω) ∈ I}, which is Next, suppose that X is a random variable which takes 1 for odd numbers and 0 for
simply written as {X ∈ I}. even numbers on the die.
Then, X is a function of ω and takes the following values:
In Example 1.1 (Section 1.1), suppose that X is a random variable which takes the X(ω1 ) = 1, X(ω2 ) = 0, X(ω3 ) = 1,
number of spots up on the die. X(ω4 ) = 0, X(ω5 ) = 1, X(ω6 ) = 0.
Then, X is a function of ω and takes the following values:
25 26
In Example 1.2 (Section 1.1), suppose that X is a random variable which takes the There are two kinds of random variables.
number of heads. One is a discrete random variable (ࢄ֬ܕม), while another is a continu-
Depending on the sample point ωi , X takes the following values: ous random variable (࿈ଓ֬ܕม).
27 28
Discrete Random Variable (ࢄ֬ܕม) and Probability Function (֬ The function f (xi ) represents the probability in the case where X takes xi .
ؔ): Suppose that the discrete random variable X takes x1 , x2 , · · ·, where x1 < Therefore, we have the following relation:
x2 < · · · is assumed.
P(X = xi ) = pi = f (xi ), i = 1, 2, · · · ,
Consider the probability that X takes xi , i.e., P(X = xi ) = pi , which is a function of
xi . where f (xi ) is called the probability function (֬ؔ) of X.
That is, a function of xi , say f (xi ), is associated with P(X = xi ) = pi .
29 30
More formally, the function f (xi ) which has the following properties is defined as Several functional forms of f (xi ) are as follows.
the probability function.
Discrete uniform distribution (ࢄܕҰ༷):
1
f (xi ) ≥ 0, i = 1, 2, · · · ,
N, x = 1, 2, · · · , N
X f (x) =
0,
f (xi ) = 1. otherwise
i where N = 1, 2, · · ·.
Furthermore, for a set A, we can write a probability as the following equation:
Bernoulli distribution (ϕϧψΠ):
X
p x (1 − p)1−x , x = 0, 1
P(X ∈ A) = f (xi ).
f (x) =
xi ∈A
0, otherwise
where 0 ≤ p ≤ 1.
31 32
n! n!
nC x = = n! = 1 · 2 · · · n (factorial of n)
x x!(n − x)!
X ∼ B(n, p)
33 34
35 36
Geometric distribution (زԿ): Taylor series expansion of f (x) = (1 − x)−r about x = 0
r(r + 1) 2 r(r + 1)(r + 2) 3 r + k − 1!
x
p(1 − p) , x = 0, 1, · · · f (x) = 1 + rx + x + x + ··· + xk + · · ·
f (x) =
2! 3! k
0, otherwise,
r + k − 1!
where f (k) (x) = xk .
where 0 < p ≤ 1. k
X∞ r + k − 1!
Negative binomial distribution (ෛͷೋ߲): (1 − x)−r = xk
k
r + x − 1! k=0
pr (1 − p) x , x = 0, 1, · · ·
Set x = 1 − p and k = x
f (x) =
x
0,
X∞ r + x − 1! X∞ r + x − 1!
otherwise, p−r = (1 − p) x , i.e., 1 = pr (1 − p) x
x=0 x x=0 x
where 0 < p ≤ 1 and r > 0.
37 38
In Example 1.2 (Section 1.1), all the possible values of X are 0, 1, 2 and 3. (note
that X denotes the number of heads when a coin is cast three times).
Hypergeometric distribution (زԿ): That is, x1 = 0, x2 = 1, x3 = 2 and x4 = 3 are assigned in this case.
K M − K
x n − x
, x = 0, 1, · · · , n
M
f (x) =
n
0, otherwise,
where M = 1, 2, · · ·, K = 0, 1, · · · , M, and n = 1, 2, · · · , M.
39 40
41 42
Continuous Random Variable (࿈ଓ֬ܕม) and Probability Density Func- For example, let I be the interval between a and b for a < b.
tion (֬ີؔ): Whereas a discrete random variable assumes at most a Then, we can rewrite P(X ∈ I) as follows:
countable set of possible values, a continuous random variable X takes any real Z b
P(a < X < b) = f (x) dx,
number within an interval I. a
For the interval I, the probability which X is contained in A is defined as: where f (x) is called the probability density function (֬ີؔ) of X, or
Z
simply the density function (ີؔ) of X.
P(X ∈ I) = f (x) dx.
I
43 44
In order for f (x) to be a probability density function, f (x) has to satisfy the follow- Some functional forms of f (x) are as follows:
ing properties:
Uniform distribution (Ұ༷):
f (x) ≥ 0,
1
Z ∞
b − a, a<x<b
f (x) =
f (x) dx = 1.
0, otherwise,
−∞
45 46
47 48
Gamma distribution (ΨϯϚ): Beta distribution (ϕʔλ):
r
λ r−1 −λx 1
Γ(r) x e ,
x>0
a−1 b−1
f (x) =
B(a, b) x (1 − x) ,
0<x<1
f (x) =
0, otherwise,
0, otherwise,
where λ > 0 and r > 0.
where a > 0 and b > 0.
Gamma dist. with r = 1 ⇐⇒ Z
Z ∞ Exponential dist.
1
Beta function: B(a, b) = xa−1 (1 − x)b−1 dx
Gamma function: Γ(a) = xa−1 e−x dx, a > 0 0
0 Γ(a)Γ(b)
Γ(a + 1) = aΓ(a) —> Use integration by parts (෦ੵ) =
Γ(a + b)
Γ(n + 1) = n! for integer n
1 1 · 3 · 5 · · · (2n − 1) √ 1 3 √ B(a, b) = B(b, a)
Γ(n + ) = π, Γ( ) = 2Γ( ) = π
2 2n 2 2
49 50
1 ϓϥε):
f (x) = , −∞ < x < ∞
πβ(1 + (x − α)2 /β2 ) 1 |x − α|
f (x) = exp − , −∞ < x < ∞
where −∞ < α < ∞ and β > 0. 2β β
where −∞ < α < ∞ and β > 0.
Log-normal distribution (ରਖ਼ن):
1 1 Weibull distribution (ϫΠϒϧ):
√
exp − 2 (ln x − µ)2 , 0<x<∞
f (x) = x 2πσ2 2σ
b−1 b
x>0
abx exp(−ax ),
0, otherwise, f (x) =
0, otherwise,
where −∞ < µ < ∞ and σ > 0.
where a > 0 and b > 0.
51 52
Logistic distribution (ϩδεςΟοΫ): Gumbel distribution (Ψϯϕϧ), or Extreme value distribution (ۃ):
53 54
t distribution (t ): F distribution (F ):
Γ( k+1
2
) 1 x2 −(k+1)/2
Γ( m+n
2
) m m/2 xm/2−1
f (x) = √ 1+ , −∞ < x < ∞
Γ( m )Γ( n ) n m (m+n)/2 , x>0
Γ( 2k ) kπ k f (x) =
2 2
(1 + n
x)
0, otherwise,
where k > 0.
X ∼ t(k) —> t dist. with k degrees of freedom (ࣗ༝) where m, n = 1, 2, · · ·.
t(1) ⇐⇒ Cauchy dist. X ∼ F(m, n) —> F dist. with (m, n) degrees of freedom
t(∞) ⇐⇒ N(0, 1)
55 56
57 58
Example 1.3: As an example, consider the following function: Example 1.4: As another example, consider the following function:
1 1 2
1, for 0 < x < 1, f (x) = √ e− 2 x ,
f (x) =
2π
0, otherwise.
for −∞ < x < ∞.
R∞ R1
Clearly, since f (x) ≥ 0 for −∞ < x < ∞ and −∞
f (x) dx = 0
f (x) dx = [x]10 = 1, Clearly, we have f (x) ≥ 0 for all x.
R∞
the above function can be a probability density function. We check whether f (x) dx = 1.
−∞
R∞
In fact, it is called a uniform distribution. First of all, we define I as I = −∞ f (x) dx.
To show I = 1, we may prove I 2 = 1 because of f (x) > 0 for all x, which is shown
as follows:
59 60
ʻ Review ʼ Integration by Substitution (ஔੵ):
Z ∞ 2 Z ∞ Z ∞
I2 = f (x) dx f (x) dx
= f (y) dy Univariate (1 ม) Case: For a function of x, f (x), we perform integration by
−∞ −∞
Z ∞ Z ∞ −∞
1 1 1 1 substitution, using x = ψ(y).
= √ exp(− x2 ) dx √ exp(− y2 ) dy
2 2
Z ∞ 2π
−∞
Z ∞ 1
−∞
2π
Then, it is easy to obtain the following formula:
1
= exp − (x2 + y2 ) dx dy Z Z
2π −∞ −∞ 2
Z 2π Z ∞ f (x) dx = ψ0 (y) f (ψ(y)) dy,
1 1
= exp(− r2 )r dr dθ
2π 0 0 2
Z 2π Z ∞ which formula is called the integration by substitution.
1 1
= exp(−s) ds dθ = 2π[− exp(−s)]∞
0 = 1.
2π 0 0 2π
61 62
Proof: Bivariate (2 ม) Case: For f (x, y), define x = ψ1 (u, v) and y = ψ2 (u, v).
Let F(x) be the integration of f (x), i.e., ZZ ZZ
Z f (x, y) dx dy = J f (ψ1 (u, v), ψ2 (u, v)) du dv,
x
F(x) = f (t) dt,
−∞ where J is called the Jacobian (ϠίϏΞϯ), which represents the following de-
0
which implies that F (x) = f (x). terminant (ߦྻࣜ): ∂x ∂x
Differentiating F(x) = F(ψ(y)) with respect to y, we have: J = ∂u ∂v ∂x ∂y ∂x ∂y
∂y ∂y = ∂u ∂v − ∂v ∂u .
dF(ψ(y)) dF(x) dx ∂u ∂v
f (x) ≡ = = f (x)ψ0 (y) = f (ψ(y))ψ0 (y). ʻ End of Review ʼ
dy dx dy
63 64
ʻ Go back to the Integration ʼ In the inner integration of the sixth equality, again, integration by substitution is
1
In the fifth equality, integration by substitution (ஔੵ) is used. utilized, where transformation is s = r2 .
2
The polar coordinate transformation (࠲ۃඪม )is used as x = r cos θ and y = Thus, we obtain the result I 2 = 1 and accordingly we have I = 1 because of f (x) ≥
r sin θ. 0.
1 2 √
Note that 0 ≤ r < +∞ and 0 ≤ θ < 2π. Therefore, f (x) = e− 2 x / 2π is also taken as a probability density function.
The Jacobian is given by: Actually, this density function is called the standard normal probability density
∂x ∂x function (ඪ४ਖ਼ن).
cos θ −r sin θ
J = ∂y
∂r ∂θ = = r.
∂y
sin θ r cos θ
∂r ∂θ
65 66
Distribution Function: The distribution function (ؔ) or the cumulative The properties of the distribution function F(x) are given by:
distribution function (ྦྷੵؔ), denoted by F(x), is defined as:
F(x1 ) ≤ F(x2 ), for x1 < x2 , — > nondecreasing function
P(X ≤ x) = F(x), P(a < X ≤ b) = F(b) − F(a), for a < b,
F(−∞) = 0, F(+∞) = 1.
which represents the probability less than x.
The difference between the discrete and continuous random variables is given by:
67 68
1. Discrete random variable (Figure 1): 2. Continuous random variable (Figure 2):
Xr X
r Z x
• F(x) = f (xi ) = pi , • F(x) = f (t) dt,
−∞
i=1 i=1
where r denotes the integer which satisfies xr ≤ x < xr+1 . • F 0 (x) = f (x).
69 70
Figure 1: Probability Function f (x) and Distribution Function F(x)— Discrete Case Figure 2: Density Function f (x) and Distribution Function F(x) — Continuous
Pr
Case
F(x) = i=1 f (xi )
z }| { .........
...... .. .. . ......
..... .. .. .. .. .. ......
.... .. .. .. .. .. .. .. .. .. .....
• f (xr ) ... .. .. .. .. .. .. .. .. .. .. .. ....
...... ... ... ... ... ... ... ... ... ... ... ... ... ... .....
... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....
... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....
B
• ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ......
..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....
• BBN
... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....
........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .........
• .............
• ............. R .. . . . . . . . . . . . . . . . . . . . . . . . . . .
.... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
x .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ....
F(x) = −∞
f (t)dt ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ...
.... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ...
f (x)
........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ....
X @ ....... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
...... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
! ...
...
x1 x2 x3 ............. xr x xr+1 ............. @
R .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. !
...
...
...
. ........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. .
........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
....
.. .. ......... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.
.....
......
................... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
. ........
.....................
.
....................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Note that r is the integer which satisfies xr ≤ x < xr+1 . X
x
71 72
2.2 Multivariate Random Variable (ଟมྔ֬ม) and Distri- where f xy (xi , y j ) represents the joint probability function (݁߹֬ؔ) of X and
bution Y. In order for f xy (xi , y j ) to be a joint probability function, f xy (xi , y j ) has to satisfies
the following properties:
We consider two random variables X and Y in this section. It is easy to extend to
more than two random variables. f xy (xi , y j ) ≥ 0, i, j = 1, 2, · · ·
XX
f xy (xi , y j ) = 1.
Discrete Random Variables: Suppose that discrete random variables X and Y i j
take x1 , x2 , · · · and y1 , y2 , · · ·, respectively. The probability which event {ω; X(ω) = Define f x (xi ) and fy (y j ) as:
xi and Y(ω) = y j } occurs is given by: X
f x (xi ) = f xy (xi , y j ), i = 1, 2, · · · ,
P(X = xi , Y = y j ) = f xy (xi , y j ), j
73 74
X
fy (y j ) = f xy (xi , y j ), j = 1, 2, · · · . Continuous Random Variables: Consider two continuous random variables X
i
and Y. For a domain D, the probability which event {ω; (X(ω), Y(ω)) ∈ D} occurs
Then, f x (xi ) and fy (y j ) are called the marginal probability functions (पล֬ؔ
is given by: ZZ
) of X and Y.
P((X, Y) ∈ D) = f xy (x, y) dx dy,
D
f x (xi ) and fy (y j ) also have the properties of the probability functions, i.e.,
P P where f xy (x, y) is called the joint probability density function (݁߹֬ີؔ
f x (xi ) ≥ 0 and i f x (xi ) = 1, and fy (y j ) ≥ 0 and j fy (y j ) = 1.
) of X and Y or the joint density function of X and Y.
75 76
f xy (x, y) has to satisfy the following properties: ֬ີؔ) of X and Y or the marginal density functions (पลີؔ) of
X and Y.
f xy (x, y) ≥ 0,
Z ∞Z ∞ For example, consider the event {ω; a < X(ω) < b, c < Y(ω) < d}, which is a
f xy (x, y) dx dy = 1.
−∞ −∞ specific case of the domain D. Then, the probability that we have the event {ω; a <
Define f x (x) and fy (y) as: X(ω) < b, c < Y(ω) < d} is written as:
Z Z bZ d
∞
f x (x) = f xy (x, y) dy, for all x and y, P(a < X < b, c < Y < d) = f xy (x, y) dx dy.
a c
Z −∞
∞
fy (y) = f xy (x, y) dx,
−∞
where f x (x) and fy (y) are called the marginal probability density functions (पล
77 78
The mixture of discrete and continuous RVs is also possible. For example, let X be The marginal probability function of X is given by:
a discrete RV and Y be a continuous RV. X takes x1 , x2 , · · ·. The probability which Z ∞
f x (xi ) = f xy (xi , y) dy,
both X takes xi and Y takes real numbers within the interval I is given by: −∞
Z
for i = 1, 2, · · ·. The marginal probability density function of Y is:
P(X = xi , Y ∈ I) = f xy (xi , y) dy.
I X
fy (y) = f xy (xi , y).
Then, we have the following properties: i
79 80
2.3 Conditional Distribution The features of the conditional probability function f x|y (xi |y j ) are:
Discrete Random Variable: The conditional probability function (݅֬ f x|y (xi |y j ) ≥ 0, i = 1, 2, · · · ,
ؔ) of X given Y = y j is represented as: X
f x|y (xi |y j ) = 1, for any j.
i
f xy (xi , y j ) f xy (xi , y j )
P(X = xi |Y = y j ) = f x|y (xi |y j ) = =P .
fy (y j ) i f xy (xi , y j )
81 82
Continuous Random Variable: The conditional probability density function The properties of the conditional probability density function f x|y (x|y) are given by:
(݅֬ີؔ) of X given Y = y (or the conditional density function (
f x|y (x|y) ≥ 0,
݅ີؔ) of X given Y = y) is: Z ∞
f x|y (x|y) dx = 1, for any Y = y.
f xy (x, y) f xy (x, y) −∞
f x|y (x|y) = = R∞ .
fy (y) f (x, y) dx
−∞ xy
83 84
Independence of Random Variables: For discrete random variables X and Y, 3 Mathematical Expectation
we say that X is independent (ಠཱ) (or stochastically independent (֬తʹಠ
ཱ)) of Y if and only if f xy (xi , y j ) = f x (xi ) fy (y j ). 3.1 Univariate Random Variable
Similarly, for continuous random variables X and Y, we say that X is independent of random variable X. The mathematical expectation of g(X), denoted by E(g(X)),
When X and Y are stochastically independent, g(X) and h(Y) are also stochastically
independent, where g(X) and h(Y) are functions of X and Y.
85 86
1. g(X) = X.
X X
The expectation of X, E(X), is known as mean (ฏ )ۉof random variable X.
g(xi )pi = g(xi ) f (xi ), (Discrete RV),
i X
E(g(X)) = Z ∞
i
xi f (xi ), (Discrete RV),
i
g(x) f (x) dx, (Continuous RV). E(X) =
−∞
Z ∞
x f (x) dx, (Continuous RV),
The following three functional forms of g(X) are important. −∞
= µ, (or µ x ).
2. g(X) = (X − µ)2 .
87 88
The expectation of (X − µ)2 is known as variance (ࢄ) of random variable If X is broadly distributed, σ2 = V(X) becomes large. Conversely, if the
X, which is denoted by V(X). distribution is concentrated on the center, σ2 becomes small. Note that σ =
√
V(X) is called the standard deviation (ඪ४ภࠩ).
V(X) = E((X − µ)2 )
X
(xi − µ)2 f (xi ), (Discrete RV),
i
= Z ∞
(x − µ)2 f (x) dx, (Continuous RV),
−∞
2
=σ , (or σ2x ).
89 90
3. g(X) = eθX . Some Formulas of Mean and Variance:
The expectation of e θX
is called the moment-generating function (ੵؔ 1. Theorem: E(aX + b) = aE(X) + b, where a and b are constant.
), which is denoted by φ(θ).
Proof:
θX
φ(θ) = E(e ) When X is a discrete random variable,
X
eθxi f (xi ), (Discrete RV), X
i
= E(aX + b) = (axi + b) f (xi )
Z
∞ θx
i
X X
e f (x) dx, (Continuous RV).
−∞ =a xi f (xi ) + b f (xi )
i i
= aE(X) + b.
91 92
P P
Note that we have i xi f (xi ) = E(X) from the definition of mean and i f (xi ) = 2. Theorem: V(X) = E(X 2 ) − µ2 , where µ = E(X).
1 because f (xi ) is a probability function. Proof:
If X is a continuous random variable, V(X) is rewritten as follows:
Z ∞
E(aX + b) = (ax + b) f (x) dx V(X) = E((X − µ)2 ) = E(X 2 − 2µX − µ2 )
−∞
Z ∞ Z ∞
= E(X 2 ) − 2µE(X) + µ2 = E(X 2 ) − µ2 .
=a x f (x) dx + b f (x) dx
−∞ −∞
= aE(X) + b. The first equality is due to the definition of variance.
R∞
Similarly, note that we have −∞ x f (x) dx = E(X) from the definition of mean 3. Theorem: V(aX + b) = a2 V(X), where a and b are constant.
R∞
and −∞ f (x) dx = 1 because f (x) is a probability density function. Proof:
93 94
The first and the fifth equalities are from the definition of variance. We use
E(aX + b) = aµ + b in the second equality.
95 96
Proof: Example 1.5: In Example 1.2 of flipping a coin three times (Section 1.1), we
E(X) and V(X) are obtained as: see in Section 2.1 that the probability function is written as the following binomial
X − µ E(X) − µ distribution:
E(Z) = E = = 0,
σ σ n!
P(X = x) = f (x) = p x (1 − p)n−x ,
1 x!(n − x)!
µ 1
V(Z) = V X− = V(X) = 1.
σ σ σ2
for x = 0, 1, 2, · · · , n,
The transformation from X to Z is known as normalization (ਖ਼نԽ) or stan-
dardization (ඪ४Խ). where n = 3 and p = 1/2.
When X has the binomial distribution above, we obtain E(X), V(X) and φ(θ) as
follows.
97 98
First, µ = E(X) is computed as: Second, in order to obtain σ2 = V(X), we rewrite V(X) as:
X
n X
n X
n
n!
µ = E(X) = x f (x) = x f (x) = x p x (1 − p)n−x σ2 = V(X) = E(X 2 ) − µ2 = E(X(X − 1)) + µ − µ2 .
x=0 x=1 x=1
x!(n − x)!
X
n
n!
= p x (1 − p)n−x E(X(X − 1)) is given by:
(x − 1)!(n − x)!
x=1 X
n X
n
Xn
(n − 1)! E(X(X − 1)) = x(x − 1) f (x) = x(x − 1) f (x)
= np p x−1 (1 − p)n−x x=0 x=2
x=1
(x − 1)!(n − x)!
Xn
n!
n0
X = x(x − 1) p x (1 − p)n−x
n0 ! 0 0 0
x!(n − x)!
= np p x (1 − p)n −x = np, x=2
x0 !(n0 − x0 )!
x0 =0 Xn
n!
= p x (1 − p)n−x
where n = n − 1 and x0 = x − 1 are set.
0
x=2
(x − 2)!(n − x)!
99 100
X
n
(n − 2)! Therefore, σ2 = V(X) is obtained as:
= n(n − 1)p2 p x−2 (1 − p)n−x
x=2
(x − 2)!(n − x)!
n0
X n0 ! σ2 = V(X) = E(X(X − 1)) + µ − µ2
0 0 0
= n(n − 1)p2 p x (1 − p)n −x
x0 =0
x !(n0 − x0 )!
0
= n(n − 1)p2 + np − n2 p2 = −np2 + np = np(1 − p).
2
= n(n − 1)p ,
101 102
Finally, the moment-generating function φ(θ) is represented as: Example 1.6: As an example of continuous random variables, in Section 2.1 the
X
n
n! uniform distribution is introduced, which is given by:
φ(θ) = E(eθX ) = eθx p x (1 − p)n−p
x!(n − x)!
x=0
1, for 0 < x < 1,
X
n
n! f (x) =
0, otherwise.
= (peθ ) x (1 − p)n−p = (peθ + 1 − p)n .
x=0
x!(n − x)!
When X has the uniform distribution above, E(X), V(X) and φ(θ) are computed as
In the last equality, we utilize the following formula:
follows:
X
n
n!
(a + b)n = a x bn−x ,
x!(n − x)! Z Z
x=0 ∞ 1 h 1 i1 1
µ = E(X) = x f (x) dx = x dx = x2 = ,
which is called the binomial theorem. −∞ 0 2 0 2
103 104
105 106
107 108
The first equality holds because of E(X) = 0. φ(θ) is derived as follows:
In the fifth equality, use the following integration formula, called the integration Z ∞ Z ∞
1 1 2
φ(θ) = eθx f (x) dx = eθx √ e− 2 x dx
by parts: Z −∞ −∞ 2π
b h ib Z b Z ∞
1
Z ∞
1
√ e− 2 ((x−θ) −θ ) dx
1 2 1 2 2
h(x)g0 (x) dx = h(x)g(x) − h0 (x)g(x) dx, = √ e− 2 x +θx dx =
a
a a −∞
Z ∞2π −∞ 2π
where we take h(x) = x and g(x) = −e − 12 x2
in this case. 1 2 1 1 2 1 2
= e2θ √ e− 2 (x−θ) dx = e 2 θ .
1 2 −∞ 2π
In the sixth equality, lim −xe− 2 x = 0 is utilized.
x→±∞
The last equality is because the integration of the standard normal probability den- The last equality holds because the integration indicates the normal density with
109 110
Example 1.8: When the moment-generating function of X is given by φ x (θ) = Example 1.8(b): When X ∼ N(µ, σ2 ), what is the moment-generating function of
1 2
e 2θ (i.e., X has a standard normal distribution), we want to obtain the moment- X?
generating function of Y = µ + σX. Z ∞
φ(θ) = eθx f (x) dx
Let φ x (θ) and φy (θ) be the moment-generating functions of X and Y, respectively. Z−∞
∞ 1
1
Then, the moment-generating function of Y is obtained as follows: = eθx √
exp − 2 (x − µ)2 dx
2σ
2πσ2
Z−∞
∞
1 1
φy (θ) = E(eθY ) = E(eθ(µ+σX) ) = eθµ E(eθσX ) = eθµ φ x (θσ) = √ exp θx − (x − µ)2 dx
−∞ 2πσ2 2σ2
1 2 2 1 Z ∞ 1
= eθµ e 2 σ θ = exp(µθ + σ2 θ2 ). 1 1
2 = exp(µθ + σ2 θ2 ) √ exp (x − µ − σ2 θ)2 dx
2 −∞ 2πσ2 2σ2
1
= exp(µθ + σ2 θ2 ).
2
111 112
3.2 Bivariate Random Variable The expectation of random variable X, i.e., E(X), is given by:
XX
Definition: Let g(X, Y) be a function of random variables X and Y. The mathe-
xi f (xi , y j ), (Discrete),
i j
matical expectation of g(X, Y), denoted by E(g(X, Y)), is defined as: E(X) = Z Z
XX
∞ ∞
x f (x, y) dx dy, (Continuous),
g(xi , y j ) f (xi , y j ), (Discrete),
−∞ −∞
i j = µx .
E(g(X, Y)) = Z Z
∞ ∞
g(x, y) f (x, y) dx dy, (Continuous).
−∞ −∞ The case of g(X, Y) = Y is exactly the same formulation as above, i.e., E(Y) =
The following four functional forms are important, i.e., mean, variance, covariance µy .
and the moment-generating function.
2. g(X, Y) = (X − µ x )2 :
1. g(X, Y) = X:
113 114
The expectation of (X − µ x )2 is known as variance of X. The expectation of (X − µ x )(Y − µy ) is known as covariance (ڞࢄ) of X
and Y, which is denoted by Cov(X, Y) and written as:
V(X) = E((X − µ x )2 )
XX
(xi − µ x )2 f (xi , y j ), (Discrete)
i j
= Z Z
∞ ∞
(x − µ x )2 f (x, y) dx dy, (Continuous)
−∞ −∞
= σ2x .
The variance of Y is also obtained in the same way, i.e., V(Y) = σ2y .
3. g(X, Y) = (X − µ x )(Y − µ y ):
115 116
117 118
Some Formulas of Mean and Variance: We consider two random variables X For continuous random variables X and Y, we can show:
and Y. Z ∞Z ∞
E(X + Y) = (x + y) f xy (x, y) dx dy
1. Theorem: E(X + Y) = E(X) + E(Y). Z−∞ −∞
∞Z ∞
= x f xy (x, y) dx dy
Proof: −∞ −∞
Z ∞Z ∞
For discrete random variables X and Y, it is given by: + y f xy (x, y) dx dy
−∞ −∞
XX = E(X) + E(Y).
E(X + Y) = (xi + y j ) f xy (xi , y j )
i j
XX XX
= xi f xy (xi , y j ) + y j f xy (xi , y j )
i j i j
= E(X) + E(Y).
119 120
2. Theorem: E(XY) = E(X)E(Y), when X is independent of Y. For continuous random variables X and Y,
Z ∞Z ∞
Proof:
E(XY) = xy f xy (x, y) dx dy
For discrete random variables X and Y, Z−∞
∞Z ∞
−∞
= xy f x (x) fy (y) dx dy
XX XX −∞ −∞
E(XY) = xi y j f xy (xi , y j ) = xi y j f x (xi ) fy (y j ) Z ∞ Z ∞
i j i j
= x f x (x) dx y fy (y) dy = E(X)E(Y).
X X −∞ −∞
= xi f x (xi ) y j fy (y j ) = E(X)E(Y).
i j When X is independent of Y, we have f xy (x, y) = f x (x) fy (y) in the second
121 122
For both discrete and continuous random variables, we can rewrite as follows: In the fourth equality, the theorem in Section 3.1 is used, i.e., E(µ x Y) =
µ x E(Y) and E(µy X) = µy E(X).
Cov(X, Y) = E((X − µ x )(Y − µy ))
= E(XY − µ x Y − µy X + µ x µy )
= E(XY) − µ x µy − µy µ x + µ x µy
123 124
4. Theorem: Cov(X, Y) = 0, when X is independent of Y. 5. Definition: The correlation coefficient (૬ؔ) between X and Y, de-
Cov(X, Y) Cov(X, Y)
From the above two theorems, we have E(XY) = E(X)E(Y) when X is inde- ρ xy = √ √ = .
V(X) V(Y) σ x σy
pendent of Y and Cov(X, Y) = E(XY) − E(X)E(Y).
ρ xy > 0 =⇒ positive correlation between X and Y
Therefore, Cov(X, Y) = 0 is obtained when X is independent of Y.
ρ xy −→ 1 =⇒ strong positive correlation
125 126
6. Theorem: ρ xy = 0, when X is independent of Y. 7. Theorem: V(X ± Y) = V(X) ± 2Cov(X, Y) + V(Y).
Proof: Proof:
When X is independent of Y, we have Cov(X, Y) = 0. For both discrete and continuous random variables, V(X ± Y) is rewritten as
Cov(X, Y) follows:
We obtain the result ρ xy = √ √ = 0.
V(X) V(Y)
However, note that ρ xy = 0 does not mean the independence between X and V(X ± Y) = E ((X ± Y) − E(X ± Y))2
Y. = E ((X − µ x ) ± (Y − µy ))2
127 128
129 130
Therefore, we have:
Cov(X, Y)
f (t) = V(Xt − Y) = V(Xt) − 2Cov(Xt, Y) + V(Y) −1 ≤ √ √ ≤ 1.
V(X) V(Y)
= t2 V(X) − 2tCov(X, Y) + V(Y) Cov(X, Y)
From the definition of correlation coefficient, i.e., ρ xy = √ √ , we
Cov(X, Y) 2 (Cov(X, Y))2 V(X) V(Y)
= V(X) t − + V(Y) − . obtain the result: −1 ≤ ρ xy ≤ 1.
V(X) V(X)
In order to have f (t) ≥ 0 for all t, we need the following condition:
(Cov(X, Y))2
V(Y) − ≥ 0,
V(X)
because the first term in the last equality is nonnegative, which implies:
(Cov(X, Y))2
≤ 1.
V(X)V(Y)
131 132
X XX
9. Theorem: V(X ± Y) = V(X) + V(Y), when X is independent of Y. V( ai Xi ) = ai a j Cov(Xi , X j ),
i i j
Proof:
where E(Xi ) = µi and ai is a constant value. Especially, when X1 , X2 , · · ·, Xn
From the theorem above, V(X ± Y) = V(X) ± 2Cov(X, Y) + V(Y) generally are mutually independent, we have the following:
holds. When random variables X and Y are independent, we have Cov(X, Y) = X X
V( ai Xi ) = a2i V(Xi ).
0. Therefore, V(X + Y) = V(X) + V(Y) holds, when X is independent of Y. i i
133 134
P
The first and second equalities come from the previous theorems on mean. For variance of i ai Xi , we can rewrite as follows:
X X 2
V( ai Xi ) = E ai (Xi − µi )
i i
X X
=E ai (Xi − µi ) a j (X j − µ j )
i j
X X
=E ai a j (Xi − µi )(X j − µ j )
i j
XX
= ai a j E (Xi − µi )(X j − µ j )
i j
XX
= ai a j Cov(Xi , X j ).
i j
135 136
for all i , j from the previous theorem. Therefore, we obtain: 11. Theorem: n random variables X1 , X2 , · · ·, Xn are mutually independently
X X and identically distributed with mean µ and variance σ2 . That is, for all i =
V( ai Xi ) = a2i V(Xi ).
i i 1, 2, · · · , n, E(Xi ) = µ and V(Xi ) = σ2 are assumed. Consider arithmetic
P
Note that Cov(Xi , Xi ) = E((Xi − µ)2 ) = V(Xi ). average X = (1/n) ni=1 Xi . Then, mean and variance of X are given by:
σ2
E(X) = µ, V(X) = .
n
137 138
Proof: The variance of X is computed as follows:
1X 1 X 1 X
n n n
The mathematical expectation of X is given by:
V(X) = V( Xi ) = 2 V( Xi ) = 2 V(Xi )
n i=1 n n i=1
1X 1 X 1X
n n n i=1
1 X 2
E(X) = E( Xi ) = E( Xi ) = E(Xi ) n
n i=1 n i=1 n i=1 1 σ2
= 2 σ = 2 nσ2 = .
n i=1 n n
1X
n
1
= µ = nµ = µ.
n i=1 n We use V(aX) = a2 V(X) in the second equality and V(X + Y) = V(X) + V(Y)
E(aX) = aE(X) in the second equality and E(X +Y) = E(X)+E(Y) in the third for X independent of Y in the third equality, where X and Y denote random
equality are utilized, where X and Y are random variables and a is a constant variables and a is a constant value.
value.
139 140
Transformation of variables is used in the case of continuous random variables. Distribution of Y = ψ−1 (X): Let f x (x) be the probability density function of con-
Based on a distribution of a random variable, a distribution of the transformed ran- tinuous random variable X and X = ψ(Y) be a one-to-one (ҰରҰ) transformation.
dom variable is derived. In other words, when a distribution of X is known, we can Then, the probability density function of Y, i.e., fy (y), is given by:
141 142
When X = ψ(Y), we want to obtain the probability density function of Y. Let fy (y) respect to y, we can obtain the following expression:
and Fy (y) be the probability density function and the distribution function of Y,
fy (y) = Fy0 (y) = ψ0 (y)F 0x ψ(y) = ψ0 (y) f x ψ(y) . (4)
respectively.
In the case of ψ0 (X) > 0, the distribution function of Y, Fy (y), is rewritten as follows:
Fy (y) = P(Y ≤ y) = P ψ(Y) ≤ ψ(y)
= P X ≤ ψ(y) = F x ψ(y) .
The first equality is the definition of the cumulative distribution function. The sec-
ond equality holds because of ψ0 (Y) > 0. Therefore, differentiating Fy (y) with
143 144
Next, in the case of ψ0 (X) < 0, the distribution function of Y, Fy (y), is rewritten as Note that −ψ0 (y) > 0.
follows: Thus, summarizing the above two cases, i.e., ψ0 (X) > 0 and ψ0 (X) < 0, equations
(4) and (5) indicate the following result:
Fy (y) = P(Y ≤ y) = P ψ(Y) ≥ ψ(y) = P X ≥ ψ(y)
= 1 − P X < ψ(y) = 1 − F x ψ(y) . fy (y) = |ψ0 (y)| f x ψ(y) ,
Thus, in the case of ψ0 (X) < 0, pay attention to the second equality, where the which is called the transformation of variables.
145 146
Example 1.9: When X ∼ N(0, 1), we derive the probability density function of On Distribution of Y = X2 : As an example, when we know the distribution
Y = µ + σX. function of X as F x (x), we want to obtain the distribution function of Y, Fy (y),
Since we have: where Y = X 2 . Using F x (x), Fy (y) is rewritten as follows:
Y −µ
X = ψ(Y) = , √ √
σ Fy (y) = P(Y ≤ y) = P(X 2 ≤ y) = P(− y ≤ X ≤ y)
ψ0 (y) = 1/σ is obtained. Therefore, fy (y) is given by: √ √
= F x ( y) − F x (− y).
1 1
fy (y) = |ψ0 (y)| f x ψ(y) = √ exp − 2 (y − µ)2 ,
σ 2π 2σ The probability density function of Y is obtained as follows:
which indicates the normal distribution with mean µ and variance σ , denoted by2 1 √ √
fy (y) = Fy0 (y) = √ f x ( y) + f x (− y) .
2 y
N(µ, σ2 ).
147 148
Example: χ2 (1) Distribution: Define Y = X 2 , where X ∼ N(0, 1). Then, Y ∼ Note that the χ2 (n) distribution is:
χ2 (1). 1 n x
f x (x) = n x 2 −1 exp(− ), x > 0,
proof: 2 2 Γ( 2n ) 2
1 √ √ √
fy (y) = √ f x ( y) + f x (− y) where Γ( 2n ) = π
2 y
1 1
= √ y−1/2 exp − y
2π 2
which is χ2 (1) distribution, where
1 1
f x (x) = √ exp − x2
2π 2
√ √ 1 1
f x ( y) = f x (− y) = √ exp − y
2π 2
149 150
4.2 Multivariate Cases where J is called the Jacobian of the transformation, which is defined as:
∂x ∂x
Bivariate Case: Let f xy (x, y) be a joint probability density function of X and Y. ∂u ∂v
.
J =
Let X = ψ1 (U, V) and Y = ψ2 (U, V) be a one-to-one transformation from (X, Y) to
∂y ∂y
(U, V). Then, we obtain a joint probability density function of U and V, denoted by ∂u ∂v
fuv (u, v), as follows:
fuv (u, v) = |J| f xy ψ1 (u, v), ψ2 (u, v) ,
151 152
Multivariate Case: Let f x (x1 , x2 , · · · , xn ) be a joint probability density function Then, we obtain a joint probability density function of Y1 , Y2 , · · ·, Yn , denoted by
of X1 , X2 , · · · Xn . Suppose that a one-to-one transformation from (X1 , X2 , · · · , Xn ) to fy (y1 , y2 , · · · , yn ), as follows:
(Y1 , Y2 , · · · , Yn ) is given by:
fy (y1 , y2 , · · · , yn )
X1 = ψ1 (Y1 , Y2 , · · · , Yn ), = |J| f x ψ1 (y1 , · · · , yn ), ψ2 (y1 , · · · , yn ), · · · , ψn (y1 , · · · , yn ) ,
X2 = ψ2 (Y1 , Y2 , · · · , Yn ),
..
.
Xn = ψn (Y1 , Y2 , · · · , Yn ).
153 154
where J is called the Jacobian of the transformation, which is defined as: Example: Normal Distribution: X ∼ N(µ1 , σ21 ) and Y ∼ N(µ2 , σ22 ). X is
∂x1 ∂x1 ∂x1
··· independent of Y.
∂y1 ∂y2 ∂yn
Then, X + Y ∼ N(µ1 + µ2 , σ21 + σ22 )
∂x2 ∂x2
∂x2
··· Proof:
∂y1 ∂y2 ∂yn
J= . The density functions of X and Y are:
... .. .. .. !
. . . 1 1
f x (x) = q exp − 2 (x − µ1 )2
2σ1
2πσ21
∂xn ∂xn ∂xn !
··· 1 1
∂y1 ∂y2 ∂yn fy (y) = q exp − 2 (y − µ2 )2
2πσ22 2σ2
155 156
The joint density of X and Y is: The joint density function of U and V, fuv (u, v), is given by:
Define U = X + Y and V = Y. We obtain the joint distribution of U and V. The marginal density function of U is:
Using X = U − V and Y = V, the Jacobian is: Z
fu (u) = fuv (u, v)dv
∂x ∂x Z
1 −1 1
J = ∂u ∂v = = 1 =
1
exp − 2 (u − v − µ1 )2 −
1
(v − µ2 )2 dv
∂y ∂y
0 1 2πσ1 σ2 2σ1 2σ22
∂u ∂v
157 158
Z Z
1 1 1
= exp − ((v − µ2 ) − (u − µ1 − µ2 ))2 = q
2πσ1 σ2 2σ21 2π/(1/σ21 + 1/σ22 )
!
1 !
− 2 (v − µ2 )2 dv 1 σ2
2σ2 × exp − ((v − µ2 ) − 2 2 2 (u − µ1 − µ2 ))2
Z 2/(1/σ21 2
+ 1/σ2 ) σ1 + σ2
1 1 !
= exp − ((v − µ2 ) 1 1 2
2πσ1 σ2 2/(1/σ21 + 1/σ22 ) × q exp − (u − µ 1 − µ2 ) dv
! 2π(σ2 + σ2 ) 2(σ21 + σ22 )
σ2 1 1 2
− 2 2 2 (u − µ1 − µ2 ))2 − (u − µ 1 − µ 2 )2
dv
σ1 + σ2 2(σ21 + σ22 )
159 160
Z
1 Example: χ2 Distribution: X ∼ χ2 (n) and Y ∼ χ2 (m). X is independent of Y.
= q
2π/(1/σ21 + 1/σ22 ) Then, X + Y ∼ χ2 (n + m).
1 Proof:
× exp − ((v − µ2 )
2/(1/σ21 + 1/σ22 )
! The density functions of X and Y are:
σ2
− 2 2 2 (u − µ1 − µ2 ))2 dv 1 n x
σ1 + σ2 f x (x) = x 2 −1 exp(− ), x > 0
! n
2 2 Γ( n2 ) 2
1 1
× q exp − 2 2
(u − µ1 − µ2 )2 1 m y
2π(σ21 + σ22 ) 2(σ1 + σ2 ) fy (y) = m m y 2 −1 exp(− ), y > 0
2 2 Γ( 2 ) 2
!
1 1 2
= q exp − (u − µ 1 − µ2 ) The joint density function of X and Y is:
2π(σ21 + σ22 ) 2(σ21 + σ22 )
f xy (x, y) = f x (x) fy (y)
161 162
1 n x 1 m y The joint density function of U and V, fuv (u, v), is given by:
= n x 2 −1 exp(− ) m m y 2 −1 exp(− )
2 2 Γ( n2 ) 2 2 2 Γ( 2 ) 2
n m x+y fuv (u, v) = |J| f xy (u − v, v)
= Cx 2 −1 y 2 −1 exp(− )
2 n m u
= C(u − v) 2 −1 v 2 −1 exp(− )
1 2
where C = n+m .
2 2 Γ( 2n )Γ( m2 )
The marginal density function of U is:
From X = U − V and Y = V, the Jacobian is: Z
∂x ∂x fu (u) = fuv (u, v)dv
1 −1
J = ∂u ∂v = = 1 Z ∞
∂y ∂y u
0 1
n m
= C exp(− ) (u − v) 2 −1 v 2 −1 dv
∂u ∂v 2 0
Z ∞
u n m
= C exp(− ) (u − uw) 2 −1 (uw) 2 −1 udw
2 0
163 164
Z 1
n+m
2 −1
u n m Example: t Distribution: X ∼ N(0, 1) and Y ∼ χ2 (n). X is independent of Y.
= Cu exp(− ) (1 − w) 2 −1 w 2 −1 dw
2 0 X
n m n+m u Then, U = √ ∼ t(n)
= CB( , )u 2 −1 exp(− ) Y/n
2 2 2
1 Γ( n2 )Γ( m2 ) n+m −1 u
= n+m n m n+m u 2 exp(− )
2 2 Γ( 2 )Γ( 2 ) Γ( 2 ) 2
Note that the density function of t(n) is:
1 n+m u
= n+m n+m u 2 −1 exp(− ) !− n+1
2 2 Γ( 2 ) 2 Γ( n+1
2
) 1 x2 2
fu (x) = √ 1+
Γ( n2 ) nπ n
Beta function B(n, m) is:
Z ∞
Γ(n)Γ(m)
B(n, m) = (1 − x)n−1 xm−1 dx =
0 Γ(n + m)
165 166
r
Proof: The density functions of X and Y are: V
From X = U and Y = V, the Jacobian is:
n
1 1 r
f x (x) = √ exp(− x2 ), −∞ < x < ∞ ∂x ∂x v u r
2π 2
1 y J = ∂u ∂v = n 2 √nv = v
∂y ∂y 0
n
fy (y) = n n y 2 −1 exp(− ), y>0 1 n
2 2 Γ( 2 ) 2 ∂u ∂v
The joint density functions of X and Y, f xy (x, y), is: The joint density function of U and V, fuv (u, v), is:
r
v
fuv (u, v) = |J| f xy (u , v)
f xy (x, y) = f x (x) fy (y) n
r
1 1 1 n y v 1 1 u2 v 1 n v
= √ exp(− x2 ) n n y 2 −1 exp(− ) = √ exp(− ) n v 2 −1 exp(− )
2π 2 2 2 Γ( 2 ) 2 n 2π 2 n 2 2 Γ( n2 ) 2
!
1 1 n y 1 v u2
y 2 −1 exp(− − x2 )
n−1
= √ n n = Cv 2 exp − (1 + )
2π 2 2 Γ( 2 ) 2 2 2 n
167 168
1 1 1 !− n+1
where C = √ √ n+1 n . u2 2 n+1 n+1
n π 2 2 Γ( 2 ) =C 1+ 2 2 Γ( )
n 2
Z
1 n+1 1
The marginal density function of U is: × n+1
w 2 −1 exp(− w)dw
2 2 Γ( n+1 ) 2
Z 2
!− n+1
fu (u) = fuv (u, v)dv u2 2 n+1 n + 1
=C 1+ 2 2 Γ( )
Z ! n 2
n−1 v u2 !− n+1
= C v 2 exp − (1 + ) dv 1 1 1 n+1 n + 1 u2 2
2 n = √ n+1 n √ 2 2 Γ( ) 1+
Z ! n−1 !−1 π 2 2 Γ( 2 ) n 2 n
u2 −1 2 1 u2
=C w(1 + ) exp(− w) 1 + dw ! n+1
2 − 2
n 2 n Γ( n+1
2
) 1 u
! n+1 Z
= √ 1+
u2 2
−
n+1 1 Γ( n2 ) nπ n
=C 1+ w 2 −1 exp(− w)dw
n 2
169 170
1
f (u) =
π(1 + u2 )
171 172
x
Proof: The density functions of X and Y are: Transformation of variables by u = and v = y.
y
1 1 From x = uv and y = v, the Jacobian is:
f x (x) = √ exp(− x2 ), −∞ < x < ∞
2π 2 ∂x ∂x
v u
1 1 2
fy (y) = √ exp(− y ), −∞ < y < ∞ J = ∂u ∂v = = v
∂y
2π 2 ∂y 0 1
∂u ∂v
The joint density function of X and Y is:
The joint density function of U and V, fuv (u, v), is:
f xy (x, y) = f x (x) fy (y)
fuv (u, v) = |J| f xy (uv, v)
1 1 1 1
= √ exp(− x2 ) √ exp(− y2 ) 1 1
2π 2 2π 2 = |v| exp(− v2 (1 + u2 ))
1 1 2π 2
= exp(− (x2 + y2 ))
2π 2
173 174
The marginal density function of U is: 5 Moment-Generating Function (ੵؔ)
Z
fu (u) = fuv (u, v)dv
5.1 Univariate Case
Z ∞
1 1
= |v| exp(− v2 (1 + u2 ))dv As discussed in Section 3.1, the moment-generating function is defined as φ(θ) =
2π −∞ 2
Z ∞
1 1 E(eθX ).
= v exp(− v2 (1 + u2 ))dv
π 0 2
" #∞ For a random variable X, µ0n ≡ E(X n ) is called the nth moment (n ࣍ͷੵ) of X.
1 1 1 2 2
= − exp(− v (1 + u ))
π 1 + u2 2 v=0
1
=
π(1 + u2 )
175 176
177 178
2. Remark: The moment-generating function is a weighted sum of all the 3. Remark: φ(θ) does not exist, if E(X n ) for some n does not exist.
moments.
X ∞ φ(θ) is finite. ⇐⇒ All the moments exist.
1 (n)
φ(θ) = E(eθX ) = E f (0)θn
n=0
n!
X∞
1 n n X 1
∞
=E X θ = E(X n )θn
n=0
n! n=0
n!
X∞
1 (n)
where f (θ) = eθX . f (θ) = f (0)θn
n=0
n!
Note that f (n) (θ) = X n eθX .
179 180
4. Remark: Let X and Y be two random variables. Suppose that both moment- 5. Theorem: Let φ(θ) be the moment-generating function of X. Then, the
generating functions exist. When the moment-generating function of X is moment-generating function of Y, where Y = aX + b, is given by ebθ φ(aθ).
equivalent to that of Y, we have the fact that X has the same distribution as Y. Proof:
181 182
183 184
185 186
8. Theorem: When X1 , X2 , · · ·, Xn are mutually independently and identically Proof:
distributed and the moment-generating function of Xi is given by φ(θ) for Let φ x (θ) be the moment-generating function of X.
θ n
all i, the moment-generating function of X is represented by φ( ) , where Y
n
P n θ Pn θ
X = (1/n) ni=1 Xi . φ x (θ) = E(eθX ) = E(e n i=1 Xi
)= E(e n Xi )
i=1
Y
n
θ θ n
= φ( ) = φ( ) .
i=1
n n
187 188
Bernoulli Distribution: The probability function of Bernoulli random variable X Binomial Distribution: For the binomial random variable, the moment-generating
is: function φ(θ) is known as:
x 1−x
f (x) = p (1 − p) , x = 0, 1
φ(θ) = (peθ + 1 − p)n ,
The moment-generating function of X is:
which is discussed in Example 1.5 (Section 3.1). Using the moment-generating
φ(θ) = peθ + 1 − p function, we check whether E(X) = np and V(X) = np(1 − p) are obtained when X
is a binomial random variable.
Mean: E(X) = φ0 (0) = p
2 2 00 2 The first- and the second-derivatives with respect to θ are given by:
Variance: V(X) = E(X ) − (E(X)) = φ (0) − p = p(1 − p)
189 190
φ00 (θ) = npeθ (peθ + 1 − p)n−1 + n(n − 1)p2 e2θ (peθ + 1 − p)n−2 . Poisson Distribution: The probability function of Poisson random variable X is:
λx
Evaluating at θ = 0, we have: f (x) = e−λ , x = 0, 1, 2, · · ·
x!
E(X) = φ0 (0) = np, E(X 2 ) = φ00 (0) = np + n(n − 1)p2 . The moment-generating function of X is:
X
∞
λx
Therefore, V(X) = E(X 2 ) − (E(X))2 = np(1 − p) can be derived. Thus, we can make φ(θ) = eθx e−λ
x=0
x!
sure that E(X) and V(X) are obtained from φ(θ). X
∞
θ θ (eθ λ) x
= e−λ ee λ e−e λ
x=0
x!
= exp(λ(eθ − 1))
191 192
Normal Distribution: When X ∼ N(µ, σ2 ), the moment-generating function of 1
Cauchy Distribution: Cauchy distribution: f (x) = for −∞ < x < ∞.
π(1 + x2 )
X is given by: φ(θ) = exp(µθ + 12 σ2 θ2 ) from the previous example.
Obtain E(X) and V(X), using φ(θ). Z Z
x
E(X) = x f (x)dx = dx
0
π(1 + x2 )
• E(X) = φ (0) = µ 1
= [log(1 + x2 )]∞
−∞
from φ0 (θ) = (µ + σ2 θ) exp(µθ + 21 σ2 θ2 ). 2π
• E(X 2 ) = φ00 (0) = σ2 + µ2 =⇒ φ(θ) does not exists.
from φ00 (θ) = σ2 exp(µθ + 12 σ2 θ2 ) + (µ + σ2 θ)2 exp(µθ + 12 σ2 θ2 ).
t(k) distrubution =⇒ E(X k ) does not exists.
• V(X) = E(X 2 ) − (E(X))2 = (σ2 + µ2 ) − µ2 = σ2
193 194
Uniform Distribution: The density function is: θ(beθb − aeθa ) − (eθb − eθa )
φ0 (θ) =
θ2 (b − a)
1 Mean:
f (x) = , a<x<b
b−a
E(X) = φ0 (0) ←− Use L’Hospital’s rule.
The moment-generating function is:
a+b
Z ∞ Z b =
1 2
φ(θ) = eθx f (x)dx = eθx dx
b−a
"
−∞
#b
a (*) f (θ) = θ(beθb − aeθa ) − (eθb − eθa ), g(θ) = θ2 (b − a)
θx θb θa
e e −e
= = f 0 (θ) = θ(b2 eθb − a2 eθa ), g0 (θ) = 2θ(b − a)
θ(b − a) a θ(b − a)
f (θ) f 0 (θ) θ(b2 eθb − a2 eθa ) a + b
lim = lim = lim =
θ→0 g(θ) θ→0 g0 (θ) θ→0 2θ(b − a) 2
195 196
197 198
Exponential Distribution: The exponential distribution is:
0
f (θ) f (θ)
φ00 (0) = lim = lim f (x) = λe−λx , 0<x
θ→0 g(θ) θ→0 g0 (θ)
θ2 (b3 eθb − a3 eθa ) b2 + ba + a2
= lim = The moment-generating function is:
θ→0 3θ2 (b − a) 3
Z ∞ Z ∞
φ(θ) = eθx f (x)dx = eθx λe−λx dx
V(X) = E(X 2 ) − (E(X))2 −∞
Z ∞ 0
λ λ
= φ00 (0) − (φ0 (0))2 ←− L’Hospital’s rule = (λ − θ)e−(λ−θ)x dx =
λ−θ 0 λ−θ
b2 + ba + a2 a + b 2 (b − a)2
= − = Use the exponential distribution with parameter λ − θ in the integration.
3 2 12
199 200
201 202
! 2n Z ∞
1 1 n 1 1. Mean: E(X) = φ0 (0)
= n n
y 2 −1 exp(− y)dy
1 − 2θ 0 2 2 Γ( 2 ) 2 n
= n(n + 2) − n2 = 2n
203 204
Sum of Bernoulli Random Variables: X1 , X2 , · · ·, Xn are mutually indepen- The moment-generating function of Y, φy (θ), is:
dently and identically distributed as Bernoulli random variable with parameter p.
φy (θ) = E(eθY ) = E(eθ(X1 +X2 +···+Xn ) )
Then, the probability function of Y = X1 + X2 + · · · + Xn is B(n, p).
= E(eθX1 )E(eθX2 ) · · · E(eθXn ) = φ1 (θ)φ2 (θ) · · · φn (θ)
Proof: The moment-generating function of Xi , φi (θ), is:
n
= φ(θ) = (peθ + 1 − p)n ,
φi (θ) = peθ + 1 − p
which is the moment-generating function of B(n, p).
Note:
In the third equality, X1 , X2 , · · ·, Xn are mutually independent.
In the fifth equality, X1 , X2 , · · ·, Xn are identically distributed.
205 206
Sum of Two Normal Random Variables: X ∼ N(µ1 , σ21 ) and Y ∼ N(µ2 , σ22 ). The moment-generating function of W = aX + bY is:
X is independent of Y.
φw (θ) = E(eθW ) = E(eθ(aX+bY) ) = E(eaθX )E(ebθY ) = φ x (aθ)φy (bθ)
Then, aX + bY ∼ N(aµ1 + bµ2 , a2 σ21 + b2 σ22 ), where a are b are constant.
1 1
= exp µ1 (aθ) + σ21 (aθ)2 × exp µ2 (bθ) + σ22 (bθ)2
2 2
Proof: Suppose tha the moment-generating functions of X and Y are given by 1
= exp (aµ1 + bµ2 )θ + (a2 σ21 + b2 σ22 )θ2
φ x (θ) and φy (θ). 2
1 which is the moment-generating function of normal distribution with mean aµ1 +bµ2
φ x (θ) = exp µ1 θ + σ21 θ2
2 and variance a2 σ2 + b2 σ22 .
1
φy (θ) = exp µ2 θ + σ22 θ2 Therefore, aX + bY ∼ N(aµ1 + bµ2 , a2 σ2 + b2 σ22 )
2
207 208
Sum of Two χ2 Random Variables: X ∼ χ2 (n) and Y ∼ χ2 (m). X is indepen- The moment-generating function of Z = X + Y is:
dent of Y.
φz (θ) ≡ E(eθZ ) = E(eθ(X+Y) ) = E(eθX )E(eθY ) = φ x (θ)φy (θ)
Then, Z = X + Y ∼ χ2 (n + m) 1 n2 1 m2 1 n+m 2
= =
1 − 2θ 1 − 2θ 1 − 2θ
Proof:
which is the moment-generating function of χ2 (n + m) distribution. Therefore, Z ∼
Let φ x (θ) and φy (θ) be the moment-generating functions of X and Y.
χ2 (n + m).
φ x (θ) and φy (θ) are given by:
1 2n 1 m2
φ x (θ) = , φy (θ) = .
1 − 2θ 1 − 2θ Note:
In the third equality, X and Y are independent.
209 210
5.2 Multivariate Cases 1. Theorem: Consider two random variables X and Y. Let φ(θ1 , θ2 ) be the
moment-generating function of X and Y. Then, we have the following result:
Bivariate Case: As discussed in Section 3.2, for two random variables X and Y,
the moment-generating function is defined as φ(θ1 , θ2 ) = E(eθ1 X+θ2 Y ). Some useful ∂ j+k φ(0, 0)
= E(X j Y k ).
∂θ1j ∂θ2k
and important theorems and remarks are shown as follows.
Proof:
Let f xy (x, y) be the probability density function of X and Y. From the defini-
tion, φ(θ1 , θ2 ) is written as:
Z ∞ Z ∞
φ(θ1 , θ2 ) = E(eθ1 X+θ2 Y ) = eθ1 x+θ2 y f xy (x, y) dx dy.
−∞ −∞
211 212
Taking the jth derivative of φ(θ1 , θ2 ) with respect to θ1 and at the same time 2. Remark: Let (Xi , Yi ) be a pair of random variables. Suppose that the
the kth derivative with respect to θ2 , we have the following expression: moment-generating function of (X1 , Y1 ) is equivalent to that of (X2 , Y2 ). Then,
Z ∞Z ∞ (X1 , Y1 ) has the same distribution function as (X2 , Y2 ).
∂ j+k φ(θ1 , θ2 )
j
= x j yk eθ1 x+θ2 y f xy (x, y) dx dy.
k
∂θ1 ∂θ2 −∞ −∞
Evaluating the above equation at (θ1 , θ2 ) = (0, 0), we can easily obtain:
Z ∞Z ∞
∂ j+k φ(0, 0)
j
= x j yk f xy (x, y) dx dy ≡ E(X j Y k ).
k
∂θ1 ∂θ2 −∞ −∞
213 214
The moment-generating function of X is given by φ1 (θ1 ) and that of Y is Again, the definition of the moment-generating function of X and Y is repre-
φ2 (θ2 ). sented as:
Z ∞ Z ∞
Then, we have the following facts:
φ(θ1 , θ2 ) = E(eθ1 X+θ2 Y ) = eθ1 x+θ2 y f xy (x, y) dx dy.
−∞ −∞
215 216
Z ∞
= eθ1 x f x (x) dx = E(eθ1 X ) = φ1 (θ1 ). 4. Theorem: The moment-generating function of (X, Y) is given by φ(θ1 , θ2 ).
−∞
Let φ1 (θ1 ) and φ2 (θ2 ) be the moment-generating functions of X and Y, respec-
Thus, we obtain the result: φ(θ1 , 0) = φ1 (θ1 ).
tively.
Similarly, φ(0, θ2 ) = φ2 (θ2 ) can be derived.
If X is independent of Y, we have:
217 218
From the definition of φ(θ1 , θ2 ), the moment-generating function of X and Y generating function is defined as:
is rewritten as follows:
φ(θ1 , θ2 , · · · , θn ) = E(eθ1 X1 +θ2 X2 +···+θn Xn ).
219 220
221 222
2. Theorem: Suppose that the multivariate random variables X1 , X2 , · · ·, Xn are Proof:
mutually independently and identically distributed. From Example 1.8 (p.111) and Example 1.9 (p.147), it is shown that the
Suppose that Xi ∼ N(µ, σ2 ). moment-generating function of X is given by: φ x (θ) = exp(µθ + 21 σ2 θ2 ), when
Pn X is normally distributed as X ∼ N(µ, σ2 ).
Let us define µ̂ = i=1 ai Xi , where ai , i = 1, 2, · · · , n, are assumed to be
known.
Pn Pn
Then, µ̂ ∼ N(µ i=1 ai , σ 2 i=1 a2i ).
223 224
Pn Y
n
When ai = 1/n is taken for all i = 1, 2, · · · , n, i.e., when µ̂ = X is taken, µ̂ = X
φµ̂ (θ) = E(eθµ̂ ) = E(eθ i=1 ai Xi )= E(eθai Xi )
i=1 is normally distributed as: X ∼ N(µ, σ2 /n).
Y
n Y
n
1
= exp(µai θ + σ2 a2i θ2 )
φ x (ai θ) =
i=1 i=1
2
Xn
1 2X 2 2
n
= exp(µ ai θ + σ ai θ )
i=1
2 i=1
225 226
6 Law of Large Numbers (ରͷ๏ଇ) and Central Theorem: Let g(X) be a nonnegative function of the random variable X, i.e.,
g(X) ≥ 0.
Limit Theorem (த৺ݶۃఆཧ)
If E(g(X)) exists, then we have:
227 228
Proof: P(U = 0) = P(g(X) < k).
We define the discrete random variable U as follows:
Then, in spite of the value which U takes, the following equation always holds:
1, if g(X) ≥ k,
U=
0, if g(X) < k. g(X) ≥ kU,
Thus, the discrete random variable U takes 0 or 1.
which implies that we have g(X) ≥ k when U = 1 and g(X) ≥ 0 when U = 0, where
Suppose that the probability function of U is given by:
k is a positive constant value.
f (u) = P(U = u),
Therefore, taking the expectation on both sides, we obtain:
where P(U = u) is represented as:
E(g(X)) ≥ kE(U), (7)
P(U = 1) = P(g(X) ≥ k),
229 230
where E(U) is given by: Chebyshev’s Inequality: Assume that E(X) = µ, V(X) = σ2 , and λ is a positive
X
1 constant value. Then, we have the following inequality:
E(U) = uP(U = u) = 1 × P(U = 1) + 0 × P(U = 0)
u=0 1
P(|X − µ| ≥ λσ) ≤ ,
= P(U = 1) = P(g(X) ≥ k). (8) λ2
or equivalently,
Accordingly, substituting equation (8) into equation (7), we have the following in- 1
P(|X − µ| < λσ) ≥ 1 − ,
equality: λ2
E(g(X))
P(g(X) ≥ k) ≤ . which is called Chebyshev’s inequality.
k
231 232
233 234
Furthermore, note as follows. Remark: Equation (10) can be derived when we take g(X) = (X − µ)2 , µ = E(X)
Taking = λσ, we obtain as follows: and k = 2 in equation (6).
235 236
6.2 Law of Large Numbers (ରͷ๏ଇ) and Convergence in Proof: The moment-generating function is written as:
237 238
which is the following probability function: Law of Large Numbers 2: Assume that X1 , X2 , · · ·, Xn are mutually indepen-
dently and identically distributed with mean E(Xi ) = µ and variance V(Xi ) = σ2 <
if x = µ,
1
f (x) =
∞ for all i.
0 otherwise.
Then, for any positive value , as n −→ ∞, we have the following result:
X
φ(θ) = eθx f (x) = eθµ f (µ) = eθµ
P(|X n − µ| ≥ ) −→ 0,
1X
n
where X n = Xi .
n i=1
We say that X n converges in probability to µ.
239 240
Proof: Accordingly, when n −→ ∞, the following equation holds:
Using (10), Chebyshev’s inequality is represented as follows: σ2
P(|X n − µ| ≥ ) ≤ −→ 0.
n 2
V(X n )
P(|X n − E(X n )| ≥ ) ≤ ,
2 That is, X n −→ µ is obtained as n −→ ∞, which is written as: plim X n = µ.
where X in (10) is replaced by X n . This theorem is called the law of large numbers.
σ2
We know E(X n ) = µ and V(X n ) = , which are substituted into the above inequal- The condition P(|X n − µ| ≥ ) −→ 0 or equivalently P(|X n − µ| < ) −→ 1 is used as
n
ity. the definition of convergence in probability (֬ऩଋ).
Then, we obtain: In this case, we say that X n converges in probability to µ.
σ2
P(|X n − µ| ≥ ) ≤ .
n 2
241 242
Theorem: In the case where X1 , X2 , · · ·, Xn are not identically distributed and they Then, we obtain the following result:
are not mutually independently distributed, define: Pn
i=1 Xi − mn
−→ 0.
Xn X
n n
mn = E( Xi ), Vn = V( Xi ), mn
i=1 i=1 That is, X n converges in probability to lim .
n→∞ n
and assume that This theorem is also called the law of large numbers.
mn 1 X Vn 1 X
n n
= E( Xi ) < ∞, = V( Xi ) < ∞,
n n i=1 n n i=1
Vn
−→ 0, as n −→ ∞.
n2
243 244
245 246
6.3 Central Limit Theorem (த৺ݶۃఆཧ) and Convergence in Proof:
Xi − µ
Distribution (ऩଋ) Define Yi = . We can rewrite as follows:
σ
1 X Xi − µ 1 X
n n
Central Limit Theorem: X1 , X2 , · · ·, Xn are mutually independently and iden- Xn − µ
√ = √ = √ Yi .
σ/ n n i=1 σ n i=1
tically distributed with E(Xi ) = µ and V(Xi ) = σ2 for all i. Both µ and σ2 are
finite. Since Y1 , Y2 , · · ·, Yn are mutually independently and identically distributed, the
Under the above assumptions, when n −→ ∞, we have: moment-generating function of Yi is identical for all i, which is denoted by φ(θ).
Xn − µ Z x
1 1 2
P √ < x −→ √ e− 2 u du,
σ/ n −∞ 2π
which is called the central limit theorem.
247 248
Using E(Yi ) = 0 and V(Yi ) = 1, the moment-generating function of Yi , φ(θ), is (*) Remark:
rewritten as: O(x) implies that it is a polynomial function of x and the higher-order terms but it
1 1 is dominated by x.
φ(θ) = E(eYi θ ) = E 1 + Yi θ + Yi2 θ2 + Yi3 θ3 · · ·
2 3! In this case, O(θ3 ) is a function of θ3 , θ4 , · · ·.
1
= 1 + θ2 + O(θ3 ). Since the moment-generating function is conventionally evaluated at θ = 0, θ3 is
2
the largest value of θ3 , θ4 , · · · and accordingly O(θ3 ) is dominated by θ3 (in other
In the second equality, eYi θ is approximated by the Taylor series expansion around
words, θ4 , θ5 , · · · are small enough, compared with θ3 ).
θ = 0.
249 250
Define Z as: 1 θ2 3
Moreover, consider x = + O(n− 2 ).
1 X
n
2n
Z= √ Yi . 1 θ2 3
n i=1 Multiply n/x on both sides of x = + O(n− 2 ).
2n
Then, the moment-generating function of Z, i.e., φz (θ), is given by:
1 1 2 1
Then, we obtain n = θ + O(n− 2 ) .
x 2
√θ Pn Y Y n √θ Y θ n
1 1 2 1
φz (θ) = E(eZθ ) = E e n i=1 i = E e n i = φ( √ ) Substitute n = θ + O(n− 2 ) into the moment-generating function of Z, i.e.,
n x 2
i=1
φz (θ).
1 θ2 θ3 n 1 θ2
− 23 n
= 1+ + O( 3 ) = 1 + + O(n ) . Then, we obtain:
2n n2 2n
1 θ2 3
n 1 θ2 −1
We consider that n goes to infinity. φz (θ) = 1 + + O(n− 2 ) = (1 + x) x ( 2 +O(n 2 ))
3 3
2n
Therefore, O( θ3 ) indicates a function of n− 2 . 1
θ2 +O(n− 21 ) θ2
n2 = (1 + x) x 2 −→ e 2 .
251 252
Note that x −→ 0 when n −→ ∞ and that lim (1 + x)1/x = e as in Section 2.3 (p.35). Xn − µ
x→0 We say that √ converges in distribution to N(0, 1).
1
Furthermore, we have O(n− 2 ) −→ 0 as n −→ ∞. σ/ n
=⇒ Convergence in distribution (ऩଋ)
θ2
Since φz (θ) = e 2 is the moment-generating function of the standard normal distri-
bution (see p.110 in Section 3.1 for the moment-generating function of the standard
normal probability density), we have:
Xn − µ Z x
1 1 2
P √ < x −→ √ e− 2 u du,
σ/ n −∞ 2π
The following expression is also possible:
or equivalently,
Xn − µ √
√ −→ N(0, 1). n(X n − µ) −→ N(0, σ2 ). (11)
σ/ n
253 254
Pn
Corollary 1: When E(Xi ) = µ, V(Xi ) = σ2 and X n = (1/n) i=1 Xi , note that Corollary 2: Consider the case where X1 , X2 , · · ·, Xn are not identically dis-
255 256
257 258
Consider estimating the parameter θ using the n observed data. Example 1.11: Consider the case of θ = (µ, σ2 ), where the unknown parameters
Let θ̂n (x1 , x2 , · · ·, xn ) be a function of the observed data x1 , x2 , · · ·, xn . contained in population is given by mean and variance.
θ̂n (x1 , x2 , · · ·, xn ) is constructed to estimate the parameter θ.
A point estimate of population mean µ is given by:
θ̂n (x1 , x2 , · · ·, xn ) takes a certain value given the n observed data.
1X
n
θ̂n (x1 , x2 , · · ·, xn ) is called the point estimate of θ, or simply the estimate of θ. µ̂n (x1 , x2 , · · · , xn ) ≡ x = xi .
n i=1
259 260
A point estimate of population variance σ2 is: 7.2 Statistic, Estimate and Estimator (౷ྔܭɼਪఆɼਪఆྔ)
1 X
n
σ̂2n (x1 , x2 , · · · , xn ) ≡ s2 = (xi − x)2 . The underlying distribution of population is assumed to be known, but the parame-
n−1 i=1 ter θ, which characterizes the underlying distribution, is unknown.
An alternative point estimate of population variance σ2 is:
The probability density function of population is given by f (x; θ).
1X
n
e2n (x1 , x2 , · · · , xn )
σ ≡s ∗∗2
= (xi − x)2 .
n i=1
261 262
Let X1 , X2 , · · ·, Xn be a subset of population, which are regarded as the random There, the experimental values and the actually observed data series are used in the
variables and are assumed to be mutually independent. same meaning.
263 264
Example 1.12: Let X1 , X2 , · · ·, Xn denote a random sample of n from a given There are numerous estimators and estimates of θ.
distribution f (x; θ).
1X
n
X1 + Xn
2 All of Xi , , median of (X1 , X2 , · · ·, Xn ) and so on are taken as the
Consider the case of θ = (µ, σ ). n i=1 2
estimators of µ.
1X 1X
n n
The estimator of µ is given by X = Xi , while the estimate of µ is x = xi .
n i=1 n i=1
Of course, they are called the estimates of θ when Xi is replaced by xi for all i.
1 X
n
The estimator of σ2 is S 2 = (Xi − X)2 and the estimate of σ2 is s2 =
n − 1 i=1
1 X 1 X 1X
n n 2
(xi − x)2 . Both S 2 = (Xi − X)2 and S ∗2 = (Xi − X)2 are the estimators of σ2 .
n − 1 i=1 n − 1 i=1 n i=1
265 266
We need to choose one out of the numerous estimators of θ. 7.3 Estimation of Mean and Variance
267 268
1. The estimator of population mean µ is: Properties of X: From Theorem on p.138, mean and variance of X are obtained
1X
n
• X= Xi . as follows:
n i=1 σ2
E(X) = µ, V(X) = .
n
2. The estimators of population variance σ2 are:
1X
n
• S ∗2 = (Xi − µ)2 , when µ is known, Properties of S∗2 , S2 and S∗∗2 : The expectation of S ∗2 is:
n i=1
1 X n 1X n
1 X
n
E(S ∗2 ) = E (Xi − µ)2 = E (Xi − µ)2
• S = 2
(Xi − X)2 , n i=1 n i=1
n − 1 i=1
1X 1X 2
n n
1X σ = σ2 .
n
= V(Xi ) =
• S ∗∗2 = (Xi − X)2 , n i=1 n i=1
n i=1
269 270
Next, the expectation of S 2 is given by: 1 Xn
= E (Xi − µ)2 − n(X − µ)2
n − 1 i=1
E(S 2 ) n 1 Xn n
1 X n 1 X n = E (Xi − µ)2 − E((X − µ)2 )
=E (Xi − X)2 = E (Xi − X)2 n − 1 n i=1 n−1
n − 1 i=1 n − 1 i=1
n n σ2
X
n = σ2 − = σ2 .
1 n−1 n−1 n
= E ((Xi − µ) − (X − µ))2
n − 1 i=1 Pn
i=1 (Xi − µ) = n(X − µ) is used in the sixth equality.
1 X
n
= E ((Xi − µ)2 − 2(Xi − µ)(X − µ) + (X − µ)2 ) 1 Xn
n − 1 i=1 E (Xi − µ)2 = E(S ∗2 ) = σ2 and
X
n X n n i=1
1
= E (Xi − µ)2 − 2(X − µ) (Xi − µ) + n(X − µ)2 σ2
n − 1 i=1 i=1
E((X − µ)2 ) = V(X) = are required in the eighth equality.
n
271 272
Finally, the expectation of S ∗∗2 is represented by: 7.4 Point Estimation: Optimality
1 Xn n − 1 1 X n θ denotes the parameter to be estimated.
E(S ∗∗2 ) = E (Xi − X)2 = E (Xi − X)2
n i=1 n n − 1 i=1
n − 1 n − 1 n−1 2
=E S2 = E(S 2 ) = σ , σ2 . θ̂n (X1 , X2 , · · ·, Xn ) represents the estimator of θ, while θ̂n (x1 , x2 , · · ·, xn ) indicates
n n n
the estimate of θ.
273 274
The desired properties of θ̂n are: Unbiasedness (ෆภੑ): One of the desirable features that the estimator of the
• unbiasedness (ෆภੑ), parameter should have is given by:
• efficiency (༗ޮੑ).
E(θ̂n ) = θ, (12)
• consistency (Ұகੑ) and
• sufficiency (ेੑ). ←− Not discussed in this class. which implies that θ̂n is distributed around θ.
275 276
As an example of unbiasedness, consider the case of θ = (µ, σ2 ). 2. The estimators of σ2 are:
1 X 1X
n n
• S2 = (Xi − X)2 , • S ∗∗2 = (Xi − X)2 .
n − 1 i=1 n i=1
Suppose that X1 , X2 , · · ·, Xn are mutually independently and identically distributed
with mean µ and variance σ2 . Since we have obtained E(X) = µ and E(S 2 ) = σ2 , X and S 2 are unbiased estimators
of µ and σ2 .
We have obtained the result E(S ∗∗2 ) , σ2 and therefore S ∗∗2 is not an unbiased
Consider the following estimators of µ and σ2 .
estimator of σ2 .
1. The estimator of µ is: According to the criterion of unbiasedness, S 2 is preferred to S ∗∗2 for estimation of
1X
n
• X= Xi . σ2 .
n i=1
277 278
279 280
which is known as the Cramer-Rao inequality (ΫϥϝʔϧɾϥΦͷෆࣜ). Proof of the Cramer-Rao inequality: We prove the above inequality and the
When there exists the unbiased estimator θ̂n such that the equality in (13) holds, equalities in σ2 (θ).
θ̂n becomes the unbiased estimator with minimum variance, which is the efficient
estimator (༗ޮਪఆྔ). The likelihood function (ؔ) l(θ; x) = l(θ; x1 , x2 , · · ·, xn ) is a joint density of
σ2 (θ)
is called the Cramer-Rao lower bound (ΫϥϝʔϧɾϥΦͷԼ)ݶ. X1 , X2 , · · ·, Xn .
n
Qn
That is, l(θ; x) = l(θ; x1 , x2 , · · ·, xn ) = i=1 f (xi ; θ)
281 282
The integration of l(θ; x1 , x2 , · · ·, xn ) with respect to x1 , x2 , · · ·, xn is equal to one. Differentiating both sides of equation (15) with respect to θ, we obtain the following
That is, we have the following equation: equation:
Z Z Z
∂l(θ; x) 1 ∂l(θ; x)
1= l(θ; x) dx, (15) 0= dx = l(θ; x) dx
∂θ l(θ; x) ∂θ
Z ∂ log l(θ; X)
Qn R ∂ log l(θ; x)
where the likelihood function l(θ; x) is given by l(θ; x) = f (xi ; θ) and · · · dx = l(θ; x) dx = E , (16)
i=1 ∂θ ∂θ
implies n-tuple integral.
∂ log l(θ; X)
which implies that the expectation of is equal to zero.
∂θ
d log x 1
In the third equality, note that = .
dx x
283 284
Now, let θ̂n be an estimator of θ. The definition of the mathematical expectation of d log x 1
In the second equality, = is utilized.
dx x
the estimator θ̂n is represented as:
Z
E(θ̂n ) = θ̂n l(θ; x) dx. (17)
285 286
Taking the square on both sides of equation (18), we obtain the following expres- Note that we have the definition of ρ is given by:
sion: ∂ log l(θ; X)
Cov θ̂n ,
∂E(θ̂n ) 2 !2 ∂θ
∂ log l(θ; X) ∂ log l(θ; X) ρ= r .
= Cov θ̂n , = ρ2 V(θ̂n )V p ∂ log l(θ; X)
∂θ ∂θ ∂θ V(θ̂n ) V
! ∂θ
∂ log l(θ; X)
≤ V(θ̂n )V , (19)
∂θ Moreover, we have −1 ≤ ρ ≤ 1 (i.e., ρ2 ≤ 1).
∂ log l(θ; X) Then, the inequality (19) is obtained, which is rewritten as:
where ρ denotes the correlation coefficient between θ̂n and .
∂θ ∂E(θ̂n ) 2
V(θ̂n ) ≥ ∂θ
∂ log l(θ; X) . (20)
V
∂θ
287 288
When E(θ̂n ) = θ, i.e., when θ̂n is an unbiased estimator of θ, the numerator in the Moreover, the denominator in the right-hand side of the above inequality is rewrit-
right-hand side of equation (20) is equal to one. ten as follows:
Therefore, we have the following result: ∂ log l(θ; X) 2 !
E
1 1 ∂θ
! X ∂ log f (Xi ; θ) 2 !
V(θ̂n ) ≥ ∂ log l(θ; X) = ∂ log l(θ; X) 2 !. Xn n
∂ log f (Xi ; θ) 2
V E =E = E
∂θ ∂θ i=1
∂θ i=1
∂θ
∂ log f (X; θ) 2 ! Z ∞
∂ log f (x; θ) 2
∂ log l(θ; X) ∂ log l(θ; X) = nE =n f (x; θ) dx.
Note that we have V =E( )2 in the equality above, be- ∂θ −∞ ∂θ
∂θ ∂θ
∂ log l(θ; X)
cause of E = 0. X
n
∂θ In the first equality, log l(θ; X) = log f (Xi ; θ) is utilized.
i=1
289 290
∂ log f (X; θ)
Since Xi , i = 1, 2, · · · , n, are mutually independent, the second equality holds. =V .
∂θ
The third equality holds because X1 , X2 , · · ·, Xn are identically distributed. Z
Therefore, we obtain the following inequality: Differentiating f (x; θ) dx = 1 with respect to θ, we obtain as follows:
Z
1 1 σ2 (θ) ∂ f (x; θ)
V(θ̂n ) ≥ ∂ log l(θ; X) 2 ! = ∂ log f (X; θ) 2 ! = n , ∂θ
dx = 0.
E nE
∂θ ∂θ ∂ f (x; θ)
We assume that the range of x does not depend on the parameter θ and that
which is equivalent to (13). ∂θ
exists.
Next, we prove the equalities in (14), i.e., The above equation is rewritten as:
∂2 log f (X; θ) ∂ log f (X; θ) 2 ! Z
∂ log f (x; θ)
−E 2
=E f (x; θ) dx = 0, (21)
∂θ ∂θ ∂θ
291 292
293 294
Thus, the Cramer-Rao inequality is derived as: Example 1.13a (Efficient Estimator of µ): Suppose that X1 , X2 , · · ·, Xn are mu-
2
σ (θ) tually independently, identically and normally distributed with mean µ and variance
V(θ̂n ) ≥ ,
n σ2 .
where
1 1
σ2 (θ) = ∂ log f (X; θ) 2 ! = ∂ log f (X; θ) ! Then, we show that X is an efficient estimator of µ.
E V
∂θ ∂θ
1
=− 2 .
∂ log f (X; θ) σ2
E V(X) is given by , which does not depend on the distribution of Xi , i = 1, 2, · · · , n.
∂θ2 n
(A)
295 296
Because Xi is normally distributed with mean µ and variance σ2 , the density func- The partial derivative of f (X; µ) with respect to µ is:
tion of Xi is given by: ∂ log f (X; µ) 1
= 2 (X − µ).
1 1 ∂µ σ
f (x; µ) = √ exp − 2 (x − µ)2 .
2πσ2 2σ
The Cramer-Rao inequality in this case is written as:
The Cramer-Rao inequality is represented as:
1
1 V(X) ≥ 1 2 !
V(X) ≥ ∂ log f (X; µ) 2 !, nE 2
(X − µ)
nE σ
∂µ 1 σ2
= = . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (B)
where the logarithm of f (X; µ) is written as: 1 n
n 4 E((X − µ)2 )
σ
1 1
log f (X; µ) = − log(2πσ2 ) − (X − µ)2 .
2 2σ2
297 298
From (A) and (B), variance of X is equal to the lower bound of Cramer-Rao inequal- Example 1.13b (Efficient Estimator of σ2 ): Suppose that X1 , X2 , · · ·, Xn are mu-
σ2
ity, i.e., V(X) = , which implies that the equality included in the Cramer-Rao tually independently, identically and normally distributed with mean µ and variance
n
inequality holds. σ2 .
Therefore, we can conclude that the sample mean X is an efficient estimator of µ. Is S 2 is an efficient estimator of σ2 ?
299 300
Because Xi is normally distributed with mean µ and variance σ2 , the density func- The partial derivative of f (X; σ2 ) with respect to σ2 is:
tion of Xi is given by: ∂ log f (X; σ2 ) 1 1
=− 2 + (X − µ)2 .
1 1 ∂σ2 2σ 2σ4
f (x; σ2 ) = √ exp − 2 (x − µ)2 .
2πσ 2 2σ The 2nd partial derivative of f (X; σ2 ) with respect to σ2 is:
The Cramer-Rao inequality is represented as:
∂2 log f (X; σ2 ) 1 1
= − (X − µ)2 .
2 1 1 ∂(σ2 )2 2σ4 σ6
V(S ) ≥ ∂ log f (X; σ2 ) 2 ! = ∂2 log f (X; σ2 )
!,
nE −nE The Cramer-Rao inequality in this case is written as:
∂σ2 ∂(σ2 )2
where the logarithm of f (X; σ2 ) is written as: 1 2σ4
V(S 2 ) ≥ != . . . . . . . . . . . . . . . . . . . . . . . . . . . . (B)
1 1 2 n
1 1 1 −nE − (X − µ) .
log f (X; σ2 ) = − log(2π) − log(σ2 ) − (X − µ)2 . 2σ4 σ6
2 2 2σ2
301 302
From (A) and (B), variance of S 2 is not equal to the lower bound of Cramer-Rao Example 1.14: Minimum Variance Linear Unbiased Estimator (࠷খࢄઢܗ
2σ4 2σ4
inequality, i.e., V(S 2 ) = > . ෆภਪఆྔ): Suppose that X1 , X2 , · · ·, Xn are mutually independently and identi-
n−1 n
cally distributed with mean µ and variance σ2 (note that the normality assumption
Therefore, we can conclude that the sample unbiased variance S 2 is not an efficient
is excluded from Example 1.13).
estimator of σ2 .
X
n
Consider the following linear estimator: µ̂ = ai Xi .
i=1
303 304
Utilizing Theorem on p.133, when E(Xi ) = µ and V(Xi ) = σ2 for all i, we have: X
n
Xn X
n That is, if ai = 1 is satisfied, µ̂ gives us a linear unbiased estimator (ઢܗෆภ
E(µ̂) = µ ai and V(µ̂) = σ2 a2i . i=1
i=1 i=1
ਪఆྔ).
Thus, as mentioned in Example 1.12 of Section 7.2, there are numerous unbiased
Since µ̂ is linear in Xi , µ̂ is called a linear estimator (ઢܗਪఆྔ) of µ.
estimators.
X
n X
n
In order for µ̂ to be unbiased, we need to have the condition: E(µ̂) = µ ai = µ. The variance of µ̂ is given by σ2 a2i .
i=1 i=1
305 306
X
n X
n
For minimization, the partial derivatives of L with respect to ai and λ are equal to
We obtain the value of ai which minimizes a2i with the constraint ai = 1.
i=1 i=1 zero, i.e.,
∂L
= ai − λ = 0, i = 1, 2, · · · , n,
Construct the Lagrange function as follows: ∂ai
∂L Xn
1X 2
n X n =1− ai = 0.
L= a + λ(1 − ai ), ∂λ i=1
2 i=1 i i=1 1
Solving the above equations, ai = λ = is obtained.
where λ denotes the Lagrange multiplier. n
1
When ai = for all i, µ̂ has minimum variance in a class of linear unbiased estima-
n
tors.
1
The 2
in the first term makes computation easier. X is a minimum variance linear unbiased estimator.
307 308
The minimum variance linear unbiased estimator is different from the efficient Consistency (Ұகੑ): Let θ̂n be an estimator of θ.
estimator.
The former does not requires the normality assumption.
The latter gives us the unbiased estimator which variance is equal to the Cramer- Suppose that for any > 0 we have the following:
Rao lower bound, which is not restricted to a class of the linear unbiased estimators.
P(|θ̂n − θ| ≥ ) −→ 0, as n −→ ∞,
Under normality assumption, he linear unbiased minimum variance estimator leads
to the efficient estimator. which implies that θ̂ −→ θ as n −→ ∞.
309 310
Example 1.15: Suppose that X1 , X2 , · · ·, Xn are mutually independently and because E(Xi ) = µ and V(Xi ) = σ2 < ∞ for all i.
identically distributed with mean µ and variance σ2 . Then, when n −→ ∞, we obtain the following result:
2
Assume that σ is known. σ2
P(|X − µ| ≥ ) ≤ −→ 0,
Then, it is shown that X is a consistent estimator of µ. n 2
For RV X, Chebyshev’s inequality is given by: which implies that X −→ µ as n −→ ∞.
V(X) Therefore, X is a consistent estimator of µ.
P(|X − E(X)| ≥ ) ≤ 2 .
σ2
E(X) = µ, V(X) = ,
n
311 312
Summary: Example 1.16a: Suppose that X1 , X2 , · · ·, Xn are mutually independently, identi-
cally and normally distributed with mean µ and variance σ2 .
When the distribution of Xi is not assumed for all i, X is an minimum variance 1 X
n
Consider S 2 = (Xi − X)2 , which is an unbiased estimator of σ2 .
linear unbiased and consistent estimator of µ. n − 1 i=1
313 314
E(U) = n − 1 and V(U) = 2(n − 1). Example 1.16b: Suppose that X1 , X2 , · · ·, Xn are mutually independently, identi-
(n − 1)S 2
V(U) = V( ) = 2(n − 1) cally and normally distributed with mean µ and variance σ2 .
σ2
(n − 1)2 1X
n
V(S 2 ) = 2(n − 1) Consider S ∗∗2 = (Xi − X)2 , which is an estimate of σ2 .
σ4 n i=1
2σ2
V(S 2 ) =
n−1 We obtain the following Chebyshev’s inequality:
2
E((S − σ ) )2 2
2σ 2 E((S ∗∗2 − σ2 )2 )
P(|S 2 − σ2 | ≥ ) ≤ = −→ 0, P(|S ∗∗2 − σ2 | ≥ ) ≤ .
2 (n − 1) 2 2
We compute E((S ∗∗2 − σ2 )2 ).
which implies that S 2 −→ σ2 as n −→ ∞.
1 X
n
Threfore, S 2 is a consistent estimator of σ2 .
Define S 2 = (Xi − X)2 as an estimator σ2 .
n − 1 i=1
315 316
317 318
Therefore, as n −→ ∞, we obtain: 7.5 Estimation Methods
1 (2n − 1) 4
P(|S ∗∗2 − σ2 | ≥ ) ≤ σ −→ 0.
2 n2 • Maximum Likelihood Estimation Method (࠷ਪఆ๏)
Because S ∗∗2 −→ σ2 , S ∗∗2 is a consistent estimator of σ2 . • Least Squares Estimation Method (࠷খೋ๏)
S ∗∗2 is biased (see Section 7.3, p.273), but is is consistent. • Method of Moment (ੵ๏)
319 320
7.5.1 Maximum Likelihood Estimator (࠷ਪఆྔ) θ = (µ, σ2 ) in the case of the normal distribution.
The parameter θ is included in the underlying distribution f (x; θ). Xi has the probability density function f (x; θ).
321 322
The joint density function of X1 , X2 , · · ·, Xn is given by: Let θ̂n be the θ which maximizes the likelihood function.
Y
n Given data x1 , x2 , · · ·, xn , θ̂n (x1 , x2 , · · ·, xn ) is called the maximum likelihood
f (x1 , x2 , · · · , xn ; θ) = f (xi ; θ),
i=1
estimate (MLE, ࠷ਪఆ).
where θ denotes the unknown parameter. Replacing x1 , x2 , · · ·, xn by X1 , X2 , · · ·, Xn , θ̂n = θ̂n (X1 , X2 , · · ·, Xn ) is called the
Given the actually observed data x1 , x2 , · · ·, xn , the joint density f (x1 , x2 , · · ·, xn ; θ) maximum likelihood estimator (MLE, ࠷ਪఆྔ).
Y
n ∂l(θ)
= 0,
l(θ) = l(θ; x) = l(θ; x1 , x2 , · · · , xn ) = f (xi ; θ). ∂θ
i=1
MLE θ̂n ≡ θ̂n (X1 , X2 , · · · , Xn ) is obtained.
l(θ) is called the likelihood function (ؔ).
323 324
Example 1.17a: Suppose that X1 , X2 , · · ·, Xn are mutually independently, identi- 1 X n
= (2πσ2 )−n/2 exp − 2 (xi − µ)2
2σ i=1
cally and normally distributed with mean µ and variance σ2 .
= l(µ, σ2 ).
2
We derive the maximum likelihood estimators of µ and σ .
The logarithm of the likelihood function is given by:
The joint density (or the likelihood function) of X1 , X2 , · · ·, Xn is:
1 X
n
n n
log l(µ, σ2 ) = − log(2π) − log(σ2 ) − (xi − µ)2 ,
Y
n 2 2 2
2σ i=1
f (x1 , x2 , · · · , xn ; µ, σ2 ) = f (xi ; µ, σ2 )
i=1
which is called the log-likelihood function (ରؔ).
Yn 1
= (2πσ2 )−1/2 exp − 2 (xi − µ)2
i=1
2σ
325 326
For maximization of the likelihood function, differentiating the log-likelihood func- Solving the two equations, we obtain the maximum likelihood estimates as follows:
tion log l(µ, σ2 ) with respect to µ and σ2 , the first derivatives should be equal to zero, 1X
n
µ̂ = xi = x,
i.e., n i=1
1X 1X
n n
1 X
n
∂ log l(µ, σ2 ) σ̂2 = (xi − µ̂)2 = (xi − x)2 = s∗∗2 .
= 2 (xi − µ) = 0, n i=1 n i=1
∂µ σ i=1
1 X
n
∂ log l(µ, σ2 ) n 1 Replacing xi by Xi for i = 1, 2, · · · , n, the maximum likelihood estimators of µ and
=− 2 + (xi − µ)2 = 0.
∂σ2 2σ 2σ4 i=1
σ2 are given by X and S ∗∗2 .
2
Let µ̂ and σ̂ be the solution which satisfies the above two equations.
327 328
Since E(X) = µ, the maximum likelihood estimator of µ, X, is an unbiased estima- Example 1.17b: Suppose that X1 , X2 , · · ·, Xn are mutually independently and
tor. identically distributed as Bernoulli random variables with parameter p.
We have checked that X is efficient and consistent. We derive the maximum likelihood estimators of p.
329 330
The log-likelihood function is given by: We obtain the maximum likelihood estimates as follows:
Xn X
n
1X
n
log l(p) = ( xi ) log(p) + (n − xi ) log(1 − p). p̂ = x = xi ,
i=1 i=1
n i=1
For maximization of the likelihood function, differentiating the log-likelihood func- Replacing xi by Xi for i = 1, 2, · · · , n, the maximum likelihood estimator of p is
1X
n
tion log l(p) with respect to p , the first derivatives should be equal to zero, i.e., given by p̂ = X = Xi .
n i=1
d log l(p) 1 X X
n n
1
= xi − (n − xi )
dp p i=1 1− p i=1
n n
= x− (1 − x) = 0
p 1− p
331 332
1X
n
1X
n From Cramer-Rao inequality,
E( p̂) = E(X) = E( Xi ) = E(Xi ) = p
n n 1
i=1 i=1 V( p̂) ≥ − .
d2 log f (X; p)
X
1 nE
dp2
Remember that E(Xi ) = xi p xi (1 − p)1−xi = p, where xi takes 0 or 1.
xi =0
f (X; p) = pX (1 − p)1−X
Thus, p̂ is an unbiased estimator of p.
log f (X; p) = X log(p) + (1 − X) log(1 − p)
d log f (X; p) X 1 − X
= −
dp p 1− p
2
d log f (X; p) X 1−X
=− 2 −
dp2 p (1 − p)2
333 334
We need to check whether the equality holds. The Cramer-Rao lower bound is:
1X 1 X 1 X
n n n
1 1
V( p̂) = V( Xi ) = 2 V( Xi ) = 2 V(Xi ) − X =−
n i=1 n n i=1 d2 log f (X; p) 1−X
i=1 nE nE − 2 −
dp2 p (1 − p)2
1 X
n
p(1 − p)
= 2 p(1 − p) = , 1 1 p(1 − p)
n i=1 n =− = = ,
E(X) 1 − E(X) 1 1 n
n− 2 − n +
Note as follows: p (1 − p)2 p 1− p
X
1
V(Xi ) = E((Xi − p)2 ) = (xi − p)2 p xi (1 − p)1−xi = p(1 − p). which is equal to V( p̂).
xi =0
335 336
˔ We check whether p̂ is consistent. Properties of Maximum Likelihood Estimator: For small sample (খඪຊ), the
From Chebyshev’s inequality, MLE has the following properties.
337 338
Efficient estimator ⇐⇒ The variance of the estimator is equal to the Cramer-Rao (23) indicates that the MLE has consistency, asymptotic unbiasedness (ۙෆภ
lower bound. ੑ), asymptotic efficiency (ۙ༗ޮੑ) and asymptotic normality (ۙਖ਼نੑ).
For large sample (େඪຊ), as n −→ ∞, the maximum likelihood estimator of θ, Asymptotic normality of the MLE comes from the central limit theorem discussed
339 340
Note that the properties of n −→ ∞ are called the asymptotic properties, which As another representation, when n is large, we can approximate the distribution of
include consistency, asymptotic normality and so on. θ̂n as follows:
σ2 (θ)
θ̂n ∼ N θ, .
n
By normalizing, as n −→ ∞, we obtain as follows: This implies that when n −→ ∞, θ̂n approaches the lower bound of Cramer-Rao
σ2 (θ)
√ inequality: .
n(θ̂n − θ) θ̂n − θ n
= √ −→ N(0, 1). This property is called an asymptotic efficiency.
σ(θ) σ(θ)/ n
√
n(θ̂n − θ) has the distribution, which does not depend on n.
√
n(θ̂n − θ) = O(1) is written, where O() is a function n.
That is, θ̂n − θ = n−1/2 × O(1) = O(n−1/2 ).
341 342
Moreover, replacing θ in variance σ2 (θ) by θ̂n , when n −→ ∞, we have the follow- Proof of (23): By the central limit theorem (11) on p.254,
ing property: 1 X ∂ log f (Xi ; θ)
n 1
θ̂n − θ √ −→ N 0, 2 , (26)
√ −→ N(0, 1). (24) n i=1 ∂θ σ (θ)
σ(θ̂n )/ n
Practically, when n is large, we approximately use: ∂ log f (Xi ; θ) 1
where σ2 (θ) is defined in (14), i.e., V = 2 .
∂θ σ (θ)
σ2 (θ̂n )
θ̂n ∼ N θ, . (25)
n
∂ log f (Xi ; θ)
Note that E = 0.
∂θ
∂ log f (Xi ; θ)
Apply the central limit theorem, taking as the ith random variable.
∂θ
343 344
By the Taylor series expansion around θ̂n = θ, The third and above terms in the right-hand side are:
345 346
1 X ∂ log f (Xi ; θ)
n
The law of large numbers indicates as follows:
which implies that the asy. dist. of √ is equivalent to that of
n i=1 ∂θ
1 X ∂2 log f (Xi ; θ)
n ∂2 log f (Xi ; θ)
Xn 1
1 2
∂ log f (Xi ; θ) − 2
−→ −E = 2 ,
−√ (θ̂n − θ). n i=1 ∂θ ∂θ2 σ (θ)
n i=1 ∂θ2
where the last equality comes from (14).
347 348
Thus, we have the following relationship: 7.5.2 Least Squares Estimation Method (࠷খೋ๏)
1 X ∂2 log f (Xi ; θ) √
n
1 √ X1 , X2 , · · ·, Xn are mutually independently distributed with mean µ.
− n(θ̂n − θ) −→ 2 n(θ̂n − θ)
n i=1 ∂θ2 σ (θ)
x1 , x2 , · · ·, xn are generated from X1 , X2 , · · ·, Xn , respectively.
1
−→ N 0, 2 Solve the following problem:
σ (θ)
X
n
Therefore, the asymptotic normality of the maximum likelihood estimator is ob- min S (µ), where S (µ) = (xi − µ)2 .
µ
tained as follows: i=1
√
n(θ̂n − θ) −→ N(0, σ2 (θ)). Let µ̂ be the least squares estimate of µ.
1X
n
Thus, (23) is obtained. dS (µ)
=0 =⇒ µ̂ = xi
dµ n i=1
349 350
The least squares estimator is given by: 7.5.3 Method of Moment (ੵ๏)
1X
n
µ̂ = Xi , The distribution of Xi is f (x; θ).
n i=1 Let µ0k be the kth moment.
which is equivalent to MLE. From the definition of the kth moment,
E(X k ) = µ0k
351 352
Example: θ = (µ, σ2 ): Because we have two parameters, we use the 1st and 2nd σ̂2 = x − µ̂2 = x − x2 = (xi − x)2
n i=1 i n i=1 i n i=1
moments.
µ01 = E(X) = µ Estimators:
1X 1X
n n
2 2 2 2
µ02 = E(X ) = V(X) + (E(X)) = σ + µ µ̂ = Xi = X σ̂2 = (Xi − X)2
n i=1 n i=1
353 354
7.6 Interval Estimation Using the random sample X1 , X2 , · · ·, Xn drawn from the population distribution, we
construct the two statistics, say, θU (X1 , X2 , · · ·, Xn ) and θL (X1 , X2 , · · ·, Xn ), where
In Sections 7.1 – 7.5.1, the point estimation is discussed.
It is important to know where the true parameter value of θ is likely to lie. P(θL (X1 , X2 , · · · , Xn ) < θ < θU (X1 , X2 , · · · , Xn )) = 1 − α. (27)
Suppose that the population distribution is given by f (x; θ).
(27) implies that θ lies on the interval θL (X1 , X2 , · · ·, Xn ), θU (X1 , X2 , · · ·, Xn ) with
probability 1 − α.
355 356
Now, we replace the random variables X1 , X2 , · · ·, Xn by the experimental values x1 , Given probability α, the θL (X1 , X2 , · · ·, Xn ) and θU (X1 , X2 , · · ·, Xn ) which satisfies
x2 , · · ·, xn . equation (27) are not unique.
Then, we say that the interval: For estimation of the unknown parameter θ, it is more optimal to minimize the
width of the confidence interval.
θL (x1 , x2 , · · · , xn ), θU (x1 , x2 , · · · , xn )
Therefore, we should choose θL and θU which minimizes the width θU (X1 , X2 , · · ·,
is called the 100 × (1 − α)% confidence interval (৴པ۠ؒ) of θ. Xn ) − θL (X1 , X2 , · · ·, Xn ).
Thus, estimating the interval is known as the interval estimation (۠ؒਪఆ), which
is distinguished from the point estimation.
In the interval, θL (x1 , x2 , · · ·, xn ) is known as the lower bound of the confidence
interval, while θU (x1 , x2 , · · ·, xn ) is the upper bound of the confidence interval.
357 358
Interval Estimation of X: Let X1 , X2 , · · ·, Xn be mutually independently and Therefore, when n is large enough,
identically distributed random variables. X−µ
P(z∗ < √ < z∗∗ ) = 1 − α,
Xi has a distribution with mean µ and variance σ2 . S/ n
From the central limit theorem, where z∗ and z∗∗ (z∗ < z∗∗ ) are percent points from the standard normal density
X−µ function.
√ −→ N(0, 1).
σ/ n Solving the inequality above with respect to µ, the following expression is obtained.
2
Replacing σ by its estimator S (or S 2 ∗∗2
), S S
P X − z∗∗ √ < µ < X − z∗ √ = 1 − α,
n n
X−µ
√ −→ N(0, 1).
S/ n S S
where θ̂L and θ̂U correspond to X − z∗∗ √ and X − z∗ √ , respectively.
n n
359 360
The length of the confidence interval is given by: Solving the minimization problem above, we can obtain the conditions that f (z∗ ) =
361 362
363 364
8 Testing Hypothesis (Ծઆݕఆ) The hypothesis against the null hypothesis, e.g. θ , θ0 , is called the alternative
hypothesis (ରཱԾઆ), which is denoted by H1 : θ , θ0 .
8.1 Basic Concepts in Testing Hypothesis
Given the population distribution f (x; θ), we want to judge from the observed values
x1 , x2 , · · ·, xn whether the hypothesis on the parameter θ, e.g. θ = θ0 , is correct or
not.
The hypothesis that we want to test is called the null hypothesis (ؼແԾઆ), which
is denoted by H0 : θ = θ0 .
365 366
Type I and Type II Errors (ୈҰछͷޡΓɼୈೋछͷޡΓ): When we test the The probability which a type I error occurs is called the significance level (༗ҙ
null hypothesis H0 , as shown in Table 1 we have four cases, i.e., ਫ४), which is denoted by α, and the probability of committing a type II error is
(i) we accept H0 when H0 is true, denoted by β.
(ii) we reject H0 when H0 is true, Probability of (iv) is called the power (ݕग़ྗ) or the power function (ݕग़ྗؔ
(iii) we accept H0 when H0 is false, and ), because it is a function of the parameter θ.
(iv) we reject H0 when H0 is false.
(i) and (iv) are correct judgments, while (ii) and (iii) are not correct.
(ii) is called a type I error (ୈҰछͷޡΓ) and (iii) is called a type II error (ୈೋ
छͷޡΓ).
367 368
Rejection of H0 Type I Error Correct judgment Derive a distribution function of the test statistic when H0 is true.
ୈҰछͷޡΓ (1 − β = Power)
3. From the observed data, compute the observed value of the test statistic.
(Probability α ݕग़ྗ
= Significance Level) 4. Compare the distribution and the observed value of the test statistic.
༗ҙਫ४
When the observed value of the test statistic is in the tails of the distribution,
369 370
we consider that H0 is not likely to occur and we reject H0 . The region that H0 is unlikely to occur and accordingly H0 is rejected is called the
rejection region (غ٫Ҭ) or the critical region, denoted by R.
Using the rejection region R and the acceptance region A, the type I and II errors
and the power are formulated as follows.
371 372
The probability of committing a type I error (ୈҰछͷޡΓ), i.e., the significance The probability of committing a type II error (ୈೋछͷޡΓ), i.e., β, is represented
level (༗ҙਫ४) α, is given by: as:
P(T (X1 , X2 , · · · , Xn ) ∈ A|H0 is not true) = β,
P(T (X1 , X2 , · · · , Xn ) ∈ R|H0 is true) = α,
which corresponds to the probability that accepts H0 when H0 is not true.
which is the probability that rejects H0 when H0 is true.
Conventionally, the significance level α = 0.1, 0.05, 0.01 is chosen in practice. The power (ݕग़ྗɼ·ͨɼݕఆྗ) is defined as 1 − β,
373 374
8.2 Power Function (ݕग़ྗؔ) Figure 3: Type I Error (α) and Type II Error (β)
Let X1 , X2 , · · ·, Xn be mutually independently, identically and normally distributed ....
...............
... ....
As α is small, i.e.,
.................
.. ...
...
: as f goes to right,
... ∗
...
..
... ..... ...
2 ..
....
... ...
... .
.... .
... .. ..
β becomes large.
...
...
with mean µ and variance σ . ....
...
...
.
....
...... ... ... ...
...
...
...
.. ... .... . . . . ...
.. ... ... .. .. .. .. ..
... ... ..... .. .. .. .. .. ...
...
The distribution under .. .... . . . . . . ...
... ....... . . . . . ... The distribution
the null hypothesis (H ) .... .... .. ...... .. .. .. .. .. ...
Assume that σ2 is known. 0 ...
. β ..... .. ...... .. .. .. ..
. . . . .. . . .
...
... under the alternative
A .... ...... .. .. .. ....... .. .. .. ...
hypothesis (H1 )
A
.... @ .........
....... .. .. .. .. .. ....... .. ..
R
@
...
...
....
....
....... .. .. .. .. .. .. .. ...... ..
.
α ...
...
A .. .. .
..... ... ... ... ... ... ... ... ... ... .........
...... .. .. .. .. .. .. .. .. .. .. .. .........
..
..
In Figure 3, we consider: AAU .. .. .. ...... ... ... ... ... ... ... ... ... ... ... ... ... ... ...............
...
...
.
. . .... . . . . . . . . . . . . . . . ........... ...
. . ....
.... ........ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ............................
.....
...
. ... ....
.....
the null hypothesis H0 : µ = µ0 , .....................
.........
.... ....
.
..................... . . . . . . . . . . . . . . . . . . . . . . . ..............................................................
........ .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ................................................................
.......
.....................
.......
...........
............
375 376
The dark shadow area (probability α) corresponds to the probability of a type I er- The distribution of sample mean X is given by:
ror, i.e., the significance level, while the light shadow area (probability β) indicates σ2
X ∼ N µ, .
the probability of a type II error. n
By normalization, we have:
The probability of the right-hand side of f ∗ in the distribution under H1 represents
X−µ
the power of the test, i.e., 1 − β. √ ∼ N(0, 1).
σ/ n
Therefore, under the null hypothesis H0 : µ = µ0 , we obtain:
X − µ0
√ ∼ N(0, 1),
σ/ n
where µ is replaced by µ0 .
377 378
Since the significance level α is the probability which rejects H0 when H0 is true, it Since the power 1 − β is the probability which rejects H0 when H1 is true, it is given
is given by: by:
σ
α = P X > µ0 + zα √ , σ X − µ1 µ0 − µ1
n 1 − β = P X > µ0 + zα √ = P √ > √ + zα
n σ/ n σ/ n
where zα denotes 100 × α percent point of N(0, 1).
X − µ1 µ0 − µ1 µ0 − µ1
=1−P √ < √ + zα = 1 − F √ + zα ,
σ σ/ n σ/ n σ/ n
Therefore, the rejection region is given by: X > µ0 + zα √ .
n
where F(·) represents the standard normal cumulative distribution function, which
is given by: Z x
1
F(x) = (2π)−1/2 exp(− t2 ) dt.
−∞ 2
The power function is a function of µ1 , given µ0 and α.
379 380
Known σ2 : Let X1 , X2 , · · ·, Xn be mutually independently, identically and nor- 1. The alternative hypothesis H1 : µ < µ0 (one-sided testɼยଆݕఆ): We
2
X − µ0 x − µ0
mally distributed with µ and σ . have: P √ < −zα = α. Therefore, when √ < −zα , we reject the
σ/ n σ/ n
Consider testing the null hypothesis H0 : µ = µ0 . null hypothesis H0 : µ = µ0 at the significance level α.
When the null hypothesis H0 is true, the distribution of X is:
X − µ0
√ ∼ N(0, 1).
σ/ n
381 382
2. The alternative hypothesis H1 : µ > µ0 (one-sided testɼยଆݕఆ): We Unknown σ2 : Let X1 , X2 , · · ·, Xn be mutually independently, identically and
X − µ0 x − µ0
have: P √ > zα = α. Therefore, when √ > zα , we reject the null normally distributed with µ and σ2 .
σ/ n σ/ n
hypothesis H0 : µ = µ0 at the significance level α. Test the null hypothesis H0 : µ = µ0 .
When the null hypothesis H0 is true, the distribution of X is given by:
3. The alternative
hypothesis H1 : µ , µ0 (two-sided
ɼ྆ଆݕఆ): We
test
X − µ0 x − µ0 X − µ0
have: P √ > zα/2 = α. Therefore, when √ > zα/2 , we reject the √ ∼ t(n − 1).
σ/ n σ/ n S/ n
null hypothesis H0 : µ = µ0 at the significance level α.
X − µ0
Therefore, the test statistic is given by: √ .
S/ n
383 384
8.3.2 Testing Hypothesis on Variance Let Y1 , Y2 , · · ·, Ym be mutually independently, identically and normally distributed
with µy and σ2y .
Testing Hypothesis on Variance: Let X1 , X2 , · · ·, Xn be mutually independently,
Test the null hypothesis H0 : σ2x = σ2y .
identically and normally distributed with µ and σ2 .
1 X
n
Test the null hypothesis H0 : σ2 = σ20 . (n − 1)S 2x
∼ χ2 (n − 1), where S 2x = (Xi − X)2
When the null hypothesis H0 is true, the distribution of S 2 is given by: σ2x n − 1 i=1
1 X
m
(n − 1)S 2 (m − 1)S y2
∼ χ2 (n − 1) ∼ χ2 (m − 1), where S y2 = (Yi − Y)2
σ20 σ2y m − 1 i=1
Both are independent.
Testing Equality of Two Variances: Let X1 , X2 , · · ·, Xn be mutually indepen-
dently, identically and normally distributed with µ x and σ2x .
385 386
Then, the ratio of two χ2 random variables divided by degrees of freedom is: 8.4 Large Sample Test (େඪຊݕఆ)
(n − 1)S 2x
σ2x /(n − 1) • Wald Test (ϫϧυݕఆ)
∼ F(n − 1, m − 1)
(m − 1)S y2
σ2y /(m − 1)
• Likelihood Ratio Test (ൺݕఆ)
Therefore, under the null hypothesis H0 : σ2x = σ2y ,
S 2x
∼ F(n − 1, m − 1)
S y2 • Lagrange Multiplier Test (ϥάϥϯδΣݕఆ)
−→ Skipped in this class.
387 388
8.4.1 Wald Test (ϫϧυݕఆ) For H0 : θ = θ0 and H1 : θ , θ0 , replacing X1 , · · ·, Xn in θ̂n by the observed values
x1 , · · ·, xn , the testing procedure is as follows.
From (24), under the null hypothesis H0 : θ = θ0 (scalar case), as n −→ ∞, the
θ̂n − θ0 2
maximum likelihood estimator θ̂n is distributed as: When we have: √ > χ2α (1), we reject the null hypothesis H0 at the
σ(θ̂n )/ n
θ̂n − θ0 significance level α.
√ −→ N(0, 1).
σ(θ̂n )/ n
χ2α (1) denotes the 100×α % point of the χ2 distribution with one degree of freedom.
Or, equivalently,
θ̂n − θ0 2
√ −→ χ2 (1). This testing procedure is called the Wald test (ϫϧυݕఆ).
σ(θ̂n )/ n
389 390
Example 1.18: X1 , X2 , · · ·, Xn are mutually independently, identically and expo- Generally, as n −→ ∞, the distribution of the maximum likelihood estimator of
nentially distributed. the parameter γ, γ̂n , is asymptotically represented as:
γ̂n − γ
Consider the following exponential probability density function: √ −→ N(0, 1),
σ(γ̂n )/ n
391 392
Therefore, under the null hypothesis H0 : γ = γ0 , when n is large enough, we have First, σ2 (γ) is given by:
the following distribution: ! 2 "−1
d log f (X; γ)
σ2 (γ) = − E 2
= γ2 .
γ̂n − γ0 2 dγ
√ −→ χ2 (1).
σ(γ̂n )/ n Note that the first- and the second-derivatives of log f (X; γ) with respect to γ are
As for the null hypothesis H0 : γ = γ0 against the alternative hypothesis H1 : γ , given by:
γ0 , if we have: d log f (X; γ) 1 d2 log f (X; γ) 1
γ̂n − γ0 2 = − X, = − 2.
√ > χ2α (1), dγ γ dγ2 γ
σ(γ̂n )/ n
we can reject H0 at the significance level α. Next, the maximum likelihood estimator of γ, i.e., γ̂n , is obtained as follows.
We need to derive σ2 (γ) and γ̂n for the testing procedure. Since X1 , X2 · · ·, Xn are mutually independently and identically distributed, the
393 394
likelihood function l(γ) is given by: the MLE of γ, γ̂n , is represented as:
Y
n Y
n
P
l(γ) = f (xi ; γ) = γe−γxi = γn e−γ xi
. n 1
γ̂n = Pn = .
i=1 i=1 i=1 Xi X
Therefore, the log-likelihood function is written as: Then, we have the following:
X
n
log l(γ) = n log(γ) − γ xi . γ̂n − γ γ̂n − γ
√ = √ −→ N(0, 1),
i=1 σ(γ̂n )/ n γ̂n / n
We obtain the value of γ which maximizes log l(γ).
where γ̂n is given by 1/X.
Solving the following equation:
Or, equivalently,
d log l(γ) n X γ̂n − γ 2 γ̂n − γ 2
n
= − xi = 0, √ = √ −→ χ2 (1).
dγ γ i=1 σ(γ̂n )/ n γ̂n / n
395 396
For H0 : γ = γ0 and H1 : γ , γ0 , when we have: 8.4.2 Likelihood Ratio Test (ൺݕఆ)
γ̂n − γ0 2
√ > χ2α (1), Suppose that the population distribution is given by f (x; θ), where θ = (θ1 , θ2 ).
γ̂n / n
we reject H0 at the significance level α. Consider testing the null hypothesis θ1 = θ1∗ against the alternative hypothesis H1 :
θ1 , θ1∗ , using the observed values (x1 , · · ·, xn ) corresponding to the random sample
(X1 , · · ·, Xn ).
397 398
Since we take the null hypothesis as H0 : θ1 = θ1∗ , the number of restrictions is That is, (e
θ1 , e
θ2 ) indicates the solution of (θ1 , θ2 ), obtained from the following equa-
given by k1 , which is equal to the dimension of θ1 . tions:
∂l(θ1 , θ2 ) ∂l(θ1 , θ2 )
= 0, = 0.
The likelihood function is written as: ∂θ1 ∂θ2
Y
n The solution (e
θ1 , e
θ2 ) is called the unconstrained maximum likelihood estimator
l(θ1 , θ2 ) = f (xi ; θ1 , θ2 ). (੍ͳ͠࠷ਪఆྔ), because the null hypothesis H0 : θ1 = θ1∗ is not taken into
i=1
account.
Let (e
θ1 , e
θ2 ) be the maximum likelihood estimator of (θ1 , θ2 ).
399 400
Let θ̂2 be the maximum likelihood estimator of θ2 under the null hypothesis H0 : Define λ as follows:
l(θ1∗ , θ̂2 )
θ1 = θ1∗ . λ= ,
l(e
θ1 , eθ2 )
That is, θ̂2 is a solution of the following equation: which is called the likelihood ratio (ൺ).
The solution θ̂2 is called the constrained maximum likelihood estimator (੍ͭ −2 log(λ) −→ χ2 (k1 ),
401 402
Let χ2α (k1 ) be the 100 × α percent point from the chi-square distribution with k1 Example 1.19: X1 , X2 , · · ·, Xn are mutually independently, identically and expo-
degrees of freedom. nentially distributed.
When −2 log(λ) > χ2α (k1 ), we reject the null hypothesis H0 : θ1 = θ1∗ at the signifi- Consider the exponential probability density function:
cance level α.
f (x; γ) = γe−γx ,
This test is called the likelihood ratio test (ൺݕఆ)
for 0 < x < ∞.
If −2 log(λ) is close to zero, we accept the null hypothesis.
Using the likelihood ratio test, we test the null hypothesis H0 : γ = γ0 against the
When (θ1∗ , θ̂2 ) is close to (e
θ1 , e
θ2 ), −2 log(λ) approaches zero. alternative hypothesis H1 : γ , γ0 .
403 404
The likelihood ratio is given by: The likelihood ratio is computed as follows:
P
l(γ0 ) l(γ0 ) γ0n e−γ0 Xi
λ= , λ= = .
l(γ̂n ) l(γ̂n ) γ̂nn e−n
where γ̂n is derived in Example 1.18, i.e., If −2 log λ > χ2α (1), we reject the null hypothesis H0 : γ = γ0 at the significance
n 1 level α.
γ̂n = Pn = .
i=1 Xi X
Since the number of the constraint is equal to one, as the sample size n goes to
infinity we have the following asymptotic distribution:
−2 log λ −→ χ2 (1).
405 406
Example 1.20: Suppose that X1 , X2 , · · ·, Xn are mutually independently, identi- The likelihood ratio is given by:
cally and normally distributed with mean µ and variance σ2 . l(µ0 , σe2 )
λ= ,
l(µ̂, σ̂2 )
The normal probability density function with mean µ and variance σ2 is given by:
e2 is the constrained maximum likelihood estimator with the constraint µ =
where σ
2 1 − 1
(x−µ)2
f (x; µ, σ ) = √ e 2σ2 . µ0 , while (µ̂, σ̂2 ) denotes the unconstrained maximum likelihood estimator.
2πσ2
By the likelihood ratio test, we test the null hypothesis H0 : µ = µ0 against the In this case, since the number of the constraint is one, the asymptotic distribution is
alternative hypothesis H1 : µ , µ0 . as follows:
−2 log λ −→ χ2 (1).
407 408
e2 ) and l(µ̂, σ̂2 ).
We derive l(µ0 , σ l(µ, σ2 ) is written as: For the numerator of the likelihood ratio, under the constraint µ = µ0 , maximize
Y
n log l(µ0 , σ2 ) with respect to σ2 .
l(µ, σ2 ) = f (x1 , x2 , · · · , xn ; µ, σ2 ) = f (xi ; µ, σ2 )
i=1
Y
n 1 Since we obtain the first-derivative:
1
= √ exp − 2 (xi − µ)2
1 X
n
2πσ2 2σ ∂ log l(µ0 , σ2 ) n 1
i=1 = − + (xi − µ0 )2 = 0,
1 X n ∂σ2 2 σ2 2σ4 i=1
= (2πσ2 )−n/2 exp − 2 (xi − µ)2 .
2σ i=1
e2 is:
the constrained maximum likelihood estimate σ
2
The log-likelihood function log l(µ, σ ) is represented as: 1X
n
e2 =
σ (xi − µ0 )2 .
n n 1 X
n n i=1
log l(µ, σ2 ) = − log(2π) − log(σ2 ) − (xi − µ)2 .
2 2 2σ2 i=1
409 410
Therefore, replacing σ2 by σ
e2 , l(µ0 , σ
e2 ) is written as: For the denominator of the likelihood ratio, because the unconstrained maximum
1 X n likelihood estimates are obtained as:
e2 ) = (2πe
l(µ0 , σ σ2 )−n/2 exp − 2 (xi − µ0 )2
2eσ i=1 1X 1X
n n
n µ̂ = xi , σ̂2 = (xi − µ̂)2 ,
σ2 )−n/2 exp − .
= (2πe n i=1 n i=1
2
l(µ̂, σ̂2 ) is written as:
1 X n
l(µ̂, σ̂2 ) = (2πσ̂2 )−n/2 exp − 2 (xi − µ̂)2
2σ̂ i=1
n
= (2πσ̂2 )−n/2 exp − .
2
411 412
When −2 log λ > χ2α (1), we reject the null hypothesis H0 : µ = µ0 at the significance
level α.
413 414