Chapter 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Statistics 131: Parametric Statistical Inference 1

0. REVIEW OF PROBABILITY THEORY

0.1 JOINT AND MARGINAL DISTRIBUTIONS

1. Let  , A, P  be a probability space. A random vector  X1 , X 2 , , X k  ' is a function with domain 


and counterdomain  , that is ( X 1 , X 2 ,..., X k )':    , such that for any set of k real numbers, say
k k

x1 , x2 , , xk ,   :    X1    x1 ,    X 2    x2 , ,    X k    xk   A .

2. In the bivariate case, i.e., for k=2, for a probability space  , A, P  , we define a bivariate random
vector  X , Y  '  2 , that is ( X ,Y )':    2 , such
that for any pair of real numbers, say u and v,   :    X ( )  u ,    Y ( )  v  A .

3. Let X1 , X 2 , , X k be k random variables, all defined on the same probability space  , A, P  . The
joint cumulative distribution function (or joint CDF) or simply joint distribution function of the
random variables X1 , X 2 , , X k , denoted by FX1 , X 2 , , X k  , , ,   , or simply FX1 , X 2 , , X k , is defined as
FX1 , X 2 , ,Xx  x1, x2 , , xk   P  X1  x1 , X 2  x2 , , X k  xk  ,  ( x1 , x2 ,..., xk )' k .

Properties of Joint Distribution Functions

Let FX ,Y be the joint CDF of two random variables X and Y.


1. Boundedness:

FX ,Y  , v   lim FX ,Y  u, v   0
u 

FX ,Y  u,    lim FX ,Y  u, v   0
v

FX ,Y  ,    lim FX ,Y  u, v   1
u 
v 

Thus,  (u , v)'  2 , 0  FX ,Y  u, v   1 .

2. Monotonicity:
If a, b, c, d are any real numbers such that a  b and c  d , then
P  a  X  b, c  Y  d   FX ,Y b, d   FX ,Y b, c   FX ,Y  a, d   FX ,Y  a, c   0
Note: The result is analogous to P(a<X b) = F(b) – F(a) in the univariate case.

3. Continuity from the Right:


FX ,Y is continuous from the right in each of the variables. That is, for any fixed x   , FX ,Y
is continuous from the right in Y, and for any fixed y   , FX ,Y is continuous from the right in
X. That is,
 
lim FX ,Y  u  h , v   FX ,Y u  , v  FX ,Y u , v 
h0

lim FX ,Y  u , v  h   FX ,Y
h0
u , v 

 FX ,Y u , v  .
Statistics 131: Parametric Statistical Inference 2

4. Given a probability space  , A, P  , a k-dimensional random vector  X1 , X 2 , , X k  ' is defined to be


a k-dimensional discrete random vector if, and only if, it can assume values only at a countable number
of points  x1 , x2 , , xk  in the k-dimensional Euclidean space  k .

5. If  X1 , X 2 , , X k  ' is a k-dimensional discrete random vector, then the joint probability mass function
(or joint PMF), or joint discrete density function, of the random variables X1 , X 2 , , X k , denoted by
pX1 , X 2 , , Xk  , , ,  or pX    , is defined as
pX ( x)  pX1 , X 2 , , Xk  x1, x2 , , xk   P  X1  x1, X 2  x2 , , X k  xk  0, where
x  ( x1 , x2 , , xk )' is a possible value of X   X1 , X 2 , , X k  ' ; and is defined to be 0 otherwise.

6. The collection of all points x  ( x1 , x2 , , xk )' for which the joint PMF is strictly greater than zero, i.e.,
pX ( x)  pX1 , X 2 , , Xk x1, x2 , , xk   0 , is called the set of mass points of the discrete random vector
X   X1 , X 2 , , Xk ' .

7. A Multinomial experiment is one that possesses the following properties:

a. The experiment consists of n repeated trials.


b. Each trial can result in any one of the (k+1) distinct possible outcomes, denoted by E1 , E2 , , Ek 1 .
c. The probability of the i possible outcome Ei is P( Ei )  pi ,  i  1, 2,
th
, k 1 .
d. The repeated trials are independent.

8. If in a Multinomial experiment, X i denotes the number of trials (out of n) that result in the outcome
Ei ,  i  1, 2, , k  1 , then, the random variables X1 , X 2 , , X k (excluding X k 1 ) are said to have a
Multinomial distribution, with joint PMF:

 n  x1 x2 k
, p k , n)   pk 1 k 1  I {0,1,...,n ) ( xi )
x
p X1 , X 2 , , X k ( x1 , x2 , , xk ; p1 , p2 ,  p1 p2
 x1 , x2 , , xk 1  i 1

k 1 k 1
with x
i 1
i  n and p
i 1
i  1.

9. A generalized Hypergeometric experiment is one that possesses the following properties:

a. A sample of size n is taken randomly without replacement from a population with N elements.
b. N1 of the N population elements are of 1 kind, N2 of the N population elements are of a 2nd kind,
and so on, and the Nk+1 of the N population elements are of a (k+1)th kind.
Statistics 131: Parametric Statistical Inference 3

10. If in a generalized Hypergeometric experiment, X i denotes the number of sample elements that are of
the ith kind,  i  1, 2, , k  1 , then, the random variables X1 , X 2 , , X k (excluding X k 1 ) are said to
have a generalized Hypergeometric distribution, with joint PMF:
 N1  N 2   N k 1 
 x  x   x 
pX1 , X 2 , , X k ( x1 , x2 , , xk ; N , N1 , N 2 , , N k , n)   1  2   k 1  ,
N
n
 
k 1 k 1
with x
i 1
i  n and N
i 1
i N.

11. Given a probability space  , A, P  , a k-dimensional random vector  X1 , X 2 , , X k  ' , k a positive


integer, is defined to be a (k-dimensional absolutely) continuous random vector, if, and only if, there
exists a nonnegative function, denoted as f X1 , X 2 , , X k (, , , ) , or f X () , such that for any
x  ( x1 , x2 , , xk )'   k ,

  
xk xk 1 x1
FX ( x)  f X1 , X 2 , ,Xk (u1 , u2 , , uk )du1du2 duk
  

12. If  X1 , X 2 , , X k  ' is a k-dimensional continuous random vector, then the nonnegative function
f X1 , X 2 , , X k (, , , ) is called the joint probability density function (or joint PDF) of the random
variables X1 , X 2 , , Xk .

13. A bivariate continuous random vector ( X , Y )' is said to have a Bivariate Uniform distribution over
the region A   2 if, and only if, the joint PDF of X and Y is given by:

k ,  ( x, y)  A
f X ,Y ( x, y )   ,
0 , otherwise
1
where k  .
area of A

14. A bivariate continuous random vector ( X , Y )' is said to have a Bivariate Normal distribution if, and
only if, the joint PDF of X and Y is given by:

1 
 1  x   X 
2
 x   X  y  Y   y  Y   
2

f X ,Y ( x, y )  exp     2     
2 X  Y 1   2
 2(1   )   X 
2
  X   Y    Y   
  
where  X , Y ,  X ,  Y and  are constants such that:

   X   , X  0 , 1    1
  Y   , Y  0 .

We write ( X , Y )' ~ BVN( X , Y ,  X ,  Y ,  ) .


2 2
Statistics 131: Parametric Statistical Inference 4

15. Given a probability space, a k-dimensional random vector X  ( X1 , X 2 , , X k )' , with joint distribution
FX () or FX1 , X 2 , , X k (, , , ) . For some i  1, 2, , k , the marginal (cumulative) distribution function
(or marginal CDF) of X i , denoted FXi () , is defined as:

FX i ( xi )  lim FX1 , X 2 ,
all x j  ,Xk ( x1 , x2 , , xi , , xk )
j i

 FX1 , X 2 , , Xk (, , , , xi , , , ) .

16. Given a k-dimensional discrete random vector X   X1 , X 2 , , X k  ' , with joint PMF pX () or
pX1 , X 2 , , Xk  , ,
,   . For some i  1, 2, , k , the marginal probability mass function (or marginal
PMF) of Xi, denoted p X i () , is defined as:
pXi ( xi )   pX1 , X 2 , , X k ( x1 , x2 , , xi , , xk ) ,
where the summation is taken with respect to all Xj, j  i .

17. Given a k-dimensional (absolutely) continuous random vector X  ( X1 , X 2 , , X k )' , with joint PDF
f X () or f X1 , X 2 ,...,X k (  ,  ,...,  , ) . For some i  1, 2, , k , the marginal probability density function (or
marginal PDF) of Xi, denoted f X i () , is defined as :

f Xi ( xi )    f X 1 , X 2 ,..., X k
( x1 , x2 ,..., xk )dx1dx2 dxi 1dxi 1 dxk ,

where the integrals are taken over the entire real line  .

0.2 CONDITIONAL DISTRIBUTIONS AND STOCHASTIC INDEPENDENCE

1. Let X and Y be jointly discrete random variables, i.e., ( X , Y )' is a bivariate discrete random vector,
with joint PMF p X ,Y ( , ) . The conditional probability mass function (or conditional PMF) of Y
given X=xo, denoted pY | X ( y | x0 ) , is defined as:

pX ,Y ( x0 , y )
pY | X ( y | x0 )  P(Y  y | X  x0 )  ,
pX ( x0 )

provided pX ( x0 )  P( X  x0 )  0 . Otherwise, it is undefined.

Similarly, the conditional probability mass function (or conditional PMF) of X given Y=yo, denoted
pX |Y ( x | y0 ) , is defined as:
pX ,Y ( x, y0 )
pX |Y ( x | y0 )  P( X  x | Y  y0 )  ,
pY ( y0 )

provided pY ( y0 )  P(Y  y0 )  0 . Otherwise, it is undefined.


Statistics 131: Parametric Statistical Inference 5

2. Let X and Y be jointly (absolutely) continuous random variables, i.e., ( X , Y )' is a bivariate continuos
random vector, with joint PDF f X ,Y ( , ) . The conditional probability density function (or conditional
PDF) of Y given X=x0, denoted by fY | X ( y | x0 ) , is defined as:
f X ,Y ( x0 , y )
fY | X ( y | x0 )  ,
f X ( x0 )
provided f X ( x0 )  0 . Otherwise, it is undefined.

Similarly, the conditional probability density function (or conditional PDF) of X given Y=y0, denoted
by f X |Y ( x | y0 ) , is defined as:
f X ,Y ( x, y0 )
f X |Y ( x | y0 )  ,
fY ( y0 )
provided fY ( y0 )  0 . Otherwise, it is undefined.

3. Let X1 , X 2 , , X k be k random variables with joint CDF FX1 , X 2 ,...,X k (  ,  ,...,  , ) . The random variables
X1 , X 2 , , X k are said to be (stochastically mutually) independent if, and only if,  ( x1 , x2 ,..., xk )'
 , k

k
FX1 , X 2 , ,Xk ( x1 , x2 , , xk )  FX1 ( x1 ) FX 2 ( x2 ) FX k ( xk )  F
j 1
Xj (x j ) .

Otherwise, X1 , X 2 , , X k are said to be dependent.

0.3 EXPECTATION OF SEVERAL RANDOM VARIABLES

1. Let X  ( X1 , X 2 , , X k )' be a k-dimensional discrete or continuous random vector, with joint PMF
p X1 , X 2 ,...,X k (  ,  ,...,  , ) or joint PDF f X1 , X 2 ,...,X k (  ,  ,...,  , ) , respectively.

a. The expectation of the random vector X , denoted by E( X ) , or the joint expectation of the
random variables X1 , X 2 , , X k , is defined as:

E( X )   E[ X1 ], E[ X 2 ], , E[ X k ] ' ,

where E[ X i ] is the expectation of X i ,  i  1, 2, ,k .

b. The expectation of a function of the random vector X , say g ( X ) , is defined as:

E[ g ( X )]   g ( x) p
A
X ( x) , for the discrete case, provided
this sum is finite, where A is the
set of mass points of X
Statistics 131: Parametric Statistical Inference 6

, for the continuous case, provided


E[ g ( X )]   k g ( x) f X ( x)dx
g ( X ) is itself a random variable
(or vector) and the integral exists

Basic Properties of Expectation

Result #1. Expectation of Linear Combinations


Let X  ( X1 , X 2 , , X k )' be a k-dimensional random vector, and let g1 (), g 2 (),..., g m () be m real-
valued functions of X

m  m
E  ai gi ( X )    a E[ g ( X )]
i i , for any constants a1 , a2 , , am .
 i 1  i 1

Some particular cases:

1. E( X1  X 2   X k )  E ( X1 )  E ( X 2 )   E( X k )

2. E ( aX i  b)  aE( X i )  b

3. Bivariate Case: Let ( X , Y )' be a bivariate random vector.

E ( aX  bY )  aE( X )  bE(Y )

Result #2. Expectation of Products


Let X  ( X1 , X 2 , , X k )' be a k-dimensional random vector, with X1 , X 2 , , X k independent, and let
g i () be a function of X i alone,  i  1, 2, , k .

 k  k
E  ai gi ( X i )  
 i 1 
 a E  g ( X ) ,
i 1
i i i for any constants a1 , a2 , , ak .

Some particular cases:

1. E( X 1 X 2  X k )  E( X 1 ) E( X 2 ) E( X k ) , when X1 , X 2 , , X k independent.

2. Bivariate Case: Let ( X , Y )' be a bivariate random vector, with X and Y independent.

a. E( XY )  E( X ) E(Y ) .

b. E[ g ( X )h(Y )]  E[ g ( X )]E[h(Y )] .
Statistics 131: Parametric Statistical Inference 7

Some Special Expectations

Let X  ( X1 , X 2 , , X k )' be a k-dimensional random vector. Then

a. the covariance between any pair of random variables Xi and Xj, denoted by
Cov(Xi, Xj), is defined as:

Cov( X i , X j )  E [ X i  E( X i )][ X j  E( X j )] ,


provided the expectation exists, where E( X i ) and E ( X j ) are the marginal expectations of X i
and X j , respectively.

b. the correlation coefficient between any pair of random variables Xi and Xj, denoted by  (Xi,
Xj), is defined as:

Cov(Xi , X j )
 (Xi , X j ) = ,
V (Xi )V(X j )

provided the covariance exists, where V ( X i )  0 and V ( X j )  0 are the variances of X i and X j ,
respectively.

Some General Results of Linear Combinations

Result #1: If X1 and X 2 are two random variables, then  a1 , a2   ,

V (a1 X1  a2 X 2 )  a1 V ( X1 )  a2 V ( X 2 )  2a1a2Cov( X1, X 2 ) .


2 2

Result #2: Let X1 , X 2 , , X k be k random variables. Then  a1 , a2 ,..., ak   ,

 k  k
V   ai X i    a V ( X )  2
2
i i a a Cov( X , X
i j i j ).
 i 1  i 1 i j

Result #3: Let X1 , X 2 , , X k and Y1 , Y2 , , Yk be 2 sets of k random variables, and let


 a1 , a2 ,..., ak   and  b1 , b2 ,..., bk   be 2 sets of real numbers. Then,

 k k  k k
Cov  ai X i ,  b j Y j    ai b j Cov( X i , Y j ) .
 i 1 j 1  i 1 j 1

Some Particular Cases:


1. V ( X  Y )  V ( X )  V (Y )  2Cov( X , Y ) .
2. V ( X  Y )  V ( X )  V (Y )  2Cov( X , Y ) .
3. V (aX  bY )  a2V ( X )  b2 V (Y )  2abCov( X , Y ) .
4. V ( X  Y )  V ( X )  V (Y ) , if X and Y are independent.
5. V ( X  Y )  V ( X )  V (Y ) , if X and Y are independent.
6. V (aX  bY )  a2V ( X )  b2 V (Y ) , if X and Y are independent.
Statistics 131: Parametric Statistical Inference 8

7. If X1 , X 2 , , X k are uncorrelated random variables, then Result #2 reduces to:

 k  k
V   ai X i   a V (X ) 2
i i .
 i 1  i 1

2. The Cauchy-Schwartz Inequality Theorem

Let X and Y be two random variables with finite second (raw) moments. Then, the following inequality
holds:

E ( | XY | )  E( X 2 ) E(Y 2 ) ,
with equality if, and only if, for some constants a and b, P( Y  a  bX )  1 , i.e., Y is a linear
function of X with probability 1.

3. Let ( X , Y )' be a bivariate discrete or continuous random vector and let g ( X , Y ) be a function of X and
Y. The conditional expectation of g( X,Y ) given X = x0, denoted as E [ g(X,Y) | X = x0 ], is defined
as:

E [ g ( X , Y ) | X  x0 ]   g ( x , y) p
 y
0 Y|X ( y | x0 ) , for the discrete case

E [ g ( X , Y ) | X  x0 ]   g ( x0 , y) fY | X ( y | x0 )dy , for the continuous case

Similarly, if ( X , Y )' is a bivariate discrete or continuous random vector and g ( X , Y ) is a function of


X and Y, then the conditional expectation of g( X,Y ) given Y = y0, denoted as E [ g(X,Y) | Y = y0 ], is
defined as:

E [ g ( X , Y ) | Y  y0 ]   g ( x, y ) p
 x
0 X |Y ( x | y0 ) , for the discrete case

E [ g ( X , Y ) | Y  y0 ]   g ( x, y0 ) f X |Y ( x | y0 )dx , for the continuous case

are sub-vectors of X , the conditional expectation of g ( X1 , X 2 ) , given X 2  x2 , is:

E [ g ( X1 , X 2 ) | X 2  x2 ]   g ( x , x , x , x ) p
 x1  x3
1 3 4 6 X1 , X 3 | X 4 , X 6 ( x1, x3 | x4 , x6 )

E [ g ( X1 , X 2 ) | X 2  x2 ]   g ( x1 , x3 , x4 , x6 ) f X1 , X3| X 4 , X 6 ( x1 , x3 | x4 , x6 )dx1dx3

Notes:

Just as conditional distributions satisfy all the properties of ordinary distributions, conditional
expectations and conditional variances also satisfy all the properties of ordinary expectations and
ordinary variances, respectively. For instance, if ( X , Y )' is a bivariate random vector and a and b are
real numbers, we have the following results:
Statistics 131: Parametric Statistical Inference 9

a. E( aX  b | Y  y0 )  a E( X | Y  y0 )  b
b. E( aY  b | X  x0 )  a E( Y | X  x0 )  b
c. V ( X | Y  y0 )  E( X 2 | Y  y0 )  [ E( X | Y  y0 ) ]2
d. V ( Y | X  x0 )  E( Y 2 | X  x0 )  [ E( Y | X  x0 ) ]2
e. V ( aX  b | Y  y0 )  a2 V ( X | Y  y0 )
f. V ( aY  b | X  x0 )  a2 V ( Y | X  x0 )

Also, if g1 () and g 2 () are real-valued functions of only one of the random variables, we have the
following results:

a. E[ g1 ( X ) | X  x0 ]  g1 ( x0 )
b. E[ g2 (Y ) | Y  y0 ]  g2 ( y0 )
c. E[ g1 (Y )  g2 (Y ) | X  x0 ]  E[ g1 (Y ) | X  x0 ]  E[ g2 (Y ) | X  x0 ]
d. E[ g1 ( X )  g2 ( X ) | Y  y0 ]  E[ g1 ( X ) | Y  y0 ]  E[ g2 ( X ) | Y  y0 ]
e. E[ g1 ( X ) g2 (Y ) | X  x0 ]  g1 ( x0 ) E[ g2 (Y ) | X  x0 ]
f. E[ g1 (Y ) g2 ( X ) | Y  y0 ]  g1 ( y0 ) E[ g2 ( X ) | Y  y0 ]

4. Theorem: Let ( X , Y )' be a bivariate random vector and let g () be a real-valued function of Y (or X).
Then

E[ g (Y )]  EX {EY [ g (Y ) | X ]} or E[ g ( X )]  EY {EX [ g ( X ) | Y ]} .

These results simplify to:

 E  g (Y ) | X  x0  pX ( x0 ) , for the discrete case



E[ g (Y )]   x0
  E  g (Y ) | X  x  f X ( x)dx , for the continuous case

 E  g ( X ) | Y  y0  pY ( y0 ) , for the discrete case



E[ g ( X )]   y0
  E  g ( X ) | Y  y  fY ( y )dy , for the continuous case

Special Cases:

1. E(Y )  EX  EY (Y | X )
2. E( X )  EY  EX ( X | Y )
3. V (Y )  EX VY (Y | X )  VX  EY (Y | X ) iterated variance
4. V ( X )  EY VX ( X | Y )  VY  EX ( X | Y ) formulas
Statistics 131: Parametric Statistical Inference 10

5. Let X  ( X1 , X 2 , , X k )' be a k-dimensional random vector. The joint moment generating function
(or joint MGF) of the random variables X1 , X 2 , , X k denoted as mX1 , X 2 ,...,X k (  ,  , ...,  ) , is defined
as:

 k

mX1 , X 2 , , X k (t1 , t2 , , tk )  E exp{t1 X 1  t2 X 2   tk X k }  E exp{ ti X i } ,
 i 1 

provided the expectation exists for all values t1 , t2 , , tk such that  hi  t i  hi , for some hi  0 ,
 i  1,2, , k .

The joint raw moments of the random variables X1 , X 2 , , X k is denoted as:

E  X1 1 X 2 2 X k k  ,
r r r

where ri is either zero or a positive integer,  i  1, 2, ,k .

The joint central moments of the random variables X1 , X 2 , , X k (about their respective means) is
denoted as:

E ( X1   X1 )r1 ( X 2   X 2 )r2 ( X k   X k )rk  ,

where ri is either zero or a positive integer,  i  1, 2, ,k .

6. For the bivariate case, if ( X , Y )' is a bivariate random vector, the joint moment generating function
(or joint MGF) of the random variables X and Y, denoted as mX ,Y (  ,  ) is defined as:

mX ,Y (t1 , t2 )  E exp{t1 X  t2Y } ,

provided the expectation exists for all values t1 and t2 such that, for any real numbers a and b,
a  t1  a and b  t2  b .

The joint raw moments of the random variables X and Y is denoted as:

E[ X rY s ] , where r and s are either zeros or positive integers.

The joint central moments of the random variables X and Y (about their respective means) is denoted
as:

E[( X   X )r (Y  Y )s ] , where r and s are either zeros or positive integers.


Statistics 131: Parametric Statistical Inference 11

Generation of Moments

Result #1. Joint Raw Moments

Let X  ( X1 , X 2 , , X k )' be a k-dimensional random vector. The joint raw moments of X1 , X 2 , , Xk ,


r1 r2 rk
denoted E( X X 2 1 X k ) , can be obtained from the joint MGF of X1 , X 2 , , X k , by differentiating
the joint MGF r1 times with respect to t1, r2 times with respect to t2, and so on, and rk times with respect
to tk. The limit of the resulting derivative is then takes as all ti’s go to zero. This may be extended to
the case of the joint raw moments of any sub-vector of X .

Result #2. Marginal Raw Moments

Let X  ( X1 , X 2 , , X k )' be a k-dimensional random vector. The rth marginal raw moment of Xi,
r
denoted E ( X i ) , can be obtained from the joint MGF of X1 , X 2 , , X k , by differentiating the joint
MGF r times with respect to ti. The limit of the resulting derivative is then taken as all ti’s go to zero.

7. A bivariate continuous random vector ( X , Y )' is said to have a Bivariate Normal Distribution if, and
only if, the joint PDF of X and Y is given by:

1 
 1  x   X 
2
 x   X  y  Y   y  Y   
2

f X ,Y ( x, y )  exp     2     
2 X  Y 1   2  2(1   )   X 
2
  X   Y    Y   
  
  x   ,    y  

where  X , Y ,  X ,  Y and  are constants such that:

   X   , X2  0 , 1    1
  Y   , Y 2  0 .

We write ( X , Y )' ~ BVN( X , Y ,  X 2 ,  Y 2 ,  ) .

Let ( X , Y )' be a bivariate continuous random vector having a Bivariate Normal distribution, i.e.,
( X , Y )' ~ BVN( X , Y ,  X ,  Y ,  ) . The joint moment generating function (or joint MGF) of the
2 2

random variables X and Y, is defined,  t1 ,t 2   , as:


mX ,Y (t1 , t2 )  exp t1 X  t2 Y  (1/ 2)(t1  X  2 t1t2 X  Y  t2  Y ) .
2 2 2 2

Theorem:
Let ( X , Y )' be a bivariate continuous random vector having a Bivariate Normal distribution, i.e.,
( X , Y )' ~ BVN( X , Y ,  X ,  Y ,  ) . Then, the marginal distributions of X and Y are each
2 2

(univariate) Normal distributions, i.e.,

X ~ N ( X ,  X ) Y ~ N (Y ,  Y ) .
2 2
and
Statistics 131: Parametric Statistical Inference 12

Theorem:
Let ( X , Y )' be a bivariate continuous random vector having a Bivariate Normal distribution, i.e.,
( X , Y )' ~ BVN( X , Y ,  X ,  Y ,  ) . Then the conditional distribution of Y given X = x0 is
2 2

univariate Normal, and similarly, the conditional distribution of X given Y = y0 is univariate


Normal, i.e.,

 (x  X ) 
Y | X  x0 ~ N  Y   Y 0 ,  Y (1   2 ) 
2

 X 

 ( y  Y ) 
X | Y  y0 ~ N   X   X 0 ,  X (1   2 ) 
2

 Y 

0.4 DISTRIBUTIONS OF FUNCTIONS OF RANDOM VARIABLES

1. Univariate Discrete Case

Let X be a discrete random variable with a countably infinite (or finite) set of mass points {x1 , x2 , }
and with PMF given by pX ( x) . Let Y = g(X) be a real-valued function, with domain R . The following
are true about Y = g(X):

a. The random variable Y = g(X) is also a discrete random variable with mass points y1  g ( x1 ) ,
y2  g ( x2 ) , …

b. The function g () need not be one-to-one. If g () is one-to-one, then, for every mass point xi of
X , there corresponds one, and only one, mass point yi  g ( xi ) of Y. Otherwise, it is possible for
at least two distinct mass points xi and x j of X to yield the same mass point for Y, i.e.
g ( xi )  g ( x j ) .

c. Since Y is discrete, its distribution can be expressed in terms of its PMF. To find the PMF of Y, it
suffices to specify the mass points of Y and the probabilities for each of these mass points. The
latter is found by:
pY ( yi )  P(Y  yi )   pX ( x j ) ,
where the summation is taken over the set {x j : g ( x j )  yi } , i.e., the set of all mass points of X
which correspond to the specific mass point yi of Y.
Statistics 131: Parametric Statistical Inference 13

2. Univariate (Absolutely) Continuous Case

Method 1: CDF Technique

Let X be a continuous random variable with CDF FX () . Then, the CDF of the random variable Y =
g(X), denoted by FY () , can be determined by:

FY ( y)  P(Y  y)  P( X  A) ,  y   ,

where A  {x : g ( x)  y} , i.e., {X  A} iff {Y  y} . Consequently, the CDF of Y can be found, since


it can be represented in terms of the CDF of X, which is known.

Method 2: Transformation Technique

Let X be a continuous random variable with PDF f X () . Let g () be a strictly monotone (increasing
or decreasing) differentiable (and thus, continuous) function. Then, the PDF of the random variable Y
= g(X), denoted by f Y () , can be determined by:

f Y ( y)  DY [ g 1 ( y)] f X [ g 1 ( y)] ,  x, y  y  g ( x) is defined.

Otherwise, the PDF goes to zero. In the above description, g 1 () is the inverse function of g () , i.e.,
we define g 1 ( y) as that value of x for which g ( x)  y .

This can be generalized to the case when g () is not continuous. If this is the case, and the range of X
can be decomposed into sub-ranges such that g () becomes one-to-one in each sub-range, then the PDF
of Y = g(X) can be determined by:

f Y ( y)  
 gi ( x )
1 1
DY [ g i ( y )] f X [ g i ( y )] ,

where the summation is taken over all the sub-ranges into which the range of X was decomposed, and
1
g i () is the inverse function of g i () for the ith sub-range.

Method 3: MGF Technique

Let X be a continuous random variable with PDF f X () . Assuming it exists, the MGF of the random
variable Y = g(X) can be derived using the PDF of X, as:

mY (t )  E[exp{tY }]  E{exp{tg ( X )}]  R


exp{tg ( x)} f X ( x)dx .

If, upon simplification, the resulting function of t can be recognized as the MGF of some known (usually
special) distribution, then if follows that Y = g(X) has that known distribution.
Statistics 131: Parametric Statistical Inference 14

3. Distribution of a Function of Several Random Variables

a. CDF Technique

Let X1 , X 2 , , X k be k random variables with joint CDF given by FX1 , X 2 ,...,X k (  ,  ,...,  , ) . Let Y1 , Y2 , , Yk
be k real-valued functions defined as:

Y1  g1 ( X1 , X 2 , , X k ) , Y2  g2 ( X1 , X 2 , , X k ) , … , Yk  gk ( X1 , X 2 , , Xk ) .

The functions Y1 , Y2 , , Yk are also random variables, whose joint CDF, by definition, is:

FY1 ,Y2 , ,Yk ( y1 , y2 , , yk )  P(Y1  y1 , Y2  y2 , , Yk  yk ) .

For simplicity, let X1 , X 2 , , X k be denoted as X . Then, for each point ( y1 , y2 , , yk ) in the k-


dimensional hyper-space, the event defined as:

{Y1  y1 , Y2  y2 , , Yk  yk }  {g1 ( X )  y1 , g2 ( X )  y2 , , gk ( X )  yk } .

Thus, the joint CDF of the random variables Y1 , Y2 , , Yk can be expressed as:

FY1 ,Y2 , ,Yk ( y1 , y2 , , yk )  P(Y1  y1 , Y2  y2 , , Yk  yk )


 P( g1 ( X )  y1 , g 2 ( X )  y2 , , g k ( X )  yk ) ,

a function involving the joint CDF of the random variables X1 , X 2 , , X k . Consequently, the joint
CDF of the random variables Y1 , Y2 , , Yk can be found through the joint CDF of the random variables
X1 , X 2 , , X k . Note that the number of Y random variables need not be equal to the number of X
random variables; we can define r functions of X , say g1 ( X ), g2 ( X ), , gr ( X ) .

Some Results:

1. Theorem: CDF of the Minimum and the Maximum

Let X1 , X 2 , , X n be independent random variables with marginal CDFs denoted by


FX1 (), FX 2 (),..., FX n () , respectively. Let Y1 and Yn be real-valued functions defined as
Y1  min{X1 , X 2 , , X n } and Yn  max{X1, X 2 , , X n} .
Then, the marginal CDFs of Y1 and Yn are given by:
n n
FY1 ( y )  1   1  FX i ( y)  and FYn ( y)    F Xi ( y)  .
i 1 i 1

If the random variables X1 , X 2 ,


, X n are independent and identically distributed (iid) random
variables with common CDF denoted by FX () , then the marginal CDFs of Y1 and Yn reduce to:
FY1 ( y)  1  1  FX ( y) FYn ( y)   FX ( y) .
n n
and
Statistics 131: Parametric Statistical Inference 15

Corollary: PDF of the Minimum and the Maximum

If the random variables X1 , X 2 , , X n are independent and identically distributed (iid) continuous
random variables with common CDF denoted by FX () and common PDF denoted by f X () ,
then, the marginal PDFs of Y1 and Yn are given by:
f Y1 ( y)  n1  FX ( y) f X ( y) and f Yn ( y)  nFX ( y) f X ( y)
n 1 n 1

2. Theorem: Distribution of the Sum and the Difference of Two Random Variables

Let X and Y be continuous random variables with joint PDF denoted by f X ,Y (,) . Let Z and V be
defined as Z = X+Y and V = X-Y. Then,
f Z ( z)   f X ,Y ( x, z  x)dx   f X ,Y ( z  y, y)dy and
R R

fV (v)  R
f X ,Y ( x, x  v)dx  R
f X ,Y (v  y, y)dy

Corollary: Convolution Formula


If X and Y are independent continuous random variables and Z = X+Y, then,
f Z ( z)   fY ( z  x) f X ( x)dx   f X ( z  y) fY ( y)dy .
R R

3. Theorem: Distribution of the Product and the Quotient of Two Random Variables

Let X and Y be 2 continuous random variables with joint PDF denoted by f X ,Y ( , ) . Let Z and V
be defined as Z = XY and V = X/Y. Then,

 | x| f  | y|
1 1
f Z ( z)  X ,Y ( x, z / x)dx  f X ,Y ( z / y, y)dy and
R R

fV (v)   | y| f
R X ,Y (vy, y)dy .

b. MGF Technique

Let X1 , X 2 , , X k be k random variables with joint PMF/PDF given by p X1 , X 2 ,...,X k (  ,  ,...,  , ) or


f X1 , X 2 ,...,X k (  ,  ,...,  , ) , respectively. Let Y1 , Y2 , , Yk be k real-valued functions defined as:

Y1  g1 ( X1 , X 2 , , X k ) , Y2  g2 ( X1 , X 2 , , X k ) , … , Yk  gk ( X1 , X 2 , , Xk ) .

The functions Y1 , Y2 , , Yk are also random variables. For simplicity, let X1 , X 2 , , X k be denoted as
X . If the joint MGF of Y1 , Y2 , , Yk exists, then, we can derive it by definition, and consequently, we
can express it in terms of X as:

mY1 ,Y2 , ,Yk (t1 , t2 , , tk )  E[exp{t1Y1  t2Y2   tkYk }]


 E[exp{t1 g1 ( X )  t2 g 2 ( X )   tk g k ( X )}] ,
Statistics 131: Parametric Statistical Inference 16

which is an expectation of a function involving the random variables X1 , X 2 , , X k . If this expectation


exists, it can be derived by using the joint PMF/PDF of the random variables X1 , X 2 , , X k . (This can
be done by summing or integrating the product of the exponential function above and the joint
PMF/PDF over all the values of the random variables X1 , X 2 , , X k .) If the resulting function can be
recognized as the joint MGF of some known distribution, then the random variables Y1 , Y2 , , Yk will
have that joint distribution.

Some Results:

1. Theorem: Distribution of Sums of Independent Random Variables

Let X1 , X 2 , , X n be n independent random variables, whose marginal MGFs, denoted by mX1 (t ) ,


mX 2 (t ) , …, mX n (t ) , respectively, all exist for all –h < t < h for some h > 0. Let
Y  X1  X 2   X n   X i . Then, the MGF of Y is:
n
mY (t )  E[exp{t  X i }]  m Xi (t ) , for –h < t < h .
i 1

Corollary:
Let X1 , X 2 ,, X n be n independent and identically distributed (iid) random variables with common
MGF denoted by mX (t ) , which exists for all –h < t < h for some h>0. Let
Y  X1  X 2   X n   X i . Then, the MGF of Y is:
mY (t )  E[exp{t  X i }]  mX (t )n , for –h < t < h .

2. If X1 , X 2 , , X n are independent and identically distributed random variables, with common MGF
mX (t ) , and S  X1  X 2   X n   X i , then mS (t )  mX (t )n . If the distribution of S
belongs

to the same parametric family as the common distribution of X1 , X 2 , , X n , then we say that the
distribution is reproductive.

3. Theorem: Central Limit Theorem (CLT)

If for each positive n, X1 , X 2 , , X n are independent and identically distributed random variables,
with common mean  and common variance 2, then  z  R,

FZn ( z) converges to ( z ) , as n   ,

X n  E( X n ) Xn  
where: Z n  
V (Xn)  n
n
X n   Xi n .
i 1
Statistics 131: Parametric Statistical Inference 17

c. Method of Transformation

Discrete Case

Let X1 , X 2 , , X k be k discrete random variables with joint PMF p X1 , X 2 ,...,X k (  ,  ,...,  , ) . Let H be the
set of mass points of X1 , X 2 , , X k , that is:

H = ( x1 , x2 , , xk )' : pX1 , X 2 , , Xk ( x1, x2 , , xk )  0 .

Let Y1 , Y2 , , Yk be k real-valued functions defined as:


Y1  g1 ( X1 , X 2 , , X k ) , Y2  g2 ( X1 , X 2 , , X k ) , … , Yk  gk ( X1 , X 2 , , Xk ) .

The functions Y1 , Y2 , , Yk are also discrete random variables, with joint PMF given by:
pY1 ,Y2 , ,Yk ( y1 , y2 , , yk )  P(Y1  y1 , Y2  y2 , , Yk  yk )
 p X1 , X 2 , , X k ( x1 , x2 , , xk ) ,

where the summation is taken over all mass points ( x1 , x2 , , xk )'  H , for which:

( y1 , y2 , , yk )   g1 ( x1, x2 , , xk ), g1 ( x1, x2 , , xk ), , gk ( x1, x2 , , xk )  .

Note that the number of Y random variables need not be equal to the number of X random variables.

Continuous Case

Let X1 , X 2 , , X k be k continuous random variables with joint PDF f X1 , X 2 ,...,X k (  ,  ,...,  , ) . Let
Y1 , Y2 , , Yk be k real-valued functions defined as:

Y1  g1 ( X1 , X 2 , , X k ) , Y2  g2 ( X1 , X 2 , , X k ) , … , Yk  gk ( X1 , X 2 , , Xk ) .

The functions Y1 , Y2 ,
, Yk are also random variables. To find the joint density of Y1 , Y2 , , Yk , we
assume that the functions g1 (  ,  ,...,  , ) , g 2 (  ,  ,...,  , ) , …, g k (  ,  ,...,  , ) satisfy some regularity
conditions.

Then, the joint PDF of Y1 , Y2 , , Yk , denoted by f Y1 ,Y2 ,...,Yk (  ,  ,...,  , ) , is determined as


fY1 ,Y2 , ,Yk ( y1 , y2 ,..., yk )
 1 1 1

 | J | f X1 , X 2 ,..., X k g1 ( y1 , y2 ,..., yk ), g2 ( y1 , y2 ,..., yk ),..., gk ( y1, y2 ,..., yk ) I D ( y1, y2 ,..., yk )

where D is the subset of k-dimensional hyper-space consisting of points ( y1 , y2 ,..., yk ) for which:
( y1 , y2 , , yk )   g1 ( x1, x2 , , xk ), g2 ( x1, x2 , , xk ), , gk ( x1, x2 , , xk ) 

and the Jacobian of the transformation, J, is the determinant of the matrix containing the partial
derivatives, that is:
Statistics 131: Parametric Statistical Inference 18

X 1 X 1 X 1
Y1 Y2 Yk
X 2 X 2 X 2
J  Y1 Y2 Yk

X k X k X k
Y1 Y2 Yk

4. Probability Integral Transform

If X is a random variable with continuous CDF FX () , then U  FX (X ) is uniformly distributed over
the interval (0,1). Conversely, if U is uniformly distributed over the interval (0,1), then X  FX 1 (U )
has CDF FX () .

0.5 SAMPLING AND SAMPLING DISTRIBUTIONS

1. The totality of elements which are under discussion, and about which information is desired will be
called the target population.

2. Given a probability space and a positive integer n, a collection of n independent random variables
X1 , X 2 ,..., X n , all having common distribution FX is called a random sample from the population (with
distribution) FX .

3. A random sample (r.s.) can be viewed as a random vector X  ( X1 , X 2 ,..., X n )' defined on the n-
dimensional real space Rn . Further, it can also be interpreted as the outcome of a series of n
independent trials of an experiment performed under identical conditions.

4. The common distribution FX is usually called the sampled population, the collection of all elements
from which the sample is actually selected. In certain cases, FX may be replaced with the
corresponding PDF or PMF.

5. Since the r.s. X  ( X1 , X 2 ,..., X n )' consists of independent and identically distributed (iid) random
variables, then the distribution of the r.s. X is simply the joint distribution of X1 , X 2 ,..., X n .

6. Let X  ( X1 , X 2 ,..., X n )' be an observable random vector. Any observable function of X , say T ( X ) ,
which is itself a r.v. (or random vector) is called a statistic. The probability distribution of a statistic is
called a sampling distribution. The standard deviation of a statistic is called its standard error.

7. A statistic is always: a function of observable random variables, is itself a random variable, and does
not contain any unknown parameter.
Statistics 131: Parametric Statistical Inference 19

8. Theorem:
Let X  ( X1 , X 2 ,..., X n )' be a r.s. from FX , with common mean E( X ) and common variance V ( X ) .
Then,
a. E( X )  E( X )
b. V ( X )  V ( X ) / n

9. Theorem: Weak Law of Large Numbers (WLLN)


Let X1 , X 2 ,..., X n be a r.s. from the PDF f X , with common mean E( X ) and common variance V ( X )
V (X )
(   ). Let  and  be arbitrary numbers such that   0 and 0    1. If n  2 , then

 
P X  E( X )    1   .

10. Theorem: Central Limit Theorem (CL T)


Let X1 , X 2 ,..., X n be a r.s. from the PDF f X , with common mean E(X) and finite common variance
V ( X ) . Let X be the sample mean of the r.s. and define the r.v. Z n as:

X  E( X ) X  E( X )
Zn   .
V (X ) Sd ( X ) n

Then, as n   , the distribution of Z n approaches the Standard Normal, i.e., Zn  N (0,1) .


(Implications about the distribution of sample mean? Sample sum?)

11. Sampling from the Normal Distribution

a. Characterization of a Chi-square r.v.


b. Characterization of an F-distributed r.v.
c. Characterization of a t-distributed r.v.

12. Reading Standard Normal, Chi-square, F, and t tables


Statistics 131: Parametric Statistical Inference 20

1. INTRODUCTION ON STATISTICAL INFERENCE

1.1 PROBABILITY THEORY VERSUS STATISTICAL INFERENCE

A typical problem in Probability Theory is of the following form: A sample space and underlying
probability distribution are specified, and we are asked to compute the probabilities of a given
event(s).

Example: Experiment of rolling a loaded die, for which the chance of landing an odd-
numbered outcome is twice as much as that of an even-numbered outcome
a. Define the sample space.
b. Define an appropriate probability space.
c. Let E be event of getting either a 3 or a 4. Find P(E).

In a typical problem of Statistical Inference, it is not a single underlying probability distribution which
is specified, but rather a class of probability distributions, any of which may possibly be the one that
governs the chance experiment, whose outcome we shall observe. We know that the underlying
probability distribution is a member of this class, but we do not know which one it is. The objective
is thus to determine a “good” way of guessing, on the basis of the observed outcome of the
experiment, which of the underlying probability distributions is the one that actually governs the
experiment.

Example: Consider the experiment of rolling a die, for which we know nothing
a. Define the random variable X as the number of dots, i.e., face value of the die. What
are the possible values of X?
b. What would be a possible (probability) distribution of X?
c. Suppose the die is rolled 5 times. On the ith roll, let Xi be the number of dots. What
would be the (joint) distribution of X?

Specifications of a Statistical Problem

We now consider the specification of a statistical problem. Suppose that there is an experiment whose
outcome can be observed by the statistician. This outcome is described by a random variable X (or
random vector X), which takes on values in the space S. The distribution function of X, say FX, (or in
the case of a random vector, the distribution of X is FX ) is unknown to the statistician, but it is known
that FX belongs to a specified class of distribution functions, the class Ω. The collection of possible
actions that the statistician can take, or the collection of possible statements that can be made, at the
end of the experiment, is called the decision space, denoted as D. At the conclusion of the experiment,
the statistician actually chooses only one action (or makes only one statement) from the possible
choices in D.

In summary, therefore, any statistical problem can be specified by defining each of the components of
the triplet (S, Ω, D).

Example

Suppose we are given a coin, about which we know nothing. We are allowed to perform 10
independent flips with the coin, on each of which, the probability of getting a head is p. We do not
know the value of p, but we know that p will be in the interval [0,1].
Statistics 131: Parametric Statistical Inference 21

In this example, we can let 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋10 )′ with each 𝑋𝑖 defined being either a “1” or “0”
according to whether the ith flip is a head or tail. Then, 𝑺 consists of all the ___ possible values of the
vector 𝑋. The class 𝛀 consists of all possible probability mass functions of 𝑋 for which the 𝑋𝑖 𝑠 are
independently and identically distributed Bernoulli random variables with probability of success 𝑝,
i.e., iid 𝐵𝑒(𝑝). Thus, for a specific value of the vector 𝑋, say 𝑥 = (𝑥1 , 𝑥2 , … , 𝑥10 )′,

𝛀 = { 𝑝𝑋 ∶ 𝑝𝑋 (𝑥) = 𝑝∑ 𝑥𝑖 (1𝑝)10−∑ 𝑥𝑖 , 0 ≤ 𝑝 ≤ 1 } .

If we are required to guess the value of 𝑝, we can define 𝑫 as follows:

𝑫 = { 𝑝̂ ∶ 0 ≤ 𝑝̂ ≤ 1} .

This type of statistical problem is referred to as Point Estimation. If, on the other hand, we do not
merely want a guess as to the value of 𝑝, but rather a statement of an interval of values which is thought
to enclose the true value of the 𝑝, then we can define 𝑫 as

𝑫 = { (𝑝𝐿 , 𝑝𝑈 ) ∶ 0 ≤ 𝑝𝐿 ≤ 𝑝𝑈 ≤ 1 } .

This approach is called Interval Estimation.

Suppose that we are not required to come up with a numerical guess as to the value of 𝑝, but only to
know whether the coin is fair or not. In this case, 𝑫 can be defined as

𝑫 = { 𝑑1 : 𝑐𝑜𝑖𝑛 𝑖𝑠 𝑓𝑎𝑖𝑟 , 𝑑2 : 𝑐𝑜𝑖𝑛 𝑖𝑠 𝑛𝑜𝑡 𝑓𝑎𝑖𝑟 } .

This type of problem is called Hypothesis Testing.

Note that 𝑫 can be viewed as the collection of possible answers to a question (e.g., “What do you guess
𝑝 to be?” or “Within what interval do you guess 𝑝 to lie?” or “Is the coin fair?”) asked of the statistician.
The real problem in Statistical Inference lies in choosing the best “guessing method.” Note also
that there are infinitely many ways of arriving at a guess of the value of 𝑝 or arriving at a decision
whether the coin is fair or not. Which of these ways of forming a guess from the experimental data
should we actually employ? This is the real problem of inference.

1.2 CLASSIFICATION OF STATISTICAL PROBLEMS

When statisticians discuss statistical problems, they naturally classify them in certain ways. In our
case, we shall classify statistical problems on the basis of the structure of 𝛀 and 𝑫.

1.2.1 Classification based on the Structure of 

Statistical problems are classified as either parametric or nonparametric depending on the


structure of 𝛀. Parametric statistical problems are all those in which the class  (of all distribution
functions 𝐹𝑋 ) can be specified in terms of a parameter (or vector of parameters) 𝜃. In this case,
inferences made on 𝐹𝑋 necessarily focus on the parameter 𝜃; hence, the name. All other problems
not falling into this formulation are called nonparametric statistical problems.
Statistics 131: Parametric Statistical Inference 22

Example:

Suppose we take a measurement on the length of an object, using a given measuring


instrument. The experiment is thus the observation of a random variable 𝑋, where 𝑋
represents the measurement taken (i.e., the length), having an unknown distribution 𝐹𝑋 .
This problem becomes a parametric statistical problem if we assume Ω to be a
parametric family of distributions. For instance, we can take Ω to be the Normal family
of distributions with parameter vector 𝜃 = (𝜇, 𝜎 2 )′ with −∞ < 𝜇 < ∞ and 𝜎 2 > 0.
Note that “saying something” about 𝐹𝑋 is equivalent to “saying something” about 𝜃.

In contrast, a nonparametric treatment of the above problem would entail only slight
assumptions regarding the distributions 𝐹𝑋 in Ω, such that Ω consists of absolutely
continuous distributions, not necessarily having common parameters.

1.2.2 Classification based on the Structure of D

Statistical problems may be classified based on the structure of D a follows:

a. Point Estimation
b. Interval (or Region) Estimation
c. Hypothesis Testing
d. Ranking / Multiple Decision Problems
e. Regression / Experimental Designs Problems

Letters (a) – (c) were discussed in the example of the previous section. Region Estimation involves
estimating a vector of parameters 𝜃 = (𝜃1 , 𝜃2 , … , 𝜃𝑘 )′. For example, we might be interested in
estimating the mean 𝜇 and variance 𝜎 2 of a distribution simultaneously. In this case, our estimate
will be of the form:

𝑫 = { (𝑢, 𝑣) ∶ 𝑢1 ≤ 𝑢 ≤ 𝑢2 ; 𝑣1 ≤ 𝑣 ≤ 𝑣2 } ,

which is a region, namely, a rectangle, in the Cartesian coordinate plane.

Multiple Decision Problems are decision problems where there are a finite (more than 2) number
of possible decisions. Note that hypothesis testing is just a special case of this problem with the
number of possible decisions equal to two. Ranking Problems are those for which a decision is a
statement as to the complete ordering of certain objects or things, as in “Method A is best, B is next,
and C is the worst.”

A Regression Problem looks into investigating the (linear) relationship between one variable
(dependent variable) and a set of other variables (independent variables). For this type of a problem,
the objective is to try to predict the value of the dependent variable based on the observed values of
independent variables. If Regression Analysis investigates the linear relationships between
variables, an Experimental Designs Problem, on the other hand, looks into investigating the causal
relationship between a dependent variable and several independent variables. Such problems aim
to determine whether changes in the independent variables will cause some effect on the dependent
variables.
Statistics 131: Parametric Statistical Inference 23

1.2.3 Other Topics

Topics that do not fall into any of the classifications just mentioned but are of practical importance
to us are listed below. Some of the more important topics usually discussed are

a. Sampling Methods
b. Cost Considerations
c. Randomization
d. Asymptotic Theory

In Sampling Methods, the focus oftentimes falls on the so-called fixed-sample-size and sequential
procedures. The former entails fixing the sample size even before any data are collected, while the
latter is characterized by taking the observations sequentially, hence, the sample size is not fixed in
advance. Cost Considerations frequently become a deciding factor in the choice between the two
sampling procedures.

Some mathematically oriented problems in statistics involve the use of Randomization. Loosely,
this is the process of incorporating some element of chance into the manner in which the experiment
is being performed, so as to minimize the possibility of having biases. Asymptotic Theory is the
class of results and theories that apply for cases using very large samples. The word asymptotic is
usually used to describe a method, a result, a theorem, or a definition associated with very large
samples.

Discussion on these four topics in this course will be selective and superficial. Sampling Methods
will be discussed extensively in Stat138, while Randomization will be discussed in Stat148. Cost
Considerations and other statistical methods are applied through a research process under Stat143.
Asymptotic Theory is discussed extensively in courses in the graduate programs.

1.3 Some Important Topics

Defn: The totality of elements which are under discussion, and about which information is desired
will be called the target population.

Remarks:

1. The target population must be capable of being well defined. It may be real or hypothetical.

2. The object of any investigation is to find out something about a given target population.

3. It is generally impossible or impractical to examine the entire population, but, on the basis of
examining a part of it, inferences regarding the entire target population can be made.

1.3.1 Concept of a Random Sample

Consider 𝑋~𝐹𝑋 . You wish to make inferences about FX on the basis of n independent observations
of 𝑋, say 𝑋1 , 𝑋2 , … , 𝑋𝑛 . The problem now is how to select a part of the population, i.e., how do we
obtain a sample? In answering this question, the following consideration should be taken into
Statistics 131: Parametric Statistical Inference 24
account: If the sample is selected in a certain way, we can make probabilistic statements about
the population.

Defn: Given a probability space and a positive integer n, a collection of n independent random
variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 , all having common distribution 𝐹𝑋 is called a random sample from
a population (with distribution) 𝐹𝑋 .

Note: We assume that each physical element of the population has some numerical value
associated with it and that the distribution of these values is given by the distribution
function 𝐹𝑋 .

Remarks:

1. A random sample (r.s.) can be viewed as a random vector 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑛 )′ defined on the
n-dimensional real space Rn. Further, it can also be interpreted as the outcome of a series of n
independent trials of an experiment performed under identical conditions.

2. Inference works under the assumption that the sample (data) reflects the truth about the
population. To ferret out this truth, inference employs the so-called process of inductive
argumentation.

Example: If a given coin is biased (loaded) in favor of heads, we would expect to observe heads
more frequently than tails in repeated tosses of the coin. Out of 20 tosses of the coin,
14 were heads and only 6 were tails. This is thus taken as evidence that the coin may
not be fair.

Since we cannot make absolutely certain generalizations, uncertainty will always be present in
all inductive inferences we make. This is why statistical inference is based on laws of
probability.

3. The distribution 𝐹𝑋 is usually called the sampled population, the collection of all elements from
which the sample is actually selected. (In certain cases, 𝐹𝑋 may be replaced with the
corresponding PMF or PDF.)

Example: target population : all 25-year old males in the country


sampled population : all 25-year old males in Quezon City

4. Sampling from the distribution 𝐹𝑋 is sometimes referred to as sampling from an infinite


population or sampling with replacement from a finite population.

5. Implicitly, sampling without replacement from a finite population is ruled out in the above
definition.

6. Since the r.s. 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑛 )′ consists of independent and identically distributed (iid)
random variables, then the distribution of the r.s. X, which is simply the joint distribution of
𝑋1 , 𝑋2 , … , 𝑋𝑛 , is thus,

F X x   FX 1 , X 2 ,...,X n x1 , x2 ,..., xn   FX 1 x1 FX 2 x2  FX n xn 


n
  FX i xi 
i 1
Statistics 131: Parametric Statistical Inference 25

If the PDF or PMF exists, then FX may be replaced accordingly. Thus,

f X x   f X 1 , X 2 ,...,X n x1 , x2 ,..., xn   f X 1 x1  f X 2 x2  f X n xn 


n
  f X i  xi 
i 1

or

p X x   p X 1 , X 2 ,...,X n x1 , x2 ,..., xn   p X 1 x1  p X 2 x2  p X n xn 


n
  p X i  xi 
i 1

Examples:

1. In studying the “reliability” of light bulbs, the lifetime X (in hours) of a given light bulb is taken
to be a r.v. with density function:

f X x   exp{x}I 0,  x , where   0 is unknown.

A collection of n light bulbs is put to a “reliability test” and their lifetimes are recorded. Then,
𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑛 )′ can be considered a r.s. (from an exponential population with parameter
 ). What is the PDF of the r.s. X? What are S and  for this statistical problem?

2. A r.s. 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑛 )′ from a Bernoulli population with parameter 0p1, will have a PMF
of the form

p X x  

One way of regarding a r.s. from a Bernoulli population is as follows:

Consider a “Yes or No” type of question, answered by each of n respondents,


chosen at random with replacement from a population of N people. Then, the
random vector X = (X1, X2, …, Xn)’ is a r.s. from a Bernoulli population with
parameter p, the unknown proportion of “Yes” answers, and each Xi is equal to
“1” or “0” according to whether the ith respondent answers “Yes” or “No”.

1.3.2 Statistics and Sampling Distributions

1.3.2.1 Statistics

Defn: Let X = (X1, X2, …, Xn)’ be an observable random vector. Any observable function of
X, say T(X), which is itself a r.v. (or random vector), is called a statistic. The standard
deviation of a statistic is called its standard error.
Statistics 131: Parametric Statistical Inference 26

Remarks:

1. A statistic is always
a. a function of observable random variables
b. itself a r.v.
c. does not contain any unknown parameter

2. By “observable”, we mean that the value of the statistic T(X) can be computed directly from
the values of the r.v.’s in the r.s.

Examples:

For the given random samples, which of the given are statistics?

1. Let X be a r.s. (of size 1) from N(, 2), where  and 2 are both unknown.
a. X- d. X2 + 3
b. X/ e. X2 + log X2
c. X

2. Let X = (X1, X2, …, X10)’ be a r.s. from FX ;  , where  is unknown.


10

X i 5
a. X  i 1

10
d. X
i 1
i

b. min(X1, X2, …,X10) e. X4


c. X 

3. Let X = (X1, X2, …, Xn) be a random sample from N(, 2).

 X 
n
2
n i X
a. Sn   X i d. S2  i 1

i 1 n 1
X1  
b. X  n1 S n e. Z

c.
n
T1  X     X i   
2
f. T2  X  
n  1S 2
i 1 2

Some Important Statistics

Let X = (X1, X2, …, Xn)’ be a r.s. from FX. Some common statistics are:

n
1. Sample Sum : Sn   X i
i 1
n

X i
2. Sample Mean : X  i 1

 X 
n
2
i X
3. Sample Variance : S2  i 1

n 1
Statistics 131: Parametric Statistical Inference 27

X i
r

4. rth Sample Raw Moment : M r ' i 1


, r  1,2,...
n

 X 
n
r
i X
5. rth Sample Central Moment : Mr  i 1
, r  1,2,...
n

6. Order Statistics : X 1  X 2     X n 

Theorem: Let X = (X1, X2, …, Xn)’ be a r.s. from FX. Then  r = 1, 2, 3, …,

   
a. EM r '  E X r   r ' , if E X r exists; and,

b. VarM ' 
r
EX   EX  , if EX  exists.
2r r 2
2r

Corollary: Let X = (X1, X2, …, Xn)’ be a r.s. from FX, with mean  and variance 2. Then,

 
a. E X  E X    ; and

VarX  
 2
b. .
n

Theorem: Let X = (X1, X2, …, Xn)’ be a r.s. from FX, with mean  and variance 2. Then,

 
a. E S 2   2 ; and

  n  3 4  
 
 4  
 n  1  
b. Var S 2   
n
.

1.3.2.2 Sampling Distributions

Defn: The probability distribution of a statistic is called a sampling distribution.

Remarks:

1. A statistic, being a r.v., has thus its own probability distribution, which is called the sampling
distribution.

2. The sampling distribution of a statistic is affected by the sample size n, the population size
N (for finite cases), and the way X was observed (i.e., the manner in which the r.s. was
selected).
Statistics 131: Parametric Statistical Inference 28

Examples:

1. Sampling from a finite population:

Suppose we have a population of N = 4 numbers: {0, 1, 2, 3}. From this population,


a sample of size n = 2 is drawn at random with replacement. If X1 is defined as the
number observed on the 1st draw and X2 the number on the 2nd draw, determine the
X  X 2 
sampling distribution of the sample mean X  1 . Also, determine the
2
sampling distribution of S2.

2. Let X = (X1, X2, …, Xn)’ be a r.s. from Be(p), where 0p1.


a. The sampling distribution of the sample sum is
b. The sampling distribution of the sample mean is

3. Let X = (X1, X2, …, Xn)’ be a r.s. from Po(), where >0.


a. The sampling distribution of the sample sum is
b. The sampling distribution of the sample mean is

4. Let X = (X1, X2, …, Xn)’ be a r.s. from N(,2), where R and 2>0.
a. The sampling distribution of the sample sum is
b. The sampling distribution of the sample mean is

5. Let X = (X1, X2, …, Xn)’ be a r.s. from Ga(1,), where >0.


a. The sampling distribution of the sample sum is
b. The sampling distribution of the sample mean is

Special Results on Sums of Random Variables

Defn: The family of density (or mass) functions { f X ;  , with parameter    } is said to
be reproductive with respect to the parameter  if, and only if,

X 1 ~ f X 1 ;1  and X 2 ~ f X 2 ;2  , with X1 and X2 independent,


 (X1 + X2) ~ f X 1  X 2 ;1  2  .

Remark: The above definition also applies for more than 2 independent r.v.’s.

Examples:

Let X1, X2, …, Xn be independent random variables.

n
1. Xi ~ Bi(mi, p)  i = 1, 2, …, n  Sn ~ Bi( mi , p) ;
i 1

and if mi = m  i, then Sn ~ Bi( , )


n
2. Xi ~ Po(i)  i = 1, 2, …, n  Sn ~ Po( i ) ;
i 1

and if i =   i, then Sn ~ Po( )


Statistics 131: Parametric Statistical Inference 29

n
3. Xi ~ Neg Bi(ri, p)  i = 1, 2, …, n  Sn ~ Neg Bi( ri , p) ;
i 1

and if ri = r  i, then Sn ~ Neg Bi( , )


n n
4. Xi ~ N(i, i2)  i = 1, 2, …, n  Sn ~ N ( i ,  i2 ) ;
i 1 i 1

and if i = , and i =   i, then Sn ~ N (


2 2
, )
n
5. Xi ~ Ga(ri, )  i = 1, 2, …, n  Sn ~ Ga( ri ,  ) ;
i 1

and if ri = r  i, then Sn ~ Ga( , )

Special Results:

n n n
 Xi ~ N(i, i2)  i = 1, 2, …, n   ai X i ~ N ( ai i ,  ai2 i2 )
i 1 i 1 i 1

 Xi ~ Exp()  i = 1, 2, …, n  Sn ~ Ga(n,  )

1.3.2.3 Sampling from the Normal Distribution

Defn: A continuous r.v. X is said to have a chi-square distribution with k degrees of freedom
(d.f.) if, and only if, the PDF of X is given by

1 k

f X x   2 x 
2
k
2 1
exp x 2I 0,   x , k   .
k 2 

Notation : X~ 2k 
Mean : E(X) = k
Variance : Var(X) = 2k
k
 1 2
MGF : mX t    
 1  2t 

k =2

k =5
k =10
k =15

FIGURE 1.1. Graph of the chi-square distribution with varying degrees of freedom (k).
Statistics 131: Parametric Statistical Inference 30

Remarks

1. The degrees of freedom (d.f.) of the chi-square distribution completely specify the
distribution of a chi-square r.v.

2. A chi-square r.v. can take on only positive real numbers.

3. A chi-square r.v. with k d.f. is equivalent to a Gamma r.v. with parameters r = k/2 and  =
½, i.e., 2k   Ga(r = k/2,  = ½).

Theorem: If the r.v.’s X1, X2, …, Xk are normally and independently distributed with means
i and variances i2, i = 1, 2, …, k, respectively, then,
2
k
 X  i 
U    i  ~  2k  .
i 1  i 

Corollary: If the r.v.’s X1, X2, …, Xn is a r.s. from N(,2), then,

2
k
X 
U   i  ~  k  .
2

i 1   

Remarks:

1. The theorem states that the sum of the squares of independent standard normal random
variables is a chi-square random variable, with d.f. equal to the number of r.v.’s (number
of terms) in the sum.
2
X 
2. If X~ N(, ), then   ~  1 .
2 2

  

3. If X~ N(0,1), then X 2 ~ 21 .

4. The chi-square family of densities is reproductive with respect to the degrees of freedom.
a. X1, X2, …, Xn ~ independent  2k i   Sn ~  2 
  ki 
 
 i 

b. X1, X2, …, Xn ~ iid  2


k   Sn ~  2
nk 

Illustration:

Four r.s.’s, each of size 100, from N(0,1) are obtained using PHStat for MS Excel. The
histograms of the normal random samples, and the histograms for the squares of the values are
shown below.
Statistics 131: Parametric Statistical Inference 31

50
12

40

Count
Count 8
30

20

10

0
-2 .0 0 -1 .0 0 0 .00 1 .00 2 .00 2 .00 4 .00 6 .00

n1 sq1

50

15

40

30
Count

10
Count

20

10

0
-2 .0 0 -1 .0 0 0 .00 1 .00 2 .00 2 .00 4 .00 6 .00 8 .00

n2 sq2

40
12

30

8
Count

Count

20

10

0
-1 .0 0 0 .00 1 .00 2 .00 1 .00 2 .00 3 .00 4 .00 5 .00 6 .00

n3 sq3

40

12

30
Count
Count

20

4
10

0
-2 .0 0 -1 .0 0 0 .00 1 .00 2 .00 1 .00 2 .00 3 .00 4 .00 5 .00

n4 sq4
Statistics 131: Parametric Statistical Inference 32

Theorem: Let X = (X1, X2, …, Xn)’ be a r.s. from N(,2), where R and 2>0 and n  2.
Then,
a. X and S2 are independent; and,

b.
n  1S 2 ~  2 .
 n  1
2

Defn: A continuous r.v. X is said to have an F-distribution with m (numerator) and n


(denominator) degrees of freedom (d.f.) if, and only if, the PDF of X is given by

 m  n   m  x  2 1
m m
2

f X x   m 2 n   I 0,   x , m, n  Z  .
 2  2   n  1  x
 m n 
m 2
n

Notation : X~ Fm, n 
n
Mean : E(X) = , n2
n2
2n 2 m  n  2 
Variance : Var(X) = , n4
mn  2  n  4 
2

MGF : DNE

FIGURE 1.2. Graph of F-distribution with (m, n) degrees of freedom.

Remarks:

1. The numerator (m) and denominator (n) degrees of freedom completely specify the
distribution.

2. An F-distributed r.v. can take on only positive real numbers.

Theorem: Let U ~ 2m  and V ~ 2n  , with U and V independent. Then,

U
X m
~ Fm, n  .
V
n
Statistics 131: Parametric Statistical Inference 33

Remark: The theorem states that the ratio of two independent chi-square r.v.’s over their
respective d.f. is an F-distributed r.v., with numerator d.f. equal to the d.f. of the
chi-square r.v. in the numerator and denominator d.f. equal to the d.f. of the chi-
square r.v. in the denominator.

Corollary: If X1, X2, …, Xm is a r.s. from N(X,2) and Y1, Y2, …, Yn is another r.s. from
N(Y,2), then,

S X2
~ Fm 1, n 1 ,
SY2

 X   Y  Y 
m n
2 2
i X i
where S X2  i 1
and SY2  i 1
.
m 1 n 1

Corollary: If X~ Fm, n  , then (1/X) ~ Fn, m  .

Illustration: Using 4 of the r.s.’s of size 100 each from N(0,1) in the earlier illustration, the
histograms of sq1/sq2 and sq3/sq4 are shown below.

1 00

75
75
Count

Count

50 50

25 25

0 0
1 00 0 .00 0 2 00 0 .00 0 3 00 0 .00 0 2 50 0 .00 0 5 00 0 .00 0 7 50 0 .00 0 1 00 0 0.0 0 0

f1_2 f5_4

Defn: A continuous r.v. X is said to have a Student’s t-distribution with k degrees of


freedom (d.f.) if, and only if, the PDF of X is given by

 k 21 
 
  k 1 

f X x   1 X2
I   ,   x , k  Z  .
2

k  k2 
k

Notation : X~ tk 
Mean : E(X) = 0
k
Variance : Var(X) = , k2
k 2
MGF : DNE
Statistics 131: Parametric Statistical Inference 34

dotted curve: v=25


black curve: standard normal
v=5

v=2

-3 -2 -1 0 1 2 3

FIGURE 1.3. Graph of the t-distribution with varying degrees of freedom (k) and the standard normal
distribution

Remarks:

1. The degrees of freedom (d.f.) completely specify the Student’s t-distribution.

2. A t-distributed r.v. can take on any real number.

3. The t-distribution is symmetric about its mean 0.

4. The t-distribution is more variable than the standard normal distribution.

5. For large d.f. k, the t-distribution reduces to the standard normal distribution.

Theorem: Let Z~N(0,1) and U ~ 2k  , with Z and U independent. Then,

Z
T ~ t k  .
U
k

Remark: The theorem states that the ratio of a standard normal random variable to the
square root of an independent chi-square random variable over its degrees of
freedom is a t-distributed random variable, with d.f. equal to the d.f. of the chi-
square random variable in the denominator.

Corollary: If X1, X2, …, Xn is a r.s. from N(,2), with sample mean X and sample standard
deviation S, then,

X 
T ~ t n  1 .
S
n
Statistics 131: Parametric Statistical Inference 35

Corollary: If X~ tk  , then X2~ F1, k  .

Illustration: The histograms of n1/n2 and n5 are shown below.


sq 4

75 60

50 40
Count

Count
25 20

0 0
-5 0.0 00 -2 5.0 00 0 .00 0 2 5.0 0 0 5 0.0 0 0 -5 0.0 00 0 .00 0 5 0.0 0 0 1 00 .0 00

n1_n2 n5_rootsq4

Some Important Results:

 
Let X 1 , X 2 , ..., X n1 be a r.s. from N 1 , 12 and Y1 , Y2 , ..., Yn2 be another independent r.s. from

N 2 , 22 . Then, 
X  1 Y  2
1. ~ tn1 1 and ~ t n 2  1 
S1 S2
n1 n2

 X   Y  Y 
n1 n2
2 2
i X i
where S12  i 1
and S22  i 1
n1  1 n2  1

2.
X  Y       ~ N (0,1) 1 2

 2
 2
1
 2
n1 n2

3.
X  Y       ~ t 1 2
assuming  12   22 ,
n1  n2  2  ,
1 1
S p2   
 n1 n2 

where S p2 
n1  1S12  n2  1S22 (pooled variance)
n1  n2  2
Statistics 131: Parametric Statistical Inference 36

S12
 12
4. ~ Fn1 1, n2 1
S22
 22

S12
5. ~ Fn1 1, n2 1 , assuming  12   22
S22

1.3.2.4 Order Statistics

Defn: Let X1, X2, …, Xn be a r.s. from FX. Let X(1)  X(2)  …  X(n) be the Xi’s arranged in
increasing order. Then, X(1), X(2), …, X(n) are called the order statistics corresponding
to the r.s. X1, X2, …, Xn, and X(r) is called the rth order statistic.

Remarks

1. For each r = 1, 2, …, n, X(r) is a random variable.

2. In general, the order statistics (o.s.) are not independent, unless FX is a distribution that is
degenerate at some constant c.

3. The first and the last order statistics, X(1) and X(n), are called the sample minimum and
sample maximum, respectively.

Theorem: Let X(1), X(2), …, X(n) represent the o.s. of a r.s. from the distribution FX. For r
=1, 2, …, n, the CDF of X(r) is given by

n
n
Fr  y     FX  y  1  FX  y  .
j n j

j r  j 

Corollary: The CDF of the sample minimum X(1) and maximum X(n) are, respectively,

F1  y   1  1  FX  y  n and Fn  y   FX  y  n .

Example: Suppose 20 identical light bulbs operate independently in a system. The system
stops when one light bulb expires. For i = 1, 2, …, 20, let Xi represent the lifetime
(in days) of the ith bulb, with each Xi ~ Exp(). Find the CDFof X(4). What is the
probability that the system will still be working after 150 days?
Statistics 131: Parametric Statistical Inference 37

Theorem: If FX is absolutely continuous with PDF fX, then for r = 1, 2, …, n, the PDF of
X(r) denoted fr, is given by

n
f r  y   r   FX  y  f X  y 1  FX  y  .
r 1 nr

 
r

Corollary: If FX is absolutely continuous with PDF fX, then for r = 1, 2, …, n, the PDF of
the sample minimum X(1) and maximum X(n) are, respectively

f1  y   n 1  FX  y  f X y and f n  y   n FX  y  f X y.


n 1 n 1

Theorem: Let X(1), X(2), …, X(n) represent the o.s. of a r.s. from the distribution FX. For r, s
=1, 2, …, n, and r < s, if FX has PDF fX, then the joint PDF of X(r) and X(s),
denoted by fr,s, is given by

n!FX x  r 1 f X x FX  y   FX x  s  r 1 f X  y 1  FX  y  n  s


f r , s x, y   , x y.
r  1!s  r  1!n  s !
Corollary: The joint PDF of the sample minimum X(1) and maximum X(n) is given by

f1,n x, y   nn  1FX  y   FX x n2 f X x f X  y , x  y .

Corollary: In general, the joint PDF of the o.s. is given by

f1, 2,...,n x1 , x2 ,..., xn   n! f X x j , x1  x2    xn .


n

j 1

Functions of Order Statistics and Their Distributions:

 X  j

1. Sample Mean of the o.s. : Y j


X
n

The distribution of Y is the same as the distribution of X .

~
2. Sample Median : X  X  n1  , if n is odd
2

~ X  n   X  n  1
X 2 2
, if n is even
2

For n odd, the distribution of the median is f M where M  n21 .


For n even, the distribution of the median can be derived (using transformation) from the
joint PDF f M , M 1 , where M  n2 .
Statistics 131: Parametric Statistical Inference 38

3. Sample Range : R  X n   X 1

The distribution of R can be derived (using transformation) from the joint PDF f1, n .

Examples:

1. Suppose we take a r.s. of size n from Bi(m, p). Find the CDF and the PMF of X(r).

2. Consider n identical batteries operating independently in an electrical system. For each i =


1, 2, …, n, let Xi denote the length of life (in hours) of the ith battery, with each Xi ~ Exp(1/).
Determine the distribution of the length of life of the system if the batteries operate
simultaneously (i.e., when one dies, the system dies), and if the batteries operate in a series
(i.e., the system dies when the last battery dies).

3. Let X1, X(2), …, X(n) be a r.s. from U(0, ), n  2. Find the mean and the variance of the r.v.
n 1
X n  .
n

1.3.3 Some Asymptotic Theory

Asymptotic theory deals with results that arise for sample size n approaching infinity, or for very
large n. The following asymptotic results will be useful in obtaining approximate (or asymptotic)
sampling distributions of certain statistics.

Theorem: (Chebyshev’s Inequality) Let X be a r.v. with mean  and variance 2 < , finite.
Then,   > 0,

2
P X       .
2

P X    k  
1
Corollary: .
k2

Remark: Alternatively, this inequality is also equivalent to:

P X    k   1 
1 1
2
, or, P k  X    k   1  2 .
k k

Special Results

1. For k = 1, P   X       0 .

2. For k = 2, P 2  X    2   0.75 .

3. For k = 3, P 3  X    3   89 0.888 .  


Statistics 131: Parametric Statistical Inference 39

Examples

1. Let X~N(0,1). Find the probabilities that X is within 1 standard deviation from the mean , 2
standard deviations from the mean , and 3 standard deviations from the mean .

2. Let X~Bi(n=10, p=0.9). Find the probabilities that X is within 1 standard deviation from the
mean , 2 standard deviations from the mean , and 3 standard deviations from the mean .

3. Let X~Po(=9). Find the probabilities that X is within 1 standard deviation from the mean , 2
standard deviations from the mean , and 3 standard deviations from the mean .

4. Let X~t(6). Find the probabilities that X is within 1 standard deviation from the mean , 2
standard deviations from the mean , and 3 standard deviations from the mean .

Theorem: Weak Law of Large Numbers (WLLN) Let X1, X2, …, Xn be a r.s. from the PDF f X ,
with mean  and finite variance 2 < . Let  and  be 2 arbitrary numbers such that  >

0 and 0 <  < 1. If n 


2
 2
 
, then P X      1   .

Remarks

1. In Probability Theory, we say “ X converges in probability to ”, that is,

n 
 
lim P X      0,   0, and we write X 

P
.

2. Explanation of the WLLN: The probability that X will deviate from the true population mean
 by more than some arbitrary small nonzero value  can be made arbitrarily small, by choosing
n sufficiently large. Because of this, the sample mean can be used to estimate  reliably.

3. If X1, X2, …, Xn is a r.s. from the PDF f X , with mean  and variance 2 < , we can determine
n  Z+ so that the probability that X will differ from  by less than an arbitrary small amount
, can be made as close to 1 as possible. Thus, X can be used to estimate  with a high degree
of accuracy.

 
P X      1 as n   ,   0 .

4. If n is sufficiently large, X   is likely to be small, but this does not imply that X   is small
for all large n.

 
5. The result does not imply that P X      1 . It only means that it can be very likely that X
is near  for sufficiently large n, and not that X is near  if n is increased.

Example: Consider a distribution with unknown mean  and variance 2 = 1. How large a
sample should be taken so that a probability of at least 0.95 is attained that the sample
mean X will not deviate from the population mean  by more than 0.4 units?
Statistics 131: Parametric Statistical Inference 40

Theorem: Central Limit Theorem (CLT) Let X1, X2, …, Xn be a r.s. from the PDF f X , with mean
 and finite variance 2. Let X be the sample mean of the r.s. and define the r.v. Z as

Zn 
   X  E X  .
X E X
VarX  
n

Then, the distribution of Zn approaches the standard normal distribution as n   , i.e.,

Zn  N (0,1) as n   .

Remarks

1. In Probability Theory, we say “ X converges in distribution to the standard normal


distribution”, i.e.,

lim FZ n z   z  , and we write Zn 


D
N 0, 1.
n 

2. The CLT also implies that X  N  ,  2

n
 as n   and S  N n, n  as n   .
n
2

3. The CLT result holds for all r.s.’s, regardless of the form of the parent PMF/PDF, for as long
as this distribution has finite variance.

4. Importance of the CLT: In making inferences about population parameter(s), we need the
distribution of certain statistics, e.g., the sample mean X . Finding the sampling distributions of
statistics is often mathematically easier if samples are taken from the normal distribution.
However, if the r.s. is not taken from the normal distribution, finding the sampling distribution
of X can become very difficult. The CLT states that, for as long as (1) the parent PMF/PDF of
the r.s. has finite variance, and (2) the sample size is large, the approximate distribution of the
sample mean is a normal distribution.

Examples

1. Consider a distribution with unknown mean  and variance 2 = 1. How large a sample should
be taken so that a probability of (exactly) 0.95 is attained that the sample mean X will not
deviate from the population mean  by more than 0.4 units?

2. An electrical firm manufactures light bulbs that have an average length of life equal to 800 hours
and a standard deviation of 40 hours. Find the probability that a random sample of 16 bulbs will
have an average life of less than 775 hours?

Corollary: De-Moivre – Laplace Theorem (Normal Approximation to Binomial Distribution) If


Z~Bi(n, p), with p close to ½ (i.e., not very close to 0 or 1), then, the approximate
(asymptotic) distribution of Z is N(np, npq) as n   .
Statistics 131: Parametric Statistical Inference 41

Remark: The De-Moivre – Laplace Theorem uses a normal distribution to approximate the
probabilities under a binomial distribution. However, the approximation is appropriate
only for binomial distributions with (1) very large values of n, and (2) values of p that
are not very close to 0 or 1. When the value of p is very close to 0 or 1, and when the
value of n is very large, the following corollary, which uses the normal distribution to
approximate the Poisson distribution, will be more appropriate.

Examples

1. Toss a pair of dice 600 times. Find the probability that there will be between 90 and 110 tosses
(exclusive) resulting in a total of “7” on the pair of dice.

2. The probability that a patient recovers from a rare blood disease is 0.6. If 100 people are known
to have contracted the disease, what is the probability that less than half of them will survive?

3. A multiple-choice quiz has 200 questions, each with 4 possible answers, only 1 of which is the
correct answer. What is the probability that sheer guesswork yields from 25 to 30 correct
answers for 80 of the 200 problems about which the student has no knowledge?

Corollary: (Normal Approximation to Poisson Distribution) If X1, X2, …, Xn is a r.s. from Po(),
with  small, the sample sum Sn   X i is approximately (or asymptotically)
i
distributed as N(n, n) as n   .

Note that the exact distribution of Sn is Po(n).

Examples

1. Suppose that, on average, 1 person in every 1000 is alcoholic. Find the probability that a random
sample of 8000 people will yield fewer than 7 alcoholics.

2. The probability that a person dies from a respiratory infection is 0.002. Find the probability that
fewer than 5 of the next 2000 so infected will die.

You might also like