Lecture 5

Section 4
Mathematical statistics
Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 65

Review of Random Sampling I
Population: Well defined group of subjects

Statistical inference involves learning something about a population
given the availability of a sample from that population.
Broadly this involves Estimation and hypothesis testing
Random Sample: If Y1 , Y2 , ...., Yn are independent random variables
with a common probability density function f (y ; ◊), then
{Y1 , Y2 , ..., Yn } is said to be a random sample from f (y ; ◊) or a
random sample drawn from a population represented by f (y ; ◊).
Different outcomes are possible before the sampling is actually carried
out
Once a sample is obtained, we have a set of numbers, say,
{y1 , y2 , ., yn }, which constitute the data that we work with.

Estimator and Estimate I
Given a random sample {Y1 , Y2 , .., Yn } drawn from a population

distribution that depends on an unknown parameter ◊, an estimator of
◊ is a rule that assigns each possible outcome of the sample a value of
◊.
An estimator W of a parameter ◊ can be expressed as an abstract
mathematical formula:
W = h(Y1 , Y2 , ..., Yn )
Example: Y1 , Y2 , . . . , Yn be random sample from the same distribution

with mean µ. An estimator of µ is is sample average,
n
ÿ
Y = n≠1 Yi
i=1
Note that an estimator, W is also a random variable.

Estimator and Estimate II
We study various properties of the probability distribution of the

random variable W
The distribution of an estimator is called sampling distribution
When a particular set of numbers y1 , y2 , . . . , yn is plugged into the
function h(.), we obtain an estimate of ◊
w = h(y1 , y2 , . . . , yn )
W is called a point estimator and w is called a point estimate

There are many ways to combine data to estimate parameters (many
estimators for the same parameter)
We need some sensible criteria (properties) to choose among estimators

General Approaches to Parameter Estimation
Are there general approaches to estimation that produce estimators

with good properties, such as unbiasedness, consistency, and efficiency?
I Method of Moments (Parameter ◊ is shown to be related to some
expected value in the distribution of Y. eg. sample average, sample
correlation coefficient)
I Maximum Likelihood (Out of all the possible values for , the value that
makes the likelihood of the observed data largest should be chosen)
I Least Squares (The estimator makes sum of squared deviation as small
as possible)
We will treat them in depth as required

Finite Sample Properties of Estimators
Unbiasedness: An estimator, W of ◊, is an unbiased estimator if

E (W ) = ◊.
Example: sample mean is an unbiased estimator of population mean.
Example: Sample variance defined as
n
1 ÿ
S2 = (Yi ≠ Ȳ )2
n ≠ 1 i=1
is unbiased for ‡ 2 where the sample Yi is drawn from a population

represented by a distribution with mean E (Y ) = µ and variance
Var (Y ) = ‡ 2 .
Note: If µ is known, then we do not need to divide by n ≠ 1. µ is
rarely known in practice.
Bias is defined as Bias(W ) = E (W ) ≠ ◊

The Sampling Variance of Estimators I
How does the distribution of an estimator spread out?

The variance of an estimator is often called its sampling variance
because it is the variance associated with a sampling distribution.
The sampling variance is not a random variable; it is a constant, but it
might be unknown.
Example: Find Var (Ȳ )
Relative Efficiency: If W1 and W2 are two unbiased estimators of ◊,
W1 is efficient relative to W2 when Var (W1 ) Æ Var (W2 ) for all ◊, with
strict inequality for at least one value of ◊.
Comparing variances is meaningless if we do not restrict our attention
to unbiased estimator.
One way to compare estimators that are not necessarily unbiased is to
compute the Mean Squared Error (MSE) of the estimators.
If W is an estimator of , then the MSE of W is defined as
MSE (W ) = E [(W ≠ ◊)2 ]

The Sampling Variance of Estimators II
Shown that MSE (W ) = Var (W ) + [Bias(W )]2

Relative efficiency
f (w )
pdf of W1
pdf of W2
◊ w
Figure 3: Relative Efficiency

Large sample or Asymptotic property of an estimator
Consistency: Let Wn be an estimator of ◊ based on a sample

{Y1 , Y2 , .., Yn } of size n. Then, Wn is a consistent estimator of ◊ if for
every ‘ > 0,
P(|Wn ≠ ◊| > ‘) æ 0 as n æ Œ
When Wn is consistent, we also say that ◊ is the probability limit of
Wn , written as plim(Wn ) = ◊.

Consistency
fWn (w )
n = 40
n = 16
n=4
◊ w
Exercise
What is Law of Large Number?

What is Central Limit Theorem?

Properties of plim
plim(g(Wn )) = g(plim(Wn )) for any continuous function g(.)

q
Example: Sn2 = (n ≠ 1)≠1 ni=1 (Yi ≠ Ȳn )2 is unbiased. You can prove
it consistent Ò
A natural estimator of ‡ is Sn = Sn2 . But this is not unbiased
because expected value of the square root is not the square root of the
expected value. However, Sn is consistent.

Properties of plim
If plim(Tn ) = – and plim(Un ) = —, then
plim(Tn + Un ) = – + —
plim(Tn Un ) = –—
plim(Tn /Un ) = –/— provided — ”= 0

Exercise
Let µm and µf be the population mean of annual earnings of male and

female IIT graduates respectively. You are interested in percentage
difference in annual earnings “ © 100(µm ≠ µf )/µf .
Propose an estimator of “. Is it unbiased? Is it consistent?

Shape of the sampling distribution
All the above three properties do not tell us anything about the shape
of the distribution of an estimator
We need to approximate it for constructing interval estimator or
hypothesis testing
Asymptotic normality results is very useful for this purpose
Let {Z1 , Z2 , . . . , Zn } be a sequence of random variables, such that for
all number z
P(Zn Æ z) æ (z) as n æ Œ,
where (z) is the standard normal distribution function. Then, Zn is
said to have an asymptotic standard normal distribution. In short
a
Zn ≥ N(0, 1).

Central Limit Theorem
Let {Y1 , Y2 , . . . , Yn } be a random sample with mean µ and variance

‡ 2 . Then,
Y¯n ≠ µ
Zn = Ô
‡/ n
has an asymptotic standard normal distribution.
Exercise: If we replace ‡ by its sample counterpart Sn in the above
standardised Zn , what kind of distribution does it follow when n æ Œ
and when n is small?
When two consistent estimators have asymptotic normal distributions,
we choose the estimator with the smallest asymptotic variance.

Interval Estimation and Confidence Intervals
Point estimate obtained from a particular sample does not, by itself,

provide enough information for testing economic theories or for
informing policy discussions.
It provides no information about how close the estimate is “likely” to
be to the population parameter.
How are we to know whether crime rates in states with higher literacy
is close to that with lower literacy?
How do we know that increasing tax rates makes a big difference in
tobacco consumption?
Reporting the standard deviation of the estimator, along with the point
estimate, provides some information on the accuracy of our estimate.
However, that makes no direct statement about where the population
value is likely to lie in relation to the estimate

Confidence interval
Example: Suppose the population has a N(µ, ‡ 2 ) distribution and let

{Y1 , ., Yn } be a random sample from this population. The variance of
the population is known.
The sample average, Ȳ , has a normal distribution with mean µ and
variance ‡ 2 /n. Ȳ ≥ N(µ, ‡ 2 /n).
We can standardize Ȳ , and, because the standardized version of Ȳ has
a standard normal distribution, we have
1 Ȳ ≠ µ 2
P ≠ 1.96 < Ô < 1.96 = 0.95
‡/ n
This information allows us to construct an interval estimate of µ.

Probabilistic
Ë interpretation:È the probability that the random interval
Ȳ ≠ 1.96 Ôn , Ȳ + 1.96 Ô‡n contains the population mean µ is 0.95 or
‡
95%

When ‡ is unknown I
For unknown ‡, we must use an estimate

A n
B1/2
1 ÿ ! "2
s= yi ≠ ȳ
n ≠ 1 i=1
We obtain a confidence interval that depends entirely on the observed

data
Unfortunately, this does not preserve the 95% level of confidence
because s depends on the particular sample
Ô
The random interval [Ȳ ± 1.96(S/ n)] no longer contains µ with
probability 0.95
How should we proceed?
Ȳ ≠ µ
Ô ≥?
S/ n

When ‡ is unknown I
It follows t distribution (why?)
Ȳ ≠ µ
Ô ≥ tn≠1
S/ n
Let c denote the 97.5th percentile in the tn≠1

P(≠c < tn≠1 < c) = 0.95
The vale of c depends on the degree of freedom parameter
Once c has been properly chosen, the random interval
# Ô Ô $
Ȳ ≠ c.S/ n, Ȳ + c.S/ n
contains µ with probability 0.95.

Ô
The associated random variable S/ n is called standard error of Ȳ
For a particular sample, the 95% confidence interval is calculated as
# Ô Ô $
ȳ ≠ c.s/ n, ȳ + c.s/ n
When ‡ is unknown II
More generally, let c– denote the 100(1 ≠ –) percentile in the tn≠1

distribution. Then, a 100(1 ≠ –)% confidence interval is obtained as
# Ô Ô $
ȳ ≠ c –2 .s/ n, ȳ + c –2 .s/ n
Obtaining c –2 requires choosing and knowing the degrees of freedom

n≠1
A simple rule of thumb: [ȳ ± 2.se(ȳ )]

97.5th percentile
Area = 0.95
Area = 0.025
Area = 0.025
-c c
0
Figure 5: The 97.5th percentile, c, in a t distribution

Exercise
Holzer, Block, Cheatham, and Knott (1993) studied the effects of job
training grants on worker productivity by collecting information on “scrap
rates” for a sample of Michigan manufacturing firms receiving job training
grants in 1988. There were no grants awarded in 1987. We are interested in
constructing CI for the change in scrap rate from 1987 to 1988 for the
population of all manufacturing firms.
The data given below is for a sample of 20 firms that received job training
grants in 1988. Scrap rate is measured as number of items per 100
produced that are not usable.

Scrap rate data I
Table 1: Scrap rate Table 2: Scrap rate
Firm 1987 1988 Firm 1987 1988

1 10.00 3.00 11 11 0.98 0.51
2 1.00 1.00 12 12 1.00 0.50
3 6.00 5.00 13 13 0.45 0.61
4 0.45 0.50 14 14 5.03 6.70
5 1.25 1.54 15 15 8.00 4.00
6 1.30 1.50 16 16 9.00 7.00
7 1.06 0.80 17 17 18.00 19.00
8 3.00 2.00 18 18 0.28 0.20
9 8.18 0.67 19 19 7.00 5.00
10 1.67 1.17 20 20 3.97 3.83

Non-normal populations and asymptotics
Sometimes the population is clearly not normal

As sample size gets larger, we can assume asymptotic normality and
construct CI as [ȳ ± 1.96.se(ȳ )]
Example: Gender discrimination in job offer. Are females discriminated
against?

Gender discrimination in hiring: matched pairs
analysis
Construct a pair of CVs consisting of a male and a female for several

job applications.
In the pair, one person is male and the other is female having exactly
the same CV in terms of experience, qualification, etc.. except their
gender.
Each person in the pair was interviewed by an employer for the same
job, and the researchers recorded who got the job offer (both may get
the offer as well).
This is an example of a matched pairs analysis, where each trial
consists of data on two subjects that are thought to be similar in many
respects but different in one important characteristic.

Example contd.
Let ◊m be the probability of that a male is offered a job and ◊f be that

for female.
Our interest is in the difference ◊f ≠ ◊m .
Let Mi denote a binary random variable equal to 1 if the male gets a
job offer from the employer i and zero otherwise. We define Fi in
similar way for the female.
We can construct several such pairs (with varying job
profiles/industries/qualifications) to have greater representations across
all spectrum of labour market.
Our sample will consists of pool of all such cases (trials), i.e. pairs of
interviews by employers

Example contd.
Let n = 241 (pairs of interviews)

Unbiased estimators of ◊m and ◊f are sample proportions (M̄ and F̄ ) of
interviews for which males and females were offered jobs, respectively.
Define a random variable Yi = Fi ≠ Mi . It takes three possible values
(discrete) - not normal distribution
Our interest is in population parameter, µ © E (Yi ) = E (Fi ) ≠ E (Mi ).
Though it is not normally distributed, can we construct approximate
confidence interval for µ as n is quite large?

Example contd.
Data gives: f¯ = .224 and m̄ = .357.

Then ȳ = .224 ≠ .357 = ≠.133
To construct CI, we need s. Data: s = .482 Ô
Approximate 95% CI for µ is ≠.133 ± 1.96(.482/ 241)
This example demonstrates how to find point estimate of a population
parameter and construct confidence interval.
But, can we answer whether females are discriminated against in
definite “yes” or “no” answer?

Hypothesis Testing I
Sometimes the question we are interested in has a definite yes or no

answer
Devising methods for answering such questions, using a sample of data,
is known as hypothesis testing.
Example: How strong is the sample evidence of comparing crime rates
in lower literacy rates with that in higher literacy rates?
We set up a hypothesis test
In order to test a hypothesis we specify a null (H0 ) and an alternative
(H1 ) hypothesis.
Null hypothesis is presumed to be true until the data strongly suggest
otherwise (just as a defendant is presumed to be innocent until proven
guilty)
We need to choose a test statistic (or statistic, for short) and a critical
value.
In hypothesis testing, we can make two kinds of mistakes

Hypothesis Testing II
I First, we can reject the null hypothesis when it is in fact true. This is
called a Type I error
I Second, failing to reject null when it is actually false is Type II error
A test statistic, denoted T , is some function of the random sample.
Given a test statistic, we can define a rejection rule that determines
when H0 is rejected in favour of H1 .
In order to conclude that H0 is false and that H1 is true, we must have
evidence “beyond reasonable doubt” against H0
How we quantify “beyond reasonable doubt”?
In hypothesis testing, we can make two kinds of mistakes: Type I and
Type II error
After deciding whether or not to reject the H0 , either we have decided
correctly or committed an error
We will never know with certainty whether an error was committed.
However, we can compute the probability of making either a Type I or
a Type II error

Hypothesis Testing III
Hypothesis testing rules are constructed to make the probability of
committing a Type I error fairly small (significance level)
We define significance level
– = P(Reject H0 |H0 is true)
Hypothesis testing requires that we initially specify a significance level

for a test
Once we have chosen the significance level, we would then like to
minimize the probability of a Type II error (alternatively, we would like
to maximize the power of a test against all relevant alternatives)
Power of a test is
ﬁ(◊) = P(Reject H0 |◊) = 1 ≠ P(Type II|◊)
where ◊ denotes the actual value of the parameter

Hypothesis Testing IV
We would like the power to equal unity whenever the null hypothesis is
false (this is impossible to achieve while keeping the significance level
small)
We choose our tests to maximize the power for a given significance
level
In order to test a null hypothesis against an alternative, we need to
choose a test statistic (or statistic, for short) and a critical value
Given a test statistic, we can define a rejection rule that determines
when H0 is rejected in favour of H1
Usually, rejection rules are based on comparing the value of a test
statistic, t, to a critical value, c.
To determine the critical value, we must first decide on a significance
level (–) of the test.
The values of t that result in rejection of the null hypothesis are
collectively known as the rejection region.

Hypothesis Testing V
Then, given –, the critical value associated with – is determined by the
distribution of T , assuming that H0 is true.
Example: Hypotheses about the mean µ from a N(µ, ‡ 2 )
H0 : µ = µ0
where µ0 is a value that we specify.

The rejection rule we choose depends on the nature of the alternative
hypothesis
There could be three possible alternative hypothesis.
I H1 : µ > µ0
I H1 : µ < µ0
I H1 : µ =
” µ0
Intuitively, in the first case we should reject the null in favour of the
alternative when the sample average is “sufficiently” greater than µ0 .
But how should we determine? This requires knowing the probability of
rejecting the null hypothesis when it is true.
Hypothesis testing
Let us define a standardised version of mean of sample

! "
Ô ! " ȳ ≠ µo
t = n ȳ ≠ µ0 /s =
se(ȳ )
Given the sample of data, it is easy to obtain the above value.

Under the null hypothesis, the random variable
Ô ! "
T = n Ȳ ≠ µ0 /S
has a tn≠1 distribution

For 5% significance level, the critical value c is chosen so that
P(T > c|H0 ) = .05 (one-tailed test)
Once we found c (this is 100(1 ≠ –) percentile in a tn≠1 distribution),
the rejection rule is t > c.
The t is often called t ≠ statistics.

Two-sided test
Similarly rejection rule for the second one-sided alternative is t < ≠c.
For two sided alternative, H1 : µ ”= µ0 , we reject the null if the sample
mean is far from the hypothesised value µ0 in absolute terms. The
rejection rule is
|t| > c
, where where the critical value is the 100(1 ≠ –/2) percentile in the
tn≠1 distribution.
Usually, we interpret like “we fail to reject H0 in favour of H1 at the
5% significance level”
With large n, we can compare the t statistic with the critical values
from a standard normal distribution.

p-values I
Different conclusion of a test for different – values
What is the largest significance level at which we could carry out the
test and still fail to reject the null hypothesis? This value is known as
the p-value of a test.
Given a value of t, we can find the largest significance level at which
we would fail to reject H0
This is the significance level associated with using t as our critical value
For one sided test
p ≠ value = P(T > t|H0 )
For two sided test
P(|T | > |t||H0 ) = 2P(T > |t||H0 )
Suppose p ≠ value is 0.065. Then the largest significance level at

which we can carry out this test and fail to reject is 6.5%.
p-values II
If we carry out the test at a level below 6.5% (such as at 5%), we fail
to reject H0
If we carry out the test at a level larger than 6.5% (such as 10%), we
reject H0
Generally, small p ≠ values are evidence against H0

Lecture 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 5

Uploaded by

Copyright:

Available Formats

Section 4

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 65

Population: Well defined group of subjects

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 66

Given a random sample {Y1 , Y2 , .., Yn } drawn from a population

Example: Y1 , Y2 , . . . , Yn be random sample from the same distribution

Note that an estimator, W is also a random variable.

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 67

We study various properties of the probability distribution of the

W is called a point estimator and w is called a point estimate

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 68

Are there general approaches to estimation that produce estimators

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 69

Unbiasedness: An estimator, W of ◊, is an unbiased estimator if

is unbiased for ‡ 2 where the sample Yi is drawn from a population

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 70

How does the distribution of an estimator spread out?

MSE (W ) = E [(W ≠ ◊)2 ]

Shown that MSE (W ) = Var (W ) + [Bias(W )]2

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 72

Figure 3: Relative Efficiency

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 73

Consistency: Let Wn be an estimator of ◊ based on a sample

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 74

What is Law of Large Number?

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 76

plim(g(Wn )) = g(plim(Wn )) for any continuous function g(.)

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 77

If plim(Tn ) = – and plim(Un ) = —, then

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 78

Let µm and µf be the population mean of annual earnings of male and

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 79

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 80

Let {Y1 , Y2 , . . . , Yn } be a random sample with mean µ and variance

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 81

Point estimate obtained from a particular sample does not, by itself,

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 82

Example: Suppose the population has a N(µ, ‡ 2 ) distribution and let

This information allows us to construct an interval estimate of µ.

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 83

For unknown ‡, we must use an estimate

We obtain a confidence interval that depends entirely on the observed

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 84

Let c denote the 97.5th percentile in the tn≠1

contains µ with probability 0.95.

More generally, let c– denote the 100(1 ≠ –) percentile in the tn≠1

Obtaining c –2 requires choosing and knowing the degrees of freedom

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 86

Figure 5: The 97.5th percentile, c, in a t distribution

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 88

Table 1: Scrap rate Table 2: Scrap rate

Firm 1987 1988 Firm 1987 1988

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 89

Sometimes the population is clearly not normal

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 90

Construct a pair of CVs consisting of a male and a female for several

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 91

Let ◊m be the probability of that a male is offered a job and ◊f be that

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 92

Let n = 241 (pairs of interviews)

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 93

Data gives: f¯ = .224 and m̄ = .357.

Sourabh B Paul (IIT Delhi) Econometric Methods II Semester 2023-24 94