AB1202 Statistics and Analysis: Statistical Inferences Based On Two Samples

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

AB1202

Statistics and Analysis


Lecture 7
Statistical Inferences Based on Two Samples

Chin Chee Kai


cheekai@ntu.edu.sg
Nanyang Business School
Nanyang Technological University
NBS 2016S1 AB1202 CCK-STAT-018
2

Statistical Inferences Based on Two


Samples
• Comparing Two Population Means
• Paired Difference Experiments
• Comparing Population Proportions
• Special Case with Equality of Proportions
• When To Assume Equal Variance?
NBS 2016S1 AB1202 CCK-STAT-018
3

Comparing Two Population Means


• We like to compare two populations means,
whose values we don’t really know.
Sampling
𝜎12 𝜎22
Distributions
𝑛1 𝑛2

𝜇1 𝜇2
𝑋1 , 𝑋2
Sample once each from
above Sampling
𝜎12 𝜎22 Distributions, then take
+
𝑛1 𝑛2 difference. Keep doing that
Rejection Region indefinitely to form a
𝑧=0 population of difference.

𝑋1 − 𝑋2
𝜇1 − 𝜇2
𝑧𝑐 𝑍
NBS 2016S1 AB1202 CCK-STAT-018
4

Points To Note
• Two parent populations 𝑋1 with mean 𝜇1 , variance
𝜎12 , and 𝑋2 with mean 𝜇2 , variance 𝜎22 .
• Two samples with means 𝑥1 , 𝑥2 , variances 𝑠12 , 𝑠22 and
sample sizes 𝑛1 , 𝑛2 (where both sample sizes are
roughly similar)

• Must check:
▫ Unpaired or paired?
▫ Both large, or both small sample sizes?
▫ Parent population variances known or unknown?
▫ Both variances assumed equal, or unequal?
▫ Both parent distributions are normal, or not?
• Unequal variance assumption is more general (less
restrictive) than equal variance assumption.
NBS 2016S1 AB1202 CCK-STAT-018
5

Unpaired Tests
• We begin by comparing two populations means under
unpaired test scenarios. The are 5 populations at work!
Parent
Populations
𝜎12 𝜎22

𝜇1 𝜇2 𝑋1 , 𝑋2

Populations
of Sample
𝜎12 Means 𝜎22
𝑛1 𝑛2
𝜇1 𝜇2 𝑋1 , 𝑋2

Population of
Rejection Region Difference of (Parent)
𝜎12 𝜎22
+ Population Means
𝑛1 𝑛2
𝑋1 − 𝑋2
𝜇1 − 𝜇2
𝑧𝑐 𝑧=0 𝑍
NBS 2016S1 AB1202 CCK-STAT-018
6

Large n, Var Known, Var Unequal


• If parent populations are Normal, then sampling
distribution will be Normal anyway.
• If parent populations are not normal, then by
CLT (𝑛1 , 𝑛2 ≥ 30), both sampling distributions
will still approximate Normal.

• Test random variable: 𝑋 = 𝑋1 − 𝑋2


• H0 : 𝜇1 − 𝜇2 ≥ 𝑑0 , H1 : 𝜇1 − 𝜇2 < 𝑑0
𝜎12 𝜎22
• 𝑉𝑎𝑟 𝑋 = 𝑉𝑎𝑟 𝑋1 + 𝑉𝑎𝑟 𝑋2 = +
𝑛1 𝑛2
𝑥1 −𝑥2 −𝑑0
• Test statistic: 𝑧 =
𝜎2 2
1 +𝜎2
𝑛 1 𝑛2
NBS 2016S1 AB1202 CCK-STAT-018
7

Large n, Var Known, Var Equal


• If both variances are known to be equal, OR may
be assumed to be equal, then we still proceed as
before, setting 𝜎12 = 𝜎22 = 𝜎 2 .

• Note that as before, parent populations can be


normally distributed, or not. We still use normal
distribution to test the sample statistic due to
CLT.
NBS 2016S1 AB1202 CCK-STAT-018
8

Large n, Var Unknown, Var Unequal


• If both variances are unknown, we need to use
Student-t Distribution to test the sample statistic.

• Test random variable: 𝑋 = 𝑋1 − 𝑋2 Welch-


• H0 : 𝜇1 − 𝜇2 ≥ 𝑑0 , H1 : 𝜇1 − 𝜇2 < 𝑑0 Satterthwaite
formula
𝑠12 𝑠22
• 𝑉𝑎𝑟 𝑋 = 𝑉𝑎𝑟 𝑋1 + 𝑉𝑎𝑟 𝑋2 = +
𝑛1 𝑛2
𝑠2 𝑠2 2
𝑥1 −𝑥2 −𝑑0 1+ 2
• Test statistic: 𝑡 = d.f. 𝑣 =
𝑛 1 𝑛2

𝑠2 2 2 2 2
1 + 𝑠2 𝑠2
1 𝑠2
Round 𝑛1 𝑛2
𝑛 1 𝑛2
DOWN +
𝑛1 −1 𝑛2 −1

• Note that we don’t need to pool variance, nor fall back on


assuming equal variance, since our sample sizes are large.
NBS 2016S1 AB1202 CCK-STAT-018
9

Large n, Var Unknown, Var Equal


• If both variances are known to be equal (but we
don’t know the value), OR may be assumed to be
equal (due to contextual understanding), then we
still proceed as before, but we would pool the
sample variances:
2 2
2
𝑛 1 − 1 𝑠1 + 𝑛 2 − 1 𝑠2
𝑠𝑝 =
𝑛1 + 𝑛2 − 2
𝑥1 −𝑥2 −𝑑0
• Test statistic: 𝑡 = 1 1
𝑠𝑝2 𝑛 +𝑛
1 2
with 𝑣 = 𝑛1 + 𝑛2 − 2 d.f.
NBS 2016S1 AB1202 CCK-STAT-018
10

Small n Cases
• With smallish sample sizes, we require
assumption of parent populations being normally
distributed to proceed. If so, then sampling
distribution will be normal.
• If parent populations are not normal, or cannot
be assumed to be approximately normal, then we
have to use other methods not discussed here.
• This will be the case (with small sample sizes)
whether or not variances are known or unknown,
and whether or not variances are equal or not
equal.
• In what follows, parent populations are all
assumed to be normally distributed.
NBS 2016S1 AB1202 CCK-STAT-018
11

Small n, Var Known, Var Unequal or Equal


• Since parent populations are normally
distributed, sampling distribution will be normal
(despite smallish sample sizes).
• Test random variable:𝑋 = 𝑋1 − 𝑋2
• H0 : 𝜇1 − 𝜇2 ≥ 𝑑0 , H1 : 𝜇1 − 𝜇2 < 𝑑0
𝜎12 𝜎22
• 𝑉𝑎𝑟 𝑋 = 𝑉𝑎𝑟 𝑋1 + 𝑉𝑎𝑟 𝑋2 = +
𝑛1 𝑛2
𝑥1 −𝑥2 −𝑑0
• Test statistic: 𝑧 =
𝜎2 𝜎 2
1+ 2
𝑛 1 𝑛2

• If both variances are known to be equal, OR may be


assumed to be equal, then we still proceed as before,
setting 𝜎12 = 𝜎22 = 𝜎 2 .
NBS 2016S1 AB1202 CCK-STAT-018
12

Small n, Var Unknown, Var Unequal


• If both variances are unknown, then just like large
sample size case, we use Student-t Distribution to
test the sample statistic.
Welch-
• Test random variable: 𝑋 = 𝑋1 − 𝑋2 Satterwaite
• H0 : 𝜇1 − 𝜇2 ≥ 𝑑0 , H1 : 𝜇1 − 𝜇2 < 𝑑0 formula
𝑠12 𝑠22
• 𝑉𝑎𝑟 𝑋 = 𝑉𝑎𝑟 𝑋1 + 𝑉𝑎𝑟 𝑋2 = + 𝑠2 2 2
𝑛1 𝑛2 𝑠
1+ 2
𝑥1 −𝑥2 −𝑑0 𝑛 1 𝑛2
• Test statistic: 𝑡 = d.f. 𝑣 = 2 2 2
2 𝑠2 𝑠
𝑠2
1 + 𝑠2
1 2
Round 𝑛1
+
𝑛2
𝑛1 𝑛2 DOWN 𝑛1 −1 𝑛2 −1

• BUT, unlike large sample size case, sample variances may be


very poor point-estimates of population variances, since
sample sizes are small. So we may suffer from inflated Type
I or II probabilities unknowingly.
NBS 2016S1 AB1202 CCK-STAT-018
13

Small n, Var Unknown, Var Unequal


• Small sample sizes 𝑛1 , 𝑛2 give poor variance point-
estimates.
𝑠12 𝑠22
𝑉𝑎𝑟 𝑋 = 𝑉𝑎𝑟 𝑋1 + 𝑉𝑎𝑟 𝑋2 ≈×≈ +
𝑛1 𝑛2
• We attempt to trade-off modeling error with precision by
making a (bold) assumption that parent population
variances are equal (without proof).
• This assumption allows us to pool-estimate the variance
value.
• Pooling sample variances tends to improve precision of
variance estimate since two sets of samples are combined as
if it were a single “larger” sample set.
• We proceed by assuming “Small n, Var Unknown, Var
Equal” (next slide).
• But as we knowingly contradict our knowledge that
population variances are unequal, we might still suffer from
modeling inaccuracies. (There’s no free lunch)
NBS 2016S1 AB1202 CCK-STAT-018
14

Small n, Var Unknown, Var Equal


• For unknown variances, we use Student-t Distribution.
• If both variances are known to be equal (but we don’t
know the value), OR may be assumed to be equal (due
to contextual understanding), then we would pool the
sample variances:
2 2
2
𝑛1 − 1 𝑠 1 + 𝑛 2 − 1 𝑠2
𝑠𝑝 =
𝑛1 + 𝑛2 − 2
• Pooled sample variance is a more precise point-
estimate of population variance.
• Test random variable: 𝑋 = 𝑋1 − 𝑋2
• H0 : 𝜇1 − 𝜇2 ≥ 𝑑0 , H1 : 𝜇1 − 𝜇2 < 𝑑0
𝑥1 −𝑥2 −𝑑0
• Test statistic: 𝑡 = 1 1
𝑣 = 𝑛1 + 𝑛2 − 2 d.f.
𝑠𝑝2 𝑛 +𝑛
1 2
NBS 2016S1 AB1202 CCK-STAT-018
15

Paired Difference Experiments


• Why pair?
▫ When experimental unit measurements are related (in
whichever way deemed proper), paired-test is more
appropriate than unpaired-test.
• What is it?
▫ Individual experimental unit measurements are paired
and their paired-difference calculated to create a NEW
population (of delta values).
▫ Test the mean of this NEW population.
• Pairing situations arise commonly from:
▫ Before-after measurements
▫ Data collected somewhat simultaneously on
experimental units which are deemed related.
 Eg, intelligence of children paired with parent
• Note that whenever paired-test can be calculated,
unpaired-test can also be done, but might not be
proper or meaningful.
NBS 2016S1 AB1202 CCK-STAT-018
16

Paired, Var Known


• As in one-variable hypothesis testing, the parent
population should be normally distributed, or the
sample size is large.
• If variance is known, then sampling distribution
used will be the normal distribution.

• Test random variable: 𝐷


where data 𝑑𝑖 = 𝑥1,𝑖 − 𝑥2,𝑖 mean 𝜇𝐷 variance 𝜎𝐷2
• H0 : 𝜇𝐷 ≥ 𝑑0 , H1 : 𝜇𝐷 < 𝑑0
2
𝜎𝐷
• 𝑉𝑎𝑟 𝐷 =
𝑛
𝑑 −𝑑0
• Test statistic: 𝑧 = 𝜎𝐷
𝑛
NBS 2016S1 AB1202 CCK-STAT-018
17

Paired, Var Unknown


• Again, we need to assume parent population is normal
or else the sample size used is large.
• If variance is unknown, then we need to estimate
population variance using sample variance. Sampling
distribution used will be Student-t distribution.

• Test random variable: 𝐷


where data 𝑑𝑖 = 𝑥1,𝑖 − 𝑥2,𝑖 mean 𝜇𝐷
• H0 : 𝜇𝐷 ≥ 𝑑0 , H1 : 𝜇𝐷 < 𝑑0
2
𝑠𝐷
• 𝑉𝑎𝑟 𝐷 =
𝑛
𝑑 −𝑑0
• Test statistic: 𝑡 = 𝑠𝐷 with 𝑣 = 𝑛 − 1 d.f.
𝑛
NBS 2016S1 AB1202 CCK-STAT-018
18

Comparing Population Proportions


• Proportion distribution’s variance depends on
the very mean proportion value which we test.
• We will always do large samples
(𝑛1 𝑝1 , 𝑛1 1 − 𝑝1 , 𝑛2 𝑝2 , 𝑛2 1 − 𝑝2 ≥ 5).

• Test random variable: 𝑃 = 𝑃1 − 𝑃2


• H0 : 𝑝1 − 𝑝2 ≥ 𝑝0 , H1 : 𝑝1 − 𝑝2 < 𝑝0
𝑝1 1−𝑝1 𝑝2 1−𝑝2 𝑝1 1−𝑝1 𝑝2 1−𝑝2
• 𝑉𝑎𝑟 𝑋 = + = +
𝑛1 𝑛2 𝑛1 𝑛2
𝑝1 −𝑝2 −𝑝0
• Test statistic: 𝑧 =
𝑝1 1−𝑝1 𝑝 1−𝑝2
+ 2
𝑛1 𝑛2
NBS 2016S1 AB1202 CCK-STAT-018
19

Population Proportions – When 𝑑0 = 0


• When 𝑑0 = 0, our null hypothesis claims that both
population proportions are equal.
• But this implies both population variances are equal!
• We pool-estimate sample proportion:
count of all "successes" in both samples
• 𝑝=
total count in both samples

• We pool estimate sample proportion variance:


𝑝 1−𝑝 𝑝 1−𝑝 1 1
• 𝑠𝑝2 = + =𝑝 1−𝑝 +
𝑛1 𝑛2 𝑛1 𝑛2
• H0 : 𝑝1 − 𝑝2 ≥ 0, H1 : 𝑝1 − 𝑝2 < 0
𝑝1 −𝑝2
• Test statistic: 𝑧 = 1 1
𝑝 1−𝑝 +
𝑛1 𝑛2
NBS 2016S1 AB1202 CCK-STAT-018
20

When To Assume Equal Variance?


• Small sample sizes:
▫ Instead of each (small) sample estimating its own variance
in a poor manner, we trade-off by assuming equal variance
and thus enabling us to pool both the samples together to
point-estimate more accurately the assumed-equal
variance.
▫ As long as actual population variances are not too far apart,
the trade-off would be safe to make.
• Both population variances are known to be the same.
▫ Usually if the same population is measured at different
times, we could gather contextual knowledge that the
variances are presumably the same.
• Null hypothesis postulates or implies population
variances are the same.
▫ Typically when the variances are dependent only on means,
and means are hypothesized to be equal.

You might also like