Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

COMPARING TWO

MEANS
Sharon Morein
Sharon.morein@anglia.ac.uk
Parametric vs non-parametric
• Parametric tests tend to be more powerful
• Use all the information available
• But also have more constraints (‘rules’ or
assumptions)

• Non-parametric:
• Do no rely on parametric assumptions
• More suited for small n’s
• Allow more levels of measurement of variables
(e.g., ordinal)
Comparing means
Conducting, concluding and reporting instances when SD of
the population is unknown (as we’d do a Z test)
Where we have 2 samples [Comparing to a specific value
you could use one sample t-test, though CI simpler]

Between subjects
Parametric (independent Student’s t-test or Welch’s t-test)
Non-parametric (e.g., Mann-Whitney, Wilcoxon rank-sum)
Within subjects
Parametric (repeated=paired=dependent t-test)
Non parametric (e.g., Wilcoxon signed-rank)
The sampling distribution
• Hypothetical scenario
• Samples are randomly and repeatedly
taken
• Unit of measurement is a sample
(group of observations) rather than a
single observation
• SE=SD of sampling distribution
• “Typical” variability between samples

• Central limit theorem: with large


samples (n>30), the sampling
distribution will become normal,
regardless of the shape of the
population distribution
• The larger the n: more normal and
smaller SE
T-tests

NHST – the logic:


• We have two samples, each with a mean, we can look at
the difference
• If the samples come from the same population, then their
means should be roughly equal. While they could differ by
chance alone, we would expect large differences between
sample means to occur very infrequently.
• In our test, we assume the null hypothesis is true (there is
no difference between the means), i.e., the two samples
came from the same population (distribution)
T-tests Test statistic= signal
noise

NHST – the logic:


• We compare the difference between our sample means
to the expected difference if there were no effect (i.e. if H0
were true). We use the standard error as a gauge of the
variability between sample means.
• If the difference between the samples we have collected
is larger than what we would expect based on the
standard error then we can assume either:
• There is no effect, we have, by chance, collected two samples that
are atypical of the same population.
• The two samples come from different populations and are typical of
their respective parent population (thus H0 is incorrect).
T distributions vary depending on df

Probability distribution (like the normal distribution) telling us the


probability of obtaining a given score
“Recipe” [Memory when happy is different
from memory when sad]

• Null hypothesis H0: μ1=μ2 (alternative H1: μ1≠μ2)


[Usually states that an effect is absent]
• All parameters are equal, and differences are due to chance
• If we reject the null hypothesis, we provide support for the
alternative hypothesis

• We will stick to two-tailed unless REALLY good reason

Xhappy= 51.1; SDhappy=10.1 ; nhappy=60


Xsad=49.67; SDsad=9.7; nsad=60
T-test Assumptions
• Both t-tests (between and within) assume:
• Data are at least an interval scale (so ratio fine too)
• Random independent selection of the sample: each observation (or
pair) in the population has the same chances to be included in the
sample (errors are random), i.e., each observation (or pair) is
chosen randomly and independently
• Independent t-test
• The sampling distribution is normally distributed
• Homogeneity of variance (HOV: σ12 = σ22)
• Scores are independent in different conditions
• Dependent t-test
• d (difference) is normally distributed
• Scores come from the same/related observations
HOV
There is no difference in variance between the two groups
Leven’s test for HOV
• We require a non-significant result in order to meet the
HOV assumption
• H0: σ12 = σ22 the variances in the two groups are equal:
• Levene’s statistic (F) is not significant (i.e. p > .05)
• A significant result (i.e. p < .05) indicates that the
variances are significantly different and so the assumption
of HOV does not hold (reject H0, so σ ≠ σ22) 1
2

• Unequal variances creates a bias and inconsistency in the


estimate of the SE → CI, t and p values are biased
• Better to use not use t-test (we’ll get back to this!)
Independent t-test
Xhappy= 51.1; SDhappy=10.1 ; nhappy=60
Xsad=49.67; SDsad=9.7; nsad=60
observed difference expected difference
between sample − between population means
t = means (if null hypothesis is true)
estimate of the standard error of the difference between
two sample means

X1  X2 Difference between
t
s p2 s p2 the means

n1 n2 SE of sampling
distribution of
differences
• variance sum law – variance of a difference between two independent
variables is the sum of variances: variance1+variance2
• The equation will take into account unequal n’s – BUT should we even go
there?
Dependent t-test
Xhappy= 51.1; SDhappy=10.1
For each person: di= Xhappy-Xsad nhappy=nsad=60
Xsad=49.7; SDsad=9.7
Xdiff=51.1-49.7=1.4
SDdiff=15.6

D  D
t
t= signal
Difference between noise
the means assumed
sD N to be 0

(estimated) SE
of differences

If the average difference is large and the SE of


differences is small, it is likely that the difference in our
sample is not a chance result
Type I and Type II Error
Null Hypothesis, H0
Alternative hypothesis, H1

Stricter α means larger β (more


confidence but higher risk of failing
to detect a genuine effect)
What is really going on
“H0 is true” “H1 is true”
There is no effect There is an effect
Study does not reject 1-α β
H0 Correct decision Fail to reject H0
(Experiment indicates (False negative/ miss)
no effect)
Study rejects H0 α Power
(Experiment indicates Incorrect rejection of H0 Correct decision
effect) (False positive) (experimenter bliss)
Comparing to critical value t
X1  X2
s p2 s p2

n1 n2
• Our computed t value
• Appropriate t-distribution (mathematically determined)
• What is the probability of our computed value occurring?
• The further away from 0 the less likely
• The cut-off we typically use is 5% (α=0.05) – “pretty
unlikely”
• If our t value yields a probability less than the critical cut-
off, we reject the null (H0) as it is pretty unlikely

For a given df…


• Our t value=0.5 → p value is 0.69
• Our t value=1.5 → p value is 0.07
• Our t value=2 → p value is 0.025
Conclusion and reporting
• Compare resulting t/p value to critical value
• Calculate effect size: Cohen’s d or r

• Example:
• Memory was not significantly different in the sad (M=51.1,
SD=10.1) vs the happy (M=49.67, SD=9.7) condition. The
difference (M=1.43, 95% CI [-2.18, 4.98]) was not statistically
significant, t(59)=1.45, p=.15. The effect size was small (Cohen’s
d=0.14).
Non parametric
• More robust to outliers
• If the distributions are not normal, then with small n’s the
sampling distribution will likely be non-normal

• Mann-Whitney
[& Wilcoxon rank-sum test]
• Wilcoxon signed-rank test

• Assumptions
• Ordinal scale and above
• Each observation (M-W) or pair (Wilcoxon) is chosen randomly and
independently
Mann-Whitney Rationale
• Ranking – we compute the statistic on ranks rather than
on raw values

• Combine all observations to one group and rank (tied


values all get their averaged rank or are ignored
depending on test)

• If there is no difference, sum of ranks should be about the


same
Mann-Whitney
• H0 – the two distributions (not mean or median!) are not
different
• Sum up ranks and use:

Sum of
ranks in
group 1

• Can compare the smallest U to ‘critical U’, or normal


approximation if large sample (how to compute
standardized values: Seminar, Field, p. 31-40)
• Wilcoxon rank sum
• Unequal n’s: the sum of ranks in smaller group
• Equal n’s: the lower sum of ranks
U1+U2=n1n2
Logic of Wilcoxon (signed rank)
• H0– the two distributions (not mean or median!) are not
different
• Compute the difference for each pair of observations
• Rank them (here ignore ties altogether)
• Sum the positive ranks separately from the negative ranks

happy sad diff rank sign pos neg


R1=2.5+4+5
15 28 |13| 2.5 + 2.5 =11.5
25 25 0
16 32 |16| 4 + 4
23 10 |-13| 2.5 - 2.5
25 21 |-4| 1 - 1
40 20 |20| 5 + 5
Conclusion and reporting
• Often, non-parametric tests report median, but can depend
on why non-parametric was done in the first place

Effect size: r = 1 – (2Usmallest)/ (n1 * n2) ~ Z/sqrt(N)

Tend to report median as less influenced by outliers than the


mean but depends on why non-parametric option was chosen
For the Mann–Whitney test:
– Ecstasy users (Median = 33.50) were significantly
more depressed than alcohol users (Median = 7.50),
U = 4.00, z = −3.48, p < .001, r = .78.
I want to do t-test but how do I spot normality?
• Normal Q-Q (quantile) plot
Look at kurtosis (leaving the line) and skew (snaking around the line)

• Tests,
e.g. Shapiro-Wilk, Kolmogorov-Smirnov (H1 the distribution differs
from the normal distribution), but don’t operate well with smaller n’s
AND can be too sensitive with very large n’s
In practice
If have a problem with normality (no 100% solution):
• Not too bad with equal n’s of sufficient size – just go ahead
• Aim for larger samples (central limit theorem) but depends on
distribution (if heavy tailed need N’s larger than 30)
• Can consider cleaning/trimming/transforming
• Non-parametric
• Bootstrapping (& robust tests if R available) – not currently
available
• If have a problem with HOV
• Not so much a problem with equal n’s
• Welch’s t
• Non-parametric

Warning: Leven’s test doesn’t work so well with small n’s and
unequal n’s
So What to do???
• Normality and HOV holds (and equal n’s)
T-test
• Normality holds but HOV violated
Welch’s t-test
• Normality violated
Look at sample size, equal sample size and HOV
If sample sizes are (roughly) equal and reasonably good and
normality not massively violated – ok to stick with Welch’s t-test

Alternatives:
Non-parametrics
Transforming the data and starting again

You might also like