Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Hypothesis Tests:
One Factor - Two Groups

F-tests (variances)
t-tests (means)
Two Proportions

Topic Motivation:
Which Group is performing better?

Case I:
Measure: Time to reconcile account, days
Group A (International Accounts): Average = 8.5 days
Group B (Domestic Accounts): Average = 9.0 days
Based on 20 samples from each group taken over 1 month
(lower the better)

Case II:
Measure: % Defective (two proportions)
Shift 1 = 3.2%
Shift 2 = 3.0%
Based on 200 samples taken over 1 month

Comments:
2

University of Michigan: Six Sigma Black Belt, P. Hammett 1


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Topics

I. Statistical Inference and Hypothesis Tests


Null vs. Alternative Hypothesis
Types of errors (Type I and Type II errors)
Statistical Significance and p-values
Statistical vs. Practical Significance

II. Two Sample (Groups) Hypothesis Tests


A. F-Test
B. t-test Independent samples
C. Paired t-test
D. Two Proportion Tests

I. Statistical Inference Tests

Use hypothesis tests (statistical inference tests) to compare


performance/preferences among groups

Among common tests to compare two groups (A vs. B):


Two Sample F-test: Compare two variances
Two Sample t-test: Compare two means
Example: If a process may be done using two methods, t-test may
be used to compare the mean of Method 1 vs. Mean of Method 2

Two Proportion test: Compare two proportions (e.g., yields)

University of Michigan: Six Sigma Black Belt, P. Hammett 2


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Null and Alternative Hypothesis


In statistical inference testing, start with a Mean Test
favored claim, Ho
Null hypothesis (Ho) Ho : m1 m2
Note: We do not prove the null!

And, Identify Alternative Hypothesis Ha Ha : m1 m2


What to test for as different
Example: Task Completion Time Variance Test
Ho: m1 = m2
Ha: Means are not equal Ho : 12 22
Statistical Tests may be applied to many
other situations
Ho: Data are Normal;
Ha : 12 22
Ha: Data are not Normal
Ho: Process is in statistical control;
Ha: Process is not in statistical control Note: Above are
2-sided tests 5

Statistical Hypothesis Tests and Errors


For any statistical test, there is the TRUTH (result if we measured
every item from population) and what we CONCLUDE (say)
based on the sample of data collected in experiment
Thus, given any test, we have 4 possible outcomes:
Two Correct Decisions:
Conclude (Say) no difference exists when no difference exists (TRUTH)
Conclude (Say) a difference exists when a difference exists (TRUTH)
Other two possible decisions:
Ho is actually:
Conclude (Say)
True False
Conclude (Say) Reject Ho
(Conclude difference) Correct
Fail to
Reject Ho Correct
6
(Conclude no difference)

University of Michigan: Six Sigma Black Belt, P. Hammett 3


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Hypothesis Tests and p-values


To test for significance in a hypothesis test, use software to
compute a p-value
p-value - probability of obtaining a test statistic at least as
extreme as the one observed if the null hypothesis were true
In simple, non-technical terms, the smaller the p-value, less likely the null is
true, and therefore more likely a significant difference exists

To determine statistical significance, compare p-values to Type I


(False Alarm) error threshold known as the alpha value (a)
For most experiments, let a = 0.05 or a = 0.01
Note: Confidence may be expressed as 1-a%

In hypothesis testing, we start with the premise that no significant


difference exists and look for evidence otherwise
Ho: All babies are
born beautiful 7

p-values & concluding significance


(Statistical vs. Practical Significance)

Given p-value and alpha, assess statistical significance:


p-value < alpha (a), conclude a statistically significant difference
p-value > alpha (a), fail to conclude a difference (i.e., no difference)

To minimize errors, seek a sufficiently representative sample:


Representative of Population of interest
With sufficiently large sample to identify desired effects
Seek minimal Type I and Type II error (Fail to detect a significant shift)
With sufficiently low inherent variance within groups to see effects

Remember, with statistical tests, we


First, assess if a statistically significant difference exists
Then, assess if result is useful practically significant
Practical: Where the difference effect is sufficiently large for management
to change the system (e.g., impact cost, requirements, # defects) 8

University of Michigan: Six Sigma Black Belt, P. Hammett 4


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

II. Two Sample (Two Groups)


Hypothesis Tests
Choice of test depends on data type, statistic, and assumptions
Note: Other than for paired tests, the sample size per group may vary
Y Data
Continuous Binary Output
(Assume Normal Data) (Large Sample)
A. Two Variances Two Means D. Two Proportion
(independent samples) t-test Test
F-test

B. Independent C. Paired Data


Samples Paired t-test

Independent Samples t-test Independent t-test


(Pooled t-test) (Assume Variances Not Equal)
(Assume Variances Equal)
9
Note: See Minitab for additional types of tests (e.g., Z-test)

Applying F-test and t-test


Suppose you want to compare:
Two machines, two gages, two user groups, two methods,
two types of a product or service

Common Approach:
Step 1 (Test variances):
If have two groups of independent samples and may assume
underlying distribution is Normal, use F-test to test for
differences in the two variances
If concerned about normality, may use Levene Test
Step 2 (Test means):
If data from the two groups are independent - use independent
samples t-test either:
Assuming equal variances
Assuming unequal variances
Note: If data are paired - use Paired t-test
Paired: Same physical unit in both groups 10

University of Michigan: Six Sigma Black Belt, P. Hammett 5


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Case Study
Document Handling Time (DHT)
Background: Financial services firm is required to check that original Trade
Fund documents meet Govt compliance requirements
Process: Review request receive time to notification sent to Adviser
Metric: Time required to check document compliance (DHT)
Scope: Internal Requirement: DHT < 10 min per review of Std. Document
Control Factors
Location (Region A vs. Region B)
Type of Customer (VIP vs. Regular)
Time of year (Peak vs Non-peak)
Staff Skill Level (Expert vs. Proficient)
Submission In Good Order vs. Not
Language (E vs. F)

Inputs (Prior Process Outputs) Process Outputs (Metrics)


Req Type (Internet, Portal, Paper, Email, Phone) Document Handling Time (DHT)
Source (Affiliate, Supplier, Business Unit) Check Compliance of Support % Complaince
Time of day request arrives Documents (Queue 11)

Uncontrollable Noise Factors


Volume of work per day
Unplanned staff absence
Time of day mail is delivered
11

Current State Performance


Not meeting quality goals
Possible Factors (Two Groups)
Region (A vs. B)
Language (E vs. F)
Staff Experience (Proficient vs.
New Hire)
VIP Status (Yes=1 vs. No=0)
Document Not in Good
Order/Rework (Yes=1, No=0)
Measurement System Study:
Automated Data Collection vs.
Manual Tracking

Mean DHT = 12.03


Next, apply statistical tests to St Dev = 5.6
compare alternatives Cpk = 0.17
% Out Spec = 30%
12

University of Michigan: Six Sigma Black Belt, P. Hammett 6


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

DHT By Region (Two Groups)

Suppose want to compare DHT between Region A vs. B


Two Key Questions:
Are the variances significantly different?
Are the means significantly different?

Multi-Box Plot 13

A. F-tests (two samples)


Test if two sample groups have same underlying variance ( S 2)
Key assumptions: Data for each sample group are randomly selected from
two Independent Normal Populations
F-Test Statistic
Ho Null Hypothesis; Ha Alternative
Ho: 2A = 2B; S A2
F 2
Ha: 2A not equal 2B SB
Use software to generate p-value for F-test statistic:
If p-value < a conclude variances are different (Reject Ho)
p-value > a conclude variances are not different (Fail to Reject Ho)
Understanding F-test p-values
If two variances are equal, what value for F-test statistic would we get?
Thus, p-values will be small under what conditions for F-statistic?
14

University of Michigan: Six Sigma Black Belt, P. Hammett 7


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Minitab: Two Variance Test


Note: Different formats may be used to
Compare Variance of Region A vs. B enter data. See drop down list in Minitab
Minitab: STAT >> Basic Statistics
>> Two Variances

Check box to assume


Normal for F-test 15

Two Variances: Minitab Results


Would you conclude a significant difference in variances?

Another Way:
What do non-
overlapping 95%
Confidence Intervals
of Std. Dev. suggest?
Assume
Normal

Assume
a = 0.05
Method DF1 DF2 Statistic P-Value
F 163 45 1.60 0.067

16

University of Michigan: Six Sigma Black Belt, P. Hammett 8


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

B. Independent Samples t-tests


(assuming unknown group variances)

Used to test if two groups have different means


Key assumptions: Data for each sample group are randomly
selected from Two Independent Normal Populations

Types of independent sample t-tests:


1. Independent samples test assuming equal variances
E.g., assume 12 = 22 {often called a pooled t-test}
2. Independent samples test assuming unequal variances

Also may obtain p-value for t-test using Excel: =ttest(array1,array2,#tails,type)


17

Handling Time: Process Factors


Among the different process factors, which would you likely
use two independent samples (groups)?
Might any of these be tested using paired data where
you have the same unit (or customers) in both groups?

Region (A vs. B)
Language (E vs. F)
Staff Experience (Proficient vs. New Hire)
VIP Status (Yes=1 vs. No=0)
Document Not in Good Order/Rework (Yes=1, No=0)
Measurement System Study: Automated Collection System vs. Manual

18

University of Michigan: Six Sigma Black Belt, P. Hammett 9


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Pooled t-test
(Independent two-sample test)
Pooled t-tests assume equal variances
(pooled variance Sp)
Ho : m1 m2
Ha : m1 m2
Test Statistic:
X1 - X 2 Sp 2
=
n1 1S12 n2 1S 22
t
S p n11 n12 n1 + n2 - 2

If p-value < a conclude means are different (Reject Ho)


p-value > a conclude means are not different (Fail to reject Ho)
Understanding t-test statistic p-values
If two means are equal, what value for t-test statistic would we get?
Thus, p-values will be small under what conditions for t statistic?
19

Two Independent Samples t-test


(assuming unequal variances)
Independent Samples t-test (assuming unequal variances)
Similar to Pooled t-test, however, uses slightly different test statistic
X1 - X2
t unequal variances
S1 2 S22
n1 n2

Given our two variance test results, which test might you recommend?
(Note: Or, choose independently of F-test results based on
understanding of the process)

20

University of Michigan: Six Sigma Black Belt, P. Hammett 10


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Minitab: t-test
(Region A vs. B)

Ho : m1 m2 Ha : m1 m2

Check box
if assume
equal variances
21
Minitab: STAT >> Basic Statistics >> Two Sample t

t-test Results (Region A vs. B)


Assuming Equal Variances
Would you conclude a mean difference exist?

Two-Sample T-Test and CI: DHT, Region

Two-sample T for DHT

Region N Mean StDev SE Mean


Region A 164 11.52 5.81 0.45
Region B 46 13.84 4.60 0.68
Assume
a = 0.05
Difference = (Region A) - (Region B)
Estimate for difference: -2.323
95% CI for difference: (-4.156, -0.490)
T-Test of difference = 0 (vs ): T-Value = -2.50 P-Value = 0.013
DF = 208
Both use Pooled StDev = 5.5732

22

University of Michigan: Six Sigma Black Belt, P. Hammett 11


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Interval Plots
Another way to compare Means is Interval Plot (Minitab Graphs)
What is the visual indicator of a significant difference?

23

Minitab Assistant: Two Sample t-test


Minitab Assistant also provides an interface with a more
comprehensive analysis for hypothesis testing
Suppose wish to test if difference exists by Language
Minitab
Assistant
Ho : m LanguageE m LanguageF
Ha : Means Not Equal

Will Discuss
in Power and
Sample Size
Planning
24

University of Michigan: Six Sigma Black Belt, P. Hammett 12


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Summary Report: Minitab Assistant

Provides similar
information plus
interpretations/
graphical output

Conclusion: Fail to
Reject Ho (i.e.,
not enough
evidence to
conclude means
are different)

25

One-Sided (One-tail) Statistical Tests


For two-sample hypothesis tests, we typically test whether the
variances (or means) are different or NOT different
This is known as a Two-Sided Test
Then, if we conclude a difference, we may look at the two groups to
see which is larger (or smaller) (this is a quick approach)
Technically, to conclude one group is larger (or smaller) requires a One
Sided Test (Note: See Minitab Setting)
For example, suppose we wish to test if Group A has a larger mean
Note: Assess significance using p-values the same way as before

Ho : m A m B
Alternative Ha is
Ha : m A m B the difference
of interest to test
26

University of Michigan: Six Sigma Black Belt, P. Hammett 13


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Minitab Assistant:
One-Sided (one-tail) Test
Important: Alternative
Ha is what you are
interested in concluding!
Suppose wish to test if
New Hires take longer
to complete review

Any concerns with


this analysis?
Tests Ha that:
Mean New Hire > Mean Proficient 27

Assume
Minitab Results: One-Sided Test a = 0.05

P-value = 0.005

Mean Difference
Effect = 4.84
(=16.64 11.80)

Conclude Mean for New Hires


is significantly greater than
for Proficient Staff (though
not many new hire samples) 28

University of Michigan: Six Sigma Black Belt, P. Hammett 14


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

C. t-tests with paired data

Paired data: Same parts (units) measured in both


groups so the samples are dependent

m D m1 m2
Ho : m D 0 Ha : m D 0 Test Statistic:

d
Here, d-bar is the average difference t
sd is the std. dev. of the differences sd
n is the number of paired samples
n

Minitab: STAT >> Basic Statistics >> Paired t


29

Paired t-test Example


MeasID DHT-System DHT-Manual Difference
1 13.9 12 1.9
Suppose you may measure 2 4.5 4 0.5
the same units using two 3 16.2 16 0.2
different methods: 4 11 11 0
5 3.6 3 0.6
Automated Collection System 6 12 12 0
Manual Data Collection 7 7.5 7 0.5
See Excel File for conditions 8 10.4 9 1.4
and full data set 9 5.4 7 -1.6
10 4.4 4 0.4
11 10.9 11 -0.1
For paired data, compute the 12 5.9 7 -1.1
difference (+/-) for each pair 13 8.5 8 0.5
14 8.3 8 0.3
Diff (i) = Group A Group B 15 5.5 5 0.5
i = observation (sample) # See Excel File for Full Data Set (30)

Avg Diff 0.16


30
StDev Diff 0.96

University of Michigan: Six Sigma Black Belt, P. Hammett 15


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Minitab Results
Paired t-test

Paired t-test result: P-value = 0.377


Does the measurement method matter?
Is this a good finding?

Paired T for DHT-System - DHT-Manual


N Mean StDev SE Mean Assume
a = 0.05
DHT-System 30 9.057 4.135 0.755
DHT-Manual 30 8.900 4.245 0.775
Difference 30 0.157 0.957 0.175

95% CI for mean difference: (-0.201, 0.514)


T-Test of mean difference = 0 (vs 0):
T-Value = 0.90 P-Value = 0.377

31

Paired vs. Unpaired Data


(Different Example)
Using independent samples t-test
WITH Paired Data may result in
failure to detect a mean difference
(Type II Beta Error by excluding
information about units)
Example: Suppose you give 15
properties to 2 different appraisers
and wish to test for mean effect
Two-sample T Appraiser1 vs Appraiser2 Paired T for Appraiser1 - Appraiser2

N Mean StDev SE Mean N Mean StDev SE Mean


Appraiser1 15 107030 46903 12110 Appraiser1 15 107030 46903 12110
Appraiser2 15 110749 48763 12590 Appraiser2 15 110749 48763 12590
Difference 15 -3719 5254 1356
Estimate for difference: -3719
95% CI for difference: (-39503, 32066) 95% CI mean difference:(-6628, -809)
T-Test of difference = 0 (vs ): T-Test of mean difference = 0 (vs 0):
T-Value = -0.21 P-Value = 0.833 DF = 28 T-Value = -2.74 P-Value = 0.016
Both use Pooled StDev = 47842.1162 32

University of Michigan: Six Sigma Black Belt, P. Hammett 16


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

D. Two Proportion Tests


Compare yields or % defective rates from:
Two processes
Two departments
Two procedures
Two machines

May use for either of the following outputs:


Yields
Fraction defective

33

Two Proportion Hypothesis Test


Proportion, p fraction defective (or yield)
Null Hypothesis: Ho: p1 = p2 Ho : p1 p 2
Alternate Hypothesis: Ha: p1 not equal to p2 Ha : p1 p 2

Necessary Data for each group to test


Sample Size: n1 and n2 (test does not require equal sample sizes)
# defective: X1, X2 (or not defective where p yield)
Hence: p1 = X1/n1 ; p2 = X2/n2
Assumptions:
Two groups are independent and have sufficiently large sample for
each group to assume Normal Approximation to Binomial
Guideline: All combinations of ni*pi and ni(1-pi) > 5
Note: If do not satisfy Normality Assumption, Minitab gives Warning

Again, use p-values to determine statistical significance 34

University of Michigan: Six Sigma Black Belt, P. Hammett 17


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Example: Two Proportions

Suppose you wish to compare the number of mis-routed


documents between Two Types: Paper vs. Phone/Email
Type 1 (Phone/Email): 187 of 479 Misrouted (Defect Rate = 39%)
Type 2 (Paper): 214 of 1128 Misrouted (Defect Rate= 19%)

Suppose you wish to test if Type 1 has a significantly greater


defect rate
Ho: p1 < p2
Ha: p1 > p2

35

Proportion Test Summary Data


(Stat >> Basic Statistics >> 2 Proportions)

In Minitab, may use raw


data or summary data
(more common)

Sample X N Sample p
1 187 479 0.390397
2 214 1128 0.189716

Ha: p1 > p2
36

University of Michigan: Six Sigma Black Belt, P. Hammett 18


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Minitab Results

Does Method 1 have a significantly greater defect rate?

Test and CI for Two Proportions

Sample X N Sample p
1 187 479 0.390397
2 214 1128 0.189716 Assume
a = 0.05
Difference = p (1) - p (2)
Estimate for difference: 0.200680
95% lower bound for difference: 0.159293
Test for difference = 0 (vs > 0): Z = 8.50
P-Value = 0.000

Based on pooled estimate of proportion method in Minitab 37

Summary
Hypothesis Tests for differences between two groups:
F-test: Test two variances
T-test: Test two means (equal variances, unequal variances, paired)
Two Proportion Test: Test two proportions
Different types of t-tests exist based on whether:
Samples are independent or dependent (e.g., Paired t-test uses same units
in each group)
Within group variances are assumed equal or unequal
Results may be affected setup: alpha error, sample size, 1-sided vs. 2-
sided tests, and how representative sample is of population
Hypothesis tests (e.g., F-test, t-test, two proportion test) provide a tool
to assess if a difference is statistically significant
Ultimately, users must determine if statistically significant also
implies practically significant
38

University of Michigan: Six Sigma Black Belt, P. Hammett 19


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Appendix:
Effect Size, ni, and Practical Significance

Particularly with large sample sizes*, mean differences that are


statistically significant may not be Practically Meaningful

Here, should consider and report the mean difference effect


Mean Difference Effect = Mean 1 Mean 2 (or use Absolute Difference)
Advanced: Compute Cohens d or Glass Delta to help categorize the effect
m 2 - m1 m 2 - m control
Cohen' s d Glass
control
Estimate using Pooled Std Deviation Here, must establish
control (baseline) group

*Defining very large sample size for a continuous variable would depend on the
standard deviation and test result implications though generally a sample size of 10,000
(or perhaps 1000) may be viewed as very large. Of course, regardless of sample size,
one could have statistically significant but not practically meaningful results
39

General Guidelines for Cohens d

Cohen proposed the following cutoffs* for effect size


|d| < 0.2 Minimal or Near Zero Effect Effect Cohen
|d| 0.2 to < 0.5: Small Effect Category |d| m -m
| Cohen' s d | | 2 1 |
|d| 0.5 to < 0.8 Medium Effect
Small 0.2
Medium 0.5
|d| > 0.8: Large Effect Large 0.8

Mean
Factor Mean 1 Mean 2 Pooled S* |Cohen's d| Effect Category
Difference
Region (A vs B) 11.52 13.84 5.57 -2.32 0.42 Small
Language (E vs F) 12.15 11.66 5.65 0.49 0.09 Minimal or Near Zero
In Order vs. NGO 10.3 18 4.65 -7.70 1.66 Large

* From Two Sample t-test assuming equal variances


Or, assess how practically meaningful the effect size is based on cost of poor quality,
defect rates, or revenue improvement (Subjective Management Decision)
40
*Should be viewed only as guidelines for descriptive purposes

University of Michigan: Six Sigma Black Belt, P. Hammett 20


Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

Estimating Average Effects (Delta: )


In addition to testing for significance, may use mean hypothesis tests
to estimate Mean Difference Effects (Difference (): Mean 1 Mean 2)
Note: If no significance difference, assume Mean = 0
Advanced: If multiple factors are significant, may use General Linear
Model or Multiple Regression Techniques to estimate effects of
individual factors after accounting for other factors (discuss later)

Suppose we wish to
estimate the effect
of rework (cases not in
good order or NGO=1)

Currently, 24% of case files


are submitted NGO

41

Minitab Results: Mean Difference Effect


Here, for those Not in Good Order, the average effect on handling time
is ~8 min (Mean Difference). Although this only happens on 24% of
cases, it contributes to high variation and out-specification conditions

Significant with Mean


Difference = -7.7

Note: Average for NGO


Mean (18) is greater than
USL (15 minutes)

42

University of Michigan: Six Sigma Black Belt, P. Hammett 21

You might also like