12T Hypoth FT 2PropTests

Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)
Hypothesis Tests:
One Factor - Two Groups
F-tests (variances)
t-tests (means)
Two Proportions
Topic Motivation:
Which Group is performing better?
Case I:
Measure: Time to reconcile account, days
Group A (International Accounts): Average = 8.5 days
Group B (Domestic Accounts): Average = 9.0 days
Based on 20 samples from each group taken over 1 month
(lower the better)
Case II:
Measure: % Defective (two proportions)
Shift 1 = 3.2%
Shift 2 = 3.0%
Based on 200 samples taken over 1 month
Comments:
2
University of Michigan: Six Sigma Black Belt, P. Hammett 1

Topics
I. Statistical Inference and Hypothesis Tests

Null vs. Alternative Hypothesis
Types of errors (Type I and Type II errors)
Statistical Significance and p-values
Statistical vs. Practical Significance
II. Two Sample (Groups) Hypothesis Tests

A. F-Test
B. t-test Independent samples
C. Paired t-test
D. Two Proportion Tests
I. Statistical Inference Tests
Use hypothesis tests (statistical inference tests) to compare

performance/preferences among groups
Among common tests to compare two groups (A vs. B):

Two Sample F-test: Compare two variances
Two Sample t-test: Compare two means
Example: If a process may be done using two methods, t-test may
be used to compare the mean of Method 1 vs. Mean of Method 2
Two Proportion test: Compare two proportions (e.g., yields)

Null and Alternative Hypothesis

In statistical inference testing, start with a Mean Test
favored claim, Ho
Null hypothesis (Ho) Ho : m1 m2
Note: We do not prove the null!
And, Identify Alternative Hypothesis Ha Ha : m1 m2

What to test for as different
Example: Task Completion Time Variance Test
Ho: m1 = m2
Ha: Means are not equal Ho : 12 22
Statistical Tests may be applied to many
other situations
Ho: Data are Normal;
Ha : 12 22
Ha: Data are not Normal
Ho: Process is in statistical control;
Ha: Process is not in statistical control Note: Above are
2-sided tests 5
Statistical Hypothesis Tests and Errors

For any statistical test, there is the TRUTH (result if we measured
every item from population) and what we CONCLUDE (say)
based on the sample of data collected in experiment
Thus, given any test, we have 4 possible outcomes:
Two Correct Decisions:
Conclude (Say) no difference exists when no difference exists (TRUTH)
Conclude (Say) a difference exists when a difference exists (TRUTH)
Other two possible decisions:
Ho is actually:
Conclude (Say)
True False
Conclude (Say) Reject Ho
(Conclude difference) Correct
Fail to
Reject Ho Correct
6
(Conclude no difference)

Hypothesis Tests and p-values

To test for significance in a hypothesis test, use software to
compute a p-value
p-value - probability of obtaining a test statistic at least as
extreme as the one observed if the null hypothesis were true
In simple, non-technical terms, the smaller the p-value, less likely the null is
true, and therefore more likely a significant difference exists
To determine statistical significance, compare p-values to Type I

(False Alarm) error threshold known as the alpha value (a)
For most experiments, let a = 0.05 or a = 0.01
Note: Confidence may be expressed as 1-a%
In hypothesis testing, we start with the premise that no significant

difference exists and look for evidence otherwise
Ho: All babies are
born beautiful 7
p-values & concluding significance

(Statistical vs. Practical Significance)
Given p-value and alpha, assess statistical significance:

p-value < alpha (a), conclude a statistically significant difference
p-value > alpha (a), fail to conclude a difference (i.e., no difference)
To minimize errors, seek a sufficiently representative sample:

Representative of Population of interest
With sufficiently large sample to identify desired effects
Seek minimal Type I and Type II error (Fail to detect a significant shift)
With sufficiently low inherent variance within groups to see effects
Remember, with statistical tests, we

First, assess if a statistically significant difference exists
Then, assess if result is useful practically significant
Practical: Where the difference effect is sufficiently large for management
to change the system (e.g., impact cost, requirements, # defects) 8

II. Two Sample (Two Groups)

Hypothesis Tests
Choice of test depends on data type, statistic, and assumptions
Note: Other than for paired tests, the sample size per group may vary
Y Data
Continuous Binary Output
(Assume Normal Data) (Large Sample)
A. Two Variances Two Means D. Two Proportion
(independent samples) t-test Test
F-test
B. Independent C. Paired Data

Samples Paired t-test
Independent Samples t-test Independent t-test

(Pooled t-test) (Assume Variances Not Equal)
(Assume Variances Equal)
9
Note: See Minitab for additional types of tests (e.g., Z-test)
Applying F-test and t-test

Suppose you want to compare:
Two machines, two gages, two user groups, two methods,
two types of a product or service
Common Approach:
Step 1 (Test variances):
If have two groups of independent samples and may assume
underlying distribution is Normal, use F-test to test for
differences in the two variances
If concerned about normality, may use Levene Test
Step 2 (Test means):
If data from the two groups are independent - use independent
samples t-test either:
Assuming equal variances
Assuming unequal variances
Note: If data are paired - use Paired t-test
Paired: Same physical unit in both groups 10

Case Study
Document Handling Time (DHT)
Background: Financial services firm is required to check that original Trade
Fund documents meet Govt compliance requirements
Process: Review request receive time to notification sent to Adviser
Metric: Time required to check document compliance (DHT)
Scope: Internal Requirement: DHT < 10 min per review of Std. Document
Control Factors
Location (Region A vs. Region B)
Type of Customer (VIP vs. Regular)
Time of year (Peak vs Non-peak)
Staff Skill Level (Expert vs. Proficient)
Submission In Good Order vs. Not
Language (E vs. F)
Inputs (Prior Process Outputs) Process Outputs (Metrics)

Req Type (Internet, Portal, Paper, Email, Phone) Document Handling Time (DHT)
Source (Affiliate, Supplier, Business Unit) Check Compliance of Support % Complaince
Time of day request arrives Documents (Queue 11)
Uncontrollable Noise Factors

Volume of work per day
Unplanned staff absence
Time of day mail is delivered
11
Current State Performance

Not meeting quality goals
Possible Factors (Two Groups)
Region (A vs. B)
Language (E vs. F)
Staff Experience (Proficient vs.
New Hire)
VIP Status (Yes=1 vs. No=0)
Document Not in Good
Order/Rework (Yes=1, No=0)
Measurement System Study:
Automated Data Collection vs.
Manual Tracking
Mean DHT = 12.03

Next, apply statistical tests to St Dev = 5.6
compare alternatives Cpk = 0.17
% Out Spec = 30%
12

DHT By Region (Two Groups)
Suppose want to compare DHT between Region A vs. B

Two Key Questions:
Are the variances significantly different?
Are the means significantly different?
Multi-Box Plot 13
A. F-tests (two samples)

Test if two sample groups have same underlying variance ( S 2)
Key assumptions: Data for each sample group are randomly selected from
two Independent Normal Populations
F-Test Statistic
Ho Null Hypothesis; Ha Alternative
Ho: 2A = 2B; S A2
F 2
Ha: 2A not equal 2B SB
Use software to generate p-value for F-test statistic:
If p-value < a conclude variances are different (Reject Ho)
p-value > a conclude variances are not different (Fail to Reject Ho)
Understanding F-test p-values
If two variances are equal, what value for F-test statistic would we get?
Thus, p-values will be small under what conditions for F-statistic?
14

Minitab: Two Variance Test

Note: Different formats may be used to
Compare Variance of Region A vs. B enter data. See drop down list in Minitab
Minitab: STAT >> Basic Statistics
>> Two Variances
Check box to assume

Normal for F-test 15
Two Variances: Minitab Results

Would you conclude a significant difference in variances?
Another Way:
What do non-
overlapping 95%
Confidence Intervals
of Std. Dev. suggest?
Assume
Normal
Assume
a = 0.05
Method DF1 DF2 Statistic P-Value
F 163 45 1.60 0.067
16

B. Independent Samples t-tests

(assuming unknown group variances)
Used to test if two groups have different means

Key assumptions: Data for each sample group are randomly
selected from Two Independent Normal Populations
Types of independent sample t-tests:

1. Independent samples test assuming equal variances
E.g., assume 12 = 22 {often called a pooled t-test}
2. Independent samples test assuming unequal variances
Also may obtain p-value for t-test using Excel: =ttest(array1,array2,#tails,type)

17
Handling Time: Process Factors

Among the different process factors, which would you likely
use two independent samples (groups)?
Might any of these be tested using paired data where
you have the same unit (or customers) in both groups?
Region (A vs. B)
Language (E vs. F)
Staff Experience (Proficient vs. New Hire)
VIP Status (Yes=1 vs. No=0)
Document Not in Good Order/Rework (Yes=1, No=0)
Measurement System Study: Automated Collection System vs. Manual
18

Pooled t-test
(Independent two-sample test)
Pooled t-tests assume equal variances
(pooled variance Sp)
Ho : m1 m2
Ha : m1 m2
Test Statistic:
X1 - X 2 Sp 2
=
n1 1S12 n2 1S 22
t
S p n11 n12 n1 + n2 - 2
If p-value < a conclude means are different (Reject Ho)

p-value > a conclude means are not different (Fail to reject Ho)
Understanding t-test statistic p-values
If two means are equal, what value for t-test statistic would we get?
Thus, p-values will be small under what conditions for t statistic?
19
Two Independent Samples t-test

(assuming unequal variances)
Independent Samples t-test (assuming unequal variances)
Similar to Pooled t-test, however, uses slightly different test statistic
X1 - X2
t unequal variances
S1 2 S22
n1 n2
Given our two variance test results, which test might you recommend?
(Note: Or, choose independently of F-test results based on
understanding of the process)
20

Minitab: t-test
(Region A vs. B)
Ho : m1 m2 Ha : m1 m2
Check box
if assume
equal variances
21
Minitab: STAT >> Basic Statistics >> Two Sample t
t-test Results (Region A vs. B)

Assuming Equal Variances
Would you conclude a mean difference exist?
Two-Sample T-Test and CI: DHT, Region
Two-sample T for DHT
Region N Mean StDev SE Mean

Region A 164 11.52 5.81 0.45
Region B 46 13.84 4.60 0.68
Assume
a = 0.05
Difference = (Region A) - (Region B)
Estimate for difference: -2.323
95% CI for difference: (-4.156, -0.490)
T-Test of difference = 0 (vs ): T-Value = -2.50 P-Value = 0.013
DF = 208
Both use Pooled StDev = 5.5732
22

Interval Plots
Another way to compare Means is Interval Plot (Minitab Graphs)
What is the visual indicator of a significant difference?
23
Minitab Assistant: Two Sample t-test

Minitab Assistant also provides an interface with a more
comprehensive analysis for hypothesis testing
Suppose wish to test if difference exists by Language
Minitab
Assistant
Ho : m LanguageE m LanguageF
Ha : Means Not Equal
Will Discuss
in Power and
Sample Size
Planning
24

Summary Report: Minitab Assistant
Provides similar
information plus
interpretations/
graphical output
Conclusion: Fail to
Reject Ho (i.e.,
not enough
evidence to
conclude means
are different)
25
One-Sided (One-tail) Statistical Tests

For two-sample hypothesis tests, we typically test whether the
variances (or means) are different or NOT different
This is known as a Two-Sided Test
Then, if we conclude a difference, we may look at the two groups to
see which is larger (or smaller) (this is a quick approach)
Technically, to conclude one group is larger (or smaller) requires a One
Sided Test (Note: See Minitab Setting)
For example, suppose we wish to test if Group A has a larger mean
Note: Assess significance using p-values the same way as before
Ho : m A m B
Alternative Ha is
Ha : m A m B the difference
of interest to test
26

Minitab Assistant:
One-Sided (one-tail) Test
Important: Alternative
Ha is what you are
interested in concluding!
Suppose wish to test if
New Hires take longer
to complete review
Any concerns with

this analysis?
Tests Ha that:
Mean New Hire > Mean Proficient 27
Assume
Minitab Results: One-Sided Test a = 0.05
P-value = 0.005
Mean Difference
Effect = 4.84
(=16.64 11.80)
Conclude Mean for New Hires

is significantly greater than
for Proficient Staff (though
not many new hire samples) 28

C. t-tests with paired data
Paired data: Same parts (units) measured in both

groups so the samples are dependent
m D m1 m2
Ho : m D 0 Ha : m D 0 Test Statistic:
d
Here, d-bar is the average difference t
sd is the std. dev. of the differences sd
n is the number of paired samples
n
Minitab: STAT >> Basic Statistics >> Paired t

29
Paired t-test Example

MeasID DHT-System DHT-Manual Difference
1 13.9 12 1.9
Suppose you may measure 2 4.5 4 0.5
the same units using two 3 16.2 16 0.2
different methods: 4 11 11 0
5 3.6 3 0.6
Automated Collection System 6 12 12 0
Manual Data Collection 7 7.5 7 0.5
See Excel File for conditions 8 10.4 9 1.4
and full data set 9 5.4 7 -1.6
10 4.4 4 0.4
11 10.9 11 -0.1
For paired data, compute the 12 5.9 7 -1.1
difference (+/-) for each pair 13 8.5 8 0.5
14 8.3 8 0.3
Diff (i) = Group A Group B 15 5.5 5 0.5
i = observation (sample) # See Excel File for Full Data Set (30)
Avg Diff 0.16

30
StDev Diff 0.96

Minitab Results
Paired t-test
Paired t-test result: P-value = 0.377

Does the measurement method matter?
Is this a good finding?
Paired T for DHT-System - DHT-Manual

N Mean StDev SE Mean Assume
a = 0.05
DHT-System 30 9.057 4.135 0.755
DHT-Manual 30 8.900 4.245 0.775
Difference 30 0.157 0.957 0.175
95% CI for mean difference: (-0.201, 0.514)

T-Test of mean difference = 0 (vs 0):
T-Value = 0.90 P-Value = 0.377
31
Paired vs. Unpaired Data

(Different Example)
Using independent samples t-test
WITH Paired Data may result in
failure to detect a mean difference
(Type II Beta Error by excluding
information about units)
Example: Suppose you give 15
properties to 2 different appraisers
and wish to test for mean effect
Two-sample T Appraiser1 vs Appraiser2 Paired T for Appraiser1 - Appraiser2
N Mean StDev SE Mean N Mean StDev SE Mean

Appraiser1 15 107030 46903 12110 Appraiser1 15 107030 46903 12110
Appraiser2 15 110749 48763 12590 Appraiser2 15 110749 48763 12590
Difference 15 -3719 5254 1356
Estimate for difference: -3719
95% CI for difference: (-39503, 32066) 95% CI mean difference:(-6628, -809)
T-Test of difference = 0 (vs ): T-Test of mean difference = 0 (vs 0):
T-Value = -0.21 P-Value = 0.833 DF = 28 T-Value = -2.74 P-Value = 0.016
Both use Pooled StDev = 47842.1162 32

D. Two Proportion Tests

Compare yields or % defective rates from:
Two processes
Two departments
Two procedures
Two machines
May use for either of the following outputs:

Yields
Fraction defective
33
Two Proportion Hypothesis Test

Proportion, p fraction defective (or yield)
Null Hypothesis: Ho: p1 = p2 Ho : p1 p 2
Alternate Hypothesis: Ha: p1 not equal to p2 Ha : p1 p 2
Necessary Data for each group to test

Sample Size: n1 and n2 (test does not require equal sample sizes)
# defective: X1, X2 (or not defective where p yield)
Hence: p1 = X1/n1 ; p2 = X2/n2
Assumptions:
Two groups are independent and have sufficiently large sample for
each group to assume Normal Approximation to Binomial
Guideline: All combinations of ni*pi and ni(1-pi) > 5
Note: If do not satisfy Normality Assumption, Minitab gives Warning
Again, use p-values to determine statistical significance 34

Example: Two Proportions
Suppose you wish to compare the number of mis-routed

documents between Two Types: Paper vs. Phone/Email
Type 1 (Phone/Email): 187 of 479 Misrouted (Defect Rate = 39%)
Type 2 (Paper): 214 of 1128 Misrouted (Defect Rate= 19%)
Suppose you wish to test if Type 1 has a significantly greater

defect rate
Ho: p1 < p2
Ha: p1 > p2
35
Proportion Test Summary Data

(Stat >> Basic Statistics >> 2 Proportions)
In Minitab, may use raw

data or summary data
(more common)
Sample X N Sample p
1 187 479 0.390397
2 214 1128 0.189716
Ha: p1 > p2
36

Minitab Results
Does Method 1 have a significantly greater defect rate?
Test and CI for Two Proportions
Sample X N Sample p
1 187 479 0.390397
2 214 1128 0.189716 Assume
a = 0.05
Difference = p (1) - p (2)
Estimate for difference: 0.200680
95% lower bound for difference: 0.159293
Test for difference = 0 (vs > 0): Z = 8.50
P-Value = 0.000
Based on pooled estimate of proportion method in Minitab 37
Summary
Hypothesis Tests for differences between two groups:
F-test: Test two variances
T-test: Test two means (equal variances, unequal variances, paired)
Two Proportion Test: Test two proportions
Different types of t-tests exist based on whether:
Samples are independent or dependent (e.g., Paired t-test uses same units
in each group)
Within group variances are assumed equal or unequal
Results may be affected setup: alpha error, sample size, 1-sided vs. 2-
sided tests, and how representative sample is of population
Hypothesis tests (e.g., F-test, t-test, two proportion test) provide a tool
to assess if a difference is statistically significant
Ultimately, users must determine if statistically significant also
implies practically significant
38

Appendix:
Effect Size, ni, and Practical Significance
Particularly with large sample sizes*, mean differences that are

statistically significant may not be Practically Meaningful
Here, should consider and report the mean difference effect

Mean Difference Effect = Mean 1 Mean 2 (or use Absolute Difference)
Advanced: Compute Cohens d or Glass Delta to help categorize the effect
m 2 - m1 m 2 - m control
Cohen' s d Glass
control
Estimate using Pooled Std Deviation Here, must establish
control (baseline) group
*Defining very large sample size for a continuous variable would depend on the
standard deviation and test result implications though generally a sample size of 10,000
(or perhaps 1000) may be viewed as very large. Of course, regardless of sample size,
one could have statistically significant but not practically meaningful results
39
General Guidelines for Cohens d
Cohen proposed the following cutoffs* for effect size

|d| < 0.2 Minimal or Near Zero Effect Effect Cohen
|d| 0.2 to < 0.5: Small Effect Category |d| m -m
| Cohen' s d | | 2 1 |
|d| 0.5 to < 0.8 Medium Effect
Small 0.2
Medium 0.5
|d| > 0.8: Large Effect Large 0.8
Mean
Factor Mean 1 Mean 2 Pooled S* |Cohen's d| Effect Category
Difference
Region (A vs B) 11.52 13.84 5.57 -2.32 0.42 Small
Language (E vs F) 12.15 11.66 5.65 0.49 0.09 Minimal or Near Zero
In Order vs. NGO 10.3 18 4.65 -7.70 1.66 Large
* From Two Sample t-test assuming equal variances

Or, assess how practically meaningful the effect size is based on cost of poor quality,
defect rates, or revenue improvement (Subjective Management Decision)
40
*Should be viewed only as guidelines for descriptive purposes

Estimating Average Effects (Delta: )

In addition to testing for significance, may use mean hypothesis tests
to estimate Mean Difference Effects (Difference (): Mean 1 Mean 2)
Note: If no significance difference, assume Mean = 0
Advanced: If multiple factors are significant, may use General Linear
Model or Multiple Regression Techniques to estimate effects of
individual factors after accounting for other factors (discuss later)
Suppose we wish to
estimate the effect
of rework (cases not in
good order or NGO=1)
Currently, 24% of case files

are submitted NGO
41
Minitab Results: Mean Difference Effect

Here, for those Not in Good Order, the average effect on handling time
is ~8 min (Mean Difference). Although this only happens on 24% of
cases, it contributes to high variation and out-specification conditions
Significant with Mean

Difference = -7.7
Note: Average for NGO

Mean (18) is greater than
USL (15 minutes)
42

12T Hypoth FT 2PropTests

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

12T Hypoth FT 2PropTests

Uploaded by

Copyright:

Available Formats

Session 12: Two Group Hypothesis Tests (F-test, t-test, 2 Proportion)

University of Michigan: Six Sigma Black Belt, P. Hammett 1

I. Statistical Inference and Hypothesis Tests

II. Two Sample (Groups) Hypothesis Tests

I. Statistical Inference Tests

Use hypothesis tests (statistical inference tests) to compare

Among common tests to compare two groups (A vs. B):

Two Proportion test: Compare two proportions (e.g., yields)

University of Michigan: Six Sigma Black Belt, P. Hammett 2

Null and Alternative Hypothesis

And, Identify Alternative Hypothesis Ha Ha : m1 m2

Statistical Hypothesis Tests and Errors

University of Michigan: Six Sigma Black Belt, P. Hammett 3

Hypothesis Tests and p-values

To determine statistical significance, compare p-values to Type I

In hypothesis testing, we start with the premise that no significant

p-values & concluding significance

Given p-value and alpha, assess statistical significance:

To minimize errors, seek a sufficiently representative sample:

Remember, with statistical tests, we

University of Michigan: Six Sigma Black Belt, P. Hammett 4

II. Two Sample (Two Groups)

B. Independent C. Paired Data

Independent Samples t-test Independent t-test

Applying F-test and t-test

University of Michigan: Six Sigma Black Belt, P. Hammett 5

Inputs (Prior Process Outputs) Process Outputs (Metrics)

Uncontrollable Noise Factors

Current State Performance

Mean DHT = 12.03

University of Michigan: Six Sigma Black Belt, P. Hammett 6

DHT By Region (Two Groups)

Suppose want to compare DHT between Region A vs. B

A. F-tests (two samples)

University of Michigan: Six Sigma Black Belt, P. Hammett 7

Minitab: Two Variance Test

Check box to assume

Two Variances: Minitab Results

University of Michigan: Six Sigma Black Belt, P. Hammett 8

B. Independent Samples t-tests

Used to test if two groups have different means

Types of independent sample t-tests:

Also may obtain p-value for t-test using Excel: =ttest(array1,array2,#tails,type)

Handling Time: Process Factors

University of Michigan: Six Sigma Black Belt, P. Hammett 9

If p-value < a conclude means are different (Reject Ho)

Two Independent Samples t-test

University of Michigan: Six Sigma Black Belt, P. Hammett 10

t-test Results (Region A vs. B)

Two-Sample T-Test and CI: DHT, Region

Two-sample T for DHT

Region N Mean StDev SE Mean

University of Michigan: Six Sigma Black Belt, P. Hammett 11

Minitab Assistant: Two Sample t-test

University of Michigan: Six Sigma Black Belt, P. Hammett 12

Summary Report: Minitab Assistant

One-Sided (One-tail) Statistical Tests

University of Michigan: Six Sigma Black Belt, P. Hammett 13

Any concerns with

Conclude Mean for New Hires

University of Michigan: Six Sigma Black Belt, P. Hammett 14

C. t-tests with paired data

Paired data: Same parts (units) measured in both