Professional Documents
Culture Documents
STATS 2 Part 2 Rev 2.0 With Exercise
STATS 2 Part 2 Rev 2.0 With Exercise
Part 2
(Rev 2.0)
ST Restricted
Structure of the course
2
ST Restricted
Module 4: Hypotheses testing
ST Restricted
Module 4 objectives
4
ST Restricted
Module 4
Hypothesis testing
e.g., Test the claim that the population mean weight is 120 pounds
5
ST Restricted
Module 4
6
ST Restricted
Module 4
H0 : μ = 3 H0 : X = 3
7
ST Restricted
Module 4
8
ST Restricted
Module 4
9
ST Restricted
Module 4
11
ST Restricted
Module 4
μ = 50 X
μ = 50 X
“threshold”
Critical Value (for a given α)
Adopting this criterion to define the Critical Value, implies that we accept a risk that some large values
of 𝑥ҧ (larger than the critical value) will lead to an erroneous rejection of H0 (the yellow right tale of the
distribution in the figure).
- The probability associated to this event (our risk) is α.
- Conversely, we don’t risk to do this error with probability (1- α).
13
ST Restricted
Module 4
μ = 50 X
Critical Value Critical Value
14
ST Restricted
Module 4
𝐻𝑜 : 𝜇 = 0 𝐻1 : 𝜇 ≠ 0 (Two Sided)
Only population parameter symbols are used in Hypothesis Statement and NOT sample statistic.
15
ST Restricted
Module 4
16
ST Restricted
Module 4
Step 4
Assume 𝐻𝑜 is true
0.025 0.025
User to define α. In this case α=0.05
0.67
ഥ
𝒙 17
µx = 5
ST Restricted
Module 4
19
ST Restricted
Module 4
20
ST Restricted
Module 4
0.025 0.025
0.67
µx = 5 ഥ
𝒙
𝒙lj − 𝟓
= ±𝟐. 𝟎𝟒𝟓 REJECT
ACCEPT REGION
REJECT
𝟎. 𝟔𝟕 REGION REGION
21
ST Restricted
Module 4
0.025 0.025
0.67
ഥ
𝒙
µx = 5
REJECT
REJECT REGION ACCEPT REGION
REGION
3. Interpretation of results.
25
ST Restricted
Module 4
26
ST Restricted
Module 4
Practical Conclusion:
The process is centered, can
release machine for
production.
Perform the same for Y-Offset data. What is your conclusion?
27
ST Restricted
Module 4
28
ST Restricted
Module 4
Statistical Conclusion:
29
Fail to Reject Ho
ST Restricted
Module 4
30
ST Restricted
Module 4
31
ST Restricted
Module 4 Side by Side Comparison : Confidence Interval vs Hypothesis Testing on same data
Hypothesis testing - introduction
Assume is 𝐻𝑜 True.
𝑑𝑜𝑓 = 29 𝑑𝑜𝑓 = 29
𝟎
t t
𝟎 32
ST Restricted
Module 4
3.69 3.69
(-1,27) – (2.045)( ) (-1.27) + (2.045)( )
30 30 Represents the “Yellow” area shown
above. Hence, we know the Test Statistic
-2.65 0.11 Is within the Acceptance Region.
33
ST Restricted
Module 4
34
ST Restricted
Module 4
Practical Simulation to show Confidence Interval and Hypothesis Testing provide same statistical result performed at same
Hypothesis testing - introduction significance level, alpha (Simulation using sheet Pop 1 from Central Limit Theorem.xls file where the known Pop mean is 5.02).
Associated errors – type I & type II Confidence Interval performed at alpha=0.05 (or 95% confidence level)
Significance, confidence, power
Parametric procedures
One normal population
Test on mean
Test on variance
Test on proportion Pop Mean.
Two normal populations @5.02
Test difference of means
Test ratio of variances
Test difference of proportions
Test on correlation coefficient
More than two populations On Average 4/100 CI
do not contain µ
Non-parametric procedures
Introduction Hypothesis Statement : 𝐻𝑜 : 𝜇 =5.02 , 𝐻1 : 𝜇 ≠ 5.02. Hypothesis test performed at alpha=0.05 (or 95% confidence level)
List of tests
On Average
4/100
P(value) <0.05
35
ST Restricted
Module 4
36
ST Restricted
Module 4
37
ST Restricted
Module 4
Lowering , the probability of type I error (with no change in available data), β, the
probability of type II error, increases.
38
ST Restricted
Module 4
39
ST Restricted
Module 4
40
ST Restricted
Module 4
β
Test ratio of variances
Test difference of proportions o when the difference between hypothesized parameter and its true value
Test on correlation coefficient
More than two populations
Non-parametric procedures
Introduction
List of tests
o β when
Module 4 Key Learning
o β when σ
o β when n
41
ST Restricted
Module 4
42
ST Restricted
Module 4
43
ST Restricted
Module 4
44
ST Restricted
Module 4
σ2 Known σ2 Unknown
45
ST Restricted
Module 4
DECISION RULE:
Test statistic > Critical value ( level) REJECT H0
Test statistic ≤ Critical value ( level) DO NOT REJECT H0 46
ST Restricted
Module 4
47
ST Restricted
Module 4
Associated errors – type I & type II Test Statistic and Critical Value
Significance, confidence, power EXAMPLE
Parametric procedures
“A phone industry manager thinks that customer monthly cell phone bill have increased, and
now average over $52 per month. The company wishes to test this claim.” (Assume = 10
One normal population
Test on mean
Test on variance is known) 1. Define the HYPOTESES to test.
Test on proportion
Two normal populations 2. Draw a SAMPLE from the population.
3. Calculate the TEST STATISTIC using sample data.
Test difference of means HYPOTHESIS TESTING PROCDURE : 4. Define the LEVEL OF SIGNIFICANCE (α).
Test ratio of variances
Test difference of proportions (Method: Critical value and rejection region) 5. Find the CRITICAL VALUE and the REJECTION REGION.
Test on correlation coefficient 6. Make your DECISION (Reject or not H0).
More than two populations H0: μ 52 The average is not over $52 per month.
1. Hypotheses formulation:
Non-parametric procedures H1: μ > 52 The average is greater than $52 per month
Introduction
List of tests 2. Sample extraction: The following results are obtained: n = 64, 𝑥ҧ = 53.1, and it’s known that 𝜎 = 10.
Module 4 Key Learning x − μ0 53.1 − 52
3. Test statistic calculation: z = = = 0.88
σ 10
n 64
4. Level of significance: α = 0.10
5. Critical value and rejection region: zα=0.1= 1.28 (from statistical tables -
Rejection Region: z > 1.28
6. Decision:
Do not reject H0 at the significance level = 0.1, since z = 0.88 < 1.28
(i.e.: at level 10%, there is not sufficient evidence that the mean bill is over $52).
48
ST Restricted
Module 4
The Test Statistic has not fallen into the Rejection Region → Do not reject H0
49
ST Restricted
Module 4
DECISION RULE:
P-value < REJECT H0
P-value ≥ DO NOT REJECT H0
50
ST Restricted
Module 4
Zα
(Critical Value)
Z
(Test Statistic)
51
ST Restricted
Module 4
6. Conclusions: Do not reject H0 at the significance level = 0.1, since p-value = 0.1894 > = 0.10
(*) calculations are carried out by statistical software. For more details, se also the Manual
of Statistical Methodology, ANNEX 6 (8482919 ver. 2). 52
ST Restricted
Module 4
= 0.1
Test difference of proportions
Test on correlation coefficient
More than two populations
Non-parametric procedures
Introduction
List of tests 0
Do not reject H0 Reject H0
Module 4 Key Learning
Zα =1.28
(Critical Value)
Z =0.88
(Test Statistic)
53
ST Restricted
Module 4
• The “test statistic” is considered large if: |Test Statistic| > Critical Value (tables)
• The “p-value” is considered large if: P-value > Significance Level (α)
• According to the formulation of H1 (one or two sided test) the comparison between the test
statistic and the critical value is carried out according to the following rules:
o One-sided (left) test - reject H0 if test statistic < critical value
o One-sided (right) test - reject H0 if test statistic > critical value
o Two-sided test - reject H0 if test statistic < critical value 1 OR if test statistic > critical value 2
• The comparison between the significance level (α) and the p-value is carried out according to the
following rule: reject H0 if p-value < significance level (α)
54
ST Restricted
Module 4
55
ST Restricted
Module 4 Hypothesis
Tests
Hypothesis testing - introduction
Parametric procedures
One normal population
Test on mean
Test on variance σ2 Known
σ2 Unknown
FOR THE MEAN - variance known
Test on proportion
Two normal populations
Test difference of means 𝒙lj − 𝝁𝟎
Test ratio of variances
The Test Statistic:
𝒛 = 𝝈 > 𝒛𝜶
Test difference of proportions is a value of the standard normal distribution
Test on correlation coefficient
𝒏
More than two populations
Non-parametric procedures
Introduction
List of tests
HYPOTHESES REJECT H0 IF P(value) < alpha or for Test
Statistics with condition stated below
Module 4 Key Learning
Parametric procedures
One normal population
Test on mean σ2 Known σ2 Unknown
Test on variance
Test on proportion FOR THE MEAN - variance unknown
Two normal populations
Test difference of means
Test ratio of variances
Test difference of proportions 𝒙lj − 𝝁𝟎
Test on correlation coefficient The Test Statistic: 𝒕 = 𝒔 is a value of the t distribution with (n – 1) DF
More than two populations
𝒏
Non-parametric procedures
Introduction
List of tests
HYPOTHESES REJECT H0 IF P(value) < alpha or for Test Statistics with
Module 4 Key Learning condition stated below
H0: μ = μ0 (or H0: μ ≥ μ0) 𝒙lj − 𝝁𝟎
𝒕= 𝒔 < −𝒕𝒏−𝟏 ,𝜶
H1: μ < μ0
𝒏
H0: μ = μ0 (or H0: μ μ0) 𝒙lj − 𝝁𝟎
𝒕 = 𝒔 > 𝒕𝒏−𝟏 ,𝜶
H1: μ > μ0
𝒏
H0: μ = μ0 𝒙lj − 𝝁𝟎 𝒙lj − 𝝁𝟎 𝒙lj − 𝝁𝟎
𝒕 = > 𝒕𝒏−𝟏,𝜶/𝟐 ⇔ > 𝒕𝒏−𝟏,𝜶/𝟐 𝑶𝑹 < −𝒕𝒏−𝟏,𝜶/𝟐
H1: μ μ0 𝑺 𝑺 𝑺
𝒏 𝒏 𝒏 57
ST Restricted
Module 4 Hypothesis
Tests
Hypothesis testing - introduction
Parametric procedures
One normal population
Test on mean
σ2 Known
Test on variance σ2 Unknown
Test on proportion
Two normal populations FOR THE VARIANCE
Test difference of means
Test ratio of variances
Test difference of proportions
2
(n − 1)s2
Test on correlation coefficient The Test Statistic: 𝜒n−1 = is a value of the χ2 (Chi-squared) distribution with n-1 d.f.
σ20
More than two populations
Non-parametric procedures
Introduction HYPOTHESES REJECT H0 IF P(value) < alpha or GRAPHICALLY
List of tests
for Test Statistics with condition
Module 4 Key Learning stated below
𝐻0 : 𝜎 2 = 𝜎02 (𝑜𝑟 𝜎 2 ≥ 𝜎02 ) 2 2
α
𝜒𝑛−1 < 𝜒𝑛−1 ,1−𝛼
𝐻1 : 𝜎 2 < 𝜎02 2
𝜒𝑛− 1 ,1−𝛼
59
ST Restricted
Module 4
60
ST Restricted
Module 4
Since Min PValue is less than the significance value (0.05), then we REJECT the NULL HYPOTHESIS.
Practical Conclusion:
The process variance is statistically not equal to 1
61
ST Restricted
Module 4 Hypothesis
Tests
Hypothesis testing - introduction
Parametric procedures
One normal population
Test on mean
σ2 Known
Test on variance σ2 Unknown
Test on proportion
Two normal populations FOR THE PROPORTION
Test difference of means
Test ratio of variances
Test difference of proportions 𝑝Ƹ − 𝑝0
Test on correlation coefficient The Test Statistic: 𝑧 = is a value of the standard normal distribution
𝑝0 (1 − 𝑝0 )
More than two populations 𝑛
Non-parametric procedures
Introduction
List of tests ASSUMPTION: The binomial distribution can be approximated by a normal distribution.
Module 4 Key Learning Rule of thumb → The normal approximation holds when np(1-p) > 9
HYPOTHESES REJECT H0 IF P(value) < alpha or for Test Statistics
with condition stated below
𝐻0 : 𝑝 = 𝑝0 (𝑜𝑟 𝑝 ≥ 𝑝0 )
𝑧 < −𝑧𝛼
𝐻1 : 𝑝 < 𝑝0
𝐻0 : 𝑝 = 𝑝0 (𝑜𝑟 𝑝 ≤ 𝑝0 ) 𝑧 > 𝑧𝛼
𝐻1 : 𝑝 > 𝑝0
𝐻0 : 𝑝 = 𝑝0 𝒛 < −𝑧𝛼 OR 𝒛 > 𝑧𝛼 62
𝐻1 : 𝑝 ≠ 𝑝0 2 2 ST Restricted
Module 4
63
ST Restricted
Module 4
Since Prob value is greater than the significance value (0.05), then
we FAIL TO REJECT the NULL HYPOTHESIS.
Practical Conclusion:
The proportion conformity is equal to 0.8 64
ST Restricted
Module 4
65
ST Restricted
Module 4
66
ST Restricted
Module 4
67
ST Restricted
Module 4
Variances Variances
KNOWN UNKNOWN
Parametric procedures
One normal population DEPENDENT INDEPENDENT
SAMPLES SAMPLES
Test on mean
Test on variance
Test on proportion Variances
KNOWN
Variances Test the difference of means of two
UNKNOWN
Two normal populations
Test difference of means dependent (or “paired”) samples
Test ratio of variances Variances assumed Variances
EQUAL assumed
Test difference of proportions UNEQUAL
Test on correlation coefficient 𝒅 − 𝒅𝟎 is a value of the t distribution with
𝒕= 𝒔𝒅
More than two populations The Test Statistic: (n – 1) degrees of freedom.
Non-parametric procedures 𝒏
Introduction
List of tests
H0: μx - μy = d0
H1: μx - μy d0 t > tn-1, α/2 OR t < -tn-1, α/2
69
ST Restricted
Module 4
70
ST Restricted
Module 4
Conclusion:
Since Prob value is less than the significance value
(0.05), then we REJECT the NULL HYPOTHESIS.
Mean difference is not equal to 0.
72
ST Restricted
Module 4
Test on variance
Test on proportion Variances Variances Test the difference of means of two
Two normal populations KNOWN UNKNOWN
74
ST Restricted
Module 4
Go to Add-Ins > Statistical Calculators > Hypothesis Test for Two Means
75
ST Restricted
Module 4
Conclusion:
Since Prob value is less than the
significance value (0.05), then we REJECT
the NULL HYPOTHESIS.
Mean difference is not equal to 0.
76
* In case there is a larger size historical std dev, use them instead of this 30 samples
ST Restricted
Module 4
77
ST Restricted
Module 4
79
ST Restricted
Module 4
80
ST Restricted
Module 4
Module 4 Key Learning Since P-value > significance level (0.05), we FAIL TO REJECT
NULL HYPOTHESIS; variances are equal.
82
ST Restricted
Module 4
83
ST Restricted
Module 4
Non-parametric procedures
Introduction
List of tests
85
ST Restricted
Module 4
86
ST Restricted
Module 4
Module 4 Key Learning Since P-value > significance level (0.05), we REJECT NULL HYPOTHESIS;
variances are not equal.
88
ST Restricted
Module 4
89
ST Restricted
Module 4
91
ST Restricted
Module 4
H0: px - py = () 0
z < -z
H1: px - py < 0 Two large independent random
H0: px - py = () 0 samples of sizes nx and ny, are drawn.
z > z
H1: px - py > 0 The normal approximation holds (still
H0: px - py = 0 z < -z/2 OR z the rule of thumb, np(1-p)>9)
H1: px - py 0 >z/2
92
Where 𝒑 ෝ 𝟎 is a weighted estimate of the (under H0) common proportion.
ST Restricted
Module 4
93
ST Restricted
Module 4
If left test
If right test
If 2-sided test
Parametric procedures
One normal population
DIFFERENC RATIO of 2 DIFFERENCE Correlation
Test on mean E of 2 means variances of 2 proportions coefficient
Test on variance
Test on proportion
DEPENDENT INDEPENDENT
Two normal populations SAMPLES
SAMPLES
Test difference of means
Test ratio of variances
Test difference of proportions Variances Variances
KNOWN UNKNOWN
Test on correlation coefficient
More than two populations Variances
FOR THE CORRELATION COEFFICIENT
Variances assumed
EQUAL assumed
Non-parametric procedures UNEQUAL
Introduction
List of tests
𝐫 (𝐧 − 𝟐)
Module 4 Key Learning
The TEST STATISTIC: 𝐭= is a value of the t distribution with (n-2) d.f.
(𝟏 − 𝐫𝟐 )
96
ST Restricted
Module 4
97
ST Restricted
Module 4
98
ST Restricted
Module 4
One variable:
For a given parameter (e.g. “Ball Diameter”), test the alignment between k machines (1 indep. Variable=MACHINE , k levels=k MACHINE_ID’s)
(*) The F-test tells us if at least one mean is different from – at least – another one. It does not tell us which mean
is different from which other. To know this, we can use the “multiple comparison methods”, which identify groups of
homogeneous means.
101
ST Restricted
Module 4
The dots (∙) substitute the indices used for averaging. For example, in 𝑦ത𝑖∙ it replaces the index j, (columns) to permit the
calculation of row averages. Or it replaces both indices i and j – like in 𝑦ത∙∙ to indicate the value of the grand average
NOTE
For simplicity, in this example the replications are “balanced”, i.e. same number of measurements from each machine (n). 103
This is not a necessary condition for data analysis. Unbalanced cases can be analyzed as well ST Restricted
Module 4
? ?
σ σ σ
H1: ∃ i: μi ≠ μ
μ1 μk μ2 = μ3
105
ST Restricted
F test to compare K means (1 variable)
Two sources of variability generate two different types of useful information to test the hypotheses on the
equality of means:
106
ST Restricted
Module 4
NOTE:
Statistics helps us in fixing the relative concepts of “large” and “small”, once a level of significance has been established. 107
ST Restricted
Module 4
F(calc) = ?
108
ST Restricted
Advanced Explanation
Numerically
To calculate the test statistic F, we first decompose SST, the total deviance Machines Sample (n replications) Average
(SST stands for “Total Sum of Squares”) of the observations into the following
components: 1
2
SST = SSX + SSE
where, ⁞ 𝑦𝑖𝑗 𝑦ത𝑖∙
- SSX is the deviance of the sample averages (i.e. the variability “between” k
the samples due to the differences between the levels of the variable X, 𝑖 = 1,2, ⋯ , 𝐾
e.g. the machines) and
Grand average 𝑦ത∙∙
𝑗 = 1,2, ⋯ , 𝑛
- SSE is the sum of the deviances “within” the samples, the inherent
process variability
𝑘 𝑛 𝑘 𝑘 𝑛
2 2 2
𝑦𝑖𝑗 − 𝑦ത∙∙ = 𝑛 𝑦ത𝑖∙ − 𝑦ത∙∙ + 𝑦𝑖𝑗 − 𝑦ത𝑖∙
𝑖=1 𝑗=1 𝑖=1 𝑖=1 𝑗=1
𝑫𝒆𝒗(𝑿)
𝒗𝒂𝒓 𝑿 =
𝑫𝑭
In our case:
Deviance Degrees of Freedom (DF)
SST kn - 1
SSX k–1
SSE k(n – 1)
So, to get the variances (also called Mean Squares, MS), we simply divide the deviances (Sum of Squares) by the
corresponding DF:
Deviance (SS) Degrees of Freedom (DF) Variance (MS)
SSX k–1 MSX = SSX/(k -1)
SSE k(n – 1) MSE = SSE/(k(n – 1))
SST kn - 1 MST = SST/(kn – 1) is not used
110
ST Restricted
Advanced Explanation
Degrees of
Source of Devianc Test
Freedom Variance (MS) P-value
variability e (SS) statistic
(DF)
Variable X
SSX k–1 MSX = SSX/(k -1)
(machines) 𝑴𝑺𝑿
𝑭𝑿 = 𝒑 − 𝒗𝒂𝒍𝒖𝒆𝑿
Error SSE k(n – 1) MSE = SSE/(k(n – 1)) 𝑴𝑺𝑬
Total SST kn - 1
One-factor (or “one-way”) AN.O.VA. table
NOTE:
Be careful not to be misled. This procedure is called analysis of variance (ANOVA) since it uses the ratio between two variances.
However, they are not the object of our investigation (see the hypotheses), we just “use” them, but to study the population means!
111
ST Restricted
Module 4
Average ഥ∙𝒋∙
𝒚 ഥ⋯
𝒚 Grand average (average of all
⁞
the khn measurements)
𝒚𝒊𝒋𝒓 ഥ𝒊𝒋∙
𝒚
Average of the measurements
obtained when X2 is set on its jth level
rth measurement obtained when X1 and X2
are set on their ith and jth levels respectively
⁞
𝒓 = 𝟏, 𝟐, ⋯ , 𝒏
The effect of an interaction between 2 variables X1 and X2 is significant, when for some combinations
(or settings) of X1 and X2, the value of the response variable is significantly higher (or lower) than what
we might expect considering X1 and X2 independently. This is called a “multiplicative” effect of X1 and
X2. If X1 and X2 are independent, their effect is said to be only “additive”.
113
ST Restricted
Advanced Explanation
To test the effects of X1, X2 and their interaction, X1*X2, we decompose SST, the total sum of squares
(or deviance), in the following components:
𝑘 ℎ 𝑛 𝑘 ℎ 𝑘 ℎ 𝑘 ℎ 𝑛
2 2 2 2 2
𝑦𝑖𝑗𝑟 − 𝑦ത⋯ = 𝑛ℎ 𝑦ത𝑖∙∙ − 𝑦ത⋯ + 𝑛𝑘 𝑦ത∙𝑗∙ − 𝑦ത⋯ + 𝑛 𝑦ത𝑖𝑗∙ − 𝑦ത𝑖∙∙ − 𝑦ത∙𝑗∙ + 𝑦ത⋯ + 𝑦𝑖𝑗𝑟 − 𝑦ത𝑖𝑗∙
𝑖=1 𝑗=1 𝑟=1 𝑖=1 𝑗=1 𝑖=1 𝑗=1 𝑖=1 𝑗=1 𝑟=1
114
ST Restricted
Advanced Explanation
To test the equality of means for X1, X2 and their interaction, three test statistics are calculated:
115
ST Restricted
Advanced Explanation
And finally, the AN.O.VA. table can be created
116
ST Restricted
Module 4
117
ST Restricted
Module 4
118
ST Restricted
Module 4
Machine B is
different with
Machine A and C
Machine A and C
has no statistically
significant difference 119
ST Restricted
Module 4
122
ST Restricted
Module 4
123
ST Restricted
Module 4
124
ST Restricted
Module 4
125
ST Restricted
Module 4
127
ST Restricted
Module 4
128
ST Restricted
Module 4
129
ST Restricted
Module 4
130
ST Restricted
Module 4
132
ST Restricted
Module 4
133
ST Restricted
Module 4
135
ST Restricted
Module 4
138
ST Restricted
Module 4
140
ST Restricted
Module 4 Key Learning’s
141
ST Restricted
Annex: Overview of outlier detection methods
ST Restricted
Annex 2 objectives
At the end of this chapter, you will be able to:
143
ST Restricted
Introduction
Outliers detection
As pointed out in the Manual of Statistical Methodology (8482919 ver.2), Chapter 7, great
importance resides in the adoption of effective methods to detect outliers. The quality of the
results of statistical analyses performed on contaminated data is heavily affected by the
presence of outliers in the dataset. As an example, consider two important statistical
applications which are heavily affected by the presence of outliers: Regression Analysis (with
OLS method) and Control Charts for process monitoring.
Moreover, from “Outlier identification in high dimensions” (2006), P. Filzmoser, R. Maronna, and M. Werner:
“Accurate identification of outliers plays an important role in statistical analysis. If classical statistical models
are blindly applied to data containing outliers, the results can be misleading at best. In addition, outliers
themselves are often the special points of interest in many practical situations and their identification is the
main purpose of the investigation. Classical tools based on the mean and covariance matrix are rarely able
to detect all the multivariate outliers in a given sample due to the masking effect (Becker and Gather, 1999),
with the consequence that methods based on classical measures are unsuitable for general use unless it is
certain that outliers are not present. Contaminated data are commonly found in several situations, and so
robust methods that identify or downweight outliers are essential tools for statisticians”.
144
ST Restricted
Methods for outlier detection
Several methods have been developed to detect outliers.
A first classification level separates between:
While most surveys collect multivariate data, univariate outlier detection methods are usually
preferred for their simplicity. But these methods fail to detect observations that violate
the correlational structure of the dataset.
OUTLIER
145
ST Restricted
Methods for outlier detection
Yet, the methods for outliers detection can be divided into different groups
according to the statistical procedure/approach which is adopted:
• Distribution-based methods.
• Distance-based methods.
• Density-based methods.
• Methods based on clustering.
146
ST Restricted
Methods for outlier detection
Distribution-based methods
they assume a known distribution of the data, and test if the target extreme value is an outlier of the
distribution, i.e., whether or not it deviates from the assumed distribution. Examples of this group of methods
are Dixon or Grubb tests. Often, in real world data it is not easy to fulfill the distributional requirements, and
this creates a limitation to their use.
Distance-based methods
Several outliers detection methods use some measure of distance to evaluate how far away an observation
is from the centre of the data. To measure this distance, the sample mean and variance may be used, but
since they are not robust to outliers, they can mask the very observations we seek to detect. In other terms,
a method which is not robust, i.e. which itself is being effected by the outliers, is of few (if no) help in
detecting them. To avoid this masking effect, variability and location estimators need to be “robustified” , that
means make the statistical estimators less sensitive to outliers. It is for this reason that many outlier
detection methods use order statistics, such as the median or quartile.
Methods for robustification of the estimators include, among the others, the Minimum Covariance
Determinant (MCD) due to Rousseeuw (The MCD estimator is determined by a subset of points of size h
which minimizes the determinant of the variance-covariance matrix over all subsets of size h).
In univariate statistics, distance-based methods provide interesting results and are often preferred for their
relative simplicity. However, in high dimensional space the notion of outlier based on distance may become
meaningless.
147
ST Restricted
Methods for outlier detection
Density-based methods
these methods assign to each object a degree to be an outlier. This degree is called the Local Outlier Factor
(LOF) of an object. It is “Local” since the considered property is the density of objects in the surrounding
neighborhood of the object itself.
Clustering is a basic method to detect outliers. From the viewpoint of clustering algorithm, potential outliers
are data which are not located in any cluster. Furthermore, if a cluster significantly differs from other clusters,
the objects in this cluster might be outliers.
Graphically:
CLUSTER A CLUSTER B
148
ST Restricted
Methods for univariate outlier detection
The methods listed below are based on “distance considerations” and are generally
considered robust in case of non-normal data (→ they do not require the normality
assumption). The idea of “distance”, means that an observed value is defined outlier if its
distance from what is considered the centre of the distribution is greater than a cut-off value.
149
ST Restricted
Methods for outlier detection
In a simulation study within the STATS Program, these methods have been tested on a
representative number of FE SPC variables with real data (results available).
The conclusion of the study about the most pertinent methods are summarized as follows:
➢ MADe and MD methods provide equivalent and pertinent results on both production and
monitor data.
➢ BoxPlot methods is generally aligned with MD & MADe, but in several cases the
proposed limits are not well adapted to distribution. This method provides good results
when employed on contamination data.
➢ Adjusted Box Plot (with Johnson Fit or Bootstrap methods) don’t provide correct limits.
150
ST Restricted
Methods for outlier detection
The MADe Method
The MADe method, using the Median and the Median Absolute Deviation (MAD), is one of the
basic robust methods which are largely unaffected by the presence of outliers in the dataset.
This approach is similar to the SD method. However, here the median and MADe are employed
instead of the mean and the standard deviation.
The MADe method is defined as follows:
RULE: An observation is considered outlier if its value is outside the interval:
MED ± 3 MADe
MAD is an estimator of data variability. It is similar to the standard deviation and like the
median has an approximately 50% breakdown point.
When the MAD value is scaled by a factor of 1.483, it is similar to the standard deviation in a
normal distribution. This scaled MAD value is the MADe.
151
ST Restricted
Key Learning’s
Now you know:
• The importance of adopting effective filters for outliers in
every statistical analysis.
• That exist several methods and approaches to detect
outliers.
• How to detect outliers at univariate level (using distance-
based methods).
152
ST Restricted
Conclusion
153
ST Restricted
Post-test
• Complete the post-test to the best of your knowledge
10-15 minutes
154
ST Restricted
Customer satisfaction
155
ST Restricted
CONGRATULATIONS!!
ST Restricted
File Revision
157
ST Restricted