Outliers and Robustness

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Outliers and Robustness

Outliers
All univariate samples have extremes: the sample minimum value x1 and the sample maximum value x n , which
are observations of the extreme order statistics, X 1 , and X n . Such extreme features widely in study of the
environmental issue.
The variable of natural interest is often extreme the hottest temperature in the year the highest ozone level and if we
are interested in the whole range of value of an environmental variable, for example, when monitoring pollution
levels in an urban area the extreme will still be important. In particular if the highest pollution level is especially
high it might indicate violation of some environmental standard.
Thus we might be concerned with situations where the extremes are not only the smallest and the largest
observations but where one or both of them is extremely extreme- apparently inconsistent with remainder of
sample such an observation is known as outlier.
For multivariate data the concept of order extremeness and outlying behavior is much less readily defined although
the interest and stimulus of such motions may be no less relevant or important.
Outlier Aims and Objectives
Outliers are commonly encountered in environmental studies and it is important to be able to answer fundamental
questions such as the following.

What does the outlier imply about the generating mechanism of the data and about a reasonable model for
it is there some mixing of distributions some contamination of the source?

How can we formally define and examine possible contamination?

What should we do about outlier or contaminants?

Three possible answer to last questions that we might

reject them

identify them for special considerations

accommodate them by means of robust statistical Inference procedures which are relatively free from the
influence of the outlier - in order to study the principle (outlier free) Component of the data generating
mechanism.

We think of outliers as extreme observations which by the extent of their extremeness leaf us to questions whether
they really have arisen from the same distribution as that rest of the data (i.e., from that of the basic random variable

X ) The alternative prospect of course is that the sample is contaminate that is that contains observations from some
other source.

Regarding the times between extremes outlier and contaminants are note the following observations

an outlier is an extreme but an extreme need not be outliers

An outlier may or may not be contaminant (depending on the form of contaminators) .For example upward
slippage of the mean might be manifest in the appearance of an upper outlier since the larger mean will
promote the prospect of larger sample values.

A contaminant may or may not be an outlier. It could reside in the midst of the data mass and would not
show up as an outlier.

Such varying prospects are illustrated in Figure1 where F is the main model for the bulk of the data and G is the
contaminating distribution.
Outliers and Robustness ~ 1 of 4

Contamination can of course take many forms it may be reading or recording error in which case rejection
(correction) might be the only possibility (supported perhaps by a test if discordancy). Alternatively it might reflect
low incidence mixing of X with another random variable Y whose source and manifestation are uninteresting. If
so a robust approach which draws inference about the distribution of X while accommodating Y in an
uninfluential way might be required. Then again the contaminant may effect an unexpected prospect and we would
be interested in identification of its origin and probabilistic characteristic if possible.
Outlier Generating Models
Accommodation, identification and rejection the three approaches to outlier handling must clearly be set in terms of
some model F for the basic destination X . This is necessary whether we are examining univariate data time series
generalized linear model outcomes, multivariate observations or whether is our outlier interest within the wide
range of methods now available. But if we reject F what form model might be relevant to explain any
contaminants?
Discordancy and Models for Outlier Generation
Any test for discordancy is a test of an initial (basic, roll) hypothesis H : F which declares that the sample arises
from a distribution F . A significant result implies rejection H in favor of an alternative hypothesis. H : which
explains the presence of contaminant likely to be manifest as an outlier.
Let us consider the form of a test of discordancy for a single upper outlier. On the null hypothesis, we have a
random sample from some distribution F and we need to examine if the upper outlier, x n , is a reasonable value
for an upper extreme from F or if, on the contrary, it is statistically unreasonably large.

Thus we would need to evaluate P X n x n

when the data arise from the model F . If we were to find that

P X n x n .We would conclude that the upper outlier is discordant at level and might thus be indicative
of contamination.
Example:
The distribution of X n is easily determined. Suppose that X has a distribution function F x and X n has
distribution function Fx x . The condition X n x implies that we need all X i x , i 1, 2,
immediately conclude that

, n . So we

Fn x {P X x }n {F x}

So, if X has an exponential distribution with parameter , then

Fn x 1 e x

which is quite a complex form.

In contrast, the minimum x1 arises from a distribution with distribution function F1 x where

F x 1 {P X x }n

1 1 1 e x

1 en x
so that X 1 is also to have an exponential but with parameter n .
Thus a test of discordancy for a lower outlier in an exponential sample is particularly straightforward in form.
Suppose is known. Then we would conclude that a lower outlier x1 is discordant at level if
Outliers and Robustness ~ 2 of 4

1 exp 1 n x1

x1

that is, if

1
ln 1
n

Specification of the alternative model is important for comparison of different tests of discordancy or for calculating
performance characteristics, such as power or if we are interest in the identification aspect of outlier study. We need
to consider some of the possible forms which might be adopted for the alternative model. We continue to consider
the case of a single upper outlier. The most obvious possibilities are the following.
1) Deterministic

Alternative:

H : x j F j 1, 2,

Suppose

, n ,

that

xi

known

to

be

spurious.

Then

we

would

have

j 1, 2,

, n

H : x j F j i . This just says that xi does not arise from F .

2) Inherent Alternative: Here we would declare H : x j F j 1, 2,

, n , H : x j G F,

under which scheme the outlier triggers rejection of model F for the whole sample in favor of another model
G for the whole sample (e.g. lognormal not normal). Whilst the outlier may be the stimulus for contrasting F

and G , the test of discordancy will not, of course, be the only basis for such comparison.
Between these extreme prospects of an obvious specified contaminant and a replacement homogenous model are
intermediate possibilities more specifically prompted by the concepts of contamination and of outlying values.
3) Mixture Alternative: Here we contemplate the possibility (under H ) that the sample contains a (small)
proportion

H : x j F j 1, 2,

of observations from a distribution G . Then we have

H : x j 1 F G

j 1, 2,

, n ,

, n . For H to reflect outliers, G needs to have an appropriate form -

for example larger or smaller location or greater dispersion, than F . If is small enough we might encounter
just one, or a few, outliers.
4) Slippage Alternative: This is the most common form of outlier model employed in the literature, Its general
form is
H : xj F

j 1, 2,

, n ,

H : xj F

j i,

xi G

where, again G is such that H is likely to be reflected by outliers.


Tests of Discordancy for Specific Distribution
Normal Distribution
Suppose we were to adopt a normal distribution as the basic (non-contamination) model. Different tests of
discordancy are possible here. For a single lower outlier the test statistics are
t1

x x
n

and

x x
n
2
t2

x x
n 1

where x and s are the sample mean and standard deviation respectively.
For a single upper outlier an obvious modification of t1 yields
lower outlier pair can be based on

( x n x1 )
s

( x n x )
s

. A corresponding test for an upper and

. Other forms have been proposed and investigated for further

cases of single or multiple outliers under different sets of assumptions about the state of knowledge of the mean and

Outliers and Robustness ~ 3 of 4

variance of the normal distribution. A particular example is

x
n

for an upper outlier from a normal

distribution when both the mean , and the variance 2 are known.
Exponential and Gamma
Suppose F has probability density function

x r 1 e x

f x
r

x 0

That is X has a gamma distribution so widely relevant to environment problems (including as special cases the
exponential distribution where r 1 and the n2 distribution: where 1

and r n ) Some existing tests of


2

discordancy for upper outliers have natural form. Thus we might choose to employ test statistics

x x

n n 1

x x
n 1

x n

xi

or

xn
. The test statistics x provides a useful test for an upper and lower outlier pair.
1

Multiple Outliers: Masking and Swamping


Often we may wish to test for more than one discordant outlier. For example we may want to test both x n 1 and

xn . There are two possible approaches: a block test of the pair x n 1 , x n

or consecutive tests, first of x and


n

then of x n 1 . Both lead to conceptual difficulties; aside form any distributional intractabilities (which are also
likely).
Consider the following sample configuration

The consecutive test may fail at the first stage when testing x n because of the proximity of x n 1 , which masks

x x

n
n 1
the effect of x n . For example, the statistics
will be prone to such masking for obvious
x x
n
1

reasons. On the other hand, we may find that block test of

n 1 ,

x n

convincingly declares the pair to be

discordant.
Consider a slightly different sample

Now a block test of x n 1 , x n

may declare the pair to be discordant, whereas consecutive tests show x to be


n

discordant, but not x n 1 . The marked outlier x n has carried the innocuous x n 1 with it in the block test: this is
known as swamping.
The dangers of these effects can be minimized by appropriate choice of test statistics, that is, of the form of test of
discordancy. Such protection is not so readily available for automated procedures on regularly collected data sets.
Different tests will have differing vulnerabilities to masking or swamping and the choice of test needs to take this
into account.
Outliers and Robustness ~ 4 of 4

You might also like