Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

0

My Store
Glossary
Home
About Me
Contact Me

Statistics By Jim
Making statistics intuitive

Graphs
Basics
Hypothesis Testing
Regression
ANOVA

Probability
Time Series
Fun

5 Ways to Find Outliers in Your Data


By Jim Frost — 25 Comments

Outliers are data points that are far from other data points. In other words, they’re unusual
values in a dataset. Outliers are problematic for many statistical analyses because they can cau
tests to either miss significant findings or distort real results.

Unfortunately, there are no strict statistical rules for definitively identifying outliers. Finding
outliers depends on subject-area knowledge and an understanding of the data collection proce
While there is no solid mathematical definition, there are guidelines and statistical tests you can
use to find outlier candidates.
In this post, I’ll explain what outliers are and why they are problematic, and present various
methods for finding them. Additionally, I close this post by comparing the different techniques
0
for identifying outliers and share my preferred approach.

Outliers and Their Impact


Outliers are a simple concept—they are values that are notably different from other data points
and they can cause problems in statistical procedures.

To demonstrate how much a single outlier can affect the results, let’s examine the properties of
an example dataset. It contains 15 height measurements of human males. One of those values
an outlier. The table below shows the mean height and standard deviation with and without the
outlier.

Throughout this post, I’ll be using this example CSV dataset: Outliers.

With Outlier Without Outlier Difference

2.4m (7’ 10.5”) 1.8m (5’ 10.8”) 0.6m (~2 feet)

2.3m (7’ 6”) 0.14m (5.5 inches) 2.16m (~7 feet)

From the table, it’s easy to see how a single outlier can distort reality. A single value changes th
mean height by 0.6m (2 feet) and the standard deviation by a whopping 2.16m (7 feet)!
Hypothesis tests that use the mean with the outlier are off the mark. And, the much larger
standard deviation will severely reduce statistical power!
Before performing statistical analyses, you should identify potential outliers. That’s the subject
this post. In the next post, we’ll move on to figuring out what to do with them.
0

There are a variety of ways to find outliers. All these methods employ different approaches for
finding values that are unusual compared to the rest of the dataset. I’ll start with visual
assessments and then move onto more analytical assessments.

Let’s find that outlier! I’ve got five methods for you to try.

Sorting Your Datasheet to Find Outliers


Sorting your datasheet is a simple but effective way to highlight unusual values. Simply sort you
data sheet for each variable and then look for unusually high or low values.

For example, I’ve sorted the example dataset in ascending order, as shown below. The highest
value is clearly different than the others. While this approach doesn’t quantify the outlier’s
degree of unusualness, I like it because, at a glance, you’ll find the unusually high or low values
0

Graphing Your Data to Identify Outliers


Boxplots, histograms, and scatterplots can highlight outliers.

Boxplots display asterisks or other symbols on the graph to indicate explicitly when datasets
contain outliers. These graphs use the interquartile method with fences to find outliers, which I
explain later. The boxplot below displays our example dataset. It’s clear that the outlier is quite
different than the typical data value.
0

You can also use boxplots to find outliers when you have groups in your data. The boxplot below
shows a different dataset that has an outlier in the Method 2 group. Click here to learn more
about boxplots.

Histograms also emphasize the existence of outliers. Look for isolated bars, as shown below. O
outlier is the bar far to the right. The graph crams the legitimate data points on the far left.
0

Click here to learn more about histograms.

Most of the outliers I discuss in this post are univariate outliers. We look at a data distribution f
a single variable and find values that fall outside the distribution. However, you can use a
scatterplot to detect outliers in a multivariate setting.

In the graph below, we’re looking at two variables, Input and Output. The scatterplot with
regression line shows how most of the points follow the fitted line for the model. However, the
circled point does not fit the model well.
0

Interestingly, the Input value (~14) for this observation isn’t unusual at all because the other
Input values range from 10 through 20 on the X-axis. Also, notice how the Output value (~50) is
similarly within the range of values on the Y-axis (10 – 60). Neither the Input nor the Output
values themselves are unusual in this dataset. Instead, it’s an outlier because it doesn’t fit the
model.

This type of outlier can be a problem in regression analysis. Given the multifaceted nature of
multivariate regression, there are numerous types of outliers in that realm. In my ebook about
regression analysis, I detail various methods and tests for identifying outliers in a multivariate
context.

For the rest of this post, we’ll focus on univariate outliers.

Using Z-scores to Detect Outliers


Z-scores can quantify the unusualness of an observation when your data follow the normal
distribution. Z-scores are the number of standard deviations above and below the mean that
each value falls. For example, a Z-score of 2 indicates that an observation is two standard
deviations above the average while a Z-score of -2 signifies it is two standard deviations below
the mean. A Z-score of zero represents a value that equals the mean.

To calculate the Z-score for an observation, take the raw measurement, subtract the mean, and
divide by the standard deviation. Mathematically, the formula for that process is the following:
0

The further away an observation’s Z-score is from zero, the more unusual it is. A standard cut-o
value for finding outliers are Z-scores of +/-3 or further from zero. The probability distribution
below displays the distribution of Z-scores in a standard normal distribution. Z-scores beyond +
3 are so extreme you can barely see the shading under the curve.

In a population that follows the normal distribution, Z-score values more extreme than +/- 3 ha
a probability of 0.0027 (2 * 0.00135), which is about 1 in 370 observations. However, if your dat
don’t follow the normal distribution, this approach might not be accurate.

Z-scores and Our Example Dataset

In our example dataset below, I display the values in the example dataset along with the Z-
scores. This approach identifies the same observation as being an outlier.
0

Note that Z-scores can be misleading with small datasets because the maximum Z-score is
limited to (n−1) / √ n.*

Indeed, our Z-score of ~3.6 is right near the maximum value for a sample size of 15. Sample siz
of 10 or fewer observations cannot have Z-scores that exceed a cutoff value of +/-3.

Also, note that the outlier’s presence throws off the Z-scores because it inflates the mean and
standard deviation as we saw earlier. Notice how all the Z-scores are negative except the outlie
value. If we calculated Z-scores without the outlier, they’d be different! Be aware that if your
dataset contains outliers, Z-values are biased such that they appear to be less extreme (i.e.,
closer to zero).

This Z-score cutoff value is based on the empirical rule. For more information, read my post,
Empirical Rule: Definition, Formula, and Uses.

Related posts: Normal Distribution and Understanding Probability Distributions

Using the Interquartile Range to Create Outlier Fences


You can use the interquartile range (IQR), several quartile values, and an adjustment factor to
calculate boundaries for what constitutes minor and major outliers. Minor and major denote th
unusualness of the outlier relative to the overall distribution of values. Major outliers are more
extreme. Analysts also refer to these categorizations as mild and extreme outliers.
The IQR is the middle 50% of the dataset. It’s the range of values between the third quartile and
the first quartile (Q3 – Q1). We can take the IQR, Q1, and Q3 values to calculate the following
0
outlier fences for our dataset: lower outer, lower inner, upper inner, and upper outer. These
fences determine whether data points are outliers and whether they are mild or extreme.

Values that fall inside the two inner fences are not outliers. Let’s see how this method works
using our example dataset.

Click here to learn more about interquartile ranges and percentiles.

Calculating the Outlier Fences Using the Interquartile Range

Using statistical software, I can determine the interquartile range along with the Q1 and Q3
values for our example dataset. We’ll need these values to calculate the “fences” for identifying
minor and major outliers. The output below indicates that our Q1 value is 1.714 and the Q3 val
is 1.936. Our IQR is 1.936 – 1.714 = 0.222.

To calculate the outlier fences, do the following:

1. Take your IQR and multiply it by 1.5 and 3. We’ll use these values to obtain the inner and
outer fences. For our example, the IQR equals 0.222. Consequently, 0.222 * 1.5 = 0.333 an
0.222 * 3 = 0.666. We’ll use 0.333 and 0.666 in the following steps.
2. Calculate the inner and outer lower fences. Take the Q1 value and subtract the two values
from step 1. The two results are the lower inner and outer outlier fences. For our example
Q1 is 1.714. So, the lower inner fence = 1.714 – 0.333 = 1.381 and the lower outer fence =
1.714 – 0.666 = 1.048.
3. Calculate the inner and outer upper fences. Take the Q3 value and add the two values from
step 1. The two results are the upper inner and upper outlier fences. For our example, Q3
1.936. So, the upper inner fence = 1.936 + 0.333 = 2.269 and the upper outer fence = 1.93
+ 0.666 = 2.602.

Using the Outlier Fences with Our Example Dataset

For our example dataset, the values for these fences are 1.048, 1.381, 2.269, and 2.602. Almost
all of our data should fall between the inner fences, which are 1.381 and 2.269. At this point, we
look at our data values and determine whether any qualify as being major or minor outliers. 14
out of the 15 data points fall inside the inner fences—they are not outliers. The 15th data point
falls outside the upper outer fence—it’s a major or extreme outlier.
0

The IQR method is helpful because it uses percentiles, which do not depend on a specific
distribution. Additionally, percentiles are relatively robust to the presence of outliers compared
to the other quantitative methods.

Boxplots use the IQR method to determine the inner fences. Typically, I’ll use boxplots rather
than calculating the fences myself when I want to use this approach. Of the quantitative
approaches in this post, this is my preferred method. The interquartile range is robust to outlie
which is clearly a crucial property when you’re looking for outliers!

Related post: What are Robust Statistics?

Finding Outliers with Hypothesis Tests


You can use hypothesis tests to find outliers. Many outlier tests exist, but I’ll focus on one to
illustrate how they work. In this post, I demonstrate Grubbs’ test, which tests the following
hypotheses:

Null: All values in the sample were drawn from a single population that follows the same
normal distribution.
Alternative: One value in the sample was not drawn from the same normally distributed
population as the other values.

If the p-value for this test is less than your significance level, you can reject the null and conclud
that one of the values is an outlier. The analysis identifies the value in question.

Let’s perform this hypothesis test using our sample dataset. Grubbs’ test assumes your data are
drawn from a normally distributed population, and it can detect only one outlier. If you suspect
you have additional outliers, use a different test.
0

Grubbs’ outlier test produced a p-value of 0.000. Because it is less than our significance level, w
can conclude that our dataset contains an outlier. The output indicates it is the high value we
found before.

If you use Grubbs’ test and find an outlier, don’t remove that outlier and perform the analysis
again. That process can cause you to remove values that are not outliers.

Challenges of Using Outlier Hypothesis Tests: Masking and


Swamping
When performing an outlier test, you either need to choose a procedure based on the number
outliers or specify the number of outliers for a test. Grubbs’ test checks for only one outlier.
However, other procedures, such as the Tietjen-Moore Test, require you to specify the number
outliers. That’s hard to do correctly! After all, you’re performing the test to find outliers! Maskin
and swamping are two problems that can occur when you specify the incorrect number of
outliers in a dataset.

Masking occurs when you specify too few outliers. The additional outliers that exist can affect th
test so that it detects no outliers. For example, if you specify one outlier when there are two, th
test can miss both outliers.

Conversely, swamping occurs when you specify too many outliers. In this case, the test identifie
too many data points as being outliers. For example, if you specify two outliers when there is
only one, the test might determine that there are two outliers.

Because of these problems, I’m not a big fan of outlier tests. More on this in the next section!
My Philosophy about Finding Outliers
0
As you saw, there are many ways to identify outliers. My philosophy is that you must use your i
depth knowledge about all the variables when analyzing data. Part of this knowledge is knowing
what values are typical, unusual, and impossible.

I find that when you have this in-depth knowledge, it’s best to use the more straightforward,
visual methods. At a glance, data points that are potential outliers will pop out under your
knowledgeable gaze. Consequently, I’ll often use boxplots, histograms, and good old-fashioned
data sorting! These simple tools provide enough information for me to find unusual data points
for further investigation.

Typically, I don’t use Z-scores and hypothesis tests to find outliers because of their various
complications. Using outlier tests can be challenging because they usually assume your data
follow the normal distribution, and then there’s masking and swamping. Additionally, the
existence of outliers makes Z-scores less extreme. It’s ironic, but these methods for identifying
outliers are actually sensitive to the presence of outliers! Fortunately, as long as researchers us
a simple method to display unusual values, a knowledgeable analyst is likely to know which
values need further investigation.

In my view, the more formal statistical tests and calculations are overkill because they can’t
definitively identify outliers. Ultimately, analysts must investigate unusual values and use their
expertise to determine whether they are legitimate data points. Statistical procedures don’t kno
the subject matter or the data collection process and can’t make the final determination. You
should not include or exclude an observation based entirely on the results of a hypothesis test
statistical measure.

At this stage of the analysis, we’re only identifying potential outliers for further investigation. It’
just the first step in handling them. If we err, we want to err on the side of investigating too ma
values rather than too few.

In my next post, I’ll explain what you’re looking for when investigating outliers and how that
helps you determine whether to remove them from your dataset. Not all outliers are bad and
some should not be deleted. In fact, outliers can be very informative about the subject-area and
data collection process. It’s important to understand how outliers occur and whether they migh
happen again as a normal part of the process or study area.

Read my Guidelines for Removing and Handling Outliers.


If you’re learning about hypothesis testing and like the approach I use in my blog, check out my
eBook!
0

Learn more! $14.00 USD

Reference

Ronald E. Shiffler (1988) Maximum Z Scores and Outliers, The American Statistician, 42:1, 79-80
DOI: 10.1080/00031305.1988.10475530

Share this:

Share 426 Share Tweet Save 10

Related

Interquartile Range (IQR): Guidelines for Removing and What are Robust Statistics?
Definition and Uses Handling Outliers in Data In "Basics"
In "Basics" In "Basics"
Filed Under: Basics Tagged With: analysis example, conceptual, gra
0

Comments

Zohaib says
July 9, 2021 at 5:17 am

The DGP of multiple linear regression model is given

Y_i=0.3+2X_1i+1.5X_2i+ε_i

Where ε_i ~Norm(0,10)

Understand it DGP carefully and generate 500 observations of each variable in excel.

And prove that:

In case of normally distributed data, the value of SE (of estimators) are efficient, and t-
statistics is valid, and parameters are not biased.

Generate outlier with value of 3000 (in Y_i) and show that how a one outlier violate all
the distribution of the data in which the SE of parameters are not more efficient, t-
statistics is not valid, and parameters become biased.

Note: Answers of above question should be given on official university answer sheet
(which I have given you) but result of excel file should be paste in this world document,
only word document for your excel results will be accepted.

Loading...

Reply

Jonas says
March 23, 2021 at 10:01 am

Hi Jim,

Great article!

Can you elaborate on why it is incorrect to use Grubs test several times to remove more
than one outlier in a dataset? In a sufficiently large data set, why can’t you just run it
0
several times?

I would think that one strategy to automate the procedure could be to take another few
extra samples, then run Grubs test a few times.

(assuming we can afford to take maybe 30 samples or more )

Loading...

Reply

Jim Frost says


March 23, 2021 at 3:49 pm

Hi Jonas,

What happens if you repeat Grubs test is that it’ll tend to remove data points that
are not outliers. The first outlier it finds is based on the entire distribution. Then,
you remove an outlier and the distribution of the remaining data now has less
variability. A point that was not an outlier might now appear to be an outlier
because of the reduced variability. Statisticians base this recommendation to only
use Grubbs test once per dataset due to its propensity for removing valid data
points when you use the test multiple times.

I hope that helps!

Loading...

Reply

Joseph Lombardi says


February 23, 2021 at 4:20 pm

Hey, Jim.

I have a question re the Fitted Line Plot — Output as a function of Input. Visually, the
one data point does not “fit the model.” If we look at the residuals, we should get a
mean of zero. Can we not use the Standard Deviation of the residuals to calculate the Z-
scores for THEM and then determine whether a datum or two can be omitted? Or is that
subject to the same issues you mention above regarding the actual observed data?
0
If using the Z-scores of residuals is not a great idea, can we use percentiles instead? Can
we just flag data with a residual of less then, say, the 1st percentile and greater than the
99th percentile? Again, I am referring to the residuals, here.

Cheers,

Joe

Loading...

Reply

Jim Frost says


February 25, 2021 at 4:23 pm

Hi Joe,

My overall point is to use all the methods carefully. Most of them have some
drawback but I’m not saying to avoid them. In the end, it really comes down to your
subject area knowledge and the investigation of candidate outliers. It’s always
possible that an unusual value is part of the natural variation of the process rather
than a problematic point!

The residuals you describe are known as standardize residuals and, yes, they do
have a value equivalent to a Z-score. They are a good way to identify potential
outliers. With normal residuals, you might see a value of X, but whether X is
unusual depends on the data units, data variability, etc. Standardized residuals are
good way to determine whether a residual is unusual given the properties of the
data because it incorporates those factors. However, the same caveats for Z-values
apply for standardized residuals. Just something to be aware of while assessing
them. Additionally, remember that it’s normal for about 5% of the residuals to have
standardized scores of +/-2. Again, use subject area knowledge and investigate
particular data points.

Another useful type of residual are the deleted Studentized residuals. These are like
the standardized residuals above but the calculations for the ith deleted
studentized residual does not include the ith observation. That helps avoid the
problem where the presence of an unusual residual actually causes its own
standardized value to be lower because it’s inflating the residuals’ variability.
You could convert to percentiles as you describe. Most software I see use either +/-
2 or 3 standard deviations to identify candidate outliers.
0
In regression, there are multiple ways that an observation can be unusual.
Residuals measure unusualness in the y-direction. But you can also have unusual
observations in the X-direction. And points that individually affect the model’s
coefficient estimates by a larger than usual degree. I can’t remember if you have
my regression book, but I discuss those issues in it!

Take care!

Loading...

Reply

Hosam Salman says


November 28, 2020 at 2:43 am

Dear Jim,

I hope all is great and well.

i have a master data sheet include few variables. In order to perform my regression, I
need to make sure I get ride of the outliers. I understand that there are many ways to
get the outliers out. I plan to use Box Plot method or Z Score method. The distributions
of the data sometimes normally distributed, left skewed and right skewed. All over, non
is consistent. The master data sheet will be resorted based on specific variables values.
so I will create from the master data sheet few specific data sheets.

But the questions that need help are listed below;

1. How we deal with outliers when the master data sheet include various distributions.
It is not consistent; some of them normally and the majority are skewed.

2. Should we apply one method to remove the outliers or we can apply more than one
method, like these two methods. And when to be applied? Should this applied to the
master data sheet or we still need to apply it after sorting the data as indicated above.

3. If we use the box plot to fix one column of variable, it will impact the other variables
since it eliminate one complete row. That row may have other good test for other
values.

4. Any advise or suggestions in general to deal with the outliers and at same time not
impacting significantly the obtained data.
0
Thank you so much!

Hosam

Loading...

Reply

Bon says
August 3, 2020 at 1:23 pm

Hi Jim,

Is there a correct way to run a outlier analysis? What if there are 3 variate ( 12 variables)
, is there any rulling about this? Thank you

Loading...

Reply

d says
August 1, 2020 at 10:16 pm

It best f you use several methods to find outliers. The true outliers will satisfy multiple
methods.

You can compare the findings of the different methods and have confidence those data
points can be treated as outliers when flagged by different methods independently.

It also helps to have a clear understand of what your dataset describes in reality, and
what those outliers really represent, how they came to be in existence. If it’s likely they
are errors, great, that’s more justification to ignore them.

If they are certainly correctly sampled you must consider what it means to remove
them from your study, how their removal affects the integrity of your analysis.

Loading...
Reply

Jim Frost says


August 2, 2020 at 12:24 am

Hi D,

Thanks for writing! In terms of flagging observations for investigation, I’d agree that
if multiple methods find the same values, there’s good reason to investigate them.
However, flagging by multiple methods doesn’t necessarily increase the likelihood
that removing those values is appropriate.

I definitely agree that understanding what your dataset describes and how the
outliers came to be are both crucial tasks.

However, removal really depends on understanding how those values came to


exist. That can get fairly complicated. I discuss those issues in my post about
determining whether to remove outliers. Identifying the candidate is the easy part.
You’re just looking for unusual values. Why they exist and what to do about them is
where it can get complicated!

Loading...

Reply

KECHLER POLYCARPE says


May 26, 2020 at 9:47 pm

Sorry didn’t want to blind you brother. Thank you I’m studying it now. I sent you a
message I had a question.

Loading...

Reply

Jim Frost says


May 26, 2020 at 10:11 pm
0

No worries! 🙂

I will reply to your other question soon.

Loading...

Reply

KECHLER POLYCARPE says


May 26, 2020 at 8:56 pm

HELLO JIM, IF YOU DON’T HAVE DATA AT START CAN YOU STILL CRAFT THE RESEARCH
QUESTION THEN DESIGN A STUDY TO COLLECT DATA? IF SO HOW?

-Thank you

Loading...

Reply

Jim Frost says


May 26, 2020 at 9:30 pm

Hi Kechler, first, please don’t use ALL CAPS! It hurts my eyes!

Yes, you can craft your research question and study design before collecting data.
In fact, that’s the preferred approach. If you start collecting data before you have a
research question and design, it’s very likely that you won’t be collecting the
necessary data to answer your question. I write about this in my post about
conducting scientific studies with statistical analyses.

Loading...

Reply
HTITI Sarah says 0
April 15, 2020 at 6:49 am

Is it legitimate to detect outliers based on the Z-score for a large population ( 800 K
observations) even if it’s not normal?

Or it’s more appropriate to use the IQR and then compute an Upper hinge and lower
hinge ? or are there other methods to apply in this case ?

Thanks in advance

Loading...

Reply

SANJAYA KUMAR SUBBA says


April 13, 2020 at 12:59 am

Hello Sir Good morning,

i’m a research scholar and i’m comparing mean of two independent sample data sets ,
(i.e stock price returns of to pricing method )now in order to test parametric test my
data should be normally distributed,but when i test normality ,i found my data is not
normal..as my guide is suggesting me to normalize my data using Z score and finding
their areas under curve,but i;m not able to understand that,please help me sir to
normalize data.

Loading...

Reply

Suruchi Sarvate says


January 28, 2020 at 1:03 am

Hi Jim,

I have a dataset with 11 columns and I have written a common function


detect_outliers() to find outliers in the columns.

For first 6 columns, the function is working out but for rest of the 5 outliers , function
returns empty list though the columns have outliers. U can see the code written below:
0
################

def detect_outliers(data):

outliers = []

threshold=3

mean = np.mean(data)

std = np.std(data)

for i in data:

z_score = (i-mean)/std

print(z_score)

if np.abs(z_score) > threshold:

outliers.append(i)

return outliers

################

As you can see, if I have taken the value of “threshold = 3”. For first six columns, the
function is working out as z_score>3 for outliers.

But for rest of the columns, z_score for outliers is greater than 1 (z_score>1), so the
threshold should be taken 1 for rest of the six coulmns.

Here I have 11 columns only in the dataset. But what if I have 1000 columns in my
dataset. In that case,I can’t check the threshold for each and every column.

Please!!!! help me and reply at the earliest.

Loading...

Reply

Jim Frost says


January 28, 2020 at 11:23 pm

Hi Suruchi,

Why would you use a Z-score of 1 to detect outliers? I’m not sure why it’s not
working but with such a low threshold you should have more detections. How
many observations per column?

Loading...
Reply

Rutvij says
November 27, 2019 at 7:47 am

Hi Jim, Thanks for sharing details on outliers. I have one question, happy if you can
advise me.

My questions is, I am building a MachineLearning model, I have traning dataset and


testing dataset. I removed outliers from traning dataset and building ML model with
good efficient level. Now, I did have large amount of outliers in testing dataset (which I
have to submit as it is).

Now, in that my ML model is less efficient when I applied on unseen test dataset (with
outliers).

Can you please advice me, how shall I achive more efficiency on test dataset. Plus, I
don’t want to loose any observed values in the test dataset.

Thanks.

Rutvij

Loading...

Reply

Narasimha Patro says


October 29, 2019 at 2:01 am

How to treat outliers ??? Please help me

Loading...

Reply

Jim Frost says


October 29, 2019 at 11:35 am

Hi Narasimha,

Read my follow up post to this one: Guidelines for Removing and Handling Outliers.

Loading...

Reply

Denny Fernandez del Viso says


October 24, 2019 at 12:36 pm

I usually use a Q-Q plot to detect outliers – just a visualization of what you suggest as
using the Z-score.

Loading...

Reply

Jim Frost says


October 24, 2019 at 3:26 pm

Hi Denny,

Thanks for the suggestion. Unusual Z-scores might stand out more in a plot than a
list. Just be aware of the constraints on Z-scores in small samples and the fact that
Z-scores themselves are sensitive to outliers.

Loading...

Reply

Brion Hurley says


October 10, 2019 at 1:37 pm

I haven’t seen this formula before related to Z-scores: (n−1) / √ n

Can you share more details about where this comes from? It’s not intuitive to me at first
glance

Loading...

Reply

Jim Frost says


October 11, 2019 at 3:58 pm

Hi Brion,

I’ve added a reference to this post for this formula. The referenced article discusses
this limitation in the context of finding outliers and it includes references to other
sources where the limit was derived. In a nutshell, maximizing Z-scores depends on
minimizing the standard deviation (or variance). As I showed earlier in this post, the
outlier is far from the mean score. While it increases the mean, it drastically
increases the standard deviation. The net result of both increases is that it limits
the maximum Z-value. In small samples, this limitation is even greater and severely
constrains the maximum absolute Z-scores.

In general, an outlier pulls the mean towards it and inflates the standard deviation.
Both effects reduce it’s Z-score. Indeed, our outlier’s Z-score of ~3.6 is greater than
3, but just barely. The Z-score seems to indicate that the value is just across the
boundary for being outlier. However, it’s truly a severe outlier when you observe
how unusual it truly is. Both the boxplot and IQR method make this clear. And,
simply observing the value compared to reasonable values, it very far beyond
legitimately possible values for human height.

The article uses an example of a dataset with 5 values {0, 0, 0, 0, 1 million}. The Z-
score for the value of 1 million is only 1.789! Not an outlier using Z-scores!

To quote the article, “The concept of a Z score as a measure of a value’s position


within a data set in terms of standard deviations is intuitively appealing.
Unfortunately, the behavior of Z is quite constrained for small data sets.”
To illustrate this constraint, I’m including the table below that lists the maximum
absolute Z-scores by sample size. Note how absolute Z-scores can exceed 3 only
0
when the sample size is 11 and greater.

I hope this helps!

Loading...

Reply

Fernando Augusto Deheza Zambrana says


October 10, 2019 at 8:12 am

Excellent work. Congratulations

Loading...

Reply
Comments and Questions 0

Meet Jim

I’ll help you intuitively


understand statistics by focusing
on concepts and using plain English so you
can concentrate on understanding your
results.

Read More…

Search this website


0

Buy My Introduction to
Statistics eBook!
Introduction to
Statistics: An 0
Intuitive Guide
[ebook]
$9.00 USD

Buy it now

New! Buy My Hypothesis


Testing eBook!

Hypothesis
Testing: An
Intuitive Guide
[ebook]
$14.00 USD

Buy it now
0

-35% -35%

-35% -35%

-35%

-35% -35%

Wakefit Happy Home Sale


Wakefit

Buy My Regression eBook!


0

Regression
Analysis: An
Intuitive Guide
[ebook]
$14.00 USD

Buy it now

Subscribe by
Email

Enter your email address to


receive notifications of new
posts by email.

Your email address

First Name

Subscribe

I won't send you spam. Unsubscribe


at any time.
0

Follow Me

Facebook

RSS Feed

Twitter

Popular Latest

How To Interpret R-squared in


Regression Analysis
How to Interpret P-values and
Coefficients in Regression Analysis 0

Measures of Central Tendency: Mean,


Median, and Mode

Normal Distribution in Statistics

Multicollinearity in Regression
Analysis: Problems, Detection, and
Solutions

How to Interpret the F-test of Overall


Significance in Regression Analysis

Understanding Interaction Effects in


Statistics

Recent Comments

Jim Frost on What are Robust Statistics?

Jim Frost on Chebyshev’s Theorem in


Statistics

Annie on Chebyshev’s Theorem in Statistics

Berhanu Gebo on What are Robust


Statistics?

Jim Frost on Benefits of Welch’s ANOVA


Compared to the Classic One-Way ANOVA
0

Copyright © 2021 · Jim Frost · Privacy Policy

You might also like