Testing For Normality: What Is The Best Method?

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Vol.

6, 2021-05

Testing for Normality: What is the Best Method?

Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
hugo.hernandez@forschem.org

doi: 10.13140/RG.2.2.13926.14406

Abstract

Determining whether or not a data sample has been obtained from a normally-distributed
population is a common practice in statistics and data analysis. Up to this date, several dozens
of methods have been proposed in the scientific literature for testing normality. Thus, a
common question that arises is: What is the best method for testing the normality of a sample?
The first part of this report briefly reviews the most important types of methods used for
testing normality. Then, a survey of several power comparisons between normality tests
published in the last 3 decades is used to rank 55 of the most common methods. The overall
winner of this analysis was the regression-based Shapiro-Wilk (SW) normality test. The SW test
is briefly explained. Then, two possible approximations for the calculation of the corresponding
test statistic are proposed in order to simplify the implementation of the method (the mean
and the median order statistics approximations). An optimal, sample size-dependent,
significance level is used as strategy to reduce the error introduced by such approximations,
making the total test error comparable to the original SW test when 5% or 8% significance levels
are used (for the median and mean order statistics approximations, respectively). The concept
of normality value (N-value) is also introduced, which is positive when the distribution is more
likely normal. As an additional benefit of this strategy, there is no need to arbitrarily define the
significance level of the test, making it less subjective and independent of the analyst.

Keywords
Cumulative Probability, Hypothesis Testing, Moments, Normal Distribution, Normality value,
Order Statistics, P-value, Regression, Shapiro-Wilk, Statistic Tests

1. Introduction

The normal distribution is probably the most frequently observed distribution in Nature. This is
a direct consequence of the Central Limit Theorem [1], which indicates that the normal
distribution emerges whenever a relatively large number of realizations of a random variable

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (1 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

are added (or averaged). The normal distribution is also the most widely explored distribution,
from a mathematical point of view. There are, for example, specific properties and statistical
tests and analyses that only apply to normal distributions. Thus, it is important to verify if a
certain random variable behaves normally or not before using these tools.

The normal distribution is characterized by the following probability density function:

( )

( )

(1.1)

where represents any real-valued realization of a normal random variable , and and are
constant parameters representing the mean value and the standard deviation of , considering
that:
( )

( ) ∫

(1.2)
( )

( ) ∫ ( )

(1.3)
where is the expected value operator and is the variance operator.

Furthermore, any -th moment of the normal distribution is expressed as [2]:

( )

( ) ∫ {
√ ∏| |

(1.4)
where is a positive integer.

Also, the cumulative probability function of a normal distribution is given by:

( )
( )
( ) ∫ √

(1.5)
where is the error function.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (2 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

There are currently more than 50 different methods for testing the normality of the
distribution from a sample of observations. So the following question necessarily emerges:
Which method should be selected for testing normality? Many different guidelines can be
found in the scientific literature for selecting the best test [3], particularly depending on the
nature of the distribution to be tested (e.g. if the data is symmetrical or not, or if it is skewed or
not, or if the alternative distribution is known). Unfortunately, due to the uncertainty involved
during sampling, particularly for samples of small size ( ), such information is not
necessarily reliable or available. In addition, a non-statistician user of these tests might prefer a
universal, widely-available, simple and fast, but sufficiently reliable method for testing
normality, rather than delving into this ocean of statistical methods.

The purpose of this report is precisely answering the question: What is the best method for
testing normality? For doing this, first the basis of normality testing will be explained. Then, the
best method will be identified after clearly defining the selection criterion. And finally, a simple
implementation of the method (which does not require specialized software but only a
spreadsheet) will be provided.

2. Overview of Normality Tests

The purpose of this section is only illustrating the rationale behind the different normality tests
available rather than providing a detailed description of the methods. The reader is kindly
invited to look into the cited references for additional information.

2.1. Graphical Methods

Graphical methods are not statistical tests in the proper sense, but rather subjective
approaches for evaluating normality. They are briefly described here because they inspired
some of the analytical tests available, and can be quickly employed by experienced users.
Additional details about graphical methods can be found elsewhere [4].

The starting point for identifying normality is the histogram, which is simply a frequency bar-
plot (either absolute or relative) of the data. In histograms of normally-distributed data, the
bars should resemble a symmetric bell (similar to the bell described by the probability density
function of the normal distribution given by Eq. 1.1). Figure 1 shows different histograms of
random data obtained from known probability distributions, along with the probability density
function of a normal distribution with parameters and estimated using the corresponding
sample statistics ̅ (sample average) and (sample standard deviation). Different distributions
and different replications are included in the graph. Even though in some cases the bell shape is
clearly observed for normal data, in some other cases it is not perfectly clear. Similarly, other
distributions may eventually present a bell-like shape giving the erroneous idea of normality.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (3 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

This example illustrates the main difficulty of using graphical methods for subjectively testing
normality.

Other similar graphical approaches include the steam-and-leaf plot and the box-and-whisker
plot which also provide a visual idea of the distribution of the data. However, in practice, it is
also not easy to determine normality with certainty using these plots.

Normal Normal Normal

Uniform Uniform Uniform

Exponential Exponential Exponential

Figure 1. Histograms of frequencies for samples of 30 random values taken from different
probability distributions. Red line: Probability density function of a normal distribution with
parameters estimated from the sample statistics. Top plots: Standard normal distribution.
Middle plots: Standard uniform distribution. Bottom plots: Exponential distribution.

An alternative graphical method consists in plotting the empirical cumulative distribution


function (CDF) of the observed data. This plot is constructed by assigning a relative frequency
of to each observation in a sample of observations, and plotting the cumulative relative
frequency versus the value of each observation. This plot can then be compared with the

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (4 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

typical sigmoid behavior described by Eq. 1.5. Examples of CDF plots for random data (30
observations) obtained from normal, uniform and exponential distributions are illustrated in
Figure 2. Again, the parameters of the normal distribution function are estimated from the
sample statistics. The differences are more clearly appreciated than using histograms, but
nevertheless, conclusions are not easily obtained all the time.

Figure 2. Empirical Cumulative Distribution Functions for samples of 30 random values taken
from different probability distributions. Red line: Cumulative probability function of a normal
distribution with parameters estimated from the sample statistics. Top plots: Standard normal
distribution. Middle plots: Standard uniform distribution. Bottom plots: Exponential
distribution.

The visualization of differences in the CDF plot can be facilitated by modifying the scale of one
axis. This scale modification transforms the sigmoid cumulative probability function into a
straight line, making deviations from normality easier to observe. This type of plot was initially
done using normal probability paper, and nowadays it is easily created by statistical software. It
is possible to transform either the observed data or the cumulative probability.

The observed data can be transformed into a theoretical normal cumulative probability using
Eq. (1.5), assuming either ̅ and , or any other suitable estimates of these parameters.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (5 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Therefore, the median value§ of the empirical cumulative distribution function (EDF) is
compared to the theoretical normal cumulative function, which would result in a straight line
with slope and intercept , if the data is perfectly normal. This plot is also denoted as a
normal P-P (probability-probability) plot, and some examples are presented in Figure 3.

Figure 3. P-P plots for samples of 30 random values taken from different probability
distributions. Red line: Normal behavior. Top plots: Standard normal distribution. Middle plots:
Standard uniform distribution. Bottom plots: Exponential distribution.

On the other hand, the transformation employed for the cumulative probability consists on
calculating an equivalent standard value under the assumption of normality ( ). The values of
can be obtained by solving Eq. (1.4) for the standard normal distribution ( ), that is:

√ ( ) ( )
(2.1)

§
The median is determined as , where is the ascending rank of each observation. However, other
heuristic expressions can be used for assigning a value to the EDF.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (6 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

where is the ascending rank of the observation, is the inverse error function, and is
the inverse cumulative probability function of the standard normal distribution. There are
different software and online calculators** for this inverse error function. There is also a
function available in MS Excel for directly calculating (without calculating the inverse error
function): STAND.NORM.INV. The plot obtained using this transformation is denoted as a normal
Q-Q (quantile-quantile) plot. Examples of Q-Q plots are presented in Figure 4. The straight line
observed in this plot can be obtained from the mean and standard deviation estimated from
the sample, or they can be obtained by linear regression. When linear regression is used, the
coefficients can be considered as estimates of the mean (intercept) and standard deviation
(slope) of the normal distribution. While some plots clearly show if the data is normal or not,
other plots are more difficult to classify. This is where analytical tests become useful.

Figure 4. Q-Q plots for samples of 30 random values taken from different probability
distributions. Red line: Normal behavior. Top plots: Standard normal distribution. Middle plots:
Standard uniform distribution. Bottom plots: Exponential distribution.

**
See for example: https://www.wolframalpha.com/input/?i=InverseErf

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (7 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

2.2. Tests based on the Empirical Cumulative Distribution Function

These tests are based on the plots presented in Figure 2. In this case, the differences between
the empirical cumulative distribution function and the theoretical normal cumulative
probability are quantified using a suitable test statistic. The distribution of the particular test
statistic must be known, and then it can used to quantify the probability of erroneously
rejecting the normal hypothesis. This probability is usually denoted as the -value. If this
probability value is larger than the maximum acceptable risk ( ) for erroneously rejecting the
normal hypothesis, then the distribution is considered normal. In other words, -values larger
than indicate that the empirical evidence does not provide enough support to the non-
normal hypothesis.

The formulation of this type of tests is as follows:

( ( ) ( ))
(2.2)

where is the test statistic, is the empirical cumulative distribution function evaluated at
observation , is the theoretical normal cumulative probability evaluated at observation ,
and is a suitable function.

In this category we can find many different tests including the Crámer-von Mises (CVM) test
[5], the Kolmogorov-Smirnov (KS) test [5,6], the Anderson-Darling (AD) improvement of the
CVM test [7], the Kuiper (KU) [8], Watson (WA) [9], and Ajne (AJ) [10] tests based on random
points on a circle, the Lilliefors (LF) modification of the KS test [11], the Frosini (FRO) test [12],
and the Bakshaev’s N-distance (ND) test [13].

The formulation of the test statistic can be further generalized into:

( )
(2.3)

where the relation between the empirical and the theoretical normal distribution is not limited
to the subtraction operation. Tests considering the formulation given in Eq. (2.3) include the
Glen-Leemis-Barr (GLM) test [14], the likelihood-ratio tests (ZA and ZC) proposed by Zhang and
Wu [15], and the G-test (GT) of Chen and Ye [16], involving two adjacent data points in the
calculation of the test statistic.

2.3. Tests based on Regression and Correlation (Order Statistics)

Now, considering the linear behavior observed in the Q-Q plots described in Section 2.1, Shapiro
and Wilk [17] proposed using regression analysis for testing normality (SW test). The resulting

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (8 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

test statistic ( ) is then used to calculate the probability ( -value) of erroneously rejecting the
normal hypothesis. Their method requires the calculation of weights for each sample, and
these values depend on the sample size. They provided the weights for sample sizes between 3
and 50. D’Agostino [18] combined the Shapiro-Wilk’s idea of linear regression with Downtown’s
estimator of the standard deviation [19] to yield a new test (DAD), suitable for moderate to
large samples and not requiring tables of pre-calculated values. Then, Shapiro and Francia [20]
also provided an improvement of the original SW test, suitable for large samples (SF test).
Many other normality tests emerged based on regression analysis including the test by De
††
Wet & Venter (DWV) [21] based on the coefficient of determination in a Q-Q plot; Filliben’s
test (FB) [22] based on the correlation coefficient‡‡ in a Q-Q plot based on the medians rather
than on the means of the order statistic; the Hegazy and Green [23] tests based on the mean
absolute deviation (HG1) and mean square deviation (HG2) between observed and theoretical
quantiles; the Weisberg-Bingham (WB) modification of the SF test [24]; the Ryan-Joiner (RJ)
[25] and Royston (ROY) [26] approximations for the calculation of the SW test; the Gan-Koehler
[27] tests (GK and GK0) based on the linear regression in P-P plots; the Chen-Shapiro (CS)
modification of the WB test [28]; the Rahman-Govindarajulu (RG) modification of the SW test
[29]; the del Barrio-Cuesta-Matrán-Rodríguez (BCMR) approach based on the Wasserstein
distance between random variables [30]; Zhang’s Q-test (ZH) based on the logarithm of the
ratio between two different linear combinations [31]; and Coin’s (CO) polynomial approach
extending the analysis of Q-Q plots beyond linear regression [32].

2.4. Tests based on Moments

Another important characteristic property of distribution models is the behavior of the raw and
central moments of the distribution. The raw moments of the distribution (cf. Eq. 1.4) are
defined in general as follows:

( ) ( ) ∫ ( )

(2.4)

where is the -th raw moment operator, is the expected moment operator, and is the
probability density function of the random variable.

In addition, the central moments are defined as follows:

††
Because it has the structure of the determination coefficient even though it does not arise from
sampling a bivariate distribution.
‡‡
Because it has the structure of the correlation coefficient even though it does not arise from sampling
a bivariate distribution.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (9 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

( ) (( ( )) ) ∫ ( ) ( )

(2.5)

where is the -th central moment operator, and is the mean value of the random variable.

While the moments considered are usually positive integers, this concept can be extended to
real values (including negative real values in some cases). By doing this, certain distributions,
like for example any unbounded normal distribution, will result in complex moment values [33].
Figure 5 shows the complex behavior of the raw moments for two different distributions: The
standard normal distribution and two different uniform distributions. Clearly, this type of plot
may allow distinguishing between different distribution models. However, analytical tests
based on the behavior of the moments are usually restricted to one or two specific moments of
the distribution.

Figure 5. Three-dimensional representation of the raw moments for different distributions. The
order of the moment is presented in the vertical axis. Left: Standard normal distribution.
Center: Uniform distribution . Right: Uniform distribution [33]

D’Agostino and Pearson [34] explored different tests based on moments including a test based
on the skewness§§ (DAS), on the kurtosis*** (DAK), and tests based on the combination of both,
like the test (DAP) and the test (PDAB) [35]. Several tests are also based either on
skewness, on kurtosis, or on a combination of both, including the Jarque-Bera (JB) combined
test [36], the adjusted Jarque-Bera (AJB) test proposed by Urzúa [37], the Bonett-Seier (BS)
kurtosis test [38], the Cabaña-Cabaña skewness (CCS) and kurtosis (CCK) tests [39], the Brys-
Hubert-Struyf (BHS) skewness test [40], the combined BHS skewness-BS kurtosis (BHSBS) test
proposed by Romão, Delgado and Costa [41], the Doornik-Hansen (DH) modification of the JB
test [42], and the robust JB test (RJB) proposed by Gel and Gastwirth [43].

§§
Third central moment expressed as a dimensionless value by dividing into the third power of the
standard deviation, which is the square root of the variance (second central moment).
***
Fourth central moment expressed as a dimensionless value by dividing into the squared second
central moment.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (10 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

In addition, we have tests based on moments other than skewness and kurtosis. Hosking [44]
proposed the use of -moments (HLM) instead of conventional moments, which are linear
combinations of order statistics. Elamir and Seheult [45], proposed a trimmed version of the -
moments (TLM) in order to reduce the effect of extreme values on the test. Bontemps and
Meddahi (BM) [46] used the generalized method of moments for testing the Stein conditions
of standard normality [47] applied to different moments. Finally, Desgagné-Lafaye [48]
proposed a test based on 2nd-power skewness and kurtosis (DL), instead of the conventional
3rd-power skewness and 4-th power kurtosis. They also proposed a directional test based only
on the 2nd-power kurtosis (DDL).

2.5. Other Analytical Tests

There are different strategies for testing normality other than the empirical distribution
function (EDF), regression-based methods and moment-based methods. For example, different
tests based on the distribution, where the problem of testing normality is reduced to
comparing the frequency of observations to the theoretical normal frequency in a certain
number of equiprobable subintervals [49]. The most commonly used test is the classical
test (CHI2) proposed by Karl Pearson [50].

A different approach was proposed by Geary [51], using a test (GE) based on the ratio between
the mean deviation and the standard deviation. This test was later modified by Spiegelhalter
(SH) [52], particularly suited to test normality against symmetrical alternatives. In a similar
approach, Martinez and Iglewicz [53] used the ratio between two different variance estimators
(a biweight estimator) as test statistic (MI). Gel, Miao and Gastwirth [54] also proposed a ratio
test statistic (GMG) between the classical standard deviation and a robust, median-based
measure of spread.

Vasicek [55] formulated a normality test (VA) based on an estimation of the entropy of the
sample (which can also be considered as a measure of spread). Different tests have been
proposed following this entropy approach [56].

Finally, Epps and Pulley (EP) [57] suggested a test statistic based on an empirical moment-
generating characteristic function, integrating the concepts behind the EDF-based tests and
the moment-based tests.

3. Selecting the Best Test for Normality

More than 50 different statistical tests for normality have been mentioned in the previous
Section. This again raises the question about what is the best test to be used for determining

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (11 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

the normality of a population after observing a sample. A typical approach for comparing the
efficacy of normality tests is by determining their power ( ). The power of a normality test
can be interpreted as the ability to correctly identify a sample coming from a non-normal
distribution [3], expressed in terms of probability. The power of a normality test can be easily
estimated using a Monte Carlo simulation approach, where different samples of different size
are obtained from different non-normal distributions. The normality test is performed
considering a pre-defined significance level and a conclusion about normality is obtained. The
proportion of correct rejections of the normal hypothesis is the power of the test. This
approach, introduced by Shapiro, Wilk and Chen in the 1960’s [58], has been extensively used
to compare the performance of different normality tests. Unfortunately, the results obtained
by different authors might lead to somehow different conclusions even when the normality
tests evaluated are the same, since no common standard is used for the design of power
comparisons [3].

Notice that the power of a normality test is also influenced by the sample size ( ), the
significance level ( ) used, and the similitude ( ) between the non-normal distribution being
tested and the best normal model describing the data. In addition, random sampling also
increases the uncertainty of the data, reducing the power of the test. Figure 6 shows the
empirical cumulative distribution functions obtained from uniform random samples of
different sizes, compared to the normal cumulative distribution function and the original
uniform cumulative distribution used to obtain the sample. For small sample sizes, it is difficult
to decide whether the data follows a normal distribution or not. As the sample size increases, it
becomes easier to identify that the data follows the uniform distribution and not the normal
distribution. Thus, it is not surprising that most normality tests present simulated powers of
almost 100% when large samples ( ) are considered.

The similitude between the normal and non-normal distribution tested also influences the
power of the test. It is possible to quantify the similitude between two distributions as follows
[59]:
∫ | ( ) ( )|

(3.1)
where is always given by Eq. (1.1).

As the similitude between distributions increases, the power of normality tests decreases
because it is more difficult to distinguish between both distributions from the observation of a
random sample of identical size. Figure 7 shows the similitude between the normal distribution
and some representative non-normal distributions used in power tests.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (12 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 6. Empirical cumulative distribution functions for random samples of different size
obtained from the uniform distribution . Blue diamonds: Median value of the empirical
cumulative relative frequency of the random sample. Red curve: Cumulative probability
function of a normal distribution with parameters estimated from the sample statistics. Green
line: Cumulative probability function of the original uniform distribution .

Figure 7. Similitude between normal and non-normal distributions represented as the common
area under the probability density function.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (13 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

In this section, the results from a sample of 20 studies published between 1990 and 2021
comparing the power of different normality tests are summarized. These dates allow including
the comparison of recent tests. Usually the comparison is performed considering the average
power over a wide range of different non-normal distributions. However, considering the
effect of similitude on power, and in order to provide a fair comparison between the tests, the
values of power will be expressed relative to the average power of all tests considered (value
obtained using the same non-normal distribution, the same sample size and the same
significance level), as follows:

( )
〈 〉
(3.2)

where ( ) is the relative power of test , evaluated with a significance level


on a sample of size randomly obtained from a non-normal distribution with probability
density ; is the corresponding absolute power of test , and 〈 〉
is the average absolute power of all tests evaluated at the same conditions ( ).

In this approach, differences in power are more relevant when the non-normal distribution is
more difficult to discriminate from the normal model (low power values). In this sense, only the
results obtained for small samples ( ) were considered, as the differences in power
between tests are more evident. The resulting relative powers are averaged for each test and
ranked in descending order for each study. The ranks obtained following this method are
presented in Table 1.

These results are summarized in Table 2. Different criteria were used to summarize the results
including: The best rank obtained in the different studies, the number of wins (whenever the
first place is obtained), the winning ratio (percentage of wins with respect to total number of
studies that included the test), and the average rank. The last criterion is not reliable because
the number of tests evaluated in each study was different. However, it can be used to break a
tie, if all other criteria have the same values. The data in Table 2 was sorted according to the
previous criteria, in the same priority order as they were presented. The overall rank thus
obtained from the studies considered indicates that the Shapiro-Wilk (SW) test is the best test
available for determining normality. The SW was the winner in 7 out of 19 power comparisons
performed ( winning ratio), and with an average rank of . The second place is for the
Anderson-Darling (AD) test, winning 3 out of 18 comparisons ( winning ratio), and with
an average rank of . The third place was obtained by the Shapiro-Francia (SF) test, winning 2
out of 10 comparisons ( winning ratio), and with an average rank of . Considering that
the SF test is an extension of the SW test, their combined winning score is 9 out of 20
comparisons ( winning ratio), clearly indicating that their approach is definitely the overall
winner.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (14 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 1. Ranks of average relative power obtained in different studies reported in the scientific
and academic literature for different normality tests between 1990 and 2021.
Referenced Test Comparison
Test Type [27] [32] [41] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76]
AD E 3 1 10 2 4 3 3 2 1 2 3 1 4 9 4 3 5 8
AJB M 7 6 11 1
AJN E 5
BCMR R 12 5
BHS M 28 17 29
BHSBS M 11 1 24
BM M 27 13 25 9
BS M 4 3 12 21 22
CCK M 24 19
CCS M 26
CHI2 O 4 4 8 9 14 16 7 6 18
CO R 7 1 23
CS R 6 1 3
CVM E 5 5 4 3 5 3 11 5 16 9 26
DAD R 3 14 7 7 4 7 17 31
DAK M 7 13 2 10
DAP M 13 6 5 6 6 2 24 2 16
DAS M 9 8 4 7
DDL M 12
DH M 19 8 10 15
DL M 6 7
DWV R 8
EP O 16 10
FB R 20 6 9 2 14
FRO E 12 12
GE O 7 12 18 15
GK R 7
GK0 R 6
GLB E 8 11
GMG O 4 3 20 2
GT E 16 22
HG1 R 10 27
HG2 R 2 23
HLM M 5 1
JB M 8 25 9 1 6 5 4 6 11 8 15 4 8 5 8 21
KS E 9 5 21 8 8 2 7 4 7 10 7 2 17 1 11 27
KU E 6 6 7 18
LF E 3 3 5 9 5 11 14 6 7 17
MI O 29 25
ND E 30
PDAB M 2
RG R 2 8 12
RJ R 3
RJB M 23 10 8 20
ROY R 3 4
SF R 18 3 3 2 1 2 5 1 9 13
SH O 15 13
SW R 1 2 7 1 2 9 5 1 1 1 2 4 2 1 5 3 1 4 4
TLM M 22
VA O 8 2
WA E 4 6 14
WB R 4 1 3 19
ZA E 6 15 6
ZC E 9
ZH R 17 5 28
Test Types: E (Empirical distribution function), R (Regression/Correlation), M (Moments), O (Other)

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (15 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 2. Results summary for the ranks of normality tests based on average relative power.
Overall Best Winning Average
Test Extended description Reference Type #Wins #Studies
Rank Rank Ratio Rank
1 SW Shapiro-Wilk [17] Regression 1 7 36.8% 2.9 19
2 AD Anderson-Darling [7] EDF 1 3 16.7% 3.8 18
3 SF Shapiro-Francia [20] Regression 1 2 20.0% 5.7 10
4 HLM Hosking L-moments [44] Moments 1 1 50.0% 3.0 2
5 CS Chen-Shapiro [28] Regression 1 1 33.3% 3.3 3
6 CO Coin [32] Regression 1 1 33.3% 10.3 3
7 BHSBS Brys–Hubert–Struyf–Bonett–Seier [41] Moments 1 1 33.3% 12.0 3
8 AJB Adjusted Jarque-Bera [37] Moments 1 1 25.0% 6.3 4
9 WB Weisberg-Bingham [24] Regression 1 1 25.0% 6.8 4
10 JB Jarque-Bera [36] Moments 1 1 6.3% 9.0 16
11 KS Kolmogorov-Smirnov [6] EDF 1 1 6.3% 9.1 16
12 PDAB Pearson-D'Agostino-Bowman [35] Moments 2 0 0.0% 2.0 1
13 VA Vasicek [55] Other 2 0 0.0% 5.0 2
14 GMG Gel-Miao-Gastwirth [54] Other 2 0 0.0% 7.3 4
15 RG Rahman-Govindarajulu [29] Regression 2 0 0.0% 7.3 3
16 DAK D'Agostino-Pearson Kurtosis [34] Moments 2 0 0.0% 8.0 4
17 DAP D'Agostino-Pearson K2 [34] Moments 2 0 0.0% 8.9 9
18 FB Filliben [22] Regression 2 0 0.0% 10.2 5
19 HG2 Hegazy-Green 2 [23] Regression 2 0 0.0% 12.5 2
20 RJ Ryan-Joiner [25] Regression 3 0 0.0% 3.0 1
21 ROY Royston [26] Regression 3 0 0.0% 3.5 2
22 LF Lilliefors [11] EDF 3 0 0.0% 8.0 10
23 CVM Cramer-von Mises [5] EDF 3 0 0.0% 8.4 11
24 DAD D'Agostino D [18] Regression 3 0 0.0% 11.3 8
25 BS Bonett-Seier [38] Moments 3 0 0.0% 12.4 5
26 DAS D'Agostino-Pearson Skewness [34] Moments 4 0 0.0% 7.0 4
27 WA Watson [9] EDF 4 0 0.0% 8.0 3
28 CHI2 Chi2 [50] Other 4 0 0.0% 9.6 9
29 AJN Ajne [10] EDF 5 0 0.0% 5.0 1
30 BCMR del Barrio-Cuesta-Matrán-Rodríguez [30] Regression 5 0 0.0% 8.5 2
31 ZH Zhang [31] Regression 5 0 0.0% 16.7 3
32 GK0 Gan-Koehler k02 [27] Regression 6 0 0.0% 6.0 1
33 DL Desgagné-Lafaye [48] Moments 6 0 0.0% 6.5 2
34 ZA Zhang-Wu ZA [15] EDF 6 0 0.0% 9.0 3
35 KU Kuiper [8] EDF 6 0 0.0% 9.3 4
36 GK Gan-Koehler k2 [27] Regression 7 0 0.0% 7.0 1
37 GE Geary [51] Other 7 0 0.0% 13.0 4
38 DWV De Wet-Venter [21] Regression 8 0 0.0% 8.0 1
39 GLB Glen-Leemis-Barr [14] EDF 8 0 0.0% 9.5 2
40 DH Doornik-Hansen [42] Moments 8 0 0.0% 13.0 4
41 RJB Robust Jarque-Bera [43] Moments 8 0 0.0% 15.3 4
42 ZC Zhang-Wu ZC [15] EDF 9 0 0.0% 9.0 1
43 BM Bontemps-Meddahi [46] Moments 9 0 0.0% 18.5 4
44 EP Epps-Pulley [57] Other 10 0 0.0% 13.0 2
45 HG1 Hegazy-Green 1 [23] Regression 10 0 0.0% 18.5 2
46 DDL Directional Desgagné-Lafaye [48] Moments 12 0 0.0% 12.0 1
47 FRO Frosini [12] EDF 12 0 0.0% 12.0 2
48 SH Spiegelhalter [52] Other 13 0 0.0% 14.0 2
49 GT G Test [16] EDF 16 0 0.0% 19.0 2
50 BHS Brys–Hubert–Struyf [40] Moments 17 0 0.0% 24.7 3
51 CCK Cabaña-Cabaña Kurtosis [39] Moments 19 0 0.0% 21.5 2
52 TLM Trimmed L-moments [45] Moments 22 0 0.0% 22.0 1
53 MI Martinez–Iglewicz [53] Other 25 0 0.0% 27.0 2
54 CCS Cabaña-Cabaña Skewness [39] Moments 26 0 0.0% 26.0 1
55 ND N-distance [13] EDF 30 0 0.0% 30.0 1

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (16 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Another interesting result was obtained by Hosking’s L-moments (HLM) test with a winning
ratio of . However, it was only tested twice and therefore it is not enough to consider it
the best test, even though in both comparisons the HLM test performed better than the SW or
SF tests. Also, the trimmed L-moments approach (TLM) related to the HLM test, did not
perform well in its single appearance. Nevertheless, Hosking obtained a very good correlation
between the power of the SW test and the L-kurtosis moment [77], indicating that both
methods provide similar results. Regarding the average rank, the Pearson-D’Agostino-Bowman
(PDAB) test obtained the best value ( ) but with only one comparison. Precisely in that study,
the SW test obtained the first place. All these facts support the decision of considering the SW
test as the overall winner.

A summary of the results obtained grouped by test types is presented in Table 3 indicating that,
in general, regression tests are the best type of normality tests.

Table 3. Results summary for the ranks of normality tests grouped by type
Criteria \ Type EDF Regression Moments Other
Best Rank 1 1 1 2
#Wins 4 12 4 0
Winning ratio 21.1% 60.0% 22.2% 0.0%
Average Rank 8.1 6.5 11.3 11.8
#Studies 19 20 18 12
#Tests evaluations 74 70 62 33

4. The Shapiro-Wilk Test of Normality

Shapiro and Wilk [17] assumed that if a certain random variable is normal, then it can be
expressed in general as follows:

(4.1)

where and are parameters representing the mean and standard deviation values of , and
is standard normal random variable (with a mean value of and standard deviation ). Thus,
in a sample of observations, each ordered experimental observation will be related to an
ordered realization in a random sample of elements obtained from the standard normal
distribution:

(4.2)
They then defined the test statistic as follows:

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (17 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

(4.3)

where is proportional to the best linear unbiased estimate of the slope of a linear regression
between the experimental values observed ( ) and the expected value of the standard normal
statistics ( ):
( )
(4.4)
is the sum of squares of the experimental observations:

∑( ̅)

(4.5)
and is determined as follows:

(4.6)

where are the normalized best linear unbiased coefficients, as given by Sarhan and
Greenberg [78].

An empirical cumulative distribution function of was obtained for each sample size, which is
then used to calculate the -value of the normality test. Royston [26] approximated the
distribution function of as a cumulative standard normal distribution function using the
following transformations (also as a function of sample size):

( ) ( )

(4.7)
where
( ( ))
{
( )
(4.8)

{
( ) ( ( )) ( ( ))
(4.9)

{ ( ) ( ( ))

(4.10)

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (18 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Then, the -value of the normality test is simply determined as:

( )

(4.11)

Figure 8 shows the behavior of the -value of the SW test as a function of the sample size ( )
and the corresponding test statistic value ( ). The original empirical values presented by
Shapiro and Wilk [17] are compared to Royston’s approximation [26]. In general, it is a very
good approximation, although some deviations are observed for higher -values in large
sample sizes.

Figure 8. -value as a function of sample size and statistic value in the SW normality test.
Black line and dots: Original Shapiro-Wilk data [17]. Blue dots: Royston approximation (Eq. 4.7 –
4.11) [26]

5. Alternative Implementation of the Shapiro-Wilk Test of Normality

5.1. Approximated Calculation of

One of the main disadvantages of the Shapiro-Wilk test of normality is that it is only readily
available in specialized statistical software, because it requires some cumbersome calculations
involving tabulated values of coefficients. Even though Royston’s approximation was an
important step for simplifying the calculations, non-statistician users requiring a normality test
still might find difficult to use it or implement it, unless they have access to statistical software.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (19 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Perhaps the most difficult part of the SW test is the calculation of . Shapiro and Wilk [17]
suggested approximating the term to the unbiased estimate of the standard deviation ̂
using a sample-size dependent coefficient (denoted here as ):

( )̂
(5.1)

Thus, using Eq. (5.1) in (4.3), can be approximately expressed as:

̂
( )
(5.2)

Now, if the parameter in Eq. (4.2) is estimated as the slope in the linear regression between
the observed values ( ) and the expected order statistics ( ) we obtain:

∑ ̅̅ ∑
̂
∑ ( ̅) ∑
(5.3)

The last expression is obtained considering that

̅ ( ( ))
(5.4)

Notice that the intercept in the same regression is the estimate of :

̂ ̅ ̂̅ ̅
(5.5)

and the determination coefficient of the regression is:

(∑ )
(∑ ( ̅ ) ) (∑ )
(5.6)

Thus, using Eq. (5.3) and (5.9) in Eq. (5.2), the test statistic can be approximated by:

( )(∑ ) ( )
(∑ ( ̅ ) ) (∑ ) ∑
(5.7)

On the other hand, the expected order statistics can be determined as follows:

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (20 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

∫ ( ) ( )
( )
( ) ( )
∫ ( ) √
( )

( ( )) ( ( ))


( )
(5.8)

where ( ) is a standard normal quantile with cumulative probability , is the


ascending rank of , and are the probability density function and cumulative probability
function of the standard normal distribution respectively, and is the inverse function of the
cumulative probability for the standard normal distribution.

Using Eq. (2.1), Eq. (5.8) becomes:

( ( )) ( ( ))
( )

(5.9)
Therefore:
( ( )) ( ( ))
∑ ∑( )

(5.10)
The sum in Eq. (5.10) can be approximated using the empirical expression (cf. Figure 9):

( ( )) ( ( ))
∑( )

(5.11)

Figure 9. Empirical approximation (5.11) for . Blue diamonds: Values calculated using the
left hand expression. Dashed red line: Right hand approximation.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (21 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Thus, Eq. (5.10) becomes:

(5.12)
and (Eq. 5.7):
( )
(5.13)
where
( ) ( )
( )

(5.14)

Filliben [22] suggested using the median of instead of its expected value as order statistics:


( ) ( )

(5.15)
Thus, another possible approximation of would be:

( )
(5.16)
where
(∑ )
(∑ ( ̅ ) ) (∑ )
(5.17)

5.2. Model Identification using Monte Carlo Simulation

The proposed relationships between and the determination coefficients were tested by
Monte Carlo simulation. 6 different distributions were used: i) The type I standard normal
distribution (an explanation of the different types of standard distributions is presented in
[79]), ii) the type III standard uniform distribution , iii) the type II standard exponential
distribution ( ), iv) the type II distribution with one degree of freedom, v) the type II
standard Maxwell-Boltzmann distribution [80], and vi) the log-normal distribution of the type I
standard normal random variable. Random samples of different sizes
( ) were obtained from each distribution. different
random samples were obtained for each sample size of the standard normal distribution, and
different random samples for each sample size were obtained for the non-normal
distributions. The SW normality test was performed for each sample using the function

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (22 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

shapiro.test available in R (CRAN project). This function provides the value of the SW test
statistic, and the corresponding -value obtained using the Royston approximation [26]. Then,
Eq. (5.6) and (5.17) were used to calculate and . The values obtained in the
different samples ranged from to , and the -values ranged from
to .

Figure 10 compares the value of the SW test with the determination coefficients considered
in the previous section. Both coefficients are shown to be potential estimators of . In
addition, the mean order statistics were able to estimate with less variation than the median
order statistics.

Figure 10. Empirical relationship obtained for vs. and , determined by Monte
Carlo simulation. Left plot: Mean order statistic. Right plot: Median order statistic. Red line:
Zero-intercept linear regression model.

Figure 11. Empirical relationship obtained for as a function of , determined by Monte


Carlo simulation. Blue diamonds: Mean order statistic. Green diamonds: Median order statistic.
Red line: Data fitted using an exponential decay model.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (23 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The coefficients of relating and are expected to be functions of sample size. Figure 11
shows the effect of on the ratio for both order statistics. The data obtained can be
approximately fitted using an exponential decay model. However, as it can be seen in the plot,
such a model is not statistically significant. Thus, we can conclude that:

( ) ( )
(5.18)
And therefore,

(5.19)

Let us now consider the effect of these approximations on the estimation of the -value for
the normality test. Figure 12 shows the relation between the -values calculated by the
Royston approximation using the values of , and the -values calculated by the Royston
expressions using the approximations and . Even though the overall trend is
preserved, the -value data is more disperse and heteroscedastic, compared to the test
statistic data. It is therefore important to determine the effect of the variability introduced
with the approximations on the error of the normality test. It is expected, however, that
some compensation of errors might occur due to the stochastic nature of these deviations.
Figure 13 and Figure 14 show the effect of the mean and median order statistics approximations
on the Type I and Type II error of the normality test.

Figure 12. Empirical relationship obtained for ( ) vs. ( ) and ( ), determined by


Monte Carlo simulation. Left plot: Mean order statistic. Right plot: Median order statistic. Red
line: ( ) ( ).

The type I error represents the fraction of tested samples obtained from normal distributions
which are erroneously identified as non-normal. On the other hand, the type II error represents
the proportion of tested samples obtained from non-normal distributions which are
erroneously identified as normal. For both approximations considered, the type I error is in
general lower than the corresponding value obtained in the original SW test, while the type II

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (24 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

error is higher. The overall effect obtained on the total test error (type I error plus type II error)
is illustrated in Figure 15. It can be clearly observed that a minimum total error exists, which is
lower for the original SW test. For the SW test, the optimal significance level ( ) is about .
The optimal significance for the mean order statistic approximation is about , and for the
median order statistic approximation is about . These, of course, are overall optimal
values. However, the total test error will also be affected by the sample size and the nature of
the non-normal distributions employed. Since the last factor is usually unknown in a real test,
let us only consider the effect of sample size.

Figure 13. Type I normality test error as a function of the significance level ( ) employed. Left
plot: . Right plot: . Black lines: SW test results (shapiro.test
function in R). Green lines: Mean order statistic approximation ( ). Blue lines: Median
order statistic approximation ( ).

Figure 14. Type II normality test error as a function of the significance level ( ) employed. Left
plot: . Right plot: . Black lines: SW test results (shapiro.test
function in R). Green lines: Mean order statistic approximation ( ). Blue lines: Median
order statistic approximation ( ).

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (25 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 15. Total normality test error as a function of the significance level ( ) employed. Left
plot: . Right plot: . Black lines: SW test results (shapiro.test
function in R). Green lines: Mean order statistic approximation ( ). Blue lines: Median
order statistic approximation ( ).

Figure 16 summarizes the effect of sample size on the total test error for the original SW test
and both approximations proposed. As the sample size increases, the minimum total test error
is reduced, and the value of the optimal significance level also decreases. Figure 17 shows the
effect of sample size on the optimal significance level estimated from the Monte Carlo
simulation data. The data was fitted using the following empirically model:

( )
(5.20)

Figure 16. Total normality test error as a function of the size of the sample ( ) and the
significance level ( ) employed. Top left plot: Mean order statistic approximation ( ).
To right plot: Median order statistic approximation ( ). Bottom plot: Original SW
test.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (26 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 17. Optimal significance level ( ) obtained by Monte Carlo simulation as a function of
the size of the sample ( ) considering different methods for estimating (Black diamond: SW
test, Green circle: , Blue triangle: ). Dashed red line: Empirical model
(5.20).

In addition, the minimum total test error obtained as a function of sample size is graphically
presented in Figure 18, and mathematically represented by the following expression:

( ) ( )
(5.21)

Figure 18. Minimum total test error ( ) obtained by Monte Carlo simulation as a function of
the size of the sample ( ) considering different methods for estimating (Black diamond: SW
test, Green circle: , Blue triangle: ). Dashed red line: Empirical model
(5.21).

Thus, the minimum sample size ( ) required for achieving a test error would be:

⌈ ⌉

(5.22)
where ⌈ ⌉ represents the ceiling rounding operator.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (27 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The correlation given in Eq. (5.20) can be used to define an alternative method for quantifying
the normality of the data, denoted here as the -value (normality value):

( ) ( ) ( )
(5.23)

A positive -value indicates a normal distribution, whereas a negative or zero -value


represents a non-normal distribution. Higher -values represent better fits of the normal
model to the data sample.

The sample size-dependent critical value for the optimal significance level (denoted as )
can be determined as follows (assuming -value ):

( ( ))
( ) {
( )

(5.24)
where and are obtained from Eq. (4.9) and (4.10), respectively.

The behavior described by Eq. (5.24) is illustrated in Figure 19. From this plot it is possible to
conclude that, as a general rule of thumb, any value of corresponds to a non-normal
distribution. For , the -value must be calculated using Eq. (5.23) in order to draw a
definitive conclusion.

Figure 19. Critical value of ( ) calculated at the optimal significance level ( ) using Eq.
(5.24).

5.3. Implementation of the Approximated SW Test

Summarizing, it is possible to implement the SW test in a simple, approximated way using the
following procedure:

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (28 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

 Obtain the data sample ( ) of size for testing normality. If a maximum error for the
test has been defined ( ), then the minimum sample size required would be (Eq. 5.22):

⌈ ⌉

 Rank the data in ascending order:

∑ ( ( ))

 Calculate the mean ( ) or median ( ) order statistic ( ) for each data point in
the sample:
o Mean order statistics (Eq. 5.8 or 5.9):

( ( )) ( ( ))


( )
( ( )) ( ( ))
( )

o Median order statistics (Eq. 5.15):

( )

 Calculate the determination coefficient ( ) of the linear regression between the


data in the sample and the corresponding order statistics (Eq. 5.6 or Eq. 5.17):
(∑ )
(∑ ( ̅ ) ) (∑ )
 Estimate the SW test statistic using the determination coefficient of the previous
regression (Eq. 5.19):

 If then the distribution is not normal and the procedure can be stopped
(unless the - and -values are of interest).
 Estimate the -value of the test using the Royston approximation [26] (Eq. 4.7 to 4.11).
For :
( ( ))
( )

For :
( ) ( ) ( ( )) ( ( ))
( ( ) ( ( ))
)

For larger samples ( ), the -value can be determined from a Monte Carlo
simulation of the standard normal distribution using the same size of the data sample.
At least standard random samples are generated and the value is determined

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (29 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

for each sample (denoted as ) using the same order statistic approximation used for
the test. Then, the -value is estimated as:
( )
( )

where
〈 ( )〉

√ 〈( ( ) ) 〉
Alternatively, a subset of elements is randomly extracted from the sample, and
only this sub-sample is used to determine normality.
 Calculate the -value of the test (Eq. 5.23):
( ) ( )
 If then the data can be considered normal. Otherwise, it is non-normal.

The proposed algorithm has been implemented in R language (cf. Appendix) and also in MS
Excel® using the mean order statistic approximation (ForsChem Nitrogen) and the median
order statistic approximation (ForsChem Thorium). In addition, a simple calculation sheet is
also provided as supplementary data.

5.4. Error Analysis of the Approximated SW Test

A Monte Carlos simulation as described in Section 5.2 was used to validate the error involved in
the proposed approximations of the SW test. This time, the number of random samples was
increased times for each distribution ( normal samples and non-normal
samples). For each random sample, four different normality tests were performed: i) The SW
test (shapiro.test function in R) considering a typical , ii) the SW test (shapiro.test
function in R) considering an overall optimal significance level , iii) the SW test
approximated by the mean order statistics (N.norm.test function in R), considering a sample
size-dependent optimal significance level, and iv) the SW test approximated by the median
order statistics (N.norm.test function in R), also considering a sample size-dependent optimal
significance level. For each test, the type I error was determined as the fraction of normal
samples erroneously identified as non-normal; the type II error was determined as the fraction
of non-normal samples erroneously identified as normal; and the total test error was
determined as the sum of error types I and II. The results obtained are summarized in Table 4.

These results indicate that the SW test using provided the lowest type I error.
However, the type II error and the total test error were also the largest. Please notice that this
is the typical significance level used for this test. By increasing the significance level to ,
the type I error increased accordingly, but the type II and total test error decrease. Thus, using
is a better option for the SW normality test. We are using the total test error as

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (30 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

decision criterion instead of one of the two types of error independently (related to the
confidence and power of the test), because in a real sample the true nature of the distribution
is unknown. It can be either truly normal or truly non-normal and therefore we should minimize
the sum of both errors.

Table 4. Test error obtained by Monte Carlo simulation for different SW tests performed in R.
SW Test Type I error Type II error Total test error
shapiro.test (=5%) 4.957% 34.214% 39.171%
shapiro.test (=8%) 8.002% 30.315% 38.317%
N.norm.test (“mean”) 11.290% 26.159% 37.449%
N.norm.test (“median”) 10.952% 27.984% 38.936%

The proposed approximations in the estimation of , using a sample size-dependent optimal


significance level resulted in an important increase in type I error (about ) but with type II
errors between and . The approximation using the median order statistic resulted in a
total test error comparable to the original SW test using . On the other hand, the mean
order statistic approximation presented the lowest total test error. Thus, despite the error
involved in the estimation of , the overall performance of the test improved thanks to the
optimization of the significance level. This also has the additional advantage of avoiding the
subjective definition of significance levels, particularly troublesome for many practitioners. Of
course, the type II and the total test error will be influenced by the nature of the non-normal
distributions considered. Thus, better empirical correlations might be obtained by using a wider
selection of non-normal distributions.

6. Conclusion

Dozens of methods have been proposed for testing the normality of a distribution from a data
sample. Most of these methods were inspired by the graphical behavior of the normal
distribution. They can be classified in general into: i) Methods based on the empirical
cumulative distribution function (EDF), ii) methods based on regression or correlation, iii)
methods based on the moments of the distribution, and iv) other methods based on other
criteria (including tests, entropy-related criteria, and the ratio of different measures). The
performance of each test strongly depends on the nature and size of the sample used, and on
the rejection criterion used (e.g. significance level of the test). Different performance
comparisons of normality tests, based on the power of the test, have been reported in the
scientific and academic literature since the 1960’s. A sample of the most recent comparisons
(since 1990) has been used to rank 55 different normality tests. In general, regression-based
normality tests are better ranked than other types of tests. Furthermore, the overall winner of
this analysis is the regression-based Shapiro-Wilk (SW) test of normality.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (31 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Two different approximations for the calculation of the SW test statistic were presented: i) The
determination coefficient between the observed data and the mean order statistics of the
normal distribution, and ii) the determination coefficient between the observed data and the
median order statistics of the normal distribution. While these approximations introduce some
error in the test, they facilitate the implementation of the SW test by not requiring the use of
tabulated coefficients. In order to compensate for the error introduced by these
approximations, the use of an optimal, sample size-dependent significance level is proposed for
minimizing the total test error. This optimal strategy can be easily implemented using the
normality value (N-value) presented in Eq. (5.23), which provides positive values only when the
distribution is more likely normal. By using the optimal significance level strategy, the proposed
approximations are able to yield total test errors similar to those obtained with the original SW
test. This approach also eliminates the need for arbitrarily choosing a significance level for the
test. The detailed implementation of the proposed approximations to the SW test is
summarized in Section 5.3.

Acknowledgments

This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.

References

[1] Hernandez, H. (2019). Sums and Averages of Large Samples Using Standard
Transformations: The Central Limit Theorem and the Law of Large Numbers. ForsChem
Research Reports, 4, 2019-01. doi: 10.13140/RG.2.2.32429.33767.
[2] Hernandez, H. (2018). Expected Value, Variance and Covariance of Natural Powers of
Representative Standard Random Variables. ForsChem Research Reports, 3, 2018-08. doi:
10.13140/RG.2.2.15187.07205.
[3] Thode, H. C. (2002). Testing for Normality. Marcel Dekker, Inc., New York.
[4] Das, K. R., & Imon, A. H. M. R. (2016). A brief review of tests for normality. American Journal
of Theoretical and Applied Statistics, 5(1), 5-12.
[5] Darling, D. A. (1957). The Kolmogorov-Smirnov, Cramer-von Mises tests. The Annals of
Mathematical Statistics, 28(4), 823-838.
[6] Massey Jr, F. J. (1951). The Kolmogorov-Smirnov test for goodness of fit. Journal of the
American statistical Association, 46(253), 68-78.
[7] Anderson, T. W., & Darling, D. A. (1952). Asymptotic theory of certain "goodness of fit"
criteria based on stochastic processes. The annals of mathematical statistics, 193-212.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (32 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

[8] Kuiper, N. H. (1960). Tests concerning random points on a circle. Nederl. Akad. Wetensch.
Proc. Ser. A, 63 (1), 38-47.
[9] Watson, G. S. (1962). Goodness-of-fit tests on a circle. II. Biometrika, 49(1/2), 57-63.
[10] Ajne, B. (1968). A simple test for uniformity of a circular distribution. Biometrika, 55(2), 343-
354.
[11] Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and
variance unknown. Journal of the American statistical Association, 62(318), 399-402.
[12] Frosini, B. V. (1978). A survey of a class of goodness-of-fit statistics. Metron, 36(1-2), 3-49.
[13] Bakshaev, A. (2009). Goodness of fit and homogeneity tests on the basis of N-distances.
Journal of statistical planning and inference, 139(11), 3750-3758.
[14] Glen, A. G., Leemis, L. M., & Barr, D. R. (2001). Order statistics in goodness-of-fit testing.
IEEE Transactions on Reliability, 50(2), 209-213.
[15] Zhang, J., & Wu, Y. (2005). Likelihood-ratio tests for normality. Computational statistics &
data analysis, 49(3), 709-721.
[16] Chen, Z., & Ye, C. (2009). An alternative test for uniformity. International Journal of
Reliability, Quality and Safety Engineering, 16(04), 343-356.
[17] Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete
samples). Biometrika, 52(3/4), 591-611.
[18] D'Agostino, R. B. (1971). An omnibus test of normality for moderate and large size samples.
Biometrika, 58(2), 341-348.
[19] Downton, F. (1966). Linear estimates with polynomial coefficients. Biometrika, 53(1/2), 129-
141.
[20] Shapiro, S. S., & Francia, R. S. (1972). An approximate analysis of variance test for
normality. Journal of the American statistical Association, 67(337), 215-216.
[21] De Wet, T. & Venter, J. H. (1972). Asymptotic distributions of certain test criteria of
normality. South African Statistical Journal, 6(2), 135-149.
[22] Filliben, J. J. (1975). The probability plot correlation coefficient test for normality.
Technometrics, 17(1), 111-117.
[23] Hegazy, Y. A. S., & Green, J. R. (1975). Some New Goodness‐Of‐Fit Tests Using Order
Statistics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(3), 299-308.
[24] Weisberg, S., & Bingham, C. (1975). An approximate analysis of variance test for non-
normality suitable for machine calculation. Technometrics, 17(1), 133-134.
[25] Ryan, T. A., & Joiner, B. L. (1976). Normal probability plots and tests for normality: technical
report. Statistics Department, The Pennsylvania State University, State College, PA, 1-7.
[26] Royston, P. (1992). Approximating the Shapiro-Wilk W-test for non-normality. Statistics
and computing, 2(3), 117-119.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (33 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

[27] Gan, F. F., & Koehler, K. J. (1990). Goodness-of-Fit Tests Based on P-P Probability Plots.
Technometrics, 32(3), 289-303.
[28] Chen, L., & Shapiro, S. S. (1995). An alternative test for normality based on normalized
spacings. Journal of Statistical Computation and Simulation, 53(3-4), 269-287.
[29] Rahman, M. M., & Govindarajulu, Z. (1997). A modification of the test of Shapiro and Wilk
for normality. Journal of Applied Statistics, 24(2), 219-236.
[30] Del Barrio, E., Cuesta-Albertos, J. A., Matrán, C., & Rodríguez-Rodríguez, J. M. (1999). Tests
of goodness of fit based on the L2-Wasserstein distance. Annals of Statistics, 1230-1239.
[31] Zhang, P. (1999). Omnibus test of normality using the Q statistic. Journal of Applied
Statistics, 26(4), 519-528.
[32] Coin, D. (2008). A goodness-of-fit test for normality based on polynomial regression.
Computational statistics & data analysis, 52(4), 2185-2198.
[33] Hernandez, H. (2018). The Realm of Randomistic Variables. ForsChem Research Reports, 3,
2018-10. doi: 10.13140/RG.2.2.29034.16326.
[34] D'Agostino, R., & Pearson, E. S. (1973). Tests for departure from normality. Empirical
results for the distributions of b2 and b1. Biometrika, 60(3), 613-622.
[35] Pearson, E. S., D’Agostino, R. B., & Bowman, K. O. (1977). Tests for departure from
normality: Comparison of powers. Biometrika, 64(2), 231-246.
[36] Jarque, C. M., & Bera, A. K. (1980). Efficient tests for normality, homoscedasticity and serial
independence of regression residuals. Economics letters, 6(3), 255-259.
[37] Urzúa, C. M. (1996). On the correct use of omnibus tests for normality. Economics Letters,
53(3), 247-251.
[38] Bonett, D. G., & Seier, E. (2002). A test of normality with high uniform power.
Computational statistics & data analysis, 40(3), 435-445.
[39] Cabaña, A., & Cabaña, E. M. (2003). Tests of normality based on transformed empirical
processes. Methodology and computing in applied probability, 5(3), 309-335.
[40] Brys, G., Hubert, M., & Struyf, A. (2008). Goodness-of-fit tests based on a robust measure
of skewness. Computational statistics, 23(3), 429-442.
[41] Romão, X., Delgado, R., & Costa, A. (2009). An empirical power comparison of univariate
goodness-of-fit tests for normality. Journal of Statistical Computation and Simulation, 80(5),
545-591.
[42] Doornik, J. A., & Hansen, H. (2008). An omnibus test for univariate and multivariate
normality. Oxford Bulletin of Economics and Statistics, 70, 927-939.
[43] Gel, Y. R., & Gastwirth, J. L. (2008). A robust modification of the Jarque–Bera test of
normality. Economics Letters, 99(1), 30-32.
[44] Hosking, J. R. (1990). L‐moments: Analysis and estimation of distributions using linear
combinations of order statistics. Journal of the Royal Statistical Society: Series B
(Methodological), 52(1), 105-124.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (34 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

[45] Elamir, E. A., & Seheult, A. H. (2003). Trimmed L-moments. Computational Statistics & Data
Analysis, 43(3), 299-314.
[46] Bontemps, C., & Meddahi, N. (2005). Testing normality: a GMM approach. Journal of
Econometrics, 124(1), 149-186.
[47] Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a
sum of dependent random variables. Proceedings of the Sixth Berkeley Symposium on
Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the
University of California. pp. 583-602.
[48] Desgagné, A., & Lafaye de Micheaux, P. (2018). A powerful and interpretable alternative to
the Jarque–Bera test of normality based on 2nd-power skewness and kurtosis, using the Rao's
score test on the APD family. Journal of Applied Statistics, 45(13), 2307-2327.
[49] Moore, D. S. (1986). Tests of Chi-Squared Type. In: D’Agostino, R. B. & Stephens, M. A.
(Eds.) Goodness-of-fit Techniques, Marcel-Dekker, Inc., New York, Chapter 3, 63-95.
[50] Pearson, K. (1900). X. On the criterion that a given system of deviations from the probable
in the case of a correlated system of variables is such that it can be reasonably supposed to
have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine
and Journal of Science, 50(302), 157-175.
[51] Geary, R. C. (1935). The ratio of the mean deviation to the standard deviation as a test of
normality. Biometrika, 27(3/4), 310-332.
[52] Spiegelhalter, D. J. (1977). A test for normality against symmetric alternatives. Biometrika,
64(2), 415-418.
[53] Martinez, J., & Iglewicz, B. (1981). A test for departure from normality based on a biweight
estimator of scale. Biometrika, 68(1), 331-333.
[54] Gel, Y. R., Miao, W., & Gastwirth, J. L. (2007). Robust directed tests of normality against
heavy-tailed alternatives. Computational Statistics & Data Analysis, 51(5), 2734-2746.
[55] Vasicek, O. (1976). A test for normality based on sample entropy. Journal of the Royal
Statistical Society: Series B (Methodological), 38(1), 54-59.
[56] Esteban, M. D., Castellanos, M. E., Morales, D., & Vajda, I. (2001). Monte Carlo comparison
of four normality tests using different entropy estimates. Communications in Statistics-
Simulation and computation, 30(4), 761-785.
[57] Epps, T. W., & Pulley, L. B. (1983). A test for normality based on the empirical characteristic
function. Biometrika, 70(3), 723-726.
[58] Shapiro, S. S., Wilk, M. B., & Chen, H. J. (1968). A comparative study of various tests for
normality. Journal of the American statistical association, 63(324), 1343-1372.
[59] Hernandez, H. (2018). Parameter Identification using Standard Transformations: An
Alternative Hypothesis Testing Method. ForsChem Research Reports, 3, 2018-04. doi:
10.13140/RG.2.2.14895.02728.
[60] Dufour, J. M., et al. (1998). Simulation-based finite sample normality tests in linear
regression. Econometrics Journal 1:154–173.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (35 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

[61] Seier, E. (2002). Comparison of tests for univariate normality. InterStat Statistical Journal, 1,
1-17.
[62] Yazici, B., & Yolacan, S. (2007). A comparison of various tests of normality. Journal of
Statistical Computation and Simulation, 77(2), 175-183.
[63] Hain, J. (2010). Comparison of common tests for normality. Diplomarbeit. Julius-
Maximilians-Universität. Würzburg, Germany.
[64] Noughabi, H. A., & Arghami, N. R. (2011). Monte Carlo comparison of seven normality tests.
Journal of Statistical Computation and Simulation, 81(8), 965-972.
[65] Razali, N. M., & Wah, Y. B. (2011). Power comparisons of Shapiro-Wilk, Kolmogorov-
Smirnov, Lilliefors and Anderson-Darling tests. Journal of statistical modeling and analytics,
2(1), 21-33.
[66] Ul-Islam, T. (2011). Normality testing-A new direction. International Journal of Business and
Social Science, 2(3).
[67] Yap, B. W., & Sim, C. H. (2011). Comparisons of various types of normality tests. Journal of
Statistical Computation and Simulation, 81(12), 2141-2155.
[68] Ahmad, F., & Khan, R. A. (2015). A power comparison of various normality tests. Pakistan
Journal of Statistics and Operation Research, 331-345.
[69] Mbah, A. K., & Paothong, A. (2015). Shapiro–Francia test compared to other normality test
using expected p-value. Journal of Statistical Computation and Simulation, 85(15), 3002-3016.
[70] Adefisoye, J., Golam Kibria, B., & George, F. (2016). Performances of several univariate
tests of normality: An empirical study. J. Biom. Biostat, 7:4.
[71] Patrício, M., Ferreira, F., Oliveiros, B., & Caramelo, F. (2017). Comparing the performance of
normality tests with ROC analysis and confidence intervals. Communications in Statistics-
Simulation and Computation, 46(10), 7535-7551.
[72] Pekgör, A., Erişoğlu, M., Karakoca, A., & Erişoğlu, Ü. (2018). Empirical Type 1 Error Rate and
Power Comparisons of Normality Tests with R. Cumhuriyet Science Journal, 39(3), 799-811.
[73] Stolfo, E. (2018). Test di normalità: un confronto tramite un esperimento Monte Carlo.
Università degli studi di Padova. http://tesi.cab.unipd.it/61430/1/Stolfo_Elena.pdf
[74] Siraj-Ud-Doulah, M. (2019). A Comparison among Twenty-Seven Normality Tests. Res. Rev.
J. Stat, 8, 41-59.
[75] Wijekularathna, D. K., Manage, A. B., & Scariano, S. M. (2019). Power analysis of several
normality tests: A Monte Carlo simulation study. Communications in Statistics-Simulation and
Computation, 1-17.
[76] Arnastauskaitė, J., Ruzgas, T., & Bražėnas, M. (2021). An Exhaustive Power Comparison of
Normality Tests. Mathematics, 9(7), 788.
[77] Hosking, J. R. (1992). Moments or L moments? An example comparing two measures of
distributional shape. The American Statistician, 46(3), 186-189.

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (36 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

[78] Sarhan, A. E., & Greenberg, B. G. (1956). Estimation of location and scale parameters by
order statistics from singly and doubly censored samples. Part I. The normal distribution up to
samples of size 10. The Annals of Mathematical Statistics, 27(2), 427-451.
[79] Hernandez, H. (2018). Multidimensional Randomness, Standard Random Variables and
Variance Algebra. ForsChem Research Reports, 3, 2018-02. doi: 10.13140/RG.2.2.11902.48966.
[80] Hernandez, H. (2017). Standard Maxwell-Boltzmann distribution: Definition and properties.
ForsChem Research Reports, 2, 2017-2. doi: 10.13140/RG.2.2.29888.74244.

Appendix. Algorithm Implementation in R

The proposed algorithm for calculating approximate values for the statistic of the Shapiro-
Wilk normality test can be implemented in R using the following code:
#Approximated Shapiro-Wilk test based on order statistics
#Optimal significance level used
#Hugo Hernandez
#ForsChem Research Reports 2021-05
#28/04/2021

N.norm.test<-function(x,type="mean",maxerror=NULL,display=TRUE){
if (display==TRUE){
cat("Approximated Shapiro-Wilk Normality Test", "\n")
cat("\n")
}
#Determine the minimum sample size (4 by default)
if (is.null(maxerror)){
nmin<-4
} else {
nmin<-ceiling(13/maxerror-12)
}
#Determine the sample size
n<-length(x)
Ln<-log(n)
#Maximum sample size = 2000
if (n>2000){
x<-x[order(runif(n))][1:2000]
if (display==TRUE){
cat("A random sub-sample of 2000 elements will be used for testing normality","\n")
}
}
if (n>=nmin){
#Ascending ranking
r<-rank(x,ties.method = "average")
#Calculation of order statistics
if (type=="median"){
#Median order statistic
o<-qnorm((r-0.5)/n)
if (display==TRUE){
cat("Median order statistics","\n")
}
} else {
#Mean order statistic
o<-(n/sqrt(2*pi))*(exp(-0.5*(qnorm((r-1)/n))^2)-exp(-0.5*(qnorm(r/n))^2))
if (display==TRUE){
cat("Mean order statistics","\n")
}

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (37 / 38)


www.forschem.org
Testing for Normality:
What is the Best Method?
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

}
#Linear regression
lrmodel<-lm(x~o)
#Estimation of W
W<-summary(lrmodel)$r.squared
#Calculation of P-value
if (n<12){
P=1-pnorm((-log(-2.273+0.459*n-log(1-W))-0.544+0.39978*n-
0.025054*n^2+0.0006714*n^3)/exp(1.3822-0.77857*n+0.062767*n^2-0.0020322*n^3))
} else {
P=1-pnorm((log(1-W)+1.5861+0.31082*Ln+0.083751*Ln^2-0.0038915*Ln^3)/exp(-0.4803-
0.082676*Ln+0.0030302*Ln^2))
}
#Calculation of N-value
N<-log(P)+0.7*Ln
#Estimate total test error
eT<-13/(n+12)
if (display==TRUE){
cat(paste("W =",W),"\n")
cat(paste("P-value =",P),"\n")
cat(paste("N-value =",N),"\n")
cat(paste("Estimated total test error:",100*eT,"%","\n"))
#Decision
if (N>0){
cat("The data is normally distributed","\n")
} else {
cat("The data is not normally distributed","\n")
}
}
output=data.frame(W,P,N,eT)
return(output)
} else {
cat(paste("A minimum sample size of",nmin,"observations is required!"))
}
}

28/04/2021 ForsChem Research Reports Vol. 6, 2021-05 (38 / 38)


www.forschem.org

You might also like