Normal Distribution: Theory and Testing of Normality

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Chapter

4
Normal Distribu3on: Theory, Applica3on,
and Tes3ng
Outline
Historical Aspects of Normal Distribu4on
Normal Distribu4on: Understanding and Applying
Normal Distribu4on
Tes4ng Normality
Problems and Solu4ons Associated with Non-normal
data
Mul4variate Normal Distribu4on. Tes4ng
Mul4variate Normality and Outliers
Historical Aspects of Normal
Distribu3on
Development
Abraham de Moivre (16671754)
Approxima4o ad summam terminorum binomii a + b in seriem expansi
Carl Friedrich Gauss (1809) Theoria motus corporum coeles4um in
sec4onibus conicis solem ambien4um
Marquis de Laplace (17491827) : Laplace's error func4on
Pearson popularized the term normal curve.
Tes3ng Normality
Fishers (1930), Bartle`s (1935), E. S. Pearson (1931), Geary (1947), Box
(1953), John Tukeys (1960), Pearson and Please (1975), D'Agos4no and Lee
(1977)
Understanding and Applying
Normal Distribu3on
(x )2

1 2 2
f (x) = e for < x <
Proper3es 2
The mean, median and mode are same.
Normal curve is symmetric.
Normal is a two-parameter distribu4on: mu and sigma, mu is the expected value and
sigma is the standard devia4on.
Normal distribu4on is a con4nuous distribu4on that can take values from uncountable
to uncountable +.
The highest frequency is in the middle and the frequency tapers down at either of the
extremes of the normal curve.
Most of the area under the normal curve is within the rst three standard devia4on on
both sides (99.74% area), whereas 68.26% area is within the rst standard devia4on.
Zero skewness and Kurtosis
Finding the Area under Normal
Distribu3on: Using Z score
Z (standard normal distribu4on) is useful in nding the area under
normal distribu4on
Example:
Variable has mean 100 and sd = 10. How many cases will be above
120?
120 100 20
Z= = =2
10 10
From the table of normal distribu4on, the area beyond Z for Z = 2 is
0.0228. In terms of percentages, it is 0.0228 100 = 2.28.
Tes3ng Normality
Tests Using Moments: Popula4on moments; Moment tests;
Skewness test; The Kurtosis test; Absolute moment test: Geary's Test.
Goodness of Fit and Related Tests: KolmogorovSmirnov test;
Lilliefors test; Kuiper's V test; AndersonDarling test (AD Test);
Cramrvon Mises test; JarqueBera test (JB Test); ShapiroWilks
test; The DAgos4noPearson test (DAgos4no K2 test).
Other Tests for Normality: Likelihood Ra4o Test; D'Agos4no's test;
Ojas Test; Lin and Mudholkars Test.
Graphical Methods for Tes3ng Normality
PloJng Raw Data: histogram; box-whisker plot
PloJng Probability: QQ plot, Detrended Q-Q plot
Skewness Test
Skewness Test (g1): When g1>0 then data are skewed to right
and if g1<0 then data are skewed to lew
n

(x x ) k

mk = i =1
g1 = m3 m23/ 2
n
6 ( n 1) 6 ( n 1)
var( g1 ) = Y = g1
( n + 1)( n + 3) ( n + 1)( n + 3)

Z = log Y a +( (Y / a )
2
+1 )
W = 1 2 ( B2 1) = 1 / log(W ) a= 2 / (W 2 1)
The Kurtosis Test
The fourth moment test for tes4ng the
symmetric departures from normality is calculated
by b2 .
b2 = m4 / m2
4/2

Anscombe and Glynn (1983)


2 1 2A
1/3

1
9A 1+ x {2 / (A 4)}
Z= !
2 / (9A)

8 2 b2 (b2 )
where A= 6+
1b2 1b2
+ 1+ 4 / (1b2 ) ! x=
(b2 )
!

1 (b
2 ) !is the third moment of b2
Absolute Moment Test: Gearys
Test
Geary (1935) proposed a test
n

x x i
a= i=1
!
n m2

DAgos4no (1970)
n (a .7979)
Z= !
0.2123
Goodness-of-t Tests
Empirical distribu4on func4on (EDF) tests
Hypothesized theore4cal distribu4on (in our case normal
distribu4on) is expressed as F0(X).
H0: F(X) = F0(X)
HA: F(X) = F0(X)
EDF for a sample is Fn(X)
0 x < x1

Fn (x) = i / n xi x < x(i+1) !
1 xn x

If no two observa4ons are equal, the empirical distribu4on
func4on is a step func4on that jumps 1/n in height at each
observa4on xk
KolmogorovSmirnov Test
Kolmogorov (1933) developed a one-sample test and Smirnov
(1939) independently developed a two-sample procedure.
Ver4cal distance between the sample cumula4ve probability
distribu4on and hypothesized cumula4ve probability distribu4on
can be obtained for each value of X.
The test sta4s4cs is the largest value for the ver4cal distance.


Dn = sup x Fn (X) F0 (X) !
Lilliefors Test: KolmogorovSmirnov one-sample test when and
are unknown: Sample mean and sample SD can be used as
es4mators of popula4on mean and popula4on sd in KS test.
Kuipers V Test: Combine M and M (V = M+M) and obtain V*.
AndersonDarling Test (AD Test)
Anderson and Darling (1952)
A = n s !
2

2i 1
n
s= log( pi ) + log(1 p(ni+1) ) !
i=1 n

Stephens (1974, 1986): AD test has a high power.


Cramrvon Mises Test
2i 1
2

n
1
W =
2
+ pi !
12n i=1 2n
JarqueBera Test (JB Test)
Carlos Jarque and Anil K. Bera (1987)

JB = (n 2
6
S + .25 ( K 3) ! ) 2

JB test asympto4cally follows the chi-square distribu4on.


df = 2
JB 5.99 is signicant at p = 0.05.
JB 9.21 is signicant at p = 0.01.
JB for mul4ple regression
nk
JB =
6
( S 2 + .25(K 3)2 ) !
ShapiroWilks Test
Shapiro and Wilk (1965) 2
n

ai xi
W= n
i=1
!
(x x ) 2

i=1

xi are sampled values from smaller to greater.


ai is a constant generated using the mean,
variance and covariance of ordered sta4s4cs
following the normal distribu4on.
Other tests
The DAgos4noPearson Test (DAgos4no K2
test)
Likelihood Ra4o Test
D'Agos4no's D
Ojas Test
Lin and Mudholkar's Test
Graphical Methods for Tes3ng
Normality
Raw data plo}ng methods (histogram, stem and
leaf plot, box plots or box-whisker plots)
Probability plo}ng methods (PP plot, QQ plot),
detrended probability plots (detrended QQ plot)
Empirical CDF plots are also commonly used for
tes4ng the normality.
Stem and leaf and box-whisker plots are more useful
among the raw data methods
Histogram is the least useful in understanding the
distribu4on.
PloJng Probability: QQ Plot
pi = (i .5) / n ! Are common posi4ons for plo}ng qq plot
Sample quan4les are on the x-axis and
pi = i / (n + 1) ! theore4cal quan4les are on the y-axis
Non-normal Data
Data Transforma3on
Square-root Transforma4on
Log Transforma4on
Inverse Transforma4on
Reciprocal
Non-parametric Sta3s3cs
Computa3onally Intensive Methods: Bootstrapping
Mul3variate Normal Distribu3on

1 1
f (x) = exp ( X ) ( 1
X )
( 2 ) m2

12
2

Bivariate Normal Distribu4on and MVN


MVN: Marginal distribu4ons of each x follows the univariate
normal distribu4on; necessary but not a sucient condi4on.
Linearity: First, the correlated variables are strictly linear and
second any combina4on of variables is normally distributed.
Quadra4c form of the MVN.
Squared Mahalanobis Distance
(squared Radii)
Distance of an observa4on from the centroid of the
remaining observa4on.
Mahalanobis distance follows chi-square distribu4on
under the MVN.
Alpha = 0.001 with degrees of freedom (df) being
number of variables.
Cooks Distance: Alterna3ve Way
Mul4variate outlier: 1. Delete ; 2. Modify ; 3. Keep
unchanged
Sta3s3cal Tests for Assessing
Mul3variate Normality

Maridas MVN Test


HenzeZirklers MVN test
Roystons MVN test

You might also like