Professional Documents
Culture Documents
S244 18 Non Parametric Statistical Techniques
S244 18 Non Parametric Statistical Techniques
Statistical Methods
Definition
When the data is generated from process
(model) that is known except for finite
number of unknown parameters the model is
called a parametric model.
0
0 median = m0
If H 0 is true then S will have a binomial distribution
with p = 0.50, n = sample size.
m0 > median
p < 0.50
median m0
If H 0 is not true then S will still have a binomial
distribution. However p will not be equal to 0.50.
m0 < median
p > 0.50
m0 median
p = the probability that an observation is greater
than m0.
Summarizing: If H 0 is true then S will have a
binomial distribution with p = 0.50, n = sample size.
n = 10 0.3
x p(x) 0.25
0 0.0010
1 0.0098 0.2
2 0.0439
0.15
3 0.1172
4 0.2051 0.1
5 0.2461
6 0.2051 0.05
7 0.1172
8 0.0439 0
9 0.0098 0 1 2 3 4 5 6 7 8 9 10
10 0.0010
The critical and acceptance region:
n = 10
x p(x) 0.3000
0 0.0010
0.2500
1 0.0098
2 0.0439 0.2000
3 0.1172
4 0.2051 0.1500
5 0.2461
6 0.2051 0.1000
7 0.1172
0.0500
8 0.0439
9 0.0098 0.0000
10 0.0010
0 1 2 3 4 5 6 7 8 9 10
Choose the critical region so that a is close to 0.05 or 0.01.
e. g. If critical region is {0,1,9,10} then a = .0010 + .0098
+ .0098 +.0010 = .0216
e. g. If critical region is {0,1,2,8,9,10} then a = .0010 + .0098
+.0439+.0439+ .0098 +.0010 = .1094
n = 10
0.3000
x p(x)
0 0.0010 0.2500
1 0.0098
2 0.0439 0.2000
3 0.1172
0.1500
4 0.2051
5 0.2461 0.1000
6 0.2051
7 0.1172 0.0500
8 0.0439
9 0.0098 0.0000
10 0.0010 0 1 2 3 4 5 6 7 8 9 10
Example
Suppose that we are interested in determining
if a new drug is effective in reducing
cholesterol.
Hence we administer the drug to n = 10
patients with high cholesterol and measure the
reduction.
The data
Cholesterol
Case Initial Final Reduction
1 240 228 12
2 237 222 15
3 264 262 2
4 233 224 9
5 236 240 -4
6 234 237 -3
7 264 264 0
8 241 219 22
9 261 252 9
10 256 254 2
Suppose we want to test
H0: the drug is not effective
median reduction ≤ 0
against
HA: the drug is effective
median reduction > 0
We
1. Compute S = The # of observations greater than m0
2. Let sobserved = the observed value of S.
3. Compute the p-value = P[S ≥ sobserved]
(2 P[S ≥ sobserved] for a two-tailed test).
Use the table for the binomial dist’n (p = ½ , n = sample size)
4. Conclude HA (Reject H0) if p-value is less than 0.05
(or 0.01).
Sign Test for Large Samples
If n is large we can use the Normal approximation
to the Binomial.
Namely S has a Binomial distribution with p = ½
and n = sample size.
Hence for large n, S has approximately a Normal
distribution with
n
mean S np
and 2
standard deviation
1 1 n
S npq n
2 2 2
Hence for large n,use as the test statistic (in
place of S) n
S
S S 2
z
S n
2
Choose the critical region for z from the
Standard Normal distribution.
Confidence Intervals
Assume that the data, x1, x2, x3, … xn is a sample
from an unknown distribution.
Hence
x(1) = the smallest observation
x(2) = the 2nd smallest observation
x(k) to x(n – k + 1)
is a P100% confidence interval for the
median
k x (k )
1 -4 x(2) = -3 to x(9) =15 is a
2 -3
97.84% confidence interval
3 0 for the median
4 2
5 2
6 9
7 9
8 12
9 15
10 22
Example
In the previous example to repeat the study
with n = 20 patients with high cholesterol.
The Case Initial
Cholesterol
Final Reduction
data 1
2
231
241
212
235
19
6
3 270 268 2
4 264 255 9
5 256 252 4
6 240 234 6
7 269 256 13
8 242 243 -1
9 269 264 5
10 230 221 9
11 234 220 14
12 266 267 -1
13 253 252 1
14 267 242 25
15 248 248 0
16 238 239 -1
17 266 268 -2
18 266 240 26
19 244 247 -3
20 249 250 -1
The binomial distribution with n = 20, p = 0.5
x p (x )
0 0.0000 Note:
1 0.0000
2 0.0002 p(6) + p(7) + p(8) + p(9) + p(10) +
3 0.0011
4 0.0046 p(11) + p(12) + p(13) + p(14)
5 0.0148
6 0.0370 = 0.037 + 0.0739 + 0.1201 + 0.1602
7
8
0.0739
0.1201
+ 0.1762 + 0.1602 + 0.1201 + 0.0739
9 0.1602 + 0.037 = 0.9586
10 0.1762
11 0.1602
12 0.1201
13 0.0739 Hence
14 0.0370
15 0.0148 x(6) to x(15) is a 95.86% confidence
16 0.0046
17 0.0011
interval for the median reduction in
18 0.0002 cholesterol
19 0.0000
20 0.0000
The data arranged in order
k x (k )
1 -3
2 -2
3 -1
4 -1
5 -1
6
7
-1
0
x(6) = -1 to x(15) = 9 is a
8
9
1
2
95.86% confidence interval
10 4 for the median
11 5
12 6
13 6
14 9
15 9
16 13
17 14
18 19
19 25
20 26
For large values of n one can use the normal
approximation to the Binomial to find the value of k so
that x(k) to x(n – k + 1) is a 95% confidence interval for the
median.
i.e. we want to find k so that
n n
S k
2k n
PS k P 2 2 P Z 0.025
n n n
2 2
but PZ 1.960 0.025
2k n n 1.96 n
hence 1.960 or k
n 2
n 1.96 n
Using k
2
n k
20 5.6
40 13.8
100 40.2
200 86.1
Next we will consider:
1. The Wilcoxon signed rank test
The Wilcoxon signed rank test is an
alternative to the Sign test, a test for the
central location of a single population
The sign test
W- n(n + 1)/4
If H0 is true then W+
True m0
median
W- large W+ small
m0 < True median
m0 True
median
W- small W+ large
Note: It is possible to work out the sampling
distribution of W+ ( and W-) when H0 is true.
6 0.4063 0.2188
7 0.5000 0.2813
8 0.3438
9 0.4219
10 0.5000
table A.6 Page A- 15
Only goes up to n = sample size = 12
For sample sizes, n > 12 we can use the fact that T (W+ or W-)
has approximately a normal distribution with
nn 1
mean T
4
nn 12n 1
standard deviation T
24
T T t T t T
and PT t P P Z
T T T
12
PT t
1 0.0005 0.0014
2 0.0008 0.0019
3 0.0013 0.0024
4 0.0018 0.0030
5 0.0025 0.0038
6 0.0035 0.0048
7 0.0047 0.0060
8 0.0062 0.0075
9 0.0081 0.0093
10 0.0105 0.0115
11 0.0135 0.0140
12 0.0171 0.0171
13 0.0213 0.0207
14 0.0262 0.0249 t T
15 0.0320 0.0299 P Z
T
Exact Values Normal
Approximation
Example
• In this example we are interested in the
quantity FVC (Forced Vital Capacity) in
patients with cystic fibrosis
• FVC (Forced Vital Capacity) = the volume of
air that a person can expel from the lungs in a
6 sec period.
• This will be reduced with time for cystic
fibrosis patients
The research question: Will this reduction be
less when a new experimental drug is
administered?
The Experimental Design
• The design will be a matched pair design
• Pairs of patients are matched (Using initial FVC
readings)
• One member of the pair is given the new drug the
other member is given a placebo
• We measure the reduction in FVC for each member
and compute the difference
– xi = Reduction in FVC (placebo) – Reduction in FVC
(drug)
These values will be generally positive if the drug is
effective in minimizing the deterioration in Forced
Vital Capacity (FVC).
W+ will be large (W- will be small)
Table: Reduction in forced vital capacity (FVC) for a
matched pair sample of patients with cystic fibrosis
Reduction in FVC
Placebo Drug xi
Subject Difference Rank Signed Rank
1 224 213 11 1 1
2 8 95 -15 2 -2
3 75 33 42 3 3
4 541 440 101 4 4
5 74 -32 106 5 5
6 85 -28 113 6 6
7 293 445 -152 7 -7
8 -23 -178 155 8 8
9 525 367 158 9 9
10 -38 140 -179 10 -10
11 508 323 185 11 11
12 255 10 245 12 12
13 525 65 460 13 13
14 1023 343 680 14 14
W+ = 86 W- = 19
We have to judge if
W+ = 86 is large (or W- = 19 is small)
nn 1 1415
W 52.5
4 4
nn 12n 1 141529
W 15.92953
24 24
52.5 19 52.5
P W 19 P
W
15.92953 15.92953
PZ 2.10 0.0179 p - value
Since the p-value is small (< 0.05) we conclude the
drug is effective in reducing the deterioration of FVC
Summarizing: To carry out Wilcoxon’s signed rank test
We
1. Compute T = W+ or W- (usually it would be the smaller
of the two)
2. Let tobserved = the observed value of T.
3. Compute the p-value = P[T ≤ tobserved] (2 P[T ≤ tobserved] for
a two-tailed test).
i. For n ≤ 12 use the table.
ii. For n > 12 use the Normal approximation.
4. Conclude HA (Reject H0) if p-value is less than 0.05 (or
0.01).
Alternative tests for this example
1. The t – test
x 0 x
test statistic t
s s
n n
We
1. Compute S = The # of observations greater than m0
2. Let sobserved = the observed value of S.
3. Compute the p-value = P[S ≤ sobserved]
(2 P[S ≤ sobserved] for a two-tailed test).
Use the table for the binomial dist’n (p = ½ , n = sample size)
Use the Normal approximation to binonial if n is large.
4. Conclude HA (Reject H0) if p-value is less than 0.05
(or 0.01).
m0 = True median
True = m0
median
S ≈ n/2
m0 > True median
True m0
median
S small (close to 0)
m0 < True median
m0 True
median
S large (close to n)
Wilcoxon’s signed-Rank test
• For Wilcoxon’s signed-Rank test we would
assign ranks to the absolute values of (x1 – m0 , x2
– m0 , … , xn – m0).
• A rank of 1 to the value of xi – m0 which is
smallest in absolute value.
• A rank of n to the value of xi – m0 which is
largest in absolute value.
W+ = the sum of the ranks associated with
positive values of xi – m0.
W- = the sum of the ranks associated with
negative values of xi – m0.
Note: W+ + W- = 1 + 2+ 3+ …n = n(n + 1)/2
W- n(n + 1)/4
If H0 is true then W+
True = m0
median
W+ ≈ W- ≈ n(n + 1)/4
m0 > True median
True m0
median
W- large W+ small
m0 > True median
True m0
median
W- large W+ small
m0 < True median
m0 True
median
W- small W+ large
Summarizing: To carry out Wilcoxon’s signed rank test
We
1. Compute T = W+ or W- (usually it would be the smaller
of the two)
2. Let tobserved = the observed value of T.
3. Compute the p-value = P[T ≤ tobserved] (2 P[T ≤ tobserved] for
a two-tailed test).
i. For n ≤ 12 use the table.
ii. For n > 12 use the Normal approximation.
4. Conclude HA (Reject H0) if p-value is less than 0.05 (or
0.01).
Nonparametric
Statistical Methods
Single sample nonparametric tests for central
location
1. The sign test
2. Wilcoxon’s signed rank test
We
1. Compute S = The # of observations greater than m0
2. Let sobserved = the observed value of S.
3. Compute the p-value = P[S ≤ sobserved]
(2 P[S ≤ sobserved] for a two-tailed test).
Use the table for the binomial dist’n (p = ½ , n = sample size)
Use the Normal approximation to binonial if n is large.
4. Conclude HA (Reject H0) if p-value is less than 0.05
(or 0.01).
Wilcoxon’s signed-Rank test
• For Wilcoxon’s signed-Rank test we would
assign ranks to the absolute values of (x1 – m0 , x2
– m0 , … , xn – m0).
• A rank of 1 to the value of xi – m0 which is
smallest in absolute value.
• A rank of n to the value of xi – m0 which is
largest in absolute value.
W+ = the sum of the ranks associated with
positive values of xi – m0.
W- = the sum of the ranks associated with
negative values of xi – m0.
Summarizing: To carry out Wilcoxon’s signed rank test
We
1. Compute T = W+ or W- (usually it would be the smaller
of the two)
2. Let tobserved = the observed value of T.
3. Compute the p-value = P[T ≤ tobserved] (2 P[T ≤ tobserved] for
a two-tailed test).
i. For n ≤ 12 use the table.
ii. For n > 12 use the Normal approximation.
4. Conclude HA (Reject H0) if p-value is less than 0.05 (or
0.01).
Two-sample – Non-parametic
tests
Mann-Whitney Test
nm
2
• The distribution function of U (U1 or U2) has
been tabled for various values of n and m (<n)
when the two observations are coming from
the same distribution.
• These tables can be used to set up critical
regions for the Mann-Whitney U test.
Example
A researcher was interested in comparing “brightness” of
paper prepared by two different processes
A measure of brightness in paper was made for n = m = 9
samples drawn randomly from each of the two processes. The
data is presented below:
Process A 6.1 9.2 8.7 8.9 7.6 7.1 9.5 8.3 9.0
Process B 9.1 8.2 8.6 6.9 7.5 7.9 8.3 7.8 8.9
Process A Process B Ranks averaged
xi rank yj rank because
observations are
6.1 1 9.1 16 tied
9.2 17 8.2 8
8.7 12 8.6 11
8.9 13.5 6.9 2
WA = 94
7.6 5 7.5 4
7.1 3 7.9 7
WB = 77
9.5 18 8.3 9.5
8.3 9.5 7.8 6
9.0 15 8.9 13.5
It can be shown that
n n 1 9 10
U A nm WA 9 9 94 32
2 2
and
m m 1 9 10
U B nm WA 9 9 77 49
2 2
From table for n = m = 9
P U i 18 .025
Hence we will reject H0 if either UA or UB ≤ 18.
Thus H0 is accepted
The Mann-Whitney test for large
samples
For large samples (n > 10 and m >10) the
statistics U1 and U2 have approximately a
Normal distribution with mean and standard
nm
deviation U i
2
nm n m 1
Ui
12
Thus we can convert Ui to a standard normal
statistic
nm
U i U i Ui
z 2
Ui nm n m 1
12
ni ri r
2
K
N N 1 i 1
3 N 1
2 2
12 k
U
N N 1 i 1 ni
i
N 1
ni
where U i rij ri1 rini
j 1
Reject
H0: the k populations have same
central location
if K 2 with d . f . k 1
Example
In this example we are measuring an enzyme
level in three groups of patients who have
received “open heart” surgery. The three groups
of patients differ in age:
1. Age 30 – 45
2. Age 46 – 60
3. Age 61+
The data
Age Group
30 - 45 46 - 60 61+
218.3 166.1 197.9
140.8 124.1 120.6
268.4 84.3 81.1
201 124.2 181.6
70.2 118.9
70.7
Computation of the Kruskal-Wallis
statistic
The raw data The data ranked
Age Group Age Group
30 - 45 46 - 60 61+ 30 - 45 46 - 60 61+
218.3 166.1 197.9 14 10 12
140.8 124.1 120.6 9 7 6
268.4 84.3 81.1 15 4 3
201 124.2 181.6 13 8 11
70.2 118.9 1 5
70.7 2
ri 12.75 5.33 7.40
Ui 51 32 37
3
U i2 512 322 37 2
i 1 ni
4
6
5
1094.717
The Kruskal-Wallis statistic
3 N 1
2 2
12 k
U
K i
N N 1 i 1 ni
N 1
3 16
2
12
1094.717 7.698
15 14 14
2
since K 0.05 5.99 for df k 1 2
H0 is rejected.
There are significant differences in the central
enzyme levels between the three age groups