Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

STA2020F

Non-parametric Statistical Techniques

Original Authors:
Allan Clark, Kutlwano Ramaboa, Karl Stielau, Christien Thiart, Melvin Varaghuse
Revised for ERT by Ehsaan Rajak with help from slides as prepared by Neil Watson
Department of Statistical Sciences
University of Cape Town

1
Contents

1 Introduction 3

1.1 Parametric vs Non-parametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Review – Measurement Scales or Types of Data . . . . . . . . . . . . . . . . . . . . . 3

1.3 Brief Overview of Non-parametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Tests for one population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Tests for two populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.3 Tests for three or more populations . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.4 Tests of association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Single Sample Tests 7

2.1 The Runs Test for Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Two Sample Tests 12

3.1 The Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 The Wilcoxon Signed Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Two Independent samples 19

4.1 The Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 k-Independent Samples 23

5.1 The Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 k-Matched or Related Samples 25

6.1 The Friedman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Tests of Association 28

7.1 The Spearman Rank-Order Correlation Coefficient Test . . . . . . . . . . . . . . . . 28

8 Advantages and Disadvantages of Non-parametric Tests 33

9 Additional Exercises! . . . 34

2
1 Introduction

REVIEW CHAPTER 8, 9, 10 OF INTROSTAT − APPLICATION OF THE NORMAL


DISTRIBUTION IN HYPOTHESIS TESTING

1.1 Parametric vs Non-parametric Tests

The inferential statistics that you have encountered thus far, such as the t-test and ANOVA, are
examples of parametric tests. A parametric test makes many assumptions about the nature of the
population from which the observations or data were drawn (e.g. normal distribution; two samples
of data drawn from populations having the same variance (σ 2 ), etc.). Parametric tests are more
powerful1 when all the assumptions required by a particular statistical test are met.

In contrast, non-parametric tests make very few and less stringent assumptions about the underlying
distribution of the data. Thus, non-parametric statistics are useful when the required distribution
of the data is unknown or other assumptions required by a parametric test are not met. This is
because the majority of the non-parametric tests do not focus on the numerical values of the scores
but rather on the rank of the scores.

1.2 Review – Measurement Scales or Types of Data

Data comes in different forms and it is important to know about these because it is part of what
determines the statistical test that you can use to analyse your data.

There are two main classifications of scales of measurement, and four scales of measurement that
must be considered:

CATEGORICAL vs. NUMERICAL (quantitative vs. qualitative)

Data that represent categories, such as nominal and ordinal observations, are collectively called
categorical (or qualitative) data. Data that are counted or measured using a numerically defined
method are called numerical (or quantitative) data.

Nominal-scale Data that is measured on a nominal scale can be placed into categories, but the
categories do not have a natural order. For example, colour (black, white, red, yellow, etc.)

Ordinal-scale Data that is measured on an ordinal scale can be placed into categories, but the
objects in one category of a scale are not only different from the objects in the other categories
of that scale but also stand in some kind of relation to them. However, the differences between
categories cannot be defined. For example, students’ grade on a course, where possible grades
are A, B, C, D, etc.; when we classify the size of companies into ‘small’, ‘medium’ and ‘large’;
Likert scales; etc.
1 Power in the statistical sense refers to how likely a test is to reject a false null hypothesis.

3
Interval-scale Data that is measured on an interval scale has all the characteristics of an ordinal
scale, except that the differences between any two numbers on the scale do have meaning.
Ratios between numbers on this scale are not meaningful, so operations such as multiplication
and division cannot be carried out directly. But ratios of differences can be expressed. The
zero point does not indicate the absence of the characteristic being measured, but is arbitrary
or undefined. Examples of interval scales are rare, but include time; temperature in degrees
centigrade (zero degrees is not ‘no temperature’); counts or ranked data; etc.
Ratio-scale Data that is measured on a ratio scale has all the characteristics of an interval scale
and, in addition, has a true zero point as its origin (i.e. a zero does indicate the absence of the
characteristic being measured). Examples include length; weight; etc.

Note that for most statistical procedures, the distinction between interval-scale and ratio-scale
does not matter and it is common to use the term “interval” to refer to ratio data as well.

Knowledge Check
Determine the type of data for the following:

1. The number of students in a statistics class.


2. The make of car driven by each of a sample of executives.

3. The rating (Extremely poor[1], Very Poor[2], Poor[3], Unsure[4], Good[5], Very Good[6], Ex-
cellent[7]) reported for a particular television program by each of a sample of viewers.
4. The weekly closing price of gold throughout the year.
5. The month of highest sales for each firm in a sample.

6. The socio-economic status of people who reside in Cape Town (upper class, middle class, lower
class).
7. The responses by citizens on a 5-point rating scale (where 1=Strongly Disagree, 2=Disagree,
3=Unsure, 4=Agree, 5=Strongly Agree) to the statement:
“South Africa should be divided into two time zones”.

8. The gender of UCT employees.


9. The maximum temperature recorded in March.
10. The rating (excellent, good, fair or poor) given to a particular television program by each of a
sample of viewers.

Solution

1. Ratio-Scale
2. Nominal
3. Ordinal
4. Ratio-Scale
5. Nominal
6. Ordinal

4
7. Ordinal
8. Nominal
9. Interval-Scale
10. Ordinal

1.3 Brief Overview of Non-parametric Tests

Different tests require an assumption about the measurement scale of the data. The following tables
are a summary of the various nonparametric tests that will be studied in this course.

1.3.1 Tests for one population

Test Data type Data Corresponding


parametric test
Tests for randomness of Nominal Independent obser-
order (RUNS test) vations

1.3.2 Tests for two populations

Test Data type Data Corresponding


parametric test
Wilcoxon Rank Sum (also Ordinal or non- Independent sam- t-test for difference
called Mann-Whitney) normal quantita- ples of means
tive
Wilcoxon Signed Rank Non-normal quanti- Matched/paired matched pairs t-
Sum tative samples test for difference of
means
Sign Test Ordinal Matched/paired matched pairs t-
/dependent sam- test for difference of
ples means

1.3.3 Tests for three or more populations

Test Data type Data Corresponding


parametric test
Kruskal-Wallis Ordinal or non- Independent sam- Single factor one
normal quantita- ples way ANOVA
tive
Friedman Ordinal or non- Matched/blocked Randomized block
normal quantita- dependent samples ANOVA
tive in experimental
design

5
1.3.4 Tests of association

Test Data type Data Corresponding


parametric test
Spearman’s Rank Correla- Ordinal or non- Two random sam- Pearson’s correla-
tion normal quantita- ples tion coefficient
tive

6
2 Single Sample Tests

2.1 The Runs Test for Randomness

In every statistical test and estimation procedure that we have encountered so far, we have assumed
that the data comprise a random sample from the population. It is possible for a data set to not be
a random sample from a population, but to have some internal sequential pattern. A majority of
statistical tests however require that the data be random.

Random means that the process generating the sample produces a sequence of data within which the
sequence of values are independent of each other. The statistical test that enables us to determine
whether the data is random is called the Runs test. The Runs test is based on the order or sequence
in which the data were originally obtained.

Data Assumptions

• The data must be observed and recorded in some natural or chronological order.
• The data (either originally or after some transformation) consists of two mutually exclusive
and exhaustive categories. For example, the following sequence of data has two categories only:

M M M F M M F F

Terminology

The Runs test is based on the number of runs which a sample exhibits.

• A run is defined as any sequence of observations of one type (i.e. a succession of identical
categories), bounded by observations of the other type, or by no observations. For example,
the following sequence:

M M M F M M F F

has R = 4 runs in n = 8 observations. The sample begins with a run of three Ms followed by
a run which consists of one F, then another run which consists of two Ms, followed by a run of
two Fs. By underlining and numbering each succession of identical observations, we observe
the 4 runs.

M M M F M M F F
1 2 3 4

The total number of runs in a sample can give an indication of whether or not the sample is
random. Non-randomness is observed when either:

1. Too many runs occur. This feature indicates that observations in one category tend to
follow observations of the other category, and form a repeated alternating pattern within
the observed data sequence.

7
For example,

N D N D N D N D N D N D N D N N

Number of runs R = in n = observations.

2. Too few runs occur. This feature indicates that observations tend to have the same cate-
gory as their predecessors, and hence definite grouping or clustering is present within the
observed data sequence. For example,

N N N N N N N N N D D D D D D D

Number of runs R = in n = observations.

• The length of the run, l = number of observations in the run. For example, in the following
sequence:

M M M F M M F F

there are 4 runs in 8 observations, and the length of the first run l1 = 3.

What is the length of the last run in the above sequence? l4 =

Hypotheses

We construct the null and alternative hypotheses as follows:

N ull Hypothesis: H0 : The sequence of data is random


Alternative Hypothesis: H1 : The sequence of data is not random, a pattern exists
(Two-tailed test)

Data Summary

• calculate R, the number of runs in binary data


• determine n1 and n2 , the number of observations in each type of category (note: n = n1 + n2 )

Test Statistic

Small Sample Test statistic (for n1 ≤ 20 AND n2 ≤ 20)

8
Test statistic = R (number of runs)

Large Sample Test statistic (for n1 > 20 AND/OR n2 > 20)

For large samples (i.e. n1 > 20 AND/OR n2 > 20), the sampling distribution of R is approximately
Gaussian or the so-called normal, with

2n1 n2 2 2n1 n2 (2n1 n2 − n1 − n2 )


µR = + 1, σR =
n1 + n2 (n1 + n2 )2 (n1 + n2 − 1)

and the test statistic is:

R − µR
z=
σR

Critical Region

Small Sample Critical Region (for n1 ≤ 20 AND n2 ≤ 20)

• The critical values for the small sample Runs test are given in the Runs Test table
• There are two critical values, and we reject H0 if R is smaller than or equal to RL , the lower
critical value, or if R is greater than or equal to RU , the upper critical value, and infer that
the sequence of data is not random (that is, there is switching of categories).

Large Sample Critical Region (for n1 > 20 AND/OR n2 > 20)

• We reject H0 if |z| ≥ c, where c is the critical value, or if the calculated p-value ≤ α, where
p-value = P r(|Z| ≥ z).

Class Examples

Example 1: Consider an industrial production line which produces television sets. As each succes-
sive set leaves the production line it is subjected to a quality control check and classified as
either non-defective (N) or defective (D).
We expect the production sequence of sets to be random. If not, it may indicate a systematic
problem in production which should be corrected.
The following results from a two-hour period were recorded in the order in which the television
sets were checked:

N N N D N N D D N N N N N D D

9
1. Find n, n1 , n2 and R
2. Test at the 5% significance level whether the sequence of data is random.

Example 2: Assume that the results of the quality control check were extended to cover an entire
8-hour shift. The following results were recorded:

N N D N D D N N D D D N N N N N D N N
N D N N N D N D N N N N N D D D N D N
N D N N N N D N D N N N D D D N N N N
N N N

1. Find n, n1 , n2 and R
2. Test at the 2.5% significance level whether the full sequence of data is random
3. What is the p-value of the test?

Example 3: In a study of aggression in young children, an experimenter observed pairs of children


in a controlled play situation. Most of the 24 children who served as subjects in the study
came from the same nursery school and thus played together daily. Since the experimenter
was able to arrange to observe only two children on any day, she was concerned that biases
might be introduced into the study by conversations among the children who had already
served as subjects and those who were to serve later. If such discussions had any effect on the
level of aggression in the play sessions, this effect might show up as lack of randomness in the
aggression scores in the order in which they were collected. After the study was completed,
the randomness of the sequence of scores was tested by converting each child’s aggression score
to a plus or minus depending upon whether that score fell above or below the group median
score.
The table on the next page shows the aggression scores for each child in the precise order in
which the scores were obtained. If the median of the set of scores is 25.5,

1. Determine the position of the score with respect to the median by assigning a + if above
median, or − if below the median
2. Find n, n1 , n2 and R
3. Test the hypothesis of random order

10
Position of the
score with respect
Child Score to the median
1 31
2 23
3 36
4 43
5 51
6 44
7 12
8 26
9 43
10 75
11 2
12 3
13 15
14 18
15 78
16 24
17 13
18 27
19 86
20 61
21 13
22 7
23 6
24 8

11
3 Two Sample Tests

3.1 The Sign Test

The Sign test gets its name from the fact that it is based on the sign of the difference between two
related observations. The test is used when we wish to compare two matched or paired populations
when the data are ordinal.

Data Assumptions

• Two paired samples, or equivalently, one sample of bivariate data.

• Both variables must be ordinal.


• The n pairs of observations are statistically independent.
• The count of + signs (or − signs) follows a binomial distribution with p = 0.5

Hypotheses
The null hypothesis for the Sign test is that there is no difference between the two populations. If
this assumption is true, then the number of + signs (or − signs) should have a binomial distribution
with p = 0.5.

N ull Hypothesis: H0 : The two population locations (medians) are the same (OR p = 0.5)
Alternative Hypothesis: H1 : The two population locations (medians) are not the same
(OR p 6= 0.5) (Two-tailed test)

OR H1 : The location on an ordinal scale of the first population


is to the right of the location of the second population (OR p > 0.5)
(One-tailed test)

OR H1 : The location on an ordinal scale of the first population


is to the left of the location of the second population (OR p < 0.5)
(One-tailed test)

Data Summary

• calculate the difference for each pair (i.e. for each observation, subtract the 2nd value from
the 1st value).
• eliminate all pairs whose difference = 0, and effectively reduce the sample size. NOTE: n =
number of non-zero differences

12
• Record the sign of all the paired-differences (we usually designate a plus sign (+) for a positive
difference, and a minus sign (−) for a negative difference)

Let S− = count of − signs S+ = count of + signs n = S− + S+

Test Statistic

Small Sample Test statistic (for n ≤ 20)


Test statistic S = S+ = count of + signs. The sampling distribution of S+ is binomial with n = n
and p = 0.5. Can you explain why this is true?

Large Sample Test statistic (for n > 20)


For large samples (i.e. n > 20), let S = S+ = count of + signs. The sampling distribution of S is
approximately Gaussian with

n n
µS = , σS2 =
2 4

and the test statistic is:

(S ± 0.5) − µS
z=
σS

where (S + 0.5) is used when S < n/2, and (S − 0.5) is used when S > n/2. The 0.5 is a “correction
for continuity”. The correction is necessary because the Gaussian distribution is continuous while
the Sign test involves discrete values.

Critical Region

Small Sample Critical Region (for n ≤ 20)

• We refer to the binomial tables to find the critical region which is defined by the alternative
hypothesis
• Reject H0 if p-value ≤ α

Large Sample Critical Region (for n > 20)

• We reject H0 if |z| ≥ c, where c is the critical value, or the calculated p-value ≤ α.

Class Examples

13
Example 1: In an experiment to determine which car is perceived to have the more comfortable
ride, 25 people took two rides:
− One ride in a European model.
− One ride in a North American car.

Each person ranked the cars on a scale of 1 (ride is very uncomfortable) to 5 (ride is very
comfortable). Do the data on the next page allow us to conclude that the European car is
perceived to be more comfortable? Test at the 1% significance level.

Person European American Difference


1 4 5
2 2 1
3 5 4
4 3 2
5 2 1
6 5 3
7 1 3
8 4 2
9 4 2
10 2 2
11 3 2
12 4 3
13 2 1
14 3 4
15 2 1
16 4 3
17 2 1
18 4 3
19 5 4
20 3 1
21 4 2
22 3 3
23 3 4
24 5 2
25 2 3

Example 2: Two different additives were compared to see which one is better for improving the
durability of concrete. 100 small batches of concrete were mixed under various conditions, and
during the mixing each batch was divided into two parts. One part received additive A and
the other part received additive B. After the concrete hardened, the two parts in each batch
were crushed against each other, and an observer rated the two parts to determine the part
that appeared to be most durable. In 77 cases the concrete with additive A was rated more
durable; in 23 cases the concrete with additive B was rated more durable. Is there a significant
difference between the effects of the two additives?

14
Example 3: Some 22 customers in a grocery store were asked to taste each of two types of cheese
and declare their preference. 7 customers preferred one kind, 12 preferred the other kind, and
3 had no preference. Does this data set indicate a significant difference in preference? Test at
the 5% significance level.

3.2 The Wilcoxon Signed Rank Sum Test

The Sign test discussed above utilizes information only about the direction of the differences within
paired observations. The Wilcoxon Signed Rank Sum test is used if we wish to consider the direction
as well as the size of the difference within paired observations. Thus the test is used when we wish
to compare two matched or paired populations when the data are quantitative (i.e. either interval
or ratio-scaled).

The Wilcoxon Signed Rank Sum test is a non-parametric counterpart of the parametric matched or
paired t-test.

Data Assumptions

• Two paired samples, or one sample of bivariate data.


• The data is quantitative but not Gaussian (normal).
• The n paired observations are statistically independent.
• The distribution of the population of differences within pairs is symmetric.

Hypotheses

N ull Hypothesis: H0 : The location of the paired differences = 0


Alternative Hypothesis: H1 : The location of the paired differences 6= 0 (Two-tailed test)
OR H1 : The location of the paired differences > 0 (One-tailed test)
OR H1 : The location of the paired differences < 0 (One-tailed test)

Data Summary

• calculate the difference for each pair


• calculate the absolute value of the difference for each pair
• eliminate all differences equal to 0

NOTE: n = number of non-zero differences

15
• rank the absolute value of the differences

NOTE: Data are ranked by ordering them from lowest to highest and assigning them, in order,
the integer values from 1 to n (e.g. numbers 11, 16, 17, 25, 31 are assigned ranks 1, 2, 3, 4,
and 5). Ties are resolved by assigning any tied values the mean of the ranks they would have
received if there were no ties (e.g. numbers 11, 17, 17, 25, 31 are assigned ranks 1, 2.5, 2.5, 4,
and 5. The two tied numbers, which would have been assigned rank 2 and 3, are assigned the
mean of 2 and 3 = 2.5).

• record the sign of the paired difference (we usually designate a plus sign (+) for a positive
difference, and a minus sign (−) for a negative difference)
• calculate the sum of the signed ranks = T .

Test Statistic

Large Sample Test statistic (for n ≥ 10)


For large samples (i.e. n ≥ 10), the sampling distribution of T is approximately Gaussian with

n(n + 1)(2n + 1)
µT = 0, σT2 =
6

and the test statistic is:

T − µT
z=
σT

Critical Region
Large Sample Critical Region (for n ≥ 10)

• We reject H0 if |z| ≥ c, where c is the critical value, or calculated p-value ≤ α.

16
Class Examples

Example 1: A manufacturing firm is attempting to determine if a difference exists in task-completion


times for two different production methods.
A sample of 11 workers was selected at random, and each worker completed a production task
using each of the two production methods. (Note: the choice of method used first was selected
randomly to avoid biases).
Do the data below (in minutes) indicate that the methods are significantly different in terms
of task completion times? Use a 5% significance level.

Worker Method 1 Method 2 Difference |Difference| Rank Signed Rank


1 16.2 15.7
2 13.7 13.7
3 21.6 22.4
4 14.9 12.6
5 16.7 15.8
6 18.4 18.3
7 16.1 15.6
8 15.7 14.9
9 18.2 18.3
10 17 17.3
11 17.6 16.9

Example 2: A regional water authority has been carrying out some new pollution-control measures
on one of the main rivers under its control.
Pollution was measured at 12 sites before new controls were implemented, and then again four
years later at the same 12 sites.
The water authority wants to determine whether the new controls have been effective. Use a
2.5% significance level.

Site Before After Difference |Difference| Rank Signed Rank


1 17.4 13.6 3.8 3.8
2 15.7 10.1 5.6
3 12.9 10.3 2.6 2.6
4 9.8 9.2 0.6
5 13.4 11.1 2.3 2.3
6 18.7 20.4 −1.7
7 13.9 10.4 3.5 3.5
8 11 11.4 −0.4
9 5.4 4.9 0.5 0.5
10 10.4 8.9 1.5 1.5
11 16.4 11.2 5.2
12 5.6 4.8 0.8 0.8

17
Example 3: Does a flexi-time work-schedule help reduce the travel time of workers to work?
A random sample of 32 workers was selected, and workers recorded their travel time before
and after the program was implemented.

8:00-Arr Flexi-time Difference |Difference| Rank Signed Rank


34 31
35 31
43 44
46 44
16 15
26 28
68 63
38 39
61 63
52 54
68 65
13 12
69 71
18 13
53 55
18 19
41 38
25 23
17 14
26 21
44 40
30 33
19 18
48 51
29 33
24 21
51 50
40 38
26 22
20 19
19 21
42 38

18
4 Two Independent samples

4.1 The Wilcoxon Rank Sum Test

Also called the U-test, the Mann-Whitney test, the Wilcoxon-Mann-Whitney test, or just the Rank
Sum test.

The Wilcoxon Rank Sum test is used to determine whether two independent random sample groups
have been drawn from the same population, and is a counterpart to the parametric t-test for two
normal independent random samples.

Data Assumptions

• 2 random samples, of size n1 and n2 .


• the data are either ordinal or quantitative (but with completely arbitrary distributions and
not necessarily normal).
• samples and observations within samples are independent.

• the distributions of the two populations differ with respect to location only (if they differ at
all).

Hypotheses

N ull Hypothesis: H0 : The two population locations are the same


Alternative Hypothesis: H1 : The two population locations are different
(Two-tailed test)
OR H1 : The location of the first population is to the right
of the location of the second population (> . . . One-tailed test)
OR H1 : The location of the first population is to the left
of the location of the second population (< . . . One-tailed test)

Data Summary

• combine the two samples


• rank the combined set of observations from smallest to largest, that is, from 1 to n1 + n2
(remember to assign the mean value of ranks to each observation for ties)
• find the sum of the ranks for each sample, T1 and T2 . (Note that T1 + T2 = (n1 + n2 )(n1 +
n2 + 1)/2

19
Test Statistic

Small Sample Test statistic (for n1 and n2 ≤ 10)


Test statistic T = T1 (= sum of ranks of the first sample).

Large Sample Test statistic (for n1 or n2 or both > 10) For large samples, the
sampling distribution of T is approximately Gaussian with

n1 (n1 + n2 + 1) n1 n2 (n1 + n2 + 1)
µT = , σT2 =
2 12

and the test statistic is:

T − µT
z=
σT

Critical Region

Small Sample Critical Region (for n1 and n2 ≤ 10)

• The critical values for the small sample Wilcoxon Rank Sum test are given in the Wilcoxon
Rank Sum Test Tables.
• There are two critical values, and we reject H0 if T is smaller than or equal to TL , the lower
critical value, or if T is greater than or equal to TU , the upper critical value.
• The upper critical value can be calculated by TU = n1 (n1 + n2 + 1) − TL

Large Sample Critical Region (for n1 or n2 or both > 10)

We reject H0 if |z| ≥ c, where c is the critical value, or calculated p-value ≤ α.

20
Class Examples

Example 1: Based on the two independent samples shown below, can we infer at the 5% significance
level that the location of population 1 is to the left of the location of population 2?
Sample 1: 20, 23, 22, 18, 24
Sample 2: 22, 27, 26, 28, 25

Example 2: The ABC Company has sent 13 of its employees to a privately-run programme that
provides word-processing skills training. Six of the employees were randomly chosen from the
data-processing (DP) department and the others were from the typing (T) pool.
At the end of the programme the company received a report indicating the score received by
each of its employees out of a total possible score of 100. The scores of the 13 employees of
ABC are given in the following table:
Is there a difference in the performance of the two groups in the word-processing programme?
Test at a 5% significance level.

DP T
70 59
52 70
46 75
65 85
60 50
40 82
64

Example 3: A pharmaceutical company is planning to introduce a new painkiller. To determine


the effectiveness of the drug in comparison to aspirin, 30 people were randomly selected.

• 15 were given the new drug (Sample 1).


• 15 were given aspirin (Sample 2).

Each participant was asked to indicate which one of the following five statements best repre-
sented the effectiveness of the drug they took.

The drug taken was . . .


(5) Extremely effective
(4) Quite effective
(3) Somewhat effective
(2) Slightly effective
(1) Not at all effective

The responses of the participants are shown below:

21
New Rank Aspirin Rank
3 4
5 1
4 3
3 2
2 4
5 1
1 3
4 4
5 2
3 2
3 2
5 4
5 3
5 4
4 5

Can we conclude that the new painkiller is perceived to be more effective?

Example 4: The number of defective products from each of two production lines was recorded daily
for a period of 14 days on each production line (the measurements were taken over different
14-day periods for the two production lines). The results are shown below:

Day A B
1 172 201
2 165 180
3 206 159
4 185 192
5 175 177
6 142 170
7 190 182
8 169 179
9 161 169
10 184 192
11 191 180
12 170 174
13 138 159
14 172 166

Assume that both production lines produce the same daily output. Test whether production
line B produces more defective products than production line A.

22
5 k-Independent Samples

5.1 The Kruskal-Wallis Test

The Kruskal-Wallis test is a test used to determine whether k independent samples are from different
populations, thus, it is an extension of the Wilcoxon Rank Sum test for two independent samples.
The test is an equivalent of a single factor ANOVA, and is used to determine whether differences
among three or more groups are significant in situations that do not meet the assumptions necessary
for single factor/one-way ANOVA.

Data Assumptions

• The data are either ordinal or quantitative but not necessarily Gaussian.
• The random samples (i.e. treatment levels) and observations within the random samples (treat-
ment levels) are independent.
• nj ≥ 3 (i.e. at least three observations per sample).

• The distributions of all the population locations differ with respect to location only (if they
differ at all).

Hypotheses

N ull Hypothesis: H0 : The locations of all the k populations (groups) are the same
Alternative Hypothesis: H1 : At least two population locations differ

Data Summary

P
• combine observations from all k groups to form one sample (n = nj )
• rank the observations from 1 (smallest) to n (largest)
Pk
• calculate Tj , the sum of ranks within each of the k treatment levels (check that Tj =
n(n + 1)/2)

Test Statistic

The test statistic is

 
k
12 X Tj2
H=  − 3(n + 1)
n(n + 1) j=1 nj

23
Critical Region
The statistic H has approximately a chi-squared (χ2 ) distribution, with degrees of freedom equal to
k − 1. Therefore, we use the χ2 tables.

• We reject H0 if H ≥ c, where c is the critical value, or


• We reject H0 if the approximate or exact p-value ≤ α.

Class Examples

Example 1: How do customers rate three shifts with respect to speed of service in a particular
restaurant?
Three samples of 10 customer response-cards were randomly selected, one sample from each
shift, and customer ratings were recorded:

4:00-mid Mid-8:00 8:00-4:00


4 3 3
4 4 1
3 2 3
4 2 2
3 3 1
3 4 3
3 3 4
3 3 2
2 2 4
3 3 1

Can we conclude at a 5% significance level that customers perceive the speed of service to be
different among the three shifts?

Example 2: To determine whether absentee rates are the same amongst three different levels of
employees in a company, samples comprising of 4 top managers, 5 middle managers and 5
workers were selected and their records examined to determine how many days they reported
sick in the last year.

Top Middle Worker


4 6 8
0 0 6
2 1 1
2 3 6
2 5

Is there evidence that the absentee rates differ from one level of employee to another?

24
Example 3: The ages of executives in top management in 4 firms – A, B, C, and D are as follows:

Firm A Rank Firm B Rank Firm C Rank Firm D Rank


58 42 49 57
63 47 52 55
60 56 61 61
54 51 63 63
59 48 60 60
66 47 54
58 50

If these 4 samples are considered representative of the firms’ top management age structure,
test whether the average age of the executives varies from firm to firm. Conduct the test at a
10% significance level.

6 k-Matched or Related Samples

6.1 The Friedman Test

When we wish to compare k matched samples (where k > 2), the Friedman test is used. The test is
an extension of both the Sign test and the Wilcoxon Signed Rank Sum test, and is an alternative of
the parametric randomised block design two-way ANOVA.

Data Assumptions

• The data are either ordinal or quantitative but not Gaussian (normal)
• The data are generated from a blocked experiment with b blocks and k treatments
• The measurements within a block are dependent or related
• The measurements from different blocks are independent
• The patterns within blocks are random

Hypotheses

N ull Hypothesis: H0 : The locations of all the k populations are the same
Alternative Hypothesis: H1 : At least two population locations differ

Data Summary

25
• identify the blocks and treatment levels
• rank the observations from smallest to largest within each block
• calculate Tj , the sum of ranks within each of the k treatment levels

Test Statistic

The test statistic is

 
k
12 X
Fr =  T 2  − 3b(k + 1)
bk(k + 1) j=1 j

where

b is the number of blocks, and k is the number of treatment levels

Critical Region
The statistic Fr has approximately a chi-squared (χ2 ) distribution (provided that either b or k ≥ 5),
with degrees of freedom equal to k − 1. Therefore the critical values (and corresponding probability
levels) from the χ2 tables will be used to draw conclusions about the null hypothesis.

• We reject H0 if Fr ≥ c, where c is the critical value, or


• We reject H0 if the approximate/exact p-value ≤ α.

Class Examples

Example 1: Four managers evaluate applicants for a job in an accounting firm on several dimen-
sions. Eight applicants were selected, and their evaluations by the four managers recorded.
There are 5 possibilities:
1) The candidate is in the top 5% of applicants 2) The candidate is in the top 10% of applicants,
but not in the top 5% 3) The candidate is in the top 25% of applicants, but not in the top 10%
4) The candidate is in the top 50% of applicants, but not in the top 25% 5) The candidate is
in the bottom 50% of applicants
Can we conclude that there are differences in the way managers evaluate applicants?

26
Manager
Applicant 1 2 3 4
1 2 1 2 2
2 4 2 3 2
3 2 2 2 3
4 3 1 3 2
5 3 2 3 5
6 2 2 3 4
7 4 1 5 5
8 3 2 5 3

Example 2: Four property development companies enter sealed bids for a number of vacant plots
of land on auction. From the random sample of plots and bids (in thousands of Rands) made
below, does it appear as if some firms tend to make higher bids on average than other firms?
Test at the 0.5% significance level.

Plot Company 1 Company 2 Company 3 Company 4


A 37 32 43 36
B 127 110 139 130
C 15 12 16 15
D 340 340 390 360
E 100 90 120 120
F 225 210 240 230

Example 3: Twelve home-owners are randomly selected to participate in an experiment with a


plant nursery. Each home-owner was asked to select four fairly identical areas in their garden
and to plant four different types of grasses, one in each area. At the end of a specified length
of time each home-owner was asked to rank the grass types in order of preference, weighing
important criteria such as expense, maintenance and upkeep required, beauty, family’s prefer-
ence, etc. The rank 1 was assigned to the least preferred grass and the rank 4 to the favourite.
Can we conclude that there are differences in the preferences for the grass types among the
home-owners?

Home-owner
Grass 1 2 3 4 5 6 7 8 9 10 11 12
1 4 4 3 3 4 2 1 2 3.5 4 4 3.5
2 3 2 1.5 1 2 2 3 4 1 1 2 1
3 2 3 1.5 2 1 2 2 1 2 3 3 2
4 1 1 4 4 3 4 4 3 3.5 2 1 3.5

27
7 Tests of Association

The tests introduced in this section can be used with variables whose joint distribution is any specified
distribution, including the bivariate normal, or whose joint distribution is completely unknown and
therefore not specified.

Recall from INTROSTAT Chapter 12, that if an association exists between two variables, no matter
how the association is measured, this association cannot and should not be interpreted as implying
a cause and effect relationship between the two variables.

In general two variables may have an association because

• they are interacting with each other (i.e. one (or both) variable(s) affects the other), or
• mere coincidence, or
• because both variables are affected by other variables that have not been measured in the
study.

7.1 The Spearman Rank-Order Correlation Coefficient Test

The Spearman rank-order correlation coefficient test is used when we wish to measure the degree of
association between two variables that are measured on at least an ordinal scale.

Data Assumptions

• There are a total of n randomly selected paired observations.


• Both variables are measured on at least an ordinal scale.

Hypotheses

N ull Hypothesis: H0 : ρs = 0
(no rank correlation exists between the two variables)
Alternative Hypothesis: H1 : ρs 6= 0
(significant rank correlation exists between the two variables)
(Two-tailed test)
OR H1 : ρs > 0
(the rank correlation between the two variables is positive)
(One-tailed test)
OR H1 : ρs < 0
(the rank correlation between the two variables is negative
(One-tailed test)

28
Data Summary

We denote the sample of n paired observations (X, Y ) by (x1 , y1 ), (x2 , y2 ), . . . (xn , yn )

1. Rank the values of X and of Y separately (each from 1 to n).

2. Calculate the difference d for each pair of ranks, that is, d = rank(xi ) − rank(yi ).

Test Statistic

n
X
6 d2i
i=1
rs = 1 −
n(n2 − 1)

For large samples (n ≥ 10), rs is approximately normally distributed, and the test
statistic is


z = rs n − 1

29
Critical Region

Large Sample Critical Region (for n > 30)

We reject H0 if |z| ≥ c, where c is the critical value, or calculated p-value ≤ α.

Class Examples

Example 1: A company runs a large fleet of trucks, which vary in age from 1 year old to 12 years
old. The annual running costs (in thousands of rands) and age of a random sample of 8 trucks
are given in the following table:

Truck no. Age (a) Cost (R’000s) (b)


1 1 5.3
2 9 10
3 12 11
4 3 7.2
5 6 9.2
6 4 7.6
7 3 9.1
8 7 6.4

Can the company infer at the 2.5% significance level that age of trucks is positively correlated
with running costs?

Example 2: A production manager wants to examine the relationship between:


• Aptitude test score taken prior to hiring, and
• Performance rating three months after starting work.

A random sample of 20 production workers was selected. The test scores and performance
rating scores were recorded for each person.
Can the firm’s manager infer at the 1% significance level that aptitude test scores are correlated
with performance rating?

30
Aptitude Rank Aptitude Performance Rank Performance
59 9 3 10.5
47 3 2 3.5
58 8 4 17
66 14 3 10.5
77 20 2 3.5
57 7 4 17
62 12 3 10.5
68 16 3 10.5
69 17 5 19.5
36 1 1 1
48 4 3 10.5
65 13 3 10.5
51 5 2 3.5
61 11 3 10.5
40 2 3 10.5
67 15 4 17
60 10 2 3.5
56 6 3 10.5
76 19 3 10.5
71 18 5 19.5

Example 3: After several semesters without much success, Pat Statstud (a student in the lowest
quarter of a statistics course) decided to try and improve. Pat needed to know the secret of
success for university students. After many hours of discussion with other more successful
students, Pat postulated a rather radical theory: The longer one studied, the better one’s
grade.
To test the theory, Pat took a random sample of 35 students in an economics course and
asked each to report the average amount of time he or she studied economics, and the final
percentage mark obtained.
Test to determine whether grade and study time are positively related.

31
Time Rank Time Mark (%) Rank Mark
30 17 71 9
5 4 30 4
36 30.5 82 17.5
37 32 98 34
32 22.5 78 14
23 7 73 10.5
34 28 82 17.5
2 2.5 25 3
34 28 94 32
43 35 99 35
34 28 85 22
32 22.5 74 12
30 17 79 15
36 30.5 82 17.5
40 34 88 26
24 8.5 55 5
0 1 7 1
25 10.5 62 6
29 13.5 91 29.5
21 5 66 8
31 20.5 86 23
30 17 73 10.5
33 25 90 28
30 17 88 26
33 25 91 29.5
22 6 64 7
29 13.5 83 20
24 8.5 87 24
30 17 96 33
2 2.5 16 2
31 20.5 84 21
33 25 92 31
25 10.5 82 17.5
38 33 88 26
26 12 75 13

32
8 Advantages and Disadvantages of Non-parametric Tests

Advantages of Non-parametric Tests

• The tests can be used when parametric methods are inapplicable, or the validity of their
parametric assumptions is uncertain.
• The tests are useful when sample sizes are small as there may be no equivalent parametric
test, unless the population distribution is known exactly.
• The tests often involve less computational work and are therefore sometimes easier and quicker
to apply than a corresponding parametric test.
• The assumptions are usually few and easily met, in contrast to assumptions for parametric
techniques.

• The tests are not just restricted to quantitative data i.e. the tests can be used on all types of
data.

Disadvantages of Non-parametric Tests

• The major disadvantage is the absence of parameters.


Because the procedures are non-parametric, there are no parameters to estimate and the results
become more difficult to describe in precise terms.
• Information is lost by ranking or taking signs.
Because information is lost, non-parametric procedures tend to be less efficient or less statis-
tically “powerful” than the equivalent parametric test (when one is appropriate for the data).
This implies that for similar sample sizes the null hypotheses may not be rejected as often as
they ought to be rejected, in the circumstances when all the assumptions for a parametric test
are valid. Thus non-parametric tests may be wasteful if all assumptions for parametric tests
are met.
• The theory behind some of the tests is complicated (In this course we do not involve ourselves
in the theory behind the tests).

33
9 Additional Exercises! . . .
1. The human resource manager of a large company wanted to compare how long business and
non-business graduates worked for the company before quitting. Two samples of 25 business
graduates and 20 non-business graduates were randomly selected from the lists of former em-
ployees. The data representing their time with the company were recorded (in months):

Business Non-Business
60 25
11 60
18 22
19 24
5 23
25 36
60 39
7 15
8 35
17 16
37 28
4 9
8 60
28 29
27 16
11 22
60 60
25 17
5 60
13 32
22
11
17
9
4

Can the personnel manager conclude at a 5% significance level that a difference in duration of
employment exists between business and non-business graduates?

2. Two kinds of emergency flares are compared on the basis of the following burning times
(rounded to the nearest tenth of a minute):

Brand A 14.9 11.3 13.2 16.6 17.0 14.1 15.4 13.0 16.9
Brand B 15.2 19.8 14.7 18.3 16.2 21.2 18.9 12.2 15.3 19.4

Test whether the average burning time of Brand A flares is less than that of Brand B flares -
use a 5% significance level.

34
3. The data below show IQ scores (Wechsler IQ test) of children with severe learning problems
after taking a placebo and after taking a drug (Ethosuximide). The order in which the placebo
and the drug were administered was randomized.

Child placebo drug


1 97 113
2 106 113
3 106 101
4 95 119
5 102 111
6 111 122
7 115 121
8 104 106
9 90 110
10 96 126

Test whether the drug has a significant effect on measured IQ, in particular that the drug has
an adverse effect on IQ. Use a 2.5% significance level.

4. The following are the final examination grades of samples from three groups of students who
where taught Swahili by one of three different methods (classroom instruction and language
laboratory (A), only classroom instruction (B), and only self-study in language laboratory (C)):

A 94 88 91 74 87 97
B 85 82 79 84 61 72 80
C 89 67 72 76 69

Are the three methods equally effective?

5. Until its recent indictment as a possible carcinogen, cyclamate was a widely used sweetener
in soft drinks. The following data show a comparison of three laboratory methods for deter-
mining the percentage of sodium cyclamate in commercially produced orange drink. All three
procedures were applied to each of 12 samples.

Method
Sample A B C
1 0.598 0.628 0.632
2 0.614 0.628 0.630
3 0.600 0.600 0.622
4 0.580 0.612 0.584
5 0.596 0.600 0.650
6 0.592 0.628 0.606
7 0.616 0.628 0.644
8 0.614 0.644 0.644
9 0.604 0.644 0.624
10 0.608 0.612 0.619
11 0.602 0.628 0.632
12 0.614 0.644 0.616

Do the three methods give different results?

35
6. The data below represent the monthly sales and the promotional expenses for a women’s ap-
parel store that specializes in sportswear for younger women.

Month Sales Promotional Expense


(×R1000) (×R1000)
1 62.4 3.9
2 68.5 4.8
3 70.2 5.5
4 79.6 6.0
5 80.1 6.8
6 88.7 7.7
7 98.6 7.9
8 104.3 9.0
9 106.5 9.2
10 107.3 9.7
11 115.8 10.9
12 120.1 11.0

Calculate rs , the Spearman’s rank correlation between monthly sales and promotional ex-
penses. Use a 1% significance level to test for evidence of rank association.

36
7. Monthly returns of Intel for the period January 1993 to December 1995 are shown below:

Intel S&P Index Return


0.228084 0.007291 P
0.091335 0.013532 P
-0.01288 0.021217 F
-0.17188 -0.02452 F
0.165572 0.026318 P
-0.00789 0.003479 F
-0.0491 -0.00499 F
0.229665 0.037531 P
0.101167 -0.00764 P
-0.10531 0.019717 F
-0.02767 -0.00897 F
0.00813 0.01219 P
0.053246 0.033612 P
0.05364 -0.02683 P
-0.01818 -0.04345 F
-0.09558 0.012034 F
0.02459 0.017335 P
-0.064 -0.02492 F
0.013969 0.032639 P
0.109705 0.041427 P
-0.06464 -0.02412 F
0.011173 0.022596 P
0.016097 -0.03632 P
0.011881 0.014704 P
0.087031 0.025741 P
0.14955 0.039381 P
0.064263 0.029446 P
0.206922 0.029453 P
0.096459 0.039665 P
0.128062 0.024231 P
0.027265 0.032912 P
-0.05577 0.003592 F
-0.01833 0.041988 F
0.160416 -0.00363 P
-0.1288 0.044155 F
-0.06776 0.017771 F

Test whether positive returns are more likely to follow positive returns (i.e. determine whether
there is evidence against the data sequence being random).

8. A well known soft drink manufacturer has used the same secret recipe for its product since
its introduction more than 100 years ago. In response to decreasing market share, however,
the president of the company is contemplating changing the recipe. She has developed two
alternative recipes. In a preliminary study, she asked 15 randomly selected people to taste the
original recipe and the two new recipes. Each person was then asked to evaluate the product
on a 5-point scale, where 1=awful, 2=poor, 3=fair, 4=good and 5=wonderful. The data is
shown below:

37
Original New Recipe 1 New Recipe 2
5 5 5
3 4 5
4 5 5
2 4 4
3 3 5
2 2 3
3 3 2
4 3 5
1 1 1
1 3 2
2 4 3
3 3 4
5 3 4
3 2 3
4 3 5

Can we conclude at the 1% significance level that there are differences in the ratings of the
three recipes?
9. A cell phone user suspects that instances of satisfactory (S) and poor (P) signal reception do
not occur at random, but rather in periods lasting several minutes. Over an hour, the cell
phone user checks reception at one minute intervals, making the following 28 observations:

S S P P S S P S S S S S P P S S S P S P S S S P
S S P S

Test whether the observations support the suspicions of the cell phone user.

10. The high cost of medical care makes it imperative that hospitals operate efficiently and ef-
fectively. As part of a larger study, a random sample of 60 patients leaving a hospital were
surveyed. They were asked how satisfied they were with the treatment they received. The
responses were recorded with a measure of the degree of severity of their illness (as determined
by the admitting physician) and the length of their stay.
Satisfaction levels were coded from 1 for very unsatisfied to 5 for very satisfied and severity of
illness was coded from 1 for least severe to 10 for most severe.
The Spearman rank correlation between all the variables is presented below:

Severity Satisfaction Days


Severity 1
Satisfaction −0.2604 1
Days 0.0376 −0.3846 1

Is the satisfaction level affected by the severity of the illness? Conduct a suitable test.

38

You might also like