Non Parametric Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Non Parametric Statistics

Engineering Data Analysis

Department of Mathematics and Statistics


College of Science

1/65
Non Parametric Statistics

I It refers to a statistical method in which the data is not required to fit a normal
distribution. Due to such reason, they are sometimes referred to as
distribution-free tests.
I Nonparametric tests serve as an alternative to parametric tests.
I Most non-parametric tests apply to data in an ordinal scale, and some apply to
data in nominal scale.
Note: Do not use non parametric procedures if parametric procedures can be used.

2/65
Advantages of Non Parametric Statistical Procedures

I Most non parametric tests have very few requirements, it is unlikely that these
tests will be used improperly.
I For some non parametric procedures, the computations are fairly easy.
I The procedures can be used for count data or rank data such as the rankings of
a movie as excellent, good, fair, or poor.

3/65
Disadvantages of Non Parametric Statistical Procedures

I Nonparametric procedures are less efficient than parametric procedures.


I The results may or may not provide an accurate answer because they are
distribution free.

4/65
Non Parametric Tests

I One-Sample Sign Test


I Wilcoxon Signed Rank Test
I Mann Whitney U - Test
I Kruskal Wallis H - Test
I Spearman Rank Correlation Test
I Chi - square Test

5/65
One Sample Sign Test

One Sample Sign Test is a nonparametric equivalent of tests regarding a single


population mean.

Command for One Sample Sign Test

To use the command of one sample sign test, you need to download the package
signmedian.test.

signmedian.test(<numeric vector>, mu = <known mean>,


alternative=<alternative>, conf.level=1 − α )

6/65
One Sample Sign Test

Assumption:

The samples must be independent.

7/65
One Sample Sign Test

Null and Alternative Hypothesis

H 0 : M = M0
Ha : M 6= M0 two-tailed: two.sided
H 0 : M ≤ M0
Ha : M > M0 one-tailed: greater
H 0 : M ≥ M0
Ha : M < M0 one-tailed: less

8/65
One Sample Sign Test

Example 1: A website administrator for a company claims that the mean number of visitors per day to the
company’s website is no more than 1500. An employee doubts the accuracy of this claim. The number of visitors per
day for 20 randomly selected days are listed below. At α = 0.05, can the employee reject the administrators claim?

No. No. of Visitors No. No. of Visitors


1 1469 11 1525
2 1463 12 1568
3 1487 13 1602
4 1579 14 1544
5 1462 15 1548
6 1476 16 1492
7 1523 17 1500
8 1620 18 1452
9 1634 19 1511
10 1570 20 1823

9/65
Procedures for Testing Hypothesis
Step 1: H0 : M ≤ 1500 and Ha : M > 1500

10/65
Procedures for Testing Hypothesis
Step 1: H0 : M ≤ 1500 and Ha : M > 1500
Step 2: α = 0.05

10/65
Procedures for Testing Hypothesis
Step 1: H0 : M ≤ 1500 and Ha : M > 1500
Step 2: α = 0.05
Step 3: Since we are comparing the mean of one sample to a known standard
mean, we will use the one sample sign test.

10/65
Procedures for Testing Hypothesis
Step 1: H0 : M ≤ 1500 and Ha : M > 1500
Step 2: α = 0.05
Step 3: Since we are comparing the mean of one sample to a known standard
mean, we will use the one sample sign test.
Step 4: Determine the p-value.
Command for One Sample Sign Test

To use the command of one sample sign test, you need to download the package
signmean.test.

signmean.test(<numeric vector>, mu = <known mean>,


alternative=’<condition>’, conf.level=1 − α )

10/65
Procedures for Testing Hypothesis

Step 5: Since p-value (0.1796) is greater than to 0.05 level of significance, we failed
to reject H0 .
Step 6: There is no sufficient evidence to reject the claim of the website
administrator.

11/65
One Sample Sign Test

Example 2: Recent studies of the private practices of physicians who saw no


Medicaid patients suggested that the mean length of each patient visit was 22
minutes. It is believed that the mean visit length in practices with a large Medicaid
load is shorter than 22 minutes. A random sample of 20 visits in practices with a
large Medicaid load yielded, in order, the following visit lengths:

12/65
One Sample Sign Test

Example 2:
No. Time (minutes) No. Time (minutes)
1 3.3 11 16.8
2 4.3 12 23.4
3 3.4 13 18.1
4 20.1 14 23.5
5 15.6 15 18.7
6 20.4 16 24.8
7 16.2 17 18.9
8 21.6 18 24.9
9 16.4 19 19.1
10 21.9 20 26.8

Based on these data, is there sufficient evidence to conclude that the mean visit length in practices with a large
Medicaid load is shorter than 22 minutes?

13/65
Procedures for Testing Hypothesis
Step 1: H0 : M ≥ 22 and Ha : M < 22
Step 2: α = 0.05
Step 3: Since we are comparing the mean of one sample to a known standard
mean, we will use the one sample sign test.
Step 4: Determine the p-value.
Command for One Sample Sign Test

To use the command of one sample sign test, you need to download the package
signmean.test.

signmean.test(<numeric vector>, mu = 22,


alternative=’<condition>’, conf.level=1 − α )

14/65
Procedures for Testing Hypothesis

Step 5: Since p-value (0.02069) is less than to 0.05 level of significance, we reject
H0 .
Step 6: There is sufficient evidence to conclude that the mean visit length in
practices with a large Medicaid load is shorter than 22 minutes.

15/65
Wilcoxon Signed Rank Test

Wilcoxon Signed Rank Test is a non parametric equivalent to t-test for two related
samples.
Command for Wilcoxon Signed Rank Test

wilcox.test(<a>, <b>, alternative = <condition>, paired =


TRUE, conf.level=1 − α, exact = FALSE)

If there is a tie in your data, it is necessary to add the command exact = FALSE
to avoid error message in the console.

a: numeric vector of data values


b: numeric vector of data values

16/65
Wilcoxon Signed Rank Test

Null and Alternative Hypothesis

H 0 : M1 = M2
Ha : M1 6= M2 two-tailed: two.sided
H 0 : M1 ≤ M2
Ha : M1 > M2 one-tailed: greater
H 0 : M1 ≥ M2
Ha : M1 < M2 one-tailed: less

17/65
Wilcoxon Signed Rank Test

Assumptions:
I Your dependent variable should be measured at the ordinal or continuous level.
I Your independent variable should consist of two categorical, ”related groups” or
”matched pairs”.

18/65
Wilcoxon Signed Rank Test

Example 1: Ten women participate in a study. A physical therapist measures the women’s waistlines before and
8 weeks after a rigorous exercise program begins. Test whether the program decreased the mean waistline at the
α = 0.01 level of significance.

Before (inches) After (inches)


48.0 42.5
23.1 20.7
28.6 18.3
23.9 24.8
25.7 58.3
47.8 33.7
38.9 33.5
22.1 22.1
49.7 43.1
31.1 23.6

19/65
Procedures for Testing Hypothesis

Step 1: H0 : Mbefore ≤ Mafter and Ha : Mbefore > Mafter


Step 2: α = 0.01
Step 3: Since we are comparing the mean of two related groups, we will use the
wilcoxon signed rank test.
Step 4: Determine the p-value.
Command for Wilcoxon Signed Rank Test

wilcox.test(<a>, <b>, alternative = <condition>, paired =


TRUE, conf.level=1 − α, exact = FALSE)

20/65
Procedures for Testing Hypothesis

Step 5: Since p-value (0.07757) is greater than 0.01 level of significance, we failed
to reject H0 .
Step 6: There is no sufficient evidence to conclude that the program help to
decreased the mean waistline.

21/65
Wilcoxon Signed Rank Test
Example 2: An analyst might want to determine whether there is a difference in the
cost per mile of airfares in the United States between 1979 and 2009 for various
cities. The data in table represent the costs per mile of airline tickets for a sample of
17 cities for both 1979 and 2009.
City 1979 2009 City 1979 2009
1 20.07 23.07 10 19.37 18.4
2 19.63 21.46 11 18.25 20.02
3 19.2 18.6 12 19.75 21.77
4 19.98 20.72 13 22.4 22.26
5 18.18 18.91 14 20.96 21.28
6 20.3 22.8 15 20.27 21.11
7 23.8 19.51 16 23.77 18.27
8 23.57 19.05 17 23.59 22.10
9 19.72 21.85 22/65
Procedures for Testing Hypothesis

Step 1: H0 : M1979 = M2009 and Ha : M1979 6= M2009


Step 2: α = 0.05
Step 3: Since we are comparing the mean of two related groups, we will use the
wilcoxon signed rank test.
Step 4: Determine the p-value.
Command for Wilcoxon Signed Rank Test

wilcox.test(<a>, <b>, alternative = <condition>, paired =


TRUE, conf.level=1 − α, exact = FALSE)

23/65
Step 5: Since p-value (0.6701) is greater than to 0.05 level of significance, we failed
to reject H0 .
Step 6: There is no sufficient evidence to conclude that the there is difference in
the cost per mile of airfares in the United States between 1979 and 2009 for various
cities.

24/65
Mann Whitney U-Test

Mann Whitney U-Test is a non parametric procedure that is used to test the equality
of two population means from independent samples. Non parametric equivalent of
independent sample t-test.
Command for Mann Whitney U-Test

wilcox.test(<a>, <b>, alternative=’<condition>’,


conf.level=1 − α, exact = FALSE)

If there is a tie in your data, it is necessary to add the command exact = FALSE
to avoid error message in the console.

25/65
Mann Whitney U-Test

Null and Alternative Hypothesis

H 0 : M1 = M2
Ha : M1 6= M2 two-tailed: two.sided
H 0 : M1 ≤ M2
Ha : M1 > M2 one-tailed: greater
H 0 : M1 ≥ M2
Ha : M1 < M2 one-tailed: less

26/65
Mann Whitney U-Test

Assumptions:
I Your dependent variable should be measured at the ordinal or continuous level.
I Your independent variable should consist of two categorical, ”independent
groups”.

27/65
Mann Whitney U-Test

Example 1: When exposed to an infection, a person typically develops antibodies.


The extent to which the antibodies respond can be measured by looking at a
person’s titer, which is a measure of the number of antibodies present. The higher
the titer is, the more antibodies that are present. The data in table represent the
titers of 11 ill people and 11 healthy people exposed to the tularemia virus in
Vermont. Is the level of titer in the ill group greater than the level of titer in the
healthy group? Use the α = 0.10 level of significance.

28/65
Mann Whitney U-Test
ill healthy
640 10
80 320
1280 320
160 320
640 80
640 160
1280 10
640 640
160 160
320 320
160 320
29/65
Procedures for Testing Hypothesis

Step 1: H0 : Mill ≤ Mhealthy and Ha : Mill > Mhealthy


Step 2: α = 0.10
Step 3: Since we are comparing the mean of two independent groups, we will use
the Mann Whitney U-Test.
Step 4: Determine the p-value.
Command for Mann Whitney U-Test

wilcox.test(<a>, <b>, alternative=’<condition>’, conf.level =


0.9, exact = FALSE)

30/65
Procedures for Testing Hypothesis

Step 5: Since p-value (0.04657) is less than to 0.10 level of significance, we reject
H0 .
Step 6: There is sufficient evidence to conclude that the level of titer in the ill
group is greater than the level of titer in the healthy group.

31/65
Mann Whitney U-Test

Example 2: An engineer is comparing the time to failure (in flight hours) of two
different air conditioners for airplanes and wants to determine if the mean time to
failure for model Y is longer than the mean time to failure for model X. She obtains
a random sample of 26 failure times for model X and an independent random sample
of 17 failure times for model Y. Do the data in Table suggest that the time to failure
for model Y is longer? Use the α = 0.05 level of significance.

32/65
Mann Whitney U-Test

Model X Model Y Model X Model Y


7 115 109 168
20 55 33 118
5 219 25 122
52 245 19 253
103 239 59
17 130 287
7 412 128
4 62 68
76 225 3
19 129 4
25 71 91
4 12 472
76 200 28

33/65
Procedures for Testing Hypothesis

Step 1: H0 : Mx ≥ My and Ha : Mx < My


Step 2: α = 0.05
Step 3: Since we are comparing the mean of two independent groups, we will use
the Mann Whitney U-Test.
Step 4: Determine the p-value.
Command for Mann Whitney U-Test

wilcox.test(<a>, <b>, alternative=’<condition>’, conf.level =


0.95, exact = FALSE)

34/65
Procedures for Testing Hypothesis

Step 5: Since p-value (0.0001362) is less than to 0.05 level of significance, we


reject H0 .
Step 6: There is sufficient evidence to conclude that the mean time to failure for
model Y is longer than the mean time to failure for model X.

35/65
Kruskal Wallis H-Test

Kruskal Wallis H-Test is a rank based non parametric test that can be used to
determine if there are statistically significant differences between two or more groups
of an independent variable on continuous or ordinal dependent variable. It is a non
parametric equivalent to one way ANOVA.

Command for Kruskal Wallis H-Test

kruskal.test(<numeric vector> ∼ <grouping factor>)

36/65
Kruskal Wallis H-Test

Null and Alternative Hypothesis

H0 : µ 1 = µ 2 = · · · = µ k
Ha : At least one of the population means is different from the others.

37/65
Kruskal Wallis H-Test

Assumptions:
I One independent variable with two or more levels (independent groups). The
test is more commonly used when you have three or more levels.
I The level of measurement of dependent variable are ordinal, interval or ratio
level.
I Your observations should be independent.

38/65
Kruskal Wallis H-Test

Example 1: Researchers wanted to compare math test scores of students at the end of secondary school from
various cities. Eight randomly selected students each from Makati, Manila, and the Quezon City were administered
the same exam; the results are presented in the table. Can the researchers conclude that the distribution of exam
scores is different for each city at the α = 0.01 level of significance?

Makati Manila Quezon City


578 568 506
548 530 518
521 571 485
555 569 480
548 563 458
530 535 456
502 561 513
492 450 491

39/65
Procedures for Testing Hypothesis

Step 1:
H0 : The distribution of exam scores is the same for each city.
Ha : The distribution of exam scores is different for each city.
Step 2: α = 0.01
Step 3: Since we are comparing the mean of more than two independent groups, we
will use the Kruskal Wallis H-Test.
Step 4: Determine the p-value.
Command for Kruskal Wallis H-Test

kruskal.test(<numeric vector> ∼ <grouping factor>)

40/65
Procedures for Testing Hypothesis

Step 5: Since p-value (0.007318) is less than to 0.01 level of significance, we reject
H0 .
Step 6: This means that the distribution of exam scores is different for each city.

41/65
Kruskal Wallis H-Test
Example 2: A family doctor claims that the distribution of HDL cholesterol in male for age groups 20 to 29
years old, 40 to 49 years old, and 60 to 69 years old are different. He obtains a simple random sample of 12
individuals from each age group and determines their HDL cholesterol. The results are presented in the table.Test
the doctor’s claim at the α = 0.05 level of significance.

No. 20-29 years old 40-49 years old 60-69 years old
1 54 61 44
2 43 41 70
3 38 44 80
4 30 47 53
5 61 33 51
6 53 29 49
7 35 59 49
8 34 35 42
9 39 34 35
10 46 74 44
11 50 50 37
12 35 65 38

42/65
Procedures for Testing Hypothesis
Step 1:
H0 : The distribution of HDL cholesterol in male for the age groups 20 to 29 years
old, 40 to 49 years old, and 60 to 69 years old are the same.
Ha : The distribution of HDL cholesterol in male for the age groups 20 to 29 years
old, 40 to 49 years old, and 60 to 69 years old are different.
Step 2: α = 0.05
Step 3: Since we are comparing the mean of more than two independent groups, we
will use the Kruskal Wallis H-Test.
Step 4: Determine the p-value.
Command for Kruskal Wallis H-Test

kruskal.test(<numeric vector> ∼ <grouping factor>)

43/65
Procedures for Testing Hypothesis

Step 5: Since p-value (0.5774) is greater than to 0.05 level of significance, we failed
to reject H0 .
Step 6: There is no sufficient evidence to support the claim of the doctor.

44/65
Spearman Rank Correlation

Spearman Rank Correlation (Spearman Rho) is used to measure the strength and
direction of association between two ordinal or continuous variables. It is a non
parametric version of the Pearson Product-Moment correlation.

Command for Spearman Rho

cor.test(<numeric vector (independent)> , <numeric vector


(dependent)> method = ’spearman’, conf.level = 1 − α)

45/65
Spearman Rank Correlation

Null and Alternative Hypothesis

H0 : There is no significant relationship between two continuous/ordinal variables.


Ha : There is significant relationship between two continuous/ordinal variables.

46/65
Spearman Rank Correlation
Assumptions:
I The two variables should be measured on an ordinal or continuous scale.

I There needs to be a monotonic relationship between the two variables.

47/65
Spearman Rank Correlation

Example 1: Here is the data of 9 participants in a triathlon. Is there a relationship between the individual ranks
obtained in swimming and cycling at 0.05 level of significance?

Swimming Rank Cycling Rank


46 99
45 98
18 10
22 25
17 16
31 32
48 33
1 2
61 59
5 8

48/65
Procedures for Testing Hypothesis
Step 1:
H0 : There is no significant relationship between the individual ranks obtained in
swimming and cycling.
Ha : There is significant relationship between the individual ranks obtained in
swimming and cycling.
Step 2: α = 0.05
Step 3: Since we are testing the significant relationship of two ordinal variables, we
will use Spearman Rho.
Step 4: Determine the p-value.
Command for Spearman Rho

cor.test(<numeric vector (independent)> , <numeric vector


(dependent)> method = ’spearman’, conf.level = 1 − α)
49/65
Procedures for Testing Hypothesis

Step 5: Since the p-value (0.00138) is less than to 0.05 level of significance, we
reject H0 .
Step 6: There is significant relationship between the individual ranks obtained in
swimming and cycling and its relationship is very strong based on correlation
coefficient (0.8909).

50/65
Spearman Rank Correlation
Example 2: The following are the ranks in statistics and the ranks in mathematics of
10 students in an examination. Determine if there is a relationship between the
ranks of students in the two subjects. Use 0.05 level of significance.
Subject Statistics Mathematics
1 56 66
2 75 70
3 45 20
4 71 60
5 62 65
6 64 56
7 58 59
8 80 77
9 76 67
10 61 68 51/65
Procedures for Testing Hypothesis
Step 1:
H0 : There is no significant relationship between the ranks of students in statistics
and mathematics subjects.
Ha : There is significant relationship between the ranks of students in statistics and
mathematics subjects.
Step 2: α = 0.05
Step 3: Since we are testing the significant relationship of two ordinal variables, we
will use Spearman Rho.
Step 4: Determine the p-value.
Command for Spearman Rho

cor.test(<numeric vector (independent)> , <numeric vector


(dependent)> method = ’spearman’, conf.level = 1 − α)
52/65
Procedures for Testing Hypothesis

Step 5: Since the p-value (0.06025) is greater than to 0.05 level of significance, we
failed reject H0 .
Step 6: There is no significant relationship between the ranks of students in
statistics and mathematics subjects and its relationship is moderately strong based
on correlation coefficient (0.6242).

53/65
Chi-Square Test

Chi-Square: Test for independence is used to discover if there is association between


two categorical variables.

Command for Chi-Square Test

chisq.test(<x>,<y>)

x: numeric vector or matrix


y: numeric vector; ignore if x is a matrix

54/65
Chi-Square Test

Null and Alternative Hypothesis

H0 : The two categorical variables are independent.


Ha : The two categorical variables are dependent.

55/65
Chi-Square Test

Assumptions:
I There are 2 variables, and both are measured as categories, usually at the
nominal level. However, categories may be ordinal. Interval or ratio data that
have been collapsed into ordinal categories may also be used.
I The two variables should consist of two or more categorical, independent groups.
I The data in the cells should be frequencies, or counts of cases rather than
percentages or some other transformation of the data.
I For a 2 by 2 table, all expected frequencies > 5.
I For a larger table, all expected frequencies > 1 and no more than 20% of all
cells may have expected frequencies < 5.

56/65
Chi-Square Test

Example 1: The Gallup Organization conducted a survey in 2014 asking individuals


questions pertaining to social well-being such as strength of relationship with spouse,
partner, or closest friend, making time for trips or vacations, and having someone
who encourages them to be healthy. Social well-being scores were determined based
on answers to these questions and used to categorize individuals as thriving,
struggling, or suffering in their social wellbeing. In addition, body mass index (BMI)
was determined based on height and weight of the individual. This allowed for
classification as obese, overweight, normal weight, or underweight.

57/65
Chi-Square Test

The data in the following contingency table are based on the results of this survey.

Thriving Struggling Suffering


Obese 202 250 102
Overweight 294 302 110
Normal Weight 300 295 103
Underweight 17 17 8

Researchers wanted to determine whether the sample data suggests that there is an
association between weight classification and social well-being.

58/65
Procedures for Testing Hypothesis

Step 1:
H0 : There is no association between weight classification and social well-being.
Ha : There is association between weight classification and social well-being.
Step 2: α = 0.05
Step 3: Since we are testing the significant relationship of two categorical variables,
we will use the Chi-square test.

59/65
Procedures for Testing Hypothesis
Step 4: Determine the p-value.
The data given is presented in a contingency table. The raw data is not given. To
solve this problem, we need to construct a matrix.
Syntax: matrix(<numeric vector>, nrow = <n>, ncol = <m>, byrow =
<bool>, dimnames = list(<vector>,<vector>))
Command for Chi-Square Test

chisq.test(<matrix>)

Step 5: Since the p-value (0.3057) is greater than to 0.05 level of significance, we
failed to reject H0 .
Step 6: There is no sufficient evidence to conclude that there is an association
between weight classification and social well-being.
60/65
Chi-Square Test

Example 2: Educators are always looking for novel ways to teach statistics to
undergraduates as part of a non-statistics degree course (e.g., psychology). With
current technology, it is possible to present how-to guides for statistical programs
online instead of in a book. However, different people learn in different ways. An
educator would like to know whether gender (male/female) is associated with the
preferred type of learning medium (online vs. books). Import excel file
”CHISQUARE(gender vs learning medium)”.

61/65
Testing the Assumption

Contingency Table
To Construct Contingency Table

table(<row category>, <column category>)

62/65
Testing the Assumption

Contingency Table
To Construct Contingency Table

table(<row category>, <column category>)

Check if all frequencies are greater than 5.

62/65
Procedures for Testing the Hypothesis

Step 1:
H0 : Gender is not associated with the preferred type of learning medium.
Ha : Gender is associated with the preferred type of learning medium.
Step 2: α = 0.05
Step 3: Since we are testing the significant relationship of two categorical variables,
we will use Chi-square test.
Step 4: Determine the p-value.
Command for Chi-Square Test

chisq.test(<matrix>)

63/65
Procedures for Testing the Hypothesis

Step 5: Since the p-value (0.02635) is less than to 0.05 level of significance, we
reject H0 .
Step 6: There is sufficient evidence based on sample data that the gender of
students is associated with the preferred type of learning medium.

64/65
Thank You!

65/65

You might also like