Summary 2022 2023

‫إلاحصاء الحيوي لطلبة الطب‬
‫والعلوم الصحية‬
‫‪10216235‬‬
‫الفصل ألاوو‪2023-2022‬‬
Elementary Statistics: A Step by Step Approach, Bluman, 7th Edition 2022-2023
BIOSTATISTICS
Chapter 1: Definitions and Concepts

Introduction
Statistics is used in almost all fields of human endeavor. In sports, for example, a statistician may keep
records of the number of yards a running back gains during a football game, or the number of hits a baseball
player gets in a season. In other areas, such as public health, an administrator might be concerned with the
number of residents who contract a new strain of flu virus during a certain year. In education, a researcher
might want to know if new methods of teaching are better than old ones. These are only a few
examples of how statistics can be used in various occupations.
Statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions
from data
Biostatistics: The branch of statistics responsible for the proper interpretation of scientific data generated
in the biology, public health and other health sciences (i.e., the biomedical sciences). In these sciences,
subjects are patients, mice, cells, etc.
Branches of Statistics
There are two branches of statistics:
Descriptive statistics consists of the collection, organization, summarization, and presentation of data.
Inferential statistics consists of generalizing from samples to populations, performing estimations and
hypothesis tests, determining relationships among variables, and making predictions.
Basic Definitions
A population consists of all subjects (human or otherwise) that are being studied.
A sample is a group of subjects selected from a population. If the subjects of a sample are properly selected,
most of the time they should possess the same or similar characteristics as the subjects in the population.
A variable is a characteristic or attribute that can assume different values.
Data are the values (measurements or observations) that the variables can assume.
A collection of data values forms a data set. Each value in the data set is called a data value or a datum.
When data are collected from every subject in the population, it is called a census.
An-Najah National University CH 1 – Page 1
Variables and Type of Data
Variables can be classified as qualitative or quantitative.
Qualitative variables are variables that have distinct categories according to some characteristic or
attribute.
For example, if subjects are classified according to gender (male or female), then the
variable gender is qualitative. Other examples of qualitative variables are religious preference and
geographic locations.
Quantitative variables are variables that can be counted or measured.
For example, the variable age is numerical, and people can be ranked in order according to the value of their
ages. Other examples of quantitative variables are heights, weights, and body temperatures.
Example 1:
Determine whether the following variables are qualitative or quantitative:
a. Sizes of soft drinks sold by a fast-food restaurant (small, medium, and large)
b. Cholesterol counts for individuals
c. Microwave wattage
d. Number of degrees awarded by a college each year for the last 10 years
e. Ratings of teachers
Quantitative variables can be further classified into two groups: discrete and continuous.
Discrete variables can be assigned values such as 0, 1, 2, 3 and are said to be countable.
Examples of discrete variables are the number of children in a family, the number of students in a classroom,
and the number of calls received by a call center each day for a month. Discrete variables assume values that
can be counted.
Continuous variables, by comparison, can assume an infinite number of values in an interval between any
two specific values.
Temperature, for example, is a continuous variable, since the variable can assume an infinite number of
values between any two given temperatures.
Continuous variables can assume an infinite number of values between any two specific values. They are
obtained by measuring. They often include fractions and decimals.

Example 2:
Determine whether the following variables are discrete or continuous:
a. Cholesterol counts for individuals
b. Microwave wattage
c. Number of degrees awarded by a college each year for the last 10 years
Level of Measurement
In addition to being classified as qualitative or quantitative, variables can be classified by how they are
categorized, counted, or measured. There are 4 levels:
The nominal level of measurement classifies data into mutually exclusive (non-overlapping) categories in
which no order or ranking can be imposed on the data.
The ordinal level of measurement classifies data into categories that can be ranked; however, precise
differences between the ranks do not exist.
The interval level of measurement ranks data, and precise differences between units of measure do exist;
however, there is no meaningful zero.
The ratio level of measurement possesses all the characteristics of interval measurement, and there exists a
true zero. In addition, true ratios exist when the same variable is measured on two different members of the
population.
Example 3:
classify each as nominal-level, ordinal-level, interval-level, or ratio-level measurement:
a. Horsepower of automobile engines
b. Blood types—O, A, B, AB
c. Scores on a statistical final exam
d. Ratings of teachers
e. Number of degrees awarded by a college each year for the last 10 years
f. Temperatures at a seashore resort
g. Sizes of soft drinks sold by a fast-food restaurant (small, medium, and large)
h. Sizes of soft drinks sold by a fast-food restaurant in ml

Sampling Techniques
To obtain samples statisticians use four basic methods of sampling:
A random sample is a sample in which all members of the population have an equal chance of being
selected.
A systematic sample is a sample obtained by selecting every 𝑘th member of the population where 𝑘 is a
counting number.
A stratified sample is a sample obtained by dividing the population into subgroups or strata according to
some characteristic relevant to the study. (There can be several subgroups.) Then subjects are selected at
random from each subgroup.
A cluster sample is obtained by dividing the population into sections or clusters and then selecting one or
more clusters at random and using all members in the cluster(s) as the members of the sample.
Example 4:
State which sampling method was used.
a. Out of 10 hospitals in a municipality, a researcher selects one and collects records for a 24-hour
period on the types of emergencies that were treated there.
b. A researcher divides a group of students according to gender, major field, and low, average, and high
grade point average. Then she randomly selects six students from each group to answer questions in
a survey.
c. The subscribers to a magazine are numbered. Then a sample of these people is selected using
random numbers.
d. Every 10th bottle of Energized Soda is selected, and the amount of liquid in the bottle is measured.
The purpose is to see if the machines that fill the bottles are working properly.

BIOSTATISTICS
CHAPTER 2: Frequency Distributions and Graphs

When conducting a statistical study, the researcher must gather data for the particular variable under study.
To describe situations, draw conclusions, or make inferences about events, the researcher must organize the
data in some meaningful way. The most convenient method of organizing data is to construct a frequency
distribution or frequency table.
After organizing the data, the researcher must present them so they can be understood by those who will
benefit from reading the study. The most useful method of presenting the data is by constructing statistical
charts and graphs. There are many different types of charts and graphs, and each one has a specific purpose.
I. Organizing Categorical (Qualitative) Data.
Example 1:
Twenty-five patients were given a blood test to determine their blood type. The data set is
A B B AB O O O B AB B B B O
A O A O O O AB AB A O B A
a) Construct a frequency distribution.
Blood type Frequency Relative frequency Percent

A 5 5/25 = 0.20 20%
B 7 7/25 = 0.28 28%
AB 4 4/25 = 0.16 16%
O 9 9/25 = 0.36 36%
Total 𝟐𝟓 = 𝒏 1.00 100%
 Sum of frequencies equals the number of observations 𝑛.

 Relative frequency = frequency/𝑛
 Sum of relative frequencies equals 1.
 Percent = Relative frequency × 100%
 Sum of percents equals 100%

b) Construct a bar graph.
Bar Chart of Blood Type
6
Frequency
5
0
A AB B O
Blood Type
c) Construct a pie graph.
Pie Chart of Blood Type

Category
A
AB
B
O
The angle of each section in the above pie graph is evaluated as follows:
 The angle of section A is 𝜃𝐴 = 𝑟. 𝑓𝑟𝑒𝑞𝐴 × 360° = 0.20 × 360° = 72°

 Similarly, the angle of section B is 𝜃𝐵 = 𝑟. 𝑓𝑟𝑒𝑞𝐵 × 360° = 0.28 × 360° = 100.8°
 And, the angle of section AB is 𝜃𝐴𝐵 = 𝑟. 𝑓𝑟𝑒𝑞𝐴𝐵 × 360° = 0.16 × 360° = 57.6°
 Finally, the angle of section O is 𝜃𝑂 = 𝑟. 𝑓𝑟𝑒𝑞𝑂 × 360° = 0.36 × 360° = 129.6°
Notice that the sum of all angles is 72° + 100.8° + 57.6° + 129.6° = 360°

II. Organizing Quantitative Data.
Example 2:
The following data are ages of 25 randomly selected college students
19 18 22 20 20 21 23 20 19 20 22 21 20
19 19 21 19 20 18 20 22 19 20 21 20
a) Construct a frequency distribution.
Age Frequency Relative frequency Percent Cumulative frequency

18 2 0.08 8% 2
19 6 0.24 24% 2+6=8
20 9 0.36 36% 𝟖 + 𝟗 = 𝟏𝟕
21 4 0.16 16% 17 + 4 = 21
22 3 0.12 12% 21 + 3 = 24
23 1 0.04 4% 24 + 1 = 25 = 𝑛
Total 𝟐𝟓 = 𝒏 1.00 100%
For the age 20 we can say:
 9 of the students are 20 years old.

 36% of the students are 20 years old.
 17 of the students are 20 years old or less
b) Construct a frequency histogram.
Histogram of Age
6
Frequency
0
18 19 20 21 22 23
Age

c) Construct a frequency polygon.
Polygon for Ages
6
Frequency
17 18 19 20 21 22 23 24
Ages
III. Grouped Frequency Distributions.
When the range of the data is large, the data must be grouped into classes that are more than one unit in
width, in what is called a grouped frequency distribution.
Example 3:
The data shown here represent the number of grams per serving of 30 randomly selected brands of cakes.
32 47 51 41 46 30 46 38 34 34 52 48 48 38 43
41 21 24 25 29 33 45 51 32 32 27 23 23 34 35
These data can be grouped into 5 classes as shown in the following table.
Class Relative Class Marks Cumulative

Frequency
Limits Frequency (Midpoints) Frequency
21 – 27 6 0.20 (21 + 27)/2 = 24 6
28 – 34 9 0.30 31 15
35 – 41 5 0.17 38 20
42 – 48 7 0.23 45 27
49 – 55 3 0.10 52 30 = 𝑛
Total 𝟑𝟎 = 𝒏 1.00

Notice that:
 The numbers 21, 28, 35, 42, and 49 are called the lower limits.
 The numbers 27, 34, 41, 48, and 55 are called the upper limits.
 The midpoint of each class is the average of the lower and upper limits.
 The difference between any two successive lower limits is the same and it equals the difference
between any two successive upper limits and any two successive midpoints. In the above example
this difference equals 7.
 The first class contains the lowest data value and the final class contains the highest data value.
a) Construct a frequency histogram.
Histogram of Number of Grams
6
Frequency
0
24 31 38 45 52
Number of Grams
b) Construct a frequency polygon.
Polygon for Number of Grams
6
Frequency
17 24 31 38 45 52 59
Number of Grams

Example 4:
Given the following frequency distribution for the scores of health care quality for selected hospitals.
Relative Cumulative Class Marks

Class Limits Frequency
Frequency Frequency (Midpoints)
83.1 – 88.9 2 2/25 = 0.08 2 86.0
89.0 – 94.8 4 0.16 6 91.9
94.9 – 100.7 4 0.16 10 97.8
100.8 – 106.6 5 0.20 15 103.7
106.7 – 112.5 7 0.28 22 109.6
112.6 – 118.4 3 0.12 25 115.5
a) What is the number of hospital in the sample?

The number of hospitals is 𝑛 = 𝑓𝑟𝑒𝑞 = 2 + 4 + 4 + 5 + 7 + 3 = 25
b) Find the relative frequencies and midpoints.

See the table above.
c) Construct a frequency polygon.
5
Frequency
80.1 86.0 91.9 97.8 103.7 109.6 115.5 121.4

Score

PROBLEMS
1. Suppose that a set of data was grouped into 7 classes as follows:
Class limits 0.6 – 1.0 1.1 – 1.5 1.6 – 2.0 2.1 – 2.5 2.6 – 3.0 3.1 – 3.5 3.6 – 4.0
Frequency 3 4 7 8 5 2 1
a. What is the mark (midpoint) of the 7th class?
b. In what class the observation 2.57 will be placed?
c. Determine the number of data values that are 1.5 or less.
d. Determine the percentage of data values that are 2.6 or greater.
2. Suppose that weights of shipments to the nearest pound are grouped as following:
Class
1 – 75 76 – 150 151 – 225 226 – 300 301 – 375 376 – 450 451 – 525 526 – 600
Limits
Frequency 12 21 33 42 19 14 6 5
Determine the number of shipments weighing:
a. 75 pounds?
b. Less than 200 pounds?
c. 300 pounds or less?
d. Less than 300 pounds?
e. More than 450 pounds?
f. 450 pounds or more?
3. The following are numbers of courses taken by 20 randomly selected college students:
3 4 3 5 1 4 6 3 4 1
3 5 2 2 4 2 3 5 4 4
a. Construct a frequency table and show the relative and cumulative frequencies.
b. Construct a bar chart.

4. The weight and being smoker are observed for 65 randomly selected adolescents, and the following data
are obtained:
# W S # W S # W S # W S # W S
1 A Yes 14 A No 27 A No 40 A No 53 U Yes
2 A Yes 15 A No 28 O No 41 O Yes 54 A Yes
3 A No 16 O No 29 A No 42 A Yes 55 A No
4 A No 17 O Yes 30 A No 43 A No 56 A No
5 A No 18 A No 31 A Yes 44 A No 57 A Yes
6 A No 19 A No 32 A No 45 A No 58 U No
7 A No 20 A No 33 A No 46 O No 59 A No
8 O No 21 U No 34 A No 47 A No 60 A No
9 U No 22 O No 35 A No 48 A No 61 A No
10 A Yes 23 U No 36 A Yes 49 A No 62 A No
11 O No 24 A No 37 A No 50 A No 63 A No
12 A No 25 O Yes 38 A No 51 A No 64 U Yes
13 A No 26 A Yes 39 A No 52 A Yes 65 A No
*W: weight, S: smoking, U: underweight, O: overweight, A: appropriate weight
a. Write the frequencies in the following table:
Smoking
Yes No
Underweight ___ ___

Overweight ___ ___
Appropriate ___ ___
b. Construct a pie chart to show the proportions of weight categories among smoker adolescents.
c. Construct a pie chart to show the proportions of weight categories among nonsmoker
adolescents.
Answers
1. a. 3.8 b. The 5th class c. 7 d. 8
2. a. 0 – 12 b. 33 – 66 c. 108 d. 66 – 108 e. 11 f. 11 – 25

3. a. The frequency table
Numbers Relative Cumulative

of courses Frequency frequency frequency
1 2 0.10 2
2 3 0.15 5
3 5 0.25 10
4 6 0.30 16
5 3 0.15 19
6 1 0.05 20
b. The bar chart.
4
Frequency
0
1 2 3 4 5 6
Number of courses
4. a. The frequencies are in the following table:
Smoking
Yes No
Underweight 2 4
Overweight 3 6
Appropriate 10 40
b. The pie chart to show the proportions of weight categories among smoker adolescents is below.

Category
Appropriate
Ov erweight
Underweight
c. The pie chart to show the proportions of weight categories among nonsmoker adolescents is below.
Category
Appropriate
Ov erweight
Underweight

BIOSTATISTICS
CHAPTER 3: Data Description

Chapter 2 showed how you can gain useful information from raw data by organizing them into a frequency
distributions and then presenting the data by using various graphs. This chapter shows the statistical
methods that can be used to summarize data.
Definition:
A statistic is a characteristic or measure obtained by using the data values from a sample.
A parameter is a characteristic or measure obtained by using all the data values from a specific population.
In statistics, Greek letters are used to denote parameters, and Roman letters are used to denote statistics.
For this chapter assume that the data are obtained from samples unless otherwise specified.
Section 3-1: Measures of Central Tendency
I. The Mean
The mean, also known as the arithmetic average, is found by adding the values of the data and
dividing by the total number of values. For example, the mean of 3, 2, 6, 5, and 4 is found by adding
3 + 2 + 6 + 5 + 4 = 20 and dividing by 5; hence, the mean of the data is 20 ÷ 5 = 4. The values of
the data are represented by 𝑋’s. In this data set, 𝑋1 = 3, 𝑋2 = 2, 𝑋3 = 6, 𝑋4 = 5, and 𝑋5 = 4. To show
a sum of the total 𝑋 values, the symbol Σ (the capital Greek letter sigma) is used, and 𝑋 means to
find the sum of the 𝑋 values in the data set.
The mean is the sum of the values, divided by the total number of values. The symbol 𝑥 represents the
sample mean.
𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 𝑥
𝑥= =
𝑛 𝑛
Where 𝑛 represents the total number of values in the sample.
For population, the Greek letter 𝜇 (mu) is used for the mean.
𝑋1 + 𝑋2 + ⋯ + 𝑋𝑁 𝑋
𝜇= =
𝑁 𝑁
Where 𝑁 represents the total number of values in the population.
Notice that 𝑥 is a statistic and 𝜇 is a parameter.

Example 1:
The data show the number of patients in a sample of 6 hospitals who acquired an infection while
hospitalized. Find the mean.
110 76 29 38 105 31
The mean number of infections for the 6 hospitals is
𝑥 110 + 76 + 29 + 38 + 105 + 31 389

𝑥= = = = 64.8
𝑛 6 6
Example 2:
Find the mean age for the following selected students:
Age Frequency Age × Frequency

18 2 18 × 2 = 36
19 6 19 × 6 = 114
20 9 20 × 9 = 180
21 4 21 × 4 = 84
22 3 22 × 3 = 66
23 1 23 × 1 = 23
Total 25 503
To find the mean age,

1. Find the number of students in the sample by adding the frequencies, 𝑛 = 𝑓𝑟𝑒𝑞 = 25
2. Construct a new column and evaluate (Age × Freq) for each row.
3. The sum of all ages is the sum of the results in the new column. Here, 𝑥 = 503
503
4. The mean is 𝑥 = 25 = 20.12 years.
Example 3:
Find the mean for the following data sets:

a) 53, 52, 50, 55, 53
b) 53, 52, 50, 55, 53, 1
The mean for set (a) is 263/5 = 52.6
The mean for set (b) is 264/6 = 44
The data value "1" in set (b) is an extreme value and it affects the mean highly.

Properties of the mean:
1. The mean is computed for quantitative data.

2. The mean is found by using all the values of the data.
3. The mean for the data set is unique and not necessarily one of the data values.
4. For any data set, the mean is not smaller than the minimum value in the data set and not greater
than the maximum value.
5. The mean is affected by extremely high or low values, called outliers, and may not be the
appropriate average to use in these situations.
II. The Median
The median is the midpoint of the data array. The symbol for the median is 𝑀𝐷
This means the median divides the data set into two equal parts, 50% of the data are below the median and
50% are above the median.
How to find the median for the data set: 𝑥1 , 𝑥2 , … , 𝑥𝑛
1. Arrange the data in order 𝑥 1 , 𝑥 2 ,…,𝑥 𝑛 , where 𝑥 1 ≤𝑥 2 ≤⋯≤𝑥 𝑛
2. If 𝑛 is an odd then the median, 𝑀𝐷 = 𝑥 𝑛 +1

2
1
3. If 𝑛 is an even then the median 𝑀𝐷 = 2 𝑥 𝑛 +𝑥 𝑛
+1
2 2
Example 4:
The number of children with asthma during a specific year in seven local districts is shown. Find the
median.
253 125 328 417 201 70 90
To find the median:

1. Arrange the data → 70, 90, 125, 201, 253, 328, 417
2. 𝑛 = 7, odd number
3. 𝑀𝐷 = 𝑥 7+1 = 𝑥 4 = 201
2

Example 5:
hospitalized. Find the median.
110 76 29 38 105 31
To find the median:
1. Arrange the data → 29, 31, 38, 76, 105, 110
2. 𝑛 = 6, even number
1 1 1
3. 𝑀𝐷 = 2 𝑥 6 +𝑥 6
+1
=2 𝑥 3 +𝑥 4 = 2 38 + 76 = 57
2 2
Example 6:
Find the median age for the following selected students:
Age Frequency cumulative Frequency

18 2 2 < 13
19 6 8 < 13
𝑥(13) → 20 9 17 ≥ 13
21 4
22 3
23 1
Total 25
To find the median age,

1. The data here are arranged.
2. 𝑛 = 𝑓𝑟𝑒𝑞 = 25, odd number.
3. 𝑀𝐷 = 𝑥 25+1 = 𝑥 13 = 20. To find the 13rd observation, construct the column of cumulative
2
frequency.
Properties of the median:
1. The median is computed for quantitative data.

2. The median is used to find the centre or middle value of a data set.
3. The median for the data set is unique and not necessarily one of the data values.
4. For any data set, the median is not smaller than the minimum value in the data set and not greater
than the maximum value.
5. The median is affected less than the mean by extremely high or extremely low values.

III. The Mode
The value that occurs most often in a data set is called the mode.
A data set that has only one value that occurs with the greatest frequency is said to be unimodal. If a data set
has two values that occur with the same greatest frequency, both values are considered to be the mode and
the data set is said to be bimodal. When no data value occurs more than the others, the data set is said to
have no mode.
Example 7:
Find the mode for the following data:
a) 18.0, 14.0, 34.5, 10.0, 11.3, 10.0, 12.4, 10.0

Since 10 occurred 3 times, a frequency larger than any other number, the mode is 10
The data set here is unimodal.
b) 18.0, 14.0, 34.5, 10.0, 11.3, 10.0, 12.4, 10.0, 12.4, 11.3, 12.4
Since 10 and 12.4 occurred 3 times each, a frequency larger than any other number, there are two
modes 10.0 and 12.4. the data set is bimodal.
c) 18.0, 14.0, 34.5, 10.0, 11.3, 10.0, 12.4, 10.0, 12.4, 11.3, 12.4, 11.3
Since 10 , 12.4 and 11.3 occurred 3 times each, a frequency larger than any other number, there are
three modes 10 , 12.4 and 11.3. the data set is trimodal
d) 18.0, 14.0, 34.5, 10.0, 11.3, 10.0, 12.4, 10.0, 12.4, 11.3, 12.4, 11.3, 14.0, 14.0
The maximum number of modes is three modes, when the number of modes exceeds 3 we say there is
no mode. So, the above data have no mode.
Example 8:
Find the mode for the following data
a) 401, 334, 209, 201, 227, 353.

b) 401, 334, 209, 201, 227, 353, 401, 334, 209, 201, 227, 353.
Since each value occurs many times as any other value then there is no mode for data set (a) and data set
(b).
Note: Do not say that the mode is zero. That would be incorrect, because in some data zero can be an actual
value.

Example 9:
Find the mode age for the following selected students:
Age Frequency
18 2
19 6
Mode → 20 9 ← The largest frequency
21 4
22 3
23 1
The mode age is 20, because the age 20 occurs more than any other age.
Properties of the mode:
1. The mode is used when the most typical case is desired.

2. The mode can be used when the data are nominal or categorical.
3. The mode is not always unique. A data set can have more than one mode, or the mode may not exist
for the data set.
Exercise 1:
The score is a short quiz are distributed as given in the following table:
Score 1 2 3 4 5
Relative frequency 0.05 0.15 0.20 0.35 0.25
Find the mean, the median and the mode score.
Exercise 2:
The score is a short quiz are distributed as given in the following table:
Score 1 2 3 4 5
Cumulative frequency 3 8 18 24 30
Find the mean, the median and the mode score.

Exercise 3:
Find the mean, median and mode for the following data set:
4
Frequency
0
1 2 3 4 5 6 7 8 9 10
x
PROBLEMS
1. Determine the mean, median, and mode for each of the following samples:
a. 20, 22, 23, 26, 29, 30 𝑥 = 25, 𝑀𝐷 = 24.5, 𝑁𝑜 𝑚𝑜𝑑𝑒
b. 120, 122, 123, 126, 126, 130, 134 𝑥 = 125.86, 𝑀𝐷 = 126, 𝑀𝑜𝑑𝑒 = 126
c. 40, 44, 44, 46, 52, 52, 58, 60, 61, 65, 70, 72 𝑥 = 55.33, 𝑀𝐷 = 55, 𝑀𝑜𝑑𝑒 = 44, 52
2. Given the data below, find the mean, the median and the mode:
132, 117, 143, 114, 125, 133, 197, 134, 114, 143, 121, 108, 131, 109, 117, 116, 84, 102, 153, 116, 98,
122, 127, 113, 111, 65, 122, 114
𝑥 = 120.75, 𝑀𝐷 = 117, 𝑀𝑜𝑑𝑒 = 114
3. Find the mean, median, and mode for the following ages:
Age 17 18 19 20 21 22 23
Frequency 2 4 8 11 14 7 3
𝑥 = 20.31, 𝑀𝐷 = 20, 𝑀𝑜𝑑𝑒 = 21

4. Two groups of students were given a test and the following data are obtained:
Group Number of students Mean score Proportion of students who pass

I 20 78 0.9
II 30 75 0.8
If the two groups are combined together find:

a. The combined mean. 76.2
b. The proportion of students who pass. 0.84
5. A sample of adult males produced a mean weight of 154 pounds, a median weight of 160 pounds, and
a mode of 157 pounds. If the unit of measurement is converted from pound to kilogram find the
mean, the median and the mode in kilograms. (Hint: 1 kg = 2.205 pounds)
154
𝑥= = 69.84 𝑘𝑔, 𝑀𝐷 = 72.56 𝑘𝑔, 𝑀𝑜𝑑𝑒 = 71.20 𝑘𝑔
2.205
6. The mean, the median and the mode of a certain data set of size 𝑛 = 20 are 123, 125 and 130,
respectively. If two new observations of values 120 and 148 are obtained. Find the new values of the
mean, the median and the mode.
𝑥 = 124, 𝑀𝐷 = 125, 𝑀𝑜𝑑𝑒 = 130 𝑜𝑟 130,120 𝑜𝑟 130,148 𝑜𝑟 130,120,148
7. Compute the mean, the median and the mode for each of the following:
a. c.
10 12
10
8
8
Frequency
Frequency
4
4
2
2
0 0
16 17 18 19 20 21 22 23 24 16 17 18 19 20 21 22 23 24
Score Score
b.
12
10
8
Frequency
0
16 17 18 19 20 21 22 23 24
a. 𝑥 = 20, 𝑀𝐷 = 20, 𝑀𝑜𝑑𝑒 = 20
Score b. 𝑥 = 19.24, 𝑀𝐷 = 19, 𝑀𝑜𝑑𝑒 = 18
c. 𝑥 = 19.84, 𝑀𝐷 = 20, 𝑀𝑜𝑑𝑒 = 18, 20

Section 3-2: Measures of Variation

In statistics, to describe the data set accurately, statisticians must know more than the measures of central
tendency. Consider the following example.
Example 1:
A testing lab wishes to test two experimental brands of paint to see how long each will last before fading.
The testing lab makes 6 gallons of each brand to test. The results (in months) are shown.
Brand A 10 60 50 30 40 20
Brand B 35 45 30 35 40 25
10+60+50+30+40+20
The mean of brand A is 𝜇 = 6
= 35 months
35+45+30+35+40+25
The mean of brand B is 𝜇 = 6
= 35 months
Since the means are equal, you might conclude that both brands of paint last equally well, but this is not the
case. Even though the means are the same for the two brands, the spread, or variation, is quit different. You
can see that brand B performs more consistently; it is less variable.
For the spread or variability of a data set, three measures are commonly used: range, variance, and standard
deviation. Each measure will be discussed in this section.
I. The Range
The range is the simplest of the three measures.
The range is the absolute difference between the highest value and the lowest value in the data set. The
symbol 𝑅 is used for the range.
𝑅 = 𝑀𝑎𝑥 – 𝑀𝑖𝑛
Example 2:
Find the ranges for the paints in Example 1.
For brand A, the range is 𝑅 = 60 – 10 = 50 months

For brand B, the range is 𝑅 = 45 – 25 = 20 months
The range of brand A shows that 50 months separate the largest data value from the smallest data value.
For brand B, 20 months separate the largest data value from the smallest data value, which less than one-
half of brand A’s range.
Note: The range cannot be a negative number.

Example 3:
Find the range for the following data sets:
Set 1 40 30 25 15 18
Set 2 40 30 25 15 18 100
For data set 1, the range is 𝑅 = 40 – 15 = 25

For data set 2, the range is 𝑅 = 100 – 15 = 85
The range of data set 2 is very large comparing to the range of data set 1, since data set 2 contains an
extreme value (outlier).
Note: One extremely high or low data value can affect the range markedly.
II. The Variance
The variance is the average of the squares of the deviation of each value from the mean. The symbol for the
population variance is 𝜎 2 .
The formula for the population variance is
𝑋1 − 𝜇 2 + 𝑋2 − 𝜇 2 + ⋯ + 𝑋𝑁 − 𝜇 2
𝑋−𝜇 2
𝜎2 = =
𝑁 𝑁
You might wonder why the squared deviations are used instead of the actual deviations. One reason is that
the sum of the deviations will always zero; 𝑋 − 𝜇 = 0.
Example 4:
Find the variances for the paints in Example 1.
Brand A 10 60 50 30 40 20
Brand B 35 45 30 35 40 25
To find the variance of brand A:

1. Find the population size. Here, 𝑁 = 6
10+60+50+30+40+20
2. Compute the population mean. 𝜇 = 6
= 35 months
3. Compute the variance.
10−35 2 + 60−35 2 + 50−35 2 + 30−35 2 + 40−35 2 + 20−35 2
Here, 𝜎 2 = 6
1750
= 6 = 291.7 months2
Similarly, the variance of brand B is 𝜎 2 = 41.7 months2.

Sample Variance
The formula for the sample variance, denoted by 𝑠 2 , is
𝑥1 − 𝑥 2 + 𝑥2 − 𝑥 2 + ⋯ + 𝑥𝑛 − 𝑥 2
𝑥−𝑥 2
𝑠2 = =
𝑛−1 𝑛−1
Example 5:
hospitalized. Find the variance.
110 76 29 38 105 32
To find the variance:
1. The sample size 𝑛 = 6.

110+76+29+38+105+32 390
2. The sample mean is 𝑥 = = = 65 infections
6 6
2 110−65 2 + 76−65 2 + 29−65 2 + 38−65 2 + 105−65 2 + 32−65 2
3. The variance is 𝑠 = 6−1
= 1372 infections2
Shortcut formulas for 𝒔𝟐
The shortcut formulas for computing the variance are as follows:
𝑥 2 − 𝑛𝑥 2 𝑛 𝑥2 − 𝑥 2
𝑠2 = , 𝑠2 =
𝑛−1 𝑛(𝑛 − 1)
Example 6:
Find the variance for the following sample:
11.2 11.9 12.0 12.8 13.4 14.3
To find the variance:

1. The sample size 𝑛 = 6.
2. The sample sum is 𝑥 = 11.2 + 11.9 + 12.0 + 12.8 + 13.4 + 14.3 = 75.6
3. Square and sum, 𝑥 2 = 11.22 + 11.92 + 12.02 + 12.82 + 13.42 + 14.32 = 958.94
958.94− 75.62 /6
4. The variance is 𝑠 2 = 6−1
= 1.276
Note: 𝑥2 ≠ 𝑥 2 . The notation 𝑥 2 = 𝑥12 + 𝑥22 + ⋯ + 𝑥𝑛2 , while, 𝑥 2

= 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 2

III. The Standard deviation
The standard deviation is the positive square root of the variance.
The population standard deviation is 𝜎 = 𝜎 2
The sample standard deviation is 𝑠 = 𝑠 2
Why is it necessary to take the square root? The reason is that since the observations were squared, the unit
of the variance is the square of the unit of the original raw data. Finding the square root of the variance puts
the standard deviation in the same units as the raw data.
Note that the variance and the standard deviation cannot be negative numbers.
Example 7:
Find the standard deviations for the paints in Example 1.
Standard deviation for brand A is 𝜎 = 𝜎 2 = 291.7 = 17.1 months.

Standard deviation for brand B is 𝜎 = 𝜎 2 = 41.7 = 6.5 months.
Example 8:
Find the standard deviation for the following ages:
Age Frequency Age2×Freq

18 2 182 × 2
19 6 192 × 6
20 9 202 × 9
21 4 212 × 4
22 3 222 × 3
23 1 232 × 1
Total 25 10159
The standard deviation can be computed as follows:
1. The sample size is 𝑛 = 𝑓𝑟𝑒𝑞 = 25

2. The mean is 𝑥 = 20.12 years (the solution shown in Example 2 in Section 3 – 1)
3. The sum of squares is 𝑥 2 = 𝐴𝑔𝑒 2 × 𝐹𝑟𝑒𝑞 = 10159
10159 −25×20.12 2
4. The variance is 𝑠 2 = 25−1
= 1.61 years2.
5. The standard deviation is 𝑠 = 1.61 = 1.27 years

Example 9:
The mean and the variance of the data set 25, 15, 18, 21, 16 are 19 and 16.5, respectively.
a) If each value is increased by 3 units find the new mean and variance
The new data will be 28, 18, 21, 24, 19
 The new mean is 𝑥 = 19 + 3 = 22
28 2 +18 2 +212 +242 +192 −5×22 2
 The new variance is 𝑠 2 = 5−1
= 16.5 (The addition does not affect the
variance)
b) If each value is multiplied by 2 units find the new mean and variance
The new data will be 50, 30, 36, 42, 32
 The new mean is 𝑥 = 2 × 19 = 38
50 2 +30 2 +36 2 +42 2 +32 2 −5×38 2
 The new variance is 𝑠 2 = 5−1
= 66 = 22 × 16.5 (The multiplication affect
the variance)
IV. Coefficient of Variation

Suppose that a statistician wants to compare the variability of two samples or populations, whenever the
two data sets have the same units of measure, the variance and standard deviation for each can be compared
directly. But if the two data sets have different units of measure a statistic, called the coefficient of variation,
can be used to compare the variability. Lager coefficient of variation means larger variability. Note that the
coefficient of variation has no unite of measure.
The coefficient of variation, denoted by CV, is given by

𝜎 𝑠
CV = . 100% for populations and CV = . 100% for samples
𝜇 𝑥

Example 10:
Compare the variability of the following two samples,
Sample 1 Sample 2
Age 25 years 10 years
Mean Weight 70 kg 30 kg
Variance 200 kg2 200 kg2
In this example the units of measure are the same for the two samples, but the nature of people in the two
samples is different. In sample one all people are of age 25, they are adults, but in sample 2 they are
children, so we must use the coefficient of variation to compare the variability.
𝑠 200
 The coefficient of variation for sample 1 is CV = . 100% = . 100% = 20.20%
𝑥 70
200
 The coefficient of variation for sample 2 is CV = . 100% = 47.14%
30
 Sample 2 has larger variability than sample 1.
PROBLEMS
1. Determine the range, variance, and standard deviation for each of the following samples:
a. 20, 22, 23, 26, 29, 30 𝑅 = 10, 𝑠 2 = 16, 𝑠 = 4
b. 120, 122, 123, 126, 126, 130, 134 𝑅 = 14, 𝑠 2 = 23.48, 𝑠 = 4.85
c. 40, 44, 44, 46, 52, 52, 58, 60, 61, 65, 70, 72 𝑅 = 32, 𝑠 2 = 113.52, 𝑠 = 10.65
2. Given the data below, find the range, variance, and standard deviation:
132, 117, 143, 114, 125, 133, 197, 134, 114, 143, 121, 108, 131, 109, 117, 116, 84, 102, 153, 116, 98,
122, 127, 113, 111, 65, 122, 114
𝑅 = 132, 𝑠 2 = 535.38, 𝑠 = 23.14
3. Find the range, variance, and standard deviation for the following ages:
Age 17 18 19 20 21 22 23
Frequency 2 4 8 11 14 7 3
𝑅 = 6, 𝑠 2 = 2.22, 𝑠 = 1.49

4. A group of students were given two tests A and B, the following data are obtained:
Test Number of students Mean score Variance

A 20 78 120
B 25 75 16
a. Which of the tests is more highly variable? Test A

b. If the two groups are combined, find the combined mean. 76.33
c. Find the pooled variance (combined variance) and pooled standard deviation. 61.95, 7.87
5. A sample of adult males produced a mean weight of 154 pounds with standard deviation of a
standard deviation of 28 pounds. If the unit of measurements is converted from pound to kilogram
find the mean and the variance after the conversion. (Hint: 1 kg = 2.205 pounds)
2
28 28
𝑠2 = = 161.25 𝑘𝑔2 , 𝑠 = = 12.70 𝑘𝑔
2.205 2.205
6. The mean of the waiting times in an emergency room is 80.2 minutes with a standard deviation of
10.5 minutes for people who are admitted for additional treatment. The waiting time for patients
who are discharged after receiving treatment is 120.6 minutes with a standard deviation of 18.3
minutes. Which times are more variable?
Waiting time for patients who are discharged after receiving treatment is more variable
7. Three sections were given a test and the scores for each section are given below, compare the
variances for these sections:
a.
10
8
Frequency
0
16 17 18 19 20 21 22 23 24
Score

b.
4
Frequency
2
0
16 17 18 19 20 21 22 23 24
Score
c.
7
5
Frequency
0
16 17 18 19 20 21 22 23 24
Score
𝒔𝟐𝒂 = 𝟒. 𝟏𝟓 < 𝒔𝟐𝒃 = 𝟔. 𝟖𝟔 < 𝒔𝟐𝒄 = 𝟗. 𝟎𝟑

Section 3-3: Measures of Position

In addition to measures of central tendency and measures of variation, there are measures of position or
location. These measures include standard score (𝑧-score), percentiles, deciles, and quartiles. They are used
to locate the relative position of a data value in the data set.
I. Standard Score
A standard score or 𝑧-score tells how many standard deviations a data value is above or below the mean for
a specific distribution of values.
The z score or standard score for a value, denoted by 𝑧, is obtained as follows:
 the 𝑧-score of a data value 𝑋 that belongs to a population with mean of 𝜇 and standard deviation of
𝜎 is
𝑋−𝜇
𝑧=
𝜎
 the 𝑧-score of a data value 𝑥 that belongs to a sample with mean of 𝑥 and standard deviation of 𝑠 is
𝑥−𝑥
𝑧=
𝑠
Note: If the 𝑧-score is positive; the data value is above the mean. If the 𝑧-score is 0; the data value is the same
as the mean. And if the 𝑧-score is negative; then the data value is below the mean.
Example 1:
A sample has a mean of 120 and a variance of 225, find
a) The z score for the data value 150

𝑥−𝑥 150−120
𝑧= = = 2, the value 150 is 2 standard deviations above the mean.
𝑠 225
b) The z score for the data value 114

𝑥−𝑥 114−120
𝑧 = 𝑠 = 225 = −0.4, the value 114 is 0.4 standard deviations below the mean.
c) The data value that has a z score of 2.6

𝑥−𝑥 𝑥−120
𝑧 = 𝑠 = 225 = 2.6 → 𝑥 = 120 + 2.6 × 225 = 159.
d) The data value that is 1.2 standard deviations below the mean.
𝑥−𝑥 𝑥−120
𝑧 = 𝑠 = 225 = −1.2 → 𝑥 = 120 − 1.2 × 225 = 102.

The 𝑧-score can be used to compare the position of two data values from two different data sets. See the
following example.
Example 2:
A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she scored 30
on a statistics test with a mean of 25 and a standard deviation of 5. Compare her relative positions on the
two tests.
𝑥−𝑥 65−50
 For calculus, the z score is 𝑧 = 𝑠
= 10 = 1.5
𝑥−𝑥 30−25
 For statistics, the z score is 𝑧 = 𝑠 = 5 = 1.0
 Since the 𝑧-score for calculus is larger, her relative position in the calculus class is higher than her
relative position in the statistics class.
Example 3:
A student scored 36 on test A that had a mean of 40 and a standard deviation of 5; she scored 94 on test B
that had a mean of 100 and a standard deviation of 10. Compare her relative positions on the two tests.
𝑥−𝑥 36−40
 For test A, the z score is 𝑧 = = = −0.8
𝑠 5
𝑥−𝑥 94−100
 For test B, the z score is 𝑧 = = = −0.6
𝑠 10
 The score for test B is relatively higher than the score for test A.
Note: When all data for a variable are transformed into 𝑧-scores, the resulting distribution will have a mean
of 0 and a standard deviation of 1. See the following example.
Example 4:
Given the following sample 8, 10, 12, 8, 12 ,
a) Find the mean and the standard deviation of the given sample
𝑥 = 10 and 𝑠 = 2
b) Evaluate the 𝑧-score for each value in the sample.

The 𝑧-scores are −1, 0, 1, −1, 1
c) Find the mean and the standard deviation for the resulting 𝑧-scores.
The mean of the resulting 𝑧-scores is 𝑧 = 0 and the standard deviation is 𝑠𝑧 = 1.

II. Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the position of an
individual in a group.
Percentiles divide the data set into 100 equal parts, and they are symbolized by 𝑃1 , 𝑃2 ,…, 𝑃99
As shown in the above graph, 1% of data values are less than the first Percentile, 𝑃1 , while 99% are
greater than 𝑃1 . Similarly, 2% of data values are less than the second percentile, 𝑃2 , while 98% are
greater than 𝑃2 , and so on. It is clear that 𝑃1 ≤ 𝑃2 ≤…,≤ 𝑃99
Example 5:
A data set, of size 500, has the following percentiles; 𝑃20 = 104, 𝑃60 = 127, 𝑃85 = 143. Find the number of
data values that are:
a) Less than 104.

500 × 0.20 = 100 values.
b) Greater than 143.

500 × 1 − 0.85 = 75 values
c) Between 104 and 127.

500 × 0.60 − 0.20 = 200 values.
d) Between 104 and 143.

500 × 0.85 − 0.20 = 325 values.

How to evaluate the 𝒌th percentile, 𝑷𝒌 , for a data set containing 𝒏 observations:
 Arrange the data set from the smallest to the largest value. 𝑥 1 , 𝑥 2 ,…,𝑥 𝑛
𝑘
 The position of 𝑃𝑘 is 𝑙 = 100 × 𝑛
1
 If 𝑙 is an integer then 𝑃𝑘 = 2 𝑥 𝑙 +𝑥 𝑙+1
 If 𝑙 is a fraction then round 𝑙 up to the next integer, say 𝐿, and the kth percentile will be 𝑃𝑘 = 𝑥 𝐿 .
Example 6:
Twenty-four patients admitted to a hospital are tested for levels of blood sugar with the results:
87, 51, 83, 67, 78, 77, 69, 76, 68, 85, 84, 85,
77, 70, 68, 80, 74, 79, 66, 85, 73, 75, 78, 81.
a) Find the 20th percentile.
To find 𝑃20 :
 Arrange the data values
1 2 3 4 5 6 7 8 9 10 11 12
51 66 67 68 68 69 70 73 74 75 76 77
13 14 15 16 17 18 19 20 21 22 23 24
77 78 78 79 80 81 83 84 85 85 85 87
𝑘 20
 Find the position of 𝑃20 , 𝑙 = 100 𝑛 = 100 24 = 4.8
 Since 𝑙 is not integer round it up to 5.
 Then 𝑃20 = 𝑥 5 = 68
b) Find the 75th percentile.
To find 𝑃75 :
75
 Find the position of 𝑃75 , 𝑙 = 100 24 = 18
1 81+83
 Since 𝑙 is integer, then 𝑃75 = 2 𝑥 18 +𝑥 19 = 2
= 82
c) Find the 85th percentile
To find 𝑃85 :
85
 Find the position of 𝑃85 , 𝑙 = 100 24 =20.4
 Round 𝑙 up to 21
 Then 𝑃85 = 𝑥 21 = 85

III. Quartiles
The quartiles are 3 values, 𝑄1 , 𝑄2 ,and 𝑄3 , that divide an ordered data set into 4 equal parts; each
part containing about 25% of observations
 The first quartile is denoted by 𝑄1 and it is equal to the 25th percentile 𝑃25
 The second quartile is denoted by 𝑄2 and it is equal to the 50th percentile 𝑃50 = 𝑀𝐷
 The third quartile is denoted by 𝑄3 and it is equal to the 75th percentile 𝑃75
Example 7:
Find the three quartiles for the data in Example 6.
a) To find the 1st quartile:

25
 Find the position of 𝑄1 = 𝑃25 , 𝑙 = 24 = 6
100
1 69+70
 𝑄1 = 𝑥 6 +𝑥 7 = = 69.5
2 2
b) To find the 2nd quartile:

50
 Find the position of 𝑄2 = 𝑀𝐷 = 𝑃50 , 𝑙 = 100 24 = 12
𝑥 12 +𝑥 13 77+77
 Then 𝑄2 = 2
= 2
= 77
c) The third quartile is 𝑄3 = 𝑃75 = 82, as shown in the previous example.
IV. Inter-Quartile Range
The inter-quartile range, denoted by 𝐼𝑄𝑅, is the positive difference between the first and third
quartiles.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 𝑃75 − 𝑃25
Relatively large values of IQR indicate relatively large variability; smaller values indicate less variability.
Note: The 𝐼𝑄𝑅 is used to measure the variability when the data set contains outliers or when the
distribution is badly skewed

Example 8:
Find the inter-quartile rang for the data in Example 6
𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 82 − 69.5 = 12.5
V. Outliers
A data set should be checked for extremely high or extremely low values. These values are called outliers.
There are many methods to check a data set for outliers. One of them will be shown in this section.
How to check a data set for outliers:
1. Find the 1st and 3rd quartiles.

2. Evaluate 𝐼𝑄𝑅
3. Find the inner fences 𝑎1 and 𝑎2 , where 𝑎1 = 𝑄1 − 1.5 𝐼𝑄𝑅 and 𝑎2 = 𝑄3 + 1.5 𝐼𝑄𝑅
4. An observation 𝑥 is considered as an outlier if 𝑥 < 𝑎1 or 𝑥 > 𝑎2
Example 9:
Check the following data set for outliers:
1.34 1.69 1.78 1.89 2.03 2.16 2.27 2.34 2.39 2.88
2.92 3.13 3.36 3.57 3.67 3.83 5.44 8.63 11.82
1. It is not difficult to show that 𝑄1 = 2.03 and 𝑄3 = 3.67
2. The inter-quartile range is 𝐼𝑄𝑅 = 3.67 − 2.03 = 1.64
3. The inner fences are 𝑎1 = −0.43 and 𝑎2 = 6.13
4. There are two outliers 8.63 and 11.82.
Note: Not all data sets have outliers.

Example 10:
Twenty-four patients admitted to a hospital are tested for levels of blood sugar with the results:
87, 51, 83, 67, 78, 77, 69, 76, 68, 85, 84, 85,
77, 70, 68, 80, 74, 79, 66, 85, 73, 75, 78, 81.
𝑄1 = 69.5, 𝑄3 = 82
Check for outliers.
1. The inter-quartile range is 𝐼𝑄𝑅 = 82 − 69.5 = 12.5
2. The inner fences are
𝑎1 = 69.5 − 1.5 × 12.5 = 50.75 and 𝑎2 = 82 + 1.5 × 12.5 = 100.75
3. There are no data values less than 50.75 or greater than 100.75, so there are no outliers.
Why outliers may occur?
1. The data value may have resulted from a measurement or observational error. Perhaps the
researcher measured the variable incorrectly.
2. The data value may have resulted from a recording error. That is, it may have been written or
typed incorrectly.
3. The data value may have been obtained from a subject that is not in the defined population. For
example, suppose test scores were obtained from a 7th grade class, but a student in that class was
actually in the 6th grade and had special permission to attend the class. This student might have
scored extremely low on that particular exam on that day.
4. The data value might be a legitimate value that occurred by chance.
What to do with outliers?
There are no hard-and-fast rules on what to do with outliers, nor is there complete agreement among
statisticians on ways to identify them. Obviously, if they occurred as a result of an error, an attempt
should be made to correct the error or else the data value should be omitted entirely. When they
occur naturally by chance, the statistician must make a decision about whether to include them in the
data set.
Note: When a distribution is normal or bell-shaped, data values that are beyond 3 standard deviations of the
mean can be considered suspected outliers.

VI. Box Plot
The box plot is a graph used to detect outliers and it can be used to determine the direction of skewness
How to construct the box plot:
 Calculate the three quartiles, 𝑄1 , 𝑀𝐷, and 𝑄3 , and the inner fences.
 Detect the outliers (if any)
 The largest and the smallest non-outlier-observations are called adjacent values
 Draw a horizontal line representing the scale of measurements.
 Form a box just above the horizontal line with the right and the left ends at 𝑄1 and 𝑄3 .
 Draw a vertical line through the box at the location of the median.
 Locate the adjacent values using the scale along the horizontal line, and connect them to the box with
horizontal lines.
 The outliers are marked on the graph with open circles (•) or with stars (∗)
Note: The box plot can be used describe the shape of a data distribution by looking at the position of
the median line compared to 𝑄1 and 𝑄3 lines. If the median is closed to the middle of the box, the
distribution is approximately symmetric. The distribution is skewed to the right if the median line is
to the left of the centre. The distribution is skewed to the left if the median line is to the right of the
centre.
Example 11:
Construct a box plot for the data set given in Example 9.
There are two outliers on the right of the distribution, which makes the distribution skewed to right
(positively skewed)

PROBLEMS
1. A test (A) of job-related stress was standardized and found to have a mean of 112.6 with a standard
deviation of 13.8. A second test (B) had a mean of 44.6 with a standard deviation of 6.3.
a) Which test is more highly variable? Test B
b) Ali was given the two tests A and B, his scores were 𝑥𝐴 = 119 and 𝑥𝐵 = 54. On which test was the
score better? Test B
2. Given the following sample 5, 6, 12, 13, 15, 18, 20, 22, 50
a) Evaluate the interqurtile range. 8
b) Check for outliers. 50
c) Construct a box plot.
3. A dietitian is interested in comparing the sodium content for real cheese with the sodium content of a
cheese substitute. The data for two random samples are shown. Compare the distributions, using box
plots.
Real cheese Cheese substitute

310 420 45 40 270 180 250 290
220 240 180 90 130 260 340 310
400
300
200
100
0
Real Substitute
4. The distribution of a certain data set is symmetric. The first and the second quartiles are 65 and 70
respectively.
a) Find the mean. 70
b) Find the third quartile. 75
c) Find the interquartile range. 10
Is the data value 45 outlier? Why? Yes

Elementary Statistics:A Step by Step Approach, Bluman, 7th Edition 2022-2023
BIOSTATISTICS
CHAPTER 4: Probability
I. Basic Concepts:
 An experiment is the process by which an observation (or measurement) is obtained
 A random experiment is an experiment, the outcome of which cannot be determined with certainty
before performing the experiment.
 An outcome is the result of a single trial of a random experiment.
For example: Measuring the rainfall for the next month, testing for blood type, testing whether a patient
diabetic or not.
 The set of all possible outcomes is called the sample space and denoted by Ω.
 An event is a subset of the sample space.
Example 1:
The sample spaces of some random experiments are shown below:
a) Test a patient for a blood type

Ω = {A+, A-, B+, B-, AB+, AB-, O+, O-}
b) Test whether a patient is diabetic or not

Ω = {Diabetic, Non-diabetic}
c) Test three patients whether they have cancer or not

Let C: The patient has cancer, C*: The patient does not have cancer
Then Ω = {CCC, CCC*, CC*C, CC*C*, C*CC, C*CC*, C*C*C, C*C*C*}
d) Five pints of blood are stored in a hospital laboratory. It is known that exactly two pints are type O,
but it is not known which ones. Pints are selected one by one until getting a pint of type O.
Let O: The selected pint is type O, O*: The selected pint is not type O
Then Ω = {O, O*O, O*O*O, O*O*O*O}.
Exercise:
For Example 1 Part (b), write the sample space if pints are selected one by one until getting the two-type O
pints.

Example 2:
Suppose the experiment is recording a person’s blood type and Rh factor.

Here Ω = {A− , A+ , B − , B + , AB − , AB + , O− , O+ }
Define an event, E, the person can receive blood of type A+, then E = {A+, AB + }
Example 3:
In experiment of testing three patients whether they have cancer or not, define an event, A, to be exactly
two of the three patients do not have cancer
Here, A = {CC*C*, C*CC*, C*C*C}
 Equally likely outcomes are outcomes that have the same probability of occurring.
Example 4:
The following is the sample space of a random experiment with equally likely outcomes: Select 3 people
at random from a population 50% of which are males and 50% are females, and then record the gender
of the selected people, then Ω = {MMM, MMF, MFM, MFF, FMM, FMF, FFM, FFF}. Here, Ω consists of 8
equally likely outcomes.
Exercise:
Give another examples for random experiments whose outcomes are equally likely.
Operations on Sets
 The intersection of events 𝐴and 𝐵, denoted by 𝐴 ∩ 𝐵, is the event that both 𝐴 and 𝐵 occur.
 If 𝐴 ∩ 𝐵 = ∅, then 𝐴 and 𝐵 are called disjoint or mutually exclusive events and this means the two
events cannot occur together.
 The union of events 𝐴and 𝐵, denoted by 𝐴 ∪ 𝐵, is the event that 𝐴 or 𝐵 or both occur.
 The complement of an event 𝐴, denoted by 𝐴𝑐 , consists of all outcomes in Ω that are not in 𝐴. 𝐴𝑐 is the
event that 𝐴 will not occur.
 Demorgan’s laws:
𝑐
1) 𝐴∩𝐵 = 𝐴𝑐 ∪ 𝐵𝑐 2) 𝐴 ∪ 𝐵 𝑐
= 𝐴𝑐 ∩ 𝐵𝑐

Note: Demorgan’s laws can be applied for three events or more.
You can use Venn’s diagram to represent the operations as follows
𝐴∩𝐵 𝐴∪𝐵 𝐴𝑐
B A
𝐴 ∩ 𝐵𝑐 𝐴𝑐 ∩ 𝐵 𝐴𝑐 ∩ 𝐵𝑐 𝐴∩𝐵 𝑐
Example 5:
Let Ω = {1, 2, 3, 4, 5, 6, 7}, and let 𝐴 and 𝐵 be subsets of Ω such that:
𝐴 = {1, 3, 5, 7} and 𝐵 = {1, 2, 3}, find
1. 𝐴 ∩ 𝐵 = {1,3}
2. 𝐴 ∪ 𝐵 = {1,2,3,5,7}
3. 𝐴𝑐 = {2,4,6}
4. 𝐴 ∩ 𝐵𝑐 = 1, 3, 5, 7 ∩ 4,5,6,7 = {5,7}
5. 𝐴𝑐 ∪ 𝐵 = 2,4,6 ∪ 1, 2, 3 = {1,2,3,4,6}
6. 𝐴 ∩ Ω = 𝐴
7. 𝐴 ∪ Ω = Ω
8. 𝐴 ∩ ∅ = ∅
9. 𝐴 ∪ ∅ = 𝐴
𝑐
10. 𝐴 ∪ 𝐵 = {1,2,3,5,7}𝑐 = {4,6}
11. 𝐴𝑐 ∩ 𝐵𝑐 = 2,4,6 ∩ 4,5,6,7 = 4,6 = 𝐴 ∪ 𝐵 𝑐

II. Probability Rules:
1. If an experiment has 𝑛 equally likely outcomes, 𝑟 of which constitute an event 𝐴, then the probability that
𝑟
the event 𝐴 will occur is 𝑃 𝐴 =
𝑛
Example 6:
Given the experiment in Example 4. Find the probability for the following:
a) A: Two of the selected people are males and the other is female.
3
A = {MMF, MFM, FMM}→ 𝑃 𝐴 = 8
b) B: At least one of the selected people is female.

7
B = {MMF, MFM, MFF, FMM, FMF, FFM, FFF}→ 𝑃 𝐵 =
8
c) C: all of the selected people are males.

1
C = {MMM}→ 𝑃 𝐶 =
8
2. The probability that an event 𝐴 will occur is equal to the sum of the probabilities of the outcomes that
make up the event 𝐴
Example 7:
Given Ω = {a, b, c, d} such that P(a) = 0.1, P(b) = 0.2, P(c) = 0.3, and P(d) = 0.4. Find,
a) P(E1), where E1 = {a, c}

P(E1) = P(a) + P(c) = 0.1 + 0.3 = 0.4
b) P(Ω)
P(Ω) = P(a) + P(b) + P(c) + P(d) = 1
c) P(E2), where E2 = {a, b, c}

P(E2) = P(a) + P(b) + P(c) = 0.1 + 0.2 + 0.3 = 0.6 or P(E2) = 1 –P(d) = 1 – 0.4 = 0.6
3. For any event, 𝐴, 0 ≤ 𝑃(𝐴) ≤ 1
4. For any sample space, Ω, 𝑃 Ω = 1 and 𝑃 ∅ = 0
5. For any event, 𝐴, 𝑃 𝐴𝑐 = 1 − 𝑃(𝐴)

6. For any event, 𝐴 and 𝐵, 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵)
7. If 𝐴 and 𝐵 are mutually exclusive, then 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵
8. For any event, 𝐴 and 𝐵, 𝑃 𝐴 = 𝑃 𝐴 ∩ 𝐵 + 𝑃 𝐴 ∩ 𝐵𝑐 → 𝑃 𝐴 ∩ 𝐵𝑐 = 𝑃 𝐴 − 𝑃 𝐴 ∩ 𝐵
9. If 𝐴 ⊆ 𝐵, then 𝑃 𝐴 ≤ 𝑃 𝐵
Example 8:
If 𝑃(𝐴) = 0.6, 𝑃(𝐵) = 0.7, and 𝑃(𝐴 ∩ 𝐵) = 0.4, then
a) 𝑃 𝐴 ∪ 𝐵 = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) = 0.6 + 0.7 − 0.4 = 0.9
b) 𝑃 𝐴 ∩ 𝐵𝑐 = 𝑃(𝐴) − 𝑃(𝐴 ∩ 𝐵) = 0.6 − 0.4 = 0.2
c) 𝑃 𝐴𝑐 ∩ 𝐵 = 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) = 0.7 − 0.4 = 0.3
d) 𝑃 𝐴𝑐 ∩ 𝐵𝑐 = 𝑃 𝐴 ∪ 𝐵 𝑐
= 1 − 𝑃 𝐴 ∪ 𝐵 = 1 − 0.9 = 0.1
e) 𝑃 𝐴 ∪ 𝐵𝑐 = 𝑃 𝐴 + 𝑃 𝐵𝑐 − 𝑃 𝐴 ∩ 𝐵𝑐 = 𝑃 𝐴 + 1 − 𝑃 𝐵 − 𝑃 𝐴 −𝑃 𝐴∩𝐵 = 0.7
𝐴∪𝐵 𝐴 ∩ 𝐵𝑐
.1
.3 .4 .2
B A
𝐴𝑐 ∩ 𝐵 𝐴𝑐 ∩ 𝐵𝑐 𝐴 ∪ 𝐵𝑐

Example 9:
Let 𝐴 and 𝐵 be two events, compare the probabilities 𝑃(𝐴), 𝑃(𝐴 ∩ 𝐵), and 𝑃(𝐴 ∪ 𝐵)
Since 𝐴 ∩ 𝐵 ⊆ 𝐴 ⊆ 𝐴 ∪ 𝐵 → 𝑃(𝐴 ∩ 𝐵) ≤ 𝑃(𝐴) ≤ 𝑃(𝐴 ∪ 𝐵)
Example 10:
(True or False), There are two disjoint events 𝐴 and 𝐵 such that 𝑃 𝐴 = 0.8 and 𝑃 𝐵 = 0.3
If 𝐴 and 𝐵 are disjoint, then, using Rule 7, 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 = 0.8 + 0.3 = 1.1 > 1.
This result contradicts Rule 3, so, the statement is False
Example 11:
If 𝑃 𝐴 = 0.5 and 𝑃 𝐴𝑐 ∩ 𝐵 = 0.2, find 𝑃 𝐴 ∪ 𝐵
𝑃 𝐴∪𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃 𝐴∩𝐵 = 𝑃 𝐴 + 𝑃 𝐴𝑐 ∩ 𝐵 = 0.5 + 0.2 = 0.7
Example 12:
Medical records of a doctor's patients show the following data:
Patients who were

Patients who have Overweight Not overweight Total
High blood pressure 121 63 184
Normal blood pressure 52 88 140
Total 173 151 324
What is the probability that one of the doctor's patients:

a) Is overweight and has high blood pressure.
121
𝑃 𝑂∩𝐻 = = 0.373.
324
b) Has high blood pressure.

184
𝑃 𝐻 = = 0.568.
324
c) Is not overweight
151
𝑃 𝑂𝑐 = 324 = 0.466.
d) Has normal blood pressure or is overweight

140 173 52 261
𝑃 𝑁∪𝑂 = + − = = 0.806
324 324 324 324

Example 13:
Suppose that the probability that Ali will get an A in biostatistics is about 0.4, Omar will get an A is about
0.35, and because they studied together, the probability that both Ali and Omar will get an A is 0.2.
Determine the probability that:
a) Either Ali or Omar will get an A
First define the following events with their probabilities

 Event 𝐴: Ali will get an A. 𝑃 𝐴 = 0.4
 Event 𝑂: Omar will get an A. 𝑃 𝑂 = 0.35
 Event 𝐴 ∩ 𝑂: Both Ali and Omar will get an A. 𝑃 𝐴 ∩ 𝑂 = 0.2
Now the even “Either Ali or Omar will get an A” can be denoted by 𝐴 ∪ 𝑂
And 𝑃 𝐴 ∪ 𝑂 = 𝑃 𝐴 + 𝑃 𝑂 − 𝑃 𝐴 ∩ 𝑂 = 0.55
b) At least one of them will get an A

𝑃 𝐴 ∪ 𝑂 = 𝑃 𝐴 + 𝑃 𝑂 − 𝑃 𝐴 ∩ 𝑂 = 0.55.
c) Ali will get an A but Omar will not

𝑃 𝐴 ∩ 𝑂𝑐 = 𝑃 𝐴 − 𝑃 𝐴 ∩ 𝑂 = 0.4 − 0.2 = 0.2.
d) Only one of them will get an A

𝑃 𝐴 ∩ 𝑂𝑐 + 𝑃 𝐴𝑐 ∩ 𝑂 = 𝑃 𝐴 − 𝑃 𝐴 ∩ 𝑂 + 𝑃 𝑂 − 𝑃 𝐴∩𝑂 = 0.2 + 0.15 = 0.35.
e) None of them will get an A

𝑃 𝐴𝑐 ∩ 𝑂𝑐 = 𝑃 𝐴 ∪ 𝑂 𝑐 = 1 − 𝑃 𝐴 ∪ 𝑂 = 1 − 0.55 = 0.45.
f) At most one of them will get an A

This event is equivalent to “Not both of them will get an A”.
Now, 𝑃 𝐴 ∩ 𝑂 𝑐 = 1 − 𝑃 𝐴 ∩ 𝑂 = 0.8

Example 14:
Among a group of children, 35% suffer from cognitive dissonance as a result of school experiences and
40% have a distorted sense of reality as measured by a standard measuring device. In a group of 200
children for which these percentages hold, a total of 32 have both cognitive dissonance and distorted
sense of reality. How many of these children are free from both problems?
First define the following events with their probabilities
 Event 𝐶: Child suffers from cognitive dissonance. 𝑃 𝐶 = 0.35
 Event 𝐷: Child has a distorted sense of reality. 𝑃 𝐷 = 0.40

32
 Event 𝐶 ∩ 𝐷 : Child suffers from both problems. 𝑃 𝐶 ∩ 𝐷 = 200 = 0.16
 Event 𝐶 𝑐 ∩ 𝐷 𝑐 : Child is free from both problems
Now,
𝑃 𝐶 𝑐 ∩ 𝐷𝑐 = 𝑃 𝐶 ∪ 𝐷 𝑐
= 1 − 𝑃 𝐶 ∪ 𝐷 = 1 − 0.35 + 0.4 − 0.16 = 0.41
The number of children that are free from both problems is 200 × 0.41 = 82
Exercise:
Among a group of children, 35% suffer from cognitive dissonance only and 40% have a distorted sense of
reality only. In a group of 200 children for which these percentages hold, a total of 32 have both cognitive
dissonance and distorted sense of reality. How many of these children are free from both problems?
III. Conditional Probability Rule:
The conditional probability of an event 𝐴 in relationship of an event 𝐵 is defined as the probability that event
𝐴 occurs after event 𝐵 has already occurred. And it can be evaluated using the following formula:
𝑃 𝐴∩𝐵
𝑃 𝐴𝐵 = , 𝑃(𝐵) ≠ 0
𝑃 𝐵

Example 15:
A class consists of 10 girls and 15 boys. 2 girls and 6 boys are left-handed. One Student is selected at
random from this class, determine the probability that the selected student is:
10
a) Girl→ 𝑃 𝐺 = 10+15 = 0.4
2
b) Girl and left-handed→ 𝑃 𝐺 ∩ 𝐿 = = 0.08
25
2+6
c) Left-handed→ 𝑃 𝐿 = 25
= 0.32
2 𝑃(𝐿∩𝐺) 2 25 2
d) Left-handed if she is a girl→ 𝑃 𝐿|𝐺 = = 0.2 or 𝑃 𝐿|𝐺 = = = = 0.2
10 𝑃(𝐺) 10 25 10
6 𝑃(𝐿∩𝐵) 6 25 6
e) Boy if he is left-handed→ 𝑃 𝐵|𝐿 = = 0.75 or 𝑃 𝐵|𝐿 = = = = 0.75
2+6 𝑃(𝐿) 8 25 8
Example 16:
According to Example 13, what is the probability that
𝑃(𝑂∩𝐴) 0.2
a) Omar will get an A if Ali gets an A→ 𝑃 𝑂|𝐴 = = = 0.5
𝑃(𝐴) 0.4
𝑃(𝑂∩𝐴𝑐 ) 0.35−0.2 0.15

b) Omar will get an A if Ali does not get an A → 𝑃 𝑂|𝐴𝑐 = = = = 0.25
𝑃(𝐴 𝑐 ) 1−0.4 0.6
Example 17:
According to Example 12, what is the probability that the selected patient is:
121
a) Overweight if he or she has high blood pressure. → 𝑃 𝑂|𝐻 = 184 = 0.658
63
b) Not overweight if he or she has high blood pressure. → 𝑃 𝑂𝑐 |𝐻 = = 0.342
184

Example 18:
Let 𝐴 and 𝐵 be two events from a sample space Ω such that 𝑃(𝐴) = 0.6, 𝑃(𝐵) = 0.7, and 𝑃(𝐴 ∩ 𝐵) = 0.4,
find:
𝑃(𝐴∩𝐵) 0.4 4
a) 𝑃(𝐴|𝐵) → 𝑃 𝐴 𝐵 = 𝑃(𝐵)
= 0.7 = 7
𝑃(𝐴∩𝐵) 0.4 2
b) 𝑃(𝐵|𝐴) → 𝑃 𝐵 𝐴 = = = . Note that in general 𝑃(𝐴|𝐵) ≠ 𝑃(𝐵|𝐴)
𝑃(𝐴) 0.6 3
𝑃 𝐴∩Ω 𝑃 𝐴
c) 𝑃 𝐴 Ω → 𝑃 𝐴 Ω = 𝑃 Ω
= 1
= 𝑃(𝐴) = 0.6
𝑃 𝐴∩Ω 𝑃 𝐴
d) 𝑃 Ω 𝐴 → 𝑃 Ω 𝐴 = 𝑃 𝐴
=𝑃 𝐴
= 1.
𝑃 𝐴∩∅ 0
e) 𝑃(∅|𝐴) → 𝑃(∅|𝐴) = 𝑃 𝐴
=𝑃 𝐴
=0
𝑃 𝐴∩𝐴 𝑃 𝐴
f) 𝑃(𝐴|𝐴) → 𝑃(𝐴|𝐴) = 𝑃 𝐴
=𝑃 𝐴
=1
𝑃 𝐴𝑐 ∩𝐵 0.7−0.4 3
g) 𝑃(𝐴𝑐 |𝐵) → 𝑃(𝐴𝑐 |𝐵) = 𝑃 𝐵
= 0.7
= 7. Notice that 𝑃 𝐴𝑐 𝐵 = 1 − 𝑃(𝐴|𝐵)
𝑃 𝐵 𝑐 ∩𝐵 𝑃 ∅
h) 𝑃 𝐵𝑐 𝐵 → 𝑃 𝐵𝑐 𝐵 = = = 0.
𝑃 𝐵 𝑃 𝐵
IV. Multiplication Rule of Probability
The multiplication rule can be used to find the probability of two or more events that occur in sequence. For
example, if you select three students from a class of 40 students, you can find the probability that the three
students are left-handed.
The probability that both two events, 𝐴 and 𝐵, occur is 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝐵 𝑃 𝐵 = 𝑃 𝐵 𝐴 𝑃 𝐴
Example 19:
Let 𝐴 and 𝐵 be two events such that 𝑃 𝐴 𝐵 = 0.4, 𝑃 𝐴|𝐵𝑐 = 0.3 and 𝑃 𝐵 = 0.6, find:
a) 𝑃 𝐴 ∩ 𝐵 → 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝐵 𝑃 𝐵 = 0.4 × 0.6 = 0.24
b) 𝑃 𝐴 ∩ 𝐵𝑐 → 𝑃 𝐴 ∩ 𝐵𝑐 = 𝑃 𝐴 𝐵𝑐 𝑃 𝐵𝑐 = 0.3 × 1 − 0.6 = 0.12
c) 𝑃 𝐴 → 𝑃 𝐴 ∩ 𝐵 + 𝑃 𝐴 ∩ 𝐵𝑐 = 0.24 + 0.12 = 0.36
𝑃 𝐴∩𝐵 0.24 2
d) 𝑃 𝐵|𝐴 → 𝑃(𝐴)
= 0.36 = 3 = 0.667

Example 20:
Suppose that there are 5 smokers in a group of 20 men. Two men are to be selected at random and one
by one from this group. Find the probability that:
a) The first selected man is smoker

5
𝑃(𝐴) = 20 , where 𝐴: The first selected man is smoker
b) The second is smoker if the first man is smoker

4
𝑃(𝐵|𝐴)= , where 𝐵: The second selected man is smoker.
19
c) The two men are smokers (the first is smoker and the second is smoker)
4 5 1
𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐵 𝐴 𝑃 𝐴 = 19 × 20 = 19
d) None of them is smoker

14 15 21
𝑃 𝐴𝑐 ∩ 𝐵𝑐 = 𝑃 𝐵𝑐 𝐴𝑐 𝑃 𝐴𝑐 = × =
19 20 38
Notice that:
1. The sample space can be determined from the tree diagram, Ω = {𝑆 𝑆, 𝑆 𝑆 𝑐 , 𝑆 𝑐 𝑆, 𝑆 𝑐 𝑆 𝑐 }
2. We can find the probability for each out come in the sample space using the multiplication rule.
3. We can find the probability for any event related to the above experiment, for example the
4 15 15 34
probability that at least one of two men is smoker equals to 𝑃 {𝑆 𝑆, 𝑆 𝑆 𝑐 , 𝑆 𝑐 𝑆} = 76 + 76 + 76 = 76

Example 21:
Suppose that 60% of students in a class are boys. And suppose that 30% of boys and 25% of girls wear
eyeglasses. One student is selected at random from this class, determine the probability that
a) The selected student wears eyeglasses if he is boy.

𝑃 𝑊 𝐵 = 0.3
b) The selected student is boy and wears eye glasses.

𝑃 𝑊 ∩ 𝐵 = 𝑃 𝑊 𝐵 𝑃 𝐵 = 0.3 × 0.6 = 0.18
c) The selected student wears eye glasses.

𝑃 𝑊 = 𝑃 𝑊 ∩ 𝐵 + 𝑃 𝑊 ∩ 𝐺 = 𝑃 𝑊 𝐵 𝑃 𝐵 + 𝑃 𝑊 𝐺 𝑃 𝐺 = .3 × .6 + .25 × .4 = 0.28
d) The selected student is girl if she wears eyeglasses.

𝑃 𝑊∩𝐺 0.25×0.4 10
𝑃 𝐺𝑊 = = = = 0.36
𝑃 𝑊 0.28 28
e) The selected student is boy or does not wear eyeglasses

𝑃 𝐵 ∪ 𝑊 𝑐 = 𝑃 𝐵 + 𝑃 𝑊 𝑐 − 𝑃 𝐵 ∩ 𝑊 𝑐 = 0.6 + 0.72 − 0.42 = 0.9
You can use tree diagram as follows.
Wears 0.3 B and W . 6 × .3 = .18

Boy 0.6
Does not wear 0.7 B and Wc . 6 × .7 = .42
Wears 0.25 G and W . 4 × .25 = .1

Girl 0.4
Does not wear 0.75 G and Wc . 4 × .75 = .3
V. Independent Events
 When the outcome or occurrence of the first event affects the outcome or occurrence of the second
event in such a way the probability is changed, the events are said to be dependent events.
 When the outcome or occurrence of the first event does not affect the outcome or occurrence of the
second event in such a way the probability remains the same, the events are said to be independent
events.
 If two events, 𝐴 and 𝐵, are independent events then 𝑃 𝐴 𝐵 = 𝑃 𝐴 and 𝑃 𝐵 𝐴 = 𝑃 𝐵 .
 If the two events are independent, then 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃 𝐵 .
 If two events are disjoint, this does not mean the two events are independent.

Example 22:
Let 𝐴 and 𝐵 be two independent events such that 𝑃(𝐴) = 0.4 and 𝑃(𝐵) = 0.7, find
a) 𝑃 𝐴 𝐵 → 𝑃 𝐴 𝐵 = 𝑃 𝐴 = 0.4
b) 𝑃 𝐴 ∩ 𝐵 → 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃 𝐵 = 0.4 × 0.7 = 0.28
𝑃 𝐴∩𝐵 𝑐 𝑃 𝐴 −𝑃(𝐴∩𝐵) 0.4−0.28

c) 𝑃 𝐴 𝐵𝑐 → 𝑃 𝐴 𝐵𝑐 = 𝑃(𝐵 𝑐 )
= 1−𝑃(𝐵)
= 1−0.7
= 0.4 = 𝑃(𝐴).
Note that 𝐴 and 𝐵𝑐 are independent, also 𝐴𝑐 and 𝐵 are independent and 𝐴𝑐 and 𝐵𝑐 are independent.
d) 𝑃 𝐴 ∪ 𝐵 → 𝑃 𝐴 ∪ 𝐵 = 0.4 + 0.7 − 0.28 = 0.82
Example 23:
Approximately 10% of men have a type of color blindness that prevents from distinguishing between red
and green. If 3 men are selected at random, find the probability that:
a) All of them will have this type of color blindness.

𝑃 𝐵𝐵𝐵 = 0.1 × 0.1 × 0.1 = 0.001
b) None of them will have this type of color blindness

𝑃 𝐵𝑐 𝐵𝑐 𝐵𝑐 = 0.9 × 0.9 × 0.9 = 0.729
c) At least one of them will have this type of color blindness

1 − 𝑃 𝐵𝑐 𝐵𝑐 𝐵𝑐 = 1 − 0.729 = 0.271
d) Exactly one of them will have this type of color blindness

𝑃 𝐵𝐵𝑐 𝐵𝑐 + 𝑃 𝐵𝑐 𝐵𝐵𝑐 + 𝑃 𝐵𝑐 𝐵𝑐 𝐵
= 0.1 × 0.9 × 0.9 + 0.9 × 0.1 × 0.9 + 0.9 × 0.9 × 0.1 = 3 0.1 × 0.9 × 0.9 = 0.243

Exercise:
A group of 100 students is classified according to their scores on a certain test and their sections.
Complete the following table so that the two events "The student's score is high" and "The student is
from section 1" are independent.
Score
Section High (𝐻) Not high (𝐻 𝑐 ) Total
1 (𝑆1) ____ ____ 40
2 (𝑆2) ____ ____ ____
Total 20 ____ 100
PROBLEMS
1. A group of students consists of 3 boys, Ali, Omar and Khalid, and 2 girls, Rana and Huda. One student will
be selected at random from this group.
a) What is the sample space? Ω ={ Ali, Omar, Khalid, Rana, Huda}
b) Find the probability that:
1. Omar will be selected 0.2
2. Omar or Huda will be selected 0.4
3. A boy will be selected 0.6
4. A boy or girl will be selected 1.0
5. Ali will not be selected 0.8
2. A group of students consists of 3 boys, Ali, Omar and Khalid, and 2 girls, Rana and Huda. Two students
will be selected at random from this group.
a) What is the sample space? Ω ={(A,O), (A,K), (A,R), (A,H), (O,k), (O,R), (O,H), (K,R), (K,H), (R,H)}
b) Find the probability that:
1. Omar will be selected 0.4
2. Omar and Huda will be selected 0.1
3. Two boys will be selected 0.3
4. At least one boy will be selected 0.9
5. A boy and a girl will be selected 0.6
6. Two boys or two girls will be selected 0.4
7. Ali will be selected if Omar has been selected 0.25

3. In a genetics experiment, the researcher mated two Drosophila fruit and observed the traits of 300
offspring. The results are shown in the following table:
Wing size
Eye color Normal Miniature
Normal 140 6
Vermillion 3 151
One of these offspring is randomly selected and observed for the two genetic traits.
What is the probability that the fly has:
a) Normal eye color and normal wing size? 140/300
b) Vermillion eyes? 154/300
c) Either vermillion eyes or miniature wings? 160/300
d) Normal eye color if it has normal wing size? 140/143
4. Out of 100 applicants for a certain job, 70 have some experience, 28 are over 40 years old, and 65 are
men. The distribution of applicants over these three factors is shown here:
Experience No Experience
Over 40 Under 40 Over 40 Under 40
Men 15 40 Men 3 7
Women 5 10 Women 5 15
One person is chosen at random from the 100, find the probability that the selected person is:
a) Over 40 years old 0.28
b) Man and has some experience 0.55
c) Under 40 or woman 0.82
d) Woman if she has no experience 0.667
5. Let 𝐴 and 𝐵 be two events such that 𝑃 𝐴 = 0.6, 𝑃 𝐵 = 0.8 and 𝑃 𝐴 ∩ 𝐵 = 0.5, answer the following:
a) Find 𝑃 𝐴 ∪ 𝐵 0.9
b) Find 𝑃 𝐴 ∩ 𝐵𝑐 0.1
𝑐
c) Find 𝑃 𝐴 ∪ 𝐵 0.7
𝑐 𝑐
d) Find 𝑃 𝐴 ∩ 𝐵 0.1
e) Find 𝑃 𝐴|𝐵 0.625
f) Find 𝑃 𝐴𝑐 |𝐵 0.375
g) Find𝑃 𝐵|𝐴 0.833
h) Find 𝑃 𝐴|𝐵𝑐 0.5
i) Are 𝐴 and 𝐵 mutually exclusive? No
j) Are 𝐴 and 𝐵 independent events? No

6. Suppose that 20% of adults in a large population are smokers. A random sample of three persons is
selected, find the probability that the sample will contain:
a) Two nonsmokers and one smoker. 0.384

b) At most two smokers. 0.992
c) No smokers. 0.512
d) Smokers and nonsmokers. 0.480
7. Suppose that 30% of adults in a certain population are smokers, 20% of smokers have lung cancer and
10% of nonsmokers have lung cancer. One person is selected at random from this population, find the
probability that the selected person:
a) has lung cancer if he is smoker. 0.2

b) is smoker and has lung cancer. 0.06
c) has lung cancer. 0.13
d) is smoker or has lung cancer. 0.37
e) is smoker if he has lung cancer. 0.462
8. A survey of people in a given region showed that 20% were smokers. The probability of death due to
lung cancer, given that a person smoked, was roughly ten times the probability of death due to lung
cancer, given that a person did not smoke. If the probability of death due to lung cancer in the region is
0.007
a) What is the probability of death due to lung cancer given that a person is a smoker?
0.025
b) If a person died due to lung cancer, what is the probability that the person was smoker?
0.714
9. In a group of 200 students, 160 take an English course, 120 take a mathematics course, and 16 take
neither. Are the two events “takes an English course” and “takes a mathematics course” independent?
Yes
See the following from the textbook (Chapter 4)

Examples: 1, 3, 6, 8, 9, 10, 11, 13, 14, 19, 21, 23, 25, 26, 27, 28, 34, 36
Exercises:
Section 4-1: 10, 12, 13, 15, 17, 18, 19, 20, 21, 24, 28, 35, 39
Section 4-2: 2, 3, 7, 11, 14, 19, 22
Section 4-3: 5, 7, 9, 11, 17, 18, 22, 23, 33, 34, 38, 46
Elementary Statistics:A Step by Step Approach, Bluman, 7th Ediyion 2022-2023
BIOSTATISTICS
CHAPTER 5: Discrete Probability Distributions
Section 5-1: Probability Distributions

One of the central ideas in probability and statistics is that of a probability distribution which describes how
probabilities in a sample space are distributed. In Chapter 4, we studied probabilities for a number of
different variables. The most important variables in statistics are known as random variables.
Random Variables
In Chapter 1, a variable was defined as a quantity whose value is not fixed. A variable that takes on number
values is a numerical variable. A numerical variable whose value is determined by chance is called a random
variable. A random variable can be defined as follows:
A variable that assumes a unique numerical value for each of the outcomes in the sample space of a
random experiment is called a random variable.
Example 1:
If a random sample of 5 patients is selected from a hospital, a letter such as 𝑋 can be used to represent the
number of diabetic patients in the sample. then the value that 𝑋 can assume is 0, 1, 2, 3, 4, or 5. The set of
all possible values of 𝑋 is called the space of 𝑋 and it is denoted by Ω𝑋 .
Example 2:
If 𝑇 is the time it takes a student to finish a one-hour exam, then 𝑇 is a random variable with possible value
from 15 minutes to 60 minutes.

Types of Random Variables

There are two types of random variables discrete and continuous.
DISCRETE RANDOM VARIABLE
When the space of a random variable 𝑋, Ω𝑋 , is finite or countable set then 𝑋 is called discrete random
variable. In general discrete random variables are obtained from data that can be counted.
Example 3:
The following are discrete random variables:
1. The number of children a family has.

2. The number of times a person visits a doctor.
3. The number of courses a college student is taking.
4. The number of infected olive trees in a field.
5. The number left-handed students in a class.
6. The sum of the two result numbers if you roll a die twice.
CONTINUOUS RANDOM VARIABLE
When the space of a random variable 𝑋, Ω𝑋 , is an interval of real numbers then the random variable 𝑋 is
called continuous random variable. In general continuous random variables obtained from data that can
be measured rather than counted.
Example 4:
The following are continuous random variables:
1. The barometric pressure at 12:00 PM in a city.

2. The weight in grams of new born baby.
3. The length of time it takes to complete a statistics exam.
4. The blood pressure of a patient admitted to a certain hospital.
5. The amount of water in liter a person drinks on a day.

Discrete Probability Distributions

In this chapter we will discuss the probability distribution for discrete random variables only. Chapter 6
explains continuous random variables and their distributions.
A discrete probability distribution of a random variable 𝑋, denoted by 𝑃 𝑥 , is a formula, graph, or table

consists of the values 𝑋 can assume along with their probabilities. The probabilities are determined
theoretically or by observation.
Example 5:
Suppose that 10% of men in a population are color blind. A sample of 3 men is selected at random and
each one of them is checked for color blindness.
a) Write the sample space for the given random experiment

Let 𝐵 denotes a color blind man and 𝐵𝑐 denotes a non- color blind man.
Ω = 𝐵𝐵𝐵, 𝐵𝐵𝐵𝑐 , 𝐵𝐵𝑐 𝐵, 𝐵𝐵𝑐 𝐵𝑐 , 𝐵𝑐 𝐵𝐵, 𝐵𝑐 𝐵𝐵𝑐 , 𝐵𝑐 𝐵𝑐 𝐵, 𝐵𝑐 𝐵𝑐 𝐵𝑐
Notice that the sample space can be determined using a tree diagram.
b) Construct a probability distribution for the number of color blind men in the sample.
Let a random variable 𝑋 denotes the number of color blind men in the sample.
To construct a probability distribution for 𝑋 ,we must determine all possible values of 𝑋 and
their probabilities.
When the outcome is 𝐵𝐵𝐵 then 𝑋 = 3 with probability 𝑃 3 = 0.1 × 0.1 × 0.1 = 0.001
When the outcome is one of the outcomes, 𝐵𝐵𝐵𝑐 , 𝐵𝐵𝑐 𝐵, 𝐵𝑐 𝐵𝐵 , then 𝑋 = 2 with probability
𝑃 2 = 3 × 0.1 × 0.1 × 0.9 = 0.027
When the outcome is one of the outcomes, 𝐵𝐵𝑐 𝐵𝑐 , 𝐵𝑐 𝐵𝐵𝑐 , 𝐵𝑐 𝐵𝑐 𝐵 , then 𝑋 = 1 with
probability 𝑃 1 = 3 × 0.1 × 0.9 × 0.9 = 0.243
When the outcome is 𝐵𝑐 𝐵𝑐 𝐵𝑐 then 𝑋 = 0 with probability 𝑃 0 = 0.9 × 0.9 × 0.9 = 0.729
The probability distribution for 𝑋 is
𝑋 0 1 2 3
𝑃(𝑋) 0.729 0.243 0.027 0.001

Example 6:
In a group of 12 men, there are 4 smokers. A sample of 3 men is selected at random from this group.
a) Write the sample space for this experiment.
Let 𝑆 denotes a smoker man and 𝑆 𝑐 denotes a nonsmoker man.
Ω = 𝑆𝑆𝑆, 𝑆𝑆𝑆 𝑐 , 𝑆𝑆 𝑐 𝑆, 𝑆𝑆 𝑐 𝑆 𝑐 , 𝑆 𝑐 𝑆𝑆, 𝑆 𝑐 𝑆𝑆 𝑐 , 𝑆 𝑐 𝑆 𝑐 𝑆, 𝑆 𝑐 𝑆 𝑐 𝑆 𝑐
b) Construct a probability distribution for the number of smokers in the sample.
Let a random variable 𝑋 denotes the number of smokers in the sample. To construct a
probability distribution for 𝑋 ,we must determine all possible values of 𝑋 and their
probabilities.
4 3 2 1
When the outcome is 𝑆𝑆𝑆 then 𝑋 = 3 with probability𝑃 3 = × × =
12 11 10 55
When the outcome is one of the outcomes, 𝑆𝑆𝑆 𝑐 , 𝑆𝑆 𝑐 𝑆, 𝑆 𝑐 𝑆𝑆 , then 𝑋 = 2 with probability
4 3 8 12
𝑃 2 = 3 × 12 × 11 × 10 = 55
When the outcome is one of the outcomes, 𝑆𝑆 𝑐 𝑆 𝑐 , 𝑆 𝑐 𝑆𝑆 𝑐 , 𝑆 𝑐 𝑆 𝑐 𝑆 , then 𝑋 = 1 with

4 8 7 28
probability𝑃 1 = 3 × × × =
12 11 10 55
8 7 6 14
When the outcome is 𝑆 𝑐 𝑆 𝑐 𝑆 𝑐 then 𝑋 = 0 with probability 𝑃 0 = 12 × 11 × 10 = 55
The probability distribution for 𝑋 is
𝑋 0 1 2 3
𝑃(𝑋) 14/55 28/55 12/55 1/55
Requirements for Discrete Probability Distributions
Let 𝑃(𝑋) be the probability distribution of a discrete random variable 𝑋, then:
1. The sum of all probabilities on all values of 𝑋 is 1; that is, 𝑃(𝑋) = 1

2. The probability of each value of 𝑋 must be between or equal to 0 and 1; that is, 0 ≤ 𝑃(𝑋) ≤ 1.

Example 7:
Determine whether each distribution is a probability distribution.
a. 𝑋 –5 0 5 10 15 b. 𝑌 2 3 7 c. 𝑍 0 2 4 6
𝑃(𝑋) 0.2 0.2 0.1 0.2 0.3 𝑃(𝑌) 0.5 0.3 0.4 𝑃(𝑍) – 1.0 1.5 0.3 0.2
𝑃(𝑋) is a probability distribution
𝑃(𝑌) is not since 𝑃(𝑌) = 1.2 ≠ 1
𝑃(𝑍) is not since 𝑃(𝑍) cannot be 1.5 or – 1.0
Example 8:
The probability that a patient will have 0, 1, 2, or 3 medical tests performed on entering a hospital are 6/15,
5/15, 3/15, and 1/15, respectively.
a) Construct a probability distribution for the number of tests that a patient will have.
Let 𝑋 denotes the number of tests that a patient will have, the probability distribution of 𝑋 is
𝑋 0 1 2 3
𝑃(𝑋) 6/15 5/15 3/15 1/15
b) Find the probability that a patient will have exactly 2 tests.

3
𝑃 2 =
15
c) Find the probability that a patient will have at least 2 tests.

3 1 4
𝑃 2 + 𝑃 3 = 15 + 15 = 15
d) Find the probability that a patient will have at most 2 tests.

6 5 3 14
𝑃 0 + 𝑃 1 + 𝑃 2 = 15 + 15 + 15 = 15
e) Find the probability that a patient will have no tests.

6
𝑃 0 = 15
f) If 300 patients entered the hospital, how many patients would you expect to have exactly 2
tests?
3
300 × 𝑃 2 = 300 × 15 = 60

Example 9:
0.48
Let 𝑃 𝑋 = 𝑋
,𝑋 = 1, 2, 3, 4 be the probability distribution of a discrete random variable 𝑋, find
a) 𝑃 𝑋 = 3
0.48
→𝑃 𝑋=3 = = 0.16
3
b) 𝑃 1.5 < 𝑋 < 2.5

0.48
→ 𝑃 1.5 < 𝑋 < 2.5 = 𝑃 𝑋 = 2 = 2 = 0.24
c) 𝑃 𝑋 ≠ 2
0.48
→ 𝑃 𝑋 ≠ 2 = 1 − 𝑃 𝑋 = 2 = 1 − 2 = 0.76
d) 𝑃 𝑋 ≤ 2
0.48 0.48
→ 𝑃 𝑋 ≤ 2 = 𝑃 𝑋 = 1 + 𝑃 𝑋 = 2 = 1 + 2 = 0.72
e) 𝑃 𝑋 < 2
0.48
→𝑃 𝑋<2 =𝑃 𝑋=1 = = 0.48
1
f) Sketch a bar chart for the distribution.
0.5
0.4
P(x)
0.3
0.2
0.1
1 2 3 4
x
g) What is the mode?

The mode is 𝑋 = 1 because it has the largest probability.
Section 5-2: Mathematical Expectations

If we perform a random experiment very large number of times and record the value of the corresponding
random variable each time, we can obtain the mean and standard deviation of this set of values as shown in
Chapter 3. These statistics will be almost equal to the mean and standard deviation of the theoretical
population of outcomes represented by the probability distribution. We can obtain these population
parameters from the probability distribution by using mathematical expectation.

Let 𝑃 𝑋 , 𝑋 ∈ Ω𝑋 be the probability distribution of a discrete random variable 𝑋 and let 𝑢(𝑋) be a function
of 𝑋. The expected value of 𝒖(𝑿) is defined by
𝐸 𝑢(𝑋) = 𝑢(𝑋)𝑃 𝑋
Ω𝑋
Example 10:
Let 𝑋 be a random variable that has a probability distribution, 𝑃(𝑋), evaluate the following:
𝑋 1 2 3 4
𝑃(𝑋) 0.4 0.3 0.2 0.1
a) 𝐸 𝑋 → 𝐸 𝑋 = 𝑋 𝑃 𝑋 = 1 𝑃 1 + 2 𝑃 2 + 3 𝑃 3 + 4 𝑃 4
= 1 .4 + 2 .3 + 3 .2 + 4 .1 = 2
b) 𝐸 5 → 𝐸 5 = 5 𝑃 𝑋 = 5 𝑃 1 + 5 𝑃 2 + 5 𝑃 3 + 5 𝑃 4
= 5 𝑃 1 + 𝑃 2 + 𝑃 3 + 𝑃 4 = 5 . 4 + .3 + .2 + .1 = 5 1 = 5
c) 𝐸 3𝑋 → 𝐸 3𝑋 = 3𝑋 𝑃 𝑋 = 3 𝑋 𝑃 𝑋 = 3𝐸 𝑋 = 3 2 = 6
d) 𝐸 3𝑋 − 5 → 𝐸 3𝑋 − 5 = 3𝑋 − 5 𝑃 𝑋 = 3𝑋 𝑃 𝑋 − 5 𝑃 𝑋 =6−5=1
e) 𝐸 𝑋 2 → 𝐸 𝑋 2 = 𝑋 2 𝑃 𝑋 = 12 𝑃 1 + 22 𝑃 2 + 32 𝑃 3 + 42 𝑃 4
= 1 . 4 + 4 . 3 + 9 . 2 + 16 . 1 = 5
Properties of Expectation
Let 𝑢(𝑋) and 𝑣(𝑋) be two functions of a random variable 𝑋 and let 𝑐 be a constant, then:
1. 𝐸 𝑐 = 𝑐
2. 𝐸 𝑐 𝑢 𝑋 = 𝑐𝐸 𝑢 𝑋
3. 𝐸 𝑢 𝑋 + 𝑣(𝑋) = 𝐸 𝑢 𝑋 + 𝐸 𝑣 𝑋
4. 𝐸 𝑢 𝑋 − 𝑣(𝑋) = 𝐸 𝑢 𝑋 − 𝐸 𝑣 𝑋
5. 𝐸 𝑢 𝑋 × 𝑣(𝑋) ≠ 𝐸 𝑢 𝑋 × 𝐸 𝑣 𝑋
6. 𝐸 𝑢 𝑋 ÷ 𝑣(𝑋) ≠ 𝐸 𝑢 𝑋 ÷ 𝐸 𝑣 𝑋

Example 11:
Let 𝑋 be a random variable such that 𝐸 𝑋 = 20 and 𝐸 𝑋 2 = 500, find 𝐸 𝑋 − 20 2
2
𝐸 𝑋 − 20 = 𝐸 𝑋 2 − 40𝑋 + 400 = 𝐸 𝑋 2 − 40𝐸 𝑋 + 400 = 500 − 40 20 + 400 = 100
Mean
Since the expected value of a random variable is the long-term average (or mean value) of the theoretical
population, it is identical to the population mean 𝜇 introduced in Chapter 3. We thus define the mean of a
random variable as follows:
Let 𝑋 be a random variable with probability distribution 𝑃 𝑋 , the mean of 𝑋 is defined by

𝜇=𝐸 𝑋 = 𝑋 𝑃 𝑋
Example 12:
Refer to Example 8, determine the mean number of tests that patients will have on entering a hospital.
6 5 3 1 14
𝜇=𝐸 𝑋 = 𝑋 𝑃 𝑋 =0 +1 +2 +3 = ≅ 0.93
15 15 15 15 15
Variance and Standard Deviation

Since random variables with the same mean can have different probability distributions, we must use a
measure of variability or dispersion. We saw in Chapter 3 that the most common of these is the variance,
symbolized 𝜎 2 . The variance of a random variable is defined as the expected value of the squared deviations
from the mean of the random variable.
Let 𝑋 be a random variable with probability distribution 𝑃 𝑋 and mean 𝜇, the variance of 𝑋 is defined by
𝜎2 = 𝐸 𝑋 − 𝜇 2
= 𝑋 − 𝜇 2𝑃 𝑋
Another formula for the variance is 𝜎 2 = 𝐸 𝑋 2 − 𝜇2 = 𝑋2 𝑃 𝑋 − 𝜇2
The standard deviation of 𝑋 is the square root of the variance. 𝜎 = 𝜎 2

Example 13:
Find the mean, variance and standard deviation for the random variable in Example 10.
1. The mean is 𝜇 = 𝐸 𝑋 = 𝑋 𝑃 𝑋 = 1 .4 + 2 .3 + 3 .2 + 4 .1 = 2
2. The variance is 𝜎 2 = 𝐸 𝑋 − 𝜇 2
= 𝑋 − 2 2𝑃 𝑋
2 2 2 2
= 1−2 .4 + 2 − 2 .3 + 3 − 2 .2 + 4 − 2 . 1 =1
Or we can find the variance using the second formula as follows

𝜎 2 = 𝐸 𝑋 2 − 𝜇2 = 𝑋2 𝑃 𝑋 − 𝜇2
2 2 2 2
= 1 .4 + 2 .3 + 3 .2 + 4 .1 − 22 = 1
3. The standard deviation is 𝜎 = 𝜎 2 = 1 = 1
Example 14:
A committee of 3 members is to be selected at random from a group of 2 nurses and 4 doctors. Let 𝑋
denotes the number of nurses in the selected committee, find the mean and the standard deviation of 𝑋.
1. To find the mean of 𝑋 (the expected value of 𝑋) , we must construct a probability distribution of 𝑋
Here, the sample space is Ω = 𝐷𝐷𝐷, 𝐷𝐷𝑁, 𝐷𝑁𝐷, 𝐷𝑁𝑁, 𝑁𝐷𝐷, 𝑁𝐷𝑁, 𝑁𝑁𝐷 and the space of 𝑋is
Ω𝑋 = 0,1,2 .
4 3 2
The probability of no nurses in the committee is 𝑃 𝑋 = 0 = 𝑃 𝐷𝐷𝐷 = 6 × 5 × 4 = 0.2
Similarly, 𝑃 𝑋 = 1 = 0.6 and 𝑃 𝑋 = 2 = 0.2
The probability distribution of 𝑋 is
𝑋 0 1 2
𝑃(𝑋) 0.2 0.6 0.2
2. The mean number of nurses in the committee is 𝜇 = 𝐸 𝑋 = 0 . 2 + 1 . 6 + 2 . 2 = 1
3. To find the standard deviation we will must the variance

The variance is 𝜎 2 = 𝐸 𝑋 2 − 𝜇2 = 02 . 2 + 12 . 6 + 22 . 2 − 12 = 0.4
4. The standard deviation is v 𝜎 = 𝜎 2 = 0.4 = 0.63

Factorial Notation
For any positive integer 𝑛, the factorial of 𝑛 is𝑛! = 𝑛 × 𝑛 − 1 × 𝑛 − 2 × … × 1
0! = 1
𝑛 + 1 ! = 𝑛 + 1 𝑛!
Example 15:
Evaluate
a) 5! = 5 × 4 × 3 × 2 × 1 = 120
b) 1! = 1
7! 7×6×5!
c) = = 7 × 6 = 42
5! 5!
8! 8×7×6×5!
d) 3!×5!
= (3×2×1)×5! = 8 × 7 = 56
Combinations Rule:
Let 𝑟 and 𝑛 be two nonnegative integers such that 𝑟 ≤ 𝑛, then the combinations of 𝑛 and 𝑟 is computed as
follows:
𝑛 𝑛!
𝑛C𝑟 = =
𝑟 𝑟! × 𝑛 − 𝑟 !
Example 16:
Evaluate
8 8! 8×7×6×5!
a) 8C3 = = 3!× 8−3 ! = 3×2×1×5! = 56
3
10 10! 10×9!
b) 10C1 = = 1!× 10−1 ! = 1×9! = 10
1
5 5! 5!
c) 5C0 = = 0!× 5−0 ! = 1×5! = 1
0
7 7! 7! 7!
d) 7C7 = = 7!× 7−7 ! = 7!×0! = 7!×1 = 1
7
15 15! 15×14! 15×14!
e) 15C14 = = 14!× 15−14 ! = 14!×1! = 14!×1 = 15
14
Note:
𝑛 𝑛
1. = =1
0 𝑛
𝑛 𝑛
2. = =𝑛
1 𝑛−1
𝑛 𝑛
3. =
𝑟 𝑛−𝑟

Example 17:
100
a) =1
0
20
b) =1
20
25
c) = 25
1
10 10
d) If = 120, then = 120
3 7
Note:
In binomial distribution, the combination (𝑛C𝑟) is used to count the ways of arranging 𝑟 successes
within 𝑛 trials.
Example 18:
Suppose that 15% of people in a large population are left-handed. A random sample of size 𝑛 = 6 is
selected from this population; in how many ways can the sample contains 2 left-handed people?
We have the following permutations:
{L L R R R R} , {L R L R R R} , {L R R L R R} , {L R R R L R} , {L R R R R L} , {R L L R R R} ,
{R L R L R R} , {R L R R L R} , {R L R R R L} , {R R L L R R} , {R R L R L R} , {R R L R R L} ,
{R R R L L R}, {R R R L R L} , {R R R R L L}
As you see, there are 15 permutations for 2 left-handed people within 6 people.
We can find the number of permutations without define each one of them by evaluating the following
6 6! 6×5
combination 6C2 = = 2!×4! = 2×1 = 15
2

Section 5-3: The Binomial Distribution

Many types of probability problems have only two outcomes or can be reduced to two outcomes. For
example, when a coin is tossed, it can land heads or tails. When a baby is born, it will be either male or
female. A true/false item can be answered in only two ways, true or false. Other situations can be reduced to
two outcomes. For example, a medical treatment can be classified as effective or ineffective, depending on
the results. A person can be classified as having normal or abnormal blood pressure, depending on the
measure of the blood pressure gauge. A multiple-choice question, even though there are 4 or 5 answer
choices, can be classified as correct or in correct. Situations like these are called binomial experiments.
A binomial experiment is a random experiment that satisfies the following:
1. There must be a fixed number of trials, say 𝑛.

2. Each trial can have only two outcomes or outcomes that can be reduced to two outcomes. These
outcomes can be considered as either success (S) or failure (F).
3. The outcomes of each trial must be independent of one another.
4. The probability of success, denoted 𝑝, must remain the same from trial to trial. (The probability of
failure is 1 − 𝑝) .
The outcomes of a binomial experiment and the corresponding probabilities of these outcomes constitute
the binomial distribution.
Let 𝑋 denotes the number of successes in 𝑛 trials of a binomial experiment, the probability distribution of
𝑋 is
𝑛 𝑋 𝑛−𝑋
𝑃 𝑋 = 𝑝 1−𝑝 , 𝑋 = 0,1,2, … , 𝑛
𝑋
𝑛 𝑛!
Where = and 𝑛! = 𝑛 𝑛 − 1 𝑛 − 2 … 2 (1)
𝑋 𝑛−𝑋 ! 𝑋!
𝑛 and 𝑝 are parameters of the binomial distribution.
 The mean of 𝑋 is 𝜇 = 𝑛𝑝
 The variance of 𝑋 is 𝜎 2 = 𝑛𝑝(1 − 𝑝)
 The standard deviation is 𝜎 = 𝑛𝑝(1 − 𝑝)

Example 19:
A survey on a large population found that one out of 5 people visited a doctor in any given month. If 10
people are selected at random, find:
a) The probability that exactly 3 will have visited a doctor last month.
This experiment is a binomial experiment for which, 𝑛 = 10, S: the selected person will have
1
visited the doctor last month, 𝑝 = 5 = 0.2, and 𝑋 is the number of people (from the 10) will
have visited a doctor last month
10
The probability distribution of 𝑋 is 𝑃 𝑥 = 0.2𝑥 0.8 10−𝑥
, 𝑥 = 0,1,2, … ,10
𝑥
10
Hence, the solution of Part (a) is 𝑃 3 = 0.23 0.8 7
= 0.201
3
b) The probability that at most 2 will have visited a doctor last month.
𝑃 𝑋 ≤2 =𝑃 0 +𝑃 1 +𝑃 2
10 10 10
= 0.20 0.8 10 + 0.21 0.8 9 + 0.22 0.8 8 = 0.678
0 1 2
c) The probability that at least 3 will have visited a doctor last month.
𝑃 𝑋 ≥ 3 = 𝑃 3 + 𝑃 4 + ⋯ + 𝑃 10 = 1 − 𝑃 0 + 𝑃 1 + 𝑃 2 = 1 − 0.678 = 0.322
d) The probability none will have visited a doctor last month.
10
𝑃 𝑋=0 =𝑃 0 = 0.20 0.8 10
= 0.107
0
e) The probability that all the 10 will have visited a doctor last month.
10
𝑃 𝑋 = 10 = 𝑃 10 = 0.210 0.8 0
≅ 0.0
10
f) The expected number of people (from the 10) will have visited a doctor last month.
𝐸 𝑋 = 𝜇 = 𝑛𝑝 = 10 (0.2) = 2
g) The variance of the number of people (from the 10) will have visited a doctor last month.
𝜎 2 = 𝑛𝑝 1 − 𝑝 = 10 0.2 0.8 = 1.6

Example 20:
The following histograms show the probability distributions for binomial random variables with 𝑛 = 10
and 𝑝 = 0.2, 0.5, 0.9
Binomial, n=10, p=0.2
0.30
0.25
0.20
Probability
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
X
𝑥 0 1 2 3 4 5 6 7 8 9 10
𝑃(𝑥) .107 .268 .302 .201 .088 .026 .006 .001 .000 .000 .000
0.25
0.20
Probability
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
X
𝑥 0 1 2 3 4 5 6 7 8 9 10
𝑃(𝑥) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001

0.30
0.25
Probability 0.20
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
X
𝑥 0 1 2 3 4 5 6 7 8 9 10
𝑃(𝑥) .000 .000 .000 .001 .006 .026 .088 .201 .302 .268 .107
Notice that the shape of the binomial distribution is:

1
1. Symmetric when 𝑝 = 2
1
2. Approximately symmetric when 𝑝 → 2
3. More skewed to the right when 𝑝 gets closer to 0

4. More skewed to the left when 𝑝 gets closer to 1
PROBLEMS
1. Omar is taking 4 courses for the semester. He believes that the probability distribution for the random
variable 𝑋 = number of courses for which he will get an A grade is given below:
𝑋 0 1 2 3 4
𝑃(𝑋) .15 .20 .30 .20 .15
a) What is the probability that he will get an A in 2 courses or more? 0.65

b) What is the probability that he will get an A in all courses that he is taking? 0.15
c) What is the probability he will not get an A in one course only? 0.20
d) What is the probability will get an A in some courses but not all of them? 0.70
e) What is the mean number of courses for which he will get an A? 2

2. A random variable 𝑋 has this probability distribution:
𝑋 0 1 2 3 4 5
𝑃(𝑋) .10 .30 .40 .10 𝑐 .05
a) Find 𝑃(4) 0.05

b) Find 𝑃(1 ≤ 𝑋 < 3) 0.70
c) Find 𝑃( 𝑋 < 4 | 𝑋 > 2 ) 0.50
d) Find 𝜇 1.85
e) Find 𝜎 1.4275 ≅ 1.19
f) What is the probability that 𝑋 will lie within 1.2 standard deviations of the mean? 0.80
3. 8 pints of blood are stored in a hospital laboratory. It is known that exactly 3 pints are type O, but it is
not known which ones. 2 pints of type O blood are needed. One pint at a time is removed and typed. If it
is type O, it is used; if not, it is labeled and the next pint is tested.
a) Make a probability distribution for the number of pints that must be tested in order to obtain 2 pints
of type O.
𝑋 2 3 4 5 6 7
𝑃(𝑋) 3/28 5/28 6/28 6/28 5/28 3/28
b) Determine the mean and the standard deviation of the variable.

𝜇 = 4.5, 𝜎 = 3/2
4. Suppose 𝑋 has a binomial distribution with 𝑛 = 10 and 𝑝 = 0.4. Determine the following:
10
a) The probability distribution of 𝑋 𝑃 𝑥 = (.4)𝑥 (.6)10−𝑥 , 𝑥 = 0, 1, … , 10
𝑥
b) 𝑃(𝑋 = 3) 0.215
c) 𝑃(𝑋 ≥ 4) 0.618
d) 𝑃(𝑋 ≥ 9) 0.002
e) 𝑃( 𝑋 = 3 | 𝑋 ≤ 3 ) 0.562
f) Mean of 𝑋 4
g) Variance of 𝑋 2.4
h) 𝐸(𝑋 2 ) 18.4

5. It is known that 20% of a certain variety of flower bulb will not grow. If 15 bulbs are planted, what is the
probability that:
a) Exactly 12 will bloom? 0.250

b) At least 4 will bloom? ≅1
c) No more than 13 will bloom? 0.833
d) All of them will bloom? 0.035
e) Exactly 5 will not bloom? 0.103
f) 10 will bloom and 5 will not? 0.103
g) What is the expected number of bulbs that will bloom? 12
h) Find the variance of the number of bulbs that will bloom. 2.4
6. The gender ratio of humans at birth is 100 females to 105 males.
a) What is the probability that, in 6 single births, at least half the babies born are females?
0.633
b) Compare this probability with the result you would obtain if you used a gender ratio of 1 to 1?
0.656
7. Suppose that 30% of adults in a certain large population are smokers. A random sample of size 20 is
selected from this population, find:
a) The expected number of smokers in the sample. 6
b) The probability that the number of smokers in the sample is more than expected. 0.392
8. A multiple-choice quiz has 8 questions with 4 responses (one correct) on each question. To pass, you
must get at least 5 correct.
a) If you guess on every question, what is the probability that you will pass? 0.027
b) If the probability of correct answer is 0.9, what is the probability that you will pass? 0.995
9. Let 𝑋 be a binomial random variable such that the mean is 𝜇 = 5 and the variance is 𝜎 2 = 3.75, find
𝑃 𝑋=5 0.202

See the following examples and exercises in the text book:
Section 5-1
Examples: 1, 2, 3, 4
Exercises: 7, 9, 11, 13, 15, 17, 18, 19
Section 5-2
Examples: 6, 10, 11, 12, 13
Exercises: 1, 3, 7, 9, 12
Section 5-3
Examples: 15, 16, 17, 20, 23
Exercises: 3, 4, 5, 11, 17, 28

BIOSTATISTICS
CHAPTER 6: The Normal Distribution
Introduction
When a random variable 𝑋 is discrete, you can assign a positive probability to each value that 𝑋 can take and
get the probability distribution of 𝑋. The sum of all probabilities associated the different values of 𝑋 is one.
However continuous random variables, such as heights, weights, length of life of a particular product, or time
required to complete a task, can assume the infinitely many values corresponding to points on a line interval.
If you try to assign a positive probability to each of these uncountable values, the probabilities will no longer
sum to 1. Therefore, you must use a different approach to find the probabilities for continuous random
variables.
Definition:
Let X be a continuous random variable with space ΩX , the probability distribution of X is a function f, that
satisfies the following:
1. f x ≥ 0 , x ∈ ΩX
2. ΩX
f x dx = 1, which means that the area under the curve of f is 1
The probability distribution of X is called the probability density function.
Example 1:
Determine whether each of the following is a probability distribution of a continuous random variable or
not:
a) b) c)
The curves in parts (a) and (c) are probability distributions that the area under these curves is one
The area under the curve in part (b) is 1.5 so it is not a probability distribution.

Properties:
Let 𝑋 be a continuous random variable with space Ω𝑋 , and probability distribution𝑓(𝑥), then:
𝑏
1. 𝑃 𝑎 ≤ 𝑋 ≤ 𝑏 = 𝑎
𝑓 𝑥 𝑑𝑥 = The area under the curve of 𝑓 from 𝑥 = 𝑎 to 𝑥 = 𝑏
2. 𝑃 𝑋 = 𝑎 = 0, for any value of 𝑎
3. 𝑃 𝑎 ≤ 𝑋 ≤ 𝑏 = 𝑃 𝑎 ≤ 𝑋 < 𝑏 = 𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝑃 𝑎 < 𝑋 < 𝑏
𝑘
4. The 𝑘th percentile, 𝑃𝑘 , can be evaluated by solving the equation 𝑃 𝑋 ≤ 𝑃𝑘 = 100 . This means the
area under the curve of 𝑓(𝑥) to the left of 𝑥 = 𝑃𝑘 is 𝑘/100
5. The mode of the distribution is the value of 𝑋 that maximize 𝑓(𝑥).
Example 2:
The probability density function of the time it takes a hematology cell counter to complete a test on a blood
sample is 𝑓 𝑥 = 0.04 , 50 < 𝑥 < 75 seconds
a) Find the probability that a test will require exactly 70 seconds to complete.
𝑃 𝑋 = 70 = 0, since 𝑋 is a continuous random variable
b) Find the probability that a test will require more than 70 seconds to complete.
𝑃 𝑋 > 70 = 0.04 75 − 70 = 0.2.

c) What is the percentage of tests that require less than one minute to complete?
𝑃 𝑋 < 60 = 0.04 60 − 50 = 0.4.
d) What time is exceeded by 80% of tests?

𝑃 𝑋 > 𝑐 = 0.04 75 − 𝑐 = 0.8 → 𝑐 = 55seconds.
0.8
e) Determine the mean of the time to complete a test.

The probability distribution is symmetric about 𝑋 = 62.5, so the mean , 𝜇 = 62.5 seconds
f) Determine the median of the time to complete a test

The probability distribution is symmetric about 𝑋 = 62.5, so median = 𝜇 = 62.5 seconds
Section 6-1: Normal Distribution

many continuous random variables observed in nature, such as the heights of adult men, body temperatures
of rats, and cholesterol levels of adults, have distributions that are bell-shaped, and these are called
approximately normally distributed random variables.

Definition:
A random variable 𝑋 with probability density function
1 − 𝑥−𝜇 2
𝑓 𝑥 = 𝑒 2𝜎 2 , −∞ < 𝑥 < ∞ , 𝑒 ≅ 2.7 and 𝜋 ≅ 3.14
2𝜋𝜎 2
is a normal random variable with mean 𝜇and variance𝜎 2 , and the notation 𝑋: 𝑁(𝜇, 𝜎 2 )is used to denote
the distribution.
Properties of the Normal Distribution
1. A normal distribution curve is bell-shape
2. The normal distribution is symmetric about the mean 𝜇
3. The measures of central tendency; mean, median and mode are equals
4. The area under a normal distribution curve is equal to 1, hence the area to the right of 𝜇 equals the
area to the left of 𝜇 equals 0.5
5. The shape of the distribution is determined by the variance, 𝜎 2 . Large values of 𝜎 2 reduce the
height of the curve and increase the spread; small values of 𝜎 2 increase the height of the curve and
reduce the spread. See the following graphs
6. About 99.7% of the area under the normal curve lie within 3 standard deviations of the mean.
Normal; Mean=120
StDev
0.04 20
10
0.03
0.02
0.01
0.00
60 70 80 90 100 110 120 130 140 150 160 170 180

Mean StDev
0 0.45
0 1.4
0.8 0 2.24
-2 0.71
0.6
0.4
0.2
0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Example 3:
Let 𝑋 be a normal random variable with a mean of 75 and a standard deviation of 5.
Given that 𝑃 75 < 𝑋 < 80 = 0.34 and 𝑃 75 < 𝑋 < 85 = 0.48, find
a) 𝑃 𝑋 = 75 → 𝑃 𝑋 = 75 = 0, since 𝑋 is continuous random variable.
b) The median of 𝑋 → the median equals the mean which equals 75
c) The mode of 𝑋 → the mode equals the mean which equals 75
d) 𝑃 𝑋 < 80 = 𝑃 𝑋 < 75 + 𝑃 75 < 𝑋 < 80 = 0.5 + 0.34 = 0.84

e) 𝑃 80 < 𝑋 < 85 = 𝑃 75 < 𝑋 < 85 − 𝑃 75 < 𝑋 < 80 = 0.48 − 0.34 = 0.14
f) 𝑃 𝑋 < 70 = 𝑃 𝑋 < 75 − 𝑃 70 < 𝑋 < 75 = 0.5 − 0.34 = 0.16

Note 𝑃 70 < 𝑋 < 75 = 𝑃 75 < 𝑋 < 80 = 0.34 since the distribution is symmetric about 75
g) The 84th percentile→ 𝑃84 = 80 since the area on the left of 80 is 0.84.
In part (c) you can see the area on the left of 80 is 0.84, so, 80 is the 84th percentile.
The Standard Normal Distribution

Since each normally distributed variable has its own mean and standard deviation, the shape and location of
these curves will vary. In particular applications you would have to have a table of areas under the curve for
each variable. To simplify this situation, statisticians use what is called the standard normal distribution.
Definition:
The Standard normal distribution is a normal distribution with a mean of 0 and a standard deviation
of 1. We will use, 𝑍, to denote a standard normal variable.

Example 4:
Let 𝑍: 𝑁 0,1 , then find
a) 𝑃 𝑍 = 2.5 = 0, since 𝑍 is a continuous random variable
b) 𝑃 0 < 𝑍 < 2.5 = 0.4938
c) 𝑃 −1.25 < 𝑍 < 0 = 𝑃 0 < 𝑍 < 1.25 = 0.3944
d) 𝑃 𝑍 < 2/3 = 𝑃 𝑍 < 0.67 = 0.5 + 𝑃 0 < 𝑍 < 0.67 = 0.5 + 0.2486 = 0.7486
e) 𝑃 𝑍 < −1 = 0.5 − 𝑃 −1 < 𝑍 < 0 = 0.5 − 0.3413 = 0.1587
f) 𝑃 −2 < 𝑍 < 1 = 𝑃 −2 < 𝑍 < 0 + 𝑃 0 < 𝑍 < 1 = 0.4772 + 0.3413 = 0.8185
g) 𝑃 1 < 𝑍 < 5 = 𝑃 0 < 𝑍 < 5 − 𝑃 0 < 𝑍 < 1 ≅ 0.5 − 0.3413 = 0.1587

Example 5:
Let 𝑍: 𝑁 0,1 , find 𝑐 so that:
a) 𝑃 0 < 𝑍 < 𝑐 = 0.497 → 𝑐 = 2.75
b) 𝑃 𝑐 < 𝑍 < 0 = 0.291 → 𝑐 = −0.81
c) 𝑃 𝑍 < 𝑐 = 0.975 → 𝑃 0 < 𝑍 < 𝑐 = 0.975 − .5 = 0.475 → 𝑐 = 1.96
d) 𝑃 𝑍 < 𝑐 = 0.015 → 𝑃 𝑐 < 𝑍 < 0 = .5 − 0.015 = 0.485 → 𝑐 = −2.17
e) 𝑃 1 < 𝑍 < 𝑐 = 0.1557 → 𝑃 0 < 𝑍 < 𝑐 − 𝑃 0 < 𝑍 < 1 = 𝑃 0 < 𝑍 < 𝑐 − 0.3413 = 0.1557 →
𝑃 0 < 𝑍 < 𝑐 = 0.497 → 𝑐 = 2.75

Converting a Normal Distribution to Standard Normal
Theorem:
Let 𝑋 be a normal random variable with a mean 𝜇 of and a standard deviation of 𝜎, and let
𝑋−𝜇
𝑍=
𝜎
Here, the random variable 𝑍 is a standard normal random variable.
Example 6:
Let 𝑋: 𝑁 50,100 , find

50−50 𝑋−50 75−50
a) 𝑃 50 < 𝑋 < 75 = 𝑃 10
< 10
< 10
= 𝑃 0 < 𝑍 < 2.5 = 0.4938
b) 𝑃 𝑋 > 38 = 𝑃 𝑍 > −1.2 = 0.5 + 𝑃 −1.2 < 𝑍 < 0 = 0.5 + 0.3849 = 0.8849
c) 𝑃 55 < 𝑋 < 75 = 𝑃 0.5 < 𝑍 < 2.5 = 0.4938 − 0.1915 = 0.3023
𝑐−50
d) Find the value of 𝑐 if 𝑃 𝑋 < 𝑐 = 0.9 → 𝑃 𝑍 < 𝑑 = 0.9 → 𝑃 0 < 𝑍 < 𝑑 = 0.4, where 𝑑 = 10
→ 𝑑 ≅ 1.28 → 𝑐 = 50 + 1.28 × 10 = 62.8 .
𝑃5 −50
e) 𝑃5 → 𝑃 𝑋 < 𝑃5 = 0.05 → 𝑃 𝑍 < 𝑑 = 0.05 → 𝑃 𝑑 < 𝑍 < 0 = 0.45, where 𝑑 =
10
→ 𝑑 ≅ −1.645 → 𝑃5 = 50 − 1.645 × 10 = 33.55 .

Example 7:
The mean of a normal random variable 𝑋 is 𝜇 = 100. If 𝑃 𝑋 > 121 = 0.1587.
Find the variance of 𝑋, 𝜎 2
21
If 𝑃 𝑋 > 121 = 0.1587 → 𝑃 𝑍 > 𝑑 = 0.1587 → 𝑃 0 < 𝑍 < 𝑑 = 0.3413, where 𝑑 = 𝜎
21 21 2
→ 𝑑 = 1.00 → 𝜎 = 1.00 → 𝜎 2 = 1.00
= 441
Section 6-2: Applications of the Normal Distribution
The standard normal distribution curve can be used to solve a wide variety of practical problems. The only
requirement is that the variable be normally or approximately normally distributed. To solve problems by
using the standard normal distribution, transform the original variable to a standard normal and use the
table to find the required areas and probabilities.
Example 8:
A study shows that the systolic blood pressures for adults in a certain population are approximately
normally distributed with mean of 120 and standard deviation of 8.
a) Find the probability that a randomly selected person will have a blood pressure between 110 and 130
𝑃 110 < 𝑋 < 130 = 𝑃 −1.25 < 𝑍 < 1.25 = 2𝑃 0 < 𝑍 < 1.25 = 0.7888 .
b) What percentage of the population have blood pressures less than 140
𝑃 𝑋 < 140 = 𝑃 𝑍 < 2.5 = 0.5 + 𝑃 0 < 𝑍 < 2.5 = 0.9938

c) For a medical study, a researcher wishes to select people in the middle 60% of the population based on
blood pressure. Find the upper and lower readings that would qualify people to participate in the study.
If 𝑃 0 < 𝑍 < 𝑐 = 0.3 → 𝑐 ≅ 0.84

The lower bound is 𝑎 = 120 − 0.84 8 = 113.28 ≅ 113
The upper bound is 𝑏 = 120 + 0.84 8 = 126.72 ≅ 127
d) A sample of 5 people is selected at random; find the probability that exactly 2 of the selected people have
blood pressure less than 130.
The probability a person will have a blood pressure less than 130 is
𝑋 − 120 130 − 120
𝑃 𝑋 < 130 = 𝑃 < = 𝑃 𝑍 < 1.25 = 0.5 + 𝑃 0 < 𝑍 < 1.25 = 0.8944
8 8
The number of people in the sample how blood pressure is less than 130, say 𝑌, is a binomial random
variable with 𝑛 = 5 and 𝑝 = 0.8944
5 2 5−2 2 3
We want to find 𝑃 𝑌 = 2 = 0.8944 1 − 0.8944 = 10 × 0.8944 × 0.1056 = 0.009
2
Example 9:
The final exam scores in a statistics class are normally distributed with a mean of 70 and a standard
deviation of 10.
a) If the lowest passing mark is 60, what proportion of the class fails?
𝑃 𝑋 < 60 = 𝑃 𝑍 < −1 = 0.5 − 0.3413 = 0.1587 .
To be more accurate we must evaluate 𝑃 𝑋 < 59.5 = 0.1469
b) If the highest 80% are to pass, what should be the lowest passing score?
𝑐−70
𝑃 𝑋 > 𝑐 = 0.8 → 𝑃 𝑍 > 𝑑 = 0.8 → 𝑃 𝑑 < 𝑍 < 0 = 0.3, where 𝑑 = 10
→ 𝑑 ≅ −0.84 → 𝑐 = 70 − 0.84 × 10 = 61.6 → 𝑐 = 62 .

Example 10:
The sick-leave time of employees in a firm in a month is normally distributed with a mean of 100 hours and a
standard deviation of 20 hours.
a) What is the probability that the sick-leave time for next month will be between 50 and 80 hours?
Let 𝑋 denote the sick-leave time per month, then 𝑋: 𝑁 100, 202 , and we want to find 𝑃(50 < 𝑋 < 80)
50 − 100 𝑋 − 100 80 − 100

𝑃 50 < 𝑋 < 80 = 𝑃 < < = 𝑃 −2.5 < 𝑍 < −1 = 0.4938 − 0.3413
20 20 20
= 0.1525
b) How much time should be budgeted for sick leave if the budgeted amount should be exceeded with a
probability of only 10%?
Let 𝑐 denote the time should be budgeted for sick leave, then 𝑃 𝑋 > 𝑐 = 0.1
𝑐 − 100 𝑐 − 100
𝑃 𝑋>𝑐 =𝑃 𝑍> = 0.1 → 𝑃 0 < 𝑍 < = 0.5 − 0.1 = 0.4
20 20
𝑐−100
By using the table, it is found that = 1.28 → 𝑐 = 100 + 1.28 20 = 125.6 hours.
20
Section 6-3: The Central Limit Theory

In addition to knowing how individual data values vary about the mean for a population, statisticians are
interested in knowing how the means of samples of the same size taken from the same population vary
about the population mean.
Distribution of Sample Means

Suppose a researcher selects a sample of 30 adults and finds the mean of blood pressures for the sample to
be 121. Then suppose a second sample is selected, and the mean of that sample is found to be 125. Continue
the process for 100 samples. What happens then is that the mean becomes a random variable, and the
sample means 121, 125,…, 119 constitute a sampling distribution of sample means.
A sampling distribution of sample means is a distribution using the means computed from all possible
random samples of a specific size taken from a population.

Example 11:
Given a population {1, 3, 5, 7}, Note that 𝜇 = 4 and 𝜎 2 = 5
a) If a random sample of size 𝑛 = 2 is to be selected from this population (the selection is with
replacement), find the sampling distribution of the sample mean 𝑋 and evaluate 𝜇𝑥 and 𝜎𝑥
The following table shows all possible samples corresponding with their means.
Sample 1,1 1,3 1,5 1,7 3,1 3,3 3,5 3,7 5,1 5,3 5,5 5,7 7,1 7,3 7,5 7,7
𝑿 1 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7
Then the sampling distribution of the sample mean will be:
𝑿 1 2 3 4 5 6 7
𝑷 𝑿 1/16 2/16 3/16 4/16 3/16 2/16 1/16
1 2 3 4 3 2 1
𝜇𝑥 = 𝐸 𝑋 = 𝑋 𝑃(𝑋) = 1 +2 +3 +4 +5 +6 +7 =4=𝜇
16 16 16 16 16 16 16
𝜎𝑥2 = 𝐸 𝑋 − 𝜇𝑥 2
= 𝑋−4 2
𝑃 𝑋
2 1 2 2 23 2 4 2 3
= 1−4 16
+ 2−4 16
+ 3−4 16
+ 4−4 16
+ 5−4 16
2 2 2 1 5 𝜎2
+ 6−4 16
+ 7−4 16
=2 = 𝑛
The following graph shows the distribution of 𝑋and compares it with normal distribution.
1 2 3 4 5 6 7
n=2

b) If a random sample of size 𝑛 = 3 is to be selected from this population (the selection is with
replacement), find the sampling distribution of the sample mean 𝑋 and evaluate 𝜇𝑥 and 𝜎𝑥
The following table shows all possible samples corresponding with their means.
Sample 1,1,1 1,1,3 1,1,5 1,1,7 1,3,1 1,3,3 1,3,5 1,3,7 1,5,1 1,5,3 1,5,5 1,5,7 1,7,1
𝑋 1 5/3 7/3 3 5/3 7/3 3 11/3 7/3 3 11/3 13/3 3
Sample 1,7,3 1,7,5 1,7,7 3,1,1 3,1,3 3,1,5 3,1,7 3,3,1 3,3,3 3,3,5 3,3,7 3,5,1 3,5,3
𝑋 11/3 13/3 5 5/3 7/3 3 11/3 7/3 3 11/3 13/3 3 11/3
Sample 3,5,5 3,5,7 3,7,1 3,7,3 3,7,5 3,7,7 5,1,1 5,1,3 5,1,5 5,1,7 5,3,1 5,3,3 5,3,5
𝑋 13/3 5 11/3 13/3 5 17/3 7/3 3 11/3 13/3 3 11/3 13/3
Sample 5,3,7 5,5,1 5,5,3 5,5,5 5,5,7 5,7,1 5,7,3 5,7,5 5,7,7 7,1,1 7,1,3 7,1,5 7,1,7
𝑋 5 11/3 13/3 5 17/3 13/3 5 17/3 19/3 3 11/3 13/3 5
Sample 7,3,1 7,3,3 7,3,5 7,3,7 7,5,1 7,5,3 7,5,5 7,5,7 7,7,1 7,7,3 7,7,5 7,7,7
𝑋 11/3 13/3 5 17/3 13/3 5 17/3 19/3 5 17/3 19/3 7
Then the sampling distribution of the sample mean will be:
Sample mean 1 5/3 7/3 3 11/3 13/3 5 17/3 19/3 7

𝑷 𝑿 1/64 3/64 6/64 10/64 12/64 12/64 10/64 6/64 3/64 1/64
5 𝜎2
𝜇𝑥 = 𝐸 𝑋 = 𝑋 𝑃(𝑋 ) = 4 = 𝜇 and 𝜎𝑥2 = 𝐸 𝑋 − 𝜇𝑥 2
= 𝑋−4 2
𝑃 𝑋 = =
3 𝑛
The following graph shows the distribution of 𝑋and compares it with normal distribution
n=3

Notice that the distribution of 𝑋when 𝑛 = 3 is closer to the normal distribution than the distribution when
𝑛 = 2.
The Sampling Distribution of the Sample Mean, 𝑿
If a random sample of 𝑛 observations is selected from a large population with mean 𝜇 and variance 𝜎 2 ,
then
𝜎2
1. The sampling distribution of the sample mean 𝑋 will have mean𝜇𝑋 = 𝜇 and variance𝜎𝑋2 = 𝑛
𝜎
2. The standard deviation of 𝑋 is called the standard error of 𝑋, 𝑆. 𝐸 𝑋 = 𝜎𝑋 =
𝑛
3. If the population has a normal distribution, the sampling distribution of 𝑋 will be exactly
normally distributed, regardless of the sample size 𝑛.
4. If the population distribution is not normal, the sampling distribution of 𝑋 will be
approximately normally distributed for large samples (𝑛 ≥ 30). This theorem is called the
Central Limit Theorem
Example 12:
A random sample is selected from a population the mean of which is 𝜇 = 120 and the standard deviation is
𝜎 = 8.
a) What is the expected value of the sample mean?

𝐸 𝑋 = 𝜇 = 120
b) How large must the random sample be if we want the standard error of the sample mean to be 0.5 or
less?
8 𝑛
𝑆. 𝐸 𝑋 = 𝑛
≤ 0.5 → 8
≥2 → 𝑛 ≥ 16 → 𝑛 ≥ 256 , the sample size must be 256 or more.
c) How large must the random sample be if we want the standard error of the sample mean to be 2 or less?
8 8 2
𝑆. 𝐸 𝑋 ≤ 2 → 𝑛
≤2→𝑛≥ 2
= 16. The sample size must be 16 or more
d) If the sample size is very large, how does this affect the expected value of the sample mean and its
standard error?
The expected value of the sample mean, 𝐸 𝑋 = 𝜇 for any sample size.
The standard error gets smaller as the sample size increases, when the sample size is very large then the
standard error will be close to 0.
Example 13:
The weights of 10-year-old boys are normally distributed with mean of 44 kilograms and standard deviation
of 5 kilograms.
a) A boy is selected at random, what is the probability that he weighs between 43 and 45 kilograms?
43−44 𝑋−44 45−44
𝑃 43 < 𝑋 < 45 = 𝑃 < < = 𝑃 −0.2 < 𝑍 < 0.2 = 2(0.0793) = 0.1586.
5 5 5
b) 4 boys are to be selected, what is the probability that their mean weight will be between 43 and 45
kilograms?
25
Here 𝑋: 𝑁 44, 4
→ 𝑃 43 < 𝑋 < 45 = 𝑃 −0.4 < 𝑍 < 0.4 = 2(.1554) = 0.3108
c) A random sample of 25 boys is selected, what is the probability that their mean weight will be between
43 and 45 kilograms?
Here
25
𝑋: 𝑁 44, : 𝑁 44,1 → 𝑃 43 < 𝑋 < 45 = 𝑃 −1 < 𝑍 < 1 ≅ 2(0.3413) = 0.6826
25
d) Suppose that all random samples of size 25 are selected, give an interval symmetric about the population
mean and contains about 95% of all the sample means.
Here, 𝑋: 𝑁 44,1 as shown in the previous part.
We want to find to numbers 𝑎 and 𝑏 such that
𝑃 𝑎 < 𝑋 < 𝑏 = 0.95, where 44 in at the center between 𝑎and 𝑏.
0.95
→ 𝑃 𝑎 < 𝑋 < 44 = 𝑃 44 < 𝑋 < 𝑏 = 2
= 0.475 .
It is easy to show that 𝑎 ≅ 42 and 𝑏 ≅ 46 kilograms.

Example 14:
The average number of milligrams of sodium in a certain brand of low-salt microwave frozen dinners is 660
mg, and the standard deviation is 35 mg.
a) If a sample of 10 dinners is selected, can you find the probability that the mean of the sample will be
larger than 670 mg? If yes, find it.
We cannot find the probability because we do not know what the distribution of the sample mean is.
Here, we cannot apply the central limit theory since 𝑛 = 10 < 30.
b) If a sample of 50 dinners is selected, can you find the probability that the mean of the sample will be
larger than 670 mg? If yes, find it
Here, we can apply the central limit theory since 𝑛 = 50 > 30.
𝑋 − 660 670 − 660
𝑃 𝑋 > 670 = 𝑃 > = 𝑃 𝑍 > 2.02 = 0.5 − 0.4783 = 0.0217
35/ 50 35/ 50
PROBLEMS
1. 1
Suppose that 𝑓 𝑥 = 2 , 3 < 𝑥 < 5 is a probability distribution function, find:
a) 𝑃(3.5 < 𝑋 < 4.5) 0.5

b) 𝑃(𝑋 > 4) 0.5
c) 𝑐, if 𝑃 4 < 𝑋 < 𝑐 = 0.2 4.4
d) The mean 4
e) The median 4
2. Suppose𝑍 is a standard normal random variable, find
a) 𝑃 0 < 𝑍 < 1.25 0.3944

b) 𝑃 𝑍>2 0.0228
c) 𝑃 1<𝑍<2 0.1359
d) 𝑃 𝑍 > −2 0.9772
2
e) 𝑃 −1 < 𝑍 < 3 0.5899
f) 𝑃 −1 < 𝑍 < 10 0.8413
g) First quartile, 𝑄1 − 0.67
h) Third quartile, 𝑄3 0.67

3. Suppose𝑍 is a standard normal random variable, find 𝑐 so that
a) 𝑃 0 < 𝑍 < 𝑐 = .17 0.44

b) 𝑃 𝑍 < 𝑐 = .9 1.28
c) 𝑃 𝑍 < 𝑐 = .352 – 0.38
d) 𝑃 −1 < 𝑍 < 𝑐 = .1156 – 0.6
4. Assume 𝑋 is normally distributed with mean 40 a variance 16. Determine:
a) 𝑃 𝑋 < 46 0.9332
b) 𝑃 𝑋 > 38 0.6915
c) 𝑃 32 < 𝑋 < 48 0.9544
d) 𝑃 −2 < 𝑋 < 36 0.1587
e) 𝑐 so that 𝑃 𝑋 < 𝑐 = .95 46.58
f) 𝑐 so that 𝑃 𝑐 < 𝑋 < 40 = .3 36.64
g) 𝑐 so that 𝑃 −𝑐 < 𝑋 − 40 < 𝑐 = .95 7.84
h) The first quartile, 𝑄1 37.32
i) The third quartile, 𝑄3 42.68
5. The IQs of individuals admitted to a school for the mentally retarded are approximately normally
distributed with a mean of 60 and a standard deviation of 10.
a) Find the proportion of IQs that exceed 75. 0.0668

b) What is the probability that an individual selected at random will have an IQ between 55 and 75?
0.6247
c) What is the IQ above which 90% of the IQs? 47.2
d) If 10 individuals are selected at random, what is the probability that exactly 2 of them will have IQ
greater than 75? 0.115
6. The time it takes a cell to divide is normally distributed with an average time of one hour and a
standard deviation of 5 minutes.
a) What is the probability that it takes a cell more than 65 minutes to divide? 0.1587
b) What is the proportion of cells that divide in less than 45 minutes? 0.0013
c) What is the time that 15% of cells need more than it to divide? 65.2 minutes
7. A nurse supervisor has found that staff nurses, on the average, complete a certain task in 10 minutes. If
the times required to complete the task are approximately normally distributed with a standard
deviation of 3 minutes, find:
a) The proportion of nurses completing the task in less than 4 minutes. 0.0228
b) The proportion of nurses requiring more than 5 minutes to complete the task. 0.9525
c) The probability that a nurse will complete the task within 3 minutes. 0.0099

8. A doctor goes daily from his home to his clinic. The trip takes 24 minutes on the average, with a
standard deviation of 4 minutes. Assume the distribution of trip times is normally distributed.
a) What is the probability that a trip will take at least 15 minutes? 0.9878
b) If the clinic opens at 9.00 a.m. and he leaves his house at 8.30 a.m. daily, what percentage of days he
will be late for work? 0.0668
9. Random samples of size 𝑛 were selected from populations with the means and variances given here.
Find the mean and the standard error of the sample mean in each case:
a) 𝑛 = 36, 𝜇 = 10, 𝜎 2 = 9 𝜇𝑥 = 10 and 𝑆. 𝐸 = 0.5

b) 𝑛 = 100, 𝜇 = 5, 𝜎 2 = 4 𝜇𝑥 = 5 and 𝑆. 𝐸 = 0.2
c) 𝑛 = 8, 𝜇 = 120, 𝜎 2 = 1 𝜇𝑥 = 120 and 𝑆. 𝐸 = 0.354
10. Refer to Problem 9
a) If the sampled populations are normal, what is the sampling distribution of 𝑋 for parts a, b, and c.
𝑋 is normally distributed
b) According to the Central Limit Theorem, if the sampled populations are not normal, what can be
said about the sampling distribution of 𝑋 for parts a, b, and c.
If the population is not normal then the sample mean 𝑋 is approximately normally distributed in parts
(a) and (b) since the sample size 𝑛 is greater than 30 in these parts.
In part (c) the distribution of 𝑋 is not normal since the sample size 𝑛 is less than 30.
11. A normal population has mean 100 and variance 40. How large must a random sample be if we want
standard error of the sample mean to be 1.5 or less? 18 or more
12. Suppose a random sample of 𝑛 = 25 observations is to be selected from a population that is normally
distributed, with mean equal to 106 and standard deviation equal to 12.
a) Find the probability that the sample mean 𝑋 exceeds 110. 0.0475
b) Find the probability that the sample mean deviates from the population mean by no more than 4
units. 0.905
13. The IQ scores of students at a certain university are normally distributed with a mean of 125 and a
standard deviation of 14.
a) What is the percentage of scores that are greater than 146? 6.68%
b) What is the probability that a random sample of 49 students will have a mean IQ score greater than
128? 0.0668
c) What is the probability that the mean will be between 122.5 and 126.0? 0.5859

14. The normal daily human potassium requirement is in the range of 2000 to 6000 mg. Suppose that the
amount of potassium in a banana is normally distributed with mean 630 mg and standard deviation of
40 mg. If you eat 3 bananas per day, find the probability that your total daily intake of potassium from
the 3 bananas will be in the required range. 0.0559
See and solve the following example and exercises from the textbook:
Section 6 – 1 :
Examples: 1, 2, 3, 4, 5
Exercises: All odd-number problems
Section 6 – 2 :
Examples: 6, 7, 8, 9, 10
Exercises: 5, 8, 10, 11, 14, 25, 26, 35, 38
Section 6 – 3 :
Examples: 13, 14, 15
Exercises: 9, 11, 15, 17, 18, 23

z
AREAS UNDER THE NORMAL DISTRIBUTION
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
0.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
0.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
0.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
0.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879
0.5 .1915 .1950 .1985 .2019 .2024 .2088 .2123 .2157 .2190 .2224
0.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .2549
0.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
0.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
0.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389
1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319
1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767
2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936
2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4987
3.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
3.1 .4990 .4991 .4991 .4991 .4992 .4992 .4992 .4992 .4993 .4993
3.2 .4993 .4993 .4994 .4994 .4994 .4994 .4994 .4995 .4995 .4995
3.3 .4995 .4995 .4996 .4996 .4996 .4996 .4996 .4996 .4996 .4997
3.4 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4998 .4998
3.5 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998

BIOSTATISTICS
CHAPTER 7: Confidence Intervals and Sample Size
Introduction
As stated in Chapter 1, Statistics is divided into two branches; descriptive and inferential. In this chapter we
will learn about the first area of the inferential statistics which is estimation. Before discussing this topic, it is
better to recall the following concepts:
A parameter is a measure that describes the population of interest, such as the population mean 𝜇, population
proportion 𝑝, and population standard deviation 𝜎. The parameters are considered as constants and they are
difficult to evaluate for large populations, so they are estimated.
A statistic is a measure that describe a sample selected from the population, such as the sample mean 𝑥 ,
sample proportion 𝑝, and sample standard deviation 𝑠. The statistics are considered as random variables and
they are computed to estimate the population parameters.
Estimation is the process of estimating the value of a parameter from information obtained from a sample.
Point Estimation
A point estimate is a specific numerical value estimate of a parameter. The best point estimator of the
population mean 𝜇 is the sample mean 𝑥 .
For example, suppose a college president wishes to estimate the mean age of students attending classes this
semester. The president could select a random sample of 100 students and find the mean age of these students,
say 21.3 years. From the sample mean, the president could infer that the mean age of all the students is 21.3
years. This type of estimation is called point estimation.
Example 1:
A random sample of 20 men was selected and gave the following data for the hemoglobin reading. Based on
this sample, estimate the mean hemoglobin reading for all men in the population.
17 18 16 19 15 17 15 16 18 15
13 15 14 17 15 17 15 16 14 14
𝑥 17+18+⋯+14
The best point estimator is the sample mean. Here, 𝑥 = 𝑛
= 20
= 15.8
3
The point estimate for the percent of men whose hemoglobin readings are above 17 is 𝑝 = 20 = 0.15

A good estimator should satisfy the following properties:
1. The estimator should be an unbiased estimator. That is, the expected value of the estimator is equal to
the parameter being estimated.
2. The estimator should be consistent estimator. That is, the value of the estimator approaches the value
of the parameter estimated as the sample size increases.
Example 2:
Show that the sample mean 𝑥 is unbiased and consistent estimator for the population mean 𝜇
1. To show that 𝑋 is an unbiased estimator for 𝜇, we must show 𝐸 𝑥 = 𝜇.

By central limit theorem𝑋 is normally distributed with mean 𝜇, which means 𝐸 𝑥 = 𝜇
2. To show that is a consistent estimator, we must find the variance of 𝑥 , by central limit
𝜎2
theorem𝑣𝑎𝑟 𝑥 = 𝑛 . It is easy to recognize the variance of 𝑋 decreases as 𝑛 increases, which means
the values of 𝑥 will be closer to 𝜇 as 𝑛 increases.
Section 7-1: Confidence Intervals for 𝝁 When 𝝈 is Known

Another type of estimate of a parameter is interval estimate, which is an interval or a range of values used to
estimate the parameter. Note that interval estimate may or may not contain the value of the parameter
estimated.
In an interval estimate, the parameter is specified as being between two values. For example, an interval
estimate for the mean age of all students might be between 20.9 and 21.7 years, or it might be 21.3∓ 0.4 years.
Either the interval contains the parameter or it does not. A degree of confidence can be assigned before an
interval estimate is made. For instance, you may wish to be 95% confident that the interval contains the true
parameter.
A confidence interval is a specific interval estimate to a parameter determined by using data obtained from a
sample and by using the specific confidence level of the estimate.
The confidence level of an interval estimate of a parameter is the probability that the interval estimate will
contain the parameter. The confidence level is denoted by 1 − 𝛼 100%
For example, a 95% confidence interval means the probability that the interval estimate will contain the
parameter is 0.95 and the probability the interval estimate will not contain the parameter is 0.05, this
probability is denoted by 𝛼. ( this means that 1 − 𝛼 = 0.95 and 𝛼 = 0.05)

How to construct a confidence interval for 𝝁 when 𝝈𝟐 is known:
1. Here, the population must be normally distributed or the sample size 𝑛 ≥ 30.
2. Select a random sample and evaluate the sample mean 𝑥 (which is a point estimator for 𝜇).
𝜎
3. Evaluate the standard error of 𝑥 , 𝑆. 𝐸 = 𝑛
𝛼
4. Determine the confidence level 1 − 𝛼 , and evaluate 𝑧𝛼 2 by solving the equation 𝑃 𝑍 > 𝑧𝛼 /2 = 2
For example; 𝑧0.1 = 1.28, 𝑧0.05 = 1.645, 𝑧0.025 = 1.96, 𝑧0.01 = 2.33, and 𝑧0.005 = 2.575
5. A 1 − 𝛼 100% confidence interval for 𝜇 is
𝜎 𝜎 𝜎
𝑥 − 𝑧𝛼 2 , 𝑋 + 𝑧𝛼 2 = 𝑥 ∓ 𝑧𝛼 2
𝑛 𝑛 𝑛
𝜎
6. The term 𝐸 = 𝑧𝛼 2 𝑛 is called the maximum error of the estimate or the margin of error.
Example 3:
Noise levels at various area urban hospitals were measured in decibels; the population standard deviation
from a previous study was 8.0 decibels.
a) The mean of the noise levels in 85 randomly selected corridors was 61.2 decibels, find a 95%
confidence interval of the true mean (population mean)
𝑛 = 85 > 30, 𝑥 = 61.2, 𝜎 = 8.0(known), 𝛼 = 0.05 → 𝛼/2 = 0.025 → 𝑧0.025 = 1.96
8.0
A 95% C.I is 61.2 ∓ 1.96 85
= 61.2 ∓ 1.7 = 59.5 , 62.9 , (margin of error = 1.7)
Based on the given data, we can say the true mean falls between 59.5 and 62.9 with probability of
0.95.
b) The mean of the noise levels in 85 randomly selected corridors was 61.2 decibels, find a 98%
𝑛 = 85 > 30, 𝑥 = 61.2, 𝜎 = 8.0(known), 𝛼 = 0.02 → 𝛼/2 = 0.01 → 𝑧0.01 = 2.33
8.0
A 95% C.I is 61.2 ∓ 2.33 85
= 61.2 ∓ 2.0 = 59.2 , 63.2 , (margin of error = 2.0)
Based on the given data, we can say the true mean falls between 58.2 and 63.2 with probability of .98

c) The mean of the noise levels in 170 randomly selected corridors was 60.8 decibels, find a 95%
𝑛 = 170 > 30, 𝑥 = 60.8, 𝜎 = 8.0(known), 𝛼 = 0.05 → 𝛼/2 = 0.025 → 𝑧0.025 = 1.96
8.0
A 95% C.I is 60.8 ∓ 1.96 170
= 60.8 ∓ 1.2 = 59.6 , 62.0 , (margin of error = 1.2)
Note: The larger sample size the smaller margin of error (the narrower C.I)
Example 4:
The average cholesterol level for a random sample of 25 adult women is 263 units with a standard deviation
of 43 units. From previous study, it is known that the population is normally distributed with a standard
deviation of 40 units. Find a 90% C.I for the mean cholesterol level for all adult women in the population.
Here, the population is normally distributed, 𝑛 = 25 < 30 small , 𝑥 = 263, 𝑠 = 43, 𝜎 = 40, and 𝛼 = 0.10 →
𝛼/2 = 0.05 → 𝑧0.05 = 1.645.
40
A 90% C.I is 263 ∓ 1.645 ≅ 263 ∓ 13.16 = 249.84 , 276.16 , (margin of error = 13.16)
25
Sample size for Means

Sample size determination is closely related to statistical estimation. Often, you ask, How large a sample is
necessary to make an accurate estimate? The answer depends on three things; the margin of error, the
population variance, and the level of confidence. For the purpose of this chapter, it will be assumed that the
population variance is known or has been estimated from a previous study.
To find the required sample size, see the following formula:
𝜎 𝑧𝛼 2 . 𝜎 2
𝐸 = 𝑧𝛼 2 → 𝑛=
𝑛 𝐸

Example 5:
A scientist wishes to estimate the average depth of a river. He wants to be 98% confident that the estimate
is accurate within 2 feet. From a previous study, the standard deviation of the depths measured was 4.38
feet. Find the required sample size.
𝛼
𝐸 = 2 , 𝜎 = 4.38 , 𝛼 = 0.02 → = 0.01 → 𝑧0.01 = 2.33 , we want to find 𝑛
2
𝜎 4.38 2.33×4.38 2
𝐸 = 𝑧𝛼 /2 𝑛
→ 2 = 2.33 𝑛
→𝑛= 2
= 26.03 , then round up to get 𝑛 = 27
Section 7-2: Confidence Intervals for 𝝁 When 𝝈 Is Unknown

When 𝜎 is known and the sample size is 30 or more or the population is normally distributed, the confidence
interval for the mean can be found by using the 𝑧 distribution as shown in Section 7 – 1. However, most of the
time, the value of 𝜎 is unknown, so it will be estimated by the sample standard deviation 𝑠. When 𝑠 is used,
especially when the sample size is small, 𝑧 distribution is no longer usable to find confidence intervals. Instead
of 𝑧 distribution, another distribution will be used which is called 𝑡 distribution.
Characteristics of the 𝒕 Distribution
The 𝑡 distribution shares some characteristics of the normal distribution and differs from it in others. The 𝑡
distribution is similar to the standard normal distribution in these ways:
1. It is bell-shaped.
2. It is symmetric about the mean.
3. The mean, median, and mode are equal to 0.
The 𝑡 distribution differs from the standard normal distribution in the following ways:
1. The variance is greater than 1.
2. The 𝑡 distribution is a family of curves based on the concept of degrees of freedom, which is related to
the sample size.
3. As the sample size increases, the 𝑡 distribution approaches the standard normal distribution. See the
following figure.
df
0.4
1
5
10
z
0.3
0.2
0.1
0.0
-3 -2 -1 0 1 2 3

The following table gives the values of 𝑡𝛼 for 𝛼 = 0.1, 0.05, 0.025, 0.01, and 0.005
𝒕𝜶
D.f 𝒕.𝟏𝟎𝟎 𝒕.𝟎𝟓𝟎 𝒕.𝟎𝟐𝟓 𝒕.𝟎𝟏𝟎 𝒕.𝟎𝟎𝟓

1 3.078 6.314 12.706 31.821 63.657
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841
4 1.533 2.132 2.776 3.747 4.604
5 1.476 2.015 2.571 3.365 4.032
6 1.440 1.943 2.447 3.143 3.707

7 1.415 1.895 2.365 2.998 3.499
8 1.397 1.860 2.306 2.896 3.355
9 1.383 1.833 2.262 2.821 3.250
10 1.372 1.812 2.228 2.764 3.169
11 1.363 1.796 2.201 2.718 3.106

12 1.356 1.782 2.179 2.681 3.055
13 1.350 1.771 2.160 2.650 3.012
14 1.345 1.761 2.145 2.624 2.977
15 1.341 1.753 2.131 2.602 2.947
16 1.337 1.746 2.120 2.583 2.921

17 1.333 1.740 2.110 2.567 2.898
18 1.330 1.734 2.101 2.552 2.878
19 1.328 1.729 2.093 2.539 2.861
20 1.325 1.725 2.086 2.528 2.845
21 1.323 1.721 2.080 2.818 2.831

22 1.321 1.717 2.074 2.508 2.819
23 1.319 1.714 2.069 2.500 2.807
24 1.318 1.711 2.064 2.492 2.797
25 1.316 1.708 2.060 2.485 2.787
26 1.315 1.706 2.056 2.479 2.779

27 1.314 1.703 2.052 2.473 2.771
28 1.313 1.701 2.048 2.467 2.763
29 1.311 1.699 2.045 2.462 2.756
∞ 1.282 1.645 1.960 2.326 2.576

Example 6:
If 𝛼 = 0.05 and degrees of freedom 𝐷. 𝐹 = 25, Find 𝑡𝛼/2 .
Here we want to find 𝑡0.05/2 = 𝑡0.025 = 2.06 , the reading under 𝑡0.025 in the row of 𝐷. 𝐹 = 25
To construct a confidence interval for 𝝁 when 𝝈𝟐 is unknown:
1. The population must be normally distributed.

2. Select a random sample and evaluate the sample mean 𝑥 (which is a point estimator for 𝜇).
3. Evaluate the sample standard deviation 𝑠, it will be used as an estimate for 𝜎.
𝑠
4. Estimate the standard error of 𝑥 , 𝑆. 𝐸 = 𝑛
5. Determine the confidence level 1 − 𝛼
6. Find 𝑡𝛼 2 that corresponds degrees of freedom 𝐷. 𝐹 = 𝑛 − 1
7. A 1 − 𝛼 100%confidence interval for 𝜇 is
𝑠 𝑠 𝑠
𝑥 − 𝑡𝛼 2 , 𝑋 + 𝑡𝛼 2 = 𝑥 ∓ 𝑡𝛼 2
𝑛 𝑛 𝑛
𝑠
8. The term 𝐸 = 𝑡𝛼 2 𝑛 is called the maximum error of the estimate or the margin of error.
Example 7:
A sample of 6 college wrestlers had an average weight of 276 pounds with a standard deviation of 12 pounds.
Assuming normality:
a) Find a 90% C.I of the true mean weight of all college wrestlers.
𝑛 = 6, 𝑥 = 276, 𝑠 = 12, 𝛼 = 0.10 → 𝛼/2 = 0.05, 𝐷. 𝐹 = 6 − 1 = 5 → 𝑡0.05,5 = 2.015.
12
The 90% C.I is 276 ∓ 2.015 6
≅ 276 ∓ 10 = 266 , 286 pounds
b) If a coach claimed that the average weight of the wrestlers on the team was 310, would the claim be
believable?
No, it would not be, since 310 is not contained in the obtained C.I.

Example 8:
100 randomly selected people were asked how long they slept at night. The mean time was 7.6 hours, and the
standard deviation was 0.9 hour. Construct a 95% C.I of the true mean
𝑛 = 100, 𝑥 = 7.6, 𝑠 = 0.9, 𝛼 = .05 → 𝛼/2 = .025, 𝐷. 𝐹 = 99 → 𝑡0.025,∞ = 1.960
0.9
A 95% C.I is 7.6 ∓ 1.96 = 7.6 ∓ 0.18 = 7.42 , 7.78 , (margin of error = 0.18)
100
Example 9:
The number of unhealthy days based on the Air Quality Index for a random sample of metropolitan areas is
shown below. Assuming normality, construct a 98% C.I based on the data.
61 12 6 40 27 38 93 5 13 40
1. Evaluate the sample mean and sample standard deviation. Here, 𝑥 = 33.5 and 𝑠 = 27.68
𝛼
2. 𝛼 = .02 → = .01, 𝐷. 𝐹 = 10 − 1 = 9 → 𝑡0.01,9 = 2.821
2
27.68
3. A 98% C.I is 33.5 ∓ 2.821 = 33.5 ∓ 24.7 = 8.8 , 58.2
10
Therefore, one can be 98% confident that the population mean is between 8.8 and 58.2.
Section 7-3: Confidence Intervals and Sample Size for Proportions 𝒑

When we say 10% of adults in a population are left-handed, the parameter 10% is called a proportion. It
means that of all adults in the population, 10 out of every 100 are left-handed. A proportion represents a part of
10 1
a whole. It can be expressed as a fraction, decimal, or percentage. In this case, 10% = 0.10 = 100 = 10 .
Proportions can also represent probabilities. In this case, if an adult is selected at random, the probability that
he or she is left-handed is 0.10.
Proportions can be obtained from samples or populations. The following symbols will be used.
𝑝 =population proportion, 𝑝 = sample proportion,

𝑋
𝑝 = 𝑛 , where 𝑋 = number of sample units that posses the characteristics of interest and 𝑛 = sample size

Example 10:
Suppose that 20% of people in a certain population are diabetic. A random sample of 150 people is selected
and it is found that there are 42 diabetic people in the sample.
Here, the population proportion is 𝑝 = 0.20, this value is a parameter

𝑋 42
The sample proportion is 𝑝 = 𝑛 = 150 = 0.28, this value is a statistic.
How to construct a confidence interval for 𝒑:
1. Select a random sample, of size 𝑛, from the population and evaluate the corresponding sample
proportion 𝑝.
2. The best point estimator of 𝑝 is 𝑝
𝑝 1−𝑝
3. The standard error of 𝑝 is 𝑆. 𝐸 = 𝑛
𝑝 1−𝑝
4. Since 𝑝 is unknown, the standard error of 𝑝 can be estimated by using the formula 𝑆. 𝐸 =
𝑛
5. Make sure that both 𝑛𝑝 and 𝑛 1 − 𝑝 are greater than or equal to 5.
6. Determine the confidence level 1 − 𝛼 , and find 𝑧𝛼 2 .
7. A 1 − 𝛼 100% confidence interval for the population proportion 𝑝 is
𝑝 1−𝑝 𝑝 1−𝑝 𝑝 1−𝑝

𝑝 − 𝑧𝛼/2 , 𝑝 + 𝑧𝛼/2 = 𝑝 ∓ 𝑧𝛼/2
𝑛 𝑛 𝑛
𝑝 1−𝑝
8. The term 𝐸 = 𝑧𝛼 /2 𝑛
is called the maximum error of the estimate or the margin of error
Example 11:
A sample of 500 nursing applications included 60 from men. Find a 90% C.I of the true population proportion
of men who applied to the nursing program.
𝑋 60
1. 𝑛 = 500, 𝑝 = 𝑛 = 500 = 0.12
2. 𝑛𝑝 = 60 ≥ 5 and 𝑛 1 − 𝑝 = 440 ≥ 5
3. 𝛼 = 0.10 → 𝛼/2 = 0.05, → 𝑧0.05 = 1.645
0.12 1−0.12
4. A 90% C.I of 𝑝 is 0.12 ∓ 1.645 500
≅ 0.12 ∓ 0.024 = 0.096 , 0.144
Hence, you can be 90% confident the percentage of applicants who are men is between 9.6% and 14.4%
Also, you can be 90% confident the percentage of applicants who are women is between 85.6% and 90.4%

Example 12:
To determine what proportion of people use brand X cough syrup, 500 people are questioned. It is found that
40 use brand X, 160 use some other brand, and 300 do not use cough syrup Construct a 95% C.I for:
a) The proportion of the population who use brand X.

𝑋 40
1. 𝑛 = 500, 𝑝 = 𝑛 = 500 = 0.08
2. 𝑛𝑝 = 40 ≥ 5 and 𝑛 1 − 𝑝 = 460 ≥ 5
3. 𝛼 = 0.05 → 𝛼/2 = 0.025, → 𝑧0.025 = 1.96
0.08 1−0.08
4. A 95% C.I of 𝑝 is 0.08 ∓ 1.96 500
≅ 0.08 ∓ 0.024 = (0.056 , 0.104)
5. The proportion of the population who use brand X is between 5.6% and 10.4% with confidence
level of 95%
b) The proportion of cough syrup users who use brand X.

𝑋 40
1. 𝑛 = 200, 𝑝 = 𝑛 = 200 = 0.2
2. 𝑛𝑝 = 40 ≥ 5 and 𝑛 1 − 𝑝 = 160 ≥ 5
3. 𝛼 = 0.05 → 𝛼/2 = 0.025, → 𝑧0.025 = 1.96
0.2 1−0.2
4. A 95% C.I of 𝑝 is 0.2 ∓ 1.96 200
≅ 0.2 ∓ 0.055 = (0.145 , 0.255)
5. The proportion of cough syrup users who use brand X is between 14.5% and 25.5% with
confidence level of 95%
c) The proportion of the population who use cough syrup.

𝑋 200
1. 𝑛 = 500, 𝑝 = = = 0.4
𝑛 500
2. 𝑛𝑝 = 200 ≥ 5 and 𝑛 1 − 𝑝 = 300 ≥ 5
3. 𝛼 = 0.05 → 𝛼/2 = 0.025, → 𝑧0.025 = 1.96
0.4 1−0.4
4. A 95% C.I of 𝑝 is 0.4 ∓ 1.96 500
≅ 0.4 ∓ 0.043 = (0.357 , 0.443)
5. The proportion of the population who use cough syrup is between 35.7% and 44.3% with
confidence level of 95%

Sample Size for Proportions

To find the sample size needed to determine a confidence interval about a proportion, use this formula:
𝑝 1−𝑝 𝑧𝛼/2 2
𝐸 = 𝑧𝛼/2 → 𝑛 =𝑝 1−𝑝
𝑛 𝐸
Round the result number up to obtain 𝑛 as a whole number.
Notice that: The value of 𝑝 can be obtained from a previous study. If no information is given about 𝑝, you
should use 𝑝 = 0.5
Example 13:
A researcher wishes to estimate, with 95% confidence, the proportion of people who own a home computer.
The researcher wishes to be accurate within 0.02 of the true proportion. Find the required sample size if
there is a previous study shows that 0.60 of those interviewed had a computer at home.
𝛼
Here, 𝛼 = 0.05 → 2
= 0.025 → 𝑧.025 = 1.96, 𝑝 = .60, and margin of error 𝐸 = 0.02
𝑧 𝛼 /2 2 1.96 2 𝑅𝑜𝑢𝑛𝑑 𝑢𝑝
→𝑛 =𝑝 1−𝑝 = 0.6 1 − 0.6 = 2304.96 𝑛 = 2305 people.
𝐸 0.02
Example 14:
Repeat Example 13, if there is no previous study.
Here, we must use 𝑝 = 0.5
1.96 2
→ 𝑛 = 0.5 1 − 0.5 0.02
= 2401 people.

PROBLEMS
1. The daily yield for a local chemical plant has averaged 880 tons for the last several years. The quality
control manager would like to know whether this average has changed in recent months. He randomly
selects 50 days from the computer data base and computes the average and standard deviation of the
𝒏 = 𝟓𝟎 yields as 𝒙 = 𝟖𝟕𝟏 tons and 𝒔 = 𝟐𝟏 tons, respectively.
a) Construct a 90% C.I for the average daily yield 866 , 876
b) Can we say the mean daily yield has changed? Yes, since the constructed C.I does not contain 880
2. It is recognized that cigarette smoking has a deleterious effect on lung function. In their study of the
effect of cigarette smoking on the carbon monoxide diffusing capacity (DL) of the lung, researchers
found that current smokers had DL readings significantly lower than those of nonsmokers. The carbon
monoxide diffusing capacities for a random sample of 20 current smokers are listed below:
104, 87, 73, 123, 91, 92, 62, 91, 84, 76, 101, 88, 71, 82, 89, 103, 109, 73, 107, and 90
Construct a 98% C.I for mean DL reading for current smokers in the population. Assuming normality.
81.3 , 98.3
3. A water company wishes to discover the mean water consumption for month of July in all homes in a
certain region. There are 600 homes in the region and a random sample of 15 homes showed a mean of
consumption of 11.6 m3 with a standard deviation of 1.2 m3. Assuming normality
a) Construct a 95% confidence interval for the mean water consumption per home in this region.
10.94 , 12.26
b) Construct a 95% confidence interval for the total water consumption for all homes in this region.
6564 , 7356
4. A peony plant with red petals was crossed with another plant have streaky petals. 100 seeds from this
cross were collected and germinated and it was found that 58 plants had red petals.
a) Construct a 90% C.I for the percentage of plants that have red petals. 49.88% , 66.12%
b) A genetic state that 75% of the offspring resulting from this cross will have red flowers, based on the
given data, is his claim true?
No, since the constructed C.I does not contain the percentage 75%.

5. Suppose that we want to estimate the proportion of smokers in a certain population of 2500 adults. A
random sample of 100 adults is selected from the population. It is found that 20 of the sampled adults
are smokers.
a) Construct a 95% C.I for the proportion of smokers in the population. 0.12 , 0.28
b) Construct a 95% C.I for the proportion of nonsmokers in the population. 0.72 , 0.88
c) Construct a 95% C.I for the number of smokers in the population. 300 , 700
6. The manager of a machine shop wishes to estimate the average time an operator needs to complete a
simple task. 20 operators are selected at random and timed (in minutes). The observed results are:
7.4 7.1 4.3 4.5 7.3 4.8 5.1 4.7 6.3 5.1
, 𝑥𝑖 = 121.2 , 𝑥𝑖2 = 759.02
6.5 6.8 7.6 6.1 5.9 6.3 7.3 4.6 5.9 7.6
a) Construct a 95% confidence interval for the average time for completion of the task among all
operators. (5.53 , 6.59)
b) Construct a 90% confidence interval for the proportion of operators that need more than 7 minutes
to complete a simple task. (0.13 , 0.47)
7. It is desired to estimate the mean number of chocolate chips per cookie for a large national brand. How
many cookies would have to be sampled to estimate the true mean number of chips per cookie within 2
chips with 98% confidence? Assume that 𝝈 = 𝟏𝟎. 𝟏 chips. 𝑛 = 139
8. A medical researcher wishes to determine the percentage of females who take vitamins. He wishes to be
99% confident that the estimate is within 3% of the true proportion.
a) How large should the sample size be, if a previous study showed that 25% of females took
vitamins?𝑛 = 1383
b) If no previous study is available, how large should the sample size be? 𝑛 = 1844

See and solve the following examples and exercises from the text book:
Section 7 – 1
Exercises: 9, 11, 19, 25, 26
Section 7 – 2
Examples: 5, 6, 7
Exercises: 5, 9, 14, 16
Section 7 – 3
Examples: 8, 9, 10, 11
Exercises: 3, 13, 15, 18

BIOSTATISTICS
CHAPTER 8: Hypothesis Testing

Introduction
Researchers are interested in answering many types of questions. For example; a scientist might want to know
whether the earth is warming up. A physician might want to know whether a new medication will lower a
person’s blood pressure. An educator might wish to see whether a new teaching technique is better than a
traditional one. These types of questions can be addressed through statistical hypothesis testing, which is a
decision –making process for evaluating claims about a population. In hypothesis testing, the researcher must:
1. Define the population under study.

2. State the particular hypotheses that will be investigated.
3. Give the significance level.
4. Select a sample from the population.
5. Collect the data.
6. Perform the calculations required for the statistical test.
7. Reach a conclusion.
In this chapter hypotheses concerning the population mean and the population proportion will be explained.
Section 8 – 1: Steps in Hypothesis Testing

STEP 1
Every hypothesis testing begins with the statement of a hypothesis.
A statistical hypothesis is a statement about a population parameter. This statement may or may not true.
There are two types of statistical hypotheses for each situation: the null hypothesis and the alternative
hypothesis.
1. The null hypothesis, symbolized by 𝐻𝑜 , is a statistical hypothesis that states that there is no difference
between a parameter and a specific value, or that no difference between two parameters.
2. The alternative hypothesis, symbolized by 𝐻1 or 𝐻𝑎 , is a statistical hypothesis that states that there is a
difference between a parameter and a specific value, or states that there is a difference between two
parameters.
An-Najah National University CH 8– Page 109

Example 1:
As an illustration of how hypotheses should be stated, three different statistical studies will be used as
examples.
Situation 1:
A medical researcher is interested in finding out whether medication will have any undesirable side effects. The
researcher is particularly concerned with the pulse rate of the patients who take the medication. Will the pulse
rate increase, decrease or remained unchanged after a patient takes the medication?
The researcher knows that the mean pulse rate for the population under study is 82 beats per minute, so the
hypotheses for this situation are 𝐻𝑜 : 𝜇 = 82 versus 𝐻1 : 𝜇 ≠ 82. This test is called a two-tailed test
Situation 2:
A chemist invents an additive to increase the life of a car battery. If the mean life time of the car battery without
the additive is 36 months, then the hypotheses are 𝐻𝑜 : 𝜇 = 36 versus 𝐻1 : 𝜇 > 36. This test is called right-
tailed test.
Situation 3:
A contractor wishes to lower heating bills by using a special type of insulation in houses. If the average of the
monthly heating bills is $78, the hypotheses about heating cost are 𝐻𝑜 : 𝜇 = 78 versus 𝐻1 : 𝜇 < 78. This test is
called a left-tailed test.
Example 2:
State the null and alternative hypotheses for each of the following situations:
a) A researcher thinks that if expectant mothers use vitamin pills, the birth weight of the babies will
increase. The average birth weight of the population is 8.6 pounds.
Solution: 𝐻𝑜 : 𝜇 = 8.6 versus 𝐻1 : 𝜇 > 8.6, right-tailed test.
b) An engineer hypothesized that the percentage of defective compact disks can be decreased by using
robots instead of humans. The percentage of defective disks is 1.8%.
Solution: 𝐻𝑜 : 𝑝 = 0.018 versus 𝐻1 : 𝑝 < 0.018, left- tailed test
c) A psychologist feels that playing soft music during a test will change the results of the test. The
psychologist is not sure whether the grades will be higher or lower. In the past, the mean of the grades
was 73.
Solution: 𝐻𝑜 : 𝜇 = 73 versus 𝐻1 : 𝜇 ≠ 73, two-tailed test.

STEP 2
After stating the hypotheses, the researcher designs the study. The researcher selects a sample from the
population and evaluates the correct statistical test. In situation 1, for instance, the researcher will select a
sample of patients who will be given the drug. After allowing a suitable time for the drug to be absorbed, the
researcher will measure each person’s pulse rate. Then the researcher will evaluate the sample mean and
variance to evaluate the test statistic.
A statistical test is a value computed based on the data obtained from a sample and it is used to make a
decision about whether the null hypothesis should be rejected.
STEP 3
In the hypothesis testing there are 4 possible outcomes and they are shown below:
𝑯𝒐 is true 𝑯𝒐 is false
Error Correct
Reject 𝑯𝒐
Type I decision
Correct Error
Do not reject 𝑯𝒐
decision Type II
If a null hypothesis is true and it is rejected, then a type I error is made. In situation 1, for instance, the
medication might not significantly change the pulse rate of all the users in the population; but it might change
the rate, by chance, of the subjects in the sample. In this case, the researcher will reject the null hypothesis
when it is really true, thus committing a type I error.
The probability of type I error is denoted by 𝛼, and it is called the level of significance.
Statisticians generally agree on using 3 arbitrary significance levels: 𝛼 = 0.1, 0.05, and 0.01. That is, if the null
hypothesis is rejected, the probability of a type I error will be 0.10, 0.05, or 0.01, depending on which level of
significance is used. The level of significance does not have to be the 0.10, 0.05, or 0.01. It can be any level,
depending on the seriousness of the type I error.

STEP 4
After a significance level is chosen, a critical value is selected from a distribution table, 𝑧 or 𝑡 tables, depends on
the method of computing the test statistic.
1. The critical value separates the critical (rejection) region from the noncritical region. The symbol for
critical value is 𝑧𝛼 , 𝑧𝛼 /2 , 𝑡𝛼 ,… The used symbol depends on the type of the test and on the table used to
select the critical value.
2. The critical or rejection region is the range of values of the test statistic that indicates that there is a
significant difference and that the null hypothesis 𝐻𝑜 should be rejected.
3. The noncritical or non-rejection region is the range of values of the test statistic that indicates that the
null hypothesis 𝐻𝑜 should not be rejected
For right-tailed tests the rejection region can be on the right of the mean, for left-tailed tests the rejection
region can be on the left of the mean, and for two-tailed tests the rejection region is divided into two smaller
regions one on the right and the other on the left of the mean.
Notice that the area of the rejection region is 𝜶.
In all types of tests, the null hypothesis 𝐻𝑜 should be rejected if the test statistic (test value) belongs to the
rejection region. On the other hand, the null hypothesis 𝐻𝑜 should not be rejected if the test statistic (test value)
does not belong to the rejection region.
Section 8 – 2: 𝒛 - Test for a Population Mean 𝝁

Assumptions:
1. The population is normally distributed or the sample size 𝑛 ≥ 30.

2. The population standard deviation 𝜎 is known.
3. The sample is a random sample
Note: We will assume the population is normally distributed for all examples in this section.
Steps of the 𝒛-test:
1. State the null and alternative hypotheses and identify the claim.
There are 3 types of tests which are:
𝐻𝑜 : 𝜇 = 𝜇𝑜 versus𝐻1 : 𝜇 ≠ 𝜇𝑜 , This test is called two-tailed test
𝐻𝑜 : 𝜇 = 𝜇𝑜 versus𝐻1 : 𝜇 > 𝜇𝑜 , This test is called right-tailed test
𝐻𝑜 : 𝜇 = 𝜇𝑜 versus𝐻1 : 𝜇 < 𝜇𝑜 , This test is called left-tailed test

𝑥 −𝜇 𝑜
2. Select a random sample and then compute the test statistic, 𝑧𝑜 = .
𝜎/ 𝑛
3. Choose the level of significance 𝛼 that will be used to determine the rejection region as explained in the
following step.
4. Evaluate the critical value which separates the rejection region from the non-rejection region.
 For right-tailed tests the rejection region is the region on the right of 𝑧𝛼 . (Reject 𝐻𝑜 if 𝑧𝑜 > 𝑧𝛼 ).
 For left-tailed tests the rejection region is the region on the left of −𝑧𝛼 . (Reject 𝐻𝑜 if 𝑧𝑜 < −𝑧𝛼 )
 For two-tailed tests the rejection region is the region on the right of 𝑧𝛼/2 and the region on the left
of −𝑧𝛼/2 .(Reject 𝐻𝑜 if 𝑧𝑜 > 𝑧𝛼/2 )
The following graphs show the rejection region (shaded region) for each type of tests
Right-tailed test Left-tailed test Two-tailed test
Note: The area of the rejection region is 𝛼.
5. Make the decision to reject or not reject the null hypothesis, 𝐻𝑜 , In all types of tests, the null hypothesis
should be rejected when the value of the test statistic 𝑧𝑜 belongs to the rejection region.
6. Summarize the results.

Example 3:
A researcher claims that the average wind speed in a certain city is 8 miles per hour. A sample of 32 days has an
average wind speed of 8.2 miles per hour. The standard deviation of the population is 0.6 miles per hour. At 5%
level of significance, is there sufficient evidence to reject the claim?
Solution: We will use 𝑧-test since the population standard deviation is known.
𝜇𝑜 = 8, 𝑛 = 32, 𝑥 = 8.2, 𝜎 = 0.6 and 𝛼 = 0.05
1. State the hypotheses and identify the claim

𝐻𝑜 : 𝜇 = 8 versus 𝐻1 : 𝜇 ≠ 8, This test is two-tailed test, here 𝐻𝑜 is the claim.
𝑥 −𝜇 𝑜 8.2−8
2. Compute the test statistic 𝑧𝑜 = 𝜎/ 𝑛
= 0.6/ 32
= 1.89
3. 𝛼 = 0.05 and the test is two-tailed, and 𝑧𝛼/2 = 𝑧0.025 = 1.96
4. The rejection region is the region on the right of 1.96 or on the left of −1.96
5. Make the decision. Do not reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑧𝑜 does not fall in the
rejection region (𝑧𝑜 falls in the non-rejection region).
6. Summarize the results. At 5% level of significance, there is no sufficient evidence to reject the claim that
the average wind speed in the city is 8 miles per hour.

Example 4:
A manager claims that in his factory, the average number of days per year missed by the employees due to
illness is less than 7 days per year. The following data show the number of days missed by 40 employees last
year. Is there sufficient evidence to believe the manager’s claim at 𝛼 = 0.05?Assume the population standard
deviation is 4 days
0 6 12 3 3 5 4 1 3 9 6 0 7 6 3
4 7 4 7 1 0 8 12 3 2 5 10 5 15 3
2 5 3 11 8 2 2 4 9 1
Solution: We will use 𝑧-test since the sample size 𝜎 is known.
𝜇𝑜 = 7, 𝑛 = 40, 𝜎 = 4 and 𝛼 = 0.05

𝐻𝑜 : 𝜇 = 7 versus 𝐻1 : 𝜇 < 7 , This test is left-tailed test, here 𝐻1 is the claim.
2. Compute the test statistic. First, we must evaluate the sample mean and standard deviation.
𝑥 201
𝑥=
= = 5.025
𝑛 40
𝑥 −𝜇 𝑜 5.025−7
𝑧𝑜 = 𝜎/ 𝑛
= 4/ 40 = −3.123𝛼 = 0.05and the test is one-tailed, and 𝑧𝛼 = 𝑧0.05 = 1.645
3. 𝛼 = 0.05 and the test is one-tailed, and 𝑧𝛼 = 𝑧0.05 = 1.645
4. The rejection region is the region on the left of −1.645
5. Make the decision. Reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑧𝑜 falls in the rejection region
6. Summarize the results. At 5% level of significance, there is sufficient evidence to support the claim that
the average number of days per year missed by the employees due to illness is less than 7 days per year.

Section 8 – 3: 𝒕- Test for a Population Mean 𝝁

When the population standard deviation is unknown, the z test is not used for testing hypotheses involving
means. A different test, called the t test, is used provided that the following assumptions are satisfied.
Assumptions:
1. The population is normally distributed.

2. The population standard deviation 𝜎 is unknown.
3. The sample is a random sample.
Note: We will assume the population is normally distributed for all examples in this section.
Steps of the 𝒕-test:
𝐻𝑜 : 𝜇 = 𝜇𝑜 versus 𝐻1 : 𝜇 ≠ 𝜇𝑜 , This test is called two-tailed test
𝐻𝑜 : 𝜇 = 𝜇𝑜 versus 𝐻1 : 𝜇 > 𝜇𝑜 , This test is called right-tailed test
𝐻𝑜 : 𝜇 = 𝜇𝑜 versus 𝐻1 : 𝜇 < 𝜇𝑜 , This test is called left-tailed test
𝑥 −𝜇 𝑜
2. Compute the test statistic, 𝑡𝑜 = 𝑠/ 𝑛
.
following step.
 For right-tailed tests the rejection region is the region on the right of 𝑡𝛼 . (Reject 𝐻𝑜 if 𝑡𝑜 > 𝑡𝛼 ).
 For left-tailed tests the rejection region is the region on the left of −𝑡𝛼 .(Reject 𝐻𝑜 if 𝑡𝑜 < −𝑡𝛼 )
 For two-tailed tests the rejection region is the region on the right of 𝑡𝛼/2 and the region on the left of
−𝑡𝛼/2 .(Reject 𝐻𝑜 if 𝑡𝑜 > 𝑡𝛼 /2 )
 For the three cases, the degrees of freedom is 𝐷. 𝑓 = 𝑛 − 1
𝑡𝛼 −𝑡𝛼 −𝑡𝛼/2 𝑡𝛼/2

Note: The area of the rejection region is 𝛼.
5. Make the decision to reject or not reject the null hypothesis 𝐻𝑜 , In all types of tests, the null hypothesis
should be rejected when the value of the test statistic 𝑡𝑜 belongs to the rejection region.
Example 5:
A medical investigation claims that the average number of infections per week at a hospital is 16.3. A random
sample of 10 weeks had a mean number of 17.7 infections. The sample standard deviation is 1.8. Is there a
sufficient evidence to reject the investigator’s claim at 𝛼 = 0.05?
Solution: 𝜇𝑜 = 16.3 , 𝑛 = 10, 𝑥 = 17.7 , 𝑠 = 1.8 , 𝛼 = 0.05 , 𝜎 is unknown
Here we will use 𝑡- test since the population standard deviation is unknown

𝐻𝑜 : 𝜇 = 16.3 versus 𝐻1 : 𝜇 ≠ 16.3 , This test is two-tailed test, here 𝐻𝑜 is the claim.
𝑥 −𝜇 𝑜 17.7−16.3
2. Compute the test statistic. 𝑡𝑜 = 𝑠/ 𝑛
= 1.8/ 10
= 2.46
3. 𝛼 = 0.05 and the test is two-tailed, and 𝑡𝛼/2,𝑛−1 = 𝑡0.025,9 = 2.262.
5. Make the decision. Reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑡𝑜 falls in the rejection region
6. Summarize the results. At 5% level of significance, there is sufficient evidence to reject the claim that
the average number of infections per week at the hospital is 16.3. Or we can say, at 𝛼 = 0.05, the
average number of infections per week at the hospital is significantly different from 16.3.

Example 6:
A physician claims that joggers’ maximal volume oxygen uptake is greater than the average of all adults. A
sample of 15 joggers has a mean of 40.6 ml/kg and a standard deviation of 6 ml/kg. If the average of all adults is
36.7 ml/kg, is there sufficient evidence to support the physician’s claim at 𝛼 = 0.10?
Solution: 𝜇𝑜 = 36.7 , 𝑛 = 15, 𝑥 = 40.6 , 𝑠 = 6 , 𝛼 = 0.10 , 𝜎 is unknown
Here we will use 𝑡- test since the population standard deviation is unknown

𝐻𝑜 : 𝜇 = 36.7 versus 𝐻1 : 𝜇 > 36.7 , This test is right-tailed test, here 𝐻1 is the claim.
𝑥 −𝜇 𝑜 40.6−36.7
2. Compute the test statistic. 𝑡𝑜 = 𝑠/ 𝑛
= 6/ 15
= 2.517
3. 𝛼 = 0.10 and the test is one-tailed, and 𝑡𝛼,𝑛−1 = 𝑡0.10,14 = 1.345
4. The rejection region is the region on the right of 1.345
5. Make the decision. Reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑡𝑜 falls in the rejection region
6. Summarize the results. At 10% level of significance, there is sufficient evidence to support the claim
that mean joggers’ maximal volume oxygen uptake is greater than the average of all adults. Or we can
say, 𝛼 = 0.10, that joggers’ maximal volume oxygen uptake is significantly greater than the average of
all adults.

Confidence Intervals and Hypothesis Testing

There is a relationship between confidence intervals and hypothesis testing. When the null hypothesis
𝐻𝑜 : 𝜇 = 𝜇𝑜 is rejected in a hypothesis testing situation, the confidence interval for the mean using the same
level of significance, 𝛼,will not contain the hypothesis mean 𝜇𝑜 . Likewise, when the null hypothesis 𝐻𝑜 is not
rejected, the confidence interval computed using the same level of significance 𝛼, will contain the hypothesis
mean 𝜇𝑜 . The following example shows this concept for two-tailed tests.
To test
𝐻𝑜 : 𝜇 = 𝜇𝑜 versus 𝐻1 : 𝜇 ≠ 𝜇𝑜
At a level of significance 𝛼, you can construct a 1 − 𝛼 100% confidence interval for 𝜇, then
1. If the confidence intervals contains 𝜇𝑜 , this leads to accept 𝐻𝑜

2. If the confidence intervals does not contain 𝜇𝑜 , this leads to reject 𝐻𝑜
Example 7:
Sugar is packed in 1-kilogram bags. An inspector suspects the bags may not contain 1 kg. A sample of 50 bags
produces a mean of 0.92 kg and a standard deviation of 0.14 kg.
a) Find a 95% C.I of the true mean

b) Is there sufficient evidence to conclude that the bags do not contain 1 kg as stated at 𝛼 = 0.05?
Solution:
𝛼
a) 𝑥 = 0.92 , 𝑠 = 0.14 , 𝑛 = 50, 𝛼 = 0.05 → 2
= 0.025, 𝐷. 𝑓 = 50 − 1 = 49 → 𝑡0.025 ≅ 1.98.
0.14
The 95% C.I is 0.92 ∓ 1.98 50
= 0.92 ∓ 0.04 = 0.88 , 0.96
We can say the true mean falls between 0.88 kg and 0.96 kg with confidence level of 0.95.
b) 𝐻𝑜 : 𝜇 = 1 𝑣𝑒𝑟𝑠𝑢𝑠 𝐻1 : 𝜇 ≠ 1, the claim is 𝐻1

Since the 95% C.I in part (a) does not contain 𝜇𝑜 = 1, we reject the null hypothesis. There is a sufficient
evidence to support the claim that bags do not contain 1 kg.

Section 8 – 4: 𝒛 - Test for a Population Proportion 𝒑

Many situations arise that call for tests of proportions or percentages rather than means. For example, a
researcher claims that the percentage of smokers in the population is greater than 40%. How can we make such
a test? In this section we will study tests involving proportions or percentages.
Steps of the 𝒛-test:
𝐻𝑜 : 𝑝 = 𝑝𝑜 versus 𝐻1 : 𝑝 ≠ 𝑝𝑜 , This test is called two-tailed test
𝐻𝑜 : 𝑝 = 𝑝𝑜 versus 𝐻1 : 𝑝 > 𝑝𝑜 , This test is called right-tailed test
𝐻𝑜 : 𝑝 = 𝑝𝑜 versus 𝐻1 : 𝑝 < 𝑝𝑜 , This test is called left-tailed test
𝑝 −𝑝 𝑜
2. Compute the test statistic, 𝑧𝑜 = , where 𝑝 is the proportion in the sample.
𝑝 𝑜 1−𝑝 𝑜 /𝑛
following step.
 For right-tailed tests the rejection region is the region on the right of 𝑧𝛼
 For left-tailed tests the rejection region is the region on the left of −𝑧𝛼 .
 For two-tailed tests the rejection region is the region on the right of 𝑧𝛼/2 and the region on the left
of −𝑧𝛼/2 .
Note: We can use the z-test for proportions only if 𝑛𝑝𝑜 ≥ 5 and 𝑛 1 − 𝑝𝑜 ≥ 5
5. Make the decision to reject or not reject the null hypothesis 𝐻𝑜 , In all types of tests, the null hypothesis
should be rejected when the value of the test statistic𝑧𝑜 belongs to the rejection region.

Example 8:
A dietitian claims that more than 60% of people are trying to avoid trans fats in their diets. She randomly
selected 200 people and found that 128 people stated that they were trying to avoid trans fats in their diets. At
𝛼 = 0.05, is there sufficient evidence to reject the claim?
128
Solution: 𝑝𝑜 = 0.6 , 𝑛 = 200, 𝑝 = 200 = .64 , 𝛼 = 0.05 .
Here we can use 𝑧- test since 𝑛𝑝𝑜 = 120 ≥ 5 and 𝑛 1 − 𝑝𝑜 = 80 ≥ 5

𝐻𝑜 : 𝑝 = 0.6 versus 𝐻1 : 𝑝 > 0.6 , This test is right-tailed test, here 𝐻1 is the claim.
𝑝 −𝑝 𝑜 0.64−0.6
2. Compute the test statistic. 𝑧𝑜 = = = 1.15
𝑝 𝑜 1−𝑝 𝑜 /𝑛 0.6 1−0.6 /200
3. 𝛼 = 0.05and the test is one-tailed, and 𝑧𝛼 = 𝑧0.05 = 1.645.
4. the rejection region is the region on the right of 1.645
rejection region
6. Summarize the results. At 5% level of significance, there is sufficient evidence to reject the claim that
more than 60% of people are trying to avoid trans fats in their diets. Or we can say, 𝛼 = 0.05, the
percent of people are trying to avoid trans fats in their diets is not significantly greater than 60%.

Example 9:
An automobile association claims that 54% of fatal car/truck accidents are caused by driver error. A researcher
studies 40 randomly selected fatal accidents and finds that 19 were caused by driver error. Using 𝛼 = 0.05, can
the claim be refuted?
19
Solution: 𝑝𝑜 = 0.54 , 𝑛 = 40, 𝑝 = 40 = .475 , 𝛼 = 0.05 .
Here we can use 𝑧- test since 𝑛𝑝𝑜 = 21.6 ≥ 5 and 𝑛 1 − 𝑝𝑜 = 18.4 ≥ 5

𝐻𝑜 : 𝑝 = 0.54 versus 𝐻1 : 𝑝 ≠ 0.54 , This test is two-tailed test, here 𝐻𝑜 is the claim.
𝑝 −𝑝 𝑜 0.475−0.54
2. Compute the test statistic. 𝑧𝑜 = = = −0.825
𝑝 𝑜 1−𝑝 𝑜 /𝑛 0.54 1−0.54 /40
3. 𝛼 = 0.05and the test is two-tailed, and 𝑧𝛼/2 = 𝑧0.025 = 1.96.
rejection region
6. Summarize the results. At 5% level of significance, there is sufficient evidence to support the claim that
54% of fatal car/truck accidents are caused by driver error.

Section 8 – 5: The 𝑷-Value Approach

Many computer statistical packages give a 𝑃-value for the hypothesis tests instead of critical values 𝑧𝛼 , 𝑡𝛼 , and
others. What is the 𝑃-value and how will it be used to make a decision? The answer is below:
The 𝑷-value or observed significance level of a statistical test is the smallest value of 𝛼 for which the null
hypothesis 𝐻𝑜 can be rejected.
If the 𝑃-value is less than a pre-assigned significance level 𝛼, (𝑃 − 𝑣𝑎𝑙𝑢𝑒 < 𝛼) then the null hypothesis 𝐻𝑜 can be
rejected, and you can report that the results are statistically significant at 𝛼.
If 𝑃 − 𝑣𝑎𝑙𝑢𝑒 ≥ 𝛼, the decision is to do not reject the null hypothesis 𝐻𝑜 and we say that the results are not
statistically significant at 𝛼.
Example 10:
Suppose that we want to test 𝐻𝑜 : 𝜇 = 120 versus 𝐻1 : 𝜇 ≠ 120 at 𝛼 = 0.05. A random sample is
selected and yields a 𝑃-value of 0.0001
Here, we reject the null hypothesis 𝐻𝑜 since 𝑃-value= 0.0001 < 𝛼 = 0.05
Example 11:
Suppose that we want to test 𝐻𝑜 : 𝑝 = 0.54 versus 𝐻1 : 𝑝 < 0.54 at 𝛼 = 0.05. A random sample is selected
and yields a 𝑃-value of 0.20
Here, we do not reject the null hypothesis 𝐻𝑜 since 𝑃-value= 0.0001 > 𝛼 = 0.05
PROBLEMS
1. The daily yield for a local chemical plant has averaged 880 tons for the last several years. The quality
control manager would like to know whether this average has changed in recent months. He randomly
selects 50 days from the computer data base and computes the average and standard deviation of the
𝑛 = 50 yields as 𝑥 = 871 tons and 𝑠 = 21 tons, respectively. At 𝛼 = 0.05, can we say the mean daily yield
has changed?
Two- tailed t-test, reject 𝐻𝑜
2. Standards set by government agencies indicate that adults should not exceed an average daily sodium
intake of 3300 mg. To find out whether adults in the population are exceeding this limit, a sample of 100
adults is selected, and the mean and the standard deviation of daily sodium intake are found to be 3400 mg
and 1100 mg, respectively. Use α = 0.05 to conduct a test of hypothesis
Right-tailed t-test, do not reject 𝐻𝑜

3. It is recognized that cigarette smoking has a deleterious effect on lung function. In their study of the effect
of cigarette smoking on the carbon monoxide diffusing capacity (DL) of the lung, researchers found that
current smokers had DL readings significantly lower than those of nonsmokers. The carbon monoxide
diffusing capacities for a random sample of 20 current smokers are listed below:
104, 87, 73, 123, 91, 92, 62, 91, 84, 76, 101, 88, 71, 82, 89, 103, 109, 73, 107, and 90
Assuming normality, do these data indicate that the mean DL reading for current smokers is significantly
lower than 100? Use 𝛼 = 0.01
Left-tailed t-test, reject 𝐻𝑜
4. It is known that the IQ scores of a certain population of adults are approximately normally distributed with
standard deviation of 15. A random sample of 25 adults selected from this population had a mean IQ score
of 105 with a standard deviation of 14. On the basis of the given data can we conclude that the mean IQ
score for the population is not 100?
Two-tailed z-test, do not reject 𝐻𝑜
5. A random sample of 16 adults selected from a certain normal population yielded a mean weight of 64 kg
with variance of 49. Do the sample data provide sufficient evidence to conclude that the mean weight for
the population is greater than 60 kg?
Right-tailed t-test, reject 𝐻𝑜
6. A dietitian wishes to see if a person’s cholesterol level will decrease if the diet is supplemented by a certain
mineral. Six subjects were pretested, and then they took the mineral supplement for a 5-week period. The
results are shown in the following table. At 𝛼 = 0.10, can it be concluded that the cholesterol level has
decreased? Assume the variable is approximately normally distributed.
Subject 1 2 3 4 5 6
Before 210 235 208 190 172 244
After 190 170 210 188 173 228
One-tailed t-test, reject 𝐻𝑜
7. A peony plant with red petals was crossed with another plant have streaky petals. 100 seeds from this
cross were collected and germinated and it was found that 58 plants had red petals. A genetic states that
75% of the offspring resulting from this cross will have red flowers. Test this claim using 𝛼 = 0.01
Two-tailed z-test, reject 𝐻𝑜
8. Suppose that we want to estimate the proportion of smokers in a certain population of adults. A random
sample of 300 adults is selected from the population. It is found that 69 of the sampled adults are smokers.
Test the claim that the percentage of smokers is less than 25%.
Left-tailed z-test, do not reject 𝐻𝑜

See and solve the following examples and exercises from the textbook:
Section 8 – 1
Examples: 1, 2
Exercises: 12, 13
Section 8 – 2
Exercises: 5, 7, 16, 19, 25
Section 8 – 3
Examples: 8, 9, 10, 11, 12, 13
Exercises: 14, 16, 17, 20
Section 8 – 4
Examples: 17, 18
Exercises: 8, 11, 12, 15, 19
Section 8 – 6
Examples: 30, 31
Exercises: 1, 2, 5

BIOSTATISTICS
CHAPTER 9: Testing the Difference between Two Means and Two Proportions
There are, however, many instances when researchers wish to compare to means. For example, two different
brands of fertilizer might be tested to see whether one is better than the other for growing plants. Or two
brands of cough syrup might be tested to see whether brand is more effective than the other.
In the comparison of two means, the same basic steps for hypothesis testing shown in Chapter 8 are used, and
the 𝑧- and 𝑡- tests are also used. The 𝑧-test can be used to compare two proportions.
Section 9.1: Testing the Difference between Two Means Using the z-Test
In many cases, researchers may be not interested in the true mean of a certain population; instead, they are
interested in comparing the means of two populations. Here, the hypotheses are:
Case I: Two-Tailed Test

𝐻𝑜 : 𝜇1 − 𝜇2 = 0 versus 𝐻1 : 𝜇1 − 𝜇2 ≠ 0
Or equivalently 𝐻𝑜 : 𝜇1 = 𝜇2 versus 𝐻1 : 𝜇1 ≠ 𝜇2
The researcher wants to know whether the two true means 𝜇1 and 𝜇2 are different or not.
Case II: Right-Tailed Test
𝐻𝑜 : 𝜇1 − 𝜇2 = 0 versus 𝐻1 : 𝜇1 − 𝜇2 > 0
Or equivalently 𝐻𝑜 : 𝜇1 = 𝜇2 versus 𝐻1 : 𝜇1 > 𝜇2
The researcher wants to know whether the first mean, 𝜇1 , exceeds the second mean, 𝜇2 , or not.
Case III: Left-Tailed Test
𝐻𝑜 : 𝜇1 − 𝜇2 = 0 versus 𝐻1 : 𝜇1 − 𝜇2 < 0
Or equivalently 𝐻𝑜 : 𝜇1 = 𝜇2 versus 𝐻1 : 𝜇1 < 𝜇2
The researcher wants to know whether the first mean, 𝜇1 , is lower than the second mean, 𝜇2 ,or not.

To use the z-test, make sure the following assumptions are satisfied:
1. The samples must be independent of each other. That is, there can be no relationship between the subjects
in each sample.
2. The standard deviations, 𝜎1 and 𝜎2 , of both populations must be known.
3. If the sample sizes, 𝑛1 and 𝑛2 , are less than 30, the populations must be normally distributed.
The steps of the hypothesis testing about the difference between means using z-test:
Step 1: State the null and alternative hypotheses and identify the claim.
Step 2: Compute the test value (test statistic)

𝑥 1 −𝑥 2
𝑧𝑜 =
𝜎2 2
1 +𝜎 2
𝑛1 𝑛2
Note:
1. 𝑥1 − 𝑥2 is the best point estimator for 𝜇1 − 𝜇2 ,

𝜎12 𝜎2
2. The term 𝑛1
+ 𝑛2 is the standard error of 𝑥1 − 𝑥2
2
Step 3: Choose the level of significance, 𝛼.
Step 4: Evaluate the critical value and find the critical (rejection) region.
For right-tailed tests the rejection region is the region on the right of 𝑧𝛼
For left-tailed tests the rejection region is the region on the left of −𝑧𝛼 .
For two-tailed tests the rejection region is the region on the right of 𝑧𝛼/2 and the region on the left of
−𝑧𝛼/2 .
Step 5: Make the decision to reject or not reject the null hypothesis 𝐻𝑜 .
In all types of tests, the null hypothesis should be rejected when the value of the test statistic 𝑧𝑜 belongs to the
rejection region.

Confidence intervals for the difference between two means

Confidence intervals for the difference between two means can also be found. When you are hypothesizing a
difference of 𝐷𝑜 , if the confidence interval contains 𝐷𝑜 , the null hypothesis is not rejected, if the confidence
interval does not contain 𝐷𝑜 . Also, we can conclude the two means are not different if the C.I contains zero.
The null hypothesis is rejected. Confidence intervals can be found by using this formula:
Formula for the (1 − 𝛼)100% confidence interval for the difference between two population means is
𝜎12 𝜎22
𝑥1 − 𝑥2 ∓ 𝑧𝛼 /2 +
𝑛1 𝑛2
Example 1:
Analyses of drinking water samples for 100 homes in each two different sections of a city gave the following
means of lead levels (in parts per million):
population
Sample Sample
Standard
Size Mean
deviation
Section 1 100 34.1 5.9
Section 2 100 36.0 6.0
Do the data provide a sufficient evidence to indicate that there is a difference in the two population means? Use
𝛼 = 0.05
Solution:
Here, we will use z-test since the population standard deviations are known.

𝐻𝑜 : 𝜇1 − 𝜇2 = 0 versus 𝐻1 : 𝜇1 − 𝜇2 ≠ 0, This test is two-tailed test, here 𝐻1 is the claim.
2. 𝛼 = 0.05 and the test is two-tailed, and 𝑧𝛼/2 = 𝑧0.025 = 1.96
𝑥 1 −𝑥 2 34.1−36.0
3. Compute the test statistic 𝑧𝑜 = = = −2.258
2 (5.9)2 (6.0)2
𝜎2
1 +𝜎 2 +
100 100
𝑛1 𝑛2

5. Make the decision. Reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑧𝑜 falls in the rejection region.
6. Summarize the results. At 5% level of significance, there is a sufficient evidence to accept the claim that
the means of lead levels in water for the 2 sections are different.
Example 2:
Using the data given in Example 1, construct a 95% C.I for the difference between the mean of lead
levels in the sections.
Solution:
Here, 𝛼 = .05 → 𝑧𝛼 /2 = 𝑧0.025 = 1.96
(5.9)2 (6.0)2
A 95% C.I for the difference is 36 − 34.1 ∓ (1.96) 100
+ 100
= 1.9 ∓ 1.65 = 0.25,3.55
Since the constructed interval does not contain the number 0 , we can say the two means are not equal
at 𝛼 = .05

Example 3:
A researcher claims that the mean height for 9-year-old girls exceeds the mean height of 9-year-old boys. Two
random samples yielded the following results
Boys Girls
Sample size 60 50
Mean height 123.5 126.2
Population variance 98 120
At 𝛼 = 0.10, is there sufficient evidence to support the claim?
Solution:

𝐻𝑜 : 𝜇𝐺 − 𝜇𝐵 = 0 versus 𝐻1 : 𝜇𝐺 − 𝜇𝐵 > 0, This test is right-tailed test, here 𝐻1 is the claim.
2. 𝛼 = 0.10 and the test is right-tailed, and 𝑧𝛼 = 𝑧0.10 = 1.282
3. The given data are, 𝑛𝐵 = 60, 𝑥𝐵 = 123.5, 𝜎𝐵2 = 98, 𝑛𝐺 = 50, 𝑥𝐺 = 126.2, 𝜎𝐺2 = 120
𝑥 𝐺 −𝑥 𝐵 126.2−123.5
4. Compute the test statistic 𝑧𝑜 = = = 1.344
120 98
𝜎2 2
𝐺 +𝜎 𝐵 +
50 60
𝑛𝐺 𝑛𝐵
1.282
6. Reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑧𝑜 falls in the rejection region.
7. At 10% level of significance, the mean height of girls is significantly higher than the mean height of boys
at 9-years-old.

Example 4:
The researcher claims that the mean height for 9-year-old boys exceeds 120 cm. At 𝛼 = 0.10, is there sufficient
evidence to support the claim?
Boys Girls
Sample size 60 50
Mean height 123.5 126.2
Solution:

𝐻𝑜 : 𝜇𝐵 = 120 versus 𝐻1 : 𝜇𝐵 > 120, This test is right-tailed test, here 𝐻1 is the claim.
2. 𝛼 = 0.10 and the test is right-tailed, and 𝑧𝛼 = 𝑧0.10 = 1.282
3. The given data are, 𝑛𝐵 = 60, 𝑥𝐵 = 123.5, 𝜎𝐵2 = 98
𝑥 −𝜇 0 123.5−120
4. Compute the test statistic 𝑧𝑜 = 𝜎 𝐵/ 𝑛𝐵
= 98
= 2.739
𝐵
60
1.282
6. Reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑧𝑜 falls in the rejection region.
7. At 10% level of significance, the mean height of 9-year-old boys is significantly higher than 120 cm.

Section 9.2: Testing the Difference between Two Means Using the t-Test
The z-test is used to test the difference between two means when the population standard deviations are
known and the populations are normally distributed, or when both sample sizes are greater than or equal to 30.
In many situations, however, these conditions cannot be met—that is, the population standard deviations are
not known. In these cases, a t-test is used when the two samples are independent and when the samples are
taken from two normally distributed populations. Samples are independent when they are not related.
To use the t-test, make sure the following assumptions are satisfied:
1. The samples must be independent of each other. That is, there can be no relationship between the
subjects in each sample.
2. The standard deviations, 𝜎1 and 𝜎2 , of both populations are not known.
3. The populations must be normally distributed or approximately normally distributed.
The steps of the hypothesis testing about the difference between means using t- test:

𝑥 1 −𝑥 2
𝑡𝑜 =
𝑠2 2
1 + 𝑠2
𝑛1 𝑛2
For right-tailed tests the rejection region is the region on the right of 𝑡𝛼
For left-tailed tests the rejection region is the region on the left of −𝑡𝛼 .
For two-tailed tests the rejection region is the region on the right of 𝑡𝛼/2 and the region on the left of
−𝑡𝛼/2 .
The degrees of freedom, 𝐷. 𝑓 = 𝑀𝑖𝑛 {𝑛1 − 1, 𝑛2 − 1}

𝑡𝛼 −𝑡𝛼 −𝑡𝛼/2 𝑡𝛼/2
Step 5: Make the decision to reject or not reject the null hypothesis 𝐻𝑜 . In all types of tests, the null hypothesis
should be rejected when the value of the test statistic 𝑡𝑜 belongs to the rejection region.
Step 6: Summarize the results.
Confidence intervals
Confidence intervals can also be found for the difference between two means with this formula:
Formula for the 1 − 𝛼 100% confidence interval for the difference between two population means is
𝑠12 𝑠22
𝑥1 − 𝑥2 ∓ 𝑡𝛼/2 + , 𝐷. 𝑓 = 𝑀𝑖𝑛{𝑛1 − 1 , 𝑛2 − 1}
𝑛1 𝑛2

Example 5:
The following table gives the percentages of oxygen uptake by air (the rest is by water) for redfish exposed to
temperature environments.
25oC 49 34 24 32 52 14 28 18 28 47 60
33oC 28 55 45 51 41 27 44 48 54 67 46 59
Assuming normality, do the given data present sufficient evidence to indicate the mean percentage of oxygen
uptake by air for redfish at 25oC is less than the mean at 33oC? Test using α = 0.05
Solution:
Here, we will use t-test since the population standard deviations are unknown.

𝐻𝑜 : 𝜇1 − 𝜇2 = 0 versus 𝐻1 : 𝜇1 − 𝜇2 < 0, This test is left-tailed test, here 𝐻1 is the claim.
2. 𝛼 = 0.05 and the test is left-tailed, and 𝐷. 𝑓 = 𝑀𝑖𝑛 11 − 1, 12 − 1 = 10 → 𝑡.05,10 = 1.812
3. To compute the test statistic, we must find the means and standard deviations of the two samples using
the formulas that given in chapter 3, 𝑥1 = 35.09, 𝑠1 = 14.88, 𝑥2 = 47.08, 𝑠2 = 11.62
𝑥1 − 𝑥2 35.09 − 47.08
𝑡𝑜 = = = −2.140
𝑠12 𝑠22 (14.88)2 (11.62)2
+
𝑛1 + 𝑛2 11 12
4. The rejection region is the region on the left of −1.812
-1.812
5. Make the decision. Reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑡𝑜 falls in the rejection region.
6. Summarize the results. At 5% level of significance, there is a sufficient evidence to accept the claim that
the mean percentage of oxygen uptake by air for redfish at 25oC is less than the mean at 33oC.

Example 6:
Using the data in Example 5, construct a 90% C.I for the difference between the mean percentage of oxygen
uptake at 25oC and the mean at 33oC.
Solution:
𝛼
Here, 𝛼 = .10 → = .05 and 𝐷. 𝑓 = 𝑀𝑖𝑛 11 − 1 , 12 − 1 = 10 → 𝑡.05,10 = 1.812
2
A 90% C.I for the difference is
(14.88)2 (11.62)2
47.08 − 35.09 ∓ (1.812) + = 11.99 ∓ 10.15 = 1.84,22.14
11 12
Since the constructed interval does not contain the number 0 , we can say the two means are not equal
at 𝛼 = .10, also, we can conclude the mean at 33oC exceeds the mean at 25oC.
Example 7:
The average size of a farm in region 1 is 191 acres. The average size of a farm in region 2 is 199 acres. Assume
the data were obtained from two samples with a standard deviations of 38 and 12 acres, respectively, and the
sample sizes of 8 and 10, respectively. Can it be concluded at 𝛼 = 0.05 that the average size of the farms in the
two regions is different? Assume the populations are normally distributed.
Solution:
Here, we will use t-test since the population standard deviations are unknown.

𝐻𝑜 : 𝜇1 − 𝜇2 = 0 versus 𝐻1 : 𝜇1 − 𝜇2 ≠ 0, This test is two-tailed test, here 𝐻1 is the claim.
2. 𝛼 = 0.05 and the test is left-tailed, and 𝐷. 𝑓 = 𝑀𝑖𝑛 8 − 1, 10 − 1 = 7 → 𝑡.025,7 = 2.365
3. The means and standard deviations of the two samples are ,
𝑛1 = 8, 𝑥1 = 191, 𝑠1 = 38, 𝑛2 = 10, 𝑥2 = 199, 𝑠2 = 12
𝑥1 − 𝑥2 191 − 199
𝑡𝑜 = = = −0.57
𝑠12 𝑠22 (38)2 (12)2
+ 8 + 10
𝑛1 𝑛2
4. The rejection region is the region on the left of −2.365 and on the right of 2.365

-2.365 2.365
5. Do not reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑡𝑜 does not fall in the rejection region.
6. At 5% level of significance, there is not sufficient evidence to support the claim that the average size of
the farms is different.
Section 9.4: Testing the Difference Between Proportions

The z-test can be used to test the equality of two proportions. For example, a researcher might ask, Is the
proportion of men who exercise regularly less than the proportion of women who exercise regularly? Is there a
difference in the percentage of students who own a personal computer and the percentage of nonstudents who
own one?
Case I: Two-Tailed Test

𝐻𝑜 : 𝑝1 − 𝑝2 = 0 versus 𝐻1 : 𝑝1 − 𝑝2 ≠ 0
Or equivalently 𝐻𝑜 : 𝑝1 = 𝑝2 versus 𝐻1 : 𝑝1 ≠ 𝑝2
Here, the researcher wants to know whether the two true proportions are equal or not.
Case II: Right-Tailed Test

𝐻𝑜 : 𝑝1 − 𝑝2 = 0 versus 𝐻1 : 𝑝1 − 𝑝2 > 0
Or equivalently 𝐻𝑜 : 𝑝1 = 𝑝2 versus 𝐻1 : 𝑝1 > 𝑝2
Here, the researcher wants to know whether the first proportion exceeds the second proportion or not.
Case III: Left-Tailed Test

𝐻𝑜 : 𝑝1 − 𝑝2 = 0 versus 𝐻1 : 𝑝1 − 𝑝2 < 0
Or equivalently 𝐻𝑜 : 𝑝1 = 𝑝2 versus 𝐻1 : 𝑝1 < 𝑝2
Here, the researcher wants to know whether the first proportion is lower than the second proportion or not.

To use the z-test, make sure the following assumptions are satisfied:
1. The samples must be independent of each other. That is, there can be no relationship between the subjects
in each sample.
2. 𝑛1 𝑝1 ≥ 5 and 𝑛1 1 − 𝑝1 ≥ 5 .
3. 𝑛2 𝑝2 ≥ 5 and 𝑛2 1 − 𝑝2 ≥ 5.
The steps of the hypothesis testing about the difference in proportions using z-test:

𝑝 1 −𝑝 2 𝑋1 𝑋2 𝑋1 +𝑋2 𝑛 1 𝑝 1 +𝑛 2 𝑝 2
𝑧𝑜 = , where 𝑝1 = , 𝑝2 = , 𝑝= =
1 1 𝑛1 𝑛2 𝑛 1 +𝑛 2 𝑛 1 +𝑛 2
𝑝 1−𝑝 +
𝑛1 𝑛2
Note:
1. 𝑝1 − 𝑝2 is the best point estimator for 𝑝1 − 𝑝2 ,

𝑝 1 1−𝑝 1 𝑝 2 1−𝑝 2
2. The term 𝑛1
+ 𝑛2
is the standard error of 𝑝1 − 𝑝2 ,
1 1
3. Since 𝑝1 and 𝑝2 are unknown, the it estimated by 𝑝 1 − 𝑝 𝑛1
+𝑛
2
For right-tailed tests the rejection region is the region on the right of 𝑧𝛼
For left-tailed tests the rejection region is the region on the left of −𝑧𝛼 .
For two-tailed tests the rejection region is the region on the right of 𝑧𝛼/2 and the region on the left of
−𝑧𝛼/2 .
Step 5: Make the decision to reject or not reject the null hypothesis 𝐻𝑜 .
In all types of tests, the null hypothesis should be rejected when the value of the test statistic 𝑧𝑜 belongs to the
rejection region.
Step 6: Summarize the results.
Confidence intervals for the difference between two proportions
Confidence intervals for the difference between two proportions can also be found. Confidence intervals can be
found by using this formula:
Formula for the (1 − 𝛼)100% confidence interval for the difference between two population proportions is
𝑝1 1 − 𝑝1 𝑝2 1 − 𝑝2
𝑝1 − 𝑝2 ∓ 𝑧𝛼/2 +
𝑛1 𝑛2
Example 8:
Suppose a drug company develops a new drug, designed to prevent colds. The company states that the drug is
equally effective for men and women. To test this claim, they choose a random sample of 100 women and 200
men from the population. At the end of the study, 48 of the women did not catch a cold; and 102 of the men did
not catch a cold. Based on these results, can we reject the company's claim that the drug is equally effective for
men and women? Use 𝛼 = 0.05
Solution:

𝐻𝑜 : 𝑝𝑤 − 𝑝𝑚 = 0 versus 𝐻1 : 𝑝𝑤 − 𝑝𝑚 ≠ 0, This test is two-tailed test, here 𝐻𝑜 is the claim.
2. 𝛼 = 0.05 and the test is two-tailed, and 𝑧𝛼 /2 = 𝑧0.025 = 1.96
48 102 48+102 150

3. Here, 𝑝𝑤 = 100 = 0.48, 𝑝𝑚 = 200 = 0.51, 𝑝 = 100+200 = 300 = 0.5
4. 𝑛𝑤 𝑝𝑤 = 100 0.48 = 48 ≥ 5 and 𝑛𝑤 1 − 𝑝𝑤 = 100 1 − 0.48 = 52 ≥ 5
5. 𝑛𝑚 𝑝𝑚 = 200 0.51 = 102 ≥ 5 and 𝑛𝑚 1 − 𝑝𝑚 = 200 1 − 0.51 = 98 ≥ 5
𝑝 𝑤 −𝑝 𝑚 0.48−0.51
6. Compute the test statistic 𝑧𝑜 = 1 1
= 1 1
= −0.490
𝑝 1−𝑝 + 0.5(1−0.5) +
𝑛𝑤 𝑛𝑚 100 200

8. Do not reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑧𝑜 does not fall in the rejection region.
9. At 5% level of significance, we cannot reject the company's claim that the drug is equally effective for
men and women .
Example 9:
Find the 95% C.I for the difference of proportions for the data in Example 8.
Solution:
𝑝 𝑚 1−𝑝 𝑚 𝑝 𝑤 1−𝑝 𝑤
A (1 − 𝛼)% C.I for the difference is 𝑝𝑚 − 𝑝𝑤 ∓ 𝑧𝛼/2 +
𝑛𝑚 𝑛𝑤
0.51 1−0.51 0.48 1−0.48

0.51 − 0.48 ∓ 1.96 200
+ 100
= 0.03 ∓ 0.12 = −0.09 , 0.15
We can say 𝑝𝑚 − 𝑝𝑤 ∈ −0.19 , 0.15 with probability of 95%.
Also, we can say the two proportions are equal.

Example 10:
In a sample of 200 workers, 45% said that they missed work because of personal illness. Ten years ago in a
sample of 200 workers, 35% said that they missed work because of personal illness. At 𝛼 = 0.01, is there a
difference in the proportion?
Solution:

𝐻𝑜 : 𝑝1 − 𝑝2 = 0 versus 𝐻1 : 𝑝1 − 𝑝2 ≠ 0, This test is two-tailed test, here 𝐻1 is the claim.
2. 𝛼 = 0.01 and the test is two-tailed, and 𝑧𝛼 /2 = 𝑧0.005 = 2.575
𝑛 1 𝑝 1 +𝑛 2 𝑝2 200(0.45)+200(0.35) 160
3. Here, 𝑝1 = 0.45, 𝑝2 = 0.35, 𝑝 = 𝑛 1 +𝑛 2
= 200+200
= 400 = 0.4
4. 𝑛1 𝑝1 = 200 0.45 = 90 ≥ 5 and 𝑛1 1 − 𝑝1 = 200 1 − 0.45 = 110 ≥ 5
5. 𝑛2 𝑝2 = 200 0.35 = 70 ≥ 5 and 𝑛2 1 − 𝑝2 = 200 1 − 0.35 = 130 ≥ 5
𝑝 1 −𝑝 2 0.45−0.35
6. Compute the test statistic 𝑧𝑜 = 1 1
= 1 1
= 2.04
𝑝 1−𝑝 + 0.4(1−0.4) +
𝑛1 𝑛2 200 200
8. Do not reject the null hypothesis 𝐻𝑜 , since the test statistic 𝑧𝑜 does not fall in the rejection region.
9. At 1% level of significance, there is a sufficient evidence to reject the claim that there is a difference in
the proportion.
PROBLEMS
1. At age 9 the average weight (21.3 kg) and the average height (124.5 cm) for both boys and girls are exactly
the same. A random sample of 9-year-olds yielded these results. Estimate the mean difference in height
between boys and girls with 95% confidence interval. Does your interval support the claim?
Boys Girls
Sample size 60 50
Mean height, cm 123.5 126.2
2.7 ∓ 3.94

2. The average length of "short hospital stays" for men is longer than that for women, 5.2 days versus 4.5 days.
A random sample of recent hospital stays for both men and women revealed the following. At 𝛼 = 0.01, is
there sufficient evidence to conclude that the average hospital stay for men is longer than the average
hospital stay for women?
Men Women
Sample size 32 30
Sample mean, days 5.5 4.2
Population standard deviation 1.2 1.5
𝑧𝑜 = 3.75, 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻𝑜
3. According to the Nielsen Media Research, children (ages 2 – 11) spend an average of 21 hours 30 minutes
watching television per week while teens (ages 12 – 17) spend an average of 20 hours 40 minutes. Based on
the sample statistics obtained below, is there sufficient evidence to conclude a difference in average
television watching times between the two groups? Use 𝛼 = 0.01.
Children Teens
Sample size 15 15
Sample mean, hours 22.45 18.50
Sample variance 16.4 18.2
𝑡𝑜 = 2.60, 𝑑𝑜 𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻𝑜
4. Females and males alike from the general adult population volunteer. A random sample of 20 female
college students and 18 male college students indicated these results concerning the amount of time spent
in volunteer service per week. At the 0.10 level of significance, is there sufficient evidence to conclude that
the mean number of volunteer hours per week for male is less than the mean number of volunteer hours
per week for females?
Male Female
Sample size 18 20
Sample mean, hours 2.5 3.8
Sample variance 2.2 3.5
𝑡𝑜 = −2.38, 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻𝑜

5. Health Care Knowledge Systems reported that an insured woman spends on average 2.3 days in the
hospital for a routine childbirth, while an uninsured woman spends on average 1.9 days. Assume two
samples of 16 women each were used in both samples. The standard deviation of the first sample is 0.6 day,
and the standard deviation of the second sample is 0.3 day. At 𝛼 = 0.05, test the claim that the means are
equal. Find the 95% confidence interval for the difference of the means.
𝑡𝑜 = 2.39, 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻𝑜
6. In a sample of 200 men, 130 said they used seat belts. in a sample of 300 women, 63 said they used seat
belts. Test the claim that men more safety-conscious than women, at 𝛼 = 0.05. Find 90% confidence
interval for the difference in proportions.
𝑧𝑜 = 9.90, 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻𝑜 , 0.44 ∓ 0.07

BIOSTATISTICS
CHAPTER 11: Chi-Square Tests
Section 11 – 1: Chi-Square Test for Independence

The chi-square test for independence can be used to test the independence of two variables. For example,
suppose a new postoperative procedure is administered to a number of patients in a large hospital. The
researcher can ask the question, Do the doctors feel differently about this procedure from the nurses, or do
they feel basically the same way? Note that the question is not whether they prefer the procedure or not but
whether there is a difference of opinion between the two groups; doctors and nurses.
To answer this question, a researcher selects a sample of nurses and doctors and tabulates the data in table
form, as shown.
Prefer new Prefer old No

Group procedure procedure preference
Nurses 100 80 20
Doctors 30 60 10
As the survey indicates, 100 nurses prefer the new procedure, 80 prefer the old procedure, and 20 have no
preference; 30 doctors prefer the new procedure, 60 prefer the old procedure, and 10 have no preference. Since
the main question is whether there is a difference in opinion, the null hypothesis is stated as follows:
𝑯𝒐 : The opinion about the procedure is independent of (not related to) the profession.
The alternative hypothesis is stated as follows:
𝑯𝟏 : The opinion about the procedure is dependent on (related to) the profession.
If the null hypothesis is not rejected, the test means that both professions feel basically the same way about the
procedure and the differences are due to chance. If the null hypothesis is rejected, the test means that one
group feels differently about the procedure from the other. Remember that rejection does not mean that one
group favors the procedure and the other does not. Perhaps the two groups favor it or both dislike it, but in
different proportions.

To test the null hypothesis by using chi-square test for independence, you must follow the steps:
Step 1
Arrange the data obtained from the sample in a contingency table. The table is made up of 𝑅 rows and 𝐶
columns. Here 𝑅 = 2 and 𝐶 = 3.A contingency table is designed as an𝑅 × 𝐶 table, in this example the table is
2 × 3 contingency table. Each block in the table is called a cell and is designated by its row and column position.
For example, the cell of observed frequency of 80 is designated as 𝑂1,2 . The cells are shown below
Column 1 Column 2 Column 3

Row 1 𝑂11 𝑂12 𝑂13
Row 2 𝑂21 𝑂22 𝑂23
Step 2
Compute the expected frequencies, assuming that the null hypothesis is true. These frequencies are computed
by using the observed frequencies given in the table. The expected frequency for each cell (𝐸𝑖,𝑗 ) is computed by
using the formula
𝑇𝑅𝑖 × 𝑇𝐶𝑗
𝐸𝑖𝑗 =
𝑛
Where, 𝑇𝑅𝑖 = the sum of frequencies in the ith row, 𝑇𝐶𝑗 = the sum of frequencies in the jth column, and 𝑛 is the
sample size (the sum of all frequencies)
Prefer new Prefer old

Group procedure procedure No preference Total
Nurses 100 (86.67) 80 (93.33) 20 (20) 200
Doctors 30 (43.33) 60 (46.67) 10 (10) 100
Total 130 140 30 300
𝑇𝑅1 ×𝑇𝐶1 200×130 𝑇𝑅 1 ×𝑇𝐶2 200×140

For example, 𝐸11 = 𝑛
= 300
= 86.67, 𝐸12 = 𝑛
= 300
= 93.33
Note:
1. Sum of observed frequencies in the ith row = Sum of expected frequencies in the ith row = 𝑇𝑅𝑖
2. Sum of observed frequencies in the jth column = Sum of expected frequencies in the jth column = 𝑇𝐶𝑗
3. Sum of all observed frequencies = Sum of all expected frequencies = n
Step 3
Compute the test statistic, 𝜒𝑜2 , by using the following formula

2
𝑂−𝐸 𝑂2
𝜒𝑜2 = = −𝑛
𝐸 𝐸

Note: The value of the test statistic cannot be negative.
For this example,
1002 802 202 302 602 102

𝜒𝑜2 = + + + + + − 300 = 11.86
86.67 93.33 20 43.33 46.67 10
Step 4
Determine the level of significance 𝛼 and find the value 𝜒𝛼2 at degrees of freedom 𝐷. 𝑓 = 𝑅 − 1 𝐶 − 1
2
In this example, 𝐷. 𝑓 = 2 − 1 3 − 1 = 2. If 𝛼 = 0.05, the critical value will be 𝜒0.05,2 = 5.991
Step 5
The 5th step is to make the decision. The null hypothesis will be rejected if 𝜒𝑜2 > 𝜒𝛼2
In this example, 𝜒𝑜2 = 11.86 > 5.991 = 𝜒𝛼2 , the decision is to reject 𝐻𝑜
Step 6
Summarize the results. There is sufficient evidence to support the claim that the opinion is related (dependent
on) profession, that is, the doctors and nurses differ in their opinions about the procedure.
Assumptions for the chi-square test
1. The data are obtained from a random sample.

2. The expected frequency in each cell must be 5 or more 𝐸𝑖𝑗 ≥ 5 .

Example 1:
A researcher wishes to determine whether there is a relationship between the gender of an individual and the
amount of coffee consumed. A random sample of 68 people is selected, and the following data are obtained:
Coffee consumption
Gender Low Moderate High Total
Male 10 9 8 27
Female 13 16 12 41
Total 23 25 20 68
At 𝛼 = 0.10, can the researcher conclude the coffee consumption is related to gender?
Solution:
1. State the hypotheses.

𝐻𝑜 : The amount of coffee that a person consumes is not related to the individual’s gender.
𝐻1 : The amount of coffee that a person consumes is related to the individual’s gender. (claim)
2. Compute the expected frequencies

27×23
 𝐸1,1 = 68
= 9.13
27×25
 𝐸1,2 = 68
= 9.93
27×20
 𝐸1,3 = 68
= 7.94 = 27 − 9.13 + 9.93
41×23
 𝐸2,1 = 68
= 13.87 = 23 − 9.13
41×25
 𝐸2,1 = = 15.07 = 25 − 9.93
68
41×20
 𝐸2,3 = = 12.06 = 20 − 7.94
68
Coffee consumption
Gender Low Moderate High Total
Male 10 (9.13) 9 (9.93) 8 (7.94) 27
Female 13 (13.87) 16 (15.07) 12 (12.06) 41
Total 23 25 20 68
3. Compute the test value

102 92 82 132 162 122
𝜒𝑜2 = + + + + + − 68 = 0.283
9.13 9.93 7.94 13.87 15.07 12.06

4. Find the critical value at 𝛼 = 0.10 with 𝑑. 𝑓 = 2 − 1 3 − 1 = 2

2
𝜒0.10,2 = 4.605
5. Make the decision. Do not reject 𝐻𝑜 , since 𝜒𝑜2 = 0.283 < 4.606 = 𝜒𝛼2
6. Summarize the results. There is no sufficient evidence to support the claim that the amount of coffee a
person consumes is related to the individual’s gender.
7. Notice that all expected frequencies are greater than 5.
Example 2:
Use the data in the above Example to construct a 95% confidence interval for:
a) The proportion of the population who consume coffee with high level.
b) The proportion of males who consume coffee with low level.
c) The difference in the proportion of males and females who are consume coffee with high level.
Solution Part (a)

20
1. To find the confidence interval we must determine 𝑛, 𝑝, here, 𝑛 = 68, 𝑝 = 68 = 0.294
2. Compute 𝑛𝑝 and 𝑛 1 − 𝑝 , here 𝑛𝑝 = 20 ≥ 5 and 𝑛 1 − 𝑝 = 48 ≥ 5, so we can use 𝑧
𝛼
3. Here, 𝛼 = 0.05 → = 0.025 → 𝑧0.025 = 1.96
2
4. The 95% C.I is
0.294 1 − 0.294
0.294 ∓ 1.96 = 0.294 ∓ 0.108 = 0.186 , 0.402
68
Solution Part (b)

10
1. To find the confidence interval we must determine 𝑛, 𝑝, here, 𝑛 = 27, 𝑝 = = 0.37
27
2. Compute 𝑛𝑝 and 𝑛 1 − 𝑝 , here 𝑛𝑝 = 10 ≥ 5 and 𝑛 1 − 𝑝 = 17 ≥ 5, so we can use 𝑧
𝛼
3. Here, 𝛼 = 0.05 → 2 = 0.025 → 𝑧0.025 = 1.96
4. The 95% C.I is
0.37 1 − 0.37
0.37 ∓ 1.96 = 0.37 ∓ 0.18 = 0.19 , 0.55
27

Solution Part (c)
1. To find the confidence interval we must determine 𝑛1 , 𝑛2 , 𝑝1 , 𝑝2 ,

8 12
here, 𝑛1 = 27, 𝑛2 = 41, 𝑝1 = = 0.296, 𝑝2 = = 0.293
27 41
2. Compute 𝑛𝑝 and 𝑛 1 − 𝑝 for each group,
here 𝑛1 𝑝1 = 8 ≥ 5, 𝑛1 1 − 𝑝1 = 19 ≥ 5 and 𝑛2 𝑝2 = 12 ≥ 5, 𝑛2 1 − 𝑝2 = 29 ≥ 5, so we can use 𝑧
𝛼
3. Here, 𝛼 = 0.05 → = 0.025 → 𝑧0.025 = 1.96
2
4. The 95% C.I is
0.296 1 − 0.296 0.293 1 − 0.293

0.296 − 0.293 ∓ 1.96 +
27 41
= 0.003 ∓ 0.221 = −0.218 , 0.224

Which means the proportion of adults who consume coffee with high levels is the same for males and
females.
Section 11 – 2: Test for Goodness of Fit

The chi-square statistic can be used to see whether a frequency distribution fits a specific pattern. For example,
an emergency service may want to see whether it receives more calls at certain times of the day than at others,
so it can provide adequate staffing. A traffic engineer may wish to see whether accidents occur more often on
some days than on others, so that he can increase police patrols accordingly.
When you are testing to see whether a frequency distribution fits a specific pattern, you can use the chi-square
goodness-of-fit test.
For example, suppose as a market analyst you wish to see whether consumers have any preference among five
flavors of a new fruit soda. A sample of 100 people provided these data:
Mango Strawberry Orange Lime Grape

32 28 16 14 10

To answer this question, we will use the chi-square goodness-of-fit test as shown below:
Step 1:
State the hypotheses 𝐻𝑜 and 𝐻1 , and identify which one is the claim.
In the above example:
1
𝐻𝑜 : Consumers show no preference for flavors of the fruit soda. (𝑝𝑀 = 𝑝𝑆 = 𝑝𝑂 = 𝑝𝐿 = 𝑝𝐺 = )
5
𝐻1 : Consumers show a preference (not 𝐻𝑜 )
Step 2:
Compute the test value 𝜒𝑜2 by evaluating the expected frequencies, 𝐸𝑖 = 𝑛𝑝𝑖
1
In this example: 𝐸𝑀 = 𝐸𝑆 = 𝐸𝑂 = 𝐸𝐿 = 𝐸𝐺 = 100 = 20
5
Mango Strawberry Orange Lime Grape Total

Observed 32 28 16 14 10 100
Expected 20 20 20 20 20 100
2
𝑂−𝐸 𝑂2 322 282 162 142 102
𝜒𝑜2 = = −𝑛= + + + + − 100 = 18
𝐸 𝐸 20 20 20 20 20
Step 3:Choose the level of significance, 𝛼
Step 4: Find the rejection region (critical region), which is on the right of 𝜒𝛼2 with 𝐷. 𝑓 = 𝑘 − 1, where 𝑘 is the
number of categories.
2
In this example 𝐷. 𝑓 = 𝑘 − 1 = 5 − 1 = 4. At 𝛼 = 0.05, 𝜒.05,4 = 9.49
2
Step 5:If 𝜒𝑜2 > 𝜒𝛼2 , then reject 𝐻𝑜 . Here𝜒𝑜2 = 18 > 𝜒.05 = 9.49. Reject 𝐻𝑜 and accept 𝐻1
We can say at 𝛼 = .05 that consumers show a preference for flavors of the fruit soda.

Example 3:
For the data in the above example, test the claim that the consumers prefer the mango, strawberry, orange,
lime, and grape flavors with ratio 3:3:2:1:1, respectively. Use 𝛼 = 0.10

Observed 32 28 16 14 10 100
To answer this question, we will use the chi-square goodness-of-fit test as shown below:
Step 1:
State the hypotheses 𝐻𝑜 and 𝐻1 , and identify which one is the claim.
In the above example:
3 2 1
𝑝𝑀 = 𝑝𝑆 = 3+3+2+1+1 = 0.3, 𝑝𝑂 = 3+3+2+1+1 = 0.2, 𝑝𝐿 = 𝑝𝐺 = 3+3+2+1+1 = 0.1
𝐻𝑜 : 𝑝𝑀 = 0.3, 𝑝𝑆 = 0.3, 𝑝𝑂 = 0.2, 𝑝𝐿 = 0.1, 𝑝𝐺 = 0.1

𝐻1 : (not 𝐻𝑜 )
Step 2:
Compute the test value 𝜒𝑜2 by evaluating the expected frequencies, 𝐸𝑖 = 𝑛𝑝𝑖
In this example:
𝐸𝑀 = 𝑛𝑝𝑀 = 100 0.3 = 30
𝐸𝑆 = 𝑛𝑝𝑆 = 100 0.3 = 30,
𝐸𝑂 = 𝑛𝑝𝑂 = 100 0.2 = 20,
𝐸𝐿 = 𝑛𝑝𝐿 = 100 0.1 = 10,
𝐸𝐺 = 𝑛𝑝𝐺 = 100 0.1 = 10

Observed 32 28 16 14 10 100
Expected 30 30 20 10 10 100
2
𝑂−𝐸 𝑂2 322 282 162 142 102
𝜒𝑜2 = = −𝑛= + + + + − 100 = 2.667
𝐸 𝐸 30 30 20 10 10
Step 3:Choose the level of significance, 𝛼

Step 4: Find the rejection region (critical region), which is on the right of 𝜒𝛼2 with 𝐷. 𝑓 = 𝑘 − 1, where 𝑘 is the
number of categories.
2
In this example 𝐷. 𝑓 = 𝑘 − 1 = 5 − 1 = 4. At 𝛼 = 0.10, 𝜒.10,4 = 7.78
2
Step 5:If 𝜒𝑜2 > 𝜒𝛼2 , then reject 𝐻𝑜 . Here 𝜒𝑜2 = 2.667 < 𝜒.10 = 7.78. Accept 𝐻𝑜 and reject 𝐻1
We can say at 𝛼 = .10 that the consumers prefer the mango, strawberry, orange, lime, and grape flavors
with ratio 3:3:2:1:1.

Example 4:
In a certain population, the percents of people with each blood type are as follows: O, 6%; A, 40%; B, 42%; and
AB, 12%. At a recent blood drive at a large university, the donors were classified as shown below. At 𝛼 = .05, is
there sufficient evidence to conclude that the proportions differ from those stated above?
O A B AB Total
10 65 60 15 150
Solution:
Step 1:
𝐻𝑜 : 𝑝𝑂 = .065, 𝑝𝐴 = .40, 𝑝𝐵 = .42, 𝑝𝐴𝐵 = .12
𝐻1 : not𝐻𝑜 (claim)
Step 2:
𝑛 = 150
𝐸𝑂 = 𝑛𝑝𝑂 = 150 . 06 = 9,
𝐸𝐴 = 𝑛𝑝𝐴 = 150 . 4 = 60,
𝐸𝐵 = 𝑛𝑝𝐵 = 150 . 42 = 63,
and 𝐸𝐴𝐵 = 𝑛𝑝𝐴𝐵 = 150 . 12 = 18
O A B AB Total
Observed 10 65 60 15 150
Expected 9 60 63 18 150
102 652 602 152

𝜒𝑜2 = + + + − 150 = 1.171
9 60 63 18
Step 3:
2
𝛼 = 0.05and the degrees of freedom 𝐷. 𝑓 = 4 − 1 = 3 → 𝜒.05 = 7.81
Step 4:
2
Since, 𝜒𝑜2 = 1.171 is not greater than 𝜒.05 = 7.81 , accept the null hypothesis and reject the claim

PROBLEMS
1. Listed below is information regarding organ transplantation for three different years. Based on these data,
is there sufficient evidence at 𝛼 = 0.01 to conclude that a relationship exists between year and type of
transplant?
Year Heart Kidney Lung

2003 2056 870 1085
2004 2016 880 1173
2005 2127 903 1408
𝜒𝑜2 = 23.210
2. A study is being conducted to determine whether the age of the customer is related to type of movie he or
she rents. A sample of renters gives the data shown here. At 𝛼 = 0.10, is the type of movie selected related
to customer’s age?
Type of movie
Age Documentary Comedy Mystery
12 – 20 14 9 8
21 – 29 15 14 9
30 – 38 9 21 39
39 – 47 7 22 17
48 and over 6 38 12
𝜒𝑜2 = 46.696
3. To test the effectiveness of a new drug, a researcher gives one group of individuals the new drug and
another group a placebo. The results of the study are shown here. At 𝛼 = 0.05, can the researcher conclude
that the drug is effective?
Medication Effective Not effective

Drug 32 9
Placebo 12 18
𝜒𝑜2 = 10.637
4. To test whether a die is fair, a student rolled the die 300 times and the following data were obtained
1 2 3 4 5 6 Total
Observed frequency 45 52 60 47 48 48 300
At 𝛼 = 0.10, can the student conclude that the die is fair? 𝜒𝑜2 = 2.92

𝜒∝2
CRITICAL VALUES OF𝜒∝2
D.F 𝝌𝟐.𝟏𝟎𝟎 𝝌𝟐.𝟎𝟓𝟎 𝝌𝟐.𝟎𝟐𝟓 𝝌𝟐.𝟎𝟏𝟎 𝝌𝟐.𝟎𝟎𝟓 D.F

1 2.706 3.841 5.024 6.635 7.879 1
2 4.606 5.991 7.378 9.210 10.597 2
3 6.252 7.815 9.348 11.345 12.838 3
4 7.780 9.488 11.143 13.277 14.860 4
5 9.236 11.071 12.833 15.086 16.750 5
6 10.645 12.592 14.449 16.812 18.548 6

7 12.017 14.067 16.013 18.475 20.278 7
8 13.362 15.507 17.535 20.090 21.955 8
9 14.684 16.919 19.023 21.666 23.589 9
10 15.987 18.307 20.483 23.209 25.188 10
11 17.275 19.675 21.920 24.725 26.757 11

12 18.549 21.026 23.337 26.217 28.300 12
13 19.812 22.362 24.736 27.688 29.819 13
14 21.064 23.685 26.119 29.141 31.319 14
15 22.307 24.996 27.488 30.578 32.801 15
16 23.542 26.296 28.845 32.000 34.267 16

17 24.769 27.587 30.191 33.409 35.719 17
18 25.989 28.869 31.526 34.805 37.156 18
19 27.2046 30.144 32.852 36.191 38.582 19
20 28.412 31.410 34.170 37.566 39.997 20
21 29.615 32.671 35.479 38.932 41.401 21

22 30.813 33.924 36.781 40.289 42.796 22
23 32.007 35.173 38.076 41.638 44.181 23
24 33.196 36.415 39.364 42.980 45.559 24
25 34.382 37.653 40.647 44.314 46.928 25
26 35.563 38.885 41.923 45.642 48.290 26

27 36.741 40.113 43.194 46.963 49.645 27
28 37.916 41.337 44.461 48.278 50.993 28
29 39.088 42.557 45.722 49.588 52.336 29
30 40.256 43.773 46.979 50.892 53.672 30

Elementary Statistics: A Step by Step Approach,Bluman, 7th Edition 2022-2023
BIOSTATISTICS
CHAPTER 12: One-Way Analysis of Variance (ANOVA)

Introduction
One of techniques that can be used to compare two or more population means is analysis of variance (ANOVA)
technique. This technique is used to test claims involving two or more means. For example, suppose a
researcher wishes to see whether the means of the time it takes three groups of students to solve a computer
problem using Fortran, Basic, and Pascal are different. The researcher will use the ANOVA technique for this
test.
The analysis of variance that is used to compare two or more means is called a one-way analysis of variance
since it contains only one variable. In the previous example, the variable is the type of computer language used.
The analysis of variance can be extended to studies involving two variables, such as type of computer language
used and mathematical background of the students. These studies involve a two-way analysis of variance.
Case Study
A researcher wishes to try three different methods to lower the blood pressure of individuals diagnosed with
high blood pressure. The subjects were randomly assigned to three groups each of 5 subjects; the first group
takes medication, the second group exercises, and the third group follows a special diet. After four weeks, the
reduction in each person’s blood pressure is recorded and the following data obtained:
Medication Exercise Diet
10 6 5
12 8 9
9 3 12
15 0 8
13 2 4
The researcher wishes to know whether the three methods are equivalent or not, if not which is the best
method to lower the blood pressure.
To answer these questions, we will explain how to use the ANOVA technique in the following section.

One-Way Analysis of Variance (ANOVA)

Suppose that we want to compare 𝑘 population means, we can use the ANOVA technique only when the
following assumptions are satisfied:
Assumptions for ANOVA
1. The populations from which the samples were obtained must be normally distributed.
2. The samples must be independent from one another.
3. The variances of the populations must be equal. 𝜎12 = 𝜎22 = ⋯ = 𝜎𝑘2 = 𝜎 2
To test whether there are differences between the 𝑘 population means, follow the steps below:
Step 1:
State the hypotheses and identify the claim
𝐻𝑜 : 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑘 n
𝐻1 : At least one mean is different from the others (not 𝐻𝑜 ).
Step 2:
Compute the test value. Even though you are comparing means in the ANOVA, variances are used in the test
instead of means. Two different estimates of the population variance are made.
1. The first estimate is called the between-group variance, (denoted by 𝑆𝐵2 ) and it involves finding the
variance of the means.
2
2. The second estimate, the within group variance, (denoted by 𝑆𝑊 ) is made by computing the mean of
group variances and it is not affected by differences in the means.
2
Note: In this case, the variance 𝑆𝑊 is also called the pooled variance and denoted by 𝑆𝑝2
If there is no difference in the means, the between-group variance estimate will be approximately equal to the
within-group variance estimate, and then the null hypothesis will not be rejected. However, when the means
differ largely, the between-group variance will be much larger than the within-group variance, and then the
null hypothesis will be rejected.
Since variances are compared, this procedure is called analysis of variance (ANOVA).

**How to evaluate the between-group variance (𝑠𝐵2 ) and the within-group variance (𝑠𝑊
2
)
1. Evaluate the mean and the variance for each group, 𝑥1 , 𝑥2 , … , 𝑥𝑘 and 𝑠12 , 𝑠22 , … , 𝑠𝑘2
𝑛𝑖 𝑥𝑖 𝑛 1 𝑥 1 +𝑛 2 𝑥 2 +⋯+𝑛 𝑘 𝑥 𝑘
2. Evaluate the combined mean, 𝑥𝑐 = 𝑛
= 𝑛
, where 𝑛𝑖 is the number of observations in
the 𝑖 th group and 𝑛 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘

𝑥𝑖 𝑥 1 + 𝑥 2 +⋯+𝑥 𝑘
Note: If the groups are of equal sizes, 𝑛1 = 𝑛2 = ⋯ = 𝑛𝑘 , then 𝑥𝑐 = 𝑘
= 𝑘
𝑛 𝑖 𝑥 𝑖 −𝑥 𝑐 2 𝑛 𝑖 𝑥 𝑖2 −𝑛𝑥 𝑐2
3. Then the between-group variance, 𝑠𝐵2 = 𝑘−1
= 𝑘−1
, this formula finds the variance
among the means

2 𝑛 𝑖 −1 𝑠𝑖2 𝑛 1 −1 𝑠12 + 𝑛 2 −1 𝑠22 +⋯+ 𝑛 𝑘 −1 𝑠𝑘2
4. Evaluate the within-group variance 𝑠𝑊 = =
𝑛−𝑘 𝑛−𝑘
2 𝑠𝑖2 𝑠12 +𝑠22 +⋯+𝑠𝑘2

Note: If the groups are of equal sizes, 𝑛1 = 𝑛2 = ⋯ = 𝑛𝑘 , then 𝑠𝑊 = 𝑘
= 𝑘
2
Note: The within-group variance 𝑠𝑊 is considered as an estimator of 𝜎 2
𝑠2
5. Evaluate the test statistic 𝐹𝑜 = 𝑠 2𝐵
𝑊
Step 3:
Determine the level of significance 𝛼 and then find the critical value 𝐹𝛼 with 𝐷. 𝑓1 = 𝑘 − 1 and
𝐷. 𝑓2 = 𝑛 − 𝑘 . This critical value is obtained from the table of the 𝐹- distribution.
Step 4:
Make the decision. The decision is to reject the null hypothesis when the evaluated test statistic is
greater than the critical value 𝐹𝑜 > 𝐹𝛼
Step 5:
Summarize the results.

Example 1:
Given the data in the above study case, test the claim that there is no difference among the means at 𝛼 = 0.05
Solution:
1. State the hypotheses and identify the claim.

𝐻𝑜 : 𝜇1 = 𝜇2 = 𝜇3 (claim) and 𝐻1 : At least one mean is different from the others
2. Compute the test statistic, 𝐹𝑜
 Compute the means. 𝑥1 = 11.8 , 𝑥2 = 3.8 , 𝑎𝑛𝑑 𝑥3 = 7.6

 Compute the variances. 𝑠12 = 5.7 , 𝑠22 = 10.2 , 𝑎𝑛𝑑 𝑠32 = 10.3
5 11.8 + 5 3.8 + 5 (7.6) 116
 The combined mean is 𝑥𝑐 = 5+5+5
= 15
= 7.733
5 (11.8)2 + 5 (3.8)2 + 5 (7.6)2 −15(7.73)2 160.21
 The between-group variance is 𝑠𝐵2 = 3−1
= 2
= 80.105
2 5−1 (5.7)+ 5−1 (10.2)+ 5−1 (10.3) 104.8
 The within-group variance is 𝑠𝑊 = 15−3
= 12
= 8.733
80.105
 Evaluate the test statistic 𝐹𝑜 = 8.733
= 9.173
3. Find the critical value
 𝐷. 𝑓1 = 𝑘 − 1 = 3 − 1 = 2
 𝐷. 𝑓2 = 𝑛 − 𝑘 = 15 − 3 = 12
 At 𝛼 = 0.05, 𝐹𝛼 = 𝐹0.05 = 3.89
4. Since 𝐹𝑜 = 9.168 > 𝐹𝛼 = 3.89 , the decision is to reject 𝐻𝑜 and accept 𝐻1 (p-value=0.004)
5. There is sufficient evidence to reject the claim and conclude that at least one mean is different from the
others.

Another Notation
In statistical programs, such as MINITAB, SPSS, R…etc, the calculations above are summarized in table as
follows:
Sum of Mean
Source Squares D.F. Squares 𝑭𝒐
Between groups 𝑆𝑆𝐵 𝑘−1 𝑀𝑆𝐵 𝑀𝑆𝐵/ 𝑀𝑆𝐸
Within (Error) 𝑆𝑆𝑊 or 𝑆𝑆𝐸 𝑛−𝑘 𝑀𝑆𝑊 or 𝑀𝑆𝐸
Total 𝑆𝑆𝑇 𝑛−1
In the above table,
1. 𝑆𝑆𝐵 = sum of squares between groups = 𝑛𝑖 𝑥𝑖2 − 𝑛𝑥𝑐2 = the numerator of 𝑠𝐵2
2. 𝑆𝑆𝑊 or 𝑆𝑆𝐸 = sum of squares within groups = 𝑛𝑖 − 1 𝑠𝑖2 =the numerator of 𝑠𝑊
2
3. 𝑆𝑆𝑇 = 𝑥 2 − 𝑛𝑥𝑐2 = 𝑆𝑆𝐵 + 𝑆𝑆𝑊 and it is called the total sum of squares
4. 𝑘 = number of groups
5. 𝑛 = 𝑛𝑖 = number of all observations
6. Notice that 𝑛 − 1 = 𝑘 − 1 + 𝑛 − 𝑘
𝑆𝑆𝐵
7. 𝑀𝑆𝐵 = 𝑘−1 = 𝑠𝐵2
𝑆𝑆𝑊 2
8. 𝑀𝑆𝐸 = 𝑀𝑆𝑊 = = 𝑠𝑊 .
𝑛−𝑘
9. 𝐹𝑜 = 𝑀𝑆𝐵/ 𝑀𝑆𝐸
As an illustration, the ANOVA table for the previous example is constructed as follows:
Sum of Mean
Source Squares D.F. Squares 𝑭𝒐 𝑭𝟎.𝟎𝟓
Between 160.21 2 80.105 9.173 3.89
Within (Error) 104.8 12 8.733
Total 265.01 14
The above table is constructed based on the following terms:

160.21
𝑠𝐵2 = 2
= 80.105 → 𝑆𝑆𝐵 = 160.21 and 𝑀𝑆𝐵 = 𝑠𝐵2 = 80.105
2 104.8 2
𝑠𝑊 = 12
= 8.733 → 𝑆𝑆𝑊 = 104.8 and 𝑀𝑆𝑊 = 𝑠𝑊 = 8.733
𝑆𝑆𝑇 = 𝑆𝑆𝐵 + 𝑆𝑆𝑊 = 256.01, 𝑛 = 15, 𝑘 = 3

Example 2:
Complete the following ANOVA table. State the hypotheses, and make a decision.
Sum of Mean
Between ___a___ 3 __c__ __e__ __f__
Within (Error) 42.333 __b__ __d__
Total 92.950 19
Solution:
1. 𝑆𝑆𝑇 = 𝑆𝑆𝐵 + 𝑆𝑆𝐸 → 𝑎 = 𝑆𝑆𝐵 = 𝑆𝑆𝑇 – 𝑆𝑆𝐸 = 92.950 − 42.333 = 50.617
2. 𝑐 = 𝑀𝑆𝐵 = 𝑆𝑆𝐵 / 3 = 50.617/3 = 16.872
3. 3 + 𝑏 = 19 → 𝑏 = 19 – 3 = 16
4. 𝑑 = 𝑀𝑆𝐸 = 𝑆𝑆𝐸/16 = 42.333/16 = 2.646
5. 𝑒 = 𝑀𝑆𝐵/𝑀𝑆𝐸 = 𝐹𝑜 = 6.376
6. 𝑓 = 𝐹.05,3,16 = 3.24
7. 𝐻𝑜 : 𝜇1 = 𝜇2 = 𝜇3 = 𝜇4 versus 𝐻1 : At least one mean is different from the others
8. Reject 𝐻𝑜 since 𝐹𝑜 = 6.376 > 𝐹𝛼 = 3.24 (p-value=0.005)
Example 3:
a) Construct a 95% C.I for 𝜇3 .

b) Construct a 95% C.I for 𝜇1 − 𝜇2 .
Solution:
1. To construct a 95% C.I for 𝜇3 we need to evaluate:

2
1. 𝑛3 = 5, 𝑥3 = 7.6, 𝑠𝑊 = 𝑠𝑝2 = 𝑀𝑆𝐸 = 8.733 as an estimator of 𝜎 2 → 𝑠𝑤 = 8.733 = 2.955
𝛼
2. Here, 𝛼 = 0.05 → 2
= 0.025 , 𝐷. 𝑓 = 𝑛 − 𝑘 = 15 − 3 = 12 → 𝑡0.025 = 2.179
𝑠 2.955
3. A 95% C.I for 𝜇3 is 𝑥3 ∓ 𝑡𝛼/2 𝑛𝑊 = 7.6 ∓ 2.179 5 ≅ 7.6 ∓ 2.88 = 4.72 , 10.48
3
4. The mean reduction in blood pressure using diet is between 4.72 and 10.48 with probability of
95%.

2. To construct a 95% C.I for 𝜇1 − 𝜇2 we need:

1. 𝑛1 = 5, 𝑥1 = 11.8, 𝑛2 = 5, 𝑥2 = 3.8 , 𝑠𝑤2 = 8.733
𝛼
2. Here, 𝛼 = 0.05 → 2 = 0.025 , 𝐷. 𝑓 = 𝑛 − 𝑘 = 15 − 3 = 12 → 𝑡0.025 = 2.179
2
𝑠𝑊 𝑠2 8.733 8.733
3. A 95% C.I for 𝜇1 − 𝜇2 is 𝑥1 − 𝑥2 ∓ 𝑡𝛼/2 𝑛1
+ 𝑛𝑊 = 11.8 − 3.8 ∓ 2.179 5
+ 5
2
= 8.0 ∓ 4.07 = 3.93, 12.07

4. At 𝛼 = 0.05, we can conclude that the two means are different since the C.I does not contain 0. And
we can say the mean reduction using medication is higher than the mean reduction using exercises
with difference from 3.93 to 12.07 units.
Example 4:
The number of grams of fiber per serving for a random sample of three different kinds of food is listed. Is there
sufficient evidence at 𝛼 = 0.05 to conclude that there is a difference in mean fiber content among breakfast
cereals, fruits, and vegetables? Given the following data do a complete ANOVA.
Sample Sample Sample

Breakfast Data size mean variance
Cereals 3 4 6 4 10 5 9 8 5 𝑛1 = 9 𝑥1 = 6 𝑠12 = 6
Fruit 7.1 2 4.4 0.6 3.8 4.5 2.8 𝑛2 = 7 𝑥2 = 3.6 𝑠22 = 4.323
Vegetables 10 1.4 3.5 2.7 2.5 6.5 4 3 𝑛3 = 8 𝑥3 = 4.2 𝑠32 = 7.697
Solution:
1. State the hypotheses and identify the claim.

𝐻𝑜 : 𝜇1 = 𝜇2 = 𝜇3 and 𝐻1 : At least one mean is different from the others (claim)
2. Compute the test statistic, 𝐹𝑜

 Compute the means. 𝑥1 , 𝑥2 , and 𝑥3 , shown in the table above
 Compute the variances. 𝑠12 , 𝑠22 , and 𝑠32 , shown in the table above
9 6 + 7 3.6 + 8 (4.2) 112.8
 The combined mean is 𝑥𝑐 = = = 4.7
9+7+8 24
9 (6)2 + 7 (3.6)2 + 8 (4.2)2 −24(4.7)2 25.68
 The between-group variance is 𝑠𝐵2 = 3−1
= 2
= 12.84
2 9−1 (6)+ 7−1 (4.323)+ 8−1 (7.697) 127.817
 The within-group variance is 𝑠𝑊 = 24−3
= 21
= 6.087
12.84
 Evaluate the test statistic 𝐹𝑜 = = 2.109
6.087

3. Find the critical value

 𝐷. 𝑓1 = 𝑘 − 1 = 3 − 1 = 2
 𝐷. 𝑓2 = 𝑛 − 𝑘 = 24 − 3 = 21
 At 𝛼 = 0.05, 𝐹𝛼 = 𝐹0.05 = 3.47
4. Make the decision. Since 𝐹𝑜 = 2.109 < 𝐹𝛼 = 3.47, the decision is to not reject 𝐻𝑜 (p_value=0.146)
5. Summarize the results. There is no sufficient evidence to support the claim.
6. There is no need to compare means, since the null hypothesis is not rejected.
7. The ANOVA table is given below
Sum of Mean
Between 25.68 2 12.84 2.109 3.47
Within (Error) 127.817 21 6.087
Total 153.497 23

PROBLEMS
1. In an experiment to determine the effect of nutrition on the attention spans of elementary school students,
a group of 15 students were randomly assigned to each of 3 meal plans: no breakfast, light breakfast, and
full breakfast. Their attention spans (in minutes) were recorded during a morning reading period and are
shown in the following table. Does the type of breakfast affect the attention spans? Test using 𝛼 = 0.05
Breakfast Attention spans (min)

No 8 7 9 13 10
Light 14 16 12 17 11
Full 10 12 16 15 12
𝐹𝑜 = 4.93
2. A researcher wishes to see whether there is any difference in the weight gains of athletes following one of
three special diets. Athletes are randomly assigned to three groups and placed on the diet for 6 weeks. The
weight gains (in pounds) are shown below. At 𝛼 = 0.05, can the researcher conclude that there is a
difference in the diets?
Diet Weight gain ∑x ∑x2

A 3 6 7 4 20 110
B 9 12 11 14 8 6 60 642
C 8 3 2 5 18 102
𝐹𝑜 = 7.16
3. The amount of sodium (in mg) in one serving for a random sample three different kinds of foods is
measured and summarize in the following ANOVA table. Complete the table and test whether there is a
difference in mean sodium amounts among condiments, cereals, and desserts. Use 𝛼 = 0.05
Sum of Mean
Between 275.4 ____ ______ ______ ______
Within (Error) _______ 19 ______
Total 1366.3 21
𝐹𝑜 = 2.40
See and solve the following examples and exercises from the text book:
Section 12 – 1
Examples: 1, 2
Exercises: 8, 9, 12, 14

Percentage points of the F distribution (𝜶 = 𝟎. 𝟎𝟓)
Df1
Df2 1 2 3 4 5 6 7 8 9 10 Df2
1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 241.9 1
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 2
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 3
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 4
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 5
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 6
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 7
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 8
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 9
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 10
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 11
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 12
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 13
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 14
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 15
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 16
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 17
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 18
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 19
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 20
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 21
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 22
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 23
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 24
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 25
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 26
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 27
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 28
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 29
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 30
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 40
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 60
120 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96 1.91 120
∞ 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.63 1.83 ∞

Summary 2022 2023

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summary 2022 2023

Uploaded by

Copyright:

Available Formats

‫إلاحصاء الحيوي لطلبة الطب‬

Chapter 1: Definitions and Concepts

There are two branches of statistics:

A variable is a characteristic or attribute that can assume different values.

Variables and Type of Data

Variables can be classified as qualitative or quantitative.

Quantitative variables are variables that can be counted or measured.

Determine whether the following variables are qualitative or quantitative:

b. Cholesterol counts for individuals

An-Najah National University CH 1 – Page 2

Determine whether the following variables are discrete or continuous:

a. Cholesterol counts for individuals

classify each as nominal-level, ordinal-level, interval-level, or ratio-level measurement:

a. Horsepower of automobile engines

c. Scores on a statistical final exam

f. Temperatures at a seashore resort

h. Sizes of soft drinks sold by a fast-food restaurant in ml

An-Najah National University CH 1 – Page 3

To obtain samples statisticians use four basic methods of sampling:

State which sampling method was used.

An-Najah National University CH 1 – Page 4

CHAPTER 2: Frequency Distributions and Graphs

I. Organizing Categorical (Qualitative) Data.

a) Construct a frequency distribution.

Blood type Frequency Relative frequency Percent

 Sum of frequencies equals the number of observations 𝑛.

An-Najah National University CH 2 – Page 5

b) Construct a bar graph.

Bar Chart of Blood Type

c) Construct a pie graph.

Pie Chart of Blood Type

 The angle of section A is 𝜃𝐴 = 𝑟. 𝑓𝑟𝑒𝑞𝐴 × 360° = 0.20 × 360° = 72°

An-Najah National University CH 2 – Page 6

II. Organizing Quantitative Data.

The following data are ages of 25 randomly selected college students

a) Construct a frequency distribution.

Age Frequency Relative frequency Percent Cumulative frequency

For the age 20 we can say:

 9 of the students are 20 years old.

b) Construct a frequency histogram.

An-Najah National University CH 2 – Page 7

c) Construct a frequency polygon.

Polygon for Ages

III. Grouped Frequency Distributions.

Class Relative Class Marks Cumulative

An-Najah National University CH 2 – Page 8

a) Construct a frequency histogram.

Histogram of Number of Grams

b) Construct a frequency polygon.

Polygon for Number of Grams

An-Najah National University CH 2 – Page 9

Relative Cumulative Class Marks

a) What is the number of hospital in the sample?

b) Find the relative frequencies and midpoints.

c) Construct a frequency polygon.

80.1 86.0 91.9 97.8 103.7 109.6 115.5 121.4

An-Najah National University CH 2 – Page 10

1. Suppose that a set of data was grouped into 7 classes as follows:

a. What is the mark (midpoint) of the 7th class?

b. In what class the observation 2.57 will be placed?

c. Determine the number of data values that are 1.5 or less.

d. Determine the percentage of data values that are 2.6 or greater.

Determine the number of shipments weighing:

An-Najah National University CH 2 – Page 11

Underweight _ _