Professional Documents
Culture Documents
MT (P) 1252 Statistics I Module
MT (P) 1252 Statistics I Module
MT (P) 1252 Statistics I Module
1
4 DETAILED DESCRIPTION 1
1 Course Description
This course is divided into the following three major topics:
2 Course Objectives
The course aims at enabling students to:
3 Learning outcomes
By the end of the course learners should be able to;
4 Detailed description
4.1 Frequency distribution
Raw data, grouped frequency distributions, class intervals class boundaries, width of a class
interval, discrete and continuous variables, histograms, histograms for grouped distribution,
open-ended classes, discrete distributions, frequency polygons, frequency curves.
Types of frequency curve; cumulative frequency distributions, percentage Ogives.
Continuous distributions, the normal distribution and applications.
Normal approximation to Binomial distribution.
Page 2
6 DATA COLLECTION 3
we can infer or make decisions concerning the amount of complex sugars in the entire soft
drinks industry.
Definition 5.2. Sample is a portion of the population that is selected for for analysis i.e., to
analyse the quality of bread produced by the Kyambogo bakery in the month of December,
we may only select a sample of 100 loaves for analysis from the entire amount produced in
December.
Definition 5.4. Statistic is a summary measure that is computed to describe the area of
interest from a given ample of the population.
Notes:
1. The major aspect of inferential statistics is the process of using sample statistics to draw
conclusions about population parameters and hence the need for inferential statistics meth-
ods are derived from the needs for sampling since as the population becomes larger, it be-
comes costly and time consuming to obtain information from the entire population therefore
decisions concerning the population characteristics have to be based on the information ob-
tained from a sample of the population.
2. Probability theory provides the link by ascertaining the likelihood that the results from
the sample reflect the results from the entire population.
6 Data Collection
Data are the facts and figures that are collected, analyzed, and summarized for presentation
and interpretation. Data may be classified as either quantitative or qualitative.
We discus the types of data below
4. Secondary data: Data that is sourced by someone other than the user.
Page 3
6 DATA COLLECTION 4
5. Discrete data: These are the data that can take only specific value.
6. Continuous data: These are data that can take values from a given range.
Example 6.1. Age and annual income are quantitative variables; the corresponding data
values indicate how many years and how much money for each individual.
Example 6.2. Gender and marital status are qualitative variables. The labels male and
female provide the qualitative data for gender, and the labels single, married, divorced, and
widowed indicate marital status.
Sample survey methods are used to collect data from observational studies, and experimental
design methods are used to collect data from experimental studies.
The area of descriptive statistics is concerned primarily with methods of presenting and in-
terpreting data using graphs, tables, and numerical summaries.
Whenever statisticians use data from a sample—i.e., a subset of the population to make
statements about a population, they are performing statistical inference.
Estimation and hypothesis testing are procedures used to make statistical inferences. Fields
such as health care, biology, chemistry, physics, education, engineering, business, and eco-
nomics make extensive use of statistical inference.
There are four main reasons as to why we need to collect data:
(i) To provide the necessary input to a research study
(ii) To measure the performance in an ongoing service or production process.
(iii) To assist in formulating alternative causes of action in the decision making process.
(iv) To satisfy curiosity.
For statistical analysis to be useful in the decision making process, the input data must be
appropriate hence proper data collection is extremely important. If the data has flaws or is
biased, no statistical methods are likely to compensate for such deficiencies hence its important
to collect the right data and using the right method.
There are several methods which are used to obtain data
(1) By using published sources (Secondary data) of data e.g journals, magazines,
newspapers, bar codes etc..
(2) By designing an experiment: For such experiments, there must be strict controls over
the treatments given to the participants e.g., in an experiment to test the effectiveness of
a herbal drug, the researcher would determine which participants in the study area would
use the drug and those who would not use the drug (control).
(3) Conducting a survey: No control is normally exercised over the behaviour of the people
carrying out the survey. They are asked questions about what they are interested in their
beliefs, attitudes and other characteristics,and then the responses are edited, coded and
tabulated for analysis.
(4) Observatory study: Here the research focuses on the area of interest directly and in
most cases, in its natural setting. This provides research information which cannot be
presented by the more structured methods of data collection such as experiments and
surveys.
Page 4
7 OBTAINING DATA THROUGH SURVEY RESEARCH 5
Disadvantages
i. You may not select enough individuals with your characteristics of interest especially if
the characteristics is uncommon.
Systematic sampling:
Individuals are selected at regular intervals from the sampling frame. The intervals are chosen
to ensure an adequate sample size. If a sample size, n from a population of size x,then you
should select every x/nth individual for the sample e.g, if you wanted a sample size of 100
from a population of 1000, select every 1000/100th=10th member of the sampling frame.
Advantages
Disadvantages
Page 5
8 PRESENTATION OF DATA 6
Stratified sampling:
The population is first divided into subgroups (or strata) who all share a similar characteristic.
It is used when we might reasonably expect the measurement of interest to vary between the
different subgroups and want to ensure representation from all the subgroups. e.g, in a study
of stroke outcomes, we may stratify the population by sex,to ensure equal representation of
male and female.
The study sample is then obtained by taking equal sample sizes from each stratum.
Advantages
i. It improves the accuracy and representation of the results by reducing sampling bias.
Disadvantages
Clustered sampling:
In a clustered sample, subgroups of the population are used as a sampling unit rather than
individuals. The population is divided into subgroups known as clusters which are randomly
selected to be included in the study.
In single stage cluster sampling, all members of the cluster are then included in the study.
In a two-stage cluster sampling, a selection of individuals from each cluster is then randomly
selected for inclusion.
Advantages
i. It is more efficient than simple random sampling especially where a study takes place over
a wide geographical area.
Disadvantages
i. Increased risk of bias, if the chosen clusters are not representative of the population,
resulting in an increased sampling error.
8 Presentation of Data
The key objective of statistics is to collect and organize data. One of the basics of data orga-
nization comes from presentation of data in a recognizable form so that it can be interpreted
easily. You can organize data in the form of tables or you can present it pictorially.
Pictorial representation of data takes the form of bar charts, pie charts, histograms or fre-
quency polygons.
The benefit of this is that data in the visual form is easy to understand in one glance.
Data can basically be organized in two ways;
5 4 8 7 3 6 5 9 6 10 7 8 6 4 2
2 3 4 4 5 5 6 6 6 7 7 8 8 9 10
When data is arranged in an ordered array, evaluation of the major features is facilitated very
fast since it becomes easier to identify extremes, typical values and the concentration of the
value.
Range
number of desired classes
For convenience, we use a whole number close to this value.
can be properly tabled without overlapping. We can use groupings such as 10− < 20,
20− < 30, 30− < 40 e.t.c, for continuous data or 5 − 9, 10 − 14, 15 − 19, 20 − 24 for
discrete data.
The mid-value for each class which is half way between the boundaries of each class is
representative of the data within the class. e.g 10 − 20 i.e.
10 + 20
= 15.
2
NOTE. Before making a frequency distribution table for raw data,we use tallies to group the
data before obtaining the frequency.
Example 8.2. Present the data in Example 8.1 in a frequency distribution table
2 1
3 1
4 2
5 2
6 3
7 2
8 2
9 1
10 1
P
f = 15
Page 8
9 RELATIVE FREQUENCY DISTRIBUTIONS AND PERCENTAGE DISTRIBUTIONS9
15, 20, 7, 20, 35, 31, 43, 7, 28, 7, 49, 5, 28, 19, 20, 32, 7, 10, 43, 50, 45, 27, 21, 32, 43, 46, 37, 18, 12, 21
Draw a frequency distribution table showing relative frequency and percentage frequency.
SOLUTION.
Range = 50 − 5 = 45
45
Ten groups = = 4.5 ≈ 5
10
Page 9
11 HISTOGRAM 10
PROBLEM 9.1 (Group activity). In groups of five, discus the following questions
10 Frequency distributions
It is often useful to organize or arrange a body of data into a frequency distribution. This
breaks up the data into groups or classes and shows the number of observations in each class.
A relative frequency distribution is obtained by dividing the number of observations in each
class by the total number of observations in the data as a whole. The sum of the relative
frequencies equals 1.
Example 10.1. Forty students in a class sat for a mathematics quiz marked out of 50 and
attached below are the grades the students scored
17 25 36 12 28 17 16 37 13 39 13 35 26 37 29 28 22 34 27 39
10 14 15 15 24 36 17 44 18 42 14 16 17 48 13 46 17 29 10 35
Construct a frequency distribution table showing class intervals and class midpoints, frequency
and cumulative frequencies for each grades using a class interval of 5 starting with 10-14.
SOLUTION. Consider the frequency distribution table
where cf stands for cumulative frequency.
11 Histogram
A histogram is a bar graph of a frequency distribution, where classes are measured along the
horizontal axis and frequencies along the vertical axis.
Example 11.1. Forty students in a class sat for a mathematics quiz marked out of 50 and
attached below are the grades the students scored Present the data in form of a histogram.
SOLUTION. Consider the frequency distribution table below
Page 10
11 HISTOGRAM 11
17 25 36 12 28 17 16 37 13 39 13 35 26 37 29 28 22 34 27 39
10 14 15 15 24 36 17 44 18 42 14 16 17 48 13 46 17 29 10 35
PROBLEM 11.1 (Activity). Present the data given in the table below in a histogram:
Marks obtained 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89
Number of students 4 10 16 22 26 18 8 2
Page 11
12 FREQUENCY POLYGON 12
Figure 1: Histogram
12 Frequency polygon
A frequency polygon is a line graph of a frequency distribution resulting from joining the
frequency of each class plotted at the class midpoint.
A frequency polygon is a graphical form of representation of data. It is used to depict the
shape of the data and to depict trends. It is usually drawn with the help of a histogram but
can be drawn without it as well.
Frequency polygons give an idea about the shape of the data and the trends that a particular
data set follows.
(ii) Calculate the class mark (or midpoint) for each class interval. The formula for class
mark is:
(Upper limit + Lower limit)
Classmark =
2
(iii) Mark all the class marks on the horizontal axis. It is also known as the mid-value of
every class.
(iv) Corresponding to each class mark, plot the frequency as given to you. The height always
depicts the frequency. Make sure that the frequency is plotted against the class mark
and not the upper or lower limit of any class.
(v) Join all the plotted points using a line segment. The curve obtained will be kinked.
Page 12
12 FREQUENCY POLYGON 13
Example 12.1. One hundred college students sat for mid semester examination marked out
of 100% and table below shows the results they scored
Construct a frequency polygon using the data give in Table 1 above.
Page 13
12 FREQUENCY POLYGON 14
SOLUTION. We first need to calculate the class mark and class boundary from the test
scores given.
PROBLEM 12.1 (Activity). Make a frequency polygon and histogram using the given data:
Page 14
13 CUMULATIVE FREQUENCY DISTRIBUTION CURVE OR OGIVE 15
Example 13.1. One hundred college students sat for mid semester examination marked out
of 100% and table below shows the results they scored
Construct a cumulative frequency curve (Ogive) using the data give in Table 2 above.
SOLUTION. We first need to calculate the cumulate frequency from the frequency given.
Page 15
14 STATISTICAL AVERAGES 16
1. The marks obtained by 100 college students in an examination are given below
Exam marks 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49
Frequency (f) 2 5 6 8 10 25 20 18 4 2
2. Construct histogram, frequency polygon and frequency curve from the following data:
Marks obtained 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79
Number of students 10 16 20 20 22 15 8 5
14 Statistical averages
14.1 Introduction
The measures of central tendency enable us to make a statistical summary of the enormous
organized data. One such method of measure of central tendency in statistics is the arithmetic
mean. This condensation of a large amount of data into a single value is known as measures
of central tendency.
For example, in the early morning while reading a newspaper, have you observed the daily
temperature reports. Well, the temperature varies all day still how a single temperature can
indicate the condition for the entire day? Or when you get your scorecard in exams, instead of
analyzing your performance based on the percentage in all subjects, the performance is based
upon the aggregate percentage.
The significance of indicating a single value for a large amount of data in real life makes it
easy to study and analyze the collection of data and deduce important information out of it.
Let us discuss the arithmetic mean in Statistics and examples in detail.
Page 16
14 STATISTICAL AVERAGES 17
For calculating the mean when the frequency of the observations is given, such that x1 , x2 , x3 , . . . , xn
is the recorded observations, and f1 , f2 , f3 , . . . , fn is the respective frequencies of the observa-
tions then;
f 1 x1 + f 2 x2 + f 3 x3 + · · · + f n xn
x̄ =
f1 + f2 + f3 + · · · + fn
This can be expressed briefly as: Pn
f i xi
x̄ = Pi=1
n .
i=1 fi
The above method of calculating the arithmetic mean is used when the data is ungrouped in
nature.
Example 14.1. The avocados got from the food tree were measured and given the following
masses.
400g, 200g, 350g, 320g, 200g
Find the mean mass of avocados obtained collected on this day.
SOLUTION. P
xi
x̄ =
n
400 + 200 + 350 + 320 + 200
=
5
1470
=
5
= 294g.
Example 14.2. The following are masses of sweets in Jozes bag
Mass 10 15 18 21 25
Number 2 4 9 5 6
Mass Number fx
10 2 20
15 4 60
18 9 162
21 5 105
25 6 159
P P
f = 26 f x = 497
Page 17
14 STATISTICAL AVERAGES 18
P
fx
x̄ = P
n
497
=
26
= 19.115.
Properties:
(1) The sum of the deviations about the mean is zero (0)
n
X
(xi=1 − x̄) = 0.
i=1
E.g x̄ = 294
n
X
(xi=1 − x̄) = (400 − 294) + (200 − 294) + (350 − 294) + (320 − 294) + (200 − 294)
i=1
= 0.
(2) The sum of the squares of the deviations about the mean is always a minimum. i.e, as
compared to other values of central tendency
n
(xi=1 − x̄)2 = minimum value
X
i=1
n
(xi − x̄)2 = (400 − 294)2 + (200 − 294)2 + (350 − 294)2 + (320 − 294)2 + (200 − 294)2
X
i=1
= 11, 236 + 8, 836 + 3, 136 + 676 + 8, 836
= 32, 720.
Example 14.3. To obtain Grade A, Ben must achieve an average of at least 70 in five tests.
If his average mark for the four tests is 86, what is the lowest mark he can get in his fifth test
and still obtain Grade A?
14.3 Median
This is the middle value in an ordered sequence of data. If there are no ties, half of the
observations will be smaller than the median, and the other half will be bigger. Unlike the
mean,the median is not affected by extreme observations in the data sets. Hence where
extreme observations do exist or not, it is better to use the median instead of the mean.
To calculate the median in raw data, we first arrange the data in an ordered array so that
th
we obtain the value in the middle. The median is always in the n+1 2
position, where the
number of data is even, the median is the average of the two middle most values.
Example 14.4. Find the median given
(iii) 5, 8, 4, 4, 2, 9
SOLUTION. (i) Arranging the data in ascending order, we obtain 200, 200, 320, , 350, 400.
Since n = 5, we shall have
n + 1 th 5 + 1 th 6 th
= = = 3rd
2 2 2
Thus median = 320
(ii) Arranging the data in ascending order, we obtain 5, 10, 12, 16, 18, 21. Since its even, we
take the average of the two middle most
12 + 16 28
Median = = = 14
2 2
(iii) Arranging the data in ascending order, we obtain 2, 4, , 4, 5, 8, 9. Since its even, we take
the average of the two middle most
4+5 9
Median = = = 4.5
2 2
14.4 Mode
The mode is also another measure of central tendency and it is that value which appears
most frequently. The mode is also not affected by extreme values, however it is not used for
descriptive purposes because it is more valuable from sample to sample compared to other
measures of central tendency.
NOTE. Some data may have more than one mode.
Example 14.5.
16, 12, 18, 3, 12, 18, 9
Example 14.6. Construct an Ogive curve using the following set of data
Use the curve to estimate the following
(i) Median
Page 19
14 STATISTICAL AVERAGES 20
10-19 2 2
20-29 8 10
30-39 3 13
40-49 6 19
50-59 2 21
P
f = 21
10-19 2 2 9.5-19.5
20-29 8 10 19.5-29.5
30-39 3 13 29.5-39.5
40-49 6 19 39.5-49.5
50-59 2 21 49.5-59.5
P
f = 21
N 21
(i) Median = 2
th = 2
th = 10.5th position
Thus from Figure 4, Median = 31.5.
1
(ii) 1st quartile position = 4
× 21 = 5.25
Thus from Figure 4, 1st quartile = 21.5.
60
(iii) 60th percentile position = 100
× 21 = 12.6
Thus from Figure 4, 1st quartile = 41.5.
8
(iv) 8th decile position = 10
× 21 = 16.8
Thus from Figure 4, 1st quartile = 45.5.
14.5 Skewness
Skewness shows how the data is distributed i.e it can either be symmetric or not and if the
data is not symmetric, it is said to be skewed or asymmetric.
For data which is symmetric, the mean, mode , and median are equal, and therefore it is said
to be zero skewed or normally distributed.
If the mean is bigger than the median,the data is said to be positively skewed or right skewed.
If the median is more than the mean, the data is said to be negatively skewed or left skewed
Example 14.7. The following are marks obtained by 34 primary six students in a mid term
mathematics examination marked out of 40. Find
Page 20
14 STATISTICAL AVERAGES 21
Page 21
14 STATISTICAL AVERAGES 22
(i) Mode
(ii) Mean
(iii) Median
Page 22
14 STATISTICAL AVERAGES 23
5-9 2 2 7 14 4.5-9.5
10-14 14 16 12 168 9.5-14.5
15-19 8 24 17 136 14.5-19.5
20-24 3 27 22 66 19.5-24.5
25-29 5 32 27 135 24.5-29.5
30-34 2 34 32 64 29.5-34.5
P P
f = 34 f x = 583
(i) !
∆1
Mode = L + ×c
∆1 + ∆ 2
!
14 − 2
= 9.5 + ×5
(14 − 2) + (14 − 8)
12
= 9.5 + ×5
18
= 12.83.
(ii) P
fx
Mean x̄ = P
f
583
=
34
= 17.147.
(iii)
N
!
− fb
2
Median = L + ×c
f
!
17 − 16
= 14.5 + ×5
8
= 14.5 + 0.625
= 15.125.
(iv) Since the mean is bigger than the median, the data is said to be positively skewed or right
skewed.
(v)
Page 23
15 MEASURES OF CENTRAL TENDENCY 24
1. The times to the nearest minute, taken by a group of 120 students to write a particular
essay were recorded and are grouped in the table below Construct the cumulative frequency
table for this distribution and draw the cumulative frequency curve.
Use your curve to estimate
2. The table below shows the frequency distribution of the masses of 52 women students at a
college. Measurements have been recorded to the nearest kilogram.
(a) Construct the cumulative frequency table for this distribution and draw the cumulative
frequency curve.
(b) How many students weighed less than 57kg.
(c) How many students weighed more than 61kg.
(d) 20% were heavier than xkg.
Find the value of x.
(e) Estimate the median.
(f) Estimate the interquartile range.
Page 24
15 MEASURES OF CENTRAL TENDENCY 25
There are many different measures of central tendency. The three most widely used measures
of central tendency are the mean, median, and mode. These measures are defined for both
samples and populations.
25 46 34 45 37 36 40 30 29 37 44 56 50 47 23
40 30 27 38 47 58 22 29 56 40 46 38 19 49 50
SOLUTION. P
x
x̄ =
n
1168
= = 38.9.
30
15.1.2 Median
The median of a set of data is a value that divides the bottom 50% of the data from the top
50% of the data.
To find the median of a data set, first arrange the data in increasing order.
If the number of observations is odd, the median is the number in the middle of the ordered
list.
If the number of observations is even, the median is the mean of the two values closest to the
middle of the ordered list.
There is no widely used symbol used to represent the median.
Example 15.2. In a certain study center of Kyambogo University, students sat an online
mathematics quiz marked out off 100% and below are are test scores 30 students chosen
randomly scored
Find the median score
Page 25
15 MEASURES OF CENTRAL TENDENCY 26
25 46 34 45 37 36 40 30 29 37 44 56 50 47 23
40 30 27 38 47 58 22 29 56 40 46 38 19 49 50
19 22 23 25 27 29 29 30 30 34 36 37 37 38 38
40 40 40 44 45 46 46 47 47 49 50 50 56 56 58
SOLUTION. To find the median, first arrange the data in increasing order The two values
closest to the middle are 38 and 40. The median is the mean of these two values.
Thus,
38 + 40 78
Median = = = 38.
2 2
15.1.3 Mode
The mode is the value in a data set that occurs the most often. If no such value exists, we
say that the data set has no mode. If two such values exist, we say the data set is bimodal.
If three such values exist, we say the data set is trimodal. There is no symbol that is used to
represent the mode.
Example 15.3. In a certain study center of Kyambogo University, students sat an online
mathematics quiz marked out of 100% and below are are test scores 30 students chosen
randomly scored
25 46 34 45 37 36 40 30 29 37 44 56 50 47 23
40 30 27 38 47 58 22 29 56 40 46 38 19 49 50
SOLUTION. When the data are examined, it is seen that 40 occurs three times, and that no
other value occurs that often. The mode is equal to 40.
For a large data set, as the number of classes is increased (and the width of the classes is
decreased), the histogram becomes a smooth curve. Oftentimes, the smooth curve assumes a
shape like that shown in Figure 9.
In this case, the data set is said to have a bell-shaped distribution or a mound-shaped distri-
bution. For such a distribution, the mean, median, and mode are equal and they are located
at the center of the curve.
For a data set having a skewed to the right distribution, the mode is usually less than the
median which is usually less than the mean. For a data set having a skewed to the left
distribution, the mean is usually less than the median which is usually less than the mode.
Example 15.4. Find the mean, median, and mode for the following three data sets and
Page 26
15 MEASURES OF CENTRAL TENDENCY 27
SOLUTION. Table 3 gives the shape of the distribution, the mean, the median, and the mode
for the three data sets.
Page 27
15 MEASURES OF CENTRAL TENDENCY 28
2. The children in a class state how many children there are in their family.
The numbers they state are given below
1 2 1 3 2 1 2 4 2 2 1 3 1 2
2 2 1 1 7 3 1 2 1 2 2 1 2 3
(a) Find the mean, median and mode for this data
(b) Which is the most sensible average to use in this case?
Page 28
16 MEASURES OF DISPERSION 29
16 Measures of dispersion
In addition to measures of central tendency, it is desirable to have numerical values to describe
the spread or dispersion of a data set. Measures that describe the spread of a data set are
called measures of dispersion.
The range for a data set is equal to the maximum value in the data set minus the minimum
value in the data set. It is clear that the range is reflective of the spread in the data set since
the difference between the largest and the smallest value is directly related to the spread in
the data.
16.1.2 Variance
The variance and the standard deviation of a data set measures the spread of the data about
the mean of the data set.
The variance of a sample of size n is represented by s2 and is given by
(x − x̄)2
P
2
s =
n−1
and the variance of a population of size N is represented by σ 2 and is given by
(x − x̄)2
P
2
σ =
N
The square root of the variance is called the standard deviation and the standard deviation
is measured in the same units as the variable.
The sample standard deviation is √
s = s2
and the population standard deviation is
√
σ= σ2
The shortcut formulas for computing sample and population variances are
P
P 2 ( x)2
2 x − n
s =
n−1
and P
P 2 ( x)2
2 x − N
σ =
N −1
respectively.
Page 29
16 MEASURES OF DISPERSION 30
Example 16.1. The height of class 5 students of a certain primary school measured in
centimeters are 100, 102, 118, 124, 126. Find the standard deviation.
x x2
100 10000
102 10404
118 13924
124 15376
126 15876
P P 2
x = 570 x = 65580
v P
( x)2
uP
u x2 − n
s=
t
n−1
v
u 65580 − (570)2
u s s
5 65580 − 64980 600
= = =
t
5−1 4 4
√
= 150
= 12.25cm
(a) Find
(b) Which average would you use if you wanted to claim that the staff were
Page 30
16 MEASURES OF DISPERSION 31
2. A farmer buys 10 packets of seeds from two different companies. Each pack contains 20
seeds and he records the number of plants which grow from each pack.
Company A 20 5 20 20 20 6 20 20 20 8
Company B 17 18 15 16 18 18 17 15 17 18
(a) Find the mean, median, mode, variance and standard deviation for each company’s
seeds
(b) Which company does the mode suggest is best?
(c) Which company does the mean suggest is best?
(d) Find the range for each company’s seeds.
Page 31
17 MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA 32
17 Measures of central tendency and dispersion for grouped
data
Statistical data are often given in grouped form, i.e., in the form of a frequency distribution,
and the raw data corresponding to the grouped data are not available or may be difficult to
obtain.
The articles that appear in newspapers and professional journals do not give the raw data,
but give the results in grouped form.
For grouped data,the mid value for any group is representative of the properties within that
group.
17.1 Mean
The mean for grouped data is given by
P
fx
x̄ =
n
P
where x represents the class marks, f represents the class frequencies, and n = f.
Example 17.1. The frequency distributions of seed yield of 50 groundnut plants are given
below.
Table 5: Frequency distribution table showing seed yield for 50 groundnut plants
Page 32
17 MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA 33
Thus P
fx
Mean weight =
n
271
=
50
= 5.42g
Example 17.2. The following table gives the frequency distribution of marks scored by 80
students at a certain Primary Teachers College in a statistics test marked out of 30.
Table 7: Shows marks scored by 80 students in a statistics test marked out of 30.
Class f x fx
10-12 4 11 44
13-15 12 14 168
16-18 20 17 340
19-21 14 20 280
22-24 16 23 368
25-27 9 26 234
28-30 5 29 145
P P
f = 80 f x = 1579
Thus P
fx
Mean score = P
f
1579
=
80
= 19.7375
≈ 19.7.
Page 33
17 MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA 34
17.2 Median
The median for grouped data is found by locating the value that divides the data into two
equal parts. In finding the median for grouped data, it is assumed that the data in each class
are uniformly spread across the class.
N − fb
2
Median = L1 + × C.
f
Where L1 is the lower boundary of the modal class, N is the total frequency, fb is the cumula-
tive frequency before median class, f frequency of the median class, and C is the class width
or interval.
17.3 Mode
The modal class is defined to be the class with the maximum frequency. The mode for grouped
data is defined to be the class mark of the modal class.
∆1
Mode = L1 + × C.
∆1 + ∆ 2
Where L1 is the lower boundary of the modal class, C is the class interval/width, ∆1 is the
difference between the modal frequency and the frequency of the class before the modal class
and ∆2 is the difference between the modal frequency and the frequency of the class after the
modal class
f (x − x̄)2
P
2
s = .
n−1
Alternatively, we can write P 2
fx
f x2 −
P
s2 = n
n−1
and the standard deviation is given by
√
s= Variance
Page 34
17 MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA 35
Example 17.3. The following are marks scored by 24 students in a mathematics examination
20, 19, 23, 28, 42, 51, 44, 48, 76, 58, 64, 90, 87, 59, 36, 32, 45, 83, 15, 76, 66, 53, 57, 91
(i) Starting with 10 − 19 and using equal groups of interval, 10, make a frequency table
f (x − x̄)2 =
P P P
f = 24 f x = 1, 268
15669.498
(ii) P
fx
Mean x̄ = P
f
1, 268
=
24
= 52.833
Page 35
17 MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA 36
(iii) Modal class is 50 − 59, thus
∆1
Mode = L1 + ×C
∆1 + ∆2
1
= 49.5 + × 10.
1+3
= 52
(iv)
N − fb
2
Median = L1 + ×C
f
24
!
2
− 11
= 49.5 + × 10
5
= 51.5
(v)
f (x − x̄)2
P
Variance =
n−1
15, 669.498
=
24 − 1
15, 669.498
=
23
= 681.28252.
= 681.3.
(vi) √
Standard deviation = Variance
√
= 681.3
= 26.1.
Example 17.4. The frequency distributions of seed yield of 50 groundnut plants are given
below.
Table 9: Frequency distribution table showing seed yield for 50 groundnut plants
Find the
(i) variance
Page 36
17 MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA 37
SOLUTION. Consider frequency distribution table below
(i) Variance P
2 ( f x)2
fx − n
P
2
s =
n−1
2
1, 537 − (271)
50
=
50 − 1
1, 537 − 1, 468.82
=
49
68.18
=
49
= 1.391
f (x − x̄)2
P
2
δ =
n
f (x − x̄)2 nδ 2
P
2
s = =
n−1 (n − 1)
Page 37
17 MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA 38
(iv) Sample standard deviation is given by
sP v
f (x − x̄)2 u nδ 2
u
s= =t
n−1 (n − 1)
Example 17.5. The following are masses of sample food items from a store
Page 38
17 MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA 39
(i) Using a histogram to obtain the mode
(iii) P
fx
Mean x̄ = P
f
704.5
=
21
= 33.55
Page 39
17 MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA 40
2
−
P
f (x x̄)
Variance of sample s2 =
n−1
2, 980.9
=
21 − 1
= 149.045
√
Standard deviation of sample s = Variance
√
= 149.045
= 12.08.
(iv)
s
Coefficient of variance =
x̄
12.08
= × 100% = 36.00596
33.55
= 36%.
1. There are twenty pupils in class A and twenty pupils in class B. All the pupils in class A
were given an I.Q test. Their scores on the test are given below
100 104 106 107 109 110 113 114 116 117
118 119 119 121 124 125 127 127 130 134
(a) Construct a frequency distribution table for above data starting with class interval of
100-104.
(b) Calculate the mean score and the standard deviation for pupils in class A.
(c) Class B takes the same I.Q test. they obtain a mean of 110 and standard deviation of
21. Compare the data for class A and Class B.
5 4 8 7 3 6 5 9 6 10 7 8 6 4 2
5 6 6 7 7 4 7 8 3 7
After making any necessary calculations for the second set, compare the two sets of
scores. Your answer should be understandable to someone who does not study statistics.
Page 40
18 DISTRIBUTIONS 41
3. 40 boys sat a test which was marked out of 50. Their marks were
28 42 35 17 49 12 48 38 24 27 23 24 30 42 44 13 48 33 26 17
18 12 45 27 17 16 28 33 34 38 27 46 25 20 40 169 43 38 25 21
(a) Starting with class interval of 10-14, contract a frequency distribution table for above
data.
(b) Calculate
(i) the mean of the marks
(ii) the standard deviation of the marks
(c) 40 girls sat the same test their marks had a mean of 30 and a standard deviation of
6.5 compare the performances of the boys and girls.
18 Distributions
Probability distributions are divided into two categories:
(i) Discrete probability distribution. This include Binomial, Poisson and Hypergeometric
distributions.
(ii) Continuous probability distribution. This include normal, uniform and exponential dis-
tributions.
Sample point HH HT TH TT
X 2 1 1 0
Thus, for example, in the case of HH (i.e.,2 heads), X = 2 while for T H (i.e., 1 head), X = 1.
It follows that X is a random variable.
It should be noted that many other variables could be defined on this sample space,for example,
the square of the number of heads or the number of heads minus the number of tails.
Page 41
18 DISTRIBUTIONS 42
A random variable that takes on a finite or countably infinite number of values is called a
discrete random variable while one which takes on a noncountably infinite number of values
is called a continuous random variable.
• thickness of an item
• time required to complete a task
• temperature of a solution
• height, in inches
These can potentially take on any value, depending only on the ability to measure accurately.
In this unit, we shall study Binomial distributions and normal distributions.
Page 42
19 BINOMIAL DISTRIBUTION 43
19 Binomial Distribution
Binomial distribution deals with the possible numbers of successes when there are n trials,
each of which may be a success (with probability p) or a failure (with probability q); p and q
are fixed positive numbers and
p + q = 1.
This distribution is denoted by B(n, p).
For B(n, p), the probability of r successes in n trials is found by the same argument as before.
Each success has probability p and each failure has probability q, so the probability of r
successes and (n − r) failures in a particular order is pr q n−r . The positions in the sequence of
n trials which the successes occupy can be chosen in n Cr ways. Therefore
P (X = r) = n Cr pr q n−r for 0 6 r 6 n.
(i) They have a fixed number of trials, n i.e can toss a coin 5 times, can attempt 8 questions,
etc.
(ii) There are only two possible outcomes for each trial e.g a success and a failure, correct
and wrong, head and tail, etc. The letter p denotes the probability of a success on one
trial, and q denotes the probability of a failure on one trial. p + q = 1.
(iii) The probability of a success is the same for each trial e.g the probability of getting a
head is 0.5.
(iv) The trials are independent i.e, the outcome of each trial does not depend on the outcome
of any previous trial.
For n trials,
Example 19.1 (When to apply binomial distribution). Examples below show when to apply
a binomial distribution
• A firm bidding for a contract will either get the contract or not
• A marketing research firm receives survey responses of “yes I will buy” or “no I will not”
Example 19.2. Extensive research has shown that 1 person out of every 4 is allergic to a
particular grass seed. A group of 20 university students volunteer to try out a new treatment.
(a) What is the expectation of the number of allergic people in the group?
(c) How large a sample would be needed for the probability of it containing at least one
allergic person to be greater than 99.9%?
Example 19.3. A fair coin is tossed 5 times, find the probability of getting
(i) 2 heads
Page 44
19 BINOMIAL DISTRIBUTION 45
(i)
5 2
P (H = 2) = 0.5 0.53
2
5
=
0.03125
2
= 10 × 0.03125
= 0.3125
(ii)
P (x > 4) = P (x = 4) + P (x = 5)
5 4
5
= (0.5) (0.5)1 + 5
(0.5) (0.5)0
4 5
= 0.15625 + 0.03125
= 0.1875
(iii)
P (x 6 1) = P (x = 1) + P (x = 0)
5 1 4 5 0 5
=
(0.5) (0.5) + (0.5) (0.5)
1 0
= 0.15625 + 0.03125
= 0.1875
(iv)
P (x > 2) = P (x = 3) + P (x = 4) + P (x = 5)
5 3 5 5
= (0.5) (0.5)2 + (0.5)
4
(0.5)1 + 5
(0.5) (0.5)0
3 4 5
= 0.3125 + 0.15625 + 0.03125
= 0.5
Example 19.4. A basket has 8 good fish and 12 rotting fish. A fish is picked at random
and then put back before making the next pick. If the picking is done 10 times. Find the
probability that;
Page 45
19 BINOMIAL DISTRIBUTION 46
8
SOLUTION. P (good) = 20
= 0.4, n = 10, q = 1 − 0.4 = 0.6
(a) x = 3
10 3
P (x = 3) = (0.4) (0.6)7
3
= 0.2150
(b)
P (x > 7) = P (x = 8) + P (x = 9) + P (x = 10)
10 8
10 10
= (0.4) (0.6)2 + 9
(0.4) (0.6)1 + 10
(0.4) (0.6)0
8 9 10
= 0.0106 + 0.0016 + 0.0001
= 0.0123
(c)
E(x) = np
= 10 × 0.4
=4
(d)
V ar(x) = npq
= 10 × 0.4 × 0.6
= 2.4
√
(e) Standard deviation= 2.4 = 1.549
Example 19.5. It is estimated that 42% of women ages 45 to 54 are overweight. If 20 females
between 45 and 54 are randomly selected, what is the probability that one-half of them are
overweight?
SOLUTION. Let X represent the number of women in the 20 who are overweight.
Then, X has a binomial distribution with n = 20 and p = 0.42, q = 1 − 0.42 = 0.58.
The probability P (X = 10) is given as follows:
20 10 10
P (X = 10) =
(0.42) (0.58)
10
20!
= (0.42)10 (0.58)10
10!10!
= 0.1359.
Example 19.6. Seventy-five percent of employed women say their income is essential to
support their family. Let X be the number in a sample of 200 employed women who will say
their income is essential to support their family. What is the mean and standard deviation of
X?
Page 46
19 BINOMIAL DISTRIBUTION 47
1. Thirty percent of the trees in a national forest are infested with a parasite. Fifty trees are
randomly selected from this forest and X is defined to equal the number of trees in the 50
sampled that are infested with the parasite. The infestation is uniformly spread throughout
the forest. Identify the values for n, p, and q.
2. There is a fault in a machine making microchips,with the result that only 80% of those it
produces work. A random sample of eight microchips made by this machine is taken. What
is the probability that exactly six of them work?
Page 47
20 NORMAL DISTRIBUTION 48
20 Normal Distribution
20.1 Introduction
The normal distribution is the most commonly used of all probability distributions in statis-
tical analysis. Many distributions actually found in nature and industry are normal. Some
examples are the IQs (intelligence quotients), weights, and heights of a large number of people
and the variations in dimensions of a large number of parts produced by a machine.
The normal distribution often can be used to approximate other distributions, such as the
binomial and the Poisson distributions and provides the basis for statistical inferences.
In this section, you will learn how to
X ∼ N (µ, σ 2 ).
(ii) The mean, mode and median are equal for normal distribution
Page 48
20 NORMAL DISTRIBUTION 49
To find this probability, you need to find the area under the normal curve between a and b.
One way of finding areas is to integrate, but since the normal function is complicated and
very difficult to integrate, tables are used instead (see Appendix B).
Example 20.1. Find
SOLUTION. (i)
P (z > 2.15) = 0.5 − P (0 < z < 2.15)
= 0.5 − 0.4842
= 0.0158.
(ii)
P (z < 1.72) = 0.5 + P (0 < z < 1.72)
= 0.5 + 0.4573
= 0.9573.
Page 49
20 NORMAL DISTRIBUTION 50
(iii)
P (1.5 < z 6 2.62) = P (0 < z < 2.62) − P (0 < z < 1.5)
= 0.4956 − 0.4332
= 0.0624.
(iv)
P (z 6 −1.28) = 0.5 − P (0 < z < 1.28)
= 0.5 − 0.3997
= 0.1003..
(v)
P (z > −2.94) = 0.5 + 0.4984
= 0.9984
(vi)
P (−2.31 6 z 6 −1.28) = P (0 < z < −2.31) − P (0 < z < −1.28)
= 0.4896 − 0.3997
= 0.0899
Page 50
20 NORMAL DISTRIBUTION 51
(vii)
P (−1.52 6 z 6 1.84) = P (0 < z < 1.52) + P (0 < z < 1.84)
= 0.4357 + 0.4671
= 0.9028
Page 51
20 NORMAL DISTRIBUTION 52
The table only reads positive values of z and since it is symmetric, then the positive and
negative values will have the same probabilities
Example 20.3. Bread produced by Ntake Bakery is normally distributed with a mean mass
of 500g and a standard deviation of 15g. Find the probability that a Ntake Bakery bread
picked on at random has a mass
SOLUTION. We shall use table in the Appendix B to obtain the z values in each of the
above asked probabilities.
(i)
470 − 500
p(x < 470) = p z < = p(z < −2)
15
(ii)
510 − 500
p(x > 510) = p x > = p(z > 0.67)
15
(iii)
485 − 500 510 − 500
p(485 6 x 6 510) = p 6
15 15
= p(−1 6 z 6 0.67)
(iv)
480 − 500 490 − 500
p(480 6 x 6 490) = p 6
15 15
= p(−1.33 6 z 6 −0.67)
SOLUTION. Let X be the time measured in hours of burning time of a bulb. We are asked to
find P (115 < X < 133) given µ = 100h and σ = 10h and letting X1 = 115h and X2 = 133h.
We shall now standardize it and get
X1 − µ 115 − 100 15
z1 = = = = 1.25.
σ 12 12
X1 − µ 133 − 100 33
z2 = = = = 2.75.
σ 12 12
To obtain the required probability, we use the cumulative normal table to get the shaded area
between z1 = 1.25 and z2 = 2.75 as shown in figure 12 below
Looking up z1 = 1.25, in the Appendix B, we get 0.3944. This is the area from z = 0 to
z1 = 1.25.
Looking up z2 = 2.75, in the Appendix B, we get 0.4970. This is the area from z = 0 to
z2 = 2.75.
Subtracting 0.3944 from 0.4970, we get
or 10.26%, for the shaded area that gives P (115 < X < 135).
The probability that a bulb picked at random will have a lifetime between 115 and 133 burning
hours is 0.1026 or 10.26%.
Page 54
20 NORMAL DISTRIBUTION 55
1. The mean weight of a large group of people is 80kg and the standard deviation is 6kg. If
the weights are normally distributed, find the probability that a person picked at random
from the group will weigh
(a) between 71kg and 80kg (b) above 95kg (c) below 68kg
2. If 20% of the students entering college drop out before receiving their diplomas, find the
probability that out of 20 students picked at random from the very large number of students
entering college, less than 3 drop out.
3. If 90% of the bulbs produced in a plant are acceptable, what is the probability that out of 10
bulbs picked at random from the very large output of the plant, 8 are acceptable?
4. In a bid to fill various positions within Uganda Revenue Authority (URA) that were ad-
vertised in April 2021, an online assessment was conducted from 3rd to 5th January 2022.
The assessment was conducted through an independent service provider ”Test Gorilla”,
a Netherlands based company. A total of of 30, 471 applicants attempted the online assess-
ment while 11,946 did not attempt the assessment. The pass mark was 40%
Assuming that the results obtained by the applicants who sat for the online assessment are
normally distributed with a mean µ = 56% and standard deviation σ = 16. What is the
probability that an applicant picked at random scored
(i) between 46% and 80% (iii) between 20% and 40%
(ii) between 40% and 100% (iv) below 40%
5. Heights of college women have a distribution that can be approximated by a normal curve
with a mean of 65 inches and a standard deviation equal to 3 inches. About what proportion
of college women are between 65 and 67 inches tall?
Page 55
21 NORMAL APPROXIMATION 56
21 Normal approximation
The normal distribution can be used as an approximation to
(b) Poisson approximation: The normal distribution can also be used to approximate the
Poisson distribution for large values of the mean of the Poisson distribution.
In this section, we shall look at normal distribution can be used as an approximation to the
binomial distribution.
Recall that if X is the binomial random variable, then X ∼ B(n, p). The shape of the binomial
distribution needs to be similar to the shape of the normal distribution. To ensure this, the
quantities np and nq must both be greater than five (np > 5 and nq > 5 ); the approximation
is better if they are both greater than or equal to 10).
Then the binomial can be approximated by the normal distribution with mean µ = np and
√
standard deviation σ = npq.
Remember that q = 1 − p. In order to get the best approximation, add 0.5 to x or subtract
0.5 from x (use x + 0.5 or x − 0.5 ).
The number 0.5 is called the continuity correction factor.
Page 56
21 NORMAL APPROXIMATION 57
Example 21.1. Experience indicates that 30% of the people entering a store make a purchase.
Using
(a) the binomial distribution and
(b) the normal approximation to the binomial,
find the probability that out of 30 people entering the store, 10 or more will make a purchase.
SOLUTION. Let X be the number of people who enter the store to make purchases. We are
required to compute P (X > 10).
(a) Here n = 30, p = 0.3, q = 1 − 0.3 = 0.7. We are asked to compute P (X > 10). From the
binomial distribution table, we shall obtain from Appendix A that
P (X > 10) = P (10) + P (11) + · · · + P (30)
= 0.1416 + 0.1103 + 0.0749 + 0.0444 + 0.0231 + 0.0106 + 0.0042 + 0.0015
+ 0.0005 + 0.0001
= 0.4112.
Therefore
P (X > 9.5) = 0.5 − 0.0793
= 0.4207.
Page 57
21 NORMAL APPROXIMATION 58
1. Use the normal approximation to the binomial with n = 30 and p = 0.5 to find the proba-
bility P (X = 18).
2. Use the normal approximation to the binomial with n = 10 and p = 0.5 to find the proba-
bility P (X > 7).
4. Past experience indicates that 60% of the students entering college get their degrees. Using
find the probability that out of 30 students picked at random from the entering class, more
than 20 will receive their degrees.
5. Suppose X is a binomial random variable with n = 600 trials and probability of success
p = 0.35 . Find the probability using the normal approximation to the binomial distribution
with a continuity correction.
6. Assume that the probability of a college student having a car on campus is 0.30. A random
sample of 12 students is taken. What is the probability that at least 4 will have a car on
campus?
7. According to the Nation’s Report Card, also known as the National Assessment of Educa-
tional Progress (NAEP), only 25% of senior two students are proficient in mathematics.
Suppose 200 senior two students from Ugandan schools are selected at random. Answer
each problem using the normal approximation to the binomial distribution.
(i) Find the approximate probability that at least 55 students are proficient in mathemat-
ics.
(ii) Find the approximate probability that between 60 and 65 (inclusive) students are pro-
ficient in mathematics.
(iii) Suppose the NAEP test results for each student are used to find that 36 (of the 200)
students are proficient in mathematics. Is there any evidence to suggest that fewer
than 25% of senior two students are proficient in mathematics?
Justify your answer.
Page 58
21 NORMAL APPROXIMATION 59
8. (a) (i) Write down two conditions for X ∼ B(n, p) to be approximated by a normal
distribution Y ∼ N (µ, σ 2 ).
(ii) Write down the mean and variance of this normal approximation in terms of n
and p
(b) A factory manufactures 2000 DVDs every day. It is known that 3% of DVDs are faulty.
Using a normal approximation, estimate the probability that at least 40 faulty DVDs
are produced in one day.
9. In a certain College, 20% of students own a touch screen laptop. A random sample of
n students is chosen from the school. Using a normal approximation, the probability that
more than 55 of these n students own a touch screen laptop is 0.0401 correct to 3 significant
figures. Find the value of n.
Page 59
A BINOMIAL DISTRIBUTION TABLE 60
Appendices
A Binomial Distribution table
Example A.1.
P (X = 3, n = 5, p = 0.30) = 0.1323
Page 60
REFERENCES 61
References
[1] Freedom, D. A. (2005). Statistical models theory and practice.
[3] Salvatore, D. P. (2021). Schaums outline of theory and problems of statistics and econo-
metrics.
Page 61
REFERENCES 62
Page 62
REFERENCES 63
Page 63
REFERENCES 64
Page 64
REFERENCES 65
Page 65
REFERENCES 66
Page 66
REFERENCES 67
Page 67