Professional Documents
Culture Documents
Descriptive Statistics - Notes2
Descriptive Statistics - Notes2
DESCRIPTIVE STATISTICS: Grouped data
1.1. Measures of central location
1. Arithmetic mean (AM)
The AM is the commonly used measure of central tendency. To calculate AM for ungrouped
data, we use
x1 x2 .... xn
x
n
n
x
i 1
i
=
n
Example:
For 10 years a company declared its percentage dividends as follows:
year 1 2 3 4 5 6 7 8 9 10
Dividend(xi) 5 6 14 20 30 10 15 20 20 30
Calculate the average dividend of the percentage declared By the company during the 10 years
Solution
Calculating the AM from frequency distribution
n
fx
i 1
i i
The AM a discrete frequency is calculated x = k
f
i 1
i
Number
Annual profit outlets
10 3
15 8
20 23
25 10
30 6
Solution
1
Calculating the AM from Grouped frequency distribution
n
fx
i 1
i i
The AM a discrete frequency is calculated x = k
f
i 1
i
Where the xi is the class mid‐point value of the ith class and
f i is the number of observations falling the ith class.
Example:
The following frequency distribution summarizes data on service times in minutes at the
checkout counter of a supermarket.
Time
interval Customers
1.99‐<2.50 3
2.50‐<3.00 8
3.00‐<3.50 23
3.50‐<4.00 10
4.00‐<4.5 6
Calculate the estimated average time a customer takes for a checkout at the counter in this
supermarket.
Solution
2. Median (Mdn/Md)
The median is defined as the middle value when the data set are arranged in ascending order. It divides
the data set into two equal parts.
2
Calculating the median for grouped data set:
Calculating the median for discrete frequency distribution
Steps:
1. Construct the less than cumulative frequency distribution
n
2. Calculate where n is the total cumulative frequency
2
n
3. Find the cumulative frequency equal or just greater than the value of calculated in step 2
2
4. The value at which the cumulative frequency is equal to that corresponding to cumulative
frequency calculated in step 3 is the median for the data set.
Example:
In a survey of 50 retail outlets, the following data were collected.
Number
Annual profit outlets
10 3
15 8
20 23
25 10
30 6
Calculate the median for the annual profit.
Calculating the median for grouped frequency distribution with equal intervals
Steps:
1. Construct the less than cumulative frequency distribution
n
2. Calculate where n is the total cumulative frequency
2
n
3. Find the cumulative frequency equal or just greater than the value of calculated in step 2
2
4. The median class at which the cumulative frequency is that corresponding to cumulative
frequency calculated in step 3.
3
5. Calculate the median using the following formula:
hn
M d IMd F
f 2
Where,
M d is the median of the data set
I M d is the lower class limit of the median class
h is the width of the median class
f is the frequency of the median class
n is the total cumulative frequency
F is the cumulative frequency of the class immediately before the median class.
Example:
The following frequency distribution summarizes data on service times in minutes at the checkout
counter of a supermarket.
Time
interval Customers
2.00‐<2.50 3
2.50‐<3.00 8
3.00‐<3.50 23
3.50‐<4.00 10
4.00‐<4.5 6
Calculate the median for the time it takes for a customer to be checked out at counter in this
supermarket.
4
Solution
Step 1
Time Customers
interval (fi) Fi
2.00‐<2.50 3 3
2.50‐<3.00 8 11
3.00‐<3.50 23 34
3.50‐<4.00 10 44
4.00‐<4.5 6 50
Step 2:
n 50
25
2 2
n
Step 3: the cumulative frequency equal to or just greater than is 34
2
Step 4:
The medial class is 3.00‐<3.50
hn
Step 5: The median is found by using the interpolation formula M d I M d F
f 2
I M d =3.00 is the lower class limit of the median class
h =0.5 is the width of the median class
f =23 is the frequency of the median class
n =50 is the total of frequencies
F =11 is the cumulative frequency of the class immediately before the median class.
0.5 50
Md 3 11 3.30
23 2
5
3. Mode M o
The mode of a data set is the value in the data set that occurs most with the greatest frequency.
It is a data point that occurs most frequently in the measurements that constitute a data set.
Calculating the mode from grouped data set:
Calculating the mode from a discrete frequency distribution:
The mode is the value that has the highest frequency.
Example:
Example:
In a survey of 50 retail outlets, the following data were collected.
Annual
profit(thousands Number
N$) outlets
10 3
15 8
20 23
25 10
30 6
Calculate the mode for the annual profit.
Solution
The highest frequency is 23. Therefore, 20 is the mode.
Calculating themode for grouped frequency distribution.
The mode is calculated using the following interpolation formula:
f1 f 0
M o lM o h
f1 f0 f1 f 2
f1 f 0
=l M o h
2 f1 f 0 f 2
6
Where
M o is the mode
lM o is lower limit of the modal class
h is the width of the modal class
f1 is the frequency of the modal class
f 0 is the frequency of the class immediately before the modal class.
f 2 is the frequency of the class immediately after the modal class
Definition:
A modal class is the class interval having the highest frequency.
Example
The following frequency distribution summarizes data on service times in minutes at the
checkout counter of a supermarket.
Time
interval Customers
2.00‐<2.50 3
2.50‐<3.00 8
3.00‐<3.50 23
3.50‐<4.00 10
4.00‐<4.5 6
Calculate the mode for the time it takes for a customer to be checked out at counter in this
Solution:
The modal class is 3.00‐<3.50 as it has the highest frequency 23.
lM o =3.0 is lower limit of the modal class
h =0.5 is the width of the modal class
7
f1 =23 is the frequency of the modal class
f 0 =8 is the frequency of the class immediately before the modal class.
f 2 =10 is the frequency of the class immediately after the modal class
Therefore,
23 8
Mo 3 0.5
23 8 231 10
=3.27
1.2. Measures of dispersion
1.3.1 Partition values: Quartiles and Percentiles
Partition values are values of a variable that divide a data set into a number of equal parts
e.g. Quartiles, Percentiles, deciles
1. Quartiles
Quartiles of a data set are values (partition values) that divide the data set into four equal parts
when data are arranged in ascending order.
There are three quartiles called lower quartile ( Q1 ), the middle quartile (second quartile Q2 ),
and upper quartile ( Q3 ).
Calculating quartiles from frequency distributions
To calculate the kth quartile from grouped frequency distributions, we use the following
procedure:
Step 1: Construct less than cumulative frequency distribution.
k
Step 2: Calculate nk n
4
For Q1 , the value of k=1
For Q2 , the value of k=2
For Q3 , the value of k=3
k
Step 3: Find the cumulative frequency equal to or just greater than the value of n calculated
4
in step 2.
8
Step4: The kth quartile class is the class at which the cumulative frequency corresponds to the
cumulative frequency in step 3.
Step 5: The kth quartile class is calculated using the following interpolation formula:
h k
Qk lk n F
fk 4
Where
Qk is the kth quartile for the data set;
lk is the lower class limit of the kth quartile class;
h is the width of the kth quartile class;
f k is the frequency of the kth quartile class;
F is the cumulative frequency of the class immediately before the the kth quartile class;
n is the total cumulative frequency
Example:
The human resource department of a company analyzed the level of absenteeism of 56
employees who reported ill over the past year.
Absenteeism level (days absent) Number of employees ( fi )
3‐<7 14
7‐<11 22
11‐<15 11
15‐<19 6
19‐<23 3
Determine the first quartile, the second quartile, and the third quartile.
2. Percentiles
The percentiles of a data set are values of a random variable dividing a data set into hundred
equal parts, with each containing 1% of values when the values are arranged in ascending order.
There ninety‐nine percentiles called first percentile, second percentile,…, and ninety‐ninth
percentile.
The fiftieth percentile is the median of the data set
The 25th percentile is the 1st quartile,
And 75 th percentile is 3rd quartile
9
Calculating percentiles from frequency distributions
To calculate the kth percentile from grouped frequency distributions, we use the following
procedure:
Step 1: Construct less than cumulative frequency distribution.
k
Step 2: Calculate nk n
100
For p1 , the value of k=1
For p2 , the value of k=2
For p3 , the value of k=3
.
.
.
For p99 , the value of k=99
k
Step 3: Find the cumulative frequency equal to or just greater than the value of n
100
calculated in step 2.
Step4: The kth percentile class is the class at which the cumulative frequency corresponds to the
cumulative frequency in step 3.
Step 5: The kth percentile is calculated using the following interpolation formula:
h k
pk lk n F
f k 100
pk is the kth quartile for the data set;
lk is the lower class limit of the kth percentile class;
h is the width of the kth percentile class;
f k is the frequency of the kth percentile class;
F is the cumulative frequency of the class immediately before the kth percentile class;
10
Example:
The human resource department of a company analyzed the level of absenteeism of 56
employees who reported ill over the past year.
Absenteeism level (days absent) Number of employees ( fi )
3‐<7 14
7‐<11 22
11‐<15 11
15‐<19 6
19‐<23 3
Determine the 65th percentile, the 70th percentile, and the 90th percentile
Two or more data sets may have the same mean and yet be very different in the way they spread
out. To describe this difference quantitatively, we use measures of dispersion. A measure of
dispersion indicates the amount of variation in a data set. Some of the commonly used measures of
spread are the range, Inter‐quartile range, semi‐quartile (Quartile deviation) variance, and standard
deviation, and coefficient of variation.
1.3.2 Range
The range is the difference between the highest and lowest values in a data set.
It measures the distance across the entire data set.
Example:
18 26 17 10 7 27 24 17 17 23 29 28
18 10 23 16 9 12 26 5 12 23 22 24
16 5
xmax 29
xmin 5
Range 29 5 24
1.3.3 Inter‐quartile range (IQ)
11
Definition
Quartiles of a data set are values (partition values) that divide the data set into four equal parts when
data are arranged in ascending order.
There are three quartiles called lower quartile, the middle quartile (second quartile), and upper quartile.
IQR Q3 Q1
1.3.4 Semi interquartile range or quartile deviation
Q3 Q1
SIQR(Q.D)
2
Example:
Let
Q1 14.5days
Q2 18.89days
Q3 23.93days
23.93 14.5
SIQR(Q.D)
2
=4.715 days
Interpretation: 50% of all observations are expected to lie within 4.715 days either side of the
median of 18.89 days. Or 25% of observations are considered to lie within 4.715 days below the
median and 25% of observations are expected to lie within 4.715 days above the median value.
Exercise
1. A company employs 12 persons in managerial positions. Their seniority (in years of service)
and sex are listed below:
Sex F M F M F M M F F F F M
Seniority (yrs) 8 15 6 2 9 21 9 3 4 7 2 10
Find the seniority mean, the seniority median and the seniority mode for the above data set.
2. The daily percentage change (to the nearest percentage ) of equity traded on the JSE was
monitored for 100 days by an investment analyst. These daily percentage changes were
summarized into the frequency distribution below.
Daily
percentage
change of Number
an of days
12
equity(%)
2 15
3 30
4 25
5 19
6 8
7 2
8 1
Find the mean daily percentage change, the median daily percentage change, and mode
daily percentage change.
3. Mary is employed as an “Affirmative Action Officer” by Ortex electronics. Mary reports
directly to the plant manager, and is responsible for monitoring and making
recommendations on Ortex hiring procedures, working conditions and compensation plans.
As part of her ongoing monitoring of compensation plans, Mary collected data on hourly
earnings on all non‐salaried employees at Ortex. T aid in interpreting the data, Mary
organized the data into the following frequency distribution:
Number of
Hourly
earnings(Rands) Women Men
4.70‐4.90 6 5
4.90‐5.10 31 16
5.10‐5.30 15 25
5.30‐5.50 29 30
5.50‐5.70 19 24
Calculate the mean, median and the mode of the hourly earnings for the men
13
4. The annual earnings of a company’s salesmen at its Johannesburg and Capr Town offices are
as follows:
Number of salesmen
Cape
Earnings(R1000s) Johannesburg Town
6‐<8 3 2
8‐<10 7 3
10‐<12 13 6
12‐<14 17 8
14‐<16 4 3
16‐<20 4 2
20‐<25 2 6
(a) Compare the salesmen’s earnings in Johannesburg and Cape Town offices by find the
means, medians and quartile deviations
(b) Find the standard deviation
Variance
The most useful and reliable measures of dispersion are those that:
Take every observation into account, and
Are based on average deviation from the central value.
Because the variance is such a measure that satisfies these properties, it has become the most
commonly used measure of dispersion. It is extensively used in statistical analysis.
The variance is calculated as the average of sum squared deviation.
For ungrouped data, the variance is calculated using the following formula:
n n
xi x x
2
i
2
nx 2
Sx2 i 1
i 1
n 1 n 1
14
Mathematical computational formulae for grouped data
n n
fi xi x fx
2 2
i i nx 2
2
Sx i 1
i 1
n 1 n 1
where
th
is fi is the frequency for i interval
xi is the mid poi nt for i interval: th
The variance is a measure of average of sum squared deviation about the arithmetic mean. It is
expressed in squared units. Consequently, its meaning in practical sense is obscure.
Because of this interpretation problem, a measure that uses original units is derived from the
variance: Standard deviation.
Standard deviation
Sx Sx2
The standard deviation describes how observations are spread about the mean.
1.3.5 Coefficient of variation
Sometimes, it is necessary to compare the samples of data from different random variables to
establish which sample data shows greater variability. A direct comparison of their respective
standard deviations would be misleading as the random variables may be measured in different
units. Thus, a meaningful comparison should be based on measure variability expressed in the
same units. This achieved by producing a measure of relative variability (i.e. relative to their
mean) expressed in percentage terms, called coefficient of variation.
Sx
CV 100% . This ratio describes how large the measure of dispersion is relative to the
x
mean of the observations.
A coefficient of variation close to zero indicates low variability and tight clustering of
observations about mean. A large coefficient of variation value indicates that observations are
more spread out about their mean.
15
Example
Turnover/month Employee age
Mean R54588 38.2 yrs
Standard deviation R8444 7.9 yrs
CV 15.47% 20.68%
The age characteristic shows greater variability than turnover/month.
1.4. Measures of skewness
The skewness is the measure that is used to describe the shape of the distribution.
Skewness refers to the degree of departure from symmetry.
Three common shapes are generally observed:
Symmetrical distribution: all three commonly used measures of central location
(i.e. mean, median and mode) will lie at the same location.
Skewed to right (positively skewed) distribution. It has a long tail to right. It has
few relative large values in data set at the right.
Skewed to the left (negatively skewed) distribution. It has a few relatively small
values at the left in the data set. It has a long tail to the left.
If the distribution is skewed, the median or mode is more representative than
the mean. The best measure of central location is the median as it is not pulled
by the extreme values
Two descriptive statistics can be used to measure the degree of skewness,
namely Pearson’s coefficient of skewness and Bowley’s coefficient of skewness.
1. Pearson’s coefficient of skewness
Pearson’s coefficient of skewness measures the degree of departure from symmetry
based on the difference between the mean and the median, or between mean and the
mode. This measure is valid only for quantitative random variables.
3 Mean Median
Sk p
st an dard deviation
Or
16
Sk p
Mean Mode
st an dard deviation
Interpretation:
If Sk p 0 , the distribution is symmetrical.
If Sk p 0 , then the distribution is skewed to right (positive skewness)
If Sk p 0 , then the distribution is skewed to the left (negative skewness).
2. Bowley’s coefficient of coefficient of skewness
Bowley’s coefficient of skewness is based on the quartile deviation and its measure
relative to median.
Q Q2 Q2 Q1
Skb 3
Q3 Q1
1 Skb 1
17