Professional Documents
Culture Documents
3 Data Description and Measures of Central Tenndency
3 Data Description and Measures of Central Tenndency
3 Data Description and Measures of Central Tenndency
Summary Statistics:
E. D. Chikaka
(Bsc Stats & Maths,:, :MSc Biostatistics
Outline of the lecture
• Data types
• Measures of measurements
• Graphical presentation of data
• Data description
– Measures of central tendency
– Measures of dispersion
– Measures of position?
Data presentation methods
1. Graphical methods
1. Pie chart
2. Bar Graphs
3. Histogram
4. Stem –and- leaf
5. Box Plots
2. Measures of central tendency
3. Measures of Variability
Guidelines in constructing a Pie Chart
Price
70
60 63
55
50
40
30
20
F re q u e n cy
20
10
0
Less than $10 $11-25 $26-50 More than $50
Frequency Distributions/Histograms and Polygons
• Pick a suitable starting point less than or equal to the minimum value.
You will be able to cover: "the class width times the number of classes"
values. You need to cover one more value than the range. Follow this
rule and you'll be okay: The starting point plus the number of classes
times the class width must be greater than the maximum value. Your
starting point is the lower limit of the first class. Continue to add the
class width to this lower limit to get the rest of the lower limits.
• To find the upper limit of the first class, subtract one from the lower
limit of the second class. Then continue to add the class width to this
upper limit to find the rest of the upper limits.
• Find the boundaries by subtracting 0.5 units from the lower limits and
adding 0.5 units from the upper limits. The boundaries are also half-
way between the upper limit of one class and the lower limit of the next
class. Depending on what you're trying to accomplish, it may not be
necessary to find the boundaries.
• Tally the data.
• Find the frequencies.
• Find the cumulative frequencies. Depending on what you're trying to
accomplish, it may not be necessary to find the cumulative frequencies.
• If necessary, find the relative frequencies and/or relative cumulative
frequencies.
Work to try at home……
Survival times are shown for patients with severe chronic left-ventricular heart
failure.
a) Construct a frequency distribution
b) construct a histogram
c) construct a stem and leaf plot.
d) which plot describes the data best? Why?
4 15 24 10
1 27 31 14
2 16 32 7
13 36 29 6
14 18 14 15
18 6 13 21
20 8 3 24
Table 1.
Distribution of mercury concentration in hair of 3000 high school students
• 0-0.49 95
• 0.5-0.99 91
• 1.0-1.49 47
• 1.5-1.99 30
• 2.0-2.49 16
• 2.5-2.99 8
• 3.0-3.49 9
• 3.5-3.99 4
Figure 2:
Histogram of mercury concentrations in hair of 300
students
100
90
80
70
60
50
40
30
20
10
0
0-0.49 0.5-0.99 1.0-1.49 1.5-1.99 2.0-2.49 2.5-2.99 3.0-3.49 3.5-3.99
micrograms Hg/g of hair
Figure 3:
Frequency polygon of mercury concentrations in hair of
300 students
100
90
80
70
60
50
40
30
20
10
0
0-0.49 0.5-0.99 1.0-1.49 1.5-1.99 2.0-2.49 2.5-2.99 3.0-3.49 3.5-3.99
micrograms Hg/g of hair
Tables
Valid Cumulative
Frequency Percent Percent Percent
Valid Less
2 1.0 1.4 1.4
than $10
$11-25 20 10.0 14.3 15.7
$26-50 63 31.5 45.0 60.7
More
55 27.5 39.3 100.0
than $50
Total 140 70.0 100.0
Missing System
60 30.0
Missing
Total 60 30.0
Total 200 100.0
Stem-and-leaf
• Is a clever, simple device to construct a
histogramlike distribution.
• It allows us to use the information
contained in a frequency distribution to:
– Show the range of scores
– Where the scores are concentrated
– The shape of the distribution
– Whether there are any vales not represented
– Whether there are any extreme values or outliers
Guidelines for constructing Stem-and-leaf
plots
• Split each score or value into two sets of digits.
The first or leading set is the stem and the second
or trailing is the leaf
• List all possible stem digits from lowest to highest
• For each score in the mass of data write down the
leaf numbers on the line labelled by the
appropriate stem number.
• If the display looks too cramped and narrow, we
can stretch the display by using two lines per stem
Example
a) Measures of location
b) Measures of dispersion/variability/spread
Population A
No. of
People
Population B
No. of
People
Population A Population B
Quick definitions
– Mode
• the most frequently occuring score
– Median
• the mid-point of a set of ordered scores
– Mean
• the result of dividing the arithmetic sum of
scores by the number of scores
Symbols and Formulae
Uppercase “Sigma” Lowercase “mu” Lowercase “sigma”
“Sum of” Population Mean Population Std Dev.
n
x
i 1
i x x1 x 2 x3 .... x n
x x x x .... x
2
1 2 3 n
2
x 2
x x x .... x
2
1
2
2
2
3
2
n
Finding the Mode
• Annual salary
– 4332384372
• units of $10k
• Annual salary
– 2, 2, 3, 3, 3, 3, 4, 4, 7, 8
• The mode is three 3
In this case, n=9 ( an odd number); therefore, the median is the (9+1)/2=5 th observation.
For this simple problem, you could compute the mean with pencil and paper by summing the
numbers in the salary column and dividing by “n” (10).
Method for Computing the Mean
No. of
People
Value of Factor K
No. of
People
Value of Factor J
The mean and the median
• Did you notice that the median was the same, 8
(the 5th value), for both data examples?
• Take a sample sample of 10 heights (70, 95, 100, 103, 105, 107, 110, 112,
115, 140cms)
Lowest (minimum) value = 70cm.
Highest (Maximum) value= 140cm
Range is therefore 140 – 70 = 70cm
Simple to understand but far from perfect - why ?
The range is derived from extreme values. It says nothing about the
values in between
Not stable (as sample size increases the range can change dramatically)
Can’t use statistics to look at it.
Figure 8. Two distributions with the same range
No. of
People
Same Range
Different mean and variability
• Percentiles: Those values in a series of observations, arranged
in ascending order of magnitude, which divide the distribution
into two equal parts (thus the median is the 50 th percentile).
The median is the middle value (if n is odd) or the average of the two middle
values (if n is even), it is a measure of the “center” of the data
• Interquartile Range
– the difference between the score representing the 75th percentile and the score
representing the 25th percentile
– Arrange: 24 , 25 , 29 , 29, 30 , 31
» Q1 = value of (n+1)/4=1.75
» Q1 = 24+0.75 = 24.75
» Q3 = value of (n+1)*3/4=5.2
» Q3 = 30+0.2 = 30.2
» Q3 – Q1 = 30.2 – 24.75
Exercise
– 0, 3, 0, 7, 2, 1, 0, 1, 5, 2, 4, 2, 8, 1, 3, 0, 1, 2, 1
So how do we get a single mathematical
measure or
summarise the variability of an observed set of
values?
Why divide by n - 1 ?
This is an adjustment for the fact that the mean is just an estimate of the true
population mean. It tends to make the variance bigger.
Measures of Data Variability
• Standard Deviation
– The standard deviation is the square root of the average
squared deviation from the mean
2
(x x)
SD
i
n 1
n x x
2 2
SD
i i
n( n 1 )
Calculating Standard Deviation
155 =0
Choosing the Measures of
Central Location and Dispersion
The Coefficient of Variation
ss
Sample CV 100
SampleCV 100%%
xx
• Solution:
Since the
Since the CV
CV isis larger
larger for
for the
the revenues,
revenues, there
there isis
more variability
more variability inin the
the recorded
recorded revenues
revenues than
than in
in
the number
the number of
of tickets
tickets issued.
issued.
Grouped data
Mean
x
xf
f
• Determine the mean, median for the data
presented below
Valid Cumulative
Frequency Percent Percent Percent
Valid Less
2 1.0 1.4 1.4
than $10
$11-25 20 10.0 14.3 15.7
$26-50 63 31.5 45.0 60.7
More
55 27.5 39.3 100.0
than $50
Total 140 70.0 100.0
Missing System
60 30.0
Missing
Total 60 30.0
Total 200 100.0
1.The median for grouped data calculation requires you to use the frequency distribution
output and the class intervals that are the question’s response categories.
2.The formula for the median is:
n
- cf p
Med = B l +( 2 )i
fm
1.The median for grouped data calculation requires you to use the frequency distribution
output and the class intervals that are the question’s response categories.
2.The formula for the median is:
n
- cf p
Med = Bl +( 2 )i
fm
where:
Bl = lower boundary of class containing median
n = sample size
cfp = cumulative frequency of classes preceding class containing the median
fm = number of observations in class containing the median
i = width of the interval containing the median
• Step 1: set up the frequency distribution table
• Step 2. Identify the median class i.e the class interval with 50% of the
values above it or below it.
• Step 3: use the formula to find the median
In our example,
The median class interval is the 26 -50 class interval.
Bl = 26
n = 140
cfp = 15.7
fm = 63
n
- cf p
Med = Bl +( 2 )i
= 26 + (140/2 -15.7)24/63
fm
= 46.69
Variance
1
f i x i
2
f x
i i 2
n 1 n