3 Data Description and Measures of Central Tenndency

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 72

Data Presentation

Summary Statistics:

E. D. Chikaka
(Bsc Stats & Maths,:, :MSc Biostatistics
Outline of the lecture

• Data types
• Measures of measurements
• Graphical presentation of data
• Data description
– Measures of central tendency
– Measures of dispersion
– Measures of position?
Data presentation methods
1. Graphical methods
1. Pie chart
2. Bar Graphs
3. Histogram
4. Stem –and- leaf
5. Box Plots
2. Measures of central tendency
3. Measures of Variability
Guidelines in constructing a Pie Chart

• Choose a small number of categories for a


variable, preferably 5 or 6. Too many
categories make the pie chart difficult to
interpret
• Whenever possible, construct the pie chart
so that percentages are either in ascending
or descending order.
Guidelines in constructing a bar graph

• Label frequencies along the axis and categories of


the variables along the other axis
• Construct a rectangle at each of the variable with a
height equal to the frequency in the category
• Leave a space between each category to connote
distinct, separate categories and to clarify the
presentation
Bar chart

Price
70

60 63

55
50

40

30

20
F re q u e n cy

20

10

0
Less than $10 $11-25 $26-50 More than $50
Frequency Distributions/Histograms and Polygons

• One of the simplest examinations of the distribution


is simply to list, for each value of a variable, the
number of times that observation occurs in the study
population.
• This is the frequency distribution and can be
displayed as a table, as a histogram or as a
frequency polygon.

• A frequency distribution shows the values that a


variable can take, and the number of people or
observations with each value.
Guidelines for constructing class intervals

• There should be between 5 and 20 classes.


• The class width should be an odd number. This will
guarantee that the class midpoints are integers
instead of decimals.
• The classes must be mutually exclusive. This means
that no data value can fall into two different classes
• The classes must be all inclusive or exhaustive. This
means that all data values must be included.
• The classes must be continuous. There are no gaps in
a frequency distribution. Classes that have no values
in them must be included (unless it's the first or last
class which are dropped).
• The classes must be equal in width. The exception
here is the first or last class. It is possible to have an
"below ..." or "... and above" class. This is often used
with ages.
Guidelines for constructing class intervals

• Find the largest and smallest values


• Compute the Range = Maximum -
Minimum
• Select the number of classes desired.
This is usually between 5 and 20.
• Find the class width by dividing the
range by the number of classes and
rounding up. There is one thing to be
careful of here. You must round up, not
off.
Guidelines for constructing class intervals

• Pick a suitable starting point less than or equal to the minimum value.
You will be able to cover: "the class width times the number of classes"
values. You need to cover one more value than the range. Follow this
rule and you'll be okay: The starting point plus the number of classes
times the class width must be greater than the maximum value. Your
starting point is the lower limit of the first class. Continue to add the
class width to this lower limit to get the rest of the lower limits.
• To find the upper limit of the first class, subtract one from the lower
limit of the second class. Then continue to add the class width to this
upper limit to find the rest of the upper limits.
• Find the boundaries by subtracting 0.5 units from the lower limits and
adding 0.5 units from the upper limits. The boundaries are also half-
way between the upper limit of one class and the lower limit of the next
class. Depending on what you're trying to accomplish, it may not be
necessary to find the boundaries.
• Tally the data.
• Find the frequencies.
• Find the cumulative frequencies. Depending on what you're trying to
accomplish, it may not be necessary to find the cumulative frequencies.
• If necessary, find the relative frequencies and/or relative cumulative
frequencies.
Work to try at home……
Survival times are shown for patients with severe chronic left-ventricular heart
failure.
a) Construct a frequency distribution
b) construct a histogram
c) construct a stem and leaf plot.
d) which plot describes the data best? Why?

4 15 24 10
1 27 31 14
2 16 32 7
13 36 29 6
14 18 14 15
18 6 13 21
20 8 3 24
Table 1.
Distribution of mercury concentration in hair of 3000 high school students

Mercury concentration No. of Children


(micrograms/g)

• 0-0.49 95
• 0.5-0.99 91
• 1.0-1.49 47
• 1.5-1.99 30
• 2.0-2.49 16
• 2.5-2.99 8
• 3.0-3.49 9
• 3.5-3.99 4
Figure 2:
Histogram of mercury concentrations in hair of 300
students

100

90

80

70

60

50

40

30

20

10

0
0-0.49 0.5-0.99 1.0-1.49 1.5-1.99 2.0-2.49 2.5-2.99 3.0-3.49 3.5-3.99
micrograms Hg/g of hair
Figure 3:
Frequency polygon of mercury concentrations in hair of
300 students
100

90

80

70

60

50

40

30

20

10

0
0-0.49 0.5-0.99 1.0-1.49 1.5-1.99 2.0-2.49 2.5-2.99 3.0-3.49 3.5-3.99
micrograms Hg/g of hair
Tables

• Tables (e.g. frequency distributions) are a


convenient way to present specific information
about the distributions of values of a variable, and
graphs (e.g. histograms or frequency polygons) can
provide a general picture of the pattern of
observations. It is often useful to provide, in
addition, a numerical summary of the important
characteristics of the distribution of a variable.
PRICE

Valid Cumulative
Frequency Percent Percent Percent
Valid Less
2 1.0 1.4 1.4
than $10
$11-25 20 10.0 14.3 15.7
$26-50 63 31.5 45.0 60.7
More
55 27.5 39.3 100.0
than $50
Total 140 70.0 100.0
Missing System
60 30.0
Missing
Total 60 30.0
Total 200 100.0
Stem-and-leaf
• Is a clever, simple device to construct a
histogramlike distribution.
• It allows us to use the information
contained in a frequency distribution to:
– Show the range of scores
– Where the scores are concentrated
– The shape of the distribution
– Whether there are any vales not represented
– Whether there are any extreme values or outliers
Guidelines for constructing Stem-and-leaf
plots
• Split each score or value into two sets of digits.
The first or leading set is the stem and the second
or trailing is the leaf
• List all possible stem digits from lowest to highest
• For each score in the mass of data write down the
leaf numbers on the line labelled by the
appropriate stem number.
• If the display looks too cramped and narrow, we
can stretch the display by using two lines per stem
Example

• Construct a stem-and-leaf diagram for the following data.


2.9 3.0 4.4
0.8 2.7 1.6
3.5 3.6 1.2
1.9 3.8 2.2
2.6 3.9 1.5
2.8 4.4 0.9
2.5 4.1 2.3
4.5 3.5 2.5
0 89
1 269
2 23556789
3 055689
4 1445
Descriptive Statistics

This session will address the following topics:


• Calculation and interpretation of measures of central tendency
– Arithmetic mean
– Median
– Mode
– Geometric mean
• The appropriate application of measure of Central Tendency
• Calculation and interpretation of Measures of variability
– Range
– Inter-quartile range
– Standard deviation
– Standard error for the mean
• Application of appropriate measures of dispersion
• For continuous variables we have two major mathematical
descriptions at our disposal and we need both to completely describe
the shape of the distribution of observations

a) Measures of location
b) Measures of dispersion/variability/spread

• These summary statistics and in addition to providing a description of


data in mathematical terms they are also necessary for precise and
efficient comparisons of different sets of data.

• Consider figure numbers 5 and 6. There may be differences in the


location of the distributions and differences in the shape of the
distribution (I.e. their variability)
Figure 5. Distribution of the value of factor X in two
populations A and B

Population A

No. of
People

Population B

Different Variability Factor X


Same Location
Figure 6. Distribution of the value of factor Y in two
populations A and B

No. of
People

Population A Population B

Same Variability Factor Y


Different Locations
Measures of Central Tendency

• Three measures frequently used to provide a “Typical


Value” for a given continuous variable in a specific
population.
Measures of Central Tendency

Quick definitions
– Mode
• the most frequently occuring score
– Median
• the mid-point of a set of ordered scores
– Mean
• the result of dividing the arithmetic sum of
scores by the number of scores
Symbols and Formulae
  
Uppercase “Sigma” Lowercase “mu” Lowercase “sigma”
“Sum of” Population Mean Population Std Dev.
n

x
i 1
i   x  x1  x 2  x3  ....  x n

  x    x  x  x  ....  x 
2
1 2 3 n
2

x 2
  x  x  x  ....  x 
2
1
2
2
2
3
2
n
Finding the Mode

• Annual salary
– 4332384372
• units of $10k

• Incubation period for 6 Hepatitis affected


persons
– 29, 31, 24, 29, 30, 25
Calculating the Mode

To compute the mode:

• Arrange the data in sequence from low to


high
• Count the number of times each value appears
• The most frequently appearing value is the
mode
Finding the Mode

• Annual salary
– 2, 2, 3, 3, 3, 3, 4, 4, 7, 8
• The mode is three 3

• Incubation period for 6 Hepatitis affected


persons
– 24, 25, 29, 29, 30, 31
• Mode is 29
• The Mode of a distribution is the value that is observed
most frequently in a given data set (rarely used).

- There may be no mode - when ?

- There may be more than one mode - when ?

- Can be misinterpreted (is a distribution skewed or


bimodal ?).
- Not very amenable to statistical tests.
Median

• The Median describes literally the middle


of the data. It is defined as the value
above or below which half (50%) the
observations fall.
Finding the Median
Exercise
Person Sex Salary ($)
1 F 4000
2 F 3000
3 M 3000
4 F 2000
5 F 3000
6 M 8000
7 M 4000
8 M 3000
9 F 7000
10 F 2000

Which salary figure is the median?


Computing the Median

– The number of observations or scores is referred to as "n".


 Arrange the scores in order from smallest to largest (ascending
order)
 Count the number of scores (determine n)

 If n is an odd number, then


• median = the (n+1) / 2 th observation

For example, consider the observations


8 ,25 ,7 ,5 ,8 ,3 ,10 ,12 ,9
Arranged in order, the observations are
3 ,5 ,7 ,8 ,8 ,9 ,10 ,12 ,25

In this case, n=9 ( an odd number); therefore, the median is the (9+1)/2=5 th observation.

–if n is an even number, then


• median = the average of the n / 2 th and (n /2)+1 th observations
Computing the Median
(even number of observations)

– For another example, consider the observations


• 11 , 7 , 10 , 9 , 15 , 13 ,

•Arranged in order, the observations are


• 7 , 9 , 10 , 11 , 13 , 15

•In this case, n=6 ( an even number); therefore, the median


is the:
• the average of the observations (n/2) + (n/2+1)
• The average of the 3 and 4 observations
= (10+11)/2
= 10.5
Median

• The advantage of this measure is that it is


unaffected by extreme values !
• The disadvantage is that it is selected by
its rank and does not contain information
on the other values in the distribution.
• It is also less amenable than the mean to
statistical tests.
Mean (arithmetic average)

Most commonly used measure of location. It is


calculated by adding all the observed values and
dividing by the total sample size.
Each observation is noted as x

The total number of observations n


Summation Process by Sigma 
The mean itself is expressed as X
Computing the Mean
Exercise
Person Sex Salary ($)
1 F 4000
2 F 3000
3 M 3000
4 F 2000
5 F 3000
6 M 8000
7 M 4000
8 M 3000
9 F 7000
10 F 2000

For this simple problem, you could compute the mean with pencil and paper by summing the
numbers in the salary column and dividing by “n” (10).
Method for Computing the Mean

To compute the mean:


– Count the number of scores (determine “n”)
– Determine the sum of the scores by adding them
– Divide the sum by “n”

• For example, consider the observations


– 8 , 25 , 7 , 5 , 8 , 3 , 10 , 12 , 9

– In this case, n=9 and the sum=87; therefore, the mean


= 87 / 9
= 9.67

• For another example, consider the observations


– 8 , 45 , 7 , 5 , 8 , 3 , 10 , 12 , 9

– In this case, n=9 and the sum=107; therefore, the mean


= 107 / 9
= 11.89
• The mean has a lot of good theoretical properties and it is
used as the basis of many statistical tests . For a symetrical
distribution the mean is a good summary statistic. It is less
useful for an asymetric distribution

Q. What is its limitation as a summary statistic in asymetrical


distributions?
A. It can be distorted by outliers, therefore giving a poor
“typical” value.

Imagine weight in Kgs in a sample population of 5 people

50, 60, 50, 40, 120

The mean is calculated as 62 kilos. Is this value of 62 Kilos


“typical” for the observations ?
Figure 4: Symmetric and asymetric Distributions

No. of
People

Value of Factor K
No. of
People

Value of Factor J
The mean and the median
• Did you notice that the median was the same, 8
(the 5th value), for both data examples?

• On the other hand, the mean changed from 9.67 to


11.89 with the one extreme score changing from 25
to 45.

• Extreme scores in a set of data have a more


pronounced effect on the mean than on the median.
Choosing a Measure of Central Tendency

• Depends on the nature of the distribution

• For continuous variables in a unimodal and


symmetric distribution the mean, median and mode
are identical.

• With a shared distribution the median may be more


useful

• For statistical analyses the mean is the preferred


measure.
Measures of Spread, Dispersion, Variability

• In addition to a measure of central tendency, in describing


a distribution it is important to provide information
concerning the relative position of other data points in the
sample, (that is, a measure of spread or variability).
Range – is the simplest = Highest value minus
lowest value

• Take a sample sample of 10 heights (70, 95, 100, 103, 105, 107, 110, 112,
115, 140cms)
Lowest (minimum) value = 70cm.
Highest (Maximum) value= 140cm
Range is therefore 140 – 70 = 70cm
Simple to understand but far from perfect - why ?
 The range is derived from extreme values. It says nothing about the
values in between
 Not stable (as sample size increases the range can change dramatically)
 Can’t use statistics to look at it.
Figure 8. Two distributions with the same range

No. of
People

Same Range
Different mean and variability
• Percentiles: Those values in a series of observations, arranged
in ascending order of magnitude, which divide the distribution
into two equal parts (thus the median is the 50 th percentile).

• Quartiles: The values which divide a series of observations,


arranged in ascending order, into 4 equal parts. (Thus the 2 nd
Quartile is the Median).

• The Interquartile Range represents the central portion of the


distribution and is calculated as the difference between the
third quartile and the first quartile. This range includes about
one-half of the observations in the set, leaving one quarter of
the observations on each side.
Median and quartiles
Sort the data in increasing order

The median is the middle value (if n is odd) or the average of the two middle
values (if n is even), it is a measure of the “center” of the data

Quartiles: dividing the set of ordered values into


4 equal parts Q2 = second quartile = median

first 25% second 25% third 25% fourth 25%


Q1 Q2 Q3
IQR = Interquartile range = Q3  Q1
Measures of Data Variability

• Interquartile Range
– the difference between the score representing the 75th percentile and the score
representing the 25th percentile

– Arrange observation in ascending order


– Find the position for Q1 and Q3
– Identify values and The Inter-quartile range = Q3 - Q1
– Example 29 , 31 , 24 , 29 , 30 , 25

– Arrange: 24 , 25 , 29 , 29, 30 , 31

» Q1 = value of (n+1)/4=1.75
» Q1 = 24+0.75 = 24.75

» Q3 = value of (n+1)*3/4=5.2
» Q3 = 30+0.2 = 30.2

» Q3 – Q1 = 30.2 – 24.75
Exercise

• Determine the first and third quartiles


and interquartile range for the following
data

– 0, 3, 0, 7, 2, 1, 0, 1, 5, 2, 4, 2, 8, 1, 3, 0, 1, 2, 1
So how do we get a single mathematical
measure or
summarise the variability of an observed set of
values?

• The most frequent and most informative


measure is the VARIANCE and its
related functions

• The variance is computed in stages:


• 1. Calculate the mean as a measure of central location (MEAN)

• 2. Calculate the difference between each observation and the mean


(DEVIATION)
(x-x)
• 3. Next square the differences (SQUARED DEVIATION)
(x-x)2

• Q. What is the effect of this ?

 Negative and positive deviations will not cancel each other


out.
 Values further from the mean have a bigger impact.
• 4. Sum up these squared deviations (SUM OF THE SQUARED DEVIATIONS)
Σ (x -x)2

• 5. Divide this SUM OF THE SQUARED DEVIATIONS by the total number of


observations minus 1 (n-1) to give the VARIANCE
Σ (x - x)2
n-1

This is a measure of the variability of the data

Why divide by n - 1 ?

This is an adjustment for the fact that the mean is just an estimate of the true
population mean. It tends to make the variance bigger.
Measures of Data Variability
• Standard Deviation
– The standard deviation is the square root of the average
squared deviation from the mean

 2

(x  x)
SD 
i

n 1

n x    x 
2 2

SD 
i i

n( n  1 )
Calculating Standard Deviation

Score (x) Mean (x) Deviation Squared deviation


(x –x) (x – x )2
13
12
13
14
10
16
15
24
20
18
Σx = 155
Calculating Standard Deviation

Score (x) Mean Deviation Squared


(x) (x –x) deviation
(x – x )2
13 15.5 -2.5 6.25
(x  x)
 2
= 156.5 = 4.17
SD 
i
12 15.5 -3.5 12.25
13 15.5 -2.5 6.25
n 1 9
14 15.5 -1.5 2.25
10 15.5 -5.5 30.25
16 15.5 0.5 0.25
Lets use the computational
15 15.5 -0.5 0.25
24 15.5 8.5 72.25
formula…………….
20 15.5 4.5 20.25
18 15.5 2.5 6.25
Σx = Σ(x –x) Σ (x -x)2 = 156.5

155 =0
Choosing the Measures of
Central Location and Dispersion
The Coefficient of Variation

• The coefficient of variation (CV) allows us


to compare the variation of two (or more)
different variables.

• Explanation of the term – sample


coefficient of variation: the sample
coefficient of variation is defined as the
sample standard deviation divided by the
sample mean of the data set.
• Usually, the result is expressed as a
percentage.
The Coefficient of Variation

ss
Sample CV  100
SampleCV 100%%
xx

NOTE: The sample coefficient of variation


standardizes the variation by dividing it
by the sample mean.
The Coefficient of Variation

• The coefficient of variation has no units


since the standard deviation and the mean
have the same units, and thus cancel out
each other.
• Because of this property, we can use this
measure to compare the variations for
different variables with different units.
The Coefficient of Variation

• Example: The mean number of parking


tickets issued in a neighborhood over a four-
month period was 90, and the standard
deviation was 5. The average revenue
generated from the tickets was $5,400, and
the standard deviation was $775. Compare
the variations of the two variables.
• Solution is on the next slide.
The Coefficient of Variation

• Solution:

Since the
Since the CV
CV isis larger
larger for
for the
the revenues,
revenues, there
there isis
more variability
more variability inin the
the recorded
recorded revenues
revenues than
than in
in
the number
the number of
of tickets
tickets issued.
issued.
Grouped data
Mean

• Step 1: construct a frequency distribution


• Step 2: find x which is the mid point for each class
• Step 3: calculate x*f i.e x multiplied by the
frequency
• Step 4: find the sum of x*f
• Step 5 : divide sum of x*f by total frequency

x 
 xf
f
• Determine the mean, median for the data
presented below

Class interval frequency


0–2 1
3–5 3
6–8 5
9 – 11 4
12 – 14 2
The Mean, median and variance for grouped data
• As part of the soccer camp study, the investigators wanted to estimate
how much the respondents would be willing to pay for their child to
attend the camp. They felt it was best to measure with price ranges (an
ordinal measurement scale) rather than with specific prices (a ratio
measurement scale). The following question was asked as part of their
survey:
• Q) Assuming the camp would run for five days, two hours each day,
how much would you be willing to pay for your child to attend the
camp?
• 1. Less than $10
• 2. $11 - $25
• 3. $26 - $50
• 4. More than $50
• What is the mean price the respondents would be willing to pay?
• What is the median price the respondents would be willing to pay?
• Calculate the variance and hence the standard deviation?
PRICE

Valid Cumulative
Frequency Percent Percent Percent
Valid Less
2 1.0 1.4 1.4
than $10
$11-25 20 10.0 14.3 15.7
$26-50 63 31.5 45.0 60.7
More
55 27.5 39.3 100.0
than $50
Total 140 70.0 100.0
Missing System
60 30.0
Missing
Total 60 30.0
Total 200 100.0

1.The median for grouped data calculation requires you to use the frequency distribution
output and the class intervals that are the question’s response categories.
2.The formula for the median is:

n
- cf p
Med = B l +( 2 )i
fm
1.The median for grouped data calculation requires you to use the frequency distribution
output and the class intervals that are the question’s response categories.
2.The formula for the median is:

n
- cf p
Med = Bl +( 2 )i
fm

where:
Bl = lower boundary of class containing median
n = sample size
cfp = cumulative frequency of classes preceding class containing the median
fm = number of observations in class containing the median
i = width of the interval containing the median
• Step 1: set up the frequency distribution table
• Step 2. Identify the median class i.e the class interval with 50% of the
values above it or below it.
• Step 3: use the formula to find the median

In our example,
The median class interval is the 26 -50 class interval.
Bl = 26
n = 140
cfp = 15.7
fm = 63

n
- cf p
Med = Bl +( 2 )i
= 26 + (140/2 -15.7)24/63
fm
= 46.69

The median price is $46,68


Variance and standard deviation

Variance 
1 
 f i x i 
2
  f x
i i  2


n 1  n 

You might also like