Descriptive Statistics :

Measures of Central Tendency and Dispersion

and Graphical Presentation of Data.
Community Medicine Unit
International Medical School
Management and Science University
Measures of Central
 Measures of Central Tendency
 Mean.
 Median.
 Mode.

 Also
  called sample average or arithmetic mean.
 Sensitive to extreme values, where one data point
could make a great change in sample mean.
 Add up data, then divide by sample size (n).
 The sample size n is the number of observations.
 The formula is :
Characteristics of the mean

 Uniqueness, for a given set of data there is only one arithmetic

 simplicity. The mean is easily understood and easy to compute.
 Extreme values have an influence on the mean and in some
cases can so distort it that it becomes undesirable as a measure
of central tendency.
Example: What is the mean of SBP among the cases?
n= 5 Systolic blood pressures (mmHg)
X1= 120
X2= 80
X3= 90
X4= 110
X5= 95
 Median is the middle value or the 50th percentile of a set of ordered
 E.g SBP SBP measurment: 80, 90, 95, 110, 120
 when n is odd, then the middle value : [(n+1)/2]th.
 when n is even, median is the average of two middle most
observation: average of (n/2)th and [(n/2)+1]th.
 E.g. measurement : 80, 90, 95, 96,110, 120

 Median = (or almost equal) in normally distributed data to the mean.

 The sample median is not sensitive to extreme values.
 Very useful when summarizing a non-normal distribution set of data (skewed
Characteristics of the median

 Uniqueness.
 Simplicity: easy to calculate.
 It is not affected by extreme values like the mean.
 It is the observation(s) that occur
most frequently.
 Less useful in describing statistics.
 The observation that occurs most
 Can be used for continuous or
ordinal data, sometime used as
average for nominal data (modal
 It Can be only one mode
(Unimodal distribution) or two
(Bimodal distribution) or even
more, e.g:
Characteristics of mode

 If all values of a sample are different there is no mode.

 There may be more than one mode.
Measures of Dispersion
 It represent the difference between the maximum and
minimum value of the distribution.
 Tends to increase with sample size.
 Sensitive to very extreme values.
 It is very easy to calculate.
 Simplest and least useful measure of variability – only for
quick estimate of variability.
 It take into account only two values.
 R= XL- XS where R is the range, XL is the largest value and XS is
the smallest value.

 If measuring variance of population, denoted by 2

 If measuring variance of sample, denoted by s2 (“s-
 Measures average squared deviation of data points from
their mean.
 Highly affected by outliers. Best for symmetric data.

Variance (for a sample)

Variance = ∑ (Mean − x) 2
 Steps:
 Compute each deviation
 Square each deviation
 Sum all the squares
 Divide by the data size (sample size) minus one: n-1
Step 1 Step 3 Step 4
x (x  x) (x  x)2
Step 2 x
 x 25
 5
6 1 1 n 5
3 -2 4
8 3 9
5 0 0
Step 5 s2 
 ( x  x ) 2

 4.5
3 -2 4 n 1 4
25 0 18 s  s 2  4.5  2.12

NOTE: The sum of the deviation, ,

is always zero.

Standard deviation:

The smaller the standard deviation, the more consistent is the

data set.

The smaller the standard deviation, the less is the deviation of

the data from its center

For example if SD for age in years is = 15,

it means that it is 15 years on average, that the data values is away from its

 Measures the amount of spread or variability of
observations from their mean.
 The sample variance (s2) is the average of the square of
the deviations about the sample mean. (population
variance = σ2).
 Not used in descriptive statistics because difficulty in
interpreting a ‘square’ unit of data
 Formula:

 We used (n-1) instead of (n) to give us a degree of

Standard deviation

 Square root of variance.

 Most widely used and better measure of variability.
 The smaller the value, the closer to the mean.
 Like mean, std deviation is sensitive to extreme values.
 Therefore std deviation is best used to describe distributions that are
symmetrical with single peak.

 Calculate the sample variance and standard deviation

of the monthly income (in USD) of nine workers.
Inter quartile (IQR) or percentiles

 The most common is the inter-percentile measure.

 Range between the 1st quartile (25th percentile) and
the 3rd quartile (75th percentile).
 Range = q3 - q1
 Like median, IQR is not sensitive to very extreme
values (outliers).
 Usually described together with the median in badly
skewed distribution of observation.
This is equal to the

  Formula

IQR = Q1 – Q3
The Coefficient of Variation
 It
  is a measure of relative variation rather than an absolute
 It expresses the standard deviation as a percentage of the
 The formula is x 100

 It has no unit because the standard deviation and the mean of

the sample are measured in the same unit and the cancel each
Organizing & Displaying data
for Categorical Variables
Frequency Tables
 Tables – organized data into values and categories with titles
and caption.
 Title: variables?, when?, where?, sample size (n)?

 A frequency table may include:

 Categories - should be listed in some natural order
 Frequency
 Cumulative Frequency
 Relative Frequency
 Proportion/Percent

24 04/22/2020
Examples of Frequency Table_1
(SPSS output)
Gender distribution in a sample of 111 patients

Frequency Percent Valid Percent Percent
Valid male 40 36.0 36.0 36.0
female 71 64.0 64.0 100.0
Total 111 100.0 100.0


Frequency Percent Valid Percent Percent
Valid proximal 46 41.4 41.4 41.4
distal 62 55.9 55.9 97.3
both 3 2.7 2.7 100.0
Total 111 100.0 100.0
25 04/22/2020
Examples of Frequency Table_2
(SPSS output)
Continuous data (age) is
grouped and converted
into a ordinal data (age
age group

Valid Cumulativ
Frequency Percent Percent e Percent
Valid 20below 4 3.6 3.6 3.6
21 - 30 6 5.4 5.4 9.0
31 - 40 18 16.2 16.2 25.2
41 - 50 30 27.0 27.0 52.3
51 - 60 24 21.6 21.6 73.9
61 - 70 17 15.3 15.3 89.2
71above 12 10.8 10.8 100.0
Total 111
Bar graph or chart
 Graphical presentation of frequency distribution of
categorical data (nominal or ordinal). Height
Figure 1: Gender distribution among 111 renal stone patients represent
80 frequency or


Y axis:
Frequency or
relative freq
Bars of
Bars separated
50 by equal gaps


male female

SEX X axis: Categorical variables

Type of Bar Charts
Type of Bar Charts
Cluster/Compo Stacked/Compo
nent 140
East West 120
70 100
60 80 West
60 East
30 40
10 20
0 0
1st Qtr 2nd Qtr 3rd Qtr 1st Qtr 2nd Qtr 3rd Qtr

40% Kedah Darulaman
20% Kelantan
10% Darulnaim
1st Qtr 2nd Qtr 3rd Qtr 0 20 40 60 80 100

Composite with 28 Horizontal, if 04/22/2020

percent long category
Pie chart
 Graphical presentation of frequency distribution of
categorical data (usually nominal).
 Circle represent 3600, start at 12 o’clock.
Each piece of
Stone location among 111 cases in HKB, 2003 - 04
slice represent
each category


Size of slice
represent 41.4%

frequency or
percent distal


29 04/22/2020
Excellence graphs (Schmid, 1983)
 Accuracy
 data properly entered
 not misleading, distortion or susceptible to misinterpretation

 Clarity
 the ideas and concepts conveyed are clearly understood

 Simplicity
 Straight forward, avoid gridlines or odd lettering

 Appearance
 Should be appealing to viewer

 Well-designed structure
 Pattern highlighted, letterings are horizontal

30 04/22/2020
Organizing & Displaying
Data of Numerical Variables

31 04/22/2020
Graphs for quantitative data

 Graphs are the visual presentation of frequency distribution, and may

 differences in spread (variability)
 differences in shape of the distribution

 Types of useful graphs:

 Histogram
 Polygon
 Stem and leaf
 Line graphs
 Box plot

32 04/22/2020
Age Distribution among 111 cases

Normality curve Each bar represent the

line interval class


Bar height represent

frequency or percent

Std. Dev = 14.99

Mean = 51.0

0 N = 111.00
15.0 25.0 35.0 45.0 55.0 65.0 75.0
20.0 30.0 40.0 50.0 60.0 70.0 80.0

Interval class, no gap
33 in between 04/22/2020
Normal Distribution

• Also called Gaussian or bell-shaped

• Mean = Median = Mode
34 04/22/2020
Normal distribution curve
1. The mean, median and mode all
have the same value.
2. The curve is symmetric around
the mean, the skew is 0.
3. The kurtosis is 3.
4. The tails of the curve get closer
and closer to the x-axis as you
move away from the mean, but
they never quite reach it.

35 04/22/2020
Measures of skewness or symmetry

 We
  can use Pearson’s skewness coefficient.
 The formula:

 If it is equal to zero then it is normal distribution.

 If +ve then data skewed to the right (mean>median).
 If –ve then data skewed to the left (mean <median).
 But usually the result will be between -1 and 1.
 Result above 0.2 or below -0.2 indicate rather severe skewness.

36 04/22/2020
Rule for the Normal Distribution

 68% of the observations fall within 1 SD of the mean

 95% of the observations fall within 2 SD of the mean
 99.7% of the observations fall within 3 SD of the mean

Distribution of Blood
Pressure in Men;
Mean (SD) = 125 (14)
mm Hg
37 04/22/2020
Histogram and Distribution

38 04/22/2020
• A frequency polygon is a graph that displays the data using
lines to connect points plotted for the frequencies.
• The frequencies represent the heights of the vertical bars in
the histogram. (superimposed).
The line
the mid
points at
the top of
The polygon
is tied down
at both 39 04/22/2020
‘Stem and leaf’ plot
 Another tool for visually displaying continuous data
 Very similar to a histogram
 Allows for easier identification of individual values in the


40 04/22/2020
Box plot
 A graphical display that use descriptive statistics based on
 Also called ‘5 number summary plot’
 Provide information about central tendency and the
variability of the middle 50% of the distribution.
 The ‘box’ represent the IQR: 25th to 75th percentile.
 Outlier observation is 1.5 times the IQR away from the edges of the
box. (> 3.0 times is extreme outliers).
 Smallest and largest values that make up the lines are the nearest
values outside the outliers.

 Box plot easily comparing continuous data in multiple

groups – can be plotted side by side.
41 04/22/2020
Boxplot: Age distribution between gender in
renal stone cases, HKB, 2003 -2004

Outlie 100 35

80 value
which is
The 75th not outlier.
percent The 50th
ile The percent

The 25th 40
box ile
percent (media
ile n)
The 103
34 which is
s 0 not outlier.
male female

42 sex 04/22/2020
Sources of the outliers

 Error in recording the data.

 A failure of data collection. E.g. not following sample
 An actual extreme value from an unusual subject.

Thank You
Thank You

44 04/22/2020

