02 - Descriptive Statistics

Descriptive Statistics :
Measures of Central Tendency and Dispersion

and Graphical Presentation of Data.
Community Medicine Unit
International Medical School
Management and Science University
Measures of Central
Tendency
Contents
 Measures of Central Tendency
 Mean.
 Median.
 Mode.
Mean
 Also
called sample average or arithmetic mean.
 Sensitive to extreme values, where one data point
could make a great change in sample mean.
 Add up data, then divide by sample size (n).
 The sample size n is the number of observations.
 The formula is :
Characteristics of the mean
 Uniqueness, for a given set of data there is only one arithmetic

mean.
 simplicity. The mean is easily understood and easy to compute.
 Extreme values have an influence on the mean and in some
cases can so distort it that it becomes undesirable as a measure
of central tendency.
Example: What is the mean of SBP among the cases?
n= 5 Systolic blood pressures (mmHg)
X1= 120
X2= 80
X3= 90
X4= 110
X5= 95
Median
 Median is the middle value or the 50th percentile of a set of ordered
numbers/measurement.
 E.g SBP SBP measurment: 80, 90, 95, 110, 120
 when n is odd, then the middle value : [(n+1)/2]th.
 when n is even, median is the average of two middle most
observation: average of (n/2)th and [(n/2)+1]th.
 E.g. measurement : 80, 90, 95, 96,110, 120
 Median = (or almost equal) in normally distributed data to the mean.

 The sample median is not sensitive to extreme values.
 Very useful when summarizing a non-normal distribution set of data (skewed
data).
Characteristics of the median
 Uniqueness.
 Simplicity: easy to calculate.
 It is not affected by extreme values like the mean.
Mode
 It is the observation(s) that occur
most frequently.
 Less useful in describing statistics.
 The observation that occurs most
frequently.
 Can be used for continuous or
ordinal data, sometime used as
average for nominal data (modal
category).
 It Can be only one mode
(Unimodal distribution) or two
(Bimodal distribution) or even
more, e.g:
Characteristics of mode
 If all values of a sample are different there is no mode.

 There may be more than one mode.
Measures of Dispersion
Range
 It represent the difference between the maximum and
minimum value of the distribution.
 Tends to increase with sample size.
 Sensitive to very extreme values.
 It is very easy to calculate.
 Simplest and least useful measure of variability – only for
quick estimate of variability.
 It take into account only two values.
 R= XL- XS where R is the range, XL is the largest value and XS is
the smallest value.
Variance
 If measuring variance of population, denoted by 2

(“sigma-squared”).
 If measuring variance of sample, denoted by s2 (“s-
squared”).
 Measures average squared deviation of data points from
their mean.
 Highly affected by outliers. Best for symmetric data.
13
Variance (for a sample)
Variance = ∑ (Mean − x) 2
n–1
 Steps:
 Compute each deviation
 Square each deviation
 Sum all the squares
 Divide by the data size (sample size) minus one: n-1
Step 1 Step 3 Step 4
x (x  x) (x  x)2
Step 2 x
 x 25
 5
6 1 1 n 5
3 -2 4
8 3 9
5 0 0
Step 5 s2 
 ( x  x ) 2

18
 4.5
3 -2 4 n 1 4
25 0 18 s  s 2  4.5  2.12
NOTE: The sum of the deviation, ,

is always zero.
15
Standard deviation:
The smaller the standard deviation, the more consistent is the

data set.
The smaller the standard deviation, the less is the deviation of

the data from its center
For example if SD for age in years is = 15,

it means that it is 15 years on average, that the data values is away from its
center/mean
16
Variance
 Measures the amount of spread or variability of
observations from their mean.
 The sample variance (s2) is the average of the square of
the deviations about the sample mean. (population
variance = σ2).
 Not used in descriptive statistics because difficulty in
interpreting a ‘square’ unit of data
 Formula:
 We used (n-1) instead of (n) to give us a degree of

freedom.
Standard deviation
 Square root of variance.

 Most widely used and better measure of variability.
 The smaller the value, the closer to the mean.
 Like mean, std deviation is sensitive to extreme values.
 Therefore std deviation is best used to describe distributions that are
symmetrical with single peak.
Exercise
 Calculate the sample variance and standard deviation

of the monthly income (in USD) of nine workers.
Inter quartile (IQR) or percentiles
 The most common is the inter-percentile measure.

 Range between the 1st quartile (25th percentile) and
the 3rd quartile (75th percentile).
 Range = q3 - q1
 Like median, IQR is not sensitive to very extreme
values (outliers).
 Usually described together with the median in badly
skewed distribution of observation.
This is equal to the
median
 Formula
IQR = Q1 – Q3
The Coefficient of Variation
 It
is a measure of relative variation rather than an absolute
variation.
 It expresses the standard deviation as a percentage of the
mean.
 The formula is x 100
 It has no unit because the standard deviation and the mean of

the sample are measured in the same unit and the cancel each
other.
Organizing & Displaying data
for Categorical Variables
Frequency Tables
 Tables – organized data into values and categories with titles
and caption.
 Title: variables?, when?, where?, sample size (n)?
 A frequency table may include:

 Categories - should be listed in some natural order
 Frequency
 Cumulative Frequency
 Relative Frequency
 Proportion/Percent
24 04/22/2020
Examples of Frequency Table_1
(SPSS output)
Gender distribution in a sample of 111 patients
Cumulative
Frequency Percent Valid Percent Percent
Valid male 40 36.0 36.0 36.0
female 71 64.0 64.0 100.0
Total 111 100.0 100.0
stoneLocation
Cumulative
Frequency Percent Valid Percent Percent
Valid proximal 46 41.4 41.4 41.4
distal 62 55.9 55.9 97.3
both 3 2.7 2.7 100.0
Total 111 100.0 100.0
25 04/22/2020
Examples of Frequency Table_2
(SPSS output)
Continuous data (age) is
grouped and converted
into a ordinal data (age
group)
age group
Valid Cumulativ
Frequency Percent Percent e Percent
Valid 20below 4 3.6 3.6 3.6
21 - 30 6 5.4 5.4 9.0
31 - 40 18 16.2 16.2 25.2
41 - 50 30 27.0 27.0 52.3
51 - 60 24 21.6 21.6 73.9
61 - 70 17 15.3 15.3 89.2
71above 12 10.8 10.8 100.0
Total 111 26 100.0 100.0 04/22/2020
Bar graph or chart
 Graphical presentation of frequency distribution of
categorical data (nominal or ordinal). Height
Figure 1: Gender distribution among 111 renal stone patients represent
80 frequency or
frequency
percent
70
Y axis:
Frequency or
60
relative freq
Bars of
Bars separated
equal
50 by equal gaps
width
40
30
male female
SEX X axis: Categorical variables

27 04/22/2020
Type of Bar Charts
Cluster/Compo Stacked/Compo
100
90
nent 140
site
East West 120
80
70 100
60 80 West
50
60 East
40
30 40
20
10 20
0 0
1st Qtr 2nd Qtr 3rd Qtr 1st Qtr 2nd Qtr 3rd Qtr
100%
Percent
90%
80%
Terengganu
70%
Daruliman
60%
50%
40% Kedah Darulaman
30%
20% Kelantan
10% Darulnaim
0%
1st Qtr 2nd Qtr 3rd Qtr 0 20 40 60 80 100
Composite with 28 Horizontal, if 04/22/2020

percent long category
Pie chart
 Graphical presentation of frequency distribution of
categorical data (usually nominal).
 Circle represent 3600, start at 12 o’clock.
Each piece of
Stone location among 111 cases in HKB, 2003 - 04
slice represent
each category
both
2.7%
proximal
Size of slice
represent 41.4%
frequency or
percent distal
55.9%
29 04/22/2020
Excellence graphs (Schmid, 1983)
 Accuracy
 data properly entered
 not misleading, distortion or susceptible to misinterpretation
 Clarity
 the ideas and concepts conveyed are clearly understood
 Simplicity
 Straight forward, avoid gridlines or odd lettering
 Appearance
 Should be appealing to viewer
 Well-designed structure
 Pattern highlighted, letterings are horizontal
30 04/22/2020
Organizing & Displaying
Data of Numerical Variables
31 04/22/2020
Graphs for quantitative data
 Graphs are the visual presentation of frequency distribution, and may

show
 differences in spread (variability)
 differences in shape of the distribution
 Types of useful graphs:

 Histogram
 Polygon
 Stem and leaf
 Line graphs
 Box plot
32 04/22/2020
Histogram
Age Distribution among 111 cases
20
Normality curve Each bar represent the

line interval class
10
Bar height represent

frequency or percent
Std. Dev = 14.99

Mean = 51.0
0 N = 111.00
15.0 25.0 35.0 45.0 55.0 65.0 75.0
20.0 30.0 40.0 50.0 60.0 70.0 80.0
AGE
Interval class, no gap
33 in between 04/22/2020
Normal Distribution
• Also called Gaussian or bell-shaped

• Mean = Median = Mode
34 04/22/2020
Normal distribution curve
properties
1. The mean, median and mode all
have the same value.
2. The curve is symmetric around
the mean, the skew is 0.
3. The kurtosis is 3.
4. The tails of the curve get closer
and closer to the x-axis as you
move away from the mean, but
they never quite reach it.
35 04/22/2020
Measures of skewness or symmetry
 We
can use Pearson’s skewness coefficient.
 The formula:
 If it is equal to zero then it is normal distribution.

 If +ve then data skewed to the right (mean>median).
 If –ve then data skewed to the left (mean <median).
 But usually the result will be between -1 and 1.
 Result above 0.2 or below -0.2 indicate rather severe skewness.
36 04/22/2020
68-95-99.7
Rule for the Normal Distribution
 68% of the observations fall within 1 SD of the mean

 95% of the observations fall within 2 SD of the mean
 99.7% of the observations fall within 3 SD of the mean
Distribution of Blood
Pressure in Men;
Mean (SD) = 125 (14)
mm Hg
37 04/22/2020
Histogram and Distribution
38 04/22/2020
Polygon
• A frequency polygon is a graph that displays the data using
lines to connect points plotted for the frequencies.
• The frequencies represent the heights of the vertical bars in
the histogram. (superimposed).
The line
segments
pass
through
the mid
points at
the top of
the
rectangle
s.
The polygon
is tied down
at both 39 04/22/2020
ends.
‘Stem and leaf’ plot
 Another tool for visually displaying continuous data
 Very similar to a histogram
 Allows for easier identification of individual values in the
sample
“leaves
”
40 04/22/2020
Box plot
 A graphical display that use descriptive statistics based on
percentile.
 Also called ‘5 number summary plot’
 Provide information about central tendency and the
variability of the middle 50% of the distribution.
 The ‘box’ represent the IQR: 25th to 75th percentile.
 Outlier observation is 1.5 times the IQR away from the edges of the
box. (> 3.0 times is extreme outliers).
 Smallest and largest values that make up the lines are the nearest
values outside the outliers.
 Box plot easily comparing continuous data in multiple

groups – can be plotted side by side.
41 04/22/2020
Boxplot
Boxplot: Age distribution between gender in
renal stone cases, HKB, 2003 -2004
Outlie 100 35
rs
Largest
80 value
which is
The 75th not outlier.
60
percent The 50th
ile The percent
age
The 25th 40
box ile
percent (media
ile n)
20
Smallest
The 103
value
whisker
34 which is
s 0 not outlier.
male female
42 sex 04/22/2020
Sources of the outliers
 Error in recording the data.

 A failure of data collection. E.g. not following sample
criteria.
 An actual extreme value from an unusual subject.
43 04/22/2020
Thank You
44 04/22/2020
Terimakasih

02 - Descriptive Statistics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 - Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Descriptive Statistics :

Measures of Central Tendency and Dispersion

 Uniqueness, for a given set of data there is only one arithmetic

 Median = (or almost equal) in normally distributed data to the mean.

 If all values of a sample are different there is no mode.

 If measuring variance of population, denoted by 2

NOTE: The sum of the deviation, ,

The smaller the standard deviation, the more consistent is the

The smaller the standard deviation, the less is the deviation of

For example if SD for age in years is = 15,

 We used (n-1) instead of (n) to give us a degree of

 Square root of variance.

 Calculate the sample variance and standard deviation

 The most common is the inter-percentile measure.

 It has no unit because the standard deviation and the mean of

 A frequency table may include:

SEX X axis: Categorical variables

Composite with 28 Horizontal, if 04/22/2020

 Graphs are the visual presentation of frequency distribution, and may

 Types of useful graphs:

Normality curve Each bar represent the

Bar height represent

Std. Dev = 14.99

• Also called Gaussian or bell-shaped

 If it is equal to zero then it is normal distribution.

 68% of the observations fall within 1 SD of the mean

 Box plot easily comparing continuous data in multiple

 Error in recording the data.

You might also like