Dtatistical Measures

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 54

BR Descriptive Statistics

Major Points
1. Measures of Central Tendency
1. Mean
2. Median
3. Mode
2. Measures of Position
1. Quartiles
2. Deciles
3. Percentiles
3. Measures of Dispersion
1. Range
2. Variance
3. Standard Deviation
4. Coefficient of variation
5. Interquartile Range
4. Other descriptive measures
1. Geometric mean
2. Weighted mean
Measures of Central Tendency

• Commonly known as Averages: an average is a


numerical value that indicates the middle point or
central region of the raw data.

• 3 most frequently used measures of central tendency:


– Mode
– Median
– Mean
Why use measures of Central Tendency?
• Mathematically summarize data Frequency Age
in order to make appropriate 159 19
comparisons.
219 20
e.g. You want to describe the
172 21
age of students attending King
146 22
Saud University
123 23
. Therefore you randomly ask
83 25
1000 students for their age.
48 27
16 29
20 32
14 40
Mean Example
A ten PHL 541 students received a score out of 20 on their quiz.
9, 10, 12, 13, 15, 15, 15, 16, 18, 19

x
X
n
 X/n = (9 + 10 + 12 + 13 + 15 + 15 + 15 + 16 + 18 +19)
10
= 142/10
= 14.2

Therefore, the mean of this sample is 14.2


Arithmetic Mean

• Arithmetic Mean: the sum of the scores divided by


the number of scores (generally thought of as the
average).
• The mean of a sample of X scores is symbolized as,
which is said as “X bar”= X

• The mean of a population of X scores is symbolized


by the Greek letter mu (µ).
Sample Mean
• The algebraic definition of the sample mean is as follows:

x
 X
n
n is used to refer to the number of scores in the data
.set (termed sample size)
Population Mean
• The algebraic definition of the population mean is as
follows:

  X
N
N is used to refer to the number of scores in the data
set (termed population size).
Properties of the Mean

• Uniqueness: For a given set of data there is one and


only one mean.
• Simplicity: Easy understood and computed.
• The mean uses all the information available: Every
value in the given set of data is used in the computation;
it is therefore affected by every value. Extreme values
have an influence on the mean and in some cases, can
so distort it that it becomes undesirable as a measure of
location. It may not be "typical" when there are extreme
values present.
Median
• Median: The middle point of the distribution, or the score
which divides the set of scores into two equal parts.
Median = value of the (n + 1)/2 observation

NOTE!
• When determining the median, you must arrange the
scores in ascending or descending order first!
Median
• If there are an ODD number of scores, the median is the
middle score:
1, 3, 6, 7, 8, 13, 15, 17, 18, 21,
23 
Median = value of the (n + 1)/2 observation, (11+1)/2=6.
Look at the value of the 6th observation,

Median = 13

There are 5 scores above the median, and 5 below.


Median
• If there are an EVEN number of scores, the median is
the midpoint between the two middle scores:

1, 3, 6, 7, 8, 13, 15, 17, 18,


23 

Median = the value of the (n + 1)/2 observation


(10+1)/2=5.5. Look at the mean value of the 5th and
6th observation

Median = (8 + 13)/2 = 10.5


Steps to Finding the Median

1. Arrange data in ascending or descending order.


2. Count the number of scores (N).
3. If there are an odd number of scores, find the middle
point - this is the median.
4. If there are an even number of scores, find the 2
middle scores - add them, and divide by 2 - this is the
median.
Properties of the Median

1. Uniqueness: As is true with the mean, there is only one


median for a given set of data.
2. Simplicity: The median is easy to calculate.
3. It is not drastically affected by extreme values as in case of the
mean.
4. Since it uses the middle value of the data set. The median is
not a very reliable measure.
5. The median does not use all the information available:
Mode
• Mode: The most Frequency Age

frequently occurring 159 19


score in a set of data. 219 20
172 21
Our example: 146 22
 20 is the most frequently 123 23
occurring age in our sample 83 25
 Therefore the mode of this 48 27
distribution is 20
16 29
 This is a unimodal
20 32
distribution
14 40
Mode
• Mode: The most Frequency Age

frequently occurring 159 19


score in a set of data. 219 20
172 21
Our example: 146 22
 20 and 25 are the most 123 23
frequently occurring age in 219 25
our sample
48 27
 Therefore the mode of this
16 29
distribution is 20 and 25
20 32
 This is a Bimodal distribution
14 40
Mode
• Mode: The most Frequency Age

frequently occurring 159 19


score in a set of data. 219 20
172 21
Our example: 146 22
 20, 25 and 30 are the most 123 23
frequently occurring age in 219 25
our sample
48 27
 Therefore the mode of this
16 29
distribution is 20,25 and 30.
219 30
 This is a Multimodal
14 40
distribution
Mode
• Mode: The most Frequency Age

frequently occurring 120 19


score in a set of data. 120 20
120 21
Our example: 120 22
 All the scores have the 120 23
same frequency 120 25
 Therefore the data has no 120 27
mode
120 29
120 32
120 40
Mode Example
Frequency Age
Our example: 159 19
 20 & 21 are the most 219 20
frequently occurring age in 219 21
our sample
146 22
 Therefore the mode of this
123 23
distribution are 20 & 21
83 25
 This is a bimodal
48 27
distribution
16 29
20 32
14 40
Ordered Data
8
10
Age of patients: 11
11
13
13
16
17
17
18
19
20
20
Modes = 11, 13, 17, 20, 21, 22
21
21
22
22
25
61
Properties of the Mode
• Like the median, it does not take into account all of
the data - only the one most frequently occurring
score.
• May appear in a distribution in places other than the
centre.
• The score with the highest bar in a histogram, or the
highest point in a frequency polygon.
• The only valid measure of central tendency for
nominal data.
• The least frequently used measure of central
tendency as it does not lend itself to mathematical
operations.
Effect of outliers on mean & median
With outliers
8
10
11
11
Patients age 13
13
16
Mean = 18.3
17
17
18 Median = 18
19
20
20
21
21
22
22
25
61
Effect of outliers on mean & median
Without outliers
8
10
11
11
Patients age 13
13 Mean = 16.9
16
17
17 17  18 Median
 17.5
18 2
19
20
20
21
21
22
22
25
61
Measures of Position
Quartiles, Deciles and Percentiles:

 They locate special point, they break distributions


into x number of points.
 If a set of data is arranged in order of magnitude, the
middle value, which divides the set into two equal
parts, is the median.
 By extending this idea we can think of these values
which divide the set into four, ten and hundred parts.
Quartiles
 One of the most frequently used quantiles is the quartile.
 Quartiles divide the values of a data set into four
subsets of equal size, each comprising 25% of the
observations.
 To find the first, second, and third quartiles:
 1. Arrange the N data values into an array.
 2. First quartile, Q1 = data value at position (N + 1)/4
 3. Second quartile,Q2 =data value at position 2(N+1)/4
 4. Third quartile, Q3 = data value at position 3(N + 1)/4
Quartiles Example
• Weight of patients:
• 175, 260, 150, 165, 170, 180, 190, 210, 210, 235, 240, 270
• Step 1:
Rank data and divide into 4 parts:
150, 165, 170 175, 180, 190 210, 210, 235 240, 260, 270

Q1 Q2 Q3
Step 2
Q1
= (170 + 175)/2
= 172.5 Q2
= (190 + 210)/2
= 200.0 Q3
= (235 + 240)/2
= 237.5
Quartiles Example

• Q1 = 1st quartile, is the value such that 1/4 of the observations


are less or equal to that quartile
• e.g.: 2, 2, 4, 6, 7, 7, 8, 9, 10, 10, 10, 12
• To find which value use the following formula:
(n + 1)/4 = 13/4 = 3.25
Q1 = 5
• Q2 = 2nd quartile = median
2 (n + 1)/4 = 26/4 = 6.5
Q2 = 7.5
• Q3 = 3rd quartile, is the value such that 3/4 of the observations
are less or equal to that quartile
3 (n + 1)/4 = 39/4 = 9.75
Q3 = 10
Deciles & Percentiles

• Similarly the values which divide the data into ten equal
parts are called deciles and are denoted by D1, D2,.....,
D9, while the values dividing the data into one hundred
parts are called percentiles and are denoted by P1,
P2,....., P99.
• E.g.: 90th percentile, is the value such that 90% of
the observations are less or equal to.
Percentile
The value below / above which a particular percentage of values fall
(median is the 50th percentile)
e.g 5th percentile - 5% of values fall below it, 95% of values fall
above it.
A series of percentiles (1st, 5th, 25th, 50th, 75th, 95, 99th) gives a
good general idea of the scatter and shape of the data

1st 5th 25th 50th 75th 95th 99th

Range

5’6” 5’7” 5’8” 5’9” 5’10” 5’11” 6’ 6’1” 6’2”


Measures of Dispersion or Variability
• Consider the following two data sets on the ages of all patients suffering
from bladder cancer (BC) and prostatic cancer (PC).

39 45 36 40 35 38 47 BC
27 52 18 33 70 PC

• The mean age of the two groups is 40 years.


• If we do not know the ages of individual patients and are told only that
the mean age of the patients in the two groups is the same, we may
deduce that the patients in the two groups have a similar age distribution.
• Variation in the patient’s ages in each of these two groups is very
different.
• The ages of the prostatic cancer patients have a much larger
variation than the ages of the bladder cancer patients.
MEASURES OF DISPERSION

In order to describe adequately a


frequency distribution, it is necessary
not only to determine the centre of the
distribution but to have an idea about
the "variation" or "dispersion" or
"scatter" of the measurements.
Two groups of data could have the same
mean, but different variations from that
mean. Mean never used as a measure of
dispersion
e.g. Blood urea level (mg/dl) of 2
groups of 5 individuals each:
Measures of Variability

• Measure the “spread” in the data


• Some important measures
– Range
– Mean deviation
– Variance
– Standard Deviation
- Standard Error
– Coefficient of variation
– Interquartile Range
Variability
• The purpose of the majority of medical, behavioural
and social science research is to explain or account for
variance or differences among individuals or groups.

Examples
1. What factors account for the variance (or difference) in IQ
among individuals?
2. What factors account for the variance in treatment
compliance among different groups of patients?
1- Range
• The range tells us the span over which the data are
distributed, and is only a very rough measure of
variability
• Range: The difference between the maximum and
minimum scores (X max-X min)
– Example: The most amount of tips made in a night is 270 and
the least is 150. Therefore, the range of tips made that night is
270 – 150 = $120
• Range is the simplest measure of dispersion.
• It is not the best measure of dispersion as it depends
entirely on the extreme scores and tells us nothing
about the middle values. Also, it does not take in
consideration all values in a series of scores
Variation
XX
X
5 0.00 This is an example of data
5 0.00 with NO variability

5 0.00
5 0.00

X 5

= 25
0.00

n=5
X
=5
Variation

X
XX
6 +1.00 This is an example of data

4 -1.00 with low variability


6 +1.00
5 0.00
4 -1.00

X = 25 n=5 X =5
Variation
X
XX
8 +3.00 This is an example of data
1 -4.00 with higher variability
9 +4.00
5 0.00
2 -3.00

 X = 25 n=5 X =5
2- Mean deviation

• The best measures of dispersion should:


– take into account all the scores in the distribution
– and should describe the average deviation of the scores around the
mean.
• Normally, to find the average we would want to sum all
deviations from the mean and then divide by n, i.e.,

 X  X 
n
BUT: We have a problem.
(X  X ) will always add up to zero
Mean Deviation

The
e.g.
This average
Blood urea
deviation
indicates levelon
that, is(mg/dl)
the average
average,forthe
5
of
the
individuals:
absolute
values deviations
of x (blood urea (i.e. regardless
level) deviate
the
11.2sign) of the
mg/dl fromindividual
the meanobservations
of the
from their mean.
distribution.
Deviations from the mean
• In any group of scores, the sum of the deviations from the
mean equals zero:

X X- µ n=6
3 3 - 5.50 = -2.50 µ = Σ X/n
5 5 - 5.50 = -0.50 µ = 33/6
9 9 - 5.50 = +3.50 µ = 5.50
2 2 - 5.50 = -3.50
8 8 - 5.50 = +2.50
6 6 - 5.50 = +0.50
ΣX = 33 Σ(X- µ) = 0.00
Variance & Standard Deviation

• However, if we square each of the deviations from the mean,


we obtain a sum that is not equal to zero

• This is the basis for the measures of variance and standard


deviation, the two most common measures of variability (or
dispersion) of data
3- Variance

• The sum of squared deviations from the mean


divided by the number of degrees of freedom (an
estimate of the population variance, n-1)

s 2 X x  2

n 1
Disadvantages of Varience

1- The original observation are measured in certain unit


BUT Varience is the square of this unit.

2- Can not be added or subtracted from the mean.


4- Standard Deviation Formulas

 X  x 
2

Standard Deviation SD 
n 1
Steps to calculate standard deviation

• Compute the mean.


• Subtract the mean from each observation.
• Square each of the deviations.
• Sum them.
• Divide by one less than the number of observations
(almost the mean).
• Take the square root.
Standard Deviation (SD)

The standard deviation is defined as the


square root of the average of the squared
deviations of the measurements from
their mean, or it is the square root of the
"variance".
Variance & Standard Deviation
XX X  X 2

X
8 +3.00 9.00
1 -4.00 16.00
9 +4.00 16.00
5 0.00 0.00
2 -3.00 9.00
X = 25
 X  X  = 0.00
 X  X 
2
= 50.00

 X  X 
2
Note: The is called the Sum of Squares
Why use Standard Deviation and not
Variance!??!
• Normally, you will only calculate variance in order to calculate
standard deviation, as standard deviation is what we typically
want.

• Why? Because standard deviation expresses variability in


the same units as the Mean. SD can be added or
subtracted from the mean. SD take into consideration all
the values in the series of observation.

• Example: Standard deviation of ages in a class is 3.7 years


(and the variance would be 13.69 years2 = (3.7)2).
Standard Error (SE)
The results are then expressed as:

"mean  SD" or "mean  SE"

(136  23.93 or 136  5.35 mmHg)

N.B. Variation of the data is accepted if


the mean > 2.5 SD or > 10 SE.
May 2003 Exam:
Coefficient of variation (CV)

This measure is used to compare the


variability or dispersion within 2 groups
of data, since it is invalid to compare 2
standard deviations.
Therefore, the CV is a measure of the
relative but not the absolute variability.
e.g. In a group of individuals, compare
between the variation in serum
cholesterol and that of body weight.

for serum cholesterol


CV = 50 / 180 x 100 = 27.78

Higher variability
for body weight
CV = 30 / 85 x 100 = 35.29

You might also like