Professional Documents
Culture Documents
Lecture 3+, Descriptive Statistics (Slide)
Lecture 3+, Descriptive Statistics (Slide)
Lecture 3+, Descriptive Statistics (Slide)
n x x s
i
Standard Deviation
Definition of sample standard deviation
Standard deviation in same units as mean
Variance in units
2
2
s s =
What is the Standard
Deviation?
The standard deviation of a data set
is based on how much each data
value deviates from the mean, and is
equal to the square root of the
variance. The greater the dispersion
of values, the larger the standard
deviation. Much of statistical theory is
based on the standard deviation and
the 'normal' distribution.
When is the Standard Deviation
Useful?
It is a useful measure when your data
distribution is very close to a normal curve.
In this situation, the mean is the best
measure of central tendency, and the
standard deviation is the best measure of
dispersion.
In a normal distribution, if you measure 1
standard deviation to either side of the
mean, you will find that 68.3% of the
observations fall into this area; 95.5% of
the observations fall within 2 standard
deviations to either side of the mean; and
99.7% of observations fall within 3
standard deviations of the mean
Calculation of the Sample Standard Deviation
using the Theoretical (Squared Deviation)
Method
X
1
= 2
X
2
= 4
X
3
= 5
X
4
= 5
X
5
= 6
X
6
= 6
X
7
= 6
X
8
= 7
Childs Age
(X) in Years
Childs Age (X) Minus The
Mean Age (X) in Years
(X X)
2
X = 66 years (X X) = 0 (X X)
2
= 44
X = X n = 6 years; n = 11; n 1 = 10
2 6 = -4
4 6 = -2
5 6 = -1
5 6 = -1
6 6 = 0
6 6 = 0
6 6 = 0
7 6 = 1
7 6 = 1
8 6 = 2
10 6 = 4
(-4)
2
=
16
(-2)
2
=
4
(-1)
2
=
1
(-1)
2
=
1
( 0)
2
=
0
( 0)
2
=
0
( 0)
2
=
0
Squared Deviation
from the Mean Age
for a Sample of 11
Chicken Pox
Sufferers
Calculation of the Sample Standard
Deviation Using the Data in Table 5.6 and
the Theoretical Formula:
1
) (
2
=
N
X X
S
=
S
44
10
=
S
4.4
=
S
2.10 years
Calculation of the Sample Standard
Deviation Using the Computational (Sum of
Squares) Formula:
2
4
5
5
6
6
Childs Age
(X) in Years
X
2
Computation
Formula
4
16
25
25
36
36
36
49
49
64
100
1
2
2
n
n
X
X
S
10
2
=
440
oo
11
=
4.4
=
S
2.10 years
X = 66
X
2
= 440,
where
n=11
Coefficient of Variation (CV)
What is the Coefficient of Variation?
The coefficient of variation measures variability in relation
to the mean (or average) and is used to compare the
relative dispersion in one type of data with the relative
dispersion in another type of data. The data to be compared
may be in the same units, in different units, with the same
mean, or with different means.
When is the Coefficient of Variation Useful?
Suppose you want to evaluate the relative dispersion of
grades for two classes of students: Class A and Class B. The
coefficient of variation can be used to compare these two
groups and determine how the grade dispersion in Class A
compares to the grade dispersion in Class B. This is one
example of how the coefficient of variation can be applied.
Coefficient of Variation
Relative variation rather than absolute
variation such as standard deviation
Definition of C.V.
Useful in comparing variation between two
distributions
Used particularly in comparing laboratory
measures to identify those determinations with
more variation
Also used in QC analyses for comparing
) 100 ( . .
x
s
V C =
Standard Deviation of the Mean
(SE)
The standard deviation of the mean (often
called the standard error) is a measure of
the variation in means of repeated
samples. It is defined as the standard
deviation divided by the square root of the
sample size: SE = To calculate the
standard deviation of the mean, do the
following:
Calculate the standard deviation (s).
Calculate the square root of the sample size (n).
Divide the standard deviation by result of step 2.
Percentiles and Quartiles
Definition of Percentiles
Given a set of n observations x
1
, x
2
,, x
n
, the
pth percentile P is value of X such that p
percent or less of the observations are less
than P and (100-p) percent or less are greater
than P
P
10
indicates 10th percentile, etc.
Definition of Quartiles
First quartile is P
25
Second quartile is median or P
50
Third quartile is P
75
Measures of Position
Quartiles, Deciles,
Percentiles
Q
1
, Q
2
, Q
3
divides ranked scores into four equal parts
25%
25% 25%
25%
Q
3
Q
2
Q
1
(minimum) (maximum)
(median)
Q
1
, Q
2
, Q
3
divides ranked scores into four equal parts
Quartiles
25%
25% 25%
25%
Q
3
Q
2
Q
1
(minimum) (maximum)
(median)
Q
1
, Q
2
, Q
3
divides ranked scores into four equal parts
25%
25% 25%
25%
Q
3
Q
2
Q
1
(minimum) (maximum)
(median)
Finding the Percentile of a
Given Score
Percentile of score x = 100
number of scores less than x
total number of
scores
Inter-quartile Range
Better description of distribution
than range
Range of middle 50 percent of the
distribution
Definition of Inter-quartile Range
IQR = Q
3
- Q
1
.
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Range = 14-1 =13
upper middle lower
25% 50% 25%
Values
upper middle lower
25% 50% 25%
Values
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 21-1 =20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Frequency distributions of values with inter-quartile range
of 5 to 9
Frequency distributions of values with inter-quartile range
of 5 to 9
Interquartile Range (or IQR): Q
3
- Q
1
Semi-interquartile Range:
Mid quartile:
10 - 90 Percentile Range: P
90
- P
10
2
2
Q
3
- Q
1
Q
1
+ Q
3
Percentiles
A "percentile" shows how a single system may be
compared to all other systems. Percentiles range
from lowest (1) to highest (99) with the average
equal to 50
The pth percentile (p ranges from 0 to 1) is a value so
that roughly p% of the data is smaller and (100-p)%
of the data is larger. Percentiles can be computed for
ordinal, interval, or ratio data.
There are three steps for computing a percentile.
.1 Sort the data from low to high;
.2 Count the number of values (n);
.3 Select the p*(n+1) observation.
If p*(n+1) is not a whole number, then go halfway
between the two adjacent numbers.
If p*(n+1) < 1, select the smallest observation.
If p*(n+1) > n, select the largest observation
Examples
The following data represents cotinine levels in saliva
(nmol/l) after smoking. We want to compute the 50th
percentile.
73, 58, 67, 93, 33, 18, 147
1. Sorted data: 18, 33, 58, 67, 73, 93, 147
2. There are n=7 observations.
3. Select 0.50*(7+1) = 4th observation.
Therefore, the 50th percentile equals 67. Notice that
there are three observations larger than 67 and three
observations smaller than 67.
Suppose we want to compute the 20th percentile.
Notice that p*(n+1) = 0.20*(7+1)=1.6. This is not a
whole number so we select halfway between 1st and
2nd observation or 25.5. (Some people see the 1.6 and
think they have to go six tenths of the way to the
second value. You can do this if you like, but I think life
is too short to worry about such details.)
Suppose we want to compute the 10th percentile. Since
The five number summary
A five number summary uses percentiles to
describe a set of data. The five number
summary consists of
MAX - the maximum value
75% - the 75th percentile (3rd quartile)
50% - the 50th percentile (2nd quartile or
median)
25% - the 25th percentile (1st quartile)
MIN - the minimum value
The five number summary splits the data
into four regions, each of which contains
25% of the data.
Summary
In practice, descriptive statistics play a
major role
Always the first 1-2 tables/figures in a paper
Statistician needs to know about each variable
before deciding how to analyze to answer
research questions
In any analysis, 90% of the effort goes
into setting up the data
Descriptive statistics are part of that 90%