Biostatistics: Descriptive Statistics

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 47

Biostatistics

Introduction
Descriptive Statistics

1
Introduction
Introduction to the basic concepts of statistics as applied to
problems in biological science.
 Goal of the course
 Understand statistical concepts (population, sample,
t-test, slope, significant etc.);
 Identify appropriate methods for your data (e.g.,
paired t-test or independent t-test, one-way block or
two-way ANOVA);
 Select correct SAS procedures to analyze data (you
may use different SAS procedure for one purpose,
which more is more suitable);
 Scientific reading and interpretation.

2
Biostatistics+Computer
Applications
 Why study Biostatistics?
 Statistical methods are widely used in biological
field;
 Examples are from biological field, practical and
useful;
 Focus on application instead of mathematical
derivation;
 Help to evaluate the paper in an intelligent
manner.
Statistics - the science and art of obtaining reliable results and conclusions from
data that is subject to variation.
Biostatistics (Biometry)- the application of statistics to the biologic sciences.

3
Biostatistics+Computer
Applications
 Why Computer Applications?
 Statistical methods are mostly difficult and
complicated (ANOVA, regression etc);
 Advances in computer technology and
statistical software development make the
application of statistical method much
easier today than before;
 Software such as SAS needs time to learn.

4
Is Biostatistics hard to study?
 Factors make it hard for some students to learn
statistics:
 The terminology is deceptive. To understand
statistics, you have to understand the statistical
meaning of terms such as significant, error and
hypothesis are distinct from ordinary uses of
these words.
 Statistics requires mastering abstract concepts. It
is not easy to think about theoretical concepts
such as populations, probability distributions, and
null hypotheses.

5
Is Biostatistics hard? (cont)
 Statistics is at the interface of mathematics and science.
To really grasp the concepts of statistics, you need to be
able to think about it from both angles.
 The derivation of many statistical tests involves difficult
math. However, you can learn to use statistical tests and
interpret the results even if you do not fully understand
how they work. You only need to know enough about how
the tool works so that you can avoid using them in
inappropriate situations.
Basically, you can calculate statistical tests and interpret
results even if you don’t understand how the equations
were derived, as long as you know enough to use the
statistical tests appropriately.

6
Questions about this class
 Is this class to be hard?
 No. Concept is easy and procedure is clear.
 Why do we spend time on theoretical
stuff?
 Helpful to understand the application
 Do we need to know all the stuff?
 You may not need all, but be prepared

7
Role of statistics in Biological
Science
Science Statistics
1.Idea or Question 1.Mathematical model /
2.Collect data/make hypothesis
observations 2.Study design
3.Describe data / 3.Descriptive statistics
observations
4.Inferential statistics
4.Assess the strength of
evidence for / against the
hypothesis

8
Contents of the course
 Descriptive statistics
 Graph, table, mean and standard deviation
 Inferential statistics
 Probability and distribution
 Hypothesis test
 Analysis of Variation
 Correlation and regression analysis
 Other special topic

9
Basic Concept

 Data
 numerical facts, measurements, or observations
obtained from an investigation, experiment
aimed at answering a question
 Statistical analyses deal with numbers
 Variable
 a characteristic that can take on different values
for different persons, places or things
 Statistical analyses need variability; otherwise
there is nothing to study
 Examples:
 Concentration of a substance, pH values obtained
from atmospheric precipitation, birth weight of
babies whose mothers are smokers, etc. 10
Basic Concept (cont.)

 Type of Variable
 Continuous variable
 Between any two values of a variable, there is
another possible value
 Examples: height, weight, concentration
 Discrete variable
 Value can be only integer
 Example: number of people, plant etc.

11
Basic Concept (cont.)

 Population
 Population: a set or collection of objects
we are interested in. (finite, infinite)
 Parameter: a descriptive measure

associated with a variable of an entire


population, usually unknown because the
whole population cannot be enumerated.
For example,
Plant height under warming conditions;
Graduates in US; Smokers in the world.
12
Basic Concept (cont.)
 Sample
 Sample: a small number of subjects from a
population to make inference about the
population;
 Random sample: A sample of size n drawn from a
population of size N in such a way that every
possible sample of size n has the same chance of
being selected.
 Statistic: a descriptive measure associated with a
random variable of a sample.

13
Basic Concept (cont.)

 Population and Sample

Parameter population

predict Generalize
properties to a
of sample population

sample statistic

SamplePopulation, StatisticParameter

14
Descriptive Statistics
 Graphical Summaries
 Frequency distribution

 Histogram

 Stem and Leaf plot

 (Barplot, Boxplot)

 Numerical Summaries
 Location - mean, median, mode.

 Spread - range, variance, standard


deviation
 (Shape – skewness, kurtosis)

15
Frequency Distribution (discrete var.)

 Example: Number of sedge plant, Carex


flacca, found in 800 sample quadrats (1m2) in
an ecological study of grasses:

1, 4, 1, 0, 0, 1, 0, 0, 2, 3, 1, 2, 3, 1, 0, 2, 0, 1, 2,
………………………….
1, 2, 3, 2, 1, 1, 0, 5, 0, 0, 1, 0, 1, 0, 2, 4, 7, 2, 1,0

How is the plant number in a quadrat distributed?

16
Frequency Distribution (discrete var.)

 Table 1. The frequency, relative frequency, cumulative


frequencies of plant sedge in a quadrat.
Plants/quadrat (Xi) Frequency (fi) Relative frequency (fi/n*100) Cumulative relative frequency
0 268 33.500 33.500
1 316 39.500 73.000
2 135 16.875 89.875
3 61 7.625 97.500
4 15 1.875 99.375
5 3 0.375 99.750
6 1 0.125 99.875
7 1 0.125 100.000
Total 800 100.000

• frequency - number of times value occurs in data.(probability for population).


• relative frequency - the % of the time that the value occurs (frequency/n).
• cumulative relative frequency - the % of the sample that is equal to or smaller
than the value (cumulative frequency/n).
17
Frequency Distribution (Conti. Var.)

 Grouping of continuous outcome


 Examples: weight, height.
 Better understanding of what data show
rather than individual values
 Example: Fiber length of a cotton (n=106)
Data:
27.5,28.6,29.4,30.5,31.4,29.8,27.6,28.7,27.6
…………
31.8,32.0,27.8

18
Frequency Distribution

Table 2. Frequency and relative frequency distribution


of fiber length (mm) of a cotton variety (n=106)
Length (Xi, mm) Frequency (fi) Relative frequency (%) Cumulative relative frequency
27.0~27.5 1 0.943396226 0.943396226
27.5~28.0 3 2.830188679 3.773584906
28.0~28.5 6 5.660377358 9.433962264
28.5~29.0 13 12.26415094 21.69811321
29.0~29.5 18 16.98113208 38.67924528
29.5~30.0 19 17.9245283 56.60377358
30.0~30.5 17 16.03773585 72.64150943
30.5~31.0 16 15.09433962 87.73584906
31.0~31.5 6 5.660377358 93.39622642
31.5~32.0 5 4.716981132 98.11320755
32.0~32.5 2 1.886792453 100
Total 106 100

19
Frequency Distribution (cont. var.)

 Calculate Range: R=max(X)-


min(x)=5.13
 Set Number of intervals g and interval
range i
 Some “rules” exist, but generally create 8-
15 equal sized intervals, g=11
 i =R/(g-1)=0.5
 Set intervals
 L1=min(X)-i /2=27.0, L2=L1+i =27.5, …
 Count number in each interval
20
Frequency Distribution

Table 2. Frequency and relative frequency distribution


of fiber length (mm) of a cotton variety (n=106)
Length (Xi, mm) Frequency (fi) Relative frequency (%) Cumulative relative frequency
27.0~27.5 1 0.943396226 0.943396226
27.5~28.0 3 2.830188679 3.773584906
28.0~28.5 6 5.660377358 9.433962264
28.5~29.0 13 12.26415094 21.69811321
29.0~29.5 18 16.98113208 38.67924528
29.5~30.0 19 17.9245283 56.60377358
30.0~30.5 17 16.03773585 72.64150943
30.5~31.0 16 15.09433962 87.73584906
31.0~31.5 6 5.660377358 93.39622642
31.5~32.0 5 4.716981132 98.11320755
32.0~32.5 2 1.886792453 100
Total 106 100

21
Histogram (Bar graph) and polygon

 Histogram graph of frequencies


 Can be used to visually compare frequencies
 Easier to assess magnitude of differences rather
than trying to judge numbers
 Frequency polygon - similar to histogram

350 350

300 300

250 250
Frequency
Frequency

200 200
150
150
100
100
50
50
0
0 0 2 4 6 8 10
1 2 3 4 5 6 7 8
Plants/Quadrat
Plants/Quadrat

Fig. 1. Frequency distribution of plants in a quadrat. 22


Histogram (Bar graph) and polygon

20
18 20
16
18
14
Frequency

12 16
10 14

Frequency
8 12
6 10
4 8
2 6
0
4
2
5

5
7.

8.

8.

9.

9.

0.

0.

1.

1.

2.

2.
~2

~2

~2

~2

~2

~3

~3

~3

~3

~3

~3
0
.0

.5

.0

.5

.0

.5

.0

.5

.0

.5

.0
27

27

28

28

29

29

30

30

31

31

32
27 28 29 30 31 32 33
Length (mm)
Length (mm)

100
Accumulate relative frequency

90
80
70
60
50
40
30
20
10
0
27 28 29 30 31 32 33
Length (mm)

Fig. 2. Frequency distribution in fiber length of a cotton.


23
Stem-and-Leaf Displays

 Another way to assess frequencies


 Does preserve individual measure information, so
not useful for large data sets
 Stem is first digit(s) of measurements, leaves are
last digit of measurements
 Most useful for two digit numbers, more
cumbersome for three+ digits
20: X 2* | 1
30: XXX 3* | 244
40: XXXX 4* | 2468
50: XX 5* | 26
60: X 6* | 4
Stem leaf
24
Summary

 In practice, descriptive statistics play a major


role
 Always the first 1-2 tables/figures in a paper
 Statistician needs to know about each variable
before deciding how to analyze to answer
research questions
 In any analysis, 90% of the effort goes into
setting up the data
 Descriptive statistics are part of that 90%

25
Descriptive Statistics:
Measures of Location

 Descriptive measure computed from


population data - parameter
 Descriptive measure computed from sample
data - statistic
 Most common measures of location
 Mean
 Median
 Mode
 Geometric Mean, harmonic mean

26
Arithmetic mean (population)

Suppose we have N measurements of a particular variable in


a population.We denote these N measurements as:
X1, X2, X3,…,XN
where X1 is the first measurement, X2 is the second, etc.

Definition N

1 1 1 X i
 X
  X 1  X 2  ...  X N  i 1

N N N N N
More accurately called the arithmetic mean, it is defined as the
sum of measures observed divided by the number of
observations.

27
Arithmetic mean (sample)

Sample: Suppose we have n measurements of a particular


variable in a population with N measurements.The n
measurements are:
X1, X2, X3,…,Xn
where X1 is the first measurement, X2 is the second, etc.
Definition
1 1 1
x  X 1  X 2  ...  X n 
X i

n n n n

28
Arithmetic mean (sample)

Some Properties of the Arithmetic Mean


1. xi  ( X i  x ) ,  x  ( X
i i  x )  0;

2. i 
x 2
 ( X x ) 2
 min

Prove: 1.
 x   ( X  x)   X  nx 0;
i i i

2. x '  x  e,

 ( X  x ' )   ( X  x  e) [( X  x )  e]
i
2
i
2
i
2

  [( X  x )  2e( X  x )  e ]   ( X  x )  2 e( X
i
2
i
2
i
2
i  x )   e 2

  ( X  x)   e
i
2 2

29
Median

 Frequently used if there are extreme values


in a distribution or if the distribution is non-
normal
 Definition
 That value that divides the ‘ordered array’ into two
equal parts
 If an odd number of observations, the median Md will be
the (n+1)/2 observation
 ex.: median of 11 observations is the 6th observation
 If an even number of observations, the median Md will
be the midpoint between the middle two observations
 ex.: median of 12 observations is the midpoint between
6th and 7th

30
Mode

 Definition
 Value that occurs most frequently in data set
 Example
2 3 4 5 3 4 5 6 7 5 3 2 5, mode Mo=5
 If all values different, no mode
 May be more than one mode
 Bimodal or multimodal
Not used very frequently in practice

31
Example: Central Location
Suppose the ages of the 10 trees you are studying are:
34,24,56,52,21,44,64,34,42,46
Then the mean age of this group is:
1
x 
n
 X  (34  24  56  52  21  44  64  34  42  46) / 10
 417 / 10
Mean are commonly
 41.7years
used
To find the median, first order the data:
21,24,34,34,42,44,46,52,56,64
1
 X 10  X 10


Median 
2





 2 


 2
 
1
 
1
  42  44
2
 43 years

The mode is 34 years Mo=34 (occurred twice).


32
Geometric mean

 Used to calculate mean growth rate


 Definition
1
G  ( X1 X 2    X n ) n

 Antilog of the mean of the log xi

log X 1  log X 2  ...  log X n


log G 
n

33
Geometric mean

 Example: Root growth at 25oC, calculate


mean growth rate (mm/d).
Day Root length(mm) Growth rate (Xi,mm/d)log(Xi)
0 17
1 23 1.352941176 0.131279
2 30 1.304347826 0.115393
3 38 1.266666667 0.102662
4 51 1.342105263 0.127787
5 72 1.411764706 0.149762
6 86 1.194444444 0.077166
Total 7.872270083 0.70405
0.7040
log G   0.1173, G  log 1 0.1173  1.31(mm / d )
6
34
Descriptive Statistics
Measures of Dispersion

 Look at these two data sets:


Set 1: 100, 30, 20, 7, –20, –30, –100
Set 2: 10, 3, 2, 7, -2, -3, -10

 If we calculate mean:
Set 1.
n  7, x  1
Set 2.
n  7, x  1
How to measure dispersion (spread,
variability)?
35
Descriptive Statistics
Measures of Dispersion

 Common measures
 Range
 Variance and Standard deviation
 Coefficient of variation
 Many distributions are well-described by
measure of location and dispersion

36
Range

 Range is the difference between the largest


and smallest values in the data set
 R=Max(Xi)-Min(Xi)

Heavily influenced by two most extreme values and


ignores the rest of the distribution
Set 1: 100, 30, 20, 7, –20, –30, –100
Set 2: 10, 3, 2, 7, -2, -3, -10
 R1=200

 R2=20

37
Variance and standard deviation
(population)

Suppose we have N measurements of a particular variable in


a population: X1, X2, X3,…,XN,
The mean is  , as  (X i   )  0 , we define:
1 1 1
  ( X 1   )  ( X 2   )  ...  ( X N   ) 2 
2 2 2  i
( X   ) 2

N N N N

as variance, unit is X unit2

  i
( X   ) 2

as standard deviation
38
Variance and standard deviation
(sample)

Suppose we have n measurements of a particular variable in a


sample: X1, X2, X3,…,Xn,
The mean is x , we define:

s2 
 i
( X  x ) 2

2
n 1 
as mean squares, or sample variance

s
 i
( X  x ) 2


n 1 
as standard deviation
39
Variance and standard deviation

s2 
 i
( x  x ) 2

n 1
( X i ) 2
SS   ( X i  x ) 2   X i2 
n
Corrected Sum of Squares (CSS)

df    n  1 Degree of freedom
 n-1 used because if we know n-1 deviations, the
nth deviation is known
 Deviations have to sum to zero
40
Example:

Suppose the ages of the 10 trees you are studying are:


34,24,56,52,21,44,64,34,42,46, We calculated x  41.7
Calculate range, variation, standard deviation and CV.

No. Xi x_bar Xi-x_bar (Xi-x_bar)^2 Xi^2


1 34 41.7 -7.7 59.29 1156
2 24 41.7 -17.7 313.29 576
3 56 41.7 14.3 204.49 3136
4 52 41.7 10.3 106.09 2704
5 21 41.7 -20.7 428.49 441
6 44 41.7 2.3 5.29 1936
7 64 41.7 22.3 497.29 4096
8 34 41.7 -7.7 59.29 1156
9 42 41.7 0.3 0.09 1764
10 46 41.7 4.3 18.49 2116
Total 417 0 1692.1 19081

R=64-21=43 y, s2=1692.1/9=188.01 y2, s=13.72 y.


41
Coefficient of Variation

 Relative variation rather than absolute


variation such as standard deviation
 Definition of C.V.
s
CV  100
x
 Useful in comparing variation between
two distributions
 Used particularly in comparing laboratory
measures to identify those determinations
with more variation

42
Example:
Set 1: 100, 30, 20, 7, –20, –30, –100
Set 2: 10, 3, 2, 7, -2, -3, -10
Calculate x , s2, s and CV.

Set x s2 s CV
1 1 3773.7 61.4 61.4
2 1 44.7 6.7 6.7

43
Descriptive Statistics (Summmary)
 Graphical Summaries
 Frequency distribution

 Histogram

 Stem and Leaf plot

 Boxplot

 Numerical Summaries
 Location - mean, median, mode.

 Dispersion - range, variance, standard


deviation
 Shape – (lab)

44
Software

 Statistical software
 SAS
 SPSS
 Stata
 BMDP
 MINITAB
 Graphical software
 Sigmaplot
 Harvard Graphics
 PowerPoint
 Excel
45
Box Plots (explain later)

 Descriptive method to convey information


about measures of location and dispersion
 Box-and-Whisker plots
 Construction of boxplot
 Box is IQR
 Line at median
 Whiskers at smallest and largest observations
 Other conventions can be used, especially to
represent extreme values

46
Box Plots

Increment in Systolic B.P.

40

20

-20
1 2 3 4

Drug
47

You might also like