Biostatistics Delft2009 Presentation

Introduction to
Biostatistics
Mag.Dr. Christian Fesl and Mag. Gabriel Singer
Contents
Basic definitions
Graphs
Data and data matrix
Data description: measures of location and variation
Empirical and theoretical frequency distributions
Normal distribution and standard normal distribution
The Central Limit Theorem
Confidence intervals
Accuracy and precision; estimation of sample size
Statistical decision theory
Statistical tests: t-test, U-test,...
Analysis of variance (ANOVA): 1-way, 2-way
Correlations and regressions
Design (Sampling, Experiment)
Presentation of results in text, table, graphs
Basic definitions
Statistics = systematic collection and display of numerical data; mathematical

method to handle uncertainty (random error)
1. Descriptive statistics (= explorative statistics = deductive statistics)

central tendency
variation
frequency distribution
2. Statistical inference (= inductive statistics)

use of a sample to draw conclusions about a population
Basic definitions
• Population entire collection of people, animals, plants or things from which
we may collect data
• Sample = group of units selected from a larger group (random samples!)

• Sample unit = person, animal, plant or thing which is actually studied by a
researcher (the basic object).
1 sample unit delivers only one independent value per variable (a variate),
cf. „sample“ in colloquial use!
• Parameter = value representing a certain population characteristic
• Statistic = quantity calculated from the sample data to give information
about parameters
• Estimation = process of indication of a value of an unknown quantity in a

population (estimator, estimate)
“a sample statistic estimates a population parameter”
• Sampling distribution describes probabilities associated with a statistic
(= probability distribution for the statistic)
Population
sample
sample
1 sample unit
Probability theory
Probability = quantitative description of the likeliness of occurrence of a

particular event
scale from 0 to 1
long-run relative frequency
equally-likely outcomes model (LaPlace):
number of outcomes corresponding to event E

P(E)
total number of outcomes
Probability theory
• Outcome = is one result of an experiment or other situation involving

uncertainty
• Event = any collection of outcomes of an experiment

Impossible event P(E) = 0
Inevitable event P(E) = 1
Complementary event 1 – P(E) = P(E)
• Sample space = exhaustive list of all possible outcomes of an experiment

(universe, population)
Probability theory
• Independent events
no influence on each other
P( A B) P ( A) P ( B )
Example: A man and a woman each have a pack of 52 playing cards. Find
the probability that they (i) each and (ii) both draw the ace of clubs.
• Mutually exclusive events

impossible to occur together
A B
Example: A subject in a study cannot be both male and female.

Probability theory
• Addition rule
P that event E1 or E2 or ... or En occurs
P(E1 E2 .... En) = P(E1) + P(E2) + ....+ P(En)
• Multiplication rule
P that event E1 and E2 and ... and En occurs
P(E1 E2 .... En) = P(E1) P(E2) .... P(En)
• Conditional probability, law of total probability and Bayes´ Theorem

Mathematical terms and notation
• Variable X = the actual property measured by individual observation

• Value x = a single observation of a variable (case, variate)
Object xi = ith value of variable X <x1, x2, x3, .....xi, ..... xn>
n
xi = sum of xi = x1 + x2 + x3 .....+ xi .....+ xn
i 1
n
xi = product of xi = x1 x2 x3 ..... xi ..... xn
i 1
• Function
if values of X correspond with values of variable Y, there is a functional
dependence
Y = f(x), Y = dependent, X = independent
e.g. y = f (x) = a + bx
Mathematical terms and notation
• Logarithm
logA(x) = y Ay= x
A = base
x = numerus (antilogarithm)
y = logarithm
log10(x) = lg(x) = common logarithm

loge(x) = ln(x) = natural logarithm (e = 2.718.....)
log(A B) = log(A) + log(B)

log(A:B) = log(A) - log(B)
log(AB) = B log(A)
Graphs (= charts, diagrams, plots)
• Abszissa (x-axis)
• Ordinate (y-axis)
• Origin
Graphs (= charts, diagrams, plots)
(a) Bar-chart/column graph
with variation
(e.g. confidence intervals)
(a) 2.5
1.5
0.5
0
1 2 3
Graphs (= charts, diagrams, plots) (a)
2.5
2
(a) Bar-chart/column graph 1.5
(b) Scatter plot 1
0.5
0
1 2 3
with regression line
(b) 14
12
10
8
6
4
2
0
0 2 4 6 8 10
2.5
2
(b) Scatter plot 1
(c) Line graph 0.5
0
14 1 2 3
(b) 12
10
8
6
4
(c) 9
2
0
8 0 2 4 6 8 10
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8
2.5
2
(b) Scatter plot 1
(c) Line graph 0.5
0
(d) Pie chart 14 1 2 3
(b) 12
10
8
6
4
2
(d) 8% 14% 0
0 2 4 6 8 10
9
14% (c) 8
7
6
5
4
20% 3
2
1
0
1 2 3 4 5 6 7 8
44%
2.5
2
(b) Scatter plot 1
(c) Line graph 0.5
0
(d) Pie chart 14 1 2 3
(b)
(e) Box –(whisker)-plot 12
10
8
6
4
2
(e) 9 0
0 2 4 6 8 10
8 9
(c) 8
7
7
6 6
5 5
4
4 3
3 2
1
2 0
1 1 2 3 4 5 6 7 8
0 (d)
8% 14%
1 2
14%
20%
44%
Data
• result from interviews, observations, measurements or experiments

• preferably noted using numbers
Categories of data according to the level of scale
Category of data / Property Example Possible calculations

type of variable
nominal variable / classification of qualitative eye frequency distribution
Non-metric scale
attribute expressions of properties colour location: mode
ordinal / ranked ordination of ranks possible military frequency distribution

variable ranks location: median
variation: range, percentiles
correlation of ranks
discontinuous / values are forming an number of probability distribution
discrete aggregate of separate animals location: arithmetic mean
Metric scale
measurement properties (discrete events), variation: standard deviation

variable no intermediate values (approximation by continuous
possible distribution possible)
continuous realisation of any value body probability density function
measurement within a given interval mass location: arithmetic mean
variable possible variation: standard deviation
Data matrix
Structured data for further processing in statistical packages

2.5
Objects Variable X 2
O1 x1 1.5
O2 x2 1
0.5
: :
0
Oi xi 1 2 3
: : 9
8
On xn 7
6
5
4
3
n = sample size (number of objects) 2
1
0
1 2 3 4 5 6 7 8
Classification of analyses in respect to the number of variables

Univariate: analysis with one variable
Data matrix
Objects Var X1 Var X2

O1 x11 x12 14
12
O2 x21 x22 10
8
: : : 6
Oi xi1 xi2 4
2
: : : 0
0 2 4 6 8 10
On xn1 xn2
n = sample size (number of objects)

Bivariate: analysis with two variables
Data matrix

Variables
Objects X1 X2 ... Xj ... Xk
O1 x11 x12 ... x1j ... x1k
O2 x21 x22 ... x2j ... x2k
: : : ... : ... :
Oi xi1 xi2 ... xij ... xik
: : : ... : ... :
On xn1 xn2 ... xnj ... xnk
n = sample size (number of objects)

k = number of variables

Bivariate: analysis with two variables
Multivariate: analysis with more than two variables
Nominal scale
Examples: Colour of eyes, names, sex, bits, presence-absence, …
Operations: Equality / inequality
Statistics:
Absolute frequency F
Relative frequency f = F / n (proportion)
n = total number of objects
Mode x* = most frequent value
Nominal scale
Eye colour Counts Fi fi
x1 (green) IIII
x2 (blue) II
x3 (brown) IIIII IIII
x4 (grey) IIIII
Sum
F f
“Barchart“
order along X is interchangeable!

Ordinal scale
Examples: Water quality index, grades, military ranks......
Operations: Ranking
Statistics:
Percentiles pi i [1, 100]

Deciles D1 = p10
Quartiles Q1 = p25 , Q2 = p50 , Q3 = p75 , Q4 = p100
Minimum / maximum
Median
Range = maximum - minimum

Interquartile range (IQR) = Q3 – Q1
Ordinal scale
Education Fi Cumulative F fi Cumulative f
x1 (No education) 65 65 0.25 0.25
x2 (Elementary school) 63 128 0.25 0.50
x3 (Work) 64 192 0.25 0.75
x4 (High school) 43 235 0.17 0.92
x5 (University) 21 256 0.08 1.00
Sum 2520 1.00
100
250 1.00
0.35
80
0.30
200
0.75
cumulative F
cumulative f
0.25
60
150
0.20
F
f
0.50
40 0.15 100
0.10 0.25
20 50
0.05
0 0.00 0 0.00
no ed elem work high univ no ed elem work high univ
X X
Q1 Md Q2
Frequency polygon Cumulative frequency-

polygon (= ‘ogive’)
Metric scale – Discrete (discontinuous / meristic) variables
Examples: number of trees in a plot, counts of animals

12
absolute frequency of plots

Operations: discrete frequency distribution 10
bar chart (gaps!) 8
discrete probability distributions 6
4
No. of trees in
plot F f Cumulati ve f 2
1 0 0 0 0
2 1 0.025 0.025 1 2 3 4 5 6 7 8 9 10 11 12
3 2 0.05 0.075 no of trees in plot
4 4 0.1 0.175
5 5 0.125 0.3
6 8 0.2 0.5 large sample space and
7 10 0.25 0.75 large sample size
8 5 0.125 0.875
approximation by
9 3 0.075 0.95
10 2 0.05 1 continuous distributions
11 0 0 1
Sum 40 1
Metric scale – Continuous variables
Examples: fish length, body weight, count data (approximated)
Operations: Continuous frequency distribution

histograms (any value possible, no „gaps“)
continuous probability distributions
Raw data
classes (= consecutive categories)
Weight (kg) Abs. frequency (F) Rel. f Cumulative f Class center
45 - <50 0 0/100= 0.0 0 47.5
50 - <55 3 3/100= 0.03 0.03 52.5
55 - <60 13 13/100=0.13 0.16 57.5
60 - <65 20 20/100=0.20 0.36 62.5
65 - <70 33 33/100=0.33 0.69 67.5
70 - <75 25 25/100=0.25 0.94 72.5
75 - <80 5 5/100= 0.05 0.99 77.5
80 - <85 1 1/100= 0.01 1 82.5
Sum 100 1.00
Raw data
classes (= consecutive categories)
bar chart without gaps = histogram
f = bar height
9
8
f = bar area!
7
relative frequency (%) 6
0
Classes
Statistics:
n
Arithmetic mean 1
x xi
n i 1
n
1
Standard deviation s (variance s2) s ( xi x )2
n 1i 1
s
Coefficient of variation C.V . 100
x
Skewness = degree of asymmetry (Sk)

Kurtosis = degree of peakedness (K)
Geometric mean n
xg n xi
i 1
Sample statistics
describe observed (empirical) frequency distributions:

location: average value, minimum, maximum
variation: dispersion around average value
shape of the distribution: symmetry, peakedness...
describe average trend of a distribution with a few values only

for further statistical analysis
Sample statistics
1. Central tendency and other measures of location (first moment)
f1 x1 f 2 x2 ... f n xn
x* , ~
x (= p50 = Q2) , x , x g , x w weighted
f1 f 2 ... f n
n
1 1
Arithmetic mean: x xi ( x1 x2 ... xn )
n i 1 n
n
1
Geometric mean: xg n x1 xn or xg anti log ln( x)
n i 1
n
ln (x+1)-transformed values: 1
xg anti log ln( x 1) 1
n i 1
Sample statistics
1. Central tendency and other measures of location (first moment)
Median: for odd n: ~

x x n 1
2
~ 1
for even n: x (x n x n )
2 2 2
1
50% of the values are below, 50% are above the median
Mode: x* , most frequent value
Percentiles, deciles, quartiles
Minimum, maximum
Sample statistics
2. Measures of spread/variation (second moment)
Range sM x( n ) x(1) max min
Interquartile range IQR Q3 Q1

n
1
Standard deviation: s ( xi x )2
n 1i 1
n
2 1
Variance: s ( xi x)2
n 1i 1
s
Coefficient of variation: C.V .
x
s
Standard error of the arithmetic mean: S .E.M .
n
information about quality of a measurement: x S .E.M .
Sample statistics
3. Asymmetry (third moment)
Skewness Sk
Normal distribution = symmetrical around a mean

Skewed to the right = positive skewness: maximum of the distribution at
the left side
Skewed to the left = negative skewness: maximum of the distribution at
the right side
Sk x -x* (Q3 Q2 ) (Q2 Q1 )

Sk
(Q3 Q1 )
Sk (x -x*)/s
(Q3 Q2 )
Sk 3(x -x*)/s Sk 2
Q2
Sample statistics
Skewness
Example: distribution skewed to the right
x* x~ x
Leptokurtic
Sample statistics
4. Peakedness (fourth moment)
Kurtosis K
Positive kurtosis (positive excess): steep peak, maximum higher than

compared with the normal distribution
Negative kurtosis (negative excess): flat peak, maximum lower
Leptokurtic / platykurtic / mesokurtic
Q3 1 Q3 Q1
K K (KND 0.263)
Q1 2 p90 p10
Sample statistics
Skewness and kurtosis

Skewness and Kurtosis Leptokurtic
Positive kurtosis
Positive skewness
Skewed to the right
Normaldistribution
mesokurtic
Platykurtic
Negative kurtosis
Sample statistics and population parameters
whole populations (census) (random) sample from population
frequency distribution described by frequency distribution described by
definite population parameters (unsure) sample statistics

estimate
Theoretical measure of location and spread:
Measure of location: expectation E(X)
Measure of spread: variation Var(X)

Population
sample
sample
Normal distribution
Most important measure of location: arithmetic mean

Population: E(X) = µ
Random sample: x
Most important measure of spread: variance

Population: Var(X) = ²
Random sample: s²
The parameters of the population (µ, ²) are estimated by the statistics of

the random sample ( x , s²).
„Mean“: Sample mean ( x ) versus population or parametric mean µ
Unbiased estimator
take several samples

calculate sample statistic repeatedly
average sample statistics x = unbiased estimator for µ
gives parameter
Biased estimator
n
1
e.g. use ( xi x )2 to calculate sample variance
n i 1
resulting quantity is biased, consistent underestimation of 2
due to use of x , which is already an (unsure) estimator!

use d.f. = n-1 to get unbiased estimator
Empirical distributions
Discontinuous distribution Continuous distribution
relative frequency (%)

classes
Empirical + theoretical distributions

Binomial distribution Normal distribution

classes
2
n i n 1 1 1 x
P( X x) pq f ( x) exp
i 2 2
E(X ) np Var ( X ) npq E( X ) Var ( X ) 2
Empirical distributions

classes
Empirical + theoretical distributions

Poisson distribution Log-normal distribution

classes
i 2
e 1 1 ln x
P( X i) f ( x) exp
i! x 2 2
2
E( X ) Var ( X ) E( X ) exp
2
2 2
Var ( X ) exp 2 exp 1
Frequency distributions
Mathematical distributions as models for natural

frequency distributions
Elimination of irregularities in empirical distribution

Simple mathematical handling
Estimation of population parameters
Statement about the derivation of the data is possible
Important for choosing appropriate statistical methods
Some advantageously statistical properties (e.g. mean)
Frequency distributions
Examples of different theoretical distributions
Discontinuous distributions
• Positive binomial 2< regular

• Poisson series 2= random
• Negative binomial 2> aggregated
Continuous distributions
• Normal distribution and standard normal distribution (z)

• 2-distribution
• t-distribution
• F-distribution
Normal distribution
Histogram of an empirical frequency distribution
relative frequency = bar height represents probability

probability also represented by bar area
Relative frequency
x
Classes
Normal distribution
Empirical frequency distribution: high n, small classes

Approximation by curve
Theoretical normal probability distribution = Normal probability density function
• Smooth
• Bell shaped
Relative frequency
• Symmetrical around mean
x
Classes
Normal distribution
Normal probability density function (PDF)

1 (x )2
1 2 2 µ = arithmetic mean
f(x) e
2 = standard deviation
(relative frequency)
Probability density
x
Normal distribution
Many different (general) normal probability density functions possible

1 (x )2
1 2 2
f(x) e
2
0.8
(relative frequency)
Probability density
0.6
0.4
0.2
0.0 x
0 1 2 3 4 5 6 7
Standard normal distribution
Centering: xi xi µ
x
0 µ
xi µ
Standardising: zi
x
0 µ
z-values: standardised values with µ = 0 und = 1 according to the
formula: X
Z
= standard normal PDF
1 z2
f ( x) exp
2 2
z
-3 -2 -1 0 1 2 3
Area under the curve = integral of the standard normal PDF
= cumulative standard normal distribution function
= probability to find z within a definite range
The whole area under the curve = 1
z
-3 -2 -1 0 1 2 3
68.27%
95.45%
99.73%
z-value p (z) z-value p (z) z-value p (z) z-value p (z) z-value p (z) z-value p (z)
0.00 0.50000 0.50 0.69146 1.00 0.84134 1.50 0.93319 2.00 0.97725 2.50 0.99379
0.01 0.50399 0.51 0.69497 1.01 0.84375 1.51 0.93448 2.01 0.97778 2.51 0.99396
0.02 0.50798 0.52 0.69847 1.02 0.84614 1.52 0.93574 2.02 0.97831 2.52 0.99413
0.03 0.51197 0.53 0.70194 1.03 0.84849 1.53 0.93699 2.03 0.97882 2.53 0.99430
0.04 0.51595 0.54 0.70540 1.04 0.85083 1.54 0.93822 2.04 0.97932 2.54 0.99446
0.05 0.51994 0.55 0.70884 1.05 0.85314 1.55 0.93943 2.05 0.97982 2.55 0.99461
0.06 0.52392 0.56 0.71226 1.06 0.85543 1.56 0.94062 2.06 0.98030 2.56 0.99477
0.07 0.52790 0.57 0.71566 1.07 0.85769 1.57 0.94179 2.07 0.98077 2.57 0.99492
0.08 0.53188 0.58 0.71904 1.08 0.85993 1.58 0.94295 2.08 0.98124 2.58 0.99506
0.09 0.53586 0.59 0.72240 1.09 0.86214 1.59 0.94408 2.09 0.98169 2.59 0.99520
0.10 0.53983 0.60 0.72575 1.10 0.86433 1.60 0.94520 2.10 0.98214 2.60 0.99534
0.11 0.54380 0.61 0.72907 1.11 0.86650 1.61 0.94630 2.11 0.98257 2.61 0.99547
0.12 0.54776 0.62 0.73237 1.12 0.86864 1.62 0.94738 2.12 0.98300 2.62 0.99560
0.13 0.55172 0.63 0.73565 1.13 0.87076 1.63 0.94845 2.13 0.98341 2.63 0.99573
0.14 0.55567 0.64 0.73891 1.14 0.87286 1.64 0.94950 2.14 0.98382 2.64 0.99585
0.15 0.55962 0.65 0.74215 1.15 0.87493 1.65 0.95053 2.15 0.98422 2.65 0.99598
0.16 0.56356 0.66 0.74537 1.16 0.87698 1.66 0.95154 2.16 0.98461 2.66 0.99609
0.17 0.56749 0.67 0.74857 1.17 0.87900 1.67 0.95254 2.17 0.98500 2.67 0.99621
0.18 0.57142 0.68 0.75175 1.18 0.88100 1.68 0.95352 2.18 0.98537 2.68 0.99632
0.19 0.57535 0.69 0.75490 1.19 0.88298 1.69 0.95449 2.19 0.98574 2.69 0.99643
0.20 0.57926 0.70 0.75804 1.20 0.88493 1.70 0.95543 2.20 0.98610 2.70 0.99653
0.21 0.58317 0.71 0.76115 1.21 0.88686 1.71 0.95637 2.21 0.98645 2.71 0.99664
0.22 0.58706 0.72 0.76424 1.22 0.88877 1.72 0.95728 2.22 0.98679 2.72 0.99674
0.23 0.59095 0.73 0.76730 1.23 0.89065 1.73 0.95818 2.23 0.98713 2.73 0.99683
0.24 0.59483 0.74 0.77035 1.24 0.89251 1.74 0.95907 2.24 0.98745 2.74 0.99693
0.25 0.59871 0.75 0.77337 1.25 0.89435 1.75 0.95994 2.25 0.98778 2.75 0.99702
0.26 0.60257 0.76 0.77637 1.26 0.89617 1.76 0.96080 2.26 0.98809 2.76 0.99711
0.27 0.60642 0.77 0.77935 1.27 0.89796 1.77 0.96164 2.27 0.98840 2.77 0.99720
0.28 0.61026 0.78 0.78230 1.28 0.89973 1.78 0.96246 2.28 0.98870 2.78 0.99728
0.29 0.61409 0.79 0.78524 1.29 0.90147 1.79 0.96327 2.29 0.98899 2.79 0.99736
0.30 0.61791 0.80 0.78814 1.30 0.90320 1.80 0.96407 2.30 0.98928 2.80 0.99744
0.31 0.62172 0.81 0.79103 1.31 0.90490 1.81 0.96485 2.31 0.98956 2.81 0.99752
0.32 0.62552 0.82 0.79389 1.32 0.90658 1.82 0.96562 2.32 0.98983 2.82 0.99760
0.33 0.62930 0.83 0.79673 1.33 0.90824 1.83 0.96638 2.33 0.99010 2.83 0.99767
0.34 0.63307 0.84 0.79955 1.34 0.90988 1.84 0.96712 2.34 0.99036 2.84 0.99774
0.35 0.63683 0.85 0.80234 1.35 0.91149 1.85 0.96784 2.35 0.99061 2.85 0.99781
0.36 0.64058 0.86 0.80511 1.36 0.91308 1.86 0.96856 2.36 0.99086 2.86 0.99788
0.37 0.64431 0.87 0.80785 1.37 0.91466 1.87 0.96926 2.37 0.99111 2.87 0.99795
0.38 0.64803 0.88 0.81057 1.38 0.91621 1.88 0.96995 2.38 0.99134 2.88 0.99801
0.39 0.65173 0.89 0.81327 1.39 0.91774 1.89 0.97062 2.39 0.99158 2.89 0.99807
0.40 0.65542 0.90 0.81594 1.40 0.91924 1.90 0.97128 2.40 0.99180 2.90 0.99813
0.41 0.65910 0.91 0.81859 1.41 0.92073 1.91 0.97193 2.41 0.99202 2.91 0.99819
0.42 0.66276 0.92 0.82121 1.42 0.92220 1.92 0.97257 2.42 0.99224 2.92 0.99825
0.43 0.66640 0.93 0.82381 1.43 0.92364 1.93 0.97320 2.43 0.99245 2.93 0.99831
0.44 0.67003 0.94 0.82639 1.44 0.92507 1.94 0.97381 2.44 0.99266 2.94 0.99836
0.45 0.67364 0.95 0.82894 1.45 0.92647 1.95 0.97441 2.45 0.99286 2.95 0.99841
0.46 0.67724 0.96 0.83147 1.46 0.92785 1.96 0.97500 2.46 0.99305 2.96 0.99846
0.47 0.68082 0.97 0.83398 1.47 0.92922 1.97 0.97558 2.47 0.99324 2.97 0.99851
0.48 0.68439 0.98 0.83646 1.48 0.93056 1.98 0.97615 2.48 0.99343 2.98 0.99856
0.49 0.68793 0.99 0.83891 1.49 0.93189 1.99 0.97670 2.49 0.99361 2.99 0.99861
Tables
z-value p (z) z-value p (z)
0.00 0.50000 0.50 0.69146
0.01 0.50399 0.51 0.69497
0.02 0.50798 0.52 0.69847
0.03 0.51197 0.53 0.70194
0.04 0.51595 0.54 0.70540
0.05 0.51994 0.55 0.70884
0.06 0.52392 0.56 0.71226
0.07 0.52790 0.57 0.71566
0.08 0.53188 0.58 0.71904
0.09 0.53586 0.59 0.72240
P(z) = P(Z z)
P(z1 Z z2) = P(Z z2) – P(Z z1)

Normal distribution
The following distributions can be approximated by a normal distribution

under the following conditions:
Positive binomial n > 30 and s2 3

Poisson series > 10
Negative binomial k
Not normally distributed data can be transformed to approximate normal

distribution:
Transformation back transformation

Log10(x) 10y
Loge(x) ey
x y2
1/x 1/y
The central limit theorem
The means of samples drawn from a normally distributed population

are themselves normally distributed regardless of sample size n.
As sample size increases, the means of samples drawn from a population of any
distribution will approach the normal distribution.
The standard deviation of the distribution of the means is given by:
S .E .M .
n
S .E .M . ...decreases as sample size n increases!
n
0.9 distribution of means with high n
0.8 x
z distribution of means with low n
0.7 / n
relative frequency
0.6
distribution of original population
0.5 = distribution of means with n = 1
0.4
nhigh
0.3
nlow
0.2
0.1 1
0
µ variable X or means
S .E .M . ...decreases as sample size n increases!
n
0.45 x
z
0.4 / n
0.35
relative frequency
0.3
0.25
0.2
0.15
0.1
0.05
0
-6 -4 -2 0 2 4 6
z z-standardised means x
The t-distribution
s
S .E .M . and S.E.M. have to be estimated!
n
z = tinfinite d.f.
0.45 x
td.f.= 2 t
0.4 s/ n
0.35 td.f.= 1
relative frequency
0.3
0.25 wider and flatter when n low
0.2
0.15
0.1
0.05
0
-6 -4 -2 0 2 4 6
t standardised means x
The t-distribution
Normal distribution
Relative frequency
t-distribution
x
0 1 2 3 4 5 6 7
The t-distribution
t-values for different degrees of freedom and
d.f. 0.05 0.01 0.001

1 12.70615 63.65590 636.57761
2 4.30266 9.92499 31.59977
3 3.18245 5.84085 12.92443
4 2.77645 4.60408 8.61008
5 2.57058 4.03212 6.86850
6 2.44691 3.70743 5.95872
7 2.36462 3.49948 5.40807
8 2.30601 3.35538 5.04137
9 2.26216 3.24984 4.78089
10 2.22814 3.16926 4.58676
Confidence interval (C.I.)
Interval around x, which includes µ with a certain confidence
(a probability close to 1, ~0.95).
is known
random sample gives x , which is normally distributed with S .E.M .
n
x
z
/ n
x
P z /2 z /2 0.95 1
/ n
P z /2 x z /2 1
n n
P x z /2 x z /2 1 x z /2
n n n
z = tabulated value from the standard normal distribution, depends on
= significance level 1- = confidence / accuracy
unknown
s
random sample gives x and s, x is t-distributed with S .E .M .
n
x
t
s/ n
s s s
P x t / 2,d . f . x t / 2,d . f . 1 x t / 2 ,d . f .
n n n
t = tabulated value from the t-distribution, depends on

,d.f. and d.f.
= signficance level (0.05 or 0.01)
d.f. = n-1 = degrees of freedom
/2 /2
-t t/2,d.f.
,d.f. x µ t t/2,d.f.
,d.f.
Be aware that t-tables are usually two-sided!

Accuracy and precision
Accuracy (/ confidence) (1- ) is the probability that the true mean of the
population lies within a given confidence interval.
Precision is the width of the confidence interval:

• expresses how close sample values (means) lie to each other (s)
• demonstrates quality of the estimation of µ (n)
Precision, accuracy (1- ) and the number of samples n are interdependent.
To get higher precision but keep same accuracy

increase sample size n
Neither precise nor accurate Precise, not accurate
Accurate, not precise Precise and accurate

Formula assuming a normal distribution
Calculation of the precision with a given accuracy ( ) and sample size n:
s
Absolute precision: G t / 2 ,d . f .
n
t / 2 ,d . f . s
Precision relative to the mean: G
x n
Calculation of the necessary sample size with predefined accuracy and
precision:
2
t / 2 ,d . f . s
n
G´ x
(The equation has to be solved iteratively, because you can find n on both
sides.)
Example: increase of sample size
µ =70%
=70%
decrease of precision
increase of accuracy
G’=20% G’=5%
µ
µ
=95% =95%
G’=50% G’=20%
Statistical decision = decisions about population based on sample

information.
Statistical hypothesis = assumption about the population to reach

decision
Null hypothesis H0 = assumption that the result obtained is ‘due entirely

to chance‘ (initial innocence)
Alternative hypothesis HA = any hypothesis that differs from a given H0

Example: testing for differences between 2 populations (each

represented by 1 sample) with regard to a certain variable.
H0: Two populations do not differ. Thus, the 2 samples come in fact from
one underlying population and any possibly observed difference between
the two samples is entirely due to chance.
HA: Two populations differ. Thus, an observed difference between the

samples is not due to chance but reflects the fact that the 2 samples come
from 2 different underlying populations.
Type I ( ) and type II ( ) error
Decision of the test reality

H0 true, HA false H0 false, HA true
H0 kept, HA rejected correct decision type II error
probability 1 - probability
H0 rejected, HA acc. type I error correct decision

probability probability 1 – (“power“)
Type I error: ’wrong alarm’

Type II error: ’missed opportunity’
Controlling the errors: Type I error Type II error
n
Type I and Type II error
H 0 right HA right
Type II Type I Power = 1 - Type II error

error error
H0 right HA right
Type II Type I Power

error error
Steps to conduct a statistical test
• Define H0 and HA
• Set in advance!
• Calculate test statistic TSemp from data
• Calculate d.f.
• Find / Calculate critical value of TS • Calculate P(TS TSemp ) from TSemp

for given and d.f. from known and d.f. (probability to get TSemp or any
distributions (z, t, F, ²) = TScrit TS larger = probability of error when
accepting HA)
• Compare TSemp with TScrit • Compare P(TS TSemp ) with
• Decision: • Decision:
if TSemp TScrit do not accept if P do not accept HA; nor H0
HA; nor H0 if P < accept HA; reject H0
if TSemp > TScrit accept HA;
reject H0
Significance levels
type I error probability of observed meaning symbol
outcome under true H0
= 5% P 0.05 not significant n.s.
= 5% P < 0.05 significant at 5% *
= 1% P < 0.01 significant at 1% **
= 0.1% P < 0.001 significant at 0.1% ***
“A significant difference between the phosphorus concentration of lake A

and lake B could be demonstrated (t=4.5, d.f.=20, P<0.01).”
“We were not able to demonstrate significant differences between plant

biomass of the fertilized and non-fertilized treatment plots (t=0.75, d.f.=20,
P=0.45).”
“A one-way Anova showed a significant effect of the factor ‘nutrient’ on

primary productivity (F=24.2, df1=2, df2=9, P<0.001).”
One-sample test (sample vs. fixed „true“ value)
Example: mice population on island census, weight ND (µ0, )

on drifting log: single exceptionally heavy mouse (weight x)
a new species? from mainland??
1) Testable hypotheses:
H0: The ‘new’ mouse belongs to the island population, its weight is
similar to those of other island mice: x µ0. Its relatively high weight is
entirely due to chance, it´s just a slightly heavy mouse of the population.
HA: The ‘new’ mouse does not belong to the island population, its weight
is higher than that of other island mice, it must belong to some other
mouse population, say from the mainland: x > µ0
2) now believe in H0 (initial innocence)!

how likely is it to find heavy mouse x?
weight ND (µ0, ) calculate probability to find heavy mouse x or larger
x 0
TS z P(Z z)
H0 right
1
P(Z z)
0
x 0
0 z
2) now believe in H0 (initial innocence)!

how likely is it to find heavy mouse x?
weight ND (µ0, ) calculate probability to find heavy mouse x or larger
x 0
TS z P(Z z)
H0 right
1
P(Z z)
0
x 0
z
3) Set threshold for P to decide between H0 and HA (in advance!)
set limit = significance level
H 0 right
zcrit
Or: calculate critical z from : P (z) = P(1.64) = critical z = 1.64
3) Set threshold for P to decide between H0 and HA (in advance!)
set limit = significance level
H 0 right H 0 right
P(Z z)
P(Z z)
zcrit zcrit
Or: calculate critical z from : P (z) = P(1.64) = critical z = 1.64
4) Decision:
if P(Z z) > / z zcrit if P(Z z) < / z > zcrit

mouse x (or larger) is likely mouse x (or larger) is unlikely
“keep” H0 , reject HA reject H0, accept HA
5) Decision wrong: type I error!

e.g. we reject H0, while it is true! (we make a type I error)
chance of making a type I error?
= maximum probability of error when rejecting H0
H 0 right
P(Z z)
Decision of the test reality

H0 true, HA false H0 false, HA true
H0 kept, HA rejected correct decision type II error
probability 1 - probability
H0 rejected, HA acc. type I error correct decision

probability probability 1 – (“power“)
5) Decision wrong: type II error! To evaluate : need definite HA!

e.g. HA: mouse from mainland where mouse weight ND (µ1, 2) and µ1>µ0
H 0 right HA right
0 1

assume and calculate threshold weight zcrit and xcrit from

xcrit 0
zcrit 1.64 xcrit
H 0 right HA right
0 xcrit 1


xcrit 0
zcrit 1.64 xcrit
= probability to find mouse with weight xcrit (or less) under true HA
xcrit 1
TS z P(Z z)
H 0 right HAright
0 xcrit 1


xcrit 0
zcrit 1.64 xcrit
= probability to find mouse with weight xcrit (or less) under true HA
xcrit 1
TS z P(Z z)
power = 1- (correctly rejecting H0 and accepting HA)
H 0 right HAright
power 1
0 xcrit 1
µ0 µ1
island mice mainland mice
population
sample means
low n H0: Sample of drifting mice belongs to island
population. The population mean µ estimated
from the sample is equal to (or smaller than) the
µ0 of the island population: µ0
sample means HA: Sample of drifting mice belongs to a

high n different population with a µ which is larger
than the µ0 of the island population: µ > µ0
Known : Gauss test

(x )
TS z <=> zcrit . z alternative: P(z) <=>
/ n
Unknown : one-sample t-test

estimated by s
(x )
TS t <=> tcrit . t ,d . f . alternative: P(t) <=>
s/ n
t-value: from Student´s or t-distribution

One-sided versus two-sided test
beforehand information no beforehand information
(model, idea, experience)
H0: µ0 H0: µ = µ0
HA: µ > µ0 H A: µ0
One-sided Two-sided
(1 - Type I error) (1 - Type I error)
Type I error
z P(Z > z) z z
Type I error / 2 Type I error / 2

P(Z < z) P(Z > z)
Important limits z of the standard normal distribution
z
1- two-sided one-sided
0.100 0.900 1.64485 1.28155

0.050 0.950 1.95996 1.64485
0.025 0.975 2.24140 1.95996
0.010 0.990 2.57583 2.32635
0.001 0.999 3.29053 3.09023
Types of statistical tests referring to certain assumptions
• Parametric tests: assume known parameterized probability distribution,

e.g. ND (µ, 2), assume ND
• Non-parametric tests: no assumptions about the frequency distributions

ND not assumed, “distribution free”
• Independent samples: do not depend on each other
• Dependent / paired samples: samples depend on each other, e.g. testing

differences before and after a treatment on the same object
Selection of different standard tests
Assumption about Number of Dependency Test

distribution samples
independent t-test after STUDENT and
2 WELCH-test**
dependent t-test for dependent
Parametric samples
independent one-way analysis of
>2 variance (ANOVA) and
WELCH variant one-way
analysis of means**
dependent repeated measures (or
paired) ANOVA
independent U-test after MANN &
2 WHITNEY
dependent WILCOXON-test for
Non-parametric paired differences
independent H-test after KRUSKAL &
>2 WALLIS
dependent FRIEDMAN-test
Check normal distribution
1) Skewness and kurtosis
calculate S.E. for skewness and kurtosis

(repeated sampling, build distribution of statistics Sk and K, standard
deviation)
Sk and K follow ND
rough C.I.:
Skewness (kurtosis) ± 2 x S.E. of the skewness (kurtosis)
continuum of
possible Sk values
continuum of - 2 SESk Sk 0 + 2 SESk
possible Sk values
Deviation from ND will be assumed if value 0 outside of the C.I.!

2) Histograms
Histogram Histogram
14 30
12
10
20
6
10
4
Frequency
Frequency
Std. Dev = 3.27

Std. Dev = 1.08
2 Mean = 7.7
Mean = 5.13
0 N = 100.00
0 N = 100.00
3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0
2.
3.
3.
4.
4.
5.
5.
6.
6.
7.
7.
4.0 6.0 8.0 10.0 12.0 14.0 16.0
50
00
00
50
50
00
50
00
50
00
50
VAR00001 VAR00002
3) Normal quantile plots

Normal Q-Q Plot of VAR00001 Normal Q-Q Plot of VAR00002
3 3
2 2
1 1
0 0
Expected Normal
Expected Normal
-1 -1
-2 -2
-3 -3
2 3 4 5 6 7 8 -10 0 10 20
Observed Value Observed Value

4) Compare location of mean, median and mode
Example: distribution skewed to the right
x* x~ x
Leptokurtic
5) Run statistical test for normal distribution
e.g. Kolmogorov-Smirnov-test, Shapiro-Wilk-test
H0: The distribution of the data is normal.

HA: The distribution of the data differs from a normal distribution.
„hope“ for high P!!

Histogram Histogram
14 30 Tests of Normality
12
a
10
Kolmogorov-Smirnov
20
8
Statistic df Sig.
6
VAR00001 .049 100 .200*
10
4 VAR00002 .172 100 .000
Frequency
Frequency
Std. Dev = 3.27

Std. Dev = 1.08
2
Mean = 5.13
Mean = 7.7
N = 100.00
*. This is a lower bound of the true significance.
0
0 N = 100.00
3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0
a. Lilliefors Significance Correction
2 .5
3.
3.
4.
4.
5 .0
5.
6 .0
6.
7.
7 .5
4.0 6.0 8.0 10.0 12.0 14.0 16.0

00
50
00
50
50
50
00
0
VAR00001 VAR00002
low n – no serious judgement possible

high n – small deviation from ND detectable
Data transformation
The original variable is replaced by another variable according to a

specific mathematical function.
The same transformation procedure is applied to all variates.
What for?
• to facilitate interpretation and presentation of data

• to approximize the empirical distribution by a normal distribution, and
then use tests which assume normal distribution (parametric tests)
• to recognize atypical values (extremes, outliers)
• to reduce effect of extreme values
• to linearize functional relationships
Data transformation
Transform not ND-data to approximate ND
Distribution skewed to the left: exponential function (x2, x3, ...)

Distribution skewed to the right: square root ( x )
logarithm (ln(x), log(x))
recursive function (1/ x , 1/x, 1/x2, ...)
Transformation Back transformation

y
y = Log10(x) x = 10
y
y = Loge(x) = ln(x) x=e
y= x x = y2
y = 1/x x = 1/y
Ecological data
• Log-normal distribution can often be assumed
• Approximation of normal distribution by use of logarithms
• In case of occurrence of zero values: xT ln( x 1)
F-distribution
1. two (!) samples from population ND (µ, 2)
2. calculate s12 (sample 1 with n1) and s22 (sample 2 with n2)
3. calculate statistic: 2
s1
F 2
s2
4. repeat 1.-3. and build distribution of F-values
s12 and s22: estimates for 2 F 1
“F-distribution”
shape determined by d.f.1 = n-1 and d.f.2 = n-1
separate F-distribution for each combination of d.f.1 and d.f.2
F-distribution
F1,20
0.8 F5,25
F25,5
relative frequency
0.6
0.4
= 0.05
0.2
0.0
0 1 2 F = 2.6 3
F
F-test: checking variance homogeneity
H0: The sample variances estimate the same parametric variance. Or: 2 = 2
1 2
variance homogeneity = homoscedasceity
HA: The sample variances estimate different parametric variances. Or: 2 2
1 2
variance heterogeneity = heteroscedasceity
= 0.05
TS: variance ratio Fs
2-tailed test 1-tailed test

0.8 0.8
F9,9 F9,9
2 2
s1 smax
Fs Fs
relative frequency
relative frequency
0.6 2 0.6 2
s2 smin
0.4 0.4
0.2 = 0.025 = 0.025 0.2 = 0.025
Fcrit = 4.0 Fcrit = 4.0

Fcrit = 0.2
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
F F
F-test: checking variance homogeneity
H0: The sample variances estimate the same parametric variance. Or: 2 = 2
1 2
variance homogeneity = homoscedasceity
HA: The sample variances estimate different parametric variances. Or: 2 2
1 2
variance heterogeneity = heteroscedasceity
= 0.05
TS: variance ratio Fs
1-tailed test
Decision:
0.8 F9,9
1) if P(F) /2 (equivalent to: Fs Fcrit) smax
2
do not accept HA, nor H0 Fs
relative frequency
0.6 2
smin
assume variance homogeneity
0.4
2) if P(F) < /2 (equivalent to Fs > Fcrit) 0.2 = 0.025
accept HA, reject H0 Fcrit = 4.0
variance heterogeneity 0.0

0 1 2 3 4 5
F
t-test (after Student) for independent samples
Parametric procedure for testing significant differences of means between two

independent samples from normally distributed populations by means of one
variable
1) Check ND
2) Hypotheses
H0: µ1 = µ2 Sample means estimate same parametric mean µ.

Both samples drawn from same population.
HA: µ1 µ2 Sample means estimate different parametric means µ1 and µ2.

Samples drawn from different populations.
= 0.05
two-sided test, TS = t
3) Check variance homogeneity (F-test)

t-test (after Student) for independent samples
4) Test statistic t
var. homogeneity var. heterogeneity (Welch test)

x1 x2 n1 n2 x1 x2
TS t TS t
2
( n1 1 ) s ( n2 1 ) s
1
2
2
n1 n2 s12 s22
n1 n2 2 n1 n2
n1 n2 2
df n1 n2 2 df
2
5) Decision (two-sided)
tcrit t / 2 ,d . f .
if P(t) /2 (equivalent to: | t | tcrit) if P(t) < /2 (equivalent to | t | > tcrit)

do not accept HA, nor H0 accept HA, reject H0
“could not show difference” populations are different
t-test (after Student) for dependent samples
Parametrical procedure for testing significant differences of means between

two dependent samples from normally distributed populations by means of
one variable
1) Check ND
2) Hypotheses
H0: µ1 = µ2 Sample means estimate same parametric mean µ.

Both samples drawn from same population.
HA: µ1 µ2 Sample means estimate different parametric means µ1 and µ2.

Samples drawn from different populations.
= 0.05
two-sided test, TS = t
patient before after differences
X1 X2 X1-X2
Gandalf 6 4 2
Saruman 4 3 1
Arwen 7 5 2
Frodo 3 2 1
...
3) Calculate differences and standard deviation of differences

n
1
di x1i x 2i s (d i d )2
n 1 i 1
4) Test statistic t
d 0 d 0 d n
TS t
s s s one-sample t-test with µ0 = 0 !!!
n n
5) Decision (two-sided)
df n 1 tcrit t / 2 ,d . f .
if P(t) > / | t | tcrit if P(t) < / | t | > tcrit

do not accept HA, nor H0 accept HA, reject H0
“could not show difference” populations are different
Non-parametric tests based on ranks
General principles of tests based on ranks:
• distribution-free
• non-parametric
• ranks: sort all values (rank order) and number sequentially.
• replace each original variate by its rank (reduce data to ordinal scale).
• generally less powerful than parametric procedures
Mann-Whitney U-test (analogous to independent t-test)

Wilcoxon test (analogous to dependent t-test)
Kruskal-Wallis-ANOVA (also called H-test, analogous to 1-way ANOVA)
U-test (after Mann & Whitney)
Non-parametric procedure for testing significant differences between two
independent samples from non-normally distributed populations with regard
to one variable.
compares the sums of ranks of the two samples
1) Hypotheses
H0: Two samples come from populations with identical “locations” (medians).
HA: Two samples come from populations which differ in location (median).
U-test (after Mann & Whitney)
2) Ranking of all observations, ignoring groups. Ties get average ranks.
3) Sums of ranks R1 and R2 for both samples.

Under true H0: ranks randomly mixed between the two samples, similar
mixture of ranks and equal rank sums
4) Calculation of test statistic U based on the sums of ranks.

When n > 20 U approaches ND
use z-distribution to calculate P(U) and zcrit
(small samples: “exact probability” based on probability distribution of U

calculated by repeated randomization of observations to groups)
5) Decision: as usual by comparing P(U) = P(z) with

Multiple comparisons
“Bonferroni”-correction (Dunn-Sidak):
t 1 (1 )k
failure success
type I error
t total error
k number of comparisons failure success failure success
Overall t for different single and

different numbers of comparison:
k=1 k=2 k=3 k=4 k=5 k=10 k=100
= 0.05 0.098 0.143 0.185 0.226 0.401 0.994
= 0.01 0.020 0.030 0.039 0.049 0.096 0.634
= 0.001 0.002 0.003 0.004 0.005 0.010 0.095
ANalysis Of VAriance
Simple analysis of variance (one-way ANOVA)
Parametric procedure for testing significant differences between more

than two independent groups from normally distributed populations
by means of one variable
Continuous response variable = dependent

Categorical group coding variable = independent = factor
(groups = factor levels)
Different types of ANOVA
• ANOVA with more than one factor = multifactorial ANOVA

• ANOVA, where the comparison between groups should be independent
of one ore more continuous variables (= covariables) = ANCOVA
=analysis of covariance
• ANOVA with more than one dependent variable = MANOVA =
multivariate ANOVA
• ANOVA with dependent samples = Repeated measures ANOVA
• model I ANOVA: with treatment factors (deliberate manipulation)
• model II ANOVA: with random effects (e.g. random replication at
multiple levels – fish within cages within ponds)
Assumptions for ANOVA
• Variable X has to be normal distributed at each factor (BUT: ANOVA
considered robust against violations)
• Homogeneity of variances (critical!)
When assumption of variance homogeneity violated:

1) Transform dependent variable, e.g. log(x).
2) Non-parametric test (Kruskal-Wallis H test).
3) Multiple pairwise comparisons using t-tests or U-tests and correct P.
4) Variant of Welch test “one way analysis of means”, in R: oneway.test().
At severe violation of the ND assumption

Non-parametric procedures
Example: Simple one-way analysis of variance (one-way ANOVA)

• equal group sizes (same n)
• 1 factor defining 3 groups, i.e. a 3-level ANOVA
• (e.g. body fat of students studying ecology, statistics, sports)
• assumptions fulfilled
(one-way ANOVA)
Hypotheses:
H0: The three groups are not different (come from same population).
HA: At least one group differs from at least one other group (one comes
from different population).
or = or
(one-way ANOVA)
Scheme of the analysis of variance
group means
x2
x-values
x3
x
grand mean
x1
1 2 3 factors
(one-way ANOVA)
Calculation of sum of squares between the groups

(squared differences between the group means and the grand mean)
x2
( x2 x )2 x
x-values
3 ( x3 x )2
x
Z
( x1 x ) 2
SSb ( between ) n ( xz x)2
z 1
x1
1 2 3 factors
(one-way ANOVA)
Calculation of sum of squares within the groups

(squared differences between each data point and the group means)
( xi 2 x2 ) 2
x2 ( xi 3 x3 ) 2
x-values
x3
x
Z
SSb ( between ) n ( xz x)2
z 1
x1 2
Z n
( xi1 x1 ) SS w( within ) ( xiz xz ) 2
z 1 i 1
1 2 3 factors
(one-way ANOVA)
1) Partitioning the total sum of squares (SS) (“Splitting of variance”):
Total SS Z n
SS t (total ) ( xiz x )2 of the squared total deviations
z 1 i 1 of total variation
Explained SS = Between-groups SS
Z
SSb ( between) n ( xz x)2 of the squared deviations between groups
z 1 of group-to-group variation
Not explained SS = Within-groups SS

Z n
SS w( within ) ( xiz xz ) 2 of squared deviations within groups
z 1 i 1 of within-group variation
z = Group (z=1, 2, ..., Z)

i = Value number (i=1, 2, ..., I)
(one-way ANOVA)
1) Partitioning the total sum of squares (SS) (“Splitting of variance”):
SS t (total ) SSb (between) SS w( within )
Sums of squares are additive!
Variation of the whole dataset partitioned into two parts depending on origin!
(one-way ANOVA)
2) Mean squared deviations = SS/d.f.
SSt SSb SS w
MSt MSb MS w
Z n 1 Z 1 Z ( n 1)
Under a true H0 mean squares are variances and estimate 2 of the

(same) population.
MSt: data treated as 1 sample, variance is estimate for 2
MSw: average within-group variation, “intragroup MS” or “error MS”

average variance of groups, estimate for 2
MSb: all means from 1 population
expected variance of group means is S.E.M.2 = s2/n
multiply variance of means by n (already done for SSb)
MSb = another estimate for 2
(one-way ANOVA)
SSt SSb SS w
MSt MSb MS w
Z n 1 Z 1 Z ( n 1)
Under a true HA (different populations):
MSw: still average within-group variation

estimate for 2= 2= 2 (variance homogeneity!)
MSb: now includes substantial group-to-group variation
estimate for 2 larger than expected!
(one-way ANOVA)
3) Calculation of the test statistic

MSb
TS Femp.
MS w
…close to 1 under true H0
…>> 1 when MSb includes group effect
4) F-distribution: F-values for two variance estimates from same population

Calculate Fcrit under true H0 at significance with df1=Z-1 and df2=Z(n-1)
Comparison of Femp with Fcrit
if Femp > Fcrit Femp is unprobable (< ) value under true H

reject H0 / accept HA
if Femp Fcrit Femp is probable value under true H
reject HA
(one-way ANOVA)
5) Results: ANOVA-table
source of variation sum of squares df mean square Femp P

between groups 20.57 2 10.28 1.132 0.328
within groups 572.45 63 9.08
total 593.03 65
6) Post-hoc tests
which group differs from which one?

or = or
multiple pairwise comparisons with correction of P Bonferroni and others

Two-factorial analysis of variance (two-way ANOVA)
two categorical factors considered simultaneously

(or more multifactorial ANOVA)
Example: 2 factors with 2 levels each
• food consumption of rats (dependent variable)

• study both sexes (factor 1 = sex, 2 levels: male and female)
• compare food types (factor 1 = food, 2 levels: fresh and old)
collect replicates for each possible combination of factors

4 combinations (= groups = cells) with several replicates each
(two-way ANOVA)
Possible outcomes of experiment:
1. difference in food consumption between sexes

2. preference for a certain food type
3. difference in food preference among sexes, e.g. males prefer food 1,
females prefer food 2
1. and 2. are main effects

3. is interaction: dependence of effect of one factor on level of other
factor (can be: inhibition or synergism)
(two-way ANOVA)
3 sets of hypotheses, the null hypotheses are:
1. H0: no difference between sexes
2. H0: no difference between food types
3. H0: no interaction
(two-way ANOVA)
1) Partitioning the total sum of squares:
SS t (total ) SS sex SS food SS interaction SS w( within )
SSsex from means of sexes (pooled food types)

S
SS sex F n ( xs x) 2
s 1
SSfood from means of food types (pooled sexes)
F
SS food S n (x f x) 2
f 1
SSw from group means (groups: all sex x food combinations)
S F n
SS w ( xisf x sf ) 2
s 1 f 1 i 1
SSinteraction by difference
SSinteraction SSt ( total ) SS sex SS food SS w( within )
s…index for sex (S = total number of levels)

f…index for food type (F = total number of levels)
(two-way ANOVA)
SSt SS w
MSt MS w
S F n 1 S F (n 1)
SS sex SS food SSinteraction

MS sex MS food MSinteraction
S 1 F 1 ( S 1) ( F 1)
Under true H0 mean squares are variances and estimate 2 of the (same)
population.
When effect of sex, food or interaction

adds additional variation to corresponding MS
MS will be larger than expected
(two-way ANOVA)
3) Calculation of the test statistics
MS sex MS food MSinteraction

Fsex F food Finteraction
MS w MS w MS w
…close to 1 under true H0

…>> 1 when MS includes effect
4) Calculate Fcrit under true H0 at significance with df1 and df2 for each H0
Comparison of Fsex / Ffood / Finteraction with corresponding Fcrit
if Femp > Fcrit Femp is unprobable (< ) value under true H

reject H0 / accept HA
if Femp Fcrit Femp is probable value under true H
reject HA
(two-way ANOVA)
5) Results: ANOVA-table
source of variation sum of squares df mean square F P

sex 10.21 1 10.21 0.549 0.46
food 37.41 1 37.41 2.012 0.16
interaction: sex x food 0.21 1 0.21 0.011 0.92
error 2156.97 116 18.59
total 2204.80 119
6) Post-hoc tests
in case of factor with > 2 levels: which group differs from which one?
multiple pairwise comparisons with correction of P Bonferroni and others

(two-way ANOVA)
7) Interaction plots
Significant interaction term scrutinize results for main effects!

(a significant interaction can make main effect results worthless!)
(a) (b) (c)

food consumption
food 1
food 2
male female male female male female
(a) paralell response – no interaction
(b) and (c) interaction

Correlations
describe the mutual variation of two variables and measures the degree to
which variables are related. No functional dependence between the
variables is assumed.
a) b)
Y Y
a) Positive (=direct)
correlation
X X
b) Negative (=inverse)
c) d) correlation
Y Y
c) No correlation
d) Non-linear correlation
X X
Correlations
Types of correlation according to the level of scale

Level of scale of X
Level of scale of Y Metric Ordinal Nominal
Metric Product-moment Rank correlation
correlation SPEARMAN´S rs
(PEARSON´S r)
KENDALL´S Tau
Ordinal Rank correlation Rank correlation
SPEARMAN´S rs SPEARMAN´S rs
KENDALL´S Tau KENDALL´S Tau
Nominal Contingency tables
–coefficient
KRAMER coefficient
The variable at the lowest level always determines the choice of the correlation
measure.
Correlations between nominal-scaled variables
- contingency tables
to test the hypothesis that the frequency of occurrence in the categories

of one variable is related to the frequencies in the second variable.
The simplest case is for binary data (two categories only).
Contingency table
x1 x2
y1 F11 F12 F1.

y2 F21 F22 F2.
F.1 F.2 F..
= sum; F = frequency
1) Setting up hypotheses
H0: No correlation between the two variables X and Y.
HA: Correlation between the two variables.
2) Calculation of –coefficient
The –coefficient is one possible correlation coefficient, calculated
from the sums per column and per row:
F11 F22 F12 F21

F1. F2. F.1 F.2
The –coefficient ranges from

=0 no correlation to
=1 perfect correlation
3) Calculation of expected frequencies

In order to compare the observed frequencies with those frequencies
which we would expect if the null hypothesis is true, we first have to
calculate the expected frequencies E.
x1 x2
y1 E11 = F.1 x F1. / F.. E12 = F.2 x F1. / F..

y2 E21 = F.1 x F2. / F.. E22 = F.2 x F2. / F..
4) Calculation of test-statistic
2
2
Fij Eij
d . f . (k 1)(r 1)
i j Eij
k = number of columns; r = number of rows
Yates correction for continuity for 2 x 2 tables
2
2
Fij Eij 0.5
i j Eij
5) Calculate P( 2) = P( 2 2) (the tail probability) from the calculated

2 and d.f.. Then compare to .
Or as an alternative: Calculate 2crit (critical value) from and d.f.,

then compare calculated 2 to 2crit.
Decision:
1) if P( 2) (is equivalent to: 2 2 )
crit do not accept HA; do
not accept H0, either. You were not able to detect any significant
correlation.
2) if P( 2) < (is equivalent to 2 > 2crit) accept HA; reject H0
There is a significant correlation between the two variables.
6) Stating results from a –correlation analyses
calculated –correlation coefficient, the total number of observations
(n = F..) and the calculated P (if not significant) or the significance
level a (if significant).
Kramer-coefficient C
is another correlation coefficient for contingency tables, based on the
2-values.
2
C 2
n
This coefficient is not standardised between 0 and 1. To do so, one has to
calculate the maximum possible C-value, which is given by
k 1
Cmax
k
The standardised coefficient is then obtained from
C
Cs tan d
Cmax
Rank correlation coefficients
must/can be used in several situations:

• At least one of the variables is ordinal-scaled and none of them is
nominal.
• Both variables are metric but not normally distributed.
• The relationship is not linear but monotonically increasing or
decreasing.
The procedure is similar to non-parametric tests for differences between

groups, i.e. it is based on ranks. Therefore, metric data have to be
transformed into ranks.
Spearman´s rank correlation rs
The (unknown) population correlation coefficient is often denoted by
s; has to be estimated from the observed correlation coefficient rs.
H 0: s = 0 No correlation between the two variables.
H A: s 0 Correlation between the two variables.
2) Calculation of rs
Each variable is ranked separately and for each object i the squared
differences (d2i) between the ranks of X and ranks of Y are computed.
n
6 d i2
i 1
rS 1 2
n (n 1)
rs = –1 perfect negative (inverse) correlation

rs = 0 no correlation
rs = +1 perfect positive correlation
Spearman´s rank correlation rs
3) The critical values

are tabulated for rs and have to be looked up in the respective tables.
4) Decision
1) if P(rs) (is equivalent to: rs rs crit) do not accept HA; do not
accept H0, either. You were not able to detect any significant
correlation.
2) if P(rs) < (is equivalent to rs > rs crit) accept HA; reject H0. There
is a significant correlation between the two variables.
5) Stating results
calculated correlation coefficient rs, the total number of observations
(n = number of objects) and the calculated P (if not significant) or the
significance level (if significant).
Product-moment correlation after Pearson
requires two metric-scaled variables, both of which have to be normally

distributed and the relationship is assumed to be linear.
Variations of X and Y: n 2
1
Variance of variable X: s x2 xi x
n 1i1
2
1 n
Variance of variable Y: s y2 yi y
n 1i1
Covariation = mutual variation of two variables, measured by the
n
covariance
Cov( X , Y ) xi x yi y
i 1
Covariance = mean value of cross-product of the deviations of X and
Y from their mean value
n
1
s xy xi x yi y
n i 1
Note that the covariance of a variable with itself equals the variance. The
covariance is not standardised (between –1 and +1). Instead, this
measure depends on the units that X and Y are measured in.
Standardised covariance = product-moment correlation r
Cov( X , Y ) ( xi x)( yi y)
r
Var ( X ) Var (Y ) ( xi x) 2 ( yi y) 2
r = –1 perfect negative, linear correlation
r = 0 no linear correlation
r = +1 perfect positive, linear correlation
(X x ) (Y y) (rho) refers to the correlation

average coefficient of the population
x y
H0: =0 No linear correlation between the two variables X and Y.
H A: 0 Linear correlation between the two variables.
2) Calculation of correlation coefficient r

( xi x)( yi y)
r
( xi x) 2 ( yi y)2
The standard error sr of r is obtained by

1 r2
sr
n 2
3) Calculation of significance
a) t-statistics
r
t d.f. = n-2; tcrit = t /2; d.f.
sr
Decision:
1) if P(t) /2 (is equivalent to: t tcrit) do not accept HA; do not
correlation.
2) if P(t) < (is equivalent to t > tcrit) accept HA; reject H0 There
is a significant correlation between the two variables.
4) Calculation of significance
b) F-statistics
d.f.1 = n-2
1 r
F d.f.2 = n-2
1 r Fcrit = Fa/2; d.f.1;d.f.2
Decision:
1) if P(F) /2 (is equivalent to: F Fcrit) do not accept HA; do not
correlation.
2) if P(F) < (is equivalent to F > Fcrit) accept HA; reject H0.
There is a significant correlation between the two variables.
Product-moment correlations after Pearson
4) Stating results:
calculated correlation coefficient r, the total number of observations
(n = number of objects) and the calculated P (if not significant) or the
significance level (if significant).
Potential pitfalls in correlation analysis
a) Heterogeneity of the data

Because of different groups within the data set, correlation analysis
may lead to wrong results.
Example: Fish
Weight
Y
Length
X
b) Set-subset (e.g. body weight – liver weight)

Set and subset are not independent in a statistical sense and inevitably,
have to be highly correlated .
c) Induced correlation by a third variable

Sometimes, it seems that two variables (X, Y) are correlated. In fact, X
and Y may not be directly related but rather this correlation may be
induced by a third variable Z (“lurking variable”). The influence of
this “hidden” variable is often difficult to detect.
X Y
Z
Partial correlation
If the lurking variable Z is known (or measured) its influence may be
removed to obtain the correlation between the remaining variables of
interest X and Y:
rXY rXZ rYZ

rXY / Z
(1 rXZ ) 2 (1 rYZ ) 2
where rXY/Z is the partial correlation between X and Y without the

influence of variable Z.
Simple linear regression
Investigation of the direct effect of one

variable X (independent or explanatory variable) on a
variable Y (dependent or response variable).
Therefore, we presume a direction of this relationship, i.e. a functional

dependency.
In graphical representations of regressions,

the independent variable is always given on the x-axis (abszissa) and
the dependant variable on the y-axis (ordinate).
Example: Wing lengths of sparrows at various times after hatching

(from Zar 1984)
Age Wing
(days) length (cm)
X Y 6
3.0 1.4
4.0 1.5 5
5.0 2.2
Wing length (cm)
6.0 2.4 4
8.0 3.1
9.0 3.2 3
10.0 3.2
11.0 3.9 2
12.0 4.1
14.0 4.7 1
15.0 4.5
16.0 5.2
0
17.0 5.0 0 2 4 6 8 10 12 14 16 18
Age (days)
Our objective is to fit a line whose equation is of the form:
Y X
Y expected values of Y
and the regression coefficients of the population
which are estimated by a and b from our sample
the error (residuals)
Estimation:
Y a bX e e (y yˆ ) 2
a intercept (point of intersection of the linear regression line with
y-the axis)
b slope of the linear regression = y / x;
change in Y that accompanies a unit change in X
positive slope: increase of Y with an increase of X

negative slope: decrease of Y with an increase of X
Estimation:
Y a bX e 6
5
a intercept (point of intersection
Wing length (cm)

of the linear regression line with 4
y
y-the axis) 3
b slope of the linear regression 2 x
= y / x; 1
change in Y that accompanies 0

a
0 2 4 6 8 10 12 14 16 18
a unit change in X Age (days)
positive slope: increase of Y with an increase of X

negative slope: decrease of Y with an increase of X
Estimation:
a and b are selected so that the sum of squared residuals is minimised:
e (y yˆ ) 2 min
= criterion of ordinary least squares (OLS) and the type of regression

analysis is the OLS–regression. The criterion is met by the formulae:
( xi x )( yi y)
b
2
( xi x)
a Y bX
1) State hypotheses
If b = 0, Y would not depend on X, because Y would not change with
changing X (the regression would result in a more or less horizontal
line). Therefore, we have to test whether the slope b is significantly
different from 0.
H0: = 0 No linear dependence of variable Y on variable X

H A: 0 Significant linear dependence
2) ANOVA procedure
The overall significance of the model is tested by an ANOVA
procedure.
Total SS = sum of the squared total deviations
n SS t
2 d . f .t n 1 MS t
SS t (total ) ( yi y)
d . f .t
i 1
Regression SS (= explained variance by the model) = linear regression
sum of squares
n
SS reg (regression ) ( yˆ i y) 2 d . f .reg 1 MS reg SS reg
i 1
Residual SS (= not explained variance) = the error term e
n
SS res (residuals ) ( yi yˆ i ) 2 SS t SS reg
i 1
SS res
d . f .res d . f .t d . f .reg n 2 MS res
d . f .res
Y yi
yi yˆ i
yi y
yˆi y
y
ŷi = estimated yi-value

on regression line
X
n n n
2 2
( yi y) ( yˆ i y) ( yi yˆ i ) 2
i 1 i 1 i 1
2) ANOVA procedure
ANOVA table in simple regression analysis
Source of
Sum of squares d.f. Mean square Femp P
variation
n MS reg
2 MS reg SS reg
Regression SS reg ( yˆ i y) 1 P(Femp.)
i 1 MS res
n SS res
Residuals SS res ( yi yˆ i ) 2 n–1 MS res
i 1
d . f .res
n SS t
Total SS t ( yi y) 2 n–2 MS t
i 1
d . f .t
3) Calculation of the test statistic F

we test the ratio explained variance divided by the not explained
variance
MS reg
TS Femp. d . f .reg 1 d . f .res n 2
MS res
4) Comparison with critical F-value at certain significance level

If Femp Fcrit, then Femp is a quite likely outcome under a true null
hypothesis. We therefore cannot accept HA.
If Femp > Fcrit, then Femp is a highly unlikely (i.e., less probable than )
value under a true null hypothesis and we therefore decide to
reject H0 and accept HA.
As an alternative we can also calculate the tail probability
P(Femp) = P(F Femp) and compare it to the significance level .
5) Quality of regression fit

The proportion (or percentage) of the total variation of Y that is
explained or accounted for by the fitted regression is termed the
coefficient of determination r2, which measures the strength of the
straight-line relationship.
2 SS reg
r
SS t
In simple regression this is equal to the squared product-moment
correlation coefficient.
r2 = 0 no fit of the regression model; no variance is explained by the

model
r2 = 1 perfect fit of the model; the whole variance in the data is
explained by the model (i.e. all data are positioned exactly on
the line)
6) Standard errors (S.E.) for the regression coefficients

The standard deviation of Y provided that X takes certain values, is
called sy.x:
( yi yˆ i ) 2
s y. x
n 2
The estimated values ŷ are obtained from the regression model:
yˆ a bx
The standard errors of a (S.E.a) and of b (S.E.b) are given by:
s y. x
S .E.b
( xi x )2
1 x2
S .E.a s y. x
n ( xi x ) 2
7) Stating the results

First, the type of model must be stated. The best way is to give the
formula, with the regression coefficients a and b together with their
standard errors. Further necessary results are the coefficient of
determination r2, the Femp-value with the degrees of freedom and the
significance.
Example:
The linear regression between wing length and age is highly
significant (Y = 0.713 (±0.148) + 0.27 (±0.013) X; r2 = 0.973;
F1, 11 = 401.1; P < 0.001).
Assumptions of the regression analysis

1) normal distribution of X and Y variables
2) in the population there exists a normal distribution of Y values for any
value of X
3) the variances of these population distributions of Y must be equal to
one another = homogeneity of variances or homoscedasticity (in
contrast to heterogeneity of variances = heteroscedasticity)
4) the errors in Y are assumed to be additive = additivity
5) the values of Y are to be independent
6) measurements of X must be obtained without error (often impossible,
assumption: that they are negligible or alt least small compared with
the measurement errors in Y)
Regression statistics is known to be robust to at least some of the
underlying assumptions; therefore, unless violations are not too severe,
they are not of concern.
Linearisation
In order to conduct a linear regression analysis with data which do not
show a linear dependency, the data can be linearised.
In the following figure, some possible linearisation procedures are
given:
instead instead instead instead
of or of of or of
Y x y Y x y
x² log(y) log(x) log(y)

x³ -1/y -1/x -1/y
X X
Y Y
log(x) y² x² y²
-1/x y³ x³ y³
X X
Linearisation, examples
Logarithmic function
Original scale of X Logarithmic scale of X
8 8
7
6
6
Y
Y
4
4
3
2
2
0 1
0 200 400 600 800 1 10 100 1000 10000
X X
The logarithmic function is given by y a b ln x

Exponential function
Original scale of Y Logarithmic scale of Y
350 1000
300
250
100
200
Y
Y
150
10
100
50
0 1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
X X
The exponential function is given by y a e bx

7
Linearisation, examples 6
5
Linearisation of the exponential function 4
ln Y
y a e bx ln 3
ln y ln a bx 1
0
0 2 4 6 8 10 12 14 16 18
Figure with logarithmic values on the y-axis! X
350 1000
300
250
100
200
Y
Y
150
10
100
50
0 1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
X X
Power functions
Original scales of X and Y Logarithmic scales of X and Y
500 1000
400
100
300
Y
Y
200
10
100
0 1
0 200 400 600 800 1 10 100 1000
X X
The logarithmic function is given by y a xb

7
6
Linearisation, examples 5
Linearisation of the power function 4
ln Y
3
b
y a x ln 2
1
ln y ln a b ln x 0
0 1 2 3 4 5 6 7
Figure with logarithmic values on x- and y-axis! ln X
500 1000
400
100
300
Y
Y
200
10
100
0 1
0 200 400 600 800 1 10 100 1000
X X

Biostatistics Delft2009 Presentation

Uploaded by

Copyright:

Available Formats

You might also like

Biostatistics Delft2009 Presentation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biostatistics Delft2009 Presentation

Uploaded by

Copyright:

Available Formats

Introduction to

Statistics = systematic collection and display of numerical data; mathematical

1. Descriptive statistics (= explorative statistics = deductive statistics)

2. Statistical inference (= inductive statistics)

• Sample = group of units selected from a larger group (random samples!)

• Estimation = process of indication of a value of an unknown quantity in a

Probability = quantitative description of the likeliness of occurrence of a

long-run relative frequency

equally-likely outcomes model (LaPlace):

number of outcomes corresponding to event E

• Outcome = is one result of an experiment or other situation involving

• Event = any collection of outcomes of an experiment

• Sample space = exhaustive list of all possible outcomes of an experiment

• Mutually exclusive events

Example: A subject in a study cannot be both male and female.

• Conditional probability, law of total probability and Bayes´ Theorem

• Variable X = the actual property measured by individual observation

log10(x) = lg(x) = common logarithm

log(A B) = log(A) + log(B)

(b) Scatter plot 1

with regression line

(b) Scatter plot 1

(c) Line graph 0.5

(b) Scatter plot 1

(c) Line graph 0.5

(b) Scatter plot 1

(c) Line graph 0.5

• result from interviews, observations, measurements or experiments

Categories of data according to the level of scale

Category of data / Property Example Possible calculations

attribute expressions of properties colour location: mode

ordinal / ranked ordination of ranks possible military frequency distribution

measurement properties (discrete events), variation: standard deviation

Structured data for further processing in statistical packages

Classification of analyses in respect to the number of variables

Structured data for further processing in statistical packages

Objects Var X1 Var X2

n = sample size (number of objects)

Classification of analyses in respect to the number of variables

Structured data for further processing in statistical packages

n = sample size (number of objects)

Classification of analyses in respect to the number of variables

Examples: Colour of eyes, names, sex, bits, presence-absence, …

Operations: Equality / inequality

order along X is interchangeable!

Examples: Water quality index, grades, military ranks......

Percentiles pi i [1, 100]

Range = maximum - minimum

Frequency polygon Cumulative frequency-

Examples: number of trees in a plot, counts of animals

absolute frequency of plots

bar chart (gaps!) 8

discrete probability distributions 6

3 2 0.05 0.075 no of trees in plot

Examples: fish length, body weight, count data (approximated)

Operations: Continuous frequency distribution

relative frequency (%) 6

Skewness = degree of asymmetry (Sk)