Professional Documents
Culture Documents
Biostatistics Delft2009 Presentation
Biostatistics Delft2009 Presentation
Biostatistics Delft2009 Presentation
Biostatistics
Mag.Dr. Christian Fesl and Mag. Gabriel Singer
Contents
Basic definitions
Graphs
Data and data matrix
Data description: measures of location and variation
Empirical and theoretical frequency distributions
Normal distribution and standard normal distribution
The Central Limit Theorem
Confidence intervals
Accuracy and precision; estimation of sample size
Statistical decision theory
Statistical tests: t-test, U-test,...
Analysis of variance (ANOVA): 1-way, 2-way
Correlations and regressions
Design (Sampling, Experiment)
Presentation of results in text, table, graphs
Basic definitions
sample
sample
1 sample unit
Probability theory
scale from 0 to 1
• Independent events
no influence on each other
P( A B) P ( A) P ( B )
Example: A man and a woman each have a pack of 52 playing cards. Find
the probability that they (i) each and (ii) both draw the ace of clubs.
A B
• Addition rule
P that event E1 or E2 or ... or En occurs
P(E1 E2 .... En) = P(E1) + P(E2) + ....+ P(En)
• Multiplication rule
P that event E1 and E2 and ... and En occurs
P(E1 E2 .... En) = P(E1) P(E2) .... P(En)
• Function
if values of X correspond with values of variable Y, there is a functional
dependence
Y = f(x), Y = dependent, X = independent
e.g. y = f (x) = a + bx
Mathematical terms and notation
• Logarithm
logA(x) = y Ay= x
A = base
x = numerus (antilogarithm)
y = logarithm
• Abszissa (x-axis)
• Ordinate (y-axis)
• Origin
Graphs (= charts, diagrams, plots)
(a) Bar-chart/column graph
with variation
(e.g. confidence intervals)
(a) 2.5
1.5
0.5
0
1 2 3
Graphs (= charts, diagrams, plots) (a)
2.5
2
(a) Bar-chart/column graph 1.5
0.5
0
1 2 3
(b) 14
12
10
8
6
4
2
0
0 2 4 6 8 10
Graphs (= charts, diagrams, plots) (a)
2.5
2
(a) Bar-chart/column graph 1.5
0
14 1 2 3
(b) 12
10
8
6
4
(c) 9
2
0
8 0 2 4 6 8 10
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8
Graphs (= charts, diagrams, plots) (a)
2.5
2
(a) Bar-chart/column graph 1.5
0
(d) Pie chart 14 1 2 3
(b) 12
10
8
6
4
2
(d) 8% 14% 0
0 2 4 6 8 10
9
14% (c) 8
7
6
5
4
20% 3
2
1
0
1 2 3 4 5 6 7 8
44%
Graphs (= charts, diagrams, plots) (a)
2.5
2
(a) Bar-chart/column graph 1.5
0
(d) Pie chart 14 1 2 3
(b)
(e) Box –(whisker)-plot 12
10
8
6
4
2
(e) 9 0
0 2 4 6 8 10
8 9
(c) 8
7
7
6 6
5 5
4
4 3
3 2
1
2 0
1 1 2 3 4 5 6 7 8
0 (d)
8% 14%
1 2
14%
20%
44%
Data
Objects Variable X 2
O1 x1 1.5
O2 x2 1
0.5
: :
0
Oi xi 1 2 3
: : 9
8
On xn 7
6
5
4
3
n = sample size (number of objects) 2
1
0
1 2 3 4 5 6 7 8
Oi xi1 xi2 4
2
: : : 0
0 2 4 6 8 10
On xn1 xn2
Statistics:
Absolute frequency F
Relative frequency f = F / n (proportion)
n = total number of objects
Mode x* = most frequent value
Nominal scale
Eye colour Counts Fi fi
x1 (green) IIII
x2 (blue) II
x3 (brown) IIIII IIII
x4 (grey) IIIII
Sum
F f
“Barchart“
Operations: Ranking
Statistics:
100
250 1.00
0.35
80
0.30
200
0.75
cumulative F
cumulative f
0.25
60
150
0.20
F
f
0.50
40 0.15 100
0.10 0.25
20 50
0.05
0 0.00 0 0.00
no ed elem work high univ no ed elem work high univ
X X
Q1 Md Q2
4
No. of trees in
plot F f Cumulati ve f 2
1 0 0 0 0
2 1 0.025 0.025 1 2 3 4 5 6 7 8 9 10 11 12
4 4 0.1 0.175
5 5 0.125 0.3
6 8 0.2 0.5 large sample space and
7 10 0.25 0.75 large sample size
8 5 0.125 0.875
approximation by
9 3 0.075 0.95
10 2 0.05 1 continuous distributions
11 0 0 1
Sum 40 1
Metric scale – Continuous variables
Raw data
classes (= consecutive categories)
frequency distribution
Weight (kg) Abs. frequency (F) Rel. f Cumulative f Class center
45 - <50 0 0/100= 0.0 0 47.5
50 - <55 3 3/100= 0.03 0.03 52.5
55 - <60 13 13/100=0.13 0.16 57.5
60 - <65 20 20/100=0.20 0.36 62.5
65 - <70 33 33/100=0.33 0.69 67.5
70 - <75 25 25/100=0.25 0.94 72.5
75 - <80 5 5/100= 0.05 0.99 77.5
80 - <85 1 1/100= 0.01 1 82.5
Sum 100 1.00
Metric scale – Continuous variables
Raw data
classes (= consecutive categories)
frequency distribution
bar chart without gaps = histogram
f = bar height
9
8
f = bar area!
7
0
Classes
Metric scale – Continuous variables
Statistics:
n
Arithmetic mean 1
x xi
n i 1
n
1
Standard deviation s (variance s2) s ( xi x )2
n 1i 1
s
Coefficient of variation C.V . 100
x
Geometric mean n
xg n xi
i 1
Sample statistics
f1 x1 f 2 x2 ... f n xn
x* , ~
x (= p50 = Q2) , x , x g , x w weighted
f1 f 2 ... f n
n
1 1
Arithmetic mean: x xi ( x1 x2 ... xn )
n i 1 n
n
1
Geometric mean: xg n x1 xn or xg anti log ln( x)
n i 1
n
ln (x+1)-transformed values: 1
xg anti log ln( x 1) 1
n i 1
Sample statistics
~ 1
for even n: x (x n x n )
2 2 2
1
50% of the values are below, 50% are above the median
Minimum, maximum
Sample statistics
Skewness Sk
Skewness
Example: distribution skewed to the right
x* x~ x
Leptokurtic
Sample statistics
Kurtosis K
Q3 1 Q3 Q1
K K (KND 0.263)
Q1 2 p90 p10
Sample statistics
Platykurtic
Negative kurtosis
Sample statistics and population parameters
sample
sample
Sample statistics and population parameters
Normal distribution
Unbiased estimator
Biased estimator
n
1
e.g. use ( xi x )2 to calculate sample variance
n i 1
classes
Empirical + theoretical distributions
classes
2
n i n 1 1 1 x
P( X x) pq f ( x) exp
i 2 2
E(X ) np Var ( X ) npq E( X ) Var ( X ) 2
Empirical distributions
classes
Empirical + theoretical distributions
classes
i 2
e 1 1 ln x
P( X i) f ( x) exp
i! x 2 2
2
E( X ) Var ( X ) E( X ) exp
2
2 2
Var ( X ) exp 2 exp 1
Frequency distributions
Discontinuous distributions
Continuous distributions
• t-distribution
• F-distribution
Normal distribution
x
Classes
Normal distribution
• Smooth
• Bell shaped
Relative frequency
x
Classes
Normal distribution
x
Normal distribution
0.8
(relative frequency)
Probability density
0.6
0.4
0.2
0.0 x
0 1 2 3 4 5 6 7
Standard normal distribution
Centering: xi xi µ
x
0 µ
Standard normal distribution
xi µ
Standardising: zi
x
0 µ
Standard normal distribution
z-values: standardised values with µ = 0 und = 1 according to the
formula: X
Z
Standard normal distribution
= standard normal PDF
1 z2
f ( x) exp
2 2
z
-3 -2 -1 0 1 2 3
Standard normal distribution
Area under the curve = integral of the standard normal PDF
= cumulative standard normal distribution function
= probability to find z within a definite range
z
-3 -2 -1 0 1 2 3
68.27%
95.45%
99.73%
z-value p (z) z-value p (z) z-value p (z) z-value p (z) z-value p (z) z-value p (z)
0.00 0.50000 0.50 0.69146 1.00 0.84134 1.50 0.93319 2.00 0.97725 2.50 0.99379
0.01 0.50399 0.51 0.69497 1.01 0.84375 1.51 0.93448 2.01 0.97778 2.51 0.99396
0.02 0.50798 0.52 0.69847 1.02 0.84614 1.52 0.93574 2.02 0.97831 2.52 0.99413
0.03 0.51197 0.53 0.70194 1.03 0.84849 1.53 0.93699 2.03 0.97882 2.53 0.99430
0.04 0.51595 0.54 0.70540 1.04 0.85083 1.54 0.93822 2.04 0.97932 2.54 0.99446
0.05 0.51994 0.55 0.70884 1.05 0.85314 1.55 0.93943 2.05 0.97982 2.55 0.99461
0.06 0.52392 0.56 0.71226 1.06 0.85543 1.56 0.94062 2.06 0.98030 2.56 0.99477
0.07 0.52790 0.57 0.71566 1.07 0.85769 1.57 0.94179 2.07 0.98077 2.57 0.99492
0.08 0.53188 0.58 0.71904 1.08 0.85993 1.58 0.94295 2.08 0.98124 2.58 0.99506
0.09 0.53586 0.59 0.72240 1.09 0.86214 1.59 0.94408 2.09 0.98169 2.59 0.99520
0.10 0.53983 0.60 0.72575 1.10 0.86433 1.60 0.94520 2.10 0.98214 2.60 0.99534
0.11 0.54380 0.61 0.72907 1.11 0.86650 1.61 0.94630 2.11 0.98257 2.61 0.99547
0.12 0.54776 0.62 0.73237 1.12 0.86864 1.62 0.94738 2.12 0.98300 2.62 0.99560
0.13 0.55172 0.63 0.73565 1.13 0.87076 1.63 0.94845 2.13 0.98341 2.63 0.99573
0.14 0.55567 0.64 0.73891 1.14 0.87286 1.64 0.94950 2.14 0.98382 2.64 0.99585
0.15 0.55962 0.65 0.74215 1.15 0.87493 1.65 0.95053 2.15 0.98422 2.65 0.99598
0.16 0.56356 0.66 0.74537 1.16 0.87698 1.66 0.95154 2.16 0.98461 2.66 0.99609
0.17 0.56749 0.67 0.74857 1.17 0.87900 1.67 0.95254 2.17 0.98500 2.67 0.99621
0.18 0.57142 0.68 0.75175 1.18 0.88100 1.68 0.95352 2.18 0.98537 2.68 0.99632
0.19 0.57535 0.69 0.75490 1.19 0.88298 1.69 0.95449 2.19 0.98574 2.69 0.99643
0.20 0.57926 0.70 0.75804 1.20 0.88493 1.70 0.95543 2.20 0.98610 2.70 0.99653
0.21 0.58317 0.71 0.76115 1.21 0.88686 1.71 0.95637 2.21 0.98645 2.71 0.99664
0.22 0.58706 0.72 0.76424 1.22 0.88877 1.72 0.95728 2.22 0.98679 2.72 0.99674
0.23 0.59095 0.73 0.76730 1.23 0.89065 1.73 0.95818 2.23 0.98713 2.73 0.99683
0.24 0.59483 0.74 0.77035 1.24 0.89251 1.74 0.95907 2.24 0.98745 2.74 0.99693
0.25 0.59871 0.75 0.77337 1.25 0.89435 1.75 0.95994 2.25 0.98778 2.75 0.99702
0.26 0.60257 0.76 0.77637 1.26 0.89617 1.76 0.96080 2.26 0.98809 2.76 0.99711
0.27 0.60642 0.77 0.77935 1.27 0.89796 1.77 0.96164 2.27 0.98840 2.77 0.99720
0.28 0.61026 0.78 0.78230 1.28 0.89973 1.78 0.96246 2.28 0.98870 2.78 0.99728
0.29 0.61409 0.79 0.78524 1.29 0.90147 1.79 0.96327 2.29 0.98899 2.79 0.99736
0.30 0.61791 0.80 0.78814 1.30 0.90320 1.80 0.96407 2.30 0.98928 2.80 0.99744
0.31 0.62172 0.81 0.79103 1.31 0.90490 1.81 0.96485 2.31 0.98956 2.81 0.99752
0.32 0.62552 0.82 0.79389 1.32 0.90658 1.82 0.96562 2.32 0.98983 2.82 0.99760
0.33 0.62930 0.83 0.79673 1.33 0.90824 1.83 0.96638 2.33 0.99010 2.83 0.99767
0.34 0.63307 0.84 0.79955 1.34 0.90988 1.84 0.96712 2.34 0.99036 2.84 0.99774
0.35 0.63683 0.85 0.80234 1.35 0.91149 1.85 0.96784 2.35 0.99061 2.85 0.99781
0.36 0.64058 0.86 0.80511 1.36 0.91308 1.86 0.96856 2.36 0.99086 2.86 0.99788
0.37 0.64431 0.87 0.80785 1.37 0.91466 1.87 0.96926 2.37 0.99111 2.87 0.99795
0.38 0.64803 0.88 0.81057 1.38 0.91621 1.88 0.96995 2.38 0.99134 2.88 0.99801
0.39 0.65173 0.89 0.81327 1.39 0.91774 1.89 0.97062 2.39 0.99158 2.89 0.99807
0.40 0.65542 0.90 0.81594 1.40 0.91924 1.90 0.97128 2.40 0.99180 2.90 0.99813
0.41 0.65910 0.91 0.81859 1.41 0.92073 1.91 0.97193 2.41 0.99202 2.91 0.99819
0.42 0.66276 0.92 0.82121 1.42 0.92220 1.92 0.97257 2.42 0.99224 2.92 0.99825
0.43 0.66640 0.93 0.82381 1.43 0.92364 1.93 0.97320 2.43 0.99245 2.93 0.99831
0.44 0.67003 0.94 0.82639 1.44 0.92507 1.94 0.97381 2.44 0.99266 2.94 0.99836
0.45 0.67364 0.95 0.82894 1.45 0.92647 1.95 0.97441 2.45 0.99286 2.95 0.99841
0.46 0.67724 0.96 0.83147 1.46 0.92785 1.96 0.97500 2.46 0.99305 2.96 0.99846
0.47 0.68082 0.97 0.83398 1.47 0.92922 1.97 0.97558 2.47 0.99324 2.97 0.99851
0.48 0.68439 0.98 0.83646 1.48 0.93056 1.98 0.97615 2.48 0.99343 2.98 0.99856
0.49 0.68793 0.99 0.83891 1.49 0.93189 1.99 0.97670 2.49 0.99361 2.99 0.99861
Standard normal distribution
Tables
z-value p (z) z-value p (z)
0.00 0.50000 0.50 0.69146
0.01 0.50399 0.51 0.69497
0.02 0.50798 0.52 0.69847
0.03 0.51197 0.53 0.70194
0.04 0.51595 0.54 0.70540
0.05 0.51994 0.55 0.70884
0.06 0.52392 0.56 0.71226
0.07 0.52790 0.57 0.71566
0.08 0.53188 0.58 0.71904
0.09 0.53586 0.59 0.72240
P(z) = P(Z z)
As sample size increases, the means of samples drawn from a population of any
distribution will approach the normal distribution.
S .E .M .
n
The central limit theorem
S .E .M . ...decreases as sample size n increases!
n
0.9 distribution of means with high n
0.8 x
z distribution of means with low n
0.7 / n
relative frequency
0.6
distribution of original population
0.5 = distribution of means with n = 1
0.4
nhigh
0.3
nlow
0.2
0.1 1
0
µ variable X or means
The central limit theorem
S .E .M . ...decreases as sample size n increases!
n
0.45 x
z
0.4 / n
0.35
relative frequency
0.3
0.25
0.2
0.15
0.1
0.05
0
-6 -4 -2 0 2 4 6
z z-standardised means x
The t-distribution
s
S .E .M . and S.E.M. have to be estimated!
n
z = tinfinite d.f.
0.45 x
td.f.= 2 t
0.4 s/ n
0.35 td.f.= 1
relative frequency
0.3
0.25 wider and flatter when n low
0.2
0.15
0.1
0.05
0
-6 -4 -2 0 2 4 6
t standardised means x
The t-distribution
Normal distribution
Relative frequency
t-distribution
x
0 1 2 3 4 5 6 7
The t-distribution
t-values for different degrees of freedom and
P z /2 x z /2 1
n n
P x z /2 x z /2 1 x z /2
n n n
z = tabulated value from the standard normal distribution, depends on
= significance level 1- = confidence / accuracy
Confidence interval (C.I.)
unknown
s
random sample gives x and s, x is t-distributed with S .E .M .
n
x
t
s/ n
s s s
P x t / 2,d . f . x t / 2,d . f . 1 x t / 2 ,d . f .
n n n
/2 /2
-t t/2,d.f.
,d.f. x µ t t/2,d.f.
,d.f.
Accuracy (/ confidence) (1- ) is the probability that the true mean of the
population lies within a given confidence interval.
s
Absolute precision: G t / 2 ,d . f .
n
t / 2 ,d . f . s
Precision relative to the mean: G
x n
Calculation of the necessary sample size with predefined accuracy and
precision:
2
t / 2 ,d . f . s
n
G´ x
(The equation has to be solved iteratively, because you can find n on both
sides.)
Accuracy and precision
µ =70%
=70%
decrease of precision
increase of accuracy
G’=20% G’=5%
µ
µ
=95% =95%
G’=50% G’=20%
Statistical decision theory
H0: Two populations do not differ. Thus, the 2 samples come in fact from
one underlying population and any possibly observed difference between
the two samples is entirely due to chance.
n
Statistical decision theory
Type I and Type II error
H 0 right HA right
H0 right HA right
• Decision: • Decision:
if TSemp TScrit do not accept if P do not accept HA; nor H0
HA; nor H0 if P < accept HA; reject H0
if TSemp > TScrit accept HA;
reject H0
Significance levels
type I error probability of observed meaning symbol
outcome under true H0
= 5% P 0.05 not significant n.s.
= 5% P < 0.05 significant at 5% *
= 1% P < 0.01 significant at 1% **
= 0.1% P < 0.001 significant at 0.1% ***
1) Testable hypotheses:
H0: The ‘new’ mouse belongs to the island population, its weight is
similar to those of other island mice: x µ0. Its relatively high weight is
entirely due to chance, it´s just a slightly heavy mouse of the population.
HA: The ‘new’ mouse does not belong to the island population, its weight
is higher than that of other island mice, it must belong to some other
mouse population, say from the mainland: x > µ0
One-sample test (sample vs. fixed „true“ value)
x 0
TS z P(Z z)
H0 right
1
P(Z z)
0
x 0
0 z
One-sample test (sample vs. fixed „true“ value)
x 0
TS z P(Z z)
H0 right
1
P(Z z)
0
x 0
z
One-sample test (sample vs. fixed „true“ value)
3) Set threshold for P to decide between H0 and HA (in advance!)
set limit = significance level
H 0 right
zcrit
Or: calculate critical z from : P (z) = P(1.64) = critical z = 1.64
One-sample test (sample vs. fixed „true“ value)
3) Set threshold for P to decide between H0 and HA (in advance!)
set limit = significance level
H 0 right H 0 right
P(Z z)
P(Z z)
zcrit zcrit
Or: calculate critical z from : P (z) = P(1.64) = critical z = 1.64
4) Decision:
P(Z z)
H 0 right HA right
0 1
One-sample test (sample vs. fixed „true“ value)
H 0 right HA right
0 xcrit 1
One-sample test (sample vs. fixed „true“ value)
H 0 right HAright
0 xcrit 1
One-sample test (sample vs. fixed „true“ value)
H 0 right HAright
power 1
0 xcrit 1
One-sample test (sample vs. fixed „true“ value)
µ0 µ1
island mice mainland mice
population
sample means
low n H0: Sample of drifting mice belongs to island
population. The population mean µ estimated
from the sample is equal to (or smaller than) the
µ0 of the island population: µ0
HA: µ > µ0 H A: µ0
One-sided Two-sided
Type I error
z P(Z > z) z z
z
1- two-sided one-sided
continuum of
possible Sk values
continuum of - 2 SESk Sk 0 + 2 SESk
possible Sk values
2) Histograms
Histogram Histogram
14 30
12
10
20
6
10
4
Frequency
Frequency
3.
3.
4.
4.
5.
5.
6.
6.
7.
7.
4.0 6.0 8.0 10.0 12.0 14.0 16.0
50
00
00
50
50
00
50
00
50
00
50
VAR00001 VAR00002
2 2
1 1
0 0
Expected Normal
Expected Normal
-1 -1
-2 -2
-3 -3
2 3 4 5 6 7 8 -10 0 10 20
x* x~ x
Leptokurtic
Check normal distribution
8
Statistic df Sig.
6
VAR00001 .049 100 .200*
10
4 VAR00002 .172 100 .000
Frequency
Frequency
3.
3.
4.
4.
5 .0
5.
6 .0
6.
7.
7 .5
50
00
50
50
50
00
0
VAR00001 VAR00002
What for?
Ecological data
• Log-normal distribution can often be assumed
• Approximation of normal distribution by use of logarithms
• In case of occurrence of zero values: xT ln( x 1)
F-distribution
1. two (!) samples from population ND (µ, 2)
2. calculate s12 (sample 1 with n1) and s22 (sample 2 with n2)
3. calculate statistic: 2
s1
F 2
s2
4. repeat 1.-3. and build distribution of F-values
“F-distribution”
shape determined by d.f.1 = n-1 and d.f.2 = n-1
separate F-distribution for each combination of d.f.1 and d.f.2
F-distribution
F1,20
0.8 F5,25
F25,5
relative frequency
0.6
0.4
= 0.05
0.2
0.0
0 1 2 F = 2.6 3
F
F-test: checking variance homogeneity
H0: The sample variances estimate the same parametric variance. Or: 2 = 2
1 2
variance homogeneity = homoscedasceity
HA: The sample variances estimate different parametric variances. Or: 2 2
1 2
variance heterogeneity = heteroscedasceity
= 0.05
TS: variance ratio Fs
relative frequency
0.6 2 0.6 2
s2 smin
0.4 0.4
1-tailed test
Decision:
0.8 F9,9
1) if P(F) /2 (equivalent to: Fs Fcrit) smax
2
relative frequency
0.6 2
smin
assume variance homogeneity
0.4
1) Check ND
2) Hypotheses
= 0.05
two-sided test, TS = t
4) Test statistic t
n1 n2 2
df n1 n2 2 df
2
5) Decision (two-sided)
tcrit t / 2 ,d . f .
1) Check ND
2) Hypotheses
= 0.05
two-sided test, TS = t
t-test (after Student) for dependent samples
patient before after differences
X1 X2 X1-X2
Gandalf 6 4 2
Saruman 4 3 1
Arwen 7 5 2
Frodo 3 2 1
...
4) Test statistic t
d 0 d 0 d n
TS t
s s s one-sample t-test with µ0 = 0 !!!
n n
t-test (after Student) for dependent samples
5) Decision (two-sided)
df n 1 tcrit t / 2 ,d . f .
• distribution-free
• non-parametric
• ranks: sort all values (rank order) and number sequentially.
• replace each original variate by its rank (reduce data to ordinal scale).
• generally less powerful than parametric procedures
1) Hypotheses
H0: Two samples come from populations with identical “locations” (medians).
HA: Two samples come from populations which differ in location (median).
U-test (after Mann & Whitney)
“Bonferroni”-correction (Dunn-Sidak):
t 1 (1 )k
failure success
type I error
t total error
k number of comparisons failure success failure success
Hypotheses:
H0: The three groups are not different (come from same population).
HA: At least one group differs from at least one other group (one comes
from different population).
or = or
(one-way ANOVA)
Scheme of the analysis of variance
group means
x2
x-values
x3
x
grand mean
x1
1 2 3 factors
(one-way ANOVA)
x2
( x2 x )2 x
x-values
3 ( x3 x )2
x
Z
( x1 x ) 2
SSb ( between ) n ( xz x)2
z 1
x1
1 2 3 factors
(one-way ANOVA)
( xi 2 x2 ) 2
x2 ( xi 3 x3 ) 2
x-values
x3
x
Z
SSb ( between ) n ( xz x)2
z 1
x1 2
Z n
( xi1 x1 ) SS w( within ) ( xiz xz ) 2
z 1 i 1
1 2 3 factors
(one-way ANOVA)
Total SS Z n
SS t (total ) ( xiz x )2 of the squared total deviations
z 1 i 1 of total variation
Explained SS = Between-groups SS
Z
SSb ( between) n ( xz x)2 of the squared deviations between groups
z 1 of group-to-group variation
Variation of the whole dataset partitioned into two parts depending on origin!
(one-way ANOVA)
SSt SSb SS w
MSt MSb MS w
Z n 1 Z 1 Z ( n 1)
SSt SSb SS w
MSt MSb MS w
Z n 1 Z 1 Z ( n 1)
5) Results: ANOVA-table
6) Post-hoc tests
3. H0: no interaction
(two-way ANOVA)
1) Partitioning the total sum of squares:
SSt SS w
MSt MS w
S F n 1 S F (n 1)
Under true H0 mean squares are variances and estimate 2 of the (same)
population.
4) Calculate Fcrit under true H0 at significance with df1 and df2 for each H0
Comparison of Fsex / Ffood / Finteraction with corresponding Fcrit
6) Post-hoc tests
in case of factor with > 2 levels: which group differs from which one?
food 1
food 2
describe the mutual variation of two variables and measures the degree to
which variables are related. No functional dependence between the
variables is assumed.
a) b)
Y Y
a) Positive (=direct)
correlation
X X
b) Negative (=inverse)
c) d) correlation
Y Y
c) No correlation
d) Non-linear correlation
X X
Correlations
The variable at the lowest level always determines the choice of the correlation
measure.
Correlations between nominal-scaled variables
- contingency tables
Contingency table
x1 x2
= sum; F = frequency
Correlations between nominal-scaled variables
1) Setting up hypotheses
H0: No correlation between the two variables X and Y.
HA: Correlation between the two variables.
2) Calculation of –coefficient
The –coefficient is one possible correlation coefficient, calculated
from the sums per column and per row:
4) Calculation of test-statistic
2
2
Fij Eij
d . f . (k 1)(r 1)
i j Eij
k = number of columns; r = number of rows
Yates correction for continuity for 2 x 2 tables
2
2
Fij Eij 0.5
i j Eij
Correlations between nominal-scaled variables
Kramer-coefficient C
is another correlation coefficient for contingency tables, based on the
2-values.
2
C 2
n
This coefficient is not standardised between 0 and 1. To do so, one has to
calculate the maximum possible C-value, which is given by
k 1
Cmax
k
The standardised coefficient is then obtained from
C
Cs tan d
Cmax
Rank correlation coefficients
1) Setting up hypotheses
The (unknown) population correlation coefficient is often denoted by
s; has to be estimated from the observed correlation coefficient rs.
H 0: s = 0 No correlation between the two variables.
H A: s 0 Correlation between the two variables.
2) Calculation of rs
Each variable is ranked separately and for each object i the squared
differences (d2i) between the ranks of X and ranks of Y are computed.
n
6 d i2
i 1
rS 1 2
n (n 1)
Variations of X and Y: n 2
1
Variance of variable X: s x2 xi x
n 1i1
2
1 n
Variance of variable Y: s y2 yi y
n 1i1
Covariation = mutual variation of two variables, measured by the
n
covariance
Cov( X , Y ) xi x yi y
i 1
Covariance = mean value of cross-product of the deviations of X and
Y from their mean value
n
1
s xy xi x yi y
n i 1
Product-moment correlation after Pearson
Note that the covariance of a variable with itself equals the variance. The
covariance is not standardised (between –1 and +1). Instead, this
measure depends on the units that X and Y are measured in.
Cov( X , Y ) ( xi x)( yi y)
r
Var ( X ) Var (Y ) ( xi x) 2 ( yi y) 2
r = –1 perfect negative, linear correlation
r = 0 no linear correlation
r = +1 perfect positive, linear correlation
1) Setting up hypotheses
H0: =0 No linear correlation between the two variables X and Y.
H A: 0 Linear correlation between the two variables.
3) Calculation of significance
a) t-statistics
r
t d.f. = n-2; tcrit = t /2; d.f.
sr
Decision:
1) if P(t) /2 (is equivalent to: t tcrit) do not accept HA; do not
accept H0, either. You were not able to detect any significant
correlation.
2) if P(t) < (is equivalent to t > tcrit) accept HA; reject H0 There
is a significant correlation between the two variables.
Product-moment correlation after Pearson
4) Calculation of significance
b) F-statistics
d.f.1 = n-2
1 r
F d.f.2 = n-2
1 r Fcrit = Fa/2; d.f.1;d.f.2
Decision:
1) if P(F) /2 (is equivalent to: F Fcrit) do not accept HA; do not
accept H0, either. You were not able to detect any significant
correlation.
2) if P(F) < (is equivalent to F > Fcrit) accept HA; reject H0.
There is a significant correlation between the two variables.
Product-moment correlations after Pearson
4) Stating results:
calculated correlation coefficient r, the total number of observations
(n = number of objects) and the calculated P (if not significant) or the
significance level (if significant).
Potential pitfalls in correlation analysis
Example: Fish
Weight
Y
Length
X
Potential pitfalls in correlation analysis
X Y
Z
Potential pitfalls in correlation analysis
Partial correlation
If the lurking variable Z is known (or measured) its influence may be
removed to obtain the correlation between the remaining variables of
interest X and Y:
3.0 1.4
4.0 1.5 5
5.0 2.2
Wing length (cm)
6.0 2.4 4
8.0 3.1
9.0 3.2 3
10.0 3.2
11.0 3.9 2
12.0 4.1
14.0 4.7 1
15.0 4.5
16.0 5.2
0
17.0 5.0 0 2 4 6 8 10 12 14 16 18
Age (days)
Simple linear regression
Y X
Y expected values of Y
and the regression coefficients of the population
which are estimated by a and b from our sample
the error (residuals)
Simple linear regression
Estimation:
Y a bX e e (y yˆ ) 2
a intercept (point of intersection of the linear regression line with
y-the axis)
b slope of the linear regression = y / x;
change in Y that accompanies a unit change in X
Estimation:
Y a bX e 6
5
a intercept (point of intersection
y
y-the axis) 3
= y / x; 1
0 2 4 6 8 10 12 14 16 18
a unit change in X Age (days)
Estimation:
a and b are selected so that the sum of squared residuals is minimised:
e (y yˆ ) 2 min
( xi x )( yi y)
b
2
( xi x)
a Y bX
Simple linear regression
1) State hypotheses
If b = 0, Y would not depend on X, because Y would not change with
changing X (the regression would result in a more or less horizontal
line). Therefore, we have to test whether the slope b is significantly
different from 0.
2) ANOVA procedure
The overall significance of the model is tested by an ANOVA
procedure.
Total SS = sum of the squared total deviations
n SS t
2 d . f .t n 1 MS t
SS t (total ) ( yi y)
d . f .t
i 1
Regression SS (= explained variance by the model) = linear regression
sum of squares
n
SS reg (regression ) ( yˆ i y) 2 d . f .reg 1 MS reg SS reg
i 1
Residual SS (= not explained variance) = the error term e
n
SS res (residuals ) ( yi yˆ i ) 2 SS t SS reg
i 1
SS res
d . f .res d . f .t d . f .reg n 2 MS res
d . f .res
Simple linear regression
Y yi
yi yˆ i
yi y
yˆi y
y
X
n n n
2 2
( yi y) ( yˆ i y) ( yi yˆ i ) 2
i 1 i 1 i 1
Simple linear regression
2) ANOVA procedure
ANOVA table in simple regression analysis
Source of
Sum of squares d.f. Mean square Femp P
variation
n MS reg
2 MS reg SS reg
Regression SS reg ( yˆ i y) 1 P(Femp.)
i 1 MS res
n SS res
Residuals SS res ( yi yˆ i ) 2 n–1 MS res
i 1
d . f .res
n SS t
Total SS t ( yi y) 2 n–2 MS t
i 1
d . f .t
Simple linear regression
2 SS reg
r
SS t
In simple regression this is equal to the squared product-moment
correlation coefficient.
( yi yˆ i ) 2
s y. x
n 2
The estimated values ŷ are obtained from the regression model:
yˆ a bx
The standard errors of a (S.E.a) and of b (S.E.b) are given by:
s y. x
S .E.b
( xi x )2
1 x2
S .E.a s y. x
n ( xi x ) 2
Simple linear regression
Example:
The linear regression between wing length and age is highly
significant (Y = 0.713 (±0.148) + 0.27 (±0.013) X; r2 = 0.973;
F1, 11 = 401.1; P < 0.001).
Simple linear regression
Linearisation
In order to conduct a linear regression analysis with data which do not
show a linear dependency, the data can be linearised.
In the following figure, some possible linearisation procedures are
given:
instead instead instead instead
of or of of or of
Y x y Y x y
Y Y
log(x) y² x² y²
-1/x y³ x³ y³
X X
Simple linear regression
Linearisation, examples
Logarithmic function
Original scale of X Logarithmic scale of X
8 8
7
6
6
Y
Y
4
4
3
2
2
0 1
0 200 400 600 800 1 10 100 1000 10000
X X
Linearisation, examples
Exponential function
Original scale of Y Logarithmic scale of Y
350 1000
300
250
100
200
Y
Y
150
10
100
50
0 1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
X X
Linearisation, examples 6
5
Linearisation of the exponential function 4
ln Y
y a e bx ln 3
ln y ln a bx 1
0
0 2 4 6 8 10 12 14 16 18
350 1000
300
250
100
200
Y
Y
150
10
100
50
0 1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
X X
Simple linear regression
Linearisation, examples
Power functions
Original scales of X and Y Logarithmic scales of X and Y
500 1000
400
100
300
Y
Y
200
10
100
0 1
0 200 400 600 800 1 10 100 1000
X X
6
Linearisation, examples 5
ln Y
3
b
y a x ln 2
1
ln y ln a b ln x 0
0 1 2 3 4 5 6 7
500 1000
400
100
300
Y
Y
200
10
100
0 1
0 200 400 600 800 1 10 100 1000
X X