Biostatistics Delft2009 Presentation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 174

Introduction to

Biostatistics
Mag.Dr. Christian Fesl and Mag. Gabriel Singer
Contents

Basic definitions
Graphs
Data and data matrix
Data description: measures of location and variation
Empirical and theoretical frequency distributions
Normal distribution and standard normal distribution
The Central Limit Theorem
Confidence intervals
Accuracy and precision; estimation of sample size
Statistical decision theory
Statistical tests: t-test, U-test,...
Analysis of variance (ANOVA): 1-way, 2-way
Correlations and regressions
Design (Sampling, Experiment)
Presentation of results in text, table, graphs
Basic definitions

Statistics = systematic collection and display of numerical data; mathematical


method to handle uncertainty (random error)

1. Descriptive statistics (= explorative statistics = deductive statistics)


central tendency
variation
frequency distribution

2. Statistical inference (= inductive statistics)


use of a sample to draw conclusions about a population
Basic definitions
• Population entire collection of people, animals, plants or things from which
we may collect data

• Sample = group of units selected from a larger group (random samples!)


• Sample unit = person, animal, plant or thing which is actually studied by a
researcher (the basic object).
1 sample unit delivers only one independent value per variable (a variate),
cf. „sample“ in colloquial use!
• Parameter = value representing a certain population characteristic
• Statistic = quantity calculated from the sample data to give information
about parameters

• Estimation = process of indication of a value of an unknown quantity in a


population (estimator, estimate)
“a sample statistic estimates a population parameter”
• Sampling distribution describes probabilities associated with a statistic
(= probability distribution for the statistic)
Population

sample

sample

1 sample unit
Probability theory

Probability = quantitative description of the likeliness of occurrence of a


particular event

scale from 0 to 1

long-run relative frequency

equally-likely outcomes model (LaPlace):

number of outcomes corresponding to event E


P(E)
total number of outcomes
Probability theory

• Outcome = is one result of an experiment or other situation involving


uncertainty

• Event = any collection of outcomes of an experiment


Impossible event P(E) = 0
Inevitable event P(E) = 1
Complementary event 1 – P(E) = P(E)

• Sample space = exhaustive list of all possible outcomes of an experiment


(universe, population)
Probability theory

• Independent events
no influence on each other

P( A B) P ( A) P ( B )

Example: A man and a woman each have a pack of 52 playing cards. Find
the probability that they (i) each and (ii) both draw the ace of clubs.

• Mutually exclusive events


impossible to occur together

A B

Example: A subject in a study cannot be both male and female.


Probability theory

• Addition rule
P that event E1 or E2 or ... or En occurs
P(E1 E2 .... En) = P(E1) + P(E2) + ....+ P(En)

• Multiplication rule
P that event E1 and E2 and ... and En occurs
P(E1 E2 .... En) = P(E1) P(E2) .... P(En)

• Conditional probability, law of total probability and Bayes´ Theorem


Mathematical terms and notation

• Variable X = the actual property measured by individual observation


• Value x = a single observation of a variable (case, variate)
Object xi = ith value of variable X <x1, x2, x3, .....xi, ..... xn>
n
xi = sum of xi = x1 + x2 + x3 .....+ xi .....+ xn
i 1
n
xi = product of xi = x1 x2 x3 ..... xi ..... xn
i 1

• Function
if values of X correspond with values of variable Y, there is a functional
dependence
Y = f(x), Y = dependent, X = independent
e.g. y = f (x) = a + bx
Mathematical terms and notation

• Logarithm

logA(x) = y Ay= x

A = base
x = numerus (antilogarithm)
y = logarithm

log10(x) = lg(x) = common logarithm


loge(x) = ln(x) = natural logarithm (e = 2.718.....)

log(A B) = log(A) + log(B)


log(A:B) = log(A) - log(B)
log(AB) = B log(A)
Graphs (= charts, diagrams, plots)

• Abszissa (x-axis)
• Ordinate (y-axis)
• Origin
Graphs (= charts, diagrams, plots)
(a) Bar-chart/column graph

with variation
(e.g. confidence intervals)

(a) 2.5

1.5

0.5

0
1 2 3
Graphs (= charts, diagrams, plots) (a)
2.5

2
(a) Bar-chart/column graph 1.5

(b) Scatter plot 1

0.5

0
1 2 3

with regression line

(b) 14
12
10
8
6
4
2
0
0 2 4 6 8 10
Graphs (= charts, diagrams, plots) (a)
2.5

2
(a) Bar-chart/column graph 1.5

(b) Scatter plot 1

(c) Line graph 0.5

0
14 1 2 3
(b) 12
10
8
6
4

(c) 9
2
0
8 0 2 4 6 8 10
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8
Graphs (= charts, diagrams, plots) (a)
2.5

2
(a) Bar-chart/column graph 1.5

(b) Scatter plot 1

(c) Line graph 0.5

0
(d) Pie chart 14 1 2 3
(b) 12
10
8
6
4
2
(d) 8% 14% 0
0 2 4 6 8 10
9
14% (c) 8
7
6
5
4
20% 3
2
1
0
1 2 3 4 5 6 7 8

44%
Graphs (= charts, diagrams, plots) (a)
2.5

2
(a) Bar-chart/column graph 1.5

(b) Scatter plot 1

(c) Line graph 0.5

0
(d) Pie chart 14 1 2 3
(b)
(e) Box –(whisker)-plot 12
10
8
6
4
2
(e) 9 0
0 2 4 6 8 10
8 9
(c) 8
7
7
6 6
5 5
4
4 3
3 2
1
2 0
1 1 2 3 4 5 6 7 8

0 (d)
8% 14%
1 2
14%

20%

44%
Data

• result from interviews, observations, measurements or experiments


• preferably noted using numbers

Categories of data according to the level of scale

Category of data / Property Example Possible calculations


type of variable
nominal variable / classification of qualitative eye frequency distribution
Non-metric scale

attribute expressions of properties colour location: mode

ordinal / ranked ordination of ranks possible military frequency distribution


variable ranks location: median
variation: range, percentiles
correlation of ranks
discontinuous / values are forming an number of probability distribution
discrete aggregate of separate animals location: arithmetic mean
Metric scale

measurement properties (discrete events), variation: standard deviation


variable no intermediate values (approximation by continuous
possible distribution possible)
continuous realisation of any value body probability density function
measurement within a given interval mass location: arithmetic mean
variable possible variation: standard deviation
Data matrix

Structured data for further processing in statistical packages


2.5

Objects Variable X 2

O1 x1 1.5

O2 x2 1

0.5
: :
0
Oi xi 1 2 3

: : 9
8
On xn 7
6
5
4
3
n = sample size (number of objects) 2
1
0
1 2 3 4 5 6 7 8

Classification of analyses in respect to the number of variables


Univariate: analysis with one variable
Data matrix

Structured data for further processing in statistical packages

Objects Var X1 Var X2


O1 x11 x12 14
12
O2 x21 x22 10
8
: : : 6

Oi xi1 xi2 4
2
: : : 0
0 2 4 6 8 10
On xn1 xn2

n = sample size (number of objects)

Classification of analyses in respect to the number of variables


Univariate: analysis with one variable
Bivariate: analysis with two variables
Data matrix

Structured data for further processing in statistical packages


Variables
Objects X1 X2 ... Xj ... Xk
O1 x11 x12 ... x1j ... x1k
O2 x21 x22 ... x2j ... x2k
: : : ... : ... :
Oi xi1 xi2 ... xij ... xik
: : : ... : ... :
On xn1 xn2 ... xnj ... xnk

n = sample size (number of objects)


k = number of variables

Classification of analyses in respect to the number of variables


Univariate: analysis with one variable
Bivariate: analysis with two variables
Multivariate: analysis with more than two variables
Nominal scale

Examples: Colour of eyes, names, sex, bits, presence-absence, …

Operations: Equality / inequality

Statistics:

Absolute frequency F
Relative frequency f = F / n (proportion)
n = total number of objects
Mode x* = most frequent value
Nominal scale
Eye colour Counts Fi fi
x1 (green) IIII
x2 (blue) II
x3 (brown) IIIII IIII
x4 (grey) IIIII
Sum

F f

“Barchart“

order along X is interchangeable!


Ordinal scale

Examples: Water quality index, grades, military ranks......

Operations: Ranking

Statistics:

Percentiles pi i [1, 100]


Deciles D1 = p10
Quartiles Q1 = p25 , Q2 = p50 , Q3 = p75 , Q4 = p100
Minimum / maximum
Median

Range = maximum - minimum


Interquartile range (IQR) = Q3 – Q1
Ordinal scale
Education Fi Cumulative F fi Cumulative f
x1 (No education) 65 65 0.25 0.25
x2 (Elementary school) 63 128 0.25 0.50
x3 (Work) 64 192 0.25 0.75
x4 (High school) 43 235 0.17 0.92
x5 (University) 21 256 0.08 1.00
Sum 2520 1.00

100
250 1.00
0.35
80
0.30
200
0.75

cumulative F

cumulative f
0.25
60
150
0.20
F

f
0.50
40 0.15 100

0.10 0.25
20 50
0.05

0 0.00 0 0.00
no ed elem work high univ no ed elem work high univ

X X
Q1 Md Q2

Frequency polygon Cumulative frequency-


polygon (= ‘ogive’)
Metric scale – Discrete (discontinuous / meristic) variables

Examples: number of trees in a plot, counts of animals


12

absolute frequency of plots


Operations: discrete frequency distribution 10

bar chart (gaps!) 8

discrete probability distributions 6

4
No. of trees in
plot F f Cumulati ve f 2

1 0 0 0 0
2 1 0.025 0.025 1 2 3 4 5 6 7 8 9 10 11 12

3 2 0.05 0.075 no of trees in plot

4 4 0.1 0.175
5 5 0.125 0.3
6 8 0.2 0.5 large sample space and
7 10 0.25 0.75 large sample size
8 5 0.125 0.875
approximation by
9 3 0.075 0.95
10 2 0.05 1 continuous distributions
11 0 0 1
Sum 40 1
Metric scale – Continuous variables

Examples: fish length, body weight, count data (approximated)

Operations: Continuous frequency distribution


histograms (any value possible, no „gaps“)
continuous probability distributions

Raw data
classes (= consecutive categories)
frequency distribution
Weight (kg) Abs. frequency (F) Rel. f Cumulative f Class center
45 - <50 0 0/100= 0.0 0 47.5
50 - <55 3 3/100= 0.03 0.03 52.5
55 - <60 13 13/100=0.13 0.16 57.5
60 - <65 20 20/100=0.20 0.36 62.5
65 - <70 33 33/100=0.33 0.69 67.5
70 - <75 25 25/100=0.25 0.94 72.5
75 - <80 5 5/100= 0.05 0.99 77.5
80 - <85 1 1/100= 0.01 1 82.5
Sum 100 1.00
Metric scale – Continuous variables

Raw data
classes (= consecutive categories)
frequency distribution
bar chart without gaps = histogram

f = bar height
9

8
f = bar area!
7

relative frequency (%) 6

0
Classes
Metric scale – Continuous variables

Statistics:
n
Arithmetic mean 1
x xi
n i 1
n
1
Standard deviation s (variance s2) s ( xi x )2
n 1i 1

s
Coefficient of variation C.V . 100
x

Skewness = degree of asymmetry (Sk)


Kurtosis = degree of peakedness (K)

Geometric mean n
xg n xi
i 1
Sample statistics

describe observed (empirical) frequency distributions:


location: average value, minimum, maximum
variation: dispersion around average value
shape of the distribution: symmetry, peakedness...

describe average trend of a distribution with a few values only


for further statistical analysis
Sample statistics

1. Central tendency and other measures of location (first moment)

f1 x1 f 2 x2 ... f n xn
x* , ~
x (= p50 = Q2) , x , x g , x w weighted
f1 f 2 ... f n

n
1 1
Arithmetic mean: x xi ( x1 x2 ... xn )
n i 1 n

n
1
Geometric mean: xg n x1 xn or xg anti log ln( x)
n i 1

n
ln (x+1)-transformed values: 1
xg anti log ln( x 1) 1
n i 1
Sample statistics

1. Central tendency and other measures of location (first moment)

Median: for odd n: ~


x x n 1
2

~ 1
for even n: x (x n x n )
2 2 2
1

50% of the values are below, 50% are above the median

Mode: x* , most frequent value

Percentiles, deciles, quartiles

Minimum, maximum
Sample statistics

2. Measures of spread/variation (second moment)

Range sM x( n ) x(1) max min

Interquartile range IQR Q3 Q1


n
1
Standard deviation: s ( xi x )2
n 1i 1
n
2 1
Variance: s ( xi x)2
n 1i 1
s
Coefficient of variation: C.V .
x
s
Standard error of the arithmetic mean: S .E.M .
n
information about quality of a measurement: x S .E.M .
Sample statistics

3. Asymmetry (third moment)

Skewness Sk

Normal distribution = symmetrical around a mean


Skewed to the right = positive skewness: maximum of the distribution at
the left side
Skewed to the left = negative skewness: maximum of the distribution at
the right side

Sk x -x* (Q3 Q2 ) (Q2 Q1 )


Sk
(Q3 Q1 )
Sk (x -x*)/s
(Q3 Q2 )
Sk 3(x -x*)/s Sk 2
Q2
Sample statistics

Skewness
Example: distribution skewed to the right

x* x~ x
Leptokurtic
Sample statistics

4. Peakedness (fourth moment)

Kurtosis K

Positive kurtosis (positive excess): steep peak, maximum higher than


compared with the normal distribution
Negative kurtosis (negative excess): flat peak, maximum lower

Leptokurtic / platykurtic / mesokurtic

Q3 1 Q3 Q1
K K (KND 0.263)
Q1 2 p90 p10
Sample statistics

Skewness and kurtosis


Skewness and Kurtosis Leptokurtic
Positive kurtosis
Positive skewness
Skewed to the right
Normaldistribution
mesokurtic

Platykurtic
Negative kurtosis
Sample statistics and population parameters

whole populations (census) (random) sample from population

frequency distribution described by frequency distribution described by

definite population parameters (unsure) sample statistics


estimate

Theoretical measure of location and spread:

Measure of location: expectation E(X)

Measure of spread: variation Var(X)


Population

sample

sample
Sample statistics and population parameters

Normal distribution

Most important measure of location: arithmetic mean


Population: E(X) = µ
Random sample: x

Most important measure of spread: variance


Population: Var(X) = ²
Random sample: s²

The parameters of the population (µ, ²) are estimated by the statistics of


the random sample ( x , s²).
„Mean“: Sample mean ( x ) versus population or parametric mean µ
Sample statistics and population parameters

Unbiased estimator

take several samples


calculate sample statistic repeatedly
average sample statistics x = unbiased estimator for µ
gives parameter

Biased estimator
n
1
e.g. use ( xi x )2 to calculate sample variance
n i 1

resulting quantity is biased, consistent underestimation of 2

due to use of x , which is already an (unsure) estimator!


use d.f. = n-1 to get unbiased estimator
Empirical distributions

Discontinuous distribution Continuous distribution

relative frequency (%)


relative frequency (%)

classes
Empirical + theoretical distributions

Discontinuous distribution Continuous distribution


Binomial distribution Normal distribution

relative frequency (%)


relative frequency (%)

classes
2
n i n 1 1 1 x
P( X x) pq f ( x) exp
i 2 2
E(X ) np Var ( X ) npq E( X ) Var ( X ) 2
Empirical distributions

Discontinuous distribution Continuous distribution

relative frequency (%)


relative frequency (%)

classes
Empirical + theoretical distributions

Discontinuous distribution Continuous distribution


Poisson distribution Log-normal distribution

relative frequency (%)


relative frequency (%)

classes

i 2
e 1 1 ln x
P( X i) f ( x) exp
i! x 2 2
2
E( X ) Var ( X ) E( X ) exp
2
2 2
Var ( X ) exp 2 exp 1
Frequency distributions

Mathematical distributions as models for natural


frequency distributions

Elimination of irregularities in empirical distribution


Simple mathematical handling
Estimation of population parameters
Statement about the derivation of the data is possible
Important for choosing appropriate statistical methods
Some advantageously statistical properties (e.g. mean)
Frequency distributions

Examples of different theoretical distributions

Discontinuous distributions

• Positive binomial 2< regular


• Poisson series 2= random
• Negative binomial 2> aggregated

Continuous distributions

• Normal distribution and standard normal distribution (z)


• 2-distribution

• t-distribution
• F-distribution
Normal distribution

Histogram of an empirical frequency distribution

relative frequency = bar height represents probability


probability also represented by bar area
Relative frequency

x
Classes
Normal distribution

Empirical frequency distribution: high n, small classes


Approximation by curve
Theoretical normal probability distribution = Normal probability density function

• Smooth
• Bell shaped
Relative frequency

• Symmetrical around mean

x
Classes
Normal distribution

Normal probability density function (PDF)


1 (x )2
1 2 2 µ = arithmetic mean
f(x) e
2 = standard deviation
(relative frequency)
Probability density

x
Normal distribution

Many different (general) normal probability density functions possible


1 (x )2
1 2 2
f(x) e
2

0.8
(relative frequency)
Probability density

0.6

0.4

0.2

0.0 x
0 1 2 3 4 5 6 7
Standard normal distribution

Centering: xi xi µ

x
0 µ
Standard normal distribution

xi µ
Standardising: zi

x
0 µ
Standard normal distribution
z-values: standardised values with µ = 0 und = 1 according to the
formula: X
Z
Standard normal distribution
= standard normal PDF

1 z2
f ( x) exp
2 2

z
-3 -2 -1 0 1 2 3
Standard normal distribution
Area under the curve = integral of the standard normal PDF
= cumulative standard normal distribution function
= probability to find z within a definite range

The whole area under the curve = 1

z
-3 -2 -1 0 1 2 3
68.27%
95.45%
99.73%
z-value p (z) z-value p (z) z-value p (z) z-value p (z) z-value p (z) z-value p (z)
0.00 0.50000 0.50 0.69146 1.00 0.84134 1.50 0.93319 2.00 0.97725 2.50 0.99379
0.01 0.50399 0.51 0.69497 1.01 0.84375 1.51 0.93448 2.01 0.97778 2.51 0.99396
0.02 0.50798 0.52 0.69847 1.02 0.84614 1.52 0.93574 2.02 0.97831 2.52 0.99413
0.03 0.51197 0.53 0.70194 1.03 0.84849 1.53 0.93699 2.03 0.97882 2.53 0.99430
0.04 0.51595 0.54 0.70540 1.04 0.85083 1.54 0.93822 2.04 0.97932 2.54 0.99446
0.05 0.51994 0.55 0.70884 1.05 0.85314 1.55 0.93943 2.05 0.97982 2.55 0.99461
0.06 0.52392 0.56 0.71226 1.06 0.85543 1.56 0.94062 2.06 0.98030 2.56 0.99477
0.07 0.52790 0.57 0.71566 1.07 0.85769 1.57 0.94179 2.07 0.98077 2.57 0.99492
0.08 0.53188 0.58 0.71904 1.08 0.85993 1.58 0.94295 2.08 0.98124 2.58 0.99506
0.09 0.53586 0.59 0.72240 1.09 0.86214 1.59 0.94408 2.09 0.98169 2.59 0.99520
0.10 0.53983 0.60 0.72575 1.10 0.86433 1.60 0.94520 2.10 0.98214 2.60 0.99534
0.11 0.54380 0.61 0.72907 1.11 0.86650 1.61 0.94630 2.11 0.98257 2.61 0.99547
0.12 0.54776 0.62 0.73237 1.12 0.86864 1.62 0.94738 2.12 0.98300 2.62 0.99560
0.13 0.55172 0.63 0.73565 1.13 0.87076 1.63 0.94845 2.13 0.98341 2.63 0.99573
0.14 0.55567 0.64 0.73891 1.14 0.87286 1.64 0.94950 2.14 0.98382 2.64 0.99585
0.15 0.55962 0.65 0.74215 1.15 0.87493 1.65 0.95053 2.15 0.98422 2.65 0.99598
0.16 0.56356 0.66 0.74537 1.16 0.87698 1.66 0.95154 2.16 0.98461 2.66 0.99609
0.17 0.56749 0.67 0.74857 1.17 0.87900 1.67 0.95254 2.17 0.98500 2.67 0.99621
0.18 0.57142 0.68 0.75175 1.18 0.88100 1.68 0.95352 2.18 0.98537 2.68 0.99632
0.19 0.57535 0.69 0.75490 1.19 0.88298 1.69 0.95449 2.19 0.98574 2.69 0.99643
0.20 0.57926 0.70 0.75804 1.20 0.88493 1.70 0.95543 2.20 0.98610 2.70 0.99653
0.21 0.58317 0.71 0.76115 1.21 0.88686 1.71 0.95637 2.21 0.98645 2.71 0.99664
0.22 0.58706 0.72 0.76424 1.22 0.88877 1.72 0.95728 2.22 0.98679 2.72 0.99674
0.23 0.59095 0.73 0.76730 1.23 0.89065 1.73 0.95818 2.23 0.98713 2.73 0.99683
0.24 0.59483 0.74 0.77035 1.24 0.89251 1.74 0.95907 2.24 0.98745 2.74 0.99693
0.25 0.59871 0.75 0.77337 1.25 0.89435 1.75 0.95994 2.25 0.98778 2.75 0.99702
0.26 0.60257 0.76 0.77637 1.26 0.89617 1.76 0.96080 2.26 0.98809 2.76 0.99711
0.27 0.60642 0.77 0.77935 1.27 0.89796 1.77 0.96164 2.27 0.98840 2.77 0.99720
0.28 0.61026 0.78 0.78230 1.28 0.89973 1.78 0.96246 2.28 0.98870 2.78 0.99728
0.29 0.61409 0.79 0.78524 1.29 0.90147 1.79 0.96327 2.29 0.98899 2.79 0.99736
0.30 0.61791 0.80 0.78814 1.30 0.90320 1.80 0.96407 2.30 0.98928 2.80 0.99744
0.31 0.62172 0.81 0.79103 1.31 0.90490 1.81 0.96485 2.31 0.98956 2.81 0.99752
0.32 0.62552 0.82 0.79389 1.32 0.90658 1.82 0.96562 2.32 0.98983 2.82 0.99760
0.33 0.62930 0.83 0.79673 1.33 0.90824 1.83 0.96638 2.33 0.99010 2.83 0.99767
0.34 0.63307 0.84 0.79955 1.34 0.90988 1.84 0.96712 2.34 0.99036 2.84 0.99774
0.35 0.63683 0.85 0.80234 1.35 0.91149 1.85 0.96784 2.35 0.99061 2.85 0.99781
0.36 0.64058 0.86 0.80511 1.36 0.91308 1.86 0.96856 2.36 0.99086 2.86 0.99788
0.37 0.64431 0.87 0.80785 1.37 0.91466 1.87 0.96926 2.37 0.99111 2.87 0.99795
0.38 0.64803 0.88 0.81057 1.38 0.91621 1.88 0.96995 2.38 0.99134 2.88 0.99801
0.39 0.65173 0.89 0.81327 1.39 0.91774 1.89 0.97062 2.39 0.99158 2.89 0.99807
0.40 0.65542 0.90 0.81594 1.40 0.91924 1.90 0.97128 2.40 0.99180 2.90 0.99813
0.41 0.65910 0.91 0.81859 1.41 0.92073 1.91 0.97193 2.41 0.99202 2.91 0.99819
0.42 0.66276 0.92 0.82121 1.42 0.92220 1.92 0.97257 2.42 0.99224 2.92 0.99825
0.43 0.66640 0.93 0.82381 1.43 0.92364 1.93 0.97320 2.43 0.99245 2.93 0.99831
0.44 0.67003 0.94 0.82639 1.44 0.92507 1.94 0.97381 2.44 0.99266 2.94 0.99836
0.45 0.67364 0.95 0.82894 1.45 0.92647 1.95 0.97441 2.45 0.99286 2.95 0.99841
0.46 0.67724 0.96 0.83147 1.46 0.92785 1.96 0.97500 2.46 0.99305 2.96 0.99846
0.47 0.68082 0.97 0.83398 1.47 0.92922 1.97 0.97558 2.47 0.99324 2.97 0.99851
0.48 0.68439 0.98 0.83646 1.48 0.93056 1.98 0.97615 2.48 0.99343 2.98 0.99856
0.49 0.68793 0.99 0.83891 1.49 0.93189 1.99 0.97670 2.49 0.99361 2.99 0.99861
Standard normal distribution

Tables
z-value p (z) z-value p (z)
0.00 0.50000 0.50 0.69146
0.01 0.50399 0.51 0.69497
0.02 0.50798 0.52 0.69847
0.03 0.51197 0.53 0.70194
0.04 0.51595 0.54 0.70540
0.05 0.51994 0.55 0.70884
0.06 0.52392 0.56 0.71226
0.07 0.52790 0.57 0.71566
0.08 0.53188 0.58 0.71904
0.09 0.53586 0.59 0.72240

P(z) = P(Z z)

P(z1 Z z2) = P(Z z2) – P(Z z1)


Normal distribution

The following distributions can be approximated by a normal distribution


under the following conditions:

Positive binomial n > 30 and s2 3


Poisson series > 10
Negative binomial k

Not normally distributed data can be transformed to approximate normal


distribution:

Transformation back transformation


Log10(x) 10y
Loge(x) ey
x y2
1/x 1/y
The central limit theorem

The means of samples drawn from a normally distributed population


are themselves normally distributed regardless of sample size n.

As sample size increases, the means of samples drawn from a population of any
distribution will approach the normal distribution.

The standard deviation of the distribution of the means is given by:

S .E .M .
n
The central limit theorem
S .E .M . ...decreases as sample size n increases!
n
0.9 distribution of means with high n
0.8 x
z distribution of means with low n
0.7 / n
relative frequency

0.6
distribution of original population
0.5 = distribution of means with n = 1
0.4
nhigh
0.3
nlow
0.2

0.1 1

0
µ variable X or means
The central limit theorem
S .E .M . ...decreases as sample size n increases!
n

0.45 x
z
0.4 / n
0.35
relative frequency

0.3
0.25
0.2
0.15
0.1
0.05
0
-6 -4 -2 0 2 4 6
z z-standardised means x
The t-distribution
s
S .E .M . and S.E.M. have to be estimated!
n
z = tinfinite d.f.
0.45 x
td.f.= 2 t
0.4 s/ n
0.35 td.f.= 1
relative frequency

0.3
0.25 wider and flatter when n low
0.2
0.15
0.1
0.05
0
-6 -4 -2 0 2 4 6
t standardised means x
The t-distribution

Normal distribution
Relative frequency

t-distribution

x
0 1 2 3 4 5 6 7
The t-distribution
t-values for different degrees of freedom and

d.f. 0.05 0.01 0.001


1 12.70615 63.65590 636.57761
2 4.30266 9.92499 31.59977
3 3.18245 5.84085 12.92443
4 2.77645 4.60408 8.61008
5 2.57058 4.03212 6.86850
6 2.44691 3.70743 5.95872
7 2.36462 3.49948 5.40807
8 2.30601 3.35538 5.04137
9 2.26216 3.24984 4.78089
10 2.22814 3.16926 4.58676
Confidence interval (C.I.)
Interval around x, which includes µ with a certain confidence
(a probability close to 1, ~0.95).
is known
random sample gives x , which is normally distributed with S .E.M .
n
x
z
/ n
x
P z /2 z /2 0.95 1
/ n

P z /2 x z /2 1
n n

P x z /2 x z /2 1 x z /2
n n n
z = tabulated value from the standard normal distribution, depends on
= significance level 1- = confidence / accuracy
Confidence interval (C.I.)
unknown
s
random sample gives x and s, x is t-distributed with S .E .M .
n
x
t
s/ n

s s s
P x t / 2,d . f . x t / 2,d . f . 1 x t / 2 ,d . f .
n n n

t = tabulated value from the t-distribution, depends on


,d.f. and d.f.
= signficance level (0.05 or 0.01)
d.f. = n-1 = degrees of freedom
Confidence interval (C.I.)

/2 /2

-t t/2,d.f.
,d.f. x µ t t/2,d.f.
,d.f.

Be aware that t-tables are usually two-sided!


Confidence interval (C.I.)

Accuracy and precision

Accuracy (/ confidence) (1- ) is the probability that the true mean of the
population lies within a given confidence interval.

Precision is the width of the confidence interval:


• expresses how close sample values (means) lie to each other (s)
• demonstrates quality of the estimation of µ (n)

Precision, accuracy (1- ) and the number of samples n are interdependent.

To get higher precision but keep same accuracy


increase sample size n
Accuracy and precision

Neither precise nor accurate Precise, not accurate

Accurate, not precise Precise and accurate


Accuracy and precision

Formula assuming a normal distribution

Calculation of the precision with a given accuracy ( ) and sample size n:

s
Absolute precision: G t / 2 ,d . f .
n
t / 2 ,d . f . s
Precision relative to the mean: G
x n
Calculation of the necessary sample size with predefined accuracy and
precision:
2
t / 2 ,d . f . s
n
G´ x

(The equation has to be solved iteratively, because you can find n on both
sides.)
Accuracy and precision

Example: increase of sample size

µ =70%
=70%

decrease of precision
increase of accuracy

G’=20% G’=5%
µ
µ

=95% =95%

G’=50% G’=20%
Statistical decision theory

Statistical decision = decisions about population based on sample


information.

Statistical hypothesis = assumption about the population to reach


decision

Null hypothesis H0 = assumption that the result obtained is ‘due entirely


to chance‘ (initial innocence)

Alternative hypothesis HA = any hypothesis that differs from a given H0


Statistical decision theory

Example: testing for differences between 2 populations (each


represented by 1 sample) with regard to a certain variable.

H0: Two populations do not differ. Thus, the 2 samples come in fact from
one underlying population and any possibly observed difference between
the two samples is entirely due to chance.

HA: Two populations differ. Thus, an observed difference between the


samples is not due to chance but reflects the fact that the 2 samples come
from 2 different underlying populations.
Statistical decision theory

Type I ( ) and type II ( ) error

Decision of the test reality


H0 true, HA false H0 false, HA true
H0 kept, HA rejected correct decision type II error
probability 1 - probability

H0 rejected, HA acc. type I error correct decision


probability probability 1 – (“power“)

Type I error: ’wrong alarm’


Type II error: ’missed opportunity’

Controlling the errors: Type I error Type II error

n
Statistical decision theory
Type I and Type II error

H 0 right HA right

Type II Type I Power = 1 - Type II error


error error

H0 right HA right

Type II Type I Power


error error
Steps to conduct a statistical test
• Define H0 and HA
• Set in advance!
• Calculate test statistic TSemp from data
• Calculate d.f.

• Find / Calculate critical value of TS • Calculate P(TS TSemp ) from TSemp


for given and d.f. from known and d.f. (probability to get TSemp or any
distributions (z, t, F, ²) = TScrit TS larger = probability of error when
accepting HA)

• Compare TSemp with TScrit • Compare P(TS TSemp ) with

• Decision: • Decision:
if TSemp TScrit do not accept if P do not accept HA; nor H0
HA; nor H0 if P < accept HA; reject H0
if TSemp > TScrit accept HA;
reject H0
Significance levels
type I error probability of observed meaning symbol
outcome under true H0
= 5% P 0.05 not significant n.s.
= 5% P < 0.05 significant at 5% *
= 1% P < 0.01 significant at 1% **
= 0.1% P < 0.001 significant at 0.1% ***

“A significant difference between the phosphorus concentration of lake A


and lake B could be demonstrated (t=4.5, d.f.=20, P<0.01).”

“We were not able to demonstrate significant differences between plant


biomass of the fertilized and non-fertilized treatment plots (t=0.75, d.f.=20,
P=0.45).”

“A one-way Anova showed a significant effect of the factor ‘nutrient’ on


primary productivity (F=24.2, df1=2, df2=9, P<0.001).”
One-sample test (sample vs. fixed „true“ value)

Example: mice population on island census, weight ND (µ0, )


on drifting log: single exceptionally heavy mouse (weight x)
a new species? from mainland??

1) Testable hypotheses:

H0: The ‘new’ mouse belongs to the island population, its weight is
similar to those of other island mice: x µ0. Its relatively high weight is
entirely due to chance, it´s just a slightly heavy mouse of the population.

HA: The ‘new’ mouse does not belong to the island population, its weight
is higher than that of other island mice, it must belong to some other
mouse population, say from the mainland: x > µ0
One-sample test (sample vs. fixed „true“ value)

2) now believe in H0 (initial innocence)!


how likely is it to find heavy mouse x?
weight ND (µ0, ) calculate probability to find heavy mouse x or larger

x 0
TS z P(Z z)

H0 right
1

P(Z z)

0
x 0
0 z
One-sample test (sample vs. fixed „true“ value)

2) now believe in H0 (initial innocence)!


how likely is it to find heavy mouse x?
weight ND (µ0, ) calculate probability to find heavy mouse x or larger

x 0
TS z P(Z z)

H0 right
1

P(Z z)

0
x 0
z
One-sample test (sample vs. fixed „true“ value)
3) Set threshold for P to decide between H0 and HA (in advance!)
set limit = significance level

H 0 right

zcrit
Or: calculate critical z from : P (z) = P(1.64) = critical z = 1.64
One-sample test (sample vs. fixed „true“ value)
3) Set threshold for P to decide between H0 and HA (in advance!)
set limit = significance level

H 0 right H 0 right
P(Z z)
P(Z z)

zcrit zcrit
Or: calculate critical z from : P (z) = P(1.64) = critical z = 1.64
4) Decision:

if P(Z z) > / z zcrit if P(Z z) < / z > zcrit


mouse x (or larger) is likely mouse x (or larger) is unlikely
“keep” H0 , reject HA reject H0, accept HA
One-sample test (sample vs. fixed „true“ value)

5) Decision wrong: type I error!


e.g. we reject H0, while it is true! (we make a type I error)
chance of making a type I error?
= maximum probability of error when rejecting H0
H 0 right

P(Z z)

Decision of the test reality


H0 true, HA false H0 false, HA true
H0 kept, HA rejected correct decision type II error
probability 1 - probability

H0 rejected, HA acc. type I error correct decision


probability probability 1 – (“power“)
One-sample test (sample vs. fixed „true“ value)

5) Decision wrong: type II error! To evaluate : need definite HA!


e.g. HA: mouse from mainland where mouse weight ND (µ1, 2) and µ1>µ0

H 0 right HA right

0 1
One-sample test (sample vs. fixed „true“ value)

5) Decision wrong: type II error! To evaluate : need definite HA!


e.g. HA: mouse from mainland where mouse weight ND (µ1, 2) and µ1>µ0

assume and calculate threshold weight zcrit and xcrit from


xcrit 0
zcrit 1.64 xcrit

H 0 right HA right

0 xcrit 1
One-sample test (sample vs. fixed „true“ value)

5) Decision wrong: type II error! To evaluate : need definite HA!


e.g. HA: mouse from mainland where mouse weight ND (µ1, 2) and µ1>µ0

assume and calculate threshold weight zcrit and xcrit from


xcrit 0
zcrit 1.64 xcrit
= probability to find mouse with weight xcrit (or less) under true HA
xcrit 1
TS z P(Z z)

H 0 right HAright

0 xcrit 1
One-sample test (sample vs. fixed „true“ value)

5) Decision wrong: type II error! To evaluate : need definite HA!


e.g. HA: mouse from mainland where mouse weight ND (µ1, 2) and µ1>µ0

assume and calculate threshold weight zcrit and xcrit from


xcrit 0
zcrit 1.64 xcrit
= probability to find mouse with weight xcrit (or less) under true HA
xcrit 1
TS z P(Z z)
power = 1- (correctly rejecting H0 and accepting HA)

H 0 right HAright

power 1

0 xcrit 1
One-sample test (sample vs. fixed „true“ value)
µ0 µ1
island mice mainland mice

population

sample means
low n H0: Sample of drifting mice belongs to island
population. The population mean µ estimated
from the sample is equal to (or smaller than) the
µ0 of the island population: µ0

sample means HA: Sample of drifting mice belongs to a


high n different population with a µ which is larger
than the µ0 of the island population: µ > µ0
One-sample test (sample vs. fixed „true“ value)

Known : Gauss test


(x )
TS z <=> zcrit . z alternative: P(z) <=>
/ n

Unknown : one-sample t-test


estimated by s
(x )
TS t <=> tcrit . t ,d . f . alternative: P(t) <=>
s/ n

t-value: from Student´s or t-distribution


One-sided versus two-sided test
beforehand information no beforehand information
(model, idea, experience)
H0: µ0 H0: µ = µ0

HA: µ > µ0 H A: µ0

One-sided Two-sided

(1 - Type I error) (1 - Type I error)

Type I error

z P(Z > z) z z

Type I error / 2 Type I error / 2


P(Z < z) P(Z > z)
Important limits z of the standard normal distribution

z
1- two-sided one-sided

0.100 0.900 1.64485 1.28155


0.050 0.950 1.95996 1.64485
0.025 0.975 2.24140 1.95996
0.010 0.990 2.57583 2.32635
0.001 0.999 3.29053 3.09023
Types of statistical tests referring to certain assumptions

• Parametric tests: assume known parameterized probability distribution,


e.g. ND (µ, 2), assume ND

• Non-parametric tests: no assumptions about the frequency distributions


ND not assumed, “distribution free”

• Independent samples: do not depend on each other

• Dependent / paired samples: samples depend on each other, e.g. testing


differences before and after a treatment on the same object
Selection of different standard tests

Assumption about Number of Dependency Test


distribution samples
independent t-test after STUDENT and
2 WELCH-test**
dependent t-test for dependent
Parametric samples
independent one-way analysis of
>2 variance (ANOVA) and
WELCH variant one-way
analysis of means**
dependent repeated measures (or
paired) ANOVA
independent U-test after MANN &
2 WHITNEY
dependent WILCOXON-test for
Non-parametric paired differences
independent H-test after KRUSKAL &
>2 WALLIS
dependent FRIEDMAN-test
Check normal distribution

1) Skewness and kurtosis

calculate S.E. for skewness and kurtosis


(repeated sampling, build distribution of statistics Sk and K, standard
deviation)
Sk and K follow ND
rough C.I.:
Skewness (kurtosis) ± 2 x S.E. of the skewness (kurtosis)

continuum of
possible Sk values
continuum of - 2 SESk Sk 0 + 2 SESk
possible Sk values

Deviation from ND will be assumed if value 0 outside of the C.I.!


Check normal distribution

2) Histograms
Histogram Histogram
14 30

12

10
20

6
10
4

Frequency
Frequency

Std. Dev = 3.27


Std. Dev = 1.08
2 Mean = 7.7
Mean = 5.13
0 N = 100.00
0 N = 100.00
3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0
2.

3.

3.

4.

4.

5.

5.

6.

6.

7.

7.
4.0 6.0 8.0 10.0 12.0 14.0 16.0
50

00
00

50

50

00

50

00

50

00

50
VAR00001 VAR00002

3) Normal quantile plots


Normal Q-Q Plot of VAR00001 Normal Q-Q Plot of VAR00002
3 3

2 2

1 1

0 0
Expected Normal

Expected Normal

-1 -1

-2 -2

-3 -3
2 3 4 5 6 7 8 -10 0 10 20

Observed Value Observed Value


Check normal distribution

4) Compare location of mean, median and mode

Example: distribution skewed to the right

x* x~ x
Leptokurtic
Check normal distribution

5) Run statistical test for normal distribution

e.g. Kolmogorov-Smirnov-test, Shapiro-Wilk-test

H0: The distribution of the data is normal.


HA: The distribution of the data differs from a normal distribution.

„hope“ for high P!!


Histogram Histogram
14 30 Tests of Normality
12
a
10
Kolmogorov-Smirnov
20

8
Statistic df Sig.
6
VAR00001 .049 100 .200*
10
4 VAR00002 .172 100 .000
Frequency
Frequency

Std. Dev = 3.27


Std. Dev = 1.08
2
Mean = 5.13
Mean = 7.7
N = 100.00
*. This is a lower bound of the true significance.
0
0 N = 100.00
3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0
a. Lilliefors Significance Correction
2 .5

3.

3.

4.

4.

5 .0

5.

6 .0

6.

7.

7 .5

4.0 6.0 8.0 10.0 12.0 14.0 16.0


00

50

00

50

50

50

00
0

VAR00001 VAR00002

low n – no serious judgement possible


high n – small deviation from ND detectable
Data transformation

The original variable is replaced by another variable according to a


specific mathematical function.
The same transformation procedure is applied to all variates.

What for?

• to facilitate interpretation and presentation of data


• to approximize the empirical distribution by a normal distribution, and
then use tests which assume normal distribution (parametric tests)
• to recognize atypical values (extremes, outliers)
• to reduce effect of extreme values
• to linearize functional relationships
Data transformation

Transform not ND-data to approximate ND

Distribution skewed to the left: exponential function (x2, x3, ...)


Distribution skewed to the right: square root ( x )
logarithm (ln(x), log(x))
recursive function (1/ x , 1/x, 1/x2, ...)

Transformation Back transformation


y
y = Log10(x) x = 10
y
y = Loge(x) = ln(x) x=e
y= x x = y2
y = 1/x x = 1/y

Ecological data
• Log-normal distribution can often be assumed
• Approximation of normal distribution by use of logarithms
• In case of occurrence of zero values: xT ln( x 1)
F-distribution
1. two (!) samples from population ND (µ, 2)
2. calculate s12 (sample 1 with n1) and s22 (sample 2 with n2)
3. calculate statistic: 2
s1
F 2
s2
4. repeat 1.-3. and build distribution of F-values

s12 and s22: estimates for 2 F 1

“F-distribution”
shape determined by d.f.1 = n-1 and d.f.2 = n-1
separate F-distribution for each combination of d.f.1 and d.f.2
F-distribution

F1,20
0.8 F5,25

F25,5
relative frequency

0.6

0.4

= 0.05
0.2

0.0
0 1 2 F = 2.6 3
F
F-test: checking variance homogeneity
H0: The sample variances estimate the same parametric variance. Or: 2 = 2
1 2
variance homogeneity = homoscedasceity
HA: The sample variances estimate different parametric variances. Or: 2 2
1 2
variance heterogeneity = heteroscedasceity
= 0.05
TS: variance ratio Fs

2-tailed test 1-tailed test


0.8 0.8
F9,9 F9,9
2 2
s1 smax
Fs Fs
relative frequency

relative frequency
0.6 2 0.6 2
s2 smin
0.4 0.4

0.2 = 0.025 = 0.025 0.2 = 0.025

Fcrit = 4.0 Fcrit = 4.0


Fcrit = 0.2
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
F F
F-test: checking variance homogeneity
H0: The sample variances estimate the same parametric variance. Or: 2 = 2
1 2
variance homogeneity = homoscedasceity
HA: The sample variances estimate different parametric variances. Or: 2 2
1 2
variance heterogeneity = heteroscedasceity
= 0.05
TS: variance ratio Fs

1-tailed test
Decision:
0.8 F9,9
1) if P(F) /2 (equivalent to: Fs Fcrit) smax
2

do not accept HA, nor H0 Fs

relative frequency
0.6 2
smin
assume variance homogeneity
0.4

2) if P(F) < /2 (equivalent to Fs > Fcrit) 0.2 = 0.025

accept HA, reject H0 Fcrit = 4.0

variance heterogeneity 0.0


0 1 2 3 4 5
F
t-test (after Student) for independent samples

Parametric procedure for testing significant differences of means between two


independent samples from normally distributed populations by means of one
variable

1) Check ND

2) Hypotheses

H0: µ1 = µ2 Sample means estimate same parametric mean µ.


Both samples drawn from same population.

HA: µ1 µ2 Sample means estimate different parametric means µ1 and µ2.


Samples drawn from different populations.

= 0.05
two-sided test, TS = t

3) Check variance homogeneity (F-test)


t-test (after Student) for independent samples

4) Test statistic t

var. homogeneity var. heterogeneity (Welch test)


x1 x2 n1 n2 x1 x2
TS t TS t
2
( n1 1 ) s ( n2 1 ) s
1
2
2
n1 n2 s12 s22
n1 n2 2 n1 n2

n1 n2 2
df n1 n2 2 df
2

5) Decision (two-sided)
tcrit t / 2 ,d . f .

if P(t) /2 (equivalent to: | t | tcrit) if P(t) < /2 (equivalent to | t | > tcrit)


do not accept HA, nor H0 accept HA, reject H0
“could not show difference” populations are different
t-test (after Student) for dependent samples

Parametrical procedure for testing significant differences of means between


two dependent samples from normally distributed populations by means of
one variable

1) Check ND

2) Hypotheses

H0: µ1 = µ2 Sample means estimate same parametric mean µ.


Both samples drawn from same population.

HA: µ1 µ2 Sample means estimate different parametric means µ1 and µ2.


Samples drawn from different populations.

= 0.05
two-sided test, TS = t
t-test (after Student) for dependent samples
patient before after differences
X1 X2 X1-X2
Gandalf 6 4 2
Saruman 4 3 1
Arwen 7 5 2
Frodo 3 2 1
...

3) Calculate differences and standard deviation of differences


n
1
di x1i x 2i s (d i d )2
n 1 i 1

4) Test statistic t

d 0 d 0 d n
TS t
s s s one-sample t-test with µ0 = 0 !!!
n n
t-test (after Student) for dependent samples
5) Decision (two-sided)

df n 1 tcrit t / 2 ,d . f .

if P(t) > / | t | tcrit if P(t) < / | t | > tcrit


do not accept HA, nor H0 accept HA, reject H0
“could not show difference” populations are different
Non-parametric tests based on ranks

General principles of tests based on ranks:

• distribution-free
• non-parametric
• ranks: sort all values (rank order) and number sequentially.
• replace each original variate by its rank (reduce data to ordinal scale).
• generally less powerful than parametric procedures

Mann-Whitney U-test (analogous to independent t-test)


Wilcoxon test (analogous to dependent t-test)
Kruskal-Wallis-ANOVA (also called H-test, analogous to 1-way ANOVA)
U-test (after Mann & Whitney)
Non-parametric procedure for testing significant differences between two
independent samples from non-normally distributed populations with regard
to one variable.

compares the sums of ranks of the two samples

1) Hypotheses
H0: Two samples come from populations with identical “locations” (medians).
HA: Two samples come from populations which differ in location (median).
U-test (after Mann & Whitney)

2) Ranking of all observations, ignoring groups. Ties get average ranks.

3) Sums of ranks R1 and R2 for both samples.


Under true H0: ranks randomly mixed between the two samples, similar
mixture of ranks and equal rank sums

4) Calculation of test statistic U based on the sums of ranks.


When n > 20 U approaches ND
use z-distribution to calculate P(U) and zcrit

(small samples: “exact probability” based on probability distribution of U


calculated by repeated randomization of observations to groups)

5) Decision: as usual by comparing P(U) = P(z) with


Multiple comparisons

“Bonferroni”-correction (Dunn-Sidak):

t 1 (1 )k
failure success
type I error
t total error
k number of comparisons failure success failure success

Overall t for different single and


different numbers of comparison:
k=1 k=2 k=3 k=4 k=5 k=10 k=100
= 0.05 0.098 0.143 0.185 0.226 0.401 0.994
= 0.01 0.020 0.030 0.039 0.049 0.096 0.634
= 0.001 0.002 0.003 0.004 0.005 0.010 0.095
ANalysis Of VAriance

Simple analysis of variance (one-way ANOVA)

Parametric procedure for testing significant differences between more


than two independent groups from normally distributed populations
by means of one variable

Continuous response variable = dependent


Categorical group coding variable = independent = factor
(groups = factor levels)
Different types of ANOVA

• ANOVA with more than one factor = multifactorial ANOVA


• ANOVA, where the comparison between groups should be independent
of one ore more continuous variables (= covariables) = ANCOVA
=analysis of covariance
• ANOVA with more than one dependent variable = MANOVA =
multivariate ANOVA
• ANOVA with dependent samples = Repeated measures ANOVA
• model I ANOVA: with treatment factors (deliberate manipulation)
• model II ANOVA: with random effects (e.g. random replication at
multiple levels – fish within cages within ponds)
Assumptions for ANOVA
• Variable X has to be normal distributed at each factor (BUT: ANOVA
considered robust against violations)
• Homogeneity of variances (critical!)

When assumption of variance homogeneity violated:


1) Transform dependent variable, e.g. log(x).
2) Non-parametric test (Kruskal-Wallis H test).
3) Multiple pairwise comparisons using t-tests or U-tests and correct P.
4) Variant of Welch test “one way analysis of means”, in R: oneway.test().

At severe violation of the ND assumption


Non-parametric procedures

Example: Simple one-way analysis of variance (one-way ANOVA)


• equal group sizes (same n)
• 1 factor defining 3 groups, i.e. a 3-level ANOVA
• (e.g. body fat of students studying ecology, statistics, sports)
• assumptions fulfilled
(one-way ANOVA)

Hypotheses:

H0: The three groups are not different (come from same population).

HA: At least one group differs from at least one other group (one comes
from different population).
or = or
(one-way ANOVA)
Scheme of the analysis of variance

group means

x2
x-values

x3
x

grand mean
x1

1 2 3 factors
(one-way ANOVA)

Calculation of sum of squares between the groups


(squared differences between the group means and the grand mean)

x2
( x2 x )2 x
x-values

3 ( x3 x )2
x
Z
( x1 x ) 2
SSb ( between ) n ( xz x)2
z 1
x1

1 2 3 factors
(one-way ANOVA)

Calculation of sum of squares within the groups


(squared differences between each data point and the group means)

( xi 2 x2 ) 2
x2 ( xi 3 x3 ) 2
x-values

x3
x
Z
SSb ( between ) n ( xz x)2
z 1
x1 2
Z n
( xi1 x1 ) SS w( within ) ( xiz xz ) 2
z 1 i 1

1 2 3 factors
(one-way ANOVA)

1) Partitioning the total sum of squares (SS) (“Splitting of variance”):

Total SS Z n
SS t (total ) ( xiz x )2 of the squared total deviations
z 1 i 1 of total variation

Explained SS = Between-groups SS
Z
SSb ( between) n ( xz x)2 of the squared deviations between groups
z 1 of group-to-group variation

Not explained SS = Within-groups SS


Z n
SS w( within ) ( xiz xz ) 2 of squared deviations within groups
z 1 i 1 of within-group variation

z = Group (z=1, 2, ..., Z)


i = Value number (i=1, 2, ..., I)
(one-way ANOVA)

1) Partitioning the total sum of squares (SS) (“Splitting of variance”):

SS t (total ) SSb (between) SS w( within )

Sums of squares are additive!

Variation of the whole dataset partitioned into two parts depending on origin!
(one-way ANOVA)

2) Mean squared deviations = SS/d.f.

SSt SSb SS w
MSt MSb MS w
Z n 1 Z 1 Z ( n 1)

Under a true H0 mean squares are variances and estimate 2 of the


(same) population.
MSt: data treated as 1 sample, variance is estimate for 2

MSw: average within-group variation, “intragroup MS” or “error MS”


average variance of groups, estimate for 2
MSb: all means from 1 population
expected variance of group means is S.E.M.2 = s2/n
multiply variance of means by n (already done for SSb)
MSb = another estimate for 2
(one-way ANOVA)

2) Mean squared deviations = SS/d.f.

SSt SSb SS w
MSt MSb MS w
Z n 1 Z 1 Z ( n 1)

Under a true HA (different populations):

MSw: still average within-group variation


estimate for 2= 2= 2 (variance homogeneity!)
MSb: now includes substantial group-to-group variation
estimate for 2 larger than expected!
(one-way ANOVA)

3) Calculation of the test statistic


MSb
TS Femp.
MS w
…close to 1 under true H0
…>> 1 when MSb includes group effect

4) F-distribution: F-values for two variance estimates from same population


Calculate Fcrit under true H0 at significance with df1=Z-1 and df2=Z(n-1)
Comparison of Femp with Fcrit

if Femp > Fcrit Femp is unprobable (< ) value under true H


reject H0 / accept HA
if Femp Fcrit Femp is probable value under true H
reject HA
(one-way ANOVA)

5) Results: ANOVA-table

source of variation sum of squares df mean square Femp P


between groups 20.57 2 10.28 1.132 0.328
within groups 572.45 63 9.08
total 593.03 65

6) Post-hoc tests

which group differs from which one?


or = or

multiple pairwise comparisons with correction of P Bonferroni and others


Two-factorial analysis of variance (two-way ANOVA)

two categorical factors considered simultaneously


(or more multifactorial ANOVA)

Example: 2 factors with 2 levels each

• food consumption of rats (dependent variable)


• study both sexes (factor 1 = sex, 2 levels: male and female)
• compare food types (factor 1 = food, 2 levels: fresh and old)

collect replicates for each possible combination of factors


4 combinations (= groups = cells) with several replicates each
(two-way ANOVA)

Possible outcomes of experiment:

1. difference in food consumption between sexes


2. preference for a certain food type
3. difference in food preference among sexes, e.g. males prefer food 1,
females prefer food 2

1. and 2. are main effects


3. is interaction: dependence of effect of one factor on level of other
factor (can be: inhibition or synergism)
(two-way ANOVA)

3 sets of hypotheses, the null hypotheses are:

1. H0: no difference between sexes

2. H0: no difference between food types

3. H0: no interaction
(two-way ANOVA)
1) Partitioning the total sum of squares:

SS t (total ) SS sex SS food SS interaction SS w( within )

SSsex from means of sexes (pooled food types)


S
SS sex F n ( xs x) 2
s 1
SSfood from means of food types (pooled sexes)
F
SS food S n (x f x) 2
f 1
SSw from group means (groups: all sex x food combinations)
S F n
SS w ( xisf x sf ) 2
s 1 f 1 i 1
SSinteraction by difference
SSinteraction SSt ( total ) SS sex SS food SS w( within )

s…index for sex (S = total number of levels)


f…index for food type (F = total number of levels)
(two-way ANOVA)
2) Mean squared deviations = SS/d.f.

SSt SS w
MSt MS w
S F n 1 S F (n 1)

SS sex SS food SSinteraction


MS sex MS food MSinteraction
S 1 F 1 ( S 1) ( F 1)

Under true H0 mean squares are variances and estimate 2 of the (same)
population.

When effect of sex, food or interaction


adds additional variation to corresponding MS
MS will be larger than expected
(two-way ANOVA)
3) Calculation of the test statistics

MS sex MS food MSinteraction


Fsex F food Finteraction
MS w MS w MS w

…close to 1 under true H0


…>> 1 when MS includes effect

4) Calculate Fcrit under true H0 at significance with df1 and df2 for each H0
Comparison of Fsex / Ffood / Finteraction with corresponding Fcrit

if Femp > Fcrit Femp is unprobable (< ) value under true H


reject H0 / accept HA
if Femp Fcrit Femp is probable value under true H
reject HA
(two-way ANOVA)
5) Results: ANOVA-table

source of variation sum of squares df mean square F P


sex 10.21 1 10.21 0.549 0.46
food 37.41 1 37.41 2.012 0.16
interaction: sex x food 0.21 1 0.21 0.011 0.92
error 2156.97 116 18.59
total 2204.80 119

6) Post-hoc tests

in case of factor with > 2 levels: which group differs from which one?

multiple pairwise comparisons with correction of P Bonferroni and others


(two-way ANOVA)
7) Interaction plots

Significant interaction term scrutinize results for main effects!


(a significant interaction can make main effect results worthless!)

(a) (b) (c)


food consumption

food 1
food 2

male female male female male female

(a) paralell response – no interaction

(b) and (c) interaction


Correlations

describe the mutual variation of two variables and measures the degree to
which variables are related. No functional dependence between the
variables is assumed.
a) b)
Y Y

a) Positive (=direct)
correlation
X X
b) Negative (=inverse)
c) d) correlation
Y Y
c) No correlation
d) Non-linear correlation

X X
Correlations

Types of correlation according to the level of scale


Level of scale of X
Level of scale of Y Metric Ordinal Nominal
Metric Product-moment Rank correlation
correlation SPEARMAN´S rs
(PEARSON´S r)
KENDALL´S Tau
Ordinal Rank correlation Rank correlation
SPEARMAN´S rs SPEARMAN´S rs
KENDALL´S Tau KENDALL´S Tau
Nominal Contingency tables
–coefficient
KRAMER coefficient

The variable at the lowest level always determines the choice of the correlation
measure.
Correlations between nominal-scaled variables
- contingency tables

to test the hypothesis that the frequency of occurrence in the categories


of one variable is related to the frequencies in the second variable.
The simplest case is for binary data (two categories only).

Contingency table

x1 x2

y1 F11 F12 F1.


y2 F21 F22 F2.
F.1 F.2 F..

= sum; F = frequency
Correlations between nominal-scaled variables

1) Setting up hypotheses
H0: No correlation between the two variables X and Y.
HA: Correlation between the two variables.

2) Calculation of –coefficient
The –coefficient is one possible correlation coefficient, calculated
from the sums per column and per row:

F11 F22 F12 F21


F1. F2. F.1 F.2

The –coefficient ranges from


=0 no correlation to
=1 perfect correlation
Correlations between nominal-scaled variables

3) Calculation of expected frequencies


In order to compare the observed frequencies with those frequencies
which we would expect if the null hypothesis is true, we first have to
calculate the expected frequencies E.
x1 x2

y1 E11 = F.1 x F1. / F.. E12 = F.2 x F1. / F..


y2 E21 = F.1 x F2. / F.. E22 = F.2 x F2. / F..

4) Calculation of test-statistic
2
2
Fij Eij
d . f . (k 1)(r 1)
i j Eij
k = number of columns; r = number of rows
Yates correction for continuity for 2 x 2 tables
2
2
Fij Eij 0.5
i j Eij
Correlations between nominal-scaled variables

5) Calculate P( 2) = P( 2 2) (the tail probability) from the calculated


2 and d.f.. Then compare to .

Or as an alternative: Calculate 2crit (critical value) from and d.f.,


then compare calculated 2 to 2crit.
Decision:
1) if P( 2) (is equivalent to: 2 2 )
crit do not accept HA; do
not accept H0, either. You were not able to detect any significant
correlation.
2) if P( 2) < (is equivalent to 2 > 2crit) accept HA; reject H0
There is a significant correlation between the two variables.
6) Stating results from a –correlation analyses
calculated –correlation coefficient, the total number of observations
(n = F..) and the calculated P (if not significant) or the significance
level a (if significant).
Correlations between nominal-scaled variables

Kramer-coefficient C
is another correlation coefficient for contingency tables, based on the
2-values.

2
C 2
n
This coefficient is not standardised between 0 and 1. To do so, one has to
calculate the maximum possible C-value, which is given by

k 1
Cmax
k
The standardised coefficient is then obtained from
C
Cs tan d
Cmax
Rank correlation coefficients

must/can be used in several situations:


• At least one of the variables is ordinal-scaled and none of them is
nominal.
• Both variables are metric but not normally distributed.
• The relationship is not linear but monotonically increasing or
decreasing.

The procedure is similar to non-parametric tests for differences between


groups, i.e. it is based on ranks. Therefore, metric data have to be
transformed into ranks.
Spearman´s rank correlation rs

1) Setting up hypotheses
The (unknown) population correlation coefficient is often denoted by
s; has to be estimated from the observed correlation coefficient rs.
H 0: s = 0 No correlation between the two variables.
H A: s 0 Correlation between the two variables.
2) Calculation of rs
Each variable is ranked separately and for each object i the squared
differences (d2i) between the ranks of X and ranks of Y are computed.
n
6 d i2
i 1
rS 1 2
n (n 1)

rs = –1 perfect negative (inverse) correlation


rs = 0 no correlation
rs = +1 perfect positive correlation
Spearman´s rank correlation rs

3) The critical values


are tabulated for rs and have to be looked up in the respective tables.
4) Decision
1) if P(rs) (is equivalent to: rs rs crit) do not accept HA; do not
accept H0, either. You were not able to detect any significant
correlation.
2) if P(rs) < (is equivalent to rs > rs crit) accept HA; reject H0. There
is a significant correlation between the two variables.
5) Stating results
calculated correlation coefficient rs, the total number of observations
(n = number of objects) and the calculated P (if not significant) or the
significance level (if significant).
Product-moment correlation after Pearson

requires two metric-scaled variables, both of which have to be normally


distributed and the relationship is assumed to be linear.

Variations of X and Y: n 2
1
Variance of variable X: s x2 xi x
n 1i1
2
1 n
Variance of variable Y: s y2 yi y
n 1i1
Covariation = mutual variation of two variables, measured by the
n
covariance
Cov( X , Y ) xi x yi y
i 1
Covariance = mean value of cross-product of the deviations of X and
Y from their mean value
n
1
s xy xi x yi y
n i 1
Product-moment correlation after Pearson

Note that the covariance of a variable with itself equals the variance. The
covariance is not standardised (between –1 and +1). Instead, this
measure depends on the units that X and Y are measured in.

Standardised covariance = product-moment correlation r

Cov( X , Y ) ( xi x)( yi y)
r
Var ( X ) Var (Y ) ( xi x) 2 ( yi y) 2
r = –1 perfect negative, linear correlation
r = 0 no linear correlation
r = +1 perfect positive, linear correlation

(X x ) (Y y) (rho) refers to the correlation


average coefficient of the population
x y
Product-moment correlation after Pearson

1) Setting up hypotheses
H0: =0 No linear correlation between the two variables X and Y.
H A: 0 Linear correlation between the two variables.

2) Calculation of correlation coefficient r


( xi x)( yi y)
r
( xi x) 2 ( yi y)2

The standard error sr of r is obtained by


1 r2
sr
n 2
Product-moment correlation after Pearson

3) Calculation of significance
a) t-statistics

r
t d.f. = n-2; tcrit = t /2; d.f.
sr
Decision:
1) if P(t) /2 (is equivalent to: t tcrit) do not accept HA; do not
accept H0, either. You were not able to detect any significant
correlation.
2) if P(t) < (is equivalent to t > tcrit) accept HA; reject H0 There
is a significant correlation between the two variables.
Product-moment correlation after Pearson

4) Calculation of significance
b) F-statistics
d.f.1 = n-2
1 r
F d.f.2 = n-2
1 r Fcrit = Fa/2; d.f.1;d.f.2

Decision:
1) if P(F) /2 (is equivalent to: F Fcrit) do not accept HA; do not
accept H0, either. You were not able to detect any significant
correlation.
2) if P(F) < (is equivalent to F > Fcrit) accept HA; reject H0.
There is a significant correlation between the two variables.
Product-moment correlations after Pearson

4) Stating results:
calculated correlation coefficient r, the total number of observations
(n = number of objects) and the calculated P (if not significant) or the
significance level (if significant).
Potential pitfalls in correlation analysis

a) Heterogeneity of the data


Because of different groups within the data set, correlation analysis
may lead to wrong results.

Example: Fish
Weight
Y

Length
X
Potential pitfalls in correlation analysis

b) Set-subset (e.g. body weight – liver weight)


Set and subset are not independent in a statistical sense and inevitably,
have to be highly correlated .

c) Induced correlation by a third variable


Sometimes, it seems that two variables (X, Y) are correlated. In fact, X
and Y may not be directly related but rather this correlation may be
induced by a third variable Z (“lurking variable”). The influence of
this “hidden” variable is often difficult to detect.

X Y

Z
Potential pitfalls in correlation analysis

Partial correlation
If the lurking variable Z is known (or measured) its influence may be
removed to obtain the correlation between the remaining variables of
interest X and Y:

rXY rXZ rYZ


rXY / Z
(1 rXZ ) 2 (1 rYZ ) 2

where rXY/Z is the partial correlation between X and Y without the


influence of variable Z.
Simple linear regression

Investigation of the direct effect of one


variable X (independent or explanatory variable) on a
variable Y (dependent or response variable).

Therefore, we presume a direction of this relationship, i.e. a functional


dependency.

In graphical representations of regressions,


the independent variable is always given on the x-axis (abszissa) and
the dependant variable on the y-axis (ordinate).
Simple linear regression

Example: Wing lengths of sparrows at various times after hatching


(from Zar 1984)
Age Wing
(days) length (cm)
X Y 6

3.0 1.4
4.0 1.5 5
5.0 2.2
Wing length (cm)

6.0 2.4 4
8.0 3.1
9.0 3.2 3
10.0 3.2
11.0 3.9 2
12.0 4.1
14.0 4.7 1
15.0 4.5
16.0 5.2
0
17.0 5.0 0 2 4 6 8 10 12 14 16 18

Age (days)
Simple linear regression

Our objective is to fit a line whose equation is of the form:

Y X

Y expected values of Y
and the regression coefficients of the population
which are estimated by a and b from our sample
the error (residuals)
Simple linear regression

Estimation:

Y a bX e e (y yˆ ) 2
a intercept (point of intersection of the linear regression line with
y-the axis)
b slope of the linear regression = y / x;
change in Y that accompanies a unit change in X

positive slope: increase of Y with an increase of X


negative slope: decrease of Y with an increase of X
Simple linear regression

Estimation:

Y a bX e 6

5
a intercept (point of intersection

Wing length (cm)


of the linear regression line with 4

y
y-the axis) 3

b slope of the linear regression 2 x

= y / x; 1

change in Y that accompanies 0


a

0 2 4 6 8 10 12 14 16 18
a unit change in X Age (days)

positive slope: increase of Y with an increase of X


negative slope: decrease of Y with an increase of X
Simple linear regression

Estimation:
a and b are selected so that the sum of squared residuals is minimised:

e (y yˆ ) 2 min

= criterion of ordinary least squares (OLS) and the type of regression


analysis is the OLS–regression. The criterion is met by the formulae:

( xi x )( yi y)
b
2
( xi x)

a Y bX
Simple linear regression

1) State hypotheses
If b = 0, Y would not depend on X, because Y would not change with
changing X (the regression would result in a more or less horizontal
line). Therefore, we have to test whether the slope b is significantly
different from 0.

H0: = 0 No linear dependence of variable Y on variable X


H A: 0 Significant linear dependence
Simple linear regression

2) ANOVA procedure
The overall significance of the model is tested by an ANOVA
procedure.
Total SS = sum of the squared total deviations
n SS t
2 d . f .t n 1 MS t
SS t (total ) ( yi y)
d . f .t
i 1
Regression SS (= explained variance by the model) = linear regression
sum of squares
n
SS reg (regression ) ( yˆ i y) 2 d . f .reg 1 MS reg SS reg
i 1
Residual SS (= not explained variance) = the error term e
n
SS res (residuals ) ( yi yˆ i ) 2 SS t SS reg
i 1
SS res
d . f .res d . f .t d . f .reg n 2 MS res
d . f .res
Simple linear regression

Y yi

yi yˆ i
yi y
yˆi y
y

ŷi = estimated yi-value


on regression line

X
n n n
2 2
( yi y) ( yˆ i y) ( yi yˆ i ) 2
i 1 i 1 i 1
Simple linear regression

2) ANOVA procedure
ANOVA table in simple regression analysis

Source of
Sum of squares d.f. Mean square Femp P
variation
n MS reg
2 MS reg SS reg
Regression SS reg ( yˆ i y) 1 P(Femp.)
i 1 MS res
n SS res
Residuals SS res ( yi yˆ i ) 2 n–1 MS res
i 1
d . f .res
n SS t
Total SS t ( yi y) 2 n–2 MS t
i 1
d . f .t
Simple linear regression

3) Calculation of the test statistic F


we test the ratio explained variance divided by the not explained
variance
MS reg
TS Femp. d . f .reg 1 d . f .res n 2
MS res

4) Comparison with critical F-value at certain significance level


If Femp Fcrit, then Femp is a quite likely outcome under a true null
hypothesis. We therefore cannot accept HA.
If Femp > Fcrit, then Femp is a highly unlikely (i.e., less probable than )
value under a true null hypothesis and we therefore decide to
reject H0 and accept HA.
As an alternative we can also calculate the tail probability
P(Femp) = P(F Femp) and compare it to the significance level .
Simple linear regression

5) Quality of regression fit


The proportion (or percentage) of the total variation of Y that is
explained or accounted for by the fitted regression is termed the
coefficient of determination r2, which measures the strength of the
straight-line relationship.

2 SS reg
r
SS t
In simple regression this is equal to the squared product-moment
correlation coefficient.

r2 = 0 no fit of the regression model; no variance is explained by the


model
r2 = 1 perfect fit of the model; the whole variance in the data is
explained by the model (i.e. all data are positioned exactly on
the line)
Simple linear regression

6) Standard errors (S.E.) for the regression coefficients


The standard deviation of Y provided that X takes certain values, is
called sy.x:

( yi yˆ i ) 2
s y. x
n 2
The estimated values ŷ are obtained from the regression model:
yˆ a bx
The standard errors of a (S.E.a) and of b (S.E.b) are given by:
s y. x
S .E.b
( xi x )2

1 x2
S .E.a s y. x
n ( xi x ) 2
Simple linear regression

7) Stating the results


First, the type of model must be stated. The best way is to give the
formula, with the regression coefficients a and b together with their
standard errors. Further necessary results are the coefficient of
determination r2, the Femp-value with the degrees of freedom and the
significance.

Example:
The linear regression between wing length and age is highly
significant (Y = 0.713 (±0.148) + 0.27 (±0.013) X; r2 = 0.973;
F1, 11 = 401.1; P < 0.001).
Simple linear regression

Assumptions of the regression analysis


1) normal distribution of X and Y variables
2) in the population there exists a normal distribution of Y values for any
value of X
3) the variances of these population distributions of Y must be equal to
one another = homogeneity of variances or homoscedasticity (in
contrast to heterogeneity of variances = heteroscedasticity)
4) the errors in Y are assumed to be additive = additivity
5) the values of Y are to be independent
6) measurements of X must be obtained without error (often impossible,
assumption: that they are negligible or alt least small compared with
the measurement errors in Y)
Regression statistics is known to be robust to at least some of the
underlying assumptions; therefore, unless violations are not too severe,
they are not of concern.
Simple linear regression

Linearisation
In order to conduct a linear regression analysis with data which do not
show a linear dependency, the data can be linearised.
In the following figure, some possible linearisation procedures are
given:
instead instead instead instead
of or of of or of
Y x y Y x y

x² log(y) log(x) log(y)


x³ -1/y -1/x -1/y
X X

Y Y

log(x) y² x² y²
-1/x y³ x³ y³

X X
Simple linear regression

Linearisation, examples
Logarithmic function
Original scale of X Logarithmic scale of X

8 8

7
6
6

Y
Y

4
4

3
2
2

0 1
0 200 400 600 800 1 10 100 1000 10000

X X

The logarithmic function is given by y a b ln x


Simple linear regression

Linearisation, examples
Exponential function
Original scale of Y Logarithmic scale of Y

350 1000

300

250
100
200

Y
Y

150
10
100

50

0 1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18

X X

The exponential function is given by y a e bx


Simple linear regression
7

Linearisation, examples 6

5
Linearisation of the exponential function 4

ln Y
y a e bx ln 3

ln y ln a bx 1

0
0 2 4 6 8 10 12 14 16 18

Figure with logarithmic values on the y-axis! X

350 1000

300

250
100
200
Y
Y

150
10
100

50

0 1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18

X X
Simple linear regression

Linearisation, examples
Power functions
Original scales of X and Y Logarithmic scales of X and Y

500 1000

400

100
300

Y
Y

200
10

100

0 1
0 200 400 600 800 1 10 100 1000

X X

The logarithmic function is given by y a xb


Simple linear regression
7

6
Linearisation, examples 5

Linearisation of the power function 4

ln Y
3
b
y a x ln 2

1
ln y ln a b ln x 0
0 1 2 3 4 5 6 7

Figure with logarithmic values on x- and y-axis! ln X

500 1000

400

100
300
Y
Y

200
10

100

0 1
0 200 400 600 800 1 10 100 1000

X X

You might also like