2023 BSC Stochastik Skript en

Bachelor of Science
Statistics and Probability
Module coordinator: Pia Domschke as of February 14, 2023

Course overview
I Descriptive Statistics
1 Statistical attributes and variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Measures to describe statistical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Two dimensional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103
II Probability Theory
5 Combinatorics and counting principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Fundamentals of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7 Random variables in one dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8 Multidimensional random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9 Stochastic models and special distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10 Limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
1-1
III Inferential Statistics
11 Point estimators for parameters of a population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .280
12 Interval estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
13 Statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
IV Appendix
Literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351 - 1
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 - 1
1-2
Statistics
2
Literature:
[Anderson et al., 2012] Anderson, D. R., Sweeney, D. J., Williams, T. A., Camm, J. D.,
and Cochran, J. D. (2012).
Statistics for Business and Economics.
South-Western Cengage Learning, 12th edition.
[Bleymüller, 2012] Bleymüller, J. (2012).
Statistik für Wirtschaftswissenschaftler.
Vahlen, 16th edition.
[Lehn and Wegmann, 2000] Lehn, J. and Wegmann, H. (2000).
Einführung in die Statistik.
Vieweg+Teubner Verlag.
[Newbold et al., 2013] Newbold, P., Carlson, W. L., and Thorne, B. M. (2013).
Statistics for business and economics.
Pearson, 8th edition.
[Schira, 2016] Schira, J. (2016).
Statistische Methoden der VWL und BWL – Theorie und Praxis.
3
Part I – Descriptive Statistics
1 Statistical attributes and variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Measures to describe statistical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Two dimensional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103
4
1 Statistical attributes and variables
1.1 Statistical units and populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Attributes/Characteristics and their values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Subpopulations, random samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
1.4 Statistical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Frequency and distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Frequency density and histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
according to [Schira, 2016], chapter 1

see also: [Anderson et al., 2012], chapter 1 & 2; and [Newbold et al., 2013], chapter 1
1. Statistical attributes and variables 5
Statistical units and populations
Definition:
Statistical units are the objects whose attributes are of interest in a given context and
are focussed on and observed, surveyed, or measured within the scope of empirical
investigation.
The identification of similar statistical units belonging to a statistical population are es-
sentially given by objective and precise identification criteria (IC) relating to
1. time
2. space and
3. objectivity
Examples of statistical units:

Motor vehicles, buildings, horses, students, civil servants, farms, branches, apples, sales,
marriages, births, accidents, bank accounts, etc.

Statistical units and populations
Definition:
The set
Ω := {ω | ω fulfills (IC) }
of all statistical units ω , that fulfill the well defined identification criteria (IC) is called the
population.
Synonyms are statistical mass, collective
Examples of statistical populations:
Traffic accidents in Bavaria in 2002

Traffic accidents with personal injury in Germany in 1999
Students in lectures on Tuesday, 25.02.2014 at 2:15pm in Frankfurt School of
Finance and Management, Germany
Registered bankruptcies of building companies in North Rhine Westphalia in April
2002

Attributes/Characteristics and their values
Statistical units ω are not of direct interest, but some of their attributes M (ω).
Distinguishable manifestations of a characteristic are called characteristic values or
modes.
Examples:
The characteristic gender has the possible values {male, female, diverse}.
The characteristic eye color has the possible values {blue, green, grey, brown}.
For the characteristic body weight of adult humans all values between 30 and
300 kg have to be allowed as possible values.

Statistical variable
Definition:
The statistical variable assigns a real number x to a statistical unit ω or its characteristic
M (ω). Thus
x = X (ω) = Fkt (M (ω)) .
X (ω) is a real-valued function of the characteristic values M (ω) and thus of the statistical
units
X :Ω→R
ω 7→ X (ω) = x
Difference between M and X :

X (ω) is always a real number. M (ω) can also be „green“ or „married“.
Characteristic (if numerical) and statistical variable are often used as synonyms,
although strictly speaking they do not precisely denote the same thing.
One often simply says: „the statistical variable X “ or „the characteristic X “

Types of attributes
Qualitative characteristics are e.g.:

gender
religious belief
legal form of companies, etc.
Quantitative characteristics or variables are e.g.:

age (in whole years)
number of children discrete
income
living space
continuous
length of a line
characteristics or variables

Measurement levels (scales)
Nominally measurable variables
Increasing degree of measurability

Nominal attributes are always qualitative.
Examples: religion, nationality, professions, color of a car, the legal structure of a
company
Ordinally measurable variables

There exists a natural or meaningful way to determine the ranking.
Examples: intelligence quotient, school grades or table positions in the German
football league.
Cardinally measurable variables

In the case of cardinally scaled variables, also the difference between outcomes is
meaningful
Examples: GDP, investment, inflation, costs, revenue and profit

Subpopulations, random samples
Possibilities to sample the values of a characteristic

Full sampling ( but in many cases not feasible)
Partial sampling
Definition:
Each proper subset Ω∗ of Ω is called a subpopulation or sample of the whole popula-
tion.
Subpopulations are called random samples if chance played a significant role in the
selection of the elements.

Pure random sampling:

Each part of the population has the same chance of being selected in the random
sample.
Representative random sampling:

The aim is to select a subpopulation that is representative of the whole population.
As the structure of the characteristics we are interested in is unknown before sampling,
we try to ensure that it is representative with respect to other characteristics where
we assume that the characteristic to be investigated has a certain „statistical relationship“
to this other characteristic.

Example
A research institute creates an election forecast.

For this purpose, 3000 entitled voters are asked the so-called
Sunday question: „Which party would you choose if there were
elections next Sunday?“
To get more realistic results , the random sample is selected on a representative basis:
consequently other characteristics are taken into consideration that could have a
statistical influence on party preference. The random sample needs to include the share
of women in the population of all eligible voters. The age structure should also conform
to the whole population.
This already makes the sample quite representative for this purpose. It would certainly
still be important to take the geographical distribution into account to avoid a situation
where too many respondents happen to live in Baden-Württemberg. Furthermore, it
would be good if the professional structure were at least analogous in the
characteristics workers, employees, civil servants, self-employed. Yes, and of course
students must be in the sample, otherwise Green voters might be underrepresented.

Statistical distribution
Raw data table
Elements ω1 ω2 ... ωi ... ωn

Possible outcomes x1 x2 ... xi ... xn
Definition:
The (finite) sequence of the n values
x1 , x2 , . . . , xi , . . . , xn
with xi = X (ωi ) fori = 1, . . . , n is called observation series of the variable X or simply

data set X.

If the order of the observations does not matter, it is often helpful to sort and renumber
the variable values.
x1 ≤ x2 ≤ x3 ≤ · · · ≤ xi ≤ · · · ≤ xn
Example: n = 20 observations
1.6 1.6 3.0 3.0 3 .0 3 .0 4 .1 4 .1 4 .1 4.1

4.1 4.1 4.1 4.1 5 .0 5 .0 5 .0 5 .0 5 .0 5 .0
with k = 4 different outcomes: 1.6 3.0 4.1 5.0

Absolute and relative frequency
Definition:
The absolute frequency
ni := absH(X = xi )
indicates how often the statistical variable X takes a certain value xi .
The relative frequency

ni
hi := relH(X = xi ) = , 0 < hi ≤ 1
n
indicates the share of the characteristic expression xi in the population.

Definition:
The tables
x1 x2 ... xk x1 x2 ... xk
and
n1 n2 ... nk h1 h2 ... hk
k
P k
P
with ni = n and hi = 1
i =1 i =1
are called absolute and relative frequency distribution of the statistical variable X ,
respectively.
Example:
xi 1.6 3 .0 4 .1 5 .0 xi 1.6 3.0 4.1 5.0
ni 2 4 8 6 hi 0.1 0.2 0.4 0.3

4
P 4
P
ni = 20 hi = 1
i =1 i =1
1.6 1.6 3.0 3.0 3 .0 3 .0 4 .1 4 .1 4 .1 4.1

4.1 4.1 4.1 4.1 5 .0 5 .0 5 .0 5 .0 5 .0 5 .0
ni hi
10 0.5
8 0.4
6 0.3
4 0.2
2 0.1
xi
1.6 3 4.1 5
Graph of a frequency distribution

Frequency and distribution function
Definition:
The function (
hi , if x = xi
h(x ) =
0 otherwise
is called the (relative) frequency function of the statistical variable X .
The function X
H (x ) = h ( xi )
xi ≤x
is called the empirical cumulative distribution function of the statistical variable X .

Example
xi 1.6 3 .0 4.1 5.0
hi 0.1 0 .2 0.4 0.3
Hi 0.1 0 .3 0 .7 1.0
h(x ) frequency function
0 .5
0.25
x
1 2 3 4 5 6
H (x )
distribution function
1
0.75
0 .5 step function
0.25
x
1 2 3 4 5 6
Properties of the empirical cumulative distribution function
The empirical cumulative distribution function H is

1. everywhere at least continuous to the right
lim H (x + ∆x ) = H (x )
∆x →0+
(at jumps it is only continuous to the right),

2. monotonically increasing
H (a) ≤ H (b), if a < b
3. and has lower limit 0 and upper limit 1
lim H (x ) = 0, lim H (x ) = 1
x →−∞ x →∞

Further properties of the distribution function
1. For a < b, the difference H (b) − H (a) = relH(a < X ≤ b)
specifies the relative frequency of observed values of the variable X that are greater
than a , but not greater than b.
2. The function value at a point x indicates the relative frequency of which values less
than or equal to x occur in the data set:
H (x ) = relH(X ≤ x )
3. At each point, the values of the frequency function are obtained from the empirical
distribution function as the difference
h(x ) = H (x ) − lim H (x − ∆x )
∆x →0+

Frequency density and histograms
Due to limitations in practice, we often apply
Formation of class intervals or layers Income distribution Germany as a whole:
Class size
Class frequency
Distribution function of the classes
Approximation by polygons
Frequency density of the classes
Frequency density function and histogram
Approximation by smooth curves

continuous density function

Formation of class intervals or layers with appropriately selected class limits

ξ0 , ξ 1 , ξ 2 , . . . , ξ m :
x
ξ0 ξ1 ξ2 ξ3 ... ... ξm
The m sections have the class sizes
∆i := ξi − ξi −1 , i = 1, . . . , m
and the class frequency of the values in each size class is
hi := relH(ξi −1 < X ≤ ξi ), i = 1, . . . , m
Note: [Schira, 2016] uses right hand inclusion while [Anderson et al., 2012] and
[Newbold et al., 2013] use left hand inclusion

Definition:
By assigning the class frequencies to the upper limits of the classes (an alternative
possibilty would be to assign the class frequencies to the class centers), the following
frequency table can be drawn from the values
ξ1 ξ2 ... ξm k
P
hi = 1
h1 h2 ... hm i =1
and hence the so-called distribution function of the classes HK (x ).
Exercise:
How does the distribution function for the classes
shown on the left look like? (Choose an appropriate
upper limit for the final class).

By focussing on the upper limits (or any other single point) of the classes we lose
information about the distribution within the classes. The assumption of a uniform
distribution leads to the definition below.
Definition:
Let HK (x ) be the distribution function of a characteristic X obtained by size classes with
upper class limits ξ1 , ξ2 , . . . , ξm . Then, the ratio
HK (ξi ) − HK (ξi −1 ) hi
=
ξi − ξi −1 ∆i
is called the (average) frequency density of the ith size class (i = 1, . . . , m).

Approximation of the distribution function Hk
by a polygonal line H̄ (x )
Taking the derivative of H̄ (x ) leads to the
(average) frequency density function h̄(x )
of the size classes:
dH̄ (x )
h̄(x ) :=
dx
Its graph is called histogram.
histogram
The area of a column corresponds to the
relative class frequency.
The total area of the columns of the
histogram is one.

Approximation of the distribution function of

the classes Hk by a smooth curve H̃ (x )
Taking the derivative of H̃ (x ) leads to

density function.
density function dH̃ (x )

h̃(x ) :=
dx

Control questions
1. What is the difference between characteristic and variable?
2. What different types of scales are there? Examples!
3. Why are mainly representative random samples taken into account in practice?
4. What are the properties of the step function? What is its information content?
5. Why is the formation of size classes often necessary?
6. What is the implicit assumption underlying the approximating distribution function

H̄ (x )?
7. What is the difference between a bar chart (look up definition if necessary) and a
histogram? Under what condition do they both look the same?

2 Measures to describe statistical distributions
2.1 Measures to describe statistical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2 Measures of central tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
2.4 Arithmetic mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Geometric mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Harmonic mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Robust measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.8 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.9 Quartiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.10 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.11 Measures of variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.12 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.13 Five-point summary and box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

see also: [Anderson et al., 2012], chapter 3;
and [Newbold et al., 2013], chapter 2
2. Measures to describe statistical distributions 31

Measures to describe statistical distributions
Especially for statistical data sets with many different characteristical values, one would
like to describe the entire distribution of the characteristic with the help of a few numbers.
Such numbers are called measures numbers or parameters of a distribution.
We distinguish between
measures of location
measures of dispersion

Measures of central tendency
Definition:
A measure of central tendency or measure of location is a parameter used to de-
scribe the distribution of a random variable and provides a „typical“ value. In particular,
it describes the location of the data set, i.e. where or in which order of magnitude the
values of the variable are located.

Mode
Definition:
A number xMod with
h(xMod ) ≥ h(xi ) for all i
is called mode or modal value of an empirical data set.
Useful measure of location especially for purely qualitative characteristics

Mode
Examples:
The data set
2 3 3 4 4 4 5 6
has the mode xMod = 4 (and is thus unimodal)
There are two „most frequent“ values in the data set
1 2 3 3 3 4 5 6 6 6 7
namely the values 3 and 6.
The mode is the value that occurs with highest frequency

Arithmetic mean
Definition:
The value
n
1X
x̄ := xj
n
j =1
is called arithmetic mean, mean value, or average of a statistical distribution.
Alternative calculation using absolute or relative frequencies:

k k
1X X
x̄ = nj xj = hj xj
n
j =1 j =1

Arithmetic mean
Properties of the arithmetic mean
1. Central property
n
X
(xj − x̄ ) = 0
j =1
2. Shifting all values of a data set X by the constant value a shifts the arithmetic
mean by exactly this value:
yi := xi + a ⇒ ȳ = x̄ + a
3. Multiplication of all values of a data set X with the constant factor b multiplies the
arithmetic mean by exactly this value:
zi := b · xi ⇒ z̄ = b · x̄

Geometric mean
Definition:
For the geometric mean
√
n
GX := x1 · x2 · · · · · xn , xi > 0
For the geometric mean, the individual characteristic values are multiplied and the n-th
root is taken from the product. It is only defined if all values of the data set X are positive.
The logarithm of the geometric mean corresponds to the arithmetic mean of the
logarithms. (important for the calculation of overall return on an investment)
n
1X
log GX = log xi
n
i =1

Geometric mean
Example:
For the data set X with the values
2 6 12 9
the geometric mean is Gx = 6

and the arithmetic mean is x̄ = 7.25
Note: The geometric mean for each set with only positive values is always smaller than
the arithmetic mean unless all the values in the data set are the same.

Example
In five consecutive years the turnover Y given in thousand C) of a company developed

as follows:

Example
(continued)
Question: Which mean is best suited for the calculation of the average growth?
Arithmetic mean:
1 + r = (1.20 + 0.85 + 1.40 + 1.25)/4 = 1.175
Geometric mean:
√
4
G1+r = 1.20 · 0.85 · 1.40 · 1.25 = 1.1559
An average increase in turnover of 17.5 % would result in a turnover of 2287 kC in 2001

whereas an average increase in turnover of 15.59 % would result in the actual value of
2142 kC.

Example Stock return
Financial advisor: The

share is a top investment,
with an average return of
Arithmetic mean: 25 %.
1
1+r = 2
· ((1 + 100 %) + (1 − 50 %)) = 1 + 25 % ⇒ 25 %
Geometric mean:
√
2
G1+r = 2 · 0.5 = 1 ⇒ 0%

Harmonic mean
Definition:
From the values xi > 0 of a data set, one can calculate the reciprocal values 1/xi and
then calculate the arithmetic mean of these values

1 1 1
+ ··· +
n x1 xn
Taking the reciprocal of the result again, yields the so-called harmonic mean
n
Hx := n 1
.
P
j =1 xj

Harmonic mean
Example:
For the data set X with the values
2 6 12 9
the harmonic mean is Hx = 4.645

the geometric mean is Gx = 6
and the arithmetic mean is x̄ = 7.25
Note: For every data set with (different) positive values, it can be shown that
HX < GX < x̄

Example:
Two trucks travel at speeds of v1 = 60 km/h and v2 = 80 km/h on the highway. Thus the average
speed (arithmetic mean) is

1 km km km
v̄ = 60 + 80 = 70 .
2 h h h
To estimate the (average) transport times t̄ and thus transport capacities and transport costs for
a distance of say from Hamburg to Duisburg, one would divide the corresponding distance d =
420 km by this value and obtained with
d 420 km
= = 6h,
v̄ 70 km/h
a wrong value. Indeed, the transport times of the two trucks are t1 = 7 h and t2 = 5.25 h. Thus the
average transport time is
t̄ = 6.125 h .
If, on the contrary, one divides the distance by the harmonic mean
2 480 km km
HV = 1 1
= ≈ 68.57 ,
60 km/h
+ 80 km/h
7 h h
44 - 1
then one receives with
d 420 km
= 480
= 6.125 h = t̄
HV 7
km/h
the correct result. If you are doing a salary calculation

based on an hourly wage, then this
question is highly relevant.
Question: Why is the first calculation wrong?
In this example we want to calculate an average transport time for a fixed distance d. The problem
with the average speed is that it is not valid over the whole time, because the first truck arrives
already after 5.25 h and then stops while the other one is still moving. For the calculation of the mean
transportation time, the speeds are in the denominator due to the principle ti = vd . This leads to the
i
harmonic mean:

1 1 d d
t̄ = (t1 + · · · + tn ) = + ··· +
n n v1 vn

1 1 1 1 d
=d· + ··· + =d· = .
n v1 vn HV HV
44 - 2
If, in contrast, we want to know how far the trucks have come on (arithmetic) average after a certain
time t, the calculation
1 1
d̄ = (d1 + · · · + dn ) = (t · v1 + · · · + t · vn )
n n
1
=t· (v1 + · · · + vn ) = t · v̄
n
is the correct one.
44 - 3
Robust measures
The measures presented so far are quite sensitive to outliers. This means that strong
deviations of individual values significantly influence these measures. This is not the
case with so-called robust measures.
Definition:
Starting with the raw data
x = (x1 , x2 , . . . , xn )
of a data set of size n, the characteristic values xi are arranged in ascending order:
x(1) ≤ x(2) ≤ · · · ≤ x(n) .
The resulting list

x( ) = (x(1) , x(2) , . . . , x(n) )
is called an ordered sample (for the raw data x).
Annotation:
In the following, the parentheses in the index are omitted for ordered samples. It should always be
clear from the context whether a data set is ordered or not.

Median
First order and renumber the observed values:
x1 ≤ x2 ≤ x3 ≤ · · · ≤ xi ≤ · · · ≤ xn .
New indexing is also called „rank“.
Definition: A number xMed with
relH(X ≤ xMed ) ≥ 50 %
relH(X ≥ xMed ) ≥ 50 %
is called the median or central value of the empirical data set X .

For an even number n it can happen that the median is not uniquely determined. For a
unique value we define

x n+1 if n odd
xMed = 1 2
 2 x n + x n +1 if n even
2 2

Median
Outliers tend to influ-

Example: ence the mean, but not
the median.
The data set
4 7 7 7 12 12 13 16 19 23 23 97
has the arithmetic mean x̄ = 20

and the median xMed = 12.5

Annotation:
With the definition of the median via the relative frequencies
relH(X ≤ xMed ) ≥ 50 %
relH(X ≥ xMed ) ≥ 50 %
one obtains the potentially non-unique definition

(
xMed = x n+1 if n odd,
2
x n ≤ xMed ≤ x n +1 if n even.
2 2
Strictly speaking, in the previous example, 12 or 13 or 12.2 would also be medians, since they divide
the data in the middle as well.
47 - 1
In practice, the arithmetic mean and the median are the most important characteristic measures of
location for a given distribution. Colloquially, however, a distinction is not always made between the
two measures, especially when it comes to income or wealth distributions.
Example: Median income is the income for

which there are just as many people with a
higher income as with a lower income. Me-
dian income, which is explicitly not identical
with average income, is used in the social sci-
ences and economics to undertake poverty
calculations, for example. It is more robust
against outliers in a sample and is therefore
often preferred to the arithmetic mean (aver-
age).
Think about when and why the average income differs from the median income?
47 - 2
Quartiles
In addition to the median, two other values can be defined that further divide the ordered
statistical data set:
Definition:
The characteristic values of the data set are arranged in ascending order
x1 ≤ x2 ≤ · · · ≤ xn
and divided into four segments with (as far as possible) the same number of values.
The three values
Q1 ≤ Q2 = xMed ≤ Q3
are called quartiles and are defined in such a way that they lie in beween the four
segments just as the median xMed does.
Consequently about 50 % of the observations are found between Q1 and Q3 .
Median and quartiles more general quantiles

Quantiles
Definition:
A number x[q ] with 0 < q < 1 is called q-quantile if it splits the data set X such that at
least 100 · q % of its observed values are less than or equal to x[q ] and at the same time
at least 100 · (1 − q )% are greater than or equal to x[q ] , that is:
relH(X ≤ x[q ] ) ≥ q and relH(X ≥ x[q ] ) ≥ 1 − q .
Special quantiles:
Quartiles:
Q1 = x[0.25] lower quartile

Q2 = x[0.5] = xMed Median
Q3 = x[0.75] upper quartile
Deciles: x[0.1] , x[0.2] , x[0.3] , . . . , x[0.9]

Percentiles: x[0.01] , x[0.02] , x[0.03] , . . . , x[0.99]

Quantiles
Calculation of the q-quantiles:

For continuously approximated distribution functions, the following holds for the
q-quantiles
H (x[q ] ) = q ,
This yields the q quantiles from the inverse of the distribution function,
x[q ] = H −1 (q )
This also works for step function shaped distribution functions if one directly hits a jump.
However, if one lands on a stairstep, the inverse function is not uniquely determined.
Then, in fact, every value between the adjacent jumps is a q-quantile:
xi ≤ x[q ] ≤ xi +1 .
To obtain a unique value, one then usually takes the arithmetic mean of both jumps
x[q ] = 21 (xi + xi +1 ).

Quantiles

Quantiles
For a data set, the q-quantile can also be determined without the detour via the graph of
the distribution function
The q-quantile of an ordered data set x1 , . . . , xn is determined by

(
1
2
(xn·q + xn·q +1 ) if n · q is an integer
x[q ] =
x⌈n·q ⌉ otherwise
Here ⌈n · q ⌉ means that the number n · q is rounded up to the nearest integer.

Example
Table: Turnover of large industrial companies in Germany (2005, in million C)
Company Turnover Company Turnover
20 DaimlerChrysler 149 776 10 Bayer 27 383

19 Volkswagen 95 268 9 Shell Deutschland 24 300
18 Siemens 75 445 8 RAG AG 21 869
17 E.ON 51 854 7 Hochtief 14 854
16 BMW 46 656 6 MAN 14 671
15 BASF 42 745 5 Continental 13 837
14 ThyssenKrupp 42 064 4 Henkel 11 974
13 Bosch 41 461 3 ZF Friedrichshafen 10 833
12 RWE 40 518 2 EnBW 10 769
11 Deutsche BP 37 432 1 Vattenfall 10 543
1 1
x[0.80] = 2
(x16 + x17 ) = 2
(46 656 + 51 854) = 49 255

Example
(continued)
150 000
120 000
90 000
60 000 80 %-quantile = 49 255
30 000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Turnover of 20 industrial companies and 80 % quantile

Example

Measures of variation
Three histograms: distribution with different degrees of spread or variability
The extent of the spread or variation or dispersion of a distribution needs to be ex-

pressed as a measure.

The descriptive statistics provides some measures of variation:
Definition:
The range is the difference between the largest and the smallest value in a data set:
range := xmax − xmin
Definition:
The so-called mean absolute deviation
n
1X
MAD := |xj − x̄ |
n
j =1
is calculated as the arithmetic mean of the amounts of the deviations of the characteristic
values from their mean.

We recall the median and the quartiles Q1 ≤ Q2 = xMed ≤ Q3 , that divide the ordered
data set into four approximately equally sized parts.
Definition:
The difference
IQR := Q3 − Q1
is known as the interquartile range.
Definition:
The arithmetic mean of the deviation of the quartiles from the median is called the quar-
tile deviation or semi-interquartile range
(Q3 − Q2 ) + (Q2 − Q1 ) IQR

QD := =
2 2

Example:
For a data set with n = 14 values we are looking for the quartile deviation.
As median we take the arithmetic mean of both neighbours and obtain Q2 = xMed = 26.8.

Variance and standard deviation
These are the most important measures of variation in statistics:
Definition:
The average quadratic deviation from the arithmetic mean
n
2 1X
sX := (xj − x̄ )2
n
j =1
is called empirical variance or in short variance of an observed data set X .
Calculation also using relative frequencies:

k
X
2 2
sX := hj (xj − x̄ )
j =1
Definition: The positive root of the variance

q
sX := + sX2
is called standard deviation.

Example
Pk 2
Variance calculation using relative frequencies: sX2 := j =1 hj (xj − x̄ )
The following distribution is given:
xi 4 5 6
1 1 1
hi
4 2 4
The arithmetic mean is x̄ = 5. Its variance is

2 1 2 1 1
sX = (4 − 5) ·+ (5 − 5)2 · + (6 − 5)2 ·
4 2 4
1 1 1
= +0+ =
4 4 2
and its standard deviation is

r
1 1
sX = = √ ≈ 0.7071 .
2 2

Example
For the data set
3 5 9 9 6 6 3 7 7 6 7 6 5 7 6 9 6 5 3 5
consisting of 20 numbers, we use the following working table:
Arithmetic mean : 6
Variance : 3.1
Standard deviation: 1.761

The variance can also be calculated for density functions.

Example: Assume that the distribution of a statistical variable has been approximated by
a frequency density function h(x ) in the interval 0 < x < 2 given by the parabola
( As the graph shows, for symmetry reasons, the
3
x − 12 x 2

x ∈ (0, 2)
h (x ) = 2 mean value is equal to 1. To calculate the value,
0 otherwise . the summation symbol is replaced by the integral
Z∞
x̄ = x · h ( x ) dx
−∞
In the same way the variance is calculated by
Z∞
2
sx = (x − x̄ )2 h(x ) dx
−∞

Example (continued): Hence we calculate the variance as the definite integral
Z2 Z2
(x − x̄ )2 h(x ) dx = dx
2 3 1 2
sX = (x − 1)2 x− x
2 2
0 0
Z2
=
3
2
x−
5 2
2
3 1 4
x + 2x − x
2
dx
0
2
3 1 2 5 3 1 4 1 5
= x − x + x − x
2 2 6 2 10 0

3 4 40 16 32 24 1
= − + − = 3 − 10 + 12 − =
2 2 6 2 10 5 5
q
1
and the standard deviation as its root: sX = 5
≈ 0.4472 .

Properties of the variance
1. The variance is always greater than or equal to zero:

2
sX ≥ 0
2. Shifting all values of a data set X by the constant value a leaves the variance
unchanged:
2 2
yi := xi + a ⇒ sY = sX
3. Multiplication of all values of a data set X by a constant factor b multiplies the
variance by the square of this value:
2 2 2
zi := b · xi ⇒ sZ = b · sX
Note: sZ = |b| sX

S TEINER’s translation theorem

For each constant d ∈ R it holds that
n n
1X 1X
(xj − x̄ )2 = (xj − d )2 − (x̄ − d )2 (1)
n n
j =1 j =1
where (x̄ − d ) is the shift (=translation) from the mean.
Properties of the variance (contd.)
4. In the special case of d = 0, the following formula for the simplified calculation of
the variance is obtained
n
2 1X 2 2 2
sX = xj − x̄ = x 2 − x̄ (2)
n
j =1
Exercise : Use formula (2) to recalculate the variances of the preceding examples.

Minimum property of the variance
Since in the translation theorem (1) the term (x̄ − d )2 can never be negative, for every d ̸= x̄ it
always holds
n n
1X 2 1X 2
xj − x̄ < xj − d .
n n
j =1 j =1
This means that the average quadratic deviation from the arithmetic mean x̄ is always smaller than
the average quadratic deviation from any other value d (minimum property). Multiplying the equation
by n, we get for the sum of the squared deviations from any d ∈ R:
n n
X 2 X 2
SSE(d ) := xj − d ≥ xj − x̄ .
j =1 j =1
That is, SSE becomes minimal in x̄. This provides us with an alternative definition of the mean:
Definition:
SSE(d ) −→ min
d
⇒ Principle of least squares.
66 - 1
Definition:
The quotient of the standard deviation and the absolute value of the mean of a data set
with x̄ ̸= 0
sX
CVX :=
|x̄ |
is called the coefficient of variation.
The coefficient of variation is a relative measure. It measures the dispersion relative to

the level or absolute size of the data set and thus makes sets with different scales
comparable.

Example:
Over a period of 250 trading days, the Volkswagen share
price had a mean value of 174.56 C and a standard
deviation of 10.28 C. For the same period, a standard
deviation of 4.68 C with a mean value of 36.96 C is
determined for the BMW AG share. The two coefficients
of variation as a measure of the volatility of the share
prices are as follows
10.28 C
CVX = = 0.0589 for VW and
174.56 C
4.68 C
CVY = = 0.1266 for BMW
36.96 C
Thus, despite a lower absolute standard deviation, BMW stock is more volatile in relative
terms.

Five-point summary and box plot
The distribution of a data set can be analyzed quite well with only a few values. In
practice, one often uses the so-called five-point summary.
(xmin , x[0.25] , xMed , x[0.75] , xmax )
It divides the data set into four parts, so that each part contains about a quarter of the
observed values. It contains the median as a measure of location and the range and
interquartile range IQR as measures of variation.
Definition:
The graphical representation of the five-point summary is called a box plot.
xmin xmax
x[0.25] xMed x[0.75]

Control questions
1. What is the central property of the arithmetic mean?
2. When is the arithmetic and when is the geometric mean used and why?
3. How does the variance change if all values of a data set are converted from DM to
euro?
4. Describe the translation theorem as a property of the variance! What feature results
from the special case d = 0?
5. What is the minimum property of the variance? What does the principle of least
squares mean in this context?
6. What does the coefficient of variation mean? Which measure from portfolio theory in
business administration comes to your mind?

3 Two dimensional distributions
3.1 Scatterplot and joint distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.2 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3 Conditional distributions and statistical correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.6 Rank correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
according to: [Schira, 2016], chapter 3

see also: [Anderson et al., 2012], chapter 3.5
and [Newbold et al., 2013], chapter 2.4
3. Two dimensional distributions 71
Scatterplot and joint distribution
Multi-dimensional statistics:
Each statistical unit ωi of a population Ω can have a variety of characteristics.
Definition:
The univariate statistics takes only one characteristic or variable into account.
The multivariate statistics considers several variables for each unit ωi
X1 (ωi ), X2 (ωi ), . . . , Xm (ωi )
Example:
For a person ωi we measure the duration of education X1 (ωi ) and the income X2 (ωi ) five
years past to the end of education.

Most simple case: two variables X (ωi ) andY (ωi ) are of interest.
The result is paired data (xi , yi ) for each ωi .
These can be represented as points
P1 := (x1 , y1 ), P2 := (x2 , y2 ), ..., Pn := (xn , yn )
in a scatter plot.

Definition: The contingency table represents the joint distribution of the statistical
variables X and Y in a concise way.
y1 y2 ... yj ... yl Σ
x1 n11 n12 ... n1j ... n1l n1•
x2 n21 n22 ... n2j ... n2l n2•
. . . . . .
. . . . . .
. . . . . . sum of row
xi ni1 ni2 ... nij ... nil ni •
. . . . . .
. . . . . .
. . . . . .
xk nk 1 nk 2 ... nkj ... nkl nk •
Σ n•1 n•2 ... n•j ... n•l n total sum
sum of column
Here
nij = absH(X = xi ∩ Y = yj ) is the absolute frequency with which the combination
(xi , yj ) was observed,
Pl Pk
ni • = j =1 nij or n•j = i =1 nij the absolute frequency with which xi or yj was
observed. ⇒ marginal frequency

Real life example: Routes of three soccer players at Bayern Munich
Task: Match the following players to their respective contingency tables:

1. Thomas Müller
2. Franck Ribéry
3. Arjen Robben
A B C
This representation of the contingency table is also called a heat map.

Solution: 1B, 2C, 3A

A representation with relative frequencies is also common. For this purpose, the
absolute frequencies, including the marginal frequencies, are divided by n.
y1 y2 ... yj ... yl Σ
x1 h11 h12 ... h1j ... h1l h1•
x2 h21 h22 ... h2j ... h2l h2•
. . . . . .
. . . . . .
. . . . . . sum of row
xi hi1 hi2 ... hij ... hil hi •
. . . . . .
. . . . . .
. . . . . .
xk hk 1 hk 2 ... hkj ... hkl hk •
Σ h•1 h•2 ... h•j ... h•l 1 total sum
Here sum of column
hij = relH(X = xi ∩ Y = yj ) is the relative frequency with which the combination

(xi , yj ) was observed,
Pl Pk
hi • = j =1 hij or h•j = i =1 hij the relative frequency with which xi or yj was
observed. ⇒ marginal frequency

Marginal distributions
Definition:
The one-dimensional distributions
ni •
hi • = relH(X = xi ) = , i = 1, . . . , k
n
and
n•j
h•j = relH(Y = yj ) = , j = 1, . . . , l
n
are called marginal distributions of the statistical variables X and Y .

Calculation of mean and variance:

The mean and variance of individual components X and Y of two- or multidimensional
random variables are easily calculated using the marginal distributions (for cardinally
measurable variables):
k
X l
X
x̄ = hi • xi ȳ = h•j yj
i =1 j =1
k
X l
X
2 2 2 2
sX = hi • (xi − x̄ ) sY = h•j (yj − ȳ )
i =1 j =1

Example
Abstract calculation example for a two-dimensional frequency distribution:
Characteristic values for X : x1 = 30, x2 = 60; andY : y1 = 1, y2 = 2, y3 = 4
Observed data: (30, 1), (30, 2), (60, 4), (30, 2), (60, 1), (30, 4), . . . , (60, 2).
Sort and count: 24 × (30, 1), 24 × (30, 2), 32 × (30, 4), . . . , 68 × (60, 4)
Contingency table
Y
1 2 4
30 24 24 32 80
X
60 16 36 68 120
40 60 100 200

Example
The relative frequencies are obtained by dividing all values by n = 200
marginal distribution of X
1 2 4
30 0.12 0.12 0.16 0.4
X
60 0.08 0.18 0.34 0.6
0.2 0 .3 0.5 1
marginal distribution of Y

Example
The marginal distribution for X :

k
X
x̄ = hi • xi = 48
xi 30 60 i =1
k
hi • 0.4 0.6 2
X 2
sX = hi • (xi − x̄ ) = 216
i =1
The marginal distribution for Y :

l
X
ȳ = h•j yj = 2.8
yj 1 2 4 j =1
l
h•j 0.2 0 .3 0 .5 2
X 2
sY = h•j (yj − ȳ ) = 1.56
j =1

Conditional distributions and statistical correlations
We now consider the distribution of X , given that (conditional on) Y has a fixed value yj .
Definition:
Normalizing the columns of the contingency table to a column sum of 1 leads to a total
of l one-dimensional distributions for j = 1, . . . , l. These are called conditional distri-
butions of X (conditional on Y = yi ),
hij
hi |Y =yj = relH(X = xi |Y = yj ) = .
h•j
Similarly, normalizing the rows to a row sum of 1 for i = 1, . . . , k leads to the conditional
distributions of Y (conditional on X = xi ),
hij
hj |X =xi = relH(Y = yj |X = xi ) = .
hi •

Example:
For the joint distribution of the previous numerical example, there are three conditional
distributions of X and a marginal distribution of X :
X hi |Y =1 hi |Y =2 hi |Y =4 hi •
30 0.60 0.40 0.32 0.4
60 0.40 0.60 0.68 0.6
1 1 1 1
. . . and two conditional distributions of Y and a marginal distribution of Y
Y 1 2 4
hj |X =30 0.300 0.300 0.400 1
hj |X =60 0.133 0.300 0.567 1
h•j 0.2 0 .3 0.5 1
Observation: The conditional distributions differ. This gives an indication of a

dependence of the statistical variables X andY .

Definition:
If the joint distribution hij of the statistical variables X and Y is equal to the product of
the two marginal distributions
hij = hi • · h•j
fori = 1, . . . , k andj = 1, . . . , l, then X and Y are called statistically independent.
Otherwise, there is a statistical correlation. We can distinguish between linear and

nonlinear statistical correlations.

Properties of independent statistical variables

For independent statistical variables, the conditional distributions are identical and
equal to the marginal distribution.
Thus, for all j = 1, . . . , l conditional distributions of X , it holds that
hi |Y =yj = hi • , i = 1, . . . , k
and for all i = 1, . . . , k conditional distributions of Y
hj |X =xi = h•j , j = 1, . . . , l

Practical example: stock returns in the US (since 1963)

Are daily stock returns on announcement days (AD = FED meetings and labor statistics)
different from non-announcement days (ND)?
absolute frequency
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
AD 27 63 305 385 111 31 922

ND 286 963 4341 5349 1003 264 12206
313 1026 4646 5734 1114 295 13128
relative frequency
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
AD 0.0021 0.0048 0.0232 0.0293 0.0085 0.0024 0.0702

ND 0.0218 0.0734 0.3307 0.4074 0.0764 0.0201 0.9298
0.0238 0.0782 0.3539 0.4368 0.0849 0.0225 1

conditional distribution
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
hj |X =AD 0.0293 0.0683 0.3308 0.4176 0.1204 0.0336 1

hj |X =ND 0.0234 0.0789 0.3556 0.4382 0.0822 0.0216 1
h•j 0.0238 0.0782 0.3539 0.4368 0.0849 0.0225 1
Daily returns in the U.S. are not independent from AD or ND!

conditional distribution
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
hj |X =AD 0.0293 0.0683 0.3308 0.4176 0.1204 0.0336 1

hj |X =ND 0.0234 0.0789 0.3556 0.4382 0.0822 0.0216 1
h•j 0.0238 0.0782 0.3539 0.4368 0.0849 0.0225 1
Daily returns in the U.S. are not independent from AD or ND!

Question: Given independence, how would the joint distribution look like if we keep the
marginal distributions the same?
relative frequency
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
AD 0.0702
ND 0.9298
0.0238 0.0782 0.3539 0.4368 0.0849 0.0225 1

Mean of the sum and the difference:

The elements ωi , i = 1, . . . , n, of a statistical mass Ω of extent n have been analyzed for
two characteristics, and the statistical variables xi = X (ωi ) and yi = Y (ωi ) have been
collected as paired data.
From both variables, both the means and the variances have been calculated.
It holds that:
The mean value of a sum (difference) is equal to the sum (difference) of the mean
values:
x + y = x̄ + ȳ x − y = x̄ − ȳ
This is true regardless of the joint distribution and equally true for statistically
independent as well as statistically dependent variables.

Variance of the sum and the difference:

The variance is calculated by applying the binomial formula:
Variance of the sum

n
2 2 2 1X
sX +Y = sX + sY + 2 · (xj − x̄ )(yj − ȳ )
n
j =1
Variance of the difference

n
2 2 2 1X
sX −Y = sX + sY − 2 · (xj − x̄ )(yj − ȳ )
n
j =1
Special case :
n
2 2 2 1X
sX ±Y = sX + sY , if cXY := (xj − x̄ )(yj − ȳ ) = 0
n
j =1

Covariance
Definition:
The quantity calculated from the n pairs of values (xi , yi )
n
1X
cXY := (xj − x̄ )(yj − ȳ )
n
j =1
is called the empirical covariance or, in short, the covariance between the statistical
variables X and Y .
Simplified calculation :
n
1X
cXY := xj · yj − x̄ · ȳ = x · y − x̄ · ȳ
n
j =1

Covariance
Illustration of the covariance:

Covariance
The covariance can also be calculated using the relative frequencies from the contin-
gency table:
k X
X l
cXY := hij (xi − x̄ )(yj − ȳ )
i =1 j =1
Simplified calculation :
k X
X l
cXY := hij xi yj −x̄ · ȳ
i =1 j =1
| {z }
x ·y

Covariance
Covariance and dependency
Proposition :
If two variables X and Y are statistically independent, then the covariance cXY between
them is zero.
This proposition is not reversible because the covariance measures only the linear part of
the statistical dependence.
correct: X andY are independent ⇒ cXY = 0
correct: cXY ̸= 0 ⇒ X andY are dependent
incorrect: cXY = 0 ⇒ X andY are independent

Correlation coefficient
Definition:
The ratio
cXY
rXY :=
sX · sY
is called (empirical) correlation coefficient between X and Y
Scatter plots and correlation coefficients:
rXY = 0.97 rXY = −0.52 rXY = 0.06

Properties :
The correlation coefficient represents a normalized measure of the strength of the

linear statistical relationship:
−1 ≤ rXY ≤ 1
The absolute value of the correlation coefficient remains unchanged if one or both
variables are transformed linearly. Suppose that
U := a1 + b1 X and V := a2 + b2 Y with b1 , b2 ̸= 0 .
Then we obtain
cUV b1 · b2 · cXY b1 · b2
rUV = = = rXY .
sU · sV |b1 | sX · |b2 | sY |b1 | · |b2 |
This means that |rUV | = |rXY |.

Example: Calculation of the correlation coefficient

For the joint distribution from the numerical example (slide 79 ff.), one obtains for the
covariance using the simplified calculation approach
2 X
X 3 2 X
X 3
cXY = hij (xi − x̄ )(yj − ȳ ) = hij xi yj − x̄ · ȳ = x · y − x̄ · ȳ
i =1 j =1 i =1 j =1
= 138 − 48 · 2.8 = 138 − 134.4 = 3.6
The correlation coefficient is thus
3.6
rXY = √ √ = 0.1961 ,
216 · 1.56
which indicates a weak positive correlation.

Examples
goals against
age of trainer
body weight
body height goals scored league position
Note: The covariance or the correlation coefficient do not necessarily mean a causal
relationship between the characteristics. Merely the just available observations show a
statistical tendency, which however could also be purely by chance.
Correlation of the day: www.correlated.org :-)

Correlation vs. causality
Son’s height (inch)
Father’s height (inch)
causality?
correlation
Father’s height Son’s height
causality?

Rank correlation
Besides the correlation coefficient according to B RAVAIS -P EARSON there is another one,
namely the one according to S PEARMAN, also called rank correlation coefficient.
Definition:
The rank correlation coefficient or correlation coefficient named after C HARLES E D -
WARD S PEARMAN
Sp
rXY := rrg(X ),rg(Y )
is the correlation coefficient between the ranks of the observations.
It is used for ordinally scaled characteristics.

Rank correlation
Example school grades
The following table shows the results of the Abitur examinations of ten students in the
subjects German (feature G) and History (feature H). The maximum achievable score is
15 in each case.
Pupil i German (G) History (H) rg(G) rg(H)
1 13 15 4 1
2 14 8 2.5 (2) 4 (3)
3 8 1 9 10
4 10 7 7 6.5 (6)
5 15 9 1 2
6 1 5 10 9
7 14 8 2.5 (3) 4 (4)
8 12 7 5 6.5 (7)
9 9 6 8 8
10 11 8 6 4 (5)

Rank correlation
Example school grades
Question: Are the grades correlated? Does good performance in German go along
with good knowledge of history?
First, we determine the rankings for each student in each of the two subjects. To do this,
we arrange the students according to the results they obtained in the subjects. Students
with the same result are assigned the arithmetic mean of those rankings they would have
received if they had been arranged randomly (given in parentheses in each case). This
may result in rankings like 2.5 or 6.5.
Then we compute variances, standard deviations, and the covariance of the ranks and
obtain with
Sp 6.95
rGH = = 0.8581
2.8636 · 2.8284
a fairly positive correlation, which was to be expected.
(Compare: rGH = 0.549 )
Question: When should rank correlation be preferred to normal correlation?

Control questions
1. What is the difference between univariate and multivariate statistics? Think about an
example of bivariate statistics.
2. What is the structure and function of contingency tables? Are there also contingency
tables for more than two characteristics?
3. How many marginal distributions does a 3-dimensional statistical distribution have?
4. When is the variance of a sum smaller than the sum of the variances?
5. What is statistical independence? What is the relationship between covariance and

independence?
6. What is the meaning of the correlation coefficient? Does an empirical correlation

coefficient of 0 imply that there is no factual relationship between the characteristics
under consideration?
7. What is a rank correlation? How do you measure it?

4 Linear regression
4.1 The regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.2 Properties of regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 Nonlinear and multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .118
according to: [Schira, 2016], chapter 4

see also: [Anderson et al., 2012], chapter 14;
4. Linear regression
and [Newbold et al., 2013], chapter 11 103
The regression line
Correlation and regression calculation:

Covariance and correlation coefficient are only measures. In correlation analysis, the
statistical variables (X , Y ) are considered completely equal.
The regression calculation goes one step further: The average linear relationship
between the characteristic values of a two-dimensional statistical variable (X , Y ) is now
to be represented by a linear function, i.e. a straight line
y = a + bx
in the scatter plot. Here we distinguish between an (in the mathematical sense)
independent variable X and a dependent variable Y .
This straight line is supposed to be a mean straight line, that is, it is supposed to pass
through the observed characteristic values (xi , yi ) in such a way that it indicates the
location and main direction of the point cloud in the scatter plot.
4. Linear regression 104

The regression line
Example
scatter plot regression line

The regression line
y = a + bx
„deviation“ ei
Point cloud and straight line in scatter plot
The method of least squares (LSM) uniquely assigns a mean straight line to the scatter
plot.

The regression line
Idea: We want to explain the observed values of Y as good as possible by the values
of X .
For the observed values it holds that
yi = a + bxi + ei
The y -values on the linear regression line are
ŷi := y (xi ) = a + bxi
Vertical deviation
ei := yi − ŷi
Question: What does „as good as possible“ mean? How to determine the regression
line?
Pn
It would be possible to minimize the deviations ei or the
Psum of the deviations i =1 ei or
n
even the sum of the absolute values of the deviations i =1 |ei |. Unfortunately, all these
approaches do not lead to a useful calculation rule or to a unique determination of the
straight line.

The regression line
Determination of the regression line:

Minimization of the sum of squared errors (SSE)
n
X n
X n
X
2
SSE(a, b) := ej = (yj − ŷj )2 = (yj − a − bxj )2
j =1 j =1 j =1
This means that the straight line is placed within the point cloud in such a way that SSE
reaches the smallest possible value for the corresponding parameters a and b.
The algebraic solution of this minimization task
SSE(a, b) −→ min (3)

a,b
leads to

The regression line
Definition:
Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) observed pairs of values of a two-dimensional statistical
variable (X , Y ) and let sX > 0. The straight line
y (x ) = a + bx
with the coefficients

cXY
b= and a = ȳ − bx̄
sX2
is called the regression line of a regression from Y to X . The y values belonging to the
individual xi values on the regression line are called regression values.

Determination of the regression line
Minimization of the sum of squared errors (SSE)
SSE(a, b) −→ min
a ,b
(Partial) differentiation with respect to a and b and setting to zero yields the two so-called normal
equations
n
∂ X !
SSE(a, b) = 2(yj − a − bxj ) · (−1) = 0
∂a j =1
n
∂ X !
SSE(a, b) = 2(yj − a − bxj ) · (−xj ) = 0
∂b j =1
Transforming both normal equations leads to
n
X
(yj − a − bxj ) = 0 ,
j =1
n
X
(yj − a − bxj )xj = 0 .
j =1
109 - 1
Splitting the sums results in
n
X n
X
yj − an − b xj = 0 ,
j =1 j =1
n
X n
X n
X
xj yj − a xj − b xj2 = 0
j =1 j =1 j =1
and thus after dividing by n
ȳ − a − bx̄ = 0 ,
xy − ax̄ − bx 2 = 0 .
Using the definitions for variance and covariance
sX2 = x 2 − x̄ 2 and cXY = xy − x̄ ȳ
we obtain after solving the linear system of equations with respect to a andb
a = ȳ − bx̄ ,
cXY
b= 2 .
sX
109 - 2
Properties of regression lines
1. mean line: The regression line passes through the center of mass (x̄ , ȳ ) of the
point cloud:
ȳ = a + bx̄ = y (x̄ ) .
The sum of the deviations ej and thus their mean value are zero,
n
X n
X
(yj − a − bxj ) = ej = 0 = ē . (4)
j =1 j =1
n n
Furthermore X X
ej xj = 0 and ej ŷj = 0 . (5)
j =1 j =1
2. minimization of variance: The variance of the deviations of the regression

n n
2 1X 1X 2 1
sE = (ej − ē)2 = ej = SSE(a, b) (6)
n n n
j =1 j =1
is identical to the sum of least squares except for the factor n1 . This means that the
regression line minimizes the variance of the deviations.

3. slope regression: The slope of the regression line is

cXY cXY sY sY
b= = = rXY
sX2 sX sX sY sX
and is the flatter the smaller the amount of correlation.
Regression of the slope with decreasing correlation

4. variance decomposition: The total variance of Y can be decomposed into two

parts:
2 2 2
sY = sŶ + sE
namely into the variance of the regression values and the variance of the
deviations. The variance of the regression values measures the share of the
variation in Y that is described or explained by the variation of the independent
variable X . The variance of the deviations measures the share of the total variance
not explained by the variation in X . Thus, the explained variance is smaller than the
total variance, at least as long as there are deviations.

Derivation of 4. The identity
(yi − ȳ ) ≡ (ŷi − ȳ ) + (yi − ŷi ) .
holds true. If we square both sides, we obtain using ei = yi − ŷi :
(yi − ȳ )2 = ((ŷi − ȳ ) + (yi − ŷi ))2

= (ŷi − ȳ )2 + (yi − ŷi )2 + 2(ŷi − ȳ )(yi − ŷi )
= (ŷi − ȳ )2 + ei2 + 2(ŷi ei − ȳ ei )
If we sum over all i, we get
n
X n
X n
X n
X n
X
(yi − ȳ )2 = (ŷi − ȳ )2 + ei2 + 2 ŷi ei −2ȳ ei
i =1 i =1 i =1 i =1 i =1
| {z } | {z }
=0(5) =0(4)
n
X n
X
= (ŷi − ȳ )2 + ei2
i =1 i =1
Now we divide both sides by n and obtain with (6) the desired equation
sY2 = sŶ2 + sE2 .
112 - 1
Just as in correlation the correlation coefficient is a standardized measure (i.e.,

independent of the units of measure and magnitudes of the statistical variables), we
would like to have a standardized measure in regression as well.
Definition:
The ratio of the variance explained in a linear regression to the total variance of the
dependent variable Y
2 s2
R := Ŷ2
sY
is called the coefficient of determination of the linear regression.
The larger R 2 the better the fit of the regression line to the point cloud will be. It is
therefore used as measure of goodness of fit.
Properties :
2
2 2 cXY 2
0≤R ≤1 and R = = rXY .
sX sY

Example
A sample of ten randomly selected monthly sales of a large bakery

with the corresponding marketing expenditures provided the
following data (in thousands of C)
Revenues yj 201 184 220 240 180 164 186 150 182 210
Marketing xj 24 16 20 26 14 16 20 12 18 22
Marketing expenses X and revenues Y

Example
(continued)
We calculate using the following working table:
i xi yi xi2 xi yi yi2
1 24 201 576 4824 40 401

2 16 184 256 2944 33 856
3 20 220 400 4400 48 400
4 26 240 676 6240 57 600
5 14 180 196 2520 32 400
6 16 164 256 2 624 26 396
7 20 186 400 3 720 34 596
8 12 150 144 1 800 22 500
9 18 182 324 3 276 33 124
10 22 210 484 4 620 44 100
Σ 188 1917 3 712 36 968 373 873

Example
(continued)
This results in the following values:

188
mean values: x̄ = = 18.8
10
1917
ȳ = = 191.7
10
2 2 3712
variances: sX = x 2 − x̄ = − 18.82 = 17.76
10
2 2 373 873
sY = y 2 − ȳ = − 191.72 = 638.41
10
36 968
covariance: cXY = xy − x̄ ȳ = − 18.8 · 191.7 = 92.84
10
cXY 92.84
correlation coefficient: rXY = = √ = 0.872
sX sY 17.76 · 638.41
The correlation coefficient confirms the presumed strong positive correlation.

Example
(continued)
This yields the following values:

cXY 92.84
slope: b= = = 5.2274
sX2 17.76
y -intercept: a = ȳ − bx̄ = 191.7 − 5.2274 · 18.8 = 93.4249
and thus the regression line
y = 93.4249 + 5.2274 · x
Interpretation?
The coefficient of determination is
2 2
R = (rXY ) = 0.760 .

Nonlinear and multiple regression
Regression line: The assumption of linearity can be a good approximation to a nonlinear reality.
However, it must be abandoned if the data in the scatter plot very clearly suggest a nonlinear rela-
tionship.
Question: Which types of functions should you choose then?
Suitable, for example, are nonlinear functions that can be transformed into linear functions by simple
transformation, such as:
Logarithmic approaches
Logarithmic linear approach
Semi-logarithmic approach
Quadratic approaches
If a relationship between more than two variables is to be established, the so-called multiple regres-
sion is used.
A brief description of these approaches is given below.
117 - 1
Logarithmic approaches
Definition: The logarithmic linear approach formulates a linear relationship not between the data
itself, but between the logarithms of the data:
log y = a + b log x
Transforming back, we obtain a power function as a correlation between the originally observed
values
y = a∗ · x b .
The coefficients of this regression are calculated with the already known formulas, but before that the
initial data have to undergo a transformation and the logarithms of the observed values have to be
taken.
Definition: With the so-called semi-logarithmic approach, only one of the two variables is logarith-
mically transformed:
log y = a + bx
Transforming back, we obtain as a correlation between the originally observed values an exponen-
tial function
y = a∗ · ebx .
117 - 2
Quadratic approaches
Definition: In the quadratic approach, the relationship between X and Y is formulated as a

2nd degree polynomial:
y = a + b1 x + b2 x 2 .
Using the observed values, the three coefficients a, b1 and b2 are calculated with the LSM. For
this, the method of multiple regression is used (see below). The variables x and x 2 are treated
mathematically as two different variables, although they are not, of course.
Regression parabolas
The quadratic approaches have the advantage that they can be used to represent such correlations
whose direction is reversed. This is useful when the correlation not only weakens with increasing x
values, but also changes its sign, as illustrated in the figure (example happiness research: young and
old people are happier than in middle age – midlife crisis).
117 - 3
Example: A farmer measures the statistical relationship between the use of
fertilizer and crop yield. He conducts experiments with different fertilizer rates
on 14 sections of his corn acreage.
fertilizer corn corn
15 1800
30 3600
45 6840
60 7200
75 8100
90 8460
105 8640
120 9000
135 9180
150 9000
165 8640 fertilizer
180 8460
195 8100
210 7740
y = 881 + 123x − 0.44x 2
Werte in kg/ha
The maximum of this function is x = 139.8 kg/ha. Whether this input is also optimal depends on the
prices for fertilizer and corn.
117 - 4
Multiple regression
In some cases, it is appropriate to represent the variation of a statistical variable Y as a function of

two other variables X1 and X2 , of the form:
yi = b0 + b1 x1i + b2 x2i + ei
To compute the three coefficients b1 , b2 , and b3 , one has to

solve the corresponding minimization problem as in (3).
Now, this is no longer a regression line, but a regression

plane in a three-dimensional coordinate system.
This principle can also be applied to more than two variables.

The regression relationship would then be
yi = b0 + b1 x1i + b2 x2i + · · · + bk xki + ei .
117 - 5
Simplified Example:
Dependent variable
Y : Percentage of students enrolled in private schools
Explanatory variables
X1 : income
X2 : Percentage of population who have completed 4 or
more years of college
Typical output of a standard software package
indicators for the

significance of the
coefficients
estimated values forb0 , b1 , b2
117 - 6
Application-oriented and simplified explanation in anticipation of the last chapter.
The values for the coefficients are the respective least squares estimators for the constant term as
well as for the prefactors of the variables of the linear regression model. One can now ask whether
the true values for b0 , b1 , . . . are indeed significantly different from 0? That is, one asks whether
the respective independent variables X1 , X2 , . . . or the constant term really have an influence on the
variable Y to be explained?
On the basis of a given data set, this cannot be determined with absolute certainty. But it is possible
(under certain assumptions) to provide probabilities for the parameters to be significantly different
from 0. For this purpose, a so-called test statistic (here the so-called t-quotient) is calculated and
compared with a reference value. The p value (so-called significance level) indicates the probability
with which one would erroneously assume significance of the respective parameter if it was actually 0.
For the exemplary calculation this means:

The probability for a wrong assumption of the constant term b0 being not equal to 0 is 16.1 %.
The probability of erroneously assuming that the factor b1 of the variable X1 (income) is not
equal to 0, in contrast, is only 0.0036 %.
The probability of incorrectly assuming that the factor b2 of variable X2 (higher education) is not
zero is 8.31 %.
The lower the p value, the more likely it is that the respective parameter is significant (different from
0). This is usually highlighted by different codings ***, **, *, to identify at a glance the parameters
with high or low significance.
117 - 7
Control questions
1. What properties should a straight line have that best describes the average linear
relationship between two variables?
2. What are the normal equations?
3. What is the principle of least squares?
4. What properties of the least squares method of calculating a regression line do you
know?
5. What is the relationship between the slope of the regression line and the correlation
coefficient?
6. What is the relationship between R 2 and the correlation coefficient? Which values
can R 2 attain (extrema), and which statements can thus be made about the statistical
correlation?
7. What is linear regression and what is nonlinear regression?

Part II – Probability Theory
5 Combinatorics and counting principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Fundamentals of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7 Random variables in one dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8 Multidimensional random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9 Stochastic models and special distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10 Limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
4. 119
5 Combinatorics and counting principles
5.1 Elementary counting principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.2 Factorials and binomial coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3 Fundamental principle of counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5 Variations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

see also: [Anderson et al., 2012], chapter 4; and [Newbold et al., 2013], chapter 3
5. Combinatorics and counting principles 120

Elementary counting principles
Selection and arrangement of objects
How many ways are there

to arrange n (distinct) elements within a sequence,
to select k elements out of a set of n elements?

Many problems concerning the arrangement or selection of elements of a given set lead to the basic
questions on how many ways it is possible,
to arrange n elements in a sequence or

to select k elements from a set containing n elements.
Examples
a) On a Saturday evening a student of Frankfurt School is on the search for two friends, that she
assumes to stay at one of four parties. How many possibilities does she have to visit the four
parties one after another?
b) After class seven students meet for playing skat. How many possibilities do they have to form a
group of three players?
121 - 1
Factorials and binomial coefficients
To determine the number of permutations that can be achieved by n distinct objects, we may think of
n placeholders, where we successively put the n elements on.
To allocate the first placeholder n elements are at disposal. Therefore, there are n possibilities to
occupy the first place. For the second placeholder only n − 1 elements are left. Together with the n
possibilities for the first place we have n · (n − 1) possibilities to allocate the first two places. For the
third place there are n − 2 possibilities and so forth until for the allocation of the nth placeholder only
one possibility (the last remaining element) is left. In total there are n · (n − 1) · · · · · 1 possibilities
to arrange n elements in a sequence. The product n · (n − 1) · · · · · 1 of the first n natural numbers
is denoted by n! (read: „factorial of n“). Furthermore, it is convenient to set 0! = 1.
The student from example a) has got 4! = 1 · 2 · 3 · 4 = 24 possibilities to visit the four parties one
after another.
122 - 1
Factorials
For n ∈ N0 we define n! (read: „factorial of n“) by Example:

There are 5! = 1 · 2 · 3 · 4 · 5 =
n
(
Y 1 · 2 · · · · · ( n − 1) · n for n ≥ 1 120 different possibilities to arrange
n! = i= 5 distinct books on a shelf.
1 for n = 0
i =1
122 - 2
Table of values for factorials
Factorials grow very rapidly:
1! = 1 11! = 39 916 800

2! = 2 12! = 479 001 600
3! = 6 13! = 6 227 020 800
4! = 24 14! = 87 178 291 200
5! = 120 15! = 1 307 674 368 000
6! = 720 16! = 20 922 789 888 000
7! = 5 040 17! = 355 687 428 096 000
8! = 40 320 18! = 6 402 373 705 728 000
9! = 362 880 19! = 121 645 100 408 832 000
10! = 3 628 800 20! = 2 432 902 008 176 640 000
There are already more than 3.6 million possibilities to arrange only 10 distinct books on a shelf!
There are 20! ways to visit 20 cities one after another. The magnitude of this number compares
to the age of the universe (1018 ), measured by seconds!
122 - 3
Calculation rules for factorials
For numbers n, k ∈ N0 and k ≤ n we have
(n + 1)! = n! · (n + 1) recursive calculation of factorials
n! n · (n − 1) · · · (k + 1) · k · · · 1
= = n · . . . · (k + 1)
k! k ···1
reduction of fractions
20! 9·10 3·5

= 19 · 20 = 380 10! · 5! 9 · 10 15
18! = =
8! · 7! 6·7 6 ·7 7
4·5 2
5! 4 ·5 5
= = 2 · 5 = 10 Y
3! · 2! 2! 2i = 25 · 5! = 32 · 120 = 3840
i =1
122 - 4
Binomial coefficients
Definition:
For n, k ∈ N0 , k ≤ n we define the binomial coefficient
n n!
:= .
k k !(n − k )!
The binomial coefficient kn (read: „n choose k “) equals the number of subsets of size k elements

for a given set of size n elements.
122 - 5
Binomial coefficients
Examples:
5 5! 5!
2
5· 4 · 3 · 2 · 1 There are 10 different pos-
• = = = = 10 sibilities to choose 3 out of
3 3! · (5 − 3)! 3! · 2! 3 · 2 · 1 · 2 ·1
5 distinct books.
There are 126 possibili-

2
9 9! 9· 8 ·7· 6 · 5 · 4 · 3 · 2 · 1 ties to choose four out of
• = = = 126
4 4! · 5! 4 · 3 · 2 ·1· 5 · 4 · 3 · 2 · 1 nine different text books on
statistics.
49 49 · 48 · 47 · 46 · 45 · 44 Number of possibilities for

• = = 13 983 816
6 6·5·4·3·2·1 lottery „6 out of 49“
122 - 6
If we ask for the number of possibilities to choose k elements out of set of size n elements, we may
proceed as follows. We find all subsets of size k elements by taking only the first k elements from
each of the n! possible arrangements of the set of size n elements. In doing so each subset of size
k elements will appear k !(n − k )! times on the first k places. Thus a set of size n elements has
n!
subsets of size k elements. This term is denoted as binomial coefficient kn (read: „n

k !(n − k )!
choose k “).
7 7! 7·6·5
Referring to example b) the students have 3
= 3!·4!
= 3·2
= 35 possibilities to form a group of
three players.
122 - 7
From its definition (slide 1) we recognize the relationship
n n
=
k n−k
as well as some particular cases:
n n n n n n n(n − 1)

= =1, = = n and = =
0 n 1 n−1 2 n−2 2
Furthermore, we may validate the recursive formula
n + 1 n n
= + (7)
k k −1 k
by calculation. Using this formula binomial coefficients are derived by simple addition within the so-
called PASCAL’s triangle.
122 - 8
Fundamental principle of counting
Theorem:
The number of possibilities to fulfill k issues simultaneously, each of which can be ful-
filled independently in ni ways (i = 1, . . . , k ), is just equal to the product of the individual
numbers of possibilities and amounts to
T = n1 · n2 · · · · · nk .
In many applications, each of the k issues can be satisfied in the same amount of ways,
that is, all ni = n. Then the number of possibilities is simply
n Tk = nk
Examples:
Number of possible outcomes when tossing a coin and then rolling a die:
T = n1 · n2 = 2 · 6 = 12 .
Number of possible outcomes when rolling a red and a blue die (i.e. two
distinguishable dice).
2
6 T2 = 6 = 36 .

Permutations
Definition:
Consider a set of n elements. Each arrangement of all these elements in any order is
called a permutation of these n elements.
Example:
From the set {a, b, c }, 6 permutations can be formed, namely

abc bac cab
⇒ 6 = 3!
acb bca cba
The set {a, b, c , d } has already 24 permuations, that is


abcd bacd cabd dabc 

abdc badc cadb dacb




acbd bcad cbad dbac

⇒ 24 = 4!
acdb bcda cbda dbca 

adbc bdac cdab dcab




adcb bdca cdba dcba


Permutations
When calculating the number of possible permutations of n elements, it is important to

consider whether or not these elements are all distinct or distinguishable.
Proposition :
If all n elements are distinguishable, the number of possible permutations is
nP = n!
If not all elements of the set to be permuted are are different, the number of
distinguishable permutations will be smaller, of course.
Proposition :
If not all elements of the set to be permuted are different, we form m groups (classes)
of equal elements from them; then let the group i contain ni ≥ 1 elements, so that
n = n1 + n2 + · · · + nm . Then the number of distinguishable permutations of these
elements is
n!
n Pn1 ,n2 ,...,nm =
n1 ! · n2 ! · · · · · nm !
Question: How many distinguishable permutations exist for the set {a, b, a, b}?

Variations and Combinations
Definition:
Consider a set with n different elements. We are interested in the number of ways to
choose a k -element subset from these elements.
We distinguish two cases:
1. A Variation of k th order, also called arrangement or k −permutation of n is an
ordered arrangements of a k -element subset of an n-set. The number of variations
is
n!
n Vk =
(n − k )!
2. A Combination of k th order or k -combination is a selection of items from a set,
such that the order of selection does not matter. The number of possible
combinations is !
n n!
n Ck = =
k k ! · (n − k )!

Example
From the elements of the set {a, b, c }, taking into account the order, the following six
variations of order 2 can be formed:

ab ba 
3!
ac ca ⇒6=3·2= = 3 V2
1!
bc cb

If the order does not matter, you get only the three combinations

ab =
b ba  !
3! 3
ac =
b ca ⇒3= = = 3 C2
2! · 1! 2
bc =
b cb


Task:
At the Summer Olympics, n = 8 sprinters start the 100-meter race. How many variations
(i.e. taking the order into account) are there for the gold, silver and bronze ranks?

Task:
At the Summer Olympics, n = 8 sprinters start the 100-meter race. How many variations
(i.e. taking the order into account) are there for the gold, silver and bronze ranks?
Solution:
There are
8!
8 V3 = = 8 · 7 · 6 = 336
(8 − 3)!
different variations for gold, silver and bronze.

Example
If you want to know how many possibilities there are to place a bet in the lottery, you calculate the
number of 6-combinations out of 49 elements (without taking the order into account), i.e. the binomial
coefficient 49 49 · 48 · 47 · 46 · 45 · 44
49 C6 = = = 13 983 816
6 6·5·4·3·2·1
But only one of these many combinations is the winning six.
For a five in the lottery, you need five out of the six right and one out of the 43 wrong picks. There are
6 43
· = 6 · 43 = 258
5 1
different combinations of five. For a four, you need four out of the six correct numbers and at the
same time two out of the 43 wrong numbers. So there are
6 43
· = 15 · 903 = 13 545
4 2
different combinations of fours possible.
128 - 1
Summary
Situation Number of possibilities
k independent issues T = n1 · n2 · · · · · nk
Permutation (all elements distinguishable) nP = n!
n!
Permutation (not all elements distinguishable) n Pn1 ,n2 ,...,nm =
n1 ! · n2 ! · · · · · nm !
n!
k -variation (order matters) n Vk =
(n − k )!
n n!
k -combination (order does not matter) n Ck = =
k k ! · (n − k )!

Control questions
1. What is a factorial? Give an example.
2. Why are there no negative numbers in a binomial coefficient?
3. What is the purpose of combinatorics?
4. What is the difference between permutation and variation/combination?
5. With the variations/combinations one distinguishes between the cases with

consideration of the order and without consideration of the order. Why is this not
done for permutations?
6. For combinatorial models, we sometimes distinguish between models with

replacement and those without replacement. What does this mean?

6 Fundamentals of probability theory
6.1 Events, sample space, set of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.2 Calculating with events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.3 Classical probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4 Statistical probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5 Subjective probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .145
6.6 Axioms of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.7 Important rules for the calculation of probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.8 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.9 Stochastic independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.10 Total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159
6.11 The B AYES theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6. Fundamentals of probability theory 131
Events, sample space, set of events
Definition:
An experiment is called a random experiment if it is
1. run according to a certain rule,
2. can be repeated under the same conditions any number of times,
3. and the outcome is uncertain and cannot be predicted.
Examples: Throwing a die or a coin, drawing a card, spinning a roulette or wheel of

fortune.
Definition:
The individual, mutually exclusive and indivisible outcomes or results of a random ex-
periment are called elementary events or basic outcomes.
Example: When rolling a die, there are the basic outcomes: „1“, „2“, „3“, „4“, „5“ and „6“.

Definition:
The set Ω of all elementary events of a random experiment is called sample space or
random sample space of this random experiment.
Examples:
In the random experiment „throwing a die“ the sample space is Ω = {1, 2, 3, 4, 5, 6}
and has finitely many elements.
The random experiment „flip a coin until heads appears“ has the sample space
Ω = {H, TH, TTH, TTTH, . . . }, consisting of infinitely many elements.

Definition:
A random event A is a subset of the sample space Ω. The event A has occurred if the
result of the random experiment is an element of this subset A.
Example:
A : „even number of pips“ when rolling the dice
⇒ A = {2, 4, 6} ⊂ Ω

further Examples:
When rolling two dice, the random event A: „Sum of pips is higher than 10“ consists of the three
elementary events
A = {(5, 6), (6, 5), (6, 6)}
In the random experiment „flip a coin until heads appears“ the event B: „Heads does not appear
before the 5th flip“
B = {TTTTH, TTTTTH, TTTTTTH, . . . }
consists of infinitely many basic outcomes.

In the same random experiment, the event C: „heads appears in the 3rd or 4th flip“ consists of
the two elementary events
C = {TTH, TTTH}
134 - 1
Taking all events of a random experiment together, we obtain a set whose elements are
subsets of Ω.
Definition:
All events of a random experiment with sample space Ω form the associated set of
events E (Ω).
Example:
The random experiment „flipping a coin“ has the sample space Ω = {H, T}. The
corresponding set of events is
n o
E (Ω) = {}, {H}, {T}, {H, T}
Two particular events need to be highlighted:

the certain event Ω: The event that always occurs: the sample space itself.
the impossible event ∅ = { }: the event that can never occur: the empty set.

Calculating with events
Since events are defined as sets, we can reuse the notations and operations from set
theory here. Thus we can calculate with events as with sets. These calculations lead
again to events in E (Ω).
Negation Ā: The event not A occurs if and only if A does not
occur.
Union A ∪ B: Event A or B occurs whenever event A or event B

or both occur at the same time.
Intersection A ∩ B: Event A and B occurs if and only if both events A

and B occur at the same time.
Difference A \ B: Event A without B occurs exactly if A but not B

occurs.

Definition: Two events A and D are called disjoint or mutually exclusive, if
A ∩ D = ∅.
Definition: The event

Ā = Ω \ A
is called the complementary event of A.
Definition:
If an event A always occurs when an event C occurs, we say,
Event C implies event A.
This is the case if and only if

C ⊆ A.

V ENN diagrams to illustrate events and event operations :
Ω Ω Ω
A A B A B
Ā
complement of A union A ∪ B intersection A ∩ B
Ω Ω Ω
A B A A C
difference A \ B disjoint events C implies A

Classical probability
L APLACE:
„If an experiment can produce a (finite) number of different
and equally possible outcomes, and some of them are to be
considered favorable, then the probability of a favorable out-
come (event A) is equal to the ratio of the number of favorable
to the number of possible outcomes“:
P IERRE -S IMON
number of favorable outcomes g
P(A) := = M ARQUIS DE L APLACE
number of possible outcomes m 1749 - 1827
Thus, if we consider an sample space
Ω = {e1 , e2 , . . . , em }
with m elementary events that are all equally probable, then

1
P(e1 ) = P(e2 ) = · · · = P(em ) = ,
m
holds where the sum of all probabilities equals one.

Definition:
A random experiment with finitely many equally probable elementary events is called
Laplace experiment.
In a Laplace experiment, the probability P(A) of event A with sample space Ω is given
by
number of elements of A |A |
P(A) = = .
number of elements of Ω |Ω|
Task: A coin and a die are thrown together. What is the (Laplace) probability that heads
and a number greater than 4 will appear?

Definition:
A random experiment with finitely many equally probable elementary events is called
Laplace experiment.
In a Laplace experiment, the probability P(A) of event A with sample space Ω is given
by
number of elements of A |A |
P(A) = = .
number of elements of Ω |Ω|
Task: A coin and a die are thrown together. What is the (Laplace) probability that heads
and a number greater than 4 will appear?
Solution: The sample space is
Ω = {(H, 1), (T, 1), (H, 2), (T, 2), (H, 3), (T, 3), (H, 4), (T, 4), (H, 5), (T, 5), (H, 6), (T, 6)}
and the event A = {(H, 5), (H, 6)}. Thus the Laplace probability is given by
2 1
P(A) = = .
12 6

Statistical probability
Example
Experiment: Throw a die. Write down how often the number 6 occurs.
number of absolute frequency relative frequency

throws n of the number 6 of the number 6
1 1 1.000 00
2 1 0.500 00
3 1 0.333 33
4 1 0.250 00
5 2 0.400 00
10 2 0.200 00
20 5 0.250 00
100 12 0.120 00
200 39 0.195 00
500 76 0.152 00
700 120 0.171 43
1000 170 0.170 00
2000 343 0.171 50
3000 506 0.168 67
It seems as if the relative frequencies converge for large n.

Question: But what to do if the classical concept of probability is not applicable?

Frequentist probability or frequentism according to J OHN V ENN (1834-1923) and
R ICHARD VON M ISES (1883-1953)
Repeat an experiment very often (n times)
Calculate the relative frequencies hn (A)
If the frequencies stabilize, it can be assumed that a limit exists against which the
relative frequencies converge: hn (A) −→ P(A)
n→∞
Definition:
The limit
P(A) = lim hn (A)
n→∞
is called the statistical probability of the event A.
The convergence of the relative frequency is also called the law of large numbers.

Example
Experiment: Throw a die 300 times.
elementary observed frequency L APLACE’s

event absolute relative probability
„1“ 51 0.170 00 0.166 666. . .

„2“ 53 0.176 67 0.166 666. . .
„3“ 48 0.160 00 0.166 666. . .
„4“ 52 0.173 33 0.166 666. . .
„5“ 49 0.163 33 0.166 666. . .
„6“ 47 0.156 67 0.166 666. . .
sum 300 1.000 00 1.000 000
In general it holds that the larger the number of observations, the better the estimate.

Note on statistical probability:

(also called empirical probability or objective probability)
n → ∞ is of course empirically irrelevant: No one has ever rolled the dice so often!
Nevertheless, the concept of statistical probability is of the utmost practical utility: The
observed relative frequency is then used as an approximation
P(A) ≈ hn (A)
or estimate P̂
P̂(A) = hn (A)
of the probability P(A) we are looking for.

Subjective probability
Risk situation A:
You get 1000 C with probability p. You obtain 0 C with probability 1 − p.
Risk situation B:
You get 1000 C if the DAX rises by at least 200 points within the next 3 months. If not, you get
nothing.
Now p is varied until the individual is indifferent to these two risk situations (e.g. p = 40 %). Then
the number p indicates the subjective probability that the DAX will rise by at least 200 points in the
next three months.
144 - 1
Axioms of probability theory
Any function P : E → R that assigns a real number to each event A from the set of
events E may be called a probability function if the following axioms are satisfied:
Axioms of KOLMOGOROV:
K1: P(A) ≥ 0, for all A ∈ E
The probability P(A) of every event A is a non-negative number.
K2: P(Ω) = 1
The certain event has probability one.
K3: P(A ∪ B ) = P(A) + P(B ), if A ∩ B = ∅
Addition rule for disjoint events.
Extension of K3 for pairwise disjoint events Ai , i = 1, 2, . . . :

K3*: P(A1 ∪ A2 ∪ A3 ∪ . . . ) = P(A1 ) + P(A2 ) + P(A3 ) + . . .

Probability space
Definition:
The sample space Ω together with the set of events E and probability function P:
(Ω, E , P)
is called probability space.
It contains all the necessary information to determine and calculate with probabilities for
all events from a sample space.

Important rules for the calculation of probabilities
Theorem 1:
The probability of the event complementary to A is always
P(Ā) = 1 − P(A)
for all A ∈ E.
Proof: A and Ā are disjoint events. According to Ω

axiom K3 and K2 it holds that
P(A) + P(Ā) = P(Ω) = 1 A

⇒ P(Ā) = 1 − P(A)
Ā

Theorem 2:
The impossible event has probability zero:
P(∅) = 0 .
Proof: ∅ and Ω are complementary events. According to axiom K2 it is P(Ω) = 1, and

with Theorem 1 we obtain
P (∅) = 1 − 1 = 0 .

Theorem 3:
If the events A1 , A2 ,. . . , An are pairwise disjoint, then the probability for the event result-
ing from the union of all these events is equal to the sum of the individual probabilities:
n
X
P(A1 ∪ A2 ∪ · · · ∪ An ) = P(Aj ) .
j =1
Proof: by mathematical induction starting from Ω

axiom K3. A1
A3
An
A2
...

Theorem 4:
For a difference set A \ B the following always holds
P(A \ B ) = P(A) − P(A ∩ B ) .
Proof: The event A is composed of the two disjoint Ω

events A \ B and A ∩ B . Thus from axiom K3 it
follows that
A\B A∩B
P(A) = P(A \ B ) + P(A ∩ B )
and hence the desired result.

Theorem 5 (Addition theorem for arbitrary events):

For two arbitrary events A and B from E it always holds that
P(A ∪ B ) = P(A) + P(B ) − P(A ∩ B ) .
Proof: The event A ∪ B is composed of the three disjoint Ω

events
A \ B , A ∩ B and B \ A.
A\B A∩B B\A
According to Theorem 4 it holds that
P(A \ B ) = P(A) − P(A ∩ B )

P(B \ A) = P(B ) − P(A ∩ B )
Using Theorem 3 we get
P(A ∪ B ) = P(A \ B ) + P(A ∩ B ) + P(B \ A) .
Inserting the above provides the result.

Theorem 6 (Monotonicity of the probability measure):

If the event A implies the event B, then the probability of B is never smaller than that of
A, i.e.
A ⊆ B ⇒ P(A) ≤ P(B ) .
Proof: Event B is composed of the two disjoint events Ω

A and B \ A.
A
B
According to axiom K3, it holds that
P(A) + P(B \ A) = P(B ) .
Since P(B \ A) cannot be negative according to

axiom K1, it follows that
P(A) ≤ P(B ) .

Conditional probability
Example:
A die has been thrown. The probability that a „6“ Ω
occurred is
P({6}) = .
1 1 3 5
6
With the additional information that an even A
number of pips was thrown, the probability that a
„6“ occurred is higher, namely 2 B
4 6
1
P({6}|even number of pips) = .
3
The probability of occurrence of an event A under the condition that event B has
occurred (or occurs simultaneously with A) is called conditional probability of A under
the condition B.

Definition:
Let A and B be two events of a given probability space (Ω, E , P). The conditional
probability of A under the condition B is defined as
P (A ∩ B )
P(A|B ) := .
P(B )
for P(B ) > 0 and remains undefined for P(B ) = 0.
A
A∩B

Theorem 7 (Multiplication theorem of probability theory):
The probability that two events A and B occur simultaneously is
P(A ∩ B ) = P(A) · P(B |A) .
In the same way
P(B ∩ A) = P(B ) · P(A|B ) .

Stochastic independence
Definition:
Two events A and B are called stochastically independent or briefly independent, if
P(A|B ) = P(A) .
It is then as well
P(B |A) = P(B ) .

Theorem 8 (Multiplication theorem for independent events):

If two events A and B are independent, then the probability that A and B occur simulta-
neously is just equal to the product of the individual probabilities:
P(A ∩ B ) = P(A) · P(B ) .
Proof: Follows directly from the definition of independence and Theorem 7.

Example
On New Year’s Eve 1988, a remarkable

event happened in the casino in Constance:
The event
A = {0, 3},
on which you can bet and then obtain 18
times the stake, repeated nine times in a
row.
According to L APLACE, the probability for this would have to be P(A) = 2/37. According
to the multiplication theorem for independent events we get for the total probability
P(A1 ∩ A2 ∩ · · · ∩ A9 ) = P(A1 ) · P(A2 ) · · · · · P(A9 )

9
2
= = 0.000 000 000 004
37

Annotation:
Independent events must not be confused with disjoint events! For disjoint events P(A ∩ B ) = 0
holds. One could even say that disjoint events are highly dependent events, because if one of them
occurs, the other cannot occur at all.
158 - 1
Total probability
Example:
A bulk article is produced by two machines. The faster Ω
machine has slightly more rejects than the other, but
produces twice as much. It applies
A
A ∩ H2
A ∩ H1
P(article produced on M1) = 2/3

H1 H2
P(article produced on M2) = 1/3
P(article broken|article produced on M1) = 0.1
P(article broken|article produced on M2) = 0.07
Question: What is the total probability P(article broken)?

Total probability
Solution: The sample space Ω is partitioned into the two Ω

disjoint events H1 (article produced on M1) and H2 = H̄1
(article produced on M2). Using this, the event A can be A
divided into the two disjoint events A ∩ H1 and A ∩ H2 . A ∩ H2
A ∩ H1
Using the multiplication theorem we get
P(A ∩ H1 ) = P(A|H1 ) · P(H1 ) H1 H2

P(A ∩ H2 ) = P(A|H2 ) · P(H2 )
and with axiom K3
P(A) = P(A ∩ H1 ) + P(A ∩ H2 )
= P(A|H1 ) · P(H1 ) + P(A|H2 ) · P(H2 )
For the example above, the total probability is thus given by

2 1
P(article broken) = 0.1 · + 0.07 · = 0.09 .
3 3

Total probability
Definition:
Any n events H1 , H2 , . . . , Hn , that are mutually exclusive but together fill the sample
space entirely, i.e.
Hi ∩ Hj = ∅ for i ̸= j and
H1 ∪ H2 ∪ · · · ∪ Hn = Ω
are called a division or partition of Ω.
H1 H3 ...
Ω
Hn
A
H2 H4 ...

Total probability
Theorem 9 (Theorem of total probability):

Let H1 , H2 , . . . , Hn a partition of Ω. Then for each event A ∈ E it holds that
n
X
P(A) = P(A|Hj ) · P(Hj ) .
j =1
H1 H3 ...
Ω
Hn
A
H2 H4 ...

Proof: The events A ∩ H1 , A ∩ H2 , . . . , A ∩ Hn taken together result in exactly the event
A = (A ∩ H1 ) ∪ (A ∩ H2 ) ∪ · · · ∪ (A ∩ Hn ) .
Since they are pairwise disjoint, axiom K3 states that
P(A) = P(A ∩ H1 ) + P(A ∩ H2 ) + · · · + P(A ∩ Hn ) .
The following applies to the individual summands according to the multiplication theorem
P(A ∩ Hi ) = P(A|Hi ) · P(Hi )
and thus in total
P(A) = P(A|H1 ) · P(H1 ) + P(A|Hi ) · P(Hi ) + · · · + P(A ∩ Hn )

n
X
= P(A|Hj ) · P(Hj ) .
j =1
162 - 1
Total probability
Example
Experimental setup of a two-stage experiment:

A die and a cupboard with three drawers.
drawer D1 contains 14 white and 6 black balls,

drawer D2 however contains 2 white and 8 black balls,
drawer D3 finally contains 3 white and 7 black balls.
1st stage: First the dice is rolled.

If a number less than 4 appears on the die, drawer D1 is selected, if a 4 or 5 appears,
drawer D2 is selected, otherwise drawer D3.
2nd stage: After that, a ball is drawn at random from the drawer selected in this way.
Question: What is the total probability of drawing a white ball in the end?

Total probability
Example
Solution:
According to the theorem of total probability, the probability of drawing a white ball in the
end is:
P(white) = P(white|D1) · P(D1)

+ P(white|D2) · P(D2)
+ P(white|D3) · P(D3)
14 3 2 2 3 1
= · + · + ·
20 6 10 6 10 6
28
=
60
= 0.4667 .

The B AYES theorem
The B AYES theorem establishes a connection between two conditional probabilities
P(A|B ) and P(B |A).
From the multiplication theorem it follows that
P(A) · P(B |A) = P(A ∩ B ) = P(B ) · P(A|B )
and hence
P(A) · P(B |A)
P(A|B ) = .
P(B )
This also applies to any partition Hi , i = 1, . . . , n
P(Hi ) · P(B |Hi )

P ( Hi | B ) = .
P(B )
If we replace P(B ) by the total probability, we get

The B AYES theorem
Theorem 10 (BAYES theorem):

If the events H1 , H2 , . . . , Hn form any partition of the sample space Ω and B is an event
with P(B ) > 0. Then for each Hi it holds that
P(B |Hi ) · P(Hi )

P(Hi |B ) = Pn .
j =1 P(B |Hj ) · P(Hj )
Example from slide 159: The probability that a piece picked at random from the day’s
production was produced by machine 1 is a priori
P(article produced on M1) = 2/3 = 0.6667
However, if the piece is faulty, one would guess (since machine 1 has a larger scrap) that
the probability of the piece being produced by M1 is higher. In fact, according to Bayes’
theorem, the probability is
2
3
· 0.1 20
P(article produced on M1|article broken) = 2 1
= = 0.7407
3
· 0.1 + 3
· 0.07 27

The B AYES theorem
Definition:
In BAYES statistics, H1 , . . . , Hn denote alternative hypotheses.
P(Hi ) is called the a priori probability of the i-th hypothesis.
P(Hi |B ) is called the a posteriori probability of the i-th hypothesis after knowl-
edge of observation B.

Example from practice
A man goes to the doctor for a routine cancer diagnostic test. The diagnostic test shows a positive
result. What is the probability that he actually has the disease?
Given: (from experience about the reliability of the test and/or from disease statistics)
A = {cancer} ⇒ Ā = {no cancer}

B = {test positive} ⇒ B̄ = {test negative}
A priori probability for disease
P(cancer) = P(A) = 2 % ⇒ P(no cancer) = P(Ā) = 98 %
Sensitivity of the test (a sick person is detected as sick):
P(test positive|cancer) = P(B |A) = 95 % ⇒ P(test negative|cancer) = P(B̄ |A) = 5 %
Specificity of the test (a healthy person is detected as healthy):
P(test negative|no cancer) = P(B̄ |Ā) = 90 % ⇒ P(test positive|no cancer) = P(B |Ā) = 10 %
167 - 1
Wanted: P(cancer|test positive) = P(A|B )
The hypotheses H1 = A = {cancer} and H2 = Ā = {no cancer} form a partition of Ω. According to

the theorem of B AYES it holds that
P(B |A) · P(A)

P(A|B ) =
P(B |A) · P(A) + P(B |Ā) · P(Ā)
0.95 · 0.02
= = 16.24 %
0.95 · 0.02 + 0.1 · 0.98
The probability that the man actually has cancer has thus increased from 2 % a priori with the infor-
mation of the test result to 16.24 % a posteriori. Thus, the probability of disease is still a good 8 times
higher than assumed a priori.
If the test had been negative, the probability of having the disease anyway would have been
P(B̄ |A) · P(A)

P(A|B̄ ) =
P(B̄ |A) · P(A) + P(B̄ |Ā) · P(Ā)
0.05 · 0.02
= = 0.11 %
0.05 · 0.02 + 0.9 · 0.98
and hence much lower than assumed a priori.

167 - 2
Visualization:
Ω : 10 000
A : 200 Ā : 9800
B̄ |A : 10
B̄ |Ā : 980
B |A : 190
B |Ā : 8820
167 - 3
Control questions
1. What is the difference between an elementary event and an event?
2. Is it possible to specify different sample spaces for the same random experiment?
3. Are complementary events disjoint? Are disjoint events complementary?
4. Which three concepts of probability do you know? On which concept of probability

are Kolmogorov’s axioms based?
5. Event A implies event B. Can B then be independent of A?
6. For which question is the multiplication theorem helpful, for which the addition
theorem?
7. Can independent events be disjoint? Can disjoint events be independent?

Problem from practice: Just in Time
A manager of a food retail chain is to report to the company management about the experience of
the Just In Time pilot project. Just-In-Time delivery means daily, early morning deliveries to stores.
However, shifting inventory to the street poses risks. For example, according to records kept by the
manager, in the past month the 46 stores in his area have been approached 1200 times. 9 times
an expected truck did not arrive at the store to be approached in the morning due to an accident
or breakdown, resulting in significant lost sales, and 83 times there were delays. This pilot study is
to serve as the basis for the design of deliveries to 16 new stores in eastern Germany, which are
to operate without storage space. As far as possible, no delivery problems should occur during the
start-up phase.
Based on the above results, what value will the reporting manager calculate for the probability that
these 16 stores will receive their deliveries on time every day (6 days total) during a week, or at least
receive their goods late?
According to the observations in the past, the statistical probability that a truck on a single trip
1. will have an accident or breakdown dur- P(breakdown) = 9/1200 = 0.0075,

ing a single trip and will not arrive
2. arrives late P(too late) = 83/1200 = 0.069 12,
3. will have a breakdown or arrives late P(breakdown ∪ too late) = 0.076 62,
4. arrives in time P(in time) = 1 − 0.076 62 = 0.923 38.
168 - 1
To supply the 16 stores six days a week, 96 trips are required. The probability that all deliveries will
be made on time is
P(all in time) = (0.923 38)96 = 0.000 474 8 .
The probability that all stores will be reached and supplied on all days of a week, even if late, is
P(all in time ∪ too late) = (0.9925)96 = 0.4854 .
168 - 2
7 Random variables in one dimension
7.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

7.2 Distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.3 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.4 Continuous random variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182
7.5 Expected values of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.6 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.7 Standardization of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.8 C HEBYSHEV’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.9 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7.10 Median and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

see also: [Anderson et al., 2012], chapter 5 and [Newbold et al., 2013], chapter 4
7. Random variables in one dimension 169
Random variables
Definition:
Let a probability space (Ω, E , P) be given. A function
X : Ω → R,
e 7→ X (e) ∈ R ,
that assigns a real number X (e) to each elementary event e ∈ Ω is called random
variable or stochastical variable.
Random variable as a mapping of the sample space onto the real axis:
X
-1 0 1 2 3
Technical constraint: for any real number r , the set Ar = {e|X (e) ≤ r } has to be an event, i.e.
Ar ∈ E.
170 - 1
Random variables
Example
A coin is tossed once.

The sample space is therefore Ω = {heads, tails}. Let the random variable X denote the
number of heads that may result from this experiment. The random variable defined in
this way can take only two values, namely
X (tails) = 0 and X (heads) = 1 .
and thus has the value range (codomain) C = {0, 1}.

The set of events E consists of the four events
E = {∅, {tails}, {heads}, Ω}

Random variables
Example
Two regular dice are rolled.

The sample space consists of 36 elementary events
Ω = {(i , j )|i = 1, . . . , 6; j = 1, . . . , 6} .
Here several random variables can be formed, examples are:

Let X denote the sum of the pips:
X (i , j ) = i + j .
The codomain of X is CX = {2, 3, . . . , 12}, X can therefore take on eleven different

values.
Let Y denote the absolute difference of the numbers of pips:
Y (i , j ) = |i − j | .
The codomain of Y is CY = {0, 1, . . . , 5}, the random variable Y can therefore only
take on six different values.

Random variables
Example
A point inside a circle of radius c is chosen at random. If x

and y denote the coordinates of the point in a coordinate
system through the center of the circle, the sample space can
be written as
n o c
Ω = (x , y )|x , y ∈ R and x 2 + y 2 < c 2 .
(0, 0)
The random variable Z is now defined as the distance of the z
chosen point from the center of the circle
(x , y )
p
Z (x , y ) := x2 + y2 .
The random variable defined this way can now take all real
values between 0 and c (0 ≤ Z < c ).

Random variables
discrete vs. continuous
Definition:
If the codomain C ⊂ R of a random variable X consists of a finitely many or countably
infinitely many values
C = {x1 , x2 , x3 . . . } ,
the random variable is called discrete.
If the codomain C ⊂ R of a random variable X consists of the whole real axis or of

subintervals
C = {x ∈ R|a ≤ x ≤ b}, −∞ < a < b < ∞ ,
it is called continuous. Its codomain then consists of uncountably infinitely many values.
Question: Which of the random variables in the examples on the previous pages are
discrete, and which are continuous?

Distribution function
Definition:
The function
F (x ) := P (X ≤ x ) ,
that assigns to each real number x the probability with which the random variable X
takes a value X ≤ x is called the distribution function of the random variable X .

The probability P (X ≤ x ) is equal to the probability of the event
Ax = {e|X (e) ≤ x } .
That is the reason for the restrictive condition in the definition of a random variable.
175 - 1
Example
Number of heads in a simple coin toss:

This random variable X can only take the two values 0 or 1. If the probability of throwing
heads is 0.5, then the associated distribution function is given by

0
 for x < 0
1
F (x ) = for 0 ≤ x < 1
2
1 for 1 ≤ x


Example
A point inside a circle of radius c is chosen at random. Each

point is equally probable. The random variable Z is defined as c
the distance of the chosen point from the center of the circle,
thus the probability is
z
2 2
circle area(Az ) z π z
P (Z ≤ z ) = = 2 = 2.
circle area(Ω) c π c
Thus, the distribution function of Z is given by


0
 for z < 0
z2
F (z ) = for 0 ≤ z < c
 c2
1 for c ≤ z


Properties of the distribution function :

The distribution function F is
1. at each x at least continuous to the right
lim F (x + ∆x ) = F (x )
∆x →0+
2. monotonically increasing
F (a) ≤ F (b), if a < b
3. and has lower limit 0 and upper limit 1
lim F (x ) = 0, lim F (x ) = 1
x →−∞ x →∞

Alternative Definition:
Any function F (x ) on the domain of the real numbers and with the codomain C = [0, 1]
that has the above three properties is called a distribution function and defines a
random variable.

Discrete random variables
Definition:
If X is a discrete random variable then the function
f (x ) := P(X = x )
is called the probability mass function or in short the mass function of the random
variable X .
Properties :
1. f (x ) ≥ 0
X
2. f (xi ) = 1
all i
3. From 1. and 2. it directly follows

f (xi ) ≤ 1

Relationship between distribution function and

mass function of a discrete random variable X :
X
F (x ) = f (xi )
xi ≤x

Continuous random variables
Definition:
If X is a continuous random variable with distribution function F , then the first derivative
d
f (x ) :=
dx F ( x )
is called the probability density function or in short density function of the random
variable X .
Properties :
1. f (x ) ≥ 0
Z∞
2. f (x ) d x = 1
−∞
Note that f (x ) > 1 can also occur.

Probabilities as areas beneath the density function
P(a < X < b) = P(a ≤ X < b)

= P(a < X ≤ b) = P(a ≤ X ≤ b)
Zb
= F (b) − F (a) = f ( x ) dx
a
In particular, the probability that a

continuous random variable takes on a
particular single value a is
P (X = a) = 0.

Example
 
0
 for x < 0, 0
 for x < 0,
1 3 1 2
F (x ) = (x − 3) + 1 for 0 ≤ x ≤ 3, f (x ) = (x − 3) for 0 ≤ x ≤ 3,
 27 9
1 for x > 3, 0 for x > 3.
 
Z2
P(1 < X < 2) = f (x ) dx = F (2) − F (1) = 0.2593
1

Expected values of random variables
Definition:
Let X be a random variable and f its mass or density function. Its expected value E(X )
is defined as
X
E(X ) := xj f (xj ) , if X is a discrete and
all j
Z∞
E(X ) := x f (x ) dx , if X is a continuously distributed random variable.
−∞
If the series or the improper integral has no finite value, then the random variable X has
no expected value.
The expected value is a parameter of location. To express its numerical value (but often
also instead of the notation E(X )), usually the Greek letter „mu“ is used:
µ or µX or µ(X )

Example
A discrete random variable X has the mass function


1
3
 for x = 1
2
f (x ) = for x = 2
3
0 otherwise

Its expected value is
1 2 5
E(X ) = 1 · + 2 · = = 1.666 · · · = µ
3 3 3

Example
Let a continuous random variable X 0.8

µ = 2.1667
have the density function 0.6
(
1 0.4
4
x for 1 ≤ x ≤ 3
f (x ) = 0.2
0 otherwise
1 2 3 4
To calculate the expected value, the integral
Z∞ Z1 Z3 Z∞
x f ( x ) dx = 0 dx + x · x dx + 0 dx
1
E(X ) =
4
−∞ −∞ 1 3
is split into three parts, from which only the middle one has to be calculated:
Z3 3
x dx =
1 2 1 3 27 1 26
= x = − = = 2.1667
4 12 1
12 12 12
1

Expected value of a function of random variables:

Let X be a random variable with mass or density function f . Instead of X we now
consider a transformation g (X ) and ask for the expected value E[g (X )]:
X
E[g (X )] := g (xj )f (xj ) , if X is a discrete and
all j
Z∞
E[g (X )] := g (x )f (x ) d x , if X is a continuously distributed random variable.
−∞
(Again, the expected value exists only if the series or improper integral has a finite value.)

Calculation rules for expected values:
1. Constant E (a ) = a
2. Factor E[b · g (X )] = b · E[g (X )]
3. Sum E[g1 (X ) + g2 (X )] = E[g1 (X )] + E[g2 (X )]
4. Linear transformation E(a + b · X ) = a + b · E(X )
Proposition (central property of the expected value)

The expected value of deviations of a random variable X from its expected value µX is
equal to zero:
E(X − µX ) = 0
Proof: Follows directly from rule 4. (Linear transformation) with b = 1 and a = −µX .

Variances
Definition:
Let X be a random variable and µX its expected value. The expected value of the
squared deviation of the random variable from µX
2
V(X ) := E[(X − µX ) ]
is called variance of the random variable X .

The positive root of this p
σX := + V(X )
is called standard deviation.
If the series or the improper integral has no finite value, then the random variable X has
no variance.
The variance is a parameter of dispersion. To express its numerical value (but often
also instead of the notation V(X )), usually the Greek letter „sigma“ is used:
σ2 or σX2 or σ 2 (X )

Variances
Calculation of the variance:

Using the definition of the expected value on slide 185 we obtain for the variance
X
V(X ) := (xj − µX )2 f (xj ) , if X is a discrete and
all j
Z∞
V(X ) := (x − µX )2 f (x ) dx , if X is a continuously distributed random variable.
−∞

Variances
Example
Number of heads in two coin tosses:

The discrete random variable X has the distribution
x: 0 1 2
1 1 1
f (x ): 4 2 4
and, because of symmetry, the expected value E(X ) = 1. Its variance is
1 2 1 1
V(X ) = (0 − 1) ·+ (1 − 1)2 · + (2 − 1)2 ·
4 2 4
1 1 1
= + = = 0.5
4 4 2
and the standard deviation
1
σX = √ ≈ 0.7071 .
2

Variances
Example
Continuous random variable with density

(
3
x − 0.5x 2 ,

2
0<x <2
f (x ) =
0 otherwise
Due to symmetry, the expected value is E(X ) = 1. The

variance is the definite integral
Z2 Z2
(x − 1)2 f (x ) dx = dx
3 1 2
V(X ) = (x − 1)2 x− x
2 2
0 0
Z2 2
=
3
2
x−
5 2
2
3 1 4
x + 2x − x
2
dx = 32 1 2
2
5 3 1 4
x − x + x −
6 2
1 5
10
x
0
0

3 4 40 16 32 24 1
= − + − = 3 − 10 + 12 − = = σX2 ⇒ σX = 0.4472
2 2 6 2 10 5 5

Variances
Calculation rules for variances :
1. Constant V(a) = 0
2. Shift V(X + a) = V(X )
2
3. Factor V(b · X ) = b V(X )
σb·X = |b| σX
2
4. Linear transformation V(a + b · X ) = b · V(X )
σa+b·X = |b| σX

Variances
The variance is defined as the „expected value of squared deviation from µ“. If we
compare it to the expected value of the squared deviation from some other value d then:
S TEINER’s translation theorem

For each constant d ∈ R it holds that
2 2
V(X ) = E[(X − d ) ] − (µ − d )
where (µ − d ) is the shift (=translation) from the expected value.
Calculation rules for variances (contd.)
5. In the special case of d = 0, the following formula for the simplified calculation of
the variance is obtained
2 2 2 2
V(X ) = E(X ) − µ = E(X ) − E(X )

Standardization of random variables
Using the calculation rules for expected values and variances, we can transform random
variables in a „clever way“:
Start: Random variable X with E(X ) = µ < ∞ and V(X ) = σ 2 < ∞.

Shift of X by µ:
Y := X − µ
1
Stretching Y by the factor σ
:
Y
Z :=
σ
The following happens:
X −µ
X −→ Y := X − µ −→ Z := σ
E(X ) = µ E(Y ) = 0 E(Z ) = 0

V(X ) = σ 2 V(Y ) = σ 2 V(Z ) = 1

Standardization of random variables
Definition:
If X is a random variable with expected value µ and standard deviation σ > 0, then the
transformed random variable
X −µ
Z :=
σ
is called standardized.
Each standardized random variable has mean 0 and variance 1.

C HEBYSHEV’s inequality
Wanted: The probability that a random variable X falls into an interval between a and b:
xj ≤b
b
f (x ) dx (continuous)
Z X
P(a < X ≤ b) = or P(a < X ≤ b) = f (xj ) (discrete)
a xj >a
For this, however, the density or mass function f must be known. The inequality provides
an estimate even for unknown f , if at least the expected value and the standard
deviation of the distribution are known.
Theorem of C HEBYSHEV:
Let X be an arbitrary continuous or discrete random variable with expected value µ and
standard deviation σ , then the inequality
1
P(|X − µ| ≥ k σ) ≤
k2
always holds for any k > 0 and completely independent of the distribution.

C HEBYSHEV’s inequality provides an upper bound on the probability of realizations

occurring for a random variable X that are away from the expected value µ by at least
k times the standard deviation σ .
1
P(|X − µ| ≥ k σ) ≤
k2

Example
1
P(|X − µ| ≥ k σ) ≤
k2
For single values of k , the following estimates are obtained outside the k σ bound:
k =1 : P(|X − µ| ≥ σ)
≤ 1 (trivial)
k = 1.5 : P(|X − µ| ≥ 1.5σ)
≤ 0.444. . .
k = 2 : P(|X − µ| ≥ 2σ) ≤ 0.25
k = 2.5 : P(|X − µ| ≥ 2.5σ) ≤ 0.16
k = 3 : P(|X − µ| ≥ 3σ) ≤ 0.111. . .

Example
1
P(|X − µ| < k σ) = 1 − P(|X − µ| ≥ k σ) ≥ 1 −
k2
For single values of k , the following estimates are obtained inside the k σ bound:
k = 1.5 : P(µ − 1.5σ < X < µ + 1.5σ) ≥

0.555. . .
k =2 : P(µ − 2 σ < X < µ + 2 σ )
≥ 0.75
k = 2.5 : P(µ − 2.5σ < X < µ + 2.5σ) ≥ 0.84
k = 3 : P(µ − 3 σ < X < µ + 3 σ ) ≥ 0.888. . .

Example
Let a discrete random variable X have as possible outcomes the integers

xi = 0, 1, 2, . . . , 12. Let its expected value be 4.5 and its variance 38 , otherwise the
distribution is unknown.
Task:
Estimate the probability that X < 3 or X > 6, that is, that X takes one of the values in
bold.
0 1 2 3 4 5 6 7 8 9 10 11 12

8
Solution: (µ = 4.5, σ 2 = 3
)
Since X is a discrete random variable and C HEBYSHEV’s inequality for ranges outside a k σ bound is
formulated as „≥“, we reformulate the desired probability as follows.
P(X < 3 ∪ X > 6) = P(X ≤ 2 ∪ X ≥ 7) .
Transformation of the variable (subtract µ) yields
= P(X − 4.5 ≤ −2.5 ∪ X − 4.5 ≥ 2.5) .
Combining the two inequalities results in
1
= P(|X − 4.5| ≥ 2.5) ≤ ,
k2
2.5
according to C HEBYSHEV, where it has to be 2.5 = k σ . This results in k = σ
= √2.5 ≈ 1.5309
8/3
and we get
1
P(X < 3 ∪ X > 6) ≤ ≈ 0.4267 .
k2
202 - 1
Moments
Moments of a distribution are the expected values of powers of random variables.
Definition:
If X is a random variable, then the expected value of the k th power
Mk := E(X k )
is called the k th moment or k th -order moment of the distribution, if it exists. The expected
value of the k th power of the deviation from the mean
MkZ := E[(X − µ)k ]
is called k th central moment
202 - 1
Characterizing sequence of moments
M1 = E(X ) = µ expected value

M2 = E(X 2 ) 2nd Moment
M3 = E(X 3 ) 3rd Moment
.
.
.
Mk = E(X k ) k th Moment
.
.
.
or central moments
M1Z = E(X − µ) = 0 central property

M2Z = E[(X − µ)2 ] = σ 2 variance
M3Z = E[(X − µ)3 ] 3rd central moment
.
.
.
Mk = E[(X − µ)k ]
Z k th central moment
.
.
.
202 - 2
Example:
Continuous random variable with density

(
3
x − 0.5x 2 ,

2
0<x <2
f (x ) =
0 otherwise
Wanted: M3 = E(X 3 ) and M3Z = E[(X − µ)3 ]
Because of the symmetry µ = 1. It follows that
Z2 Z2
3 1
M3 = x 3 f (x ) dx = x3 x− x2 dx
2 2
0 0
Z2 2
3 1 3 1 1
= x4 − x5 dx = x5 − x6
2 2 2 5 12 0
0

3 1 1
= · 25 · − = 0.4472
2 5 6
202 - 3
and
Z2 Z2
3 1
M3Z = (x − µ)3 f (x ) dx = (x − 1)3 x− x2 dx
2 2
0 0
Z2
3 7 9 5 1
= −x + x2 − x3 + x4 − x5 dx
2 2 2 2 2
0
2
3 1 7 9 1 1 6
= − x2 + x3 − x4 + x5 − x =0
2 2 6 8 2 12 0
Definition:
The ratio
E[(X − µ)3 ]
γ :=
σ3
is called the skewness of a distribution. For symmetric distributions γ = 0.
If γ < 0, the distribution is said to have a negative skewness or to be left-skewed, left-tailed, or

skewed to the left. Distributions with positive skewness γ > 0, on the other hand, are called right-
skewed or right-tailed, or skewed to the right.
202 - 4
A right-skewed continuous and a left-skewed discrete distribution
The 4th central moment is – if it exists – positive for every distribution. It gives information about the
kurtosis or „tailedness“ of a distribution. To obtain a measure independent of scale and scatter, here
one divides by the 4th power of the standard deviation:
Definition:
The ratio
E[(X − µ)4 ]
κ :=
σ4
is called the curtosis of a distribution.
κ = 3 is considered as normal. Distributions with larger κ values have narrower and more peaked
density functions, those with smaller κ values are more broadly curved than the normal distribution.
202 - 5
Median and Quantiles
The median or central value xMed of a random variable X is a number that lies in the
middle of the distribution in such a way that the probability for X to take a value greater or
less than xMed would be just equal:
1
P(X ≤ xMed ) = F (xMed ) = ,
2
hence
−1 1
xMed = F ( 2 ),
if the inverse function exists. This is usually the case for continuous random variables,
but not for discrete ones. More generally:
Definition: The number xMed is called median, if at the same time

1
P(X ≤ xMed ) ≥ and
2
1
P(X ≥ xMed ) ≥ holds.
2

Determination of the median: 3 situations

Examples
Let Y be the number of heads when flipping two coins. The median is clearly
yMed = 1, because only for this number it holds
3 1 3 1
P(Y ≤ 1) = ≥ and P(Y ≥ 1) = ≥ .
4 2 4 2
Let X be the number of pips when rolling a die. It holds

3 1 4 1
P(X ≤ 3) = ≥ and P(X ≥ 3) = ≥
6 2 6 2
4 1 3 1
P(X ≤ 4) = ≥ and P(X ≥ 4) = ≥
6 2 6 2
but also any other number in the closed interval 3 ≤ xMed ≤ 4 satisfies the definition.
We choose as median the value xMed = 3.5 (arithmetic mean of the limits).

Actually, the median is only a special case of the more generally defined quantiles:
Definition:
A number x[q ] with 0 < q < 1 is called q-quantile, if at the same time
P(X ≤ x[q ] ) ≥ q and

P(X ≥ x[q ] ) ≥ 1 − q holds
Thus, the median is a 0.5-quantile or 50%-quantile. For the q-quantiles of continuous

distributions it always holds that:
−1
F (x[q ] ) = q , x[q ] = F (q ),

Example
A very important distribution whose quantiles are needed by every statistician is the
so-called standard normal distribution. The sketch shows the density function of a
standard normally distributed random variable Z . Its median is
zMed = z[0.5] = 0 .
Table of some quantiles
q z[q ]
0.5 0.000
0.9 1.282
0.95 1.645 q
0.975 1.960
0.99 2.327
0.995 2.575
z[q ]

Why is the normal distribution important?
Example: Stock returns
empirical density
function
Many variables in economics are (approximately) normally distributed.

Control questions
1. Describe the difference of a random variable and a „normal“ variable in your own
words.
2. What must the codomain of a random variable be like for it to be called continuous?
3. Which properties must a mass function have, which a density function?
4. How are density function and distribution function related? How to extract the mass
function from the distribution function?
5. Is the expected value the most likely value for a random variable?
6. What is measured by the variance of a random variable?
7. What is the effect of standardizing? What characterizes a standardized random
variable?
8. Can you make probability statements about random variables whose distribution you
do not know?
9. To what extent can C HEBYSHEV’s inequality be said to provide a „rough“ estimate?

8 Multidimensional random variables
8.1 Joint distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211

8.2 Marginal distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217
8.3 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
8.4 Stochastic independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
8.5 Covariance and correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.6 Sum of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

8. Multidimensional random variables 210

Joint distribution
Definition:
For a two-dimensional discrete random variable (X , Y ) the function
f (x , y ) := P(X = x ∩ Y = y )
is called is called the joint probability mass function of X and Y .
Properties:
The following always holds:
1. f (x , y ) ≥ 0,
XX
2. f (xi , yj ) = 1,
i j
3. f (xi , yj ) ≤ 1 for all i and j.

Joint distribution
Matrix of the probability masses: pij := f (xi , yj )
y1 y2 ... yj ... yl Σ
x1 p11 p12 ... p1j ... p1l p1•
x2 p21 p22 ... p2j ... p2l p2•
. . . . . .
. . . . . .
. . . . . .
xi pi1 pi2 ... pij ... pil pi •
. . . . . .
. . . . . .
. . . . . .
xk pk 1 pk 2 ... pkj ... pkl pk •
Σ p•1 p•2 ... p•j ... p•l 1
Margins: X X
pi • = pij , p•j = pij
j i

Joint distribution
Example: Draw two balls from an urn with replacement

The urn contains 6 balls (see figure). Let the random variable
(X , Y ) be defined as
X = „number on 1st ball“ ,

Y = „number on 2nd ball“ .
It is
1 1 1
P(X = 1) = P(X = 2) = P(X = 3) =
2 3 6
1 1 1
P(Y = 1) = P(Y = 2) = P(Y = 3) =
2 3 6
and because of independence
1
P(X = 1 ∩ Y = 1) = P(X = 1) · P(Y = 1) =
4
1
P(X = 1 ∩ Y = 2) = P(X = 1) · P(Y = 2) = . . . etc.
6

Joint distribution
Example: Draw two balls from the same urn without replacement
For the case of drawing without replacement, the calculation of the
joint distribution is a bit more difficult, because now the general
multiplication theorem has to be applied: The corresponding
conditional probabilities for the component Y , after a „1“ has
already appeared at the 1st draw and has not been replaced are:
2 2 1
P(Y = 1|X = 1) = P(Y = 2|X = 1) = P(Y = 3|X = 1) =
5 5 5
Now we have
1
P(X = 1 ∩ Y = 1) = P(X = 1) · P(Y = 1|X = 1) =
5
1
P(X = 1 ∩ Y = 2) = P(X = 1) · P(Y = 2|X = 1) = . . . etc.
5

Joint distribution
Example: Draw two balls from an urn
[1] with replacement [2] without replacement
Y Y
1 2 3 1 2 3
1 1 1 1 1 1 1 1
1 4 6 12 2
1 5 5 10 2
1 1 1 1 1 1 1 1
X 2 6 9 18 3
X 2 5 15 15 3
1 1 1 1 1 1 1
3 12 18 36 6
3 10 15
0 6
1 1 1 1 1 1
2 3 6
1 2 3 6
1
Joint distributions of (X , Y )

Joint distribution
Definition:
The joint distribution function
F (x , y ) = P(X ≤ x ∩ Y ≤ y )
indicates the probability with which the random variable X takes on values less than or
equal to x and at the same time the random variable Y takes on values less than or
equal to y . F is obtained by summing up the joint mass function:
XX
F (x , y ) = f (xi , yj )
xi ≤x yj ≤y

Joint distribution of continuous random variables
Also for the continuous case one can specify the joint distribution of random variables in several
dimensions. As with the one-dimensional variables, the summation is replaced by integration. The
naming is analogous to the one-dimensional case:
discrete: probability mass function

continuous: probability density function
Definition: For a two-dimensional continuous random variable (X , Y ), the function f (x , y ) with
Zb Zd
f (x , y ) dy dx = P(a < X ≤ b ∩ c < Y ≤ d )
a c
for a < b and c < d is called the joint probability density function of X and Y .
Properties : The following always holds:
1. f (x , y ) ≥ 0,
Z∞ Z∞
2. f (x , y ) dy dx = 1.
−∞ −∞
216 - 1
Example:
The density function of the two-dimensional standard

normal distribution is given by
1 1 2
+y 2 )
f (x , y ) = e − 2 (x
2π
Again, the joint distribution function can be given analogously to the discrete case. The summation
of the mass function is simply replaced by the integration of the density function:
Definition: The joint distribution function
F (x , y ) = P(X ≤ x ∩ Y ≤ y )
indicates the probability with which the random variable X takes on values less than or equal to x
and at the same time the random variable Y takes on values less than or equal to y . F is obtained
by integrating the joint density function:
Zx Zy
F (x , y ) = f (u , v ) dv du
−∞ −∞
216 - 2
Definition:
The distribution of a single component of a multidimensional random variable without
regard to the other components is called marginal distribution.
For discrete random variables, the marginal distributions are calculated as the column
and row sums:
X
fX (xi ) := P(X = xi ) = f (xi , yj ) = pi •
j
X
fY (yj ) := P(Y = yj ) = f (xi , yj ) = p•j
i

Example: Toss a coin 4 times
X= number of heads
0 1 2 3 4
marginal distribution of Y
1 1 1
Y = number of changes
0 16
0 0 0 16 8
1 1 1 3
1 0 8 8 8
0 8
1 1 1 3
2 0 8 8 8
0 8
1 1
3 0 0 8
0 0 8
1 1 3 1 1
16 4 8 4 16
1
marginal distribution of X
marginal distribution of X marginal distribution of Y
xi 0 1 2 3 4 yi 0 1 2 3
1 1 3 1 1 1 3 3 1
fX (xi ) 16 4 8 4 16
fY (yi ) 8 8 8 8

Expected values and variances of the components:

For multidimensional discrete random variables with a joint distribution, one calculates
the expected value and variance and further moments of the individual components
using the corresponding marginal distributions:
X
µX = E(X ) = xi fX (xi )
i
X
σX2 = V(X ) = (xi − µX )2 fX (xi )
i

Marginal distributions of continuous random variables
Definition:
The distribution of a single component of a multidimensional random variable without regard to the
other components is called marginal distribution.
For continuous random variables, it is the integrals:
Z∞
fX (x ) := f (x , y ) dy
−∞
Z∞
fY (y ) := f (x , y ) dx
−∞
219 - 1
Expected values and variances of the components:
For multidimensional continuous random variables with a joint distribution, one calculates the ex-
pected value and variance and further moments of the individual components using the correspond-
ing marginal distributions via integration:
Z∞
µX = E(X ) = x fX (x ) dx
−∞
Z∞
σX2 = V(X ) = (x − µX )2 fX (x ) dx
−∞
219 - 2
Conditional distributions
The conditional distributions provide information about the distribution of one variable
under the constraint that the other takes a certain value.
Definition:
For discrete random variables (X , Y ) we define the mass function f1 of the conditional
distribution of X under the condition Y = yj as
f ( x , yj )
f1 (x |yj ) := for j = 1, . . . , m
fY (yj )
and correspondingly the conditional distribution of Y under the condition X = xi
f (xi , y )
f2 (y |xi ) := for i = 1, . . . , n
fX (xi )
(There are m conditional distributions of X and n conditional distributions of Y ).

Example: Why are conditional distributions important?

Consider f (x , y ) with
X : Distribution of stock returns and
Y : economic growth.
What do the conditional distributions of stock returns f (x |Y = good) and f (x |Y = bad)
look like?
From the point of view of an investor or asset manager, conditional distributions of
security returns are particularly important!
Conditions (represented by the random variable Y) are manifold, e.g.
Y1 = economic growth
Y2 = central bank policy
Y3 = interest rate development
..
.

Example
From the values of the joint distribution from the previous example (slide 218) one
calculates four conditional distributions of X
X 0 1 2 3 4
f1 (x |0) 0 .5 0 0 0 0 .5 1
f1 (x |1) 0 0.333 0.333 0.333 0 1
f1 (x |2) 0 0.333 0.333 0.333 0 1
f1 (x |3) 0 0 1 0 0 1
and five conditional distributions of Y
Y 0 1 2 3
f2 (y |0) 1 0 0 0 1
f2 (y |1) 0 0 .5 0.5 0 1
f2 (y |2) 0 0.333 0.333 0.333 1
f2 (y |3) 0 0 .5 0.5 0 1
f2 (y |4) 1 0 0 0 1

Conditional distributions of continuous random variables
Analogously we obtain the
Definition:
For continuous random variables, the conditional distributions are defined by the density function
f (x , y ) f (x , y )
f1 (x |y ) := and f2 (y |x ) := .
fY (y ) fX ( x )
222 - 1
Basic idea:
If the conditional distributions of X for different conditions y1 and y2 are different
f1 (x |y1 ) ̸= f1 (x |y2 ) ,
t means that the distribution of X depends on what value Y takes. In this case, X and Y
are said to be stochastically dependent.
In order to assess a joint distribution, it is particularly important to know whether X and Y
are dependent or independent.
Example:
Are DAX 30 returns distributed differently in January than in the rest of the months
(x = DAX return, y1 = January, y2 = February to y12 = December)?

Definition:
The random variables X and Y are called stochastically independent, or independent
for short, if the joint mass or density function
f (x , y ) = fX (x ) · fY (y )
is just equal to the product of the two marginal distributions.
In the case of independence, all conditional distributions are equal – and equal to the
corresponding marginal distribution:
fX (x ) · fY (y )
f1 (x |y ) = = fX (x )
fY (y )
fY (y ) · fX (x )
f2 (y |x ) = = fY (y )
fX (x )

Example
In the joint distributions from the previous example (slides 213-215) the random variables
are stochastically independent [1] resp. dependent [2]. The conditional distributions are
equal [1] or unequal [2] to the marginal distribution:
Y 1 2 3 Y 1 2 3
1 1 1 2 2 1
f2 (y |1) 2 3 6
1 f2 (y |1) 5 5 5
1
1 1 1 3 1 1
f2 (y |2) 2 3 6
1 f2 (y |2) 5 5 5
1
1 1 1 3 2
f2 (y |3) 2 3 6
1 f2 (y |3) 5 5
0 1
1 1 1 1 1 1
fY (y ) 2 3 6
1 fY (y ) 2 3 6
1
[1] independent [2] dependent

Covariance and correlation coefficient
Definition:
Let X and Y be components of a two-dimensional random variable with expected values
µX and µY . The quantity
Cov(X , Y ) := E[(X − µX )(Y − µY )]
is called covariance of X and Y .

For practical calculation of the covariance one needs the joint distribution f (x , y ):
XX
Cov(X , Y ) = (xi − µX )(yj − µY )f (xi , yj )
i j
for discrete or
Z∞ Z∞
Cov(X , Y ) = (x − µX )(y − µY )f (x , y ) dy dx
−∞ −∞
for continuously distributed random variables.
Using the calculation rules for expected values, the definition of covariance can be reformulated:
Cov(X , Y ) : = E[(X − µX )(Y − µY )] = E(XY − X µY − µX Y + µX µY )

= E(XY ) − E(X )µY − µX E(Y ) + µX µY
Applying E(X ) = µX and E(Y ) = µY we obtain an
226 - 1
Alternative way of calculating the covariance :
Cov(X , Y ) = E(XY ) − E(X )E(Y )
This results in the
Multiplication theorem for expected values:

If X and Y are arbitrary random variables for which E(X ) and E(Y ) exist, then always
E(XY ) := E(X )E(Y ) + Cov(X , Y ) .
If X and Y are independent, then
E(XY ) := E(X )E(Y ) .

Attention: again, the theorem cannot be reversed. The following is the case
correct: X andY are independent ⇒ Cov(X , Y ) = 0
correct: Cov(X , Y ) ̸= 0 ⇒ X andY are dependent
incorrect: Cov(X , Y ) = 0 ⇒ X and Y are independent
227 - 1
Example: Draw two balls from an urn without replacement (see slides 213-215)
Expected values:
1 1 1 5 Y
E(X ) = 1 · + 2 · + 3 · = = E(Y )
2 3 6 3 1 2 3
1 1 1 1
Variances: 1 5 5 10 2
1 1 1 1
X 2 5 15 15 3
2 1 1 1 10
E(X ) = 1 · +4· +9· = = E(Y 2 ) 3 1 1
0 1
2 3 6 3 10 15 6
1 1 1
2 2 10 25 5 2 3 6
1
V(X ) = E(X ) − E(X ) = − = = V (Y )
3 9 9
Covariance:
1·1 1·2 1·3 2·1 2·2 2·3 3·1 3·2 8
E (XY ) = + + + + + + + +3·3·0=
5 5 10 5 15 15 10 15 3
8 5 5 1
Cov(X , Y ) = E(XY ) − E(X )E(Y ) = − · = −
3 3 3 9

Further Example: Toss a coin 4 times (see slide 218)

Expected values (due to symmetry):
E(X ) = 2 , E(Y ) = 1.5 X= number of heads
Variances: 0 1 2 3 4
1 1 3 1 1 1 1 1
Y = number of changes
2
E(X ) = 0 · +1· +4· +9· + 16 · =5 0 16
0 0 0 16 8
16 4 8 4 16 1 1 1 3
1 3 3 1
1 0 8 8 8
0 8
E(Y 2 ) = 0 · +1· +4· +9· =3 1 1 1 3
8 8 8 8 2 0 8 8 8
0 8
V(X ) = E(X 2 ) − E(X )2 = 5 − 4 = 1 3 0 0 1
0 0 1
8 8
V(Y ) = E(Y 2 ) − E(Y )2 = 3 − 2.25 = 0.75 1 1 3 1 1
1
16 4 8 4 16
Covariance:
1·1 2·1 3·1 1·2 2·2 3·2 2·3
E (XY ) = + + + + + +
88 8 8 8 8 8
1+2+3+2+4+6+6 24
= = =3
8 8
Cov(X , Y ) = E(XY ) − E(X )E(Y ) = 3 − 2 · 1.5 = 0
although X and Y are dependent!

Properties/calculation rules of the covariance:
1. Relation to variance V(X ) = Cov(X , X )
2. Sign Cov(X , −Y ) = −Cov(X , Y )
3. Symmetry Cov(X , Y ) = Cov(Y , X )
4. Bilinearity Cov(aX + b, cY + d ) = a c Cov(X , Y )

Cov(X , (eY + f ) + (gZ + h)) = e Cov(X , Y ) + g Cov(X , Z )

Definition:
The ratio of the covariance and the standard deviations of X and Y
Cov(X , Y )
ρXY :=
σX · σY
is called correlation coefficient of X and Y .
Properties :
The correlation coefficient
1. has the same sign as the covariance,
2. is normalized: −1 ≤ ρXY ≤ 1,
3. and indicates the strength of the linear stochastic relationship, independent of the
magnitudes and variances of the two variables.

Sum of random variables
Let X and Y be random variables and their joint distribution f (x , y ) be known. We define
a new random variable X + Y .
Question: What does the distribution fX +Y look like?
First: expected value and variance
Addition theorem for expected values:

Let X and Y be any random variables for which E(X ) and E(Y ) exist. Then the expected
value of the sum is always equal to the sum of the expected values and the expected
value of the difference is equal to the difference of the expected values:
E(X + Y ) = E(X ) + E(Y )

E(X − Y ) = E(X ) − E(Y )

Addition theorem for variances:

If X and Y are any random variables for which V(X ) and V(Y ) exist, then:
V(X + Y ) = V(X ) + V(Y ) + 2Cov(X , Y )

V(X − Y ) = V(X ) + V(Y ) − 2Cov(X , Y )
However, if X and Y are uncorrelated, then
V(X + Y ) = V(X ) + V(Y )

V(X − Y ) = V(X ) + V(Y )

Verification by calculation:
h 2 i
V(X + Y ) = E (X + Y ) − (µX + µY )
h 2 i
=E (X − µX ) + (Y − µY )
h i
= E (X − µX )2 + (Y − µY )2 + 2(X − µX )(Y − µY )
h i h i
= E (X − µX )2 + E (Y − µY )2 + 2E [(X − µX )(Y − µY )]
= V(X ) + V(Y ) + 2Cov(X , Y )
233 - 1
Properties :
Notation using the correlation coefficient
σX2 +Y = σX2 + σY2 + 2σX σY ρXY
Estimate for the variance if one does not know the covariance:
(σX − σY )2 ≤ σX2 +Y ≤ (σX + σY )2
Triangle inequality for standard deviations
|σX − σY | ≤ σX +Y ≤ σX + σY

Arithmetic mean of random variables
If, for example, one wants to achieve a high accuracy in a measurement, one will take the arithmetic
mean of many individual measurements. Each single measurement is a random variable Xi and has
the actual measured value µ as the expected value if the measurement is arranged appropriately. If
each measurement is made under the same conditions, the variables are stochastically independent
and have the same variance.
We denote with
1
X̄ := (X1 + X2 + · · · + Xn )
n
the random variable „arithmetic mean“ and calculate

1
E(X̄ ) = E (X1 + X2 + · · · + Xn )
n
1
= E(X1 ) + E(X2 ) + · · · + E(Xn )
n
1 1
= (µ + µ + · · · + µ) = (nµ) = µ
n n
234 - 1
and

1
V(X̄ ) = V (X1 + X2 + · · · + Xn )
n
1
= V(X1 ) + V(X2 ) + · · · + V(Xn )
n2
1 1 σ2
= (σ 2 + σ 2 + · · · + σ 2 ) = (nσ 2 ) = .
n2 n2 n
The following two important propositions follow from the calculations.
234 - 2
Proposition:
If n random variables Xi have the expected value E(Xi ) = µ, then their arithmetic mean
has the same expected value
E(X̄ ) = µ .
√
The n−law:
If n independent random variables have the same standard deviation σ , then the stan-
dard deviation of their arithmetic mean
σ
σX̄ = √
n
√
is smaller by a factor n.

Control questions
1. What do you need the concept of multidimensional random variables for?

2. What does the joint mass function and what does the joint density function indicate?
3. „From the two marginal distributions of a two-dimensional random variable, the joint
distribution can be calculated.“
Under which conditions is this proposition correct?
4. What is the difference between a marginal distribution and a conditional distribution?
5. What do you use to calculate the expected value and variance of the components of a
multidimensional random variable? The marginal distribution or the joint distribution?
6. What does „stochastic independence“ mean?
7. Why does V(X ) + V(Y ) = V(X + Y ) not imply the stochastic independence of X and
Y ? Are X and Y independent if their covariance is zero?
8. What is the covariance a measure of?
9. Why do you use the correlation coefficient and not just the covariance?

9 Stochastic models and special distributions
9.1 Stochastic models and special distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .238

9.2 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.3 B ERNOULLI distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.4 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
9.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

see also: [Anderson et al., 2012], chapter 5 & 6; and [Newbold et al., 2013], chapter 4 & 5
9. Stochastic models and special distributions 237

Uniform distribution
Example: Dice roll, lotto, roulette

1
P(X = x1 ) = P(X = x2 ) = · · · = P(X = xm ) =
m
Definition:
A discrete random variable X with mass function
(
1
m
for x = x1 , x2 , . . . , xm
fX (x ) = fU (x ; m) =
0 otherwise
is called discrete uniformly distributed or for short U(m)-distributed.
For the special case x1 = 1, x2 = 2, . . . , xm = m (e.g. dice, lottery numbers) it holds:
m+1 m2 − 1
E(X ) = , V(X ) = .
2 12

Example: classical L APLACE experiment dice roll
Mass function of the random variable X =„number of pips“ with m = 6:

(
1
6
, for x = 1, 2, . . . , 6
fX (x ) = fU (x ; 6) =
0 otherwise
with distribution function

0
 for x < 1

1
for 1 ≤ x < 2


6



2
for 2 ≤ x < 3


6

3
FX (x ) = FU (x ; 6) = 6
for 3 ≤ x < 4

4

for 4 ≤ x < 5
 65


for 5 ≤ x < 6


6


for 6 ≤ x
1
expected value and variance:

m+1
E(X ) = = 3.5
2
m2 − 1 35
V(X ) = = = 2.9166 ⇒ σX = 1.7078
12 12
Example: classical L APLACE experiment dice roll
mass function

Example: Waiting time for the subway at fixed frequency
Definition:
A continuous random variable X with density function
(
1
b −a
for a ≤ x ≤ b
fX (x ) = fU (x ; a, b) =
0 otherwise
is called uniformly distributed in the interval [a, b] or for short U(a, b)-distributed.
For geometric reasons, this is often referred to as a rectangular distribution.
Properties :
a+b (b − a )2
E (X ) = , V (X ) = .
2 12

Example
Density function of the random variable X =„waiting time for the subway“ running every
10 minutes with a = 0 and b = 10:
(
1
10
, for 0 ≤ x ≤ 10
fX (x ) = fU (x ; 0, 10) =
0 otherwise
with distribution function

0
 for x < 0
1
FX (x ) = FU (x ; 0, 10) = x for 0 ≤ x < 10
 10
1 for 10 ≤ x

expected value and variance:

0 + 10
E(X ) = =5
2
(10 − 0)2 100
V(X ) = = = 8.3333 ⇒ σX = 2.8868
12 12

B ERNOULLI distribution
Example: Share rises/falls, product successful/unsuccessful, . . .

so-called B ERNOULLI experiments
possible outcomes: success and failure
one single parameter: probability of success P(success) = p.
Definition:
A discrete random variable X with the mass function

1 − p for x = 0

fX (x ) = fBe (x ; p) = p for x = 1

0 otherwise

is called B ERNOULLI distributed with parameter p.
Properties : expected value and variance
E(X ) = p V(X ) = p · (1 − p) = p · q
| {z }
=:q

1
Example: probability of success p = 3
mass function

Example:
In order to start in ludo (Mensch ärgere dich nicht), you must
roll at least one six on three rolls (=„success“).
The probability of success is p = 1 − ( 56 )3 = 0.4213

From this we calculate:
expected value: µ = p = 0.4213

variance: σ 2 = p · q = 0.4213 · 0.5787 = 0.2438
standard distribution: σ = 0.4938

Binomial distribution
Example: Share rises/falls over several days, number of defective products in a

production series, . . .
Several B ERNOULLI experiments with the same success probability p are performed
independently (one after the other or simultaneously).
Definition:
A discrete random variable X with the mass function
!
n x n−x
fX (x ) = fB (x ; n, p) = p (1 − p)
x
for x = 0, 1, . . . , n, where n is a natural number and 0 < p < 1 is a real number

between zero and one, is called binomially distributed or short B(n, p)-distributed.
n: number of tries, x: number of successes, p: probability of success
E (X ) = n · p V(X ) = n · p · (1 − p)

From a purely mathematical point of view: The individual probability masses are the summands from
the binomial formula
n
X n x n−x
(p + q )n = p q
x
x =0
1 1
All binomial distributions with p = 2
are symmetric. Probabilities p < 2
result in right skewed
1
distributions p > 2
result in left skewed distributions:
246 - 1
Example: Urn model
The share of red balls in an urn is p.

From this, a random sample of size n is drawn with replacement.
The random variable
„number of red balls in the sample“ X : 0, 1, . . . , n
is binomially distributed.

Example
Let there be 20 balls in an urn, four of which are red. Let X be the number of red balls if
4
we draw from it three balls with replacement. We calculate with p = 20 = 0.2 and n = 3.
!
3
P(X = 0) = fB (0; 3, 0.2) = · 0.20 · 0.83 = 1 · 1 · 0.512 = 0.512
0
!
3
P(X = 1) = fB (1; 3, 0.2) = · 0.21 · 0.82 = 3 · 0.2 · 0.64 = 0.384
1
!
3
P(X = 2) = fB (2; 3, 0.2) = · 0.22 · 0.81 = 3 · 0.04 · 0.8 = 0.096
2
!
3
P(X = 3) = fB (3; 3, 0.2) = · 0.23 · 0.80 = 1 · 0.008 · 1 = 0.008
3
The sum of these four probabilities is one and it is:

E(X ) = 3 · 0.2 = 0.6
V(X ) = 3 · 0.2 · 0.8 = 0.48
σX = 0.6928.

Example:
Imagine p = 30 % of eligible voters wanted to vote left. A random sample

of size n = 12 is to be asked about their voting intention. What will be
the outcome of the sample?

Example:
Imagine p = 30 % of eligible voters wanted to vote left. A random sample

of size n = 12 is to be asked about their voting intention. What will be
the outcome of the sample?
√
It is µ = 12 · 0.3 = 3.6 with a standard deviation of σX = 12 · 0.3 · 0.7 = 1.5875. We
also compute the probability P(X > 6) that left voters are in the majority in the sample,
using the binomial distribution table:
P(X > 6) = 1 − P(X ≤ 6) = 1 − FB (6; 0.3, 12)

= 1 − 0.9614 = 0.0386


Normal distribution
It is the most important distribution of all in statistics.

Why important? Some reasons
Many empirically observed distributions
correspond at least approximately to the
normal distribution
It gives a good approximation for certain
discrete distributions, such as the binomial q
distribution and the P OISSON distribution
Distribution of sample means approaches
normal distribution the larger the sample is
z[q ]
Basis of many theoretical models
Examples: Energy consumption, stock returns, . . .

Normal distribution
Standard normal distribution
Definition:
A continuous random variable Z with the density function
1 − 1 z2
fSt (z ) := √ e 2
2π
for−∞ < z < ∞ is called standard-normally distributed or short N (0, 1)-distributed.
Note:
There are many normal distributions, but this one is standard because it has an
expected value of 0 and a standard deviation of 1:
E(Z ) = 0 , V (Z ) = 1 .

Normal distribution
Properties of the „bell curve“:

0.4
0.3
Maximum at z = 0
0.2
Points of inflection at z = −1 and z = 1
Quickly asymptotes to the x-axis 0.1
−4 −3 −2 −1 1 2 3 4
While one can easily calculate the values of the density function with the pocket
calculator, the integral of the distribution function
Zz
FSt (z ) = e
− 21 u 2
du
−∞
is not elementary. Therefore there are tables for it – already since L APLACE.
Normal distribution
Probabilities as areas below the density function:
Zz
fSt (u ) du = FSt (z )
tables
P(Z ≤ z ) =
−∞
Z∞
P(Z ≥ z ) = fSt (u ) du = 1 − FSt (z ) z
interval
Zb
P(a < Z ≤ b) = fSt (u ) du = FSt (b) − FSt (a)
a
a b

Normal distribution
Table
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141
0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517
0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879
0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015
1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319
1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441
1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545
1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706
1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857
2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974
2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981
2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990
because of symmetry: FSt (−z ) = 1 − FSt (z )

Normal distribution
Symmetric intervals:
z
fSt (u ) du =: D (z )
Z
P(−z < Z ≤ z ) =
−z
Calculation: D (z ) = FSt (z ) − FSt (−z ) = 2FSt (z ) − 1
Examples:
P(Z ≤ 0) = 0.5 = FSt (0)

P(Z ≤ 1) = 0.8413 = FSt (1)
P(Z ≤ 1.8) = 0.9641 = FSt (1.8)
P(−1 < Z ≤ 1) = 0.6826 = D (1)

P(−2 < Z ≤ 2) = 0.9544 = D (2)
P(−1.96 ≤ Z ≤ 1.96) = 0.95 = D (1.96)
P(−1 < Z ≤ 2.5) = 0.9938 − 0.1587 = 0.8351

Normal distribution
The general normal distribution
By introducing a parameter of location µ and a parameter of dispersion σ 2 > 0 one

obtains further normal distributions:
Definition:
A continuous random variable X with the density function
1 (x −µ)2
2 −1
fN (x ) = fN (x ; µ, σ ) = √ e 2 σ2
2πσ 2
for −∞ < x < ∞ is called normally distributed with the parameters µ and σ 2 or short
N µ, σ 2 -distributed.
2
E(X ) = µ , V(X ) = σ .

Normal distribution
fN (x ) fN (x )
Two-parameter family:
0 .5 0.5
The expected value and the variance
µ=0 µ=1
σ=1 σ=1 serve as parameters:
2
−2 −1 1 2
x
−1 1 2 3
x fN (x ; µ, σ )
fN (x ) fN (x )
Properties :
0 .5 0.5
µ=0 µ=1
σ=2 σ=2 1. symmetrical around x = µ
x x
2. points of inflection at x = µ − σ and
−2 −1 1 2 −1 1 2 3
x =µ+σ
fN (x ) fN (x )
3. Density function is flatter the larger the
0 .5 0 .5
dispersion:
µ=0 µ = −2 x − µ
σ = 0.6 σ = 0.6 1
fN (x ; µ, σ) = fSt
σ σ
x x
−2 −1 1 2 −4 −3 −2 −1 1

Normal distribution
Properties (contd.):
4. The following applies to the distribution function

x − µ
FN (x ) = FSt
σ
5. To determine probabilities, the table of the standard normal distribution is sufficient:
only standardization is required before using the table.

Normal distribution
Example:
Let the random variable X (e.g. stock return) be normally distributed with E(X ) = 8 %
and V(X ) = 625 %2 .
Wanted: P(0 % < X ≤ 20 %)

Normal distribution
Example:
Let the random variable X (e.g. stock return) be normally distributed with E(X ) = 8 %
and V(X ) = 625 %2 .
Wanted: P(0 % < X ≤ 20 %)
Solution:
First standardize
0% − 8% X − 8% 20 % − 8 %
< ≤
25 % 25 % 25 %
−0.32 < Z ≤ 0.48
Then calculate the probability
P(0 % < X ≤ 20 %) = P(−0.32 < Z ≤ 0.48) = FSt (0.48) − FSt (−0.32)

= 0.6844 − (1 − 0.6255) = 0.3099

Control questions
1. How is the binomial distribution defined? Does the B ERNOULLI distribution belong to
the family of binomial distributions?
2. Why is the normal distribution considered the most important distribution in statistics?
3. Why do you need only the values of the standard normal distribution when calculating
with normal distributions?
4. What is a „family“ or „class“ of distributions?
5. What is a stochastic model? What is the purpose of stochastic models?

10 Limit theorems
10.1 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

10.2 The fundamental theorem of statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.3 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.4 Normal distribution as approximation distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275

10. Limit theorems 262

Limit theorems
Goal: Draw conclusions about a population and determine their reliability

Use: sampling (with replacement), series of experiments or hypothetical population
n-dimensional random variable
X1 , X2 , . . . , Xn
What can we say about asymptotic behaviour of sequences of random variables?

Random Variables
Sum and arithmetic mean
n-dimensional random variable

X1 , X2 , . . . , Xn
independent and identically distributed with
2
E(Xi ) = µ V(Xi ) = σ
define new random variables
sum: Sn := X1 + X2 + · · · + Xn
1
arithm. mean: X̄n := (X1 + X2 + · · · + Xn )
n
Then
Proposition : Let X1 , X2 , . . . , Xn independent and identically distributed with E(Xi ) = µ

and V(Xi ) = σ 2 . Then the following applies to the arithmetic mean X̄n
σ2
E(X̄n ) = µ , V(X̄n ) = .
n

Limit theorems
Important limit theorems n→∞
1. Law of large numbers: mean value converges to- X̄n → µ

wards the expected value
Special case: B ERNOULLI’s law of large numbers: hn → p

Relative frequency converges towards probability
2. Main theorem of statistics: The empirical distribu- Hn ( x ) → F ( x )

tion converges towards the probability distribution
3. Central limit theorem: Distribution of the mean con- Fn (zn ) → FSt (z )

verges to the normal distribution

The law of large numbers
The weak law of large numbers :

Let X1 , X2 , . . . , Xn be independent and identically distributed random variables whose
expected values E(Xi ) = µ and variances V(Xi ) = σ 2 exist, and let X̄n be the arithmetic
mean of them.
Then for any ε > 0, no matter how small, the following is true
P( X̄n − µ ≥ ε) → 0 for n → ∞
other notations:
P( X̄n − µ < ε) → 1
plimn→∞ X̄n = µ
(plim ≡ „probability limit“)

Proof: According to Chebyshev’s inequality, the deviation of any random variable from its mean value
- regardless of its distribution - hence also for X̄n is
1
P( X̄n − µ ≥ k σX̄ ) ≤ ,
k2
σ
where the standard deviation is σX̄ = √
n
n
With the substitution k · σX̄ = ε, hence k 2 = ε2 · σ2
it follows that
σ2
P( X̄n − µ ≥ ε) ≤
ε2 ·n
and from this for n → ∞

σ2
→ 0.
ε2 ·n
Weak and strong: different types of convergence
1. Weak law of large numbers: plim X̄n = µ

„probability limit“ or „stochastic convergence“
2. Strong law of large numbers: P(lim X̄n = µ) = 1
„convergence with probability one“ or „almost sure convergence“
3. But wrong would be „sure convergence“ lim X̄n = µ
266 - 1
Practical meaning of the laws of large numbers
1. Statistical probability
Determination of probabilities by experimental means:
hn (observed rel. frequency) is a good approximation or useful estimate for p if n is sufficiently
large.
2. Sampling method
For qualitative characteristics:
p = proportion of statistical units in the population for which the characteristic has a specific
value or property.
hn = relative frequency in random sample. It will be closer and closer to the value p as the
sample size increases.
266 - 2
The law of large numbers
Example: 9114 historical lottery numbers show the law of large numbers quite
illustratively. (discrete uniform distribution with m = 49)
theoretically:
49 + 1
µ= = 25
2
r
492 − 1 √
σX = = 200 = 14.1421
12
empirically: n = 9114
1 X
x̄ = nj xj = 25.2211
9114
√
sX = 200.6512 = 14.1651

The fundamental theorem of statistics
Law of large numbers: The mean value of a sample converges stochastically towards
the expected value.
Question: Can the probability distribution F (x ) also be determined experimentally?
For this purpose, one calculates the empirical distribution function from the n sample or
measured values:
x1 , x2 , . . . , xn ⇒ Hn (x )
Idea: Hn (x ) → F (x ) if n → ∞?
Fundamental theorem of statistics:

The empirical distribution functions obtained using samples of size n converge with prob-
ability one
P lim Hn (x ) = F (x ) = 1
n→∞
to the distribution function of the random variable X if n tends to infinity.

The fundamental theorem of statistics
Example: Random numbers were drawn on a PC, uniformly distributed over the
interval [0, 10]:
Convergence of the empirical distribution function towards the

probability distribution function.

The central limit theorem
We consider the mean value of a series of experiments or random sample

1
x̄n := (x1 + x2 + · · · + xn )
n
as a realization of a random variable
1
X̄n := (X1 + X2 + · · · + Xn ) ,
n
where
2
E(X1 ) = · · · = E (Xn ) = µ and V(X1 ) = · · · = V(Xn ) = σ .
Then
2 σ2
E(X̄n ) = µX̄n = µ and V(X̄n ) = σX̄n = .
n
So far so good!

We consider the mean value of a series of experiments or random sample

1
x̄n := (x1 + x2 + · · · + xn )
n
1
X̄n := (X1 + X2 + · · · + Xn ) ,
n
where
2
E(X1 ) = · · · = E (Xn ) = µ and V(X1 ) = · · · = V(Xn ) = σ .
Then
2 σ2
E(X̄n ) = µX̄n = µ and V(X̄n ) = σX̄n = .
n
So far so good!
Problem: In many applications, however, it is not sufficient to know only the two mo-
ments E and V.
Question: What is the distribution function F (x̄n ) of the random variable X̄n ?

We also consider the sum of a series of experiments or random sample
sn := x1 + x2 + · · · + xn
Sn := X1 + X2 + · · · + Xn .
Here,
E(Sn ) = µSn = n · µ and V(Sn ) = σS2 n = n · σ 2 .
Standardization yields
Sn − µSn Sn − nµ
Zn : = = √
σSn σ n
which is equivalent to
X̄n − µ X̄n − µX̄n

= √ =
σ/ n σX̄n
270 - 1
Central limit theorem :

Let X1 , X2 , . . . , Xn be independent and identically distributed random variables whose
expected values E(Xi ) = µ and variances V(Xi ) = σ 2 exist, and let Sn be the sum and
X̄n = Snn the arithmetic mean of them.
Then the distribution function Fn of the standardized quantity
Sn − nµ X̄n − µ
Zn := √ = √
σ n σ/ n
tends to the standard normal distribution
Fn (zn ) → FSt (z ) for n → ∞
as n increases.

Illustration of the CLT: mass functions

of the sum of the numbers of pips for 1,
2, 3, and 6 independent dice rolls.

This is why the normal distribution and the CLT are so important:
Decisive advantage: The CLT does not impose any requirement on the initial
distribution. Whatever the identical and independent distribution of Xi may be, the
distribution function of the sum or the arithmetic mean always converges to the
normal distribution.
It is to this circumstance that the normal distribution owes its universal theoretical
and practical importance.
Empirical Distributions: The CLT also explains why so many empirical distributions
are close to the normal distribution and can be approximated by it quite well.

Of course, the CLT also applies to the B ERNOULLI distributions
The sum Bn := X1 + X2 + · · · + Xn with independent Bernoulli variables Xi is by definition binomially

distributed with µBn = np and σB2 n = npq, and its arithmetic mean is a relative frequency or a
proportion value
1 1
Hn := (X1 + X2 + · · · + Xn ) = Bn
n n
2 = pq
with µHn = p and σHn n
.
D E M OIVRE-L APLACE theorem:
The binomial distribution converges to the normal distribution for n → ∞. In particular, the distri-
bution function of the standardized variable
Bn − np Hn − p
Zn := √ ≡ p
npq pq /n
converges to the standard normal distribution
Fn (zn ) → FSt (z ) for n → ∞.
273 - 1
Convergence in distribution:
Approximation of the distribution
functions of the binomial distribution for
p = 14 and n = 1, 2, 3 and 6 toward the
normal distribution function.

Normal distribution as approximation distribution
Properties:
If n is sufficiently large, the distribution of a sum or arithmetic mean can be approximated
by the normal distribution.

y − µY
P(Y ≤ y ) ≈ FSt
σY
For an interval a < b an approximation can also be given

b − µY a − µY
P(a < Y ≤ b) ≈ FSt − FSt
σY σY
Here, Y can be one of the following random variables:
sum Sn : µSn = nµ σS2 n = nσ 2

σ2
arithm. mean X̄n : µX̄n = µ σX̄2 n =
n

Example:
The average processing time of a BAföG application by a clerk at the
Students’ Union is µ = 35 min with a standard deviation of
σ = 18 min.
Question: What is the probability that the clerk will complete more than 15 applications
in an 8-hour workday, i.e.
P(S16 ≤ 480) = ?

Example:
The average processing time of a BAföG application by a clerk at the
Students’ Union is µ = 35 min with a standard deviation of
σ = 18 min.
Question: What is the probability that the clerk will complete more than 15 applications
in an 8-hour workday, i.e.
P(S16 ≤ 480) = ?
Solution:
According to the CLT, the sum S16 is approximately normally distributed with
E(S16 ) = 35 · 16 = 560 and the standard deviation σS16 = 18 · 4 = 72. We calculate:

2 480 − 560
P(S16 ≤ 480) ≈ FN (480; 560, 72 ) = FSt
72
= FSt (−1.1111) = 0.1331

Continuity correction
A note for practice: approximation of discrete

random variables by the normal distribution in
the middle of the steps is better than at its edges.
step size
→ continuity correction Sk =
2
If n is sufficiently large, the distribution of a sum or arithmetic mean of discrete random

variables and hence also of a binomially distributed random variable can in general be
approximated better by using continuity correction:

y + Sk − µY
P(Y ≤ y ) ≈ FSt
σY
For an interval a < b an approximation can also be given

b + Sk − µY a + Sk − µY
P(a < Y ≤ b) ≈ FSt − FSt .
σY σY

Now here, Y can be one of the following (discrete) random variables:
sum Sn : µSn = nµ σS2 n = nσ 2

σ2
arithm. mean X̄n : µX̄n = µ σX̄2 =
n n
binomially distributed Bn : µBn = np σB2 n = npq
The continuity correction Sk is always half the step size of the random variable Y . For example, if Y
can take only integers or only natural numbers, then Sk = 21 .
Question: How large should n be in order to use the normal distribution as an approximation?
There is no generally valid answer here. A rule of thumb says that for n > 30 the approximation is
generally quite good. For binomially distributed random variables it is often required that npq > 9
holds.
277 - 1
Control questions
1. How can a sequence of independently and identically distributed random variables

be obtained?
2. What concepts of convergence have you come across?
3. What is the difference between the weak and the strong law of large numbers?
4. Why do some people in roulette bet on numbers that have rarely won? Is this a good
strategy?
5. Some people bet on numbers that have won particularly often. Do you think that’s a
rational strategy?
6. What is special about B ERNOULLI’s law of large numbers?
7. How can we empirically or experimentally infer the unknown distribution function of a
random variable?
8. Why is the normal distribution so popular?
9. Why is continuity correction useful when approximating a discrete distribution by a
continuous distribution?

Part III – Inferential Statistics
11 Point estimators for parameters of a population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .280
12 Interval estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
13 Statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
10. 279
11 Point estimators for parameters of a population
11.1 Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

11.2 Point estimator for the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
11.3 Point estimator for the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
11.4 Properties of point estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

11. Point estimators for parameters of a population 280

Random sampling
Motivation:
Complete information about the distribution of characteristics in a population can only
be obtained by a census (full sample). In most cases, censuses are ineconomical,
often even impossible.
The representative sample: It is made sure that the sample has the same or similar
structure as the basic population with regard to other characteristics.
Pure random sample: Each element ωi of the basic population has an equal chance of
entering the sample.
Question: Is a random survey of n = 100 people between 2pm and 3pm on the Zeil in
front of Karstadt representative?

Random sampling
Definition:
Urn model to describe the pure random sampling:
The urn contains N numbered balls (= number of statistical units in the basic population).
The number on the ball is assigned to exactly one statistical unit.
We distinguish between two options:

(i) Drawing with replacement
The number of the drawn ball is recorded, the ball is returned to the urn, and it is
reshuffled. ( This is the procedure that is mainly assumed in the following.)
(ii) Drawing without replacement
The ball is not put back into the urn. (closer to practice; only little deviation from (i) for samples that are small
relative to the basic population).

Random sampling
Properties of pure random samples
The characteristic value Xi of each individual element of the sample is a random

variable.
The probability distribution of this random variable Xi is determined by the
frequency distribution of the characteristic X in the basic population.
With the characteristic values xi observed in the sample, we now try to

estimate this distribution or at least its mean and variance.

Point estimator for the mean
Let the mean µ of the metric characteristic X of a basic population be unknown.

We want to estimate it using a random sample of size n.
Observed characteristic values xi of the sample elements are realizations of the
random variable X .
We calculate the arithmetic mean
1X
{ x1 , x 2 , . . . , x n } −→ x̄ = xj
n
Estimator for the mean value:

µ̂ := x̄
(µ̂ is the estimated value for the unknown mean µ; the hat „ ˆ “ denotes the estimator)
Such an estimate is called point estimate, because a single (punctual) value is given
as an estimate and not an interval for example. Also, no probability is given with
which the estimate could be correct or incorrect.

Point estimator for the mean
Example
A sample of size n = 10 was drawn from the basic population of students in a lecture.
The body height X in cm was determined and recorded in the following table:
i 1 2 3 4 5 6 7 8 9 10
xi 176 180 181 168 177 186 184 173 182 177
This data set has the mean value 178.4 cm. The point estimate for the height of students
in the lecture hall is simply:
µ̂ = x̄ = 178.4 cm

Question: Is this a good estimate?
The value µ̂ is an estimate for the unknown mean µ. Therefore, most of the time the estimated value
will not exactly match the true mean (i.e. µ̂ ̸= µ). That means it is very rarely true that µ̂ = µ.
First, we need to understand that the estimate µ̂ is a realization of a random variable:
Every single observed characteristic value xi is a realization of a random variable Xi . For each of
these random variables Xi the probability distribution is given by the frequency distribution of the
basic population. Thus for each random variable Xi the following applies
E(Xi ) = µ, V(Xi ) = σ 2 .
Thus, one can consider the observed sample values and their mean as realizations of an n-dimensional
random variable (X1 , X2 , . . . , Xn ). If the random sample is carried out with replacement, all Xi are
independent and identically distributed.
We define a new random variable

1X
X̄n := Xj .
n
Thus, the point estimate µ̂ is a realization of the random variable X̄n .
As stated above, the estimated value will rather rarely meet the true mean value, it is possible or likely
that an estimation error
e := µ − µ̂
285 - 1
will occur. However, the crucial question is whether the estimated value µ̂ hits the true value at
least on average (that is, if we determine many realizations of µ̂). For this purpose we calculate the
expected value
1 X 1X 1
E(µ̂) = E(X̄n ) = E( Xj ) = E(Xj ) = nµ ,
n n n
hence it holds that
E(µ̂) = µ .
This property of µ is called unbiasedness. For an unbiased estimator, the estimation error vanishes
on average, i.e. E(e) = 0.
If an estimator is not unbiased, we call it biased, the expected value of the estimation error
bias := E(e)
is called bias.
Unbiasedness is not the only important property. If we calculate the variance of our estimate µ̂, we
get
σ2
X
1 1 X 1
V(µ̂) = V(X̄n ) = V Xj = 2 V(Xj ) = 2 nσ 2 = .
n n n n
We notice that with increasing sample size n the variance of the sample mean becomes smaller and
smaller, hence
lim V(µ̂) = 0 .
n→∞
285 - 2
According to the law of large numbers
plim µ̂ = µ.
This property is called consistency. It means that the larger the sample size, the more accurate the
estimate.
An important prerequisite for the calculation of the variance and also for the application of the law of
large numbers is the independence of the variables Xi , which, however, is given in every case for
samples with replacement. However, this is also approximately true for samples without replacement,
if the basic population is very large in relation to the sample size.
285 - 3
Point estimator for the variance
Assume that the variance of a metric characteristic in a basic population is

unknown.
It is to be estimated using the empirical variance of the random sample.
2 1X
s := (xj − x̄ )2
n
Unfortunately, the estimation formula σ̂ 2 = s2 , which seems obvious at first sight, is

not unbiased, but biased.
After adjusting for the number of degrees of freedom, we obtain a
Unbiased estimator of the variance:

n 1 X
σ̂ 2 := 2
s = (xj − x̄ )2
n−1 n−1

Point estimator for the variance
Example
Using the observed values from the example before, we can also make an estimate for
the variance of the body size of the students in a lecture. To do this, we first calculate
the variance of the sample values
2 1
sX = (1762 + 1802 + · · · + 1772 ) − 178.42
10
= 31 852.4 − 31 826.56 = 25.84
and thus the unbiased point estimate

10
σ̂ 2 = · 25.84 = 28.7111 or σ̂ = 5.3583 .
9

Properties of point estimators
We have observed that the estimated value of a parameter of a basic population, such as mean µ,
variance σ 2 , or even a proportion value p, is itself again a random variable. Although the parameter
itself is unknown, it is still a constant quantity. We denote such a parameter in the following by the
Greek letter theta
θ and with θ̂
– as before with a „hat“ – the estimated value of the parameter.
An estimator or an estimating function
θ̂ = θ̂(X1 , . . . , Xn ) ,
that depends on the random variables X1 , X2 , . . . , Xn is again a random variable and accordingly
also has a probability distribution. Certain stochastic properties of the estimator follow from this.
To assess the quality of an estimator, we use the following desirable properties.
Unbiasedness: An estimator θ̂ is unbiased if its expected value is equal to the true parameter
E(θ̂) = θ .
287 - 1
If an estimator is biased, it would be good if the bias became smaller with increasing sample size and
would disappear for n → ∞.
Asymptotic unbiasedness: An estimator θ̂ is asymptotically unbiased if

lim E(θ̂) = θ .
n→∞
Asymptotic unbiasedness is thus a somewhat weaker property than unbiasedness.
Examples
1. The estimator µ̂ = x̄ is unbiased.
2. The estimator σ̂ 2 = s2 is not unbiased but asymptotically unbiased.
Consistency: An estimator θ̂ is consistent if it is unbiased or at least asymptotically unbiased

and, furthermore, its variance tends to zero as the sample size increases
lim V(θ̂) = 0 .
n→∞
An estimated value usually does not agree with the true value of the parameter. However, it would be
good if it is close to the parameter, or at least has a good chance of being close. Thus, the estimation
error |θ̂ − θ| should be as small as possible and become smaller and smaller, especially for larger
287 - 2
sample sizes. The property of consistency means that the probability of an estimation error ε > 0,
however small, tends to zero as n increases.
287 - 3
Efficiency: We call an (unbiased) estimator θ̂1 more efficient than another (unbiased) estimator
θ̂2 if it has a smaller variance,
V(θ̂1 ) < V(θ̂2 ).
Thus, the most efficient or best unbiased estimator θ ∗ would be the one among all unbiased
estimators that had the smallest variance, that is
V(θ ∗ ) < V(θ̂).
Mean squared error (MSE): The mean squared error (MSE) of an estimator is the expected value
of its squared deviation from the true parameter value, i.e.
MSE(θ̂) = E[(θ̂ − θ)2 ].
The MSE accounts for both, the variance and the bias:
MSE(θ̂) = V(θ̂) + bias2 .
It may be advantageous to give preference to a slightly biased estimator, provided that this achieves
an effective reduction in variance, which is often the case.
287 - 4
Example
Consider the distribution of two alternative estimators for a parameter θ . One is unbiased, the other
has a small bias but a much smaller variance.
f (θ̂) bias
θ̂
θ
Which estimator to choose?
It seems obvious to prefer the biased estimator in this particular case.
287 - 5
Control questions
1. What is a representative sample? How does a representative sample differ from a

purely random sample?
2. What is meant by the estimation error? Is the estimation error a stochastic quantity?
3. What is the difference between the estimation error and the bias?
4. What is the reason for the bias when estimating the variance of a basic population
with the empirical variance of the sample?
5. What is consistency?

12 Interval estimators
12.1 Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

12.2 Interval estimators for large samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.3 The chi-squared distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
12.4 The S TUDENT-t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
12.5 Interval estimators for small samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .319

12. Interval estimators 289

Sampling distributions
Motivation:
Measures of samples, such as mean, variance, and others, are realizations of random
variables. Their probability distributions are called sampling distributions. In
particular, we are interested in:
Distribution of the sample mean

Distribution of the sample variance
Note:
Sampling distributions follow the normal distribution in many cases, because
1. characteristics are often a priori approximately normally distributed,
2. for larger samples, the Central Limit Theorem (CLT) applies.

Recall: Using the normal distribution (e.g. as a sampling distribution).
1. Values of the distribution function:

Zz
P(Z ≤ z ) = fSt (u ) du = FSt (z ) FSt (z )
−∞
z
2. The symmetrical intervals are also useful:
Zz
P(−z < Z ≤ z ) = fSt (u ) du = D (z )
−z D (z )
⇒ D (z ) = FSt (z ) − FSt (−z ) −z z
= 2FSt (z ) − 1 Areas below the density function

Distribution of the sample mean:

The metric characteristic X in a basic population has mean µ and variance σ 2 . For a
large sample size n, the distribution of the sample mean is:
1. E(X̄ ) = µ,
σ
2. σX̄ = √
n
,
3. X̄ is approximately normally distributed.
This is true for any distribution of the characteristic X as long as the individual sample
elements are drawn independently.
By standardizing the random variables X̄ we obtain
X̄ −µ
4. √
σ/ n
is approximately standard normally distributed.

σ
From 4. and with σX̄ = √
n
it follows directly that
X̄ − µ
P(−z < ≤ z ) ≈ FSt (z ) − FSt (−z ) = D (z ) .
σX̄
By transforming the inequality inside the probability function we get
P(−z · σX̄ < X̄ − µ ≤ z · σX̄ ) ≈ D (z )
and finally the
Formula for the direct conclusion
P(µ − z · σX̄ < X̄ ≤ µ + z · σX̄ ) ≈ D (z )
⇒ Conclusion from the basic population to the sample

Example
For the introductory lecture in statistics, N = 800

students have come to the Audimax. Their average
body height µ is 183 cm, with a standard deviation σ
of 10 cm. From this we draw a random sample of size
n = 25 (with replacement).
Question 1 (Given interval): With what probability will the sample mean fall into the
interval 182 cm < X̄ ≤ 184 cm

Example

The normal distribution can be taken as the sampling distribution, since the initial
distribution is already approximately normally distributed.

Example

The normal distribution can be taken as the sampling distribution, since the initial
distribution is already approximately normally distributed.
It holds that
σ 10 cm
E(X̄ ) = 183 cm and σX̄ = √ = √ = 2 cm
n 25
So the above interval has a length of ± half a standard deviation and thus the probability:
1 1 1
P(183 cm − · 2 cm < X̄ ≤ 183 cm + · 2 cm) ≈ D ( ) = 0.3830
2 2 2

Alternatively we calculate
182 − 183 X̄ − µ 184 − 183

P(182 < X̄ ≤ 184) = P( < ≤ )
2 σX̄ 2
1 X̄ − µ 1
= P(− < ≤ )
2 σX̄ 2
1
≈ D ( ) = 0.3830
2
294 - 1
Question 1: Reading off the table of the normal distribution
Standard normal distribution

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141
0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517
0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879
0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
D (0.5) = 2 · 0.6915 − 1 = 0.3830 0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015
1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319
1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441
1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545
1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706
1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857
2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974
2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981
2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990
because of symmetry: FSt (−z ) = 1 − FSt (z )

moreover: D (z ) = 2FSt (z ) − 1
Example

Question 2 (Given probability): What is the interval in which the sample mean falls with
a high probability of 0.9?

Example

Question 2 (Given probability): What is the interval in which the sample mean falls with
a high probability of 0.9?
To do this, we need to determine the z value for which
D (z ) = 0.9.In the table of the standard normal
distribution we find z = 1.645, such that
0.9
0.9 = D (1.645)
≈ P(183 cm − 1.645 · 2 cm < X̄ ≤ 183 cm + 1.645 · 2 cm)
177 179 181 183 185 187 189
σX̄
= P(183 cm − 3.29 cm < X̄ ≤ 183 cm + 3.29 cm)
= P(179.71 cm < X̄ ≤ 186.29 cm)

Question 2: Reading off the quantile table
S TUDENT’s t-distribution
In the quantile table of the Degrees of Quantiles
freedom t[0.6] t[0.667] t[0.75] t[0.8] t[0.875] t[0.9] t[0.95] t[0.975] t[0.99] t[0.995] t[0.999]
S TUDENT’s t-distribution one 1 0.325 0.577 1.000 1.376 2.414 3.078 6.314 12.706 31.821 63.657 318.31
finds the ∞ symbol in the last 2

3
0.289
0.277
0.500
0.476
0.816
0.765
1.061
0.978
1.604
1.423
1.886
1.638
2.920
2.353
4.303
3.182
6.965
4.541
9.925
5.841
22.327
10.215
4 0.271 0.464 0.741 0.941 1.344 1.533 2.132 2.776 3.747 4.604 7.173
row. The entries there are the 5 0.267 0.457 0.727 0.920 1.301 1.476 2.015 2.571 3.365 4.032 5.893
6 0.265 0.453 0.718 0.906 1.273 1.440 1.943 2.447 3.143 3.707 5.208
quantiles z[q ] of the normal 7 0.263 0.449 0.711 0.896 1.254 1.415 1.895 2.365 2.998 3.499 4.785
8 0.262 0.447 0.706 0.889 1.240 1.397 1.860 2.306 2.896 3.355 4.501
distribution! 9 0.261 0.445 0.703 0.883 1.230 1.383 1.833 2.262 2.821 3.250 4.297
10 0.260 0.444 0.700 0.879 1.221 1.372 1.812 2.228 2.764 3.169 4.144
11 0.260 0.443 0.697 0.876 1.214 1.363 1.796 2.201 2.718 3.106 4.025
It holds that 12
13
0.259
0.259
0.442
0.441
0.695
0.694
0.873
0.870
1.209
1.204
1.356
1.350
1.782
1.771
2.179
2.160
2.681
2.650
3.055
3.012
3.930
3.852
14 0.258 0.440 0.692 0.868 1.200 1.345 1.761 2.145 2.624 2.977 3.787
15 0.258 0.439 0.691 0.866 1.197 1.341 1.753 2.131 2.602 2.947 3.733
D (z ) = 2FSt (z ) − 1 = 0.9 16 0.258 0.439 0.690 0.865 1.194 1.337 1.746 2.120 2.583 2.921 3.686
17 0.257 0.438 0.689 0.863 1.191 1.333 1.740 2.110 2.567 2.898 3.646
18 0.257 0.438 0.688 0.862 1.189 1.330 1.734 2.101 2.552 2.878 3.610
19 0.257 0.438 0.688 0.861 1.187 1.328 1.729 2.093 2.539 2.861 3.579
and hence 20 0.257 0.437 0.687 0.860 1.185 1.325 1.725 2.086 2.528 2.845 3.552
21 0.257 0.437 0.686 0.859 1.183 1.323 1.721 2.080 2.518 2.831 3.527
22 0.256 0.437 0.686 0.858 1.182 1.321 1.717 2.074 2.508 2.819 3.505
FSt (z ) = 0.95 . 23
24
0.256
0.256
0.436
0.436
0.685
0.685
0.858
0.857
1.180
1.179
1.319
1.318
1.714
1.711
2.069
2.064
2.500
2.492
2.807
2.797
3.485
3.467
25 0.256 0.436 0.684 0.856 1.178 1.316 1.708 2.060 2.485 2.787 3.450
26 0.256 0.436 0.684 0.856 1.177 1.315 1.706 2.056 2.479 2.779 3.435
Thus we look for z[0.95] in the 27 0.256 0.435 0.684 0.855 1.176 1.314 1.703 2.052 2.473 2.771 3.421
28 0.256 0.435 0.683 0.855 1.175 1.313 1.701 2.048 2.467 2.763 3.408
table and find 29 0.256 0.435 0.683 0.854 1.174 1.311 1.699 2.045 2.462 2.756 3.396
30 0.256 0.435 0.683 0.854 1.173 1.310 1.697 2.042 2.457 2.750 3.385
35 0.255 0.434 0.682 0.852 1.170 1.306 1.690 2.030 2.438 2.724 3.340
40 0.255 0.434 0.681 0.851 1.167 1.303 1.684 2.021 2.423 2.704 3.307
z[0.95] = 1.645. 45 0.255 0.434 0.680 0.850 1.165 1.301 1.679 2.014 2.412 2.690 3.281
50 0.255 0.433 0.679 0.849 1.164 1.299 1.676 2.009 2.403 2.678 3.261
55 0.255 0.433 0.679 0.848 1.163 1.297 1.673 2.004 2.396 2.668 3.245
60 0.254 0.433 0.679 0.848 1.162 1.296 1.671 2.000 2.390 2.660 3.232
∞ 0.253 0.431 0.674 0.842 1.150 1.282 1.645 1.960 2.326 2.576 3.090

Interval estimators for large samples
Question: What is a „large“ sample size?
Definition:
A sample is considered a large sample if the deviation of the actual sampling distribution
from the normal distribution can be neglected.
Rule of thumb: n > 30

But:
1. objectively justified claims to accuracy
2. General principle: The more similar the initial distribution is to the normal
distribution, the smaller the sample size may be.

Confidence intervals for mean values
σ
Prerequisite: CLT (variance σ known) → σX̄ = √
n
)
X̄ − µ
P(−z < ≤ z ) ≈ FSt (z ) − FSt (−z ) = D (z )
σX̄
This can be transformed into an approximate probability statement:
1. direct conclusion P(µ − z · σX̄ < X̄ ≤ µ + z · σX̄ ) ≈ D (z )
2. inference P(X̄ − z · σX̄ < µ ≤ X̄ + z · σX̄ ) ≈ D (z ) = 1 − α
Definition:
Inference is the statistical conclusion from the sample to the unknown basic population.

Inference: P(X̄ − z · σX̄ < µ ≤ X̄ + z · σX̄ ) ≈ D (z ) = 1 − α

Replacing the random variable X̄ with the current sample mean x̄ results in the
Definition: Confidence interval for µ for large samples with known variance σ 2
CI(µ, 1 − α) = [x̄ − z σX̄ , x̄ + z σX̄ ]
with z = z[1−α/2] . Here,

1 − α is the confidence level
α the probability of error or significance level. It indicates how often one is wrong
on average when setting up confidence intervals of this type.

Connection between α and z
The value 1 − α = D (z ) := FSt (z ) − FSt (−z ) corresponds to the (white) area in the symmetric
interval [−z , z ] below the density function of the standard normal distribution. Accordingly, the left
and right (blue colored) parts together have the area α. Accordingly, FSt (z ) = 1 − α 2
and thus the
z = z[1−α/2] we are looking for is the (1 − α/2) quantile.
fSt (z )
α D (z ) = 1 − α α
2 2
−z z
Mostly, the significance level α is given and then the corresponding z value is determined as quantile.
300 - 1
If the variance in the basic population is unknown. . . we have to find a way to

estimate it:
n σ̂ s
σ̂ 2 = 2
s and hence σ̂X̄ = √ = √
n−1 n n−1
Definition: Confidence interval for µ for large samples with unknown variance
CI(µ, 1 − α) = [x̄ − z σ̂X̄ , x̄ + z σ̂X̄ ]
with z = z[1−α/2] .
However, this means that an additional inaccuracy is introduced.

Way out: Somewhat larger sample sizes. This makes the point estimate of the variance
more accurate and the sampling distribution even closer to the normal distribution.

Example
For the local rent index, the local government

asks 50 households that have rented apartments
ranging from 80 to 100 m2 in size about the net
rent per square meter.
They find a sample mean of 8.30 C with an empirical standard deviation of s = 2.07 C.
Let us discuss two different questions in this regard:
1. Distribution in the basic population: In which interval do two thirds of the rents per
square meter paid probably lie?
2. Interval estimation: What is the confidence interval on a confidence level of 0.9 for
the average net rent µ?

Example
We assume normal distribution. We take the sample mean as point estimate of the
actual average rent and r
50
σ̂ = · 2.07 C = 2.10 C
49
as the point estimate for the standard deviation. Thus, the variable
X − 8.3
Z =
2.10
would be standard normally distributed. According to the standard normal distribution
table, 66.7 % of all observations of variable Z are in the interval −0.97 < Z ≤ 0.97. We
undo the standardization and obtain the interval
[8.30 C − 0.97 · 2.10 C; 8.30 C + 0.97 · 2.10 C] = [6.26 C; 10.34 C]
66.7 %
x
6.26 C 10.34 C

Example
For the local rent index, the local government

asks 50 households that have rented apartments
ranging from 80 to 100 m2 in size about the net
rent per square meter.
They find a sample mean of 8.30 C with an empirical standard deviation of s = 2.07 C.
Let us discuss two different questions in this regard:

Example
It holds that
CI(µ, 1 − α) = [x̄ − z σ̂X̄ , x̄ + z σ̂X̄ ]
with 0 .9
σ̂ 2.10 C
σ̂X̄ = √ = √ ≈ 0.29 C .
n 50 µ
7.82 C 8.30 C 8.78 C
It is α = 0.1. According to the table, the 0.95 quantile is z[0.95] = 1.645. Thus, the
confidence interval is
CI(µ, 0.9) = [8.30 C − 1.645 · 0.29 C; 8.30 C + 1.645 · 0.29 C]

= [7.82 C; 8.78 C] ,
Thus, the unknown mean of the basic population lies in the interval [7.82 C; 8.78 C] with
a probability of 90 %.

1. The first question is about the actual rents, i.e., the random variable X . Therefore, here the
n
random variable X is standardized with the estimated variance σ̂ 2 = n− 1
s2 or the standard
deviation r
n
σ̂ = ·s.
n−1
2. The second question deals with the average net rent, so here the random variable under
consideration is the sample mean X̄ . Accordingly,√the estimated standard deviation of the
sample mean is also calculated according to the n-law,
σ̂
σ̂X̄ = √
n
with σ̂ from question 1.
305 - 1
The chi-squared distribution
Prerequisite: Random variables Z1 , Z2 , . . . , Zn are standard normally distributed and

independent
Squaring these n random variables and then summing them yields a new random
variable.
Definition:
The random variable
χ2n := Z12 + Z22 + · · · + Zn2
is called chi-square distributed with n degrees of freedom.
Thus, the chi-square distributions form a whole family of distributions. They are
continuous distributions and have positive probability densities in the interval (0, ∞).

The chi-squared distribution
The chi-square distributions are suitable as test distributions for many typical test
situations and thus have multiple applications in practice.
f (χ2 )
0.5
0.4 n=1
0.3
n=3
0.2
n=5
n = 10
0.1 n = 15
χ2
5 10 15 20 25
Properties :
2 2
E(χn ) = n, V(χn ) = 2n

The S TUDENT-t distribution
Prerequisite: Let χ2n be a chi-square distributed

random variable and Z be a standard normally
distributed random variable and let both be
independent:
Definition:
The random variable
Z
Tn := q
1
n
· χ2n
is called S TUDENT-t distributed with n degrees of free- W ILLIAM S EALY G OSSET

Pseudonym: Student
dom. 1876-1937

The S TUDENT-t distribution
The t-distributions are similar to the normal distribution, but slightly wider. For increasing
number of degrees of freedom they tend towards the standard normal distribution.
f (z ), f (t )
0.4
n = 10
standard normal distribution 0.3
n=3
0.2 n=1
0.1
z, t
−4 −2 2 4
Properties :
n
E(Tn ) = 0, V(Tn ) = > 1, (n > 2)
n−2

Interval estimators for small samples
We have seen that for large samples the CLT holds

⇒ normal distribution is a quite good approximation for the true sampling distribution
too small samples ⇒ CLT does not help
The actual sampling distribution would have to be used. However, this depends on
the distribution of the characteristic in the basic population and therefore varies from
case to case and is usually difficult to calculate.
Only in the special situation, when the characteristic in the basic population is
already normally distributed or does follow the normal distribution quite well, the
construction of confidence intervals becomes simple again.

Theorem 11:
If the characteristic is normally distributed in the basic population, then

1. the sample mean is normally distributed even for small samples,
2. the fraction
X̄ − µ
Tn−1 =
σ̂X̄
is exactly STUDENT -t-distributed with n − 1 degrees of freedom.
X̄ −µ
From this follows immediately P(−t < σ̂X̄
≤ t ) = FTn−1 (t ) − FTn−1 (−t ) and thus the
Definition: confidence interval for µ for small samples with normally distributed basic
population and unknown variance.
CI(µ, 1 − α) = [x̄ − tn−1 σ̂X̄ , x̄ + tn−1 σ̂X̄ ]

with the (1 − α/2) quantile tn−1 = tn−1;[1−α/2] .

Property of t quantiles
It is
tn−1;[1−α/2] = −tn−1;[α/2]
This property, by the way, also holds for the quantiles of the normal distribution z[·] , which can be
found in the quantile table of the STUDENT-t distribution in the bottom row for n = ∞.
It is important to emphasize here once again that the use of the t-distribution presupposes a normally
distributed basic population!
311 - 1
Example
Question: How much do university graduates

earn on average five years after graduation?
A survey of randomly selected 25 alumni yields
an average gross income of 42 720 C with an
empirical standard deviation of 6256 C. The
income can be considered normally distributed
to a good approximation.

Example
Question: How much do university graduates

earn on average five years after graduation?
A survey of randomly selected 25 alumni yields
an average gross income of 42 720 C with an
empirical standard deviation of 6256 C. The
income can be considered normally distributed
to a good approximation.
1. Point estimate of the standard deviation of the basic population:

r
25
σ̂ = · 6256 C = 6385 C
24
2. Estimated standard deviation of the sample mean:

σ̂ 6385 C
σ̂X̄ = √ = √ = 1277 C
n 25

Example
3. We determine the t value at a significance

level of 5 % with 24 degrees of freedom :
t24;[1−0.025] = 2.064
4. The confidence interval is then
CI(µ, 0.95) = [42 720 C − 2.064 · 1277 C; 42 720 C + 2.064 · 1277 C]

= [40 084 C; 45 355 C]
2.5 % 95 % 2.5 %
x̄
40 084 C 42 720 C 45 355 C

Example: Reading from the quantile table
S TUDENT’s t-distribution
Degrees of Quantiles
freedom t[0.6] t[0.667] t[0.75] t[0.8] t[0.875] t[0.9] t[0.95] t[0.975] t[0.99] t[0.995] t[0.999]
1 0.325 0.577 1.000 1.376 2.414 3.078 6.314 12.706 31.821 63.657 318.31
2 0.289 0.500 0.816 1.061 1.604 1.886 2.920 4.303 6.965 9.925 22.327
3 0.277 0.476 0.765 0.978 1.423 1.638 2.353 3.182 4.541 5.841 10.215
4 0.271 0.464 0.741 0.941 1.344 1.533 2.132 2.776 3.747 4.604 7.173
5 0.267 0.457 0.727 0.920 1.301 1.476 2.015 2.571 3.365 4.032 5.893
6 0.265 0.453 0.718 0.906 1.273 1.440 1.943 2.447 3.143 3.707 5.208
In the quantile table of the 7 0.263 0.449 0.711 0.896 1.254 1.415 1.895 2.365 2.998 3.499 4.785
8 0.262 0.447 0.706 0.889 1.240 1.397 1.860 2.306 2.896 3.355 4.501
S TUDENT-t distribution, we find 9 0.261 0.445 0.703 0.883 1.230 1.383 1.833 2.262 2.821 3.250 4.297
10 0.260 0.444 0.700 0.879 1.221 1.372 1.812 2.228 2.764 3.169 4.144
the row with 24 degrees of 11 0.260 0.443 0.697 0.876 1.214 1.363 1.796 2.201 2.718 3.106 4.025
12 0.259 0.442 0.695 0.873 1.209 1.356 1.782 2.179 2.681 3.055 3.930
freedom. We are looking for 13
14
0.259
0.258
0.441
0.440
0.694
0.692
0.870
0.868
1.204
1.200
1.350
1.345
1.771
1.761
2.160
2.145
2.650
2.624
3.012
2.977
3.852
3.787
the 0.975 quantile, so we look 15
16
0.258
0.258
0.439
0.439
0.691
0.690
0.866
0.865
1.197
1.194
1.341
1.337
1.753
1.746
2.131
2.120
2.602
2.583
2.947
2.921
3.733
3.686
in the corresponding column. 17
18
0.257
0.257
0.438
0.438
0.689
0.688
0.863
0.862
1.191
1.189
1.333
1.330
1.740
1.734
2.110
2.101
2.567
2.552
2.898
2.878
3.646
3.610
19 0.257 0.438 0.688 0.861 1.187 1.328 1.729 2.093 2.539 2.861 3.579
20 0.257 0.437 0.687 0.860 1.185 1.325 1.725 2.086 2.528 2.845 3.552
It is 21 0.257 0.437 0.686 0.859 1.183 1.323 1.721 2.080 2.518 2.831 3.527
22 0.256 0.437 0.686 0.858 1.182 1.321 1.717 2.074 2.508 2.819 3.505
23 0.256 0.436 0.685 0.858 1.180 1.319 1.714 2.069 2.500 2.807 3.485
t24;[1−0.025] = t24;[0.975] = 2.064. 24
25
0.256
0.256
0.436
0.436
0.685
0.684
0.857
0.856
1.179
1.178
1.318
1.316
1.711
1.708
2.064
2.060
2.492
2.485
2.797
2.787
3.467
3.450
26 0.256 0.436 0.684 0.856 1.177 1.315 1.706 2.056 2.479 2.779 3.435
27 0.256 0.435 0.684 0.855 1.176 1.314 1.703 2.052 2.473 2.771 3.421
28 0.256 0.435 0.683 0.855 1.175 1.313 1.701 2.048 2.467 2.763 3.408
29 0.256 0.435 0.683 0.854 1.174 1.311 1.699 2.045 2.462 2.756 3.396
30 0.256 0.435 0.683 0.854 1.173 1.310 1.697 2.042 2.457 2.750 3.385
35 0.255 0.434 0.682 0.852 1.170 1.306 1.690 2.030 2.438 2.724 3.340
40 0.255 0.434 0.681 0.851 1.167 1.303 1.684 2.021 2.423 2.704 3.307
45 0.255 0.434 0.680 0.850 1.165 1.301 1.679 2.014 2.412 2.690 3.281
50 0.255 0.433 0.679 0.849 1.164 1.299 1.676 2.009 2.403 2.678 3.261
55 0.255 0.433 0.679 0.848 1.163 1.297 1.673 2.004 2.396 2.668 3.245
60 0.254 0.433 0.679 0.848 1.162 1.296 1.671 2.000 2.390 2.660 3.232
∞ 0.253 0.431 0.674 0.842 1.150 1.282 1.645 1.960 2.326 2.576 3.090

The empirical variance S 2 of a sample is also a random variable. Its distribution can be
calculated for the case that the characteristic is approximately normally distributed in the
basic population with the mean µ and the standard deviation σ – and the individual
samples are drawn independently (i.e. with replacement).
Theorem 12:
If the characteristic is normally distributed in the basic population, the quotient
S2
n = χ2n−1
σ2
is chi-square distributed with n − 1 degrees of freedom.
It follows that
2 nS 2
P(χn−1;[α/2] < ≤ χ2n−1;[1−α/2] ) = 1 − α
σ2

Definition: Confidence interval for σ 2 for small samples with normally distributed basic
population:
n · s2 n · s2

2
CI(σ , 1 − α) = ,
χ2upper χ2lower
with the quantiles χ2lower = χ2n−1;[α/2] and χ2upper = χ2n−1;[1−α/2] .
f (χ2 )
1−α
α α
2 2
χ2
χ2lower χ2upper

Example
Task:
From a sample of size n = 30 from a normally distributed basic population, the empirical
variance s2 = 225 is obtained. For the variance σ 2 of the basic population a point
estimator as well as a confidence interval at a confidence level of 0.95 shall be given.

Example
Task:
From a sample of size n = 30 from a normally distributed basic population, the empirical
variance s2 = 225 is obtained. For the variance σ 2 of the basic population a point
estimator as well as a confidence interval at a confidence level of 0.95 shall be given.
Solution:
1. Point estimate for the basic population
30
σ̂ 2 = · 225 = 232.76
29
2. From the table of the chi-square distribution with 29 degrees of freedom we find the
two values
χ2lower = χ229;[0.025] = 16.047 χ2upper = χ229;[0.975] = 45.722
3. Confidence interval

2 30 · 225 30 · 225
CI(σ , 0.95) = ; = [147.6; 420.6]
45.722 16.047

Example: Reading from the quantile table
Chi-square (χ2 ) distribution

Degrees of Quantiles
freedom χ2[0.005] χ2[0.01] χ2[0.025] χ2[0.05] χ2[0.1] χ2[0.9] χ2[0.95] χ2[0.975] χ2[0.99] χ2[0.995]
1 0.000 0.000 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838
In the quantile table of the 4
5
0.207
0.412
0.297
0.554
0.484
0.831
0.711
1.145
1.064
1.610
7.779
9.236
9.488
11.070
11.143
12.833
13.277
15.086
14.860
16.750
chi-square distribution, we find 6

7
0.676
0.989
0.872
1.239
1.237
1.690
1.635
2.167
2.204
2.833
10.645
12.017
12.592
14.067
14.449
16.013
16.812
18.475
18.548
20.278
8 1.344 1.646 2.180 2.733 3.490 13.362 15.507 17.535 20.090 21.955
the row with 29 degrees of 9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188
freedom. We are looking for 11
12
2.603
3.074
3.053
3.571
3.816
4.404
4.575
5.226
5.578
6.304
17.275
18.549
19.675
21.026
21.920
23.337
24.725
26.217
26.757
28.300
the 0.025- and the 13

14
3.565
4.075
4.107
4.660
5.009
5.629
5.892
6.571
7.042
7.790
19.812
21.064
22.362
23.685
24.736
26.119
27.688
29.141
29.819
31.319
15 4.601 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 32.801
0.975-quantile, so we look in 16 5.142 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 34.267
17 5.697 6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 35.718
the corresponding columns. 18
19
6.265
6.844
7.015
7.633
8.231
8.907
9.390
10.117
10.865
11.651
25.989
27.204
28.869
30.144
31.526
32.852
34.805
36.191
37.156
38.582
20 7.434 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 39.997
21 8.034 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 41.401
We have 22 8.643 9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 42.796
23 9.260 10.196 11.689 13.091 14.848 32.007 35.172 38.076 41.638 44.181
24 9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 45.559
χ229;[0.025] = 16.047 , 25
26
10.520
11.160
11.524
12.198
13.120
13.844
14.611
15.379
16.473
17.292
34.382
35.563
37.652
38.885
40.646
41.923
44.314
45.642
46.928
48.290
27 11.808 12.879 14.573 16.151 18.114 36.741 40.113 43.195 46.963 49.645
χ229;[0.975] = 45.722 .
28 12.461 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 50.993
29 13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 52.336
30 13.787 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 53.672
35 17.192 18.509 20.569 22.465 24.797 46.059 49.802 53.203 57.342 60.275
40 20.707 22.164 24.433 26.509 29.051 51.805 55.758 59.342 63.691 66.766
45 24.311 25.901 28.366 30.612 33.350 57.505 61.656 65.410 69.957 73.166
50 27.991 29.707 32.357 34.764 37.689 63.167 67.505 71.420 76.154 79.490
55 31.735 33.570 36.398 38.958 42.060 68.796 73.311 77.380 82.292 85.749
60 35.534 37.485 40.482 43.188 46.459 74.397 79.082 83.298 88.379 91.952
70 43.275 45.442 48.758 51.739 55.329 85.527 90.531 95.023 100.425 104.215
80 51.172 53.540 57.153 60.391 64.278 96.578 101.879 106.629 112.329 116.321
90 59.196 61.754 65.647 69.126 73.291 107.565 113.145 118.136 124.116 128.299
100 67.328 70.065 74.222 77.929 82.358 118.498 124.342 129.561 135.807 140.169

Summary
Useful is the following overview of the „different variances“
Sample variance
n
1X
s2 = (xj − x̄ )2
n
j =1
After a sample is drawn, the quantities x̄ and s2 are always calculated. Here n is the sample
size and
n
1X
x̄ = xj
n
j =1
the sample mean.
318 - 1
The variance in the basic population
N
1 X
σ2 = (xj − µ)2
N
j =1
with the size N of the basic population and its arithmetic mean
N
1 X
µ= xj
N
j =1
is usually unknown when estimating and testing. Only after a census of the characteristic X
could µ and σ 2 be calculated in this way.
Estimated variance in the basic population
n
σ̂ 2 = s2
n−1
This estimation formula yields an unbiased estimate for σ 2 (given independence). Here, n − 1 is
the number of degrees of freedom.
318 - 2
The variance of the sample mean
σ2
V(X̄ ) = σX̄2 =
n
can be calculated in the case that the variance in the basic population is known.
Otherwise, we calculate the estimated variance of the sample mean
σ̂ 2
V̂(X̄ ) = σ̂X̄2 = ,
n
from the estimated variance in the basic population.
Confidence intervals for the mean µ
1. Large sample with known variance
CI(µ, 1 − α) = [x̄ − z σX̄ , x̄ + z σX̄ ]
with z = z[1−α/2] .
318 - 3
2. Large sample with unknown variance
CI(µ, 1 − α) = [x̄ − z σ̂X̄ , x̄ + z σ̂X̄ ]
with z = z[1−α/2] .
3. Small sample with normally distributed basic population with known variance is to be
treated as 1. since according to Theorem 11 item 1. the sample mean X̄ is also normally
distributed:
CI(µ, 1 − α) = [x̄ − z σX̄ , x̄ + z σX̄ ]
with z = z[1−α/2] .
4. Small sample with normally distributed basic population and unknown variance
CI(µ, 1 − α) = [x̄ − tn−1 σ̂X̄ , x̄ + tn−1 σ̂X̄ ]
with the (1 − α/2)-quantile tn−1 = tn−1;[1−α/2] of the S TUDENT-t-distribution.
Confidence interval for the variance σ 2 for small samples with normally distributed basic population
" #
n · s2 n · s2
CI(σ 2 , 1 − α) = ,
χ2upper χ2lower
with the quantiles χ2lower = χ2n−1;[α/2] and χ2upper = χ2n−1;[1−α/2] of the chi-square distribution.
318 - 4
Control questions
1. About which random variable does the sampling distribution provide information?
2. What properties should samples have in order to provide reliable information about
the basic population?
3. What is the role of the central limit theorem in estimation, and what is the role of the
limit theorem of DE M OIVRE and L APLACE?
4. When is a sample considered to be „large“?
5. What does a confidence interval provide information about?
6. How are the chi-square distribution and the normal distribution related?
7. Why is the t distribution mostly tabulated only up to n = 100?
8. If the sample size is too small, one must use the t distribution. Is this statement
correct without restrictions?
9. When do you use the t distribution to determine a confidence interval? Which
quantity is t-distributed in these cases?

13 Statistical testing
13.1 Null hypothesis, alternative hypothesis, decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

13.2 Testing hypotheses regarding mean values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
13.3 Testing hypotheses regarding variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
13.4 Summary of one-sample tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
13.5 Comparison of two means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .347
13.6 Comparison of two variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
13.7 Regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
according to [Schira, 2016], chapter 15 and 17

see also: [Anderson et al., 2012], chapters 9-11; and [Newbold et al., 2013], chapter 9, 10
13. Statistical testing 320
Statistical testing
Methods of estimation and testing are applications of the sampling theory.

Aim: Making a decision about a hypothesis
Definition: Hypotheses are assumptions, e.g., about a distribution or about individual

parameters of the distribution of a characteristic in a basic population.
Where do these hypotheses come from?

theoretical considerations
principle of insufficient cause
reasonable conjecture
previous observations
Note: Whether a formulated hypothesis is true or false cannot be determined with

a sample!
Test decision: Retain or reject hypothesis.

Null hypothesis, alternative hypothesis, decision
Given: Hypothesis about the numerical value θ0 of a parameter θ of a distribution.

Mostly the distribution of a characteristic in a basic population or the probability
distribution of a random variable is concerned.
The parameter under question θ can be, for example:
Mean µ
Proportion p
Standard deviation σ
or another measure.

Definition:
Null hypothesis or initial hypothesis
H0 : θ = θ0
H0 can be right or wrong. In any case, it will be retained until sufficient evidence is
provided to the contrary (sample).
Alternative hypothesis
H1 : θ ̸= θ0
Often only the one-sided question is of interest and one formulates
H0 : θ ≤ θ 0 against H1 : θ > θ0
or
H0 : θ ≥ θ 0 against H1 : θ < θ0 .
It is important to note that H0 and H1 must be mutually exclusive.

Four ways reality and the test decision can collide:

Reality
H0 is correct H0 is wrong
Type 2 error
retainH0 o.k.
Test decision β -error
Type 1 error
reject H0 o.k.
α-error
Definition:
Type 1 error: the null hypothesis is rejected even though it is correct.
Type 2 error: the null hypothesis is retained even though it is wrong.
The main focus lies on the type 1 error:
P(reject H0 |H0 right) = α
should be as small as possible.

Test procedure :
1. Formulate hypothesis: H0 vs. H1
2. Calculate test statistic/test quantity (from sample) T (x1 , . . . , xn )
3. Determine critical values and rejection region A.
4. Test decision
T (x1 , . . . , xn ) ∈ A ⇒ reject H0
T (x1 , . . . , xn ) ̸∈ A ⇒ retain H0
T (x1 , . . . , xn ) and A are the two essential quantities.

Testing hypotheses regarding mean values
Let µ be the mean of the metric characteristic X in a basic population.
Null hypothesis: H0 : µ = µ0
Draw a random sample from the basic population: x1 , . . . , xn
Calculate: x̄
Deviation: |x̄ − µ0 | > 0
Question: Reject null hypothesis?
Is the deviation significant or just random? How to decide?

A correct null hypothesis should only be rejected with a very low probability α.
Common significance levels are α = 0.05 or α = 0.01.

Rejection regions should be constructed such that the probability of the sample mean x̄
to fall within the rejection region, even though H0 is correct, is at most α:
reject H0 retain H0 reject H0

X̄
µ0 x̄
„critical value“ „critical value“
α α
1−α
2 2
P(X̄ ∈ rejection region A | H0 right) = α

f (x̄ ) two-sided test

H0 : µ = µ 0
H1 : µ ̸= µ0
α
2
1−α α
2
Acceptance and rejection regions
µ0
x Difference between two-sided and
rejection acceptance region rejection
one-sided questions
region region
Sampling distribution under the
f (x̄ ) upper-sided test
H0 : µ ≤ µ 0
condition that the expected value of the
H1 : µ > µ 0 basic population is µ = µ0 .
1−α α
x
µ0 Definition:
acceptance region rejection
region The probability of the type 1 error
f (x̄ ) lower-sided test
H0 : µ ≥ µ 0 P(X̄ ∈ A | H0 right) = α
H1 : µ < µ 0
1−α is called significance level.
α
A denotes the rejection region.
x
µ0
rejection acceptance region

region
In addition to the one-sided or two-sided question, we again have to distinguish different

cases in the following – as we did with the confidence intervals – because the underlying
distributions differ:
Variance of basic population known/unknown?
Large/small sample?
We distinguish between the following tests for mean values:
The G AUSS test is used in the following situations:
Variance of the basic population known, large sample
Variance of the basic population known, small sample, normally distributed basic
population
variance of basic population unknown, large sample;
we then use the estimated variance of the sample mean
The t test is used for

Variance of the basic population unknown, small sample, (approximately)
normally distributed basic population

G AUSS-Test
Two-sided test:
In the two-sided test, the rejection region is symmetrically arranged on both sides of
the acceptance region.
Assumption: variance σ 2 known
X̄ −µ
The standardized test variable σX̄
is standard normally distributed under H0 .
For large samples (and for small samples with a normally distributed basic
population): !
X̄ − µ
P > z[1−α/2] |µ = µ0 = α
σX̄
Test procedure two-sided G AUSS test :

1. Formulate hypothesis: H0 : µ = µ0 vs. H1 : µ ̸= µ0
x̄ −µ0
2. Test statistic/test quantity T (x1 , . . . , xn ) = σX̄
3. critical value: k = z[1−α/2]

4. Test decision:
If |T (x1 , . . . , xn )| > k ⇒ reject H0
Example
In the student pub Finkenkrug, calibrated beer glasses are

supposed to contain 0.4 L beer. For a sample of size n = 50, the
average filling quantity is 0.38 L with a known variance of
0.0064 L2 (standard deviation = 0.08 L).
Question: At a significance level of 5 %, can we retain the null
hypothesis that on average there is 0.4 L of beer in the glass?

Example
In the student pub Finkenkrug, calibrated beer glasses are

supposed to contain 0.4 L beer. For a sample of size n = 50, the
average filling quantity is 0.38 L with a known variance of
0.0064 L2 (standard deviation = 0.08 L).
hypothesis that on average there is 0.4 L of beer in the glass?
Test procedure:
1. Formulate Hypothesis (two-sided): H0 : µ = 0.4 L vs. H1 : µ ̸= 0.4 L
2. To calculate the test statistic, we need the variance of the sample mean:
0.0064 L2 √
σX̄2 = = 0.000 128 L2 ⇒ σX̄ = 0.000 128 L2 = 0.011 31 L
50
Calculate the Test quantity:
x̄ − µ0 0.38 − 0.4
T (x1 , . . . , xn ) = = = −1.77
σX̄ 0.011 31

3. critical value for α = 0.05:
k = z[1−0.025] = z[0.975] = 1.96
4. Test decision:
|−1.77| = 1.77 ≤ 1.96 ⇒ retain H0
f (z )
−1.77
x̄
-1.96 0 1.96
The null hypothesis H0 cannot be rejected at a significance level of 5 %.

G AUSS-Test
One-sided test:
In the one-sided test, the rejection region is not symmetrically arranged on the two
sides of the acceptance region.
Test procedure upper-sided G AUSS test :

1. Formulate hypothesis: H0 : µ ≤ µ0 vs. H1 : µ > µ0
x̄ −µ0
3. critical value: k = z[1−α]

4. Test decision: If T (x1 , . . . , xn ) > k ⇒ reject H0
Test procedure lower-sided G AUSS test :

1. Formulate hypothesis: H0 : µ ≥ µ0 vs. H1 : µ < µ0
x̄ −µ0
3. critical value: k = z[α]

4. Test decision: If T (x1 , . . . , xn ) < k ⇒ reject H0
Example
In the student pub Finkenkrug, a one-sided test would be more

appropriate. After all, no one will complain if there is too much beer
in the glass. The null hypothesis would thus have to be modified.
(µ0 = 0.4 L, n = 50, x̄ = 0.38 L, σ 2 = 0.0064 L2 or σ = 0.08 L).
hypothesis that on average there is at least 0.4 L of beer in the
glass?

Example
In the student pub Finkenkrug, a one-sided test would be more

appropriate. After all, no one will complain if there is too much beer
in the glass. The null hypothesis would thus have to be modified.
(µ0 = 0.4 L, n = 50, x̄ = 0.38 L, σ 2 = 0.0064 L2 or σ = 0.08 L).
hypothesis that on average there is at least 0.4 L of beer in the
glass?
Test procedure:
1. Formulate hypothesis (lower-sided): H0 : µ ≥ 0.4 L vs. H1 : µ < 0.4 L
2. To calculate the test statistic, we need the variance of the sample mean:
0.0064 L2 √
σX̄2 = = 0.000 128 L2 ⇒ σX̄ = 0.000 128 L2 = 0.011 31 L
50
Calculate the test quantity:
x̄ − µ0 0.38 − 0.4
T (x1 , . . . , xn ) = = = −1.77
σX̄ 0.011 31

3. Critical value for α = 0.05:
k = z[0.05] = −z[0.95] = −1.645
4. Test decision:
−1.77 < −1.645 ⇒ reject H0 !
f (z )
−1.77
x̄
-1.96 0 1.96
-1.645
old rejection region old rejection region
new rejection region
This means that a different test decision is made here than in the previous example.

t-Test
Assumption: variance σ 2 is unknown and has to be estimated ⇒ σ̂ 2

For large samples, the Gaussian test can also be used because of the CLT (see
slides 330 and 333). Here, however, the variance σ 2 (since unknown) has to be
replaced in the formulas by the unbiased estimate σ̂ 2 .
For small samples, the t distribution can be used – as for the confidence intervals
as well.
Prerequisite: Basic population is at least approximately normally distributed.
Test procedure two-sided t-test :

1. Formulate hypothesis: H0 : µ = µ0 vs. H1 : µ ̸= µ0
x̄ −µ0
2. Test statistic/test quantity T (x1 , . . . , xn ) = σ̂X̄
3. critical value: k = tn−1;[1−α/2]

4. Test decision:
If |T (x1 , . . . , xn )| > k ⇒ reject H0

t-Test
Accordingly, we obtain the one-sided t tests:
Test procedure upper-sided t-test :

1. Formulate hypothesis: H0 : µ ≤ µ0 vs. H1 : µ > µ0
x̄ −µ0
3. critical value: k = tn−1;[1−α]

4. Test decision: If T (x1 , . . . , xn ) > k ⇒ reject H0
Test procedure lower-sided t-test :

1. Formulate hypothesis: H0 : µ ≥ µ0 vs. H1 : µ < µ0
x̄ −µ0
3. critical value: k = tn−1;[α]

4. Test decision: If T (x1 , . . . , xn ) < k ⇒ reject H0

Example
A small retail chain knows from experience that the average sales
of its 48 stores are 25 % higher in December than in November.
On New Year’s Eve, a small random sample of n = 8 stores is
hastily drawn. It yields the following sales increases in percent:
i 1 2 3 4 5 6 7 8
xi 26.5 22.5 25.9 25.2 25.4 24.0 28.2 29.2
Question: At a significance level of 5 %, can the null hypothesis that the average
increase in sales was 25 % be retained?

Example
i 1 2 3 4 5 6 7 8
xi 26.5 22.5 25.9 25.2 25.4 24.0 28.2 29.2
First we calculate
2 2
x̄ = 25.8625 % and sX = 4.0548 % .
Test procedure:
1. Formulate hypothesis (two-sided): H0 : µ = 25 % vs. H1 : µ ̸= 25 %
2. To calculate the test statistic, we need the estimated variance or standard deviation
of the sample mean, respectively.
8 σ̂ 2
σ̂X2 = · 4.0548 %2 = 4.634 06 %2 ⇒ σX̄2 = X = 0.579 26 %2 ⇒ σX̄ = 0.7611 %
7 8
Calculate the test quantity:
x̄ − µ0 25.8625 % − 25 %
T (x1 , . . . , xn ) = = = 1.1332
σ̂X̄ 0.7611 %

3. Critical value for α = 0.05 and 7 degrees of freedom:
k = t7;[1−0.025] = t7;[0.975] = 2.365
4. Test decision:
|1.1332| = 1.1332 ≤ 2.365 ⇒ retain H0 !

Testing hypotheses regarding variances
Chi-square test
What needs to be tested is the null hypothesis that the variance of a basic population
does not deviate from a given hypothetical value:
2 2 2 2
H0 : σ = σ0 vs. H1 : σ ̸= σ0
We recall theorem 12 on slide 315:

If the characteristic is normally distributed in the basic population, the quotient
2
n σS 2 = χ2n−1 is chi-square distributed with n − 1 degrees of freedom.
This yields what is known as the chi-square test for variances:
Test procedure two-sided chi-square test :

1. Formulate hypothesis: H0 : σ 2 = σ02 vs. H1 : σ 2 ̸= σ02
s2
2. Test statistic/test quantity T (x1 , . . . , xn ) = n
σ02
3. critical values: χ2lower = χ2n−1;[α/2] and χ2upper = χ2n−1;[1−α/2]
4. Test decision:
2 2
If T (x1 , . . . , xn ) < χlower or T (x1 , . . . , xn ) > χupper ⇒ reject H0
Chi-square test
f (χ2 )
acceptance region
α α
2 2
χ2
χ2lower χ2upper

Chi-square test
There are one-sided tests available as well:
Test procedure upper-sided chi-square test :

1. Formulate hypothesis: H0 : σ 2 ≤ σ02 vs. H1 : σ 2 > σ02
s2
σ02
3. critical value: χ2upper = χ2n−1;[1−α]
4. Test decision: If T (x1 , . . . , xn ) > χ2upper ⇒ reject H0
Test procedure lower-sided chi-square test :

1. Formulate hypothesis: H0 : σ 2 ≥ σ02 vs. H1 : σ 2 < σ02
s2
σ02
3. critical value: χ2lower = χ2n−1;[α]
4. Test decision: If T (x1 , . . . , xn ) < χ2lower ⇒ reject H0

Example
Cars of the same model and year of manufacture differ in

gasoline consumption. However, it is required that the
standard deviation of gasoline consumption is not greater than
0.3 L/100 km. An automobile magazine has tested 30
vehicles of a new model and determined
8.0 8.3 8.2 7.4 7 .9 8 .3 7 .0 7 .5 7 .8 7.3

8.2 8.7 7.5 7.7 7 .9 7 .8 8 .2 8 .1 7 .9 8 .0
7.9 8.3 8.1 8.1 7.8 7 .7 8 .1 8 .0 8 .3 7 .9
as the gasoline consumption and now claims that the standard deviation in consumption
is way too large with 0.35 L/100 km. Is this true?
Task: Assume that the basic population is normally distributed. Test the hypothesis that
the standard deviation is at most 0.3 L/100 km, as required, at a significance level
of 10 %.

Example
1. Formulate hypothesis (upper-sided)

2 2
H0 : σ ≤ 0.09 vs. H1 : σ > 0.09
2. Calculate the sample variance

30
2 1 X
s = (xj − x̄ )2 or = x 2 − x̄
2
= · · · = 0.1188
30
j =1
this yields the test statistic:
s2 30 · 0.1188
T (x1 , . . . , xn ) = n = = 39.6
σ02 0.09
3. critical value for α = 10 % and 29 degrees of freedom:
χ2upper = χ229;[1−0.1] = χ229;[0.9] = 39.09
4. Test decision: 39.6 > 39.09 ⇒ reject H0 !

Summary of one-sample tests
Tests regarding the mean :
The G AUSS test is used in case
the variance is known (small sample with normally distributed basic population or large sample)
x̄ − µ σ
⇒ test quantity T (x1 , . . . , xn ) = is standard normally distributed with σX̄ = √
σX̄ n
the variance is unknown and it’s a large sample
x̄ − µ σ̂
⇒ test quantity T (x1 , . . . , xn ) = is standard normally distributed with σ̂X̄ = √
σ̂X̄ n
The t test is used if
the variance is unknown and the sample is small

(the basic population has to be at least approximately normally distributed)
x̄ − µ
⇒ test quantity T (x1 , . . . , xn ) = is t-distributed with n − 1 degrees of freedom
σ̂X̄
σ̂
and σ̂X̄ = √
n
345 - 1
Then
two-sided upper-sided lower-sided
hypothesis H0 : µ = µ0 H0 : µ ≤ µ0 H0 : µ ≥ µ0
H1 : µ ̸= µ0 H1 : µ > µ 0 H1 : µ < µ0
x̄ − µ x̄ − µ
test quantity T (x1 , . . . , xn ) = or T (x1 , . . . , xn ) =
σX̄ σ̂X̄
critical value k = k = k =
G AUSS test: z[1−α/2] z[1−α] z[α] = −z[1−α]
t test: tn−1;[1−α/2] tn−1;[1−α] tn−1;[α] = −tn−1;[1−α]
reject if |T (x1 , . . . , xn )| > k T (x1 , . . . , xn ) > k T (x1 , . . . , xn ) < k

retain if |T (x1 , . . . , xn )| ≤ k T (x1 , . . . , xn ) ≤ k T (x1 , . . . , xn ) ≥ k
345 - 2
Tests regarding the variance: chi-square test
is used if the basic population is approximately normally distributed.
two-sided upper-sided lower-sided
hypothesis H0 : σ = σ0 H0 : σ ≤ σ0 H0 : σ ≥ σ0
H1 : σ ̸= σ0 H1 : σ > σ0 H1 : σ < σ 0
s2
test quantity T = T (x1 , . . . , xn ) = n
σ02
critical value χ2lower = χ2n−1;[α/2] χ2upper = χ2n−1;[1−α] χ2lower = χ2n−1;[α]
χ2upper = χ2n−1;[1−α/2]
reject if T < χ2lower T > χ2upper T < χ2lower

or
T > χ2upper
retain if χ2lower ≤ T ≤ χ2upper T ≤ χ2upper T ≥ χ2lower
345 - 3
Comparison of two means
Assume two independent samples of size n1 and n2
with the means x̄1 and x̄2
were taken. We will now test the hypothesis whether the two samples originate from the same basic
population or at least are taken from populations with the same mean:
H0 : µ 1 = µ 2 vs. H1 : µ1 ̸= µ2
The random variable defined by the difference of X1 and X2 ,
∆ = X̄1 − X̄2
is approximately normally distributed under the null hypothesis if the sample sizes are large (CLT) or
if the characteristic is normally distributed in the basic population. Then it holds
E(∆) = 0
and
V(∆) = V(X̄1 ) + V(X̄2 )
346 - 1
it the samples are independent and hence
s
σ12 σ2
σ∆ = + 2 .
n1 n2
The test quantity is therefore

x̄1 − x̄2
T =
σ∆
and the critical value for two-sided questions again k = z[1−α/2] as in the G AUSS test for one sample.
This hypothesis test is also called the two-sample G AUSS test:
Test procedure two-sided two-sample G AUSS test :

1. Formulate hypothesis: H0 : µ1 = µ2 vs. H1 : µ1 ̸= µ2
x̄1 −x̄2
2. Test statistic/test quantity T = σ∆
3. critical value: k = z[1−α/2]
4. Test decision:
If |T | > k ⇒ reject H0
346 - 2
Just as with the G AUSS test for one sample, there are one-sided tests as well:
Test procedure upper-sided two-sample G AUSS test :

1. Formulate hypothesis: H0 : µ1 ≤ µ2 vs. H1 : µ1 > µ2
x̄1 −x̄2
3. critical value: k = z[1−α]
4. Test decision: If T > k ⇒ reject H0
Test procedure lower-sided two-sample G AUSS test :

1. Formulate hypothesis: H0 : µ1 ≥ µ2 vs. H1 : µ1 < µ2
x̄1 −x̄2
3. critical value: k = z[α]
4. Test decision: If T < k ⇒ reject H0
If the variance or variances of the basic populations have to be estimated, one uses
s
2 σ̂12 σ̂ 2
σ̂∆ = + 2
n1 n2
346 - 3
in the two-sample G AUSS test accordingly for large samples.
For small samples from normally distributed basic populations, the additional condition
σ12 = σ22 = σ
must hold, then

X̄1 − X̄2
= Tn1 +n2 −2
σ̂∆
is t-distributed with n1 + n2 − 2 degrees of freedom. The variance σ 2 is estimated from both samples
n1 s12 + n2 s22
σ̂ 2 =
n1 + n2 − 2
and used to calculate s s
σ̂ 2 σ̂ 2 n1 + n2
σ̂∆ = + = σ̂
n1 n2 n1 · n2
Test procedure two-sided two-sample t test :

1. Formulate hypothesis: H0 : µ1 = µ2 vs. H1 : µ1 ̸= µ2
x̄1 −x̄2
2. Test statistic/test quantity T = σ̂∆
3. critical value: k = tn1 +n2 −2;[1−α/2]
4. Test decision:
If |T | > k ⇒ reject H0
346 - 4
Likewise, there are also one-sided tests available:
Test procedure upper-sided two-sample t test :

1. Formulate hypothesis: H0 : µ1 ≤ µ2 vs. H1 : µ1 > µ2
x̄1 −x̄2
3. critical value: k = tn1 +n2 −2;[1−α]
4. Test decision: If T > k ⇒ reject H0
Test procedure lower-sided two-sample t test :

1. Formulate hypothesis: H0 : µ1 ≥ µ2 vs. H1 : µ1 < µ2
x̄1 −x̄2
3. critical value: k = tn1 +n2 −2;[α]
4. Test decision: If T < k ⇒ reject H0
346 - 5
Example:
Stiftung Warentest praises the new car tire »Super ZZ«. It is said to have more than 10 % higher
mileage than its predecessor »Z«. The organization has tested four sets of each type of tire and
obtained the following result:
Mileage in km
»Super ZZ« »Z«
50 000 43 000
41 000 44 000
40 000 36 000
49 000 37 000
We test the null hypothesis that the new tire X1 has no higher mileage than the old tire X2 at a
significance level of 5 %:
From the two samples we first calculate (in TSD km)
x̄1 = 45 and x̄2 = 40 .
346 - 6
The sample variances are
1h i
s12 = (50 − 45)2 + (41 − 45)2 + (40 − 45)2 + (49 − 45)2
4
1h i 82
= 52 + 42 + 52 + 42 = = 20.5
4 4
and
1h i
s22 = (43 − 40)2 + (44 − 40)2 + (36 − 40)2 + (37 − 40)2
4
1h i 50
= 32 + 42 + 42 + 32 = = 12.5 .
4 4
This results in the estimated variance of the basic population
4s1 + 4s2 82 + 50
σ̂ 2 = = = 22
4+4−2 6
and finally the estimated standard deviation of the difference ∆.

s
√
r
n1 + n2 8
σ̂∆ = σ̂ = 22 · = 3.3166 .
n1 · n2 16
346 - 7
Test procedure:
1. Formulate hypothesis (upper-sided)
H0 : µ1 ≤ µ2 vs. H1 : µ1 > µ2 .
2. Calculate the test quantity:
x̄1 − x̄2 5
T = = = 1.5076
σ̂∆ 3.3166
3. critical value for α = 0.05 and 6 degrees of freedom:
k = t6;[1−0.05] = t6;[0.95] = 1.943
4. Test decision:
1.5076 < 1.943 ⇒ retain H0 !
The null hypothesis cannot be rejected at a significance level of 5 %.
346 - 8
Comparison of two variances
We have already learned about the t-distribution and the chi-square distribution in the last chapter.
These are not really suitable for stochastic models, otherwise we would have already covered them
in chapter 9 on special distributions. Rather, they are so-called test distributions, which are very
useful in estimation and testing. Another test distribution is the so-called F-distribution:
Definition: Given two independent chi-square distributed random variables
χ2m and χ2n
with m and n degrees of freedom, respectively. Then the random variable
1 2
χ
m m
Fm
n := 1 2
χ
n n
is called F-distributed with m and n degrees of freedom.
Important: Swapping the numerator and denominator leads to different F distributions Fm n

n ̸= Fm . In
fact
1
Fm
n = n .
Fm
347 - 1
For the quantiles of the F-distribution we get
1
Fm
n;[α] = .
Fnm;[1−α]
Density functions of F-distributed random variables

f (F)
0 .8
m = 20
n = 20
0 .6
0 .4
m=6
0 .2
n=4
F
1 2 3 4
347 - 2
Now, from two different independent samples of size n1 and n2 , respectively, the variances s12 and s22
have been calculated. The aim is to check whether both samples are taken from basic populations
with the same variance:
H0 : σ1 = σ2 (= σ 2 ) vs. H1 : σ1 ̸= σ2 .
We already know that the two random variables
n1 S1 n2 S2
= χ2n1 −1 and = χ2n2 −1
σ2 σ2
are chi-square distributed with n1 − 1 and n2 − 1 degrees of freedom. Thus, by definition,
n1
S2
n1 −1 1 n −1
n2 = Fn1 −1
S2
n2 −1 2
2
is F-distributed with n1 − 1 and n2 − 1 degrees of freedom. Of course, this only works if the charac-
teristic is normally distributed in the basic population.
Since the F-distribution is not symmetric, two critical values have to be used for the two-sided ques-
tion. They are placed in such a way that the risk of error α is divided equally between the two parts
of the rejection region.
347 - 3
Test procedure F test :
1. Formulate hypothesis: H0 : σ12 = σ22 vs. H1 : σ12 ̸= σ22
n1
s2
n1 −1 1
2. Test statistic/test quantity T = n2
s2
n2 −1 2
n −1 n −1
3. critical values: Flower = Fn1 −1;[α/2] and Fupper = Fn1 −1;[1−α/2]
2 2
4. Test decision:
If T < Flower or T > Fupper ⇒ reject H0
Example:
In a random sample of 21 newly issued AAA-rated corporate bonds, the maturity had a variance of
58.35 (years2 ). In contrast, in a random sample of 13 newly issued corporate bonds rated CCC, the
variance in maturity was only 4.69. Is this difference significant?
Assuming normal distribution in the basic population, we calculate at a significance level of 5 %:
347 - 4
Test procedure:
1. Formulate hypothesis (two-sided)
H0 : σ12 = σ22 vs. H1 : σ12 > σ22 .
n1
s2
n1 −1 1
21
58.35 61.27
20
T = n2 = 13
= = 12.06
s2
n2 −1 2 12
4.69 5.08
3. critical values for α = 0.05 and 20 and 12 degrees of freedom:
1 1
Flower = F20
12;[0.025] = = = 0.374
F12
20;[0.975]
2.676
Fupper = F20
12;[0.975] = 3.0728
4. Test decision:
12.06 > 3.073 ⇒ reject H0 !
347 - 5
Read from the table of quantiles of the F distribution:
F20
12;[0.025]
1
=
F12
20;[0.975]
1
= = 0.374
2.676
347 - 6
F20
12;[0.975]
= 3.073
347 - 7
Regression analysis
We recall section 4.1: Linear Regression. The task there was to compute a regression line
y = a + bx
in the least squares sense from given pairs of data (xi , yi ), i = 1, . . . , n.
Question: How can we test the individual parameters of the regression for statistical significance?
The term statistical significance usually refers to the test of whether the parameter a or b or the
correlation coefficient rXY is significantly different from zero. The significance of these parameters
can be tested with a t-test.
The t-test is the most widely used test!
We will now briefly review what we learned in section 4.1 and then derive a test first for the correlation
coefficient rXY , then for the two parameters a and b.
348 - 1
regression line : y = a + bx
ŷi := y (xi ) = a + bxi
yi = a + bxi + ei
„deviation“ ei := yi − ŷi
cXY
rXY :=
sX sY
Point cloud and straight line in scatter plot
348 - 2
Let Ei be the random variable describing the error term in the i-th data point. In order to test the cor-
relation coefficient rXY or the parameters a and b for their statistical significance, a few assumptions
have to be made about the distribution of the error terms Ei .
Assumptions:
1. There are no systematic influences on Y other than X :
E(Ei ) = 0 for all i
2. All error terms originate from a distribution with the same standard deviation, so-called
homoscedasticity:
σ(Ei ) = const. for all i
3. The error terms are not correlated with each other:
Cov(Ei , Ej ) = 0 for all i ̸= j
4. The random variables Ei are normally distributed with N (0, σ 2 ) for all i.
348 - 3
t-test for the correlation coefficient
Question: When is the correlation coefficient rXY significantly different from zero? In other words:
When does the variable X show a significant (linear) correlation with the variable Y ?
Test procedure two-sided t-test for the correlation coefficient :

1. Formulate hypothesis: H0 : rXY = 0 vs. H1 : rXY ̸= 0
√
rXY n −2
2. Test statistic/test quantity T = q is t-distributed with n − 2 degrees of freedom
2
1−rXY
3. critical value: k = tn−2;[1−α/2]

4. Test decision: If |T | > k ⇒ reject H0
Beispiel: In a random sample of 5 individuals, the following relationship is found between their annual
income and the amount they spend on their annual vacation:
individual 1 2 3 4 5
income (X ) 50000 30000 20000 80000 90000

Holiday exp. (Y ) 2500 1800 500 3500 5000
We test the significance of the correlation coefficient rXY at a significance level of 5 %.
348 - 4
Test procedure:
1. Formulate hypothesis (two-sided)
H0 : rXY = 0 vs. H1 : rXY ̸= 0

√ √
rXY n − 2 0.966 · 5 − 2
T = q = √ = 6.485
2
1 − rXY 1 − 0.9662
3. critical value for α = 0.05 and 3 degrees of freedom:
k = tn−2;[1−α/2] = t3;[0.975] = 3.182
4. Test decision:
|6.485| > 3.182 ⇒ reject H0
Thus, we assume that the correlation coefficient rXY is significantly different from 0.
348 - 5
t-test for the regression parameters a and b
Question: When are the intercept a and the slope b significantly different from zero?
In economics, this question is of central importance. For the regression line
y = a + bx
one is mostly interested in whether the correlation of X and Y , which is estimated by a sample, is
statistically significant (or just random).
How are the parameters a and b distributed? To answer this question, the four assumptions from a
few pages earlier are necessary.
Representing the regression equation in matrix form makes it easier to derive the test statistics (and
to move to multivariate regressions, i.e., multiple X ). We start with two independent variables X1 and
X2 to determine the matrix form. An extension to k columns is easily done.
Starting from the linear model y = a + b1 x1 + b2 x2 , for each data point (x1i , x2i , yi ) we obtain the
equation
yi = a + b1 x1i + b2 x2i + ei .
348 - 6
For n observations we obtain
y1 = a + b1 x11 + b2 x21 + e1
y2 = a + b1 x12 + b2 x22 + e2
. .
. .
.=.
yn = a + b1 x1n + b2 x2n + en
y X·b e
|{z} | {z } |{z}
This can be represented in matrix notation as
y =X·b+e
with
y1 1 x11 x21 e1
     
 
y2  1 x12 x22  a  e2 
y= .  , X = . . , b = b1  , and e= .  .
     
.
 ..   .. .
.
. 
. b2
 .. 
yn 1 x1n x2n en
348 - 7
In the univariate case (only one X ), we estimated the regression parameters using the least squares
method:
n
X n
X
ei2 = (yi − (a + bxi ))2 −→ min
a ,b
i =1 i =1
The same equation (for two or more X variables) in matrix form reads as
eT · e = (y − X · b)T · (y − X · b) −→ min
b
We find the minimum by taking the partial derivative with respect to b and setting the gradient equal
to zero
d(eT · e) (y − X · b)T · (y − X · b)
= = −2 · XT · y + 2 · XT · X · b = 0
db db
and obtain the so-called normal equations,
XT · X · b = XT · y ,
the solution of which is the minimum b.
348 - 8
Numerical example (to keep the calculation simple):
3 1 3 5
   
1 1 1 4
y = 8 , X = 1 5 6 .
   
3 1 2 4

5 1 4 6
With    
5 15 25 20
T T
X · X = 15 55 81  and X · y =  76 
25 81 129 109
the normal equations are    

5 15 25 20
15 55 81  · b =  76 
25 81 129 109
As a solution of the LSE we obtain (for example, by applying the G AUSSian elimination method)
   
a 4
b = b1  =  2.5 
b2 −1.5
348 - 9
and hence the regression plane
y = 4 + 2.5x1 − 1.5x2 .
348 - 10
The solution of the normal equations
XT · X · b = XT · y ,
can be reformulated with the help of the inverse matrix into
−1
b= XT · X · XT · y (8)
We use β̂ = b as estimator for the model parameters and assume that the true relationship between
X and y is given by the linear model
y =X·β+u
with the unknown parameters β and the so-called confounding variables or latent variables u. Sub-
stituting this into (8) yields
−1
β̂ = b = XT · X · XT · (X · β + u)
−1 −1
= XT · X · XT · X · β + XT · X · XT · u
−1
= β + XT · X · XT · u
348 - 11
What are the properties of the estimator β̂ ?
The estimator is unbiased E(β̂) = β , since
−1
E(β̂) = E(β + XT · X · X T · u)
−1
= E(β) + E XT · X · XT · u
−1
= β + XT · X · XT · E(u) = β
| {z }
=0
Variance of the estimator: With

−1
β̂ − E(β̂) = XT · X · XT · u
we get
T
V(β̂) = E β̂ − E(β̂) · β̂ − E(β̂)
−1 −1
=E XT · X · X T · u · uT · X · X T · X
−1 −1 −1
= XT · X · XT · σ 2 · I · X · XT · X = σ 2 · XT · X ,
348 - 12
where σ 2 is estimated by
eT e
σ̂ 2 =
n − (k + 1)
unbiased. Here k is the number of independent variables (X1 , . . . , Xk ). Hence, together with the
intercept a, there are k + 1 parameters to be estimated.
For our numerical example we have
 −1  
−1 5 15 25 26.7 4.5 −8
XT · X = 15 55 81  =  4 .5 1 −1.5 .
25 81 129 −8 −1.5 2.5
With T
e = y − X · b = −1 0.5 0.5 0 0
we estimate
eT e 1 .5
σ̂ 2 = = = 0.75
n − (k + 1) 5−3
and finally obtain
 
−1 20.025 3.375 −6
2 T
V(β̂) = σ̂ · X · X =  3.375 0.75 −1.125 .
−6 −1.125 1.875
348 - 13
For the variances of the estimated parameters we need the diagonal of the matrix and get
   
V(a) 20.025
V(b1 ) =  0.75  .
V(b2 ) 1.875
Test procedure two-sided t-test for regression parameters :

1. Formulate hypothesis: H0 : regression parameter = 0 vs. H1 : regression parameter ̸= 0
regression parameter−0
2. Test statistic/test quantity T = √ is t-distributed with n − (k + 1)
V(regression parameter)
degrees of freedom
−1
Here, the estimated variance of all regression parameters is V(β̂) = σ̂ 2 · XT · X
3. critical value: k = tn−(k +1);[1−α/2]
4. Test decision: If |T | > k ⇒ H0 reject
For our numerical example, we perform a two-sided test at a significance level of 5 % to see if the
regression parameters are significantly different from zero.
H0 : a = 0 vs. H1 : a ̸= 0
test statistic: T = √4−0 = 0.89
20.025
348 - 14
H0 : b1 = 0 vs. H1 : b1 ̸= 0
2.5−0
test statistic: T = √ = 2.89
0.75
H0 : b2 = 0 vs. H1 : b2 ̸= 0
−
√1.5−0
test statistic: T = = −1.10
1.875
The critical value is k = t5−(2+1);[1−α/2] = t2;[0.975] = 4.303. Thus, none of the null hypotheses are
rejected, so the regression parameters are not significant.
Question: What happens to tn−(k +1);[0.975] as the sample size n becomes large?
From the table we find

tn−(k +1);[0.975] ≈ 2
n>50
This provides a
5. Simplified test decision for large n:
If |T | > 2 ⇒ reject H0
Thus, when test statistics of regression parameters are greater than 2 or less than -2, they are
usually said to be significant. This would correspond to a two-sided test at a significance level of
approximately α = 0.05.
348 - 15
Let us consider again the income/holiday example:
In a random sample of 5 individuals, the following relationship is found between their annual income
and the amount they spend on their annual vacation:
individual 1 2 3 4 5
income (X ) 50000 30000 20000 80000 90000

Holiday exp. (Y ) 2500 1800 500 3500 5000
The regression line is

y = −255 + 0.054x
348 - 16
and the result of the regression analysis is
The t-values in the output correspond to the test statistics or test variables and are often noted in
parentheses under the regression parameters:
y = −255 + 0.054x
(−0.51) (6.49)
The critical value is t3;[0.975] = 3.182. Thus, the intercept is not significant, but the slope of the
regression line is.
Interpretation of the (significant) coefficients? Individuals with higher income spend more on
vacations (in total about 5.4 % of salary)
348 - 17
Discussion: In the last example, are the requirements for the t-test satisfied?
1. There are no systematic influences on Y other than X :
E(Ei ) = 0 for all i
2. All error terms originate from a distribution with the same standard deviation, so-called
homoscedasticity:
σ(Ei ) = const. for all i
3. The error terms are not correlated with each other:
Cov(Ei , Ej ) = 0 for all i ̸= j
4. The random variables Ei are normally distributed with N (0, σ 2 ) for all i.
348 - 18
Control questions
1. Can a statistical test be used to falsify a hypothesis?

2. Which two errors can occur in a hypothesis test, and how should they be interpreted?
3. Why do the two risks of error (type 1 and type 2) not add up to one?
4. What do you usually formulate hypotheses about in statistical testing?
5. Which random variable is the focus of testing?
6. Why is it possible to make statements about the test distribution even if we do not
know the distribution of characteristics in the basic population?
7. What is a „one-sided“ and what is a „two-sided test“?
8. Why do you divide the α error by two to determine the critical value in the two-sided
test?
9. In which situations do you use the t distribution?
10. What do you want to find out with the test for the difference of means?
11. When do you apply the chi-square and when do you apply the F distributions?

Part IV – Appendix
Literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351 - 1
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 - 1
13. 350
Literature
[Anderson et al., 2012] Anderson, D. R., Sweeney, D. J., Williams, T. A., Camm, J. D., and Cochran,
J. D. (2012).
Statistics for Business and Economics.
South-Western Cengage Learning, 12th edition.
[Bleymüller, 2012] Bleymüller, J. (2012).
Statistik für Wirtschaftswissenschaftler.
Vahlen, 16th edition.
[Lehn and Wegmann, 2000] Lehn, J. and Wegmann, H. (2000).
Einführung in die Statistik.
Vieweg+Teubner Verlag.
[Newbold et al., 2013] Newbold, P., Carlson, W. L., and Thorne, B. M. (2013).
Statistics for business and economics.
[Schira, 2016] Schira, J. (2016).
Statistische Methoden der VWL und BWL – Theorie und Praxis.
351 - 1
Index
approach of variation, 75
logarithmic linear, 131 coefficient of variation, 75
quadratic, 132 combination, 153
confidence
semi-logarithmic, 131
interval, 379, 381
arrangement, 153
level, 379
average, 38
consistency, 355, 359
bias, 354 contingency table, 82
binomial coefficient, 146, 148 continuity correction, 343
box plot, 77 correlation coefficient, 103
Central property, 39 B RAVAIS -P EARSON, 103
S PEARMAN, 108
characteristic value, 10
empirical, 103
class intervals, 27
rank, 108
class limits, 27
covariance, 99
class size, 27 empirical, 99
coefficient curtosis, 249
of determination, 125 data
352 - 1
paired, 81 marginal, 85
data set, 17 normal, 316
decile, 56 probability, 358
density standard normal, 254, 311
frequency, 29 uniform, 295, 298
deviation distribution function

empirical cumulative, 22, 24
mean absolute, 64
of the classes, 28
quartile, 65
division, 194
standard, 67
Drawing
dispersion, 63
with replacement, 349
distribution
without replacement, 349
S TUDENT-t, 389
efficiency, 361
Bernoulli, 300 error
binomial, 303 estimation, 353
chi-square, 387 mean squared, 361
conditional, 90 type 1, 412
F, 447 type 2, 412
joint, 82 estimate
352 - 2
point, 351 marginal, 82, 84
estimator, 358 relative, 19
event frequency distribution
certain, 165 absolute, 20
complementary, 167 relative, 20
disjoint, 167 function

density, 31, 70, 223
elementary, 161
distribution, 215, 220
impossible, 165
estimating, 358
mutually exclusive, 167
frequency, 22
random, 163
frequency density, 30
experiment
mass, 221
Laplace, 170, 171
probability, 177
random, 161 probability density, 223
factorial, 142, 143 probability mass, 221
five-point summary, 77 heat map, 83
frequency histogram, 30
absolute, 19 homoscedasticity, 456, 471
class, 27 hypothesis, 201, 409
352 - 3
alternative, 411 geometric, 40
initial, 411 harmonic, 45
null, 411 mean line, 121
identification criteria (IC), 8 mean value, 70
independence, 188 of a sum/difference, 97
stochastical, 188 measure, 34

of central tendency, 34, 35
independent
of location, 34, 35
statistically, 92
robust, 50
inference, 378
median, 51, 250
interquartile range, 65
method
law
of least squares, 115
of large numbers, 173
mode, 10, 36
layers, 27 moment, 245
level central, 245
confidence, 379 normal equations, 119
significance, 379, 414, 416 observation series, 17
mean outcome
arithmetic, 38 basic, 161
352 - 4
parameter objective, 175
dispersion, 231 statistical, 173
of location, 226 subjective, 176
partition, 194 quantile, 56, 253
percentile, 56 q-, 56, 59
permutation, 151 quartile, 55, 56

random sample space, 162
k -, 153
range, 64
plot
interquartile, 65
box, 77
semi-interquartile, 65
scatter, 81
regression
population, 9
line, 118
Principle of least squares, 74
slope, 122
probability values, 118
a posteriori, 201 regression analysis, 113
a priori, 201 sample, 14
conditional, 185, 186 large, 377
empirical, 175 ordered, 50
frequentist, 173 random, 14
352 - 5
sample space, 162 subpopulation, 14
scatter plot, 81 table
set contingency, 82
of events, 165 test
significance t-, 417, 426, 436
statistical, 454 G AUSS, 417, 422, 436

chi-square, 431, 438
significant, 414
F, 450
skewness, 248
one-sided, 422
space
procedure, 413
probability, 178
two-sample, 440
sample, 162
two-sided, 418
spread, 63
unbiased, 354, 358
standard deviation, 67, 231 asymptotically, 359
statistical unit, 8 unimodal, 36
statistics urn model, 349
Bayes, 201 value
multivariate, 80 central, 51
univariate, 80 expected, 226
352 - 6
mean, 38, 70 stochastical, 209
modal, 36 variance, 67, 70, 231
variable
decomposition, 123
continuous, 214
empirical, 67
discrete, 214
minimization, 121
random, 209
standardized, 238 of the sum/difference, 98
statistical, 11 variation, 63, 153
352 - 7

2023 BSC Stochastik Skript en

Uploaded by

Copyright:

Available Formats

You might also like

2023 BSC Stochastik Skript en

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2023 BSC Stochastik Skript en

Uploaded by

Copyright:

Available Formats

Bachelor of Science

Statistics and Probability

Module coordinator: Pia Domschke as of February 14, 2023

1 Statistical attributes and variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Measures to describe statistical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Two dimensional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103

1.1 Statistical units and populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

according to [Schira, 2016], chapter 1

Examples of statistical units:

1. Statistical attributes and variables 6

Examples of statistical populations:

Traffic accidents in Bavaria in 2002

1. Statistical attributes and variables 7

1. Statistical attributes and variables 8

Difference between M and X :

1. Statistical attributes and variables 9

Qualitative characteristics are e.g.:

Quantitative characteristics or variables are e.g.:

1. Statistical attributes and variables 10

Nominally measurable variables

Increasing degree of measurability

Ordinally measurable variables

Cardinally measurable variables

1. Statistical attributes and variables 11

Possibilities to sample the values of a characteristic

1. Statistical attributes and variables 12

Pure random sampling:

Representative random sampling:

1. Statistical attributes and variables 13

A research institute creates an election forecast.

1. Statistical attributes and variables 14

Raw data table

Elements ω1 ω2 ... ωi ... ωn

with xi = X (ωi ) fori = 1, . . . , n is called observation series of the variable X or simply

1. Statistical attributes and variables 15

1.6 1.6 3.0 3.0 3 .0 3 .0 4 .1 4 .1 4 .1 4.1

with k = 4 different outcomes: 1.6 3.0 4.1 5.0

1. Statistical attributes and variables 16

The relative frequency

1. Statistical attributes and variables 17

xi 1.6 3 .0 4 .1 5 .0 xi 1.6 3.0 4.1 5.0

ni 2 4 8 6 hi 0.1 0.2 0.4 0.3

1.6 1.6 3.0 3.0 3 .0 3 .0 4 .1 4 .1 4 .1 4.1

1. Statistical attributes and variables 19

is called the empirical cumulative distribution function of the statistical variable X .

1. Statistical attributes and variables 20

xi 1.6 3 .0 4.1 5.0

hi 0.1 0 .2 0.4 0.3

h(x ) frequency function

The empirical cumulative distribution function H is

(at jumps it is only continuous to the right),

H (a) ≤ H (b), if a < b

3. and has lower limit 0 and upper limit 1

1. Statistical attributes and variables 22

1. For a < b, the difference H (b) − H (a) = relH(a < X ≤ b)

1. Statistical attributes and variables 23