2023 BSC Stochastik Skript en

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 481

Bachelor of Science

Statistics and Probability

Module coordinator: Pia Domschke as of February 14, 2023


Course overview

I Descriptive Statistics
1 Statistical attributes and variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Measures to describe statistical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Two dimensional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103

II Probability Theory
5 Combinatorics and counting principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Fundamentals of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7 Random variables in one dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8 Multidimensional random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9 Stochastic models and special distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10 Limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

1-1
III Inferential Statistics
11 Point estimators for parameters of a population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .280
12 Interval estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
13 Statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

IV Appendix
Literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351 - 1
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 - 1

1-2
Statistics

2
Literature:

[Anderson et al., 2012] Anderson, D. R., Sweeney, D. J., Williams, T. A., Camm, J. D.,
and Cochran, J. D. (2012).
Statistics for Business and Economics.
South-Western Cengage Learning, 12th edition.
[Bleymüller, 2012] Bleymüller, J. (2012).
Statistik für Wirtschaftswissenschaftler.
Vahlen, 16th edition.
[Lehn and Wegmann, 2000] Lehn, J. and Wegmann, H. (2000).
Einführung in die Statistik.
Vieweg+Teubner Verlag.
[Newbold et al., 2013] Newbold, P., Carlson, W. L., and Thorne, B. M. (2013).
Statistics for business and economics.
Pearson, 8th edition.
[Schira, 2016] Schira, J. (2016).
Statistische Methoden der VWL und BWL – Theorie und Praxis.
Pearson, 5th edition.
3
Part I – Descriptive Statistics

1 Statistical attributes and variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Measures to describe statistical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Two dimensional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103

4
1 Statistical attributes and variables

1.1 Statistical units and populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6


1.2 Attributes/Characteristics and their values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Subpopulations, random samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
1.4 Statistical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Frequency and distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Frequency density and histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

according to [Schira, 2016], chapter 1


see also: [Anderson et al., 2012], chapter 1 & 2; and [Newbold et al., 2013], chapter 1
1. Statistical attributes and variables 5
Statistical units and populations

Definition:
Statistical units are the objects whose attributes are of interest in a given context and
are focussed on and observed, surveyed, or measured within the scope of empirical
investigation.

The identification of similar statistical units belonging to a statistical population are es-
sentially given by objective and precise identification criteria (IC) relating to

1. time
2. space and
3. objectivity

Examples of statistical units:


Motor vehicles, buildings, horses, students, civil servants, farms, branches, apples, sales,
marriages, births, accidents, bank accounts, etc.

1. Statistical attributes and variables 6


Statistical units and populations

Definition:
The set
Ω := {ω | ω fulfills (IC) }
of all statistical units ω , that fulfill the well defined identification criteria (IC) is called the
population.
Synonyms are statistical mass, collective

Examples of statistical populations:

Traffic accidents in Bavaria in 2002


Traffic accidents with personal injury in Germany in 1999
Students in lectures on Tuesday, 25.02.2014 at 2:15pm in Frankfurt School of
Finance and Management, Germany
Registered bankruptcies of building companies in North Rhine Westphalia in April
2002

1. Statistical attributes and variables 7


Attributes/Characteristics and their values

Statistical units ω are not of direct interest, but some of their attributes M (ω).
Distinguishable manifestations of a characteristic are called characteristic values or
modes.

Examples:

The characteristic gender has the possible values {male, female, diverse}.
The characteristic eye color has the possible values {blue, green, grey, brown}.
For the characteristic body weight of adult humans all values between 30 and
300 kg have to be allowed as possible values.

1. Statistical attributes and variables 8


Statistical variable

Definition:
The statistical variable assigns a real number x to a statistical unit ω or its characteristic
M (ω). Thus
x = X (ω) = Fkt (M (ω)) .
X (ω) is a real-valued function of the characteristic values M (ω) and thus of the statistical
units

X :Ω→R
ω 7→ X (ω) = x

Difference between M and X :


X (ω) is always a real number. M (ω) can also be „green“ or „married“.
Characteristic (if numerical) and statistical variable are often used as synonyms,
although strictly speaking they do not precisely denote the same thing.
One often simply says: „the statistical variable X “ or „the characteristic X “

1. Statistical attributes and variables 9


Types of attributes

Qualitative characteristics are e.g.:


gender
religious belief
legal form of companies, etc.

Quantitative characteristics or variables are e.g.:


age (in whole years)
number of children discrete
income
living space
continuous
length of a line

characteristics or variables

1. Statistical attributes and variables 10


Measurement levels (scales)

Nominally measurable variables

Increasing degree of measurability


Nominal attributes are always qualitative.
Examples: religion, nationality, professions, color of a car, the legal structure of a
company

Ordinally measurable variables


There exists a natural or meaningful way to determine the ranking.
Examples: intelligence quotient, school grades or table positions in the German
football league.

Cardinally measurable variables


In the case of cardinally scaled variables, also the difference between outcomes is
meaningful
Examples: GDP, investment, inflation, costs, revenue and profit

1. Statistical attributes and variables 11


Subpopulations, random samples

Possibilities to sample the values of a characteristic


Full sampling ( but in many cases not feasible)
Partial sampling

Definition:
Each proper subset Ω∗ of Ω is called a subpopulation or sample of the whole popula-
tion.
Subpopulations are called random samples if chance played a significant role in the
selection of the elements.

1. Statistical attributes and variables 12


Subpopulations, random samples

Pure random sampling:


Each part of the population has the same chance of being selected in the random
sample.

Representative random sampling:


The aim is to select a subpopulation that is representative of the whole population.
As the structure of the characteristics we are interested in is unknown before sampling,
we try to ensure that it is representative with respect to other characteristics where
we assume that the characteristic to be investigated has a certain „statistical relationship“
to this other characteristic.

1. Statistical attributes and variables 13


Subpopulations, random samples
Example

A research institute creates an election forecast.


For this purpose, 3000 entitled voters are asked the so-called
Sunday question: „Which party would you choose if there were
elections next Sunday?“

To get more realistic results , the random sample is selected on a representative basis:
consequently other characteristics are taken into consideration that could have a
statistical influence on party preference. The random sample needs to include the share
of women in the population of all eligible voters. The age structure should also conform
to the whole population.
This already makes the sample quite representative for this purpose. It would certainly
still be important to take the geographical distribution into account to avoid a situation
where too many respondents happen to live in Baden-Württemberg. Furthermore, it
would be good if the professional structure were at least analogous in the
characteristics workers, employees, civil servants, self-employed. Yes, and of course
students must be in the sample, otherwise Green voters might be underrepresented.

1. Statistical attributes and variables 14


Statistical distribution

Raw data table

Elements ω1 ω2 ... ωi ... ωn


Possible outcomes x1 x2 ... xi ... xn

Definition:
The (finite) sequence of the n values

x1 , x2 , . . . , xi , . . . , xn

with xi = X (ωi ) fori = 1, . . . , n is called observation series of the variable X or simply


data set X.

1. Statistical attributes and variables 15


Statistical distribution

If the order of the observations does not matter, it is often helpful to sort and renumber
the variable values.

x1 ≤ x2 ≤ x3 ≤ · · · ≤ xi ≤ · · · ≤ xn

Example: n = 20 observations

1.6 1.6 3.0 3.0 3 .0 3 .0 4 .1 4 .1 4 .1 4.1


4.1 4.1 4.1 4.1 5 .0 5 .0 5 .0 5 .0 5 .0 5 .0

with k = 4 different outcomes: 1.6 3.0 4.1 5.0

1. Statistical attributes and variables 16


Absolute and relative frequency

Definition:
The absolute frequency
ni := absH(X = xi )
indicates how often the statistical variable X takes a certain value xi .

The relative frequency


ni
hi := relH(X = xi ) = , 0 < hi ≤ 1
n
indicates the share of the characteristic expression xi in the population.

1. Statistical attributes and variables 17


Statistical distribution

Definition:
The tables

x1 x2 ... xk x1 x2 ... xk
and
n1 n2 ... nk h1 h2 ... hk

k
P k
P
with ni = n and hi = 1
i =1 i =1

are called absolute and relative frequency distribution of the statistical variable X ,
respectively.

Example:

xi 1.6 3 .0 4 .1 5 .0 xi 1.6 3.0 4.1 5.0

ni 2 4 8 6 hi 0.1 0.2 0.4 0.3


4
P 4
P
ni = 20 hi = 1
i =1 i =1
1. Statistical attributes and variables 18
Statistical distribution

1.6 1.6 3.0 3.0 3 .0 3 .0 4 .1 4 .1 4 .1 4.1


4.1 4.1 4.1 4.1 5 .0 5 .0 5 .0 5 .0 5 .0 5 .0

ni hi

10 0.5
8 0.4
6 0.3

4 0.2

2 0.1
xi
1.6 3 4.1 5
Graph of a frequency distribution

1. Statistical attributes and variables 19


Frequency and distribution function

Definition:
The function (
hi , if x = xi
h(x ) =
0 otherwise
is called the (relative) frequency function of the statistical variable X .

The function X
H (x ) = h ( xi )
xi ≤x

is called the empirical cumulative distribution function of the statistical variable X .

1. Statistical attributes and variables 20


Frequency and distribution function
Example

xi 1.6 3 .0 4.1 5.0

hi 0.1 0 .2 0.4 0.3

Hi 0.1 0 .3 0 .7 1.0

h(x ) frequency function

0 .5
0.25
x
1 2 3 4 5 6
H (x )
distribution function
1
0.75
0 .5 step function
0.25
x
1 2 3 4 5 6
1. Statistical attributes and variables 21
Frequency and distribution function
Properties of the empirical cumulative distribution function

The empirical cumulative distribution function H is


1. everywhere at least continuous to the right

lim H (x + ∆x ) = H (x )
∆x →0+

(at jumps it is only continuous to the right),


2. monotonically increasing

H (a) ≤ H (b), if a < b

3. and has lower limit 0 and upper limit 1

lim H (x ) = 0, lim H (x ) = 1
x →−∞ x →∞

1. Statistical attributes and variables 22


Frequency and distribution function
Further properties of the distribution function

1. For a < b, the difference H (b) − H (a) = relH(a < X ≤ b)

specifies the relative frequency of observed values of the variable X that are greater
than a , but not greater than b.
2. The function value at a point x indicates the relative frequency of which values less
than or equal to x occur in the data set:

H (x ) = relH(X ≤ x )

3. At each point, the values of the frequency function are obtained from the empirical
distribution function as the difference

h(x ) = H (x ) − lim H (x − ∆x )
∆x →0+

1. Statistical attributes and variables 23


Frequency density and histograms

Due to limitations in practice, we often apply

Formation of class intervals or layers Income distribution Germany as a whole:

Class size
Class frequency

Distribution function of the classes

Approximation by polygons
Frequency density of the classes
Frequency density function and histogram

Approximation by smooth curves


continuous density function

1. Statistical attributes and variables 24


Frequency density and histograms

Formation of class intervals or layers with appropriately selected class limits


ξ0 , ξ 1 , ξ 2 , . . . , ξ m :

x
ξ0 ξ1 ξ2 ξ3 ... ... ξm

The m sections have the class sizes

∆i := ξi − ξi −1 , i = 1, . . . , m

and the class frequency of the values in each size class is

hi := relH(ξi −1 < X ≤ ξi ), i = 1, . . . , m

Note: [Schira, 2016] uses right hand inclusion while [Anderson et al., 2012] and
[Newbold et al., 2013] use left hand inclusion

1. Statistical attributes and variables 25


Frequency density and histograms

Definition:
By assigning the class frequencies to the upper limits of the classes (an alternative
possibilty would be to assign the class frequencies to the class centers), the following
frequency table can be drawn from the values

ξ1 ξ2 ... ξm k
P
hi = 1
h1 h2 ... hm i =1

and hence the so-called distribution function of the classes HK (x ).

Exercise:
How does the distribution function for the classes
shown on the left look like? (Choose an appropriate
upper limit for the final class).

1. Statistical attributes and variables 26


Frequency density and histograms

By focussing on the upper limits (or any other single point) of the classes we lose
information about the distribution within the classes. The assumption of a uniform
distribution leads to the definition below.

Definition:
Let HK (x ) be the distribution function of a characteristic X obtained by size classes with
upper class limits ξ1 , ξ2 , . . . , ξm . Then, the ratio

HK (ξi ) − HK (ξi −1 ) hi
=
ξi − ξi −1 ∆i
is called the (average) frequency density of the ith size class (i = 1, . . . , m).

1. Statistical attributes and variables 27


Frequency density and histograms

distribution function
Approximation of the distribution function Hk
by a polygonal line H̄ (x )
Taking the derivative of H̄ (x ) leads to the
(average) frequency density function h̄(x )
of the size classes:
dH̄ (x )
h̄(x ) :=
dx
Its graph is called histogram.
histogram
The area of a column corresponds to the
relative class frequency.
The total area of the columns of the
histogram is one.

1. Statistical attributes and variables 28


Frequency density and histograms

distribution function

Approximation of the distribution function of


the classes Hk by a smooth curve H̃ (x )

Taking the derivative of H̃ (x ) leads to


density function.

density function dH̃ (x )


h̃(x ) :=
dx

1. Statistical attributes and variables 29


Control questions

1. What is the difference between characteristic and variable?

2. What different types of scales are there? Examples!

3. Why are mainly representative random samples taken into account in practice?

4. What are the properties of the step function? What is its information content?

5. Why is the formation of size classes often necessary?

6. What is the implicit assumption underlying the approximating distribution function


H̄ (x )?

7. What is the difference between a bar chart (look up definition if necessary) and a
histogram? Under what condition do they both look the same?

1. Statistical attributes and variables 30


2 Measures to describe statistical distributions

2.1 Measures to describe statistical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32


2.2 Measures of central tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
2.4 Arithmetic mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Geometric mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Harmonic mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Robust measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.8 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.9 Quartiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.10 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.11 Measures of variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.12 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.13 Five-point summary and box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

according to [Schira, 2016], chapter 2


see also: [Anderson et al., 2012], chapter 3;
and [Newbold et al., 2013], chapter 2

2. Measures to describe statistical distributions 31


Measures to describe statistical distributions

Especially for statistical data sets with many different characteristical values, one would
like to describe the entire distribution of the characteristic with the help of a few numbers.
Such numbers are called measures numbers or parameters of a distribution.

We distinguish between
measures of location
measures of dispersion

2. Measures to describe statistical distributions 32


Measures of central tendency

Definition:
A measure of central tendency or measure of location is a parameter used to de-
scribe the distribution of a random variable and provides a „typical“ value. In particular,
it describes the location of the data set, i.e. where or in which order of magnitude the
values of the variable are located.

2. Measures to describe statistical distributions 33


Mode

Definition:
A number xMod with
h(xMod ) ≥ h(xi ) for all i
is called mode or modal value of an empirical data set.

Useful measure of location especially for purely qualitative characteristics

2. Measures to describe statistical distributions 34


Mode

Examples:
The data set

2 3 3 4 4 4 5 6

has the mode xMod = 4 (and is thus unimodal)

There are two „most frequent“ values in the data set

1 2 3 3 3 4 5 6 6 6 7

namely the values 3 and 6.

The mode is the value that occurs with highest frequency

2. Measures to describe statistical distributions 35


Arithmetic mean

Definition:
The value
n
1X
x̄ := xj
n
j =1

is called arithmetic mean, mean value, or average of a statistical distribution.

Alternative calculation using absolute or relative frequencies:


k k
1X X
x̄ = nj xj = hj xj
n
j =1 j =1

2. Measures to describe statistical distributions 36


Arithmetic mean

Properties of the arithmetic mean

1. Central property
n
X
(xj − x̄ ) = 0
j =1

2. Shifting all values of a data set X by the constant value a shifts the arithmetic
mean by exactly this value:

yi := xi + a ⇒ ȳ = x̄ + a

3. Multiplication of all values of a data set X with the constant factor b multiplies the
arithmetic mean by exactly this value:

zi := b · xi ⇒ z̄ = b · x̄

2. Measures to describe statistical distributions 37


Geometric mean

Definition:
For the geometric mean

n
GX := x1 · x2 · · · · · xn , xi > 0

For the geometric mean, the individual characteristic values are multiplied and the n-th
root is taken from the product. It is only defined if all values of the data set X are positive.

The logarithm of the geometric mean corresponds to the arithmetic mean of the
logarithms. (important for the calculation of overall return on an investment)
n
1X
log GX = log xi
n
i =1

2. Measures to describe statistical distributions 38


Geometric mean

Example:
For the data set X with the values

2 6 12 9

the geometric mean is Gx = 6


and the arithmetic mean is x̄ = 7.25

Note: The geometric mean for each set with only positive values is always smaller than
the arithmetic mean unless all the values in the data set are the same.

2. Measures to describe statistical distributions 39


Example

In five consecutive years the turnover Y given in thousand C) of a company developed


as follows:

2. Measures to describe statistical distributions 40


Example
(continued)

Question: Which mean is best suited for the calculation of the average growth?

Arithmetic mean:

1 + r = (1.20 + 0.85 + 1.40 + 1.25)/4 = 1.175

Geometric mean:

4
G1+r = 1.20 · 0.85 · 1.40 · 1.25 = 1.1559

An average increase in turnover of 17.5 % would result in a turnover of 2287 kC in 2001


whereas an average increase in turnover of 15.59 % would result in the actual value of
2142 kC.

2. Measures to describe statistical distributions 41


Example Stock return

Financial advisor: The


share is a top investment,
with an average return of
Arithmetic mean: 25 %.

1
1+r = 2
· ((1 + 100 %) + (1 − 50 %)) = 1 + 25 % ⇒ 25 %

Geometric mean:

2
G1+r = 2 · 0.5 = 1 ⇒ 0%

2. Measures to describe statistical distributions 42


Harmonic mean

Definition:
From the values xi > 0 of a data set, one can calculate the reciprocal values 1/xi and
then calculate the arithmetic mean of these values
 
1 1 1
+ ··· +
n x1 xn

Taking the reciprocal of the result again, yields the so-called harmonic mean
n
Hx := n 1
.
P
j =1 xj

2. Measures to describe statistical distributions 43


Harmonic mean

Example:
For the data set X with the values

2 6 12 9

the harmonic mean is Hx = 4.645


the geometric mean is Gx = 6
and the arithmetic mean is x̄ = 7.25

Note: For every data set with (different) positive values, it can be shown that

HX < GX < x̄

2. Measures to describe statistical distributions 44


Example:

Two trucks travel at speeds of v1 = 60 km/h and v2 = 80 km/h on the highway. Thus the average
speed (arithmetic mean) is
 
1 km km km
v̄ = 60 + 80 = 70 .
2 h h h

To estimate the (average) transport times t̄ and thus transport capacities and transport costs for
a distance of say from Hamburg to Duisburg, one would divide the corresponding distance d =
420 km by this value and obtained with

d 420 km
= = 6h,
v̄ 70 km/h

a wrong value. Indeed, the transport times of the two trucks are t1 = 7 h and t2 = 5.25 h. Thus the
average transport time is
t̄ = 6.125 h .

If, on the contrary, one divides the distance by the harmonic mean

2 480 km km
HV = 1 1
= ≈ 68.57 ,
60 km/h
+ 80 km/h
7 h h

44 - 1
then one receives with
d 420 km
= 480
= 6.125 h = t̄
HV 7
km/h

the correct result. If you are doing a salary calculation


based on an hourly wage, then this
question is highly relevant.

Question: Why is the first calculation wrong?

In this example we want to calculate an average transport time for a fixed distance d. The problem
with the average speed is that it is not valid over the whole time, because the first truck arrives
already after 5.25 h and then stops while the other one is still moving. For the calculation of the mean
transportation time, the speeds are in the denominator due to the principle ti = vd . This leads to the
i
harmonic mean:
 
1 1 d d
t̄ = (t1 + · · · + tn ) = + ··· +
n n v1 vn
 
1 1 1 1 d
=d· + ··· + =d· = .
n v1 vn HV HV

44 - 2
If, in contrast, we want to know how far the trucks have come on (arithmetic) average after a certain
time t, the calculation

1 1
d̄ = (d1 + · · · + dn ) = (t · v1 + · · · + t · vn )
n n
1
=t· (v1 + · · · + vn ) = t · v̄
n

is the correct one.

44 - 3
Robust measures

The measures presented so far are quite sensitive to outliers. This means that strong
deviations of individual values significantly influence these measures. This is not the
case with so-called robust measures.

Definition:
Starting with the raw data
x = (x1 , x2 , . . . , xn )
of a data set of size n, the characteristic values xi are arranged in ascending order:

x(1) ≤ x(2) ≤ · · · ≤ x(n) .

The resulting list


x( ) = (x(1) , x(2) , . . . , x(n) )
is called an ordered sample (for the raw data x).

Annotation:
In the following, the parentheses in the index are omitted for ordered samples. It should always be
clear from the context whether a data set is ordered or not.

2. Measures to describe statistical distributions 45


Median

First order and renumber the observed values:

x1 ≤ x2 ≤ x3 ≤ · · · ≤ xi ≤ · · · ≤ xn .

New indexing is also called „rank“.

Definition: A number xMed with

relH(X ≤ xMed ) ≥ 50 %
relH(X ≥ xMed ) ≥ 50 %

is called the median or central value of the empirical data set X .


For an even number n it can happen that the median is not uniquely determined. For a
unique value we define

x n+1 if n odd
xMed = 1 2 
 2 x n + x n +1 if n even
2 2

2. Measures to describe statistical distributions 46


Median

Outliers tend to influ-


Example: ence the mean, but not
the median.
The data set

4 7 7 7 12 12 13 16 19 23 23 97

has the arithmetic mean x̄ = 20


and the median xMed = 12.5

2. Measures to describe statistical distributions 47


Annotation:

With the definition of the median via the relative frequencies

relH(X ≤ xMed ) ≥ 50 %
relH(X ≥ xMed ) ≥ 50 %

one obtains the potentially non-unique definition


(
xMed = x n+1 if n odd,
2
x n ≤ xMed ≤ x n +1 if n even.
2 2

Strictly speaking, in the previous example, 12 or 13 or 12.2 would also be medians, since they divide
the data in the middle as well.

47 - 1
In practice, the arithmetic mean and the median are the most important characteristic measures of
location for a given distribution. Colloquially, however, a distinction is not always made between the
two measures, especially when it comes to income or wealth distributions.

Example: Median income is the income for


which there are just as many people with a
higher income as with a lower income. Me-
dian income, which is explicitly not identical
with average income, is used in the social sci-
ences and economics to undertake poverty
calculations, for example. It is more robust
against outliers in a sample and is therefore
often preferred to the arithmetic mean (aver-
age).

Think about when and why the average income differs from the median income?

47 - 2
Quartiles

In addition to the median, two other values can be defined that further divide the ordered
statistical data set:

Definition:
The characteristic values of the data set are arranged in ascending order

x1 ≤ x2 ≤ · · · ≤ xn

and divided into four segments with (as far as possible) the same number of values.
The three values
Q1 ≤ Q2 = xMed ≤ Q3
are called quartiles and are defined in such a way that they lie in beween the four
segments just as the median xMed does.
Consequently about 50 % of the observations are found between Q1 and Q3 .

Median and quartiles more general quantiles

2. Measures to describe statistical distributions 48


Quantiles

Definition:
A number x[q ] with 0 < q < 1 is called q-quantile if it splits the data set X such that at
least 100 · q % of its observed values are less than or equal to x[q ] and at the same time
at least 100 · (1 − q )% are greater than or equal to x[q ] , that is:

relH(X ≤ x[q ] ) ≥ q and relH(X ≥ x[q ] ) ≥ 1 − q .

Special quantiles:

Quartiles:

Q1 = x[0.25] lower quartile


Q2 = x[0.5] = xMed Median
Q3 = x[0.75] upper quartile

Deciles: x[0.1] , x[0.2] , x[0.3] , . . . , x[0.9]


Percentiles: x[0.01] , x[0.02] , x[0.03] , . . . , x[0.99]

2. Measures to describe statistical distributions 49


Quantiles

Calculation of the q-quantiles:


For continuously approximated distribution functions, the following holds for the
q-quantiles
H (x[q ] ) = q ,
This yields the q quantiles from the inverse of the distribution function,

x[q ] = H −1 (q )

This also works for step function shaped distribution functions if one directly hits a jump.
However, if one lands on a stairstep, the inverse function is not uniquely determined.
Then, in fact, every value between the adjacent jumps is a q-quantile:

xi ≤ x[q ] ≤ xi +1 .

To obtain a unique value, one then usually takes the arithmetic mean of both jumps
x[q ] = 21 (xi + xi +1 ).

2. Measures to describe statistical distributions 50


Quantiles

2. Measures to describe statistical distributions 51


Quantiles

For a data set, the q-quantile can also be determined without the detour via the graph of
the distribution function

The q-quantile of an ordered data set x1 , . . . , xn is determined by


(
1
2
(xn·q + xn·q +1 ) if n · q is an integer
x[q ] =
x⌈n·q ⌉ otherwise

Here ⌈n · q ⌉ means that the number n · q is rounded up to the nearest integer.

2. Measures to describe statistical distributions 52


Example

Table: Turnover of large industrial companies in Germany (2005, in million C)

Company Turnover Company Turnover

20 DaimlerChrysler 149 776 10 Bayer 27 383


19 Volkswagen 95 268 9 Shell Deutschland 24 300
18 Siemens 75 445 8 RAG AG 21 869
17 E.ON 51 854 7 Hochtief 14 854
16 BMW 46 656 6 MAN 14 671
15 BASF 42 745 5 Continental 13 837
14 ThyssenKrupp 42 064 4 Henkel 11 974
13 Bosch 41 461 3 ZF Friedrichshafen 10 833
12 RWE 40 518 2 EnBW 10 769
11 Deutsche BP 37 432 1 Vattenfall 10 543

1 1
x[0.80] = 2
(x16 + x17 ) = 2
(46 656 + 51 854) = 49 255

2. Measures to describe statistical distributions 53


Example
(continued)

150 000

120 000

90 000

60 000 80 %-quantile = 49 255

30 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Turnover of 20 industrial companies and 80 % quantile

2. Measures to describe statistical distributions 54


Example

2. Measures to describe statistical distributions 55


Measures of variation

Three histograms: distribution with different degrees of spread or variability

The extent of the spread or variation or dispersion of a distribution needs to be ex-


pressed as a measure.

2. Measures to describe statistical distributions 56


Measures of variation

The descriptive statistics provides some measures of variation:

Definition:
The range is the difference between the largest and the smallest value in a data set:

range := xmax − xmin

Definition:
The so-called mean absolute deviation
n
1X
MAD := |xj − x̄ |
n
j =1

is calculated as the arithmetic mean of the amounts of the deviations of the characteristic
values from their mean.

2. Measures to describe statistical distributions 57


Measures of variation

We recall the median and the quartiles Q1 ≤ Q2 = xMed ≤ Q3 , that divide the ordered
data set into four approximately equally sized parts.

Definition:
The difference
IQR := Q3 − Q1
is known as the interquartile range.

Definition:
The arithmetic mean of the deviation of the quartiles from the median is called the quar-
tile deviation or semi-interquartile range

(Q3 − Q2 ) + (Q2 − Q1 ) IQR


QD := =
2 2

2. Measures to describe statistical distributions 58


Measures of variation

Example:
For a data set with n = 14 values we are looking for the quartile deviation.

As median we take the arithmetic mean of both neighbours and obtain Q2 = xMed = 26.8.

2. Measures to describe statistical distributions 59


Variance and standard deviation

These are the most important measures of variation in statistics:

Definition:
The average quadratic deviation from the arithmetic mean
n
2 1X
sX := (xj − x̄ )2
n
j =1

is called empirical variance or in short variance of an observed data set X .

Calculation also using relative frequencies:


k
X
2 2
sX := hj (xj − x̄ )
j =1

Definition: The positive root of the variance


q
sX := + sX2

is called standard deviation.


2. Measures to describe statistical distributions 60
Example

Pk 2
Variance calculation using relative frequencies: sX2 := j =1 hj (xj − x̄ )
The following distribution is given:

xi 4 5 6
1 1 1
hi
4 2 4

The arithmetic mean is x̄ = 5. Its variance is


2 1 2 1 1
sX = (4 − 5) ·+ (5 − 5)2 · + (6 − 5)2 ·
4 2 4
1 1 1
= +0+ =
4 4 2

and its standard deviation is


r
1 1
sX = = √ ≈ 0.7071 .
2 2

2. Measures to describe statistical distributions 61


Example

For the data set

3 5 9 9 6 6 3 7 7 6 7 6 5 7 6 9 6 5 3 5

consisting of 20 numbers, we use the following working table:

Arithmetic mean : 6
Variance : 3.1
Standard deviation: 1.761

2. Measures to describe statistical distributions 62


Variance and standard deviation

The variance can also be calculated for density functions.


Example: Assume that the distribution of a statistical variable has been approximated by
a frequency density function h(x ) in the interval 0 < x < 2 given by the parabola
( As the graph shows, for symmetry reasons, the
3
x − 12 x 2

x ∈ (0, 2)
h (x ) = 2 mean value is equal to 1. To calculate the value,
0 otherwise . the summation symbol is replaced by the integral

Z∞
x̄ = x · h ( x ) dx
−∞

In the same way the variance is calculated by

Z∞
2
sx = (x − x̄ )2 h(x ) dx
−∞

2. Measures to describe statistical distributions 63


Variance and standard deviation

Example (continued): Hence we calculate the variance as the definite integral

Z2 Z2  
(x − x̄ )2 h(x ) dx = dx
2 3 1 2
sX = (x − 1)2 x− x
2 2
0 0

Z2  
=
3
2
x−
5 2
2
3 1 4
x + 2x − x
2
dx
0
 2
3 1 2 5 3 1 4 1 5
= x − x + x − x
2 2 6 2 10 0
 
3 4 40 16 32 24 1
= − + − = 3 − 10 + 12 − =
2 2 6 2 10 5 5
q
1
and the standard deviation as its root: sX = 5
≈ 0.4472 .

2. Measures to describe statistical distributions 64


Variance and standard deviation

Properties of the variance

1. The variance is always greater than or equal to zero:


2
sX ≥ 0

2. Shifting all values of a data set X by the constant value a leaves the variance
unchanged:
2 2
yi := xi + a ⇒ sY = sX
3. Multiplication of all values of a data set X by a constant factor b multiplies the
variance by the square of this value:
2 2 2
zi := b · xi ⇒ sZ = b · sX

Note: sZ = |b| sX

2. Measures to describe statistical distributions 65


Variance and standard deviation

S TEINER’s translation theorem


For each constant d ∈ R it holds that
n n
1X 1X
(xj − x̄ )2 = (xj − d )2 − (x̄ − d )2 (1)
n n
j =1 j =1

where (x̄ − d ) is the shift (=translation) from the mean.

Properties of the variance (contd.)

4. In the special case of d = 0, the following formula for the simplified calculation of
the variance is obtained
n
2 1X 2 2 2
sX = xj − x̄ = x 2 − x̄ (2)
n
j =1

Exercise : Use formula (2) to recalculate the variances of the preceding examples.

2. Measures to describe statistical distributions 66


Minimum property of the variance
Since in the translation theorem (1) the term (x̄ − d )2 can never be negative, for every d ̸= x̄ it
always holds
n n
1X 2 1X 2
xj − x̄ < xj − d .
n n
j =1 j =1

This means that the average quadratic deviation from the arithmetic mean x̄ is always smaller than
the average quadratic deviation from any other value d (minimum property). Multiplying the equation
by n, we get for the sum of the squared deviations from any d ∈ R:

n n
X 2 X 2
SSE(d ) := xj − d ≥ xj − x̄ .
j =1 j =1

That is, SSE becomes minimal in x̄. This provides us with an alternative definition of the mean:

Definition:
SSE(d ) −→ min
d

⇒ Principle of least squares.

66 - 1
Variance and standard deviation

Definition:
The quotient of the standard deviation and the absolute value of the mean of a data set
with x̄ ̸= 0
sX
CVX :=
|x̄ |
is called the coefficient of variation.

The coefficient of variation is a relative measure. It measures the dispersion relative to


the level or absolute size of the data set and thus makes sets with different scales
comparable.

2. Measures to describe statistical distributions 67


Variance and standard deviation

Example:
Over a period of 250 trading days, the Volkswagen share
price had a mean value of 174.56 C and a standard
deviation of 10.28 C. For the same period, a standard
deviation of 4.68 C with a mean value of 36.96 C is
determined for the BMW AG share. The two coefficients
of variation as a measure of the volatility of the share
prices are as follows

10.28 C
CVX = = 0.0589 for VW and
174.56 C
4.68 C
CVY = = 0.1266 for BMW
36.96 C
Thus, despite a lower absolute standard deviation, BMW stock is more volatile in relative
terms.

2. Measures to describe statistical distributions 68


Five-point summary and box plot

The distribution of a data set can be analyzed quite well with only a few values. In
practice, one often uses the so-called five-point summary.

(xmin , x[0.25] , xMed , x[0.75] , xmax )

It divides the data set into four parts, so that each part contains about a quarter of the
observed values. It contains the median as a measure of location and the range and
interquartile range IQR as measures of variation.

Definition:
The graphical representation of the five-point summary is called a box plot.

xmin xmax

x[0.25] xMed x[0.75]

2. Measures to describe statistical distributions 69


Control questions

1. What is the central property of the arithmetic mean?

2. When is the arithmetic and when is the geometric mean used and why?

3. How does the variance change if all values of a data set are converted from DM to
euro?

4. Describe the translation theorem as a property of the variance! What feature results
from the special case d = 0?

5. What is the minimum property of the variance? What does the principle of least
squares mean in this context?

6. What does the coefficient of variation mean? Which measure from portfolio theory in
business administration comes to your mind?

2. Measures to describe statistical distributions 70


3 Two dimensional distributions

3.1 Scatterplot and joint distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


3.2 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3 Conditional distributions and statistical correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.6 Rank correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

according to: [Schira, 2016], chapter 3


see also: [Anderson et al., 2012], chapter 3.5
and [Newbold et al., 2013], chapter 2.4
3. Two dimensional distributions 71
Scatterplot and joint distribution

Multi-dimensional statistics:
Each statistical unit ωi of a population Ω can have a variety of characteristics.

Definition:
The univariate statistics takes only one characteristic or variable into account.
The multivariate statistics considers several variables for each unit ωi

X1 (ωi ), X2 (ωi ), . . . , Xm (ωi )

Example:
For a person ωi we measure the duration of education X1 (ωi ) and the income X2 (ωi ) five
years past to the end of education.

3. Two dimensional distributions 72


Scatterplot and joint distribution

Most simple case: two variables X (ωi ) andY (ωi ) are of interest.
The result is paired data (xi , yi ) for each ωi .
These can be represented as points

P1 := (x1 , y1 ), P2 := (x2 , y2 ), ..., Pn := (xn , yn )

in a scatter plot.

3. Two dimensional distributions 73


Scatterplot and joint distribution

Definition: The contingency table represents the joint distribution of the statistical
variables X and Y in a concise way.

y1 y2 ... yj ... yl Σ
x1 n11 n12 ... n1j ... n1l n1•
x2 n21 n22 ... n2j ... n2l n2•
. . . . . .
. . . . . .
. . . . . . sum of row
xi ni1 ni2 ... nij ... nil ni •
. . . . . .
. . . . . .
. . . . . .
xk nk 1 nk 2 ... nkj ... nkl nk •
Σ n•1 n•2 ... n•j ... n•l n total sum

sum of column
Here
nij = absH(X = xi ∩ Y = yj ) is the absolute frequency with which the combination
(xi , yj ) was observed,
Pl Pk
ni • = j =1 nij or n•j = i =1 nij the absolute frequency with which xi or yj was
observed. ⇒ marginal frequency

3. Two dimensional distributions 74


Scatterplot and joint distribution

Real life example: Routes of three soccer players at Bayern Munich

Task: Match the following players to their respective contingency tables:


1. Thomas Müller
2. Franck Ribéry
3. Arjen Robben

A B C

This representation of the contingency table is also called a heat map.


Solution: 1B, 2C, 3A

3. Two dimensional distributions 75


Scatterplot and joint distribution

A representation with relative frequencies is also common. For this purpose, the
absolute frequencies, including the marginal frequencies, are divided by n.

y1 y2 ... yj ... yl Σ
x1 h11 h12 ... h1j ... h1l h1•
x2 h21 h22 ... h2j ... h2l h2•
. . . . . .
. . . . . .
. . . . . . sum of row
xi hi1 hi2 ... hij ... hil hi •
. . . . . .
. . . . . .
. . . . . .
xk hk 1 hk 2 ... hkj ... hkl hk •
Σ h•1 h•2 ... h•j ... h•l 1 total sum

Here sum of column

hij = relH(X = xi ∩ Y = yj ) is the relative frequency with which the combination


(xi , yj ) was observed,
Pl Pk
hi • = j =1 hij or h•j = i =1 hij the relative frequency with which xi or yj was
observed. ⇒ marginal frequency

3. Two dimensional distributions 76


Marginal distributions

Definition:
The one-dimensional distributions
ni •
hi • = relH(X = xi ) = , i = 1, . . . , k
n

and
n•j
h•j = relH(Y = yj ) = , j = 1, . . . , l
n

are called marginal distributions of the statistical variables X and Y .

3. Two dimensional distributions 77


Marginal distributions

Calculation of mean and variance:


The mean and variance of individual components X and Y of two- or multidimensional
random variables are easily calculated using the marginal distributions (for cardinally
measurable variables):
k
X l
X
x̄ = hi • xi ȳ = h•j yj
i =1 j =1
k
X l
X
2 2 2 2
sX = hi • (xi − x̄ ) sY = h•j (yj − ȳ )
i =1 j =1

3. Two dimensional distributions 78


Marginal distributions
Example

Abstract calculation example for a two-dimensional frequency distribution:

Characteristic values for X : x1 = 30, x2 = 60; andY : y1 = 1, y2 = 2, y3 = 4

Observed data: (30, 1), (30, 2), (60, 4), (30, 2), (60, 1), (30, 4), . . . , (60, 2).

Sort and count: 24 × (30, 1), 24 × (30, 2), 32 × (30, 4), . . . , 68 × (60, 4)

Contingency table
Y
1 2 4
30 24 24 32 80
X
60 16 36 68 120
40 60 100 200

3. Two dimensional distributions 79


Marginal distributions
Example

The relative frequencies are obtained by dividing all values by n = 200

marginal distribution of X
1 2 4
30 0.12 0.12 0.16 0.4
X
60 0.08 0.18 0.34 0.6
0.2 0 .3 0.5 1

marginal distribution of Y

3. Two dimensional distributions 80


Marginal distributions
Example

The marginal distribution for X :


k
X
x̄ = hi • xi = 48
xi 30 60 i =1
k
hi • 0.4 0.6 2
X 2
sX = hi • (xi − x̄ ) = 216
i =1

The marginal distribution for Y :


l
X
ȳ = h•j yj = 2.8
yj 1 2 4 j =1
l
h•j 0.2 0 .3 0 .5 2
X 2
sY = h•j (yj − ȳ ) = 1.56
j =1

3. Two dimensional distributions 81


Conditional distributions and statistical correlations

We now consider the distribution of X , given that (conditional on) Y has a fixed value yj .

Definition:
Normalizing the columns of the contingency table to a column sum of 1 leads to a total
of l one-dimensional distributions for j = 1, . . . , l. These are called conditional distri-
butions of X (conditional on Y = yi ),
hij
hi |Y =yj = relH(X = xi |Y = yj ) = .
h•j

Similarly, normalizing the rows to a row sum of 1 for i = 1, . . . , k leads to the conditional
distributions of Y (conditional on X = xi ),
hij
hj |X =xi = relH(Y = yj |X = xi ) = .
hi •

3. Two dimensional distributions 82


Conditional distributions and statistical correlations

Example:
For the joint distribution of the previous numerical example, there are three conditional
distributions of X and a marginal distribution of X :

X hi |Y =1 hi |Y =2 hi |Y =4 hi •
30 0.60 0.40 0.32 0.4
60 0.40 0.60 0.68 0.6
1 1 1 1
. . . and two conditional distributions of Y and a marginal distribution of Y

Y 1 2 4
hj |X =30 0.300 0.300 0.400 1
hj |X =60 0.133 0.300 0.567 1
h•j 0.2 0 .3 0.5 1

Observation: The conditional distributions differ. This gives an indication of a


dependence of the statistical variables X andY .

3. Two dimensional distributions 83


Conditional distributions and statistical correlations

Definition:
If the joint distribution hij of the statistical variables X and Y is equal to the product of
the two marginal distributions
hij = hi • · h•j
fori = 1, . . . , k andj = 1, . . . , l, then X and Y are called statistically independent.

Otherwise, there is a statistical correlation. We can distinguish between linear and


nonlinear statistical correlations.

3. Two dimensional distributions 84


Conditional distributions and statistical correlations

Properties of independent statistical variables


For independent statistical variables, the conditional distributions are identical and
equal to the marginal distribution.

Thus, for all j = 1, . . . , l conditional distributions of X , it holds that

hi |Y =yj = hi • , i = 1, . . . , k

and for all i = 1, . . . , k conditional distributions of Y

hj |X =xi = h•j , j = 1, . . . , l

3. Two dimensional distributions 85


Conditional distributions and statistical correlations

Practical example: stock returns in the US (since 1963)


Are daily stock returns on announcement days (AD = FED meetings and labor statistics)
different from non-announcement days (ND)?

absolute frequency
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %

AD 27 63 305 385 111 31 922


ND 286 963 4341 5349 1003 264 12206
313 1026 4646 5734 1114 295 13128

relative frequency
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %

AD 0.0021 0.0048 0.0232 0.0293 0.0085 0.0024 0.0702


ND 0.0218 0.0734 0.3307 0.4074 0.0764 0.0201 0.9298
0.0238 0.0782 0.3539 0.4368 0.0849 0.0225 1

3. Two dimensional distributions 86


Conditional distributions and statistical correlations

conditional distribution
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %

hj |X =AD 0.0293 0.0683 0.3308 0.4176 0.1204 0.0336 1


hj |X =ND 0.0234 0.0789 0.3556 0.4382 0.0822 0.0216 1
h•j 0.0238 0.0782 0.3539 0.4368 0.0849 0.0225 1

Daily returns in the U.S. are not independent from AD or ND!

3. Two dimensional distributions 87


Conditional distributions and statistical correlations

conditional distribution
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %

hj |X =AD 0.0293 0.0683 0.3308 0.4176 0.1204 0.0336 1


hj |X =ND 0.0234 0.0789 0.3556 0.4382 0.0822 0.0216 1
h•j 0.0238 0.0782 0.3539 0.4368 0.0849 0.0225 1

Daily returns in the U.S. are not independent from AD or ND!


Question: Given independence, how would the joint distribution look like if we keep the
marginal distributions the same?

relative frequency
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %

AD 0.0702
ND 0.9298
0.0238 0.0782 0.3539 0.4368 0.0849 0.0225 1

3. Two dimensional distributions 87


Conditional distributions and statistical correlations

Mean of the sum and the difference:


The elements ωi , i = 1, . . . , n, of a statistical mass Ω of extent n have been analyzed for
two characteristics, and the statistical variables xi = X (ωi ) and yi = Y (ωi ) have been
collected as paired data.
From both variables, both the means and the variances have been calculated.
It holds that:

The mean value of a sum (difference) is equal to the sum (difference) of the mean
values:
x + y = x̄ + ȳ x − y = x̄ − ȳ

This is true regardless of the joint distribution and equally true for statistically
independent as well as statistically dependent variables.

3. Two dimensional distributions 88


Conditional distributions and statistical correlations

Variance of the sum and the difference:


The variance is calculated by applying the binomial formula:

Variance of the sum


n
2 2 2 1X
sX +Y = sX + sY + 2 · (xj − x̄ )(yj − ȳ )
n
j =1

Variance of the difference


n
2 2 2 1X
sX −Y = sX + sY − 2 · (xj − x̄ )(yj − ȳ )
n
j =1

Special case :

n
2 2 2 1X
sX ±Y = sX + sY , if cXY := (xj − x̄ )(yj − ȳ ) = 0
n
j =1

3. Two dimensional distributions 89


Covariance

Definition:
The quantity calculated from the n pairs of values (xi , yi )
n
1X
cXY := (xj − x̄ )(yj − ȳ )
n
j =1

is called the empirical covariance or, in short, the covariance between the statistical
variables X and Y .

Simplified calculation :

n
1X
cXY := xj · yj − x̄ · ȳ = x · y − x̄ · ȳ
n
j =1

3. Two dimensional distributions 90


Covariance

Illustration of the covariance:

3. Two dimensional distributions 91


Covariance

The covariance can also be calculated using the relative frequencies from the contin-
gency table:
k X
X l
cXY := hij (xi − x̄ )(yj − ȳ )
i =1 j =1

Simplified calculation :

k X
X l
cXY := hij xi yj −x̄ · ȳ
i =1 j =1
| {z }
x ·y

3. Two dimensional distributions 92


Covariance
Covariance and dependency

Proposition :
If two variables X and Y are statistically independent, then the covariance cXY between
them is zero.

This proposition is not reversible because the covariance measures only the linear part of
the statistical dependence.

correct: X andY are independent ⇒ cXY = 0

correct: cXY ̸= 0 ⇒ X andY are dependent

incorrect: cXY = 0 ⇒ X andY are independent

3. Two dimensional distributions 93


Correlation coefficient

Definition:
The ratio
cXY
rXY :=
sX · sY
is called (empirical) correlation coefficient between X and Y

Scatter plots and correlation coefficients:

rXY = 0.97 rXY = −0.52 rXY = 0.06

3. Two dimensional distributions 94


Correlation coefficient

Properties :

The correlation coefficient represents a normalized measure of the strength of the


linear statistical relationship:
−1 ≤ rXY ≤ 1
The absolute value of the correlation coefficient remains unchanged if one or both
variables are transformed linearly. Suppose that

U := a1 + b1 X and V := a2 + b2 Y with b1 , b2 ̸= 0 .

Then we obtain
cUV b1 · b2 · cXY b1 · b2
rUV = = = rXY .
sU · sV |b1 | sX · |b2 | sY |b1 | · |b2 |
This means that |rUV | = |rXY |.

3. Two dimensional distributions 95


Correlation coefficient

Example: Calculation of the correlation coefficient


For the joint distribution from the numerical example (slide 79 ff.), one obtains for the
covariance using the simplified calculation approach
2 X
X 3 2 X
X 3
cXY = hij (xi − x̄ )(yj − ȳ ) = hij xi yj − x̄ · ȳ = x · y − x̄ · ȳ
i =1 j =1 i =1 j =1

= 138 − 48 · 2.8 = 138 − 134.4 = 3.6

The correlation coefficient is thus

3.6
rXY = √ √ = 0.1961 ,
216 · 1.56
which indicates a weak positive correlation.

3. Two dimensional distributions 96


Correlation coefficient
Examples

goals against

age of trainer
body weight

body height goals scored league position

Note: The covariance or the correlation coefficient do not necessarily mean a causal
relationship between the characteristics. Merely the just available observations show a
statistical tendency, which however could also be purely by chance.

Correlation of the day: www.correlated.org :-)

3. Two dimensional distributions 97


Correlation vs. causality

Son’s height (inch)

Father’s height (inch)

causality?

correlation
Father’s height Son’s height

causality?

3. Two dimensional distributions 98


Rank correlation

Besides the correlation coefficient according to B RAVAIS -P EARSON there is another one,
namely the one according to S PEARMAN, also called rank correlation coefficient.

Definition:
The rank correlation coefficient or correlation coefficient named after C HARLES E D -
WARD S PEARMAN
Sp
rXY := rrg(X ),rg(Y )
is the correlation coefficient between the ranks of the observations.

It is used for ordinally scaled characteristics.

3. Two dimensional distributions 99


Rank correlation
Example school grades

The following table shows the results of the Abitur examinations of ten students in the
subjects German (feature G) and History (feature H). The maximum achievable score is
15 in each case.

Pupil i German (G) History (H) rg(G) rg(H)

1 13 15 4 1
2 14 8 2.5 (2) 4 (3)
3 8 1 9 10
4 10 7 7 6.5 (6)
5 15 9 1 2
6 1 5 10 9
7 14 8 2.5 (3) 4 (4)
8 12 7 5 6.5 (7)
9 9 6 8 8
10 11 8 6 4 (5)

3. Two dimensional distributions 100


Rank correlation
Example school grades

Question: Are the grades correlated? Does good performance in German go along
with good knowledge of history?
First, we determine the rankings for each student in each of the two subjects. To do this,
we arrange the students according to the results they obtained in the subjects. Students
with the same result are assigned the arithmetic mean of those rankings they would have
received if they had been arranged randomly (given in parentheses in each case). This
may result in rankings like 2.5 or 6.5.
Then we compute variances, standard deviations, and the covariance of the ranks and
obtain with
Sp 6.95
rGH = = 0.8581
2.8636 · 2.8284
a fairly positive correlation, which was to be expected.
(Compare: rGH = 0.549 )

Question: When should rank correlation be preferred to normal correlation?

3. Two dimensional distributions 101


Control questions

1. What is the difference between univariate and multivariate statistics? Think about an
example of bivariate statistics.

2. What is the structure and function of contingency tables? Are there also contingency
tables for more than two characteristics?

3. How many marginal distributions does a 3-dimensional statistical distribution have?

4. When is the variance of a sum smaller than the sum of the variances?

5. What is statistical independence? What is the relationship between covariance and


independence?

6. What is the meaning of the correlation coefficient? Does an empirical correlation


coefficient of 0 imply that there is no factual relationship between the characteristics
under consideration?

7. What is a rank correlation? How do you measure it?

3. Two dimensional distributions 102


4 Linear regression

4.1 The regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104


4.2 Properties of regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 Nonlinear and multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .118

according to: [Schira, 2016], chapter 4


see also: [Anderson et al., 2012], chapter 14;
4. Linear regression
and [Newbold et al., 2013], chapter 11 103
The regression line

Correlation and regression calculation:


Covariance and correlation coefficient are only measures. In correlation analysis, the
statistical variables (X , Y ) are considered completely equal.
The regression calculation goes one step further: The average linear relationship
between the characteristic values of a two-dimensional statistical variable (X , Y ) is now
to be represented by a linear function, i.e. a straight line

y = a + bx

in the scatter plot. Here we distinguish between an (in the mathematical sense)
independent variable X and a dependent variable Y .

This straight line is supposed to be a mean straight line, that is, it is supposed to pass
through the observed characteristic values (xi , yi ) in such a way that it indicates the
location and main direction of the point cloud in the scatter plot.

4. Linear regression 104


The regression line
Example

scatter plot regression line

4. Linear regression 105


The regression line

y = a + bx

„deviation“ ei

Point cloud and straight line in scatter plot

The method of least squares (LSM) uniquely assigns a mean straight line to the scatter
plot.

4. Linear regression 106


The regression line

Idea: We want to explain the observed values of Y as good as possible by the values
of X .

For the observed values it holds that

yi = a + bxi + ei

The y -values on the linear regression line are

ŷi := y (xi ) = a + bxi

Vertical deviation
ei := yi − ŷi

Question: What does „as good as possible“ mean? How to determine the regression
line?
Pn
It would be possible to minimize the deviations ei or the
Psum of the deviations i =1 ei or
n
even the sum of the absolute values of the deviations i =1 |ei |. Unfortunately, all these
approaches do not lead to a useful calculation rule or to a unique determination of the
straight line.

4. Linear regression 107


The regression line

Determination of the regression line:


Minimization of the sum of squared errors (SSE)
n
X n
X n
X
2
SSE(a, b) := ej = (yj − ŷj )2 = (yj − a − bxj )2
j =1 j =1 j =1

This means that the straight line is placed within the point cloud in such a way that SSE
reaches the smallest possible value for the corresponding parameters a and b.
The algebraic solution of this minimization task

SSE(a, b) −→ min (3)


a,b

leads to

4. Linear regression 108


The regression line

Definition:
Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) observed pairs of values of a two-dimensional statistical
variable (X , Y ) and let sX > 0. The straight line

y (x ) = a + bx

with the coefficients


cXY
b= and a = ȳ − bx̄
sX2
is called the regression line of a regression from Y to X . The y values belonging to the
individual xi values on the regression line are called regression values.

4. Linear regression 109


Determination of the regression line
Minimization of the sum of squared errors (SSE)

SSE(a, b) −→ min
a ,b

(Partial) differentiation with respect to a and b and setting to zero yields the two so-called normal
equations

n
∂ X !
SSE(a, b) = 2(yj − a − bxj ) · (−1) = 0
∂a j =1
n
∂ X !
SSE(a, b) = 2(yj − a − bxj ) · (−xj ) = 0
∂b j =1

Transforming both normal equations leads to

n
X
(yj − a − bxj ) = 0 ,
j =1
n
X
(yj − a − bxj )xj = 0 .
j =1

109 - 1
Splitting the sums results in

n
X n
X
yj − an − b xj = 0 ,
j =1 j =1
n
X n
X n
X
xj yj − a xj − b xj2 = 0
j =1 j =1 j =1

and thus after dividing by n

ȳ − a − bx̄ = 0 ,

xy − ax̄ − bx 2 = 0 .

Using the definitions for variance and covariance

sX2 = x 2 − x̄ 2 and cXY = xy − x̄ ȳ

we obtain after solving the linear system of equations with respect to a andb

a = ȳ − bx̄ ,
cXY
b= 2 .
sX

109 - 2
Properties of regression lines

1. mean line: The regression line passes through the center of mass (x̄ , ȳ ) of the
point cloud:
ȳ = a + bx̄ = y (x̄ ) .
The sum of the deviations ej and thus their mean value are zero,
n
X n
X
(yj − a − bxj ) = ej = 0 = ē . (4)
j =1 j =1
n n
Furthermore X X
ej xj = 0 and ej ŷj = 0 . (5)
j =1 j =1

2. minimization of variance: The variance of the deviations of the regression


n n
2 1X 1X 2 1
sE = (ej − ē)2 = ej = SSE(a, b) (6)
n n n
j =1 j =1

is identical to the sum of least squares except for the factor n1 . This means that the
regression line minimizes the variance of the deviations.

4. Linear regression 110


Properties of regression lines

3. slope regression: The slope of the regression line is


cXY cXY sY sY
b= = = rXY
sX2 sX sX sY sX

and is the flatter the smaller the amount of correlation.

Regression of the slope with decreasing correlation

4. Linear regression 111


Properties of regression lines

4. variance decomposition: The total variance of Y can be decomposed into two


parts:
2 2 2
sY = sŶ + sE

namely into the variance of the regression values and the variance of the
deviations. The variance of the regression values measures the share of the
variation in Y that is described or explained by the variation of the independent
variable X . The variance of the deviations measures the share of the total variance
not explained by the variation in X . Thus, the explained variance is smaller than the
total variance, at least as long as there are deviations.

4. Linear regression 112


Derivation of 4. The identity

(yi − ȳ ) ≡ (ŷi − ȳ ) + (yi − ŷi ) .

holds true. If we square both sides, we obtain using ei = yi − ŷi :

(yi − ȳ )2 = ((ŷi − ȳ ) + (yi − ŷi ))2


= (ŷi − ȳ )2 + (yi − ŷi )2 + 2(ŷi − ȳ )(yi − ŷi )
= (ŷi − ȳ )2 + ei2 + 2(ŷi ei − ȳ ei )

If we sum over all i, we get

n
X n
X n
X n
X n
X
(yi − ȳ )2 = (ŷi − ȳ )2 + ei2 + 2 ŷi ei −2ȳ ei
i =1 i =1 i =1 i =1 i =1
| {z } | {z }
=0(5) =0(4)
n
X n
X
= (ŷi − ȳ )2 + ei2
i =1 i =1

Now we divide both sides by n and obtain with (6) the desired equation

sY2 = sŶ2 + sE2 .

112 - 1
Properties of regression lines

Just as in correlation the correlation coefficient is a standardized measure (i.e.,


independent of the units of measure and magnitudes of the statistical variables), we
would like to have a standardized measure in regression as well.

Definition:
The ratio of the variance explained in a linear regression to the total variance of the
dependent variable Y
2 s2
R := Ŷ2
sY
is called the coefficient of determination of the linear regression.

The larger R 2 the better the fit of the regression line to the point cloud will be. It is
therefore used as measure of goodness of fit.

Properties :
 2
2 2 cXY 2
0≤R ≤1 and R = = rXY .
sX sY

4. Linear regression 113


Example

A sample of ten randomly selected monthly sales of a large bakery


with the corresponding marketing expenditures provided the
following data (in thousands of C)

Revenues yj 201 184 220 240 180 164 186 150 182 210
Marketing xj 24 16 20 26 14 16 20 12 18 22

Marketing expenses X and revenues Y

4. Linear regression 114


Example
(continued)

We calculate using the following working table:

i xi yi xi2 xi yi yi2

1 24 201 576 4824 40 401


2 16 184 256 2944 33 856
3 20 220 400 4400 48 400
4 26 240 676 6240 57 600
5 14 180 196 2520 32 400
6 16 164 256 2 624 26 396
7 20 186 400 3 720 34 596
8 12 150 144 1 800 22 500
9 18 182 324 3 276 33 124
10 22 210 484 4 620 44 100

Σ 188 1917 3 712 36 968 373 873

4. Linear regression 115


Example
(continued)

This results in the following values:


188
mean values: x̄ = = 18.8
10
1917
ȳ = = 191.7
10
2 2 3712
variances: sX = x 2 − x̄ = − 18.82 = 17.76
10
2 2 373 873
sY = y 2 − ȳ = − 191.72 = 638.41
10
36 968
covariance: cXY = xy − x̄ ȳ = − 18.8 · 191.7 = 92.84
10
cXY 92.84
correlation coefficient: rXY = = √ = 0.872
sX sY 17.76 · 638.41
The correlation coefficient confirms the presumed strong positive correlation.

4. Linear regression 116


Example
(continued)

This yields the following values:


cXY 92.84
slope: b= = = 5.2274
sX2 17.76
y -intercept: a = ȳ − bx̄ = 191.7 − 5.2274 · 18.8 = 93.4249

and thus the regression line

y = 93.4249 + 5.2274 · x

Interpretation?
The coefficient of determination is

2 2
R = (rXY ) = 0.760 .

4. Linear regression 117


Nonlinear and multiple regression
Regression line: The assumption of linearity can be a good approximation to a nonlinear reality.
However, it must be abandoned if the data in the scatter plot very clearly suggest a nonlinear rela-
tionship.

Question: Which types of functions should you choose then?

Suitable, for example, are nonlinear functions that can be transformed into linear functions by simple
transformation, such as:

Logarithmic approaches
Logarithmic linear approach
Semi-logarithmic approach
Quadratic approaches

If a relationship between more than two variables is to be established, the so-called multiple regres-
sion is used.

A brief description of these approaches is given below.

117 - 1
Logarithmic approaches

Definition: The logarithmic linear approach formulates a linear relationship not between the data
itself, but between the logarithms of the data:

log y = a + b log x

Transforming back, we obtain a power function as a correlation between the originally observed
values
y = a∗ · x b .

The coefficients of this regression are calculated with the already known formulas, but before that the
initial data have to undergo a transformation and the logarithms of the observed values have to be
taken.

Definition: With the so-called semi-logarithmic approach, only one of the two variables is logarith-
mically transformed:
log y = a + bx
Transforming back, we obtain as a correlation between the originally observed values an exponen-
tial function
y = a∗ · ebx .

117 - 2
Quadratic approaches

Definition: In the quadratic approach, the relationship between X and Y is formulated as a


2nd degree polynomial:
y = a + b1 x + b2 x 2 .

Using the observed values, the three coefficients a, b1 and b2 are calculated with the LSM. For
this, the method of multiple regression is used (see below). The variables x and x 2 are treated
mathematically as two different variables, although they are not, of course.

Regression parabolas

The quadratic approaches have the advantage that they can be used to represent such correlations
whose direction is reversed. This is useful when the correlation not only weakens with increasing x
values, but also changes its sign, as illustrated in the figure (example happiness research: young and
old people are happier than in middle age – midlife crisis).

117 - 3
Example: A farmer measures the statistical relationship between the use of
fertilizer and crop yield. He conducts experiments with different fertilizer rates
on 14 sections of his corn acreage.

fertilizer corn corn

15 1800
30 3600
45 6840
60 7200
75 8100
90 8460
105 8640
120 9000
135 9180
150 9000
165 8640 fertilizer
180 8460
195 8100
210 7740
y = 881 + 123x − 0.44x 2
Werte in kg/ha

The maximum of this function is x = 139.8 kg/ha. Whether this input is also optimal depends on the
prices for fertilizer and corn.

117 - 4
Multiple regression

In some cases, it is appropriate to represent the variation of a statistical variable Y as a function of


two other variables X1 and X2 , of the form:

yi = b0 + b1 x1i + b2 x2i + ei

To compute the three coefficients b1 , b2 , and b3 , one has to


solve the corresponding minimization problem as in (3).

Now, this is no longer a regression line, but a regression


plane in a three-dimensional coordinate system.

This principle can also be applied to more than two variables.


The regression relationship would then be

yi = b0 + b1 x1i + b2 x2i + · · · + bk xki + ei .

117 - 5
Simplified Example:

Dependent variable
Y : Percentage of students enrolled in private schools

Explanatory variables
X1 : income
X2 : Percentage of population who have completed 4 or
more years of college

Typical output of a standard software package

indicators for the


significance of the
coefficients

estimated values forb0 , b1 , b2

117 - 6
Application-oriented and simplified explanation in anticipation of the last chapter.

The values for the coefficients are the respective least squares estimators for the constant term as
well as for the prefactors of the variables of the linear regression model. One can now ask whether
the true values for b0 , b1 , . . . are indeed significantly different from 0? That is, one asks whether
the respective independent variables X1 , X2 , . . . or the constant term really have an influence on the
variable Y to be explained?

On the basis of a given data set, this cannot be determined with absolute certainty. But it is possible
(under certain assumptions) to provide probabilities for the parameters to be significantly different
from 0. For this purpose, a so-called test statistic (here the so-called t-quotient) is calculated and
compared with a reference value. The p value (so-called significance level) indicates the probability
with which one would erroneously assume significance of the respective parameter if it was actually 0.

For the exemplary calculation this means:


The probability for a wrong assumption of the constant term b0 being not equal to 0 is 16.1 %.
The probability of erroneously assuming that the factor b1 of the variable X1 (income) is not
equal to 0, in contrast, is only 0.0036 %.
The probability of incorrectly assuming that the factor b2 of variable X2 (higher education) is not
zero is 8.31 %.

The lower the p value, the more likely it is that the respective parameter is significant (different from
0). This is usually highlighted by different codings ***, **, *, to identify at a glance the parameters
with high or low significance.

117 - 7
Control questions

1. What properties should a straight line have that best describes the average linear
relationship between two variables?

2. What are the normal equations?

3. What is the principle of least squares?

4. What properties of the least squares method of calculating a regression line do you
know?

5. What is the relationship between the slope of the regression line and the correlation
coefficient?

6. What is the relationship between R 2 and the correlation coefficient? Which values
can R 2 attain (extrema), and which statements can thus be made about the statistical
correlation?

7. What is linear regression and what is nonlinear regression?

4. Linear regression 118


Part II – Probability Theory

5 Combinatorics and counting principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6 Fundamentals of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7 Random variables in one dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

8 Multidimensional random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

9 Stochastic models and special distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

10 Limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

4. 119
5 Combinatorics and counting principles

5.1 Elementary counting principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


5.2 Factorials and binomial coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3 Fundamental principle of counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5 Variations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

according to [Schira, 2016], chapter 7


see also: [Anderson et al., 2012], chapter 4; and [Newbold et al., 2013], chapter 3

5. Combinatorics and counting principles 120


Elementary counting principles
Selection and arrangement of objects

How many ways are there


to arrange n (distinct) elements within a sequence,

to select k elements out of a set of n elements?

5. Combinatorics and counting principles 121


Many problems concerning the arrangement or selection of elements of a given set lead to the basic
questions on how many ways it is possible,

to arrange n elements in a sequence or


to select k elements from a set containing n elements.

Examples

a) On a Saturday evening a student of Frankfurt School is on the search for two friends, that she
assumes to stay at one of four parties. How many possibilities does she have to visit the four
parties one after another?
b) After class seven students meet for playing skat. How many possibilities do they have to form a
group of three players?

121 - 1
Factorials and binomial coefficients
To determine the number of permutations that can be achieved by n distinct objects, we may think of
n placeholders, where we successively put the n elements on.

To allocate the first placeholder n elements are at disposal. Therefore, there are n possibilities to
occupy the first place. For the second placeholder only n − 1 elements are left. Together with the n
possibilities for the first place we have n · (n − 1) possibilities to allocate the first two places. For the
third place there are n − 2 possibilities and so forth until for the allocation of the nth placeholder only
one possibility (the last remaining element) is left. In total there are n · (n − 1) · · · · · 1 possibilities
to arrange n elements in a sequence. The product n · (n − 1) · · · · · 1 of the first n natural numbers
is denoted by n! (read: „factorial of n“). Furthermore, it is convenient to set 0! = 1.

The student from example a) has got 4! = 1 · 2 · 3 · 4 = 24 possibilities to visit the four parties one
after another.

122 - 1
Factorials

For n ∈ N0 we define n! (read: „factorial of n“) by Example:


There are 5! = 1 · 2 · 3 · 4 · 5 =
n
(
Y 1 · 2 · · · · · ( n − 1) · n for n ≥ 1 120 different possibilities to arrange
n! = i= 5 distinct books on a shelf.
1 for n = 0
i =1

122 - 2
Table of values for factorials

Factorials grow very rapidly:

1! = 1 11! = 39 916 800


2! = 2 12! = 479 001 600
3! = 6 13! = 6 227 020 800
4! = 24 14! = 87 178 291 200
5! = 120 15! = 1 307 674 368 000
6! = 720 16! = 20 922 789 888 000
7! = 5 040 17! = 355 687 428 096 000
8! = 40 320 18! = 6 402 373 705 728 000
9! = 362 880 19! = 121 645 100 408 832 000
10! = 3 628 800 20! = 2 432 902 008 176 640 000

There are already more than 3.6 million possibilities to arrange only 10 distinct books on a shelf!
There are 20! ways to visit 20 cities one after another. The magnitude of this number compares
to the age of the universe (1018 ), measured by seconds!

122 - 3
Calculation rules for factorials

For numbers n, k ∈ N0 and k ≤ n we have

(n + 1)! = n! · (n + 1) recursive calculation of factorials

n! n · (n − 1) · · · (k + 1) · k · · · 1
= = n · . . . · (k + 1)
k! k ···1
reduction of fractions

20! 9·10 3·5


= 19 · 20 = 380 10! · 5! 9 · 10 15
18! = =
8! · 7! 6·7 6 ·7 7
4·5 2
5! 4 ·5 5
= = 2 · 5 = 10 Y
3! · 2! 2! 2i = 25 · 5! = 32 · 120 = 3840
i =1

122 - 4
Binomial coefficients

Definition:

For n, k ∈ N0 , k ≤ n we define the binomial coefficient

n  n!
:= .
k k !(n − k )!

The binomial coefficient kn (read: „n choose k “) equals the number of subsets of size k elements

for a given set of size n elements.

122 - 5
Binomial coefficients
Examples:

5 5! 5!
2
5· 4 · 3 · 2 · 1 There are 10 different pos-
• = = = = 10 sibilities to choose 3 out of
3 3! · (5 − 3)! 3! · 2! 3 · 2 · 1 · 2 ·1
5 distinct books.

There are 126 possibili-


2
9 9! 9· 8 ·7· 6 · 5 · 4 · 3 · 2 · 1 ties to choose four out of
• = = = 126
4 4! · 5! 4 · 3 · 2 ·1· 5 · 4 · 3 · 2 · 1 nine different text books on
statistics.

49 49 · 48 · 47 · 46 · 45 · 44 Number of possibilities for


• = = 13 983 816
6 6·5·4·3·2·1 lottery „6 out of 49“

122 - 6
If we ask for the number of possibilities to choose k elements out of set of size n elements, we may
proceed as follows. We find all subsets of size k elements by taking only the first k elements from
each of the n! possible arrangements of the set of size n elements. In doing so each subset of size
k elements will appear k !(n − k )! times on the first k places. Thus a set of size n elements has
n!
subsets of size k elements. This term is denoted as binomial coefficient kn (read: „n

k !(n − k )!
choose k “).

7 7! 7·6·5
Referring to example b) the students have 3
= 3!·4!
= 3·2
= 35 possibilities to form a group of
three players.

122 - 7
From its definition (slide 1) we recognize the relationship
n   n 
=
k n−k

as well as some particular cases:

n  n n  n  n   n  n(n − 1)


= =1, = = n and = =
0 n 1 n−1 2 n−2 2

Furthermore, we may validate the recursive formula

n + 1  n  n
= + (7)
k k −1 k

by calculation. Using this formula binomial coefficients are derived by simple addition within the so-
called PASCAL’s triangle.

122 - 8
Fundamental principle of counting

Theorem:
The number of possibilities to fulfill k issues simultaneously, each of which can be ful-
filled independently in ni ways (i = 1, . . . , k ), is just equal to the product of the individual
numbers of possibilities and amounts to

T = n1 · n2 · · · · · nk .

In many applications, each of the k issues can be satisfied in the same amount of ways,
that is, all ni = n. Then the number of possibilities is simply

n Tk = nk
Examples:

Number of possible outcomes when tossing a coin and then rolling a die:

T = n1 · n2 = 2 · 6 = 12 .

Number of possible outcomes when rolling a red and a blue die (i.e. two
distinguishable dice).
2
6 T2 = 6 = 36 .

5. Combinatorics and counting principles 123


Permutations

Definition:
Consider a set of n elements. Each arrangement of all these elements in any order is
called a permutation of these n elements.

Example:

From the set {a, b, c }, 6 permutations can be formed, namely



abc bac cab
⇒ 6 = 3!
acb bca cba

The set {a, b, c , d } has already 24 permuations, that is



abcd bacd cabd dabc 

abdc badc cadb dacb




acbd bcad cbad dbac

⇒ 24 = 4!
acdb bcda cbda dbca 

adbc bdac cdab dcab




adcb bdca cdba dcba

5. Combinatorics and counting principles 124


Permutations

When calculating the number of possible permutations of n elements, it is important to


consider whether or not these elements are all distinct or distinguishable.

Proposition :
If all n elements are distinguishable, the number of possible permutations is

nP = n!
If not all elements of the set to be permuted are are different, the number of
distinguishable permutations will be smaller, of course.

Proposition :
If not all elements of the set to be permuted are different, we form m groups (classes)
of equal elements from them; then let the group i contain ni ≥ 1 elements, so that
n = n1 + n2 + · · · + nm . Then the number of distinguishable permutations of these
elements is
n!
n Pn1 ,n2 ,...,nm =
n1 ! · n2 ! · · · · · nm !

Question: How many distinguishable permutations exist for the set {a, b, a, b}?

5. Combinatorics and counting principles 125


Variations and Combinations

Definition:
Consider a set with n different elements. We are interested in the number of ways to
choose a k -element subset from these elements.
We distinguish two cases:
1. A Variation of k th order, also called arrangement or k −permutation of n is an
ordered arrangements of a k -element subset of an n-set. The number of variations
is
n!
n Vk =
(n − k )!
2. A Combination of k th order or k -combination is a selection of items from a set,
such that the order of selection does not matter. The number of possible
combinations is !
n n!
n Ck = =
k k ! · (n − k )!

5. Combinatorics and counting principles 126


Variations and Combinations
Example

From the elements of the set {a, b, c }, taking into account the order, the following six
variations of order 2 can be formed:

ab ba 
3!
ac ca ⇒6=3·2= = 3 V2
1!
bc cb

If the order does not matter, you get only the three combinations

ab =
b ba  !
3! 3
ac =
b ca ⇒3= = = 3 C2
2! · 1! 2
bc =
b cb

5. Combinatorics and counting principles 127


Variations and Combinations

Task:
At the Summer Olympics, n = 8 sprinters start the 100-meter race. How many variations
(i.e. taking the order into account) are there for the gold, silver and bronze ranks?

5. Combinatorics and counting principles 128


Variations and Combinations

Task:
At the Summer Olympics, n = 8 sprinters start the 100-meter race. How many variations
(i.e. taking the order into account) are there for the gold, silver and bronze ranks?

Solution:
There are
8!
8 V3 = = 8 · 7 · 6 = 336
(8 − 3)!

different variations for gold, silver and bronze.

5. Combinatorics and counting principles 128


Example

If you want to know how many possibilities there are to place a bet in the lottery, you calculate the
number of 6-combinations out of 49 elements (without taking the order into account), i.e. the binomial
coefficient 49 49 · 48 · 47 · 46 · 45 · 44
49 C6 = = = 13 983 816
6 6·5·4·3·2·1

But only one of these many combinations is the winning six.

For a five in the lottery, you need five out of the six right and one out of the 43 wrong picks. There are

6 43
· = 6 · 43 = 258
5 1

different combinations of five. For a four, you need four out of the six correct numbers and at the
same time two out of the 43 wrong numbers. So there are

6 43
· = 15 · 903 = 13 545
4 2

different combinations of fours possible.

128 - 1
Summary

Situation Number of possibilities

k independent issues T = n1 · n2 · · · · · nk

Permutation (all elements distinguishable) nP = n!

n!
Permutation (not all elements distinguishable) n Pn1 ,n2 ,...,nm =
n1 ! · n2 ! · · · · · nm !

n!
k -variation (order matters) n Vk =
(n − k )!
n  n!
k -combination (order does not matter) n Ck = =
k k ! · (n − k )!

5. Combinatorics and counting principles 129


Control questions

1. What is a factorial? Give an example.

2. Why are there no negative numbers in a binomial coefficient?

3. What is the purpose of combinatorics?

4. What is the difference between permutation and variation/combination?

5. With the variations/combinations one distinguishes between the cases with


consideration of the order and without consideration of the order. Why is this not
done for permutations?

6. For combinatorial models, we sometimes distinguish between models with


replacement and those without replacement. What does this mean?

5. Combinatorics and counting principles 130


6 Fundamentals of probability theory

6.1 Events, sample space, set of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132


6.2 Calculating with events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.3 Classical probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4 Statistical probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5 Subjective probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .145
6.6 Axioms of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.7 Important rules for the calculation of probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.8 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.9 Stochastic independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.10 Total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159
6.11 The B AYES theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

according to [Schira, 2016], chapter 8


see also: [Anderson et al., 2012], chapter 4; and [Newbold et al., 2013], chapter 3
6. Fundamentals of probability theory 131
Events, sample space, set of events

Definition:
An experiment is called a random experiment if it is
1. run according to a certain rule,
2. can be repeated under the same conditions any number of times,
3. and the outcome is uncertain and cannot be predicted.

Examples: Throwing a die or a coin, drawing a card, spinning a roulette or wheel of


fortune.

Definition:
The individual, mutually exclusive and indivisible outcomes or results of a random ex-
periment are called elementary events or basic outcomes.

Example: When rolling a die, there are the basic outcomes: „1“, „2“, „3“, „4“, „5“ and „6“.

6. Fundamentals of probability theory 132


Events, sample space, set of events

Definition:
The set Ω of all elementary events of a random experiment is called sample space or
random sample space of this random experiment.

Examples:
In the random experiment „throwing a die“ the sample space is Ω = {1, 2, 3, 4, 5, 6}
and has finitely many elements.
The random experiment „flip a coin until heads appears“ has the sample space
Ω = {H, TH, TTH, TTTH, . . . }, consisting of infinitely many elements.

6. Fundamentals of probability theory 133


Events, sample space, set of events

Definition:
A random event A is a subset of the sample space Ω. The event A has occurred if the
result of the random experiment is an element of this subset A.

Example:
A : „even number of pips“ when rolling the dice

⇒ A = {2, 4, 6} ⊂ Ω

6. Fundamentals of probability theory 134


further Examples:

When rolling two dice, the random event A: „Sum of pips is higher than 10“ consists of the three
elementary events
A = {(5, 6), (6, 5), (6, 6)}

In the random experiment „flip a coin until heads appears“ the event B: „Heads does not appear
before the 5th flip“
B = {TTTTH, TTTTTH, TTTTTTH, . . . }

consists of infinitely many basic outcomes.


In the same random experiment, the event C: „heads appears in the 3rd or 4th flip“ consists of
the two elementary events
C = {TTH, TTTH}

134 - 1
Events, sample space, set of events

Taking all events of a random experiment together, we obtain a set whose elements are
subsets of Ω.

Definition:
All events of a random experiment with sample space Ω form the associated set of
events E (Ω).

Example:
The random experiment „flipping a coin“ has the sample space Ω = {H, T}. The
corresponding set of events is
n o
E (Ω) = {}, {H}, {T}, {H, T}

Two particular events need to be highlighted:


the certain event Ω: The event that always occurs: the sample space itself.
the impossible event ∅ = { }: the event that can never occur: the empty set.

6. Fundamentals of probability theory 135


Calculating with events

Since events are defined as sets, we can reuse the notations and operations from set
theory here. Thus we can calculate with events as with sets. These calculations lead
again to events in E (Ω).

Negation Ā: The event not A occurs if and only if A does not
occur.

Union A ∪ B: Event A or B occurs whenever event A or event B


or both occur at the same time.

Intersection A ∩ B: Event A and B occurs if and only if both events A


and B occur at the same time.

Difference A \ B: Event A without B occurs exactly if A but not B


occurs.

6. Fundamentals of probability theory 136


Calculating with events

Definition: Two events A and D are called disjoint or mutually exclusive, if

A ∩ D = ∅.

Definition: The event


Ā = Ω \ A
is called the complementary event of A.

Definition:
If an event A always occurs when an event C occurs, we say,

Event C implies event A.

This is the case if and only if


C ⊆ A.

6. Fundamentals of probability theory 137


Calculating with events

V ENN diagrams to illustrate events and event operations :

Ω Ω Ω

A A B A B

complement of A union A ∪ B intersection A ∩ B

Ω Ω Ω

A B A A C

difference A \ B disjoint events C implies A

6. Fundamentals of probability theory 138


Classical probability

L APLACE:
„If an experiment can produce a (finite) number of different
and equally possible outcomes, and some of them are to be
considered favorable, then the probability of a favorable out-
come (event A) is equal to the ratio of the number of favorable
to the number of possible outcomes“:
P IERRE -S IMON
number of favorable outcomes g
P(A) := = M ARQUIS DE L APLACE
number of possible outcomes m 1749 - 1827

Thus, if we consider an sample space

Ω = {e1 , e2 , . . . , em }

with m elementary events that are all equally probable, then


1
P(e1 ) = P(e2 ) = · · · = P(em ) = ,
m
holds where the sum of all probabilities equals one.

6. Fundamentals of probability theory 139


Classical probability

Definition:
A random experiment with finitely many equally probable elementary events is called
Laplace experiment.
In a Laplace experiment, the probability P(A) of event A with sample space Ω is given
by
number of elements of A |A |
P(A) = = .
number of elements of Ω |Ω|

Task: A coin and a die are thrown together. What is the (Laplace) probability that heads
and a number greater than 4 will appear?

6. Fundamentals of probability theory 140


Classical probability

Definition:
A random experiment with finitely many equally probable elementary events is called
Laplace experiment.
In a Laplace experiment, the probability P(A) of event A with sample space Ω is given
by
number of elements of A |A |
P(A) = = .
number of elements of Ω |Ω|

Task: A coin and a die are thrown together. What is the (Laplace) probability that heads
and a number greater than 4 will appear?
Solution: The sample space is

Ω = {(H, 1), (T, 1), (H, 2), (T, 2), (H, 3), (T, 3), (H, 4), (T, 4), (H, 5), (T, 5), (H, 6), (T, 6)}

and the event A = {(H, 5), (H, 6)}. Thus the Laplace probability is given by

2 1
P(A) = = .
12 6

6. Fundamentals of probability theory 140


Statistical probability
Example

Experiment: Throw a die. Write down how often the number 6 occurs.

number of absolute frequency relative frequency


throws n of the number 6 of the number 6
1 1 1.000 00
2 1 0.500 00
3 1 0.333 33
4 1 0.250 00
5 2 0.400 00
10 2 0.200 00
20 5 0.250 00
100 12 0.120 00
200 39 0.195 00
500 76 0.152 00
700 120 0.171 43
1000 170 0.170 00
2000 343 0.171 50
3000 506 0.168 67

It seems as if the relative frequencies converge for large n.


6. Fundamentals of probability theory 141
Statistical probability

Question: But what to do if the classical concept of probability is not applicable?


Frequentist probability or frequentism according to J OHN V ENN (1834-1923) and
R ICHARD VON M ISES (1883-1953)
Repeat an experiment very often (n times)
Calculate the relative frequencies hn (A)
If the frequencies stabilize, it can be assumed that a limit exists against which the
relative frequencies converge: hn (A) −→ P(A)
n→∞

Definition:
The limit
P(A) = lim hn (A)
n→∞

is called the statistical probability of the event A.

The convergence of the relative frequency is also called the law of large numbers.

6. Fundamentals of probability theory 142


Statistical probability
Example

Experiment: Throw a die 300 times.

elementary observed frequency L APLACE’s


event absolute relative probability

„1“ 51 0.170 00 0.166 666. . .


„2“ 53 0.176 67 0.166 666. . .
„3“ 48 0.160 00 0.166 666. . .
„4“ 52 0.173 33 0.166 666. . .
„5“ 49 0.163 33 0.166 666. . .
„6“ 47 0.156 67 0.166 666. . .

sum 300 1.000 00 1.000 000

In general it holds that the larger the number of observations, the better the estimate.

6. Fundamentals of probability theory 143


Statistical probability

Note on statistical probability:


(also called empirical probability or objective probability)
n → ∞ is of course empirically irrelevant: No one has ever rolled the dice so often!
Nevertheless, the concept of statistical probability is of the utmost practical utility: The
observed relative frequency is then used as an approximation

P(A) ≈ hn (A)

or estimate P̂

P̂(A) = hn (A)

of the probability P(A) we are looking for.

6. Fundamentals of probability theory 144


Subjective probability

Risk situation A:
You get 1000 C with probability p. You obtain 0 C with probability 1 − p.

Risk situation B:
You get 1000 C if the DAX rises by at least 200 points within the next 3 months. If not, you get
nothing.

Now p is varied until the individual is indifferent to these two risk situations (e.g. p = 40 %). Then
the number p indicates the subjective probability that the DAX will rise by at least 200 points in the
next three months.

144 - 1
Axioms of probability theory

Any function P : E → R that assigns a real number to each event A from the set of
events E may be called a probability function if the following axioms are satisfied:

Axioms of KOLMOGOROV:

K1: P(A) ≥ 0, for all A ∈ E

The probability P(A) of every event A is a non-negative number.

K2: P(Ω) = 1

The certain event has probability one.

K3: P(A ∪ B ) = P(A) + P(B ), if A ∩ B = ∅

Addition rule for disjoint events.

Extension of K3 for pairwise disjoint events Ai , i = 1, 2, . . . :


K3*: P(A1 ∪ A2 ∪ A3 ∪ . . . ) = P(A1 ) + P(A2 ) + P(A3 ) + . . .

6. Fundamentals of probability theory 145


Probability space

Definition:
The sample space Ω together with the set of events E and probability function P:

(Ω, E , P)

is called probability space.

It contains all the necessary information to determine and calculate with probabilities for
all events from a sample space.

6. Fundamentals of probability theory 146


Important rules for the calculation of probabilities

Theorem 1:

The probability of the event complementary to A is always

P(Ā) = 1 − P(A)

for all A ∈ E.

Proof: A and Ā are disjoint events. According to Ω


axiom K3 and K2 it holds that

P(A) + P(Ā) = P(Ω) = 1 A


⇒ P(Ā) = 1 − P(A)

6. Fundamentals of probability theory 147


Important rules for the calculation of probabilities

Theorem 2:
The impossible event has probability zero:

P(∅) = 0 .

Proof: ∅ and Ω are complementary events. According to axiom K2 it is P(Ω) = 1, and


with Theorem 1 we obtain
P (∅) = 1 − 1 = 0 .

6. Fundamentals of probability theory 148


Important rules for the calculation of probabilities

Theorem 3:

If the events A1 , A2 ,. . . , An are pairwise disjoint, then the probability for the event result-
ing from the union of all these events is equal to the sum of the individual probabilities:
n
X
P(A1 ∪ A2 ∪ · · · ∪ An ) = P(Aj ) .
j =1

Proof: by mathematical induction starting from Ω


axiom K3. A1
A3
An
A2
...

6. Fundamentals of probability theory 149


Important rules for the calculation of probabilities

Theorem 4:

For a difference set A \ B the following always holds

P(A \ B ) = P(A) − P(A ∩ B ) .

Proof: The event A is composed of the two disjoint Ω


events A \ B and A ∩ B . Thus from axiom K3 it
follows that
A\B A∩B
P(A) = P(A \ B ) + P(A ∩ B )

and hence the desired result.

6. Fundamentals of probability theory 150


Important rules for the calculation of probabilities

Theorem 5 (Addition theorem for arbitrary events):


For two arbitrary events A and B from E it always holds that

P(A ∪ B ) = P(A) + P(B ) − P(A ∩ B ) .

Proof: The event A ∪ B is composed of the three disjoint Ω


events
A \ B , A ∩ B and B \ A.
A\B A∩B B\A
According to Theorem 4 it holds that

P(A \ B ) = P(A) − P(A ∩ B )


P(B \ A) = P(B ) − P(A ∩ B )

Using Theorem 3 we get

P(A ∪ B ) = P(A \ B ) + P(A ∩ B ) + P(B \ A) .

Inserting the above provides the result.

6. Fundamentals of probability theory 151


Important rules for the calculation of probabilities

Theorem 6 (Monotonicity of the probability measure):


If the event A implies the event B, then the probability of B is never smaller than that of
A, i.e.
A ⊆ B ⇒ P(A) ≤ P(B ) .

Proof: Event B is composed of the two disjoint events Ω


A and B \ A.
A
B
According to axiom K3, it holds that

P(A) + P(B \ A) = P(B ) .

Since P(B \ A) cannot be negative according to


axiom K1, it follows that

P(A) ≤ P(B ) .

6. Fundamentals of probability theory 152


Conditional probability

Example:
A die has been thrown. The probability that a „6“ Ω
occurred is
P({6}) = .
1 1 3 5
6
With the additional information that an even A
number of pips was thrown, the probability that a
„6“ occurred is higher, namely 2 B
4 6
1
P({6}|even number of pips) = .
3

The probability of occurrence of an event A under the condition that event B has
occurred (or occurs simultaneously with A) is called conditional probability of A under
the condition B.

6. Fundamentals of probability theory 153


Conditional probability

Definition:
Let A and B be two events of a given probability space (Ω, E , P). The conditional
probability of A under the condition B is defined as

P (A ∩ B )
P(A|B ) := .
P(B )

for P(B ) > 0 and remains undefined for P(B ) = 0.

A
A∩B

6. Fundamentals of probability theory 154


Conditional probability

Theorem 7 (Multiplication theorem of probability theory):

The probability that two events A and B occur simultaneously is

P(A ∩ B ) = P(A) · P(B |A) .

In the same way

P(B ∩ A) = P(B ) · P(A|B ) .

6. Fundamentals of probability theory 155


Stochastic independence

Definition:
Two events A and B are called stochastically independent or briefly independent, if

P(A|B ) = P(A) .

It is then as well

P(B |A) = P(B ) .

6. Fundamentals of probability theory 156


Stochastic independence

Theorem 8 (Multiplication theorem for independent events):


If two events A and B are independent, then the probability that A and B occur simulta-
neously is just equal to the product of the individual probabilities:

P(A ∩ B ) = P(A) · P(B ) .

Proof: Follows directly from the definition of independence and Theorem 7.

6. Fundamentals of probability theory 157


Stochastic independence
Example

On New Year’s Eve 1988, a remarkable


event happened in the casino in Constance:
The event
A = {0, 3},
on which you can bet and then obtain 18
times the stake, repeated nine times in a
row.

According to L APLACE, the probability for this would have to be P(A) = 2/37. According
to the multiplication theorem for independent events we get for the total probability

P(A1 ∩ A2 ∩ · · · ∩ A9 ) = P(A1 ) · P(A2 ) · · · · · P(A9 )


 9
2
= = 0.000 000 000 004
37

6. Fundamentals of probability theory 158


Annotation:

Independent events must not be confused with disjoint events! For disjoint events P(A ∩ B ) = 0
holds. One could even say that disjoint events are highly dependent events, because if one of them
occurs, the other cannot occur at all.

158 - 1
Total probability

Example:
A bulk article is produced by two machines. The faster Ω
machine has slightly more rejects than the other, but
produces twice as much. It applies
A
A ∩ H2
A ∩ H1

P(article produced on M1) = 2/3


H1 H2
P(article produced on M2) = 1/3
P(article broken|article produced on M1) = 0.1
P(article broken|article produced on M2) = 0.07

Question: What is the total probability P(article broken)?

6. Fundamentals of probability theory 159


Total probability

Solution: The sample space Ω is partitioned into the two Ω


disjoint events H1 (article produced on M1) and H2 = H̄1
(article produced on M2). Using this, the event A can be A
divided into the two disjoint events A ∩ H1 and A ∩ H2 . A ∩ H2
A ∩ H1
Using the multiplication theorem we get

P(A ∩ H1 ) = P(A|H1 ) · P(H1 ) H1 H2


P(A ∩ H2 ) = P(A|H2 ) · P(H2 )

and with axiom K3

P(A) = P(A ∩ H1 ) + P(A ∩ H2 )

= P(A|H1 ) · P(H1 ) + P(A|H2 ) · P(H2 )

For the example above, the total probability is thus given by


2 1
P(article broken) = 0.1 · + 0.07 · = 0.09 .
3 3

6. Fundamentals of probability theory 160


Total probability

Definition:
Any n events H1 , H2 , . . . , Hn , that are mutually exclusive but together fill the sample
space entirely, i.e.

Hi ∩ Hj = ∅ for i ̸= j and
H1 ∪ H2 ∪ · · · ∪ Hn = Ω

are called a division or partition of Ω.

H1 H3 ...

Hn
A

H2 H4 ...

6. Fundamentals of probability theory 161


Total probability

Theorem 9 (Theorem of total probability):


Let H1 , H2 , . . . , Hn a partition of Ω. Then for each event A ∈ E it holds that
n
X
P(A) = P(A|Hj ) · P(Hj ) .
j =1

H1 H3 ...

Hn
A

H2 H4 ...

6. Fundamentals of probability theory 162


Proof: The events A ∩ H1 , A ∩ H2 , . . . , A ∩ Hn taken together result in exactly the event

A = (A ∩ H1 ) ∪ (A ∩ H2 ) ∪ · · · ∪ (A ∩ Hn ) .

Since they are pairwise disjoint, axiom K3 states that

P(A) = P(A ∩ H1 ) + P(A ∩ H2 ) + · · · + P(A ∩ Hn ) .

The following applies to the individual summands according to the multiplication theorem

P(A ∩ Hi ) = P(A|Hi ) · P(Hi )

and thus in total

P(A) = P(A|H1 ) · P(H1 ) + P(A|Hi ) · P(Hi ) + · · · + P(A ∩ Hn )


n
X
= P(A|Hj ) · P(Hj ) .
j =1

162 - 1
Total probability
Example

Experimental setup of a two-stage experiment:


A die and a cupboard with three drawers.

drawer D1 contains 14 white and 6 black balls,


drawer D2 however contains 2 white and 8 black balls,
drawer D3 finally contains 3 white and 7 black balls.

1st stage: First the dice is rolled.


If a number less than 4 appears on the die, drawer D1 is selected, if a 4 or 5 appears,
drawer D2 is selected, otherwise drawer D3.
2nd stage: After that, a ball is drawn at random from the drawer selected in this way.

Question: What is the total probability of drawing a white ball in the end?

6. Fundamentals of probability theory 163


Total probability
Example

Solution:
According to the theorem of total probability, the probability of drawing a white ball in the
end is:

P(white) = P(white|D1) · P(D1)


+ P(white|D2) · P(D2)
+ P(white|D3) · P(D3)
14 3 2 2 3 1
= · + · + ·
20 6 10 6 10 6
28
=
60
= 0.4667 .

6. Fundamentals of probability theory 164


The B AYES theorem

The B AYES theorem establishes a connection between two conditional probabilities

P(A|B ) and P(B |A).

From the multiplication theorem it follows that

P(A) · P(B |A) = P(A ∩ B ) = P(B ) · P(A|B )

and hence
P(A) · P(B |A)
P(A|B ) = .
P(B )

This also applies to any partition Hi , i = 1, . . . , n

P(Hi ) · P(B |Hi )


P ( Hi | B ) = .
P(B )

If we replace P(B ) by the total probability, we get

6. Fundamentals of probability theory 165


The B AYES theorem

Theorem 10 (BAYES theorem):


If the events H1 , H2 , . . . , Hn form any partition of the sample space Ω and B is an event
with P(B ) > 0. Then for each Hi it holds that

P(B |Hi ) · P(Hi )


P(Hi |B ) = Pn .
j =1 P(B |Hj ) · P(Hj )

Example from slide 159: The probability that a piece picked at random from the day’s
production was produced by machine 1 is a priori

P(article produced on M1) = 2/3 = 0.6667

However, if the piece is faulty, one would guess (since machine 1 has a larger scrap) that
the probability of the piece being produced by M1 is higher. In fact, according to Bayes’
theorem, the probability is
2
3
· 0.1 20
P(article produced on M1|article broken) = 2 1
= = 0.7407
3
· 0.1 + 3
· 0.07 27

6. Fundamentals of probability theory 166


The B AYES theorem

Definition:
In BAYES statistics, H1 , . . . , Hn denote alternative hypotheses.

P(Hi ) is called the a priori probability of the i-th hypothesis.

P(Hi |B ) is called the a posteriori probability of the i-th hypothesis after knowl-
edge of observation B.

6. Fundamentals of probability theory 167


Example from practice
A man goes to the doctor for a routine cancer diagnostic test. The diagnostic test shows a positive
result. What is the probability that he actually has the disease?

Given: (from experience about the reliability of the test and/or from disease statistics)

A = {cancer} ⇒ Ā = {no cancer}


B = {test positive} ⇒ B̄ = {test negative}

A priori probability for disease

P(cancer) = P(A) = 2 % ⇒ P(no cancer) = P(Ā) = 98 %

Sensitivity of the test (a sick person is detected as sick):

P(test positive|cancer) = P(B |A) = 95 % ⇒ P(test negative|cancer) = P(B̄ |A) = 5 %

Specificity of the test (a healthy person is detected as healthy):

P(test negative|no cancer) = P(B̄ |Ā) = 90 % ⇒ P(test positive|no cancer) = P(B |Ā) = 10 %
167 - 1
Wanted: P(cancer|test positive) = P(A|B )

The hypotheses H1 = A = {cancer} and H2 = Ā = {no cancer} form a partition of Ω. According to


the theorem of B AYES it holds that

P(B |A) · P(A)


P(A|B ) =
P(B |A) · P(A) + P(B |Ā) · P(Ā)

0.95 · 0.02
= = 16.24 %
0.95 · 0.02 + 0.1 · 0.98

The probability that the man actually has cancer has thus increased from 2 % a priori with the infor-
mation of the test result to 16.24 % a posteriori. Thus, the probability of disease is still a good 8 times
higher than assumed a priori.

If the test had been negative, the probability of having the disease anyway would have been

P(B̄ |A) · P(A)


P(A|B̄ ) =
P(B̄ |A) · P(A) + P(B̄ |Ā) · P(Ā)

0.05 · 0.02
= = 0.11 %
0.05 · 0.02 + 0.9 · 0.98

and hence much lower than assumed a priori.


167 - 2
Visualization:

Ω : 10 000
A : 200 Ā : 9800
B̄ |A : 10

B̄ |Ā : 980
B |A : 190

B |Ā : 8820

167 - 3
Control questions

1. What is the difference between an elementary event and an event?

2. Is it possible to specify different sample spaces for the same random experiment?

3. Are complementary events disjoint? Are disjoint events complementary?

4. Which three concepts of probability do you know? On which concept of probability


are Kolmogorov’s axioms based?

5. Event A implies event B. Can B then be independent of A?

6. For which question is the multiplication theorem helpful, for which the addition
theorem?

7. Can independent events be disjoint? Can disjoint events be independent?

6. Fundamentals of probability theory 168


Problem from practice: Just in Time
A manager of a food retail chain is to report to the company management about the experience of
the Just In Time pilot project. Just-In-Time delivery means daily, early morning deliveries to stores.
However, shifting inventory to the street poses risks. For example, according to records kept by the
manager, in the past month the 46 stores in his area have been approached 1200 times. 9 times
an expected truck did not arrive at the store to be approached in the morning due to an accident
or breakdown, resulting in significant lost sales, and 83 times there were delays. This pilot study is
to serve as the basis for the design of deliveries to 16 new stores in eastern Germany, which are
to operate without storage space. As far as possible, no delivery problems should occur during the
start-up phase.

Based on the above results, what value will the reporting manager calculate for the probability that
these 16 stores will receive their deliveries on time every day (6 days total) during a week, or at least
receive their goods late?

According to the observations in the past, the statistical probability that a truck on a single trip

1. will have an accident or breakdown dur- P(breakdown) = 9/1200 = 0.0075,


ing a single trip and will not arrive
2. arrives late P(too late) = 83/1200 = 0.069 12,
3. will have a breakdown or arrives late P(breakdown ∪ too late) = 0.076 62,
4. arrives in time P(in time) = 1 − 0.076 62 = 0.923 38.

168 - 1
To supply the 16 stores six days a week, 96 trips are required. The probability that all deliveries will
be made on time is
P(all in time) = (0.923 38)96 = 0.000 474 8 .

The probability that all stores will be reached and supplied on all days of a week, even if late, is

P(all in time ∪ too late) = (0.9925)96 = 0.4854 .

168 - 2
7 Random variables in one dimension

7.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170


7.2 Distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.3 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.4 Continuous random variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182
7.5 Expected values of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.6 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.7 Standardization of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.8 C HEBYSHEV’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.9 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7.10 Median and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

according to [Schira, 2016], chapter 9


see also: [Anderson et al., 2012], chapter 5 and [Newbold et al., 2013], chapter 4
7. Random variables in one dimension 169
Random variables

Definition:
Let a probability space (Ω, E , P) be given. A function

X : Ω → R,
e 7→ X (e) ∈ R ,

that assigns a real number X (e) to each elementary event e ∈ Ω is called random
variable or stochastical variable.

Random variable as a mapping of the sample space onto the real axis:

X
-1 0 1 2 3
7. Random variables in one dimension 170
Technical constraint: for any real number r , the set Ar = {e|X (e) ≤ r } has to be an event, i.e.
Ar ∈ E.

170 - 1
Random variables
Example

A coin is tossed once.


The sample space is therefore Ω = {heads, tails}. Let the random variable X denote the
number of heads that may result from this experiment. The random variable defined in
this way can take only two values, namely

X (tails) = 0 and X (heads) = 1 .

and thus has the value range (codomain) C = {0, 1}.


The set of events E consists of the four events

E = {∅, {tails}, {heads}, Ω}

7. Random variables in one dimension 171


Random variables
Example

Two regular dice are rolled.


The sample space consists of 36 elementary events

Ω = {(i , j )|i = 1, . . . , 6; j = 1, . . . , 6} .

Here several random variables can be formed, examples are:


Let X denote the sum of the pips:

X (i , j ) = i + j .

The codomain of X is CX = {2, 3, . . . , 12}, X can therefore take on eleven different


values.
Let Y denote the absolute difference of the numbers of pips:

Y (i , j ) = |i − j | .

The codomain of Y is CY = {0, 1, . . . , 5}, the random variable Y can therefore only
take on six different values.

7. Random variables in one dimension 172


Random variables
Example

A point inside a circle of radius c is chosen at random. If x


and y denote the coordinates of the point in a coordinate
system through the center of the circle, the sample space can
be written as
n o c
Ω = (x , y )|x , y ∈ R and x 2 + y 2 < c 2 .
(0, 0)
The random variable Z is now defined as the distance of the z
chosen point from the center of the circle
(x , y )
p
Z (x , y ) := x2 + y2 .

The random variable defined this way can now take all real
values between 0 and c (0 ≤ Z < c ).

7. Random variables in one dimension 173


Random variables
discrete vs. continuous

Definition:
If the codomain C ⊂ R of a random variable X consists of a finitely many or countably
infinitely many values
C = {x1 , x2 , x3 . . . } ,
the random variable is called discrete.

If the codomain C ⊂ R of a random variable X consists of the whole real axis or of


subintervals
C = {x ∈ R|a ≤ x ≤ b}, −∞ < a < b < ∞ ,
it is called continuous. Its codomain then consists of uncountably infinitely many values.

Question: Which of the random variables in the examples on the previous pages are
discrete, and which are continuous?

7. Random variables in one dimension 174


Distribution function

Definition:
The function
F (x ) := P (X ≤ x ) ,
that assigns to each real number x the probability with which the random variable X
takes a value X ≤ x is called the distribution function of the random variable X .

7. Random variables in one dimension 175


The probability P (X ≤ x ) is equal to the probability of the event

Ax = {e|X (e) ≤ x } .

That is the reason for the restrictive condition in the definition of a random variable.

175 - 1
Distribution function
Example

Number of heads in a simple coin toss:


This random variable X can only take the two values 0 or 1. If the probability of throwing
heads is 0.5, then the associated distribution function is given by


0
 for x < 0
1
F (x ) = for 0 ≤ x < 1
2
1 for 1 ≤ x

7. Random variables in one dimension 176


Distribution function
Example

A point inside a circle of radius c is chosen at random. Each


point is equally probable. The random variable Z is defined as c
the distance of the chosen point from the center of the circle,
thus the probability is
z
2 2
circle area(Az ) z π z
P (Z ≤ z ) = = 2 = 2.
circle area(Ω) c π c

Thus, the distribution function of Z is given by



0
 for z < 0
z2
F (z ) = for 0 ≤ z < c
 c2
1 for c ≤ z

7. Random variables in one dimension 177


Distribution function

Properties of the distribution function :


The distribution function F is
1. at each x at least continuous to the right

lim F (x + ∆x ) = F (x )
∆x →0+

2. monotonically increasing

F (a) ≤ F (b), if a < b

3. and has lower limit 0 and upper limit 1

lim F (x ) = 0, lim F (x ) = 1
x →−∞ x →∞

7. Random variables in one dimension 178


Distribution function

Alternative Definition:
Any function F (x ) on the domain of the real numbers and with the codomain C = [0, 1]
that has the above three properties is called a distribution function and defines a
random variable.

7. Random variables in one dimension 179


Discrete random variables

Definition:
If X is a discrete random variable then the function

f (x ) := P(X = x )

is called the probability mass function or in short the mass function of the random
variable X .

Properties :

1. f (x ) ≥ 0
X
2. f (xi ) = 1
all i

3. From 1. and 2. it directly follows


f (xi ) ≤ 1

7. Random variables in one dimension 180


Discrete random variables

Relationship between distribution function and


mass function of a discrete random variable X :
X
F (x ) = f (xi )
xi ≤x

7. Random variables in one dimension 181


Continuous random variables

Definition:
If X is a continuous random variable with distribution function F , then the first derivative
d
f (x ) :=
dx F ( x )
is called the probability density function or in short density function of the random
variable X .

Properties :

1. f (x ) ≥ 0
Z∞
2. f (x ) d x = 1
−∞

Note that f (x ) > 1 can also occur.

7. Random variables in one dimension 182


Continuous random variables

Probabilities as areas beneath the density function

P(a < X < b) = P(a ≤ X < b)


= P(a < X ≤ b) = P(a ≤ X ≤ b)
Zb
= F (b) − F (a) = f ( x ) dx
a

In particular, the probability that a


continuous random variable takes on a
particular single value a is

P (X = a) = 0.

7. Random variables in one dimension 183


Continuous random variables
Example

 
0
 for x < 0, 0
 for x < 0,
1 3 1 2
F (x ) = (x − 3) + 1 for 0 ≤ x ≤ 3, f (x ) = (x − 3) for 0 ≤ x ≤ 3,
 27 9
1 for x > 3, 0 for x > 3.
 

Z2
P(1 < X < 2) = f (x ) dx = F (2) − F (1) = 0.2593
1

7. Random variables in one dimension 184


Expected values of random variables

Definition:
Let X be a random variable and f its mass or density function. Its expected value E(X )
is defined as
X
E(X ) := xj f (xj ) , if X is a discrete and
all j

Z∞
E(X ) := x f (x ) dx , if X is a continuously distributed random variable.
−∞

If the series or the improper integral has no finite value, then the random variable X has
no expected value.
The expected value is a parameter of location. To express its numerical value (but often
also instead of the notation E(X )), usually the Greek letter „mu“ is used:

µ or µX or µ(X )

7. Random variables in one dimension 185


Expected values of random variables
Example

A discrete random variable X has the mass function



1
3
 for x = 1
2
f (x ) = for x = 2
3
0 otherwise

Its expected value is

1 2 5
E(X ) = 1 · + 2 · = = 1.666 · · · = µ
3 3 3

7. Random variables in one dimension 186


Expected values of random variables
Example

Let a continuous random variable X 0.8


µ = 2.1667
have the density function 0.6
(
1 0.4
4
x for 1 ≤ x ≤ 3
f (x ) = 0.2
0 otherwise
1 2 3 4

To calculate the expected value, the integral

Z∞ Z1 Z3 Z∞
x f ( x ) dx = 0 dx + x · x dx + 0 dx
1
E(X ) =
4
−∞ −∞ 1 3

is split into three parts, from which only the middle one has to be calculated:

Z3  3
x dx =
1 2 1 3 27 1 26
= x = − = = 2.1667
4 12 1
12 12 12
1

7. Random variables in one dimension 187


Expected values of random variables

Expected value of a function of random variables:


Let X be a random variable with mass or density function f . Instead of X we now
consider a transformation g (X ) and ask for the expected value E[g (X )]:

X
E[g (X )] := g (xj )f (xj ) , if X is a discrete and
all j

Z∞
E[g (X )] := g (x )f (x ) d x , if X is a continuously distributed random variable.
−∞

(Again, the expected value exists only if the series or improper integral has a finite value.)

7. Random variables in one dimension 188


Expected values of random variables

Calculation rules for expected values:

1. Constant E (a ) = a

2. Factor E[b · g (X )] = b · E[g (X )]

3. Sum E[g1 (X ) + g2 (X )] = E[g1 (X )] + E[g2 (X )]

4. Linear transformation E(a + b · X ) = a + b · E(X )

Proposition (central property of the expected value)


The expected value of deviations of a random variable X from its expected value µX is
equal to zero:
E(X − µX ) = 0

Proof: Follows directly from rule 4. (Linear transformation) with b = 1 and a = −µX .

7. Random variables in one dimension 189


Variances

Definition:
Let X be a random variable and µX its expected value. The expected value of the
squared deviation of the random variable from µX
2
V(X ) := E[(X − µX ) ]

is called variance of the random variable X .


The positive root of this p
σX := + V(X )
is called standard deviation.

If the series or the improper integral has no finite value, then the random variable X has
no variance.
The variance is a parameter of dispersion. To express its numerical value (but often
also instead of the notation V(X )), usually the Greek letter „sigma“ is used:

σ2 or σX2 or σ 2 (X )

7. Random variables in one dimension 190


Variances

Calculation of the variance:


Using the definition of the expected value on slide 185 we obtain for the variance
X
V(X ) := (xj − µX )2 f (xj ) , if X is a discrete and
all j

Z∞
V(X ) := (x − µX )2 f (x ) dx , if X is a continuously distributed random variable.
−∞

7. Random variables in one dimension 191


Variances
Example

Number of heads in two coin tosses:


The discrete random variable X has the distribution

x: 0 1 2
1 1 1
f (x ): 4 2 4

and, because of symmetry, the expected value E(X ) = 1. Its variance is

1 2 1 1
V(X ) = (0 − 1) ·+ (1 − 1)2 · + (2 − 1)2 ·
4 2 4
1 1 1
= + = = 0.5
4 4 2

and the standard deviation

1
σX = √ ≈ 0.7071 .
2

7. Random variables in one dimension 192


Variances
Example

Continuous random variable with density


(
3
x − 0.5x 2 ,

2
0<x <2
f (x ) =
0 otherwise

Due to symmetry, the expected value is E(X ) = 1. The


variance is the definite integral

Z2 Z2  
(x − 1)2 f (x ) dx = dx
3 1 2
V(X ) = (x − 1)2 x− x
2 2
0 0

Z2    2
=
3
2
x−
5 2
2
3 1 4
x + 2x − x
2
dx = 32 1 2
2
5 3 1 4
x − x + x −
6 2
1 5
10
x
0
0
 
3 4 40 16 32 24 1
= − + − = 3 − 10 + 12 − = = σX2 ⇒ σX = 0.4472
2 2 6 2 10 5 5

7. Random variables in one dimension 193


Variances

Calculation rules for variances :

1. Constant V(a) = 0

2. Shift V(X + a) = V(X )

2
3. Factor V(b · X ) = b V(X )
σb·X = |b| σX

2
4. Linear transformation V(a + b · X ) = b · V(X )
σa+b·X = |b| σX

7. Random variables in one dimension 194


Variances

The variance is defined as the „expected value of squared deviation from µ“. If we
compare it to the expected value of the squared deviation from some other value d then:

S TEINER’s translation theorem


For each constant d ∈ R it holds that
2 2
V(X ) = E[(X − d ) ] − (µ − d )

where (µ − d ) is the shift (=translation) from the expected value.

Calculation rules for variances (contd.)

5. In the special case of d = 0, the following formula for the simplified calculation of
the variance is obtained
2 2 2 2
V(X ) = E(X ) − µ = E(X ) − E(X )

7. Random variables in one dimension 195


Standardization of random variables

Using the calculation rules for expected values and variances, we can transform random
variables in a „clever way“:

Start: Random variable X with E(X ) = µ < ∞ and V(X ) = σ 2 < ∞.


Shift of X by µ:
Y := X − µ
1
Stretching Y by the factor σ
:
Y
Z :=
σ
The following happens:

X −µ
X −→ Y := X − µ −→ Z := σ

E(X ) = µ E(Y ) = 0 E(Z ) = 0


V(X ) = σ 2 V(Y ) = σ 2 V(Z ) = 1

7. Random variables in one dimension 196


Standardization of random variables

Definition:
If X is a random variable with expected value µ and standard deviation σ > 0, then the
transformed random variable
X −µ
Z :=
σ
is called standardized.

Each standardized random variable has mean 0 and variance 1.

7. Random variables in one dimension 197


C HEBYSHEV’s inequality

Wanted: The probability that a random variable X falls into an interval between a and b:
xj ≤b
b
f (x ) dx (continuous)
Z X
P(a < X ≤ b) = or P(a < X ≤ b) = f (xj ) (discrete)
a xj >a

For this, however, the density or mass function f must be known. The inequality provides
an estimate even for unknown f , if at least the expected value and the standard
deviation of the distribution are known.

Theorem of C HEBYSHEV:
Let X be an arbitrary continuous or discrete random variable with expected value µ and
standard deviation σ , then the inequality
1
P(|X − µ| ≥ k σ) ≤
k2
always holds for any k > 0 and completely independent of the distribution.

7. Random variables in one dimension 198


C HEBYSHEV’s inequality

C HEBYSHEV’s inequality provides an upper bound on the probability of realizations


occurring for a random variable X that are away from the expected value µ by at least
k times the standard deviation σ .

1
P(|X − µ| ≥ k σ) ≤
k2

7. Random variables in one dimension 199


C HEBYSHEV’s inequality
Example

1
P(|X − µ| ≥ k σ) ≤
k2
For single values of k , the following estimates are obtained outside the k σ bound:

k =1 : P(|X − µ| ≥ σ)
≤ 1 (trivial)
k = 1.5 : P(|X − µ| ≥ 1.5σ)
≤ 0.444. . .
k = 2 : P(|X − µ| ≥ 2σ) ≤ 0.25
k = 2.5 : P(|X − µ| ≥ 2.5σ) ≤ 0.16
k = 3 : P(|X − µ| ≥ 3σ) ≤ 0.111. . .

7. Random variables in one dimension 200


C HEBYSHEV’s inequality
Example

1
P(|X − µ| < k σ) = 1 − P(|X − µ| ≥ k σ) ≥ 1 −
k2
For single values of k , the following estimates are obtained inside the k σ bound:

k = 1.5 : P(µ − 1.5σ < X < µ + 1.5σ) ≥


0.555. . .
k =2 : P(µ − 2 σ < X < µ + 2 σ )
≥ 0.75
k = 2.5 : P(µ − 2.5σ < X < µ + 2.5σ) ≥ 0.84
k = 3 : P(µ − 3 σ < X < µ + 3 σ ) ≥ 0.888. . .

7. Random variables in one dimension 201


C HEBYSHEV’s inequality
Example

Let a discrete random variable X have as possible outcomes the integers


xi = 0, 1, 2, . . . , 12. Let its expected value be 4.5 and its variance 38 , otherwise the
distribution is unknown.
Task:
Estimate the probability that X < 3 or X > 6, that is, that X takes one of the values in
bold.

0 1 2 3 4 5 6 7 8 9 10 11 12

7. Random variables in one dimension 202


8
Solution: (µ = 4.5, σ 2 = 3
)

Since X is a discrete random variable and C HEBYSHEV’s inequality for ranges outside a k σ bound is
formulated as „≥“, we reformulate the desired probability as follows.

P(X < 3 ∪ X > 6) = P(X ≤ 2 ∪ X ≥ 7) .

Transformation of the variable (subtract µ) yields

= P(X − 4.5 ≤ −2.5 ∪ X − 4.5 ≥ 2.5) .

Combining the two inequalities results in

1
= P(|X − 4.5| ≥ 2.5) ≤ ,
k2

2.5
according to C HEBYSHEV, where it has to be 2.5 = k σ . This results in k = σ
= √2.5 ≈ 1.5309
8/3
and we get
1
P(X < 3 ∪ X > 6) ≤ ≈ 0.4267 .
k2

202 - 1
Moments
Moments of a distribution are the expected values of powers of random variables.

Definition:

If X is a random variable, then the expected value of the k th power

Mk := E(X k )

is called the k th moment or k th -order moment of the distribution, if it exists. The expected
value of the k th power of the deviation from the mean

MkZ := E[(X − µ)k ]

is called k th central moment

202 - 1
Characterizing sequence of moments

M1 = E(X ) = µ expected value


M2 = E(X 2 ) 2nd Moment
M3 = E(X 3 ) 3rd Moment
.
.
.
Mk = E(X k ) k th Moment
.
.
.

or central moments

M1Z = E(X − µ) = 0 central property


M2Z = E[(X − µ)2 ] = σ 2 variance
M3Z = E[(X − µ)3 ] 3rd central moment
.
.
.
Mk = E[(X − µ)k ]
Z k th central moment
.
.
.

202 - 2
Example:

Continuous random variable with density


(
3
x − 0.5x 2 ,

2
0<x <2
f (x ) =
0 otherwise

Wanted: M3 = E(X 3 ) and M3Z = E[(X − µ)3 ]

Because of the symmetry µ = 1. It follows that

Z2 Z2  
3 1
M3 = x 3 f (x ) dx = x3 x− x2 dx
2 2
0 0

Z2    2
3 1 3 1 1
= x4 − x5 dx = x5 − x6
2 2 2 5 12 0
0
 
3 1 1
= · 25 · − = 0.4472
2 5 6

202 - 3
and

Z2 Z2  
3 1
M3Z = (x − µ)3 f (x ) dx = (x − 1)3 x− x2 dx
2 2
0 0

Z2  
3 7 9 5 1
= −x + x2 − x3 + x4 − x5 dx
2 2 2 2 2
0
 2
3 1 7 9 1 1 6
= − x2 + x3 − x4 + x5 − x =0
2 2 6 8 2 12 0

Definition:

The ratio
E[(X − µ)3 ]
γ :=
σ3
is called the skewness of a distribution. For symmetric distributions γ = 0.

If γ < 0, the distribution is said to have a negative skewness or to be left-skewed, left-tailed, or


skewed to the left. Distributions with positive skewness γ > 0, on the other hand, are called right-
skewed or right-tailed, or skewed to the right.

202 - 4
A right-skewed continuous and a left-skewed discrete distribution

The 4th central moment is – if it exists – positive for every distribution. It gives information about the
kurtosis or „tailedness“ of a distribution. To obtain a measure independent of scale and scatter, here
one divides by the 4th power of the standard deviation:

Definition:

The ratio
E[(X − µ)4 ]
κ :=
σ4
is called the curtosis of a distribution.

κ = 3 is considered as normal. Distributions with larger κ values have narrower and more peaked
density functions, those with smaller κ values are more broadly curved than the normal distribution.

202 - 5
Median and Quantiles

The median or central value xMed of a random variable X is a number that lies in the
middle of the distribution in such a way that the probability for X to take a value greater or
less than xMed would be just equal:

1
P(X ≤ xMed ) = F (xMed ) = ,
2

hence

−1 1
xMed = F ( 2 ),

if the inverse function exists. This is usually the case for continuous random variables,
but not for discrete ones. More generally:

Definition: The number xMed is called median, if at the same time


1
P(X ≤ xMed ) ≥ and
2
1
P(X ≥ xMed ) ≥ holds.
2

7. Random variables in one dimension 203


Median and Quantiles

Determination of the median: 3 situations

7. Random variables in one dimension 204


Median and Quantiles
Examples

Let Y be the number of heads when flipping two coins. The median is clearly
yMed = 1, because only for this number it holds
3 1 3 1
P(Y ≤ 1) = ≥ and P(Y ≥ 1) = ≥ .
4 2 4 2

Let X be the number of pips when rolling a die. It holds


3 1 4 1
P(X ≤ 3) = ≥ and P(X ≥ 3) = ≥
6 2 6 2
4 1 3 1
P(X ≤ 4) = ≥ and P(X ≥ 4) = ≥
6 2 6 2
but also any other number in the closed interval 3 ≤ xMed ≤ 4 satisfies the definition.
We choose as median the value xMed = 3.5 (arithmetic mean of the limits).

7. Random variables in one dimension 205


Median and Quantiles

Actually, the median is only a special case of the more generally defined quantiles:

Definition:
A number x[q ] with 0 < q < 1 is called q-quantile, if at the same time

P(X ≤ x[q ] ) ≥ q and


P(X ≥ x[q ] ) ≥ 1 − q holds

Thus, the median is a 0.5-quantile or 50%-quantile. For the q-quantiles of continuous


distributions it always holds that:
−1
F (x[q ] ) = q , x[q ] = F (q ),

7. Random variables in one dimension 206


Median and Quantiles
Example

A very important distribution whose quantiles are needed by every statistician is the
so-called standard normal distribution. The sketch shows the density function of a
standard normally distributed random variable Z . Its median is

zMed = z[0.5] = 0 .

Table of some quantiles

q z[q ]

0.5 0.000
0.9 1.282
0.95 1.645 q
0.975 1.960
0.99 2.327
0.995 2.575
z[q ]

7. Random variables in one dimension 207


Why is the normal distribution important?

Example: Stock returns

empirical density
function

Many variables in economics are (approximately) normally distributed.

7. Random variables in one dimension 208


Control questions

1. Describe the difference of a random variable and a „normal“ variable in your own
words.
2. What must the codomain of a random variable be like for it to be called continuous?
3. Which properties must a mass function have, which a density function?
4. How are density function and distribution function related? How to extract the mass
function from the distribution function?
5. Is the expected value the most likely value for a random variable?
6. What is measured by the variance of a random variable?
7. What is the effect of standardizing? What characterizes a standardized random
variable?
8. Can you make probability statements about random variables whose distribution you
do not know?
9. To what extent can C HEBYSHEV’s inequality be said to provide a „rough“ estimate?

7. Random variables in one dimension 209


8 Multidimensional random variables

8.1 Joint distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211


8.2 Marginal distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217
8.3 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
8.4 Stochastic independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
8.5 Covariance and correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.6 Sum of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

according to [Schira, 2016], chapter 10


see also: [Anderson et al., 2012], chapter 6; and [Newbold et al., 2013], chapter 5

8. Multidimensional random variables 210


Joint distribution
Discrete random variables

Definition:
For a two-dimensional discrete random variable (X , Y ) the function

f (x , y ) := P(X = x ∩ Y = y )

is called is called the joint probability mass function of X and Y .

Properties:
The following always holds:
1. f (x , y ) ≥ 0,
XX
2. f (xi , yj ) = 1,
i j

3. f (xi , yj ) ≤ 1 for all i and j.

8. Multidimensional random variables 211


Joint distribution
Discrete random variables

Matrix of the probability masses: pij := f (xi , yj )

y1 y2 ... yj ... yl Σ
x1 p11 p12 ... p1j ... p1l p1•
x2 p21 p22 ... p2j ... p2l p2•
. . . . . .
. . . . . .
. . . . . .
xi pi1 pi2 ... pij ... pil pi •
. . . . . .
. . . . . .
. . . . . .
xk pk 1 pk 2 ... pkj ... pkl pk •
Σ p•1 p•2 ... p•j ... p•l 1

Margins: X X
pi • = pij , p•j = pij
j i

8. Multidimensional random variables 212


Joint distribution
Discrete random variables

Example: Draw two balls from an urn with replacement


The urn contains 6 balls (see figure). Let the random variable
(X , Y ) be defined as

X = „number on 1st ball“ ,


Y = „number on 2nd ball“ .
It is
1 1 1
P(X = 1) = P(X = 2) = P(X = 3) =
2 3 6
1 1 1
P(Y = 1) = P(Y = 2) = P(Y = 3) =
2 3 6
and because of independence
1
P(X = 1 ∩ Y = 1) = P(X = 1) · P(Y = 1) =
4
1
P(X = 1 ∩ Y = 2) = P(X = 1) · P(Y = 2) = . . . etc.
6

8. Multidimensional random variables 213


Joint distribution
Discrete random variables

Example: Draw two balls from the same urn without replacement
For the case of drawing without replacement, the calculation of the
joint distribution is a bit more difficult, because now the general
multiplication theorem has to be applied: The corresponding
conditional probabilities for the component Y , after a „1“ has
already appeared at the 1st draw and has not been replaced are:

2 2 1
P(Y = 1|X = 1) = P(Y = 2|X = 1) = P(Y = 3|X = 1) =
5 5 5
Now we have
1
P(X = 1 ∩ Y = 1) = P(X = 1) · P(Y = 1|X = 1) =
5
1
P(X = 1 ∩ Y = 2) = P(X = 1) · P(Y = 2|X = 1) = . . . etc.
5

8. Multidimensional random variables 214


Joint distribution
Discrete random variables

Example: Draw two balls from an urn

[1] with replacement [2] without replacement

Y Y
1 2 3 1 2 3
1 1 1 1 1 1 1 1
1 4 6 12 2
1 5 5 10 2
1 1 1 1 1 1 1 1
X 2 6 9 18 3
X 2 5 15 15 3
1 1 1 1 1 1 1
3 12 18 36 6
3 10 15
0 6
1 1 1 1 1 1
2 3 6
1 2 3 6
1

Joint distributions of (X , Y )

8. Multidimensional random variables 215


Joint distribution
Discrete random variables

Definition:
The joint distribution function

F (x , y ) = P(X ≤ x ∩ Y ≤ y )

indicates the probability with which the random variable X takes on values less than or
equal to x and at the same time the random variable Y takes on values less than or
equal to y . F is obtained by summing up the joint mass function:
XX
F (x , y ) = f (xi , yj )
xi ≤x yj ≤y

8. Multidimensional random variables 216


Joint distribution of continuous random variables
Also for the continuous case one can specify the joint distribution of random variables in several
dimensions. As with the one-dimensional variables, the summation is replaced by integration. The
naming is analogous to the one-dimensional case:

discrete: probability mass function


continuous: probability density function

Definition: For a two-dimensional continuous random variable (X , Y ), the function f (x , y ) with

Zb Zd
f (x , y ) dy dx = P(a < X ≤ b ∩ c < Y ≤ d )
a c

for a < b and c < d is called the joint probability density function of X and Y .

Properties : The following always holds:

1. f (x , y ) ≥ 0,
Z∞ Z∞
2. f (x , y ) dy dx = 1.
−∞ −∞

216 - 1
Example:

The density function of the two-dimensional standard


normal distribution is given by

1 1 2
+y 2 )
f (x , y ) = e − 2 (x

Again, the joint distribution function can be given analogously to the discrete case. The summation
of the mass function is simply replaced by the integration of the density function:

Definition: The joint distribution function

F (x , y ) = P(X ≤ x ∩ Y ≤ y )

indicates the probability with which the random variable X takes on values less than or equal to x
and at the same time the random variable Y takes on values less than or equal to y . F is obtained
by integrating the joint density function:
Zx Zy
F (x , y ) = f (u , v ) dv du
−∞ −∞

216 - 2
Marginal distributions
Discrete random variables

Definition:
The distribution of a single component of a multidimensional random variable without
regard to the other components is called marginal distribution.
For discrete random variables, the marginal distributions are calculated as the column
and row sums:
X
fX (xi ) := P(X = xi ) = f (xi , yj ) = pi •
j
X
fY (yj ) := P(Y = yj ) = f (xi , yj ) = p•j
i

8. Multidimensional random variables 217


Marginal distributions
Discrete random variables

Example: Toss a coin 4 times

X= number of heads

0 1 2 3 4

marginal distribution of Y
1 1 1
Y = number of changes

0 16
0 0 0 16 8
1 1 1 3
1 0 8 8 8
0 8
1 1 1 3
2 0 8 8 8
0 8
1 1
3 0 0 8
0 0 8
1 1 3 1 1
16 4 8 4 16
1

marginal distribution of X

marginal distribution of X marginal distribution of Y

xi 0 1 2 3 4 yi 0 1 2 3
1 1 3 1 1 1 3 3 1
fX (xi ) 16 4 8 4 16
fY (yi ) 8 8 8 8

8. Multidimensional random variables 218


Marginal distributions
Discrete random variables

Expected values and variances of the components:


For multidimensional discrete random variables with a joint distribution, one calculates
the expected value and variance and further moments of the individual components
using the corresponding marginal distributions:

X
µX = E(X ) = xi fX (xi )
i
X
σX2 = V(X ) = (xi − µX )2 fX (xi )
i

8. Multidimensional random variables 219


Marginal distributions of continuous random variables

Definition:

The distribution of a single component of a multidimensional random variable without regard to the
other components is called marginal distribution.

For continuous random variables, it is the integrals:

Z∞
fX (x ) := f (x , y ) dy
−∞
Z∞
fY (y ) := f (x , y ) dx
−∞

219 - 1
Expected values and variances of the components:

For multidimensional continuous random variables with a joint distribution, one calculates the ex-
pected value and variance and further moments of the individual components using the correspond-
ing marginal distributions via integration:

Z∞
µX = E(X ) = x fX (x ) dx
−∞
Z∞
σX2 = V(X ) = (x − µX )2 fX (x ) dx
−∞

219 - 2
Conditional distributions
Discrete random variables

The conditional distributions provide information about the distribution of one variable
under the constraint that the other takes a certain value.

Definition:
For discrete random variables (X , Y ) we define the mass function f1 of the conditional
distribution of X under the condition Y = yj as

f ( x , yj )
f1 (x |yj ) := for j = 1, . . . , m
fY (yj )

and correspondingly the conditional distribution of Y under the condition X = xi

f (xi , y )
f2 (y |xi ) := for i = 1, . . . , n
fX (xi )

(There are m conditional distributions of X and n conditional distributions of Y ).

8. Multidimensional random variables 220


Conditional distributions

Example: Why are conditional distributions important?


Consider f (x , y ) with
X : Distribution of stock returns and
Y : economic growth.
What do the conditional distributions of stock returns f (x |Y = good) and f (x |Y = bad)
look like?
From the point of view of an investor or asset manager, conditional distributions of
security returns are particularly important!
Conditions (represented by the random variable Y) are manifold, e.g.
Y1 = economic growth
Y2 = central bank policy
Y3 = interest rate development
..
.

8. Multidimensional random variables 221


Conditional distributions
Example

From the values of the joint distribution from the previous example (slide 218) one
calculates four conditional distributions of X

X 0 1 2 3 4
f1 (x |0) 0 .5 0 0 0 0 .5 1
f1 (x |1) 0 0.333 0.333 0.333 0 1
f1 (x |2) 0 0.333 0.333 0.333 0 1
f1 (x |3) 0 0 1 0 0 1

and five conditional distributions of Y

Y 0 1 2 3
f2 (y |0) 1 0 0 0 1
f2 (y |1) 0 0 .5 0.5 0 1
f2 (y |2) 0 0.333 0.333 0.333 1
f2 (y |3) 0 0 .5 0.5 0 1
f2 (y |4) 1 0 0 0 1

8. Multidimensional random variables 222


Conditional distributions of continuous random variables
Analogously we obtain the

Definition:

For continuous random variables, the conditional distributions are defined by the density function

f (x , y ) f (x , y )
f1 (x |y ) := and f2 (y |x ) := .
fY (y ) fX ( x )

222 - 1
Stochastic independence

Basic idea:
If the conditional distributions of X for different conditions y1 and y2 are different

f1 (x |y1 ) ̸= f1 (x |y2 ) ,

t means that the distribution of X depends on what value Y takes. In this case, X and Y
are said to be stochastically dependent.
In order to assess a joint distribution, it is particularly important to know whether X and Y
are dependent or independent.
Example:
Are DAX 30 returns distributed differently in January than in the rest of the months
(x = DAX return, y1 = January, y2 = February to y12 = December)?

8. Multidimensional random variables 223


Stochastic independence

Definition:
The random variables X and Y are called stochastically independent, or independent
for short, if the joint mass or density function

f (x , y ) = fX (x ) · fY (y )

is just equal to the product of the two marginal distributions.

In the case of independence, all conditional distributions are equal – and equal to the
corresponding marginal distribution:

fX (x ) · fY (y )
f1 (x |y ) = = fX (x )
fY (y )
fY (y ) · fX (x )
f2 (y |x ) = = fY (y )
fX (x )

8. Multidimensional random variables 224


Stochastic independence
Example

In the joint distributions from the previous example (slides 213-215) the random variables
are stochastically independent [1] resp. dependent [2]. The conditional distributions are
equal [1] or unequal [2] to the marginal distribution:

Y 1 2 3 Y 1 2 3
1 1 1 2 2 1
f2 (y |1) 2 3 6
1 f2 (y |1) 5 5 5
1
1 1 1 3 1 1
f2 (y |2) 2 3 6
1 f2 (y |2) 5 5 5
1
1 1 1 3 2
f2 (y |3) 2 3 6
1 f2 (y |3) 5 5
0 1
1 1 1 1 1 1
fY (y ) 2 3 6
1 fY (y ) 2 3 6
1

[1] independent [2] dependent

8. Multidimensional random variables 225


Covariance and correlation coefficient

Definition:
Let X and Y be components of a two-dimensional random variable with expected values
µX and µY . The quantity

Cov(X , Y ) := E[(X − µX )(Y − µY )]

is called covariance of X and Y .

8. Multidimensional random variables 226


For practical calculation of the covariance one needs the joint distribution f (x , y ):

XX
Cov(X , Y ) = (xi − µX )(yj − µY )f (xi , yj )
i j

for discrete or

Z∞ Z∞
Cov(X , Y ) = (x − µX )(y − µY )f (x , y ) dy dx
−∞ −∞

for continuously distributed random variables.

Using the calculation rules for expected values, the definition of covariance can be reformulated:

Cov(X , Y ) : = E[(X − µX )(Y − µY )] = E(XY − X µY − µX Y + µX µY )


= E(XY ) − E(X )µY − µX E(Y ) + µX µY

Applying E(X ) = µX and E(Y ) = µY we obtain an

226 - 1
Covariance and correlation coefficient

Alternative way of calculating the covariance :

Cov(X , Y ) = E(XY ) − E(X )E(Y )

This results in the

Multiplication theorem for expected values:


If X and Y are arbitrary random variables for which E(X ) and E(Y ) exist, then always

E(XY ) := E(X )E(Y ) + Cov(X , Y ) .

If X and Y are independent, then

E(XY ) := E(X )E(Y ) .

8. Multidimensional random variables 227


Attention: again, the theorem cannot be reversed. The following is the case

correct: X andY are independent ⇒ Cov(X , Y ) = 0

correct: Cov(X , Y ) ̸= 0 ⇒ X andY are dependent

incorrect: Cov(X , Y ) = 0 ⇒ X and Y are independent

227 - 1
Covariance and correlation coefficient

Example: Draw two balls from an urn without replacement (see slides 213-215)

Expected values:
1 1 1 5 Y
E(X ) = 1 · + 2 · + 3 · = = E(Y )
2 3 6 3 1 2 3
1 1 1 1
Variances: 1 5 5 10 2
1 1 1 1
X 2 5 15 15 3
2 1 1 1 10
E(X ) = 1 · +4· +9· = = E(Y 2 ) 3 1 1
0 1
2 3 6 3 10 15 6
1 1 1
2 2 10 25 5 2 3 6
1
V(X ) = E(X ) − E(X ) = − = = V (Y )
3 9 9

Covariance:
1·1 1·2 1·3 2·1 2·2 2·3 3·1 3·2 8
E (XY ) = + + + + + + + +3·3·0=
5 5 10 5 15 15 10 15 3
8 5 5 1
Cov(X , Y ) = E(XY ) − E(X )E(Y ) = − · = −
3 3 3 9

8. Multidimensional random variables 228


Covariance and correlation coefficient

Further Example: Toss a coin 4 times (see slide 218)


Expected values (due to symmetry):
E(X ) = 2 , E(Y ) = 1.5 X= number of heads

Variances: 0 1 2 3 4
1 1 3 1 1 1 1 1

Y = number of changes
2
E(X ) = 0 · +1· +4· +9· + 16 · =5 0 16
0 0 0 16 8
16 4 8 4 16 1 1 1 3
1 3 3 1
1 0 8 8 8
0 8
E(Y 2 ) = 0 · +1· +4· +9· =3 1 1 1 3
8 8 8 8 2 0 8 8 8
0 8
V(X ) = E(X 2 ) − E(X )2 = 5 − 4 = 1 3 0 0 1
0 0 1
8 8
V(Y ) = E(Y 2 ) − E(Y )2 = 3 − 2.25 = 0.75 1 1 3 1 1
1
16 4 8 4 16

Covariance:
1·1 2·1 3·1 1·2 2·2 3·2 2·3
E (XY ) = + + + + + +
88 8 8 8 8 8
1+2+3+2+4+6+6 24
= = =3
8 8
Cov(X , Y ) = E(XY ) − E(X )E(Y ) = 3 − 2 · 1.5 = 0

although X and Y are dependent!


8. Multidimensional random variables 229
Covariance and correlation coefficient

Properties/calculation rules of the covariance:

1. Relation to variance V(X ) = Cov(X , X )

2. Sign Cov(X , −Y ) = −Cov(X , Y )

3. Symmetry Cov(X , Y ) = Cov(Y , X )

4. Bilinearity Cov(aX + b, cY + d ) = a c Cov(X , Y )


Cov(X , (eY + f ) + (gZ + h)) = e Cov(X , Y ) + g Cov(X , Z )

8. Multidimensional random variables 230


Covariance and correlation coefficient

Definition:
The ratio of the covariance and the standard deviations of X and Y
Cov(X , Y )
ρXY :=
σX · σY
is called correlation coefficient of X and Y .

Properties :
The correlation coefficient
1. has the same sign as the covariance,
2. is normalized: −1 ≤ ρXY ≤ 1,
3. and indicates the strength of the linear stochastic relationship, independent of the
magnitudes and variances of the two variables.

8. Multidimensional random variables 231


Sum of random variables

Let X and Y be random variables and their joint distribution f (x , y ) be known. We define
a new random variable X + Y .
Question: What does the distribution fX +Y look like?
First: expected value and variance

Addition theorem for expected values:


Let X and Y be any random variables for which E(X ) and E(Y ) exist. Then the expected
value of the sum is always equal to the sum of the expected values and the expected
value of the difference is equal to the difference of the expected values:

E(X + Y ) = E(X ) + E(Y )


E(X − Y ) = E(X ) − E(Y )

8. Multidimensional random variables 232


Sum of random variables

Addition theorem for variances:


If X and Y are any random variables for which V(X ) and V(Y ) exist, then:

V(X + Y ) = V(X ) + V(Y ) + 2Cov(X , Y )


V(X − Y ) = V(X ) + V(Y ) − 2Cov(X , Y )

However, if X and Y are uncorrelated, then

V(X + Y ) = V(X ) + V(Y )


V(X − Y ) = V(X ) + V(Y )

8. Multidimensional random variables 233


Verification by calculation:

h 2 i
V(X + Y ) = E (X + Y ) − (µX + µY )
h 2 i
=E (X − µX ) + (Y − µY )
h i
= E (X − µX )2 + (Y − µY )2 + 2(X − µX )(Y − µY )
h i h i
= E (X − µX )2 + E (Y − µY )2 + 2E [(X − µX )(Y − µY )]
= V(X ) + V(Y ) + 2Cov(X , Y )

233 - 1
Sum of random variables

Properties :

Notation using the correlation coefficient

σX2 +Y = σX2 + σY2 + 2σX σY ρXY

Estimate for the variance if one does not know the covariance:

(σX − σY )2 ≤ σX2 +Y ≤ (σX + σY )2

Triangle inequality for standard deviations

|σX − σY | ≤ σX +Y ≤ σX + σY

8. Multidimensional random variables 234


Arithmetic mean of random variables
If, for example, one wants to achieve a high accuracy in a measurement, one will take the arithmetic
mean of many individual measurements. Each single measurement is a random variable Xi and has
the actual measured value µ as the expected value if the measurement is arranged appropriately. If
each measurement is made under the same conditions, the variables are stochastically independent
and have the same variance.

We denote with
1
X̄ := (X1 + X2 + · · · + Xn )
n
the random variable „arithmetic mean“ and calculate
 
1
E(X̄ ) = E (X1 + X2 + · · · + Xn )
n
1 
= E(X1 ) + E(X2 ) + · · · + E(Xn )
n
1 1
= (µ + µ + · · · + µ) = (nµ) = µ
n n

234 - 1
and

 
1
V(X̄ ) = V (X1 + X2 + · · · + Xn )
n
1 
= V(X1 ) + V(X2 ) + · · · + V(Xn )
n2
1 1 σ2
= (σ 2 + σ 2 + · · · + σ 2 ) = (nσ 2 ) = .
n2 n2 n

The following two important propositions follow from the calculations.

234 - 2
Sum of random variables

Proposition:
If n random variables Xi have the expected value E(Xi ) = µ, then their arithmetic mean
has the same expected value
E(X̄ ) = µ .


The n−law:
If n independent random variables have the same standard deviation σ , then the stan-
dard deviation of their arithmetic mean
σ
σX̄ = √
n

is smaller by a factor n.

8. Multidimensional random variables 235


Control questions

1. What do you need the concept of multidimensional random variables for?


2. What does the joint mass function and what does the joint density function indicate?
3. „From the two marginal distributions of a two-dimensional random variable, the joint
distribution can be calculated.“
Under which conditions is this proposition correct?
4. What is the difference between a marginal distribution and a conditional distribution?
5. What do you use to calculate the expected value and variance of the components of a
multidimensional random variable? The marginal distribution or the joint distribution?
6. What does „stochastic independence“ mean?
7. Why does V(X ) + V(Y ) = V(X + Y ) not imply the stochastic independence of X and
Y ? Are X and Y independent if their covariance is zero?
8. What is the covariance a measure of?
9. Why do you use the correlation coefficient and not just the covariance?

8. Multidimensional random variables 236


9 Stochastic models and special distributions

9.1 Stochastic models and special distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .238


9.2 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.3 B ERNOULLI distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.4 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
9.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

according to [Schira, 2016], chapter 11


see also: [Anderson et al., 2012], chapter 5 & 6; and [Newbold et al., 2013], chapter 4 & 5

9. Stochastic models and special distributions 237


Uniform distribution
Discrete random variables

Example: Dice roll, lotto, roulette


1
P(X = x1 ) = P(X = x2 ) = · · · = P(X = xm ) =
m

Definition:
A discrete random variable X with mass function
(
1
m
for x = x1 , x2 , . . . , xm
fX (x ) = fU (x ; m) =
0 otherwise

is called discrete uniformly distributed or for short U(m)-distributed.

For the special case x1 = 1, x2 = 2, . . . , xm = m (e.g. dice, lottery numbers) it holds:

m+1 m2 − 1
E(X ) = , V(X ) = .
2 12

9. Stochastic models and special distributions 238


Uniform distribution
Example: classical L APLACE experiment dice roll

Mass function of the random variable X =„number of pips“ with m = 6:


(
1
6
, for x = 1, 2, . . . , 6
fX (x ) = fU (x ; 6) =
0 otherwise
with distribution function

0
 for x < 1

1
for 1 ≤ x < 2


6



2
for 2 ≤ x < 3


6

3
FX (x ) = FU (x ; 6) = 6
for 3 ≤ x < 4

4

for 4 ≤ x < 5
 65


for 5 ≤ x < 6


6


for 6 ≤ x
1

expected value and variance:


m+1
E(X ) = = 3.5
2
m2 − 1 35
V(X ) = = = 2.9166 ⇒ σX = 1.7078
12 12
9. Stochastic models and special distributions 239
Uniform distribution
Example: classical L APLACE experiment dice roll

mass function

distribution function

9. Stochastic models and special distributions 240


Uniform distribution
Continuous random variables

Example: Waiting time for the subway at fixed frequency

Definition:
A continuous random variable X with density function
(
1
b −a
for a ≤ x ≤ b
fX (x ) = fU (x ; a, b) =
0 otherwise

is called uniformly distributed in the interval [a, b] or for short U(a, b)-distributed.

For geometric reasons, this is often referred to as a rectangular distribution.

Properties :

a+b (b − a )2
E (X ) = , V (X ) = .
2 12

9. Stochastic models and special distributions 241


Uniform distribution
Example

Density function of the random variable X =„waiting time for the subway“ running every
10 minutes with a = 0 and b = 10:
(
1
10
, for 0 ≤ x ≤ 10
fX (x ) = fU (x ; 0, 10) =
0 otherwise
with distribution function

0
 for x < 0
1
FX (x ) = FU (x ; 0, 10) = x for 0 ≤ x < 10
 10
1 for 10 ≤ x

expected value and variance:


0 + 10
E(X ) = =5
2
(10 − 0)2 100
V(X ) = = = 8.3333 ⇒ σX = 2.8868
12 12

9. Stochastic models and special distributions 242


B ERNOULLI distribution

Example: Share rises/falls, product successful/unsuccessful, . . .


so-called B ERNOULLI experiments
possible outcomes: success and failure
one single parameter: probability of success P(success) = p.

Definition:
A discrete random variable X with the mass function

1 − p for x = 0

fX (x ) = fBe (x ; p) = p for x = 1

0 otherwise

is called B ERNOULLI distributed with parameter p.

Properties : expected value and variance

E(X ) = p V(X ) = p · (1 − p) = p · q
| {z }
=:q

9. Stochastic models and special distributions 243


B ERNOULLI distribution
1
Example: probability of success p = 3

mass function

distribution function

9. Stochastic models and special distributions 244


B ERNOULLI distribution

Example:
In order to start in ludo (Mensch ärgere dich nicht), you must
roll at least one six on three rolls (=„success“).

The probability of success is p = 1 − ( 56 )3 = 0.4213


From this we calculate:

expected value: µ = p = 0.4213


variance: σ 2 = p · q = 0.4213 · 0.5787 = 0.2438
standard distribution: σ = 0.4938

9. Stochastic models and special distributions 245


Binomial distribution

Example: Share rises/falls over several days, number of defective products in a


production series, . . .
Several B ERNOULLI experiments with the same success probability p are performed
independently (one after the other or simultaneously).

Definition:
A discrete random variable X with the mass function
!
n x n−x
fX (x ) = fB (x ; n, p) = p (1 − p)
x

for x = 0, 1, . . . , n, where n is a natural number and 0 < p < 1 is a real number


between zero and one, is called binomially distributed or short B(n, p)-distributed.
n: number of tries, x: number of successes, p: probability of success

Properties : expected value and variance

E (X ) = n · p V(X ) = n · p · (1 − p)

9. Stochastic models and special distributions 246


From a purely mathematical point of view: The individual probability masses are the summands from
the binomial formula
n  
X n x n−x
(p + q )n = p q
x
x =0

1 1
All binomial distributions with p = 2
are symmetric. Probabilities p < 2
result in right skewed
1
distributions p > 2
result in left skewed distributions:

246 - 1
Binomial distribution

Example: Urn model

The share of red balls in an urn is p.


From this, a random sample of size n is drawn with replacement.

The random variable

„number of red balls in the sample“ X : 0, 1, . . . , n

is binomially distributed.

9. Stochastic models and special distributions 247


Binomial distribution
Example

Let there be 20 balls in an urn, four of which are red. Let X be the number of red balls if
4
we draw from it three balls with replacement. We calculate with p = 20 = 0.2 and n = 3.
!
3
P(X = 0) = fB (0; 3, 0.2) = · 0.20 · 0.83 = 1 · 1 · 0.512 = 0.512
0
!
3
P(X = 1) = fB (1; 3, 0.2) = · 0.21 · 0.82 = 3 · 0.2 · 0.64 = 0.384
1
!
3
P(X = 2) = fB (2; 3, 0.2) = · 0.22 · 0.81 = 3 · 0.04 · 0.8 = 0.096
2
!
3
P(X = 3) = fB (3; 3, 0.2) = · 0.23 · 0.80 = 1 · 0.008 · 1 = 0.008
3

The sum of these four probabilities is one and it is:


E(X ) = 3 · 0.2 = 0.6
V(X ) = 3 · 0.2 · 0.8 = 0.48
σX = 0.6928.

9. Stochastic models and special distributions 248


Binomial distribution

Example:

Imagine p = 30 % of eligible voters wanted to vote left. A random sample


of size n = 12 is to be asked about their voting intention. What will be
the outcome of the sample?

9. Stochastic models and special distributions 249


Binomial distribution

Example:

Imagine p = 30 % of eligible voters wanted to vote left. A random sample


of size n = 12 is to be asked about their voting intention. What will be
the outcome of the sample?


It is µ = 12 · 0.3 = 3.6 with a standard deviation of σX = 12 · 0.3 · 0.7 = 1.5875. We
also compute the probability P(X > 6) that left voters are in the majority in the sample,
using the binomial distribution table:

P(X > 6) = 1 − P(X ≤ 6) = 1 − FB (6; 0.3, 12)


= 1 − 0.9614 = 0.0386

9. Stochastic models and special distributions 249


Binomial distribution

9. Stochastic models and special distributions 250


Normal distribution

It is the most important distribution of all in statistics.


Why important? Some reasons
Many empirically observed distributions
correspond at least approximately to the
normal distribution
It gives a good approximation for certain
discrete distributions, such as the binomial q
distribution and the P OISSON distribution
Distribution of sample means approaches
normal distribution the larger the sample is
z[q ]
Basis of many theoretical models

Examples: Energy consumption, stock returns, . . .

9. Stochastic models and special distributions 251


Normal distribution
Standard normal distribution

Definition:
A continuous random variable Z with the density function
1 − 1 z2
fSt (z ) := √ e 2

for−∞ < z < ∞ is called standard-normally distributed or short N (0, 1)-distributed.

Note:
There are many normal distributions, but this one is standard because it has an
expected value of 0 and a standard deviation of 1:

E(Z ) = 0 , V (Z ) = 1 .

9. Stochastic models and special distributions 252


Normal distribution

Properties of the „bell curve“:


0.4

0.3

Maximum at z = 0
0.2
Points of inflection at z = −1 and z = 1
Quickly asymptotes to the x-axis 0.1

−4 −3 −2 −1 1 2 3 4

While one can easily calculate the values of the density function with the pocket
calculator, the integral of the distribution function
Zz
FSt (z ) = e
− 21 u 2
du
−∞

is not elementary. Therefore there are tables for it – already since L APLACE.
9. Stochastic models and special distributions 253
Normal distribution

Probabilities as areas below the density function:

distribution function
Zz
fSt (u ) du = FSt (z )
tables
P(Z ≤ z ) =
−∞
Z∞
P(Z ≥ z ) = fSt (u ) du = 1 − FSt (z ) z

interval

Zb
P(a < Z ≤ b) = fSt (u ) du = FSt (b) − FSt (a)
a
a b

9. Stochastic models and special distributions 254


Normal distribution
Table

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141
0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517
0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879
0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015
1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319
1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441
1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545
1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706
1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857
2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974
2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981
2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990

because of symmetry: FSt (−z ) = 1 − FSt (z )

9. Stochastic models and special distributions 255


Normal distribution

Symmetric intervals:

z
fSt (u ) du =: D (z )
Z
P(−z < Z ≤ z ) =
−z

Calculation: D (z ) = FSt (z ) − FSt (−z ) = 2FSt (z ) − 1

Examples:

P(Z ≤ 0) = 0.5 = FSt (0)


P(Z ≤ 1) = 0.8413 = FSt (1)
P(Z ≤ 1.8) = 0.9641 = FSt (1.8)

P(−1 < Z ≤ 1) = 0.6826 = D (1)


P(−2 < Z ≤ 2) = 0.9544 = D (2)
P(−1.96 ≤ Z ≤ 1.96) = 0.95 = D (1.96)

P(−1 < Z ≤ 2.5) = 0.9938 − 0.1587 = 0.8351

9. Stochastic models and special distributions 256


Normal distribution
The general normal distribution

By introducing a parameter of location µ and a parameter of dispersion σ 2 > 0 one


obtains further normal distributions:

Definition:
A continuous random variable X with the density function

1 (x −µ)2
2 −1
fN (x ) = fN (x ; µ, σ ) = √ e 2 σ2
2πσ 2

for −∞ < x < ∞ is called normally distributed with the parameters µ and σ 2 or short
N µ, σ 2 -distributed.

Properties : expected value and variance

2
E(X ) = µ , V(X ) = σ .

9. Stochastic models and special distributions 257


Normal distribution

fN (x ) fN (x )
Two-parameter family:
0 .5 0.5
The expected value and the variance
µ=0 µ=1
σ=1 σ=1 serve as parameters:
2
−2 −1 1 2
x
−1 1 2 3
x fN (x ; µ, σ )
fN (x ) fN (x )

Properties :
0 .5 0.5
µ=0 µ=1
σ=2 σ=2 1. symmetrical around x = µ

x x
2. points of inflection at x = µ − σ and
−2 −1 1 2 −1 1 2 3
x =µ+σ
fN (x ) fN (x )
3. Density function is flatter the larger the
0 .5 0 .5
dispersion:
µ=0 µ = −2 x − µ
σ = 0.6 σ = 0.6 1
fN (x ; µ, σ) = fSt
σ σ
x x
−2 −1 1 2 −4 −3 −2 −1 1

9. Stochastic models and special distributions 258


Normal distribution

Properties (contd.):

4. The following applies to the distribution function


x − µ
FN (x ) = FSt
σ
5. To determine probabilities, the table of the standard normal distribution is sufficient:
only standardization is required before using the table.

9. Stochastic models and special distributions 259


Normal distribution

Example:
Let the random variable X (e.g. stock return) be normally distributed with E(X ) = 8 %
and V(X ) = 625 %2 .
Wanted: P(0 % < X ≤ 20 %)

9. Stochastic models and special distributions 260


Normal distribution

Example:
Let the random variable X (e.g. stock return) be normally distributed with E(X ) = 8 %
and V(X ) = 625 %2 .
Wanted: P(0 % < X ≤ 20 %)
Solution:
First standardize
0% − 8% X − 8% 20 % − 8 %
< ≤
25 % 25 % 25 %
−0.32 < Z ≤ 0.48

Then calculate the probability

P(0 % < X ≤ 20 %) = P(−0.32 < Z ≤ 0.48) = FSt (0.48) − FSt (−0.32)


= 0.6844 − (1 − 0.6255) = 0.3099

9. Stochastic models and special distributions 260


Control questions

1. How is the binomial distribution defined? Does the B ERNOULLI distribution belong to
the family of binomial distributions?

2. Why is the normal distribution considered the most important distribution in statistics?

3. Why do you need only the values of the standard normal distribution when calculating
with normal distributions?

4. What is a „family“ or „class“ of distributions?

5. What is a stochastic model? What is the purpose of stochastic models?

9. Stochastic models and special distributions 261


10 Limit theorems

10.1 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266


10.2 The fundamental theorem of statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.3 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.4 Normal distribution as approximation distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275

according to [Schira, 2016], chapter 12


see also: [Anderson et al., 2012], chapter 7; and [Newbold et al., 2013], chapter 6

10. Limit theorems 262


Limit theorems

Goal: Draw conclusions about a population and determine their reliability


Use: sampling (with replacement), series of experiments or hypothetical population
n-dimensional random variable
X1 , X2 , . . . , Xn
What can we say about asymptotic behaviour of sequences of random variables?

10. Limit theorems 263


Random Variables
Sum and arithmetic mean

n-dimensional random variable


X1 , X2 , . . . , Xn
independent and identically distributed with
2
E(Xi ) = µ V(Xi ) = σ

define new random variables

sum: Sn := X1 + X2 + · · · + Xn
1
arithm. mean: X̄n := (X1 + X2 + · · · + Xn )
n
Then

Proposition : Let X1 , X2 , . . . , Xn independent and identically distributed with E(Xi ) = µ


and V(Xi ) = σ 2 . Then the following applies to the arithmetic mean X̄n

σ2
E(X̄n ) = µ , V(X̄n ) = .
n

10. Limit theorems 264


Limit theorems

Important limit theorems n→∞

1. Law of large numbers: mean value converges to- X̄n → µ


wards the expected value

Special case: B ERNOULLI’s law of large numbers: hn → p


Relative frequency converges towards probability

2. Main theorem of statistics: The empirical distribu- Hn ( x ) → F ( x )


tion converges towards the probability distribution

3. Central limit theorem: Distribution of the mean con- Fn (zn ) → FSt (z )


verges to the normal distribution

10. Limit theorems 265


The law of large numbers

The weak law of large numbers :


Let X1 , X2 , . . . , Xn be independent and identically distributed random variables whose
expected values E(Xi ) = µ and variances V(Xi ) = σ 2 exist, and let X̄n be the arithmetic
mean of them.
Then for any ε > 0, no matter how small, the following is true

P( X̄n − µ ≥ ε) → 0 for n → ∞

other notations:

P( X̄n − µ < ε) → 1
plimn→∞ X̄n = µ

(plim ≡ „probability limit“)

10. Limit theorems 266


Proof: According to Chebyshev’s inequality, the deviation of any random variable from its mean value
- regardless of its distribution - hence also for X̄n is

1
P( X̄n − µ ≥ k σX̄ ) ≤ ,
k2
σ
where the standard deviation is σX̄ = √
n

n
With the substitution k · σX̄ = ε, hence k 2 = ε2 · σ2
it follows that

σ2
P( X̄n − µ ≥ ε) ≤
ε2 ·n

and from this for n → ∞


σ2
→ 0.
ε2 ·n

Weak and strong: different types of convergence

1. Weak law of large numbers: plim X̄n = µ


„probability limit“ or „stochastic convergence“
2. Strong law of large numbers: P(lim X̄n = µ) = 1
„convergence with probability one“ or „almost sure convergence“
3. But wrong would be „sure convergence“ lim X̄n = µ
266 - 1
Practical meaning of the laws of large numbers

1. Statistical probability
Determination of probabilities by experimental means:
hn (observed rel. frequency) is a good approximation or useful estimate for p if n is sufficiently
large.
2. Sampling method
For qualitative characteristics:
p = proportion of statistical units in the population for which the characteristic has a specific
value or property.
hn = relative frequency in random sample. It will be closer and closer to the value p as the
sample size increases.

266 - 2
The law of large numbers

Example: 9114 historical lottery numbers show the law of large numbers quite
illustratively. (discrete uniform distribution with m = 49)

theoretically:
49 + 1
µ= = 25
2
r
492 − 1 √
σX = = 200 = 14.1421
12

empirically: n = 9114

1 X
x̄ = nj xj = 25.2211
9114

sX = 200.6512 = 14.1651

10. Limit theorems 267


The fundamental theorem of statistics

Law of large numbers: The mean value of a sample converges stochastically towards
the expected value.
Question: Can the probability distribution F (x ) also be determined experimentally?
For this purpose, one calculates the empirical distribution function from the n sample or
measured values:
x1 , x2 , . . . , xn ⇒ Hn (x )
Idea: Hn (x ) → F (x ) if n → ∞?

Fundamental theorem of statistics:


The empirical distribution functions obtained using samples of size n converge with prob-
ability one  
P lim Hn (x ) = F (x ) = 1
n→∞

to the distribution function of the random variable X if n tends to infinity.

10. Limit theorems 268


The fundamental theorem of statistics

Example: Random numbers were drawn on a PC, uniformly distributed over the
interval [0, 10]:

Convergence of the empirical distribution function towards the


probability distribution function.

10. Limit theorems 269


The central limit theorem

We consider the mean value of a series of experiments or random sample


1
x̄n := (x1 + x2 + · · · + xn )
n
as a realization of a random variable
1
X̄n := (X1 + X2 + · · · + Xn ) ,
n
where
2
E(X1 ) = · · · = E (Xn ) = µ and V(X1 ) = · · · = V(Xn ) = σ .
Then
2 σ2
E(X̄n ) = µX̄n = µ and V(X̄n ) = σX̄n = .
n
So far so good!

10. Limit theorems 270


The central limit theorem

We consider the mean value of a series of experiments or random sample


1
x̄n := (x1 + x2 + · · · + xn )
n
as a realization of a random variable
1
X̄n := (X1 + X2 + · · · + Xn ) ,
n
where
2
E(X1 ) = · · · = E (Xn ) = µ and V(X1 ) = · · · = V(Xn ) = σ .
Then
2 σ2
E(X̄n ) = µX̄n = µ and V(X̄n ) = σX̄n = .
n
So far so good!

Problem: In many applications, however, it is not sufficient to know only the two mo-
ments E and V.
Question: What is the distribution function F (x̄n ) of the random variable X̄n ?

10. Limit theorems 270


We also consider the sum of a series of experiments or random sample

sn := x1 + x2 + · · · + xn

as a realization of a random variable

Sn := X1 + X2 + · · · + Xn .

Here,
E(Sn ) = µSn = n · µ and V(Sn ) = σS2 n = n · σ 2 .

Standardization yields

Sn − µSn Sn − nµ
Zn : = = √
σSn σ n

which is equivalent to

X̄n − µ X̄n − µX̄n


= √ =
σ/ n σX̄n

270 - 1
The central limit theorem

Central limit theorem :


Let X1 , X2 , . . . , Xn be independent and identically distributed random variables whose
expected values E(Xi ) = µ and variances V(Xi ) = σ 2 exist, and let Sn be the sum and
X̄n = Snn the arithmetic mean of them.
Then the distribution function Fn of the standardized quantity

Sn − nµ X̄n − µ
Zn := √ = √
σ n σ/ n
tends to the standard normal distribution

Fn (zn ) → FSt (z ) for n → ∞

as n increases.

10. Limit theorems 271


The central limit theorem

Illustration of the CLT: mass functions


of the sum of the numbers of pips for 1,
2, 3, and 6 independent dice rolls.

10. Limit theorems 272


The central limit theorem

This is why the normal distribution and the CLT are so important:

Decisive advantage: The CLT does not impose any requirement on the initial
distribution. Whatever the identical and independent distribution of Xi may be, the
distribution function of the sum or the arithmetic mean always converges to the
normal distribution.
It is to this circumstance that the normal distribution owes its universal theoretical
and practical importance.
Empirical Distributions: The CLT also explains why so many empirical distributions
are close to the normal distribution and can be approximated by it quite well.

10. Limit theorems 273


Of course, the CLT also applies to the B ERNOULLI distributions

The sum Bn := X1 + X2 + · · · + Xn with independent Bernoulli variables Xi is by definition binomially


distributed with µBn = np and σB2 n = npq, and its arithmetic mean is a relative frequency or a
proportion value
1 1
Hn := (X1 + X2 + · · · + Xn ) = Bn
n n
2 = pq
with µHn = p and σHn n
.

D E M OIVRE-L APLACE theorem:

The binomial distribution converges to the normal distribution for n → ∞. In particular, the distri-
bution function of the standardized variable

Bn − np Hn − p
Zn := √ ≡ p
npq pq /n

converges to the standard normal distribution

Fn (zn ) → FSt (z ) for n → ∞.

273 - 1
The central limit theorem

Convergence in distribution:
Approximation of the distribution
functions of the binomial distribution for
p = 14 and n = 1, 2, 3 and 6 toward the
normal distribution function.

10. Limit theorems 274


Normal distribution as approximation distribution

Properties:
If n is sufficiently large, the distribution of a sum or arithmetic mean can be approximated
by the normal distribution.
 
y − µY
P(Y ≤ y ) ≈ FSt
σY

For an interval a < b an approximation can also be given


   
b − µY a − µY
P(a < Y ≤ b) ≈ FSt − FSt
σY σY

Here, Y can be one of the following random variables:

sum Sn : µSn = nµ σS2 n = nσ 2


σ2
arithm. mean X̄n : µX̄n = µ σX̄2 n =
n

10. Limit theorems 275


Normal distribution as approximation distribution

Example:
The average processing time of a BAföG application by a clerk at the
Students’ Union is µ = 35 min with a standard deviation of
σ = 18 min.
Question: What is the probability that the clerk will complete more than 15 applications
in an 8-hour workday, i.e.
P(S16 ≤ 480) = ?

10. Limit theorems 276


Normal distribution as approximation distribution

Example:
The average processing time of a BAföG application by a clerk at the
Students’ Union is µ = 35 min with a standard deviation of
σ = 18 min.
Question: What is the probability that the clerk will complete more than 15 applications
in an 8-hour workday, i.e.
P(S16 ≤ 480) = ?

Solution:
According to the CLT, the sum S16 is approximately normally distributed with
E(S16 ) = 35 · 16 = 560 and the standard deviation σS16 = 18 · 4 = 72. We calculate:
 
2 480 − 560
P(S16 ≤ 480) ≈ FN (480; 560, 72 ) = FSt
72
= FSt (−1.1111) = 0.1331

10. Limit theorems 276


Normal distribution as approximation distribution
Continuity correction

A note for practice: approximation of discrete


random variables by the normal distribution in
the middle of the steps is better than at its edges.
step size
→ continuity correction Sk =
2

If n is sufficiently large, the distribution of a sum or arithmetic mean of discrete random


variables and hence also of a binomially distributed random variable can in general be
approximated better by using continuity correction:
 
y + Sk − µY
P(Y ≤ y ) ≈ FSt
σY

For an interval a < b an approximation can also be given


   
b + Sk − µY a + Sk − µY
P(a < Y ≤ b) ≈ FSt − FSt .
σY σY

10. Limit theorems 277


Now here, Y can be one of the following (discrete) random variables:

sum Sn : µSn = nµ σS2 n = nσ 2


σ2
arithm. mean X̄n : µX̄n = µ σX̄2 =
n n
binomially distributed Bn : µBn = np σB2 n = npq

The continuity correction Sk is always half the step size of the random variable Y . For example, if Y
can take only integers or only natural numbers, then Sk = 21 .

Question: How large should n be in order to use the normal distribution as an approximation?

There is no generally valid answer here. A rule of thumb says that for n > 30 the approximation is
generally quite good. For binomially distributed random variables it is often required that npq > 9
holds.

277 - 1
Control questions

1. How can a sequence of independently and identically distributed random variables


be obtained?
2. What concepts of convergence have you come across?
3. What is the difference between the weak and the strong law of large numbers?
4. Why do some people in roulette bet on numbers that have rarely won? Is this a good
strategy?
5. Some people bet on numbers that have won particularly often. Do you think that’s a
rational strategy?
6. What is special about B ERNOULLI’s law of large numbers?
7. How can we empirically or experimentally infer the unknown distribution function of a
random variable?
8. Why is the normal distribution so popular?
9. Why is continuity correction useful when approximating a discrete distribution by a
continuous distribution?

10. Limit theorems 278


Part III – Inferential Statistics

11 Point estimators for parameters of a population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .280

12 Interval estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

13 Statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

10. 279
11 Point estimators for parameters of a population

11.1 Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281


11.2 Point estimator for the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
11.3 Point estimator for the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
11.4 Properties of point estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

according to [Schira, 2016], chapter 13


see also: [Anderson et al., 2012], chapter 7; and [Newbold et al., 2013], chapter 6

11. Point estimators for parameters of a population 280


Random sampling

Motivation:
Complete information about the distribution of characteristics in a population can only
be obtained by a census (full sample). In most cases, censuses are ineconomical,
often even impossible.

The representative sample: It is made sure that the sample has the same or similar
structure as the basic population with regard to other characteristics.

Pure random sample: Each element ωi of the basic population has an equal chance of
entering the sample.

Question: Is a random survey of n = 100 people between 2pm and 3pm on the Zeil in
front of Karstadt representative?

11. Point estimators for parameters of a population 281


Random sampling

Definition:
Urn model to describe the pure random sampling:

The urn contains N numbered balls (= number of statistical units in the basic population).
The number on the ball is assigned to exactly one statistical unit.

We distinguish between two options:


(i) Drawing with replacement
The number of the drawn ball is recorded, the ball is returned to the urn, and it is
reshuffled. ( This is the procedure that is mainly assumed in the following.)
(ii) Drawing without replacement
The ball is not put back into the urn. (closer to practice; only little deviation from (i) for samples that are small
relative to the basic population).

11. Point estimators for parameters of a population 282


Random sampling

Properties of pure random samples

The characteristic value Xi of each individual element of the sample is a random


variable.
The probability distribution of this random variable Xi is determined by the
frequency distribution of the characteristic X in the basic population.

With the characteristic values xi observed in the sample, we now try to


estimate this distribution or at least its mean and variance.

11. Point estimators for parameters of a population 283


Point estimator for the mean

Let the mean µ of the metric characteristic X of a basic population be unknown.


We want to estimate it using a random sample of size n.
Observed characteristic values xi of the sample elements are realizations of the
random variable X .
We calculate the arithmetic mean
1X
{ x1 , x 2 , . . . , x n } −→ x̄ = xj
n

Estimator for the mean value:


µ̂ := x̄
(µ̂ is the estimated value for the unknown mean µ; the hat „ ˆ “ denotes the estimator)
Such an estimate is called point estimate, because a single (punctual) value is given
as an estimate and not an interval for example. Also, no probability is given with
which the estimate could be correct or incorrect.

11. Point estimators for parameters of a population 284


Point estimator for the mean
Example

A sample of size n = 10 was drawn from the basic population of students in a lecture.
The body height X in cm was determined and recorded in the following table:

i 1 2 3 4 5 6 7 8 9 10

xi 176 180 181 168 177 186 184 173 182 177

This data set has the mean value 178.4 cm. The point estimate for the height of students
in the lecture hall is simply:
µ̂ = x̄ = 178.4 cm

11. Point estimators for parameters of a population 285


Question: Is this a good estimate?

The value µ̂ is an estimate for the unknown mean µ. Therefore, most of the time the estimated value
will not exactly match the true mean (i.e. µ̂ ̸= µ). That means it is very rarely true that µ̂ = µ.

First, we need to understand that the estimate µ̂ is a realization of a random variable:

Every single observed characteristic value xi is a realization of a random variable Xi . For each of
these random variables Xi the probability distribution is given by the frequency distribution of the
basic population. Thus for each random variable Xi the following applies

E(Xi ) = µ, V(Xi ) = σ 2 .

Thus, one can consider the observed sample values and their mean as realizations of an n-dimensional
random variable (X1 , X2 , . . . , Xn ). If the random sample is carried out with replacement, all Xi are
independent and identically distributed.

We define a new random variable


1X
X̄n := Xj .
n
Thus, the point estimate µ̂ is a realization of the random variable X̄n .

As stated above, the estimated value will rather rarely meet the true mean value, it is possible or likely
that an estimation error
e := µ − µ̂

285 - 1
will occur. However, the crucial question is whether the estimated value µ̂ hits the true value at
least on average (that is, if we determine many realizations of µ̂). For this purpose we calculate the
expected value
1 X 1X 1
E(µ̂) = E(X̄n ) = E( Xj ) = E(Xj ) = nµ ,
n n n
hence it holds that
E(µ̂) = µ .

This property of µ is called unbiasedness. For an unbiased estimator, the estimation error vanishes
on average, i.e. E(e) = 0.

If an estimator is not unbiased, we call it biased, the expected value of the estimation error

bias := E(e)

is called bias.

Unbiasedness is not the only important property. If we calculate the variance of our estimate µ̂, we
get
σ2
 X 
1 1 X 1
V(µ̂) = V(X̄n ) = V Xj = 2 V(Xj ) = 2 nσ 2 = .
n n n n
We notice that with increasing sample size n the variance of the sample mean becomes smaller and
smaller, hence
lim V(µ̂) = 0 .
n→∞

285 - 2
According to the law of large numbers
plim µ̂ = µ.

This property is called consistency. It means that the larger the sample size, the more accurate the
estimate.

An important prerequisite for the calculation of the variance and also for the application of the law of
large numbers is the independence of the variables Xi , which, however, is given in every case for
samples with replacement. However, this is also approximately true for samples without replacement,
if the basic population is very large in relation to the sample size.

285 - 3
Point estimator for the variance

Assume that the variance of a metric characteristic in a basic population is


unknown.
It is to be estimated using the empirical variance of the random sample.

2 1X
s := (xj − x̄ )2
n

Unfortunately, the estimation formula σ̂ 2 = s2 , which seems obvious at first sight, is


not unbiased, but biased.
After adjusting for the number of degrees of freedom, we obtain a

Unbiased estimator of the variance:


n 1 X
σ̂ 2 := 2
s = (xj − x̄ )2
n−1 n−1

11. Point estimators for parameters of a population 286


Point estimator for the variance
Example

Using the observed values from the example before, we can also make an estimate for
the variance of the body size of the students in a lecture. To do this, we first calculate
the variance of the sample values

2 1
sX = (1762 + 1802 + · · · + 1772 ) − 178.42
10
= 31 852.4 − 31 826.56 = 25.84

and thus the unbiased point estimate


10
σ̂ 2 = · 25.84 = 28.7111 or σ̂ = 5.3583 .
9

11. Point estimators for parameters of a population 287


Properties of point estimators
We have observed that the estimated value of a parameter of a basic population, such as mean µ,
variance σ 2 , or even a proportion value p, is itself again a random variable. Although the parameter
itself is unknown, it is still a constant quantity. We denote such a parameter in the following by the
Greek letter theta
θ and with θ̂
– as before with a „hat“ – the estimated value of the parameter.

An estimator or an estimating function

θ̂ = θ̂(X1 , . . . , Xn ) ,

that depends on the random variables X1 , X2 , . . . , Xn is again a random variable and accordingly
also has a probability distribution. Certain stochastic properties of the estimator follow from this.

To assess the quality of an estimator, we use the following desirable properties.

Unbiasedness: An estimator θ̂ is unbiased if its expected value is equal to the true parameter

E(θ̂) = θ .

287 - 1
If an estimator is biased, it would be good if the bias became smaller with increasing sample size and
would disappear for n → ∞.

Asymptotic unbiasedness: An estimator θ̂ is asymptotically unbiased if


lim E(θ̂) = θ .
n→∞

Asymptotic unbiasedness is thus a somewhat weaker property than unbiasedness.

Examples
1. The estimator µ̂ = x̄ is unbiased.
2. The estimator σ̂ 2 = s2 is not unbiased but asymptotically unbiased.

Consistency: An estimator θ̂ is consistent if it is unbiased or at least asymptotically unbiased


and, furthermore, its variance tends to zero as the sample size increases

lim V(θ̂) = 0 .
n→∞

An estimated value usually does not agree with the true value of the parameter. However, it would be
good if it is close to the parameter, or at least has a good chance of being close. Thus, the estimation
error |θ̂ − θ| should be as small as possible and become smaller and smaller, especially for larger

287 - 2
sample sizes. The property of consistency means that the probability of an estimation error ε > 0,
however small, tends to zero as n increases.

287 - 3
Efficiency: We call an (unbiased) estimator θ̂1 more efficient than another (unbiased) estimator
θ̂2 if it has a smaller variance,
V(θ̂1 ) < V(θ̂2 ).

Thus, the most efficient or best unbiased estimator θ ∗ would be the one among all unbiased
estimators that had the smallest variance, that is

V(θ ∗ ) < V(θ̂).

Mean squared error (MSE): The mean squared error (MSE) of an estimator is the expected value
of its squared deviation from the true parameter value, i.e.

MSE(θ̂) = E[(θ̂ − θ)2 ].

The MSE accounts for both, the variance and the bias:

MSE(θ̂) = V(θ̂) + bias2 .

It may be advantageous to give preference to a slightly biased estimator, provided that this achieves
an effective reduction in variance, which is often the case.

287 - 4
Example

Consider the distribution of two alternative estimators for a parameter θ . One is unbiased, the other
has a small bias but a much smaller variance.

f (θ̂) bias

θ̂
θ

Which estimator to choose?

It seems obvious to prefer the biased estimator in this particular case.

287 - 5
Control questions

1. What is a representative sample? How does a representative sample differ from a


purely random sample?
2. What is meant by the estimation error? Is the estimation error a stochastic quantity?
3. What is the difference between the estimation error and the bias?
4. What is the reason for the bias when estimating the variance of a basic population
with the empirical variance of the sample?
5. What is consistency?

11. Point estimators for parameters of a population 288


12 Interval estimators

12.1 Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290


12.2 Interval estimators for large samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.3 The chi-squared distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
12.4 The S TUDENT-t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
12.5 Interval estimators for small samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .319

according to [Schira, 2016], chapter 14


see also: [Anderson et al., 2012], chapter 8; and [Newbold et al., 2013], chapter 7

12. Interval estimators 289


Sampling distributions

Motivation:
Measures of samples, such as mean, variance, and others, are realizations of random
variables. Their probability distributions are called sampling distributions. In
particular, we are interested in:

Distribution of the sample mean


Distribution of the sample variance

Note:
Sampling distributions follow the normal distribution in many cases, because
1. characteristics are often a priori approximately normally distributed,
2. for larger samples, the Central Limit Theorem (CLT) applies.

12. Interval estimators 290


Sampling distributions

Recall: Using the normal distribution (e.g. as a sampling distribution).

1. Values of the distribution function:


Zz
P(Z ≤ z ) = fSt (u ) du = FSt (z ) FSt (z )
−∞

z
2. The symmetrical intervals are also useful:
Zz
P(−z < Z ≤ z ) = fSt (u ) du = D (z )
−z D (z )

⇒ D (z ) = FSt (z ) − FSt (−z ) −z z

= 2FSt (z ) − 1 Areas below the density function

12. Interval estimators 291


Sampling distributions
Distribution of the sample mean

Distribution of the sample mean:


The metric characteristic X in a basic population has mean µ and variance σ 2 . For a
large sample size n, the distribution of the sample mean is:
1. E(X̄ ) = µ,
σ
2. σX̄ = √
n
,

3. X̄ is approximately normally distributed.

This is true for any distribution of the characteristic X as long as the individual sample
elements are drawn independently.
By standardizing the random variables X̄ we obtain

X̄ −µ
4. √
σ/ n
is approximately standard normally distributed.

12. Interval estimators 292


Sampling distributions
Distribution of the sample mean

σ
From 4. and with σX̄ = √
n
it follows directly that

X̄ − µ
P(−z < ≤ z ) ≈ FSt (z ) − FSt (−z ) = D (z ) .
σX̄
By transforming the inequality inside the probability function we get

P(−z · σX̄ < X̄ − µ ≤ z · σX̄ ) ≈ D (z )

and finally the

Formula for the direct conclusion

P(µ − z · σX̄ < X̄ ≤ µ + z · σX̄ ) ≈ D (z )

⇒ Conclusion from the basic population to the sample

12. Interval estimators 293


Sampling distributions
Example

For the introductory lecture in statistics, N = 800


students have come to the Audimax. Their average
body height µ is 183 cm, with a standard deviation σ
of 10 cm. From this we draw a random sample of size
n = 25 (with replacement).

Question 1 (Given interval): With what probability will the sample mean fall into the
interval 182 cm < X̄ ≤ 184 cm

12. Interval estimators 294


Sampling distributions
Example

For the introductory lecture in statistics, N = 800


students have come to the Audimax. Their average
body height µ is 183 cm, with a standard deviation σ
of 10 cm. From this we draw a random sample of size
n = 25 (with replacement).

Question 1 (Given interval): With what probability will the sample mean fall into the
interval 182 cm < X̄ ≤ 184 cm

The normal distribution can be taken as the sampling distribution, since the initial
distribution is already approximately normally distributed.

12. Interval estimators 294


Sampling distributions
Example

For the introductory lecture in statistics, N = 800


students have come to the Audimax. Their average
body height µ is 183 cm, with a standard deviation σ
of 10 cm. From this we draw a random sample of size
n = 25 (with replacement).

Question 1 (Given interval): With what probability will the sample mean fall into the
interval 182 cm < X̄ ≤ 184 cm

The normal distribution can be taken as the sampling distribution, since the initial
distribution is already approximately normally distributed.

It holds that
σ 10 cm
E(X̄ ) = 183 cm and σX̄ = √ = √ = 2 cm
n 25
So the above interval has a length of ± half a standard deviation and thus the probability:
1 1 1
P(183 cm − · 2 cm < X̄ ≤ 183 cm + · 2 cm) ≈ D ( ) = 0.3830
2 2 2

12. Interval estimators 294


Alternatively we calculate

182 − 183 X̄ − µ 184 − 183


P(182 < X̄ ≤ 184) = P( < ≤ )
2 σX̄ 2
1 X̄ − µ 1
= P(− < ≤ )
2 σX̄ 2
1
≈ D ( ) = 0.3830
2

294 - 1
Sampling distributions
Question 1: Reading off the table of the normal distribution

Standard normal distribution


z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141
0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517
0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879
0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
D (0.5) = 2 · 0.6915 − 1 = 0.3830 0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015
1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319
1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441
1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545
1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706
1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857
2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974
2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981
2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990

because of symmetry: FSt (−z ) = 1 − FSt (z )


moreover: D (z ) = 2FSt (z ) − 1
12. Interval estimators 295
Sampling distributions
Example

For the introductory lecture in statistics, N = 800


students have come to the Audimax. Their average
body height µ is 183 cm, with a standard deviation σ
of 10 cm. From this we draw a random sample of size
n = 25 (with replacement).

Question 2 (Given probability): What is the interval in which the sample mean falls with
a high probability of 0.9?

12. Interval estimators 296


Sampling distributions
Example

For the introductory lecture in statistics, N = 800


students have come to the Audimax. Their average
body height µ is 183 cm, with a standard deviation σ
of 10 cm. From this we draw a random sample of size
n = 25 (with replacement).

Question 2 (Given probability): What is the interval in which the sample mean falls with
a high probability of 0.9?
To do this, we need to determine the z value for which
D (z ) = 0.9.In the table of the standard normal
distribution we find z = 1.645, such that
0.9
0.9 = D (1.645)
≈ P(183 cm − 1.645 · 2 cm < X̄ ≤ 183 cm + 1.645 · 2 cm)
177 179 181 183 185 187 189
σX̄
= P(183 cm − 3.29 cm < X̄ ≤ 183 cm + 3.29 cm)
= P(179.71 cm < X̄ ≤ 186.29 cm)

12. Interval estimators 296


Sampling distributions
Question 2: Reading off the quantile table

S TUDENT’s t-distribution
In the quantile table of the Degrees of Quantiles
freedom t[0.6] t[0.667] t[0.75] t[0.8] t[0.875] t[0.9] t[0.95] t[0.975] t[0.99] t[0.995] t[0.999]
S TUDENT’s t-distribution one 1 0.325 0.577 1.000 1.376 2.414 3.078 6.314 12.706 31.821 63.657 318.31

finds the ∞ symbol in the last 2


3
0.289
0.277
0.500
0.476
0.816
0.765
1.061
0.978
1.604
1.423
1.886
1.638
2.920
2.353
4.303
3.182
6.965
4.541
9.925
5.841
22.327
10.215
4 0.271 0.464 0.741 0.941 1.344 1.533 2.132 2.776 3.747 4.604 7.173
row. The entries there are the 5 0.267 0.457 0.727 0.920 1.301 1.476 2.015 2.571 3.365 4.032 5.893
6 0.265 0.453 0.718 0.906 1.273 1.440 1.943 2.447 3.143 3.707 5.208
quantiles z[q ] of the normal 7 0.263 0.449 0.711 0.896 1.254 1.415 1.895 2.365 2.998 3.499 4.785
8 0.262 0.447 0.706 0.889 1.240 1.397 1.860 2.306 2.896 3.355 4.501
distribution! 9 0.261 0.445 0.703 0.883 1.230 1.383 1.833 2.262 2.821 3.250 4.297
10 0.260 0.444 0.700 0.879 1.221 1.372 1.812 2.228 2.764 3.169 4.144
11 0.260 0.443 0.697 0.876 1.214 1.363 1.796 2.201 2.718 3.106 4.025
It holds that 12
13
0.259
0.259
0.442
0.441
0.695
0.694
0.873
0.870
1.209
1.204
1.356
1.350
1.782
1.771
2.179
2.160
2.681
2.650
3.055
3.012
3.930
3.852
14 0.258 0.440 0.692 0.868 1.200 1.345 1.761 2.145 2.624 2.977 3.787
15 0.258 0.439 0.691 0.866 1.197 1.341 1.753 2.131 2.602 2.947 3.733
D (z ) = 2FSt (z ) − 1 = 0.9 16 0.258 0.439 0.690 0.865 1.194 1.337 1.746 2.120 2.583 2.921 3.686
17 0.257 0.438 0.689 0.863 1.191 1.333 1.740 2.110 2.567 2.898 3.646
18 0.257 0.438 0.688 0.862 1.189 1.330 1.734 2.101 2.552 2.878 3.610
19 0.257 0.438 0.688 0.861 1.187 1.328 1.729 2.093 2.539 2.861 3.579
and hence 20 0.257 0.437 0.687 0.860 1.185 1.325 1.725 2.086 2.528 2.845 3.552
21 0.257 0.437 0.686 0.859 1.183 1.323 1.721 2.080 2.518 2.831 3.527
22 0.256 0.437 0.686 0.858 1.182 1.321 1.717 2.074 2.508 2.819 3.505
FSt (z ) = 0.95 . 23
24
0.256
0.256
0.436
0.436
0.685
0.685
0.858
0.857
1.180
1.179
1.319
1.318
1.714
1.711
2.069
2.064
2.500
2.492
2.807
2.797
3.485
3.467
25 0.256 0.436 0.684 0.856 1.178 1.316 1.708 2.060 2.485 2.787 3.450
26 0.256 0.436 0.684 0.856 1.177 1.315 1.706 2.056 2.479 2.779 3.435
Thus we look for z[0.95] in the 27 0.256 0.435 0.684 0.855 1.176 1.314 1.703 2.052 2.473 2.771 3.421
28 0.256 0.435 0.683 0.855 1.175 1.313 1.701 2.048 2.467 2.763 3.408
table and find 29 0.256 0.435 0.683 0.854 1.174 1.311 1.699 2.045 2.462 2.756 3.396
30 0.256 0.435 0.683 0.854 1.173 1.310 1.697 2.042 2.457 2.750 3.385
35 0.255 0.434 0.682 0.852 1.170 1.306 1.690 2.030 2.438 2.724 3.340
40 0.255 0.434 0.681 0.851 1.167 1.303 1.684 2.021 2.423 2.704 3.307
z[0.95] = 1.645. 45 0.255 0.434 0.680 0.850 1.165 1.301 1.679 2.014 2.412 2.690 3.281
50 0.255 0.433 0.679 0.849 1.164 1.299 1.676 2.009 2.403 2.678 3.261
55 0.255 0.433 0.679 0.848 1.163 1.297 1.673 2.004 2.396 2.668 3.245
60 0.254 0.433 0.679 0.848 1.162 1.296 1.671 2.000 2.390 2.660 3.232
∞ 0.253 0.431 0.674 0.842 1.150 1.282 1.645 1.960 2.326 2.576 3.090

12. Interval estimators 297


Interval estimators for large samples

Question: What is a „large“ sample size?

Definition:
A sample is considered a large sample if the deviation of the actual sampling distribution
from the normal distribution can be neglected.

Rule of thumb: n > 30


But:
1. objectively justified claims to accuracy
2. General principle: The more similar the initial distribution is to the normal
distribution, the smaller the sample size may be.

12. Interval estimators 298


Interval estimators for large samples
Confidence intervals for mean values

σ
Prerequisite: CLT (variance σ known) → σX̄ = √
n
)

X̄ − µ
P(−z < ≤ z ) ≈ FSt (z ) − FSt (−z ) = D (z )
σX̄
This can be transformed into an approximate probability statement:

1. direct conclusion P(µ − z · σX̄ < X̄ ≤ µ + z · σX̄ ) ≈ D (z )

2. inference P(X̄ − z · σX̄ < µ ≤ X̄ + z · σX̄ ) ≈ D (z ) = 1 − α

Definition:
Inference is the statistical conclusion from the sample to the unknown basic population.

12. Interval estimators 299


Interval estimators for large samples

Inference: P(X̄ − z · σX̄ < µ ≤ X̄ + z · σX̄ ) ≈ D (z ) = 1 − α


Replacing the random variable X̄ with the current sample mean x̄ results in the

Definition: Confidence interval for µ for large samples with known variance σ 2

CI(µ, 1 − α) = [x̄ − z σX̄ , x̄ + z σX̄ ]

with z = z[1−α/2] . Here,


1 − α is the confidence level
α the probability of error or significance level. It indicates how often one is wrong
on average when setting up confidence intervals of this type.

12. Interval estimators 300


Connection between α and z
The value 1 − α = D (z ) := FSt (z ) − FSt (−z ) corresponds to the (white) area in the symmetric
interval [−z , z ] below the density function of the standard normal distribution. Accordingly, the left
and right (blue colored) parts together have the area α. Accordingly, FSt (z ) = 1 − α 2
and thus the
z = z[1−α/2] we are looking for is the (1 − α/2) quantile.

fSt (z )

α D (z ) = 1 − α α
2 2

−z z

Mostly, the significance level α is given and then the corresponding z value is determined as quantile.

300 - 1
Interval estimators for large samples

If the variance in the basic population is unknown. . . we have to find a way to


estimate it:
n σ̂ s
σ̂ 2 = 2
s and hence σ̂X̄ = √ = √
n−1 n n−1

Definition: Confidence interval for µ for large samples with unknown variance

CI(µ, 1 − α) = [x̄ − z σ̂X̄ , x̄ + z σ̂X̄ ]

with z = z[1−α/2] .

However, this means that an additional inaccuracy is introduced.


Way out: Somewhat larger sample sizes. This makes the point estimate of the variance
more accurate and the sampling distribution even closer to the normal distribution.

12. Interval estimators 301


Interval estimators for large samples
Example

For the local rent index, the local government


asks 50 households that have rented apartments
ranging from 80 to 100 m2 in size about the net
rent per square meter.
They find a sample mean of 8.30 C with an empirical standard deviation of s = 2.07 C.
Let us discuss two different questions in this regard:

1. Distribution in the basic population: In which interval do two thirds of the rents per
square meter paid probably lie?

2. Interval estimation: What is the confidence interval on a confidence level of 0.9 for
the average net rent µ?

12. Interval estimators 302


Interval estimators for large samples
Example

1. Distribution in the basic population: In which interval do two thirds of the rents per
square meter paid probably lie?
We assume normal distribution. We take the sample mean as point estimate of the
actual average rent and r
50
σ̂ = · 2.07 C = 2.10 C
49
as the point estimate for the standard deviation. Thus, the variable
X − 8.3
Z =
2.10
would be standard normally distributed. According to the standard normal distribution
table, 66.7 % of all observations of variable Z are in the interval −0.97 < Z ≤ 0.97. We
undo the standardization and obtain the interval
[8.30 C − 0.97 · 2.10 C; 8.30 C + 0.97 · 2.10 C] = [6.26 C; 10.34 C]

66.7 %

x
6.26 C 10.34 C

12. Interval estimators 303


Interval estimators for large samples
Example

For the local rent index, the local government


asks 50 households that have rented apartments
ranging from 80 to 100 m2 in size about the net
rent per square meter.
They find a sample mean of 8.30 C with an empirical standard deviation of s = 2.07 C.
Let us discuss two different questions in this regard:

1. Distribution in the basic population: In which interval do two thirds of the rents per
square meter paid probably lie?

2. Interval estimation: What is the confidence interval on a confidence level of 0.9 for
the average net rent µ?

12. Interval estimators 304


Interval estimators for large samples
Example

2. Interval estimation: What is the confidence interval on a confidence level of 0.9 for
the average net rent µ?

It holds that

CI(µ, 1 − α) = [x̄ − z σ̂X̄ , x̄ + z σ̂X̄ ]

with 0 .9
σ̂ 2.10 C
σ̂X̄ = √ = √ ≈ 0.29 C .
n 50 µ
7.82 C 8.30 C 8.78 C
It is α = 0.1. According to the table, the 0.95 quantile is z[0.95] = 1.645. Thus, the
confidence interval is

CI(µ, 0.9) = [8.30 C − 1.645 · 0.29 C; 8.30 C + 1.645 · 0.29 C]


= [7.82 C; 8.78 C] ,

Thus, the unknown mean of the basic population lies in the interval [7.82 C; 8.78 C] with
a probability of 90 %.

12. Interval estimators 305


1. The first question is about the actual rents, i.e., the random variable X . Therefore, here the
n
random variable X is standardized with the estimated variance σ̂ 2 = n− 1
s2 or the standard
deviation r
n
σ̂ = ·s.
n−1
2. The second question deals with the average net rent, so here the random variable under
consideration is the sample mean X̄ . Accordingly,√the estimated standard deviation of the
sample mean is also calculated according to the n-law,

σ̂
σ̂X̄ = √
n

with σ̂ from question 1.

305 - 1
The chi-squared distribution

Prerequisite: Random variables Z1 , Z2 , . . . , Zn are standard normally distributed and


independent
Squaring these n random variables and then summing them yields a new random
variable.

Definition:
The random variable
χ2n := Z12 + Z22 + · · · + Zn2
is called chi-square distributed with n degrees of freedom.

Thus, the chi-square distributions form a whole family of distributions. They are
continuous distributions and have positive probability densities in the interval (0, ∞).

12. Interval estimators 306


The chi-squared distribution

The chi-square distributions are suitable as test distributions for many typical test
situations and thus have multiple applications in practice.

f (χ2 )
0.5

0.4 n=1

0.3
n=3
0.2
n=5
n = 10
0.1 n = 15

χ2
5 10 15 20 25

Properties :
2 2
E(χn ) = n, V(χn ) = 2n

12. Interval estimators 307


The S TUDENT-t distribution

Prerequisite: Let χ2n be a chi-square distributed


random variable and Z be a standard normally
distributed random variable and let both be
independent:

Definition:
The random variable
Z
Tn := q
1
n
· χ2n

is called S TUDENT-t distributed with n degrees of free- W ILLIAM S EALY G OSSET


Pseudonym: Student
dom. 1876-1937

12. Interval estimators 308


The S TUDENT-t distribution

The t-distributions are similar to the normal distribution, but slightly wider. For increasing
number of degrees of freedom they tend towards the standard normal distribution.

f (z ), f (t )

0.4
n = 10
standard normal distribution 0.3

n=3
0.2 n=1

0.1

z, t
−4 −2 2 4

Properties :
n
E(Tn ) = 0, V(Tn ) = > 1, (n > 2)
n−2

12. Interval estimators 309


Interval estimators for small samples

We have seen that for large samples the CLT holds


⇒ normal distribution is a quite good approximation for the true sampling distribution

too small samples ⇒ CLT does not help

The actual sampling distribution would have to be used. However, this depends on
the distribution of the characteristic in the basic population and therefore varies from
case to case and is usually difficult to calculate.

Only in the special situation, when the characteristic in the basic population is
already normally distributed or does follow the normal distribution quite well, the
construction of confidence intervals becomes simple again.

12. Interval estimators 310


Interval estimators for small samples

Theorem 11:

If the characteristic is normally distributed in the basic population, then


1. the sample mean is normally distributed even for small samples,
2. the fraction
X̄ − µ
Tn−1 =
σ̂X̄
is exactly STUDENT -t-distributed with n − 1 degrees of freedom.

X̄ −µ
From this follows immediately P(−t < σ̂X̄
≤ t ) = FTn−1 (t ) − FTn−1 (−t ) and thus the

Definition: confidence interval for µ for small samples with normally distributed basic
population and unknown variance.

CI(µ, 1 − α) = [x̄ − tn−1 σ̂X̄ , x̄ + tn−1 σ̂X̄ ]


with the (1 − α/2) quantile tn−1 = tn−1;[1−α/2] .

12. Interval estimators 311


Property of t quantiles

It is
tn−1;[1−α/2] = −tn−1;[α/2]

This property, by the way, also holds for the quantiles of the normal distribution z[·] , which can be
found in the quantile table of the STUDENT-t distribution in the bottom row for n = ∞.

It is important to emphasize here once again that the use of the t-distribution presupposes a normally
distributed basic population!

311 - 1
Interval estimators for small samples
Example

Question: How much do university graduates


earn on average five years after graduation?
A survey of randomly selected 25 alumni yields
an average gross income of 42 720 C with an
empirical standard deviation of 6256 C. The
income can be considered normally distributed
to a good approximation.

12. Interval estimators 312


Interval estimators for small samples
Example

Question: How much do university graduates


earn on average five years after graduation?
A survey of randomly selected 25 alumni yields
an average gross income of 42 720 C with an
empirical standard deviation of 6256 C. The
income can be considered normally distributed
to a good approximation.

1. Point estimate of the standard deviation of the basic population:


r
25
σ̂ = · 6256 C = 6385 C
24

2. Estimated standard deviation of the sample mean:


σ̂ 6385 C
σ̂X̄ = √ = √ = 1277 C
n 25

12. Interval estimators 312


Interval estimators for small samples
Example

3. We determine the t value at a significance


level of 5 % with 24 degrees of freedom :

t24;[1−0.025] = 2.064

4. The confidence interval is then

CI(µ, 0.95) = [42 720 C − 2.064 · 1277 C; 42 720 C + 2.064 · 1277 C]


= [40 084 C; 45 355 C]

2.5 % 95 % 2.5 %


40 084 C 42 720 C 45 355 C

12. Interval estimators 313


Interval estimators for small samples
Example: Reading from the quantile table

S TUDENT’s t-distribution
Degrees of Quantiles
freedom t[0.6] t[0.667] t[0.75] t[0.8] t[0.875] t[0.9] t[0.95] t[0.975] t[0.99] t[0.995] t[0.999]

1 0.325 0.577 1.000 1.376 2.414 3.078 6.314 12.706 31.821 63.657 318.31
2 0.289 0.500 0.816 1.061 1.604 1.886 2.920 4.303 6.965 9.925 22.327
3 0.277 0.476 0.765 0.978 1.423 1.638 2.353 3.182 4.541 5.841 10.215
4 0.271 0.464 0.741 0.941 1.344 1.533 2.132 2.776 3.747 4.604 7.173
5 0.267 0.457 0.727 0.920 1.301 1.476 2.015 2.571 3.365 4.032 5.893
6 0.265 0.453 0.718 0.906 1.273 1.440 1.943 2.447 3.143 3.707 5.208
In the quantile table of the 7 0.263 0.449 0.711 0.896 1.254 1.415 1.895 2.365 2.998 3.499 4.785
8 0.262 0.447 0.706 0.889 1.240 1.397 1.860 2.306 2.896 3.355 4.501
S TUDENT-t distribution, we find 9 0.261 0.445 0.703 0.883 1.230 1.383 1.833 2.262 2.821 3.250 4.297
10 0.260 0.444 0.700 0.879 1.221 1.372 1.812 2.228 2.764 3.169 4.144
the row with 24 degrees of 11 0.260 0.443 0.697 0.876 1.214 1.363 1.796 2.201 2.718 3.106 4.025
12 0.259 0.442 0.695 0.873 1.209 1.356 1.782 2.179 2.681 3.055 3.930
freedom. We are looking for 13
14
0.259
0.258
0.441
0.440
0.694
0.692
0.870
0.868
1.204
1.200
1.350
1.345
1.771
1.761
2.160
2.145
2.650
2.624
3.012
2.977
3.852
3.787
the 0.975 quantile, so we look 15
16
0.258
0.258
0.439
0.439
0.691
0.690
0.866
0.865
1.197
1.194
1.341
1.337
1.753
1.746
2.131
2.120
2.602
2.583
2.947
2.921
3.733
3.686
in the corresponding column. 17
18
0.257
0.257
0.438
0.438
0.689
0.688
0.863
0.862
1.191
1.189
1.333
1.330
1.740
1.734
2.110
2.101
2.567
2.552
2.898
2.878
3.646
3.610
19 0.257 0.438 0.688 0.861 1.187 1.328 1.729 2.093 2.539 2.861 3.579
20 0.257 0.437 0.687 0.860 1.185 1.325 1.725 2.086 2.528 2.845 3.552
It is 21 0.257 0.437 0.686 0.859 1.183 1.323 1.721 2.080 2.518 2.831 3.527
22 0.256 0.437 0.686 0.858 1.182 1.321 1.717 2.074 2.508 2.819 3.505
23 0.256 0.436 0.685 0.858 1.180 1.319 1.714 2.069 2.500 2.807 3.485
t24;[1−0.025] = t24;[0.975] = 2.064. 24
25
0.256
0.256
0.436
0.436
0.685
0.684
0.857
0.856
1.179
1.178
1.318
1.316
1.711
1.708
2.064
2.060
2.492
2.485
2.797
2.787
3.467
3.450
26 0.256 0.436 0.684 0.856 1.177 1.315 1.706 2.056 2.479 2.779 3.435
27 0.256 0.435 0.684 0.855 1.176 1.314 1.703 2.052 2.473 2.771 3.421
28 0.256 0.435 0.683 0.855 1.175 1.313 1.701 2.048 2.467 2.763 3.408
29 0.256 0.435 0.683 0.854 1.174 1.311 1.699 2.045 2.462 2.756 3.396
30 0.256 0.435 0.683 0.854 1.173 1.310 1.697 2.042 2.457 2.750 3.385
35 0.255 0.434 0.682 0.852 1.170 1.306 1.690 2.030 2.438 2.724 3.340
40 0.255 0.434 0.681 0.851 1.167 1.303 1.684 2.021 2.423 2.704 3.307
45 0.255 0.434 0.680 0.850 1.165 1.301 1.679 2.014 2.412 2.690 3.281
50 0.255 0.433 0.679 0.849 1.164 1.299 1.676 2.009 2.403 2.678 3.261
55 0.255 0.433 0.679 0.848 1.163 1.297 1.673 2.004 2.396 2.668 3.245
60 0.254 0.433 0.679 0.848 1.162 1.296 1.671 2.000 2.390 2.660 3.232
∞ 0.253 0.431 0.674 0.842 1.150 1.282 1.645 1.960 2.326 2.576 3.090

12. Interval estimators 314


Interval estimators for small samples

The empirical variance S 2 of a sample is also a random variable. Its distribution can be
calculated for the case that the characteristic is approximately normally distributed in the
basic population with the mean µ and the standard deviation σ – and the individual
samples are drawn independently (i.e. with replacement).

Theorem 12:

If the characteristic is normally distributed in the basic population, the quotient

S2
n = χ2n−1
σ2
is chi-square distributed with n − 1 degrees of freedom.

It follows that
2 nS 2
P(χn−1;[α/2] < ≤ χ2n−1;[1−α/2] ) = 1 − α
σ2

12. Interval estimators 315


Interval estimators for small samples

Definition: Confidence interval for σ 2 for small samples with normally distributed basic
population:

n · s2 n · s2
 
2
CI(σ , 1 − α) = ,
χ2upper χ2lower
with the quantiles χ2lower = χ2n−1;[α/2] and χ2upper = χ2n−1;[1−α/2] .

f (χ2 )

1−α

α α
2 2

χ2
χ2lower χ2upper

12. Interval estimators 316


Interval estimators for small samples
Example

Task:
From a sample of size n = 30 from a normally distributed basic population, the empirical
variance s2 = 225 is obtained. For the variance σ 2 of the basic population a point
estimator as well as a confidence interval at a confidence level of 0.95 shall be given.

12. Interval estimators 317


Interval estimators for small samples
Example

Task:
From a sample of size n = 30 from a normally distributed basic population, the empirical
variance s2 = 225 is obtained. For the variance σ 2 of the basic population a point
estimator as well as a confidence interval at a confidence level of 0.95 shall be given.
Solution:
1. Point estimate for the basic population
30
σ̂ 2 = · 225 = 232.76
29

2. From the table of the chi-square distribution with 29 degrees of freedom we find the
two values

χ2lower = χ229;[0.025] = 16.047 χ2upper = χ229;[0.975] = 45.722

3. Confidence interval
 
2 30 · 225 30 · 225
CI(σ , 0.95) = ; = [147.6; 420.6]
45.722 16.047

12. Interval estimators 317


Interval estimators for small samples
Example: Reading from the quantile table

Chi-square (χ2 ) distribution


Degrees of Quantiles
freedom χ2[0.005] χ2[0.01] χ2[0.025] χ2[0.05] χ2[0.1] χ2[0.9] χ2[0.95] χ2[0.975] χ2[0.99] χ2[0.995]

1 0.000 0.000 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838
In the quantile table of the 4
5
0.207
0.412
0.297
0.554
0.484
0.831
0.711
1.145
1.064
1.610
7.779
9.236
9.488
11.070
11.143
12.833
13.277
15.086
14.860
16.750

chi-square distribution, we find 6


7
0.676
0.989
0.872
1.239
1.237
1.690
1.635
2.167
2.204
2.833
10.645
12.017
12.592
14.067
14.449
16.013
16.812
18.475
18.548
20.278
8 1.344 1.646 2.180 2.733 3.490 13.362 15.507 17.535 20.090 21.955
the row with 29 degrees of 9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188
freedom. We are looking for 11
12
2.603
3.074
3.053
3.571
3.816
4.404
4.575
5.226
5.578
6.304
17.275
18.549
19.675
21.026
21.920
23.337
24.725
26.217
26.757
28.300

the 0.025- and the 13


14
3.565
4.075
4.107
4.660
5.009
5.629
5.892
6.571
7.042
7.790
19.812
21.064
22.362
23.685
24.736
26.119
27.688
29.141
29.819
31.319
15 4.601 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 32.801
0.975-quantile, so we look in 16 5.142 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 34.267
17 5.697 6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 35.718
the corresponding columns. 18
19
6.265
6.844
7.015
7.633
8.231
8.907
9.390
10.117
10.865
11.651
25.989
27.204
28.869
30.144
31.526
32.852
34.805
36.191
37.156
38.582
20 7.434 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 39.997
21 8.034 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 41.401
We have 22 8.643 9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 42.796
23 9.260 10.196 11.689 13.091 14.848 32.007 35.172 38.076 41.638 44.181
24 9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 45.559

χ229;[0.025] = 16.047 , 25
26
10.520
11.160
11.524
12.198
13.120
13.844
14.611
15.379
16.473
17.292
34.382
35.563
37.652
38.885
40.646
41.923
44.314
45.642
46.928
48.290
27 11.808 12.879 14.573 16.151 18.114 36.741 40.113 43.195 46.963 49.645

χ229;[0.975] = 45.722 .
28 12.461 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 50.993
29 13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 52.336
30 13.787 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 53.672
35 17.192 18.509 20.569 22.465 24.797 46.059 49.802 53.203 57.342 60.275
40 20.707 22.164 24.433 26.509 29.051 51.805 55.758 59.342 63.691 66.766
45 24.311 25.901 28.366 30.612 33.350 57.505 61.656 65.410 69.957 73.166
50 27.991 29.707 32.357 34.764 37.689 63.167 67.505 71.420 76.154 79.490
55 31.735 33.570 36.398 38.958 42.060 68.796 73.311 77.380 82.292 85.749
60 35.534 37.485 40.482 43.188 46.459 74.397 79.082 83.298 88.379 91.952
70 43.275 45.442 48.758 51.739 55.329 85.527 90.531 95.023 100.425 104.215
80 51.172 53.540 57.153 60.391 64.278 96.578 101.879 106.629 112.329 116.321
90 59.196 61.754 65.647 69.126 73.291 107.565 113.145 118.136 124.116 128.299
100 67.328 70.065 74.222 77.929 82.358 118.498 124.342 129.561 135.807 140.169

12. Interval estimators 318


Summary
Useful is the following overview of the „different variances“

Sample variance
n
1X
s2 = (xj − x̄ )2
n
j =1

After a sample is drawn, the quantities x̄ and s2 are always calculated. Here n is the sample
size and
n
1X
x̄ = xj
n
j =1

the sample mean.

318 - 1
The variance in the basic population

N
1 X
σ2 = (xj − µ)2
N
j =1

with the size N of the basic population and its arithmetic mean

N
1 X
µ= xj
N
j =1

is usually unknown when estimating and testing. Only after a census of the characteristic X
could µ and σ 2 be calculated in this way.
Estimated variance in the basic population

n
σ̂ 2 = s2
n−1

This estimation formula yields an unbiased estimate for σ 2 (given independence). Here, n − 1 is
the number of degrees of freedom.

318 - 2
The variance of the sample mean

σ2
V(X̄ ) = σX̄2 =
n

can be calculated in the case that the variance in the basic population is known.
Otherwise, we calculate the estimated variance of the sample mean

σ̂ 2
V̂(X̄ ) = σ̂X̄2 = ,
n

from the estimated variance in the basic population.

Confidence intervals for the mean µ

1. Large sample with known variance

CI(µ, 1 − α) = [x̄ − z σX̄ , x̄ + z σX̄ ]

with z = z[1−α/2] .

318 - 3
2. Large sample with unknown variance

CI(µ, 1 − α) = [x̄ − z σ̂X̄ , x̄ + z σ̂X̄ ]

with z = z[1−α/2] .
3. Small sample with normally distributed basic population with known variance is to be
treated as 1. since according to Theorem 11 item 1. the sample mean X̄ is also normally
distributed:
CI(µ, 1 − α) = [x̄ − z σX̄ , x̄ + z σX̄ ]

with z = z[1−α/2] .
4. Small sample with normally distributed basic population and unknown variance

CI(µ, 1 − α) = [x̄ − tn−1 σ̂X̄ , x̄ + tn−1 σ̂X̄ ]

with the (1 − α/2)-quantile tn−1 = tn−1;[1−α/2] of the S TUDENT-t-distribution.

Confidence interval for the variance σ 2 for small samples with normally distributed basic population
" #
n · s2 n · s2
CI(σ 2 , 1 − α) = ,
χ2upper χ2lower

with the quantiles χ2lower = χ2n−1;[α/2] and χ2upper = χ2n−1;[1−α/2] of the chi-square distribution.

318 - 4
Control questions

1. About which random variable does the sampling distribution provide information?
2. What properties should samples have in order to provide reliable information about
the basic population?
3. What is the role of the central limit theorem in estimation, and what is the role of the
limit theorem of DE M OIVRE and L APLACE?
4. When is a sample considered to be „large“?
5. What does a confidence interval provide information about?
6. How are the chi-square distribution and the normal distribution related?
7. Why is the t distribution mostly tabulated only up to n = 100?
8. If the sample size is too small, one must use the t distribution. Is this statement
correct without restrictions?
9. When do you use the t distribution to determine a confidence interval? Which
quantity is t-distributed in these cases?

12. Interval estimators 319


13 Statistical testing

13.1 Null hypothesis, alternative hypothesis, decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322


13.2 Testing hypotheses regarding mean values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
13.3 Testing hypotheses regarding variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
13.4 Summary of one-sample tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
13.5 Comparison of two means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .347
13.6 Comparison of two variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
13.7 Regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

according to [Schira, 2016], chapter 15 and 17


see also: [Anderson et al., 2012], chapters 9-11; and [Newbold et al., 2013], chapter 9, 10
13. Statistical testing 320
Statistical testing

Methods of estimation and testing are applications of the sampling theory.


Aim: Making a decision about a hypothesis

Definition: Hypotheses are assumptions, e.g., about a distribution or about individual


parameters of the distribution of a characteristic in a basic population.

Where do these hypotheses come from?


theoretical considerations
principle of insufficient cause
reasonable conjecture
previous observations

Note: Whether a formulated hypothesis is true or false cannot be determined with


a sample!
Test decision: Retain or reject hypothesis.

13. Statistical testing 321


Null hypothesis, alternative hypothesis, decision

Given: Hypothesis about the numerical value θ0 of a parameter θ of a distribution.


Mostly the distribution of a characteristic in a basic population or the probability
distribution of a random variable is concerned.
The parameter under question θ can be, for example:
Mean µ
Proportion p
Standard deviation σ
or another measure.

13. Statistical testing 322


Null hypothesis, alternative hypothesis, decision

Definition:
Null hypothesis or initial hypothesis

H0 : θ = θ0

H0 can be right or wrong. In any case, it will be retained until sufficient evidence is
provided to the contrary (sample).
Alternative hypothesis
H1 : θ ̸= θ0

Often only the one-sided question is of interest and one formulates

H0 : θ ≤ θ 0 against H1 : θ > θ0
or
H0 : θ ≥ θ 0 against H1 : θ < θ0 .

It is important to note that H0 and H1 must be mutually exclusive.

13. Statistical testing 323


Null hypothesis, alternative hypothesis, decision

Four ways reality and the test decision can collide:


Reality

H0 is correct H0 is wrong

Type 2 error
retainH0 o.k.
Test decision β -error
Type 1 error
reject H0 o.k.
α-error

Definition:
Type 1 error: the null hypothesis is rejected even though it is correct.
Type 2 error: the null hypothesis is retained even though it is wrong.

The main focus lies on the type 1 error:

P(reject H0 |H0 right) = α

should be as small as possible.


13. Statistical testing 324
Null hypothesis, alternative hypothesis, decision

Test procedure :
1. Formulate hypothesis: H0 vs. H1
2. Calculate test statistic/test quantity (from sample) T (x1 , . . . , xn )
3. Determine critical values and rejection region A.
4. Test decision

T (x1 , . . . , xn ) ∈ A ⇒ reject H0
T (x1 , . . . , xn ) ̸∈ A ⇒ retain H0

T (x1 , . . . , xn ) and A are the two essential quantities.

13. Statistical testing 325


Testing hypotheses regarding mean values

Let µ be the mean of the metric characteristic X in a basic population.

Null hypothesis: H0 : µ = µ0

Draw a random sample from the basic population: x1 , . . . , xn

Calculate: x̄
Deviation: |x̄ − µ0 | > 0

Question: Reject null hypothesis?

Is the deviation significant or just random? How to decide?


A correct null hypothesis should only be rejected with a very low probability α.
Common significance levels are α = 0.05 or α = 0.01.

13. Statistical testing 326


Testing hypotheses regarding mean values

Rejection regions should be constructed such that the probability of the sample mean x̄
to fall within the rejection region, even though H0 is correct, is at most α:

reject H0 retain H0 reject H0



µ0 x̄

„critical value“ „critical value“

α α
1−α
2 2

P(X̄ ∈ rejection region A | H0 right) = α

13. Statistical testing 327


Testing hypotheses regarding mean values

f (x̄ ) two-sided test


H0 : µ = µ 0
H1 : µ ̸= µ0
α
2
1−α α
2
Acceptance and rejection regions

µ0
x Difference between two-sided and
rejection acceptance region rejection
one-sided questions
region region
Sampling distribution under the
f (x̄ ) upper-sided test
H0 : µ ≤ µ 0
condition that the expected value of the
H1 : µ > µ 0 basic population is µ = µ0 .
1−α α
x
µ0 Definition:
acceptance region rejection
region The probability of the type 1 error
f (x̄ ) lower-sided test
H0 : µ ≥ µ 0 P(X̄ ∈ A | H0 right) = α
H1 : µ < µ 0
1−α is called significance level.
α
A denotes the rejection region.
x
µ0

rejection acceptance region


region
13. Statistical testing 328
Testing hypotheses regarding mean values

In addition to the one-sided or two-sided question, we again have to distinguish different


cases in the following – as we did with the confidence intervals – because the underlying
distributions differ:
Variance of basic population known/unknown?
Large/small sample?
We distinguish between the following tests for mean values:
The G AUSS test is used in the following situations:
Variance of the basic population known, large sample
Variance of the basic population known, small sample, normally distributed basic
population
variance of basic population unknown, large sample;
we then use the estimated variance of the sample mean

The t test is used for


Variance of the basic population unknown, small sample, (approximately)
normally distributed basic population

13. Statistical testing 329


Testing hypotheses regarding mean values
G AUSS-Test

Two-sided test:
In the two-sided test, the rejection region is symmetrically arranged on both sides of
the acceptance region.
Assumption: variance σ 2 known
X̄ −µ
The standardized test variable σX̄
is standard normally distributed under H0 .
For large samples (and for small samples with a normally distributed basic
population): !
X̄ − µ
P > z[1−α/2] |µ = µ0 = α
σX̄

Test procedure two-sided G AUSS test :


1. Formulate hypothesis: H0 : µ = µ0 vs. H1 : µ ̸= µ0
x̄ −µ0
2. Test statistic/test quantity T (x1 , . . . , xn ) = σX̄

3. critical value: k = z[1−α/2]


4. Test decision:
If |T (x1 , . . . , xn )| > k ⇒ reject H0
13. Statistical testing 330
Testing hypotheses regarding mean values
Example

In the student pub Finkenkrug, calibrated beer glasses are


supposed to contain 0.4 L beer. For a sample of size n = 50, the
average filling quantity is 0.38 L with a known variance of
0.0064 L2 (standard deviation = 0.08 L).
Question: At a significance level of 5 %, can we retain the null
hypothesis that on average there is 0.4 L of beer in the glass?

13. Statistical testing 331


Testing hypotheses regarding mean values
Example

In the student pub Finkenkrug, calibrated beer glasses are


supposed to contain 0.4 L beer. For a sample of size n = 50, the
average filling quantity is 0.38 L with a known variance of
0.0064 L2 (standard deviation = 0.08 L).
Question: At a significance level of 5 %, can we retain the null
hypothesis that on average there is 0.4 L of beer in the glass?

Test procedure:
1. Formulate Hypothesis (two-sided): H0 : µ = 0.4 L vs. H1 : µ ̸= 0.4 L
2. To calculate the test statistic, we need the variance of the sample mean:

0.0064 L2 √
σX̄2 = = 0.000 128 L2 ⇒ σX̄ = 0.000 128 L2 = 0.011 31 L
50
Calculate the Test quantity:
x̄ − µ0 0.38 − 0.4
T (x1 , . . . , xn ) = = = −1.77
σX̄ 0.011 31

13. Statistical testing 331


Testing hypotheses regarding mean values

3. critical value for α = 0.05:

k = z[1−0.025] = z[0.975] = 1.96

4. Test decision:

|−1.77| = 1.77 ≤ 1.96 ⇒ retain H0

f (z )

−1.77


-1.96 0 1.96

The null hypothesis H0 cannot be rejected at a significance level of 5 %.

13. Statistical testing 332


Testing hypotheses regarding mean values
G AUSS-Test

One-sided test:
In the one-sided test, the rejection region is not symmetrically arranged on the two
sides of the acceptance region.

Test procedure upper-sided G AUSS test :


1. Formulate hypothesis: H0 : µ ≤ µ0 vs. H1 : µ > µ0
x̄ −µ0
2. Test statistic/test quantity T (x1 , . . . , xn ) = σX̄

3. critical value: k = z[1−α]


4. Test decision: If T (x1 , . . . , xn ) > k ⇒ reject H0

Test procedure lower-sided G AUSS test :


1. Formulate hypothesis: H0 : µ ≥ µ0 vs. H1 : µ < µ0
x̄ −µ0
2. Test statistic/test quantity T (x1 , . . . , xn ) = σX̄

3. critical value: k = z[α]


4. Test decision: If T (x1 , . . . , xn ) < k ⇒ reject H0
13. Statistical testing 333
Testing hypotheses regarding mean values
Example

In the student pub Finkenkrug, a one-sided test would be more


appropriate. After all, no one will complain if there is too much beer
in the glass. The null hypothesis would thus have to be modified.
(µ0 = 0.4 L, n = 50, x̄ = 0.38 L, σ 2 = 0.0064 L2 or σ = 0.08 L).
Question: At a significance level of 5 %, can we retain the null
hypothesis that on average there is at least 0.4 L of beer in the
glass?

13. Statistical testing 334


Testing hypotheses regarding mean values
Example

In the student pub Finkenkrug, a one-sided test would be more


appropriate. After all, no one will complain if there is too much beer
in the glass. The null hypothesis would thus have to be modified.
(µ0 = 0.4 L, n = 50, x̄ = 0.38 L, σ 2 = 0.0064 L2 or σ = 0.08 L).
Question: At a significance level of 5 %, can we retain the null
hypothesis that on average there is at least 0.4 L of beer in the
glass?

Test procedure:
1. Formulate hypothesis (lower-sided): H0 : µ ≥ 0.4 L vs. H1 : µ < 0.4 L
2. To calculate the test statistic, we need the variance of the sample mean:

0.0064 L2 √
σX̄2 = = 0.000 128 L2 ⇒ σX̄ = 0.000 128 L2 = 0.011 31 L
50
Calculate the test quantity:
x̄ − µ0 0.38 − 0.4
T (x1 , . . . , xn ) = = = −1.77
σX̄ 0.011 31

13. Statistical testing 334


Testing hypotheses regarding mean values

3. Critical value for α = 0.05:

k = z[0.05] = −z[0.95] = −1.645

4. Test decision:

−1.77 < −1.645 ⇒ reject H0 !

f (z )

−1.77


-1.96 0 1.96
-1.645
old rejection region old rejection region
new rejection region

This means that a different test decision is made here than in the previous example.

13. Statistical testing 335


Testing hypotheses regarding mean values
t-Test

Assumption: variance σ 2 is unknown and has to be estimated ⇒ σ̂ 2


For large samples, the Gaussian test can also be used because of the CLT (see
slides 330 and 333). Here, however, the variance σ 2 (since unknown) has to be
replaced in the formulas by the unbiased estimate σ̂ 2 .
For small samples, the t distribution can be used – as for the confidence intervals
as well.
Prerequisite: Basic population is at least approximately normally distributed.

Test procedure two-sided t-test :


1. Formulate hypothesis: H0 : µ = µ0 vs. H1 : µ ̸= µ0
x̄ −µ0
2. Test statistic/test quantity T (x1 , . . . , xn ) = σ̂X̄

3. critical value: k = tn−1;[1−α/2]


4. Test decision:
If |T (x1 , . . . , xn )| > k ⇒ reject H0

13. Statistical testing 336


Testing hypotheses regarding mean values
t-Test

Accordingly, we obtain the one-sided t tests:

Test procedure upper-sided t-test :


1. Formulate hypothesis: H0 : µ ≤ µ0 vs. H1 : µ > µ0
x̄ −µ0
2. Test statistic/test quantity T (x1 , . . . , xn ) = σ̂X̄

3. critical value: k = tn−1;[1−α]


4. Test decision: If T (x1 , . . . , xn ) > k ⇒ reject H0

Test procedure lower-sided t-test :


1. Formulate hypothesis: H0 : µ ≥ µ0 vs. H1 : µ < µ0
x̄ −µ0
2. Test statistic/test quantity T (x1 , . . . , xn ) = σ̂X̄

3. critical value: k = tn−1;[α]


4. Test decision: If T (x1 , . . . , xn ) < k ⇒ reject H0

13. Statistical testing 337


Testing hypotheses regarding mean values
Example

A small retail chain knows from experience that the average sales
of its 48 stores are 25 % higher in December than in November.
On New Year’s Eve, a small random sample of n = 8 stores is
hastily drawn. It yields the following sales increases in percent:

i 1 2 3 4 5 6 7 8

xi 26.5 22.5 25.9 25.2 25.4 24.0 28.2 29.2

Question: At a significance level of 5 %, can the null hypothesis that the average
increase in sales was 25 % be retained?

13. Statistical testing 338


Testing hypotheses regarding mean values
Example

i 1 2 3 4 5 6 7 8

xi 26.5 22.5 25.9 25.2 25.4 24.0 28.2 29.2

First we calculate
2 2
x̄ = 25.8625 % and sX = 4.0548 % .
Test procedure:
1. Formulate hypothesis (two-sided): H0 : µ = 25 % vs. H1 : µ ̸= 25 %
2. To calculate the test statistic, we need the estimated variance or standard deviation
of the sample mean, respectively.

8 σ̂ 2
σ̂X2 = · 4.0548 %2 = 4.634 06 %2 ⇒ σX̄2 = X = 0.579 26 %2 ⇒ σX̄ = 0.7611 %
7 8
Calculate the test quantity:
x̄ − µ0 25.8625 % − 25 %
T (x1 , . . . , xn ) = = = 1.1332
σ̂X̄ 0.7611 %

13. Statistical testing 339


Testing hypotheses regarding mean values

3. Critical value for α = 0.05 and 7 degrees of freedom:

k = t7;[1−0.025] = t7;[0.975] = 2.365

4. Test decision:

|1.1332| = 1.1332 ≤ 2.365 ⇒ retain H0 !

13. Statistical testing 340


Testing hypotheses regarding variances
Chi-square test
What needs to be tested is the null hypothesis that the variance of a basic population
does not deviate from a given hypothetical value:
2 2 2 2
H0 : σ = σ0 vs. H1 : σ ̸= σ0

We recall theorem 12 on slide 315:


If the characteristic is normally distributed in the basic population, the quotient
2
n σS 2 = χ2n−1 is chi-square distributed with n − 1 degrees of freedom.
This yields what is known as the chi-square test for variances:

Test procedure two-sided chi-square test :


1. Formulate hypothesis: H0 : σ 2 = σ02 vs. H1 : σ 2 ̸= σ02
s2
2. Test statistic/test quantity T (x1 , . . . , xn ) = n
σ02
3. critical values: χ2lower = χ2n−1;[α/2] and χ2upper = χ2n−1;[1−α/2]
4. Test decision:
2 2
If T (x1 , . . . , xn ) < χlower or T (x1 , . . . , xn ) > χupper ⇒ reject H0
13. Statistical testing 341
Testing hypotheses regarding variances
Chi-square test

f (χ2 )

acceptance region
α α
2 2

χ2
χ2lower χ2upper

13. Statistical testing 342


Testing hypotheses regarding variances
Chi-square test
There are one-sided tests available as well:

Test procedure upper-sided chi-square test :


1. Formulate hypothesis: H0 : σ 2 ≤ σ02 vs. H1 : σ 2 > σ02
s2
2. Test statistic/test quantity T (x1 , . . . , xn ) = n
σ02
3. critical value: χ2upper = χ2n−1;[1−α]
4. Test decision: If T (x1 , . . . , xn ) > χ2upper ⇒ reject H0

Test procedure lower-sided chi-square test :


1. Formulate hypothesis: H0 : σ 2 ≥ σ02 vs. H1 : σ 2 < σ02
s2
2. Test statistic/test quantity T (x1 , . . . , xn ) = n
σ02
3. critical value: χ2lower = χ2n−1;[α]
4. Test decision: If T (x1 , . . . , xn ) < χ2lower ⇒ reject H0

13. Statistical testing 343


Testing hypotheses regarding variances
Example

Cars of the same model and year of manufacture differ in


gasoline consumption. However, it is required that the
standard deviation of gasoline consumption is not greater than
0.3 L/100 km. An automobile magazine has tested 30
vehicles of a new model and determined

8.0 8.3 8.2 7.4 7 .9 8 .3 7 .0 7 .5 7 .8 7.3


8.2 8.7 7.5 7.7 7 .9 7 .8 8 .2 8 .1 7 .9 8 .0
7.9 8.3 8.1 8.1 7.8 7 .7 8 .1 8 .0 8 .3 7 .9

as the gasoline consumption and now claims that the standard deviation in consumption
is way too large with 0.35 L/100 km. Is this true?

Task: Assume that the basic population is normally distributed. Test the hypothesis that
the standard deviation is at most 0.3 L/100 km, as required, at a significance level
of 10 %.

13. Statistical testing 344


Testing hypotheses regarding variances
Example

1. Formulate hypothesis (upper-sided)


2 2
H0 : σ ≤ 0.09 vs. H1 : σ > 0.09

2. Calculate the sample variance


30
2 1 X
s = (xj − x̄ )2 or = x 2 − x̄
2
= · · · = 0.1188
30
j =1

this yields the test statistic:

s2 30 · 0.1188
T (x1 , . . . , xn ) = n = = 39.6
σ02 0.09
3. critical value for α = 10 % and 29 degrees of freedom:

χ2upper = χ229;[1−0.1] = χ229;[0.9] = 39.09

4. Test decision: 39.6 > 39.09 ⇒ reject H0 !

13. Statistical testing 345


Summary of one-sample tests
Tests regarding the mean :

The G AUSS test is used in case

the variance is known (small sample with normally distributed basic population or large sample)
x̄ − µ σ
⇒ test quantity T (x1 , . . . , xn ) = is standard normally distributed with σX̄ = √
σX̄ n
the variance is unknown and it’s a large sample
x̄ − µ σ̂
⇒ test quantity T (x1 , . . . , xn ) = is standard normally distributed with σ̂X̄ = √
σ̂X̄ n

The t test is used if

the variance is unknown and the sample is small


(the basic population has to be at least approximately normally distributed)
x̄ − µ
⇒ test quantity T (x1 , . . . , xn ) = is t-distributed with n − 1 degrees of freedom
σ̂X̄
σ̂
and σ̂X̄ = √
n

345 - 1
Then

two-sided upper-sided lower-sided

hypothesis H0 : µ = µ0 H0 : µ ≤ µ0 H0 : µ ≥ µ0
H1 : µ ̸= µ0 H1 : µ > µ 0 H1 : µ < µ0
x̄ − µ x̄ − µ
test quantity T (x1 , . . . , xn ) = or T (x1 , . . . , xn ) =
σX̄ σ̂X̄
critical value k = k = k =
G AUSS test: z[1−α/2] z[1−α] z[α] = −z[1−α]
t test: tn−1;[1−α/2] tn−1;[1−α] tn−1;[α] = −tn−1;[1−α]

reject if |T (x1 , . . . , xn )| > k T (x1 , . . . , xn ) > k T (x1 , . . . , xn ) < k


retain if |T (x1 , . . . , xn )| ≤ k T (x1 , . . . , xn ) ≤ k T (x1 , . . . , xn ) ≥ k

345 - 2
Tests regarding the variance: chi-square test

is used if the basic population is approximately normally distributed.

two-sided upper-sided lower-sided

hypothesis H0 : σ = σ0 H0 : σ ≤ σ0 H0 : σ ≥ σ0
H1 : σ ̸= σ0 H1 : σ > σ0 H1 : σ < σ 0

s2
test quantity T = T (x1 , . . . , xn ) = n
σ02
critical value χ2lower = χ2n−1;[α/2] χ2upper = χ2n−1;[1−α] χ2lower = χ2n−1;[α]
χ2upper = χ2n−1;[1−α/2]

reject if T < χ2lower T > χ2upper T < χ2lower


or
T > χ2upper

retain if χ2lower ≤ T ≤ χ2upper T ≤ χ2upper T ≥ χ2lower

345 - 3
Comparison of two means
Assume two independent samples of size n1 and n2
with the means x̄1 and x̄2
were taken. We will now test the hypothesis whether the two samples originate from the same basic
population or at least are taken from populations with the same mean:

H0 : µ 1 = µ 2 vs. H1 : µ1 ̸= µ2

The random variable defined by the difference of X1 and X2 ,

∆ = X̄1 − X̄2

is approximately normally distributed under the null hypothesis if the sample sizes are large (CLT) or
if the characteristic is normally distributed in the basic population. Then it holds

E(∆) = 0

and

V(∆) = V(X̄1 ) + V(X̄2 )

346 - 1
it the samples are independent and hence

s
σ12 σ2
σ∆ = + 2 .
n1 n2

The test quantity is therefore


x̄1 − x̄2
T =
σ∆
and the critical value for two-sided questions again k = z[1−α/2] as in the G AUSS test for one sample.

This hypothesis test is also called the two-sample G AUSS test:

Test procedure two-sided two-sample G AUSS test :


1. Formulate hypothesis: H0 : µ1 = µ2 vs. H1 : µ1 ̸= µ2
x̄1 −x̄2
2. Test statistic/test quantity T = σ∆
3. critical value: k = z[1−α/2]
4. Test decision:
If |T | > k ⇒ reject H0

346 - 2
Just as with the G AUSS test for one sample, there are one-sided tests as well:

Test procedure upper-sided two-sample G AUSS test :


1. Formulate hypothesis: H0 : µ1 ≤ µ2 vs. H1 : µ1 > µ2
x̄1 −x̄2
2. Test statistic/test quantity T = σ∆
3. critical value: k = z[1−α]
4. Test decision: If T > k ⇒ reject H0

Test procedure lower-sided two-sample G AUSS test :


1. Formulate hypothesis: H0 : µ1 ≥ µ2 vs. H1 : µ1 < µ2
x̄1 −x̄2
2. Test statistic/test quantity T = σ∆
3. critical value: k = z[α]
4. Test decision: If T < k ⇒ reject H0

If the variance or variances of the basic populations have to be estimated, one uses
s
2 σ̂12 σ̂ 2
σ̂∆ = + 2
n1 n2

346 - 3
in the two-sample G AUSS test accordingly for large samples.

For small samples from normally distributed basic populations, the additional condition

σ12 = σ22 = σ

must hold, then


X̄1 − X̄2
= Tn1 +n2 −2
σ̂∆
is t-distributed with n1 + n2 − 2 degrees of freedom. The variance σ 2 is estimated from both samples

n1 s12 + n2 s22
σ̂ 2 =
n1 + n2 − 2
and used to calculate s s
σ̂ 2 σ̂ 2 n1 + n2
σ̂∆ = + = σ̂
n1 n2 n1 · n2

Test procedure two-sided two-sample t test :


1. Formulate hypothesis: H0 : µ1 = µ2 vs. H1 : µ1 ̸= µ2
x̄1 −x̄2
2. Test statistic/test quantity T = σ̂∆
3. critical value: k = tn1 +n2 −2;[1−α/2]
4. Test decision:
If |T | > k ⇒ reject H0
346 - 4
Likewise, there are also one-sided tests available:

Test procedure upper-sided two-sample t test :


1. Formulate hypothesis: H0 : µ1 ≤ µ2 vs. H1 : µ1 > µ2
x̄1 −x̄2
2. Test statistic/test quantity T = σ̂∆
3. critical value: k = tn1 +n2 −2;[1−α]
4. Test decision: If T > k ⇒ reject H0

Test procedure lower-sided two-sample t test :


1. Formulate hypothesis: H0 : µ1 ≥ µ2 vs. H1 : µ1 < µ2
x̄1 −x̄2
2. Test statistic/test quantity T = σ̂∆
3. critical value: k = tn1 +n2 −2;[α]
4. Test decision: If T < k ⇒ reject H0

346 - 5
Example:

Stiftung Warentest praises the new car tire »Super ZZ«. It is said to have more than 10 % higher
mileage than its predecessor »Z«. The organization has tested four sets of each type of tire and
obtained the following result:

Mileage in km
»Super ZZ« »Z«

50 000 43 000
41 000 44 000
40 000 36 000
49 000 37 000

We test the null hypothesis that the new tire X1 has no higher mileage than the old tire X2 at a
significance level of 5 %:

From the two samples we first calculate (in TSD km)

x̄1 = 45 and x̄2 = 40 .

346 - 6
The sample variances are

1h i
s12 = (50 − 45)2 + (41 − 45)2 + (40 − 45)2 + (49 − 45)2
4
1h i 82
= 52 + 42 + 52 + 42 = = 20.5
4 4

and

1h i
s22 = (43 − 40)2 + (44 − 40)2 + (36 − 40)2 + (37 − 40)2
4
1h i 50
= 32 + 42 + 42 + 32 = = 12.5 .
4 4

This results in the estimated variance of the basic population

4s1 + 4s2 82 + 50
σ̂ 2 = = = 22
4+4−2 6

and finally the estimated standard deviation of the difference ∆.


s

r
n1 + n2 8
σ̂∆ = σ̂ = 22 · = 3.3166 .
n1 · n2 16

346 - 7
Test procedure:

1. Formulate hypothesis (upper-sided)

H0 : µ1 ≤ µ2 vs. H1 : µ1 > µ2 .

2. Calculate the test quantity:

x̄1 − x̄2 5
T = = = 1.5076
σ̂∆ 3.3166

3. critical value for α = 0.05 and 6 degrees of freedom:

k = t6;[1−0.05] = t6;[0.95] = 1.943

4. Test decision:
1.5076 < 1.943 ⇒ retain H0 !

The null hypothesis cannot be rejected at a significance level of 5 %.

346 - 8
Comparison of two variances
We have already learned about the t-distribution and the chi-square distribution in the last chapter.
These are not really suitable for stochastic models, otherwise we would have already covered them
in chapter 9 on special distributions. Rather, they are so-called test distributions, which are very
useful in estimation and testing. Another test distribution is the so-called F-distribution:

Definition: Given two independent chi-square distributed random variables

χ2m and χ2n

with m and n degrees of freedom, respectively. Then the random variable

1 2
χ
m m
Fm
n := 1 2
χ
n n

is called F-distributed with m and n degrees of freedom.

Important: Swapping the numerator and denominator leads to different F distributions Fm n


n ̸= Fm . In
fact
1
Fm
n = n .
Fm

347 - 1
For the quantiles of the F-distribution we get

1
Fm
n;[α] = .
Fnm;[1−α]

Density functions of F-distributed random variables


f (F)

0 .8
m = 20
n = 20
0 .6

0 .4

m=6
0 .2
n=4

F
1 2 3 4

347 - 2
Now, from two different independent samples of size n1 and n2 , respectively, the variances s12 and s22
have been calculated. The aim is to check whether both samples are taken from basic populations
with the same variance:
H0 : σ1 = σ2 (= σ 2 ) vs. H1 : σ1 ̸= σ2 .

We already know that the two random variables

n1 S1 n2 S2
= χ2n1 −1 and = χ2n2 −1
σ2 σ2

are chi-square distributed with n1 − 1 and n2 − 1 degrees of freedom. Thus, by definition,

n1
S2
n1 −1 1 n −1
n2 = Fn1 −1
S2
n2 −1 2
2

is F-distributed with n1 − 1 and n2 − 1 degrees of freedom. Of course, this only works if the charac-
teristic is normally distributed in the basic population.

Since the F-distribution is not symmetric, two critical values have to be used for the two-sided ques-
tion. They are placed in such a way that the risk of error α is divided equally between the two parts
of the rejection region.

347 - 3
Test procedure F test :
1. Formulate hypothesis: H0 : σ12 = σ22 vs. H1 : σ12 ̸= σ22
n1
s2
n1 −1 1
2. Test statistic/test quantity T = n2
s2
n2 −1 2
n −1 n −1
3. critical values: Flower = Fn1 −1;[α/2] and Fupper = Fn1 −1;[1−α/2]
2 2

4. Test decision:
If T < Flower or T > Fupper ⇒ reject H0

Example:

In a random sample of 21 newly issued AAA-rated corporate bonds, the maturity had a variance of
58.35 (years2 ). In contrast, in a random sample of 13 newly issued corporate bonds rated CCC, the
variance in maturity was only 4.69. Is this difference significant?

Assuming normal distribution in the basic population, we calculate at a significance level of 5 %:

347 - 4
Test procedure:

1. Formulate hypothesis (two-sided)

H0 : σ12 = σ22 vs. H1 : σ12 > σ22 .

2. Calculate the test quantity:

n1
s2
n1 −1 1
21
58.35 61.27
20
T = n2 = 13
= = 12.06
s2
n2 −1 2 12
4.69 5.08

3. critical values for α = 0.05 and 20 and 12 degrees of freedom:

1 1
Flower = F20
12;[0.025] = = = 0.374
F12
20;[0.975]
2.676

Fupper = F20
12;[0.975] = 3.0728

4. Test decision:
12.06 > 3.073 ⇒ reject H0 !

347 - 5
Read from the table of quantiles of the F distribution:

F20
12;[0.025]

1
=
F12
20;[0.975]
1
= = 0.374
2.676

347 - 6
F20
12;[0.975]

= 3.073

347 - 7
Regression analysis
We recall section 4.1: Linear Regression. The task there was to compute a regression line

y = a + bx

in the least squares sense from given pairs of data (xi , yi ), i = 1, . . . , n.

Question: How can we test the individual parameters of the regression for statistical significance?

The term statistical significance usually refers to the test of whether the parameter a or b or the
correlation coefficient rXY is significantly different from zero. The significance of these parameters
can be tested with a t-test.

The t-test is the most widely used test!

We will now briefly review what we learned in section 4.1 and then derive a test first for the correlation
coefficient rXY , then for the two parameters a and b.

348 - 1
regression line : y = a + bx

ŷi := y (xi ) = a + bxi

yi = a + bxi + ei
„deviation“ ei := yi − ŷi

cXY
rXY :=
sX sY

Point cloud and straight line in scatter plot

348 - 2
Let Ei be the random variable describing the error term in the i-th data point. In order to test the cor-
relation coefficient rXY or the parameters a and b for their statistical significance, a few assumptions
have to be made about the distribution of the error terms Ei .

Assumptions:
1. There are no systematic influences on Y other than X :

E(Ei ) = 0 for all i

2. All error terms originate from a distribution with the same standard deviation, so-called
homoscedasticity:
σ(Ei ) = const. for all i
3. The error terms are not correlated with each other:

Cov(Ei , Ej ) = 0 for all i ̸= j

4. The random variables Ei are normally distributed with N (0, σ 2 ) for all i.

348 - 3
t-test for the correlation coefficient

Question: When is the correlation coefficient rXY significantly different from zero? In other words:
When does the variable X show a significant (linear) correlation with the variable Y ?

Test procedure two-sided t-test for the correlation coefficient :


1. Formulate hypothesis: H0 : rXY = 0 vs. H1 : rXY ̸= 0

rXY n −2
2. Test statistic/test quantity T = q is t-distributed with n − 2 degrees of freedom
2
1−rXY

3. critical value: k = tn−2;[1−α/2]


4. Test decision: If |T | > k ⇒ reject H0

Beispiel: In a random sample of 5 individuals, the following relationship is found between their annual
income and the amount they spend on their annual vacation:

individual 1 2 3 4 5

income (X ) 50000 30000 20000 80000 90000


Holiday exp. (Y ) 2500 1800 500 3500 5000

We test the significance of the correlation coefficient rXY at a significance level of 5 %.

348 - 4
Test procedure:

1. Formulate hypothesis (two-sided)

H0 : rXY = 0 vs. H1 : rXY ̸= 0

2. Calculate the test quantity:


√ √
rXY n − 2 0.966 · 5 − 2
T = q = √ = 6.485
2
1 − rXY 1 − 0.9662

3. critical value for α = 0.05 and 3 degrees of freedom:

k = tn−2;[1−α/2] = t3;[0.975] = 3.182

4. Test decision:
|6.485| > 3.182 ⇒ reject H0

Thus, we assume that the correlation coefficient rXY is significantly different from 0.

348 - 5
t-test for the regression parameters a and b

Question: When are the intercept a and the slope b significantly different from zero?

In economics, this question is of central importance. For the regression line

y = a + bx

one is mostly interested in whether the correlation of X and Y , which is estimated by a sample, is
statistically significant (or just random).

How are the parameters a and b distributed? To answer this question, the four assumptions from a
few pages earlier are necessary.

Representing the regression equation in matrix form makes it easier to derive the test statistics (and
to move to multivariate regressions, i.e., multiple X ). We start with two independent variables X1 and
X2 to determine the matrix form. An extension to k columns is easily done.

Starting from the linear model y = a + b1 x1 + b2 x2 , for each data point (x1i , x2i , yi ) we obtain the
equation
yi = a + b1 x1i + b2 x2i + ei .

348 - 6
For n observations we obtain

y1 = a + b1 x11 + b2 x21 + e1
y2 = a + b1 x12 + b2 x22 + e2
. .
. .
.=.
yn = a + b1 x1n + b2 x2n + en
y X·b e
|{z} | {z } |{z}

This can be represented in matrix notation as

y =X·b+e

with

y1 1 x11 x21 e1
     
 
y2  1 x12 x22  a  e2 
y= .  , X = . . , b = b1  , and e= .  .
     
.
 ..   .. .
.
. 
. b2
 .. 
yn 1 x1n x2n en

348 - 7
In the univariate case (only one X ), we estimated the regression parameters using the least squares
method:
n
X n
X
ei2 = (yi − (a + bxi ))2 −→ min
a ,b
i =1 i =1

The same equation (for two or more X variables) in matrix form reads as

eT · e = (y − X · b)T · (y − X · b) −→ min
b
We find the minimum by taking the partial derivative with respect to b and setting the gradient equal
to zero
d(eT · e) (y − X · b)T · (y − X · b)
= = −2 · XT · y + 2 · XT · X · b = 0
db db
and obtain the so-called normal equations,

XT · X · b = XT · y ,

the solution of which is the minimum b.

348 - 8
Numerical example (to keep the calculation simple):

3 1 3 5
   
1 1 1 4
y = 8 , X = 1 5 6 .
   
3 1 2 4

5 1 4 6

With    
5 15 25 20
T T
X · X = 15 55 81  and X · y =  76 
25 81 129 109

the normal equations are    


5 15 25 20
15 55 81  · b =  76 
25 81 129 109

As a solution of the LSE we obtain (for example, by applying the G AUSSian elimination method)
   
a 4
b = b1  =  2.5 
b2 −1.5

348 - 9
and hence the regression plane
y = 4 + 2.5x1 − 1.5x2 .

348 - 10
The solution of the normal equations

XT · X · b = XT · y ,

can be reformulated with the help of the inverse matrix into

 −1
b= XT · X · XT · y (8)

We use β̂ = b as estimator for the model parameters and assume that the true relationship between
X and y is given by the linear model
y =X·β+u

with the unknown parameters β and the so-called confounding variables or latent variables u. Sub-
stituting this into (8) yields

 −1
β̂ = b = XT · X · XT · (X · β + u)
 −1  −1
= XT · X · XT · X · β + XT · X · XT · u
 −1
= β + XT · X · XT · u

348 - 11
What are the properties of the estimator β̂ ?

The estimator is unbiased E(β̂) = β , since

 −1
E(β̂) = E(β + XT · X · X T · u)
 −1 
= E(β) + E XT · X · XT · u
 −1
= β + XT · X · XT · E(u) = β
| {z }
=0

Variance of the estimator: With


 −1
β̂ − E(β̂) = XT · X · XT · u

we get
   T 
V(β̂) = E β̂ − E(β̂) · β̂ − E(β̂)
 −1  −1 
=E XT · X · X T · u · uT · X · X T · X
 −1  −1  −1
= XT · X · XT · σ 2 · I · X · XT · X = σ 2 · XT · X ,

348 - 12
where σ 2 is estimated by
eT e
σ̂ 2 =
n − (k + 1)
unbiased. Here k is the number of independent variables (X1 , . . . , Xk ). Hence, together with the
intercept a, there are k + 1 parameters to be estimated.

For our numerical example we have

 −1  
 −1 5 15 25 26.7 4.5 −8
XT · X = 15 55 81  =  4 .5 1 −1.5 .
25 81 129 −8 −1.5 2.5

With T
e = y − X · b = −1 0.5 0.5 0 0

we estimate
eT e 1 .5
σ̂ 2 = = = 0.75
n − (k + 1) 5−3
and finally obtain
 
 −1 20.025 3.375 −6
2 T
V(β̂) = σ̂ · X · X =  3.375 0.75 −1.125 .
−6 −1.125 1.875

348 - 13
For the variances of the estimated parameters we need the diagonal of the matrix and get
   
V(a) 20.025
V(b1 ) =  0.75  .
V(b2 ) 1.875

Test procedure two-sided t-test for regression parameters :


1. Formulate hypothesis: H0 : regression parameter = 0 vs. H1 : regression parameter ̸= 0
regression parameter−0
2. Test statistic/test quantity T = √ is t-distributed with n − (k + 1)
V(regression parameter)
degrees of freedom
−1
Here, the estimated variance of all regression parameters is V(β̂) = σ̂ 2 · XT · X
3. critical value: k = tn−(k +1);[1−α/2]
4. Test decision: If |T | > k ⇒ H0 reject

For our numerical example, we perform a two-sided test at a significance level of 5 % to see if the
regression parameters are significantly different from zero.

H0 : a = 0 vs. H1 : a ̸= 0
test statistic: T = √4−0 = 0.89
20.025

348 - 14
H0 : b1 = 0 vs. H1 : b1 ̸= 0
2.5−0
test statistic: T = √ = 2.89
0.75
H0 : b2 = 0 vs. H1 : b2 ̸= 0

√1.5−0
test statistic: T = = −1.10
1.875

The critical value is k = t5−(2+1);[1−α/2] = t2;[0.975] = 4.303. Thus, none of the null hypotheses are
rejected, so the regression parameters are not significant.

Question: What happens to tn−(k +1);[0.975] as the sample size n becomes large?

From the table we find


tn−(k +1);[0.975] ≈ 2
n>50

This provides a

5. Simplified test decision for large n:

If |T | > 2 ⇒ reject H0

Thus, when test statistics of regression parameters are greater than 2 or less than -2, they are
usually said to be significant. This would correspond to a two-sided test at a significance level of
approximately α = 0.05.

348 - 15
Let us consider again the income/holiday example:

In a random sample of 5 individuals, the following relationship is found between their annual income
and the amount they spend on their annual vacation:

individual 1 2 3 4 5

income (X ) 50000 30000 20000 80000 90000


Holiday exp. (Y ) 2500 1800 500 3500 5000

The regression line is


y = −255 + 0.054x

348 - 16
and the result of the regression analysis is

The t-values in the output correspond to the test statistics or test variables and are often noted in
parentheses under the regression parameters:

y = −255 + 0.054x
(−0.51) (6.49)

The critical value is t3;[0.975] = 3.182. Thus, the intercept is not significant, but the slope of the
regression line is.

Interpretation of the (significant) coefficients? Individuals with higher income spend more on
vacations (in total about 5.4 % of salary)
348 - 17
Discussion: In the last example, are the requirements for the t-test satisfied?

1. There are no systematic influences on Y other than X :

E(Ei ) = 0 for all i

2. All error terms originate from a distribution with the same standard deviation, so-called
homoscedasticity:
σ(Ei ) = const. for all i
3. The error terms are not correlated with each other:

Cov(Ei , Ej ) = 0 for all i ̸= j

4. The random variables Ei are normally distributed with N (0, σ 2 ) for all i.

348 - 18
Control questions

1. Can a statistical test be used to falsify a hypothesis?


2. Which two errors can occur in a hypothesis test, and how should they be interpreted?
3. Why do the two risks of error (type 1 and type 2) not add up to one?
4. What do you usually formulate hypotheses about in statistical testing?
5. Which random variable is the focus of testing?
6. Why is it possible to make statements about the test distribution even if we do not
know the distribution of characteristics in the basic population?
7. What is a „one-sided“ and what is a „two-sided test“?
8. Why do you divide the α error by two to determine the critical value in the two-sided
test?
9. In which situations do you use the t distribution?
10. What do you want to find out with the test for the difference of means?
11. When do you apply the chi-square and when do you apply the F distributions?

13. Statistical testing 349


Part IV – Appendix

Literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351 - 1

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 - 1

13. 350
Literature

[Anderson et al., 2012] Anderson, D. R., Sweeney, D. J., Williams, T. A., Camm, J. D., and Cochran,
J. D. (2012).
Statistics for Business and Economics.
South-Western Cengage Learning, 12th edition.
[Bleymüller, 2012] Bleymüller, J. (2012).
Statistik für Wirtschaftswissenschaftler.
Vahlen, 16th edition.
[Lehn and Wegmann, 2000] Lehn, J. and Wegmann, H. (2000).
Einführung in die Statistik.
Vieweg+Teubner Verlag.
[Newbold et al., 2013] Newbold, P., Carlson, W. L., and Thorne, B. M. (2013).
Statistics for business and economics.
Pearson, 8th edition.
[Schira, 2016] Schira, J. (2016).
Statistische Methoden der VWL und BWL – Theorie und Praxis.
Pearson, 5th edition.

351 - 1
Index

approach of variation, 75
logarithmic linear, 131 coefficient of variation, 75
quadratic, 132 combination, 153
confidence
semi-logarithmic, 131
interval, 379, 381
arrangement, 153
level, 379
average, 38
consistency, 355, 359
bias, 354 contingency table, 82
binomial coefficient, 146, 148 continuity correction, 343
box plot, 77 correlation coefficient, 103
Central property, 39 B RAVAIS -P EARSON, 103
S PEARMAN, 108
characteristic value, 10
empirical, 103
class intervals, 27
rank, 108
class limits, 27
covariance, 99
class size, 27 empirical, 99
coefficient curtosis, 249
of determination, 125 data
352 - 1
paired, 81 marginal, 85
data set, 17 normal, 316
decile, 56 probability, 358
density standard normal, 254, 311

frequency, 29 uniform, 295, 298

deviation distribution function


empirical cumulative, 22, 24
mean absolute, 64
of the classes, 28
quartile, 65
division, 194
standard, 67
Drawing
dispersion, 63
with replacement, 349
distribution
without replacement, 349
S TUDENT-t, 389
efficiency, 361
Bernoulli, 300 error
binomial, 303 estimation, 353
chi-square, 387 mean squared, 361
conditional, 90 type 1, 412
F, 447 type 2, 412
joint, 82 estimate
352 - 2
point, 351 marginal, 82, 84
estimator, 358 relative, 19
event frequency distribution
certain, 165 absolute, 20

complementary, 167 relative, 20

disjoint, 167 function


density, 31, 70, 223
elementary, 161
distribution, 215, 220
impossible, 165
estimating, 358
mutually exclusive, 167
frequency, 22
random, 163
frequency density, 30
experiment
mass, 221
Laplace, 170, 171
probability, 177
random, 161 probability density, 223
factorial, 142, 143 probability mass, 221
five-point summary, 77 heat map, 83
frequency histogram, 30
absolute, 19 homoscedasticity, 456, 471
class, 27 hypothesis, 201, 409
352 - 3
alternative, 411 geometric, 40
initial, 411 harmonic, 45
null, 411 mean line, 121
identification criteria (IC), 8 mean value, 70

independence, 188 of a sum/difference, 97

stochastical, 188 measure, 34


of central tendency, 34, 35
independent
of location, 34, 35
statistically, 92
robust, 50
inference, 378
median, 51, 250
interquartile range, 65
method
law
of least squares, 115
of large numbers, 173
mode, 10, 36
layers, 27 moment, 245
level central, 245
confidence, 379 normal equations, 119
significance, 379, 414, 416 observation series, 17
mean outcome
arithmetic, 38 basic, 161
352 - 4
parameter objective, 175
dispersion, 231 statistical, 173
of location, 226 subjective, 176
partition, 194 quantile, 56, 253

percentile, 56 q-, 56, 59

permutation, 151 quartile, 55, 56


random sample space, 162
k -, 153
range, 64
plot
interquartile, 65
box, 77
semi-interquartile, 65
scatter, 81
regression
population, 9
line, 118
Principle of least squares, 74
slope, 122
probability values, 118
a posteriori, 201 regression analysis, 113
a priori, 201 sample, 14
conditional, 185, 186 large, 377
empirical, 175 ordered, 50
frequentist, 173 random, 14
352 - 5
sample space, 162 subpopulation, 14
scatter plot, 81 table
set contingency, 82
of events, 165 test

significance t-, 417, 426, 436

statistical, 454 G AUSS, 417, 422, 436


chi-square, 431, 438
significant, 414
F, 450
skewness, 248
one-sided, 422
space
procedure, 413
probability, 178
two-sample, 440
sample, 162
two-sided, 418
spread, 63
unbiased, 354, 358
standard deviation, 67, 231 asymptotically, 359
statistical unit, 8 unimodal, 36
statistics urn model, 349
Bayes, 201 value
multivariate, 80 central, 51
univariate, 80 expected, 226
352 - 6
mean, 38, 70 stochastical, 209
modal, 36 variance, 67, 70, 231
variable
decomposition, 123
continuous, 214
empirical, 67
discrete, 214
minimization, 121
random, 209
standardized, 238 of the sum/difference, 98
statistical, 11 variation, 63, 153

352 - 7

You might also like