Professional Documents
Culture Documents
2023 BSC Stochastik Skript en
2023 BSC Stochastik Skript en
2023 BSC Stochastik Skript en
I Descriptive Statistics
1 Statistical attributes and variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Measures to describe statistical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Two dimensional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103
II Probability Theory
5 Combinatorics and counting principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Fundamentals of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7 Random variables in one dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8 Multidimensional random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9 Stochastic models and special distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10 Limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
1-1
III Inferential Statistics
11 Point estimators for parameters of a population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .280
12 Interval estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
13 Statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
IV Appendix
Literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351 - 1
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 - 1
1-2
Statistics
2
Literature:
[Anderson et al., 2012] Anderson, D. R., Sweeney, D. J., Williams, T. A., Camm, J. D.,
and Cochran, J. D. (2012).
Statistics for Business and Economics.
South-Western Cengage Learning, 12th edition.
[Bleymüller, 2012] Bleymüller, J. (2012).
Statistik für Wirtschaftswissenschaftler.
Vahlen, 16th edition.
[Lehn and Wegmann, 2000] Lehn, J. and Wegmann, H. (2000).
Einführung in die Statistik.
Vieweg+Teubner Verlag.
[Newbold et al., 2013] Newbold, P., Carlson, W. L., and Thorne, B. M. (2013).
Statistics for business and economics.
Pearson, 8th edition.
[Schira, 2016] Schira, J. (2016).
Statistische Methoden der VWL und BWL – Theorie und Praxis.
Pearson, 5th edition.
3
Part I – Descriptive Statistics
4
1 Statistical attributes and variables
Definition:
Statistical units are the objects whose attributes are of interest in a given context and
are focussed on and observed, surveyed, or measured within the scope of empirical
investigation.
The identification of similar statistical units belonging to a statistical population are es-
sentially given by objective and precise identification criteria (IC) relating to
1. time
2. space and
3. objectivity
Definition:
The set
Ω := {ω | ω fulfills (IC) }
of all statistical units ω , that fulfill the well defined identification criteria (IC) is called the
population.
Synonyms are statistical mass, collective
Statistical units ω are not of direct interest, but some of their attributes M (ω).
Distinguishable manifestations of a characteristic are called characteristic values or
modes.
Examples:
The characteristic gender has the possible values {male, female, diverse}.
The characteristic eye color has the possible values {blue, green, grey, brown}.
For the characteristic body weight of adult humans all values between 30 and
300 kg have to be allowed as possible values.
Definition:
The statistical variable assigns a real number x to a statistical unit ω or its characteristic
M (ω). Thus
x = X (ω) = Fkt (M (ω)) .
X (ω) is a real-valued function of the characteristic values M (ω) and thus of the statistical
units
X :Ω→R
ω 7→ X (ω) = x
characteristics or variables
Definition:
Each proper subset Ω∗ of Ω is called a subpopulation or sample of the whole popula-
tion.
Subpopulations are called random samples if chance played a significant role in the
selection of the elements.
To get more realistic results , the random sample is selected on a representative basis:
consequently other characteristics are taken into consideration that could have a
statistical influence on party preference. The random sample needs to include the share
of women in the population of all eligible voters. The age structure should also conform
to the whole population.
This already makes the sample quite representative for this purpose. It would certainly
still be important to take the geographical distribution into account to avoid a situation
where too many respondents happen to live in Baden-Württemberg. Furthermore, it
would be good if the professional structure were at least analogous in the
characteristics workers, employees, civil servants, self-employed. Yes, and of course
students must be in the sample, otherwise Green voters might be underrepresented.
Definition:
The (finite) sequence of the n values
x1 , x2 , . . . , xi , . . . , xn
If the order of the observations does not matter, it is often helpful to sort and renumber
the variable values.
x1 ≤ x2 ≤ x3 ≤ · · · ≤ xi ≤ · · · ≤ xn
Example: n = 20 observations
Definition:
The absolute frequency
ni := absH(X = xi )
indicates how often the statistical variable X takes a certain value xi .
Definition:
The tables
x1 x2 ... xk x1 x2 ... xk
and
n1 n2 ... nk h1 h2 ... hk
k
P k
P
with ni = n and hi = 1
i =1 i =1
are called absolute and relative frequency distribution of the statistical variable X ,
respectively.
Example:
ni hi
10 0.5
8 0.4
6 0.3
4 0.2
2 0.1
xi
1.6 3 4.1 5
Graph of a frequency distribution
Definition:
The function (
hi , if x = xi
h(x ) =
0 otherwise
is called the (relative) frequency function of the statistical variable X .
The function X
H (x ) = h ( xi )
xi ≤x
Hi 0.1 0 .3 0 .7 1.0
0 .5
0.25
x
1 2 3 4 5 6
H (x )
distribution function
1
0.75
0 .5 step function
0.25
x
1 2 3 4 5 6
1. Statistical attributes and variables 21
Frequency and distribution function
Properties of the empirical cumulative distribution function
lim H (x + ∆x ) = H (x )
∆x →0+
lim H (x ) = 0, lim H (x ) = 1
x →−∞ x →∞
specifies the relative frequency of observed values of the variable X that are greater
than a , but not greater than b.
2. The function value at a point x indicates the relative frequency of which values less
than or equal to x occur in the data set:
H (x ) = relH(X ≤ x )
3. At each point, the values of the frequency function are obtained from the empirical
distribution function as the difference
h(x ) = H (x ) − lim H (x − ∆x )
∆x →0+
Class size
Class frequency
Approximation by polygons
Frequency density of the classes
Frequency density function and histogram
x
ξ0 ξ1 ξ2 ξ3 ... ... ξm
∆i := ξi − ξi −1 , i = 1, . . . , m
hi := relH(ξi −1 < X ≤ ξi ), i = 1, . . . , m
Note: [Schira, 2016] uses right hand inclusion while [Anderson et al., 2012] and
[Newbold et al., 2013] use left hand inclusion
Definition:
By assigning the class frequencies to the upper limits of the classes (an alternative
possibilty would be to assign the class frequencies to the class centers), the following
frequency table can be drawn from the values
ξ1 ξ2 ... ξm k
P
hi = 1
h1 h2 ... hm i =1
Exercise:
How does the distribution function for the classes
shown on the left look like? (Choose an appropriate
upper limit for the final class).
By focussing on the upper limits (or any other single point) of the classes we lose
information about the distribution within the classes. The assumption of a uniform
distribution leads to the definition below.
Definition:
Let HK (x ) be the distribution function of a characteristic X obtained by size classes with
upper class limits ξ1 , ξ2 , . . . , ξm . Then, the ratio
HK (ξi ) − HK (ξi −1 ) hi
=
ξi − ξi −1 ∆i
is called the (average) frequency density of the ith size class (i = 1, . . . , m).
distribution function
Approximation of the distribution function Hk
by a polygonal line H̄ (x )
Taking the derivative of H̄ (x ) leads to the
(average) frequency density function h̄(x )
of the size classes:
dH̄ (x )
h̄(x ) :=
dx
Its graph is called histogram.
histogram
The area of a column corresponds to the
relative class frequency.
The total area of the columns of the
histogram is one.
distribution function
3. Why are mainly representative random samples taken into account in practice?
4. What are the properties of the step function? What is its information content?
7. What is the difference between a bar chart (look up definition if necessary) and a
histogram? Under what condition do they both look the same?
Especially for statistical data sets with many different characteristical values, one would
like to describe the entire distribution of the characteristic with the help of a few numbers.
Such numbers are called measures numbers or parameters of a distribution.
We distinguish between
measures of location
measures of dispersion
Definition:
A measure of central tendency or measure of location is a parameter used to de-
scribe the distribution of a random variable and provides a „typical“ value. In particular,
it describes the location of the data set, i.e. where or in which order of magnitude the
values of the variable are located.
Definition:
A number xMod with
h(xMod ) ≥ h(xi ) for all i
is called mode or modal value of an empirical data set.
Examples:
The data set
2 3 3 4 4 4 5 6
1 2 3 3 3 4 5 6 6 6 7
Definition:
The value
n
1X
x̄ := xj
n
j =1
1. Central property
n
X
(xj − x̄ ) = 0
j =1
2. Shifting all values of a data set X by the constant value a shifts the arithmetic
mean by exactly this value:
yi := xi + a ⇒ ȳ = x̄ + a
3. Multiplication of all values of a data set X with the constant factor b multiplies the
arithmetic mean by exactly this value:
zi := b · xi ⇒ z̄ = b · x̄
Definition:
For the geometric mean
√
n
GX := x1 · x2 · · · · · xn , xi > 0
For the geometric mean, the individual characteristic values are multiplied and the n-th
root is taken from the product. It is only defined if all values of the data set X are positive.
The logarithm of the geometric mean corresponds to the arithmetic mean of the
logarithms. (important for the calculation of overall return on an investment)
n
1X
log GX = log xi
n
i =1
Example:
For the data set X with the values
2 6 12 9
Note: The geometric mean for each set with only positive values is always smaller than
the arithmetic mean unless all the values in the data set are the same.
Question: Which mean is best suited for the calculation of the average growth?
Arithmetic mean:
Geometric mean:
√
4
G1+r = 1.20 · 0.85 · 1.40 · 1.25 = 1.1559
1
1+r = 2
· ((1 + 100 %) + (1 − 50 %)) = 1 + 25 % ⇒ 25 %
Geometric mean:
√
2
G1+r = 2 · 0.5 = 1 ⇒ 0%
Definition:
From the values xi > 0 of a data set, one can calculate the reciprocal values 1/xi and
then calculate the arithmetic mean of these values
1 1 1
+ ··· +
n x1 xn
Taking the reciprocal of the result again, yields the so-called harmonic mean
n
Hx := n 1
.
P
j =1 xj
Example:
For the data set X with the values
2 6 12 9
Note: For every data set with (different) positive values, it can be shown that
HX < GX < x̄
Two trucks travel at speeds of v1 = 60 km/h and v2 = 80 km/h on the highway. Thus the average
speed (arithmetic mean) is
1 km km km
v̄ = 60 + 80 = 70 .
2 h h h
To estimate the (average) transport times t̄ and thus transport capacities and transport costs for
a distance of say from Hamburg to Duisburg, one would divide the corresponding distance d =
420 km by this value and obtained with
d 420 km
= = 6h,
v̄ 70 km/h
a wrong value. Indeed, the transport times of the two trucks are t1 = 7 h and t2 = 5.25 h. Thus the
average transport time is
t̄ = 6.125 h .
If, on the contrary, one divides the distance by the harmonic mean
2 480 km km
HV = 1 1
= ≈ 68.57 ,
60 km/h
+ 80 km/h
7 h h
44 - 1
then one receives with
d 420 km
= 480
= 6.125 h = t̄
HV 7
km/h
In this example we want to calculate an average transport time for a fixed distance d. The problem
with the average speed is that it is not valid over the whole time, because the first truck arrives
already after 5.25 h and then stops while the other one is still moving. For the calculation of the mean
transportation time, the speeds are in the denominator due to the principle ti = vd . This leads to the
i
harmonic mean:
1 1 d d
t̄ = (t1 + · · · + tn ) = + ··· +
n n v1 vn
1 1 1 1 d
=d· + ··· + =d· = .
n v1 vn HV HV
44 - 2
If, in contrast, we want to know how far the trucks have come on (arithmetic) average after a certain
time t, the calculation
1 1
d̄ = (d1 + · · · + dn ) = (t · v1 + · · · + t · vn )
n n
1
=t· (v1 + · · · + vn ) = t · v̄
n
44 - 3
Robust measures
The measures presented so far are quite sensitive to outliers. This means that strong
deviations of individual values significantly influence these measures. This is not the
case with so-called robust measures.
Definition:
Starting with the raw data
x = (x1 , x2 , . . . , xn )
of a data set of size n, the characteristic values xi are arranged in ascending order:
Annotation:
In the following, the parentheses in the index are omitted for ordered samples. It should always be
clear from the context whether a data set is ordered or not.
x1 ≤ x2 ≤ x3 ≤ · · · ≤ xi ≤ · · · ≤ xn .
relH(X ≤ xMed ) ≥ 50 %
relH(X ≥ xMed ) ≥ 50 %
4 7 7 7 12 12 13 16 19 23 23 97
relH(X ≤ xMed ) ≥ 50 %
relH(X ≥ xMed ) ≥ 50 %
Strictly speaking, in the previous example, 12 or 13 or 12.2 would also be medians, since they divide
the data in the middle as well.
47 - 1
In practice, the arithmetic mean and the median are the most important characteristic measures of
location for a given distribution. Colloquially, however, a distinction is not always made between the
two measures, especially when it comes to income or wealth distributions.
Think about when and why the average income differs from the median income?
47 - 2
Quartiles
In addition to the median, two other values can be defined that further divide the ordered
statistical data set:
Definition:
The characteristic values of the data set are arranged in ascending order
x1 ≤ x2 ≤ · · · ≤ xn
and divided into four segments with (as far as possible) the same number of values.
The three values
Q1 ≤ Q2 = xMed ≤ Q3
are called quartiles and are defined in such a way that they lie in beween the four
segments just as the median xMed does.
Consequently about 50 % of the observations are found between Q1 and Q3 .
Definition:
A number x[q ] with 0 < q < 1 is called q-quantile if it splits the data set X such that at
least 100 · q % of its observed values are less than or equal to x[q ] and at the same time
at least 100 · (1 − q )% are greater than or equal to x[q ] , that is:
Special quantiles:
Quartiles:
x[q ] = H −1 (q )
This also works for step function shaped distribution functions if one directly hits a jump.
However, if one lands on a stairstep, the inverse function is not uniquely determined.
Then, in fact, every value between the adjacent jumps is a q-quantile:
xi ≤ x[q ] ≤ xi +1 .
To obtain a unique value, one then usually takes the arithmetic mean of both jumps
x[q ] = 21 (xi + xi +1 ).
For a data set, the q-quantile can also be determined without the detour via the graph of
the distribution function
1 1
x[0.80] = 2
(x16 + x17 ) = 2
(46 656 + 51 854) = 49 255
150 000
120 000
90 000
30 000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Turnover of 20 industrial companies and 80 % quantile
Definition:
The range is the difference between the largest and the smallest value in a data set:
Definition:
The so-called mean absolute deviation
n
1X
MAD := |xj − x̄ |
n
j =1
is calculated as the arithmetic mean of the amounts of the deviations of the characteristic
values from their mean.
We recall the median and the quartiles Q1 ≤ Q2 = xMed ≤ Q3 , that divide the ordered
data set into four approximately equally sized parts.
Definition:
The difference
IQR := Q3 − Q1
is known as the interquartile range.
Definition:
The arithmetic mean of the deviation of the quartiles from the median is called the quar-
tile deviation or semi-interquartile range
Example:
For a data set with n = 14 values we are looking for the quartile deviation.
As median we take the arithmetic mean of both neighbours and obtain Q2 = xMed = 26.8.
Definition:
The average quadratic deviation from the arithmetic mean
n
2 1X
sX := (xj − x̄ )2
n
j =1
Pk 2
Variance calculation using relative frequencies: sX2 := j =1 hj (xj − x̄ )
The following distribution is given:
xi 4 5 6
1 1 1
hi
4 2 4
3 5 9 9 6 6 3 7 7 6 7 6 5 7 6 9 6 5 3 5
Arithmetic mean : 6
Variance : 3.1
Standard deviation: 1.761
Z∞
x̄ = x · h ( x ) dx
−∞
Z∞
2
sx = (x − x̄ )2 h(x ) dx
−∞
Z2 Z2
(x − x̄ )2 h(x ) dx = dx
2 3 1 2
sX = (x − 1)2 x− x
2 2
0 0
Z2
=
3
2
x−
5 2
2
3 1 4
x + 2x − x
2
dx
0
2
3 1 2 5 3 1 4 1 5
= x − x + x − x
2 2 6 2 10 0
3 4 40 16 32 24 1
= − + − = 3 − 10 + 12 − =
2 2 6 2 10 5 5
q
1
and the standard deviation as its root: sX = 5
≈ 0.4472 .
2. Shifting all values of a data set X by the constant value a leaves the variance
unchanged:
2 2
yi := xi + a ⇒ sY = sX
3. Multiplication of all values of a data set X by a constant factor b multiplies the
variance by the square of this value:
2 2 2
zi := b · xi ⇒ sZ = b · sX
Note: sZ = |b| sX
4. In the special case of d = 0, the following formula for the simplified calculation of
the variance is obtained
n
2 1X 2 2 2
sX = xj − x̄ = x 2 − x̄ (2)
n
j =1
Exercise : Use formula (2) to recalculate the variances of the preceding examples.
This means that the average quadratic deviation from the arithmetic mean x̄ is always smaller than
the average quadratic deviation from any other value d (minimum property). Multiplying the equation
by n, we get for the sum of the squared deviations from any d ∈ R:
n n
X 2 X 2
SSE(d ) := xj − d ≥ xj − x̄ .
j =1 j =1
That is, SSE becomes minimal in x̄. This provides us with an alternative definition of the mean:
Definition:
SSE(d ) −→ min
d
66 - 1
Variance and standard deviation
Definition:
The quotient of the standard deviation and the absolute value of the mean of a data set
with x̄ ̸= 0
sX
CVX :=
|x̄ |
is called the coefficient of variation.
Example:
Over a period of 250 trading days, the Volkswagen share
price had a mean value of 174.56 C and a standard
deviation of 10.28 C. For the same period, a standard
deviation of 4.68 C with a mean value of 36.96 C is
determined for the BMW AG share. The two coefficients
of variation as a measure of the volatility of the share
prices are as follows
10.28 C
CVX = = 0.0589 for VW and
174.56 C
4.68 C
CVY = = 0.1266 for BMW
36.96 C
Thus, despite a lower absolute standard deviation, BMW stock is more volatile in relative
terms.
The distribution of a data set can be analyzed quite well with only a few values. In
practice, one often uses the so-called five-point summary.
It divides the data set into four parts, so that each part contains about a quarter of the
observed values. It contains the median as a measure of location and the range and
interquartile range IQR as measures of variation.
Definition:
The graphical representation of the five-point summary is called a box plot.
xmin xmax
2. When is the arithmetic and when is the geometric mean used and why?
3. How does the variance change if all values of a data set are converted from DM to
euro?
4. Describe the translation theorem as a property of the variance! What feature results
from the special case d = 0?
5. What is the minimum property of the variance? What does the principle of least
squares mean in this context?
6. What does the coefficient of variation mean? Which measure from portfolio theory in
business administration comes to your mind?
Multi-dimensional statistics:
Each statistical unit ωi of a population Ω can have a variety of characteristics.
Definition:
The univariate statistics takes only one characteristic or variable into account.
The multivariate statistics considers several variables for each unit ωi
Example:
For a person ωi we measure the duration of education X1 (ωi ) and the income X2 (ωi ) five
years past to the end of education.
Most simple case: two variables X (ωi ) andY (ωi ) are of interest.
The result is paired data (xi , yi ) for each ωi .
These can be represented as points
in a scatter plot.
Definition: The contingency table represents the joint distribution of the statistical
variables X and Y in a concise way.
y1 y2 ... yj ... yl Σ
x1 n11 n12 ... n1j ... n1l n1•
x2 n21 n22 ... n2j ... n2l n2•
. . . . . .
. . . . . .
. . . . . . sum of row
xi ni1 ni2 ... nij ... nil ni •
. . . . . .
. . . . . .
. . . . . .
xk nk 1 nk 2 ... nkj ... nkl nk •
Σ n•1 n•2 ... n•j ... n•l n total sum
sum of column
Here
nij = absH(X = xi ∩ Y = yj ) is the absolute frequency with which the combination
(xi , yj ) was observed,
Pl Pk
ni • = j =1 nij or n•j = i =1 nij the absolute frequency with which xi or yj was
observed. ⇒ marginal frequency
A B C
A representation with relative frequencies is also common. For this purpose, the
absolute frequencies, including the marginal frequencies, are divided by n.
y1 y2 ... yj ... yl Σ
x1 h11 h12 ... h1j ... h1l h1•
x2 h21 h22 ... h2j ... h2l h2•
. . . . . .
. . . . . .
. . . . . . sum of row
xi hi1 hi2 ... hij ... hil hi •
. . . . . .
. . . . . .
. . . . . .
xk hk 1 hk 2 ... hkj ... hkl hk •
Σ h•1 h•2 ... h•j ... h•l 1 total sum
Definition:
The one-dimensional distributions
ni •
hi • = relH(X = xi ) = , i = 1, . . . , k
n
and
n•j
h•j = relH(Y = yj ) = , j = 1, . . . , l
n
Observed data: (30, 1), (30, 2), (60, 4), (30, 2), (60, 1), (30, 4), . . . , (60, 2).
Sort and count: 24 × (30, 1), 24 × (30, 2), 32 × (30, 4), . . . , 68 × (60, 4)
Contingency table
Y
1 2 4
30 24 24 32 80
X
60 16 36 68 120
40 60 100 200
marginal distribution of X
1 2 4
30 0.12 0.12 0.16 0.4
X
60 0.08 0.18 0.34 0.6
0.2 0 .3 0.5 1
marginal distribution of Y
We now consider the distribution of X , given that (conditional on) Y has a fixed value yj .
Definition:
Normalizing the columns of the contingency table to a column sum of 1 leads to a total
of l one-dimensional distributions for j = 1, . . . , l. These are called conditional distri-
butions of X (conditional on Y = yi ),
hij
hi |Y =yj = relH(X = xi |Y = yj ) = .
h•j
Similarly, normalizing the rows to a row sum of 1 for i = 1, . . . , k leads to the conditional
distributions of Y (conditional on X = xi ),
hij
hj |X =xi = relH(Y = yj |X = xi ) = .
hi •
Example:
For the joint distribution of the previous numerical example, there are three conditional
distributions of X and a marginal distribution of X :
X hi |Y =1 hi |Y =2 hi |Y =4 hi •
30 0.60 0.40 0.32 0.4
60 0.40 0.60 0.68 0.6
1 1 1 1
. . . and two conditional distributions of Y and a marginal distribution of Y
Y 1 2 4
hj |X =30 0.300 0.300 0.400 1
hj |X =60 0.133 0.300 0.567 1
h•j 0.2 0 .3 0.5 1
Definition:
If the joint distribution hij of the statistical variables X and Y is equal to the product of
the two marginal distributions
hij = hi • · h•j
fori = 1, . . . , k andj = 1, . . . , l, then X and Y are called statistically independent.
hi |Y =yj = hi • , i = 1, . . . , k
hj |X =xi = h•j , j = 1, . . . , l
absolute frequency
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
relative frequency
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
conditional distribution
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
conditional distribution
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
relative frequency
≤−2 % (−2 %,−1 %] (−1 %,0 %] (0 %,1 %] (1 %,2 %] >2 %
AD 0.0702
ND 0.9298
0.0238 0.0782 0.3539 0.4368 0.0849 0.0225 1
The mean value of a sum (difference) is equal to the sum (difference) of the mean
values:
x + y = x̄ + ȳ x − y = x̄ − ȳ
This is true regardless of the joint distribution and equally true for statistically
independent as well as statistically dependent variables.
Special case :
n
2 2 2 1X
sX ±Y = sX + sY , if cXY := (xj − x̄ )(yj − ȳ ) = 0
n
j =1
Definition:
The quantity calculated from the n pairs of values (xi , yi )
n
1X
cXY := (xj − x̄ )(yj − ȳ )
n
j =1
is called the empirical covariance or, in short, the covariance between the statistical
variables X and Y .
Simplified calculation :
n
1X
cXY := xj · yj − x̄ · ȳ = x · y − x̄ · ȳ
n
j =1
The covariance can also be calculated using the relative frequencies from the contin-
gency table:
k X
X l
cXY := hij (xi − x̄ )(yj − ȳ )
i =1 j =1
Simplified calculation :
k X
X l
cXY := hij xi yj −x̄ · ȳ
i =1 j =1
| {z }
x ·y
Proposition :
If two variables X and Y are statistically independent, then the covariance cXY between
them is zero.
This proposition is not reversible because the covariance measures only the linear part of
the statistical dependence.
Definition:
The ratio
cXY
rXY :=
sX · sY
is called (empirical) correlation coefficient between X and Y
Properties :
U := a1 + b1 X and V := a2 + b2 Y with b1 , b2 ̸= 0 .
Then we obtain
cUV b1 · b2 · cXY b1 · b2
rUV = = = rXY .
sU · sV |b1 | sX · |b2 | sY |b1 | · |b2 |
This means that |rUV | = |rXY |.
3.6
rXY = √ √ = 0.1961 ,
216 · 1.56
which indicates a weak positive correlation.
goals against
age of trainer
body weight
Note: The covariance or the correlation coefficient do not necessarily mean a causal
relationship between the characteristics. Merely the just available observations show a
statistical tendency, which however could also be purely by chance.
causality?
correlation
Father’s height Son’s height
causality?
Besides the correlation coefficient according to B RAVAIS -P EARSON there is another one,
namely the one according to S PEARMAN, also called rank correlation coefficient.
Definition:
The rank correlation coefficient or correlation coefficient named after C HARLES E D -
WARD S PEARMAN
Sp
rXY := rrg(X ),rg(Y )
is the correlation coefficient between the ranks of the observations.
The following table shows the results of the Abitur examinations of ten students in the
subjects German (feature G) and History (feature H). The maximum achievable score is
15 in each case.
1 13 15 4 1
2 14 8 2.5 (2) 4 (3)
3 8 1 9 10
4 10 7 7 6.5 (6)
5 15 9 1 2
6 1 5 10 9
7 14 8 2.5 (3) 4 (4)
8 12 7 5 6.5 (7)
9 9 6 8 8
10 11 8 6 4 (5)
Question: Are the grades correlated? Does good performance in German go along
with good knowledge of history?
First, we determine the rankings for each student in each of the two subjects. To do this,
we arrange the students according to the results they obtained in the subjects. Students
with the same result are assigned the arithmetic mean of those rankings they would have
received if they had been arranged randomly (given in parentheses in each case). This
may result in rankings like 2.5 or 6.5.
Then we compute variances, standard deviations, and the covariance of the ranks and
obtain with
Sp 6.95
rGH = = 0.8581
2.8636 · 2.8284
a fairly positive correlation, which was to be expected.
(Compare: rGH = 0.549 )
1. What is the difference between univariate and multivariate statistics? Think about an
example of bivariate statistics.
2. What is the structure and function of contingency tables? Are there also contingency
tables for more than two characteristics?
4. When is the variance of a sum smaller than the sum of the variances?
y = a + bx
in the scatter plot. Here we distinguish between an (in the mathematical sense)
independent variable X and a dependent variable Y .
This straight line is supposed to be a mean straight line, that is, it is supposed to pass
through the observed characteristic values (xi , yi ) in such a way that it indicates the
location and main direction of the point cloud in the scatter plot.
y = a + bx
„deviation“ ei
The method of least squares (LSM) uniquely assigns a mean straight line to the scatter
plot.
Idea: We want to explain the observed values of Y as good as possible by the values
of X .
yi = a + bxi + ei
Vertical deviation
ei := yi − ŷi
Question: What does „as good as possible“ mean? How to determine the regression
line?
Pn
It would be possible to minimize the deviations ei or the
Psum of the deviations i =1 ei or
n
even the sum of the absolute values of the deviations i =1 |ei |. Unfortunately, all these
approaches do not lead to a useful calculation rule or to a unique determination of the
straight line.
This means that the straight line is placed within the point cloud in such a way that SSE
reaches the smallest possible value for the corresponding parameters a and b.
The algebraic solution of this minimization task
leads to
Definition:
Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) observed pairs of values of a two-dimensional statistical
variable (X , Y ) and let sX > 0. The straight line
y (x ) = a + bx
SSE(a, b) −→ min
a ,b
(Partial) differentiation with respect to a and b and setting to zero yields the two so-called normal
equations
n
∂ X !
SSE(a, b) = 2(yj − a − bxj ) · (−1) = 0
∂a j =1
n
∂ X !
SSE(a, b) = 2(yj − a − bxj ) · (−xj ) = 0
∂b j =1
n
X
(yj − a − bxj ) = 0 ,
j =1
n
X
(yj − a − bxj )xj = 0 .
j =1
109 - 1
Splitting the sums results in
n
X n
X
yj − an − b xj = 0 ,
j =1 j =1
n
X n
X n
X
xj yj − a xj − b xj2 = 0
j =1 j =1 j =1
ȳ − a − bx̄ = 0 ,
xy − ax̄ − bx 2 = 0 .
we obtain after solving the linear system of equations with respect to a andb
a = ȳ − bx̄ ,
cXY
b= 2 .
sX
109 - 2
Properties of regression lines
1. mean line: The regression line passes through the center of mass (x̄ , ȳ ) of the
point cloud:
ȳ = a + bx̄ = y (x̄ ) .
The sum of the deviations ej and thus their mean value are zero,
n
X n
X
(yj − a − bxj ) = ej = 0 = ē . (4)
j =1 j =1
n n
Furthermore X X
ej xj = 0 and ej ŷj = 0 . (5)
j =1 j =1
is identical to the sum of least squares except for the factor n1 . This means that the
regression line minimizes the variance of the deviations.
namely into the variance of the regression values and the variance of the
deviations. The variance of the regression values measures the share of the
variation in Y that is described or explained by the variation of the independent
variable X . The variance of the deviations measures the share of the total variance
not explained by the variation in X . Thus, the explained variance is smaller than the
total variance, at least as long as there are deviations.
n
X n
X n
X n
X n
X
(yi − ȳ )2 = (ŷi − ȳ )2 + ei2 + 2 ŷi ei −2ȳ ei
i =1 i =1 i =1 i =1 i =1
| {z } | {z }
=0(5) =0(4)
n
X n
X
= (ŷi − ȳ )2 + ei2
i =1 i =1
Now we divide both sides by n and obtain with (6) the desired equation
112 - 1
Properties of regression lines
Definition:
The ratio of the variance explained in a linear regression to the total variance of the
dependent variable Y
2 s2
R := Ŷ2
sY
is called the coefficient of determination of the linear regression.
The larger R 2 the better the fit of the regression line to the point cloud will be. It is
therefore used as measure of goodness of fit.
Properties :
2
2 2 cXY 2
0≤R ≤1 and R = = rXY .
sX sY
Revenues yj 201 184 220 240 180 164 186 150 182 210
Marketing xj 24 16 20 26 14 16 20 12 18 22
i xi yi xi2 xi yi yi2
y = 93.4249 + 5.2274 · x
Interpretation?
The coefficient of determination is
2 2
R = (rXY ) = 0.760 .
Suitable, for example, are nonlinear functions that can be transformed into linear functions by simple
transformation, such as:
Logarithmic approaches
Logarithmic linear approach
Semi-logarithmic approach
Quadratic approaches
If a relationship between more than two variables is to be established, the so-called multiple regres-
sion is used.
117 - 1
Logarithmic approaches
Definition: The logarithmic linear approach formulates a linear relationship not between the data
itself, but between the logarithms of the data:
log y = a + b log x
Transforming back, we obtain a power function as a correlation between the originally observed
values
y = a∗ · x b .
The coefficients of this regression are calculated with the already known formulas, but before that the
initial data have to undergo a transformation and the logarithms of the observed values have to be
taken.
Definition: With the so-called semi-logarithmic approach, only one of the two variables is logarith-
mically transformed:
log y = a + bx
Transforming back, we obtain as a correlation between the originally observed values an exponen-
tial function
y = a∗ · ebx .
117 - 2
Quadratic approaches
Using the observed values, the three coefficients a, b1 and b2 are calculated with the LSM. For
this, the method of multiple regression is used (see below). The variables x and x 2 are treated
mathematically as two different variables, although they are not, of course.
Regression parabolas
The quadratic approaches have the advantage that they can be used to represent such correlations
whose direction is reversed. This is useful when the correlation not only weakens with increasing x
values, but also changes its sign, as illustrated in the figure (example happiness research: young and
old people are happier than in middle age – midlife crisis).
117 - 3
Example: A farmer measures the statistical relationship between the use of
fertilizer and crop yield. He conducts experiments with different fertilizer rates
on 14 sections of his corn acreage.
15 1800
30 3600
45 6840
60 7200
75 8100
90 8460
105 8640
120 9000
135 9180
150 9000
165 8640 fertilizer
180 8460
195 8100
210 7740
y = 881 + 123x − 0.44x 2
Werte in kg/ha
The maximum of this function is x = 139.8 kg/ha. Whether this input is also optimal depends on the
prices for fertilizer and corn.
117 - 4
Multiple regression
yi = b0 + b1 x1i + b2 x2i + ei
117 - 5
Simplified Example:
Dependent variable
Y : Percentage of students enrolled in private schools
Explanatory variables
X1 : income
X2 : Percentage of population who have completed 4 or
more years of college
117 - 6
Application-oriented and simplified explanation in anticipation of the last chapter.
The values for the coefficients are the respective least squares estimators for the constant term as
well as for the prefactors of the variables of the linear regression model. One can now ask whether
the true values for b0 , b1 , . . . are indeed significantly different from 0? That is, one asks whether
the respective independent variables X1 , X2 , . . . or the constant term really have an influence on the
variable Y to be explained?
On the basis of a given data set, this cannot be determined with absolute certainty. But it is possible
(under certain assumptions) to provide probabilities for the parameters to be significantly different
from 0. For this purpose, a so-called test statistic (here the so-called t-quotient) is calculated and
compared with a reference value. The p value (so-called significance level) indicates the probability
with which one would erroneously assume significance of the respective parameter if it was actually 0.
The lower the p value, the more likely it is that the respective parameter is significant (different from
0). This is usually highlighted by different codings ***, **, *, to identify at a glance the parameters
with high or low significance.
117 - 7
Control questions
1. What properties should a straight line have that best describes the average linear
relationship between two variables?
4. What properties of the least squares method of calculating a regression line do you
know?
5. What is the relationship between the slope of the regression line and the correlation
coefficient?
6. What is the relationship between R 2 and the correlation coefficient? Which values
can R 2 attain (extrema), and which statements can thus be made about the statistical
correlation?
4. 119
5 Combinatorics and counting principles
Examples
a) On a Saturday evening a student of Frankfurt School is on the search for two friends, that she
assumes to stay at one of four parties. How many possibilities does she have to visit the four
parties one after another?
b) After class seven students meet for playing skat. How many possibilities do they have to form a
group of three players?
121 - 1
Factorials and binomial coefficients
To determine the number of permutations that can be achieved by n distinct objects, we may think of
n placeholders, where we successively put the n elements on.
To allocate the first placeholder n elements are at disposal. Therefore, there are n possibilities to
occupy the first place. For the second placeholder only n − 1 elements are left. Together with the n
possibilities for the first place we have n · (n − 1) possibilities to allocate the first two places. For the
third place there are n − 2 possibilities and so forth until for the allocation of the nth placeholder only
one possibility (the last remaining element) is left. In total there are n · (n − 1) · · · · · 1 possibilities
to arrange n elements in a sequence. The product n · (n − 1) · · · · · 1 of the first n natural numbers
is denoted by n! (read: „factorial of n“). Furthermore, it is convenient to set 0! = 1.
The student from example a) has got 4! = 1 · 2 · 3 · 4 = 24 possibilities to visit the four parties one
after another.
122 - 1
Factorials
122 - 2
Table of values for factorials
There are already more than 3.6 million possibilities to arrange only 10 distinct books on a shelf!
There are 20! ways to visit 20 cities one after another. The magnitude of this number compares
to the age of the universe (1018 ), measured by seconds!
122 - 3
Calculation rules for factorials
n! n · (n − 1) · · · (k + 1) · k · · · 1
= = n · . . . · (k + 1)
k! k ···1
reduction of fractions
122 - 4
Binomial coefficients
Definition:
n n!
:= .
k k !(n − k )!
The binomial coefficient kn (read: „n choose k “) equals the number of subsets of size k elements
for a given set of size n elements.
122 - 5
Binomial coefficients
Examples:
5 5! 5!
2
5· 4 · 3 · 2 · 1 There are 10 different pos-
• = = = = 10 sibilities to choose 3 out of
3 3! · (5 − 3)! 3! · 2! 3 · 2 · 1 · 2 ·1
5 distinct books.
122 - 6
If we ask for the number of possibilities to choose k elements out of set of size n elements, we may
proceed as follows. We find all subsets of size k elements by taking only the first k elements from
each of the n! possible arrangements of the set of size n elements. In doing so each subset of size
k elements will appear k !(n − k )! times on the first k places. Thus a set of size n elements has
n!
subsets of size k elements. This term is denoted as binomial coefficient kn (read: „n
k !(n − k )!
choose k “).
7 7! 7·6·5
Referring to example b) the students have 3
= 3!·4!
= 3·2
= 35 possibilities to form a group of
three players.
122 - 7
From its definition (slide 1) we recognize the relationship
n n
=
k n−k
n + 1 n n
= + (7)
k k −1 k
by calculation. Using this formula binomial coefficients are derived by simple addition within the so-
called PASCAL’s triangle.
122 - 8
Fundamental principle of counting
Theorem:
The number of possibilities to fulfill k issues simultaneously, each of which can be ful-
filled independently in ni ways (i = 1, . . . , k ), is just equal to the product of the individual
numbers of possibilities and amounts to
T = n1 · n2 · · · · · nk .
In many applications, each of the k issues can be satisfied in the same amount of ways,
that is, all ni = n. Then the number of possibilities is simply
n Tk = nk
Examples:
Number of possible outcomes when tossing a coin and then rolling a die:
T = n1 · n2 = 2 · 6 = 12 .
Number of possible outcomes when rolling a red and a blue die (i.e. two
distinguishable dice).
2
6 T2 = 6 = 36 .
Definition:
Consider a set of n elements. Each arrangement of all these elements in any order is
called a permutation of these n elements.
Example:
Proposition :
If all n elements are distinguishable, the number of possible permutations is
nP = n!
If not all elements of the set to be permuted are are different, the number of
distinguishable permutations will be smaller, of course.
Proposition :
If not all elements of the set to be permuted are different, we form m groups (classes)
of equal elements from them; then let the group i contain ni ≥ 1 elements, so that
n = n1 + n2 + · · · + nm . Then the number of distinguishable permutations of these
elements is
n!
n Pn1 ,n2 ,...,nm =
n1 ! · n2 ! · · · · · nm !
Question: How many distinguishable permutations exist for the set {a, b, a, b}?
Definition:
Consider a set with n different elements. We are interested in the number of ways to
choose a k -element subset from these elements.
We distinguish two cases:
1. A Variation of k th order, also called arrangement or k −permutation of n is an
ordered arrangements of a k -element subset of an n-set. The number of variations
is
n!
n Vk =
(n − k )!
2. A Combination of k th order or k -combination is a selection of items from a set,
such that the order of selection does not matter. The number of possible
combinations is !
n n!
n Ck = =
k k ! · (n − k )!
From the elements of the set {a, b, c }, taking into account the order, the following six
variations of order 2 can be formed:
ab ba
3!
ac ca ⇒6=3·2= = 3 V2
1!
bc cb
If the order does not matter, you get only the three combinations
ab =
b ba !
3! 3
ac =
b ca ⇒3= = = 3 C2
2! · 1! 2
bc =
b cb
Task:
At the Summer Olympics, n = 8 sprinters start the 100-meter race. How many variations
(i.e. taking the order into account) are there for the gold, silver and bronze ranks?
Task:
At the Summer Olympics, n = 8 sprinters start the 100-meter race. How many variations
(i.e. taking the order into account) are there for the gold, silver and bronze ranks?
Solution:
There are
8!
8 V3 = = 8 · 7 · 6 = 336
(8 − 3)!
If you want to know how many possibilities there are to place a bet in the lottery, you calculate the
number of 6-combinations out of 49 elements (without taking the order into account), i.e. the binomial
coefficient 49 49 · 48 · 47 · 46 · 45 · 44
49 C6 = = = 13 983 816
6 6·5·4·3·2·1
For a five in the lottery, you need five out of the six right and one out of the 43 wrong picks. There are
6 43
· = 6 · 43 = 258
5 1
different combinations of five. For a four, you need four out of the six correct numbers and at the
same time two out of the 43 wrong numbers. So there are
6 43
· = 15 · 903 = 13 545
4 2
128 - 1
Summary
k independent issues T = n1 · n2 · · · · · nk
n!
Permutation (not all elements distinguishable) n Pn1 ,n2 ,...,nm =
n1 ! · n2 ! · · · · · nm !
n!
k -variation (order matters) n Vk =
(n − k )!
n n!
k -combination (order does not matter) n Ck = =
k k ! · (n − k )!
Definition:
An experiment is called a random experiment if it is
1. run according to a certain rule,
2. can be repeated under the same conditions any number of times,
3. and the outcome is uncertain and cannot be predicted.
Definition:
The individual, mutually exclusive and indivisible outcomes or results of a random ex-
periment are called elementary events or basic outcomes.
Example: When rolling a die, there are the basic outcomes: „1“, „2“, „3“, „4“, „5“ and „6“.
Definition:
The set Ω of all elementary events of a random experiment is called sample space or
random sample space of this random experiment.
Examples:
In the random experiment „throwing a die“ the sample space is Ω = {1, 2, 3, 4, 5, 6}
and has finitely many elements.
The random experiment „flip a coin until heads appears“ has the sample space
Ω = {H, TH, TTH, TTTH, . . . }, consisting of infinitely many elements.
Definition:
A random event A is a subset of the sample space Ω. The event A has occurred if the
result of the random experiment is an element of this subset A.
Example:
A : „even number of pips“ when rolling the dice
⇒ A = {2, 4, 6} ⊂ Ω
When rolling two dice, the random event A: „Sum of pips is higher than 10“ consists of the three
elementary events
A = {(5, 6), (6, 5), (6, 6)}
In the random experiment „flip a coin until heads appears“ the event B: „Heads does not appear
before the 5th flip“
B = {TTTTH, TTTTTH, TTTTTTH, . . . }
134 - 1
Events, sample space, set of events
Taking all events of a random experiment together, we obtain a set whose elements are
subsets of Ω.
Definition:
All events of a random experiment with sample space Ω form the associated set of
events E (Ω).
Example:
The random experiment „flipping a coin“ has the sample space Ω = {H, T}. The
corresponding set of events is
n o
E (Ω) = {}, {H}, {T}, {H, T}
Since events are defined as sets, we can reuse the notations and operations from set
theory here. Thus we can calculate with events as with sets. These calculations lead
again to events in E (Ω).
Negation Ā: The event not A occurs if and only if A does not
occur.
A ∩ D = ∅.
Definition:
If an event A always occurs when an event C occurs, we say,
Ω Ω Ω
A A B A B
Ā
Ω Ω Ω
A B A A C
L APLACE:
„If an experiment can produce a (finite) number of different
and equally possible outcomes, and some of them are to be
considered favorable, then the probability of a favorable out-
come (event A) is equal to the ratio of the number of favorable
to the number of possible outcomes“:
P IERRE -S IMON
number of favorable outcomes g
P(A) := = M ARQUIS DE L APLACE
number of possible outcomes m 1749 - 1827
Ω = {e1 , e2 , . . . , em }
Definition:
A random experiment with finitely many equally probable elementary events is called
Laplace experiment.
In a Laplace experiment, the probability P(A) of event A with sample space Ω is given
by
number of elements of A |A |
P(A) = = .
number of elements of Ω |Ω|
Task: A coin and a die are thrown together. What is the (Laplace) probability that heads
and a number greater than 4 will appear?
Definition:
A random experiment with finitely many equally probable elementary events is called
Laplace experiment.
In a Laplace experiment, the probability P(A) of event A with sample space Ω is given
by
number of elements of A |A |
P(A) = = .
number of elements of Ω |Ω|
Task: A coin and a die are thrown together. What is the (Laplace) probability that heads
and a number greater than 4 will appear?
Solution: The sample space is
Ω = {(H, 1), (T, 1), (H, 2), (T, 2), (H, 3), (T, 3), (H, 4), (T, 4), (H, 5), (T, 5), (H, 6), (T, 6)}
and the event A = {(H, 5), (H, 6)}. Thus the Laplace probability is given by
2 1
P(A) = = .
12 6
Experiment: Throw a die. Write down how often the number 6 occurs.
Definition:
The limit
P(A) = lim hn (A)
n→∞
The convergence of the relative frequency is also called the law of large numbers.
In general it holds that the larger the number of observations, the better the estimate.
P(A) ≈ hn (A)
or estimate P̂
P̂(A) = hn (A)
Risk situation A:
You get 1000 C with probability p. You obtain 0 C with probability 1 − p.
Risk situation B:
You get 1000 C if the DAX rises by at least 200 points within the next 3 months. If not, you get
nothing.
Now p is varied until the individual is indifferent to these two risk situations (e.g. p = 40 %). Then
the number p indicates the subjective probability that the DAX will rise by at least 200 points in the
next three months.
144 - 1
Axioms of probability theory
Any function P : E → R that assigns a real number to each event A from the set of
events E may be called a probability function if the following axioms are satisfied:
Axioms of KOLMOGOROV:
K2: P(Ω) = 1
Definition:
The sample space Ω together with the set of events E and probability function P:
(Ω, E , P)
It contains all the necessary information to determine and calculate with probabilities for
all events from a sample space.
Theorem 1:
P(Ā) = 1 − P(A)
for all A ∈ E.
Theorem 2:
The impossible event has probability zero:
P(∅) = 0 .
Theorem 3:
If the events A1 , A2 ,. . . , An are pairwise disjoint, then the probability for the event result-
ing from the union of all these events is equal to the sum of the individual probabilities:
n
X
P(A1 ∪ A2 ∪ · · · ∪ An ) = P(Aj ) .
j =1
Theorem 4:
P(A) ≤ P(B ) .
Example:
A die has been thrown. The probability that a „6“ Ω
occurred is
P({6}) = .
1 1 3 5
6
With the additional information that an even A
number of pips was thrown, the probability that a
„6“ occurred is higher, namely 2 B
4 6
1
P({6}|even number of pips) = .
3
The probability of occurrence of an event A under the condition that event B has
occurred (or occurs simultaneously with A) is called conditional probability of A under
the condition B.
Definition:
Let A and B be two events of a given probability space (Ω, E , P). The conditional
probability of A under the condition B is defined as
P (A ∩ B )
P(A|B ) := .
P(B )
A
A∩B
Definition:
Two events A and B are called stochastically independent or briefly independent, if
P(A|B ) = P(A) .
It is then as well
According to L APLACE, the probability for this would have to be P(A) = 2/37. According
to the multiplication theorem for independent events we get for the total probability
Independent events must not be confused with disjoint events! For disjoint events P(A ∩ B ) = 0
holds. One could even say that disjoint events are highly dependent events, because if one of them
occurs, the other cannot occur at all.
158 - 1
Total probability
Example:
A bulk article is produced by two machines. The faster Ω
machine has slightly more rejects than the other, but
produces twice as much. It applies
A
A ∩ H2
A ∩ H1
Definition:
Any n events H1 , H2 , . . . , Hn , that are mutually exclusive but together fill the sample
space entirely, i.e.
Hi ∩ Hj = ∅ for i ̸= j and
H1 ∪ H2 ∪ · · · ∪ Hn = Ω
H1 H3 ...
Ω
Hn
A
H2 H4 ...
H1 H3 ...
Ω
Hn
A
H2 H4 ...
A = (A ∩ H1 ) ∪ (A ∩ H2 ) ∪ · · · ∪ (A ∩ Hn ) .
The following applies to the individual summands according to the multiplication theorem
162 - 1
Total probability
Example
Question: What is the total probability of drawing a white ball in the end?
Solution:
According to the theorem of total probability, the probability of drawing a white ball in the
end is:
and hence
P(A) · P(B |A)
P(A|B ) = .
P(B )
Example from slide 159: The probability that a piece picked at random from the day’s
production was produced by machine 1 is a priori
However, if the piece is faulty, one would guess (since machine 1 has a larger scrap) that
the probability of the piece being produced by M1 is higher. In fact, according to Bayes’
theorem, the probability is
2
3
· 0.1 20
P(article produced on M1|article broken) = 2 1
= = 0.7407
3
· 0.1 + 3
· 0.07 27
Definition:
In BAYES statistics, H1 , . . . , Hn denote alternative hypotheses.
P(Hi |B ) is called the a posteriori probability of the i-th hypothesis after knowl-
edge of observation B.
Given: (from experience about the reliability of the test and/or from disease statistics)
P(test negative|no cancer) = P(B̄ |Ā) = 90 % ⇒ P(test positive|no cancer) = P(B |Ā) = 10 %
167 - 1
Wanted: P(cancer|test positive) = P(A|B )
0.95 · 0.02
= = 16.24 %
0.95 · 0.02 + 0.1 · 0.98
The probability that the man actually has cancer has thus increased from 2 % a priori with the infor-
mation of the test result to 16.24 % a posteriori. Thus, the probability of disease is still a good 8 times
higher than assumed a priori.
If the test had been negative, the probability of having the disease anyway would have been
0.05 · 0.02
= = 0.11 %
0.05 · 0.02 + 0.9 · 0.98
Ω : 10 000
A : 200 Ā : 9800
B̄ |A : 10
B̄ |Ā : 980
B |A : 190
B |Ā : 8820
167 - 3
Control questions
2. Is it possible to specify different sample spaces for the same random experiment?
6. For which question is the multiplication theorem helpful, for which the addition
theorem?
Based on the above results, what value will the reporting manager calculate for the probability that
these 16 stores will receive their deliveries on time every day (6 days total) during a week, or at least
receive their goods late?
According to the observations in the past, the statistical probability that a truck on a single trip
168 - 1
To supply the 16 stores six days a week, 96 trips are required. The probability that all deliveries will
be made on time is
P(all in time) = (0.923 38)96 = 0.000 474 8 .
The probability that all stores will be reached and supplied on all days of a week, even if late, is
168 - 2
7 Random variables in one dimension
Definition:
Let a probability space (Ω, E , P) be given. A function
X : Ω → R,
e 7→ X (e) ∈ R ,
that assigns a real number X (e) to each elementary event e ∈ Ω is called random
variable or stochastical variable.
Random variable as a mapping of the sample space onto the real axis:
X
-1 0 1 2 3
7. Random variables in one dimension 170
Technical constraint: for any real number r , the set Ar = {e|X (e) ≤ r } has to be an event, i.e.
Ar ∈ E.
170 - 1
Random variables
Example
Ω = {(i , j )|i = 1, . . . , 6; j = 1, . . . , 6} .
X (i , j ) = i + j .
Y (i , j ) = |i − j | .
The codomain of Y is CY = {0, 1, . . . , 5}, the random variable Y can therefore only
take on six different values.
The random variable defined this way can now take all real
values between 0 and c (0 ≤ Z < c ).
Definition:
If the codomain C ⊂ R of a random variable X consists of a finitely many or countably
infinitely many values
C = {x1 , x2 , x3 . . . } ,
the random variable is called discrete.
Question: Which of the random variables in the examples on the previous pages are
discrete, and which are continuous?
Definition:
The function
F (x ) := P (X ≤ x ) ,
that assigns to each real number x the probability with which the random variable X
takes a value X ≤ x is called the distribution function of the random variable X .
Ax = {e|X (e) ≤ x } .
That is the reason for the restrictive condition in the definition of a random variable.
175 - 1
Distribution function
Example
0
for x < 0
1
F (x ) = for 0 ≤ x < 1
2
1 for 1 ≤ x
lim F (x + ∆x ) = F (x )
∆x →0+
2. monotonically increasing
lim F (x ) = 0, lim F (x ) = 1
x →−∞ x →∞
Alternative Definition:
Any function F (x ) on the domain of the real numbers and with the codomain C = [0, 1]
that has the above three properties is called a distribution function and defines a
random variable.
Definition:
If X is a discrete random variable then the function
f (x ) := P(X = x )
is called the probability mass function or in short the mass function of the random
variable X .
Properties :
1. f (x ) ≥ 0
X
2. f (xi ) = 1
all i
Definition:
If X is a continuous random variable with distribution function F , then the first derivative
d
f (x ) :=
dx F ( x )
is called the probability density function or in short density function of the random
variable X .
Properties :
1. f (x ) ≥ 0
Z∞
2. f (x ) d x = 1
−∞
P (X = a) = 0.
0
for x < 0, 0
for x < 0,
1 3 1 2
F (x ) = (x − 3) + 1 for 0 ≤ x ≤ 3, f (x ) = (x − 3) for 0 ≤ x ≤ 3,
27 9
1 for x > 3, 0 for x > 3.
Z2
P(1 < X < 2) = f (x ) dx = F (2) − F (1) = 0.2593
1
Definition:
Let X be a random variable and f its mass or density function. Its expected value E(X )
is defined as
X
E(X ) := xj f (xj ) , if X is a discrete and
all j
Z∞
E(X ) := x f (x ) dx , if X is a continuously distributed random variable.
−∞
If the series or the improper integral has no finite value, then the random variable X has
no expected value.
The expected value is a parameter of location. To express its numerical value (but often
also instead of the notation E(X )), usually the Greek letter „mu“ is used:
µ or µX or µ(X )
1 2 5
E(X ) = 1 · + 2 · = = 1.666 · · · = µ
3 3 3
Z∞ Z1 Z3 Z∞
x f ( x ) dx = 0 dx + x · x dx + 0 dx
1
E(X ) =
4
−∞ −∞ 1 3
is split into three parts, from which only the middle one has to be calculated:
Z3 3
x dx =
1 2 1 3 27 1 26
= x = − = = 2.1667
4 12 1
12 12 12
1
X
E[g (X )] := g (xj )f (xj ) , if X is a discrete and
all j
Z∞
E[g (X )] := g (x )f (x ) d x , if X is a continuously distributed random variable.
−∞
(Again, the expected value exists only if the series or improper integral has a finite value.)
1. Constant E (a ) = a
Proof: Follows directly from rule 4. (Linear transformation) with b = 1 and a = −µX .
Definition:
Let X be a random variable and µX its expected value. The expected value of the
squared deviation of the random variable from µX
2
V(X ) := E[(X − µX ) ]
If the series or the improper integral has no finite value, then the random variable X has
no variance.
The variance is a parameter of dispersion. To express its numerical value (but often
also instead of the notation V(X )), usually the Greek letter „sigma“ is used:
σ2 or σX2 or σ 2 (X )
Z∞
V(X ) := (x − µX )2 f (x ) dx , if X is a continuously distributed random variable.
−∞
x: 0 1 2
1 1 1
f (x ): 4 2 4
1 2 1 1
V(X ) = (0 − 1) ·+ (1 − 1)2 · + (2 − 1)2 ·
4 2 4
1 1 1
= + = = 0.5
4 4 2
1
σX = √ ≈ 0.7071 .
2
Z2 Z2
(x − 1)2 f (x ) dx = dx
3 1 2
V(X ) = (x − 1)2 x− x
2 2
0 0
Z2 2
=
3
2
x−
5 2
2
3 1 4
x + 2x − x
2
dx = 32 1 2
2
5 3 1 4
x − x + x −
6 2
1 5
10
x
0
0
3 4 40 16 32 24 1
= − + − = 3 − 10 + 12 − = = σX2 ⇒ σX = 0.4472
2 2 6 2 10 5 5
1. Constant V(a) = 0
2
3. Factor V(b · X ) = b V(X )
σb·X = |b| σX
2
4. Linear transformation V(a + b · X ) = b · V(X )
σa+b·X = |b| σX
The variance is defined as the „expected value of squared deviation from µ“. If we
compare it to the expected value of the squared deviation from some other value d then:
5. In the special case of d = 0, the following formula for the simplified calculation of
the variance is obtained
2 2 2 2
V(X ) = E(X ) − µ = E(X ) − E(X )
Using the calculation rules for expected values and variances, we can transform random
variables in a „clever way“:
X −µ
X −→ Y := X − µ −→ Z := σ
Definition:
If X is a random variable with expected value µ and standard deviation σ > 0, then the
transformed random variable
X −µ
Z :=
σ
is called standardized.
Wanted: The probability that a random variable X falls into an interval between a and b:
xj ≤b
b
f (x ) dx (continuous)
Z X
P(a < X ≤ b) = or P(a < X ≤ b) = f (xj ) (discrete)
a xj >a
For this, however, the density or mass function f must be known. The inequality provides
an estimate even for unknown f , if at least the expected value and the standard
deviation of the distribution are known.
Theorem of C HEBYSHEV:
Let X be an arbitrary continuous or discrete random variable with expected value µ and
standard deviation σ , then the inequality
1
P(|X − µ| ≥ k σ) ≤
k2
always holds for any k > 0 and completely independent of the distribution.
1
P(|X − µ| ≥ k σ) ≤
k2
1
P(|X − µ| ≥ k σ) ≤
k2
For single values of k , the following estimates are obtained outside the k σ bound:
k =1 : P(|X − µ| ≥ σ)
≤ 1 (trivial)
k = 1.5 : P(|X − µ| ≥ 1.5σ)
≤ 0.444. . .
k = 2 : P(|X − µ| ≥ 2σ) ≤ 0.25
k = 2.5 : P(|X − µ| ≥ 2.5σ) ≤ 0.16
k = 3 : P(|X − µ| ≥ 3σ) ≤ 0.111. . .
1
P(|X − µ| < k σ) = 1 − P(|X − µ| ≥ k σ) ≥ 1 −
k2
For single values of k , the following estimates are obtained inside the k σ bound:
0 1 2 3 4 5 6 7 8 9 10 11 12
Since X is a discrete random variable and C HEBYSHEV’s inequality for ranges outside a k σ bound is
formulated as „≥“, we reformulate the desired probability as follows.
1
= P(|X − 4.5| ≥ 2.5) ≤ ,
k2
2.5
according to C HEBYSHEV, where it has to be 2.5 = k σ . This results in k = σ
= √2.5 ≈ 1.5309
8/3
and we get
1
P(X < 3 ∪ X > 6) ≤ ≈ 0.4267 .
k2
202 - 1
Moments
Moments of a distribution are the expected values of powers of random variables.
Definition:
Mk := E(X k )
is called the k th moment or k th -order moment of the distribution, if it exists. The expected
value of the k th power of the deviation from the mean
202 - 1
Characterizing sequence of moments
or central moments
202 - 2
Example:
Z2 Z2
3 1
M3 = x 3 f (x ) dx = x3 x− x2 dx
2 2
0 0
Z2 2
3 1 3 1 1
= x4 − x5 dx = x5 − x6
2 2 2 5 12 0
0
3 1 1
= · 25 · − = 0.4472
2 5 6
202 - 3
and
Z2 Z2
3 1
M3Z = (x − µ)3 f (x ) dx = (x − 1)3 x− x2 dx
2 2
0 0
Z2
3 7 9 5 1
= −x + x2 − x3 + x4 − x5 dx
2 2 2 2 2
0
2
3 1 7 9 1 1 6
= − x2 + x3 − x4 + x5 − x =0
2 2 6 8 2 12 0
Definition:
The ratio
E[(X − µ)3 ]
γ :=
σ3
is called the skewness of a distribution. For symmetric distributions γ = 0.
202 - 4
A right-skewed continuous and a left-skewed discrete distribution
The 4th central moment is – if it exists – positive for every distribution. It gives information about the
kurtosis or „tailedness“ of a distribution. To obtain a measure independent of scale and scatter, here
one divides by the 4th power of the standard deviation:
Definition:
The ratio
E[(X − µ)4 ]
κ :=
σ4
is called the curtosis of a distribution.
κ = 3 is considered as normal. Distributions with larger κ values have narrower and more peaked
density functions, those with smaller κ values are more broadly curved than the normal distribution.
202 - 5
Median and Quantiles
The median or central value xMed of a random variable X is a number that lies in the
middle of the distribution in such a way that the probability for X to take a value greater or
less than xMed would be just equal:
1
P(X ≤ xMed ) = F (xMed ) = ,
2
hence
−1 1
xMed = F ( 2 ),
if the inverse function exists. This is usually the case for continuous random variables,
but not for discrete ones. More generally:
Let Y be the number of heads when flipping two coins. The median is clearly
yMed = 1, because only for this number it holds
3 1 3 1
P(Y ≤ 1) = ≥ and P(Y ≥ 1) = ≥ .
4 2 4 2
Actually, the median is only a special case of the more generally defined quantiles:
Definition:
A number x[q ] with 0 < q < 1 is called q-quantile, if at the same time
A very important distribution whose quantiles are needed by every statistician is the
so-called standard normal distribution. The sketch shows the density function of a
standard normally distributed random variable Z . Its median is
zMed = z[0.5] = 0 .
q z[q ]
0.5 0.000
0.9 1.282
0.95 1.645 q
0.975 1.960
0.99 2.327
0.995 2.575
z[q ]
empirical density
function
1. Describe the difference of a random variable and a „normal“ variable in your own
words.
2. What must the codomain of a random variable be like for it to be called continuous?
3. Which properties must a mass function have, which a density function?
4. How are density function and distribution function related? How to extract the mass
function from the distribution function?
5. Is the expected value the most likely value for a random variable?
6. What is measured by the variance of a random variable?
7. What is the effect of standardizing? What characterizes a standardized random
variable?
8. Can you make probability statements about random variables whose distribution you
do not know?
9. To what extent can C HEBYSHEV’s inequality be said to provide a „rough“ estimate?
Definition:
For a two-dimensional discrete random variable (X , Y ) the function
f (x , y ) := P(X = x ∩ Y = y )
Properties:
The following always holds:
1. f (x , y ) ≥ 0,
XX
2. f (xi , yj ) = 1,
i j
y1 y2 ... yj ... yl Σ
x1 p11 p12 ... p1j ... p1l p1•
x2 p21 p22 ... p2j ... p2l p2•
. . . . . .
. . . . . .
. . . . . .
xi pi1 pi2 ... pij ... pil pi •
. . . . . .
. . . . . .
. . . . . .
xk pk 1 pk 2 ... pkj ... pkl pk •
Σ p•1 p•2 ... p•j ... p•l 1
Margins: X X
pi • = pij , p•j = pij
j i
Example: Draw two balls from the same urn without replacement
For the case of drawing without replacement, the calculation of the
joint distribution is a bit more difficult, because now the general
multiplication theorem has to be applied: The corresponding
conditional probabilities for the component Y , after a „1“ has
already appeared at the 1st draw and has not been replaced are:
2 2 1
P(Y = 1|X = 1) = P(Y = 2|X = 1) = P(Y = 3|X = 1) =
5 5 5
Now we have
1
P(X = 1 ∩ Y = 1) = P(X = 1) · P(Y = 1|X = 1) =
5
1
P(X = 1 ∩ Y = 2) = P(X = 1) · P(Y = 2|X = 1) = . . . etc.
5
Y Y
1 2 3 1 2 3
1 1 1 1 1 1 1 1
1 4 6 12 2
1 5 5 10 2
1 1 1 1 1 1 1 1
X 2 6 9 18 3
X 2 5 15 15 3
1 1 1 1 1 1 1
3 12 18 36 6
3 10 15
0 6
1 1 1 1 1 1
2 3 6
1 2 3 6
1
Joint distributions of (X , Y )
Definition:
The joint distribution function
F (x , y ) = P(X ≤ x ∩ Y ≤ y )
indicates the probability with which the random variable X takes on values less than or
equal to x and at the same time the random variable Y takes on values less than or
equal to y . F is obtained by summing up the joint mass function:
XX
F (x , y ) = f (xi , yj )
xi ≤x yj ≤y
Zb Zd
f (x , y ) dy dx = P(a < X ≤ b ∩ c < Y ≤ d )
a c
for a < b and c < d is called the joint probability density function of X and Y .
1. f (x , y ) ≥ 0,
Z∞ Z∞
2. f (x , y ) dy dx = 1.
−∞ −∞
216 - 1
Example:
1 1 2
+y 2 )
f (x , y ) = e − 2 (x
2π
Again, the joint distribution function can be given analogously to the discrete case. The summation
of the mass function is simply replaced by the integration of the density function:
F (x , y ) = P(X ≤ x ∩ Y ≤ y )
indicates the probability with which the random variable X takes on values less than or equal to x
and at the same time the random variable Y takes on values less than or equal to y . F is obtained
by integrating the joint density function:
Zx Zy
F (x , y ) = f (u , v ) dv du
−∞ −∞
216 - 2
Marginal distributions
Discrete random variables
Definition:
The distribution of a single component of a multidimensional random variable without
regard to the other components is called marginal distribution.
For discrete random variables, the marginal distributions are calculated as the column
and row sums:
X
fX (xi ) := P(X = xi ) = f (xi , yj ) = pi •
j
X
fY (yj ) := P(Y = yj ) = f (xi , yj ) = p•j
i
X= number of heads
0 1 2 3 4
marginal distribution of Y
1 1 1
Y = number of changes
0 16
0 0 0 16 8
1 1 1 3
1 0 8 8 8
0 8
1 1 1 3
2 0 8 8 8
0 8
1 1
3 0 0 8
0 0 8
1 1 3 1 1
16 4 8 4 16
1
marginal distribution of X
xi 0 1 2 3 4 yi 0 1 2 3
1 1 3 1 1 1 3 3 1
fX (xi ) 16 4 8 4 16
fY (yi ) 8 8 8 8
X
µX = E(X ) = xi fX (xi )
i
X
σX2 = V(X ) = (xi − µX )2 fX (xi )
i
Definition:
The distribution of a single component of a multidimensional random variable without regard to the
other components is called marginal distribution.
Z∞
fX (x ) := f (x , y ) dy
−∞
Z∞
fY (y ) := f (x , y ) dx
−∞
219 - 1
Expected values and variances of the components:
For multidimensional continuous random variables with a joint distribution, one calculates the ex-
pected value and variance and further moments of the individual components using the correspond-
ing marginal distributions via integration:
Z∞
µX = E(X ) = x fX (x ) dx
−∞
Z∞
σX2 = V(X ) = (x − µX )2 fX (x ) dx
−∞
219 - 2
Conditional distributions
Discrete random variables
The conditional distributions provide information about the distribution of one variable
under the constraint that the other takes a certain value.
Definition:
For discrete random variables (X , Y ) we define the mass function f1 of the conditional
distribution of X under the condition Y = yj as
f ( x , yj )
f1 (x |yj ) := for j = 1, . . . , m
fY (yj )
f (xi , y )
f2 (y |xi ) := for i = 1, . . . , n
fX (xi )
From the values of the joint distribution from the previous example (slide 218) one
calculates four conditional distributions of X
X 0 1 2 3 4
f1 (x |0) 0 .5 0 0 0 0 .5 1
f1 (x |1) 0 0.333 0.333 0.333 0 1
f1 (x |2) 0 0.333 0.333 0.333 0 1
f1 (x |3) 0 0 1 0 0 1
Y 0 1 2 3
f2 (y |0) 1 0 0 0 1
f2 (y |1) 0 0 .5 0.5 0 1
f2 (y |2) 0 0.333 0.333 0.333 1
f2 (y |3) 0 0 .5 0.5 0 1
f2 (y |4) 1 0 0 0 1
Definition:
For continuous random variables, the conditional distributions are defined by the density function
f (x , y ) f (x , y )
f1 (x |y ) := and f2 (y |x ) := .
fY (y ) fX ( x )
222 - 1
Stochastic independence
Basic idea:
If the conditional distributions of X for different conditions y1 and y2 are different
f1 (x |y1 ) ̸= f1 (x |y2 ) ,
t means that the distribution of X depends on what value Y takes. In this case, X and Y
are said to be stochastically dependent.
In order to assess a joint distribution, it is particularly important to know whether X and Y
are dependent or independent.
Example:
Are DAX 30 returns distributed differently in January than in the rest of the months
(x = DAX return, y1 = January, y2 = February to y12 = December)?
Definition:
The random variables X and Y are called stochastically independent, or independent
for short, if the joint mass or density function
f (x , y ) = fX (x ) · fY (y )
In the case of independence, all conditional distributions are equal – and equal to the
corresponding marginal distribution:
fX (x ) · fY (y )
f1 (x |y ) = = fX (x )
fY (y )
fY (y ) · fX (x )
f2 (y |x ) = = fY (y )
fX (x )
In the joint distributions from the previous example (slides 213-215) the random variables
are stochastically independent [1] resp. dependent [2]. The conditional distributions are
equal [1] or unequal [2] to the marginal distribution:
Y 1 2 3 Y 1 2 3
1 1 1 2 2 1
f2 (y |1) 2 3 6
1 f2 (y |1) 5 5 5
1
1 1 1 3 1 1
f2 (y |2) 2 3 6
1 f2 (y |2) 5 5 5
1
1 1 1 3 2
f2 (y |3) 2 3 6
1 f2 (y |3) 5 5
0 1
1 1 1 1 1 1
fY (y ) 2 3 6
1 fY (y ) 2 3 6
1
Definition:
Let X and Y be components of a two-dimensional random variable with expected values
µX and µY . The quantity
XX
Cov(X , Y ) = (xi − µX )(yj − µY )f (xi , yj )
i j
for discrete or
Z∞ Z∞
Cov(X , Y ) = (x − µX )(y − µY )f (x , y ) dy dx
−∞ −∞
Using the calculation rules for expected values, the definition of covariance can be reformulated:
226 - 1
Covariance and correlation coefficient
227 - 1
Covariance and correlation coefficient
Example: Draw two balls from an urn without replacement (see slides 213-215)
Expected values:
1 1 1 5 Y
E(X ) = 1 · + 2 · + 3 · = = E(Y )
2 3 6 3 1 2 3
1 1 1 1
Variances: 1 5 5 10 2
1 1 1 1
X 2 5 15 15 3
2 1 1 1 10
E(X ) = 1 · +4· +9· = = E(Y 2 ) 3 1 1
0 1
2 3 6 3 10 15 6
1 1 1
2 2 10 25 5 2 3 6
1
V(X ) = E(X ) − E(X ) = − = = V (Y )
3 9 9
Covariance:
1·1 1·2 1·3 2·1 2·2 2·3 3·1 3·2 8
E (XY ) = + + + + + + + +3·3·0=
5 5 10 5 15 15 10 15 3
8 5 5 1
Cov(X , Y ) = E(XY ) − E(X )E(Y ) = − · = −
3 3 3 9
Variances: 0 1 2 3 4
1 1 3 1 1 1 1 1
Y = number of changes
2
E(X ) = 0 · +1· +4· +9· + 16 · =5 0 16
0 0 0 16 8
16 4 8 4 16 1 1 1 3
1 3 3 1
1 0 8 8 8
0 8
E(Y 2 ) = 0 · +1· +4· +9· =3 1 1 1 3
8 8 8 8 2 0 8 8 8
0 8
V(X ) = E(X 2 ) − E(X )2 = 5 − 4 = 1 3 0 0 1
0 0 1
8 8
V(Y ) = E(Y 2 ) − E(Y )2 = 3 − 2.25 = 0.75 1 1 3 1 1
1
16 4 8 4 16
Covariance:
1·1 2·1 3·1 1·2 2·2 3·2 2·3
E (XY ) = + + + + + +
88 8 8 8 8 8
1+2+3+2+4+6+6 24
= = =3
8 8
Cov(X , Y ) = E(XY ) − E(X )E(Y ) = 3 − 2 · 1.5 = 0
Definition:
The ratio of the covariance and the standard deviations of X and Y
Cov(X , Y )
ρXY :=
σX · σY
is called correlation coefficient of X and Y .
Properties :
The correlation coefficient
1. has the same sign as the covariance,
2. is normalized: −1 ≤ ρXY ≤ 1,
3. and indicates the strength of the linear stochastic relationship, independent of the
magnitudes and variances of the two variables.
Let X and Y be random variables and their joint distribution f (x , y ) be known. We define
a new random variable X + Y .
Question: What does the distribution fX +Y look like?
First: expected value and variance
h 2 i
V(X + Y ) = E (X + Y ) − (µX + µY )
h 2 i
=E (X − µX ) + (Y − µY )
h i
= E (X − µX )2 + (Y − µY )2 + 2(X − µX )(Y − µY )
h i h i
= E (X − µX )2 + E (Y − µY )2 + 2E [(X − µX )(Y − µY )]
= V(X ) + V(Y ) + 2Cov(X , Y )
233 - 1
Sum of random variables
Properties :
Estimate for the variance if one does not know the covariance:
|σX − σY | ≤ σX +Y ≤ σX + σY
We denote with
1
X̄ := (X1 + X2 + · · · + Xn )
n
the random variable „arithmetic mean“ and calculate
1
E(X̄ ) = E (X1 + X2 + · · · + Xn )
n
1
= E(X1 ) + E(X2 ) + · · · + E(Xn )
n
1 1
= (µ + µ + · · · + µ) = (nµ) = µ
n n
234 - 1
and
1
V(X̄ ) = V (X1 + X2 + · · · + Xn )
n
1
= V(X1 ) + V(X2 ) + · · · + V(Xn )
n2
1 1 σ2
= (σ 2 + σ 2 + · · · + σ 2 ) = (nσ 2 ) = .
n2 n2 n
234 - 2
Sum of random variables
Proposition:
If n random variables Xi have the expected value E(Xi ) = µ, then their arithmetic mean
has the same expected value
E(X̄ ) = µ .
√
The n−law:
If n independent random variables have the same standard deviation σ , then the stan-
dard deviation of their arithmetic mean
σ
σX̄ = √
n
√
is smaller by a factor n.
Definition:
A discrete random variable X with mass function
(
1
m
for x = x1 , x2 , . . . , xm
fX (x ) = fU (x ; m) =
0 otherwise
m+1 m2 − 1
E(X ) = , V(X ) = .
2 12
mass function
distribution function
Definition:
A continuous random variable X with density function
(
1
b −a
for a ≤ x ≤ b
fX (x ) = fU (x ; a, b) =
0 otherwise
is called uniformly distributed in the interval [a, b] or for short U(a, b)-distributed.
Properties :
a+b (b − a )2
E (X ) = , V (X ) = .
2 12
Density function of the random variable X =„waiting time for the subway“ running every
10 minutes with a = 0 and b = 10:
(
1
10
, for 0 ≤ x ≤ 10
fX (x ) = fU (x ; 0, 10) =
0 otherwise
with distribution function
0
for x < 0
1
FX (x ) = FU (x ; 0, 10) = x for 0 ≤ x < 10
10
1 for 10 ≤ x
Definition:
A discrete random variable X with the mass function
1 − p for x = 0
fX (x ) = fBe (x ; p) = p for x = 1
0 otherwise
E(X ) = p V(X ) = p · (1 − p) = p · q
| {z }
=:q
mass function
distribution function
Example:
In order to start in ludo (Mensch ärgere dich nicht), you must
roll at least one six on three rolls (=„success“).
Definition:
A discrete random variable X with the mass function
!
n x n−x
fX (x ) = fB (x ; n, p) = p (1 − p)
x
E (X ) = n · p V(X ) = n · p · (1 − p)
1 1
All binomial distributions with p = 2
are symmetric. Probabilities p < 2
result in right skewed
1
distributions p > 2
result in left skewed distributions:
246 - 1
Binomial distribution
is binomially distributed.
Let there be 20 balls in an urn, four of which are red. Let X be the number of red balls if
4
we draw from it three balls with replacement. We calculate with p = 20 = 0.2 and n = 3.
!
3
P(X = 0) = fB (0; 3, 0.2) = · 0.20 · 0.83 = 1 · 1 · 0.512 = 0.512
0
!
3
P(X = 1) = fB (1; 3, 0.2) = · 0.21 · 0.82 = 3 · 0.2 · 0.64 = 0.384
1
!
3
P(X = 2) = fB (2; 3, 0.2) = · 0.22 · 0.81 = 3 · 0.04 · 0.8 = 0.096
2
!
3
P(X = 3) = fB (3; 3, 0.2) = · 0.23 · 0.80 = 1 · 0.008 · 1 = 0.008
3
Example:
Example:
√
It is µ = 12 · 0.3 = 3.6 with a standard deviation of σX = 12 · 0.3 · 0.7 = 1.5875. We
also compute the probability P(X > 6) that left voters are in the majority in the sample,
using the binomial distribution table:
Definition:
A continuous random variable Z with the density function
1 − 1 z2
fSt (z ) := √ e 2
2π
for−∞ < z < ∞ is called standard-normally distributed or short N (0, 1)-distributed.
Note:
There are many normal distributions, but this one is standard because it has an
expected value of 0 and a standard deviation of 1:
E(Z ) = 0 , V (Z ) = 1 .
0.3
Maximum at z = 0
0.2
Points of inflection at z = −1 and z = 1
Quickly asymptotes to the x-axis 0.1
−4 −3 −2 −1 1 2 3 4
While one can easily calculate the values of the density function with the pocket
calculator, the integral of the distribution function
Zz
FSt (z ) = e
− 21 u 2
du
−∞
is not elementary. Therefore there are tables for it – already since L APLACE.
9. Stochastic models and special distributions 253
Normal distribution
distribution function
Zz
fSt (u ) du = FSt (z )
tables
P(Z ≤ z ) =
−∞
Z∞
P(Z ≥ z ) = fSt (u ) du = 1 − FSt (z ) z
interval
Zb
P(a < Z ≤ b) = fSt (u ) du = FSt (b) − FSt (a)
a
a b
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141
0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517
0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879
0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015
1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319
1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441
1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545
1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706
1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857
2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974
2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981
2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990
Symmetric intervals:
z
fSt (u ) du =: D (z )
Z
P(−z < Z ≤ z ) =
−z
Examples:
Definition:
A continuous random variable X with the density function
1 (x −µ)2
2 −1
fN (x ) = fN (x ; µ, σ ) = √ e 2 σ2
2πσ 2
for −∞ < x < ∞ is called normally distributed with the parameters µ and σ 2 or short
N µ, σ 2 -distributed.
2
E(X ) = µ , V(X ) = σ .
fN (x ) fN (x )
Two-parameter family:
0 .5 0.5
The expected value and the variance
µ=0 µ=1
σ=1 σ=1 serve as parameters:
2
−2 −1 1 2
x
−1 1 2 3
x fN (x ; µ, σ )
fN (x ) fN (x )
Properties :
0 .5 0.5
µ=0 µ=1
σ=2 σ=2 1. symmetrical around x = µ
x x
2. points of inflection at x = µ − σ and
−2 −1 1 2 −1 1 2 3
x =µ+σ
fN (x ) fN (x )
3. Density function is flatter the larger the
0 .5 0 .5
dispersion:
µ=0 µ = −2 x − µ
σ = 0.6 σ = 0.6 1
fN (x ; µ, σ) = fSt
σ σ
x x
−2 −1 1 2 −4 −3 −2 −1 1
Properties (contd.):
Example:
Let the random variable X (e.g. stock return) be normally distributed with E(X ) = 8 %
and V(X ) = 625 %2 .
Wanted: P(0 % < X ≤ 20 %)
Example:
Let the random variable X (e.g. stock return) be normally distributed with E(X ) = 8 %
and V(X ) = 625 %2 .
Wanted: P(0 % < X ≤ 20 %)
Solution:
First standardize
0% − 8% X − 8% 20 % − 8 %
< ≤
25 % 25 % 25 %
−0.32 < Z ≤ 0.48
1. How is the binomial distribution defined? Does the B ERNOULLI distribution belong to
the family of binomial distributions?
2. Why is the normal distribution considered the most important distribution in statistics?
3. Why do you need only the values of the standard normal distribution when calculating
with normal distributions?
sum: Sn := X1 + X2 + · · · + Xn
1
arithm. mean: X̄n := (X1 + X2 + · · · + Xn )
n
Then
σ2
E(X̄n ) = µ , V(X̄n ) = .
n
P( X̄n − µ ≥ ε) → 0 for n → ∞
other notations:
P( X̄n − µ < ε) → 1
plimn→∞ X̄n = µ
1
P( X̄n − µ ≥ k σX̄ ) ≤ ,
k2
σ
where the standard deviation is σX̄ = √
n
n
With the substitution k · σX̄ = ε, hence k 2 = ε2 · σ2
it follows that
σ2
P( X̄n − µ ≥ ε) ≤
ε2 ·n
1. Statistical probability
Determination of probabilities by experimental means:
hn (observed rel. frequency) is a good approximation or useful estimate for p if n is sufficiently
large.
2. Sampling method
For qualitative characteristics:
p = proportion of statistical units in the population for which the characteristic has a specific
value or property.
hn = relative frequency in random sample. It will be closer and closer to the value p as the
sample size increases.
266 - 2
The law of large numbers
Example: 9114 historical lottery numbers show the law of large numbers quite
illustratively. (discrete uniform distribution with m = 49)
theoretically:
49 + 1
µ= = 25
2
r
492 − 1 √
σX = = 200 = 14.1421
12
empirically: n = 9114
1 X
x̄ = nj xj = 25.2211
9114
√
sX = 200.6512 = 14.1651
Law of large numbers: The mean value of a sample converges stochastically towards
the expected value.
Question: Can the probability distribution F (x ) also be determined experimentally?
For this purpose, one calculates the empirical distribution function from the n sample or
measured values:
x1 , x2 , . . . , xn ⇒ Hn (x )
Idea: Hn (x ) → F (x ) if n → ∞?
Example: Random numbers were drawn on a PC, uniformly distributed over the
interval [0, 10]:
Problem: In many applications, however, it is not sufficient to know only the two mo-
ments E and V.
Question: What is the distribution function F (x̄n ) of the random variable X̄n ?
sn := x1 + x2 + · · · + xn
Sn := X1 + X2 + · · · + Xn .
Here,
E(Sn ) = µSn = n · µ and V(Sn ) = σS2 n = n · σ 2 .
Standardization yields
Sn − µSn Sn − nµ
Zn : = = √
σSn σ n
which is equivalent to
270 - 1
The central limit theorem
Sn − nµ X̄n − µ
Zn := √ = √
σ n σ/ n
tends to the standard normal distribution
as n increases.
This is why the normal distribution and the CLT are so important:
Decisive advantage: The CLT does not impose any requirement on the initial
distribution. Whatever the identical and independent distribution of Xi may be, the
distribution function of the sum or the arithmetic mean always converges to the
normal distribution.
It is to this circumstance that the normal distribution owes its universal theoretical
and practical importance.
Empirical Distributions: The CLT also explains why so many empirical distributions
are close to the normal distribution and can be approximated by it quite well.
The binomial distribution converges to the normal distribution for n → ∞. In particular, the distri-
bution function of the standardized variable
Bn − np Hn − p
Zn := √ ≡ p
npq pq /n
273 - 1
The central limit theorem
Convergence in distribution:
Approximation of the distribution
functions of the binomial distribution for
p = 14 and n = 1, 2, 3 and 6 toward the
normal distribution function.
Properties:
If n is sufficiently large, the distribution of a sum or arithmetic mean can be approximated
by the normal distribution.
y − µY
P(Y ≤ y ) ≈ FSt
σY
Example:
The average processing time of a BAföG application by a clerk at the
Students’ Union is µ = 35 min with a standard deviation of
σ = 18 min.
Question: What is the probability that the clerk will complete more than 15 applications
in an 8-hour workday, i.e.
P(S16 ≤ 480) = ?
Example:
The average processing time of a BAföG application by a clerk at the
Students’ Union is µ = 35 min with a standard deviation of
σ = 18 min.
Question: What is the probability that the clerk will complete more than 15 applications
in an 8-hour workday, i.e.
P(S16 ≤ 480) = ?
Solution:
According to the CLT, the sum S16 is approximately normally distributed with
E(S16 ) = 35 · 16 = 560 and the standard deviation σS16 = 18 · 4 = 72. We calculate:
2 480 − 560
P(S16 ≤ 480) ≈ FN (480; 560, 72 ) = FSt
72
= FSt (−1.1111) = 0.1331
The continuity correction Sk is always half the step size of the random variable Y . For example, if Y
can take only integers or only natural numbers, then Sk = 21 .
Question: How large should n be in order to use the normal distribution as an approximation?
There is no generally valid answer here. A rule of thumb says that for n > 30 the approximation is
generally quite good. For binomially distributed random variables it is often required that npq > 9
holds.
277 - 1
Control questions
10. 279
11 Point estimators for parameters of a population
Motivation:
Complete information about the distribution of characteristics in a population can only
be obtained by a census (full sample). In most cases, censuses are ineconomical,
often even impossible.
The representative sample: It is made sure that the sample has the same or similar
structure as the basic population with regard to other characteristics.
Pure random sample: Each element ωi of the basic population has an equal chance of
entering the sample.
Question: Is a random survey of n = 100 people between 2pm and 3pm on the Zeil in
front of Karstadt representative?
Definition:
Urn model to describe the pure random sampling:
The urn contains N numbered balls (= number of statistical units in the basic population).
The number on the ball is assigned to exactly one statistical unit.
A sample of size n = 10 was drawn from the basic population of students in a lecture.
The body height X in cm was determined and recorded in the following table:
i 1 2 3 4 5 6 7 8 9 10
xi 176 180 181 168 177 186 184 173 182 177
This data set has the mean value 178.4 cm. The point estimate for the height of students
in the lecture hall is simply:
µ̂ = x̄ = 178.4 cm
The value µ̂ is an estimate for the unknown mean µ. Therefore, most of the time the estimated value
will not exactly match the true mean (i.e. µ̂ ̸= µ). That means it is very rarely true that µ̂ = µ.
Every single observed characteristic value xi is a realization of a random variable Xi . For each of
these random variables Xi the probability distribution is given by the frequency distribution of the
basic population. Thus for each random variable Xi the following applies
E(Xi ) = µ, V(Xi ) = σ 2 .
Thus, one can consider the observed sample values and their mean as realizations of an n-dimensional
random variable (X1 , X2 , . . . , Xn ). If the random sample is carried out with replacement, all Xi are
independent and identically distributed.
As stated above, the estimated value will rather rarely meet the true mean value, it is possible or likely
that an estimation error
e := µ − µ̂
285 - 1
will occur. However, the crucial question is whether the estimated value µ̂ hits the true value at
least on average (that is, if we determine many realizations of µ̂). For this purpose we calculate the
expected value
1 X 1X 1
E(µ̂) = E(X̄n ) = E( Xj ) = E(Xj ) = nµ ,
n n n
hence it holds that
E(µ̂) = µ .
This property of µ is called unbiasedness. For an unbiased estimator, the estimation error vanishes
on average, i.e. E(e) = 0.
If an estimator is not unbiased, we call it biased, the expected value of the estimation error
bias := E(e)
is called bias.
Unbiasedness is not the only important property. If we calculate the variance of our estimate µ̂, we
get
σ2
X
1 1 X 1
V(µ̂) = V(X̄n ) = V Xj = 2 V(Xj ) = 2 nσ 2 = .
n n n n
We notice that with increasing sample size n the variance of the sample mean becomes smaller and
smaller, hence
lim V(µ̂) = 0 .
n→∞
285 - 2
According to the law of large numbers
plim µ̂ = µ.
This property is called consistency. It means that the larger the sample size, the more accurate the
estimate.
An important prerequisite for the calculation of the variance and also for the application of the law of
large numbers is the independence of the variables Xi , which, however, is given in every case for
samples with replacement. However, this is also approximately true for samples without replacement,
if the basic population is very large in relation to the sample size.
285 - 3
Point estimator for the variance
2 1X
s := (xj − x̄ )2
n
Using the observed values from the example before, we can also make an estimate for
the variance of the body size of the students in a lecture. To do this, we first calculate
the variance of the sample values
2 1
sX = (1762 + 1802 + · · · + 1772 ) − 178.42
10
= 31 852.4 − 31 826.56 = 25.84
θ̂ = θ̂(X1 , . . . , Xn ) ,
that depends on the random variables X1 , X2 , . . . , Xn is again a random variable and accordingly
also has a probability distribution. Certain stochastic properties of the estimator follow from this.
Unbiasedness: An estimator θ̂ is unbiased if its expected value is equal to the true parameter
E(θ̂) = θ .
287 - 1
If an estimator is biased, it would be good if the bias became smaller with increasing sample size and
would disappear for n → ∞.
Examples
1. The estimator µ̂ = x̄ is unbiased.
2. The estimator σ̂ 2 = s2 is not unbiased but asymptotically unbiased.
lim V(θ̂) = 0 .
n→∞
An estimated value usually does not agree with the true value of the parameter. However, it would be
good if it is close to the parameter, or at least has a good chance of being close. Thus, the estimation
error |θ̂ − θ| should be as small as possible and become smaller and smaller, especially for larger
287 - 2
sample sizes. The property of consistency means that the probability of an estimation error ε > 0,
however small, tends to zero as n increases.
287 - 3
Efficiency: We call an (unbiased) estimator θ̂1 more efficient than another (unbiased) estimator
θ̂2 if it has a smaller variance,
V(θ̂1 ) < V(θ̂2 ).
Thus, the most efficient or best unbiased estimator θ ∗ would be the one among all unbiased
estimators that had the smallest variance, that is
Mean squared error (MSE): The mean squared error (MSE) of an estimator is the expected value
of its squared deviation from the true parameter value, i.e.
The MSE accounts for both, the variance and the bias:
It may be advantageous to give preference to a slightly biased estimator, provided that this achieves
an effective reduction in variance, which is often the case.
287 - 4
Example
Consider the distribution of two alternative estimators for a parameter θ . One is unbiased, the other
has a small bias but a much smaller variance.
f (θ̂) bias
θ̂
θ
287 - 5
Control questions
Motivation:
Measures of samples, such as mean, variance, and others, are realizations of random
variables. Their probability distributions are called sampling distributions. In
particular, we are interested in:
Note:
Sampling distributions follow the normal distribution in many cases, because
1. characteristics are often a priori approximately normally distributed,
2. for larger samples, the Central Limit Theorem (CLT) applies.
z
2. The symmetrical intervals are also useful:
Zz
P(−z < Z ≤ z ) = fSt (u ) du = D (z )
−z D (z )
This is true for any distribution of the characteristic X as long as the individual sample
elements are drawn independently.
By standardizing the random variables X̄ we obtain
X̄ −µ
4. √
σ/ n
is approximately standard normally distributed.
σ
From 4. and with σX̄ = √
n
it follows directly that
X̄ − µ
P(−z < ≤ z ) ≈ FSt (z ) − FSt (−z ) = D (z ) .
σX̄
By transforming the inequality inside the probability function we get
Question 1 (Given interval): With what probability will the sample mean fall into the
interval 182 cm < X̄ ≤ 184 cm
Question 1 (Given interval): With what probability will the sample mean fall into the
interval 182 cm < X̄ ≤ 184 cm
The normal distribution can be taken as the sampling distribution, since the initial
distribution is already approximately normally distributed.
Question 1 (Given interval): With what probability will the sample mean fall into the
interval 182 cm < X̄ ≤ 184 cm
The normal distribution can be taken as the sampling distribution, since the initial
distribution is already approximately normally distributed.
It holds that
σ 10 cm
E(X̄ ) = 183 cm and σX̄ = √ = √ = 2 cm
n 25
So the above interval has a length of ± half a standard deviation and thus the probability:
1 1 1
P(183 cm − · 2 cm < X̄ ≤ 183 cm + · 2 cm) ≈ D ( ) = 0.3830
2 2 2
294 - 1
Sampling distributions
Question 1: Reading off the table of the normal distribution
0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141
0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517
0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879
0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
D (0.5) = 2 · 0.6915 − 1 = 0.3830 0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015
1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319
1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441
1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545
1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706
1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857
2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974
2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981
2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990
Question 2 (Given probability): What is the interval in which the sample mean falls with
a high probability of 0.9?
Question 2 (Given probability): What is the interval in which the sample mean falls with
a high probability of 0.9?
To do this, we need to determine the z value for which
D (z ) = 0.9.In the table of the standard normal
distribution we find z = 1.645, such that
0.9
0.9 = D (1.645)
≈ P(183 cm − 1.645 · 2 cm < X̄ ≤ 183 cm + 1.645 · 2 cm)
177 179 181 183 185 187 189
σX̄
= P(183 cm − 3.29 cm < X̄ ≤ 183 cm + 3.29 cm)
= P(179.71 cm < X̄ ≤ 186.29 cm)
S TUDENT’s t-distribution
In the quantile table of the Degrees of Quantiles
freedom t[0.6] t[0.667] t[0.75] t[0.8] t[0.875] t[0.9] t[0.95] t[0.975] t[0.99] t[0.995] t[0.999]
S TUDENT’s t-distribution one 1 0.325 0.577 1.000 1.376 2.414 3.078 6.314 12.706 31.821 63.657 318.31
Definition:
A sample is considered a large sample if the deviation of the actual sampling distribution
from the normal distribution can be neglected.
σ
Prerequisite: CLT (variance σ known) → σX̄ = √
n
)
X̄ − µ
P(−z < ≤ z ) ≈ FSt (z ) − FSt (−z ) = D (z )
σX̄
This can be transformed into an approximate probability statement:
Definition:
Inference is the statistical conclusion from the sample to the unknown basic population.
Definition: Confidence interval for µ for large samples with known variance σ 2
fSt (z )
α D (z ) = 1 − α α
2 2
−z z
Mostly, the significance level α is given and then the corresponding z value is determined as quantile.
300 - 1
Interval estimators for large samples
Definition: Confidence interval for µ for large samples with unknown variance
with z = z[1−α/2] .
1. Distribution in the basic population: In which interval do two thirds of the rents per
square meter paid probably lie?
2. Interval estimation: What is the confidence interval on a confidence level of 0.9 for
the average net rent µ?
1. Distribution in the basic population: In which interval do two thirds of the rents per
square meter paid probably lie?
We assume normal distribution. We take the sample mean as point estimate of the
actual average rent and r
50
σ̂ = · 2.07 C = 2.10 C
49
as the point estimate for the standard deviation. Thus, the variable
X − 8.3
Z =
2.10
would be standard normally distributed. According to the standard normal distribution
table, 66.7 % of all observations of variable Z are in the interval −0.97 < Z ≤ 0.97. We
undo the standardization and obtain the interval
[8.30 C − 0.97 · 2.10 C; 8.30 C + 0.97 · 2.10 C] = [6.26 C; 10.34 C]
66.7 %
x
6.26 C 10.34 C
1. Distribution in the basic population: In which interval do two thirds of the rents per
square meter paid probably lie?
2. Interval estimation: What is the confidence interval on a confidence level of 0.9 for
the average net rent µ?
2. Interval estimation: What is the confidence interval on a confidence level of 0.9 for
the average net rent µ?
It holds that
with 0 .9
σ̂ 2.10 C
σ̂X̄ = √ = √ ≈ 0.29 C .
n 50 µ
7.82 C 8.30 C 8.78 C
It is α = 0.1. According to the table, the 0.95 quantile is z[0.95] = 1.645. Thus, the
confidence interval is
Thus, the unknown mean of the basic population lies in the interval [7.82 C; 8.78 C] with
a probability of 90 %.
σ̂
σ̂X̄ = √
n
305 - 1
The chi-squared distribution
Definition:
The random variable
χ2n := Z12 + Z22 + · · · + Zn2
is called chi-square distributed with n degrees of freedom.
Thus, the chi-square distributions form a whole family of distributions. They are
continuous distributions and have positive probability densities in the interval (0, ∞).
The chi-square distributions are suitable as test distributions for many typical test
situations and thus have multiple applications in practice.
f (χ2 )
0.5
0.4 n=1
0.3
n=3
0.2
n=5
n = 10
0.1 n = 15
χ2
5 10 15 20 25
Properties :
2 2
E(χn ) = n, V(χn ) = 2n
Definition:
The random variable
Z
Tn := q
1
n
· χ2n
The t-distributions are similar to the normal distribution, but slightly wider. For increasing
number of degrees of freedom they tend towards the standard normal distribution.
f (z ), f (t )
0.4
n = 10
standard normal distribution 0.3
n=3
0.2 n=1
0.1
z, t
−4 −2 2 4
Properties :
n
E(Tn ) = 0, V(Tn ) = > 1, (n > 2)
n−2
The actual sampling distribution would have to be used. However, this depends on
the distribution of the characteristic in the basic population and therefore varies from
case to case and is usually difficult to calculate.
Only in the special situation, when the characteristic in the basic population is
already normally distributed or does follow the normal distribution quite well, the
construction of confidence intervals becomes simple again.
Theorem 11:
X̄ −µ
From this follows immediately P(−t < σ̂X̄
≤ t ) = FTn−1 (t ) − FTn−1 (−t ) and thus the
Definition: confidence interval for µ for small samples with normally distributed basic
population and unknown variance.
It is
tn−1;[1−α/2] = −tn−1;[α/2]
This property, by the way, also holds for the quantiles of the normal distribution z[·] , which can be
found in the quantile table of the STUDENT-t distribution in the bottom row for n = ∞.
It is important to emphasize here once again that the use of the t-distribution presupposes a normally
distributed basic population!
311 - 1
Interval estimators for small samples
Example
t24;[1−0.025] = 2.064
2.5 % 95 % 2.5 %
x̄
40 084 C 42 720 C 45 355 C
S TUDENT’s t-distribution
Degrees of Quantiles
freedom t[0.6] t[0.667] t[0.75] t[0.8] t[0.875] t[0.9] t[0.95] t[0.975] t[0.99] t[0.995] t[0.999]
1 0.325 0.577 1.000 1.376 2.414 3.078 6.314 12.706 31.821 63.657 318.31
2 0.289 0.500 0.816 1.061 1.604 1.886 2.920 4.303 6.965 9.925 22.327
3 0.277 0.476 0.765 0.978 1.423 1.638 2.353 3.182 4.541 5.841 10.215
4 0.271 0.464 0.741 0.941 1.344 1.533 2.132 2.776 3.747 4.604 7.173
5 0.267 0.457 0.727 0.920 1.301 1.476 2.015 2.571 3.365 4.032 5.893
6 0.265 0.453 0.718 0.906 1.273 1.440 1.943 2.447 3.143 3.707 5.208
In the quantile table of the 7 0.263 0.449 0.711 0.896 1.254 1.415 1.895 2.365 2.998 3.499 4.785
8 0.262 0.447 0.706 0.889 1.240 1.397 1.860 2.306 2.896 3.355 4.501
S TUDENT-t distribution, we find 9 0.261 0.445 0.703 0.883 1.230 1.383 1.833 2.262 2.821 3.250 4.297
10 0.260 0.444 0.700 0.879 1.221 1.372 1.812 2.228 2.764 3.169 4.144
the row with 24 degrees of 11 0.260 0.443 0.697 0.876 1.214 1.363 1.796 2.201 2.718 3.106 4.025
12 0.259 0.442 0.695 0.873 1.209 1.356 1.782 2.179 2.681 3.055 3.930
freedom. We are looking for 13
14
0.259
0.258
0.441
0.440
0.694
0.692
0.870
0.868
1.204
1.200
1.350
1.345
1.771
1.761
2.160
2.145
2.650
2.624
3.012
2.977
3.852
3.787
the 0.975 quantile, so we look 15
16
0.258
0.258
0.439
0.439
0.691
0.690
0.866
0.865
1.197
1.194
1.341
1.337
1.753
1.746
2.131
2.120
2.602
2.583
2.947
2.921
3.733
3.686
in the corresponding column. 17
18
0.257
0.257
0.438
0.438
0.689
0.688
0.863
0.862
1.191
1.189
1.333
1.330
1.740
1.734
2.110
2.101
2.567
2.552
2.898
2.878
3.646
3.610
19 0.257 0.438 0.688 0.861 1.187 1.328 1.729 2.093 2.539 2.861 3.579
20 0.257 0.437 0.687 0.860 1.185 1.325 1.725 2.086 2.528 2.845 3.552
It is 21 0.257 0.437 0.686 0.859 1.183 1.323 1.721 2.080 2.518 2.831 3.527
22 0.256 0.437 0.686 0.858 1.182 1.321 1.717 2.074 2.508 2.819 3.505
23 0.256 0.436 0.685 0.858 1.180 1.319 1.714 2.069 2.500 2.807 3.485
t24;[1−0.025] = t24;[0.975] = 2.064. 24
25
0.256
0.256
0.436
0.436
0.685
0.684
0.857
0.856
1.179
1.178
1.318
1.316
1.711
1.708
2.064
2.060
2.492
2.485
2.797
2.787
3.467
3.450
26 0.256 0.436 0.684 0.856 1.177 1.315 1.706 2.056 2.479 2.779 3.435
27 0.256 0.435 0.684 0.855 1.176 1.314 1.703 2.052 2.473 2.771 3.421
28 0.256 0.435 0.683 0.855 1.175 1.313 1.701 2.048 2.467 2.763 3.408
29 0.256 0.435 0.683 0.854 1.174 1.311 1.699 2.045 2.462 2.756 3.396
30 0.256 0.435 0.683 0.854 1.173 1.310 1.697 2.042 2.457 2.750 3.385
35 0.255 0.434 0.682 0.852 1.170 1.306 1.690 2.030 2.438 2.724 3.340
40 0.255 0.434 0.681 0.851 1.167 1.303 1.684 2.021 2.423 2.704 3.307
45 0.255 0.434 0.680 0.850 1.165 1.301 1.679 2.014 2.412 2.690 3.281
50 0.255 0.433 0.679 0.849 1.164 1.299 1.676 2.009 2.403 2.678 3.261
55 0.255 0.433 0.679 0.848 1.163 1.297 1.673 2.004 2.396 2.668 3.245
60 0.254 0.433 0.679 0.848 1.162 1.296 1.671 2.000 2.390 2.660 3.232
∞ 0.253 0.431 0.674 0.842 1.150 1.282 1.645 1.960 2.326 2.576 3.090
The empirical variance S 2 of a sample is also a random variable. Its distribution can be
calculated for the case that the characteristic is approximately normally distributed in the
basic population with the mean µ and the standard deviation σ – and the individual
samples are drawn independently (i.e. with replacement).
Theorem 12:
S2
n = χ2n−1
σ2
is chi-square distributed with n − 1 degrees of freedom.
It follows that
2 nS 2
P(χn−1;[α/2] < ≤ χ2n−1;[1−α/2] ) = 1 − α
σ2
Definition: Confidence interval for σ 2 for small samples with normally distributed basic
population:
n · s2 n · s2
2
CI(σ , 1 − α) = ,
χ2upper χ2lower
with the quantiles χ2lower = χ2n−1;[α/2] and χ2upper = χ2n−1;[1−α/2] .
f (χ2 )
1−α
α α
2 2
χ2
χ2lower χ2upper
Task:
From a sample of size n = 30 from a normally distributed basic population, the empirical
variance s2 = 225 is obtained. For the variance σ 2 of the basic population a point
estimator as well as a confidence interval at a confidence level of 0.95 shall be given.
Task:
From a sample of size n = 30 from a normally distributed basic population, the empirical
variance s2 = 225 is obtained. For the variance σ 2 of the basic population a point
estimator as well as a confidence interval at a confidence level of 0.95 shall be given.
Solution:
1. Point estimate for the basic population
30
σ̂ 2 = · 225 = 232.76
29
2. From the table of the chi-square distribution with 29 degrees of freedom we find the
two values
3. Confidence interval
2 30 · 225 30 · 225
CI(σ , 0.95) = ; = [147.6; 420.6]
45.722 16.047
1 0.000 0.000 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838
In the quantile table of the 4
5
0.207
0.412
0.297
0.554
0.484
0.831
0.711
1.145
1.064
1.610
7.779
9.236
9.488
11.070
11.143
12.833
13.277
15.086
14.860
16.750
χ229;[0.025] = 16.047 , 25
26
10.520
11.160
11.524
12.198
13.120
13.844
14.611
15.379
16.473
17.292
34.382
35.563
37.652
38.885
40.646
41.923
44.314
45.642
46.928
48.290
27 11.808 12.879 14.573 16.151 18.114 36.741 40.113 43.195 46.963 49.645
χ229;[0.975] = 45.722 .
28 12.461 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 50.993
29 13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 52.336
30 13.787 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 53.672
35 17.192 18.509 20.569 22.465 24.797 46.059 49.802 53.203 57.342 60.275
40 20.707 22.164 24.433 26.509 29.051 51.805 55.758 59.342 63.691 66.766
45 24.311 25.901 28.366 30.612 33.350 57.505 61.656 65.410 69.957 73.166
50 27.991 29.707 32.357 34.764 37.689 63.167 67.505 71.420 76.154 79.490
55 31.735 33.570 36.398 38.958 42.060 68.796 73.311 77.380 82.292 85.749
60 35.534 37.485 40.482 43.188 46.459 74.397 79.082 83.298 88.379 91.952
70 43.275 45.442 48.758 51.739 55.329 85.527 90.531 95.023 100.425 104.215
80 51.172 53.540 57.153 60.391 64.278 96.578 101.879 106.629 112.329 116.321
90 59.196 61.754 65.647 69.126 73.291 107.565 113.145 118.136 124.116 128.299
100 67.328 70.065 74.222 77.929 82.358 118.498 124.342 129.561 135.807 140.169
Sample variance
n
1X
s2 = (xj − x̄ )2
n
j =1
After a sample is drawn, the quantities x̄ and s2 are always calculated. Here n is the sample
size and
n
1X
x̄ = xj
n
j =1
318 - 1
The variance in the basic population
N
1 X
σ2 = (xj − µ)2
N
j =1
with the size N of the basic population and its arithmetic mean
N
1 X
µ= xj
N
j =1
is usually unknown when estimating and testing. Only after a census of the characteristic X
could µ and σ 2 be calculated in this way.
Estimated variance in the basic population
n
σ̂ 2 = s2
n−1
This estimation formula yields an unbiased estimate for σ 2 (given independence). Here, n − 1 is
the number of degrees of freedom.
318 - 2
The variance of the sample mean
σ2
V(X̄ ) = σX̄2 =
n
can be calculated in the case that the variance in the basic population is known.
Otherwise, we calculate the estimated variance of the sample mean
σ̂ 2
V̂(X̄ ) = σ̂X̄2 = ,
n
with z = z[1−α/2] .
318 - 3
2. Large sample with unknown variance
with z = z[1−α/2] .
3. Small sample with normally distributed basic population with known variance is to be
treated as 1. since according to Theorem 11 item 1. the sample mean X̄ is also normally
distributed:
CI(µ, 1 − α) = [x̄ − z σX̄ , x̄ + z σX̄ ]
with z = z[1−α/2] .
4. Small sample with normally distributed basic population and unknown variance
Confidence interval for the variance σ 2 for small samples with normally distributed basic population
" #
n · s2 n · s2
CI(σ 2 , 1 − α) = ,
χ2upper χ2lower
with the quantiles χ2lower = χ2n−1;[α/2] and χ2upper = χ2n−1;[1−α/2] of the chi-square distribution.
318 - 4
Control questions
1. About which random variable does the sampling distribution provide information?
2. What properties should samples have in order to provide reliable information about
the basic population?
3. What is the role of the central limit theorem in estimation, and what is the role of the
limit theorem of DE M OIVRE and L APLACE?
4. When is a sample considered to be „large“?
5. What does a confidence interval provide information about?
6. How are the chi-square distribution and the normal distribution related?
7. Why is the t distribution mostly tabulated only up to n = 100?
8. If the sample size is too small, one must use the t distribution. Is this statement
correct without restrictions?
9. When do you use the t distribution to determine a confidence interval? Which
quantity is t-distributed in these cases?
Definition:
Null hypothesis or initial hypothesis
H0 : θ = θ0
H0 can be right or wrong. In any case, it will be retained until sufficient evidence is
provided to the contrary (sample).
Alternative hypothesis
H1 : θ ̸= θ0
H0 : θ ≤ θ 0 against H1 : θ > θ0
or
H0 : θ ≥ θ 0 against H1 : θ < θ0 .
H0 is correct H0 is wrong
Type 2 error
retainH0 o.k.
Test decision β -error
Type 1 error
reject H0 o.k.
α-error
Definition:
Type 1 error: the null hypothesis is rejected even though it is correct.
Type 2 error: the null hypothesis is retained even though it is wrong.
Test procedure :
1. Formulate hypothesis: H0 vs. H1
2. Calculate test statistic/test quantity (from sample) T (x1 , . . . , xn )
3. Determine critical values and rejection region A.
4. Test decision
T (x1 , . . . , xn ) ∈ A ⇒ reject H0
T (x1 , . . . , xn ) ̸∈ A ⇒ retain H0
Null hypothesis: H0 : µ = µ0
Calculate: x̄
Deviation: |x̄ − µ0 | > 0
Rejection regions should be constructed such that the probability of the sample mean x̄
to fall within the rejection region, even though H0 is correct, is at most α:
α α
1−α
2 2
µ0
x Difference between two-sided and
rejection acceptance region rejection
one-sided questions
region region
Sampling distribution under the
f (x̄ ) upper-sided test
H0 : µ ≤ µ 0
condition that the expected value of the
H1 : µ > µ 0 basic population is µ = µ0 .
1−α α
x
µ0 Definition:
acceptance region rejection
region The probability of the type 1 error
f (x̄ ) lower-sided test
H0 : µ ≥ µ 0 P(X̄ ∈ A | H0 right) = α
H1 : µ < µ 0
1−α is called significance level.
α
A denotes the rejection region.
x
µ0
Two-sided test:
In the two-sided test, the rejection region is symmetrically arranged on both sides of
the acceptance region.
Assumption: variance σ 2 known
X̄ −µ
The standardized test variable σX̄
is standard normally distributed under H0 .
For large samples (and for small samples with a normally distributed basic
population): !
X̄ − µ
P > z[1−α/2] |µ = µ0 = α
σX̄
Test procedure:
1. Formulate Hypothesis (two-sided): H0 : µ = 0.4 L vs. H1 : µ ̸= 0.4 L
2. To calculate the test statistic, we need the variance of the sample mean:
0.0064 L2 √
σX̄2 = = 0.000 128 L2 ⇒ σX̄ = 0.000 128 L2 = 0.011 31 L
50
Calculate the Test quantity:
x̄ − µ0 0.38 − 0.4
T (x1 , . . . , xn ) = = = −1.77
σX̄ 0.011 31
4. Test decision:
f (z )
−1.77
x̄
-1.96 0 1.96
One-sided test:
In the one-sided test, the rejection region is not symmetrically arranged on the two
sides of the acceptance region.
Test procedure:
1. Formulate hypothesis (lower-sided): H0 : µ ≥ 0.4 L vs. H1 : µ < 0.4 L
2. To calculate the test statistic, we need the variance of the sample mean:
0.0064 L2 √
σX̄2 = = 0.000 128 L2 ⇒ σX̄ = 0.000 128 L2 = 0.011 31 L
50
Calculate the test quantity:
x̄ − µ0 0.38 − 0.4
T (x1 , . . . , xn ) = = = −1.77
σX̄ 0.011 31
4. Test decision:
f (z )
−1.77
x̄
-1.96 0 1.96
-1.645
old rejection region old rejection region
new rejection region
This means that a different test decision is made here than in the previous example.
A small retail chain knows from experience that the average sales
of its 48 stores are 25 % higher in December than in November.
On New Year’s Eve, a small random sample of n = 8 stores is
hastily drawn. It yields the following sales increases in percent:
i 1 2 3 4 5 6 7 8
Question: At a significance level of 5 %, can the null hypothesis that the average
increase in sales was 25 % be retained?
i 1 2 3 4 5 6 7 8
First we calculate
2 2
x̄ = 25.8625 % and sX = 4.0548 % .
Test procedure:
1. Formulate hypothesis (two-sided): H0 : µ = 25 % vs. H1 : µ ̸= 25 %
2. To calculate the test statistic, we need the estimated variance or standard deviation
of the sample mean, respectively.
8 σ̂ 2
σ̂X2 = · 4.0548 %2 = 4.634 06 %2 ⇒ σX̄2 = X = 0.579 26 %2 ⇒ σX̄ = 0.7611 %
7 8
Calculate the test quantity:
x̄ − µ0 25.8625 % − 25 %
T (x1 , . . . , xn ) = = = 1.1332
σ̂X̄ 0.7611 %
4. Test decision:
f (χ2 )
acceptance region
α α
2 2
χ2
χ2lower χ2upper
as the gasoline consumption and now claims that the standard deviation in consumption
is way too large with 0.35 L/100 km. Is this true?
Task: Assume that the basic population is normally distributed. Test the hypothesis that
the standard deviation is at most 0.3 L/100 km, as required, at a significance level
of 10 %.
s2 30 · 0.1188
T (x1 , . . . , xn ) = n = = 39.6
σ02 0.09
3. critical value for α = 10 % and 29 degrees of freedom:
the variance is known (small sample with normally distributed basic population or large sample)
x̄ − µ σ
⇒ test quantity T (x1 , . . . , xn ) = is standard normally distributed with σX̄ = √
σX̄ n
the variance is unknown and it’s a large sample
x̄ − µ σ̂
⇒ test quantity T (x1 , . . . , xn ) = is standard normally distributed with σ̂X̄ = √
σ̂X̄ n
345 - 1
Then
hypothesis H0 : µ = µ0 H0 : µ ≤ µ0 H0 : µ ≥ µ0
H1 : µ ̸= µ0 H1 : µ > µ 0 H1 : µ < µ0
x̄ − µ x̄ − µ
test quantity T (x1 , . . . , xn ) = or T (x1 , . . . , xn ) =
σX̄ σ̂X̄
critical value k = k = k =
G AUSS test: z[1−α/2] z[1−α] z[α] = −z[1−α]
t test: tn−1;[1−α/2] tn−1;[1−α] tn−1;[α] = −tn−1;[1−α]
345 - 2
Tests regarding the variance: chi-square test
hypothesis H0 : σ = σ0 H0 : σ ≤ σ0 H0 : σ ≥ σ0
H1 : σ ̸= σ0 H1 : σ > σ0 H1 : σ < σ 0
s2
test quantity T = T (x1 , . . . , xn ) = n
σ02
critical value χ2lower = χ2n−1;[α/2] χ2upper = χ2n−1;[1−α] χ2lower = χ2n−1;[α]
χ2upper = χ2n−1;[1−α/2]
345 - 3
Comparison of two means
Assume two independent samples of size n1 and n2
with the means x̄1 and x̄2
were taken. We will now test the hypothesis whether the two samples originate from the same basic
population or at least are taken from populations with the same mean:
H0 : µ 1 = µ 2 vs. H1 : µ1 ̸= µ2
∆ = X̄1 − X̄2
is approximately normally distributed under the null hypothesis if the sample sizes are large (CLT) or
if the characteristic is normally distributed in the basic population. Then it holds
E(∆) = 0
and
346 - 1
it the samples are independent and hence
s
σ12 σ2
σ∆ = + 2 .
n1 n2
346 - 2
Just as with the G AUSS test for one sample, there are one-sided tests as well:
If the variance or variances of the basic populations have to be estimated, one uses
s
2 σ̂12 σ̂ 2
σ̂∆ = + 2
n1 n2
346 - 3
in the two-sample G AUSS test accordingly for large samples.
For small samples from normally distributed basic populations, the additional condition
σ12 = σ22 = σ
n1 s12 + n2 s22
σ̂ 2 =
n1 + n2 − 2
and used to calculate s s
σ̂ 2 σ̂ 2 n1 + n2
σ̂∆ = + = σ̂
n1 n2 n1 · n2
346 - 5
Example:
Stiftung Warentest praises the new car tire »Super ZZ«. It is said to have more than 10 % higher
mileage than its predecessor »Z«. The organization has tested four sets of each type of tire and
obtained the following result:
Mileage in km
»Super ZZ« »Z«
50 000 43 000
41 000 44 000
40 000 36 000
49 000 37 000
We test the null hypothesis that the new tire X1 has no higher mileage than the old tire X2 at a
significance level of 5 %:
346 - 6
The sample variances are
1h i
s12 = (50 − 45)2 + (41 − 45)2 + (40 − 45)2 + (49 − 45)2
4
1h i 82
= 52 + 42 + 52 + 42 = = 20.5
4 4
and
1h i
s22 = (43 − 40)2 + (44 − 40)2 + (36 − 40)2 + (37 − 40)2
4
1h i 50
= 32 + 42 + 42 + 32 = = 12.5 .
4 4
4s1 + 4s2 82 + 50
σ̂ 2 = = = 22
4+4−2 6
346 - 7
Test procedure:
H0 : µ1 ≤ µ2 vs. H1 : µ1 > µ2 .
x̄1 − x̄2 5
T = = = 1.5076
σ̂∆ 3.3166
4. Test decision:
1.5076 < 1.943 ⇒ retain H0 !
346 - 8
Comparison of two variances
We have already learned about the t-distribution and the chi-square distribution in the last chapter.
These are not really suitable for stochastic models, otherwise we would have already covered them
in chapter 9 on special distributions. Rather, they are so-called test distributions, which are very
useful in estimation and testing. Another test distribution is the so-called F-distribution:
1 2
χ
m m
Fm
n := 1 2
χ
n n
347 - 1
For the quantiles of the F-distribution we get
1
Fm
n;[α] = .
Fnm;[1−α]
0 .8
m = 20
n = 20
0 .6
0 .4
m=6
0 .2
n=4
F
1 2 3 4
347 - 2
Now, from two different independent samples of size n1 and n2 , respectively, the variances s12 and s22
have been calculated. The aim is to check whether both samples are taken from basic populations
with the same variance:
H0 : σ1 = σ2 (= σ 2 ) vs. H1 : σ1 ̸= σ2 .
n1 S1 n2 S2
= χ2n1 −1 and = χ2n2 −1
σ2 σ2
n1
S2
n1 −1 1 n −1
n2 = Fn1 −1
S2
n2 −1 2
2
is F-distributed with n1 − 1 and n2 − 1 degrees of freedom. Of course, this only works if the charac-
teristic is normally distributed in the basic population.
Since the F-distribution is not symmetric, two critical values have to be used for the two-sided ques-
tion. They are placed in such a way that the risk of error α is divided equally between the two parts
of the rejection region.
347 - 3
Test procedure F test :
1. Formulate hypothesis: H0 : σ12 = σ22 vs. H1 : σ12 ̸= σ22
n1
s2
n1 −1 1
2. Test statistic/test quantity T = n2
s2
n2 −1 2
n −1 n −1
3. critical values: Flower = Fn1 −1;[α/2] and Fupper = Fn1 −1;[1−α/2]
2 2
4. Test decision:
If T < Flower or T > Fupper ⇒ reject H0
Example:
In a random sample of 21 newly issued AAA-rated corporate bonds, the maturity had a variance of
58.35 (years2 ). In contrast, in a random sample of 13 newly issued corporate bonds rated CCC, the
variance in maturity was only 4.69. Is this difference significant?
347 - 4
Test procedure:
n1
s2
n1 −1 1
21
58.35 61.27
20
T = n2 = 13
= = 12.06
s2
n2 −1 2 12
4.69 5.08
1 1
Flower = F20
12;[0.025] = = = 0.374
F12
20;[0.975]
2.676
Fupper = F20
12;[0.975] = 3.0728
4. Test decision:
12.06 > 3.073 ⇒ reject H0 !
347 - 5
Read from the table of quantiles of the F distribution:
F20
12;[0.025]
1
=
F12
20;[0.975]
1
= = 0.374
2.676
347 - 6
F20
12;[0.975]
= 3.073
347 - 7
Regression analysis
We recall section 4.1: Linear Regression. The task there was to compute a regression line
y = a + bx
Question: How can we test the individual parameters of the regression for statistical significance?
The term statistical significance usually refers to the test of whether the parameter a or b or the
correlation coefficient rXY is significantly different from zero. The significance of these parameters
can be tested with a t-test.
We will now briefly review what we learned in section 4.1 and then derive a test first for the correlation
coefficient rXY , then for the two parameters a and b.
348 - 1
regression line : y = a + bx
yi = a + bxi + ei
„deviation“ ei := yi − ŷi
cXY
rXY :=
sX sY
348 - 2
Let Ei be the random variable describing the error term in the i-th data point. In order to test the cor-
relation coefficient rXY or the parameters a and b for their statistical significance, a few assumptions
have to be made about the distribution of the error terms Ei .
Assumptions:
1. There are no systematic influences on Y other than X :
2. All error terms originate from a distribution with the same standard deviation, so-called
homoscedasticity:
σ(Ei ) = const. for all i
3. The error terms are not correlated with each other:
4. The random variables Ei are normally distributed with N (0, σ 2 ) for all i.
348 - 3
t-test for the correlation coefficient
Question: When is the correlation coefficient rXY significantly different from zero? In other words:
When does the variable X show a significant (linear) correlation with the variable Y ?
Beispiel: In a random sample of 5 individuals, the following relationship is found between their annual
income and the amount they spend on their annual vacation:
individual 1 2 3 4 5
348 - 4
Test procedure:
4. Test decision:
|6.485| > 3.182 ⇒ reject H0
Thus, we assume that the correlation coefficient rXY is significantly different from 0.
348 - 5
t-test for the regression parameters a and b
Question: When are the intercept a and the slope b significantly different from zero?
y = a + bx
one is mostly interested in whether the correlation of X and Y , which is estimated by a sample, is
statistically significant (or just random).
How are the parameters a and b distributed? To answer this question, the four assumptions from a
few pages earlier are necessary.
Representing the regression equation in matrix form makes it easier to derive the test statistics (and
to move to multivariate regressions, i.e., multiple X ). We start with two independent variables X1 and
X2 to determine the matrix form. An extension to k columns is easily done.
Starting from the linear model y = a + b1 x1 + b2 x2 , for each data point (x1i , x2i , yi ) we obtain the
equation
yi = a + b1 x1i + b2 x2i + ei .
348 - 6
For n observations we obtain
y1 = a + b1 x11 + b2 x21 + e1
y2 = a + b1 x12 + b2 x22 + e2
. .
. .
.=.
yn = a + b1 x1n + b2 x2n + en
y X·b e
|{z} | {z } |{z}
y =X·b+e
with
y1 1 x11 x21 e1
y2 1 x12 x22 a e2
y= . , X = . . , b = b1 , and e= . .
.
.. .. .
.
.
. b2
..
yn 1 x1n x2n en
348 - 7
In the univariate case (only one X ), we estimated the regression parameters using the least squares
method:
n
X n
X
ei2 = (yi − (a + bxi ))2 −→ min
a ,b
i =1 i =1
The same equation (for two or more X variables) in matrix form reads as
eT · e = (y − X · b)T · (y − X · b) −→ min
b
We find the minimum by taking the partial derivative with respect to b and setting the gradient equal
to zero
d(eT · e) (y − X · b)T · (y − X · b)
= = −2 · XT · y + 2 · XT · X · b = 0
db db
and obtain the so-called normal equations,
XT · X · b = XT · y ,
348 - 8
Numerical example (to keep the calculation simple):
3 1 3 5
1 1 1 4
y = 8 , X = 1 5 6 .
3 1 2 4
5 1 4 6
With
5 15 25 20
T T
X · X = 15 55 81 and X · y = 76
25 81 129 109
As a solution of the LSE we obtain (for example, by applying the G AUSSian elimination method)
a 4
b = b1 = 2.5
b2 −1.5
348 - 9
and hence the regression plane
y = 4 + 2.5x1 − 1.5x2 .
348 - 10
The solution of the normal equations
XT · X · b = XT · y ,
−1
b= XT · X · XT · y (8)
We use β̂ = b as estimator for the model parameters and assume that the true relationship between
X and y is given by the linear model
y =X·β+u
with the unknown parameters β and the so-called confounding variables or latent variables u. Sub-
stituting this into (8) yields
−1
β̂ = b = XT · X · XT · (X · β + u)
−1 −1
= XT · X · XT · X · β + XT · X · XT · u
−1
= β + XT · X · XT · u
348 - 11
What are the properties of the estimator β̂ ?
−1
E(β̂) = E(β + XT · X · X T · u)
−1
= E(β) + E XT · X · XT · u
−1
= β + XT · X · XT · E(u) = β
| {z }
=0
we get
T
V(β̂) = E β̂ − E(β̂) · β̂ − E(β̂)
−1 −1
=E XT · X · X T · u · uT · X · X T · X
−1 −1 −1
= XT · X · XT · σ 2 · I · X · XT · X = σ 2 · XT · X ,
348 - 12
where σ 2 is estimated by
eT e
σ̂ 2 =
n − (k + 1)
unbiased. Here k is the number of independent variables (X1 , . . . , Xk ). Hence, together with the
intercept a, there are k + 1 parameters to be estimated.
−1
−1 5 15 25 26.7 4.5 −8
XT · X = 15 55 81 = 4 .5 1 −1.5 .
25 81 129 −8 −1.5 2.5
With T
e = y − X · b = −1 0.5 0.5 0 0
we estimate
eT e 1 .5
σ̂ 2 = = = 0.75
n − (k + 1) 5−3
and finally obtain
−1 20.025 3.375 −6
2 T
V(β̂) = σ̂ · X · X = 3.375 0.75 −1.125 .
−6 −1.125 1.875
348 - 13
For the variances of the estimated parameters we need the diagonal of the matrix and get
V(a) 20.025
V(b1 ) = 0.75 .
V(b2 ) 1.875
For our numerical example, we perform a two-sided test at a significance level of 5 % to see if the
regression parameters are significantly different from zero.
H0 : a = 0 vs. H1 : a ̸= 0
test statistic: T = √4−0 = 0.89
20.025
348 - 14
H0 : b1 = 0 vs. H1 : b1 ̸= 0
2.5−0
test statistic: T = √ = 2.89
0.75
H0 : b2 = 0 vs. H1 : b2 ̸= 0
−
√1.5−0
test statistic: T = = −1.10
1.875
The critical value is k = t5−(2+1);[1−α/2] = t2;[0.975] = 4.303. Thus, none of the null hypotheses are
rejected, so the regression parameters are not significant.
Question: What happens to tn−(k +1);[0.975] as the sample size n becomes large?
This provides a
If |T | > 2 ⇒ reject H0
Thus, when test statistics of regression parameters are greater than 2 or less than -2, they are
usually said to be significant. This would correspond to a two-sided test at a significance level of
approximately α = 0.05.
348 - 15
Let us consider again the income/holiday example:
In a random sample of 5 individuals, the following relationship is found between their annual income
and the amount they spend on their annual vacation:
individual 1 2 3 4 5
348 - 16
and the result of the regression analysis is
The t-values in the output correspond to the test statistics or test variables and are often noted in
parentheses under the regression parameters:
y = −255 + 0.054x
(−0.51) (6.49)
The critical value is t3;[0.975] = 3.182. Thus, the intercept is not significant, but the slope of the
regression line is.
Interpretation of the (significant) coefficients? Individuals with higher income spend more on
vacations (in total about 5.4 % of salary)
348 - 17
Discussion: In the last example, are the requirements for the t-test satisfied?
2. All error terms originate from a distribution with the same standard deviation, so-called
homoscedasticity:
σ(Ei ) = const. for all i
3. The error terms are not correlated with each other:
4. The random variables Ei are normally distributed with N (0, σ 2 ) for all i.
348 - 18
Control questions
Literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351 - 1
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 - 1
13. 350
Literature
[Anderson et al., 2012] Anderson, D. R., Sweeney, D. J., Williams, T. A., Camm, J. D., and Cochran,
J. D. (2012).
Statistics for Business and Economics.
South-Western Cengage Learning, 12th edition.
[Bleymüller, 2012] Bleymüller, J. (2012).
Statistik für Wirtschaftswissenschaftler.
Vahlen, 16th edition.
[Lehn and Wegmann, 2000] Lehn, J. and Wegmann, H. (2000).
Einführung in die Statistik.
Vieweg+Teubner Verlag.
[Newbold et al., 2013] Newbold, P., Carlson, W. L., and Thorne, B. M. (2013).
Statistics for business and economics.
Pearson, 8th edition.
[Schira, 2016] Schira, J. (2016).
Statistische Methoden der VWL und BWL – Theorie und Praxis.
Pearson, 5th edition.
351 - 1
Index
approach of variation, 75
logarithmic linear, 131 coefficient of variation, 75
quadratic, 132 combination, 153
confidence
semi-logarithmic, 131
interval, 379, 381
arrangement, 153
level, 379
average, 38
consistency, 355, 359
bias, 354 contingency table, 82
binomial coefficient, 146, 148 continuity correction, 343
box plot, 77 correlation coefficient, 103
Central property, 39 B RAVAIS -P EARSON, 103
S PEARMAN, 108
characteristic value, 10
empirical, 103
class intervals, 27
rank, 108
class limits, 27
covariance, 99
class size, 27 empirical, 99
coefficient curtosis, 249
of determination, 125 data
352 - 1
paired, 81 marginal, 85
data set, 17 normal, 316
decile, 56 probability, 358
density standard normal, 254, 311
352 - 7