ML 7

Lecture 5
Probabilistic models
Machine Learning
Andrey Filchenkov
13.06.2016
Lecture plan
• Bayesian classification
• Non-parametric density recovery
• Parametric density recovery
• Normal discriminant analysis
• Logistic regression
• The presentation is prepared with

materials of the K.V. Vorontsov’s
course “Machine Leaning”.
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 2
Lecture plan

Problem
An illness, which is spread among 1% of

population. This illness test returns true answers in
95% of cases. Someone receives a positive result.
What is the probability, he actually suffers the
illness?

Problem: options
An illness, which is spread among 1% of population. This

illness test returns true answers in 95% of cases. Someone
receives a positive result. What is the probability, he
actually suffers the illness?
97,5% ≤ 𝑥 ≤ 100%
95% ≤ 𝑥 < 97,5%
92% ≤ 𝑥 < 95%
81% ≤ 𝑥 < 92%
70% ≤ 𝑥 < 81%
55% ≤ 𝑥 < 70%
30% ≤ 𝑥 < 55%
𝑥 < 30%

Problem: answer
An illness, which is spread among 1% of population. This

illness test returns true answers in 95% of cases. Someone
receives a positive result. What is the probability, he
actually suffers the illness?
Pr 𝑑 = 1 𝑡 = 1 =
Pr 𝑡 = 1 𝑑 = 1 Pr 𝑑 = 1
= =
Pr 𝑡 = 1 𝑑 = 1 Pr(𝑑 = 1) + Pr 𝑡 = 1 𝑑 = 0 Pr(𝑑 = 0)
0.95 × 0.01
= = 𝟎. 𝟏𝟔.
0.95 × 0.01 + 0.05 × 0.99
Probabilistic classification problem
Instead of an unknown target function 𝑦 ∗ (𝑥) we are going

to think about an unknown probability distribution on
𝑋 × 𝑌 with a density 𝑝(𝑥, 𝑦).
Simple or independent identically distributed (i.i.d.)
sample is a sample, which contains 𝓁 random independent
observations 𝑇 𝓁 = 𝑥𝑖 , 𝑦𝑖 𝓁𝑖=1 .
Now we have families of distributions 𝜑 𝑥, 𝑦, 𝜃 𝜃 ∈ Θ

instead of algorithm models.
Problem: find the algorithm which minimizes probability
of error.

Problem statement
𝑎: 𝑋 → 𝑌 splits 𝑋 on non-overlapping domains 𝐴𝑦 :

𝐴𝑦 = {𝑥 ∈ 𝑋|𝑎 𝑥 = 𝑦} .
Error is when object 𝑥 labeled as 𝑦 is classified as belonging
to 𝐴𝑠 , 𝑠 ≠ 𝑦.
Error probability: Pr(𝐴𝑠 , 𝑦) = 𝐴𝑆
𝑝(𝑥, 𝑦)𝑑𝑥 𝐴𝑠.
Error loss: λ𝑦𝑠 > 0, for all 𝑦, 𝑠 ∈ 𝑌 × 𝑌.
Usually λ𝑦𝑦 = 0, λ𝑦 = λ𝑦𝑠 = λ𝑦𝑡 ∀𝑠, 𝑡 ∈, 𝑠 ≠ 𝑦, 𝑡 ≠ 𝑦.
Mean risk of 𝑎:
𝑅 𝑎 = λ𝑦𝑠 Pr(𝐴𝑠 , 𝑦) .
𝑦∈𝑌 𝑠∈𝑌
The main equation
𝑝 𝑋, 𝑌 = 𝑝 𝑥 Pr 𝑦 𝑥 = Pr 𝑦 𝑝(𝑥|𝑦)
Pr(𝑦) is priory probability of class 𝑦.

𝑝(𝑥|𝑦) is likelihood of class 𝑦
Pr(𝑦|𝑥) is posterior probability of class 𝑦.

Two problems
First problem: probability density recovering

Given: 𝑇 𝓁 = 𝑥𝑖 , 𝑦𝑖 𝓁𝑖=1 .
Problem: find empirical estimates Pr(𝑦) и 𝑝(𝑥|𝑦), 𝑦 ∈ 𝑌.
Second problem: mean risk minimization

Given:
• prior probabilities Pr(𝑦),
• likelihood 𝑝 𝑥 𝑦 , 𝑦 ∈ 𝑌.
Problem: find classifier 𝑎 which minimizes 𝑅(𝑎).

Maximum a posteriori probability
Let Pr(𝑦) and 𝑝 𝑥 𝑦 be known for all 𝑦 ∈ 𝑌.

𝑝 𝑥, 𝑦 = 𝑝 𝑥 Pr(𝑦|𝑥) = Pr(𝑦) 𝑝(𝑥|𝑦).
Main idea: choose a class, in which the object is the most

probable.
Maximum a posteriori probability (MAP):

𝑎 𝑥 = argmax𝑦∈𝑌 Pr(𝑦|𝑥) = argmax𝑦∈𝑌 Pr 𝑦 𝑝 𝑥 𝑦 .

Optimal Bayesian classifier
Theorem
If Pr(𝑦) and 𝑝(𝑥|𝑦) are known, then the minimal mean risk
𝑅(𝑎) is achieved by Bayesian classifier
𝑎𝑂𝐵 𝑥 = argmin𝑠∈𝑌 λ𝑦𝑠 Pr 𝑦 𝑝 𝑥 𝑦 .
𝑦∈𝑌
If λ𝑦𝑦 = 0, λ𝑦 = λ𝑦𝑠 = λ𝑦𝑡 ∀𝑠, 𝑡 ∈ 𝑌, 𝑠 ≠ 𝑦, 𝑡 ≠ 𝑦

𝑎𝑂𝐵 𝑥 = argmax𝑦∈𝑌 λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 .
Classifier 𝑎𝑂𝐵 (𝑥) is optimal Bayesian classifier.

Bayesian risk is minimal value of 𝑅 𝑎 .

Separating surface
Separating surface for classes 𝑎 and 𝑏 is locus of

𝑥 ∈ 𝑋, such that maximum of Bayesian decision
rule is achieved both for 𝑦 = 𝑠 and 𝑦 = 𝑡:
λ𝑎 Pr 𝑎 𝑝 𝑥 𝑎 = λ𝑠 Pr 𝑏 𝑝(𝑥|𝑏).

Lecture plan

Two subproblems
The problem is to estimate prior and posterior probabilities

for each class:
Pr 𝑦 =?
𝑝 𝑥 𝑦 =?
The first subproblem can be solved easily:

𝑋𝑦
Pr 𝑦 = , 𝑋𝑦 = 𝑥𝑖 , 𝑦𝑖 ∈ 𝑇 𝓁 , 𝑦𝑖 = 𝑦 .
𝓁
The second one is much more complex.
Instead of recovering 𝑥 𝑦 , we will recover 𝑝(𝑥) with
𝑇𝓂 = 𝑥 1 , 𝑠 , … , 𝑥(𝓂) , 𝑠 for each 𝑠 ∈ 𝑌.

One-dimensional case
If Pr 𝑎, 𝑏 is a probabilistic measure on [𝑎, 𝑏], then

1
𝑝 𝑥 = lim Pr([𝑥 − ℎ, 𝑥 + ℎ]).
ℎ→0 2ℎ
Empirical density estimation with window of a width ℎ

𝓂
1
𝑝ℎ 𝑥 = [|𝑥 − 𝑥𝑖 | < ℎ] .
2𝑚ℎ
𝑖=1

Parzen-Rosenblatt window
Empirical density estimation with window of a width ℎ:

𝓂
1 𝑥 − 𝑥𝑖
𝑝ℎ 𝑥 = <1 .
2ℎ𝓂 ℎ
𝑖=1
Parzen-Rosenblatt estimation for a window with width ℎ:

𝓂
1 𝑥 − 𝑥𝑖
𝑝ℎ 𝑥 = 𝐾 ,
ℎ𝓂 ℎ
𝑖=1
where 𝐾(𝑟) is a kernel function.
𝑝ℎ 𝑥 converges to 𝑝 𝑥 .
Generalization to multidimensional case
1. If objects are described with 𝑛 numeric features

𝑓𝑗 : 𝑋 → ℝ, 𝑗 = 1, . . . , 𝑛,
𝓂 𝑛
1 1 𝑓𝑗 (𝑥) − 𝑓𝑗 𝑥𝑖
𝑝ℎ 𝑥 = 𝐾 .
𝓂 ℎ𝑗 ℎ𝑗
𝑖=1 𝑗=1
2. If 𝑋 is a (metric) space with a distance ρ(𝑥, 𝑥′):

𝓂
1 ρ(𝑥, 𝑥𝑖 )
𝓂𝑉(ℎ) ℎ
𝑖=1
ρ(𝑥,𝑥𝑖 )
where 𝑉(ℎ) = 𝑋
𝐾 𝑑𝑥 is normalizing factor.
ℎ

Multidimensional Parzen window
Estimate 𝑝ℎ 𝑥 with
𝓂
1 ρ(𝑥, 𝑥𝑖 )
𝓂𝑉(ℎ) ℎ
𝑖=1
Parzen window:
ρ 𝑥, 𝑥𝑖
𝑎 𝑥; 𝑇 𝓁 , ℎ = arg max λ𝑦 Pr 𝑦 𝓁−1
𝑦 𝐾 .
𝑦∈𝑌 ℎ
𝑖:𝑦𝑖 =𝑦
ρ 𝑥,𝑥𝑖
Γ𝑦 𝑥 = λ𝑦 Pr 𝑦 𝓁−1
𝑦 𝑖:𝑦𝑖 =𝑦 𝐾 is a closeness to class.
ℎ

Naïve Bayesian classifier
Hypothesis (naïve): features are independent random

variables with probability densities 𝑝𝑗 (𝜉|𝑦), 𝑦 ∈ 𝑌, 𝑗 =
1, . . . , 𝑛.
Then classes likelihoods can be represented as:

𝑝 𝑥 𝑦 = 𝑝1 𝜉1 𝑦 ⋅ … ⋅ 𝑝𝑛 𝜉𝑛 𝑦 , 𝑥 = 𝜉1 , . . . , 𝜉𝑛 , 𝑦 ∈ 𝑌.
Naïve Bayesian classifier:

𝑛
𝑎 𝑥 = argmax𝑦∈𝑌 ln λ𝑦 Pr 𝑦 + ln 𝑝𝑗 𝜉𝑗 𝑦 .
𝑗=1

Implementation
Python: GaussianNB
MultinomialNB
BernulliNB
WEKA: NaiveBayes (Gaussian by default) with

auxiliary -K for kernel density estimator

Lecture plan

Parametrical notation
Joint probability density for sample:

𝓁
𝑝 𝑇 𝓁 = 𝑝 𝑥1 , 𝑦1 , … , 𝑥𝓁 , 𝑦𝓁 = 𝑝 𝑥𝑖 , 𝑦𝑖 .
𝑖=1
Likelihood:
𝓁
𝐿 θ, 𝑇 𝓁 = φ 𝑥𝑖 , 𝑦𝑖 , θ .
𝑖=1
MAP:
𝑎θ 𝑥 = argmax𝑦 φ 𝑥, 𝑦, θ .

Relation with empirical risk
Find logarithm:
𝓁
− ln 𝐿 θ, 𝑇 𝓁 = − ln φ 𝑥𝑖 , 𝑦𝑖 , θ → minθ .
𝑖=1
Define loss function:
𝐿 𝑎θ , 𝑥 = −𝓁 ln φ 𝑥, 𝑦, θ .
Then empirical risk minimization problem is:

𝓁
1 𝓁
𝑄 𝑎θ , 𝑇 = 𝐿(𝑎θ , 𝑥) =
𝓁 1
𝓁
1 𝓁
=− 𝓁 ln φ 𝑥, 𝑦, θ = − ln φ 𝑥𝑖 , 𝑦𝑖 , θ → minθ .
𝓁 1
𝑖=1
Maximum likelihood
Principle of maximum likelihood:

𝑚
𝐿 θ; 𝑋 𝑚 = ln φ(𝑥𝑖 ; θ) → maxθ ,
𝑖=1
Optimum for θ is achieved in point where derivate

value is zero.
Principle of maximum weighted likelihood:
𝑚
𝐿 θ; 𝑋 𝑚 , 𝑊 𝑚 = 𝑤𝑖 ln φ(𝑥𝑖 ; θ) → maxθ ,
𝑖=1
where 𝑊 𝑚 = {𝑤1 , … , 𝑤𝑚 } is vector of object weights.
Maximum joint likelihood principle
𝑄 𝑎θ , 𝑇 𝓁 = − ln φ 𝑥𝑖 , 𝑦𝑖 , θ → minθ .
𝑖=1
φ 𝑥𝑖 , 𝑦𝑖 , θ = 𝑝 𝑥𝑖 , 𝑦𝑖 𝑤 𝑝 𝑤, γ ,
𝑝 𝑥𝑖 , 𝑦𝑖 𝑤 is a probabilistic data model, 𝑝 𝑤, γ is prior
distribution of model parameters, γ is hyper-parameter.
Maximum joint likelihood principle:

𝓁
ln 𝑝 𝑥𝑖 , 𝑦𝑖 𝑤 + ln 𝑝 𝑤, γ → max𝑤,γ .
𝑖=1
Quadratic penalty conditions
Let 𝑤 ∈ ℝ𝑛 is described with 𝑛-dimensional Gaussian

distribution:
1 𝑤 2
𝑝 𝑤; σ = 𝑛/2
exp − ,
2πσ 2σ
(weights are independent, their expectations are equal to
zeros, their variances are the same and equal to σ).
It leads to quadratic penalty:

1 2
− ln 𝑝 𝑤; σ = 𝑤 + const 𝑤 .
2σ

Lecture plan

Key hypothesis
Key hypothesis: classes have 𝑛-dimensional

normal densities:
1 ⏉ −1
𝑒 −2 𝑥−μ𝑦 Σ𝑦 𝑥−μ𝑦
𝑝 𝑥 𝑦 = 𝒩 𝑥; μ𝑦 , Σ𝑦 = ,
2π 𝑛 det Σ𝑦
where μ𝑦 is vector of expectation of class 𝑦 ∈ 𝑌,
Σ𝑦 ∈ ℝ𝑛×𝑛 is covariance matrix for class 𝑦 ∈ 𝑌,
it is symmetrical, nonsingular, positive define
matrix.

Theorem on separating surface
Theorem:
If classes densities are normal
1) separating surface
{𝑥 ∈ 𝑋|λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 = λ𝑠 Pr 𝑠 𝑝(𝑥|𝑠)}
is quadratic;
2) if Σ𝑦+ = Σ𝑦− , then it is linear.

Quadratic analysis
Principle of maximum weighted likelihood:

𝑚
𝐿 θ; 𝑋 𝑚 , 𝑊 𝑚 = 𝑤𝑖 ln φ(𝑥𝑖 ; θ) → maxθ ,
𝑖=1
where 𝑊 𝑚 = {𝑤1 , … , 𝑤𝑚 } is vector of object
weights.
Optimum for θ is achieved in point where derivate

value is zero.

Quadratic discriminant
Theorem:
Estimates for maximum weighed likelihood with 𝑦 ∈ 𝑌 are:
1
μ𝑦 = 𝑤𝑖 𝑥𝑖 ;
𝑊𝑦
𝑦:𝑦𝑖 =𝑦
1 ⏉
Σ𝑦 = 𝑤𝑖 𝑥 − μ𝑦 𝑥 − μ𝑦 ;
𝑊𝑦
𝑦:𝑦𝑖 =𝑦
where 𝑊𝑦 = 𝑦:𝑦𝑖 =𝑦 𝑤𝑖 .
Quadratic discriminant:
1 ⏉ −1 1
𝑎 𝑥 = argmax𝑦∈𝑌 ln λ𝑦 Pr 𝑦 − 𝑥 − μ𝑦 Σ𝑦 𝑥 − μ𝑦 − ln det Σ𝑦 .
2 2

Method problems
• If 𝓁𝑦 < 𝑛, then Σ𝑦 is singular.

• The less 𝓁𝑦 is, the less Σ𝑦 is robust.
• Estimates μ𝑦 and Σ𝑦 are sensitive to noise.
• Distributions are required to be normal.

Linear discriminant analysis
Hypothesis: covariance matrices are equal

1 ⏉
Σ= 𝑤𝑖 𝑥 − μ𝑦𝑖 𝑥 − μ𝑦𝑖 .
𝑊𝑦
𝑦:𝑦𝑖 =𝑦
Fisher’s linear discriminant:

𝑎 𝑥 = argmax𝑦∈𝑌 λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 =
1 ⏉ −1
= argmax𝑦∈𝑌 ln λ𝑦 Pr 𝑦 − μ𝑦 Σ μ𝑦 − 𝑥 ⏉ Σ −1 μ𝑦 =
2
= argmax𝑦∈𝑌 β𝑦 + 𝑥 ⏉ α𝑦 = sign 𝑥, 𝑤 − 𝑤0 .

Mahalanobis distance
Theorem: error probability of Fisher’s linear

discriminant equals
1
𝑅 𝑎 = Φ − 𝜇1 − 𝜇2 Σ ,
2
where Φ 𝑟 = 𝒩 𝑥; 0,1 .

Implementation
Python: LinearDiscriminantAnalysis
QuadraticDiscriminantAnalysis
WEKA: none

Lecture plan

Bayesian classification
A distribution 𝑝(𝑥, 𝑦) on object-answers space.

Simple sample of size 𝓁
𝓁
𝑇𝓁 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 .
Bayesian classifier:
𝑎𝑂𝐵 𝑥 = argmax𝑦∈𝑌 λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 ,
where λ𝑦 is losses for class 𝑦.

Linear classifiers
Constraint: 𝑌 = −1, +1 = {𝑦−1 , 𝑦+1 }

Linear classifier:
𝑛
𝑎𝑤 𝑥, 𝑇 𝓁 = sign 𝑤𝑖 𝑓𝑖 𝑥 − 𝑤0 .
𝑖=1
where 𝑤1 , … , 𝑤𝑛 ∈ ℝ are features weights.
𝑎𝑤 𝑥, 𝑇 𝓁 = sign 〈𝑤, 𝑥〉 .

Linear Bayesian classifiers
𝓁
1 𝓁
𝑄 𝑎𝜃 , 𝑇𝓁 = 𝐿(𝑎𝜃 , 𝑥) = − ln 𝜑 𝑥𝑖 , 𝑦𝑖 , 𝜃 → min𝜃 .
𝓁 1
𝑖=1
Bayesian classifier for two classes:
𝑎 𝑥 = sign λ+ Pr 𝑦+ 𝑥 − λ− Pr 𝑦− 𝑥 =
𝑝 𝑥|𝑦+ λ− Pr(𝑦− )
= sign − .
𝑝 𝑥 𝑦− λ+ Pr(𝑦+ )
Separating surface
λ+ Pr(𝑦+ )𝑝 𝑥|𝑦+ = λ− Pr(𝑦− )𝑝 𝑥|𝑦−
is linear.

Key hypothesis
Key hypothesis: classes are defined with 𝑛-dimensional

overdispersed exponential densities:
𝑝 𝑥 𝑦 = exp 𝑐𝑦 δ θ𝑦 , 𝑥 + 𝑏𝑦 δ, θ𝑦 + 𝑑 𝑥, δ ,
where θ𝑦 ∈ ℝ𝑚 is shift parameter,
𝛿 — dispersion parameter;
𝑏𝑦 , 𝑐𝑦 , 𝑑 are some numeric functions.
Overdispersed exponential distribution family includes:

uniform, normal, hypergeometric, Poisson, binominal, Γ-
distribution and other.

Example: Gaussian
Let θ = Σ −1 μ; δ = Σ.
Then
1
−2 𝑥−μ ⏉ Σ−1 𝑥−μ
𝑒
𝒩 𝑥; μ, Σ = =
2π
det Σ 𝑛
⏉ −1
1 ⏉ −1 −1
= exp μ Σ 𝑥 − μ Σ ΣΣ μ
2
1 ⏉ −1 𝑛 1
− 𝑥 Σ 𝑥 + ln 2π + ln Σ .
2 2 2

The main theorem
Theorem:
If 𝑝𝑦 are overdispersed exponential distributions and
𝑓0 𝑥 = const, then
1) Bayesian classifier
𝑝 𝑥|𝑦+ λ− Pr(𝑦− )
𝑎 𝑥 = sign −
𝑝 𝑥 𝑦− λ+ Pr(𝑦+ )
λ−
is linear: 𝑎 𝑥 = sign 𝑤, 𝑥 − 𝑤0 , 𝑤0 = ln ;
λ+
2) posterior probabilities of classes are:
Pr 𝑦 𝑥 = σ 𝑤, 𝑥 𝑦 ,
1
where σ 𝑠 = , which is logistic (sigmoid) function.
1+𝑒 −𝑠

ML 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML 7

Uploaded by

Copyright:

Available Formats

Lecture 5

• The presentation is prepared with

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 3

An illness, which is spread among 1% of

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 4

An illness, which is spread among 1% of population. This

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 5

An illness, which is spread among 1% of population. This

Instead of an unknown target function 𝑦 ∗ (𝑥) we are going

Now we have families of distributions 𝜑 𝑥, 𝑦, 𝜃 𝜃 ∈ Θ

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 7

𝑎: 𝑋 → 𝑌 splits 𝑋 on non-overlapping domains 𝐴𝑦 :

Pr(𝑦) is priory probability of class 𝑦.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 9

First problem: probability density recovering

Second problem: mean risk minimization

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 10

Let Pr(𝑦) and 𝑝 𝑥 𝑦 be known for all 𝑦 ∈ 𝑌.

Main idea: choose a class, in which the object is the most

Maximum a posteriori probability (MAP):

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 11

If λ𝑦𝑦 = 0, λ𝑦 = λ𝑦𝑠 = λ𝑦𝑡 ∀𝑠, 𝑡 ∈ 𝑌, 𝑠 ≠ 𝑦, 𝑡 ≠ 𝑦

Classifier 𝑎𝑂𝐵 (𝑥) is optimal Bayesian classifier.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 12

Separating surface for classes 𝑎 and 𝑏 is locus of

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 13

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 14

The problem is to estimate prior and posterior probabilities

The first subproblem can be solved easily:

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 15

If Pr 𝑎, 𝑏 is a probabilistic measure on [𝑎, 𝑏], then

Empirical density estimation with window of a width ℎ

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 16

Empirical density estimation with window of a width ℎ:

Parzen-Rosenblatt estimation for a window with width ℎ:

1. If objects are described with 𝑛 numeric features

2. If 𝑋 is a (metric) space with a distance ρ(𝑥, 𝑥′):

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 18

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 19

Hypothesis (naïve): features are independent random

Then classes likelihoods can be represented as:

Naïve Bayesian classifier:

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 20

WEKA: NaiveBayes (Gaussian by default) with

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 21

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 22

Joint probability density for sample:

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 23

Then empirical risk minimization problem is:

Principle of maximum likelihood:

Optimum for θ is achieved in point where derivate

Maximum joint likelihood principle:

Let 𝑤 ∈ ℝ𝑛 is described with 𝑛-dimensional Gaussian

It leads to quadratic penalty:

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 27

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 28

Key hypothesis: classes have 𝑛-dimensional

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 29

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 30

Principle of maximum weighted likelihood:

Optimum for θ is achieved in point where derivate

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 31

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 32