Professional Documents
Culture Documents
ML 7
ML 7
Probabilistic models
Machine Learning
Andrey Filchenkov
13.06.2016
Lecture plan
• Bayesian classification
• Non-parametric density recovery
• Parametric density recovery
• Normal discriminant analysis
• Logistic regression
Pr 𝑑 = 1 𝑡 = 1 =
Pr 𝑡 = 1 𝑑 = 1 Pr 𝑑 = 1
= =
Pr 𝑡 = 1 𝑑 = 1 Pr(𝑑 = 1) + Pr 𝑡 = 1 𝑑 = 0 Pr(𝑑 = 0)
0.95 × 0.01
= = 𝟎. 𝟏𝟔.
0.95 × 0.01 + 0.05 × 0.99
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 6
Probabilistic classification problem
Mean risk of 𝑎:
𝑅 𝑎 = λ𝑦𝑠 Pr(𝐴𝑠 , 𝑦) .
𝑦∈𝑌 𝑠∈𝑌
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 8
The main equation
𝑝 𝑋, 𝑌 = 𝑝 𝑥 Pr 𝑦 𝑥 = Pr 𝑦 𝑝(𝑥|𝑦)
Theorem
If Pr(𝑦) and 𝑝(𝑥|𝑦) are known, then the minimal mean risk
𝑅(𝑎) is achieved by Bayesian classifier
𝑎𝑂𝐵 𝑥 = argmin𝑠∈𝑌 λ𝑦𝑠 Pr 𝑦 𝑝 𝑥 𝑦 .
𝑦∈𝑌
𝑝ℎ 𝑥 converges to 𝑝 𝑥 .
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 17
Generalization to multidimensional case
Estimate 𝑝ℎ 𝑥 with
𝓂
1 ρ(𝑥, 𝑥𝑖 )
𝑝ℎ 𝑥 = 𝐾 ,
𝓂𝑉(ℎ) ℎ
𝑖=1
Parzen window:
ρ 𝑥, 𝑥𝑖
𝑎 𝑥; 𝑇 𝓁 , ℎ = arg max λ𝑦 Pr 𝑦 𝓁−1
𝑦 𝐾 .
𝑦∈𝑌 ℎ
𝑖:𝑦𝑖 =𝑦
ρ 𝑥,𝑥𝑖
Γ𝑦 𝑥 = λ𝑦 Pr 𝑦 𝓁−1
𝑦 𝑖:𝑦𝑖 =𝑦 𝐾 is a closeness to class.
ℎ
𝑎 𝑥 = argmax𝑦∈𝑌 ln λ𝑦 Pr 𝑦 + ln 𝑝𝑗 𝜉𝑗 𝑦 .
𝑗=1
Python: GaussianNB
MultinomialNB
BernulliNB
Likelihood:
𝓁
𝐿 θ, 𝑇 𝓁 = φ 𝑥𝑖 , 𝑦𝑖 , θ .
𝑖=1
MAP:
𝑎θ 𝑥 = argmax𝑦 φ 𝑥, 𝑦, θ .
Find logarithm:
𝓁
− ln 𝐿 θ, 𝑇 𝓁 = − ln φ 𝑥𝑖 , 𝑦𝑖 , θ → minθ .
𝑖=1
Define loss function:
𝐿 𝑎θ , 𝑥 = −𝓁 ln φ 𝑥, 𝑦, θ .
𝐿 θ; 𝑋 𝑚 = ln φ(𝑥𝑖 ; θ) → maxθ ,
𝑖=1
𝐿 θ; 𝑋 𝑚 , 𝑊 𝑚 = 𝑤𝑖 ln φ(𝑥𝑖 ; θ) → maxθ ,
𝑖=1
where 𝑊 𝑚 = {𝑤1 , … , 𝑤𝑚 } is vector of object weights.
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 25
Maximum joint likelihood principle
𝑄 𝑎θ , 𝑇 𝓁 = − ln φ 𝑥𝑖 , 𝑦𝑖 , θ → minθ .
𝑖=1
φ 𝑥𝑖 , 𝑦𝑖 , θ = 𝑝 𝑥𝑖 , 𝑦𝑖 𝑤 𝑝 𝑤, γ ,
𝑝 𝑥𝑖 , 𝑦𝑖 𝑤 is a probabilistic data model, 𝑝 𝑤, γ is prior
distribution of model parameters, γ is hyper-parameter.
ln 𝑝 𝑥𝑖 , 𝑦𝑖 𝑤 + ln 𝑝 𝑤, γ → max𝑤,γ .
𝑖=1
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 26
Quadratic penalty conditions
Theorem:
If classes densities are normal
1) separating surface
{𝑥 ∈ 𝑋|λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 = λ𝑠 Pr 𝑠 𝑝(𝑥|𝑠)}
is quadratic;
2) if Σ𝑦+ = Σ𝑦− , then it is linear.
𝐿 θ; 𝑋 𝑚 , 𝑊 𝑚 = 𝑤𝑖 ln φ(𝑥𝑖 ; θ) → maxθ ,
𝑖=1
where 𝑊 𝑚 = {𝑤1 , … , 𝑤𝑚 } is vector of object
weights.
Theorem:
Estimates for maximum weighed likelihood with 𝑦 ∈ 𝑌 are:
1
μ𝑦 = 𝑤𝑖 𝑥𝑖 ;
𝑊𝑦
𝑦:𝑦𝑖 =𝑦
1 ⏉
Σ𝑦 = 𝑤𝑖 𝑥 − μ𝑦 𝑥 − μ𝑦 ;
𝑊𝑦
𝑦:𝑦𝑖 =𝑦
where 𝑊𝑦 = 𝑦:𝑦𝑖 =𝑦 𝑤𝑖 .
Quadratic discriminant:
1 ⏉ −1 1
𝑎 𝑥 = argmax𝑦∈𝑌 ln λ𝑦 Pr 𝑦 − 𝑥 − μ𝑦 Σ𝑦 𝑥 − μ𝑦 − ln det Σ𝑦 .
2 2
Python: LinearDiscriminantAnalysis
QuadraticDiscriminantAnalysis
WEKA: none
Bayesian classifier:
𝑎𝑂𝐵 𝑥 = argmax𝑦∈𝑌 λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 ,
where λ𝑦 is losses for class 𝑦.
𝑎𝑤 𝑥, 𝑇 𝓁 = sign 𝑤𝑖 𝑓𝑖 𝑥 − 𝑤0 .
𝑖=1
where 𝑤1 , … , 𝑤𝑛 ∈ ℝ are features weights.
𝑎𝑤 𝑥, 𝑇 𝓁 = sign 〈𝑤, 𝑥〉 .
𝓁
1 𝓁
𝑄 𝑎𝜃 , 𝑇𝓁 = 𝐿(𝑎𝜃 , 𝑥) = − ln 𝜑 𝑥𝑖 , 𝑦𝑖 , 𝜃 → min𝜃 .
𝓁 1
𝑖=1
Bayesian classifier for two classes:
𝑎 𝑥 = sign λ+ Pr 𝑦+ 𝑥 − λ− Pr 𝑦− 𝑥 =
𝑝 𝑥|𝑦+ λ− Pr(𝑦− )
= sign − .
𝑝 𝑥 𝑦− λ+ Pr(𝑦+ )
Separating surface
λ+ Pr(𝑦+ )𝑝 𝑥|𝑦+ = λ− Pr(𝑦− )𝑝 𝑥|𝑦−
is linear.
Let θ = Σ −1 μ; δ = Σ.
Then
1
−2 𝑥−μ ⏉ Σ−1 𝑥−μ
𝑒
𝒩 𝑥; μ, Σ = =
2π
det Σ 𝑛
⏉ −1
1 ⏉ −1 −1
= exp μ Σ 𝑥 − μ Σ ΣΣ μ
2
1 ⏉ −1 𝑛 1
− 𝑥 Σ 𝑥 + ln 2π + ln Σ .
2 2 2
Theorem:
If 𝑝𝑦 are overdispersed exponential distributions and
𝑓0 𝑥 = const, then
1) Bayesian classifier
𝑝 𝑥|𝑦+ λ− Pr(𝑦− )
𝑎 𝑥 = sign −
𝑝 𝑥 𝑦− λ+ Pr(𝑦+ )
λ−
is linear: 𝑎 𝑥 = sign 𝑤, 𝑥 − 𝑤0 , 𝑤0 = ln ;
λ+
2) posterior probabilities of classes are:
Pr 𝑦 𝑥 = σ 𝑤, 𝑥 𝑦 ,
1
where σ 𝑠 = , which is logistic (sigmoid) function.
1+𝑒 −𝑠
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 43