Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Lecture 5

Probabilistic models

Machine Learning
Andrey Filchenkov

13.06.2016
Lecture plan
• Bayesian classification
• Non-parametric density recovery
• Parametric density recovery
• Normal discriminant analysis
• Logistic regression

• The presentation is prepared with


materials of the K.V. Vorontsov’s
course “Machine Leaning”.
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 2
Lecture plan
• Bayesian classification
• Non-parametric density recovery
• Parametric density recovery
• Normal discriminant analysis
• Logistic regression

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 3


Problem

An illness, which is spread among 1% of


population. This illness test returns true answers in
95% of cases. Someone receives a positive result.
What is the probability, he actually suffers the
illness?

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 4


Problem: options

An illness, which is spread among 1% of population. This


illness test returns true answers in 95% of cases. Someone
receives a positive result. What is the probability, he
actually suffers the illness?
97,5% ≤ 𝑥 ≤ 100%
95% ≤ 𝑥 < 97,5%
92% ≤ 𝑥 < 95%
81% ≤ 𝑥 < 92%
70% ≤ 𝑥 < 81%
55% ≤ 𝑥 < 70%
30% ≤ 𝑥 < 55%
𝑥 < 30%

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 5


Problem: answer

An illness, which is spread among 1% of population. This


illness test returns true answers in 95% of cases. Someone
receives a positive result. What is the probability, he
actually suffers the illness?

Pr 𝑑 = 1 𝑡 = 1 =

Pr 𝑡 = 1 𝑑 = 1 Pr 𝑑 = 1
= =
Pr 𝑡 = 1 𝑑 = 1 Pr(𝑑 = 1) + Pr 𝑡 = 1 𝑑 = 0 Pr(𝑑 = 0)

0.95 × 0.01
= = 𝟎. 𝟏𝟔.
0.95 × 0.01 + 0.05 × 0.99
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 6
Probabilistic classification problem

Instead of an unknown target function 𝑦 ∗ (𝑥) we are going


to think about an unknown probability distribution on
𝑋 × 𝑌 with a density 𝑝(𝑥, 𝑦).
Simple or independent identically distributed (i.i.d.)
sample is a sample, which contains 𝓁 random independent
observations 𝑇 𝓁 = 𝑥𝑖 , 𝑦𝑖 𝓁𝑖=1 .

Now we have families of distributions 𝜑 𝑥, 𝑦, 𝜃 𝜃 ∈ Θ


instead of algorithm models.
Problem: find the algorithm which minimizes probability
of error.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 7


Problem statement

𝑎: 𝑋 → 𝑌 splits 𝑋 on non-overlapping domains 𝐴𝑦 :


𝐴𝑦 = {𝑥 ∈ 𝑋|𝑎 𝑥 = 𝑦} .
Error is when object 𝑥 labeled as 𝑦 is classified as belonging
to 𝐴𝑠 , 𝑠 ≠ 𝑦.
Error probability: Pr(𝐴𝑠 , 𝑦) = 𝐴𝑆
𝑝(𝑥, 𝑦)𝑑𝑥 𝐴𝑠.
Error loss: λ𝑦𝑠 > 0, for all 𝑦, 𝑠 ∈ 𝑌 × 𝑌.
Usually λ𝑦𝑦 = 0, λ𝑦 = λ𝑦𝑠 = λ𝑦𝑡 ∀𝑠, 𝑡 ∈, 𝑠 ≠ 𝑦, 𝑡 ≠ 𝑦.

Mean risk of 𝑎:
𝑅 𝑎 = λ𝑦𝑠 Pr(𝐴𝑠 , 𝑦) .
𝑦∈𝑌 𝑠∈𝑌
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 8
The main equation

𝑝 𝑋, 𝑌 = 𝑝 𝑥 Pr 𝑦 𝑥 = Pr 𝑦 𝑝(𝑥|𝑦)

Pr(𝑦) is priory probability of class 𝑦.


𝑝(𝑥|𝑦) is likelihood of class 𝑦
Pr(𝑦|𝑥) is posterior probability of class 𝑦.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 9


Two problems

First problem: probability density recovering


Given: 𝑇 𝓁 = 𝑥𝑖 , 𝑦𝑖 𝓁𝑖=1 .
Problem: find empirical estimates Pr(𝑦) и 𝑝(𝑥|𝑦), 𝑦 ∈ 𝑌.

Second problem: mean risk minimization


Given:
• prior probabilities Pr(𝑦),
• likelihood 𝑝 𝑥 𝑦 , 𝑦 ∈ 𝑌.
Problem: find classifier 𝑎 which minimizes 𝑅(𝑎).

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 10


Maximum a posteriori probability

Let Pr(𝑦) and 𝑝 𝑥 𝑦 be known for all 𝑦 ∈ 𝑌.


𝑝 𝑥, 𝑦 = 𝑝 𝑥 Pr(𝑦|𝑥) = Pr(𝑦) 𝑝(𝑥|𝑦).

Main idea: choose a class, in which the object is the most


probable.

Maximum a posteriori probability (MAP):


𝑎 𝑥 = argmax𝑦∈𝑌 Pr(𝑦|𝑥) = argmax𝑦∈𝑌 Pr 𝑦 𝑝 𝑥 𝑦 .

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 11


Optimal Bayesian classifier

Theorem
If Pr(𝑦) and 𝑝(𝑥|𝑦) are known, then the minimal mean risk
𝑅(𝑎) is achieved by Bayesian classifier
𝑎𝑂𝐵 𝑥 = argmin𝑠∈𝑌 λ𝑦𝑠 Pr 𝑦 𝑝 𝑥 𝑦 .
𝑦∈𝑌

If λ𝑦𝑦 = 0, λ𝑦 = λ𝑦𝑠 = λ𝑦𝑡 ∀𝑠, 𝑡 ∈ 𝑌, 𝑠 ≠ 𝑦, 𝑡 ≠ 𝑦


𝑎𝑂𝐵 𝑥 = argmax𝑦∈𝑌 λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 .

Classifier 𝑎𝑂𝐵 (𝑥) is optimal Bayesian classifier.


Bayesian risk is minimal value of 𝑅 𝑎 .

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 12


Separating surface

Separating surface for classes 𝑎 and 𝑏 is locus of


𝑥 ∈ 𝑋, such that maximum of Bayesian decision
rule is achieved both for 𝑦 = 𝑠 and 𝑦 = 𝑡:
λ𝑎 Pr 𝑎 𝑝 𝑥 𝑎 = λ𝑠 Pr 𝑏 𝑝(𝑥|𝑏).

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 13


Lecture plan
• Bayesian classification
• Non-parametric density recovery
• Parametric density recovery
• Normal discriminant analysis
• Logistic regression

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 14


Two subproblems

The problem is to estimate prior and posterior probabilities


for each class:
Pr 𝑦 =?
𝑝 𝑥 𝑦 =?

The first subproblem can be solved easily:


𝑋𝑦
Pr 𝑦 = , 𝑋𝑦 = 𝑥𝑖 , 𝑦𝑖 ∈ 𝑇 𝓁 , 𝑦𝑖 = 𝑦 .
𝓁
The second one is much more complex.
Instead of recovering 𝑥 𝑦 , we will recover 𝑝(𝑥) with
𝑇𝓂 = 𝑥 1 , 𝑠 , … , 𝑥(𝓂) , 𝑠 for each 𝑠 ∈ 𝑌.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 15


One-dimensional case

If Pr 𝑎, 𝑏 is a probabilistic measure on [𝑎, 𝑏], then


1
𝑝 𝑥 = lim Pr([𝑥 − ℎ, 𝑥 + ℎ]).
ℎ→0 2ℎ

Empirical density estimation with window of a width ℎ


𝓂
1
𝑝ℎ 𝑥 = [|𝑥 − 𝑥𝑖 | < ℎ] .
2𝑚ℎ
𝑖=1

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 16


Parzen-Rosenblatt window

Empirical density estimation with window of a width ℎ:


𝓂
1 𝑥 − 𝑥𝑖
𝑝ℎ 𝑥 = <1 .
2ℎ𝓂 ℎ
𝑖=1

Parzen-Rosenblatt estimation for a window with width ℎ:


𝓂
1 𝑥 − 𝑥𝑖
𝑝ℎ 𝑥 = 𝐾 ,
ℎ𝓂 ℎ
𝑖=1
where 𝐾(𝑟) is a kernel function.

𝑝ℎ 𝑥 converges to 𝑝 𝑥 .
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 17
Generalization to multidimensional case

1. If objects are described with 𝑛 numeric features


𝑓𝑗 : 𝑋 → ℝ, 𝑗 = 1, . . . , 𝑛,
𝓂 𝑛
1 1 𝑓𝑗 (𝑥) − 𝑓𝑗 𝑥𝑖
𝑝ℎ 𝑥 = 𝐾 .
𝓂 ℎ𝑗 ℎ𝑗
𝑖=1 𝑗=1

2. If 𝑋 is a (metric) space with a distance ρ(𝑥, 𝑥′):


𝓂
1 ρ(𝑥, 𝑥𝑖 )
𝑝ℎ 𝑥 = 𝐾 ,
𝓂𝑉(ℎ) ℎ
𝑖=1
ρ(𝑥,𝑥𝑖 )
where 𝑉(ℎ) = 𝑋
𝐾 𝑑𝑥 is normalizing factor.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 18


Multidimensional Parzen window

Estimate 𝑝ℎ 𝑥 with
𝓂
1 ρ(𝑥, 𝑥𝑖 )
𝑝ℎ 𝑥 = 𝐾 ,
𝓂𝑉(ℎ) ℎ
𝑖=1

Parzen window:
ρ 𝑥, 𝑥𝑖
𝑎 𝑥; 𝑇 𝓁 , ℎ = arg max λ𝑦 Pr 𝑦 𝓁−1
𝑦 𝐾 .
𝑦∈𝑌 ℎ
𝑖:𝑦𝑖 =𝑦

ρ 𝑥,𝑥𝑖
Γ𝑦 𝑥 = λ𝑦 Pr 𝑦 𝓁−1
𝑦 𝑖:𝑦𝑖 =𝑦 𝐾 is a closeness to class.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 19


Naïve Bayesian classifier

Hypothesis (naïve): features are independent random


variables with probability densities 𝑝𝑗 (𝜉|𝑦), 𝑦 ∈ 𝑌, 𝑗 =
1, . . . , 𝑛.

Then classes likelihoods can be represented as:


𝑝 𝑥 𝑦 = 𝑝1 𝜉1 𝑦 ⋅ … ⋅ 𝑝𝑛 𝜉𝑛 𝑦 , 𝑥 = 𝜉1 , . . . , 𝜉𝑛 , 𝑦 ∈ 𝑌.

Naïve Bayesian classifier:


𝑛

𝑎 𝑥 = argmax𝑦∈𝑌 ln λ𝑦 Pr 𝑦 + ln 𝑝𝑗 𝜉𝑗 𝑦 .
𝑗=1

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 20


Implementation

Python: GaussianNB
MultinomialNB
BernulliNB

WEKA: NaiveBayes (Gaussian by default) with


auxiliary -K for kernel density estimator

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 21


Lecture plan
• Bayesian classification
• Non-parametric density recovery
• Parametric density recovery
• Normal discriminant analysis
• Logistic regression

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 22


Parametrical notation

Joint probability density for sample:


𝓁
𝑝 𝑇 𝓁 = 𝑝 𝑥1 , 𝑦1 , … , 𝑥𝓁 , 𝑦𝓁 = 𝑝 𝑥𝑖 , 𝑦𝑖 .
𝑖=1

Likelihood:
𝓁
𝐿 θ, 𝑇 𝓁 = φ 𝑥𝑖 , 𝑦𝑖 , θ .
𝑖=1

MAP:
𝑎θ 𝑥 = argmax𝑦 φ 𝑥, 𝑦, θ .

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 23


Relation with empirical risk

Find logarithm:
𝓁

− ln 𝐿 θ, 𝑇 𝓁 = − ln φ 𝑥𝑖 , 𝑦𝑖 , θ → minθ .
𝑖=1
Define loss function:
𝐿 𝑎θ , 𝑥 = −𝓁 ln φ 𝑥, 𝑦, θ .

Then empirical risk minimization problem is:


𝓁
1 𝓁
𝑄 𝑎θ , 𝑇 = 𝐿(𝑎θ , 𝑥) =
𝓁 1
𝓁
1 𝓁
=− 𝓁 ln φ 𝑥, 𝑦, θ = − ln φ 𝑥𝑖 , 𝑦𝑖 , θ → minθ .
𝓁 1
𝑖=1
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 24
Maximum likelihood

Principle of maximum likelihood:


𝑚

𝐿 θ; 𝑋 𝑚 = ln φ(𝑥𝑖 ; θ) → maxθ ,
𝑖=1

Optimum for θ is achieved in point where derivate


value is zero.
Principle of maximum weighted likelihood:
𝑚

𝐿 θ; 𝑋 𝑚 , 𝑊 𝑚 = 𝑤𝑖 ln φ(𝑥𝑖 ; θ) → maxθ ,
𝑖=1
where 𝑊 𝑚 = {𝑤1 , … , 𝑤𝑚 } is vector of object weights.
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 25
Maximum joint likelihood principle

𝑄 𝑎θ , 𝑇 𝓁 = − ln φ 𝑥𝑖 , 𝑦𝑖 , θ → minθ .
𝑖=1

φ 𝑥𝑖 , 𝑦𝑖 , θ = 𝑝 𝑥𝑖 , 𝑦𝑖 𝑤 𝑝 𝑤, γ ,
𝑝 𝑥𝑖 , 𝑦𝑖 𝑤 is a probabilistic data model, 𝑝 𝑤, γ is prior
distribution of model parameters, γ is hyper-parameter.

Maximum joint likelihood principle:


𝓁

ln 𝑝 𝑥𝑖 , 𝑦𝑖 𝑤 + ln 𝑝 𝑤, γ → max𝑤,γ .
𝑖=1
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 26
Quadratic penalty conditions

Let 𝑤 ∈ ℝ𝑛 is described with 𝑛-dimensional Gaussian


distribution:
1 𝑤 2
𝑝 𝑤; σ = 𝑛/2
exp − ,
2πσ 2σ
(weights are independent, their expectations are equal to
zeros, their variances are the same and equal to σ).

It leads to quadratic penalty:


1 2
− ln 𝑝 𝑤; σ = 𝑤 + const 𝑤 .

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 27


Lecture plan
• Bayesian classification
• Non-parametric density recovery
• Parametric density recovery
• Normal discriminant analysis
• Logistic regression

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 28


Key hypothesis

Key hypothesis: classes have 𝑛-dimensional


normal densities:
1 ⏉ −1
𝑒 −2 𝑥−μ𝑦 Σ𝑦 𝑥−μ𝑦
𝑝 𝑥 𝑦 = 𝒩 𝑥; μ𝑦 , Σ𝑦 = ,
2π 𝑛 det Σ𝑦
where μ𝑦 is vector of expectation of class 𝑦 ∈ 𝑌,
Σ𝑦 ∈ ℝ𝑛×𝑛 is covariance matrix for class 𝑦 ∈ 𝑌,
it is symmetrical, nonsingular, positive define
matrix.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 29


Theorem on separating surface

Theorem:
If classes densities are normal
1) separating surface
{𝑥 ∈ 𝑋|λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 = λ𝑠 Pr 𝑠 𝑝(𝑥|𝑠)}
is quadratic;
2) if Σ𝑦+ = Σ𝑦− , then it is linear.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 30


Quadratic analysis

Principle of maximum weighted likelihood:


𝑚

𝐿 θ; 𝑋 𝑚 , 𝑊 𝑚 = 𝑤𝑖 ln φ(𝑥𝑖 ; θ) → maxθ ,
𝑖=1
where 𝑊 𝑚 = {𝑤1 , … , 𝑤𝑚 } is vector of object
weights.

Optimum for θ is achieved in point where derivate


value is zero.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 31


Quadratic discriminant

Theorem:
Estimates for maximum weighed likelihood with 𝑦 ∈ 𝑌 are:
1
μ𝑦 = 𝑤𝑖 𝑥𝑖 ;
𝑊𝑦
𝑦:𝑦𝑖 =𝑦
1 ⏉
Σ𝑦 = 𝑤𝑖 𝑥 − μ𝑦 𝑥 − μ𝑦 ;
𝑊𝑦
𝑦:𝑦𝑖 =𝑦
where 𝑊𝑦 = 𝑦:𝑦𝑖 =𝑦 𝑤𝑖 .

Quadratic discriminant:
1 ⏉ −1 1
𝑎 𝑥 = argmax𝑦∈𝑌 ln λ𝑦 Pr 𝑦 − 𝑥 − μ𝑦 Σ𝑦 𝑥 − μ𝑦 − ln det Σ𝑦 .
2 2

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 32


Method problems

• If 𝓁𝑦 < 𝑛, then Σ𝑦 is singular.


• The less 𝓁𝑦 is, the less Σ𝑦 is robust.
• Estimates μ𝑦 and Σ𝑦 are sensitive to noise.
• Distributions are required to be normal.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 33


Linear discriminant analysis

Hypothesis: covariance matrices are equal


1 ⏉
Σ= 𝑤𝑖 𝑥 − μ𝑦𝑖 𝑥 − μ𝑦𝑖 .
𝑊𝑦
𝑦:𝑦𝑖 =𝑦

Fisher’s linear discriminant:


𝑎 𝑥 = argmax𝑦∈𝑌 λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 =
1 ⏉ −1
= argmax𝑦∈𝑌 ln λ𝑦 Pr 𝑦 − μ𝑦 Σ μ𝑦 − 𝑥 ⏉ Σ −1 μ𝑦 =
2
= argmax𝑦∈𝑌 β𝑦 + 𝑥 ⏉ α𝑦 = sign 𝑥, 𝑤 − 𝑤0 .

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 34


Mahalanobis distance

Theorem: error probability of Fisher’s linear


discriminant equals
1
𝑅 𝑎 = Φ − 𝜇1 − 𝜇2 Σ ,
2
where Φ 𝑟 = 𝒩 𝑥; 0,1 .

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 35


Implementation

Python: LinearDiscriminantAnalysis
QuadraticDiscriminantAnalysis
WEKA: none

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 36


Lecture plan
• Bayesian classification
• Non-parametric density recovery
• Parametric density recovery
• Normal discriminant analysis
• Logistic regression

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 37


Bayesian classification

A distribution 𝑝(𝑥, 𝑦) on object-answers space.


Simple sample of size 𝓁
𝓁
𝑇𝓁 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 .

Bayesian classifier:
𝑎𝑂𝐵 𝑥 = argmax𝑦∈𝑌 λ𝑦 Pr 𝑦 𝑝 𝑥 𝑦 ,
where λ𝑦 is losses for class 𝑦.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 38


Linear classifiers

Constraint: 𝑌 = −1, +1 = {𝑦−1 , 𝑦+1 }


Linear classifier:
𝑛

𝑎𝑤 𝑥, 𝑇 𝓁 = sign 𝑤𝑖 𝑓𝑖 𝑥 − 𝑤0 .
𝑖=1
where 𝑤1 , … , 𝑤𝑛 ∈ ℝ are features weights.

𝑎𝑤 𝑥, 𝑇 𝓁 = sign 〈𝑤, 𝑥〉 .

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 39


Linear Bayesian classifiers

𝓁
1 𝓁
𝑄 𝑎𝜃 , 𝑇𝓁 = 𝐿(𝑎𝜃 , 𝑥) = − ln 𝜑 𝑥𝑖 , 𝑦𝑖 , 𝜃 → min𝜃 .
𝓁 1
𝑖=1
Bayesian classifier for two classes:
𝑎 𝑥 = sign λ+ Pr 𝑦+ 𝑥 − λ− Pr 𝑦− 𝑥 =
𝑝 𝑥|𝑦+ λ− Pr(𝑦− )
= sign − .
𝑝 𝑥 𝑦− λ+ Pr(𝑦+ )

Separating surface
λ+ Pr(𝑦+ )𝑝 𝑥|𝑦+ = λ− Pr(𝑦− )𝑝 𝑥|𝑦−
is linear.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 40


Key hypothesis

Key hypothesis: classes are defined with 𝑛-dimensional


overdispersed exponential densities:
𝑝 𝑥 𝑦 = exp 𝑐𝑦 δ θ𝑦 , 𝑥 + 𝑏𝑦 δ, θ𝑦 + 𝑑 𝑥, δ ,
where θ𝑦 ∈ ℝ𝑚 is shift parameter,
𝛿 — dispersion parameter;
𝑏𝑦 , 𝑐𝑦 , 𝑑 are some numeric functions.

Overdispersed exponential distribution family includes:


uniform, normal, hypergeometric, Poisson, binominal, Γ-
distribution and other.

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 41


Example: Gaussian

Let θ = Σ −1 μ; δ = Σ.
Then
1
−2 𝑥−μ ⏉ Σ−1 𝑥−μ
𝑒
𝒩 𝑥; μ, Σ = =

det Σ 𝑛

⏉ −1
1 ⏉ −1 −1
= exp μ Σ 𝑥 − μ Σ ΣΣ μ
2
1 ⏉ −1 𝑛 1
− 𝑥 Σ 𝑥 + ln 2π + ln Σ .
2 2 2

Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 42


The main theorem

Theorem:
If 𝑝𝑦 are overdispersed exponential distributions and
𝑓0 𝑥 = const, then
1) Bayesian classifier
𝑝 𝑥|𝑦+ λ− Pr(𝑦− )
𝑎 𝑥 = sign −
𝑝 𝑥 𝑦− λ+ Pr(𝑦+ )
λ−
is linear: 𝑎 𝑥 = sign 𝑤, 𝑥 − 𝑤0 , 𝑤0 = ln ;
λ+
2) posterior probabilities of classes are:
Pr 𝑦 𝑥 = σ 𝑤, 𝑥 𝑦 ,
1
where σ 𝑠 = , which is logistic (sigmoid) function.
1+𝑒 −𝑠
Machine learning. Lecture 5. Probabilistic models. 13.06.2016. 43

You might also like