Professional Documents
Culture Documents
Intro To ML PART3
Intro To ML PART3
Intro To ML PART3
Marc Sebban
Bayes Error
The Bayes error ϵB is the lowest possible error rate (or irreducible error)
for any hypothesis h (for the sake of simplicity, f or p will be used
indistinctively). In the binary setting,
Z
ϵB = f (ymin (x)|x)f (x)dx
x
p(x|y1)1)
𝑓(𝑥|𝑦 p(x|y22) )
𝑓(𝑥|𝑦
𝑓(𝑦1|𝑥)
Bayesian Classifier
Bayesian Classifier
The Bayesian classifier predicts the optimal class y ∗ ∈ Y given an example
x ∈ X by applying the Maximum a posteriori (MAP) decision rule:
p(x|yc ).p(yc )
∀yc ∈ Y, p(yc |x) =
p(x)
Bayesian Classifier
Bayesian Classifier
The Bayesian classifier predicts the optimal class y ∗ ∈ Y given an example
x ∈ X by applying the Maximum a posteriori (MAP) decision rule:
p(x|yc ).p(yc )
∀yc ∈ Y, p(yc |x) =
p(x)
Exercise
Reminder
Z
ϵB = f (ymin (x)|x)f (x)dx
x
3
fc1 = y 2 + y pour 0<y <1
2
3 7
fc2 = 1 pour <y <
4 4
Question
After having drawn these densities in a 2-dimensional space, calculate the
Bayes error ϵB .
Theorem
kn
Let us denote by p̂n (x) = nV n
the estimate of p(x) from a training sample
S of size n. When n is increasing, p̂n (x) converges to p(x) if the following
three conditions are fulfilled:
lim Vn = 0
n→∞
lim kn = ∞
n→∞
kn
lim =0
n→∞ n
Theorem
kn
Let us denote by p̂n (x) = nV n
the estimate of p(x) from a training sample
S of size n. When n is increasing, p̂n (x) converges to p(x) if the following
three conditions are fulfilled:
lim Vn = 0
n→∞
lim kn = ∞
n→∞
kn
lim =0
n→∞ n
Interpretation
If n is high, we get:
a good estimate p̂(x) (resp. p̂(x|yj ) if we take into account the class)
of p(x) (resp. p(x|yj )).
a good approximation of the Bayesian method.
Marc Sebban (LabHC) Introduction to Machine Learning 10 / 64
Bayesian Classifier
𝑘𝑛 = 𝑛
k-nearest neighbors
1 1 2
p(x) = √ e − 2 x
2π
1 1 2
p(x) = √ e − 2 x
2π
k-NN as a classifier
P
Let us suppose we have nj examples in S of class yj , such that
P j nj = n.
Suppose the hypersphere contains kj examples of class yj ( j kj = k). If
we aim at classifying a new example x with its k nearest-neighbors, and
applying the Bayes rule, we get:
p̂(x|yj ).p̂(yj )
h(x) = arg max
j p̂(x)
k-NN as a classifier
P
Let us suppose we have nj examples in S of class yj , such that
P j nj = n.
Suppose the hypersphere contains kj examples of class yj ( j kj = k). If
we aim at classifying a new example x with its k nearest-neighbors, and
applying the Bayes rule, we get:
kj nj
p̂(x|yj ).p̂(yj ) Vn ×nj × n
h(x) = arg max = arg max k
j p̂(x) j
n×Vn
k-NN as a classifier
P
Let us suppose we have nj examples in S of class yj , such that
P j nj = n.
Suppose the hypersphere contains kj examples of class yj ( j kj = k). If
we aim at classifying a new example x with its k nearest-neighbors, and
applying the Bayes rule, we get:
kj nj
p̂(x|yj ).p̂(yj ) Vn ×nj × n kj
h(x) = arg max = arg max k
= arg max
j p̂(x) j
n×Vn
j k
Input: x, S, d
Output: class of x
foreach (x′ , y ′ ) ∈ S do
Compute the distance d(x′ , x);
Sort the n distances by increasing order;
Count the number of occurrences of each class yj among the k nearest
neighbors;
Assign to x the most frequent class;
Theorem
Let x′ be the nearest neighbor of x,
Theorem
Let x′ be the nearest neighbor of x,
Corrolary
If n → ∞, p(yj |x′ ) ≈ p(yj |x).
p(d(x, x′ ) > ϵ) = (1 − pϵ )n
Theorem
The generalization error ϵ1NN of the 1-nearest neighbor rule is bounded by
twice the (optimal) bayes error ϵB .
ϵ1NN ≤ 2ϵB
p(x|y1)1)
((%|) p(x|y22) )
((%|)
∗
(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)
(()1|%)
Proof.
Bayes error:
p(x|y1)1)
((%|) p(x|y22) )
((%|)
∗
(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)
(()1|%)
Proof.
Bayes error:
1NN error:
p(x|y1)1)
((%|) p(x|y22) )
((%|)
∗
(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)
(()1|%)
Proof.
Bayes error:
1NN error:
p(x|y1)1)
((%|) p(x|y22) )
((%|)
∗
(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)
(()1|%)
Proof.
Bayes error:
1NN error:
X
ϵB = p(ymin |x).p(x)
x∈X
X
p(y1 |x)p(y2 |x′ ) + p(y2 |x)p(y1 |x′ )).p(x ≤ 2ϵ∗
ϵ1NN =
x∈X
Z
f (y1 |x)f (y2 |x′ ) + f (y2 |x)f (y1 |x′ )).f (x dx ≤ 2ϵ∗
ϵ1NN =
x∈X
Marc Sebban (LabHC) Introduction to Machine Learning 22 / 64
Nearest Neighbors
Graphically
0
0 0.25 0.5
0
0 0.25 0.5
Reminder
(n → ∞)
The previous property holds only asymptoticallyq
n
→ possible compromise between k and n: k = |Y| .
Exercise
Let us consider the following discrete probability functions of two classes
y1 (solid line) and y2 (dashed line).
y1
y2
Question
Verify that the previous theorem holds
Curse of dimensionality
Fukunaga [1987] studied the effect of the representation dimension on the
NN behavior (curse of dimensionality).
Problems of k-NN
Problems of k-NN
Problems of k-NN
Preliminary step: remove from S the outliers and the examples of the
bayes error region.
Input: S
Output: Scleaned
Split randomly S into two subsets S1 and S2 ;
while no stabilization of S1 and S2 do
Classify S1 with S2 using the 1-NN rule;
Remove from S1 the misclassified instances;
Classify S2 with the new set S1 using the 1-NN rule;
Remove from S2 the misclassified instances;
Scleaned = S1 ∪ S2 ;
Illustration
return STORAGE ;
Illustration
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X3
X6 X
X10 X2
X9 X1 X8
X5
X4
X7
X3
X6 X
X10 X2 dmin
X9 X1 X8
X5
X4
X7
X3
X6 X
X10 X2 d(X,X 2) dmin
X9 X1 X8
X5
X4
X7
X9 X1 X8
X5
X4
X7
X7
X9 X1 X8
X5
X4
X7
X3
X6 X
X10 X2 dmin
X9 X1 X8
,X4 )
X5
d (X
X4
X7
X9 X1 X8
,X4 )
X5
d (X
X4
X7
X3
X6 X
X10 X2 dmin
X9 X1 X8
X5
X4
X7
X3
X6 X
X10 X2
X9 X1 X8
X5
X4
X7
(4,7)
(9,6)
(5,4)
(2,3)
(7,2)
(8,1)
(7,2)
(4,7)
(9,6)
(5,4)
(2,3)
(7,2)
(8,1)
(7,2)
(5,4)
(4,7)
(9,6)
(5,4)
(2,3)
(7,2)
(8,1)
(7,2)
(5,4)
(4,7)
(9,6) (2,3)
(5,4)
(2,3)
(7,2)
(8,1)
(7,2)
(5,4)
(4,7)
(5,4)
(2,3)
(7,2)
(8,1)
(7,2)
(5,4) (9,6)
(4,7)
(5,4)
(2,3)
(7,2)
(8,1)
(7,2)
(5,4) (9,6)
(4,7)
(5,4)
(2,3)
(7,2)
(8,1)
(7,2)
(5,4) (9,6)
(4,7) (6,7)
(7,2)
(8,1)
X=(6,7)
(7,2)
(5,4) (9,6)
(4,7) (6,7)
(7,2)
(8,1)
X=(6,7)
(7,2)
(5,4) (9,6)
(4,7) (6,7)
(8,1)
(7,2)
X=(6,7)
(5,4) (9,6)
(4,7) (6,7)
(8,1)
(7,2)
(5,4) (9,6)
(4,7) (6,7)
X=(6,7)
(8,1)
(7,2)
(5,4) (9,6)
(4,7) (6,7) X=(6,7) X
(8,1)
(7,2)
X=(6,7)
1
(5,4) (9,6)
(4,7) (6,7)
(8,1)
(7,2)
1
(5,4) (9,6)
(4,7) (6,7)
X=(6,7)
(9,6) (2,3) (4,7) (8,1)
(8,1)
(7,2)
1
(5,4) (9,6)
(4,7) (6,7)
X=(6,7)
(9,6) (2,3) (4,7) (8,1)
(8,1)
Conclusion
y ∗ (x) = arg max p(yc |x) = arg max p(yc |x1 , ..., xd ).
c c
that corresponds to
y ∗ (x) = arg max p(x|yc ).p(yc ) = arg max p(x1 , ..., xd |yc ).p(yc ).
c c
y ∗ (x) = arg max p(yc |x) = arg max p(yc |x1 , ..., xd ).
c c
that corresponds to
y ∗ (x) = arg max p(x|yc ).p(yc ) = arg max p(x1 , ..., xd |yc ).p(yc ).
c c
Then...
A classifier h(x) which would predict h(x) = −1 ∀x is 99.8% accurate
while missing all the frauds!!!
TP + TN
Accuracy =
TP + FP + TN + FN
= 1 − Error
F-Measure
When # positives << # negatives we rather optimize other metrics such
as the Fβ − Measure:
(1 + β 2 ) × Precision × Recall
Fβ − Measure =
(β 2 × Precision) + Recall
where:
TP
Precision =
TP + FP
TP
Recall =
TP + FN
Fβ − Measure is aX
non convex (even with surrrogates), and non separable
function (Fβ ̸= ...):
(xi ,yi )∈S
The loss for one point depends on the others.
Impossible to optimize directly
Marc Sebban (LabHC) Introduction to Machine Learning 52 / 64
Learning from Imbalanced Datasets
P
1X
AP = precision(ki )
P
i=1
AUCROC versus AP
AP is much better when the budget in terms of number of allowed checks
by human controllers is limited, like in bank or tax fraud detection.
Summary
F-Measure, AUCROC and AP are not differentiable and not convex
measures. Therefore, they are difficult to directly optimize.
Summary
F-Measure, AUCROC and AP are not differentiable and not convex
measures. Therefore, they are difficult to directly optimize.
Other strategies:
Sampling methods
Modification of the decision boundaries (e.g. γ-NN)
Sampling methods
Sampling methods
Sampling methods
Sampling methods