Intro To ML PART3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 104

Introduction to Machine Learning

PART 3: k-Nearest Neighbors

Marc Sebban

Laboratoire Hubert Curien, UMR CNRS 5516


Université Jean Monnet Saint-Étienne

Marc Sebban (LabHC) Introduction to Machine Learning 1 / 64


Bayesian Classifier

Bayes Error
The Bayes error ϵB is the lowest possible error rate (or irreducible error)
for any hypothesis h (for the sake of simplicity, f or p will be used
indistinctively). In the binary setting,
Z
ϵB = f (ymin (x)|x)f (x)dx
x

where (x, y ) is a labeled instance and


ymin (x) = arg miny1 ,y2 {f (y1 |x), f (y2 |x)}

p(x|y1)1)
𝑓(𝑥|𝑦 p(x|y22) )
𝑓(𝑥|𝑦

𝑓(𝑦2|𝑥) 𝑓(𝑦2|𝑥) > 𝑓(𝑦1|𝑥) ⟹ 𝜖𝐵 (𝑥) = 𝑓(𝑦1|𝑥)

𝑓(𝑦1|𝑥)

Marc Sebban (LabHC) Introduction to Machine Learning 2 / 64


Bayesian Classifier

Bayesian Classifier

Bayesian Classifier
The Bayesian classifier predicts the optimal class y ∗ ∈ Y given an example
x ∈ X by applying the Maximum a posteriori (MAP) decision rule:

p(x|yc ).p(yc )
∀yc ∈ Y, p(yc |x) =
p(x)

y ∗ (x) = arg max p(yc |x).


c

that corresponds to y ∗ (x) = arg maxc p(x|yc ).p(yc )

Marc Sebban (LabHC) Introduction to Machine Learning 3 / 64


Bayesian Classifier

Bayesian Classifier

Bayesian Classifier
The Bayesian classifier predicts the optimal class y ∗ ∈ Y given an example
x ∈ X by applying the Maximum a posteriori (MAP) decision rule:

p(x|yc ).p(yc )
∀yc ∈ Y, p(yc |x) =
p(x)

y ∗ (x) = arg max p(yc |x).


c

that corresponds to y ∗ (x) = arg maxc p(x|yc ).p(yc )

If this calculation is possible, the Bayesian classifier is optimal from a


probabilistic point of view, with an associated error ϵB .

Marc Sebban (LabHC) Introduction to Machine Learning 3 / 64


Bayesian Classifier

Exercise

Reminder
Z
ϵB = f (ymin (x)|x)f (x)dx
x

Let fc1 and fc2 be the densities of two classes c1 and c2 :

3
fc1 = y 2 + y pour 0<y <1
2

3 7
fc2 = 1 pour <y <
4 4

Question
After having drawn these densities in a 2-dimensional space, calculate the
Bayes error ϵB .

Marc Sebban (LabHC) Introduction to Machine Learning 4 / 64


Bayesian Classifier

Underlying conditions to solve this problem

To compute ϵB , one needs some priors:


1 Know the a priori probabilities p(yj ) of the different classes.
2 Know the probabilities of the observations given the classes p(x|yj ).

Unfortunately, p(yj ) and p(x|yj ) are unknown. One needs to estimate


these two quantities from the training sample S.

Marc Sebban (LabHC) Introduction to Machine Learning 5 / 64


Bayesian Classifier

Estimation of the a priori probability of the classes p(yj )

What about p(yj )?


An unbiased estimate of p(yj ) is given by the observed frequency
|S |
p̂(yj ) = |S|j where |Sj | is the number of training examples belonging to the
class yj .

Marc Sebban (LabHC) Introduction to Machine Learning 6 / 64


Bayesian Classifier

Estimation of the conditional probabilities p(x|yj )

What about p(x|yj )?


We can distinguish two types of approaches:
1 The parametric methods which assume that p(x|yj ) follows a given
statistical distribution. In this case, the problem to solve consists in
estimating the parameters of the considered distribution (e.g. normal
distribution with µ and σ or Binomial distribution with p
⇒ Likelihood Maximization).

Marc Sebban (LabHC) Introduction to Machine Learning 7 / 64


Bayesian Classifier

Estimation of the conditional probabilities p(x|yj )

What about p(x|yj )?


We can distinguish two types of approaches:
1 The parametric methods which assume that p(x|yj ) follows a given
statistical distribution. In this case, the problem to solve consists in
estimating the parameters of the considered distribution (e.g. normal
distribution with µ and σ or Binomial distribution with p
⇒ Likelihood Maximization).
2 The non parametric methods which do not impose any constraint
about the underlying distribution, and for which the densities p(x|yj )
are locally estimated around x.

Marc Sebban (LabHC) Introduction to Machine Learning 7 / 64


Bayesian Classifier

Non parametric methods

No assumption is made on the underlying distribution...


... we only suppose that this target distribution is locally regular.
The objective is to estimate p(x|yj ). Since this must be done
∀yj ∈ Y, for the sake of simplicity, let us focus on p(x) first.

Marc Sebban (LabHC) Introduction to Machine Learning 8 / 64


Bayesian Classifier

Non parametric methods


Let p be the unknown target probability density. The probability P to
observe x in a volume V is:
Z
P= p(x)dx
V

Marc Sebban (LabHC) Introduction to Machine Learning 9 / 64


Bayesian Classifier

Non parametric methods


Let p be the unknown target probability density. The probability P to
observe x in a volume V is:
Z
P= p(x)dx
V

Assuming that p(x) does not significantly change in V (locally


regular), we can approximate P such that:
P̂ ≃ p(x) × V (1)

Marc Sebban (LabHC) Introduction to Machine Learning 9 / 64


Bayesian Classifier

Non parametric methods


Let p be the unknown target probability density. The probability P to
observe x in a volume V is:
Z
P= p(x)dx
V

Assuming that p(x) does not significantly change in V (locally


regular), we can approximate P such that:
P̂ ≃ p(x) × V (1)

P can also be estimated by the proportion of training data in V :


k
P̂ ≃ (2)
n

Marc Sebban (LabHC) Introduction to Machine Learning 9 / 64


Bayesian Classifier

Non parametric methods


Let p be the unknown target probability density. The probability P to
observe x in a volume V is:
Z
P= p(x)dx
V

Assuming that p(x) does not significantly change in V (locally


regular), we can approximate P such that:
P̂ ≃ p(x) × V (1)

P can also be estimated by the proportion of training data in V :


k
P̂ ≃ (2)
n

Therefore, we can deduce that


k
p(x) ≈ = p̂(x).
nV
Marc Sebban (LabHC) Introduction to Machine Learning 9 / 64
Bayesian Classifier

Theorem
kn
Let us denote by p̂n (x) = nV n
the estimate of p(x) from a training sample
S of size n. When n is increasing, p̂n (x) converges to p(x) if the following
three conditions are fulfilled:

lim Vn = 0
n→∞

lim kn = ∞
n→∞

kn
lim =0
n→∞ n

Marc Sebban (LabHC) Introduction to Machine Learning 10 / 64


Bayesian Classifier

Theorem
kn
Let us denote by p̂n (x) = nV n
the estimate of p(x) from a training sample
S of size n. When n is increasing, p̂n (x) converges to p(x) if the following
three conditions are fulfilled:

lim Vn = 0
n→∞

lim kn = ∞
n→∞

kn
lim =0
n→∞ n

Interpretation
If n is high, we get:
a good estimate p̂(x) (resp. p̂(x|yj ) if we take into account the class)
of p(x) (resp. p(x|yj )).
a good approximation of the Bayesian method.
Marc Sebban (LabHC) Introduction to Machine Learning 10 / 64
Bayesian Classifier

Non parametric methods


k-Nearest Neighbors (Cover & Hart 1968)
It turns out that the k-nearest neighbor method satisfies the previous
conditions: fixing a number kn of examples (growing w.r.t. n) and
adapting a volume Vn (e.g. a hypersphere centered at x) such that kn
examples are in the volume.

𝑘𝑛 = 𝑛

Marc Sebban (LabHC) Introduction to Machine Learning 11 / 64


Nearest Neighbors

k-nearest neighbors

kNN as a good way to estimate p(x)


To estimate p(x) from the n training examples of S, let us build a
hypersphere centered at x that contains kn examples.
kn /n
p̂n (x) = Vn is a good estimate of p(x)

Marc Sebban (LabHC) Introduction to Machine Learning 12 / 64


Nearest Neighbors

Example 1 with k = n

1 1 2
p(x) = √ e − 2 x

Marc Sebban (LabHC) Introduction to Machine Learning 13 / 64


Nearest Neighbors

Example 1 with k = n

1 1 2
p(x) = √ e − 2 x

Marc Sebban (LabHC) Introduction to Machine Learning 14 / 64


Nearest Neighbors

k-NN as a classifier
P
Let us suppose we have nj examples in S of class yj , such that
P j nj = n.
Suppose the hypersphere contains kj examples of class yj ( j kj = k). If
we aim at classifying a new example x with its k nearest-neighbors, and
applying the Bayes rule, we get:

p̂(x|yj ).p̂(yj )
h(x) = arg max
j p̂(x)

Marc Sebban (LabHC) Introduction to Machine Learning 15 / 64


Nearest Neighbors

k-NN as a classifier
P
Let us suppose we have nj examples in S of class yj , such that
P j nj = n.
Suppose the hypersphere contains kj examples of class yj ( j kj = k). If
we aim at classifying a new example x with its k nearest-neighbors, and
applying the Bayes rule, we get:

kj nj
p̂(x|yj ).p̂(yj ) Vn ×nj × n
h(x) = arg max = arg max k
j p̂(x) j
n×Vn

Marc Sebban (LabHC) Introduction to Machine Learning 15 / 64


Nearest Neighbors

k-NN as a classifier
P
Let us suppose we have nj examples in S of class yj , such that
P j nj = n.
Suppose the hypersphere contains kj examples of class yj ( j kj = k). If
we aim at classifying a new example x with its k nearest-neighbors, and
applying the Bayes rule, we get:

kj nj
p̂(x|yj ).p̂(yj ) Vn ×nj × n kj
h(x) = arg max = arg max k
= arg max
j p̂(x) j
n×Vn
j k

Marc Sebban (LabHC) Introduction to Machine Learning 15 / 64


Nearest Neighbors

k-nearest neighbors algorithm


How to classify an unknown example x using S?

Input: x, S, d
Output: class of x
foreach (x′ , y ′ ) ∈ S do
Compute the distance d(x′ , x);
Sort the n distances by increasing order;
Count the number of occurrences of each class yj among the k nearest
neighbors;
Assign to x the most frequent class;

Marc Sebban (LabHC) Introduction to Machine Learning 16 / 64


Nearest Neighbors

Special case of the (k = 1)-nearest neighbor rule

Decision boundaries of the (k = 1)-nearest neighbor rule.

In a 2-dimensional space, the (k = 1)-nearest neighbor rule achieves a partition of


the space in Voronoi cells (obtained from the perpendicular bisectors of
Delaunay’s triangles), each one labeled by the class of the example it contains.

Marc Sebban (LabHC) Introduction to Machine Learning 17 / 64


Nearest Neighbors

Special case of the (k = 1)-nearest neighbor rule

Convergence properties of the (k = 1)-nearest neighbor rule

Theorem
Let x′ be the nearest neighbor of x,

lim p d(x, x ′ ) > ϵ = 0, ∀ϵ > 0



n→∞

Marc Sebban (LabHC) Introduction to Machine Learning 18 / 64


Nearest Neighbors

Special case of the (k = 1)-nearest neighbor rule

Convergence properties of the (k = 1)-nearest neighbor rule

Theorem
Let x′ be the nearest neighbor of x,

lim p d(x, x ′ ) > ϵ = 0, ∀ϵ > 0



n→∞

Corrolary
If n → ∞, p(yj |x′ ) ≈ p(yj |x).

Marc Sebban (LabHC) Introduction to Machine Learning 18 / 64


Nearest Neighbors

Special case of the (k = 1)-nearest neighbor rule


Proof.
Let P be the probability that the hypersphere s(x, ϵ) centered at x of
radius ϵ does not contain any point of S:

Marc Sebban (LabHC) Introduction to Machine Learning 19 / 64


Nearest Neighbors

Special case of the (k = 1)-nearest neighbor rule


Proof.
Let P be the probability that the hypersphere s(x, ϵ) centered at x of
radius ϵ does not contain any point of S:
n
Y
P = p(x1 ∈
/ s(x, ϵ), ..., xn ∈
/ s(x, ϵ)) = p(xi ∈
/ s(x, ϵ))
i=1

Marc Sebban (LabHC) Introduction to Machine Learning 19 / 64


Nearest Neighbors

Special case of the (k = 1)-nearest neighbor rule


Proof.
Let P be the probability that the hypersphere s(x, ϵ) centered at x of
radius ϵ does not contain any point of S:
n
Y
P = p(x1 ∈
/ s(x, ϵ), ..., xn ∈
/ s(x, ϵ)) = p(xi ∈
/ s(x, ϵ))
i=1
n
Y
= (1 − p(xi ∈ s(x, ϵ))) = (1 − pϵ )n .
i=1

Marc Sebban (LabHC) Introduction to Machine Learning 19 / 64


Nearest Neighbors

Special case of the (k = 1)-nearest neighbor rule


Proof.
Let P be the probability that the hypersphere s(x, ϵ) centered at x of
radius ϵ does not contain any point of S:
n
Y
P = p(x1 ∈
/ s(x, ϵ), ..., xn ∈
/ s(x, ϵ)) = p(xi ∈
/ s(x, ϵ))
i=1
n
Y
= (1 − p(xi ∈ s(x, ϵ))) = (1 − pϵ )n .
i=1

Since the nearest neighbor x′ of x is also outside of the sphere, we get

p(d(x, x′ ) > ϵ) = (1 − pϵ )n

therefore, lim p(d(x, x′ ) > ϵ) = lim (1 − pϵ )n = 0


n→∞ n→∞

Marc Sebban (LabHC) Introduction to Machine Learning 19 / 64


Nearest Neighbors

Theorem
The generalization error ϵ1NN of the 1-nearest neighbor rule is bounded by
twice the (optimal) bayes error ϵB .

ϵ1NN ≤ 2ϵB

Marc Sebban (LabHC) Introduction to Machine Learning 20 / 64


Nearest Neighbors

p(x|y1)1)
((%|) p(x|y22) )
((%|)


(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)

(()1|%)

Proof.
Bayes error:

∀x ∈ X , ϵB (x) = min {p(y1 |x), p(y2 |x)} = p(ymin |x)

Marc Sebban (LabHC) Introduction to Machine Learning 21 / 64


Nearest Neighbors

p(x|y1)1)
((%|) p(x|y22) )
((%|)


(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)

(()1|%)

Proof.
Bayes error:

∀x ∈ X , ϵB (x) = min {p(y1 |x), p(y2 |x)} = p(ymin |x)

1NN error:

ϵ1NN (x) = p(y1 |x)p(y2 |x′ ) + p(y2 |x)p(y1 |x′ )

Marc Sebban (LabHC) Introduction to Machine Learning 21 / 64


Nearest Neighbors

p(x|y1)1)
((%|) p(x|y22) )
((%|)


(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)

(()1|%)

Proof.
Bayes error:

∀x ∈ X , ϵB (x) = min {p(y1 |x), p(y2 |x)} = p(ymin |x)

1NN error:

ϵ1NN (x) = p(y1 |x)p(y2 |x′ ) + p(y2 |x)p(y1 |x′ )


(if n is large) ≈ p(y1 |x)p(y2 |x) + p(y2 |x)p(y1 |x)

Marc Sebban (LabHC) Introduction to Machine Learning 21 / 64


Nearest Neighbors

p(x|y1)1)
((%|) p(x|y22) )
((%|)


(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)

(()1|%)

Proof.
Bayes error:

∀x ∈ X , ϵB (x) = min {p(y1 |x), p(y2 |x)} = p(ymin |x)

1NN error:

ϵ1NN (x) = p(y1 |x)p(y2 |x′ ) + p(y2 |x)p(y1 |x′ )


(if n is large) ≈ p(y1 |x)p(y2 |x) + p(y2 |x)p(y1 |x)
= 2p(ymin |x)(1 − p(ymin |x)) ≤ 2p(ymin |x) = 2ϵ∗ (x)

Marc Sebban (LabHC) Introduction to Machine Learning 21 / 64


Nearest Neighbors

We can deduce that:


In the discrete case

X
ϵB = p(ymin |x).p(x)
x∈X

X
p(y1 |x)p(y2 |x′ ) + p(y2 |x)p(y1 |x′ )).p(x ≤ 2ϵ∗

ϵ1NN =
x∈X

In the continuous case


Z
ϵB = f (ymin |x).f (x)dx
x∈X

Z
f (y1 |x)f (y2 |x′ ) + f (y2 |x)f (y1 |x′ )).f (x dx ≤ 2ϵ∗

ϵ1NN =
x∈X
Marc Sebban (LabHC) Introduction to Machine Learning 22 / 64
Nearest Neighbors

Graphically

ϵB (x) = min {p(y1 |x), p(y2 |x)} = p(ymin |x)

ϵ1NN (x) = 2p(ymin |x)(1 − p(ymin |x)) ≤ 2ϵ∗ (x)

Marc Sebban (LabHC) Introduction to Machine Learning 23 / 64


Nearest Neighbors

Effect of k on the estimation quality

ϵB ≤ ϵkNN ≤ ϵ(k−1)NN ≤ . . . ≤ ϵ1NN ≤ 2ϵB


error rate
k=1 0.5
k=3
k=5
k=7
0.25
Bayes error

0
0 0.25 0.5

Marc Sebban (LabHC) Introduction to Machine Learning 24 / 64


Nearest Neighbors

Effect of k on the estimation quality

ϵB ≤ ϵkNN ≤ ϵ(k−1)NN ≤ . . . ≤ ϵ1NN ≤ 2ϵB


error rate
k=1 0.5
k=3
k=5
k=7
0.25
Bayes error

0
0 0.25 0.5

Reminder
(n → ∞)
The previous property holds only asymptoticallyq
n
→ possible compromise between k and n: k = |Y| .

Marc Sebban (LabHC) Introduction to Machine Learning 24 / 64


Nearest Neighbors

Exercise
Let us consider the following discrete probability functions of two classes
y1 (solid line) and y2 (dashed line).
y1

y2

Question
Verify that the previous theorem holds

Marc Sebban (LabHC) Introduction to Machine Learning 25 / 64


Nearest Neighbors

Curse of dimensionality
Fukunaga [1987] studied the effect of the representation dimension on the
NN behavior (curse of dimensionality).

Marc Sebban (LabHC) Introduction to Machine Learning 26 / 64


Nearest Neighbors

Problems of k-NN

To prevent the kNN algorithm from facing the curse of dimensionality,


dimensionality reduction is required (PCA, feature selection, metric
learning, etc.).

Marc Sebban (LabHC) Introduction to Machine Learning 27 / 64


Nearest Neighbors

Problems of k-NN

To prevent the kNN algorithm from facing the curse of dimensionality,


dimensionality reduction is required (PCA, feature selection, metric
learning, etc.).
To converge, the k-nearest neighbor algorithm requires a large
number of training examples.
However, a large number of training examples implies a large
space/time complexity.

Marc Sebban (LabHC) Introduction to Machine Learning 27 / 64


Nearest Neighbors

Problems of k-NN

To prevent the kNN algorithm from facing the curse of dimensionality,


dimensionality reduction is required (PCA, feature selection, metric
learning, etc.).
To converge, the k-nearest neighbor algorithm requires a large
number of training examples.
However, a large number of training examples implies a large
space/time complexity.

Two strategies to overcome these problems:


Reduce the size of S while keeping the most relevant examples (e.g.
the condensed nearest neighbor rule (Hart 1968)).
Simplify the calculation of the nearest-neighbor.

Marc Sebban (LabHC) Introduction to Machine Learning 27 / 64


Nearest Neighbors

Data reduction techniques

Preliminary step: remove from S the outliers and the examples of the
bayes error region.
Input: S
Output: Scleaned
Split randomly S into two subsets S1 and S2 ;
while no stabilization of S1 and S2 do
Classify S1 with S2 using the 1-NN rule;
Remove from S1 the misclassified instances;
Classify S2 with the new set S1 using the 1-NN rule;
Remove from S2 the misclassified instances;
Scleaned = S1 ∪ S2 ;

Marc Sebban (LabHC) Introduction to Machine Learning 28 / 64


Nearest Neighbors

Illustration

Marc Sebban (LabHC) Introduction to Machine Learning 29 / 64


Nearest Neighbors

The condensed nearest neighbor rule (CNN)

Second step: remove the irrelevant examples.


Input: S
Output: STORAGE
STORAGE ← ∅ ; DUSTBIN ← ∅;
Draw randomly a training example from S and put it in STORAGE;
while no stabilization of STORAGE do
foreach xi ∈ S do
if xi is correctly classified with STORAGE using the 1-NN rule
then
DUSTBIN ← xi
else
STORAGE ← xi

return STORAGE ;

Marc Sebban (LabHC) Introduction to Machine Learning 30 / 64


Nearest Neighbors

Illustration

Marc Sebban (LabHC) Introduction to Machine Learning 31 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?

X3
X6 X
X10 X2
X9 X1 X8
X5
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 32 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin
*
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’
dmin

X3
X6 X
X10 X2 dmin

X9 X1 X8
X5
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 33 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin
* *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X2)
dmin

X3
X6 X
X10 X2 d(X,X 2) dmin

X9 X1 X8
X5
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 34 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin
* *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X2)
dmin

Since d(X,X2) > dmin the training points that are

-in the sphere centered at X2 of radius d(X,X2)-dmin


-out of the sphere centered at X2 of radius d(X,X2)+dmin
X3
cannot be NN(X) X6 X
X10 X2 d(X,X 2) dmin

X9 X1 X8
X5
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 35 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin
* *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X2)
dmin
Explanation
d(X9, X2) ≤ d(X, X2) – dmin
Since d(X,X2) > dmin the training points that are ⟺ d(X9, X2) + dmin ≤ d(X, X2) (1)

-in the sphere centered at X2 of radius d(X,X2)-dmin By the triangle inequality:


-out of the sphere centered at X2 of radius d(X,X2)+dmin
X3 d(X, X2) ≤ d(X, X9) + d(X9, X2) (2)
cannot be NN(X) X6 X
X10 X2 d(X,X 2) dmin From (1) and (2), we get:
X9 X1 X8 d(X9, X2) + dmin≤ d(X, X9) + d(X9, X2)
X5 ⟺ dmin ≤ d(X, X9)
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 36 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin
* *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X2)
dmin

Since d(X,X2) > dmin the training points that are

-in the sphere centered at X2 of radius d(X,X2)-dmin


-out of the sphere centered at X2 of radius d(X,X2)+dmin
X3
cannot be NN(X) X6 X
X10 X2 d(X,X 2) dmin

X9 X1 X8
X5
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 37 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin
* * *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X4)
dmin

X3
X6 X
X10 X2 dmin

X9 X1 X8

,X4 )
X5

d (X
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 38 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin
* * *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X4)
dmin

Since d(X,X4) > dmin the training points that are

-in the sphere centered at X4 of radius d(X,X4)-dmin


-out of the sphere centered at X2 of radius d(X,X2)+dmin
X3
cannot be NN(X) X6 X
X10 X2 dmin

X9 X1 X8

,X4 )
X5

d (X
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 39 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin
* * * *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X)
dmin

X3
X6 X
X10 X2 dmin

X9 X1 X8
X5
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 40 / 64


Nearest Neighbors

How to speed-up the nearest-neighbor calculation?


X: new query to classify
X’: current NN of X at a distance dmin
* * * *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’
dmin

X3
X6 X
X10 X2
X9 X1 X8
X5
X4

X7

Marc Sebban (LabHC) Introduction to Machine Learning 41 / 64


Nearest Neighbors

kNN and KD-trees

(4,7)

(9,6)

(5,4)

(2,3)

(7,2)

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

(7,2)

(4,7)

(9,6)

(5,4)

(2,3)

(7,2)

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

(7,2)

(5,4)
(4,7)

(9,6)

(5,4)

(2,3)

(7,2)

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

(7,2)

(5,4)
(4,7)

(9,6) (2,3)

(5,4)

(2,3)

(7,2)

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

(7,2)

(5,4)
(4,7)

(9,6) (2,3) (4,7)

(5,4)

(2,3)

(7,2)

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

(7,2)

(5,4) (9,6)
(4,7)

(9,6) (2,3) (4,7)

(5,4)

(2,3)

(7,2)

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

(7,2)

(5,4) (9,6)
(4,7)

(9,6) (2,3) (4,7) (8,1)

(5,4)

(2,3)

(7,2)

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

(7,2)

(5,4) (9,6)
(4,7) (6,7)

(9,6) (2,3) (4,7) (8,1)

(5,4) How to search the NN for a new query point?


X=(6,7)
(2,3)

(7,2)

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

How to search the NN for a new query point?

X=(6,7)
(7,2)

(5,4) (9,6)
(4,7) (6,7)

(9,6) (2,3) (4,7) (8,1)

(5,4) Best distance = infinity


Current best = none
(2,3)

(7,2)

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

How to search the NN for a new query point?

X=(6,7)
(7,2)

(5,4) (9,6)
(4,7) (6,7)

(9,6) (2,3) (4,7) (8,1)

(5,4) Best distance = 5.09


Current best = (7,2)
(2,3)

(7,2) 𝑑 6,7 ; 7,2 = 5.09

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

How to search the NN for a new query point?

(7,2)
X=(6,7)
(5,4) (9,6)
(4,7) (6,7)

(9,6) (2,3) (4,7) (8,1)

(5,4) Best distance = 3.16


Current best = (5,4)
(2,3)

(7,2) 𝑑 6,7 ; 5,4 = 3.16

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

How to search the NN for a new query point?

(7,2)

(5,4) (9,6)
(4,7) (6,7)
X=(6,7)

(9,6) (2,3) (4,7) (8,1)

(5,4) Best distance = 2


Current best = (4,7)
(2,3)

(7,2) 𝑑 6,7 ; 4,7 =2

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

How to search the NN for a new query point?

(7,2)

(5,4) (9,6)
(4,7) (6,7) X=(6,7) X

3 (9,6) (2,3) (4,7) (8,1)

(5,4) Best distance = 2


Current best = (4,7)
(2,3)

(7,2) 𝑑 6,7 ; 4,7 =2

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

How to search the NN for a new query point?

(7,2)
X=(6,7)
1
(5,4) (9,6)
(4,7) (6,7)

(9,6) (2,3) (4,7) (8,1)

(5,4) Best distance = 2


Current best = (4,7)
(2,3)

(7,2) 𝑑 6,7 ; 9,6 = 3.16

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

How to search the NN for a new query point?

(7,2)

1
(5,4) (9,6)
(4,7) (6,7)
X=(6,7)
(9,6) (2,3) (4,7) (8,1)

(5,4) Best distance = 2


Current best = (4,7)
(2,3)

(7,2) 𝑑 6,7 ; 8,1 = 6.32

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

kNN and KD-trees

How to search the NN for a new query point?

(7,2)

1
(5,4) (9,6)
(4,7) (6,7)
X=(6,7)
(9,6) (2,3) (4,7) (8,1)

(5,4) Best distance = 2


Current best = (4,7)
(2,3)

(7,2) 𝑑 6,7 ; 8,1 = 6.32

(8,1)

Marc Sebban (LabHC) Introduction to Machine Learning 42 / 64


Nearest Neighbors

Conclusion

With a sufficiently large number of training examples, a NN classifier


is able to converge towards very complex target functions.
It is simple and theoretically well founded.
There exist several solutions to overcome its algorithmic complexity
issues (time and space).

Marc Sebban (LabHC) Introduction to Machine Learning 43 / 64


Naive Bayes Classifier

Back to the Bayes Classifier


Bayes Classifier
Let x be a d-dimensional feature vector x = (x1 , ..., xd ).

y ∗ (x) = arg max p(yc |x) = arg max p(yc |x1 , ..., xd ).
c c

that corresponds to

y ∗ (x) = arg max p(x|yc ).p(yc ) = arg max p(x1 , ..., xd |yc ).p(yc ).
c c

Marc Sebban (LabHC) Introduction to Machine Learning 44 / 64


Naive Bayes Classifier

Back to the Bayes Classifier


Bayes Classifier
Let x be a d-dimensional feature vector x = (x1 , ..., xd ).

y ∗ (x) = arg max p(yc |x) = arg max p(yc |x1 , ..., xd ).
c c

that corresponds to

y ∗ (x) = arg max p(x|yc ).p(yc ) = arg max p(x1 , ..., xd |yc ).p(yc ).
c c

Naive Bayes Classifier


The naive Bayes classifier assumes that the value of a particular feature is
independent of the value of any other feature, given the class variable.
d
Y
ŷ (x) = ŷ (x1 , ..., xd ) = arg max p(xi |yc ).p(yc )
c
i=1

Marc Sebban (LabHC) Introduction to Machine Learning 44 / 64


Naive Bayes Classifier

Naive Bayes Classifier


PROS
Highly scalable. Each probability can be independently estimated as a
one-dimensional distribution.
This helps alleviate problems stemming from the curse of
dimensionality (the need for examples that scale exponentially with
the number of features).
Allows us to smooth the probabilities.

Marc Sebban (LabHC) Introduction to Machine Learning 45 / 64


Naive Bayes Classifier

Naive Bayes Classifier


PROS
Highly scalable. Each probability can be independently estimated as a
one-dimensional distribution.
This helps alleviate problems stemming from the curse of
dimensionality (the need for examples that scale exponentially with
the number of features).
Allows us to smooth the probabilities.

Marc Sebban (LabHC) Introduction to Machine Learning 45 / 64


Learning from Imbalanced Datasets

Learning from Imbalanced Datasets

Marc Sebban (LabHC) Introduction to Machine Learning 46 / 64


Learning from Imbalanced Datasets

Learning from Imbalanced Datasets

In real-world applications, datasets can be (higly) imbalanced:


Credit card transactions: < 0.2% frauds and 99.8% genuine
transactions
Medical screening: e.g. HIV prevalence in the USA is 0.4%
Disk drive failure: < 1%
etc.

What is the impact of an imbalanced dataset on the performance of the


classifiers?

Marc Sebban (LabHC) Introduction to Machine Learning 47 / 64


Learning from Imbalanced Datasets

k-NN and imbalanced datasets

Marc Sebban (LabHC) Introduction to Machine Learning 48 / 64


Learning from Imbalanced Datasets

k-NN and imbalanced datasets

Marc Sebban (LabHC) Introduction to Machine Learning 48 / 64


Learning from Imbalanced Datasets

k-NN and imbalanced datasets

Marc Sebban (LabHC) Introduction to Machine Learning 48 / 64


Learning from Imbalanced Datasets

k-NN and imbalanced datasets

Marc Sebban (LabHC) Introduction to Machine Learning 48 / 64


Learning from Imbalanced Datasets

Neural Network and imbalanced datasets

« Parallel one-class extreme


learning machine for imbalance
learning based on Bayesian
approach »
Journal of Ambient Intelligence and
Humanized Computing
Li et al. , 2018

Linear model with Gaussian noise

Marc Sebban (LabHC) Introduction to Machine Learning 49 / 64


Learning from Imbalanced Datasets

Learning from imbalanced datasets

More generally, how to learn from (highly) imbalanced datasets?

In credit card transactions:


< 0.2% correspond to frauds and 99.8% genuine transactions.
Binary classification problem where
y = +1 for a fraud (called positive example)
y = −1 for a genuine transaction (called negative example)
we are looking for an hypothesis h(x) as accurate as possible.

Then...
A classifier h(x) which would predict h(x) = −1 ∀x is 99.8% accurate
while missing all the frauds!!!

→ Conclusion: minimizing the error rate (or a surrogate function) is


not adapted to an imbalance setting

Marc Sebban (LabHC) Introduction to Machine Learning 50 / 64


Learning from Imbalanced Datasets

Metrics for learning from imbalanced datasets

TP + TN
Accuracy =
TP + FP + TN + FN
= 1 − Error

Marc Sebban (LabHC) Introduction to Machine Learning 51 / 64


Learning from Imbalanced Datasets

F-Measure
When # positives << # negatives we rather optimize other metrics such
as the Fβ − Measure:

(1 + β 2 ) × Precision × Recall
Fβ − Measure =
(β 2 × Precision) + Recall
where:
TP
Precision =
TP + FP
TP
Recall =
TP + FN

Fβ − Measure is aX
non convex (even with surrrogates), and non separable
function (Fβ ̸= ...):
(xi ,yi )∈S
The loss for one point depends on the others.
Impossible to optimize directly
Marc Sebban (LabHC) Introduction to Machine Learning 52 / 64
Learning from Imbalanced Datasets

Learning from imbalanced datasets


Fβ − Measure depends on the decision threshold defined for the classifier.

Example (k-NN): Let p + (x) (resp. p − (x)) the proportion of positives


(negatives) in the neighborhood of a query x.
In a standard k-NN, we usually use the following decision rule: if
p + (x) > p − (x) that is equivalent to p + (x) > 0.5 then h(x) = +1
otherwise h(x) = −1.
The threshold 0.5 is well suited to the balanced scenario. For imbalanced
datasets, we can play with this threshold to increase the F-Measure. One
can also tune k so as to maximize the F-Measure.
Marc Sebban (LabHC) Introduction to Machine Learning 53 / 64
Learning from Imbalanced Datasets

Ranking Metrics independent of a threshold


P N
1 XX +
AUCROC = [p (xi ) > p + (xj )]
PN
i=1 j=1

P
1X
AP = precision(ki )
P
i=1

where precision(ki ) is the precision with respect to the rank ki of the


positive example xi .

Marc Sebban (LabHC) Introduction to Machine Learning 54 / 64


Learning from Imbalanced Datasets

AUCROC versus AP
AP is much better when the budget in terms of number of allowed checks
by human controllers is limited, like in bank or tax fraud detection.

Marc Sebban (LabHC) Introduction to Machine Learning 55 / 64


Learning from Imbalanced Datasets

Summary
F-Measure, AUCROC and AP are not differentiable and not convex
measures. Therefore, they are difficult to directly optimize.

Marc Sebban (LabHC) Introduction to Machine Learning 56 / 64


Learning from Imbalanced Datasets

Summary
F-Measure, AUCROC and AP are not differentiable and not convex
measures. Therefore, they are difficult to directly optimize.

Other strategies:
Sampling methods
Modification of the decision boundaries (e.g. γ-NN)

Marc Sebban (LabHC) Introduction to Machine Learning 56 / 64


Learning from Imbalanced Datasets

Sampling methods

Marc Sebban (LabHC) Introduction to Machine Learning 57 / 64


Learning from Imbalanced Datasets

Sampling methods

Marc Sebban (LabHC) Introduction to Machine Learning 58 / 64


Learning from Imbalanced Datasets

SMOTE: Synthetic Minority Oversampling Technique

Marc Sebban (LabHC) Introduction to Machine Learning 59 / 64


Learning from Imbalanced Datasets

Sampling methods

Marc Sebban (LabHC) Introduction to Machine Learning 60 / 64


Learning from Imbalanced Datasets

Sampling methods

Can be unstable by generating positives in “negative” regions.


Marc Sebban (LabHC) Introduction to Machine Learning 60 / 64
Learning from Imbalanced Datasets

Modification of the boundaries: γ-NN [Viola et al. 2021]

Marc Sebban (LabHC) Introduction to Machine Learning 61 / 64


Learning from Imbalanced Datasets

Modification of the boundaries: γ-NN [Viola et al. 2021]

Marc Sebban (LabHC) Introduction to Machine Learning 62 / 64


Learning from Imbalanced Datasets

Modification of the boundaries: γ-NN [Viola et al. 2021]

Marc Sebban (LabHC) Introduction to Machine Learning 63 / 64


Learning from Imbalanced Datasets

Presentation of the Project


+
Multiple Choice Questions MCQ3

Marc Sebban (LabHC) Introduction to Machine Learning 64 / 64

You might also like