Intro To ML PART3

Introduction to Machine Learning
PART 3: k-Nearest Neighbors
Marc Sebban
Laboratoire Hubert Curien, UMR CNRS 5516

Université Jean Monnet Saint-Étienne
Marc Sebban (LabHC) Introduction to Machine Learning 1 / 64

Bayesian Classifier
Bayes Error
The Bayes error ϵB is the lowest possible error rate (or irreducible error)
for any hypothesis h (for the sake of simplicity, f or p will be used
indistinctively). In the binary setting,
Z
ϵB = f (ymin (x)|x)f (x)dx
x
where (x, y ) is a labeled instance and

ymin (x) = arg miny1 ,y2 {f (y1 |x), f (y2 |x)}
p(x|y1)1)
𝑓(𝑥|𝑦 p(x|y22) )
𝑓(𝑥|𝑦
𝑓(𝑦2|𝑥) 𝑓(𝑦2|𝑥) > 𝑓(𝑦1|𝑥) ⟹ 𝜖𝐵 (𝑥) = 𝑓(𝑦1|𝑥)
𝑓(𝑦1|𝑥)

Bayesian Classifier
Bayesian Classifier
Bayesian Classifier
The Bayesian classifier predicts the optimal class y ∗ ∈ Y given an example
x ∈ X by applying the Maximum a posteriori (MAP) decision rule:
p(x|yc ).p(yc )
∀yc ∈ Y, p(yc |x) =
p(x)
y ∗ (x) = arg max p(yc |x).

c
that corresponds to y ∗ (x) = arg maxc p(x|yc ).p(yc )

Bayesian Classifier
Bayesian Classifier
Bayesian Classifier
The Bayesian classifier predicts the optimal class y ∗ ∈ Y given an example
x ∈ X by applying the Maximum a posteriori (MAP) decision rule:
p(x|yc ).p(yc )
∀yc ∈ Y, p(yc |x) =
p(x)
y ∗ (x) = arg max p(yc |x).

c
that corresponds to y ∗ (x) = arg maxc p(x|yc ).p(yc )
If this calculation is possible, the Bayesian classifier is optimal from a

probabilistic point of view, with an associated error ϵB .

Bayesian Classifier
Exercise
Reminder
Z
ϵB = f (ymin (x)|x)f (x)dx
x
Let fc1 and fc2 be the densities of two classes c1 and c2 :
3
fc1 = y 2 + y pour 0<y <1
2
3 7
fc2 = 1 pour <y <
4 4
Question
After having drawn these densities in a 2-dimensional space, calculate the
Bayes error ϵB .

Bayesian Classifier
Underlying conditions to solve this problem
To compute ϵB , one needs some priors:

1 Know the a priori probabilities p(yj ) of the different classes.
2 Know the probabilities of the observations given the classes p(x|yj ).
Unfortunately, p(yj ) and p(x|yj ) are unknown. One needs to estimate

these two quantities from the training sample S.

Bayesian Classifier
Estimation of the a priori probability of the classes p(yj )
What about p(yj )?

An unbiased estimate of p(yj ) is given by the observed frequency
|S |
p̂(yj ) = |S|j where |Sj | is the number of training examples belonging to the
class yj .

Bayesian Classifier
Estimation of the conditional probabilities p(x|yj )
What about p(x|yj )?

We can distinguish two types of approaches:
1 The parametric methods which assume that p(x|yj ) follows a given
statistical distribution. In this case, the problem to solve consists in
estimating the parameters of the considered distribution (e.g. normal
distribution with µ and σ or Binomial distribution with p
⇒ Likelihood Maximization).

Bayesian Classifier
Estimation of the conditional probabilities p(x|yj )
What about p(x|yj )?

We can distinguish two types of approaches:
1 The parametric methods which assume that p(x|yj ) follows a given
statistical distribution. In this case, the problem to solve consists in
estimating the parameters of the considered distribution (e.g. normal
distribution with µ and σ or Binomial distribution with p
⇒ Likelihood Maximization).
2 The non parametric methods which do not impose any constraint
about the underlying distribution, and for which the densities p(x|yj )
are locally estimated around x.

Bayesian Classifier
Non parametric methods
No assumption is made on the underlying distribution...

... we only suppose that this target distribution is locally regular.
The objective is to estimate p(x|yj ). Since this must be done
∀yj ∈ Y, for the sake of simplicity, let us focus on p(x) first.

Bayesian Classifier

Let p be the unknown target probability density. The probability P to
observe x in a volume V is:
Z
P= p(x)dx
V

Bayesian Classifier

Z
P= p(x)dx
V
Assuming that p(x) does not significantly change in V (locally

regular), we can approximate P such that:
P̂ ≃ p(x) × V (1)

Bayesian Classifier

Z
P= p(x)dx
V

P̂ ≃ p(x) × V (1)
P can also be estimated by the proportion of training data in V :

k
P̂ ≃ (2)
n

Bayesian Classifier

Z
P= p(x)dx
V

P̂ ≃ p(x) × V (1)
P can also be estimated by the proportion of training data in V :

k
P̂ ≃ (2)
n
Therefore, we can deduce that

k
p(x) ≈ = p̂(x).
nV
Bayesian Classifier
Theorem
kn
Let us denote by p̂n (x) = nV n
the estimate of p(x) from a training sample
S of size n. When n is increasing, p̂n (x) converges to p(x) if the following
three conditions are fulfilled:
lim Vn = 0
n→∞
lim kn = ∞
n→∞
kn
lim =0
n→∞ n

Bayesian Classifier
Theorem
kn
Let us denote by p̂n (x) = nV n
the estimate of p(x) from a training sample
S of size n. When n is increasing, p̂n (x) converges to p(x) if the following
three conditions are fulfilled:
lim Vn = 0
n→∞
lim kn = ∞
n→∞
kn
lim =0
n→∞ n
Interpretation
If n is high, we get:
a good estimate p̂(x) (resp. p̂(x|yj ) if we take into account the class)
of p(x) (resp. p(x|yj )).
a good approximation of the Bayesian method.
Bayesian Classifier

k-Nearest Neighbors (Cover & Hart 1968)
It turns out that the k-nearest neighbor method satisfies the previous
conditions: fixing a number kn of examples (growing w.r.t. n) and
adapting a volume Vn (e.g. a hypersphere centered at x) such that kn
examples are in the volume.
𝑘𝑛 = 𝑛

Nearest Neighbors
k-nearest neighbors
kNN as a good way to estimate p(x)

To estimate p(x) from the n training examples of S, let us build a
hypersphere centered at x that contains kn examples.
kn /n
p̂n (x) = Vn is a good estimate of p(x)

Nearest Neighbors
√
Example 1 with k = n
1 1 2
p(x) = √ e − 2 x
2π

Nearest Neighbors
√
Example 1 with k = n
1 1 2
p(x) = √ e − 2 x
2π

Nearest Neighbors
k-NN as a classifier
P
Let us suppose we have nj examples in S of class yj , such that
P j nj = n.
Suppose the hypersphere contains kj examples of class yj ( j kj = k). If
we aim at classifying a new example x with its k nearest-neighbors, and
applying the Bayes rule, we get:
p̂(x|yj ).p̂(yj )
h(x) = arg max
j p̂(x)

Nearest Neighbors
P
P j nj = n.
kj nj
p̂(x|yj ).p̂(yj ) Vn ×nj × n
h(x) = arg max = arg max k
j p̂(x) j
n×Vn

Nearest Neighbors
P
P j nj = n.
kj nj
p̂(x|yj ).p̂(yj ) Vn ×nj × n kj
h(x) = arg max = arg max k
= arg max
j p̂(x) j
n×Vn
j k

Nearest Neighbors
k-nearest neighbors algorithm

How to classify an unknown example x using S?
Input: x, S, d
Output: class of x
foreach (x′ , y ′ ) ∈ S do
Compute the distance d(x′ , x);
Sort the n distances by increasing order;
Count the number of occurrences of each class yj among the k nearest
neighbors;
Assign to x the most frequent class;

Nearest Neighbors
Special case of the (k = 1)-nearest neighbor rule
Decision boundaries of the (k = 1)-nearest neighbor rule.
In a 2-dimensional space, the (k = 1)-nearest neighbor rule achieves a partition of

the space in Voronoi cells (obtained from the perpendicular bisectors of
Delaunay’s triangles), each one labeled by the class of the example it contains.

Nearest Neighbors
Convergence properties of the (k = 1)-nearest neighbor rule
Theorem
Let x′ be the nearest neighbor of x,
lim p d(x, x ′ ) > ϵ = 0, ∀ϵ > 0

n→∞

Nearest Neighbors
Convergence properties of the (k = 1)-nearest neighbor rule
Theorem
Let x′ be the nearest neighbor of x,
lim p d(x, x ′ ) > ϵ = 0, ∀ϵ > 0

n→∞
Corrolary
If n → ∞, p(yj |x′ ) ≈ p(yj |x).

Nearest Neighbors

Proof.
Let P be the probability that the hypersphere s(x, ϵ) centered at x of
radius ϵ does not contain any point of S:

Nearest Neighbors

Proof.
n
Y
P = p(x1 ∈
/ s(x, ϵ), ..., xn ∈
/ s(x, ϵ)) = p(xi ∈
/ s(x, ϵ))
i=1

Nearest Neighbors

Proof.
n
Y
P = p(x1 ∈
/ s(x, ϵ), ..., xn ∈
/ s(x, ϵ)) = p(xi ∈
/ s(x, ϵ))
i=1
n
Y
= (1 − p(xi ∈ s(x, ϵ))) = (1 − pϵ )n .
i=1

Nearest Neighbors

Proof.
n
Y
P = p(x1 ∈
/ s(x, ϵ), ..., xn ∈
/ s(x, ϵ)) = p(xi ∈
/ s(x, ϵ))
i=1
n
Y
= (1 − p(xi ∈ s(x, ϵ))) = (1 − pϵ )n .
i=1
Since the nearest neighbor x′ of x is also outside of the sphere, we get
p(d(x, x′ ) > ϵ) = (1 − pϵ )n
therefore, lim p(d(x, x′ ) > ϵ) = lim (1 − pϵ )n = 0

n→∞ n→∞

Nearest Neighbors
Theorem
The generalization error ϵ1NN of the 1-nearest neighbor rule is bounded by
twice the (optimal) bayes error ϵB .
ϵ1NN ≤ 2ϵB

Nearest Neighbors
p(x|y1)1)
((%|) p(x|y22) )
((%|)
∗
(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)
(()1|%)
Proof.
Bayes error:
∀x ∈ X , ϵB (x) = min {p(y1 |x), p(y2 |x)} = p(ymin |x)

Nearest Neighbors
p(x|y1)1)
((%|) p(x|y22) )
((%|)
∗
(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)
(()1|%)
Proof.
Bayes error:
1NN error:
ϵ1NN (x) = p(y1 |x)p(y2 |x′ ) + p(y2 |x)p(y1 |x′ )

Nearest Neighbors
p(x|y1)1)
((%|) p(x|y22) )
((%|)
∗
(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)
(()1|%)
Proof.
Bayes error:
1NN error:

(if n is large) ≈ p(y1 |x)p(y2 |x) + p(y2 |x)p(y1 |x)

Nearest Neighbors
p(x|y1)1)
((%|) p(x|y22) )
((%|)
∗
(()2|%) ( )2 % > (()1|%) ⟹ ) (%) = )2
⟹ "# (%) = (()1|%)
(()1|%)
Proof.
Bayes error:
1NN error:

(if n is large) ≈ p(y1 |x)p(y2 |x) + p(y2 |x)p(y1 |x)
= 2p(ymin |x)(1 − p(ymin |x)) ≤ 2p(ymin |x) = 2ϵ∗ (x)

Nearest Neighbors
We can deduce that:

In the discrete case
X
ϵB = p(ymin |x).p(x)
x∈X
X
p(y1 |x)p(y2 |x′ ) + p(y2 |x)p(y1 |x′ )).p(x ≤ 2ϵ∗

ϵ1NN =
x∈X
In the continuous case

Z
ϵB = f (ymin |x).f (x)dx
x∈X
Z
f (y1 |x)f (y2 |x′ ) + f (y2 |x)f (y1 |x′ )).f (x dx ≤ 2ϵ∗

ϵ1NN =
x∈X
Nearest Neighbors
Graphically
ϵB (x) = min {p(y1 |x), p(y2 |x)} = p(ymin |x)
ϵ1NN (x) = 2p(ymin |x)(1 − p(ymin |x)) ≤ 2ϵ∗ (x)

Nearest Neighbors
Effect of k on the estimation quality
ϵB ≤ ϵkNN ≤ ϵ(k−1)NN ≤ . . . ≤ ϵ1NN ≤ 2ϵB

error rate
k=1 0.5
k=3
k=5
k=7
0.25
Bayes error
0
0 0.25 0.5

Nearest Neighbors
Effect of k on the estimation quality
ϵB ≤ ϵkNN ≤ ϵ(k−1)NN ≤ . . . ≤ ϵ1NN ≤ 2ϵB

error rate
k=1 0.5
k=3
k=5
k=7
0.25
Bayes error
0
0 0.25 0.5
Reminder
(n → ∞)
The previous property holds only asymptoticallyq
n
→ possible compromise between k and n: k = |Y| .

Nearest Neighbors
Exercise
Let us consider the following discrete probability functions of two classes
y1 (solid line) and y2 (dashed line).
y1
y2
Question
Verify that the previous theorem holds

Nearest Neighbors
Curse of dimensionality
Fukunaga [1987] studied the effect of the representation dimension on the
NN behavior (curse of dimensionality).

Nearest Neighbors
Problems of k-NN
To prevent the kNN algorithm from facing the curse of dimensionality,

dimensionality reduction is required (PCA, feature selection, metric
learning, etc.).

Nearest Neighbors
Problems of k-NN

learning, etc.).
To converge, the k-nearest neighbor algorithm requires a large
number of training examples.
However, a large number of training examples implies a large
space/time complexity.

Nearest Neighbors
Problems of k-NN

learning, etc.).
To converge, the k-nearest neighbor algorithm requires a large
number of training examples.
However, a large number of training examples implies a large
space/time complexity.
Two strategies to overcome these problems:

Reduce the size of S while keeping the most relevant examples (e.g.
the condensed nearest neighbor rule (Hart 1968)).
Simplify the calculation of the nearest-neighbor.

Nearest Neighbors
Data reduction techniques
Preliminary step: remove from S the outliers and the examples of the
bayes error region.
Input: S
Output: Scleaned
Split randomly S into two subsets S1 and S2 ;
while no stabilization of S1 and S2 do
Classify S1 with S2 using the 1-NN rule;
Remove from S1 the misclassified instances;
Classify S2 with the new set S1 using the 1-NN rule;
Remove from S2 the misclassified instances;
Scleaned = S1 ∪ S2 ;

Nearest Neighbors
Illustration

Nearest Neighbors
The condensed nearest neighbor rule (CNN)
Second step: remove the irrelevant examples.

Input: S
Output: STORAGE
STORAGE ← ∅ ; DUSTBIN ← ∅;
Draw randomly a training example from S and put it in STORAGE;
while no stabilization of STORAGE do
foreach xi ∈ S do
if xi is correctly classified with STORAGE using the 1-NN rule
then
DUSTBIN ← xi
else
STORAGE ← xi
return STORAGE ;

Nearest Neighbors
Illustration

Nearest Neighbors
How to speed-up the nearest-neighbor calculation?

X: new query to classify
X’: current NN of X at a distance dmin
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X3
X6 X
X10 X2
X9 X1 X8
X5
X4
X7

Nearest Neighbors

*
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’
dmin
X3
X6 X
X10 X2 dmin
X9 X1 X8
X5
X4
X7

Nearest Neighbors

* *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X2)
dmin
X3
X6 X
X10 X2 d(X,X 2) dmin
X9 X1 X8
X5
X4
X7

Nearest Neighbors

* *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X2)
dmin
Since d(X,X2) > dmin the training points that are
-in the sphere centered at X2 of radius d(X,X2)-dmin

-out of the sphere centered at X2 of radius d(X,X2)+dmin
X3
cannot be NN(X) X6 X
X9 X1 X8
X5
X4
X7

Nearest Neighbors

* *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X2)
dmin
Explanation
d(X9, X2) ≤ d(X, X2) – dmin
Since d(X,X2) > dmin the training points that are ⟺ d(X9, X2) + dmin ≤ d(X, X2) (1)
-in the sphere centered at X2 of radius d(X,X2)-dmin By the triangle inequality:

X3 d(X, X2) ≤ d(X, X9) + d(X9, X2) (2)
X10 X2 d(X,X 2) dmin From (1) and (2), we get:
X9 X1 X8 d(X9, X2) + dmin≤ d(X, X9) + d(X9, X2)
X5 ⟺ dmin ≤ d(X, X9)
X4
X7

Nearest Neighbors

* *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X2)
dmin

X3
X9 X1 X8
X5
X4
X7

Nearest Neighbors

* * *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X4)
dmin
X3
X6 X
X10 X2 dmin
X9 X1 X8
,X4 )
X5
d (X
X4
X7

Nearest Neighbors

* * *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X4)
dmin

X3
X10 X2 dmin
X9 X1 X8
,X4 )
X5
d (X
X4
X7

Nearest Neighbors

* * * *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’ d(X,X)
dmin
X3
X6 X
X10 X2 dmin
X9 X1 X8
X5
X4
X7

Nearest Neighbors

* * * *
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
NN(X) ?
X’
dmin
X3
X6 X
X10 X2
X9 X1 X8
X5
X4
X7

Nearest Neighbors
kNN and KD-trees
(4,7)
(9,6)
(5,4)
(2,3)
(7,2)
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
(4,7)
(9,6)
(5,4)
(2,3)
(7,2)
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
(5,4)
(4,7)
(9,6)
(5,4)
(2,3)
(7,2)
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
(5,4)
(4,7)
(9,6) (2,3)
(5,4)
(2,3)
(7,2)
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
(5,4)
(4,7)
(9,6) (2,3) (4,7)
(5,4)
(2,3)
(7,2)
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
(5,4) (9,6)
(4,7)
(9,6) (2,3) (4,7)
(5,4)
(2,3)
(7,2)
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
(5,4) (9,6)
(4,7)
(9,6) (2,3) (4,7) (8,1)
(5,4)
(2,3)
(7,2)
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
(5,4) (9,6)
(4,7) (6,7)
(9,6) (2,3) (4,7) (8,1)
(5,4) How to search the NN for a new query point?

X=(6,7)
(2,3)
(7,2)
(8,1)

Nearest Neighbors
kNN and KD-trees
How to search the NN for a new query point?
X=(6,7)
(7,2)
(5,4) (9,6)
(4,7) (6,7)
(9,6) (2,3) (4,7) (8,1)
(5,4) Best distance = infinity

Current best = none
(2,3)
(7,2)
(8,1)

Nearest Neighbors
kNN and KD-trees
X=(6,7)
(7,2)
(5,4) (9,6)
(4,7) (6,7)
(9,6) (2,3) (4,7) (8,1)
(5,4) Best distance = 5.09

Current best = (7,2)
(2,3)
(7,2) 𝑑 6,7 ; 7,2 = 5.09
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
X=(6,7)
(5,4) (9,6)
(4,7) (6,7)
(9,6) (2,3) (4,7) (8,1)
(5,4) Best distance = 3.16

(2,3)
(7,2) 𝑑 6,7 ; 5,4 = 3.16
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
(5,4) (9,6)
(4,7) (6,7)
X=(6,7)
(9,6) (2,3) (4,7) (8,1)
(5,4) Best distance = 2

(2,3)
(7,2) 𝑑 6,7 ; 4,7 =2
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
(5,4) (9,6)
(4,7) (6,7) X=(6,7) X
3 (9,6) (2,3) (4,7) (8,1)

(2,3)
(7,2) 𝑑 6,7 ; 4,7 =2
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
X=(6,7)
1
(5,4) (9,6)
(4,7) (6,7)
(9,6) (2,3) (4,7) (8,1)

(2,3)
(7,2) 𝑑 6,7 ; 9,6 = 3.16
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
1
(5,4) (9,6)
(4,7) (6,7)
X=(6,7)
(9,6) (2,3) (4,7) (8,1)

(2,3)
(7,2) 𝑑 6,7 ; 8,1 = 6.32
(8,1)

Nearest Neighbors
kNN and KD-trees
(7,2)
1
(5,4) (9,6)
(4,7) (6,7)
X=(6,7)
(9,6) (2,3) (4,7) (8,1)

(2,3)
(7,2) 𝑑 6,7 ; 8,1 = 6.32
(8,1)

Nearest Neighbors
Conclusion
With a sufficiently large number of training examples, a NN classifier

is able to converge towards very complex target functions.
It is simple and theoretically well founded.
There exist several solutions to overcome its algorithmic complexity
issues (time and space).

Naive Bayes Classifier
Back to the Bayes Classifier

Bayes Classifier
Let x be a d-dimensional feature vector x = (x1 , ..., xd ).
y ∗ (x) = arg max p(yc |x) = arg max p(yc |x1 , ..., xd ).
c c
that corresponds to
y ∗ (x) = arg max p(x|yc ).p(yc ) = arg max p(x1 , ..., xd |yc ).p(yc ).
c c

Back to the Bayes Classifier

Bayes Classifier
Let x be a d-dimensional feature vector x = (x1 , ..., xd ).
y ∗ (x) = arg max p(yc |x) = arg max p(yc |x1 , ..., xd ).
c c
that corresponds to
y ∗ (x) = arg max p(x|yc ).p(yc ) = arg max p(x1 , ..., xd |yc ).p(yc ).
c c

The naive Bayes classifier assumes that the value of a particular feature is
independent of the value of any other feature, given the class variable.
d
Y
ŷ (x) = ŷ (x1 , ..., xd ) = arg max p(xi |yc ).p(yc )
c
i=1


PROS
Highly scalable. Each probability can be independently estimated as a
one-dimensional distribution.
This helps alleviate problems stemming from the curse of
dimensionality (the need for examples that scale exponentially with
the number of features).
Allows us to smooth the probabilities.


PROS
Highly scalable. Each probability can be independently estimated as a
one-dimensional distribution.
This helps alleviate problems stemming from the curse of
dimensionality (the need for examples that scale exponentially with
the number of features).
Allows us to smooth the probabilities.

Learning from Imbalanced Datasets

In real-world applications, datasets can be (higly) imbalanced:

Credit card transactions: < 0.2% frauds and 99.8% genuine
transactions
Medical screening: e.g. HIV prevalence in the USA is 0.4%
Disk drive failure: < 1%
etc.
What is the impact of an imbalanced dataset on the performance of the

classifiers?

k-NN and imbalanced datasets




Neural Network and imbalanced datasets
« Parallel one-class extreme

learning machine for imbalance
learning based on Bayesian
approach »
Journal of Ambient Intelligence and
Humanized Computing
Li et al. , 2018
Linear model with Gaussian noise

Learning from imbalanced datasets
More generally, how to learn from (highly) imbalanced datasets?
In credit card transactions:

< 0.2% correspond to frauds and 99.8% genuine transactions.
Binary classification problem where
y = +1 for a fraud (called positive example)
y = −1 for a genuine transaction (called negative example)
we are looking for an hypothesis h(x) as accurate as possible.
Then...
A classifier h(x) which would predict h(x) = −1 ∀x is 99.8% accurate
while missing all the frauds!!!
→ Conclusion: minimizing the error rate (or a surrogate function) is

not adapted to an imbalance setting

Metrics for learning from imbalanced datasets
TP + TN
Accuracy =
TP + FP + TN + FN
= 1 − Error

F-Measure
When # positives << # negatives we rather optimize other metrics such
as the Fβ − Measure:
(1 + β 2 ) × Precision × Recall
Fβ − Measure =
(β 2 × Precision) + Recall
where:
TP
Precision =
TP + FP
TP
Recall =
TP + FN
Fβ − Measure is aX
non convex (even with surrrogates), and non separable
function (Fβ ̸= ...):
(xi ,yi )∈S
The loss for one point depends on the others.
Impossible to optimize directly
Learning from imbalanced datasets

Fβ − Measure depends on the decision threshold defined for the classifier.
Example (k-NN): Let p + (x) (resp. p − (x)) the proportion of positives

(negatives) in the neighborhood of a query x.
In a standard k-NN, we usually use the following decision rule: if
p + (x) > p − (x) that is equivalent to p + (x) > 0.5 then h(x) = +1
otherwise h(x) = −1.
The threshold 0.5 is well suited to the balanced scenario. For imbalanced
datasets, we can play with this threshold to increase the F-Measure. One
can also tune k so as to maximize the F-Measure.
Ranking Metrics independent of a threshold

P N
1 XX +
AUCROC = [p (xi ) > p + (xj )]
PN
i=1 j=1
P
1X
AP = precision(ki )
P
i=1
where precision(ki ) is the precision with respect to the rank ki of the

positive example xi .

AUCROC versus AP
AP is much better when the budget in terms of number of allowed checks
by human controllers is limited, like in bank or tax fraud detection.

Summary
F-Measure, AUCROC and AP are not differentiable and not convex
measures. Therefore, they are difficult to directly optimize.

Summary
F-Measure, AUCROC and AP are not differentiable and not convex
measures. Therefore, they are difficult to directly optimize.
Other strategies:
Sampling methods
Modification of the decision boundaries (e.g. γ-NN)

Sampling methods

Sampling methods

SMOTE: Synthetic Minority Oversampling Technique

Sampling methods

Sampling methods
Can be unstable by generating positives in “negative” regions.

Modification of the boundaries: γ-NN [Viola et al. 2021]



Presentation of the Project

+
Multiple Choice Questions MCQ3

Intro To ML PART3

Uploaded by

Copyright:

Available Formats

You might also like

Intro To ML PART3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro To ML PART3

Uploaded by

Copyright:

Available Formats

Introduction to Machine Learning

PART 3: k-Nearest Neighbors

Laboratoire Hubert Curien, UMR CNRS 5516

Marc Sebban (LabHC) Introduction to Machine Learning 1 / 64

where (x, y ) is a labeled instance and

𝑓(𝑦2|𝑥) 𝑓(𝑦2|𝑥) > 𝑓(𝑦1|𝑥) ⟹ 𝜖𝐵 (𝑥) = 𝑓(𝑦1|𝑥)

Marc Sebban (LabHC) Introduction to Machine Learning 2 / 64

y ∗ (x) = arg max p(yc |x).

that corresponds to y ∗ (x) = arg maxc p(x|yc ).p(yc )

Marc Sebban (LabHC) Introduction to Machine Learning 3 / 64

y ∗ (x) = arg max p(yc |x).

that corresponds to y ∗ (x) = arg maxc p(x|yc ).p(yc )

If this calculation is possible, the Bayesian classifier is optimal from a

Marc Sebban (LabHC) Introduction to Machine Learning 3 / 64

Let fc1 and fc2 be the densities of two classes c1 and c2 :

Marc Sebban (LabHC) Introduction to Machine Learning 4 / 64

Underlying conditions to solve this problem

To compute ϵB , one needs some priors:

Unfortunately, p(yj ) and p(x|yj ) are unknown. One needs to estimate

Marc Sebban (LabHC) Introduction to Machine Learning 5 / 64

Estimation of the a priori probability of the classes p(yj )

What about p(yj )?

Marc Sebban (LabHC) Introduction to Machine Learning 6 / 64

Estimation of the conditional probabilities p(x|yj )

What about p(x|yj )?

Marc Sebban (LabHC) Introduction to Machine Learning 7 / 64

Estimation of the conditional probabilities p(x|yj )

What about p(x|yj )?

Marc Sebban (LabHC) Introduction to Machine Learning 7 / 64

Non parametric methods

No assumption is made on the underlying distribution...

Marc Sebban (LabHC) Introduction to Machine Learning 8 / 64

Non parametric methods

Marc Sebban (LabHC) Introduction to Machine Learning 9 / 64

Non parametric methods

Assuming that p(x) does not significantly change in V (locally

Marc Sebban (LabHC) Introduction to Machine Learning 9 / 64

Non parametric methods

Assuming that p(x) does not significantly change in V (locally

P can also be estimated by the proportion of training data in V :

Marc Sebban (LabHC) Introduction to Machine Learning 9 / 64

Non parametric methods

Assuming that p(x) does not significantly change in V (locally

P can also be estimated by the proportion of training data in V :

Therefore, we can deduce that

Marc Sebban (LabHC) Introduction to Machine Learning 10 / 64

Non parametric methods

Marc Sebban (LabHC) Introduction to Machine Learning 11 / 64

kNN as a good way to estimate p(x)

Marc Sebban (LabHC) Introduction to Machine Learning 12 / 64

Marc Sebban (LabHC) Introduction to Machine Learning 13 / 64

Marc Sebban (LabHC) Introduction to Machine Learning 14 / 64

Marc Sebban (LabHC) Introduction to Machine Learning 15 / 64

Marc Sebban (LabHC) Introduction to Machine Learning 15 / 64

Marc Sebban (LabHC) Introduction to Machine Learning 15 / 64

k-nearest neighbors algorithm

Marc Sebban (LabHC) Introduction to Machine Learning 16 / 64