Professional Documents
Culture Documents
Lec02 Machine Learning
Lec02 Machine Learning
Lec02 Machine Learning
September 9, 2014
September 9, 2014
1 / 47
Administration
Outline
Administration
September 9, 2014
2 / 47
Administration
Entrance exam
All were graded
about 87% has passed
96 students in Prof. Shas section and 48 in Prof. Lius section
September 9, 2014
3 / 47
Outline
1
Administration
September 9, 2014
4 / 47
Intuitive example
Recognizing flowers
September 9, 2014
5 / 47
Intuitive example
September 9, 2014
6 / 47
Intuitive example
petal length
petal width
petal width
petal length
sepal width
sepal length
sepal length
September 9, 2014
7 / 47
Intuitive example
petal length
petal width
petal width
petal length
sepal width
sepal length
sepal length
September 9, 2014
8 / 47
Intuitive example
petal length
petal width
petal width
petal length
sepal width
sepal length
sepal length
September 9, 2014
9 / 47
Multi-class classification
September 9, 2014
10 / 47
More terminology
September 9, 2014
11 / 47
September 9, 2014
12 / 47
Algorithm
Nearest neighbor
x(1) = xnn(x)
where nn(x) [N] = {1, 2, , N}, i.e., the index to one of the training
instances,
nn(x) = arg minn[N] kx xn k22 = arg minn[N]
D
X
(xd xnd )2
d=1
Classification rule
y = f (x) = ynn(x)
September 9, 2014
13 / 47
Algorithm
Visual example
In this 2-dimensional example, the nearest point to x is a training
instance, thus, x will be labeled as red.
x2
x1
(a)
September 9, 2014
14 / 47
Algorithm
category (y)
setoas
versicolor
virginica
distance
1.75
0.72
0.76
September 9, 2014
15 / 47
Algorithm
Decision boundary
For every point in the space, we can determine its label using the NNC
rule. This gives rise to a decision boundary that partitions the space into
different regions.
x2
x1
(b)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
September 9, 2014
16 / 47
Algorithm
D
X
d=1
|xd xnd |
September 9, 2014
17 / 47
Algorithm
September 9, 2014
18 / 47
Algorithm
I(yn == c),
c [C]
nknn(x)
September 9, 2014
19 / 47
Algorithm
Example
x2
x2
x2
x1
(a)
x1
(a)
x1
(a)
September 9, 2014
20 / 47
Algorithm
K =1
K =3
x7
x7
x7
x6
K = 31
x6
x6
September 9, 2014
21 / 47
Algorithm
Mini-summary
Advantages of NNC
Computationally, simple and easy to implement just computing the
distance
Theoretically, has strong guarantees doing the right thing
Disadvantages of NNC
Computationally intensive for large-scale problems: O(ND) for
labeling a data point
We need to carry the training data around. Without it, we cannot
do classification. This type of method is called nonparametric.
Choosing the right distance measure and K can be involved.
September 9, 2014
22 / 47
Outline
1
Administration
September 9, 2014
23 / 47
We then compare our simple NNC classifier to the ideal one and show
that it performs nearly as good.
September 9, 2014
24 / 47
Measuring performance
1X
I[f (xn ) == yn ],
N n
train =
1X
I[f (xn ) 6= yn ]
N n
1 X
I[f (xm ) == ym ],
M m
test =
1 X
I[f (xm ) 6= ym ]
M
M
September 9, 2014
25 / 47
Measuring performance
Example
Training data
September 9, 2014
26 / 47
Measuring performance
Example
Training data
train = 0%
September 9, 2014
26 / 47
Measuring performance
Example
Training data
Test data
Atrain = 100%,
train = 0%
September 9, 2014
26 / 47
Measuring performance
Example
Training data
Test data
Atrain = 100%,
train = 0%
Atest = 0%,
test = 100%
September 9, 2014
26 / 47
Measuring performance
Leave-one-out (LOO)
Training data
Idea
For each training instance xn ,
take it out of the training set
and then label it.
For NNC, xn s nearest neighbor
will not be itself. So the error
rate would not become 0
necessarily.
September 9, 2014
27 / 47
Measuring performance
Leave-one-out (LOO)
Training data
Idea
For each training instance xn ,
take it out of the training set
and then label it.
For NNC, xn s nearest neighbor
will not be itself. So the error
rate would not become 0
necessarily.
September 9, 2014
27 / 47
Measuring performance
September 9, 2014
28 / 47
Measuring performance
September 9, 2014
28 / 47
Measuring performance
September 9, 2014
28 / 47
Measuring performance
Expected mistakes
Setup
Assume our data (x, y) is drawn from the joint and unknown
distribution p(x, y)
Classification mistake on a single data point x with the ground-truth
label y
0 if f (x) = y
L(f (x), y) =
1 if f (x) 6= y
September 9, 2014
29 / 47
Measuring performance
Expected mistakes
Setup
Assume our data (x, y) is drawn from the joint and unknown
distribution p(x, y)
Classification mistake on a single data point x with the ground-truth
label y
0 if f (x) = y
L(f (x), y) =
1 if f (x) 6= y
Expected classification mistake on a single data point x
R(f, x) = Eyp(y|x) L(f (x), y)
September 9, 2014
29 / 47
Measuring performance
Expected mistakes
Setup
Assume our data (x, y) is drawn from the joint and unknown
distribution p(x, y)
Classification mistake on a single data point x with the ground-truth
label y
0 if f (x) = y
L(f (x), y) =
1 if f (x) 6= y
Expected classification mistake on a single data point x
R(f, x) = Eyp(y|x) L(f (x), y)
The average classification mistake by the classifier itself
R(f ) = Exp(x) R(f, x) = E(x,y)p(x,y) L(f (x), y)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
September 9, 2014
29 / 47
Measuring performance
Jargons
L(f (x), y) is called 0/1 loss function many other forms of loss
functions exist for different learning problems.
September 9, 2014
30 / 47
Measuring performance
Jargons
L(f (x), y) is called 0/1 loss function many other forms of loss
functions exist for different learning problems.
Expected conditional risk
R(f, x) = Eyp(y|x) L(f (x), y)
September 9, 2014
30 / 47
Measuring performance
Jargons
L(f (x), y) is called 0/1 loss function many other forms of loss
functions exist for different learning problems.
Expected conditional risk
R(f, x) = Eyp(y|x) L(f (x), y)
Expected risk
R(f ) = E(x,y)p(x,y) L(f (x), y)
September 9, 2014
30 / 47
Measuring performance
Jargons
L(f (x), y) is called 0/1 loss function many other forms of loss
functions exist for different learning problems.
Expected conditional risk
R(f, x) = Eyp(y|x) L(f (x), y)
Expected risk
R(f ) = E(x,y)p(x,y) L(f (x), y)
Empirical risk
RD (f ) =
1X
L(f (xn ), yn )
N n
September 9, 2014
30 / 47
Measuring performance
September 9, 2014
31 / 47
Measuring performance
September 9, 2014
31 / 47
Measuring performance
September 9, 2014
31 / 47
f (x) = 0 if (x) < 1/2 equivalently f (x) = 0 if p(y = 1|x) < p(y = 0|x)
September 9, 2014
32 / 47
f (x) = 0 if (x) < 1/2 equivalently f (x) = 0 if p(y = 1|x) < p(y = 0|x)
Theorem
For any labeling function f (), R(f , x) R(f, x). Similarly,
R(f ) R(f ). Namely, f () is optimal.
September 9, 2014
32 / 47
September 9, 2014
33 / 47
September 9, 2014
33 / 47
September 9, 2014
34 / 47
September 9, 2014
34 / 47
September 9, 2014
35 / 47
September 9, 2014
36 / 47
September 9, 2014
36 / 47
September 9, 2014
36 / 47
September 9, 2014
36 / 47
September 9, 2014
36 / 47
Mini-summary
Advantages of NNC
Computationally, simple and easy to implement just computing the
distance
X Theoretically, has strong guarantees doing the right thing
Disadvantages of NNC
Computationally intensive for large-scale problems: O(ND) for
labeling a data point
We need to carry the training data around. Without it, we cannot
do classification. This type of method is called nonparametric.
Choosing the right distance measure and K can be involved.
September 9, 2014
37 / 47
Outline
Administration
September 9, 2014
38 / 47
Hypeparameters in NNC
Two practical issues about NNC
Choosing K, i.e., the number of nearest neighbors (default is 1)
Choosing the right distance measure (default is Euclidean distance),
for example, from the following generalized distance measure
!1/p
kx xn kp =
(xd xnd )
for p 1.
Those are not specified by the algorithm itself resolving them requires
empirical studies and are task/dataset-specific.
September 9, 2014
39 / 47
September 9, 2014
40 / 47
Recipe
September 9, 2014
41 / 47
Cross-validation
What if we do not have validation data?
We split the training data into S
equal parts.
September 9, 2014
42 / 47
Recipe
Split the training data into S equal parts. Denote each part as Dstrain
for each possible value of the hyperparameter (say K = 1, 3, , 100)
for every s [1, S]
train
Train a model using D\s
= Dtrain Dstrain
Evaluate the performance of the model on Dstrain
September 9, 2014
43 / 47
Preprocessing data
September 9, 2014
44 / 47
Preprocessing data
Preprocess data
Normalize data so that the data look like from a normal distribution
Compute the means and standard deviations in each feature
x
d =
1X
xnd ,
N n
s2d =
1 X
(xnd x
d )2
N 1 n
xnd x
d
sd
September 9, 2014
45 / 47
Outline
Administration
September 9, 2014
46 / 47
Summary so far
September 9, 2014
47 / 47