Professional Documents
Culture Documents
SWE622 Lecture 3 Classification
SWE622 Lecture 3 Classification
Lecture 3
Classification
Dr. Mohammad Abdullah-Al-Wadud
mwadud@ksu.edu.sa
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
The machine learning
framework
y = f(x)
output prediction Image
function feature
putting features together it will be a vector. ----> feature vector = list of features. and each feature is a value.
Testing
Image Learned
Prediction
Features model
Test Image Slide credit: D. Hoiem and L. Lazebnik
to find the girl
• Raw pixels
in images
put colour as feature vectors
instead of using colour we can use histogram
in histogram we will have 255 bars
• Histograms
read more about why the
the number of points is not change in histogram
however, the histogram doesn't give you the exact
location of the pixel
if i want to detect the girl:
histogram will detect her color by looking at its frequency
• GIST descriptors
if we want to detect the location of the girl
histogram will fail. raw pixels color is good idea. but it is huge.
Training
Training Test
examples
examples example
from class 2
from class 1
feature is X and Y
2 dimentinoal vector
• All we need is a distance function for our inputs number of features and
number of examples.
streight lines
city distance
• In most cases, City block distance yields results similar to the Euclidean
distance. Note, however, that with City block distance, the effect of a large
difference in a single dimension is dampened (since the distances are not
squared).
• Consider two points in the xy-plane.
– The shortest distance between the two points is along the hypotenuse,
which is the Euclidean distance.
– The City block distance is instead calculated as the distance in x plus the
distance in y, which is similar to the way you move in a city (like
Manhattan) where you have to move around the buildings instead of
going straight through.
Distance measures
• Correlation
The Correlation between two points, a and b, with k dimensions is calculated as
This correlation is called Pearson Product
Momentum Correlation, simply referred to
where as Pearson's correlation or Pearson's r.
It ranges from +1 to -1 where +1 is the
highest correlation. Complete opposite
points have correlation -1.
where
Distance measures
• Other distances
– Chi-Square distance
– Mahalanobis distance
– Chebychev distance
– Etc.
Classifiers: Linear
f(X) = ax + by +cz+k=0
f(x) = ax + by + cz +k
=(a,b,c) .(x,y,z)+k
w.x +k
f(x) = sgn(w x + b)
Contains a motorbike
Error due to
Unavoidable Error due to variance of training
error incorrect samples
assumptions
Slide
Slide
credit:
credit:
D. D.
Hoiem
Hoiem
How to reduce variance?
x1
1-nearest neighbor
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
3-nearest neighbor
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
5-nearest neighbor
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
Using K-NN
Col
c c c
1 2 3
Using Naïve Bayes
x2
x1 Pitch of voice
P( x1 , x2 | y 1)
log wT x
P( x1 , x2 | y 1)
P( y 1 | x1 , x2 ) 1 / 1 exp w T x
Using Logistic Regression
for exampele tempreture / a value
• Use L2 or L1 regularization
– L1 does feature selection and is robust to irrelevant
features but slower to train
Classifiers: Linear SVM
x
x
x x x
x
o x
x
o o
o
o
x2
x1
Classifiers: Linear SVM
x
x
x x x
x
o x
x
o o
o
o
x2
x1
Classifiers: Linear SVM
x
x
x o
x x
x
o x
x
o o
o
o
x2
x1
Nonlinear SVMs
• Datasets that are linearly separable work out great:
0 x
0 x
Φ: x → φ(x)
y ( x ) ( x) b y K ( x , x ) b
i
i i i
i
i i i
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Nonlinear kernel: Example
• Consider the mapping ( x) ( x, x )
2
x2
( x) ( y ) ( x, x 2 ) ( y, y 2 ) xy x 2 y 2
K ( x, y ) xy x 2 y 2
Kernels for bags of features
• Histogram intersection kernel:
N
I (h1 , h2 ) min( h1 (i ), h2 (i ))
i 1
• Cons
– No “direct” multi-class SVM, must combine two-
class SVMs
– Computation, memory
• During training time, must compute matrix of
kernel values for every pair of examples
• Learning can take a very long time for large-
Summary: Classifiers
• Nearest-neighbor and k-nearest-neighbor
classifiers
– L1 distance, χ2 distance, quadratic distance,
histogram intersection
• Support vector machines
– Linear classifiers
– Margin maximization
– The kernel trick
– Kernel functions: histogram intersection, generalized
Gaussian, pyramid match
– Multi-class
• Of course, there are many other classifiers
Slideout
credit: L. Lazebnik
Classifiers: Decision Trees
x
x
x o
x x
x
o x
o x
o o
o
o
x2
x1
Ideals for a classification
algorithm
• Objective function: encodes the right loss for the
problem
Comparison
Learning Objective Training Inference
θ1 x θ 0 1 x 0
x 1 y k r
T T
log Pxij | yi ; j P x j 1 | y 1
Naïve maximize j kj i
ij i
where 1 j log
P x j 1 | y 0
,
Bayes i
log P yi ; 0
y k Kr
i
0 j log
P x j 0 | y 1
i P x j 0 | y 0
1
minimize i θ
Linear 2 Linear programming
θT x 0
i
Kernelized
SVM
complicated to write Quadratic
programming y K xˆ , x 0
i
i i i
Nearest yi
where i argmin K xˆ i , x
Neighbor most similar features same label Record data
i