Professional Documents
Culture Documents
Module 4 SVM PCA Kmeans
Module 4 SVM PCA Kmeans
Module 4 SVM PCA Kmeans
Module-4
SVM
• The support vector machine is a generalization of a simple
and intuitive classifier called the maximal margin classifier.
A hyperplane is a subspace whose dimension is one less than that of its ambient space. If a
space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space
is 2-dimensional, its hyperplanes are the 1-dimensional lines.
Maximum margin classifier
What is a Hyperplane?
• In p-dimensional setting:
• So if
Then this tells us that X lies to one side of the hyperplane.
• On the other hand, if
Then this tells us that X lies to other side of the hyperplane.
Maximum margin classifier and Hyperplane
The constraint
guarantees that each observation will be on the correct side of the
hyperplane, provided that M is positive (M > 0)
• When the classes are not linearly separable we can extend the
concept of a separating hyperplane in order to develop a
hyperplane that almost separates the classes, using a so-called
soft margin.
• C bounds the sum of the ’s, and so it determines the number and severity of the
violations to the margin (and to the hyperplane) that we will tolerate.. (think of
C as a budget for the amount that the margin can be violated by the n
observations).
• If C = 0 then there is no budget for violations to the margin, and it must be the
case that ….= = 0, and simply amounts to the maximal margin hyperplane
optimization problem.
• For C > 0 no more than C observations can be on the wrong side of the
hyperplane, because if an observation is on the wrong side of the hyperplane
then > 1.
• As the budget C increases, we become more tolerant of violations to the margin,
and so the margin will widen. Conversely, as C decreases, we become less
tolerant of violations to the margin and so the margin narrows.
nonnegative tuning parameter C:
• When C is small, we seek narrow margins that are rarely violated; this amounts
to a classifier that is highly fit to the data, which may have low bias but high
variance.
• On the other hand, when C is larger, the margin is wider and we allow more
violations to it; this amounts to fitting the data less hard and obtaining a
classifier that is potentially more biased but may have lower variance.
Property of SVM classifier
• Only observations that either lie on the margin or that violate the margin will
affect the hyperplane, and hence the classifier obtained.
• In other words, an observation that lies strictly on the correct side of the margin
does not affect the support vector classifier! Changing the position of that
observation would not change the classifier at all, provided that its position
remains on the correct side of the margin.
• Observations that lie directly on the margin, or on the wrong side of the margin for
their class, are known as support vectors. These observations do affect the support
vector classifier.
• When the tuning parameter C is large, then the margin is wide, many observations
violate the margin, and so there are many support vectors. In this case, many
observations are involved in determining the hyperplane.
Property of SVM classifier
• A powerful insight is that the linear SVM can be rephrased using the
inner product of any two given observations, rather than the
observations themselves.
• The equation for making a prediction for a new input using the dot product
between the input (x) and each support vector (xi) is calculated as follows:
• This is an equation that involves calculating the inner products of a new input
vector (x) with all support vectors in training data.
•
• The coefficients (for each input) must be estimated from the training data by
the learning algorithm.
SVM classifier with non-linear decision boundary
Linear Kernel in SVM:
• The inner product or dot-product is called the kernel and can be re-written
as:
K(x, xi) = sum(x * xi)
• The kernel defines the similarity or a distance measure between new data
and the support vectors. The dot product is the similarity measure used for
linear SVM or a linear kernel because the distance is a linear combination of
the inputs.
• Other kernels can be used that transform the input space into higher
dimensions such as a Polynomial Kernel and a Radial Kernel. This is called
the Kernel Trick.
• It is desirable to use more complex kernels as it allows lines to separate the
classes that are curved or even more complex. This in turn can lead to more
accurate classifiers
SVM classifier with non-linear decision boundary
Polynomial Kernel in SVM:
• Instead of the dot-product, we can use a polynomial kernel, for example:
K(x,xi) = 1 + sum(x * xi)^d
• Where the degree of the polynomial (d) must be specified by hand to the
learning algorithm.
• When d=1 this is the same as the linear kernel. The polynomial kernel allows
for curved lines in the input space.
• Using such a kernel with d > 1 instead of the standard linear kernel in the
support vector classifier algorithm leads to a much more flexible decision
boundary.
• Note that in this case the (non-linear) function has the form
SVM classifier with non-linear decision boundary
Radial Kernel in SVM:
• We can also have a more complex radial kernel. For example:
K(x,xi) = exp(-gamma * sum((x – xi^2))
• The radial kernel is very local and can create complex regions within the feature
space, like closed polygons in two-dimensional space.
SVM classifier with non-linear decision boundary
Radial Kernel in SVM:
Left: An SVM with a polynomial kernel of degree 3 is applied to the non-
linear data, resulting in a far more appropriate decision rule. The fit is
a substantial improvement over the linear SVM classifier.
Right: An SVM with a radial kernel is applied. In this example, either kernel is
capable of capturing the decision boundary. It also does a good job in
separating the two classes.
Advantages of SVM:
• SVM works relatively well when there is a clear margin of separation
between classes.
• SVM is more effective in high dimensional spaces.
• SVM is effective in cases where the number of dimensions is greater than
the number of samples.
• SVM is relatively memory efficient
Disadvantages of SVM:
• SVM algorithm is not suitable for large data sets.
• SVM does not perform very well when the data set has more noise i.e.
target classes are overlapping.
• In cases where the number of features for each data point exceeds the
number of training data samples, the SVM will underperform.
• As the support vector classifier works by putting data points, above and
below the classifying hyperplane there is no probabilistic explanation for
the classification.
SVMs with More than Two Classes
2. One-Versus-All Classification
• Transform a large set of variables into a smaller one that still contains most
of the information in the large set.
• The idea is to reduce the number of variables of a data set, while preserving
as much information as possible.
• PCA represent the input vector as a sum of orthonormal basis functions and
it exploits the possible correlation between the variables.
• Projection of vector:
Quick Recap..
Covariance
• Variance and Covariance are a measure of the “spread” of a set of
points around their center of mass (mean)
• Variance: measure of the deviation from the mean for points in one
dimension.
67
Are they correleted?
PCA
• Properties
– It can be viewed as a rotation of the existing axes to new
positions in the space defined by original variables
– New axes are orthogonal and represent the directions with
maximum variability
PCA
• PCA is performed by finding the eigenvalues and eigenvectors
of the covariance matrix.
Original Variable B PC 2
PC 1
Original Variable A
=
=
=
Why ??? Because only then the transformed data, Y is completely decorrelated !!
• Hence columns of P should diagonalize Cov(X) matrix
• Eg. Clustering
Unsupervised learning
I can see
ern
the patt
Unsupervised learning
• Clustering
K-Means Clustering
The idea behind K-means clustering is that a good clustering is one for
which the within-cluster variation is as small as possible.
K-Means Clustering
• https://www.youtube.com/watch?
v=_aWzGGNrcic