Module 4 SVM PCA Kmeans

Support vector machine (SVM)
Module-4
SVM
• The support vector machine is a generalization of a simple
and intuitive classifier called the maximal margin classifier.
• Though the traditional maximum margin classifier is elegant

and simple, this classifier unfortunately cannot be applied
to most data sets, since it requires that the classes be
separable by a linear boundary.
• The SVM classifier is an extension of the maximal margin

classifier that can be applied in a broader range of cases.
Maximum margin classifier
• The maximal margin classifier is the optimal hyperplane where two
classes are linearly separable(hyperplane with maximum margin)
A hyperplane is a subspace whose dimension is one less than that of its ambient space. If a
space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space
is 2-dimensional, its hyperplanes are the 1-dimensional lines.
Maximum margin classifier
What is a Hyperplane?
• In a p-dimensional space, a hyperplane is a flat affine subspace of

hyperplane dimension p-1.
• For instance, in two dimensions, a hyperplane is a flat 1D subspace or a

line.
• In three dimensions, a hyperplane is a flat 2D subspace—that is, a plane.
• In p > 3 dimensions, it can be hard to visualize hyperplane, but it is a (p

− 1)-dimensional flat subspace.
Maximum margin classifier and Hyperplane
• In two dimensions, a hyperplane is defined by the equation
So, holds is a point on the hyperplane if it satisfies the above

equation. Note, this is simply the equation of a line, since indeed in
two dimensions a hyperplane is a line.
• The equation can be easily extended to the p-dimensional setting:
defines a p-dimensional hyperplane

• In p-dimensional setting:
defines a p-dimensional hyperplane. So if a point in p-dimensional

space (i.e. a vector of length p) satisfies the equation then X lies on
the hyperplane.
• What if X does not satisfy the above equation?

i.e. if
What if X does not satisfy the above equation?

i.e. if
• So if
Then this tells us that X lies to one side of the hyperplane.
• On the other hand, if
Then this tells us that X lies to other side of the hyperplane.
• So we can think of the hyperplane as dividing p-dimensional

space into two halves.
• One can easily determine on which side of the hyperplane a

point lies by simply calculating the sign of the LHS of the
hyperplane equation.
• Now suppose that we have a n x p data matrix X that consists of n

training observations in p-dimensional space, and these
observations fall into two classes—i.e, y1, . . . , yn ∈ {−1, 1}
where −1 represents one class and 1 the other class.
• We also have a test observation, which is a p x 1-vector of

observed features.
• Our goal is to develop a classifier based on the training data that

will correctly classify the test observation using its feature
measurements based on the concept of a separating hyperplane.
• Suppose that it is possible to construct a hyperplane that separates the
hyperplane training observations perfectly according to their class labels.
• Examples of three such separating hyperplanes are shown below.
There are two classes of observations, shown in

blue and in purple, each of which has
measurements on two variables. Three
separating hyperplanes, out of many possible,
are shown in black.
• We can label the observations from the blue class as and those from the
purple class as . Then a separating hyperplane has the property that
If a separating hyperplane exists, we can use it to construct a very natural
classifier: a test observation is assigned a class depending on which side of
the hyperplane it is located. The figure below shows an example of such a
classifier.
• A separating hyperplane is shown in black.

• The blue and purple grid indicates the decision
rule made by a classifier based on this
separating hyperplane:
A test observation that falls in the blue
portion of the grid will be assigned to the blue
class, and a test observation that falls into the
purple portion of the grid will be assigned to the
purple class.
• In general, if our data can be perfectly separated using a hyperplane, then there
will in fact exist an infinite number of such hyperplanes, because a given
separating hyperplane can usually be shifted a tiny bit up or down, or rotated,
without coming into contact with any of the observations.
• In order to construct a classifier based upon a separating hyperplane, we must

have a reasonable way to decide which of the infinite possible separating
hyperplanes to use.
Three possible separating

hyperplanes are shown.
• A natural choice is the maximal margin hyperplane (also known as

the optimal separating hyperplane), which is a separating
hyperplane that is farthest from the training observations.
• That is, we can compute the (perpendicular) distance from each

training observation to a given separating hyperplane; the smallest
such distance is the minimal distance from the observations to the
hyperplane, and is known as the margin.
• The maximal margin hyperplane is the separating hyperplane for

which the margin is largest.
• We can then classify a test observation based on which side of

the maximal margin hyperplane it lies. This is known as the
maximal margin classifier.
• We hope that a classifier that has a large margin on the

training data will also have a large margin on the test data,
and hence will classify the test observations correctly.
• Although the maximal margin classifier is often successful, it

can also lead to overfitting when p is large.
• In the figure we see that there are 3 training observations are
equidistant from the maximal margin hyperplane and lie along the
dashed lines indicating the width of the margin.
• These three observations are known as “support vectors”, since they
are vectors in p-dimensional space and they “support” the maximal
margin hyperplane in the sense vector that if these points were move
slightly then the maximal margin hyperplane would move as well.
3 support vectors (2 blue and 1

pink that lies on the margin)
Here how many support vectors?

• The maximal margin hyperplane depends directly on the

support vectors, but not on the other observations
• A movement to any of the other observations would not

affect the separating hyperplane, provided that the
observation’s movement does not cause it to cross the
boundary set by the margin.
Construction of Maximum margin classifier
• Consider the task of constructing the maximal margin hyperplane
based on a set of n training observations and associated class labels
y1, . . . , yn ∈ {−1, 1}.
• The maximal margin hyperplane is the solution to the optimization

problem:
The constraint
guarantees that each observation will be on the correct side of the
hyperplane, provided that M is positive (M > 0)
• The constraints ensure that each observation is on the correct side

of the hyperplane and at least a distance M from the hyperplane.
• Hence, M represents the margin of our hyperplane, and the

optimization problem chooses β0, β1, . . . , βp to maximize M.
• This is exactly the definition of the maximal margin hyperplane!

The Non-separable Case:

• The maximal margin classifier is a very natural way to perform
classification, if a separating hyperplane exists.
• However in many cases where datasets are not linearly

separable, no separating hyperplane exists, and so there is no
maximal margin classifier. In such cases, the optimization
problem (as define earlier) has no solution with M >0.
• When the classes are not linearly separable we can extend the
concept of a separating hyperplane in order to develop a
hyperplane that almost separates the classes, using a so-called
soft margin.
• The generalization of the maximal margin classifier to the non-

separable case is known as the support vector classifier or SVM.
Support Vector Machine (SVM) classifier
If we want a classifier based on a separating hyperplane that would perfectly classify
all of the training observations, may lead to sensitivity to individual observations.
• The maximal margin hyperplane is extremely sensitive to a change in

a single observation suggests that it may have overfit the training
data.
• In this case, we might be willing to consider a classifier based on a

hyperplane that does not perfectly separate the two classes, in the
interest of (i) greater robustness to individual observations, and (ii)
better classification of most of the training observations.
• That is, it could be worthwhile to misclassify a few training

observation in order to do a better job in classifying the remaining
observations.
• The support vector classifier, sometimes called a soft margin

classifier does exactly this.
• Rather than seeking the largest possible margin so that every

observation is not only on the correct side of the hyperplane
but also on the correct side of the margin, we instead allow
some observation to be on the incorrect side of the margin, or
even the incorrect side of the hyperplane.
• The margin is soft because it can be violated by some of the

training observations.
• So an observation can be not only on the wrong side of the

margin, but also on the wrong side of the hyperplane.
• Observations on the wrong side of the hyperplane correspond

to training observations that are misclassified by the SVM
classifier.
• The support vector classifier classifies a test observation

depending on which side of a hyperplane it lies.
• The hyperplane is chosen to correctly separate most of the

training observations into the two classes, but may misclassify a
few observations.
• SVM is the solution to the optimization problem

• where C is a nonnegative tuning parameter.
• M is the width of the margin; we seek to make this quantity as large as
possible.
• ,…. are slack variables that allow individual observations to be on the wrong
side of the margin or the hyperplane.
The slack variable tells us where the i’th observation is located, relative
to the hyperplane and relative to the margin: (i) If = 0 then the i’th observation
is on the correct side of the margin. (ii) If > 0 then the ith observation is on the
wrong side of the margin, and we say that the ith observation has violated the
margin. (iii) If > 1 then it is on the wrong side of the hyperplane.
• Once trained the SVM, we classify a test observation as

before, by simply determining on which side of the
hyperplane it lies.
• That is, we classify the test observation based on the sign of

nonnegative tuning parameter C:
• C bounds the sum of the ’s, and so it determines the number and severity of the
violations to the margin (and to the hyperplane) that we will tolerate.. (think of
C as a budget for the amount that the margin can be violated by the n
observations).
• If C = 0 then there is no budget for violations to the margin, and it must be the
case that ….= = 0, and simply amounts to the maximal margin hyperplane
optimization problem.
• For C > 0 no more than C observations can be on the wrong side of the
hyperplane, because if an observation is on the wrong side of the hyperplane
then > 1.
• As the budget C increases, we become more tolerant of violations to the margin,
and so the margin will widen. Conversely, as C decreases, we become less
tolerant of violations to the margin and so the margin narrows.
nonnegative tuning parameter C:
• In practice, C is treated as a tuning parameter that is generally chosen via cross-

validation.
• C controls the bias-variance trade-off of the statistical learning technique.
• When C is small, we seek narrow margins that are rarely violated; this amounts
to a classifier that is highly fit to the data, which may have low bias but high
variance.
• On the other hand, when C is larger, the margin is wider and we allow more
violations to it; this amounts to fitting the data less hard and obtaining a
classifier that is potentially more biased but may have lower variance.
Property of SVM classifier
• Only observations that either lie on the margin or that violate the margin will
affect the hyperplane, and hence the classifier obtained.
• In other words, an observation that lies strictly on the correct side of the margin
does not affect the support vector classifier! Changing the position of that
observation would not change the classifier at all, provided that its position
remains on the correct side of the margin.
• Observations that lie directly on the margin, or on the wrong side of the margin for
their class, are known as support vectors. These observations do affect the support
vector classifier.
• When the tuning parameter C is large, then the margin is wide, many observations
violate the margin, and so there are many support vectors. In this case, many
observations are involved in determining the hyperplane.
• When the tuning parameter C is large, then the margin is wide,

many observations violate the margin, and so there are many
support vectors. In this case, many observations are involved in
determining the hyperplane. This classifier will have low
variance (since many observations are support vectors) but high
bias (underfitting problem).
• In contrast, if C is small, then there will be fewer support vectors

and hence the resulting classifier will have low bias but high
variance (overfitting problem).
A support vector classifier was fit
using four different values of the
tuning parameter C.
The largest value of C was used in

the top left panel.
The smaller values were used in
the top right, bottom left, and
bottom
right panels.
When C is large, then there is a

high tolerance for observations
being on the wrong side of the
margin, and so the margin will be
large.
As C decreases, the tolerance for

observations being on the wrong
side of the margin decreases, and
the margin narrows.
• The fact that the support vector classifier’s decision rule is

based only on a potentially small subset of the training
observations (the support vectors) means that it is quite robust
to the behavior of observations that are far away from the
hyperplane.
• This property is distinct from some of the other classification

methods that we have seen before where decision boundary
depends on all observations.
SVM classifier with non-linear decision boundary
• How SVM works if the classes are not linearly separable?

• The support vector classifier is a natural approach for

classification in the two-class setting, if the boundary between
the two classes is linear.
• However, in practice many times we face with non-linear class

boundaries.
• We have seen that the performance of linear regression can suffer

when there is a nonlinear relationship between the predictors and
the outcome.
• In that case, we consider enlarging the feature space using

functions of the predictors, such as quadratic and cubic terms, in
order to address this non-linearity.
• In the case of the support vector classifier, we could address the

problem of possibly non-linear boundaries between classes in a
similar way, by enlarging the feature space using quadratic, cubic,
and even higher-order polynomial functions of the predictors.
Thus rather than fitting a support vector classifier using p features
we could instead fit a support vector classifier using 2p features
Now the optimization problem becomes

 Kernel function in SVM:
• The SVM algorithm is implemented in practice using a kernel. Kernel

function generally transforms the training set of data so that a non-
linear decision surface is able to transformed to a linear equation in
a higher dimensional spaces.
• A powerful insight is that the linear SVM can be rephrased using the
inner product of any two given observations, rather than the
observations themselves.
• The inner product between two vectors is the sum of the

multiplication of each pair of input values. For example, the inner
product of the vectors [2, 3] and [5, 6] is 2*5 + 3*6 or 28.
Kernel function in SVM:
• The equation for making a prediction for a new input using the dot product
between the input (x) and each support vector (xi) is calculated as follows:
• This is an equation that involves calculating the inner products of a new input
vector (x) with all support vectors in training data.
•
• The coefficients (for each input) must be estimated from the training data by
the learning algorithm.
Linear Kernel in SVM:
• The inner product or dot-product is called the kernel and can be re-written
as:
K(x, xi) = sum(x * xi)
• The kernel defines the similarity or a distance measure between new data
and the support vectors. The dot product is the similarity measure used for
linear SVM or a linear kernel because the distance is a linear combination of
the inputs.
• Other kernels can be used that transform the input space into higher
dimensions such as a Polynomial Kernel and a Radial Kernel. This is called
the Kernel Trick.
• It is desirable to use more complex kernels as it allows lines to separate the
classes that are curved or even more complex. This in turn can lead to more
accurate classifiers
Polynomial Kernel in SVM:
• Instead of the dot-product, we can use a polynomial kernel, for example:
K(x,xi) = 1 + sum(x * xi)^d
• Where the degree of the polynomial (d) must be specified by hand to the
learning algorithm.
• When d=1 this is the same as the linear kernel. The polynomial kernel allows
for curved lines in the input space.
• Using such a kernel with d > 1 instead of the standard linear kernel in the
support vector classifier algorithm leads to a much more flexible decision
boundary.
• Note that in this case the (non-linear) function has the form
Radial Kernel in SVM:
• We can also have a more complex radial kernel. For example:
K(x,xi) = exp(-gamma * sum((x – xi^2))
• Where gamma (positive value) is a parameter that must be specified to the

learning algorithm. A good default value for gamma is 0.1, where gamma is
often 0 < γ < 1.
• The radial kernel is very local and can create complex regions within the feature
space, like closed polygons in two-dimensional space.
Radial Kernel in SVM:
Left: An SVM with a polynomial kernel of degree 3 is applied to the non-
linear data, resulting in a far more appropriate decision rule. The fit is
a substantial improvement over the linear SVM classifier.
Right: An SVM with a radial kernel is applied. In this example, either kernel is
capable of capturing the decision boundary. It also does a good job in
separating the two classes.
Advantages of SVM:
• SVM works relatively well when there is a clear margin of separation
between classes.
• SVM is more effective in high dimensional spaces.
• SVM is effective in cases where the number of dimensions is greater than
the number of samples.
• SVM is relatively memory efficient
Disadvantages of SVM:
• SVM algorithm is not suitable for large data sets.
• SVM does not perform very well when the data set has more noise i.e.
target classes are overlapping.
• In cases where the number of features for each data point exceeds the
number of training data samples, the SVM will underperform.
• As the support vector classifier works by putting data points, above and
below the classifying hyperplane there is no probabilistic explanation for
the classification.
SVMs with More than Two Classes
1. One-Versus-One Classification (pairwise classification)
• Suppose that we would like to perform classification

using SVMs, and there are K > 2 classes. A one-versus-
one or all-pairs approach constructs .
• SVMs, each of which compares a pair of classes. For

example, one such SVM might compare the k’th class,
coded as +1, to the k1’th class, coded as −1.
SVMs with More than Two Classes
2. One-Versus-All Classification
• The one-versus-all approach is an alternative

procedure for applying in the case of K > 2 classes.
• We fit K SVMs, each time comparing one of the K

classes (coded as +1) to the remaining K − 1 classes
(coded as -1).
Principal Component
Analysis (PCA)
PCA
• PCA is a dimensionality-reduction method that is often used to reduce the
dimensionality of large data sets,
• Transform a large set of variables into a smaller one that still contains most
of the information in the large set.
• The idea is to reduce the number of variables of a data set, while preserving
as much information as possible.
• PCA represent the input vector as a sum of orthonormal basis functions and
it exploits the possible correlation between the variables.
• The output variable of this transformations are uncorrelated.

Quick Recap..
• Orthogonal set of vectors

• Orthonormal set of vectors
• Projection of vectors
--Gram Schmidt algorithm
Quick Recap..
• Two vectors are orthogonal if they are perpendicular to each
other, i.e, if the dot product of two vector is zero.
• Dividing each orthogonal vector by its ‘norm’ yield the

orthonormal set.
• Projection of vector:
Quick Recap..
Covariance
• Variance and Covariance are a measure of the “spread” of a set of
points around their center of mass (mean)
• Variance: measure of the deviation from the mean for points in one
dimension.
• Covariance: It shows the extent to which two variables are dependent

on each other, higher the number higher the dependency. Ranges from
-
Covariance indicates the direction of the linear relationship
between two random variables. Correlation (Ranges from -measures both
the strength and direction of the linear relationship between two
variables.
Covariance
Covariance
• A positive value of covariance indicates both
dimensions increase or decrease together.
• A negative value indicates while one increases

the other decreases, or vice-versa.
• If covariance is zero: the two datasets are

independent of each other e.g. heights of
students Vs. the marks obtained in a subject
PCA
• Principal components analysis (PCA) is a technique that can be
used to simplify a dataset.
• It is a linear transformation that chooses a new coordinate

system for the data set such that greatest variance by any
projection of the dataset comes to lie on the first axis (then
called the first principal component).
• The second greatest variance on the second axis, and so on.
• PCA can be used for reducing dimensionality by eliminating the

later principal components.
PRINCIPAL COMPONENT ?
67
Are they correleted?
PCA
• We define the new dimensions (variables) which

are
linear combinations of the original ones
uncorrelated with one another
• Orthogonal in original dimension space
capture as much of the original variance in the data
as possible
These are called Principal Components

PCA
• Given a set of points, how do we know if they

can be compressed like in the previous toy
example?
• The answer is to look into the correlation

between the points
• The tool for doing this is called PCA

PCA
• Principle
– Linear projection method to reduce the number of parameters
– Transfer a set of correlated variables into a new set of
uncorrelated variables
– Map the data into a space of lower dimensionality
– Form of unsupervised learning
• Properties
– It can be viewed as a rotation of the existing axes to new
positions in the space defined by original variables
– New axes are orthogonal and represent the directions with
maximum variability
PCA
• PCA is performed by finding the eigenvalues and eigenvectors
of the covariance matrix.
• We find that the eigenvectors with the largest eigenvalues

correspond to the dimensions that have the strongest
correlation in the dataset.
This is the principal component.
• PCA is a useful statistical technique that has found application in:

– fields such as face recognition and image compression
– finding patterns in data of high dimension.
What are the new axes?
• Orthogonal directions of greatest variance in data
• Projections along PC1 discriminate the data most along any one axis
Original Variable B PC 2
PC 1
Original Variable A
• First principal component is the direction of greatest variability (covariance)

in the data
• Second is the next orthogonal (uncorrelated) direction of greatest variability
• And so on …
PCA
• What are principal components?
– 1st Principal component : The Most-important direction -

Direction of maximum variance in the input space
– 2nd Principal component : 2nd Most-important direction -
Direction of second-largest variance in the input space
– 3rd Principal component : …………
– 4th Principal component : ………….
• How many principal components are possible ?

– As many as dimensions of the input space
78
PCA
• Are all principal components equally important ??
• No. Only those principal components that contribute a

significant fraction of total energy are considered important.
Energy along a direction is proportional to Variance along that
direction.
• The less important principal components can be ignored –

leading to reduction in dimensionality
• What does it say about the distribution, if all dimensions are

equally important ??
– Isotropic
How to compute ??
– Given N-dimensional Data x (Say, M-points X1, X2,…….XM)
– Find the Covariance Matrix of X (zero mean)

• Cx = Cov (X) = E[X XT]
• Now find eigen values (λ)s of Cx
• Sort the eigen values
• The eigen vector vi that corresponds to the largest eigen value λi

is the first principal component
• The eigen vector vj that corresponds to the j’th-largest eigen
value λj is the j’th- principal component
Steps : PCA
• Arrange data points in matrix X
• Find Cov X
• Find Eigen values and Eigen Vectors of Cov X
• Sort Eigen Values in descending order
• Start building matrix P. First column of P is the eigen vector
that corresponds to largest eigen value.
• Second column of P is the eigen vector that corresponds to
the second-largest eigen value.
• PT is the transform that completely decorrelates data X
Dimension reduction
• For dimensionality reduction:
– Choose only significant eigen values. Use only
those corresponding eigenvectors to build matrix
P
– This will lead to dimension reduction in the

transformed data
Kx1 KxN Nx1

• The Eigen vectors of the Covariance matrix are actually the directions
of the axes where there is the most variance(most information) and
that we call Principal Components.
• Eigenvalues are simply the coefficients attached to Eigen vectors,

which give the amount of variance carried in each Principal
Component.
• By ranking your eigenvectors in order of their eigenvalues, highest to

lowest, we get the principal components in order of significance.
• After having the principal components, in order to compute the

percentage of variance (information) accounted for by each
component, we divide the eigenvalue of each component by the sum
of Eigen values.
• % of variance =
• We are looking for a Transform that :
– Represents data along each of the principal
components
– The transformed data should be completely de-
correlated
• i.e Cov(Y) = Diagonal matrix
• How do we compute that Transform ???

Say the transformed data , Y is given as
We want Cov(Y) to be diagonal,
=
=
=
Why ??? Because only then the transformed data, Y is completely decorrelated !!
• Hence columns of P should diagonalize Cov(X) matrix
• But is Cov(X) diagonalizable ??

– Yes. Since Cov(X) is symmetric
• All symmetric matrices are diagonalizable
– Covariance matrices are positive definite
• All eigen values are positive
• P should be formed by linearly independent eigen vectors of

Cov(X)
• Eigen vectors of P are guaranteed to be orthogonal for distinct

eigen values
K-Means Clustering
Unsupervised learning
• No class label is associated with the data.
• Experience a dataset containing only features and then learn

useful properties of the structure of dataset.
• Since there is no associated response we cannot fit linear

regression model. There is no response variable to predict.
This situation is called “unsupervised” because we lack
response variable (target class) that can supervise the analysis
• Eg. Clustering
I can see
ern
the patt
• Clustering
K-Means Clustering
• K-means clustering is a simple and elegant

approach for partitioning the dataset into K
distinct, non-overlapping clusters.
• To perform K-means clustering, we must first

specify the desired number of clusters K; then
the K-means algorithm will assign each
observation to exactly one of the K clusters.
K-Means Clustering
K-Means Clustering
• Let denote sets containing the indices of the observations in
each cluster. These sets satisfy two properties:
For instance, if the observation is in the kth cluster, then .
The idea behind K-means clustering is that a good clustering is one for
which the within-cluster variation is as small as possible.
K-Means Clustering
• The within-cluster-variation for cluster is the measure of

of the amount by which the observations within a
cluster differ from each other.
• Hence we want to solve the problem
Objective: To partition the observations into K clusters such that

the total within-cluster variation, summed over all K clusters, is as
small as possible.
K-Means Clustering
K-Means Clustering
K-means Clustering
• Iterate:
1. Calculate distance from objects to cluster centroids.
2. Assign objects to closest cluster
3. Recalculate new centroids
4. Goto setp-1
• Stop based on convergence criteria

– No change in clusters
– Max iterations
K-Mean clustering - video
• https://www.youtube.com/watch?
v=_aWzGGNrcic

Module 4 SVM PCA Kmeans

Uploaded by

Copyright:

Available Formats

You might also like

Module 4 SVM PCA Kmeans

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 4 SVM PCA Kmeans

Uploaded by

Copyright:

Available Formats

Support vector machine (SVM)

• Though the traditional maximum margin classifier is elegant

• The SVM classifier is an extension of the maximal margin

• In a p-dimensional space, a hyperplane is a flat affine subspace of

• For instance, in two dimensions, a hyperplane is a flat 1D subspace or a

• In three dimensions, a hyperplane is a flat 2D subspace—that is, a plane.

• In p > 3 dimensions, it can be hard to visualize hyperplane, but it is a (p

• In two dimensions, a hyperplane is defined by the equation

So, holds is a point on the hyperplane if it satisfies the above

• The equation can be easily extended to the p-dimensional setting:

defines a p-dimensional hyperplane

defines a p-dimensional hyperplane. So if a point in p-dimensional

• What if X does not satisfy the above equation?

What if X does not satisfy the above equation?

• So we can think of the hyperplane as dividing p-dimensional

• One can easily determine on which side of the hyperplane a

• Now suppose that we have a n x p data matrix X that consists of n

• We also have a test observation, which is a p x 1-vector of

• Our goal is to develop a classifier based on the training data that

There are two classes of observations, shown in

• A separating hyperplane is shown in black.

• In order to construct a classifier based upon a separating hyperplane, we must

Three possible separating

• A natural choice is the maximal margin hyperplane (also known as

• That is, we can compute the (perpendicular) distance from each

• The maximal margin hyperplane is the separating hyperplane for

• We can then classify a test observation based on which side of

• We hope that a classifier that has a large margin on the

• Although the maximal margin classifier is often successful, it

3 support vectors (2 blue and 1

Here how many support vectors?

• The maximal margin hyperplane depends directly on the

• A movement to any of the other observations would not

• The maximal margin hyperplane is the solution to the optimization

• The constraints ensure that each observation is on the correct side

• Hence, M represents the margin of our hyperplane, and the

• This is exactly the definition of the maximal margin hyperplane!

The Non-separable Case:

• However in many cases where datasets are not linearly

The Non-separable Case:

• The generalization of the maximal margin classifier to the non-

• The maximal margin hyperplane is extremely sensitive to a change in

• In this case, we might be willing to consider a classifier based on a

• That is, it could be worthwhile to misclassify a few training

• The support vector classifier, sometimes called a soft margin

• Rather than seeking the largest possible margin so that every

• The margin is soft because it can be violated by some of the

• So an observation can be not only on the wrong side of the

• Observations on the wrong side of the hyperplane correspond

• The support vector classifier classifies a test observation

• The hyperplane is chosen to correctly separate most of the

• SVM is the solution to the optimization problem

• Once trained the SVM, we classify a test observation as

• That is, we classify the test observation based on the sign of

• In practice, C is treated as a tuning parameter that is generally chosen via cross-

• C controls the bias-variance trade-off of the statistical learning technique.

• When the tuning parameter C is large, then the margin is wide,