Unit 2 ML

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 201

Machine Learning Techniques [KCS-055]

Module-2
Topics Covered (Unit 2)
 REGRESSION: Linear Regression and Logistic
Regression.
 BAYESIAN LEARNING - Bayes theorem, Concept
learning, Bayes Optimal Classifier, Naïve Bayes classifier,
Bayesian belief networks, EM algorithm.
 SUPPORT VECTOR MACHINE: Introduction, Types
of support vector kernel – (Linear Kernel, Polynomial
Kernel, and Gaussian Kernel), Hyperplane – (Decision
surface), Properties of SVM, and Issues in SVM.
Overview

•Dynamic Programming
•Monte Carlo methods

• ANN
Applications of these approaches
Classification Vs Regression
Regression
 Regression is a statistical method used in finance,
investing, and other disciplines that attempts to determine
the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of
other variables (known as independent variables).
Applications
 Predicting a stock market.
 Forecasting amount of precipitation in a region.
 Projecting total sale of a company etc.
Regression: Types (Basic)
 The two basic types of regression are simple linear
regression and multiple linear regression, although there
are non-linear regression methods for more complicated
data and analysis. Simple linear regression uses one
independent variable to explain or predict the outcome
of the dependent variable Y, while multiple linear
regression uses two or more independent variables to
predict the outcome.
Equations/General Form of Regression
 The general form of each type of regression is:
• Simple linear regression: Y = a + bX + u
• Multiple linear regression: Y = a + b1X1 + b2X2 + b3X3 + ... +
btXt + u
Where:
• Y = the variable that you are trying to predict (dependent
variable).
• X = the variable that you are using to predict Y
(independent variable).
• a = the intercept.
• b = the slope.
• u = the regression residual.
Important Point :Regression
 Regression takes a group of random variables, thought to
be predicting Y, and tries to find a mathematical
relationship between them. This relationship is typically
in the form of a straight line (linear regression) that best
approximates all the individual data points. In multiple
regression, the separate variables are differentiated by
using subscripts.
 The core idea is to obtain a line that best fits the data.
The best fit line is the one for which total prediction error
(all data points) are as small as possible. Error is the
distance between the point to the regression line.
Preliminaries
 Let D be a data set that contains N observations:

 Regression is the task of learning a target function f that


maps each attribute set x into a continuous valued output
y.
Preleminaries
 The goal of regression is to find a target function that can
fit the input data with minimum error.
 The error function for a regression task can be expressed
as:
Linear Regression
 Linear Regression is a machine learning algorithm
based on supervised regression algorithm.
 Regression models a target prediction value based
on independent variables. It is mostly used for
finding out the relationship between variables and
forecasting.
 Different regression models differ based on – the
kind of relationship between the dependent and
independent variables, they are considering and the
number of independent variables being used.
Representation
Simple Linear Regression
Simple Linear Regression
Least Square Method
 The "least squares" method is a form of mathematical
regression analysis used to determine the line of best
fit for a set of data, providing a visual demonstration of
the relationship between the data points. Each point of
data represents the relationship between a known
independent variable and an unknown dependent
variable.
What Does the Least Squares Method Tell You?

 The least squares method provides the overall rationale


for the placement of the line of best fit among the data
points being studied. The most common application of
this method, which is sometimes referred to as "linear" or
"ordinary", aims to create a straight line that minimizes
the sum of the squares of the errors that are generated
by the results of the associated equations, such as the
squared residuals resulting from differences in the
observed value, and the value anticipated, based on that
model.
What Does the Least Squares Method Tell You?
(Continued)

 This method of regression analysis begins with a set of


data points to be plotted on an x- and y-axis graph. An
analyst using the least squares method will generate a
line of best fit that explains the potential relationship
between independent and dependent variables.
 In regression analysis, dependent variables are illustrated
on the vertical y-axis, while independent variables are
illustrated on the horizontal x-axis. These designations
will form the equation for the line of best fit, which is
determined from the least squares method.
Important Point: Least Square
 In contrast to a linear problem, a non-linear least squares
problem has no closed solution and is generally solved by
iteration. The discovery of the least squares method is
attributed to Carl Friedrich Gauss, who discovered the
method in 1795.
The Line of Best Fit Equation

 The line of best fit determined from the least squares


method has an equation that tells the story of the
relationship between the data points. Line of best fit
equations may be determined by computer software
models, which include a summary of outputs for analysis,
where the coefficients and summary outputs explain the
dependence of the variables being tested.
Least Squares Regression Line

 If the data shows a leaner relationship between two


variables, the line that best fits this linear relationship is
known as a least squares regression line, which minimizes
the vertical distance from the data points to the
regression line. The term “least squares” is used because
it is the smallest sum of squares of errors, which is also
called the "variance".
Least Square Method
Least Square Method (Continued)
Least Square Method (Continued)
Representation
Normal Equations
Example
 Given X-values and Y-values as follows:
X=[1,2,3,4,5] and Y=[3,4,6,5,6]
Predict the value of Y when X=8?
Logistic Regression
Logistic Regression
 Name is somewhat misleading. Really a technique for
classification, not regression.
 “Regression” comes from fact that we fit a linear model
to the feature space.
 Involves a more probabilistic view of classification.
 Logistic regression is basically a supervised
classification algorithm. In a classification problem,
the target variable(or output), y, can take only
discrete values for a given set of features(or inputs),
X.
Important Point: Logistic
Regression
 Logistic regression is used to find the probability of
event=Success and event=Failure. We should use
logistic regression when the dependent variable is
binary (0/ 1, True/ False, Yes/ No) in nature.
Representation: Logistic Regression
Logistic function
Consider Different ways of expressing probability:
Log odds (logit)
Odds, which describes the ratio of success to ratio of failure
From probability to log odds (and back again)
Standard logistic function

f(x)=1/(1+e-x)
Logistic regression
 Scenario:
 – A multidimensional feature space (features can be
categorical or continuous).
 – Outcome is discrete, not continuous.
We’ll focus on case of two classes/categories.
 – It seems plausible that a linear decision boundary
(hyperplane) will give good predictive accuracy.
Using a logistic regression model
So why not always use probability?
 The problem is that probability and odds have different
properties that give odds some advantages in statistics.
For example, in logistic regression the odds ratio
represents the constant effect of a predictor X, on the
likelihood that one outcome will occur.
 The key phrase here is constant effect. In regression
models, we often want a measure of the unique effect of
each X on Y. If we try to express the effect of X on the
likelihood of a categorical Y having a specific value
through probability, the effect is not constant.
So why not always use probability?
 What that means is there is no way to express in one
number how X affects Y in terms of probability. The
effect of X on the probability of Y has different values
depending on the value of X.
 So while we would love to use probabilities because
they’re intuitive, you’re just not going to be able to
describe that effect in a single number. So if you need
to communicate that effect to a research audience,
you’re going to have to wrap your head around odds
ratios.
What about Probabilities
 What you can do, and many people do, is to use the logistic
regression model to calculate predicted probabilities at specific
values of a key predictor, usually when holding all other predictors
constant.
 This is a great approach to use together with odds ratios. The
odds ratio is a single summary score of the effect, and the
probabilities are more intuitive.
 Presenting probabilities without the corresponding odds ratios
can be problematic, though.
 First,when X, the predictor, is categorical, the effect of X can be
effectively communicated through a difference or ratio of
probabilities. The probability a person has a relapse in an
intervention condition compared to the control condition makes a
lot of sense.
What about Probabilities
 But the p-value for that effect is not the p-value for the
differences in probabilities.
 If you present a table of probabilities at different values of X,
most research audiences will, at least in their minds, make
those difference comparisons between the
probabilities. They do this because they’ve been trained to
do this in linear models.
 These differences in probabilities don’t line up with the p-
values in logistic regression models, though. And this can get
quite confusing.
 Second, when X, the predictor is continuous, the odds ratio is
constant across values of X. But probabilities aren’t.
Using a logistic regression model
 Can interpret prediction from a logistic regression model
as:
 A probability of class membership
 A class assignment, by applying threshold to probability
*threshold represents decision boundary in feature
space
Training a logistic regression model
 Need to optimize β so the model gives the best possible
reproduction of training set labels
 Usually done by numerical approximation of maximum
likelihood
 On really large datasets, may use stochastic gradient
descent
Logistic regression in one
dimension
Pros and Cons
Advantages:
 – Makes no assumptions about distributions of classes in feature
space
 Easily extended to multiple classes (multinomial regression)
 Natural probabilistic view of class predictions
 Quick to train
 Very fast at classifying unknown records
 Good accuracy for many simple data sets
 Resistant to overfitting
 Can interpret model coefficients as indicators of feature importance
Disadvantages:
 Linear decision boundary
Difference between Linear & Logistic Regression
Difference between Linear & Logistic Regression

Linear Regression Logistic Regression


Bayesian Learning
The Famous Coin Flip Experiment

 When we flip a coin, there are two possible outcomes


— heads or tails. Of course, there is a third rare
possibility where the coin balances on its edge without
falling onto either side, which we assume is not a
possible outcome of the coin flip for our discussion. We
conduct a series of coin flips and record our
observations i.e. the number of the heads (or tails)
observed for a certain number of coin flips. In this
experiment, we are trying to determine the fairness of
the coin, using the number of heads (or tails) that we
observe.
Frequentist Statistics
 Let's think about how we can determine the fairness of
the coin using our observations in the above-
mentioned experiment. Once we have conducted a
sufficient number of coin flip trials, we can determine
the frequency or the probability of observing the heads
(or tails). If we observed heads and tails with equal
frequencies or the probability of observing heads (or
tails) is 0.5, then it can be established that the coin is a
fair coin. Failing that, it is a biased coin. Let's
denote p as the probability of observing the heads.
Consequently, as the quantity that p deviates
from 0.5 indicates how biased the coin is, p can be
considered as the degree-of-fairness of the coin.
Frequentist Statistics
 Testing whether a hypothesis is true or false by calculating
the probability of an event in a prolonged experiment is
known as frequentist statistics. As such, determining the
fairness of a coin by using the probability of observing the
heads is an example of frequentist statistics (a.k.a.
frequentist approach).
 Let us now further investigate the coin flip example using
the frequentist approach. Since we have not intentionally
altered the coin, it is reasonable to assume that we are
using an unbiased coin for the experiment. When we flip
the coin 10 times, we observe the heads 6 times. Therefore,
the p is 0.6 (note that p is the number of heads observed
over the number of total coin flips).
Frequentist Statistics
 Hence, according to frequencies statistics, the coin is a
biased coin — which opposes our assumption of a fair
coin. Perhaps one of your friends who is more skeptical
than you extends this experiment to 100 trails using
the same coin. Then she observes heads 55 times,
which results in a different p with 0.55. Even though
the new value for p does not change our previous
conclusion (i.e. that the coin is biased), this
observation raises several questions:
Questions:
• How confident are we of p being 0.6?
• How confident are of p being 0.55?
• Which of these values is the accurate estimation of p?
• Will p continue to change when we further increase the
number of coin flip trails?
Conclusion from Frequent
Statistics
 We cannot find out the exact answers to the first three
questions using frequentist statistics. We may assume
that true value of p is closer to 0.55 than 0.6 because
the former is computed using observations from a
considerable number of trials compared to what we
used to compute the latter. Yet there is no way of
confirming that hypothesis. However, if we further
increase the number of trials, we may get a different
probability from both of the above values for observing
the heads and eventually, we may even discover that
the coin is a fair coin.
Table — Coin flip experiment
results when increasing the
number of trials
Probability of
Number of coin flips Number of heads
observing heads (p)
10 6 0.6
50 29 0.58
100 55 0.55
200 94 0.47
500 245 0.49
Table 1 presents
 Some of the possible outcomes of a hypothetical coin
flip experiment when we are increasing the number of
trials. The fairness (p) of the coin changes when
increasing the number of coin-flips in this experiment.
Our confidence of estimated p may also increase when
increasing the number of coin-flips, yet the frequentist
statistic does not facilitate any indication of the
confidence of the estimated p value. We can attempt to
understand the importance of such a confident
measure by studying the following cases:
Cases
• An experiment with an infinite number of trials
guarantees p with absolute accuracy (100% confidence).
Yet, it is not practical to conduct an experiment with an
infinite number of trials and we should stop the experiment
after a sufficiently large number of trials. However, deciding
the value of this sufficient number of trials is a challenge
when using frequentist statistics.
• If we can determine the confidence of the estimated p value
or the inferred conclusion in a situation where the number
of trials are limited, this will allow us to decide whether to
accept the conclusion or to extend the experiment with
more trials until it achieves sufficient confidence.
Conclusion
 Moreover, we may have valuable insights or prior
beliefs (for example, coins are usually fair and the coin
used is not made biased intentionally, therefore p ≈
0.5) that describes the value of p. Embedding that
information can significantly improve the accuracy of
the final conclusion. Such beliefs play a significant role
in shaping the outcome of a hypothesis test especially
when we have limited data. However, with frequentist
statistics, it is not possible to incorporate such beliefs
or past experience to increase the accuracy of the
hypothesis test.
Some Terms to Understand
 Before delving into Bayesian learning, it is essential to understand the
definition of some terminologies used.
• Random variable (Stochastic variable) — In statistics, the random
variable is a variable whose possible values are a result of a random
event. Therefore, each possible value of a random variable has some
probability attached to it to represent the likelihood of those values.
• Probability distribution — The function that defines the probability of
different outcomes/values of a random variable. The continuous
probability distributions are described using probability density
functions whereas discrete probability distributions can be represented
using probability mass functions.
• Conditional probability — This is a measure of probability P(A|B) of an
event A given that another event B has occurred.
• Joint probability distribution
Classification problem
• Training data: examples of the form (d,h(d))
– where d are the data objects to classify (inputs)
– and h(d) are the correct class info for d, h(d){1,…K}
• Goal: given dnew, provide h(dnew)
A word about the Bayesian framework

• Allows us to combine observed data and prior knowledge


• Provides practical learning algorithms
• It is a generative (model based) approach, which offers a
useful conceptual framework
– This means that any kind of objects (e.g. time series, trees, etc.)
can be classified, based on a probabilistic model specification
Bayesian Classification: Why?
 Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain types
of learning problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct. Prior
knowledge can be combined with observed data.
 Probabilistic prediction: Predict multiple hypotheses, weighted by
their probabilities
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured
Basic Formulas for Probabilities
• Product Rule : probability P(AB) of a conjunction of two events A and B:

P ( A, B ) = P ( A | B ) P ( B ) = P ( B | A) P ( A)
•Sum Rule: probability of a disjunction of two events A and B:

P ( A + B ) = P ( A) + P ( B ) − P ( AB )

n
•Theorem of Total Probability : if events A1, …., An are mutually exclusive with  P( A ) = 1
i =1
i

n
P ( B ) =  P ( B | Ai ) P ( Ai )
i =1
Some Notations
•D: the training data

•H: the set of all hypotheses

•h: a hypothesis h ϵ H

•P(h): the prior probability of h: the initial probability that hypothesis h holds,
before we have observed the training data

•P(D): the prior probability that training data D will be observed

•P(D | h): the probability of observing data D given some world in which h holds

•P(x | y): the probability of x occurring given that y has been observed

•P(h | D): the posterior probability of h given D: the probability that h holds given
the observed training data D
2. Bayes’ Rule
Understanding Bayes' rule
d = data
h = hypothesis
Proof. Just rearrange :
P ( D | h) P ( h)
P(h | D) = p ( h | d ) P ( d ) = P ( d | h) P ( h)
P( D) P ( d , h) = P ( d , h)
the same joint probability
on both sides

This means: the posterior probability of D given h equals the


probability of observing data D given some world in which h
holds times the prior probability of h all over the prior probability
of D.
Probabilities – auxiliary slide
for memory refreshing
• Have two dice h1 and h2
• The probability of rolling an i given die h1 is denoted
P(i|h1). This is a conditional probability
• Pick a die at random with probability P(hj), j=1 or 2. The probability for picking die
hj and rolling an i with it is called joint probability and is
P(i, hj)=P(hj)P(i| hj).
• For any events X and Y, P(X,Y)=P(X|Y)P(Y)

• If we know P(X,Y), then the so-called marginal probability P(X) can be computed
as P( X ) =  P( X , Y )
Y
• Probabilities sum to 1. Conditional probabilities sum to 1 provided that their
conditions are the same.
Choosing Hypotheses
• Maximum Likelihood (ML)
hypothesis: hML = arg max P (d | h)
hH

• Generally we want the most probable


hypothesis given training data.This is hMAP = arg max P (h | d )
the maximum a posteriori (MAP) hH
hypothesis:
– Useful observation: it does not
depend on the denominator
P(d)

Notice in the final step above we dropped the


term P(D) because it is a constant independent
of h.
Maximum a Posteriori (MAP)
We can use MAP to determine the valid hypothesis from a set of
hypotheses. According to MAP, the hypothesis that has the maximum
posterior probability is considered as the valid hypothesis. Therefore, we
can express the hypothesis θMAP that is concluded using MAP as follows:

The argmaxθ operator estimates the event or hypothesis θi that maximizes


the posterior probability P(θi|X). Let us apply MAP to the above example
in order to determine the true hypothesis:
Figure 1 — P(X|θ) and P(X|¬θ) when changing the P(θ) = p
Figure 1 illustrates how the posterior probabilities of
possible hypotheses change with the value of prior
probability. Unlike frequentist statistics, where our belief
or past experience had no influence on the concluded
hypothesis, Bayesian learning is capable of incorporating
our belief to improve the accuracy of predictions. Assuming
that we have fairly good programmers and therefore the
probability of observing a bug is P(θ) = 0.4 , then we find
the θMAP:
However, P(X) is independent of θ, and thus P(X) is same for all the events or
hypotheses. Therefore, we can simplify the θMAP estimation, without the denominator
of each posterior computation as shown below:

MAP estimation algorithms do not compute the posterior probability of each hypothesis
to decide which is the most probable hypothesis. Assuming that our hypothesis space is
continuous (i.e. fairness of the coin encoded as probability of observing heads,
coefficient of a regression model, etc.), where endless possible hypotheses are present
even in the smallest range that the human mind can think of, or for even a discrete
hypothesis space with a large number of possible outcomes for an event, we do not need
to find the posterior of each hypothesis in order to decide which is the most probable
hypothesis. Therefore, the practical implementation of MAP estimation algorithms uses
approximation techniques, which are capable of finding the most probable hypothesis
without computing posteriors or only by computing some of them.
Using the Bayesian theorem, we can now
incorporate our belief as the prior probability,
which was not possible when we
used frequentist statistics. However, we still
have the problem of deciding a sufficiently
large number of trials or attaching a
confidence to the concluded hypothesis. This is
because the above example was solely
designed to introduce the Bayesian theorem
and each of its terms. Let us now gain a better
understanding of Bayesian learning to learn
about the full potential of Bayes' theorem.
An example: Does patient have cancer
or not?
• A patient takes a lab test and the result comes back positive. It is known that the
test returns a correct positive result in only 98% of the cases and a correct
negative result in only 97% of the cases. Furthermore, only 0.008 of the entire
population has this disease.

1. What is the probability that this patient has cancer?


2. What is the probability that he does not have cancer?
3. What is the diagnosis?
Two alternative hypotheses: (1) that the patient has a particular form of cancer and
(2) that the patient does not.

The available data is from a particular laboratory


test with two possible outcomes: + (positive) and -(negative).

P (cancer ) = .008, P (cancer ) = .992


P ( + | cancer ) = .98, P ( − | cancer ) = .02
P ( + | cancer ) = .03, P ( − | cancer ) = .97
P ( + | cancer ) P (cancer )
P (cancer | + ) =
P(+)
P ( + | cancer ) P (cancer )
P (cancer | + ) =
P(+)

Suppose we now observe a new patient for whom the lab test returns a positive
result. Should we diagnose the patient as having cancer or not? The maximum a
posteriori hypothesis can be found using hMAP

while the posterior probability of cancer is significantly higher than its prior probability, the
most probable hypothesis is still that the patient does not have cancer i.e.
3.Baye’s Theorem and Concept Learning

The Big Idea

"Since Bayes theorem provides a principled way to calculate the posterior


probability of each hypothesis given the training data, we can use it as the
basis for a straightforward learning algorithm that calculates the probability
for each possible hypothesis, then outputs the most probable."

Bayes Theorem and Concept Learning


•Brute-Force Bayes Concept Learning
•MAP Hypotheses and Consistent Learners
What Is Concept Learning?
In terms of machine learning, "concept learning" can be defined as:

“The problem of searching through a predefined space of potential


hypotheses for the hypothesis that best fits the training examples.”
— Tom Michell

Much of human learning involves acquiring general concepts from past


experiences. For example, humans identify different vehicles among all the
vehicles based on specific sets of features defined over a large set of features.
This special set of features differentiates the subset of cars in a set of vehicles.
This set of features that differentiate cars can be called a concept.
Similarly, machines can learn from concepts to identify whether an object
belongs to a specific category by processing past/training data to find a
hypothesis that best fits the training examples.
Target concept:

The set of items/objects over which the concept is defined is called the set of
instances and denoted by X. The concept or function to be learned is called the
target concept and denoted by c. It can be seen as a boolean valued function defined
over X and can be represented as c: X -> {0, 1}.
If we have a set of training examples with specific features of target concept C, the
problem faced by the learner is to estimate C that can be defined on training data.
H is used to denote the set of all possible hypotheses that the learner may consider
regarding the identity of the target concept. The goal of a learner is to find a
hypothesis H that can identify all the objects in X so that h(x) = c(x) for all x in X.
An algorithm that supports concept learning requires:
1.Training data (past experiences to train our models)
2.Target concept (hypothesis to identify data objects)
3.Actual data objects (for testing the models)

Inductive Learning
Hypothesis
As we discussed earlier, the ultimate goal of concept learning is to identify a
hypothesis H identical to target concept C over data set X with the only available
information about C being its value over X. Our algorithm can guarantee that it best
fits the training data. In other words:

"Any hypothesis found approximate the


target function well over a sufficiently
large set of training examples will also
approximate the target function well
over other unobserved examples."
For example, whether a person goes to a movie is based on four binary features
with two values (true or false):
1.Has money
2.Has free time
3.It’s a holiday
4.Has pending work

With the training data, we have with two data objects as positive samples and
one as negative:
1.x1: <true, true, false, false> : +ve
2.x2: <true, false, false, true> : +ve
3.x3:<true, false, false, true> : -ve
Hypothesis Notations
Each of the data objects represents a concept and hypotheses. Considering a
hypothesis <true, true, false, false> is more specific because it can cover
only one sample. Generally, we can add some notations into this hypothesis.
We have the following notations:
1.ⵁ (represents a hypothesis that rejects all)
2.< ? , ? , ? , ? > (accepts all)
3.<true, false, ? , ? > (accepts some)

The hypothesis ⵁ will reject all the data samples. The hypothesis <? , ? , ? , ?
> will accept all the data samples. The ? notation indicates that the values of
this specific feature do not affect the result.
The total number of the possible hypothesis is (3 * 3 * 3 * 3) + 1 — 3 because
one feature can have either true, false, or ? and one hypothesis for rejects all
(ⵁ).
General to Specific
Many machine learning algorithms rely on the concept of general-to-
specific ordering of hypothesis.
1.h1 = < true, true, ?, ? >
2.h2 = < true, ? , ? , ? >
Any instance classified by h1 will also be classified by h2. We can say
that h2 is more general than h1. Using this concept, we can find a
general hypothesis that can be defined over the entire dataset X.
To find a single hypothesis defined on X, we can use the concept of being
more general than partial ordering. One way to do this is start with the
most specific hypothesis from H and generalize this hypothesis each
time it fails to classify and observe positive training data object as
positive.
General to Specific

1.The first step in the Find-S algorithm is to start with the most specific hypothesis,
which can be denoted by h <- <ⵁ, ⵁ, ⵁ, ⵁ>.
2.This step involves picking up next training sample and applying Step 3 on the
sample.
3.The next step involves observing the data sample. If the sample is negative, the
hypothesis remains unchanged and we pick the next training sample by processing
Step 2 again. Otherwise, we process Step 4.
4.If the sample is positive and we find that our initial hypothesis is too specific
because it does not cover the current training sample, then we need to update our
current hypothesis. This can be done by the pairwise conjunction
(logical and operation) of the current hypothesis and training sample.
5.If the next training sample is <true, true, false, false> and the current hypothesis
is <ⵁ, ⵁ, ⵁ, ⵁ>, then we can directly replace our existing hypothesis with the new
one.
General to Specific

If the next positive training sample is <true, true, false, true> and current
hypothesis is <true, true, false, false>, then we can perform a pairwise
conjunctive. With the current hypothesis and next training sample, we can find
a new hypothesis by putting ? in the place where the result of conjunction is
false:
<true, true, false, true> ⴷ <true, true, false, false> = <true, true, false, ?>
Now, we can replace our existing hypothesis with the new one: h <-<true,
true, false, ?>
5. This step involves repetition of Step 2 until we have more training samples.
6. Once there are no training samples, the current hypothesis is the one we
wanted to find. We can use the final hypothesis to classify the real objects.
Limitations of the Find-S
Algorithm
The Find-S algorithm for concept learning is one of the most basic algorithms
of machine learning, though it has some limitation and disadvantages like:
1.There's no way to determine if the only final hypothesis (found by Find-S) is
consistent with the data or there are more hypotheses that are consistent
with data.
2.Inconsistent sets of training examples can mislead the Find-S algorithm, as
it ignores negative data samples. An algorithm that can detect inconsistency
of training data is better.
3.A good concept learning algorithm should be able to backtrack the choice of
hypothesis found so that the resulting hypothesis can be improved over time.
Unfortunately, Find-S provides no such method.
More Terminology
Brute-Force Bayes Concept Learning

Brute Force MAP learning Algorithm


Constraining Our Example

These constraints imply…


If h is consistent with D:
the derivation of P(D):

therefore:
Evolution of posterior probabilities

Evolution of posterior probabilities P(h|D) with increasing training data.


(a) Uniform priors assign equal probability to each hypothesis. As training data increases
first to D1 (b), then to D1 ^ D2 (c), the posterior probability of inconsistent hypotheses
becomes zero, while posterior probabilities increase for hypotheses remaining in the
version space.
MAP Hypotheses and Consistent Learners
consistent learner: a learning algorithm that outputs a
hypothesis that commits zero errors over the training examples.

Every consistent learner outputs a MAP hypothesis if we


assume:
•uniform prior probability distribution over H
•deterministic, noise-free training data

Example: Find-S outputs the maximally specific consistent


hypothesis, which is a MAP hypothesis.
MAP Hypotheses and Consistent Learners
Characterizing the Behavior of Learning Algorithms
4.Maximum Likelihood and
Least-Squared Error Hypotheses
• In statistics, maximum likelihood estimation (MLE) is a method
of estimating the parameters of a probability
distribution by maximizing a likelihood function, so that under the
assumed statistical model the observed data is most probable. The point in
the parameter space that maximizes the likelihood function is called the
maximum likelihood estimate. The logic of maximum likelihood is both
intuitive and flexible, and as such the method has become a dominant means
of statistical inference.
• If the likelihood function is differentiable, the derivative test for determining
maxima can be applied. In some cases, the first-order conditions of the
likelihood function can be solved explicitly; for instance, the ordinary least
squares estimator maximizes the likelihood of the linear regression model.
Under most circumstances, however, numerical methods will be necessary to
find the maximum of the likelihood function.
• From the vantage point of Bayesian inference, MLE is a special case
of maximum a posteriori estimation (MAP) that assumes a uniform prior
distribution of the parameters. In frequentist inference, MLE is a special case
of an extremum estimator, with the objective function being the likelihood.
Maximum Likelihood and
Least-Squared Error Hypotheses
Premise: "...under certain assumptions any learning algorithm
that minimizes the squared error between the output hypothesis
predictions and the training data will output a maximum
likelihood hypothesis."

•neural networks do this


•so do other curve fitting methods
Terminology
We wish to show that….
Therefore,
5.Maximum Likelihood Hypotheses for
Predicting Probabilities
Outputs and Inputs
Therefore,

Furthermore,

See table 5.3, Text book1:ML by Tom Mitchell


Gradient Search to Maximize Likelihood in a Neural Net

What we want:

We desire a weight-training rule for neural network learning that seeks to


maximize the maximum likelihood hypothesis (G(h,D)) using gradient ascent.
example

Gradient ascent:
6. Minimum Description Length
Principle

What the is the Minimum Description Length Principle?

•A Bayesian perspective on Occam’s razor


•Motivated by interpreting the definition of hMAP in the light of
basic concepts from information theory.
Consider this:

Terminology used:
Minimum Description Length Principle:
Naïve Bayes Classifier
Review:Prior and Posterior Probabilities
 P(A) and P(B) are called prior probabilities X Y
 P(A|B), P(B|A) are called posterior probabilities
𝑥1 A
Example 8.6:Prior versus Posterior Probabilities 𝑥2 A
 This table shows that the event Y has two outcomes
namely A and B, which is dependent on another event X 𝑥3 B
with various outcomes like 𝑥1 , 𝑥2 and 𝑥3 . 𝑥3 A
 Case1: Suppose, we don’t have any information of the
event A. Then, from the given sample space, we can 𝑥2 B
5
calculate P(Y = A) = 10 = 0.5 𝑥1 A
𝑥1 B
 Case2: Now, suppose, we want to calculate P(X =
2
𝑥2 |Y =A) = 5= 0.4 . 𝑥3 B
𝑥2 B
The later is the conditional or posterior probability, where
as the former is the prior probability. 𝑥2 A
116
Naïve Bayesian Classifier
 Suppose, Y is a class variable and X = 𝑋1, 𝑋2 , … . . , 𝑋𝑛 is a set of attributes,
with instance of Y.

INPUT (X) CLASS(Y)


… … …
… … … …
𝑥 1, 𝑥 2 , … , 𝑥 𝑛 𝑦 𝑖
… … … …

 The classification problem, then can be expressed as the class-conditional


probability
𝑃 𝑌 = 𝑦𝑖 | 𝑋1 = 𝑥1 AND 𝑋2 = 𝑥2 AND … . . 𝑋𝑛 = 𝑥𝑛

117
Naïve Bayesian Classifier
 Naïve Bayesian classifier calculate this posterior probability using Bayes’ theorem, which is
as follows.
 From Bayes’ theorem on conditional probability, we have
𝑃(𝑋|𝑌) ∙ 𝑃(𝑌)
𝑃 𝑌𝑋 =
𝑃(𝑋)
𝑃(𝑋|𝑌) ∙ 𝑃(𝑌)
=
𝑃 𝑋 𝑌 = 𝑦1 ∙ 𝑃 𝑌 = 𝑦1 + ⋯ + 𝑃 𝑋 𝑌 = 𝑦𝑘 ∙ 𝑃 𝑌 = 𝑦𝑘
where,
𝑃 𝑋 = σ𝑘𝑖=1 𝑃(𝑋|𝑌 = 𝑦𝑖 ) ∙ 𝑃(Y = 𝑦𝑖 )
Note:
▪ 𝑃 𝑋 is called the evidence (also the total probability) and it is a constant.
▪ The probability P(Y|X) (also called class conditional probability) is therefore
proportional to P(X|Y)∙ 𝑃(𝑌).

▪ Thus, P(Y|X) can be taken as a measure of Y given that X.


P(Y|X) ≈ 𝑃 𝑋 𝑌 ∙ 𝑃(𝑌)

118
Naïve Bayesian Classifier
 Suppose, for a given instance of X (say x = (𝑋1 = 𝑥1 ) and ….. (𝑋𝑛 = 𝑥𝑛 )).

 There are any two class conditional probabilities namely P(Y= 𝑦𝑖 |X=x) and
P(Y= 𝑦𝑗 | X=x).

 If P(Y= 𝑦𝑖 | X=x) >P(Y= 𝑦𝑗 | X=x), then we say that 𝑦𝑖 is more stronger than
𝑦𝑗 for the instance X = x.

 The strongest 𝑦𝑖 is the classification for the instance X = x.

119
Naïve Bayesian Classifier
 Example: With reference to the Air Traffic Dataset mentioned earlier, let us
tabulate all the posterior and prior probabilities as shown below.
Class
Attribute On Time Late Very Late Cancelled
Weekday 9/14 = 0.64 ½ = 0.5 3/3 = 1 0/1 = 0
Saturday 2/14 = 0.14 ½ = 0.5 0/3 = 0 1/1 = 1
Day

Sunday 1/14 = 0.07 0/2 = 0 0/3 = 0 0/1 = 0


Holiday 2/14 = 0.14 0/2 = 0 0/3 = 0 0/1 = 0
Spring 4/14 = 0.29 0/2 = 0 0/3 = 0 0/1 = 0
Season

Summer 6/14 = 0.43 0/2 = 0 0/3 = 0 0/1 = 0


Autumn 2/14 = 0.14 0/2 = 0 1/3= 0.33 0/1 = 0
Winter 2/14 = 0.14 2/2 = 1 2/3 = 0.67 0/1 = 0
120
Naïve Bayesian Classifier

Class
Attribute On Time Late Very Late Cancelled
None 5/14 = 0.36 0/2 = 0 0/3 = 0 0/1 = 0
Fog

High 4/14 = 0.29 1/2 = 0.5 1/3 = 0.33 1/1 = 1


Normal 5/14 = 0.36 1/2 = 0.5 2/3 = 0.67 0/1 = 0
None 5/14 = 0.36 1/2 = 0.5 1/3 = 0.33 0/1 = 0
Rain

Slight 8/14 = 0.57 0/2 = 0 0/3 = 0 0/1 = 0


Heavy 1/14 = 0.07 1/2 = 0.5 2/3 = 0.67 1/1 = 1
Prior Probability 14/20 = 0.70 2/20 = 0.10 3/20 = 0.15 1/20 = 0.05

121
Naïve Bayesian Classifier
Instance:

Week Day Winter High Heavy ???

Case1: Class = On Time : 0.70 × 0.64 × 0.14 × 0.29 × 0.07 = 0.0013

Case2: Class = Late : 0.10 × 0.50 × 1.0 × 0.50 × 0.50 = 0.0125

Case3: Class = Very Late : 0.15 × 1.0 × 0.67 × 0.33 × 0.67 = 0.0222

Case4: Class = Cancelled : 0.05 × 0.0 × 0.0 × 1.0 × 1.0 = 0.0000

Case3 is the strongest; Hence correct classification is Very Late

122
Naïve Bayesian Classifier
 Algorithm: Naïve Bayesian Classification

123
Example. ‘Play Tennis’ data
Day Outlook Temperature Humidity Wind Play
Tennis

Day1 Sunny Hot High Weak No


Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Based on the examples in the table, classify the following datum x
(attribute,value) pair:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
• That means: Play tennis or not?
hNB = arg max P (h) P (x | h) = arg max P (h) P (at | h)
h[ yes , no ] h[ yes , no ] t

= arg max P (h) P (Outlook = sunny | h) P (Temp = cool | h) P ( Humidity = high | h) P (Wind = strong | h)
h[ yes , no ]

• Working:
P ( PlayTennis = yes) = 9 / 14 = 0.64
P ( PlayTennis = no) = 5 / 14 = 0.36
P (Wind = strong | PlayTennis = yes) = 3 / 9 = 0.33
P (Wind = strong | PlayTennis = no) = 3 / 5 = 0.60
etc.
P ( yes) P ( sunny | yes) P (cool | yes) P (high | yes) P ( strong | yes) = 0.0053
P (no) P ( sunny | no) P (cool | no) P (high | no) P ( strong | no) = 0.0206
 answer : PlayTennis ( x) = no
Naïve Bayesian Classifier
Pros and Cons
 The Naïve Bayes’ approach is a very popular one, which often works well.

 However, it has a number of potential problems

 It relies on all attributes being categorical.

 If the data is less, then it estimates poorly.

126
Naïve Bayesian Classifier
Approach to overcome the limitations in Naïve Bayesian Classification
 Estimating the posterior probabilities for continuous attributes
 In real life situation, all attributes are not necessarily be categorical, In fact, there is a mix of
both categorical and continuous attributes.

 In the following, we discuss the schemes to deal with continuous attributes in Bayesian
classifier.
1. We can discretize each continuous attributes and then replace the continuous values
with its corresponding discrete intervals.

2. We can assume a certain form of probability distribution for the continuous variable and
estimate the parameters of the distribution using the training data. A Gaussian distribution is
usually chosen to represent the posterior probabilities for continuous attributes. A general
form of Gaussian distribution will look like
2
1 x−μ
P x: μ, σ2 = e−
2πσ 2σ2
2
where, μ and σ denote mean and variance, respectively.

127
Naïve Bayesian Classifier
For each class Ci, the posterior probabilities for attribute Aj(it is the numeric
attribute) can be calculated following Gaussian normal distribution as follows.
1 aj − μij 2
P Aj = aj|Ci = e−
2πσij 2σij2
Here, the parameter μijcan be calculated based on the sample mean of attribute
value of Aj for the training records that belong to the class Ci.

Similarly, σij2 can be estimated from the calculation of variance of such training
records.

128
Naïve Bayesian Classifier
M-estimate of Conditional Probability

 The M-estimation is to deal with the potential problem of Naïve Bayesian Classifier
when training data size is too poor.
 If the posterior probability for one of the attribute is zero, then the overall class-
conditional probability for the class vanishes.

 In other words, if training data do not cover many of the attribute values, then we may
not be able to classify some of the test records.

 This problem can be addressed by using the M-estimate approach.

129
M-estimate Approach
 M-estimate approach can be stated as follows
𝑛𝑐𝑖 + 𝑚𝑝
P Aj = aj|Ci =
𝑛+𝑚
where, n = total number of instances from class C𝑖
𝑛𝑐𝑖 = number of training examples from class C𝑖 that take the value Aj =aj
m = it is a parameter known as the equivalent sample size, and
p = is a user specified parameter.

Note:
If n = 0, that is, if there is no training set available, then 𝑃 ai|C𝑖 = p,
so, this is a different value, in absence of sample value.

130
A Practice Example
age income studentcredit_rating
buys_computer
Example 1: <=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’
>40 medium no fair yes
C2:buys_computer = ‘no’
>40 low yes fair yes
>40 low yes excellent no
Data instance
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
<=30 low yes fair yes
Student = yes
>40 medium yes fair yes
Credit_rating = fair)
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
131
A Practice Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 × 0.444 × 0.667 × 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 × 0.4 × 0.2 × 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028


P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


132
Bayesian belief network

Bayesian belief networks assume conditional


independence only between subsets of the attributes
 Conditional independence
 Discrete-valued random variables X,Y,Z
 X is conditionally independent of Y given Z if
P(X |Y,Z)= P(X |Z)
Bayesian belief network
Bayesian belief network
Representation
 A Bayesian network represents the joint probability
distribution of a set of variables
 Each variable is represented by a node
 Conditional independence assumptions are indicated by a
directed acyclic graph
 Variables are conditionally independent of its
nondescendents in the network given its inmediate
predecessors
Bayesian belief network
The joint probabilities are calculated as

P(Y1,Y2,...,Yn) = i=1,n P [Yi|Parents(Yi)]

The values P [Yi|Parents(Yi)] are stored in tables


associated to nodes Yi

Example:
P(Campfire=True|Storm=True,BusTourGroup=True)=0.4
Bayesian belief network
Inference
 We wish to infer the probability distribution for some
variable given observed values for (a subset of) the other
variables
 Exact (and sometimes approximate) inference of
probabilities for an arbitrary BN is NP-hard
 There are numerous methods for probabilistic inference in
BN (for instance, Monte Carlo), which have been shown
to be useful in many cases
Bayesian belief network
Learning Bayesian Belief Networks
Task: Devising effective algorithms for learning BBN
from training data
 Focus of much current research interest
 For given network structure, gradient ascent can be
used to learn the entries of conditional probability tables
 Learning the structure of BBN is much more difficult,
although there are successful approaches for some
particular problems
EM Algorithm
“A widely used approach to learning in the presence of unobserved
variables. The EM algorithm can be used even for variables whose value is
never directly observed, provided the general form of the probability
distribution governing these variables is known.

The EM algorithm has been used to train Bayesian belief networks as well
as radial basis function networks.

The EM algorithm is also the basis for many unsupervised clustering


algorithms and it is the basis for the widely used Baum-Welch forward-
backward algorithm for learning Partially Observable Markov Models.”

139
EM Algorithm
Estimating Means of k Gaussians
The easiest way to introduce the EM algorithm is via an example. Consider a problem in
which the data D is a set of instances generated by a probability distribution that is a
mixture of k distinct Normal distributions. This problem setting is illustrated in Figure 6.4
for the case where k = 2 and where the instances are the points shown along the x axis.
Each instance is generated using a two-step process. First, one of the k Normal
distributions is selected at random. Second, a single random instance xi is generated
according to this selected distribution.
This process is repeated to generate a set of data points as shown in the figure.
EM Algorithm
We would like to find a maximum likelihood hypothesis for these means; that is, a
hypothesis h that maximizes p(D | h).
In this case:
EM Algorithm

Now, In First step


In second step
EM Algorithm
General Statement of EM Algorithm

Let X = {xl , . . . , xm} denote the observed data in a set of m independently drawn
instances, let Z = {zl , . . . , zm} denote the unobserved data in these same instances,
and let Y = X U Z denote the full data.
We use h to denote the current hypothesized values of the parameters ϴ, and h'
to denote the revised hypothesis that is estimated on each iteration of the EM
algorithm.
The EM algorithm searches for the maximum likelihood hypothesis h' by seeking
the h' that maximizes E[ln P(Y |h')].
The EM algorithm uses its current hypothesis h in place of the actual
parameters ϴ to estimate the distribution governing Y. Let us define a function
Q(h' |h) that gives E[ln P(Y |h')] as a function of h', under the
assumption that ϴ = h and given the observed portion X of the full data Y.
EM Algorithm
In its general form, the EM algorithm repeats the following two steps until convergence:

When the function Q is continuous, the EM algorithm converges to a stationary


point of the likelihood function P(Y | h’).
When this likelihood function has a single maximum, EM will converge to this
global maximum likelihood estimate for h'. Otherwise, it is guaranteed only to
converge to a local maximum.
Support Vector Machines (SVM)
History of SVM
 SVM is related to statistical learning theory
 SVM was first introduced in 1992
 SVM becomes popular because of its success in
handwritten digit recognition
 1.1% test error rate for SVM. This is the same as the error
rates of a carefully constructed neural network.
 SVM is now regarded as an important example of “kernel
methods”, one of the key area in machine learning

2023/11/1 146
Introduction
 What is a classication problem?
 How can it be thought as a prediction problem?
 Support Vector Machine (SVM) as classication technique,
 Received considerable attention
 SVM has its roots in Statistical learning theory,
 Shown promising results in many practical applications.
 For examples:
 Handwritten digit recognition,
 Text categorization,
 Regression
Introduction
 SVM works very well with
 High-dimensional data,
 Avoids the curse of dimensionality problem.
 Another unique aspect of this approach is that
 it represents the decision boundary using a subset of the
training examples, known as the support vectors.
 Goal of the SVM
 To find the optimal separating hyperplane
 which maximizes the margin of training data.
Hyper plane
or
Decision boundary
Decided by Maximum Margin
Maximum Margin Hyperplanes
Maximum Margin

denotes +1
denotes -1

Support Vectors
are those
datapoints that the
margin pushes up
against

2023/11/1 155
Rationale for Maximum Margin
Linear SVM: Separable Case
x – Vector (x1-x2)
w (or wT)– Normal
Vector to hyper
plane
b – Scale Value

2023/11/1 158
Estimate the Margin
denotes +1
denotes -1 x
wx +b = 0

X – Vector
W
W – Normal Vector
b – Scale Value

 What is the distance expression for a point x to a line


wx+b= 0?
xw +b xw +b
d ( x) = =

2 d 2
w w
i =1 i
2
2023/11/1 159
Large-margin Decision Boundary for linear classifier
 The decision boundary should be as far away from the data of both classes
as possible
 We should maximize the margin, m
 Distance between the origin and the line wt x= -b is b/||w||

Class 2

m
Class 1

2023/11/1 160
Learning a Linear SVM Model

2023/11/1 161
Learning a Linear SVM Model: Example
Linear SVM: Nonseparable Case
 When it is not possible to separate the training data
linearly.
 SVM to construct a linear decision boundary even in
situations,
 Where the classes are not linearly separable.
Non-Linear SVM’s
Non-linear transformation is to make a dataset higher-dimensional space (Mapping a
higher dimension). And it is also the fundamental of a non-linear system. The below
graph reveals a non-linear dataset and how it can not be used Linear kernel rather than
the Gaussian kernel.
In geometry, a hyperplane is a subspace whose dimension is one less than that of
its ambient space. If space is 3-dimensional then its hyperplanes are the 2-dimensional
planes, while if space is 2-dimensional, its hyperplanes are the 1-dimensional lines. This
notion can be used in any general space in which the concept of the dimension of a
subspace is defined.

Mapping Function makes 2D to 3D and


Projection returns the 3D to 2D
Non-linear SVMs
◼ Datasets that are linearly separable with some noise work out
great:
0 x

◼ But what are we going to do if the dataset is just too hard?


0 x
◼ How about… mapping data to a higher-dimensional space:

x2

0 x
Extension to Non-linear Decision
Boundary
 So far, we have only considered large-margin classifier with
a linear decision boundary
 How to generalize it to become nonlinear?
 Key idea: transform xi to a higher dimensional space to
“make life easier”
 Input space: the space the point xi are located
 Feature space: the space of f(xi) after transformation

2023/11/1 176
Non-linear SVMs: Feature spaces
◼ General idea: the original input space can always be mapped
to some higher-dimensional feature space where the training
set is separable:

Φ: x → φ(x)

Input space Feature space


Transforming the Data
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )
Input space Feature space

Note: feature space is of higher dimension


than the input space in practice

 Computation in the feature space can be costly because it is


high dimensional
 The feature space is typically infinite-dimensional!
 The kernel trick comes to rescue

2023/11/1 178
Types of Kernels
 SVM algorithms use a set of mathematical functions
that are defined as the kernel. The function of kernel
is to take data as input and transform it into the
required form. Different SVM algorithms use
different types of kernel functions. These functions
can be different types.
 For example linear, nonlinear, polynomial, radial basis
function (RBF), and sigmoid.
 Introduce Kernel functions for sequence data,
graphs, text, images, as well as vectors. The most
used type of kernel function is RBF. Because it has
localized and finite response along the entire x-axis.
The kernel functions return the inner product
between two points in a suitable feature space. Thus
by defining a notion of similarity, with little
computational cost even in very high-dimensional
spaces.
Polynomial kernel

 It is popular in image processing.


Equation is:

where d is the degree of the polynomial.


Gaussian kernel

 It is a general-purpose kernel; used when there is no


prior knowledge about the data. Equation is:
Gaussian radial basis function
(RBF)
 It is a general-purpose kernel; used when there is no
prior knowledge about the data.
Equation is:

Sometimes parametrized using:


Laplace RBF kernel

 It is general-purpose kernel; used when there is no


prior knowledge about the data.
Equation is:
Hyperbolic tangent kernel

 We can use it in neural networks.


Equation is:

, for some (not every) k>0 and c<0.


Sigmoid kernel

 We can use it as the proxy for neural networks.


Equation is
Bessel function of the first kind Kernel

 We can use it to remove the cross term in


mathematical functions. Equation is :

where j is the Bessel function of first kind.


ANOVA radial basis kernel

 We can use it in regression problems. Equation is:


Linear splines kernel in one-dimension

 It is useful when dealing with large sparse data


vectors. It is often used in text categorization. The
splines kernel also performs well in regression
problems. Equation is:
The Kernel Trick
 Recall the SVM optimization problem

 The data points only appear as inner product


 As long as we can calculate the inner product in the feature
space, we do not need the mapping explicitly
 Many common geometric operations (angles, distances) can be
expressed by inner products
 Define the kernel function K by
2023/11/1 190
The kernel trick
 For many mappings from a
low-D space to a high-D Low-D
space, there is a simple xb
operation on two vectors in xa
the low-D space that can be
used to compute the scalar
product of their two images
f
in the high-D space.
High-D

K ( x a , x b ) = f ( x a ) .f ( x b )
f (xa )
Letting the doing the scalar f ( xb )
kernel do product in the
the work obvious way
An Example for f(.) and K(.,.)
 Suppose f(.) is given as follows

 An inner product in the feature space is

 So, if we define the kernel function as follows, there is no need


to carry out f(.) explicitly

 This use of kernel function to avoid carrying out f(.) explicitly


is known as the kernel trick
2023/11/1 192
More on Kernel Functions
 Not all similarity measures can be used as kernel function,
however
 The kernel function needs to satisfy the Mercer function, i.e.,
the function is “positive-definite”
 This implies that
 the n by n kernel matrix,
 in which the (i,j)-th entry is the K(xi, xj), is always positive
definite
 This also means that optimization problem can be solved
in polynomial time!

2023/11/1 193
◼ Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
◼ Semi-positive definite symmetric functions correspond
to a semi-positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)


K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)
K=
… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
Examples of Kernel Functions
 Linear: K(x,y)= x Ty

 Polynomial kernel with degree d

 Radial (Gaussian) basis function kernel with width s

 Closely related to radial basis function neural networks


 The feature space is infinite-dimensional
 Sigmoid with parameter k and q

 It does not satisfy the Mercer condition on all k and q

2023/11/1 195
Example
 Suppose we have 5 one-dimensional data points
 x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as
class 2  y1=1, y2=1, y3=-1, y4=-1, y5=1
 We use the polynomial kernel of degree 2
 K(x,y) = (xy+1)2
 C is set to 100
 We first find ai (i=1, …, 5) by

2023/11/1 196
Example
 By using a QP solver, we get
 a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
 Note that the constraints are indeed satisfied
 The support vectors are {x2=2, x4=5, x5=6}
 The discriminant function is

 b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2 and x5


lie on the line and x4 lies on the line

 All three give b=9

2023/11/1 197
Characteristics of SVM
SVM has many desirable qualities that make it one of the most widely
usedclassication algorithms. Following is a summary of the general
characteristics of SVM:

 Flexibility in choosing a similarity function


 Sparseness of solution when dealing with large data sets
- only support vectors are used to specify the separating hyperplane
 Ability to handle large feature spaces
- complexity does not depend on the dimensionality of the feature space
 Overfitting can be controlled by soft margin approach
 Nice math property: a simple convex optimization problem which is guaranteed
to converge to a single global solution
 Feature Selection
 SVM can be applied to categorical data,
 By introducing dummy variables for each categorical attribute value present in the
data.
 For example, if Marital Status has three values (Single, Married, Di-vorced), we
can introduce a binary variable for each of the attribute values.
Characteristics of SVM
 The SVM learning problem can be formulated as a convex
optimization problem,
 In which ecient algorithms are available to nd the global
minimumof the objective function.
 Other classication methods, such as rule-based classiers and
artificial neural networks,
 Employ a greedy based strategy to search the hypothesis space.
 Such methods tend to find only locally optimum solutions.

 SVM performs capacity control,


 By maximizing the margin of the decision boundary.
 Nevertheless,the user must still provide other parameters such as
the type of kernel function to use and the cost function C for
introducing each slack variable.
SVM Applications
 SVM has been used successfully in many real-world
problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition
Strengths and Weaknesses of SVM

You might also like