Professional Documents
Culture Documents
Supervised Learning
Supervised Learning
By : Shedriko
Preliminary
Supervised learning
themachine learningtask of inferring a function fro
m labeled training data, thetraining dataconsist of
a set oftraining examples, each example is apairco
nsisting of an input object (typically a vector) and a
desired output value (also called thesupervisory sig
nal)
Preliminary
Preliminary
X = Y = space of values
Regression :
target variable is continuous, such as housing
roblem
Classification :
target variable is a number of discrete values, such
as whether the dwelling is a house or an apartment
A. Linear Regression
Preliminary
Problem
How do we learn parameters ?
Make h(x) close to y or
h(x(i)) close to y(i)
Each value of (cost function)
LMS algorithm
LMS algorithm
LMS algorithm
Initialized at (48,30)
It gives
0 = 71.27
1 = 0.1345
Or
0 = 89.60
1 = 0.1392
2 = -8.738
For a function
mapping from m-b
y-n, we define derivatives of f with respect to A
to be :
Suppose
ion
is given by
then have :
And
Also
be the m dimensional vector containing all the
target value of training set
Since
we can verify
, thus
Hence
Probabilistic Interpretation
Assume :
where
is an error term
Further assume
independently and identically
distributed, according to Gaussian distribution
(n
ormal distribution) i.e
~
The density of
is given by
This implies that
The notation
indicates that this is the
distribution of y(i) given x(i) and parameterized by
Probabilistic Interpretation
Given X (design matrix, contain all of x(i) ), the d
ata probability is given by
The likelihood function
This can also be written
Probabilistic Interpretation
The derivations will be a bit simpler if we instead
maximize the log likelihood ()
Maximizing
(the same as min) gives
(we recognize as
)
LWR algorithm
In the original linear regression algorithm, to make
a prediction at a query point x (i.e. to evaluate h(x) ),
we would :
1. fit to minimize
2. output
In contrast, the locally weighted linear regression
algorithm does the following :
1. fit to minimize
2. output
LWR algorithm
the
if
if
Preliminary
Focus on the binary classification problem, whic
h y can take on only two values, 0 (zero) and 1 (on
e)
For e.g. : build a spam classifier for email, then
x(i) maybe some piece of email
y maybe 1 for spam email or 0 for otherwise
0 (zero) : negative class (-)
1 (one) : positive class (+)
Given x(i) , the corresponding y is called the label f
or the training example
Logistic regression
To make
ill choose
Logistic regression
Useful property of the derivative of the sigmoid fu
nction
Assume that
Note : it can be written more compactly as
Logistic regression
Assume that the m training examples were genera
ted independently, we can write down the likelihoo
d as
Logistic regression
It is given
The gradien ascent (similar to derivation in linear
regression)
By using
, it gives
If we then let
as before but with modif
ied definition of g, then we have the perceptron l
earning algorithm
Suppose
,
performs
So by letting
, to maximize
we obtain
Preliminary
Bernoulli distribution ( ) :
specifies a distribution over
, so that
in terms of
Gaussian distribution
Constructing GLMs
To derive a GLM to predict some random variable y
as a function of x in a classification or regression pr
oblem, there are three assumptions about the cond
itional distribution of y given x and about our model :
1.
i.e. given x and , the di
stribution of y follows some exponential family distri
bution, with parameter
2. To predict the expected value of T(y) given x, in
most of our examples we will have T(y) = y, so this me
ans we would like the prediction h(x) output by our
learned hypothesis h to satisfy
Constructing GLMs
(Remember, we had :
=
=
=
)
3. The natural parameter and inputs x are relat
ed
lineary
(if is vector-valued,
)
Constructing GLMs
Ordinary Least Squares
Consider the target variable y (also called the resp
onse variable in GLM terminology) is continuous, an
d we model the conditional distribution of y as Gauss
ian
We had
so we have
Constructing GLMs
Logistic Regression
Given that y is binary-valued, it seems natural to ch
then
Constructing GLMs
Softmax Regression
For e.g. : classify email into two classes (spam or no
t- spam) which is binary classification, then we classif
y it into three classes (spam, personal & work-relate
d mail)
Thus we model it as a multinomial distribution
For notational, we have
T(y) is not y, now we have T(y) is k-1 dimensional vec
tor, rather than real number
We also have
, e.g. : 1{2 = 3} = 0
,
1{3 = 5 2} = 1, where 0 = false, 1 = true
Constructing GLMs
Constructing GLMs
where
Constructing GLMs
We have defined
, to invert the link
function and derive the response function, we ther
efore have that
The
can be substituted back into equ
ation (7) to give the response function
s to
s is called the
Constructing GLMs
To complete our model, we use :
1. assumption 3
2. given earlier, that s are linearly related to xs
3. have
where
as the parameters of our model
4. we also define
so that
as giv
en
previously
Hence conditional
distribution of y
given x is
Constructing GLMs
Constructing GLMs
Thank you.