Supervised Learning

Supervised Learning
By : Shedriko
Outline & Content

Preliminary
Linear Regression
Classification and Logistic Regression
Generalized Linear Models
Preliminary
Supervised learning
themachine learningtask of inferring a function fro
m labeled training data, thetraining dataconsist of
a set oftraining examples, each example is apairco
nsisting of an input object (typically a vector) and a
desired output value (also called thesupervisory sig
nal)
Preliminary
x(i) = input, y(i) = output ; i-th = index (not exponent)

(x(i) ,y(i))
training example
{(x(i) ,y(i)); i = 1,2,,m}
training set
Preliminary
X = Y = space of values
Regression :
target variable is continuous, such as housing
roblem
Classification :
target variable is a number of discrete values, such
as whether the dwelling is a house or an apartment
A. Linear Regression
Outline & Content

Preliminary
Problem
LMS (least mean square) algorithm
Normal equations
Probabilistic interpretation
LWR (locally weighted linear regression)
Preliminary
: weight (parameterizing the space of line

ar functions mapping from X to Y)
h(x) = y : hipotesis
Problem
How do we learn parameters ?
Make h(x) close to y or
h(x(i)) close to y(i)
Each value of (cost function)
LMS algorithm
Minimize J() using gradient descent algo

rithm
: learning rate
j : initial guess
Consider there is only one training exam
ple
LMS algorithm
For a single training example, it gives
For i=1 to m, it gives
LMS algorithm
Initialized at (48,30)
It gives
0 = 71.27
1 = 0.1345
Or
0 = 89.60
1 = 0.1392
2 = -8.738
The normal equations

Matrix Derivatives
For a function
mapping from m-b
y-n, we define derivatives of f with respect to A
to be :
Thus, the gradient

itself is an m-by-n mat
rix, whose (i,j) element is

Example :
Suppose
ion
is given by
is a 2-by-2 matrix, the funct
Here, Aij denotes the (i,j) entry of the matrix A,

we then have

Trace
trace (written tr), for an n-by-n (square) matrix A, the trace

A is defined to be the sum of its diagonal entries :
If a is a real number, a 1-by-1 matrix, gives

The property of trace for two matrices (AB is square), we hav
e
We also have, e.g

We
then have :
And
Equation (4) applies only to non-singular square mat

rices A, where |A| denotes the determinant of A

Least square revisited
Define the design matrix X to be the m-by-n contains t

raining examples value
Also
be the m dimensional vector containing all the
target value of training set
Since
we can verify
For a vector z, we have that
, thus
To minimize J , its derivatives with respect to by com

bining equations (2) and (3), we find that
Hence
To minimize J (we set its derivatives to zero), we obtain

normal equations
Thus, the value of that minimize J() is
Probabilistic Interpretation
Assume :
where
is an error term
Further assume
independently and identically
distributed, according to Gaussian distribution
(n
ormal distribution) i.e
~
The density of
is given by
This implies that
The notation
indicates that this is the
distribution of y(i) given x(i) and parameterized by
Given X (design matrix, contain all of x(i) ), the d
ata probability is given by
The likelihood function
This can also be written
The principal of max likelihood we should choo

se to make the data as high probability as possi
ble, so we maximize L()
The derivations will be a bit simpler if we instead
maximize the log likelihood ()
Maximizing
(the same as min) gives
(we recognize as
)
LWR algorithm
In the original linear regression algorithm, to make
a prediction at a query point x (i.e. to evaluate h(x) ),
we would :
1. fit to minimize
2. output
In contrast, the locally weighted linear regression
algorithm does the following :
1. fit to minimize
2. output
LWR algorithm
the
if
if
are non-negative valued weights
is small, then w(i) is close to 1 (one)

is large, then w(i) is small
B. Classification and logistic re

gression
Outline & Content

Preliminary
Logistic regression
Digression : The perceptron learning algorit
hm
Another algorithm for maximizing
Preliminary
Focus on the binary classification problem, whic
h y can take on only two values, 0 (zero) and 1 (on
e)
For e.g. : build a spam classifier for email, then
x(i) maybe some piece of email
y maybe 1 for spam email or 0 for otherwise
0 (zero) : negative class (-)
1 (one) : positive class (+)
Given x(i) , the corresponding y is called the label f
or the training example
Logistic regression
To make
ill choose
larger than 1 or smaller than 0, we w
Logistic function or sigmoid function

g(z) tends towards 1 as z
g(z) tends towards 0 as z
Logistic regression
Useful property of the derivative of the sigmoid fu
nction
Assume that
Note : it can be written more compactly as
Logistic regression
Assume that the m training examples were genera
ted independently, we can write down the likelihoo
d as
To maximize the log likelihood
Logistic regression
It is given
The gradien ascent (similar to derivation in linear
regression)
By using
, it gives
Digression: The perceptron algorithm

To Digress by modifying the logistic regression met
hod to force it to output values that are either 0 or 1
or exact value
To do so, it seems natural to change the definition
of g to be the threshold function
If we then let
as before but with modif
ied definition of g, then we have the perceptron l
earning algorithm
Suppose
,
performs
, we wish to find value of so that

is a real number, Newtons method

The leftmost figure : function f with line y = 0,
the value of from
, is about 1.3
The middle figure : with Newtons method, we have tan
gent = 2.8
The rightmost figure : with Newtons method, we have
tangent = 1.8
After a few more iterations, we approach tangent =
1.3
What if we want to use it to maximize some function ?
The maxima of correspond to points where its
firs
t derivative
is zero
So by letting
, to maximize
we obtain
Newton-Raphson method (the generalization of

Newtons method to this multidimensional setting)
is given by
;
is the vector of partial derivatives of
wit
h respect to the i s
H is an n-by-n matrix called the Hessian, whose en
tries are given by
When Newtons method is applied to maximize the

logistic regression log likelihood function
, the
resulting method is called Fisher scoring
C. Generalized Linear Models
Outline & Content

Preliminary
The exponential family
Constructing GLMs
Preliminary
In the regression example, we had

In the classification one, we had
In this section, both of those methods are special c

ases of a broader family of models, called Generalize
d Linear Models (GLMs)
Other models in GLM family can be derived and ap

plied to other classification and regression problems

A class of distribution is in the exponential family if
it can be written in the form
: natural parameter /canonical parameter

: sufficient statistic
: log partition function
: normalization constant
: sums/integrates over y to 1 (one)
A fixed choice of T, a and b defines a family (or set)
of distributions that is parameterized by , we then
get different distributions within this family
Bernoulli distribution ( ) :
specifies a distribution over
, so that
There is a choice of T, a and b, so that equation [6]

becomes exactly the class of Bernoulli distribution
We write the Bernoulli distribution as
The natural parameter is given by

If we invert this definition by solving
, we obtain =
To complete the Bernoulli
distribution, we have :
in terms of
Gaussian distribution
Gaussian in the exponential family
Constructing GLMs
To derive a GLM to predict some random variable y
as a function of x in a classification or regression pr
oblem, there are three assumptions about the cond
itional distribution of y given x and about our model :
1.
i.e. given x and , the di
stribution of y follows some exponential family distri
bution, with parameter
2. To predict the expected value of T(y) given x, in
most of our examples we will have T(y) = y, so this me
ans we would like the prediction h(x) output by our
learned hypothesis h to satisfy
Constructing GLMs
(Remember, we had :
=
=
=
)
3. The natural parameter and inputs x are relat
ed
lineary
(if is vector-valued,
)
Constructing GLMs
Ordinary Least Squares
Consider the target variable y (also called the resp
onse variable in GLM terminology) is continuous, an
d we model the conditional distribution of y as Gauss
ian
We let the exponential family ( ) distribution abo

ve be the Gaussian distribution
We had
so we have
Constructing GLMs
Logistic Regression
Given that y is binary-valued, it seems natural to ch
oose the Bernoulli family of distributions, which we h

ad
Furthermore, note that if
then
So following a similar derivation as the one for ordi

nary least square, we get
Constructing GLMs
Softmax Regression
For e.g. : classify email into two classes (spam or no
t- spam) which is binary classification, then we classif
y it into three classes (spam, personal & work-relate
d mail)
Thus we model it as a multinomial distribution
For notational, we have
T(y) is not y, now we have T(y) is k-1 dimensional vec
tor, rather than real number
We also have
, e.g. : 1{2 = 3} = 0
,
1{3 = 5 2} = 1, where 0 = false, 1 = true
Constructing GLMs
The multinomial is a member of exponential famil

y
Constructing GLMs
where
The link function is given (for i = 1, , k) by
Constructing GLMs
We have defined
, to invert the link
function and derive the response function, we ther
efore have that
The
can be substituted back into equ
ation (7) to give the response function
This function mapping from

softmax function
s to
s is called the
Constructing GLMs
To complete our model, we use :
1. assumption 3
2. given earlier, that s are linearly related to xs
3. have
where
as the parameters of our model
4. we also define
so that
as giv
en
previously
Hence conditional
distribution of y
given x is
Constructing GLMs
This model which applies to classification problem

where
is called softmax regression
Our hypothesis
will output
Constructing GLMs
If we have training set of m examples

we would begin by writing down the log-likelihood
Thank you.

Supervised Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supervised Learning

Uploaded by

Copyright:

Available Formats

Supervised Learning

Outline & Content

x(i) = input, y(i) = output ; i-th = index (not exponent)

Outline & Content

: weight (parameterizing the space of line

Minimize J() using gradient descent algo

For a single training example, it gives

For i=1 to m, it gives

The normal equations

Thus, the gradient

The normal equations

is a 2-by-2 matrix, the funct

Here, Aij denotes the (i,j) entry of the matrix A,

The normal equations

trace (written tr), for an n-by-n (square) matrix A, the trace

If a is a real number, a 1-by-1 matrix, gives

We also have, e.g

The normal equations

Equation (4) applies only to non-singular square mat

The normal equations

Define the design matrix X to be the m-by-n contains t

The normal equations

For a vector z, we have that

The normal equations

To minimize J , its derivatives with respect to by com

The normal equations

To minimize J (we set its derivatives to zero), we obtain

The principal of max likelihood we should choo

are non-negative valued weights

is small, then w(i) is close to 1 (one)

B. Classification and logistic re

Outline & Content

larger than 1 or smaller than 0, we w

Logistic function or sigmoid function

To maximize the log likelihood

Digression: The perceptron algorithm

Another algorithm for maximizing

, we wish to find value of so that

Another algorithm for maximizing

Another algorithm for maximizing

Newton-Raphson method (the generalization of

Another algorithm for maximizing

When Newtons method is applied to maximize the

C. Generalized Linear Models

Outline & Content

In the regression example, we had

In this section, both of those methods are special c

Other models in GLM family can be derived and ap

The exponential family

: natural parameter /canonical parameter

The exponential family

There is a choice of T, a and b, so that equation [6]

The exponential family

We write the Bernoulli distribution as

The natural parameter is given by

The exponential family

Gaussian in the exponential family

We let the exponential family ( ) distribution abo

oose the Bernoulli family of distributions, which we h

Furthermore, note that if

So following a similar derivation as the one for ordi

The multinomial is a member of exponential famil

The link function is given (for i = 1, , k) by