Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 56

Supervised Learning

By : Shedriko

Outline & Content


Preliminary
Linear Regression
Classification and Logistic Regression
Generalized Linear Models

Preliminary
Supervised learning
themachine learningtask of inferring a function fro
m labeled training data, thetraining dataconsist of
a set oftraining examples, each example is apairco
nsisting of an input object (typically a vector) and a
desired output value (also called thesupervisory sig
nal)

Preliminary

x(i) = input, y(i) = output ; i-th = index (not exponent)


(x(i) ,y(i))
training example
{(x(i) ,y(i)); i = 1,2,,m}
training set

Preliminary
X = Y = space of values
Regression :
target variable is continuous, such as housing
roblem

Classification :
target variable is a number of discrete values, such
as whether the dwelling is a house or an apartment

A. Linear Regression

Outline & Content


Preliminary
Problem
LMS (least mean square) algorithm
Normal equations
Probabilistic interpretation
LWR (locally weighted linear regression)

Preliminary

: weight (parameterizing the space of line


ar functions mapping from X to Y)
h(x) = y : hipotesis

Problem
How do we learn parameters ?
Make h(x) close to y or
h(x(i)) close to y(i)
Each value of (cost function)

LMS algorithm

Minimize J() using gradient descent algo


rithm
: learning rate
j : initial guess
Consider there is only one training exam
ple

LMS algorithm

For a single training example, it gives

For i=1 to m, it gives

LMS algorithm
Initialized at (48,30)

It gives
0 = 71.27
1 = 0.1345
Or
0 = 89.60
1 = 0.1392
2 = -8.738

The normal equations


Matrix Derivatives

For a function
mapping from m-b
y-n, we define derivatives of f with respect to A
to be :

Thus, the gradient


itself is an m-by-n mat
rix, whose (i,j) element is

The normal equations


Example :

Suppose
ion
is given by

is a 2-by-2 matrix, the funct

Here, Aij denotes the (i,j) entry of the matrix A,


we then have

The normal equations


Trace

trace (written tr), for an n-by-n (square) matrix A, the trace


A is defined to be the sum of its diagonal entries :

If a is a real number, a 1-by-1 matrix, gives


The property of trace for two matrices (AB is square), we hav
e

We also have, e.g

The normal equations


We

then have :

And

Equation (4) applies only to non-singular square mat


rices A, where |A| denotes the determinant of A

The normal equations


Least square revisited

Define the design matrix X to be the m-by-n contains t


raining examples value

Also
be the m dimensional vector containing all the
target value of training set

The normal equations

Since

we can verify

For a vector z, we have that

, thus

The normal equations

To minimize J , its derivatives with respect to by com


bining equations (2) and (3), we find that

Hence

The normal equations

To minimize J (we set its derivatives to zero), we obtain


normal equations
Thus, the value of that minimize J() is

Probabilistic Interpretation
Assume :
where
is an error term
Further assume
independently and identically
distributed, according to Gaussian distribution
(n
ormal distribution) i.e
~
The density of
is given by
This implies that

The notation
indicates that this is the
distribution of y(i) given x(i) and parameterized by

Probabilistic Interpretation
Given X (design matrix, contain all of x(i) ), the d
ata probability is given by
The likelihood function
This can also be written

The principal of max likelihood we should choo


se to make the data as high probability as possi
ble, so we maximize L()

Probabilistic Interpretation
The derivations will be a bit simpler if we instead
maximize the log likelihood ()

Maximizing
(the same as min) gives
(we recognize as
)

LWR algorithm
In the original linear regression algorithm, to make
a prediction at a query point x (i.e. to evaluate h(x) ),

we would :
1. fit to minimize
2. output
In contrast, the locally weighted linear regression
algorithm does the following :
1. fit to minimize
2. output

LWR algorithm

the

if
if

are non-negative valued weights

is small, then w(i) is close to 1 (one)


is large, then w(i) is small

B. Classification and logistic re


gression

Outline & Content


Preliminary
Logistic regression
Digression : The perceptron learning algorit
hm
Another algorithm for maximizing

Preliminary
Focus on the binary classification problem, whic
h y can take on only two values, 0 (zero) and 1 (on
e)
For e.g. : build a spam classifier for email, then
x(i) maybe some piece of email
y maybe 1 for spam email or 0 for otherwise
0 (zero) : negative class (-)
1 (one) : positive class (+)
Given x(i) , the corresponding y is called the label f
or the training example

Logistic regression
To make
ill choose

larger than 1 or smaller than 0, we w

Logistic function or sigmoid function


g(z) tends towards 1 as z
g(z) tends towards 0 as z

Logistic regression
Useful property of the derivative of the sigmoid fu
nction

Assume that
Note : it can be written more compactly as

Logistic regression
Assume that the m training examples were genera
ted independently, we can write down the likelihoo
d as

To maximize the log likelihood

Logistic regression

It is given
The gradien ascent (similar to derivation in linear
regression)

By using

, it gives

Digression: The perceptron algorithm


To Digress by modifying the logistic regression met
hod to force it to output values that are either 0 or 1
or exact value
To do so, it seems natural to change the definition
of g to be the threshold function

If we then let
as before but with modif
ied definition of g, then we have the perceptron l
earning algorithm

Another algorithm for maximizing

Suppose
,
performs

, we wish to find value of so that


is a real number, Newtons method

Another algorithm for maximizing


The leftmost figure : function f with line y = 0,
the value of from
, is about 1.3
The middle figure : with Newtons method, we have tan
gent = 2.8
The rightmost figure : with Newtons method, we have
tangent = 1.8
After a few more iterations, we approach tangent =
1.3
What if we want to use it to maximize some function ?
The maxima of correspond to points where its
firs
t derivative
is zero

Another algorithm for maximizing

So by letting

, to maximize

we obtain

Newton-Raphson method (the generalization of


Newtons method to this multidimensional setting)
is given by
;
is the vector of partial derivatives of
wit
h respect to the i s
H is an n-by-n matrix called the Hessian, whose en
tries are given by

Another algorithm for maximizing

When Newtons method is applied to maximize the


logistic regression log likelihood function
, the
resulting method is called Fisher scoring

C. Generalized Linear Models

Outline & Content


Preliminary
The exponential family
Constructing GLMs

Preliminary

In the regression example, we had


In the classification one, we had

In this section, both of those methods are special c


ases of a broader family of models, called Generalize
d Linear Models (GLMs)

Other models in GLM family can be derived and ap


plied to other classification and regression problems

The exponential family


A class of distribution is in the exponential family if
it can be written in the form

: natural parameter /canonical parameter


: sufficient statistic
: log partition function
: normalization constant
: sums/integrates over y to 1 (one)
A fixed choice of T, a and b defines a family (or set)
of distributions that is parameterized by , we then
get different distributions within this family

The exponential family

Bernoulli distribution ( ) :
specifies a distribution over

, so that

There is a choice of T, a and b, so that equation [6]


becomes exactly the class of Bernoulli distribution

The exponential family

We write the Bernoulli distribution as

The natural parameter is given by


If we invert this definition by solving
, we obtain =
To complete the Bernoulli
distribution, we have :

in terms of

The exponential family

Gaussian distribution

Gaussian in the exponential family

Constructing GLMs
To derive a GLM to predict some random variable y
as a function of x in a classification or regression pr
oblem, there are three assumptions about the cond
itional distribution of y given x and about our model :
1.
i.e. given x and , the di
stribution of y follows some exponential family distri
bution, with parameter
2. To predict the expected value of T(y) given x, in
most of our examples we will have T(y) = y, so this me
ans we would like the prediction h(x) output by our
learned hypothesis h to satisfy

Constructing GLMs
(Remember, we had :
=
=
=
)
3. The natural parameter and inputs x are relat
ed
lineary
(if is vector-valued,
)

Constructing GLMs
Ordinary Least Squares
Consider the target variable y (also called the resp
onse variable in GLM terminology) is continuous, an
d we model the conditional distribution of y as Gauss
ian

We let the exponential family ( ) distribution abo


ve be the Gaussian distribution

We had
so we have

Constructing GLMs
Logistic Regression
Given that y is binary-valued, it seems natural to ch

oose the Bernoulli family of distributions, which we h


ad

Furthermore, note that if

then

So following a similar derivation as the one for ordi


nary least square, we get

Constructing GLMs
Softmax Regression
For e.g. : classify email into two classes (spam or no
t- spam) which is binary classification, then we classif
y it into three classes (spam, personal & work-relate
d mail)
Thus we model it as a multinomial distribution
For notational, we have
T(y) is not y, now we have T(y) is k-1 dimensional vec
tor, rather than real number
We also have
, e.g. : 1{2 = 3} = 0
,
1{3 = 5 2} = 1, where 0 = false, 1 = true

Constructing GLMs

The multinomial is a member of exponential famil


y

Constructing GLMs
where

The link function is given (for i = 1, , k) by

Constructing GLMs

We have defined
, to invert the link
function and derive the response function, we ther
efore have that

The
can be substituted back into equ
ation (7) to give the response function

This function mapping from


softmax function

s to

s is called the

Constructing GLMs
To complete our model, we use :
1. assumption 3
2. given earlier, that s are linearly related to xs
3. have
where
as the parameters of our model
4. we also define
so that
as giv
en
previously
Hence conditional
distribution of y
given x is

Constructing GLMs

This model which applies to classification problem


where
is called softmax regression
Our hypothesis
will output

Constructing GLMs

If we have training set of m examples


we would begin by writing down the log-likelihood

Thank you.

You might also like