Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Regression and Classification"

with Linear Models"


CMPSCI 383
Nov 15, 2011!

Todays topics"
Learning from Examples: brief review!
Univariate Linear Regression!
Batch gradient descent!
Stochastic gradient descent!

Multivariate Linear Regression!


Regularization!

Linear Classifiers!
Perceptron learning rule!

Logistic Regression!

Learning from Examples (supervised learning)"

Learning from Examples (supervised learning)"

Learning from Examples (supervised learning)"

Learning from Examples (supervised learning)"

Learning from Examples (supervised learning)"

Learning from Examples (supervised learning)"

Important issues"
Generalization !
Overfitting!
Cross-validation!
Holdout cross validation!
K-fold cross validation!
Leave-one-out cross-validation!

Model selection!

Recall Notation"

(x1, y1 ), (x 2 , y 2 ),K (x N , y N )

training set!

Where each y j was generated by !


an unknown function! y = f (x)

Discover a function h that best

approximates
the true function! f
hypothesis!

10

Loss Functions"
Suppose the true prediction for input x is f (x) = y
but the hypothesis gives h(x) = y

L(x, y, y ) = Utility(result of using y given input x)


Utility(result of using y given input x)

Simplified version : L(y, y )

Absolute value loss : L1 (y, y ) = y y

Squared error loss : L2 (y, y ) = ( y y )


0/1 loss :
L0 /1 (y, y ) = 0 if y = y , else 1

Generalization loss: expected loss over all possible examples!


Empirical loss: average loss over available examples!
11

Univariate Linear Regression"

12

Univariate Linear Regression contd."


w = [ w 0 ,w1 ]

weight vector!

hw (x) = w1 x + w 0
Find weight vector that minimizes empirical loss,
e.g., L2:!

Loss(hw ) = L2 (y j , hw (x j )) = (y j hw (x j )) 2 = (y j (w1 x j + w 0 )) 2
j =1

j =1

j =1

i.e., find w * such that!

w* = argmin w Loss(hw )

13

Weight Space"

14

Finding w*"
Find weights such that:!
N
N
2
(y j (w1 x j + w 0 )) = 0 and
(y j (w1 x j + w 0 )) 2 = 0

w 0 j =1
w1 j =1

15

Gradient Descent"

wi wi
Loss(w)
w i
step size or!
learning rate!

16

Gradient Descent contd."


For one training example (x, y) :!
w 0 w 0 + (y hw (x)) and w1 w1 + (y hw (x))x

For N training examples:!


w 0 w 0 + (y j hw (x j )) and w1 w1 + (y j hw (x j ))x j
j

batch gradient descent!


stochastic gradient descent: take a step for
one training example at a time!
17

The Multivariate case"


hsw (x j ) = w 0 + w1 x j,1 +L + w n x j,n = w 0 + w i x j,i
i

Augmented vectors: add a feature to each

x by tacking on a 1:! x j,0 = 1

Then:!

hsw (x j ) = w x j = wT x j = w i x j,i

And batch gradient descent update becomes:!

w i w i + (y j hw (x j ))x j,i
j

18

The Multivariate case contd."


Or, solving analytically:!
Let

be the vector of outputs for the training examples!

data matrix: each row is an input vector!

Solving this for w *:!

y = Xw
1

w* = ( XT X) XT y

pseudo inverse!

19

Regularization"
Cost(h) = EmpLoss(h) + Complexity(h)
Complexity(hw ) = Lq (w) = w i

20

L1 vs. L2 Regularization"

21

Linear Classification: hard thresholds"

22

Linear Classification: hard thresholds contd."


Decision Boundary:!
In linear case: linear separator, a hyperplane!

Linearly separable: !
data is linearly separable if the classes can be
separated by a linear separator!

Classification hypothesis:!
hw (x) = Threshold(w x) where Threshold(z) = 1 if z 0 and 0 otherwise

23

Perceptron Learning Rule"


For a single sample (x, y) :
w i w i + ( y hw (x)) x i

If the output is correct, i.e., y =h w (x), then the weights don't change
If y = 1 but hw (x) = 0, then w i is increased when x i is positive and decreased when x i is negative.
If y = 0 but hw (x) = 1, then w i is decreased when x i is positive and increased when x i is negative.

Perceptron Convergence Theorem: For any data set


thats linearly separable and any training procedure
that continues to present each training example, the
learning rule is guaranteed to find a solution in a finite
number of steps.!
24

Perceptron Performance"

25

Linear Classification with Logistic Regression"

An important function!!
26

Logistic Regression"

1
hw (x) = Logistic(w x) =
1+ e wx

For a single sample (x, y) and L2 loss function :


wi w i + ( y hw (x)) hw (x)(1 hw (x)) x i

derivative of logistic function!

27

Logistic Regression Performance"

separable case!

28

Summary"
Learning from Examples: brief review!

Loss functions!
Generalization!
Overfitting!
Cross-validation!
Regularization!

Univariate Linear Regression!


Batch gradient descent!
Stochastic gradient descent!

Multivariate Linear Regression!


Regularization!

Linear Classifiers!
Perceptron learning rule!

Logistic Regression!
29

Next Class"
Artificial Neural Networks, Nonparametric
Models, & Support Vector Machines!
Secs. 18.7 18.9!

30

You might also like