Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

Regression Methods

Gradient descent and


regularization
Regression Methods – Main Topics
 Regression problem: target variable is
continuous
 Linear regression with one input variable
 Gradient descent
 Multivariate linear regression
 Feature scaling

 Polynomial features
 Regularization to handle overfitting
Linear regression with one input
variable (simple regression)
We want to predict the house price value (in $10k) as a linear
function of its living area in 1k sq. ft) in King County:
• Here we have 100 data points as
plotted in the figure. We want to learn
a linear function from this dataset,
mapping from “area” to “price” 
hypothesis representation?
• So, what is the optimal linear function
(a line drawn in the figure) that best
fits the given data  loss function
definition?
• How do we find the optimal
hypothesis  gradient descent or
normal equation
Example: House price as a linear
function of house living area

Living area in1k sq ft House price in $10k


1.18 22.19
2.57 53.8
0.77 18
1.96 60.4
1.68 51
5.42 122.5
1.715 25.75
1.06 29.185
1.78 22.95
Which line best fits the training data?
How should we define the “best” line?
Hypothesis representation: linear
function with one input variable
 Here we have one input variable , and one output variable
( which is also called the “target” variable)
 The function we try to learn is a linear function : (here are the
set of all possible values of input/out variable respectively)
 , where is a column vector of parameters to be learned, and the
function is called a hypothesis (parameterized by.
 If we “expand” the input data with a “dummy” zero-th feature and
rename the original input variable as feature then we can re-write the
input as = . This way, the hypothesis can be represented as =  we
try to learn , such that should be “close” to the target value y for each
(x, y) pair in the training data. = ( , ) is the transpose of .
Linear regression with one input
variable (simple regression)
Problem definition:
Given training data D = {
 Assume the target variable y is a linear function of the
input variable of the form
(for some parameters ,
 Assume each input vector = with =1 for all
 Find: value for the parameter vector = (equivalently
hypothesis ) that optimizes/minimizes the loss/cost
function:
=
Hypothesis, loss function and
optimization
 There is one-to-one correspondence between
parameter vector and the hypothesis
 The loss/cost function is a function of the
parameters . As we saw earlier, different lines
for the housing data correspond to different
values, and thus the corresponding loss
values differ.
 The goal of linear regression is to find the
optimal hypothesis (the best ) that
minimizes .
Linear Regression is an
Optimization Problem
 Optimization problem:
Select the best solution among all feasible
solutions
 In simple Linear Regression, there are
infinitely many hypotheses , i.e., infinitely
many feasible values for the parameter vector
= . LR wants to find the optimal minimizing
the objective function .  same as finding the
best hypothesis
The loss function for linear
regression is convex
 The loss function for linear regression is convex:
=
=
 A convex function has a unique minimal point  there
is a unique vector (global minima ) such that for any
possible values
 This means the surface plot for is bowl-shaped with
one lowest point  make gradient descent search for
best easy.
Example: predicting house price
from house area in square feet

Here the best


hypothesis is
shown by the blue
line.
Solving LR by gradient descent

Main idea behind gradient descent for optimization:


 Loss is a function of the parameters =
 Update the parameters in the direction that reduces
the most – the steepest descent
 The direction is given by the negative (minus) of the
gradient: the negative of partial derivative of with
respect to :
= ( , )T =
(The superscript “T” above denotes “transpose”)
The (column) vector is called the gradient of with
respect to (that is why the name gradient descent)
Solving LR by gradient descent

 Gradient of the loss function:


 = ( , )T ( this is the gradient of J)
=
) (EQ. 1)
=* ( EQ. 2)
 Note that we can actually merge (EQ. 1) and (EQ. 2) into
one formula (as = 1 for all
=* for j = 0, 1
 Define , the update to , we have
=* for j = 0, 1
Gradient descent algorithm for LR

 Initialize and to be 0, choose


 Then loop till convergence/termination
 Compute for j = 0, 1
 for j = 0, 1
Note: must be updated AFTER both have been
computed – otherwise the second may use an altered in
its calculation!
Solving LR by gradient descent - in
vector format

 Initialize = choose
 Then loop till convergence/termination
 Compute =
 Here and are computed using EQ. 1 and EQ.
2
 Update :
Note that this gradient descent algorithm can be
easily generalized to do gradient descent for
multi-variate linear regression
Intuition about gradient descent
Here we have two parameters
The gradient descent direction
at point (2, 2) is indicated by the
𝐽(𝜃)

arrow on the “floor”, which would


lead to reduction of the loss .
Note that the gradient at has
both components positive
(namely, increasing or would
𝜃
0 increase the loss). It is clear that
𝜃1 the has both components
negative
Intuition about gradient descent

Plot of loss/cost J( with only one


parameter .
At > 0  increase will increase
 So we need to decrease
The update of , = < 0
On the other hand, at we have < 0
 We need to increase to reduce
Therefore
=0
Learning rate choices

 defines the step size in the gradient descent


 If is too small, convergence would be rather slow
 If is too big, gradient descent may not converge
 How do we choose the “right” value for ? Typically by empirical
method through grid search:
 Choose a list of candidate values for (for example:
0.001, 0.005, 0.01, 0.05, 0.1, 0.2)
 Run gradient descent for each candidate value in the
list for n iterations (say n = 20 or 50), and plot the loss
curve as a function of the number of iterations
 Select the best value that gives the “right” curve
Effect of learning rate choices –
just right value for

Learning rate 0.2

Learning rate 0.1


Effect of learning rate choices – too small
value of makes convergence slow

Learning rate 0.01

Learning rate too low makes the step size very small, and
convergence takes too long
Effect of learning rate choices – too big
value of leads to divergence

Learning rate = 0.4

Too big learning rate causes gradient descent


overshooting the optimal value and moving
farther away – diverge!
Loss function surface map
Multivariate linear regression
 Data D = {>
 We have n input variables
 Hypothesis is parameterized by
= , )T
 = (with =1 for all
 Note: m is the number of training examples, n is the
number of input variables)
 The loss function is the same as the one variable
case: =
Gradient Descent

 The problem can be solved by gradient descent in


the same fashion:
 Initialize choose
 Then loop till convergence/termination
 Compute
 Update :
 In particular:
= * for each
Data in matrix form

X= Y= =
Each row of matrix X represents a data point.

=
Multivariate linear regression

 Note that the gradient is a (n+1) by 1vector:


= ( , , …, )T
each for is computed with the formula
=*
=
= (the dot product of the parameter vector with the i-
th input vector ).
Feature scaling/normalization
 Be cautious: different features may have
rather different value ranges – scaling
features may be a good idea to make
gradient descent work well.
 Example: house price from num_br, living
area, lot
 Num_br: mean = 3.4
 Living area in sf: mean = 2088
 Lot size in sf: mean = 11732
 House price: mean = 519149
Sample data for house price
(multiple input variables)
num_bedroom
Living_area
l ot_size House_Price
3 1180 5650 221900
3 2570 7242 538000
2 770 10000 180000
4 1960 5000 604000
3 1680 8080 510000
4 5420 101930 1230000
3 1715 6819 257500
3 1060 9711 291850
3 1780 7470 229500
3 1890 6560 323000
3 3560 9796 662500
2 1160 6000 468000
3 1430 19901 310000
3 1370 9680 400000
5 1810 4850 530000
Feature standardization

 Standardization: for feature with mean and


sample standard deviation
Set
This will make the transformed feature
approximately mean zero and variance 1.
More ways to scale features
 Normalize by range (this will make the new
feature range [0, 1]):
Set
 Scale to make the new feature approximately
in the interval [-0.5, 0.5]:
set
 Note the , , , are estimated from the training
data.
Need to scale new data point for
prediction
 Do NOT scale the dummy variable
 The dependent variable does not need to be
scaled either
 For a new data point need to transform the
data using the mean, standard deviation,
max, min for each variable (obtained from
training data), before applying learned to
predict y
Solving LR by closed form
solutions (normal equation)
 X= Y= =

=
Normal equation
 In matrix format:
= setting the eq. to 0:

Solving the above matrix equation (assuming not


singular), we get
EQ. (3)
 This is the optimal solution to the minimization
problem
 The above equation (3) is often called normal
equation
Gradient descent or normal
equation for LR?
 When n, the number of features (input variables) is not too
big, it is a good idea to use normal equation for LR, which
is a direct way to get the optimal parameter values – no
need to select learning rate, no iterations needed.
 When n is very big (e.g., 200, 000), it would be
computationally too expensive to use normal equations,
gradient descent is better. Using normal equation for LR
requires computing , which has computational complexity
This is too costly if n, the number of features is very big.
 Gradient descent method has the advantage that the
method could be generalized to handle ML problems when
the x-y relationship is not linear (or no closed form solution
is available)
Polynomial features
 In many cases, the response variable y may be dependent
on the some input variable (here denotes the j-th power of ,
not the j-th data point).
 In general, y may be a polynomial function (of degree k (>
1) of multiple independent vars
 This can still be handled by linear regression – introducing
new polynomial features
 y is no longer a linear function of
 But y is a linear function of the parameters
 Need to do feature scaling/normalization for gradient
descent to work well
Underfit, Overfit and Just Right

 Intuitively, a model/hypothesis h underfits a dataset D if


the hypothesis performs poorly data D. This could be
because the features describing h are limited and not
sufficient to capture the underlying structures in D.
 A hypothesis h overfits a dataset D means intuitively h fits
D "too well" but fails to generalize well on new data.
The features may contain many irrelevant ones, and the
hypothesis space is too huge – there is NO enough data
to narrow down the search for a good model.
 The "just right" hypothesis would perform well on both D
and separate validation data.
Underfit, Overfit and Just Right
 When using polynomial features for linear regression,
underfitting could happen if the data pattern is quite complex
and the degree of polynomial is too low.
 Similarly, overfitting could happen if the degree of polynomial
used for linear regression is too high.

 This figure shows 3 cases in trying to use polynomials of degree


1, 2, and 4 to do linear regression. Apparently the left case is an
underfit with large errors, the right overfits the data and the
middle one is "just right".
Overfitting handling
 Reduce the number of features
 Manually select features (e.g., select degree of
polynomials)
 Use validation dataset to choose the most relevant
ones
 Regularization
 Keep all features, but constrain to limit the weights
on features small in magnitude
 Features with too small weights practically will not
matter, so effectively selecting features
automatically
Regularization
 When we have too many input variables (or
using a polynomial of high degree)
 the model too complex, risk of overfitting
 To handle it: add a regularization term to loss:
=
The gradient would also be changed:
= +
Note we do NOT regularize the
Choice of the regularization parameter

 If is too big, then each of the would be very


small, and the model may underfit – loss
would be high on both training and validation
data
 If is too small (near 0), then regularization
takes no effect, and the model may overfit –
loss would be low on training data but high on
validation data
 Plot of loss on training/validation data as a
function of to choose value

You might also like