Linear Regression Python Programming

Departamento de Eletrónica, Telecomunicações e
Informática
LECTURE 2: LINEAR REGRESSION
Petia Georgieva
(petia@ua.pt)
LINEAR REGRESSION - Outline
1. Univariate linear regression
- Cost (loss) function - Mean Squared Error (MSE)
- Cost function convergence
- Gradient descent algorithm
2. Multivariate linear regression

-Overfitting problem
3. Regularization => way to deal with overfitting
2
Supervised Learning -
CLASSIFICATION vs REGRESSION
Classification- the label is an integer number.
(e.g. 0, 1 for binary classification)
Regression - the label is a real number.
Examples of regression problems:
• Weather forecast
• Predicting wind velocity from temperature, humidity, air pressure
• Time series prediction of stock market indices
• Predicting sales amounts of new product based on advertising

expenditure
3
Standard Notations in this course
4
Supervised Learning –
univariate regression
Problem: Learning to predict the housing prices (output, predicted variable) as a
function of the living area (input, feature, predictor)
observation – prediction = Error (“residual”)
5
Supervised Learning –
univariate regression
6
Mean Square Error (MSE)
Linear Model (hypothesis) =>
Cost (loss) function =>

(Mean Square Error)
m – number of training examples
Goal =>
Gradient descent algorithm =>

iterative algorithm; at each
iteration all parameters (theta)
are updated simultaneously
alpha – learning rate > 0

7
Linear Regression
(computing the gradient)
Cost function =>
Cost function gradients =>

vector with parcial
derivatives of J with
respect to each parameter
for one example (m=1)
Cost function gradients =>

for m examples
8
Linear Regression – iterative gradient
descent algorithm (summary)
Inicialize model parameters (e.g. θ =0)
Repeat until J converge {
Compute Linear Regression Model =>
Compute cost function =>
Compute cost function gradients =>
Update parameters =>

}
9
Batch/mini batch/stochastic
gradient descent for parameter update
Batch learning (classical approach):

update parameters after all training examples have been
processed, repeat several iterations until convergence
Mini batch learning (if big training data):

devide training data into small batches update parameters after
each mini batch has been processed, repeat until convergence
Stochastic (incremental) learning (large-scale ML problems):

update parameters after every single training example has been
processed.
Stochastic Gradient Descent (SGD): the true gradient is

approximated by a gradient at a single example. Adaptive
learning rate. 10
Cost function convergence
Linear Regression (LR):
starting from different initial values of the

parameters the cost function J should
always converge (maybe to a local
minimum !!!) if LR works properly.
11
Lin Reg Cost function – local minimum
Suppose θ1 is at a local optima as shown in the figure.
What will one step of Gradient Descent do ?
1) Leave θ1 unchanged
2) Change θ1 in a random direction
3) Decrease θ1
4) Move θ1 in direction to the global minimum of J
12
changing the learning rate (α) -100 iter.
10
x 10
7
alpha = 0.01
alpha = 0.03
alpha = 0.1
6
alpha = 0.3
4
Cost J
0
0 10 20 30 40 50 60 70 80 90 100
Number of iterations
If α too small : slow convergence of the cost function J (the

Gradient Descent optimization can be slow) 13
α) -400 iter.
changing the learning rate (α
10
x 10
7
alpha = 0.01
alpha = 0.03
alpha = 0.1
6
alpha = 0.3
4
Cost J
0
0 50 100 150 200 250 300 350 400
14
LinReg Cost function convergence -
α)
learning rate variation (α
11
x 10
8
alpha = 0.01
alpha = 0.03
7 alpha = 0.1
alpha = 1.4
5
Cost J
0
0 2 4 6 8 10 12 14 16 18 20
If α too large: the cost function J may no converge (decrease at each

iteration). It may diverge !
15
Univariate Regression
Given the house area, what is the most likely house price?
If univariate linear regression model is not sufficiently good model,
add more data (ex. # bedrooms).
16
Multivariate Regression
Problem: Learning to predict the housing price as a function of living area &
number of bedrooms.
 x 0 = 1
  rT r
hθ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 = [θ 0 θ1 θ 2 ]  x1  = θ x
 x 2 
17
Polynomial Regression
If univariate linear regression model is not a good model, try polynomial
model.
Univariate (x1=size) housing price problem transformed into multivariate
(still linear !!!) regression model x=[ x1=size, x2=size^2, x3=size^3 ]
18
Overfitting problem
Overfitting: If we have too many features ( e.g. high order polynomial
model), the learned hypothesis may fit the training set very well but
fail to generalize to new examples (predict prices on new examples).
underfit just right overfit

(1st order polin. model) (3rd order polinom. model) (higher ord. polinom. Model)
hθ ( x ) = θ 0 + θ1 x hθ ( x) = θ 0 + θ1 x + θ 2 x 2 + θ 3 x 3 hθ ( x ) = θ 0 + θ1 x + θ 2 x 2 + .... + θ16 x n
19
Overfitting problem
Overfitting: If we have too many features (x1,…x100) the learned
model may fit the training data very well but fails to generalize to new
examples.
rT r
hθ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + .... + θ n x n = θ x
20
How to deal with overfitting problem ?
1. Reduce number of features.

― Manually select which features to keep.
― Algorithm to select the best model complexity.
2. Regularization (add extra term in cost function)

Regularization methods shrink model parameters θ towards zero to
prevent overfitting by reducing the variance of the model.
2.1 Ridge Regression (L2 norm)

― Reduce magnitude of θ (but never make them =0) => keep all features
― Works well when all features contributes a bit to the output y.
2.2 Lasso Regression (L1 norm)

- May shrink some of the elements of vector θ to become 0.
- Eliminate some of the features => Serve as feature selection
21
Regularized Linear Regression
(cost function)
Unregularized cost function =>
Regularized cost function Ridge Regression

(add extra regularization term
don’t regularize θ
0
22
(cost function gradient)
Unregularized cost
function gradients =>
Regularized cost
function gradients =>
23
What if lambda is set to an extremely large value ?
- Algorithm fails to eliminate overfitting.

- Algorithm results in under-fitting. (Fails to fit even training data
well).
- Gradient descent will fail to converge.
24
Regularization: Lasso Regression
1 m λ n
[ ( ( )) ( ) ( ( ))]
J (θ ) =  − y ( i ) log hθ x ( i ) − 1 − y ( i ) log 1 − hθ x ( i ) +
m i =1

2m j =1
θj
Ridge Regression shrinks θ towards zero, but never equal to zero => all
features are included in the model no matter how small are the
coefficients.
Lasso Regression is able to shrink coefficients to exactly zero =>

reduces the number of features. This makes Lasso Regression useful
in cases with high dimension.
Lasso Regression involves absolute values (not differentiable)=>

computing is difficult => relevant algorithms available in sklearn
Python library.
25

Linear Regression Python Programming

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression Python Programming

Uploaded by

Copyright:

Available Formats

Departamento de Eletrónica, Telecomunicações e

LECTURE 2: LINEAR REGRESSION

2. Multivariate linear regression

3. Regularization => way to deal with overfitting

Regression - the label is a real number.

Examples of regression problems:

• Predicting wind velocity from temperature, humidity, air pressure

• Time series prediction of stock market indices

• Predicting sales amounts of new product based on advertising

observation – prediction = Error (“residual”)

Linear Model (hypothesis) =>

Cost (loss) function =>

m – number of training examples

Gradient descent algorithm =>

alpha – learning rate > 0

Cost function gradients =>

Cost function gradients =>

Compute Linear Regression Model =>

Compute cost function =>

Compute cost function gradients =>

Update parameters =>

Batch learning (classical approach):

Mini batch learning (if big training data):

Stochastic (incremental) learning (large-scale ML problems):

Stochastic Gradient Descent (SGD): the true gradient is

Linear Regression (LR):

starting from different initial values of the

What will one step of Gradient Descent do ?

If α too small : slow convergence of the cost function J (the

If α too large: the cost function J may no converge (decrease at each

underfit just right overfit

1. Reduce number of features.

2. Regularization (add extra term in cost function)

2.1 Ridge Regression (L2 norm)

2.2 Lasso Regression (L1 norm)

Unregularized cost function =>

Regularized cost function Ridge Regression

- Algorithm fails to eliminate overfitting.

Lasso Regression is able to shrink coefficients to exactly zero =>

Lasso Regression involves absolute values (not differentiable)=>

You might also like