LinearRegression Annotated

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 116

Linear Regression

Fundamentals of Data Science


29 October, 3, 5, 10 November 2020
Prof. Fabio Galasso
ALVINN’97

2 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Outline

• Linear regression
One or multiple variables
Cost function

• Gradient descent, incl. batch/stochastic GD

• Normal equation

• MSE and Correlation

• Locally-Weighted Regression

• Probabilistic Interpretation of Least Squares


3 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Outline

• Linear regression
One or multiple variables
Cost function

• Gradient descent, incl. batch/stochastic GD

• Normal equation

• MSE and Correlation

• Locally-Weighted Regression

• Probabilistic Interpretation of Least Squares


4 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Linear Regression
500
Housing Prices
400
(Portland, OR)
300

Price 200

(in 1000s 100


of dollars) 0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-‐valued output
each example in the data.
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
1416 232
(Portland, OR)
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Training Set How do we represent h ?

Learning Algorithm

Size of h Estimated
house price

Linear regression with one variable.


Univariate linear regression.
h maps from x’s to y’s
Multiple features (variables)

Size (feet2) Price ($1000)

2104 460
1416 232
1534 315
852 178
… …
Multiple features (variables)
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
Hypothesis:
With one variable:
For convenience of notation, define .

Multivariate linear regression


Linear Regression:
Cost Function
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …

Hypothesis:
‘s: Parameters
How to choose ‘s ?
3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
y

Idea: Choose so that


is close to for our
training examples
Simplified
Hypothesis:

Parameters:

Cost Function:

Goal:
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5
x
Hypothesis:

Parameters:

Cost Function:

Goal:
(for fixed , this is a function of x) (function of the parameters )

500

400
Price ($) 300
in 1000’s
200

100

0
0 1000 2000 3000
Size in feet2 (x)
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
Outline

• Linear regression
One or multiple variables
Cost function

• Gradient descent, incl. batch/stochastic GD

• Normal equation

• MSE and Correlation

• Locally-Weighted Regression

• Probabilistic Interpretation of Least Squares


28 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Gradient Descent
Have some function
Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
J()



J()



Gradient descent algorithm

Correct: Simultaneous update Incorrect:


Gradient descent algorithm
If α is too small, gradient descent
can be slow.

If α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.
at local optima

Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α fixed

As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
Gradient Descent:
for Linear Regression
Gradient descent algorithm Linear Regression Model
Gradient descent algorithm

update
and
simultaneously
J()



J()



(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent

“Batch”: Each step of gradient descent


uses all the training examples.
Gradient Descent

• The gradient is the result of considering all dataset samples

• This may be very slow

• Can we do something more efficient?

56 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Stochastic Gradient Descent (SGD)

• For large training sets, the evaluation the gradient over all samples may be expensive
• Stochastic or “online” gradient descent approximates the true gradient by a gradient at a
single example
Pseudocode:
- Choose an initial vector of parameters 𝜽𝟎 and learning rate 𝛼
- Repeat until convergence:
• Randomly shuffle examples in the training set
• For 𝑘 = 1,2, … 𝑚 , do:
𝜽(𝑖+1) = 𝜽(𝑖) − 𝛼 ∇𝐽(𝑘) (𝜽(𝑖) )

57 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Stochastic Gradient Descent (SGD)

• For large training sets, the evaluation the gradient over all samples may be expensive
• Stochastic or “online” gradient descent approximates the true gradient by a gradient at a
single example
Pseudocode:
- Choose an initial vector of parameters 𝜽𝟎 and learning rate 𝛼
- Repeat until convergence:
• Randomly shuffle examples in the training set
• For 𝑘 = 1,2, … 𝑚 , do:
𝜽(𝑖+1) = 𝜽(𝑖) − 𝛼 ∇𝐽(𝑘) (𝜽(𝑖) )
• Normally preferable: mini-batch gradient descent
Consider a mini-batch at each step
This normally results in smoother convergence
This normally is faster, thanks to vectorization libraries

58 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Gradient Descent:
for Multiple Variables
Hypothesis:
Parameters:
Cost function:

Gradient descent:
Repeat

(simultaneously update for every )


New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat

(simultaneously update for


)

(simultaneously update )
Feature Scaling
Idea: Make sure features are on a similar scale.
E.g. = size (0-‐2000 feet2) size (feet2)
= number of bedrooms (1-‐5)
number of bedrooms
Feature Scaling
Get every feature into approximately a range.
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.
Gradient Descent:
Learning Rate
Gradient descent

-‐ “Debugging”: How to make sure gradient


descent is working correctly.
-‐ How to choose learning rate .
Making sure gradient descent is working correctly.

Example automatic
convergence test:

Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .

No. of iterations

No. of iterations

-‐ For sufficiently small , should decrease on every iteration.


-‐ But if is too small, gradient descent can be slow to converge.
Summary:
-‐ If is too small: slow convergence.
-‐ If is too large: may not decrease on
every iteration; may not converge.

To choose , try
Linear Regression:
Features and Polynomial regression
Housing prices prediction
Polynomial regression

Price
(y)

Size (x)
Choice of features

Price
(y)

Size (x)
Dangers of (Polynomial) Regression

Overfitting and Underfitting


Outline

• Linear regression
One or multiple variables
Cost function

• Gradient descent, incl. batch/stochastic GD

• Normal equation

• MSE and Correlation

• Locally-Weighted Regression

• Probabilistic Interpretation of Least Squares


75 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Normal Equation
Gradient Descent
𝛿𝐽
𝛿Θ0
Θ ≔ Θ − 𝛼 ∇Θ 𝐽 ∇Θ 𝐽 = ⋮ ∈ ℝ𝑛+1
𝛿𝐽
𝛿Θ𝑛
Gradient Descent
𝛿𝐽
𝛿Θ0
Θ ≔ Θ − 𝛼 ∇Θ 𝐽 ∇Θ 𝐽 = ⋮ ∈ ℝ𝑛+1
𝛿𝐽
𝛿Θ𝑛
Normal equation: what about solving for Θ analytically?
Useful notation

• For 𝑓: ℝ𝑚𝑥𝑛 ↦ ℝ, define:

𝑓 𝐴 ∈ ℝ , 𝐴 ∈ ℝ𝑚𝑥𝑛

• Trace:
n

If 𝐴 ∈ ℝ𝑛𝑥𝑛 tr A =  Aii
i=1

80 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Facts

• Some facts of matrix derivatives (without proof)

tr 𝐴𝐵 = tr 𝐵𝐴

tr 𝐴𝐵𝐶 = tr 𝐶𝐴𝐵 = tr 𝐵𝐶𝐴

T
f(𝐴) = tr 𝐵𝐴 A tr AB = B
T
tr A = tr A

If 𝑎 ∈ ℝ tr a = a
T T T
A tr ABA C = CAB + C AB

81 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
examples ; features.
Cost function
1
ℎ 𝑥 − 𝑦 (1) 𝑛
(𝑖)
𝑋Θ − 𝑦 = ⋮ where Θ𝑇 𝑥 (𝑖) = ෍ Θ𝑗 𝑥𝑗
ℎ 𝑥 𝑚 − 𝑦 (𝑚) 𝑗=0

Recall z 𝑇 𝑧 = ෍ 𝑧𝑖2
𝑖

1 𝑇
1
𝑋Θ − 𝑦 𝑋Θ − 𝑦 = = 𝐽(Θ)
2 2
Intuition: If 1D
Solve for 𝚯 analytically
attained when

Expanding J(Θ):
Solve for 𝚯 analytically

T T T T
Recall A tr ABA C = CAB + C AB A tr AB = B
Solve for 𝚯 analytically

Normal equation
Examples:
(feet22)
Size (feet Number
Numberof
of Number
Numberofof Age
Ageofofhome
home Price ($1000)
bedrooms floors (years)
bedrooms floors (years)

1 2104 5 1 45 460
1 1416
1416 33 22 40
40 232
232
1 1534
1534 33 22 30
30 315
315
1 852
852 22 11 36
36 178
178
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
Normal equation

-‐ What if is non-‐invertible? (singular/


degenerate)
What if is non-‐invertible?

• Redundant features (linearly dependent).


E.g. size in feet2
size in m2

• Too many features (e.g. ).


-‐ Delete some features, or use regularization.
Outline

• Linear regression
One or multiple variables
Cost function

• Gradient descent, incl. batch/stochastic GD

• Normal equation

• MSE and Correlation

• Locally-Weighted Regression

• Probabilistic Interpretation of Least Squares


94 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Linear Regression and Correlation
Summary So Far
▪ Given: Set of known (x,y) points
▪ Find: function f(x)=ax+b that “best fits” the
known points, i.e., f(x) is close to y
▪ Use function to predict y values for new x’s
➢ Also can be used to test correlation

96 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Correlation and Causation

Correlation – Values track each other


• Height and Shoe Size
• Grades and Entrance Exam Scores
Causation – One value directly influences another
• Education Level → Starting Salary
• Temperature → Cold Drink Sales

97 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Correlation and Causation (from Overview)

Correlation – Values track each other


• Height and Shoe Size
• Grades and Entrance Exam Scores
Find: function f(x)=ax+b that “best fits” the
known points, i.e., f(x) is close to y

The better the function


fits the points, the more
correlated x and y are

98 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Regression and Correlation
The better the function fits the points,
the more correlated x and y are
▪ Linear functions only
▪ Correlation – Values track each other
Positively – when one goes up the other goes up
▪ Also negative correlation
When one goes up the other goes down
• Latitude versus temperature
• Car weight versus gas mileage
• Class absences versus final grade

99 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Calculating Simple Linear Regression
Method of least squares
▪ Given a point and a line, the error for the point
is its vertical distance d from the line, and the
squared error is d 2
▪ Given a set of points and a line, the sum of
squared error (SSE) is the sum of the squared
errors for all the points
▪ Goal: Given a set of points, find the line that
minimizes the SSE

100 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
Calculating Simple Linear Regression
Method of least squares

d4 d5

d2
d3

d1

SSE = d12 + d22 + d32 + d42 + d52

101 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
Calculating Simple Linear Regression
Method of least squares
Goal: Find the line that
minimizes the SSE d4 d5

d2
d3

d1

- Gradient Descent
- Normal equation
- software packages,
e.g. Numpy polyfit
SSE = d12 + d22 + d32 + d42 + d52

102 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
Measuring Correlation
More help from software packages…
Pearson’s Product Moment Correlation (PPMC)
• “Pearson coefficient”, “correlation coefficient”
• Value r between 1 and -1
1 maximum positive correlation
0 no correlation
-1 maximum negative correlation Swapping x and y axes
yields same values

103 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
Measuring Correlation
More help from software packages…
Pearson’s Product Moment Correlation (PPMC)
• “Pearson coefficient”, “correlation coefficient”
• Value r between 1 and -1
1 maximum positive correlation
0 no correlation
-1 maximum negative correlation
“The better the function
Coefficient of determination fits the points, the more
• r2, R2, “R squared” correlated x and y are”
• Measures fit of any line/curve to set of points
• Usually between 0 and 1
• For simple linear regression R2 = Pearson2
104 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Correlation Game
http://aionet.eu/corguess (*)
Try to get:
Right answers ≥ 10, Guesses ≤ Right answers × 2
Anti-cheating: Pictures = Right answers + 1
(*) Improved version of “Wilderdom correlation guessing game” thanks to
Poland participant Marcin Piotrowski

Other correlation games:


http://guessthecorrelation.com/
http://www.rossmanchance.com/applets/GuessCorrelation.html
http://www.istics.net/Correlations/

105 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
Outline

• Linear regression
One or multiple variables
Cost function

• Gradient descent, incl. batch/stochastic GD

• Normal equation

• MSE and Correlation

• Locally-Weighted Regression

• Probabilistic Interpretation of Least Squares


106 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Locally-Weighted Regression
Recap
examples ; features.

𝑦 (𝑖) ∈ ℝ x0 = 1

𝑛 𝑚
1 2
ℎΘ 𝑥 = ෍ Θ𝑗 𝑥𝑗 = Θ𝑇 𝑥 𝐽 Θ = ෍ ℎΘ 𝑥 (𝑖) − 𝑦 (𝑖)
2
𝑗=0 𝑖=1
Recap: choice of features
Θ0 + Θ1 𝑥1

Θ0 + Θ1 𝑥 + Θ2 𝑥 2
Price
(y) Θ0 + Θ1 𝑥 + Θ2 𝑥 + Θ3 log(𝑥)

Size (x)
Recap: dangers of polynomial regression
Overfitting and Underfitting
Locally-weighted regression

• “Parametric learning algorithm”:


fixed set of parameters
• “Non-parametric learning algorithm”:
parameters grow with the data
(more data to keep in memory)

• Locally-weighted regression y
also named Loess or Lowess

111 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
Locally-weighted regression

112 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
Outline

• Linear regression
One or multiple variables
Cost function

• Gradient descent, incl. batch/stochastic GD

• Normal equation

• MSE and Correlation

• Locally-Weighted Regression

• Probabilistic Interpretation of Least Squares


113 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Probabilistic Interpretation of Least Squares
Why Least Squares

• Assume 𝑦 (𝑖) = Θ𝑇 𝑥 (𝑖) + 𝜖 (𝑖)

115 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
Why Least Squares

• Likelihood 𝐿(Θ) = 𝑃(𝑦|𝑋;


Ԧ Θ)

116 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
References

• Chapter 3.1 in [Bishop, 2006. Pattern Recognition and Machine Learning]

117 Fabio Galasso


Fundamentals of Data Science | Winter Semester 2020
Thank you

Acknowledges: slides and material from Andrew Ng, Eric Xing, Matthew R. Gormley, Jessica Wu

You might also like