Lecture 5 - Least Square and Linear (DONE!!)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

EE2211 Introduction to Machine

fromSklearn import
Learning
metrics
mean squared error
Lecture 5
MSEa meansquarederror 7test 7
Semester 2
predicted 9
data 2021/2022

Helen Juan Zhou


helen.zhou@nus.edu.sg

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Helen, Vincent, Chen Khong, Robby, and Haizhou)

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Least Squares and Linear Regression
Module II Contents
• Notations, Vectors, Matrices (introduced in L3)
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

3
© Copyright EE, NUS. All Rights Reserved.
Recap: Linear and Affine Functions

Linear Functions
A function !: #$ → # is linear if it satisfies the following two properties:

• Homogeneity ! &' = &! ' Scaling


• Additivity ! ' + * = ! ' + ! * Adding

Inner product function


! " = $+ " = %,&, + %-&- + ⋯ %$ &$

Affine function
! " = $+ " + ) scalar ) is called the offset (or bias)
b o
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)

4
© Copyright EE, NUS. All Rights Reserved.
Functions: Maximum and Minimum
A local and a global minima of a function

• ! . has a local minimum


at . = / if ! . ≥ ! / for
every . in some open
interval around . = /
x
• ! . has a global
minimum at . = / if ! . ≥
! / for all . in the domain
of ! a b
a<x≤b
Note: An interval is a set of real numbers with the property that any number that lies
between two numbers in the set is also included in the set.
An open interval does not include its endpoints and is denoted using parentheses. E.g.
(0, 1) means “all numbers greater than 0 and less than 1”.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).

5
© Copyright EE, NUS. All Rights Reserved.
Functions: Maximum and Minimum
Max and Arg Max
• Given a set of values ! = {$, , $- , … , $1 },
• The operator max *($) returns the highest value *($) for all elements in
234
the set !
• The operator arg max *($) returns the element of the set ! that
234
maximizes *($)
• When the set is implicit or infinite, we can write
max *($) or arg max *($)
2 2
E.g. !(6) = 36, 6 ϵ [0,1] à max ! 6 = 3 and arg max ! 6 = 1
2 2

Min and Arg Min operate in a similar manner

Note: arg max returns a value from the domain of the function and max returns
from the range (codomain) of the function.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).

6
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
• The derivative D′ of a function D is a function that describes
how fast D grows (or decreases)
– If the derivative is a constant value, e.g. 5 or −3
• The function ! grows (or decreases) constantly at any point x of its domain
– When the derivative !′ is a function
• If !′ is positive at some x, then the function ! grows at this point
• If !′ is negative at some x, then the function ! decreases at this point
• The derivative of zero at x means that the function’s slope at x is horizontal (e.g.
maximum or minimum points)

• The process of finding a derivative is called differentiation


• Gradient is the generalization of derivative for functions that
take several inputs (or one input in the form of a vector or
some other complex structure).

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p8 of chp2).

7
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
The gradient of a function is a vector of partial derivatives
Differentiation of a scalar function w.r.t. a vector
If !(") is a scalar function of d variables, " is a d x1 vector.
Then differentiation of !(") w.r.t. " results in a d x1 vector
.!
eg
-!(") .&,
µ -"
= ⋮ firs p
.!. i stalar
.&$
This is referred to as the gradient of !(") and often written
as ,' !.
scalar % I E
E.g. ! " = %&, + )&- ,' ! =
) Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)
8
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
Partial Derivatives
Differentiation of a vector function w.r.t. a vector
If 0(") is a vector function of size h x1 and " is a d x1 vector.
Then differentiation of 0(") results in a h x d matrix
.!, .!,
⋯ eg R xyz
linputt
-0(") .&, .&$
= ⋮ ⋱ ⋮
gh of
-" .!F .!.F pl V Va Va
⋯ Contputt
.&, .&$
d.at
a
The matrix is referred to as the Jacobian of 0(") f Yj.YI
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)

9
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient

combination
Some Vector-Matrix Differentiation Formulae

-2" TA T
=2 y
-"
M ATy
-(3+ ") -(4 + 2")
=3 = 2+ 4
-" -"
-(" + 2")
= (2 + 2+ )"
-"

Derivations: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)

10
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

• Linear regression is a popular regression learning


algorithm that learns a model which is a linear
combination of features of the input example.

56 = 4, 5 ∈ 91×$ , 6 ∈ 9$×,, 4 ∈ 91×,

&,,, &,,- … &,,$ <, =,


5= ⁞ ⁞ ⋱ ⁞ , 6= ⁞ , 4= ⁞
&1,, &1,- … &1,$ <$ =1

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p3 of chp3).

11
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
scalarfunctionwithinputvector inputscalar
Problem Statement: To predict the unknown H for a given ' (testing)
1
• We have a collection of labeled examples (training) {(' J , yJ )}JS,
– I is the size of the collection
– ' J is the d-dimensional feature vector of example K = 1, … , I (input)
– yJ is a real-valued target (1-D) output scalar
– Note:
• when yJ is continuous valued, it is a regression problem
• when yJ is discrete valued, it is a classification problem

• We want to build a model !N,O (') as a linear combination of features of


example ': !N,O ' = ' + N + P
where N is a d-dimensional vector of parameters and P is a real number.
• The notation !N,O means that the model ! is parametrized by two values: w
and b
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

12
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

Learning objective function


• To find the optimal values for w*
and b* which minimizes the
following expression:
1
1 predicted data
T(!N,O ' J − yJ )-
minimizeI
JS, meansquareerror
• In mathematics, the expression we
minimize or maximize is called an
objective function, or, simply, an
objective

(!N ' J − yJ )- is called the loss function: a measure of the difference


between !N ' J and yJ or a penalty for misclassification of example i.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)

13
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning objective function (using simplified notation hereon)
• To find the optimal values for w* which minimizes the
following expression:
1
>(!N " J − yJ )-
JS,
with !N " J = " + 6, offset
where we define 6 = [), <,, … <$ ]+ = [<V, <,, … <$ ]+ ,
and " J = [1, & , … & ]+ = [& , & , … & ]+ , D = 1, … , E
O J,, J,$ J,V J,, J,$

• This particular choice of the loss function is called squared


error loss
!
Note: The normalization factor can be omitted as it does not affect the optimization.
"

14
© Copyright EE, NUS. All Rights Reserved.
1
Linear Regression >(!N,O " J − yJ )-
JS,
• All model-based learning algorithms have a loss function
• What we do to find the best model is to minimize the
objective known as the cost function
• Cost function is a sum of loss functions over training set
plus possibly some model complexity penalty (regularization)

• In linear regression, the cost function is given by the average


loss, also called the empirical risk because we do not have
all the data (e.g. testing data)
– The average of all penalties is obtained by applying the
model to the training data
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)

15
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning (Training)
• Consider the set of feature vector " J and target output =J
indexed by D = 1, … , E, a linear model !N " = " + 6 can
be stacked as =,
FN 5 = 56 4= ⁞
=1
Learning
Model ",+ 6 Learning
target vector
= ⁞
+6
"1
)
O
cette <
where " J+ 6 = [1, &J,,, … , &J,$ ] ,
<$

Note: The bias/offset term is responsible for translating the line/plane/hyperplane
away from the origin.

16
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

Least Squares Regression


In vector-matrix notation, the minimization of the objective
function can be written compactly using G = 56 − 4 :
Freddion
EYE fwXi y J(6) = G+ G
= (56 − 4)+ (56 − 4)
= (6 + 5 + − 4 + )(56 − 4)
= 6 + 5 + 56 − 6 + 5 + 4 − 4 + 56 + 4 + 4
= 6 + 5 + 56 − 24 + 56 + 4 + 4.
Note: when FN 5 = 56, then
1
JJS,(!N " J − yJ )- = (56 − 4)+ (56 − 4).

17
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

Differentiating J(6) with respect to 6 and setting the


W
result to 0: J 6 =K
WN
W
(6 + 5 + 56 − 24 + 56 + 4 + 4) = K
WN
⇒ 25 + 56 − 25 + 4 = 0
⇒ 5 + 56 = 5 + 4
⇒ Any minimizer 6 M of J 6 must satisfy 5 + 56 − 4 = K.
If 5 + 5 is invertible, then leastysquaresolution overdeterminedsystem
inadditionwithoffset
Learning/training: 6 +
M = (5 5) 5 4 XY +

Prediction/testing: FN N 5 Z[\ = 5 Z[\ 6 M

18
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
oneinput oneoutput
Example 1 Training set {(.J , yJ )}1
JS, {. = −9} → {H = −6}
{. = −7} → {H = −6}
5 6 4 . = −5 → {H = −4}
1 −9 −6 .= 1 → H = −1
1 −7 −6 .= 5 → {H = 1 }
`V −4 .= 9 → {H = 4 }
1 −5
`, = −1
1 1
1 5 1
1 9 4 over determinedsystem
This set of linear equations has no exact solution
However, d + d is invertible Least square approximation

M = 5 c 4 = (5 + 5)XY 5 + 4
6 −6
−6 offset
X,
6 −6 1 1 1 1 1 1 −4 −1.4375
= =
−6 262 −9 − 7 − 5 1 5 9 −1 0.5625
1 slope
4
19
© Copyright EE, NUS. All Rights Reserved.
importnumpyasup
fromnumpylinalgimportinv
fromsklearnmetricsimportmeansquarederror

Linear RegressionXgimp
nparrayly 93,11 73.11 51.11.1 1.53.11.91
array If61,167.141,113.111,14
learnmodel w invlX.t xjaX.T y ts
pintle 10.5625
knew uparray11 1
new knew w
=O = 56M
man −1.4375
=5
0.5625
y = −1.4375+0.5625x

Prediction:
Test set
{& = −1} → {= =? }

−1.4375
Hg = 1 − 1
0.5625
= −2
computemeansquareerror
Linear Regression on one-dimensional samples
Ytest X
MSE mean.squared_errorlYtest.Y Python demo 1
print Mst
20
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
multipleinputs vector oneoutput
Example 2 {(' J , yJ )}1
JS,
{., = 1, .- = 1, .h = 1} → {H = 1}
{., = 1, .- = −1, .h = 1} → {H = 0}
Training set {., = 1, .- = 1, .h = 3} → {H = 2}
5 6 4 {., = 1, .- = 1, .h = 0} → {H = −1}
1 1 1 1
`,
1 −1 1 `- = 0
1 1 3 `h 2
1 1 0 −1
This set of linear equations has no exact solution
However, d + d is invertible
M = 5 c 4 = (5 + 5)XY 5 + 4
6 Least square approximation

1 1
X,
4 2 5 1 1 1 −0.7500
= 2 4 3 0 = 0.1786
1 −1 1 1 2
5 3 11 1 1 3 0 −1 0.9286

21
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
The four linear equations Prediction:
Test set

gst {., = 1, .- = 6, .h = 8} → {H =? }
2nd{., = 1, .- = 0, .h = −1} → {H =? }
j = Dm N d Z[\ = d Z[\ N
k j

Kaew
1 6 8 −0.7500
j=
k

=
g
1 0 −1
7.7500
0.1786
0.9286

−1.6786 Kid

22
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
A vector
Learning of Vectored Function (Multiple Outputs)
For one sample: a linear model 0N " = " + [ Vector function
For m samples: \N 5 = 5[ = ]
+ <V,, … <V,F vector
Sample 1
", 1 & ,,, … & ,,$
<,,, … <,,F
⁞ = ⁞ [= ⁞ ⁞ ⋱ ⁞
+ 1 & … & ⁞ ⋱ ⁞
Sample m " 1 1,, 1,$ <
$,, … <$,F

multipleoutputs ℎ
Sample 1’s output =,,, … =,,F for asample matrix
⁞ = ⁞ I
Sample m’s output =1,, … =1,F
ofontp
Iff
of
number features numberof
ℎ offset samples

o
5 ∈ 91×($o,) , [ ∈ 9($o,)×F , ] ∈ 91×F
g y
23
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
1
Objective:JJS,(0N " J − 4J )- = e+ e

Least Squares Regression of Multiple Outputs


In matrix notation, the sum of squared errors cost
function can be written compactly using ^ = 5[ − ]:
J([) = trace(^ + ^)
= trace[(5[ − ])+ (5[ − ])]

If 5 + 5 is invertible, then somethingas before justly


I W frmvectortomatrix
gain
matrix
Learning/training: [ d = (5 5) 5 ]
+ X, +

Prediction/testing: \NN 5 Z[\ = 5 Z[\ [ d

Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.3.2.4)

24
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

Least Squares Regression of Multiple Outputs

J([) = trace(^ + ^)

G,+
= trace( ⁞ [G, G- … GF ])
G+F

G,+ G, G,+ G- … G,+ GF


G+- G, G+- G- … G+- GF F
= trace( ) = JpS, G+p Gp
⁞ ⁞ ⋱ ⁞
G+F G, G+F G- … G+F GF I
sumofsquareerror
foreachoutput

25
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Training set {., = 1, .- = 1, .h = 1} → { H, = 1, H- = 0}
{" J , 4J }1
JS,
{., = 1, .- = −1, .h = 1} → { H, = 0, H- = 1}
{., = 1, .- = 1, .h = 3} → { H, = 2, H- = −1}
{., = 1, .- = 1, .h = 0} → { H, = −1, H- = 3}
5 [ ]
Bias 1 1 1 ` 1 0 1.07Tos 12 1 E1 3
,,, `,,- Y uparray math
1 − 1 1 `-,, `-,- = 0 1
2 −1 WeinvIXT x xT y
1 1 3 `h,, `h,-
Xnew uparray 1,68 1 0 1
1 1 0 −1 3
This set of linear equations has NO exact solution
Least square
d = 5 c ] = (5 + 5)X,5 + ]
[ d + d is invertible approximation

4 2 5 X,
1 1 1 1 1 0
−0.75 2.25
= 2 4 3 0 1 = 0.1786 0.0357
1 −1 1 1 2 −1
5 3 11 1 1 3 0 −1 0.9286 − 1.2143
3

26
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Prediction:
Test set: two new samples
{., = 1, .- = 6, .h = 8} → { H, =? , H- =? }
{., = 1, .- = 0, .h = −1} → { H, =? , H- =? }
differentmapping
q = d Z[\ s
r t mm mm
−0.75 2.25
Bias 1 6 8
= 0.1786 0.0357
1 0 −1
0.9286 − 1.2143
7.75 − 7.25
=
−1.6786 3.4643

Python demo 2

27
© Copyright EE, NUS. All Rights Reserved.
PlotGraph X uparrayEi 1,4 1,6 it
i'o X7Cx.T
Y hP.arrayllC0isJ.Ci.si4j.Es.gj.Exnojpi.ggjywiinVlx.T
Ytest X
y
Linear Regression of multiple outputs
X1 17,71oplt.ploto label Yi
Example 4 pitplotXC lies y label 42
PltPlotXE 1 Yestt o pit.piotlxf.d.ytgtf.gg
The values of feature x and their corresponding values of multiple
it
pitglabel Y

outputs target y are shown in the table below.

Based on the least square regression, what are the values of w?


Based on the current mapping, when x = 2, what is the value of y?

x [3] [4] [10] [6] [7]

y [0, 5] [1.5, 4] [-3, 8] [-4, 10] [1, 6]

d = 5 c ] = (5 + 5)X,5 + ] = 1.9
[
3.6
Python demo 3
−0.4667 0.5
h
] i
uvw = 5 Z[\ [ = 1 2 [i = 0.9667 4.6 Prediction

28
© Copyright EE, NUS. All Rights Reserved.
Summary
• Notations, Vectors, Matrices Midterm (1 hour, L1 to L5)
• Operations on Vectors and Matrices Trialçquiz
• Dot-product, matrix inverse Set up test
• Systems of Linear Equations DN d = dN = *
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Functions, Derivative and Gradient
• Inner product, linear/affine functions
• Maximum and minimum, partial derivatives, gradient
• Least Squares, Linear Regression
• Objective function, loss function
• Least square solution, training/learning and testing/prediction
• Linear regression with multiple outputs
Learning/training M = (5 +xyz{u 5 xyz{u )XY 5 +xyz{u 4xyz{u
6
Prediction/testing 4xv|x = 5 xv|x 6 M
• Classification Python packages: numpy, pandas, matplotlib.pyplot,
• Ridge Regression numpy.linalg, and sklearn.metrics (for
• Polynomial Regression mean_squared_error), numpy.linalg.pinv
29
© Copyright EE, NUS. All Rights Reserved.

You might also like