Lecture 5 - Least Square and Linear (DONE!!)

EE2211 Introduction to Machine
fromSklearn import
Learning
metrics
mean squared error
Lecture 5
MSEa meansquarederror 7test 7
Semester 2
predicted 9
data 2021/2022
Helen Juan Zhou

helen.zhou@nus.edu.sg
Electrical and Computer Engineering Department

National University of Singapore
Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Helen, Vincent, Chen Khong, Robby, and Haizhou)
© Copyright EE, NUS. All Rights Reserved.

Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks
2
Least Squares and Linear Regression
Module II Contents
• Notations, Vectors, Matrices (introduced in L3)
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
3
Recap: Linear and Affine Functions
Linear Functions
A function !: #$ → # is linear if it satisfies the following two properties:
• Homogeneity ! &' = &! ' Scaling

• Additivity ! ' + * = ! ' + ! * Adding
Inner product function

! " = $+ " = %,&, + %-&- + ⋯ %$ &$
Affine function
! " = $+ " + ) scalar ) is called the offset (or bias)
b o
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)
4
Functions: Maximum and Minimum
A local and a global minima of a function
• ! . has a local minimum

at . = / if ! . ≥ ! / for
every . in some open
interval around . = /
x
• ! . has a global
minimum at . = / if ! . ≥
! / for all . in the domain
of ! a b
a<x≤b
Note: An interval is a set of real numbers with the property that any number that lies
between two numbers in the set is also included in the set.
An open interval does not include its endpoints and is denoted using parentheses. E.g.
(0, 1) means “all numbers greater than 0 and less than 1”.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).
5
Functions: Maximum and Minimum
Max and Arg Max
• Given a set of values ! = {$, , $- , … , $1 },
• The operator max *($) returns the highest value *($) for all elements in
234
the set !
• The operator arg max *($) returns the element of the set ! that
234
maximizes *($)
• When the set is implicit or infinite, we can write
max *($) or arg max *($)
2 2
E.g. !(6) = 36, 6 ϵ [0,1] à max ! 6 = 3 and arg max ! 6 = 1
2 2
Min and Arg Min operate in a similar manner
Note: arg max returns a value from the domain of the function and max returns
from the range (codomain) of the function.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).
6
Derivative and Gradient
• The derivative D′ of a function D is a function that describes
how fast D grows (or decreases)
– If the derivative is a constant value, e.g. 5 or −3
• The function ! grows (or decreases) constantly at any point x of its domain
– When the derivative !′ is a function
• If !′ is positive at some x, then the function ! grows at this point
• If !′ is negative at some x, then the function ! decreases at this point
• The derivative of zero at x means that the function’s slope at x is horizontal (e.g.
maximum or minimum points)
• The process of finding a derivative is called differentiation

• Gradient is the generalization of derivative for functions that
take several inputs (or one input in the form of a vector or
some other complex structure).
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p8 of chp2).
7
The gradient of a function is a vector of partial derivatives
Differentiation of a scalar function w.r.t. a vector
If !(") is a scalar function of d variables, " is a d x1 vector.
Then differentiation of !(") w.r.t. " results in a d x1 vector
.!
eg
-!(") .&,
µ -"
= ⋮ firs p
.!. i stalar
.&$
This is referred to as the gradient of !(") and often written
as ,' !.
scalar % I E
E.g. ! " = %&, + )&- ,' ! =
) Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)
8
Partial Derivatives
Differentiation of a vector function w.r.t. a vector
If 0(") is a vector function of size h x1 and " is a d x1 vector.
Then differentiation of 0(") results in a h x d matrix
.!, .!,
⋯ eg R xyz
linputt
-0(") .&, .&$
= ⋮ ⋱ ⋮
gh of
-" .!F .!.F pl V Va Va
⋯ Contputt
.&, .&$
d.at
a
The matrix is referred to as the Jacobian of 0(") f Yj.YI
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)
9
combination
Some Vector-Matrix Differentiation Formulae
-2" TA T
=2 y
-"
M ATy
-(3+ ") -(4 + 2")
=3 = 2+ 4
-" -"
-(" + 2")
= (2 + 2+ )"
-"
Derivations: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)
10
Linear Regression
• Linear regression is a popular regression learning

algorithm that learns a model which is a linear
combination of features of the input example.
56 = 4, 5 ∈ 91×$ , 6 ∈ 9$×,, 4 ∈ 91×,
&,,, &,,- … &,,$ <, =,

5= ⁞ ⁞ ⋱ ⁞ , 6= ⁞ , 4= ⁞
&1,, &1,- … &1,$ <$ =1
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p3 of chp3).
11
Linear Regression
scalarfunctionwithinputvector inputscalar
Problem Statement: To predict the unknown H for a given ' (testing)
1
• We have a collection of labeled examples (training) {(' J , yJ )}JS,
– I is the size of the collection
– ' J is the d-dimensional feature vector of example K = 1, … , I (input)
– yJ is a real-valued target (1-D) output scalar
– Note:
• when yJ is continuous valued, it is a regression problem
• when yJ is discrete valued, it is a classification problem
• We want to build a model !N,O (') as a linear combination of features of

example ': !N,O ' = ' + N + P
where N is a d-dimensional vector of parameters and P is a real number.
• The notation !N,O means that the model ! is parametrized by two values: w
and b
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)
12
Linear Regression
Learning objective function

• To find the optimal values for w*
and b* which minimizes the
following expression:
1
1 predicted data
T(!N,O ' J − yJ )-
minimizeI
JS, meansquareerror
• In mathematics, the expression we
minimize or maximize is called an
objective function, or, simply, an
objective
(!N ' J − yJ )- is called the loss function: a measure of the difference

between !N ' J and yJ or a penalty for misclassification of example i.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)
13
Linear Regression
Learning objective function (using simplified notation hereon)
• To find the optimal values for w* which minimizes the
following expression:
1
>(!N " J − yJ )-
JS,
with !N " J = " + 6, offset
where we define 6 = [), <,, … <$ ]+ = [<V, <,, … <$ ]+ ,
and " J = [1, & , … & ]+ = [& , & , … & ]+ , D = 1, … , E
O J,, J,$ J,V J,, J,$
• This particular choice of the loss function is called squared

error loss
!
Note: The normalization factor can be omitted as it does not affect the optimization.
"
14
1
Linear Regression >(!N,O " J − yJ )-
JS,
• All model-based learning algorithms have a loss function
• What we do to find the best model is to minimize the
objective known as the cost function
• Cost function is a sum of loss functions over training set
plus possibly some model complexity penalty (regularization)
• In linear regression, the cost function is given by the average

loss, also called the empirical risk because we do not have
all the data (e.g. testing data)
– The average of all penalties is obtained by applying the
model to the training data
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)
15
Linear Regression
Learning (Training)
• Consider the set of feature vector " J and target output =J
indexed by D = 1, … , E, a linear model !N " = " + 6 can
be stacked as =,
FN 5 = 56 4= ⁞
=1
Learning
Model ",+ 6 Learning
target vector
= ⁞
+6
"1
)
O
cette <
where " J+ 6 = [1, &J,,, … , &J,$ ] ,
<$
⁞
Note: The bias/offset term is responsible for translating the line/plane/hyperplane
away from the origin.
16
Linear Regression
Least Squares Regression

In vector-matrix notation, the minimization of the objective
function can be written compactly using G = 56 − 4 :
Freddion
EYE fwXi y J(6) = G+ G
= (56 − 4)+ (56 − 4)
= (6 + 5 + − 4 + )(56 − 4)
= 6 + 5 + 56 − 6 + 5 + 4 − 4 + 56 + 4 + 4
= 6 + 5 + 56 − 24 + 56 + 4 + 4.
Note: when FN 5 = 56, then
1
JJS,(!N " J − yJ )- = (56 − 4)+ (56 − 4).
17
Linear Regression
Differentiating J(6) with respect to 6 and setting the

W
result to 0: J 6 =K
WN
W
(6 + 5 + 56 − 24 + 56 + 4 + 4) = K
WN
⇒ 25 + 56 − 25 + 4 = 0
⇒ 5 + 56 = 5 + 4
⇒ Any minimizer 6 M of J 6 must satisfy 5 + 56 − 4 = K.
If 5 + 5 is invertible, then leastysquaresolution overdeterminedsystem
inadditionwithoffset
Learning/training: 6 +
M = (5 5) 5 4 XY +
Prediction/testing: FN N 5 Z[\ = 5 Z[\ 6 M
18
Linear Regression
oneinput oneoutput
Example 1 Training set {(.J , yJ )}1
JS, {. = −9} → {H = −6}
{. = −7} → {H = −6}
5 6 4 . = −5 → {H = −4}
1 −9 −6 .= 1 → H = −1
1 −7 −6 .= 5 → {H = 1 }
`V −4 .= 9 → {H = 4 }
1 −5
`, = −1
1 1
1 5 1
1 9 4 over determinedsystem
This set of linear equations has no exact solution
However, d + d is invertible Least square approximation
M = 5 c 4 = (5 + 5)XY 5 + 4
6 −6
−6 offset
X,
6 −6 1 1 1 1 1 1 −4 −1.4375
= =
−6 262 −9 − 7 − 5 1 5 9 −1 0.5625
1 slope
4
19
importnumpyasup
fromnumpylinalgimportinv
fromsklearnmetricsimportmeansquarederror
Linear RegressionXgimp
nparrayly 93,11 73.11 51.11.1 1.53.11.91
array If61,167.141,113.111,14
learnmodel w invlX.t xjaX.T y ts
pintle 10.5625
knew uparray11 1
new knew w
=O = 56M
man −1.4375
=5
0.5625
y = −1.4375+0.5625x
Prediction:
Test set
{& = −1} → {= =? }
−1.4375
Hg = 1 − 1
0.5625
= −2
computemeansquareerror
Linear Regression on one-dimensional samples
Ytest X
MSE mean.squared_errorlYtest.Y Python demo 1
print Mst
20
Linear Regression
multipleinputs vector oneoutput
Example 2 {(' J , yJ )}1
JS,
{., = 1, .- = 1, .h = 1} → {H = 1}
{., = 1, .- = −1, .h = 1} → {H = 0}
Training set {., = 1, .- = 1, .h = 3} → {H = 2}
5 6 4 {., = 1, .- = 1, .h = 0} → {H = −1}
1 1 1 1
`,
1 −1 1 `- = 0
1 1 3 `h 2
1 1 0 −1
This set of linear equations has no exact solution
However, d + d is invertible
M = 5 c 4 = (5 + 5)XY 5 + 4
6 Least square approximation
1 1
X,
4 2 5 1 1 1 −0.7500
= 2 4 3 0 = 0.1786
1 −1 1 1 2
5 3 11 1 1 3 0 −1 0.9286
21
Linear Regression
The four linear equations Prediction:
Test set
gst {., = 1, .- = 6, .h = 8} → {H =? }
2nd{., = 1, .- = 0, .h = −1} → {H =? }
j = Dm N d Z[\ = d Z[\ N
k j
Kaew
1 6 8 −0.7500
j=
k
=
g
1 0 −1
7.7500
0.1786
0.9286
−1.6786 Kid
22
Linear Regression
A vector
Learning of Vectored Function (Multiple Outputs)
For one sample: a linear model 0N " = " + [ Vector function
For m samples: \N 5 = 5[ = ]
+ <V,, … <V,F vector
Sample 1
", 1 & ,,, … & ,,$
<,,, … <,,F
⁞ = ⁞ [= ⁞ ⁞ ⋱ ⁞
+ 1 & … & ⁞ ⋱ ⁞
Sample m " 1 1,, 1,$ <
$,, … <$,F
multipleoutputs ℎ
Sample 1’s output =,,, … =,,F for asample matrix
⁞ = ⁞ I
Sample m’s output =1,, … =1,F
ofontp
Iff
of
number features numberof
ℎ offset samples
o
5 ∈ 91×($o,) , [ ∈ 9($o,)×F , ] ∈ 91×F
g y
23
Linear Regression
1
Objective:JJS,(0N " J − 4J )- = e+ e
Least Squares Regression of Multiple Outputs

In matrix notation, the sum of squared errors cost
function can be written compactly using ^ = 5[ − ]:
J([) = trace(^ + ^)
= trace[(5[ − ])+ (5[ − ])]
If 5 + 5 is invertible, then somethingas before justly

I W frmvectortomatrix
gain
matrix
Learning/training: [ d = (5 5) 5 ]
+ X, +
Prediction/testing: \NN 5 Z[\ = 5 Z[\ [ d
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.3.2.4)
24
Linear Regression
Least Squares Regression of Multiple Outputs
J([) = trace(^ + ^)
G,+
= trace( ⁞ [G, G- … GF ])
G+F
G,+ G, G,+ G- … G,+ GF

G+- G, G+- G- … G+- GF F
= trace( ) = JpS, G+p Gp
⁞ ⁞ ⋱ ⁞
G+F G, G+F G- … G+F GF I
sumofsquareerror
foreachoutput
25
Linear Regression of multiple outputs
Example 3
Training set {., = 1, .- = 1, .h = 1} → { H, = 1, H- = 0}
{" J , 4J }1
JS,
{., = 1, .- = −1, .h = 1} → { H, = 0, H- = 1}
{., = 1, .- = 1, .h = 3} → { H, = 2, H- = −1}
{., = 1, .- = 1, .h = 0} → { H, = −1, H- = 3}
5 [ ]
Bias 1 1 1 ` 1 0 1.07Tos 12 1 E1 3
,,, `,,- Y uparray math
1 − 1 1 `-,, `-,- = 0 1
2 −1 WeinvIXT x xT y
1 1 3 `h,, `h,-
Xnew uparray 1,68 1 0 1
1 1 0 −1 3
This set of linear equations has NO exact solution
Least square
d = 5 c ] = (5 + 5)X,5 + ]
[ d + d is invertible approximation
4 2 5 X,
1 1 1 1 1 0
−0.75 2.25
= 2 4 3 0 1 = 0.1786 0.0357
1 −1 1 1 2 −1
5 3 11 1 1 3 0 −1 0.9286 − 1.2143
3
26
Example 3
Prediction:
Test set: two new samples
{., = 1, .- = 6, .h = 8} → { H, =? , H- =? }
{., = 1, .- = 0, .h = −1} → { H, =? , H- =? }
differentmapping
q = d Z[\ s
r t mm mm
−0.75 2.25
Bias 1 6 8
= 0.1786 0.0357
1 0 −1
0.9286 − 1.2143
7.75 − 7.25
=
−1.6786 3.4643
Python demo 2
27
PlotGraph X uparrayEi 1,4 1,6 it
i'o X7Cx.T
Y hP.arrayllC0isJ.Ci.si4j.Es.gj.Exnojpi.ggjywiinVlx.T
Ytest X
y
X1 17,71oplt.ploto label Yi
Example 4 pitplotXC lies y label 42
PltPlotXE 1 Yestt o pit.piotlxf.d.ytgtf.gg
The values of feature x and their corresponding values of multiple
it
pitglabel Y
outputs target y are shown in the table below.
Based on the least square regression, what are the values of w?

Based on the current mapping, when x = 2, what is the value of y?
x [3] [4] [10] [6] [7]
y [0, 5] [1.5, 4] [-3, 8] [-4, 10] [1, 6]
d = 5 c ] = (5 + 5)X,5 + ] = 1.9
[
3.6
Python demo 3
−0.4667 0.5
h
] i
uvw = 5 Z[\ [ = 1 2 [i = 0.9667 4.6 Prediction
28
Summary
• Notations, Vectors, Matrices Midterm (1 hour, L1 to L5)
• Operations on Vectors and Matrices Trialçquiz
• Dot-product, matrix inverse Set up test
• Systems of Linear Equations DN d = dN = *
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Functions, Derivative and Gradient
• Inner product, linear/affine functions
• Maximum and minimum, partial derivatives, gradient
• Least Squares, Linear Regression
• Objective function, loss function
• Least square solution, training/learning and testing/prediction
• Linear regression with multiple outputs
Learning/training M = (5 +xyz{u 5 xyz{u )XY 5 +xyz{u 4xyz{u
6
Prediction/testing 4xv|x = 5 xv|x 6 M
• Classification Python packages: numpy, pandas, matplotlib.pyplot,
• Ridge Regression numpy.linalg, and sklearn.metrics (for
• Polynomial Regression mean_squared_error), numpy.linalg.pinv
29

Lecture 5 - Least Square and Linear (DONE!!)

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 5 - Least Square and Linear (DONE!!)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 5 - Least Square and Linear (DONE!!)

Uploaded by

Copyright:

Available Formats

EE2211 Introduction to Machine

Helen Juan Zhou

Electrical and Computer Engineering Department

© Copyright EE, NUS. All Rights Reserved.

• Homogeneity ! &' = &! ' Scaling

Inner product function

• ! . has a local minimum

Min and Arg Min operate in a similar manner

• The process of finding a derivative is called differentiation

Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)

• Linear regression is a popular regression learning

56 = 4, 5 ∈ 91×$ , 6 ∈ 9$×,, 4 ∈ 91×,

&,,, &,,- … &,,$ <, =,

• We want to build a model !N,O (') as a linear combination of features of

Learning objective function

(!N ' J − yJ )- is called the loss function: a measure of the difference

• This particular choice of the loss function is called squared

• In linear regression, the cost function is given by the average

Least Squares Regression

Differentiating J(6) with respect to 6 and setting the

Prediction/testing: FN N 5 Z[\ = 5 Z[\ 6 M

Least Squares Regression of Multiple Outputs

If 5 + 5 is invertible, then somethingas before justly

Prediction/testing: \NN 5 Z[\ = 5 Z[\ [ d

Least Squares Regression of Multiple Outputs

G,+ G, G,+ G- … G,+ GF

outputs target y are shown in the table below.

Based on the least square regression, what are the values of w?

x [3] [4] [10] [6] [7]

y [0, 5] [1.5, 4] [-3, 8] [-4, 10] [1, 6]

You might also like