Professional Documents
Culture Documents
Machine Learning
Machine Learning
Machine Learning
I Linear Regression
Note
Supervised Learning
Linear Regression
2
Agenda
Supervised Learning
Linear Regression
3
On Learning and Different Types
What is Learning?
Problem 1 Problem 2
Some cat images
C = {C1 , C2 , . . . , Cm } An array of numbers
Input
and dog images a = [a1 , a2 , . . . , an ]
D = {D1 , D2 , . . . , Dn }
Identify a new image Sort a in ascending
Objective
X as cat/dog order
Follow a fixed recipe
Approach ? that works in the same
way for all arrays a
4
What is Learning? (contd. . . )
5
Classification of Learning Approaches
6
Classification of Learning Approaches
9
Classification: Example
1
Image is taken from Erickson et al, Machine Learning for Medical Imaging, Radio Graphics,
2017
10
Who supervises “Learning”?
11
Supervised Learning
I Linear regression
I Polynomial regression
I etc.
14
Supervised Learning - Classification
Example - Classification
I Examples:
I Classifying images based on objects being depicted
I Classifying market condition as favorable or unfavorable
I Classifying pixels based on membership to
object/background for segmentation
I Predicting the next word based on a sequence of observed
words
1
Image source: https://www.hact.org.uk 15
Supervised Learning - Classification (contd. . . )
I Logistic regression
I Random forests
I Neural networks
I etc.
16
Supervised Learning Setup
17
Supervised Learning Setup: Notation
18
Supervised Learning Setup: Dimension
I Why?
19
Supervised Learning Setup: Dimension (Contd...)
20
Supervised Learning Setup: Dimension (Contd...)
21
Some Foundational aspects of
Machine Learning
On Statistical Approach to Machine Learning
22
On Statistical Approach to Machine Learning (contd...)
23
On Statistical Approach to Machine Learning (contd...)
24
Loss Function
We need some guiding mechanism that will tell us how good our
predictions are given an input.
Note
25
Learning as an optimization
Objective
Given a loss function `, aim is to find f such that,
is minimum
26
Diversion: Probability Basics
27
Empirical Risk
28
Empirical Risk
N
1 X
Lemp (f ) = `(yn , f (xn ))
N
n=1
N
1 X
i.e. f ∗ = arg min `(yn , f (xn ))
f N
n=1
I f needs to be simple.
31
Generalizing Capacity(cont...)
The blue curve has better generalization capacity. The orange curve overfits
the data
32
Generalizing Capacity(cont...)
33
Learning as the Optimization
34
Linear Regression
Linear Regression: One dimensional Case
Problem: Find a straight line that best fits these set of points 35
Linear Regression: One dimensional Case (contd...)
I {(xn , yn )}N
n=1 , xn ∈ R, yn ∈ R, find a straight line that
best fits these set of points.
36
Linear Regression: One dimensional Case (contd...)
F = {fw (x) = wx : w ∈ R}
I F is paramerized by w
38
Linear Regression: One dimensional Case (contd...)
N
X
`(fw ) = (yn − fw (xn ))2
n=1
40
Linear Regression: One dimensional Case (cont...)
N
X
=⇒ (wx2n − xn yn ) = 0 41
Linear Regression: One dimensional Case (cont...)
N
X
=⇒ (wx2n − xn yn ) = 0
n=1
XN N
X
=⇒ w x2n = x n yn
n=1 n=1
PN
n=1 xn yn
=⇒ w = P N 2
n=1 xn
42
Linear Regression (General formulation)
wj : Model parameters
φj : basis function(changes the representation of x)
or
y = b + w| φ(x), where
w| = [w1 , . . . , wm ] φ| = [φ1 , . . . , φm ]
43
Linear Regression (cont ...)
Pm
I Model: y = fw (x) = b + j=1 wj φj (x)
I Model: y = b + w| x
We get
Y = XW + b
44
Linear Regression(cont ...)
I We have
Y = XW + b
y1 x11 x12 . . . x1d w1 b
. . . .
.. = .. .. + .
.
yN xN 1 xN 2 . . . xN d wd b
d×1
1 x11 x12 . . . x1d b
y1
. 1 x21 x22 . . . x2d w1
=⇒ .
. = .. .
.
. .
yN
| {z } 1 xN 1 xN 2 . . . xN d wd
N ×1 | {z } | {z }
Matrix N ×(d+1) (d+1)×1
Matrix Matrix
=⇒ Y = XW
45
Linear Regression (cont...)
Final Solution:
N
X N
X
w=( xn x|n )−1 yn xn
n=1 n=1
= (X X)−1 X | Y
|
Some Remarks
I Regularization.
50
Ridge Regression (Linear Regression with Regularization)
I Here ||w||2 = w| w
I λ is the hyperparameter, that controls amount of
regularization.
I Solution:
N
∂L(W ) X
= 2(yn − w| xn )(−xn ) + 2λw = 0
∂w
n=1
51
Ridge Regression(cont...
N
X
=⇒ λ(w) = xn (yn − x|n w)
n=1
XN N
X
=⇒ λ(w) = x n yn − xn x|n w
n=1 n=1
| |
=⇒ λW = X Y − X XW
=⇒ λW + X | XW = X | Y
=⇒ (λId + X | X)W = X | Y
=⇒ W = (X | X + λId )−1 X | Y
Note: X | X is a d × d matrix
52
On Regularization
53
On Regularization (cont...)
54
Ridge Regression Solution
55
Gradient Descent Solution for Least Squares
W ∗ = (X | X)−1 X | Y
∗
Wreg = (X | X + λId )−1 X | Y
56
Gradient Descent Procedure
Procedure:
∂L
w(t) = w(t−1) − η
∂w w=w(t−1)
57
Gradient Descent Procedure (contd...)
We have
N
∂L X
= xn (yn − x|n w)
∂w
n=1
Procedure:
58
On Convexity
Convex Functions:
59
On `1 Regularizer
Pd
`1 regularizer R(w) = ||w||1 = j=1 |wj |
60