Download as pdf or txt
Download as pdf or txt
You are on page 1of 79

Applied Machine

Learning Systems 1
Professor Miguel Rodrigues
University College London
Instructors

Prof Miguel Prof Yannis


Rodrigues Andreopoulos


the rest [ last 2 lectures
Course Outline
• Introduction
• Historical perspective

• Supervised learning
• Linear regression in one variable and multiple variables
• Polynomial regression in multiple variables
• Regularization ← Ridge and
• Ridge Regression LASSO

• Least Absolute Shrinkage and Selection Operator ( Lasso)


• Model Selection / Cross-Validation
• Logistic regression, Soft max regression}
and
classification
methods
prediction
• Gradient descent, stochastic gradient descent, proximal gradient method
• Boosting, decision trees & random forests, KNNs, recommender systems
Course Outline
• Unsupervised learning
• Data clustering
• k-means clustering algorithm
• Hierarchical clustering algorithm
• Dimensionality reduction
• Feature selection-learning
• Density estimation

• Kernel learning

}
• Support vector machines
• Kernels

÷::!
• Introduction to neural networks
• Representation
• Learning
logics
• Convolutional neural networks and variants
Timeline ( 6 lectures and 2 labs)

• Lecture 01 / Tue 08th October, 2-6pm


• Overview of machine learning including the supervised learning, unsupervised
learning, and reinforcement learning paradigms; Linear Regression; Polynomial
Regression; Regularisation; Ridge Regression; Least Absolute Shrinkage and
Selection Operator (LASSO); Cross-Validation
← LAB in between →
A 5th Oct)
• Lecture 02 / Tue 22nd October, 2-6pm
• Logistic Regression; Softmax Regression; Gradient Descent; Stochastic Gradient
Descent; Proximal Gradient Method
€ classification
problems
• Lecture 03 / Tue 29th October, 2-6pm
• Boosting; Decision Trees; Random Forests; KNN; Recommender Systems
Timeline
WPERVlsEDLEARN1N#
C)

• Lecture 04 / Tue 12th November, 2-6pm


• Introduction to Clustering; K-Means Clustering Algorithm; Hierarchical
Clustering Algorithms; Spectral Clustering; Dimensionality Reduction (PCA,
LDA, CS, other) | Feature Selection-Learning; Density Estimation

• Lecture 05 / Tue 19th November, 2-6pm


• Kernel learning: Creating non-linear algorithms by “kernelization”, Support
vector machines for classification and regression

• Lecture 06 / Tue 20th November, 2-6pm


• Neural networks: Representation, Learning, Convolutional neural networks and
variants
Timeline

• Lab 1 / Tue 15th October, 2-6pm


• This 3-hour lab will cover topics a) supervised learning I, b) supervised learning II, c)
supervised learning III, d) unsupervised learning JVPYT ER NOTEBOOK
PYTHON

• Lab 2 / Tue 03rd December, 2-6pm IMPORTANT for ASSIGNMENT !


• This 3-hour lab will cover topics a) kernel learning, b) neural networks
I prepares for the assignment
Assessment
• Mini-project based assessment (100%) involving
• A programming assignment computer vision challenge
)
-

report required paper GitHub


C
• A report of the assignment
-

- code is assessed from

• Mini-project timelines
• Hand-out: Friday, 25th October, 2019
• Hand-in: Monday, 06th January, 2020 (after Christmas)
Bibiliography
• K. P. Murphy. Machine Learning: A Probabilistic Perspective.
Cambridge, MA: MIT Press, 2012.

• S. Shalev-Shwartz and S. Ben-David. Understanding


Machine Learning. New York, NY: Cambridge University
Press, 2014.

• I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning.


Cambridge, MA: MIT Press, 2017.
1. Introduction
A Brief History of AI
Artificial Intelligence has been capturing our imagination for decades, by making its
way into pop culture through mainstream Hollywood movies

1968 1977 1984 1999 2001 2015


An (Actual) Brief History of AI

Genesis of AI Resurgence of AI Current AI


2006: Start of Deep Learning era
Netflix RecSys challenge

2009: GPUs to train DL models

2011: IBM Watson wins Jeopardy

1948: First computer program 1980: The rise of expert systems 2014: DeepMind funded
Facebook develops DeepFace
1950: The Turing test: “Can machines think?” 1982: Revival of connectionism, backpropagation
2015: Google’s TensorFlow
1956: The birth of AI - John McCarthy AAAI 1987–93: Second AI winter Amazon’s ML platform
Microsoft’s ML Toolkit
1957: First neural network by Frank Rosenblatt 1990s: Age of knowledge representation OpenAI and ethics

1975-80: AI hype leads to first AI winter 2016: AlphaGo beats Lee Sedol in Go
1997: IBM Deep Blue beats Kasparov in chess

1956 1975 1980 1987 1993 2006 2011


An (Actual) Brief History of AI

Tay AlphaGo

Machine
Learning

Eugene Alexa
Artificial
Siri Watson
Intelligence

2011 2012 2013 2014 2015 2016 2017


Artificial Intelligence
Current AI: Hype or the Start of a Revolution?

Perception Learning Interaction

• Computer vision • Supervised learning • Planning

• Search
• Speech understanding • Unsupervised learning
• Reasoning
• Language understanding • Reinforcement learning
• Games
• Knowledge representation • Common sense
Artificial Intelligence: Ingredients
Artificial General Artificial Super
Artificial Narrow Intelligence
Intelligence Intelligence

Current AI Future AI Singularity


AI currently out-performs humans Performing broad tasks and reasoning Powerful superintelligence that
in some specific narrow tasks comparable to a human would, qualitatively, far surpass all
human intelligence.
Artificial Intelligence: Approaches / Tools

Logic Based Approaches

These tools enable solving AI


problems such as knowledge
representation

17
Artificial Intelligence: Approaches / Tools
Probabilistic/Statistical Based
Logic Based Approaches
Approaches

These tools enable solving AI These tools – notably, machine learning


problems such as knowledge methods – are leading to state-of-the-art
representation results in various prediction problems
arising in artificial intelligence challenges

18
Artificial Intelligence: Approaches / Tools
Probabilistic/Statistical Based
Logic Based Approaches
Approaches

These tools enable solving AI These tools – notably, machine learning


problems such as knowledge methods – are leading to state-of-the-art
representation results in various prediction problems
arising in artificial intelligence challenges

Optimization Based
Approaches

These tools enable solving many AI


problems by searching intelligently
over many solutions

19
Artificial Intelligence: Approaches / Tools
Probabilistic/Statistical Based
Logic Based Approaches
Approaches

These tools enable solving AI These tools – notably, machine learning


problems such as knowledge methods – are leading to state-of-the-art
representation results in various prediction problems
arising in artificial intelligence challenges

Optimization Based Statistical Learning


Approaches Based Approaches

These tools enable solving many AI These approaches allow one to tackle
problems by searching intelligently AI problems where an agent operates
over many solutions within an environment subject to
uncertainty.
20
Artificial Intelligence: What is different now?
Massive Amounts Massive Computational State-of-the-Art
of Data Power Algorithms

Big Data GPUs/CPUs Deep Learning


Massive amounts of data deriving GPUs have also been playing a major Deep learning – involving the use of
from (1) internet (2) social media role in fuelling the artificial neural networks consisting of various
(3) internet of things or other have intelligence revolution layers – are state-of-the-art machine
become crucial to power artificial learning approaches fuelling the
intelligence. artificial intelligence revolution

21
Warm-Up
Input Output
Cat 1
Dog 0
Bird 0
Horse 0

Cat 0
Prediction Dog 1
Bird 0
Rule
Horse 0

Cat 0
Dog 0
Bird 0
Horse 1
A Rule Based Approach
Concept
Input Output
Rules are pre-specified
Cat 1
Classification Dog 0
Rule Bird 0
Horse 0

input data
if condition 1 satisfied

10010010 if condition n satisfied
output cat
else
output dog
end
New Input
The Learning Based Approach
Concept
Example 1 Example 2 Example 3 Example 4

Learning Classification
Algorithm Rule
Cat 0 Cat 0 Cat 1 Cat 0
Dog 1 Dog 0 Dog 0 Dog 0
Bird 0 Bird 0 Bird 0 Bird 1

New Output
Horse 0 Horse 1 Horse 0 Horse 0
Cat 1
Classification Dog 0
Bird 0
What is machine learning? Rule Set Horse 0

Field of study that gives computers


the ability to learn without being
explicitly programmed.
Rules learnt automatically
Arthur Samuel (1959) based on data/experience
Machine Learning Paradigms
Reinforcement
Supervised
Learning
Learning
01 It is concerned with how
Machine learning task of 02 software agents ought to take
inferring a function describing
actions in an environment so as
the relationship between
to maximize some cumulative
independent and dependent
reward.
variables in a dataset. It relies
on labelled examples
03

Unsupervised
Learning
02
Machine learning task of
inferring a function describing
the structure of a dataset. It
relies on unlabelled data.
Supervised Learning
Example 1 Example 2 Example 3 Example 4 Example 5 Example 6
Concept
The task of inferring a function describing the
relationship between independent and
Cat 1 Cat 0 Cat 0 Cat 1 Cat 0 Cat 0 dependent variables in a dataset (based on
Dog 0 Dog 1 Dog 0 Dog 0 Dog 0 Dog 0
Bird 0 Bird 0 Bird 0 Bird 0 Bird 1 Bird 1 labelled examples)
Horse 0 Horse 0 Horse 1 Horse 0 Horse 0 Horse 0

Applications
Learning Supervised learning can be used to solve various
Algorithm tasks:
• Classification tasks
• Regression tasks
• Forecasting tasks
Prediction • Recommender systems
Input Output
Rule • Anomaly detection tasks
Variant: Semi-Supervised Learning
Example 1 Example 2 Example 3 Example 4 Example 5 Example 6

Cat 1 Cat 0 Cat 0


Dog 0 Dog 0 Dog 0
Bird 0 Bird 0 Bird 1
Horse 0 Horse 1 Horse 0

Learning
Algorithm

Prediction
Input Output
Rule
Examples: Supervised Learning
Classification Forecasting
Predict the class of an item given various examples
Predict future values given past ones
Example:
time-series
data feature 2

Classification of tumors as
benign or malign based on:
o appearance (feature 1) ?
o tumor size (feature 2) time
data feature 1

Regression
Example:
Predict the value of a variable given other variables.
Prediction of tomorrow’s stock value based
Example:
on the previous stock values.
output variable

+ Prediction of a car’s price


+
+ + (output variable) based on
+ + +
+
+ +
the car’s mileage (input
variable)
Input variable
Unsupervised Learning
Image 1 Image 2 Image 3 Image 4 Image 5
Concept
The task of inferring a function describing the
structure of a dataset (based on unlabelled data).

Learning
Algorithm
Examples
Examples of unsupervised learning tasks include:
cluster 1 cluster 2 • Clustering tasks
• Dimensionality reduction tasks
• Recommender systems
cluster 3 • Anomaly detection tasks
Reinforcement Learning
Concept
Reinforcement learning represents an agent’s
attempt to approximate the environment’s
Agent function, in order to determine the agent’s
actions on the black-box environment that
maximize the agent’s rewards
Observations
Actions Rewards
Applications
Robotics, industrial automation, education and
training, health and medicine, media and
Environment
advertising, finance
The Machine Learning Landscape

Note: There is also now


state-of-the-art approaches
such as deep learning and
deep reinforcement learning

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Related news are
automatically clustered
Machine Learning: Real-World Examples
Machine Learning: Real-World Examples
The Machine Learning Process
Problem
Definition

Data
Collection
Pre-Processing
Data
Feature Extraction
Preparation
Feature Normalization

Model
Model Evaluation

Selection

Learning
Model
Fitting

Training
Algorithm

Prediction
Model
Usage

Testing
Rule
2. Supervised Learning:
Simple Linear Regression
Linear Regression
Example Training set Testing set
• Prediction of second-hand car Input Variable Output Variable Input Variable Output Variable
prices based on features such as (Mileage, Km) (Cost, £) (Mileage, Km) (Cost, £)
mileage
25,000 16,000 50,000 ?
Training Test
105,000 11,500 90,000 ?
sample sample
120,000 6,000
!" , $"
140,000 3,000
45,000 13,500

? ?

50,000 90,000
Linear Regression
Example Training set Testing set
• Prediction of second-hand car Input Variable Output Variable Input Variable Output Variable
prices based on features such as (Mileage, Km) (Cost, £) (Mileage, Km) (Cost, £)
mileage
25,000 16,000 50,000 ?
Training Test
• It is a supervised learning problem 105,000 11,500 90,000 ?
sample sample
because one has access to input- 120,000 6,000
output examples !" , $"
140,000 3,000
45,000 13,500
• It is a regression problem because
the output variable is continuous-
valued

• Input variables also known as


features or regressors

• Output variable also known as


response or prediction
Linear Regression: Approach
Process
• One is given access to a training data –
consisting of various feature-response pairs –
and testing data – consisting of features
points with unknown response.

• One is also given a hypothesis (or model)


class containing a series of hypotheses (or
models) that potentially explain the
relationship between the features and
responses.

• The learning algorithm selects a hypothesis


(or model) from the hypothesis (or model)
class that fits the training data.

• Such selected hypothesis can then be used on


the testing data to determine the response
associated with the new features.
Linear Regression: Model

Linear Model Examples


$ $
In linear regression, the relationship
between the input and output $ = & + (! $ = (!
variables is expressed as follows:
(
$ = & + (!
! !
&
where & and ( are model parameters

• h( ) is the hypothesis
$ = ℎ(!)
• h( ) it is a linear function
& = 0 ;( > 0
Linear Regression: Model

Linear Model Examples


$ $
In linear regression, the relationship
between the input and output $ = & + (!
variables is expressed as follows:
(
$ = & + (!
! !
&
where & and ( are model parameters
$ = (!

& = 0 ;( < 0
Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…

$ !" , $"
True values
Predicted values !" , $6" = !" , & + (!"
!0 , $60
!4 , $4
!1 , $61

!5 , $5
!0 , $0
!1 , $1 !4 , $64 !2 , $62 !3 , $3
!5 , $65

!2 , $2
!3 , $63

!
Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…

$ !" , $"
True values
Predicted values !" , $6"

!4 , $4
70
71 74 !5 , $5
!0 , $0
75
!1 , $1 !3 , $3
72
73
!2 , $2

!
Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…
$
True values !" , $"
Predicted values !" , $6" = !" , & + (!"

!4 , $4
!0 , $0 !3 , $63
!1 , $1 !5 , $5

!2 , $62
!5 , $65

!4 , $64 !3 , $3
!2 , $2

!1 , $61

!0 , $60

!
Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…
Cost Function – Sum-of-Squared Residuals
$
True values !" , $" We can now define two quantities that can be used
Predicted values !" , $6" to assess the model fit to the data:

!4 , $4
(1) The individual residual measures how much the
!0 , $0
predicted output value deviates from the true
!1 , $1 !5 , $5
74
output value given the input value:
73
75

71
72 7" = $" − $6" = $" − & + (!"
70
!3 , $3
!2 , $2 (2) The sum of squared residuals measures how
well overall the model fits the data:
> >
1 1 1
889 &, ( = < 7"1 = < $" − & + (!"
! ; ;
"=0 "=0
Linear Regression: Learning Algorithm
& Cgi )
-
TO MINIMIZE : in
argon th - Cdt B Xi )
optimized d* ,

fztn.IE Cgi 'll


:
Learning Algorithm - Cdt Bri New Predictions
: f.¥
gThe
,

We should select the linear regression model parameters that new response $6 associated with a new
minimize the sum-of-squared residuals as follows: data point ! is now given by:

1
optimized *
>
:

Fp th
1 ,
( yi Cat Pai))

-


-

Ip? pi! -5
'

& ∗ , (∗ = argmin 889 &, ( = argmin ⋅ < $" − & + (!" $6 = & + ( !
G,H G,H ; ( new linear model)
"=0
n

÷::÷÷÷:
l
This leads to the optimal model parameters given by:
>
∗ ∗ ∗
∑ "=0 !" $" − $
K
& = !̅ − ( ⋅ $K ( = > . '
∑"=0 !" !" − !̅
residual squares
.

where !̅ = 1⁄; ⋅ ∑>"=0 !" and $K = 1⁄; ⋅ ∑>"=0 $"

memes of a once . nation.


.iiiiiiiimiiiae
et L and p
squares is determined by finding the 1st donate -0
3. Supervised Learning:
Multivariate Linear Regression
Multivariate Linear Regression
Training set
Example Input Variable
# cylinders Output Variable
(Mileage, Km) # doors
• Prediction of second-hand car (Cost, £)
prices based on features such as
25,000 3 6 16,000
mileage, # doors, # cylinders
105,000 5 12 11,500 Training sample
• It is a supervised learning problem 120,000 5 8 6,000 !",0 , !",1 , !",4 , $"
because one has access to input- 140,000 3 8 3,000
output examples
45,000 3 12 13,500
• It is a regression problem because
the output variable is continuous-
valued
Testing set
• Input variables also known as Input Variable
features or regressors Input Variable Output Variable
# doors # cylinders
(Mileage, Km) (Cost, £)
• Output variable also known as
response or prediction 50,000 50,000 50,000 ?
90,000 90,000 90,000 ? Test sample
Multivariate Linear Regression
Training set
Example Input Variable
# cylinders Output Variable
(Mileage, Km) # doors
• Prediction of second-hand car (Cost, £)
prices based on features such as
25,000 3 6 16,000
mileage, # doors, # cylinders
105,000 5 12 11,500 Training sample
• It is a supervised learning problem 120,000 5 8 6,000 !",0 , !",1 , !",4 , $"
because one has access to input- 140,000 3 8 3,000
output examples
45,000 3 12 13,500
• It is a regression problem because
the output variable is continuous-
valued
Testing set
• Input variables also known as Input Variable
features or regressors Input Variable Output Variable
# doors # cylinders
(Mileage, Km) (Cost, £)
• Output variable also known as
response or prediction 50,000 50,000 50,000 ?
90,000 90,000 90,000 ? Test sample
Multivariate Linear Regression
Process
• One is given access to a training data –
consisting of various feature-response pairs –
and testing data – consisting of features
points with unknown response.

• One is also given a hypothesis (or model)


class containing a series of hypotheses (or
models) that potentially explain the
relationship between the features and
responses.

• The learning algorithm selects a hypothesis


(or model) from the hypothesis (or model)
class that fits the training data.

• Such selected hypothesis can then be used on


the testing data to determine the response
associated with the new features.
Multivariate Linear Regression: Model
Multivariate Linear Model
The relationship between the input and output The relationship between the input and output
variables can be expressed as follows: variables can also be expressed as follows:
re explained in (makes observation much easier)

1 !0,0 … !0,P
-

matrices $0 (N
$" = (N + (0 !",0 + ⋯ + (P !",P (Q = 1, … , ;) ⇒ V= ⋮ = ⋮ ⋮ ⋱ ⋮ ⋮ = YZ
$> 1 !>,0 … !>,P (P
- -
of
where (S , T = 0, … , U are the model parameters value of features
vector

of multivariate
-
weightage coefficient the
present in
parameters
Geometrical Interpretation
3.2 Linear Regression Models and Least Squares 45
each
parameter
framing
data -



• • ••
•• • • • •

• • •• •• •• •• • •

• •• • ••
•• • • • • • •

•• • •
• • X2
• •

X1
Multivariate Linear Regression: Model
Multivariate Linear Model
The relationship between the input and output The relationship between the input and output
variables can be expressed as follows: variables can also be expressed as follows:

16,000 1 25000 3 6 (N
$" = (N + (0 !",0 + ⋯ + (P !",P (Q = 1, … , ;) V= ⋮ = ⋮ ⋮ ⋱ ⋮ ⋮ = YZ
13,500 1 45,000 3 12 (P
where (S , T = 0, … , U are the model parameters

÷÷:

%÷÷÷÷.o
Geometrical Interpretation
3.2 Linear Regression Models and Least Squares 45

Bo Bi Ba output ! )
-
--



• • ••
•• • • • •

• • •• •• •• •• • •

• •• • •• data
•• • • • • • • the
•• • •
• • X2

that explain
differently !
• • this is determined
• expected output
X1
by calculating the cost
fun m
Multivariate Linear Regression: Model
Multivariate Linear Model
The relationship between the input and output The relationship between the input and output
variables can be expressed as follows: variables can also be expressed as follows:

16,000 1 25000 3 6 (N
$" = (N + (0 !",0 + ⋯ + (P !",P (Q = 1, … , ;) V= ⋮ = ⋮ ⋮ ⋱ ⋮ ⋮ = YZ
13,500 1 45,000 3 12 (P
where (S , T = 0, … , U are the model parameters

Geometrical Interpretation
3.2 Linear Regression Models and Least Squares 45
rt ; it ÷:÷ or me
.

Y
PFodimcte.dneaahsm.me
tug!
lues

"R
Ln (
-

• -

• •
• •• y
•• • • • •

• •• •• •• •• • •

tntrccy-jfcy.gg/emgauvtiaYent

ta Cy j ) Cy j )
• +
• •• • •• =
• • • • •
-
•• •
- .


•• • •
• • X2
• • = to the

X1 = In tr Ky 5) Cy g) )
-
.
t arithmetic E
ta tr ( Cy -

xp) Cy -
)t)§I7p
xp
Multivariate Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…
3.2 Linear Regression Models and Least Squares 45

Y True values
Cost Function – Sum-of-Squared Residuals

Note that true response values and the predicted response


values are given by:


• • •• $0 $60 1 !0,0 … !0,P (N
•• • • • • • •

• •• • • •• • • V= ⋮ and V̀ = ⋮ = 1 ⋮ ⋱ ⋮ ⋮ = YZ



••
$> $6> 1 !>,0 … !>,P (P
• • •
•• • • • • • •
• Note also that the sum-of-square residuals is given by
•• • •• •
edifiedBruno del
X2
• •
.

889 Z = ab V − V̀ ⋅ V − V̀ c = ab V − YZ ⋅ V − YZ c
• -
X1
SSR na
error
URE 3.1. Linear least squares fitting with X ∈ IR . We seek Matri
2
Ya, amateurs traces
the linear .
Multiv. Linear Regression: Learning Algorithm
Learning Algorithm New Predictions

We should select the linear regression model parameters that The new response $6 associated with a new
minimize the sum-of-squared residuals as follows: data point f is now given by:
optimize
of p
rate
⇒ by minimize the SSR

Z∗ = argmin 889 Z = argmin ab V − YZ ⋅ V − YZ c


V̀ = f Yc Y d0
Yc V
Z Z
- model
matrix representation of
new
prediction
This leads to the optimal model parameters given by: multivariate for multivariate parameters
Fp Ssr Cp) 0 to give
= :
parameters
# dagger
~
Z∗ = Yc Y d0 Yc V = Ye V
- is
-
input output
variable'
variables
a
pseudo invent
ofthe matrix X
( potentially a
rectangular
matrix)
4. Supervised Learning:
Polynomial Regression
Polynomial Regression
Simple Linear Regression Polynomial Regression

First-order model: y = β0 + β1x Second order model: y = β0 + β1x + β2x2


Polynomial Regression
Simple Linear Regression Polynomial Regression

First-order model: y = β0 + β1x Third order model: y = β0 + β1x + β2x2 + β3x3


Polynomial Regression
Simple Linear Regression Polynomial Regression

First-order model: y = β0 + β1x Squared-root model: y = β0 + β1√x


Polynomial Regression: How to Apply?
Example: Second-order Polynomial Regression
Model

$ = (N + (0 ! + (1 ! 1 $ = (N + (0 !0 + (1O
!1 Apply Simple Linear
Regression
T n


!0 !1 renamed from
d ther
apply the
×2 ultimately
sempre weer regression
-

converting back to -
or multivariate
anear equation
our
simple regression .

Features will have quite some different scales!

(÷:;I%¥÷E .im?T:::::gai:::..anem.oineenotosiam
We will see later the importance in scaling them!

:3
.
The Problem of Overfitting
Does a higher-order model produce better results compared to a lower-order one?

First-Order Model Third-Order Model Ninth-Order Model


$ = (N + (0 ! $ = (N + (0 ! + (1 ! 1 + (4 ! 4 $ = (N + (0 ! + (1 ! 1 + ⋯ + (g ! g

in:L:L,
true model true model true model

fo
-

fitted model
y
fitted model y wrong fitted model
prediction from
y beta
with noise prediction
an
overfilling
model ( model with 9th
order is

overhung
.

y ✓

[ model
true [
x

sane
( x
because
it

to
x
fails

w/ true values
prediction wave
model has accurately
model to
prediction a
explain
the
the good balance of over
fitting OVER FITTING true
explain and
Under-fitting model sine wave .

admitting
Good fitting model making Over-fitting model as it
model
is too
this the best model accustomed and
UNDER FITTING fitted to toe noise
.

Good fitting
( dates?
Bias-Variance Tradeoff
=
The use of a small hypothesis space – such as one consisting of linear models – can lead to
underfitting to the training data. It is also known as high-bias setting because we are “biasing” the
learning algorithm – due to possible preconceptions about the data – to choose a very constrained
model.

Conversely, the use a a large hypothesis space – such as one consisting of high-order polynomials –
can lead to overfitting to the training data. It is typically known as a high-variance setting.

Both a high-bias and a high-variance setting can lead to poor prediction performance in the
presence of new data.

There is typically a bias-variance tradeoff in the sense that one cannot achieve both low-bias and
low-variance simultaneously.
The Problem of Overfitting
Does a higher-order model produce better results compared to a lower-order one?

First-Order Model Third-Order Model Ninth-Order Model

true model true model true model


fitted model fitted model fitted model
y y y

x x x

Under-fitting model Good fitting model Over-fitting model


High-bias High-variance
How to Deal with Overfitting?
① • Feature Selection $ = (N + (0 ! + (1 ! 1 + (4 ! 4

omit features with


o This can involve manual selection of relevant features the least
weightage or
significance to how the
o It can be problematic because such feature selection procedure can also lead to loss output
of valuable information for prediction purposes SIDE C :
depends on this parameter .

THE DOWN

$ = (N + (0 ! + (1 ! 1 + (4 ! 4
②• Regularization

o This involves automatic reduction of the magnitude of some of the model parameters

o It can work well in problems involving various features with each one contributing a bit to
the prediction
5. Regularization
Regularization
Regularization techniques can address over-fitting by penalizing certain large
parameter values
Without Regulatization
A With Regularization
y = β0*+ β1*x + β2*x2 + β3*x3 + β4*x4 y = β0*+ β1*x + β2*x2 + β3*x3 + β4*x4

vefeel
now
to select B
valves thattend


to se
small

now w/
Parameter selection Parameter selection ←
penalization
Z∗ = argmin ab V − YZ ⋅ V − YZ c
Z∗ = argmin ab V − YZ ⋅ V − YZ c
+ 104 (4 + 10 (5
4
.

Z Z - w
multiplied very large numbers .
Regularization
We do not want manual tuning of the importance of the model parameters

Without Regularization
>
1 1
& ∗ , (∗ = argmin 889 &, ( = argmin ⋅ < $" − & + (!"
G,H G,H ;
"=0

FACTOR
.
.

ALI SATOW
PEN the and
penalizes
parameters se
With Regularization them to

>
f Pforcing small

1 1
& ∗ , (∗ = argmin 889 &, ( = argmin ⋅ < $" − & + (!" + h < (S
G,H G,H ;
"=0 S=0

± X
parameter
.

Regularization parameter hyper


Regularization: Ridge Regression
soc

Simple Linear Regression Ridge Regression ¥1 BRRR


. .

Parameter estimates Parameter estimates


SSR term
-

t satin
regular
Z∗ = argmin ab V − YZ ⋅ V − YZ c Z∗ = argmin ab V − YZ ⋅ V − YZ c + h ⋅ Zc Z
Z Z

when we
ftp.CSSR ) =
o ⇒ p* cseeprenws
similar example )
∗ c d0 c
Z = YY YV x
-0 ,
Z∗ = Yc Y + h ⋅ i d0 Yc V (i is the identity matrix)

gognaemortgres.i n
regular
satin -
if X soo -

-
'

Predictions without Predictions


then
Xt X. I ) → o

V̀ = f ⋅ Yc Y d0
Yc V V̀ = f ⋅ Yc Y + h ⋅ i d0
Yc V
then ,
optimal parameters
for RR O
G

NB: the matrix Yc Y + h ⋅ i is always invertible whereas the matrix Yc Y may not be invertible (e.g. if the number of features is higher than the
number of examples / samples.
Regularization: Ridge Regression EvAT
o

What is the role of the regularization parameter h?


Reasonable h Very small h Very large h
*

Parameter estimates Very small h ⇒ the optimal parameters for Very large h ⇒ the optimal parameters for
ridge regression correspond to those of ridge regression become approximately
Z ∗ = Yc Y + h ⋅ i d0 Yc V
simple linear regression equal to zero

We need a disciplined way to select the regularization parameter ⟹ Cross-Validation


Regularization: LASSO – Least Absolute
Shrinkage and Selection Operator
some
Ridge Regression (shrinks B feature )
to
select ) LASSO ( sets p to
t
be zero

here feature meeting

Parameter estimates
hyponparameqihnfgdy.gg Parameter estimates
(L2 Norm ' of p)
1
hyperparameter x inner

product
of
,B
Z∗ = argmin ab V − YZ ⋅ V − YZ c +h⋅ Z 1 Z∗ = argmin ab V − YZ ⋅ V − YZ c +h⋅ Z 0
Z Z
(Ll NORM of B )

The key difference between LASSO and ridge regression is that


∗ c d0 c
Z = Y Y+h⋅i YV the LASSO uses an L1 instead of an L2 penalty

This forces some of the coefficients to be set to be equal to zero


Predictions provided that the hyper-parameter λ is sufficiently large.

Thus, the lasso performs variable/feature selection and


V̀ = f ⋅ Yc Y + h ⋅ i d0
Yc V
shrinkage at the same time.
Regularization: LASSO – Least Absolute
Shrinkage and Selection Operator
5 B coefficients graphed
← ↳ now as X increases ,
some p goes
to

zero
by LASSO .

( feature When λ = 0 , the LASSO solution reduces to the


selection )
simple lineer
• regression solution.

When λ à ∞ , the LASSO solution is such that


the coefficient estimates approach zero.

70
LASSO vs Ridge Regression Contours
of
SSR
b
Simple Linear
Regression Solution

Ridge Regression
← NOW SPARSE
The ridge regression problem can be re-expressed as follows:
.

Ridge

TRAINTVERs Regression
1
Z∗ = argmin ab V − YZ ⋅ V − YZ c
l. a. Z 1≤ l

@TmodeestrektwT7f.YTpHfcs
Z
cows

where the value of s depends on the value of the hyper-parameter h.


← we see an -

equivalence between
LASSO the two constraints Simple Linear
Regression Solution

The ridge regression problem can be re-expressed as follows: d


← here , B isforeal
Z∗ = argmin ab V − YZ ⋅ V − YZ c l. a. Z 0≤ l to zero and
Lasso
Z
CONSTRAINT A 07! I mind
BBB
.

where the value of s depends on the value of the hyper-parameter h. models the ¢ SPARSITY is
and allow
Uts so
S t 11 poll
- -
constraints
, s g
promoted B= O
The LASSO and ridge regression coefficient estimates are given by the
for
first point at which the ellipse contacts the constraint region. feature
selector
LASSO vs Ridge Regression
ADVANTAGE :
• LASSO has an advantage over ridge regression, in that it produces simpler and more interpretable
models that involved only a subset of predictors.
#Asso defines features
SIMILARITY :
more
clearly -

• LASSO leads to qualitatively similar behavior to ridge regression, in that as λ increases, the
variance decreases and the bias increases.

ADVANTAGE :

• LASSO can generate more accurate predictions compared to ridge regression.

• Cross-validation can be used in order to determine which approach is better on a particular data
set.
But how do you select the regular satin term X ?
-

CROSS VALIDATION !
72
Training Data
DataSet
Learning Phase

Training Set ~60% Optimize parameters based on the training set

Validation Set ~20% Selection of my model (for example, degree of the


polynomial or my regularization term)
Test Phase

Final test
Test Set ~20%
6. Cross-Validation
Cross Validation: K-fold CV ← get eeto the best

regimens atm parameter


• Select a grid of potential values for the hyper- any hyperspace metes
or
,
X
parameter value h ∈ {h0 , h1 , … , hc } ( like neural network

• Partition the entire dataset onto a training and testing


T RAI N T EST
hyper parameter )
set

error
TEM
• Partition the entire training set T onto K subsets T1, T2,
T RA I N TES C
Procedure

…, TK
• Calculate the error errork on subset Tk based on a

Hear
model learnt on the training set T except subset Tk
• Calculate the overall error corresponding to the :
average of the individual errors errork (k=1,...,K)
hyper parameters
• Choose the hyper-parameter value leading to the to get →
smallest overall error
• The model associated with such optimal hyper- - O
λ∗ λ

parameter is then trained on the entire training set and


it is tested on the test set
Cross Validation: K-fold CV
Procedure

TRIAL I



Cross Validation: K-fold CV
Procedure

TRIAL 2

¥i
Cross Validation: K-fold CV
Procedure
Cross Validation: K-fold CV
Boog
Procedure

am

You might also like