Lecture 1 Introduction PDF

Applied Machine
Learning Systems 1
Professor Miguel Rodrigues
University College London
Instructors
Prof Miguel Prof Yannis

Rodrigues Andreopoulos
←
the rest [ last 2 lectures
Course Outline
• Introduction
• Historical perspective
• Supervised learning
• Linear regression in one variable and multiple variables
• Polynomial regression in multiple variables
• Regularization ← Ridge and
• Ridge Regression LASSO
• Least Absolute Shrinkage and Selection Operator ( Lasso)

• Model Selection / Cross-Validation
• Logistic regression, Soft max regression}
and
classification
methods
prediction
• Gradient descent, stochastic gradient descent, proximal gradient method
• Boosting, decision trees & random forests, KNNs, recommender systems
Course Outline
• Unsupervised learning
• Data clustering
• k-means clustering algorithm
• Hierarchical clustering algorithm
• Dimensionality reduction
• Feature selection-learning
• Density estimation
• Kernel learning
}
• Support vector machines
• Kernels
÷::!
• Introduction to neural networks
• Representation
• Learning
logics
• Convolutional neural networks and variants
Timeline ( 6 lectures and 2 labs)
• Lecture 01 / Tue 08th October, 2-6pm

• Overview of machine learning including the supervised learning, unsupervised
learning, and reinforcement learning paradigms; Linear Regression; Polynomial
Regression; Regularisation; Ridge Regression; Least Absolute Shrinkage and
Selection Operator (LASSO); Cross-Validation
← LAB in between →
A 5th Oct)
• Lecture 02 / Tue 22nd October, 2-6pm
• Logistic Regression; Softmax Regression; Gradient Descent; Stochastic Gradient
Descent; Proximal Gradient Method
€ classification
problems
• Lecture 03 / Tue 29th October, 2-6pm
• Boosting; Decision Trees; Random Forests; KNN; Recommender Systems
Timeline
WPERVlsEDLEARN1N#
C)
• Lecture 04 / Tue 12th November, 2-6pm

• Introduction to Clustering; K-Means Clustering Algorithm; Hierarchical
Clustering Algorithms; Spectral Clustering; Dimensionality Reduction (PCA,
LDA, CS, other) | Feature Selection-Learning; Density Estimation

• Kernel learning: Creating non-linear algorithms by “kernelization”, Support
vector machines for classification and regression

• Neural networks: Representation, Learning, Convolutional neural networks and
variants
Timeline
• Lab 1 / Tue 15th October, 2-6pm

• This 3-hour lab will cover topics a) supervised learning I, b) supervised learning II, c)
supervised learning III, d) unsupervised learning JVPYT ER NOTEBOOK
PYTHON
• Lab 2 / Tue 03rd December, 2-6pm IMPORTANT for ASSIGNMENT !

• This 3-hour lab will cover topics a) kernel learning, b) neural networks
I prepares for the assignment
Assessment
• Mini-project based assessment (100%) involving
• A programming assignment computer vision challenge
)
-
report required paper GitHub

C
• A report of the assignment
-
- code is assessed from
• Mini-project timelines
• Hand-out: Friday, 25th October, 2019
• Hand-in: Monday, 06th January, 2020 (after Christmas)
Bibiliography
• K. P. Murphy. Machine Learning: A Probabilistic Perspective.
Cambridge, MA: MIT Press, 2012.
• S. Shalev-Shwartz and S. Ben-David. Understanding

Machine Learning. New York, NY: Cambridge University
Press, 2014.
• I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning.

Cambridge, MA: MIT Press, 2017.
1. Introduction
A Brief History of AI
Artificial Intelligence has been capturing our imagination for decades, by making its
way into pop culture through mainstream Hollywood movies
1968 1977 1984 1999 2001 2015

An (Actual) Brief History of AI
Genesis of AI Resurgence of AI Current AI

2006: Start of Deep Learning era
Netflix RecSys challenge
2009: GPUs to train DL models
2011: IBM Watson wins Jeopardy
1948: First computer program 1980: The rise of expert systems 2014: DeepMind funded
Facebook develops DeepFace
1950: The Turing test: “Can machines think?” 1982: Revival of connectionism, backpropagation
2015: Google’s TensorFlow
1956: The birth of AI - John McCarthy AAAI 1987–93: Second AI winter Amazon’s ML platform
Microsoft’s ML Toolkit
1957: First neural network by Frank Rosenblatt 1990s: Age of knowledge representation OpenAI and ethics
1975-80: AI hype leads to first AI winter 2016: AlphaGo beats Lee Sedol in Go
1997: IBM Deep Blue beats Kasparov in chess
1956 1975 1980 1987 1993 2006 2011

An (Actual) Brief History of AI
Tay AlphaGo
Machine
Learning
Eugene Alexa
Artificial
Siri Watson
Intelligence
2011 2012 2013 2014 2015 2016 2017

Artificial Intelligence
Current AI: Hype or the Start of a Revolution?
Perception Learning Interaction
• Computer vision • Supervised learning • Planning
• Search
• Speech understanding • Unsupervised learning
• Reasoning
• Language understanding • Reinforcement learning
• Games
• Knowledge representation • Common sense
Artificial Intelligence: Ingredients
Artificial General Artificial Super
Artificial Narrow Intelligence
Intelligence Intelligence
Current AI Future AI Singularity

AI currently out-performs humans Performing broad tasks and reasoning Powerful superintelligence that
in some specific narrow tasks comparable to a human would, qualitatively, far surpass all
human intelligence.
Artificial Intelligence: Approaches / Tools
Logic Based Approaches
These tools enable solving AI

problems such as knowledge
representation
17
Probabilistic/Statistical Based
Approaches
These tools enable solving AI These tools – notably, machine learning

problems such as knowledge methods – are leading to state-of-the-art
representation results in various prediction problems
arising in artificial intelligence challenges
18
Approaches

Optimization Based
Approaches
These tools enable solving many AI

problems by searching intelligently
over many solutions
19
Approaches

Optimization Based Statistical Learning

Approaches Based Approaches
These tools enable solving many AI These approaches allow one to tackle
problems by searching intelligently AI problems where an agent operates
over many solutions within an environment subject to
uncertainty.
20
Artificial Intelligence: What is different now?
Massive Amounts Massive Computational State-of-the-Art
of Data Power Algorithms
Big Data GPUs/CPUs Deep Learning

Massive amounts of data deriving GPUs have also been playing a major Deep learning – involving the use of
from (1) internet (2) social media role in fuelling the artificial neural networks consisting of various
(3) internet of things or other have intelligence revolution layers – are state-of-the-art machine
become crucial to power artificial learning approaches fuelling the
intelligence. artificial intelligence revolution
21
Warm-Up
Input Output
Cat 1
Dog 0
Bird 0
Horse 0
Cat 0
Prediction Dog 1
Bird 0
Rule
Horse 0
Cat 0
Dog 0
Bird 0
Horse 1
A Rule Based Approach
Concept
Input Output
Rules are pre-specified
Cat 1
Classification Dog 0
Rule Bird 0
Horse 0
input data
if condition 1 satisfied
…
10010010 if condition n satisfied
output cat
else
output dog
end
New Input
The Learning Based Approach
Concept
Example 1 Example 2 Example 3 Example 4
Learning Classification
Algorithm Rule
Cat 0 Cat 0 Cat 1 Cat 0
Dog 1 Dog 0 Dog 0 Dog 0
Bird 0 Bird 0 Bird 0 Bird 1
New Output
Horse 0 Horse 1 Horse 0 Horse 0
Cat 1
Classification Dog 0
Bird 0
What is machine learning? Rule Set Horse 0
Field of study that gives computers

the ability to learn without being
explicitly programmed.
Rules learnt automatically
Arthur Samuel (1959) based on data/experience
Machine Learning Paradigms
Reinforcement
Supervised
Learning
Learning
01 It is concerned with how
Machine learning task of 02 software agents ought to take
inferring a function describing
actions in an environment so as
the relationship between
to maximize some cumulative
independent and dependent
reward.
variables in a dataset. It relies
on labelled examples
03
Unsupervised
Learning
02
Machine learning task of
inferring a function describing
the structure of a dataset. It
relies on unlabelled data.
Supervised Learning
Example 1 Example 2 Example 3 Example 4 Example 5 Example 6
Concept
The task of inferring a function describing the
relationship between independent and
Cat 1 Cat 0 Cat 0 Cat 1 Cat 0 Cat 0 dependent variables in a dataset (based on
Dog 0 Dog 1 Dog 0 Dog 0 Dog 0 Dog 0
Bird 0 Bird 0 Bird 0 Bird 0 Bird 1 Bird 1 labelled examples)
Horse 0 Horse 0 Horse 1 Horse 0 Horse 0 Horse 0
Applications
Learning Supervised learning can be used to solve various
Algorithm tasks:
• Classification tasks
• Regression tasks
• Forecasting tasks
Prediction • Recommender systems
Input Output
Rule • Anomaly detection tasks
Variant: Semi-Supervised Learning
Example 1 Example 2 Example 3 Example 4 Example 5 Example 6
Cat 1 Cat 0 Cat 0

Dog 0 Dog 0 Dog 0
Bird 0 Bird 0 Bird 1
Horse 0 Horse 1 Horse 0
Learning
Algorithm
Prediction
Input Output
Rule
Examples: Supervised Learning
Classification Forecasting
Predict the class of an item given various examples
Predict future values given past ones
Example:
time-series
data feature 2
Classification of tumors as
benign or malign based on:
o appearance (feature 1) ?
o tumor size (feature 2) time
data feature 1
Regression
Example:
Predict the value of a variable given other variables.
Prediction of tomorrow’s stock value based
Example:
on the previous stock values.
output variable
+ Prediction of a car’s price

+
+ + (output variable) based on
+ + +
+
+ +
the car’s mileage (input
variable)
Input variable
Unsupervised Learning
Image 1 Image 2 Image 3 Image 4 Image 5
Concept
The task of inferring a function describing the
structure of a dataset (based on unlabelled data).
Learning
Algorithm
Examples
Examples of unsupervised learning tasks include:
cluster 1 cluster 2 • Clustering tasks
• Dimensionality reduction tasks
• Recommender systems
cluster 3 • Anomaly detection tasks
Reinforcement Learning
Concept
Reinforcement learning represents an agent’s
attempt to approximate the environment’s
Agent function, in order to determine the agent’s
actions on the black-box environment that
maximize the agent’s rewards
Observations
Actions Rewards
Applications
Robotics, industrial automation, education and
training, health and medicine, media and
Environment
advertising, finance
The Machine Learning Landscape
Note: There is also now

state-of-the-art approaches
such as deep learning and
deep reinforcement learning
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Related news are
automatically clustered
Machine Learning: Real-World Examples
Machine Learning: Real-World Examples
The Machine Learning Process
Problem
Definition
Data
Collection
Pre-Processing
Data
Feature Extraction
Preparation
Feature Normalization
Model
Model Evaluation
Selection
Learning
Model
Fitting
Training
Algorithm
Prediction
Model
Usage
Testing
Rule
2. Supervised Learning:
Simple Linear Regression
Linear Regression
Example Training set Testing set
• Prediction of second-hand car Input Variable Output Variable Input Variable Output Variable
prices based on features such as (Mileage, Km) (Cost, £) (Mileage, Km) (Cost, £)
mileage
25,000 16,000 50,000 ?
Training Test
105,000 11,500 90,000 ?
sample sample
120,000 6,000
!" , $"
140,000 3,000
45,000 13,500
? ?
50,000 90,000
Linear Regression
Example Training set Testing set
• Prediction of second-hand car Input Variable Output Variable Input Variable Output Variable
prices based on features such as (Mileage, Km) (Cost, £) (Mileage, Km) (Cost, £)
mileage
25,000 16,000 50,000 ?
Training Test
• It is a supervised learning problem 105,000 11,500 90,000 ?
sample sample
because one has access to input- 120,000 6,000
output examples !" , $"
140,000 3,000
45,000 13,500
• It is a regression problem because
the output variable is continuous-
valued
• Input variables also known as

features or regressors
• Output variable also known as

response or prediction
Linear Regression: Approach
Process
• One is given access to a training data –
consisting of various feature-response pairs –
and testing data – consisting of features
points with unknown response.
• One is also given a hypothesis (or model)

class containing a series of hypotheses (or
models) that potentially explain the
relationship between the features and
responses.
• The learning algorithm selects a hypothesis

(or model) from the hypothesis (or model)
class that fits the training data.
• Such selected hypothesis can then be used on

the testing data to determine the response
associated with the new features.
Linear Regression: Model
Linear Model Examples

$ $
In linear regression, the relationship
between the input and output $ = & + (! $ = (!
variables is expressed as follows:
(
$ = & + (!
! !
&
where & and ( are model parameters
• h( ) is the hypothesis
$ = ℎ(!)
• h( ) it is a linear function
& = 0 ;( > 0
Linear Regression: Model
Linear Model Examples

$ $
In linear regression, the relationship
between the input and output $ = & + (!
variables is expressed as follows:
(
$ = & + (!
! !
&
where & and ( are model parameters
$ = (!
& = 0 ;( < 0
Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…
$ !" , $"
True values
Predicted values !" , $6" = !" , & + (!"
!0 , $60
!4 , $4
!1 , $61
!5 , $5
!0 , $0
!1 , $1 !4 , $64 !2 , $62 !3 , $3
!5 , $65
!2 , $2
!3 , $63
!
cost function…
$ !" , $"
True values
Predicted values !" , $6"
!4 , $4
70
71 74 !5 , $5
!0 , $0
75
!1 , $1 !3 , $3
72
73
!2 , $2
!
cost function…
$
True values !" , $"
Predicted values !" , $6" = !" , & + (!"
!4 , $4
!0 , $0 !3 , $63
!1 , $1 !5 , $5
!2 , $62
!5 , $65
!4 , $64 !3 , $3
!2 , $2
!1 , $61
!0 , $60
!
cost function…
Cost Function – Sum-of-Squared Residuals
$
True values !" , $" We can now define two quantities that can be used
Predicted values !" , $6" to assess the model fit to the data:
!4 , $4
(1) The individual residual measures how much the
!0 , $0
predicted output value deviates from the true
!1 , $1 !5 , $5
74
output value given the input value:
73
75
71
72 7" = $" − $6" = $" − & + (!"
70
!3 , $3
!2 , $2 (2) The sum of squared residuals measures how
well overall the model fits the data:
> >
1 1 1
889 &, ( = < 7"1 = < $" − & + (!"
! ; ;
"=0 "=0
Linear Regression: Learning Algorithm
& Cgi )
-
TO MINIMIZE : in
argon th - Cdt B Xi )
optimized d* ,
fztn.IE Cgi 'll

:
Learning Algorithm - Cdt Bri New Predictions
: f.¥
gThe
,
We should select the linear regression model parameters that new response $6 associated with a new
minimize the sum-of-squared residuals as follows: data point ! is now given by:
1
optimized *
>
:
Fp th
1 ,
( yi Cat Pai))
∗
-
∗
-
Ip? pi! -5
'
& ∗ , (∗ = argmin 889 &, ( = argmin ⋅ < $" − & + (!" $6 = & + ( !
G,H G,H ; ( new linear model)
"=0
n
÷::÷÷÷:
l
This leads to the optimal model parameters given by:
>
∗ ∗ ∗
∑ "=0 !" $" − $
K
& = !̅ − ( ⋅ $K ( = > . '
∑"=0 !" !" − !̅
residual squares
.
where !̅ = 1⁄; ⋅ ∑>"=0 !" and $K = 1⁄; ⋅ ∑>"=0 $"
memes of a once . nation.

.iiiiiiiimiiiae
et L and p
squares is determined by finding the 1st donate -0
Multivariate Linear Regression
Training set
Example Input Variable
# cylinders Output Variable
(Mileage, Km) # doors
• Prediction of second-hand car (Cost, £)
prices based on features such as
25,000 3 6 16,000
mileage, # doors, # cylinders
105,000 5 12 11,500 Training sample
• It is a supervised learning problem 120,000 5 8 6,000 !",0 , !",1 , !",4 , $"
because one has access to input- 140,000 3 8 3,000
output examples
45,000 3 12 13,500
valued
Testing set
• Input variables also known as Input Variable
features or regressors Input Variable Output Variable
# doors # cylinders
(Mileage, Km) (Cost, £)
response or prediction 50,000 50,000 50,000 ?
90,000 90,000 90,000 ? Test sample
Training set
Example Input Variable
# cylinders Output Variable
(Mileage, Km) # doors
• Prediction of second-hand car (Cost, £)
prices based on features such as
25,000 3 6 16,000
mileage, # doors, # cylinders
105,000 5 12 11,500 Training sample
• It is a supervised learning problem 120,000 5 8 6,000 !",0 , !",1 , !",4 , $"
because one has access to input- 140,000 3 8 3,000
output examples
45,000 3 12 13,500
valued
Testing set
• Input variables also known as Input Variable
features or regressors Input Variable Output Variable
# doors # cylinders
(Mileage, Km) (Cost, £)
response or prediction 50,000 50,000 50,000 ?
90,000 90,000 90,000 ? Test sample
Process
• One is given access to a training data –
consisting of various feature-response pairs –
and testing data – consisting of features
points with unknown response.
• One is also given a hypothesis (or model)

class containing a series of hypotheses (or
models) that potentially explain the
relationship between the features and
responses.
• The learning algorithm selects a hypothesis

(or model) from the hypothesis (or model)
class that fits the training data.
• Such selected hypothesis can then be used on

the testing data to determine the response
associated with the new features.
Multivariate Linear Regression: Model
Multivariate Linear Model
The relationship between the input and output The relationship between the input and output
variables can be expressed as follows: variables can also be expressed as follows:
re explained in (makes observation much easier)
1 !0,0 … !0,P
-
matrices $0 (N
$" = (N + (0 !",0 + ⋯ + (P !",P (Q = 1, … , ;) ⇒ V= ⋮ = ⋮ ⋮ ⋱ ⋮ ⋮ = YZ
$> 1 !>,0 … !>,P (P
- -
of
where (S , T = 0, … , U are the model parameters value of features
vector
of multivariate
-
weightage coefficient the
present in
parameters
Geometrical Interpretation
3.2 Linear Regression Models and Least Squares 45
each
parameter
framing
data -
•
•
• • ••
•• • • • •
•
• • •• •• •• •• • •
•
• •• • ••
•• • • • • • •
•
•• • •
• • X2
• •
•
X1
16,000 1 25000 3 6 (N
$" = (N + (0 !",0 + ⋯ + (P !",P (Q = 1, … , ;) V= ⋮ = ⋮ ⋮ ⋱ ⋮ ⋮ = YZ
13,500 1 45,000 3 12 (P
where (S , T = 0, … , U are the model parameters
÷÷:
%÷÷÷÷.o
Bo Bi Ba output ! )
-
--
•
•
• • ••
•• • • • •
•
• • •• •• •• •• • •
•
• •• • •• data
•• • • • • • • the
•• • •
• • X2
•
that explain
differently !
• • this is determined
• expected output
X1
by calculating the cost
fun m
16,000 1 25000 3 6 (N
$" = (N + (0 !",0 + ⋯ + (P !",P (Q = 1, … , ;) V= ⋮ = ⋮ ⋮ ⋱ ⋮ ⋮ = YZ
13,500 1 45,000 3 12 (P
where (S , T = 0, … , U are the model parameters
rt ; it ÷:÷ or me
.
Y
PFodimcte.dneaahsm.me
tug!
lues
"R
Ln (
-
• -
• •
• •• y
•• • • • •
•
• •• •• •• •• • •
tntrccy-jfcy.gg/emgauvtiaYent
•
ta Cy j ) Cy j )
• +
• •• • •• =
• • • • •
-
•• •
- .
•
•• • •
• • X2
• • = to the
•
X1 = In tr Ky 5) Cy g) )
-
.
t arithmetic E
ta tr ( Cy -
xp) Cy -
)t)§I7p
xp
Multivariate Linear Regression: Cost Function
cost function…
Y True values
Cost Function – Sum-of-Squared Residuals
Note that true response values and the predicted response

values are given by:
•
•
• • •• $0 $60 1 !0,0 … !0,P (N
•• • • • • • •
•
• •• • • •• • • V= ⋮ and V̀ = ⋮ = 1 ⋮ ⋱ ⋮ ⋮ = YZ
•
•
•
••
$> $6> 1 !>,0 … !>,P (P
• • •
•• • • • • • •
• Note also that the sum-of-square residuals is given by
•• • •• •
edifiedBruno del
X2
• •
.
889 Z = ab V − V̀ ⋅ V − V̀ c = ab V − YZ ⋅ V − YZ c
• -
X1
SSR na
error
URE 3.1. Linear least squares fitting with X ∈ IR . We seek Matri
2
Ya, amateurs traces
the linear .
Multiv. Linear Regression: Learning Algorithm
Learning Algorithm New Predictions
We should select the linear regression model parameters that The new response $6 associated with a new
minimize the sum-of-squared residuals as follows: data point f is now given by:
optimize
of p
rate
⇒ by minimize the SSR
Z∗ = argmin 889 Z = argmin ab V − YZ ⋅ V − YZ c

V̀ = f Yc Y d0
Yc V
Z Z
- model
matrix representation of
new
prediction
This leads to the optimal model parameters given by: multivariate for multivariate parameters
Fp Ssr Cp) 0 to give
= :
parameters
# dagger
~
Z∗ = Yc Y d0 Yc V = Ye V
- is
-
input output
variable'
variables
a
pseudo invent
ofthe matrix X
( potentially a
rectangular
matrix)
Polynomial Regression
Simple Linear Regression Polynomial Regression
First-order model: y = β0 + β1x Second order model: y = β0 + β1x + β2x2

First-order model: y = β0 + β1x Third order model: y = β0 + β1x + β2x2 + β3x3

First-order model: y = β0 + β1x Squared-root model: y = β0 + β1√x

Polynomial Regression: How to Apply?
Example: Second-order Polynomial Regression
Model
$ = (N + (0 ! + (1 ! 1 $ = (N + (0 !0 + (1O
!1 Apply Simple Linear
Regression
T n
↳
!0 !1 renamed from
d ther
apply the
×2 ultimately
sempre weer regression
-
converting back to -
or multivariate
anear equation
our
simple regression .
Features will have quite some different scales!
(÷:;I%¥÷E .im?T:::::gai:::..anem.oineenotosiam
We will see later the importance in scaling them!
:3
.
The Problem of Overfitting
Does a higher-order model produce better results compared to a lower-order one?
First-Order Model Third-Order Model Ninth-Order Model

$ = (N + (0 ! $ = (N + (0 ! + (1 ! 1 + (4 ! 4 $ = (N + (0 ! + (1 ! 1 + ⋯ + (g ! g
→
in:L:L,
true model true model true model
fo
-
fitted model
y
fitted model y wrong fitted model
prediction from
y beta
with noise prediction
an
overfilling
model ( model with 9th
order is
overhung
.
y ✓
[ model
true [
x
sane
( x
because
it
to
x
fails
w/ true values
prediction wave
model has accurately
model to
prediction a
explain
the
the good balance of over
fitting OVER FITTING true
explain and
Under-fitting model sine wave .
admitting
Good fitting model making Over-fitting model as it
model
is too
this the best model accustomed and
UNDER FITTING fitted to toe noise
.
Good fitting
( dates?
Bias-Variance Tradeoff
=
The use of a small hypothesis space – such as one consisting of linear models – can lead to
underfitting to the training data. It is also known as high-bias setting because we are “biasing” the
learning algorithm – due to possible preconceptions about the data – to choose a very constrained
model.
Conversely, the use a a large hypothesis space – such as one consisting of high-order polynomials –
can lead to overfitting to the training data. It is typically known as a high-variance setting.
Both a high-bias and a high-variance setting can lead to poor prediction performance in the
presence of new data.
There is typically a bias-variance tradeoff in the sense that one cannot achieve both low-bias and
low-variance simultaneously.
The Problem of Overfitting
Does a higher-order model produce better results compared to a lower-order one?
First-Order Model Third-Order Model Ninth-Order Model
true model true model true model

fitted model fitted model fitted model
y y y
x x x
Under-fitting model Good fitting model Over-fitting model

High-bias High-variance
How to Deal with Overfitting?
① • Feature Selection $ = (N + (0 ! + (1 ! 1 + (4 ! 4
omit features with

o This can involve manual selection of relevant features the least
weightage or
significance to how the
o It can be problematic because such feature selection procedure can also lead to loss output
of valuable information for prediction purposes SIDE C :
depends on this parameter .
THE DOWN
$ = (N + (0 ! + (1 ! 1 + (4 ! 4
②• Regularization
o This involves automatic reduction of the magnitude of some of the model parameters
o It can work well in problems involving various features with each one contributing a bit to
the prediction
5. Regularization
Regularization
Regularization techniques can address over-fitting by penalizing certain large
parameter values
Without Regulatization
A With Regularization
y = β0*+ β1*x + β2*x2 + β3*x3 + β4*x4 y = β0*+ β1*x + β2*x2 + β3*x3 + β4*x4
vefeel
now
to select B
valves thattend
,§
to se
small
now w/
Parameter selection Parameter selection ←
penalization
Z∗ = argmin ab V − YZ ⋅ V − YZ c
+ 104 (4 + 10 (5
4
.
Z Z - w
multiplied very large numbers .
Regularization
We do not want manual tuning of the importance of the model parameters
Without Regularization
>
1 1
& ∗ , (∗ = argmin 889 &, ( = argmin ⋅ < $" − & + (!"
G,H G,H ;
"=0
FACTOR
.
.
ALI SATOW
PEN the and
penalizes
parameters se
With Regularization them to
>
f Pforcing small
1 1
& ∗ , (∗ = argmin 889 &, ( = argmin ⋅ < $" − & + (!" + h < (S
G,H G,H ;
"=0 S=0
± X
parameter
.
Regularization parameter hyper

Regularization: Ridge Regression
soc
Simple Linear Regression Ridge Regression ¥1 BRRR

. .
Parameter estimates Parameter estimates

SSR term
-
t satin
regular
Z∗ = argmin ab V − YZ ⋅ V − YZ c Z∗ = argmin ab V − YZ ⋅ V − YZ c + h ⋅ Zc Z
Z Z
when we
ftp.CSSR ) =
o ⇒ p* cseeprenws
similar example )
∗ c d0 c
Z = YY YV x
-0 ,
Z∗ = Yc Y + h ⋅ i d0 Yc V (i is the identity matrix)
gognaemortgres.i n
regular
satin -
if X soo -
-
'
Predictions without Predictions

then
Xt X. I ) → o
V̀ = f ⋅ Yc Y d0
Yc V V̀ = f ⋅ Yc Y + h ⋅ i d0
Yc V
then ,
optimal parameters
for RR O
G
→
NB: the matrix Yc Y + h ⋅ i is always invertible whereas the matrix Yc Y may not be invertible (e.g. if the number of features is higher than the
number of examples / samples.
Regularization: Ridge Regression EvAT
o
What is the role of the regularization parameter h?

Reasonable h Very small h Very large h
*
Parameter estimates Very small h ⇒ the optimal parameters for Very large h ⇒ the optimal parameters for
ridge regression correspond to those of ridge regression become approximately
Z ∗ = Yc Y + h ⋅ i d0 Yc V
simple linear regression equal to zero
We need a disciplined way to select the regularization parameter ⟹ Cross-Validation

Regularization: LASSO – Least Absolute
Shrinkage and Selection Operator
some
Ridge Regression (shrinks B feature )
to
select ) LASSO ( sets p to
t
be zero
here feature meeting
Parameter estimates
hyponparameqihnfgdy.gg Parameter estimates
(L2 Norm ' of p)
1
hyperparameter x inner
product
of
,B
Z∗ = argmin ab V − YZ ⋅ V − YZ c +h⋅ Z 1 Z∗ = argmin ab V − YZ ⋅ V − YZ c +h⋅ Z 0
Z Z
(Ll NORM of B )
The key difference between LASSO and ridge regression is that

∗ c d0 c
Z = Y Y+h⋅i YV the LASSO uses an L1 instead of an L2 penalty
This forces some of the coefficients to be set to be equal to zero

Predictions provided that the hyper-parameter λ is sufficiently large.
Thus, the lasso performs variable/feature selection and

V̀ = f ⋅ Yc Y + h ⋅ i d0
Yc V
shrinkage at the same time.
Regularization: LASSO – Least Absolute
Shrinkage and Selection Operator
5 B coefficients graphed
← ↳ now as X increases ,
some p goes
to
zero
by LASSO .
( feature When λ = 0 , the LASSO solution reduces to the

selection )
simple lineer
• regression solution.
When λ à ∞ , the LASSO solution is such that

the coefficient estimates approach zero.
70
LASSO vs Ridge Regression Contours
of
SSR
b
Simple Linear
Regression Solution
Ridge Regression
← NOW SPARSE
The ridge regression problem can be re-expressed as follows:
.
Ridge
TRAINTVERs Regression
1
l. a. Z 1≤ l
@TmodeestrektwT7f.YTpHfcs
Z
cows
where the value of s depends on the value of the hyper-parameter h.

← we see an -
equivalence between
LASSO the two constraints Simple Linear
Regression Solution
The ridge regression problem can be re-expressed as follows: d

← here , B isforeal
Z∗ = argmin ab V − YZ ⋅ V − YZ c l. a. Z 0≤ l to zero and
Lasso
Z
CONSTRAINT A 07! I mind
BBB
.
where the value of s depends on the value of the hyper-parameter h. models the ¢ SPARSITY is
and allow
Uts so
S t 11 poll
- -
constraints
, s g
promoted B= O
The LASSO and ridge regression coefficient estimates are given by the
for
first point at which the ellipse contacts the constraint region. feature
selector
LASSO vs Ridge Regression
ADVANTAGE :
• LASSO has an advantage over ridge regression, in that it produces simpler and more interpretable
models that involved only a subset of predictors.
#Asso defines features
SIMILARITY :
more
clearly -
• LASSO leads to qualitatively similar behavior to ridge regression, in that as λ increases, the
variance decreases and the bias increases.
ADVANTAGE :
• LASSO can generate more accurate predictions compared to ridge regression.
• Cross-validation can be used in order to determine which approach is better on a particular data
set.
But how do you select the regular satin term X ?
-
CROSS VALIDATION !
72
Training Data
DataSet
Learning Phase
Training Set ~60% Optimize parameters based on the training set
Validation Set ~20% Selection of my model (for example, degree of the

polynomial or my regularization term)
Test Phase
Final test
Test Set ~20%
6. Cross-Validation
Cross Validation: K-fold CV ← get eeto the best
regimens atm parameter

• Select a grid of potential values for the hyper- any hyperspace metes
or
,
X
parameter value h ∈ {h0 , h1 , … , hc } ( like neural network
• Partition the entire dataset onto a training and testing

T RAI N T EST
hyper parameter )
set
error
TEM
• Partition the entire training set T onto K subsets T1, T2,
T RA I N TES C
Procedure
…, TK
• Calculate the error errork on subset Tk based on a
Hear
model learnt on the training set T except subset Tk
• Calculate the overall error corresponding to the :
average of the individual errors errork (k=1,...,K)
hyper parameters
• Choose the hyper-parameter value leading to the to get →
smallest overall error
• The model associated with such optimal hyper- - O
λ∗ λ
parameter is then trained on the entire training set and

it is tested on the test set
Cross Validation: K-fold CV
Procedure
TRIAL I
④
↳
Procedure
TRIAL 2
¥i
Procedure
Boog
Procedure
am

Lecture 1 Introduction PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1 Introduction PDF

Uploaded by

Copyright:

Available Formats

Applied Machine

Prof Miguel Prof Yannis

• Least Absolute Shrinkage and Selection Operator ( Lasso)

• Lecture 01 / Tue 08th October, 2-6pm

• Lecture 04 / Tue 12th November, 2-6pm

• Lecture 05 / Tue 19th November, 2-6pm

• Lecture 06 / Tue 20th November, 2-6pm

• Lab 1 / Tue 15th October, 2-6pm

• Lab 2 / Tue 03rd December, 2-6pm IMPORTANT for ASSIGNMENT !

report required paper GitHub

- code is assessed from

• S. Shalev-Shwartz and S. Ben-David. Understanding

• I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning.

1968 1977 1984 1999 2001 2015

Genesis of AI Resurgence of AI Current AI

2009: GPUs to train DL models

2011: IBM Watson wins Jeopardy

1956 1975 1980 1987 1993 2006 2011

2011 2012 2013 2014 2015 2016 2017

Perception Learning Interaction

• Computer vision • Supervised learning • Planning

Current AI Future AI Singularity

Logic Based Approaches

These tools enable solving AI

These tools enable solving AI These tools – notably, machine learning

These tools enable solving AI These tools – notably, machine learning

These tools enable solving many AI

These tools enable solving AI These tools – notably, machine learning

Optimization Based Statistical Learning

Big Data GPUs/CPUs Deep Learning

Field of study that gives computers

Cat 1 Cat 0 Cat 0

+ Prediction of a car’s price

Note: There is also now

• Input variables also known as

• Output variable also known as

• One is also given a hypothesis (or model)

• The learning algorithm selects a hypothesis

• Such selected hypothesis can then be used on

Linear Model Examples

Linear Model Examples

fztn.IE Cgi 'll

where !̅ = 1⁄; ⋅ ∑>"=0 !" and $K = 1⁄; ⋅ ∑>"=0 $"

memes of a once . nation.

• One is also given a hypothesis (or model)

• The learning algorithm selects a hypothesis

• Such selected hypothesis can then be used on

Note that true response values and the predicted response

Z∗ = argmin 889 Z = argmin ab V − YZ ⋅ V − YZ c

First-order model: y = β0 + β1x Second order model: y = β0 + β1x + β2x2

First-order model: y = β0 + β1x Third order model: y = β0 + β1x + β2x2 + β3x3

First-order model: y = β0 + β1x Squared-root model: y = β0 + β1√x

Features will have quite some different scales!

First-Order Model Third-Order Model Ninth-Order Model

First-Order Model Third-Order Model Ninth-Order Model

true model true model true model

Under-fitting model Good fitting model Over-fitting model

omit features with

Regularization parameter hyper

Simple Linear Regression Ridge Regression ¥1 BRRR

Parameter estimates Parameter estimates

Predictions without Predictions

What is the role of the regularization parameter h?