Professional Documents
Culture Documents
Lecture 1 Introduction PDF
Lecture 1 Introduction PDF
Learning Systems 1
Professor Miguel Rodrigues
University College London
Instructors
←
the rest [ last 2 lectures
Course Outline
• Introduction
• Historical perspective
• Supervised learning
• Linear regression in one variable and multiple variables
• Polynomial regression in multiple variables
• Regularization ← Ridge and
• Ridge Regression LASSO
• Kernel learning
}
• Support vector machines
• Kernels
÷::!
• Introduction to neural networks
• Representation
• Learning
logics
• Convolutional neural networks and variants
Timeline ( 6 lectures and 2 labs)
• Mini-project timelines
• Hand-out: Friday, 25th October, 2019
• Hand-in: Monday, 06th January, 2020 (after Christmas)
Bibiliography
• K. P. Murphy. Machine Learning: A Probabilistic Perspective.
Cambridge, MA: MIT Press, 2012.
1948: First computer program 1980: The rise of expert systems 2014: DeepMind funded
Facebook develops DeepFace
1950: The Turing test: “Can machines think?” 1982: Revival of connectionism, backpropagation
2015: Google’s TensorFlow
1956: The birth of AI - John McCarthy AAAI 1987–93: Second AI winter Amazon’s ML platform
Microsoft’s ML Toolkit
1957: First neural network by Frank Rosenblatt 1990s: Age of knowledge representation OpenAI and ethics
1975-80: AI hype leads to first AI winter 2016: AlphaGo beats Lee Sedol in Go
1997: IBM Deep Blue beats Kasparov in chess
Tay AlphaGo
Machine
Learning
Eugene Alexa
Artificial
Siri Watson
Intelligence
• Search
• Speech understanding • Unsupervised learning
• Reasoning
• Language understanding • Reinforcement learning
• Games
• Knowledge representation • Common sense
Artificial Intelligence: Ingredients
Artificial General Artificial Super
Artificial Narrow Intelligence
Intelligence Intelligence
17
Artificial Intelligence: Approaches / Tools
Probabilistic/Statistical Based
Logic Based Approaches
Approaches
18
Artificial Intelligence: Approaches / Tools
Probabilistic/Statistical Based
Logic Based Approaches
Approaches
Optimization Based
Approaches
19
Artificial Intelligence: Approaches / Tools
Probabilistic/Statistical Based
Logic Based Approaches
Approaches
These tools enable solving many AI These approaches allow one to tackle
problems by searching intelligently AI problems where an agent operates
over many solutions within an environment subject to
uncertainty.
20
Artificial Intelligence: What is different now?
Massive Amounts Massive Computational State-of-the-Art
of Data Power Algorithms
21
Warm-Up
Input Output
Cat 1
Dog 0
Bird 0
Horse 0
Cat 0
Prediction Dog 1
Bird 0
Rule
Horse 0
Cat 0
Dog 0
Bird 0
Horse 1
A Rule Based Approach
Concept
Input Output
Rules are pre-specified
Cat 1
Classification Dog 0
Rule Bird 0
Horse 0
input data
if condition 1 satisfied
…
10010010 if condition n satisfied
output cat
else
output dog
end
New Input
The Learning Based Approach
Concept
Example 1 Example 2 Example 3 Example 4
Learning Classification
Algorithm Rule
Cat 0 Cat 0 Cat 1 Cat 0
Dog 1 Dog 0 Dog 0 Dog 0
Bird 0 Bird 0 Bird 0 Bird 1
New Output
Horse 0 Horse 1 Horse 0 Horse 0
Cat 1
Classification Dog 0
Bird 0
What is machine learning? Rule Set Horse 0
Unsupervised
Learning
02
Machine learning task of
inferring a function describing
the structure of a dataset. It
relies on unlabelled data.
Supervised Learning
Example 1 Example 2 Example 3 Example 4 Example 5 Example 6
Concept
The task of inferring a function describing the
relationship between independent and
Cat 1 Cat 0 Cat 0 Cat 1 Cat 0 Cat 0 dependent variables in a dataset (based on
Dog 0 Dog 1 Dog 0 Dog 0 Dog 0 Dog 0
Bird 0 Bird 0 Bird 0 Bird 0 Bird 1 Bird 1 labelled examples)
Horse 0 Horse 0 Horse 1 Horse 0 Horse 0 Horse 0
Applications
Learning Supervised learning can be used to solve various
Algorithm tasks:
• Classification tasks
• Regression tasks
• Forecasting tasks
Prediction • Recommender systems
Input Output
Rule • Anomaly detection tasks
Variant: Semi-Supervised Learning
Example 1 Example 2 Example 3 Example 4 Example 5 Example 6
Learning
Algorithm
Prediction
Input Output
Rule
Examples: Supervised Learning
Classification Forecasting
Predict the class of an item given various examples
Predict future values given past ones
Example:
time-series
data feature 2
Classification of tumors as
benign or malign based on:
o appearance (feature 1) ?
o tumor size (feature 2) time
data feature 1
Regression
Example:
Predict the value of a variable given other variables.
Prediction of tomorrow’s stock value based
Example:
on the previous stock values.
output variable
Learning
Algorithm
Examples
Examples of unsupervised learning tasks include:
cluster 1 cluster 2 • Clustering tasks
• Dimensionality reduction tasks
• Recommender systems
cluster 3 • Anomaly detection tasks
Reinforcement Learning
Concept
Reinforcement learning represents an agent’s
attempt to approximate the environment’s
Agent function, in order to determine the agent’s
actions on the black-box environment that
maximize the agent’s rewards
Observations
Actions Rewards
Applications
Robotics, industrial automation, education and
training, health and medicine, media and
Environment
advertising, finance
The Machine Learning Landscape
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Related news are
automatically clustered
Machine Learning: Real-World Examples
Machine Learning: Real-World Examples
The Machine Learning Process
Problem
Definition
Data
Collection
Pre-Processing
Data
Feature Extraction
Preparation
Feature Normalization
Model
Model Evaluation
Selection
Learning
Model
Fitting
Training
Algorithm
Prediction
Model
Usage
Testing
Rule
2. Supervised Learning:
Simple Linear Regression
Linear Regression
Example Training set Testing set
• Prediction of second-hand car Input Variable Output Variable Input Variable Output Variable
prices based on features such as (Mileage, Km) (Cost, £) (Mileage, Km) (Cost, £)
mileage
25,000 16,000 50,000 ?
Training Test
105,000 11,500 90,000 ?
sample sample
120,000 6,000
!" , $"
140,000 3,000
45,000 13,500
? ?
50,000 90,000
Linear Regression
Example Training set Testing set
• Prediction of second-hand car Input Variable Output Variable Input Variable Output Variable
prices based on features such as (Mileage, Km) (Cost, £) (Mileage, Km) (Cost, £)
mileage
25,000 16,000 50,000 ?
Training Test
• It is a supervised learning problem 105,000 11,500 90,000 ?
sample sample
because one has access to input- 120,000 6,000
output examples !" , $"
140,000 3,000
45,000 13,500
• It is a regression problem because
the output variable is continuous-
valued
• h( ) is the hypothesis
$ = ℎ(!)
• h( ) it is a linear function
& = 0 ;( > 0
Linear Regression: Model
& = 0 ;( < 0
Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…
$ !" , $"
True values
Predicted values !" , $6" = !" , & + (!"
!0 , $60
!4 , $4
!1 , $61
!5 , $5
!0 , $0
!1 , $1 !4 , $64 !2 , $62 !3 , $3
!5 , $65
!2 , $2
!3 , $63
!
Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…
$ !" , $"
True values
Predicted values !" , $6"
!4 , $4
70
71 74 !5 , $5
!0 , $0
75
!1 , $1 !3 , $3
72
73
!2 , $2
!
Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…
$
True values !" , $"
Predicted values !" , $6" = !" , & + (!"
!4 , $4
!0 , $0 !3 , $63
!1 , $1 !5 , $5
!2 , $62
!5 , $65
!4 , $64 !3 , $3
!2 , $2
!1 , $61
!0 , $60
!
Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…
Cost Function – Sum-of-Squared Residuals
$
True values !" , $" We can now define two quantities that can be used
Predicted values !" , $6" to assess the model fit to the data:
!4 , $4
(1) The individual residual measures how much the
!0 , $0
predicted output value deviates from the true
!1 , $1 !5 , $5
74
output value given the input value:
73
75
71
72 7" = $" − $6" = $" − & + (!"
70
!3 , $3
!2 , $2 (2) The sum of squared residuals measures how
well overall the model fits the data:
> >
1 1 1
889 &, ( = < 7"1 = < $" − & + (!"
! ; ;
"=0 "=0
Linear Regression: Learning Algorithm
& Cgi )
-
TO MINIMIZE : in
argon th - Cdt B Xi )
optimized d* ,
We should select the linear regression model parameters that new response $6 associated with a new
minimize the sum-of-squared residuals as follows: data point ! is now given by:
1
optimized *
>
:
Fp th
1 ,
( yi Cat Pai))
∗
-
∗
-
Ip? pi! -5
'
& ∗ , (∗ = argmin 889 &, ( = argmin ⋅ < $" − & + (!" $6 = & + ( !
G,H G,H ; ( new linear model)
"=0
n
÷::÷÷÷:
l
This leads to the optimal model parameters given by:
>
∗ ∗ ∗
∑ "=0 !" $" − $
K
& = !̅ − ( ⋅ $K ( = > . '
∑"=0 !" !" − !̅
residual squares
.
1 !0,0 … !0,P
-
matrices $0 (N
$" = (N + (0 !",0 + ⋯ + (P !",P (Q = 1, … , ;) ⇒ V= ⋮ = ⋮ ⋮ ⋱ ⋮ ⋮ = YZ
$> 1 !>,0 … !>,P (P
- -
of
where (S , T = 0, … , U are the model parameters value of features
vector
of multivariate
-
weightage coefficient the
present in
parameters
Geometrical Interpretation
3.2 Linear Regression Models and Least Squares 45
each
parameter
framing
data -
•
•
• • ••
•• • • • •
•
• • •• •• •• •• • •
•
• •• • ••
•• • • • • • •
•
•• • •
• • X2
• •
•
X1
Multivariate Linear Regression: Model
Multivariate Linear Model
The relationship between the input and output The relationship between the input and output
variables can be expressed as follows: variables can also be expressed as follows:
16,000 1 25000 3 6 (N
$" = (N + (0 !",0 + ⋯ + (P !",P (Q = 1, … , ;) V= ⋮ = ⋮ ⋮ ⋱ ⋮ ⋮ = YZ
13,500 1 45,000 3 12 (P
where (S , T = 0, … , U are the model parameters
÷÷:
%÷÷÷÷.o
Geometrical Interpretation
3.2 Linear Regression Models and Least Squares 45
Bo Bi Ba output ! )
-
--
•
•
• • ••
•• • • • •
•
• • •• •• •• •• • •
•
• •• • •• data
•• • • • • • • the
•• • •
• • X2
•
that explain
differently !
• • this is determined
• expected output
X1
by calculating the cost
fun m
Multivariate Linear Regression: Model
Multivariate Linear Model
The relationship between the input and output The relationship between the input and output
variables can be expressed as follows: variables can also be expressed as follows:
16,000 1 25000 3 6 (N
$" = (N + (0 !",0 + ⋯ + (P !",P (Q = 1, … , ;) V= ⋮ = ⋮ ⋮ ⋱ ⋮ ⋮ = YZ
13,500 1 45,000 3 12 (P
where (S , T = 0, … , U are the model parameters
Geometrical Interpretation
3.2 Linear Regression Models and Least Squares 45
rt ; it ÷:÷ or me
.
Y
PFodimcte.dneaahsm.me
tug!
lues
"R
Ln (
-
• -
• •
• •• y
•• • • • •
•
• •• •• •• •• • •
tntrccy-jfcy.gg/emgauvtiaYent
•
ta Cy j ) Cy j )
• +
• •• • •• =
• • • • •
-
•• •
- .
•
•• • •
• • X2
• • = to the
•
X1 = In tr Ky 5) Cy g) )
-
.
t arithmetic E
ta tr ( Cy -
xp) Cy -
)t)§I7p
xp
Multivariate Linear Regression: Cost Function
How does the learning algorithm select the linear hypothesis / model? We need a
cost function…
3.2 Linear Regression Models and Least Squares 45
Y True values
Cost Function – Sum-of-Squared Residuals
889 Z = ab V − V̀ ⋅ V − V̀ c = ab V − YZ ⋅ V − YZ c
• -
X1
SSR na
error
URE 3.1. Linear least squares fitting with X ∈ IR . We seek Matri
2
Ya, amateurs traces
the linear .
Multiv. Linear Regression: Learning Algorithm
Learning Algorithm New Predictions
We should select the linear regression model parameters that The new response $6 associated with a new
minimize the sum-of-squared residuals as follows: data point f is now given by:
optimize
of p
rate
⇒ by minimize the SSR
$ = (N + (0 ! + (1 ! 1 $ = (N + (0 !0 + (1O
!1 Apply Simple Linear
Regression
T n
↳
!0 !1 renamed from
d ther
apply the
×2 ultimately
sempre weer regression
-
converting back to -
or multivariate
anear equation
our
simple regression .
(÷:;I%¥÷E .im?T:::::gai:::..anem.oineenotosiam
We will see later the importance in scaling them!
:3
.
The Problem of Overfitting
Does a higher-order model produce better results compared to a lower-order one?
fo
-
fitted model
y
fitted model y wrong fitted model
prediction from
y beta
with noise prediction
an
overfilling
model ( model with 9th
order is
overhung
.
y ✓
[ model
true [
x
sane
( x
because
it
to
x
fails
w/ true values
prediction wave
model has accurately
model to
prediction a
explain
the
the good balance of over
fitting OVER FITTING true
explain and
Under-fitting model sine wave .
admitting
Good fitting model making Over-fitting model as it
model
is too
this the best model accustomed and
UNDER FITTING fitted to toe noise
.
Good fitting
( dates?
Bias-Variance Tradeoff
=
The use of a small hypothesis space – such as one consisting of linear models – can lead to
underfitting to the training data. It is also known as high-bias setting because we are “biasing” the
learning algorithm – due to possible preconceptions about the data – to choose a very constrained
model.
Conversely, the use a a large hypothesis space – such as one consisting of high-order polynomials –
can lead to overfitting to the training data. It is typically known as a high-variance setting.
Both a high-bias and a high-variance setting can lead to poor prediction performance in the
presence of new data.
There is typically a bias-variance tradeoff in the sense that one cannot achieve both low-bias and
low-variance simultaneously.
The Problem of Overfitting
Does a higher-order model produce better results compared to a lower-order one?
x x x
THE DOWN
$ = (N + (0 ! + (1 ! 1 + (4 ! 4
②• Regularization
o This involves automatic reduction of the magnitude of some of the model parameters
o It can work well in problems involving various features with each one contributing a bit to
the prediction
5. Regularization
Regularization
Regularization techniques can address over-fitting by penalizing certain large
parameter values
Without Regulatization
A With Regularization
y = β0*+ β1*x + β2*x2 + β3*x3 + β4*x4 y = β0*+ β1*x + β2*x2 + β3*x3 + β4*x4
vefeel
now
to select B
valves thattend
,§
to se
small
now w/
Parameter selection Parameter selection ←
penalization
Z∗ = argmin ab V − YZ ⋅ V − YZ c
Z∗ = argmin ab V − YZ ⋅ V − YZ c
+ 104 (4 + 10 (5
4
.
Z Z - w
multiplied very large numbers .
Regularization
We do not want manual tuning of the importance of the model parameters
Without Regularization
>
1 1
& ∗ , (∗ = argmin 889 &, ( = argmin ⋅ < $" − & + (!"
G,H G,H ;
"=0
FACTOR
.
.
ALI SATOW
PEN the and
penalizes
parameters se
With Regularization them to
>
f Pforcing small
1 1
& ∗ , (∗ = argmin 889 &, ( = argmin ⋅ < $" − & + (!" + h < (S
G,H G,H ;
"=0 S=0
± X
parameter
.
t satin
regular
Z∗ = argmin ab V − YZ ⋅ V − YZ c Z∗ = argmin ab V − YZ ⋅ V − YZ c + h ⋅ Zc Z
Z Z
when we
ftp.CSSR ) =
o ⇒ p* cseeprenws
similar example )
∗ c d0 c
Z = YY YV x
-0 ,
Z∗ = Yc Y + h ⋅ i d0 Yc V (i is the identity matrix)
gognaemortgres.i n
regular
satin -
if X soo -
-
'
V̀ = f ⋅ Yc Y d0
Yc V V̀ = f ⋅ Yc Y + h ⋅ i d0
Yc V
then ,
optimal parameters
for RR O
G
→
NB: the matrix Yc Y + h ⋅ i is always invertible whereas the matrix Yc Y may not be invertible (e.g. if the number of features is higher than the
number of examples / samples.
Regularization: Ridge Regression EvAT
o
Parameter estimates Very small h ⇒ the optimal parameters for Very large h ⇒ the optimal parameters for
ridge regression correspond to those of ridge regression become approximately
Z ∗ = Yc Y + h ⋅ i d0 Yc V
simple linear regression equal to zero
Parameter estimates
hyponparameqihnfgdy.gg Parameter estimates
(L2 Norm ' of p)
1
hyperparameter x inner
product
of
,B
Z∗ = argmin ab V − YZ ⋅ V − YZ c +h⋅ Z 1 Z∗ = argmin ab V − YZ ⋅ V − YZ c +h⋅ Z 0
Z Z
(Ll NORM of B )
zero
by LASSO .
70
LASSO vs Ridge Regression Contours
of
SSR
b
Simple Linear
Regression Solution
Ridge Regression
← NOW SPARSE
The ridge regression problem can be re-expressed as follows:
.
Ridge
TRAINTVERs Regression
1
Z∗ = argmin ab V − YZ ⋅ V − YZ c
l. a. Z 1≤ l
@TmodeestrektwT7f.YTpHfcs
Z
cows
equivalence between
LASSO the two constraints Simple Linear
Regression Solution
where the value of s depends on the value of the hyper-parameter h. models the ¢ SPARSITY is
and allow
Uts so
S t 11 poll
- -
constraints
, s g
promoted B= O
The LASSO and ridge regression coefficient estimates are given by the
for
first point at which the ellipse contacts the constraint region. feature
selector
LASSO vs Ridge Regression
ADVANTAGE :
• LASSO has an advantage over ridge regression, in that it produces simpler and more interpretable
models that involved only a subset of predictors.
#Asso defines features
SIMILARITY :
more
clearly -
• LASSO leads to qualitatively similar behavior to ridge regression, in that as λ increases, the
variance decreases and the bias increases.
ADVANTAGE :
• Cross-validation can be used in order to determine which approach is better on a particular data
set.
But how do you select the regular satin term X ?
-
CROSS VALIDATION !
72
Training Data
DataSet
Learning Phase
Final test
Test Set ~20%
6. Cross-Validation
Cross Validation: K-fold CV ← get eeto the best
error
TEM
• Partition the entire training set T onto K subsets T1, T2,
T RA I N TES C
Procedure
…, TK
• Calculate the error errork on subset Tk based on a
Hear
model learnt on the training set T except subset Tk
• Calculate the overall error corresponding to the :
average of the individual errors errork (k=1,...,K)
hyper parameters
• Choose the hyper-parameter value leading to the to get →
smallest overall error
• The model associated with such optimal hyper- - O
λ∗ λ
TRIAL I
④
↳
Cross Validation: K-fold CV
Procedure
TRIAL 2
¥i
Cross Validation: K-fold CV
Procedure
Cross Validation: K-fold CV
Boog
Procedure
am