Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Attendance Code:

xxxxx
#Code
#Code

● Last ML lectures:
● Unsupervised learning: clustering and the k-means algorithm

● Supervised learning:
● Linear regression using SGD
● Classification using the perceptron algorithm

● This lecture: neural networks and deep learning


#Code
What we have accomplished so far
● Supervised Learning
Labeled training data
Data point Feature 1 Feature 2 … Feature D Output
(𝒊) (𝒙𝟏 ) (𝒙𝟐 ) (𝒙𝑫 ) (𝒚)
1

Linear models for


regression/classification
#Code
What we have accomplished so far
● Supervised Learning
Labeled training data
Data point Feature 1 Feature 2 … Feature D Output
(𝒊) (𝒙𝟏 ) (𝒙𝟐 ) (𝒙𝑫 ) (𝒚)
1

Linear models for


regression/classification
𝐷

𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1
#Code
What we have accomplished so far
● Supervised Learning
Labeled training data Pictorial Representation
Data point Feature 1 Feature 2 … Feature D Output
(𝒊) (𝒙𝟏 ) (𝒙𝟐 ) (𝒙𝑫 ) (𝒚)
1 x1 w1
2

Features
x2 w2
4

Output

𝑦ො

wD
xD
Linear models for
regression/classification
𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 𝑤1 𝑥 1 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷
𝐷

𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1
#Code
What we have accomplished so far
● Supervised Learning
Labeled training data Pictorial Representation
Data point Feature 1 Feature 2 … Feature D Output
(𝒊) (𝒙𝟏 ) (𝒙𝟐 ) (𝒙𝑫 ) (𝒚)
1 x1 w1
2

Features
x2 w2
4

Output

𝑦ො

wD
xD
Linear models for
regression/classification
𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 𝑤1 𝑥 1 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷
𝐷

𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 ෍ 𝑤𝑗 𝑥 𝑗 𝑦ො = 𝑓 𝑤 𝑇 𝑥
𝑗=1
parameters vector features vector
#Code
Classification Intuition: Housing Market
● Data on 3 features: square footage, # bedrooms, house age
● Classification label: location (UK vs. non-UK house)

1
w1
x1
Features

Output
w2
𝑦ො
x2 w3
w4
x3

𝑓 𝑤, 𝑥 = 𝑦ො = sign 𝑤1 1 + 𝑤2 𝑥 1 + 𝑤3 𝑥 2 + 𝑤4 𝑥 3

𝑦ො = sign 𝑤 𝑇 𝑥
#Code
How can we extend this to 3 classes?
● Data on 3 features: square footage, # bedrooms, house age
● Classification label: location (UK vs. non-UK house)

● Extra data labeled as Dubai houses


1
w1 ● How would you build a classification model
x1 to classify between the UK, Dubai, and
Features

Output
w2
𝑦ො other locations using the same features?
x2 w3
w4
x3

𝑓 𝑤, 𝑥 = 𝑦ො = sign 𝑤1 1 + 𝑤2 𝑥 1 + 𝑤3 𝑥 2 + 𝑤4 𝑥 3

𝑦ො = sign 𝑤 𝑇 𝑥
#Code
#Code
How can we extend this to 3 classes?
● Data on 3 features: square footage, # bedrooms, house age
● Classification label: location (UK vs. non-UK house)

● Extra data labeled as Dubai houses


1
w1 ● How would you build a classification model
x1 to classify between the UK, Dubai, and
Features

Output
w2
𝑦ො other locations using the same features?
x2 w3
w4
x3

𝑓 𝑤, 𝑥 = 𝑦ො = sign 𝑤1 1 + 𝑤2 𝑥 1 + 𝑤3 𝑥 2 + 𝑤4 𝑥 3

𝑦ො = sign 𝑤 𝑇 𝑥
#Code
Neural networks: single layer perceptron
● Can view the simple linear perceptron with its loss function, in
the form of a weighted linear combination with nonlinear
activation function

x1 w1
● Inspired by a biological neuron, which
has axons, dendrites and a nucleus,
x2 w2 Output which fires when a certain threshold is
Inputs

y crossed
Perceptron is very limited, and can only
wD model linear decision boundaries
xD
More complex nonlinear boundaries are
y = max(0, wTx) required in practical ML
#Code
Multi-layer perceptron
#Code
#Code
Multi-layer perceptron
#Code
#Code
#Code
Activation nonlinearities
Activation Function Derivative
● Wide range of activation functions
in use (logistic, tanh, softplus,
ReLU): only criteria is that they
must be nonlinear and should
ideally be differentiable (almost
everywhere)

● ReLU is perceptron loss, sigmoid is


logistic regression loss

● ReLU most widely used activation;


exactly zero for half of its input
range (many outputs will be zero)
#Code
Neural networks: single layer perceptron
● Can view the simple linear perceptron with its loss function, in
the form of a weighted linear combination with nonlinear
activation function

x1 w1
● Inspired by a biological neuron, which
has axons, dendrites and a nucleus,
x2 w2 Output which fires when a certain threshold is
Inputs

y crossed
● Perceptron is very limited, and can only
wD model linear decision boundaries
xD
● More complex nonlinear boundaries
y = max(0, wTx) are required in practical ML
#Code
Neural networks: deep learning
● Extend the perceptron to two or more layers of weighted combinations
(linear layers) with nonlinear activations connecting them

Hidden
w1,1 ● Intermediate nodes are known as
x1 hidden neurons, whose output z is fed
w1,M
z1
v1
into the output layer which produces the
x2 w final output y

Output
Inputs

2,1

y
w2,M ● Modern deep learning algorithms
wD,1 vM
usually have multiple hidden neurons, in
zM
multiple additional (hidden) layers
xD w
D,M
● Greatly extends the complexity of
decision boundaries (piecewise linear
z = max(0, WTx),
x) y = max(0, vTz) boundaries)
#Code
Neural networks: deep learning
● Example of a multilayer neural network

1 𝑧1 = 𝑓 𝑤1,1 1 + 𝑤2,1 𝑥 1 + 𝑤3,1 𝑥 2

1
𝑧 2 = 𝑓 𝑤1,2 1 + 𝑤2,2 𝑥 1 + 𝑤3,2 𝑥 2
z1 𝑤1,1 𝑤2,1 𝑤3,1

Output
Inputs

𝑦 𝑊𝑇 = 𝑤 𝑤2,2 𝑤3,2
x1 1,2

𝑧 = 𝑓 𝑊𝑇𝑥
z2
x2
Hidden

𝑦 = 𝑓 𝑣1 1 + 𝑣2 𝑧1 + 𝑣3 𝑧 2

𝑦 = 𝑓 𝑣𝑇𝑧
#Code
Neural networks: weights consideration
● Fully connected networks (every node in each layer connected to every
node in the previous layer) rapid growth in the number of weights
Weight-sharing, forcing certain connections between nodes to have the same
weight, is sensible for certain special applications
Widely-used example (particularly suited to ordered data: images or time series)
is convolutional sharing
#Code
Neural networks: weights consideration
● Fully connected networks (every node in each layer connected to every
node in the previous layer) means rapid growth in the number of weights
● Weight-sharing, forcing certain connections between nodes to have the
same weight, is sensible for certain special applications
● Widely-used example (particularly suited to ordered data: images or time
series) is convolutional sharing

w1
x1 z1 = max(0, wT[x1 x2]T)
w2 z1 v1
x2 w1 z2 = max(0, wT[x2 x3]T)
w2 z2

Output
Inputs

v2 y
xD-1 w1 zM = max(0, wT[xD-1 xD]T)
w2 zM vM
y = max(0, vTz)
xD
Hidden
#Code
Deep neural logic networks: example
● With sign activation function (similar to step activation), logical neural
networks have simple weights
● Use these to implement basic logical functions "and", "or" and "not",
encoding True as +1, False as –1
Any complex logical function can be implemented by composing these basic
neurons together

True = +1, False = −1


y = x1 ⋀ x2 (and)

1 −1
fand(x1, x2) = sign(x1 + x2 − 1)

x1 y for(x1, x2) = sign(x1 + x2 + 1)


+1
x2 +1 fnot(x) = sign(−x)
#Code
Deep neural logic networks: example
● With sign activation function (similar to step activation), logical neural
networks have simple weights
● Use these to implement basic logical functions "and", "or" and "not",
encoding True as +1, False as –1
● Any complex logical function can be implemented by composing these
basic neurons together

True = +1, False = −1


y = x1 ⋀ x2 (and) y = x1 ∨ x2 (or) y = ¬ x1 (not)

1 −1 1 +1 1 0
fand(x1, x2) = sign(x1 + x2 − 1)

x1 y x1 y y for(x1, x2) = sign(x1 + x2 + 1)


+1 +1
x2 +1 x2 +1 x1 −1 fnot(x) = sign(−x)
#Code
Deep neural logic networks: XNOR

1 y = (u ⋀ v) ∨ (¬ u ⋀ ¬ v) (xnor)
● Exclusive not-or "xnor"
0
function constructed using
z1 = ¬ v the basic logical neural
z1 1
−1
+1
networks
−1
● In this implementation, need
u 1
z3 = ¬ u ⋀ ¬ v z 3
+1 +1 two hidden layers z1, z2 and
y z3, z4 to compute
+1
1 1 intermediate terms in the
0
+1 −1 +1 expression

−1
z 2 z2 = ¬ u z4 ● Example simple function
z4 = u ⋀ v
+1 which cannot be computed
v using a single layer linear
neural network
#Code
#Code
XOR
#Code
To recap
● We learned the basic concepts of a neural network
● Extended the concept to multiple (hidden) layers: deep learning
● Problem of the number of weights & weight sharing: convolutional
neural network

● Next: How to optimize the weights of a neural network


● Pre-Reading: Lecture Notes, Section 14

Further Reading
● PRML, Section 5.1
● H&T, Section 11.3

You might also like