Professional Documents
Culture Documents
Lecture2 Slides 1
Lecture2 Slides 1
xxxxx
#Code
#Code
● Last ML lectures:
● Unsupervised learning: clustering and the k-means algorithm
● Supervised learning:
● Linear regression using SGD
● Classification using the perceptron algorithm
𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 𝑤𝑗 𝑥 𝑗
𝑗=1
#Code
What we have accomplished so far
● Supervised Learning
Labeled training data Pictorial Representation
Data point Feature 1 Feature 2 … Feature D Output
(𝒊) (𝒙𝟏 ) (𝒙𝟐 ) (𝒙𝑫 ) (𝒚)
1 x1 w1
2
Features
x2 w2
4
Output
…
𝑦ො
…
wD
xD
Linear models for
regression/classification
𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 𝑤1 𝑥 1 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷
𝐷
𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 𝑤𝑗 𝑥 𝑗
𝑗=1
#Code
What we have accomplished so far
● Supervised Learning
Labeled training data Pictorial Representation
Data point Feature 1 Feature 2 … Feature D Output
(𝒊) (𝒙𝟏 ) (𝒙𝟐 ) (𝒙𝑫 ) (𝒚)
1 x1 w1
2
Features
x2 w2
4
Output
…
𝑦ො
…
wD
xD
Linear models for
regression/classification
𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 𝑤1 𝑥 1 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷
𝐷
𝑓 𝑤, 𝑥 = 𝑦ො = 𝑓 𝑤𝑗 𝑥 𝑗 𝑦ො = 𝑓 𝑤 𝑇 𝑥
𝑗=1
parameters vector features vector
#Code
Classification Intuition: Housing Market
● Data on 3 features: square footage, # bedrooms, house age
● Classification label: location (UK vs. non-UK house)
1
w1
x1
Features
Output
w2
𝑦ො
x2 w3
w4
x3
𝑓 𝑤, 𝑥 = 𝑦ො = sign 𝑤1 1 + 𝑤2 𝑥 1 + 𝑤3 𝑥 2 + 𝑤4 𝑥 3
𝑦ො = sign 𝑤 𝑇 𝑥
#Code
How can we extend this to 3 classes?
● Data on 3 features: square footage, # bedrooms, house age
● Classification label: location (UK vs. non-UK house)
Output
w2
𝑦ො other locations using the same features?
x2 w3
w4
x3
𝑓 𝑤, 𝑥 = 𝑦ො = sign 𝑤1 1 + 𝑤2 𝑥 1 + 𝑤3 𝑥 2 + 𝑤4 𝑥 3
𝑦ො = sign 𝑤 𝑇 𝑥
#Code
#Code
How can we extend this to 3 classes?
● Data on 3 features: square footage, # bedrooms, house age
● Classification label: location (UK vs. non-UK house)
Output
w2
𝑦ො other locations using the same features?
x2 w3
w4
x3
𝑓 𝑤, 𝑥 = 𝑦ො = sign 𝑤1 1 + 𝑤2 𝑥 1 + 𝑤3 𝑥 2 + 𝑤4 𝑥 3
𝑦ො = sign 𝑤 𝑇 𝑥
#Code
Neural networks: single layer perceptron
● Can view the simple linear perceptron with its loss function, in
the form of a weighted linear combination with nonlinear
activation function
x1 w1
● Inspired by a biological neuron, which
has axons, dendrites and a nucleus,
x2 w2 Output which fires when a certain threshold is
Inputs
y crossed
Perceptron is very limited, and can only
wD model linear decision boundaries
xD
More complex nonlinear boundaries are
y = max(0, wTx) required in practical ML
#Code
Multi-layer perceptron
#Code
#Code
Multi-layer perceptron
#Code
#Code
#Code
Activation nonlinearities
Activation Function Derivative
● Wide range of activation functions
in use (logistic, tanh, softplus,
ReLU): only criteria is that they
must be nonlinear and should
ideally be differentiable (almost
everywhere)
x1 w1
● Inspired by a biological neuron, which
has axons, dendrites and a nucleus,
x2 w2 Output which fires when a certain threshold is
Inputs
y crossed
● Perceptron is very limited, and can only
wD model linear decision boundaries
xD
● More complex nonlinear boundaries
y = max(0, wTx) are required in practical ML
#Code
Neural networks: deep learning
● Extend the perceptron to two or more layers of weighted combinations
(linear layers) with nonlinear activations connecting them
Hidden
w1,1 ● Intermediate nodes are known as
x1 hidden neurons, whose output z is fed
w1,M
z1
v1
into the output layer which produces the
x2 w final output y
Output
Inputs
2,1
y
w2,M ● Modern deep learning algorithms
wD,1 vM
usually have multiple hidden neurons, in
zM
multiple additional (hidden) layers
xD w
D,M
● Greatly extends the complexity of
decision boundaries (piecewise linear
z = max(0, WTx),
x) y = max(0, vTz) boundaries)
#Code
Neural networks: deep learning
● Example of a multilayer neural network
1
𝑧 2 = 𝑓 𝑤1,2 1 + 𝑤2,2 𝑥 1 + 𝑤3,2 𝑥 2
z1 𝑤1,1 𝑤2,1 𝑤3,1
Output
Inputs
𝑦 𝑊𝑇 = 𝑤 𝑤2,2 𝑤3,2
x1 1,2
𝑧 = 𝑓 𝑊𝑇𝑥
z2
x2
Hidden
𝑦 = 𝑓 𝑣1 1 + 𝑣2 𝑧1 + 𝑣3 𝑧 2
𝑦 = 𝑓 𝑣𝑇𝑧
#Code
Neural networks: weights consideration
● Fully connected networks (every node in each layer connected to every
node in the previous layer) rapid growth in the number of weights
Weight-sharing, forcing certain connections between nodes to have the same
weight, is sensible for certain special applications
Widely-used example (particularly suited to ordered data: images or time series)
is convolutional sharing
#Code
Neural networks: weights consideration
● Fully connected networks (every node in each layer connected to every
node in the previous layer) means rapid growth in the number of weights
● Weight-sharing, forcing certain connections between nodes to have the
same weight, is sensible for certain special applications
● Widely-used example (particularly suited to ordered data: images or time
series) is convolutional sharing
w1
x1 z1 = max(0, wT[x1 x2]T)
w2 z1 v1
x2 w1 z2 = max(0, wT[x2 x3]T)
w2 z2
…
Output
Inputs
v2 y
xD-1 w1 zM = max(0, wT[xD-1 xD]T)
w2 zM vM
y = max(0, vTz)
xD
Hidden
#Code
Deep neural logic networks: example
● With sign activation function (similar to step activation), logical neural
networks have simple weights
● Use these to implement basic logical functions "and", "or" and "not",
encoding True as +1, False as –1
Any complex logical function can be implemented by composing these basic
neurons together
1 −1
fand(x1, x2) = sign(x1 + x2 − 1)
1 −1 1 +1 1 0
fand(x1, x2) = sign(x1 + x2 − 1)
1 y = (u ⋀ v) ∨ (¬ u ⋀ ¬ v) (xnor)
● Exclusive not-or "xnor"
0
function constructed using
z1 = ¬ v the basic logical neural
z1 1
−1
+1
networks
−1
● In this implementation, need
u 1
z3 = ¬ u ⋀ ¬ v z 3
+1 +1 two hidden layers z1, z2 and
y z3, z4 to compute
+1
1 1 intermediate terms in the
0
+1 −1 +1 expression
−1
z 2 z2 = ¬ u z4 ● Example simple function
z4 = u ⋀ v
+1 which cannot be computed
v using a single layer linear
neural network
#Code
#Code
XOR
#Code
To recap
● We learned the basic concepts of a neural network
● Extended the concept to multiple (hidden) layers: deep learning
● Problem of the number of weights & weight sharing: convolutional
neural network
Further Reading
● PRML, Section 5.1
● H&T, Section 11.3