MLP_1122_20240509_ch10_DeepNN

Deep Feedforward Networks
Sources:
Ch. 6, ”Deep Learning” textbook
by Goodfellow et al.
Ch 10, Introduction to Artificial Neural Networks
Deep Neural Network
l Deep learning -> Deep neural network
l Deep feedforward networks, also often called
feedforward neural networks, or multilayer perceptrons
(MLPs), are the quintessential deep learning models.
• The goal of a feedforward network is to approximate some
function f∗.
• For example, for a classifier, y = f∗ (x) maps an input x to a
category y.
• A feedforward network defines a mapping y = f (x; θ) and learns
the value of the parameters θ that result in the best function
approximation.
• There are no feedback connections in which outputs of the
model are fed back into itself.
• When feedforward neural networks are extended to include
feedback connections, they are called recurrent neural
networks.
2
Biological Neuron and Perceptrons
Frank Rosenblatt, 1957
A biological neuron An artificial neuron (Perceptron)

- a linear classifier
Activation Functions
l We can think of the layer as consisting of
many units that act in parallel, each
representing a vector-to-scalar function.
l Each unit resembles a neuron in the sense
that it receives input from many other units
and computes its own activation value.
4
Activation Functions
l Nonlinearity of neural network
l Binary step function
l Sigmoid function
5
Receptive Fields of Lateral Geniculate
and Primary Visual Cortex
http://www.geog.ucsb.edu/~kclarke 7
Human Cortical Visual Regions: V1, V2, V3, V4, V5 (MT)
http://raymond.rodriguez1.free.fr/Documents/Organisme-A/Vision
Hubel/Wiesel Architecture and Multi-layer Perceptrons
depth
width
Hubel and Weisel’s architecture Multi-layer perceptrons

- A non-linear classifier
Perceptron
l A Perceptron is simply composed of a single layer of linear
threshold unit (LTUs), with each neuron connected to all the inputs.
10
Multi-layer Perceptrons (MLP)
l When an ANN has two or more hidden layers, it is called a
deep neural network (DNN).
11
Multi-layer Perceptrons
l
Backpropagation MLP
l Steps of the backpropagation algorithm for each training
instance:
• Feed the training instance to the network to compute the output
of every neuron in each consecutive layer (this is the forward
pass, just like when making predictions).
• Measure the network’s output error (i.e., the difference between
the desired output and the actual output of the network).
• Compute how much each neuron in the last hidden layer
contributed to each output neuron’s error.
• Proceed to measure how much of these error contributions
came from each neuron in the previous hidden layer.
• Repeat the above two steps until the algorithm reaches the
input layer. This reverse pass efficiently measures the error
gradient across all the connection weights in the network by
propagating the error gradient backward in the network (hence
the name of the algorithm).
13
Backpropagation MLP
l Steps of the backpropagation algorithm for each
training instance:
1. first makes a prediction (forward pass)
2. measures the error
3. goes through each layer in reverse to measure the
error contribution from each connection (reverse
pass)
4. finally slightly tweaks the connection weights to
reduce the error (Gradient Descent step).
14
Reverse-mode autodiff
Backpropagation algorithm: Gradient Descent using reverse-
mode autodiff (implemented in TensorFlow)
15
6.1 A simple example: learning XOR
l Data:
l Target function:
l Linear model:
l MSE loss function:
l Linear model is NOT able

to represent XOR function.
16
A simple example: learning XOR
l Let there be nonlinearity!
l ReLU: rectified linear unit:
l ReLU is applied element-wise to h:
17
l Use one hidden layer containing two hidden
units to learn Φ.
18
l Complete neural network model:
l Obtain model parameters after training:
l Run the network:
19
l Complete neural network model:
l Obtain model parameters after training:
l Run the network:
20
Major components for ANN
l Architectures
• Layer/neuron numbers,
feedforward/backpropagation
l Cost functions
• MSE, cross-entropy
l Algorithms for updating parameters
• Gradient decent
l Activation functions for output/hidden
layers
21
Activation functions
22
A modern MLP (including ReLU and
softmax) for classification
23
Output unites
l Linear Units for Gaussian Output Distributions
• Maximizing the log-likelihood is then equivalent to
minimizing the mean squared error.
l Sigmoid Units for Bernoulli Output Distributions

• Used to predict the value of a binary variable y, in
classification problems with two classes
24
Output unites
l Softmax Units for Multinoulli Output Distributions
• Used as the output of a classifier for n classes
• Maximize log-likelihood:
越大越好越小越好
• An output saturates to 1 when the corresponding
input zi is maximal and much greater than all other inputs.
• An output can also saturate to 0 when zi is not
maximal and the maximum is much greater.
25
Output unites
• Softmax is a way to create a form of competition
between the units that participate in it.
• From a neuroscience point of view, lateral inhibition is
believed to exist between nearby neurons, that is,
winner-take-all.
26
Hidden Units
l Design of hidden units in an extremely active
area of research.
l ReLUs are an excellent default choice.
l Although not differentiable at all point, it is still
okay to use for gradient-based learning
algorithm.
• Use left or right derivative, instead.
l Hidden units compute:
• An affine transformation
• An element-wise nonlinear function g(z)
27
Generalizations of ReLUs: Maxout units
l Maxout units:
• It becomes an ReLU when k=2, w1=b1=0

• It can learn a piecewise linear, convex activation
function with up to k pieces.
28
Other hidden units
l Many other types of hidden units are possible, but are
used less frequently. In general, a wide variety of
differentiable functions perform perfectly well.
• Radial basis function
• Softplus:
• Hard tanh
29
Architecture Design
l Architecture: overall structure of the network
• How many units it should have
• How these units should be connected to each other
• How to choose the depth and width of each layer
l Deeper networks often:
• Use far fewer units per layer and
far fewer parameters
• Generalize to the test set
• Are harder to optimize
30
Implementing MLPs with Keras
l Keras is a high-level Deep Learning API that allows you
to easily build, train, evaluate and execute all sorts of
neural networks.
l Its documentation (or specification) is available at
https://keras.io.
Used in this textbook

31
Implementing MLPs with Keras
Creating
Using the model Training Using the
Keras to Compiling and model to
using the the model
load the evaluating make
sequential
dataset API the model predictions
32
Practice: Building an Image Classifier
Using the Sequential API
1. Using Keras to load the dataset
33
2. Creating the model using the
sequential API
Method 1: adding layers one by one
Method 2: creating the Sequential model
34
sequential API
35
sequential API
To get a model’s list of
layers using the layers
attribute, or use the
get_layer() method to
access a layer by name
All the parameters of a

layer can be accessed
using its get_weights()
and set_weights()
methods
We’ll discuss initializers further in Chapter 11, and the full list is at 36
https://keras.io/api/layers/initializers.
3. Compile the model
37
4. Training and evaluating the modal
The fit() method returns a History object containing the training parameters
(history.params), the list of epochs it went through (history.epoch), and most
importantly a dictionary (history.history) containing the loss and extra metrics it
measured at the end of each epoch on the training set and on the validation
set (if any).
38
39
40
5. Using the model to make predictions
use the argmax() method to get the

highest probability class index for each
instance
41
More examples using API
l Building an Image Classifier Using the
Sequential API
l Building a Regression MLP Using the
Sequential API
l Building Complex Models Using the Functional
API
l Using the Subclassing API to Build Dynamic
Models
l Saving and Restoring a Model
42
Better Generalization with Greater Depth
l Empirical results showing that deeper networks
generalize better when used to transcribe multi-digit
numbers from photographs of addresses.
43
Large, Shallow Models Overfit More
n Deeper models tend to perform better.
44
Back-Propagation Algorithm
l When we use a feedforward neural network to
accept an input x and produce an output yˆ,
information flows forward through the network.
• Forward propagation: The inputs x provide the initial
information that then propagates up to the hidden units at
each layer and finally produces yˆ.
• During training, forward propagation can continue onward until
it produces a scalar cost J (θ).
l Back-propagation algorithm: To allow the
information from the cost to then flow backwards
through the network, in order to compute the
gradient,
• while another algorithm, such as stochastic gradient descent, is
used to perform learning using this gradient
45
Back-Propagation
l During inference:
l During training:
l Backpropagation:
compute gradients
l Stochastic gradient descent is used to perform

the learning using these gradients.
46
Recap
l ANN/DNN
• perceptron, MLP
• numbers of layers/neurons
• feedforward/
l Cost functions
• MSE, cross-entropy
l Algorithms for updating parameters
• Backpropagation
• Gradient decent
l Activation functions for output/hidden layers
• Classification, regression
• Logit, Softmax, ReLU, Maxout
47

MLP_1122_20240509_ch10_DeepNN

Uploaded by

Copyright:

Available Formats

You might also like

MLP_1122_20240509_ch10_DeepNN

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLP_1122_20240509_ch10_DeepNN

Uploaded by

Copyright:

Available Formats

Deep Feedforward Networks

Frank Rosenblatt, 1957

A biological neuron An artificial neuron (Perceptron)

Hubel and Weisel’s architecture Multi-layer perceptrons

l Linear model is NOT able

l Obtain model parameters after training:

l Run the network:

l Obtain model parameters after training:

l Run the network:

l Sigmoid Units for Bernoulli Output Distributions

• It becomes an ReLU when k=2, w1=b1=0

Used in this textbook

Method 2: creating the Sequential model

All the parameters of a

use the argmax() method to get the

l Stochastic gradient descent is used to perform

You might also like