26 Neural Nets

CSCE-421 Machine Learning
16. Neural Networks
Instructor: Guni Sharon

1
Announcements
• Midterm on Tuesday, November-23 (in class)
• Covering all topics up to the exam date excluding today
• Written exam
• One theoretical question and 4 multiple Choice/answers
• We will have a preparation class (Nov-18)
• Due:
• Quiz 5: decision trees and bagging, due Thursday Nov-18
• Assignment (P4): Decision trees, due Thursday Nov-25
2
Theoretical question examples
1. What is the expected error rate (binary loss) for Bayes optimal binary classifier?
2. Show that as 𝑛→∞, the 1-NN error for binary classification is no more than twice the error of the Bayes
optimal classifier
3. Show that the Perceptron algorithm will converge in a finite number of steps assuming the data is linearly
separable
4. Proof from Written assignment 2: generative models
5. Prove Bayes’ Rule
6. Prove that the Newton step direction in Newton's method is a closed form solution for a quadratic Taylor
approximation
7. Prove the closed form solution for OLS with mean squared error loss
8. show that maximizing the SVM margin is equivalent to minimizing the L2 norm of the w vector
9. Show that the expected test (squared) error of a regression model is a function of variance, bias^2, and
noise
10. Show that the RBF kernel is a valid kernel function
11. Prove the step optimal size (alpha) in AdaBoost
3
Fitting a parametrized function
• So far, we only considered fitting linear functions
• This is easy -> can be computed in closed form
• Ordinary Least Squares regression, Ridge regression
• Highly biased: correlations are rarely linear
• Linear separability is usually possible once we
add transformed features (dimensions)
• A hyperplane can fit non-linear patterns in a
lower (original) dimensionality
4
Fitting to non-linear data
• Before: bending the feature space and fitting a hyperplane (kernelized
machines)
• Today: fitting a general non-linear function to the original feature
dimensions
• How can we define a “general non-linear function”?
• We must commit to some structure. Won’t that limit our hypothesis space?
• How can we fit a “general non-linear function”?
• Both a closed form solution and an approximate solution (GD) require us to
compute the gradient of the loss w.r.t. the function’s parameters
• How can we compute a gradient for a “general non-linear function”?
5
Function approximator
• We would like to approximate a function where is the feature vector
and is the label (value) vector
• What about classification problems?
• (as a value) can represent a distribution over labels, e.g., {0.8(‘yes’),0.2(‘no’)}
• A neural network is a function approximator with many tunable
parameters that can be tuned to approximate any continuous
function with arbitrary accuracy
• Universal Function Approximation Theorem
6
Perceptrons
• A single artificial neuron (a linear function)
• Tunable parameters are the weights,
• Output is: or in vector notation
• Goal: set such that
• Approach: minimize the difference (loss) between
• Not differentiable, no closed form solution

𝑤1
𝑤2
• Ordinary least square regression, closed form solution 𝑤3
7
Gradient decent for Perceptrons
• = Squared loss function
• GD:
• With a linear approximator (single perceptron)
• Very easy to train!
• What are the limitations of preceptrons as function approximators?
• Only able to capture linear relations between state features and output
• The approximated linear function must be 0 for
8
Add bias
• Simple perceptron () can capture only linear

functions that pass through the origin
• Solution: Generalize by adding a bias term:
• How do we train the bias term?

• Gradient decent as before:
• With a linear approximator (perceptron)
9
Beyond linearity: connecting perceptrons
• What if we connect several preceptors as a network?
• is the input, is the output (activation) vector for layer , is a weight
matrix for layers with dimensions
• What is the activation for layer 1?
2
• ,… 1 2
𝑧
1 𝑧 𝑤
𝑥1 𝑤 3
𝑤
𝑥2 𝑧
3
• What is the activation of ?
𝑥4
Where 𝑥5
1
Matrix product is 𝑧
associative i.e., 2
(AB)C=A(BC) 𝑧
3 Not more powerful than a single preceptron 10
𝑧
Introducing non-linearity
• In order to represent general functions, the output of a neuron needs
to be nonlinear w.r.t the activation
𝑤
• We need an Activation Function 1
𝑤2
𝑤3
• Activation function Job description:

• Well defined derivative (should be differentiable)
• Informative derivative values (must know how to improve the objective)
• Monotonic (since local optimums = suboptimal solutions)
• Fast to compute (intense working environment)
• Covers the required output range
• Let's look at some candidates 11
Common Activation Functions
Well defined derivative
Substantial derivative values
Monotonic
Fast to compute
Covers the required output range
12
[source: MIT 6.S191 introtodeeplearning.com]
Monotonic
Fast to compute
13
Monotonic
Fast to compute
14
Monotonic
Fast to compute
15
Example: Sigmoid neuron
• Assume a single neuron with a Sigmoid activation function
•
• Where 𝑤1
• How can we compute or ? 𝑤2
• The chain rule! 𝑤3
𝑏
• Given a set of labeled observations perform a GD step
Derivative of squared error loss 16

Example: Sigmoid neuron
• Assume a single neuron with a Sigmoid activation function
•
• Where 𝑤1
• How can we compute or ? 𝑤2
• The chain rule! 𝑤3
𝑏
• Given a set of labeled observations perform a GD step
𝜕𝑔( 𝑧)
=𝑔 ( 𝑧 ) ( 1 −𝑔 ( 𝑧 ) ) 17
𝜕𝑧
Connecting neurons
• What if we connect several neurons to a network?
• is the input, is the activation vector for layer , is a weight matrix for
layers with dimensions
• What is the output of layer 1?
•, …
2
𝑧
• What is the output of the network? 𝑤
1 𝑧
1
𝑤
2
𝑏
𝑥1
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
• Can represent any continuous function! 𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 18
Universal Function Approximation
Theorem
• In words: Given any continuous function f(x), if a 2-layer neural network has enough
hidden units, then there is a choice of weights that allow it to closely approximate f(x)
Cybenko (1989) “Approximations by superpositions of sigmoidal functions”
Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks”
Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation 19
Functions Can Approximate Any Function”
Defining a neural net with Numpy
20
Gradient decent for a full network
• Step1: compute the loss gradient with respect to
• For =sigmoid and loss = squared:
2
𝑧
1 2
𝑧 𝑤
𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 21
2
𝑧
1 2
𝑧 𝑤
𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 22
2
𝑧
1 2
𝑧 𝑤
𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 23
• Step2: Compute the loss gradient with respect to and
2
𝑧
1 2
𝑧 𝑤
𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 24
• Step2: Compute the loss gradient with respect to and
2
𝑧
1 2
𝑧 𝑤
• Update such that loss is reduced 𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 25
• Next: compute the loss gradient with respect to
2
𝑧
1 2
𝑧 𝑤
𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 26
We know this!
We computed it in the previous step
2
𝑧
1 2
𝑧 𝑤
𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 27
We know this!
2
𝑧
1 2
𝑧 𝑤
𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 28
• In general: compute the loss gradient with respect to
We know this!
2
𝑧
1 2
𝑧 𝑤
𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 29
Back propagation
• In general: compute the loss gradient with respect to
• Compute the loss gradient with respect to and
• Later: update such that loss is reduced 2

𝑧
1 2
𝑧 𝑤
𝑥1 𝑤
1
𝑏
𝑏
3
𝑤
𝑥2 𝑧
3
𝑏
𝑏 𝑏
𝑥4 𝑏
𝑥5 𝑏
𝑏 30
Numerical example
• One training sample:
• Network = // [input, hidden layer, output]
• Hidden layer uses a ReLU activation function
• Init:
Uniform values are
• Squared loss = bad idea in general
31
Numerical example
• Init:
• Squared loss = ,
For all ,
3 by 4 matrix
32
Numerical example
• One training sample:
• In order to reduce the loss
• Update:
• After update:
• New loss
33
34
Train with batch Gradient Descent
• For
• : learning rate --- step size parameter that needs to be

chosen carefully
• How? Try multiple choices
• Crude rule of thumb: update by about 0.1 – 1 % per iter
35
Train with Stochastic Gradient Descent
• Observation: once gradient on one training example has been computed,
might as well incorporate before computing next one
• For
• Pick random
36
Train with mini-batch Stochastic Gradient
Descent
• Observation: gradient over small set of training examples (=mini-batch)
can be computed in parallel, might as well do that instead of a single one
• For
• Pick a random mini-batch
37
Can and should
be parallelized
38
Parallel training of a mini-batch
•
• training samples (each with features)
• Parallel forward pass:
conventional notation used in deep
• For each layer, compute activation for entire minibatch learning allows the addition of a
matrix and a vector, yielding another
simultaneously matrix: , where is
• Parallel backprop:
• Same idea, is now a gradient vector and is a Jacobian matrix
39
Demo
• https://playground.tensorflow.org
40
Output vector
• Neural nets can easily be generalized to produce an
output vector
• For instance, in classification problems, it is generally
more efficient to assign an output per class instead of a
single output representing the class
• = probability of class 1
• = probability of class 2
• probability of class 3
• Instead of = class number
• Actually, there is an efficient activation function for
producing such distributions
• Softmax activation (more on this later)
41
Common loss functions
• Regression Loss Functions
• When required to fit a real number
• Absolute Error Loss
• Squared Error Loss
• Huber Loss
• Classification Loss Functions
• When required to predict one of many classes
• Cross Entropy Loss (usually over a Softmax activation for the output layer)
• Distribution approximation
• When required to fit a distribution
• Kullback-Liebler (KL) Divergence
42
Squared Error Loss
• also known as L2 Loss, is the square of the difference between the
actual and the predicted values
• The Squared Error loss function penalizes the model for making large
errors by squaring them
• Change in loss grows with the prediction error i.e., grows with
• This property makes the Squared Error Loss function less robust to
outliers. Therefore, it should not be used if our data is prone to many
outliers
43
Absolute Error Loss
• Also known as the L1 loss
• More robust to outliers as compared to Squared Error Loss
• Not differentiable at x=0
• The derivative is not continuous around zero which is problematic for
convergence
44
Huber Loss
• Quadratic for smaller errors and linear otherwise (similarly for its
gradient)
• Requires tuning of the delta hyper-parameter
• More robust to outliers as compared to Squared Error Loss
• Differentiable and continuous
45
Cross Entropy Loss
• also called Log-Loss or Negative
Log-Likelihood loss
• Penalizes large errors in predicted
distribution
• Goes in tandem with the softmax
function
46
Softmax function
• Input: set of scores
• Output: set of probabilities, one per possible assignment
•
• That is, with (0.95) classification is , (0.05) is , and < (0.01) is
47
Example
Miranda, Lester James 48

Softmax derivative
•
• Partial derivative relates to a given input/output pair:
• Jacobian =
49
Softmax derivative
• Case :
• Case :
50
Softmax derivative
•
• Partial derivative relates to a given input/output pair:
51
Gradient decent with Softmax
: Probability : ANN predicted

: Softmax
per class score per class
probabilities
• the prediction loss post Softmax

• Train with gradient decent
Derivative of the loss

Softmax derivative Backpropagation
function (value per class)
(value per class) (value per class)
52
Simplifying the SoftMax derivative
• Log loss
• In practice, the softmax function is commonly used in tandem

log loss
53
Softmax derivative with log-likelihood
Not depended on
54
Kullback-Liebler (KL) Divergence
• Measures similarity between two distributions
• A KL-divergence of zero indicates that the distributions are identical
• Relative entropy of P with respect to Q,

• Helpful for comparing two policies in RL
55
Overfitting
• We can now define a neural network with non-linear activation and set an
appropriate loss function
• We know how to compute the gradient of the loss with respect to the
tunable parameters
• Can we perform successive gradient decent steps until loss=0 ?
• Universal Function Approximation Theorem tells us that it is possible to reach loss=0
• However, parameter assignment space is nonconvex with respect to the loss
1. We might converge on a local optimum
2. In any case, that is probably not a good idea. Doing so usually implies memorizing
the training data. In such cases predicting new observations is usually inaccurate
56
Variance reduction
1. Early stopping
2. Penalize complex model
• Add model complexity penalty to the loss function
• L1 regularization
• L2 regularization
3. Enforce model simplicity
• Small network
• Dropout
57
Early stopping
• Usually, more training = higher accuracy on training set
• At some point the model starts memorizing the training examples
• In order to avoid such overfitting
• Keep track of test set accuracy
• Stop training when it declines
58
L1 & L2 regularization
• Assumes that a neural network with smaller weight absolute values
leads to simpler models
• Therefore, it will also reduce overfitting
• Add model complexity penalty to the loss function
• ,
• is the regularization parameter, is the size of the training set, is the set of all
weights in the network
• How will adding these penalties affect the weight gradient?
59
Dropout
• Common regularization technique (these days)
• At every iteration, randomly (and temporarily) remove some nodes
along with all of their incoming and outgoing connections
• Train a minibatch of observation on the augmented network
• Each iteration trains a different subset of the network
• Dropout probability is a hyper parameter
60
Adaptive learning rates
• SGD has trouble navigating ravines, i.e., areas where the surface
curves much more steeply in one dimension than in another, which
are common around local optima
• In these scenarios, SGD oscillates across the slopes of the ravine while
only making hesitant progress along the bottom towards the local
optimum
61
Momentum
• Keep an exponential weighted average of the gradient over successive
iterations
• Update the model using SGD towards
62
Adaptive Moment Estimation (ADAM)
• Keeps an exponentially decaying average of past gradients, as in vanilla
momentum
• Keeps an exponentially decaying average of past squared gradients
• Computes adaptive learning rates for each parameter
• Invariant to the magnitude of the gradient, which helps a lot when going through areas with
tiny gradients
• Commonly used these days
63
Conv layers
• In fully connected layers, spatial structure can be inferred in some
cases (e.g., images)
• We would like to take advantage of the spatial structure in such cases
• CNN use three basic ideas: local receptive fields, shared weights, and
pooling
64
Image credit: towardsdatascience.com
Local receptive fields
• Instead of connecting all inputs to every
hidden neuron at , connect only a subset of
spatially connected input neurons
• That region in the input image is called the
local receptive field for the hidden neuron
• You can think of that particular hidden neuron
as learning to analyze its particular local
receptive field
65
credit: http://neuralnetworksanddeeplearning.com/
Local receptive fields
• Start with a local receptive field in the top-left corner
• Then slide the local receptive field over by one pixel
• if we have a 28×28 input image, and 5×5 local receptive fields, then there
will be 24×24 neurons in the hidden layer
66
Shared weights and biases
• Use the same weights and bias for each hidden neuron in the conv
layer
• This means that all the neurons in the first hidden layer detect exactly
the same feature
• What if we want to identify more than one feature?
• Affiliate each local receptive field with n hidden neurons (channels)
• Each feature has its own shared weights and biases across all local fields
3 channels
67
Pooling layers
• Pooling layers are usually used immediately after convolutional layers
• Creates a condensed feature map
• E.g., max-pooling outputs the maximum activation in an input region
68
Pooling layers
• We apply pooling to each feature map separately
• Finally, don’t forget to add at least one fully connected layer
69
Keras (Python library)
Create a CNN with 5 hidden layers:
model = Sequential()
1. 32 channels (hidden features) over 3×3
model.add(Conv2D(32, kernel_size=(3, 3), strides=(1, 1) local receptive fields that slide 1 pixel from
activation='relu', each other in each direction.
input_shape=input_shape)) 2. A second conv layer connected to the first
model.add(Conv2D(64, (3, 3), activation='relu')) one. Further extraction of local features.
model.add(MaxPooling2D(pool_size=(2, 2))) 3. Pooling layer with dropout regularization
model.add(Dropout(0.25)) 4. A fully connected layer with dropout
model.add(Flatten()) regulation
model.add(Dense(128, activation='relu')) 5. Finally a softmax layer
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax')) The loss function is defined to be the cross-
entropy loss. The SGD optimizer is ADAM. The
matric for evaluation is total accuracy on the
model.compile(loss=keras.losses.categorical_crossentropy,
test set.
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
70
Keras: train
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test))
71
Pytorch
class ConvNet(nn.Module): Conv1: 1 input channel (the image), 32 output
def __init__(self): channels (unique local features), 5x5 conv
super(ConvNet, self).__init__() filters, padding =2 such that the output will be
self.layer1 = nn.Sequential( 28x28, ReLU activation, max pool layer over
nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2), 2x2 kernels.
nn.ReLU(), Conv2: similar to conv1 with 32 input channels
nn.MaxPool2d(kernel_size=2, stride=2)) and 64 output channels each is a 7x7 feature
self.layer2 = nn.Sequential( matrix.
nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2), Drop_out: dropout regularization. Removes
nn.ReLU(), every value in a specific layer with 0.5
nn.MaxPool2d(kernel_size=2, stride=2)) probability when applied.
self.drop_out = nn.Dropout(0.5) Fully connected (fc1): 7*7*64 input size and
self.fc1 = nn.Linear(7 * 7 * 64, 1000) 1000 output size. (fc2 is similar)
self.fc2 = nn.Linear(1000, 10)
72
Pytorch: connect the layers
def forward(self, x): Define the data flow along the network:
out = self.layer1(x) conv1 -> max pool 1 -> conv2 -> max pool 2 ->
out = self.layer2(out) flatten 64*7*7 -> 1 vector -> drop some values
out = out.reshape(out.size(0), -1) (to avoid overfitting) -> 2 fully connected
out = self.drop_out(out) layers.
out = self.fc1(out)
out = self.fc2(out)
return out
73
Pytorch: train
model = ConvNet()
# Loss and optimizer

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Run the forward pass
outputs = model(images) the CrossEntropyLoss function combines both
loss = criterion(outputs, labels) a SoftMax activation with the log loss function.
Set the Adam optimizer over all tunable
# Backprop and perform Adam optimisation parameters.
optimizer.zero_grad() Get the outputs for a minibatch of images and
loss.backward() compute the loss.
optimizer.step() Next, init all gradients to zero, compute new
gradients (backpropagate), update parameters
based on new gradients.
74
What did we learn?
• ANN are nonlinear functions with many tunable parameters (weights and
biases)
• Using the backpropagation algorithm, we can efficiently compute the partial
derivative of the ANN’s loss w.r.t the tunable parameters
• Using (stochastic) gradient decent, we update the parameters such that the
loss is reduced (in expectancy)
• ANN can approximate any continuous function to arbitrary precision
• Choose an appropriate loss function
• Avoid overfitting! Regulate
• Hyper parameter tuning: network’s structure (depth, width, convolution layers,
pooling layers, fully connected layers), momentum, activation functions, Regularization
technique.
75
What next?
• Lecture: Derivative free optimization
• Assignments:
• Programming assignment 5 + Competition: Deep Learning (due Monday, Dec
13)
• Quiz:
• Quiz 6: ANN (due Tuesday, Dec 14)
76
Course evaluation
• Your (constructive) feedback is appreciated https://tamu.aefis.net/
• I will be providing this course again Spring semester
• Please let me know regarding:
1. Lectures – what can help keep students engaged during lectures? Especially when discussing
mathematical proofs.
2. Expectations – how can I better set students’ expectations during the first lecture?
3. Covered materiel – any topics that you expected to be covered but weren’t? Were there any
topics that you felt to be superfluous?
4. Assignments – were the assignments meaningful? Is there any way to make them more
meaningful?
5. Quizzes – were the quizzes meaningful? Is there any way to make them more meaningful?
6. Exam – was the exam fair?
7. Website – was the course website helpful? Should anything be added?
77

26 Neural Nets

Uploaded by

Copyright:

Available Formats

You might also like

26 Neural Nets

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

26 Neural Nets

Uploaded by

Copyright:

Available Formats

CSCE-421 Machine Learning

16. Neural Networks

Instructor: Guni Sharon

• Not differentiable, no closed form solution

• Simple perceptron () can capture only linear

• How do we train the bias term?

• With a linear approximator (perceptron)

• Activation function Job description:

Derivative of squared error loss 16

• For =sigmoid and loss = squared:

• For =sigmoid and loss = squared:

• For =sigmoid and loss = squared:

• For =sigmoid and loss = squared:

• Step2: Compute the loss gradient with respect to and

• For =sigmoid and loss = squared:

• Step2: Compute the loss gradient with respect to and

• Compute the loss gradient with respect to and

• Later: update such that loss is reduced 2

• : learning rate --- step size parameter that needs to be

Miranda, Lester James 48

: Probability : ANN predicted

• the prediction loss post Softmax

Derivative of the loss

• In practice, the softmax function is commonly used in tandem

• Relative entropy of P with respect to Q,

• Update the model using SGD towards

# Loss and optimizer

You might also like