Omar Arif Omar - Arif@seecs - Edu.pk National University of Sciences and Technology

Omar Arif
omar.arif@seecs.edu.pk
National University of Sciences and Technology
 The Perceptron – Basic building block of Neural network
 Neural Network – Stacking perceptron to build neural networks
 Loss Minimization – Gradient descent
 Implementing ANNs – How to use libraries to implement neural

network
 Training ANNs
Building Block of Neural Network
Bias term
𝑥0 𝑤0
𝒙 = 𝑥1 , 𝐰 = 𝑤1
𝑥2 𝑤2
ℎ𝒘 𝒙 = g(𝐰 T 𝒙)
Non-linearity, activation functio

 Activation function allows to Introduce non-linearity into the
network
1
 Sigmoid: 𝜎(𝑧) =
1+𝑒 −𝑧
 Rectified Linear Unit: relu z = max 𝑧, 0
 Softplus: softplus z = log(1 + 𝑒 𝑧 )

𝒙𝟏 𝒙𝟐 𝒉
0 0 0
0 1 1
1 0 1
1 1 1
0 0 0
0 1 0
1 0 0
1 1 1
 Non Linear Decision Boundary
0 0 0
 Using the basic perceptron, we can
not approximate a non-linear
0 1 1
function
1 0 1
1 1 0
 Feature Engineering: Use higher order features such as 𝑥 2 or 𝑥 3
to obtain non-linear function.
 Problems: Don’t know what features to choose
 we would like to automate things and let the algorithm choose

the features
 Neural networks allow one to automatically learn the
representation/features of a linear classifier which are geared
towards the desired task, rather than specifying them all by
hand.
Building neural networks by stacking perceptron
x1 x2 h h h
AND NOR OR
0 0 0 1 1
0 1 0 0 0
1 0 0 0 0
1 1 1 0 1
Loss function tells us how good our neural network is
 Optimization problem
𝑚
Training data: 𝐷𝑡𝑟𝑎𝑖𝑛 = (𝑥 𝑖 , 𝑦 𝑖 ) 𝑖=1
min 𝐽(𝒘, 𝐷𝑡𝑟𝑎𝑖𝑛 )

𝒘
1
𝐽 𝒘𝐷𝑡𝑟𝑎𝑖𝑛 = ෍ 𝐿𝑜𝑠𝑠(𝑥, 𝑦, 𝒘)
𝑚
𝑥,𝑦 ∈𝐷𝑇𝑟𝑎𝑖𝑛
 Goal: Compute gradient
𝛻w J(𝐰, Dtrain )
 Mean squared loss error
𝐿𝑜𝑠𝑠 𝑥, 𝑦, 𝒘 = (ℎ𝑤 𝑥 − 𝑦)2
 Binary Cross Entropy Loss (Logistic Loss)
𝐿𝑜𝑠𝑠 𝑥, 𝑦, 𝒘 = −𝑦𝑙𝑜𝑔 log ℎ 𝑥 − 1 − 𝑦 (1 − log(ℎ(𝑥)))

 Forward Pass: compute the output of the network
 Backward Pass: compute gradients
See Backpropagation_examples.pdf
Stochastic Gradient
Batch Gradient Descent
Descent
 Initialize weights randomly  Initialize weights randomly
 Loop  Loop
 Computer gradient  For all data points in 𝐷𝑡𝑟𝑎𝑖𝑛
𝜕𝐽(𝒘)  Computer gradient
𝜕𝒘 𝜕𝐿𝑜𝑠𝑠(𝑥, 𝑦, 𝒘)
𝜕𝒘
 Update 𝒘  Update 𝒘
𝜕𝐽(𝒘) 𝜕𝐿𝑜𝑠𝑠(𝑥, 𝑦, 𝒘)
𝒘≔𝒘−𝜶 𝒘≔𝒘−𝜶
𝜕𝒘 𝜕𝒘
CIFAR10
MNIST
Fashion-MNIST
 Softmax function takes as input a vector of k real numbers and
normalizes it into a probability distribution
𝑒 𝑦𝑖
 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝑖 =
σ𝑘
𝑖=1 𝑒 𝑦𝑖
𝑒 𝑦1 𝑝(𝑦 = 1|𝑥, 𝑤)
1
 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = ⋮ = ⋮
σ𝑘
𝑖=1 𝑒
𝑦𝑖
𝑒 𝑦𝑘 𝑝(𝑦 = 𝑘|𝑥, 𝑤)
 Negative-Log-Likelihood-Loss= − σ𝑘
𝑖=1 log(𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑦𝑖 ))
The activation function of all neurons in the hidden layer is ReLU
The output neurons implement Logsoftmax
For complete code see cifar10linear.ipynb
See mnist_classification.ipynb
Labels
● 0 T-shirt/top
● 1 Trouser
● 2 Pullover
● 3 Dress
● 4 Coat
● 5 Sandal
● 6 Shirt
● 7 Sneaker
● 8 Bag
● 9 Ankle boot
Deadline: Submit .ipynb file by16th Feb. midnight

Mini batch gradient descent
Learning Rate
Avoiding overfitting
 Batch gradient descent
Batch gradient descent, computes the gradient of the cost
function w.r.t. to the parameters 𝑤 for the entire training dataset.
 Stochastic gradient descent
Stochastic gradient descent (SGD) in contrast performs a
parameter update for each training example (𝑥 𝑖 , 𝑦 𝑖 )
 Mini-batch gradient descent
Mini-batch gradient descent performs an update for every mini-
batch of 𝑛𝑏𝑎𝑡𝑐ℎ𝑠𝑖𝑧𝑒 training examples.
 How to choose Learning Rate
𝜕𝐽
𝒘≔𝒘−𝜶
𝜕𝒘
 Small learning rate converges slowly while large learning rate
overshoots
 Loss Landscape of Neural Nets is

non convex
 Momentum is a method that helps accelerate SGD in the relevant
direction and dampens oscillations
𝑣𝑡 = 𝛾𝑣𝑡−1 + 𝛼𝛻𝑤 𝐽
𝑤 ≔ 𝑤 − 𝑣𝑡
optimizer = optim.SGD(h.parameters(), lr = 0.001, momentum=.9)
http://ruder.io/optimizing-gradient-descent/
Learning rate is not fixed
 Adam (Adaptive Momentum Estimation)
 torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999))
 Adagrad - adapts the learning rate for each weight

torch.optim.Adagrad(params, lr=0.01)
https://pytorch.org/docs/stable/optim.html
1. L2 weight Regularization
𝐽 𝑤 = 𝐿𝑜𝑠𝑠 𝑥, 𝑦, 𝑤 + 𝜆σ𝑤 2
torch.optim.SGD(params, lr=<>, momentum=0, weight_decay=0)

Set weight_decay to 𝜆
1. Dropout:
 randomly select neurons and remove them along with incoming and
outgoing connections
 Forces the network to use all neurons
torch.nn.functional.dropout(input, p=0.5)
3. Early Stopping:
 Stop before the network starts to over fit
 The Perceptron – Basic building block of Neural network
 Neural Network – Stacking perceptron to build neural networks
 Loss Minimization – Gradient descent
 Implementing ANNs – How to use libraries to implement neural

network
 Training ANN

Omar Arif Omar - Arif@seecs - Edu.pk National University of Sciences and Technology

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Omar Arif Omar - Arif@seecs - Edu.pk National University of Sciences and Technology

Uploaded by

Copyright:

Available Formats

Omar Arif

 Neural Network – Stacking perceptron to build neural networks

 Loss Minimization – Gradient descent

 Implementing ANNs – How to use libraries to implement neural

Non-linearity, activation functio

 Rectified Linear Unit: relu z = max 𝑧, 0

 Softplus: softplus z = log(1 + 𝑒 𝑧 )

 we would like to automate things and let the algorithm choose

min 𝐽(𝒘, 𝐷𝑡𝑟𝑎𝑖𝑛 )

 Goal: Compute gradient

𝐿𝑜𝑠𝑠 𝑥, 𝑦, 𝒘 = (ℎ𝑤 𝑥 − 𝑦)2

 Binary Cross Entropy Loss (Logistic Loss)

𝐿𝑜𝑠𝑠 𝑥, 𝑦, 𝒘 = −𝑦𝑙𝑜𝑔 log ℎ 𝑥 − 1 − 𝑦 (1 − log(ℎ(𝑥)))

 Backward Pass: compute gradients

Deadline: Submit .ipynb file by16th Feb. midnight

 Loss Landscape of Neural Nets is

 Adagrad - adapts the learning rate for each weight

torch.optim.SGD(params, lr=<>, momentum=0, weight_decay=0)

 Neural Network – Stacking perceptron to build neural networks

 Loss Minimization – Gradient descent

 Implementing ANNs – How to use libraries to implement neural

You might also like