Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Omar Arif

omar.arif@seecs.edu.pk
National University of Sciences and Technology
 The Perceptron – Basic building block of Neural network

 Neural Network – Stacking perceptron to build neural networks

 Loss Minimization – Gradient descent

 Implementing ANNs – How to use libraries to implement neural


network
 Training ANNs
Building Block of Neural Network
Bias term

𝑥0 𝑤0
𝒙 = 𝑥1 , 𝐰 = 𝑤1
𝑥2 𝑤2

ℎ𝒘 𝒙 = g(𝐰 T 𝒙)

Non-linearity, activation functio


 Activation function allows to Introduce non-linearity into the
network
1
 Sigmoid: 𝜎(𝑧) =
1+𝑒 −𝑧

 Rectified Linear Unit: relu z = max 𝑧, 0

 Softplus: softplus z = log(1 + 𝑒 𝑧 )


𝒙𝟏 𝒙𝟐 𝒉

0 0 0

0 1 1

1 0 1

1 1 1
𝒙𝟏 𝒙𝟐 𝒉

0 0 0

0 1 0

1 0 0

1 1 1
 Non Linear Decision Boundary
𝒙𝟏 𝒙𝟐 𝒉

0 0 0
 Using the basic perceptron, we can
not approximate a non-linear
0 1 1
function

1 0 1

1 1 0
 Feature Engineering: Use higher order features such as 𝑥 2 or 𝑥 3
to obtain non-linear function.
 Problems: Don’t know what features to choose

 we would like to automate things and let the algorithm choose


the features
 Neural networks allow one to automatically learn the
representation/features of a linear classifier which are geared
towards the desired task, rather than specifying them all by
hand.
Building neural networks by stacking perceptron
x1 x2 h h h
AND NOR OR
0 0 0 1 1

0 1 0 0 0

1 0 0 0 0

1 1 1 0 1
Loss function tells us how good our neural network is
 Optimization problem
𝑚
Training data: 𝐷𝑡𝑟𝑎𝑖𝑛 = (𝑥 𝑖 , 𝑦 𝑖 ) 𝑖=1

min 𝐽(𝒘, 𝐷𝑡𝑟𝑎𝑖𝑛 )


𝒘

1
𝐽 𝒘𝐷𝑡𝑟𝑎𝑖𝑛 = ෍ 𝐿𝑜𝑠𝑠(𝑥, 𝑦, 𝒘)
𝑚
𝑥,𝑦 ∈𝐷𝑇𝑟𝑎𝑖𝑛

 Goal: Compute gradient

𝛻w J(𝐰, Dtrain )
 Mean squared loss error

𝐿𝑜𝑠𝑠 𝑥, 𝑦, 𝒘 = (ℎ𝑤 𝑥 − 𝑦)2

 Binary Cross Entropy Loss (Logistic Loss)

𝐿𝑜𝑠𝑠 𝑥, 𝑦, 𝒘 = −𝑦𝑙𝑜𝑔 log ℎ 𝑥 − 1 − 𝑦 (1 − log(ℎ(𝑥)))


 Forward Pass: compute the output of the network

 Backward Pass: compute gradients

See Backpropagation_examples.pdf
Stochastic Gradient
Batch Gradient Descent
Descent
 Initialize weights randomly  Initialize weights randomly
 Loop  Loop
 Computer gradient  For all data points in 𝐷𝑡𝑟𝑎𝑖𝑛
𝜕𝐽(𝒘)  Computer gradient
𝜕𝒘 𝜕𝐿𝑜𝑠𝑠(𝑥, 𝑦, 𝒘)
𝜕𝒘
 Update 𝒘  Update 𝒘
𝜕𝐽(𝒘) 𝜕𝐿𝑜𝑠𝑠(𝑥, 𝑦, 𝒘)
𝒘≔𝒘−𝜶 𝒘≔𝒘−𝜶
𝜕𝒘 𝜕𝒘
CIFAR10
MNIST
Fashion-MNIST
 Softmax function takes as input a vector of k real numbers and
normalizes it into a probability distribution
𝑒 𝑦𝑖
 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝑖 =
σ𝑘
𝑖=1 𝑒 𝑦𝑖

𝑒 𝑦1 𝑝(𝑦 = 1|𝑥, 𝑤)
1
 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = ⋮ = ⋮
σ𝑘
𝑖=1 𝑒
𝑦𝑖
𝑒 𝑦𝑘 𝑝(𝑦 = 𝑘|𝑥, 𝑤)

 Negative-Log-Likelihood-Loss= − σ𝑘
𝑖=1 log(𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑦𝑖 ))
The activation function of all neurons in the hidden layer is ReLU
The output neurons implement Logsoftmax
For complete code see cifar10linear.ipynb
See mnist_classification.ipynb
Labels
● 0 T-shirt/top
● 1 Trouser
● 2 Pullover
● 3 Dress
● 4 Coat
● 5 Sandal
● 6 Shirt
● 7 Sneaker
● 8 Bag
● 9 Ankle boot

Deadline: Submit .ipynb file by16th Feb. midnight


Mini batch gradient descent
Learning Rate
Avoiding overfitting
 Batch gradient descent
Batch gradient descent, computes the gradient of the cost
function w.r.t. to the parameters 𝑤 for the entire training dataset.
 Stochastic gradient descent
Stochastic gradient descent (SGD) in contrast performs a
parameter update for each training example (𝑥 𝑖 , 𝑦 𝑖 )
 Mini-batch gradient descent
Mini-batch gradient descent performs an update for every mini-
batch of 𝑛𝑏𝑎𝑡𝑐ℎ𝑠𝑖𝑧𝑒 training examples.
 How to choose Learning Rate

𝜕𝐽
𝒘≔𝒘−𝜶
𝜕𝒘
 Small learning rate converges slowly while large learning rate
overshoots

 Loss Landscape of Neural Nets is


non convex
 Momentum is a method that helps accelerate SGD in the relevant
direction and dampens oscillations

𝑣𝑡 = 𝛾𝑣𝑡−1 + 𝛼𝛻𝑤 𝐽
𝑤 ≔ 𝑤 − 𝑣𝑡
optimizer = optim.SGD(h.parameters(), lr = 0.001, momentum=.9)

http://ruder.io/optimizing-gradient-descent/
Learning rate is not fixed
 Adam (Adaptive Momentum Estimation)
 torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999))

 Adagrad - adapts the learning rate for each weight


torch.optim.Adagrad(params, lr=0.01)

https://pytorch.org/docs/stable/optim.html
1. L2 weight Regularization

𝐽 𝑤 = 𝐿𝑜𝑠𝑠 𝑥, 𝑦, 𝑤 + 𝜆σ𝑤 2

torch.optim.SGD(params, lr=<>, momentum=0, weight_decay=0)


Set weight_decay to 𝜆
1. Dropout:
 randomly select neurons and remove them along with incoming and
outgoing connections
 Forces the network to use all neurons

torch.nn.functional.dropout(input, p=0.5)
3. Early Stopping:
 Stop before the network starts to over fit
 The Perceptron – Basic building block of Neural network

 Neural Network – Stacking perceptron to build neural networks

 Loss Minimization – Gradient descent

 Implementing ANNs – How to use libraries to implement neural


network
 Training ANN

You might also like