Download as pdf or txt
Download as pdf or txt
You are on page 1of 109

U18AII4202 Neural

Networks and Deep


Learning

Dr.S.Sangeetha
Associate Professor
Department of Artificial Intelligence and Data Science
Kumaraguru College of Technology
Course Outcomes

CO 1: Understand different methodologies to create application using deep nets

CO 2: Design the test procedures to assess the efficacy of the developed model.

CO 3: Identify and apply appropriate deep learning models for analyzing the data for a variety of problems.

CO 4: Implement different deep learning algorithms


Syllabus
CONVOLUTIONAL NEURAL NETWORKS 10 HOURS

Architectural Overview, Motivation, Layers, Filters, Parameter sharing, Regularization, Popular


CNN Architectures: ResNet, AlexNet – Applications
RECURRENT AND RECURSIVE NETS 12 Hours

Recurrent Neural Networks, Bidirectional RNNs, Encoder-decoder sequence to sequence


architectures - BPTT for training RNN, Long Short Term Memory Networks, Computer Vision -
Speech Recognition - Natural language Processing, Case studies in classification, Regression and
deep networks.
DEEP LEARNING ARCHITECTURES 12 Hours

Machine Learning and Deep Learning, Representation Learning, Width and Depth of Neural
Networks, Learning Algorithms: Capacity - Overfitting - Underfitting - Bayesian Classification -
Activation Functions: RELU, LRELU, ERELU, Unsupervised Training of Neural Networks, Restricted
and Deep Boltzmann Machines , Auto Encoders
ADVANCED NEURAL NETWORKS 11 Hours

Deep Feedforward Networks : Gradient based learning - Hidden Units - Architectural design – Back
Propagation algorithms - Regularization for deep learning: Dataset Augmentation - Noise Robustes
–Semi supervised learning -Multitask learning - Deep Belief networks -Generative Adversial
Networks by Keras MXnet
Books

Reference Books:

1.Josh Patterson, Adam Gibson


"Deep Learning: A Practitioner's
Approach", O'Reilly Media, 2017 Text Books:
2.Laura Graesser, Wah Loon 1. Ian Goodfellow, Yoshua Bengio, and
Keng "Foundations of Deep Aaron Courville, “Deep Learning”,
Reinforcement Learning: Theory
and Practice in Python" First Edition, MIT Press, 2016.
Addison-Wesley Professional - 2. Nikhil Buduma and Nicholas
2020 Lacascio, “Fundamentals of Deep
3.Jon Krohn, Grant Beyleveld, Learning”, First Edition, O.Reilly, 2017
Aglaé Bassens "Deep Learning
Illustrated: A Visual, Interactive
Guide to Artificial Intelligence",
1st edition Addison-Wesley
Professional 2019
Why Deep Learning?

Pillars of Deep Learning


1.Big data
2.GPU and TPU
3.Algorithmic
Advancements
Why Deep Learning?
A Short Story
ML vs DL
Neural Network
Neuron

𝑦ො = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝑤1 𝑥1 + 𝑤2 𝑥2
+ ⋯ 𝑤𝑛 𝑥𝑛 + 𝑏)
Model Building
Architecture
Loss function
Optimization
Metrics
Architecture
Number of layers
Number units / layer
Activation function
Linear vs Nonlinear
Activation functions

sigmoid tanh relu linear


1 2
𝑓 𝑥 = 𝑓 𝑥 = −1 𝑓 𝑥 = max(0, 𝑥) 𝑓 𝑥 = 𝑚𝑥
(1 + 𝑒 −𝑥 ) (1 + 𝑒 −2𝑥 )
Loss Function

σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ො 2
𝐽(𝑏, 𝑤) =
𝑛
Gradient Descent lr:optimal
Neural Network
Single Neuron Solution

? ?
Four Neurons Solution

? ?
5 5
Hidden Layer Solution

? ?
?
Convolutional Neural Networks
1. Introduction
2. The Convolution Operation
3. Motivation
4. Pooling
5. Convolution and Pooling as an Infinitely Strong Prior
6. Variants of the Basic Convolution Function

2 InfoLab
Introduction

InfoLab
Multi-Layer Perceptron (MLP) for image recognition task

Use intensity of each pixel as 1D input vector


There are several problems such as:
- Required extremely large number of parameter
- No invariance to shifting, scaling
- Limited Capability to Capture Spatial Information
- Unable to deal with variable size input data

28x28 image

784 input neurons and large number of parameters

Image source : https://en.wikipedia.org/wiki/Artificial_neural_network

4 InfoLab
Convolutional neural network (CNN)
CNN consists of:
- Convolutional layer
- Pooling layer
- Fully connected layer
CNN learns convolutional layer’s kernel
- To make computation easier , CNN uses cross-correlation
instead of convolution

Image source : https://www.kdnuggets.com/2015/11/understanding-convolutional-neural-networks-nlp.htm

25 InfoLab
ILSVRC
ImageNet Large Scale Visual Recognition (ILSVRC)
- One of the biggest computer vision contest
- 1M images, 1000 labels

26 InfoLab
Convolutional Neural Network
- AlexNet (Krizhevsky et al.) won the ILSVRC
- University of Toronto : Error rate of 16.4%
It also showed CNN can be accelerated through GPU

Image source : https://www.researchgate.net/figure/289928157_fig7_Figure-1-An-illustration-of-the-weights-in-the-AlexNet-model-Note-that-after-every

27 InfoLab
Convolutional Neural Network
InfoSeminar

The Convolution Operation

InfoLab
Convolution
• Convolution is a mathematical operation that combines two
functions to produce a third function. In the context of signal
processing and neural networks, convolution is often used to
process data through filters or kernels.
• In convolutional network terminology, the first argument to
the convolution is often referred to as the input and the
second argument as the kernel. The output is sometimes
referred to as the feature map.
Convolution in the context of discrete
Discrete 1D convolution in computer

[0] [1] [2] [3] [4] [5] [6]

Input 𝒙 3 7 5 6 4 2 1

[-2] [-1] [0] [1] [2]

Kernel 0.3 0.5 0.7 0.6 0.4


𝒘

Output 𝒔 1.2

31 InfoLab
Convolution in the context of discrete
Discrete 1D convolution in computer

[0] [1] [2] [3] [4] [5] [6]

Input 3 7 5 6 4 2 1
𝒙

[-2] [-1] [0] [1] [2]

Kernel 0.3 0.5 0.7 0.6 0.4


𝒘

Output 𝒔 4.2

32 InfoLab
Convolution in the context of discrete
Discrete 1D convolution in computer

[0] [1] [2] [3] [4] [5] [6]

Input 3 7 5 6 4 2 1
𝒙

[-2] [-1] [0] [1] [2]

Kernel 0.3 0.5 0.7 0.6 0.4


𝒘

Output 𝒔 7.7

33 InfoLab
Convolution in the context of discrete
Discrete 1D convolution in computer

[0] [1] [2] [3] [4] [5] [6]

Input 3 7 5 6 4 2 1
𝒙

[-2] [-1] [0] [1] [2]

Kernel 0.3 0.5 0.7 0.6 0.4


𝒘

Output 𝒔 11.9

34 InfoLab
Convolution in the context of discrete
Discrete 1D convolution in computer

[0] [1] [2] [3] [4] [5] [6]

Input 3 7 5 6 4 2 1
𝒙

[-2] [-1] [0] [1] [2]

Kernel 0.3 0.5 0.7 0.6 0.4


𝒘

Output 𝒔 13.1

35 InfoLab
Cross-correlation
Discrete 1D cross-correlation

Input 3 7 5 6 4 2 1

Kernel 0.3 0.5 0.7 0.6 0.4

Output 16.1

36 InfoLab
Convolution vs. Cross-correlation
Convolution

Cross-correlation

37 InfoLab
Convolution
In terms of deep learning, an (image) convolution is an element-
wise multiplication of two matrices followed by a sum.
1.Take two matrices (which both have the same dimensions).
2.Multiply them, element-by-element (i.e., not the dot product,
just a simple multiplication).
3.Sum the elements together.

Nearly all machine learning and deep learning libraries use the
simplified cross-correlation function.
2D Convolution
2D convolution
Convolutional Neural Network (CNN) learns appropriate
value of kernel
Many neural network libraries implement a related
function to convolution called the cross-correlation ,
which is the same as convolution but without flipping
the kernel:

source : https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.html
40 InfoLab
Convolution
Computer Vision Problem

vertical edges

horizontal edges Andrew Ng


Vertical edge detection examples
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
∗ 1 0 -1 =
0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0

0 0 0 10 10 10
0 0 0 10 10 10 0 -30 -30 0
0 0 0 10 10 10 1 0 -1 0 -30 -30 0
0 0 0 10 10 10
∗ 1 0 -1 =
0 -30 -30 0
0 0 0 10 10 10 1 0 -1 0 -30 -30 0
0 0 0 10 10 10

Andr
ew Ng
Vertical and Horizontal Edge Detection
1 0 -1 1 1 1
1 0 -1 0 0 0
1 0 -1 -1 -1 -1
Vertical Horizontal
10 10 10 0 0 0
10 10 10 0 0 0 0 0 0 0
1 1 1
10 10 10 0 0 0 30 10 -10 -30
∗ 0 0 0 =
0 0 0 10 10 10 30 10 -10 -30
-1 -1 -1
0 0 0 10 10 10 0 0 0 0
0 0 0 10 10 10
Andr
ew Ng
Learning to detect edges
1 0 -1
1 0 -1
1 0 -1

3 0 1 2 7 4
1 5 8 9 3 1
𝑤1 𝑤2 𝑤3
2 7 2 5 1 3
𝑤4 𝑤5 𝑤6
0 1 3 1 7 8
𝑤7 𝑤8 𝑤𝖾
4 2 1 6 2 8
2 4 5 2 3 9
Andr
ew Ng
InfoSeminar

Motivation

InfoLab
Benefits when includes convolution layer in deep learning

Convolution leverages three important ideas :


- Sparse interaction
- Parameter sharing
- Equivariant representation

48 InfoLab
Sparse interaction
• Traditional neural network layers use matrix
multiplication by a matrix of parameters with a separate
parameter describing the interaction between each input
unit and each output unit.
• This means every output unit interacts with every input
unit.
• Convolutional networks, however, typically have referred
to as sparse connectivity or sparse interactions sparse
weights. This is accomplished by making the kernel
smaller than the input.
• For example, when processing an image, the input
image might have thousands or millions of pixels, but we
can detect small, meaningful features such as edges
with kernels that occupy only tens or hundreds of pixels.
Sparse interaction (traditional neural net)
Neurons of input layer and hidden layer are “fully
connected”
Complexity is 𝑶 𝒏 ∗ 𝒎

Output 𝒏

Weight 𝒏 ∗
𝒎

Input
𝒎

50 InfoLab
Sparse interaction (conv)
To make the kernel smaller than the input causes sparsity
Complexity is 𝑶 𝒏 ∗ 𝒌
Fewer parameters are required to be stored and computed

Output 𝒏

Weight 𝒌

Input
𝒎

51 InfoLab
Sparse interaction (comparison)
Output neurons influenced by single input neuron
Output

Weight

Input

Output

Weight

Input

52 InfoLab
Sparse interaction (comparison)
Input neurons influence single output neuron
Output

Weight

Input

Output

Weight

Input

53 InfoLab
Sparse interaction (indirectly fully influenced)

Even though direct connection is sparse, indirect


connection is influenced almost all of the input
neurons if the neural net is deep enough

Conv2 output

Conv1 output

Input

54 InfoLab
Why convolutions
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
∗ 1 0 -1 =
0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0

Parameter sharing: A feature detector (such as a vertical edge


detector) that’s useful in one part of the image is probably
useful in another part of the image.

Sparsity of connections: In each layer, each output value


depends only on a small number of inputs.
Andr
ew Ng
Parameter sharing
Kernel is shared to input located every position
Parameter sharing has some effects :
- Extracts same features for every location
- Reduce memory space

56 InfoLab
Parameter Sharing
Parameter sharing refers to using the same parameter for
more than one function in a model.
In a traditional neural net, each element of the weight
matrix is used exactly once when computing the output of a
layer.
It is multiplied by one element of the input and then never
revisited.
A network has tied weights , because the value of the
weight applied to one input is tied to the value of a weight
applied elsewhere.
The parameter sharing used by the convolution operation
means that rather than learning a separate set of
parameters for every location, we learn only one set.
Equivariance Translation
Equivariance Translation
• To say a function is equivariant means that if the input
changes, the output changes in the same way.

• Specifically, a function f ( x ) is equivariant to a function.


g if f ( g ( x )) = g ( f ( x )).
Convolutional
Neural
Networks
Padding
deeplearning.ai
Padding

∗ =

𝑛 + 2𝑝 − 𝑓 + 1 ∗ 𝑛 + 2𝑝 − 𝑓 + 1

Andr
ew Ng
Valid and Same convolutions

“Valid”: no padding

“Same”: Pad so that output size is the same


as the input size.

𝑓−1
𝑝=
2 Andr
ew Ng
Convolutional
Neural
Networks
Strided
deeplearning.ai
convolutions
Strided convolution
2 3 3 4 7 43 4 4 6 34 2 4 9 4
6 1 6 0 9 21 8 0 7 12 4 0 3 2
3 -13 4 04 8 -143 3 4
0 8 -143 9 04 7 43 3 4 4
7 1 8 0 3 21 6 0 6 12 3 0 4 2 ∗ 1 0 2 =
4 -13 2 04 1 -134 8 04 3 -143 4 04 6 34 -1 0 3
3 1 2 0 4 21 1 0 9 12 8 0 3 2
0 -1 1 0 3 -13 9 0 2 -13 1 0 4 3

Andr
ew Ng
Summary of convolutions

𝑛 × 𝑛 image 𝑓 × 𝑓 filter

padding p stride s

𝑛+2𝑝 – 𝑓 𝑛+2𝑝 – 𝑓
+1 × +1
𝑠 𝑠

Andr
ew Ng
Convolutional
Neural
Networks
Convolutions over
deeplearning.ai
volumes
Convolutions on RGB images

Andr
ew Ng
Convolutions on RGB image

∗ =

4x4

Andr
ew Ng
Multiple filters

∗ =

3x3x3 4x4

6x6x3
∗ =

3x3x3
4x4

Andr
ew Ng
Convolutional
Neural
Networks
One layer of a
deeplearning.ai
convolutional
network
Example of a layer


3x3x3

6x6x3

3x3x3

Andr
ew Ng
Number of parameters in one
layer
If you have 10 filters that are 3 x 3 x 3
in one layer of a neural network, how
many parameters does that layer have?
Solution
3*3*3+1=28
28*10 = 210 parameters
Find number of parameters
tf.keras.layers.Conv2D(16, (3,3), activation='relu',
input_shape=(150, 150, 3))
Solution
3*3*3+1=28
28*16=448
Tensorflow code with stride and padding
tf.keras.layers.Conv2D(16, (3,3), strides=(2, 2), padding='same',
activation='relu')
Convolutional
Neural
Networks
A simple convolution
deeplearning.ai
network example
Example ConvNet

Andr
ew Ng
Types of layer in a
convolutional network:

- Convolution
- Pooling
- Fully connected

Andr
ew Ng
Translation Invariance
Convolutional Neural Networks

Pooling layers
deeplearning.ai
Pooling
• Pooling helps to make the representation become
approximately invariant to small translations of the input.

• Invariance to translation means that if we translate the input


by a small amount, the values of most of the pooled outputs
do not change.

• Invariance to local translation can be a very useful


property if we care more about whether some feature is
present than exactly where it is.

• The use of pooling can be viewed as adding an infinitely


strong prior that the function the layer learns must be invariant
to small translations.
Pooling layer: Max pooling

1 3 2 1
2 9 1 1
1 3 2 3
5 6 1 2

Andr
Deep Learning | Pooling and Fully Connected layers (2020 ) (youtube.com)
ew Ng
Pooling layer: Max pooling

1 3 2 1 3
2 9 1 1 5
1 3 2 3 2
8 3 5 1 0
5 6 1 2 9

Andr
ew Ng
Pooling layer: Average pooling

1 3 2 1
2 9 1 1
1 4 2 3
5 6 1 2

Andr
ew Ng
Regularization

L1 Regularization
L2 Regularization
Dropout
L1 Regularization
tf.keras.layers.Conv2D(128, (3,3), activation = 'relu', use_bias=True,
kernel_regularizer =tf.keras.regularizers.l1( l=0.01) )
L2 Regularization
tf.keras.layers.Conv2D(128, (3,3), activation = 'relu', use_bias=True
, kernel_regularizer =tf.keras.regularizers.l2( l=0.01))
Dropout
tf.keras.layers.Conv2D(128, (3,3), activation = 'relu', use_bias=True ),
tf.keras.layers.Dropout(0.2)

The Dropout layer is a mask that nullifies the contribution of some neurons
towards the next layer and leaves unmodified all others. We can apply a
Dropout layer to the input vector, in which case it nullifies some of its features;
but we can also apply it to a hidden layer, in which case it nullifies some hidden
neurons.
CNN Architectures

ResNet
AlexNet
AlexNet
AlexNet famously won the 2012 ImageNet LSVRC-2012
competition by a large margin (15.3% vs 26.2%(second place)
error rates).

Major highlights
1.Used ReLU instead of tanh to add non-linearity.
2.Used dropout instead of regularization to deal with overfitting.
3.Overlap pooling was used to reduce the size of the network.

AlexNet solves the problem of image classification with subset of ImageNet


dataset with roughly 1.2 million training images, 50,000 validation images,
and 150,000 testing images. The input is an image of one of 1000 different
classes and output is a vector of 1000 numbers.
Saturating and non saturating Activation
• A saturating activation function squeezes the input. Saturating
nonlinearities are more traditional functions used in CNN. Eg:
sigmoid, tangent.
• Non saturating function like Relu was faster to train.
Why Relu is faster?
1.Sparse Activation: ReLU and its variants introduce sparsity in the
network activations. During the forward pass, ReLU sets all negative
values to zero. This sparsity encourages more efficient
computations during the forward and backward passes, as only a
subset of neurons is activated.
2.Avoiding Vanishing Gradient: Saturating activation functions like
sigmoid and tanh suffer from the vanishing gradient problem, where
gradients become very small in deep networks, hindering effective
learning, especially in deep networks. ReLU mitigates this issue by
having a constant gradient for positive inputs, ensuring that
gradients do not vanish for positive inputs.
3.Computational Efficiency:
4. Effective Representation Learning: ReLU encourages the network
to learn more effective representations of the data.
Overlap Pooling
• Overlap pooling is a technique used in convolutional neural
networks (CNNs) for down-sampling feature maps.
• Traditional pooling operations, such as max pooling or
average pooling, divide the input feature map into non-
overlapping regions and apply the pooling operation
independently to each region.
• However, in overlap pooling, instead of using non-overlapping
regions, the pooling operation is applied with overlapping
regions.
AlexNet
• The input to AlexNet is an RGB image of size 256*256.
• The image is trained with raw RGB values of pixels. So, if input
image is grayscale, it is converted into RGB image .
AlexNet Architecture
AlexNet
•The first two convolutional layers use a kernel of size
11×11 and apply 96 filters to the input image.
•The third and fourth convolutional layers use a kernel of
size 5×5 and apply 256 filters.
•The fifth convolutional layer uses a kernel of size 3×3 and
applies 384 filters.
•The output of these convolutional layers is then passed
through max-pooling layers that reduce the spatial
dimensions of the feature maps.
AlexNet
•The output of the pooling layers is then passed through
three fully connected layers, with 4096, 4096, and 1000
neurons respectively. The last fully connected layer is used
for classification, and produces a probability distribution
over the 1000 ImageNet classes.
•AlexNet was trained on the ImageNet dataset, which
consists of 1.2 million images with 1000 classes, and was
able to achieve high recognition accuracy.
•The AlexNet architecture was the first to show that CNNs
could significantly outperform traditional machine learning
methods in image recognition tasks, and was an important
step in the development of deeper architectures like
VGGNet, GoogleNet, and ResNet.
ResNet
Motivation: Vanishing/Exploding gradient. This causes the gradient to become 0
or too large. Thus when we increases number of layers, the training and test error
rate also increases.
ResNet, which was proposed in 2015 by researchers introduced a new
architecture called Residual Network. at Microsoft Research
Residual Network

• Residual Network: In order to solve the


problem of the vanishing/exploding
gradient, this architecture introduced the
concept called Residual Blocks. In this
network, we use a technique called skip
connections.
• The skip connection connects
activations of a layer to further layers
by skipping some layers in between.
This forms a residual block.
• Resnets are made by stacking these
residual blocks together.
Residual
𝑎 [𝑙+1]
block 𝑎[𝑙] 𝑎 [𝑙+2]

𝑧 [𝑙+1] = 𝘸 [ 𝑙 + 1 ] 𝑎 [𝑙] + 𝑏 [𝑙+1] 𝑎 [𝑙+1] = 𝑔(𝑧 [𝑙+1] ) 𝑧 [𝑙+2] = 𝘸 [𝑙+2] 𝑎 [𝑙+1] + 𝑏 [𝑙+2] 𝑎 [𝑙+2] = 𝑔(𝑧 [𝑙+2] )

[He et al., 2015. Deep residual networks for image recognition] Andrew Ng
Skip Connection
• The advantage of adding this type of skip connection is
that if any layer hurt the performance of architecture
then it will be skipped by regularization.
• So, this results in training a very deep neural network
without the problems caused by vanishing/exploding
gradient.
• The authors of the paper experimented on 100-1000
layers of the CIFAR-10 dataset.
ResNet-34 Architecture
Inspired from VGG-19

https://youtu.be/o_3mboe1jYI?si=ceyId1ui7Lq0w1eF
ResNet Hand Calculation
Let's consider a simple feedforward neural network with
three layers: an input layer, a hidden layer, and an output
layer. For simplicity, we'll assume each layer has only one
neuron.

1. Input Layer: x1 = 3
2. Hidden Layer: h1 = f(w1 . x1 + b1)
3. Output Layer: y1 = g(w2 . h1 + b2)

Let's define f(x) = x and g(x) = x as identity activation


functions for simplicity.

Now, let's add a skip connection from the input layer to the
output layer. The output of the network becomes:
ResNet Hand Calculation

y1 = g(w2 . h1 + b2 + x1)

For numerical values:

w1 = 0.5
w2 = 2
b1 = 1
b2 = 0
ResNet Hand Calculation
With these values, let's calculate the output y1 :

1. Calculate the hidden layer output:


h1 = f(w1 . x1 + b1) = f(0.5 * 3 + 1) = f(2.5) = 2.5

2. Calculate the output with skip connection:


y1 = g(w2 . h1 + b2 + x1) = g(2 * 2.5 + 0 + 3) = g(8) = 8

So, with the skip connection, the output y1 is 8. Without the


skip connection, it would have been just g(5) = 5. This
demonstrates how skip connections can help in
propagating the input information directly to the deeper
layers, aiding in better gradient flow and potentially
preventing the vanishing gradient problem.

You might also like