TS - Chapters

CONVOLUTION NEURAL NETWORKS TECHNICAL SEMINAR
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Computing techniques called Artificial Neural Networks (ANNs) are greatly
affected by the operations of organic nervous systems, such as the human brain. Numerous
linked computational nodes (neurons), which collaborate to learn from inputs and improve
the final output, are a crucial component of an ANN.
Inputs are loaded into the input layer and then distributed to the hidden layers. The
input is usually in the form of multidimensional vectors. The learning process is when the
hidden layer considers decisions from previous layers and decides whether stochastic
changes themselves make the output worse or better. Deep learning is a term used to
describe the stacking of many hidden layers.
An ANN is made up of many layers of neurons. Data is fed into the first layer, and
output is produced by the final layer. One or more hidden layers sit in between, processing
the input data using weights and biases that are changed during training to increase the
network's accuracy.
Fig 1.1 : ANN LAYERS
DEPT OF ISE, VIII SEM, HKBKCE 1 2022 - 2023

Input layer, hidden layer and output layer of forward linear neural network (FNN).
This framework (RNN) provides the basis for many popular ANN designs, including feed-
forward neural networks (FNN), restricted Boltzmann machines (RBM), and recurrent
neural networks.
Two important learning paradigms for tasks requiring image processing are supervised
learning and unsupervised learning. We refer to learning that focuses on pre-classified
inputs as supervised learning. Each training example has a predefined output value and a
set (vector) of input values.
By accurately calculating the training output values of the training samples, this
training method tries to reduce the overall classification error of the model. Unlike
supervised learning, unsupervised learning has no labels in the training set. Network
success is often determined by the ability to measure relative cost performance.
Convolutional neural networks (CNNs) consist of neurons that learn to improve

themselves, just like traditional artificial neural networks. Each neuron continues to accept
inputs and perform actions (such as inner products followed by nonlinear functions) that
serve as the building blocks of many ANNs. From the input raw image vectors to the
output class scores (weights), the entire network reflects the same perceptual scoring
function. Class-related loss functions are included in the final layer, which continues to use
all the common methods developed for traditional ANNs.
The only significant difference between traditional CNNs and ANNs is the widespread
use of CNNs in the field of image pattern recognition. This makes it possible to add image-
specific design components, increases network compatibility for image-related tasks, and
reduces the number of setup parameters.
A major problem with traditional ANN models is that they often struggle with the
computational complexity required to compute images. The MNIST database of
handwritten digits is one of the most widely used standard machine learning datasets and is
suitable for most ANN 2828 variants due to its relatively small image size. A neuron in the
first hidden layer of the dataset has 784 weights (28,281; note that MNIST is set to black
and white values only). It is controlled by many types of artificial neural networks.

Considering a larger color image input of 64 64, the number of weights for one layer 1
neuron increases significantly to 12,288. The drawback of employing such a model is that
it requires a much larger network than that used to identify color-normalized MNIST digits
to handle this input scale.
1.2 WHAT IS CNN ?

Convnets, commonly called Convnets or CNNs, are used in machine learning. It is a
subset of many ANN models used in different datasets and applications. CNNs, a type of
deep learning algorithm network design, are particularly useful for applications requiring
pixel data processing and image recognition.
Although many other types of neural networks are used in deep learning, CNN is an
effective network design in object recognition and identification. This makes them ideal for
computer vision tasks and applications that require accurate object recognition for systems
such as self-driving cars and facial recognition.
Both time series and image data provide important information. For tasks involving
images, such as pattern recognition, object classification, and image identification, it is
particularly useful. CNNs analyze images for patterns using linear algebra techniques such
as matrix multiplication. CNNs can rate audio and highlight information.
CNNs have been developed for a variety of tasks, including image recognition and
analysis, but they also have many other applications, such as image classification, natural
language processing, drug discovery, and risk assessment. CNNs are useful for depth
estimation in autonomous vehicles. Applications include speech processing for virtual
assistants, facial recognition for social media, retail, healthcare, automotive and law
enforcement.
Application of CNN include image classification, object detection, facial recognition,

medical imaging and natural language processing. It includes image classification, word
recognition, action recognition, object recognition, human localization, image annotation
and more. Convolutional neural networks (CNNs) consist of neurons that learn to improve
themselves, just like traditional artificial neural networks. This makes it possible to add
image-specific design components

Fig 1.2 : EXAMPLE OF CLASSIFICATION OF CNN

1.3 OBJECTIVES
 To assist in image/object recognition and classification.
 CNNs are outlined to naturally and adaptively learn spatial pecking orders of
highlights through backpropagation utilizing numerous building pieces such as
convolutional layers, pooling layers, and completely associated layers.
 The CNN's role is to make the image into a manageable form without losing essential
features for good prediction.
 Understand model architecture.
 Used for document analysis, facial recognition, object recognition and automatic
translation.
 CNNs are often used for feature extraction tasks that aim to find and isolate key
features from an image that can be used for further processing and analysis.

1.4 PRINCIPLES OF CNN

Convolution computes the dot product pixel values of the input and the filter after the
window has been moved over the image. Convolution can highlight important features as a
result.
Fig 1.3 : SLIDE WINDOW BY ONE FOR EACH ELEMENT

Puncture the window elements with filter elements, surround them with a small
window, and save the result. Repeat each operation to produce five output elements as
[0,0,0,1,0] . From this output we can see that the function changes in sequence 4 (1
becomes 0). The filter successfully identified the input value. Similar events occurred with
2D convolutions.
Fig 1.4 : 2D CONVOLUTION PRINCIPLE

CHAPTER – 2
TECHNICAL BACKGROUND
2.1 EVOLUTION OF CNN
CNN was first created and deployed around the 1980s. At the time, CNN could only
recognize handwritten digits to a limited extent. It was mainly used in the postal industry to
read postal codes and passwords. The most important thing to keep in mind with deep
learning models is that training requires a lot of processing power and data. This significant
shortcoming limited CNN to the postal industry and prevented it from entering the realm of
machine learning.
Fig 2.1 : EVOLUTION OF CNN

In 2012, Alex Krizhevsky concluded that we should revive a subset of deep learning,
multilayer neural networks. The availability of large datasets, specifically the Image Net
dataset, which contains millions of tagged images and abundant computing resources,
allows the researcher to revitalize his CNN. I made it. Designed to read handwritten
numbers. His 5 layers of 5x5 convolutional layers and 2x2 max pooling made the NN
design simple. This finds relevant features without the need for human intervention. It is
used in various applications such as facial recognition, document analysis, climate
understanding, image recognition and item identification. A convolutional neural network
is a strong and effective neural network, each component of which has its own meaning.

2.2 WHY CNN FOR IMAGE CLASSIFICATION
“Convolutional neural systems are exceptionally great at picture classification.”
Usually one of the broadly known and well-publicized actualities. We must extract
characteristics from the photos in order to categorise the photographs and identify patterns
in datasets. It takes a lot of compute to use his ANN for picture classification because of the
comparatively high number of trainable parameters.
Fig 2.2 : CNN IMAGE CLASSIFICATION

Network
A CNN is a fully connected feedforward neural network. CNNs are very good at
reducing the amount of parameters without sacrificing model quality. The image is highly
dimensional (since each pixel is considered a feature) and corresponds to the CNN
functionality above. Additionally, CNN was created with images in mind while setting the
standard for word processing. A CNN is trained to recognize object edges in each image.
How is the dimensionality reduced ?

Reduce dimensionality using a sliding window of size smaller than the input matrix.
When we think intuitively, we only think part of the whole vision at a time. This square
blob is the window that moves continuously from top to bottom and left to right to cover
the entire image. A small regression model is trained to recognize certain elements in an
image (say, one model recognizes dogs and another recognizes grass). Note that the
window shows the following perspective of the cropped image and the image is shown in
black and white for clarity.

2.3 FEATURE EXTRACTION
A CNN is a kind of neural network that takes information from an input picture and
classifies them using another neural network. The input picture serves as the first starting
point for a feature extraction network. The retrieved feature signal is used by a neural
network for classification. Neural network categorization using visual characteristics
produces the results. A neural network for feature extraction includes a stack of
convolutional layers and a number of pooling layers. Convolutional layers use a
convolutional technique to modify pictures, as the name indicates. It is comparable to a
group of digital filters. A pooling layer turns adjacent pixels into a single pixel. The size of
the picture is then decreased using a pooling layer. Convolutional and pooling layers
naturally occur on a 2D plane since CNNs' primary concern is the picture. CNNs vary from
other neural networks in this way.
Fig 2.3 : EXAMPLE OF FEATURE EXTRACTION

2.4 TYPES OF CNN LAYERS

CONVOLUTIONAL LAYER : A basic component of a CNN is a convolutional layer. It
carries most of the computing load on the network. This layer creates an inner product
between two matrices. One is the core (collection of learnable parameters) and the other is
the restricted region of the receptive field. Compared to the image, the nucleus is spatially
smaller but deeper. This appears that in case the picture comprises of his three (RGB)
channels, the tallness and width of the bit are spatially little, but the profundity is he
expanded to all three channels increase.
Fig 2.4 : CONVOLUTIONAL LAYER

POOLING LAYER: The number of future learnable parameters is decreased by pooling
layers, a common downsampling procedure that lowers the dimensionality of in-plane
feature maps to add translational invariance for tiny displacements and stresses. 11 .
Similar to convolution operations, filter size, stride, and padding are hyperparameters for
pooling operations; however, none of the pooling layers contain learnable parameters.
MAX POOLING: Max pooling is the most used kind of pooling procedure. Patches are
removed from the input highlight outline, and the highest value is given for each repair
while eliminating all other values. Max pooling with a 2 2 size filter of stride 2 is often
employed in real-world settings. This results in a two-fold downsampling of the feature
map's dimension at the level. The depth dimension of the feature map does not vary, in
contrast to height and breadth.

Fig 2.5 : MAX POOLING LAYER

FULLY CONNECTED LAYER: Neurons are connected between two layers using a fully
connected layer (FC), which also includes weights and biases. These layers typically
comprise the last few levels of the CNN architecture and have recently been added to the
production layer. The input image of the previous layer is smoothed and passed to the FC
layer. The flat vector then travels through several additional FC levels where simple
arithmetic operations are performed. This is where the recording begins.

Fig 2.6 : FULLY CONNECTED LAYER

DROPOUT : All features bound to the FC layer often lead to overfitting of the training
dataset. Overshoot occurs when a particular model performs well on training data, but its
performance suffers when applied to new data. Dropout layers containing a small number
of neurons removed from the neural network during training to reduce the size of the
sample are used to solve this problem. When the dropout is greater than 0.3, 30% of the
neural network nodes are randomly dropped.
ACTIVATION FUNCTIONS : Finally, the activation function is one of the most

important variables in CNN models. They are used to remember and infer all kinds of
complex relationships between network elements. In other words, it chooses which sample
information to forward and which not to forward to the network end. Increase the
nonlinearity of the network. Some of the frequently used functions are activation functions
such as ReLU, Softmax, tanH and sigmoid functions. All these functions have specific
applications. Sigmoid and soft functions are recommended for binary classification CNN
models, while softmax is often used for multiclass classification.

2.5 HOW CONVOLUTIONAL NEURAL NETWORKS WORK ?

A CNN consists of several layers, each of which trains the network to recognize
different aspects of the input image. A filter or kernel is applied to each image to provide a
better and more detailed output at each level. At lower levels, filters start as basic
processes.
The complexity of the filter increases with each subsequent layer in searching and
finding features that uniquely represent the input element. As a result, the output of each
layer serves as input to the following layers, which use a partially detected image, also
known as a convoluted image. In the FC layer, the last layer, the CNN detects the image or
object it represents.
Convolution is the application of various filters to the input images. Each filter does
its job by triggering a specific part of the image, after which it sends its output to filters in
other layers. As each layer is capable of distinguishing between different features, this
process is repeated for dozens, hundreds, or thousands of layers. Eventually, after several
layers of processing all the image inputs, the CNN can detect all the objects.
CNN VS NEURAL NETWORKS

Regular neural networks' (RNNs') inability to scale is one of its main disadvantages.
A classic NN may provide acceptable results for tiny pictures with few colour channels.
But as pictures become bigger and more complex, more computing power and resources
are needed, necessitating bigger and more costly NNs.
Overfitting problems can develop over time, when the NN picks up too much
information from the training data. Additionally, it may cause learning data noise and have
an impact on how well you do on the test data set. In the end, NNs are unable to distinguish
between the objects themselves and any characteristics or patterns present in the data
collection. CNNs, on the other hand, utilise parameter sharing.
Traditional neural networks can be used for image and video processing tasks, but
they are not as effective as CNNs because they cannot take advantage of the spatial
structure of the image. CNNs have revolutionized the field of computer vision, delivering
state-of-the-art results on a variety of tasks such as image classification, object detection.

2.6 CHARACTERISTICS OF CONVOLUTION NEURAL NETWORK

 It could be a impartially executed scientific show.
 Input signals reach the processing elements via connections and connection weights.
 The information stored in a neuron is essentially the weighted connections of the neuron.
 A learning process for acquiring knowledge.
 Modeling systems with unknown input/output relationships.
 Contains a large number of interconnected processing elements called neurons to
perform all operations.
 A neural arrange comprises of a huge number of neuron-like preparing components.

Additionally, connections between elements provide a distributed representation of the
data.
 Ability to learn, remember, and generalize from data provided by appropriate mappings
and weight adjustments.
 The collective behavior of neurons is their computational capacity, there are no

signaling neurons that transmit specific information.
 Changes in adaptive responses in the environment.
 Neural networks stimulate biological systems to learn, including adapting to synthetic
connections between neurons.
 Neural networks help with pattern recognition and data classification through a learning
process.
 Even if neurons do not respond or information is lost, the network can detect errors and
produce output.
 Input data is stored in a separate network rather than a database. Therefore, loss of data
does not affect functionality.
 Learn from instances and adjust them when similar events occur, allowing them to
function in real-time during events.
 Convolutional layers are used by CNN to extract features from input photos. The input
image undergoes a series of filters at these layers and is convolved with the image to
create a collection of feature maps that capture different aspects of the image.

2.7 TYPES OF CONVOLUTION NEURAL NETWORKS

Convolutional Neural Networks (CNNs) come in many different forms and were
created for different purposes. Here are some of the main CNN types:
Convolutional Neural Network (CNN) : This type of CNN is the most basic and consists
of convolutional, pooling and fully connected layers. It is widely used for image
classification and recognition applications.
Recurrent Convolutional Neural Networks (RCNNs) : A subtype of CNN that combines

a recurrent neural network with a convolutional layer (RNN). Using sequential data such as
movies or time series data, networks can now model temporal dependencies. A recurrent
convolutional neural network (RCNN) is a type of deep learning algorithm that combines
the functionality of convolutional neural networks (CNN) and recurrent neural networks
(RNN). RCNNs are used in applications that require both spatial and temporal processing,
such as: Action recognition with video or voice recognition.
Deep Residual Network (ResNet) : ResNet is a subset of CNN and uses remnants to
transmit data between some network layers. This allows you to build deeper networks with
better performance and avoids the vanishing gradient problem.
Inception Network : An example of a CNN that uses many parallel convolutional layers
with different filter sizes and pooling techniques is the Inception network. This allows the
network to capture features at different scales, improving network performance.
Fully convolutional network (FCN) : No fully connected layers. It uses only

convolutional layers instead to produce spatial output that can be used for segmentation and
other tasks. A fully connected (FC) layer, which also contains weights and biases.
Siamese Network : A type of CNN that uses two identical sub-networks to process two
separate inputs and generate a similarity score is called a Siamese network. They are
commonly used for tasks such as face recognition and image matching.
These are just a few of the many different kinds of CNNs created for different
purposes. The exact task and type of data you enter will determine which network
architecture to use.

CHAPTER 3
CNN ARCHITECTURE
3.1 INTRODUCTION TO CNN ARCHITECTURE
CNN focuses on the fact that the input is mostly composed of visual elements. For
this reason, the architectural design is focused on best meeting the needs of managing each
type of data.
A major difference is that the CNN neuron layer consists of neurons arranged in
three spatial dimensions (height, width, and depth) of the input. Instead of the total number
of layers in an ANN, depth represents the third dimension of activation dimension.
Neurons in each layer are connected to only a small percentage of the previous layer,
unlike standard ANNS.
In practice, this means that the dimensions of the final output layer are 11. n (where n
is the number of possible classes), and the dimensions of the input "size" are 64643
(height, width, depth) for the previous example. This results in the size of all input class
evaluations being compressed and reduced on the depth dimension.
A CNN consists of three types of layers. These are convolutional layers, pooling
layers and fully connected layers. When these layers were stacked, a CNN architecture
was formed.
Fig 3.1 : CNN ARCHITECTURE FORMED

There are four main areas in which the basic functionality of an exemplary CNN can be
decomposed.
1. Similar to other kinds of ANNs, the input layer records the image's pixel values.
2. The convolutional layer computes the scalar and recognises the output of the neuron
whose connection to the local input region is created as the sum of the weight of the
particular item and the area around the input volume. The rectifier linear unit, also known
as ReLu, is used to apply a "element-by-element" activation function, such as Sigmoid, to
the output of the preceding layer's activation.
3. The pooling layer only uses downsampling along the spatial dimension of the input, thus
reducing the number of parameters in this activation.
4. Next, a fully connected layer attempts to extract class scores from the activations that
may be utilised for classification, performing the same goal as the conventional Her ANN.
In order to boost performance, it is also suggested to apply ReLu between these layers.
Utilising convolution and downsampling methods, this straightforward transformation
methodology enables CNNs to alter the initial input layer by layer in order to obtain class
values for classification and regression applications.
Fig 3.2 : DOWN SAMPLING TECHNIQUE

It is important to remember that understanding the general design of CNN

architecture is not enough. These models can be difficult to create and fine-tune and the
process can be time consuming. Here we dive deep into the hyperparameters and
connectivity of each layer.
CONVOLUTION LAYER : Convolutional classes as their name suggests are essential

for CNN to work. The main goal of layer parameters is to implement a learnable kernel.
These cores often cover only a small fraction of the input depth in terms of area. They are
the basis for CNN networks and the implementation of convolutional operations. The
kernel/filter (array) of this layer is the part that performs the convolution operation. Until
the entire image is scanned, the kernel makes horizontal and vertical adjustments based on
the step rate. Along with convolutions, the nonlinear activation function is a key
component found in convolutional layers. The output of a linear process such as
convolution is subject to a nonlinear activation function. In the past, smooth nonlinear
functions such as tangent sigmoid or hyperbolic (tan) functions have been used, even
though they are mathematical representations of real neural activity. The corrected linear
unit (ReLU) is now the most widely used nonlinear activation function.
Fig 3.3 : CONVOLUTION LAYER

POOLING LAYER : Dimensionality reduction is done by this layer, which is sometimes

referred to as the pooling layer (POOL). It aids in lowering the amount of computation
needed to process the data. Pooling may be divided into two categories: maximum pooling
and average pooling. The maximum value from the portion of the image that the kernel
covered is returned by Max Pooling. The average of all values in the area of the picture
covered by the kernel is known as average pooling..
Fig 3.4 : POOLING LAYER

FULLY CONNECTED LAYER : Flat inputs are used in fully connected layer (FC)
operations. In other words, each neuron is paired with each input. The flat vector is passed
through several other FC layers, which are often used to perform mathematical function
operations. The grading process begins here. If an FC layer is present, it is often oriented in
a direction that predicts the structure of the CNN..
Fig 3.5 : FULLY CONNECTED LAYER

The Convolution Neural Network architecture includes some terms in addition to the layers
above :-
ACTIVATION FUNCTION : Often the last fully committed layer has a different
activation function than the previous levels. For each action, you must select the
appropriate activation function. The activation function used in multiclass classification
problems is the soft max function. Adjust the final actual output values of the fully
connected layer with the probability of the intended class. Each number ranges from 0 to 1
and they all add up to 1.
DROPOUT LAYERS : A dropdown layer is a mask that cancels the contribution of some
neurons in the next layer while leaving everything else intact. Some properties of the input
vector are canceled when a dropdown layer is applied. Additionally, it removes some
hidden neurons when applied to the hidden layer. Dropout layers are essential for training a
CNN because they prevent overtraining the training data. If there is no original set of
training data, learning is disproportionately affected. By doing this, learning characteristics
that are only present in later samples or batches are avoided..
3.2 TYPES OF CNN Architecture

LENET ARCHITECTURE : The LeNet architecture is simple and unobtrusive, making it
a great introduction to the basics of CNNs. This is a good "first CNN" and can even run on
the CPU (if your system doesn't have a suitable GPU). Used to read handwritten digits, he
is one of the earliest and most widely used CNN designs. The LeNet 5 CNN architecture
has seven layers. The layer stack consists of 3 convolution layers, 2 subsampling layers,
and 2 fully connected layers.
Fig 3.6 : LENET ARCHITECTURE

ALEXNET ARCHITECTURE : The architecture of LeNet and AlexNet was very
similar. the first convolutional network to use a graphics processing unit (GPU) for
performance enhancement. Convolutional filters and ReLU (Rectified Linear Unit)
nonlinear activation functions are used in each convolutional layer. A pooling layer is used
to perform max pooling. Since we have a fully connected layer, the input size is fixed.
When the AlexNet architecture was first introduced, it was designed for large image
datasets, which is why it gave the best results. It has a total of 60 million features.
Fig 3.7 : ALEXNET ARCHITECTURE
VGGNET ARCHITECTURE : Compared to earlier AlexNet derivatives, VGG takes a

different approach to CNNs. This focused on improvements in the first convolutional layer
and window width reduction. His 224224-pixel RGB picture is utilised as the input. To
maintain a consistent input picture size for the ImageNet competition, the authors cropped
each image to remove the centre 224224 patches. The convolutional layers of the VGG
have a relatively tiny receptive field. In order to retain the spatial resolution after
convolution, the convolution step is kept at 1 pixel. The third level of VGG contains 1000
channels (one for each class) and is completely merged with the previous two, whereas the
first two levels have 4096 channels apiece.
Fig 3.8 : VGGNET ARCHITECTURE

3.3 CHALLENGES IN CNN ARCHITECTURE

CNNs are ineffective at encoding object positions and orientations and require large
amounts of training data. Object position and orientation are not encoded. They sort photos
in different positions.
 Object position and orientation are not encoded. They are having a hard time sorting
photos in different positions.
 CNNs require large amounts of training data to function efficiently, as methods such as
Maxpool often slow down CNNs significantly.
 If the convolutional neural network has many layers, it can take a long time to train if
your computer's CPU isn't fast.
 Convolutional neural networks identify images as discrete patterns made up of clusters
of pixels. They don't understand them as the elements that make up the picture.
 Convolutional neural networks are often used for image classification, but the data
(images) are often high dimensional. ConvNet structures are intended to reduce
overfitting, but convolutional neural networks usually require a large number of inputs
to work well. Of course, the complexity of the task at hand determines the amount of
information required.
 Convolutional neural networks are probably overkill if the problem at hand is very
simple (for example, how can we distinguish between open and closed circles?). This is
related to the first argument. To do basic processing for a very simple task, we should
use OpenCV, OpenCV is faster and more efficient.
 Training convolutional neural networks can take a long time, especially for large data
sets. As a general rule, they need special hardware (such as a CPU) to speed up
training.
 CNNs are translation invariant, but without explicit data augmentation they usually
struggle to handle rotation and scale invariance.
 Convolution has limited ability to extract visual patterns from different spatial
locations. Since each pixel has a limited receptive field, it is very difficult to model
long-range multi-hop relationships between pixels.
 Object positions and orientations are not encoded by CNNs, are not spatially invariant
with respect to the input data, and require large amounts of training data.

3.4 BENEFITS OF CNN ARCHITECTURE

 Very high accuracy for image recognition problems.
 Automatically recognizes critical features without human supervision.
 Weight sharing.
 By inserting a linear layer after the pooling layer, there is likely room for improvement
in the text classification model CNN.
 CNN models perform feature extraction, they have become the industry standard for
computer vision jobs. In the early days of computer vision, image classification
problems involved researchers manually extracting features and applying traditional
ML techniques (simple neural networks, SVMs, etc.). However, CNN models usually
outperform manual extraction methods (end-to-end).
Fig 3.9 : END TO END LEARNING

 A major advantage of CNN over its predecessors is that it recognizes key elements
without human intervention, making it the most popular.
 Easy to understand and immediately put into practice. It has the highest accuracy
among all image prediction algorithms.
 It is used to identify items, classes and categories, and you should look for patterns in
the pictures. They are very useful for classifying signals, time series, and audio data.
 It aims to automatically and adaptively learn spatial feature hierarchies using
backpropagation with various constructs such as convolutional layers, pooling layers
and fully connected layers.

3.5 WHY IS CNN IMPORTANT

CNNs, often referred to as convolutional neural networks, are a kind of artificial
neural networks used in deep learning and are frequently employed for the identification
and categorization of objects and images. CNNs are therefore used in deep learning to
recognise visual components. Widespread applications of CNNs include obstacle detection
in self-driving vehicles, voice recognition in natural language processing, video analysis in
image processing, and computer vision..
 WEIGHT SHARING : It uses local spatial coherence, which gives equal weight to
some edges. This reduces computational costs. This is especially useful if having a
weak or no GPU.
Fig 3.10 : MEMORY SHARING

 MEMORY SHARING : Fewer arguments save memory. For example, detecting
numbers in the MNIST dataset using a single hidden layer CNN with 10 nodes requires
hundreds of nodes, whereas using a simple deep neural network requires 19000 More
than 1 parameter is required.
 INDEPENDENT OF LOCAL DIFFERENCES IN THE IMAGE : Fewer arguments

saves memory. For example, recognizing numbers in the MNIST dataset using a single
10-node hidden layer CNN requires hundreds of nodes, whereas using a simple deep
neural network requires 19000. Multiple parameters are required.
 EQUIVARIANCE : Equal variance is a feature of CNNs and can be thought of as a

special way of sharing parameters. A function is conceptually said to be equivariant if
its output reflects similar changes in response to changes in its inputs. Mathematically it
can be expressed as f(g(x)) = g(f(x)). Convolution has been found to be equivalent to
many data transformation techniques, making it easy to determine how a particular
change in input affects the output.

CHAPTER 4
TRAINING PROCESS
The training process mainly includes the following steps:
 Parameter Initialization
 Optimizer Selection
 Regularization of CNN
 Data Pre-Processing and Augmentation
4.1 Data Pre-Processing and Augmentation :

Data preprocessing applies artificial changes to raw data sets (including training,
validation, and test data sets) to improve their cleanliness, functional content, learnability,
and consistency process. Before the data is delivered to the CNN model, the data is
preprocessed. Indeed, the performance of convolutional neural networks (CNNs) is directly
proportional to the amount of data they train on. H. Effective preprocessing always
improves model accuracy. On the other hand, substandard preprocessing can also affect
model performance. The following subsections provide an overview of common
pretreatment methods.
4.1.1 Mean-subtraction (Zero centering) :
Here, the mean is subtracted from each individual data point (or feature) to be zero-
centered. This operation can be implemented mathematically as follows:
0
∗
X = X −x
Fig : 4.1 MEAN SUBTRACTION

4.1.2 NORMALIZATION :
The dimensions of the data samples taken from the training, validation, and test
datasets are normalized by dividing each dimension by its standard deviation. This
operation is mathematically implemented as follows:
where N , X 0 and X∗ are the same.
Fig 4.2 : NORMALIZATION OF DATA
DATA AUGMENTATION
The training data set is artificially augmented via the practise of "data
augmentation". Here, we purposefully replace the data samples with one or more new data
samples (new versions), which are subsequently employed in the training process (training
data set alone). In certain circumstances, data augmentation is crucial since the majority of
complex real-world scenarios (like medical records) only have access to tiny training
datasets. In actual use, increasing the number of training data samples may strengthen the
CNN model. There are several methods for enhancing data, such as scaling, converting,
adjusting contrast, rotating, mirroring, and cropping. These methods may be used alone or
together to generate several new versions from a single data sample. Data augmentation
may also drive regularisation of CNN models by avoiding overfitting issues, which is
another argument for their usage.

Fig 4.3 : EXAMPLE OF A RAW TRAINING DATA SAMPLE

4.2 PARAMETER INITIALIZATION

Millions or billions of parameters make up a deep CNN. Therefore, weight
initialization directly affects the speed and accuracy of convergence of the CNN model and
should be properly initialized at the beginning of the training phase. Below are some of the
most common parameter initialization methods used in CNNs and discussed in this section.
The easiest way to do this is to initialize all weights to zero. However, this turns out
to be a mistake. This is because setting the weight of each layer to zero causes all neurons
in the network to give the same output and gradient in backpropagation. As a result, all
weights receive the same updates. The network does not learn any favorable properties
from this and there are no discrepancies between neurons. To deal with this difference
between neurons, we don't initialize all the weights with the same value.
4.2.1 RANDOM INITIALIZATION

Use random arrays to randomly initialize weights (for both Convi and FC layers).
The components of these matrices are drawn from distributions with zero mean, modest
standard deviations (eg 0.1 and 0.01) and low variance. However, the biggest problem with
random initialization is that it can lead to color gamut or fade point problems. It includes
several simple random initialization methods:
 Gaussian Random Initialization : In this method, the weights are randomly initialized
using a random matrix and each element of the matrix is randomly chosen from a
Gaussian distribution.
 Uniform Random Initialization : In this method, the weights are randomly initialized
using a random uniform matrix whose members are randomly drawn from a uniform
distribution.
 Orthogonal Random Initialization : This method uses an orthogonal matrix to
randomly initialize the weights and draws the matrix elements from orthogonal
distributions.
4.2.2 XAVIER INITIALIZATION

This technique, put out by Xavier Glorot and Yoshua Bengio in 2010, aims to
balance the distribution of the output and input connections for each tier of the network.
The main goal of this method is to equalise the variance of the activation function.

However, while Xavier Glorot and Yoshua Bengio developed it for the logistic sigmoid
activation function, which is now primarily used in CNN architectures, the ReLU
activation function does not perform as well. The He initialization approach was then
suggested by Kaiming He et al. to address the same issue as ReLU activation.
4.2.3 UNSUPERVISED PRE-TRAINING INITIALIZATION

This approach uses an unsupervised technique to train a convolutional neural
network from scratch, such as: Use Deep Auto Encoder or Deep Belief Network to
initialise the network. This strategy has the potential to be highly successful in solving both
optimisation and overfitting issues.
4.3 REGULARIZATION TO CNN

How deep learning algorithms can successfully generalize to new or previously unseen
inputs originating from the same distribution as the training data is a major challenge. For
CNN models, overfitting is a major obstacle to achieving sufficient generalization. When a
model performs well on the training data but poorly on the (hidden) test data, it is said to be
overfit. On the other hand, an inappropriate model results from the model not learning
enough from the training data. Any model that performs well on both training and test data
is excellent.
Fig 4.4 : Under – fitting, Over – fitting, Just – fitting

4.4 OPTIMIZER SELECTION

How to train a CNN model It involves applying a systematic approach to the CNN
model as well as pre-processing processes using data samples. Choosing a (better) learning
algorithm and applying various optimizations (momentum, adagrad, adadelta, etc.) to the
learning algorithm to improve the results are two important parts of the learning process.
The primary objective of supervised learning algorithms is to minimize the error, also
known as the loss function, based on a variety of variables such as weights, biases, and
other variables. When training CNN models, gradient-based learning strategies make
sense. During each training session, the model continuously updates the model's
parameters in an attempt to minimize error while iteratively searching for a locally optimal
solution. "Learning rate" refers to the duration of the parameter update step, and "training
period" refers to the full parameter update frequency using the entire training dataset.
Learning rate is a high parameter, but it must be adjusted carefully to minimize the
detrimental effects it can have on learning.
Fig : 4.5 The effect of different learning rate (LR) value on the training process.

4.4.1 GRADIENT DESCENT ALGORITHM

In order to lessen training error, the gradient descent technique continuously
modifies the model parameters during each training period. This method is also referred to
as a gradient-based learning algorithm. First, using the first derivative with respect to the
model's parameters, find the gradient of the objective function (gradient), and then update
the model's parameters in the direction opposite to the gradient (gradient) to reduce error.
When the model is backpropagated, each neuron's gradient is backpropagated to all neurons
belonging to the previous layer. This parameter update procedure is done at this stage.
Fig 4.6 : WORKING PRINCIPLE OF GRADIENT DESCENT

There exist a number of variants of the gradient-based learning algorithm, out of them the
most widely used are:
 Batch Gradient Descent
 Mini Batch Gradient Descent
 Stochastic Gradient Descent
1) Batch Gradient Descent: With batch gradient descent, the parameters are modified
only once after the network has gone through the entire training data set. As a result, the
training set is used to update the parameters and compute the gradient across gradients. On
smaller datasets, batch gradient descent converges CNN models more quickly and produces
more stable gradients. Since the parameters are changed only once per training period, they
also use less resources. However, if the training data set is large (for non-convex problems)
it converges and converges to a local optimum..

2) Mini Batch Gradient Descent: Divide the training instances into several separate small
batches that do not overlap. You can think of each little nudge as an act of gathering a little
data. The parameters must be updated after calculating the gradient for each micro-batch.
You can take advantage of gradient descent, random gradient descent, and small batch
gradient descent by combining them. Consistent association and increased memory and
computational ability were observed. The effectiveness of CNN training models is
enhanced by several modifications to gradient-based (mostly SGD) learning algorithms,
which are described in the next section.
3) Stochastic Gradient Descent: In contrast to batch gradient descent, in this case the
parameters are changed independently for each training region. We recommend randomly
reshuffling the training data in the interval before each training session. Compared to batch
gradient descent, it converges faster. This is beneficial because larger training data sets use
less memory and perform more quickly. However, the problem is that frequent updates
make very erratic progress toward the solution, resulting in very unpredictable convergence
behavior.
Fig 4.7 : GRADIENT DESCENT GRAPH

4.4.4 MOMENTUM
In neural network objective functions, a method known as impulse is used to add
gradients learned in earlier training stages, weighted by a variable called the impulse
coefficient, in order to increase training speed and accuracy. Gradient-based learning
algorithms' fundamental flaw is that they often get trapped at local minima rather than
global minima. When the problem's solution space is non-convex (or flat), this often
occurs. To decrease accuracy, the impulse factor's value should remain between 0 and 1,
with the weight update's step size being increased towards the lowest value. For large
momentum coefficients, the model converges faster, but for very small momentum
coefficients it converges less quickly and local minima can be avoided. However, using
high values for LR and momentum factor can also cause you to jump over them and miss
the global minimum. If the direction of the gradient changes continuously during exercise,
a higher pulse factor value will smooth out the weight changes. The hyperparameter is the
shock factor.
4.4.5 ADAGRAD
Adagrad or adaptive learning rate methods update each network parameter
differently depending on how critical it is to the task. In this case, update abnormal
parameters more frequently (using higher learning rate values) and update normal
parameters less frequently (using lower learning rate values). For each training period (t),
divide the learning rate for each parameter by the square root of the sum of the prior
gradients for that parameter (in this example, wij). Large neural networks can be trained
with small training data using Adagrad, which is particularly effective in dealing with small
gradients. The update process can be easily described mathematically as:
where w t ij is the weight of the current t-th training epoch of parameter wij, w t-1 ij is the
weight of the previous (t-1)th training epoch of parameter wij, and δ t ij is the weight of the
local t-th Gradient of parameter wij over epochs, δ t-1 ij is the local gradient of parameter
wij at the (t-1)th epoch, η is the learning rate, ε is a very small value, avoidance by division
by zero.

4.4.6 ADADELTA
You may think of AdaDelta as an expansion of AdaGrad. AdaGrad has the
drawback that when a network is trained over a long number of training epochs (t), the sum
of squares of all previous gradients (P t m=1 m ij 2) grows significantly, leading to
essentially nil learning rate. Instead of utilising all the prior gradients, like AdaGrad does,
the adaptive delta technique (AdaDelta) divides the learning rate of each parameter by the
sum of squares of the k past gradients to overcome this issue. to). wij for each training
epoch t. The updating procedure may be mathematically described as follows:
where w t ij is the parameter's weight for the most recent t training epoch, w t-1 ij is the
weight for the most recent (t) training epoch, and t ij is the parameter at local t. By
dividing by one, the gradient of wij, which is the first tiny value, is avoided at zero.
4.4.7 RMSPROP
As mentioned in the preceding section, Root Mean Square Propagation (RMSProp)
was also created to address the issue of Adagrad's fast declining learning rate. It was
created by Geoffrey Hinton's team and uses a moving average of prior quadratic gradients
E[2] to try to solve Adagrad's issue. The updating procedure may be mathematically
described as follows:
a location for effective training rate learning tweaking. Hinton recommends adjusting to
0.9. This default setting for the initial learning rate, such as 0.001, is excellent.

4.4.8 ADAPTIVE MOMENT ESTIMATION (ADAM)

Adaptive moment estimation, which calculates an adaptive LR for each network
parameter, is another learning technique. By maintaining both an exponential moving
average of the gradient (similar to momentum) and an exponential moving average of the
squared gradient, adaptive moment estimation combines the advantages of momentum and
RMSprop (such as RMSprop). These estimators' formulae are:
E[] t is the projection of the noncentral variance or second moment of the gradient, and
E[2] t is the estimate of the first moment (mean). After many iterations, particularly if 1 and
2 are extremely tiny, both estimates may still lean towards zero since they are both
originally set to zero during training epochs. To solve this problem, estimates are generated
after bias adjustment. These estimators' final formulae are:
Therefore, Adam's parameter update operation can be easily expressed mathematically as :
Adam is more memory efficient and requires less processing power than the others.

CHAPTER 5
ADVANCEMENT IN CNN
5.1 ADVANCEMENT IN CNN ARCHITECTURE
In the Recent times CNN is used in every field as it has achieved astonishing
achievements across a variety of domains.
 IMAGE CLASSIFICATION
 OBJECT DETECTION
 IMAGE SEGMENTATION
1) IMAGE CLASSIFICATION : The CNN model should be used to categorise the
input picture into one of the preselected target classes with the assumption that it only
includes one element. The following is a list of some of the most significant CNN
architectures (models) created for picture categorization.
 LeNet – 5
 AlexNet
 ZFNet
 VGGNet
 GoogLeNet
 ResNet
2) OBJECT DETECTION : Therefore, in contrast to image classification, the input

image can contain many objects. A CNN model is used to identify the objects contained in
the input image, correctly identify each object, and attempt to determine the exact location
of each object within the image. Our first CNN model, designed for object detection, uses
a sliding window-based approach to successfully detect objects and numbers. Below are
some of the main CNN architectures (models) created for object detection.
 R-CNN
 SPP-Net
 Fast R-CNN
 Faster R-CNN
 Mask R-CNN

3) IMAGE SEGMENTATION : CNN was quite good. Numerous cutting-edge models
for semantic segmentation and instance segmentation have emerged since AlexNet's
world-beating performance in 2012. Below is a list of a few of them.
 Sematic Segmentation
 Instance Segmentation
1) Sematic Segmentation: Semantic segmentation is a low-level vision problem, as
opposed to categorization and object recognition. Each pixel in the picture has a class label
applied to it. Recognise each element in the picture.
(i) Fully Convolutional Network : The creators of FCN created a dense FCN by
swapping out the fully connected layers of AlexNet, VggNet, and GoogLeNet (all three
networks were pretrained on the ILSVRC dataset). To predict the score for each PASCAL
VOC class (including background), the final layer consists of her 1x1 convolution with
channel dimension 21. The author's design incorporates bilinear interpolation and jump
joins to produce fine-grained segmentation.
Fig 5.1 : End to End model of FCN

(ii) DEEPLAB : Downsampling and spatial invariance are two problems observed when
applying Deep CNN to semantic segmentation. The author used an "atrous" algorithm to
solve the first problem. They used Conditional Random Fields (CRF) to gather details and
solve the second problem. In a series of tests for the semantic segmentation challenge
PASCAL VOC 2012, DeepLab achieved his Intersection over Union (IoU) accuracy of
71.6%. The main difference between DeepLabv1 and DeepLabv2 is a new technology
called Atrous Spatial Pyramid Pooling (ASPP).

(iii) SEGNET : For semantic segmentation, an encoder-decoder architecture was utilised.
The VGG16 network's 13 layers are used by both the encoder and decoder parts, albeit in
reverse order. The per-pixel categorization layer is the top layer..
Fig 5.2 : End to End model of SegNet
(iv) DECONVNET : 13 convolutional layers and 2 fully connected layers make up the
network from the VGG16 network, and these 15 layers are employed in the deconvolution
network in a hierarchical reverse order. While a convolution layer employs convolution and
pooling layers to extract the feature map, a deconvolution network uses deconvolution and
decoil to return the activations to their original size.
Fig 5.3 : End to End model of DeconvNet

(v) U – NET : An expanding and contracting U-shaped track forms the structure. ReLU
and 22 Max-Pooling are the two 33 convolutions that form each step of the erosion pass.
In contrast, the expansive approach combines ReLU, 33-fold and 22-up fold. The feature
map is concatenated with the feature map pruned from the contraction pass from the
associated layer between the upfold and fold of the expansion pass.
2) INSTANCE SEGMENTATION : Instance segmentation is one step ahead of

semantic segmentation. Recognize and distinguish all instances of objects present in the
image.
(i) DEEPMASK : In the author's design, the last maximum pooling layer and associated
layer are completely removed using VGGNet. VGGNet applies a class-neutral semantic
mask to the extracted feature map at two locations and is assigned a score that reflects the
probability of an element being present in the patch. The first pass consists of 11
convolutional layers, one nonlinear layer, two FC layers, and bilinear interpolation to
mask a single object. Its second pass consists of up to 22 pool layers, two FC stages and a
prediction layer.
Fig 5.4 : End to End Model of DeepMask

(ii) SHARPMASK : SharpMask has a similar architecture to DeepMask, but creates
sharper and more accurate object masks at the pixel level. To create the object mask, this
network consists of a series of convolutionally pooled layers followed by fully connected
layers.
(iii) PANET : This model is based on Feature Pyramid Network and Mask R-CNN.
Improving the dissemination of information across networks is a fundamental goal of
PANet. To improve low-layer feature propagation, the author used his FPN-based feature
extractor in combination with a new improved bottom-up pass. A RoIAlign pooling layer is
used to subsample the feature map to extract suggestions from all levels of features. Each
phase processes the feature map in a fully linked layer using an adaptive feature pooling
layer. The network then merges all outputs together. The output feature pooling layer will
be he three branches of bounding box, feature class and binary pixel mask prediction.
(iii) TensorMask : In this approach, dense sliding windows are used in place of bounding
box detection of objects. Tensor mask architecture's fundamental concept is to show a
picture that resembles a mask in a series of dense sliding windows by organising a high-
dimensional tensor. There are two heads on these versions. One is to anticipate object
categories, and the other is to construct masks in a sliding window..
5.2 APPLICATION AREAS OF CNN

This section discusses some of the main applications of CNNs, such as image
classification, word recognition, action identification, object recognition, human
localization, and captioning.
(i) IMAGE CLASSIFICATION : CNN achieved superior classification accuracy

compared to other methods, especially on large datasets. This is due to a number of
features, including weight distributions, classifiers and various degrees of feature
extraction, etc. The creation of AlexNet in 2012, the same award-winning year, marks the
first advances in image classification. The researchers then improved his CNN model even
further, placing it at the top of the list for image classification problems. A CNN takes an
input image and outputs a probability distribution of possible categories. The category with
the highest probability is used as the final classification of the image.

(ii) TEXT RECOGNITION : Text and image recognition have long been the subject of
extensive research. His LeNet-5, which accurately recognized data from the MNIST
dataset, is his CNN's first breakthrough contribution in this field. Then in recent years,
CNNs have played an important role with a series of advances in identifying text (numbers,
alphabets, and symbols belonging to many languages) in images.
(iii) ACTION RECOGNITION : Many efficient CNN-based approaches have now been
able to accurately predict human subjects' actions and behaviors based on the visual
appearance and motion dynamics of the human body. When it comes to AI, this pushes
CNN to the next level. Identify actions in still images or video sequences and record still
images and video.
(iv) IMAGE CAPTION GENERATION : It refers to capturing a description of a target
image, including recognizing and identifying various objects in the image and describing
their states. Here, the first task was performed on a CNN and textual descriptions of states
were created using various natural language processing (NLP) methods.
(v) MEDICAL IMAGE ANALYSIS : By improving the performance of disease
diagnosis by processing medical images such as MRIs and X-rays, CNNs are rapidly
becoming a state-of-the-art platform. Today, CNN-based models can accurately diagnose a
wide range of health problems, including diabetes, Parkinson's disease, brain tumors,
pneumonia, and breast cancer, and predict heart disease. One of the most common uses of
CNNs in medical analytics is image classification. A network is trained to classify medical
images, such as X-ray and MRI scans, into various categories, such as cancerous or
noncancerous. This can be useful for tasks such as early cancer detection or disease
diagnosis.
(vi) SECURITY AND SURVEILLANCE : Security systems with computer vision
capabilities are now constantly monitoring homes, subway stations, streets, schools,
hospitals and many other places, and can spot or identify criminals even in crowded areas.
(vii) AUTOMATIC COLORIZATION OF IMAGE AND STYLE TRANSFER : With
the recent advent of deep learning, some well-known CNN models now offer automated
ways to convert grayscale or black and white photos to their RGB color equivalents. As a
result, old black-and-white films can now be viewed in color. He can create a new
composite image by rendering one image in the style of another image called style
mapping. Convolutional neural networks can be effectively used for this style transfer.

5.3 BENEFITS OF CONVOLUTION NEURAL NETWORK

 CNNs are very useful for image identification, image classification, and computer
vision (CV) applications because they provide highly accurate results, especially
when large amounts of data are available.
 When an object's data passes through multiple levels of the CNN, the CNN also
acquires the properties of the object in subsequent rounds. This eliminates the need
for manual feature extraction (feature engineering).
 The ability of CNNs to generate "spatial invariance" is an important feature. In

addition, it can learn to recognize and recognize visual features in images by
extracting data directly from photos. CNN is a powerful deep learning technology for
accurate results.
 The most common use of CNNs is image analysis, but they can also be used to solve
other classification and data analysis problems. As a result, it can be used in a variety
of contexts to obtain accurate results, including critical processes such as facial
recognition, image classification, road/traffic sign recognition, galaxy classification,
medical image interpretation, and diagnostic/analysis.
5.4 FUTURE OF CONVOLUTION NEURAL NETWORK

Convolutional Neural Networks give people new possibilities in an ever-changing
environment. The extent to which neural networks can be used in even the simplest
applications is amazing.
CNN's perception of the photos also reveals a lot about their design and execution.
Another interesting example of how artificial neural networks can improve the world is
drug discovery with convolutional neural networks.
As technology advances, CPUs and GPUs will become more affordable and faster,
allowing us to create larger and more efficient algorithms. You can create neural networks
that can process more data or process it faster, so you can recognize patterns from 1,000
samples instead of 10,000.

CNNs are becoming more accurate and powerful as researchers develop new
architectures and techniques to train them more effectively. For example, attention
mechanisms, capsule networks, and generative adversarial networks (GANs) are some of
the recent advances that hold promise for improving the performance of CNNs.
As researchers develop new designs and strategies to train CNNs more efficiently,
they become more accurate and powerful. Attention mechanisms, capsule networks, and
generative adversarial networks (GANs) are examples of recent developments that show
promise for improving the performance of CNNs.
CNNs are increasingly being used in real-time robotics, augmented reality, and self-
driving cars. As computing power continues to improve, CNNs are becoming increasingly
effective at processing large amounts of data in real time.
Custom CNN architectures are becoming increasingly popular for specialized tasks
such as medical image analysis, remote sensing, and industrial inspection. Performance can
be improved by optimizing these networks for specific types of data and tasks.
Transfer learning is gaining popularity as a strategy for reducing the amount of data
required for training and improving performance on small datasets. The method involves
pre-training his CNN on large datasets and fine-tuning for specific applications.
CNNs are becoming increasingly important for activities such as image and speech
recognition on mobile and Internet of Things (IoT) devices as edge computing becomes
more prevalent, where data is processed on local devices rather than in the cloud. Overall,
CNNs are expected to continue to play an important role in many applications such as
computer vision, natural language processing, and robotics. As research progresses, we can
expect even more powerful and sophisticated CNN architectures and techniques to emerge
in the future.

CHAPTER 6
CONCLUSION
Convolutional Neural Networks (CNNs) are a powerful class of neural networks that
excel at tasks that require image and video recognition. It has completely revolutionized
the field of computer vision as it excels at various tasks such as object identification, image
classification and segmentation. The ability of CNNs to automatically learn and extract
useful features from images and videos is one of the fundamental features of CNNs. This
capability enables CNNs to perform complex tasks that were previously difficult or
impossible with traditional computer vision approaches. Convolutional layers, pooling
layers, and nonlinear activation functions are used to achieve this. CNN is deployed in a
variety of sectors including healthcare, automotive, and retail. It is also being used in an
increasing number of real-time applications such as robotics, augmented reality, and self-
driving cars. As CNN research (GAN) progresses, we can expect to see even more
powerful and sophisticated designs and techniques in the future, such as attention
mechanisms, capsule networks, and generative adversarial networks. Overall, CNNs have
revolutionized computer vision and will continue to be an important component in many
fields and applications.
FUTURE SCOPE OF ENHANCEMENT
 Improved accuracy and performance : Develop new architectures and techniques

forefficient training.
 Real-time applications : Self-driving cars, augmented reality, robotics.
 Customized network : Medical image analysis, remote sensing, industrial inspection.
 Transfer learning: This includes CNN pre-training on large datasets and fine-tuning
forspecific tasks.
 Edge Computing : Edge computing, where data is processed on devices rather than
inthe cloud, is proliferating.
 CNNs are becoming increasingly important for tasks such as image and speech
recognition in mobile and Internet of Things (IoT) devices.


DEPT OF ISE, VIII SEM, HKBKCE 43 2022-2023

REFERENCES
[1] M. K. Khan, S. M. Anwar, M. Majid, A. Qayyum, M. Awais, and M. Convolutional
neural networks for medical image analysis: a review. 2018 November; J. Med. Syst.
42(11):1–13.
[2] A. Kendall, V. Badrinarayanan, and R. Cipolla. A deep convolutional encoder and

decoder architecture called Segnet is used to segment images. 2015; CoRR,
abs/1511.00561.
[3] Large-scale machine learning using stochastic gradient descent, L. Bottou. Pages 177–
186 in Proceedings of COMPSTAT'2010, edited by Y. Lechevallier and G. Saporta,
Heidelberg, 2010. HD Physica-Verlag.
[4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image

segmentation with deep convolutional nets and fully connected crfs. In 3rd International
Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
2015, Conference Track Proceedings, 2015.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012). utilising deep convolutional

neural networks for ImageNet classification. Neural information processing systems:
Advancement (pp. 1097-1105).
[6] Bengio, Y., LeCun, Y., and Hinton, G. (2015). Nature, 521(7553), 436-444. Deep
learning.
[7] K. Simonyan and A. Zisserman (2015). Deep convolutional networks for recognising
images on a huge scale. Preprint for arXiv is arXiv:1409.1556.
[8] R. Girshick (2015). Quick R-CNN. International Conference on Computer Vision by

the IEEE, Proceedings (pp. 1440-1448).
[9] Long, Shelhamer, and Darrell (2015). Convolutional networks in their entirety for
semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition
Proceedings (pp. 3431-3440).

[10] He K, Zhang X, Ren S, and Sun J. (2016). Deep residual learning to recognise images.
IEEE Conference on Computer Vision and Pattern Recognition Proceedings (pp. 770-778).
[11] Szegedy, C., Wojna, Z., Ioffe, S., Vanhoucke, V., & Shlens (2016). reevaluating the
computer vision inception architecture. IEEE Conference on Computer Vision and Pattern
Recognition Proceedings (pp. 2818-2826).
[12] ImageNet Large Scale Visual Recognition Challenge, Russakovsky O, Deng J, Su H et

al., 2015. Article from the Int J Computer Vision 115:211-252 Using Google Scholar.
[13] Yasaka, K., Akai, H., Abe, and S. (2018) A preliminary investigation using deep
learning and a convolutional neural network to distinguish liver masses on dynamic
contrast-enhanced CT. Imaging 286:887–896 Newspaper PubMed Use Google Scholar.
[14] Automated liver and lesion segmentation in CT using cascaded fully convolutional
neural networks and 3D conditional random fields, Christ PF, Elshaer MEA, Ettlinger F et
al. In the book Proceedings of Medical image computing and computer-assisted
intervention - MICCAI 2016, edited by Ourselin S, Joskowicz L, Sabuncu M, Unal G, and
Wells W.
[15] Peng, Gulshan, Coram, et al (2016) Deep learning algorithm development and
validation for the diagnosis of diabetic retinopathy in retinal fundus images. JAMA
316:2402–2410
[16] P. Lakhani and B. Sundaram (2017) Convolutional neural networks are used in deep
learning at chest radiography to automatically classify cases of pulmonary tuberculosis.
Imaging 284:574-582
[17] Yasaka, K., Akai, H., Abe, and S. (2018) A preliminary investigation using deep
learning and a convolutional neural network to distinguish liver masses on dynamic
contrast-enhanced CT. 286:887-896 in Radiology.
[18] Convolutional neural network committees for handwritten character categorization by

Cires-an, D.C., Meier, U., Gambardella, L.M. International Conference on Document
Analysis and Recognition (ICDAR), 2011, pp. 1135–1139. IEEE(2011).

[19] Hardware-accelerated convolutional neural networks for synthetic vision systems by

Farabet, Martini, Akselrod, Talay, LeCun, and Culurciello. pp. 257–260 in Proceedings of
the 2010 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE (2010)
(2010)
[20] Convolutional neural networks are used for large-scale video categorization,
according to Karpathy, Toderici, Shetty, Leung, Sukthankar, and Fei-Fei. 2014 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732.
IEEE(2014)
[21] Fine-grained car model recognition utilising a coarse-to-fine convolutional neural

network architecture. Fang J., Zhou Y., Yu Y., Du S. (2017), 18 (7), 1782-1792, IEEE
Transactions on Intelligent Transportation Systems
[22] An overview of machine learning, by Carbonell, J. G., Michalski, R. S., and Mitchell,
T. M. (1983). About machine learning. Heidelberg Berlin Springer (pp. 3-23).
[23] A deep convolutional activation feature for general visual identification. Donahue, J.,
Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, and Darrell, T. (2014). International
machine learning conference (pp. 647-655).
[24] Training convolutional networks of threshold neurons suitable for low-power

hardware implementation, Fieres, Schemmel, and Meier (2006). (2006) in Neural
Networks. IEEE. International Joint Conference on NNN (IJCNN'06) (pp. 21-28).
[25] Facial recognition: A convolutional neural-network approach, Lawrence, S., Giles,

C. L., Tsoi, A. C., and Back, A. D., 1997. IEEE Transactions on Neural Networks, Vol. 8,
No. 1, pp. 98–113.



TS - Chapters

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TS - Chapters

Uploaded by

Copyright:

Available Formats

CONVOLUTION NEURAL NETWORKS TECHNICAL SEMINAR

Fig 1.1 : ANN LAYERS

DEPT OF ISE, VIII SEM, HKBKCE 1 2022 - 2023

Convolutional neural networks (CNNs) consist of neurons that learn to improve

DEPT OF ISE, VIII SEM, HKBKCE 2 2022 - 2023

1.2 WHAT IS CNN ?

Application of CNN include image classification, object detection, facial recognition,

DEPT OF ISE, VIII SEM, HKBKCE 3 2022 - 2023

Fig 1.2 : EXAMPLE OF CLASSIFICATION OF CNN

DEPT OF ISE, VIII SEM, HKBKCE 4 2022 - 2023

1.4 PRINCIPLES OF CNN

Fig 1.3 : SLIDE WINDOW BY ONE FOR EACH ELEMENT

Fig 1.4 : 2D CONVOLUTION PRINCIPLE

DEPT OF ISE, VIII SEM, HKBKCE 5 2022 - 2023

Fig 2.1 : EVOLUTION OF CNN

DEPT OF ISE, VIII SEM, HKBKCE 6 2022 - 2023

Fig 2.2 : CNN IMAGE CLASSIFICATION

How is the dimensionality reduced ?

DEPT OF ISE, VIII SEM, HKBKCE 7 2022 - 2023

Fig 2.3 : EXAMPLE OF FEATURE EXTRACTION

DEPT OF ISE, VIII SEM, HKBKCE 8 2022 - 2023

2.4 TYPES OF CNN LAYERS

Fig 2.4 : CONVOLUTIONAL LAYER

DEPT OF ISE, VIII SEM, HKBKCE 9 2022 - 2023

Fig 2.5 : MAX POOLING LAYER

DEPT OF ISE, VIII SEM, HKBKCE 10 2022 - 2023

Fig 2.6 : FULLY CONNECTED LAYER

ACTIVATION FUNCTIONS : Finally, the activation function is one of the most

DEPT OF ISE, VIII SEM, HKBKCE 11 2022 - 2023

2.5 HOW CONVOLUTIONAL NEURAL NETWORKS WORK ?

CNN VS NEURAL NETWORKS

DEPT OF ISE, VIII SEM, HKBKCE 12 2022 - 2023

2.6 CHARACTERISTICS OF CONVOLUTION NEURAL NETWORK

 A neural arrange comprises of a huge number of neuron-like preparing components.

 The collective behavior of neurons is their computational capacity, there are no

DEPT OF ISE, VIII SEM, HKBKCE 13 2022 - 2023

2.7 TYPES OF CONVOLUTION NEURAL NETWORKS

Recurrent Convolutional Neural Networks (RCNNs) : A subtype of CNN that combines

Fully convolutional network (FCN) : No fully connected layers. It uses only

DEPT OF ISE, VIII SEM, HKBKCE 14 2022 - 2023

Fig 3.1 : CNN ARCHITECTURE FORMED

DEPT OF ISE, VIII SEM, HKBKCE 15 2022 - 2023

Fig 3.2 : DOWN SAMPLING TECHNIQUE

DEPT OF ISE, VIII SEM, HKBKCE 16 2022 - 2023

It is important to remember that understanding the general design of CNN

CONVOLUTION LAYER : Convolutional classes as their name suggests are essential

Fig 3.3 : CONVOLUTION LAYER

DEPT OF ISE, VIII SEM, HKBKCE 17 2022 - 2023

POOLING LAYER : Dimensionality reduction is done by this layer, which is sometimes

Fig 3.4 : POOLING LAYER

Fig 3.5 : FULLY CONNECTED LAYER

DEPT OF ISE, VIII SEM, HKBKCE 18 2022 - 2023

3.2 TYPES OF CNN Architecture

Fig 3.6 : LENET ARCHITECTURE

DEPT OF ISE, VIII SEM, HKBKCE 19 2022 - 2023

Fig 3.7 : ALEXNET ARCHITECTURE

VGGNET ARCHITECTURE : Compared to earlier AlexNet derivatives, VGG takes a

Fig 3.8 : VGGNET ARCHITECTURE

DEPT OF ISE, VIII SEM, HKBKCE 20 2022 - 2023

3.3 CHALLENGES IN CNN ARCHITECTURE

DEPT OF ISE, VIII SEM, HKBKCE 21 2022 - 2023

3.4 BENEFITS OF CNN ARCHITECTURE

Fig 3.9 : END TO END LEARNING