Professional Documents
Culture Documents
TS - Chapters
TS - Chapters
TS - Chapters
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Computing techniques called Artificial Neural Networks (ANNs) are greatly
affected by the operations of organic nervous systems, such as the human brain. Numerous
linked computational nodes (neurons), which collaborate to learn from inputs and improve
the final output, are a crucial component of an ANN.
Inputs are loaded into the input layer and then distributed to the hidden layers. The
input is usually in the form of multidimensional vectors. The learning process is when the
hidden layer considers decisions from previous layers and decides whether stochastic
changes themselves make the output worse or better. Deep learning is a term used to
describe the stacking of many hidden layers.
An ANN is made up of many layers of neurons. Data is fed into the first layer, and
output is produced by the final layer. One or more hidden layers sit in between, processing
the input data using weights and biases that are changed during training to increase the
network's accuracy.
Two important learning paradigms for tasks requiring image processing are supervised
learning and unsupervised learning. We refer to learning that focuses on pre-classified
inputs as supervised learning. Each training example has a predefined output value and a
set (vector) of input values.
By accurately calculating the training output values of the training samples, this
training method tries to reduce the overall classification error of the model. Unlike
supervised learning, unsupervised learning has no labels in the training set. Network
success is often determined by the ability to measure relative cost performance.
The only significant difference between traditional CNNs and ANNs is the widespread
use of CNNs in the field of image pattern recognition. This makes it possible to add image-
specific design components, increases network compatibility for image-related tasks, and
reduces the number of setup parameters.
A major problem with traditional ANN models is that they often struggle with the
computational complexity required to compute images. The MNIST database of
handwritten digits is one of the most widely used standard machine learning datasets and is
suitable for most ANN 2828 variants due to its relatively small image size. A neuron in the
first hidden layer of the dataset has 784 weights (28,281; note that MNIST is set to black
and white values only). It is controlled by many types of artificial neural networks.
Considering a larger color image input of 64 64, the number of weights for one layer 1
neuron increases significantly to 12,288. The drawback of employing such a model is that
it requires a much larger network than that used to identify color-normalized MNIST digits
to handle this input scale.
Although many other types of neural networks are used in deep learning, CNN is an
effective network design in object recognition and identification. This makes them ideal for
computer vision tasks and applications that require accurate object recognition for systems
such as self-driving cars and facial recognition.
Both time series and image data provide important information. For tasks involving
images, such as pattern recognition, object classification, and image identification, it is
particularly useful. CNNs analyze images for patterns using linear algebra techniques such
as matrix multiplication. CNNs can rate audio and highlight information.
CNNs have been developed for a variety of tasks, including image recognition and
analysis, but they also have many other applications, such as image classification, natural
language processing, drug discovery, and risk assessment. CNNs are useful for depth
estimation in autonomous vehicles. Applications include speech processing for virtual
assistants, facial recognition for social media, retail, healthcare, automotive and law
enforcement.
MAX POOLING: Max pooling is the most used kind of pooling procedure. Patches are
removed from the input highlight outline, and the highest value is given for each repair
while eliminating all other values. Max pooling with a 2 2 size filter of stride 2 is often
employed in real-world settings. This results in a two-fold downsampling of the feature
map's dimension at the level. The depth dimension of the feature map does not vary, in
contrast to height and breadth.
The complexity of the filter increases with each subsequent layer in searching and
finding features that uniquely represent the input element. As a result, the output of each
layer serves as input to the following layers, which use a partially detected image, also
known as a convoluted image. In the FC layer, the last layer, the CNN detects the image or
object it represents.
Convolution is the application of various filters to the input images. Each filter does
its job by triggering a specific part of the image, after which it sends its output to filters in
other layers. As each layer is capable of distinguishing between different features, this
process is repeated for dozens, hundreds, or thousands of layers. Eventually, after several
layers of processing all the image inputs, the CNN can detect all the objects.
Overfitting problems can develop over time, when the NN picks up too much
information from the training data. Additionally, it may cause learning data noise and have
an impact on how well you do on the test data set. In the end, NNs are unable to distinguish
between the objects themselves and any characteristics or patterns present in the data
collection. CNNs, on the other hand, utilise parameter sharing.
Traditional neural networks can be used for image and video processing tasks, but
they are not as effective as CNNs because they cannot take advantage of the spatial
structure of the image. CNNs have revolutionized the field of computer vision, delivering
state-of-the-art results on a variety of tasks such as image classification, object detection.
Input signals reach the processing elements via connections and connection weights.
The information stored in a neuron is essentially the weighted connections of the neuron.
A learning process for acquiring knowledge.
Modeling systems with unknown input/output relationships.
Contains a large number of interconnected processing elements called neurons to
perform all operations.
Ability to learn, remember, and generalize from data provided by appropriate mappings
and weight adjustments.
Convolutional Neural Network (CNN) : This type of CNN is the most basic and consists
of convolutional, pooling and fully connected layers. It is widely used for image
classification and recognition applications.
Deep Residual Network (ResNet) : ResNet is a subset of CNN and uses remnants to
transmit data between some network layers. This allows you to build deeper networks with
better performance and avoids the vanishing gradient problem.
Inception Network : An example of a CNN that uses many parallel convolutional layers
with different filter sizes and pooling techniques is the Inception network. This allows the
network to capture features at different scales, improving network performance.
Siamese Network : A type of CNN that uses two identical sub-networks to process two
separate inputs and generate a similarity score is called a Siamese network. They are
commonly used for tasks such as face recognition and image matching.
These are just a few of the many different kinds of CNNs created for different
purposes. The exact task and type of data you enter will determine which network
architecture to use.
CHAPTER 3
CNN ARCHITECTURE
3.1 INTRODUCTION TO CNN ARCHITECTURE
CNN focuses on the fact that the input is mostly composed of visual elements. For
this reason, the architectural design is focused on best meeting the needs of managing each
type of data.
A major difference is that the CNN neuron layer consists of neurons arranged in
three spatial dimensions (height, width, and depth) of the input. Instead of the total number
of layers in an ANN, depth represents the third dimension of activation dimension.
Neurons in each layer are connected to only a small percentage of the previous layer,
unlike standard ANNS.
In practice, this means that the dimensions of the final output layer are 11. n (where n
is the number of possible classes), and the dimensions of the input "size" are 64643
(height, width, depth) for the previous example. This results in the size of all input class
evaluations being compressed and reduced on the depth dimension.
A CNN consists of three types of layers. These are convolutional layers, pooling
layers and fully connected layers. When these layers were stacked, a CNN architecture
was formed.
There are four main areas in which the basic functionality of an exemplary CNN can be
decomposed.
1. Similar to other kinds of ANNs, the input layer records the image's pixel values.
2. The convolutional layer computes the scalar and recognises the output of the neuron
whose connection to the local input region is created as the sum of the weight of the
particular item and the area around the input volume. The rectifier linear unit, also known
as ReLu, is used to apply a "element-by-element" activation function, such as Sigmoid, to
the output of the preceding layer's activation.
3. The pooling layer only uses downsampling along the spatial dimension of the input, thus
reducing the number of parameters in this activation.
4. Next, a fully connected layer attempts to extract class scores from the activations that
may be utilised for classification, performing the same goal as the conventional Her ANN.
In order to boost performance, it is also suggested to apply ReLu between these layers.
Utilising convolution and downsampling methods, this straightforward transformation
methodology enables CNNs to alter the initial input layer by layer in order to obtain class
values for classification and regression applications.
ACTIVATION FUNCTION : Often the last fully committed layer has a different
activation function than the previous levels. For each action, you must select the
appropriate activation function. The activation function used in multiclass classification
problems is the soft max function. Adjust the final actual output values of the fully
connected layer with the probability of the intended class. Each number ranges from 0 to 1
and they all add up to 1.
DROPOUT LAYERS : A dropdown layer is a mask that cancels the contribution of some
neurons in the next layer while leaving everything else intact. Some properties of the input
vector are canceled when a dropdown layer is applied. Additionally, it removes some
hidden neurons when applied to the hidden layer. Dropout layers are essential for training a
CNN because they prevent overtraining the training data. If there is no original set of
training data, learning is disproportionately affected. By doing this, learning characteristics
that are only present in later samples or batches are avoided..
CHAPTER 4
TRAINING PROCESS
The training process mainly includes the following steps:
Parameter Initialization
Optimizer Selection
Regularization of CNN
DATA AUGMENTATION
The training data set is artificially augmented via the practise of "data
augmentation". Here, we purposefully replace the data samples with one or more new data
samples (new versions), which are subsequently employed in the training process (training
data set alone). In certain circumstances, data augmentation is crucial since the majority of
complex real-world scenarios (like medical records) only have access to tiny training
datasets. In actual use, increasing the number of training data samples may strengthen the
CNN model. There are several methods for enhancing data, such as scaling, converting,
adjusting contrast, rotating, mirroring, and cropping. These methods may be used alone or
together to generate several new versions from a single data sample. Data augmentation
may also drive regularisation of CNN models by avoiding overfitting issues, which is
another argument for their usage.
The easiest way to do this is to initialize all weights to zero. However, this turns out
to be a mistake. This is because setting the weight of each layer to zero causes all neurons
in the network to give the same output and gradient in backpropagation. As a result, all
weights receive the same updates. The network does not learn any favorable properties
from this and there are no discrepancies between neurons. To deal with this difference
between neurons, we don't initialize all the weights with the same value.
Fig : 4.5 The effect of different learning rate (LR) value on the training process.
2) Mini Batch Gradient Descent: Divide the training instances into several separate small
batches that do not overlap. You can think of each little nudge as an act of gathering a little
data. The parameters must be updated after calculating the gradient for each micro-batch.
You can take advantage of gradient descent, random gradient descent, and small batch
gradient descent by combining them. Consistent association and increased memory and
computational ability were observed. The effectiveness of CNN training models is
enhanced by several modifications to gradient-based (mostly SGD) learning algorithms,
which are described in the next section.
3) Stochastic Gradient Descent: In contrast to batch gradient descent, in this case the
parameters are changed independently for each training region. We recommend randomly
reshuffling the training data in the interval before each training session. Compared to batch
gradient descent, it converges faster. This is beneficial because larger training data sets use
less memory and perform more quickly. However, the problem is that frequent updates
make very erratic progress toward the solution, resulting in very unpredictable convergence
behavior.
4.4.4 MOMENTUM
In neural network objective functions, a method known as impulse is used to add
gradients learned in earlier training stages, weighted by a variable called the impulse
coefficient, in order to increase training speed and accuracy. Gradient-based learning
algorithms' fundamental flaw is that they often get trapped at local minima rather than
global minima. When the problem's solution space is non-convex (or flat), this often
occurs. To decrease accuracy, the impulse factor's value should remain between 0 and 1,
with the weight update's step size being increased towards the lowest value. For large
momentum coefficients, the model converges faster, but for very small momentum
coefficients it converges less quickly and local minima can be avoided. However, using
high values for LR and momentum factor can also cause you to jump over them and miss
the global minimum. If the direction of the gradient changes continuously during exercise,
a higher pulse factor value will smooth out the weight changes. The hyperparameter is the
shock factor.
4.4.5 ADAGRAD
Adagrad or adaptive learning rate methods update each network parameter
differently depending on how critical it is to the task. In this case, update abnormal
parameters more frequently (using higher learning rate values) and update normal
parameters less frequently (using lower learning rate values). For each training period (t),
divide the learning rate for each parameter by the square root of the sum of the prior
gradients for that parameter (in this example, wij). Large neural networks can be trained
with small training data using Adagrad, which is particularly effective in dealing with small
gradients. The update process can be easily described mathematically as:
where w t ij is the weight of the current t-th training epoch of parameter wij, w t-1 ij is the
weight of the previous (t-1)th training epoch of parameter wij, and δ t ij is the weight of the
local t-th Gradient of parameter wij over epochs, δ t-1 ij is the local gradient of parameter
wij at the (t-1)th epoch, η is the learning rate, ε is a very small value, avoidance by division
by zero.
4.4.6 ADADELTA
You may think of AdaDelta as an expansion of AdaGrad. AdaGrad has the
drawback that when a network is trained over a long number of training epochs (t), the sum
of squares of all previous gradients (P t m=1 m ij 2) grows significantly, leading to
essentially nil learning rate. Instead of utilising all the prior gradients, like AdaGrad does,
the adaptive delta technique (AdaDelta) divides the learning rate of each parameter by the
sum of squares of the k past gradients to overcome this issue. to). wij for each training
epoch t. The updating procedure may be mathematically described as follows:
where w t ij is the parameter's weight for the most recent t training epoch, w t-1 ij is the
weight for the most recent (t) training epoch, and t ij is the parameter at local t. By
dividing by one, the gradient of wij, which is the first tiny value, is avoided at zero.
4.4.7 RMSPROP
As mentioned in the preceding section, Root Mean Square Propagation (RMSProp)
was also created to address the issue of Adagrad's fast declining learning rate. It was
created by Geoffrey Hinton's team and uses a moving average of prior quadratic gradients
E[2] to try to solve Adagrad's issue. The updating procedure may be mathematically
described as follows:
a location for effective training rate learning tweaking. Hinton recommends adjusting to
0.9. This default setting for the initial learning rate, such as 0.001, is excellent.
E[] t is the projection of the noncentral variance or second moment of the gradient, and
E[2] t is the estimate of the first moment (mean). After many iterations, particularly if 1 and
2 are extremely tiny, both estimates may still lean towards zero since they are both
originally set to zero during training epochs. To solve this problem, estimates are generated
after bias adjustment. These estimators' final formulae are:
Adam is more memory efficient and requires less processing power than the others.
CHAPTER 5
ADVANCEMENT IN CNN
5.1 ADVANCEMENT IN CNN ARCHITECTURE
In the Recent times CNN is used in every field as it has achieved astonishing
achievements across a variety of domains.
IMAGE CLASSIFICATION
OBJECT DETECTION
IMAGE SEGMENTATION
1) IMAGE CLASSIFICATION : The CNN model should be used to categorise the
input picture into one of the preselected target classes with the assumption that it only
includes one element. The following is a list of some of the most significant CNN
architectures (models) created for picture categorization.
LeNet – 5
AlexNet
ZFNet
VGGNet
GoogLeNet
ResNet
(iv) DECONVNET : 13 convolutional layers and 2 fully connected layers make up the
network from the VGG16 network, and these 15 layers are employed in the deconvolution
network in a hierarchical reverse order. While a convolution layer employs convolution and
pooling layers to extract the feature map, a deconvolution network uses deconvolution and
decoil to return the activations to their original size.
(iii) PANET : This model is based on Feature Pyramid Network and Mask R-CNN.
Improving the dissemination of information across networks is a fundamental goal of
PANet. To improve low-layer feature propagation, the author used his FPN-based feature
extractor in combination with a new improved bottom-up pass. A RoIAlign pooling layer is
used to subsample the feature map to extract suggestions from all levels of features. Each
phase processes the feature map in a fully linked layer using an adaptive feature pooling
layer. The network then merges all outputs together. The output feature pooling layer will
be he three branches of bounding box, feature class and binary pixel mask prediction.
(iii) TensorMask : In this approach, dense sliding windows are used in place of bounding
box detection of objects. Tensor mask architecture's fundamental concept is to show a
picture that resembles a mask in a series of dense sliding windows by organising a high-
dimensional tensor. There are two heads on these versions. One is to anticipate object
categories, and the other is to construct masks in a sliding window..
When an object's data passes through multiple levels of the CNN, the CNN also
acquires the properties of the object in subsequent rounds. This eliminates the need
for manual feature extraction (feature engineering).
The most common use of CNNs is image analysis, but they can also be used to solve
other classification and data analysis problems. As a result, it can be used in a variety
of contexts to obtain accurate results, including critical processes such as facial
recognition, image classification, road/traffic sign recognition, galaxy classification,
medical image interpretation, and diagnostic/analysis.
CNN's perception of the photos also reveals a lot about their design and execution.
Another interesting example of how artificial neural networks can improve the world is
drug discovery with convolutional neural networks.
As technology advances, CPUs and GPUs will become more affordable and faster,
allowing us to create larger and more efficient algorithms. You can create neural networks
that can process more data or process it faster, so you can recognize patterns from 1,000
samples instead of 10,000.
As researchers develop new designs and strategies to train CNNs more efficiently,
they become more accurate and powerful. Attention mechanisms, capsule networks, and
generative adversarial networks (GANs) are examples of recent developments that show
promise for improving the performance of CNNs.
CNNs are increasingly being used in real-time robotics, augmented reality, and self-
driving cars. As computing power continues to improve, CNNs are becoming increasingly
effective at processing large amounts of data in real time.
Custom CNN architectures are becoming increasingly popular for specialized tasks
such as medical image analysis, remote sensing, and industrial inspection. Performance can
be improved by optimizing these networks for specific types of data and tasks.
Transfer learning is gaining popularity as a strategy for reducing the amount of data
required for training and improving performance on small datasets. The method involves
pre-training his CNN on large datasets and fine-tuning for specific applications.
CNNs are becoming increasingly important for activities such as image and speech
recognition on mobile and Internet of Things (IoT) devices as edge computing becomes
more prevalent, where data is processed on local devices rather than in the cloud. Overall,
CNNs are expected to continue to play an important role in many applications such as
computer vision, natural language processing, and robotics. As research progresses, we can
expect even more powerful and sophisticated CNN architectures and techniques to emerge
in the future.
CHAPTER 6
CONCLUSION
Convolutional Neural Networks (CNNs) are a powerful class of neural networks that
excel at tasks that require image and video recognition. It has completely revolutionized
the field of computer vision as it excels at various tasks such as object identification, image
classification and segmentation. The ability of CNNs to automatically learn and extract
useful features from images and videos is one of the fundamental features of CNNs. This
capability enables CNNs to perform complex tasks that were previously difficult or
impossible with traditional computer vision approaches. Convolutional layers, pooling
layers, and nonlinear activation functions are used to achieve this. CNN is deployed in a
variety of sectors including healthcare, automotive, and retail. It is also being used in an
increasing number of real-time applications such as robotics, augmented reality, and self-
driving cars. As CNN research (GAN) progresses, we can expect to see even more
powerful and sophisticated designs and techniques in the future, such as attention
mechanisms, capsule networks, and generative adversarial networks. Overall, CNNs have
revolutionized computer vision and will continue to be an important component in many
fields and applications.
[3] Large-scale machine learning using stochastic gradient descent, L. Bottou. Pages 177–
186 in Proceedings of COMPSTAT'2010, edited by Y. Lechevallier and G. Saporta,
Heidelberg, 2010. HD Physica-Verlag.
[6] Bengio, Y., LeCun, Y., and Hinton, G. (2015). Nature, 521(7553), 436-444. Deep
learning.
[7] K. Simonyan and A. Zisserman (2015). Deep convolutional networks for recognising
images on a huge scale. Preprint for arXiv is arXiv:1409.1556.
[9] Long, Shelhamer, and Darrell (2015). Convolutional networks in their entirety for
semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition
Proceedings (pp. 3431-3440).
[11] Szegedy, C., Wojna, Z., Ioffe, S., Vanhoucke, V., & Shlens (2016). reevaluating the
computer vision inception architecture. IEEE Conference on Computer Vision and Pattern
Recognition Proceedings (pp. 2818-2826).
[13] Yasaka, K., Akai, H., Abe, and S. (2018) A preliminary investigation using deep
learning and a convolutional neural network to distinguish liver masses on dynamic
contrast-enhanced CT. Imaging 286:887–896 Newspaper PubMed Use Google Scholar.
[14] Automated liver and lesion segmentation in CT using cascaded fully convolutional
neural networks and 3D conditional random fields, Christ PF, Elshaer MEA, Ettlinger F et
al. In the book Proceedings of Medical image computing and computer-assisted
intervention - MICCAI 2016, edited by Ourselin S, Joskowicz L, Sabuncu M, Unal G, and
Wells W.
[15] Peng, Gulshan, Coram, et al (2016) Deep learning algorithm development and
validation for the diagnosis of diabetic retinopathy in retinal fundus images. JAMA
316:2402–2410
[16] P. Lakhani and B. Sundaram (2017) Convolutional neural networks are used in deep
learning at chest radiography to automatically classify cases of pulmonary tuberculosis.
Imaging 284:574-582
[17] Yasaka, K., Akai, H., Abe, and S. (2018) A preliminary investigation using deep
learning and a convolutional neural network to distinguish liver masses on dynamic
contrast-enhanced CT. 286:887-896 in Radiology.
[20] Convolutional neural networks are used for large-scale video categorization,
according to Karpathy, Toderici, Shetty, Leung, Sukthankar, and Fei-Fei. 2014 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732.
IEEE(2014)
[22] An overview of machine learning, by Carbonell, J. G., Michalski, R. S., and Mitchell,
T. M. (1983). About machine learning. Heidelberg Berlin Springer (pp. 3-23).
[23] A deep convolutional activation feature for general visual identification. Donahue, J.,
Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, and Darrell, T. (2014). International
machine learning conference (pp. 647-655).