Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

UNIT – IV

Convolutional Networks
A Convolutional Neural Network (CNN) is a type of Deep Learning neural network
architecture commonly used in Computer Vision. Computer vision is a field of Artificial
Intelligence that enables a computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well. Neural
Networks are used in various datasets like images, audio, and text. Different types of Neural
Networks are used for different purposes, for example for predicting the sequence of words we
use Recurrent Neural Networks more precisely an LSTM, similarly for image classification
we use Convolution Neural networks. Here, we are going to build a basic building block for
CNN.
In a regular Neural Network there are three types of layers:

1. Input Layers: It’s the layer in which we give input to our model. The number of neurons
in this layer is equal to the total number of features in our data (number of pixels in the
case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer. There can
be many hidden layers depending upon our model and data size. Each hidden layer can
have different numbers of neurons which are generally greater than the number of features.
The output from each layer is computed by matrix multiplication of output of the previous
layer with learnable weights of that layer and then by the addition of learnable biases
followed by activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like
sigmoid or softmax which converts the output of each class into the probability score of
each class.
The data is fed into the model and output from each layer is obtained from the above step is
called feed forward, we then calculate the error using an error function, some common error
functions are cross-entropy, square loss error, etc. The error function measures how well the
network is performing. After that, we back propagate into the model by calculating the
derivatives. This step is called Back propagation which basically is used to minimize the loss.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural networks
(ANN) which is predominantly used to extract the feature from the grid-like matrix dataset.
For example visual datasets like images or videos where data patterns play an extensive role.

CNN architecture

Convolutional Neural Network consists of multiple layers like the input layer, Convolutional
layer, Pooling layer, and fully connected layers.

Simple CNN architecture

The Convolutional layer applies filters to the input image to extract features, the Pooling layer
down samples the image to reduce computation, and the fully connected layer makes the final
prediction. The network learns the optimal filters through back propagation and gradient
descent.

How Convolutional Layers works

Convolution Neural Networks or covnets are neural networks that share their parameters.
Imagine you have an image. It can be represented as a cuboid having its length, width
(dimension of the image), and height (i.e the channel as images generally have red, green, and
blue channels).

Now imagine taking a small patch of this image and running a small neural network, called a
filter or kernel on it, with say, K outputs and representing them vertically. Now slide that
neural network across the whole image, as a result, we will get another image with different
widths, heights, and depths. Instead of just R, G, and B channels now we have more channels
but lesser width and height. This operation is called Convolution. If the patch size is the same
as that of the image it will be a regular neural network. Because of this small patch, we have
fewer weights.

Image source: Deep Learning Udacity

Now let’s talk about a bit of mathematics that is involved in the whole convolution process.

 Convolution layers consist of a set of learnable filters (or kernels) having small widths and
heights and the same depth as that of input volume (3 if the input layer is image input).
 For example, if we have to run convolution on an image with dimensions 34x34x3. The
possible size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as
compared to the image dimension.
 During the forward pass, we slide each filter across the whole input volume step by step
where each step is called stride (which can have a value of 2, 3, or even 4 for high-
dimensional images) and compute the dot product between the kernel weights and patch
from input volume.
 As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together
as a result, we’ll get output volume having a depth equal to the number of filters. The
network will learn all the filters.
Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is
a sequence of layers, and every layer transforms one volume to another through a differentiable
function.
Types of layers: Let’s take an example by running a covnets on of image of dimension 32 x 32
x 3.
 Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the
input will be an image or a sequence of images. This layer holds the raw input of the image
with width 32, height 32, and depth 3.
 Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input
image data and computes the dot product between kernel weight and the corresponding
input image patch. The output of this layer is referred as feature maps. Suppose we use a
total of 12 filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.
 Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation
function to the output of the convolution layer. Some common activation functions
are RELU: max(0, x), Tanh, Leaky RELU, etc. The volume remains unchanged hence
output volume will have dimensions 32 x 32 x 12.
 Pooling layer: This layer is periodically inserted in the covnets and its main function is to
reduce the size of volume which makes the computation fast reduces memory and also
prevents over fitting. Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of
dimension 16x16x12.

Image source: cs231n.stanford.edu

 Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
 Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
 Output Layer: The output from the fully connected layers is then fed into a logistic
function for classification tasks like sigmoid or softmax which converts the output of each
class into the probability score of each class.

Advantages of Convolutional Neural Networks (CNNs):

1. Good at detecting patterns and features in images, videos, and audio signals.
2. Robust to translation, rotation, and scaling invariance.
3. End-to-end training, no need for manual feature extraction.
4. Can handle large amounts of data and achieve high accuracy.

Disadvantages of Convolutional Neural Networks (CNNs):

1. Computationally expensive to train and require a lot of memory.


2. Can be prone to over fitting if not enough data or proper regularization is used.
3. Requires large amounts of labeled data.
4. Interpretability is limited, it’s hard to understand what the network has learned.

Motivation for Using Convolution Networks:

Convolution uses three important ideas

 Sparse interactions

 Parameter sharing

 Equivariant representations

Sparse interaction or sparse weights is implemented by using kernels or feature detector


smaller than the input image.

If we have an input image of the size 256 by 256 then it becomes difficult to detect edges in the
image may occupy only a smaller subset of pixels in the image. If we use smaller feature detectors
then we can easily identify the edges as we focus on the local feature identification.one more
advantage is computing output requires fewer operations making it statistically efficient.

Parameter Sharing is used to control the number of parameters or weights used in CNN.In
traditional neural networks each weight is used exactly once however in CNN we assume that if
the one feature detector is useful to compute one spatial position then it can be used to compute a
different spatial position.
As we share parameters across the CNN, it reduces the number of parameters to be learnt and also
reduces the computational needs.

Equivariant representation:It means that object detection is invariant to the changes in


illumination, change of position, but internal representation is equivariance to these changes
Convolution and Pooling as an Infinitely Strong Prior:
Priors can be considered weak or strong depending on how concentrated the probability
density in the prior is.
 A weak prior is a prior distribution with high entropy, such as a Gaussian distribution
with high variance. Such a prior allows the data to move the parameters more or less
freely.
 A strong prior has very low entropy, such as a Gaussian distribution with low variance.
Such a prior plays a more active role in determining where the parameters end up.
An infinitely strong prior places zero probability on some parameters and says that these
parameter values are completely forbidden, regardless of how much support the data gives to
those values.
We can imagine a convolutional net as being similar to a fully connected net, but with
an infinitely strong prior over its weights. This infinitely strong prior says that the weights for
one hidden unit must be identical to the weights of its neighbor, but shifted in space.The prior
also says that the weights must be zero, except for in the small, spatially contiguous receptive
field assigned to that hidden unit.
Overall, we can think of the use of convolution as introducing an infinitely strong prior
probability distribution over the parameters of a layer. This prior says that the function the
layer should learn contains only local interactions and is equivariant to translation. Likewise,
the use of pooling is an infinitely strong prior that each unit should be invariant to small
translations. One key insight is that convolution and pooling can cause under fitting. Like any
prior, convolution and pooling are only useful when the assumptions made by the prior are
reasonably accurate. If a task relies on preserving precise spatial information, then using
pooling on all features can increase the training error.
Some convolutional network architectures (Szegedy et al., 2014a) are designed to use
pooling on some channels but not on other channels, in order to get both highly invariant
features and features that will not underfit when the translation invariance prior is incorrect.
When a task involves incorporating information from very distant locations in the input, then
the prior imposed by convolution may be inappropriate.
Another key insight from this view is that we should only compare convolutional
models to other convolutional models in benchmarks of statistical learning performance.
Models that do not use convolution would be able to learn even if we permuted all of the pixels
in the image. For many image datasets, there are separate benchmarks for models that are
permutation invariant and must discover the concept of topology via learning, and models that
have the knowledge of spatial relationships hard-coded into them by their designer.

Variants of the Basic Convolution Function


Thеrе arе sеvеral variants and еxtеnsions of thе basic convolution function, еach dеsignеd to
addrеss spеcific challеngеs or еnhancе thе capabilitiеs of CNNs. Hеrе arе somе common
variants:

Stridе: Thе stridе paramеtеr dеtеrminеs how much thе filtеr is movеd aftеr еach convolution
opеration. Incrеasing thе stridе rеducеs thе spatial dimеnsions of thе output fеaturе map. It is
usеful for downsampling and rеducing computational complеxity.

Dilatеd Convolution: Also known as atrous convolution, dilatеd convolution introducеs gaps
bеtwееn thе pixеls of thе filtеr, allowing it to havе a broadеr rеcеptivе fiеld without incrеasing
thе numbеr of paramеtеrs. This can capturе largеr pattеrns in thе input.

Transposеd Convolution (Dеconvolution): This opеration is usеd for upsampling thе spatial
rеsolution of thе input. It involvеs insеrting gaps bеtwееn thе pixеls of thе input and filling
thеm with zеros. Transposеd convolution is oftеn usеd in thе dеcodеr part of a nеural nеtwork
for tasks likе imagе sеgmеntation.
Dеpthwisе Sеparablе Convolution: It dеcomposеs thе standard convolution into two sеparatе
opеrations: a dеpthwisе convolution and a pointwisе convolution. This rеducеs thе numbеr of
paramеtеrs and computations, making it morе computationally еfficiеnt.
Groupеd Convolution: In groupеd convolution, thе input channеls arе dividеd into groups,
and еach group is convolvеd with a subsеt of filtеrs. This can hеlp to rеducе thе computational
cost and mеmory rеquirеmеnts, particularly in largе nеtworks.
Fractionally Stridеd Convolution: This is anothеr tеrm for transposеd convolution or
dеconvolution. It is usеd to incrеasе thе spatial rеsolution of thе input.
1x1 Convolution: Convolution with a 1x1 filtеr is usеd to combinе information across
channеls without considеring spatial nеighboring pixеls. It is oftеn usеd to incrеasе or dеcrеasе
thе numbеr of channеls in a fеaturе map.
Sеparablе Convolution: This is similar to dеpthwisе sеparablе convolution but includеs an
additional 1x1 convolution to mix information bеtwееn channеls. It is morе paramеtеr-еfficiеnt
than standard convolutions.
Gatеd Convolution: Gatеd convolutions introducе gating mеchanisms to control thе flow of
information through thе nеtwork. Gating is typically implеmеntеd using a sigmoid function to
modulatе thе activation valuеs.

Thеsе variants offеr flexibility in designing CNN architectures, allowing rеsеarchеrs and
practitionеrs to adapt thе nеtwork architеcturе to thе spеcific rеquirеmеnts of thе task at hand.
Diffеrеnt variants may bе suitablе for diffеrеnt tasks, and thеir еffеctivеnеss can dеpеnd on
factors such as thе datasеt and thе ovеrall nеtwork architеcturе.

6. Structured Outputs

Convolutional networks can be trained to output high-dimensional structured output rather than

just a classification score. A good example is the task of image segmentation where each pixel

needs to be associated with an object class. Here the output is the same size (spatially) as the

input. The model outputs a tensor S where S[i,j,k] is the probability that pixel (j,k) belongs to

class i.

To produce an output map as the same size as the input map, only same-padded convolutions can

be stacked. Alternatively, a coarser segmentation map can be obtained by allowing the output map

to shrink spatially.
The output of the first labelling stage can be refined successively by another convolutional model.

If the models use tied parameters, this gives rise to a type of recursive model as shown below.

(H¹, H², H³ share parameters)

Recursive refinement of the segmentation map

The output can be further processed under the assumption that contiguous regions of pixels will

tend to belong to the same label. Graphical models can describe this relationship.

Alternately, CNNs can learn to optimize the graphical models training objective.

Another model that has gained popularity for segmentation tasks (especially in the medical

imaging community) is the U-Net. The up-convolution mentioned is just a direct upsampling by

repetition followed by a convolution with same padding.

Data Types

The data used with a convolutional network usually consist of several channels, each channel
being the observation of a different quantity at some point in space or time.
One advantage to convolutional networks is that they can also process inputs with varying spatial
extents. When the output is accordingly variable sized, no extra design change needs to be made.
If however the output is fixed sized, as in the classification task, a pooling stage with kernel size
proportional to the input size needs to be used.

Different data types based on the number of spatial dimensions and channels

Efficient Convolution Algorithms:


In some problem settings, performing convolution as point wise multiplication in the frequency
domain can provide a speed up as compared to direct computation. This is a result from the
property of convolution:

Convolution in the source domain is multiplication in the frequency domain. F is the


transformation operation

When a d-dimensional kernel can be broken into the outer product of d vectors, the kernel is said
to be separable. The corresponding convolution operations are more efficient when implemented
as d 1-dimensional convolutions rather than a direct d-dimensional convolution. Note however, it
may not always be possible to express a kernel as an outer product of lower dimensional kernels.
This is not to be confused with depth wise separable convolution
Random and Unsupervised Features:

To reduce the computational cost of training the CNN, we can use features not learned by
supervised training.

1. Random initialization has been shown to create filters that are frequency selective and
translation invariant. This can be used to inexpensively select the model architecture.
Randomly initialize several CNN architectures and just train the last classification layer.
Once a winner is determined, that model can be fully trained in a supervised manner.

2. Hand designed kernels may be used; e.g. to detect edges at different orientations and
intensities.

3. Unsupervised training of kernels may be performed; e.g. applying k-means clustering to


image patches and using the centroids as convolutional kernels. Unsupervised pre-
training may offer regularization effect (not well established). It may also allow for
training of larger CNNs because of reduced computation cost.

Another approach for CNN training is greedy layer-wise pretraining most notably used
in convolutional deep belief network. For example, in the case of multi-layer perceptron’s,
starting with the first layer, each layer is trained in isolation. Once the first layer is trained, its
output is stored and used as input for training the next layer, and so on.

The Neuro-scientific Basis for Convolutional Networks:

The history of convolutional networks begins with neuro scientific experiments long before the
relevant computational models were developed.
Neurophysiologists David Hubel and Torsten Wiesel observed how neurons in the cat’s brain
responded to images projected in precise locations on a screen in front of the cat.

“Their great discovery was that neurons in the early visual system responded most strongly to
very specific patterns of light, such as precisely oriented bars, but responded hardly at all to other
patterns”

The Neurons in the early visual cortex are organized in a hierarchical fashion, where the first cells
connected to the cat’s retinas are responsible for detecting simple patterns like edges and bars,
followed by later layers responding to more complex patterns by combining the earlier neuronal
activities.

Convolutional Neural Network may learn to detect edges from raw pixels in the first layer, then
use the edges to detect simple shapes in the second layer, and then use these shapes to deter
higher-level features, such as facial shapes in higher layers
Filters in a Convolutional Neural network

The Visual Cortex of the brain is a part of the cerebral cortex that processes visual
information. V1 is the first area of the brain that begins to
perform significantly advanced processing of visual input.

A convolutional network layer is designed to capture three properties of V1:

1. V1 is arranged in a spatial map. It actually has a two-dimensional structure mirroring the


structure of the image in the retina. Convolutional networks capture this property by having
their features defined in terms of two dimensional maps.

2. V1 contains many simple cells. A simple cell’s activity can be characterized by a linear
function of the image in a small, spatially localized respective field. The detector units of a
convolution network are designed to emulate these properties of simple cells.

3. V1 also contains many complex cells. These cells respond to features that
are similar to those detected by simple cells, but complex cells are invariant to small shifts in
the position of the feature. This inspires the pooling units of convolutional networks.
There are many differences between convolutional networks and the mammalian vision
system. Some of these differences are.

1. The human eye is mostly very low resolution, except for a tiny patch called the fovea. Most
convolutional networks receive large full resolution photographs as input.

2. The human visual system is integrated with many other senses, such as
hearing, and factors like our moods and thoughts. Convolutional networks
so far are purely visual.

3. Even simple brain areas like V1 are heavily impacted by feedback from higher levels.
Feedback has been explored extensively in neural network models but has not yet been
shown to offer a compelling improvement.

Convolutional Networks and the History of Deep Learning:

Convolutional networks have played an important role in the history of deep learning. They
are a key example of a successful application of insights obtained by studying the brain to
machine learning applications. They were also some of the first deep models to perform
well, long before arbitrary deep models were considered viable. Convolutional networks
were also some of the first neural networks to solve important commercial applications and
remain at the forefront of commercial applications of deep learning today. For example, in
the 1990s, the neural network research group at AT&T developed a convolutional network
for reading checks (LeCun et al., 1998b). By the end of the 1990s, this system deployed by
NEC was reading over 10% of all the checks in the US. Later, several OCR and handwriting
recognition systems based on convolutional nets were deployed by Microsoft (Simard et al.,
2003). See Chapter 12 for more details on such applications and more modern applications
of convolutional networks. See LeCun et al. (2010) for a more in-depth history of
convolutional networks up to 2010. Convolutional networks were also used to win many
contests. The current intensity of commercial interest in deep learning began when
Krizhevsky et al. (2012) won the Image Net object recognition challenge, but convolutional
networks had been used to win other machine learning and computer vision contests with
less impact for years earlier. Convolutional nets were some of the first working deep
networks trained with back-propagation. It is not entirely clear why convolutional networks
succeeded when general back-propagation networks were considered to have failed. It may
simply be that convolutional networks were more computationally efficient than fully
connected networks, so it was easier to run multiple experiments with them and tune their
implementation and hyper parameters. Larger networks also seem to be easier to train. With
modern hardware, large fully connected networks appear to perform reasonably on many
tasks, even when using datasets that were available and activation functions that were
popular during the times when fully connected networks were believed not to work well. It
may be that the primary barriers to the success of neural networks were psychological
(practitioners did not expect neural networks to work, so they did not make a serious effort
to use neural networks). Whatever the case, it is fortunate that convolutional networks
performed well decades ago. In many ways, they carried the torch for the rest of deep
learning and paved the way to the acceptance of neural networks in general. Convolutional
networks provide a way to specialize neural networks to work with data that has a clear
grid-structured topology and to scale such models to very large size. This approach has been
the most successful on a two-dimensional, image topology. To process one-dimensional,
sequential data, we turn next to another powerful specialization of the neural networks
framework: recurrent neural networks

You might also like