Convolutional Neural Networks, Explained - by Mayank Mishra - Towards

“+ ostanintecaccar te esttMesumtresthn Sut, Basmeamentar Convolutional Neural Networks, Explained A Convolutional Neural Network, also known as CNN or ConvNet, i$ €1455 enna of neural networks that specializes in processing data that has grid-ike ‘topology, such as an image. A digital image is a binary representation of visual data It contains a series of pixels arranged ina grid-ike fashion that contains pixel values to denote how bright and what colo each pixel should be. ‘The human brain processes a huge amount of information the second we see san image. Bach neuron works in its own receptive field and is connected to other neurons in away hat they cover the entire visual field, Jastas each neuron responds to stimuli only in the restricted region ofthe visual field called the receptive field in the biological vision system, each neuron ina CCNN processes data only in ts receptive field as well. The layers are srranged in such a way o that they detect simpler patterns firs (ines, curves, ete.) and more complex patterns (faces, objects, etc) further along By using GNN, one can enable sit to.computers Convolutional Neural Network Architecture A GNN typically has three layers: a convolutional layer, a pooling layer, anda fully connected layer. Convolution Layer The convolution layer isthe core building block ofthe CN. It carries the ‘main portion of the network’ computational load. ‘This yer performs a dot product between two matrices, where one matrix isthe set of learnable parameters otherwise known as a kernel, and the other matricis the restricted portion of the receptive field. The kernel is spatially smaller than an image but is more in-depth. This means that, ifthe mage is composed of three (RGB) channels, the kernel height and width will be spatially small, bu the depth extends upto all three channels. During the forward p the kernel sides aross the height and width ofthe image-producing the image representation ofthat receptive region, This produces two dimensional representation of the image known as an activation map that gives the response of the kernel at each spatial position ofthe image. The sliding size ofthe kernel is called a stride, Ifwe have an input of size Wx W'x D and Dout number of kernels witha spatial size of F with stride Sand amount of padding P, then the size of output volume can be determined by the following formula: W—F+2P Wout Fo etl WE Kernet awibe |] | owsce | | ewe sentt || | yar | | serene ewrte | | twope | | gwehx viveie | | siveke || shot twee | | jwelar | | si | tana mysne | | nysor oyspz en ee Motivation behind Convolution Convolution leverages three important ideas that motivated computer vision researchers: sparse interaction, parameter sharing, and equivariant representation, Lets describe each one of them in detail Trivial neural network layers use matrix multiplication by a matrix of parameters describing the interaction between the input and output unt ‘This means that every output unit interacts with every input unit. However, convolution neural networks have sparse interaction, Ths is achieved by ‘making kernel smaller than the input eg, an image can have millions or thousands of pixels, but while processing it wring kernel we can detect ‘meaningful information thats of tens or hundreds of pixols. This means that we need to store fewer parameters that nt only reduces the memory requirement ofthe model but also improves the statistical efficiency ofthe model If-computing one feature ata spatial point (x1, y1) is useful then it should also be useful at some other spatial point say (x2, y2). It means that fora single two-dimensional slice ie, for ereatng one activation map, neurons are constrained to use the same set of weights, Ina traditional neural network, each element of the weight matrix used once and then never revisited, while convolution network has shared parametersi., for getting ‘output, weights applied to one inputare the same as the weight applied elsewhere, Due to parameter sharing, the layers of convolution neural network will have ‘ property of equivariance to translation, Itsays that if we changed the input ina way, the output wil also get changed in the same way. Pooling Layer ‘The pooling layer replaces the output ofthe network at certain locations by deriving a summary statistic ofthe nearby outputs. This helps in reducing the spatial size ofthe representation, which decreases the required amount ‘of computation and weights. The pooling operation is processed on every slice ofthe representation individually ‘There are several poling functions such asthe average of the rectangular neighborhood, L2 norm of the rectangular neighborhood, and a weighted average based on the distance from the central pixel. However, the most popular process is max pooling, which reports the maximum output from the neighborhood. Sige denn seo 1/1|)2)4 5] 6 | 7] 8] anrosmacimnarnne | 6 | 8 If we have an activation map of size W'x Wx D, a pooling Keenel of spatial size F, and stride S, then the size of output volume can be determined by the following formule: ‘This will yield an output volume of sie Wit x Woutx D. mall cases, pooling provides some translation invariance which means that an object would be recognizable regardless of where it appears on the frame. Fully Connected Layer [Neurons in this ayer have full connectivity with ll neurons in the preceding and succeeding layer as seen in regular FCN. This is why itean be computed as usual by a matrix multiplication followed by a bia effect. ‘The FC ayer helps to map the representation between the input and the output. Non-Linearity Layers. Since convolution i a linear operation and images are fr from linear, nom linearity layers are often placed directly after the convolutional layer to introduce non-linearity tothe activation map. “There are several types of non-linear operations the popular ones being: 1. Sigmoid ‘The sigmoid non-tineaity has the mathematical form o(x)=1/4e"e). It takes a real-valued number and "squashes" it into a range between Oand i However, a very undesirable property of sigmoid is that when the activation is at either tal, the gradient becomes almost zero. Ifthe local gradient becomes very small, then in backpropagation it will effectively “ill” the gradient, Also, ifthe data coming into the neuron is always positive, then the ‘output of sigmoid will be either all positives or all negatives, resulting in a sig:2ag dynamic of gradient updates fr weight. 2.Tanh ‘Tanh squashes a real-valued number tothe range 1, 1. Like sigmoid, the activation saturates, but — unlike the sigmoid neurons — its output s zero centered. 3. Rel ‘The Rectified Linear Unit (ReLU) has become very popular inthe last few ‘years. It computes the function 4 is simply threshold at zero. 12x (0,2) In other words, the activation In comparison to sigmoid and tank, ReLU is more reliable and accelerates the convergence by six times, Unfortunately, a con is that ReLU can be fragile during training. A large gradient lowing through it can update tin such a way thatthe neuron will never get further updated. However, we can work with this by setting a proper learning rate. Designing a Convolutional Neural Network Now that we understand the various components, we can build a convolutional neural networks, We will be using Fashion-MNIST, which isa dataset of Zalando’s article images consisting ofa training set of 60,000, examples and a test st of 10,000 examples. Each example sa 28328 srayscale image, associated with a abel from 10 classes. The dataset can be downloaded here. (our convolutional neural network has architecture as follows unc) {CONV > [BATCH NORM] > (ReLU] > [POOL 1] > (CONV2} > [BATCH NORM] > [ReLU] > [POOL.2} + (RC LAYER)» [RESULT] For both conw layers, we will use kernel of spatial size 5x 5 with stride size 1 and padding of 2. For both pooling layers, we will use max pool operation ‘with kernel size 2, stride 2, and zero padding. a conv Input Size (W1 x Hix D1) = 28 x 28x 1 © Requires four hyperparameter © Number of kernels, k= 16 © Spatial extend of each one, F © Stride Size, S=1 © Amount of zero padding, P=2 ‘© Outputting volume of W2x H2x Dz © We=(28-5+2(2))/1+1=28 © Ha=(28-5+2(2))/1+1=28 © De=k Output of Conv 1 (W2x H2x D2) = 28 x 28 x 16 | Input Size (Wax H2x Da) = 28 x 28 x 16 ‘© Requires two hyperparameter: © Spatial extend of each one, F=2 © Stride Size, S=2 ‘* Outputting volume of Ws x Hs x Dz © Ws=(28-2)/2+1=14 (28-2)/2+1=14 + Output of Pool 1 (Ws x Hx Da 14x 14x16 Input Size (Ws x Hs x D2) = 14x 14 x 16 ‘+ Requires four hyperparameter: Number of kernels, k= 32 © Spatial extend of each one, F o Stride Size, S=1 © Amount of zero padding, P ‘* Outputting volume of W, x He x Ds | 14-542 (2))/1+1=14 oHs=(14—5+2(2))/1+1=14 oDs=k Output of Conv 2 (Wax Hax Ds) = 14 x 14 x 32 Input Size (Wax Hax Ds) = 14 x 14 x 32 ‘* Requires two hyperparameter: > Spatial extend of each one, F=2 oStride Size, $= 2 ‘* Outputting volume of Ws x Hsx Ds oWs= (14-2) /241=7 oHs= (14-2) /2+1=7 Output of Pool 2 (Wsx Hs x Ds) = 7x7 x 32 Input Size (Ws x Hsx Ds) = 7 x7 x32 "Output Size (Number of Classes) Code snipped for defining the convnet lass, commneta(on.Nodule) Bot ante teeth) Taper onnett, self) -_.init__0) f constraints for layer £ Selfleanst = nn-Gonvad(inchormelset, out_channelsei6, Stride 1, podaings2) Set bateht = nn.batetmerm2d(26) SeLfirelut = nnRetu0) EsIF_pool = hnMasPool2d(kerrel_sizes2) sdefoule eerie te couivatene £0 the Kerneinsize SeLficonv2 = nn-Conv2dCin chamelse16, out_chamnetse32, hernelsizess, stride = 1y padding=2) selfs batena = on, batetorm2d(32) eelfirelua self pool? meet) nn RaxPool2e kernels Seltife s nn-Linear 324707, 20) f aerining the network Flow Get forwara(selty 39 Poon fut 7 sa f.com(a) ut = Self-botens (out) out = Self-retattoct) hax Poot 1 fut = self poolacout) out 7 self-com2(out) out = SeLf-baten2(out) Gut = selfireluzcoue) out = self pool2(out) out = out. view(out.size( 0), out t eelf fe(out) ‘We have also used batch normalization in our network, which saves us from Improper initialization of weight matrices by explicitly forcing the network to take on unit Gaussian distribution. The code forthe above-defined network is available bere. We have trained using cross-entropy as our loss fanction and the Adam Optimizer witha learning rate of 0.001, ARer training the model, we achieved 90% accuracy onthe test dataset. Applications Below are some applications of Convolutional Neural Networks used today: 1 Object detection: With CNN, wwe now have sophisticated models like R. ‘CNN, Hast RGNN, and aster RLGNN that are the predominant pipeline for ‘many object detection models deployed in autonomous vehicles, facial detection, and more, 2. Semi meena 2015, po meer i Hy Kony developed a CNN based Deep Parsing Network to incorporate rich information into an image segmentation model. Researchers from UC Berkey also built full convolutional networks that improved upon state-of the-art semantic segmentation. 4. Image captioning: CNNs are used with recurrent neural networks to write captions for images and videos. This ean he used for many applications such asactvity recognition or describing videos and images forthe visually impaired, It has been heavily deployed by YouTube to make sense to the huge number of videos uploaded tothe platform on a regular bass References 1. Deep Learning by lan Goodfellow, Yoshua Bengio and Aaron Coutville published by MIT Press, 2016 2, Stanford University's Course — CS2S1n: Convolutional Neural Network for Visual Recognition by Prof. Fei-Pei Li Justin Johnson, Serena Yeung 5. hitpsuldatascience stackexchanse.comiquestions/14340/difforenceof activation functions in-neural-networks-in-general 4 hutpsuwww.codementoriofiames aka vale/convolulional-neral- snetworks-the-biologically-inspited-madel-ingssioms 5. hitpsulsearchenterpriseaistechtarget.comidefinition/convolutional-neural- networks mm O68 na Written by Mayank Mishra @o Nayak ira adnate nema capensis He wat Netan ae nd Cosndeos bli Vache Lering tne: ore rm Mayank Mishra an Towards Dat Seance - Se = @ vert ania Teste cltentsen onndrOnnsos ‘Computer Vision in Artifical ‘The Math behind Adam Optimize Intelligence \Why Adam the most pope optmzarin = 9 Go G2 Qe to @ owt onto © seit: Building aData Piatformin2024 Python's Most Powerful Decorator fined «Fab 2 “$node 2088 1 Qa Go G2" Qe” om Recommended from Medium fesse ere — — UnvtingthsOheray:A ly Comes Layer 1 c Backto Basics: Feature Extraction Convolutional Neural Networks for with ON Dummies 2 a 6» < a Bhan Tern ce © tas torande toes ‘The Math behind Adam Optimizer Convolutional Neural Network vis Asam the mostpopulreptnierin _FromSeratch aoa

Convolutional Neural Networks, Explained - by Mayank Mishra - Towards

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Convolutional Neural Networks, Explained - by Mayank Mishra - Towards

Uploaded by

Copyright:

Available Formats

You might also like