Decode - Neural Networks and Deep Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

SUBJECT CODE CS864PE

JNTUH R16
IV-ll(CSE/IT)
B.Tech.,
Elective-
Professional VI

NEURAL NETWORKS
&DEEP LEARNING
IreshA. Dhotre
M.E. (InformationTechnology)
SinhgodCollegeof Engineering,
Ex-Faculty,
Pune.

FEATURES
Writtenby PopularAuthors of TextBooks of TechnicalPublications

CoversEntireSyllabus Question Answer Format


ExactAnswers 8 Solutions
Important
Pointsto Remember Z Fillin the BlankswithAnswers for Mid Term Exam

Answers for Mid Term Exam


MCQ's with

ShortAnswered Questions SolvedModel Question Paper [R-16 Pattern]

DECODE
A GuideForEngineering
Students
DECODE
A GuideFor EngineeringStudents

NEURAL NETWORKS & DEEP LEARNING


SUBJECTCODE:CS864PE

IV-JI
B.Tech., [CSE/IT]ProfessionalElective VI

with TechnicalPublications
Copyright
Allpublishingrights(printed and ebook version)reserved with TechnicalPublications.No partof this book
should be reproduced in any form, Electronic,Mechanical, Photocopy or any informationstorage and
retrievalsystem withoutpriorpermissionin writing,from TechnicalPublications,Pune.

Publishedby:
Residency, Ofice
No.1,
TECHNCALT Amit -
412,Shaniwar Peth,

PUBLICATIONSPune 411030,M.S.INDIAPh.:
+91-020-24495496/97,
: sales@technicalpublications.org
Website:www.technicalpublications.org
p-ut fowldgeEmail

FirstEdition:
2020

ISBN 978-93-89750-67-6

90789389"75067 JNTUH 1

9789389750676 [1] (i)


SYLLABUS
NeuralNetworks and Deep Learning
(Cs864PE)
UNIT - I
Artificial
Neural Networks:Introduction.
Basic models
ofANN, Importantterminologies.
Supervised
Networks, AdaptiveLinearNeuron, Backpropagation
LearningNetworks, Perceptron Network. Associative

Memory Networks. Training


Algorithms BAM and Hopfield
for patternassociation, Networks. (Chapter 1)

UNIT-II
Unsupervised Learning Network: Introduction. FixedWeight CompetitiveNets, Maxnet. Hamming
Network.Kohonen Self-Organizing FeatureMaps. LearningVectorQuantization. Counter Propagation

Networks, AdaptiveResonance Theory Netuworks. Networks


Special - Introduction
to variousnetworks.

(Chapter-9

UNIT III
Introduction
to Deep Learning:
Historical Deep Feed forwardnetworks.
Trends inDeep learning.
Gradient-Based Hidden Units,Architecture
learning. and Other Differentiation
Design.Back-Propagation
Algorithms.
(Chapter
-3)

UNIT IV
Regularizationfor Deep Learning Parameter norm Penalties,Norm Penalties as Constrained
Optimization, and
Regularization Under-Constrained
Problems, Dataset Noise
Augmentation. Robustness.

Semi-Supervised Multi-tasklearning.
learning. EarlyStopping. ParameterTyping and ParameterSharing.

SparseRepresentations,Baggingand other Ensemble Methods. Dropout. Adversarial


Training.
Tangent
(Chapter4)
Tangent Prop and Manifold.TangentClassifier.
Distance.

UNIT- V
for TrainDeep Models:ChallengesinNeuralNetwork Optimization.,
Optimization Basic Algorithms,
ParameterInitialization
Strategies. with AdaptiveLearningRates,ApproximateSecond-Order
Algorithms
Methods. Optimization and
Strategies Meta-Algorithms.
Applications
Large-Scale Deep Learning. Natural Language
Speech Recognition.
Computer Vision.
Processing.(Chapter 5)

tü)
.
TABLE OF CONTENTS

Unit I Fill
...
intheBlanks withAnswers
1or Mid lerm Exam. -18
Chapter 1

1.1
1.2

1.3
Introduction ..
..
Artificial
NeuralNetworks

SupervisedLearningNetworks
(1-1)
to (1-24)

1-5
I-10
i
1
Multiple ChoiceQuestionswithAnswers
for Mid Term Exam. *************** 18

Unit
Chapter-4 Regularization
V
forDeep Learning
.3-1

Networks..
Back-propagation (4-1)to (4-14)
...1
1.4 Associative

1.5 Hopfield
Fill
Memory

Networks
Networks...

intheBlanks withAnswers
.1-13

-18
4.1
4.2
ParameterNorm
Norm Penalties
Penalties... 4-1|
Optimization.4-4
as Constrained
DatasetAugmentationand Noise Robustness.4-5

...
4.3
forMid 'TermExam.... ********** 1-22
ChoiceQuestionswithAnswers
Multiple
4.4 and EarlyStopping.. 4-6
learning
Multi-task
for 1-22
Mid TermExam 45 ParameterTyping
.4-7
and ParameterSharing...

Unit- II 4.6
Baggingandother Ensemble Methods...
.4-9
4-
2.1
Introduction....
2.2 Kohonen Self-Organizing
..
Chapter-2 UnsupervisedLearningNetwork

...2-1
(2-1) to (2 12)

FeatureMaps....2-4
2.3 LeaningVectorQuantization.
.
4.7

4.8

Fill
for
Training...
Adversarial

Tangent Distance..

inthe Blanks withAnswers


Mid TermExam... ...
...
MultipleChoiceQuestionswithAnswers
4

4-13
12

- 12

2.4 Networks
Counter Propagation 7 for MidTerm Exam... . 4* 1D

2.5 AdaptiveResonance Theory *** 2-8 Unit V


2.6
SpecialNetworks.. 10
5
-
Chapter forTrain
Optimization Deep
1)to (5-6)
Fill
intheBlanks withAnswers
....
**** -11 Models (5

..
forMid Term Exam 5.1 . 5-1

Multiple
forMid Term Exam

Unit
..
ChoiceQuestionswithAnswers

III
. 2-11
5.2

5.3
ChallengesinNeuralNetwork Optimization.
ParameterInitialization

withAdaptiveLearning Rates..
and Algorithms
Strategies

ApproximateSecond-OrderMethods
5 -3
5-4

Chapter 3 Deep Learning (3 1)to (3 19) 5.4


Optimization
and Meta-Algorithms
Strategies

.**. APplicanons

..
3.1 Introduction
toDeep Learning.. 3*i

.5-6
FillintheBlanks withAnswers
3.2
Deep FeedforwardNetworks.....
3-7 for
MidTerm Exam. 5-6
3.3
Learning..
Gradient-Based 3- IT ChoiceQuestionswithAnswers
Multiple
3.4 **********
14 -
forMid Term Exam
Design.
Architecture
SolvedModel QuestionPaper (M-1)to (M -2)
3.5 and
Back-Propagation
Other Differentiation
Algorithms. 3-14
(iv
1
Q.1 What
Ans.:An
1.1:
Introduction
is artificial
Artificial

neural network?
neuralnetwork consists of units,
(artificial)
connections and weights. Inputs and outputsare
numeric.
UNIT

NeuralNetworks
1

Q.5 Classifydifferent

Ans.:
types of neurons.
[INTU:May-17, Marks
Three major neuron groups make up this
classification:
multipolar, and unipolar
bipolar,
3]

Unipolarneurons have a singleshort process that


emerges from the cellbody and divides
T-likeinto
proximal and distalbranches.
Q.2 Definethe term neural network.
Ans: Neural network isa system composed of many Bipolarneurons have two processes,an axon and a
that extend from opposite ends
dendrite, of the
simple processing elements operating in parallel
soma.
whose functionis determined by network structure,
connection and the processingperformed at
strengths, Multipolarneurons, the most common type, have
one axon and two or more dendrites.
computingelementsor nodes.
Q.6 Explainuseful propertiesand capabilities
of
Q.3 Where are neural networks applicable? neural network.
Ans: Ans.
In signature analysis:as a mechanism for 1. Nonlinearity An artificial
neuron can be linear
comparingsignatures made with thosestored. or nonlinear.A neural network, made up of an
b. In process control:thereare clearlyapplications interconnectionof nonlinear neurons, is itself

to be made here, most processes cannot be nonlinear.


determined as computable algorithms. 2. Adaptivity: Neural networks have a built-in
to adapt their synaptic weights to
c. In monitoring networks have been used to capability
monitorthe stateof aircraft changes inthe surroundingenvironment.
engines.
3. Contextual Information Knowledge s
Q.4 Listadvantages of Neural Networks representedby the very structureand activation
Ans: The advantages of neural networks are due to stateof a neuralnetwork
itsadaptiveand generalization
ability. Evidential
Response : In the contextof pattern

a) Neural networks are adaptive methods thatcan a neural network can be designed
classification,
learn without any prior assumption of the to provideinformation not only about which
data. but also about the
particularpattern to select,
underlying
confidenceinthe decisionmade
b) Neural network, namely the feed forward
5. Uniformityof Analysis and Design : Neural
multilayerperception and radialbasis function
networks enjoy universality as information
network have been proven to be universal
processors
functional
approximations.
6. VLSIImplement-ability:
The massivelyparallel
c)Neuralnetworks are non-linearmodel with good nature of a neural network makes it potentially
generalization
ability. fastfor the computationof certain
tasks.

(- 1)
NeuralNetworks and Deep NeuralNetworks
Artificial
Learning

Q.7 With the help of a neat diagram, explain the diffusion


of chemicalscalledneuro transmitters
neuron.
analogy of a biological acrossthe synapticcleft.
JNTU May-17, Marks 5]
Q.8 List the three basic elements of the neural
Ans.: Artificial
neural systems are inspired by
model.
biological neural systems. The elementary building8
Ans.:Basic elements of the neural model are
neuralsystenms isthe neuron.
block of biological
synapses or connectinglinks,adder and activation
Fig.Q7.1 neural systems.
shows biological function.
.Thesinglecellneuron consistsof the cellbody or
soma, the dendrites and the axon. The dendrites Q.9 Definedendrites,
soma, Axon and synapses.
receivesignalsfrom the axons of other neurons. Ans.:
The small space between the axon of one neuron Dendrites:
They are tree-like
branches,responsible
and the dendriteof another is the synapse. The for receivingthe from other neurons it
information
afferentdendrites conduct impulses toward the
isconnected to.Inother sense,we can say thatthey
soma. The efferentaxon conducts impulses away
are likethe ears of neuron.
from thesoma.
BasicComponents of BiologicalNeurons Soma: Itis the cellbody of the neuron and is

responsiblefor processingof they have


information,
1. The majorityof neurons encode theiractivations receivedfrom dendrites.
or outputs as a seriesof briefelectrical
pulses.
2. The neuron's cell body (soma) processes the
Axon:Itisjustlikea cablethroughwhich neurons
send the information.
incoming activationsand converts them into
outputactivations. Synapses:It
isthe connectionbetween the axon

3. The neuron's nucleus contains the genetic and otherneurondendrites.


materialinthe form of DNA. Thisexistsin most
Q.10 What are the characteristics
and application
not justneurons.
types of cells,
ofANN?
4. Dendrites are fibreswhich emanate from the cell
Ans.:Characteristies
of Artificial
Neural Networks
body and provide the receptive zones that
receiveactivationfrom otherneurons. 1. Large number of very simple processing
neuron-likeprocessingelements.
5. Axons are fibresactingas transmissionlinesthat
send activation
to otherneurons. 2. Large number of weightedconnectionsbetween
the elements.
6. The junctions
that allow signal transmission
between the axons and dendrites are called Distributed of knowledgeover the
representation
synapses. The process of transmissionis by connections.

Axon hillock

Soma AXOn

Nucleus buttons
Dendrite Teminal

Fig.Q.7.1
Schematicof biological
neuron

TTECHNICALPUBLICATIONS An up thrust
forknowledge ECODE
NeuralNetworks and Deep Learning 1-3 NeuralNetworks
Artificial

4. Knowledge is acquired by network through a Q.12 Differencebetween digitalcomputer and


neural network.
learningprocess.
Ans.:
Application
ofANN:
1. Controlling
the movements of a robot based on Sr. Digital
computer Neural network
and other information; No.
self-perception
1. Deductivereasoning Inductive
reasoning :
2. food items in
Decidingthe category of potential
We apply known rules Giveninputand output
an artificial
world; to input data to data (training
3. Recognizing
a visualobject; produce output. examples),we construct
the rules.
where a moving objectgoes, when a
Predicting
4
robotwants to catchit. Computation is Computation is
collective,
asynchronous
centralized,
and
Q.11List out the synchronous and serial. parallel.
strength and weakness of
artificial
neural network. Memory is packetted, Memory is distributed,
stored and
literally internalized
and content
Ans.:Strength:
locationaddressable. addressable.
1. The greatestpower of Neural Networks is thatit
Not faulttolerant.
One Faulttolerant,
isendowed with a finite
number hiddenunits,
of
and it
transistor
8oes redundancy and sharing
can yet approximate any continuousfunctionto no longer works. of responsibilities.

any desired degree of accuracy.This has been Fast.Measured in Slow. Measured in


commonly referred to as the property of millionthsof a second. thousandths of a
second.
universalapproximate.
Exact. Inexact.
2. No prior knowledge of the data generating
NN.
processisneededforimplementing Static
connectivity. Dynamic connectivity.
3. Problem of model misspecificationdoes not 8.
Applicableif well ifrulesare
Applicable
Occur. definedruleswith unknown or
preciseinput data. complicated or if data is
4 In case of NN sinceno specifications
areused as noisy or partial.
the network merely learns the hidden
relationshipinthe data. Q.13Explainwith diagram representationof neural
network.
Adaptivelearning: An abilityto learn how to
shows the neuralnetwork representation.
do tasksbased on the data givenfor training
or Fig.Q.13.1
initial
experience.
6. :An ANN
Self-Organisation can createitsown Inpu
of the
organisationor representation
itreceivesduringlearningtime.

Weakness:
information
Input
Input layer
Input
-Q
1. The addition
of too many hiddenunitsincites
the
Output layer
problem of over fitting
the data. Input Hiddenlayer
2. of the NN model can
The construction be a time
Fig.Q.13.1Artificial
neural network
consuming proceSs.
Neural Networks consists of many number of
simple elements (neurons)connected between them
insystem. Whole system isable to solve of complex
tasksand tolearnfor itlikea naturalbrain.

TECHNICAL PUBLICATIOSAn up thrust


forknowledge DECODE
NeuralNetuworks and Deep Learning 1-4 Neural Networks
Artificial

For user NN isblack box with Inputvector(source

:
Q.14 What is appropriateproblems for neural
data)and Output vector(result). network?
A Neural Network is usually structuredinto an Ans. The backpropagation
algorithm is the most

inputlayer of neurons, one or more hidden layers commonly used ANN learning technique.It is
and one outputlayer. appropriate for problems with the following
characteristics:
Neurons belongingto adjecentlayers are usually
1.Instances are represented by many
fully connected and the various types and attribute-value
pairs.
are identified
architectures both by the different 2. The target function output may De
topologiesadopted for the connectionsas well by
discrete-valued,real-valued,or a vector of
the choiceof activation
function. severalreal-or discrete-valued
attributes.
The values of the functions
associatedwith the 3. The training
examples may containerrors.Long
connectionsarecalled"weights". times
training are acceptable.
4. Fast evaluationof the learned targetfunction
The whole game of usingNNs is inthe factthat,in
may be required.
order for the network to yield appropriateoutputs
5. The abilityfor humans to understand the
for giveninputs, the weight must be set to suitable
learnedtargetfunction
isnot important.
values.The way thisis obtained allows a further
distinctionamong modes of operations. Q.15Explain()Integrated
and Fire Neuron model
SpikingNeuron model.
(ii)
A neural network is a processingdevice,
eitheran
K[JNTU:May-17, Marks 5]
algorithm or actual hardware, whose design was
motivatedby the design and functioning
of human
Ans.:i) Integrated
and Fire Neuron
brainsand components thereof. .Theintegrate-and-fire
neuron model is one of the
most widely used models for analysing the
Most neural networks have some sortof "training
behavior of neural systems.
rule whereby the weights of connections are

adjustedon the basisof presentedpatterns.


It describesthe membrane potentialof a neuron in

In other words, neural networks "learn" from


terms of the synaptic inputs and the injected
Currentthatitreceives.
examples, justlike
childrenlearnto recognizedogs
An actionpotential(spike)is generated when the
from examples of dogs, and exhibitsome structural
membrane potential reaches a threshold,but the
forgeneralization.
capability actual changes associated with the membrane
Neuralnetworks normally have great potentialfor voltage and conductance driving the action
do not form part of the model.
potential
parallelism, since the computations of the
components are independentof each other.
FigQ.15.1shows and fireneuron model.
integrated
Neural networks are a differentparadigm for The basic circuitof an integrate-and-firemodel
computing: consistsof a capacitorC inparallelwith a resistor
1. Von Neumann the R drivenby a currentI{t).
machines are based on
processing/memory abstraction of human The driving current can be split into two
information
processing components, () Ir+c.
=
2. Neural networks are based on the parallel
architectureof animal brains. The firstcomponent is the resistivecurrent Ir
which passes throughthe linearresistor
R. Itcan be
Neural networks are a form of multiprocessor
calculatedfrom Ohm's law as Ir= u/R where u is
computersystem, with : the voltageacrossthe resistor.
a. Simpleprocessingelements
The second component Iccharges the capacitorC.
b.A high degree of interconnection
From the definitionof the capacityas C q/u
c.Simplescalarmessages
(where q isthe charge and u the voltage),we find
d.Adaptiveinteraction
between elements.
-C du/dt.
currentIc
a capacitive
TTECHNICALPUBLICATIONS An up thrustforknowledge DECODE
Neural Networks and Deep Learning 1-5 Neural Networks
Artificial

from neuronj

axon Soma

synapse

t-1
SOmid

(t-

Fig.Q.15.1

The synapticinputsto the neuron are consideredto An essentialfeatureof the spikingneurons isthat
be stochasticand are described as a temporally can act as coincidencedetectors for the
they
homogeneous Poisson process. if they
incoming pulses, by detecting arrive in

Methods and resultsfor both current synapses and almost the same time

synapses are examined inthe diffusion


conductance Spikes are generated whenever the membranhe
where the individual
approximation, contributions potentialu crosses some threshold v from below.
are small.
to the postsynapticpotential The moment of thresholdcrossingdefinesthe firing
i)SpikingNeuron model time t

Detailedconductance-based neuron

of
models can
measurements
reproduce electrophysiological
but because of
to a
eir
t;u(t))- V dt
and du()
k=()
>0

high degree accuracy,


intrinsic
complexity these models are difficult
to
1.2: LearningNetworks
Supervised
analyse.
For this reason, simple phenomenological spiking Q.16 Definesupervisedlearning.
neuron models are highlypopularfor studies of
Ans.:Supervised learning in which the network is
neuralcoding,memory, and networkdynamics.
trained by providingit with input and matching
A simple spikingneural model can carry out pairs are usually
outputpatterns.These input-output
computationsover the input spike trainsunder provided by an external teacher. A supervised
severaldifferent
modes. data and
learning algorithm analyzes the training
Thus,spikingneurons compute when the inputis produces an inferred function,which is called a
encoded in temporal patterns,firingrates,firing or a regressionfunction.
classifier
rates and temporal correlations,
and space rate
codes.

An up thrustforknowledge DECODE
TECHNICALPUBLICATIONS
NeuralNetworks and Deep Learning 1-6 Neural Networks
Artificial

Q.17Explainsupervised learning. to the


model during the learning process. The
construction of a proper training,
validationand
Ans.:Supervisedlearning:Supervised is
learning testset iscrucial.
These methods are usually fast
the machine learningtask of
inferringa function from and accurate.
supervised trainingdata.The training data consistof
a set of training
examples. The taskof the supervised
Have to be able to generalize give the correct
resultswhen new data are givenin inputwithout
learneris to predictthe output behavior of a system
for any set of inputvalues,afteran initial training knowing a priorithe target.

phase. Supervisedlearningisthe machineleamingtask of


inferringa functionfrom supervisedtraining data.
Supervised learningin which the network is
trained by providingit with inputand matching The trainingdata consist of a set of training
examples. Insupervisedlearning,each example isa
output patterns. These input-outputpairs are
pair consistingof an input objectand a desired
usuallyprovidedby an externalteacher.
Human outputvalue.
learningisbased on the pastexperiences.
A
computer does not have experiences.
A supervised learning algorithm analyzes the
trainingdata and produces an inferredfunction,
A computer system learns from data, which which iscalleda classifier
or a regressionfunction.
representsome of
"pastexperiences" an application shows
domain. Fig.Q.17.1 supervisedlearningprocess.
The learnedmodel helps the system to perform task
To learn a targetfunctionthat can be used to betteras compared tono learning.
predictthe values of a discreteclassattribute,
e.g.,
approve or not-approved,and high-riskor low risk.
Each inputvector requiresa correspondingtarget

or inductive learning.
Classification
:
The task is commonly called Supervised learning, vector.

Pair (InputVector,Target Vector)


Training
Trainingdata includesboth the input and the

desiredresults.For
some
examples the correct
Fig.
Q.17.2shows inputvector.
are known
results(targets) and are givenin input

data

H
Training Leaming
algorithm

Training

Fig.Q.17.1
Model

Supervisedlearningprocess
Test
data

Testing
Accuracy

Input
Neural
netwonk

7 Actual
output

Error
signal
Desired generate
output

Fig.Q.17.2Input vector

TECHNICALPUBLICATIONS An up thrust
forknowiedge OECODE
Neural Netuworks and Deep Learning NeuralNetworks
Artificial

Supervisedlearning denotes a method in which Some and negativeexamples cannot


sets of positive

some inputvectorsare collectedand presented to be separated by any hyperplane.Those thatcan be


the network.The outputcomputedby the net-work separated are called linearlyseparable sets of
is observed and the deviation from the expected examples.
answer is measured. The weights are corrected A singleperceptroncan be used to representmany
to the magnitudeof the errorinthe way Boolean functions. For example, if we assume
according
definedby the learningalgorithm. Boolean values of 1 (true)and - 1 (false),
then one
way to use a two-inputperceptron to implement
Supervisedlearningis furtherdividedintomethods the AND function
is to setthe weights.
which use reinforcementor error correction.
The
Consider two-input patterns X1,X2) being
perceptron learning algorithm is an example of
intotwo classesas shown in Fig.Q.13.1.
classified
supervisedlearningwith reinforcement.
Each pointwith eithersymbol of x or 0 representsa
In order to solve a given problem of supervised
patternwith a setof values(X1,X2)
following
learning, stepsare performed Each pattern is classifiedintoone of two classes.
1. Findout the type of training
examples. Noticethat these classescan be separated with a
2. Collecta
singlelineL.They are known as linearlyseparable
set.
training
3. Determine the input feature representationof patterns.
the learned function. Linear separability
refersto the factthatclassesof
4. Determine the structureof the learned function
and correspondinglearningalgorithm.
patterns with ndimensional vector x
can be separated with a single
(X1,X2 Xn)
5. Complete the designand then run the learning decision surface.In the case above, the line L
algorithmon the collected
trainingset.
representsthe decisionsurface.
6. Evaluate the accuracy of the learned function.
After parameter adjustment and learning,the
If two classesof patterns can be separated by a
decision boundary, represented by the linear
of the function
performance resulting should be
measured on a test set that is separatefrom the equationthen they are said to be linearlyseparable.
set. The simple network can correctlyclassifyany
training
patterns.
Q.18 What is perceptron?
Ans.: An arrangement of one input layer of
McCulloch-Pittsneurons feeding forward to one
outputlayerof McCulloch-Pitts
neurons is known as
a Perceptron.

The perceptron is a feed forwardnetwork with


one outputneuron thatlearnsa separatinghyper -

plane ina patternspace.


O
The "n" linearFx neurons feed forward to one Fig.Q.19.1
threshold output Fy neuron. The perceptron Decisionboundary (i.e.,
W, b or q) of linearly
set of patterns.
separateslinearly separableclassescan be determined eitherby some
learningprocedures or by solving linearequation
Q.19 Discuss the representable power of a
systems based on representative patterns of each
perceptron. JNTU: Dec.-17,Marks 5 classes.

Ans.: We can view the perceptronas representing


a hyperplane decisionsurface in the n-dimensional
If such a decisionboundary does not exist, then the
two classesare said to be linearlyinseparable.
space of instances.
Linearly
inseparableproblems cannot be solved by
The perceptron outputsal tor instanceslyingon the simple network,more sophisticatedarchitecture
one side of the hyperplane and outputsa-1 for is needed.
instanceslyingon the other side.

An up DECOD
TECHNICAL PUBLICATIONS thrustforknowledge
Neural Networks and Deep Learning 1-8 Neural Networks
Artificial

Q.20 Discussthedecisionsurface of perceptron. 2. LogicalOR function


TINTU:Dec.-16,Marks 5
Patterns(bipolar) Decision
Ans.: A singleperceptroncan be used to represent boundary

many Boolean functions.For example, if we assume


Boolean values of 1 (true)and - 1 (false),then one 1 W 1

W21
way to use a two-input perceptronto implement the
b-1
AND function.
Perceptroncan represent all of
AND, OR, NAND
Boolean functions
NOR (1OR). Unfortunately,however, some
the primitive
(1AND), and
Boolean
9
1+X1+X2 0
functionscannot be represented by a single
perceptron,such as the XOR function whose value
1
is ifand only if xj X2
#

The decisionsurface represented by a two-input


perceptron.

a) A set of trainingexamples and the decision


surface of a perceptron that classifies
them
correctly.
X:ClassI(y 1)
O: ClassI (y =-1)
b) A set of examples that is not linearly
training
Fig.Q.20.2
separable(i.e., cannot be correctlyclassified
that
by any straight
line)
Q.21 What is an Adaline? Explainin detail.
and are the Perceptron inputs.Positive Ans.:
X X2
examples are by "+",negativeby "-".
indicated
Adalinewhich stands for AdaptiveLinearNeuron,
1. LogicalAND function
isa network havinga singlelinearunit.

Patterns(bipolar) Decisionboundary The basic structure of Adaline is similar to

perceptron havingan extra feedback loop with the


W1 help of which the actualoutputis compared with

-1-1 -1 W2 output.After comparison on the


the desired/target
basis of training
algorithm,the weightsand bias
-11 1 b=- 1
willbe updated.

1-1 -1 90 Fig.
Q.21.1
shows
adaline
-1+X1+2 =0
If
i-
theinputconductancesare denoted by Wi,where
0, 1, 2,.n,and inputand outputsignalsby
x
and y, respectively,
then the outputof the central
O blockis defined
tobe
y w Xi+0
i =1

Where Wo

X:Class I (y=1)
In a simple physical implementation this device
consistsof a set of controllable connected
resistors
I
O: Class (y=-1)
to a circuit
which can sum up currentscaused by
Fig.Q.20.1 the inputvoltagesignals.Usually the centralblock

An up thrustforknowledge DECODE
TECHNICALPUBLICATIONS
NeuralNetuworks and
Deep Leaning
1-9 Neural Networks
Artificial

Level

Output

Error Quantizer

Gains Summer

inpur
pattern
switches -1 +1
Referencee
SWitch

Fig.Q.21.1

the summer is also followed by a quantizerwhich To decrease Ep by gradient descent,the update


outputs either+1 of -1,dependingon the polarity
of the sum.
formula forWi on the p"input-output
patternis

The problem is to determine the coefficients Ap Wi ntp-°p)xi


w
where i
= 0, 1,. n, insuch a way thatthe input Q.22 What do you mean by linearseparability
?
outputresponse is correctfor a large number of Ans.
chosen signalsets.
arbitrarily
Consider two-input patterns (X,X2) being8
If an exact mapping is not possiblethe average intotwo classesas shown in Fig.Q.2.1.
classified
errormust be minimized, forinstance,in the sense Each pointwitheithersymbol of x or 0 representsa
of leastsquares.
patternwith a set of values (X1,X2)
An adaptiveoperationmeans that there existsa Each pattern isclassifiedintoone of two classes.
mechanism by which the wi can be adjusted, Noticethat these classescan be separated with a
to attainthe correct
usuallyiteratively values. singlelineL.They are known as linearlyseparable
patterns.
For thep"input-output
pattern,the errormeasure
Adalinecanbe expressedas,
of a single-output

Ep p-op
Where
tpTarget
output
*
Op
Actualoutput of the Adaline O
The derivation
of
Ep with respect to each weight

Wiis
OEp
dWi -2(tp-°p)X
Fig.Q.22.1

TECHNICAL PUBLICATIONS An up thrustforknowledge DECODE


NeuralNetworks and Deep Learning 1-10 NeuralNetworks
Artificial

Linearseparabilityrefersto the factthat classesof


Ans.:Amultilayerfeed-forwardneural network is
a networkconsisting allof
layersof units,
of multiple
patterns with dimensional
vector x which are adaptive.
X2.n)
(1, can be separated with a single
decision surface. In the case above,the line L .Thenetwork is not allowed to have cycles from
laterlayersback to earlierlayers,hence the name
representsthe decision
surface.

If two classesof patterns can be separated by a


"feed-forward".

Letus considera network with a singlecomplete


decisionboundary, represented by the linear
hidden layer.i.e.,the network consistsof some
equationthen they are said to be linearlyseparable.
The simple network can correctlyclassifyany input nodes, some output nodes, and a set of
hiddennodes.
patterns
Decisionboundary (i.e.W, b or q) of linearly Everyhidden node takes inputsfrom each of the

separableclassescan be determined eitherby some input nodes, and feeds intoeach of the output
nodes.
learningprocedures or by solvinglinearequation
systems based on representativepatterns of each Fig.
Q.25.1shows multilayerfeed forward neural
classes networkk.

If such a decision
boundary does not exist,then the
tow classes
are said to be linearlyinseparable.
Feed forward
Linearly inseparableproblems cannot be solved by
the simple network,more sophisticatedarchitecture
isneeded. Y
X4
1.3Back-propagationNetworks 5ack propogation

Q.23 What is back propagationneural network1 Inputlayer Hidden layer Outputlayer


Ans. :Backpropagation is a training
method used for
Fig..25.1 feedforward neural
Multilayer network
a multi layer neural network.It is also called the

generalizeddeltarule.Itis a gradientdescent method Thisstructureis calledmultilayerbecause ithas a


which minimizesthe totalsquared errorof the output layer of processingunits(i.e.,
the hidden units)in
computedby the net. addition to the outputunits.

These networks are calledfeedforward because the


Q.24 List the training
stages of a neural network
back propagation. output from one layer of neurons feeds forward
by
intothe next layerof neurons. There are never any
Ans. backward connections,and connectionsnever skipa
The trainingof a neural network by back layer.
takesplaceinthreestages
propagation Each connectionbetween nodes has a weight
1. Feedforward of the inputpattern associatedwith it.In addition, there is a special
2. Calculationand Back propagation of the weight (calledwo) thatfeeds intoevery node at the
associatederror. hidden layer and a specialweight (calledz,)that
feedsintoevery node at the outputlayer.
3. Adjustments of the weights.
These weights are called the bias, and set the
Q.25 Explain in brief architecture
of multilayer
thresholdingvalues for the nodes. Initially,
all of
feed-forwardneural network.
the weights are set to some small random values
near zero.

TECHNICAL PUBLICATIONS An up thrust


forknowledge OECODE
NeuralNetworks and Deep Learning 1-11 NeuralNetworks
Artificial

Every node in the hidden layer and in the output Q.27 Why multilayerperceptron neural network ?
layer processes its weighted inputto produce an Ans.: Neural networks, with their remarkable
output. This can be done slightlydifferently
at the
abilityto derive meaning from complicated or
hiddenlayer,compared to the output layer.
to extractpatternsand
imprecisedata, can be
Inputunits:The input data you provideyour detecttrends thatare too complex to be noticedby
network comes through the input units. No eitherhumans or other computer techniques.
processingtakes place in an inputunit,it simply A trainedneural network can be thought of as an
feedsdata intothe system.
"expert"in the category of information
ithas been
Hiddenunits: The connectionscoming out of an givento analyse.
input unit have weights associatedwith them. A Thisexpertcan then be used to provideprojections
weight goingto hiddenunitzh from inputunitx new situations
o f interestarnd answer "what
would be labeledwhi Thebiasinputnode ( X) is given
connected to all the hidden units, with weightswho ifquestions.
Each hiddennode calculatesthe weightedsum of
Other advantagesinclude:
its inputsand applies a thresholdingfunctionto 1.Adaptive learning: tolearn how to
An ability
do tasksbased on the data givenfor training
or
determine the output of the hidden node. The
initial
experience.
weightedsum of the inputsfor hidden node zh is
as 2. One of the preferred techniques for gesture
calculated
recognition.
3. MLP/Neural networks do not make any
j=0 assumption regardingthe underlying probability
The density functions or other probabilistic
thresholding function appliedat the hidden
about the pattern classes under
node is typically eithera step function or a sigmoid information
considerationin comparison to other probability
function.The sigmoidfunction is sometimes called based models.
the "squashing" function, because it squashes its
to a value between 0 and
4. They yieldthe required decision function
input(i.e.,a) via training.
directly
Inmulti-layerfeed forward neural networks, the 5. A two layer backpropagation network with
sigmoid activationfunction,definedby hiddennodes has been proven to be a
sufficient
is normally used. universalapproximator
g) +ex
Q.28 Explainbackpropagationlearningrule.
The outputlayer:Functionally just
likethe hidden
Ans.: The net input of a node is defined as the
layers.Outputsare passed on to the world outside
theneural network. weighted sum of the incomingsignalsplus a bias
term. Fig.Q.23.1shows the backpropagationMLP for
Q.26 How does the network learn? node j.The net inputand output of node j is as
Ans.:The training
samples are passed through the follows
network and the outputobtainedfrom the network is

compared withthe actualoutput.


1
This error is used to change the weights of the
neurons such that theerrordecreasesgradually.
x X)-1+exp(-X)
Thisis done using the Backpropagation algorithm,
also calledbackprop.Iterativelypassing batches of
data through the network and updating the Sum
X2_2 Threshold
weights,so that the erroris decreased,is known as Activation
Stochastic
GradientDescent ( SGD). Waj
The amount by which the weights are changed is
determined by a parameter calledlearningrate. BackpropagationMLP for node
Fig.Q.28.1
An up thrustforknowledge DECODE
TECHNICAL PUBLICATIONS
Neural Networks and Deep Learning 1-12 NeuralNetworks
Artificial

Where x1 is the outputof node ilocatedinany one Q.29 Write the characteristics
and applications
of
of the previouslayers, erorback propagation algorithm.
is the weight associated with the link Ans.:
W Characteristics
correspondingnodes iand j. Itis an algorithm for supervised learning of
W isthebias of node j. artificial
neuralnetworks usinggradientdescent.

Internalparameters associatedwith each node is j Learns weights for a multilayernetwork, given a


the weight Wij. So changingthe weights of the fixedsetof unitsand interconnections.
node willchange the behaviourof the whole back
propagationMLP.
In multilayernetworks the error surfacecan have
multipleminima, but in practicebackpropagation
Fig.
Q.28.2shows MLP.
two layerback propagation has produced excellentresultsin many real-world

applications.
put signais The algorithm is for two layersof sigmoidunits

4
and does stochastic
gradientdescent.

It uses gradientdescent to minimizethe squashed


error between the network outputsand the target
valuesforthese outputs.

O- Applications:
The fast development
has
of
increasedthe
artificialsatellite

technology importance of three


m and therefore,
dimensional (3D) positioning satellite

geodesy.
Input Output
layer layer the
Particularly, Global Positioning
System (GPS)
Hidde
provides morepractical,rapid, precise and
laveen
layer
continuouspositioning
results anywhere on the

oferrorcorrection
Back-propagation Earth in geodetic applicationswhen compared to
the traditional
terrestrial methods.
positioning
Fig.Q.28.2 Due to the increasinguse of GPS positioning
has been paidto the
techniques,a great attention
The MLP willreferto as a
above back propagation
precise determination of local/regionalgeoids,
3-4-3 network, corresponding to the number of
aimingat replacingthegeometric levelingwithGPS
nodes ineach layer. measurements.
The backward errorpropagation alsoknown as the BPANN method
Therefore, is easilyprogrammable
Backpropagation(BP)or the GeneralizedDeltaRule with decreased and increasednumber of reference

(GDR). A squared error measure for the p pointswhen generatinga localGPS, itperforms a
pairis definedas
input-output flexible
modelling.

Ep d- Also,BPANN method isopen to updatingwhich


couldbe accepted as an importantadvantage.Thus,
is the desiredoutputfornode k and xg is
Where d it is believed that BPANN method is more
the actualoutputfornode k when the inputpart of convenient for generating local GPS when
thepth data pairis presented.
compared toothermethods.

DECO
TECHNICALPUBLICATIONS- An up thrustforknowledge
NeuralNetworks and Deep Learning 1-13 Neural Networks
Artificial

Q.30 List out merits and demerits of EBP. Q.33 Describe auto-associative
memory.
Ans.:Merits/Strength: Ans.
1. Computing time is reduced if weight chosen are An auto-associative
net remembers one or more
small at thebeginning patterns.For an auto-associative
net, the trainin8
2. Itminimizethe error inputand targetoutput vectorsare identical.
3. Batch update of weight exist,which provide Auto-associativememories are content based
on the weightcorrection.
smoothingeffects memories which can recalla storedsequence when
4. Simple method and easy for implementation. they are presented with a fragment or a noisy
versionof it.
inweight space
5. Minimum of theerrorfunction
6. Standard method and generallywork well They in de-noisingthe inputor
are very effective

removing interference
from the inputwhich makes
Demerits
them a promisingfirststep in solving the cocktail
1. Training
may sometime cause temporal instability
party problem.
to the system.
The simplestversion of auto-associative
memory is
2. For complex problem, ittakeslotof times.
linear associatorwhich is a 2-layer feed-forward
3. Selectionof number of hidden node in the
fullyconnectedneuralnetworkwhere the outputis
network is problem. constructedin a singlefeed-forward computation.
Backpropagation learning does not require The figurebelow illustrates
itsbasicconnectivity.
normalization ofinput vectors; howeve,
normalizationcouldimprove performance W11
5. It can get stuck in local minima resultingin
sub-optimalsolutions.
Wi
6. Slow and inefficient.
Wn

1.4:Associative
Memory Networks
Na Wri
Q.31 What is meant by associative
memory ?
Wnn
[INTU: May-17, Marks 2]

Ans.:An associative
memory can be consideredas a Fig.2.33.1
memory unitwhose storeddata can be identified
for Allinputsare connected to all outputs via the
ratherthanby
accessby the contentof the data itself connection weight matrix W where W denotes thee
an address or memory Associativememory
location. strength of unidirectionalconnectionf rom the in

is often referredto as Content Addressable Memory inputto the j"output.

CAM). Sincex = y inauto-associative


memories, we have
x W' x. Therefore,all stored sequencesmust be
Q.32 Defineautoassociative
memory. eigenvectors of matrix W. Assuming all stored
Ans.:Thisisa singlelayerneural network inwhich sequences are orthogonal to each other,we can
the input trainingvector and the output target represent the weight matrix as W = XX*
=
XX
where T is orthonormal matrix of all stored
vectorsare the same. The weights are determined so
sequences.
Ifvector"t"
that the network storesa set of patterns.
The processoftraining iscalledstoringthe vectors.
isthe same as "s,thenet isauto-associative. The representation
can be bipolaror binary.

TECHNICAL PUBLICATIONS
An up thrustforknowledge DECODE
Neural Netuworks and Deep Learning 1-14 NeuralNetworks
Artificial

Not only does itremember but in


patternsexactly, Fig.
Q.35.1shows adaline.
shown some
addition, degraded
pattern,itreturnsthe originaluncorrupted
versions
version
of a
If the input conductances are denoted by Wi
of the pattern.
where i=0, 1, 2,..n, and inputand output
signalsby x;and y,respectively,then the output
EitherHebb Rule,Delta Rule or Extended Delta blockisdefinedto be:
of the central
Rule can be used fortraining.

It isoftenthe casethatfor auto-associative


nets,the
+0

diagonalweights (those which connect an input


component to the correspondingoutputcomponent) where 0 Wo
are set to 0. In a simple physical implementation thisdevice
consistsof a set of controllable
resistors
connected
Q.34 What isHebbian learning?
circuit
which can sum up currentscaused by
INTU: May - 17,Marks 2] to a

the inputvoltagesignals.
Usually the centralblock
Ans.:Hebb rule is the simplest and most common
the summer isalso followed by a quantizerwhich
method of determiningweights for an associative
outputseither of +1dependingon the polarity
-1,
memory neuralnet.It canbe used with patternsare
of the sum.
vectors.
representedas eitherbinaryor bipolar
The problem is to determine the coefficients
Wi
Q.35 Explain delta
layer.
learning rule for
where i0,1...n,
in such a way thatthe input
multiperceptron is correct for a large number of
outputresponse
Ans.: An 8eneralization of the
important chosen signalsets.
arbitrarily
perceptron trainingalgorithm was presented by
Widrow and Hoff as the leastmean square learning If an exact mapping is not possiblethe average
errormust be minimized,forinstance,inthe sense
procedure alsoknown as the deltarule.
The learning rule was appliedto the "adaptive of leastsquares.

linearelement" alsonamed Adaline. An adaptiveoperation means that there existsa


The perceptronlearningruleuses the output of the mechanism by which the wi can be adjusted,
thresholdfunction forlearning.The deltaruleuses to attainthe correctvalues.
usuallyiteratively
the net output withoutfurthermapping intooutput
For the Adaline,Widrow introduced
the delta rule
values-1
or +1.
to adjustthe weights.

OLeve

W3

Error Quantizer
Gains
Input Summer
pattern
Switches
-1 +1

Reference
Switch

Adaline
Fig.Q.35.1

TECHNICAL PUBLICATIONS An up thrust


forknowledge DECODE
Neural Networks and Deep Learning 1-15 Neural Networks
Artificial

Forthe p"input-output
pattern,the errormeasure Q.36 What is Bidirectional
AssociativeMemory
Adalinecanbe expressed as,
of a single-output (BAM)?
Ans.

Where
Ep p-p)2 The Hopfield network represents an
tp Target output
Actual the Adaline type of memory. Itcan retrievea
auto-associative
op outputof corrupted or incomplete memory but cannot
associate this memory with another different
The derivation
of
Ep with respect to each weight memory.
Wi is
Human memory is essentiallyassociative.One
dp 2(tp-Op thing may remind us of another, and that of
another, and so on. We use a chain of mental
To decrease Ep by gradient descent, the update to recovera lostmemory.
associations
formula for Wi on the pth input-output
patternis If we forgetwhere we leftan umbrella,we try to
Ap Wi n(tp- Op) Xi recallwhere we lasthad it,
what we were doing,
and who we were talkingto. We attempt to
The deltaruletriestominimizesquared errors,itis
establisha chainof associations,
and thereby to
also referredto as the leastmean square learning
restorea lostmemory
procedureor Widrow-Hofflearning
rule.
Bidirectionalassociative memory (BAM), first
Featuresof thedeltaruleare as follows:
proposed by Bart Kosko, is a hetero-associative
1. Simplicity network. Itassociatespatternsfrom one set,set A,
2. Distributedlearning:learningis not relianton to patternsfrom another set,setB,and viceversa.
centralcontrolof the network. Likea Hopfieldnetwork,the BAM can generalize
weights are updated
3. Online learning: after and also producecorrectoutputsdespitecorrupted
presentationof each pattern. or incompleteinputs.

Rules for FeedforwardMultilayer


Perceptron Fig Q.36.1 shows AM operation Q.36.1
(Fig.
The trainingalgorithm is called Error Back shown on next page)

Propagation (EBP) training algorithm. If a The basicideabehindtheBAM isto store pattern


submitted patterm provides an output far from pairsso thatwhen n-dimensionalvector X from set
desired value, the weights and thresholds are A is presented as input, the BAM recalls
adjusted so that the current mean square m-dimensional vectorY from set B,but when Y is
errorisreduced.
classification presentedas input, X.
the BAM recalls

The training is repeated for all patternsuntilthe To develop the BAM, we need to create a
matrix foreach patternpair we want to
correlation
trainingset providean acceptable overallerror.
store.
Usually the mapping error is computed over the
fulltraining
set. The correlationmatrix is the matrix productof the
input vector X, and the transpose of the output
algorithmis workingintwo
Errorback propagation
vectorY. The BAM weight matrix isthe sum of all
stages matrices,thatis
correlation
1.The trained network operates feed-forward to
obtainoutputof the network W Xm
m=1
Ym

2. The weight adjustment propagate backward


where M is the number of patternpairsto be stored
from outputlayer throughhidden layer toward
inthe BAM
inputlayer.

TECHNICAL PUBLICATIONS An up thnustforknowiedge DECODE


NeuralNetworks and Deep Learning 1-16 Neural Networks
Artificial

Xq P)

xz{P)-
Y4P)
xy(p+1)

xz(p+1)
1
2
-
-
y2P)

x,(P+

YmP)
nput Output
XP+1) nInput Output
layer layer layer ayer

(a) Forward direction (b) Backwarddirection

Adaline
Fig.Q.36.1

and storage capacityof the


Q.37 Explainstability Q.39 What iscontentaddressable
BAM. memory?
A content-addressable is a type of
Ans.: memory
Ans.: The BAM is stable.This
unconditionally
memory that allows for the recallof data based
on
means that any set of associationscan be learned
the degree of similarity
between the input pattern
withoutriskof instability.
and thepatternsstoredinmemory.
The maximum number of associations
in the BAM should not exceed the number
to be stored
of
It refersto a memory organizationin which thee
memory isaccessedby itscontentas opposed to an
neurons inthe smallerlayer.
explicitaddress like in the traditional
computer
The more serious problem with the BAM is memory Ssystem.
incorrectconvergence. The BAM may not always
thistype of
Therefore, memory allows the recallof
In fact, a stable
produce the closestassociation. informationbased on partialknowledge of its
associationmay be only slightlyrelated to the
contents.
initial
inputvector.
Q.40 What are the Delta Rule for Pattern
Q.38 List the problems of BAM Network. ?
Association
Ans. Ans.
1. Storage capacityof BAM: The maximum
the
When the inputvectorsare linearlyindependent,
number of associations be storedin the BAM
to
the Delta Rule produces exactsolutions.
should not exceed the number of neurons inthe
smallerlayer. Whether the inputvectorsare linearlyindependent
or not,the Delta Rule produces a leastsquares
2. convergence.The BAM may not always
Incorrect
solution,ie,it optimizes for the lowest sum of
produce the closestassociation
leastsquared errors.

An up thrustforknowledge
DECODE
TECHNICAL PUBLICATIONS
Neural Networks
and Deep Learning 1-17 NeuralNetworks
Artificial

Q.41 What is continuousBAM ? allow only a little while limiting


"forgetting" the
Ans.:ContinuousBAM transforms input smoothly weight growth.
intooutputinthe range [0,1]using
and continuously Hebbian learningalgorithm
the logisticsigmoid functionas the activation
function
for allunits. Step 1: Initialisation
Set initialsynaptic weights and thresholds to
Q.42 ExplainHebbian rule with example. small random values,say inan interval[0, 1].

In 1949, Donald Hebb proposed one of the Activation


2:
Ans. Step
key ideasinbiologicallearningcommonly knownas Compute the neuron outputat iteration
p
Hebb's Law. Hebb's Law statesthat ifneuron is i
near enough to excite neuron j and repeatedly y(p)2
XP)wj(P)-
in itsactivation,
participates the synapticconnection
where n is the number of neuron inputs,and 6 is
between these two neurons is strengthened and
thethresholdvalue of neuronj.
neuron j becomes more sensitiveto stimulifrom
neuron i Step 3:
Learning
Hebb'sLaw can be representedin the form of two Update the weights inthe network:
rules: p+1) Wj (P)+4wj{(p)
W
1.Iftwo neurons on eitherside of a connection are where Wj (p)isthe weight correctionat iteration
p.
activatedsynchronously,thern the weight of that
The weight correction is determined by the
connectionisincreased.
generalisedactivity rule:
2. Iftwo neurons on eitherside of a connection
are product
activated asynchronously,then the weight of P) 9y,(P)xi(P)-Wj (P
w
thatconnection isdecreased.

Hebb'sLaw without
provides the basisforlearning
Step 4:1Iteration
Increaseiteration
p by one, go back to Step 2.
teacher.Learninghere is a local phenomenon
Hebbian learningexample
withoutfeedbackfrom the environment.
occurring
we can express the adjustment
To Hebbian learning,
illustrate consider a fully
Using Hebb's Law
connected feed forward network with a singlelayer
appliedto the weight wj at iteration P in the
of five computation neurons. Each neuron is
form:
following
Awj (P) F lyi(P),
x,(p)]
representedby a McCullochand Pittsmodel with

As a specialcase,we can representHebb's Law as the sign activation The network is trained
function.
follows: on the following
setofinputvectors

Awj (P) a yi(p)x,


(P)
where a is the learningrateparameter.Thisequation
isreferredto as the activity X2 X3
productrule.
Hebbian learning implies that weights can only
To resolvethisproblem, we might impose
increase.
a limiton the growth of synapticweights. Itcan
be done by introducing a non-linear forgetting

intoHebb's Law
factor X4

AW (p) ayi(p)xi(p)-oyi(p)
wj (P)
where isthe forgetting
factor.

Forgettingfactor usually falls in the interval


between 0 and 1, typically
between 0.01and 0.1,to

An up thrust
forknowiedge OFCODE
TECHNICAL PUBLICATIONS
1-18 NeuralNetworks
Artificial
NeuralNetuworks and Deep Leaning
and finalweight matrices
Initial

Output layer Outputlayer

10
20 00
1 0 0
o

0 20 2.0204 0
2.0204

100 0 1.0200

00 o 1
o 40 0 0 0.9996 0

2.0204 2.0204

A testinputvector,or probe,isdefinedas

X 0

When thisprobe
ispresented to the network, we obtain:

0 20204

0
0
10200
0
0

0
0.4940
2.0204|||002661
0-00907-0
0

0 0 09996 0
0094780
0 20204 0 0
20204]}|1] 00737]

1.5:Hopfield
Networks

Q.43 What isDiscreteHopfieldNetwork ?


A
Ans.: Hopfieldnetworkwhich operates in a discretelinefashionor inother words, itcan be said the input
and outputpatternsare discretevector,whichcanbe eitherbinary(0,1)or bipolar(+1, 1) innature.The network -
has symmetricalweights withno i.e.,
self-connections wj Wji and wi =0.
Q.44 What isContinuousHopfieldNetwork ?
Ans.:In comparison with DiscreteHopfieldnetwork,continuous networkhas time as a continuous
variable.
Itis
also used inauto association
and optimization
problems such as salesman
travelling problem

Q.45 Explainwhat ismeant capacityof Hopfieldnetwork.


by

Ans. Network capacity Hopfieldnetwork model isdetermined by neuron amounts and connections
of the

withina givennetwork. Therefore,the number of memories thatare able to be stored is dependent on neurons
and connections.
itwas
Furthermore, shown that the recallaccuracy between vectorsand nodes was 0.138. Therefore,itis
evidentthatmany mistakeswilloccurifone triesto storea largenumber of vectors.

up thrust
forknowledge
TTECHNICALPUBLICATIONS
An DECODE)
Neural Networks and Deep Learning 1-19 Artificial
NeuralNetworks

When the Hopfieldmodel does not recallthe right Hopfieldmodel consists of a single layer of

,
pattern,it is possiblethat an intrusion has taken processingelements where each unitisconnected to
place, since semantically related items tend to every other unitinthe networkother than itself.
and recollection
confuse the individual, of the The outputof each neuron isa binary number in
wrong patternoccurs. -1,1).The output vector is the state vector.
Thereforeif we store p patterns in a Hopfield Startingfrom an initial
state(givenas the input
vector),the stateof the network changes from one
network with a largenumber of N nodes, then the
to another like an automaton. If the state
of error,ie.,
probability the probability
that >1
C converges, the point to which itconverges iscalled
is the attractor.

ProrP(C>1) exp(-/2a2)dx
In itssimplestform, the outputfunctionisthe sign
which yields1 for arguments 2 0 and -1
function,
otherwise.
-erfao) The connectionweight matrix W of this type of

erf(
where the errorfunction
2
erfis givenby
p))
: A
network is square and symmetric.The unitsin the
Hopfieldmodel act as both inputand outputunits.
Hopfieldnetwork consistsof "n"totallycoupled
units.Each unit is connected to all other units
The networkis symmetric because the
except itself.
erf(x)
ep-s)ds weight wij between unit i and
for the connection
unit j is equal to the weight wj of the connection
Therefore,givenN and p we can findout the
from unitj to uniti.The absence of a connection
probability from each unit to itselfavoids a permanent
feedbackof itsown statevalue.
Perror of errorfor a singleneuron of a storedpattern.
Hopfield networks are typically used for
Q.46 Explainwith neat diagram the architecture of
problems with binarypatternvectors.
classification
Hopfield neural network.
Hopfield model isclassified categories
intotwo
Ans.: The Hopfield model is a single-layered 1.DiscreteHopfieldModel
recurrentnetwork.Likethe associative memory, itis
usuallyinitialized
with appropriateweights insteadof 2. ContinuousHopfieldModel

beingtrained. In both discreteand continuous Hopfieldnetwork


HopfieldNeural Network (HNN) is a model of weights trained in a one-shot fashion and not
trained incrementally as was done in case of
memory. Itis a singlelayerneural
auto-associative
network with feedbacks.Fig.Q46.1 shows Hopfield Perceptronand MLP.
network of three units.The Hopfieldnetwork is In the discreteHopfieldmodel, the unitsuse a
createdby supplying inputdata vectors,or pattern slightlymodifiedbipolaroutput functionwhere the
vectors, corresponding to the differentclasses. states of the units,ie, the output of the units
These patternsare calledclasspatterns remain the same ifthe current stateis equal to
some thresholdvalue.

Unit1 The continuous Hopfield model is just a


generalizationof the discretecase. Here,the units
use a continuousoutput functionsuch as the
W12
sigmoid or hyperbolic tangent function.In the
continuousHopfield model, each unit has an
r thatmodel
W23
Unit z Unit 3
associatedcapacitorC;and resistance
the capacitanceand resistance of realneuron'scell
Fig.Q46.1Hopfieldnetwork of threeunits membrane, respectively.

PUBLICATIONS An up thrust
forknowledge DECODE
TECHNICAL
NeuralNetuworks and Deep Learning NeuralNetworks
Artificial
1-20
Q.47 Give the limitationand applicationof 3. The energy minimization of the Hopfield
ability
Hopfieldnetworks. NTU:June-13,
Marks 7] network isused to solveoptimization
problems.
Ans.:Limitations Networks
of Hopfield 4. The various applicationsof Hopfieldnetworks
are as givenbelow. Hopfield
network remembers
Severelylimitedin the number of patternsp that
cues from the past and does not complicatethe
inN nodes.
can be storedreliably
trainingprocedure.
the maximum number
Generally, of patternsmust
The Hopfieldnetwork can be used for converting
be below 0.15Nfor reliable
performance;
analog signals into the digitalformat, for
Avalanche inerroroccursabove 0.138N. associative
memory.
Too many patternswillresultinspurious outputs; Q.50 What isassociative
memory ?
i.e.,
outputs not corresponding to any stored
Ans.:Associativememory neural nets are single
pattern.
layer nets in which the weights are determined in
similarpatternscan causeerrorsinoutput.
Storing such a way that net can store a set of pattern
Application: associations.
Common applicationsare those where pattern
is useful,and Hopfieldnetworks have Q.51Explainrelation
between BAM and Hopfield
recognition
been used for image detection and recognition, nets.
Ans.: DiscreteHopfieldnet and the BAM net :
enhancement of X-Ray images, medical image
etc.
restoration, closely related.
The Hopfield net can be viewed as an
The Hopfield network is commonly used for
BAM
auto-associative with the X-layerand Y-layer
and optimization
auto-association tasks.
treatedas a singlelayer and the diagonalof the
Q.48 Why Hopfield neural network is an symmetric weight matrix set to zero.
associative
memory ? BAM can be viewed as a specialcase of a Hopfield
Ans.
net which containsall of the X-layer and Y-layer
from
Starting state,the HNN willchange
any initial neurons,but withno interconnections between tw0
itsstateuntil
the energy function approaches to the X-layerneurons or between two Y-layerneurons.
minimum.
X-layerneurons must be update theiractivations
The minimum pointiscalledthe attractor. before any of the Y-layer neurons update theirs;
then all Y fieldneuron update before the next
Patternscan be stored in the network in the form
round of X-layerupdates.
of attractors.

The initial and the state


stateis givenas the input, The update of the neurons with the X-layer or
withinthe Y-layer can be done at the same time
afterconvergenceisthe output.
because a change in the activation of an X-layer
Q.49 Explain applicationsof the Hopfield neuron does not affectthe net inputto any other
Network. X-layerunitand similarlyfor theY-layerunits.
Ans. :
Q.52 Explain Hopfield model and apply it to
1. The Hopfieldnetworkcan be used as an effective travelingsalesman problem.
interfacebetween analog and digital devices, [ONTU :May-17, Marks 5]
where the input signals to the network are
Ans.:
analog and the outputsignalsarediscretevalues.
2. Associativememory is a major applicationof the FigureQ.52.1 shows a TSP defined over a
transportaionnetwork. The artificial neural
Hopfield network.
network encodingthatproblem isalsoshown.

PUBLICATIONS An up thrustforknowledge OECODE)


TECHNICAL
Neural Networks and Deep Learning 1-21 NeuralNetworks
Artificial

In the transportationnetwork, the five vertices The weight on the connection between the units
stand for citiesand the links are labeled or thatrepresenta visit
to cities.

weightedby the inter-city


distancesdij.
Consequently,thatconnection
should be inhibitory
A feasiblesolution to that problem is the tour negativeweight), because two cities
cannot occupy
as
Montreal-Boston-NY-LA-Toronto-Montreal, the same exact posiion.
shown by the bold arcs. The firstunitto be activatedwillinhibit
the other
In the TSP context,the weights are derived inpart unit via that connection, so as to prevent an
from the inter-citydistances.They are chosen to solutionto occur.
infeasible

penalize infeasibletours and, among the feasible


tours,to favorthe shorterones.

City3

City2
City4

City5

City1

Fig.Q.52.1
Schematic of blological
neuron

City 4

City3

City1

City1,
City5
City5

City2

Fig.Q.52.2Schematic of biological
neuron

An up
TECHNICAL PUBLICATIONS thrustfor knowledge OECODE
NeuralNetworks and Deep Learning 1-22 Neural Networks
Artificial

in the BlanksforMid Term Exam| Choice Questions for Mid Term


Multiple
Fill Exam
Perceptronisbased on .
Q.1 Training
Q.1 The basic processing elements of neural
networksS arecalled asupervisedlearning
technique
Q.2 The function used for a back bunsupervised
learning
propagationneural network can be eithera reinforced
bipolarsigmoidor a binarysigmoid.
c learning

Q.3 Backpropagationis a training


method used for dstochastic
learning
neuralnetwork. Q.2 What is perceptroninNeural network?
Q.4 The. isa kindof a singlelayerartificial
network with only one neuron. aItan is auto-associative
neuralnetwork

Q.5 The process of weight adaptationis called bItisa double layerauto-associativeneural


network

Q.6 One successful method for finding high


accuracy hypotheses is a techniquecalled
It isa singlelayerfeed-forwardneural
networkwith pre-processing
NTU: Feb.-17]
Aug.-16,
dItisa neuralnetworkthatcontains
Q.7 learningmethods providea robust feedback
approach to approximating real-valued,
discrete-valued,and vector-valued target Q3 of NeuralNetwork includes
Application
functions. BUNTU Aug.-16,Feb.-17]
Q.8 Perceptronscan representallof the primitive
a PatternRecognition bClassification

functions. NTU
:Aug.-16,Feb.-17] cClustering dAllof these
Q.9 Theoreticalresultshave been developed that Q.4 Neural Networks are complex
characterize the fundamental relationship with many parameters.
among the number of examples
linearFunctions
observed. [INTU: Feb.-17] a
Q.10 A is an information-processing
unit bnonlinear
Functions
that is fundamental to the operation of a discreteFunctions
neuralnetwork.

Q.11 An. is a single-layernetwork in


which the weights are determined in such a Q.5
exponentialFunctions

What is backpropagation
?

way that the network can store a set of a Itis anothername givento the curvy
patternassociations. function
inthe perceptron
Q.12 Auto-associative memories aree

memories which can recalla storedsequence. bItisthe transmission of errorback through


thenetwork to adjustthe inputs
Q.13 areelementary functional
and structural
unitsthat mediate the interactionbetween CItisthe transmission of errorback through
thenetwork to allow weightsto be adjusted
neurons. [INTU: Feb,- 17]
so thatthe networkcan learn
Q.14 In neural network terminology a threshold
function
iscommonly known as d None of the above

INTU:
Feb. 17]
Q.6 A to solve the
possibleneuron specification
Q.15 An error correctionlearningalgorithm which
AND problem requires a minimum of
is applied to a multilayer feed forward
networkleadsto - . type of problem

T[JNTU:Feb,- 17] a singleneuron btwoneurons


TECHNICAL An up thrusttorknowledge DECODE
PUBLICATIONS
NeuralNetworks and Deep Learning 1-23 NeuralNetworks
Artificial

cthree
neuror d four neurons Q.13 What are the generaltasks that are performed
with backpropagationalgorithm?
Q.7 A perceptron takes a vector of real-valued
a linearcombination
inputs,calculates of these a patternmapping
inputs,then outputs. bfunction
approximation
NTU: Aug.-16,Feb.-17]
cprediction
a - 1 or b 0 or 1
dalloftheabove
c-1
or0 d none
Q.14 Adalinewhichstands for .
Q.8 The_ algorithmcomputes the versionn
space containing allhypotheses from that H a adaptiveLinearNeuron

are consistentwith an observed sequence of baddressLinearNeuron


trainingexamples. INTU:
Feb.-17] LinearNetwork
adaptive
ainductive
Hypothesis dadaptiveNeural Neuron
bartificial
NeuralNetwork Q.15 BAM standsfor
c candidate
Elimination
Adaptivememory
aBidirectional
d none b Backpropagationassociative
memory

Q9 If the trainingexamples are not linearly associative


CBidirectional memory
separable,the delta rule converges toward a associative
dBidirectional machine
approximation to the targetconcept.
Q.16 The Hopfieldnetwork consists of a set of
NTU: Feb.-17] neurons forming a multiple loop
aOverit b Under fit system.

cDoesn't
fit dBestfit aunidirectional b parallel

Q.10 What is shape of dendriteslike feedback dfeedforward


a oval b round Q.17 A Hopfield net can be used to
determine whether an inputvectorisa known
tree d rectangular vectoror an unknown vector.

Q.11 What isthe objective


of BAM ? a binary bdiscrete
ato storepattern
pairs autoassociative All

b to recall
patternpairs Q.18 Neural networks are complex with

Cto storea setof patternpairsand they can many parameters.


be recalled
by givingeitherof patternas alinear
functions
input
b nonliner
functions
d none of the mentioned
c discretefunctions

memory isalso known


Q.12 Hetero-associative as ? dexponential
functions

aunidirectional
memory Q.19 The network thatinolvesbackward linksfrom
bbidirectional
memory output to the input and hidden layers is
calledas
c multidirectional
associative
memory

dtemporalassociative
memory aselforganizing
maps

b perceptrons

T TECHNICAL
PUBLICATIONS
An up thrust
forknowledge DECODE
Neural Networks and Deep Leaning 1-24 Neural Networks
Artificial
network inthe Blanks
Answer Key for Fill
recurrerntneural
d multilayeredperceptron Q.1 artificial
neurons
Q.2 activation

Q,20 A Neuro softwareis: Q3 multilayer Q4 perceptron

aa softwareused to analyze neurons .5 learning Q.6 rulepost pruning

b itispowerfuland easy neuralnetwork. Q.7 Neuralnetwork Q.8 boolean

c designedto aidexpertsinrealword Q.9 training Q.10 neuron

dit issoftwareused by Neuro surgeon.


.11 network
associative
Q.12 contentbased

Q.21 A neuron can send at a time how many Q.13 centralnervous Q.14 activation
function
signals? system

aMultiple b One Q.15 supervised


learming

None 1010 ChoiceQuestions


Answer Key for Multiple
Q.22 belongs to the classof singlelayer feed
forward or recurrent network architecture
dependingon itsassociation capability.
Q.1
05
a Q.2

Q.6
cQ.3
aQ.7ada Q4
Q.8 c

a Fuzzy systems 0.9 dQ.10 cQ.11 b c Q.12


b Associativememory
Q.13 d Q.14 aQ.15 Q.16
c Patternclassification

dGenetic
algorithms.
Q.17

0.21 a
a Q.18 a

Q.22 b
Q.19

Q.23
C

d
Q.20 b
Q.23 The BAM AssociativeMemory)
(Bidirectional
significantly
belongs to END...
a Auto associative

b Fuzzy associative

Genetic associative

d Heteroassociative.

TECHNICAL PUBLICATIONSAn up thrust


forknowledge OECODE
UNIT II
2
Q.1 What
Unsupervised

is unsupervised learning?
Network
Learning

2.1
Introduction

Ans.:In an the network adapts purely inresponse to itsinputs.


unsupervised learning, Such networks can learn
topickout structureintheirinput.

Q.2 Explainunsupervised learning.


The
Ans.:Unsupervisedlearning: model isnot providedwiththe correctresultsduringthe training.
Itcan be
used to custerthe inputdata inclasseson the basisof theirstatistical only.Custer significance
properties and
labeling.
.Thelabelingcan be carriedout even ifthe labelsare only availablefor a small number of objects
All similarinputspatternsare grouped togetheras clusters.
of the desiredclasses.
representative
Ifmatchingpattern is not found,a new clusterisformed. There is no errorfeedback.

Externalteacher isnot used and is based upon only local information.Itis also referredto as

self-organization.

They are calledunsupervised because they do not need a teacher or super-visorto labela set of

examples.Only the original


training datais requiredto startthe analysis.
Incontrastto supervised learning,unsupervised or learning does not require an
self-organized
externalteacher.Duringthe training
session,the neuralnetworkreceivesa number of differentinput
featuresin these patternsand learnshow to classify
patterns,discoverssignificant inputdata into
appropriate
categories
and can be used in real-time.
Unsupervised learningalgorithmsaim to learn rapidly Unsupervised
is
learning frequentlyemployed fordata feature
clustering, extraction
e tc.

Another mode of learningcalledrecordinglearningby Zurada is typicallyemployed forassociative


memory networks. An associativememory networks is designed by recording severalideapatterns
intothe networks stablestates.

Q.3 What is semi-supervisedlearning?


Ans.:Semi-supervised Learning: Semi-supervisedlearninguses both labeledand unlabeleddata to improve

supervisedlearning.The goal is to learna predictorthatpredictsfuturetest data betterthan the predictorlearned


from the labeledtraining
data alone.
value inlearningfaster,
learningismotivated by itspractical
Semi-supervised and cheaper.
better,
Inmany itis relatively
realworldapplications, easy to acquirea largeamount of unlabeleddatax.
Forexample, documents can be crawled from the Web, images can be obtainedfrom surveillance
from broadcast.However,theircorrespondinglabelsy for the
cameras, and speech can be collected

2-1)
NeuralNetworks and Deep Learning 2-2 UnsupervisedLearningNetwork

predictiontask,such as sentiment orientation,


intrusiondetection,
and phonetic transcript,
often

requiresslow human and


annotation expensivelaboratory
experiments.
Inmany practicallearmingdomains, thereisa largesupplyof unlabeled
data but limitedlabeleddata,
which can be expensivetogenerate.Forexample:textprocessing, bioinformatics
video-indexing, etc.

Semi-supervisedLearningmakes use of both labeledand unlabeleddata for training, a small


typically
amount of labeleddata with a large amount of unlabeled data. When unlabeleddata is used
witha small amount of labeleddata,it can produceconsiderableimprovement inlearning
conjunction
accuracy.
learningsometimes enablespredictive
Semi-supervised model testingat reduced cost.

Semi-supervised classification:Training on labeled data unlabeled data,


exploitsadditional
in
frequentlyresulting a more accurate
classifier.

Uses small amount


Semi-supervisedclustering: of labeled
data to aidand bias the clustering
unlabeleddata.

Q.4 Explaindifferencebetween supervisedand unsupervisedlearning.


Ans.

Sr.No. Supervised Learning Unsupervised Learning


Desiredoutput is given.
Desiredoutput isnotgiven.
Itis notpossibleto learnlarger
and more complex Itispossibleto learnlarger
and more complex models
models than with supervisedlearning with unsupervised learning

Use trainingdatato infermodel No trainingdata isused

Everyinputpatternthat is used to trainthe network The target output isnot presentedto the network.
isassociatedwith an output pattern.

5.
Try to detectinteresting indata.
relations
Trying a functionfrom labeleddata
topredict
6.
target variable
Supervised requiresthatthe
learning is For unsupervised learningtypically
eitherthe
target
well defined and that a sufficient
number of its variable isunknown or has only been recorded fortoo
valuesaregiven. small a number of cases.

Example: Finda face in an image


Example:Optical
characterrecognition
We can test our model We can not test our model

learningisalsocalled
Supervised classification learningisalsocalled
Unsupervised clustering.

Q.5 What iswinner-take-all


learningnetwork ?

Ans. : Most unsupervised neuralnetworks relyon algorithmsto compute and compare distances, determinethe
winnernode with the highestlevelof activationand use theseto adapt network weights
Among allompeting nodes, only one willwin and allotherswilllose.We mainly deal with single
winnerWTA, but multiple
winners WTA are
possible
way to realizeWTA: have
Easiest an external, to decide the winnerby comparing
centralarbitrator
Thisisbiologically
thecurrentoutputs of the competitors. unsound.

Q.6 Listthe types of fixedweight competitive


nets.
Ans.:Fixed weightcompetitive
netsare Maxnet, Mexican Hat and Hamming Net.

TTECHNICALPUBLICATIONS An up thrust
forknowledge DECODE
NeuralNetuworks and Dep Learning UusupervisedLearningNetwork

Q.7 What is hamming net ?


Ans.:Hamming net findsthe similarities
between the inputpatternand the weight vectorsof allneurons.

Q.8 What is hamming distance?


Ans.:The number of differentbitsin two binaryor bipolarvectorsX1 and X2 is calledthe hamming distance
between the vectorsand is denoted by [X1, H
X2]
Q.9 Explainhamming net with diagram.
Ans.: Hamming net is a singlelayerneuralnetwork.

Hamming network is a neural network mode that is specifically


designed to address the pattern
with
recognition inputs from neuralnetwork in a form.
bipolar

No
N22)

P p (N2
Wpn pp 1
Flg.Q.9.1

Inthe process,hamming network use the hamming distanceas a similarity indicatorbetween two
vectors,and Maxnet servesas a subnetto determinethe unitthathas the biggestnet input
It is the neural network implementationof the hamming distancebased nearestneighbor classifier.

The inputsare binarynumbers or


(0,1) f-1,1).
.We consideronly bi-polarcase here.The outputsare the similarities
between the inputpatternand
theweightvectorsof theneurons.
The number inputsis n,which is the dimensionality
of of the patternspace.The number of outputsis
P. whichisthe number of patternstostore.
The hamming netuses MAXNET as a subnet tofindthe unitwiththe largestnet input

Inputlayerneurons are programmed a fixednumber of patterns;


to
identify thenumber ofneurons
thislayermatches the number of those patterns(M neurons - M patterns).
Outputsof theseneurons realisethe function, which "measures" the similarity
of an inputsignalto a

8iven pattern.
The outputlayeris responsiblefor choosing the pattern,whichisthe most similarto testingsignal.In
thislayer,the neuron withthe strongestresponse stops the otherneurons responses.

Q.10 Definemaxnet.
Ans. :Maxnet is one of the artificial
neural networks based on competition.
Generally,it is used for pattern
identification.
Maxnet can be used by another model of artificial
neuralnetworksuch as hamming network to get

neuronwiththebiggest
input.
TTECHNICALPUBLICATIONs.An up thrustforknowledge DECODE)
NeuralNetworks and Deep Learning 2-4 UnsupervisedLearningNetwork

Q.11Write about Maxican Hat network. discretemap, and to perform this transformation

Ans.: The Mexican Hat network is a more general adaptivelyina topologically


ordered fashion.
contrastenhancingsubnet than the MAXNET. Three
Q.15Explaincomponents of selforganization.
types of linkscan be found insuchnetwork:
1. Each neuron is connected with excitatory Ans.: The self-organization
process involves four
:
(positivelyweighted) links to a number of major components
1. Initialization:
All the connectionweights are
"cooperativeneighbors,"neurons thatare inclose
initialized
with small random values.
proximity.
Competition For each input pattern, the
2. Each neuron is also connected with inhibitory
neurons compute theirrespectivevalues of a
links (with negative weights) to a number of
discriminant functionwhich provides the basis
"competitiveneighbors,"neurons that are
for competition.The particularneuron withthe
somewhat furtheraway.
smallest value of the discriminantfunctionis
3. There may also be a number of neurons, further
declaredthe winner.
to which the neuronisnot connected.
away still,
3. Cooperation The winning neuron determines
Allof these connections are within a particular the spatial location of a topological
layer of a neural net, so, as in the case of
neighbourhood of excited neurons, thereby
MAXNET, the neurons receivean externalsignalin
providing the basis for cooperation among
addition
to these interconnection
signals neurons.
neighbouring
4. Adaptation: The excitedneurons decrease their
2.2:
KohonenSelf-Organizing
FeatureMaps individualvalues of the discriminant
functionin
relationto the input pattern through suitable
Q.12 What are the goals of competitive
learning. adjustment of the associatedconnection
weights,
Ans. such thatthe response of the winningneuron to
1. Learn to form classes/clusters
of examples/sample the subsequent application
of a similar input

patterns according to similaritiesof these patternisenhanced.


examples. Q.16 Listthe stages of the SOM algorithm.
2. Patternsina clusterwould have similarfeatures. Ans. :
3. No prior knowledge as what features are 1. Initialization
Choose random values for the
important for classification,
and how many initial
weight vectorsWj.
classesare there.
2. SamplingDraw a sample training inputvector
Q.13Defineself organizing
- from the inputspace.
map.
Ans.:The map is one of the most
self-organizing 3. Matching - Find the winning neuron I(x)with
popularneural network models. It belongs to the weightvectorclosesttoinputvector.
category of competitivelearning networks. The 4. Updating Apply the weightupdateequation
self-organizingmap is based on unsupervised

learningwhich means thatno human intervention is


Awji n() T,)
(* -wj)
to step 2 untilthe
5. ContinuationKeep returning
needed duringthe learningand thatlittle
needs to be featuremap stops
changing
known about the characteristics
of the inputdata.
Q.17Explain an essential ingredients and

Q.14 What is principle parameters of the SOM algorithm.


goal of the self-organizing
map Ans.:An ingredientsand parameters of the
essential
Ans.:The principal SOM algorithmare as follows:
goal of the Self-Organizing
Map
(SOM) is to transform an incomingsignalpattern of
1. Continuous input space of activation patterns
dimensionintoa one or two - dimensional that are generated in accordance with a certain
arbitrary
probability
distribution.
TTECHNICALPUBLICATIONS An up thrustforknowledge DECODE
Neural Networks and Deep Learning 5 UnsupervisedLearningNetwork

2. Topologyof the network in the form of a lattice Brainis a system that can learnby
self-organizing
of neurons, which defines a discrete output itself
by changing(adding, removing,strengthening)
space; the interconnections
between neurons.Formation of

3. Time-varying neighborhoodfunctionhi,i(x)(n) feature maps in the brain that have a linearor

that is defined
around a winningneuron i(x); planartopology.

4. Learning-rateparameter that startsat an initial


Use the sOM for clusteringdata withoutknowing

value and then decreasesgraduallywith time , the classmemberships of the inputdata.The SOM

but never goes to zero. can be used to detect featuresinherent to the


problem and thus has also been called the
Q.18 What is self-organizing
feature maps ? List Feature Map.
Self-Organizing
and explain itscomponents.
Ans.: In competitive
The propertyoftopologypreservingmeans thatthe
learning,neurons compete
themselves to be While in Hebbian
activated. mapping preservesthe relativedistancebetween the
among
points.Pointsthatare near each other in the input
learning,several output neurons can be activated
space are mapped to nearby map unitsin the
simultaneously,in competitive only a single
learning, SOM. The SOM can thus serve as a cluster
output neuron is active at any time. The output
analyzing tool of high-dimensionaldata. Also, the
neuron that wins the "competition" is called the
SOM has the capability
to generalize.
winner-takes-all
neuron.
Generalization means
capability that the network
Such competitioncan be implemented by having
can recognize or characterizeinputsithas never
lateralinhibition
connectionsbetween the neurons.
encountered before. A new input is assimilated
The resultisthatthe neurons are forcedto organise
with the map unit it is mapped to. Fig.Q.18.1
themselves.For obvious reasons,such a network is
shows organizationof the mapping.
calleda SelfOrganizingMap (SOM).
Here pointsx inthe inputspace mapping to points
The Map is based on unsupervised
Self-Organizing
which means that no human intervention I(x)in the output space. Each point "T" in the
learning,
is needed duringthe learningand thatlittleneeds output space will map to a corresponding point
to be known about the characteristics
of the input wl)inthe inputspace.
data. Components of SelfOrganization:ReferQ.15.
It providesa topologypreservingmapping from the
Q.19 Write short noteon Kohonen selforganizing.
Kohonen self organizingnetworks are also
high dimensional space to map units.
Map unitsor Ans.
neurons,usually form a two-dimensionallattice
and called Kohonen features maps or topology
thus the mapping is a mapping from high preservingmaps are used to solve based
competition
dimensionalspace onto a plane. network paradigmfor data clustering.

Feature
Continuous
highdimensiona map
inputspace

W)

Discrete
low dimentional
Output space

Fig.Q.18.1Organization
of the mapping

An up thrust
forknowledge DECODE
TTECHNICALPUBLICATIONS
Neural Networks and Deep Learning 2-6 UnsupervisedLearningNetwork

The Kohonen model mapping. Itplacesa fixednumber of inputpatternsfrom


provides a topological
the inputlayerintoa higher-dimensional
outputor Kohonen layer.
inthe Kohonen networkbegins withthe winner'sneighbourhoodof a fairlylargesize.Then,
Training
as training
proceeds,the neighbourhood
sizegradually
decreases
Fig.Q.19.1shows a simple Kohonen self organizingnetwork with 2 inputsand 49 outputs.
The
learningfeaturemap issimilarto thatof competitive
learningnetworks.
A measure is selectedand the winningunitis considered to be the one with the largest
similarity
activation.
For thisKohonen featuresmaps allthe weights in a neighborhoodaround the winning8
unitsare also updated.The neighborhood'ssizegenerallydecreasesslowly with each iteration.

Step for how to traina Kohonen selforganizing


networkis as follows:
For n-dimensionalinputspace and m output neurons

1. Choose random weight vectorwi for neuron i, i=1,.**"

2. Choose random inputx.


3. Determine winnerneuron k:llw^-xl=min l
llwi-xl (Euclidean
distance)

4 i
Update allweight vectorsof allneurons inthe neighborhood
of neuron
k: wi (wiisshiftedtowards x)
+n9%,k)(x-w,)
5. If convergence criterionmet, STOP. Otherwise, narrow neighborhoodfunctionand learning
parameter n and go to (2).

Competitivelearningin the Kohonen network


To illustrate learningconsiderthe Kohonen network
competitive with 100 neurons arranged in the
with 10 rows and 10 columns. The network is required to classify
form of a two-dimensionallattice
two-dimensionalinputvectors each neuron inthe network shouldrespond only to the inputvectors
initsregion.
occurring
.Thenetwork istrainedwith 1000 two-dimensional inputvectors generated randomly in a square
regioninthe interval
between-1and +1.The learningrateparameter a is equal to0.1.

Output unitsf ooo


ooo
o
O
ooo
ojoo
ojoo ooj0O
- ooo o oojojo
o;oofo oo
o ofoo o oo
-NB,(t=2)
NB(t 1)
Inputunits
2 NB(t=0)

a) (b)

Simple Kohonen selforganizingnetwork


Fig.Q.19.1

An up thrust
forknowledge DECODE
TTECHNICALPUBLICATIONS
NeuralNetworks and Deep Learning 2-7 UnsupervisedLearningNetwork

2.3LearningVectorQuantization Step 1:Initialize


the clustercentersby a clustering
method.
Q.20 Write short note on learning vector Step 2: Label each clusterby the votingmethod.
quantization.
Ans.: Learning Vector Quantization(LVO) is Step 3:
Randomly selecta training input vector x
and findk such that I| x -wk | is a minimum.
adaptivedata classification
method. It is based on
data with desiredclassinformation.
training
Step 4 If x and Wk belongs to the same class,
LVO uses unsupervised data clusteringtechniques updatewk by
to preprocesses the data set and obtain cluster
nters. AW N (x-W)
Otherwise updateWk
Fig.Q.20.1 shows the network representationof by

LVQ
W11
AW- n(X-W)
Step 5:If the maximum number of iterationsis
reached,stop.Otherwisereturnto step 3.

2.4:
CounterPropagation
Networks
Q.21 What is counter-propagation
network?
Ans.:Counter-propagation networks are multilayer
networks based on a combination of input,clustering8
and outputlayers.Itcan be used to compress data,to
Inputunits
W26 approximate functions
or to associate
patterns.

Output units

:
Q.22 How does counter propagation nets are
Fig.Q.20.1LVQ trained?
Here input dimensionis 2 and the inputspace is Ans. nets are trainedin two
Counter-propagation
dividedinto six clusters.The firsttwo clusters stages
belong to class 1,whileother four clustersbelong 1. Firststage:The inputvectorsare clustered.The
to class2. clustersthatare formed may be based on either
.THE LVQ learning the dot productmetric or the Euclideannorm
algorithminvolvestwo steps:
metric.
1. An unsupervised learning data clustering
method is used to locateseveralclustercenters
2 Second stage :The weights from the clusterunits
withoutusingtheclassinformation. to the output unitsare adapted to produce the
2. The class information
is used to finetune the desiredresponse.
clustercenters to minimize the number of
misclassified
cases. Q.23 List the possible drawback of
networks.
counter-propagation
The number of clustercan eitherbe specifieda
or determinedviaa clustertechnique Ans.: Traininga counter-propagationnetwork has
priori capable
the same associated with traininga
of adaptivelyadding new clusterswhen necessary. difficulty
Once the clustersare obtained,
theirclassesmust be Kohonen network.
labeledbefore moving to second step.Such labeling Counter-propagationnetworks tend to be larger
isa achievedby votingmethod. than backpropagation
networks. Ifa certainnumber
Learningmethod of mappings are to be learned,the middle layer

The weight vector (w) that is closestto the input must have thatmany number of neurons.
vector(x) must be found.Ifx belongs to the same
Q.24 Explain full Counter-Propagation
Network
class,we move w towards x; otherwise we move w (CPN).
away from the inputvectorX.
An up thrust DECODE
TECHNICAL PUBLICATIONS forknowledge
Neural Networks and
Deep Learning
2-8 LearningNetwork
Insupervised

Ans.: The full CPN allows to produce a correct


The original presentation of forward-only
outputeven when itis givenan inputvectorthatis
counter-propagationused the Euclideandistance
incompleteor incorrect.
partially
between the inputvectorand the weight vector for
Full counter-propagationwas developed to provide
the Kohonen unit.
method of representinga largenumber
an efficient
of vector pairs, x:yby a
adaptivelyconstructing Q.26 What isforward onlycounter-propagation
?
lookuptable. Ans.: Is a simplifiedversion of the fhull

It produces an approximationx* : y* based on input counterpropagation.


of an x vectororinputof a y vectoronly,or input Are intendedto approximate y = f(x)function
that
of an x:y pair,possibly with some distortedor isnot necessarily
invertible.

In
missing elements ineitheror both vectors. It may be used if the mapping from x to y is well
firstphase, the training
vectorpairsare used to but the mapping fromy tox is not.
defined,
form custersusingeitherdot product or Euclidean
distance.Ifdot productis used, normalizationis a 2.5 AdaptiveResonanceTheory
must.

Thisphase of is calledas in star modeled


training Q.27 Defineplasticity.
training.The activeunitshere are the unitsin the Ans.:The ability
of a net to respond to learna new

x-input,z-clusterand y-inputlayers.The winning patternequally wellat any stageof learningiscalled


unit uses standard Kohonen learningrule for its
plasticity.
weigh updation.
Q.28 Listthe components of ART1.
During second phase, the weights are adjusted
are as follows
between theclusterunitsand output units. Ans.:Components
1. The shortterm memory
In thisphase,we can findonly the J unitremaining
The recognition
layer(F1)
It containsthe long
activeintheclusterlayer. 2.
layer(F2):
The weights from the winningclusterunitJ to the
output units are adjusted, so that vector of
term memory
3. VigilanceParameter (P) A
of the system.

parameter that
controlsthe generalityof the memory. Larger p
activationof unitsin the y output layer,y', is
means more detailed memories, smaller P
of
approximation input vector y; and x* is an
produces more generalmemories.
approximation of inputvectorx.
The of CPN
architecture resembles an instarand of ART.
Q.29 Draw and explainthe architecture
outstarmodel. What isitsuse and types ?
Ans.:GailCarpenter and Stephen Grossberg (Boston
The model which connects the inputlayersto the
hidden layer is calledInstarmodel and the model University)developed the Adaptive Resonance
which connectsthehiddenlayerto the outputlayer learning model. How can a system retain its

iscalledOutstarmodel. previously learned knowledge while incorporating


new information.
.Theweightsare updatedinboth the Instar(infirst
Adaptive resonance architecturesare artificial
phase) and Outstarmodel (second phase).
neural networks that are capable of stable
The network is fully network.
interconnected
of an arbitrary
categorization sequence of unlabeled
Q.25 How forward-only differs form full inputpatternsin realtime. These architectures
are
nets
counter-propagation capable of continuoustrainingwith non-stationary
Ans.:In
full counter-propagation,only the x
inputs.
vectors to form the clusterson the Kohonen units
duringthefirst
stageof training.

TTECHNICALPUBLICATIONS- An up thrustforknowledge DECODE


NeuralNetworks and Deep Learning 2-9 LearningNetwork
lnsupervised

Some models of AdaptiveResonance Theory are: forthe


Top-downweights representthe "prototype"
1.ART1 - Discreteinput. custer definedby each output neuron.A cdose
match between inputand prototypeis necessaryfor
2. ART2- Continuousinput.
the input.
categorizing
3. ARTMAP Using two inputvectors,transforms
the unsupervised ART model into a supervised Findingthis match can require multiplesignal
one. exchanges between the two layersinboth directions
Various others ARTMAP until"resonance"is establishedor a new neuron is
Fuzzy ART, Fuzzy
etc.. added.
(FARTMAP),
The primary intuitionbehindtheART model isthat The basic ART model, ART1, is comprised of the

identification
and recognitiongenerallyoccur following
components :
object
as a resultof the interaction
of 'top-down'observer 1.The short term memory layer : F1 Short term
expectationswith bottom-up'
sensory information. memory.
2. The recognitionlayer: F2 Containsthe long
The basic ART system isan unsupervised learning
term memory of the system.
model. It typicallyconsistsof a comparison fieldd
and a recognition fieldcomposed of neurons, a 3. VigilanceParameter P-
A parameter that
controlsthe the p
vigilance parameter,and a resetmodule. However, of
generality memory. Larger
ART networks are able to grow additionalneurons means more detailed memories, smaller p
ifa new inputcannot be categorizedappropriately produces more generalmemories.
with the existing
neurons. Types of ART
ART networks tackle the stability-plasticity
Remarks
Type
dilemma

:
1.PlasticityThey can always adapt to unknown
inputsif the giveninputcannot be classified by
ART 1 Itisthe simplest variety
of ART networks,
acceptingonly binary inputs.

clusters.
existing ART 2 Extends network capabilities
to support

2. clustersare not deleted by continuous inputs.


Stability: Existing

3. Problem
on p.
:
the introduction
of rnew inputs.

Clustersare of fixedsize,depending
ART 3 ART 3 buildson ART-2 by simulating
rudimentary neurotransmitter regulation
of synapticactivity by incorporating
simulatedsodium (Na+) and calcium
Fig.
Q.29.1shows ART-1Network. (Ca2+) ion concentrationsintothe system's
equations,which resultsina more
physiologicallyrealistic
means of partially
inhibitingcategoriesthattriggermismatch
resets.
Dutput layer

Fuzzy ART| Fuzzy ART implementsfuzzy logic into


ART's patternrecognition,thus enhancin8
generalizability

Input layer
ARTMAPIt isalsoknown as Predictive
ART,
combines two sightly modifiedART-1 or
ART-2 into
units a supervised learning
structure where the firstunittakes the

Fig.Q.29.1ART 1 network input data and the second unittakesthe


correctoutput data,then used to make the

minimum possibleadjustment of the


ART-1 networks, which receive binary input vigilance parameter in the first unit in

order to make the correct classification.


vectors.Bottom-upweightsare used to determine

output-layercandidates that may best match the Fuzzy Fuzzy ARTMAP ismerelyARTMAP using
Currentinput. ARTMAP fuzzy ART units,resultingina
corresponding increaseinefficiency.

TECHNICAL PUBLICATIONSAn up thrust


forknowledge DECODE
Neural Networks and Deep Learning 10 UnsupervisedLearningNetwork

Q.30 Listthe advantages of adaptiveresonance 2 The hidden nodes implement a set of radialbasis
theory. functions
(e.g.Gaussian functions).

Ans.: Itexhibitsstabilityand isnot disturbedby a


The output nodes implement linearsummation
wide varietyof inputsprovidedto itsnetwork.
as inan MLP.
functions
It can be integratedand used with various other
4. The networktraining is dividedintotwo stages
techniquesto givemore good results.
It can be used for various fieldssuch as mobile
firstthe weights from the inputto hidden layer
are determined,and then the weights from the
robot control, face
recognition,land cover hiddento output layer.
classification, medical
targetrecognition, diagnosis,
isvery
5. The training/learning fast.
signatureverification, web users,etc.
custering
It has got advantages over competitive arning.
6. The networks are very good at interpolation

The competitive to add


learninglacksthe capability Q.34 Compare RBF netuwork with multilayer
when deemed necessary.
new clusters perceptron.
It informingclusters.
does not guarantee stability Ans.

No. RBF networks Multilayer


2.6:
Special
Networks perceptrons

Q.31 DescribeBayesian beliefnetwork. An RBFN has a single MLP may have one or
hidden layer. more hiddenlayers.
Ans.:
Bayesian belief network describes the
Hidden layeris Hidden and output
governinga set of variables
2
distribution
probability nonlinearand output layerusedas patterrn
a
by specifying set of conditional
independence layerislinear. dlassifierareusually

assumptions along with a set of conditional nonlinear.

probabilities. The argument of the The activation


function
activation
function of of each hidden unit
a.32What isradialbasisfunctionnetwork each hidden unit
computes the inner
Ans.: Radial Basis Function(RBF) network is an computes the Eucliden product of the input
norm between the input the synaptic
neural network that uses radial basis
artificial vectorand
vector and the centre of vector of that
functions
as activation The outputof the
functions. weight
thatunit. unit.
network is a linear combinationof radial basis
RBF networks using MLPs constructglobal
of the inputsand neuronparameters.
functions
exponentiallydecaying approximations to
RBF networks form a special class of neural localized nonlinearities nonlinearinput-output
constructlocal mapping
networks,which consistof three layers to
approximations
The inputlayerisused only to connect the network nonlinear input-output
toitsenvironment. mappings.
The hidden layer contains a number of nodes, Computation nodes in Computation nodes of
which applya nonlineartransformationtothe input the hiddernlayerof an an MLP, located ina
RBF network arequite hidden or an output
usinga radialbasisfunction,
variables, such as the
differentand servea layer,share a common
the thinplatesplinefunction
Gaussian function, etc.
neuronal model.
differentpurpose from
The output layer is linear and serves as a those in the output
Summation unit. layerof the network.

Q.33 Listthe features of Radial Basis Function


(RBF) networks.
Ans.: Features follows:
areas
1. They are two-layerfeed-forwardnetworks.

TECHNICAL An up thrust
forknowledge DECODE
PUBLICATIONS
NeuralNetuworks and Deep Learning 2-11 LearningNetwork
nsupervised

inthe Blanksfor Mid Term Exam


Fill Q.15 is most popular technique in which
we startwith a large multilayerperceptron
Q.1 The SOM algorithmisa _ quantization with an adequate performance for the
algorithm, which provides good problem and then remove it by eliminating
to the inputspace.
approximation certainsynapticweights inan orderlyfashion.

Q2 networks are a type of artificial Q.16 A HAP model which comes under auto
neural network constructed from spatially correlatedassociative
memory is abbreviated
kernelfunctions.
localized

Q.3 An outputneuron that wins the competition Choice Questions


Multiple
iscalleda neuron.
for Mid Term Exam
Q.4 Hamming netuses . as a subnet to find
the unitwith the largestnet input.
Q.1 Kohonen network trainedinan . mode.
Q.5 In a asupervised unsupervised
map, the neurons are placed at
that isusually one or
the nodes of a lattice csemi-supervised dAll of
these

two dimensional.
Q.2 What isan activation
value ?
Q.6 A specific competitive net that performs
Winner Take All (WTA) competition is the a weightedsum of inputs

bthreshold
value

Q.7 The algorithm is based Cmain inputtoneuron


the mentioned
unsupervised,competitive
learning. dnone of
Q.8 Counter propagationnetwork is Q.3 What isHebb's ruleof learning?
winner-take-all learningnetwork.
competitive
athesystem learnsfrom itspast mistakes
Q.9 The weightvectorforan outputunitis often
referredto as a , vectorforthe classtat
bthesystem recallsprevious referenceinputs
and respectiveidealoutputs
the unitrepresents.
cthestrengthof neuralconnection
get
Q.10 nets are designed to allow the user modifiedaccordingly
to controlthe degree of similarity
of patterms
dnone of the mentioned
placedon thesame cluster.

Q.4 Why can't we design a perfect neural


of a net to respond to learma
Q.11The ability new
network ?
patternequallywell at any stage of learningis
called a not known of
fulloperationisstill
neurons
Q.12 Counter-propagation networks are biological
networks based on a combinationof input, bnumber of neuron is
not precisely
itself
and outputlayers.
clustering known

Q.13 In cnumber of interconnection


isvery largeand
Perceptron convergence theorem, the
isvery complex
adaptionweight vectorisdone by considering
which appropriateparameter ? dallof these
Q.14 is aspecific technique for Q.5 What was the main point of difference
implemented in improving the efficiency
of
between the Adalineand perceptronmodel ?
feed
multi-layer forwardnetwork.

PUBLICATIONS- An up thrust
forknowledge
TECHNICAL CODE
Neural Networks and Deep Learning 2-12 UnsuperoisedLearningNetwork

aweightsare compared withoutput Q.9 In generalthe "errorbased learningalgorithm"


comes under
bsensoryunitsresultis compared with
output
value is compared with
a reinforced bsupervised
canalogactivation Cunsupervised dprivate
output
inthe Blanks
Answer Key for Fill
dallof thementioned
Q.1 vector 02 Radialbasis
Q.6 Heteroassociative memory can be an example function
of which type of network?
Q.3 winner-takesall Q4 Maxnet
agroupof instars
Maxnet
b group of oustar
Q.5

Q7
self-organizing

SOM
Q.6

Q.8 unsupervised
either group of instarsor outstars
Q.9 reference Q.10 Adaptiveresonance
d bothgroup of instarsor outstars theory
Q.7 The ability
to learn how to do tasks on the Q.11 Q.12 multilayer
plasticity
or initial
data givenfor training experienceis
know as Q.13 linearly
separableQ.14 back propagation

Q.15 supervised Q.16 Hopfield


aself-organizatiorn badaptive
learning
ChoiceQuestions
Answer Key for Multiple
cfault
tolerance drobustness
Q.1 b Q.2 a Q.3 C Q4 a
Q8 The sigmoid activation
functionis generally
in
identified shape. Q.5 Q.6cQ.7 b Q.8 a

aS bZ Q.9
CA U
END...

TECHNICAL PUBLICATIONSAn up thnustforknowiledge DECODE


UNIT III

3
3.1Introduction
toDeep Learning
Deep Learning
Due number of layers,nodes, and
to the sheer
itis difficult
connections, to understand how deep
Q.1 Definedeep learning. learningnetworks arriveat insights.
Ans.:Deep learningis a machinelearningtechnique networks are highlysusceptibleto
Deep-learning
that teachescomputers to do what comes naturallyto the butterflyeffect-small variationsin the input
humans: learnby example. Deep learninguses layers data can lead to drastically
different
results,
making
of algorithms to process data, understand human them inherentlyunstable.
speech,and visuallyrecognizeobjects.
Q.3 Listthe application
of deep learning.
Q.2 Explain deep learming. What are the
challengesindeep learning? Ans.:
Applications:
1. Colorization
of blackand whiteimages.
Ans.: Deep Learning is a new area of machine
learning research,which has been introduced with 2. Adding sounds to silent
movies.

the objectiveof moving machine learning closerto 3. Automaticmachine translation.


one of itsoriginal
goals. 4. inphotographs.
Objectclassification
Deep learningis about learningmultiple levelsof 5. Automatichandwriting generation.
representationand abstractionthat help to make 6. Charactertextgeneration.
sense of data such as images, sound and text.
7. Image caption
generation.
'Deeplearning'means usinga neural network with
S. Automaticgame playing.
severallayersof nodes between inputand output.
Itis generallybetterthan other methods on image, Q.4 What is vanishing
gradientproblem ?
speech and certainother types of data because the
series of layers between input and output do
Ans.: Vanishinggradientproblem occurs when we
and processingin a seriesof
featureidentification try to traina neural network model using gradient
based optimization
stages,justas our brainsseem to. techniques.
Now when we do back-propagationi.e. moving
Deep leaningemphasizes the networkarchitecture backward in the network and calculating
gradients
of today's most Successful machine learning
of loss (error)with respect to the weights, the
approaches. These methods are based on "deep"
multi-layerneural networks with many hidden gradients tends to get smaller and smaller as we
layers. keep on moving backward inthenetwork.

ChallengesinDeep learning This means thatthe neurons in the earlierlayers


learn very slowly as compared to the neurons in
They need to findand process massive datasetsfor
the laterlayersin the hierarchy.The earlierlayers
training
inthe network areslowest to train.
.One of the reasons deep learningworks so well is
the largenumber of interconnected neurons, or free Gradientbased methods learn a parameter'svalue

parameters,that allow for capturingsubtlenuances by understandinghow a small change in the


and variations indata. parameter'svalue willaffectthe network'soutput.

(3-1)
NeuralNetworks and Deep Learning 3-2 Deep Learning

If a change in the parameter'svalue causes very With a deep learningworkflow,relevantfeatures


small change in the network'soutput the network are automatically extracted from images. In
which isa
justcan'tlearnthe parameter effectively, addition,deep learning performs end-to-end
problem. learning",where a network is givenraw data and a

This is exactlywhat's happening in the vanishing and itlearns


task to perform, such as classification,
how to do thisautomatically.
gradient problem, the gradients of the network's
outputwith respectto the parameters in the early Q.7 Explainlimitations
of deep learning.
layersbecome extremely small. Ans.
That'sa fancy way of saying that even a large 1. Data dependency:In general,deep learning
changein the value of parameters for the early
algorithmsrequirevastamounts of trainingdata
layersdoesn'thave a bigeffecton the output. Let's
to perform theirtasksaccurately.
try to understand when and why does this problem
happen. 2. Lack of generalization:
Deep-learningalgorithms
are good at performing focusedtasksbut poor at
Vanishinggradientproblem depends on the choice
generalizingtheirknowledge.
of the activation
function.
Many common activation
functions'squash' their input into a very small 3. Algorithmicbias :Deep-learningalgorithmsare
outputrange ina very non-linearfashion. as good as the data they'retrainedon.

For example, sigmoidmaps the real number line 4. ExplainabilityNeural networks develop their
onto a "small"range of [0, 1]. behaviorinextremelycomplicatedways.
As a result,there are large regions of the input Q.8 What is convolutional
neural network ? List
space which are mapped to an extremely smal itslimitation.
range. In deep learning,a convolutional
neural
Ans.:
Inthese regions of the inputspace, even a large network (CNN, or ConvNet) is a class of deep,

change inthe inputwillproduce a small change in feed-forward artificialneural networks, most


the output - hence the gradientis small. commonly appliedto analyzingvisualimagery.
Q.5 Why is deep learning
useful? CNNs use a variation of multilayer perceptrons

Ans.:Manually designed features are often designed to require minimalpreprocessing Ihey


are also known as shiftinvariantor space invariant
incomplete and take a long time to
over-specified,
designand validate. artificial
neural networks (SIANN),based on their

Learnedfeaturesare easy to adapt,fastto learn. shared-weights architecture and translation


invariancecharacteristics.
Deep learning provides a very flexible,
(almost?)
universal,learnable framework for representing Convolutional
networks were inspiredby biological
world,visualand linguisticinformation. processes in that the connectivity
pattern between
neurons resembles the organizationof the animal
Can learnboth unsupervised and supervised.
visualcortex.
Effective
end-to-endjoint
system learning
Individual
cortical
neurons respond to stimulionly
Utilize
largeamounts of data.
training in a restricted of the visualfieldknown as
region
Q.6 What's the differencebetween machine The receptivefieldsof different
the receptivefield.
and deep learning? neurons partially
overlap such that they cover the
learning
entirevisualfield.
Ans. Deep learning is a specializedform of
machine learning. A machine learning worktlow
startswith relevantfeaturesbeingmanually extracted
It is fully-connectedstructuredoes not scale to
large images. The explicit assumption that the
from images. The featuresare then used to createa inputsare images.
the objectsinthe image
model thatcategorizes

TECHNICAL An up thustforknowledge DECODE


PUBLICATIONS
Neural Netuworks and Deep Learning 3-3 Deep Learning

It allows us to encode certainpropertiesinto the 2. Weight sharing


robustness and
Can also improve model
reduce overtitting as each
architecture.
These then make the forward function
more efficientto implement. It vastly reduce the weight is learned from multiplefrequency
bands in the input instead of just from one
amount of parametersinthenetwork.
singlelocation.
A CNN is a neural network with some 3. Pooling: The same featurevalues computed at
layers.A convolutional
convolutional layer has a differentlocationsare pooled together and
thatdoes convolutional
number of filters operation. representedby one value.

CNNs run a small window over the inputimage The imitations of the convolutional neural
at both training
and testingtime,the weights of the networks:
networkthat looks throughthiswindow can learn 1.Take fixedlengthvectorsas inputand produce
from variousfeaturesof the inputdata regardless fixedlengthvectorsas output.
withinthe input.
of theirabsoluteposition
2.Allow fixedamount of computational
steps.
a.9 Explain architecture
of ConvolutionNeural
Networks (ConvNet).
of a CNN is designed to
Ans.: The architecture
take advantage of the 2D structure
of an inputimage.

CNN consistsof a number of convolutionaland

sub-sampling layers optionallyfollowed by fully


connected layers.
Kernels The inputto a convolutionallayer is "m x m * r"
image where m is the height and width of the
Inputs Outputs
image and r is the number of channels.
Fig.Q.8.1

was one of the firstmodels that


Fig.
Q.9.1shows of convolution
architecture neural
GoogLeNet network.
the ideathatCNN layers.
introduced
CNN architecture
is made up of severallayersthat
A CNN compresses a fullyconnected network in implement feature extraction, and then
threeways: classification.
1.Reducingnumber of connections. The image is divided
intoreceptivefieldsthatfeed
2. Shared weights on the edges. into a convolutionallayer, which then extracts
3.Max poolingfurtherreduces the complexity. featuresfrom the inputimage.

The CNN has 3 k


properties:
The next step is pooling,which reduces the
of the extractedfeatures (through
dimensionality
1. Locality Allows more robustness against
non-whitenoise where some bands are cleaner down-sampling) whileretainingthe most important
than the others. information through
(typically max pooling).

Featureextraction Classification
--------
K
m
Input Convolution Pool Convolution2 Pool2 Hidden Output

Architecture
Fig.Q.9.1 of convolutionneural network

An up thrust
forknowledge DECODE
TECHNICALPUBLICATIONS
NeuralNetwvorks and Deep Learning 3-4 Deep Learning

Another convolutionand pooling step is then A loss functionis definedin the fullyconnected
performed that feeds into a fully connected outputlayer to compute the mean square loss.The
multilayerperceptron.The finaloutputlayerof this gradientof erroristhen calculated.
network is a set of nodes that identify
featuresof
The error is then backpropagated to update the
the image. You train the network by using
(weights)and biasvalues. One training
filter cycle
back-propagation. is completed in a single forward and backward
An inputimage is passed to the firstconvolutional pass.
layer.The convoluted output is obtainedas an
Q.10 Write short noteon Poolinglayer.
activationmap. The filters applied in the
Ans.: Poolinglayerswere developed to reduce the
convoutionlayerextractrelevantfeaturesfrom the
to pass further.
number of parameters needed to describe layers
inputimage
deeper in the network.
Each filtershallgive a differentfeatureto aidthe
Poolingalso reduces the number of computations
In case we need to retain
correctclass prediction.
required for the network,or for simply
training
the size of the image, we use same padding (zero
otherwise valid runningitforwardduringa classification
task.
padding), padding is used since it
helps to reduce the number of features. Fig.
Q.10.1
shows poolinglayer.
224 224 64
Poolinglayersare thenadded to furtherreduce the
112 112 64
number of parameters.
Poo
Severalconvolution
and poolinglayersare added
is made. Convolutional
before the prediction layer
in
help extracting features.

As wego deeper in the network more specific


featuresare extractedas compared to a shallow Fig.Q.10.1Poolinglayer
network where the features extracted are more
Poolinglayers provide some limitedamount of
generic. and rotational
translational invariance.
However, it
The output layer in a CNN as mentioned willnot account for large translational
or rotational
such
perturbations, as a face fliPpedby 180.
previouslyis a fullyconnectedlayer,where the
inputfrom the other layersisflattenedand sent so Perturbationslikethese could only be detected if
as the transform the output into the number of
images of faces,rotatedby 180, were present in the
classesas desiredby the network. set.
original
training
The output is then generated through the output Accounting
for allpossibleorientations
of an object
layerand is compared to the outputlayer for error would require more filters,and thereforemore
generation. weight parameters would be required.

Single
depth slice

max pool with 2 2 filters

and stride2

Fig.Q.10.2Max pooling

TECHNICALPUBLICATIONS An up thrustforknowledge DECODE


Neural Networks and Deep Learning 3-5 Deep Learming

Itsfunctionis to progressively to reducethe amount of


reduce the spatialsizeof the representation

parameters and computationin the network. Pooling layer operates on each feature map
independently.
The most common approach used inpoolingismax pooling.
The poolingfunctionreplacesthe outputof the net at a certainlocation
with a summary statistic«
Max poolingreportsthe maximum
the nearby outputs. outputwithina rectangularneighborhood.
Average poolingreportsthe average output.

helps make
Pooling approximatelyinvariant
the representation to smallinputtranslations.

Q.11Writeshort note on convolutional


layer.
Ans.: The Conv networkthatdoes most of the computational
block of a convolutional
layeris the core building
heavy lifting.

32 x 32 x 3 image ispreserve
spatial structure.

32 height

32 width

3 depth

Convolvethe filter
with the 32 32 3 image
"slideover
imagei.e. the image
computing dot
spatially,
products". 5 5 3fiter
Filtersalways extendthe full
52
depth of the input volume.

1number:The resultof takinga 32 323 image


3 thiter
dot product between the filter
and a small 5 x 5 x3 chunk of
the image

TECHNICAL PUBLICATIONSAnup thrustforknowledge OECODE


Neural Networks and Deep Learning 3-6 Deep Learning

32
5
323
5
image
3filter Activation
map

28

Convolve (slide)
overall
spatiallocations

Forexample,ifwe had 6 5 x 5 filters,


well get6 separate activationmaps.
Activation
map

28

Convolutionlayer

s2

We stackthese up to geta "new image"of size28x28x6.

of convolutional
a.12 Explainapplications neural network.

Ans.:Computervision CNN: are the


employed to identify structureofan image.
hierarchyor conceptual
Instead of feedingeach image intothe neural networkas one gridof numbers, the image isbroken down into
overlapping image tilesthatare each fedintoa small neuralnetwork.
CNN
Speech recognition: have been used recentlyinspeech recognitionand has givenbetter
results
over deep neural networks. Robustness of CNN is enhanced when poolingis done at a local
isavoidedby usingfewer parameters to extractlow-levelfeatures
frequency regionand over-fitting
CNNs use relatively algorithms.This
ittlepre-processingcompared to other image classification
that in traditional
means that the network leans the flters algorithmswere hand-engineered.This
from priorknowledge and human effortinfeaturedesign isa major advantage.
independence
in image and videorecognition,
They have applications recommender systems, image classification,
medical image analysis,
and naturallanguage processing

Videoanalysis: Compared toimage data domains, thereisrelatively work on applyingCNNs to


little
Video ismore complex than images since ithas another (temporal)dimension.
videoclassification.
However, some extensionsof CNNs intothe videodomain have been explored.One approach is to
inboth time
treatspace and time as equivalentdimensions of the inputand perform convolutions
and space.

TECHNICAL PUBLICATIONS An up thrust


forknowiedge DECODE
NeuralNetworks and Deep Learning 3-7 Deep Learning

Drug discovery:
CNNs
discovery. Predicting
have been used in drug
the interaction between
For example:consider3 functionsf,f2, and

fS)connectedina chain,to form


molecules and biologicalproteins can identify
treatments.
f(x)-(ff)f(x)).
potential
These chainstructuresare the most commonly used
Natural language processing: CNNs have also structuresof neural networks. In this case, fis
explored naturallanguage processing.CNN models calledthe firstlayer of the network, ft4)is called
are effective for various NLP problems and the second layer,and so on.
achieved excellent results in semantic parsing
The overalllength of the chaingives the depth of
search query retrieval, sentence modeling, the model. Itisfrom thisterminologythat the name
classification,
prediction NLP
and other traditional
"deeplearning"arises.
tasks.
The finallayer of a feedforward network iscalled
the output layer.During neural network training,
3.2:
DeepFeedforwardNetworks we drive f(x) to match f"(x).
.The trainingdata provides us with noisy,
Q.13Why models are calledfeedforward?
approximate examples of f(x) evaluated at
Ans.: Models are called feedforward because different
training
points.
informationflows through the function being The training examples specify directlywhat the
evaluated from x, through the intermediate output layer must do at each point x; it must
computationsused to definef, and finallyto the producea value thatis closeto y. The behavior of
outputy.There are no feedback connections inwhich the other layers is not directlyspecifiedby the

outputsof the model are fed back intoitself. trainingdata.

The learningalgorithm must decide how to use


a.14Why feedforward neural network are called
those layersto producethe desiredoutput,
but the
network ?
trainingdata does not say what each individual
Ans.:Feedforward neural networks are called should do.
layer
networks because they are typically
representedby
Instead,the learningalgorithm must decidehow to
composing together many differentfunctions. The
use these to best an
layers implement
model is associatedwith a directedacyclicgraph
of
approximation f".
how the functions
describing are composed together.
Because the training
data does not show the desired

Q.15What isrecurrentneural networks ? output for each of these layers,these layers are
calledhidden layers.
Ans.:When feedforward neural networks are
extended to includefeedback connections,they are these networks
Finally, are called neural because
calledrecurrentneuralnetworks. they are loosely inspiredby neuroscience.Each
hidden layer of the network is typically
Q.16 Write short note on Deep feedforward The dimensionality
vector-valued. of these hidden
networks. determinesthe width of the model.
layers
feedforward networks is also called
Ans.: Deep Q.17What is use of hidden layer?
feedforward neural networks or multilayer
Ans.:Feedforward networks have introducedthe
perceptrons.
Feedforwardneural networks are called networks concept of a hidden layer,and this requires us to
choose the activation
functionsthat willbe used too
because they are typically represented by
compute the hiddenlayervalues.
composing togethermany differentfunctions. The
model is associatedwith a directedacyclicgraph
conceptof XOR.
Q.18 Explainlearning
describinghow the functions are composed
together. Ans.: XOR problem is a pattern recognition

problem inneuralnetwork.
TECHNICAL An up thrust
forknowledge DECODE
PUBLICATIONS
Neural Netuworks and Deep Leaning 3-8 Deep Learning

The XOR function on two binaryvalues,x1 and x2. When exactlyone


isan operation of thesebinary

returns1.Otherwise,
valuesisequal to 1,the XOR function itreturns0.

The XOR function


provides the targetfunction
y -f"()
thatwe
want to leam. Our model provides a
functiony f(x;v) and our learningalgorithm willadapt the parameters v tomake f as similar
possibleto f.
Neuralnetworks Boolean functions
can be used to classify dependingon their
desired outputs.

The XOR problem isnot linearlyseparable.We cannot use a singlelayerperceptronto construct


the two dimensional
straightlineto partition inputspace intotwo regions,each containing
only data
pointsof the same class.

Q.19Listthe types of activation


function.
Ans.:Types of activation are :Sigmoidor Logistic,
functions Tanh and Relu.

Q.20 Defineactivation
function.Explainthe purpose of activation
functionin multilayerneural networks.
Giveany two activation
functions.
Ans. : An activationfunction The activation
on the signaloutput.
f performs a mathematical operation functions
are chosen dependingupon the type of problem to be solved by thenetwork. There are a number of common
activationfunctions
inuse with neuralnetworks.
Fig.
Q.20.1shows of activation
position function.
Unit step function
isone of theactivation
function.

Inputs Weights

X1 (W
Activation
Tunction
Net input
2 Wa (net)
Activation

Transfer
function
Threshold

function
Q.20.1Positionof activation
Fig.

The cellbody itselfis considered to have two functions. of all


The firstfunctionis integration
weightedstimulisymbolized by the summation sign.The second functionisthe activation
which
to an outputvalue which issentout throughconnection
transformsthesum of weightedstimuli y.
the same activation
Typically isused forallneurons in any
function layer.In a multi-layer
particular
functions
network if the neurons have linearactivation are no betterthan a single
the capabilities

layernetwork with a linearactivation Hence inmost casesnonlinear


function. activation are
functions
used.
1. Linearactivation The linearactivation
function: function
willonly produce positivenumbers over
the entirerealnumber range.The linearactivation
functionv alue is 0 iftheargument islessthan a
lower boundary,increasinglinearlyfrom 0 to +1 for arguments equal or largerthan the lower

boundaryand lessthan an upper boundary,and +1 forallarguments equal or greaterthan a given


upper boundary.

TECHNICALPUBLICATIONSAn up thrustforknowledge DECODE


Neural Netuworks and Deep Learning
3-9 Deep Learning

2. Sigmoid activationfunction: The sigmoidfunction willonlyproducepositivenumbers between 0


and 1. The sigmoidactivationfunction ismost used for trainingdata thatisalsobetween 0 and 1.
Itisone of the most used activation functions.
A sigmoidfunction produces a curve withan "S"
shape. Logisticand hyperbolic tangent functions a re commonly used sigmoid functions. The
sigmoidfunctions are extensivelyused inback propagation neuralnetworks because itreduces the
burden of complicationinvolvedduringtraining phase.

sig()14et

0.9

0.0

0.6
0.5
0.4
.3
02
0.1

-6

Fig.Q.20.2(a) Linearfunctions Fig.Q.20.2(b) Sigmoid functions

Fig.Q.20.2(c)Binarysigmoid function

3. Binarysigmoid:The logistic function,which is a sigmoidfunctionbetween 0 and 1 are used in


neuralnetworkas activationfunction
where the outputvalues areeitherbinaryor variesfrom 0 to
1.Itisalsocalledas binarysigmoidor logistic
sigmoid.

f (*)

4. Bipolarsigmoid : A logisticsigmoidfunction can be scaled to have any range of values which


may be appropriatefora problem. The is from-1 to 1.This idcalledbipolar
most common range
sigmoid.

TECHNICAL PUBLICATIONSAn up thrustforknowledge OECODE


Neural Netuworks and
Deep Learning
3-10 Deep Learning

f)-1t-
The bipolar
sigmoidisalsocloselyrelatedto the hyperbolictangent function.

tan n exp )-
exp(x)+exp(-x)
exp(-x) 1-exp
(-2x)
1+ exp (-2x)

Q.21Writeshort noteon Tanh and RelLU neurons.


Ans.:Tanh is alsolikelogisticsigmoidbut better.The range of the tanhfunction
isfrom (-1to 1).Tanh isalso

sigmoidal
(s shaped).

Fig.
Q.21.l
shows tanh v/s Logistic
Sigmoid.
Tanh neuron is simply a scaled
sigmoidneuron. Sigmoid I
Problems resolvedby Tanh ). Tanh

1. The outputisnot zero centered

2. Small gradient of sigmoid


function
ReLU (Rectified
LinearUnit)is the -0.5
most used activation
function
inthe
itisused in
world rightnow. Since,
almost all the convolution
neural 3
networks or deep learning. X

Fig.Q.21.1:
tanh vis
Sigmoid
Logistic
Fig.
Q21.2shows RelLU v/s Logistic
Sigmoid.
As you can see,the ReLU ishalfrectified(from bottom). f(z) iszero when z islessthanzero and f(z)
is equal to z when z isabove or equal to zero.

Sigmoid RelLU
.0 10

R(2)= max(0,2)
.8 (2)1e
0.6

0.4

.2

000 10 -10 10

Fig.Q.21.2 ReLU v/s LogisticSigmoid

TECHNICALPUBLICATIONSAn up thnustforknowledge OECODE


NeuralNetworks and Deep Leaning 3-11 Deep Leaning

Compared to tanh/sigmoidneurons that involve the


etc.),
expensive operations(exponentials, RelUJ
canbe implemented by simplythresholdinga matrixof activations
at zero.

Function Advantages Disadvantages

Sigmoid 1.Outputinrange(0,1) . Saturated


Neurons

2. Not zero entered

3.Smallgradient

4. Vanishing gradient
Tanh 1.Zero centered, 1.Saturated
Neurons

2. Outputinrange(-1,1)

ReLU 1.Computational efficiency, Dead Neurons,

2. Accelerated
convergence 2.Not zerocentered

Q.22 List the pros and cons of RelLU.


Ans.:Pros:
1. Fasterto compute
2. Ithelps to reduce vanishing
gradientproblem.
i.e.
3. Sparse activation more robust to noise

4. tooptimize,
Efficient converges much fasterthan sigmoidor tanh.

Cons
1. Very negativeneuron cannot recoverdue tozero gradient.
2. Non zero centeredoutput.

3.3: Learning
Gradient-Based

Q.23 What isgradientdescent?


Ans.:
Gradientdescent is a first-order algorithm.To finda localminimum of a function
optimization using
gradientdescent,one takesstepsproportionalto the negativeof the gradientof the function
at the currentpoint.

Q.24 What is cost function


?
Ans.:
Costfunction
defineas

C(w,b) N lat,)-yi
where x are the input both are determined.So the cost
ors and y are there correspondinglabels,
function value dependingon the weights w and biasesb.Itisquadraticcost function.
chagnes it's

Q.25 List the components of hiddenunits.


Ans.:Hidden unitsare composed of inputvectors(x),computingan affinetransformation(z) and element-wise
nonlinearfunction.

TECHNICAL PUBLICATIONS- An up thrust


forknowledge DECODE
NeuralNetworks and Deep Learning 3-12 Deep Learning

Q.26 What issigmoidunit?


Ans.:A sigmoidunitisused when predicting That means itcan predictfor only
the value of a binaryvariable.
two cases.That also means thatthe only probability
distributionwhich the sigmoidunitisable to predictis the
Bernoullidistribution.

Q.27 What isuse of softmax unit?


Ans.:Softmax functions
are most oftenusedas to representthe probability
the outputof a classifier, distribution
over n different More rarely,softmax functions
classes. ifwe wish the model
can be used insidethe model itself,
to choose between one of n different
optionsfor some internalvariable.

a.28 What isdifference


between linearunitand rectified
linearunit?
Ans.:The only differencebetween linearunitisthata rectified
a linearunitand a rectified linearunitoutputs
zero acrosshalf itsdomain.This makes the derivativesthrougha rectified
linearunitremain largewhenever the
unitis active.
The gradientsare not only largebut alsoconsistent.

Q.29 Write short note on softmax unit.


Ans. The softmax functionsquashes the outputsof each unit to be between 0 and 1, just likea sigmoid
But italsodivides
function. each outputsuch thatthetotalsum of the outputsisequal to 1.
The output of the softmax functionis equivalentto a categoricalprobability distribution,it tellsyou the
thatany of the classesare true.
probability
Mathematicallythe softmax function is shown below, where z isa vectorof the inputsto the output

layer.And again,
jindexes theoutputunits,
so j- 1,2, K.
e
a
Each training
image islabeledwiththetruedigitand the goal of thenetworkis to predict
the correct
label.So, ifthe inputis an image of the digit4, the outputunit corresponding to 4 would be
and so on for the restof the units.
activated,
The softmax function the largestvaluesand suppress other values.
highlights
.The softmax can be used for any number of classes.
It'salso used for hundreds and thousands of
for
classes, example in objectrecognition
problems where there are hundreds of different
possible
objects.

Listthe limitation
Q.30 Explaingradientdescent algorithm. of gradient descent.

Ans.: Goal:Solvingminimization problems throughderivative


nonlinear information.
First
and second derivativesof the objectivefunetion or the constraints
play an importantrole
The firstorder derivativesare calledthe gradientand the second order derivativesare
i
optimization.
calledthe Hessianmatrix.

Derivative is also called nonlinear.


based optimization Capable of determiningsearch directions"
toan objective
according function's
derivative
information.
Derivative
based optimization
methods are used for:
of nonlinear
1.Optimization neuro-fuzzymodels
2. Neuralnetworklearningg

3.Regressionanalysisinnonlinear
models

An up thrust
forknowiledge (DECODE
TTECHNICALPUBLICATIONS
Neural Networks and Deep Learning 3-13 Deep Learning

Basicdescentmethods are as follows:


1.Steepestdescent
2.Newton-Raphsonmethod

GradientDescent
Gradient
descentis a first-order algorithm.To finda
optimization local minimum of a function
using
to the negative of the gradientof the function
gradientdescent,one takessteps proportional at the
Currentpoint.

Gradientdescent is popular for problems because it is easy


very large-scaleoptimization to

implement, can handle blackbox and


functions, each is
iterationcheap.
Givena differentiable guess x,
scalarfieldf (x)and an initial moves
gradientdescentiteratively the
of the negativegradient-V f (x).
guess toward lower valuesof "f"by takingstepsinthe direction
the negated gradientisthe steepest
Locally, the direction
i.e., that x would need
descentdirection,
move in order to decrease"f"the fastest.
The algorithm typicallyconverges to a localminimum, but
may rarelyreach a saddlepoint,or not move atallif liesat a localmaximum.
x,
The gradientwillgivethe slope of willpointto an increaseinthe
the curve at thatx and itsdirection
function.
So we change x inthe oppositedirection value:
tolower the function

Xk1*-AVf
(K3)
The A>0 isa small number thatforcesthe algorithmto make small jumps.

Limitations Descent:
of Gradient
.Gradientdescent is relativelyslow close to the minimum: technically,
its asymptotic rate of

convergence is inferior
to many othermethods.

Forpoorly conditionedconvex problems, gradientdescentincreasingly'zigzags'as the gradientspoint


toa minimum point
nearlyorthogonallyto theshortestdirection

SteepestDescent:
Steepestdescent isalso known as gradientmethod
.Thismethod isbased on first
order Taylorseriesapproximation function.
of objective Thismethod is
alsocalledsaddle pointmethod. Fig.Q.30.1shows steepestdescentmethod.

The steepestdescent is the simplest of the gradient methods. The choice of directionis where f
decreasesmost quickly,which is inthe directionopposite to Vf The
(x). search starts
at an arbitrary
point x0 and then go down the gradient,
untilreach closeto the solution.
The method of steepestdescent is the discreteanalogue of gradientdescent,but the best move is
able
computed using a localminimization rather than computing a gradient.It is typically
function.
converge infew stepsbut itisunable toescape localminima or plateausinthe objective
The gradientiseverywhere perpendicular
to thecontourlines.Aftereach lineminimization
the new

gradientisalways orthogonalto the previousstep direction. tend to zig-zag


Consequently,the iterates
manner.
down the valleyina very inefficient
The method of steepestdescentissimple,easy to apply,and each iteration isfast.Italso very stable;
iftheminimum pointsexist,
the method is guaranteed to locatethem afterat leastan infinite
number
of iterations.

TECHNICALPUBLICATIONSAn up thrustforknowledge OECODE


NeuralNetworks and Deep Learning 3-14 Deep Learning

Steepestdescentmethod
Fig.Q.30.1
Q.31 What are differences between gradient Most neural networks are organized intogroups of
descent and stochastic
gradient descent ?
unitscalledlayers.These layersare arranged in a
Ans.:
1. In standard 8radient descent, the erroris
chainstructure, with each layerbeinga function of
the layerthatpreceded it.
summed over all examples before updating
whereas in stochasticgradient descent
In thisstructure,
the first
layeris givenby
weights,
weights are updated upon examining each h-g(wOTx+b))
training
example.
The second layerisgivenby

Summing over multiple examples in standard h g(weTh)+b)


gradientdescent requiresmore computationper 3.5:
Back-Propagation
and
weight update step. On the other hand,because OtherDifferentiation
Algorithms
it uses the true gradient, standard gradient
descentis often used with a largerstep sizeper Q.34 What isforward propagation
?

weight updatethan stochastic


gradientdescent. Ans.:When we use a feedforward neuralnetwork to

accept an input x and produce an output î,


3.4:Architecture
Design information flows forward throughthe network. The

.
inputs provides the initial
x information that then
Q.32 What isuniversalapproximation theeorem ?
propagates up to the hiddenunitsat each layerand
Ans.:A singlehidden layer neural network with a finally produces This is called forward
linearoutput unit can approximate any continuous propagation.
function well,givenenough hiddenunits.
arbitrary
Q.35 What isbackpropagation
?

Q.33 What do you mean architecture design ? Ans.: Backpropagation allows information
to flow
Ans.: Architecturerefersto the overallstructureof backwards from cost to compute the gradient.
the network. It is an algorithm for supervised learning of

artificial
neuralnetworks usinggradient
descent
An up thrustforknowiedge DECODE
TECHNICAL PUBLICATIONS
Neural Networks and Deep Learning 15 Deep Learning

Givenan artificial
neural network and an errorfunction, the method calculates
the gradientof the
w
errorfunction ith respectto the neuralnetwork'sweights.

a.36 Explainbackpropagation
learningrule.
Ans.: The net inputof a node is definedas the weightedsum of the incomingsignalsplus a bias ternm.

Fig.Q.36.1shows the backpropagationMLP for node j.The net inputand output of node j is as follows

X i+Wy+W

Where x
1+exp(-X;)

is the outputof node


one of the previouslayers,
ilocatedin any
X1 W
A
22Sum
W3
|
Threshold
H Activation

Fig.Q.36.1BackpropagationMLP for node j


the weight associated with the link
is
W

correspondingnodes iand j.

W; is the bias of node j.


Internal parameters associatedwith each node j is the weight Wij.So changingthe weights of the
node willchange the behaviour of the whole backpropagation
MLP.

Fig.
Q.36.2shows two layerbackpropagation
MLP.

Input
signals

O 0-
m O-Y
Input Output
layer layer
Hidden
layer

of
Back-propagationerrorcorrection

Fig.Q.36.2
.The above back propagationMLP willreferto
nodes ineach layer.
as a 3-4-3 network,corresponding to the number .
.Thebackward error propagation also known as the Backpropagation (BP) or the GeneralizedDelta
Rule (GDR).A squared errormeasure for the ph input-outputpairis definedas

Ep2d-x
TECHNICAL PUBLICATIONSAnup thrustforknowledge OECODE)
NeuralNetworks and Deep Learning 16 Deep Learning

Where dg is the desiredoutputfor node k and x is Q.38 Listout merits and demerits of EBP.
for node k when the inputpart of Ans.:
Merits/Strength:
the
actualoutput
the ph data pairis presented. 1. Computing time is reduced ifweight chosen are
small at the
Q.37 Write the characteristics
and applications
of beginning
errorback propagationalgorithm. 2. Itminimizethe error.
:
Ans.:Characteristics 3. Batch update of weight exist,which provide
Itis an algorithm for supervised learningof on the weight correction.
smoothingeffects
artificial
neural networks usinggradientdescent 4. Simplemethod and easy for implementation.
Learns weights for a multilayernetwork, given a 5. Minimum of theerrorfunction inweight space
fixedsetof unitsand interconnections. 6. Standard method and generallywork well.
Inmultilayernetworks the error surfacecan have Demerits:
multipleminima, but in practice backpropagation 1. Trainingmay sometime cause temporal instability
has produced excellentresultsin many real-world to the system.
applications. 2. For complex problem, ittakes lotof times.
The algorithm is for two layers of sigmoidunits 3. Selection of number of hidden node in the
and does stochastic
gradientdescent. networkis problem.
It uses gradient descent to minimizethe squashed
4 Backpropagation learning does not requir
error between the network outputsand the target normalizationof input vectors; however,
values for theseoutputs. normalization
couldimproveperformance
Applications: 5. It can get stuck in local minima resultingin
The fast development of artificial
satellite sub-optimalsolutions.
technology has increasedthe importance three
of 6. Slow and inefficient.
and therefore,
dimensional (3D)positioning satellite
geodesy. Q.39 Discussperformance issue of EBP.
is main
Ans.: Computationalefficiency aspect of
the
Particularly, Global Positioning System (GPS)
back-propagation.Number of operationsto compute
provides more practical, rapid, precise and
derivativesof errorfunction
scaleswith totalnumber
continuouspositioning results anywhere on the
Earth in geodetic applications
when compared to W of weightsand biases.
the traditional
terrestrial methods. Singleevaluation of error functionfor a single
positioning
Due tothe increasing use of GPS positioning input requires O(W) operations for large W.
Compute derivatives using method of finite
techniques,a great attentionhas been paidto the differences.
precise determination of local/regionalgeoids,
aimingat replacingthe geometriclevelingwithGPS oEn
dW
E +)E,
(Wp).o9
measurements. ji
.Therefore,BPANN method is easilyprogrammable where E <« 1
with decreased and increasednumber of reference
Accuracycan be improved by making & smaller
pointswhen generatinga localGPS, it performs a untilround-offproblems arise.
flexible
modelling.
Accuracy can be improved by using central
Also,BPANN method is open to updatingwhich differences.
couldbe accepted as an importantadvantage.Thus,
it is believed that BPANN method is more
OEn E,(Wji+)-E(Wji-+O(E2)
26
dwji
convenient for generating local GPS when
compared toother methods.
Thisis O(w).

An up thrust
forknowledge (DECODE
TECHNICALPUBLICATIONS
NeuralNetworks and 3-17 Deep Learning
Deep Learning

Generalization
means performance on inputpatterns,i.e.
inputpatternswhich were not among the
patternson which the network was trained.
Ifyou trainfor long you can often get the sum-squared error very low, by over-fitting
too the
data you get a network which performs very well on the training
training data,but not as well as it
couldon unseen data.

By stoppingtrainingearlier,one hopes that the network willhave learned the broad rules of the

problem, but not bent itself


intothe shape of some of the more idiosyncratic
(perhaps even noisy)
patterns.
training
Q40 Why does overfitting tendto occur duringlateriterations, but not duringearlieriterations?

Ans.: Consider that network weightsare initializedto small random values.With weights of nearly identical
value,only very smooth decisionsurfacesare describable.
As trainingproceeds,some weights beginto grow inorder to reduce the errorover the training
data,
and thecomplexity
of the learneddecision
surfaceincreases.

Thus,the effectivecomplexityof the hypotheses that can be reached by BACKPROPAGATiON


increases withthenumber of weight-tuning iterations.

Givenenough weight-tuning BACKPROPAGATION oftenbe able to createoverlycomplex


iterations,
surfacesthatfitnoise inthe training
decision data or unrepresentative
characteristics
of the particular

training
sample.
Thisoverfitting
problemisanalogous to the overfitting
problem indecision
treelearning.

isalso calledas generalizeddeltarule ?


Q.41 Why backpropagation NTU: May-17, Marks-5]

Ans. : The generalizeddeltaruleisa mathematicallyderivedformula used to determine how to updatea neural


network duringa (backpropagation)trainingstep.
A neural networklearmsa functionthatmaps an inputto an outputbased on givenexample pairsof
inputsand outputs.
A set number of input and output pairs are presented repeatedly,in random order during the
training
The to modify weightsbetween node
generalizeddelta rule is used repeatedly during training
connections.
Before training, the network has connectionweightsinitializedwith small,random numbers. The
purpose of the weight is
modifications to reduce the overallnetwork error,which means to reduce
the difference
between the actualand expected output.

The backpropagation trainsa givenfeed-forwardmultilayerneural networkfor a givenset


algorithm
of inputpatternswithknown classifications.
When each entry of the sample set ispresented to the network, the network examines itsoutput

response to the sample inputpattern.


.The output response is then compared to the known and desired outputand the error value is

weightsare adjusted.
Based on the error,the connection
calculated.
The backpropagation algorithm is based on Widrow-Hoff delta learningrule in which the weight
adjustment is done throughmean square errorof the outputresponse to the sample input.
The set of these sample patternsare repeatedly presented to the network untilthe error value
minimized.

TECHNICAL PUBLICATIONS An up thrust


forknowledge DECODE
NeuralNetworks and Deep Learning 18
Deep Learning

Q.2 The iterative gradient-based optimization


inthe Blanksfor Mid Term Exam
Fill
algorithmsused to train
Q.1 feedforward networks, also often
called feedforward neural networks, or
afeedforward
networks

multilayerperceptron's. bbackpropagation
Q,2 When feedforward neural networks are cactivaion
function
extended to includefeedback connections, dhiddenlayer
they are called. neuralnetworks.
Q.3 Hidden units are composed of
Q.3 A single-layerwith one hidden unit also an affine transformation and
computing
called
element-wisenonlinearfunction.
Q.4 Gradient descent is an iterative
algorithm.
aoutputvector
Q.s ReLU is one of the most popular function b inputvector
which is used as activation
functionin activation
function
deep neuralnetwork. dallofthese
Q.6 Trainingdata does not show the desired
Q.4 The backpropagationalgorithmisused to find
output for each of these layers,these layers
a localminimum of the .
are called

Q.7 gradient descent applied aneural


network
non-convex loss functionshas no such
bactivation
function
convergence guarantee,and is sensitive
to the
values of the initial
parameters.
error function
Q.8 The iterative gradient-based optimization dnone ofthese
algorithmsused to train, Q.5 The function thelargest
highlights values
Q.9 Rectifiedlinearunitsare an excellentdefault and suppress other values.
choiceof unit.
aError
the minimum
Q.10 The algorithmlooks for b activation
of the errorfunctioninweightspace usingthe
method of gradientdescent. backpropagation
Q.11 Algebraic expressions and computational softmax
graphs both operate on or variablesthat
Q.6 In back propagationalgorithmsa
do not have specificvalues
determines the sizeof the weight adjustments
Q.12 The firstorder derivativesare called the made at each iterationand influencesthe rate
gradientand the second order derivativesare of convergence.
calledthe
a
MultipleChoiceQuestions
for Mid Term Exam
d
Q.1 ReLU standsfor

a LinkUnit
Rectified

b Rectified
Large Unit

RectifiedLayerUnit
Rectified
LinearUnit
d
TECHNICAL PUBLICATIONS An up thrust
forknowiedge (DECODE
NeuralNetuworks and Deep Learning 3-19 Deep Learning

Answer Key for Fillin the Blanks

Q.1 Deep Q.2 recurrent

Q3 perceptron Q4 optimization

Q5 hiddenlayer Q.6 hiddenlayers

Q.7 Stochastic Q.8 feedforward


networks

9.9 hidden 9.10 backpropagation


Q.11 symbols Q.12 Hessianmatrix

Answer Key for MultipleChoiceQuestions

Q.1
Q.5 b
0.2

Q.6
a Q.3
b04
END.

TECHNICAL PUBLICATIONSAnup thrustforknowledge DE


UNIT IV

44.1:
forDeep Learning
Regularization

lookingat the predictionerroron the training


data
ParameterNorm Penalties
and the evaluationdata.
Q.1 What isregularization
?
FigQ2.1shows overfiting.
Ans.:Regularization refers to a
set of different
techniques that lower the complexityof a neural
network model duringtraining,and thus prevent the

overfitting

a.2 Explainoverfitting.
What are the reason for X X
Balanced Overfitting
overfitting
?

Ans.: Trainingerrorcan be reducedby making the Fig.Q.2.1:


Overfitting

hypothesis more sensitiveto training


data, but this Reasons foroverfitting
may lead to and
overfitting poor generalization. 1. Noisy data setis too small
2. Training
Overfittingoccurs when a statisticalmodel 3. Large number of features
describes random error or noise instead of the
Method to avoidoverfittings:
underlying
relationship.
1. Limitthenumber of hiddennodes
is when
Overfitting a classifier
fitsthe training
data
too Such a classifierworks well on the
2. Stop training early to avoid a perfect
tightly.
set,and
explanationof the training
trainingdata but not on independenttestdata.Itis
3. Apply weight decay to limit the size of the
a generalproblem thatplagues allmachinelearning
weights, and thus of the function class
methods.
implemented by the network
Because of overfitting,
low error on training
data
a.3 Explainunderfitting.
Whatare the reason for
and high erroron testdata.
? How
underfitting do we know if we are
Overfittingoccurs when a model begins to ?
or overfitting
underfitting
memorize trainingdata rather than learning to Ans.: Underfitting: put too few variablesin
Ifwe
from
generalize trend. the model, leaving out variablesthat could help

The more difficult isto predict,


a criterion the more explainthe response,we are underfitting
noise existsin past informationthat need to be Consequences :
ignored.The problem is determiningwhich part to 1. Fittedmodel is not good for prediction
of new
ignore. data prediction
isbiased.
2. Regressioncoefficients
are biased.
Overfittinggenerally occurs when a model is
3. Estimateof errorvarianceis too large.
excessivelycomplex, such as having too many
to the number of observations.
parameters relative Underfitting
examples:
We can determine whether a predictive
model is 1. The learning time may be prohibitively
large,
underfittingor overfittingthe trainingdata by and the learning stage was prematurely
terminated.

-1)
NeuralNetworks and Deep Learning 4-2
forDeep Learning
Regularization
2. The learnerdid not use a sufficient
number of encouragingthe network to use allof itsinputsa
iterations. little
ratherthatsome of itsinputsalot.
3. The learner triesto fit a straight line to a
set whose examples exhibita quadratic
The L2 causes the learmingalgorithm
regularization
training the inputX as
to "perceive"
nature. havinghighervariance,
which makes it shrink the weights on features
To prevent under-fitting
we need to make sure whose covariancewith the output target is low
that
compared to thisadded variance.
1.The network has enough hidden units to
representthe requiredmappings.
InL regularization, term is the sum
regularization
of square of all feature weights.It forces the
2. The networkis trainedfor longenough thatthe
functionis sufficiently
error/cost minimized. weights to be small but does not make them zero
and does non sparsesolution.
How do we know if we are underfitting
or

overfitting Q.5 What isL'norm parameter regulation


?

Ans.
1. If by increasing capacity we decrease
generalizationerror, then we are underfitting, A regressionmodel that uses L'
regularization
otherwisewe are overfitting.
techniqueiscalledLasso Regression.
2. If the error representingthe trainingset is
Lasso Regression adds "absolute value of
relativelylarge and the generalizationerror is
large,then underfitting magnitude"of coefficient
as penaltyterm to the loss

3. If the error in representingthe trainingset is function.


relativelysmall and the generalizationerror is L regularizationon themodel
Formally, parameter
large,thenoverfitting: w is definedas
4. There are many featuresbut relativelysmall

trainingset. 0) llol,
2lo
that is, as the sum of absolute values of the
Q.4 Explain individual
parameterregularization. parameters.
Ans. InL', we have:
It theis hyperparameter whose value is optimized Cost functionLoss+
L regularization
for betterresults. isalso known as 2m llwll
weight decay as it forces the weights to decay In this,we penalize the absolute value of the
towards zero. weights.UnlikeL4, the weightsmay be reduced to

The weightsare zero.

Hence,it is very useful when


w)llw-ww compress our
we are tryingto
model. Otherwise,we usually prefer
The update rule of gradient decent using L norm L over it.

penalty is InLregularization, the weights shrink by a

The

L
w(1-ea)w-EVwi(w)
factor at each step.
shrinkby a constant
weights multiplicatively

drivesthe weights eloserto origin


regularization
.
constantamount toward 0. In L regularization,
weights
to
And
shrink

so
by an

when
amount

a
particular
which is
the
proportional

weight has a large


by adding a regularization
term. magnitude, Iwl,L regularizationshrinks the
The L regularization has the intuitive weight much lessthan L does.
regularization
of heavily
interpretation penalizing
peaky weight By contrast,when lwlissmall, L'
regularization
vectorsand preferringdiffuse
weightvectors. shrinks the weight much more than L
Due
and
tomultiplicativeinteractions
the
between weights regularization.The net result is that
tends to concentratethe weight of
L
inputs this has appealingproperty of regularization

An up thrustforknowledge
TECHNICALPUBLICATIONS ECODE
NeuralNetworks and Deep Learning 4-3 Regularization
forDeep Learning

the network inrelativelysmall number of


a thatthe networklearnsthe filters
thatintraditional
high-importance connections, while the other algorithms were hand-engineered. This
weights are driventoward zero. independence from priorknowledge and human
effortinfeaturedesign isa major advantage.
Q.6 What is difference between and 2
?
regulation They applications in image and video
have

Ans.: recognition, recommender systems, image


medical image analysis,and natural
classification,

L'regulation L regulation language processing

the Calculatethe sum of the .Video


analysis:
Compared to image data domains,
Calculate
the sum of
absolutevalues of the squared valuesof the there is relatively work on applying
little CNNS to
L
weights, called weights,called L. Video is more complex than
video classification.
L'has a sparse solution. L has a non sparse images since it has another (temporal)dimension.
solution. However, some extensionsof CNNs intothevideo
domain have been explored. One approach is to
Ithas builtinfeature There is no feature
selection. selection. treatspace and time as equivalent dimensions of
the inputand perform convolutions in both time
Lisrobust to outliers L is not robust to outliers
and space.
Lgenerates model thatare isableto
regularization CNNs have been used in drug
simpleand interpretable learncomplex data Drug discovery:
discovery. Predictingthe interaction between
but cannot learncomplex patterns.
patterns molecules and biological proteins can identify
treatments.
potential
solutions. Lhas one
L'hasmultiple solution.
Natural language processing : CNNs have also
Itis equivalentto MAP It is equivalentto MAP
explored naturallanguage processing.CNN models
Bayesianestimationwith estimationwith
Bayesian
Gaussian prior. are effectivefor various NLP problems and
Laplaceprior.
achieved excellent results in semantic parsing,
A regressionmodel that model which uses L'is search
calledRidgeRegression. query retrieval, sentence modeling
uses L
regularization
classification,
prediction NLP
and other traditional
technique is called
Lasso
Regression. tasks.

Q.8 Explainoverfitting.
What are the reason for
Q.7 Explain applications
of convolutional
neural overfitting?
network. Ans.: Trainingerrorcanbe reduced by making the
Ans.: Computer vision:
CNN are employed to data,but this
hypothesis more sensitiveto training
the
identify hierarchy or conceptual structureof an lead to and
may overfitting poor generalization.
image. Insteadof feedingeach image intothe neural Overfittingoccurs when a model
statistical
networkas one gridof numbers, the image isbroken describesrandom error or noise instead of the
down intooverlappingimage tilesthatare each fed
underlying
relationship.
intoa smallneuralnetwork.
Speech recognitionCNN :have been used recently
in speech recognitionand has givenbetterresults
too
is when
Overfitting
Such
tightly. a
a classifier
fitsthe training
classifier
works well on
data
the
data but not on independent test data.Itis
training
over deep neural networks. Robustness of CNN is
a generalproblem thatplagues allmachine learning
enhanced when poolingis done at a localfrequency
methods.
regionand over-fittingis avoidedby usingfewer
Because data
low error on training
of overfitting,
parameters to extractlow-levelfeatures
and high erroron testdata.
CNNs little
use relatively compared
pre-processing
to other image classification
algorithms.This means

TECHNICALPUBLICATIONSAnup thrustforknowledge OECODE


NeuralNetworks and Deep Learning 4-4 Regularization
forDeep Learning

occurs when
.Overfitting a model begins to 4.2:Norm Penalties
as Constrained
memorize trainingdata rather than learning to Optimization
from
generalize trend.

The more difficult


a criterion
is to predict,
the more Q.9 What isgoal ofpenaltyfunctions?
noise existsin past information that need to be Ans. :
ignored.The problem is determining which part to The goal of penalty functionsis to convert
ignore constrainedproblems intounconstrained problems

Overfitting generally occurs when a model is by introducingan artificial the


penalty for violating
excessivelycomplex, such as having too many constraint.
to the number of
parameters relative observations In the exactpenalty functionsmethods, the original
We can determine whether a predictive
model is constrainedoptimization problem is replacedby an
underfitting overfitting trainingdata by
or the unconstrained problem, in which the objective
lookingat the predictionerror on the training
data functionis the sum of a certain"merit"function
and the evaluationdata. and a penaltyterm which reflects
the constraint
set.

Fig.Q.8.1shows overfiting
The penalty term is obtained by multiplying a

suitablefunction,which representsthe constraints,


by a positiveparameter c called the penalty
parameter.
A given penalty-parameter c is called an exact
penalty-parameterwhen every solutionof the given
extremum problem can be found by solving the
unconstrained optimizationproblem with the
associatedwith c.
penalty function
Balanced
Q.10 Explain norm penalties as constrained
optimization.
Ans. :
If2 is the Lnorm,
then the
weights are
n
constrainedto lieinan L" ball.If isthe L' norm,
then the weights are constrainedto liein a region
of limitedL' norm.

Overfitting
Usuallywe do not know the size of the constraint
region thatwe impose by usingweightdecay with
Fig.Q.8.1Overfitting coefficienta* because the value of a" does not
directly us the value of k.
tell
Reasons for overfitting
1. Noisy data 2. Trainingsetistoo small
In one
principle, can solve for k, but the
between
relationship k and a" depends on the form
3. Large number of features of J.
Method to avoidoverfittingS While we do not know the exact size of the
1.Limitthenumber of hiddennodes. constraintregion, we can control it roughly by
2.Stoptraining early to avoid a perfect increasingor decreasinga in order to grow or
the
explanationof trainingset, and shrinkthe constraint
region.
3. Apply weight decay to limitthe size of the
Largera willresultin a smaller constraintregion.
weights, and thus of the function class Smallera willresultina largerconstraintregion.
implemented by thenetwork.

TECHNICAL An up thrust
forknowledge DECODE
PUBLICATIONS
NeuralNetworks and Deep Learning 4-5 for Deep Learning
Regularization

Q.11What are the reasons to use explicit Random Erasing is a data augmentation method
constraints
ratherthan penalties? used for training convolutional neural networks
Ans. thatinvolvesrandomly erasing a rectangularregion
in an image. Images with occlusions are then
Reason to use explicit
constraintsand reprojection
rather than enforcingconstraintswith penaltiesis generated.This makes a model robust to occlusion
and reduces the chances of overfitting.
that penalties can cause nonconvex optimization
procedures to get stuck in local minima noise inthe inputto a
Injecting neuralnetworkcan
correspondingto small6. also be seen a form of data augmentation.
as For
many cation
classi? and even some regression tasks,
When trainingneural networks, this usually
the task should stillbe possible to solve even if
manifestsas neural networks thattrainwith several
small random noiseis added to the input.
"dead units."These are unitsthatdo not contribute
much to thebehaviorof the functionlearned by the One way to improve the robustness of neural
network because the weights going intoor out of networks is simply to trainthem with random
them are allvery small. noise appliedto their is
Inputnoise injection
inputs.
When with a penalty on the norm of the part of some unsupervised learning algorithms,
training
can be locallyoptimal, such as the denoising
autoencoder.
weights,these configurations
even ifitis possible to significantly
reduce J by Noise injection also works when the noise is
making the weights large. appliedto the hidden which can be seen as
units,
doing dataset augmentationat multiplelevelsof
abstraction.
4.3:
DatasetAugmentationand Noise Q.13What is dropout ? How itsolve problem of
Robustness
?
overfitting
was a techniquedeveloped for
.12What is dataset augmentation? Ans.: Dropout
preventing of a
overfitting network to the particular
Ans.
of the training
variations set.
Data augmentationis the process of increasingthe neural networks with a large number of
amount and diversityof data. We do not collect
Deep
parameters are very powerful machine learning
new data,ratherwe transformthe already present
is a seriousproblem
systems. However, overfitting
data.
insuch networks.
Data augmentationis an integralprocess in deep
Large networks are also slow to use, making it
learning,as in deep learning we need large
difficultto deal with overfitting
by combiningthe
amounts of data and insome casesitisnot feasible
predictionsof many different large neural nets at
to collect
thousands or millionsof images, so data
testtime.Dropoutis a technique for addressingthis
augmentation comes to the rescue.
It helps us to increasethe size of the datasetand
problem.
The key ideaisto randomlydrop units(alongwith
introduce inthe dataset.
variability
theirconnections)from the neural network during
Data augmentationinvolvesthe process of creating This prevents unitsfrom co-adapting too
training.
new data pointsby manipulating the original
data.
much.
For example, for images, this can be done by
During training,dropout samples from an
rotating croppingand more.
resizing,
"thinned"
exponentialnumber of different networks.
This process increasesthe diversityof the data
available for trainingmodels in deep learning Attesttime,itis easy to approximatethe effect of

withouthaving to actuallycollectnew data. This averaging the predictionsof all these thinned

then,generallyspeaking,improves the performance networks by simply using a single un-thinned


models.
of deep learning networkthathas smallerweights.

An up thrust
forknowiedge DECODE
TECHNICAL PUBLICATIONS
Neural Networks and Deep Learning 4-6 for Deep Leaming
Regularization

O
(a) Standard
neural net (b)Afterapplyingdropout
Fig.
Q.13.11
Fig.Q.13.1shows neural network and dropout
4.4:Multi-tasklearningand Early
neuralnetwork.

The term "dropout"refersto dropping out units Stopping


ina neuralnetwork.
(hiddenand visible)
Q.14 What is multi-tasklearning?
With limitedtraining
data,however, many of these
Ans.:
complicatedrelationshipswill be the result of
Multi-Tasklearning is a sub-field of Machine
sampling noise,so they willexistinthe training
set
but not inrealtestdata even ifitisdrawn from the Learningthataims to solve multiple
different
tasks
same distribution. at the same time, by taking advantage of the
similarities
between different
tasks.
Thisleads to many overfittingmethods have been
developed for reducing it.These includestopping Multitask learning is a way to improve
the trainingas soon as performance on a validation
generalizationby poolingthe examples arisingout
setstartsto get worse,introducing weightpenalties of severaltasks.FigQ.14.1shows multi-tasking.
of various kindssuch as Ll and L2 regularization
and soft weightsharing.

Task A Task B Task C

TECHNICALPUBLICATIONS An up thrustforknowledge
-
Fig.Q.14.1
Constrained
layers

ODE
Neural Netuworks and Deep Learning 4-7 forDeep Learning
Regularization

Differentsupervised task shared the same input It is probably the most commonly used form of
and some intermediate-level
representation. indeep learning.
regularization

We can view multi-tasklearning as a form of Early stopping is an unobtrusive form of


inductivetransfer. uctir transter can help regularization,in thatit requiresalmost no change
improve a model by introducingan inductivebias, in the underlyingtraining procedure,the objective
which causes a model to prefer some hypotheses function,or the set of allowableparameter values.
over others.
Early stoppingmay be used either alone or in
For instance,a common form bias is ?1
ofinductive withother regularization
conjunction strategies.
which leads
regularization, to a preference for
Early stoppingrequires a validation
set, which
sparsesolutions. data isnot fed to the model
means some training

Multi-taskinglearningare of two types:hard or soft Earlystoppingis alsousefulbecauseitreduces the


parameter sharing of hidden layers. cost of thetraining
computational procedure.
Hard parameter:Itis generallyappliedby sharing
the hidden layersbetween alltasks,whilekeeping

outputlayers.Hard parameter
several task-specific 4.5:Paranmeter Typingand Parameter
sharinggreatlyreducesthe riskof overfitting. Sharing
Insoft parameter sharingon the other hand,each
task has itsown model with its own parameters. Q.16 What is a parametric machine learning
algorithm and nonparametric machine learning
The distancebetween the parameters of the model
is then regularized in order to encourage the algorithm?
Ans.:Parametricmachinelearning
algorithm:
parameters to be similar.
Parametricmodel is one thatcan be parameterized
a.15Discussabout earlystopping by a finite
number of parameters.
Ans. :
Learningmodel thatsummarizes data with a set of

Early stoppingis a techniquethat is very often parameters of fixed size is called a parametric
used when trainingneural networks, as well as model.
with some other iterative machine learning No matter how much data you throw at a
algorithms. parametric model, it won't change itsmind about
how many parametersitneeds.
Early stopping is a technique for controlling
overfittingin machine learning models, especially Given the parameters,futurepredictions (x) are
neural networks, by stoppingtraining before the independentof D) =
the observed data (D):P(xl6,
P(xl6) therefore6 capture everythingthere is to
weights have converged.
know about the data.
This means we can obtaina model with better
validation set error by returning
to the parameter
So the complexity of the model is bounded even if
the amount of data is unbounded.Thismakes them
setting at the point in time with the lowest
not very flexible.
validationseterror.
Some more examples of parametric machine
Every time the erroron the validation
set improves,
learning algorithms include: Logisticregression,
we storea copy the model parameters. lineardiscriminantanalysis,
perceptron,naive bayes
When the training
algorithm terminates,we return and simple neuralnetworks.
theseparameters,ratherthan the latestparameters. Benefits of Parametric Machine Learning
The algorithm terminateswhen no parameters have Algorithms:
improved over the best recorded validation error 1.Simpler These methods are easier too

for some pre-specified This


number of iterations. understand and results.
interpret
strategyis known as earlystopping.

An up thrustforknowledge DECODE
TTECHNICAL PUBLICATIONS
NeuralNetworks and Deep Learning 4-8 forDeep Learning
Regularization

2. Speed:Parametricmodels are very fastto learn Benefits of nonparametric machine learning


from data. algorithms
3. Less data : They do not require as much 1.Flexibility: of a large number
Capable fitting
training data and can work well even if the fit of functional
forms.
to the data is not perfect.
2. Power:No assumptions (orweak assumptions)
Limitations of parametriC machine learning about theunderlying
function.
algorithms: 3. Performance:Can resultinhigherperformance
1.Constrained By choosinga functional form models for prediction.
these methods are highly constrained to the Limitationsof nonparametric machine learning
specifiedform. algorithms
2. Limitedcomplexity:The methods are more
1.More data to
data : Requirea lotmore training
suitedtosimpler problems.
estimatethe mapping function.
3. Poor Fit :In practice
the methods are unlikely
2. Slower:A lot slower to trainas they often
to match theunderlying
mapping function. have farmore parameters to train.
Nonparametricmodel :
3. of a risk to overfit the
Overfiting:
More
A nonparametric model is one which cannot be trainingdata and it is harder to explainwhy
parametrizedby a fixednumber of parameters. predictions made.
specific are

Non-parametric models assume that the data Q.17 Differenc betweenn parametric and
distributioncannot be definedin terms of such a non-parametric modeling.
finiteset of parameters. But they can often be Ans.:
definedby assuming an infinite dimensionale.
Usually we think of e as a function. Parametricmodeling Non-parametric
modeling|
The amount of informationthat0 can capture about Parametric: Dataaredrawn Nonparametric:Dataare
the data D can grow as the amount of data grows. from a probability drawn from a certain
Thismakes them more flexible. of specificform| unspecified
distribution probability
up to unknown parameters.distribution.
Nonparametric methods seek tobest fitthe training
Ituses singleglobal model. There is no singleglobal
data in constructing
the mapPping function,whilst
model; localmodels are
maintainingsome ability to generalizeto unseen estimated as they are
data.As such,they are able to fita largenumber of needed.
functional
forms.
Flexible
discovery. Easy idea.
An easy to understand nonparametric model is the
Example: Logistic Example: k-nearest
k-nearest neighbors algorithm that makes regression,linear neighbors,decision trees
based
predictions on the k most sinmilar training discriminantanalysis, likecartand support
vector
patternsfor a new data instance. perceptron,naivebayesand machines.
simple neural networks.
The method does not assume anythingabout the
form of the mapping function
otherthan patterns Benefit: Requires less data,|Benefit: Flexibility,
power
Simpler and speed. and performance.
that are close are likelyhave a similar output
variable. :Poor fitand
Limitation Limitation:
More data
limited
complexity slower and
required,
Examples of nonparametric machine learning
overfitting
algorithmsare:
It requiresless data. It more data.
requires
1.k-Nearest Neighbors
2. DecisionTrees like
CART and C4.5 Q.18 What is K-Nearest NeighbourMethods ?
3. SupportVectorMachines

An up thrustforknowledge
TECHNICALPUBLICATIONS OECODE)
NeuralNetworks and Deep Learning 4-9 Regularization
forDeep Learning

Ans.: The K-nearest neighbor(KNN) is a classical 3. The number of neighbors used to classifythe
method and requiresno training
classification effort, new example.
critically
depends on the qualityof the distance
measures among examples. 4.6:Bagging and OtherEnsemble|
Methods
The KNN classifieruses Mahalanobis distance
A sample is classified
function. accordingto the
K Q.22 Explainensemble learningmethod.
majority vote of itsnearest samples in
training
the feature space. Distance of a sample to its Ans.: The ideaof ensemble learningis to employ
learnersand combinetheir Ifwe
neighborsisdefinedusinga distancefunction multiple predictions.
have a committee of M models with uncorrelated
Q.19 Listout the steps that need to be caried
errors,simply by averagingthem the average errorof
out duringthe KNN algorithm.
a model can be reduced by a factorof M.
Ans.:Steps are as follows:
a. Divide the data intotraining
and test data. Unfortunately,the key assumption that the errors
due to the individualmodels are uncorrelatedis
b. Selecta value K.
in practice,the errors are typically
unrealisticC
C. Determine which distancefunction
is tobe used. so the reductionin overallerror
highlycorrelated,
d. Choose a sample from the testdata thatneeds to is generallysmall.
and compute the distance to itsn
be classified
Ensemble modeling is the process of runningtwo
training
samples
or more relatedbut different
analyticalmodels and
and take thek-nearest
e. Sort the distancesobtained
then synthesizingthe resultsintoa singlescore or
data samples.
spread in order to improve the accuracy of
Assign the test classto the classbased on the and data mining applications.
predictive
analytics
majorityvote of itsK neighbors.
Ensemble of classifiers
isa set of classifiers
whose
Q.20 What are the advantages and disadvantages individual
decisions combined in some to
way
of KNN ?
new examples.
classify
Ans.:Advantages
1. The KNN algorithmis very easy to implement. Ensemble methods combine several decisiontrees
to produce betterpredictive
classifiers
2 Nearly optimalinthe largesample limit. performance
The main
than a single decision tree classifier.
which can
3. Uses localinformation, yieldhighly
behind the ensemble model is that a
principle
adaptivebehavior.
group of weaklearnerscome together to form a
4. Lends itself very easily to parallel
strong learner,thus increasingthe accuracy of the
implementations.
model.
Disadvantages
1. Thereare two approaches for combiningmodels :
Large storagerequirements.
votingand stacking.
intensiverecall.
2. Computationally
3. Highlysusceptible
to the curse of dimensionality.
In voting no learningtakesplace at the meta level
when combiningclassifiersby a votingscheme.
Q.21 Which are the performance factors that Label that is most often assigned to a particular
influenceKNN algorithm
?
instanceis chosen as the correctprediction when
Ans.: The performance of the KNN algorithmis
usingvoting.
influenced
by threemain factors:
1. The distancefunctionor distancemetric used to Stackingis concerned with combining multiple

2
determine the nearestneighbors.
The decision
rule used
from the K-nearest
to derive a classification

neighbors.
,
classifiers
generatedby different
y
by a featurevectorSi
learningalgorithms
on a singledatasetS,which is composed
"(j,ti)
TECHNICAL An up thrust
forknowiedge
PUBLICATIONS DECODE
Neural Networks and Deep Learning 10 forDeep Learning
Regularization

The stacking process can be broken into two 1. Variancereduction: If the trainingsets are
phases it willalways helps to
completely independent,
1.Generate a set of base-levelclassifiers
Where C = 4(S)
C,.,
CN average an ensemble because this will reduce
variancewithoutaffecting
bias (e.g,bagging)and
2. Train a meta-level classifier to combine the to individual
reduce sensitivity data points.
outputsof the base-levelclassifiers 2. Bias reduction:For simple models, average of
Fig.
Q.22.1shows stackingframe. models has much greater capacitythan single
model Averaging models can reduce bias
The trainingset for the meta-level classifier
is
and control

.
a leave-one-out
cross validation by increasingcapacity,
substantially
generated through
varianceby Citting
one component at a time.
process.
vi 1,.n
and vk=1,
N :Ck Q.23 What is bagging? Explainbaggingsteps.List
itsadvantages and disadvantages.
LyS s) Ans.: Bagging is alsocalledBootstrapaggregating.
The learned
are then used to generate
classifiers Bagging and boostingare meta-algorithmsthat pool
for
predictions s:9 = C,*i) decisions from multiple classifiers. It creates
The meta-leveldatasetconsistsof examples of the ensembles by repeatedly randomly resampling the
form data.
( »yi) where the featuresare the training
predictions of the base-levelclassifiers
and the class method of ensemble
Baggingwas the first
effective
is the correctclassof the example inhand. learning and is one of the simplest methods of
methods work? arching. The meta-algorithm,which isa specialcase
Why do ensemble
of the model averaging,was originallydesigned for
Based on one of two basicobservations: and is usuallyaPpliedto decision
classification tree
models, but itcan be used with any type of model
forclassification
or regression.

Training
set

Hypotheses -Training
observations

on
redictions
observations
training

Meta leaner

Metalearner - Lest observation

prediction P
Final

Fig.Q.22.1Stackingframe

TECHNICAL PUBLICATIONS An up thrust


forknowledge DECODE
Neural
Netuworks and Deep Learning 4-11 forDeep Learning
Regularization

Ensemble classifiers
such as bagging,boostingand model averaging are known to have improved

accuracy and robustnessover a single model. do


Althoughunsupervised models, such as clustering,
not directlygeneratelabelprediction for each individual,
they provide usefulconstraints
for the joint

predictionof a set of relatedobjects.


.For givena trainingsetof sizen,createm samples of sizen by drawingn examples from the original

data, with replacement.Each bootstrapsample willon average contain63.2 % of the uniquetraining


Itcombines the m resultingmodels using simple majority
examples, the restare replicates. vote.

Inparticular,on each round,the base learneris trainedon what isoftencalleda "bootstrapreplicate"


of theoriginal set. Suppose the training
training set consistsof n examples. Then a bootstrap

set thatalso consistsof n examples, and which is formed by repeatedly


is a new training
replicate
selecting uniformlyat random and with replacementnexamples from the original trainingset. This
means that the same example may appear multipletimes in the bootstrapreplicate,
or itmay appear

not at all.

Italso decreaseserrorby decreasingthe variance in the resultsdue to unstable learners,


algorithms
decision
(like trees)whose outputcan change dramaticallywhen thetraining data is slightly
changed.
Pseudocode:
1. Giventraining
data
&1,Y1),.,m
ym)
2.
Fort 1,T:
a. Form bootstrapreplicatedatasetSt by selectingm random examples from the training
set
with replacement.

b. Let
h base learningalgorithmon
be the resultof training S.
3. Output combinedcassifier:
H)= (h1),.-.,
majority hT)).

BaggingSteps
1. Suppose thereare N dataset.A sample from training
observationsand M featuresin training data
setistaken randomlywith replacement.

2. A subset of M featuresisselectedrandomly and whicheverfeaturegives the best splitisused to

splitthe node iteratively.


3. The treeisgrown to the largest.
4. Above stepsare repeated n times and prediction
is givenbased on the aggregationof predictions
from n number of trees.

Advantages of Bagging :

1. Reduces over-fitting
of the model.

2. Handleshigherdimensionality
data very well.

accuracy for missing data.


3. Maintains

Disadvantagesof Bagging
1. Sincefinalprediction
is based on the mean predictions
from subset trees,it won't give precise
and regressionmodel.
values for theclassification

TECHNICAL PUBLICATIONSAn up thrustforknowledge OECODE


NeuralNetworks and Derp Leaming 12 forDeep Learning
Regularization

a.24 Explainboostingsteps. List advantages and Neuralnetworks are able to representfunctionsthat


disadvantages of boosting. can range from nearly linear to nearly locally
Ans.:
BoostingSteps: constant and thus have the flexibility to capture
lineartrendsin the training data whilestill
1. Draw a random subset of training
samples d learning
set D to
withoutreplacemerntfrom the trainin8 toresistlocalperturbation.

a weak learnerC1
train Adversarialexamples also provide a means of

accomplishing Adversarial
semi-supervisedlearning.
2. subset d2 without
Draw second random training
examples generated using not the true labelbut a
replacement from set and add 50
thetraining
labelprovidedby a trainedmodel are calledvirtual
percent of the samples that were previously adversarial
examples.
to traina weak
falsely classified/misclassified
learnerC 4.8:TangentDistance

|
samples ds inthe training
3. Find the training set D
Q.26 Writeshort note on tangent propagation.
on which and to traina third
C C
disagree
Ans.:
weak learnerC3
The tangent prop algorithmtrains neural net
4. Combine all the weak learners via majority
with an extrapenalty to make each output
classifier
voting.
fx) of the neural net locallyinvariantto known
AdvantagesofBoosting: factorsof variation.
1. Supports different
loss function.

2. Works well with interactions.


As withthe tangentdistancealgorithm,the tangent
vectors are derived a priori, usually from the
Disadvantagesof Boosting: formal knowledge of the effectof transformations,
1. Prone toover-fitting. such as translation, and scalinginimages.
rotation,
2. careful of different
Requires tuning The tangentdistancealgorithmisa popularmethod
hyper-parameters. thatestimatesthemanifolddistanceby employinga

4.7: Training|
Adversarial
linear approximation
manifolds.
of the transformation

Q.25 What isadversarialtraining


? Explain.
Tangentprop has been used for supervisedlearning8
Ans. and reinforcementlearning.
Adversarialmachine learning is a technique closelyrelated to dataset
Tangent propagationis
employed in the fieldof machine learning which augmentation.
attempts to fool models throughmaliciousinput. also related to double
This techniquecan be appliedfor a variety of
Tangent propagationis
backprop and adversarialtraining.
reasons,the most common beingto attackor cause
instandard machinelearningmodels. .Double backprop regularizesthe Jacobianto be
a malfunction
small,while adversarialtrainingfindsinputsnear
Adversarial traininghelps to illustrate
the power of the original
inputsand trainsthe model to produce
using a large functionfamilyin combinationwith the same outputon theseas on theoriginal
inputs.
aggressiveregularization.
Tangent propagationand dataset augmentation
Purelylinearmodels, likelogisticregression,are using manually specifiedtransformations both
adversarialexamples because they
not able to resist
require that the model be invariant to certain
are forcedto be linear. directions
specified of change intheinput.

The manifoldtangent classifier


eliminatestheneed
toknow the tangent vectorsa priori.

TTECHNICALPUBLICATIONS An up thrustforknowiedge DECODE


Neural Netuworks and Deep Leaning 4-13 forDeep Learning
Regularization

Fill
intheBlanksforMidTerm Exam
Q.1 In machine learning is way to preventover-fitting.

by addinga penalty tothe,


Q.2 Regularizationreduces over-fitting function.
Q.3 a machinelearningproblem by choosing
simplifies which subset of the availablefeaturesshould
be used.

Q.4 occurswhen the model isnot able to obtaina suficiently


Underfitting low errorvalue on the

occurswhen the gap between the training


Q.5 Overfitting errorand. is too large.

Q6 The sparsityproperty inducedby has been used extensivelyas a featureselection


regularization
mechanism.

Q.7 Each penalty isa productbetween a coefficient,


calleda and a function
multiplier, representing
whether the constraint
issatisfied.

noise inthe inputtoa neuralnetworkcan alsobe seen as a form


Q.8 Injecting of

Q.9 can be interpretedas a way of regularizinga neural network by adding noise to itshidden
units
Q.10 Earlystoppng requiresa , which means some data isnot fed tothemodel
training
Q.11Bag8ingmeans .
Q.12 The technique
called. an ensemble with highercapacitythantheindividual
constructs models.

Q.13 Adversarialexamples generatedusingnot the truelabelbut a labelprovided


by a trainedmodel are called
examples.

Q.14 A relatedtechnique propertiesintodistance-basedmethods


Distanceis used to buildinvariance
such as nearest-neighbor
classifiers

is closelyrelatedto
Q.15 Tangent propagation .
ChoiceQuestions for Mid Term Exam
Multiple
isalsoknown
Q.1 Lregularization as .
a Ridgeregression
b Lasso regression

CLinear
regression
dAllof these
term isthe .
Q.2 InL regularization,
regularization .ofallfeatureweights.
aAbsolute
value bsum of square

square dsum
ispart of some
Q.3 Inputnoiseinjection . learningalgorithms,such as the denoising
autoencoder.

a Semisupervised bsupervised
cunsupervide dDeep

TECHNICALPUBLICATIONSAnup thrustforknowledge OECODE


Neural Netuworks and Deep Learming 4-14 forDeep Learning
Regularization

Q.4 Calculatethe sum of the absolute values of inthe Blanks


Answer Key for Fill
the weights,
called
Q.1 regularization Q.2 loss

Q.3 Featureselection Q4 trainingset


dnone
Q.5 test error Q.6
Q.5 Dropoutboostingtrainsthe entireensemble to
Q.7 Karush-Kuhn-Tucker Q8 data
jointlymaximize the log-likelihoodon the augmentation

Q.9 Q.10 validation


set
a set
training
Q.11
Dropout

bootstrap Q.12 boosting


btesting
set
agEregating8
Ctraining
set and testingset
Q.13 virtual
adversarial Q.14 Tangent
d None Q.15 dataset augmentation

ChoiceQuestions
Answer Key for Multiple

.1 Q.2 b

.3 Q4 b

Q.5 b

END.

TECHNICALPUBLICATIONSAn up thrustforknowiedge DECODE


UNIT V

5
5.1:Challenges in NeuralNetwork
Optimizationfor
TrainDeep Models
Moreover,MLEs and likelihood
functions
generally
Optimization have very desirablelargesample propertie

1. They become unbiased minimum variance


Q.1What is convex optimization? estimatorsas the sample sizeincreases
Ans.:Convex optimization
involves a functionin
2 They have approximate normal distributions
and
which there is only one optimum, corresponding to
approximate sample variances that can be
the globaloptimum (maximum or minimum). There calculated and used to generate confidencee
is no concept of localoptimaforconvex optimization bounds
problems, making them relatively
easy to solve. 3. Likelihoodfunctions can be used to test

Q.2 What is non-convex optimization


? hypotheses about models and parameters

Ans:Non-convex optimization involves a function The likelihood


,
of the observed data, given the

,
,
which has multiple model parameters as the conditional
optima,only one of which is the probability
that the model, M, with parameters
global optima.Depending on the loss surface,it can produces
be very difficult
to locatethe globaloptima. x[1,.,xln].
L(O)-Pr(rl1),, xn] |e,
M),
Q.3 What
Ans: Givena
risk minimization
is empirical ?
set S and a function
In MLE we seek the model parameters, that
training space maximize the likelihood.

H empirical risk minimization is the class of


LikelihoodMethod
algorithms that look at S and selectfs as, Observations
are outcomes of random experiments.
The outcome is representedby a random variable
fsarg min Is[
The trainingprocess based on minimizing this ().The distributionof possibleoutcomes is given

error is known as empiricalrisk distribution.


by probability
average training
minimization. The same observations can be generated by
models and the different
different observationsmay
Q4 Discussabout maximum - likelihood
estimation.
be generated by the same model. The probability
Ans:Maximum likelihood,also called the
model predicts an outcome and associates a
maximum likelihoodmethod, is the procedure of
with each outcome.
probability
the
finding value of one or more parameters for a
which makes the known likelihood
given statistic Suppose the entireresult of an experiment is a

distribution
a maximum. collectionof numbers (x)and suppose the joint pdf
for the data x isa functionthatdepends on a set of
Maximum likelihood estimationis a totallyanalytic
parameters6.
maximization procedure.Itappliestoevery form of
censored or multi-censored data, and it is even f 0)
possibleto use the techniqueacrossseveralstress Then the likelihood is given as :
function
cellsand estimateacceleration model parameters at
the same time as lifedistribution
parameters.
L )- f;0)
(5-1
NeuralNetworks and Deep Learning 5-2 for TrainDeep Models
Optimization

.Consider 'n independent observations of inexactgradients and poor correspondence between


x:XX2x3Xn-1Xn where x follows
f(x;6). local& globalstructure.
The jointpdf for the whole data sample is as
follows: Q.7 Explain ill conditioningproblem of neural
network optimization.
0)-n
f1,X2.., i=1f(xXn
:0)
Ans.: The ill-conditioning problem is generally
believed to be present in neural network training
The likelihood
function
forabove isas follows:

The
L (0)
maximum
, f(xi
;0)

estimator is given by
likelihood
problems.
Ill-conditioningcan manifestby causingSGD to get
"stuck" in the sense that even very small steps
increasethe costfunction.
formula:
following

d (X1,X2,
X3, ..., Xn) =
Xn-1,
Xn of
Ill-conditioning the Hessian matrix is a
prominent problem in most numerical optimization
problems,convex or otherwise.
In multipledimensions,there is a diferentsecond

-
for each direction
derivative at a singlepoint.

The condition
number of the Hessian at thispoint
The sample standard deviation
is givenby measures how much the second derivativesdiffer
from each other.

When the Hessian has a large conditionnumber,


gradientdescentperforms poorly.Thisisbecause in
the derivative
one direction, increases while
rapidly,
Q.5 Which factorsare considerfor minibatch inanother direction,
itincreasesslowly.
sizes?
Gradientdescent is unaware of thischange in the
Ans.:
Followingfactorsare consider: derivativeso it does not know that it needs to
1. Larger batches providea more accurateestimate in the direction
where the
explore preferentially
but withlessthan linearreturns.
of the gradient, derivativeremains negativeforlonger.
2. Multicore architecturesare usually underutilized
Q.8 Defineweight space symmetry.
by extremely small batches.
Ans.: Ifwe have m layerswith n unitseach, then
3. Ifallexamples inthe batchare to be processedin
then the amount of memory scaleswith
there are (n!" ways of arranging the hiddenunits.
parallel, This kindof non-identifiability
is known as weight
the batch size.For many hardware setups thisis
the limitingfactor inbatchsize. Space symmetry.

Some kindsof hardware achieve betterruntime


4 Q.9 What is saddle point?
withspecific
sizesof arrays. Ans.:A saddle pointis a pointwhere the Hessian
Small batches can offer a regularizingeffect, matrixhas both positiveand negativeeigen values.
perhaps due to the noisethey add to the learning
process.Generalizationerror is oftenbest for a Q.10 What do you mean explodinggradients
?
batch sizeof 1. can
updates to weightsduringtraining
Ans.:Large
cause a numerical overflow or underflow often
Q.6 Listthe challengesinoptimization.
referredtoas "exploding
gradients."
Ans.:Challenges are ill-conditioning,
local minima,
A common and relativelyeasy solutionto the
plateaus,saddle pointsand other flatregions,cliffs
and explodinggradients,long-term dependencies, exploding gradients problem is to change the
derivative of the error before propagating it

TECHNICAL PUBLICATIONS An up thrust


forknowledge DECODE
Neural Networks and Deep Learning 5-3 forTrainDeep Models
Optimization

backward through the network and using it to Q.12 What is momentum ?


updatethe weights. Ans.:Momentum methods inthe contextof machine

Two approaches includerescaling the gradients learning refer to group of tricksand techniques
a

firstorder
givena chosen vector norm and clippinggradient designed to speed up convergence
values that exceed a preferred range. Together, methods
optimization likegradientdescent.
thesemethods are referredto as "gradientclipping"
ParameterInitialization
5.2: Strategies
and
a.11What is stochasticgradientdescent ? Algorithms with Adaptive Learning|
Ans.:Much of machine learningcan be writtenas an Rates
optimization
problem.
Example loss functions: linear
Logisticregression, Q.13Write short note on parameter initialization

regression,principle
component analysis,neural strategies.
network loss. Ans.: Training fordeep learningmodels
algorithms
are usually iterativeand thus require the user too
A way to trainlogistic
very efficient models is with
StochasticGradientDescent (SGD). specify some initial
pointfrom which to begin the
iterations.
One challengewith training
on power law data
The initial point can determine whether the
most data) isthatthe terms inthe gradientcan
(i.e.
algorithm converges at all,with some initial
points
have very different
strerngths.
being so unstable that the algorithm encounters
The idea behind stochasticgradient descent is numerical difficulties
and failsaltogether.
a weightupdatebased on the gradientof
iterating When learningdoes converge, the initial
pointcan
lossfunction: determine how quicklylearning converges and
wk+1) W(k)-yV L(w whether it converges to a pointwith high or low
Logisticregressionis designed as a binaryclassifer cOst.
(output say 10, 1) but actually outputs the Pointsof
that the input instanceis in the "1" comparable cost can have wildlyvarying8
probability
generalizationerror,and the initial
pointcan affect
class.
as well.
the generalization
A has the form:
classifier
logistic Iftwo hidden units with the same activation
P9 1+exp(-Xp) functionare connected to the same inputs,then
initial
theseunitsmust have different parameters.
where X =
X., Xn) isa vectorof features Ifthey have the same initialparameters,then a
Stochasticgradient has some serious limitations deterministic learning algorithm applied to a
however, especiallyifthe gradientsvary widely deterministiccost and model will
:
constantlyupdate
magnitude.Some coefficients change very fast, both of theseunitsinthe same way.
othersvery slowly.
Q.14 Explain
thefollowing:
This happens for text,user and social
activity a) AdaGrad b) Adam
media data (and other power-law data),because
Ans.:a)AdaGrad
gradientmagnitudes scale with feature frequency,
i.e.
over severalordersof magnitude. The AdaGrad algorithm is just a variant of

It is not possibleto set a singlelearningratethat


stochastic
preconditioned gradientdescent.
The AdaGradindividually
trainsthe frequent and infrequentfeaturesat the adapts the learningrates
same time. of allmodel parameters by scalingthem inversely
to the square root of the sum of allthe
An example of stochasticgradient descent with
proportional
historical
squared values of the gradient.
perceptronloss isshown as follows

fromsklearm.linear_model
importSGDClassifier
TECHNICAL PUBLICATIONSAnup thrustforknowledge DECODE
NeuralNetworks and Deep Learning for TrainDeep Models
Optimization

The parameters with the largestpartialderivativeof


the loss have a correspondinglyrapiddecrease in
Correctbiasinsecond moment: feq
1-p
their learning rate,while parameters with smal
S
=
partialderivativeshave a relatively
small decrease Compute update:A6 -€
in their learning rate. The net effectis greater
(Operationsappliedelement-wise)
progress in the more gently sloped directionsof

parameter space. Apply update : 0-0+A0


AdaGrad is designed to converge rapidlywhen Sinces and r are initialized
as zeros,itisobserved
that, bias during the initial steps of training
applied a convex function.
to
thereby adding correctionterm for both the
One of Adagrad's main benefitsisthatiteliminates
moments to account for theirinitialization
near the
theneed to manually tune the learningrate.
origin.
Adagrad'smain weakness isitsaccumulation
of the

squared gradientsin the denominator:Sinceevery 5.3:


ApproximateSecond-OrderMethods
added term is positive,
the accumulated sum keeps
This in turn causes the Q.15Explainnewton method.
growing duringtraining.
learning rate to shrink and eventually become Ans.:In
contrastto first
order gradient methods,
at which second order methods make use of second derivatives
infinitesimallysmall, pointthe algorithm
is no longer ableto acquireadditional
knowledge. to improveoptimization.
AdaGrad shrinksthe learningrateaccordingto the Most widelyused second order method isNewton's

entirehistory of the squared gradient and may method. It is described in more detailhere
have made the learning rate too small betore emphasizing neuralnetwork training.
atsuch a convex structure.
arriving It is based on Taylor's series expansion to
approximate Je) near some point 6o ignoring
b) Adam
derivatives
of higherorder.
AdaptiveMoment Estimation (Adam) is another
method that computes adaptivelearningratesfor seriesto approximateJO) near 0o
Taylor's
each parameter.It keeps an exponentiallydecayin8
average of past gradients. Je) Je)+(e-eVaJe,)+50-0 H(O-0,)
Adam is generallyregarded as beingfairlyrobust where H isthe Hessian of Jwrt 0 evaluatedat6g
to the choice of hyperparameters, though the
Solvingfor point of this functionwe
the critical
learningrate sometimes needs to be changed from obtainthe Newton parameter updaterule.
the suggested default.

InAdam, momentum is incorporateddirectly


as an
-0-HVJO%)|
Thus for a quadraticfunction definite
(withpositive
estimate of the first-order moment (with
H) by rescalingthe gradient by H-1 Newton's
of the gradient.
exponentialweighting)
jumps to the minimum.
method directly
The weightupdatefor Adam is given by:
If objectivefunctionis convex but not quadratic

Compute gradient:
6
t-t+1
.LE®,0),
my0) (thereare higher-orderterms) this update can be
iteratedyielding
the training
algorithmgivennext.

moment estimate: Q.16 What gradient method?


conjugate
Update biased first
Ans.:The conjugate gradient method is the most
SP1S+(1-p1)8
prominent iterativemethod for solving sparse
Update biased second moment estimate:
systems of linear equations. It is a method to
r-P2r+1-p2)
E08 avoidthe calculation
of the inverseHessian
effciently
Correctbias infirst
moment :S1-p conjugatedirections.
descending
by iteratively

An up thrustforknowledge DECODE
TECHNICAL PUBLICATIONS
NeuralNetworks and Deep Learning 5-5 Optimizationfor TrainDeep Models

5.4:Optimization and
Strategies Q.19 Explainhow deep learninghelps in speech
Meta-AlgorithmsApplications recognitionproblem.
Ans.:Speech recognition(is also known as
a.17What is batch normalization
? Automatic Speech Recognition(ASR), or computer
Ans.:Batchnornmalizationis one of the most speech recognition)is the process of convertinga
to a of words, by means of an
excitinginnovationsin Deep learning that has speech signal sequence
stabilized the learning process and algorithmimplemented as a computerprogram.
significantly
allowed fasterconvergence rates. Automatic Speech Recognition (ASR) is the task of
producing a text transcription of the audio signal
Most of the Deep Learning networks are
from a speaker.
compositionsof many layersor functions and the
gradient with respect to one layer is taken Large Vocabulary Speech Recognition(LVSR)
involvesa large or open vocabulary and we will
considering the other layers to be constant.
also take it to imply a speaker independent
However, in practiseall the layers are updated
recognizer.
simultaneously and thiscan lead to unexpected
results. The most general LVSR systems may also need to
operateon spontaneous as well as read speech with
For example, lety* -x Wlw2..Wl0, Here,y'is audio data from an uncontrolled recording
a linearfunctionof x but not a linearfunction
of environment corrupted by both unstructured and
the weights.
Suppose the is
gradient givenby g and structurednoise.
we now intendto reduce y'by 0.1 Traditional
LVSR recognizersare based on Hidden
Using first-orderTaylor Series approximation, Markov (HMMs) with Gaussian Mixture
Models

takinga step of e g would reduce y* by eg'8.


Model (GMM) emission distributions, n-gram
language models, and use beam search for
The updates to one layer is so stronglydependent decoding
on the other layers, choosing an appropriate
GMM-HMM acousticmodels are trained with the
learningrateis tough. Maximization
Expectation (EM) algorithmusing a
Batch normalization
takes care of this problem by procedure that trains progressively more
using an efficient of almost any
reparameterization sophisticatedmodels using an initial
alignment
from the previousmodel.
deep network.
Givena matrixofactivations
(H),the normalization In speech recognition,the main goal of the feature
extraction step is to compute a parsimonious
isgivenby: H'
(H -p/o,where the subtraction
and divisionisbroadcasted. sequence of featurevectors providinga compact
representationof the giveninputsignal.

The featureextraction
is usuallyperformed inthree
stages.

1. The firststageiscalledthe speech analysisor the


acousticfront end. It performs some kind of
Q.18 What iscoordinate
descent ?
spectro temporal analysis of the signal and
Ans.:In some cases,it may be possibleto solve an generatesraw featuresdescribingtheenvelope of
problem quicklyby breaking it into
optimization the power spectrum of shortspeech intervals.
separatepieces.Ifwe minimizef(x)with respectto a 2. The second stage compiles an extended feature
singlevariableXj,then minimizeitwith respectto and dynamic features.
vectorcomposed of static
another variablexi,and so on, repeatedly cycling
3. Finally,the laststage transformsthese extended
we are guaranteed to arriveat a
throughallvariables, feature vectors into more compact and robust
(local) minimum. This practice is known as vectorsthatare then suppliedtothe recognizer.
coordinatedescent.

TECHNICAL PUBLICATIONSAn up thrustforknowledge DECODE


NeuralNetworks and Deep Learning 5-6 for TrainDeep Models
Optimization

In speech recognitiona supervised pattern Q.5 In contrastto standard optimization,


training
system is trained with labeled
classification algorithmsdo not haltat butwhen
earlystoppinghaltcriterion
issatisfied.
examples; that is,each inputpattern has a class
labelassociatedwithit. Q.6 Gradient Method is the most
method for solvingsparse
prominentiterative
Pattern classifiers
can also be trained in an
systems of linearequations.
unsupervised fashion. For example in a technique
known as vectorquantization, some representation MultipleChoiceQuestions forMid Term Exam
of the inputdata is clusteredby finding implicit
Q.1 Adam standsfor
groupings in the data.
.Theresultingtableof clustercentersis known as a a AdaptiveMoment Estimation
codebook, which can be used to index new vectors
the clustercenterthat is closestto the
b AdaptiveDeep Moment
by finding
new vectors.
cAdaptive
Moment Learning
of these
Once a featureselectionor classification
dNone
procedure
finds a proper representation, a classifier
can be Q.2 A saddle pointis a pointwhere the

designed using a number of possibleaPproaches. has bothpositive


and negativeeigenvalues.
Q.20 What isnaturallanguageprocessing?
a diagonalmatrix
Ans.:Natural language processing is a form of
artificial that bsquarematrix
intelligence helps computers read and
respond by simulating the human abilitytoo cHessianmatrix
understand everyday language. dscalar
matrix
Many organizationsuse NLP techniquesto optimize Which of the followingis challenges of
Q.3
customer support, improve the efficiencyof text
the information optimization?
analyticsby easily finding they
need, and enhance socialmedia monitoring. aIll-conditioning bLocal
minima
For example, banks might implemernt NLP CPlateaus dAll of thesee
algorithms to optimizecustomer support; a large
consumer products brand might combine natural Q.1 Polyak averaging consists of averaging
language processing and semantic analysis to several points in the trajectorythrough
improve their knowledge management strategies parameter space visited by an
and socialmedia monitoring.
algorithm.
in the Blanksfor Mid Term Exam
Fill a non-optimization boptimization
Q1 Optimizationalgorithms that use the entire eboth (a)and (b) d None
set are calledbatchor deterministic inthe Blanks
Answer Key for Fill
gradientmethods.

Q.2 Optimizationalgorithms that use only a


o1 training 9.2 single
Q3 Q.4
example at a time are sometimes overftting Surrogate
called stochastiC and sometimes online Q.5 localminima Q.6 Conjugate
methods.
ChoiceQuestions
Answer Key for Multiple
Q.3 Empirical risk minimizationis prone to

Q.4 A acts
oss function a proxy to
as
Q1 Q.2 C Q.3
dQ4b
riskwhile "nice"
enough to be
empirical
optimized
being
efficiently
END...
An up thrustforknowledge DECODE
TECHNICAL PUBLICATIONS
sOLVED MODEL QUESTION PAPER SolvedPaper
IV-II
B.Tech., (CSE/IT)
NeuralNetworks and Deep Learning
(R18Pattern)

Time:3Hours] [Maximum Marks : 75

Note:Thisquestionpapercontains
two parts andB. A
Part A iscompulsory which carries25 marks. Answer allquestionsinPart A.Part B consistsof 5 Units.
Answer any one fullquestion from each unit.Each questioncarries10 marks and may have a, b, c as sub
questions.
PART (25 Marks) A
a.1 a)What isartificial
neural netuwork ? (ReferQ.1of Chapter 1) (2
b) memory. (ReferQ.32 of Chapter-1)
Defineauto associative

c)What iscounter-propagationnetwork ? (ReferQ.21of Chapter 2) 121

d) learningnetwork ? (ReferQ.5of
What is winner-take-all Chapter 2) 31
Listthe application (ReferQ.3of Chapter 3) 121
ofdeep learning.
What isforwardpropagation? (ReferQ.34 of Chapter3) 3
9) What is regularization Q.1of
? (Refer Chapter4)
h) What isdifferencebetween L'and L regulatiom? (ReferQ.6 Chapter 4)

)What isconvex optimization ? (ReferQ.1of Chapter 5) 121

What isnatural languageprocessing ? (ReferQ.20 of Chapter 5) 13


B (50 Marks)
PART
Q.2 a) What is Bidirectional
Associative
Memory (BAM) ? (ReferQ.36 of Chapter- 1) [51

b) What isan Adaline? Explainin detail.


(ReferQ.21of Chapter 1)
OR

Q.3 a) Describe memory. (ReferQ.33 of Chapter


auto-associative
1
b) Explainusefulpropertiesand capabilities
ofneuralnetwork. (ReferQ.6 of Chapter 1)D

Q.4 a) Explainhamming net with diagram.(ReferQ.9 of Chapter 2) 51


b) Draw and explainthe architecture
of ART.What is itsuse and types ? (ReferQ.29 of Chapter 2)
OR

Q.5 a) Writeshort note on learning


vector quantization.(ReferQ.20 of Chapter 2)

b) Explainunsupervised (ReferQ.2 of Chapter


learning. 2) 151

Q.6 a) NeuralNetoorks (ConvNet).(Refer0.9 of Chapter


ExplainarchitectureofConvolution 3)
b) Defineactivation Explainthe purpose of activation
function. in multilayer
function neural networks. Give any

two
activation (ReferQ.20 of Chapter
functions. 3) [5]
OR

Q.7 a)What isvanishing problem


gradient ? (ReferQ4 of Chapter 3)
(M - 1)
Neural Newworks and Deep Learning M-2 SolvedModel QuestionPaper

b) Listthe limitation
descent algorithm.
Explaingradient descent.(ReferQ.30 of Chapter
of gradient 3)
a.8 a) What are the
Explainoverfitting. ?
reason foroverfitting
ReferQ.2 of Chapter4)
b) Explainapplications neural network. (ReferQ7 of Chapter4)
of corvoutional
OR

Q.9 a)What isdropout ? Horw itsolve problem (ReferQ.13of Chapter4)


ofoverfitting? [5
b)What isa parametricmachine and nonparametric machine learning
earningalgorithm algorithm?
(ReferQ.16 of Chapter 4
a.10a) Explainhow deep learninghelps inspeech recognitionproblem. (ReferQ.19 of Chapter 5) IS
b) What is batch normalization Q.17of Chapter 5)
? (Refer I51
OR

Q.11a) Which factorsareconsiderforminibatch sizes ? (ReferQ.5 of Chapter 5)


b) What isstochasticgradientdescent ? (ReferQ.11
of Chapter 5)
END...

TECHNICAL PUBLICATIONS An up thrustforknowiedge DECODE

You might also like