Professional Documents
Culture Documents
Decode - Neural Networks and Deep Learning
Decode - Neural Networks and Deep Learning
Decode - Neural Networks and Deep Learning
JNTUH R16
IV-ll(CSE/IT)
B.Tech.,
Elective-
Professional VI
NEURAL NETWORKS
&DEEP LEARNING
IreshA. Dhotre
M.E. (InformationTechnology)
SinhgodCollegeof Engineering,
Ex-Faculty,
Pune.
FEATURES
Writtenby PopularAuthors of TextBooks of TechnicalPublications
DECODE
A GuideForEngineering
Students
DECODE
A GuideFor EngineeringStudents
IV-JI
B.Tech., [CSE/IT]ProfessionalElective VI
with TechnicalPublications
Copyright
Allpublishingrights(printed and ebook version)reserved with TechnicalPublications.No partof this book
should be reproduced in any form, Electronic,Mechanical, Photocopy or any informationstorage and
retrievalsystem withoutpriorpermissionin writing,from TechnicalPublications,Pune.
Publishedby:
Residency, Ofice
No.1,
TECHNCALT Amit -
412,Shaniwar Peth,
PUBLICATIONSPune 411030,M.S.INDIAPh.:
+91-020-24495496/97,
: sales@technicalpublications.org
Website:www.technicalpublications.org
p-ut fowldgeEmail
FirstEdition:
2020
ISBN 978-93-89750-67-6
90789389"75067 JNTUH 1
UNIT-II
Unsupervised Learning Network: Introduction. FixedWeight CompetitiveNets, Maxnet. Hamming
Network.Kohonen Self-Organizing FeatureMaps. LearningVectorQuantization. Counter Propagation
(Chapter-9
UNIT III
Introduction
to Deep Learning:
Historical Deep Feed forwardnetworks.
Trends inDeep learning.
Gradient-Based Hidden Units,Architecture
learning. and Other Differentiation
Design.Back-Propagation
Algorithms.
(Chapter
-3)
UNIT IV
Regularizationfor Deep Learning Parameter norm Penalties,Norm Penalties as Constrained
Optimization, and
Regularization Under-Constrained
Problems, Dataset Noise
Augmentation. Robustness.
Semi-Supervised Multi-tasklearning.
learning. EarlyStopping. ParameterTyping and ParameterSharing.
UNIT- V
for TrainDeep Models:ChallengesinNeuralNetwork Optimization.,
Optimization Basic Algorithms,
ParameterInitialization
Strategies. with AdaptiveLearningRates,ApproximateSecond-Order
Algorithms
Methods. Optimization and
Strategies Meta-Algorithms.
Applications
Large-Scale Deep Learning. Natural Language
Speech Recognition.
Computer Vision.
Processing.(Chapter 5)
tü)
.
TABLE OF CONTENTS
Unit I Fill
...
intheBlanks withAnswers
1or Mid lerm Exam. -18
Chapter 1
1.1
1.2
1.3
Introduction ..
..
Artificial
NeuralNetworks
SupervisedLearningNetworks
(1-1)
to (1-24)
1-5
I-10
i
1
Multiple ChoiceQuestionswithAnswers
for Mid Term Exam. *************** 18
Unit
Chapter-4 Regularization
V
forDeep Learning
.3-1
Networks..
Back-propagation (4-1)to (4-14)
...1
1.4 Associative
1.5 Hopfield
Fill
Memory
Networks
Networks...
intheBlanks withAnswers
.1-13
-18
4.1
4.2
ParameterNorm
Norm Penalties
Penalties... 4-1|
Optimization.4-4
as Constrained
DatasetAugmentationand Noise Robustness.4-5
...
4.3
forMid 'TermExam.... ********** 1-22
ChoiceQuestionswithAnswers
Multiple
4.4 and EarlyStopping.. 4-6
learning
Multi-task
for 1-22
Mid TermExam 45 ParameterTyping
.4-7
and ParameterSharing...
Unit- II 4.6
Baggingandother Ensemble Methods...
.4-9
4-
2.1
Introduction....
2.2 Kohonen Self-Organizing
..
Chapter-2 UnsupervisedLearningNetwork
...2-1
(2-1) to (2 12)
FeatureMaps....2-4
2.3 LeaningVectorQuantization.
.
4.7
4.8
Fill
for
Training...
Adversarial
Tangent Distance..
4-13
12
- 12
2.4 Networks
Counter Propagation 7 for MidTerm Exam... . 4* 1D
..
forMid Term Exam 5.1 . 5-1
Multiple
forMid Term Exam
Unit
..
ChoiceQuestionswithAnswers
III
. 2-11
5.2
5.3
ChallengesinNeuralNetwork Optimization.
ParameterInitialization
withAdaptiveLearning Rates..
and Algorithms
Strategies
ApproximateSecond-OrderMethods
5 -3
5-4
.**. APplicanons
..
3.1 Introduction
toDeep Learning.. 3*i
.5-6
FillintheBlanks withAnswers
3.2
Deep FeedforwardNetworks.....
3-7 for
MidTerm Exam. 5-6
3.3
Learning..
Gradient-Based 3- IT ChoiceQuestionswithAnswers
Multiple
3.4 **********
14 -
forMid Term Exam
Design.
Architecture
SolvedModel QuestionPaper (M-1)to (M -2)
3.5 and
Back-Propagation
Other Differentiation
Algorithms. 3-14
(iv
1
Q.1 What
Ans.:An
1.1:
Introduction
is artificial
Artificial
neural network?
neuralnetwork consists of units,
(artificial)
connections and weights. Inputs and outputsare
numeric.
UNIT
NeuralNetworks
1
Q.5 Classifydifferent
Ans.:
types of neurons.
[INTU:May-17, Marks
Three major neuron groups make up this
classification:
multipolar, and unipolar
bipolar,
3]
a) Neural networks are adaptive methods thatcan a neural network can be designed
classification,
learn without any prior assumption of the to provideinformation not only about which
data. but also about the
particularpattern to select,
underlying
confidenceinthe decisionmade
b) Neural network, namely the feed forward
5. Uniformityof Analysis and Design : Neural
multilayerperception and radialbasis function
networks enjoy universality as information
network have been proven to be universal
processors
functional
approximations.
6. VLSIImplement-ability:
The massivelyparallel
c)Neuralnetworks are non-linearmodel with good nature of a neural network makes it potentially
generalization
ability. fastfor the computationof certain
tasks.
(- 1)
NeuralNetworks and Deep NeuralNetworks
Artificial
Learning
Axon hillock
Soma AXOn
Nucleus buttons
Dendrite Teminal
Fig.Q.7.1
Schematicof biological
neuron
TTECHNICALPUBLICATIONS An up thrust
forknowledge ECODE
NeuralNetworks and Deep Learning 1-3 NeuralNetworks
Artificial
Weakness:
information
Input
Input layer
Input
-Q
1. The addition
of too many hiddenunitsincites
the
Output layer
problem of over fitting
the data. Input Hiddenlayer
2. of the NN model can
The construction be a time
Fig.Q.13.1Artificial
neural network
consuming proceSs.
Neural Networks consists of many number of
simple elements (neurons)connected between them
insystem. Whole system isable to solve of complex
tasksand tolearnfor itlikea naturalbrain.
:
Q.14 What is appropriateproblems for neural
data)and Output vector(result). network?
A Neural Network is usually structuredinto an Ans. The backpropagation
algorithm is the most
inputlayer of neurons, one or more hidden layers commonly used ANN learning technique.It is
and one outputlayer. appropriate for problems with the following
characteristics:
Neurons belongingto adjecentlayers are usually
1.Instances are represented by many
fully connected and the various types and attribute-value
pairs.
are identified
architectures both by the different 2. The target function output may De
topologiesadopted for the connectionsas well by
discrete-valued,real-valued,or a vector of
the choiceof activation
function. severalreal-or discrete-valued
attributes.
The values of the functions
associatedwith the 3. The training
examples may containerrors.Long
connectionsarecalled"weights". times
training are acceptable.
4. Fast evaluationof the learned targetfunction
The whole game of usingNNs is inthe factthat,in
may be required.
order for the network to yield appropriateoutputs
5. The abilityfor humans to understand the
for giveninputs, the weight must be set to suitable
learnedtargetfunction
isnot important.
values.The way thisis obtained allows a further
distinctionamong modes of operations. Q.15Explain()Integrated
and Fire Neuron model
SpikingNeuron model.
(ii)
A neural network is a processingdevice,
eitheran
K[JNTU:May-17, Marks 5]
algorithm or actual hardware, whose design was
motivatedby the design and functioning
of human
Ans.:i) Integrated
and Fire Neuron
brainsand components thereof. .Theintegrate-and-fire
neuron model is one of the
most widely used models for analysing the
Most neural networks have some sortof "training
behavior of neural systems.
rule whereby the weights of connections are
from neuronj
axon Soma
synapse
t-1
SOmid
(t-
Fig.Q.15.1
The synapticinputsto the neuron are consideredto An essentialfeatureof the spikingneurons isthat
be stochasticand are described as a temporally can act as coincidencedetectors for the
they
homogeneous Poisson process. if they
incoming pulses, by detecting arrive in
Methods and resultsfor both current synapses and almost the same time
Detailedconductance-based neuron
of
models can
measurements
reproduce electrophysiological
but because of
to a
eir
t;u(t))- V dt
and du()
k=()
>0
An up thrustforknowledge DECODE
TECHNICALPUBLICATIONS
NeuralNetworks and Deep Learning 1-6 Neural Networks
Artificial
or inductive learning.
Classification
:
The task is commonly called Supervised learning, vector.
desiredresults.For
some
examples the correct
Fig.
Q.17.2shows inputvector.
are known
results(targets) and are givenin input
data
H
Training Leaming
algorithm
Training
Fig.Q.17.1
Model
Supervisedlearningprocess
Test
data
Testing
Accuracy
Input
Neural
netwonk
7 Actual
output
Error
signal
Desired generate
output
Fig.Q.17.2Input vector
TECHNICALPUBLICATIONS An up thrust
forknowiedge OECODE
Neural Netuworks and Deep Learning NeuralNetworks
Artificial
An up DECOD
TECHNICAL PUBLICATIONS thrustforknowledge
Neural Networks and Deep Learning 1-8 Neural Networks
Artificial
W21
way to use a two-input perceptronto implement the
b-1
AND function.
Perceptroncan represent all of
AND, OR, NAND
Boolean functions
NOR (1OR). Unfortunately,however, some
the primitive
(1AND), and
Boolean
9
1+X1+X2 0
functionscannot be represented by a single
perceptron,such as the XOR function whose value
1
is ifand only if xj X2
#
1-1 -1 90 Fig.
Q.21.1
shows
adaline
-1+X1+2 =0
If
i-
theinputconductancesare denoted by Wi,where
0, 1, 2,.n,and inputand outputsignalsby
x
and y, respectively,
then the outputof the central
O blockis defined
tobe
y w Xi+0
i =1
Where Wo
X:Class I (y=1)
In a simple physical implementation this device
consistsof a set of controllable connected
resistors
I
O: Class (y=-1)
to a circuit
which can sum up currentscaused by
Fig.Q.20.1 the inputvoltagesignals.Usually the centralblock
An up thrustforknowledge DECODE
TECHNICALPUBLICATIONS
NeuralNetuworks and
Deep Leaning
1-9 Neural Networks
Artificial
Level
Output
Error Quantizer
Gains Summer
inpur
pattern
switches -1 +1
Referencee
SWitch
Fig.Q.21.1
Ep p-op
Where
tpTarget
output
*
Op
Actualoutput of the Adaline O
The derivation
of
Ep with respect to each weight
Wiis
OEp
dWi -2(tp-°p)X
Fig.Q.22.1
separableclassescan be determined eitherby some input nodes, and feeds intoeach of the output
nodes.
learningprocedures or by solvinglinearequation
systems based on representativepatterns of each Fig.
Q.25.1shows multilayerfeed forward neural
classes networkk.
If such a decision
boundary does not exist,then the
tow classes
are said to be linearlyinseparable.
Feed forward
Linearly inseparableproblems cannot be solved by
the simple network,more sophisticatedarchitecture
isneeded. Y
X4
1.3Back-propagationNetworks 5ack propogation
Every node in the hidden layer and in the output Q.27 Why multilayerperceptron neural network ?
layer processes its weighted inputto produce an Ans.: Neural networks, with their remarkable
output. This can be done slightlydifferently
at the
abilityto derive meaning from complicated or
hiddenlayer,compared to the output layer.
to extractpatternsand
imprecisedata, can be
Inputunits:The input data you provideyour detecttrends thatare too complex to be noticedby
network comes through the input units. No eitherhumans or other computer techniques.
processingtakes place in an inputunit,it simply A trainedneural network can be thought of as an
feedsdata intothe system.
"expert"in the category of information
ithas been
Hiddenunits: The connectionscoming out of an givento analyse.
input unit have weights associatedwith them. A Thisexpertcan then be used to provideprojections
weight goingto hiddenunitzh from inputunitx new situations
o f interestarnd answer "what
would be labeledwhi Thebiasinputnode ( X) is given
connected to all the hidden units, with weightswho ifquestions.
Each hiddennode calculatesthe weightedsum of
Other advantagesinclude:
its inputsand applies a thresholdingfunctionto 1.Adaptive learning: tolearn how to
An ability
do tasksbased on the data givenfor training
or
determine the output of the hidden node. The
initial
experience.
weightedsum of the inputsfor hidden node zh is
as 2. One of the preferred techniques for gesture
calculated
recognition.
3. MLP/Neural networks do not make any
j=0 assumption regardingthe underlying probability
The density functions or other probabilistic
thresholding function appliedat the hidden
about the pattern classes under
node is typically eithera step function or a sigmoid information
considerationin comparison to other probability
function.The sigmoidfunction is sometimes called based models.
the "squashing" function, because it squashes its
to a value between 0 and
4. They yieldthe required decision function
input(i.e.,a) via training.
directly
Inmulti-layerfeed forward neural networks, the 5. A two layer backpropagation network with
sigmoid activationfunction,definedby hiddennodes has been proven to be a
sufficient
is normally used. universalapproximator
g) +ex
Q.28 Explainbackpropagationlearningrule.
The outputlayer:Functionally just
likethe hidden
Ans.: The net input of a node is defined as the
layers.Outputsare passed on to the world outside
theneural network. weighted sum of the incomingsignalsplus a bias
term. Fig.Q.23.1shows the backpropagationMLP for
Q.26 How does the network learn? node j.The net inputand output of node j is as
Ans.:The training
samples are passed through the follows
network and the outputobtainedfrom the network is
Where x1 is the outputof node ilocatedinany one Q.29 Write the characteristics
and applications
of
of the previouslayers, erorback propagation algorithm.
is the weight associated with the link Ans.:
W Characteristics
correspondingnodes iand j. Itis an algorithm for supervised learning of
W isthebias of node j. artificial
neuralnetworks usinggradientdescent.
applications.
put signais The algorithm is for two layersof sigmoidunits
4
and does stochastic
gradientdescent.
O- Applications:
The fast development
has
of
increasedthe
artificialsatellite
geodesy.
Input Output
layer layer the
Particularly, Global Positioning
System (GPS)
Hidde
provides morepractical,rapid, precise and
laveen
layer
continuouspositioning
results anywhere on the
oferrorcorrection
Back-propagation Earth in geodetic applicationswhen compared to
the traditional
terrestrial methods.
positioning
Fig.Q.28.2 Due to the increasinguse of GPS positioning
has been paidto the
techniques,a great attention
The MLP willreferto as a
above back propagation
precise determination of local/regionalgeoids,
3-4-3 network, corresponding to the number of
aimingat replacingthegeometric levelingwithGPS
nodes ineach layer. measurements.
The backward errorpropagation alsoknown as the BPANN method
Therefore, is easilyprogrammable
Backpropagation(BP)or the GeneralizedDeltaRule with decreased and increasednumber of reference
(GDR). A squared error measure for the p pointswhen generatinga localGPS, itperforms a
pairis definedas
input-output flexible
modelling.
DECO
TECHNICALPUBLICATIONS- An up thrustforknowledge
NeuralNetworks and Deep Learning 1-13 Neural Networks
Artificial
Q.30 List out merits and demerits of EBP. Q.33 Describe auto-associative
memory.
Ans.:Merits/Strength: Ans.
1. Computing time is reduced if weight chosen are An auto-associative
net remembers one or more
small at thebeginning patterns.For an auto-associative
net, the trainin8
2. Itminimizethe error inputand targetoutput vectorsare identical.
3. Batch update of weight exist,which provide Auto-associativememories are content based
on the weightcorrection.
smoothingeffects memories which can recalla storedsequence when
4. Simple method and easy for implementation. they are presented with a fragment or a noisy
versionof it.
inweight space
5. Minimum of theerrorfunction
6. Standard method and generallywork well They in de-noisingthe inputor
are very effective
removing interference
from the inputwhich makes
Demerits
them a promisingfirststep in solving the cocktail
1. Training
may sometime cause temporal instability
party problem.
to the system.
The simplestversion of auto-associative
memory is
2. For complex problem, ittakeslotof times.
linear associatorwhich is a 2-layer feed-forward
3. Selectionof number of hidden node in the
fullyconnectedneuralnetworkwhere the outputis
network is problem. constructedin a singlefeed-forward computation.
Backpropagation learning does not require The figurebelow illustrates
itsbasicconnectivity.
normalization ofinput vectors; howeve,
normalizationcouldimprove performance W11
5. It can get stuck in local minima resultingin
sub-optimalsolutions.
Wi
6. Slow and inefficient.
Wn
1.4:Associative
Memory Networks
Na Wri
Q.31 What is meant by associative
memory ?
Wnn
[INTU: May-17, Marks 2]
Ans.:An associative
memory can be consideredas a Fig.2.33.1
memory unitwhose storeddata can be identified
for Allinputsare connected to all outputs via the
ratherthanby
accessby the contentof the data itself connection weight matrix W where W denotes thee
an address or memory Associativememory
location. strength of unidirectionalconnectionf rom the in
TECHNICAL PUBLICATIONS
An up thrustforknowledge DECODE
Neural Netuworks and Deep Learning 1-14 NeuralNetworks
Artificial
the inputvoltagesignals.
Usually the centralblock
Ans.:Hebb rule is the simplest and most common
the summer isalso followed by a quantizerwhich
method of determiningweights for an associative
outputseither of +1dependingon the polarity
-1,
memory neuralnet.It canbe used with patternsare
of the sum.
vectors.
representedas eitherbinaryor bipolar
The problem is to determine the coefficients
Wi
Q.35 Explain delta
layer.
learning rule for
where i0,1...n,
in such a way thatthe input
multiperceptron is correct for a large number of
outputresponse
Ans.: An 8eneralization of the
important chosen signalsets.
arbitrarily
perceptron trainingalgorithm was presented by
Widrow and Hoff as the leastmean square learning If an exact mapping is not possiblethe average
errormust be minimized,forinstance,inthe sense
procedure alsoknown as the deltarule.
The learning rule was appliedto the "adaptive of leastsquares.
OLeve
W3
Error Quantizer
Gains
Input Summer
pattern
Switches
-1 +1
Reference
Switch
Adaline
Fig.Q.35.1
Forthe p"input-output
pattern,the errormeasure Q.36 What is Bidirectional
AssociativeMemory
Adalinecanbe expressed as,
of a single-output (BAM)?
Ans.
Where
Ep p-p)2 The Hopfield network represents an
tp Target output
Actual the Adaline type of memory. Itcan retrievea
auto-associative
op outputof corrupted or incomplete memory but cannot
associate this memory with another different
The derivation
of
Ep with respect to each weight memory.
Wi is
Human memory is essentiallyassociative.One
dp 2(tp-Op thing may remind us of another, and that of
another, and so on. We use a chain of mental
To decrease Ep by gradient descent, the update to recovera lostmemory.
associations
formula for Wi on the pth input-output
patternis If we forgetwhere we leftan umbrella,we try to
Ap Wi n(tp- Op) Xi recallwhere we lasthad it,
what we were doing,
and who we were talkingto. We attempt to
The deltaruletriestominimizesquared errors,itis
establisha chainof associations,
and thereby to
also referredto as the leastmean square learning
restorea lostmemory
procedureor Widrow-Hofflearning
rule.
Bidirectionalassociative memory (BAM), first
Featuresof thedeltaruleare as follows:
proposed by Bart Kosko, is a hetero-associative
1. Simplicity network. Itassociatespatternsfrom one set,set A,
2. Distributedlearning:learningis not relianton to patternsfrom another set,setB,and viceversa.
centralcontrolof the network. Likea Hopfieldnetwork,the BAM can generalize
weights are updated
3. Online learning: after and also producecorrectoutputsdespitecorrupted
presentationof each pattern. or incompleteinputs.
The training is repeated for all patternsuntilthe To develop the BAM, we need to create a
matrix foreach patternpair we want to
correlation
trainingset providean acceptable overallerror.
store.
Usually the mapping error is computed over the
fulltraining
set. The correlationmatrix is the matrix productof the
input vector X, and the transpose of the output
algorithmis workingintwo
Errorback propagation
vectorY. The BAM weight matrix isthe sum of all
stages matrices,thatis
correlation
1.The trained network operates feed-forward to
obtainoutputof the network W Xm
m=1
Ym
Xq P)
xz{P)-
Y4P)
xy(p+1)
xz(p+1)
1
2
-
-
y2P)
x,(P+
YmP)
nput Output
XP+1) nInput Output
layer layer layer ayer
Adaline
Fig.Q.36.1
An up thrustforknowledge
DECODE
TECHNICAL PUBLICATIONS
Neural Networks
and Deep Learning 1-17 NeuralNetworks
Artificial
Hebb'sLaw without
provides the basisforlearning
Step 4:1Iteration
Increaseiteration
p by one, go back to Step 2.
teacher.Learninghere is a local phenomenon
Hebbian learningexample
withoutfeedbackfrom the environment.
occurring
we can express the adjustment
To Hebbian learning,
illustrate consider a fully
Using Hebb's Law
connected feed forward network with a singlelayer
appliedto the weight wj at iteration P in the
of five computation neurons. Each neuron is
form:
following
Awj (P) F lyi(P),
x,(p)]
representedby a McCullochand Pittsmodel with
As a specialcase,we can representHebb's Law as the sign activation The network is trained
function.
follows: on the following
setofinputvectors
intoHebb's Law
factor X4
AW (p) ayi(p)xi(p)-oyi(p)
wj (P)
where isthe forgetting
factor.
An up thrust
forknowiedge OFCODE
TECHNICAL PUBLICATIONS
1-18 NeuralNetworks
Artificial
NeuralNetuworks and Deep Leaning
and finalweight matrices
Initial
10
20 00
1 0 0
o
0 20 2.0204 0
2.0204
100 0 1.0200
00 o 1
o 40 0 0 0.9996 0
2.0204 2.0204
A testinputvector,or probe,isdefinedas
X 0
When thisprobe
ispresented to the network, we obtain:
0 20204
0
0
10200
0
0
0
0.4940
2.0204|||002661
0-00907-0
0
0 0 09996 0
0094780
0 20204 0 0
20204]}|1] 00737]
1.5:Hopfield
Networks
Ans. Network capacity Hopfieldnetwork model isdetermined by neuron amounts and connections
of the
withina givennetwork. Therefore,the number of memories thatare able to be stored is dependent on neurons
and connections.
itwas
Furthermore, shown that the recallaccuracy between vectorsand nodes was 0.138. Therefore,itis
evidentthatmany mistakeswilloccurifone triesto storea largenumber of vectors.
up thrust
forknowledge
TTECHNICALPUBLICATIONS
An DECODE)
Neural Networks and Deep Learning 1-19 Artificial
NeuralNetworks
When the Hopfieldmodel does not recallthe right Hopfieldmodel consists of a single layer of
,
pattern,it is possiblethat an intrusion has taken processingelements where each unitisconnected to
place, since semantically related items tend to every other unitinthe networkother than itself.
and recollection
confuse the individual, of the The outputof each neuron isa binary number in
wrong patternoccurs. -1,1).The output vector is the state vector.
Thereforeif we store p patterns in a Hopfield Startingfrom an initial
state(givenas the input
vector),the stateof the network changes from one
network with a largenumber of N nodes, then the
to another like an automaton. If the state
of error,ie.,
probability the probability
that >1
C converges, the point to which itconverges iscalled
is the attractor.
ProrP(C>1) exp(-/2a2)dx
In itssimplestform, the outputfunctionisthe sign
which yields1 for arguments 2 0 and -1
function,
otherwise.
-erfao) The connectionweight matrix W of this type of
erf(
where the errorfunction
2
erfis givenby
p))
: A
network is square and symmetric.The unitsin the
Hopfieldmodel act as both inputand outputunits.
Hopfieldnetwork consistsof "n"totallycoupled
units.Each unit is connected to all other units
The networkis symmetric because the
except itself.
erf(x)
ep-s)ds weight wij between unit i and
for the connection
unit j is equal to the weight wj of the connection
Therefore,givenN and p we can findout the
from unitj to uniti.The absence of a connection
probability from each unit to itselfavoids a permanent
feedbackof itsown statevalue.
Perror of errorfor a singleneuron of a storedpattern.
Hopfield networks are typically used for
Q.46 Explainwith neat diagram the architecture of
problems with binarypatternvectors.
classification
Hopfield neural network.
Hopfield model isclassified categories
intotwo
Ans.: The Hopfield model is a single-layered 1.DiscreteHopfieldModel
recurrentnetwork.Likethe associative memory, itis
usuallyinitialized
with appropriateweights insteadof 2. ContinuousHopfieldModel
PUBLICATIONS An up thrust
forknowledge DECODE
TECHNICAL
NeuralNetuworks and Deep Learning NeuralNetworks
Artificial
1-20
Q.47 Give the limitationand applicationof 3. The energy minimization of the Hopfield
ability
Hopfieldnetworks. NTU:June-13,
Marks 7] network isused to solveoptimization
problems.
Ans.:Limitations Networks
of Hopfield 4. The various applicationsof Hopfieldnetworks
are as givenbelow. Hopfield
network remembers
Severelylimitedin the number of patternsp that
cues from the past and does not complicatethe
inN nodes.
can be storedreliably
trainingprocedure.
the maximum number
Generally, of patternsmust
The Hopfieldnetwork can be used for converting
be below 0.15Nfor reliable
performance;
analog signals into the digitalformat, for
Avalanche inerroroccursabove 0.138N. associative
memory.
Too many patternswillresultinspurious outputs; Q.50 What isassociative
memory ?
i.e.,
outputs not corresponding to any stored
Ans.:Associativememory neural nets are single
pattern.
layer nets in which the weights are determined in
similarpatternscan causeerrorsinoutput.
Storing such a way that net can store a set of pattern
Application: associations.
Common applicationsare those where pattern
is useful,and Hopfieldnetworks have Q.51Explainrelation
between BAM and Hopfield
recognition
been used for image detection and recognition, nets.
Ans.: DiscreteHopfieldnet and the BAM net :
enhancement of X-Ray images, medical image
etc.
restoration, closely related.
The Hopfield net can be viewed as an
The Hopfield network is commonly used for
BAM
auto-associative with the X-layerand Y-layer
and optimization
auto-association tasks.
treatedas a singlelayer and the diagonalof the
Q.48 Why Hopfield neural network is an symmetric weight matrix set to zero.
associative
memory ? BAM can be viewed as a specialcase of a Hopfield
Ans.
net which containsall of the X-layer and Y-layer
from
Starting state,the HNN willchange
any initial neurons,but withno interconnections between tw0
itsstateuntil
the energy function approaches to the X-layerneurons or between two Y-layerneurons.
minimum.
X-layerneurons must be update theiractivations
The minimum pointiscalledthe attractor. before any of the Y-layer neurons update theirs;
then all Y fieldneuron update before the next
Patternscan be stored in the network in the form
round of X-layerupdates.
of attractors.
In the transportationnetwork, the five vertices The weight on the connection between the units
stand for citiesand the links are labeled or thatrepresenta visit
to cities.
City3
City2
City4
City5
City1
Fig.Q.52.1
Schematic of blological
neuron
City 4
City3
City1
City1,
City5
City5
City2
Fig.Q.52.2Schematic of biological
neuron
An up
TECHNICAL PUBLICATIONS thrustfor knowledge OECODE
NeuralNetworks and Deep Learning 1-22 Neural Networks
Artificial
functions. NTU
:Aug.-16,Feb.-17] cClustering dAllof these
Q.9 Theoreticalresultshave been developed that Q.4 Neural Networks are complex
characterize the fundamental relationship with many parameters.
among the number of examples
linearFunctions
observed. [INTU: Feb.-17] a
Q.10 A is an information-processing
unit bnonlinear
Functions
that is fundamental to the operation of a discreteFunctions
neuralnetwork.
What is backpropagation
?
way that the network can store a set of a Itis anothername givento the curvy
patternassociations. function
inthe perceptron
Q.12 Auto-associative memories aree
INTU:
Feb. 17]
Q.6 A to solve the
possibleneuron specification
Q.15 An error correctionlearningalgorithm which
AND problem requires a minimum of
is applied to a multilayer feed forward
networkleadsto - . type of problem
cthree
neuror d four neurons Q.13 What are the generaltasks that are performed
with backpropagationalgorithm?
Q.7 A perceptron takes a vector of real-valued
a linearcombination
inputs,calculates of these a patternmapping
inputs,then outputs. bfunction
approximation
NTU: Aug.-16,Feb.-17]
cprediction
a - 1 or b 0 or 1
dalloftheabove
c-1
or0 d none
Q.14 Adalinewhichstands for .
Q.8 The_ algorithmcomputes the versionn
space containing allhypotheses from that H a adaptiveLinearNeuron
cDoesn't
fit dBestfit aunidirectional b parallel
b to recall
patternpairs Q.18 Neural networks are complex with
aunidirectional
memory Q.19 The network thatinolvesbackward linksfrom
bbidirectional
memory output to the input and hidden layers is
calledas
c multidirectional
associative
memory
dtemporalassociative
memory aselforganizing
maps
b perceptrons
T TECHNICAL
PUBLICATIONS
An up thrust
forknowledge DECODE
Neural Networks and Deep Leaning 1-24 Neural Networks
Artificial
network inthe Blanks
Answer Key for Fill
recurrerntneural
d multilayeredperceptron Q.1 artificial
neurons
Q.2 activation
Q.21 A neuron can send at a time how many Q.13 centralnervous Q.14 activation
function
signals? system
Q.6
cQ.3
aQ.7ada Q4
Q.8 c
dGenetic
algorithms.
Q.17
0.21 a
a Q.18 a
Q.22 b
Q.19
Q.23
C
d
Q.20 b
Q.23 The BAM AssociativeMemory)
(Bidirectional
significantly
belongs to END...
a Auto associative
b Fuzzy associative
Genetic associative
d Heteroassociative.
is unsupervised learning?
Network
Learning
2.1
Introduction
Externalteacher isnot used and is based upon only local information.Itis also referredto as
self-organization.
They are calledunsupervised because they do not need a teacher or super-visorto labela set of
2-1)
NeuralNetworks and Deep Learning 2-2 UnsupervisedLearningNetwork
Everyinputpatternthat is used to trainthe network The target output isnot presentedto the network.
isassociatedwith an output pattern.
5.
Try to detectinteresting indata.
relations
Trying a functionfrom labeleddata
topredict
6.
target variable
Supervised requiresthatthe
learning is For unsupervised learningtypically
eitherthe
target
well defined and that a sufficient
number of its variable isunknown or has only been recorded fortoo
valuesaregiven. small a number of cases.
learningisalsocalled
Supervised classification learningisalsocalled
Unsupervised clustering.
Ans. : Most unsupervised neuralnetworks relyon algorithmsto compute and compare distances, determinethe
winnernode with the highestlevelof activationand use theseto adapt network weights
Among allompeting nodes, only one willwin and allotherswilllose.We mainly deal with single
winnerWTA, but multiple
winners WTA are
possible
way to realizeWTA: have
Easiest an external, to decide the winnerby comparing
centralarbitrator
Thisisbiologically
thecurrentoutputs of the competitors. unsound.
TTECHNICALPUBLICATIONS An up thrust
forknowledge DECODE
NeuralNetuworks and Dep Learning UusupervisedLearningNetwork
No
N22)
P p (N2
Wpn pp 1
Flg.Q.9.1
Inthe process,hamming network use the hamming distanceas a similarity indicatorbetween two
vectors,and Maxnet servesas a subnetto determinethe unitthathas the biggestnet input
It is the neural network implementationof the hamming distancebased nearestneighbor classifier.
8iven pattern.
The outputlayeris responsiblefor choosing the pattern,whichisthe most similarto testingsignal.In
thislayer,the neuron withthe strongestresponse stops the otherneurons responses.
Q.10 Definemaxnet.
Ans. :Maxnet is one of the artificial
neural networks based on competition.
Generally,it is used for pattern
identification.
Maxnet can be used by another model of artificial
neuralnetworksuch as hamming network to get
neuronwiththebiggest
input.
TTECHNICALPUBLICATIONs.An up thrustforknowledge DECODE)
NeuralNetworks and Deep Learning 2-4 UnsupervisedLearningNetwork
Q.11Write about Maxican Hat network. discretemap, and to perform this transformation
2. Topologyof the network in the form of a lattice Brainis a system that can learnby
self-organizing
of neurons, which defines a discrete output itself
by changing(adding, removing,strengthening)
space; the interconnections
between neurons.Formation of
that is defined
around a winningneuron i(x); planartopology.
value and then decreasesgraduallywith time , the classmemberships of the inputdata.The SOM
Feature
Continuous
highdimensiona map
inputspace
W)
Discrete
low dimentional
Output space
Fig.Q.18.1Organization
of the mapping
An up thrust
forknowledge DECODE
TTECHNICALPUBLICATIONS
Neural Networks and Deep Learning 2-6 UnsupervisedLearningNetwork
4 i
Update allweight vectorsof allneurons inthe neighborhood
of neuron
k: wi (wiisshiftedtowards x)
+n9%,k)(x-w,)
5. If convergence criterionmet, STOP. Otherwise, narrow neighborhoodfunctionand learning
parameter n and go to (2).
a) (b)
An up thrust
forknowledge DECODE
TTECHNICALPUBLICATIONS
NeuralNetworks and Deep Learning 2-7 UnsupervisedLearningNetwork
LVQ
W11
AW- n(X-W)
Step 5:If the maximum number of iterationsis
reached,stop.Otherwisereturnto step 3.
2.4:
CounterPropagation
Networks
Q.21 What is counter-propagation
network?
Ans.:Counter-propagation networks are multilayer
networks based on a combination of input,clustering8
and outputlayers.Itcan be used to compress data,to
Inputunits
W26 approximate functions
or to associate
patterns.
Output units
:
Q.22 How does counter propagation nets are
Fig.Q.20.1LVQ trained?
Here input dimensionis 2 and the inputspace is Ans. nets are trainedin two
Counter-propagation
dividedinto six clusters.The firsttwo clusters stages
belong to class 1,whileother four clustersbelong 1. Firststage:The inputvectorsare clustered.The
to class2. clustersthatare formed may be based on either
.THE LVQ learning the dot productmetric or the Euclideannorm
algorithminvolvestwo steps:
metric.
1. An unsupervised learning data clustering
method is used to locateseveralclustercenters
2 Second stage :The weights from the clusterunits
withoutusingtheclassinformation. to the output unitsare adapted to produce the
2. The class information
is used to finetune the desiredresponse.
clustercenters to minimize the number of
misclassified
cases. Q.23 List the possible drawback of
networks.
counter-propagation
The number of clustercan eitherbe specifieda
or determinedviaa clustertechnique Ans.: Traininga counter-propagationnetwork has
priori capable
the same associated with traininga
of adaptivelyadding new clusterswhen necessary. difficulty
Once the clustersare obtained,
theirclassesmust be Kohonen network.
labeledbefore moving to second step.Such labeling Counter-propagationnetworks tend to be larger
isa achievedby votingmethod. than backpropagation
networks. Ifa certainnumber
Learningmethod of mappings are to be learned,the middle layer
The weight vector (w) that is closestto the input must have thatmany number of neurons.
vector(x) must be found.Ifx belongs to the same
Q.24 Explain full Counter-Propagation
Network
class,we move w towards x; otherwise we move w (CPN).
away from the inputvectorX.
An up thrust DECODE
TECHNICAL PUBLICATIONS forknowledge
Neural Networks and
Deep Learning
2-8 LearningNetwork
Insupervised
In
missing elements ineitheror both vectors. It may be used if the mapping from x to y is well
firstphase, the training
vectorpairsare used to but the mapping fromy tox is not.
defined,
form custersusingeitherdot product or Euclidean
distance.Ifdot productis used, normalizationis a 2.5 AdaptiveResonanceTheory
must.
parameter that
controlsthe generalityof the memory. Larger p
activationof unitsin the y output layer,y', is
means more detailed memories, smaller P
of
approximation input vector y; and x* is an
produces more generalmemories.
approximation of inputvectorx.
The of CPN
architecture resembles an instarand of ART.
Q.29 Draw and explainthe architecture
outstarmodel. What isitsuse and types ?
Ans.:GailCarpenter and Stephen Grossberg (Boston
The model which connects the inputlayersto the
hidden layer is calledInstarmodel and the model University)developed the Adaptive Resonance
which connectsthehiddenlayerto the outputlayer learning model. How can a system retain its
identification
and recognitiongenerallyoccur following
components :
object
as a resultof the interaction
of 'top-down'observer 1.The short term memory layer : F1 Short term
expectationswith bottom-up'
sensory information. memory.
2. The recognitionlayer: F2 Containsthe long
The basic ART system isan unsupervised learning
term memory of the system.
model. It typicallyconsistsof a comparison fieldd
and a recognition fieldcomposed of neurons, a 3. VigilanceParameter P-
A parameter that
controlsthe the p
vigilance parameter,and a resetmodule. However, of
generality memory. Larger
ART networks are able to grow additionalneurons means more detailed memories, smaller p
ifa new inputcannot be categorizedappropriately produces more generalmemories.
with the existing
neurons. Types of ART
ART networks tackle the stability-plasticity
Remarks
Type
dilemma
:
1.PlasticityThey can always adapt to unknown
inputsif the giveninputcannot be classified by
ART 1 Itisthe simplest variety
of ART networks,
acceptingonly binary inputs.
clusters.
existing ART 2 Extends network capabilities
to support
3. Problem
on p.
:
the introduction
of rnew inputs.
Clustersare of fixedsize,depending
ART 3 ART 3 buildson ART-2 by simulating
rudimentary neurotransmitter regulation
of synapticactivity by incorporating
simulatedsodium (Na+) and calcium
Fig.
Q.29.1shows ART-1Network. (Ca2+) ion concentrationsintothe system's
equations,which resultsina more
physiologicallyrealistic
means of partially
inhibitingcategoriesthattriggermismatch
resets.
Dutput layer
Input layer
ARTMAPIt isalsoknown as Predictive
ART,
combines two sightly modifiedART-1 or
ART-2 into
units a supervised learning
structure where the firstunittakes the
output-layercandidates that may best match the Fuzzy Fuzzy ARTMAP ismerelyARTMAP using
Currentinput. ARTMAP fuzzy ART units,resultingina
corresponding increaseinefficiency.
Q.30 Listthe advantages of adaptiveresonance 2 The hidden nodes implement a set of radialbasis
theory. functions
(e.g.Gaussian functions).
Q.31 DescribeBayesian beliefnetwork. An RBFN has a single MLP may have one or
hidden layer. more hiddenlayers.
Ans.:
Bayesian belief network describes the
Hidden layeris Hidden and output
governinga set of variables
2
distribution
probability nonlinearand output layerusedas patterrn
a
by specifying set of conditional
independence layerislinear. dlassifierareusually
TECHNICAL An up thrust
forknowledge DECODE
PUBLICATIONS
NeuralNetuworks and Deep Learning 2-11 LearningNetwork
nsupervised
Q2 networks are a type of artificial Q.16 A HAP model which comes under auto
neural network constructed from spatially correlatedassociative
memory is abbreviated
kernelfunctions.
localized
two dimensional.
Q.2 What isan activation
value ?
Q.6 A specific competitive net that performs
Winner Take All (WTA) competition is the a weightedsum of inputs
bthreshold
value
PUBLICATIONS- An up thrust
forknowledge
TECHNICAL CODE
Neural Networks and Deep Learning 2-12 UnsuperoisedLearningNetwork
Q7
self-organizing
SOM
Q.6
Q.8 unsupervised
either group of instarsor outstars
Q.9 reference Q.10 Adaptiveresonance
d bothgroup of instarsor outstars theory
Q.7 The ability
to learn how to do tasks on the Q.11 Q.12 multilayer
plasticity
or initial
data givenfor training experienceis
know as Q.13 linearly
separableQ.14 back propagation
aS bZ Q.9
CA U
END...
3
3.1Introduction
toDeep Learning
Deep Learning
Due number of layers,nodes, and
to the sheer
itis difficult
connections, to understand how deep
Q.1 Definedeep learning. learningnetworks arriveat insights.
Ans.:Deep learningis a machinelearningtechnique networks are highlysusceptibleto
Deep-learning
that teachescomputers to do what comes naturallyto the butterflyeffect-small variationsin the input
humans: learnby example. Deep learninguses layers data can lead to drastically
different
results,
making
of algorithms to process data, understand human them inherentlyunstable.
speech,and visuallyrecognizeobjects.
Q.3 Listthe application
of deep learning.
Q.2 Explain deep learming. What are the
challengesindeep learning? Ans.:
Applications:
1. Colorization
of blackand whiteimages.
Ans.: Deep Learning is a new area of machine
learning research,which has been introduced with 2. Adding sounds to silent
movies.
(3-1)
NeuralNetworks and Deep Learning 3-2 Deep Learning
For example, sigmoidmaps the real number line 4. ExplainabilityNeural networks develop their
onto a "small"range of [0, 1]. behaviorinextremelycomplicatedways.
As a result,there are large regions of the input Q.8 What is convolutional
neural network ? List
space which are mapped to an extremely smal itslimitation.
range. In deep learning,a convolutional
neural
Ans.:
Inthese regions of the inputspace, even a large network (CNN, or ConvNet) is a class of deep,
CNNs run a small window over the inputimage The imitations of the convolutional neural
at both training
and testingtime,the weights of the networks:
networkthat looks throughthiswindow can learn 1.Take fixedlengthvectorsas inputand produce
from variousfeaturesof the inputdata regardless fixedlengthvectorsas output.
withinthe input.
of theirabsoluteposition
2.Allow fixedamount of computational
steps.
a.9 Explain architecture
of ConvolutionNeural
Networks (ConvNet).
of a CNN is designed to
Ans.: The architecture
take advantage of the 2D structure
of an inputimage.
Featureextraction Classification
--------
K
m
Input Convolution Pool Convolution2 Pool2 Hidden Output
Architecture
Fig.Q.9.1 of convolutionneural network
An up thrust
forknowledge DECODE
TECHNICALPUBLICATIONS
NeuralNetwvorks and Deep Learning 3-4 Deep Learning
Another convolutionand pooling step is then A loss functionis definedin the fullyconnected
performed that feeds into a fully connected outputlayer to compute the mean square loss.The
multilayerperceptron.The finaloutputlayerof this gradientof erroristhen calculated.
network is a set of nodes that identify
featuresof
The error is then backpropagated to update the
the image. You train the network by using
(weights)and biasvalues. One training
filter cycle
back-propagation. is completed in a single forward and backward
An inputimage is passed to the firstconvolutional pass.
layer.The convoluted output is obtainedas an
Q.10 Write short noteon Poolinglayer.
activationmap. The filters applied in the
Ans.: Poolinglayerswere developed to reduce the
convoutionlayerextractrelevantfeaturesfrom the
to pass further.
number of parameters needed to describe layers
inputimage
deeper in the network.
Each filtershallgive a differentfeatureto aidthe
Poolingalso reduces the number of computations
In case we need to retain
correctclass prediction.
required for the network,or for simply
training
the size of the image, we use same padding (zero
otherwise valid runningitforwardduringa classification
task.
padding), padding is used since it
helps to reduce the number of features. Fig.
Q.10.1
shows poolinglayer.
224 224 64
Poolinglayersare thenadded to furtherreduce the
112 112 64
number of parameters.
Poo
Severalconvolution
and poolinglayersare added
is made. Convolutional
before the prediction layer
in
help extracting features.
Single
depth slice
and stride2
Fig.Q.10.2Max pooling
parameters and computationin the network. Pooling layer operates on each feature map
independently.
The most common approach used inpoolingismax pooling.
The poolingfunctionreplacesthe outputof the net at a certainlocation
with a summary statistic«
Max poolingreportsthe maximum
the nearby outputs. outputwithina rectangularneighborhood.
Average poolingreportsthe average output.
helps make
Pooling approximatelyinvariant
the representation to smallinputtranslations.
32 x 32 x 3 image ispreserve
spatial structure.
32 height
32 width
3 depth
Convolvethe filter
with the 32 32 3 image
"slideover
imagei.e. the image
computing dot
spatially,
products". 5 5 3fiter
Filtersalways extendthe full
52
depth of the input volume.
32
5
323
5
image
3filter Activation
map
28
Convolve (slide)
overall
spatiallocations
28
Convolutionlayer
s2
of convolutional
a.12 Explainapplications neural network.
Drug discovery:
CNNs
discovery. Predicting
have been used in drug
the interaction between
For example:consider3 functionsf,f2, and
Q.15What isrecurrentneural networks ? output for each of these layers,these layers are
calledhidden layers.
Ans.:When feedforward neural networks are
extended to includefeedback connections,they are these networks
Finally, are called neural because
calledrecurrentneuralnetworks. they are loosely inspiredby neuroscience.Each
hidden layer of the network is typically
Q.16 Write short note on Deep feedforward The dimensionality
vector-valued. of these hidden
networks. determinesthe width of the model.
layers
feedforward networks is also called
Ans.: Deep Q.17What is use of hidden layer?
feedforward neural networks or multilayer
Ans.:Feedforward networks have introducedthe
perceptrons.
Feedforwardneural networks are called networks concept of a hidden layer,and this requires us to
choose the activation
functionsthat willbe used too
because they are typically represented by
compute the hiddenlayervalues.
composing togethermany differentfunctions. The
model is associatedwith a directedacyclicgraph
conceptof XOR.
Q.18 Explainlearning
describinghow the functions are composed
together. Ans.: XOR problem is a pattern recognition
problem inneuralnetwork.
TECHNICAL An up thrust
forknowledge DECODE
PUBLICATIONS
Neural Netuworks and Deep Leaning 3-8 Deep Learning
returns1.Otherwise,
valuesisequal to 1,the XOR function itreturns0.
Q.20 Defineactivation
function.Explainthe purpose of activation
functionin multilayerneural networks.
Giveany two activation
functions.
Ans. : An activationfunction The activation
on the signaloutput.
f performs a mathematical operation functions
are chosen dependingupon the type of problem to be solved by thenetwork. There are a number of common
activationfunctions
inuse with neuralnetworks.
Fig.
Q.20.1shows of activation
position function.
Unit step function
isone of theactivation
function.
Inputs Weights
X1 (W
Activation
Tunction
Net input
2 Wa (net)
Activation
Transfer
function
Threshold
function
Q.20.1Positionof activation
Fig.
sig()14et
0.9
0.0
0.6
0.5
0.4
.3
02
0.1
-6
Fig.Q.20.2(c)Binarysigmoid function
f (*)
f)-1t-
The bipolar
sigmoidisalsocloselyrelatedto the hyperbolictangent function.
tan n exp )-
exp(x)+exp(-x)
exp(-x) 1-exp
(-2x)
1+ exp (-2x)
sigmoidal
(s shaped).
Fig.
Q.21.l
shows tanh v/s Logistic
Sigmoid.
Tanh neuron is simply a scaled
sigmoidneuron. Sigmoid I
Problems resolvedby Tanh ). Tanh
Fig.Q.21.1:
tanh vis
Sigmoid
Logistic
Fig.
Q21.2shows RelLU v/s Logistic
Sigmoid.
As you can see,the ReLU ishalfrectified(from bottom). f(z) iszero when z islessthanzero and f(z)
is equal to z when z isabove or equal to zero.
Sigmoid RelLU
.0 10
R(2)= max(0,2)
.8 (2)1e
0.6
0.4
.2
000 10 -10 10
3.Smallgradient
4. Vanishing gradient
Tanh 1.Zero centered, 1.Saturated
Neurons
2. Outputinrange(-1,1)
2. Accelerated
convergence 2.Not zerocentered
4. tooptimize,
Efficient converges much fasterthan sigmoidor tanh.
Cons
1. Very negativeneuron cannot recoverdue tozero gradient.
2. Non zero centeredoutput.
3.3: Learning
Gradient-Based
C(w,b) N lat,)-yi
where x are the input both are determined.So the cost
ors and y are there correspondinglabels,
function value dependingon the weights w and biasesb.Itisquadraticcost function.
chagnes it's
layer.And again,
jindexes theoutputunits,
so j- 1,2, K.
e
a
Each training
image islabeledwiththetruedigitand the goal of thenetworkis to predict
the correct
label.So, ifthe inputis an image of the digit4, the outputunit corresponding to 4 would be
and so on for the restof the units.
activated,
The softmax function the largestvaluesand suppress other values.
highlights
.The softmax can be used for any number of classes.
It'salso used for hundreds and thousands of
for
classes, example in objectrecognition
problems where there are hundreds of different
possible
objects.
Listthe limitation
Q.30 Explaingradientdescent algorithm. of gradient descent.
3.Regressionanalysisinnonlinear
models
An up thrust
forknowiledge (DECODE
TTECHNICALPUBLICATIONS
Neural Networks and Deep Learning 3-13 Deep Learning
GradientDescent
Gradient
descentis a first-order algorithm.To finda
optimization local minimum of a function
using
to the negative of the gradientof the function
gradientdescent,one takessteps proportional at the
Currentpoint.
Xk1*-AVf
(K3)
The A>0 isa small number thatforcesthe algorithmto make small jumps.
Limitations Descent:
of Gradient
.Gradientdescent is relativelyslow close to the minimum: technically,
its asymptotic rate of
convergence is inferior
to many othermethods.
SteepestDescent:
Steepestdescent isalso known as gradientmethod
.Thismethod isbased on first
order Taylorseriesapproximation function.
of objective Thismethod is
alsocalledsaddle pointmethod. Fig.Q.30.1shows steepestdescentmethod.
The steepestdescent is the simplest of the gradient methods. The choice of directionis where f
decreasesmost quickly,which is inthe directionopposite to Vf The
(x). search starts
at an arbitrary
point x0 and then go down the gradient,
untilreach closeto the solution.
The method of steepestdescent is the discreteanalogue of gradientdescent,but the best move is
able
computed using a localminimization rather than computing a gradient.It is typically
function.
converge infew stepsbut itisunable toescape localminima or plateausinthe objective
The gradientiseverywhere perpendicular
to thecontourlines.Aftereach lineminimization
the new
Steepestdescentmethod
Fig.Q.30.1
Q.31 What are differences between gradient Most neural networks are organized intogroups of
descent and stochastic
gradient descent ?
unitscalledlayers.These layersare arranged in a
Ans.:
1. In standard 8radient descent, the erroris
chainstructure, with each layerbeinga function of
the layerthatpreceded it.
summed over all examples before updating
whereas in stochasticgradient descent
In thisstructure,
the first
layeris givenby
weights,
weights are updated upon examining each h-g(wOTx+b))
training
example.
The second layerisgivenby
.
inputs provides the initial
x information that then
Q.32 What isuniversalapproximation theeorem ?
propagates up to the hiddenunitsat each layerand
Ans.:A singlehidden layer neural network with a finally produces This is called forward
linearoutput unit can approximate any continuous propagation.
function well,givenenough hiddenunits.
arbitrary
Q.35 What isbackpropagation
?
Q.33 What do you mean architecture design ? Ans.: Backpropagation allows information
to flow
Ans.: Architecturerefersto the overallstructureof backwards from cost to compute the gradient.
the network. It is an algorithm for supervised learning of
artificial
neuralnetworks usinggradient
descent
An up thrustforknowiedge DECODE
TECHNICAL PUBLICATIONS
Neural Networks and Deep Learning 15 Deep Learning
Givenan artificial
neural network and an errorfunction, the method calculates
the gradientof the
w
errorfunction ith respectto the neuralnetwork'sweights.
a.36 Explainbackpropagation
learningrule.
Ans.: The net inputof a node is definedas the weightedsum of the incomingsignalsplus a bias ternm.
Fig.Q.36.1shows the backpropagationMLP for node j.The net inputand output of node j is as follows
X i+Wy+W
Where x
1+exp(-X;)
correspondingnodes iand j.
Fig.
Q.36.2shows two layerbackpropagation
MLP.
Input
signals
O 0-
m O-Y
Input Output
layer layer
Hidden
layer
of
Back-propagationerrorcorrection
Fig.Q.36.2
.The above back propagationMLP willreferto
nodes ineach layer.
as a 3-4-3 network,corresponding to the number .
.Thebackward error propagation also known as the Backpropagation (BP) or the GeneralizedDelta
Rule (GDR).A squared errormeasure for the ph input-outputpairis definedas
Ep2d-x
TECHNICAL PUBLICATIONSAnup thrustforknowledge OECODE)
NeuralNetworks and Deep Learning 16 Deep Learning
Where dg is the desiredoutputfor node k and x is Q.38 Listout merits and demerits of EBP.
for node k when the inputpart of Ans.:
Merits/Strength:
the
actualoutput
the ph data pairis presented. 1. Computing time is reduced ifweight chosen are
small at the
Q.37 Write the characteristics
and applications
of beginning
errorback propagationalgorithm. 2. Itminimizethe error.
:
Ans.:Characteristics 3. Batch update of weight exist,which provide
Itis an algorithm for supervised learningof on the weight correction.
smoothingeffects
artificial
neural networks usinggradientdescent 4. Simplemethod and easy for implementation.
Learns weights for a multilayernetwork, given a 5. Minimum of theerrorfunction inweight space
fixedsetof unitsand interconnections. 6. Standard method and generallywork well.
Inmultilayernetworks the error surfacecan have Demerits:
multipleminima, but in practice backpropagation 1. Trainingmay sometime cause temporal instability
has produced excellentresultsin many real-world to the system.
applications. 2. For complex problem, ittakes lotof times.
The algorithm is for two layers of sigmoidunits 3. Selection of number of hidden node in the
and does stochastic
gradientdescent. networkis problem.
It uses gradient descent to minimizethe squashed
4 Backpropagation learning does not requir
error between the network outputsand the target normalizationof input vectors; however,
values for theseoutputs. normalization
couldimproveperformance
Applications: 5. It can get stuck in local minima resultingin
The fast development of artificial
satellite sub-optimalsolutions.
technology has increasedthe importance three
of 6. Slow and inefficient.
and therefore,
dimensional (3D)positioning satellite
geodesy. Q.39 Discussperformance issue of EBP.
is main
Ans.: Computationalefficiency aspect of
the
Particularly, Global Positioning System (GPS)
back-propagation.Number of operationsto compute
provides more practical, rapid, precise and
derivativesof errorfunction
scaleswith totalnumber
continuouspositioning results anywhere on the
Earth in geodetic applications
when compared to W of weightsand biases.
the traditional
terrestrial methods. Singleevaluation of error functionfor a single
positioning
Due tothe increasing use of GPS positioning input requires O(W) operations for large W.
Compute derivatives using method of finite
techniques,a great attentionhas been paidto the differences.
precise determination of local/regionalgeoids,
aimingat replacingthe geometriclevelingwithGPS oEn
dW
E +)E,
(Wp).o9
measurements. ji
.Therefore,BPANN method is easilyprogrammable where E <« 1
with decreased and increasednumber of reference
Accuracycan be improved by making & smaller
pointswhen generatinga localGPS, it performs a untilround-offproblems arise.
flexible
modelling.
Accuracy can be improved by using central
Also,BPANN method is open to updatingwhich differences.
couldbe accepted as an importantadvantage.Thus,
it is believed that BPANN method is more
OEn E,(Wji+)-E(Wji-+O(E2)
26
dwji
convenient for generating local GPS when
compared toother methods.
Thisis O(w).
An up thrust
forknowledge (DECODE
TECHNICALPUBLICATIONS
NeuralNetworks and 3-17 Deep Learning
Deep Learning
Generalization
means performance on inputpatterns,i.e.
inputpatternswhich were not among the
patternson which the network was trained.
Ifyou trainfor long you can often get the sum-squared error very low, by over-fitting
too the
data you get a network which performs very well on the training
training data,but not as well as it
couldon unseen data.
By stoppingtrainingearlier,one hopes that the network willhave learned the broad rules of the
Ans.: Consider that network weightsare initializedto small random values.With weights of nearly identical
value,only very smooth decisionsurfacesare describable.
As trainingproceeds,some weights beginto grow inorder to reduce the errorover the training
data,
and thecomplexity
of the learneddecision
surfaceincreases.
training
sample.
Thisoverfitting
problemisanalogous to the overfitting
problem indecision
treelearning.
weightsare adjusted.
Based on the error,the connection
calculated.
The backpropagation algorithm is based on Widrow-Hoff delta learningrule in which the weight
adjustment is done throughmean square errorof the outputresponse to the sample input.
The set of these sample patternsare repeatedly presented to the network untilthe error value
minimized.
multilayerperceptron's. bbackpropagation
Q,2 When feedforward neural networks are cactivaion
function
extended to includefeedback connections, dhiddenlayer
they are called. neuralnetworks.
Q.3 Hidden units are composed of
Q.3 A single-layerwith one hidden unit also an affine transformation and
computing
called
element-wisenonlinearfunction.
Q.4 Gradient descent is an iterative
algorithm.
aoutputvector
Q.s ReLU is one of the most popular function b inputvector
which is used as activation
functionin activation
function
deep neuralnetwork. dallofthese
Q.6 Trainingdata does not show the desired
Q.4 The backpropagationalgorithmisused to find
output for each of these layers,these layers
a localminimum of the .
are called
a LinkUnit
Rectified
b Rectified
Large Unit
RectifiedLayerUnit
Rectified
LinearUnit
d
TECHNICAL PUBLICATIONS An up thrust
forknowiedge (DECODE
NeuralNetuworks and Deep Learning 3-19 Deep Learning
Q3 perceptron Q4 optimization
Q.1
Q.5 b
0.2
Q.6
a Q.3
b04
END.
44.1:
forDeep Learning
Regularization
overfitting
a.2 Explainoverfitting.
What are the reason for X X
Balanced Overfitting
overfitting
?
-1)
NeuralNetworks and Deep Learning 4-2
forDeep Learning
Regularization
2. The learnerdid not use a sufficient
number of encouragingthe network to use allof itsinputsa
iterations. little
ratherthatsome of itsinputsalot.
3. The learner triesto fit a straight line to a
set whose examples exhibita quadratic
The L2 causes the learmingalgorithm
regularization
training the inputX as
to "perceive"
nature. havinghighervariance,
which makes it shrink the weights on features
To prevent under-fitting
we need to make sure whose covariancewith the output target is low
that
compared to thisadded variance.
1.The network has enough hidden units to
representthe requiredmappings.
InL regularization, term is the sum
regularization
of square of all feature weights.It forces the
2. The networkis trainedfor longenough thatthe
functionis sufficiently
error/cost minimized. weights to be small but does not make them zero
and does non sparsesolution.
How do we know if we are underfitting
or
Ans.
1. If by increasing capacity we decrease
generalizationerror, then we are underfitting, A regressionmodel that uses L'
regularization
otherwisewe are overfitting.
techniqueiscalledLasso Regression.
2. If the error representingthe trainingset is
Lasso Regression adds "absolute value of
relativelylarge and the generalizationerror is
large,then underfitting magnitude"of coefficient
as penaltyterm to the loss
trainingset. 0) llol,
2lo
that is, as the sum of absolute values of the
Q.4 Explain individual
parameterregularization. parameters.
Ans. InL', we have:
It theis hyperparameter whose value is optimized Cost functionLoss+
L regularization
for betterresults. isalso known as 2m llwll
weight decay as it forces the weights to decay In this,we penalize the absolute value of the
towards zero. weights.UnlikeL4, the weightsmay be reduced to
The
L
w(1-ea)w-EVwi(w)
factor at each step.
shrinkby a constant
weights multiplicatively
so
by an
when
amount
a
particular
which is
the
proportional
An up thrustforknowledge
TECHNICALPUBLICATIONS ECODE
NeuralNetworks and Deep Learning 4-3 Regularization
forDeep Learning
Q.8 Explainoverfitting.
What are the reason for
Q.7 Explain applications
of convolutional
neural overfitting?
network. Ans.: Trainingerrorcanbe reduced by making the
Ans.: Computer vision:
CNN are employed to data,but this
hypothesis more sensitiveto training
the
identify hierarchy or conceptual structureof an lead to and
may overfitting poor generalization.
image. Insteadof feedingeach image intothe neural Overfittingoccurs when a model
statistical
networkas one gridof numbers, the image isbroken describesrandom error or noise instead of the
down intooverlappingimage tilesthatare each fed
underlying
relationship.
intoa smallneuralnetwork.
Speech recognitionCNN :have been used recently
in speech recognitionand has givenbetterresults
too
is when
Overfitting
Such
tightly. a
a classifier
fitsthe training
classifier
works well on
data
the
data but not on independent test data.Itis
training
over deep neural networks. Robustness of CNN is
a generalproblem thatplagues allmachine learning
enhanced when poolingis done at a localfrequency
methods.
regionand over-fittingis avoidedby usingfewer
Because data
low error on training
of overfitting,
parameters to extractlow-levelfeatures
and high erroron testdata.
CNNs little
use relatively compared
pre-processing
to other image classification
algorithms.This means
occurs when
.Overfitting a model begins to 4.2:Norm Penalties
as Constrained
memorize trainingdata rather than learning to Optimization
from
generalize trend.
Fig.Q.8.1shows overfiting
The penalty term is obtained by multiplying a
Overfitting
Usuallywe do not know the size of the constraint
region thatwe impose by usingweightdecay with
Fig.Q.8.1Overfitting coefficienta* because the value of a" does not
directly us the value of k.
tell
Reasons for overfitting
1. Noisy data 2. Trainingsetistoo small
In one
principle, can solve for k, but the
between
relationship k and a" depends on the form
3. Large number of features of J.
Method to avoidoverfittingS While we do not know the exact size of the
1.Limitthenumber of hiddennodes. constraintregion, we can control it roughly by
2.Stoptraining early to avoid a perfect increasingor decreasinga in order to grow or
the
explanationof trainingset, and shrinkthe constraint
region.
3. Apply weight decay to limitthe size of the
Largera willresultin a smaller constraintregion.
weights, and thus of the function class Smallera willresultina largerconstraintregion.
implemented by thenetwork.
TECHNICAL An up thrust
forknowledge DECODE
PUBLICATIONS
NeuralNetworks and Deep Learning 4-5 for Deep Learning
Regularization
Q.11What are the reasons to use explicit Random Erasing is a data augmentation method
constraints
ratherthan penalties? used for training convolutional neural networks
Ans. thatinvolvesrandomly erasing a rectangularregion
in an image. Images with occlusions are then
Reason to use explicit
constraintsand reprojection
rather than enforcingconstraintswith penaltiesis generated.This makes a model robust to occlusion
and reduces the chances of overfitting.
that penalties can cause nonconvex optimization
procedures to get stuck in local minima noise inthe inputto a
Injecting neuralnetworkcan
correspondingto small6. also be seen a form of data augmentation.
as For
many cation
classi? and even some regression tasks,
When trainingneural networks, this usually
the task should stillbe possible to solve even if
manifestsas neural networks thattrainwith several
small random noiseis added to the input.
"dead units."These are unitsthatdo not contribute
much to thebehaviorof the functionlearned by the One way to improve the robustness of neural
network because the weights going intoor out of networks is simply to trainthem with random
them are allvery small. noise appliedto their is
Inputnoise injection
inputs.
When with a penalty on the norm of the part of some unsupervised learning algorithms,
training
can be locallyoptimal, such as the denoising
autoencoder.
weights,these configurations
even ifitis possible to significantly
reduce J by Noise injection also works when the noise is
making the weights large. appliedto the hidden which can be seen as
units,
doing dataset augmentationat multiplelevelsof
abstraction.
4.3:
DatasetAugmentationand Noise Q.13What is dropout ? How itsolve problem of
Robustness
?
overfitting
was a techniquedeveloped for
.12What is dataset augmentation? Ans.: Dropout
preventing of a
overfitting network to the particular
Ans.
of the training
variations set.
Data augmentationis the process of increasingthe neural networks with a large number of
amount and diversityof data. We do not collect
Deep
parameters are very powerful machine learning
new data,ratherwe transformthe already present
is a seriousproblem
systems. However, overfitting
data.
insuch networks.
Data augmentationis an integralprocess in deep
Large networks are also slow to use, making it
learning,as in deep learning we need large
difficultto deal with overfitting
by combiningthe
amounts of data and insome casesitisnot feasible
predictionsof many different large neural nets at
to collect
thousands or millionsof images, so data
testtime.Dropoutis a technique for addressingthis
augmentation comes to the rescue.
It helps us to increasethe size of the datasetand
problem.
The key ideaisto randomlydrop units(alongwith
introduce inthe dataset.
variability
theirconnections)from the neural network during
Data augmentationinvolvesthe process of creating This prevents unitsfrom co-adapting too
training.
new data pointsby manipulating the original
data.
much.
For example, for images, this can be done by
During training,dropout samples from an
rotating croppingand more.
resizing,
"thinned"
exponentialnumber of different networks.
This process increasesthe diversityof the data
available for trainingmodels in deep learning Attesttime,itis easy to approximatethe effect of
withouthaving to actuallycollectnew data. This averaging the predictionsof all these thinned
An up thrust
forknowiedge DECODE
TECHNICAL PUBLICATIONS
Neural Networks and Deep Learning 4-6 for Deep Leaming
Regularization
O
(a) Standard
neural net (b)Afterapplyingdropout
Fig.
Q.13.11
Fig.Q.13.1shows neural network and dropout
4.4:Multi-tasklearningand Early
neuralnetwork.
TECHNICALPUBLICATIONS An up thrustforknowledge
-
Fig.Q.14.1
Constrained
layers
ODE
Neural Netuworks and Deep Learning 4-7 forDeep Learning
Regularization
Differentsupervised task shared the same input It is probably the most commonly used form of
and some intermediate-level
representation. indeep learning.
regularization
outputlayers.Hard parameter
several task-specific 4.5:Paranmeter Typingand Parameter
sharinggreatlyreducesthe riskof overfitting. Sharing
Insoft parameter sharingon the other hand,each
task has itsown model with its own parameters. Q.16 What is a parametric machine learning
algorithm and nonparametric machine learning
The distancebetween the parameters of the model
is then regularized in order to encourage the algorithm?
Ans.:Parametricmachinelearning
algorithm:
parameters to be similar.
Parametricmodel is one thatcan be parameterized
a.15Discussabout earlystopping by a finite
number of parameters.
Ans. :
Learningmodel thatsummarizes data with a set of
Early stoppingis a techniquethat is very often parameters of fixed size is called a parametric
used when trainingneural networks, as well as model.
with some other iterative machine learning No matter how much data you throw at a
algorithms. parametric model, it won't change itsmind about
how many parametersitneeds.
Early stopping is a technique for controlling
overfittingin machine learning models, especially Given the parameters,futurepredictions (x) are
neural networks, by stoppingtraining before the independentof D) =
the observed data (D):P(xl6,
P(xl6) therefore6 capture everythingthere is to
weights have converged.
know about the data.
This means we can obtaina model with better
validation set error by returning
to the parameter
So the complexity of the model is bounded even if
the amount of data is unbounded.Thismakes them
setting at the point in time with the lowest
not very flexible.
validationseterror.
Some more examples of parametric machine
Every time the erroron the validation
set improves,
learning algorithms include: Logisticregression,
we storea copy the model parameters. lineardiscriminantanalysis,
perceptron,naive bayes
When the training
algorithm terminates,we return and simple neuralnetworks.
theseparameters,ratherthan the latestparameters. Benefits of Parametric Machine Learning
The algorithm terminateswhen no parameters have Algorithms:
improved over the best recorded validation error 1.Simpler These methods are easier too
An up thrustforknowledge DECODE
TTECHNICAL PUBLICATIONS
NeuralNetworks and Deep Learning 4-8 forDeep Learning
Regularization
Non-parametric models assume that the data Q.17 Differenc betweenn parametric and
distributioncannot be definedin terms of such a non-parametric modeling.
finiteset of parameters. But they can often be Ans.:
definedby assuming an infinite dimensionale.
Usually we think of e as a function. Parametricmodeling Non-parametric
modeling|
The amount of informationthat0 can capture about Parametric: Dataaredrawn Nonparametric:Dataare
the data D can grow as the amount of data grows. from a probability drawn from a certain
Thismakes them more flexible. of specificform| unspecified
distribution probability
up to unknown parameters.distribution.
Nonparametric methods seek tobest fitthe training
Ituses singleglobal model. There is no singleglobal
data in constructing
the mapPping function,whilst
model; localmodels are
maintainingsome ability to generalizeto unseen estimated as they are
data.As such,they are able to fita largenumber of needed.
functional
forms.
Flexible
discovery. Easy idea.
An easy to understand nonparametric model is the
Example: Logistic Example: k-nearest
k-nearest neighbors algorithm that makes regression,linear neighbors,decision trees
based
predictions on the k most sinmilar training discriminantanalysis, likecartand support
vector
patternsfor a new data instance. perceptron,naivebayesand machines.
simple neural networks.
The method does not assume anythingabout the
form of the mapping function
otherthan patterns Benefit: Requires less data,|Benefit: Flexibility,
power
Simpler and speed. and performance.
that are close are likelyhave a similar output
variable. :Poor fitand
Limitation Limitation:
More data
limited
complexity slower and
required,
Examples of nonparametric machine learning
overfitting
algorithmsare:
It requiresless data. It more data.
requires
1.k-Nearest Neighbors
2. DecisionTrees like
CART and C4.5 Q.18 What is K-Nearest NeighbourMethods ?
3. SupportVectorMachines
An up thrustforknowledge
TECHNICALPUBLICATIONS OECODE)
NeuralNetworks and Deep Learning 4-9 Regularization
forDeep Learning
Ans.: The K-nearest neighbor(KNN) is a classical 3. The number of neighbors used to classifythe
method and requiresno training
classification effort, new example.
critically
depends on the qualityof the distance
measures among examples. 4.6:Bagging and OtherEnsemble|
Methods
The KNN classifieruses Mahalanobis distance
A sample is classified
function. accordingto the
K Q.22 Explainensemble learningmethod.
majority vote of itsnearest samples in
training
the feature space. Distance of a sample to its Ans.: The ideaof ensemble learningis to employ
learnersand combinetheir Ifwe
neighborsisdefinedusinga distancefunction multiple predictions.
have a committee of M models with uncorrelated
Q.19 Listout the steps that need to be caried
errors,simply by averagingthem the average errorof
out duringthe KNN algorithm.
a model can be reduced by a factorof M.
Ans.:Steps are as follows:
a. Divide the data intotraining
and test data. Unfortunately,the key assumption that the errors
due to the individualmodels are uncorrelatedis
b. Selecta value K.
in practice,the errors are typically
unrealisticC
C. Determine which distancefunction
is tobe used. so the reductionin overallerror
highlycorrelated,
d. Choose a sample from the testdata thatneeds to is generallysmall.
and compute the distance to itsn
be classified
Ensemble modeling is the process of runningtwo
training
samples
or more relatedbut different
analyticalmodels and
and take thek-nearest
e. Sort the distancesobtained
then synthesizingthe resultsintoa singlescore or
data samples.
spread in order to improve the accuracy of
Assign the test classto the classbased on the and data mining applications.
predictive
analytics
majorityvote of itsK neighbors.
Ensemble of classifiers
isa set of classifiers
whose
Q.20 What are the advantages and disadvantages individual
decisions combined in some to
way
of KNN ?
new examples.
classify
Ans.:Advantages
1. The KNN algorithmis very easy to implement. Ensemble methods combine several decisiontrees
to produce betterpredictive
classifiers
2 Nearly optimalinthe largesample limit. performance
The main
than a single decision tree classifier.
which can
3. Uses localinformation, yieldhighly
behind the ensemble model is that a
principle
adaptivebehavior.
group of weaklearnerscome together to form a
4. Lends itself very easily to parallel
strong learner,thus increasingthe accuracy of the
implementations.
model.
Disadvantages
1. Thereare two approaches for combiningmodels :
Large storagerequirements.
votingand stacking.
intensiverecall.
2. Computationally
3. Highlysusceptible
to the curse of dimensionality.
In voting no learningtakesplace at the meta level
when combiningclassifiersby a votingscheme.
Q.21 Which are the performance factors that Label that is most often assigned to a particular
influenceKNN algorithm
?
instanceis chosen as the correctprediction when
Ans.: The performance of the KNN algorithmis
usingvoting.
influenced
by threemain factors:
1. The distancefunctionor distancemetric used to Stackingis concerned with combining multiple
2
determine the nearestneighbors.
The decision
rule used
from the K-nearest
to derive a classification
neighbors.
,
classifiers
generatedby different
y
by a featurevectorSi
learningalgorithms
on a singledatasetS,which is composed
"(j,ti)
TECHNICAL An up thrust
forknowiedge
PUBLICATIONS DECODE
Neural Networks and Deep Learning 10 forDeep Learning
Regularization
The stacking process can be broken into two 1. Variancereduction: If the trainingsets are
phases it willalways helps to
completely independent,
1.Generate a set of base-levelclassifiers
Where C = 4(S)
C,.,
CN average an ensemble because this will reduce
variancewithoutaffecting
bias (e.g,bagging)and
2. Train a meta-level classifier to combine the to individual
reduce sensitivity data points.
outputsof the base-levelclassifiers 2. Bias reduction:For simple models, average of
Fig.
Q.22.1shows stackingframe. models has much greater capacitythan single
model Averaging models can reduce bias
The trainingset for the meta-level classifier
is
and control
.
a leave-one-out
cross validation by increasingcapacity,
substantially
generated through
varianceby Citting
one component at a time.
process.
vi 1,.n
and vk=1,
N :Ck Q.23 What is bagging? Explainbaggingsteps.List
itsadvantages and disadvantages.
LyS s) Ans.: Bagging is alsocalledBootstrapaggregating.
The learned
are then used to generate
classifiers Bagging and boostingare meta-algorithmsthat pool
for
predictions s:9 = C,*i) decisions from multiple classifiers. It creates
The meta-leveldatasetconsistsof examples of the ensembles by repeatedly randomly resampling the
form data.
( »yi) where the featuresare the training
predictions of the base-levelclassifiers
and the class method of ensemble
Baggingwas the first
effective
is the correctclassof the example inhand. learning and is one of the simplest methods of
methods work? arching. The meta-algorithm,which isa specialcase
Why do ensemble
of the model averaging,was originallydesigned for
Based on one of two basicobservations: and is usuallyaPpliedto decision
classification tree
models, but itcan be used with any type of model
forclassification
or regression.
Training
set
Hypotheses -Training
observations
on
redictions
observations
training
Meta leaner
prediction P
Final
Fig.Q.22.1Stackingframe
Ensemble classifiers
such as bagging,boostingand model averaging are known to have improved
not at all.
b. Let
h base learningalgorithmon
be the resultof training S.
3. Output combinedcassifier:
H)= (h1),.-.,
majority hT)).
BaggingSteps
1. Suppose thereare N dataset.A sample from training
observationsand M featuresin training data
setistaken randomlywith replacement.
Advantages of Bagging :
1. Reduces over-fitting
of the model.
2. Handleshigherdimensionality
data very well.
Disadvantagesof Bagging
1. Sincefinalprediction
is based on the mean predictions
from subset trees,it won't give precise
and regressionmodel.
values for theclassification
a weak learnerC1
train Adversarialexamples also provide a means of
accomplishing Adversarial
semi-supervisedlearning.
2. subset d2 without
Draw second random training
examples generated using not the true labelbut a
replacement from set and add 50
thetraining
labelprovidedby a trainedmodel are calledvirtual
percent of the samples that were previously adversarial
examples.
to traina weak
falsely classified/misclassified
learnerC 4.8:TangentDistance
|
samples ds inthe training
3. Find the training set D
Q.26 Writeshort note on tangent propagation.
on which and to traina third
C C
disagree
Ans.:
weak learnerC3
The tangent prop algorithmtrains neural net
4. Combine all the weak learners via majority
with an extrapenalty to make each output
classifier
voting.
fx) of the neural net locallyinvariantto known
AdvantagesofBoosting: factorsof variation.
1. Supports different
loss function.
4.7: Training|
Adversarial
linear approximation
manifolds.
of the transformation
Fill
intheBlanksforMidTerm Exam
Q.1 In machine learning is way to preventover-fitting.
Q.9 can be interpretedas a way of regularizinga neural network by adding noise to itshidden
units
Q.10 Earlystoppng requiresa , which means some data isnot fed tothemodel
training
Q.11Bag8ingmeans .
Q.12 The technique
called. an ensemble with highercapacitythantheindividual
constructs models.
is closelyrelatedto
Q.15 Tangent propagation .
ChoiceQuestions for Mid Term Exam
Multiple
isalsoknown
Q.1 Lregularization as .
a Ridgeregression
b Lasso regression
CLinear
regression
dAllof these
term isthe .
Q.2 InL regularization,
regularization .ofallfeatureweights.
aAbsolute
value bsum of square
square dsum
ispart of some
Q.3 Inputnoiseinjection . learningalgorithms,such as the denoising
autoencoder.
a Semisupervised bsupervised
cunsupervide dDeep
ChoiceQuestions
Answer Key for Multiple
.1 Q.2 b
.3 Q4 b
Q.5 b
END.
5
5.1:Challenges in NeuralNetwork
Optimizationfor
TrainDeep Models
Moreover,MLEs and likelihood
functions
generally
Optimization have very desirablelargesample propertie
,
,
which has multiple model parameters as the conditional
optima,only one of which is the probability
that the model, M, with parameters
global optima.Depending on the loss surface,it can produces
be very difficult
to locatethe globaloptima. x[1,.,xln].
L(O)-Pr(rl1),, xn] |e,
M),
Q.3 What
Ans: Givena
risk minimization
is empirical ?
set S and a function
In MLE we seek the model parameters, that
training space maximize the likelihood.
distribution
a maximum. collectionof numbers (x)and suppose the joint pdf
for the data x isa functionthatdepends on a set of
Maximum likelihood estimationis a totallyanalytic
parameters6.
maximization procedure.Itappliestoevery form of
censored or multi-censored data, and it is even f 0)
possibleto use the techniqueacrossseveralstress Then the likelihood is given as :
function
cellsand estimateacceleration model parameters at
the same time as lifedistribution
parameters.
L )- f;0)
(5-1
NeuralNetworks and Deep Learning 5-2 for TrainDeep Models
Optimization
The
L (0)
maximum
, f(xi
;0)
estimator is given by
likelihood
problems.
Ill-conditioningcan manifestby causingSGD to get
"stuck" in the sense that even very small steps
increasethe costfunction.
formula:
following
d (X1,X2,
X3, ..., Xn) =
Xn-1,
Xn of
Ill-conditioning the Hessian matrix is a
prominent problem in most numerical optimization
problems,convex or otherwise.
In multipledimensions,there is a diferentsecond
-
for each direction
derivative at a singlepoint.
The condition
number of the Hessian at thispoint
The sample standard deviation
is givenby measures how much the second derivativesdiffer
from each other.
Two approaches includerescaling the gradients learning refer to group of tricksand techniques
a
firstorder
givena chosen vector norm and clippinggradient designed to speed up convergence
values that exceed a preferred range. Together, methods
optimization likegradientdescent.
thesemethods are referredto as "gradientclipping"
ParameterInitialization
5.2: Strategies
and
a.11What is stochasticgradientdescent ? Algorithms with Adaptive Learning|
Ans.:Much of machine learningcan be writtenas an Rates
optimization
problem.
Example loss functions: linear
Logisticregression, Q.13Write short note on parameter initialization
regression,principle
component analysis,neural strategies.
network loss. Ans.: Training fordeep learningmodels
algorithms
are usually iterativeand thus require the user too
A way to trainlogistic
very efficient models is with
StochasticGradientDescent (SGD). specify some initial
pointfrom which to begin the
iterations.
One challengewith training
on power law data
The initial point can determine whether the
most data) isthatthe terms inthe gradientcan
(i.e.
algorithm converges at all,with some initial
points
have very different
strerngths.
being so unstable that the algorithm encounters
The idea behind stochasticgradient descent is numerical difficulties
and failsaltogether.
a weightupdatebased on the gradientof
iterating When learningdoes converge, the initial
pointcan
lossfunction: determine how quicklylearning converges and
wk+1) W(k)-yV L(w whether it converges to a pointwith high or low
Logisticregressionis designed as a binaryclassifer cOst.
(output say 10, 1) but actually outputs the Pointsof
that the input instanceis in the "1" comparable cost can have wildlyvarying8
probability
generalizationerror,and the initial
pointcan affect
class.
as well.
the generalization
A has the form:
classifier
logistic Iftwo hidden units with the same activation
P9 1+exp(-Xp) functionare connected to the same inputs,then
initial
theseunitsmust have different parameters.
where X =
X., Xn) isa vectorof features Ifthey have the same initialparameters,then a
Stochasticgradient has some serious limitations deterministic learning algorithm applied to a
however, especiallyifthe gradientsvary widely deterministiccost and model will
:
constantlyupdate
magnitude.Some coefficients change very fast, both of theseunitsinthe same way.
othersvery slowly.
Q.14 Explain
thefollowing:
This happens for text,user and social
activity a) AdaGrad b) Adam
media data (and other power-law data),because
Ans.:a)AdaGrad
gradientmagnitudes scale with feature frequency,
i.e.
over severalordersof magnitude. The AdaGrad algorithm is just a variant of
fromsklearm.linear_model
importSGDClassifier
TECHNICAL PUBLICATIONSAnup thrustforknowledge DECODE
NeuralNetworks and Deep Learning for TrainDeep Models
Optimization
entirehistory of the squared gradient and may method. It is described in more detailhere
have made the learning rate too small betore emphasizing neuralnetwork training.
atsuch a convex structure.
arriving It is based on Taylor's series expansion to
approximate Je) near some point 6o ignoring
b) Adam
derivatives
of higherorder.
AdaptiveMoment Estimation (Adam) is another
method that computes adaptivelearningratesfor seriesto approximateJO) near 0o
Taylor's
each parameter.It keeps an exponentiallydecayin8
average of past gradients. Je) Je)+(e-eVaJe,)+50-0 H(O-0,)
Adam is generallyregarded as beingfairlyrobust where H isthe Hessian of Jwrt 0 evaluatedat6g
to the choice of hyperparameters, though the
Solvingfor point of this functionwe
the critical
learningrate sometimes needs to be changed from obtainthe Newton parameter updaterule.
the suggested default.
Compute gradient:
6
t-t+1
.LE®,0),
my0) (thereare higher-orderterms) this update can be
iteratedyielding
the training
algorithmgivennext.
An up thrustforknowledge DECODE
TECHNICAL PUBLICATIONS
NeuralNetworks and Deep Learning 5-5 Optimizationfor TrainDeep Models
5.4:Optimization and
Strategies Q.19 Explainhow deep learninghelps in speech
Meta-AlgorithmsApplications recognitionproblem.
Ans.:Speech recognition(is also known as
a.17What is batch normalization
? Automatic Speech Recognition(ASR), or computer
Ans.:Batchnornmalizationis one of the most speech recognition)is the process of convertinga
to a of words, by means of an
excitinginnovationsin Deep learning that has speech signal sequence
stabilized the learning process and algorithmimplemented as a computerprogram.
significantly
allowed fasterconvergence rates. Automatic Speech Recognition (ASR) is the task of
producing a text transcription of the audio signal
Most of the Deep Learning networks are
from a speaker.
compositionsof many layersor functions and the
gradient with respect to one layer is taken Large Vocabulary Speech Recognition(LVSR)
involvesa large or open vocabulary and we will
considering the other layers to be constant.
also take it to imply a speaker independent
However, in practiseall the layers are updated
recognizer.
simultaneously and thiscan lead to unexpected
results. The most general LVSR systems may also need to
operateon spontaneous as well as read speech with
For example, lety* -x Wlw2..Wl0, Here,y'is audio data from an uncontrolled recording
a linearfunctionof x but not a linearfunction
of environment corrupted by both unstructured and
the weights.
Suppose the is
gradient givenby g and structurednoise.
we now intendto reduce y'by 0.1 Traditional
LVSR recognizersare based on Hidden
Using first-orderTaylor Series approximation, Markov (HMMs) with Gaussian Mixture
Models
The featureextraction
is usuallyperformed inthree
stages.
Q.4 A acts
oss function a proxy to
as
Q1 Q.2 C Q.3
dQ4b
riskwhile "nice"
enough to be
empirical
optimized
being
efficiently
END...
An up thrustforknowledge DECODE
TECHNICAL PUBLICATIONS
sOLVED MODEL QUESTION PAPER SolvedPaper
IV-II
B.Tech., (CSE/IT)
NeuralNetworks and Deep Learning
(R18Pattern)
Note:Thisquestionpapercontains
two parts andB. A
Part A iscompulsory which carries25 marks. Answer allquestionsinPart A.Part B consistsof 5 Units.
Answer any one fullquestion from each unit.Each questioncarries10 marks and may have a, b, c as sub
questions.
PART (25 Marks) A
a.1 a)What isartificial
neural netuwork ? (ReferQ.1of Chapter 1) (2
b) memory. (ReferQ.32 of Chapter-1)
Defineauto associative
d) learningnetwork ? (ReferQ.5of
What is winner-take-all Chapter 2) 31
Listthe application (ReferQ.3of Chapter 3) 121
ofdeep learning.
What isforwardpropagation? (ReferQ.34 of Chapter3) 3
9) What is regularization Q.1of
? (Refer Chapter4)
h) What isdifferencebetween L'and L regulatiom? (ReferQ.6 Chapter 4)
two
activation (ReferQ.20 of Chapter
functions. 3) [5]
OR
b) Listthe limitation
descent algorithm.
Explaingradient descent.(ReferQ.30 of Chapter
of gradient 3)
a.8 a) What are the
Explainoverfitting. ?
reason foroverfitting
ReferQ.2 of Chapter4)
b) Explainapplications neural network. (ReferQ7 of Chapter4)
of corvoutional
OR