Module 1 Notes

Neural Network Module 1
Module 1
1.1 BIOLOGICAL NEURONS
a. Structure of a Biological Neuron
• The information-processing cells of the brain are the neurons.
• Neurons come in various shapes and sizes: some are as small as a red blood cell; others half as
large as the entire organism.
• As shown in Fig. 1.1, each neuron has a soma or cell body which contains the cell’s nucleus
and other vital components called organelles which perform specialized tasks.
• There are also its main communication links:
o A set of dendrites which form a tree-like structure that spreads out from the cell. The
neuron receives its input electrical signals along these.
o A single axon which is a tubular extension from the cell soma that carries an electrical
signal away from the soma to another neuron for processing.
• The dendrites and axon together are sometimes called processes of the cell.
• A dendritic tree typically starts out as a narrow extension from the soma, and then forms a very
dense structure by repeatedly branching.
• In fact, dendritic trees are so dense that they account for most of the cell surface.
• Membranes that form dendrites are similar to the membranes of the soma and are basically
extensions of the cell body.
• Dendrites may also emerge from several different regions of the soma.
• A neuron has only one axon which may repeatedly branch to form an axonal tree.
• An axon carries the electrical signal called an action potential to other neurons for processing.
• Axons can terminate in local regions or may travel larger distances.
• They usually terminate on the dendrites of other cells or on muscles.
• Axons can branch, delivering their signals to a large number of recipients, some close and
others farther away.
Fig 1.1 A Neuron

b. Glial Cells and Membranes
• Glial cells which account for about nine-tenths of the cells in the brain, come in three varieties, each
with a specialized role.
• The star shaped astroglia surround the neurons and isolate them from the smallest blood vessels of
the brain called capillaries.
Department of CSE, AJIET Mangalore 1

• These cells form an interface with the capillary walls called the blood-brain barrier, across which
they selectively absorb nutrients from the blood and transfer them
to the neurons.
• They also provide physical support and electrical isolation for neurons.
• Isolation minimizes inter-neuron crosstalk.
• Microglia are small cells that move continuously between neurons and glia to clean up debris.
• Oligodendroglia sends out membranous processes that wrap themselves tightly around axons,
forming a dense layer of Spiralling membranes called a myelin sheath.
• The plasma membrane that encloses neurons is a two-layered structure about 90Å thick.
• As shown In Fig, 1.2, the two layers of the plasma membrane consist of phospholipid molecules.
• Each molecule has a hydrophilic (water loving) head, and a hydrophobic (water avoiding) tail made of
a hydrocarbon chain.
• Figure 1.2 shows how these molecules self-organize into a stable surface—hydrophilic heads face
towards the surrounding watery medium, and hydrophobic tails are isolated from the water, facing one
another on the inside.
Fig 1.2 Structure of the cell membrane
• A variety of proteins are embedded in the cell membrane.

• The principal function of these proteins is to regulate the transport of ions through pores in the
membrane and move ions across the membrane from one side to the other. These are therefore called
ion channels.
• Ion channels have three extremely important properties:
o they allow the passage of ions at a high rate;
o they are very selective to a specific ion;
o they can open or close depending upon a voltage or chemical signal.
c. Resting Potential and Nernst Equation
• Due to a difference in ion concentrations inside and outside the cell, there is a difference in electrical
potential across the membrane which is called the resting membrane potential.
• This potential difference as observed through recordings from an electrode placed inside a cell is
about —65 mV with respect to the external medium.
• The table below shows the concentration of the most important ions: Sodium, potassium and
Chloride to which the membrane is semipermeable.

• Because of the difference in concentrations, ions will tend to diffuse into and out of the cell depending
upon the concentration gradient.
• For example, 𝐾 + which is in excess within the cell will diffuse out.
• As it does so, it will leave behind unbalanced negative charges within the cell, making the interior
increasingly negative and the exterior increasingly positive.
• This sets up an electric field which opposes the movement of the ions.
• A stage comes when the electrical force due to the imbalance of charge neutrality balances the diffusive
force due to the concentration gradient.
• At this point, there is no net movement of the ions across the membrane and that the ion is in
equilibrium. This potential is called the equilibrium potential of the ion.
• The ionic equilibrium potential across the neuron’s membrane can be calculated from the ion
concentrations inside and outside the cell using the Nernst equation:
• The two forces keep the ion X in dynamic equilibrium.

o Concentration Gradient: Ions have a tendency to diffuse inward(outward) due to
greater(lesser) concentration of ions on the outside.
o Electrostatic gradient: Ions have the tendency to be pushed out or into the cell depending
upon the nature of the electric field that builds up across the membrane due to imbalance of
charge neutrality.
• The equilibrium membrane potential for the three ions can be calculated as follows
• At the resting neuron potential of -65 mV, chloride ions are in equilibrium.
• However, at -65 mV sodium and potassium ions cannot be in equilibrium.
• At this resting potential sodium ions should be pouring into the cell along the concentration and
electrostatic gradients, and following the same reasoning, potassium ions should be leaking out.
• An inward leakage of sodium ions would tend to push the resting potential towards +62 mV, and an
outward leakage of potassium ions would tend to push the resting potential towards -80 mV.

• Now, if the membrane were equally permeable to both sodium and potassium ions, one could expect
the resting membrane potential to be the average of 𝐸𝑘 and 𝐸𝑁𝑎 .
• However, the cell membrane is almost forty times more permeable to potassium than it is to sodium,
which accounts for the resting potential being closer to 𝐸𝑘 than to 𝐸𝑁𝑎 .
• This balance is maintained with the help of membrane embedded proteins that exchange potassium
and sodium ions by expending metabolic energy.
• These microscopic sodium—potassium pumps actively pump sodium out of cell and potassium into
the cell in order to maintain a high sodium ion concentration outside the cell and high potassium
concentration within the cell.
d. Graded Potential spreads in space and time
• The maintenance of the resting potential is important for the functioning of a neuron since it provides
baseline upon which it decides whether or not to transmit information.
• External signals impinge the neuron at synapses which presynaptic neurons make at dendritic spines
create disturbances in cell potential called graded potentials.
• When an impinging signal decreases the internal potential below the resting potential then the neuron
is said to be hyperpolarized.
• On the other hand, if the external disturbance pushes the potential up in a positive direction with respect
to the resting potential, then the neuron is said to be Hypo polarized or depolarized.
• Impinging action potentials at the dendrites of a neuron create tiny electrical disturbances, typically of
the order of a few millivolts.
• These disturbances are conducted along the neuron’s processes in the same way as an electrical charge
is conducted along a resistive conductor.
• Since the dendrite has both a resistance and a capacitance, the disturbance decays exponentially
towards zero in space.
• This decay occurs within a few millimetres along the membrane.
• For a membrane space constant of 1 mm, the voltage 1 mm away from the origin of the disturbance is
𝑒 −1 = 0.368-time original disturbance.
• The space constant is different on every part of the neuron since it depends upon the shape of the
membrane as well as the number of the ion channels and the nature of the membrane proteins.
• Since the amount of fluid inside the membrane of the dendrite is responsible for conduction of
electricity, the space constant depends on the diameter of the dendrite.
• Dendrites with a larger diameter have more fluid and therefore offer less resistance and conduct better.
• Consequently, they have large space constants.
• The converse is true of thinner dendrites which have smaller space constants.
• As the graded potential varies in space, so does it also vary in time.
• What happens is that when an external stimulus disturbs the ionic balance of ions in a resting
membrane, it takes a finite time for the diffusion and pumping of ions to restore the resting potential.
• Graded potentials are thus sustained for some time before equilibrium concentrations are gradually
restored.
• Just as a graded potential decay in space, so does it decay back towards its resting value in time. There
is thus a time constant which quantifies the rate at which the decay to the resting value takes place.
• For a typical neuron the time constant is about 4—10 ms.
• For a 4 ms time constant this means that in 4 ms the membrane potential returns to 36.8 per cent of its
deviation from the resting level.
• Thus, when a stimulus is applied to dendrite, the membrane potential is restored to the rest value in a
small fraction of a second.

e. Action Potential
• Neurons constantly receive inputs from other neurons along their dendrites at points of contact called
synapses.
• These inputs take the form of small electrical disturbances that are called postsynaptic potentials
(PSPs).
• For a neuron with hundreds and thousands of synapses, these tiny PSPs occur asynchronously in space
and time at various points along different dendrites of the neuron.
• Each PSP is like a small increasing or decreasing Voltage pulse that decays in space and time.
• The cell soma receives each of these small disturbances which are superimposed upon each other, and
therefore the soma potential reflects a temporal integration of these potentials.
• In other words, the soma potential varies in time — sometimes increasing and sometimes decreasing
back towards the resting value.
• Measurements therefore yield a continuous time-varying potential waveform.
• At the point where the axon of the neuron meets the cell body, the axon expands into a structure called
the axon hillock. Here, ion channels exist in considerably high density.
• These are highly Sensitive to small perturbations in the integrated soma potential.
• These ion channels constantly monitor the soma potential in such a way that when the cell potential
exceeds a threshold value of about -40 mV, the neuron fires an action potential that is transmitted down
its axon towards a synaptic terminal.
• At the axon hillock, specialized electrically gated sodium and potassium ion channels are to be found.
When we say electrically gated, we mean that the proteins that regulate these channels are sensitive to
deviations in the membrane potential from its resting value of -65 mV.
• The membrane potential at the axon hillock mirrors the soma potential.
• Now as long as the cell is hyperpolarized or even slightly depolarized nothing unusual happens.
However, when the depolarization is sufficient enough to push the membrane potential above some -
40 mV, the specialized ionic channels of the axonal membrane open up, permitting the free passage of
specific ions into and out of the axon.
• First, the sodium channels open. This naturally creates a large influx of sodium ions which move in
under their electrostatic and concentration gradients.
• This influx of sodium ions further depolarizes the cell at that point, leading to an increase in the number
of channels that open.
• Depolarization thus dramatically increases the overall sodium conductance so much so that within less
than a millisecond the local potential at the axon turns positive with respect to the outside as it tries to
approach the sodium equilibrium potential of +60 mV.
• However, it never gets there because in a fraction of a millisecond the sodium channels begin to close,
regardless of the polarization state of the axon membrane.
• Depolarization also causes potassium channels to open, albeit somewhat more slowly than the
sodium channels.
• Note that the membrane is always somewhat leaky in regard to potassium ions, and depolarization
reduces the membrane resistance to potassium ions even further.
• Potassium channels begin to open around the same time that sodium channels begin to close.
• The opening of potassium channels causes a large efflux of these ions, once again along the
electrostatic and Concentration gradients.
• This efflux causes the membrane potential to drop sharply back towards the equilibrium potential
potassium.

• As the local axon potential gets hyper polarized, potassium channels begin to close, and there is a
small undershoot towards -80mV potassium equilibrium potential due to leakage of extra potassium
from within the axon.
• The closing of potassium channel signals the end of action potential as sodium-potassium pumps
subsequently restore ion concentration to equilibrium levels.
Fig 1.3 Action Potential
f. Refractory Period
• No matter how strong the applied external stimulus is made, an action potential cannot be regenerated
during the action potential itself —until the overshoot phase of the action potential is complete and the
sodium channels close.
• This period of in excitability lasts for some 1 - 2 ms and is called the absolute refractory period.
• In the next few milliseconds after the completion of overshoot phase of the action potential, and during
the undershoot when the potassium channels are closing it is more difficult than usual to excite the
axon since the membrane potential is still hyper polarized.
• The threshold depolarization required to fire an action potential increase beyond the normal -40 mV.
This period is called the relative refractory period, because the axon is excitable but requires a larger
stimulus than normal.
• The absolute and relative refractory periods naturally impose a biological limit on the maximum
frequency of firing of action potentials.
• No neuron can respond at rates much over 1000 spikes/sec (1000 Hz), and even this is possible only
for a fraction of a second with very strong stimuli.
g. Propagation of Action Potential
• Any disturbance in the equilibrium potential is conducted immediately to this adjacent region in the
form of graded potential.
• As action potential rises towards its peak at the axon hillock, the potential along the considerable length
of axonal membrane adjacent to the axon hillock is also pushed above the threshold.
• This causes that regions sodium channel to open, thereby initiating the entire sequence of event that
generate an action potential.
• The channels are opened by membrane potential disturbances created by ionic flows of the nearby
action potential.
• This has the effect of short circuiting the graded potentials which would otherwise have spread out
exponentially in space from the axon hillock.
• The action potential caused by opening of these channels in turn causes the opening of channels farther
down the axon and the action potential is physically propagated along the axon
• As the action potential passes down a patch of axon, sodium—potassium pumps in that region begin
to restore the equilibrium ion concentrations.

1.2 Neuron Activation
Figure 1.4 shows the 𝑗 𝑡ℎ artificial neuron that receives input signals 𝑠𝑖 from possibly n different sources. These
signals traverse from neuron i to neuron j along path ways or connections with an efficacy or weight denoted
by 𝑤𝑖𝑗 .
As shown in Fig. 1.4, the internal activation 𝑥𝑗 is a linear weighted summation of the impinging signals,
modified by an internal threshold or bias 𝜃𝑗 .
𝑥𝑗 = ∑𝑛𝑖=1 𝑤𝑖𝑗 𝑠𝑗 + 𝜃𝑗 ---------------------------------------(1)
• Positive signals correspond to Excitatory synapses, while negative weights model inhibitory synapses.
• Impinging signals 𝑠𝑖 represent mathematical abstractions of action potentials.
• The threshold 𝜃𝑗 model the internal firing threshold of a neuron, and the activation 𝑥𝑗 models the
internal aggregated cell potential.
• The activation of the neuron is subsequently transformed through a signal function S(.) to generate the
output signal 𝑠𝑗 = 𝑆(𝑥𝑗 ) of the neuron.
1.3 Neuron Signal Function

1.3.1 Binary Threshold Signal Function
• Neurons with binary signal functions are referred to as binary neurons or threshold logic Neurons
(TLNs).
• Threshold logic neuron computes its activation, 𝑥𝑗 = 𝑠 𝑇 𝑤𝑖𝑗 and then generates a +1 or a 0-signal
depending upon whether this activity is positive or negative:

• As Fig. 1.5 shows, net positive activations translate to a +1-signal value, and net negative activations
translate to a 0-signal value. The threshold logic neuron is a two-state machine, i.e.,
Fig 1.5 Binary threshold neuron signal function with different values of neuronal bias as indicated
• To get around the discontinuity at 0, it can be assumed that the neurons update its signals at discrete
instant of time, k by sampling its instantaneous activation 𝑥𝑗𝑘 .
• In discrete time, the binary threshold signal takes the form:
• Where the updated signal value 𝑆(𝑥𝑗𝑘+1 ) at time instant k+1 is generated from the neuron activation
𝑥𝑗𝑘+1, sampled at time instant k+1.
• The ambiguous activation value of zero leaves the neuron signal unchanged at its previous value.
• The behavioural response of the threshold logic neuron which is a two-state machine can be
extended to the bipolar case where the signals are 𝑠𝑗 Є {−1,1} instead of {0,1}
• The resulting function is a signum function sign(x) commonly encountered in communication theory

1.3.2 Interpretation of Bias
• The activation function can be rewritten as
• Where, 𝑞𝑗 = ∑𝑛𝑖=0 𝑤𝑖𝑗 𝑠𝑖 is the activation that aggregates from external source and 𝑤0𝑗 is the bias
internal to the neuron.
• Let 𝑞𝑗 be the net external input to the neuron or simply the activation of the neuron when the bias is
set to zero.
• From the point of view of activation 𝑥𝑗 , the signal function is always centred at 0 and 𝑠𝑗 = 1 if 𝑥𝑗 ≥ 0
and 0 otherwise.
• The range of activation values that yield a zero signal is (-∞, 0) and the range that yield a signal
value 1 is [0, ∞).
• The net external input 𝑞𝑗 shifts either left or right depending on the value of the internal bias.
• In the above figure the signal function is shifted left by 3, where 𝑤0𝑗 = +3.
• In other words, the bias alters the neuron response to now fire a signal 0 for net external inputs in the
range (∞, -3), and a signal 1 for external inputs in the range [-3, ∞).
• The value of 𝑤0𝑗 biases the 0/1 firing ranges of the neuron for a given value of 𝑞𝑗 .
• By allowing the signal function to shift on the 𝑞𝑗 axis, the bias weight gives the neuron an extra degree
of freedom which extends the learning capabilities of the neuron.
• From the point of view of net activation 𝑥𝑗 , the signal is +1 if 𝑥𝑗 = 𝑞𝑗 + 𝑤0𝑗 ≥ 0 or 𝑞𝑗 ≥ − 𝑤0𝑗 and
is 0 if 𝑞𝑗 < − 𝑤0𝑗 .
• The neuron thus compares the net external input 𝑞𝑗 = ∑𝑛𝑖=0 𝑤𝑖𝑗 𝑠𝑖 with 𝑤0𝑗 if 𝑞𝑗 is greater than the
negative of the bias, it fires +1, otherwise it fires 0.
1.3.3 Linear Threshold (Ramp) Signal Function

• The model for binary threshold signal function has been generalized in various ways.
• The simplest function is the linear function where we have

• Such a signal function is unbounded.

• The bounded version of the linear signal function is the linear threshold function shown in the figure
• For a piecewise linear threshold signal function
• Where 𝛼𝑗 = 1⁄𝑥𝑚 is the slope parameter of the function.

• The signal function can be shifted along the 𝑞𝑗 axis depending on the value of the bias.
• The signal function of such neuron can be written in a more compact form
• The extra subscript j in the signal function makes it specific to the 𝑗 𝑡ℎ neuron emphasizing that it is
possible that each neuron has its own individual signal function, by maintaining a distinct value of the
slope parameter.

1.3.4 Sigmoidal Signal Function
• The sigmoid function is by far the most frequently used signal function in neural networks.
• A common example is the logistic signal function:
• Where 𝜆𝑗 is the gain factor.
• As 𝜆𝑗 → ∞ the smooth logistic function approaches the non-smooth binary threshold function.
1.3.5 Gaussian Signal Function

• The gaussian signal function is given by
• Where 𝜎𝑗 is the Gaussian spread factor and 𝑐𝑗 is the center.

• Varying the spread makes the function sharper or more diffuse.

• Changing the centre shifts the function to the right or left along the activation axis.
• This function is also an example of a non-monotonic signal function—it first increases from zero to
its peak value of 1 at the centre and then decreases smoothly towards 0.
• Activations close to the centre of the function elicit strong signal responses.
• Those far away from the centre elicit weak responses.
• There is thus a receptive field of the neuron within which it responds to inputs by generating strong
signals.
• These neurons can therefore be used to recognize specific ranges of inputs by selectively tuning
receptive fields to be centred around those regions of activity by using a Suitable learning algorithm.
• Like its sigmoidal counterpart, the Gaussian function takes as input an infinite range of Activation,
(-∞, ∞), and transforms them smoothly into a value in the finite range (0, 1).
1.3.6 Stochastic Neurons
• All the neuron signal functions introduced thus far are deterministic in the sense that the signal value
is completely determined by the instantaneous activation value that the neuron acquires, in conjunction
with the neuron signal function. There is no randomness.
• However, synaptic transmission in biological neurons is essentially a noisy process brought about
through random fluctuations resulting from the release of neurotransmitters, and numerous other
uncertain causes.
• such randomness generated through synaptic noise in a mathematically tractable framework should be
taken into account.
• Stochasticity into the neuron can be introduced by assuming that its activation to signal updates is no
longer deterministic.

• For binary or bipolar stochastic neuron the signal is assumed to be two state, but the neuron switches
into these states depending on its probabilistic function of its activation P(𝑥𝑗 ).
• The signal state 𝑠𝑗 of neuron j switches in accordance with a probabilistic rule:
• For the function P(𝑥𝑗 ) to qualify as a probability, it must satisfy the following limiting values
• The function P(𝑥𝑗 ) must increase monotonically with 𝑥𝑗 between these limits.
• The common choice for P(𝑥𝑗 ) is the sigmoid shaped function
• Where T is the pseudo-temperature that controls the noise levels.

• It is the parameter that controls the fluctuations in signal state that in turn represent the effects of
synaptic noise.
• In the noiseless limit as T→0, the function transitions to a step, and the probabilistic firing rule
becomes purely deterministic.
• Simulation of such a probabilistic function simply requires the generation of a random number drawn
from a uniform distribution in the interval [0, 1].
• If the number generated is less than or equal to P(𝑥𝑗 ) then the neuron switches to the +1 state; otherwise
it switches to the -1 state.
• If T = 0, P(𝑥𝑗 ≥ 0)=1, and P(𝑥𝑗 < 0) = 0.
• In such a case, a neuron switches deterministically, and the signal function reverts to the binary
threshold case.

Summary of Signal Functions
1.4 ARCHITECTURES: FEEDFORWARD AND FEEDBACK
• A neural network N can be viewed as a weighted directed graph in which artificial neurons are nodes
and directed weighted edges represent connections between neurons.
• Local groups of neurons can be connected either in
o a feedforward architecture, in which the network has no loops; or
o a feedback (recurrent) architecture, in which loops occur in the network because of feedback
connections.
1.4.1 Neural Network Generate Mappings
• The neurons in an artificial neural network are generally organized into layers.
• Inputs to the network are presented to the input layer; outputs are generated as signals of the output
layer.
• Since the numeric activations and signals comprise vectors, it can be said that a input vector applied
to the input layer generates an output signal vector across the output layer.
• The signals may pass through one or more intermediate or hidden layers which transform the signals
depending on the neuron signal functions.
• The network performs a transformation linear or non-linear-from the input space to the output space.

• For example, if a 5-neuron input feeds signal to a 4-neuron output layer, an input vector from a five-
dimensional space is transformed or mapped to a signal vector in a four-dimensional space.
• Neural networks thus generate mappings from one space to another or sometimes from the input space
back to itself.
• Multi-layered networks that associate vectors from one space to vectors of another space are called
heteroassociators.
• At any instant of time, they might map or associate two different patterns with one another—one as
input and the other as Output.
• Mathematically it can be written as, f: ℝ𝑛 → ℝ𝑝 .
• The neural network N takes a vector in ℝ𝑛 and maps it to a point in ℝ𝑝 .
• When neurons in a single layer (Lx) connect back onto themselves the resulting network is called an
auto associator since it associates a single pattern in ℝ𝑛 with itself.
• In such a case, the neural network implements a mapping f: ℝ𝑛 → ℝ𝑛 .
1.4.2 Activation and Signal State Space
• For a p-dimensional field of neurons, the activation state space is the entire ℝ𝑝 .
• However, the signal state space is the Cartesian cross space,

if the neurons have continuous signal functions in the interval [0, 1], or [−1, 1]𝑝 if the neurons have
continuous signal functions in the interval [-1,1].
• For the case when the neuron signal functions are binary threshold, the signal state space is
or {−1, 1}𝑝 when the neuron signal functions are bipolar threshold.
1.4.3 Feedforward vs Feedback: Static vs Dynamic
• In multilayer perceptron, neurons are organized into different layers that have unidirectional
connections between them.
• Such feedback networks are static in the sense that the output depends only on the present input.
• Mathematically,
𝑇
where S = (𝑠1 … … . . 𝑠𝑝 ) Є ℝ𝑝 is the output signal vector generated in response to an input vector X
Є ℝ𝑛 .
• Feedforward neural networks are memoryless in the sense that their response to an input is independent
of the previous network state.
• Although such networks possess no dynamics, they are capable of demonstrating powerful properties
such as universal function approximation and find widespread applications in pattern classification.
• On the other hand, recurrent or feedback networks are examples of non-linear dynamical systems.
• When a new input pattern is presented, the neuron outputs are computed as usual, but because these
outputs are fed back as inputs to the system, the activations of neurons get subsequently modified,
leading the network to a new state.
• The next state of the network is thus a function of the current input and the present state of the network,
very much like a finite state machine.
• In recurrent neural networks, network activations and signals are in a flux of change until (and unless
they settle down to a steady state.
• These changing quantities are readily modelled by first order nonlinear differential equations of the
form:
where X = = (𝑥1 … … . . 𝑥𝑛 )𝑇 Є ℝ𝑛 is a vector of activations of neurons in the network.

• Each neuron instantaneously generates an output signal (assuming deterministic Signal function) as a
function of the activation.
• A trajectory in the state space ℝ𝑛 then describes the time evolution of network activations.
• Feedback neural networks represent complex non-linear dynamical systems.
1.5 CONVEX SETS, CONVEX HULLS AND LINEAR SEPERABILITY

Consider two pattern sets 𝑋0 and 𝑋1 sampled from two classes 𝐶0 and 𝐶1 . Many subsets of ℝ𝑛
contain the pattern sets 𝑋0 and 𝑋1.
Definition 1.5.1: Let X, Y ∈ S ⸦ ℝ𝑛 , then S is convex iff λX+ (1-λ) Y Є S, 0 ≤ λ ≤ 1, ∀ X, Y ∈ S.
Equivalently, a set S is convex if it contains all points on all line segments with end points in S.
Definition 1.5.2: The convex hull, C(𝑋𝑖 ), of a pattern set 𝑋𝑖 is the smallest convex set in ℝ𝑛 which contains
the set 𝑋𝑖 . Equivalently, consider every convex set 𝑆𝛼 , such that 𝑋𝑖 ⸦ 𝑆𝛼 ⸦ ℝ𝑛 , α ∈ I where I is an index set.
Then the convex hull of 𝑋𝑖 is the intersection of all the set, C(𝑋𝑖 ) = ∩𝛼∈I 𝑆𝛼 .
The intersection of all the subset of ℝ𝑛 need to be taken that contains the pattern set in question. This
intersection yields the smallest convex set in ℝ𝑛 that contains the pattern set.
Definition 1.5.3: Two pattern set 𝑋𝑖 and 𝑋𝑗 are said to be linearly separable if their convex hulls are disjoint,
that is, 𝐶(𝑋𝑖 ) ∩ 𝐶(𝑋𝑗 )= ∅.
1.6 NON-LINEARLY SEPARABLE PROBLEMS
In general, a neuron with n inputs would have a discriminant function representing an n - 1-dimensional
hyperplane in ℝ𝑛 , which would slice the entire ℝ𝑛 into two parts:
• one where the neuron output would be a +1, and
• the other where the output would be a 0.
The non-linear element of this model is represented by a hard limiting step function at its output.
Linear separability requires that the patterns to be classified be sufficiently separated from one another
to ensure decision surfaces comprise only of hyperplanes. For example, the two convex hulls 𝐶1 , 𝐶2 of the two
pattern sets in fig a shown below are separable. If they are moved closer, a point comes when the convex hulls
overlap, (Fig. b) and no straight line can separate the two pattern sets. The pattern sets become linearly non-
separable but may be non-linearly separable as shown in Fig. (b). In short, a single TLN can no longer classify

the two classes effectively, and errors in classification will result when binary threshold neuron classifier is
used to separate the patterns.
1.6.1 XOR is Not Linearly Separable
The Boolean XOR function ℱ⊕ ∶ 𝐵 𝑛 −→ {0, 1} is an example of a non-linearly separable problem.

This is shown geometrically in Fig. a for the two-dimensional case. As before the black dots indicate the points
mapped to 1; the white dots (open circles) indicate points mapped to 0. What are the convex hulls of Xo and
X1? These are simply the two straight lines that join the black and white dots respectively as indicated in the
figure by Co and C1. Note that the convex hulls Co and C1 are not disjoint: XOR is not linearly separable.
Another way to look at it is as shown in Fig. b. Notice that there is no single straight line that able to
separate Xo = 𝑓⊕−1 (0) from 𝑋1= 𝑓⊕−1 (1) to separate the black dots from the whites. Arrow perpendicular to
the two separating lines show possible orientations each of which individually result one point getting
misclassified.
Fig (a) Fig(b)
Theorem 1.6.1: No single threshold logic neuron can solve the logical XOR classification problem 𝑥1 ⊕ 𝑥2
.
Proof:
Assume TLN with weights 𝑤0 , 𝑤1 𝑎𝑛𝑑 𝑤2 and inputs 𝑥1 𝑎𝑛𝑑 𝑥2 . For the XOR function 𝑥1 ⊕ 𝑥2 =
1 if 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤0 ≥ 0. ------------------------(1)
Since the XOR operation is symmetric, 𝑥2 ⊕ 𝑥1 = 1 if 𝑤1 𝑥2 + 𝑤2 𝑥1 + 𝑤0 ≥ 0 -------------------(2)
Condition (1) and (2) implies that
𝑤1 + 𝑤2 𝑤1 + 𝑤2
( ) 𝑥1 + ( ) 𝑥2 + 𝑤0 ≥ 0
2 2

𝑤1 +𝑤2
( ) 𝑥 + 𝑤0 ≥ 0
2
𝑤1 +𝑤2
This implies that 𝑤𝑥 + 𝑤0 ≥ 0 where 𝑤 = ( 2 ) and 𝑥 = 𝑥1 + 𝑥2
𝑤𝑥 + 𝑤0 is a first-degree polynomial in x which must be less than zero for x = 0, greater than zero for x = 1
and less than zero for x = 2.
This is impossible since a first-degree polynomial is monotonic and therefore cannot change sign more than
once. We thus conclude that there is no such polynomial, and therefore there is no TLN that can solve the
XOR problem.
1.6.2 Solving the XOR Problem
Since the desired central region of l's in the geometry is created by two straight lines, a solution to the
problem emerges each separating line is first implemented as the discriminant function of a TLN, and then
take the logical intersection of their outputs.
Figures (a) and (b) above provide a three-dimensional perspective of what we can construct with a single
binary neuron. Our points of interest, 𝐵 2, lie on the 𝑥1 , −𝑥2 plane and the z-axis plots the neuronal output or
signal response for various different values of 𝑥1 and 𝑥2 in the range [-0.5, 1.5]. The signal response changes
abruptly from 0 to 1 at the separating line that represents the neuron discriminant function! In Fig. (a) which
portrays the response surface for a neuron designed to solve the logical OR function, the neuron fires 0 to the
left of the separating line, and l to the right of the separating line. The region of response 0 contains the point
(0,0) of 𝐵 2, the region of response contains the points (0,1), (1,0) and (1,1) of 𝐵 2, thus solving the OR
problem. In Fig. (b), the neuron fires a 1 to the left of the line, and a 0 to the right of the line. This function
represents the response surface of a neuron which solves the logical NAND function since the output, s, is 0
for (1,1) and 1 for (0,0). (0,1) and (1,0). Single binary threshold logic neurons can do no better than generate

function such as these. However, combining the results of Fig. (a) and (b) using a logical AND operation on
the surfaces (which can be implemented by a third binary threshold logic neuron), results in the response
surface portrayed in Fig. (c). Figure (c) shows the response surface which solve the XOR classification
problem. In Boolean functions too, XOR is logically generated by taking the logical AND of the OR and
NAND functions respectively: 𝑥2 ⊕ 𝑥1 = (𝑥2 + 𝑥1 ). ̅̅̅̅̅̅
𝑥1 𝑥2 = 𝑥1 ̅̅̅
𝑥2 + ̅̅̅𝑥
𝑥1 2 .
Figure (a) above shows the architecture of a neural network that can implement the XOR logic. This is a two-
layer architecture where Layer 1 implements the logical OR and NAND functions (neurons 1,2 respectively)
and Layer 2 takes the logical AND (neuron 3) of the resulting outputs of Layer 1 to generate the XOR function.
Figure (b) shows the more common notation for portraying such a multilayer network. Here, external inputs
go to linear input neurons (1 and 2) in the input layer which simply transmit these inputs without any
modification. The middle layer (comprising neurons 3 and 4) is now isolated from the inputs and is called the
hidden layer. The output neuron 5 belongs to the output layer. The neurons in Fig. (b) have been renumbered
for assignment of proper labels to the weights.
1.7 MULTILAYER NETWORKS
The threshold logic neurons can be connected into multiple layers in a feedforward fashion. This
dramatically increases the computational capability of the system as a whole. In general, it is not uncommon
to have more than one hidden layer when solving a classification problem. Such a situation arises when one
has a number of disjoint regions in the pattern space that are associated into a single class. To understand this
better, the following observations are made for binary threshold neurons with two hidden layers apart from
the input and the output layer.
1. Each neuron in the first hidden layer forms a hyperplane in the input pattern space.
2. A neuron in the second hidden layer can form a hyper-region from the outputs of the first layer neurons
by performing an AND operation on the hyperplanes. These neurons can thus approximate the
boundaries between pattern classes.
3. The output layer neurons can then combine disjoint pattern classes into decision regions made by the
neurons in the second hidden layer by performing logical OR operations.

In general, each increment of a neuron in any one of the hidden layers increases the number of
connections in the system. This increases the number of degrees of freedom which will ultimately allow the
machine to learn how to solve a complex decision-making problem. It should be clear that no number of
neurons in single hidden layer networks can separate meshed regions. However, two hidden layers suffice to
form arbitrarily complex decision regions.
Theorem 1.7.1: No more than three layers in binary threshold logic feedforward networks are required to
form arbitrarily complex decision regions.
Proof: By construction.
Consider the n-dimensional case: X ∈ ℝ𝑛 . Partition the desired decision regions into small hypercubes. Each
hypercube requires 2n neurons in the first layer (one for each side of the hypercube). One neuron in the second
layer takes the logical AND of the outputs from the first layer neurons. Outputs of second layer neurons will
be high only for points within the hypercube. Hypercubes are assigned to the proper decision regions by
connecting the outputs of second layer neurons to third layer neurons corresponding to the decision region
that the hypercubes represent by taking the logical OR of appropriate second layer outputs.
1.8 Learning Algorithms
Learning algorithms define an architecture-dependent procedure to encode pattern information into

weights to generate these internal models. Learning proceeds by modifying connection strengths.
Most learning is data driven. The data might take the form of a set of input-output pattens derived from
a (possibly unknown) probability distribution. Here the output pattern might specify a desired system response
for a given input pattern, and the issue of learning would involve approximating the unknown function as
described by the given data. Alternatively, the data might comprise patterns that naturally cluster into some
number of unknown classes and the learning problem might involve generating a suitable classification of the

samples. The nature of the problem of learning allows us to demarcate learning algorithm into two categories:
supervised and unsupervised.
1.8.1 Supervised Learning
The data available for learning comprises a set of discrete samples T = {(𝑋𝑘 , 𝐷𝑘 )}𝑄𝑘=1 drawn from the
pattern space where each sample relates an input vector 𝑋𝑘 ∈ ℝ𝑛 to an output vector 𝐷𝑘 ∈ ℝ𝑝 . The set of
samples describe the behaviour of an unknown function f: ℝ𝑛 → ℝ𝑝 which is to be characterized.
Figure above portrays the structure of a supervised learning system. When an input 𝑋𝑘 is presented to
the system, it generates a response 𝑆𝑘 . Supervised learning employs the teaching input 𝐷𝑘 , the associant of
𝑋𝑘 , to reduce the error (𝐷𝑘 − 𝑆𝑘 ) in the response of the system. Input-output sample pairs are employed to
train the network through a form of error correction learning or gradient descent weight adaptation. These
procedures are based on global error measures derived from the difference between the desired (𝐷𝑘 ) and actual
(𝑆𝑘 ) output of the network. The implementation of learning in a supervised system is usually in the form of
difference equations that are designed to work with such global information.
Supervised learning encodes a behaviouristic pattern into the network by attempting to approximate
the function that underlies the data set. The system should generate an output 𝐷𝑘 , in response to an input 𝑋𝑘 ,
and we say that the system has learnt the underlying map if a stimulus 𝑋′𝑘 close to 𝑋𝑘 , elicits a response 𝑆′𝑘
which is sufficiently close to 𝐷𝑘 .
1.8.2 Unsupervised Learning
One of the ways of looking at the learning problem is to simply provide the system with an input 𝑋𝑘 ,
and allow it to self-organize its parameters (which are the weights of the network) to generate internal
prototypes of sample vectors.
An unsupervised learning system attempts to represent the entire data set by employing a small number
of prototypical vectors-enough to allow the system to retain a desired level of discrimination between samples.
As new samples continuously buffer the system, the prototypes will be in a state of constant flux. Note that
there is no teaching input. This kind of learning is often called adaptive vector quantization and is essentially
unsupervised in nature.

Given a set of data samples {𝑋𝑖 }, 𝑋𝑖 ∈ ℝ𝑛 , it is possible to identify well-defined "clusters", where
each cluster defines a class of vectors which are similar in some broad sense. Clusters help establish a
classification structure within a data set that has no categories defined in advance. Classes are derived from
clusters by appropriate labelling. The goal of pattern classification is to assign an input pattern to one of a
finite number of classes. Each cluster or class can be characterized by a prototype vector which is a "code
book" vector that best represents members of the cluster.
In unsupervised systems, weights of the neural network undergo a process of self-organization-without

the intervention of a desired behavioural stimulus-to create clusters of patterns represented by codebook
vectors. Learning in an unsupervised system is often driven by a complex competitive-cooperative process
where individual neurons compete and cooperate with each other to update their weights based on the present
input. Only winning neurons or clusters of neurons learn. Unsupervised learning systems embody complicated
time dynamics. Learning in such systems is usually implemented in the form of differential equations that are
designed to work with information available only locally to a synapse.

1.9 Error Correction and Gradient Descent Rules
Two kinds of supervised learning laws have evolved in the literature that employ such a philosophy:
1. Error correction rules that alter the weights of a network using a linear error measure to reduce the
error in the output generated in response to the present input pattern.
2. Gradient rules that alter the weights of a network during each pattern presentation by employing
gradient information with the objective of reducing the mean squared error (usually averaged over all
training patterns).
1.10 Learning Objective for TLNS
Observe that a TLN is actually a linear neuron whose output is directed into a unit step or signum function.
The neuron is adaptive when its weights are allowed to change in accordance with a well-defined learning
law. Commonly used adaptive algorithms for such threshold logic neurons are the Perceptron learning
algorithm and the least-mean-square (LMS) algorithm.
The neuron input and weight vectors are assumed to be in augmented form to accommodate the neuron
bias as an adjustable weight:
where the iteration index k has been introduced to explicitly indicate the discrete time dependent nature
of these vectors: 𝑋𝑘 is the pattern presented at iteration k: 𝑊𝑘 is the neuron weight vector at iteration k. Also
note that xo = 1 always, and therefore does not carry an iteration index. The neuronal activation 𝑦𝑘 =
𝑋𝑘𝑇 𝑊𝑘 , determines whether the neuron fires a 1 or a 0 signal in accordance with the binary threshold
neuronal signal function:
To design the weights of a TLN to correctly classify a given set of patterns:

Given a training set T = {(𝑋𝑘 , 𝑑𝑘 )}𝑄𝑘=1 , 𝑋𝑘 ∈ ℝ𝑛 , 𝑑𝑘 ∈ {0, 1} where each pattern 𝑋𝑘 is tagged to one
of two classes 𝐶0 𝑎𝑛𝑑 𝐶1 denoted by the desired output 𝑑𝑘 being 0 or 1, respectively. Note that since this is a
two-class classification problem the training patterns can be divided into two subsets 𝑋0 , 𝑋1 respectively
comprising patterns that belong to 𝐶0 𝑎𝑛𝑑 𝐶1.The input patterns explicitly, will be referred to as X = 𝑋0 ∪ 𝑋1 .
as the training set rather than using the symbol T which includes desired values also. The two classes will be
identified by the two possible signal states of the TLN- 𝐶0 patterns by a signal S(𝑦𝑘 ) = 0, and 𝐶1 patterns by a
signal S(𝑦𝑘 ) = 1.
1.11 Perceptron Learning Algorithm
1.11.1 Design in Weight Space

Consider a situation as shown in figure below which denotes the current weight vector at iteration k
by 𝑊𝑘 and the pattern presented to the TLN by 𝑋𝑘 . Assume 𝑋𝑘 ∈ 𝑋1 and 𝑊𝑘𝑇 𝑋𝑘 as erroneously non-positive. In order
to classify 𝑋𝑘 correctly, the weight vector must be shifted to some position 𝑊𝑘+1 where the inner product is
positive. The smallest perturbation in 𝑊𝑘 that produces a desired change is 𝚫, the perpendicular distance from
𝑊𝑘 onto the pattern hyperplane.
The simplest way to change weight is therefore in direct proportion to the misclassified pattern 𝑋𝑘 . This can be
implemented by adding a fraction of the pattern to the weight 𝑊𝑘 if the inner product 𝑊𝑘𝑇 𝑋𝑘 should be increased.
Alternatively, if 𝑋𝑘 ∈ 𝑋0 and 𝑊𝑘𝑇 𝑋𝑘 is erroneously non negative, a fraction of the pattern is subtracted from the
weight 𝑊𝑘 in order to reduce this inner product. Formally then,
Where k is an iteration index, and 𝜂𝑘 is the learning rate that scales 𝑋𝑘 before adding or subtracting it from 𝑊𝑘 .
The weight update procedure is referred to as Perceptron learning law and is applicable only to binary threshold neuron.
The value of 𝜂𝑘 does not vary with iteration index. Typically, 1 ≤ 𝜂 ≤ 2 and it is common to simply
choose 𝜂 = 1. The choice of 𝜂 does not affect the stability of the perceptron learning algorithm. For initial
weight vectors that are non-zero, large value of 𝜂 might cause the learning algorithm to converge to a solution
with a large norm. This is because weight perturbations are too large which may the cause the weight to
overshoot the solution region a number of times before it is able to land within the convex one.
For multipattern case, each pattern needs to be presented sequentially, and update the neurons weight
in accordance with the above equation in each iteration whenever there is misclassification. For each such

weight update, the weight vectors move in weight space in a direction perpendicular to the pattern hyperplane
corresponding to the current input pattern. The single pattern presentation is referred to as an iteration, and
one complete set of pattern presentation is referred to as an epoch.
1.11.2 A simplification with pattern set adjustment
Consider the training set X = 𝑋0 ∪ 𝑋1. Vectors of class 𝐶0 is identified with neuronal signal 0, which requires
that each vector 𝑋𝑘 ∈ 𝑋0 must generate a negative inner product 𝑋𝑘𝑇 𝑊𝑠 with a solution weight vector 𝑊𝑠 .
Similarly, Vectors of class 𝐶1 is identified with neuronal signal 1, which requires that each vector 𝑋𝑘 ∈ 𝑋1 must
generate a positive inner product 𝑋𝑘𝑇 𝑊𝑠 with a solution weight vector 𝑊𝑠 . If every vector of pattern set 𝑋0 were
negated, the desired inner product for correct classification would be positive since
𝑋𝑘𝑇 𝑊𝑠 < 0 ⟹ (−𝑋𝑘 )𝑇 𝑊𝑠 > 0, 𝑋𝑘 ∈ 𝑋0
The advantage is that instead of having to check two inequalities separately, it can be checked only once: positive
inner product implies correct classification and negative or zero inner products imply incorrect classification.
Consider the set 𝑋0′ in which each vector of 𝑋0 is negated. Then, given a weight vector 𝑊𝑘 for any 𝑋𝑘 ∈
𝑋1 ∪ 𝑋′0, 𝑋𝑘𝑇 𝑊𝑘 > 0 implies correct classification and 𝑋𝑘𝑇 𝑊𝑘 < 0 implies incorrect classification. X` = 𝑋′0 ∪ 𝑋1
is called as the adjusted training set. The assumption of linear separability guarantees the existence of solution
weight vector 𝑊𝑠 , such that
-------------------- (1.11)
Since 𝑋𝑘 ∈ X’, a misclassification of 𝑋𝑘 will add 𝜂𝑘 𝑋𝑘 to 𝑊𝑘 . Specifically, for 𝑋𝑘 ∈ 𝑋0′ , 𝑋𝑘 is actually the
negative of the original vector. Therefore, addition of 𝜂𝑘 𝑋𝑘 to 𝑊𝑘 actually amounts to subtraction of the
original vector from 𝑊𝑘 . This is precisely what is required-a misclassified 𝑋0′ class pattern needs to be
subtracted from the weight vector since we want to decrease the inner product in this case.
The (k + 1)th weight vector depends only on the kth pattern, the learning rate, previous weight vector.
The weight vector 𝑊𝑘 is changed only if 𝑋𝑘𝑇 𝑊𝑘 is equal to zero or has the incorrect sign relative to the category
of 𝑋𝑘 . A sequence of pattern presentations therefore generates a sequence of weight vectors that is completely
determined by the learning rate 𝜂𝑘 , the initial weight vector and the sequence in which the patterns are
presented.

1.12 Perceptron Convergence Theorem

The training problem for a TLN involves searching for a solution weight vector 𝑊𝑠 . This can be done by
presenting each pattern in X' to the network and updating the weights in accordance with simplified perceptron
equation, that is, by generating a sequence of weight vectors 𝑊2 … … … … . 𝑊𝑘+1 in response to some sequence
of pattern presentations 𝑋1 … … … … . 𝑋𝑘 ; and continuing this process until for some index 𝑘0 , we have 𝑊𝑘0 =
𝑊𝑘𝑜 +1 = 𝑊𝑘𝑜 +2 = ….. = 𝑊𝑠 . The initial weight vector is arbitrary.
Definition 1.12.1: A training sequence denoted by 𝑆𝑥 , is an infinite sequence of patterns 𝑆𝑥 = 𝑋1 … … … … . 𝑋𝑘

such that, each 𝑋𝑘 in 𝑆𝑥 is a member of X'; and every element of X` occurs infinitely often in 𝑆𝑥 .
Definition 1.12.2: A weight vector sequence 𝑆𝑤 , is a sequence of weight vectors 𝑆𝑤 = 𝑊1 … … … … . 𝑊𝑘

generated by any training sequence 𝑆𝑥 using the error correction procedure.
̂𝑥 , is a subset of 𝑆𝑥 generated by deleting from 𝑆𝑥 all vectors

Definition 1.12.3: The reduced training sequence 𝑆
that do not cause any weight perturbations in accordance with Eq. (1.11).
Definition 1.12.4: The reduced weight vector sequence 𝑆̂ 𝑤 , is the sequence of weight vectors generated from
𝑆𝑥 , and application of the error correction rule of Eq. (1.11).




Module 1 Notes

Uploaded by

Copyright:

Available Formats

You might also like

Module 1 Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 1 Notes

Uploaded by

Copyright:

Available Formats

Neural Network Module 1

Fig 1.1 A Neuron

Department of CSE, AJIET Mangalore 1

Fig 1.2 Structure of the cell membrane

• A variety of proteins are embedded in the cell membrane.

c. Resting Potential and Nernst Equation

Department of CSE, AJIET Mangalore 2

• The two forces keep the ion X in dynamic equilibrium.

Department of CSE, AJIET Mangalore 3

d. Graded Potential spreads in space and time

Department of CSE, AJIET Mangalore 4

Department of CSE, AJIET Mangalore 5

Fig 1.3 Action Potential

g. Propagation of Action Potential

Department of CSE, AJIET Mangalore 6

1.2 Neuron Activation

𝑥𝑗 = ∑𝑛𝑖=1 𝑤𝑖𝑗 𝑠𝑗 + 𝜃𝑗 ---------------------------------------(1)

1.3 Neuron Signal Function

Department of CSE, AJIET Mangalore 7

Department of CSE, AJIET Mangalore 8

1.3.2 Interpretation of Bias

• The activation function can be rewritten as

1.3.3 Linear Threshold (Ramp) Signal Function

Department of CSE, AJIET Mangalore 9

• Such a signal function is unbounded.

• For a piecewise linear threshold signal function

• Where 𝛼𝑗 = 1⁄𝑥𝑚 is the slope parameter of the function.

Department of CSE, AJIET Mangalore 10

1.3.4 Sigmoidal Signal Function

• Where 𝜆𝑗 is the gain factor.

1.3.5 Gaussian Signal Function

• Where 𝜎𝑗 is the Gaussian spread factor and 𝑐𝑗 is the center.

Department of CSE, AJIET Mangalore 11

• Varying the spread makes the function sharper or more diffuse.

1.3.6 Stochastic Neurons

Department of CSE, AJIET Mangalore 12

• Where T is the pseudo-temperature that controls the noise levels.

Department of CSE, AJIET Mangalore 13

Summary of Signal Functions

1.4 ARCHITECTURES: FEEDFORWARD AND FEEDBACK

1.4.1 Neural Network Generate Mappings

Department of CSE, AJIET Mangalore 14

1.4.2 Activation and Signal State Space

Department of CSE, AJIET Mangalore 15

1.4.3 Feedforward vs Feedback: Static vs Dynamic

where X = = (𝑥1 … … . . 𝑥𝑛 )𝑇 Є ℝ𝑛 is a vector of activations of neurons in the network.

1.5 CONVEX SETS, CONVEX HULLS AND LINEAR SEPERABILITY

1.6 NON-LINEARLY SEPARABLE PROBLEMS

Department of CSE, AJIET Mangalore 17

1.6.1 XOR is Not Linearly Separable

The Boolean XOR function ℱ⊕ ∶ 𝐵 𝑛 −→ {0, 1} is an example of a non-linearly separable problem.

Fig (a) Fig(b)

Department of CSE, AJIET Mangalore 18

1.6.2 Solving the XOR Problem

Department of CSE, AJIET Mangalore 19

1.7 MULTILAYER NETWORKS

Department of CSE, AJIET Mangalore 20

1.8 Learning Algorithms