Download as pdf or txt
Download as pdf or txt
You are on page 1of 93

03 NEURAL NETWORKS I

Fall 2023 CS54118 Machine Learning


Credits
1. B1: Machine learning: an algorithmic perspective. 2nd Edition, Marsland, Stephen. CRC press,
2015
2. B2: Principles of Soft Computing. 3rd Edition. S. N. Sivanandam, S. N. Deepa. Wiley,
2018.
3. www.d.umn.edu/~alam0026/NeuralNetwork.ppt
4. www.ohio.edu/people/starzykj/network/Class/ee690/.../NeuralNets%20overview.ppt
5. https://www.staff.ncl.ac.uk/peter.andras/annintro.ppt
6. https://tmohammed.files.wordpress.com/2012/03/w1-01-introtonn.ppt
7. http://aass.oru.se/~lilien/ml/seminars/2007_02_01b-Janecek-Perceptron.pdf
8. http://www.cems.uvm.edu/~rsnapp/teaching/cs295ml/notes/perceptron.pdf
9. http://www.atmos.washington.edu/~dennis/MatrixCalculus.pdf
10. https://en.wikipedia.org/wiki/Matrix_calculus
11. https://data-flair.training/blogs/learning-rules-in-neural-network/
B2:
Principles of Soft Computing.
3rd Edition
S. N. Sivanandam, S. N. Deepa.
Wiley, 2018.
Assignment
Read:
B1: Chapter 3.
B2: Chapter 2, 3.

Problems:
B1: 3.1, 3.2, 3.3
B2: Chapter 2, 3: Solved Problems
Neural Networks
 Inspired by how human brain does analysis
 Neuron – processing unit of human brain
 Neuron collects signals from others through a host of fine structures called
dendrites.
 Neuron sends out spikes of electrical activity through a long, thin stand
known as an axon, which splits into thousands of branches.
 At the end of each branch, a structure called a synapse converts the
activity from the axon into electrical effects that inhibit or excite activity in
the connected neurons.
 Estimated 1011 neurons are present in a human brain.
 Each neuron is connected to thousands of other neurons.
14
 About 10 synapses exist in a human brain.
 Input signals collected through dendrites affect the electrical potential
inside the neuron body – called membrane potential.
 Spiking of neuron happens when this membrane potential crosses a certain
threshold value.
 After firing, the neuron must wait for some time to recover its energy (the
refractory period) before it can fire again.
 Each neuron can be seen as a separate processor doing simple task:
whether to fire or not to fire.
 Brain is a massively parallel supercomputer with 1011 processing elements
and dense interconnection.
 Learning in brain happens on the principal concept of plasticity:
 Modifying the strength of synaptic connections between neurons, and creating
new connections.
McCulloch and Pitts Neuron Model

 Set of weighted inputs 𝒙𝒊 , 𝒘𝒊 that correspond to the synapses


 Adder that sums the input signals (equivalent to the membrane of the cell
that collects electrical charge)
 Activation function (initially a threshold function) that decides whether
the neuron fires (‘spikes’) for the current inputs
Analogy
 𝑥𝑖 = 1 if the connected input neuron fired, =0 if it did not, an
intermediate value (e.g., =.5) can be taken as something in between.
 𝑤𝑖 denotes the strength of synaptic connection

 Input signal is proportional to strength of synaptic weight, so we do


𝑚

ℎ = ෍ 𝑤𝑖 𝑥𝑖
𝑖=1
Analogy
 𝜃 is the threshold (“membrane threshold”)

 A simple model, which has limitations

 Incapable of emulating all the behaviors of real biological neurons


 A network of such neurons (Neural Network) can model whatever a computer
can do
 Neurons will be updated sequentially (based on a clock)
 Weights can be positive (excitatory connections) or negative (inhibitory
connections)
 Inputs can also be negative or positive
How does Neuron learn?

 Inputs cannot change


 Only weights and threshold function can change

 Learning through neural network:

How to change weights and threshold functions of the neurons so that the
neural network gives correct output
The Perceptron

McCulloch and Pitts


Neuron

Weighted Connections
The Perceptron

Adder not explicitly shown


 There can be m inputs and n outputs
 𝑚 ≠ 𝑛 or 𝑚 = 𝑛
 𝑤𝑖𝑗 represents weight given to signal value from 𝑖 𝑡ℎ input to 𝑗𝑡ℎ neuron
1 ≤ 𝑖 ≤ 𝑚 and 1 ≤ 𝑗 ≤ 𝑛

𝑖
Learning Rules in Neural Networks
 Perceptron learning rule
 Hebbian Learning Rule
 Delta learning rule or Widrow-Hoff rule
Perceptron Learning Rule
 Supervised Learning Approach
 The modification in sympatric weight of a node is equal to the
multiplication of error and the input.
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 + 𝜂 𝑡𝑗 − 𝑦𝑗 ∙ 𝑥𝑖
or, 𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 𝑦𝑗 − 𝑡𝑗 ∙ 𝑥𝑖
where
𝑦𝑗 : actual output at 𝑗𝑡ℎ neuron
𝑡𝑗 : target output corresponding to 𝑗𝑡ℎ neuron
𝜂: learning rate
 Input 𝑥𝑖 , target 𝑡𝑗 and adder output 𝑦𝑗 are beyond our control
 𝑤𝑖𝑗 and 𝜂 is what we can change
 High value of 𝜂: learning will be too fast (dramatic) and the system will
never stabilize
 Low value of 𝜂: learning will be too slow – system will have to see the input
too many times before it learns, but it will be resistant to noise.
 Ideally 0.1 < 𝜂 < 0.4
Error →
Error →

Weights → Weights →
High value of 𝜂 Low value of 𝜂
Bias Input
 What if all inputs are zero and we want one or more neurons to fire?
 Solution:
 Introduce a non-zero (say -1) “bias” input indexed at 0
 Introduce weights 𝑤0𝑖 : weight of bias input to 𝑖
𝑡ℎ neuron.
Bias Input
 Intuitively, bias also represents the bias in human mind
Perceptron Algorithm
Simulating OR output
Bias

Inputs

Take 𝑤𝑜 = −0.05, 𝑤1 = −0.02, 𝑤2 = 0.02, 𝜂 = 0.25


Let us iterate
For 𝑥1 = 0, 𝑥2 = 0, 𝑤𝑖 ← 𝑤𝑖 − 𝜂 𝑦 − 𝑡 × 𝑥𝑖
෍ 𝑤𝑖 𝑥𝑖 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥0 = 𝜂 = 0.25

= −.05 −1 + −.02 0 + (.02)(0)


𝑤𝑜 = −0.05
= .05 > 0; 𝑦 = 1; y ≠ 𝑡 ⇒ update required 𝑥1
𝑤0 ← 𝑤0 − 𝜂 𝑦 − 𝑡 × 𝑥0
= −.05 − .25 1 − 0 −1 𝑤1 = −0.02
= .20
𝑤1 ← 𝑤1 − 𝜂 𝑦 − 𝑡 × 𝑥1 𝑥2 𝑤2 = 0.02
= −.02 − .25 1 − 0 0 𝑥1 𝑥2 𝑡
= −.02 (no update)
𝑤2 ← 𝑤2 − 𝜂 𝑦 − 𝑡 × 𝑥2 0 0 0
= .02 − .25 1 − 0 0 0 1 1
= .02 (no update) 1 0 1
1 1 1
For 𝑥1 = 0, 𝑥2 = 1, 𝑤𝑖 ← 𝑤𝑖 − 𝜂 𝑦 − 𝑡 × 𝑥𝑖
෍ 𝑤𝑖 𝑥𝑖 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥0 = 𝜂 = 0.25

= .2 −1 + −.02 0 + (.02)(1)
𝑤𝑜 = .2
= −.18 < 0; 𝑦 = 0; y ≠ 𝑡 ⇒ update required 𝑥1
𝑤0 ← 𝑤0 − 𝜂 𝑦 − 𝑡 × 𝑥0
= .2 − .25 0 − 1 −1 𝑤1 = −0.02
= −.05
𝑤1 ← 𝑤1 − 𝜂 𝑦 − 𝑡 × 𝑥1 𝑥2 𝑤2 = 0.02
= −.02 − .25 0 − 1 0 𝑥1 𝑥2 𝑡
= −.02 (no update)
𝑤2 ← 𝑤2 − 𝜂 𝑦 − 𝑡 × 𝑥2 0 0 0
= .02 − .25 0 − 1 1 0 1 1
= .27 1 0 1
1 1 1
For 𝑥1 = 1, 𝑥2 = 0, 𝑤𝑖 ← 𝑤𝑖 − 𝜂 𝑦 − 𝑡 × 𝑥𝑖
෍ 𝑤𝑖 𝑥𝑖 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥0 = 𝜂 = 0.25

= −.05 −1 + −.02 1 + (.27)(0)


𝑤𝑜 = −.05
= .03 > 0; 𝑦 = 1; y = 𝑡 ⇒ no update required 𝑥1
𝑤1 = −0.02

𝑥2 𝑤2 = .27
𝑥1 𝑥2 𝑡
0 0 0
0 1 1
1 0 1
1 1 1
For 𝑥1 = 1, 𝑥2 = 1, 𝑤𝑖 ← 𝑤𝑖 − 𝜂 𝑦 − 𝑡 × 𝑥𝑖
෍ 𝑤𝑖 𝑥𝑖 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥0 = 𝜂 = 0.25

= −.05 −1 + −.02 1 + (.27)(1)


𝑤𝑜 = −.05
= .3 > 0; 𝑦 = 1; y = 𝑡 ⇒ no update required 𝑥1
 This completes one epoch! 𝑤1 = −0.02
 We will repeat the same process till in an
epoch there are no weight updations.
𝑥2 𝑤2 = .27
 All input samples are correctly classified 𝑥1 𝑥2 𝑡
 Left as an exercise to complete the 0 0 0
training! 0 1 1
1 0 1
1 1 1
Algorithmic complexity?
 T: #iterations
 m: #inputs

 n: #outputs

 k: #samples

 Ο(𝑇𝑚𝑛𝑘)
1. B1: Machine learning: an algorithmic perspective. 2nd Edition,
Marsland, Stephen. CRC press, 2015
2. B2: Principles of Soft Computing. 3rd Edition. S. N. Sivanandam, S. N.
Deepa. Wiley, 2018.
B1 vs. B2:
 B1: 𝑤𝑛𝑒𝑤 ← 𝑤𝑜𝑙𝑑 − 𝜂 𝑦 − 𝑡 𝑥

 Assuming binary data using Perceptron Rule


 B2: 𝑤𝑛𝑒𝑤 ← 𝑤𝑜𝑙𝑑 + 𝛼𝑡𝑥
 Assuming bipolar data using Perceptron Rule
Hebb’s Rule
 Donald Hebb in 1949
 Changes in the strength of synaptic connections are proportional to the
correlation in the firing of the two connecting neurons.
 If two neighbor neurons activated and deactivated at the same time. Then the
weight connecting these neurons should increase.
 For neurons operating in the opposite phase, the weight between them should
decrease.
 If there is no signal correlation, the weight should not change/ the connection
should die away
Δ𝑤𝑖𝑗 ← 𝑦𝑗 × 𝑥𝑖 , ; 𝑥𝑖 , 𝑦𝑗 ∈ {−1,1}
Generally, activation function is the linear identity function, 𝑡𝑗 = 𝑓(𝑦𝑗 ) = 𝑦𝑗
 At the start, values of all weights are set to zero
 Unsupervised learning rule
 Target values are not used
Delta learning rule
 Similar to perceptron rule, but
 Based on minimization of LMS (Least Mean Square) error using Gradient
Descent Technique
 Works for differentiable activation functions (e.g., linear) vs. the step
function in perceptron rule
 Perceptron rule is guaranteed to converge if the data is linearly separable,
but the gradient-descent approach continues forever, converging only
asymptotically to the solution (tries to minimize error in case of inseparable
data).
 We will stick to perceptron rule for now on, will discuss Gradient
Descent later
Batch Mode Learning
 Let input dataset have 𝑠 samples, each sample have 𝑚 inputs (i.e., 𝑚
dimensions) and let there be 𝑛 neurons
 Input dataset 𝑥 is an (s, m: #dimensions of input) matrix

 𝑦 and 𝑡 each is an (s, n: #neurons) matrix

 𝑤 is an (m: #dimensions of input, n: #neurons) matrix

Algorithm
 For 𝑃 iterations do:
 Predict
𝑦 for all 𝑠 input samples
 Update 𝑤 for combined effect of all 𝑠 input sample, i.e.,
𝑠 𝑠
𝑇
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 ෍ 𝑥𝑘𝑖 𝑦𝑘𝑗 − 𝑡𝑘𝑗 = 𝑤𝑖𝑗 − 𝜂 ෍ 𝑥𝑖𝑘 𝑦𝑘𝑗 − 𝑡𝑘𝑗
𝑘=1 𝑘=1
𝑤 ← 𝑤 − 𝜂𝑥 𝑇 𝑦 − 𝑡
 Batch mode seems to be often better (than updating at every sample)

Error →

Weights →
Bias

Inputs

Take 𝑤𝑜 = −0.05, 𝑤1 = −0.02, 𝑤2 = 0.02, 𝜂 = 0.25


Let us iterate in batch mode.
4
For 𝑥1 = 0, 𝑥2 = 0, 𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 ෍ 𝑥𝑘𝑖 𝑦𝑘𝑗 − 𝑡𝑘𝑗
𝑘=1
෍ 𝑤𝑖 𝑥𝑖 = −.05 −1 + −.02 0 + (.02)(0)
𝑥0 = 𝜂 = 0.25
= .05 > 0; 𝑦 = 1; y ≠ 𝑡
⇒ update needs to be accumulated 𝑤𝑜 = −0.05
𝑤0 ← −.05 − .25 1 − 0 −1 𝑥1
Update accumulated, but not applied. 𝑤1 = −0.02
Similarly,
𝑤1 ← −.02 − .25 1 − 0 0 𝑥2 𝑤2 = 0.02
𝑤2 ← .02 − .25 1 − 0 0
𝑥1 𝑥2 𝑡
0 0 0
0 1 1
1 0 1
1 1 1
4
For 𝑥1 = 0, 𝑥2 = 1, 𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 ෍ 𝑥𝑘𝑖 𝑦𝑘𝑗 − 𝑡𝑘𝑗
𝑘=1
෍ 𝑤𝑖 𝑥𝑖 = −.05 −1 + −.02 0 + (.02)(1)
𝑥0 = 𝜂 = 0.25
= .07 > 0; 𝑦 = 1; y = 𝑡
⇒ no update accumulation required 𝑤𝑜 = −0.05
The previous weight expression stay as they are 𝑥1
𝑤0 ← −.05 − .25 1 − 0 −1 +0 𝑤1 = −0.02
𝑤1 ← −.02 − .25 1 − 0 0 +0
𝑤2 ← .02 − .25 1 − 0 0 +0 𝑥2 𝑤2 = 0.02
𝑥1 𝑥2 𝑡
0 0 0
0 1 1
1 0 1
1 1 1
4
For 𝑥1 = 1, 𝑥2 = 0, 𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 ෍ 𝑥𝑘𝑖 𝑦𝑘𝑗 − 𝑡𝑘𝑗
𝑘=1
෍ 𝑤𝑖 𝑥𝑖 = −.05 −1 + −.02 1 + (.02)(0)
𝑥0 = 𝜂 = 0.25
= .03 > 0; 𝑦 = 1; y = 𝑡
⇒ no update accumulation required 𝑤𝑜 = −0.05
The previous weight expression stay as they are 𝑥1
𝑤0 ← −.05 − .25 1 − 0 −1 +0+0 𝑤1 = −0.02
𝑤1 ← −.02 − .25 1 − 0 0 +0+0
𝑤2 ← .02 − .25 1 − 0 0 +0+0 𝑥2 𝑤2 = 0.02
𝑥1 𝑥2 𝑡
0 0 0
0 1 1
1 0 1
1 1 1
4
For 𝑥1 = 1, 𝑥2 = 1, 𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 ෍ 𝑥𝑘𝑖 𝑦𝑘𝑗 − 𝑡𝑘𝑗
𝑘=1
෍ 𝑤𝑖 𝑥𝑖 = −.05 −1 + −.02 1 + (.02)(1)
𝑥0 = 𝜂 = 0.25
= .05 > 0; 𝑦 = 1; y = 𝑡
⇒ no update accumulation required 𝑤𝑜 = −0.05
The previous weight expression stay as they are 𝑥1
𝑤0 ← −.05 − .25 1 − 0 −1 +0+0+0 𝑤1 = −0.02
𝑤1 ← −.02 − .25 1 − 0 0 +0+0+0
𝑤2 ← .02 − .25 1 − 0 0 +0+0+0 𝑥2 𝑤2 = 0.02
The current epoch is over, the accumulated updates 𝑥1 𝑥2 𝑡
can be applied 0 0 0
𝑤0 = .2
0 1 1
𝑤1 = −.02
𝑤2 = .02 1 0 1
1 1 1
4
 For epoch 2, the updated weights from 𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 ෍ 𝑥𝑘𝑖 𝑦𝑘𝑗 − 𝑡𝑘𝑗
𝑘=1
the previous epoch are considered
 We will repeat the same process till in an
𝑥0 = 𝜂 = 0.25
epoch all input samples are correctly
classified 𝑤𝑜 = .2
𝑥1
 Left as an exercise to complete the
training! 𝑤1 = −0.02

𝑥2 𝑤2 = 0.02
𝑥1 𝑥2 𝑡
0 0 0
0 1 1
1 0 1
1 1 1
Decision Boundary for OR function

Perceptron tries to find a straight line (in 2D, a


plane in 3D, and a hyperplane in higher
dimensions) – called decision boundary.
 What is the decision boundary and how is it a line? (for 2D case)

Activation Value (say)


𝑚

෍ 𝑤𝑖𝑗 𝑥𝑖 = 𝑥 ⋅ 𝑤𝑗
𝑖=0
Where 𝑤𝑗 is the column vector corresponding to 𝑗𝑡ℎ neuron.
𝑗𝑡ℎ neuron fires if 𝑥 ⋅ 𝑤𝑗 > 0 and does not fire otherwise
So, 𝑗𝑡ℎ neuron acts as a two-class classifier:
Class I: 𝑥 ⋅ 𝑤𝑗 > 0
Class II: 𝑥 ⋅ 𝑤𝑗 ≤ 0
𝑥 ⋅ 𝑤𝑗 = 0 can be considered as the decision boundary for 𝑗𝑡ℎ neuron
For 2-D OR case with 1 neuron, this becomes:
𝑥0 𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 = 0
−𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 = 0 (−𝑤0 corresponds to bias)
The above is the equation for a straight line.
Class I:−𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 > 0
Class II:−𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 ≤ 0
Another Perspective
 Let 𝑥 1 = {−1, 𝑥11 , 𝑥21 } and 𝑥 2 = {−1, 𝑥12 , 𝑥22 } be two points on the
decision boundary.
𝑥 1 ⋅ 𝑤𝑗 = 0 and 𝑥 2 ⋅ 𝑤𝑗 = 0 , i.e.,
𝑥 1 − 𝑥 2 ⋅ 𝑤𝑗 = 0
That is, vector 𝑤𝑗 is perpendicular to the line 𝑥 1 − 𝑥 2 , and this holds
for any two points 𝑥 1 and 𝑥 2 on the decision boundary.
Hence, decision boundary is a line and 𝑤𝑗 is a vector perpendicular to it.
Decision boundary is a line in 2D case, plane in 3D case and hyperplane
in higher dimensions.
𝑤𝑗
Decision boundary is:
−𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = 0 𝑥0 =
⇒ −.01 + .02𝑥1 + .02𝑥2 = 0

𝑥1 -axis intercept: .5,0 𝑤𝑜 = +0.01


𝑥2 -axis intercept: 0, . 5 𝑥1
𝑤1 = 0.02

𝑥2 𝑤2 = 0.02
𝒙𝟐 ⟶

𝑥1 𝑥2 𝑡
0 0 0
0 1 1
1 0 1
𝒙𝟏 ⟶
1 1 1
Convergence Theorem
 If the data is linearly separable, the fixed-increment perceptron
algorithm terminates after a finite number of weight updates.
 Proof taken from the slides by Prof. Robert Snapp, Department of
Computer Science, University of Vermont, Vermont, USA as part of his
course CS 295: Machine Learning
Proof of Convergence Theorem
 Consider a single neuron.
 Let 𝑤 represent the weight vector.
 Let 𝑥𝑖 represent the 𝑖 𝑡ℎ sample vector
 Let 𝑡𝑖 represent the target label of the 𝑖 𝑡ℎ sample.
 𝑡𝑖 ∈ {0,1}
 Let the activation function be:
1 if 𝑥𝑖 𝑤 > 0
𝑦𝑖 = ቊ
0 if 𝑥𝑖 𝑤 ≤ 0
 Update rule
𝑤 = 𝑤 − 𝜂𝑥𝑖𝑇 (𝑦𝑖 − 𝑡𝑖 )
 Let
−1 if 𝑡𝑖 = 0
𝑙𝑖 = ቊ
+1 if 𝑡𝑖 = 1
 Then, update rule becomes
𝑤 = 𝑤 − 𝜂𝑥𝑖𝑇 (𝑦𝑖 − 𝑡𝑖 )
⇒ 𝑤 += 𝜂𝑥𝑖𝑇 𝑙𝑖
 Let 𝑤 ∗ represent the solution that separates the given data.
 Let 𝑥ො𝑖 = 𝑥𝑖 𝑙𝑖
 Then,
𝑥ො𝑖 𝑤 ∗ > 0, ∀𝑖
 And, weight update becomes
𝑤 += 𝜂𝑥ො𝑖𝑇
 Let 𝑥ො𝑖 = 𝑥𝑖 𝑙𝑖
 Let 𝑤 𝑘 represent the weight vector after the 𝑘 𝑡ℎ update.
 Let𝑥ො 𝑘 represent the input sample that triggered the 𝑘 𝑡ℎ update.
 Thus,
𝑤 1 = 𝑤 0 + 𝜂𝑥ො 𝑇 (1)
𝑤 2 = 𝑤 1 + 𝜂𝑥ො 𝑇 (2)

𝑤 𝑘 = 𝑤 𝑘 − 1 + 𝜂𝑥ො 𝑇 (𝑘)
 We shall prove
𝐴𝑘 2 ≤ 𝑤 𝑘 − 𝑤 0 2
≤ 𝐵𝑘
for constants A and B
Thus, the network must converge after no more than 𝑘max = 𝐵/𝐴
updates
Cauchy-Schwartz Inequality
Let 𝑎, 𝑏 ∈ ℝ𝑛
𝑎 2 𝑏 2 ≥ 𝑎𝑇 𝑏 2
𝑤 1 = 𝑤 0 + 𝜂𝑥ො 𝑇 (1)
𝑤 2 = 𝑤 1 + 𝜂𝑥ො 𝑇 (2)

𝑤 𝑘 = 𝑤 𝑘 − 1 + 𝜂𝑥ො 𝑇 (𝑘)
Adding the above 𝑘 equations yields
𝑤 𝑘 = 𝑤 0 + 𝜂(𝑥ො 𝑇 1 + 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 𝑘 )
𝑤 𝑘 − 𝑤 0 = 𝜂(𝑥ො 𝑇 1 + 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 𝑘 )
∗𝑇
Multiplying both sides with the solution 𝑤
∗𝑇 ∗𝑇 𝑇
𝑤 𝑤 𝑘 − 𝑤 0 = 𝜂𝑤 (𝑥ො 1 + 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 𝑘 )
Let
𝑎 = min 𝑤 ∗ 𝑇 𝑥ො 𝑇 > 0
𝑥ො
Thus,
𝑤 ∗𝑇 𝑤 𝑘 − 𝑤 0 ≥ 𝜂𝑎𝑘 > 0
∗𝑇
𝑤 𝑤 𝑘 −𝑤 0 ≥ 𝜂𝑎𝑘 > 0
Squaring both sides, with the Cauchy-Schwartz inequality, yields
∗𝑇 2 2 ∗𝑇 2 2
𝑤 𝑤 𝑘 −𝑤 0 ≥ 𝑤 𝑤 𝑘 −𝑤 0 ≥ 𝜂𝑎𝑘
Thus,
2
2
𝜂𝑎
𝑤 𝑘 −𝑤 0 ≥ 𝑘2
𝑤 ∗𝑇
This gives the lower bound.
Proof: Upper Bound
𝑤 1 = 𝑤 0 + 𝜂𝑥ො 𝑇 (1)
𝑤 2 = 𝑤 1 + 𝜂𝑥ො 𝑇 (2)

𝑤 𝑘 = 𝑤 𝑘 − 1 + 𝜂𝑥ො 𝑇 (𝑘)
 Subtracting 𝑤 0 from both sides yields
𝑤 1 − 𝑤 0 = 𝜂𝑥ො 𝑇 (1)
𝑤 2 − 𝑤 0 = 𝑤 1 − 𝑤 0 + 𝜂 𝑥ො 𝑇 (2)

𝑤 𝑘 − 𝑤 0 = 𝑤 𝑘 − 1 − 𝑤 0 + 𝜂 𝑥ො 𝑇 (𝑘)
𝑤 1 − 𝑤 0 = 𝜂 𝑥ො 𝑇 (1)
𝑤 2 − 𝑤 0 = 𝑤 1 − 𝑤 0 + 𝜂 𝑥ො 𝑇 (2)

𝑤 𝑘 − 𝑤 0 = 𝑤 𝑘 − 1 − 𝑤 0 + 𝜂𝑥ො 𝑇 (𝑘)
Squaring both sides yields
𝑤 1 − 𝑤 0 2 = 𝜂2 𝑥ො 𝑇 1 2
2 2 𝑇 𝑇
𝑤 2 −𝑤 0 = 𝑤 1 − 𝑤 0 + 2𝜂 𝑤 1 − 𝑤 0 𝑥ො (2)
+𝜂2 𝑥ො 𝑇 2 2

2 2 𝑇 𝑇
𝑤 𝑘 −𝑤 0 = 𝑤 𝑘 − 1 − 𝑤 0 + 2𝜂 𝑤 𝑘 − 1 − 𝑤 0 𝑥ො (𝑘)
+𝜂2 𝑥ො 𝑇 𝑘 2
2
𝑤 1 −𝑤 0 = 𝜂2 𝑥ො 𝑇 1 2

𝑤 2 −𝑤 0 2
2 𝑇 𝑇
= 𝑤 1 −𝑤 0 + 2𝜂 𝑤 1 − 𝑤 0 𝑥ො 2 + 𝜂2 𝑥ො 𝑇 2 2


𝑤 𝑘 −𝑤 0 2
2 𝑇 𝑇
= 𝑤 𝑘−1 −𝑤 0 + 2𝜂 𝑤 𝑘 − 1 − 𝑤 0 𝑥ො 𝑘 + 𝜂2 𝑥ො 𝑇 𝑘 2

Since, 𝑥ො 𝑇 1 triggers an update, it must have been misclassified by weight


vector 𝑤 0 , i.e., 𝑤 0 𝑇 𝑥ො 𝑇 1 < 0
Similarly,
𝑤 𝑗 − 1 𝑇 𝑥ො 𝑇 𝑗 < 0, for 𝑗 = 1,2, … 𝑘
2
𝑤 1 −𝑤 0 = 𝜂2 𝑥ො 𝑇 1 2
2 2
𝑤 2 −𝑤 0 ≤ 𝑤 1 −𝑤 0 − 2𝜂𝑤 0 𝑇 𝑥ො 𝑇 (2) + 𝜂2 𝑥ො 𝑇 2 2


2 2
𝑤 𝑘 −𝑤 0 ≤ 𝑤 𝑘−1 −𝑤 0 − 2𝜂𝑤 0 𝑇 𝑥ො 𝑇 (𝑘) + 𝜂2 𝑥ො 𝑇 𝑘 2

Summing the 𝑘 inequalities yields,


𝑤 𝑘 −𝑤 0 2
≤ 𝜂2 𝑥ො 𝑇 1 2 + 𝑥ො 𝑇 2 2 + ⋯ + 𝑥ො 𝑇 𝑘 2

− 2𝜂𝑤 0 𝑇 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 (𝑘)
𝑤 𝑘 −𝑤 0 2
≤ 𝜂 2 𝑥ො 𝑇 1 2 + 𝑥ො 𝑇 2 2 + ⋯ + 𝑥ො 𝑇 𝑘 2

− 2𝜂𝑤 0 𝑇 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 (𝑘)

Define
𝑇 2
𝑀 = max
𝑇
𝑥

𝑥ො
𝑇 𝑇
𝜇 = 2 min
𝑇
𝑤 0 𝑥ො < 0 (misclassfications)
𝑥ො
The top equation becomes
𝑤 𝑘 −𝑤 0 2 ≤ 𝜂 2 𝑀 − 𝜂𝜇 𝑘
 Hence, we have shown
𝐴𝑘 2 ≤ 𝑤 𝑘 − 𝑤 0 2 ≤ 𝐵𝑘
2
𝜂𝑎 2
𝐴= 𝑇 and 𝐵 = 𝜂 𝑀 − 𝜂𝜇
𝑤∗
Thus,
𝜂𝑀 − 𝜇 ∗ 2
𝑘𝑚𝑎𝑥 = 2
𝑤
𝜂𝑎
LINEAR SEPARABILITY
 A straight line decision boundary may not always exist
 Linearly separable cases – when a straight (linear) decision boundary
is possible
Multiple Neurons May Help!
XOR Function – Linearly Inseparable
XOR – separable in 3D

Added Dimension
 It is always possible to separate out two classes with a linear function,
provided that you project the data into the correct set of dimensions.
 Kernel classifiers – basis of Support Vector Machines
Data Normalization/Standardization
 Scaling input data to lie between (-1,+1)
 Additionally with zero mean and unit variance – little better as it does not
allow outliers to dominate as much
𝑥 = (𝑥 − 𝜇)/𝜎
 Partitioning data based on range to integral values
 Choosing a subset of features can improve accuracy
LINEAR REGRESSION
 Classification: find a line that separates out the classes
 Regression: fit a line to data
 Classification as instance of Regression
1. Fit a line to target data
2. Do regression for each class separately, i.e., fit line for data points of
each classes separately
 In linear regression, we are computing lines (in 2D) that can predict
target values closely, i.e., 𝑦 = 𝛽1 𝑥 + 𝛽0
 General form:
𝑀

𝑦 = ෍ 𝛽𝑖 𝑥𝑖
𝑖=0
where: 𝑀 is the #of dimension of an input vector
𝛽 = (𝛽0 , 𝛽1 … , 𝛽𝑀 ) defines a line in 2-D, plane in 3-D and hyperplane
in higher dimensions.
Linear regression in two and three dimensions
 How do we define the line/plane/hyperplane that best fits the data?
 Minimize the distance between the line and the data points.
 Least-squares Optimization

Where,
N: #data points
M: #dimension of input vector
N: #data points
M: #dimension of input vector

In matrix form, the above can be written as


𝑡 − 𝑋𝛽 𝑇 (𝑡 − 𝑋𝛽)
Where,
𝑡 is an (𝑁 × 1) vector containing target values
𝑋 is an (𝑁 × 𝑀) matrix denoting input values (including bias)
𝑋𝑖𝑗 : denotes value of 𝑗𝑡ℎ dimension of 𝑖 𝑡ℎ input vector
𝛽 is an (𝑀 × 1) vector defining the hyperplane.
To minimize least-squares error:
𝑑( 𝑡 − 𝑋𝛽 𝑇 𝑡 − 𝑋𝛽 )
=0
𝑑𝛽
𝑑( 𝑡 𝑇 − 𝑋𝛽 𝑇 𝑡 − 𝑋𝛽 )
=0
𝑑𝛽
−𝑑(𝑡 𝑇 t) 𝑑 𝑡 𝑇 𝑋𝛽 𝑑 𝑋𝛽 𝑇 𝑡 𝑑 𝛽 𝑇 𝑋 𝑇 𝑋𝛽
− − + =0
𝑑𝛽 𝑑𝛽 𝑑𝛽 𝑑𝛽
0 − 𝑡 𝑇 𝑋 − 𝑡 𝑇 𝑋 + 𝛽 𝑇 𝑋 𝑇 𝑋 + 𝑋 𝑇 𝑋 T = −2𝑡 𝑇 𝑋 + 2𝛽𝑇 𝑋 𝑇 𝑋 = 0
𝛽𝑇 𝑋 𝑇 − 𝑡 𝑇 𝑋 = 0
𝑋 𝑇 𝑋Β − 𝑡 = 0
Hence, 𝛽 = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑡 [assuming 𝑋 𝑇 𝑋 −1 exists]
 The following links may be helpful in finding matrix calculus identities
used in the previous proof:
 https://en.wikipedia.org/wiki/Matrix_calculus

 https://en.wikipedia.org/wiki/Matrix_calculus#Vector-by-vector

 http://www.math.nyu.edu/~neylon/linalgfall04/project1/dj/proptranspose
.htm
 Fill in the details in the proof (left as homework assignment)
Linear Regression for OR, AND, and XOR

Inputs OR AND XOR


𝑥1 𝑥2 𝑡 𝑦 𝑡 𝑦 𝑡 𝑦
0 0 0 0.25 0 -0.25 0 0.5
0 1 1 0.75 0 0.25 1 0.5
1 0 1 0.75 0 0.25 1 0.5
1 1 1 1.25 1 0.75 0 0.5
Inputs OR
𝑥1 𝑥2 𝑡 𝑦
0 0 0 0.25
0 1 1 0.75

𝒕/𝒚 ⟶
1 0 1 0.75
1 1 1 1.25
Inputs AND
𝑥1 𝑥2 𝑡 𝑦
0 0 0 -0.25
0 1 0 0.25
1 0 0 0.25

𝒕/𝒚 ⟶
1 1 1 0.75
Inputs XOR
𝑥1 𝑥2 𝑡 𝑦
0 0 0 0.5
0 1 1 0.5
1 0 1 0.5

𝒕/𝒚 ⟶
1 1 0 0.5
Miscellaneous Topics
Adaline: Adaptive Linear Neuron
A single linear unit that uses input to activation function (activation potential)
for calculating error, rather than the output of the activation function
Update Rule
𝑤𝑖 ← 𝑤𝑖 − 𝜂 𝑦𝑖𝑛 − 𝑡 ∙ 𝑥𝑖
Madaline: Multiple adaptive linear neurons
 Many Adalines in parallel with a single output unit
 Output is based on selection rule (e.g., max, AND)

𝑣𝑖 s are fixed, +ve


and possess a
common value
 Training is like Adaline
1. Let 𝑧𝑗 = 𝑓(𝑧in𝑗 ) denote the output of 𝑗 th Adaline unit
2. If the final output does not match with target
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 𝑧𝑗 − 𝑡 ∙ 𝑥𝑖
ANNs Based on Connections
 Single-layer feed-forward network
 Multilayer feed-forward network
 Single node with its own feedback
 Single-layer recurrent network
 Multilayer recurrent network.
Single Layer Feed-Forward Network
Multi Layer Feed-Forward Network

 It may or may not be fully connected


Single Node with Own Feedback

 Lateral Feedback: feedback to the same layer


 Recurrent Networks: feedback networks with closed loop
Single Layer Recurrent Neural Network
Multi Layer Recurrent Neural Network

You might also like