Professional Documents
Culture Documents
03 NeuralNetworksI
03 NeuralNetworksI
Problems:
B1: 3.1, 3.2, 3.3
B2: Chapter 2, 3: Solved Problems
Neural Networks
Inspired by how human brain does analysis
Neuron – processing unit of human brain
Neuron collects signals from others through a host of fine structures called
dendrites.
Neuron sends out spikes of electrical activity through a long, thin stand
known as an axon, which splits into thousands of branches.
At the end of each branch, a structure called a synapse converts the
activity from the axon into electrical effects that inhibit or excite activity in
the connected neurons.
Estimated 1011 neurons are present in a human brain.
Each neuron is connected to thousands of other neurons.
14
About 10 synapses exist in a human brain.
Input signals collected through dendrites affect the electrical potential
inside the neuron body – called membrane potential.
Spiking of neuron happens when this membrane potential crosses a certain
threshold value.
After firing, the neuron must wait for some time to recover its energy (the
refractory period) before it can fire again.
Each neuron can be seen as a separate processor doing simple task:
whether to fire or not to fire.
Brain is a massively parallel supercomputer with 1011 processing elements
and dense interconnection.
Learning in brain happens on the principal concept of plasticity:
Modifying the strength of synaptic connections between neurons, and creating
new connections.
McCulloch and Pitts Neuron Model
ℎ = 𝑤𝑖 𝑥𝑖
𝑖=1
Analogy
𝜃 is the threshold (“membrane threshold”)
How to change weights and threshold functions of the neurons so that the
neural network gives correct output
The Perceptron
Weighted Connections
The Perceptron
𝑖
Learning Rules in Neural Networks
Perceptron learning rule
Hebbian Learning Rule
Delta learning rule or Widrow-Hoff rule
Perceptron Learning Rule
Supervised Learning Approach
The modification in sympatric weight of a node is equal to the
multiplication of error and the input.
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 + 𝜂 𝑡𝑗 − 𝑦𝑗 ∙ 𝑥𝑖
or, 𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 𝑦𝑗 − 𝑡𝑗 ∙ 𝑥𝑖
where
𝑦𝑗 : actual output at 𝑗𝑡ℎ neuron
𝑡𝑗 : target output corresponding to 𝑗𝑡ℎ neuron
𝜂: learning rate
Input 𝑥𝑖 , target 𝑡𝑗 and adder output 𝑦𝑗 are beyond our control
𝑤𝑖𝑗 and 𝜂 is what we can change
High value of 𝜂: learning will be too fast (dramatic) and the system will
never stabilize
Low value of 𝜂: learning will be too slow – system will have to see the input
too many times before it learns, but it will be resistant to noise.
Ideally 0.1 < 𝜂 < 0.4
Error →
Error →
Weights → Weights →
High value of 𝜂 Low value of 𝜂
Bias Input
What if all inputs are zero and we want one or more neurons to fire?
Solution:
Introduce a non-zero (say -1) “bias” input indexed at 0
Introduce weights 𝑤0𝑖 : weight of bias input to 𝑖
𝑡ℎ neuron.
Bias Input
Intuitively, bias also represents the bias in human mind
Perceptron Algorithm
Simulating OR output
Bias
Inputs
= .2 −1 + −.02 0 + (.02)(1)
𝑤𝑜 = .2
= −.18 < 0; 𝑦 = 0; y ≠ 𝑡 ⇒ update required 𝑥1
𝑤0 ← 𝑤0 − 𝜂 𝑦 − 𝑡 × 𝑥0
= .2 − .25 0 − 1 −1 𝑤1 = −0.02
= −.05
𝑤1 ← 𝑤1 − 𝜂 𝑦 − 𝑡 × 𝑥1 𝑥2 𝑤2 = 0.02
= −.02 − .25 0 − 1 0 𝑥1 𝑥2 𝑡
= −.02 (no update)
𝑤2 ← 𝑤2 − 𝜂 𝑦 − 𝑡 × 𝑥2 0 0 0
= .02 − .25 0 − 1 1 0 1 1
= .27 1 0 1
1 1 1
For 𝑥1 = 1, 𝑥2 = 0, 𝑤𝑖 ← 𝑤𝑖 − 𝜂 𝑦 − 𝑡 × 𝑥𝑖
𝑤𝑖 𝑥𝑖 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥0 = 𝜂 = 0.25
𝑥2 𝑤2 = .27
𝑥1 𝑥2 𝑡
0 0 0
0 1 1
1 0 1
1 1 1
For 𝑥1 = 1, 𝑥2 = 1, 𝑤𝑖 ← 𝑤𝑖 − 𝜂 𝑦 − 𝑡 × 𝑥𝑖
𝑤𝑖 𝑥𝑖 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥0 = 𝜂 = 0.25
n: #outputs
k: #samples
Ο(𝑇𝑚𝑛𝑘)
1. B1: Machine learning: an algorithmic perspective. 2nd Edition,
Marsland, Stephen. CRC press, 2015
2. B2: Principles of Soft Computing. 3rd Edition. S. N. Sivanandam, S. N.
Deepa. Wiley, 2018.
B1 vs. B2:
B1: 𝑤𝑛𝑒𝑤 ← 𝑤𝑜𝑙𝑑 − 𝜂 𝑦 − 𝑡 𝑥
Algorithm
For 𝑃 iterations do:
Predict
𝑦 for all 𝑠 input samples
Update 𝑤 for combined effect of all 𝑠 input sample, i.e.,
𝑠 𝑠
𝑇
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 𝑥𝑘𝑖 𝑦𝑘𝑗 − 𝑡𝑘𝑗 = 𝑤𝑖𝑗 − 𝜂 𝑥𝑖𝑘 𝑦𝑘𝑗 − 𝑡𝑘𝑗
𝑘=1 𝑘=1
𝑤 ← 𝑤 − 𝜂𝑥 𝑇 𝑦 − 𝑡
Batch mode seems to be often better (than updating at every sample)
Error →
Weights →
Bias
Inputs
𝑥2 𝑤2 = 0.02
𝑥1 𝑥2 𝑡
0 0 0
0 1 1
1 0 1
1 1 1
Decision Boundary for OR function
𝑤𝑖𝑗 𝑥𝑖 = 𝑥 ⋅ 𝑤𝑗
𝑖=0
Where 𝑤𝑗 is the column vector corresponding to 𝑗𝑡ℎ neuron.
𝑗𝑡ℎ neuron fires if 𝑥 ⋅ 𝑤𝑗 > 0 and does not fire otherwise
So, 𝑗𝑡ℎ neuron acts as a two-class classifier:
Class I: 𝑥 ⋅ 𝑤𝑗 > 0
Class II: 𝑥 ⋅ 𝑤𝑗 ≤ 0
𝑥 ⋅ 𝑤𝑗 = 0 can be considered as the decision boundary for 𝑗𝑡ℎ neuron
For 2-D OR case with 1 neuron, this becomes:
𝑥0 𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 = 0
−𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 = 0 (−𝑤0 corresponds to bias)
The above is the equation for a straight line.
Class I:−𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 > 0
Class II:−𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 ≤ 0
Another Perspective
Let 𝑥 1 = {−1, 𝑥11 , 𝑥21 } and 𝑥 2 = {−1, 𝑥12 , 𝑥22 } be two points on the
decision boundary.
𝑥 1 ⋅ 𝑤𝑗 = 0 and 𝑥 2 ⋅ 𝑤𝑗 = 0 , i.e.,
𝑥 1 − 𝑥 2 ⋅ 𝑤𝑗 = 0
That is, vector 𝑤𝑗 is perpendicular to the line 𝑥 1 − 𝑥 2 , and this holds
for any two points 𝑥 1 and 𝑥 2 on the decision boundary.
Hence, decision boundary is a line and 𝑤𝑗 is a vector perpendicular to it.
Decision boundary is a line in 2D case, plane in 3D case and hyperplane
in higher dimensions.
𝑤𝑗
Decision boundary is:
−𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = 0 𝑥0 =
⇒ −.01 + .02𝑥1 + .02𝑥2 = 0
𝑥2 𝑤2 = 0.02
𝒙𝟐 ⟶
𝑥1 𝑥2 𝑡
0 0 0
0 1 1
1 0 1
𝒙𝟏 ⟶
1 1 1
Convergence Theorem
If the data is linearly separable, the fixed-increment perceptron
algorithm terminates after a finite number of weight updates.
Proof taken from the slides by Prof. Robert Snapp, Department of
Computer Science, University of Vermont, Vermont, USA as part of his
course CS 295: Machine Learning
Proof of Convergence Theorem
Consider a single neuron.
Let 𝑤 represent the weight vector.
Let 𝑥𝑖 represent the 𝑖 𝑡ℎ sample vector
Let 𝑡𝑖 represent the target label of the 𝑖 𝑡ℎ sample.
𝑡𝑖 ∈ {0,1}
Let the activation function be:
1 if 𝑥𝑖 𝑤 > 0
𝑦𝑖 = ቊ
0 if 𝑥𝑖 𝑤 ≤ 0
Update rule
𝑤 = 𝑤 − 𝜂𝑥𝑖𝑇 (𝑦𝑖 − 𝑡𝑖 )
Let
−1 if 𝑡𝑖 = 0
𝑙𝑖 = ቊ
+1 if 𝑡𝑖 = 1
Then, update rule becomes
𝑤 = 𝑤 − 𝜂𝑥𝑖𝑇 (𝑦𝑖 − 𝑡𝑖 )
⇒ 𝑤 += 𝜂𝑥𝑖𝑇 𝑙𝑖
Let 𝑤 ∗ represent the solution that separates the given data.
Let 𝑥ො𝑖 = 𝑥𝑖 𝑙𝑖
Then,
𝑥ො𝑖 𝑤 ∗ > 0, ∀𝑖
And, weight update becomes
𝑤 += 𝜂𝑥ො𝑖𝑇
Let 𝑥ො𝑖 = 𝑥𝑖 𝑙𝑖
Let 𝑤 𝑘 represent the weight vector after the 𝑘 𝑡ℎ update.
Let𝑥ො 𝑘 represent the input sample that triggered the 𝑘 𝑡ℎ update.
Thus,
𝑤 1 = 𝑤 0 + 𝜂𝑥ො 𝑇 (1)
𝑤 2 = 𝑤 1 + 𝜂𝑥ො 𝑇 (2)
⋮
𝑤 𝑘 = 𝑤 𝑘 − 1 + 𝜂𝑥ො 𝑇 (𝑘)
We shall prove
𝐴𝑘 2 ≤ 𝑤 𝑘 − 𝑤 0 2
≤ 𝐵𝑘
for constants A and B
Thus, the network must converge after no more than 𝑘max = 𝐵/𝐴
updates
Cauchy-Schwartz Inequality
Let 𝑎, 𝑏 ∈ ℝ𝑛
𝑎 2 𝑏 2 ≥ 𝑎𝑇 𝑏 2
𝑤 1 = 𝑤 0 + 𝜂𝑥ො 𝑇 (1)
𝑤 2 = 𝑤 1 + 𝜂𝑥ො 𝑇 (2)
⋮
𝑤 𝑘 = 𝑤 𝑘 − 1 + 𝜂𝑥ො 𝑇 (𝑘)
Adding the above 𝑘 equations yields
𝑤 𝑘 = 𝑤 0 + 𝜂(𝑥ො 𝑇 1 + 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 𝑘 )
𝑤 𝑘 − 𝑤 0 = 𝜂(𝑥ො 𝑇 1 + 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 𝑘 )
∗𝑇
Multiplying both sides with the solution 𝑤
∗𝑇 ∗𝑇 𝑇
𝑤 𝑤 𝑘 − 𝑤 0 = 𝜂𝑤 (𝑥ො 1 + 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 𝑘 )
Let
𝑎 = min 𝑤 ∗ 𝑇 𝑥ො 𝑇 > 0
𝑥ො
Thus,
𝑤 ∗𝑇 𝑤 𝑘 − 𝑤 0 ≥ 𝜂𝑎𝑘 > 0
∗𝑇
𝑤 𝑤 𝑘 −𝑤 0 ≥ 𝜂𝑎𝑘 > 0
Squaring both sides, with the Cauchy-Schwartz inequality, yields
∗𝑇 2 2 ∗𝑇 2 2
𝑤 𝑤 𝑘 −𝑤 0 ≥ 𝑤 𝑤 𝑘 −𝑤 0 ≥ 𝜂𝑎𝑘
Thus,
2
2
𝜂𝑎
𝑤 𝑘 −𝑤 0 ≥ 𝑘2
𝑤 ∗𝑇
This gives the lower bound.
Proof: Upper Bound
𝑤 1 = 𝑤 0 + 𝜂𝑥ො 𝑇 (1)
𝑤 2 = 𝑤 1 + 𝜂𝑥ො 𝑇 (2)
⋮
𝑤 𝑘 = 𝑤 𝑘 − 1 + 𝜂𝑥ො 𝑇 (𝑘)
Subtracting 𝑤 0 from both sides yields
𝑤 1 − 𝑤 0 = 𝜂𝑥ො 𝑇 (1)
𝑤 2 − 𝑤 0 = 𝑤 1 − 𝑤 0 + 𝜂 𝑥ො 𝑇 (2)
⋮
𝑤 𝑘 − 𝑤 0 = 𝑤 𝑘 − 1 − 𝑤 0 + 𝜂 𝑥ො 𝑇 (𝑘)
𝑤 1 − 𝑤 0 = 𝜂 𝑥ො 𝑇 (1)
𝑤 2 − 𝑤 0 = 𝑤 1 − 𝑤 0 + 𝜂 𝑥ො 𝑇 (2)
⋮
𝑤 𝑘 − 𝑤 0 = 𝑤 𝑘 − 1 − 𝑤 0 + 𝜂𝑥ො 𝑇 (𝑘)
Squaring both sides yields
𝑤 1 − 𝑤 0 2 = 𝜂2 𝑥ො 𝑇 1 2
2 2 𝑇 𝑇
𝑤 2 −𝑤 0 = 𝑤 1 − 𝑤 0 + 2𝜂 𝑤 1 − 𝑤 0 𝑥ො (2)
+𝜂2 𝑥ො 𝑇 2 2
⋮
2 2 𝑇 𝑇
𝑤 𝑘 −𝑤 0 = 𝑤 𝑘 − 1 − 𝑤 0 + 2𝜂 𝑤 𝑘 − 1 − 𝑤 0 𝑥ො (𝑘)
+𝜂2 𝑥ො 𝑇 𝑘 2
2
𝑤 1 −𝑤 0 = 𝜂2 𝑥ො 𝑇 1 2
𝑤 2 −𝑤 0 2
2 𝑇 𝑇
= 𝑤 1 −𝑤 0 + 2𝜂 𝑤 1 − 𝑤 0 𝑥ො 2 + 𝜂2 𝑥ො 𝑇 2 2
⋮
𝑤 𝑘 −𝑤 0 2
2 𝑇 𝑇
= 𝑤 𝑘−1 −𝑤 0 + 2𝜂 𝑤 𝑘 − 1 − 𝑤 0 𝑥ො 𝑘 + 𝜂2 𝑥ො 𝑇 𝑘 2
⋮
2 2
𝑤 𝑘 −𝑤 0 ≤ 𝑤 𝑘−1 −𝑤 0 − 2𝜂𝑤 0 𝑇 𝑥ො 𝑇 (𝑘) + 𝜂2 𝑥ො 𝑇 𝑘 2
− 2𝜂𝑤 0 𝑇 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 (𝑘)
𝑤 𝑘 −𝑤 0 2
≤ 𝜂 2 𝑥ො 𝑇 1 2 + 𝑥ො 𝑇 2 2 + ⋯ + 𝑥ො 𝑇 𝑘 2
− 2𝜂𝑤 0 𝑇 𝑥ො 𝑇 2 + ⋯ + 𝑥ො 𝑇 (𝑘)
Define
𝑇 2
𝑀 = max
𝑇
𝑥
ො
𝑥ො
𝑇 𝑇
𝜇 = 2 min
𝑇
𝑤 0 𝑥ො < 0 (misclassfications)
𝑥ො
The top equation becomes
𝑤 𝑘 −𝑤 0 2 ≤ 𝜂 2 𝑀 − 𝜂𝜇 𝑘
Hence, we have shown
𝐴𝑘 2 ≤ 𝑤 𝑘 − 𝑤 0 2 ≤ 𝐵𝑘
2
𝜂𝑎 2
𝐴= 𝑇 and 𝐵 = 𝜂 𝑀 − 𝜂𝜇
𝑤∗
Thus,
𝜂𝑀 − 𝜇 ∗ 2
𝑘𝑚𝑎𝑥 = 2
𝑤
𝜂𝑎
LINEAR SEPARABILITY
A straight line decision boundary may not always exist
Linearly separable cases – when a straight (linear) decision boundary
is possible
Multiple Neurons May Help!
XOR Function – Linearly Inseparable
XOR – separable in 3D
Added Dimension
It is always possible to separate out two classes with a linear function,
provided that you project the data into the correct set of dimensions.
Kernel classifiers – basis of Support Vector Machines
Data Normalization/Standardization
Scaling input data to lie between (-1,+1)
Additionally with zero mean and unit variance – little better as it does not
allow outliers to dominate as much
𝑥 = (𝑥 − 𝜇)/𝜎
Partitioning data based on range to integral values
Choosing a subset of features can improve accuracy
LINEAR REGRESSION
Classification: find a line that separates out the classes
Regression: fit a line to data
Classification as instance of Regression
1. Fit a line to target data
2. Do regression for each class separately, i.e., fit line for data points of
each classes separately
In linear regression, we are computing lines (in 2D) that can predict
target values closely, i.e., 𝑦 = 𝛽1 𝑥 + 𝛽0
General form:
𝑀
𝑦 = 𝛽𝑖 𝑥𝑖
𝑖=0
where: 𝑀 is the #of dimension of an input vector
𝛽 = (𝛽0 , 𝛽1 … , 𝛽𝑀 ) defines a line in 2-D, plane in 3-D and hyperplane
in higher dimensions.
Linear regression in two and three dimensions
How do we define the line/plane/hyperplane that best fits the data?
Minimize the distance between the line and the data points.
Least-squares Optimization
Where,
N: #data points
M: #dimension of input vector
N: #data points
M: #dimension of input vector
https://en.wikipedia.org/wiki/Matrix_calculus#Vector-by-vector
http://www.math.nyu.edu/~neylon/linalgfall04/project1/dj/proptranspose
.htm
Fill in the details in the proof (left as homework assignment)
Linear Regression for OR, AND, and XOR
𝒕/𝒚 ⟶
1 0 1 0.75
1 1 1 1.25
Inputs AND
𝑥1 𝑥2 𝑡 𝑦
0 0 0 -0.25
0 1 0 0.25
1 0 0 0.25
𝒕/𝒚 ⟶
1 1 1 0.75
Inputs XOR
𝑥1 𝑥2 𝑡 𝑦
0 0 0 0.5
0 1 1 0.5
1 0 1 0.5
𝒕/𝒚 ⟶
1 1 0 0.5
Miscellaneous Topics
Adaline: Adaptive Linear Neuron
A single linear unit that uses input to activation function (activation potential)
for calculating error, rather than the output of the activation function
Update Rule
𝑤𝑖 ← 𝑤𝑖 − 𝜂 𝑦𝑖𝑛 − 𝑡 ∙ 𝑥𝑖
Madaline: Multiple adaptive linear neurons
Many Adalines in parallel with a single output unit
Output is based on selection rule (e.g., max, AND)