7 NN Apr 28 2021

Simple Neural
Networks and Units in Neural Networks

Neural
Language
Models
This is in your brain
By BruceBlaus - Own work, CC BY 3.0,

https://commons.wikimedia.org/w/index.php?curid=28761830 2
Neural Network Unit
This is not in your brain
Output value y
a
Non-linear transform σ
z
Weighted sum ∑
Weights w1 w2 w3
bias
b
Input layer x1 x2 x3 +1
3
heart, a neural unit is taking a weighted sum of its inputs, with on
me in theNeural
convenient tounit
sum called a biasthis
express term. Given sum
weighted a set using
of inputs x1 ...x
vector n, a
notatio
Eq. 7.2,
rrespondingz is just
weights a real
w valued
and a
gebra that a vector is,1 at nheart, just
...w number.
biasab,list
so or
thearray
weighted sum z
of numbe
stead
dutas: of
z inTake
termsusing
weighted z, a linear
sum of
of a weight function
inputs,
vector w, plus
a a of
bias
scalar x, as the output
bias b, and an inp
X
near function f to z. We
eplace the sum with the convenienti dot
z = b + willw refer
x i
to the
product: output of th
n value for the unit, a. Since i we are just modeling a si
he node is in fact the zfinal
= w output
· x + b of the network, which
more convenient to express this weighted sum using vector notation w
value y isthat
r algebra defined as:is, at heart, just a list or array of number
a vector
Instead of just using z, we'll apply a nonlinear activation
Eq. 7.2,
about inzterms
is just
zfunction of
f: a areal valued
weight number.
vector w, a scalar bias b, and an inpu
ll replace
nstead of the sumz,with
using the = f (z)dot
y =convenient
a linearafunction of product:
x, as the output, neu
inear function f to z. We will refer to the output of this fu
z = w·x+b
fact the final output of the network, which we’ll generally
ned as: Non-Linear Activation Functions
y = a = f (z)
We're already seen the sigmoid for logistic regression:
r non-linear functions f () below (the sigmoid, the tanh,
LU) but it’s pedagogically convenient to start with the
saw it in Chapter 5:
Sigmoid
1
y = s (z) = z
(7.3)
1+e
Fig. 7.1) has a number of advantages; it maps the output
is useful in squashing outliers toward 0 or 1. And it’s
saw in Section ?? will be handy for learning.
5
Final function the unit is computing
7.1 • U NI
tituting Eq. 7.2 into Eq. 7.3 gives us the output of a neural unit:
1
y = s (w · x + b) =
1 + exp( (w · x + b))
7.2 shows a final schematic of a basic neural unit. In this example t
nput values x1 , x2 , and x3 , and computes a weighted sum, multiplyin
a weight (w1 , w2 , and w3 , respectively), adds them to a bias term b, an
he resulting sum through a sigmoid function to result in a number bet
Final unit again
Output value y
a
Non-linear activation function σ

z
Weighted sum ∑
Weights w1 w2 w3
bias
b
Input layer x1 x2 x3 +1
7
An example w = [0.2, 0.3, 0.9]
b = 0.5
Suppose a unit has:
What
w would this unit do with the following input vector:
= [0.2,0.3,0.9]
b = 0.5 x = [0.5, 0.6, 0.1]
What happens with input x:
The resulting output y would be:
x = [0.5,0.6,0.1]
1 1
y = s (w · x + b) = = =
1 + e (w·x+b) 1 + e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
In practice, the sigmoid is not commonly used as an activation
An example ww =
= 0.3,0.9]
[0.2,0.3,
[0.2, 0.9]
Suppose a unit has: bb =
= 0.5
0.5
What w
would= [0.2,0.3,0.9]
this unitdo
What would this unit dowith
withthe
thefollowing
followinginput
inputvector:
vector:
b = 0.5
xx = 0.6,0.1]
[0.5,0.6,
= [0.5, 0.1]
What happens with the following input x?
Theresulting
The resulting outputyywould
x = output wouldbe:
[0.5,0.6,0.1] be:
=ss(w 11 11
yy= (w··xx+
+b)b)==
1 + e (w·x+b)
=
= 1 + e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
=
=
1+e (w·x+b) 1+e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
Inpractice,
In practice,the
thesigmoid
sigmoidisisnot
notcommonly
commonlyused
usedas asan
anactivation
activation
exampleAnjustexample ww == [0.2,
to get an intuition. Let’s suppose [0.2,
we 0.3,
0.3,
have 0.9]
a 0.9]
eight vector and bias: bb = 0.5
= 0.5
Suppose a unit has:
w =
Whatwould[0.2,
would 0.3, 0.9]
this unitdo
dowith
withthe
thefollowing
followinginput
inputvector:
vector:
What w = this unit
[0.2,0.3,0.9]
b = 0.5
b = 0.5 xx = 0.6,0.1]
[0.5,0.6,
= [0.5, 0.1]
with the following input vector:
What happens with input x:
Theresulting
The resultingoutput
outputyywould
wouldbe:
be:
x = [0.5,x =0.6,[0.5,0.6,0.1]
0.1]
11 11
yy=
uld be: =ss(w(w··xx+ +b)b)= = == =
=
1 +
1+e e (w·x+b)
(w·x+b) 1 +
1+e e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
(.5⇤.2+.6⇤.3+.1⇤.9+.5)
1 1 1
In=practice,
In practice,
(w·x+b) 1+e the
the sigmoid
sigmoid is
is not
=
not
(.5⇤.2+.6⇤.3+.1⇤.9+.5) commonly
commonly
1+e 0.87
= used
.70
used as
as an
an activation
activation
nexample
examplejust
justtotoget
getananintuition.
intuition.Let’s
Let’ssuppose
supposewewe have
have a a
eight
ight vector
vector
An
example and
and bias:
bias:
justexample ww = = we
to get an intuition. Let’s suppose [0.2,
[0.2, 0.3,
0.3,
have 0.9]
a 0.9]
eight vector and bias: bb = 0.5
= 0.5
wwSuppose
== [0.2,a0.3,
[0.2, unit has:
0.3,0.9]
0.9]
wbb=
What = [0.2,
0.5
w 0.5
would =0.3, 0.9]
[0.2,0.3,0.9]
this unitdo
dowith
withthe
thefollowing
followinginput
inputvector:
vector:
What would this unit
=
b = 0.5
b = 0.5
with
withthe
thefollowing
followinginput
inputvector:
vector: xx = 0.6,0.1]
[0.5,0.6,
= [0.5, 0.1]
What happens
with the following with input x:
input vector:
The
The =resulting
xx resulting
= [0.5,
[0.5, output
0.6,
0.6,0.1]
output y
0.1]
y would
would
x =0.6,[0.5,0.6,0.1] be:
be:
x = [0.5, 0.1]
uld
ould be:
be:
=ss(w 11 11
yy=
uld be: (w··xx+ b)=
+b) =
1 + e (w·x+b)
=
= 1 + e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
=
=
1+e (w·x+b) 1+e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
11 1 11 1 11
In==
practice,
= the sigmoid is not
= =
commonly
= = used
.70 = .70
as
= an
.70 activation
In
w·x+b) practice,
(w·x+b) 11+
(w·x+b) e
1+e
+ e the sigmoid is not
(.5⇤.2+.6⇤.3+.1⇤.9+.5)
(.5⇤.2+.6⇤.3+.1⇤.9+.5)
(.5⇤.2+.6⇤.3+.1⇤.9+.5)commonly
1 + 1
e 1
+ +e
0.87
e used
0.87
0.87 as an activation
1+e 1+e
The simplest activation function, and perhaps the most comm
not commonly used
Non-Linear as an activation
Activation function.
FunctionsA function
besides sigmoid
U tified linear unit, also called the ReLU, shown
ost always better is the tanh function shown in Fig. 7.3a; in Fig. 7.3b.
when
moid that z isfrom
ranges positive, and 0 otherwise:
-1 to +1: Most Common:
ez e z
y = max(z,
y= z z
(7.5) 0)
e +e
nction, and perhaps the most commonly used, is the rec-
ed the ReLU, shown in Fig. 7.3b. It’s just the same as x
therwise:
y = max(x, 0) (7.6)
tanh ReLU
Rectified Linear Unit 12
Simple Neural
Networks and Units in Neural Networks
Neural
Language
Models
Simple Neural
Networks and The XOR problem
Neural
Language
Models
ts into larger networks.
The XOR problem
One of the most clever demonstrations of the need for multi-layer networks was
proof by Minsky and Papert (1969) that aMinsky
single neural unit cannot compute
and Papert (1969)
me very simple functions of its input. Consider the task of computing elementary
Can neural
ical functions units compute
of two inputs, simple
like AND, OR, and XOR.functions of input?
As a reminder, here are
truth tables for those functions:
AND OR XOR
x1 x2 y x1 x2 y x1 x2 y
0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0
This example was first shown for the perceptron, which is a very simple neural
t that has a binary output and does not have a non-linear activation function. The
Perceptrons 7.2 • T HE XO
A very simple neural unit
ron is• 0Binary
or 1,output
and is computed
(0 or 1) as follows (using
b as in Eq.non-linear
• No 7.2): activation function
⇢
0, if w · x + b  0
y=
1, if w · x + b > 0
asy
s 0 orto
easy 1,tobuild
andbuild a perceptron
a perceptron
is computed that
as follows (using
Early in the history of Early that can
can
the same
neuralinnetworkscompute
compute
weight
it was the
thelogical
realized logical
that the AND
AND
power a
of
the history of neural networks it was real
binary⇢
Easy
in Eq. 7.2):
stsbinary works, to
inputs;
inputs; build
Fig.
Fig.
as with 7.4
the AND
7.4shows
real showsor
neurons
works, as
OR
the
the
that
with
with
necessary
necessary
inspired
the real
perceptrons
weights.
them,weights.
neuronscomes
that from comb
inspired the
units
0, if winto
· x +larger
b  0 networks.
units into larger networks.
y= (7.7)
1, ifOne
w·x+ ofbthe
> 0most clever demonstrations of the need for multi-layer net
One of the most clever demonstrations of the n
x the proof by Minsky and Papert (1969)
x x that a single neural unit canno
x
d a perceptron
puts; Fig.1 7.4
1some
that
shows
can
very
compute
the simple
the logical AND and OR
the proof by Minsky and
weights. of its input. 1Consider
necessaryfunctions 1
Papert
the
(1969)
task of
that a
computing
si
1 1 some very simple functions 1 1 of its input. Consider
logical functions of two inputs,
logical like AND,
functions OR,inputs,
of two and XOR. As a reminde
like AND, OR, an
x2x2the truth1 1 tables x1 for those functions: x x
the truth tables for2those 1 1
functions:
1 1 2
1 -1-1 x2 1 AND OR AND00 XOR OR
1 +1+1 0 x1 x2 y +1
x1 +1x2x1y x2 y x1 x2x1y x2 y
+1 0 0 0 0 0 0 00 0 0 0 0 00 0
) (a)(a) (b)0 1 0 0 1 0 1 1(b) (b)
0 0 1 1
0 1 1
he
The weights AND
w and bias b for perceptrons
w and
for computing
bias b for
0 0functions. 1The 0 1 1OR
1 logical
perceptrons for 0 0
computing
1 0 1 1
logical 0 1
functio
2 andweights w and
the bias as a special bias
node b 1for+11perceptrons
with value which 1 for
1 is multiplied 1 1computing
1 1 1 1 logical
1 1 0 1 functi
1
asy
s 0 orto
easy 1,tobuild
a perceptron
is computed that
as follows (using
can
the same
compute
weight
it was the
thelogical
realized logical
that the AND
AND
power a
of
binary⇢
Easy
in Eq. 7.2):
stsbinary works, to
inputs;
inputs; build
Fig.
Fig.
as with 7.4
the AND
7.4shows
real showsor
neurons
works, as
OR
the
the
that
with
with
necessary
necessary
inspired
the real
perceptrons
weights.
them,weights.
neuronscomes
that from comb
inspired the
units
0, if winto
· x +larger
b  0 networks.
y= (7.7)
1, ifOne
w·x+ ofbthe
x
d a perceptron
puts; Fig.1 7.4
1some
that
shows
can
very
compute
the simple
Papert
the
(1969)
task of
that a
computing
si
logical like AND,
like AND, OR, an
functions:
1 1 2
1 +1+1 0 x1 x2 y +1
+1 0 0 0 0 0 0 00 0 0 0 0 00 0
) (a)(a) (b)0 1 0 0 1 0 1 1(b) (b)
0 0 1 1
0 1 1
he
The weights AND
w and
for computing
bias b for
1 logical
perceptrons for 0 0
computing
1 0 1 1
logical 0 1
functio
2 andweights w and
1 1 1 1 logical
1 1 0 1 functi
1
asy
s 0 orto
easy 1,tobuild
a perceptron
is computed that
as follows (using
can
the same
compute
weight
it was the
thelogical
realized logical
that the AND
AND
power a
of
binary⇢
Easy
in Eq. 7.2):
stsbinary works, to
inputs;
inputs; build
Fig.
Fig.
as with 7.4
the AND
7.4shows
real showsor
neurons
works, as
OR
the
the
that
with
with
necessary
necessary
inspired
the real
perceptrons
weights.
them,weights.
neuronscomes
that from comb
inspired the
units
0, if winto
· x +larger
b  0 networks.
y= (7.7)
1, ifOne
w·x+ ofbthe
x
d a perceptron
puts; Fig.1 7.4
1some
that
shows
can
very
compute
the simple
Papert
the
(1969)
task of
that a
computing
si
logical like AND,
like AND, OR, an
functions:
1 1 2
1 +1+1 0 x1 x2 y +1
+1 0 0 0 0 0 0 00 0 0 0 0 00 0
) (a)(a) (b)0 1 0 0 1 0 1 1(b) (b)
0 0 1 1
0 1 1
he
The weights AND
w and
for computing
bias b for
1 logical
perceptrons for 0 0
computing
1 0 1 1
logical 0 1
functio
2 andweights w and
1 1 1 1 logical
1 1 0 1 functi
1
Not possible to capture XOR with perceptrons
Pause the lecture and try for yourself!

Why? Perceptrons are linear classifiers
Perceptron equation given x1 and x2, is the equation of a line
w1x1 + w2x2 + b = 0
(in standard linear format: x2 = (−w1/w2)x1 + (−b/w2) )
This line acts as a decision boundary

• 0 if input is on one side of the line
• 1 if on the other side of the line
Decision boundaries
x2 x2 x2
1 1 1
?
0 x1 0 x1 0 x1
0 1 0 1 0 1
a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2
XOR is not a linearly separable function!

hem, comes from combining these
Solution to the XOR

e need for multi-layer networks was problem
er the task of computing elementary
XOR can't be calculated by a single perceptron
and XOR. As a reminder, here are
XOR can be calculated by a layered network of units.
ReLU y1
XOR
x1 x2 y 1 -2 0
0 0 0 ReLU h1 h2 +1
0 1 1
1 0 1 1 1 1 1 0 -1
1 1 0 x1 x2 +1
ptron, which is a very simple neural
hem, comes from combining these
Solution to the XOR

e need for multi-layer networks was problem
er the task of computing elementary
XOR can't be calculated by a single perceptron
and XOR. As a reminder, here are
XOR can be calculated by a layered network of units.
y1
XOR
x1 x2 y 1 -2 0
0 0 0 h1 h2 +1
0 1 1
1 0 1 1 1 1 1 0 -1
1 1 0 x1 x2 +1
ptron, which is a very simple neural
The hidden representation h
y1
1 -2 0
h1 h2 +1
1 1 1 1 0 -1
x1 x2 +1
x2 h2
1 1
0 x1 0
h1
0 1 0 1 2
a) The original x space b) The new (linearly separable) h space
(With learning: hidden layers will learn to form useful representations)

Simple Neural
Networks and The XOR problem
Neural
Language
Models
Simple Neural
Networks and Feedforward Neural Networks
Neural
Language
Models
Feedforward Neural Networks
Can also be called multi-layer perceptrons (or
8
MLPs) for historical reasons
C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS
y1 y2 … yn
2
h1 h2 h3 … hn
1
W b
x1 x2 … xn
0 +1
Binary Logistic Regression as a 1-layer Network
(we don't count the input layer in counting layers!)
Output layer σ 𝑦 = 𝜎(𝑤 & 𝑥 + 𝑏)

(σ node) (y is a scalar)
w w1 wn b (scalar)
(vector)
Input layer x1 xn +1
vector x
29
Multinomial Logistic Regression as a 1-layer Network
Fully connected single layer network
y1 yn
Output layer s s s 𝑦 = softmax(𝑊𝑥 + 𝑏)
(softmax nodes) y is a vector
W b
W is a b is a vector
matrix
Input layer x1 xn +1
scalars
30
The multinomial logistic classifier exp uses
(z= a generalization of the sigmoid,
y ofthe
y being in each potential class c 2 C, p(y i ) c|x).
softmax function, softmax(z
to compute
i) = P the 1ik
k probability p(y = c|x). The softmax fu
(5.3
nomial Reminder:
logistic classifier softmax:
uses a a generalization
generalization exp (zof j )the
takes a vector z = [z1 , z2 , ..., zk ] of k arbitrary values and maps them to a prob
j=1
of
sigmoid, sigmoid
called
unction, to compute the probability p(y = c|x). The
distribution, with each value in the range (0,1), and all the values summin softmax function
z=The[z 1softmax
, z 2 , ..., zkof
] ankinput
of vector
arbitrary z = [zand
values 1 , z2maps
, ..., zkthem] is thus to aaprobability
vector itself:
For
Like
with each a
the vector
sigmoid,z itofisdimensionality
value in the
an exponential function.
range (0,1), and all k,
the the
values softmax
summing is:to 1.
For a vector z of dimensionality k, the softmax is defined as:
oid, it5is an
PTER L OGISTIC "R
• exponential function.
EGRESSION #
or z of dimensionality exp (z 1 ) exp (z 2 ) exp (z )
softmax(z) = k, the softmax is defined as: k
PP k
, Pk , ..., Pk (5.3
• L OGISTIC R
The denominator EGRESSION k exp (zi ) exp exp (z (z
i ) ) exp (z )
i=1 exp =i ) isPused to normalize all the values into p
i=1 i=1 i i=1 i
softmax(z i ) (z k
1  i  k
ities. Thus for exampleexp given
(zi ) a vector: j=1 exp (z j )
Pk
softmax(zi ) = Pexp
The denominator (z i ) is used 1 
to i  k
normalize all the (5.30)
values into probabi
i=1 k exp (z
The softmax of an input
j=1 z vector
= j ) z1.1,
[0.6, = [z 1.5,
, z , 1.2,
..., z 3.2,
] is 1.1]
thus a vector itself:
es. Thus for example given a vector: 1 2 k
Example:
ax of the
an input vector
resulting z=
(rounded)
[z 1 ,
" z
softmax(z)
2 , ..., zk ] is thus
is a
z = [0.6, 1.1, 1.5, 1.2, 3.2, 1.1] vector itself: #
exp (z1 ) exp (z2 ) exp (zk )
"softmax(z) = P
[0.055, 0.090, 0.006,
, P 0.099, 0.74,
,# 0.010]
..., P
e resulting (rounded) softmax(z) is k k k
Two-Layer Network with scalar output
Output layer 𝑦 = 𝜎(𝑧) y is a scalar
(σ node) z = 𝑈ℎ
U
hidden units
(σ node) Could be ReLU
Or tanh
W b
Input layer
x1 xn +1
(vector)
U
hidden units j
Wji
(σ node)
W b vector
Input layer
x1 i xn +1
(vector)
U
hidden units
Or tanh
W b
Input layer
x1 xn +1
(vector)
Two-Layer Network with softmax output
Output layer 𝑦 = softmax(𝑧)
U y is a vector
hidden units
Or tanh
W b
Input layer
x1 xn +1
(vector)
Multi-layer Notation
𝑦 = 𝑎[%]
sigmoid or softmax
𝑎[%] %
= 𝑔 (𝑧 ) %
[2 𝑧 [%] = 𝑊 [%] 𝑎["] + 𝑏[%]

W b[2]
]
j 𝑎["] "
= 𝑔 (𝑧 " ) ReLU
𝑧 ["] = 𝑊 ["] 𝑎[$] + 𝑏["]

W [1
b[1]
]
x1 i xn +1 𝑎[$]
y
Multi Layer Notation σ

a
z
∑
w1 w2 w3 b
x1 x2 x3 +1
37
Replacing the bias unit
Let's switch to a notation without the bias unit
Just a notational change
1. Add a dummy node a0=1 to each layer
2. Its weight w0 will be the bias
3. So input layer a[0]0=1,
◦ And a[1]0=1 , a[2]0=1,…
1
willvalues,
have a with
= a
1, new
and 0th
so dummy
on. This value
dummy x node
0 =
esents exactly the same function without referring to an ex-
0 1: x
still has
= x 0
that weight
computing
Replacing represents
each h as the bias
follows: value b. For example inste
d, we add a dummythe nodebias
j a tounit
each layer whose value will
0
h = s (W x)
0, the input layer, will have a dummy node a
[0]
= 1, layer !1 (7.13
h = sX n 00 x + b)
(W
Instead of: hj = We'll
s do Wthis:
o on. This dummy node still has an associated ji xi + band
weight, j ,
our we’ll
vectoruse:x having n values: x = x1 , . . . , xn , it will have n +
e bias value
x= xb., For
x , …, example
x instead of an equation
x= i=1
x , x , x like
w 0th dummy 1 2value n0 x0 = 1: x = x0 , . . . , xn00 . 1 And 2, …, xn0
instead o
as follows:
we’ll instead
h = s (W x + b) use: h = s (W x) (7.12)
! n
!
n0 X 0
X of our vector x having n values: x = x
But now instead
h j = s W ji x i + b j , s W ji x i , (7.14
1 values, with a new 0th dummy value x0 = 1: x = x0
i=1 i=0
h = s (W
computing each x) h j as follows: (7.13)
!
Replacing the bias unit
Instead of: We'll do this:
y1 y2 … yn
2
y1 y2 … yn
2
U U
h1 h2 h3 … hn
1
h1 h2 h3 … hn
1
W b W
x1 x2 … xn
0 +1 x0=1 x1 x2 … xn
0
Simple Neural
Networks and Feedforward Neural Networks
Neural
Language
Models
Simple Neural
Networks and
Applying feedforward networks
Neural to NLP tasks
Language
Models
Use cases for feedforward networks
Let's consider 2 (simplified) sample tasks:
1. Text classification
2. Language modeling
State of the art systems use more powerful neural

architectures, but simple models are useful to
consider!
43
Classification: Sentiment Analysis
We could do exactly what we did with logistic
regression
Input layer are binary features as before
Output layer is 0 or 1 σ
U
W
x1 xn
+ or to a review document doc. We’ll represent each input ob
features x1 ...x6 of the input shown in the following table; Fig. 5.2
Sentiment Features
in a sample mini test document.
Var Definition Value i

x1 count(positive lexicon) 2 doc) 3
x2 count(negative
⇢ lexicon) 2 doc) 2
1 if “no” 2 doc
x3 1
0 otherwise
x4 count(1st
⇢ and 2nd pronouns 2 doc) 3
1 if “!” 2 doc
x5 0
0 otherwise
x6 log(word count of doc) ln(66)
45
Feedforward nets for simple classification
σ
σ U
2-layer
Logistic
Regression W feedforward
network
W
x1 xn
f1 f2 fn x1 xn
f1 f2 fn
Just adding a hidden layer to logistic regression
46
46
• allows the network to use non-linear interactions between features

• which may (or may not) improve performance.
Even better: representation learning
σ
The real power of deep learning comes U
from the ability to learn features from
the data W
Instead of using hand-built human-
engineered features for classification x1 xn
e1 e2 en
Use learned representations like
embeddings!
47
Neural Net Classification with embeddings as input features!
p(positive sentiment|The dessert is…)
Output layer ^y
sigmoid
U |V|⨉dh
Hidden layer h1 h2 h3 … hdh dh⨉1
wt-1 dh⨉3d
W
Projection layer 3d⨉1
embeddings
E embedding for
word 534
embedding for
word 23864
embedding for
word 7
... The dessert is …

w1 w2 w3 48
Hidden layer h1 h2 h3 … hdh dh⨉1
Issue: texts come in different

W sizes
Projection layer
wt-1 dh⨉3d
3d⨉1
embeddings
This assumes a fixed size length (3)! E embedding for

word 534
embedding for
word 23864
embedding for
word 7
Kind of unrealistic. The

... dessert
w1 w2
is
w3
…
Some simple solutions (more sophisticated solutions later)

1. Make the input the length of the longest review
• If shorter then pad with zero embeddings
• Truncate if you get longer reviews at test time
2. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
• Take the mean of all the word embeddings
• Take the element-wise max of all the word embeddings
• For each dimension, pick the max value from all words 49
Reminder: Multiclass Outputs
What if you have more than two output classes?
◦ Add more output units (one for each class)
◦ And use a “softmax layer”
W
x1 xn
50
Neural Language Models (LMs)
Language Modeling: Calculating the probability of the
next word in a sequence given some history.
• We've seen N-gram based LMs
• But neural network LMs far outperform n-gram
language models
State-of-the-art neural LMs are based on more
powerful neural network technology like Transformers
But simple feedforward LMs can do almost as well!
51
Simple feedforward Neural Language Models
Task: predict next word wt

given prior words wt-1, wt-2, wt-3, …
Problem: Now we’re dealing with sequences of
arbitrary length.
Solution: Sliding windows (of fixed length)
52
Neural Language Model
53
Why Neural LMs work better than N-gram LMs
Training data:
We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed
Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog"
embeddings to generalize and predict “fed” after dog
Simple Neural
Networks and Applying feedforward networks
Neural to NLP tasks
Language
Models
Simple Neural
Networks and Training Neural Nets: Overview
Neural
Language
Models
Intuition: training a 2-layer Network
Actual answer 𝑦
Loss function L(𝑦,
! 𝑦)
System output 𝑦!
U Backward pass
Forward pass
W
Training instance x1 xn
57
Intuition: Training a 2-layer network
For every training tuple (𝑥, 𝑦)
◦ Run forward computation to find our estimate 𝑦3
◦ Run backward computation to update weights:
◦ For every output node
◦ Compute loss 𝐿 between true 𝑦 and the estimated 𝑦#
◦ For every weight 𝑤 from hidden layer to the output layer
◦ Update the weight
◦ For every hidden node
◦ Assess how much blame it deserves for the current answer
◦ For every weight 𝑤 from input layer to the hidden layer
◦ Update the weight
58
y 1 y
of the probability: log p(y|x) = log ŷ (1 ŷ)
Reminder: Loss=Functiony⇥logy ŷ + (1 1 y) ⇤log(1 ŷ)
log ŷ (1 for ŷ) binary logistic regression
= y (5.10)
log p(y|x)
q. 5.10 describes a log likelihood = y logthatŷ + (1 y)
should belog(1 ŷ)
maximized. In order to turn this
(5.10)
A measure
to loss function for how
(something farneed
that we off the currentwe’ll
to minimize), answer is to
just flip the sign on
5.10
q. describes
5.10. The a log
result is likelihood
the that should
cross-entropy loss be maximized.
L : In order to turn this
the right answer CE
loss function (something that we need to minimize), we’ll just flip the sign on
5.10. TheCross
LCE (entropy
resultŷ,isy)the loss
=cross-entropy
log for =
p(y|x) logistic
loss L[y : regression:
CElog ŷ + (1 y) log(1 ŷ)] (5.11)
nally, weLCE
can(ŷ,plug
y) =in the
logdefinition
p(y|x) =of ŷ [y
=log ŷ +
s (w · x(1 y) log(1
+ b): ŷ)] (5.11)
lly, we L
can (plug
CE [y log s (wof· xŷ+=b)s+
ŷ, y) in=the definition (w(1· x +y)b):
log (1 s (w · x + b))] (5.12)
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))] (5.12)
et’s see if this loss function does the right thing for our example from Fig. 5.2. We
ant the loss to be smaller if the model’s estimate is close to correct, and
s see if this loss function does the right thing for our example from Fig. 5.2. Webigger if
e model is confused. So first let’s suppose the correct gold label for the sentiment
t the loss to be smaller if the model’s estimate is close to correct, and bigger if 59
Reminder: gradient descent for weight updates
5.4 • G RADIENT D ESCENT 11
STIC RUse the derivative of the loss function with respect to
EGRESSION
1 Theweights !
Gradient 𝐿(𝑓
for Logistic Regression
!"
𝑥; 𝑤 , 𝑦)
der to update q , we need a definition for the gradient —L( f (x; q ), y). Recall that
To tell us how to adjust weights for each training item
ogistic regression, the cross-entropy loss function is:
◦ Move them in the opposite direction of the gradient
LCE (ŷ, y) = [y log s (w · x + b)d
+ (1 y) log (1 s (w · x + b))] (5.17)
t+1 t
w = w h
turns out that the derivative of thisdw L( f (x; w), y)
function for one observation vector x is
.18 (the interested reader
◦ For logistic can see Section 5.8 for the derivation of this equation):
regression
extend the intuition ∂ LCE (ŷ,from
y) a function of one scalar variable w
= [s (w · x + b) y]x j (5.18)
ause we don’t just∂ wwant j to move left or right, we want to kn
of f (x) is the derivative of u(x) with respect to v(x) time
Where didtothat
respect x: derivative come from?
Using the chain rule! f (x) = u(v(x)) d f du dv

= ·
dx dv dx
Intuition (see the text for details)
y The chain rule extends toDerivative
more thanof the two functions.
weighted sum If co
a composite function Derivative
f (x) = u(v(w(x))),
of the Activation the derivative of
σ Derivative of the Loss
z df du dv dw
= · ·
∑ !" !" !$dx !% dv dw dx
=
w1 w2 w3
Let’s
b
now !# !
compute !$
the 3 !% !# ! we need. Sinc
derivatives
x1 x2 x3 +1
∂L
L = ce, we can directly compute the derivative ∂ c :
61
How can I find that gradient for every weight in
the network?
These derivatives on the prior slide only give the
updates for one weight layer: the last one!
What about deeper networks?
• Lots of layers, different activation functions?
Solution in the next lecture:
• Even more use of the chain rule!!
• Computation graphs and backward differentiation!
62
Simple Neural
Networks and Training Neural Nets: Overview
Neural
Language
Models
Simple Neural
Networks and Computation Graphs and
Neural Backward Differentiation
Language
Models
Why Computation Graphs
For training, we need the derivative of the loss with
respect to each weight in every layer of the network
• But the loss is computed only at the very end of the
network!
Solution: error backpropagation (Rumelhart, Hinton, Williams, 1986)
• Backprop is a special case of backward differentiation
• Which relies on computation graphs.
65
Computation Graphs
A computation graph represents the process of

computing a mathematical expression
66
Example:
Computations:
a
e=a+d
b d = 2b L=ce
c 67
Example:
Computations:
68
Backwards differentiation in computation graphs
The importance of the computation graph

comes from the backward pass
This is used to compute the derivatives that we’ll
need for the weight update.
Example
ard differentiation on computation graphs
of the computation graph comes from the backward pa
te the derivatives that we’ll need for the weight updat
is to compute the derivative of the output function L w
∂L ∂L ∂L ∂L
ut variables,
We want: i.e., ∂ a , ∂ b , and ∂ c . The derivative ∂ a , te
nge in a affects L.
fferentiation makes!#use
The derivative
!"
, tellsof
us how chain
the much rulechange
a small in calculus.
in a Su
e derivative
affectsofL. a composite function f (x) = u(v(x)). The
70
ure 7.10 Computation graph for the function L(a, b, c) = c(a+2b), with valu
es
f fa(x)
= 3,isbthe
= 1, c = 2, showing
derivative of u(x) the forward
with pass
respect to computation of L.
v(x) times the derivative
The chain rule
espect to x:
(x) is the derivative of u(x) with respect to v(x) times the derivative o
Computing
ect to x: the derivative d of
f a composite
du dv function:
= ·
dx dv dx
f (x) = u(v(x)) d f du dv
he chain rule extends to more than=two functions.
· If computing the d
dx dv dx
omposite function f (x) = u(v(w(x))), the derivative of f (x) is:
chain rule extends to more than two functions. If computing the deri
mposite ffunction f (x) = d f
u(v(w(x))), thedu dv
derivative dw
of f (x) is:
(x) = u(v(w(x))) dx
= ·
dv dw dx
·
df du dv dw
Let’s now compute the 3 derivatives
= · we need.
· Since in the comp
dx dv dw dx ∂L
= ce, we can directly compute the derivative :
dx dv dw dx
ddf f dudu dv dv dw dw
nowExample
s’snow compute
computethethe33derivatives
==
derivatives · ·we
weneed.·
·need. Since
Sinceininthe
thecomput
compu
dx
dx dvdv dw dw dx ∂dx
LL
∂
,wewecan
candirectly
directlycompute
computethe thederivative
derivative ∂∂cc: :
ow computethe
w compute the33derivatives
derivativeswe weneed.
need. SinceSinceininthe
thecomputa
comput
candirectly
can directlycompute
computethe ∂ ∂LL
thederivative
derivative ∂ ∂LL
: :
==ee∂∂c c
∂∂cc
∂∂LL
rthe
theother
othertwo,
two,we’ll
we’llneed
needtotouse==the
use e chain
ethe chainrule:rule:
∂∂cc
∂∂LL ∂∂LL∂∂ee
othertwo,
other two,we’ll
we’llneed
needtotouse
usethe
=the chainrule:
= chain rule:
∂∂aa ∂∂ee∂∂aa
∂∂LL∂∂LL ∂∂LL∂∂L∂L ee∂∂ee∂∂dd
== ==
∂∂aa∂∂bb ∂∂ee∂∂e∂aea∂∂dd∂∂bb 72
Example ∂L
=
∂L ∂e
∂a ∂e ∂a
∂L ∂L ∂e ∂d
=
∂b ∂e ∂d ∂b
∂L ∂L
Eq. 7.26 thus requires five intermediate derivatives: ,
∂e ∂c ,
which are as follows (making use of the fact that the derivative of a
of the derivatives):
∂L ∂L
L = ce : = c, =e
∂e ∂c
∂e ∂e
e = a+d : = 1, =1
∂a ∂d
∂d 73
∂L ∂L ∂e ∂L ∂L
Example = L = ce : = c, =e
∂a ∂e ∂a ∂e ∂c
∂L ∂L ∂e ∂d ∂e ∂e
e = a+d : = 1, =1
= (7.26)
∂a ∂d
a=3 ∂b ∂e ∂d ∂b
∂d
d = 2b : =2
a thus requires five intermediate derivatives:
Eq. 7.26 ∂L ∂L ∂e ∂e
∂e , ∂c , ∂a , ∂d ,
∂ b
and ,
∂b
∂d
which are as follows (making use of the

In fact that the derivative
the backward of a sum
pass, we compute is of
each thethese
sumpartials along each
of the derivatives): from right to left, multiplying the necessary partials to result in t
e=5
we need. Thus we begin by annotating the final node with ∂∂ LL =
∂L ∂L
b=1 L = ce : left, e=d+ac,
we=then compute
= e ∂L
and ∂L
∂ e L=-10
, and so on, until we have anno
d=2 ∂c
the ∂way ∂ c input variables. The forward pass conveniently
e to the
b ∂e ∂ evalues of the forward L=ce
de =
= 2b
a+d : computed= 1,the =1 intermediate variables we n
∂a ∂d
∂d
d = 2b : =2
∂b
In the backward
c=-2 pass, we compute each of these partials along each edge of the graph
from right to left, multiplying the necessary partials to result in the final derivative
c backward pass
we need. Thus we begin by annotating the final node with ∂∂ LL = 1. Moving to the
left, we then compute ∂∂ Lc and ∂∂ Le , and so on, until we have annotated the graph all
the way to the input variables. The forward pass conveniently already will have
74
computed the values of the forward intermediate variables we need (like d and e)
Example
75
Fig. 7.12 shows a sample computation graph for a 2-la
2, n1 = 2,
Backward and n2 = 1, assuming
differentiation on a two binary classificatio
layer network
output unit for simplicity. The function that the compu
y
[1] [1] [1]
Sigmoid activation σ z = W x+b
[1] [1]
a = ReLU(z )
W[2] b[2] [2] [2] [1] [2]
1
z = W a +b
ReLU activation
a [2]
= s (z ) [2]
W[1]
b[1] [2]
ŷ = a
1
The weights that need updating (those for which
x1 x2 76
e computation graph
We’ll for
also aneed
2-layer neural of
the derivatives network with
each of the n0activation
other =
derivative of tanh is:
assumingBackward differentiation
binary classification andon a two
hence layer
using network
a sigmoid
d tanh(z)
y. The function that the computation graph=is1 computing is:
tanh2 (z)
dz
[1] [1] derivative
[1] of the ReLU is
z = W x+b
The
⇢
[1] [1] d ReLU(z)
a = ReLU(z ) =
0 f or z < 0
dz 1 f or z 0
[2] [2] [1] [2]
z = W a +b
a [2]
= s (z ) [2]
[2]
ŷ = a (7.27)
ed updating (those for which we need to know the partial 77

[1] [1]
derivative of tanh is: a = ReLU(z )
z[2] = W [2] a[1] + b[2]
Backward differentiation on a 2-layer network
[2]
d tanh(z)
(7.29)
a = s (z [2] = 1 tanh2 (z)
dz )
ŷ = a[2] (7.27)
The derivative of the ReLU is
The weights that need updating (those for ⇢
which
d ReLU(z) 0 fwe
or need
z < 0to know the partial
derivative of the loss function) are shown in =
orange. (7.30)
dz 1 fIn
or order
z 0to do the backward
pass, we’ll need to know the derivatives of all the functions in the graph. We already
saw in Section ?? the derivative of the sigmoid s :
ds (z)
= s (z)(1 s (z)) (7.28)
dz
We’ll also need the derivatives of each of the other activation functions. The
derivative of tanh is:
d tanh(z)
=1 tanh2 (z) (7.29)
dz
The derivative of the ReLU is
⇢
d ReLU(z) 0 f or z < 0
= (7.30)
dz 1 f or z 0
!" z[1] = W [1] x + b[1]
Starting off the backward pass: a[1] = ReLU(z[1] )
["] ["] !# z[2] = W [2] a[1] + b[2]
(I'll write 𝑎 for 𝑎 and 𝑧 for 𝑧 )
a[2] = s (z[2] )
𝐿 𝑦,
! 𝑦 = − y log(𝑦)
! + 1 − 𝑦 log(1 − 𝑦)
! ŷ = a[2]
The weights that need updating (those for which

𝐿 𝑎, 𝑦 = − y log 𝑎 + 1 − 𝑦 log(1 − 𝑎) derivative of the loss function) are shown in orange.
!" !" !# pass, we’ll need to know the derivatives of all the func
= saw in Section ?? the derivative of the sigmoid s :
!$ !# !$
ds (z)
= s (z)(1
s (z))
𝜕𝐿 𝜕 log 𝑎 𝜕 log 1 − 𝑎 dz
=− 𝑦 + (1 − y)
𝜕𝑎 𝜕𝑎 𝜕𝑎 We’ll also need the derivatives of each of the ot
1 1 𝑦 of tanh
derivative 𝑦 −is:1
=− 𝑦 + 1−y −1 = − +
𝑎 1−𝑎 𝑎 1−𝑎 d tanh(z)
= 1 tanh2 (z)
dz
𝜕𝑎 𝜕𝐿 𝑦 𝑦−
The1
derivative of the ReLU is
= 𝑎(1 − 𝑎) =− + 𝑎 1 − 𝑎d ReLU(z)= a −⇢y0
𝜕𝑧 𝜕𝑧 𝑎 1−𝑎 =
1
f or x <
f or x
dz
Summary
For training, we need the derivative of the loss with respect to
weights in early layers of the network
• But loss is computed only at the very end of the network!
Solution: backward differentiation
Given a computation graph and the derivatives of all the
functions in it we can automatically compute the derivative of
the loss with respect to these early weights.
80
Simple Neural
Networks and Computation Graphs and
Neural Backward Differentiation
Language
Models

7 NN Apr 28 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 NN Apr 28 2021

Uploaded by

Copyright:

Available Formats

Simple Neural

Networks and Units in Neural Networks

By BruceBlaus - Own work, CC BY 3.0,

Non-linear activation function σ

Pause the lecture and try for yourself!

(in standard linear format: x2 = (−w1/w2)x1 + (−b/w2) )

This line acts as a decision boundary

XOR is not a linearly separable function!

Solution to the XOR

Solution to the XOR

a) The original x space b) The new (linearly separable) h space

(With learning: hidden layers will learn to form useful representations)

Output layer σ 𝑦 = 𝜎(𝑤 & 𝑥 + 𝑏)

[2 𝑧 [%] = 𝑊 [%] 𝑎["] + 𝑏[%]

𝑧 ["] = 𝑊 ["] 𝑎[$] + 𝑏["]

Multi Layer Notation σ

State of the art systems use more powerful neural

Var Definition Value i

• allows the network to use non-linear interactions between features

Hidden layer h1 h2 h3 … hdh dh⨉1

... The dessert is …

Issue: texts come in different

This assumes a fixed size length (3)! E embedding for

Kind of unrealistic. The

Some simple solutions (more sophisticated solutions later)

Task: predict next word wt

Using the chain rule! f (x) = u(v(x)) d f du dv

A computation graph represents the process of

The importance of the computation graph

which are as follows (making use of the

ed updating (those for which we need to know the partial 77

The weights that need updating (those for which

You might also like