Professional Documents
Culture Documents
4 Classification 2
4 Classification 2
Info 159/259
Lecture 4: Text classification 2 (Jan 27, 2022)
( = | , )=
+ exp =
exp( )
( = | = ; )=
Y exp( )
• We can represent text through features that are review begins with “I like”
not just the identities of individual words, but
any feature that is scoped over the entirety of at least 5 mentions of positive
affectual verbs (like, love,
the input. etc.)
Logistic regression
• We want to find the value of β that leads to the highest value of the
conditional log likelihood:
( )= log ( | , )
=
5
L2 regularization
( )= log ( | , )
= =
• We can do this by changing the function we’re trying to optimize by adding a penalty for having
values of β that are high
• This is equivalent to saying that each β element is drawn from a Normal distribution centered on 0.
• η controls how much of a penalty to pay for coefficients that are far from 0 (optimize on
development data)
L1 regularization
( )= log ( | , ) | |
= =
https://www.forbes.com/sites/kevinmurnane/2016/04/01/what-is-
deep-learning-and-how-is-it-useful
History of NLP
• Foundational insights, 1940s/1950s
0.7
0.525
0.35
0.175
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
• Syntactic parsing [Chen and Manning 2014, Dyer et al. 2015, Andor et al. 2016]
• Dialogue agents [Sordoni et al. 2015, Vinyals and Lee 2015, Ji et al. 2016]
1 not 1 -0.5
P( ŷ = 1) =
1 + exp (− ∑i=1 xi βi)
F
bad 1 -1.7
movie 0 0.3
SGD
15
Logistic regression
x β
x1 β1
1 not 1 -0.5
P( ŷ = 1) =
1 + exp (− ∑i=1 xi βi)
F β2
x2 y
bad 1 -1.7
x3 β3
movie 0 0.3
Feedforward neural network
x1
h1
y
x2
h2
x3
*For simplicity, we’re leaving out the bias term,
but assume most layers have them as well.
W V
x1 W1,1
W1,2 h1 V1
W2,1
y
x2 W2,2
V2
W3,1 h2
W3,2
x3
x1
h1
y
x2
h2
x3
x W V y
x1
h1
y
x2
h2
x3
x1
h1
y
x2
h2
x3
= ,
=
Activation functions
( )=
+ exp( )
1.00
0.75
0.50
y
0.25
0.00
-10 -5 0 5 10
x
Logistic regression
1
P( ŷ = 1) =
1 + exp (− ∑i=1 xi βi)
x1 β1 F
β2
x2 y
F
(∑ )
P( ŷ = 1) = σ xi βi
x3 β3 i=1
1.0
0.5
0.0
y
-0.5
-1.0
-10 -5 0 5 10
x
Activation functions
ReLU(z) = max(0, z)
10.0
7.5
5.0
y
2.5
0.0
-10 -5 0 5 10
x
function
derivative
Goldberg 46
• Sigmoid is useful for final layer to scale output between 0 and 1, but
is not often used in intermediate layers.
W V
x1
h1
y
x2
h2
x3
= ,
= ˆ= [ + ]
= ,
=
W V
x1
h1
y
x2
h2
x3
ˆ= , + ,
we can express y as a function only of the input x and the weights W and V
ˆ= , + ,
x1
h1
y
x2
h2
x3
log ( ( ( ) )) ( ( ) ) ( )
=
( ( ) ) ( )
log ( ( )) ( )
=
( )
Chain rule
log ( ( )) ( )
=
( )
= ( ) ( ( ))
( )
=( ( ))
=( ˆ)
Neural networks
• Articulate model structure and use the chain rule to derive parameter
updates.
Neural network structures
x1
h1
y 1
x2
h2
x3
y 1
x2
h2
x3
y 0
y 1 happy
dobj_hits x2
h2
y 0 sad
nsubj_says x3
• Dropout: when training on a <x,y> pair, randomly remove some node and
weights.
x1
h1
h2
x2
h2 y
h2
x3
h2
x3
Densely
W
x1 connected layer
x2
x3
h1 x
h2
x4
W
x5 h2
x6
h
x7
h = (xW )
Convolutional networks
• With convolution networks, the same operation is (i.e., the same set of
parameters) is applied to different regions of the input
2D Convolution
0 0 0 0 0
0 1 1 1 0
0 1 1 1 0
0 1 1 1 0
0 0 0 0 0
blurring
1D Convolution
3
1⅓ 1 2 1⅔ 2
1.5
0
Convolutional
I x1 networks
hated x2
h2 h2=f(hated, it, I) W1
I x4
W2 Parameters
h3 h3=f(it, I, really)
really x5 W3
a
indicator
aa 0
aal 0
aalii 0
• Every token is a V-dimensional vector
(size of the vocab) with a single 1 aam 0
aardwolf 0
aba 0
x W
x1 3.1
1 -2.7
x1 W1 1.4
-0.7
x2 -1.4
9.2
h1 -3.1
x3 x2 W2 -2.7
1 1.4
0.1
x4 1 0.3
-0.4
x3 W3 -2.4
-4.7
x5 5.7
x6
h1 = (x1 W1 + x2 W2 + x3 W3 )
x7
x W
x1 3.1
1 -2.7
x1 W1 1.4
-0.7
x2 -1.4
9.2
h1 -3.1
x3 x2 W2 -2.7
1 1.4
0.1
x4 1 0.3
-0.4
x3 W3 -2.4
-4.7
x5 5.7
x7 h1 = (W1,xid
1
+ W2,xid
2
+ W3,xid
3
)
(Where xnid specifies the location of the 1 in
the vector — i.e., the vocabulary id)
x W
x1 0.4 3.1
0.8 -2.7
x1 1.2 W1 1.4
-1.3 -0.7
x2 0.4 -1.4
0.2 9.2
h1 -5.3 -3.1
x3 x2 -1.2 W2 -2.7
5.3 1.4
0.4 0.1
x4 2.6 0.3
2.7 -0.4
x3 -3.2 W3 -2.4
6.2 -4.7
x5 1.9 5.7
x6
For dense input vectors (e.g.,
embeddings), full dot product
x7
h1 = (x1 W1 + x2 W2 + x3 W3 )
7
3
7
Pooling
1
1
9 • Down-samples a layer by selecting a
single point from some set
5
5
3
7
3
Global pooling
1
x3 10
x4 2 10
x5 -1
x7
x2
Wa Wb Wc Wd
x3
1
x1
x4
x2
1
x5
1
x3
x6
x7
h1 = (x W ) R4
Convolutional networks
• With max pooling, we select a single number for each filter over all tokens
• (e.g., with 100 filters, the output of max pooling stage = 100-dimensional
vector)
• If we specify multiple filters, we can also scope each filter over different
window sizes
Zhang and Wallace 2016, “A Sensitivity Analysis of
(and Practitioners’ Guide to) Convolutional Neural
Networks for Sentence Classification”
CNN as important ngram
detector
unique types
Higher-order ngrams are much more
informative than just unigrams (e.g., “i
don’t like this movie” [“I”, “don’t”, “like”, unigrams 50921
“this”, “movie”])
bigrams 451,220
We can think about a CNN as
providing a mechanism for detecting trigrams 910,694
important (sequential) ngrams without
having the burden of creating them as
4-grams 1,074,921
unique features
Unique ngrams (1-4) in Cornell movie review dataset