Xu-Ly-Ngon-Ngu-Tu-Nhien - Christopher-Manning - Cs224n-2019-Lecture16-Coref - (Cuuduongthancong - Com)

Natural Language Processing
Natural Language Processing

with DeepLearning
with Deep Learning
CS224N/Ling284
CS224N/Ling284
Christopher Manning
Christopher Manning and Richard Socher
Lecture 16: Coreference Resolution
Lecture 2: Word Vectors
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Announcements
• We plan to get HW5 grades back tomorrow before the add/drop
deadline
• Final project milestone is due this coming Tuesday
1
Lecture Plan:
Lecture 16: Coreference Resolution
1. What is Coreference Resolution? (15 mins)
2. Applications of coreference resolution (5 mins)
3. Mention Detection (5 mins)
4. Some Linguistics: Types of Reference (5 mins)
Four Kinds of Coreference Resolution Models
5. Rule-based (Hobbs Algorithm) (10 mins)
6. Mention-pair models (10 mins)
7. Mention ranking models (15 mins)
• Including the current state-of-the-art coreference system!
8. Mention clustering model (5 mins – only partial coverage)
9. Evaluation and current results (10 mins)
2
1. What is Coreference Resolution?
• Identify all mentions that refer to the same real world entity
Barack Obama nominated Hillary Rodham Clinton as his

secretary of state on Monday. He chose her because she
had foreign affairs experience as a former First Lady.
3
What is Coreference Resolution?

4

5

6

7

8
A couple of years later, Vanaja met Akhila at the local park.
Akhila’s son Prajwal was just two months younger than her son
Akash, and they went to the same school. For the pre-school
play, Prajwal was chosen for the lead role of the naughty child
Lord Krishna. Akash was to be a tree. She resigned herself to
make Akash the best tree that anybody had ever seen. She
bought him a brown T-shirt and brown trousers to represent the
tree trunk. Then she made a large cardboard cutout of a tree’s
foliage, with a circular opening in the middle for Akash’s face.
She attached red balls to it to represent fruits. It truly was the
nicest tree. From The Star by Shruthi Rao, with some shortening.
Applications
• Full text understanding
• information extraction, question answering, summarization, …
• “He was born in 1961” (Who?)
10
Applications
• Machine translation
• languages have different features for gender, number,
dropped pronouns, etc.
11
Applications
• languages have different features for gender, number,
dropped pronouns, etc.
12
Applications
• Dialogue Systems
“Book tickets to see James Bond”
“Spectre is playing near you at 2:00 and 3:00 today. How many
tickets would you like?”
“Two tickets for the showing at three”
13
Coreference Resolution in Two Steps
1. Detect the mentions (easy)

“[I] voted for [Nader] because [he] was most aligned with
[[my] values],” [she] said
• mentions can be nested!
2. Cluster the mentions (hard)

“[I] voted for [Nader] because [he] was most aligned with
[[my] values],” [she] said
14
3. Mention Detection
• Mention: span of text referring to some entity
• Three kinds of mentions:
1. Pronouns
• I, your, it, she, him, etc.
2. Named entities
• People, places, etc.
3. Noun phrases
• “a dog,” “the big fluffy cat stuck in the tree”
15
Mention Detection
• Span of text referring to some entity
• For detection: use other NLP systems
1. Pronouns
• Use a part-of-speech tagger
2. Named entities
• Use a NER system (like hw3)
3. Noun phrases
• Use a parser (especially a constituency parser – next week!)
16
Mention Detection: Not so Simple
• Marking all pronouns, named entities, and NPs as mentions
over-generates mentions
• Are these mentions?
• It is sunny
• Every student
• No student
• The best donut in the world
• 100 miles
17
How to deal with these bad mentions?
• Could train a classifier to filter out spurious mentions
• Much more common: keep all mentions as “candidate

mentions”
• After your coreference system is done running discard all
singleton mentions (i.e., ones that have not been marked as
coreference with anything else)
18
Can we avoid a pipelined system?
• We could instead train a classifier specifically for mention
detection instead of using a POS tagger, NER system, and parser.
• Or even jointly do mention-detection and coreference

resolution end-to-end instead of in two steps
• Will cover later in this lecture!
19
4. On to Coreference! First, some linguistics
• Coreference is when two mentions refer to the same entity in
the world
• Barack Obama traveled to … Obama
• A related linguistic concept is anaphora: when a term (anaphor)

refers to another term (antecedent)
• the interpretation of the anaphor is in some way determined
by the interpretation of the antecedent
• Barack Obama said he would sign the bill.
antecedent anaphor
20
Anaphora vs Coreference
• Coreference with named entities
text Barack Obama Obama
world
• Anaphora
text Barack Obama he
world
21
Not all anaphoric relations are coreferential
• Not all noun phrases have reference
• Every dancer twisted her knee.

• No dancer twisted her knee.
• There are three NPs in each of these sentences;

because the first one is non-referential, the other two
aren’t either.
Anaphora vs. Coreference
• Not all anaphoric relations are coreferential
We went to see a concert last night. The tickets were really

expensive.
• This is referred to as bridging anaphora.

coreference anaphora
Barack Obama pronominal bridging

… Obama anaphora anaphora
23
Anaphora vs. Cataphora
• Usually the antecedent comes before the anaphor (e.g., a

pronoun), but not always
24
Cataphora
“From the corner of the divan of Persian saddle-

bags on which he was lying, smoking, as was his
custom, innumerable cigarettes, Lord Henry
Wotton could just catch the gleam of the honey-
sweet and honey-coloured blossoms of a
laburnum…”
(Oscar Wilde – The Picture of Dorian Gray)
25
Four Kinds of Coreference Models
• Rule-based (pronominal anaphora resolution)

• Mention Pair
• Mention Ranking
• Clustering
26
5. Traditional pronominal anaphora resolution:
Hobbs’ naive algorithm
1. Begin at the NP immediately dominating the pronoun
2. Go up tree to first NP or S. Call this X, and the path p.
3. Traverse all branches below X to the left of p, left-to-right,
breadth-first. Propose as antecedent any NP that has a NP or S
between it and X
4. If X is the highest S in the sentence, traverse the parse trees of
the previous sentences in the order of recency. Traverse each
tree left-to-right, breadth first. When an NP is encountered,
propose as antecedent. If X not the highest node, go to step 5.
Hobbs’ naive algorithm (1976)
5. From node X, go up the tree to the first NP or S. Call it X, and
the path p.
6. If X is an NP and the path p to X came from a non-head phrase
of X (a specifier or adjunct, such as a possessive, PP, apposition, or
relative clause), propose X as antecedent
(The original said “did not pass through the N’ that X immediately
dominates”, but the Penn Treebank grammar lacks N’ nodes….)
7. Traverse all branches below X to the left of the path, in a left-
to-right, breadth first manner. Propose any NP encountered as
the antecedent
8. If X is an S node, traverse all branches of X to the right of the
path but do not go below any NP or S encountered. Propose
any NP as the antecedent.
9. Go to step 4
Until deep learning still often used as a feature in ML systems!
Hobbs Algorithm Example
Knowledge-based Pronominal Coreference
• She poured water from the pitcher into the cup until it was full
• She poured water from the pitcher into the cup until it was empty”
• The city council refused the women a permit because

they feared violence.
• The city council refused the women a permit because
they advocated violence.
• Winograd (1972)
• These are called Winograd Schema
• Recently proposed as an alternative to the Turing test
• See: Hector J. Levesque “On our best behaviour” IJCAI 2013
http://www.cs.toronto.edu/~hector/Papers/ijcai-13-paper.pdf
• http://commonsensereasoning.org/winograd.html
• If you’ve fully solved coreference, arguably you’ve solved AI
Hobbs’ algorithm: commentary
“… the naïve approach is quite good. Computationally

speaking, it will be a long time before a semantically
based algorithm is sophisticated enough to perform as
well, and these results set a very high standard for any
other approach to aim for.
“Yet there is every reason to pursue a semantically

based approach. The naïve algorithm does not work.
Any one can think of examples where it fails. In these
cases it not only fails; it gives no indication that it has
failed and offers no help in finding the real antecedent.”
— Hobbs (1978), Lingua, p. 345
6. Coreference Models: Mention Pair
“I voted for Nader because he was most aligned with my values,” she said.
I Nader he my she
Coreference Cluster 1
Coreference Cluster 2
32
Coreference Models: Mention Pair
• Train a binary classifier that assigns every pair of mentions a

probability of being coreferent:
• e.g., for “she” look at all candidate antecedents (previously
occurring mentions) and decide which are coreferent with it
I Nader he my she
coreferent with she?
33

I Nader he my she
Positive examples: want to be near 1
34

I Nader he my she
Negative examples: want to be near 0
35
Mention Pair Training
• N mentions in a document
• yij = 1 if mentions mi and mj are coreferent, -1 if otherwise
• Just train with regular cross-entropy loss (looks a bit different
because it is binary classification)
Iterate through Iterate through candidate Coreferent mentions pairs

mentions antecedents (previously should get high probability,
occurring mentions) others should get low
probability
36
Mention Pair Test Time
• Coreference resolution is a clustering task, but we are only
scoring pairs of mentions… what to do?
I Nader he my she
37
• Pick some threshold (e.g., 0.5) and add coreference links
between mention pairs where is above the threshold
I Nader he my she
38
• Take the transitive closure to get the clustering
I Nader he my she
Even though the model did not predict this coreference link,
I and my are coreferent due to transitivity
39
• Take the transitive closure to get the clustering
I Nader he my she
Adding this extra link would merge everything

into one big coreference cluster!
40
Mention Pair Models: Disadvantage
• Suppose we have a long document with the following mentions
• Ralph Nader … he … his … him … <several paragraphs>
… voted for Nader because he …
Relatively easy
Ralph
he his him Nader he
Nader
almost impossible
41
Mention Pair Models: Disadvantage
• Suppose we have a long document with the following mentions
• Ralph Nader … he … his … him … <several paragraphs>
… voted for Nader because he …
Relatively easy
Ralph
he his him Nader he
Nader
almost impossible
• Many mentions only have one clear antecedent

• But we are asking the model to predict all of them
• Solution: instead train the model to predict only one antecedent
for each mention
42 • More linguistically plausible
7. Coreference Models: Mention Ranking
• Assign each mention its highest scoring candidate antecedent

according to the model
• Dummy NA mention allows model to decline linking the current
mention to anything (“singleton” or “first” mention)
NA I Nader he my she
best antecedent for she?
43
Coreference Models: Mention Ranking

Positive examples: model has to assign a high

probability to either one (but not necessarily both)
44

best antecedent for she?

p(NA, she) = 0.1
p(I, she) = 0.5 Apply a softmax over the scores for
p(Nader, she) = 0.1 candidate antecedents so
p(he, she) = 0.1 probabilities sum to 1
45
p(my, she) = 0.2

only add highest scoring

coreference link
p(NA, she) = 0.1
p(I, she) = 0.5 Apply a softmax over the scores for
p(Nader, she) = 0.1 candidate antecedents so
p(he, she) = 0.1 probabilities sum to 1
46
p(my, she) = 0.2
Coreference Models: Training
• We want the current mention mj to be linked to any one of the
candidate antecedents it’s coreferent with.
• Mathematically, we might want to maximize this probability:
i 1
X
(yij = 1)p(mj , mi )
j=1
Iterate through candidate For ones that …we want the model to
antecedents (previously
0 are coreferent assign a high probability
1
occurring mentions)
N YX i 1
to mj…
J= log @ (yij = 1)p(mj , mi )A
i=2 j=1
47
Coreference Models: Training
• Mathematically, we want to maximize this probability:
i 1
X
(yij = 1)p(mj , mi )
j=1
Iterate through candidate For ones that …we want the model to
antecedents (previously
0 are coreferent assign a high probability
1
occurring mentions)
N i 1
YX
to mj…
J= log @ (yij = 1)p(mj , mi )A
i=2 j=1
• The model could produce 0.9 probability for one of the correct
antecedents and low probability for everything else, and the
sum will still be large
48
i 1
X
Coreference Models:
(yij Training
= 1)p(mj , mi )
j=1
0 1
• Mathematically, weNwant
i 1to maximize this probability:
YX
J= i@
logX1 (yij = 1)p(mj , mi )A
(yj=1
i=2 ij = 1)p(mj , mi )
j=1
• Turning this into a loss
0function: 1
N
X 0 X i 1 1
J= N@X
logY i 1 (yij = 1)p(mj , mi )A
J = i=2log @ j=1 (yij = 1)p(mj , mi )A
i=2 j=1
Iterate over all the mentions
in the document Usual trick of taking negative
log to go from likelihood to loss
49
Mention Ranking Models: Test Time
• Pretty much the same as mention-pair model except each

mention is assigned only one antecedent
50
Mention Ranking Models: Test Time
• Pretty much the same as mention-pair model except each

mention is assigned only one antecedent
51
How do we compute the probabilities?
A. Non-neural statistical classifier
B. Simple neural network
C. More advanced model using LSTMs, attention
52
A. Non-Neural Coref Model: Features
• Person/Number/Gender agreement
• Jack gave Mary a gift. She was excited.
• Semantic compatibility
• … the mining conglomerate … the company …
• Certain syntactic constraints
• John bought him a new car. [him can not be John]
• More recently mentioned entities preferred for referenced
• John went to a movie. Jack went as well. He was not busy.
• Grammatical Role: Prefer entities in the subject position
• John went to a movie with Jack. He was not busy.
• Parallelism:
• John went with Jack to a movie. Joe went with him to a bar.
• …
53
B. Neural Coref Model
• Standard feed-forward neural network
• Input layer: word embeddings and a few categorical features
Score s
Hidden Layer h3 W4h3 + b4
Hidden Layer h2 ReLU(W3h2 + b3)
Hidden Layer h1 ReLU(W2h1 + b2)
Input Layer h0 ReLU(W1h0 + b1)
Candidate Candidate Mention Mention Additional

Antecedent Antecedent Embeddings Features Features
Embeddings Features
54
Neural Coref Model: Inputs
• Embeddings
• Previous two words, first word, last word, head word, … of
each mention
• The head word is the “most important” word in the mention – you can
find it using a parser. e.g., The fluffy cat stuck in the tree
• Still need some other features:

• Distance
• Document genre
• Speaker information
55
C. End-to-end Model
• Current state-of-the-art model for coreference resolution
(Kenton Lee et al. from UW, EMNLP 2017)
• Mention ranking model
• Improvements over simple feed-forward NN
• Use an LSTM
• Use attention
• Do mention detection and coreference end-to-end
• No mention detection step!
• Instead consider every span of text (up to a certain length) as a
candidate mention
• a span is just a contiguous sequence of words
56
End-to-end Model
• First embed the words in the document using a word embedding
matrix and
Mention score (s )
m
a character-level CNN
General Electric Electric said the the Postal Service Service contacted the the company
Span representation (g)
Span head (x̂) + + + + +
Bidirectional LSTM (x∗ )
Word & character

embedding (x)
General Electric said the Postal Service contacted the company
e 1: First step of the end-to-end coreference resolution model, which computes embedding r
tions of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that o
geable number of spans is considered for coreference decisions. In general, the model conside
ble spans up to a maximum width, but we depict here only a small subset.
57
End-to-end Model
• Then run a bidirectional LSTM over the document
Mention score (sm )
Span head (x̂) + + + + +
Word & character

embedding (x)
58
End-to-end Model
• Next, represent each span of text i going from START(i) to END(i) as a
vector
Mention score (s )
m
Span head (x̂) + + + + +
Word & character

embedding (x)
59
End-to-end Model
• Next, represent each span of text i going from START(i) to END(i) as a
vector
Mention score (s )
m
Span head (x̂) + + + + +
Word & character

embedding (x)
• General,
e 1: First General
step of the Electric,
end-to-end General
coreference Electricmodel,
resolution said, …which
Electric, Electric
computes embedding r
tions of spans
said, for scoring
… will potential
all get its ownentity mentions.
vector Low-scoring spans are pruned, so that o
representation
60
END(i)
" In the traini
exp(αk )
End-to-end Model is observed.
k= START(i)
optimize the
END (i)
• Next, represent each span of " text i going from START(i) to END(i) as a
antecedents
vector.
Mention score (s )
x̂
General Electric
i =
Electric said the a · x
the Postal
i,t
Service
t
Service contacted the the company
m
t= START(i)
Span head (x̂) + + + + +

l
where x̂i is a weighted sum of word vectors in
span i. The weights ai,t are automatically learned
∗
Bidirectional LSTM (x )
and correlate strongly with traditional definitions where GOLD
of head words as we will see in Section 9.2. ter containin
The above span information is concatenated to to a gold clu
Word & character
embedding (x) produce the final representation g of span i:
General Electric said the i
Postal Service contacted pruned,
the GOL
company
∗ ∗
By optim
e 1:Span
Firstrepresentation: gi = [xcoreference
step of the end-to-end START(i)
, xresolution
END(i)
, x̂imodel,
, φ(i)]which computesrallyembedding
learns r
initial pruni
geable numberThisof spansgeneralizes thecoreference
is considered for recurrent span In repre-
decisions. general, the model conside
ble spans up tosentations
a maximum width, but we depict here only mentions rec
recently proposed fora small subset.
question-
quickly leve
answering (Lee et al., 2016), which only include
CuuDuongThanCong.com
ate credit as
https://fb.com/tailieudientucntt
END(i)
" In the traini
exp(αk )
k= START(i)
optimize the
END (i)
antecedents
vector. For
Mention score (s )
example, x̂ for
General Electric
i = “the postal
service”
the Postal
i,t
Service
t
m
t= START(i)
Span head (x̂) + + + + +

l
∗
Word & character
the GOL
company
∗ ∗
By optim
e 1:Span
, xresolution
END(i)
, x̂imodel,
learns r
initial pruni
ble spans up tosentations
a maximum width, but we depict here only mentions rec
recently proposed fora small subset.
question-
quickly leve
ate credit as
END(i)
" In the traini
exp(αk )
k= START(i)
optimize the
END (i)
antecedents
vector. For
Mention score (s )
example, x̂ for
General Electric
i = “the postal
service”
the Postal
i,t
Service
t
m
t= START(i)
Span head (x̂) + + + + +

l
∗
Word & character
the GOL
company
∗ ∗
By optim
e 1:Span
, xresolution
END(i)
, x̂imodel,
, φ(i)]which computesrally
embedding
learns r
initial pruni
ble spans upBILSTM hidden states
tosentations
a maximum for but we depict here only a small subset.
width, mentions rec
span’s start and end recently proposed for question-
quickly leve
ate credit as
END(i)
" In the traini
exp(αk )
k= START(i)
optimize the
END (i)
antecedents
vector. For
Mention score (s )
example, x̂ for
General Electric
i = “the postal
service”
the Postal
i,t
Service
t
m
t= START(i)
Span head (x̂) + + + + +

l
∗
Word & character
the GOL
company
∗ ∗
By optim
e 1:Span
, xresolution
END(i)
, x̂imodel,
learns r
initial pruni
tosentations
a maximum width, Attention-based
for but we depict representation
here only a small subset. mentions rec
(details next slide) of the words quickly leve
answering (Lee et al.,in 2016),
the span which only include
ate credit as
END(i)
" In the traini
exp(αk )
k= START(i)
optimize the
END (i)
antecedents
vector. For
Mention score (s )
example, x̂ for
General Electric
i = “the postal
service”
the Postal
i,t
Service
t
m
t= START(i)
Span head (x̂) + + + + +

l
∗
Word & character
the GOL
company
∗ ∗
By optim
e 1:Span
, xresolution
END(i)
, x̂imodel,
learns r
initial pruni
tosentations
a maximum width, Attention-based
for but we depict representation
here only a small mentions
Additional
subset. features rec
(details next slide) of the words quickly leve
answering (Lee et al.,in 2016),
the span which only include
ate credit as
with the highest mention scores and co
,δ = ot,δ
mation is ◦concatenated
tanh(ct,δ ) to to a gold cluster up or to all
K gold antecedents
antecedents for each. have Webeenal
∗ End-to-end Model
t = [hg
xtation t,1i ,of span
ht,−1 ] i: pruned, GOLDnon-crossing (i) = {ϵ}. bracketing structures wit
suppression scheme.the 2 We accept sp
∗
By optimizing this objective, model natu-
x∈ {−1,
END•(i)
, x̂1} i ,isindicates
φ(i)] the directionality
an attention-weighted rally learns
of
average to of the word
creasing
prune orderembeddings
spans of the mention
accurately. inWhile
the
scores,theun
∗
STM, and span xt is the General
Mention score (s ) m
concatenated
Electric Electricoutput
initial
said the
pruning
the Postal Service
considering
is completely
i, there exists
span random, only agold
prev
idirectional
recurrent LSTM.repre-
span We use independent cepted span j such that START (i) < S
for every for sentence, mentions
since cross-sentence receive positive updates. The model can
oposed question- END (i) < END (j) ∨ START (j) < ST
was
Span head (x̂)
not helpful in our +
experiments. quickly + leverage this + learning signal + for appropri-
+
16), which only include END (j) < END (i).
ctic heads are ∗ typically included ate credit
as fea-assignment Despite to these
the different
aggressive factors,
pruning such
str
ntations
BidirectionalxLSTM ∗
(x )
(i) and
n previous systems (Durrettas
START theKlein,
and mentionmaintain scores samhigh usedrecall
for pruning.
of gold mentions
e soft head word vector
lark and Manning, 2016b,a). Instead of re-scoreperiments
Fixing of the dummy (over 92% antecedent
when λ =to0.4). zero
(i) encoding
syntactic parses, theoursize of learns
model removesa task- a spurious For degree of freedom
the remaining in thethe
mentions, over-
joi
notion
Word & ofcharacter
embedding (x)
headedness using an allattention
model withtion respect to mention
of antecedents for each detection.
documentIti
sm (Bahdanau et al., 2014) over General Electric words
said in the
in a
Postal
forward
Service
also prevents the span pruning from introducing pass
contacted
over a
the
single
company
computa
n: Attention scores The final prediction is the clustering p
e 1: First step of the end-to-end coreference 2 resolution
The official CoNLL-2012 model, evaluation
which computes embedding
only considers pre-r
odel
tions αdescribed
of spans for scoringabove potential
is∗ entity without
mentions. the most
Low-scoring likely configuration.
t = wα · FFNN α (xt ) dictions crossing mentionsspansto be are pruned,
valid. so that
Enforcing thiso
geable
lengthnumberT . To of spans
maintainis considered for coreference
consistency decisions.
is not inherently In general,
necessary in ourthemodel.
model conside
exp(α ) 6 only Learning
adot
ble spans
i,t
product
= up to a of weightt width, but we depict here
maximum a small subset.
vector andEND "transformed
(i)
hidden state In the training data, only clustering i
exp(αk )
ot,δ = σ(Wo [xt , ht+δ,δ with ] +thebo ) highest mention compute their and
scores unaryco m
,δ = ot,δ
tanh(ct,δ ) to c!to t,δ =a tanh(W
gold cluster c [xt ,up or
ht+δ,δto]all
+Kbcgold
) antecedents
antecedents fined in Section
for have We
each. 4). To
been al
∗
xtation
t = End-to-end Model
[h g
t,1i ,of
h span
t,−1 ] i: c t,δ = f t,δ ◦ !
c
pruned, GOLDnon-crossing
t,δ + (1 − f t,δ ) ◦ c
(i) = {ϵ}. bracketing
t+δ,δ
of spans
with the
to consider,
structures wit
highest mentio
w
ht,δ = ot,δ ◦ tanh(ct,δ )

suppression scheme. 2up toWe accept
K antecedents spf
∗
By optimizing this objective, the model natu-
x∗∈ {−1,
END•(i)
, x̂1}i ,isindicates
an attention-weighted
φ(i)] xt = [ht,1 , h
the directionality
rally learns
oft,−1 ] creasing
average to of the
prune word
order
spans embeddings
of the mention
accurately. in
non-crossing the
While
bracketin
scores, theun
2
∗ suppression scheme.
STM, and span t x
Mention score (sm )
is the concatenated
General Electric Electricoutput
said the
∈ {−1, pruning
where δinitial 1} indicates
the Postal Service
considering
thecompletely
is
i, there
spanofrandom,
directionality creasingexists
order ofagold
only prev
the m
idirectional
recurrent LSTM.
span repre- We use
each independent
LSTM, and x ∗ is the concatenated output
t cepted span j such that START (i)
considering span< S
i, th
for every for sentence, since of the mentions receive
bidirectional LSTM. We use independent
cross-sentence positive updates. The model j such
can
oposed question- END (i) < END (j) ∨cepted START span(j) < tha
ST
was
Span head (x̂)
not helpful in our LSTMs
+
experiments. quickly leverage this learning signal
for every
+ sentence, since
+ cross-sentence + for<appropri-
END (i) + END (j) ∨
16), which only include END
context was not helpful in our experiments. (j) < END (i). END (j) < END (i).
ctic heads are ∗ typically included ate
Syntactic credit
as
heads fea-areassignment
typicallyDespite to
included the
these
as different
aggressive
fea- factors,
pruning such str
ntations
BidirectionalxLSTM ∗
(x )
(i) and Despite these aggres
START
n previous systems (Durrett tures in asand theKlein,
previous mentionsystemsscores (Durrettsm andused
Klein, formaintain
pruning.
e soft head word vector maintain a high recall of golda high recall
mentions
lark and Manning, 2016b,a). Instead 2013; Clark and Manning, 2016b,a).
of re-scoreperiments Instead of re- periments (over 92%
lying on
Fixing
syntactic parses, our
of
model
thelearnsdummy
(over
a 92%
task-
antecedent
when λ =to0.4). zerow
(i) encoding
syntactic parses, theour size
modelof learns removes a task- a spurious degree of freedom
For the remaining m
specific notion of headednessFor using thean remaining
attention tion of in
mentions, thethe
antecedentsover-
joi
for
notion
Word & ofcharacter
embedding (x)
headedness using mechanisman
allattention
(Bahdanau
model with et al.,
tion 2014)
respect over towords
of antecedents in forineach
mention a detection.
forwarddocument
pass overIti
sm (Bahdanau et al., 2014) over General each span:
Electric words
said in the Postal Service contactedThe the company
final prediction is
also prevents in
thea forward
span pruning pass over from a single computa
introducing
n: ∗ the most likely configu
Attention scores Attention
α t = w α distribution
· The
FFNN α final
(x t ) prediction is the clustering p
The officialexp(α CoNLL-2012 model, which computes
evaluation only embedding
considers pre-r
the t)
most likely 6
configuration. Learning
odel
tions αdescribed
of spans for scoringabove potential
t = wα · FFNN α (xt )
is∗ ai,t =without
entity
dictions mentions. Low-scoring spans are pruned, so that o
END(i)crossing mentions to be valid. Enforcing this
"
geable
lengthnumber
T . To of spans is considered
maintain for coreference
is not inherently
exp(α In general,
) necessary inInourthemodel.
the trainingconside
model data, o
dot
ble spans product
up to a
exp(α
of weight
maximum t ) width, but we depict here 6 Learning k
is observed. Since the
ai,t = k= START(i) only a small subset.
vector andEND "transformed
(i) just a softmax END (i)over attention
optimize the marginal
" In the training data, antecedents only clustering implied byi
hidden state exp(αk ) scores
CuuDuongThanCong.com x̂i =for the span ai,t · xt https://fb.com/tailieudientucntt
ot,δ = σ(Wo [xt , ht+δ,δ with ] +the bo ) highest mention compute their and
scores unaryco m
,δ = ot,δ
tanh(ct,δ ) to c!to t,δ =a tanh(W
gold cluster c [xt ,up
where
or
ht+δ,δto]all+ ) ∈ {−1,
Kbδcgold
antecedents 1} indicates
antecedents fined in Section
for have
each. 4). To
the direction
been
We al
∗
xtation
t = [hEnd-to-end Model
g
t,1i ,of
h span
t,−1 ] i: c = f ◦ !
c
pruned, GOLDnon-crossing
t,δ t,δ t,δ + (1 − each
f
(i) of=thet,δ ) ◦LSTM,
{ϵ}.c t+δ,δ bracketing
bidirectional LSTM.
∗ of spans
and xt is the concatenated
with the
to consider,
structures
highest
We
wit
use mentio
indep
w
ht,δ = ot,δ ◦ tanh(ct,δ )

suppression scheme. 2up toWe Ksinceaccept
antecedents spf
∗
By optimizing this
LSTMs objective,
for every the
sentence, model natu-
cross-se
x∗∈ {−1,
END•(i)
, x̂1}i ,isindicates
φ(i)] xt = [ht,1 , h
the directionality
an attention-weighted rally learns
oft,−1 ] creasing
average to of the
prunecontext word
order
spanswas embeddings
notof the non-crossing
mention
helpful
accurately. in in
our the bracketin
scores,
experiments
While theun
2
∗ suppression scheme.
STM, and span t x
Mention score (sm )
is the concatenated
General Electric Electricoutput
said the
∈ {−1, pruning
where δinitial 1} indicates
the Postal Service
considering Syntactic
thecompletely
is directionality
heads
span i,are
there
ofrandom, typically included
creasingexistsorder ofagold
only prev
the m
idirectional
recurrent LSTM.
span repre- We use
each independent
LSTM, and x t
tures
∗ is the concatenated in
cepted span j such that previous
output systems START (i)
considering
(Durrett
span<
andS
i, th
for every for sentence, since of the mentions receive
bidirectional LSTM. We use independent
cross-sentence positive
2013; Clark updates.
and Manning, The model
2016b,a). can
Instea
j such
oposed question- END (i) < END (j) ∨cepted START span(j) < tha
ST
Span head (x̂) LSTMs
+ quickly leverage this learning signal
for every
+ sentence, lying
since
+ on syntactic
cross-sentence +parses,
END (i) our model
for<appropri-
+ learns
END (j) ∨
was not helpful
16), which only include in our experiments. END
context was not helpful in our experiments. (j) < END (i).
specific notion of headedness END (j) < using END (i). an at
ctic heads are typically included ate credit
as fea- assignment to the different factors, such
ntations
Bidirectionalx∗LSTM (x∗ )
(i) and Syntactic heads are typically Despite
includedthese
mechanism aggressive
as(Bahdanau
fea- et al., pruning
Despite 2014)
theseover strw
aggres
START
n previous systems (Durrett tures in asand theKlein,
previous mentionsystemsscores (Durrett
each span:
maintain sam
andused
highKlein, formaintain
recall pruning.
of golda high recall
mentions
e soft head word vector
lark and Manning, 2016b,a). Instead 2013; Clark and Manning, 2016b,a).
of re-scoreperiments Instead of re- periments (over 92%
lying on
Fixing
syntactic parses, our
of
model
thelearnsdummy
(over a 92%
task-
antecedent
when λ = to zerow
∗ 0.4).
(i) encoding
syntactic parses, theour sizemodelof learns removes a task- a spurious degree
αt = wα · For
of freedom
FFNN theα (x
in the t)
remaining
over-
m
specific notion of headednessFor using thean remaining
attention mentions,
tion of antecedents the for
joi
exp(α t )
notion
Word & ofcharacter
embedding (x)
headedness using mechanisman
allattention
(Bahdanau
model with et al.,
tion 2014)
respect over to
of antecedents words i,t in
= forineach
amention a detection.
forwarddocument
pass over Iti
END(i)
sm (Bahdanau et al., 2014) over General each span:
Electric words
said in the Postal Service contacted "
The the company
final prediction is
also prevents in
thea forward
span pruning pass over from a single computa
introducing
exp(α )
n: the most likely kconfigu
Attention scores Attention
αt = wα ·distribution
FFNNThe ∗
α (xfinal
t) prediction Finalk=representation
isSTART
the(i)clustering p
The officialexp(α CoNLL-2012 model, which
evaluation computes
only embedding
considers pre- r
the t)
most likely 6
configuration.
END (i)Learning
"pruned, so that o
odel
tions αdescribed
of spans
= w for
· above
scoring (x is
∗
potential
) dictionsai,t =without
entity mentions. Low-scoring
crossing mentions spans
to be are
valid. Enforcing
t α FFNN α t END(i)
" x̂i = ai,t · xt this
geable
lengthnumber
T . To of spans
maintainis considered for coreference
is not inherently In general, Inour
themodel.
the trainingconside
model data, o
exp(α ) necessary inSTART
dot
ble spans product
up to a
exp(α
of weight
maximum t ) width, but we depict here 6 Learning k t= (i)
is observed. Since the
ai,t = k= START(i) only a small subset.
vector andEND transformed
" (i) just a softmax END (i)over where attention x̂ i is a weighted optimize the marginal
Attention-weighted
sum of word sumvec
" In the training data, antecedents only clustering byi
hidden state exp(αk ) scores
CuuDuongThanCong.com x̂i =for the span · xt i. The weights
ai,tspan of word
ai,t embeddings
implied
are automatically
and correlate strongly with traditional definitions where GOLD (i) is
End-to-end
of head wordsModelas we will see in Section 9.2. ter containing spa
The above span information is concatenated to to a gold cluster o
• Why include
produce the all these
final different terms
representation gi ofinspan
the span?
i: pruned, GOLD(i)
∗ ∗
By optimizing
gi = [xSTART(i) , xEND(i) , x̂i , φ(i)] rally learns to pru
initial pruning is
This generalizes the recurrent span repre-
hidden states for mentions
Attention-based representation Additional features receive
sentations
span’s start andrecently
end proposed for question-
quickly leverage t
Represents the Represents the span∗ itself ate credit
Represents other
assignm
the boundary representations xSTART(i) and
context
∗ to the left information
as thenotmention
in sco
xand (i) . ofWe
ENDright the introduce the soft head word vector the text Fixing score of
x̂span
i and a feature vector φ(i) encoding the size of removes a spuriou
span i. all model with re
5 Inference also prevents the
2
The official CoN
The size of the full model described above is dictions without cross
O(T 4 ) in the document length T . To maintain
consistency is not inh
nd or (4) forcontext
5), Klein, this pairwise coreference
is unambiguous. score:
There (1) factors
are three whether
span ithis
is apairwise
mention,coreference
(2) whether span(1)j is a men-
End-to-end Model
Manning,
tt and Klein, for score: whether
is Manning,
nd most tion,
spanandi is(3)
a mention, j iswhether
whether (2) span j isofa i:men-
an antecedent
ach
son • is
overmost score
Lastly,
tion, and (3) #whether j is an antecedent of i:
every pair of spans to decide if they are coreferent
reason
ions and over
mentions
#0 j=ϵ
s(i, j) = 0 j=ϵ
mentions and s(i, j) = sm (i) + sm (j) + sa (i, j) j ̸ =ϵ
sm (i) + sm (j) + sa (i, j) j̸
=ϵ
Are spans
Here
i and j
sm (i)Is is
i a
amention?
unary score
Is j a
for span iDo
mention?
being
they
a men-
look
ules to re- Here s m (i) is a unary score for span i being a men-
coreferent
use rules
alized
to re- tion,
reg-reg- and sa (i, j) is pairwise score forcoreferent?
mentions? span j being
tion, and sa (i, j) is pairwise score for span j being
lexicalized
an an
antecedent span i.i.
antecedentofofspan
There (1) factors
are three whether
aid the Postal Service contacted the company
span i is a mention, (2) whether span(1)j is a men-
End-to-end Model
Manning,
tt and Klein, for this pairwise coreference score: whether
is Manning,
nd most tion,span andi is(3) whether (2)
a mention, j iswhether span j isofa i:men-
an antecedent
rence
ach resolution model,
tion, and (3) which
#whether computes embedding
j is an antecedent repre-
of i:
son • is
over most score
Lastly, every pair of spans to decide if they are coreferent
ity
ions over Low-scoring#
mentions.
reason mentions
and 0spans are pruned, so that only j = ϵa
s(i, j) = 0 j=ϵ
mentions and
or coreference decisions. s(i, j) =Insm general,
(i) + sm the(j)model considers
+ sa (i, j) j ̸ =allϵ
we depict here only a small subset. sm (i) + sm (j) + sa (i, j) j ̸ =ϵ
Are spans
Here
i and j
sm (i)Is is
i a
amention?
unary scoreIs j a
for span iDo
mention?
being
they
a men-
look
use rules
alized coreferent
reg- mentions?
to re- tion,
tion,and
and sas(i,(i,j)j)isispairwise
pairwise scorefor
score spanj jbeing
forcoreferent?
span being
lexicalized reg- a
above are computed via standard feed-forward
an an antecedent
antecedentofofspan span i.i.
neural networks:
,
ervice)• Scoring functions take the span representations as input
sm (i) = wm · FFNN m (gi )
sa (i, j) = wa · FFNN a ([gi , gj , gi ◦ gj , φ(i, j)])
where · denotes the dot product, ◦ denotes

ny element-wise multiplication, and FFNN denotes a
There (1) factors
are three whether
aid the Postal Service contacted the company
span i is a mention, (2) whether span(1)j is a men-
End-to-end Model
Manning,
tt and Klein, for this pairwise coreference score: whether
is Manning,
nd most tion,span andi is(3) whether (2)
a mention, j iswhether span j isofa i:men-
an antecedent
rence
ach resolution model,
tion, and (3) which
#whether computes embedding
j is an antecedent repre-
of i:
son • is
over most score
Lastly, every pair of spans to decide if they are coreferent
ity
ions over Low-scoring#
mentions.
reason mentions
and 0spans are pruned, so that only j = ϵa
s(i, j) = 0 j=ϵ
mentions and
or coreference decisions. s(i, j) =Insm general,
(i) + sm the(j)model considers
+ sa (i, j) j ̸ =allϵ
we depict here only a small subset. sm (i) + sm (j) + sa (i, j) j ̸ =ϵ
Are spans
Here
i and j
sm (i)Is is
i a
amention?
unary scoreIs j a
for span iDo
mention?
being
they
a men-
look
use rules
alized coreferent
reg- mentions?
to re- tion,
tion,and
and sas(i,(i,j)j)isispairwise
pairwise scorefor
score spanj jbeing
forcoreferent?
span being
lexicalized reg- a
above are computed via standard feed-forward
an an antecedent
antecedentofofspan span i.i.
neural networks:
,
ervice)• Scoring functions take the span representations as input
sm (i) = wm · FFNN m (gi )
sa (i, j) = wa · FFNN a ([gi , gj , gi ◦ gj , φ(i, j)])
include multiplicative again, we have some

where · denotes the interactions
dot product,between◦ extra features
denotes
the representations
ny element-wise multiplication, and
FFNN denotes a
End-to-end Model
• Intractable to score every pair of spans
• O(T^2) spans of text in a document (T is the number of words)
• O(T^4) runtime!
• So have to do lots of pruning to make work (only consider a few of
the spans that are likely to be mentions)
• Attention learns which words are important in a mention (a bit like

head words)
(A fire in a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most
of the deceased were killed in the crush as workers tried to flee (the blaze) in the four-story building.
1
A fire in (a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most
of the deceased were killed in the crush as workers tried to flee the blaze in (the four-story building).
We are looking for (a region of central Italy bordering the Adriatic Sea). (The area) is mostly
mountainous and includes Mt. Corno, the highest peak of the Apennines. (It) also includes a lot of
2
sheep, good clean-living, healthy sheep, and an Italian entrepreneur has an idea about how to make a
little money of them.
(The flight attendants) have until 6:00 today to ratify labor concessions. (The pilots’) union and ground
3 CuuDuongThanCong.com https://fb.com/tailieudientucntt
8. Last Coreference Approach: Clustering-Based
• Coreference is a clustering task, let’s use a clustering algorithm!

• In particular we will use agglomerative clustering
• Start with each mention in it’s own singleton cluster
• Merge a pair of clusters at each step

• Use a model to score which cluster merges are good
74
Coreference Models: Clustering-Based
Coreference with Agglomera3ve Clustering

Google recently … the company announced Google Plus ... the product features ...
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Google the company Google Plus the product
75
Coreference Models: Clustering-Based
Coreference
Google recently with Agglomera3ve
Coreference with Agglomera3ve Clustering
… the company announced Clustering
Google Plus ... the product features ...
Cluster11
Cluster Cluster22
Cluster Cluster33
Cluster Cluster44
Cluster
Google
Google thecompany
the company GooglePlus
Google Plus theproduct
the product
s(c1, c2) = 5 ✔ merge
Cluster 1 Cluster 2 Cluster 3
s(c2, c3) = 4 ✔ merge
Cluster 1 Cluster 2
766 s(c1, c2) = -3 ✖ do not merge

The Advantage of Clustering: Exploi3ng
Coreference Models:
En3ty-level Clustering-Based
Informa3on
Mention-pair decision is difficult
Google Google Plus
? coreferent
Cluster-pair decision is easier

Cluster 1 Cluster 2
? coreferent
8
77
Clustering Model Architecture
From Clark & Manning, 2016
Neural Network Architecture
Merge clusters c1 = {Google, the company} and
c2 = {Google Plus, the product} ?
Men3on-Pair Cluster-Pair
Men3on Pairs Score
Representa3ons Representa3on
(Google, Google Plus)
(Google, the product)

(the company, s(MERGE[c1,c2])
Google Plus)
(the company, the
product)
78
Clustering
ProducingModel Architecture
Cluster-Pair Representa3ons
First•produces representaHons
First produce for relevant
a vector for each menHon pairs
pair of mentions
using the menHon-pair-encoder.
• e.g., the output of the hidden layer in the feedforward neural
network model
c1
!!!
!!!
Mention-Pair Mention-Pair
Representations Encoder c2
!!!
!!!
79
Producing Cluster-Pair Representa3ons
• Then
• Then apply
applies a pooling
a pooling operationover
operaHon overthe
the matrix
matrix of
of mention-pair
menHon-
representations to get a cluster-pair representation
pair representaHons (we use max and average pooling)
max avg
c1
Cluster-Pair
Representation
rc(c1, c2) !!!
Pooling
!!!
Rm(c1, c2)
!!!
!!!
8
80
Producing Cluster-Pair Representa3ons
• Score the candidate cluster merge by taking the dot product of
• Then applies a pooling operaHon over the matrix of menHon-
the representation with a weight vector
pair representaHons (we use max and average pooling)
max avg
c1
Cluster-Pair
Representation
rc(c1, c2) !!!
Pooling
!!!
Rm(c1, c2)
!!!
!!!
18
81
Clustering Model: Training
• Current candidate cluster merges depend on previous ones it

already made
• So can’t use regular supervised learning
• Instead use something like Reinforcement Learning to train
the model
• Reward for each merge: the change in a coreference evaluation metric
82
9. Coreference Evaluation
• Many different metrics: MUC, CEAF, LEA, B-CUBED, BLANC
• Often report the average over a few different metrics
Gold Cluster 1
Gold Cluster 2
System Cluster 1 System Cluster 2

83
Coreference Evaluation
• An example: B-cubed
• For each mention, compute a precision and a recall
P = 4/5
R= 4/6
Gold Cluster 1
Gold Cluster 2

84
P = 4/5
R= 4/6
Gold Cluster 1
Gold Cluster 2
P = 1/5
R= 1/3

85
• Then average the individual Ps and Rs
P = [4(4/5) + 1(1/5) + 2(2/4) + 2(2/4)] / 9 = 0.6
P = 4/5
R= 4/6 P = 2/4
P = 2/4 R= 2/3
R= 2/6
Gold Cluster 1
Gold Cluster 2
P = 1/5
R= 1/3

86
100% Precision, 33% Recall 50% Precision, 100% Recall,
87
System Performance
• OntoNotes dataset: ~3000 documents labeled by humans
• English and Chinese data
• Report an F1 score averaged over 3 coreference metrics
88
System Performance
Model English Chinese
Rule-based system, used
Lee et al. (2010) ~55 ~50 to be state-of-the-art!
Chen & Ng (2012) 54.5 57.6

[CoNLL 2012 Chinese winner]
Non-neural machine
Fernandes (2012) 60.7 51.6 learning models
[CoNLL 2012 English winner]

Wiseman et al. (2015) 63.3 — Neural mention ranker
Clark & Manning (2016) 65.4 63.7 Neural clustering model
Lee et al. (2017) 67.2 -- End-to-end neural

mention ranker
89
Where do neural scoring models help?
• Especially with NPs and named entities with no string matching.

Neural vs non-neural scores:
18.9 F1 vs 10.7 F1 on this type compared to 68.7 vs 66.1 F1
These kinds of coreference are hard and the scores are still low!
Example Wins
Anaphor Antecedent
the country’s leftist rebels the guerillas
the company the New York firm
216 sailors from the ``USS cole’’ the crew
the gun the rifle
90
Conclusion
• Coreference is a useful, challenging, and linguistically interesting
task
• Many different kinds of coreference resolution systems
• Systems are getting better rapidly, largely due to better neural
models
• But overall, results are still not amazing
• Try out a coreference system yourself!
• http://corenlp.run/ (ask for coref in Annotations)
• https://huggingface.co/coref/

Xu-Ly-Ngon-Ngu-Tu-Nhien - Christopher-Manning - Cs224n-2019-Lecture16-Coref - (Cuuduongthancong - Com)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Xu-Ly-Ngon-Ngu-Tu-Nhien - Christopher-Manning - Cs224n-2019-Lecture16-Coref - (Cuuduongthancong - Com)

Uploaded by

Copyright:

Available Formats

Natural Language Processing

Natural Language Processing

• Final project milestone is due this coming Tuesday

Barack Obama nominated Hillary Rodham Clinton as his

Barack Obama nominated Hillary Rodham Clinton as his

Barack Obama nominated Hillary Rodham Clinton as his

Barack Obama nominated Hillary Rodham Clinton as his

Barack Obama nominated Hillary Rodham Clinton as his

Barack Obama nominated Hillary Rodham Clinton as his

1. Detect the mentions (easy)

2. Cluster the mentions (hard)

• Much more common: keep all mentions as “candidate

• Or even jointly do mention-detection and coreference

• A related linguistic concept is anaphora: when a term (anaphor)

• Every dancer twisted her knee.

• There are three NPs in each of these sentences;

We went to see a concert last night. The tickets were really

• This is referred to as bridging anaphora.

Barack Obama pronominal bridging

• Usually the antecedent comes before the anaphor (e.g., a

“From the corner of the divan of Persian saddle-

• Rule-based (pronominal anaphora resolution)

• The city council refused the women a permit because

“… the naïve approach is quite good. Computationally

“Yet there is every reason to pursue a semantically

• Train a binary classifier that assigns every pair of mentions a

coreferent with she?

• Train a binary classifier that assigns every pair of mentions a

Positive examples: want to be near 1

• Train a binary classifier that assigns every pair of mentions a

Negative examples: want to be near 0

Iterate through Iterate through candidate Coreferent mentions pairs

Adding this extra link would merge everything

• Many mentions only have one clear antecedent

• Assign each mention its highest scoring candidate antecedent

best antecedent for she?

• Assign each mention its highest scoring candidate antecedent

Positive examples: model has to assign a high

• Assign each mention its highest scoring candidate antecedent

best antecedent for she?

• Assign each mention its highest scoring candidate antecedent

only add highest scoring

• Pretty much the same as mention-pair model except each

• Pretty much the same as mention-pair model except each

A. Non-neural statistical classifier

B. Simple neural network

C. More advanced model using LSTMs, attention

Hidden Layer h2 ReLU(W3h2 + b3)

Hidden Layer h1 ReLU(W2h1 + b2)

Input Layer h0 ReLU(W1h0 + b1)

Candidate Candidate Mention Mention Additional

• Still need some other features:

Span representation (g)

Span head (x̂) + + + + +

Bidirectional LSTM (x∗ )

Word & character

Span representation (g)

Span head (x̂) + + + + +

Bidirectional LSTM (x∗ )

Word & character

Span representation (g)

Span head (x̂) + + + + +