Download as pdf or txt
Download as pdf or txt
You are on page 1of 92

Natural Language Processing

Natural Language Processing


with DeepLearning
with Deep Learning
CS224N/Ling284
CS224N/Ling284

Christopher Manning
Christopher Manning and Richard Socher
Lecture 16: Coreference Resolution
Lecture 2: Word Vectors

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Announcements
• We plan to get HW5 grades back tomorrow before the add/drop
deadline

• Final project milestone is due this coming Tuesday

1
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Lecture Plan:
Lecture 16: Coreference Resolution
1. What is Coreference Resolution? (15 mins)
2. Applications of coreference resolution (5 mins)
3. Mention Detection (5 mins)
4. Some Linguistics: Types of Reference (5 mins)
Four Kinds of Coreference Resolution Models
5. Rule-based (Hobbs Algorithm) (10 mins)
6. Mention-pair models (10 mins)
7. Mention ranking models (15 mins)
• Including the current state-of-the-art coreference system!
8. Mention clustering model (5 mins – only partial coverage)
9. Evaluation and current results (10 mins)
2
CuuDuongThanCong.com https://fb.com/tailieudientucntt
1. What is Coreference Resolution?
• Identify all mentions that refer to the same real world entity

Barack Obama nominated Hillary Rodham Clinton as his


secretary of state on Monday. He chose her because she
had foreign affairs experience as a former First Lady.

3
CuuDuongThanCong.com https://fb.com/tailieudientucntt
What is Coreference Resolution?
• Identify all mentions that refer to the same real world entity

Barack Obama nominated Hillary Rodham Clinton as his


secretary of state on Monday. He chose her because she
had foreign affairs experience as a former First Lady.

4
CuuDuongThanCong.com https://fb.com/tailieudientucntt
What is Coreference Resolution?
• Identify all mentions that refer to the same real world entity

Barack Obama nominated Hillary Rodham Clinton as his


secretary of state on Monday. He chose her because she
had foreign affairs experience as a former First Lady.

5
CuuDuongThanCong.com https://fb.com/tailieudientucntt
What is Coreference Resolution?
• Identify all mentions that refer to the same real world entity

Barack Obama nominated Hillary Rodham Clinton as his


secretary of state on Monday. He chose her because she
had foreign affairs experience as a former First Lady.

6
CuuDuongThanCong.com https://fb.com/tailieudientucntt
What is Coreference Resolution?
• Identify all mentions that refer to the same real world entity

Barack Obama nominated Hillary Rodham Clinton as his


secretary of state on Monday. He chose her because she
had foreign affairs experience as a former First Lady.

7
CuuDuongThanCong.com https://fb.com/tailieudientucntt
What is Coreference Resolution?
• Identify all mentions that refer to the same real world entity

Barack Obama nominated Hillary Rodham Clinton as his


secretary of state on Monday. He chose her because she
had foreign affairs experience as a former First Lady.

8
CuuDuongThanCong.com https://fb.com/tailieudientucntt
A couple of years later, Vanaja met Akhila at the local park.
Akhila’s son Prajwal was just two months younger than her son
Akash, and they went to the same school. For the pre-school
play, Prajwal was chosen for the lead role of the naughty child
Lord Krishna. Akash was to be a tree. She resigned herself to
make Akash the best tree that anybody had ever seen. She
bought him a brown T-shirt and brown trousers to represent the
tree trunk. Then she made a large cardboard cutout of a tree’s
foliage, with a circular opening in the middle for Akash’s face.
She attached red balls to it to represent fruits. It truly was the
nicest tree. From The Star by Shruthi Rao, with some shortening.
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Applications
• Full text understanding
• information extraction, question answering, summarization, …
• “He was born in 1961” (Who?)

10
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Applications
• Full text understanding
• Machine translation
• languages have different features for gender, number,
dropped pronouns, etc.

11
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Applications
• Full text understanding
• Machine translation
• languages have different features for gender, number,
dropped pronouns, etc.

12
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Applications
• Full text understanding
• Machine translation
• Dialogue Systems
“Book tickets to see James Bond”
“Spectre is playing near you at 2:00 and 3:00 today. How many
tickets would you like?”
“Two tickets for the showing at three”

13
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Resolution in Two Steps

1. Detect the mentions (easy)


“[I] voted for [Nader] because [he] was most aligned with
[[my] values],” [she] said
• mentions can be nested!

2. Cluster the mentions (hard)


“[I] voted for [Nader] because [he] was most aligned with
[[my] values],” [she] said

14
CuuDuongThanCong.com https://fb.com/tailieudientucntt
3. Mention Detection
• Mention: span of text referring to some entity
• Three kinds of mentions:

1. Pronouns
• I, your, it, she, him, etc.

2. Named entities
• People, places, etc.

3. Noun phrases
• “a dog,” “the big fluffy cat stuck in the tree”

15
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Detection
• Span of text referring to some entity
• For detection: use other NLP systems

1. Pronouns
• Use a part-of-speech tagger

2. Named entities
• Use a NER system (like hw3)

3. Noun phrases
• Use a parser (especially a constituency parser – next week!)

16
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Detection: Not so Simple
• Marking all pronouns, named entities, and NPs as mentions
over-generates mentions
• Are these mentions?
• It is sunny
• Every student
• No student
• The best donut in the world
• 100 miles

17
CuuDuongThanCong.com https://fb.com/tailieudientucntt
How to deal with these bad mentions?
• Could train a classifier to filter out spurious mentions

• Much more common: keep all mentions as “candidate


mentions”
• After your coreference system is done running discard all
singleton mentions (i.e., ones that have not been marked as
coreference with anything else)

18
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Can we avoid a pipelined system?
• We could instead train a classifier specifically for mention
detection instead of using a POS tagger, NER system, and parser.

• Or even jointly do mention-detection and coreference


resolution end-to-end instead of in two steps
• Will cover later in this lecture!

19
CuuDuongThanCong.com https://fb.com/tailieudientucntt
4. On to Coreference! First, some linguistics
• Coreference is when two mentions refer to the same entity in
the world
• Barack Obama traveled to … Obama

• A related linguistic concept is anaphora: when a term (anaphor)


refers to another term (antecedent)
• the interpretation of the anaphor is in some way determined
by the interpretation of the antecedent
• Barack Obama said he would sign the bill.
antecedent anaphor

20
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Anaphora vs Coreference
• Coreference with named entities
text Barack Obama Obama

world

• Anaphora
text Barack Obama he

world

21
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Not all anaphoric relations are coreferential
• Not all noun phrases have reference

• Every dancer twisted her knee.


• No dancer twisted her knee.

• There are three NPs in each of these sentences;


because the first one is non-referential, the other two
aren’t either.

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Anaphora vs. Coreference
• Not all anaphoric relations are coreferential

We went to see a concert last night. The tickets were really


expensive.

• This is referred to as bridging anaphora.


coreference anaphora

Barack Obama pronominal bridging


… Obama anaphora anaphora

23
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Anaphora vs. Cataphora

• Usually the antecedent comes before the anaphor (e.g., a


pronoun), but not always

24
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Cataphora

“From the corner of the divan of Persian saddle-


bags on which he was lying, smoking, as was his
custom, innumerable cigarettes, Lord Henry
Wotton could just catch the gleam of the honey-
sweet and honey-coloured blossoms of a
laburnum…”
(Oscar Wilde – The Picture of Dorian Gray)

25
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Four Kinds of Coreference Models

• Rule-based (pronominal anaphora resolution)


• Mention Pair
• Mention Ranking
• Clustering

26
CuuDuongThanCong.com https://fb.com/tailieudientucntt
5. Traditional pronominal anaphora resolution:
Hobbs’ naive algorithm
1. Begin at the NP immediately dominating the pronoun
2. Go up tree to first NP or S. Call this X, and the path p.
3. Traverse all branches below X to the left of p, left-to-right,
breadth-first. Propose as antecedent any NP that has a NP or S
between it and X
4. If X is the highest S in the sentence, traverse the parse trees of
the previous sentences in the order of recency. Traverse each
tree left-to-right, breadth first. When an NP is encountered,
propose as antecedent. If X not the highest node, go to step 5.

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Hobbs’ naive algorithm (1976)
5. From node X, go up the tree to the first NP or S. Call it X, and
the path p.
6. If X is an NP and the path p to X came from a non-head phrase
of X (a specifier or adjunct, such as a possessive, PP, apposition, or
relative clause), propose X as antecedent
(The original said “did not pass through the N’ that X immediately
dominates”, but the Penn Treebank grammar lacks N’ nodes….)
7. Traverse all branches below X to the left of the path, in a left-
to-right, breadth first manner. Propose any NP encountered as
the antecedent
8. If X is an S node, traverse all branches of X to the right of the
path but do not go below any NP or S encountered. Propose
any NP as the antecedent.
9. Go to step 4
Until deep learning still often used as a feature in ML systems!
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Hobbs Algorithm Example

CuuDuongThanCong.com https://fb.com/tailieudientucntt
Knowledge-based Pronominal Coreference
• She poured water from the pitcher into the cup until it was full
• She poured water from the pitcher into the cup until it was empty”

• The city council refused the women a permit because


they feared violence.
• The city council refused the women a permit because
they advocated violence.
• Winograd (1972)
• These are called Winograd Schema
• Recently proposed as an alternative to the Turing test
• See: Hector J. Levesque “On our best behaviour” IJCAI 2013
http://www.cs.toronto.edu/~hector/Papers/ijcai-13-paper.pdf
• http://commonsensereasoning.org/winograd.html
• If you’ve fully solved coreference, arguably you’ve solved AI
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Hobbs’ algorithm: commentary

“… the naïve approach is quite good. Computationally


speaking, it will be a long time before a semantically
based algorithm is sophisticated enough to perform as
well, and these results set a very high standard for any
other approach to aim for.

“Yet there is every reason to pursue a semantically


based approach. The naïve algorithm does not work.
Any one can think of examples where it fails. In these
cases it not only fails; it gives no indication that it has
failed and offers no help in finding the real antecedent.”
— Hobbs (1978), Lingua, p. 345

CuuDuongThanCong.com https://fb.com/tailieudientucntt
6. Coreference Models: Mention Pair

“I voted for Nader because he was most aligned with my values,” she said.

I Nader he my she

Coreference Cluster 1
Coreference Cluster 2

32
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Mention Pair

• Train a binary classifier that assigns every pair of mentions a


probability of being coreferent:
• e.g., for “she” look at all candidate antecedents (previously
occurring mentions) and decide which are coreferent with it

“I voted for Nader because he was most aligned with my values,” she said.

I Nader he my she

coreferent with she?

33
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Mention Pair

• Train a binary classifier that assigns every pair of mentions a


probability of being coreferent:
• e.g., for “she” look at all candidate antecedents (previously
occurring mentions) and decide which are coreferent with it

“I voted for Nader because he was most aligned with my values,” she said.

I Nader he my she

Positive examples: want to be near 1

34
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Mention Pair

• Train a binary classifier that assigns every pair of mentions a


probability of being coreferent:
• e.g., for “she” look at all candidate antecedents (previously
occurring mentions) and decide which are coreferent with it

“I voted for Nader because he was most aligned with my values,” she said.

I Nader he my she

Negative examples: want to be near 0

35
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Pair Training
• N mentions in a document
• yij = 1 if mentions mi and mj are coreferent, -1 if otherwise
• Just train with regular cross-entropy loss (looks a bit different
because it is binary classification)

Iterate through Iterate through candidate Coreferent mentions pairs


mentions antecedents (previously should get high probability,
occurring mentions) others should get low
probability
36
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Pair Test Time
• Coreference resolution is a clustering task, but we are only
scoring pairs of mentions… what to do?

I Nader he my she

37
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Pair Test Time
• Coreference resolution is a clustering task, but we are only
scoring pairs of mentions… what to do?
• Pick some threshold (e.g., 0.5) and add coreference links
between mention pairs where is above the threshold

“I voted for Nader because he was most aligned with my values,” she said.

I Nader he my she

38
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Pair Test Time
• Coreference resolution is a clustering task, but we are only
scoring pairs of mentions… what to do?
• Pick some threshold (e.g., 0.5) and add coreference links
between mention pairs where is above the threshold
• Take the transitive closure to get the clustering

“I voted for Nader because he was most aligned with my values,” she said.

I Nader he my she

Even though the model did not predict this coreference link,
I and my are coreferent due to transitivity
39
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Pair Test Time
• Coreference resolution is a clustering task, but we are only
scoring pairs of mentions… what to do?
• Pick some threshold (e.g., 0.5) and add coreference links
between mention pairs where is above the threshold
• Take the transitive closure to get the clustering

“I voted for Nader because he was most aligned with my values,” she said.

I Nader he my she

Adding this extra link would merge everything


into one big coreference cluster!
40
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Pair Models: Disadvantage
• Suppose we have a long document with the following mentions
• Ralph Nader … he … his … him … <several paragraphs>
… voted for Nader because he …
Relatively easy

Ralph
he his him Nader he
Nader

almost impossible

41
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Pair Models: Disadvantage
• Suppose we have a long document with the following mentions
• Ralph Nader … he … his … him … <several paragraphs>
… voted for Nader because he …
Relatively easy

Ralph
he his him Nader he
Nader

almost impossible

• Many mentions only have one clear antecedent


• But we are asking the model to predict all of them
• Solution: instead train the model to predict only one antecedent
for each mention
42 • More linguistically plausible
CuuDuongThanCong.com https://fb.com/tailieudientucntt
7. Coreference Models: Mention Ranking

• Assign each mention its highest scoring candidate antecedent


according to the model
• Dummy NA mention allows model to decline linking the current
mention to anything (“singleton” or “first” mention)

NA I Nader he my she

best antecedent for she?

43
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Mention Ranking

• Assign each mention its highest scoring candidate antecedent


according to the model
• Dummy NA mention allows model to decline linking the current
mention to anything (“singleton” or “first” mention)

NA I Nader he my she

Positive examples: model has to assign a high


probability to either one (but not necessarily both)

44
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Mention Ranking

• Assign each mention its highest scoring candidate antecedent


according to the model
• Dummy NA mention allows model to decline linking the current
mention to anything (“singleton” or “first” mention)

NA I Nader he my she

best antecedent for she?


p(NA, she) = 0.1
p(I, she) = 0.5 Apply a softmax over the scores for
p(Nader, she) = 0.1 candidate antecedents so
p(he, she) = 0.1 probabilities sum to 1
45
p(my, she) = 0.2
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Mention Ranking

• Assign each mention its highest scoring candidate antecedent


according to the model
• Dummy NA mention allows model to decline linking the current
mention to anything (“singleton” or “first” mention)

NA I Nader he my she

only add highest scoring


coreference link
p(NA, she) = 0.1
p(I, she) = 0.5 Apply a softmax over the scores for
p(Nader, she) = 0.1 candidate antecedents so
p(he, she) = 0.1 probabilities sum to 1
46
p(my, she) = 0.2
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Training
• We want the current mention mj to be linked to any one of the
candidate antecedents it’s coreferent with.
• Mathematically, we might want to maximize this probability:
i 1
X
(yij = 1)p(mj , mi )
j=1

Iterate through candidate For ones that …we want the model to
antecedents (previously
0 are coreferent assign a high probability
1
occurring mentions)
N YX i 1
to mj…
J= log @ (yij = 1)p(mj , mi )A
i=2 j=1

47
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Training
• We want the current mention mj to be linked to any one of the
candidate antecedents it’s coreferent with.
• Mathematically, we want to maximize this probability:
i 1
X
(yij = 1)p(mj , mi )
j=1

Iterate through candidate For ones that …we want the model to
antecedents (previously
0 are coreferent assign a high probability
1
occurring mentions)
N i 1
YX
to mj…
J= log @ (yij = 1)p(mj , mi )A
i=2 j=1
• The model could produce 0.9 probability for one of the correct
antecedents and low probability for everything else, and the
sum will still be large
48
CuuDuongThanCong.com https://fb.com/tailieudientucntt
i 1
X
Coreference Models:
(yij Training
= 1)p(mj , mi )
j=1
• We want the current mention mj to be linked to any one of the
candidate antecedents it’s coreferent with.
0 1
• Mathematically, weNwant
i 1to maximize this probability:
YX
J= i@
logX1 (yij = 1)p(mj , mi )A
(yj=1
i=2 ij = 1)p(mj , mi )
j=1
• Turning this into a loss
0function: 1
N
X 0 X i 1 1
J= N@X
logY i 1 (yij = 1)p(mj , mi )A
J = i=2log @ j=1 (yij = 1)p(mj , mi )A
i=2 j=1
Iterate over all the mentions
in the document Usual trick of taking negative
log to go from likelihood to loss
49
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Ranking Models: Test Time

• Pretty much the same as mention-pair model except each


mention is assigned only one antecedent

NA I Nader he my she

50
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mention Ranking Models: Test Time

• Pretty much the same as mention-pair model except each


mention is assigned only one antecedent

NA I Nader he my she

51
CuuDuongThanCong.com https://fb.com/tailieudientucntt
How do we compute the probabilities?

A. Non-neural statistical classifier

B. Simple neural network

C. More advanced model using LSTMs, attention

52
CuuDuongThanCong.com https://fb.com/tailieudientucntt
A. Non-Neural Coref Model: Features

• Person/Number/Gender agreement
• Jack gave Mary a gift. She was excited.
• Semantic compatibility
• … the mining conglomerate … the company …
• Certain syntactic constraints
• John bought him a new car. [him can not be John]
• More recently mentioned entities preferred for referenced
• John went to a movie. Jack went as well. He was not busy.
• Grammatical Role: Prefer entities in the subject position
• John went to a movie with Jack. He was not busy.
• Parallelism:
• John went with Jack to a movie. Joe went with him to a bar.
• …
53
CuuDuongThanCong.com https://fb.com/tailieudientucntt
B. Neural Coref Model
• Standard feed-forward neural network
• Input layer: word embeddings and a few categorical features

Score s
Hidden Layer h3 W4h3 + b4

Hidden Layer h2 ReLU(W3h2 + b3)

Hidden Layer h1 ReLU(W2h1 + b2)

Input Layer h0 ReLU(W1h0 + b1)

Candidate Candidate Mention Mention Additional


Antecedent Antecedent Embeddings Features Features
Embeddings Features

54
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Neural Coref Model: Inputs
• Embeddings
• Previous two words, first word, last word, head word, … of
each mention
• The head word is the “most important” word in the mention – you can
find it using a parser. e.g., The fluffy cat stuck in the tree

• Still need some other features:


• Distance
• Document genre
• Speaker information

55
CuuDuongThanCong.com https://fb.com/tailieudientucntt
C. End-to-end Model
• Current state-of-the-art model for coreference resolution
(Kenton Lee et al. from UW, EMNLP 2017)
• Mention ranking model
• Improvements over simple feed-forward NN
• Use an LSTM
• Use attention
• Do mention detection and coreference end-to-end
• No mention detection step!
• Instead consider every span of text (up to a certain length) as a
candidate mention
• a span is just a contiguous sequence of words

56
CuuDuongThanCong.com https://fb.com/tailieudientucntt
End-to-end Model
• First embed the words in the document using a word embedding
matrix and
Mention score (s )
m
a character-level CNN
General Electric Electric said the the Postal Service Service contacted the the company

Span representation (g)

Span head (x̂) + + + + +

Bidirectional LSTM (x∗ )

Word & character


embedding (x)
General Electric said the Postal Service contacted the company

e 1: First step of the end-to-end coreference resolution model, which computes embedding r
tions of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that o
geable number of spans is considered for coreference decisions. In general, the model conside
ble spans up to a maximum width, but we depict here only a small subset.
57
CuuDuongThanCong.com https://fb.com/tailieudientucntt
End-to-end Model
• Then run a bidirectional LSTM over the document
General Electric Electric said the the Postal Service Service contacted the the company
Mention score (sm )

Span representation (g)

Span head (x̂) + + + + +

Bidirectional LSTM (x∗ )

Word & character


embedding (x)
General Electric said the Postal Service contacted the company

e 1: First step of the end-to-end coreference resolution model, which computes embedding r
tions of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that o
geable number of spans is considered for coreference decisions. In general, the model conside
ble spans up to a maximum width, but we depict here only a small subset.
58
CuuDuongThanCong.com https://fb.com/tailieudientucntt
End-to-end Model
• Next, represent each span of text i going from START(i) to END(i) as a
vector
Mention score (s )
m
General Electric Electric said the the Postal Service Service contacted the the company

Span representation (g)

Span head (x̂) + + + + +

Bidirectional LSTM (x∗ )

Word & character


embedding (x)
General Electric said the Postal Service contacted the company

e 1: First step of the end-to-end coreference resolution model, which computes embedding r
tions of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that o
geable number of spans is considered for coreference decisions. In general, the model conside
ble spans up to a maximum width, but we depict here only a small subset.
59
CuuDuongThanCong.com https://fb.com/tailieudientucntt
End-to-end Model
• Next, represent each span of text i going from START(i) to END(i) as a
vector
Mention score (s )
m
General Electric Electric said the the Postal Service Service contacted the the company

Span representation (g)

Span head (x̂) + + + + +

Bidirectional LSTM (x∗ )

Word & character


embedding (x)
General Electric said the Postal Service contacted the company

• General,
e 1: First General
step of the Electric,
end-to-end General
coreference Electricmodel,
resolution said, …which
Electric, Electric
computes embedding r
tions of spans
said, for scoring
… will potential
all get its ownentity mentions.
vector Low-scoring spans are pruned, so that o
representation
geable number of spans is considered for coreference decisions. In general, the model conside
ble spans up to a maximum width, but we depict here only a small subset.
60
CuuDuongThanCong.com https://fb.com/tailieudientucntt
END(i)
" In the traini
exp(αk )
End-to-end Model is observed.
k= START(i)
optimize the
END (i)
• Next, represent each span of " text i going from START(i) to END(i) as a
antecedents
vector.
Mention score (s )

General Electric
i =
Electric said the a · x
the Postal
i,t
Service
t
Service contacted the the company
m
t= START(i)
Span representation (g)

Span head (x̂) + + + + +


l
where x̂i is a weighted sum of word vectors in
span i. The weights ai,t are automatically learned

Bidirectional LSTM (x )
and correlate strongly with traditional definitions where GOLD
of head words as we will see in Section 9.2. ter containin
The above span information is concatenated to to a gold clu
Word & character
embedding (x) produce the final representation g of span i:
General Electric said the i
Postal Service contacted pruned,
the GOL
company

∗ ∗
By optim
e 1:Span
Firstrepresentation: gi = [xcoreference
step of the end-to-end START(i)
, xresolution
END(i)
, x̂imodel,
, φ(i)]which computesrallyembedding
learns r
tions of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that o
initial pruni
geable numberThisof spansgeneralizes thecoreference
is considered for recurrent span In repre-
decisions. general, the model conside
ble spans up tosentations
a maximum width, but we depict here only mentions rec
recently proposed fora small subset.
question-
quickly leve
answering (Lee et al., 2016), which only include
CuuDuongThanCong.com
ate credit as
https://fb.com/tailieudientucntt
END(i)
" In the traini
exp(αk )
End-to-end Model is observed.
k= START(i)
optimize the
END (i)
• Next, represent each span of " text i going from START(i) to END(i) as a
antecedents
vector. For
Mention score (s )
example, x̂ for
General Electric
i = “the postal
Electric said the a · x
service”
the Postal
i,t
Service
t
Service contacted the the company
m
t= START(i)
Span representation (g)

Span head (x̂) + + + + +


l
where x̂i is a weighted sum of word vectors in
span i. The weights ai,t are automatically learned

Bidirectional LSTM (x )
and correlate strongly with traditional definitions where GOLD
of head words as we will see in Section 9.2. ter containin
The above span information is concatenated to to a gold clu
Word & character
embedding (x) produce the final representation g of span i:
General Electric said the i
Postal Service contacted pruned,
the GOL
company

∗ ∗
By optim
e 1:Span
Firstrepresentation: gi = [xcoreference
step of the end-to-end START(i)
, xresolution
END(i)
, x̂imodel,
, φ(i)]which computesrallyembedding
learns r
tions of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that o
initial pruni
geable numberThisof spansgeneralizes thecoreference
is considered for recurrent span In repre-
decisions. general, the model conside
ble spans up tosentations
a maximum width, but we depict here only mentions rec
recently proposed fora small subset.
question-
quickly leve
answering (Lee et al., 2016), which only include
CuuDuongThanCong.com
ate credit as
https://fb.com/tailieudientucntt
END(i)
" In the traini
exp(αk )
End-to-end Model is observed.
k= START(i)
optimize the
END (i)
• Next, represent each span of " text i going from START(i) to END(i) as a
antecedents
vector. For
Mention score (s )
example, x̂ for
General Electric
i = “the postal
Electric said the a · x
service”
the Postal
i,t
Service
t
Service contacted the the company
m
t= START(i)
Span representation (g)

Span head (x̂) + + + + +


l
where x̂i is a weighted sum of word vectors in
span i. The weights ai,t are automatically learned

Bidirectional LSTM (x )
and correlate strongly with traditional definitions where GOLD
of head words as we will see in Section 9.2. ter containin
The above span information is concatenated to to a gold clu
Word & character
embedding (x) produce the final representation g of span i:
General Electric said the i
Postal Service contacted pruned,
the GOL
company

∗ ∗
By optim
e 1:Span
Firstrepresentation: gi = [xcoreference
step of the end-to-end START(i)
, xresolution
END(i)
, x̂imodel,
, φ(i)]which computesrally
embedding
learns r
tions of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that o
initial pruni
geable numberThisof spansgeneralizes thecoreference
is considered for recurrent span In repre-
decisions. general, the model conside
ble spans upBILSTM hidden states
tosentations
a maximum for but we depict here only a small subset.
width, mentions rec
span’s start and end recently proposed for question-
quickly leve
answering (Lee et al., 2016), which only include
CuuDuongThanCong.com
ate credit as
https://fb.com/tailieudientucntt
END(i)
" In the traini
exp(αk )
End-to-end Model is observed.
k= START(i)
optimize the
END (i)
• Next, represent each span of " text i going from START(i) to END(i) as a
antecedents
vector. For
Mention score (s )
example, x̂ for
General Electric
i = “the postal
Electric said the a · x
service”
the Postal
i,t
Service
t
Service contacted the the company
m
t= START(i)
Span representation (g)

Span head (x̂) + + + + +


l
where x̂i is a weighted sum of word vectors in
span i. The weights ai,t are automatically learned

Bidirectional LSTM (x )
and correlate strongly with traditional definitions where GOLD
of head words as we will see in Section 9.2. ter containin
The above span information is concatenated to to a gold clu
Word & character
embedding (x) produce the final representation g of span i:
General Electric said the i
Postal Service contacted pruned,
the GOL
company

∗ ∗
By optim
e 1:Span
Firstrepresentation: gi = [xcoreference
step of the end-to-end START(i)
, xresolution
END(i)
, x̂imodel,
, φ(i)]which computesrallyembedding
learns r
tions of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that o
initial pruni
geable numberThisof spansgeneralizes thecoreference
is considered for recurrent span In repre-
decisions. general, the model conside
ble spans upBILSTM hidden states
tosentations
a maximum width, Attention-based
for but we depict representation
here only a small subset. mentions rec
span’s start and end recently proposed for question-
(details next slide) of the words quickly leve
answering (Lee et al.,in 2016),
the span which only include
CuuDuongThanCong.com
ate credit as
https://fb.com/tailieudientucntt
END(i)
" In the traini
exp(αk )
End-to-end Model is observed.
k= START(i)
optimize the
END (i)
• Next, represent each span of " text i going from START(i) to END(i) as a
antecedents
vector. For
Mention score (s )
example, x̂ for
General Electric
i = “the postal
Electric said the a · x
service”
the Postal
i,t
Service
t
Service contacted the the company
m
t= START(i)
Span representation (g)

Span head (x̂) + + + + +


l
where x̂i is a weighted sum of word vectors in
span i. The weights ai,t are automatically learned

Bidirectional LSTM (x )
and correlate strongly with traditional definitions where GOLD
of head words as we will see in Section 9.2. ter containin
The above span information is concatenated to to a gold clu
Word & character
embedding (x) produce the final representation g of span i:
General Electric said the i
Postal Service contacted pruned,
the GOL
company

∗ ∗
By optim
e 1:Span
Firstrepresentation: gi = [xcoreference
step of the end-to-end START(i)
, xresolution
END(i)
, x̂imodel,
, φ(i)]which computesrallyembedding
learns r
tions of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that o
initial pruni
geable numberThisof spansgeneralizes thecoreference
is considered for recurrent span In repre-
decisions. general, the model conside
ble spans upBILSTM hidden states
tosentations
a maximum width, Attention-based
for but we depict representation
here only a small mentions
Additional
subset. features rec
span’s start and end recently proposed for question-
(details next slide) of the words quickly leve
answering (Lee et al.,in 2016),
the span which only include
CuuDuongThanCong.com
ate credit as
https://fb.com/tailieudientucntt
with the highest mention scores and co
,δ = ot,δ
mation is ◦concatenated
tanh(ct,δ ) to to a gold cluster up or to all
K gold antecedents
antecedents for each. have Webeenal
∗ End-to-end Model
t = [hg
xtation t,1i ,of span
ht,−1 ] i: pruned, GOLDnon-crossing (i) = {ϵ}. bracketing structures wit
suppression scheme.the 2 We accept sp

By optimizing this objective, model natu-
x∈ {−1,
END•(i)
, x̂1} i ,isindicates
φ(i)] the directionality
an attention-weighted rally learns
of
average to of the word
creasing
prune orderembeddings
spans of the mention
accurately. inWhile
the
scores,theun

STM, and span xt is the General
Mention score (s ) m
concatenated
Electric Electricoutput
initial
said the
pruning
the Postal Service
considering
is completely
Service contacted the the company
i, there exists
span random, only agold
prev
idirectional
recurrent LSTM.repre-
span We use independent cepted span j such that START (i) < S
Span representation (g)
for every for sentence, mentions
since cross-sentence receive positive updates. The model can
oposed question- END (i) < END (j) ∨ START (j) < ST
was
Span head (x̂)
not helpful in our +
experiments. quickly + leverage this + learning signal + for appropri-
+
16), which only include END (j) < END (i).
ctic heads are ∗ typically included ate credit
as fea-assignment Despite to these
the different
aggressive factors,
pruning such
str
ntations
BidirectionalxLSTM ∗
(x )
(i) and
n previous systems (Durrettas
START theKlein,
and mentionmaintain scores samhigh usedrecall
for pruning.
of gold mentions
e soft head word vector
lark and Manning, 2016b,a). Instead of re-scoreperiments
Fixing of the dummy (over 92% antecedent
when λ =to0.4). zero
(i) encoding
syntactic parses, theoursize of learns
model removesa task- a spurious For degree of freedom
the remaining in thethe
mentions, over-
joi
notion
Word & ofcharacter
embedding (x)
headedness using an allattention
model withtion respect to mention
of antecedents for each detection.
documentIti
sm (Bahdanau et al., 2014) over General Electric words
said in the
in a
Postal
forward
Service
also prevents the span pruning from introducing pass
contacted
over a
the
single
company
computa
n: Attention scores The final prediction is the clustering p
e 1: First step of the end-to-end coreference 2 resolution
The official CoNLL-2012 model, evaluation
which computes embedding
only considers pre-r
odel
tions αdescribed
of spans for scoringabove potential
is∗ entity without
mentions. the most
Low-scoring likely configuration.
t = wα · FFNN α (xt ) dictions crossing mentionsspansto be are pruned,
valid. so that
Enforcing thiso
geable
lengthnumberT . To of spans
maintainis considered for coreference
consistency decisions.
is not inherently In general,
necessary in ourthemodel.
model conside
exp(α ) 6 only Learning
adot
ble spans
i,t
product
= up to a of weightt width, but we depict here
maximum a small subset.
vector andEND "transformed
(i)
hidden state In the training data, only clustering i
exp(αk )
CuuDuongThanCong.com https://fb.com/tailieudientucntt
ot,δ = σ(Wo [xt , ht+δ,δ with ] +thebo ) highest mention compute their and
scores unaryco m
,δ = ot,δ
mation is ◦concatenated
tanh(ct,δ ) to c!to t,δ =a tanh(W
gold cluster c [xt ,up or
ht+δ,δto]all
+Kbcgold
) antecedents
antecedents fined in Section
for have We
each. 4). To
been al

xtation
t = End-to-end Model
[h g
t,1i ,of
h span
t,−1 ] i: c t,δ = f t,δ ◦ !
c
pruned, GOLDnon-crossing
t,δ + (1 − f t,δ ) ◦ c
(i) = {ϵ}. bracketing
t+δ,δ
of spans
with the
to consider,
structures wit
highest mentio
w

ht,δ = ot,δ ◦ tanh(ct,δ )


suppression scheme. 2up toWe accept
K antecedents spf

By optimizing this objective, the model natu-
x∗∈ {−1,
END•(i)
, x̂1}i ,isindicates
an attention-weighted
φ(i)] xt = [ht,1 , h
the directionality
rally learns
oft,−1 ] creasing
average to of the
prune word
order
spans embeddings
of the mention
accurately. in
non-crossing the
While
bracketin
scores, theun
2
∗ suppression scheme.
STM, and span t x
Mention score (sm )
is the concatenated
General Electric Electricoutput
said the
∈ {−1, pruning
where δinitial 1} indicates
the Postal Service
considering
thecompletely
is
Service contacted the the company
i, there
spanofrandom,
directionality creasingexists
order ofagold
only prev
the m
idirectional
recurrent LSTM.
span repre- We use
each independent
LSTM, and x ∗ is the concatenated output
t cepted span j such that START (i)
considering span< S
i, th
Span representation (g)
for every for sentence, since of the mentions receive
bidirectional LSTM. We use independent
cross-sentence positive updates. The model j such
can
oposed question- END (i) < END (j) ∨cepted START span(j) < tha
ST
was
Span head (x̂)
not helpful in our LSTMs
+
experiments. quickly leverage this learning signal
for every
+ sentence, since
+ cross-sentence + for<appropri-
END (i) + END (j) ∨
16), which only include END
context was not helpful in our experiments. (j) < END (i). END (j) < END (i).
ctic heads are ∗ typically included ate
Syntactic credit
as
heads fea-areassignment
typicallyDespite to
included the
these
as different
aggressive
fea- factors,
pruning such str
ntations
BidirectionalxLSTM ∗
(x )
(i) and Despite these aggres
START
n previous systems (Durrett tures in asand theKlein,
previous mentionsystemsscores (Durrettsm andused
Klein, formaintain
pruning.
e soft head word vector maintain a high recall of golda high recall
mentions
lark and Manning, 2016b,a). Instead 2013; Clark and Manning, 2016b,a).
of re-scoreperiments Instead of re- periments (over 92%
lying on
Fixing
syntactic parses, our
of
model
thelearnsdummy
(over
a 92%
task-
antecedent
when λ =to0.4). zerow
(i) encoding
syntactic parses, theour size
modelof learns removes a task- a spurious degree of freedom
For the remaining m
specific notion of headednessFor using thean remaining
attention tion of in
mentions, thethe
antecedentsover-
joi
for
notion
Word & ofcharacter
embedding (x)
headedness using mechanisman
allattention
(Bahdanau
model with et al.,
tion 2014)
respect over towords
of antecedents in forineach
mention a detection.
forwarddocument
pass overIti
sm (Bahdanau et al., 2014) over General each span:
Electric words
said in the Postal Service contactedThe the company
final prediction is
also prevents in
thea forward
span pruning pass over from a single computa
introducing
n: ∗ the most likely configu
Attention scores Attention
α t = w α distribution
· The
FFNN α final
(x t ) prediction is the clustering p
e 1: First step of the end-to-end coreference 2 resolution
The officialexp(α CoNLL-2012 model, which computes
evaluation only embedding
considers pre-r
the t)
most likely 6
configuration. Learning
odel
tions αdescribed
of spans for scoringabove potential
t = wα · FFNN α (xt )
is∗ ai,t =without
entity
dictions mentions. Low-scoring spans are pruned, so that o
END(i)crossing mentions to be valid. Enforcing this
"
geable
lengthnumber
T . To of spans is considered
maintain for coreference
consistency decisions.
is not inherently
exp(α In general,
) necessary inInourthemodel.
the trainingconside
model data, o
dot
ble spans product
up to a
exp(α
of weight
maximum t ) width, but we depict here 6 Learning k
is observed. Since the
ai,t = k= START(i) only a small subset.
vector andEND "transformed
(i) just a softmax END (i)over attention
optimize the marginal
" In the training data, antecedents only clustering implied byi
hidden state exp(αk ) scores
CuuDuongThanCong.com x̂i =for the span ai,t · xt https://fb.com/tailieudientucntt
ot,δ = σ(Wo [xt , ht+δ,δ with ] +the bo ) highest mention compute their and
scores unaryco m
,δ = ot,δ
mation is ◦concatenated
tanh(ct,δ ) to c!to t,δ =a tanh(W
gold cluster c [xt ,up
where
or
ht+δ,δto]all+ ) ∈ {−1,
Kbδcgold
antecedents 1} indicates
antecedents fined in Section
for have
each. 4). To
the direction
been
We al

xtation
t = [hEnd-to-end Model
g
t,1i ,of
h span
t,−1 ] i: c = f ◦ !
c
pruned, GOLDnon-crossing
t,δ t,δ t,δ + (1 − each
f
(i) of=thet,δ ) ◦LSTM,
{ϵ}.c t+δ,δ bracketing
bidirectional LSTM.
∗ of spans
and xt is the concatenated
with the
to consider,
structures
highest
We
wit
use mentio
indep
w

ht,δ = ot,δ ◦ tanh(ct,δ )


suppression scheme. 2up toWe Ksinceaccept
antecedents spf

By optimizing this
LSTMs objective,
for every the
sentence, model natu-
cross-se
x∗∈ {−1,
END•(i)
, x̂1}i ,isindicates
φ(i)] xt = [ht,1 , h
the directionality
an attention-weighted rally learns
oft,−1 ] creasing
average to of the
prunecontext word
order
spanswas embeddings
notof the non-crossing
mention
helpful
accurately. in in
our the bracketin
scores,
experiments
While theun
2
∗ suppression scheme.
STM, and span t x
Mention score (sm )
is the concatenated
General Electric Electricoutput
said the
∈ {−1, pruning
where δinitial 1} indicates
the Postal Service
considering Syntactic
thecompletely
is directionality
Service contacted the the company
heads
span i,are
there
ofrandom, typically included
creasingexistsorder ofagold
only prev
the m
idirectional
recurrent LSTM.
span repre- We use
each independent
LSTM, and x t
tures
∗ is the concatenated in
cepted span j such that previous
output systems START (i)
considering
(Durrett
span<
andS
i, th
Span representation (g)
for every for sentence, since of the mentions receive
bidirectional LSTM. We use independent
cross-sentence positive
2013; Clark updates.
and Manning, The model
2016b,a). can
Instea
j such
oposed question- END (i) < END (j) ∨cepted START span(j) < tha
ST
Span head (x̂) LSTMs
+ quickly leverage this learning signal
for every
+ sentence, lying
since
+ on syntactic
cross-sentence +parses,
END (i) our model
for<appropri-
+ learns
END (j) ∨
was not helpful
16), which only include in our experiments. END
context was not helpful in our experiments. (j) < END (i).
specific notion of headedness END (j) < using END (i). an at
ctic heads are typically included ate credit
as fea- assignment to the different factors, such
ntations
Bidirectionalx∗LSTM (x∗ )
(i) and Syntactic heads are typically Despite
includedthese
mechanism aggressive
as(Bahdanau
fea- et al., pruning
Despite 2014)
theseover strw
aggres
START
n previous systems (Durrett tures in asand theKlein,
previous mentionsystemsscores (Durrett
each span:
maintain sam
andused
highKlein, formaintain
recall pruning.
of golda high recall
mentions
e soft head word vector
lark and Manning, 2016b,a). Instead 2013; Clark and Manning, 2016b,a).
of re-scoreperiments Instead of re- periments (over 92%
lying on
Fixing
syntactic parses, our
of
model
thelearnsdummy
(over a 92%
task-
antecedent
when λ = to zerow
∗ 0.4).
(i) encoding
syntactic parses, theour sizemodelof learns removes a task- a spurious degree
αt = wα · For
of freedom
FFNN theα (x
in the t)
remaining
over-
m
specific notion of headednessFor using thean remaining
attention mentions,
tion of antecedents the for
joi
exp(α t )
notion
Word & ofcharacter
embedding (x)
headedness using mechanisman
allattention
(Bahdanau
model with et al.,
tion 2014)
respect over to
of antecedents words i,t in
= forineach
amention a detection.
forwarddocument
pass over Iti
END(i)
sm (Bahdanau et al., 2014) over General each span:
Electric words
said in the Postal Service contacted "
The the company
final prediction is
also prevents in
thea forward
span pruning pass over from a single computa
introducing
exp(α )
n: the most likely kconfigu
Attention scores Attention
αt = wα ·distribution
FFNNThe ∗
α (xfinal
t) prediction Finalk=representation
isSTART
the(i)clustering p
e 1: First step of the end-to-end coreference 2 resolution
The officialexp(α CoNLL-2012 model, which
evaluation computes
only embedding
considers pre- r
the t)
most likely 6
configuration.
END (i)Learning
"pruned, so that o
odel
tions αdescribed
of spans
= w for
· above
scoring (x is

potential
) dictionsai,t =without
entity mentions. Low-scoring
crossing mentions spans
to be are
valid. Enforcing
t α FFNN α t END(i)
" x̂i = ai,t · xt this
geable
lengthnumber
T . To of spans
maintainis considered for coreference
consistency decisions.
is not inherently In general, Inour
themodel.
the trainingconside
model data, o
exp(α ) necessary inSTART
dot
ble spans product
up to a
exp(α
of weight
maximum t ) width, but we depict here 6 Learning k t= (i)
is observed. Since the
ai,t = k= START(i) only a small subset.
vector andEND transformed
" (i) just a softmax END (i)over where attention x̂ i is a weighted optimize the marginal
Attention-weighted
sum of word sumvec
" In the training data, antecedents only clustering byi
hidden state exp(αk ) scores
CuuDuongThanCong.com x̂i =for the span · xt i. The weights
ai,tspan of word
ai,t embeddings
implied
are automatically
https://fb.com/tailieudientucntt
and correlate strongly with traditional definitions where GOLD (i) is
End-to-end
of head wordsModelas we will see in Section 9.2. ter containing spa
The above span information is concatenated to to a gold cluster o
• Why include
produce the all these
final different terms
representation gi ofinspan
the span?
i: pruned, GOLD(i)
∗ ∗
By optimizing
gi = [xSTART(i) , xEND(i) , x̂i , φ(i)] rally learns to pru
initial pruning is
This generalizes the recurrent span repre-
hidden states for mentions
Attention-based representation Additional features receive
sentations
span’s start andrecently
end proposed for question-
quickly leverage t
answering (Lee et al., 2016), which only include
Represents the Represents the span∗ itself ate credit
Represents other
assignm
the boundary representations xSTART(i) and
context
∗ to the left information
as thenotmention
in sco
xand (i) . ofWe
ENDright the introduce the soft head word vector the text Fixing score of
x̂span
i and a feature vector φ(i) encoding the size of removes a spuriou
span i. all model with re
5 Inference also prevents the
2
The official CoN
The size of the full model described above is dictions without cross
O(T 4 ) in the document length T . To maintain
CuuDuongThanCong.com https://fb.com/tailieudientucntt
consistency is not inh
nd or (4) forcontext
5), Klein, this pairwise coreference
is unambiguous. score:
There (1) factors
are three whether
span ithis
is apairwise
mention,coreference
(2) whether span(1)j is a men-
End-to-end Model
Manning,
tt and Klein, for score: whether
is Manning,
nd most tion,
spanandi is(3)
a mention, j iswhether
whether (2) span j isofa i:men-
an antecedent
ach
son • is
overmost score
Lastly,
tion, and (3) #whether j is an antecedent of i:
every pair of spans to decide if they are coreferent
reason
ions and over
mentions
#0 j=ϵ
s(i, j) = 0 j=ϵ
mentions and s(i, j) = sm (i) + sm (j) + sa (i, j) j ̸ =ϵ
sm (i) + sm (j) + sa (i, j) j̸

Are spans
Here
i and j
sm (i)Is is
i a
amention?
unary score
Is j a
for span iDo
mention?
being
they
a men-
look
ules to re- Here s m (i) is a unary score for span i being a men-
coreferent
use rules
alized
to re- tion,
reg-reg- and sa (i, j) is pairwise score forcoreferent?
mentions? span j being
tion, and sa (i, j) is pairwise score for span j being
lexicalized
an an
antecedent span i.i.
antecedentofofspan

CuuDuongThanCong.com https://fb.com/tailieudientucntt
nd or (4) forcontext
5), Klein, this pairwise coreference
is unambiguous. score:
There (1) factors
are three whether
aid the Postal Service contacted the company
span i is a mention, (2) whether span(1)j is a men-
End-to-end Model
Manning,
tt and Klein, for this pairwise coreference score: whether
is Manning,
nd most tion,span andi is(3) whether (2)
a mention, j iswhether span j isofa i:men-
an antecedent
rence
ach resolution model,
tion, and (3) which
#whether computes embedding
j is an antecedent repre-
of i:
son • is
over most score
Lastly, every pair of spans to decide if they are coreferent
ity
ions over Low-scoring#
mentions.
reason mentions
and 0spans are pruned, so that only j = ϵa
s(i, j) = 0 j=ϵ
mentions and
or coreference decisions. s(i, j) =Insm general,
(i) + sm the(j)model considers
+ sa (i, j) j ̸ =allϵ
we depict here only a small subset. sm (i) + sm (j) + sa (i, j) j ̸ =ϵ
Are spans
Here
i and j
sm (i)Is is
i a
amention?
unary scoreIs j a
for span iDo
mention?
being
they
a men-
look
ules to re- Here s m (i) is a unary score for span i being a men-
use rules
alized coreferent
reg- mentions?
to re- tion,
tion,and
and sas(i,(i,j)j)isispairwise
pairwise scorefor
score spanj jbeing
forcoreferent?
span being
lexicalized reg- a
above are computed via standard feed-forward
an an antecedent
antecedentofofspan span i.i.
neural networks:
,
ervice)• Scoring functions take the span representations as input
sm (i) = wm · FFNN m (gi )
sa (i, j) = wa · FFNN a ([gi , gj , gi ◦ gj , φ(i, j)])

where · denotes the dot product, ◦ denotes


ny element-wise multiplication, and FFNN denotes a
CuuDuongThanCong.com https://fb.com/tailieudientucntt
nd or (4) forcontext
5), Klein, this pairwise coreference
is unambiguous. score:
There (1) factors
are three whether
aid the Postal Service contacted the company
span i is a mention, (2) whether span(1)j is a men-
End-to-end Model
Manning,
tt and Klein, for this pairwise coreference score: whether
is Manning,
nd most tion,span andi is(3) whether (2)
a mention, j iswhether span j isofa i:men-
an antecedent
rence
ach resolution model,
tion, and (3) which
#whether computes embedding
j is an antecedent repre-
of i:
son • is
over most score
Lastly, every pair of spans to decide if they are coreferent
ity
ions over Low-scoring#
mentions.
reason mentions
and 0spans are pruned, so that only j = ϵa
s(i, j) = 0 j=ϵ
mentions and
or coreference decisions. s(i, j) =Insm general,
(i) + sm the(j)model considers
+ sa (i, j) j ̸ =allϵ
we depict here only a small subset. sm (i) + sm (j) + sa (i, j) j ̸ =ϵ
Are spans
Here
i and j
sm (i)Is is
i a
amention?
unary scoreIs j a
for span iDo
mention?
being
they
a men-
look
ules to re- Here s m (i) is a unary score for span i being a men-
use rules
alized coreferent
reg- mentions?
to re- tion,
tion,and
and sas(i,(i,j)j)isispairwise
pairwise scorefor
score spanj jbeing
forcoreferent?
span being
lexicalized reg- a
above are computed via standard feed-forward
an an antecedent
antecedentofofspan span i.i.
neural networks:
,
ervice)• Scoring functions take the span representations as input
sm (i) = wm · FFNN m (gi )
sa (i, j) = wa · FFNN a ([gi , gj , gi ◦ gj , φ(i, j)])

include multiplicative again, we have some


where · denotes the interactions
dot product,between◦ extra features
denotes
the representations
ny element-wise multiplication, and
CuuDuongThanCong.com
FFNN denotes a
https://fb.com/tailieudientucntt
End-to-end Model
• Intractable to score every pair of spans
• O(T^2) spans of text in a document (T is the number of words)
• O(T^4) runtime!
• So have to do lots of pruning to make work (only consider a few of
the spans that are likely to be mentions)

• Attention learns which words are important in a mention (a bit like


head words)
(A fire in a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most
of the deceased were killed in the crush as workers tried to flee (the blaze) in the four-story building.
1
A fire in (a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most
of the deceased were killed in the crush as workers tried to flee the blaze in (the four-story building).
We are looking for (a region of central Italy bordering the Adriatic Sea). (The area) is mostly
mountainous and includes Mt. Corno, the highest peak of the Apennines. (It) also includes a lot of
2
sheep, good clean-living, healthy sheep, and an Italian entrepreneur has an idea about how to make a
little money of them.
(The flight attendants) have until 6:00 today to ratify labor concessions. (The pilots’) union and ground
3 CuuDuongThanCong.com https://fb.com/tailieudientucntt
8. Last Coreference Approach: Clustering-Based

• Coreference is a clustering task, let’s use a clustering algorithm!


• In particular we will use agglomerative clustering

• Start with each mention in it’s own singleton cluster

• Merge a pair of clusters at each step


• Use a model to score which cluster merges are good

74
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Clustering-Based

Coreference with Agglomera3ve Clustering


Google recently … the company announced Google Plus ... the product features ...

Cluster 1 Cluster 2 Cluster 3 Cluster 4

Google the company Google Plus the product

75
4 CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Models: Clustering-Based

Coreference
Google recently with Agglomera3ve
Coreference with Agglomera3ve Clustering
… the company announced Clustering
Google Plus ... the product features ...

Cluster11
Cluster Cluster22
Cluster Cluster33
Cluster Cluster44
Cluster

Google
Google thecompany
the company GooglePlus
Google Plus theproduct
the product

s(c1, c2) = 5 ✔ merge

Cluster 1 Cluster 2 Cluster 3

Google the company Google Plus the product

s(c2, c3) = 4 ✔ merge

Cluster 1 Cluster 2

Google the company Google Plus the product

766 s(c1, c2) = -3 ✖ do not merge


4 CuuDuongThanCong.com https://fb.com/tailieudientucntt
The Advantage of Clustering: Exploi3ng
Coreference Models:
En3ty-level Clustering-Based
Informa3on
Mention-pair decision is difficult

Google Google Plus

? coreferent

Cluster-pair decision is easier


Cluster 1 Cluster 2

Google the company Google Plus the product

? coreferent
8

77
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Clustering Model Architecture
From Clark & Manning, 2016
Neural Network Architecture
Merge clusters c1 = {Google, the company} and
c2 = {Google Plus, the product} ?

Men3on-Pair Cluster-Pair
Men3on Pairs Score
Representa3ons Representa3on

(Google, Google Plus)

(Google, the product)


(the company, s(MERGE[c1,c2])
Google Plus)
(the company, the
product)

78
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Clustering
ProducingModel Architecture
Cluster-Pair Representa3ons
First•produces representaHons
First produce for relevant
a vector for each menHon pairs
pair of mentions
using the menHon-pair-encoder.
• e.g., the output of the hidden layer in the feedforward neural
network model
c1

!!!

!!!

Mention-Pair Mention-Pair
Representations Encoder c2

!!!

!!!

79
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Clustering Model Architecture
Producing Cluster-Pair Representa3ons
• Then
• Then apply
applies a pooling
a pooling operationover
operaHon overthe
the matrix
matrix of
of mention-pair
menHon-
representations to get a cluster-pair representation
pair representaHons (we use max and average pooling)
max avg

c1
Cluster-Pair
Representation
rc(c1, c2) !!!
Pooling

!!!

Mention-Pair Mention-Pair
Representations Encoder c2
Rm(c1, c2)

!!!

!!!
8
80
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Clustering Model Architecture
Producing Cluster-Pair Representa3ons
• Score the candidate cluster merge by taking the dot product of
• Then applies a pooling operaHon over the matrix of menHon-
the representation with a weight vector
pair representaHons (we use max and average pooling)
max avg

c1
Cluster-Pair
Representation
rc(c1, c2) !!!
Pooling

!!!

Mention-Pair Mention-Pair
Representations Encoder c2
Rm(c1, c2)

!!!

!!!
18

81
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Clustering Model: Training

• Current candidate cluster merges depend on previous ones it


already made
• So can’t use regular supervised learning
• Instead use something like Reinforcement Learning to train
the model
• Reward for each merge: the change in a coreference evaluation metric

82
CuuDuongThanCong.com https://fb.com/tailieudientucntt
9. Coreference Evaluation
• Many different metrics: MUC, CEAF, LEA, B-CUBED, BLANC
• Often report the average over a few different metrics

Gold Cluster 1

Gold Cluster 2

System Cluster 1 System Cluster 2


83
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Evaluation
• An example: B-cubed
• For each mention, compute a precision and a recall

P = 4/5
R= 4/6

Gold Cluster 1

Gold Cluster 2

System Cluster 1 System Cluster 2


84
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Evaluation
• An example: B-cubed
• For each mention, compute a precision and a recall

P = 4/5
R= 4/6

Gold Cluster 1

Gold Cluster 2
P = 1/5
R= 1/3

System Cluster 1 System Cluster 2


85
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Evaluation
• An example: B-cubed
• For each mention, compute a precision and a recall
• Then average the individual Ps and Rs
P = [4(4/5) + 1(1/5) + 2(2/4) + 2(2/4)] / 9 = 0.6
P = 4/5
R= 4/6 P = 2/4
P = 2/4 R= 2/3
R= 2/6

Gold Cluster 1

Gold Cluster 2
P = 1/5
R= 1/3

System Cluster 1 System Cluster 2


86
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Coreference Evaluation

100% Precision, 33% Recall 50% Precision, 100% Recall,

87
CuuDuongThanCong.com https://fb.com/tailieudientucntt
System Performance
• OntoNotes dataset: ~3000 documents labeled by humans
• English and Chinese data

• Report an F1 score averaged over 3 coreference metrics

88
CuuDuongThanCong.com https://fb.com/tailieudientucntt
System Performance
Model English Chinese
Rule-based system, used
Lee et al. (2010) ~55 ~50 to be state-of-the-art!

Chen & Ng (2012) 54.5 57.6


[CoNLL 2012 Chinese winner]
Non-neural machine
Fernandes (2012) 60.7 51.6 learning models

[CoNLL 2012 English winner]


Wiseman et al. (2015) 63.3 — Neural mention ranker

Clark & Manning (2016) 65.4 63.7 Neural clustering model

Lee et al. (2017) 67.2 -- End-to-end neural


mention ranker

89
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Where do neural scoring models help?

• Especially with NPs and named entities with no string matching.


Neural vs non-neural scores:
18.9 F1 vs 10.7 F1 on this type compared to 68.7 vs 66.1 F1
These kinds of coreference are hard and the scores are still low!

Example Wins
Anaphor Antecedent
the country’s leftist rebels the guerillas
the company the New York firm
216 sailors from the ``USS cole’’ the crew
the gun the rifle

90
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Conclusion
• Coreference is a useful, challenging, and linguistically interesting
task
• Many different kinds of coreference resolution systems
• Systems are getting better rapidly, largely due to better neural
models
• But overall, results are still not amazing
• Try out a coreference system yourself!
• http://corenlp.run/ (ask for coref in Annotations)
• https://huggingface.co/coref/

CuuDuongThanCong.com https://fb.com/tailieudientucntt

You might also like