2019 Information Extraction From Text Part 3

INFORMATION
EXTRACTION FROM
TEXT
TRU CAO
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
AND JOHN VON NEUMANN INSTITUTE
OUTLINE
• Named entity recognition and relation extraction
• Practical applications
• Rule-based methods
• Machine learning methods
• State of the art
5/19/19
Cao Hoang Tru 2
NAMED ENTITY RECOGNITION
• Named entity:
• Entity that is referred to by a proper name.
• Example: Barack Obama, Ministry of Education, Saigon.
• Named entity recognition (NER):

• To recognize the category of a named entity.
• Main categories: PERSON, ORGANIZATION, LOCATION.
• Sub-categories are defined in an ontology of discourse.
5/19/19
Cao Hoang Tru 3
RELATION EXTRACTION
• To extract the relation between named entities.
• Relations are defined in an ontology of discourse.
• Example:
• Input text:
• “In 1998, Larry Page and Sergey Brin founded Google Inc.”
• Extracted relations:
• FounderOf(Larry Page, Google Inc.)
• FounderOf(Sergey Brin, Google Inc.)
• FoundedIn(Google Inc., 1998)
5/19/19
Cao Hoang Tru 4
PRACTICAL APPLICATIONS
• General applications:
• Semantic analysis and text understanding
• Knowledge discovery from text
• Specific domain applications:

• Electronic medical records: to recognize problems, tests, and
treatments, and to discover their relations (medication).
• Online shopping: to survey product prices on the web.
• Semantic search: “Find web pages about universities”.
Bệnh_nhân siêu_âm phát_hiện sỏi thận , phẫu_thuật

nội_soi nhưng không khỏi, hiện_tại đau nhiều vùng hông
lưng.
5/19/19
Cao Hoang Tru 5
RULE-BASED METHODS
• 2004-2006: Vietnamese Semantic Web
• A national key project funded by Ministry of Science and
Technology.
• To recognize and annotate named entities in Vietnamese
web pages.
• To build up a knowledge base and to manage web pages
annotated with popular Vietnamese named entities (VN-KIM).
5/19/19
Cao Hoang Tru 6
RULE-BASED METHODS
• 2004-2006: Vietnamese Semantic Web
5/19/19
Cao Hoang Tru 7
RULE-BASED METHODS
• 2007-2009: Information Extraction and integration
on Vietnamese Semantic Web (VNUHCM)
Plug-ins
S-Search S-Editor
VN-KIM Front-End
APIs
Semantic LUCENE VN-KIM IE Fuzzy SESAME
Semantic Web
Semantic Annotation Knowledge Base
Repository
5/19/19
Cao Hoang Tru 8
A RULE-BASED NER METHOD
• Nguyen, V.T.T. & Cao, T.H. (2007), VN-KIM IE:
Automatic Extraction of Vietnamese Named-
Entities on the Web. Journal of New Generation
Computing, 25 (3), 277-292.
5/19/19
Cao Hoang Tru 9
• Ontology and knowledge base:
• VN-KIM KB: 370 classes and 115 properties, with over
120,000 entities (about 60% are in Vietnam and the rest in
the world).
• Lexical resource: words surrounding entity proper names.
• Examples: “Chủ tịch Hồ Chí Minh”, “Thành phố Hồ Chí Minh”.
• Each entity class has a corresponding lexical resource.
5/19/19
Cao Hoang Tru 10
• System architecture
5/19/19
Cao Hoang Tru 11
• VN Hash Gazetteer:
• To match a name against entity aliases and abbreviations.
• To generate temporary annotations.
• Examples: “Bộ Giáo dục - Đào tạo”, “Bộ GD&ĐT”, …
5/19/19
Cao Hoang Tru 12
• Pattern Matching
5/19/19
Cao Hoang Tru 13
• Removal of misclassified words in capitals:
• An entity name often comes with an initial uppercase
character, but the reverse may not true.
• Example: “Bình minh trên đỉnh Hàm Rồng”.
5/19/19
Cao Hoang Tru 14
• Recognition of overlapping entities:
• Sharing a common text segment.
• Example: “Giám đốc công ty FPT Trương Gia Bình”.
5/19/19
Cao Hoang Tru 15
• Lexical resource-based recognition:
• Lexical resource words provide contextual and structural
information for recognizing NEs yet present in the knowledge
base.
• Example: “Ca sĩ Minh Vương”.
5/19/19
Cao Hoang Tru 16
• Context-based recognition:
• Contextual words such as conjunctions can also help to
identify entity classes.
• Example: “Công ty Kinh Đô và Thăng Long”.
5/19/19
Cao Hoang Tru 17
• Removal of inconsistent annotations:
• After the previous steps, a NE can be associated with two or
more annotations that are inconsistent to each other.
• Example: “Trường đại học Tôn Đức Thắng”.
5/19/19
Cao Hoang Tru 18
• Performance evaluation:
5/19/19
Cao Hoang Tru 19
5/19/19
Cao Hoang Tru 20
RULE-BASED RELATION
EXTRACTION
• Cao, T.H. & Cao, T.D. & Tran, T.L. (2008), A Robust
Ontology-Based Method for Translating Natural
Language Queries to Conceptual Graphs. In Proc.
of the 3th Asian Semantic Web Conference,
Springer-Verlag, 479-492.
5/19/19
Cao Hoang Tru 21
RULE-BASED RELATION
EXTRACTION
• Conceptual graphs:
• A bipartite graph of concept vertices alternate with
(conceptual) relation vertices.
• Example: “Cognac is produced in a province in France”.
5/19/19
Cao Hoang Tru 22
RULE-BASED RELATION
EXTRACTION
• A syntax-free method:
E1 R1 E2 R2 E3 R3
5/19/19
Cao Hoang Tru 23
RULE-BASED RELATION
EXTRACTION
• A syntax-free method:
• Taking entities as anchors
“What county is Modesto, California in?”
entities
• Recognizing relations between entities
“What county is Modesto, California in?”
5/19/19
relations to be recognized
Cao Hoang Tru 24
RULE-BASED RELATION
EXTRACTION
• A syntax-free method (9 steps):
• Recognizing specified entities
• Recognizing unspecified entities
• Extracting relational phrases
• Determining the type of queried entities
• Unifying identical entities
• Discovering implicit relations
• Determining the types of relations
• Removing improper relations
• Constructing the final conceptual graph
5/19/19
Cao Hoang Tru 25
RULE-BASED RELATION
EXTRACTION
• A syntax-free method (steps 1-4):
• Recognizing specified entities
• What is the capital of Mongolia?
• Recognizing unspecified entities
• How many counties are in Indiana?
• Extracting relational phrases
• What state is Niagara Falls located in?
• Determining the type of queried entities
• What is WWE short for?
5/19/19
Cao Hoang Tru 26
RULE-BASED RELATION
EXTRACTION
• A syntax-free method (steps 5-9):
• Unifying identical entities
• Who is the president of Bolivia?
• Discovering implicit relations
• What county is Modesto, California in?
• Determining the types of relations
• When was Microsoft established?
• Removing improper relations
• What city in Florida is Sea World in?
• Constructing the final conceptual graph
5/19/19
Cao Hoang Tru 27
RULE-BASED RELATION
EXTRACTION
• R-error: due to GATE’s performance.
• O-error: due to lack of entity types, relation types, NEs in
KIM ontology and knowledge base.
• Q-error: due to expressiveness of simple conceptual graphs.
• M-error: due to the proposed algorithm itself.
5/19/19
Cao Hoang Tru 28
RULE-BASED RELATION
EXTRACTION
Query Number of Correct
R-errors O-errors Q-errors M-errors
Type Queries CGs
What 173 120 0 23 28 2
Which 15 9 0 2 4 0
Where 13 9 0 2 0 2
Who 57 36 0 9 11 1
When 13 10 0 2 1 0
How 56 4 0 2 50 0
Other 118 81 0 19 18 0
269 0 59 112 5
Total 445
(60.45%) (0%) (13.26%) (25.17%) (1.12%)
5/19/19
Cao Hoang Tru 29
RULE-BASED METHODS FOR NER
• Advantages:
• The rules are transparent (humans can understand them).
• No training corpus is required.
• Effective if the rules are well defined.
• Disadvantages:
• The labor cost is high for manually specifying the rules.
• Coverage of the rules is limited.
• It is difficult to make extension.
5/19/19
Cao Hoang Tru 30
MACHINE LEARNING FOR NER
• Class labels = {PER, ORG, LOC, OTHER}.
• Word sequence:
• Example: “Facebook CEO Zuckerberg visited Vietnam”.
ORG OTHER PER OTHER LOC

• NER: to find the most probable label sequence for
a given word sequence.
5/19/19
Cao Hoang Tru 31
HIDDEN MARKOV MODELS
• Introduction
• Example
• Independence assumptions
• Forward algorithm
• Viterbi algorithm
• Training
• Application to NER
5/19/19
Cao Hoang Tru 32
• One of the most popular graphical models.
• Dynamic extension of Bayesian networks.
• Sequential extension of Naïve Bayes classifier.
5/19/19
Cao Hoang Tru 33
• Example:
• Your possible looking prior to the exam = {tired, hungover,
scared, fine}.
• Your possible activity last night = {TV, pub, party, study}.
• Given a sequence of observations of your looking, guess
what you did in previous nights.
5/19/19
Cao Hoang Tru 34
• Example:
scared, fine}.
Fri Sat Sun Mon
Your activity
last night
? ? ? ?
Your looking fine hungover tired scared

today
5/19/19
Cao Hoang Tru 35
• Example:
scared, fine}.
• A model:
• Your looking depends on what you did in the night before.
• Your activity in a night depends on what you did in some
previous nights.
5/19/19
Cao Hoang Tru 36
• A finite set of possible observations.
• A finite set of possible hidden states.
• To predict the most probable sequence of
underlying states {y1, y2, …, yT} for a given
sequence of observations {x1, x2, …, xT}.
transits
states y1 yt-1 yt yT
emits
observations x1 xt-1 xt xT
5/19/19
Cao Hoang Tru 37
0.05 0.4
Tired 0.3 Transition probability Tired 0.2
Hungover 0.4 Hungover 0.1
0.7
Scared 0.2 Scared 0.2
Fine 0.1 Fine 0.5
Party TV
Emission
(observation)
0.1
probability 0.6 0.2
0.1 0.05 0.3 0.2
0.25 0.3
0.25
Tired 0.4
Pub Study Tired 0.3
Scared 0.1 0.4 Scared 0.3
Fine 0.3 Fine 0.35
0.05 0.05
5/19/19
Cao Hoang Tru
Marsland, S. (2009) Machine Learning: An Algorithmic Perspective.
38
0.05
Σy p(y | yt) = 1 0.4
Tired 0.3 Transition probability Tired 0.2
0.7
Scared 0.2 Scared 0.2
Fine 0.1 Fine 0.5
Party TV
Emission
(observation)
0.1
probability 0.6 0.2
Σx p(x | yt) = 1
0.1 0.05 0.3 0.2
0.25 0.3
0.25
Tired 0.4
Pub Study Tired 0.3
Scared 0.1 0.4 Scared 0.3
Fine 0.3 Fine 0.35
0.05 0.05
5/19/19
Cao Hoang Tru
Marsland, S. (2009) Machine Learning: An Algorithmic Perspective.
39
• HMM conditional independence assumptions:
• State at time t depends only on state at time t – 1.
p(yt | yt-1, Z) = p(yt | yt-1)
• Observation at time t depends only on state at time t.
p(xt | yt, Z) = p(xt | yt)
5/19/19
Cao Hoang Tru 40
• Generative model:
• Joint distributions: p(Y, X)
Example: binary A, B, C
p(A, B, C), p(-A, B, C), …., p(-A, -B, -C)
• It can generate any distribution on Y and X.
Example: p(A | -B, C) = p(A, -B, C)/p(-B, C)
p(-B, C) = p(A, -B, C) + p(-A, -B, C)
5/19/19
Cao Hoang Tru 41
• Generative model:
• Joint distributions: p(Y, X)
Example: binary A, B, C
p(A, B, C), p(-A, B, C), …., p(-A, -B, -C)
Example: p(A | -B, C) = p(A, -B, C)/p(-B, C)
p(-B, C) = p(A, -B, C) + p(-A, -B, C)
• Discriminative model:
• Conditional distributions: p(Y I X)
• It discriminates Y given X.
5/19/19
Cao Hoang Tru 42
• HMM is a generative model:
• Joint distributions: one can prove that
p(Y, X) = p(y1, y2,…, yT, x1, x2,…, xT) = Πt=1,T p(xt | yt).p(yt | yt-1)
5/19/19
Cao Hoang Tru 43
• Joint distributions: one can prove that
Values are given in HMM
5/19/19
Cao Hoang Tru 44
• Joint distributions:
Proof:
p(y1, y2,…, yT, x1, x2,…, xT)
= p(xT | y1, y2,…, yT, x1, x2,…, xT-1).p(y1, y2,…, yT, x1, x2,…, xT-1)
= p(xT | yT).p(y1, y2,…, yT, x1, x2,…, xT-1)
= p(xT | yT).p(yT | y1, …, yT-1, x1, …, xT-1).p(y1, …, yT-1, x1, …, xT-1)
= p(xT | yT).p(yT | yT-1).p(y1, …, yT-1, x1, …, xT-1)
=…
5/19/19
Cao Hoang Tru 45
p(y1 | y0) = p(y1)
5/19/19
Cao Hoang Tru 46
• Joint distributions: transition emission
p(Y, X) = p(y1, y2,…, yT, x1, x2,…, xT) = Πt=1,T p(yt | yt-1).p(xt | yt)
p(y1 | y0) = p(y1)
5/19/19
Cao Hoang Tru 47
p(Y, X) = p(y1, y2,…, yT, x1, x2,…, xT) = Πt=1,T p(yt | yt-1).p(xt | yt)
p(y1 | y0) = p(y1)
• In contrast to a discriminative model (e.g., CRF):

• Conditional distributions: p(Y I X)
• It discriminates Y given X.
5/19/19
Cao Hoang Tru 48
• Forward algorithm:
• To compute the joint probability of the state at time t being yt
and the sequence of observations in the first t steps being
{x1, x2, …, xt}:
αt(yt) = p(yt, x1, x2, …, xt)
5/19/19
Cao Hoang Tru 49
{x1, x2, …, xt}:
αt(yt) = p(yt, x1, x2, …, xt)
• Bayes’ theorem gives:
p(yt I x1, x2, …, xt)
= p(yt, x1, x2, …, xt)/p(x1, x2, …, xt)
= αt(yt)/p(x1, x2, …, xt)
5/19/19
Cao Hoang Tru 50
{x1, x2, …, xt}:
αt(yt) = p(yt, x1, x2, …, xt)
• Bayes’ theorem gives:

p(yt I x1, x2, …, xt)
= p(yt, x1, x2, …, xt)/p(x1, x2, …, xt)
= αt(yt)/p(x1, x2, …, xt)
• The highest αt(yt) is, the most likely yt would be given the
same {x1, x2, …, xt}.
5/19/19
Cao Hoang Tru 51
{x1, x2, …, xt}:
αt(yt) = p(yt, x1, x2, …, xt)
• The highest αt(yt) is, the most likely yt would be given the
same {x1, x2, …, xt}.
Fri Sat Sun Mon
Your activity
last night
?

5/19/19
today
Cao Hoang Tru 52
• “Naïve” computation:
αt(yt)
= p(yt, x1, x2, …, xt)
= Σy1,y2, .., yt-1 p(y1, y2, …, yt-1, yt, x1, x2, …, xt)
like
p(A) = p(A, B) + p(A, -B)
p(A) = p(A, B, C) + p(A, -B, C) + p(A, B, -C) + p(A, -B, -C)
5/19/19
Cao Hoang Tru 53
αt(yt)
= p(yt, x1, x2, …, xt)
= Σyt-1p(yt, yt-1, x1, x2, …, xt)
= Σyt-1p(xt | yt, yt-1, x1, x2, …, xt-1).p(yt, yt-1, x1, x2, …, xt-1)
= Σyt-1p(xt | yt).p(yt | yt-1, x1, x2, …, xt-1).p(yt-1, x1, x2, …, xt-1)
= Σyt-1p(xt | yt).p(yt | yt-1).p(yt-1, x1, x2, …, xt-1)
= p(xt | yt) Σyt-1 p(yt | yt-1).αt-1(yt-1)

5/19/19
Cao Hoang Tru 54
αt(yt) α1(y1) = p(y1, x1) = p(x1| y1). p(y1)
= p(yt, x1, x2, …, xt)
= Σyt-1p(yt, yt-1, x1, x2, …, xt)
= Σyt-1p(xt | yt, yt-1, x1, x2, …, xt-1).p(yt, yt-1, x1, x2, …, xt-1)
= Σyt-1p(xt | yt).p(yt | yt-1, x1, x2, …, xt-1).p(yt-1, x1, x2, …, xt-1)
= Σyt-1p(xt | yt).p(yt | yt-1).p(yt-1, x1, x2, …, xt-1)

5/19/19
Cao Hoang Tru 55
α1(y1) = p(y1, x1) = p(x1| y1). p(y1)
αt(yt) = p(xt | yt) Σyt-1 p(yt | yt-1).αt-1(yt-1)
X
5/19/19
Cao Hoang Tru 56
• Forward algorithm’s complexity:
αt(yt)
fT = O(T.N2)
5/19/19
Cao Hoang Tru 57
• Viterbi algorithm:
• To compute the most probable sequence of states {y1, y2, …, yT}
given a sequence of observations {x1, x2, …, xT}:
Y* = argmaxY p(Y | X) = argmaxY [p(Y, X)/p(X)] = argmaxY p(Y, X)
5/19/19
Cao Hoang Tru 58
Y* = argmaxY p(Y | X) = argmaxY p(Y, X)
Fri Sat Sun Mon
Your activity
? ? ? ?
5/19/19
Cao Hoang Tru 59
5/19/19
Cao Hoang Tru 60
5/19/19
Cao Hoang Tru 61
• Exhaustive search complexity: |Y|T (!)
maxy1:T p(y1, y2, …, yT, x1, x2, …, xT)
5/19/19
Cao Hoang Tru 62
• Viterbi algorithm: maxa,b f(a, b)
maxy1:T p(y1, y2, …, yT, x1, x2, …, xT) = maxa (maxb f(a, b))
= maxyT maxy1:T-1 p(y1, y2, …, yT-1, yT, x1, x2, …, xT) Best for each yT
= maxyT maxy1:T-1 (p(xT | yT).p(yT | yT-1).p(y1, …, yT-1, x1, …, xT-1))

= maxyT maxyT-1 maxy1:T-2(p(xT | yT).p(yT | yT-1).p(y1, …, yT-1, x1, …, xT-1))
= maxyT maxyT-1 (p(xT | yT).p(yT | yT-1).maxy1:T-2p(y1,…, yT-2, yT-1, x1,…, xT-2, xT-1))
Best for each yT-1
y1 y2 yT-2 yT-1 yT
Best for each yT-1
5/19/19 maxy1:T-2p(y1,…, yT-2, yT-1, x1,…, xT-2, xT-1) p(xT | yT).p(yT | yT-1)
Cao Hoang Tru 63
maxy1:T p(y1, y2, …, yT, x1, x2, …, xT)
= maxyT maxy1:T-1 p(y1, y2, …, yT, x1, x2, …, xT) Best for each yT
Best for each yT-1
• Dynamic programming:
• Solving, storing, and reusing solutions of the sub-problems for the
current problem.
5/19/19
Cao Hoang Tru 64
maxy1:T p(y1, y2, …, yT, x1, x2, …, xT)
= maxyT maxy1:T-1 p(y1, y2, …, yT, x1, x2, …, xT) Best for each yT
Best for each yT-1
• Compute:
maxy1 p(y1, x1) = maxy1 p(x1 | y1).p(y1)
• For each t from 2 to T, and for each state yt, compute:
argmaxy1:t-1 p(y1, y2, …, yt, x1, x2, …, xt)
= argmaxyt-1 (p(xt | yt).p(yt | yt-1).maxy1:t-2p(y1,…, yt-2, yt-1, x1,…, xt-2, xt-1))
• Select:
argmaxyT maxy1:T-1 p(y1, y2, …, yT, x1, x2, …, xT)
65
66
• Could the results from the forward algorithm be
used for Viterbi algorithm?
5/19/19
Cao Hoang Tru 67
• Where does an HMM come from?
5/19/19
Cao Hoang Tru 68
• Training HMMs:
• Topology is designed beforehand.
• Parameters to be learned: emission and transition probabilities.
• Supervised or unsupervised training.
5/19/19
Cao Hoang Tru 69
• Supervised learning:
• Training data: paired sequences of states and observations
(y1, y2, …, yT, x1, x2, …, xT)
y
x
5/19/19
Cao Hoang Tru 70
(y1, y2, …, yT, x1, x2, …, xT)
y
x
• p(y1 = F) = ?
5/20/19
Cao Hoang Tru 71
(y1, y2, …, yT, x1, x2, …, xT)
y
x
• p(y1 = F) = the prior probability of the first hidden state in a

sequence being F = 4/8
5/19/19
Cao Hoang Tru 72
(y1, y2, …, yT, x1, x2, …, xT)
y
x
• p(y1 = F) = 4/8
• p(y = F | y* = B) = ?
5/20/19
Cao Hoang Tru 73
(y1, y2, …, yT, x1, x2, …, xT)
y
x
• p(y1 = F) = 4/8
• p(y = F | y* = B) = number of (B, F)/number of (B, B/F) = 10/11
5/19/19
Cao Hoang Tru 74
(y1, y2, …, yT, x1, x2, …, xT)
y
x
• p(y1 = F) = 4/8
• p(y = F | y* = B) = 10/11
• p(x = H | y = F) = ?
5/20/19
Cao Hoang Tru 75
(y1, y2, …, yT, x1, x2, …, xT)
y
x
• p(y1 = F) = 4/8
• p(y = F | y* = B) = 10/11
• p(x = H | y = F) = number of (x = H, y = F)/number (x = H/T, y = F)
= 17/36
5/19/19
Cao Hoang Tru 76
(y1, y2, …, yT, x1, x2, …, xT)
• p(y1 = y) = num. of sequences starting with y/num. of all sequences
• p(yt = y | yt-1 = y*) = number of (y*, y)/number of all (y*, y’)
• p(xt = x | yt = y) = number of (y, x)/number of all (y, x’)
y
x
5/19/19
Cao Hoang Tru 77
• Supervised learning example:
p(F | F)? ?
p(B | F)?
p(H | F) H ? H ?
F B
p(T | F) T ? p(F | B)? T ?
p(F)? ?
Start
5/19/19
Cao Hoang Tru 78

2019 Information Extraction From Text Part 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2019 Information Extraction From Text Part 3

Uploaded by

Copyright:

Available Formats

INFORMATION

• Machine learning methods

• State of the art

• Example: Barack Obama, Ministry of Education, Saigon.

• Named entity recognition (NER):

• Main categories: PERSON, ORGANIZATION, LOCATION.

• Sub-categories are defined in an ontology of discourse.

• Specific domain applications:

Bệnh_nhân siêu_âm phát_hiện sỏi thận , phẫu_thuật

Semantic LUCENE VN-KIM IE Fuzzy SESAME

“What county is Modesto, California in?”

“What county is Modesto, California in?”

ORG OTHER PER OTHER LOC

Fri Sat Sun Mon

Your looking fine hungover tired scared

0.1 0.05 0.3 0.2

Values are given in HMM

• In contrast to a discriminative model (e.g., CRF):

• Bayes’ theorem gives:

Your looking fine hungover tired scared

= Σyt-1p(xt | yt).p(yt | yt-1, x1, x2, …, xt-1).p(yt-1, x1, x2, …, xt-1)

= Σyt-1p(xt | yt).p(yt | yt-1).p(yt-1, x1, x2, …, xt-1)

= p(xt | yt) Σyt-1 p(yt | yt-1).αt-1(yt-1)

= Σyt-1p(xt | yt).p(yt | yt-1, x1, x2, …, xt-1).p(yt-1, x1, x2, …, xt-1)

= Σyt-1p(xt | yt).p(yt | yt-1).p(yt-1, x1, x2, …, xt-1)

= p(xt | yt) Σyt-1 p(yt | yt-1).αt-1(yt-1)

Fri Sat Sun Mon

Your looking fine hungover tired scared

= maxyT maxy1:T-1 (p(xT | yT).p(yT | yT-1).p(y1, …, yT-1, x1, …, xT-1))

Best for each yT-1

• p(y1 = F) = the prior probability of the first hidden state in a

You might also like