Professional Documents
Culture Documents
Constructing Biological Knowledge Bases by Extracting Information From Text Sources
Constructing Biological Knowledge Bases by Extracting Information From Text Sources
Constructing Biological Knowledge Bases by Extracting Information From Text Sources
Figure 1: An illustration of the IE task. On the left are sentences from MEDLINE abstracts. On the right are instances of
the subcellular-localization relation that we might extract from these sentences.
task of extracting instances of a binary relation, r(X, ting up the task so that instances are relatively small.
Y). This approach assumes that for the variables of the In order to learn models for classifying sentences, we
relation, X and Y, we are given semantic lexicons, L(X) use a statistical text-classication method. Specically,
and L(Y), of the possible words that could be used we use a Naive Bayes classier with a bag-of-words rep-
in instances of r. For example, the second constant resentation (Mitchell 1997). This approach involves
of each instance of the relation subcellular-localization, representing each document (i.e. sentence) as a bag
described in the previous section, is in the semantic of words. The key assumption made by the bag-of-
class Subcellular-Structure. Our semantic lexicon for this words representation is that the position of a word in a
class consists of words like nucleus, mitochondrion2 , document does not matter (e.g. encountering the word
etc. Given such lexicons, the rst step in this approach protein at the beginning of a document is the same as
is to identify the instances in a document that could encountering it at the end).
possibly express the relation. In the work reported Given a document d of n words (w1 ; w2; : : :; wn),
here, we make the assumption that these instances con- Naive Bayes estimates the probability that the docu-
sist of individual sentences. Thus, we can frame the ment belongs to each possible class cj 2 C as follows:
information-extraction task as one of sentence classi- Q
) Pr(djcj ) Pr(cj ) ni=1 Pr(wi jcj ) :
cation. We extract a relation instance r(x, y) from the Pr(cj jd) = Pr(cjPr(d) Pr(d)
sentence if (i) the sentence contains a word x 2 L(X)
and a word y 2 L(Y), and (ii) the sentence is classied (1)
as a positive instance by a statistical model. Other- In addition to the position-independence assump-
wise, we consider the sentence to be a negative instance tion implicit in the bag-of-words representation, Naive
and we do not extract anything from it. We can learn Bayes makes the assumption that the occurrence of a
the statistical model used for classication from labeled given word in a document is independent of all other
positive and negative instances (i.e. sentences that de- words in the document. Clearly, this assumption does
scribe instances of the relation, and sentences that do not hold in real text documents. However, in practice,
not). Naive Bayes classiers often perform quite well (Domin-
As stated above, we make the assumption that in- gos & Pazzani 1997; Lewis & Ringuette 1994).
stances consist of individual sentences. It would be pos- The prior probability of the document, Pr(d) does
sible, however, to dene instances to be larger chunks not need to be estimated directly. Instead we can get
of text (e.g. paragraphs) or smaller ones (e.g. sen- the denominator by normalizing over all of the classes.
tence clauses) instead. One limitation of this approach The conditional probability, Pr(wi jcj ), of seeing word
is that it forces us to assign only one class label to wi given class cj is estimated from the training data.
each instance. Consider, for example, a sentence that In order to make these estimates robust with respect
mentions multiple proteins and multiple subcellular lo- to infrequently encountered words, we use Laplace es-
cations. The sentence may specify that only some of timates:
these proteins are found in only some of the locations. Pr(wi jcj ) = N(w i ; cj ) + 1
N(cj ) + T ; (2)
However, we can only classify the sentence as being a
member of the positive class, in which case we extract where N(wi; cj ) is the number of times word wi appears
all protein/location pairs as instances of the target rela- in training set examples from class cj , N(cj ) is the total
tion, or we classify the sentence as a negative instance, number of words in the training set for class cj and T
in which case we extract no relation instances from the is the total number of unique words in the training set.
sentence. This limitation provides an argument for set- Before applying Naive Bayes to our documents, we
rst preprocess them by stemming words. Stemming
2 Our lexicons also include adjectives and the plural forms refers to the process of heuristically reducing words to
of nouns. their root form (Porter 1980). For example the words
localize, localized and localization would be stemmed to model returns its estimated posterior probability that
the root local. The motivation for this step is to make the instance is positive. With this method, we do not
commonalities in related sentences more apparent to strictly accept or reject sentences.
the learner. For each method, we rank its predictions by a con-
To evaluate our approach, we assembled a corpus of dence measure. For a given relation instance, r(x, y), we
abstracts from the MEDLINE database. This corpus, rst collect the set of sentences that would assert this
consisting of 2,889 abstracts, was collected by querying relation if classied into the positive class (i.e. those
on the names of six proteins and then downloading the sentences that contain both the term x and the term
rst 500 articles returned for each query protein, dis- y). For the sentence co-occurrence predictor, we rank a
carding entries that did not include an abstract. We predicted relation instance by the size of this set. When
selected the six proteins for their diversity and for their we use the Naive Bayes models as classiers, we rank a
relevance to the research of one of our collaborators. predicted relation instance by the number of sentences
The six proteins/polypeptides are: serotonin (a neuro- in this set that are classied as belonging to the pos-
transmitter), secretin (a hormone), NMDA receptor (a itive class. In the second method, where we use the
receptor), collagen (a structural protein), trypsinogen probabilities produced by Naive Bayes, we estimate the
(an enzyme), and calcium channel (an ion channel). posterior probability that each sentence is in the posi-
We created a labeled data set for our IE experiments tive class and combine the class probabilities using the
as follows. One of us (Kumlien), who is trained in noisy or function (Pearl 1988):
medicine and clinical chemistry, hand-annotated each N
Y
abstract in the corpus with instances of the target rela- condence = 1 [1 Pr(c = pos jsk )]:
tion subcellular-localization. To determine if an abstract k
should be annotated with a given instance, subcellular-
localization(x, y), the abstract had to clearly indicate Here, Pr(c = pos jsk ) is the probability estimated by
that protein x is found in location y. To aid in this Naive Bayes for the kth element of our set of sentences.
labeling process, we wrote software that searched the This combination function assumes that each sentence
abstracts for words from the location lexicon, and sug- in the set provides independent evidence for the truth
gested candidate instances based on search hits. This of the asserted relation.
labeling process resulted in a total of thirty-three in- Since we have a way to rank the predictions produced
stances of the subcellular-localization relation. Individ- by each of our methods, we can see how the accuracy of
ual instances were found in from one to thirty dierent their predictions vary with condence. Figure 2 plots
abstracts. For example, the fact that calcium channels precision versus recall for the three methods on the task
are found in the sarcoplasmic reticulum was indicated of extracting instances of the subcellular-localization re-
in eight dierent abstracts. lation. Precision and recall are dened as follows:
The goal of the information-extraction task is
to correctly identify the instances of the tar- precision = # correct positive predictions
# positive predictions
;
get relation that are represented in the corpus,
without predicting spurious instances. Further-
more, although each instance of the target rela- recall = # correct positive predictions
# positive instances
:
tion, such as subcellular-localization(calcium-channels,
sarcoplasmic-reticulum), may be represented multiple Figure 2 illustrates several interesting results. The
times in the corpus, we consider the information- most signicant result is that both versions of the Naive
extraction method to be correct as long it extracts this Bayes predictor generally achieve higher levels of pre-
instance from one of its occurrences. We estimate the cision than the sentence co-occurrence predictor. For
accuracy of our learned sentence classiers using leave- example, at 25% recall, the precision of the baseline
one-out cross validation. Thus, for every sentence in the predictor is 44%, whereas for the Naive Bayes classiers
data set, we induce a classier using the other sentences it is 70%, and for the Naive Bayes models using noisy-
as training data, and then treat the held-out sentence or combination it is 62%. This result indicates that the
as a test case. We compare our learned information ex- learning algorithm has captured some of the statistical
tractors against a baseline method that we refer to as regularities that arise in how authors describe the sub-
the sentence co-occurrence predictor. This method pre- cellular localization of a protein. None of the methods is
dicts that a relation holds if a protein and a sub-cellular able to achieve 100% recall since some positive relation
location occur in the same sentence. instances are not represented by individual sentences.
We consider using our learned Naive Bayes models in In the limit, the recall of the Naive Bayes classiers is
two ways. In the rst method, we use them as classi- not as high as it is for the baseline predictor because the
ers: given an instance, the model either classies it as former incorrectly classies as negative some sentences
positive and returns an extracted relation instance, or representing positive instances. Since the Naive Bayes
the model classies it as negative and extracts nothing. models with noisy-or do not reject any sentences in this
To use Naive Bayes for classication, we simply return way, their recall is the same as the baseline method.
the most probable class. In the second method, the Their precision is lower than the Naive Bayes classier,
100% the PubMed entry for the reference) to the article that
Sentence co-occurrence
established the subcellular localization fact. Thus, each
of these entries along with its reference could be used
Naive Bayes classifier
Naive Bayes w/noisy-or
as a weakly labeled instance for learning our subcellular-
80%
local 0.00571
60% pmr 0.00306
dpap 0.00259
Precision
insid 0.00209
40% indirect 0.00191
galactosidas 0.00190
immuno
uoresc 0.00182
20%
secretion 0.00181
mcm 0.00157
mannosidas 0.00157
0% 20% 40% 60% 80% 100% sla 0.00156
Recall
gdpase 0.00156
Figure 3: Precision vs. recall for the Naive Bayes model balomycin 0.00154
trained on the YPD data set. marker 0.00141
presequ 0.00125
immunoloc 0.00125
achieves better precision at comparable levels of recall snc 0.00121
than the sentence co-occurrence classier. stain 0.00115
These two results support our hypothesis. It should accumul 0.00114
be emphasized that the result of this experiment was microscopi 0.00112
not a foregone conclusion. Although the YPD data set
contains many more positive instances than our hand- Figure 4: The twenty stemmed words (aside from words
labeled data set, this data set represents a very dierent referring to specic subcellular locations) weighted most
distribution of text than our test set. The YPD data set highly by the YPD-trained text classier. The weights rep-
has a particular focus on the localization of yeast pro- resent the log-odds ratio of the words given the positive
teins. The test set, in contrast does not concentrate on class.
protein localization and barely mentions yeast. We ar-
gue that the result of this experiment is very signicant
result because it indicates that eective information-
Extraction via Relational Learning
extraction routines can be learned without an expensive The primary limitation of the statistical classication
hand-coding or hand-labeling process. approach to IE presented in the preceding sections is
One way to obtain insight into our learned text clas- that it does not represent the linguistic structure of
siers is to ask which words contribute most highly to the text being analyzed. In deciding whether a given
the quantity Pr(posjd) (i.e. the predicted probability sentence encodes an instance of the target relation or
that a document d belongs to the positive class). To not, the statistical text classiers consider only what
measure this, we calculate words occur in the sentence { not their relationships to
one another. Surely, the grammatical structure of the
Pr(w i j pos) sentence is important for our task, however.
log Pr(w jneg) (3) To learn information extractors that are able to rep-
i resent grammatical structure, we have begun exploring
for each word wi in the vocabulary of the model learned an approach that involves parsing sentences, and learn-
from the YPD data set. Figure 4 shows the twenty ing relational rules in terms of these parses. Our ap-
stemmed words, excluding words that refer to specic proach uses a sentence analyzer called Sundance (Rilo
subcellular locations, that have the greatest value of 1998) that assigns part-of-speech tags to words, and
this log-odds ratio. The vocabulary for this learned then builds a shallow parse tree that segments sentences
model includes more than 2500 stemmed words. As into clauses and noun, verb, or prepositional phrases.
the table illustrates, many of the highly weighted words Figure 5 shows the parse tree built by Sundance for one
are intuitively natural predictors of sentences that de- sentence in our corpus. The numbers shown in brackets
scribe subcellular-localization facts. The words in this next to the root and each phrase in the tree are iden-
set include local, insid, immuno
uoresc, immunoloc, tiers that we can use to refer to a particular sentence
accumul, and microscopi. Some of the highly weighted in the corpus or to a particular phrase in a sentence.
words, however, are not closely associated with the con- Given these parses, we learn information-extraction
cept of subcellular localization. Instead, their relatively rules using a relational learning algorithm that is sim-
large weights simply re
ect the fact that it is dicult to ilar to Foil (Quinlan 1990). The appeal of using a
reliably estimate such probabilities from limited train- relational method for this task is that it can naturally
ing data. represent relationships among sentence constituents in
sentence [25]
phrase-type(phrase-0, prepositional-phrase).
phrase-type(phrase-1, noun-phrase).
phrase-type(phrase-2, noun-phrase).
prep phrase [0] noun phrase [2] verb phrase [3] prep−phrase [4]
phrase-type(phrase-3, verb-phrase).
phrase-type(phrase-4, prepositional-phrase).
phrase-type(phrase-5, noun-phrase).
noun phrase [1] noun−phrase [5]
By immunofluorescence microscopy the PRP20 protein was localized in the nucleus next-phrase(phrase-0, phrase-2).
next-phrase(phrase-2, phrase-3).
Figure 5: A parse tree produced by Sundance for one sen- next-phrase(phrase-3, phrase-4).
tence in our YPD corpus.
constituent-phrase(phrase-0, phrase-1).
constituent-phrase(phrase-4, phrase-5).
learned rules, and it can represent an arbitrary amount
of context around the parts of the sentence to be ex- subject-verb(phrase-2, phrase-3).
tracted.
The objective of the learning algorithm is to learn a localization-sentence(sentence-25, phrase-2, phrase-5).
denition for the predicate:
localization-sentence(Sentence-ID,Phrase-ID,Phrase-ID).
Each instance of this relation consists of (i) an iden- Figure 6: Our relational representation of the parse shown
tier corresponding to the sentence represented by the in Figure 5.
instance, (ii) an identier representing the phrase in the
sentence that contains an entry in the protein lexicon,
and (iii) and identier representing the phrase in the This set of background relations enables the learner
sentence that contains an entry in the subcellular lo- to characterize the relations among phrases in sen-
cation lexicon. Thus, the learning task is to recognize tences. Additionally, we also allow the learner to char-
pairs of phrases that correspond to positive instances acterize the words in sentences and phrases. One ap-
of the target relation. The models learned by the rela- proach to doing this would be to include another back-
tional learner consist of logical rules constructed from ground relation whose instances linked individual words
the following background relations: to the phrases and sentences in which they occur. We
phrase-type(Phrase-ID, Phrase-Type): This relation al- have investigated this approach and found that the
lows a particular phrase to be characterized as a noun learned rules often have low precision and/or recall be-
phrase, verb phrase, or prepositional phrase. cause they are too dependent on the presence of par-
ticular words. The approach we use instead allows
next-phrase(Phrase-ID, Phrase-ID): This relation spec- the learning algorithm to use Naive Bayes classiers to
ies the order of phrases in a sentence. Each instance characterize the words in sentences and phrases.
of the relation indicates the successor of one particu- Figure 7 shows a rule learned by our relational
lar phrase. method. The rule is satised when all of the literals
constituent-phrase(Phrase-ID, Phrase-ID): This rela- to the right of the \:-" are satised. The rst two lit-
tion indicates cases where one phrase is a constituent erals specify that the rule is looking for sentences in
of another phrase. For example, in Figure 5, the rst which the phrase referencing the subcellular location
prepositional phrase in the sentence has a constituent follows the phrase referencing the protein, and there
noun phrase. is one phrase separating them. The next literal spec-
ies that the sentence must satisfy (i.e. be classied
subject-verb(Phrase-ID, Phrase-ID), as positive by) a particular Naive Bayes classier. The
verb-direct-object(Phrase-ID,Phrase-ID): These rela- fourth literal indicates that the phrase referencing the
tions enable the learner to link subject noun phrases protein must satisfy a Naive Bayes classier. The two
to their corresponding verb phrases, and verb phrases nal literals specify a similar condition for the phrase
to their corresponding direct object phrases. referencing the subcellular location. The bottom part
same-clause(Phrase-ID, Phrase-ID): This relation links of Figure 7 shows the stemmed words that are weighted
phrases that occur in the same sentence clause. most highly by each of the naive Bayes classiers.
Although the Naive Bayes predicates used in the rule
Training and test examples are described by instances shown in Figure 7 appear to overlap somewhat, their
of these relations. For example, Figure 6 shows the in- dierences are noticeable. For example, whereas the
stances of the background and target relations that rep- predicate that is applied to the Protein-Phrase highly
resent the parse tree shown in Figure 5. The constants weights the words protein, gene and product, the pred-
used to represent the sentence and its phrases in Fig- icates that are applied to the Location-Phrase focus on
ure 6 correspond to the identiers shown in brackets in subcellular locations and prepositions such as in, to and
Figure 5. with.
localization-sentence(Sentence, Protein-Phrase, Location-Phrase) :-
next-phrase(Protein-Phrase, Phrase-1),
next-phrase(Phrase-1, Location-Phrase),
sentence-naive-bayes-1(Sentence),
phrase-naive-bayes-1(Protein-Phrase),
phrase-naive-bayes-2(Location-Phrase),
phrase-naive-bayes-3(Location-Phrase).
sentence-naive-bayes-1: nucleu, mannosidas, bifunct, local, galactosidas, nuclei, immuno
uoresc, . . .
phrase-naive-bayes-1: protein, beta, galactosidas, gene, alpha, mannosidas, bifunct, product, . . .
phrase-naive-bayes-2: nucleu, nuclei, mitochondria, vacuol, plasma, insid, membran, atpas, . . .
phrase-naive-bayes-3: the, nucleu, in, mitochondria, membran, nuclei, to, vacuol, yeast, with, . . .
Figure 7: Top: a rule learned by our relational method. This rule includes four Naive Bayes predicates. Bottom: the most
highly weighted words (using the log-odds ratio) in each of the Naive Bayes predicates.
Precision
is initialized with these literals, the learning algorithm
uses a hill-climbing search to add additional literals.
The algorithm can either add a literal expressed using 40%