Constructing Biological Knowledge Bases by Extracting Information From Text Sources

You might also like

Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 10

Appears in Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-99).

Constructing Biological Knowledge Bases by Extracting Information


from Text Sources
Mark Craven and Johan Kumlien
School of Computer Science
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, Pennsylvania, 15213-3891, U.S.A.
mark.craven@cs.cmu.edu johan.kumlien@cs.cmu.edu

Abstract that the information they contain is not represented in


Recently, there has been much e ort in making
structured format, but instead in natural language text.
databases for molecular biology more accessible and The goal of our research is to develop methods that can
interoperable. However, information in text form, such inexpensively and accurately map information in scien-
as MEDLINE records, remains a greatly underutilized ti c text sources, such as MEDLINE, into a structured
source of biological information. We have begun a re- representation, such as a knowledge base or a database.
search e ort aimed at automatically mapping infor- Toward this end, we have developed novel methods for
mation from text sources into structured representa- automatically extracting key facts from scienti c texts.
tions, such as knowledge bases. Our approach to this Current systems for accessing MEDLINE (e.g.
task is to use machine-learning methods to induce rou- Pubmed, National Library of Medicine, 1999a) accept
tines for extracting facts from text. We describe two keyword-based queries to text sources and return doc-
learning methods that we have applied to this task uments that are (hopefully) relevant to the query. Our
| a statistical text classi cation method, and a rela- goal, in contrast, is to support the kinds of arbitrarily
tional learning method | and our initial experiments
in learning such information-extraction routines. We complex queries that current database systems handle,
also present an approach to decreasing the cost of learn- and to return actual answers rather than relevant doc-
ing information-extraction routines by learning from uments. The system we are developing is motivated by
\weakly" labeled training data. several di erent types of tasks that we believe would
greatly bene t from the ability to extracted structured
information from text:
Introduction  Database construction and updating. Our sys-
The science of molecular biology has been greatly af- tem could be used to help construct and update
fected by the proliferation of the Internet in recent databases and knowledge bases by extracting elds
years. There are now hundreds of on-line databases from text. For example, we are currently working
characterizing biological information such as sequences, with a team that is developing a knowledge base
structures, molecular interactions and expression pat- of protein localization patterns (Boland, Markey, &
terns. Moreover, there are servers that perform such Murphy 1996). We are using our system to assist in
tasks as identifying genes in DNA sequences (e.g. developing an ontology of localization patterns and
GRAIL, Xu et al., 1996) and predicting protein sec- to populate the database with text-extracted facts
ondary structures (e.g. PredictProtein, Rost, 1996). describing the localization patterns of individual pro-
And there are systems that integrate information from teins. In a similar vein, our system could be used to
various sources (e.g. The Genome Channel, Genome update databases that track particular classes of mu-
Annotation Consortium, 1999), provide interoperabil- tation studies (Lathrop et al. 1998), and to provide
ity among distributed databases (e.g. Entrez, National automatic genome annotation for a system such as
Center for Biotechnology Information, 1999) and sup- The Genome Channel (Genome Annotation Consor-
port knowledge-based reasoning (e.g. EcoCyc, Karp et tium 1999) or EcoCyc (Karp et al. 1997).
al., 1997). Another rich source of on-line information
is the scienti c literature. The MEDLINE database,  Summarization. Another promising application of
for example, provides bibliographic information and ab- our system is to provide structured summaries of
stracts for more than nine million articles that have what is known about particular biological objects.
been published in biomedical journals. A fundamental For example, we are working with scientists who are
limitation of MEDLINE and similar sources, however, is studying the genetic basis of diseases by identifying
expressed sequence tags that are di erentially ex-
Copyright c 1999, American Association for Arti cial pressed in tissues in various states. Frequently, these
Intelligence (www.aaai.org). All rights reserved. scientists do time-consuming MEDLINE searches to
determine if some candidate gene product is likely to tions and limitations of our work, and some directions
be related to the disease of interest. When perform- we are pursuing in our current research.
ing these searches, the scientists typically are trying
to answer such questions as: In what types of tissues, The Information Extraction Task
cells and subcellular locations is the protein known The general information extraction task can be formu-
to be expressed? Is the protein known to be asso- lated as follows:
ciated with any diseases? Is the protein known to Given: (i) a set of classes of interest and relations
interact with any pharmacological agents? We plan among these classes, and (ii) a corpus of docu-
to partially automate the task of extracting answers ments to be processed.
to these questions from text. Do: extract from the documents instances of the
 Discovery. An especially compelling application of classes and relations that are described in the doc-
our system is its potential application to scienti c uments.
discovery. The articles in MEDLINE describe a vast This limited form of natural language understanding
web of relationships among the genes, proteins, path- has been the focus of much research over the past
ways, tissues and diseases of various systems and or- decade (Cowie & Lehnert 1996; Cardie 1997). Most
ganisms of interest. Moreover, each article describes of the work in this community has involved hand-
only a small piece of this web. The work of Swan- coding extraction routines. However, in recent years
son et al. (Swanson & Smalheiser 1997) has demon- there have been several research e orts investigating
strated that signi cant but previously unknown rela- the application of machine learning methods to induc-
tionships among entities (e.g. magnesium and mi- ing information extractors (Rilo 1996; Soderland 1996;
graine headaches) can be discovered by automati- Cali 1998; Freitag 1998; Soderland 1999). Machine
cally eliciting this information from the literature. learning methods o er a promising alternative to hand
Swanson's algorithm detects relationships among ob- coding IE routines because they can greatly reduce the
jects simply by considering the statistics of word co- amount of time and e ort required to develop such
occurrences in article titles. We conjecture that such methods.
relationships can be detected more accurately by our In the applications we are addressing, we are primar-
method of analyzing sentences in the article's ab- ily interested in extracting instances of relations among
stract or text. Moreover, whereas Swanson's algo- objects. In particular, we want to learn extractors for
rithm posits only that some relation holds between a the following:1
pair of objects, our system is designed to state what  subcellular-localization(Protein, Subcellular-Structure):
the speci c relation is. the instances of this relation represent proteins and
One conceivable approach to devising a system to the subcellular structures in which they are found.
solve tasks such as these would be to perform full natu-  cell-localization(Protein, Cell-Type): the cell types in
ral language understanding of the text. This undertak- which a given protein is found.
ing, however, is well beyond the capabilities of current  tissue-localization(Protein, Tissue): the tissue types in
natural language systems. Our approach is to treat which a given protein is found.
the task as one of information extraction. Information
extraction (IE) involves a limited form of natural lan-  associated-diseases(Protein, Disease): the diseases
guage processing in which the system tries only to ex- with which a given protein is known to have some
tract prede ned classes of facts from the text. A key association.
aspect of our approach is that we use machine-learning  drug-interactions(Protein, Pharmacologic-Agent): the
algorithms to induce our information extractors. pharmacologic agents with which a given protein is
In the following section, we describe the information- known to interact.
extraction task in more detail. We then describe a In our initial experiments we are focusing on the
statistical text-classi cation approach to learning in- subcellular-localization relation. As an example of the
formation extractors, and present an empirical eval- IE task, Figure 1 shows several sentences and the in-
uation of this method. A key limitation of us- stances of the subcellular-localization relation that we
ing machine-learning methods to induce information- would like to extract from them.
extraction methods is that the process of labeling train-
ing examples is expensive. The fourth section of the
paper presents an approach to learning information ex- Extraction via Text Classi cation
tractors that exploits existing databases to automati- Our rst approach to learning information extractors
cally label training examples. The promise of this ap- uses a statistical text classi cation method. Without
proach is that it can greatly reduce the cost of assem- loss of generality, assume that we are addressing the
bling sets of labeled training data. We then present a 1 We use the following notation to describe relations: con-
second approach to learning information extractors that stants, such as the names of speci c relations and the objects
exploits more linguistic knowledge than our initial ap- they characterize, start with lowercase letters; the names of
proach. Finally, we discuss related work, the contribu- variables begin with uppercase letters.
Immunoprecipitation of biotinylated type XIII collagen from subcellular-localization(collagen,
surface-labeled HT-1080 cells, subcellular fractionation, and im- plasma-membranes)
muno uorescence staining were used to demonstrate that type XIII
collagen molecules are indeed located in the plasma membranes of
these cells.
HSP47 is a collagen-binding stress protein and is thought to be subcellular-localization(collagen,
a collagen-speci c molecular chaperone, which plays a pivotal role endoplasmic-reticulum)
during the biosynthesis and secretion of collagen molecules in the
endoplasmic reticulum.

Figure 1: An illustration of the IE task. On the left are sentences from MEDLINE abstracts. On the right are instances of
the subcellular-localization relation that we might extract from these sentences.

task of extracting instances of a binary relation, r(X, ting up the task so that instances are relatively small.
Y). This approach assumes that for the variables of the In order to learn models for classifying sentences, we
relation, X and Y, we are given semantic lexicons, L(X) use a statistical text-classi cation method. Speci cally,
and L(Y), of the possible words that could be used we use a Naive Bayes classi er with a bag-of-words rep-
in instances of r. For example, the second constant resentation (Mitchell 1997). This approach involves
of each instance of the relation subcellular-localization, representing each document (i.e. sentence) as a bag
described in the previous section, is in the semantic of words. The key assumption made by the bag-of-
class Subcellular-Structure. Our semantic lexicon for this words representation is that the position of a word in a
class consists of words like nucleus, mitochondrion2 , document does not matter (e.g. encountering the word
etc. Given such lexicons, the rst step in this approach protein at the beginning of a document is the same as
is to identify the instances in a document that could encountering it at the end).
possibly express the relation. In the work reported Given a document d of n words (w1 ; w2; : : :; wn),
here, we make the assumption that these instances con- Naive Bayes estimates the probability that the docu-
sist of individual sentences. Thus, we can frame the ment belongs to each possible class cj 2 C as follows:
information-extraction task as one of sentence classi - Q
) Pr(djcj )  Pr(cj ) ni=1 Pr(wi jcj ) :
cation. We extract a relation instance r(x, y) from the Pr(cj jd) = Pr(cjPr(d) Pr(d)
sentence if (i) the sentence contains a word x 2 L(X)
and a word y 2 L(Y), and (ii) the sentence is classi ed (1)
as a positive instance by a statistical model. Other- In addition to the position-independence assump-
wise, we consider the sentence to be a negative instance tion implicit in the bag-of-words representation, Naive
and we do not extract anything from it. We can learn Bayes makes the assumption that the occurrence of a
the statistical model used for classi cation from labeled given word in a document is independent of all other
positive and negative instances (i.e. sentences that de- words in the document. Clearly, this assumption does
scribe instances of the relation, and sentences that do not hold in real text documents. However, in practice,
not). Naive Bayes classi ers often perform quite well (Domin-
As stated above, we make the assumption that in- gos & Pazzani 1997; Lewis & Ringuette 1994).
stances consist of individual sentences. It would be pos- The prior probability of the document, Pr(d) does
sible, however, to de ne instances to be larger chunks not need to be estimated directly. Instead we can get
of text (e.g. paragraphs) or smaller ones (e.g. sen- the denominator by normalizing over all of the classes.
tence clauses) instead. One limitation of this approach The conditional probability, Pr(wi jcj ), of seeing word
is that it forces us to assign only one class label to wi given class cj is estimated from the training data.
each instance. Consider, for example, a sentence that In order to make these estimates robust with respect
mentions multiple proteins and multiple subcellular lo- to infrequently encountered words, we use Laplace es-
cations. The sentence may specify that only some of timates:
these proteins are found in only some of the locations. Pr(wi jcj ) = N(w i ; cj ) + 1
N(cj ) + T ; (2)
However, we can only classify the sentence as being a
member of the positive class, in which case we extract where N(wi; cj ) is the number of times word wi appears
all protein/location pairs as instances of the target rela- in training set examples from class cj , N(cj ) is the total
tion, or we classify the sentence as a negative instance, number of words in the training set for class cj and T
in which case we extract no relation instances from the is the total number of unique words in the training set.
sentence. This limitation provides an argument for set- Before applying Naive Bayes to our documents, we
rst preprocess them by stemming words. Stemming
2 Our lexicons also include adjectives and the plural forms refers to the process of heuristically reducing words to
of nouns. their root form (Porter 1980). For example the words
localize, localized and localization would be stemmed to model returns its estimated posterior probability that
the root local. The motivation for this step is to make the instance is positive. With this method, we do not
commonalities in related sentences more apparent to strictly accept or reject sentences.
the learner. For each method, we rank its predictions by a con -
To evaluate our approach, we assembled a corpus of dence measure. For a given relation instance, r(x, y), we
abstracts from the MEDLINE database. This corpus, rst collect the set of sentences that would assert this
consisting of 2,889 abstracts, was collected by querying relation if classi ed into the positive class (i.e. those
on the names of six proteins and then downloading the sentences that contain both the term x and the term
rst 500 articles returned for each query protein, dis- y). For the sentence co-occurrence predictor, we rank a
carding entries that did not include an abstract. We predicted relation instance by the size of this set. When
selected the six proteins for their diversity and for their we use the Naive Bayes models as classi ers, we rank a
relevance to the research of one of our collaborators. predicted relation instance by the number of sentences
The six proteins/polypeptides are: serotonin (a neuro- in this set that are classi ed as belonging to the pos-
transmitter), secretin (a hormone), NMDA receptor (a itive class. In the second method, where we use the
receptor), collagen (a structural protein), trypsinogen probabilities produced by Naive Bayes, we estimate the
(an enzyme), and calcium channel (an ion channel). posterior probability that each sentence is in the posi-
We created a labeled data set for our IE experiments tive class and combine the class probabilities using the
as follows. One of us (Kumlien), who is trained in noisy or function (Pearl 1988):
medicine and clinical chemistry, hand-annotated each N
Y
abstract in the corpus with instances of the target rela- con dence = 1 [1 Pr(c = pos jsk )]:
tion subcellular-localization. To determine if an abstract k
should be annotated with a given instance, subcellular-
localization(x, y), the abstract had to clearly indicate Here, Pr(c = pos jsk ) is the probability estimated by
that protein x is found in location y. To aid in this Naive Bayes for the kth element of our set of sentences.
labeling process, we wrote software that searched the This combination function assumes that each sentence
abstracts for words from the location lexicon, and sug- in the set provides independent evidence for the truth
gested candidate instances based on search hits. This of the asserted relation.
labeling process resulted in a total of thirty-three in- Since we have a way to rank the predictions produced
stances of the subcellular-localization relation. Individ- by each of our methods, we can see how the accuracy of
ual instances were found in from one to thirty di erent their predictions vary with con dence. Figure 2 plots
abstracts. For example, the fact that calcium channels precision versus recall for the three methods on the task
are found in the sarcoplasmic reticulum was indicated of extracting instances of the subcellular-localization re-
in eight di erent abstracts. lation. Precision and recall are de ned as follows:
The goal of the information-extraction task is
to correctly identify the instances of the tar- precision = # correct positive predictions
# positive predictions
;
get relation that are represented in the corpus,
without predicting spurious instances. Further-
more, although each instance of the target rela- recall = # correct positive predictions
# positive instances
:
tion, such as subcellular-localization(calcium-channels,
sarcoplasmic-reticulum), may be represented multiple Figure 2 illustrates several interesting results. The
times in the corpus, we consider the information- most signi cant result is that both versions of the Naive
extraction method to be correct as long it extracts this Bayes predictor generally achieve higher levels of pre-
instance from one of its occurrences. We estimate the cision than the sentence co-occurrence predictor. For
accuracy of our learned sentence classi ers using leave- example, at 25% recall, the precision of the baseline
one-out cross validation. Thus, for every sentence in the predictor is 44%, whereas for the Naive Bayes classi ers
data set, we induce a classi er using the other sentences it is 70%, and for the Naive Bayes models using noisy-
as training data, and then treat the held-out sentence or combination it is 62%. This result indicates that the
as a test case. We compare our learned information ex- learning algorithm has captured some of the statistical
tractors against a baseline method that we refer to as regularities that arise in how authors describe the sub-
the sentence co-occurrence predictor. This method pre- cellular localization of a protein. None of the methods is
dicts that a relation holds if a protein and a sub-cellular able to achieve 100% recall since some positive relation
location occur in the same sentence. instances are not represented by individual sentences.
We consider using our learned Naive Bayes models in In the limit, the recall of the Naive Bayes classi ers is
two ways. In the rst method, we use them as classi- not as high as it is for the baseline predictor because the
ers: given an instance, the model either classi es it as former incorrectly classi es as negative some sentences
positive and returns an extracted relation instance, or representing positive instances. Since the Naive Bayes
the model classi es it as negative and extracts nothing. models with noisy-or do not reject any sentences in this
To use Naive Bayes for classi cation, we simply return way, their recall is the same as the baseline method.
the most probable class. In the second method, the Their precision is lower than the Naive Bayes classi er,
100% the PubMed entry for the reference) to the article that
Sentence co-occurrence
established the subcellular localization fact. Thus, each
of these entries along with its reference could be used
Naive Bayes classifier
Naive Bayes w/noisy-or
as a weakly labeled instance for learning our subcellular-
80%

localization information extractors.


60% In this section we evaluate the utility of learning
Precision

from weakly labeled training instances. From the


YPD Web site, we collected 1,213 instances of the
40%
subcellular-localization relation that are asserted in the
YPD database, and from PubMed we collected the ab-
20% stracts from 924 articles that are pointed to by these
entries in YPD. For many of the relation instances, the
associated abstracts do not say anything about the sub-
cellular localization of the reference protein, and thus
0% 20% 40% 60%
Recall
80% 100%
they are not helpful to us. However, if we select the
relation instances for which an associated abstract con-
Figure 2: Precision vs. recall for the co-occurrence predic- tains a sentence that mentions both the protein and a
tors and the Naive Bayes model. subcellular location, then we wind up with 336 relation
instances described in 633 sentences. This data set con-
tains signi cantly more relation instances than the one
however, indicating that even when Naive Bayes makes we obtained via hand-labeling, and it was acquired by
accurate classi cations, it often does not estimate prob- a completely automated process.
abilities well (Domingos & Pazzani 1997). An inter- As in the previous section, we treat individual sen-
esting possibility would be to combine these predictors tences as instances to be processed by a Naive Bayes
to get the high precision of the Naive Bayes classi ers text classi er. Moreover, we make the assumption that
along with the high recall of the Naive Bayes models every one of the 633 sentences mentioned above repre-
using noisy-or. Provost and Fawcett (1998) have de- sents a positive training example for our text classi er.
veloped a method especially well suited to this type of In other words, we assume that if we know that relation
combination. subcellular-localization(x, y) holds, then any sentence in
the abstract(s) associated with subcellular-localization(x,
Exploiting Existing Databases for y) that references both x and y is e ectively stating that
x is located in y. Of course this assumption is not al-
Training Data ways valid in practice. We take the remaining sentences
We have argued that machine learning o ers a promis- in the YPD corpus as negative training examples.
ing alternative to hand-coding information extrac- The hypothesis that we consider in this section is that
tion routines because the hand-coding process has it is possible to learn accurate information-extraction
proven to be so time-consuming. A limitation of the routines using weakly labeled training data, such as
machine-learning approach, however, is that providing that we gathered from YPD. To test this hypothesis
labeled training data to the learner is itself quite time- we train a Naive Bayes model using the YPD data as
consuming and tedious. In fact, labeling the corpus a training set, and then we evaluate it using our hand-
used in the previous section required approximately 35 labeled corpus as a test set. We train our statistical
hours of an expert's time. In this section, we present text classi er in the same manner as described in the
an approach to learning information extractors that re- previous section.
lies on existing databases to provide something akin to Figure 3 shows the precision vs. recall curves for
labeled training instances. this experiment. As a baseline, the gure also shows
Our approach is motivated by the observation that, the precision/recall curve for the sentence co-occurrence
for many IE tasks, there are existing information predictor described in the previous section. Recall that
sources (knowledge bases, databases, or even simple the co-occurrence predictor does not use a training set
lists or tables) that can be coupled with documents in any way; it simply makes its predictions by noting
to provide what we term \weakly" labeled training ex- co-occurrence statistics in the test set. Therefore, it is
amples. We call this form of training data weakly la- an appropriate baseline no matter what training set we
beled because each instance consists not of a precisely use.
marked document, but instead it consists of a fact to From this gure we can see that the Naive Bayes
be extracted along with a document that may assert model learned from the YPD curve is comparable to
the fact. To make this concept more concrete, consider the curve for the models learned from the hand-labeled
the Yeast Protein Database (YPD) (Hodges, Payne, & data. Whereas the Naive Bayes classi ers from the pre-
Garrels 1998), which includes a subcellular localization vious section achieved 69% precision at 30% recall, the
eld for many proteins. Moreover, in some cases the Naive Bayes classi er trained on the YPD data reaches
entry for this eld has a reference (and a hyperlink to 77% precision at 30% recall. Moreover, the YPD model
100%
 
Sentence co-occurrence
stemmed word log w
Pr( i jpos)
Naive Bayes classifier w
Pr( i jneg)
80% Naive Bayes w/noisy-or

local 0.00571
60% pmr 0.00306
dpap 0.00259
Precision

insid 0.00209
40% indirect 0.00191
galactosidas 0.00190
immuno uoresc 0.00182
20%
secretion 0.00181
mcm 0.00157
mannosidas 0.00157
0% 20% 40% 60% 80% 100% sla 0.00156
Recall
gdpase 0.00156
Figure 3: Precision vs. recall for the Naive Bayes model ba lomycin 0.00154
trained on the YPD data set. marker 0.00141
presequ 0.00125
immunoloc 0.00125
achieves better precision at comparable levels of recall snc 0.00121
than the sentence co-occurrence classi er. stain 0.00115
These two results support our hypothesis. It should accumul 0.00114
be emphasized that the result of this experiment was microscopi 0.00112
not a foregone conclusion. Although the YPD data set
contains many more positive instances than our hand- Figure 4: The twenty stemmed words (aside from words
labeled data set, this data set represents a very di erent referring to speci c subcellular locations) weighted most
distribution of text than our test set. The YPD data set highly by the YPD-trained text classi er. The weights rep-
has a particular focus on the localization of yeast pro- resent the log-odds ratio of the words given the positive
teins. The test set, in contrast does not concentrate on class.
protein localization and barely mentions yeast. We ar-
gue that the result of this experiment is very signi cant
result because it indicates that e ective information-
Extraction via Relational Learning
extraction routines can be learned without an expensive The primary limitation of the statistical classi cation
hand-coding or hand-labeling process. approach to IE presented in the preceding sections is
One way to obtain insight into our learned text clas- that it does not represent the linguistic structure of
si ers is to ask which words contribute most highly to the text being analyzed. In deciding whether a given
the quantity Pr(posjd) (i.e. the predicted probability sentence encodes an instance of the target relation or
that a document d belongs to the positive class). To not, the statistical text classi ers consider only what
measure this, we calculate words occur in the sentence { not their relationships to
  one another. Surely, the grammatical structure of the
Pr(w i j pos) sentence is important for our task, however.
log Pr(w jneg) (3) To learn information extractors that are able to rep-
i resent grammatical structure, we have begun exploring
for each word wi in the vocabulary of the model learned an approach that involves parsing sentences, and learn-
from the YPD data set. Figure 4 shows the twenty ing relational rules in terms of these parses. Our ap-
stemmed words, excluding words that refer to speci c proach uses a sentence analyzer called Sundance (Rilo
subcellular locations, that have the greatest value of 1998) that assigns part-of-speech tags to words, and
this log-odds ratio. The vocabulary for this learned then builds a shallow parse tree that segments sentences
model includes more than 2500 stemmed words. As into clauses and noun, verb, or prepositional phrases.
the table illustrates, many of the highly weighted words Figure 5 shows the parse tree built by Sundance for one
are intuitively natural predictors of sentences that de- sentence in our corpus. The numbers shown in brackets
scribe subcellular-localization facts. The words in this next to the root and each phrase in the tree are iden-
set include local, insid, immuno uoresc, immunoloc, ti ers that we can use to refer to a particular sentence
accumul, and microscopi. Some of the highly weighted in the corpus or to a particular phrase in a sentence.
words, however, are not closely associated with the con- Given these parses, we learn information-extraction
cept of subcellular localization. Instead, their relatively rules using a relational learning algorithm that is sim-
large weights simply re ect the fact that it is dicult to ilar to Foil (Quinlan 1990). The appeal of using a
reliably estimate such probabilities from limited train- relational method for this task is that it can naturally
ing data. represent relationships among sentence constituents in
sentence [25]
phrase-type(phrase-0, prepositional-phrase).
phrase-type(phrase-1, noun-phrase).
phrase-type(phrase-2, noun-phrase).
prep phrase [0] noun phrase [2] verb phrase [3] prep−phrase [4]
phrase-type(phrase-3, verb-phrase).
phrase-type(phrase-4, prepositional-phrase).
phrase-type(phrase-5, noun-phrase).
noun phrase [1] noun−phrase [5]

By immunofluorescence microscopy the PRP20 protein was localized in the nucleus next-phrase(phrase-0, phrase-2).
next-phrase(phrase-2, phrase-3).
Figure 5: A parse tree produced by Sundance for one sen- next-phrase(phrase-3, phrase-4).
tence in our YPD corpus.
constituent-phrase(phrase-0, phrase-1).
constituent-phrase(phrase-4, phrase-5).
learned rules, and it can represent an arbitrary amount
of context around the parts of the sentence to be ex- subject-verb(phrase-2, phrase-3).
tracted.
The objective of the learning algorithm is to learn a localization-sentence(sentence-25, phrase-2, phrase-5).
de nition for the predicate:
localization-sentence(Sentence-ID,Phrase-ID,Phrase-ID).
Each instance of this relation consists of (i) an iden- Figure 6: Our relational representation of the parse shown
ti er corresponding to the sentence represented by the in Figure 5.
instance, (ii) an identi er representing the phrase in the
sentence that contains an entry in the protein lexicon,
and (iii) and identi er representing the phrase in the This set of background relations enables the learner
sentence that contains an entry in the subcellular lo- to characterize the relations among phrases in sen-
cation lexicon. Thus, the learning task is to recognize tences. Additionally, we also allow the learner to char-
pairs of phrases that correspond to positive instances acterize the words in sentences and phrases. One ap-
of the target relation. The models learned by the rela- proach to doing this would be to include another back-
tional learner consist of logical rules constructed from ground relation whose instances linked individual words
the following background relations: to the phrases and sentences in which they occur. We
 phrase-type(Phrase-ID, Phrase-Type): This relation al- have investigated this approach and found that the
lows a particular phrase to be characterized as a noun learned rules often have low precision and/or recall be-
phrase, verb phrase, or prepositional phrase. cause they are too dependent on the presence of par-
ticular words. The approach we use instead allows
 next-phrase(Phrase-ID, Phrase-ID): This relation spec- the learning algorithm to use Naive Bayes classi ers to
i es the order of phrases in a sentence. Each instance characterize the words in sentences and phrases.
of the relation indicates the successor of one particu- Figure 7 shows a rule learned by our relational
lar phrase. method. The rule is satis ed when all of the literals
 constituent-phrase(Phrase-ID, Phrase-ID): This rela- to the right of the \:-" are satis ed. The rst two lit-
tion indicates cases where one phrase is a constituent erals specify that the rule is looking for sentences in
of another phrase. For example, in Figure 5, the rst which the phrase referencing the subcellular location
prepositional phrase in the sentence has a constituent follows the phrase referencing the protein, and there
noun phrase. is one phrase separating them. The next literal spec-
i es that the sentence must satisfy (i.e. be classi ed
 subject-verb(Phrase-ID, Phrase-ID), as positive by) a particular Naive Bayes classi er. The
verb-direct-object(Phrase-ID,Phrase-ID): These rela- fourth literal indicates that the phrase referencing the
tions enable the learner to link subject noun phrases protein must satisfy a Naive Bayes classi er. The two
to their corresponding verb phrases, and verb phrases nal literals specify a similar condition for the phrase
to their corresponding direct object phrases. referencing the subcellular location. The bottom part
 same-clause(Phrase-ID, Phrase-ID): This relation links of Figure 7 shows the stemmed words that are weighted
phrases that occur in the same sentence clause. most highly by each of the naive Bayes classi ers.
Although the Naive Bayes predicates used in the rule
Training and test examples are described by instances shown in Figure 7 appear to overlap somewhat, their
of these relations. For example, Figure 6 shows the in- di erences are noticeable. For example, whereas the
stances of the background and target relations that rep- predicate that is applied to the Protein-Phrase highly
resent the parse tree shown in Figure 5. The constants weights the words protein, gene and product, the pred-
used to represent the sentence and its phrases in Fig- icates that are applied to the Location-Phrase focus on
ure 6 correspond to the identi ers shown in brackets in subcellular locations and prepositions such as in, to and
Figure 5. with.
localization-sentence(Sentence, Protein-Phrase, Location-Phrase) :-
next-phrase(Protein-Phrase, Phrase-1),
next-phrase(Phrase-1, Location-Phrase),
sentence-naive-bayes-1(Sentence),
phrase-naive-bayes-1(Protein-Phrase),
phrase-naive-bayes-2(Location-Phrase),
phrase-naive-bayes-3(Location-Phrase).
sentence-naive-bayes-1: nucleu, mannosidas, bifunct, local, galactosidas, nuclei, immuno uoresc, . . .
phrase-naive-bayes-1: protein, beta, galactosidas, gene, alpha, mannosidas, bifunct, product, . . .
phrase-naive-bayes-2: nucleu, nuclei, mitochondria, vacuol, plasma, insid, membran, atpas, . . .
phrase-naive-bayes-3: the, nucleu, in, mitochondria, membran, nuclei, to, vacuol, yeast, with, . . .
Figure 7: Top: a rule learned by our relational method. This rule includes four Naive Bayes predicates. Bottom: the most
highly weighted words (using the log-odds ratio) in each of the Naive Bayes predicates.

Using a procedure similar to relational path nding 100%


(Richards & Mooney 1992), our learning algorithm ini- Sentence co-occurrence
tializes each rule by trying to nd the combination of
Naive Bayes classifier
Relational classifier
next-phrase, constituent-phrase, subject-verb, verb-direct-
80%

object, and same-clause literals that link the phrases of


the most uncovered positive instances. After the rule 60%

Precision
is initialized with these literals, the learning algorithm
uses a hill-climbing search to add additional literals.
The algorithm can either add a literal expressed using 40%

one of the background relations, or it can invent a new


Naive Bayes classi er to characterize one of the phrases 20%
in the sentence or the sentence itself. This method for
inventing Naive Bayes classi ers in the context of rela-
tional learning is described in detail elsewhere (Slattery
& Craven 1998). 0% 20% 40% 60%
Recall
80% 100%

To evaluate our relational IE approach, we learned a


set of rules using the YPD data set as a training set, Figure 8: Precision vs. recall for the relational classi er
and tested the rules on the hand-labeled data set. Our trained on the YPD data set.
relational algorithm learned a total of 26 rules covering
the positive instances in the training set.
Figure 8 shows the precision vs. recall curve for the For comparison, Figure 8 also shows the precision vs.
learned relational rules. The con dence measure for a recall curves for the YPD-trained Naive Bayes classi er
given example is the estimated accuracy of the rst rule discussed in the previous section, and for the sentence
that the example satis es. We estimate the accuracy of co-occurrence baseline. As this gure illustrates, al-
each of our learned rules by calculating an m-estimate though the recall of the relational rule set is rather low
(Cestnik 1990) of the rule's accuracy over the training (21%), the precision is quite high (92%). In fact, this
examples. The m-estimate of a rule's accuracy is de- precision value is considerably higher than the precision
ned as follows: of the Naive Bayes classi er at the corresponding level
of recall. This result indicates the value of represent-
m estimate accuracy = nnc ++ mp m
ing grammatical structure when learning information
extractors. We believe that the recall level of our re-
lational learner can be improved by tuning the set of
where nc is the number of instances correctly classi ed background relations it employs, and we are investigat-
by the rule, n is the total number of instances classi ed ing this issue in our current research.
by the rule, p is a prior estimate of the rule's accu-
racy, and m is a constant called the equivalent sample Related Work
size which determines how heavily p is weighted rela-
tive to the observed data. In our experiments, we set Several other research groups have addressed the task
m = 5 and we set p to the proportion of instances of information extraction from biomedical texts. Our
in the training set that belong to the target class. We research di ers considerably, however, in the type of
then use these m-estimates to sort the rules in order of knowledge we are trying to extract and in our approach
descending estimated accuracy. to the problem.
A number of groups have developed systems for ex- operating in a semi-automated mode in which a person
tracting keywords from text sources. Andrade and Va- reviews (some) of the predictions made by the informa-
lencia (1997) describe a method for extracting keywords tion extractors.
characterizing functional characteristics of protein fam- Perhaps the most signi cant contribution of our work
ilies. This approach identi es words that are used much is the approach to using \weakly" labeled training data.
more frequently in the literature for a given family than Most previous work in learning information extractors
in the literature associated with other families. In sim- has relied on training examples consisting of documents
ilar work, Ohta et al. (1997) extract keywords using an precisely marked with the facts that should be extracted
information-theoretic measure to identify those words along with their locations within the document. Our
that carry the most information about a given docu- approach involves (i) identifying existing databases that
ment. Weeber and Vos (1998) have developed a sys- contain instances of the target relation, (ii) associat-
tem for extracting information about adverse drug re- ing these instances with documents so that they may
actions from medical abstracts. Their system isolates be used as training data, and (iii) dividing the docu-
words that occur near the phrase \side e ect" and then ments into training instances and weakly labeling these
uses statistical techniques to identify words that pos- instances (e.g. by assuming that all sentences that men-
sibly describe adverse drug reactions. In all of these tion a protein and a subcellular location represent in-
research e orts, the information-extraction task is to stances of the subcellular-localization relation) We be-
identify and extract informative words related to some lieve that this approach has great promise because it
topic. In our work, on the other hand, we are focusing vastly reduces the time and e ort involved in assem-
on extracting instances of speci c target relations. bling training sets for inducing information extractors.
Fukuda et al. (1998) consider the task of recognizing Currently, we are investigating modifying the learn-
protein names in biological articles. Their system uses ing process to take into account the nature of weakly
both orthographic and part-of-speech features to recog- labeled training data. Speci cally we are developing
nize and extract protein names. Whereas the task we objective functions that are biased towards covering
are addressing is to extract relation instances, Fukuda at least one sentence per positive instance instead of
et al. are concerned with extracting instances of a class, equally weighting all sentences labeled as positive.
namely proteins. We have numerous other plans to extend the work
The prior research most similar to ours is that of Leek presented here. First, we are currently using our
(1997). His work investigated using hidden Markov learned information extractors to help populate a
models (HMMs) to extract facts from text elds in protein-localization knowledge base being developed at
the OMIM (On-Line Mendelian Inheritance in Man) Carnegie Mellon University. Second, we plan to learn
database. The task addressed by Leek, like our task, information-extraction routines for all of the relations
involved extracting instances of a binary relation per- mentioned in the Introduction. Third, we plan to inves-
taining to location. His location relation, however, re- tigate ways in which existing sources of domain knowl-
ferred to the positions of genes on chromosomes. The edge, such as the Uni ed Medical Language System
principal di erence between Leek's approach and our (National Library of Medicine 1999b), can be leveraged
approach is that his HMMs involved a fair amount of to learn more accurate extraction routines. Fourth, we
domain-speci c human engineering. plan to address the task of extracting instances that are
not represented by individual sentences. Fifth, we plan
Discussion and Conclusions to extend our relation-extraction methods so that they
One may ask whether the learned classi ers we de- can take into account factors that may qualify a fact,
scribed in this paper are accurate enough to be of use. such as its temporal or spatial scope.
We argue that, for many tasks, they are. As discussed in In summary, we believe that the work presented
the Introduction, two of the motivating applications for herein represents a signi cant step toward making tex-
our work are (i) providing structured summaries of par- tual sources of biological knowledge as accessible and
ticular biological objects, and (ii) supporting discovery interoperable as structured databases.
by eliciting connections among biological objects. As
demonstrated by the work of Swanson et al. (Swanson
& Smalheiser 1997), even word co-occurrence predic- References
tors can be quite useful for these tasks. Therefore, any Andrade, M. A., and Valencia, A. 1997. Automatic
method that can provide a boost in predictive power annotation for biological sequences by extraction of
over these baselines is of practical value. For tasks keywords from MEDLINE abstracts. In Proceedings of
such as automatic genome annotation, where the pre- the Fifth International Conference on Intelligent Sys-
dictions made by the information extractors would be tems for Molecular Biology, 25{32. Halkidiki, Greece:
put directly into a database, the standard for accuracy AAAI Press.
is higher. For this type of task, we believe that extrac-
tion routines like those described in this paper can be Boland, M. V.; Markey, M. K.; and Murphy, R. F.
of value either by (i) making only high-con dence pre- 1996. Automated classi cation of protein localization
dictions, thereby sacri cing recall for precision, or (ii) patterns. Molecular Biology of the Cell 8(346a).
Cali , M. E. 1998. Relational Learning Techniques Ohta, Y.; Yamamoto, Y.; Okazaki, T.; Uchiyama,
for Natural Language Extraction. Ph.D. Dissertation, I.; and Takagi, T. 1997. Automatic construction of
Computer Science Department, University of Texas, knowledge base from biological papers. In Proceedings
Austin, TX. AI Technical Report 98-276. of the Fifth International Conference on Intelligent
Cardie, C. 1997. Empirical methods in information Systems for Molecular Biology, 218{225. Halkidiki,
extraction. AI Magazine 18(4):65{80. Greece: AAAI Press.
Cestnik, B. 1990. Estimating probabilities: A crucial Pearl, J. 1988. Probabalistic Reasoning in Intelligent
task in machine learning. In Proceedings of the Ninth Systems: Networks of Plausible Inference. San Mateo,
European Conference on Arti cial Intelligence, 147{ CA: Morgan Kaufmann.
150. Stockholm, Sweden: Pitman. Porter, M. F. 1980. An algorithm for sux stripping.
Cowie, J., and Lehnert, W. 1996. Information extrac- Program 14(3):127{130.
tion. Communications of the ACM 39(1):80{91. Provost, F., and Fawcett, T. 1998. Robust classi ca-
Domingos, P., and Pazzani, M. 1997. On the opti- tion systems for imprecise environments. In Proceed-
mality of the simple Bayesian classi er under zero-one ings of the Fifteenth National Conference on Arti cial
loss. Machine Learning 29:103{130. Intelligence, 706{713. Madison, WI: AAAI Press.
Freitag, D. 1998. Multistrategy learning for informa- Quinlan, J. R. 1990. Learning logical de nitions from
tion extraction. In Proceedings of the Fifteenth Inter- relations. Machine Learning 5:239{2666.
national Conference on Machine Learning, 161{169. Richards, B. L., and Mooney, R. J. 1992. Learning
Madison, WI: Morgan Kaufmann. relations by path nding. In Proceedings of the Tenth
Fukuda, K.; Tsunoda, T.; Tamura, A.; and Takagi, National Conference on Arti cial Intelligence, 50{55.
T. 1998. Toward information extraction: Identify- San Jose, CA: AAAI/MIT Press.
ing protein names from biological papers. In Paci c Rilo , E. 1996. An empirical study of automated
Symposium on Biocomputing, 707{718. dictionary construction for information extraction in
Genome Annotation Consortium. 1999. The genome three domains. Arti cial Intelligence 85:101{134.
channel. http://compbio.ornl.gov/tools/channel/. Rilo , E. 1998. The sundance sentence analyzer.
Hodges, P. E.; Payne, W. E.; and Garrels, J. I. 1998. http://www.cs.utah.edu/projects/nlp/.
Yeast protein database (YPD): A database for the Rost, B. 1996. PHD: Predicting one-dimensional pro-
complete proteome of saccharomyces cerevisiae. Nu- tein structure by pro le based neural networks. Meth-
cleic Acids Research 26:68{72. ods in Enzymology 266:525{539.
Karp, P.; Riley, M.; Paley, S.; and Pellegrini-Toole, Slattery, S., and Craven, M. 1998. Combining statis-
A. 1997. EcoCyc: Electronic encyclopedia of E. coli tical and relational methods for learning in hypertext
genes and metabolism. Nucleic Acids Research 25(1). domains. In Proceedings of the Eighth International
Lathrop, R. H.; Ste en, N. R.; Raphael, M. P.; Deeds- Conference on Inductive Logic Programming. Springer
Rubin, S.; Pazzani, M. J.; Cimoch, P.; See, D. M.; Verlag.
and Tilles, J. G. 1998. Knowledge-based avoidance Soderland, S. 1996. Learning Text Analysis Rules for
of drug-resistant HIV mutants. In Proceedings of the Domain-spei c Natural Language Processing. Ph.D.
Tenth Conference on Innovative Applications of Arti- Dissertation, University of Massachusetts. Depart-
cial Intelligence. Madison, WI: AAAI Press. ment of Computer Science Technical Report 96-087.
Leek, T. 1997. Information extraction using hidden Soderland, S. 1999. Learning information extrac-
markov models. Master's thesis, Department of Com- tion rules for semi-structured and free text. Machine
puter Science and Engineering, University of Califor- Learning.
nia, San Diego, CA. Swanson, D. R., and Smalheiser, N. R. 1997. An inter-
Lewis, D. D., and Ringuette, M. 1994. A comparison active system for nding complementary literatures: a
of two learning algorithms for text categorization. In stimulus to scienti c discovery. Arti cial Intelligence
Proceedings of the Third Annual Symposium on Doc- 91:183{203.
ument Analysis and Information Retrieval, 81{93. Weeber, M., and Vos, R. 1998. Extracting expert
Mitchell, T. M. 1997. Machine Learning. New York: medical knowledge from texts. In Working Notes of
McGraw-Hill. the Intelligent Data Analysis in Medicine and Phar-
National Center for Biotechnology Information. 1999. macology Workshop, 23{28.
Entrez. http://www.ncbi.nlm.nih.gov/Entrez/. Xu, Y.; Mural, R. J.; Einstein, J. R.; Shah, M. B.; and
National Library of Medicine. 1999a. Pubmed. Uberbacher, E. C. 1996. GRAIL: A multi-agent neural
http://www.ncbi.nlm.nih.gov/PubMed/. network system for gene identi cation. Proceedings of
the IEEE 84(10):1544{1552.
National Library of Medicine. 1999b. Uni ed medical
language system.
http://www.nlm.nih.gov/research/umls/umlsmain.html.

You might also like