Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 8

Using Information Extraction to Aid

the Discovery of Prediction Rules from Text

Un Yong Nahm Raymond J. Mooney


Department of Computer Sciences Department of Computer Sciences
University of Texas, Austin, TX 78712-1188 University of Texas, Austin, TX 78712-1188
pebronia@cs.utexas.edu mooney@cs.utexas.edu

ABSTRACT Text Data Mining


Text mining and Information Extra tion (IE) are both topi s
of signi ant re ent interest. Text mining on erns apply-
ing data mining, a.k.a. knowledge dis overy from databases Information
(KDD) te hniques to unstru tured text. Information extra - Text Extraction DB KDD Rule Base
tion (IE) is a form of shallow text understanding that lo ates
spe i pie es of data in natural language do uments, trans-
forming unstru tured text into a stru tured database. This
paper des ribes a system alled Dis oTEX, that ombines
IE and KDD methods to perform a text mining task, dis-
overing predi tion rules from natural-language orpora. An Figure 1: Overview of IE-based Text Mining Frame-
initial version of Dis oTEX is onstru ted by integrating work
an IE module based on Rapier and a rule-learning module,
Ripper. We present en ouraging results on applying these
te hniques to a orpus of omputer job postings from an Although onstru ting an IE system is a diÆ ult task, there
Internet newsgroup. has been signi ant re ent progress in using ma hine learn-
ing methods to help automate the onstru tion of IE systems
[5, 3℄. By manually annotating a small number of do uments
1. INTRODUCTION with the information to be extra ted, a fairly a urate IE
The problem of text mining, i.e. dis overing useful knowl- system an be indu ed from this labeled orpus and then
edge from unstru tured text, is attra ting in reasing atten- applied to a large body of raw text to onstru t a large
tion [14, 18℄. This paper suggests a new framework for text database for mining. In this way, a small amount of labeled
mining based on the integration of Information Extra tion training data for an IE learning system an be automati ally
(IE) and traditional Knowledge Dis overy from Databases transformed into a large database of stru tured information
(KDD). Traditional data mining assumes that the infor- ready to be mined with traditional KDD methods. For ex-
mation to be \mined" is already in the form of a rela- ample, the IE learning system Rapier [4℄ has been used to
tional database. Unfortunately, for many appli ations, ele - indu e an IE system that transforms newsgroup job post-
troni information is only available in the form of unstru - ings into a relational database. By applying standard rule
tured natural-language do uments rather than stru tured indu tion methods to a database of 5,000 jobs automati ally
databases. Information Extra tion, a task that has attra ted extra ted from the newsgroup, we have dis overed interest-
in reasing attention sin e the start of the Message Under- ing relationships su h as \If a omputer-related job requires
standing Conferen es (MUCs) [11, 12℄, addresses the prob- knowledge of Java and graphi s then it also requires knowl-
lem of transforming a orpus of textual do uments into a edge of PhotoShop."
more stru tured database. This suggests an obvious role
that IE an play in text mining when ombined with stan- However, the a ura y of urrent IE systems, whether built
dard KDD methods, as illustrated in Figure 1. The IE manually or indu ed from data, is limited. Therefore, an au-
module lo ates spe i pie es of data in raw text, and the tomati ally extra ted database will inevitably ontain signif-
resulting database is provided to the KDD module for min- i ant numbers of errors. An important question is whether
ing. the knowledge dis overed from this \noisy" database is sig-
ni antly less reliable than knowledge dis overed from a
leaner traditional database. In this paper we present ex-
periments demonstrating that knowledge dis overed from an
automati ally extra ted database is lose in a ura y to that
dis overed from a manually onstru ted database, demon-
strating that ombining IE and KDD is a viable approa h
to text mining.
The remainder of the paper is organized as follows. Se -
tion 2 des ribes a system alled Dis oTEX (DISCOvery
Sample Job Posting
ing a database using an IE system, we ltered out irrelevant
Leading Internet Provider using utting edge web te h- do uments from the newsgroup using a trained text atego-
nology in Austin is a epting appli ations for a Senior rizer. First, 1,000 postings were olle ted and lassi ed by
Software Developer. The andidate must have 5 years of a human expert as relevant or irrelevant. Next, a bag-of-
software development, whi h in ludes oding in C/C++ words Naive-Bayes text ategorizer [21, 19℄ was trained on
and experien e with databases (Ora le, Sybase, Informix, this data to identify relevant do uments (using the Rainbow
et .). A BS degree or higher in Computer S ien e or related pa kage [20℄). The resulting ategorizer has an a ura y of
eld are required. PERL and JAVASCRIPT programming over 99% and is used to lter irrelevant do uments from the
experien e will be a de nite plus! This position will require original postings.
developing appli ations under Windows95/98 and NT,
meeting with ustomers to de ne requirements, and the Rapier [4℄, a ma hine-learning system for indu ing rules
design, development and implementation of e- ommer e, for extra ting information from natural-language texts, is
internet/intranet appli ations with emphasis on ba k-end used to onstru t an IE module for Dis oTEX. Rapier is a
web site development using C++, Java and RDBMS. bottom-up relational rule learner for a quiring IE rules from
Salary: $70-85K plus outstanding bene ts(Medi al, Dental, a orpus of labeled training examples. It learns patterns
Vision, Sto k Options); Lo ation: Austin(South); Type of des ribing onstraints on slot llers and their surrounding
Position: Full Time ontext using a spe i -to-general sear h. Constraints in
patterns an spe ify the spe i words, part-of-spee h, or
Filled Job Template
semanti lasses of tokens. It has been demonstrated that
title: Senior Software Developer Rapier performs fairly well on realisti appli ations su h as
salary: $70-85K USENET job postings and seminar announ ements [2, 4℄.
ity: Austin
language: Perl, C, Javas ript, Java, C++ In the experiments in this paper, Rapier was trained on
platform: NT, Windows only 60 labeled do uments, at whi h point its a ura y at
appli ation: Ora le, Informix, Sybase extra ting information is somewhat limited; extra tion pre-
area: RDBMS, Internet, Intranet, E- ommer e ision (per entage of extra ted slot llers that are orre t) is
required years of experien e: 5 about 91.9% and extra tion re all (per entage of all of the
required degree: BS orre t llers extra ted) is about 52.4% . We purposely
trained Rapier on a relatively small orpus in order to
Figure 2: Sample Text and Filled Template
demonstrate that labeling only a relatively small number
of do uments an result in a learned extra tor apable of
building a database from whi h a urate knowledge an be
from Text EXtra tion) that ombines IE and KDD te h- dis overed.
nologies to dis over predi tion rules. Se tion 3 presents and
dis usses experimental results obtained on a orpus of job
postings from the newsgroup austin.jobs. Se tion 4 re- 2.2 Rule Induction
views some related work, se tion 5 dis usses dire tions for After onstru ting an IE system that extra ts the desired
future resear h, and se tion 6 presents our on lusions. set of slots for a given appli ation, a database is onstru ted
from a orpus of texts by applying the extra tor to ea h
do ument to reate a olle tion of stru tured re ords. Stan-
2. THE DISCOTEX SYSTEM dard KDD te hniques an then be applied to the resulting
2.1 Information Extraction database to dis over interesting relationships. Spe i ally,
The goal of an IE system is to lo ate spe i data in natural- we indu e rules for predi ting ea h pie e of information in
language text. The data to be extra ted is typi ally given by ea h database eld given all other information in a re ord.
a template whi h spe i es a list of slots to be lled with sub- Standard lassi ation rule-learning methods an be em-
strings taken from the do ument. IE is useful for a variety ployed for this task.
of appli ations, parti ularly given the re ent proliferation of
Internet and web do uments. Re ent appli ations in lude In order to dis over predi tion rules, we treat ea h slot-value
ourse homepages [17℄, apartment rental ads [26℄, and job pair in the extra ted database as a distin t binary feature,
announ ements [7, 4℄. su h as graphi s2area, and learn rules for predi ting ea h
feature from all other features. Similar slot llers are rst
In this paper, we onsider the task of extra ting a database ollapsed into a pre-determined standard term. For exam-
from postings to the USENET newsgroup, austin.jobs. ple, \Windows 95" is a popular ller for the platform slot,
Figure 2 shows a sample message from the newsgroup and but it often appears as \Win 95", \Win95", `MS Win 95",
the lled omputer-s ien e job template where several slots and so on, and \DBA" in the title slot is an abbreviation
may have multiple llers. For example, slots su h as lan- for \DataBase Administrator". These terms are ollapsed
guages, platforms, appli ations, and areas usually have more to unique slot values before predi tion rules are mined from
than one ller, while slots related to the job's title or lo a- the data. A small domain-dependent synonym di tionary
tion have only one ller. is used to identify su h similar terms. Trivial ases su h as
\Databases" ! \Database" and \Client/Server" ! \Client-
Sin e austin.jobs is not a moderated newsgroup, not all Server" are handled by manually ontrived synonym- he king
posted do uments are relevant to our task. Some of them rules.
are resumes posted by job-seekers, advertisements, or non-
omputer-s ien e job postings. Therefore, before onstru t- We have applied C4.5rules [24℄ to indu e rules from the
 Ora le2appli ation ^ QA Partner2appli ation !
SQL2language IE Training
DiscoTEX
System
 C++2language ^ C2language ^ CORBA2appli ation Examples
! Windows2platform 1 KDD
 HTML2language ^ WindowsNT2platform ^ A tive
IE

Server Pages2appli ation ! Database2area 2 Rule 6


Rule
 :(UNIX2platform) ^ :(Windows2platform) ^
7
Induction Induction
IE Prediction
Games2area ! 3D2area
Database
Rule Base Rule Base
 Java2language ^ A tiveX2area ^ Graphi s2area ! Extraction 5 Prediction 8

Web2area
3

 Visual Basi 2language ^ OLE2area !


:(UNIX2platform)
4

Examples
 3D2area ^ Games2area ^ :(E-Commer e2area) !
:(SQL2language)
slot−
Figure 3: Sample Mined Rules for Computer- Database values

S ien e Job Postings


INPUT OUTPUT

resulting binary data by learning de ision trees and trans- Figure 4: The System Ar hite ture
lating them into pruned rules. Ripper [8℄ was also applied to
learn predi tion rules. Ripper runs signi antly faster sin e
it has an ability to handle set-valued features [9℄ to avoid the is signi antly less.
step of expli itly translating slot llers into a large number
of binary features.
Dis overed knowledge des ribing the relationships between 3. EXPERIMENTAL RESULTS
slot values is written in a form of produ tion rules. If
there is a tenden y for Web to appear in the area slot when 3.1 Experimental Methodology
Sho kWave appears the in appli ations slot, this is repre- Dis overed knowledge is only useful and informative if it
sented by the produ tion rule, Sho kWave2appli ation ! is a urate. Dis overing uke orrelations in data is not
Web2area. Rules an also predi t the absen e of a ller in produ tive, and therefore it is important to measure the a -
a slot. Sample rules mined from a database of 600 jobs ura y of dis overed knowledge on independent test data.
extra ted from the USENET newsgroup austin.jobs are The primary question we address in the experiments in this
shown in Figure 3. se tion is whether knowledge dis overed from automati ally
extra ted data (whi h may be quite noisy) is relatively re-
liable ompared to knowledge dis overed from a manually
2.3 System Architecture onstru ted database.
The overall ar hite ture of Dis oTEX is shown in Figure
4. First, do uments annotated by the user are provided to To test the overall system, 600 omputer-s ien e job post-
Rapier as training data. IE rules indu ed from this training ings to the newsgroup austin.jobs were olle ted and man-
set are stored in the IE rule base and subsequently used by ually annotated with orre t extra tion templates. 10-fold
the extra tion module. The learned IE system then takes ross validation was used to generate training and test sets.
unlabeled texts and transforms them into a database of slot- Rules were indu ed for predi ting the llers of the languages,
values, whi h is provided to the KDD omponent (i.e. C4.5 platforms, appli ations, and areas slots, sin e these are
or Ripper) as a training set for onstru ting a knowledge usually lled with multiple dis rete-valued llers and have
base of predi tion rules. The training data for KDD an obvious potential relationships between their values. The
in lude the user-labeled do uments used for training IE, as total number of slot-values used in the experiment is 476:
well as a larger IE-labeled set automati ally extra ted from 48 slot-values are for languages slot, 59 for platforms, 159
raw text. Dis oTEX also in ludes a apability for improv- for appli ations, and 210 for areas.
ing the re all of the learned IE system by proposing addi-
tional slot llers based on learned predi tion rules. More The lassi ation a ura y for predi ting absen e or pres-
details on this aspe t of the system an be found in [22℄. en e of slot llers is not a parti ularly informative perfor-
man e metri sin e high a ura y an be a hieved by sim-
In order to test the a ura y of the dis overed rules, they ply assuming every slot ller is absent. For instan e, with
are used to predi t the information in a disjoint database of 60 user-labeled examples, Dis oTEX gives a lassi ation
user-labeled examples. For ea h test job, ea h possible slot- a ura y of 92.7% while the all-absent strategy has an a -
value is predi ted to be present or absent given information ura y of 92.5%. This is be ause the set of potential slot
on all of its other slot-values. Average performan e a ross all llers is very large and not xed in advan e, and only a
features and all test examples is then omputed. The rules small fra tion of possible llers is present in any given ex-
produ ed by Ripper and C4.5rules were found to be of ample. Therefore, we evaluate the performan e of Dis o-
similar a ura y, and the experiments in this paper employ TEX using the IE performan e metri s of pre ision, re all,
Ripper sin e its omputational time and spa e omplexity and F-measure with regard to predi ting slot llers. These
Present Absent ross-validation is performed on the remaining 540 examples
Predi ted To Be Present mp (n m)  p in order to evaluate data mining. In order to learly illus-
Predi ted To Be Absent m  (1 p) (n m)  (1 p) trate the impa t of mining automati ally extra ted data, the
same set of training examples was provided to both KDD
Table 1: The expe ted out ome for random guessing systems. The only di eren e between them is the training
data for the rule-miner of Dis oTEX is automati ally ex-
tra ted by Rapier after being trained on a disjoint set of
60 user-labeled examples. Both systems are tested on user-
metri s are de ned as follows: labeled data to identify the quality of the rules produ ed.
#of P resentSlotV aluesCorre tlyP redi ted Figures 5, 6 and 7 show the learning urves for pre ision,
pre ision = re all, and F-measure, respe tively.
#of SlotV aluesP redi tedT oBeP resent
(1)
Even with a small amount of user-labeled data, the results
#of P resentSlotV aluesCorre tlyP redi ted indi ate that Dis oTEX a hieves a performan e fairly om-
re all = (2) parable to the rule-miner trained on a manually onstru ted
#of P resentSlotV alues database, while random-guessing does quite poorly. Figure
F-measure is the harmoni mean of pre ision and re all and 6 indi ates that Dis oTEX does relatively worse with the
is omputed as follows (when the same weight is given to rst 60 training examples, but qui kly improves with 60 ad-
pre ision and re all): ditional examples. The results also show that the pre ision
2  P re ision  Re all of Dis oTEX seems to start leveling o a bit sooner, this
F measure = (3) is presumably due to the fa t that extra tion errors put a
P re ision + Re all somewhat lower eiling on the performan e it an eventually
a hieve.
In order to obtain non-trivial bounds on pre ision and re all,
a simple random guessing method is used as a ben hmark. Figure 8 presents F-measures for Dis oTEX's performan e
This approa h guesses a slot-value based on its frequen y on individual slots. Not surprisingly, the Programming Lan-
of o urren e in the training data. For instan e, if \Java" guages slot with the least number of possible values shows
o urs as a programming language in 29% of jobs in the the best performan e, and the Area slot with as many as
training data, then this approa h guesses that it o urs 29% 210 values does poorly. More interesting is that fa t that
of the time for the test data. Instead of simulating this di erent slots show quite di erent learning rates.
method, we analyti ally al ulated its expe ted pre ision
and re all for ea h slot-value. The expe ted out ome for Figures 9, 10 and 11 show the learning urves for pre ision,
this strategy for a given slot-value is summarized in Table 1, re all, and F-measure under the \more natural" s enario in
where p is the per entage of times the slot-value appears the whi h the training set provided to Rapier, onsisting of 60
training examples, n is the total number of the test examples user-labeled examples, is also provided to the rule-miner as
and m is the number of times the slot-value o urs in the a part of its training set. In this ase, both approa hes start
test data. Using the information in the table, the pre ision with the same 60 user-labeled examples, whi h have already
and the re all for random-guessing is determined as follows: been used to train Dis oTEX's IE system. However, as
mp Dis oTEX pro eeds to dis over knowledge from data it au-
pre ision = = m=n (4) tomati ally extra ts from raw text, it fairly losely tra ks the
(m  p) + ((n m )  p)
performan e of a system trained on additional data labori-
mp
ously extra ted by a human expert. Sin e in this ase Dis-
re all = =p (5) oTEX has the advantage of a small set of relatively noise-
(m  p) + (m  (1 p)) free data to start with, its performan e is even somewhat
Therefore, the ben hmark pre ision for a slot-value is its loser to that a hieved by mining a hand-built database.
probability of o urren e as estimated from the test data
and the re all is its probability of o urren e as estimated All of the results presented above employed 60 labeled ex-
from the training data. The only di eren e between the two amples to train the IE system. In a followup experiment, we
is due to sampling error. examined the e e t of in reasing the number of IE training
examples to obtain a more a urate extra tor. We varied
3.2 Results the number of training examples given to Rapier (trying
Be ause of the two di erent training phases used in Dis- 60, 120, 180, 240, and 300 examples), always using 240 ex-
oTEX , there is a question of whether or not the train- amples in the database to be mined by Ripper. The size of
ing set for IE should also be used to train the rule-miner. the test set is 60 as in the previous experiment. Figure 12
In realisti situations, there is no reason not to use the IE shows the preforman e results. In reasing the number of IE
training data for mining sin e the human e ort has already training examples improves the a ura y of the mined rules
been expended to orre tly extra t the data in this text. a bit, further approa hing the a ura y of Ripper trained on
However, to learly illustrate the di eren e between mining user-labeled data. However, a ura y improves slowly with
human-labeled and IE-labeled data, we rst show a ompar- additional IE training data. This result indi ates that if
ison with a disjoint IE training set. In this experiment, the the training set for data mining to be automati ally labeled
IE training data are thrown away on e they have been used by an IE module is large enough (240 in this experiment),
Dis otex is able to a hieve a fairly good performan e with
to train Rapier, sin e the extra tor is unlikely to make the
normal number of extra tion errors on this data. Ten-fold only a small amount e ort devoted to labeling IE training
80
RIPPER trained on User-Labeled Data
DiscoTEX
Random-Guessing 40
RIPPER trained on User-Labeled Data
70 DiscoTEX
Random-Guessing

35
60

30
Precision (%)

50

Recall (%)
40 25

30
20

20
15

10
50 100 150 200 250 300 350 400 450 500
Number of Training Examples 10
50 100 150 200 250 300 350 400 450 500
Number of Training Examples
Figure 5: Pre ision with disjoint IE training set

Figure 6: Re all with disjoint IE training set


examples.

4. RELATED RESEARCH
There has been relatively little resear h exploring the in-
tegration of IE and KDD. KDT (Knowledge Dis overy in 50
RIPPER trained on User-Labeled Data
Textual Databases) alludes to the use of IE in text mining DiscoTEX
[14℄; however, it uses texts manually tagged with a limited 45
Random-Guessing

number of xed ategory labels. KDT does not a tually


use automated text ategorization or IE. Similarly, FACT
[15℄, whi h dis overs asso iations amongst keywords from 40
text do uments, does not automati ally label the do uments
with keywords. Another approa h similar to Dis oTEX is
a method used to dis over semanti relationships between 35
terms in a olle tion of do uments [16℄. In this work, a
F-Measure (%)

natural-language parser is used in pla e of information ex-


tra tion sin e the extra tion pro ess is one of nding syn- 30
ta ti or semanti relations between terms.
25
5. FUTURE WORK
Although our preliminary results with job postings are en-
ouraging, a fuller evaluation will apply Dis oTEX to larger 20
job orpora and to other realisti domains. For instan e,
we plan to use Dis oTEX to dis over knowledge from a
database extra ted from business news arti les. Mining a 15
database extra ted from do tors' English notes and re ords
about medi al patients is another promising appli ation area.
10
50 100 150 200 250 300 350 400 450 500
One step in Dis oTEX that is urrently performed man- Number of Training Examples
ually is ollapsing similar slot- llers in the extra ted data
into a anoni al form, e.g. mapping \NT," \Windows NT,"
and \Mi rosoft Windows NT" all to a unique term. In many Figure 7: F-measure with disjoint IE training set
ases, su h ollapsing ould be automated by lustering slot
llers using a distan e metri based on textual similarity,
60 40
Programming Languages RIPPER trained on User-Labeled Data
Platforms DiscoTEX
Applications Random-Guessing
55 Areas

35
50

45

30
F-Measure (%)

Recall (%)
40

35
25

30

25
20

20

15 15
50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 550
Number of Training Examples Number of Training Examples

Figure 8: F-measure for Dis oTEX by slots Figure 10: Re all with reused IE training set

70 50
RIPPER trained on User-Labeled Data RIPPER trained on User-Labeled Data
DiscoTEX DiscoTEX
Random-Guessing Random-Guessing
45
60

40
50
F-Measure (%)
Precision (%)

35

40

30

30
25

20
20

10 15
50 100 150 200 250 300 350 400 450 500 550 50 100 150 200 250 300 350 400 450 500 550
Number of Training Examples Number of Training Examples

Figure 9: Pre ision with reused IE training set Figure 11: F-measure with reused IE training set
The pro edure for sele ting slots to be used in rule mining
Precision(Discotex)
Precision(Ripper without IE) also needs to be automated. In the urrent experiments,
70 Recall(Discotex) we manually hose ve slots from the omputer-s ien e job
Recall(Ripper without IE)
F-Measure(Discotex) template. For example, title slots for job postings is not
F-Measure(Ripper without IE) used be ause it has many possible values and is diÆ ult
60
Precision,Recall,F-Measure(Random)
to predi t. By identifying and quantifying the orrelations
between slot values, this de ision ould be automated.

50 6. CONCLUSIONS
There is a growing interest in the general topi of text min-
ing [18℄; however, there are few working systems or detailed
experimental evaluations. By utilizing existing IE and KDD
(%)

40 te hnology, text-mining systems an be developed relatively


rapidly and evaluated on existing IE orpora. In this paper,
we presented an approa h to using an automati ally learned
30 IE system to extra t a stru tured databases from a text or-
pus, and then mining this database with traditional KDD
tools. Our preliminary experimental results demonstrate
that the knowledge dis overed from su h an automati ally
20
extra ted database is lose in a ura y to the knowledge
dis overed from a manually onstru ted database.
10 Text mining is a relatively new resear h area at the interse -
50 100 150 200 250 300 tion of data mining, natural-language pro essing, ma hine
Number of IE Training Examples learning, and information retrieval. By appropriately inte-
grating te hniques from ea h of these dis iplines, useful new
methods for dis overing knowledge from large text orpora
Figure 12: Performan es with varied sizes of IE
an be developed. In parti ular, we believe that the growing
training set
intera tion between omputational linguisti s and ma hine
learning [6℄ is riti al to the development of e e tive text-
mining systems.
su h as hara ter edit distan e [25℄, or semanti similarity
as determined by ontext-based word lustering [23℄. Acknowledgements
This resear h was supported by the National S ien e Foun-
Instead of requiring or reating anoni al slot- llers that dation under grant IRI-9704943.
must mat h exa tly, an alternative approa h is to allow par-
tial mat hing of slot- llers during the dis overy pro ess. The
need for soft mat hing of text strings in dis overed rules is 7. REFERENCES
an important aspe t of text mining that requires hanges [1℄ R. Baeza-Yates and B. Ribeiro-Neto. Modern
Information Retrieval. ACM Press, New York, 1999.
to existing rule indu tion methods. Sin e text strings in
traditional databases also ontain typos, misspellings, and [2℄ M. E. Cali . Relational Learning Te hniques for
non-standardized variations, this is also an important as- Natural Language Information Extra tion. PhD thesis,
pe t of traditional KDD that has not been adequately ad- Department of Computer S ien es, University of
dressed. We are urrently exploring the dis overy of rules Texas, Austin, TX, August 1998. Also appears as
that allow soft mat hing of slot- llers by adapting the RISE Arti ial Intelligen e Laboratory Te hni al Report AI
approa h to unifying rule-based and instan e-based learning 98-276 (see http://www. s.utexas.edu/users/ai-lab).
methods [13℄. Like Whirl [10℄, our approa h uses a TFIDF
text-similarity metri from information retrieval [1℄ to nd [3℄ M. E. Cali , editor. Papers from the AAAI-99
examples that are lose but not exa t mat hes to the ondi- Workshop on Ma hine Learning for Information
tions of a rule. We have applied a preliminary version of this Extra tion, Orlando, FL, 1999. AAAI Press.
system to a database of s ien e-book des riptions extra ted
from Amazon. om and dis overed rules su h as: \If a ti- [4℄ M. E. Cali and R. J. Mooney. Relational learning of
tle of a s ien e-book has a substring of Gender, then Men, pattern-mat h rules for information extra tion. In
Women, and Di eren es are found in the review for that Pro eedings of the Sixteenth National Conferen e on
book." and \If a synopsis of a book ontains History and Arti ial Intelligen e, pages 328{334, Orlando, FL,
Life, then title of that book in ludes a substring, Origin." July 1999.
[5℄ C. Cardie. Empiri al methods in information
Currently, we only onsider dis rete-valued slots. However, extra tion. AI Magazine, 18(4):65{79, 1997.
real-valued slots, su h as \required years of experien e" or
\salary" ould also be provided to the rule miner as addi- [6℄ C. Cardie and R. J. Mooney. Ma hine learning and
tional input features when predi ting other slots. Predi ting natural language (introdu tion to spe ial issue on
su h ontinuous values using regression methods instead of natural language learning). Ma hine Learning, 34:5{9,
ategorization te hniques is another area for future resear h. 1999.
[7℄ J. Y. Chai, A. W. Biermann, and C. I. Guinn. Two [21℄ T. Mit hell. Ma hine Learning. M Graw-Hill, New
dimensional generalization in information extra tion. York, NY, 1997.
In Pro eedings of the Sixteenth National Conferen e
on Arti ial Intelligen e, pages 431{438, Orlando, FL, [22℄ U. Y. Nahm and R. J. Mooney. A mutually bene ial
July 1999. integration of data mining and information extra tion.
In Pro eedings of the Seventeenth National Conferen e
[8℄ W. W. Cohen. Fast e e tive rule indu tion. In on Arti ial Intelligen e, Austin, TX, July 2000.
Pro eedings of the Twelfth International Conferen e
on Ma hine Learning, pages 115{123, 1995. [23℄ F. Pereira, N. Tishby, and L. Lee. Distributional
lustering of English words. In Pro eedings of the 31st
[9℄ W. W. Cohen. Learning trees and rules with Annual Meeting of the Asso iation for Computational
set-valued features. In Pro eedings of the Thirteenth Linguisti s, pages 183{190, Columbus, Ohio, 1993.
National Conferen e on Arti ial Intelligen e, pages
709{716, Portland, OR, August 1996. [24℄ J. R. Quinlan. C4.5: Programs for Ma hine Learning.
Morgan Kaufmann, San Mateo,CA, 1993.
[10℄ W. W. Cohen and H. Hirsh. Joins that generalize:
Text lassi ation using WHIRL. In Pro eedings of the [25℄ E. S. Ristad and P. N. Yianilos. Learning string edit
Fourth International Conferen e on Knowledge distan e. IEEE Transa tions on Pattern Analysis and
Dis overy and Data Mining, pages 169{173, New Ma hine Intelligen e, 20(5), 1998.
York, NY, 1998.
[26℄ S. Soderland. Learning information extra tion rules
[11℄ DARPA, editor. Pro eedings of the Fifth DARPA for semi-stru tured and free text. Ma hine Learning,
Message Understanding Evaluation and Conferen e, 34:233{272, 1999.
San Mateo, CA, 1993. Morgan Kaufman.
[12℄ DARPA, editor. Pro eedings of the 6th Message
Understanding Conferen e, San Mateo, CA, 1995.
Morgan Kaufman.
[13℄ P. Domingos. Unifying instan e-based and rule-based
indu tion. Ma hine Learning, 24:141{168, 1996.
[14℄ R. Feldman and I. Dagan. Knowledge dis overy in
textual databases (KDT). In Pro eedings of the First
International Conferen e on Knowledge Dis overy and
Data Mining, Montreal, 1995.
[15℄ R. Feldman and H. Hirsh. Mining asso iations in text
in the presen e of ba kground knowledge. In
Pro eedings of the Se ond International Conferen e on
Knowledge Dis overy and Data Mining, pages
343{346, Portland, OR, August 1996.
[16℄ M. Finkelstein-Landau and E. Morin. Extra ting
semanti relationships between terms: Supervised vs.
unsupervised methods. In Pro eedings of International
Workshop on Ontologi al Engineering on the Global
Information Infrastru ture, pages 71{80, Dagstuhl
Castle, Germany, May 1999.
[17℄ D. Freitag. Information extra tion from HTML:
Appli ation of a general learning approa h. In
Pro eedings of the Fifteenth National Conferen e on
Arti ial Intelligen e, pages 517{523, 1998.
[18℄ M. Hearst. Untangling text data mining. In
Pro eedings of the 37th Annual Meeting of the
Asso iation for Computational Linguisti s, pages
3{10, University of Maryland, 1999.
[19℄ A. M Callum and K. Nigam. A omparison of event
models for naive Bayes text lassi ation. In Papers
from the AAAI 1998 Workshop on Text
Categorization, pages 41{48, Madison, WI, 1998.
[20℄ A. K. M Callum. Bow: A toolkit for statisti al
language modeling, text retrieval, lass ation and
lustering, 1996.
http://www. s. mu.edu/~m allum/bow.

You might also like