Professional Documents
Culture Documents
Nahm TM IE
Nahm TM IE
Web2area
3
Examples
3D2area ^ Games2area ^ :(E-Commer
e2area) !
:(SQL2language)
slot−
Figure 3: Sample Mined Rules for Computer- Database values
resulting binary data by learning de
ision trees and trans- Figure 4: The System Ar
hite
ture
lating them into pruned rules. Ripper [8℄ was also applied to
learn predi
tion rules. Ripper runs signi
antly faster sin
e
it has an ability to handle set-valued features [9℄ to avoid the is signi
antly less.
step of expli
itly translating slot llers into a large number
of binary features.
Dis
overed knowledge des
ribing the relationships between 3. EXPERIMENTAL RESULTS
slot values is written in a form of produ
tion rules. If
there is a tenden
y for Web to appear in the area slot when 3.1 Experimental Methodology
Sho
kWave appears the in appli
ations slot, this is repre- Dis
overed knowledge is only useful and informative if it
sented by the produ
tion rule, Sho
kWave2appli
ation ! is a
urate. Dis
overing
uke
orrelations in data is not
Web2area. Rules
an also predi
t the absen
e of a ller in produ
tive, and therefore it is important to measure the a
-
a slot. Sample rules mined from a database of 600 jobs
ura
y of dis
overed knowledge on independent test data.
extra
ted from the USENET newsgroup austin.jobs are The primary question we address in the experiments in this
shown in Figure 3. se
tion is whether knowledge dis
overed from automati
ally
extra
ted data (whi
h may be quite noisy) is relatively re-
liable
ompared to knowledge dis
overed from a manually
2.3 System Architecture
onstru
ted database.
The overall ar
hite
ture of Dis
oTEX is shown in Figure
4. First, do
uments annotated by the user are provided to To test the overall system, 600
omputer-s
ien
e job post-
Rapier as training data. IE rules indu
ed from this training ings to the newsgroup austin.jobs were
olle
ted and man-
set are stored in the IE rule base and subsequently used by ually annotated with
orre
t extra
tion templates. 10-fold
the extra
tion module. The learned IE system then takes
ross validation was used to generate training and test sets.
unlabeled texts and transforms them into a database of slot- Rules were indu
ed for predi
ting the llers of the languages,
values, whi
h is provided to the KDD
omponent (i.e. C4.5 platforms, appli
ations, and areas slots, sin
e these are
or Ripper) as a training set for
onstru
ting a knowledge usually lled with multiple dis
rete-valued llers and have
base of predi
tion rules. The training data for KDD
an obvious potential relationships between their values. The
in
lude the user-labeled do
uments used for training IE, as total number of slot-values used in the experiment is 476:
well as a larger IE-labeled set automati
ally extra
ted from 48 slot-values are for languages slot, 59 for platforms, 159
raw text. Dis
oTEX also in
ludes a
apability for improv- for appli
ations, and 210 for areas.
ing the re
all of the learned IE system by proposing addi-
tional slot llers based on learned predi
tion rules. More The
lassi
ation a
ura
y for predi
ting absen
e or pres-
details on this aspe
t of the system
an be found in [22℄. en
e of slot llers is not a parti
ularly informative perfor-
man
e metri
sin
e high a
ura
y
an be a
hieved by sim-
In order to test the a
ura
y of the dis
overed rules, they ply assuming every slot ller is absent. For instan
e, with
are used to predi
t the information in a disjoint database of 60 user-labeled examples, Dis
oTEX gives a
lassi
ation
user-labeled examples. For ea
h test job, ea
h possible slot- a
ura
y of 92.7% while the all-absent strategy has an a
-
value is predi
ted to be present or absent given information
ura
y of 92.5%. This is be
ause the set of potential slot
on all of its other slot-values. Average performan
e a
ross all llers is very large and not xed in advan
e, and only a
features and all test examples is then
omputed. The rules small fra
tion of possible llers is present in any given ex-
produ
ed by Ripper and C4.5rules were found to be of ample. Therefore, we evaluate the performan
e of Dis
o-
similar a
ura
y, and the experiments in this paper employ TEX using the IE performan
e metri
s of pre
ision, re
all,
Ripper sin
e its
omputational time and spa
e
omplexity and F-measure with regard to predi
ting slot llers. These
Present Absent
ross-validation is performed on the remaining 540 examples
Predi
ted To Be Present mp (n m) p in order to evaluate data mining. In order to
learly illus-
Predi
ted To Be Absent m (1 p) (n m) (1 p) trate the impa
t of mining automati
ally extra
ted data, the
same set of training examples was provided to both KDD
Table 1: The expe
ted out
ome for random guessing systems. The only dieren
e between them is the training
data for the rule-miner of Dis
oTEX is automati
ally ex-
tra
ted by Rapier after being trained on a disjoint set of
60 user-labeled examples. Both systems are tested on user-
metri
s are dened as follows: labeled data to identify the quality of the rules produ
ed.
#of P resentSlotV aluesCorre
tlyP redi
ted Figures 5, 6 and 7 show the learning
urves for pre
ision,
pre
ision = re
all, and F-measure, respe
tively.
#of SlotV aluesP redi
tedT oBeP resent
(1)
Even with a small amount of user-labeled data, the results
#of P resentSlotV aluesCorre
tlyP redi
ted indi
ate that Dis
oTEX a
hieves a performan
e fairly
om-
re
all = (2) parable to the rule-miner trained on a manually
onstru
ted
#of P resentSlotV alues database, while random-guessing does quite poorly. Figure
F-measure is the harmoni
mean of pre
ision and re
all and 6 indi
ates that Dis
oTEX does relatively worse with the
is
omputed as follows (when the same weight is given to rst 60 training examples, but qui
kly improves with 60 ad-
pre
ision and re
all): ditional examples. The results also show that the pre
ision
2 P re
ision Re
all of Dis
oTEX seems to start leveling o a bit sooner, this
F measure = (3) is presumably due to the fa
t that extra
tion errors put a
P re
ision + Re
all somewhat lower
eiling on the performan
e it
an eventually
a
hieve.
In order to obtain non-trivial bounds on pre
ision and re
all,
a simple random guessing method is used as a ben
hmark. Figure 8 presents F-measures for Dis
oTEX's performan
e
This approa
h guesses a slot-value based on its frequen
y on individual slots. Not surprisingly, the Programming Lan-
of o
urren
e in the training data. For instan
e, if \Java" guages slot with the least number of possible values shows
o
urs as a programming language in 29% of jobs in the the best performan
e, and the Area slot with as many as
training data, then this approa
h guesses that it o
urs 29% 210 values does poorly. More interesting is that fa
t that
of the time for the test data. Instead of simulating this dierent slots show quite dierent learning rates.
method, we analyti
ally
al
ulated its expe
ted pre
ision
and re
all for ea
h slot-value. The expe
ted out
ome for Figures 9, 10 and 11 show the learning
urves for pre
ision,
this strategy for a given slot-value is summarized in Table 1, re
all, and F-measure under the \more natural" s
enario in
where p is the per
entage of times the slot-value appears the whi
h the training set provided to Rapier,
onsisting of 60
training examples, n is the total number of the test examples user-labeled examples, is also provided to the rule-miner as
and m is the number of times the slot-value o
urs in the a part of its training set. In this
ase, both approa
hes start
test data. Using the information in the table, the pre
ision with the same 60 user-labeled examples, whi
h have already
and the re
all for random-guessing is determined as follows: been used to train Dis
oTEX's IE system. However, as
mp Dis
oTEX pro
eeds to dis
over knowledge from data it au-
pre
ision = = m=n (4) tomati
ally extra
ts from raw text, it fairly
losely tra
ks the
(m p) + ((n m ) p)
performan
e of a system trained on additional data labori-
mp
ously extra
ted by a human expert. Sin
e in this
ase Dis-
re
all = =p (5)
oTEX has the advantage of a small set of relatively noise-
(m p) + (m (1 p)) free data to start with, its performan
e is even somewhat
Therefore, the ben
hmark pre
ision for a slot-value is its
loser to that a
hieved by mining a hand-built database.
probability of o
urren
e as estimated from the test data
and the re
all is its probability of o
urren
e as estimated All of the results presented above employed 60 labeled ex-
from the training data. The only dieren
e between the two amples to train the IE system. In a followup experiment, we
is due to sampling error. examined the ee
t of in
reasing the number of IE training
examples to obtain a more a
urate extra
tor. We varied
3.2 Results the number of training examples given to Rapier (trying
Be
ause of the two dierent training phases used in Dis- 60, 120, 180, 240, and 300 examples), always using 240 ex-
oTEX , there is a question of whether or not the train- amples in the database to be mined by Ripper. The size of
ing set for IE should also be used to train the rule-miner. the test set is 60 as in the previous experiment. Figure 12
In realisti
situations, there is no reason not to use the IE shows the preforman
e results. In
reasing the number of IE
training data for mining sin
e the human eort has already training examples improves the a
ura
y of the mined rules
been expended to
orre
tly extra
t the data in this text. a bit, further approa
hing the a
ura
y of Ripper trained on
However, to
learly illustrate the dieren
e between mining user-labeled data. However, a
ura
y improves slowly with
human-labeled and IE-labeled data, we rst show a
ompar- additional IE training data. This result indi
ates that if
ison with a disjoint IE training set. In this experiment, the the training set for data mining to be automati
ally labeled
IE training data are thrown away on
e they have been used by an IE module is large enough (240 in this experiment),
Dis
otex is able to a
hieve a fairly good performan
e with
to train Rapier, sin
e the extra
tor is unlikely to make the
normal number of extra
tion errors on this data. Ten-fold only a small amount eort devoted to labeling IE training
80
RIPPER trained on User-Labeled Data
DiscoTEX
Random-Guessing 40
RIPPER trained on User-Labeled Data
70 DiscoTEX
Random-Guessing
35
60
30
Precision (%)
50
Recall (%)
40 25
30
20
20
15
10
50 100 150 200 250 300 350 400 450 500
Number of Training Examples 10
50 100 150 200 250 300 350 400 450 500
Number of Training Examples
Figure 5: Pre
ision with disjoint IE training set
4. RELATED RESEARCH
There has been relatively little resear
h exploring the in-
tegration of IE and KDD. KDT (Knowledge Dis
overy in 50
RIPPER trained on User-Labeled Data
Textual Databases) alludes to the use of IE in text mining DiscoTEX
[14℄; however, it uses texts manually tagged with a limited 45
Random-Guessing
35
50
45
30
F-Measure (%)
Recall (%)
40
35
25
30
25
20
20
15 15
50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 550
Number of Training Examples Number of Training Examples
Figure 8: F-measure for Dis oTEX by slots Figure 10: Re all with reused IE training set
70 50
RIPPER trained on User-Labeled Data RIPPER trained on User-Labeled Data
DiscoTEX DiscoTEX
Random-Guessing Random-Guessing
45
60
40
50
F-Measure (%)
Precision (%)
35
40
30
30
25
20
20
10 15
50 100 150 200 250 300 350 400 450 500 550 50 100 150 200 250 300 350 400 450 500 550
Number of Training Examples Number of Training Examples
Figure 9: Pre
ision with reused IE training set Figure 11: F-measure with reused IE training set
The pro
edure for sele
ting slots to be used in rule mining
Precision(Discotex)
Precision(Ripper without IE) also needs to be automated. In the
urrent experiments,
70 Recall(Discotex) we manually
hose ve slots from the
omputer-s
ien
e job
Recall(Ripper without IE)
F-Measure(Discotex) template. For example, title slots for job postings is not
F-Measure(Ripper without IE) used be
ause it has many possible values and is diÆ
ult
60
Precision,Recall,F-Measure(Random)
to predi
t. By identifying and quantifying the
orrelations
between slot values, this de
ision
ould be automated.
50 6. CONCLUSIONS
There is a growing interest in the general topi
of text min-
ing [18℄; however, there are few working systems or detailed
experimental evaluations. By utilizing existing IE and KDD
(%)