Professional Documents
Culture Documents
BaocaoChuyende RuttrichThongtin Version1.0
BaocaoChuyende RuttrichThongtin Version1.0
3
1.1 Mc tiu v phm vi chuyn
12
16
18
19
20
20
21
22
24
27
30
32
33
34
38
40
Theo (Jim Cowie and Yorick Wilks) [2]: IE l tn c t cho qu trnh cu trc
v kt hp mt cch c chn lc d liu c tm thy, c pht biu r rng
trong mt hay nhiu ti liu vn bn.
Cng theo Line Eikvil [1], thnh phn ct li ca cc h thng trch xut thng tin
l mt tp hp cc lut v mu dng xc nh nhng thng tin lin quan cn
trch xut.
Theo Tin s Alexander Yates trng i hc Washington [3] th trch xut thng
tin l qu trnh truy vn nhng thng tin cu trc t nhng vn bn khng cu trc.
Theo nhng chuyn gia v trch xut thng tin ca GATE1 th nhng h thng trch
xut thng tin s tin hnh phn tch vn bn nhm trch ra nhng thng tin cn
thit theo cc dng c nh ngha trc, chng hn nh nhng s kin, cc thc
th v cc mi quan h.
http://gate.ac.uk/ie/
Rt trch thng tin khng i hi h thng phi c hiu ni dung ca ti liu vn bn,
nhng h thng phi c kh nng phn tch ti liu v tm kim cc thng tin lin quan
m h thng mong mun c tm thy. Cc k thut rt trch thng tin c th p dng
cho bt k tp ti liu no m chng ta cn rt ra nhng thng tin chnh yu, cn thit
cng nh cc s kin lin quan. Cc kho d liu vn bn v mt lnh vc trn internet l
v d in hnh, thng tin trn c th tn ti nhiu ni khc nhau, di nhiu nh
dng khc nhau. S rt hu ch cho cc kho st, ng dng lin quan n mt lnh vc
nu nh nhng thng tin lnh vc lin quan c rt trch v tch hp li thnh mt hnh
thc thng nht v biu din mt cch c cu trc. Khi thng tin trn internet s c
chuyn vo mt c s d liu c cu trc phc v cho cc ng phn tch v khai thc
khc nhau.
Cc nghin cu hin nay lin quan n rt trch thng tin vn bn tp trung vo:
Rt trch cc thut ng (Terminology extraction): tm kim cc thut ng
chnh c lin quan, th hin ng ngha, ni dung, ch ti liu hay mt tp cc
ti liu.
Rt trch cc thc th c tn (named entity recognition): vic rt trch ra cc
thc th c tn tp trung vo cc phng php nhn din cc i tng, thc th
nh: tn ngi, tn cng ty, tn t chc, mt a danh, ni chn.
Rt trch quan h (Relationship Extraction): cn xc nh mi quan h gia cc
thc th nhn bit t ti liu. Chng hn xc nh ni chn cho mt t chc,
cng ty hay ni lm vic ca mt ngi no . V d t mt on vn bn:
James Gosling vo lm vic cho Sun Microsystems t nm 1984 nm ti Silicon
Valley , bng cc phng php, k thut trch xut thng tin lm th no ta c th
nhn din c cc thc th, loi thc th v quan h gia chng nh sau:
CONNGI lm vic TCHC: nhn din c hai thc th l
James Gosling v Sun Microsystems. Mi quan h gia hai thc
th ny l lm vic.
Hnh 1: Rt trch thng tin h tr tm vic (Ngun ti liu tham kho [4])
Mt ng dng khc l trch xut v lc ra nhng thng tin lin quan ti u vn
tm kim thng tin [4]. V d trong hnh v 2 bn di, khi ngi dng c nhu cu tm
kim cc cng vic lin quan n ngh lm bnh m (baker), th ngi ta nhp vo
Goolge chui baker job opening. Kt qu tr v ca Google c rt nhiu thng tin
7
khng lin quan: chng hn thng tin ng tuyn dng ca trng hc MtBaker v cng
ty Baker Hostetler, v.v. Nhng thng tin ny khng lin quan n cng vic cn tm l
ngh lm bnh m (Baker). ng ra h thng phi tr v cc lin kt n cc trang hay
cc cng ty tuyn dng ngh Baker. Nh vy trong trng hp ny IE c nhim v
trch ra cc lin kt lin quan n nhu cu tm kim ca ngi dng.
Hnh 2:Tm vic da trn search engine (Ngun ti liu tham kho [4])
IE ng dng tm kim cu tr li cho cc h thng hi p QA (Question Answering)
da vo kt qu tr v ca search engine. Gn y xut hin mt cch tip cn nghin cu
pht trin h thng QA da vo vic phn tch kt qu tm kim tr v t cc search
engine nhm tm ra cu tr li chnh xc cho cu hi a vo. V d ngi dng cn hi
Thnh ph no l th ca nc Vit Nam, th kt qu tr v t cc search engine th
rt nhiu v h thng phi tm cch trch ra cu tr li m ngi dng mong ch, l
H Ni hay Thnh ph H Ni y l mt dng ng dng k thut rt trch thng tin
IE trong QA. (hnh 3)
8
http://dblp.uni-trier.de/
10
Tip cn hc t ng
Da trn hc my thng k.
Cn mt lng ln d liu hc c
gn nhn tt.
http://en.wikipedia.org/wiki/DARPA
12
chnh xc Recall (R): l phn s th hin t l thng tin c rt trch ng. Bao
nhiu phn trm thng tin c rt l ng. T l gia s lng cu tr li ng tm thy
vi tng s cu tr li ng c th.
tin cy Precision (P): l o hay phn s th hin kh nng tin cy ca thng tin
c trch xut. T l gia tng s cu tr li ng tm thy vi tng s cu tr li tm
thy.
Vi
tp
(tp tn)
tp
(tp fp )
tp: s kt qu ng c tm thy
tn: s kt qu ng m khng tm thy
fp: s kt qu tm thy m khng ng
Vi = 1 th F-Mearsure: F ( P R )
13
16
Kho
Ti liu
T in
lnh vc
Rt trch ng vin
Cm ng
vin
Cm t kha
c gn nhn
trc
C
Xy dng m
hnh dng Nave
Bayes
Tnh c trng
Khng
Hun
luyn?
M hnh
Tnh xc sut
Cm t
kha
18
Cm ng vin
programming
demonstration
method
programming by demonstration
demonstration method
programming by demonstration
method
19
20
21
22
http://en.wikipedia.org/wiki/Named_entity_recognition
http://nlp.stanford.edu/ner/index.shtml
6
http://gate.ac.uk/ie/annie.html
23
25
http://dublincore.org/
26
nhng mi lin h gia cc ti liu thng qua metadata. Chng hn sau khi xc nh c
thng tin metadata ca mt bi bo. Chng ta c th bit c bi bo ny c nhng
ti liu no trch dn, nhiu hay t. Da vo chng ta c th gn cho mi bi bo mt
o. o ny s gip ch nhiu trong vn xp hng cc bi bo khi tm kim. Bn
cnh metadata ca cc ti liu v mt lnh vc no c th gip ch cho vic lm
giu Ontology lnh vc. Chng hn t cc thng tin metadata ca cc computer scienece
publications chng ta c th dng lm giu mt Ontology v Khoa hc My tnh
(Computer Science Ontology - CSOnt).
Trong chng ny chng ti trnh by mt cch tip cn rt trch Metadata cho cc
bi bo khoa hc da trn thng tin cu trc trnh by v vic xy dng lut da trn cc
mu (patterns). ng thi chng ti cng xy dng mt cng c rt trch metadata t
ng c th dng kt hp vi cc phm mm th vin s.
Trong mc 3.2 chng ti s trnh by v cc khi nim c bn v Metadata, mc 3.3
gii thiu v chun Dublin Core Metadata c hin ang dn c p dng trong cc
th vin s v thay th dn cho nhng chun trc y. Mc 3.4 trnh by v cc nghin
cu lin quan n rt trch metadata t ng t chc d liu s. Mc 3.5 s trnh by
v cch tip cn ca chng ti, kin trc h thng rt trch v nhng lut c nh ngha
da trn JAPE Grammar v plug in l ANNIE ca GATE. Mc 3.6 s trnh by kt qu
thc nghim ca phng php xut v cng c xy dng.
3.2 Khi nim Metadata
Metadata (siu d liu) dng m t ti nguyn thng tin. Thut ng meta xut x
l mt t Hy Lp ng ch mt ci g c bn cht c bn hn hoc cao hn. Mt
nh ngha chung nht v c dng ph bin trong cng ng nhng ngi lm Cng
ngh Thng tin: Metadata l d liu v d liu khc (Metadata is data about other data)
hay c th ni ngn gn l d liu v d liu.
Trong cc phm vi c th, nhng chuyn gia a ra cc quan im khc nhau v
metadata:
27
Theo Chris.Taylor gim c dch v truy cp thng tin th vin thuc trng
i hc Queensland8 th Metadata l d liu c cu trc c dng m t
nhng c im ca ti nguyn. Mt mu tin metadata bao gm mt s lng
nhng phn t c nh ngha trc gi l elements dng m t c tnh,
thng tin ti nguyn. Mi elements c th c 1 hay nhiu gi tr.
Theo tin s Warwick Cathro thuc th vin quc gia Australia 9 th mt phn t
metadata hay cn gi l metadata elements m t ti nguyn thng tin, hay h
tr truy cp n mt ti nguyn thng tin.
Tm li, ta c th hiu metadata l thng tin dng m t ti nguyn thng tin.
3.3 Chun Dublin Core Metadata
Dublin Core Metadata10 l mt chun metadata c nhiu ngi bit n v c
dng rng ri trong cng ng cc nh nghin cu, chuyn gia v th vin s. Dublin
Core Metadata ln u tin c xut nm 1995 bi Dublin Core Metadata Element
Initiative. Dublin l tn mt a danh Dublin, Ohio M ni t chc hi tho
OCLC/NCSA Metadata Workshop nm 1995. Core c ngha l mt danh sch cc thnh
phn ct li dng m t ti nguyn (Element metadata), nhng thnh phn ny c th m
rng thm.
Theo [35], thng 9/2001 b yu t siu d liu Dublin Core Metadata c ban hnh
thnh tiu chun M, gi l tiu chun The Dublin Core Metadata Element Set
ANSI/NISO Z39.85-2001.
Dublin Core Metadata bao gm 15 yu t c bn [35] c m t chi tit trong bng
bn di
Cc yu t c bn ca chun Dublin Core Metadata
STT
1
2
3
Yu t
Title
Creator
Subject
M t
Nhan hay tiu ca ti liu
Tc gi ca ti liu, bao gm c tc gi c nhn v tc gi tp th
Ch ti liu cp dng phn loi ti liu. C th th hin
http://www.library.uq.edu.au/iad/ctmeta4.html
http://www.nla.gov.au/nla/staffpaper/cathro3.html
http://dublincore.org/
10
28
Description
phn loi).
Tm tt, m t ni dung ti liu. C th bao gm tm tt, ch
Publisher
Contributor
Date
Type
Format
10
Identifier
mm....)
Cc thng tin v nh danh ti liu, cc ngun tham chiu n,
hoc chui k t nh v ti nguyn: URL (Uniform Resource
Locators) (bt u bng http://), URN (Uniform Resource Name),
ISBN (International Standard Book Number), ISSN (International
Standard Serial Number), SICI (Serial Item & Contribution
11
Source
Identifier), ...
Cc thng tin v xut x ca ti liu, tham chiu n ngun m ti
liu hin m t c trch ra/to ra, ngun cng c th l: ng
12
13
14
Language
Relation
Coverage
15
Rights
29
http://www.greenstone.org/
30
32
Priority: x
(
({SpaceToken.kind=="control"})+
({Token.string=="Abstract\u2014" } | {Token.string=="ABSTRACT\u2014"} |
{Token.string=="Abstract" } | {Token.string=="ABSTRACT"})
({Token.string=="."})?
):abstract_Keyword
-->
:abstract_Keyword.AbstractKeyword = {rule = "AbstractKeyword"}
(
(
({Token.string=="["})
({Token} | {SpaceToken.kind=="space"})+
({Token.string=="]"})
):referenceBreak_1
|
(
({Token.string=="("})
{Token.kind=="number", Token.length < 3}
({Token.string==")"})
):referenceBreak_2
|
(
{Token.kind=="number", Token.length < 3}
{Token.string=="."}
):referenceBreak_3
)
)
|
(
({Token.string=="References"} | {Token.string=="REFERENCES"} |
{Token.string=="."} | {Token.kind=="number"} | {Lookup.majorType=="year"})
(({SpaceToken.kind=="control"})+):referenceBreak_4
({Person} | {Lookup.majorType=="person_first"})
)
)
-->
:referenceBreak_1.ReferenceBreak_1 = {rule = "ReferencesBreak"},
:referenceBreak_2.ReferenceBreak_2 = {rule = "ReferencesBreak"},
:referenceBreak_3.ReferenceBreak_3 = {rule = "ReferencesBreak"},
:referenceBreak_4.ReferenceBreak_4 = {rule = "ReferencesBreak"}
(
{Token.string=="{"}
(
{Token}
({SpaceToken.kind=="space"})?
)+
({SpaceToken.kind=="control"})?
)?
(
{Token}
({SpaceToken.kind=="space"})?
)+
(
{Token.string=="@"} | {Address.kind=="email"} | {Token.string=="}"}
)
({SpaceToken.kind=="space"})?
(
{Token}
({SpaceToken.kind=="space"})?
)+
):lineEmailAnnotation
-->
:lineEmailAnnotation.LineEmailAnnotation = {rule = "LineEmailAnnotation"}
{Token.kind=="number", Token.length>=3}
{SpaceToken.kind=="space"}
)
|
(
{Token.kind=="number"}
({SpaceToken.kind=="space"})?
({Token.kind== "punctuation", Token.subkind =="dashpunct"})
({SpaceToken.kind=="space"})?
{Token.kind=="number"}
)
)
)
({SpaceToken.kind=="space"})?
(
{Token}
({SpaceToken.kind=="space"})?
)*
):lineAffiliationAnnotation
-->
:lineAffiliationAnnotation.LineAffiliationAnnotation = {rule = "LineAffiliationAnnotation"}
Trong
tp
tp
2 P R
P
F
( P R)
(tp tn) ;
(tp fp ) ;
tp: s kt qu ng c tm thy
tn: s kt qu ng m khng tm thy
fp: s kt qu tm thy m khng ng
Kt qu thc nghim c o trn mt s thuc tnh metadata chnh theo chun Dubline
Core Metadata, v kt qu c th hin trong bng bn di:
Metadata
Title
Authors
Affiliation
Email
Abstract
References
Precision (%)
100.00
92.72
95.83
100.00
96.55
97.44
Recall (%)
100.00
89.47
92.00
100.00
93.33
88.05
38
F-Measure (%)
100.00
91.07
93.87
100.00
94.92
92.51
39
40
41
[12] Niraj Kumar & Kannan Srinathan, Automatic Keyphrase Extraction from Scientific
Documents Using N-gram Filtration Technique, Proceeding of the eighth ACM
symposium on Document engineering. Information extraction in documents, 2008, page
199-208.
[13] Jiabing Wang et al, Ensemble Learning for Keyphrases Extraction from Scientific
Document, Book-Advances in Neural Networks - ISNN 2006, Publisher Springer
Berlin/Heidelberg 2006, page.1267-1272.
[14] Yi-fang Brook Wu, Quanzhi Li, Razvan Stefan Bot, Xin Chen, Domain-specific
Keyphrase Extraction. CIKM05, October 31-November 5, 2005, Bremen, Germany,
ACM-2005.
[15] P.D. Turney, Learning algorithms for keyphrase extraction, Information Retrieval,
vol. 2, no. 4, pp. 303- 336, 2000.
[16] P.D. Turney, Learning to Extract Keyphrases from Text. National Research Council,
Institute for Information Technology, Technical Report ERB-1057, 1999.
[17] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin and C.G. Nevill-Manning. KEA:
Practical automatic Keyphrase Extraction. The proceedings of Digital Libraries '99: The
Fourth ACM Conference on Digital Libraries, pp. 254-255, 1999.
[18] Web link for KEA5.0 source code: http://www.nzdl.org./Kea/download.html
[19] Teuvo Kohonen, et al. Self-Organizing Maps, Third edition, Springer, 2002.
[20] A. Rauber, D. Merkl, and M. Dittenbach: The Growing Hierarchical Self-Organizing
Map: Exploratory Analysis of High-Dimensional Data in: IEEE Transactions on Neural
Networks, Vol. 13, No 6, pp. 1331-1341, IEEE, November 2002.
[21] Michael Dittenbach, Andreas
Rauber, Dieter Merkl, Uncovering Hierarchical Struture in Data Using the Growing
Hierarchical Self-Organizing Map, Institute of Software Technology, Vienna University
of Technology, Vienna Austria, 24 July 2002.
[22] Hoang Kiem Huynh Ngoc Tin. Organization, management and knowledge
discovery from the English, Vietnamese text collection. Proceedings JCIS2003-USA. (7th
42
Joint Conference on Information Sciences, September 2003, North Carolina, USA), page
1613-1616.
[23] Phc, Hong Kim. Rt trch chnh t vn bn ting Vit h tr tm tt ni
dung. Tp ch cc cng trnh nghin cu trin khai vin thng v cng ngh thng tin,
s 13, 2004.
[24] ng Th Bch Thy, H Bo Quc. ng dng x l ngn ng t nhin trong h tm
kim thng tin trn vn bn ting Vit. i hc Khoa hc T nhin, 2003.
[25] Hunh Ngc Tn. Qun l ni dung v khai thc tri thc trn bn vn bn ting
Vit. Lun vn thc s ti trng i hc Khoa hc T nhin HQG TpHCM, 2003.
[26] Nguyn Tun ng. Khai thc d liu vn bn ting Vit vi SOM (SelfOrganizationg Map). Lun vn thc s Khoa CNTT - HKHTN - HQG TpHCM. 2002.
[27] Dinh Dien, Hoang Kiem, Nguyen Van Toan. Vietnamese Word Segmentation.
Proceedings of the NLPRS2001, Tokyo (Japan, 27-30 November 2001, p.749-756.
[28] Scott Miller, Heidi Fox, et al. A Novel use of statistical parsing to extract
information from Text, In 6th Applied Natural Language Processing Conference, 2000.
[29] Zhou GuoDong, Su Jian, et al. Exploring Various Knowledge in Relation Extraction.
Proceedings of the 43rd Annual Meeting of ACL, pages 427 434, Association for
computational linguitics, 2005.
[30] Dmitry Zelenko, Chinatsu Aone, Anthony Richardella. Kernel Methods for Relation
Extraction. Journal of Machine Learning Research 3, pages 1083-1106, 2003.
[31] Razvan C. Bunescu, Raymond J. Mooney. Subsequence Kernels for Relation
Extraction. In Advances in Neural Information Processing Systems, 2006.
[32] Brill, E. Transformation-based error-driven learning and natural
language
processing:
case
study
in
part-of-speech
tagging.
http://www.nlv.gov.vn/nlv/index.php/en/2008060697/DUBLIN-CORE/XML-
Metadata-va-Dublin-Core-Metadata.html
[36] H. Han, C.L. Giles, E. Manavoglu, H. Zha, Z. Zhang, E.A. Fox, Automatic
document metadata extraction using support vector machines, In: Proceedings of the 3rd
ACM/IEEECS Joint Conference on Digital Libraries, International Conference on Digital
Libraries, pages 3748. IEEE Computer Society Press, Washington, DC, 2003.
[37] K. Nakagawa, A. Nomura, and M. Suzuki, Extraction of Logical Structure from
Articles in Mathematics, MKM, LNCS 3119, pages 276-289, Springer Berlin Heidelberg
from Articles in Mathematics, 2004.
[38] F. Peng, A. McCallum, Accurate Information Extraction from Research Papers using
Conditional Random Fields, Information Processing and Management: an International
Journal, Pages: 963 979, 2006.
[39] H. Alani, S. Kim, D. E. Millard, M. J. Weal, P. H. Lewis, W. Hall and N. R Shadbolt,
Automatic Extraction of Knowledge from Web Documents, In: 2nd International
Semantic Web Conference - Workshop on Human Language Technology for the Semantic
Web abd Web Services, October 20-23, Sanibel Island, Florida, USA, 2003.
[40] J. Greenburg, K. Spurgin, A. Crystal, Final Report for the Automatic Metadata
Generation Applications (AMeGA) Project, UNC School of Information and Library
Science. http://ils.unc.edu/mrc/amega/, 2005. Last visited date 30/04/2010.
[41] P. Flynn, L. Zhou, K. Maly, S. Zeil, and M. Zubair, Automated Template-Based
Metadata Extraction Architecture, ICADL 2007, LNCS 4822, pages 327336, 2007.
Springer-Verlag Berlin Heidelberg, 2007.
[42] S. Marinai, Metadata Extraction from PDF Papers for Digital Library Ingest, 10th
International Conference on Document Analysis and Recognition. ICDAR-IEEE, pages
251-255, 2009.
44
45