Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

32

CHNG 4: BI TON PHN LOI TI LIU


Tin trnh phn loi vn bn t ng l mt tp cc bc x l ni tip nhau.
T vn bn mi u vo, h thng s xc nh xem ti liu mi thuc v lp no
trong s cc lp c sn. Do vy ti liu mi cn c trch chn nhng c trng
cn thit v ph hp vi h thng phn loi. Trong chng ny lun vn trnh by
cc khi nim c bn, cc tin trnh x l ca mt h thng phn loi.
4.1 Khi nim phn loi
4.1.1 Khi nim
Khi nim phn loi ti liu n gin nh sau:
Phn loi vn ti liu l vic gn cc nhn phn loi ln mt ti liu mi
da trn mc tng t ca ti liu so vi cc ti liu c gn nhn trong
tp hun luyn[24]
Cho:
D: khng gian cc ti liu mu D(d
1
, d
2
, ...,d
s
).
C: tp nh ngha cc loi ti liu C(c
1
, c
2
, ... c
|C|
)
Vi cp (d
i
, c
j
) DxC nhn gi tr Boolean (T,F), vi gi tr T ng vi
trng hp ti liu d
i
thuc v nhm ti liu c
j
v vi gi tr F ng vi trng hp
ti liu d
i
khng thuc v nhm ti liu c
j
.
Khi nim phn loi ti liu trn tr thnh xy dng hm :
: DxC (T,F) (4.1)
Gi tr true hoc false da trn vic la chn ngng o v hm chnh l
hm

b-cuI
(J). La chn ngng o v hm c trnh by di y vi 3
trng hp c th.
4.1.2 Cc trng hp phn loi
Khi xem xt s ph thuc ca ti liu d
j
D vo lp ti liu c
i
C th ngi
ta li chia ra lm 3 loi nh sau:
Phn loi nh phn - Binary case TC: l trng hp phn loi m ti liu phn
loi ch thuc v mt trong 2 lp c sn.

b
:X {truc, olsc] (4.2)
33
Phn loi a lp - Multi class case TC: l trng hp phn loi m ti liu d
j

ch thuc v duy nht mt lp c
i
no .

mc
:X C (4.3)
Phn loi thuc v nhiu nhn - Multi label case TC: l trng hp phn loi
m ti liu d
j
c th thuc v nhiu lp ti liu c
i
.

b
:X 2
C
(4.4)

Hnh 4.1: Cc loi phn loi ti liu
V d: mt ti liu ni v mn th thao i b c th l nhnh con ca lp
in kinh thuc nhnh ln oplimpic cng c th l nhnh con ca lp cc
mn phc hi chc nng thuc nhnh ln y hc.
Trong hai hng tip cn phn loi multi-class v multi label thng thng
ngi ra vn s dng hng tip cn binary case lm bc cn bn. T bc cn
bn ny h thng s thu c cc s o v s ph thuc ca ti liu vo v lp xem
xt. Cc s o ny s c xp hng theo th t t ln tr xung. Vi h thng ch
chn gi tr ln nht th ta c multi class case, vi h thng chn ngng chp nhn
th ta c mt tp cc lp chp nhn v do h thng tr thnh multi label. C th
cn xc nh hm nh sau:

b-cuI
X R (4.5)
V kt qu tr v l true nu

(x) > o, trong o R c gi l


ngng. V nh vy cc loi phn loi ti liu trn c th c m t nh sau:
Ti liu
Phn loi nh phn
Yes No
Phn loi a lp
C
i
C
1
C
j
Phn loi a nhn
C
i
C
1
C
j
34

b
= _
trucnu

b-cuI
(x) > o
olscnungcli

(4.6)

mc
(x) = orgmax
c
{

b-cuI
c
i
(x), c

C] (4.7)

mI
(x) = {c

C,

b-cuI
c
i
(x) > o, c

C] (4.8)
Mc tiu lun vn ra gii thut phn loi c m hnh theo thuyt vn vt
hp dn. Mi lin h ca ti liu mi a vo v kin trc phn loi c xc nh
da theo vic la v tr cn bng bn cho cht im (theo mc 2.9). Lun vn chn
hng tip cn bi ton phn loi a lp.
4.2 Bi ton phn loi ti liu t ng
T khi nim phn loi ti liu theo 4.1.1 ta c khi nim n gin v bi ton
phn loi ti liu t ng: Phn loi vn ti liu t ng l vic gn cc nhn
phn loi ln mt ti liu mi mt cch t ng da trn mc tng t ca ti
liu so vi cc ti liu c gn nhn trong tp hun luyn.
Nhiu gii thut, phng php phn loi ti liu t ng xut nh l cc
gii thut da trn m hnh vecto (Centroid vector, Rocchio ...), gii thut k-NN, cc
gii thut da trn xc xut Naive Bayer ... Mi gii thut hay phng php ra
u c mt hng tip cn gii quyt vn khc nhau.
4.2.1 Hng tip cn
4.2.1.1 Tip cn ton cc v cc b
Khi xem xt bi ton giai on phn loi, ta c hai cch phn loi nh sau:
Hng tip cn cc b: Chia bi ton phn loi thnh nhiu bi ton phn loi
con cc b. Hng tip cn cc b c thc hin bng cch chn cc c trng
mt cch c lp i vi mi nt ca cy phn cp (kin trc phn cp) v xy dng
b phn loi ring cho mi nt ch vi cc c trng cc b. Khi s lng b
phn loi bng vi s nt ca cy v c th ln ti hng trm. Tuy nhin cch ny
li c li giai on phn loi v xy dng vecto c trng.
Hng tip cn ton cc: Xy dng b phn loi duy nht trn phm vi ton
cc. c trng ha tng nt c thc hin trn ton khng gian c trng, khng
ph thuc vo s lng c trng nt cha. Tin trnh phn loi theo hng tip
35
cn ny c thc hin mt cch c lp tng nt. Hng tip cn ny c nhiu
u im hn so vi hng tip cn cc b cc im sau:
- B phn loi ton cc cha thng tin ca tt c cc lp nn trnh c nhng
sai st mc cao hn.
- Cc c trng c chn c trng ha nt khng ph thuc vo nt cha
v nh vy b phn loi tr nn tin cy hn.
Hng tip cn ton cc cng cho php xy dng b phn loi c xt n mi
lin quan ca nt cha v nt con cng nh cc nt anh em trong kin trc cy phn
loi.
4.2.1.2 Tip cn my hc
Bt u nghin cu cc phng php phn loi ti liu t ng t thp k 60
ca th k 20 nhng mi n thp k 90 th cc li gii mi t c nhng bc
tin quan trng. Cui thp k 80, hng tip cn ph bin cho bi ton phn loi t
ng l k thut tri thc (knowledge engineering KE) [14]. K thut ny ch
n gin l nh ngha mt cch th cng cc lut tng hp t tri thc cho php
phn loi mt ti liu thuc v lp tng ng. Trong thp k 90 hng tip cn ny
t pht trin m chuyn sang hng tip cn theo m hnh my hc l m hnh
gm cc bc lin tc xy dng b phn loi t ng theo phng php hc t tp
cc ti liu c phn lp. Tc chuyn t hng tip cn truyn thng da trn
lut (rule based xy dng th cng cc b lut vn phm, lut suy din, c s tri
thc) sang hng tip cn da trn ng liu (corpus based hc cc lut) [3], [14].
Tin trnh thc hin l lp i lp li nhiu ln xy dng b phn loi t ng cho
lp c
i
da trn cc c trng ca tp ti liu c gn nhn vo lp c
i
hoc c
i
-
cn gi l hc c thy. Thay v xy dng cc b phn loi nh hng tip cn KE
th xy dng cch to ra cc b phn loi . Vic xy dng ny hon ton t
ng da trn ng liu c sn.
Nguyn nhn ca s chuyn bin l:
- S pht trin nhanh chng v ph bin cc kho ng liu trn h tng internet.
Bn cnh cn l thi quen s dng ti liu s ha trong lu tr hoc truyn ti
36
thng tin.
- S tin b vt bc ca phn cng my tnh, phn mm x l cng nh cc
gii php x l vn bn t ng cho php ta lu tr, x l mt khi lng ln cc
tnh ton trn kho ng liu vi tc nhanh v chnh xc.
- Nhng nghin cu mi nht, nhng kt qu th nghim thnh cng trong
lnh vc my hc.
Vi hng tip cn my hc, phn loi ti liu t ng c chnh xc c th
so snh c vi cc chuyn gia phn loi, tit kim c nhiu cng sc, khng
cn n nhiu tri thc cng nh cc chuyn gia lnh vc trong vic xy dng cc b
phn loi
Cc dng hc
C nhiu hng tip cn my hc, nhn chung ta c th phn loi thnh 3 loi
nh sau [3],[14]:
1. Hc theo k hiu: dng hc ph hp nht vi cc bi ton x l ngn ng
mc k hiu bao gm cy quyt nh (decision tree), danh sch quyt nh (decision
lists), hc da theo lut ci bin (transformation based learning), b lut phn tch
tuyn tnh (linear separator), hc da theo trng hp (instance based learning), suy
din logic (inductive logic programming)
2. Hc theo xc sut (stochastic, statistical hoc probabilistic): M hnh ny
c m t nh l mt mng xc sut (probabilistic network) trong m t cc
ph thuc xc sut gia cc bin c vi nhau. Mi nt trong th l mt phn
phi, v t cc phn phi c lp ta tnh c phn phi kt hp ca d liu
quan tm. C nhiu hng tip cn t c mng xc sut v d nh Nave
Bayes, Maximum Entropy, M hnh Markov n (Hidden Markov Model).
Expectation Maximum (EM), log-linear
3. Hc theo tiu k t (subsymbolic): nh mng neural, thut gii di truyn
(genetic algorithm). Dng hc ny ph hp vi vic hc trn ng liu ngn ng cp
thp nh nhn dng ting ni, .
37

Hnh 4.2 : M hnh phn loi ti liu t ng
4.2.2 Tin x l
4.2.2.1 Khi nim
Ti liu mu hun luyn hay ti liu cn phn loi thng l ti liu th. D
liu vn bn thng c biu din cc nh dng khc nhau, ty theo cu trc
format ca file lu tr nh pdf, doc, xml, txt Vi nhng d liu c cch biu
din phc tp th cu trc file s c nhiu k t cu trc m t vn bn, k t iu
khin, ... i vi h thng phn loi ti liu, ci chng ta cn khng phi l cc k
t m t cu trc trong trng hp ny nhng k t gi l nhiu vn bn hay
nhng k t khng c ngha cho bi ton phn loi. gim thiu s chiu khng
gian c trng, tit kim ti nguyn ca h thng chng ta cn phi loi b nhiu
vn bn ny. Ty vo loi vn bn u vo khc nhau m ta c gii php x l
nhiu khc nhau. Trong khun kh lun vn ny, ti khng tham vng x l nhiu
cc nh dng khc nhau, m ch tp trung x l nhiu i vi nh dng trang web
- file html [3],[14].
4.2.2.2 Tin x d liu html
Trang web (web page hoc webpage) c t chc di dng text file (nh
dng HTML hoc HTM). Cc k t iu khin h tr vic hin th ni dung trang
web cng c nh dng lun trong ni dung trang file html. Do vy ta cn c mt
bc loi ba nhiu dng ny - Nhiu iu khin. Gii thut loi b nhiu



Tin x l
Ti liu
hun luyn
Ti liu th
nghim



Trch
chn c
trng

Hun luyn
Phn loi
Kt qu
38
iu khin kh n gin do nhng k t iu khin u c quy nh chun nh
dng html. Trong m s trng hp ta c th tn dng cc k t iu khin gn
trng s cho cc c trng chch chn c sau ny. V d t kha trong th <B>,
<I>, <Strong> th s c ngha hn so vi cc t kha khng thuc th c bit
no. Ngoi ra vic nhn dng trang web thuc loi no di y (theo cch phn
loi ca Dublin Core [6],[12]) s rt l hu ch cho modul tin x l la chn gii
thut ph hp.
Trang ch - Topic page:l webpage cha ni dung c th v ch no
Trang trung tm - Hub page: l webpage khng trnh by mt ni dung c th
no m l mt tp cc ng link ti cc webpage lin quan n ch ca
trang web ny. Trang web http://www.yahoo.com l mt v d c th ca trng
hp ny.
Trang a phng tin - Multimedia page: ni dung chnh ca trang web c
biu din bng hnh nh, m thanh, thay cho vn bn. Thng thng vi loi
webpage ny th mi ni dung (hnh nh, m thanh) thng c nhng thng tin i
km m t s b v ni dung ny.
Nhiu ni dung
Thng tin trong ni dung ca webpage thng km theo mt s lng ln
nhiu ni dung iu ny khc bit rt nhiu so vi d liu vn bn truyn thng.
Nhiu ny c th chia lm 2 loi:
Nhiu ton cc (Global noise): Nhiu ny chim mt s lng ln, thng
thng l mt bn sao mt trang web. Nhiu ny khng nhng lm xo trn vic
thu thp v xp hng ca cc my tm kim m cn tiu tn nhiu ti nguyn h
thng trong vic lu tr mt bn sao ca trang web.
Nhiu cc b (Local noise): y l cc nhiu ni dung trong trang web. Loi
nhiu ny thng l cc ni dung qung co, thng tin v bn quyn trang web, bn
quyn bi vit, Loi nhiu ny gy nhiu kh khn cho cc chng trnh thu thp
ni dung chnh ca trang web.
Ta gi tin trnh loi b local noise l lm sch trang web (webpage cleaning),
39
v tin trnh loi b global noise l loi b bn sao (Replica Removal). Trn thc t
s dng ngi ta dng kt qu ca tin trnh webpage cleaning l u vo ca loi
tin trnh replica removal. Mt nhim v quan trng na trong giai on tin x l
ny l trch xut c siu d liu ca webpage th u vo.

Hnh 4.3: Tin x l trang web
T hnh 4.3 cho thy sau bc lm sch trang web, tt c cc ni dung
webpage c lm sch nhiu cc b. V sau bc loi b bn sao th tt c cc
bn sao webpage no c loi b. D liu lc ny c th p ng lm d
liu u vo ca cc modul x l gii thut khc.
4.2.2.3 Lm sch trang web
L bc loi b nhiu cc b ca trang web, y cng l tin trnh trch xut
ni dung ch ca trang web. Da trn cy nh du theo chun quy nh cho
htm, trch xut ni dung ch v cc link c lin quan..
Nhn dng loi trang web: Da vo cc th iu khin ta c th xc nh
c nhng im khc nhau chnh ca ca thut ng, siu lin kt (hyperlink) v
cc ni dung a phng tin. Trn thc t ta thy ni dung trang web c quan
trng khc nhau th s c biu din hoc b tr nhng v tr c tm quan trng
khc nhau trn trang web. Do vy xc nh loi trang web thay v h thng quan
tm ti ton b trang th h thng ch quan tm ti nhng v tr quan trng trn trang
web.
Lm sch trang web: vi mt webpage cho trc, u tin xc nh cy nh
du. Da trn cy nh du, tin trnh trch xut xut ni dung ch im qua mt s
bc tng ng vi c trng ca loi trang web.
Vi loi trang ch : trong trang ch im, ni dung c biu din bng cc
Trang
web
Ni
dung
thun
u vo
chn lc

Lm sch trang web
Nhiu cc b


Rt gn

Loi b bn sao
Nhiu ton cc





Rt gn


D



Siu d liu

Trch chn

40
on text v t hnh nh cng nh link lin kt ti trang web khc. V vy da trn
phng on ta c th ct ta nhng ni dung cha a ch lin kt n website hay
cc ni dung a phng tin khc. Vi cch lm nh vy ta c cy con ch gm
nhng ni dung ch im ca trang web.
Vi loi trang trung tm: trong loi trang web ny th ni dung ch im
thng khng r rng. Mt s trang web cn khng c ni dung ch im hoc c
nhiu ni dung ch im. Trong trng hp ny ch c cch trch xut nhng ni
dung quan trng t cc link lin kt. Chng ta ch tp trung nhng link thuc vo
nhng nt ti v tr trung tm ca trang web.
Vi loi trang a phng tin: trong trang a phng tin th hnh nh, m
thanh thng c nhng ni dung i km l ch im ni dung.
4.2.2.4 Loi b bn sao
Pht hin bn sao da trn thut ton tnh ton s tng t ca cc cp trang
web tng i mt. Tuy nhin vi mt tp hp cc trang web, phc tp tnh ton
s l O(n
2
) s tn nhiu ti nguyn my tnh. V vy u tin ta to ra du im ch
(fingerprint) cho tt c cc trang web tng ng vi ni dung. Kch thc ca du
im ch s phi lun nh hn so vi nguyn bn trang web. Sau thut ton tnh
ton s tng t s c p dng cho cc cp trang web tng i mt bng vic so
snh cc gi tr du im ch.
Thut ton di y cho php to ra du im ch di y da trn t kha
thay v cu v on. Trong thut ton chng ta quan tm ti 2 tham s: percentage
l phn trm s t c chn to fingerprint; interval l mt n v iu chnh s
lng t cn chn.
Bc 1: Biu din trang web bng mt vecto c trng (s t trong vecto ny
c gi l Sum1)
Bc 2: Sp xp o cc t ( o tf) theo th t gim dn, cc t c cng
o tf th sp xp theo th t A,B,C
Bc 3: La chn Sum2 (Sum2=Sum1*percentage) t danh sch c c t
bc 2
41
Bc 4: Sp xp cc t kha theo th t A,B,C
Bc 5: La chn Sum3 (Sum2 DIV Interval * Interval) t danh sch c
c bc 4
Bc 6: p dng hm MD5 i vi cc t c c bc 5 - fingerprint.
Sau bc tin x l, ni dung trang web c trch xut m khng c cha cc
k t iu khin, nhng ni dung khng lin quan nh cc qung co, cc thng tin
a phng tin Ni dung thun vn bn. Phn d liu ny s l u vo ca cc
modul x l tip theo nh tch token, loi b stopword, lc b hu t .
4.2.3 Biu din ti liu
4.2.3.1 Khng gian vecto
Hng tip cn vecto c s dng kh nhiu trong cc bi ton x l ngn
ng t nhin, phn loi vn bn. Trong mt s gii thut khc, mt phn l thuyt
ca m hnh vect vn c s dng nh trng s ca c trng vn bn, biu din
vn bn, Vi m hnh khng gian xy dng c lin quan n trng s ca t
kha hay khi lng ca cht im (theo 5.2.2). Do vy cc bc trnh by ca lun
vn v sau tp trung vo mt s l thuyt ca m hnh vecto.
4.2.3.2 To vct vn bn
T cc bc trn ta c tp cc t kha (keyword). Chng ta thc hin mt
s bc loi b nhng t kha khng c tc dng biu din vn bn nhiu nh
stopword, lc b hu t gim nhiu v s chiu khng gian. Sau cc bc ny
ta c mt tp cc thut ng (term) hay cn gi l c trng c dng lm cc
c trng cho khng gian vecto.
Vi mt ti liu ta c th biu din nh sau:

Hnh 4.4: Mt khng gian vecto 2 chiu. mi chiu tng ng vi 1 t.
42
V d: Hnh 4.4 biu din 3 ti liu v mt cu truy vn trn khng gian vecto
2 chiu (car, insurance). Cc ti liu d
1
, d
2
, d
3
biu din bng cc gi tr nh sau:
d
1
(0.13, 0.99), d
2
(0.8, 0.6), d
3
(0.99, 0.13) v cu truy vn q(0.71, 0.71). Vi cch
biu din ny th chng ta khng cn quan tm ti thng tin thu c t d liu th
m ch quan tm ti gi tr ca vecto.
V mt ton hc ta c th pht biu m hnh ny nh sau: Trong khng gian
vecto n chiu W
n
(w
1
, w
2
, ... w
n
) vi w
i
tng ng vi mt c trng, tng ng vi
mt chiu trong khng gian. Khi ti liu d
i
c nh ngha l d
i
(TF(w
1
), TF(w
2
),
... TF(w
n
)). Gi tr TF(w
j
) l tn xut xut hin ca t w
j
trong vn bn d
i
. Thng
thng chiu di vn bn khng nh hng nhiu n mc ch s dng, ngi
ta biu din ti liu nh sau [15],[20]:
J
i

= (
1P(w
1
)
1P
2
(w
i
)
,
1P(w
2
)
1P
2
(w
i
)

1P(w
n
)
1P
2
(w
i
)
) (4.9)

Hnh 4.5: Ti liu c biu din trn khng gian
Chng ta coi mt chiu trong khng gian vecto nh l mt c trng. Vi mt
vecto trong khng gian, chng ta cn phi xc nh cc trng s tng ng vi tng
c trng. V d cc gi tr 0.13, 0.99 l cc trng s tng ng vi cc c trng
ca vecto d
1
trong hnh 4.4. Mt cch n gin cng hay c s dng l m s
c trng thuc ti liu v coi l trng s.
Khi vn bn c chiu di qu ln, cch m s t li khng hiu qu, mt s
43
hng tip cn s dng cng thc (4.9). Hng tip cn khc cng c dng nhiu
cho bi cc khi nim sau:
Term frequency (tf
ij
): s ln xut hin ca t w
i
trong ti liu d
j

Document frequency (df
i
): s lng ti liu c trong su tp m c cha t w
i

Collection frequency (cf
i
): s ln xut hin ca t w
i
trong su tp
Vi cc khi nim ny th thng tin thu c l s ln xut hin ca c trng
trong ti liu. S ln xut hin c trng cng nhiu (s lng t xut hin trong
vn bn) th c trng ng vai tr cng quan trng ca vn bn.
tnh trng s ta c cng thc nh sau [8]:
Weight(i,j) = _
(1 +log(t
]
))log
N
d]
i
nut
]
> 1
unut
]
= u

(4.10)
Trong N l s lng cc ti liu. T cng thc (4.10) cho ta thy cc trng
hp c bit nh sau: vi c trng s khng c trong su tp th tf
ij
= 0 tng ng
weight(i,j) = 0; vi trng hp c trng s c trong tt c document th df
i
= N v
do log(N/df
i
) = 0 hay weight(i,j) = 0.
Vi cch xy dng m hnh biu din ti liu theo khng gian vecto v cch
tnh trng s cho cc c trng chng ta c th xc nh c nhng t no khng
c ngha phn loi, t no c ngha. Trn c s ny ta rt gn c s chiu
khng gian vecto.
4.2.3.3 Vecto trng tm
Vn xut pht t tng xy dng vecto c trng cho mt ti liu. Trong
trng hp ny ta xy dng vecto cho lp ti liu cn gi l vecto c trng ca
lp hoc vecto trng tm [17],[19].
Cng ging nh khi to vecto c trng cho mt ti liu, chng ta u phi
trch xut c cc c trng tng ng l cc thut ng. Trong trng hp trch
xut c trng cho mt lp thng cha mt tp cc ti liu khc nhau. Vic trch
xut ht tt c cc c trng cc ti liu ny s tiu tn nhiu ti nguyn v mt
nhiu thi gian x l. Thng thng ngi ta trch chn cc c trng t ni dung
tm tt, trch dn, m t thm hoc nhng thng tin t th META, ngoi ra cn da
44
chnh vo cu trc ca cy nh du nh t nhng th HEAD, TITLE, I, B, H1, H2,
H3. Ta cng c th da trn cu trc cy nh du la chn nhng on c th l
ch im ca ti liu, mt khc nhng ni dung tm tt nh abtract l mt trong
nhng ni dung trch xut c nhiu c trng c gi tr nht.
Vi cch la chn c ch ch nhng ni dung cn trch chn c trng nh
trn, ta gim c kh nhiu ti nguyn cng nh thi gian x l ca h thng.
Sau khi tch c cc c trng tng ng l cc thut ng theo cch trn, vic
cn li l p dng nguyn mu cch xy dng vecto c trng ca mt ti liu
xy dng vecto c trng.
J
]

= (w
1j
, w
2j
,...,w
nj
) D (4.11)
Vi n = ||, J
]

l vecto c trng ca ti liu th j trong su tp hun luyn D,


v w
ij
l trng s
Theo nh ngha vecto trng tm ta c cng thc tnh vecto trng tm ca lp
c
i
nh sau:
c
iv
= J
]

= J
]

|c
i
|
1
(4.12)
vi |c

| l s lng ti liu thuc lp c


i

n gin trnh trng hp lp c
i
c nhiu ti liu gn nhn ngi ta thng
dng cng thc sau:
c
i
=
1
|c
i
|
J
]

|c
i
|
1
(4.13)
Trng s w
cil
ca term th l trong vecto trng tm ca lp c
i
c tnh thng
qua cc trng s w
I]

ca ti liu th j thuc v lp ny.


w
cil
=
1
|c
i
|
w
I]

|c
i
|
1
(4.14)
4.2.4 Chn c trng
C nhiu gii thut la chn c trng cho bi ton phn loi. Mc tiu chung
ca cc hng tip cn u l xy dng m hnh la chn c nhng c trng
tt nht c gi tr biu din ti liu, s lng c trng khng gian c trng
khng qu ln nh hng n tc x l, nhng cng khng qu nh nh hng
n chnh xc ca gii thut phn loi.
45
Theo hng tip cn vect xy dng m hnh phn loi, th mi ti liu c
biu din bng mt vecto trong khng gian c trng. Vi cch biu din theo m
hnh vecto ny kha cnh no ngi ta cn gi l ti cc t - bag of word
(BOW), tc l khng c th t hay cu trc gia cc c trng. V nh vy theo
hnh thc biu din ny th mi nhm phn loi s c c trng bi mt vecto.
trch chn c c trng t cc ti liu l c s xy dng vecto c trng,
mi ti liu hun luyn u phi qua mt s bc gm c: tin x l, token, loi b
stopword, lc b hu t (stemming) trc khi p dng gii thut chn c trng.
4.2.4.1 Tch token
Token l mt dy tun t cc k t trong bng ch ci, hoc mt dy tun t
cc con s (mt ch s c cha du chm l du chm thp phn c xem nh l
mt token), hoc mt k t khng nm trong bng ch ci (nh du chm cu, du
ngoc kp, hoc cc k t m rng,...). Nhim v ca b phn tch token l chia
vn bn u vo (dy cc k t) ra thnh cc token ri rc, cc token ny c dng
lm u vo cho b phn tm kim t loi. [3]
Nh vy, t mt vn bn a vo, u tin b phn tch token s da vo cc
khong trng c trong vn bn tch ra thnh cc chui con, nhng chui con ny
c th l mt token theo nh ngha trn, cng c th l hp ca nhiu token.
Chnh v th cho nn chng ta khng th s dng ngay c m mi chui con nh
vy, chng ta phi xc nh xem n c ng l token hay khng, nu ng th s
dng n a vo cho bc tip theo, nu khng th chng ta phi chia nh n ra
thnh cc token trc khi cho vo cc tng sau. V d thats khng phi l mt
token m gm c hai token l that v is. l do ngi ta s dng tnh lc. Trong
mt s trng hp khc do d liu vo sai quy cch nh drive.Specify th chng
ta cng phi tch thnh ba token l drive, . v specify. Trng hp e.g th
chng ta khng th tch ra v li l mt token, trong trng hp ny ta phi c
mt t in lu tr cc trng hp c bit.
Thut ton m t tch vn bn thnh token nh sau:[3]

Bc 1: Tch mt chui con t vn bn u vo nh vo khong trng
46
Bc 2: Nu chui con khng tn ti th kt thc
Bc 3: Kim tra xem chui con tn ti trong t in hay khng. Nu c trong
t in th ta c mt token, nu khng th qua bc 5
Bc 4: Tch chui con thnh n chui con (n>=1) nh vo cc du cu trong
chui con, Nu tn ti mt chui con c trong t in th ta c n token nu khng
th ta c 1 token.
Bc 5: quay li bc 1
4.2.4.2 Loi b stop word
Tin trnh ny gip tit kim ti nguyn h thng, gim bt s c trng trong
khng gian vecto. Nhng c trng ny t c ngha trong biu din, tm kim hay
cc k thut lin quan n x l ngn ng t nhin. Bc loi b stopword hin
khng cn l vn ln do kh nng tnh ton, b nh my tnh ngy cng c ci
thin. Danh sch cc stopword bao gm lp cc t t ngha nh cc nh t
(articles: a, the ...), lin th (conjunction: and, but...), thn t (interjection: oh, but,
..), gii t (preposition: in, over), .... Bc loi b stopword ch n gin l thut
ton so snh t cn xem xt vi danh sch c sn stopword, nu c trong danh sch
th loi bo v ngc li. v m t n gin nh sau:
Thut ton loi b stopword
Input: {T
i
} l tp cc term trong ti liu d
i
eD, S: Danh sch cc stopword
Output: {T
i
} sao cho t
j
e

S
Gii thut:
Foreach t
j

e
T
i

If t
j
eS then T
i
/t
j
Return T
i
(Danh sch stopword S c trnh by ti ph lc I ca lun vn ny)
4.2.4.3 Lc b hu t (stemming)
Lc b hu t l tin trnh c bit hu ch trong cc bi ton x l ngn ng
t nhin ca my tnh. Vi mt tp hp cc vn bn, mi vn bn c m t bi
cc t (word). Theo cc cch biu din vn bn trn y (theo mc 4.2.3) th mi
47
vn bn c c trng bng mt vecto trong khng gian c trng. Tuy nhin
trong ngn ng hc, nhiu t c cng gc t th thng c ngha tng t nhau v
d: computer, computerize, computerization, computerise ... ci thin tc x
l h thng th cn thit phi gim cc nhm t tng t nhau v thnh mt t n.
Trong v d trn ch cn t compute i din cho nhm t trn l . Vic ny thc
hin n gin bng cch loi b cc hu t -er, - ise, - ion, -r .... Hn na tin trnh
lc b hu t s gim s lng c trng trong khng gian vecto v nh vy gim
kch thc v phc tp tnh ton ca h thng.[16]
Thut ton lc b hu t (stemming suffix) c M.F.Porter gii thiu nm
1980[16].
4.2.4.4 Chn c trng
Trong qu trnh phn loi ti liu, s lng cc thut ng trong tp mu
thng rt ln, hn na chng ta khng th s dng tt c cc thut ng ny biu
din cho mt ti liu v thut ton hc phn loi. Mt kha cnh khc cc thut ng
khng lin quan hoc d tha nh hng n chnh xc v tc x l. Sau
bc lc b hu t ta c mt tp cc thut ng. Ta c th dng tp thut ng ny
lm tp cc c trng cho khng gian tuy nhin trong rt nhiu trng hp ngi ta
xy dng mt tp con ti u cc c trng dng biu din ti liu cng nh cc
nhm ti liu. Vic la chn c trng khng phi l qu trnh n gin. Vi qu t
c trng th vic c trng ha ti liu, nhm ti liu khng m bo chnh xc,
mt khc vic la chn qu nhiu c trng th d gy ra nhiu v tng ti nguyn
x l ca my tnh [17],[19].
Thut ton la chn c trng c chia lm hai loi: bao gi v lc [13].
Thut ton bao gi tnh ton n s thc thi ca gii thut hc trong qu trnh tp
con ti u c chn. N o tnh cht ca tp cc thuc tnh bng s thc thi ca
thut ton hay kt qu ph thuc vo thut ton. Vi hng tip cn ny, vn
chnh ca cc h thng l ti nguyn tnh ton. Ngc li vi thut ton bao gi th
thut ton lc quan tm ti trng lng ca tng thut ng v s dng cng thc
xp hng la chn thut ng m khng quan tm ti thut ton hc. N ph
48
thuc vo tp cc ti liu ch khng ph thuc vo bt c thut ton hc no. Trn
thc t thut ton lc c s dng nhiu hn cho bi ton phn loi ti liu.
u vo ca gii thut lc ny l tp khng gian cc thut ng c c t tp
cc ti liu sau khi loi b stopword, lc b hu t. Thut ton sau da trn m
hnh vecto v l thuyt tf-idf. T cng thc (4.14) th ta tnh c trng s w
cil
ca
thut ng th l trong lp c
i
C. Trn ton tp hun luyn th trng s ny c tnh
bi hm CW(l):
CW(l) = P(c

)w
cI
|C|
1
(4.15)
Trong |C| l s lng lp ca C, P(c
i
) l phn b xc xut ca lp c
i
trong
su tp hun luyn.
Da vo cng thc (4.15) ta c c mt bng thng k cc thut ng vi
trng s tng ng ca n trong su tp. Sp xp cc trng s ny t trn xung
di v a ra mt ngng . Thut ng no c trng s ln hn ngng s c
chn. Cn trng hp khc s b loi b. Theo phng php ny ta a ra c
gii php chn c tp con thut ng t tp hp tt c cc thut ng ca su tp ti
liu.
Khng gian c trng ca bi ton phn loi ti liu thng rt ln, nht l i
vi bi ton phn loi tng qut. Vi cc bi ton phn loi theo chuyn ngnh hay
ch th khng gian c trng gim i. Do vy ty vo tng bi ton c th m
mc khng gian c trng ln hay nh hay ngng chn c trng s khc nhau.
4.2.5 Hun luyn v phn loi
c th phn loi c vn bn mt cch t ng, ta cn c b d liu mu
c phn loi. Da trn b d liu mu ny ta rt trch nhng thng tin c
trng. T ty vo tng gii thut m ta xy dng c cc b phn loi (hnh
4.8). Da trn b phn loi h thng phn loi t ng s quyt nh mt ti liu
mi a vo s thuc v nhm no. Nhn chung h thng phn loi ti liu l s kt
hp gia 2 giai on: giai on hun luyn v giai on phn loi [8],[14]
Giai on hun luyn:
Tng ng vi mi nhm loi ti liu c
i
(kinh t, tin hc, ...), ngi dng thu
49
thp nhng ti liu lin quan n nhm ny d
i
(d
i
1
, d
i
2
, ... d
i
k
). Ton b ti liu mu d
i

ca nhm c
i
s c qua cc bc tin x l, trch chn cc c trng cho nhm ti
liu ny f
i
(f
i
1
, f
i
2
, ... f
i
s
). Da trn c trng f
i
m ngi ta xy dng cc b phn loi.
Kt qu ca giai on ny l ta c c mt b phn loi (classifer) [8]. Ty vo
thut ton v hng nghin cu, cch xy dng m hnh m ta c cc b phn loi
khc nhau.
Giai on phn loi: Ti liu cn phn loi s c x l qua mt s bc
(tin x l, trch chn c trng, c trng ha). Ti liu c phn vo nhm
tng ng thng qua b phn loi c xy dng t giai on hun luyn.

Hnh 4.6: M hnh xy dng b phn loi ti liu
4.2.5.1 Hun luyn
y l giai on quan trng v phc tp nht ca h thng phn loi. giai
on ny chng ta xy dng m hnh phn loi cho tng lp. M hnh c to ra
cho php xc nh mt ti liu mi c thuc v lp hay khng. Do vy cn phi
xy dng b phn loi (classifer) cho tng nt (lp) c
i
. B phn loi cho lp c
i
c
xy dng t ng bng mt tin trnh lp i lp li. Da trn s nhng c trng c
c t tp ti liu gn cho lp c
i
v nhng ti liu khng c gn vo lp c
i
.
Da trn nhng c trng ny c th cho php xc nh mt ti liu mi c thuc v
lp c
i
hay khng. xy dng c cc b phn loi ca C th cn c mt su

Tp hun luyn
Ti liu
Hc

B phn loi
50
tp ph qut D ti liu c gn nhn hay gi tr ca hm s (d
j
,c
i
) c xc
nh: (d
j
,c
i
) D x C.
Ti giai on hc ny, ta khng s dng ht su tp D m ch ly mt phn
su tp. Thc hin nh sau: B su tp chia thnh 3 su tp c lp nh hn l tp
hun luyn Tr (Training set), tp nh gi Va (Validation set), v tp th nghim Te
(test set). Tp hun luyn Tr l su tp ti liu c s dng trong qu trnh hc xy
dng b phn loi. Tp nh gi Va l su tp ti liu c ng cho vic tinh
chnh cc b phn loi nh l la chn tham s p no m b phn loi ph thuc
nhm em li kt qu tt nht khi c lng trn b d liu nh gi. Tp th Te
nghim l tp d liu c s dng nh gi kt qu cui cng ca b phn loi.
C nhiu hng hc phn loi nh l b phn loi nh phn theo cng thc
(4.2) : DxC (T,F). Mt s khc th hm phn loi nhn gi tr thc thuc [0,1]
(hm CSV categorization status value) CSV: DxC [0, 1]. Khi s dng hng
tip cn ny th h thng cn xc nh ngng , khi hm CSV tr thnh .
4.2.5.2 Mt s phng php phn loi
Nhiu b phn loi c xy dng v ng dng trong thc t. Di y ti
trnh by mt s phng php phn loi:
M hnh phn loi Rocchio
Vi mi lp phn loi c
i
ta xy dng vecto c trng ca lp. Vecto c trng
(vecto trng tm) ny c s chiu bng vi s chiu biu din ti liu. Cc chiu ca
vecto trng tm c tnh trng s theo l thuyt c trnh by ti mc (4.2.3.2).
Trong trng hp ny Rocchio nh ngha trng lng nh sau

e e
=
} { } {
| | | |
i j i j NEG d
i
kj
POS d
i
kj
ki
NEG
w
POS
w
w
(4.16)
Trong
k: th t term
w
ki
: trng s ca term k ca lp th i
d
j
: ti liu th j thc lp i
POS
i
: tp cc ti liu thuc lp i
51
NEG
i
: tp cc ti liu khng thuc lp i
, : l cc tham s iu khin nh ngha mc quan
trng ca POS v NEG.
V d: Cho tp hun luyn c 10 ti liu, trong ti liu 1 n 4 c gn
nhn vo lp Medicine, v 5 n 10 khng thuc lp Medicine. Khi POS
med
= 4
v NEG
med
= 6.
V s phn b term nuclear trong tp hun luyn nh sau
w_nuclear_doc1 = 0.5
w_nuclear_doc2 = w_nuclear_doc3 = w_nuclear_doc5 =...
w_nuclear_doc9
w_nuclear_doc4 = 0.5
w_nuclear_doc10 = 0.5
V cc tham s iu khin , ln lt l 2 v 1. Theo cng thc tnh trng
s ca term ca lp medicine s l:
w_nuclear_medicine = 2* (0.5 + 0.5)/4 1 * 0.5/6 = 0.5 - 0.08 = 0.42
Tng t cho cc term khc trong khng gian vecto. Nh vy ta to c
vecto trng tm ca lp Medicine. Tng t vi cc lp khc ta cng tin hnh xy
dng vecto trng tm nh vy. Kt thc qu trnh ta c c b phn loi Rocchio
theo m hnh vecto.
Thut ton hun luyn Rocchio c m t n gin nh sau;
Input:
{C} : cc lp
{D} : tp hun luyn

: cc tham s iu khin
Output:
{} : tp cc vecto c trng ca cc lp {C}
Thut ton:
Foreach c
i
C do
52
{
POS
i
(d: d c
i
, d D )
NEG
i
(d: d

c
i
, d D )
Tnh
i
(w
1i
, w
2i
, ... w
ni
) vi w
ki
c tnh theo (4.14)
}
Return {
1
,
2
, ...
n
}

Naive Bayer (NB)
B phn loi NB c xy dng da trn phng php Naive Bayers l
phng php phn loi da trn xc xut c ng dng rng ri trong lnh vc
my hc. Hng tip cn NB l s dng xc xut c iu kin gia t v ch
d on xc xut ch ca mt ti liu cn phn loi.
Gi Pr(C
j
, d) l xc sut vn bn d thuc v lp C
j
. Theo lut Bayes, vn
bn d s c gn vo lp Cj no c xc sut Pr(C
j
, d) cao nht.
H
Bayer
(d) = orgmox
C
]
C
_
P(C
]
). P(w
]
|C
]
)
|d||
i=1
P(C
|
). P(w
]
|C
|
)
|d||
i=1 c
|
C
] (4.17)
= orgmox
C
]
C
_
P(C
]
). P(w
]
|C
]
)
TF(w,d|)
wF
P(C
|
). P(w
]
|C
|
)
TF(w,d|)
wF C
|
C
] (4.18)
Vi:
TF(w
i
,d) l tn s xut hin ca t w
i
trong ti liu d
|d| l s lng cc t trong d
w
i
l mt t trong khng gian c trng F vi s chiu l |F|
Pr(C
j
) c tnh da trn t l phn trm ca s vn bn mi lp tng
ng trong tp d liu hun luyn: Pr(C
j
) =
|C
]
|
|C|

Pr(w
i
|Cj) c tnh theo php c lng Laplace
Pi(w

|C
]
) =
1+1P(w
i
,C
]
)
|P|+ 1P(w

,C
]
)
w|F|
(4.19)

NB l mt thut ton phn loi tuyn tnh thch hp trong phn loi vn bn
53
nhiu ch . NB c u im l ci t n gin, tc nhanh, d dng cp nht d
liu hun luyn mi v c tnh c lp cao vi tp hun luyn, c th s dng kt
hp nhiu tp hun luyn khc nhau.

Phng php Centroid based vector
L mt phng php phn loi n gin, d ci t v tc nhanh do c
phc tp tuyn tnh O(n). tng ca cch tip cn ny l mi lp trong d liu
hun luyn s c biu din bng mt vector trng tm. Vic xc nh lp ca mt
vn bn bt k s thng qua vic tm vector trng tm no gn vi vector biu din
vn bn nht. Lp ca vn bn chnh l lp m vector trng tm i din v khong
cch c xc nh theo o cosine.
Chng ta c cng thc tnh vector trng tm ca lp i :

e

=

{i}
d
j
d
j {i}
1
C
i
(4.20)
o khong ccg gia vector x v vector Ci :
C
i
.
x
C
i
.
x
C
i
,
x
cos

=
|
|
.
|

\
|

(4.21)
Trong :
x l vector vn bn cn phn loi
{i} l tp hp cc vn bn thuc ch C
i

Ch ca vector x l C
x
tha mn cos(x, C
x
)= arg max (cos(x,C
i
)).

Phng php SVM
SVM l phng php phn loi rt hiu qa c Vapnik gii thiu. tng
ca phng php l cho trc mt tp hun luyn c biu din trong khng gian
vector, trong mi mt vn bn c xem nh mt im trong khng gian ny.
Phng php ny tm ra mt siu mt phng h quyt nh tt nht c th chia cc
im trn khng gian ny thnh hai lp ring bit tng ng, tm gi l lp + (
54
cng ) v lp ( tr). Cht lng ca siu mt phng ny c quyt nh bi mt
khong cch (c gi l bin) ca im d liu gn nht ca mi lp n mt
phng ny. Khong cch bin cng ln th cng c s phn chia tt cc im ra
thnh hai lp, ngha l s t c kt qa phn loi tt. Mc tiu ca thut ton
SVM l tm c khong cch bin ln nht to kt qa phn loi tt.
C th ni SVM thc cht l mt bi ton ti u, mc tiu ca thut ton l
tm c mt khng gian H v siu mt phng quyt nh h trn H sao cho sai s
khi phn loi l thp nht, khi kt qa phn loi s l tt nht.

Hnh 4.7: V d phng php SVM
Phng trnh siu mt phng cha vector d
i
trong khng gian nh sau :
0 b
w
.
d
i
= +

(4.22)

< +

> +

+
=
|
|
.
|

\
|

=
|
|
.
|

\
|
0 b
w .
d
i
,
0 b w .
d
i
,
w .
d
i
sign
d
i
h
(4.23)
Nh th vector h(d
i
) biu din s phn lp ca vector d
i
vo hai lp. Gi Y
i

mang gi tr +1 hoc -1, khi Y
i
= +1 vn bn tng ng vi vector d
i
thuc lp +
v ngc li n s thuc vo lp -. Khi ny c siu mt phng h ta s gii bi
ton sau:
Tm Minw vi
w

v b tha iu kin :
1 b)) w
d
i
(sign( y
i
: n 1, i > + e
(4.24)
Chng ta thy rng SVM l mt phng quyt nh ch ph thuc vo cc
+
+
+
+
+
--
--
--
--
--
--
55
vector h tr c khong cch n mt phng quyt nh l 1/wi . Khi cc im khc
b xa i th thut ton vn cho kt qa ging nh ban u. Chnh c im ny lm
cho SVM khc vi cc thut ton khc nh kNN, LLSF, Nnet, NB v tt c d liu
trong tp hun luyn u c dng ti u ha kt qa.

KNearest Neighbor (kNN)
kNN l phng php truyn thng kh ni ting theo hng tip cn thng k
c nghin cu trong nhiu nm qua. kNN c nh gi l mt trong nhng
phng php tt nht c s dng t nhng thi k u trong nghin cu v phn
loi vn bn .
tng ca phng php ny l khi cn phn loi mt vn bn mi, thut
ton s xc nh khong cch (c th p dng cc cng thc v khong cch nh
Euclide, Cosine, Manhattan, ) ca tt c cc vn bn trong tp hun luyn n
vn bn ny tm ra k vn bn gn nht, gi l k nearest neighbor k lng ging
gn nht, sau dng cc khong cch ny nh trng s cho tt c cc ch . Khi
, trng s ca mt ch chnh l tng tt c cc khong cch trn ca cc vn
bn trong k lng ging c cng ch , ch no khng xut hin trong k lng
ging s c trng s bng 0. Sau cc ch s c sp xp theo gi tr trng s
gim dn v cc ch c trng s cao s c chn lm ch ca vn bn cn
phn loi.

Trng s ca ch c
j
i vi vn bn x c tnh nh sau :
b
j
c
j
,
d
i
y .
{kNN}
d
i
d
i
,
x
sim c
j
x,
W
|
.
|

\
|

|
|
.
|

\
|

=
|
|
.
|

\
|
(4.25)
Trong :
y (d
i
, c) thuc {0,1}, vi :
y = 0: vn bn di khng thuc v ch c
j

y = 1: vn bn di thuc v ch c
j

56
sim (x, d): ging nhau gia vn bn cn phn loi x v vn bn d. Chng ta
c th s dng o cosine tnh khong cch :
i d x
di

.
.
x
d
i
,
x
cos
d
i
,
x
sim

=
|
|
.
|

\
|

=
|
|
.
|

\
|

(4.26)
- b
j
l ngng phn loi ca ch c
j
c t ng hc s dng mt tp vn
bn hp l c chn ra t tp hun luyn.
chn c tham s k tt nht cho thao tc phn loi, thut ton cn c
chy th nghim trn nhiu gi tr k khc nhau, gi tr k cng ln th thut ton cng
n nh v sai st cng thp.
Cc thut ton phn loi trn u c im chung l yu cu vn bn phi c
biu din di dng vector c trng
C 3 yu t quan trng tc ng n kt qa phn loi vn bn:
i, Cn mt tp d liu hun luyn chun v ln cho thut ton hc phn
loi. Nu chng ta c c mt tp d liu chun v ln th qa trnh hun luyn
s tt v khi chng ta s c kt qa phn loi tt sau khi c hc.
ii, Cc phng php trn hu ht u s dng m hnh vector biu din vn
bn, do phng php tch c trng trong vn bn ng vai tr quan trng qa
trnh biu din vn bn bng vector.
iii, Thut ton s dng phn loi phi c thi gian x l hp l, thi gian
ny bao gm: thi gian hc, thi gian phn loi vn bn, ngoi ra thut ton ny
phi c tnh tng cng (incremental function) ngha l khng phn loi li ton b
tp vn bn khi thm mt s vn bn mi vo tp d liu m ch phn loi cc vn
bn mi m thi, khi thut ton phi c kh nng gim nhiu khi phn loi
vn bn.
4.2.6 c lng
Sau giai on hun luyn ca m hnh phn loi ta s c mt tp cc b phn
loi. Trong nhiu trng hp b phn loi vn p ng yu cu hay tiu ch no .
nng cao hiu qu ca hm phn loi th ngi ta cn c giai on c lng
(validation phrase) chn la hoc tinh chnh cc thng s lin quan cn thit.
57
Giai on ny cn s dng d liu nh gi (validation set) Va (theo 4.2.5.1) p
dng vo b phn loi c to ra t trc [8],[10].
Hiu nng hun luyn (thng c tnh bng thi gian trung bnh xy dng
b phn loi cho lp c
i
t su tp d liu D) cng nh hiu nng phn loi (thng
c tnh bng thi gian trung bnh phn loi mt ti liu vo lp c
i
ca b phn
loi), v hiu qu ca b phn loi (thng c o bng chnh xc ca b phn
loi) l cc o chnh thng v thnh cng ca cc gii thut hc. Trong cc
ng dng phn loi, 3 tham s ny u quan trng v cn c xem xt s cn bng
gia chng. V d trong ng dng c s tng tc vi ngi dng, b phn loi vi
hiu nng km th khng thch hp. Ngc li vi cc ng dng phn loi m s lp
ln ti hng ngn, b phn loi vi hiu nng km th khng cng khng thch hp.
Tnh hiu qu c o bng s kt hp ca chnh xc Precision v bao
ph. Cc o ny cho tng lp c
i
(Pr
i
v Re
i
) c tnh qua cc gi tr sau:
Vi tp ti liu c xt gn vo hnh lp c
i
ta c cc s liu theo bng sau:
Bng 4.1: Cc trng hp gn tp ti liu vo lp c
i
v nhn ca n
C
i
ng C
i
sai
H thng gn C
i
TP
i
FP
i
H thng khng gn C
i
FN
i
TN
i

Ta c th m t li nh sau:
TP
i
: Tp ti liu c h thng gn vo lp c
i
v thc t tp ti liu c gn
nhn c
i

FP
i
: Tp ti liu h thng gn vo lp c
i
nhng thc t cc ti liu ny
khng c gn nhn thuc v lp c
i

TN
i
: C h thng v nhn ca cc ti liu u khng thuc v lp c
i

FN
i
: H thng khng gn tp ti liu thuc v lp c
i
nhng thc t tp ti liu
ny thuc v lp c
i
.
Bng 4.2 : Cc o cho ton khng gian phn loi
58
C ng C sai
H thng gn C
TP = IPi
|C|
1
FP = FPi
|C|
1

H thng khng gn C
FN = FNi
|C|
1
TN = TNi
|C|
1


V nh vy ta c
Pr
i
=
1P
1P+PP
(4.27)
Re
i
=
1P
1P+PN
(4.28)
c lng chnh xc v bao ph cho ton khng gian phn loi th
ngi ta s dng hai cch sau:
Micro Average:
r

=
1P
1P+PP
(4.29)
e

=
1P
1P+PN
(4.30)
Macro Average:
r
M
=
P
|C|
1
|C|
(4.31)
e
M
=
Rc
|C|
1
|C|
(4.32)
Kt qu tr v ca h thng giai on ny s c s khc nhau rt nhiu do b
hun luyn khc nhau v cc b d liu nh gi khc nhau.
4.2.7 Phn loi ti liu
Sau khi xy dng c b phn loi, kim th b phn loi theo m hnh
pht trin v gii thut ra. c bit giai on xy dng b phn loi (classifer),
ta xp x c hm : DxC [0,1]. Hm ny c dng xc nh mt ti
liu mi c thuc v lp ang xem st hay khng. Ni cch khc l bc n gin
th nghim d liu mi d
new
xem thuc v lp c
i
no.
59

Hnh 4.8: S phn loi mt ti liu d
new
mi vo lp c
i
Hnh 4.8 m t qu trnh phn loi mt ti liu mi vo lp c
i
. Ti liu d
new
sau
bc tin x l s c trch chn c trng, trn c s c trng ny ti liu d
new

c c trng ha bc characterization. c trng ha d
new
c dng lm d
liu u vo ca b phn loi lp c
i
(classifer c
i
).
Chng 5 ca lun vn trnh by c th v gii thut phn loi cho ti liu mi
a vo. Thng qua vic la chn tp cc nt (i din cho lp) lin quan vi ti
liu mi a vo, gii thut duyt ch thc hin trn tp cc nt lin quan ny. Ti
mi nt trong gii thut duyt, b phn loi ca nt c
i
s thc hin tnh ton lin
quan ca ti liu mi v nt c
i
. So snh cc gi tr nh gi mc lin quan ca
cc b phn loi h thng s chn ra c nt c
j
ph hp nht vi ti liu d
new
a
vo.

Tin x l
Ti liu
mi
Trch chn c trng
c trng ha
Phn loi
B phn loi C
i

Kt qu

You might also like