Tiểu luận Khai Phá Dữ Liệu Sử Dụng Weka để Phân Lớp trên Dataset

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 17

H v tn: Nguyn Th Phng

Lp h thng thng tin 6

S dng weka phn lp trn Dataset SpamBase


Contents
S dng weka phn lp trn Dataset SpamBase....................................................................1
1.Gii thiu v Dataset SpamBase........................................................................................ 1
1.1. Khi nim v email v spam mail................................................................................ 1
1.2.Gii thiu v dataset SpamBase................................................................................... 1
2. Thc hnh phn lp trn weka.......................................................................................... 3
2.1. Tin x l d liu.................................................................................................... 4
2.1.1. Np d liu...................................................................................................... 4
2.1.2. Lc thuc tnh................................................................................................... 5
2.2. Phn lp bng thut ton Naive Bayer...........................................................................8
2.2.1. Thut ton Naive Bayer....................................................................................... 8
2.2.2. Phn lp trn weka........................................................................................... 12
2.2.3. Nhn xt........................................................................................................ 16

1.Gii thiu v Dataset SpamBase


1.1. Khi nim v email v spam mail
Th in t, hay email (t ch electronic mail), i khi c dch khng chnh xc l
in th, l mt h thng chuyn nhn th t qua cc mng my tnh.
Email l mt phng tin thng tin rt nhanh. Mt mu thng tin (th t) c th c gi
i dng m ho hay dng thng thng v c chuyn qua cc mng my tnh c bit
l mng Internet. N c th chuyn mu thng tin t mt my ngun ti mt hay rt
nhiu my nhn trong cng lc.
Spam mail, cn gi l th rc, chnh l nhng email c pht tn mt cch rng ri
khng theo bt c mt yu cu no ca ngi nhn vi s lng ln.
Hin nay cng vi s bng n s ngi s dng internet v tt nhin cng vi n l
nhng c hi qung co .V t y Spam mail pht trin nhanh chng. Cc th rc c
th v hi nhng mi ngy mt ngi dng c th v cc th rc ny m b y c mt
hp th, thng gy kh chu cho ngi dng, thm ch c th dn d nhng ngi nh
d, tm c s th tn dng v cc tin tc c nhn ca h.
Do s gia tng hng nm ca th rc, vic phn bit xem email no l th rc, v email
no khng phi tr nn cn thit trnh nhng phin toi trn.
1.2.Gii thiu v dataset SpamBase.
Dataset SpamBase cha kt qu ca cuc tng hp v cc th rc t bu in v cc c
nhn nhn c th rc. Dataset c 58 thuc tnh, trong thuc tnh class cui cng
nhn gi tr 0,1 xc nh xem c phi th rc khng.
48 thuc tnh u word_freq_ ni v t l % ca cc t trong th ph hp vi ni dung
ca thuc tnh nhc n. V d: word_freq_address: t l % cc t trong th ph hp vi
a ch gi.
6 thuc tnh tip char_freq_ t l phn trm cc k t trong th ph hp vi k t nhc
n trong thuc tnh.
3 thuc tnh tip
capital_run_length_average: chiu di trung bnh khng b gin on ca chui
ch vit hoa.

capital_run_length_longest: chiu di ln nht khng b gin on ca chui ch


vit hoa.
capital_run_length_total: tng s lng ch in hoa trong email
Thuc tnh class cui cng nhn gi tr 0,1 xc nh xem c phi th rc khng.

Danh sch 58 thuc tnh:


1.word_freq_make
2.word_freq_address
3.word_freq_all
4.word_freq_3d
5.word_freq_our
6.word_freq_over
7.word_freq_remove
8.word_freq_internet
9.word_freq_order
10.word_freq_mail
11.word_freq_receive
12.word_freq_will
13.word_freq_people
14.word_freq_report
15.word_freq_addresses
16.word_freq_free
17.word_freq_business
18.word_freq_email
19.word_freq_you
20.word_freq_credit
21.word_freq_your
22.word_freq_font
23.word_freq_000
24.word_freq_money
25.word_freq_hp
26.word_freq_hpl
27.word_freq_george
28.word_freq_650
29.word_freq_lab
30.word_freq_labs
31.word_freq_telnet
32.word_freq_857
33.word_freq_data

34.word_freq_415
35.word_freq_85
36.word_freq_technology
37.word_freq_1999
38.word_freq_parts
39.word_freq_pm
40.word_freq_direct
41.word_freq_cs
42.word_freq_meeting
43.word_freq_original
44.word_freq_project
45.word_freq_re
46.word_freq_edu
47.word_freq_table
48.word_freq_conference
49.char_freq_;
50.char_freq_(
51.char_freq_[
52.char_freq_!
53.char_freq_$
54.char_freq_#
55.capital_run_length_average
56.capital_run_length_longest
57.capital_run_length_total
58.class

2. Thc hnh phn lp trn weka


Sau khi ci t xong, m weka, chn explorer

2.1. Tin x l d liu


2.1.1. Np d liu
Thng thng, nh dng chun file d liu ca Weka la file ARFF (Attribute Relation
File Format), tuy nhin rt nhiu DBMS v Spreadsheet cho php t chc file d liu
di dng file .csv (comma-separated values) v mt iu thun li l Weka cho php
c d liu t file .csv.
Trong bi ny, ta chn Spambase.arff

Sau khi d liu c np ln, panel bn tri th hin cc thuc tnh ca file d liu, panel
bn phi th hin cc thng k tng ng vi thuc tnh bn tri

2.1.2. Lc thuc tnh


bc tin x l ny ta s tin hnh loi b nhng d liu li hoc nhng thuc
tnh c qu nhiu gi tr ri rc nh trng ID, hoc nhng gi tr bt thng.
Weka cung cp cho ta chc nng filter nhng gi tr li .
Trn giao din Weka, chn choose/Filter/unsupervise/Dicretize/Apply

xem kt qu lc tt c cc thuc tnh: Chn Visualize All

Sau qu trnh tin x l, ta bc sang phn lp (classification)


2.2. Phn lp bng thut ton Naive Bayer
2.2.1. Thut ton Naive Bayer
YU CU BI TON
Yu cu t ra l ngn chn spam bng cch phn loi mt email gi n l spam hay
non-spam. Cn t c hiu qu phn loi email tht kh quan. Tuy nhin cn tuyt i
trnh li sai cho rng email non-spam l spam v c th gy hu qu nghim trng hn l
kh nng lc spam thp. Do yu cu i vi h thng l phi nhn ra c email spam
cng nhiu cng tt v gim thiu li nhn sai email non-spam l email spam.

HNG THC HIN


tng ca phng php l tm cch xy dng mt b phn loi nhm phn loi cho

mt mu mi bng cch hun luyn t nhng mu c sn. y mi mu m ta xt n


chnh l mi mt email, tp cc lp m mi email c th thuc v l y={spam, nonspam}
Khi ta nhn c 1 email mi gi n, khi ta da vo mt s c im hay thuc tnh
no ca email tng kh nng phn loi chnh xc email . Cc c im ca 1
email nh: tiu , ni dung, c tp tin nh km hay khng Cng nhiu nhng thng
tin nh vy xc sut phn loi ng cng ln, tt nhin cn ph thuc vo kch thc ca
tp mu hun luyn.
Vic tnh ton xc sut s da vo cng thc Nave Bayes, t xc sut thu c ta em so
snh vi mt gi tr ngng t no m ta xem l ngng phn loi email spam hay
non-spam. Nu ln hn t th email l spam, ngc li l non-spam. Nh ta bit khi
phn loi email c hai li : li nhn 1 email non-spam thnh spam v li cho qua mt
email spam. Loi li th nht nghim trng hn, v vy ta xem mi mt email non-spam
nh l email non-spam. Nh vy khi li nhn 1 email non-spam thnh spam xy ra ta
xem nh l li, v khi phn loi ng xem nh ln thnh cng. Ngng phn loi t
s ph thuc v ch s ny.

C S L THUYT
Cng thc xc sut c iu kin
Xc sut iu kin ca bin c A vi iu kin bin c B xy ra l mt s khng m,
k hiu l P( A/B ) n biu th kh nng xy ra bin c A trong tnh hung bin c B
xy ra.
P( A/B ) = (P( AB ))/(P( B ))
Suy ra
P( A/B ) . P( B ) = P( B/A ) . P( A ) = P( AB )
Cng thc xc sut y
Gi s B1, B2, Bn l 1 nhm y cc bin c. Xt bin c A sao cho A xy ra ch
khi mt trong cc bin c B1, B2, Bn xy ra. Khi :
P(A) = P(Bi) . P(A/Bi)
Cng thc xc sut Bayes
T cc cng thc trn ta c cng thc xc sut Bayes :

P(Bk/A) = (P(ABk) )/(P(A) ) = (P(Bk) .P(A/Bk) )/(P(Bi) .P(A/Bi))


Phng php phn loi Nave-Bayesian
Phn loi Bayesian l phng php phn loi s dng tri thc cc xc sut qua hun
luyn. Phng php ny thch hp vi nhng lp bi ton i hi phi d on chnh xc
lp ca mu cn kim tra da trn nhng thng tin t tp hun luyn ban u
Gi thit mi mt email c i din bi mt vector thuc tnh c trng l x = (x1, x2,
,xn) vi x1, x2, , xn l gi tr ca cc thuc tnh X1, X2,,Xn tng ng trong
khng gian vector c trng X
Da vo cng thc xc sut Bayes v cng thc xc sut y ta c c xc sut 1
email vi vector c trng x thuc v loi c l :
P(C=c | X=x) = (P(C=c) .P(X=x | C=c) )/(P(C=k) .P(X=x | C=k))
vi C l email c xt , c {spam, non-spam}
Xc sut P(C=c) c tnh d dng t tp hun luyn. Thc t rt kh tnh c xc
sut P(X=x | C=c) . Gi thit rng tt c cc bin c X1, X2Xn l c lp vi nhau do
chng ta c th tnh c xc sut P(X=x | C=c) da theo cng thc:
P(X=x | C=c) = P(Xi=xi | C=c)
Nh vy cng thc tnh xc sut 1 email l spam s c vit thnh :
P(C=c | X=x) = (P(C=c) . P(Xi=xi | C=c) )/( P(C=k) . P(Xi=xi | C=k))
T xc sut ny ta so snh vi mt gi tr ngng t l ngng phn loi email l spam
hay khng, nu xc sut ny ln hn t, ta cho email l spam, ngc li email l
non-spam
Trong phn loi email c 2 loi sai lm, mt l sai lm nhn 1 email spam thnh nonspam v sai lm th 2 l nhn 1 email non-spam thnh spam. R rng sai lm th 2 l
nghim trng hn v ngi dng c th chp nhn mt email spam vt qua b lc nhng
khng th chp nhn mt email hp l quan trng li b b lc chn li.
Gi s ta gi S->N v N->S tng ng vi 2 loi li trn. hn ch loi li th 2 ta
gi s rng li N->S c chi ph gp li S->N ngha l ta phn loi 1 email l spam da
theo :

(P(C=spam | X=x) )/(P(C=non-spam | X=x)) >


Mt khc
P(C=spam | X=x) = 1 - P(C=non-spam | X=x) v P(C=spam | X=x) > t
Nh vy ta gi tr ngng t ph thuc vo , c th :
t = / ( + 1)

PHNG PHP THC HIN


nh gi 1 email ta phi chuyn mi mt email sang mt vector x = (x1,x2,...xn) vi
x1,x2,..xn l gi tr cc thuc tnh X1,X2Xn trong khng gian vector c trng X. Mi
thuc tnh c th hin bi mt token n. Theo phng php n gin nht ta c th
lp ra mt t in cha cc token. Sau vi mi token trong email nu n xut hin
trong t in th gi tr thuc tnh s l 1, ngc li th l 0. Tuy nhin trn thc t, tp
hun luyn ca ta khng thng l mt b t in nh vy. Thay vo tp hun luyn
lc ny s gm c 2 kho ng liu. Kho ng liu Spam s cha mt list cc email c
xc nh l spam trc , v tng t vi kho ng liu Non-spam s cha cc email hp
l.
Nh vy nu ta vn gi tr cc thuc tnh l 0 hoc 1 th s rt kh nh gi c 1
email l spam hay khng. c bit nu email nhn c l di, khi nu ta vn s dng
gi tr thuc tnh l 0 hoc 1 th s xut hin ca 1 token 100 ln cng tng ng vi
vic xut hin ch 1 ln.
khc phc vn ny gi tr thuc tnh by gi ta s thay bng xc sut spam ca
token . Xc sut ny tng ng vi xc sut spam ca 1 email ch cha token v
l email spam. Vic tnh xc sut ny th c nhiu phng php. Ta c th tnh da trn s
ln xut hin ca token ny trong mi kho ng liu hc ban u. V d mt token w c s
ln xut hin trong kho ng liu spam l s v non-spam l n, s email tng cng kho
spam v non-spam tng ng l Ns v Nn th xc sut spam ca token w ny s l :
P(X=w | C=spam) = (s/Ns)/(s/Ns+n/Nn)
Tuy nhin nhc im ca phng php ny kh nng spam ca mt token xut hin 100
ln 100 email khc nhau l bng vi kh nng spam ca mt token xut hin 100 ln
ch trong 1 email.

Thay vo vic tnh xc sut ny da theo s ln xut hin ca token trong tng kho ng
liu ta c th da vo s email cha token trong tng kho ng liu. V d mt token w c
s email cha n trong kho ng liu spam v non-spam l ns v nn th xc sut spam ca
token w ny s l :
P(X=w | C=spam) = (s/Ns)/(ns/Ns+nn/Nn)
Nhc im ca phng php ny l kh nng spam ca mt token xut hin 1 ln trong
1 email l bng vi kh nng spam ca mt token xut hin 100 ln trong 1 email.
V vy chng ta s dng cch th ba l tng hp ca hai cch trn :
P(X=w | C=spam) = ((s*ns)/Ns)/((ns*s)/Ns+(nn*n)/Nn))
Cn i vi cc token ch xut hin trong kho ng liu ny m khng xut hin trong kho
ng liu kia th khng th kt lun mt token ch xut hin kho ng liu spam th khng
bao gi xut hin trong kho ng liu non-spam v ngc li. Cch thch hp th ta s gn
cho chng mt gi tr ph hp. Vi nhng token ch xut hin trong kho ng liu spam
th ta gn xc sut spam cho n l gi tr N gn vi 1 ( chng hn 0,9999) v ngc li th
gn xc sut spam l gi tr M gn vi 0 ( chng hn 0,0001).
Nh vy ta c cng thc tnh xc sut spam ca token da trn s ln xut hin v s
email cha n l :
P = Max ( M, Min ( N, ((ns*s)/Ns)/((ns*s)/Ns+(nn*n)/Nn) ) )
ns : s email cha token trong kho spam
nn : s email cha token trong kho non-spam
s : s ln token xut hin trong kho spam
n : s ln token xut hin trong kho non-spam
Ns : tng s email trong kho spam
Nn : tng s email trong kho non-spam
2.2.2. Phn lp trn weka
Trn giao din weka, chn classify/Choose/bayers/NaiveBayers
Mc Cross-validation mc nh 10
Chn More option thit lp ouput. Ti y ta tch thm vo mc output predictions
hin thm phn d bo.

n start

Kt qu:

Kt qu thng k cho thy c 3890 mu c phn loi ng chim 84,5468%, v 711


mu phn loi sai chim 15.4532% .

Classify output cn cung cp cho chng ta thy Confusion Matrix biu din rng:
-

1433 mu ca class 1 th c 380 mu c phn la ng v c 1 mu sai thnh


class 0.

2457 mu class 0 th c 331 mu phn loi ng c 1 mu phn lp sai thnh class


1.

Ouput cng hin th chi tit d on trong qu trnh thc hin thut ton.

C li mu #7,#13,#14, phn lp sai, thc t thuc class 0, li phn thnh class 1.

2.2.3. Nhn xt
Nu so snh vi thut ton khc v d nh C4.5 (phn lp bng cy quyt nh j48) th
vic phn lp bng Naive Bayers i vi dataset Spambase nhanh hn nhiu v cng
chnh xc hn, c th

Vi Naive Bayers, kt qu thng k cho thy c 3890 mu c phn loi ng chim


84,5468%, v 711 mu phn loi sai chim 15.4532% . Vi J48, 3868 mu c phn
loi th ng chim 84,0687% v sai chim 15,9313%. Trong khi thi gian phn lp ca
Naive Bayers vn nhanh hn J48.

You might also like