K46 Nguyen Thi Thuy Linh Thesis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

I HC QUC GIA H NI TRNG I HC CNG NGH

Nguyn Th Thy Linh

NGHIN CU CC THUT TON PHN LP D LIU DA TRN CY QUYT NH

KHA LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin

H NI - 2005

I HC QUC GIA H NI TRNG I HC CNG NGH

Nguyn Th Thy Linh

NGHIN CU CC THUT TON PHN LP D LIU DA TRN CY QUYT NH

KHA LUN TT NGHIP I HC H CHNH QUY Ngnh: Cng ngh thng tin Cn b hng dn: TS. Nguyn Hi Chu

H NI - 2005

TM TT NI DUNG
Phn lp d liu l mt trong nhng hng nghin cu chnh ca khai ph d liu. Cng ngh ny , ang v s c nhiu ng dng trong cc lnh vc thng mi, ngn hng, y t, gio dcTrong cc m hnh phn lp c xut, cy quyt nh c coi l cng c mnh, ph bin v c bit thch hp vi cc ng dng khai ph d liu. Thut ton phn lp l nhn t trung tm trong mt m hnh phn lp. Kha lun nghin cu vn phn lp d liu da trn cy quyt nh. T tp trung vo phn tch, nh gi, so snh hai thut ton tiu biu cho hai phm vi ng dng khc nhau l C4.5 v SPRINT. Vi cc chin lc ring v la chn thuc tnh pht trin, cch thc lu tr phn chia d liu, v mt s c im khc, C4.5 l thut ton ph bin nht khi phn lp tp d liu va v nh, SPRINT l thut ton tiu biu p dng cho nhng tp d liu c kch thc cc ln. Kha lun chy th nghim m hnh phn lp C4.5 vi tp d liu thc v thu c mt s kt qu phn lp c ngha thc tin cao, ng thi nh gi c hiu nng ca m hnh phn lp C4.5. Trn c s nghin cu l thuyt v qu trnh thc nghim, kha lun xut mt s ci tin m hnh phn lp C4.5 v tin ti ci t SPRINT.

- i-

LI CM N
Trong sut thi gian hc tp, hon thnh kha lun em may mn c cc thy c ch bo, du dt v c gia nh, bn b quan tm, ng vin. Em xin c by t lng bit n chn thnh ti cc thy c trng i hc Cng Ngh truyn t cho em ngun kin thc v cng qu bu cng nh cch hc tp v nghin cu khoa hc. Cho php em c gi li cm n su sc nht ti TS. Nguyn Hi Chu, ngi thy rt nhit tnh ch bo v hng dn em trong sut qu trnh thc hin kha lun. Vi tt c tm lng mnh, em xin by t lng bit n su sc n TS. H Quang Thy to iu kin thun li v cho em nhng nh hng nghin cu. Em xin li cm n ti Nghin cu sinh on Sn (JAIST) cung cp ti liu v cho em nhng li khuyn qu bu. Em cng xin gi li cm n ti cc thy c trong B mn Cc h thng thng tin, Khoa Cng ngh thng tin gip em c c mi thc nghim thun li. Em cng xin gi ti cc bn trong nhm Seminar Khai ph d liu v Tnh ton song song li cm n chn thnh v nhng ng gp v nhng kin thc qu bu em tip thu c trong sut thi gian tham gia nghin cu khoa hc. Cui cng, em xin cm n gia nh, bn b v tp th lp K46CA, nhng ngi lun bn khch l v ng vin em rt nhiu.

H Ni, thng 6 nm 2005 Sinh vin

Nguyn Th Thy Linh

- ii-

MC LC
TM TT NI DUNG ..................................................................................................i LI CM N ............................................................................................................... ii MC LC .................................................................................................................... iii DANH MC BIU HNH V ...............................................................................v DANH MC THUT NG ...................................................................................... vii T VN .................................................................................................................1 Chng 1. TNG QUAN V PHN LP D LIU DA TRN CY QUYT NH...............................................................................................................................3 1.1. Tng quan v phn lp d liu trong data mining................................................3
1.1.1. Phn lp d liu........................................................................................................ 3 1.1.2. Cc vn lin quan n phn lp d liu............................................................... 6 1.1.3. Cc phng php nh gi chnh xc ca m hnh phn lp .............................. 8

1.2. Cy quyt nh ng dng trong phn lp d liu .................................................9


1.2.1. nh ngha ................................................................................................................ 9 1.2.2. Cc vn trong khai ph d liu s dng cy quyt nh.................................... 10 1.2.3. nh gi cy quyt nh trong lnh vc khai ph d liu....................................... 11 1.2.4. Xy dng cy quyt nh........................................................................................ 13

1.3. Thut ton xy dng cy quyt nh...................................................................14


1.3.1. T tng chung ...................................................................................................... 14 1.3.2. Tnh hnh nghin cu cc thut ton hin nay........................................................ 15 1.3.3. Song song ha thut ton phn lp da trn cy quyt nh tun t ...................... 17

Chng 2. C4.5 V SPRINT......................................................................................21 2.1. Gii thiu chung .................................................................................................21 2.2. Thut ton C4.5...................................................................................................21
2.2.1. C4.5 dng Gain-entropy lm o la chn thuc tnh tt nht........................ 22 2.2.2. C4.5 c c ch ring trong x l nhng gi tr thiu.............................................. 25 2.2.3. Trnh qu va d liu ......................................................................................... 26 2.2.4. Chuyn i t cy quyt nh sang lut ................................................................. 26 2.2.5. C4.5 l mt thut ton hiu qu cho nhng tp d liu va v nh ....................... 27

2.3. Thut ton SPRINT ............................................................................................28


2.3.1. Cu trc d liu trong SPRINT .............................................................................. 29 2.3.2. SPRINT s dng Gini-index lm o tm im phn chia tp d liu tt nht .......................................................................................................................................... 31 2.3.3. Thc thi s phn chia ............................................................................................. 34 2.3.4. SPRINT l thut ton hiu qu vi nhng tp d liu qu ln so vi cc thut ton khc................................................................................................................................... 35

- iii-

2.4. So snh C4.5 v SPRINT....................................................................................37 Chng 3. CC KT QU THC NGHIM .........................................................38 3.1. Mi trng thc nghim .....................................................................................38 3.2. Cu trc m hnh phn lp C4.5 release8:..........................................................38
3.2.1. M hnh phn lp C4.5 c 4 chng trnh chnh: .................................................. 38 3.2.2. Cu trc d liu s dng trong C4.5 ...................................................................... 39

3.3. Kt qu thc nghim...........................................................................................40


3.3.1. `7Mt s kt qu phn lp tiu biu: ...................................................................... 40 3.3.2. Cc biu hiu nng ............................................................................................ 47

3.4. Mt s xut ci tin m hnh phn lp C4.5..................................................54 KT LUN ..................................................................................................................56 TI LIU THAM KHO...........................................................................................57

- iv-

DANH MC BIU HNH V


Hnh 1 - Qu trnh phn lp d liu - (a) Bc xy dng m hnh phn lp .................4 Hnh 2 - Qu trnh phn lp d liu - (b1)c lng chnh xc ca m hnh...........5 Hnh 3 - Qu trnh phn lp d liu - (b2) Phn lp d liu mi ...................................5 Hnh 4 - c lng chnh xc ca m hnh phn lp vi phng php holdout ......8 Hnh 5- V d v cy quyt nh .....................................................................................9 Hnh 6 - M gi ca thut ton phn lp d liu da trn cy quyt nh ....................14 Hnh 7 - S xy dng cy quyt nh theo phng php ng b ...........................18 Hnh 8 - S xy dng cy quyt nh theo phng php phn hoch .....................19 Hnh 9 - S xy dng cy quyt nh theo phng php lai....................................20 Hnh 10 - M gi thut ton C4.5 ..................................................................................22 Hnh 11 - M gi thut ton SPRINT............................................................................28 Hnh 12 - Cu trc d liu trong SLIQ..........................................................................29 Hnh 13 - Cu trc danh sch thuc tnh trong SPRINT Danh sch thuc tnh lin tc c sp xp theo th t ngay c to ra ............................................................30 Hnh 14 - c lng cc im phn chia vi thuc tnh lin tc .................................32 Hnh 15 - c lng im phn chia vi thuc tnh ri rc .........................................33 Hnh 16 - Phn chia danh sch thuc tnh ca mt node ..............................................34 Hnh 17 - Cu trc ca bng bm phn chia d liu trong SPRINT (theo v d cc hnh trc) ......................................................................................................................35 Hnh 18 - File nh ngha cu trc d liu s dng trong thc nghim ........................39 Hnh 19 - File cha d liu cn phn lp ......................................................................40 Hnh 20 - Dng cy quyt nh to ra t tp d liu th nghim..................................41 Hnh 21 - c lng trn cy quyt nh va to ra trn tp d liu training v tp d liu test ...................................................................................................................42 Hnh 22 - Mt s lut rt ra t b d liu 19 thuc tnh, phn lp loi thit lp ch giao din ca ngi s dng (WEB_SETTING_ID).............................................43 Hnh 23 - Mt s lut rt ra t b d liu 8 thuc tnh, phn lp theo s hiu nh sn xut in thoi (PRODUCTER_ID) ......................................................................44 Hnh 24 - Mt s lut sinh ra t tp d liu 8 thuc tnh, phn lp theo dch v inthoi m khch hng s dng (MOBILE_SERVICE_ID)..............................45 Hnh 25 - c lng tp lut trn tp d liu o to ..................................................46

- v-

Bng 1 - Bng d liu tp training vi thuc tnh phn lp l buys_computer ............24 Bng 2 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to 2 thuc tnh....................................................................49 Bng 3 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to 7 thuc tnh....................................................................50 Bng 4 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to18 thuc tnh...................................................................51 Bng 5 - Thi gian sinh cy quyt nh ph thuc vo s lng thuc tnh .................52 Bng 6 - Thi gian xy dng cy quyt nh vi thuc tnh ri rc v thuc tnh lin tc ...........................................................................................................................53 Bng 7 - Thi gian sinh cy quyt nh ph thuc vo s gi tr phn lp...................54

Biu 1- So snh thi gian thc thi ca m hnh phn lp SPRINT v SLIQ theo kch thc tp d liu o to................................................................................36 Biu 2 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to 2 thuc tnh....................................................................49 Biu 3 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to 7 thuc tnh....................................................................50 Biu 4 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to18 thuc tnh...................................................................51 Biu 5 - S ph thuc thi gian sinh cy quyt nh vo s lng thuc tnh.........52 Biu 6 - So snh thi gian xy dng cy quyt nh t tp thuc tnh lin tc v t tp thuc tnh ri rc ..............................................................................................53 Biu 7 - Thi gian sinh cy quyt nh ph thuc vo s gi tr phn lp...............54

- vi-

DANH MC THUT NG
STT
1 2 3 4 5 6 7 8

Ting Anh
training data test data Pruning decision tree Over fitting data Noise Missing value Data tuple Case

Ting Vit
d liu o to d liu kim tra Ct, ta cy quyt nh Qu va d liu D liu li Gi tr thiu Phn t d liu Case (c hiu nh mt data tuple, cha mt b gi tr ca cc thuc tnh trong tp d liu)

- vii-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

T VN
Trong qu trnh hot ng, con ngi to ra nhiu d liu nghip v. Cc tp d liu c tch ly c kch thc ngy cng ln, v c th cha nhiu thng tin n dng nhng quy lut cha c khm ph. Chnh v vy, mt nhu cu t ra l cn tm cch trch rt t tp d liu cc lut v phn lp d liu hay d on nhng xu hng d liu tng lai. Nhng quy tc nghip v thng minh c to ra s phc v c lc cho cc hot ng thc tin, cng nh phc v c lc cho qu trnh nghin cu khoa hc. Cng ngh phn lp v d on d liu ra i p ng mong mun . Cng ngh phn lp d liu , ang v s pht trin mnh m trc nhng khao kht tri thc ca con ngi. Trong nhng nm qua, phn lp d liu thu ht s quan tm cc nh nghin cu trong nhiu lnh vc khc nhau nh hc my (machine learning), h chuyn gia (expert system), thng k (statistics)... Cng ngh ny cng ng dng trong nhiu lnh vc thc t nh: thng mi, nh bng, maketing, nghin cu th trng, bo him, y t, gio dc... Nhiu k thut phn lp c xut nh: Phn lp cy quyt nh (Decision tree classification), phn lp Bayesian (Bayesian classifier), phn lp Khng xm gn nht (K-nearest neighbor classifier), mng nron, phn tch thng k, Trong cc k thut , cy quyt nh c coi l cng c mnh, ph bin v c bit thch hp cho data mining [5][7]. Trong cc m hnh phn lp, thut ton phn lp l nhn t ch o. Do vy cn xy dng nhng thut ton c chnh xc cao, thc thi nhanh, i km vi kh nng m rng c c th thao tc vi nhng tp d liu ngy cng ln. Kha lun nghin cu tng quan v cng ngh phn lp d liu ni chung v phn lp d liu da trn cy quyt nh ni ring. T tp trung hai thut ton tiu biu cho hai phm vi ng dng khc nhau l C4.5 v SPRINT. Vic phn tch, nh gi cc thut ton c gi tr khoa hc v ngha thc tin. Tm hiu cc thut ton gip chng ta tip thu v c th pht trin v mt t tng, cng nh k thut ca mt cng ngh tin tin v ang l thch thc i vi cc nh khoa hc trong lnh vc data mining. T c th trin khai ci t v th nghim cc m hnh phn lp d liu trn thc t. Tin ti ng dng vo trong cc hot ng thc tin ti Vit Nam, m trc tin l cc hot ng phn tch, nghin cu th trng khch hng.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 1-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Kha lun cng chy th nghim m hnh phn lp C4.5 trn tp d liu thc t t Tng cng ty bu chnh vin thng. Qua tip thu c cc k thut trin khai, p dng mt m hnh phn lp d liu vo hot ng thc tin. Qu trnh chy th nghim thu c cc kt qu phn lp kh quan vi tin cy cao v nhiu tim nng ng dng. Cc nh gi hiu nng ca m hnh phn lp cng c tin hnh. Trn c s , kha lun xut nhng ci tin nhm tng hiu nng ca m hnh phn lp C4.5 ng thi thm tin ch cho ngi dng.

Kha lun gm c 3 chng chnh: Chng 1 i t tng quan cng ngh phn lp d liu ti k thut phn lp d liu da trn cy quyt nh. Cc nh gi v cng c cy quyt nh cng c trnh by. Chng ny cng cung cp mt ci nhn tng quan v lnh vc nghin cu cc thut ton phn lp d liu da trn cy quyt nh vi nn tng t tng, tnh hnh nghin cu v phng hng pht trin hin nay. Chng 2 tp trung vo hai thut ton tiu biu cho hai phm vi ng dng khc nhau l C4.5 v SPRINT. Hai thut ton ny c nhng chin lc ring trong la chn tiu chun phn chia d liu cng nh cch thc lu tr phn chia d liuChnh nhng c im ring m C4.5 l thut ton tiu biu ph bin nht vi tp d liu va v nh, trong khi SPRINT li l s la chn i vi nhng tp d liu cc ln. Chng 3 trnh by qu trnh thc nghim vi m hnh phn lp C4.5 trn tp d liu thc t tng cng ty bu chnh vin thng Vit Nam. Cc kt qu thc nghim c trnh by. T kha lun xut cc ci tin m hnh phn lp C4.5

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 2-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Chng 1. TNG QUAN V PHN LP D LIU DA TRN CY QUYT NH


1.1. Tng quan v phn lp d liu trong data mining
1.1.1. Phn lp d liu Ngy nay phn lp d liu (classification) l mt trong nhng hng nghin cu chnh ca khai ph d liu. Thc t t ra nhu cu l t mt c s d liu vi nhiu thng tin n con ngi c th trch rt ra cc quyt nh nghip v thng minh. Phn lp v d on l hai dng ca phn tch d liu nhm trch rt ra mt m hnh m t cc lp d liu quan trng hay d on xu hng d liu tng lai. Phn lp d on gi tr ca nhng nhn xc nh (categorical label) hay nhng gi tr ri rc (discrete value), c ngha l phn lp thao tc vi nhng i tng d liu m c b gi tr l bit trc. Trong khi , d on li xy dng m hnh vi cc hm nhn gi tr lin tc. V d m hnh phn lp d bo thi tit c th cho bit thi tit ngy mai l ma, hay nng da vo nhng thng s v m, sc gi, nhit , ca ngy hm nay v cc ngy trc . Hay nh cc lut v xu hng mua hng ca khch hng trong siu th, cc nhn vin kinh doanh c th ra nhng quyt sch ng n v lng mt hng cng nh chng loi by bn Mt m hnh d on c th d on c lng tin tiu dng ca cc khch hng tim nng da trn nhng thng tin v thu nhp v ngh nghip ca khch hng. Trong nhng nm qua, phn lp d liu thu ht s quan tm cc nh nghin cu trong nhiu lnh vc khc nhau nh hc my (machine learning), h chuyn gia (expert system), thng k (statistics)... Cng ngh ny cng ng dng trong nhiu lnh vc khc nhau nh: thng mi, nh bng, maketing, nghin cu th trng, bo him, y t, gio dc... Phn ln cc thut ton ra i trc u s dng c ch d liu c tr trong b nh (memory resident), thng thao tc vi lng d liu nh. Mt s thut ton ra i sau ny s dng k thut c tr trn a ci thin ng k kh nng m rng ca thut ton vi nhng tp d liu ln ln ti hng t bn ghi. Qu trnh phn lp d liu gm hai bc [14]: Bc th nht (learning) Qu trnh hc nhm xy dng mt m hnh m t mt tp cc lp d liu hay cc khi nim nh trc. u vo ca qu trnh ny l mt tp d liu c cu trc c m t bng cc thuc tnh v c to ra t tp cc b gi tr ca cc thuc tnh
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 3-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

. Mi b gi tr c gi chung l mt phn t d liu (data tuple), c th l cc mu (sample), v d (example), i tng (object), bn ghi (record) hay trng hp (case). Kho lun s dng cc thut ng ny vi ngha tng ng. Trong tp d liu ny, mi phn t d liu c gi s thuc v mt lp nh trc, lp y l gi tr ca mt thuc tnh c chn lm thuc tnh gn nhn lp hay thuc tnh phn lp (class label attribute). u ra ca bc ny thng l cc quy tc phn lp di dng lut dng if-then, cy quyt nh, cng thc logic, hay mng nron. Qu trnh ny c m t nh trong hnh 1

a) Training data

Classification algorithm

Age 20 18 40 50 35 30 32 40

C a r T yp e C o mbi S po rts S po rts F a mily M iniv a n C o mbi F a mily C o mbi

R is k High High High Low Low High Low Low

Classifier (model)

if age < 31 or Car Type =Sports then Risk = High

Hnh 1 - Qu trnh phn lp d liu - (a) Bc xy dng m hnh phn lp Bc th hai (classification) Bc th hai dng m hnh xy dng bc trc phn lp d liu mi. Trc tin chnh xc mang tnh cht d on ca m hnh phn lp va to ra c c lng. Holdout l mt k thut n gin c lng chnh xc . K thut ny s dng mt tp d liu kim tra vi cc mu c gn nhn lp. Cc mu ny c chn ngu nhin v c lp vi cc mu trong tp d liu o to. chnh xc ca m hnh trn tp d liu kim tra a l t l phn trm cc cc mu trong tp d liu kim tra c m hnh phn lp ng (so vi thc t). Nu chnh xc ca m hnh c c lng da trn tp d liu o to th kt qu thu c l rt kh quan v m hnh lun c xu hng qu va d liu. Qu va d liu l hin tng kt qu phn lp trng kht vi d liu thc t v qu trnh xy dng m hnh phn lp t tp d liu o to c th kt hp nhng c im ring bit ca tp d
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 4-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

liu . Do vy cn s dng mt tp d liu kim tra c lp vi tp d liu o to. Nu chnh xc ca m hnh l chp nhn c, th m hnh c s dng phn lp nhng d liu tng lai, hoc nhng d liu m gi tr ca thuc tnh phn lp l cha bit. b1) Classifier (model) Test data

Age 27 34 66 44

Car Type Sports Family Family Sports

Risk High Low High High

Risk High Low Low High

Hnh 2 - Qu trnh phn lp d liu - (b1)c lng chnh xc ca m hnh b2) New data

Classifier (model)

Age 27 34 55 34

C a r T yp e S po rts M iniv a n F a m ily S po rts

R is k
R is k H ig h Low Low H ig h

Hnh 3 - Qu trnh phn lp d liu - (b2) Phn lp d liu mi Trong m hnh phn lp, thut ton phn lp gi vai tr trung tm, quyt nh ti s thnh cng ca m hnh phn lp. Do vy cha kha ca vn phn lp d liu l tm ra c mt thut ton phn lp nhanh, hiu qu, c chnh xc cao v c kh nng m rng c. Trong kh nng m rng c ca thut ton c c bit tr trng v pht trin [14]. C th lit k ra y cc k thut phn lp c s dng trong nhng nm qua: Phn lp cy quyt nh (Decision tree classification)
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 5-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

B phn lp Bayesian (Bayesian classifier) M hnh phn lp K-hng xm gn nht (K-nearest neighbor classifier) Mng nron Phn tch thng k Cc thut ton di truyn Phng php tp th (Rough set Approach) 1.1.2. Cc vn lin quan n phn lp d liu 1.1.2.1. Chun b d liu cho vic phn lp Vic tin x l d liu cho qu trnh phn lp l mt vic lm khng th thiu v c vai tr quan trng quyt nh ti s p dng c hay khng ca m hnh phn lp. Qu trnh tin x l d liu s gip ci thin chnh xc, tnh hiu qu v kh nng m rng c ca m hnh phn lp. Qu trnh tin x l d liu gm c cc cng vic sau: Lm sch d liu Lm sch d liu lin quan n vic x l vi li (noise) v gi tr thiu (missing value) trong tp d liu ban u. Noise l cc li ngu nhin hay cc gi tr khng hp l ca cc bin trong tp d liu. x l vi loi li ny c th dng k thut lm trn. Missing value l nhng khng c gi tr ca cc thuc tnh. Gi tr thiu c th do li ch quan trong qu trnh nhp liu, hoc trong trng hp c th gi tr ca thuc tnh khng c, hay khng quan trng. K thut x l y c th bng cch thay gi tr thiu bng gi tr ph bin nht ca thuc tnh hoc bng gi tr c th xy ra nht da trn thng k. Mc d phn ln thut ton phn lp u c c ch x l vi nhng gi tr thiu v li trong tp d liu, nhng bc tin x l ny c th lm gim s hn n trong qu trnh hc (xy dng m hnh phn lp). Phn tch s cn thit ca d liu C rt nhiu thuc tnh trong tp d liu c th hon ton khng cn thit hay lin quan n mt bi ton phn lp c th. V d d liu v ngy trong tun hon ton khng cn thit i vi ng dng phn tch ri ro ca cc khon tin cho vay ca ngn hng, nn thuc tnh ny l d tha. Phn tch s cn thit ca d liu nhm mc ch loi b nhng thuc tnh khng cn thit, d
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 6-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

tha khi qu trnh hc v nhng thuc tnh s lm chm, phc tp v gy ra s hiu sai trong qu trnh hc dn ti mt m hnh phn lp khng dng c. Chuyn i d liu Vic khi qut ha d liu ln mc khi nim cao hn i khi l cn thit trong qu trnh tin x l. Vic ny c bit hu ch vi nhng thuc tnh lin tc (continuous attribute hay numeric attribute). V d cc gi tr s ca thuc tnh thu nhp ca khch hng c th c khi qut ha thnh cc dy gi tr ri rc: thp, trung bnh, cao. Tng t vi nhng thuc tnh ri rc (categorical attribute) nh a ch ph c th c khi qut ha ln thnh thnh ph. Vic khi qut ha lm c ng d liu hc nguyn thy, v vy cc thao tc vo/ ra lin quan n qu trnh hc s gim. 1.1.2.2. So snh cc m hnh phn lp Trong tng ng dng c th cn la chn m hnh phn lp ph hp. Vic la chn cn c vo s so snh cc m hnh phn lp vi nhau, da trn cc tiu chun sau: chnh xc d on (predictive accuracy) chnh xc l kh nng ca m hnh d on chnh xc nhn lp ca d liu mi hay d liu cha bit. Tc (speed) Tc l nhng chi ph tnh ton lin quan n qu trnh to ra v s dng m hnh. Sc mnh (robustness) Sc mnh l kh nng m hnh to ta nhng d on ng t nhng d liu noise hay d liu vi nhng gi tr thiu. Kh nng m rng (scalability) Kh nng m rng l kh nng thc thi hiu qu trn lng ln d liu ca m hnh hc. Tnh hiu c (interpretability) Tnh hiu c l mc hiu v hiu r nhng kt qu sinh ra bi m hnh hc.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 7-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Tnh n gin (simplicity) Tnh n gin lin quan n kch thc ca cy quyt nh hay c ng ca cc lut. Trong cc tiu chun trn, kh nng m rng ca m hnh phn lp c nhn mnh v tr trng pht trin, c bit vi cy quyt nh. [14] 1.1.3. Cc phng php nh gi chnh xc ca m hnh phn lp c lng chnh xc ca b phn lp l quan trng ch n cho php d on c chnh xc ca cc kt qu phn lp nhng d liu tng lai. chnh xc cn gip so snh cc m hnh phn lp khc nhau. Kha lun ny cp n 2 phng php nh gi ph bin l holdout v k-fold cross-validation. C 2 k thut ny u da trn cc phn hoch ngu nhin tp d liu ban u. Trong phng php holdout, d liu da ra c phn chia ngu nhin thnh 2 phn l: tp d liu o to v tp d liu kim tra. Thng thng 2/3 d liu cp cho tp d liu o to, phn cn li cho tp d liu kim tra [14]. Derive classifier

Training set

Data

Esitmate accuracy Test set

Hnh 4 - c lng chnh xc ca m hnh phn lp vi phng php holdout Trong phng php k-fold cross validation tp d liu ban u c chia ngu nhin thnh k tp con (fold) c kch thc xp x nhau S1, S2, , Sk. Qu trnh hc v test c thc hin k ln. Ti ln lp th i, Si l tp d liu kim tra, cc tp cn li hp thnh tp d liu o to. C ngha l, u tin vic dy c thc hin trn cc tp S2, S3 , Sk, sau test trn tp S1; tip tc qu trnh dy c thc hin trn tp S1, S3, S4,, Sk, sau test trn tp S2; v c th tip tc. chnh xc l ton b s phn lp ng t k ln lp chia cho tng s mu ca tp d liu ban u.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 8-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

1.2. Cy quyt nh ng dng trong phn lp d liu


1.2.1. nh ngha Trong nhng nm qua, nhiu m hnh phn lp d liu c cc nh khoa hc trong nhiu lnh vc khc nhau xut nh mng notron, m hnh thng k tuyn tnh /bc 2, cy quyt nh, m hnh di truyn. Trong s nhng m hnh , cy quyt nh vi nhng u im ca mnh c nh gi l mt cng c mnh, ph bin v c bit thch hp cho data mining ni chung v phn lp d liu ni ring [7]. C th k ra nhng u im ca cy quyt nh nh: xy dng tng i nhanh; n gin, d hiu. Hn na cc cy c th d dng c chuyn i sang cc cu lnh SQL c th c s dng truy nhp c s d liu mt cch hiu qu. Cui cng, vic phn lp da trn cy quyt nh t c s tng t v i khi l chnh xc hn so vi cc phng php phn lp khc [10]. Cy quyt nh l biu pht trin c cu trc dng cy, nh m t trong hnh v sau:
Age Age27.5 Age>27.5

Risk = High Car type {sport}

Car type Car type {family, truck}

Risk = High

Risk = Low

Hnh 5- V d v cy quyt nh Trong cy quyt nh: Gc: l node trn cng ca cy Node trong: biu din mt kim tra trn mt thuc tnh n (hnh ch nht) Nhnh: biu din cc kt qu ca kim tra trn node trong (mi tn) Node l: biu din lp hay s phn phi lp (hnh trn)

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 9-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

phn lp mu d liu cha bit, gi tr cc thuc tnh ca mu c a vo kim tra trn cy quyt nh. Mi mu tng ng c mt ng i t gc n l v l biu din d on gi tr phn lp mu . 1.2.2. Cc vn trong khai ph d liu s dng cy quyt nh Cc vn c th trong khi hc hay phn lp d liu bng cy quyt nh gm: xc nh su pht trin cy quyt nh, x l vi nhng thuc tnh lin tc, chn php o la chn thuc tnh thch hp, s dng tp d liu o to vi nhng gi tr thuc tnh b thiu, s dng cc thuc tnh vi nhng chi ph khc nhau, v ci thin hiu nng tnh ton. Sau y kha lun s cp n nhng vn chnh c gii quyt trong cc thut ton phn lp da trn cy quyt nh. 1.2.2.1. Trnh qu va d liu Th no l qu va d liu? C th hiu y l hin tng cy quyt nh cha mt s c trng ring ca tp d liu o to, nu ly chnh tp traning data test li m hnh phn lp th chnh xc s rt cao, trong khi i vi nhng d liu tng lai khc nu s dng cy li khng t c chnh xc nh vy. Qu va d liu l mt kh khn ng k i vi hc bng cy quyt nh v nhng phng php hc khc. c bit khi s lng v d trong tp d liu o to qu t, hay c noise trong d liu. C hai phng php trnh qu va d liu trong cy quyt nh: Dng pht trin cy sm hn bnh thng, trc khi t ti im phn lp hon ho tp d liu o to. Vi phng php ny, mt thch thc t ra l phi c lng chnh xc thi im dng pht trin cy. Cho php cy c th qu va d liu, sau s ct, ta cy. Mc d phng php th nht c v trc tip hn, nhng vi phng php th hai th cy quyt nh c sinh ra c thc nghim chng minh l thnh cng hn trong thc t. Hn na vic ct ta cy quyt nh cn gip tng qut ha, v ci thin chnh xc ca m hnh phn lp. D thc hin phng php no th vn mu cht y l tiu chun no c s dng xc nh kch thc hp l ca cy cui cng.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 10-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

1.2.2.2. Thao tc vi thuc tnh lin tc Vic thao tc vi thuc tnh lin tc trn cy quyt nh hon ton khng n gin nh vi thuc tnh ri rc. Thuc tnh ri rc c tp gi tr (domain) xc nh t trc v l tp hp cc gi tr ri rc. V d loi t l mt thuc tnh ri rc vi tp gi tr l: {xe ti, xe khch, xe con, taxi}.Vic phn chia d liu da vo php kim tra gi tr ca thuc tnh ri rc c chn ti mt v d c th c thuc tp gi tr ca thuc tnh hay khng: value(A) X vi X domain (A). y l php kim tra logic n gin, khng tn nhiu ti nguyn tnh ton. Trong khi , vi thuc tnh lin tc (thuc tnh dng s) th tp gi tr l khng xc nh trc. Chnh v vy, trong qu trnh pht trin cy, cn s dng kim tra dng nh phn: value(A) . Vi l hng s ngng (threshold) c ln lt xc nh da trn tng gi tr ring bit hay tng cp gi tr lin nhau (theo th t sp xp) ca thuc tnh lin tc ang xem xt trong tp d liu o to. iu c ngha l nu thuc tnh lin tc A trong tp d liu o to c d gi tr phn bit th cn thc hin d-1 ln kim tra value(A) i vi i = 1..d-1 tm ra ngng best tt nht tng ng vi thuc tnh . Vic xc nh gi tr ca v tiu chun tm tt nht ty vo chin lc ca tng thut ton [13][1]. Trong thut ton C4.5, i c chn l gi tr trung bnh ca hai gi tr lin k nhau trong dy gi tr sp xp. Ngoi ra cn mt s vn lin quan n sinh tp lut, x l vi gi tr thiu s c trnh by c th trong phn thut ton C4.5. 1.2.3. nh gi cy quyt nh trong lnh vc khai ph d liu 1.2.3.1. Sc mnh ca cy quyt nh Cy quyt nh c 5 sc mnh chnh sau [5]: Kh nng sinh ra cc quy tc hiu c Cy quyt nh c kh nng sinh ra cc quy tc c th chuyn i c sang dng ting Anh, hoc cc cu lnh SQL. y l u im ni bt ca k thut ny. Thm ch vi nhng tp d liu ln khin cho hnh dng cy quyt nh ln v phc tp, vic i theo bt c ng no trn cy l d dng theo ngha ph bin v r rng. Do vy s gii thch cho bt c mt s phn lp hay d on no u tng i minh bch.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 11-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Kh nng thc thi trong nhng lnh vc hng quy tc iu ny c nghe c v hin nhin, nhng quy tc quy np ni chung v cy quyt nh ni ring l la chn hon ho cho nhng lnh vc thc s l cc quy tc. Rt nhiu lnh vc t di truyn ti cc qu trnh cng nghip thc s cha cc quy tc n, khng r rng (underlying rules) do kh phc tp v ti ngha bi nhng d liu li (noisy). Cy quyt nh l mt s la chn t nhin khi chng ta nghi ng s tn ti ca cc quy tc n, khng r rng. D dng tnh ton trong khi phn lp Mc d nh chng ta bit, cy quyt nh c th cha nhiu nh dng, nhng trong thc t, cc thut ton s dng to ra cy quyt nh thng to ra nhng cy vi s phn nhnh thp v cc test n gin ti tng node. Nhng test in hnh l: so snh s, xem xt phn t ca mt tp hp, v cc php ni n gin. Khi thc thi trn my tnh, nhng test ny chuyn thnh cc ton hm logic v s nguyn l nhng ton hng thc thi nhanh v khng t. y l mt u im quan trng bi trong mi trng thng mi, cc m hnh d on thng c s dng phn lp hng triu thm tr hng t bn ghi. Kh nng x l vi c thuc tnh lin tc v thuc tnh ri rc Cy quyt nh x l tt nh nhau vi thuc tnh lin tc v thuc tnh ri rc. Tuy rng vi thuc tnh lin tc cn nhiu ti nguyn tnh ton hn. Nhng thuc tnh ri rc tng gy ra nhng vn vi mng neural v cc k thut thng k li thc s d dng thao tc vi cc tiu chun phn chia (splitting criteria) trn cy quyt nh: mi nhnh tng ng vi tng phn tch tp d liu theo gi tr ca thuc tnh c chn pht trin ti node . Cc thuc tnh lin tc cng d dng phn chia bng vic chn ra mt s gi l ngng trong tp cc gi tr sp xp ca thuc tnh . Sau khi chn c ngng tt nht, tp d liu phn chia theo test nh phn ca ngng . Th hin r rng nhng thuc tnh tt nht Cc thut ton xy dng cy quyt nh a ra thuc tnh m phn chia tt nht tp d liu o to bt u t node gc ca cy. T c th thy nhng thuc tnh no l quan trng nht cho vic d on hay phn lp.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 12-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

1.2.3.2. im yu ca cy quyt nh D c nhng sc mnh ni bt trn, cy quyt nh vn khng trnh khi c nhng im yu. l cy quyt nh khng thch hp lm vi nhng bi ton vi mc tiu l d on gi tr ca thuc tnh lin tc nh thu nhp, huyt p hay li xut ngn hng, Cy quyt nh cng kh gii quyt vi nhng d liu thi gian lin tc nu khng b ra nhiu cng sc cho vic t ra s biu din d liu theo cc mu lin tc. D xy ra li khi c qu nhiu lp Mt s cy quyt nh ch thao tc vi nhng lp gi tr nh phn dng yes/no hay accept/reject. S khc li c th ch nh cc bn ghi vo mt s lp bt k, nhng d xy ra li khi s v d o to ng vi mt lp l nh. iu ny xy ra cng nhanh hn vi cy m c nhiu tng hay c nhiu nhnh trn mt node. Chi ph tnh ton t o to iu ny nghe c v mu thun vi khng nh u im ca cy quyt nh trn. Nhng qu trnh pht trin cy quyt nh t v mt tnh ton. V cy quyt nh c rt nhiu node trong trc khi i n l cui cng. Ti tng node, cn tnh mt o (hay tiu chun phn chia) trn tng thuc tnh, vi thuc tnh lin tc phi thm thao tc xp xp li tp d liu theo th t gi tr ca thuc tnh . Sau mi c th chn c mt thuc tnh pht trin v tng ng l mt phn chia tt nht. Mt vi thut ton s dng t hp cc thuc tnh kt hp vi nhau c trng s pht trin cy quyt nh. Qu trnh ct ct cy cng t v nhiu cy con ng c phi c to ra v so snh. 1.2.4. Xy dng cy quyt nh Qu trnh xy dng cy quyt nh gm hai giai on: Giai on th nht pht trin cy quyt nh: Giai on ny pht trin bt u t gc, n tng nhnh v pht trin quy np theo cch thc chia tr cho ti khi t c cy quyt nh vi tt c cc l c gn nhn lp. Giai on th hai ct, ta bt cc cnh nhnh trn cy quyt nh. Giai on ny nhm mc ch n gin ha v khi qut ha t lm tng chnh xc ca cy quyt nh bng cch loi b s ph thuc vo mc li (noise)
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 13-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

ca d liu o to mang tnh cht thng k, hay nhng s bin i m c th l c tnh ring bit ca d liu o to. Giai on ny ch truy cp d liu trn cy quyt nh c pht trin trong giai on trc v qu trnh thc nghim cho thy giai on ny khng tn nhiu ti nguyn tnh ton, nh vi phn ln cc thut ton, giai on ny chim khong di 1% tng thi gian xy dng m hnh phn lp [7][1]. Do vy, y chng ta ch tp trung vo nghin cu giai on pht trin cy quyt nh. Di y l khung cng vic ca giai on ny:
1) Chn thuc tnh tt nht bng mt o nh trc 2) Pht trin cy bng vic thm cc nhnh tng ng vi tng gi tr ca thuc tnh chn 3) Sp xp, phn chia tp d liu o to ti node con 4) Nu cc v d c phn lp r rng th dng. Ngc li: lp li bc 1 ti bc 4 cho tng node con

1.3. Thut ton xy dng cy quyt nh


1.3.1. T tng chung Phn ln cc thut ton phn lp d liu da trn cy quyt nh c m gi nh sau:
Make Tree (Training Data T) { Partition(T) } Partition(Data S) { if (all points in S are in the same class) then return for each attribute A do evaluate splits on attribute A; use best split found to partition S into S1, S2,..., Sk Partition(S1) Partition(S2) ... Partition(Sk) }

Hnh 6 - M gi ca thut ton phn lp d liu da trn cy quyt nh

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 14-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Cc thut ton phn lp nh C4.5 (Quinlan, 1993), CDP (Agrawal v cc tc gi khc, 1993), SLIQ (Mehta v cc tc gi khc, 1996) v SPRINT (Shafer v cc tc gi khc, 1996) u s dng phng php ca Hunt lm t tng ch o. Phng php ny c Hunt v cc ng s ngh ra vo nhng nm cui thp k 50 u thp k 60. M t quy np phng php Hunt [1]:
Gi s xy dng cy quyt nh t T l tp training data v cc lp c biu din di dng tp C = {C1, C2, ,Ck } Trng hp 1: T cha cc case thuc v mt lp n Cj, cy quyt nh ng vi T l mt l tng ng vi lp Cj Trng hp 2: T cha cc case thuc v nhiu lp khc nhau trong tp C. Mt kim tra c chn trn mt thuc tnh c nhiu gi tr {O1, O2, .,On }. Trong nhiu ng dng n thng c chn l 2, khi to ra cy quyt nh nh phn. Tp T c chia thnh cc tp con T1, T2, , Tn, vi Ti cha tt c cc case trong T m c kt qu l Oi trong kim tra chn. Cy quyt nh ng vi T bao gm mt node biu din kim tra c chn, v mi nhnh tng ng vi mi kt qu c th ca kim tra . Cch thc xy dng cy tng t c p dng quy cho tng tp con ca tp training data. Trng hp 3: T khng cha case no. Cy quyt nh ng vi T l mt l, nhng lp gn vi l phi c xc nh t nhng thng tin khc ngoi T. V d C4.5 chn gi tr phn lp l lp ph bin nht ti cha ca node ny.

1.3.2. Tnh hnh nghin cu cc thut ton hin nay Cc thut ton phn lp d liu da trn cy quyt nh u c t tng ch o l phng php Hunt trnh by trn. Lun c 2 cu hi ln cn phi c tr li trong cc thut ton phn lp d liu da trn cy quyt nh l: 1. Lm cch no xc nh c thuc tnh tt nht pht trin ti mi node? 2. Lu tr d liu nh th no v lm cch no phn chia d liu theo cc test tng ng? Cc thut ton khc nhau c cc cch tr li khc nhau cho hai cu hi trn. iu ny lm nn s khc bit ca tng thut ton. C 3 loi tiu chun hay ch s xc nh thuc tnh tt nht pht trin ti mi node

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 15-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Gini-index (Breiman v cc ng s, 1984 [1]): Loi tiu chun ny la chn thuc tnh m lm cc tiu ha khng tinh khit ca mi phn chia. Cc thut ton s dng ny l CART, SLIQ, SPRINT. Informationgain (Quinlan, 1993 [1]): Khc vi Gini-index, tiu chun ny s dng entropy o khng tinh khit ca mt phn chia v la chn thuc tnh theo mc cc i ha ch s entropy. Cc thut ton s dng tiu chun ny l ID3, C4.5.

2 -bng thng k cc s kin xy ra ngu nhin: 2 o tng quan gia tng thuc tnh v nhn lp. Sau la chn thuc tnh c tng quan ln nht. CHAID l thut ton s dng tiu chun ny.

Chi tit v cch tnh cc tiu chun Gini-index v Information-gain s c trnh by trong hai thut ton C4.5 v SPRINT, chng 2. Vic tnh ton cc ch s trn i khi i hi phi duyt ton b hay mt phn ca tp d liu o to. Do vy cc thut ton ra i trc yu cu ton b tp d liu o to phi nm thng tr trong b nh (memory- resident) trong qu trnh pht trin cy quyt nh. iu ny lm hn ch kh nng m rng ca cc thut ton , v kch thc b nh l c hn, m kch thc ca tp d liu o to th tng khng ngng, i khi l triu l t bn ghi trong lnh vc thng mi. R rng cn tm ra gii php mi thay i c ch lu tr v truy cp d liu, nm 1996 SLIQ (Mehta) v SPRINT (Shafer) ra i gii quyt c hn ch . Hai thut ton ny s dng c ch lu tr d liu thng tr trn a (disk- resident) v c ch sp xp trc mt ln (pre- sorting) tp d liu o to. Nhng c im mi ny lm ci thin ng k hiu nng v tnh m rng so vi cc thut ton khc. Tip theo l mt s thut ton khc pht trin trn nn tng SPRINT vi mt s b xung ci tin nh PUBLIC (1998) [11] vi tng kt hp hai qu trnh xy dng v ct ta vi nhau, hay ScalParC (1998) ci thin qu trnh phn chia d liu ca SPRINT vi cch dng bng bm khc, hay thut ton do cc nh khoa hc trng i hc Minesota (M ) kt hp vi IBM xut lm gim chi ph vo ra cng nh chi ph giao tip ton cc khi song song ha so vi SPRINT [2]. Trong cc thut ton SPRINT c coi l sng to t bin, ng chng ta tm hiu v pht trin.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 16-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

1.3.3. Song song ha thut ton phn lp da trn cy quyt nh tun t Song song ha xu hng nghin cu hin nay ca cc thut ton phn lp d liu da trn cy quyt nh. Nhu cu song song ha cc thut ton tun t l mt nhu cu tt yu ca thc tin pht trin khi m cc i hi v hiu nng, chnh xc ngy cng cao. Thm vo l s gia tng nhanh chng v kch thc ca d liu cn khai ph. Mt m hnh phn lp chy trn h thng tnh ton song song c hiu nng cao, c kh nng khai ph c nhng tp d liu ln hn t gia tng tin cy ca cc quy tc phn lp. Hin nay, cc thut ton tun t yu cu d liu thng tr trong b nh khng p ng c yu cu ca cc tp d liu c kch thc TetaByte vi hng t bn ghi. Do vy xy dng thut ton song song hiu qu da trn nhng thut ton tun t sn c l mt thch thc t ra cho cc nh nghin cu. C 3 chin lc song song ha cc thut ton tun t: Phng php xy dng cy ng b Trong phng php ny, tt c cc b vi x l ng thi tham gia xy dng cy quyt nh bng vic gi v nhn cc thng tin phn lp ca d liu a phng. Hnh 7 m t c ch lm vic ca cc b vi x l trong phng php ny u im ca phng php ny l khng yu cu vic di chuyn cc d liu trong tp d liu o to. Tuy nhin, thut ton ny phi chp nhn chi ph giao tip cao, v ti bt cn bng. Vi tng node trong cy quyt nh, sau khi tp hp c cc thng tin phn lp, tt c cc b vi x l cn phi ng b v cp nht cc thng tin phn lp. Vi nhng node su thp, chi ph giao tip tng i nh, bi v s lng cc mc training data c x l l tng i nh. Nhng khi cy cng su th chi ph cho giao tip chim phn ln thi gian x l. Mt vn na ca phng php ny l ti bt cn bng do c ch lu tr v phn chia d liu ban u ti tng b vi x l.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 17-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Hnh 7 - S xy dng cy quyt nh theo phng php ng b Phng php xy dng cy phn hoch Khi xy dng cy quyt nh bng phng php phn hoch cc b vi x l khc nhau lm vic vi cc phn khc nhau ca cy quyt nh. Nu nhiu hn 1 b vi x l cng kt hp pht trin 1 node, th cc b vi x l c phn hoch pht trin cc con ca node . Phng php ny tp trung vo trng hp 1 nhm cc b vi x l Pn cng hp tc pht trin node n. Khi bt u, tt c cc b vi x l cng ng thi kt hp pht trin node gc ca cy phn lp. Khi kt thc, ton b cy phn lp c to ra bng cch kt hp tt c cc cy con ca tng b vi x l. Hnh 8 m t c ch lm vic ca cc b vi x l trong phng php ny. u im ca phng php ny l khi mt b vi x l mt mnh chu trch nhim pht trin mt node, th n c th pht trin thnh mt cy con ca cy ton cc mt cch c lp m khng cn bt c chi ph giao tip no. Tuy nhin cng c mt vi nhc im trong phng php ny, l: Th nht yu cu di chuyn d liu sau mi ln pht trin mt node cho ti khi mi b vi x l cha ton b d liu c th pht trin ton b mt cy con. Do vy dn n tn km chi ph giao tip khi phn trn ca cy phn lp. Th hai l kh t c ti cn bng. Vic gn cc node cho cc b vi x l c thc hin da trn s lng cc case trong cc node con. Tuy nhin s lng cc case gn vi mt node khng nht

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 18-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

thit phi tng ng vi s lng cng vic cn phi x l pht trin cy con ti node .

Hnh 8 - S xy dng cy quyt nh theo phng php phn hoch Phng php lai Phng php lai c tn dng u im ca c 2 phng php trn. Phng php xy dng cy ng b chp nhn chi ph giao tip cao khi bin gii ca cy cng rng. Trong khi , phng php xy dng cy quyt nh phn hoch th phi chp nhn chi ph cho vic ti cn bng sau mi bc. Trn c s , phng php lai tip tc duy tr cch thc th nht min l chi ph giao tip phi chu do tun theo cch thc th nht khng qu ln. Khi m chi ph ny vt qu mt ngng quy nh, th cc b vi x l ang x l cc node ti ng bin hin ti ca cy phn lp c phn chia thnh 2 phn (vi gi thit s lng cc b vi x l l ly tha ca 2). Phng php ny cn s dng tiu chun khi to s phn hoch tp cc b vi x l hin ti, l:

(Chi ph giao tip) Chi ph di chuyn + Ti cn bng


Kha lun tt nghip Nguyn Th Thy Linh K46CA - 19-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

M hnh hot ng ca phng php lai c m t trong hnh 9.

Hnh 9 - S xy dng cy quyt nh theo phng php lai

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 20-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Chng 2.

C4.5 V SPRINT

2.1. Gii thiu chung


Sau y l nhng gii thiu chung nht v lch s ra i ca hai thut ton C4.5 v SPRINT. C4.5 l s k tha ca ca thut ton hc my bng cy quyt nh da trn nn tng l kt qu nghin cu ca HUNT v cc cng s ca ng trong na cui thp k 50 v na u nhng nm 60 (Hunt 1962). Phin bn u tin ra i l ID3 (Quinlan, 1979)- 1 h thng n gin ban u cha khong 600 dng lnh Pascal, v tip theo l C4 (Quinlan 1987). Nm 1993, J. Ross Quinlan k tha cc kt qu pht trin thnh C4.5 vi 9000 dng lnh C cha trong mt a mm. Mc d c phin bn pht trin t C4.5 l C5.0 - mt h thng to ra li nhun t Rule Quest Research, nhng nhiu tranh lun, nghin cu vn tp trung vo C4.5 v m ngun ca n l sn dng [13]. Nm 1996, 3 tc gi John Shafer, Rakesh Agrawal, Manish Mehta thuc IBM Almaden Research Center xut mt thut ton mi vi tn gi SPRINT (Scalable PaRallelization INduction of decision Trees). SPRINT ra i loi b tt c cc gii hn v b nh, thc thi nhanh v c kh nng m rng. Thut ton ny c thit k d dng song song ha, cho php nhiu b vi x l cng lm vic ng thi xy dng mt m hnh phn lp n, ng nht [7]. Hin nay SPRINT c thng mi ha, thut ton ny c tch hp vo trong cc cng c khai ph d liu ca IBM. Trong cc thut ton phn lp d liu da trn cy quyt nh, C4.5 v SPRINT l hai thut ton tiu biu cho hai phm vi ng dng khc nhau. C4.5 l thut ton hiu qu v c dng rng ri nht trong cc ng dng phn lp vi lng d liu nh c vi trm nghn bn ghi. SPRINT mt thut ton tuyt vi cho nhng ng dng vi lng d liu khng l c vi triu n hng t bn ghi.

2.2. Thut ton C4.5


Vi nhng c im C4.5 l thut ton phn lp d liu da trn cy quyt nh hiu qu v ph bin trong nhng ng dng khai ph c s d liu c kch thc nh. C4.5 s dng c ch lu tr d liu thng tr trong b nh, chnh c im ny lm C4.5 ch thch hp vi nhng c s d liu nh, v c ch sp xp li d liu ti
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 21-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

mi node trong qu trnh pht trin cy quyt nh. C4.5 cn cha mt k thut cho php biu din li cy quyt nh di dng mt danh sch sp th t cc lut if-then (mt dng quy tc phn lp d hiu). K thut ny cho php lm gim bt kch thc tp lut v n gin ha cc lut m chnh xc so vi nhnh tng ng cy quyt nh l tng ng. T tng pht trin cy quyt nh ca C4.5 l phng php HUNT nghin cu trn. Chin lc pht trin theo su (depth-first strategy) c p dng cho C4.5. M gi ca thut ton C4.5:
(1)
(2) ComputerClassFrequency(T); if OneClass or FewCases

return a leaf; Create a decision node N; (3) (4) (5) (6) (7) else (8) (9) Child of N=FormTree(T'); ComputeErrors of N; ForEach Attribute A N.test=AttributeWithBestGain; if N.test is continuous ForEach T' in the splitting of T if T' is Empty ComputeGain(A);

find Threshold;

Child of N is a leaf

return N

Hnh 10 - M gi thut ton C4.5 Trong bo co ny, chng ti tp trung phn tch nhng im khc bit ca C4.5 so vi cc thut ton khc. l c ch chn thuc tnh kim tra ti mi node, c ch x l vi nhng gi tr thiu, trnh vic qu va d liu, c lng chnh xc v c ch ct ta cy. 2.2.1. C4.5 dng Gain-entropy lm o la chn thuc tnh tt nht Phn ln cc h thng hc my u c gng to ra 1 cy cng nh cng tt, v nhng cy nh hn th d hiu hn v d t c chnh xc d on cao hn.
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 22-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Do khng th m bo c s cc tiu ca cy quyt nh, C4.5 da vo nghin cu ti u ha, v s la chn cch phn chia m c o la chn thuc tnh t gi tr cc i. Hai o c s dng trong C4.5 l information gain v gain ratio. RF(Cj, S) biu din tn xut (Relative Frequency) cc case trong S thuc v lp Cj.

RF (Cj, S) = |Sj| / |S|


Vi |Sj| l kch thc tp cc case c gi tr phn lp l Cj. |S| l kch thc tp d liu o to. Ch s thng tin cn thit cho s phn lp: I(S) vi S l tp cn xt s phn phi lp c tnh bng:

Sau khi S c phn chia thnh cc tp con S1, S2,, St bi test B th information gain c tnh bng:

Test B s c chn nu c G(S, B) t gi tr ln nht. Tuy nhin c mt vn khi s dng G(S, B) u tin test c s lng ln kt qu, v d G(S, B) t cc i vi test m tng Si ch cha mt case n. Tiu chun gain ratio gii quyt c vn ny bng vic a vo thng tin tim nng (potential information) ca bn thn mi phn hoch

Test B s c chn nu c t s gi tr gain ratio = G(S, B) / P(S, B) ln nht. Trong m hnh phn lp C4.5 release8, c th dng mt trong hai loi ch s Information Gain hay Gain ratio xc nh thuc tnh tt nht. Trong Gain ratio l la chn mc nh.
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 23-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

V d m t cch tnh information gain Vi thuc tnh ri rc Bng 1 - Bng d liu tp training vi thuc tnh phn lp l buys_computer

Trong tp d liu trn: s1 l tp nhng bn ghi c gi tr phn lp l yes, s2 l tp nhng bn ghi c gi tr phn lp l no. Khi : I(S) = I(s1,s2) = I(9, 5) = -9/14*log29/14 5/14* log25/14 = 0.940 Tnh G(S, A) vi A ln lt l tng thuc tnh:

A = age. Thuc tnh age c ri rc ha thnh cc gi tr <30,


30-40, v >40.

Vi age= <30: I (S1) = (s11,s21) = -2/5log22/5 3/5log23/5 =


0,971

Vi age = 30-40: I (S2) = I(s12,s22) = 0 Vi age = >40: I (S3) = I(s13,s23) = 0.971


|Si| / |S|* I(Si) = 5/14* I(S1) + 4/14 * I(S2) + 5/14 * I(S3) = 0.694 Gain (S, age) = I(s1,s2) |Si| / |S|* I(Si) = 0.246 Tnh tng t vi cc thuc tnh khc ta c:

A = income: Gain (S, income) = 0.029 A = student: Gain (S, student) = 0.151 A = credit_rating: Gain (S, credit_rating) = 0.048
Thuc tnh age l thuc tnh c o Information Gain ln nht. Do vy age c chn lm thuc tnh pht trin ti node ang xt.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 24-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Vi thuc tnh lin tc X l thuc tnh lin tc i hi nhiu ti nguyn tnh ton hn thuc tnh ri rc. Gm cc bc sau: 1. K thut Quick sort c s dng sp xp cc case trong tp d liu o to theo th t tng dn hoc gim dn cc gi tr ca thuc tnh lin tc V ang xt. c tp gi tr V = {v1, v2, , vm} 2. Chia tp d liu thnh hai tp con theo ngng i = (vi + vi+1)/2 nm gia hai gi tr lin k nhau vi v vi+1. Test phn chia d liu l test nh phn dng V <= i hay V > i. Thc thi test ta c hai tp d liu con: V1 = {v1, v2, , vi} v V2 = {vi+1, vi+2, , vm}. 3. Xt (m-1) ngng i c th c ng vi m gi tr ca thuc tnh V bng cch tnh Information gain hay Gain ratio vi tng ngng . Ngng c gi tr ca Information gain hay Gain ratio ln nht s c chn lm ngng phn chia ca thuc tnh . Vic tm ngng (theo cch tuyn tnh nh trn) v sp xp tp training theo thuc tnh lin tc ang xem xt i khi gy ra tht c chai v tn nhiu ti nguyn tnh ton. 2.2.2. C4.5 c c ch ring trong x l nhng gi tr thiu Gi tr thiu ca thuc tnh l hin tng ph bin trong d liu, c th do li khi nhp cc bn ghi vo c s d liu, cng c th do gi tr thuc tnh c nh gi l khng cn thit i vi case c th. Trong qu trnh xy dng cy t tp d liu o to S, B l test da trn thuc tnh Aa vi cc gi tr u ra l b1, b2, ..., bt. Tp S0 l tp con cc case trong S m c gi tr thuc tnh Aa khng bit v Si biu din cc case vi u ra l bi trong test B. Khi o information gain ca test B gim v chng ta khng hc c g t cc case trong S0.

Tng ng vi G(S, B), P(S, B) cng thay i,

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 25-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Hai thay i ny lm gim gi tr ca test lin quan n thuc tnh c t l gi tr thiu cao. Nu test B c chn, C4.5 khng to mt nhnh ring trn cy quyt nh cho S0. Thay vo , thut ton c c ch phn chia cc case trong S0 v vc tp con Si l tp con m c gi tr thuc tnh test xc nh theo trong s |Si|/ |S S0|. 2.2.3. Trnh qu va d liu Qu va d liu l mt kh khn ng k i vi hc bng cy quyt nh v nhng phng php hc khc. Qu va d liu l hin tng: nu khng c cc case xung t (l nhng case m gi tr cho mi thuc tnh l ging nhau nhng gi tr ca lp li khc nhau) th cy quyt nh s phn lp chnh xc ton b cc case trong tp d liu o to. i khi d liu o to li cha nhng c tnh c th, nn khi p dng cy quyt nh cho nhng tp d liu khc th chnh xc khng cn cao nh trc. C mt s phng php trnh qu va d liu trong cy quyt nh: Dng pht trin cy sm hn bnh thng, trc khi t ti im phn lp hon ho tp d liu o to. Vi phng php ny, mt thch thc t ra l phi c lng chnh xc thi im dng pht trin cy. Cho php cy c th qu va d liu, sau s ct, ta cy Mc d phng php th nht c v trc quan hn, nhng vi phng php th hai th cy quyt nh c sinh ra c th nghim chng minh l thnh cng hn trong thc t, v n cho php cc tng tc tim nng gia cc thuc tnh c khm ph trc khi quyt nh xem kt qu no ng gi li. C4.5 s dng k thut th hai trnh qu va d liu. 2.2.4. Chuyn i t cy quyt nh sang lut Vic chuyn i t cy quyt nh sang lut sn xut (production rules) dng if-then to ra nhng quy tc phn lp d hiu, d p dng. Cc m hnh phn lp biu din cc khi nim di dng cc lut sn xut c chng minh l hu ch trong nhiu lnh vc khc nhau, vi cc i hi v c chnh xc v tnh hiu c ca m hnh phn lp. Dng output tp lut sn xut l s la chn khn ngoan. Tuy nhin, ti nguyn tnh ton dng cho vic to ra tp lut t tp d liu o to c kch thc ln v nhiu gi tr sai l v cng ln [12]. Khng nh ny s c chng minh qua kt qu thc nghim trn m hnh phn lp C4.5
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 26-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Giai on chuyn di t cy quyt nh sang lut bao gm 4 bc: Ct ta: Lut khi to ban u l ng i t gc n l ca cy quyt nh. Mt cy quyt nh c l l th tng ng tp lut sn xut s c l lut khi to. Tng iu kin trong lut c xem xt v loi b nu khng nh hng ti chnh xc ca lut . Sau , cc lut ct ta c thm vo tp lut trung gian nu n khng trng vi nhng lut c. La chn Cc lut ct ta c nhm li theo gi tr phn lp, to nn cc tp con cha cc lut theo lp. S c k tp lut con nu tp training c k gi tr phn lp. Tng tp con trn c xem xt chn ra mt tp con cc lut m ti u ha chnh xc d on ca lp gn vi tp lut . Sp xp Sp xp K tp lut to ra t trn bc theo tn s li. Lp mc nh c to ra bng cch xc nh cc case trong tp training khng cha trong cc lut hin ti v chn lp ph bin nht trong cc case lm lp mc nh. c lng, nh gi: Tp lut c em c lng li trn ton b tp training, nhm mc ch xc nh xem liu c lut no lm gim chnh xc ca s phn lp. Nu c, lut b loi b v qu trnh c lng c lp cho n khi khng th ci tin thm. 2.2.5. C4.5 l mt thut ton hiu qu cho nhng tp d liu va v nh C4.5 c c ch sinh cy quyt nh hiu qu v cht ch bng vic s dng o la chn thuc tnh tt nht l information-gain. Cc c ch x l vi gi tr li, thiu v chng qu va d liu ca C4.5 cng vi c ch ct ta cy to nn sc mnh ca C4.5. Thm vo , m hnh phn lp C4.5 cn c phn chuyn i t cy quyt nh sang lut dng if-then, lm tng chnh xc v tnh d hiu ca kt qu phn lp. y l tin ch rt c ngha i vi ngi s dng.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 27-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

2.3. Thut ton SPRINT


Ngy nay d liu cn khai ph c th c ti hng triu bn ghi v khong 10 n 10000 thuc tnh. Hng Tetabyte (100 M bn ghi * 2000 trng * 5 bytes) d liu cn c khai ph. Nhng thut ton ra i trc khng th p ng c nhu cu . Trc tnh hnh , SPRINT l s ci tin ca thut ton SLIQ (Mehta, 1996) ra i. Cc thut ton SLIQ v SPRINT u c nhng ci tin tng kh nng m rng ca thut ton nh: Kh nng x l tt vi nhng thuc tnh lin tc v thuc tnh ri rc. C hai thut ton ny u s dng k thut sp xp trc mt ln d liu, v lu tr thng tr trn a (disk resident data) nhng d liu qu ln khng th cha va trong b nh trong. V sp xp nhng d liu lu tr trn a l t [3], nn vi c ch sp xp trc, d liu phc v cho qu trnh pht trin cy ch cn c sp xp mt ln. Sau mi bc phn chia d liu ti tng node, th t ca cc bn ghi trong tng danh sch c duy tr, khng cn phi sp xp li nh cc thut ton CART, v C4.5 [13][12]. T lm gim ti nguyn tnh ton khi s dng gii php lu tr d liu thng tr trn a. C 2 thut ton s dng nhng cu trc d liu gip cho vic xy dng cy quyt nh d dng hn. Tuy nhin cu trc d liu lu tr ca SLIQ v SPRINT khc nhau, dn n nhng kh nng m rng, v song song ha khc nhau gia hai thut ton ny. M gi ca thut ton SPRINT nh sau:
SPRINT algorithm: Partition(Data S) { if (all points in S are of the same class) then return; for each attribute A do evaluate splits on attribute A; Use best split found to partition S into S1& S2 Partition(S1); Partition(S2); } Initial call: Partition(Training Data)

Hnh 11 - M gi thut ton SPRINT

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 28-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

2.3.1. Cu trc d liu trong SPRINT K thut phn chia d liu thnh cc danh sch thuc tnh ring bit ln u tin c SLIQ (Supervised Learning In Quest) xut. D liu s dng trong SLIQ gm: nhiu danh sch thuc tnh lu tr thng tr trn a (mi thuc tnh tng ng vi mt danh sch), v mt danh sch n cha gi tr ca class lu tr thng tr trong b nh chnh. Cc danh sch ny lin kt vi nhau bi gi tr ca thuc tnh rid (ch s bn ghi c nh th t trong c s d liu) c trong mi danh sch. SLIQ phn chia d liu thnh hai loi cu trc:[14][9]

Hnh 12 - Cu trc d liu trong SLIQ Danh sch thuc tnh (Attribute List) thng tr trn a. Danh sch ny gm trng thuc tnh v rid (a record identifier). Danh sch lp (Class List) cha cc gi tr ca thuc tnh phn lp tng ng vi tng bn ghi trong c s d liu. Danh sch ny gm cc trng rid, thuc tnh phn lp v node (lin kt vi node c gi tr tng ng trn cy quyt nh). Vic to ra trng con tr tr ti node tng ng trn cy quyt nh gip cho qu trnh phn chia d liu ch cn thay i gi tr ca trng con tr, m khng cn thc s phn chia d liu gia cc node. Danh sch lp c lu tr thng tr trong b nh trong v n thng xuyn c truy cp, sa i c trong giai on xy dng cy, v c trong giai on ct, ta cy. Kch thc ca danh sch lp t l thun vi s lng cc bn ghi u vo. Khi danh sch lp khng va trong b nh, hiu nng ca SLIQ s gim. l hn ch ca thut ton SLIQ. Vic s dng cu trc d liu thng tr trong b nh lm gii hn tnh m rng c ca thut ton SLIQ.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 29-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

SPRINT s dng danh sch thuc tnh c tr trn a SPRINT khc phc c hn ch ca SLIQ bng cch khng s dng danh sch lp c tr trong b nh, SPRINT ch s dng mt loi danh sch l danh sch thuc tnh c cu trc nh sau:
Car Type family sport sport family truck family RID 0 1 2 3 4 5 Age 17 20 23 32 43 68 Risk high high high low low high RID 1 5 0 4 2 3

RID 0 1 2 3 4 5

Age 23 17 43 68 32 20

Car Type Risk family high sport high sport high family low truck low family high

Risk high high high low high low

Hnh 13 - Cu trc danh sch thuc tnh trong SPRINT Danh sch thuc tnh lin tc c sp xp theo th t ngay c to ra Danh sch thuc tnh SPRINT to danh sch thuc tnh cho tng thuc tnh trong tp d liu. Danh sch ny bao gm thuc tnh, nhn lp (Class label hay thuc tnh phn lp), v ch s ca bn ghi rid (c nh t tp d liu ban u). Danh sch thuc tnh lin tc c sp xp th t theo gi tr ca thuc tnh ngay khi c to ra. Nu ton b d liu khng cha trong b nh th tt c cc danh sch thuc tnh c lu tr trn a. Chnh do c im lu tr ny m SPRINT loi b mi gii hn v b nh, v c kh nng ng dng vi nhng c s d liu thc t vi s lng bn ghi c khi ln ti hng t. Cc danh sch thuc tnh ban u to ra t tp d liu o to c gn vi gc ca cy quyt nh. Khi cy pht trin, cc node c phn chia thnh cc node con mi th cc dnh sch thuc tnh thuc v node cng c phn chia tng ng v gn vo cc node con. Khi danh sch b phn chia th th t ca cc bn ghi trong danh sch c gi nguyn, v th cc danh sch con c to ra khng bao gi phi sp xp li. l mt u im ca SPRINT so vi cc thut ton trc . Biu (Histogram)

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 30-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

SPRINT s dng biu lp bng thng k s phn phi lp ca cc bn ghi trong mi danh sch thuc tnh, t dng vo vic c lng im phn chia cho danh sch . Thuc tnh lin tc v thuc tnh ri rc c hai dng biu khc nhau.
Biu ca thuc tnh lin tc

SPRINT s dng 2 biu : Cbelow v Cabove. Cbelow cha s phn phi ca nhng bn ghi c x l, Cabove cha s phn phi ca nhng bn ghi cha c x l trong danh sch thuc tnh. Hnh II-3 minh ha vic s dng biu cho thuc tnh lin tc
Biu ca thuc tnh ri rc

Thuc tnh ri rc cng c mt biu gn vi tng node. Tuy nhin SPRINT ch s dng mt biu l count matrix cha s phn phi lp ng vi tng gi tr ca thuc tnh c xem xt. Cc danh sch thuc tnh c x l cng mt lc, do vy thay v i hi cc danh sch thuc tnh trong b nh, vi SPRINT b nh ch cn cha tp cc biu nh trn trong qu trnh pht trin cy. 2.3.2. SPRINT s dng Gini-index lm o tm im phn chia tp d liu tt nht SPRINT l mt trong nhng thut ton s dng o Gini-index tm thuc tnh tt nht lm thuc tnh test ti mi node trn cy. Ch s ny c Breiman ngh ra t nm 1984, cch tnh nh sau:

Trc tin cn nh ngha: gini (S) = 1- pj2 Trong : S l tp d liu o to c n lp; pj l tn xut ca lp j trong S (l thng ca s bn ghi c gi tr ca thuc tnh phn lp l pj vi tng s bn ghi trong S)

Nu phn chia dng nh phn, tc l S c chia thnh S1, S2 (SPRINT ch s dng phn chia nh phn ny) th ch s tnh phn chia c cho bi cng thc sau:

ginisplit(S) = n1/n*gini(S1) + n2/n*gini(S2)


Vi n, n1, n2 ln lt l kch thc ca S, S1, S2.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 31-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

u im ca loi ch s ny l cc tnh ton trn n ch da vo thng tin v s phn phi cc gi tr lp trong tng phn phn chia m khng tnh ton trn cc gi tr ca thuc tnh ang xem xt. tm c im phn chia cho mi node, cn qut tng danh sch thuc tnh ca node v c lng cc phn chia da trn mi thuc tnh gn vi node . Thuc tnh c chn phn chia l thuc tnh c ch s ginisplit(S) nh nht. im cn nhn mnh y l khc vi Information Gain ch s ny c tnh m khng cn c ni dung d liu, ch cn biu biu din s phn phi cc bn ghi theo cc gi tr phn lp. l tin cho c ch lu tr d liu thng tr trn a. Cc biu ca danh sch thuc tnh lin tc, hay ri rc c m t di y. Vi thuc tnh lin tc Vi thuc tnh lin tc, cc gi tr kim tra l cc gi tr nm gia mi cp 2 gi tr lin k ca thuc tnh . tm im phn chia cho thuc tnh ti mt node nht nh, biu c khi to vi Cbelow bng 0 v Cabove l phn phi lp ca tt c cc bn ghi ti node . Hai biu trn c cp nht ln lt mi khi tng bn ghi c c. Mi khi con tr chy gini-index c tnh trn tng im phn chia nm gia gi tr va c v gi tr sp c. Khi c ht danh sch thuc tnh (Cabove bng 0 tt c cc ct) th cng l lc tnh c ton b cc gini-index ca cc im phn chia cn xem xt. Cn c vo kt qu c th chn ra gini-index thp nht v tng ng l im phn chia ca thuc tnh lin tc ang xem xt ti node . Vic tnh giniindex hon ton da vo biu . Nu tm ra im phn chia tt nht th kt qu c lu li v biu va gn danh sch thuc tnh c khi to li trc khi x l vi thuc tnh tip theo.

Hnh 14 - c lng cc im phn chia vi thuc tnh lin tc


Kha lun tt nghip Nguyn Th Thy Linh K46CA - 32-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Vi thuc tnh ri rc Vi thuc tnh ri rc, qu trnh tm im phn chia tt nht cng c tnh ton da trn biu ca danh sch thuc tnh . Trc tin cn qut ton b danh sch thuc tnh thu c s lng phn lp ng vi tng gi tr ca thuc tnh ri rc, kt qu ny c lu trong biu count matrix. Sau , cn tm tt c cc tp con c th c t cc gi tr ca thuc tnh ang xt, coi l im phn chia v tnh gini-index tng ng. Cc thng tin cn cho vic tnh ton ch s gini-index ca bt c tp con no u c trong count matrix. B nh cung cp cho count matrix c thu hi sau khi tm ra c im phn chia tt nht ca thuc tnh .

Hnh 15 - c lng im phn chia vi thuc tnh ri rc V d m t cch tnh ch s Giniindex Vi tp d liu o to c m t trong hnh 13, vic tnh ch s Gini-index tm ra im phn chia tt nht c thc hin nh sau: 1. Vi Thuc tnh lin tc Age cn tnh im phn chia trn ln lt cc so snh sau Age<=17, Age<=20, Age<=23, Age<=32, Age<=43, Age<=68
Tuple count Age<=17 Age>17 High 1 3 Low 0 2

G(Age<=17) = 1- (12+02) = 0 G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25 GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5

Kha lun tt

Tuple count High Low Age<=20 2 0 Age>20 2 2 nghip Nguyn Th Thy Linh

K46CA

- 33G(Age<=20) = 1- (12+02) = 0 G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Tuple count Age<=23 Age>23

High 3 1

Low 0 2

G(Age23) = 1- (12+02) = 0 G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9 GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9 Tnh ton tng t vi cc test cn li Age<=32, Age<=43, Age<=68 So snh cc GSPILT tm c ng vi tng phn chia ca cc thuc tnh, GSPLIT ng vi Age <=23 c gi tr nh nht. Do vy im phn chia l ti thuc tnh Age vi gi tr phn chia = (23+32) / 2 = 27.5. Kt qu ta c cy quyt nh nh hnh 5 phn 1.2.1 2.3.3. Thc thi s phn chia Sau khi tm ra im phn chia tt nht ca node da trn so snh gini-index ca cc thuc tnh c trn node , cn thc thi s phn chia bng cch to ra cc node con v phn chia danh sch thuc tnh ca node cha cho cc node con.

Hnh 16 - Phn chia danh sch thuc tnh ca mt node

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 34-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Vi thuc tnh c chn (Age nh trn hnh v) lm thuc tnh phn chia ti node , vic phn chia danh sch thuc tnh ny v cc node con kh n gin. Nu l thuc tnh lin tc, ch cn ct danh sch thuc tnh theo im phn chia thnh 2 phn v gn cho 2 node con tng ng. Nu l thuc tnh ri rc th cn qut ton b danh sch v p dng test xc nh chuyn cc bn ghi v 2 danh sch mi ng vi 2 node con. Nhng vn khng n gin nh vy vi nhng thuc tnh cn li ti node (Car Type chng hn), khng c test trn thuc tnh ny, nn khng th p dng cc kim tra trn gi tr ca thuc tnh phn chia cc bn ghi. Lc ny cn dng n mt trng c bit trong cc danh sch thuc tnh l rids. y chnh l trng kt ni cc bn ghi trong cc danh sch thuc tnh. C th nh sau: trong khi phn chia danh sch ca thuc tnh phn chia (Age) cn chn gi tr trng rids ca mi bn ghi vo mt bng bm (hash table) nh u node con m cc bn ghi tng ng (c cng rids) trong cc danh sch thuc tnh khc c phn chia ti. Cu trc ca bng bm nh sau:

Hash table

Rids Child node

1 L

2 R

3 R

4 R

5 L

6 L

Hnh 17 - Cu trc ca bng bm phn chia d liu trong SPRINT (theo v d cc hnh trc) Phn chia xong danh sch ca thuc tnh phn chia th cng l lc xy dng xong bng bm. Danh sch cc thuc tnh cn li c phn chia ti cc node con theo thng tin trn bng bm bng cch c trng rids trn tng bn ghi v trng Child node tng ng trn bng bm. Nu bng bm qu ln so vi b nh, qu trnh phn chia c chia thnh nhiu bc. Bng bm c tch thnh nhiu phn sao cho va vi b nh, v cc danh sch thuc tnh phn chia theo tng phn bng bm. Qu trnh lp li cho n khi bng bm nm trong b nh. 2.3.4. SPRINT l thut ton hiu qu vi nhng tp d liu qu ln so vi cc thut ton khc SPRINT ra i khng nhm mc ch lm tt hn SLIQ [9] vi nhng tp d liu m danh sch lp nm va trong b nh. Mc tiu ca thut ton ny l nhm vo nhng tp d liu qu ln so vi cc thut ton khc v c kh nng to ra mt m
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 35-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

hnh phn lp hiu qu t . Hn na, SPRINT cn c thit k d dng song song ha. Qu vy, vic song song ha SPRINT kh t nhin v hiu qu vi c ch x l d liu song song. SPRINT t c chun cho vic sp xp d liu v ti cn bng khi lng cng vic bng cch phn phi u danh sch thuc tnh thuc tnh cho N b vi x l ca mt my theo kin trc shared-nothing [7]. Vic song song ha SPRINT ni ring cng nh song song ha cc m hnh phn lp d liu da trn cy quyt nh ni chung trn h thng Shared-memory multiprocessor (SMPs) hay cn c gi l h thng shared-everthing c nghin cu trong [10]. Bn cnh nhng mt mnh, SPRINT cng c nhng mt yu. Trc ht l bng bm s dng cho vic phn chia d liu, c kch c t l thun vi s lng i tng d liu gn vi node hin ti (s bn ghi ca mt danh sch thuc tnh). ng thi bng bm cn c t trong b nh khi thi hnh phn chia d liu, khi kch c bng bm qu ln, vic phn chia d liu phi tch thnh nhiu bc. Mt khc, thut ton ny phi chu chi ph vo-ra trm trng. Vic song song ha thut ton ny cng i hi chi ph giao tip ton cc cao do cn ng b ha cc thng tin v cc ch s Gini-index ca tng danh sch thuc tnh. Ba tc gi ca SPRINT a ra mt s kt qu thc nghim trn m hnh phn lp SPRINT so snh vi SLIQ [7] c th hin bng biu di y.

Biu 1- So snh thi gian thc thi ca m hnh phn lp SPRINT v SLIQ theo kch thc tp d liu o to
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 36-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

T biu trn c th thy: vi nhng tp d liu nh (<1 triu cases) th thi gian thc thi ca SPRINT ln hn so vi thi gian thc thi SLIQ. iu ny c th gii thch do khi danh sch lp khi dng thut ton SLIQ vn nm va trong b nh. Nhng vi nhng tp d liu ln (>1 triu cases) th SLIQ khng th thao tc, trong khi vi nhng tp d liu khong hn 2,5 triu cases SPRINT vn thao tc d dng. L do l SPRINT s dng c ch lu tr liu thng tr hon ton trn a.

2.4. So snh C4.5 v SPRINT Ni dung so snh C4.5 SPRINT


Gini-index C khuynh hng chia thnh cc nhm lp vi lng d liu tng ng
class A 40 class B 30 class C 20 class D 10

Tiu chun Gain-entropy la chn C khuynh hng lm c lp lp thuc tnh ln nht khi cc lp khc phn chia
class A 40 class B 30 class C 20 class D 10

if age < 40

if age < 65

yes
class A 40

no
class B 30 class C 20 class D 10

yes
class A 40

no
class B 30 class C 20

class D

C ch lu Lu tr trong b nh (memoryresident) tr d liu -> p dng cho nhng ng dng khai ph c s d liu nh (hng trm nghn bn ghi)

Lu tr trn a (disk-resdient) -> p dng cho nhng ng dng khai ph d liu cc ln m cc thut ton khc khng lm c (hng trm triu - hng t bn ghi)

C ch sp Sp xp li tp d liu tng Sp xp trc mt ln. Trong ng vi mi node qu trnh pht trin cy, danh xp d liu sch thuc tnh c phn chia nhng th t ban uvn c duy tr, do khng cn phi sp xp li.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 37-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Chng 3.

CC KT QU THC NGHIM

Tc gi s dng m hnh phn lp C4.5 release8 m ngun m do J. Ross Quinlan vit, ti a ch:http://www.cse.unsw.edu.au/~quinlan/ phn tch, nh gi m hnh phn lp C4.5 v kt qu phn lp v cc nhn t nh hng n hiu nng ca m hnh.

3.1. Mi trng thc nghim


M ngun C.45 c ci t v chy th nghim trn Server 10.10.0.10 ca i hc Cng Ngh. Cu hnh ca Server nh sau: b vi x l Intel Xeon 2.4GHz, c 2 b x l vt l c th hot ng nh 4 b x l logic theo cng ngh hyper-threading, cache size: 512KB, dung lng b nh trong 1GB. Tp d liu th nghim l tp d liu cha cc thng tin v khch hng s dng in thoi di ng ng k s dng web portal. Cc trng trong tp d liu gm c: Cc thng tin c nhn nh: Tn tui, gii tnh, ngy sinh, vng ng k s dng in thoi, loi in thoi s dng, version ca loi in thoi , s ln v thi gian truy cp web portal s dng cc dch v nh gi tin nhn, gi logo hay ringtone... Tp d liu c kch thc khong 120000 bn ghi dng training v khong 60000 bn ghi c s dng lm tp d liu test.

3.2. Cu trc m hnh phn lp C4.5 release8:


3.2.1. M hnh phn lp C4.5 c 4 chng trnh chnh: Chng trnh sinh cy quyt nh (c4.5) Chng trnh sinh lut sn xut (c4.5rules) Chng trnh ng dng cy quyt nh vo phn lp nhng d liu mi (consult) Chng trnh ng dng b lut sn xut vo phn lp nhng d liu mi (consultr) Ngoi ra C4.5 cn c 2 tin ch i km phc v cho qu trnh chy thc nghim l: csh shell script cho k thut c lng chnh xc ca m hnh phn lp cross-validation ('xval.sh') Hai chng trnh ph thuc i km l ('xval-prep' v 'average'). Chi tit hn v m hnh phn lp C4.5 c th tham kho ti a ch:
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 38-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html

3.2.2. Cu trc d liu s dng trong C4.5 Mi b d liu dng trong C4.5 gm c 3 file: 3.2.2.1. Filestem.names: nh ngha b d liu

Hnh 18 - File nh ngha cu trc d liu s dng trong thc nghim M t: Dng trn cng nh ngha cc gi tr phn lp theo thuc tnh c chn (v d trn hnh 18 l thuc tnh MOBILE_PRODUCTER_ID) Cc dng tip theo l danh sch cc thuc tnh cng vi tp gi tr ca n trong tp d liu. Cc thuc tnh lin tc c nh ngha bng t kha continuous Ch thch c nh ngha sau du | 3.2.2.2. Filestem.data: cha d liu training

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 39-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Hnh 19 - File cha d liu cn phn lp Filestem.data c cu trc nh sau: mi dng tng ng vi mt bn ghi (cases) trong c s d liu. Mi dng mt b gi tr theo th nh ca cc thuc tnh nh ngha trong filestem.names. Cc gi tr ngn cch nhau bi du phy. Gi tr thiu (missing value) c biu din bng du ?. 3.2.2.3. Filestem.test: cha d liu test File ny cha d liu test trn m hnh phn lp c to ra t tp d liu training, v c cu trc ging filestem.data

3.3. Kt qu thc nghim


3.3.1. `7Mt s kt qu phn lp tiu biu: 3.3.1.1. Cy quyt nh Lnh to cy quyt nh
$ ./C4.5 -f ../Data/Classes/10-5/class u >> ../Data/Classes/10-5/class.dt Tham s ty chn: -f: xc nh b d liu cn phn lp -u: ty chn cy c to ra c nh gi trn tp d liu test. -v verb: mc chi tit ca output [0..3], mc nh l 0 -t trials: thit lp ch iteractive vi trials l s cy th nghim. Iteractive l ch cho php to ra nhiu cy th nghim bt u vi mt tp con d liu c chn ngu nhin. Mc nh l ch batch vi ton b tp d liu c s dng to mt cy quyt nh duy nht.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 40-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Cy quyt nh c cc node trong l cc kim tra gi tr ca thuc tnh c chn pht trin ti node . L ca cy quyt nh c nh dng: Gi_tr_phn_lp (N/E) hoc (N). Vi N/E l t l gia tng cc case t ti l vi s case t ti l nhng thuc v lp khc (trong tp d liu o to).

Hnh 20 - Dng cy quyt nh to ra t tp d liu th nghim

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 41-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Hnh 21 - c lng trn cy quyt nh va to ra trn tp d liu training v tp d liu test Sau khi cy quyt nh c to ra, n s c c lng li chnh xc trn chnh tp d liu o to va hc c, v c th c c lng trn tp d liu test c lp vi d liu training nu c ty chn t pha ngi dng. Cc c lng c thc hin trn cy khi cha ct ta v sau khi ct ta. M hnh C4.5 cng cho php truyn cc tham s v mc ct ta ca cy, mc nh l ct ta 25%. 3.3.1.2. Cc lut sn xut tiu biu Lnh to lut sn xut khi c cy quyt nh:
$ ./C4.5rules 5/class.r -f ../Data/Classes/10-5/class -u >> ../Data/Classes/10-

Cc tham s ty chn f, -v, -u ging nh vi lnh to cy quyt nh.

Mi lut sinh ra gm c 3 phn: iu kin phn lp Gi tr phn lp ( ->class ) []: d on chnh xc ca lut. Gi tr ny c c lng trn tp training v test (nu c ty chn u khi sinh lut)
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 42-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Hnh 22 - Mt s lut rt ra t b d liu 19 thuc tnh, phn lp loi thit lp ch giao din ca ngi s dng (WEB_SETTING_ID) Vic a ra c cc lut lin quan n s thch giao din s dng ca khch hng gip ch cho cng vic thit k, cng nh to cc loi giao din ph hp cho tng loi i tng khch hng khc nhau. V d, Rule 233 trong hnh 22 cho thy, nu khch hng ng k s dng dch v ti H Ni, ngh nghip thuc nhm Other v sinh nm 1982 th ch giao din m ngi s dng c m s l 1. Kt lun ny c chnh xc l 96,6%.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 43-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Hnh 23 - Mt s lut rt ra t b d liu 8 thuc tnh, phn lp theo s hiu nh sn xut in thoi (PRODUCTER_ID) T kt qu thc t hnh 23, t Rule 1021, chng ta c th kt lun: nu khch hng lm cng vic Supervisory v sinh trong khong t nm 1969 n 1973 th loi in thoi m khch hng dng c s hiu l 1 (l in thoi SAMSUNG). chnh xc ca kt lun ny l 91,7%. Nhng lut nh trn gip cho cc nhn vin maketing c th tm ra c th trng in thoi di ng i vi tng loi i tng khch hng khc nhau, t c cc chin lc pht trin sn phm hp l.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 44-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Hnh 24 - Mt s lut sinh ra t tp d liu 8 thuc tnh, phn lp theo dch v in thoi m khch hng s dng (MOBILE_SERVICE_ID) V d t Rule 661: nu khch hng l nam (F), ngh nghip Engineering, in thoi s dng l Erricsion (MOBILE_PRODUCTER_ID = 4) v ng k nm 2004, th dch v m khch hng s dng l gi logo (MOBILE_SERVICE_ID = 2). chnh xc ca lut ny l 79,4%. T nhng lut nh vy, ta c th thng k cng nh d on c xu hng s dng cc loi dch v ca tng i tng khch hng khc nhau. T c chin lc pht trin dch v khch hng hiu qu.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 45-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Hnh 25 - c lng tp lut trn tp d liu o to Sau khi c to ra, tp lut c c lng li trn tp training data, hay tp d liu test (ty chn). M t cc mt s trng tiu biu: Rule: s hiu ca lut Zize: Kch thc ca lut (s cc iu kin so snh trong phn iu kin phn lp) Used: s lng cases trong tp training p dng lut . Trng ny quy nh tnh ph bin ca lut. Wrong: s lng case phn lp sai -> t l phn trm li Kt lun T qu trnh thc nghim, chng ti nhn thy vai tr ca qu trnh tin x l d liu l rt quan trng. Trong qu trnh ny, cn xc nh chnh xc nhng thng tin g cn rt ra t c s d liu , t chn thuc tnh phn lp ph hp. Sau vic
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 46-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

la chn nhng thuc tnh lin quan l rt quan trng, n quyt nh m hnh phn lp c ng n khng, c ngha thc t khng v c th p dng cho nhng d liu tng lai hay khng. 3.3.2. Cc biu hiu nng Cc tham s nh hng n hiu nng ca m hnh phn lp l [6]: S cc bn ghi trong tp d liu o to (N) S lng thuc tnh (A) S cc gi tr ri rc ca mi thuc tnh (nhn t nhnh) (V) S cc lp (C) Chi ph xy dng cy quyt nh l tng chi ph xy dng tng node:

T = tnode(i)
Chi ph tn cho node i c tnh bng tng cc khon chi ph ring cho tng cng vic:

tnode(i) = tsingle(i) + tfreq(i) + tinfo(i) + tdiv(i)


Vi: tsingle(i) l chi ph thc thi vic kim tra xem liu tt c cc case trong tp d liu o to c thuc v cng mt lp khng? tdiv(i) l chi ph phn chia tp d liu theo thuc tnh chn Vic la chn thuc tnh c Information gain ln nht trong tp d liu hin ti l kt qu ca vic tnh Information gain ca tng thuc tnh. Chi ph cho qu trnh ny bao gm thi gian tnh ton tn xut phn phi theo cc gi tr phn lp ca tng thuc tnh (tfreq(i)) v thi gian tnh Information gain t cc thng tin phn phi (tinfo(i)). C th biu din s ph thuc ca cc khon chi ph trn vo cc tham s hiu nng m t trn nh sau: tfreq = k1 *AiNi tinfo = k2 * CAiV tdiv = k3 * Ai tsingle = k4*Ni
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 47-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Vi kj l hng s c gi tr ty theo tng ng dng c th. S lng bn ghi (Ni) v s lng thuc tnh (Ai) tng ng vi tng node ph thuc vo su ca node v bn thn tp d liu. Vic xc nh chnh xc chi ph cho qu trnh xy dng cy quyt nh (T) l rt kh v cn phi bit chnh xc hnh dng ca cy quyt nh, iu ny khng th xc nh trong thi gian chy. Chnh v vy m T c n gin ha bng cch dng gi tr trung bnh i km vi nhng gi s v hnh dng ca cy v gii cc phng trnh lp cho tng thnh phn ring l ca m hnh [6].

Sau y l cc kt qu thc nghim nh gi nh hng ca cc tham s hiu nng nh kch thc tp d liu o to, s lng thuc tnh, thuc tnh lin tc, v s gi tr phn lp ti m hnh phn lp C4.5:

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 48-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

3.3.2.1. Thi gian thc thi ph thuc vo kch thc tp d liu o to Cc th nghim c tin hnh trn nhiu tp d liu vi kch thc, s lng thuc tnh v thuc tnh phn lp khc nhau. Sau y l cc bng kt qu v biu th hin s ph thuc ang xt. Th nghim vi tp d liu 2 thuc tnh Bng 2 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to 2 thuc tnh
Kch thc 29000 Thi gian tp d liu xy dng (giy)

60000

66000

131000

262000

Decision Tree Production Rules

0.15 3.21

0.46 6.82

0.47 8.85

1.17 20.51

2.2 37.94

40 35 30 25 20 15 10

(s)

Decision Tree

Productio n Rules

5 0 29000 60000 66000 131000

Trend line of Productio n rules

262000 (cases)

Biu 2 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to 2 thuc tnh

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 49-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Th nghim vi tp d liu 7 thuc tnh Bng 3 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to 7 thuc tnh
Kch thc 1000 Thi gian tp d liu xy dng (giy)

10000 15000 20000

25000

30000

36000

Decision Tree Production Rules

0.03 0.13

0.46 107.1

1.90 276.2

2.79 709.9

5.70 1211.0

8.31 2504.8

13.34 5999.5

10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1000 10000 15000 20000 25000 30000 36000

Decision Tree Productio n Rules Trend line of Productio n rules

Biu 3 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to 7 thuc tnh

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 50-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Th nghim vi tp d liu 18 thuc tnh Bng 4 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to18 thuc tnh
Kch thc 4000 Thi gian tp d liu xy dng (giy)

6000

8500

10000 12000 15000 17500 20000 25000

Decision Tree Production Rules

0.45 43.6

0.64 90.77

1.32 304.0 7

1.77 531.3 4

2.37 838.8 8

1.8 968.2 4

2.68 1584. 63

2.98 2927. 56

5.24 4617. 23

5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0

(s)

Decision Tree Productio n Rules Trend Line of Productio n Rules


(case) 4000 6000 8500 10000 12000 15000 17500 20000 25000

Biu 4 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch thc tp d liu o to18 thuc tnh Cc nh gi s ph thuc ca thi gian thc thi vo kch thc tp d liu o to c tin hnh trn cc tp d liu vi s lng thuc tnh khc nhau. C th rt ra cc kt lun sau: Kch thc tp d liu cng ln th thi gian sinh cy quyt nh cng nh thi gian sinh tp lut sn xut cng ln. Cn c vo cc ng trendline ca ng

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 51-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

biu din thi gian sinh tp lut sn xut c v thm trn cc biu , chng ti d on s ph thuc trn c din t bng hm a thc. Cc biu trn cho thy qu trnh sinh lut sn xut sau t cy quyt nh to ra tn ti nguyn tnh ton gp nhiu ln so vi qu trnh sinh cy quyt nh. Thc nghim cho thy vi nhng tp d liu c trm nghn bn ghi, thi gian sinh lut sn xut l kh lu ( thng thng > 5 gi). cng l mt trong nhng l do khin C4.5 khng th p dng vi nhng tp d liu ln. Tp d liu o to c cng nhiu thuc tnh th s chnh lch v thi gian thc thi gia 2 qu trnh trn cng ln. 3.3.2.2. Hiu nng ca C4.5 ph thuc vo s lng thuc tnh nh gi s ph thuc trn, cc th nghim tin hnh vi 3 tp d liu c 2, 4, v 8 thuc tnh ri rc, vi cng thuc tnh phn lp. Bng 5 - Thi gian sinh cy quyt nh ph thuc vo s lng thuc tnh 2 attributes 4 attributes 8 attributes 3000 0.01 0.12 0.14
(s)

6000 0.02 0.18 0.3

16000 0.05 0.82 3.56

23000 0.1 2.18 9.99

32000 0.18 3.32 23.40

40500 0.25 5.58 33.36

55500 0.39 11.83 47.62

65500 0.47 16.79 80

96600 0.89 33.49 106.61

131000 1.17 71.52 185

200 180 160 140 120 100 80 60 40 20 0

2 attributes 4 attributes 8 attributes

00

00

00

50

50

50

30

60

60

16

23

32

40

55

65

96

Biu 5 - S ph thuc thi gian sinh cy quyt nh vo s lng thuc tnh


Kha lun tt nghip Nguyn Th Thy Linh K46CA - 52-

13

10

00

00

00

(cases)

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

Thi gian C4.5 xy dng cy quyt nh ph thuc vo s lng thuc tnh qua cc khong thi gian tfreq, tinfo, tdiv. S thuc tnh cng nhiu thi gian tnh ton la chn thuc tnh tt nht test ti mi node cng ln, v vy thi gian sinh cy quyt nh cng tng. Do vy C4.5 b hn ch v s lng thuc tnh trong tp d liu o to [2]. y l mt im khc bit so vi SPRINT 3.3.2.3. Hiu nng ca C4.5 khi thao tc vi thuc tnh lin tc Bng 6 - Thi gian xy dng cy quyt nh vi thuc tnh ri rc v thuc tnh lin tc
3 thuc tnh ri rc+ 1 thuc tnh lin tc 4 thuc tnh lin tc
140 120 100 80 60 40 20 0

3000 0.12 0.24

6000 0.18 0.66

16000 22000 31000 40000 55000 65000 96000 131000 0.92 2.18 3.32 5.74 11.83 16.79 33.47 61.52 3.02 5.01 11.56 16.99 30.37 38.16 70.38 125.21

(s)

3 categorical attributes + 1 continuous attribute 4 continuous attributes

(cases)
65 00 0 60 00 16 00 0 22 00 0 31 00 0 55 00 0 40 00 0 13 10 00 96 00 0

Biu 6 - So snh thi gian xy dng cy quyt nh t tp thuc tnh lin tc v t tp thuc tnh ri rc Nh phn tch trong thut ton C4.5 cng nh trong thut ton phn lp d liu da trn cy quyt nh ni chung, vic thao tc vi thuc tnh lin tc chim nhiu ti nguyn tnh ton hn vi thuc tnh ri rc. Do vy tp d liu c nhiu
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 53-

30 00

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

thuc tnh lin tc nh hng ng k n thi gian sinh cy quyt nh so vi tp d liu c nhiu thuc tnh ri rc. 3.3.2.4. Hiu nng ca C4.5 khi thao tc vi nhiu gi tr phn lp Bng 7 - Thi gian sinh cy quyt nh ph thuc vo s gi tr phn lp 3000 4 classes 0.04 28 classes 0.12 6000 0.07 0.18 16000 23000 31000 40000 55000 0.22 0.35 0.61 0.97 1.8 0.82 2.18 3.32 5.74 11.83 65500 2.36 16.79 96600 3.68 33.49 131000 4.72 61.51

70 60 50 40 30 20 10 0

(s)

4 classes 28 classes

16 00 0 23 00 0 31 00 0 40 00 0 55 00 0 65 50 0 96 60 13 0 10 00
- 54-

60 00

30 00

(cases)

Biu 7 - Thi gian sinh cy quyt nh ph thuc vo s gi tr phn lp Cng nhiu gi tr phn lp th thi gian tnh Information gain cho tng thuc tnh (tinfo) cng nhiu. Do vy thi gian sinh cy quyt nh cng lu.

3.4. Mt s xut ci tin m hnh phn lp C4.5


T qu trnh nghin cu m hnh phn lp C4.5 cng nh nhng so snh vi SPRINT thy c u nhc im ca thut ton. V t qu trnh thc nghim chng ti a ra mt s xut ci tin thut ton C4.5 1. Sinh lut sn xut l mt tnh nng mi ca C4.5 so vi cc thut ton khc. Hin nay vi c s d liu ln, tp lut sinh ra l rt di, v d vi tp training c 30000 cases vi 8 thuc tnh, tp lut c th ln ti 3000 lut. Do vic xem v trch rt thng tin c ch trn tp lut l kh khn. Trn thc t , chng ti xut tch hp thm vo C4.5 module trch chn tp nhng lut tt
Kha lun tt nghip Nguyn Th Thy Linh K46CA

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

nht c l nhng lut c chnh xc chp nhn c (mc chnh xc c th do ngi dng ty chn) v c ph bin cao (l nhng lut m p dng c trn nhiu case trong tp d liu th nghim). 2. Sinh lut sn xut l mt tnh nng mi, em li nhiu li ch ca C4.5 so vi cc thut ton phn lp d liu khc. Nhng qu trnh sinh lut sn xut tn rt nhiu ti nguyn tnh ton so vi qu trnh sinh cy quyt nh. Do vy cn song song ha giai on sinh lut ci tin hiu nng ca C4.5 3. C4.5 b hn ch v s lng thuc tnh trong tp d liu o to, v chnh xc ca cc cy quyt nh hay cc lut sinh ra ni chung l cha cao. Cn tp trung s dng cc phng php ci tin chnh xc ca m hnh phn lp nh bagging, boosting. 4. C4.5 thao tc vi thuc tnh lin tc lu hn thuc tnh ri rc. iu ny c th gii thch bi: vi thuc tnh lin tc c n gi tr sp xp, thut ton cn o phn chia ti (n-1) ngng nm gia 2 gi tr lin nhau trong dy sp xp. T mi c th tm ra c mt ngng tt nht test trn thuc tnh . Trong tp d liu o to, thuc tnh lin tc cng nhiu gi tr, th ti nguyn tnh ton b ra thao tc vi n cng nhiu. Hin nay c mt s xut ci tin cch x l vi thuc tnh lin tc [3][8], l mt trong nhng hng nghin cu ang nghin cu ca ti. 5. Chng ti xut c ch sp xp trc c s dng lc phn phi lp mt ln nh ca SPRINT p dng vo C4.5. T tin ti xy dng c ch lu tr d liu thng tr trn a. Nu thc hin c s lm tng hiu nng cng nh kh nng m rng ca m hnh phn lp C4.5.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 55-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

KT LUN
Trong khun kh kha lun tt nghip ny, chng ti nghin cu, phn tch, nh gi cc thut ton phn lp d liu da trn cy quyt nh. Tiu biu l 2 thut ton C4.5 v SPRINT. C4.5 v SPRINT c cch thc lu tr d liu v xy dng cy quyt nh da trn nhng o khc nhau. Do hai thut ton ny c phm vi ng dng vo cc c s d liu c kch thc khc nhau. C4.5 l thut ton x l y cc vn ca qu trnh phn lp d liu: la chn thuc tnh tt nht, lu tr phn chia d liu, x l gi tr thiu, trnh qu va, ct ta cy,Vi nhng l do C4.5 tr thnh thut ton ph bin nht trong nhng ng dng va v nh. Qu trnh trin khai, ci t th nghim cng vi cc nh gi hiu nng m hnh phn lp C4.5 c tin hnh. V thu c nhiu kt qu c ngha thc tin, cng nh cc kt qu gi m nhng hng nghin cu tip theo. SPRINT l mt thut ton ti u cho nhng c s d liu cc ln. Nhng u im ca SPRINT l t tng ca thut ton kh n gin, c kh nng m rng cao, li rt d dng song song ha. Do vy ci t v trin khai SPRINT c ngha khoa hc v c kh nng trin khai ng dng v em li nhiu li ch thc t.

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 56-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

TI LIU THAM KHO


[1] Anurag Srivastava, Eui- Hong Han, Vipin Kumar, Vieet Singh. Parallel Formulations of Decision-Tree Classification Algorithm. Kluwer Academic Publisher, 1999. [2] Anurag Srivastava, Vineet Singh, Eui- Hong (Sam) Han, Vipin Kumar. An Efficient, Scalable, Parallel Classifier for Data mining. [3] Girija J. Narlikar. A Parallel, Multithreaded Decision Tree Builder. CMU-CS-98184. reports-archive.adm.cs.cmu.edu/ anon/1998/CMU-CS-98-184.pdf [4] Henrique Andrade, Tahsin Kurc, Alan Sussman, Joel Saltz. Decision Tree Construction for Data Ming on Cluster of Shared-Memory Multiprocessors. http://citeseer.csail.mit.edu/178359.html [5] Ho Tu Bao, Chapter 3:Data mining with Decision Tree http://www.netnam.vn/unescocourse/knowlegde/knowlegd.htm [6] John Darlington, Moustafa M. Ghanem, Yike Guo, Hing Wing To. Performance Model for Co-odinating Parallel Data Classification [7] John Shafer, Rakesh Agrawal, Manish Mehta. SPRINT- A Scalable Paralllel Classifier for Data mining. In Predeeings of the 22nd International Conference on Very Large Database, India, 1996. [8] J. R. Quinlan. Improve Used of Continuous Attribute in C4.5. In Joural of Artficial Intelligence Research 4 (1996) 77-90 [9] Manish Mehta, Rakesh Agrawal, Jorma Rissanen. SLIQ: A Fast Scalable Classifier for Data mining. IBM Amaden Research Center, 1996. [10] Mohammed J. Zaki, Ching-Tien Ho, Rekesh Agrawal. Parallel Classification for Data Mining on Shared-Memory Multiprocessors. IVM Almaden Research Center, San Jose, CA 95120. [11] Rajeev Rastogi, Kyuseok Shim (Bell Laboratories). PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning, 1998. www.vldb.org/conf/1998/p404.pdf [12] Richard Kufrin. Generating C4.5 Production Rules in Parallel. In Proceeding of Fourteenth National Conference on Artificial Intelligence, Providence RI, 1997
Kha lun tt nghip Nguyn Th Thy Linh K46CA - 57-

Nghin cu cc thut ton phn lp d liu da trn cy quyt nh

www.almaden.ibm.com/software/quest/Publications/papers/vldb96_sprint.pdf [13] Ron Kohavi, J. Ross Quinlan. Decision Tree Discovery, 1999 [14] The Morgan Kaufmann Series in Data Management Systems, Jim Gray. Datamining- Concepts and Techniques, Chapter 7-Classification and Prediction. Series Editor Morgan Kaufmann Publishers, August 2000

Kha lun tt nghip Nguyn Th Thy Linh K46CA - 58-

You might also like