Download as pdf or txt
Download as pdf or txt
You are on page 1of 112

B NGUY N THU TR H N i 2006 CNG NGH THNG TIN 2004-2006

GIO D C V O T O

TR NG I H C BCH KHOA H N I

----------------------------------------------

LU N VN TH C S KHOA H C NGNH: CNG NGH THNG TIN

NGHIN C U V P D NG M T S K THU T KHAI PH D V I C S D LI U

LI U NGNH THU VI T NAM

NGUY N THU TR

H N i 2006

M CL C
DANH M C CC K HI U V CC CH VI T T T........................4 DANH M C CC B NG ..........................................................................5 DANH M C CC HNH V .....................................................................6 M U .....................................................................................................8 LI U .....................................................12 CHNG 1. KHAI PH D

1.1. T ng quan khai ph d li u..................................................... 12 1.1.1 D li u .............................................................................. 14 1.1.2 Ti n x l d li u .............................................................. 16 1.1.3 M hnh khai ph d li u .................................................. 18 1.2. Cc ch c nng c b n khai ph d li u .................................. 19 1.2.1 Phn l p (Classification) .................................................. 19 1.2.2 H i qui .............................................................................. 31 1.2.3 Phn nhm........................................................................ 34 1.2.4 Khai ph lu t k t h p........................................................ 38
CHNG 2. M T S THU T TON KHAI PH D LI U ..........46

2.1. Thu t ton khai ph lu t k t h p............................................. 46 2.1.1 Thu t ton Apriori ............................................................ 46 2.1.2 Thu t ton AprioriTid ....................................................... 49 2.1.3 Thu t ton AprioriHybrid ................................................. 51 2.2. C i ti n hi u qu thu t ton Apriori........................................ 54 2.2.2 Phng php FP-tree ....................................................... 56 2.2.3 Thu t ton PHP ................................................................ 59 2.2.4 Thu t ton PCY................................................................. 63 2.2.5 Thu t ton PCY nhi u ch ng............................................. 65 2.3. Thu t ton phn l p b ng h c cy quy t nh ........................ 67 2.3.1 Cc nh ngha.................................................................. 68 2.3.2 Thu t ton ID3.................................................................. 69 2.3.3 Cc m r ng c a C4.5 ...................................................... 70
CHNG 3. P D NG KHAI PH TRN CSDL NGNH THU ..72

3.1. CSDL ngnh Thu .................................................................. 72 3.2. L a ch n cng c khai ph ..................................................... 73 3.2.1 L a ch n cng c .............................................................. 73 3.2.2 Oracle Data Mining (ODM) ............................................. 76 3.2.3 DBMS_DATA_MINING.................................................... 78 3.3. M c tiu khai thc thng tin c a ngnh Thu ......................... 79

3 3.4. Th nghi m khai ph lu t k t h p .......................................... 81 3.5. Phn l p b ng h c cy quy t nh .......................................... 91 3.5.1 Phn l p TNT d a vo so snh t su t cc nm ............. 93 3.5.2 Phn l p TNT theo s li u c a m t nm......................... 96
CHNG 4. K T LU N .................................................................... 102 H NG NGHIN C U TI P THEO.................................................. 103 TI LI U THAM KH O ...................................................................... 104 PH L C ................................................................................................ 106

DANH M C CC K HI U V CC CH
K hi u, ch vi t t t Association Rules Candidate itemset Cc lu t k t h p ngha

VI T T T

M t itemset trong t p Ck c s d ng sinh ra cc large itemset T p cc candidate k-itemset ch c ch n c a lu t k t h p = support(XY)/support(X) ph n nh kh nng giao d ch h tr X th cng h tr Y giai o n th k

Ck Confidence

CSDL DM DW TNT Frequent/large itemset

C s d li u Data mining Khai ph d li u Data warehouse Kho d li u i t ng n p thu , ch t i cc c nhn ho c t ch c n p thu M t itemset c h tr (support) >= ng ng h tr t i thi u Identifier M t ph n t c a itemset T p c a cc item M t itemset c di k T p cc Large itemset giai o n th k Oracle Data Mining 1 cng c khai ph d li u Unique Transaction Identifier Giao d ch

ID Item Itemset k-itemset Lk ODM TID Transaction

DANH M C CC B NG
B ng 1.1: CSDL n gi n g m cc v d hu n luy n .................................... 25 B ng 1.2 M hnh CSDL giao d ch n gi n ................................................. 39 B ng 2.1 C s d li u giao d ch T ............................................................... 56 B ng 2.2 B ng cc s n ph m khai ph d li u ............................................... 74

DANH M C CC HNH V
Hnh 1.1 Qu trnh khm ph tri th c ............................................................. 14 Hnh 1.2 Khun d ng n b n ghi v a b n ghi ........................................... 16 Hnh 1.3: Cy quy t nh n gi n v i cc tests trn cc thu c tnh X v Y. 22 Hnh 1.4: S phn l p m t m u m i d a trn m hnh cy quy t nh ......... 23 Hnh 1.5 Cy quy t nh cu i cng cho CSDL T nu trong b ng 1.1 ....... 29 Hnh 1.6 Cy quy t nh d ng gi code cho CSDL T (b ng 1.1)............... 29 Hnh 1.7 H i qui tuy n tnh ............................................................................ 32 Hnh 1.8 G p nhm theo phng php k-means (i m nh d u + l tm) 36 Hnh 1.9 Phn ho ch vun ng ho c tch d n ............................................... 37 Hnh 1.10 B c l p u tin c a thu t ton Apriori cho CSDL DB .............. 41 Hnh 1.11 L n l p th 2 c a thu t ton Apriori cho CSDL DB ..................... 42 Hnh 1.12 L n l p th 3 c a thu t ton Apriori cho CSDL DB ..................... 42 Hnh 2.1 Thu t ton Apriori............................................................................ 46 Hnh 2.2 Thu t ton AprioriTid ...................................................................... 50 Hnh 2.3 V d ................................................................................................ 51 Hnh 2.4: Th i gian th c hi n cho m i l n duy t c a Apriori v AprioriTid 52 Hnh 2.5: M t v d c a cy phn c p khi ni m cho khai ph cc frequent itemsets nhi u m c.......................................................................................... 55 Hnh 2.6: FP-tree cho CSDL T trong b ng 2.1 ............................................... 57 Hnh 2.7 Thu t ton PHP ................................................................................ 62 Hnh 2.8 B nh v i 2 l n duy t c a thu t ton PCY .................................. 63 Hnh 2.9 S d ng b nh cho cc b ng bm nhi u ch ng............................. 66 Hnh 3.1 Cng s c c n cho m i giai o n khai ph d li u .......................... 82 Hnh 3.2 Cc b c khai ph lu t k t h p trn CSDL ngnh Thu ................ 83 Hnh 3.3 Nhnh cy phn c p ngnh ngh .................................................... 85 Hnh 3.4 Cc lu t khai ph t ODM ( di lu t = 2) ................................... 87

7 Hnh 3.5 Cc lu t khai ph t ODM ( di lu t = 3) ................................... 89 Hnh 3.6 Cy quy t nh dng ODM Bi ton phn tch t su t................ 95 Hnh 3.7 Cy quy t nh dng See5 Bi ton phn tch t su t ................. 96 Hnh 3.8 Cy quy t nh dng ODM Bi ton xt s li u m t nm........... 99 Hnh 3.9 Cy quy t nh dng See5 Bi ton phn tch trong nm.......... 100

Th i i pht tri n m nh c a Internet, Intranet, Data warehouse, cng v i s pht tri n nhanh v cng ngh lu tr t o i u ki n cho cc doanh nghi p, cc t ch c thu th p v s h u c kh i l ng thng tin kh ng l . Hng tri u CSDL c dng trong qu n tr kinh doanh, qu n l chnh ph , qu n l d li u khoa h c v nhi u ng d ng khc. V i kh nng h tr m nh c a cc H qu n tr CSDL, cc CSDL ny cng l n ln nhanh chng. Cu S l n m nh c a cc CSDL d n n s c n thi t ph i c cc k thu t v cc cng c m i th c hi n chuy n i t ng d li u m t cch thng minh thnh thng tin v tri th c h u ch [10] tr thnh t v n c a nhi u bi vi t v khai ph thng tin v tri th c t cc CSDL l n. Cng tc trong ngnh Thu , ni Cng ngh thng tin c p d ng vo qu n l Thu t nh ng nm 1986, CSDL thng tin lin quan n cc lnh v c qu n l Thu l m t CSDL l n v ch c ch n ti m n nhi u thng tin qu bu. V i mong mu n b c u p d ng k thu t khai ph d li u trn CSDL ngnh Thu , lu n vn t p trung nghin c u v cc k thu t khai ph d li u v ti n hnh khai ph th nghi m trn CSDL ngnh Thu . Kh nng m r ng tri th c c ch n trong d li u a ra nh ng hnh ng c n thi t d a trn tri th c ang tr nn ngy cng quan tr ng trong th gi i c nh tranh hi n nay. Ton b qu trnh dng cc phng php lu n d a trn tnh ton, bao g m cc k thu t m i pht hi n ra tri th c t d li u c g i l khai ph d li u (data mining). [9] Khai ph d li u l s tm ki m thng tin m i, c gi tr v khng t m th ng trong m t kh i l ng d li u l n. N l s ph i h p n l c c a con ng i v my tnh. Cc k t qu t t nh t nh n c b ng vi c cn b ng gi a

9 tri th c c a cc chuyn gia con ng i trong vi c m t cc v n v m c ch v i kh nng tm ki m c a my tnh. Hai m c ch chnh c a khai ph d li u l d on (prediction) v m t (description). D on bao g m vi c dng m t vi bi n ho c tr ng trong t p d li u d on cc gi tr tng lai ho c cha bi t c a cc bi n c n quan tm. Cn m t t p trung vo vi c tm ra cc m u m t d li u m con ng i c th hi u c/ bin d ch c. C th a cc ho t ng khai ph d li u vo m t trong hai lo i sau: Khai ph d li u d bo, t o ra m hnh c a h th ng c m t b i t p d li u cho tr c, ho c Khai ph d li u m t , v i vi c t o ra thng tin m i, khng t m th ng d a trn t p d li u c s n. M t s ch c nng khai ph d li u chnh nh: M t khi ni m: M t c i m v phn bi t. Tm ra cc c i m khi qut ho, t ng k t, cc c i m khc nhau trong d li u. K t h p: xem xt v tng quan v quan h nhn qu . Phn l p v d bo (Classification and Prediction): Xc nh m hnh m t cc l p ring bi t v dng cho d on tng lai. Phn tch nhm (Cluster analysis): Cha bi t nhn l p, th c hi n nhm d li u thnh cc l p m i d a trn nguyn t c c c i ho s tng t trong cng l p v c c ti u ho s khc tng t gi a cc l p khc nhau. Phn tch nhi u (Outlier analysis): H u ch trong vi c pht hi n l i, phn tch cc s ki n hi m. Phn tch xu h ng v s pht tri n Khai ph d li u l m t trong nh ng lnh v c pht tri n nhanh nh t trong cng nghi p my tnh. T ch l m t mi n quan tm nh trong khoa h c

10 my tnh v th ng k, n nhanh chng m r ng thnh m t lnh v c/ngnh c a ring n. M t trong nh ng l n m nh nh t c a khai ph d li u l s nh h ng trong ph m vi r ng c a cc phng php lu n v cc k thu t c ng d ng i v i m t lo t cc bi ton, cc lnh v c. Trong kinh doanh, khai ph d li u c th c dng khm ph ra nh ng xu h ng mua s m m i, k ho ch cho cc chi n l c u t, v pht hi n nh ng s tiu dng khng chnh ng t h th ng k ton. N c th gip c i ti n cc chi n d ch marketing mang l i nhi u h tr v quan tm hn t i khch hng. Cc k thu t khai ph d li u c th c p d ng i v i cc bi ton thi t k l i quy trnh kinh doanh, trong m c ch l hi u c cc tng tc v quan h trong thng l kinh doanh v cc t ch c kinh doanh. Nhi u n v thi hnh lu t, cc n v i u tra c bi t, c nhi m v tm ra cc hnh ng khng trung th c v pht hi n ra cc xu h ng ph m t i, cng s d ng khai ph d li u m t cch thnh cng. Cc k thu t khai ph d li u cng c th c dng trong cc t ch c tnh bo ni lu gi nhi u ngu n d li u l n lin quan n cc ho t ng, cc v n v an ninh qu c gia. V i m c ch nghin c u m t s phng php khai ph d li u v th nghi m khai ph trn CSDL ngnh Thu , lu n vn c trnh by v i cc ph n sau: Chng 1 Khai ph d li u: Tm hi u cc ch c nng khai ph d li u. Chng 2 M t s thu t ton khai ph d li u. Nghin c u trn hai ki u khai ph: Khai ph lu t k t h p - m t k thu t thng d ng trong h c khng gim st. Phn l p b ng h c cy quy t nh - k thu t h c c gim st. Chng 3 p d ng khai ph trn CSDL ngnh Thu : Th nghi m khai ph lu t k t h p v phn l p trn CSDL ngnh Thu

11 Chng 4 K t lu n v nh ng k t qu t c Cu i cng l m t s h ng nghin c u ti p theo. Em xin chn thnh c m n PGS. TS Nguy n Ng c Bnh h ng d n v cho em nh ng ki n qu bu, chn thnh c m n cc th y c gio c a tr ng i h c Bch khoa H N i trang b ki n th c gip em hon thnh lu n vn ny.

12

CHNG 1. KHAI PH D

LI U

1.1. T ng quan khai ph d li u


Khai ph d li u c ngu n g c t cc phng php ring bi t, 2 d ng quan tr ng nh t l th ng k v h c my. Th ng k c ngu n g c t ton h c v do nh n m nh n chnh xc ton h c, mong mu n thi t l p ci m c th nh n ra trn n n ton h c tr c khi ki m th n trong th c t . Ng c l i, h c my c ngu n g c r t nhi u trong th c ti n tnh ton. i u ny d n n s h ng th c ti n, s n sng ki m th bi t n th c hi n t t th no m khng c n ch m t ch ng minh chnh th c. [9] C th c nh ngha v Khai ph d li u nh sau: Khai ph d li u l qu trnh pht hi n cc m hnh, cc t ng k t khc nhau v cc gi tr c l y t t p d li u cho tr c. [9] Hay, Khai ph d li u l s thm d v phn tch l ng d li u l n khm ph t d li u ra cc m u h p l , m i l , c ch v c th hi u c [14]. H p l l cc m u m b o tnh t ng qut, m i l l m u cha c bi t tr c , c ch l c th d a vo m u a ra cc hnh ng ph h p, hi u c l c th bin d ch v hi u th u o cc m u. Cc k nng phn tch c a con ng i l khng y do: Kch th c v chi u c a d li u; t c tng tr ng c a d li u l r t l n. Thm vo l nh ng p ng m nh m c a k thu t v kh nng: thu th p d li u, lu tr , nng l c tnh ton, ph n m m, s thnh th o v chuyn mn. Ngoi ra cn c mi tr ng c nh tranh v d ch v , ch khng ch c nh tranh v gi ( i v i Ngn hng, cng ty i n tho i, khch s n, cng ty cho thu ) v i cu B quy t c a s thnh cng l bi t nh ng g m khng ai khc bi t (Aristotle Onassis [14]). T t c nh ng i u chnh l nh ng nguyn nhn thc y Khai ph d li u pht tri n.

13 Qu trnh khm ph tri th c: Tr c tin, phn bi t gi a cc thu t ng m hnh (model) v m u (pattern) dng trong khai ph d li u. M hnh l m t c u trc quy m l n, c th l t ng k t cc quan h qua nhi u tr ng h p (case) (i khi l t t c cc tr ng h p), trong khi m u l m t c u trc c c b , tho mn b i m t s t tr ng h p ho c trong m t mi n nh c a khng gian d li u. Trong khai ph d li u, m t m u n gi n l m t m hnh c c b . Qu trnh khm ph tri th c ti n hnh theo cc b c sau: 1. Xc nh bi ton nghi p v : Tr c tin ph i tm hi u lnh v c c a ng d ng nghi p v ; Tm hi u cc tri th c lin quan v cc m c ch c a ng d ng. 2. Khai ph d li u - L a ch n d li u: Xc nh cc t p d li u ch v cc tr ng lin quan - Lm s ch d li u: Xo b nhi u, ti n x l. Ph n vi c ny c th chi m t i 60% cng s c. - Gi m b t d li u v chuy n i d li u: Tm ra nh ng c trng h u d ng, gi m b t cc chi u ho c cc bi n, bi u di n l i cc i l ng b t bi n - L a ch n ch c nng khai ph d li u: T ng k t, phn l p, H i qui, k t h p, phn nhm. - L a ch n thu t ton khai ph. - Th c hi n khai ph d li u (Data Mining): Tm ki m cc m u quan tm - nh gi cc m u v bi u di n tri th c

14

Hnh 1.1 Qu trnh khm ph tri th c 3. p d ng khm ph tri th c 4. nh gi v o c 5. Tri n khai v tch h p vo cc qui trnh nghi p v

1.1.1 D li u
Do c nhi u ki u d li u, cc CSDL s d ng trong cc ng d ng cng khc nhau, nn ng i dng lun mong i m t h th ng khai ph d li u c th i u khi n c t t c cc lo i d li u. Th c t CSDL c s n th ng l CSDL quan h v h th ng khai ph d li u cng th c hi n hi u qu vi c khai ph tri th c trn d li u quan h . V i nh ng CSDL c a ng d ng ch a cc ki u d li u ph c t p, nh d li u hypertext v multimedia, d li u t m v khng gian (spatial), d li u k th a (legacy) th ng ph i c cc h th ng khai ph d li u ring bi t xy d ng khai ph cho cc ki u d li u c th .

15 D li u c khai ph c th l d li u c c u trc, ho c khng c c u trc. M i b n ghi d li u c coi nh m t tr ng h p ho c m t v d (case/example). Phn bi t hai ki u thu c tnh: phn lo i (categorical) v s (numerical). Cc thu c tnh ki u phn lo i l nh ng thu c tnh c cc gi tr thu c vo m t s l ng nh cc phn lo i ho c cc l p ring r v gi a chng khng c th t n no. N u ch c 2 gi tr , v d l yes v no, ho c male v

female, thu c tnh c coi l binary. N u c hn 2 gi tr , v d , nh , v a, l n, r t l n, thu c tnh c coi l a l p (multiclass). Cc thu c tnh s l nh ng thu c tnh l y cc gi tr lin t c, v d , thu nh p hng nm, ho c tu i. Thu nh p hng nm ho c tu i c th v l thuy t

l b t k m t gi tr no t 0 t i v h n, m c d m i gi tr th ng xu t hi n ph h p v i th c t . Cc thu c tnh s c th c bi n i thnh categorical: V d , thu nh p hng nm c th c chia thnh cc lo i: th p, trung bnh, cao. D li u khng c c u trc c th p d ng cc thu t ton khai ph d li u th ng l d li u ki u Text. Khun d ng b ng c a d li u c th thu c hai lo i: D li u d ng n b n ghi (cn g i l ki u khng giao d ch), y l cc b ng d li u quan h thng th ng. D li u d ng a b n ghi (cn g i l ki u giao d ch), c dng cho d li u v i nhi u thu c tnh. d ng n b n ghi (ki u khng giao d ch), m i b n ghi c lu tr nh 1 dng trong b ng. D li u n b n ghi khng i h i cung c p kho xc nh duy nh t m i b n ghi. Nhng, kho l c n cho cc tr ng h p k t h p (associate) c k t qu cho h c c gim st.

16 Trong d ng a b n ghi (ki u giao d ch), m i tr ng h p (case) c lu trong nhi u b n ghi trong m t b ng v i cc c t: dy s nh danh, tn thu c tnh, gi tr .

Hnh 1.2 Khun d ng n b n ghi v a b n ghi

1.1.2 Ti n x l d li u
D li u c ch n l c s ph i qua b c ti n x l tr c khi ti n hnh khai ph pht hi n tri th c. B c thu th p v ti n x l d li u l b c r t ph c t p. m t gi i thu t DM th c hi n trn ton b CSDL s r t c ng k nh, km hi u qu . Trong qu trnh khai ph d li u, nhi u khi ph i th c hi n lin k t/tch h p d li u t r t nhi u ngu n khc nhau. Cc h th ng s n c c thi t k v i nh ng m c ch v i t ng ph c v khc nhau, khi t p h p d li u t nh ng h th ng ny ph c v khai ph d li u, hi n t ng d th a l r t ph bi n, ngoi ra cn c th x y ra xung t gy m y d li u, d li u khng ng nh t, khng chnh xc. R rng yu c u ch n l c v lm s ch d li u l r t c n thi t. N u u vo c a qu trnh khai ph l d li u trong DW th s r t thu n ti n, v d li u ny c lm s ch, nh t qun v c tnh ch t h ng ch .

17 Tuy nhin nhi u khi v n ph i c thm m t s b c ti n x l a d li u v ng d ng c n thi t. Ngoi m t s x l thng th ng nh: bi n i, t p h p d li u t nhi u ngu n v m t kho chung, x l m b o nh t qun d li u (kh cc tr ng h p l p, th ng nh t cch k hi u, chuy n i v khun d ng th ng nh t (n v ti n t , ngy thng..)). M t s x l c bi t c n ch trong b c ti n x l d li u: X l v i d li u thi u (missing data): Th ng th khi khai ph d li u khng i h i NSD ph i x l cc gi tr thi u b ng cch th c c bi t no. Khi khai ph, thu t ton khai ph s b qua cc gi tr thi u. Tuy nhin trong m t vi tr ng h p c n ch m b o thu t ton phn bi t c gi a gi tr c ngha (0) v i gi tr tr ng. (tham kh o trong [11]). Cc gi tr gy nhi u (Outliers): M t outlier l m t gi tr xa bn

ngoi c a mi n thng th ng trong t p h p d li u, l gi tr chnh l ch v i chu n v ngha. S c m t c a outliers c th c nh h ng ng k trong cc m hnh khai ph d li u. Outliers nh h ng n khai ph d li u trong b c ti n x l d li u ho c l khi n c th c hi n b i NSD ho c t ng trong khi xy d ng m hnh. Binning: M t vi thu t ton khai ph d li u c th c l i nh vi c binning v i c hai lo i d li u number v categorical. Cc thu t ton Naive Bayes, Adaptive Bayes Network, Clustering, Attribute Importance, v Association Rules c th c l i t vi c binning. Binning ngha l nhm cc gi tr lin quan v i nhau, nh v y gi m s l ng cc gi tr ring bi t c a m t thu c tnh. C t hn cc gi tr ring bi t d n n m hnh g n nh v xy d ng c nhanh hn, nhng n cng c th

18 d n n vi c m t i chnh xc [11] (Cc phng php tnh ton ranh gi i bin [11]).

1.1.3 M hnh khai ph d li u


M hnh khai ph d li u l m t m t v m t kha c nh c th c a m t t p d li u. N t o ra cc gi tr u ra cho t p cc gi tr u vo. V d : M hnh H i qui tuy n tnh, m hnh phn l p, m hnh phn nhm. M t m hnh khai ph d li u c th c m t 2 m c:

M c ch c nng (Function level): M t m hnh b ng nh ng thu t ng v d nh s d ng. V d : Phn l p, phn nhm. M c bi u di n (representation level): Bi u di n c th m t m hnh. V d : M hnh log-linear, cy phn l p, phng php lng gi ng g n nh t. Cc m hnh khai ph d li u d a trn 2 ki u h c: c gim st v khng gim st (i khi c ni n nh l h c tr c ti p v khng tr c ti p directed and undirected learning) [11]. Cc hm h c c gim st (Supervised learning functions) c s d ng d on gi tr . Cc hm h c khng gim st c dng tm ra c u trc bn trong, cc quan h ho c tnh gi ng nhau trong n i dung d li u nhng khng c l p hay nhn no c gn u tin. V d c a cc thu t ton h c khng gim st g m phn nhm k-mean (k-mean clustering) v cc lu t k t h p Apriori. M t v d c a thu t ton h c c gim st bao g m Naive Bayes cho phn l p (classification). Tng ng c 2 lo i m hnh khai ph d li u: Cc m hnh d bo (h c c gim st):

19 Phn l p: nhm cc items thnh cc l p ring bi t v d on m t item s thu c vo l p no. H i qui (Regression): x p x hm v d bo cc gi tr lin t c quan tr ng c a thu c tnh: xc nh cc thu c tnh l quan tr ng nh t trong cc k t qu d bo Cc m hnh m t (h c khng gim st): Phn nhm (Clustering): Tm cc nhm t nhin trong d li u Cc m hnh k t h p (Association models): Phn tch gi hng Trch ch n c trng (Feature extraction): T o cc thu c tnh ( c trng) m i nh l k t h p c a cc thu c tnh ban u

1.2. Cc ch c nng c b n khai ph d li u


1.2.1 Phn l p (Classification)
Trong bi ton phn l p, ta c d li u l ch s (cc v d c gn nhn - thu c l p no) v cc d li u m i cha c gn nhn. M i v d c gn nhn bao g m nhi u thu c tnh d bo v m t thu c tnh ch (bi n ph thu c). Gi tr c a thu c tnh ch chnh l nhn c a l p. Cc v d khng c gn nhn ch bao g m cc thu c tnh d bo. M c ch c a vi c phn l p l xy d ng m hnh d a vo d li u l ch s d bo chnh xc nhn (l p) c a cc v d khng gn nhn. [11] Nhi m v phn l p b t u v i vi c xy d ng d li u (d li u hu n luy n) c cc gi tr ch (nhn l p) bi t. Cc thu t ton phn l p khc nhau dng cc k thu t khc nhau cho vi c tm cc quan h gi a cc gi tr c a thu c tnh d bo v cc gi tr c a thu c tnh ch trong d li u hu n luy n. Nh ng quan h ny c t ng k t trong m hnh, sau c dng

20 cho cc tr ng h p m i v i cc gi tr ch cha bi t d on cc gi tr ch. M hnh phn l p c th c dng trn b d li u ki m th /d li u nh gi v i m c ch so snh cc gi tr d bo v i cc cu tr l i bi t. K thu t ny c g i l ki m tra m hnh, n o chnh xc d m hnh. p d ng m hnh phn l p i v i d li u m i c g i l s d ng m hnh, v d li u c g i l d li u s d ng hay d li u trung tm (apply data or scoring data). Vi c s d ng d li u th ng c g i l scoring the data. S phn l p c dng trong phn o n khch hng, phn tch tn d ng, v nhi u ng d ng khc. V d , cng ty th tn d ng mu n d bo nh ng khch hng no s khng tr ng h n trn cc chi tr c a h . M i khch hng tng ng v i m t tr ng h p; d li u cho m i tr ng h p c th bao g m m t s thu c tnh m t thi quen tiu dng c a khch hng, thu nh p, cc thu c tnh nhn kh u h c, y l nh ng thu c tnh d bo. bo c a

Thu c tnh ch ch ra c hay khng ng i khch hng v n /khng tr ng h n; nh v y, c hai l p c kh nng, tng ng v i v n ho c khng. D li u hu n luy n s c dng xy d ng m hnh dng cho d bo cc tr ng h p m i sau ny (d bo khch hng m i c kh nng chi tr n khng). Chi ph (Costs): Trong bi ton phn l p, c th c n xc nh chi ph bao hm trong vi c t o ra m t quy t nh sai l m. Vi c ny l quan tr ng v c n thi t khi c chnh l ch chi ph l n gi a cc phn l p sai (misclassification). V d , bi ton d bo c hay khng m t ng i s tr l i v i th qu ng co. ch c 2 phn lo i: YES (khch hng tr l i) v NO (khch hng khng tr l i). Gi s tr l i tch c c i v i qu ng co sinh ra $500 v n tr gi $5 g i th. N u

21 m hnh d bo YES v gi tr th c t l YES, gi tr c a phn l p sai l $0. N u m hnh d bo YES v gi tr th c t l NO, gi tr c a phn l p sai l $5. N u m hnh d bo NO v gi tr th c t l YES, gi tr c a phn l p sai l $500. N u m hnh d bo NO v gi tr th c l NO, chi ph l $0. Ma tr n chi ph, c ch s hng ng ng v i cc gi tr th c; ch s c t tng ng v i cc gi tr d bo. V i m i c p ch s th c-d bo, gi tr c a ma tr n ch ra chi ph c a s phn l p sai. M t vi thu t ton, nh Adaptive Bayes Network, t i u ma tr n chi ph m t cch tr c ti p, s a i m hnh m c ch t o ra cc gi i php chi ph c c ti u. Cc thu t ton khc, nh Naive Bayes (d bo xc su t), dng ma tr n chi ph trong khi tm k t qu trn d li u th t a ra gi i php chi ph t nh t. 1.2.1.1 Phn l p - m t qu trnh hai b c B c 1. Xy d ng m hnh (H c) Xy d ng m hnh b ng cch phn tch t p d li u hu n luy n, s d ng cc thu t ton phn l p v th hi n m hnh theo lu t phn l p, cy quy t nh ho c cc cng th c ton h c, m ng nron B c ny cn c coi l b c t o ra b phn l p (classifier). B c 2. S d ng m hnh (Phn l p) p d ng m hnh cho t p d li u ki m th v i cc l p xc nh ki m tra v nh gi chnh xc c a m hnh. N u chnh xc l ch p nh n c, m hnh s c s d ng phn l p cho cc d li u m i. Nh v y c 3 t p d li u c c u trc v cc thu c tnh d on gi ng nhau: T p hu n luy n v t p ki m th bi t l p; T p m i cha xc nh l p.

22 1.2.1.2 Phn l p b ng h c cy quy t nh Cy quy t nh Phng php hi u qu c bi t cho vi c t o ra cc b phn l p t d li u l sinh ra cy quy t nh. Bi u di n c a cy quy t nh l phng php logic c s d ng r ng ri nh t [9]. M t cy quy t nh bao g m cc nodes m cc thu c tnh c ki m tra (tested). Cc nhnh ra c a m t node tng ng v i t t c cc k t qu c th c a vi c ki m tra t i node. V d , cy quy t nh n gi n cho vi c phn l p cc m u v i 2 thu c tnh u vo X v Y c cho trong hnh 1.3. T t c cc m u v i cc gi tr c trng X>1 v Y=B thu c vo Class2, trong khi cc m u v i gi tr X<1 u thu c vo Class1, d Y l y b t k gi tr no.

Hnh 1.3: Cy quy t nh n gi n v i cc tests trn cc thu c tnh X v Y Ph n quan tr ng nh t c a thu t ton l qu trnh sinh ra m t cy quy t nh kh i u t t p cc m u hu n luy n. K t qu , thu t ton sinh ra m t b phn l p d ng c a m t cy quy t nh; M t c u trc v i 2 ki u nodes: Node l, ch 1 l p, ho c m t node quy t nh ch ra ki m tra c th c hi n trn m t gi tr thu c tnh n, v i m t nhnh v cy con cho m i kh nng u ra c a ki m tra.

23 M t cy quy t nh c th c dng phn l p m t m u m i b ng cch kh i u t i g c c a cy v di chuy n qua n n khi g p m t l. T i m i node quy t nh khng l l, u ra v i ki m tra t i node c xc nh v l a ch n di chuy n t i g c c a cy con. V d , n u m hnh phn l p c a bi ton c cho v i cy quy t nh trong hnh 1.4.1 v m u cho vi c phn l p trong hnh 1.4.2, th thu t ton s t o ng i qua cc nodes A, C, v F (node l) n khi n t o quy t nh phn l p cu i cng: CLASS2.

Hnh 1.4: S phn l p m t m u m i d a trn m hnh cy quy t nh Thu t ton pht tri n cy (tree-growing) cho vi c sinh ra cy quy t nh d a trn cc phn tch n bi n l ID3 v i phin b n m r ng l C4.5. Gi s c nhi m v l a ch n m t ki m tra v i n u ra (n gi tr cho m t c trng cho) m chia t p cc m u h c T thnh cc t p con T1, T2, , Tn. Thng tin dng cho vi c h ng d n l s phn tn c a cc l p trong T v cc t p con Ti c a n. N u S l t p b t k cc m u, g i freq (Ci, S) bi u th s l ng cc m u trong S m thu c vo l p Ci, v |S| bi u di n s l ng cc m u trong t p S. Thu t ton ID3 g c dng m t tiu chu n c g i l l i ch (gain) l a ch n thu c tnh c ki m tra, d a trn khi ni n l thuy t thng tin: entropy. Quan h sau y a ra tnh ton c a entropy c a t p S:

24
k i=1 k i=1

Info(S) = - pi log2pi = - ((freq(Ci, S) / |S|) * log2 (freq(Ci, S) / |S|) Xem xt t p T sau khi c phn chia tng ng v i n u ra c a m t thu c tnh ki m tra X. Yu c u v thng tin mong i c th c tm ra nh l t ng tr ng s c a cc entropies trn cc t p con:
n

Infox(T) = - ((|Ti| / |T|) * Info(Ti))


i=1

o l i ch thng tin Gain: M t thu c tnh c l i ch thng tin cao, ngha l n u bi t c cc gi tr c a thu c tnh th vi c phn l p s ti n g n t i ch. Nh v d trn hnh 1.3, n u bi t X>1 th bi t c ngay thu c l p Class1. Gain c a thu c tnh X c o b ng gi m entropy trung bnh c a t p T sau khi bi t gi tr c a X: Gain(X) = Info(T) Infox(T) V d minh ho vi c p d ng cc php o khi t o cy quy t nh: Gi s CSDL T v i 14 tr ng h p (v d ) c m t v i 3 thu c tnh u vo v thu c vo 2 nhm cho tr c: CLASS1 ho c CLASS2. CSDL cho tr c trong b ng 1.1 9 m u thu c vo CLASS1 v 5 m u thu c CLASS2, v y entropy tr c khi phn tch l: Info(T) = 9/14 log2 (9/14) 5/14 log2 (5/14) = 0.940 bits Sau khi dng Attribute1 chia t p ban u c a cc m u T thnh 3 t p con (ki m tra x1 bi u di n l a ch n m t trong 3 gi tr A, B ho c C), thng tin k t qu c cho b i: Infox1 (T) = 5/14 ( 2/5 log2 (2/5) 3/5 log2 (3/5)) + 4/14 ( 4/4 log2 (4/4) 0/4 log2 (0/4)) + 5/14 ( 3/5 log2 (3/5) 2/5 log2 (2/5)) = 0.694 bits

25 B ng 1.1: CSDL n gi n g m cc v d hu n luy n CSDL T: Attribute1 A A A A A B B B B C C C C C Attribute2 70 90 85 95 70 90 78 65 75 80 70 80 80 96 Attribute3 True True False False False True False True False True True False False False Attribute4 CLASS1 CLASS2 CLASS2 CLASS2 CLASS1 CLASS1 CLASS1 CLASS1 CLASS1 CLASS2 CLASS2 CLASS1 CLASS1 CLASS1

Thng tin thu c b ng ki m tra x1 ny l: Gain (x1) = 0.940 0.694 = 0.246 bits N u ki m tra v phn tch d a trn Attribute3 (ki m tra x2 bi n di n l a ch n m t trong 2 gi tr True ho c False), m t tnh ton tng t s cho cc k t qu m i: Infox2 (T) = 6/14 ( 3/6 log2 (3/6) 3/6 log2 (3/6)) + 8/14 ( 6/8 log2 (6/8) 2/8 log2 (2/8)) = 0.892 bits

26 V gain tng ng l Gain(x2) = 0.940 0.892 = 0.048 bits D a trn i u ki n l i ch (gain criterion), thu t ton cy quy t nh s l a ch n ki m tra x1 nh m t ki m tra kh i u cho vi c phn tch CSDL T b i v gi tr l i ch cao hn. tm ra ki m tra t i u, c n ph i phn tch ki m tra trn Attribute2, l m t c trng s v i cc gi tr lin t c. trn gi i thch ki m tra chu n cho cc thu c tnh phn lo i. D i y s nu thm v th t c cho vi t thi t l p cc ki m tra trn cc thu c tnh v i cc gi tr s . Cc ki m tra trn cc thu c tnh lin t c s kh cng th c ho, v n ch a m t ng ng b t k cho vi c phn tch t t c cc gi tr vo 2 kho ng. C m t thu t ton cho vi c tnh ton gi tr ng ng t i u Z. Cc m u h c u tin c s p x p trn cc gi tr c a thu c tnh Y ang c xem xt. Ch c m t s c h n c a cc gi tr ny, v v y k hi u chng trong th t c s p x p l {v1, v2 , vm}. B t k gi tr ng ng no n m gi a vi v vi+1 s c cng hi u qu n u ng ng chia cc tr ng h p thnh nh ng ph n m gi tr c a thu c tnh Y c a chng n m trong {v1, v2 , vi} v trong {vi+1, vi+2, , vm}. Ch c m-1 kh nng trn Y, t t c chng c n c ki m tra m t cch c h th ng thu c m t phn tch t i u. Th ng ch n ng ng l i m gi a c a m i kho ng (vi + vi+1)/2. V d minh ho qu trnh tm ng ng ny: V i CSDL T, phn tch cc kh nng phn tch Attribute2. Sau khi s p x p, t p cc gi tr cho Attribute2 l {65, 70, 75, 78, 80, 85, 90, 95, 96} v t p cc gi tr ng ng ti m nng Z l {65, 70, 75, 78, 80, 85, 90, 95}. Z t i u (v i thng tin l i ch cao nh t) c n c l a ch n. Trong v d ny, gi tr Z t i u l Z = 80 v qu trnh tnh ton thng tin l i ch tng ng cho ki m tra x3 (Attribute2 80 or Attribute2 > 80) nh sau:

27 Infox3 (T) = 9/14 ( 7/9 log2 (7/9) 2/9 log2 (2/9)) + 5/14 ( 2/5 log2 (2/5) 3/5 log2 (3/5)) = 0.837 bits Gain(x3) = 0.940 0.837 = 0.103 bits So snh thng tin l i ch cho 3 thu c tnh trong v d , ta c th th y Attribute1 v n cho l i ch cao nh t 0.246 bits v do thu c tnh ny s c l a ch n cho vi c phn tch u tin trong vi c xy d ng cy quy t nh. Nt g c s c ki m tra cho cc gi tr c a Attribute1, v 3 nhnh s c t o, m i nhnh cho m t gi tr thu c tnh. Cy ban u ny v i cc t p con tng ng c a cc m u trong cc nodes con c bi u di n trong hnh 1.5.

Hnh 1.5 Cy quy t nh ban u v t p con cc tr ng h p cho m t CSDL trong b ng 1.1 Sau vi c phn tch ban u, m i node con c m t vi m u t CSDL,

v ton b qu trnh l a ch n v t i u ki m tra s c l p l i cho m i node con. B i v node con cho ki m tra x1: Attribute1 = B c 4 tr ng h p v t t c chng l trong CLASS1, node ny s l node l, v khng c cc ki m tra b sung no c n cho nhnh ny c a cy.

28 Cho node con cn l i, c 5 tr ng h p trong t p con T1, cc ki m tra trn cc thu c tnh cn l i c th c th c hi n; m t ki m tra t i u (v i thng tin c ch c c i) s l ki m tra x4 v i 2 l a ch n: Attribute2 70 or Attribute2 > 70. Info (T1) = 2/15 log2 (2/5) 3/15 log2 (3/5) = 0.940 bits Dng Attribute2 chia T1 thnh 2 t p con (ki m tra x4 bi u di n l a ch n c a m t trong 2 kho ng), thng tin k t qu c cho b i: Infox4 (T1) = 2/5 ( 2/2 log2 (2/2) 0/2 log2 (0/2)) + 3/5 ( 0/3 log2 (0/3) 3/3 log2 (3/3)) = 0 bits Gain thu c b i test ny l c c i: Gain(x4) = 0.940 0 = 0.940 bits V 2 nhnh s t o cc node l cu i cng v cc t p con c a cc tr ng h p trong m i nhnh thu c vo cng m t class. Tnh ton tng t s c ti n hnh/ti p t c cho con th 3 c a node g c. Cho t p con T3 c a CSDL T, ki m tra x5 t i u c ch n l vi c ki m tra trn cc gi tr c a Attribute3. Cc nhnh c a cy, Attribute3 = True v Attribute3 = False, s t o cc t p con ng nh t c a cc tr ng h p m thu c vo cng m t l p. Cy quy t nh cu i cng cho CSDL T c bi u di n trong hnh 1.5.

29

Hnh 1.5 Cy quy t nh cu i cng cho CSDL T nu trong b ng 1.1 Tu ch n, m t cy quy t nh cng c th c bi u di n d ng m t

m th c hi n (ho c gi m) v i cc c u trc if-then cho vi c tch nhnh thnh m t c u trc cy. Cy quy t nh cu i cng trong v d trn c a trong gi code nh hnh 1.6.

Hnh 1.6 Cy quy t nh

d ng gi code cho CSDL T (b ng 1.1)

30 1.2.1.3 Phn l p Bayees Phn l p Bayees l phng php phn l p th ng k d on xc su t cc thnh vin thu c l p. Phn l p Bayees cho tnh chnh xc v t c cao khi p d ng vo cc CSDL l n. Phng php Naive Bayees l m t phng php phn l p Bayees n gi n. Phng php ny gi thi t nh h ng c a m t gi tr thu c tnh t i l p l c l p v i cc gi tr thu c tnh khc - g i l c l p i u ki n l p. L thuy t Bayees Cho X l d li u v d c a m t l p cha bi t. H l gi thi t X thu c l p C. Bi ton phn l p s xc nh P(H|X) l xc su t gi thuy t H ch a v d X. l xc su t h u nghi m c a H v i i u ki n X. Cng th c Bayees l: P(H|X) = P(X|H) * P(H) / P(X) V i P(X|H) l xc su t h u nghi m c a X v i i u ki n H. P(X) l xc su t tin nghi m c a X. Phn l p Naive Bayees 1. M i d li u v d c bi u di n b ng m t vecto X=(x1, .. xn) m t n o c a n thu c tnh A1,.., An. 2. Gi s c m l p C1,, Cm. Cho m t tr ng h p X cha bi t l p, phn l p s d on X thu c v l p Ci c xc su t i u ki n X cao nh t, ngha l X Ci P(Ci|X)>P(Cj | X) 1<=j<=m j # i (1.1)

Theo cng th c Bayees c: P(Ci|X) = P(X | Ci)P(Ci)/ P(X) Trong Ci c g i l gi thuy t h u nghi m l n nh t. 3. N u P(X) l h ng ch c n tm max P(X|Ci)P(Ci). N u xc su t tin nghi m cha bi t v gi P(X|Ci)P(Ci). s P(C1)=P(C2)... th tm Ci c max (1.2)

31 4. N u d li u c nhi u thu c tnh, chi ph tnh ton P(X|Ci) c th r t l n, v v y v i gi thi t cc thu c tnh c l p i u ki n l p th c th tnh P(X|Ci)= P(Xk|Ci) (k=1..n) Trong P(Xk|Ci) c tnh nh sau : V i gi thi t Ak l thu c tnh gi tr tn th P(Xk|Ci)= Sik/Si, trong Sik s v d hu n luy n c a l p Ci c gi tr Xk v i Ak, Si l s v d thu c l p Ci. 5. phn l p cho i t ng X cha bi t l p: Tnh cc gi tr P(X|Ci) cho m i l p Ci v X thu c l p Ci khi v ch khi: P(X|Ci)=Max(P(X|Ci)P(Ci).

1.2.2 H i qui
H i qui: M t k thu t phn tch d li u dng th ng k xy d ng cc m hnh d bo cho cc tr ng d bo c gi tr lin t c. K thu t t ng xc nh m t cng th c ton h c m c c ti u ho m t vi php o l i gi a ci d bo t m hnh H i qui v i d li u th c. D ng n gi n nh t c a m t m hnh h i qui ch a m t bi n ph thu c (cn g i l bi n u ra, bi n n i sinh hay bi n Y) v m t bi n c l p n (cn g i l h s , bi n ngo i sinh, hay bi n X). V d nh: s ph thu c c a huy t p Y theo tu i tc X, hay s ph thu c c a tr ng l ng Y theo kh u ph n n hng ngy. S ph thu c ny c g i l h i qui c a Y ln X. H i qui t o cc m hnh d bo. S khc nhau gi a H i qui v phn l p l h i qui x l v i cc thu c tnh ch l ki u s ho c lin t c, trong khi phn l p x l v i cc thu c tnh ch ring l ho c phn lo i (discrete/categorical). Ni cch khc, n u thu c tnh ch ch a cc gi tr lin t c, s i h i dng k thu t h i qui. N u thu c tnh ch ch a cc gi tr phn lo i (xu k t ho c s nguyn r i r c), k thu t phn l p s c s d ng.

32 D ng thng d ng nh t c a h i qui l H i qui tuy n tnh (linear regression), trong m t ng th ng v a nh t v i d li u c tnh ton, l, ng th ng c c ti u ho kho ng cch trung bnh c a t t c cc i m t ng . ng ny tr thnh m hnh d bo khi gi tr c a bi n ph thu c l cha bi t; gi tr c a n c d bo b i i m n m trn ng m tng ng v i gi tr c a cc bi n ph thu c cho b n ghi .

Hnh 1.7 H i qui tuy n tnh M t s khi ni m: Cc bi n ng u nhin X1, , Xk (cc bi n d bo) v Y (bi n ph thu c) Xi c mi n (domain) l dom(Xi), Y c mi n l dom(Y) P l m t phn b xc su t trn dom(X1) x x dom(Xk) x dom(Y) CSDL hu n luy n D l m t m u ng u nhin t P

33 B d bo (predictor) l m t hm: d: dom(X1) dom(Xk) d c g i l hm H i qui. G i r l m t b n ghi ng u nhin l y t P. nh ngha t su t l i trung bnh bnh phng c a d l: RT(d,P) = E(r.Y d(r.X1, , r.Xk))2 nh ngha bi ton: T p d li u D cho tr c l m t m u ng u nhin t phn tn xc su t P, tm hm H i qui d m RT(d, P) l c c ti u. dom(Y) N u Y l s , bi ton l bi ton H i qui. Y c g i l bi n ph thu c,

Thu t ton SVM cho H i qui Support Vector Machine (SVM) xy d ng c hai m hnh phn l p v H i qui. SVM l m t cng c d bo phn l p v H i qui dng l thuy t h c my c c i chnh xc d bo trong khi t ng trnh v t ng ng (over-fit) i v i d li u. Cc m ng neural v cc hm radial basis (RBFs), hai k thu t khai ph thng d ng, c th c xem l tr ng h p c bi t c a SVMs. SVMs th c hi n t t v i cc ng d ng th gi i th c nh phn l p d li u vn b n (text), nh n d ng cc ch vi t tay, phn l p cc hnh nh,... Vi c gi i thi u n trong nh ng nm u 1990s d n n s bng n cc ng d ng v cc phn tch l thuy t chuyn su thi t l p SVM cng v i m ng neural nh cc cng c chu n cho h c my v khai ph d li u. [11] Khng c gi i h n trn no trn s l ng cc thu c tnh v ng vin ch cho SVMs. Chi ti t v chu n b d li u v cc thi t t l a ch n cho SVM tham kh o trong [13].

34

1.2.3 Phn nhm


Phn nhm l k thu t h u ch cho vi c khm ph d li u. N h u ch khi c nhi u tr ng h p v khng c cc nhm t nhin rnh m ch. y,

cc thu t ton khai ph d li u phn nhm c th c dng tm ra b t k nhm t nhin no c th t n t i. Phn tch phn nhm xc nh cc nhm trong d li u. M t nhm l t p h p/thu th p cc i t ng d li u tng t nhau trong m t vi ngha no so v i i t ng khc. Phng php phn nhm t t t o ra cc nhm ch t l ng cao m b o r ng s tng t gi a cc nhm khc nhau l th p v s tng t bn trong nhm l cao; ni cch khc, cc thnh vin c a m t nhm gi ng nhau hn so v i cc thnh vin trong nhm khc. Phn nhm cng c th ph c v nh m t b c ti n x l d li u hi u qu xc nh cc nhm khng ng nh t trong xy d ng cc m hnh d bo. Cc m hnh phn nhm khc v i cc m hnh d bo trong u ra c a qu trnh khng c h ng d n b ng k t qu bi t, l, khng c thu c tnh ch. Cc m hnh d bo d on cc gi tr cho m t thu c tnh ch v m t t su t l i gi a cc gi tr ch v gi tr d bo c th c tnh ton h ng d n vi c xy d ng m hnh. Cc m hnh phn nhm khm ph cc nhm t nhin trong d li u. M hnh c th sau c dng gn cc nhn c a cc nhm (cluster IDs) t i cc i m d li u. V d , cho t p d li u v i 2 thu c tnh: AGE v HEIGHT, lu t sau y bi u di n ph n l n d li u c gn cho cluster 10: If AGE >= 25 and AGE <= 40 and HEIGHT >= 5.0ft and HEIGHT <= 5.5ft then CLUSTER = 10 M t u vo c a m t phn tch cluster c th c m t nh m t c p c th t (X, s), ho c (X, d), trong X l t p cc m t c a cc m u v s v

35 d theo th t l cc tiu chu n cho tng t ho c khng tng t (kho ng cch) gi a cc m u. u ra t h th ng clustering l m t partition = {G1, G2, , GN} trong Gk, k = 1, , N l t p con th c s (crisp subset) c a X: G1 G2 G1 = X, v Gi Gj = ij

Cc thnh vin G1, G2, , GN c a c g i l cc clusters. M i cluster c th c m t v i m t vi c trng. Trong vi c phn nhm (clustering) trn c s khm ph, c cluster (t p ring bi t cc i m trong X) v cc m t c a chng ho c cc c trng c sinh ra nh l m t k t qu c a m t th t c clustering. Ph n l n cc thu t ton clustering d a trn 2 cch ti p c n sau: 1. Phn nhm theo partitional theo th t l i l p (Iterative square-error partitional clustering) Phn ho ch 2. Phn nhm phn c p (Hierarchical clustering) 1.2.3.1 Cc phng php phn ho ch Cho m t CSDL v i N i t ng, phng php phn ho ch xy d ng K phn ho ch (K<=N), trong m i phn ho ch bi u di n m t nhm. Ngha l phn chia d li u thnh K nhm tho mn: M i nhm ph i ch a t nh t 1 i t ng M i i t ng ch thu c 1 nhm M t k thu t n gi n nh t l chia N thnh K t p con c th , th phn ho ch l n u, sau l p cc b c phn ho ch b ng cch chuy n cc i t ng t nhm ny sang nhm khc v nh gi theo tiu chu n g n nh t c a cc i t ng trong nhm. Qu trnh phn ho ch ny c th i h i s d ng m i kh nng c th c a K t p con trong N nhm. Cc ng d ng c th s d ng m t trong hai thu t ton K-mean, m i nhm c i di n b ng m t gi

36 tr trung bnh c a cc i t ng trong nhm ho c K-medoid trong m i nhm c i di n b ng 1 trong cc i t ng g n tm nhm. K thu t d a tm - Thu t ton k-mean Thu t ton K-mean s phn ho ch d li u thnh K (tham s ) nhm x l nh sau: L a ch n ng u nhin K i t ng, m i i t ng c coi l kh i t o gi tr trung bnh ho c tm c a nhm. Cc i t ng cn l i s c gn vo cc nhm c coi l g n i t ng nh t d a trn kho ng cch gi a i t ng v i tm. Tnh ton l i gi tr trung bnh c a m i nhm Cc b c trn c l p cho n khi t h i t . Tiu chu n sai s ton phng: E = | p mi | 2 Trong E l t ng cc sai s bnh phng c a m i i t ng trong CSDL, p l i m trong khng gian bi u di n cc i t ng, mi l trung bnh c a nhm Ci. Tiu chu n ny ti n n K nhm l kn v tch r i nhau. Thu t ton s ti n n K phn ho ch sao cho hm sai s bnh phng l t i thi u (LMS). Thu t ton ny ch p d ng c n u xc nh c gi tr trung bnh. Do v y thu t ton s khng p d ng c cho d li u tn (k t ).

Hnh 1.8 G p nhm theo phng php k-means (i m nh d u + l tm)

37 Cc bi n th m r ng c a K-mean: K-medoid m r ng phn ho ch cho cc thu c tnh c gi tr tn b ng k thu t d a trn t n su t xu t hi n. 1.2.3.2 Cc phng php phn c p Phng php phn c p lm vi c b ng cch nhm cc i t ng d li u thnh cy. Cc phng php phn c p c th phn l p theo ki u vun ng ho c tch d n: G p nhm phn c p ki u vun ng: Chi n l c t d i ln b t u b ng cch t m i i t ng n vo m t nhm sau tr n vi nhm thnh nhm l n hn v l n hn n a, cho n khi thnh m t nhm n ho c tho mn i u ki n k t thc no . G p nhm phn c p ki u tch d n: Chi n l c ny ng c v i chi n l c t trn xu ng. B t u b ng cch t t t c cc i t ng thnh m t nhm n. Sau chia thnh cc nhm nh d n, cho n khi thnh cc nhm v i m t i t ng ho c tho mn i u ki n k t thc no . Trong cc thu t ton ny c n xc nh i u ki n k t thc.

Hnh 1.9 Phn ho ch vun ng ho c tch d n

38

1.2.4 Khai ph lu t k t h p
Cc lu t k t h p l m t trong nh ng k thu t chnh c a khai ph d li u v n c th l thng d ng nh t t khm ph m u a phng trong h th ng h c khng gim st. 1.2.4.1 Phn tch gi hng Gi hng l t p thu th p cc m c hng mua s m c a khch hng trong m t giao d ch n l . Nh ng ng i bn hng thu th p cc giao d ch b ng vi c ghi l i cc ho t ng kinh doanh (bn hng) qua th i gian di. M t phn tch thng th ng th c hi n trn CSDL cc giao d ch l tm ra cc t p hng ho xu t hi n cng nhau trong nhi u giao d ch. Tri th c c a nh ng m u ny c th c dng c i ti n ch c a nh ng m t hng trong kho ho c s p t l i th t catalog trong trang mail v cc trang Web. M t itemset ch a i items c g i l i-itemset. S ph n trm cc giao d ch ch a m t itemset c g i l h tr c a itemset. Itemset c quan tm, h tr c a n c n ph i cao hn gi tr c c ti u m ng i s d ng a ra. Nh ng itemsets nh v y c g i l th ng xuyn (frequent). Vi c tm ra cc frequent itemsets l khng n gi n. L do tr c tin, s giao d ch c a khch hng c th r t l n v th ng khng v a v i b nh c a my tnh. Th hai, s l ng cc frequent itemsets ti m nng l theo hm m i v i s l ng cc items khc nhau, m c d s l ng th c c a cc frequent itemsets c th nh hn nhi u. Do v y, yu c u i v i cc thu t ton l c th m r ng ( ph c t p c a n ph i tng tuy n tnh, khng theo hm m v i s l ng cc giao d ch) v n ki m tra cng t cc infrequent itemsets cng t t. M t bi ton:

39 T m t CSDL c a cc giao d ch bn hng, c n tm ra cc m i lin h quan tr ng trong cc items: S c m t c a m t vi items trong giao d ch s d n n s c m t c a cc items khc trong cng giao d ch. C cc items I = {i1, i2, , im}. DB l t p cc giao d ch, trong m i giao d ch T l t p c a cc items v y l T I. y khng quan tm n s

l ng hng (items) c mua trong giao d ch, ngha l m i item l m t bi n nh phn ch ra item c c mua hay khng. M i giao d ch tng ng v i m t nh danh c g i l transaction identifier ho c TID. M t v d v CSDL giao d ch nh v y c a ra trong b ng 1.2 B ng 1.2 M hnh CSDL giao d ch n gi n Database DB: TID 001 002 003 004 Items ACD BCE ABCE BE G i X l t p cc items. Giao d ch T c g i l ch a X n u v ch n u X T. M t lu t k t h p ng d ng X Y, trong X I, Y I, v X Y = . Lu t X Y c trong t p giao d ch DB v i ch c ch n c (confidence) n u c% giao d ch trong D c ch a X th cng ch a Y. Lu t X Y c h tr s trong t p giao d ch D n u s% cc giao d ch trong DB ch a X Y. ch c ch n bi u th l n c a ngha c a lu t v h tr bi u th t n s c a cc m u xu t hi n trong lu t. Th ng c quan tm l nh ng lu t c h tr l n h p l. Nh ng lu t v i ch c ch n cao v h tr m nh c xem nh l nh ng lu t m nh. Nhi m v c a khai ph cc lu t k t h p l pht hi n

40 cc lu t k t h p m nh trong cc CSDL l n. Bi ton khai ph cc lu t k t h p c th c chia thnh 2 pha: 1. Khm ph cc itemsets l n, v d cc t p items c h tr c a giao d ch trn ng ng t i thi u cho tr c. 2. Dng cc itemsets l n sinh ra cc lu t k t h p cho CSDL m c ch c ch n c trn ng ng t i thi u cho tr c Hi u nng chung c a vi c khai ph cc lu t k t h p c xc nh ch y u b i b c u tin. Sau khi cc itemsets l n c xc nh, cc lu t k t h p tng ng c th c l y theo cch n gi n. Tnh ton hi u qu c a cc itemsets l n l tr ng tm c a ph n l n cc thu t ton khai ph. 1.2.4.2 Thu t ton Apriori Thu t ton Apriori tnh ton cc frequent itemsets trong CSDL qua m t vi l n l p. B c l p th i tnh ton t t c cc frequent i-itemsets (cc itemsets v i i thnh ph n). M i l n l p c 2 b c: b1) Sinh ra cc candidate. b2) Tnh ton v ch n candidate. Trong pha u tin c a b c l p u tin, t p cc candidate itemsets c sinh ra ch a t t c cc 1-itemsets (ngha l, t t c cc items trong CSDL). Trong pha tnh ton, thu t ton tnh h tr c a chng tm trn ton b CSDL. Cu i cng, ch nh ng 1-itemsets v i s trn ng ng yu c u s c ch n l frequent. Nh v y sau l n l p u tin, t t c cc frequent 1itemsets s c xc nh. Ti p t c v i l n l p th 2. Sinh ra cc candidate c a 2-itemset nh th no? T t c cc c p c a items u l cc candidates. D a trn tri th c v cc infrequent itemsets thu c t cc l n l p tr c, thu t ton Apriori gi m t p cc candidate itemsets b ng cch c t t a cc candidate itemsets khng l frequent. Vi c c t t a d a trn s quan st l n u m t itemset l frequent th

41 t t c cc t p con c a n cng l frequent. Do v y, tr c khi vo b c tnh ton candidate, thu t ton s th c hi n lo i b m i candidate itemset m c cc t p con l infrequent. Xem xt CSDL trong b ng 1.2. Gi s r ng h tr t i thi u s=50% nh v y m t itemset l frequent n u n c ch a trong t nh t l 50% cc giao d ch. Trong m i b c l p, thu t ton Apriori xy d ng m t t p candidate c a cc large itemsets, m s l n xu t hi n c a m i candidate, v sau xc nh large itemsets d a trn h tr c c ti u cho tr c s=50%. Trong b c u tin c a l n l p u tin, t t c cc items n l u l candidates. Apriori n gi n duy t t t c cc giao d ch trong CSDL DB v sinh ra danh sch cc candidates. Trong b c ti p theo, thu t ton m s l n xu t hi n c a m i candidate v d a trn ng ng s ch n ra cc frequent itemsets. T t c nh ng b c ny c a trong hnh 1.10. Nm 1-itemsets c sinh ra trong C1 v ch c b n c ch n l l n trong L1 (c s 50%).
1-itemset C1 {A} {C} {D} {B} {E} 1. Generate phase 1-itemsets {A} {C} {D} {B} {E} Count 2 3 1 3 3 S{%} 50 75 25 75 75 {B} {E} 3 3 75 75 Large 1-itemsets L1 Count S{%} {A} {C} 2 3 50 75

2. Count phase

3. Select phase

Hnh 1.10 B c l p u tin c a thu t ton Apriori cho CSDL DB khm ph ra t p cc large 2-itemsets, b i v b t k t p con no c a large itemset cng c h tr c c ti u, thu t ton Apriori dng L1 * L1 sinh ra cc candidates. Thao tc * c nh ngha t ng qut l Lk *Lk = {X Y where X, Y Lk, |X Y| = k 1}

42 V i k=1 ton h ng bi u di n s ghp chu i l i n gi n. Do , C2 ch a 2-itemsets c sinh ra b i ton h ng |L1| (|L1| 1)/2 nh l cc candidates trong l n l p 2. Trong v d c a chng ta, s ny l 4.3/2 = 6. Duy t CSDL DB v i danh sch ny, thu t ton tnh h tr cho m i

candidate v cu i cng l a ch n large 2-itemsets L2 c s 50%. T t c nh ng b c ny v cc k t qu tng ng c a l n l p 2 c cho trong hnh 1.11.
2-itemset C2 {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 1. Generate phase 2-itemsets Count S{%} 25 50 25 50 75 50 Large 2-itemsets L2 {A, C} Count S{%} {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 2. Count phase

50 50 75 50

{B, C} 2 {B, E} 3 {C, E} 2 3. Select phase

Hnh 1.11 L n l p th 2 c a thu t ton Apriori cho CSDL DB T p cc candidate itemsets C3 c sinh ra t L2 dng ton h ng nh ngha tr c y L2*L2. Th c t , t L2, hai large 2-itemsets v i item u tin gi ng nhau nh {B,C} v {B,E}, c xc nh tr c. Nh v y, Apriori ki m tra c 2-itemset {C,E}, ci m ch a items th 2 trong t p {B,C} v {B,E}, t o thnh m t larger 2-itemset hay khng. B i v {C,E} l large itemset, nh v y t t c cc t p con c a {B,C,E} u l large v do {B,C,E} tr thnh candidate 3-itemset. Khng c candidate 3-itemset no khc t L2 trong CSDL DB. Apriori sau duy t t t c cc giao d ch v khm ph ra large 3-itemsets L3, nh trong hnh 1.12
3-itemset C3 {B, C, E} 1.Generate phase 3-itemsets {B, C, E} Count 2 S{%} 50 Large 3-itemsets L3 Count S{%} {B, C, E} 2 50

2. Count phase

3. Select phase

Hnh 1.12 L n l p th 3 c a thu t ton Apriori cho CSDL DB

43 Trong v d , v khng c candidate 4-itemset ti p t c t L3, Apriori k t thc qu trnh l p. Apriori khng ch tnh ton h tr c a t t c cc frequent itemsets, m c h tr c a cc infrequent candidate itemsets khng th lo i ra trong pha c t t a. T p t t c cc candidate itemsets m l infrequent nhng h tr c a n c tnh ton b i Apriori c g i l negative border. Nh v y, m t itemset l trong negative border n u n l infrequent nhng t t c cc t p con c a n l frequent. Trong v d , phn tch hnh 1.10 v 1.12 chng ta c th th y r ng negative border bao g m cc itemsets {D}, {A, B}, v {A, E}. Negative border c bi t quan tr ng cho m t vi c i ti n thu t ton Apriori, nh l tng hi u qu trong vi c sinh ra cc large itemsets ho c thu c cc lu t k t h p negative. 1.2.4.3 T cc frequent itemsets t i cc lu t k t h p Pha th hai trong khai ph cc lu t k t h p d a trn t t c cc frequent i-itemsets, ci c tm th y trong pha u tin dng Apriori ho c m t vi thu t ton tng t , l tng i n gi n v d hi u. V i lu t ng m {x1, x2, x3,} x4, th c n ph i c itemset {x1, x2, x3 x4} v {x1, x2, x3} l frequent. V y, ch c ch n c a lu t c= s(x1, x2, x3, x4}/s(x1 x2, x3). Cc lu t k t h p m nh l cc lu t v i gi tr ch c ch n c l n hn ng ng a cho tr c. V i CSDL DB trong b ng 1.2, ki m tra lu t k t h p {B, C} E c l lu t m nh khng, u tin l y cc h tr tng ng t L2 v L3: s(B,C) = 2, s(B, C, E) = 2 V dng nh ng h tr ny tnh ton ch c ch n c a lu t: c({B, C} E) = s(B, C, E) / s(B, C) = 2/2 = 1 (ho c 100%)

44 V i b t k ng ng ch n cho cc lu t k t h p m nh (v d , cT= 0.8 ho c 80%), lu t ny s t v ch c ch n c a n t c c i, ngha l, n u m t giao d ch ch a items B v C, n cng s ch a item E. Cc lu t khc cng c th c cho CSDL DB, nh l A C v c(A C) = s (A, C)/s(A) =1, v c hai itemsets {A} v {A, C} l frequent d a trn thu t ton Apriori. Do v y trong pha ny, c n thi t phn tch m t cch c h th ng t t c cc lu t k t h p c th sinh ra t cc frequent itemsets, v l a ch n nh ng lu t k t h p m nh c ch c ch n trn ng ng cho tr c. Tuy nhin khng ph i t t c cc lu t k t h p m nh c pht hi n (ngha l, qua c cc yu c u v h tr s v ch c ch n c) u c ngha s d ng. V d , xem xt tr ng h p sau y khai ph cc k t qu i u tra tr ng h c c 5000 sinh vin. Ng i bn l b a sng ng c c i u tra cc ho t ng m cc sinh vin tham gia trong m i bu i sng. D li u cho th y 60% sinh vin (l 3000 sinh vin) chi basketball, 75% sinh vin (3759 sinh vin) n ng c c, v 40% (l 2000 sinh vin) chi basketball v cng n ng c c. Gi s r ng chng trnh khai ph d li u cho khai ph cc lu t k t h p th c hi n trn cc thi t l p sau: h tr c c ti u l 2000 (s=0.4) v ch c ch n c c ti u l 60% (c=0.6). Lu t k t h p sau y s c t o ra: Chi basketball) (n ng c c)", v lu t ny ch a h tr sinh vin c c ti u v tng ng v i ch c ch n c = 2000/3000 = 0.66 l l n hn gi tr ng ng. Nhng, lu t k t h p trn l sai l c v t l sinh vin n ng c c l 75%, l n hn 66%. l, chi basketball v n ng c c trong th c t l khng c quan h . Khng hi u y kha c nh ny, ng i ta c th t o nh ng kinh doanh sai ho c nh ng quy t nh mang tnh khoa h c/k thu t cao sai t cc lu t k t h p thu c. l c ra cc k t h p sai l ch, ng i ta c th nh ngha m t lu t k t h p A B l ng quan tm n u ch c ch n c a n v t qu m t o

45 ch c ch n. Tham s n gi n ta dng trong v d trn a ra kinh nghi m o s k t h p c n ph i l: s(A, B) / s(A) s(B) > d ho c c th ch n l: s(A, B) s(A) s(B) > k Trong d ho c k l cc h ng s ph h p. Cc bi u th c trn v c b n bi u di n cc ki m tra c a tnh c l p th ng k. R rng, y u t ph thu c th ng k trong s cc itemsets c phn tch c a vo xem xt quy t nh tnh h u d ng c a cc lu t k t h p. Trong v d n gi n c a chng ta v cc sinh vin cu c ki m tra ny l khng thnh cng do khm ph ra lu t k t h p: s (A, B) s(A) s(B) = 0.4 0.6 0.75 = 0.05 < 0 V do , m c d cc gi tr cao cho cc tham s s v c, lu t v n khng c quan tm. Trong tr ng h p ny, l s ki n sai l c.

46

CHNG 2. M T S THU T TON KHAI PH D LI U

2.1.
2.1.1

Thu t ton khai ph lu t k t h p


Thu t ton Apriori
Thu t ton ny th c hi n theo t ng m c: 1. Cho ng ng h tr s. L n duy t u tin tm cc items xu t hi n t nh t trong s ph n c a gi hng. C t p L1, cc frequent items. 2. Cc candidate C2 cho l n duy t th hai l cc c p items trong L1. Cc c p trong C2 c s m >= s l cc c p frequent pairs L2. 3. B ba candidate C3 l nh ng t p {A, B, C} m t t c {A, B}, {A. C} v {B, C} u trong L2. L n duy t th 3, b ba candidate trong C3 c s l n xu t hi n >= s l cc b 3 frequent, L3. 4. C th ti p t c n khi cc t p tr thnh r ng. Li l frequent itemsets c kch th c i; Ci+1 l itemsets kch th c i+1 m m i t p con kch th c i u n m trong Li.

Hnh 2.1 Thu t ton Apriori

47 2.1.1.1 Sinh ra Candidate c a Apriori Hm apriori-gen l y tham s Lk-1, t p c a t t c cc large (k-1)-

itemsets. N tr v m t superset c a t p t t c cc large k-itemsets. Hm lm vi c nh sau. u tin, trong b c join, k t n i Lk-1 v i Lk-1 insert into Ck select p.item1, p.item2, , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1 = q.item1, , p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1; Ti p theo, trong b c c t t a, ta xo t t c cc itemsets c thu c Ck m cc t p con (k-1) c a c khng n m trong Lk-1: forall itemsets c Ck do forall (k-1)-subsets s of c do if (s Lk-1) then delete c from Ck; V d , Cho L3 l { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} }. Sau b c k t n i, C4 s l { {1 2 3 4}, {1 3 4 5}}. B c c t t a s xo itemsets {1 3 4 5} v itemset {1 4 5} l khng trong L3. Sau ch cn {1 2 3 4} trong C4. Tnh ng n C Ck Lk. R rng b t k t p con no c a large itemset cng c n ph i c h tr c c ti u. Do v y, n u m r ng m i itemset trong Lk-1 v i t t c cc items c th v sau xo i t t c nh ng ci m c (k-1)-subsets c a n khng c trong Lk-1, ta s cn l i m t superset c a cc itemsets trong Lk. K t n i tng ng v i m r ng Lk-1 v i m i item trong CSDL v sau xo nh ng itemsets m v i n (k-1) itemset c t o b ng cch xo item th (k-1) l khng c trong Lk-1. i u ki n p.itemk-1 < q.itemk-1 n gi n m b o r ng khng sinh ra trng nhau. Nh v y sau b c join, Ck Lk. V i l do tng t , b c c t t a, xo kh i Ck t t c cc itemsets m (k-1)-subsets

48 c a n khng c trong Lk-1, s khng xo m t b t k itemset no c th c trong Lk. 2.1.1.2 Hm Subset

Cc candidate itemsets Ck c s p x p trong m t hash-tree. M t node c a hash-tree ho c ch a m t danh sch cc itemsets (m t node l) ho c l m t hash table (m t node trong). M t node trong, m i bucket c a hash table (l p c cng gi tr hm bm) ch t i m t node khc. G c c a hash-tree c nh ngha l c su 1. M t node trong su d ch t i cc node t i su su d, chng ta d+1. Cc itemsets c lu trong cc l. Khi thm m t itemset c, ta b t u t g c v i xu ng n khi g p m t l. T i m t node trong quy t nh i theo nhnh no b ng cch p d ng hm bm (hash function) cho item th d c a itemset. T t c cc nodes c t o ban u nh cc node l. Khi s l ng cc itemsets trong m t node l v t qu m t ng ng no , node l c chuy n thnh node trong. B t u t node g c, hm subset tm t t c cc candidates ch a trong m t giao d ch t nh sau. N u t i l, ta tm itemset no trong l c ch a node trong v trong t v thm cc tham chi u t i chng vo t p tr l i. N u

ta n t i n b ng cch dng hm bm cho item i, ta s bm trn m i item n sau i trong t v p d ng quy th t c ny t i node trong bucket tng ng. V i node g c, chng ta bm trn m i item trong t. bi t v sao hm subset tr v t p cc tham chi u mong mu n, xem xt t i node g c. V i b t k itemset c no ch a trong giao d ch t, item u tin c a c c n ph i c trong t. T i g c, b ng cch bm trn m i item trong t, m b o r ng ta ch b qua cc itemsets b t u v i m t item khng c trong t. Tng t cc tham s p d ng cc su th p hn. V cc items trong b t

49 k itemset no cng c s p th t , nn n u ta t n node hi n th i b ng cch bm item i, th ch c n xem xt cc items xu t hi n sau i trong t.

2.1.2

Thu t ton AprioriTid


Thu t ton AprioriTid cho th y trong hnh 2.2, cng dng hm apriori-

gen xc nh cc candidate itemsets tr c khi l n duy t b t u. c i m ng quan tm c a thu t ton ny l CSDL D khng c dng cho vi c m h tr sau l n duy t u tin. T p C k c dng cho m c ch ny. M i thnh vin c a t p C k l d ng <TID, {Xk}>, trong m i Xk l m t large k-

itemset ti m nng bi u di n trong giao d ch v i TID. V i k=1, C1 tng ng v i CSDL D, m c d v khi ni m m i item i c thay th b i itemset {i}. V i k>1, C k c sinh ra b ng thu t ton (b c 10). Thnh vin c a C k tng ng v i giao d ch t l <t.TID, { c Ck | c c ch a trong t}. N u m t giao d ch khng ch a b t k candidate k-itemset no, th C k s khng c m t entry cho giao d ch ny. Nh v y s l ng cc entries trong C k c th l nh hn s l ng cc giao d ch trong CSDL, c bi t cho cc gi tr l n c a k. Hn n a, v i cc k gi tr l n, m i entry c th l nh hn giao d ch tng ng v giao d ch c th ch a r t t candidates. Nhng, v i cc gi tr k nh , m i entry c th l n hn giao d ch tng ng v m t entry trong Ck bao g m t t c cc candidate k-itemset c trong giao d ch.

50

Hnh 2.2 Thu t ton AprioriTid V d : Xem xt CSDL trong hnh 3 v cho r ng h tr c c ti u l 2 giao d ch. G i apriori-gen v i L1 t i b c 4 cho cc candidate itemsets C2. Trong cc b c 6 n 10, chng ta m h tr c a cc candidates trong C2 b ng cch l p qua cc entries trong C1 v sinh ra C 2 . Entry u tin trong C1 l { {1} {3} {4} }, tng ng v i giao d ch 100. Ct t i b c 7 tng ng v i entry t ny l { {1 3} }, v {1 3} l m t thnh vin c a C2 v c hai ( {1 3} {1}) v ( {1 3} - {3}) l cc thnh vin c a t.set-of-itemsets. G i hm apriori-gen v i L2, c c C3. Duy t m t l n qua C 2 v C3 t o ra C 3 . Ch r ng khng c entry no trong C 3 cho cc giao d ch v i TIDs 100 v 400, v chng khng ch a b t k t p itemsets trong C3. Candidate {2 3 5} trong C3 tr thnh large v l thnh vin duy nh t c a L3. Khi sinh C4 dng L3, k t qu r ng v k t thc.

51

Hnh 2.3 V d

2.1.3

Thu t ton AprioriHybrid


Khng c n thi t ph i dng cng m t thu t ton trong t t c cc l n

duy t qua d li u. Hnh 2.4 cho th y th i gian th c hi n cho Apriori v AprioriTid cho cc l n duy t khc nhau qua t p d li u T10.I4.D100K kch th c 4.4MB (T10.I4.D100K c T=10, I=4, D=100, T l kch th c trung bnh c a giao d ch, I l kch th c trung bnh c a cc large itemsets, D l s l ng cc giao d ch). Trong cc l n duy t u, Apriori lm t t hn AprioriTid. Nhng AprioriTid l i hn Apriori trong cc l n duy t sau. Nguyn nhn: Apriori v AprioriTid dng cng m t th t c sinh ra candidate v do m cng cc itemsets. Trong cc l n duy t sau, s l ng cc candidate itemsets gi m. Nhng, Apriori v n ki m tra m i giao d ch trong CSDL. Trong khi , khng duy t ton b CSDL, AprioriTid duy t C k thu

52 c s m h tr , v kch th c c a C k tr thnh nh hn kch th c c a CSDL. Khi t p C k c th v a v n trong b nh , th th m ch cn khng ph i ch u chi ph c a vi c ghi n ln a.

Hnh 2.4: Th i gian th c hi n cho m i l n duy t c a Apriori v AprioriTid (T10.I14.D100K, h tr c c ti u = 0.75%) D a trn nh ng quan st ny, ng i ta thi t k m t thu t ton lai, g i l AprioriHybrid: dng Apriori trong trong cc l n duy t u v chuy n t i AprioriTid khi n cho r ng t p C k t i cu i l n duy t s v a v n trong b nh . Dng kinh nghi m sau y nh gi xem C k c v a v n trong b nh khng trong l n duy t ti p theo. T i cu i m i l n duy t hi n th i, ta c s m cc candidates trong Ck. T y, ta c l ng kch th c c a C k s l g n u n c sinh ra. Kch th c ny, tnh b ng words, l support(c) + s cc giao d ch
Candidates c C k

53 N u C k trong l n duy t ny l nh v a trong b nh , v c cc large candidates trong l n duy t hi n t i t hn l n duy t tr c, ta s chuy n t i AprioriTid. i u ki n th hai c thm vo trnh chuy n khi C k trong l n duy t hi n t i v a v i b nh nhng C k trong l n duy t sau l i khng v a. Chuy n t Apriori t i AprioriTid khng bao g m chi ph. Gi s r ng quy t nh chuy n t Apriori t i AprioriTid t i cu i c a l n duy t th k. Trong l n duy t th (k+1), sau khi tm ra cc candidate itemsets c ch a trong giao d ch, ta cng s ph i thm IDs c a chng vo C k +1 . Nh v y c m t chi ph b sung ph i ch u trong l n duy t ny ch lin quan t i vi c ch y Apriori. Ch trong l n duy t th (k+2) khi th c s b t u ch y AprioriTid. Nh v y, n u khng c large (k+1)-itemsets, ho c khng c (k+2)-candidates, ta s ph i ch u chi ph c a vi c chuy n m khng thu c b t k ci g t vi c dng AprioriTid. So snh hi u nng c a AprioriHybrid v i Apriori v AprioriTid v i cng cc t p d li u: AprioriHybrid th c hi n t t hn Apriori trong ph n l n cc tr ng h p. V i tr ng h p AprioriHybrid lm t t hn Apriori t l n duy t c s chuy n nhng vi c chuy n l i xu t hi n l n duy t cu i cng;

nh v y AprioriHybrid ch u chi ph c a vi c chuy n m khng th c s c l i ch. Ni chung, thu n l i c a AprioriHybrid hn Apriori ph thu c vo kch th c c a t p C k suy bi n nh th no trong cc l n duy t sau. N u C k gi nguyn large n g n cu i v sau t ng t r t xu ng, th s khng thu c nhi u t vi c dng AprioriHybrid v ta ch c th dng AprioriTid cho m t chu k th i gian ng n sau khi chuy n (nh v i t p d li u T20.I16.D100K). Ng c l i, n u c m t s suy gi m d n trong kch th c c a C k , AprioriTid c th c dng m t lc sau khi chuy n, v ngha c i ti n c th thu c trong th i gian th c hi n.

54

2.2.

C i ti n hi u qu thu t ton Apriori


V l ng c a d li u x l trong khai ph cc frequent itemsets c xu

h ng r t l n, nn vi c pht minh ra nh ng thu t ton hi u qu khai ph d li u nh v y l r t quan tr ng. Thu t ton Apriori c b n duy t CSDL m t vi l n, ph thu c vo kch th c c a frequent itemset l n nh t. M t vi tinh ch nh c a ra t p trung vo vi c gi m s l ng l n duy t CSDL, s l ng cc candidate itemsets c tnh ton trong m i l n duy t, ho c c hai. Partition-based Apriori (Apriori d a trn Partition) l thu t ton i h i ch 2 l n duy t CSDL giao d ch. CSDL c chia thnh cc ph n (partitions) r i nhau, m i ph n nh v a v i b nh s n c. Trong l n duy t u tin, thu t ton c m i partition v tnh ton cc frequent itemsets a phng trong m i partition. Trong l n duy t th hai, thu t ton tnh ton h tr c a t t c cc frequent itemsets a phng i v i ton b CSDL. N u itemset l frequent v i ton b CSDL, n ch c ch n l frequent trong t nh t m t partition. l kinh nghi m dng trong thu t ton. Do , l n duy t th 2 qua CSDL m superset c a t t c cc frequent itemsets ti m nng. L y m u: Khi kch th c c a CSDL r t l n, vi c l y m u tr thnh cch ti p c n h p d n v i vi c khai ph d li u. Thu t ton d a trn m u i n hnh yu c u 2 l n duy t CSDL. Thu t ton u tin l y m u t CSDL v sinh ra t p cc candidate itemsets s l frequent trong ton b CSDL v i kh nng ch c ch n cao. Trong l n duy t chu i con trn CSDL, thu t ton tnh ton chnh xc h tr c a cc itemsets ny v h tr c a ranh gi i negative. N u khng c itemset no trong negative border l frequent, th thu t ton khm ph t t c cc frequent itemsets. M t khc, m t vi superset c a m t itemset trong negative border c th l frequent, nhng h tr c a n cha c tnh ton. Thu t ton sinh ra v tnh ton t t c cc frequent itemsets ti m nng trong cc l n duy t chu i con trn CSDL.

55 C p nh t d n (Incremental updating): Vi c tm ra cc frequent itemsets trong cc CSDL l n l r t c gi tr , cc k thu t incremental updating c n c pht tri n b o tr cc frequent itemsets pht hi n c (v cc lu t k t h p tng ng) v i m c ch trnh khai ph l i ton b CSDL. Cc c p nh t trn CSDL c th khng ch lm sai m t vi frequent itemsets ang t n t i m con chuy n m t vi itemsets m i thnh frequent. V n b o tr cc frequent itemsets pht hi n t tr c trong cc CSDL l n v bi n ng khng n gi n. t ng l dng l i thng tin c a cc frequent itemsets c v tch h p thng tin v h tr c a cc frequent itemsets m i gi m v cn b n l ng cc candidates c ki m tra l i. Khai ph lu t k t h p t ng qut: Trong nhi u ng d ng, cc k t h p c a cc items d li u th ng xu t hi n t i m c khi ni m tng i cao. V d , m t phn c p c a cc thnh ph n th c n c bi u di n trong hnh 2.5, trong M (s a), B (bnh m), l khi ni m phn c p, c th c m t vi n i dung thnh ph n con. Cc thnh ph n m c th p nh t trong phn c p l cc lo i s a v bnh m. C th kh tm ra quy t c v i m c khi ni m nguyn th y, nh l s a socola v bnh m b ng la m. Nhng d dng tm ra quy t c m c khi ni m cao, nh: hn 80% khch hng mua s a cng mua bnh m.

Hnh 2.5: M t v d c a cy phn c p khi ni m cho khai ph cc frequent itemsets nhi u m c

56 Do , khai ph ra cc frequent itemsets t i m c tr u t ng t ng qut ho c t i cc m c a khi ni m (multiple-concept levels) l r t quan tr ng. V s l ng d li u c x l trong khai ph cc lu t k t h p c xu h ng r t l n, nn a ra nh ng thu t ton hi u qu xy d ng vi c khai ph trn nh ng d li u nh v y l r t quan tr ng. Trong ph n ny, trnh by m t vi thu t ton c i ti n cho Apriori.

2.2.2

Phng php FP-tree


Phng php Frequent pattern growth (FP-growth) l m t phng php

hi u qu khai ph cc frequent itemsets trong cc CSDL l n. Thu t ton khai ph cc frequent itemsets m khng c qu trnh sinh candidate t n th i gian m l y u t c n thi t cho Apriori. Khi CSDL l n, FP-growth u tin th c hi n m t php chi u CSDL c a cc frequent items; sau n chuy n t i khai ph b nh chnh b ng cch xy d ng c u trc d li u g n nh c g i l FP-tree. [9] gi i thch thu t ton, ta dng CSDL giao d ch trong b ng 2.1 v ch n ng ng h tr c c ti u l 3. B ng 2.1 C s d li u giao d ch T

57 Th nh t, vi c duy t CSDL T l y danh sch L cc frequent items xu t hi n l n hn ho c b ng 3 l n trong CSDL. Nh ng items ny l (v i h tr c a n): L = {(f, 4), (c, 4), (a, 3), (b, 3), (m, 3), (p, 3)} Cc items c hi n th trong th t gi m d n c a t n s xu t hi n. Th t ny l quan trong v m i ng i c a FP-tree s theo th t ny. Th 2, g c c a cy, nh nhn ROOT, c t o ra. CSDL T c duy t l n th 2. Duy t giao d ch u tin d n n xy d ng nhnh u tin c a FP-tree: {(f, 1), (c, 1), (a, 1), (m, 1), (p, 1)}. Ch nh ng items ny trong danh sch cc frequent items L c l a ch n. Cc ch s cho cc node trong nhnh (t t c l 1) bi u di n s tch lu c a cc m u t i node ny trong cy, v t t nhin sau m u u tin, t t c l 1. Th t c a cc nodes khng ph i trong m u m l trong danh sch cc frequent items L. V i giao d ch th 2, v n dng chung cc items f, c v a, n dng chung ti n t {f, c, a} v i nhnh tr c v m r ng t i nhnh m i {(f, 2), (c, 2), (m, 1), (p, 1),} tng thm 1 cho cc ch s c a ti n t chung. Phin b n trung gian m i c a FP-tree, sau 2 m u t CSDL c a ra trong hnh 2.6.1. Cc giao d ch cn l i c th c chn vo tng t v cy FP cu i cng cho th y trong hnh 2.6.2.

Hnh 2.6: FP-tree cho CSDL T trong b ng 2.1

58 thu n ti n cho vi c duy t trn cy, m t item header table c xy d ng, trong m i item trong danh sch L k t n i cc nodes trong FP-tree v i cc gi tr c a n qua cc ng n i node (node-links). T t c cc nodes f c n i trong 1 danh sch, t t c cc nodes c trong danh sch khc, n gi n, ch bi n di n danh sch cho cc node b trong hnh 2.6.2. Dng c u trc cy thu g n (compact-tree), Thu t ton FP-growth khai ph t p y cc frequent itemsets. Tng ng v i danh sch L c a frequent items, t p y cc frequent itemsets c th c chia thnh cc t p con (6 v i v d ny) khng ch ng nhau: 1) frequent itemsets c item p (cu i c a danh sch L); 2) itemsets c item m nhng khng c p; 3) frequent itemsets v i b m khng c c m v p 6) large itemsets ch c f. S phn l p ny l h p l cho v d y,

nhng cc lu t gi ng nhau c th c s d ng cho cc CSDL khc v cc danh sch L khc. D a trn k t n i lin k t node, ta thu th p t t c cc giao d ch c p tham gia vo b ng cch b t u t header table c a p v ti p theo cc nodelinks c a p. Trong v d ny, 2 ng (paths) s c ch n trong FP-tree: {(f, 4), (c, 3), (a, 3), (m, 2), ((p, 2)} v {(c, 1), (b, 1), (p, 1)}, trong cc m u v i frequent item p l {(f, 2), (c, 2), (a, 2), (m, 2), ((p, 2) v {(c, 1), (b, 1), (p, 1)}. Gi tr ng ng cho tr c (3) tho mn ch frequent itemsets {(c, 3), (p, 3)}, ho c n gi n {c, p}. T t c cc itemsets khc v i p u d i gi tr ng ng. T p con ti p theo c a frequent itemsets l nh ng ci c m v khng c p. FP-tree nh n ra cc paths {(f, 4), (c, 3), (a, 3), (m, 2)} v {(f, 4) (c, 3), (a, 3), (b, 1), (m, 1)}, ho c cc samples tch lu tng ng {(f, 2), (c, 2), (a, 2), (m, 2)} v (f, 1), (c, 1),(a, 1), (b, 1), (m, 1)}. Vi c phn tch cc m u pht hi n frequent itemset {(f, 3), (c, 3), (a, 3), (m, 3)} ho c, n gi n, {f, c, a, m}.

59 Vi c l p l i cng qu trnh cho cc t p con 3 n 6 trong v d ny, thm cc frequent itemsets n a c khai ph. y l nh ng itemsets {f, c, a} v {f, c}, nhng chng l t p con c a frequent itemset {f, c, a, m}. Do , p n cu i cng trong phng php FP-growth l t p cc frequent itemsets {{c, p}, {f, c, a, m}}. Th nghi m cho th y r ng thu t ton FP-growth nhanh hn thu t ton Apriori. M t vi k thu t t i u c thm vo thu t ton FP-growth, v do t n t i m t s cc phin b n khc cho vi c khai ph cc dy v cc m u v i cc rng bu c.

2.2.3

Thu t ton PHP


(Thu t ton bm v c t t a hon h o perfect hashing and pruning) Trong thu t ton DHP, n u c th nh ngha m t b ng bm l n m i

itemsets khc nhau c nh x t i cc v tr khc nhau trong b ng bm, th cc m c c a b ng bm cho s m th c s c a m i itemset trong CSDL. Trong tr ng h p , ta khng c b t k false positives no v k t qu l x l qu m c cho vi c m s l n xu t hi n c a m i itemset c lo i b . Ta bi t l ng d li u ph i duy t trong qu trnh pht hi n large itemset l m t v n lin quan n hi u nng. Gi m s l ng cc giao d ch ph i duy t v c t b t s cc items trong m i giao d ch s c i ti n c hi u qu khai ph d li u trong cc ch ng sau. Thu t ton ny dng bm hi u qu cho b ng bm c sinh ra m i

l n duy t v gi m kch th c c a CSDL b ng cch c t t a cc giao d ch khng ch a b t k frequent item no. Do v y thu t ton c g i l Perfect Hashing and Pruning (PHP). Thu t ton nh sau: Trong l n duy t u tin, b ng bm v i kch th c b ng v i cc items ring bi t trong CSDL c t o ra. M i item ring bi t trong CSDL c nh

60 x t i v tr khc nhau trong b ng bm, v phng php ny c g i l perfect hashing. Phng th c add c a b ng bm thm m t m c m i n u m t m c cho item x cha t n t i trong b ng bm v kh i t o m c a n b ng 1, n u khng n tng s m c a x trong b ng ln 1. Sau l n duy t u tin, b ng bm ch a s chnh xc l n xu t hi n c a m i item trong CSDL. Ch duy t m t l n qua b ng bm, n m trong b nh , thu t ton d dng sinh ra cc frequent 1-itemsets. Sau thao tc , phng th c c t t a prune c a b ng bm c t t a t t c cc m c m h tr c a chng nh hn h tr c c ti u. Trong cc l n duy t ti p theo, thu t ton c t t a CSDL b ng cch lo i b cc giao d ch khng c items no trong frequent itemsets, v cng c t cc items m khng l frequent kh i cc giao d ch. Cng lc, n sinh ra cc candidate k-itemsets. Qu trnh ny ti p t c n khi khng c Fk m i no c tm ra. Thu t ton cho th y trong hnh 2.7. Thu t ton ny t t hn thu t ton DHP v sau khi hnh thnh b ng bm, n khng c n n s l n xu t hi n c a cc candidate k-itemsets nh trong thu t ton DHP. Thu t ton cng t t hn thu t ton Apriori v t i m i l n l p, kch th c c a CSDL c gi m i, v n em l i hi u nng cao cho thu t ton khi kch th c c a CSDL r t l n m s l ng cc frequent itemsets l i tng i nh .

61

62

Hnh 2.7 Thu t ton PHP

63 Thm n a, sau m i l n l p, CSDL Dk ch a cc giao d ch ch v i cc frequent items. Thu t ton hnh thnh t t c cc t p con k items trong m i giao d ch v chn t t c cc t p con k-1 l n (large) vo b ng bm. V l do ny thu t ton khng thi u b t k frequent itemset no. V thu t ton th c hi n c t t a trong khi chn cc candidate k-itemsets vo Hk, kch th c c a b ng bm s khng l n v v a v i b nh .

2.2.4

Thu t ton PCY


Park, Chen, v Yu a ra vi c dng hash table (b ng bm) xc nh

trong l n duy t u tin (khi L1 ang c xc nh) nhi u c p khng th l frequent. Th c t c thu n l i l b nh chnh th ng l n hn nhi u s l ng cc items. Trong 2 l n duy t tm L2, b nh chnh c cho th y trong hnh 2.8.

Hnh 2.8 B nh v i 2 l n duy t c a thu t ton PCY

64 Gi s r ng d li u c lu nh m t flat file, v i cc b n ghi bao g m basket ID v m t danh sch cc items c a n. L n duy t 1: (a) n s l n xu t hi n c a t t c cc items (b) V i m i bucket, bao g m cc items {i1, ik}, bm t t c cc c p t i m t bucket c a b ng bm, v tng s m c a bucket ln 1 (c) Cu i l n duy t, xc nh L1, cc items v i s m t nh t = s (d) Cng t i cu i l n duy t, xc nh nh ng buckets v i h tr t nh t l s * i m m u ch t: m t c p (i, j) khng th l frequent tr khi n bm t i m t frequent bucket, v v y cc c p m bm t i cc buckets khc khng c n l candidate trong C2. Thay b ng bm b i m t bitmap, v i m t bit cho m t bucket: 1 n u bucket l frequent, 0 n u khng l frequent. L n duy t 2: (a) B nh chnh gi m t danh sch c a t t c cc frequent items, ngha l L1. (b) B nh chnh cng gi c a vi c bm t l n duy t 1. * i m m u ch t: Cc buckets c n dng 16 ho c 32 bits cho m t s m (count), nhng n c nn vo 1 bit. Nh v y, th m ch b ng bm chi m g n nh ton b b nh chnh trong l n duy t 1, bitmap c a n chi m khng l n hn 1/16 b nh chnh trong l n duy t th 2 (c) Cu i cng, b nh chnh cng gi m t b ng v i t t c cc c p candidate v cc m c a chng. C p (i, j) c th l m t candidate trong C2 ch khi t t c cc i u sau l ng: i) i l trong L1. bitmap (b n bit) t ng k t cc k t qu

65 ii) j l trong L1. iii) (i, j) bm t i m t frequent bucket. i u ki n cu i cng l phn bi t PCY v i a-priori n gi n v gi m cc yu c u v b nh trong l n duy t 2 (d) Trong l n duy t 2, ta xem xt m i basket, v m i c p cc items c a n, th c hi n cc ki m tra theo nguyn t c nu trn. N u 1 c p m b o t t c cc i u ki n, c ng thm vo s m c a n trong b nh , ho c t o m t m c cho n n u n cha t n t i. Khi no PCY hn Apriori? Khi c qu nhi u c p cc items t L1, khng th v a m t b ng cc c p candidate v cc m c a chng trong b nh chnh, th s cc frequent buckets trong thu t ton PCY l nh gi m kch th c c a C2 xu ng v a v i b nh (th m ch c v i khi 1/16 b nh dng cho bitmap). Khi no ph n l n cc buckets s l infrequent trong PCY? Khi c t c p frequent, ph n l n cc c p l infrequent m th m ch khi cc m c a t t c cc c p bm t i m t bucket c c c ng thm, chng v n khng ch c ch n c ng c t i l n hn ho c b ng s.

2.2.5

Thu t ton PCY nhi u ch ng


Thay cho vi c ki m tra cc candidates trong l n duy t 2, ta ch y m t

b ng bm khc (hm bm khc!) trong l n duy t 2, nhng ta ch bm nh ng c p m tho mn i u ki n ki m tra c a PCY; ngha l, c hai u trong L1 v c bm t i m t frequent bucket trong l n duy t 1. khi: (a) C hai items u trong L1 (b) C p c bm t i frequent buckets trong l n duy t 1 l n duy t th 3, gi cc bitmaps t c hai b ng bm, v coi 1 c p l m t candidate trong C2 ch

66 (c) C p cng c bm t i frequent bucket trong l n duy t 2 Hnh 2.9 gi i thi u vi c dng b nh . L c ny c th c m r ng t i nhi u l n duy t hn, nhng c m t gi i h n, v th c t b nh tr nn y v i bitmaps, v ta khng th m c candidate no.

Hnh 2.9 S d ng b nh cho cc b ng bm nhi u ch ng Khi no nhi u b ng bm c ch? Khi ph n l n cc buckets trn l n duy t u tin c a PCY c cc m cch xa d i ng ng s. Khi , c th g p i cc m trong buckets v v n c ph n l n cc buckets d i ng ng. Khi no nhi u ch ng c ch? Khi s l ng cc frequent buckets trn l n duy t u tin l cao (v d 50%), nhng khng ph i t t c cc buckets. Khi , l n bm th 2 v i m t vi c p b b qua c th gi m s l ng cc frequent buckets m t cch ng k .

67

2.3.

Thu t ton phn l p b ng h c cy quy t nh


ID3 v C4.5 l cc thu t ton c Quilan gi i thi u thu c cc

m hnh phn l p t d li u (cng c g i l cc cy quy t nh). Cy quy t nh quan tr ng khng ph i v n t ng k t t t p hu n luy n m ta mong i n s phn l p chnh xc cc tr ng h p m i. Nh v y khi xy d ng cc m hnh phn l p ta c n c c d li u hu n luy n xy d ng m hnh, c d li u ki m th nh gi cy quy t nh t t m c no. Cho m t t p cc b n ghi. M i b n ghi c cng m t c u trc, g m m t s c p thu c tnh/gi tr . M t trong cc thu c tnh ny bi u di n phn lo i c a b n ghi. Bi ton l xc nh cy quy t nh d a trn cc cu tr l i v i cc cu h i v cc thu c tnh khng phn lo i (non-category) d bo chnh xc gi tr c a thu c tnh phn lo i. Thng th ng thu c tnh phn lo i ch l y cc gi tr {true, false}, ho c {success, failure} ho c tng t . Trong b t k tr ng h p no, m t trong cc gi tr c a n cng c ngha l sai. Cc thu c tnh khng phn lo i c th l r i r c ho c lin t c. ID3 khng tr c ti p x l v i nh ng tr ng h p thu c tnh lin t c. t ng c s c a ID3 l: Trong cy quy t nh m i node trong tng ng v i m t thu c tnh khng phn lo i v m i cnh t i m t gi tr c th c a thu c tnh . L c a cy ch gi tr mong i c a thu c tnh phn l p cho cc b n ghi c m t b i ng i t g c t i l. [y l nh ngha Cy quy t nh l g] Trong cy quy t nh t i m i node c n tng ng v i thu c tnh khng phn lo i ch a nhi u thng tin nh t trong s cc thu c tnh cha c xem xt trong ng i t g c . [Vi c ny thi t l p cy quy t nh t t]

68 Entropy c dng o thng tin th no l node. [i u ny nh ngha t t l g] C4.5 l m t m r ng c a ID3 v i tnh ton cc gi tr thi u, cc mi n gi tr lin t c, c t t a cy quy t nh, suy di n lu t

2.3.1

Cc nh ngha
N u c n thng i p c kh nng x y ra nh nhau, th xc xu t p c a

m i thng i p l 1/n v thng tin chuy n t i b i thng i p l log2(p) = log2(n). l, n u c 16 thng i p, th log2(16) = 4 v ta c n 4 bits nh danh m i thng i p. Ni chung, n u ta c phn tn xc xu t (probability distribution) P = (p1, p2, .., pn) th thng tin chuy n t i b i s phn tn ny -Entropy c a P- l: I(P) = -(p1*log2(p1) + p2*log2(p2) + .. + pn*log2(pn)) N u t p T c a cc b n ghi c phn chia thnh cc l p ring bi t C1, C2, , Ck trn c s gi tr c a thu c tnh phn lo i, th thng tin c n xc nh l p c a ph n t c a T l Info(T) = I(P), trong P l phn tn xc su t c a cc ph n (C1, C2,..Ck): P = (|C1|/|T|, |C2|/|T|, ..., |Ck|/|T|) N u u tin ta chia ph n T trn c s gi tr c a cc thu c tnh khng phn lo i X thnh cc t p T1, T2, .. Tn th thng tin c n xc nh l p c a m t ph n t c a T tr thnh tr ng s trung bnh c a thng tin c n xc nh l p c a m t ph n t c a T, ngha l tr ng s trung bnh c a Info(Ti):
n

Info(X, T) = Infox(T) = - ((|Ti| / |T|) * Info(Ti))


i=1

Gi tr l i ch Gain(X,T) c nh ngha l Gain(X,T) = Info(T) - Info(X,T)

69 i u ny bi u di n s khc nhau gi a thng tin c n xc nh m t ph n t c a T v thng tin c n xc nh m t ph n t c a T sau khi gi tr thu c tnh X c bi t, l, l i ch thng tin (gain) trong thu c tnh X. Ta c th dng khi ni m gain gi i h n (rank) cc thu c tnh v xy d ng cc cy quy t nh v i m i node c nh v thu c tnh v i gain l n nh t trong s cc thu c tnh cha c xem xt trong ng i t g c.

2.3.2

Thu t ton ID3


Thu t ton ID3 c dng xy d ng cy quy t nh, cho m t t p cc

thu c tnh khng phn lo i C1, C2, , Cn, thu c tnh phn lo i C v t p cc b n ghi hu n luy n T. function ID3 (R: t p cc thu c tnh khng phn lo i, C: thu c tnh phn lo i, S: t p hu n luy n) tr l i m t cy quy t nh; begin N u S r ng, tr v m t node n l v i gi tr Failure; N u S ch a cc b n ghi t t c c cng gi tr cho thu c tnh phn lo i, tr v m t node n l v i gi tr . N u R l tr ng, th tr v m t node n l v i gi tr c t n su t l n nh t trong cc gi tr c a thu c tnh phn lo i m tm th y trong cc b n ghi c a S; [ch r ng sau s c cc l i, l, cc b n ghi m s b phn l p khng r rng]; Cho D l thu c tnh v i Gain l n nh t Gain(D,S) trong s cc thu c tnh trong R; Cho {dj | j=1,2, .., m} l cc gi tr c a thu c tnh D; Cho {Sj | j=1,2, .., m} l cc t p con c a S g m th t

70 cc b n ghi v i gi tr dj cho thu c tnh D; Tr v cy v i g c gn nhn D v cc cung c gn nhn d1, d2, .., dm l n l t t i cc cy

ID3(R-{D}, C, S1), ID3(R-{D}, C, S2), .., ID3(R-{D}, C, Sm); end ID3; Dng t su t l i ch (Gain Ratios) Khi ni m l i ch (Gain) c xu h ng u tin cc thu c tnh c s l ng l n cc gi tr . V d , n u m t thu c tnh D c gi tr ring bi t cho m i b n ghi, th Info(D,T) l 0, nh v y Gain(D,T) l c c i. kh c ph c, dng t l sau thay cho Gain: GainRatio(D,T) = Gain(D,T) / SplitInfo(D,T) Trong SplitInfo(D,T) l thng tin do phn tch c a T trn c s gi tr c a thu c tnh phn lo i D. SplitInfo(D,T) l I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|) Trong {T1, T2, .. Tm} l s phn ho ch T do gi tr c a D.

2.3.3

Cc m r ng c a C4.5
C4.5 m r ng m t s x l t thu t ton g c ID3: Trong vi c xy d ng cy quy t nh: X l cc t p hu n luy n c cc

b n ghi ch a gi tr thu c tnh thi u b ng cch nh gi l i ch, ho c t l l i ch cho m t thu c tnh ch qua xem xt cc b n ghi c gi tr c a thu c tnh . Trong vi c dng m t cy quy t nh, ta c th phn l p cc b n ghi c cc gi tr thu c tnh thi u b ng cch a ra k t qu l d on xc su t c a m i k t qu khc nhau.

71 X l v i tr ng h p cc thu c tnh v i ph m vi lin t c (continuous ranges) nh sau. C thu c tnh Ci lin t c. Ki m tra cc gi tr c a thu c tnh ny trong t p hu n luy n. Ni chng l theo th t tng, A1, A2, ..,Am. V y cho m i gi tr Aj, j=1,2,..m, ta phn ho ch (partition) cc b n ghi thnh nh ng ph n m c cc gi tr Ci t nh t i Aj, v nh ng ph n c gi tr l n hn Aj. V i m i ph n phn ho ch ny ta tnh ton gain, ho c gain ratio, v ch n partition m c c i l i ch (gain). C t t a cy quy t nh: Cy quy t nh xy d ng dng t p hu n luy n, v i cch xy d ng cy l x l chnh xc v i ph n l n cc b n ghi c a t p hu n luy n. Th c t , lm nh v y, cy c th tr thnh qu ph c t p, v i cc ng i th m ch r t di. Vi c c t t a cy quy t nh c lm b ng cch thay th ton b cy con b ng m t node l. S thay th th c hi n n u m t lu t quy t nh xy d ng m t su t l i trong cy con l l n hn trong l n l . V d , n u cy quy t nh n gi n
Color / red/ / Success \ \blue \ Failure

c xy d ng v i m t b n ghi thnh cng mu v 2 b n ghi l i mu xanh, v nh v y trong t p ki m th ta tm th y 3 l i v 1 thnh cng xanh, ta c th xem xt thay th cy con ny b ng m t node l i (Failure) n l . Sau khi thay th ta s ch c 2 l i thay v 5 l i.

72

CHNG 3. P D NG KHAI PH TRN CSDL NGNH THU

3.1. CSDL ngnh Thu


p d ng cng ngh tin h c vo cng tc qu n l Thu t nh ng nm 1986, n nay ngnh Thu xy d ng c h th ng Cng ngh thng tin s , p ng c nhi m v qu n l Thu trong giai o n m i. T nh ng ng d ng pht tri n trn my n l , n nay ton ngnh c m t CSDL phn tn t i 64 C c Thu trn c n c. H th ng k t n i m ng my tnh, trao i thng tin, d li u ton ngnh, t T ng c c n 64 C c Thu v g n 700 Chi c c Thu qu n, huy n. H th ng cc ng d ng ph c v cc cng tc ng k v c p m s thu , h th ng qu n l thu thu t ng ho cc khu x l quan tr ng trong qui trnh qu n l nh qu n l s ph i thu, qu n l s thu, qu n l n tnh thu , tnh n , t ng h p cc bo co k ton, th ng k thu S h u m t kho thng tin lin quan n lnh v c Thu , CSDL ngnh Thu ng m t vai tr quan tr ng khng ch trong ngnh m cn c gi tr v i c n c. M t ph n thng tin trong CSDL ngnh Thu - l thng tin lin quan n cc t ch c, c nhn n p thu - s gp ph n ng gp cho CSDL qu c gia ngnh Ti chnh. Tr c y, CSDL ngnh Thu m i c s d ng ph c v cc tc

nghi p hng ngy, cc bo co, th ng k. Nh ng nm g n y, nh ng nm u c a th i k C i cch Thu , CSDL ngnh Thu m i p ng m t ph n cho cng tc phn tch thng tin. Trong giai o n C i cch hnh chnh v Thu , ngnh Thu a d n th c hi n c ch t khai t tnh, t n p thu . V i nhi m v tr ng tm l xy d ng l i ton b quy trnh qu n l n p thu trn c s ch c nng m i, c th ho trch nhi m c a c quan qu n l thu v i t ng n p thu , n gi n v lm r hn v quy trnh v th t c gi y t trong vi c k khai, n p thu . Giao

73 cho i t ng n p thu quy n t ch , t ch u trch nhi m xc nh s thu v n p thu , c quan Thu s t p trung y m nh hai khu cng tc l n l tuyn truy n, h ng d n, cung c p d ch v h tr i t ng n p thu v thanh tra, ki m tra. Nh v y trong giai o n m i ny, c th th y thng tin c m t gi tr r t quan tr ng, t ch c khai thc thng tin t t s gp ph n l n h tr cng tc thanh tra, ki m tra, m b o ngn ch n cc hnh vi tr n thu , m b o gi cng b ng cho cc i t ng n p thu trong ngha v ng gp ngn sch cho Nh n c. Phn tch, d bo thng tin ng cng gp ph n gip cng tc thanh tra, ki m tra Thu xc nh c ng i t ng c n thanh ki m tra, gip h n ch nh ng tiu c c trong cng tc thanh tra, ki m tra thu . Nghin c u l thuy t khai ph d li u, p d ng khai ph d li u trn c s d li u ngnh Thu v i mong mu n b c u tm hi u nh ng k t qu khai ph th v t kho thng tin Thu . Nh ng k t qu khai ph trong ph m vi lu n vn c th cha c ngha thi t th c, nhng hy v ng s l b c u cho d n Xy d ng h th ng phn tch thng tin h tr cc cng tc qu n l v thanh tra thu .

3.2. L a ch n cng c khai ph


3.2.1 L a ch n cng c
C r t nhi u s n ph m h tr vi c khai ph tri th c t CSDL. B ng d i y li t k m t s s n ph m khai ph d li u c a cc hng khc nhau v nh ng tnh nng c a m i s n ph m (http://www.xore.com/prodtable.html).

74 B ng 2.2 B ng cc s n ph m khai ph d li u
Company Product NN Nave Tree Bayes Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y kkMns NN Time Stats Pred Series Y Y Y Clust Assoc Win 32 UNIX Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Par Y Y Y Y API SQL SDK Ext Angoss International KnowledgeSEEK Ltd. ER KnowledgeSTUDIO Y Business Objects BusinessMiner Scenario Fair, Isaac/HNC Software Informix/RedBrick Software Inc. International Business Machines Accrue Software NeuralWare Oracle Corp. Salford Systems SAS Institute SPSS, Inc. DataBase Mining Marksman Y Red Brick Data Mine Intelligent Miner Decision Series NeuralSIM Darwin CART Enterprise Miner Answer Tree Clementine Neural Connection Pattern Recognition Workbench Model 1 Cognos Incorporated 4Thought

Unica Technology

Y Y Y Y

Y Y

Y Y

Y Y

Y Y Y Y

75 CSDL ngnh Thu s d ng l CSDL Oracle. Do v y vi c ch n cng c khai ph d li u c a hng Oracle cng l m t l a ch n t t y u. Khai ph d li u b ng s n ph m c a hng Oracle, c th l a ch n: 1. Darwin: L m t ng d ng khai ph d li u c bi t x l v i nhi u gigabytes d li u v cung c p nh ng cu tr l i cho cc bi ton ph c t p nh phn l p d li u, d on v d bo. Ph n m m Darwin gip ta chuy n i m t kh i l ng d li u l n thnh nh ng tri th c kinh doanh (tri th c nghi p v - Business intelligence). Darwin gip tm ra nh ng m u v cc lin k t c ngha trong ton b d li u Cc m u cho php ta hi u t t hn v d on c hnh vi c a khch hng. 2. Oracle Data Mining (ODM) c thi t k cho ng i l p trnh, nh ng nh phn tch h th ng, cc qu n tr d n v cho t t c nh ng ai quan tm n vi c pht tri n cc ng d ng CSDL dng khai ph d li u pht hi n ra cc m u n v dng tri th c t o cc d on. ODM l cng c khai ph d li u c nhng trong CSDL Oracle. D li u khng tch r i CSDL - d li u, v t t c nh ng ho t ng chu n b d li u, xy d ng m hnh v p d ng m hnh u c gi trong CSDL. Vi c ny cho php Oracle xy d ng n n t ng cho nh ng nh phn tch d li u v nh ng ng ipht tri n ng d ng c th tch h p khai ph d li u m t cch li n m ch v i cc ng d ng CSDL. Darwin l s n ph m khai ph d li u ch ch y trn n n Unix. Hi n t i trong ngnh Thu v n ang s d ng h i u hnh Windows, v cng cha mua b n quy n s d ng Darwin. Cc thnh ph n lin quan n CSDL Oracle s d ng t i ngnh Thu u c mua b n quy n c a hng. ODM l c s n trong CSDL Oracle. Do v y ODM l cng c khai ph d li u c l a ch n trong lu n vn ny.

76

3.2.2 Oracle Data Mining (ODM)


Oracle Data Mining (ODM) cung c p c hai giao di n l p trnh ng d ng PL/SQL v Java API cho vi c t o ra cc m hnh khai ph d li u c gim st v khng gim st. Hai APIs l tng tc hon ton v i nhau, v v y m hnh c th c t o ra v i m t API v sau s a i ho c s d ng dng API khc. Java API l m t th c hi n c a Oracle theo chu n JDM 1.0, theo ng framework m r ng c a chu n JSR-73. PL/SQL API: C th s d ng cc package xy d ng m hnh khai ph, ki m th m hnh, v p d ng m hnh v i d li u thu c cc thng tin d on v m t . Cc API c a Oracle Data Mining h tr c 2 ch c nng khai ph d on v m t . Cc ch c nng d on c bi t nh h c c gim st, dng d li u hu n luy n d on gi tr ch. Cc ch c nng m t , c bi t nh h c khng gim st, xc nh cc quan h b n ch t bn trong d li u. M i ch c nng khai ph xc nh m t l p cc bi ton c gi i quy t v m i ch c nng c th c th c hi n v i m t ho c nhi u thu t ton. Cc API cng cung c p cc phng ti n chuy n i d li u c s cho vi c chu n b d li u khai ph.

77 Oracle Data Mining cung c p: 1. Cc ch c nng d on sau: Ch c nng Phn l p Classification M t M hnh phn l p dng d li u l ch s d on d m i Pht hi n b t th ng Anomaly Detection Cc thu t ton Naive Bayes, Adaptive Bayes Network, Support Machine,

li u r i r c ho c phn lo i Vector Decision Tree

M hnh pht hi n b t One-Class th ng d

Support

on c hay Vector Machine (SVM).

khng m t i m d li u l i n hnh cho s phn tn cho tr c. PL/SQL v Java APIs h tr PL//SQL v Java APIs One-Class SVM

pht hi n b t th ng h tr

qua ch c nng phn l p

dng ch c nng khai ph phn l p v thu t ton SVM khng c ch.

H i qui Regression

M hnh H i qui dng d li u l ch s d on d li u s , lin ti p m i

Support Vector Machine

quan tr ng c a M hnh quan tr ng c a Minimal thu c tnh xc nh t m Length m t thu c tnh trong vi c d on m t u ra cho

Descriptor

thu c tnh

Attribute Importance quan tr ng lin quan c a

tr c.

78 2. Cc ch c nng m t sau: Ch c nng Phn nhm Clustering M t Cc thu t ton k-means, Clustering

M hnh phn nhm xc Enhanced nh cc nhm t trong t p d li u nhin Orthogonal

(O-Cluster - Thu t ton b n quy n c a Oracle)

Cc lu t k t h p Association Rules

M hnh k t h p xc nh Apriori cc quan h v kh nng xu t hi n c a chng trong t p d li u

Trch ch n c trng M hnh trch ch n c Non-Negative Feature Extraction trng t o t p d li u t i u Factorization lm c s cho m hnh trn .

Matric

3.2.3 DBMS_DATA_MINING
Phng php pht tri n cho khai ph d DBMS_DATA_MINING c chia thnh hai pha. Pha u tin bao g m vi c phn tch v thi t k d li u c a ng d ng, trong th c hi n hai b c sau: 1. Phn tch bi ton, l a ch n hm khai ph v thu t ton khai ph 2. Phn tch d li u c dng cho xy d ng cc m hnh khai ph (build data), ki m th cc m hnh d on (test data), v s d ng d li u m i trn m hnh (scoring data). Pha th hai bao g m vi c pht tri n ng d ng khai ph dng cc packages DBMS_DATA_MINING v DBMS_DATA_MINING_TRANSFORM. li u dng giao di n

79 3. Chu n b d li u xy d ng, ki m th , p d ng (build, test, scoring data) dng package DBMS_DATA_MINING_TRANSFORM ho c cng c third-party ho c dng tr c ti p cc scripts SQL ho c PL/SQL trong m u ph h p v i hm v thu t ton l a ch n. Vi c quan tr ng l ba t p d li u nu trn ph i c chu n b theo cch gi ng nhau vi c khai ph ra cc k t qu c ngha. 4. Chu n b cc b ng thi t l p tham s thay th cho cc thi t t ng m nh c a thu t ton, c a ch c nng khai ph. B c ny l tu ch n. 5. Xy d ng m hnh khai ph cho t p d li u hu n luy n cho 6. V i cc m hnh d on (phn l p v h i qui), ki m th m hnh cho tnh chnh xc v o hi u nng. Vi c ny l p d ng m hnh trn d li u ki m th . 7. L y d u hi u c a m hnh xc nh cc thu c tnh khai ph s c dng v i m hnh khi p d ng. Thng tin ny s gip bi t ch c ch n d li u khai ph l ph h p v i m hnh cho. y l b c tu ch n. 8. p d ng m hnh phn l p, h i qui, phn nhm, ho c m hnh trch ch n c trng v i d li u m i sinh ra cc d on v/ho c cc t ng k t m t v cc m u v d li u 9. L y cc chi ti t c a m hnh hi u c v sao m hnh m hnh cho ra d li u trong m i m u c th . y l b c tu ch n 10. L p l i b c 3 n b c 9, n khi ta thu c cc k t qu v a .

3.3. M c tiu khai thc thng tin c a ngnh Thu


T i h u h t cc n v , t ch c c p d ng cng ngh thng tin vo qu n l hi n nay, ng d ng m i d ng l i m c l ng d ng tc nghi p thng th ng v i ch c nng h tr a thng tin vo v k t xu t ra cc bo co u ra. Nh ng ng d ng h tr cao cho phn tch, h tr ra quy t nh

80 cha nhi u. Tuy nhin v i xu h ng pht tri n hi n t i, ch c ch n s r t c n n nh ng ng d ng khai ph tri th c ti m n trong CSDL. Hi n nay, ngnh Thu ang trong nh ng nm u th c hi n c i cch hnh chnh Thu . Theo chi n l c ny h ng qu n l c a ngnh Thu s thay i l n, t p trung vo hai cng tc chnh: Cng tc tuyn truy n, h tr v cung c p cc d ch v ph c v cho i t ng n p thu . Cng tc thanh tra ki m tra Thu . Khai ph d li u t t c tc d ng h tr cng tc tuyn truy n h tr TNT: Phn tch trn d li u, c th tm ra c nh ng k t qu gip nh h ng vi c h tr , tuyn truy n, gip xc nh nh ng TNT no nn p d ng cch th c tuyn truy n no cho hi u qu . V i cng tc thanh tra ki m tra Thu : Khai ph d li u cn mang l i ngha to l n hn. Tr c y cng tc thanh tra ch y u d a vo kinh nghi m c a cc cn b thanh tra, xem xt s li u trn cc bo co ti chnh c a TNT, so snh s li u cc nm c a doanh nghi p , so snh s li u trong nm c a doanh nghi p v i tnh hnh pht tri n chung c a ngnh pht hi n ra nh ng i m nghi ng c n xc minh. Ngy nay, s l ng doanh nghi p tng tr ng ngy cng nhi u, s n lc m i cn b thanh tra khng th xem xt t ng tr ng h p, t ng s li u c th c a m i TNT c. Nh v y r t c n cng c h tr . M t v n n a khng ch c ngnh Thu quan tm, l h n ch nh ng phi n ton cho Doanh nghi p khi ph i thanh tra Thu . Mu n v y, c n xc nh c TNT nghi ng , ph i thanh tra thu v i ch c ch n cao. M c d cha c ng d ng khai ph d li u no, nhng qua m t s thng tin h c h i t Thu cc n c, Thu Vi t Nam cng b t u i theo h ng c i ti n ny. Ngnh Thu b t u xem xt vi c yu c u Doanh nghi p

81 cung c p cc bo co ti chnh lin quan, lm c s xem xt, phn tch TNT, nh B ng cn i k ton, Bo co k t qu ho t ng kinh doanh, Bo co lu chuy n ti n t tr c ti p/gin ti p T nh ng bo co ny, k t h p v i s li u qu n l thu (s thu m i TNT ph i n p, s n p, cn n ) xc nh cc ch tiu phn tch. ng d ng hi n t i m i d ng m c a ra bo co li t k cc ch tiu phn tch (phn tch cc ch tiu m t cch ring l ), d a vo cn b thanh tra xem xt ra quy t nh. Mong mu n c a cn b thanh tra l c c ng d ng t ng phn tch d a trn nhi u ch tiu v khi a s li u c a m t TNT vo s c cu tr l i l i m nh gi m c vi ph m c a TNT ny. V i nh ng tm hi u trn, c th th y nhi u ki u khai ph d li u c th p d ng c p ng yu c u v gip nng cao hi u qu c a cng tc qu n l Thu . Tuy nhin trong khun kh c a Lu n vn, hai ch c nng khai ph c ch n khai ph th nghi m trn CSDL ngnh Thu , l: Khai ph lu t k t h p: V i mong mu n tri th c pht hi n ra c th gip ch cho cng tc tuyn truy n v h tr TNT Phn l p: D a vo m t s ch tiu phn tch phn l p cc TNT v d bo v kh nng vi ph m c a TNT. H tr thanh tra Thu .

3.4. Th nghi m khai ph lu t k t h p


D li u qu n l Thu c t ch c phn tn t i 64 C c Thu . T i T ng c c Thu c t p trung d li u m t m c nh t nh tu theo lo i thng tin. V d v i d li u thng tin cc i t ng n p thu c t p trung kh y t i T ng c c thu (tr ph n d li u l ch s , t i T ng c c ch lu thng tin y n th i i m hi n t i), cn d li u v qu n l thu th ch c s li u t ng h p t i T ng c c, d li u chi ti t c qu n l t i cc C c Thu .

82 Cng vi c khai ph d li u ni chung c th t ng k t theo 4 nhi m v chnh: Xc nh m c tiu v l a ch n d li u, Chu n b d li u, Khai ph d li u, Phn tch k t qu v qu n tr tri th c. Trong 4 nhi m v trn th vi c chu n b d li u s m t nhi u cng s c nh t. C th th y minh ho hnh 3.1.

Cng s c dnh cho vic chu n b d li u khai ph i v i CSDL tc nghi p th c s s kh khn hn nhi u so v i th c hi n trn d li u gi nh.

Hnh 3.1 Cng s c c n cho m i giai o n khai ph d li u S d ng ODM khai ph lu t k t h p g m nh ng b c chnh: Chu n b d li u, xy d ng m hnh chnh l b c xc nh cc frequent itemsets, l y ra cc lu t khai ph c. Cc b c ti n hnh th nghi m khai ph lu t k t h p trn CSDL ngnh Thu th c hi n trong lu n vn ny u c ti n hnh theo quy trnh sau:

83

Hnh 3.2 Cc b c khai ph lu t k t h p trn CSDL ngnh Thu Khi t cc tham s cho m hnh khai ph lu t k t h p c th l cao qu v i d li u, k t qu s khng thu c lu t. Khi th c hi n i u ch nh tham s c a m hnh. Tr ng h p thay i cc tham s v n khng hi u qu , c th ph i xem xt l i t b c ti n x l d li u. Tr ng h p khng lo i b cc items ph bi n trong t p d li u cng c th d n n k t qu khai ph khng nh mong mu n. Ho c xem xt l i cch x l v i d li u thi u. Cng c th ph i xem xt l i d li u l a ch n cho khai ph ng cha. Th nghi m khai ph lu t k t h p c th c hi n theo cc b c nu trn v d i y l k t qu cu i cng. Cc m l nh tng ng c trnh by trong ph n ph l c.

84 Nh nu trong m c 3.3, bi ton khai ph lu t k t h p kh ph h p cho vi c pht hi n tri th c ph c v cho cng tc tuyn truy n, h tr TNT. Nh ng lu t pht hi n c c th gip cn b tuyn truy n, h tr xc nh c ph m vi TNT a cc hnh th c tuyn truy n ph h p. D i y l m t khai ph th nghi m pht hi n m i lin h gi a ngnh ngh , quy m doanh nghi p (theo doanh thu), s thu ph i n p v tnh tr ng n p ch m thu . Xc nh n i dung khai ph: Nh m xc nh ph m vi TNT no c n t p trung tuyn truy n nng cao th c nghim ch nh ch p hnh ngha v Thu . Bi ton s d a vo nh ng thng tin c kh nng lin quan n tnh tr ng n p ch m Thu , bao g m: ngnh ngh kinh doanh, quy m doanh nghi p (tnh theo doanh thu), s thu ph i n p. L a ch n d li u: Thng tin t Bo co k t qu s n xu t kinh doanh c a TNT: C c thng tin v doanh thu, s thu ph i n p. D li u v ngnh ngh c a cc TNT: ID M s thu M ngnh ngh Tr ng xc nh d li u l ch s hay hi n t i M ngnh ngh bi u di n b i 5 k t (v d : L7221 Cho thu my mc thi t b nng nghi p). S phn c p ngnh ngh c t ch c ngay trong m. V d m t nhnh cy phn c p trong hnh 3.3.

85

Hnh 3.3 Nhnh cy phn c p ngnh ngh Tnh tr ng n p ch m thu : c l y t thng tin tnh ph t n p ch m trong h th ng thng tin Qu n l thu . ch m thu (1) hay khng (0). Ti n x l d li u: V i ngnh ngh n u m c th p s kh pht hi n lu t. S th c hi n khai ph m c khi ni m cao hn. Nh v y khi l y gi tr ngnh ngh s c bi n i: l y ngnh ngh kinh doanh c a m i i t ng theo 3 k t u c a ngnh ngh . Quy m doanh nghi p c phn lo i d a theo doanh thu trung bnh thng c a m i i t ng (tnh trung bnh trong 1 nm), v chia thnh cc m c: R t nh (t 0 n 100.000.000), nh (t 100.000.000 n 500.000.000), trung bnh (t 500.000.000 n 1.000.000.000), l n (t 1.000.000.000 n 5.000.000.000), r t l n (trn 5.000.000.000). S thu ph i n p trung bnh thng cng c phn nhm thnh cc kho ng 5 triu, 10 tri u, 20 tri u, 30 tri u, 50 tri u, 100 tri u, 500 tri u, 1 t , 5 t . y ch l y thng tin TNT c n p

86 a d li u v d ng ph h p v i yu c u khai ph: D li u c a v d ng: (M s thu , ngnh sx, 1 Union M s thu , doanh thu, 1 Union M s thu , thu ph i n p, 1 Union M s thu , n p ch m, 1) V chuy n v d ng nested table:
CREATE VIEW TR_dondoc_AR AS SELECT TIN, CAST(COLLECT(DM_Nested_Numerical( SUBSTRB(nganhsx, 1, 10), has_it)) AS DM_Nested_Numericals) tinnganhsx FROM tr_dondoc GROUP BY TIN;

t tham s cho m hnh: Ng ng h tr c c ti u: 0.1 Ng ng ch c ch n c c ti u: 0.1 di lu t khai ph: 2 T o m hnh v a ra k t qu : Item


G51 SMALL

h tr (support)

S items
1 1 1 1 1 1 2

.24691358024691358024691358024691358025 .24867724867724867724867724867724867725

VERY SMALL .3015873015873015873015873015873015873 1-1 0-1 5 0 .31393298059964726631393298059964726631 .68606701940035273368606701940035273369 .74074074074074074074074074074074074074 .22751322751322751322751322751322751323

87
VERY SMALL .22751322751322751322751322751322751323 1 5 5 .22927689594356261022927689594356261023 .22927689594356261022927689594356261023 .29276895943562610229276895943562610229 2 2 2 2 2 2 2

VERY SMALL .29276895943562610229276895943562610229 0 5 .51146384479717813051146384479717813051 .51146384479717813051146384479717813051

Cc lu t khai ph c:

Hnh 3.4 Cc lu t khai ph t ODM ( di lu t = 2) LU T


VERY SMALL => 5 G51 => 5 VERY LARGE => 0 SMALL => 5 VERY SMALL => 0 0 => 5 1 => 5

CONFIDENCE
97.07603 89.28571 84.05797 77.30496 75.4386 74.550125 73.03371

SUPPORT
29.276896 22.045855 10.229277 19.223986 22.751324 51.146385 22.92769

Nh n xt: Khai ph c cc lu t trn u c ch c ch n l n. 1.


VERY SMALL => 5:

Quy m r t nh th 97% c s thu ph i n p

d i 5 tri u/thng 2.
G51 => 5:

Ngnh ngh Bn bun v i l (tr xe c ng c v

mt, xe my) th 89% c s thu ph i n p d i 5 tri u/thng

88 3.
VERY LARGE => 0:

TNT c quy m r t l n th c 84% khng

n p ch m thu 4.
SMALL => 5:

TNT c quy m nh , c 77% n p thu d i 5 TNT c quy m r t nh th 75% th c hi n

tri u/thng 5.
VERY SMALL => 0:

t t ngha v Thu , khng n p ch m thu . 6.


0 => 5:

Trong s cc TNT khng n p ch m thu th c 74% l Trong s cc TNT n p ch m thu th c 73% l TNT

TNT ph i n p d i 5 tri u/thng 7.


1 => 5:

ph i n p d i 5 tri u/thng

M t s ngha rt ra c t cc lu t trn: Nh ng TNT thu c di n n p thu d i 5 tri u/thng c hi n t ng ch m n p thu . Tuy nhin v s l ng th s TNT ch p hnh t t ngha v ng thu thu c di n n p thu d i 5 tri u/thng l n hn nhi u so v i s l ng ch m n p thu (theo lu t 6 v 7). Thm vo s thu th ng nh nn t ng thu t nh ng TNT ny khng l n. C n t ch c cc hnh th c tuyn truy n cng c ng, t n ph tuyn truy n cho cc TNT ny. Nh ng i t ng c quy m r t l n nghim ch nh ch p hnh ngha v Thu s r t c l i cho nh n c (lu t 3). B i v y c n c ch , chnh sch khen th ng k p th i nh ng TNT ny. Khai ph thm cc lu t v i di lu t khai ph = 3 t tham s cho m hnh: Ng ng h tr c c ti u: 0.1 Ng ng ch c ch n c c ti u: 0.1 di lu t khai ph: 3

89 T o m hnh v a ra k t qu : Item
G51 SMALL

h tr (support)

S items
1 1 1 1 1 1 2 2 2 2 2 2 2 2

.24691358024691358024691358024691358025 .24867724867724867724867724867724867725

VERY SMALL .3015873015873015873015873015873015873 1 0 5 0 .31393298059964726631393298059964726631 .68606701940035273368606701940035273369 .74074074074074074074074074074074074074 .22751322751322751322751322751322751323

VERY SMALL .22751322751322751322751322751322751323 1 5 5 .22927689594356261022927689594356261023 .22927689594356261022927689594356261023 .29276895943562610229276895943562610229

VERY SMALL .29276895943562610229276895943562610229 0 5 .51146384479717813051146384479717813051 .51146384479717813051146384479717813051

Cc lu t khai ph c:

Hnh 3.5 Cc lu t khai ph t ODM ( di lu t = 3)

90 LU T
0 AND VERY SMALL => 5 VERY SMALL => 5 0 AND G51 => 5 G51 => 5 VERY LARGE => 0 0 AND SMALL => 5 SMALL => 5 5 AND VERY SMALL => 0 VERY SMALL => 0 0 => 5 1 => 5 5 AND G51 => 0

CONFIDENCE
99.22481 97.07603 90.81633 89.28571 84.05797 81.17647 77.30496 77.10844 75.4386 74.550125 73.03371 71.2

SUPPORT
22.574955 29.276896 15.696649 22.045855 10.229277 12.1693125 19.223986 22.574955 22.751324 51.146385 22.92769 15.696649

Nh n xt: Khai ph c cc lu t trn u c ch c ch n l n. Cc lu t di b ng 2 c khai ph t b c tr c v c di n gi i. D i y ch nu lu t di hn 2. 1.


0 AND VERY SMALL => 5:

Trong s TNT khng n p ch m thu

v thu c lo i TNT quy m r t nh th 99% trong s c s thu ph i n p d i 5 tri u/thng. 2.


0 AND G51 => 5:

TNT ch p hnh t t ngha v Thu v thu c

ngnh ngh Bn bun v i l (tr xe c ng c v mt, xe my) th 90% s c s thu ph i n p hng thng d i 5 tri u 3.
0 AND SMALL => 5:

Trong s TNT khng n p ch m thu v

thu c lo i TNT quy m nh th 81% trong s c s thu ph i n p d i 5 tri u/thng. 4.


5 AND VERY SMALL => 0:

TNT ph i n p thu d i 5 tri u/thng

v c quy m r t nh th 77% l n p thu ng h n

91 5.
5 AND G51 => 0:

71% TNT c s

thu ph i n p d i 5

tri u/thng v kinh doanh ngnh ngh Bn bun v i l (tr xe c ng c v mt, xe my) th c hi n t t ngha v n p thu .

M t s ngha t cc lu t trn: TNT c quy m nh , r t nh v c s thu ph i n p d i 5 tri u/thng, c bi t TNT thu c ngnh ngh Bn bun v i l (tr xe c ng c v mt, xe my) s khng ph i quan tm nhi u n vi c c thc thu thu , v TNT thu c ph m vi ny th ng nghim ch nh ch p hnh vi c n p thu .

3.5. Phn l p b ng h c cy quy t nh


Trong phn l p b ng h c cy quy t nh, sau khi xc nh bi ton v l a ch n d li u th c n th c hi n b c t o ra b d li u hu n luy n dng xy d ng m hnh, b ki m th v nh gi chnh xc c a m hnh. M hnh t c chnh xc ch p nh n c s c s d ng v i b d li u m i. S d ng ODM phn l p s qua cc b c chnh sau: Chu n b 3 b d li u (xc nh thu c tnh phn lo i, t ch c c a 3 b d li u ph i tng t nhau) Thi t l p cc tham s : L a ch n thu t ton no, xc nh ma tr n chi ph. Xy d ng m hnh d a vo cc tham s thi t l p. Ngoi ra, ch r: S d ng ma tr n chi ph no, thu c tnh kho xc nh duy nh t m t b n ghi, ch ra thu c tnh ch (l thu c tnh phn l p), ch ra b d li u hu n luy n

92 Ki m th trn b d li u ki m th : p d ng m hnh phn lo i trn d li u ki m th v so snh v i thu c tnh ch nh gi chnh xc. y c th l a ch n phn lo i c dng ho c khng dng ma tr n chi ph. Cu i cng l s d ng m hnh n u m hnh c chnh xc ch p nh n c: p d ng m hnh trn d li u cha phn lo i, a ra cc d bo. p d ng phn l p trn CSDL ngnh Thu c th : Dng d bo TNT n thu , ph c v cho cng tc n c thu. Dng d bo TNT nghi ng vi ph m, gian l n ph c v cho cng tc thanh tra Thu . Nh ng ch tiu th ng c l y lm cn c phn tch ph c v cng tc thanh tra Thu g m nh ng thng tin sau: Cc t su t th hi n kh nng thanh ton, t su t sinh l i, t su t hi u qu , c c u ti s n v c c u ngu n v n, t su t lin quan n k khai thu Quy m doanh nghi p: Quy m theo doanh thu, ngu n v n, theo Ti s n c nh Xc nh r i ro theo: Quy m c a doanh nghi p, lo i hnh doanh nghi p, theo m c tun th v n p thu , hi u qu s n xu t kinh doanh, tnh hnh k khai thu c a doanh nghi p C nhi u cch phn tch d a trn cc ch tiu trn. C th tnh ton cc t su t c a m t doanh nghi p v so snh v i chnh doanh nghi p qua cc th i k khc nhau ho c cng so snh v i t su t chu n c a ngnh. C th xem xt t su t theo nhi u nm c a cc doanh nghi p trong cng ngnh kinh t v t su t trung bnh ngnh theo t ng nm. So snh doanh thu, chi ph c a m i doanh nghi p qua cc nm v so v i doanh thu, chi ph trung bnh c a ngnh.

93 Th c t ph i h p c nhi u ch tiu trong phn tch v s li u thu th p c cng chnh xc s c c nh ng nh n nh c ch c ch n cao. S ph i h p thng tin gi a cc ngnh khc nhau cng r t quan tr ng, v d l y s li u th ng k ngnh ngh t C c Th ng K. V i m c ch khai ph th nghi m, nh ng bi ton khai ph trong lu n vn c th coi l nh ng minh ho cho kh nng khai ph d li u, t pht tri n sau ny v i s phn tch y cc ch tiu.

3.5.1 Phn l p TNT d a vo so snh t su t cc nm


Xc nh n i dung khai ph D a vo cch phn tch t su t c a m t TNT qua cc nm v so snh v i t su t chung c a Ngnh, a ra bi ton: Cn c vo t su t Sinh l i c a m i TNT qua hai nm v t su t Sinh l i c a ngnh a ra nh n nh TNT c thu c di n c n ph i xem xt khng. T su t Sinh l i = (L i nhu n thu n + Chi ph li vay)/Doanh thu thu n L a ch n d li u S li u c l y t Bo co K t qu ho t ng kinh doanh c a TNT. Bo co k t qu ho t ng kinh doanh: M s thu Lo i bo co Nm Ch tiu bo co S ti n M ngnh ngh c a TNT c l y theo d li u ngnh ngh . Ti n x l d li u L y cc ch tiu c n thi t tnh T su t Sinh l i, l y d li u c a 2 nm 2004 v 2005 so snh.

94 Tnh ton T su t Sinh l i trung bnh c a ngnh trong nm 2004 v 2005. th nghi m trn c cng c khai ph c a Oracle v See5, s l c l y m t ph n nh d li u. V l y m t s ngnh ngh nh: K70 - Ho t ng khoa h c v cng ngh , D26 - S n xu t cc s n ph m t khong ch t, I60 - V n t i ng b , D22 - Xu t b n, in v s o b n ghi cc lo i, C14 Khai thc than v khai thc m , C10 Khai thc than c ng, than non, than bn, J65 Trung gian ti chnh (Tr b o hi m v tr c p hu tr). D li u cho xy d ng cy quy t nh nh sau: M s thu (TIN) Ngnh s n xu t (ch l y m c 3 k t ) (NGANHSX) Chnh l ch t su t sinh l i gi a 2 nm (SoTSSinhLoi) Chnh l ch t su t sinh l i c a ngnh ngh (SoTS) Tr ng phn lo i xc nh TNT c thu c di n ph i xem xt hay khng (XEMXET) Thi t t cc tham s v xc nh ma tr n chi ph: Ma tr n chi ph:
Chi ph D bo c n xem xt 1 D bo khng xem xt 0

Xem xt (th c t ) Khng xem xt (th c t )

Ch n s d ng thu t ton cy quy t nh T o m hnh: y chnh l b c xy d ng cy quy t nh Ki m th , nh gi m hnh: p d ng trn d li u ki m th

95 nh gi chnh xc khi dng ma tr n chi ph v khi khng dng Th c hi n trn d li u ngnh Thu , c k t nh sau: chnh xc khi khng dng ma tr n chi ph v dng ma tr n chi ph l nh nhau v b ng 80%. Cy quy t nh nh sau:

Hnh 3.6 Cy quy t nh dng ODM Bi ton phn tch t su t Nh n xt: K t qu trn cho th y: V i nh ng ngnh ngh c ch n gi m so v i nm tr c trn u c m t m c chung cho vi c phn l p. N u TNT c t su t sinh l i nm sau m t m c no th s ph i xem xt l i TNT . y m c ph i xem xt l m c -0.00166, ngha l t su t sinh l i c a cc ngnh ang xt n u nm 2005 gi m i 0.00166 so v i t su t sinh l i c a cng TNT trong nm 2004, TNT s c x p vo lo i c n xem xt. Th c t TNT c t su t sinh l i gi m xem xt. p d ng cng s li u ny v i cng c See5 ta c k t qu sau: T l l i l 8%, ngha l chnh xc 82% - cao hn so v i th c hi n b ng ODM. Cy quy t nh nh sau: m t m c no , trong khi m c chung c a ngnh l pht tri n, t su t sinh l i tng hng nm th c n ph i

96

Hnh 3.7 Cy quy t nh dng See5 Bi ton phn tch t su t C th th y cng c demo d ng cy chi ti t hn, chnh xc cng cao hn. Tuy nhin v i cng c khai ph trn d li u l n s c nh ng xem xt cn i gi a ph c t p c a cy v i chnh xc. V i cy quy t nh sinh b ng See5 c th pht bi u k t qu nh sau: N u chnh l ch t su t sinh l i c a TNT so v i nm tr c gi m i 0.0029 th v n cha c n xem xt. N u chnh l ch ny gi m nhi u hn 0.0029 th c n xem xt n Chnh l ch t su t sinh l i c a ngnh. N u t su t sinh l i c a ngnh so v i nm tr c c gi m nh hn 0.0108 th TNT khng c n xem xt, n u so v i nm tr c t su t sinh l i nm nay gi m hn 0.0108 th c n xem xt TNT .

3.5.2 Phn l p TNT theo s li u c a m t nm


Xc nh n i dung khai ph So snh s li u c a TNT trong m t nm v so v i s bnh qun tng ng c a ngnh. Cc ch tiu xem xt, l y t Bo co k t qu kinh doanh c a TNT:

97 T su t sinh l i = (L i nhu n thu n kinh doanh + Chi ph li vay) / Doanh thu thu n. T ng doanh thu = Doanh thu thu n bn hng v cung c p d ch v + Doanh thu ho t ng ti chnh + Thu nh p khc Chi ph = Chi ph ti chnh + Chi ph bn hng + Chi ph qu n l doanh nghi p + Chi ph khc L a ch n d li u S li u c l y t Bo co K t qu ho t ng kinh doanh c a TNT. M ngnh ngh c a TNT c l y theo d li u ngnh ngh . Ti n x l d li u L y cc ch tiu c n thi t tnh T su t Sinh l i, T ng doanh thu, Chi ph trong m i nm. Tnh ton ch tiu trung bnh c a ngnh: T su t Sinh l i trung bnh, doanh thu trung bnh, chi ph trung bnh c a ngnh trong t ng nm. Cng th nghi m trn c See5, s l c l y m t ph n nh d li u. V l y m t s ngnh ngh nh v i bi ton trn (cc ngnh s n xu t: K70, D26, I60, D22, C14, C10, J65). D li u cho xy d ng cy quy t nh nh sau: M s thu (TIN) Ngnh s n xu t (ch l y m c 3 k t ) (NGANHSX) T su t sinh l i (TS) T ng doanh thu (DT) Chi ph (CP) Tr ng phn lo i xc nh TNT c thu c di n ph i xem xt hay khng (XEMXET)

98 D li u c trong 2 view tng ng v i 3 b d li u xy d ng, ki m th v p d ng v i d li u m i: tr_So1Nganh_Build_v, tr_So1Nganh_Test_v, tr_So1Nganh_Apply_v. Thi t t cc tham s v xc nh ma tr n chi ph: Ma tr n chi ph: Chi ph D bo c n xem xt 1 Xem xt (th c t ) Khng xem xt (th c t ) 0 1 0 1 0 D bo khng xem xt 0 5

Ch n s d ng thu t ton cy quy t nh T o m hnh: Xy d ng cy quy t nh t tr_So1Nganh_Build_v. Ki m th , nh gi m hnh, p d ng trn: tr_So1Nganh_Test_v K t qu : p d ng trn d li u ki m th (khng dng ma tr n chi ph): t chnh xc 80%. V i k t qu : Gi tr th c 0 1 Gi tr d bo 0 0 S l ng 20 5

p d ng trn d li u ki m th (c dng ma tr n chi ph): t chnh xc 96%. V i k t qu :

99 Gi tr th c 0 0 1 Gi tr d bo 0 1 1 S l ng 19 1 5

Cy quy t nh nh sau:

Hnh 3.8 Cy quy t nh dng ODM Bi ton xt s li u m t nm Nh n xt: Cng c khai ph ODM d a vo k t qu v xc nh thu c tnh ki m tra duy nh t l TS (t su t sinh l i) lm i u ki n cho xy d ng cy quy t nh.V i k t qu trn: V i nh ng ngnh ngh ang xem xt u c m t m c chung cho vi c phn l p. N u TNT c t su t sinh l i so v i t su t sinh l i chung c a ngnh l nh hn 0.00939 th khng c n xem xt TNT . Tr ng h p ng c l i c n ph i xem xt l i TNT. p d ng cng s li u ny v i cng c See5 ta c k t qu sau: T l l i l 1.3%, ngha l chnh xc 89.7% - v n l cao hn so v i th c hi n b ng ODM. Cy quy t nh nh sau:

100

Hnh 3.9 Cy quy t nh dng See5 Bi ton phn tch trong nm Nh n xt: C cng nh n xt v i bi ton trn, xy d ng cy b ng See5 s chi ti t hn, thu t ton quan tm xy d ng ng v i m u hu n luy n nh t nn s c cy k t qu ph c t p hn. V i cy quy t nh sinh b ng See5 c th pht bi u k t qu nh sau: N u chnh l ch t su t sinh l i c a TNT so v i t su t sinh l i chung t hn 0.0084 th cha ph i xem xt. Tr ng h p t nhi u hn 0,0081 so v i t su t sinh l i chung th c n ti p t c xem xt. Cc xem xt ti p sau s th c hi n v i ngnh s n xu t. N u ngnh trong s K70, D22, I65, v ngnh = D36 th khng c n xem xt. Ngnh C14 s ph i xem xt. Tr ng h p ngnh s n xu t l I60 th c n xt ti p n Chi ph (CP). Cn ngnh s n xu t l C10 th xem xt ti p T su t sinh l i chung c a ngnh (TS).

101 Th c t , vi c ph i h p nhi u ch tiu v s th ng k trn ngnh chnh xc, thm vo cc k t qu th c t thanh tra t i cc TNT ho c cc nh n nh chnh xc c a nh ng cn b thanh tra c kinh nghi m s cho php xy d ng c m hnh phn l p hon ch nh hn. M hnh chnh xc cao s gip nng cao hi u qu cng tc qu n l Thu .

102

CHNG 4. K T LU N
V i n i dung Nghin c u v p d ng m t s k thu t khai ph d li u trn CSDL ngnh Thu Vi t Nam, lu n vn l b c kh i u tm hi u cc bi ton khai ph d li u, tm hi u nh ng v n c n quan tm khi khai ph d li u t c th a vo p d ng trong th c t . Trong khun kh lu n vn cha th th nghi m khai ph, p d ng nhi u k thu t khai ph. Lu n vn ch d ng l i m c p d ng ch y u khai

ph lu t k t h p v k thu t phn l p trn CSDL ngnh Thu . M c d k t qu khai ph cha mang nhi u ngha th c t nhng cng em l i ngha ban u c a vi c p d ng cc k thu t khai ph pht hi n ra nh ng tri th c t CSDL. Nh ng k t qu m lu n vn t c: 1. Tm hi u cc ch c nng v k thu t c b n trong khai ph d li u. N m c cc tr ng h p p d ng. 2. Do i u ki n th i gian cha cho php i su nghin c u k t t c cc k thu t khai ph d li u, lu n vn m i t p trung tm hi u chi ti t i v i ch c nng khai ph lu t k t h p v khai ph b ng h c cy quy t nh. N m c cc thu t ton, so snh hi u nng c a cc thu t ton, cc v n quan tm khi c i ti n thu t ton khai ph lu t k t h p, cc thu t ton m i m b o hi u nng. 3. p d ng th nghi m m t s khai ph d li u trn CSDL ngnh Thu . Qua c c nh ng kinh nghi m ban u khi khai ph tri th c trn d li u th c: a) Cng vi c chu n b d li u l m t cng vi c r t quan tr ng v m t nhi u th i gian. Th ng th d li u th c lun c nh ng v n ph i x l

103 nh d li u thi u, th m ch CSDL thi u h n nh ng thng tin quan tr ng c n cho khai ph. b) Vi c k t h p v i cc chuyn gia phn tch l r t quan tr ng xc nh c ng cc thu c tnh d bo cng nh a ra yu c u c n thi t v thu c tnh ch v xc nh cc ng ng gi tr quan tr ng.

H NG NGHIN C U TI P THEO
1. Tm hi u, nghin c u khai thc r ng v su hn cc tri th c v l thuy t c b n c a khai ph d li u c th v n d ng vo th c ti n chnh xc hn 2. Th nghi m v nh gi k hn cc thu t ton trn d li u l n 3. Khai ph d li u trn kho d li u v i cc lu t k t h p a chi u, nhi u m c 4. Cc h ng hi u ch nh s li u 6. Tm hi u cng c h tr hi n th k t qu ) 7. Thuy t ph c kh i u d n xy d ng h th ng phn tch thng tin ph c v qu n l thu , n c n v thanh tra ki m tra. Trong d n s c s ph i h p ch t ch v i chuyn gia phn tch nghi p v trong cc b c chu n b khai ph d li u v nh gi k t qu . d ng ho ( th , bi u

104

TI LI U THAM KH O
Ti ng Vi t 1. Trng Ng c Chu, Phan Vn Dng (2002), Nghin c u tnh ng d ng c a khai thc lu t k t h p trong C s d li u giao d ch, Tr ng i h c Bch Khoa, i h c N ng. http://www.ud.edu.vn/bankh/zipfiles/2_chau_truongngoc.doc 2. Nguy n An Nhn (2001), Khai ph d li u v pht hi n lu t k t h p trong C s d li u l n, Lu n vn th c s ngnh Cng ngh Thng tin, Tr ng i h c Bch khoa H N i. 3. Nguy n Lng Th c (2002), M t s phng php khai ph lu t k t h p v ci t th nghi m, Lu n vn th c s ngnh Cng ngh Thng tin, Tr ng i h c Bch khoa H N i. Ti ng Anh 4. Ashok Savasere, Edward Omiecinski, Shamkant Navathe (1995), An Efficient, Algorithm for Mining Association Rules in Large Databases, College of Computing Georgia Institute of Technology - Atlanta. 5. H.Hamilton. E. Gurak, L. Findlater W. Olive (2001), Overview of Decision Trees 6. Jeffrey D. Ullman (2003), Data Mining Lecture Notes, 2003's edition of CS345 7. Jiawei Han and Michelline Kamber (2000), Data mining: Concepts and Techniques, Morgan Kaufmann Publishers.

105 8. Jyothsna R. Nayak and Diane J.Cook (1998), Approximate Association Rule Mining, Department of Computer Science and Engineering, Arlington. 9. Mehmed Kantardzic (2003), Data Mining: Concepts, Models, Methods, and Algorithms, John Wiley & Sons. 10.Ming-Syan Chen, Jiawei Han, Philip S. Yu (1999), Data Mining: An Overview from Database Perspective, Natural Sciences and Engineering Research Council of Canada. 11.Oracle (2003), Oracle Data Mining Concepts 10g Release 1 (10.1), Oracle Corporation. 12.Rakesh Agrawal, John C. Shafer (1996), Parallel Mining of Association Rules: Design, Implementation and Experience, IBM Research Report, IBM Research Division Almaden Research Center. 13.Rakesh Agrawal, Ramakrishnan Srikant (1994), Fast Algorithms for Mining Association Rules, IBM Almaden Research Center. 14.Ramakrishnan and Gehrke (2002), Database Management Systems, McGraw-Hill, 3rd Edition.

106

PH L C
M t s m c a ph n khai ph d li u trn CSDL ngnh Thu : Khai ph lu t k t h p: 1. Chu n b d li u
drop table tr_dondoc; create table tr_dondoc as (select a.tin, a.nganhsx, a.tongDT/12 DT, PhaiNop/12 PN, 0 nopcham from tr_tysuat a where nam=2005); --567 recs update tr_dondoc a set nopcham = 1 where exists (select tin from tr_nopcham b where b.tin = a.tin and to_char(b.ngay_bdau,'rrrr')='2005'); commit; --178 recs EXPORT IMPORT VAO SH

drop table tr_dondoc1; create table tr_dondoc1 as (select tin, nganhsx, decode(sign(dt - 100000000),-1,'VERY SMALL', decode(sign(dt - 500000000),-1,'SMALL', decode(sign(dt - 1000000000),-1,'MEDIUM', decode(sign(dt-5000000000),-1,'LARGE', 'VERY LARGE')))) DT, decode(sign(round(PN/1000000) - 5), -1, '5', decode(sign(round(PN/1000000) - 10), -1, '10', decode(sign(round(PN/1000000) - 20), -1, '20', decode(sign(round(PN/1000000) - 30), -1, '30',

107
decode(sign(round(PN/1000000) - 50), -1, '50', decode(sign(round(PN/1000000) - 100), -1, '100', decode(sign(round(PN/1000000) - 500), -1, '500', decode(sign(round(PN/1000000) - 1000), -1, '1000', decode(sign(round(PN/1000000) - 5000), -1, '5000', '>5000'))))))))) PN, from tr_dondoc); nopcham

2. Chuy n v ng khun d ng cho khai ph lu t k t h p


drop table tr_dondoc2; create table tr_dondoc2 as (select tin, nganhsx, 1 has_it from tr_dondoc1 union select tin, dt, 1 has_it from tr_dondoc1 union select tin, to_char(pn) pn, 1 has_it from tr_dondoc1 union select tin, to_char(nopcham) nopcham, 1 has_it from tr_dondoc1);

GRANT SELECT ON TR_dondoc2 TO DMUSER;

DROP VIEW TR_dondoc ; CREATE VIEW TR_dondoc AS SELECT * FROM sh.tr_dondoc2;

DROP VIEW TR_dondoc_AR; CREATE VIEW TR_dondoc_AR AS SELECT TIN, CAST(COLLECT(DM_Nested_Numerical(

108
SUBSTRB(nganhsx, 1, 10), has_it)) AS DM_Nested_Numericals) tinnganhsx FROM tr_dondoc GROUP BY TIN;

3. Thi t t cc tham s
BEGIN EXECUTE IMMEDIATE 'DROP TABLE ar_dondoc_settings'; EXCEPTION WHEN OTHERS THEN NULL; END; / set echo off CREATE TABLE ar_dondoc_settings ( setting_name VARCHAR2(30),

setting_value VARCHAR2(30)); set echo on

BEGIN INSERT INTO ar_dondoc_settings VALUES (dbms_data_mining.asso_min_support,0.1); INSERT INTO ar_dondoc_settings VALUES (dbms_data_mining.asso_min_confidence,0.1); INSERT INTO ar_dondoc_settings VALUES (dbms_data_mining.asso_max_rule_length,2); COMMIT; END;

4. Xy d ng m hnh
BEGIN DBMS_DATA_MINING.DROP_MODEL('AR_dondoc_nghe'); EXCEPTION WHEN OTHERS THEN NULL; END; / BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name mining_function => 'AR_dondoc_nghe', => DBMS_DATA_MINING.ASSOCIATION,

109
data_table_name => 'TR_dondoc_AR',

case_id_column_name => 'TIN', settings_table_name => 'ar_dondoc_settings'); END; /

5. L y k t qu khai ph Danh sch frequent itemsets:


SELECT item, support, number_of_items FROM (SELECT I.column_value AS item, F.support, F.number_of_items FROM TABLE(DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS( 'AR_dondoc_nghe', 10)) F, TABLE(F.items) I ORDER BY number_of_items, support, column_value);

Danh sch cc lu t:
SELECT ROUND(rule_support,4) support, ROUND(rule_confidence,4) confidence, antecedent, consequent FROM TABLE(DBMS_DATA_MINING.GET_ASSOCIATION_RULES ('AR_dondoc_nghe', 10)) ORDER BY confidence DESC, support DESC;

Phn l p, d bo b ng cy quy t nh: 1. Chu n b d li u create table tr_sinhloi as (select a.tin, a.nganhsx, sotssinhloi, SoTS, 0 xemxet

110 from tr_so_1DT a, SoNganh b where a.nganhsx = b.nganhsx); create table tr_So1Nganh as (select a.tin, a.nganhsx, a.nam, (b.ts_nganh - a.tssinhloi) ts, (b.DTnganh - a.TongDT) DT, (a.ChiPhi - b.ChiPhiNganh) CP, 0 xemxet from tr_tysuat a, tr_nganh2004 b where a.nam=2004 and a.nganhsx=b.nganhsx union select a.tin, a.nganhsx, a.nam, (b.ts_nganh - a.tssinhloi) ts, (b.DTnganh - a.TongDT) DT, (a.ChiPhi - b.ChiPhiNganh) CP, 0 xemxet from tr_tysuat a, tr_nganh2005 b where a.nam=2005 and a.nganhsx=b.nganhsx); 2. T o ma tr n chi ph
DROP TABLE dt_sh_NOP_cost; CREATE TABLE dt_sh_NOP_cost ( actual_target_value predicted_target_value cost NUMBER, NUMBER, NUMBER);

INSERT INTO dt_sh_NOP_cost VALUES (0,0,0); INSERT INTO dt_sh_NOP_cost VALUES (0,1,1); INSERT INTO dt_sh_NOP_cost VALUES (1,0,5); INSERT INTO dt_sh_NOP_cost VALUES (1,1,0); COMMIT;

3. Thi t l p cc tham s
DROP TABLE dt_sh_BTC_settings; CREATE TABLE dt_sh_BTC_settings (

111
setting_name VARCHAR2(30),

setting_value VARCHAR2(30)); BEGIN -- Populate settings table INSERT INTO dt_sh_BTC_settings VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_decision_tree); INSERT INTO dt_sh_BTC_settings VALUES (dbms_data_mining.clas_cost_table_name, 'dt_sh_NOP_cost'); COMMIT; END; /

4. T o m hnh
BEGIN DBMS_DATA_MINING.DROP_MODEL('DT_SH_Clas_TS1DT'); EXCEPTION WHEN OTHERS THEN NULL; END; /

BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name mining_function data_table_name => 'DT_SH_Clas_TS1DT', => dbms_data_mining.classification, => 'tr_so_1DT_v',

case_id_column_name => 'tin', target_column_name => 'xemxet',

settings_table_name => 'dt_sh_BTC_settings'); END;

112

TM T T LU N VN
Khai ph d li u th c s ngy cng tr nn quan tr ng v c p thi t, nh t l v i nh ng ni n m gi l ng d li u kh ng l . Kho d li u ngnh Thu c lu gi qua nhi u nm, khm ph nh ng tri th c ti m n trong nh ng d li u ny ch c ch n s h tr khng nh cho cng tc qu n l Thu . Nghin c u nh ng ch c nng khai ph d li u v th nghi m kh nng p d ng trn CSDL ngnh Thu chnh l m c ch chnh c a Lu n vn. Qua tm hi u nh ng ch c nng c b n c a khai ph d li u, lu n vn t p trung hn vo nghin c u cc k thu t khai ph lu t k t h p v phn l p b ng h c cy quy t nh. Hi u c cc thu t ton hi u qu g n y, t n m c nh ng i m chnh c n quan tm gi i quy t trong m i k thu t khai ph, nh: X l d li u thi u, c t t a gi m kch th c, gi m l n duy t CSDL. L a ch n cng c Oracle Data Mining (ODM) c a Oracle khai ph tri th c trn CSDL ngnh Thu . Th c nghi m khai ph lu t k t h p th hi n m i lin quan gi a ngnh ngh kinh doanh c a TNT, quy m doanh nghi p, doanh thu trung bnh, m c thu ph i n p v i th c ch p hnh ngha v n p thu . Ti p theo p d ng phng php phn l p b ng cy quy t nh phn l p v d bo trn CSDL ngnh Thu : Phn l p TNT d a vo m t s ch tiu phn tch (ngnh ngh , t su t sinh l i, t ng doanh thu, chi ph, thu ph i n p) a ra phn lo i trn thu c tnh ch tr l i cu h i TNT c thu c di n nghi ng vi ph m v Thu khngl tri th c tr gip thanh tra Thu . Cc tri th c khai ph th c nghi m ch c ch n cn nhi u thi u st, r t mong nh n c gp t cc th y c v cc chuyn gia Thu . Hy v ng khai ph c hon thi n trong d n khai ph d li u Thu ph c v cng tc Thanh tra ni h i y u t thnh cng: K t h p ch t ch gi a k thu t v i cc chuyn gia nghi p v - c kinh nghi m qu bu lm cn c khm ph tri th c.

You might also like