1.Gii thiu v k thut phn cm trong Khai ph d liu (Clustering Techniques in Data Mining) Phn cm l k thut rt quan trng trong khai ph d liu, n thuc lp cc phng phpUnsupervised Learning trong Machine Learning. C rt nhiu nh ngha khc nhau v k thut ny, nhng v bn cht ta c th hiu phn cm l cc qui trnh tm cch nhm cc i tng cho vo cc cm (clusters), sao cho cc i tng trong cng 1 cm tng t (similar) nhau v cc i tng khc cm th khng tng t (Dissimilar) nhau. Mc ch ca phn cm l tm ra bn cht bn trong cc nhm ca d liu. Cc thut ton phn cm (Clustering Algorithms) u sinh ra cc cm (clusters). Tuy nhin, khng c tiu ch no l c xem l tt nht nh hiu ca ca phn tch phn cm, iu ny ph thuc vo mc ch ca phn cm nh: data reduction, natural clusters, useful clusters, outlier detection K thut phn cm c th p dng trong rt nhiu lnh vc nh: Marketing: Xc nh cc nhm khch hng (khch hng tim nng, khch hng gi tr, phn loi v d on hnh vi khch hng,) s dng sn phm hay dch v ca cng ty gip cng ty c chin lc kinh doanh hiu qu hn; Biology: Phn nhm ng vt v thc vt da vo cc thuc tnh ca chng; Libraries: Theo di c gi, sch, d on nhu cu ca c gi; Insurance, Finance: Phn nhm cc i tng s dng bo him v cc dch v ti chnh, d on xu hng (trend) ca khch hng, pht hin gian ln ti chnh (identifying frauds); WWW: Phn loi ti liu (document classification); phn loi ngi dng web (clustering weblog);
Cc k thut phn cm c phn loi nh sau (xem hnh)
2. Thut Ton K-Means K-Means l thut ton rt quan trng v c s dng ph bin trong k thut phn cm. T tng chnh ca thut ton K-Means l tm cch phn nhm cc i tng (objects) cho vo K cm (K l s cc cm c xc inh trc, K nguyn dng) sao cho tng bnh phng khong cch gia cc i tng n tm nhm (centroid ) l nh nht. Thut ton K-Means c m t nh sau
Thut ton K-Means thc hin qua cc bc chnh sau: 1. Chn ngu nhin K tm (centroid) cho K cm (cluster). Mi cm c i din bng cc tm ca cm. 2. Tnh khong cch gia cc i tng (objects) n K tm (thng dng khong cch Euclidean) 3. Nhm cc i tng vo nhm gn nht 4. Xc nh li tm mi cho cc nhm 5. Thc hin li bc 2 cho n khi khng c s thay i nhm no ca cc i tng V d minh ha thut ton K-Mean: Gi s ta c 4 loi thuc A,B,C,D, mi loi thuc c biu din bi 2 c trng X v Y nh sau. Mc ch ca ta l nhm cc thuc cho vo 2 nhm (K=2) da vo cc c trng ca chng.
Bc 1. Khi to tm (centroid) cho 2 nhm. Gi s ta chn A l tm ca nhm th nht (ta tm nhm th nht c1(1,1)) v B l tm ca nhm th 2 (to tm nhm th hai c2 (2,1)).
Bc 2. Tnh khong cch t cc i tng n tm ca cc nhm (Khong cch Euclidean)
Mi ct trong ma trn khong cch (D) l mt i tng (ct th nht tng ng vi i tng A, ct th 2 tng ng vi i tng B,). Hng th nht trong ma trn khong cch biu din khong cch gia cc i tng n tm ca nhm th nht (c1) v hng th 2 trong ma trn khong cch biu din khong cch ca cc i tng n tm ca nhm th 2 (c2). V d, khong cch t loi thuc C=(4,3) n tm c1(1,1) l 3.61 v n tm c2(2,1) l 2.83 c tnh nh sau:
Bc 3. Nhm cc i tng vo nhm gn nht
Ta thy rng nhm 1 sau vng lp th nht gm c 1 i tng A v nhm 2 gm cc i tng cn li B,C,D. Bc 5. Tnh li ta cc tm cho cc nhm mi da vo ta ca cc i tng trong nhm. Nhm 1 ch c 1 i tng A nn tm nhm 1 vn khng i, c1(1,1). Tm nhm 2 c tnh nh sau:
Bc 6. Tnh li khong cch t cc i tng n tm mi
Bc 7. Nhm cc i tng vo nhm
Bc 8. Tnh li tm cho nhm mi
Bc 8. Tnh li khong cch t cc i tng n tm mi
Bc 9. Nhm cc i tng vo nhm
Ta thy G 2 = G 1 (Khng c s thay i nhm no ca cc i tng) nn thut ton dng v kt qu phn nhm nh sau:
Thut ton K-Means c u im l n gin, d hiu v ci t. Tuy nhin, mt s hn ch ca K-Means l hiu qu ca thut ton ph thuc vo vic chn s nhm K (phi xc nh trc) v chi ph cho thc hin vng lp tnh ton khong cch ln khi s cm K v d liu phn cm ln. 3. Trin khai ng dng phn cm vi phn mm WeKa Trong v d ny, ti s gii thiu cch xy dng mt KnowledgeFlow trin khai k thut phn cm da trn thut ton K-Means trn Data Mining Software WeKa. D liu dng phn cm trong v d ny l d liu dng phn loi khch hng ca ngn hng (file d liu bank.arff). bank.arff gm c 11 thuc tnh v 600 khch hng (instances). Di y l cu trc v phn b d liu ca bank.arff Cc bn c th Down file bank.arff ti y:
Nhim v ca chng ta l dng thut ton K-Means phn nhm cc khch hng vo K nhm (trong v d ny K=5) da vo s tng t (similar) trn11 thuc tnh ca h. Ta xy dng mt KnowledgeFlow trong WeKa nh sau:
Thit lp cc tham s cho thut ton K-Means nh s cm (trong v d ny K=5), Cch tnh khong cch (trong v d ny dng khong cch Euclidean),
Kt qu phn cm chi tit nh sau:
PS. The next topic is SOM (Self Organizing Maps) in Clustering Techniques. All comments please send to chucnv@ud.edu.vn.