Download as pdf or txt
Download as pdf or txt
You are on page 1of 83

I HC QUC GIA TP.

H CH MINH
TRNG I HC KHOA HC T NHIN
KHOA CNG NGH THNG TIN


PHM QUNH NGA 0512030
HONG TRNG NGHA 0512031
H TRN NHT THY 0512046

N MN HC
KHAI THC D LIU V NG DNG

TI : TIN X L D LIU
DA TRN TI LIU :
Data Mining: Concepts and Techniques, Jiawei Han





TP.HCM 01/ 2008
2
Mc lc

Mc lc ............................................................................................................. 2
Danh sch cc hnh .......................................................................................... 3
Tm tt ni dung n ................................................................................... 5
Chng 2 Tin x l d liu ........................................................................ 5
Phn 2.1. Ti sao phi tin x l d liu? .................................................................... 5
2.1.1 D liu trong th gii thc khng sch : .......................................... 5
2.1.2 Ti sao d liu khng sch ? ............................................................. 5
2.1.3 Ti sao qu trnh tin x l d liu li quan trng? ............................. 7
2.1.4 Nhng nhim v chnh trong qu trnh tin x l d liu: ................... 8
Phn 2.2. Tm tt d liu .............................................................................................. 9
2.2.1 o lng gi tr trung tm .................................................................. 10
2.2.2 o lng s phn tn d liu ............................................................. 15
2.2.3 Cc dng th ................................................................................... 18
Phn 2.3. Lm sch d liu (data cleaning) ............................................................... 26
2.3.1 D liu b thiu (missing) ................................................................... 27
2.3.2 D liu b nhiu (noisy) ...................................................................... 30
2.3.3 Tin trnh lm sch d liu: ................................................................ 33
Phn 2.4. Tch hp v chuyn i d liu .................................................................. 36
2.4.1 Tch hp d liu (Data Integration) .................................................... 37
2.4.2 Chuyn i d liu (Data transformation) ......................................... 42
Phn 2.5. Thu gn d liu ........................................................................................... 45
2.5.1 La chn tp thuc tnh ...................................................................... 46
2.5.2 Gim chiu d liu ............................................................................. 50
2.5.2.1 Wavelet Transform .................................................................................. 50
2.5.2.2 Principle Component Analysis ................................................................ 53
2.5.3 Gim kch thc tp d liu ............................................................... 60
2.5.3.1 Regression ................................................................................................ 60
2.5.3.2 Log-Linear ............................................................................................... 62
2.5.3.3 Gaussian Mixture Models ........................................................................ 65
2.5.3.4 K-Means Clustering ................................................................................. 69
2.5.3.5 Fuzzy C-Means Clustering ...................................................................... 70
2.5.3.6 Hierachical Clustering ............................................................................. 72
Phn 2.6. Tng kt ....................................................................................................... 74
Phn 2.7. Gii mt s bi tp ...................................................................................... 75

3
Danh sch cc hnh
Hnh 2.1 Nhng nhim v chnh trong qu trnh tin x l d liu ....................... 9
Hnh 2.2 D liu i xng .................................................................................... 14
Hnh 2.3 D liu lch tri ...................................................................................... 14
Hnh 2.4 D liu lch phi .................................................................................... 15
Hnh 2.5 Boxplot biu din d liu n gi cho cc mt hng bn ti
AllElectronics ........................................................................................................... 17
Hnh 2.6 Histogram ............................................................................................... 20
Hnh 2.7 Quantile plot ........................................................................................... 21
Hnh 2.8 Qq plot .................................................................................................... 22
Hnh 2.9 Scatter plot trng hp c s tng quan ........................................... 24
Hnh 2.10 Scatter plot trng hp khng c s tng quan ............................ 24
Hnh 2.11 Scatter plot .......................................................................................... 25
Hnh 2.12 ng cong hi qui cc b ................................................................ 26
Hnh 2.13 Minh ha k thut hi qui .................................................................. 32
Hnh 2.14 Minh ha k thut clustering .............................................................. 33
Hnh 2.15 Forward Selection ............................................................................... 48
Hnh 2.16 Backward Elimination ........................................................................ 48
Hnh 2.17 Decision Tree Induction ..................................................................... 49
Hnh 2.18 th biu din cc mu d liu hai chiu sau khi chun ho .......... 55
Hnh 2.19 Mean adjust data with eigenvectors ................................................... 58
Hnh 2.20 Data Table .......................................................................................... 61
Hnh 2.21 Mixture of Gaussian Distribution ....................................................... 65
Hnh 2.22 M gi ca thut ton EM .................................................................. 67
Hnh 2.23 S thut ton Fuzzy C-Means Clustering ...................................... 71
Hnh 2.24 Bng o khong cch gia cc mu d liu .................................. 72
Hnh 2.25 Lc th hin cy gom nhm (clustering tree) ............................. 73
Hnh 2.26 Phn nhm bng cch loi b link di nht .................................... 73
Hnh 2.27 th biu din tng quan gia 2 bin Tui %M ...................... 77
4
Hnh 2.28 Boxplot biu din d liu ................................................................... 81
5

Tm tt ni dung n
n trnh by cc vn lin quan n tin x l d liu, bao gm nguyn
nhn, tm quan trng v cc k thut tin x l cn thit. Tin x l d liu l cng
vic kh vt v, tn nhiu thi gian v cng sc nhng khng th khng lm, v d
liu t th gii thc thng c cht lng thp, lm nh hng nhiu n qu trnh
khai thc d liu. Thc hin tt cng on tin x l s lm tng tc cng nh
tng cht lng ca qu trnh khai thc d liu.
Chng 2 Tin x l d liu
Phn 2.1. Ti sao phi tin x l d liu?
2.1.1 D liu trong th gii thc khng sch :
- Khng hon chnh (incomplete): thiu gi tr thuc tnh, thiu nhng thuc
tnh quan trng
V d: occupation =
- Nhiu (noisy): d liu b li hoc d liu nm ngoi min gi tr
V d: salary = -10
- Khng nht qun (inconsistent): khng nht qun trong cch t tn
V d: Age = 42, Birthday = 03/07/1997, hay nh dng d liu khng
ging nhau v d: thuc tnh rank (lc c nh dng theo kiu A, B, C,
khi li c nh dng theo kiu s 1, 2, 3,)
2.1.2 Ti sao d liu khng sch ?
- D liu khng hon chnh c th xy ra v mt s nguyn nhn:
6
o Mt vi thuc tnh quan trng khng c cung cp. V d: thng tin
khch hng i vi giao dch bn hng v l do c nhn c th khch
hng khng mun cung cp thng tin ca h, hay thuc tnh m s
bng li xe i vi ngi khng c bng li xe h khng th cung cp
thng tin c yu cu
o Mt s d liu khng c chn la n gin bi v n khng c
xem lm quan trng ti thi im nhp d liu. Hay ni cch khc
vic xem xt d liu ti thi im nhp d liu v thi im phn tch
l khc nhau.
o Vn con ngi/ phn mm/ phn cng.
o D liu khng nht qun vi nhng d liu c lu trc c
th b xa dn n vic mt mt d liu.
- D liu nhiu c th xy ra v mt s nguyn nhn:
o Cng c la chn d liu c s dng b li.
o Li do con ngi hay my tnh lc ghi chp d liu.
o Li trong qu trnh truyn ti d liu.
o Gii hn v cng ngh nh l kch thc buffer b gii hn trong qu
trnh truyn, nhn d liu
o D liu khng chnh xc cng c th l do khng nht qun trong vic
t tn, nh dng d liu. V d thuc tnh date nu c nh dng
theo kiu mm/dd/yyyy th gi tr 20/11/2007 ti mt b no c th
b xem l gi tr nhiu
- D liu khng nht qun c th l do:
o D liu c tp hp t nhiu ngun khc nhau.
7
o Vi thuc tnh c biu din bng nhng tn khc nhau trong c s
d liu. V d: thuc tnh customer indentification c th l
customer_id trong c s d liu ny nhng l cust_id trong c s d
liu khc.
o Vi phm ph thuc hm.
2.1.3 Ti sao qu trnh tin x l d liu li quan trng?
- Qu trnh lm sch d liu s lp y nhng gi tr b thiu, lm mn cc d
liu nhiu, xc nh v xa b nhng d liu sai min gi tr, v gii quyt
vn khng nht qun.
- Nu ngi dng cho rng d liu l d th h s khng tin tng vo bt k
kt qu khai thc no t d liu .
- Ngoi ra, d liu d c th l nguyn nhn gy ra s ln xn trong qu trnh
khai thc, cho ra kt qu khng ng tin. Vic c mt s lng ln d liu d
tha c th lm gim tc v lm hn lon qu trnh tm kim tri thc.
- R rng, vic thm vo qu trnh lm sch d liu gip chng ta trnh nhng
d liu d tha khng cn thit trong qu trnh phn tch d liu.
- Lm sch d liu l mt bc quan trng trong qu trnh tm kim tri thc v
d liu khng c cht lng th kt qu khai thc cng khng c cht lng.
Nhng quyt nh c cht lng phi da trn d liu c cht lng. V d:
d liu trng lp hoc b thiu c th l nguyn nhn lm sai s liu thng k.
- Qu trnh lm giu d liu, lm sch d liu v m ha d liu c vai tr
quan trng trong vic xy dng data warehouse.
Nhng tiu chun xc nh d liu c cht lng:
- chnh xc
- hon chnh
8
- Tnh nht qun
- Hp thi
- ng tin
- C gi tr
- C th hiu c
- C th dng c
2.1.4 Nhng nhim v chnh trong qu trnh tin x l d liu:
- Lm sch d liu (Data cleaning): Thm vo nhng gi tr b thiu, lm mn
d liu, nhn bit hoc xa nhng d liu sai min gi tr v gii quyt s
khng nht qun.
- Tch hp d liu (Data integration): Kt hp nhiu c s d liu, khi d liu
hoc t nhiu file.
- Chuyn ha d liu (Data transformation): Chun ha v kt hp d liu.
- Thu gn d liu (Data reduction): Gim bt kch thc d liu nhng vn
cho ra kt qu phn tch tng t. Mt dng ca thu gn d liu l ri rc
ha d liu (Data discretization), rt c ch cho vic pht sinh t ng khi
nim h thng th bc t d liu s.
9

Hnh 2.1 Nhng nhim v chnh trong qu trnh tin x l d liu

Phn 2.2. Tm tt d liu
qu trnh tin x l d liu thnh cng cn phi c bc tranh ton din v
d liu ca bn. K thut tm tt d liu (Descriptive data summarization) c th
c s dng nhn bit nhng thuc tnh c trng ca d liu v nu bt nhng
ch m gi tr d liu c coi nh l nhiu (noisy) hay l nhng phn t c bit
(outliers). V vy, u tin chng ta gii thiu nhng khi nim c bn ca tm tt
d liu trc khi i vo nhng cng vic c th ca k thut tin x l d liu.
i vi nhiu nhim v tin x l d liu, ngi s dng cn bit v nhng
c im d liu i vi gi tr trung tm v s phn tn ca d liu. Nhng tiu
chun nh gi gi tr trung tm bao gm mean, median, mode v midrange, nhng
tiu chun nh gi s phn tn d liu bao gm quartile, interquartile range (IQR)
v variance. Nhng thng k miu t ny gip ch trong vic hiu s phn b ca d
liu. c bit l gii thiu nhng khi nim distributive measure, algebraic measure
10
v holistic measure. Bit c cc cch o khc nhau s gip chng ta chn c
mt cch thc hin hiu qu cho n.
2.2.1 o lng gi tr trung tm
Trong phn ny, chng ta s xem nhng cch khc nhau o gi tr trung
tm ca d liu. Cch ph bin nht v hiu qu nht o gi tr trung tm ca
mt tp hp d liu l mean. t x
1
, x
2
, , x
N
l mt tp hp N gi tr, v d l
nhng gi tr ca thuc tnh nh salary (lng). Mean ca tp hp nhng gi tr
l:

N
x x x
x
N
x
N
N
i
i
+ + +
= =

=
... 1
2 1
1
(2.1)
Tng ng vi cch o ny l xy dng hm hp average (avg() trong SQL), c
cung cp trong nhng h thng c s d liu quan h.
Distributive measure l cch o m c th c tnh cho mt tp hp d liu
bng cch phn chia d liu vo nhng tp con nh hn, tnh ton cho mi tp con
v sau trn kt qu c c t c gi tr o lng cho tp d liu bao u
(ton b d liu). Sum() v count() l nhng distributive measure bi v chng c
th c tnh theo cch . V d khc l max() v min(). Algebraic measure l
cch o m c th c tnh bng cch p dng mt hm i s cho mt hoc nhiu
hn mt distributive measure. Theo , average (hay mean()) l mt algebraic
measure v n c th c tnh bng sum() / count(). Khi tnh ton cho khi d liu,
sum() v count() c bo lu trong tin tnh ton. V vy, vic rt ra average cho
khi d liu l d hiu.
Thnh thong, mi gi tr x
i
trong mt tp hp c th lin kt vi mt trng
lng (tn s) w
i
, vi i =1,, N. Trng lng phn nh ngha, s quan trng
hoc tn s xy ra gn vi nhng gi tr tng tng ca chng. Trong trng hp
ny, chng ta c th tnh:
11

N
N N
N
i
i
N
i
i i
w w w
x w x w x w
w
x w
x
+ + +
+ + +
= =

=
=
...
...
2 1
2 2 1 1
1
1
(2.2)
Cch o ny c gi l weighted arithmetic mean hay weighted average.
Lu rng weighted average l mt v d khc ca algebraic measure.
Mc d mean l mt con s hu ch nht cho vic m t mt tp d liu,
nhng n khng phi l cch tt nht o gi tr trung tm ca d liu. Mt vn
ln vi mean l tnh nhy cm ca n vi nhng gi tr c bit (outliers). Thm ch
mt s lng nh nhng gi tr c bit cng c th lm sai lch mean. V d, lng
trung bnh ti mt cng ty c th b y ln bng mt vi ngi qun l c tr
lng cao. Tng t, s im trung bnh ca mt lp trong mt k thi c th b y
xung thp vi mt vi im s rt thp. gii quyt nh hng gy ra bi nhng
gi tr c bit, chng ta c th s dng thay th bng cch dng trimmed mean,
ngha l mean c tnh sau khi ct b nhng gi tr c bit. V d, chng ta c th
sp xp d liu salary theo th t v xa b 2% nhng gi tr trn v 2% di
trc khi tnh mean. Chng ta nn trnh vic ct b mt phn qu ln d liu
(chng hn 20%) c phn u v cui v iu c th lm mt nhng thng tin
c gi tr. V d: ta c d liu cho thuc tnh tui nh sau: 3, 13, 15, 16, 19, 20, 21,
25, 40. tnh trimmed mean, u tin ta ct b i hai u mt s gi tr, trong
trng hp ny ta s ct b mu u mt gi tr, nh vy cc gi tr cn li l: 13,
15, 16, 19, 20, 21, 25. T ta s tnh mean cho nhng gi tr cn li ny.
i vi d liu b lch, mt cch o tt hn cho trung tm ca d liu l
median. Gi s rng tp d liu c cho c N c gi tr phn bit c sp xp
theo th t tng dn. Nu N l l, th median l gi tr chnh gia ca tp d liu c
th t. Ngc li, nu N chn, th median l trung bnh ca hai gi tr chnh gia.
V d: ta xt li v d v thuc tnh tui trc
12
- Nu tp hp cc gi tr ca thuc tnh tui l: 13, 15, 16, 19, 20, 21, 25.
Trong trng hp ny tp hp c 7 gi tr, nh vy median l gi tr ca
phn t chnh gia, tc l median =19.
- Nu tp hp cc gi tr ca thuc tnh tui l: 13, 15, 16, 18, 20, 21, 25,
30. Trong trng hp ny tp hp c 8 gi tr, nh vy median l trung
bnh ca 2 phn t chnh gia, tc l median =(18 +20)/2 =19.
Holistic measure l cch o m c th c tnh trn ton b tp d liu. N
khng th c tnh bng cch phn chia d liu thnh nhng tp con v trn nhng
gi tr t c khi o lng tng tp con. Median l mt v d ca holistic
measure. Holistic measure th tn chi ph tnh hn distributive measure.
Tuy nhin, chng ta c th xp x mt cch d dng gi tr median ca tp d
liu. Gi s d liu c nhm trong nhng khong theo gi tr x
i
ca chng, v
bit tn s (s lng gi tr d liu) ca tng khong. V d, con ngi c nhm
theo lng hng nm ca h trong nhng khong nh l 10 20K, 20 30K, .
t khong m cha tn s median l median interval. Chng ta c th xp x
median ca ton b tp d liu bng cch dng cng thc :

( )
width
freq
freq N
L median
median
l

-
+ =

2 /
1
(2.3)
vi L
1
l chn di ca median interval, N l s lng gi tr trong ton b tp d
liu, (freq )
l
l tng tn s ca tt c cc khong m thp hn median interval,
freq
median
l tn s ca median interval v width l chiu rng ca median interval.
V d: ta c d liu ca thuc tnh tui c chia theo tng khong v c tn
s tng ng nh sau:
Tui Tn s
1 5 200
5 15 450
13
15 20 300
20 50 1500
50 80 700
Gi tr median ca tn s l 450, nh vy median interval l khong 5 15,
freq
median
=450, width =15 5 =10, L
1
=5 (chn di ca khong 5 15), N =
1350 (s lng cc gi tr), (freq )
l
=freq
1- 5
=200 (v ch c mt khong thp
hn khong 5 15 l khong 1 5). Nh vy median trong gi tr l:
35 10
450
200 2 / 3150
5

-
+ = median
Mt cch o khc ca gi tr trung tm l mode. Mode ca mt tp d liu l
gi tr xy ra vi tn s ln nht. C th tn s ln nht tng tng vi nhiu gi tr
khc nhau, khi c nhiu hn mt mode. Tp d liu vi mt, hai, ba mode ln
lt c gi l unimodal, bimodal, trimodal. Tng qut, mt tp d liu c hai
hoc nhiu hn mode c gi l multimodal. Trng hp c bit, nu mi gi tr
xy ra ch mt ln, th trng hp ny khng c mode.
i vi tn s unimodal c ng cong lch va phi (khng i xng),
chng ta c quan h sau:
Mean mode = 3 x (mean median) (2.4)
iu c ngha l mode cho ng cong tn s unimodal lch va phi c
th tnh d dng nu bit gi tr mean v median.
Trong mt ng cong tn s unimodal vi phn b d liu i xng hon
ton, mean, median, mode l cng mt gi tr trung tm nh hnh 2.2. Tuy nhin d
liu trong thc t hu ht u khng i xng. Chng c th lch dng, mode xy
ra ti gi tr nh hn median hnh 2.3; hoc lch m, mode xy ra ti gi tr ln hn
median hnh 2.4.
Midrange cng c th c dng tnh gi tr trung tm ca tp d liu.
N l trung bnh gi tr ln nht v gi tr nh nht trong tp hp.
14
V d: ta xt li v d v thuc tnh tui trc 13, 15, 16, 18, 20, 21, 25.
Midrange =(13 +25)/2 =19.

Hnh 2.2 D liu i xng

Hnh 2.3 D liu lch tri
15

Hnh 2.4 D liu lch phi
2.2.2 o lng s phn tn d liu
Mc m d liu s c khuynh hng tri ra c gi l s phn tn, s
dao ng ca d liu. Nhng cch o ph bin nht cho s phn tn ca d liu l
range, five-number summary (da vo quartiles), interquartile range v standard
deviration ( lch chun). Boxplots l biu c th v c da vo five-number
summary v l mt cng c hiu qu cho vic xc nh nhng phn t c bit
(outliers).
Range, Quartiles, Outliers v Boxplots
t x
1
, x
2
, , x
N
l mt tp hp gi tr cho mt thuc tnh no . Phm vi
ca d liu t gi tr nh nht (min()) n gi tr ln nht (max()). Gi s d liu
c sp xp theo th t s tng dn.
Percentile th k ca tp d liu c sp xp l gi tr x
i
m c k% d liu
bng hoc nh hn x
i
. Median l percentile th 50.
V d: Nu c 30% s phn t ca thuc tnh tui nh hn hay bng gi tr
20 th percentile th 30 s bng gi tr 20.
16

Nhng percentile c dng ph bin hn median l quartiles. Quartile u
tin Q
1
l percentile th 25, quartile th 3 Q
3
l percentile th 75. Nhng quartile,
bao gm c median, a ra vi s ch ca gi tr trung tm, s tri di, v hnh dng
ca s phn b. Khong cch gia quartile th nht v th ba l cch o n gin
cho mc tri di. Khong cch c gi l interquartile range (IQR), v
c xc nh nh sau:
IQR =Q
3
Q
1
. (2.5)
Da vo nguyn nhn tng t khi chng ta phn tch median trong phn
2.2.1, Q
1
v Q
3
l holistic measure.
Mt qui lut ph bin cho vic xc nh nhng phn t c bit kh nghi l
xem nhng gi tr c nm bn ngoi t nht l khong 1.5xIQR pha trn quartile
th ba v pha di quartile th nht (tc l mt phn t c xem l c bit nu gi
tr ca n ln hn Q
3
+ 1.5xIQR hay nh hn Q
1
1.5xIQR).
V d:
- Nu Q
1
=60, Q
3
=100, nh vy IQR =Q
3
Q
1
=40
- Xt phn t 175: v 175 >Q
3
+1.5xIQR nn 175 l phn t c bit.
V Q
1
, median v Q
3
u khng cha thng tin v nhng im u cui ca
d liu, mt tm tt y hn v hnh dng ca d liu c th t c bng cch
cung cp gi tr thp nht v gi tr cao nht. l five-number summary. Five
number summary ca s phn b bao gm median, quartile th nht, th ba v gi
tr ln nht, nh nht c vit theo th t : Minium, Q
1
, Median,Q
3
, Maximum.
20 0
Age
30%
Percentile th 30 = 20
17
Boxplots l cch ph bin biu din s phn b ca d liu. Boxplot biu
din five-number summary nh sau:

Hnh 2.5 Boxplot biu din d liu n gi cho cc mt hng bn ti AllElectronics
u cui ca hp l nhng quartile Q
1
v Q
3
, v vy chiu di hp bng
IQR.
Median c nh du bng mt nm ng ngang bn trong hp.
Hai ng thng bn ngoi hp (hay cn li l whiskers) ko di n gi tr
nh nht (minimum) v gi tr ln nht (maximum).
gii quyt nhng phn t c bit trong boxplot, hai ng thng bn ngoi
hp c ko di n nhng gi tr cc nh hoc cc ln nu v ch nu nhng gi
tr nm trong khong 1.5xIQR tnh t nhng quartile. Hai ng thng nm bn
ngoi hp kt thc ti gi tr ln nht hoc nh nht nm trong khong 1.5xIQR ca
quartiles. Nhng trng hp cn li s c biu din ring l. Boxplot c th c
s dng so snh nhng tp hp d liu tng thch vi nhau.
V d: Hnh 2.5 cho thy nhng boxplot biu din d liu n gi cho mt
vi mt hng c bn ti cc chi nhnh ca AllElectronics trong khong thi gian
Branch 2
200
100
120
Branch 1
180
160
140
60
40
80
20
U
n
i
t

p
r
i
c
e

(
$
)

18
c cho. Vi chi nhnh 1, chng ta thy gi tr median l 80$, Q
1
l 60$, Q
3
l
100$. Ch hai gi tr nm ngoi vng 1.5xIQR c biu din ring l l 172 v
202. l nhng phn t c bit.
Phng sai v lch chun
Phng sai ca N gi tr x
1
, x
2
, , x
n
l
( )

- = - =

= =
N
i
i i
N
i
i
x
N
x
N
x x
N
1
2
2
1
2 2
1 1
) (
1
s (2.6)
vi x l gi tr mean c nh ngha trong phng trnh (2.1). lch chun s l
cn bc hai ca phng sai s
2
.
Nhng tnh cht c bn ca lch chun l:
- s o mc tri di so vi mean, nn c dng ch khi mean c
chn l gi tr trung tm.
- s = 0 khi khng c tri, khi tt c cc s c cng gi tr. Ngc li s
>0.
Phng sai v lch chun l nhng algebraic measure v chng c th
c tnh t distributive measures. l, N ( count() trong SQL), Sx
i
( sum() ca
x
i
) v Sx
i
2
(sum() ca x
i
2
) c th c tnh trong bt k partition no v sau c
trn cho ra phng trnh (2.6). V vy, vic tnh ton phng sai v lch
chun l c th trong nhng c s d liu ln.
2.2.3 Cc dng th
Bn cnh bar chart, pie chart, line graph, vn c s dng nhiu trong vic
biu din cc d liu th hoc thng k, th cn c nhiu dng th khc dng
th hin thng tin tm tt v s phn b ca d liu. Chng bao gm histogram,
quantile plot, qq plot, scatter plot v loess curve. Nhng th nh vy rt c ch
trong vic xem xt, kim tra d liu 1 cch trc quan.
19
Biu histogram, hay cn gi l biu tn sut (frequency histogram) l
mt phng php trc quan th hin tm tt s phn b ca mt thuc tnh cho
trc. Mt histogram cho thuc tnh A s phn hoch d liu ca A vo nhng tp
con ri nhau, c gi l bucket. Trong trng hp in hnh, chiu rng ca cc
bucket ny l bng nhau (uniform u). Mi bucket c biu din bng 1 hnh
ch nht c chiu cao bng vi s lng hoc tn sut ca cc gi tr nm trong
bucket . Nu cc gi tr ca A l ri rc, v d nh tn_cc_loi__t, th mt
hnh ch nht s biu din cho 1 loi gi tr cA, v th kt qu thng thng s
c gi l bar chart (biu ct). Trong trng hp cc gi tr ca A l s th
ngi ta thng gi l histogram. Cc lut phn hoch dng trong vic xy dng
histogram cho cc thuc tnh s s c trnh by ti phn 2.5.4. Trong 1 histogram
m cc bucket c chiu rng bng nhau, mi bucket s biu din mt khong gi tr
bng nhau ca thuc tnh s A.
Gi 1 n v sn phm S sn phm bn c
40 275
43 300
47 250

74 360
75 515
78 540

115 320
117 270
120 350
20
Bng 2.1
Hnh 2.6 biu din 1 histogram cho d liu bng 2.1, trong cc bucket
c nh ngha bi cc khong gi tr bng nhau biu din mc tng 20$ v tn
sut l s lng item bn c. Histogram tn ti t nht l 1 th k v c s
dng rng ri. Tuy nhin, n khng hiu qu bng quantile plot, qq plot v boxplot
trong vic so snh cc nhm univariate observation.

Hnh 2.6 Histogram
Quantile plot l mt phng php hiu qu v n gin c mt ci nhn
tng quan v s phn b ca d liu univariate. u tin, n biu din tt c d
liu ca thuc tnh cho trc. Th hai, n thm vo nhng thng tin quantile. K
thut c dng bc ny hi khc mt cht so vi vic tnh ton percentile trong
phn 2.2.2. Cho x
i
, vi i t1 n n l d liu c sp xp tng dn sao cho x
1

l gi
tr quan st nh nht v x
N
l ln nht. Mi gi tr quan st x
i
c i km vi mt
percentage f
i
, cho bit c khong 100f
i
% d liu c gi tr nh hn hoc bng x
i
.
Chng ta ni khong l v c th khng c gi tr no c fraction ng bng f
i
.
Ch rng 0.25 quantile tng ng vi quartile Q1, 0.5 quantile l median (trung
v) cn 0.75 quantile l quartile Q3.
21
t

Nhng con s ny tng vi cc bc tng bng nhau l 1/N, bin thin trong
khong t 1/2N (hi ln hn 0) n 1-1/2N (hi nh hn 1). Trn quantile plot, x
i

c nh du cng vi f
i
. iu ny cho php chng ta so snh c s phn phi
khc nhau da trn quantiles ca chng. Ly v d, cho quantile plot ca d liu bn
hng trong 2 khong thi gian khc nhau, chng ta c th so snh gi tr Q1, Q2, Q3
ca chng, v nhng gi tr f
i
khc. Hnh 2.7 cho thy 1 quantile plot v d liu gi
n v sn phm cho bng 2.1

Hnh 2.7 Quantile plot
th Quantile-Quantile (qq plot)
QQ Plot th ha cc quantile ca 1 phn b univariate vi cc quantile
tng ng ca 1 phn b univariate khc.. l cng c trc quan hu hiu, cho
php ngi dng quan st xem c 1 s thay i khi i t phn phi ny n phn
phi kia hay khng.
22
Gi s chng ta c 2 tp d liu quan st cho bin gi n v sn phm ly t
2 chi nhnh khc nhau. Gi x
1
, x
2
, ...x
N
l d liu t chi nhnh th nht, v y
1
, y
2
,
...y
M
l d liu t nhnh th 2, mi tp d liu c sp xp theo th t tng dn.
Nu M=N (ngha l s im c tp d liu bng nhau) th chng ta ch vic nh
du y
i
v x
i
, trong y
i
v x
i
u l (i=0.5)/N quantile ca tp d liu tng ng
ca chng. Nu M<N (ngha l chi nhnh 2 c t d liu quan st hn chi nhnh 1)
th ch c M im trn qq plot. Trong , y
i
l quantile (i-0.5)/M ca d liu y, v
c i cp cng quantile (i-0.5)/M ca d liu x. S tnh ton ny lin quan in
hnh ti php ni suy.
Hnh 2.8 cho thy 1 qq plot cho d liu gi n v sn phm ca cc sn
phm c bn ti 2 chi nhnh khc nhau ca AllElectronics sut 1 khong thi
gian cho trc. Mi im tng ng vi cng 1 quantile cho mi tp d liu, v cho
thy gi n v sn phm ca cc sn phm bn ti chi nhnh 1 so vi chi nhnh 2
ti cng quantile .

Hnh 2.8 Qq plot
23
(Lu , d so snh, ta v b sung thm 1 ng thng biu din cho
trng hp, vi mi quantile th gi n v sn phm ca 2 nhnh u bng nhau.
Ngoi ra, cc im t mu sm s tng ng vi d liu cho Q1, trung v - median,
v Q3).
Xt 1 v d, trong hnh v ny, im thp nht gc tri tng ng vi 0,03
quantile. Chng ta thy rng ti quantile ny, gi n v cc sn phm bn ti chi
nhnh 1 hi thp hn so vi chi nhnh 2. Ni 1 cch khc, 3% sn phm c bn
ti chi nhnh 1 c gi thp hn hoc bng 40$, trong khi 3% s sn phm bn ti chi
nhnh 2 c gi thp hn hoc bng 42$. Ti im quantile cao nht, ta thy rng
n v gi ca cc sn phm bn ti chi nhnh 2 li thp hn 1 cht so vi chi nhnh
1. Nhn chung, ta y c 1 s i hng (shift) trong phn phi ca chi nhnh 1 so
vi chi nhnh 2, trong gi n v sn phm bn ti chi nhnh 1 v thp hn so
vi chi nhnh 2.
Scatter plot ( th phn tn) l mt trong nhng phng php th hiu
qu nht xc nh xem c 1 quan h, 1 mu (pattern), hay 1 khuynh hng
(trend) no gia 2 thuc tnh s hay khng. xy dng 1 plot, mi cp gi tr
s c xem nh l 1 cp ta v c th hin di dng im trn mt phng.
Hnh 2.11 cho thy 1 scatter slot ng vi tp d liu bng 2.1. Scatter plot l
phng php hu hiu cung cp 1 ci nhn ban u v nhng d liu bivariate,
trng thy nhng cluster im, cc gi tr c bit, hoc khm ph nhng m
quan h tng quan. Trong hnh 2.9, chng ta thy 1 v d v cc tng quan thun
v nghch gia 2 thuc tnh trong 2 tp d liu khc nhau. Hnh 2.10 cho thy 3
trng hp m khng c s tng quan gia 2 thuc tnh trong cc tp d liu
cho trc.
24

Hnh 2.9 Scatter plot trng hp c s tng quan


Hnh 2.10 Scatter plot trng hp khng c s tng quan
Khi phi xem xt nhiu hn 2 thuc tnh, ngi ta s dng ma trn scatter
25
plot (l mt dng m rng ca scatter plot). n thuc tnh, ma trn scatter plot s l 1
li nxn scatter plot cung cp ci nhn trc quan ca tng thuc tnh vi mi thuc
tnh khc. Ma trn scatter plot tr nn km hiu qu khi s chiu cn kho st tng
ln.

Hnh 2.11 Scatter plot
ng cong hi qui cc b (loess curve local regression curve) l 1 cng
c th hu hiu khc. Ngi ta b sung thm 1 ng cong trn vo scatter plot
nhm cho thy 1 ci nhn tt hn v s ph thuc. T loess c ngha l hi qui
cc b. Hnh 2.12 cho thy 1 ng cong hi qui cc b cho tp d liu trong bng
2.1.
26

Hnh 2.12 ng cong hi qui cc b
khp (fit) 1 ng cong hi qui cc b, cn 2 tham s l alpha tham s
trn v lambda, bc ca a thc hi qui. Alpha c th l bt c s dng no
(thng ta chn alpha nm gia v 1), cn lambda c th l 1 hoc 2. Mc ch
ca vic chn alpha l to ra 1 ng cong va vn, cng trn cng tt v khng
bp mo qu ng nhng pattern c bn bn trong d liu. Alpha cng ln th
ng cong cng trn, nhng s km va vn. Nu alpha rt nh th cc pattern c
bn bn trong s c nh du (?).
Nu cc pattern c bn ca d liu c 1 cong 'p' v khng c cc im
cc tr cc b th ch cn dng ng tuyn tnh cc b (lambda = 1). Tuy nhin nu
c cc im cc tr th cn phi dng n cc ng cong bc 2 (lambda = 2).
Tm li, s tm tt d liu cung cp 1 ci nhn gi tr v c im tng qut
ca d liu. Bng cch xc nh cc nhiu v gi tr c bit, chng rt hu ch cho
vic lm sch d liu.
Phn 2.3. Lm sch d liu (data cleaning)
D liu trong th gii thc c xu hng khng hon chnh, nhiu v khng nht
qun. Qu trnh lm sch d liu lp y nhng gi tr b thiu, lm mn cc gi tr
27
nhiu, xc nh nhng gi tr c bit v gii quyt vn khng nht qun trong d
liu. Trong phn ny, chng ta s tm hiu cc phng php lm sch d liu c
bn.
2.3.1 D liu b thiu (missing)
Tng tng l bn cn phi phn tch d liu bn hng ca AllElectronics
v d liu khch hng. Ch l nhiu b c nhng gi tr khng c lu cho vi
thuc tnh, chng hn nh thu nhp ca khch hng (income). Bn c th lm y
nhng gi tr b thiu cho nhng thuc tnh nh th no? Hy xem nhng
phng php sau:
1. B qua nhng b thiu gi tr (Ignore the tuple): Phng php ny thng
dng khi nhn lp b thiu (gi s vic khai thc bao gm phn lp). Phng
php ny khng hiu qu nu b khng cha mt vi thuc tnh b mt gi tr.
N c bit t trong trng hp s lng gi tr b thiu trn mi thuc tnh ln.
2. Lm y bng tay: Tng qut, cch tip cn ny tn nhiu thi gian v c th
khng kh thi vi mt tp d liu ln c nhiu gi tr b thiu.
3. Lm y bng hng s ton cc: Thay th tt c cc gi tr b thiu bng cng
mt hng s, nh l mt nhn Unknown or -. Nu nhng gi tr b thiu
c thay th bng Unknown th chng trnh khai thc c th gp li khi
ngh rng l mt gi tr ng lu v n xut hin vi tn s cao trong d
liu. Mt d phng php ny n gin, nhng n khng hiu qu.
V d: cc gi tr b thiu c thay th bng hng Unknown
ID Age

ID Age
01 10 01 10
02 02 Unknown
03 22 03 22

28
20 20 20 20
21 21 Unknown

48 45 48 45
49 20 49 20
50 50 Unknown
4. Dng gi tr trung bnh ca thuc tnh: V d, gi tr trung bnh ca thuc tnh
tui l 35 th gi tr c dng thay th nhng gi tr b thiu trong thuc
tnh tui.
ID Age

ID Age
01 10 01 10
02 02 35
03 22 03 22

20 20 20 20
21 21 35

48 45 48 45
49 20 49 20
50 50 35
5. Dng gi tr trung bnh ca tt c cc b thuc v cng 1 lp vi b cho:
Nhng gi tr b thiu s c thay bng gi tr trung bnh ca cc gi tr thuc
cng lp vi b b thiu. Phng php ny gn ging vi phng php 4 nhng
khc ch nu phng php 4 thay th bng gi tr trung bnh ca tt c cc b
th phng php ny thay bng gi tr trung bnh ca cc b thuc cng lp vi
b ang xt.
V d: nu phn chia khch hng theo tui t 1 20, 20 30, 30 50, v
29
mean ca cc gi tr thuc lp 20 30 l $45000 th gi tr b thiu ca thuc
tnh income (thu nhp) b c ID bng 03 s c thay th bng $45000 (v b
03 thuc lp 20 30).
ID Age Income

ID Age Income
01 25 $50000 01 25 $50000
02 40 $40000 02 40 $40000
03 30 03 30 $45000


6. Dng gi tr c nhiu kh nng nht (most probable): Phng php ny c
th c quyt nh bng hi quy, da vo nhng cng c suy lun nh l
phng php Bayesian hoc cy quyt nh. V d, s dng nhng thuc tnh
khc trong tp d liu ca bn, bn c th xy dng cy quyt nh d on
nhng gi tr b thiu ca thuc tnh income. Cy quyt nh, hi quy hay
phng php Bayesian c m t chi tit trong chng 6.
Nhng gi tr c lm y c th khng chnh xc. Tuy nhin, phng php
6 l mt chin lc ph bin. So vi nhng phng php khc, n s dng thng tin
tt nht t d liu hin c d on nhng gi tr b thiu. Bng vic dng nhng
gi tr ca cc thuc tnh khc c lng nhng gi tr b thiu cho income, c
mt kh nng ln hn l mi quan h gia income v nhng thuc tnh khc c
bo ton.
Mt lu quan trng l, trong mt vi trng hp, gi tr b thiu c th
khng dn n li trong d liu. V d, khi p dng cho credit card, nhng ng
vin c th c yu cu cung cp m s bng li xe cua h. Nhng ng vin m
khng c bng li xe sinhra mt trng trng mt cch t nhin. Nhng dng
nn cho php nhng gi tr c bit nh l not applicable. Nhng cng vic ca
30
phn mm c th c s dng pht hin nhng gi tr null khc, nh l dont
know, ?, hoc none. Mt thuc tnh nnc mt hoc nhiu hn nhng quy lut
xem xt cho trng hp null. Nhng quy lut c th xc nh c th cho php c
gi tr null hay khng, hoc l nhng gi tr s uc gii quyt hoc chun ha
nh th no. Mt trng c th c trng nu n c cung cp trong bc tip
theo ca qu trnh. V vy, mt thit k tt ca c s d liu v ca qu trnh nhp
d liu nn gim n mc ti thiu s liu nhng gi tr b thiu hoc sai ngay t
bc u tin.
2.3.2 D liu b nhiu (noisy)
Mt s k thut lm mn d liu:
1. Bining: Phng php bining lm mn d liu c sp xp bng cch tham
kho nhng gi tr xung quanh n. Nhng gi tr c sp xp c phn
chia vo mt s bucket hay gi l bin. V phng php bining tham kho
nhng gi tr xung quanh, nn y l phng php thc hin vic lm mn d
liu cc b. Xt mt v d minh ha vi k thut bining:
D liu v price (gi) c sp xp theo th t tng dn (theo n v
dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Phn chia vo cc bin:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Lm mn bng bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
31
* Lm mn bng bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Trong v d ny, d liu v price u tin c sp xp v sao c phn
chia vo nhng bin mt bng nhau c kch thc l 3 (mi bin cha ba gi
tr).
- Trong k thut lm mn d liu bng bin means (smoothing by bin
means), mi gi tr trong mt bin c thay th bi gi tr mean ca bin .
V d, mean ca gi tr 4, 8, 9 v 15 trong bin 1 l 9. V vy, mi gi tr ban
u trong bin ny c thay th bng gi tr 9.
- Tng t, trong k thut lm mn bng bin medians (smoothing by bin
medians), mi gi tr trong bin c thay th bng gi tr bin median.
- Trong k thut lm mn bng bin boundaries (smoothing by bin
boundaries), gi tr minimum v maximum trong bin c xem l bin
boundaries (gi tr bin). Mi gi tr trong bin c thay th bng gi tr bin
gn n nht. V d, cc bin ca bin 1 l 4 v 15, ga tr 8 gn vi bin 4 hn
so vi bin 15 nn c thay bng 4.
Bining s c tho lun chi tit hn trong phn 2.6.
2. Regression: D liu c lm mn bng cch lm khp d liu vi mt hm.
Linear regression (Hi qui tuyn tnh) bao gm vic tm ng thng tt nht
khp vi hai thuc tnh (hoc bin), v vy mt thuc tnh c th c dng
d on thuc tnh cn li. T , mt cch trc quan ta d dng xc nh c
cc gi tr c bit (l nhng gi tr nm hn bn ngoi so vi ng thng tm
c). Multiple linear regression l mt m rng ca linear regression, khi
32
c nhiu hn hai thuc tnh v d liu c khp trn mt phng a chiu.
Regression c m t chi tit hn trong phn 2.5.4 cng nh trong chng 6.

Hnh 2.13 Minh ha k thut hi qui
3. Clustering(gom cm): Nhng phn t c bit c th c pht hin bng
phng php clustering. Nhng gi tr tng t nhau theo mt tiu chun no
c t chc vo trong cc nhm, hay l cc cluster. Bng trc gic, nhng gi
tr nm ngoi tp hp ca cc cluster c th xem nh l nhng phn t c bit
(Hnh 2.14). K thut ny s c trnh by trong chng 7.
x
y
y =x
X
Y
Y
33

Hnh 2.14 Minh ha k thut clustering
Nhiu phng php lm mn d liu cng l phng php gim s chiu
d liu. V d, k thut bining m t trn lm gim s lng gi tr ring bit trn
mi thuc tnh. Khi nim phn cp l mt dng ri rc ha d liu c th c
dng cho vic lm mn d liu. V d, i vi thuc tnh price c th chia gi tr
price thnh inexpensive, moderately-priced v expensive, v vy vic lm gim s
lng gi tr c th c gii quyt bng qu trnh khai thc. Mt vi phng php
phn lp nh l mng nron, gn lin vi k thut lm mn d liu. Phn lp c
trnh by trong chng 6.
2.3.3 Tin trnh lm sch d liu:
Bc u tin trong tin trnh lm sch d liu l pht hin discrepancy
(khng nht qun). Khng nht qun c th xy ra do nhiu yu t bao gm vic
thit k form nhp liu t (c nhiu trng chn la), li do con ngi trong qu
trnh nhp liu, li do c (v d khng mun tit l thng tin c nhn), d liu li
thi. Khng nht qun c th sinh ra t vic biu din d liu khng thng nht
hoc khng thng nht trong vic dng m. Li trong thit b lu tr d liu, li h
thng l nhng nguyn nhn khc gy ra vn khng nht qun. Li cng c th
34
xy ra khi d liu c dng cho mc ch khc vi d nh ban u.
Pht hin khng nht qun: ti lc bt u, dng bt c kin thc no m bn
c xem xt nhng thuc tnh ca d liu. V d, min gi tr v loi d liu cho
mi thuc tnh l g? Nhng gi tr no c th c chp nhn cho mi thuc tnh?
Tt c gi tr c nm ngoi vng mong i hay khng? M t d liu tng qut
trong phn 2.2 c dng xc nh xu hng chnh ca d liu v pht hin
nhng phn t c bit. V d, nhng gi tr m ln hn lch chun hai ln c th
b nh du l nhng phn t c bit. Trong bc ny, bn c th dng kch bn ca
mnh hoc s dng vi cng c s tho lun sau. T , bn c th tm thy nhng
gi tr nhiu, nhng gi tr c bit, nhng gi tr khng bnh thng cn phi kim
tra.
Khi phn tch d liu, nn cnh gic s khng nht qun trong cch dng m
hay bt c biu din d liu no (nh l 2004/12/25 v 25/12/2004 i vi thuc
tnh c kiu ngy thng). Field overloading l mt nguyn nhn khc gy ra li khi
ngi thit k c p mt thuc tnh mi dng nhng phn (bit) khng c s dng
ca cc thuc tnh khc nh ngha trc (v d, dng mt bit khng c s
dng ca mt thuc tnh m vng gi tr ch s dng 31 trn 32 bit).
D liu nn c xem xt unique rule (qui lut duy nht), consecutive rule
(quy lut lin tip), v null rule (qui lut cho cc gi tr null). Qui lut duy nht pht
biu rng mi ga tr ca mt thuc tnh phi khc vi tt c gi tr khc ca thuc
tnh . Quy lut lin tip pht biu l khng thiu gi tr no nm gia gi tr nh
nht v gi tr ln nht ca thuc tnh, v tt c gi tr cng phi duy nht. Null rule
xc nh s dng trng trng, du ?, k t c bit hay chui khc biu th cho
gi tr null v nhng gi tr ny s c x l nh th no. Nh cp trong
phn 2.3.1, nguyn nhn lm thiu d liu c th bao gm (1) do con ngi c
yu cu cung cp gi tr thuc tnh khng th hoc khng c (v d, thuc tnh m
s bng li xe b trng bi v ngi ny khng c bng li); (2) ngi nhp liu
khng bit gi tr chnh xc; (3) gi tr c cung cp trong bc tip theo ca qu
35
trnh. Null rule cn phi xc nh cch thc lu tr tnh trng null. V d, lu 0 cho
thuc tnh s, khong trng cho thuc tnh k t, hoc bt c quy c no c th
dng c ( chng hn nhng entry nh dony know or ? nn c chuyn
thnh khong trng).
C mt s cng c thng mi khc c th gip trong bc pht hin s
khng nht qun. Data scrubbing tools n gin s dng kin thc v min ga tr (
kin thc v a ch bu in v kim tra chnh t) pht hin li v sa li.
Nhng cng c ny da trn vic phn tch c php v k thut tp m khi lm sch
d liu t nhiu ngun. Data auditing tools tm s khng nht qun bng cch phn
tch d liu tm s tng quan v gom nhm pht hin phn t c bit. Nhng
cng c ny cng c th dng k thut tm tt d liu m t trong phn 2.2.
Vi d liu khng nht qun c th c sa bng tay. V d, nhng li xy
ra trong lc nhp liu c th c sa bng cch ln theo du vt trang. Tuy nhin,
hu ht li s cn n data transformation (chuyn ha d liu). l bc hai
trong tin trnh lm sch d liu. Khi chng ta tm thy s khng nht qun, chng
ta cn xc nh v thc hin vic bin i sa chng.
Nhiu cng c thng mi c th gip trong bc chuyn ho d liu.
Data migration tools cho php nhng bin i c c th, nh l thay th chui
gender bng sex. ETL (extraction/transformation/loading) tools cho php ngi
dng xc nh nhng bin i thng qua giao din ha (GUI). Nhng cng c
ny ch h tr tp hp nhng bin i c gii hn, v vy chng ta thng chn
cch vit kch bn theo khc hng cho bc ny.
Hai bc pht hin s khng nht qun v chuyn ho d liu c lp i
lp li. Tuy nhin trong qu trnh ny c th xy ra li v tn nhiu thi gian. Vi s
bin i c th pht sinh ra li khng nht qun. Mt vi li khng nht qun n np
c th ch c pht hin sau khi mt ci khc c sa. V d, li in n 2004
trong trng year c th xut hin khi tt c gi tr ngy c chuyn sang dng
chun. Nhng bin i thng c lm nh l x l theo khi khi ngi dng i
36
m khng c thng tin phn hi. Ch sau khi vic chuyn i hon tt th ngi
dng mi c th quay li v kim tra khng c gi tr khc thng no c to ra.
Vic lp i lp li nhiu ln c yu cu trc khi ngi dng tho mn. Bt c b
no khng th c gii quyt mt cch t ng bng cch bin i s c in ra
file m khng c bt c s gii thch no xem nh l nguyn nhn gy ra li. Kt
qu l, qu trnh lm sch d liu cng mt cht lng t vic thiu s tng tc.
Nhng cch tip cn mi lm sch d liu nhn mnh vic lm tng s
tng tc. V d, Potters Wheel l mt cng c lm sch d liu public (xem
http://control.cs.berkeley.edu/abc) m kt hp c pht hin s khng nht qun v
chuyn ha d liu. Ngi dng dn dn xy dng mt dy bin i bng cch son
v sa li tng bin i, mt bc ti mt thi im, trn giao din nhiu sheet tri
di. Nhng bin i c th c xc nh bng ho hoc bng cch cung cp v
d. Kt qu c hin th ngay lp tc trn mn hnh. Ngi dng c th hy b
mt thao tc bin i trc , v vy nhng li thm vo c th c xo. Cng c
ny thc hin vic kim tra s khng nht qun mt cch t ng trn d liu c
bin i trong ln gn nht. Ngi dng c th dn dn pht trin v ci tin nhng
bin i khi m s khng nht qun c tm thy, dn n hiu qu cao hn v qu
trnh lm sch d liu hiu qu hn.
Mt cch tip cn khc tng s tng tc trong qu trnh lm sch d liu
l vic pht trin ngn ng cho vic xc nh nhng ton t chuyn i d liu.
Vic ny tp trung vo vic nh ngha nhng m rng mnh m hn cho SQL v
nhng thut ton m cho php ngi dng din t qu trnh lm sch d liu hiu
qu hn.
Vic cp nht siu d liu l quan trng phn nh tri thc. iu s gip
tng tc qu trnh lm sch d liu trong nhng phin bn lu tr d liu tng lai.
Phn 2.4. Tch hp v chuyn i d liu
Khai thc d liu thng i hi vic tch hp d liu, ngha l trn d liu
37
t nhiu ngun li vi nhau. D liu cng cn phi c chuyn i sang nhng
dng thch hp cho vic khai thc. Phn ny s m t c 2 qu trnh: tch hp v
chuyn i d liu.
2.4.1 Tch hp d liu (Data Integration)
Tch hp d liu ngha l s kt hp d liu t nhiu ngun li vi nhau
vo cng mt kho cha mt cch cht ch, mch lc. Nhng ngun d liu ny c
th bao gm cc c s d liu (database), khi d liu (data cube), hoc file.
C kh nhiu vn cn phi quan tm n trong qu trnh tch hp d liu,
trong c Schema Integration v Object matching. Trn thc t, c nhiu thc th
tng ng nhau nhng c tn go khc nhau cc ngun d liu khc nhau. V
d:
ID v MASO
Customer_ID v cust_number
Vn xc nh tnh tng ng gia cc thc th nh vy gi l Entity
identification problem. Chng ta c th da vo cc metadata ca mi thuc tnh
(gm tn, ngha, kiu d liu, khong gi tr, cc gi tr null) gii quyt phn
no vn ny. Cc metadata ny cng gip ch trong vic chuyn i d liu. V
d: thuc tnh pay_type trong c s d liu ny c th mang gi tr l H hoc S
v mang gi tr 1 hoc 2 trong 1 c s d liu khc. Do , bc ny cng lin
quan n vic lm sch d liu m chng ta tho lun phn trc.
Mt vn quan trong khc l s d tha d liu. Mt thuc tnh (nh tng
doanh thu hng nm) c th xem nh d tha v n c th c suy ra t cc thuc
tnh khc. S khng nht qun trong vic t tn cc thuc tnh (hoc cc chiu d
liu) cng c th gy ra s d tha trong tp d liu kt qu.
Cc thuc tnh d tha c th c pht hin bng cch s dng phn tch
tng quan (correlation analysis). Da vo php phn tch ny, ta c th bit mc
38
tng quan (ngha l mc ph thuc ca thuc tnh ny vo thuc tnh ka)
gia 2 thuc tnh cho trc, da trn b d liu c sn. i vi thuc tnh s, ta c
th c on tng quan gia chng bng cch tnh h s tng quan
(correlation coefficient), cn gi l Pearsons product moment coefficent h s
mmen tch Pearson, do Karl Pearson pht minh ra. Cng thc tnh h s ny nh
sau:

Trong :
- N: s record
- a
i
, b
i
l gi tr ca thuc tnh A v B trong record th i
- : gi tr trung bnh ca A v B
- s
A
, s
B
: lch chun ca A v B
- (a
i
b
i
): tng ca tich v hng AB (ngha l: vi mi record, ly gi
tr ca A nhn vi gi tr ca B trong cng record o).
Lu rng |r
A,B
| 1
Cc trng hp xy ra:
- r
A,B
>0: A v B tng quan cng chiu, ngha l A tng khi B tng.
Gi tr ny cng ln th tng quan cng ln
- r
A,B
=0: A v B c lp vi nhau
- r
A,B
<0: A v B tng quan ngc chiu (A tng khi B gim).
Ta thy rng, nu gi tr tuyt i ca r
A,B
ln th A hoc B c th b b i do
trng lp d liu. C th dng scatter plot (tm dch: th phn tn) thy c
39
s tng quan gia cc thuc tnh.
Lu rng, hai thuc tnh tng quan vi nhau khng c ngha l gia
chng c quan h nhn qu. Ly v d: khi phn tch mt c s d liu v dn s,
ngi ta pht hin ra rng thuc tnh s bnh vin v thuc tnh s v trm xe
hi trong 1 khu vc no l c tng quan vi nhau. iu khng c ngha l
ci ny l nguyn nhn gy ra ci kia. Thc ra, c 2 thuc tnh ny lin quan ti 1
thuc tnh th 3, l dn s.
i vi d liu ri rc, quan h tng quan gia 2 thuc tnh A v B c th
c pht hin bng kim nh chi square (chi bnh phng). Gi s A c c gi tr
ri rc (a
1
, a
2
, a
c
) v B c r gi tr ri rc (b
1
, b
2
, b
r
). Cc b d liu m t A v
B c th biu din trong contingency table, vi c ct v r dng. K hiu (A
i
, B
j
) l
s kin m (A=a
i
; B=b
j
). Mi (A
i
, B
j
) c 1 trong bng. Gi tr chi-square c
tnh nh sau:

-
= =
=
c
i
r
j
ij E
E
O
ij
ij
1 1
2
2
) (
c

Trong o
ij
l tn s quan st (ngha l s lng m c trong thc t) ca s
kin (A
i
, B
j
) cn e
ij
l tn s mong mun ca s kin (A
i
, B
j
), c tnh nh sau:

Trong N l s record, count(A=a
i
) l s b m A=a
i
(tng t vi count(B
j
)).
Tng trong cng thc (2.9) c tnh vi tt c r * c . Lu rng nhng ng
gp nhiu nht vo gi tr chi-square chnh l nhng m s lng m c thc
t (o
ij
) khc xa so vi s lng mong mun (e
ij
).
Thng k chi-square kim nh gi thuyt 2 bin A v B c c lp vi
nhau. Php kim nh ny da trn 1 tin c vi (r-1)*(c-1) bc t do. Chng ta
40
s minh ha cch s dng thng k ny trong v d di y. Nu gi thuyt b bc
b, chng ta c th ni rng A v B c quan h vi nhau, v mt thng k.
V d c th:
Xt bng d liu (2x2) sau. Kim tra xem Gii tnh v s thch c sch c lin
quan g vi nhau hay khng.
Male Female Total
Fiction 250 (90) 200 (360) 450
Non-fiction 50 (210) 1000 (840) 1050
Total 300 1200 1500
Gi s c 1 nhm 1500 ngi tham gia 1 cuc kho st. Ngi ta ghi nhn li gii
tnh v th loi sch a thch ca h. Vy l chng ta c 2 thuc tnh cho mi ngi
l gii tnh v s thch c sch. Tn s quan st ca mi s kin c ghi nhn
trong bng trn, cn con s trong ngoc n biu din tn s mong mun. V d, tn
s mong mun cho (male, fiction) l:

Lu rng, trong bt c dng no, tng ca cc tn s mong mun cng bng vi
tng tn s quan st ca dng . Tng t, tng cc tn s mong mun trong bt
c ct no cng bng vi tng tn s quan st ca ct .
Ta tnh c gi tr chi-square:

41

Vi bng 2x2, bc t do l (2-1)x(2-1)=1. Vi 1 bc t do, gi tr c
2
cn thit
bc b gi thuyt (vi tin cy 0.001) l 10.828. Do gi tr c
2
ta tnh c
trn ln hn con s ny nn chng ta c th bc b gi thuyt gii tnh v s thch
c sch c lp vi nhau. Ta kt lun rng 2 thuc tnh ny c tng quan mnh
vi nhau, vi tp d liu cho trc.
Bn cnh vic xc nh s d tha d liu gia cc thuc tnh, s d tha
cn c th gp mc b (tuple, record), ngha l c 2 hay nhiu b ging nhau 1
thuc tnh no . Vic s dng cc table khng chun (thng dng tng tc
x l, nh hn ch c php kt - join) cng l 1 nguyn nhn ca vic d tha d
liu. S khng nht qun cng thng xut hin gia nhng d liu lp khc nhau,
do s khng chnh xc ca d liu, hoc vic cp nht d liu ch c thc hin
vi mt vi phn ch khng phi l ton b th hin (occurrence) ca d liu.. Ly
v d, nu 1 database n t hng cha thuc tnh tn ngi t hng v a ch
ca h, thay v 1 kha tham chiu n database ngi t hng, th s khng nht
qun c th xy ra: cng 1 tn ngi t hng nhng li mang a ch khc nhau
trong database n t hng.
Mt vn quan trong na trong tch hp d liu l pht hin v x l xung
t gi tr d liu. V d, vi cng 1 thc th, gi tr cc thuc tnh trong cc ngun
d liu khc nhau c th khc nhau. iu ny c th xy ra do s th hin d liu
khc nhau. V d, cn nng c th c o bng cc n v khc nhau. Vi 1 chui
khch sn, gi phng cc thnh ph khc nhau c th khc nhau, khng ch v n
v tin t m cn v dch v v thu. Mt thuc tnh trong 1 h thng c th c
ghi nhn 1 cp tru tng thp hn thuc tnh tng t trong 1 h thng
khc. V d, thuc tnh total_sales trong 1 database c th m ch tng doanh s
ca 1 chi nhnh ca hng All_Electronics, trong khi 1 thuc tnh cng tn trong 1
database khc li m ch tng doanh s ca tt c cc ca hng ca All_Electronics
1 a phng cho trc.
42
Khi thc hin vic ghp (matching) thuc tnh t database ny sang database
khc trong qu trnh tch hp d liu, cn phi ch c bit n cu trc ca d
liu. iu ny nhm m bo rng mi ph thuc hm ca thuc tnh v cc rng
buc kha ngoi trong h thng ngun phi c m bo trong h thng ch. V
d, trong 1 h thng, thuc tnh discount c th c p dcho 1 n t hng,
trong khi 1 h thng khc, n c p dng cho tng mt hng trong n hng .
Nu iu ny khng c ch trc khi thc hin vic tch hp, cc mt hng
trong h thng ch c th s khng c discount chnh xc.
S khc bit v ng ngha v cu trc ca d liu gy ra 1 thch thc ln
trong vic tch hp d liu. Nu cng vic ny c thc hin cn thn th s gim
thiu hoc trnh c s d tha hoc khng nht qun v d liu trong tp d liu
kt qu. iu ny gip ci thin chnh xc v tc ca qu trnh mining.
2.4.2 Chuyn i d liu (Data transformation)
Trong vic chuyn i d liu, d liu c chuyn i hoc hp nht thnh
nhng dng ph hp cho vic khai thc. Chuyn i d liu gm nhng vic sau:
- Lm mn: kh nhiu cho d liu. C th dng cc k thut nh binning,
regression, clustering.
- Tp hp d liu: tm tt, xy dng cc khi d liu (data cube). V d:
doanh s bn hng hng ngy c th c tp hp li thnh doanh s ca
thng hoc nm.
- Tng qut ha d liu: cc d liu th hoc cp thp s c thay th
bng nhng khi nim cao hn thng qua vic s dng 1 h thng phn cp
khi nim. V d: cc d liu ri rc nh ng ph c th c tng qut
ln thnh nhng khi nim cao hn nh thnh ph hoc quc gia. Tng
t, gi tr ca cc thuc tnh s, nh tui cng c th c chuyn thnh
cc khi nim mc cao hn nh tr, trung nin v gi
- Chun ha d liu: cc d liu thuc tnh s c a v 1 khong nh,
43
nh -1 n 1, hoc 0 n 1.
- Xy dng thuc tnh mi: xy dng thuc tnh mi v a vo tp thuc
tnh nhm phc v qu trnh mining.
Lm mn d liu l 1 dng ca lm sch d liu c trnh by trong phn
trc. Tp hp v tng qut ha d liu l nhng dng ca qu trnh thu gn d
liu (data reduction) s c cp n sau. Trong phn ny, chng ta tp trung
trnh by v chun ha d liu v xy dng thuc tnh mi.
Mt thuc tnh c chun ha bng cch a gi tr ca n vo 1 khong
nh xc nh, nh t 0.0 n 0.1. Chun ha c ch trong cc thut ton phn lp
lin quan n mng neural, hoc phn lp v phn loi da vo o khong cch
ngi lng ging gn nht. Nu s dng thut ton neural network
backpropagation khai thc phn lp, th vic chun ha d liu vi mi thuc
tnh trong tp d liu hun luyn s gip tng tc qu trnh hc. Vi cc phng
php da vo khong cch, chun ha gip ngn chn cc thuc tnh nm trong 1
khong gi tr ln (vd: thu nhp). C nhiu phng php chun ha d liu.
Chng ta tm hiu 3 phng php l: chun ha min-max, chun ha z-score v
chun ha t l thp phn.
Chun ha min-max thc hin vic chun ha tuyn tnh trn d liu gc.
Gi s minA v maxA l gi tr nh nht v ln nht ca thuc tnh A. Chun ha
min-max s nh x gi tr v ca A sang gi tr v nm trong khong [new_minA,
new_maxA] bng cng thc sau:
A A A
A A
A
min new min new max new
min max
min v
v _ ) _ _ ( ' + -
-
-
=

Chun ha min-max bo m mi quan h gia cc gi tr ca d liu gc.
N s pht hin ra c cc li vt qu gii hn nu cc d liu input trong
tng lai ri ra ngoi khong gi tr ban u ca A.
V d: Gi s rng gi tr nh nht v ln nht ca thuc tnh thu nhp l l
44
12000 v 98000. Chng ta mun nh x gi tr thu nhp vo khong [0.0 1.0].
Vi chun ha min-max th gi tr 73000 s c chuyn thnh:
716 . 0 0 ) 0 0 . 1 (
000 , 12 000 , 98
000 , 12 600 , 73
= + -
-
-

Chun ha z-score
Trong chun ha z-score, cc gi tr ca thuc tnh A s c chun ha da
trn gi tr trung bnh v lch chun ca A. Mt gi tr v ca A s c chun
ha thnh gi tr v bng cng thc sau:
A
A v
v
s
m -
= '

Ch thch:
m
A
: gi tr trung bnh ca A
s
A
: lch chun ca A
Phng php chun ha ny hu ch trong trng hp ta khng bit c gi
tr nh nht v ln nht ca A, hoc khi c nhng gi tr c bit (outlier) thng tr
chun ha min-max
V d: Gi s rng gi tr trung bnh v lch chun ca thuc tnh thu
nhp l 54000 v 16000. Vi cch chun ha z-score, gi tr 73000 s c chuyn
thnh 225 . 1
000 , 16
000 , 54 600 , 73
=
-

Chun ha bng t l thp phn (decimal scaling)
Theo phng php ny, gi tr v s c chun ha bng cch di du phy
thp phn ca n, da vo gi tr tuyt i ln nht ca A. Mt gi tr v s c
chun ha thnh v bi cng thc sau:
45
j
v
v
10
'=

Trong j l s nguyn nh nht sao cho Max(|v|) < 1
V d: Gi s gi tr ca thuc tnh A nm trong khong t -986 n 917. Gi
tr tuyt i ln nht ca A l 986. Vy j=3 (v 10
2
<986<10
3
). Gi tr 986 s c
chun ha thnh -0.986 v 917 c chun ha thnh 0.917.
Lu rng vic chun ha c th lm thay i cht t gi tr ban u, c bit
l 2 phng php sau. Chng ta cng cn lu li cc tham s chun ha (v d
nh gi tr trung bnh v lch chun nu dng chun ha z-score) cc d liu
c b sung thm sau ny cng c chun ha theo cng cng thc.
Trong vic xy dng cc thuc tnh mi (attribute construction), cc thuc
tnh mi s c to thnh da vo cc thuc tnh bit, sau c b sung
ci thin chnh xc v hiu bit v cu trc ca d liu nhiu chiu. V d,
chng ta c th b sung thuc tnh din tch da vo 2 thuc tnh chiu rng v
chiu cao. Vic kt hp thuc tnh nh vy gip pht hin ra nhng thng tin b
thiu v mi quan h gia cc thuc tnh m c th hu ch trong vic khai ph tri
thc.
Phn 2.5. Thu gn d liu
Yu cu v thu gn d liu c quan tm khi bn phi tin hnh phn tch
trn mt c s d liu khng l cha hng Tetra-bytes d liu (!). Mt iu hin
nhin l bn khng th t phn tch bng cch t mnh duyt qua cc bng s liu,
do bn phi tin hnh phn tch thng qua my tnh thu c kt qu trong
thi gian hp l. Tuy nhin, i vi nhng c s d liu c kch thc qu ln nh
vy th ngay c my tnh cng phi mt hng th k nu phn tch trc tip trn d
liu th.
Chnh v vy, mt s k thut thu gn d liu c xut c kh nng rt ra
46
dng biu din rt gn ca d liu th m vn m bo tnh ton vn ca d liu.
Cc m hnh phn tch d liu nh s c ci tin c v mt tc ln cht
lng. Cc k thut chnh nhm thu gn d liu bao gm :
La chn tp thuc tnh (Attribute Subset Selection) : loi b nhng thnh
phn d liu khng lin quan hay lin quan t n mc ch phn tch d liu
tng tc phn tch v gim nhiu (gy ra bi cc thuc tnh d tha
khng lin quan) t nng cao cht lng phn tch d liu.
Gim chiu d liu (Dimensionality Reduction) : chiu cc im d liu
(data-point) ln mt khng gian t chiu hn nhm n gin ho d liu m
khng lm gim cht lng kt qu phn tch.
Gim kch thc tp d liu (Numerosity Reduction) : thay v lu tr d liu
ngi ta ch quan tm n dng m hnh ha tham s ca d liu. Cc tham
s ny s c c lng da trn tp d liu th ban u. Sau ngi ta
s lu tr cc tham s ny thay v ton b d liu.

2.5.1 La chn tp thuc tnh
Trong thc t, d liu cn phn tch c th ln n hng trm thuc tnh, trong
nhiu thuc tnh khng lin quan hay khng ng gp ng k n mc ch phn
tch d liu. Ly v d, mt ca hng bn a CD mun phn loi cc khch quen
(d liu lu tr trong database) vo hai nhm vi tiu ch l h s mua hay khng
mua mt mu sn phm mi ca h xc nh chin lc kinh doanh. H s tin
hnh phn tch d liu ca tng khch hng phn loi, d liu ny bao gm
thng tin c nhn ca khch hng v h s mua bn ti ca hng ny. Nh vy, mi
mu d liu c th bao gm hng trm thuc tnh v rt nhiu trong s khng
ng gp g cho vic phn tch nh a ch, s in thoi vv
Vn t ra y l lm sao loi b c nhng thuc tnh khng cn thit
47
hay ni cch khc l chn ra mt tp con cc thuc tnh c kch thc nh nht
sao cho khi tin hnh phn tch d liu trn tp cc thuc tnh ny ta thu c mt
cu trc phn b d liu vo cc lp tng ng c sai lch t nht so vi cu trc
phn b thu c nu ta phn tch trc tip trn tp d liu th (theo mt tiu chun
no c nh ngha trc). y l mt vn rt quan trng v khi qu trnh
phn tch d liu ch tp trung trn tp ti u cc thuc tnh rt gn th chi ph
tnh ton s gim i ng k v cc mu tm c s n gin d hiu ng thi
m bo c tnh chnh xc, trnh c cc trng hp kt qu phn tch khng
chnh xc do nhiu (a phn l xut hin cc thnh phn d liu khng lin quan
n mc ch phn tch)
Gi s vi mi mu d liu ta c n thuc tnh th s c 2
n
tp con cc thuc
tnh, do r rng tip cn theo hng vt cn gii quyt vn ny l khng
hiu qu. gii quyt vn ny cc hng tip cn thng thng l cc phng
php Heuristic, cc phng php ny ch yu p dng nguyn l Greedy, a ra
nhng quyt nh ti u ti tng thi im thu gn khng gian tm kim v tip
cn li gii ti u. Thc nghim cho thy nhng phng php ny kh hiu qu v
thng a ra nhng li gii gn ti u ( xp x li gii ti u ).
Nhng quyt nh ti u ti mi thi im thng l xc nh nhng thuc
tnh no l c nh hng ln n s phn b d liu v nhng thuc tnh no c nh
hng khng ng k bng cc phng php, tiu chun thng k ( gi s l cc
thuc tnh c lp xc sut vi nhau )
Cc m hnh Heuristic ni trn thng l s kt hp ca cc phng php c
bn sau y :
Stepwise forward selection : Qu trnh bt u vi mt tp rng cc thuc
tnh, mi bc thuc tnh tt nht trong cc thuc tnh cha thuc tp ti
u s c xc nh thng qua mt tiu chun no v them vo tp ti u.
48

Hnh 2.15 Forward Selection

Stepwise backward elimination : Qu trnh s bt u vi ton b cc thuc
tnh. mi bc nhng thuc tnh ti nht s c xc nh v b i.

Hnh 2.16 Backward Elimination
49

Kt hp Stepwise forward selection v Stepwise backward selection : Ti
mi bc chn ra thuc tnh tt nht them vo tp ti u v b i nhng
thuc tnh ti nht trong tp cc thuc tnh cn li.
Cy quyt nh (Decision Tree) : p dng cc thut ton ph bin trn cy
quyt nh nh ID3, C4.5, CART. Ti mi nt trong ca cy c mt test
kim nh thuc tnh tt nht, mi nt ngoi ng vi 1 class phn hoch.
D liu s c y t trn xung, nhng thuc tnh no xut hin trn cy
s l nhng thuc tnh trong tp ti u.

Hnh 2.17 Decision Tree Induction

iu kin dng ca cc phng php trn c th ty bin, ph thuc vo bn
cht ca tng vn , thng l s c mt ngng kt thc dng qu trnh chn
tp thuc tnh ti u. V d : iu kin dng ca Forward Selection c th l khi
50
thuc tnh tt nht c o tm quan trng nh hn ngng hay vi Backward
Elimination th iu kin dng c th l khi o tm quan trng ca thuc tnh
ti nht ln hn ngng vv
2.5.2 Gim chiu d liu
gim s chiu ca d liu (d liu a chiu) nhiu phng php m ha,
bin i d liu c nghin cu vi mc ch nn kch thc d liu, tng tc
x l m khng lm nh hng n kt qu phn tch. Chnh v vy, ngi ta a ra
khi nim lossless v lossy trong qu trnh thu gn kch thc d liu. C th,
nu d liu nguyn thy c th ti dng li hon chnh t d liu chuyn i th
php chuyn i tng ng c xem l lossless ngc li l lossy. Trn thc
t, khng tn ti mt php bin i lossless lm gim kch thc d liu nn vn
ngi ta quan tm l nghin cu nhng php bin i lossy m sau khi bin
i lng thng tin mt mt t, khng nh hng ng k n kt qu phn tch d
liu.
Cc k thut chnh gim chiu d liu gm c hai phng php l phn
tch thnh phn chnh (Principle Component Analysis) v bin i wavelet ri rc
(Discrete Wavelet Transform).
2.5.2.1 Wavelet Transform
Discrete Wavelet Transform l mt phng php kh ph bin c s dng
rng ri vi nhiu bin th. Thng dng nht c cc dng Wavelet Transform nh
Haar-2, Daubechies-4, Daubechies-6. Mi dng Wavelet Transform nh vy c
c trng bi mt php bin i ring.
tng chnh ca phng php ny l xem mt mu d liu n thnh phn
nh mt vector n chiu {x
1
, x
2
, , x
n
}. Mc tiu ca php bin i l to ra mt
vector di tng ng {y
1
, y
2
, , y
n
}, cc thnh phn ca cc vector ny l cc
wavelet coefficients. gim chiu d liu ngi ta ra mt ngng (threshold),
cc coefficient nh hn ngng ny s c lm trn v 0 v p dng li php bin
51
i trn {y
1
, y
2
, , y
n
} ( sau khi lm trn mt s coefficient di ngng } ta s
thu c d liu gim chiu. a phn cc dng Wavelet Transform u rt phc
tp nn d hiu y chng ta s ch xem xt dng Wavelet Transform n gin
nht : Haar-2. Cc v d minh ha cng s c minh ha vi dng Wavelet
Transform ny.
Php bin i d liu ca Haar-2 c c trng bi ma trn :

Mi mu d liu s c biu din di dng vector : {x
0
, x
1
, , x
2n
, x
2n+1
}.
Nu s thnh phn ca x khng l ly tha ca 2 th ta thm 0 vo cho ri bin
i qua cc bc sau :
B1 : Gom nhm 2 thnh phn k nhau (x
0
, x
1
) , , (x
2n
, x
2n+1
)
B2 : Vi mi vector (x
2i
, x
2i+1
) ta nhn vo bn phi H
2
v thu c (x
2i
+
x
2i+1
, x
2i
- x
2i+1
) .
B3 : Sp xp li (s
0
, s
1
, , s
n
, t
0
, t
1
, , t
n
) trong :
o s
i
=x
2i
+x
2i+1.

o t
i
=x
2i
x
2i+1
.
B4 : Gi nguyn {t} v lp li qu trnh trn cho { s
0
, s
1
, , s
n
} nu n > 1.
Bn cht l gi li cc hiu t
i
v tip tc bin i cc tng s
i
.
Khi qu trnh kt thc th ta s thu c dy wavelet coefficients tng ng.
Sau ty thuc vo threshold, ta s lm trn mt s coefficients v 0. thu c
d liu gim chiu ta lm ngc li php bin i trn cho dy coefficient {s
0
, t
0
, s
1
,
t
1
, }

52
V d : Cho mu d liu { a
0
, a
1
, a
2
, a
3
}.
Transform step :
o

+
+
1
1

-
+
1
1

1
0
a
a
=

0
0
a
a

-
+
1
1
a
a

o

+
+
1
1

-
+
1
1

3
2
a
a
=

2
2
a
a

-
+
3
3
a
a

Thu c : { a
0
a
1
, a
0
+a
1
, a
2
a
3
, a
2
+a
3
}
Sp xp li ( nhm gi li phn sum v bin i phn hiu ) ta c : { a
0

a
1
, a
2
a
3
, a
0
+a
1
, a
2
+a
3
} ( Khng bin i tip v { a
0
- a
1
, a
2
- a
3
} c
di bng 2 )
dng li dy nguyn thy ta sp xp li hng s : { a
0
a
1
, a
0
+a
1
, a
2
a
3
,
a
2
+a
3
} v p dng Transform step cho dy ny ta thu c dy { 2a
0
, 2a
1
,
2a
2
, 2a
3
} thu li c dy nguyn thy ban u. Trong trng hp mt s
coefficient b lm trn v bng 0, v d a
0
a
1
0 th ta thu c dy { a
0
+
a
1
, - (a
0
+a
1
), 2a
2
, 2a
3
}. Dy ny do c 2 thnh phn u i du nhau nn
ch cn lu tr mt trong hai nn ta c th thu gn thnh { a
0
+a
1
, 2a
2
, 2a
3
}
gim chiu d liu. Khi tin hnh phn tch d liu ta ch quan tm n
d liu gim chiu tng tc x l + kim sot c mt mt d
liu.
S bin i + phc hi d liu :
Transform :
S [ Transform Step ] A
1
B
1

A
1
[ Transform Step ] A
2
B
2

..
A
n
[ Transform Step ] A
n+1
B
n+1

53
[ Truncated ] A
n+1
B
n+1
B
1

Reconstruct :
A
n+1
B
n+1
B
1

[ Merge A
n+1
B
n+1
] [ Transform Step ] A
n

[ Merge A
n
B
n
] [ Transform Step ] A
n-1

...
[ Merge A
1
B
1
] [ Transform Step ] S
Lu : Nu lc Transform ta b qua bc Truncated (gi nguyn vector
coefficient) th lc Reconstruct ta s thu c S (ch khc bit v ln cn phng
v chiu khng i, trong thc t S s c chun ha bng cch scale v ln =
1).
2.5.2.2 Principle Component Analysis
Principle Component Analysis (PCA) l phng php hiu qu phn tch
d liu a chiu. i vi nhng dng d liu a chiu, kh nng biu din c
chng di dng th phn tch mu i vi chng ta l rt him do phng
php PCA c pht trin nhm phn tch s nht qun v khc bit ca cc mu d
liu. im ni bt ca PCA l n cung cp cho chng ta mt phng php lm gim
s chiu ca d liu, ni cho chnh xc l chiu d liu a chiu ln mt khng gian
t chiu hn m khng lm mt mt qu nhiu thng tin.
Gi s chng ta c mt tp d liu gm M mu d liu N chiu. Vi N ln,
vic phn tch tp d liu s tr nn rt kh khn nu ta lm vic trc tip vi d
liu th. tng chnh y l ta s tp trung phn tch nhng thnh phn chnh
ca cc mu d liu ( mi mu c N thnh phn ng vi N chiu ) lm r s nht
qun v khc bit gia cc mu d liu vi nhau v a ra mt chun o hiu qu
v s khc bit ( nhm phn lp d liu ).
54
Nh vy,vn y l lm sao ta bit c thnh phn no l thnh phn
chnh c nh hng ch yu n s phn b ca cc mu d liu ? lm r vn
ny chng ta s ln lt kho st cc bc phn tch d liu sau y :
Chun ha d liu : bc ny chng ta s s dng nhng mu d liu 2
chiu (x,y) minh ha cho tng chnh v k t y cc bc sau cng s
c minh ha bng nhng mu d liu 2 chiu ny.
i vi mi thnh phn ta tnh trung binh mu ca chng ri tr i
trung bnh mu .
V d :


55

Hnh 2.18 th biu din cc mu d liu hai chiu sau khi chun ho
Tnh ma trn Covariance :
Ma trn Covariance c xc nh nh sau :
{ C
ij
=[

=
- -
M
i
j j i i
MeanX X MeanX X
1
) )( ( ]/(M-1) }
i vi v d trn ca chng ta th ma trn Covariance s nh sau :

Xc nh eigenvectors v eigenvalues tng ng :
y trc khi i vo vn ta nhc li mt cht v eigenvector v
eigenvalue. Hai khi nim ny c nh ngha trn ma trn vung N*N.
Mt ma trn vung N*N th s c ng N eigenvectors v N eigenvalues
tng ng.
56

V d :

Hnh 1 cho v d v mt eigenvector v mt vector khng phi eigen vector.
(3,2) l mt eigenvector v c tnh cht l khi ta nhn ma trn vung vo bn tri
ca vector ny th c mt vector khc tng t v phng v chiu ch khc bit
v ln. chnh l tnh cht c bit ca eigenvector. Ngoi ra, eigenvector cn
c mt s tnh cht ng ch khc nh :
Hai eigenvector bt k ca cng mt ma trn vung trc giao nhau.
Eigenvalue l khng i bt k ta thay i ln ca eigenvector nh
th no i na.

V d trn cho thy nu ta nng ln ca vector (3,2) ln th khi nhn ma
trn vung vo bn tri th eigenvalue tng ng vn l 4.
i vi nhng m hnh phn tch d liu, thun tin ngi ta thng
chun ha sao cho cc eigenvector c ln l 1.

57


y, ma trn vung chng ta xt s l ma trn Covariance va tnh vi kch
thc N*N. i vi v d minh ha ca chng ta :

Cc eigenvector xp theo ct ( mi ct l mt eigenvector ).
Chn thnh phn chnh v thit lp vector c trng :

58

Hnh 2.19 Mean adjust data with eigenvectors
Quan st hnh biu din d liu minh ha trn ta thy hai ng nt t
biu din 2 eigenvectors. Qua th ta thy d liu ch yu phn b xung quanh
ng thng t gc tri di ln gc phi trn do eigenvector tng ng chnh l
vector thnh phn chnh ca d liu ( xem nh mi eigenvector i din cho mt
thnh phn ). Tng qut, ngi ta chng minh c rng eigenvalue chnh l
o mc nh hng ca mt thnh phn n cu trc phn b ca d liu, eigenvalue
cng ln th nh hng cng cao. Khi tin hnh phn tch d liu ngi ta c th b
i nhng eigenvector c eigenvalue thp hn hn so vi nhng eigenvector khc vi
quan im l vic b i nh th s lm mt rt t thong tin v lng thng tin s
khng nh hng ng k n kt qu phn tch v tng tc phn tch ln rt
nhiu.
Ngi ta cn a ra khi nim vector c trng xy dng t nhng
eigenvector nh sau :
59

i vi v d minh ha ca chng ta, vector c trng s l :

Ta cng c th b i eigenvector khng quan trng gim chiu ca vector
c trng :

Ly c trng mu :
y l bc cui cng v cng l bc d nht trong m hnh PCA. Ta c
cng thc sau :

Trong , FinalData s l ma trn biu din c trng d liu,
RowFeatureVector l chuyn v ca FeatureVector (dng thnh ct,ct thnh dng
cc eigenvector gi nm trn mt dng ), tng t cho RowDataAdjust mi mu
d liu chun ha nm trn 1 ct v FinalData s cha cc c trng d liu dc
trng mi mu d liu nm trn 1 ct. Khi tin hnh phn tch cc mu d liu
ngi ta s ch quan tm n nhng c trng ca mu ( ct tng ng trn
FinalData )
Lu : t FinalData ta cng c th phc hi c d liu nguyn thy theo
mt t l no y ph thuc vo vic chng ta b i nhng eigenvector no v
bao nhiu. Ta c cng thc :

60
Nu ta gi nguyn tt c cc eigenvector th d liu nguyn thy s c
phc hi 100%. Tm li th tng chnh ca PCA l loi b i nhng thnh phn
khng quan trng n cu trc phn b ca cc mu d liu vi gi thit rng s
loi b s dn n mt mt thng tin t v nh hng khng ng k n kt qu
phn tch d liu.
2.5.3 Gim kch thc tp d liu
Liu chng ta c th gim kch thc tp d liu bng cch chn nhng
dng biu din tng ng nhng gn hn khng ? tr li cu hi trn ngi
ta nghin cu v xut mt s phng php kh hiu qu. Cc phng php
ny thuc mt trong hai dng l tham s ho v khng tham s. i vi nhng
phng php tham s ho, mt m hnh s c s dng c lng cc im d
liu hay ni cch khc l d liu s c m hnh ho v khi lu tr ngi ta s ch
lu tr cc tham s thay v ton b d liu. i vi cc phng php khng tham
s, d liu s c gom nhm da trn nhng tiu chun v s nht qun, khc bit.
Cc mu d liu trong cng mt nhm s c i din bi mt dng biu din
chung rt gn (v d nh ly trng tm ca cc mu d liu nu xem chng nh cc
vector trong khng gian a chiu). Bng cch ny, ngi ta s ch lu tr dng biu
din thu gn ca tng nhm thay v lu tr ton b d liu.
Cc phng php ph bin bao gm :
Tham s ho : Log-Linear Models, Gaussian Mixture Models, Regression.
Khng tham s : K-Means Clustering, Hierachical Clustering, Fuzzy C-
Means Clustering.
2.5.3.1 Regression
y l dng m hnh thng dng xp x ra dng tham s ca d liu cho
trc. n gin i vi m hnh Linear Regression (hi quy tuyn tnh) , d liu s
c m hnh ho khp vo mt ng thng. Ly v d, mt bin ngu nhin Y
61
c th c m hnh ho di dng mt hm tuyn tnh ca mt bin ngu nhin X
nh sau :
Y =WX +B
Trong , W v B l cc hng s hi quy. Xt trong ng cnh bi ton khai
thc d liu th X, Y l cc thuc tnh ca mt mu d liu vi Y l thuc tnh ph
thuc cn on nhn v X l thuc tnh bit (lu tr trong c s d liu). Cc
hng s W v B c th c gii bng phng php ti thiu ho li bnh phng
(method of least squares) ti thiu ho s sai lch gia hm thc s biu din d
liu v hm tuyn tnh chng ta c lng. M rng t m hnh hi quy n bin
trn chng ta c Multiple Linear Regression (hi quy tuyn tnh a bin) trong Y
=F(X
1
, X
2
, , X
N
) vi F l hm tuyn tnh ca cc bin X
1
, X
2
, , X
N
.
Cc m hnh hi quy tuyn tnh s gip ta gim kch thc tp d liu khi
a vo phn tch. tng l khi ta c lng c hm F th khi tin hnh
phn tch, ta s khng phn tch cc thuc tnh Y = F(X
1
, X
2
, , X
N
) v cc thuc
tnh nh vy s c xc nh bi X
1
, X
2
, , X
N
v tp hng s a
0
, a
1
, a
2
, ca
hm F. Do mt mu d liu {X
1
, X
2
, , X
N
, Y} khi a vo phn tch s c
thu gn thnh {X
1
, X
2
, , X
N
}.
hiu r hn v m hnh hi quy, ta kho st v d sau trn m hnh hi
quy tuyn tnh n bin :

Hnh 2.20 Data Table
Trn y l bng d liu ca mt c s d liu, trong mi mu d liu
gm 2 thuc tnh X, Y. Hm hi quy ta cn c lng c dng Y = WX + B, nhim
62
v ca ta l t bng d liu trn c lng W, B thch hp nht.
T bng trn, ta c :

X =311

Y =18.6

XY =1159.7

2
X =19359
N =5 (tng s mu)
ti thiu ho hm li bnh phng (MSE) ng vi bng d liu cho trc,
ngi ta chng minh c :
W = ) ) ( /( )) ( ) ( (
2 2

- - X X N Y X XY
B = N X W Y / ) (

-
Nh vy W = 0.19, B = -8.098. Ta c phng trnh hi quy Y = 0.19X -
8.098, gi s trng hp ta mun c lng Y cho mt mu ch bit X = 64 th ta
thay vo phng trnh hi quy v c Y =0.19(0.64) - 8.098 =4.06.
2.5.3.2 Log-Linear
M hnh logarithm tuyn tnh (Log-Linear) c thit k vi mc ch xp x
dng phn b xc sut a chiu ri rc ca d liu. Cho trc mt tp d liu gm
M mu d liu N thuc tnh, ta c th xem chng nh mt tp cc im trn khng
gian N chiu. M hnh logarithm tuyn tnh s c lng xc sut ca cc im d
liu ny da trn mt tp con cc thuc tnh ri rc ca chng, t cho php ta thu
gn d liu trong khng gian a chiu v mt khng gian t chiu hn v c th
phc hi d liu nguyn thy khi cn thit. tng ca m hnh ny l ch ra mi
quan h gia cc thuc tnh khi ly logarithm t nhin :
63
Ln(y) =a
0
+a
1
x
1
+ +a
N
x
N
Trong :
y l thuc tnh ph thuc quan h i vi {x
i
}
{x
i
} l cc thuc tnh c lp quan h.
{a
i
} l cc tham s logarithm tuyn tnh.
Cc tham s {a
i
} s c c lng xp x bng cch khp cc mu d liu
sn c ging nh trong m hnh hi quy. Bn cht m hnh ny l mt m hnh xc
sut thng k, ngi ta s da trn s liu thng k v tp d liu mu xc nh
s tn ti mi quan h logarithm gia cc thuc tnh. Nu s liu thng k cho thy
c tn ti mi lin h th m hnh s tm cch khp cc mu d liu theo cng
thc trn c lng cc tham s {a
i
}.
V d : Xt bng s liu sau:
Y X
35 9.48
40 9.83
50 10.43
55 10.68
70 11.32
75 11.51

Qua bng trn ta c th d dng nhn thy c s lin quan gia X v Y.
Trc ht, r rng X tng th Y tng, ngoi ra k ta s thy khi X tng chm th
Y tng chm, khi X tng nhanh th Y tng nhanh Gia X, Y c mi lin h kh
64
cht ch. Tt nhin, ch l nhn xt theo trc quan, trong thc t ngi ta s lp
ma trn Covariance xc nh mc lin quan gia X, Y. Trong v d ny, ma
trn Covariance cng s cho cng mt kt qu nh nhn xt trn. Ta s th p
dng m hnh Log-Linear c lng hm logarithm tuyn tnh biu din Y theo
X :
Ln(y) =a +bx (1)
Ta c :
Ln(Y) X
3.55 9.48
3.68 9.83
3.91 10.43
4.00 10.68
4.24 11.32
4.31 11.51

Lc ny, bi ton quy v vic c lng cc hng s a, b khp cc mu
d liu vo (1). Nu ta t z = ln(y) th (1) tr thnh z =a +bx bi ton quy v
vic khp cc im d liu vo mt ng thng Ta c th p dng phng
php dng trong m hnh hi quy :
B = ) ) ( /( )) ( ) ( (
2 2

- - X X N Y X XY ~0.375
A = N X W Y / ) (

- ~ 0.0 (N =6)
Nh vy, ta thu c hm logarithm tuyn tnh ln(y) = 0.375x. Khi mun
c lng Y cho mt mu d liu mi ch bit X ta c th p dng cng thc trn :
65
X =9.5
Ln(Y) =0.375*9.5 =3.5625
Y =e
3.5625
~35.3
2.5.3.3 Gaussian Mixture Models
M hnh hn hp phn b Gaussian (Gaussian Mixture Models) bn cht l
mt phng php gom cm nhng khc vi cc phng php gom cm khc ch
n l hnh thc gom cm theo kiu m hnh tham s. Theo , cc im d liu s
c phn vo cc nhm vi cc cu trc phn b Gaussian tng ng ch khc bit
nhau v gi tr cc tham s.

Hnh 2.21 Mixture of Gaussian Distribution

66
Quy trnh c lng tham s : Ngi ta s xem cc mu d liu nh cc
im c sinh ngu nhin bi m hnh hn hp v c lng cc tham s sao cho
xc sut cc im d liu c sinh ra bi m hnh hn hp l cao nht. Ni n
gin l ta cc i ho xc sut hu nghim :

Trong
k
m m m ,..., ,
2 1
l cc gi tr trung bnh nhm i vi cc cu trc phn
b Gaussian tng ng. Nu xem nh cc im d liu l c lp xc sut vi nhau
ta cn phi cc i biu thc sau :

Vi data l tp cc im d liu x, P(v
i
) l xc sut cu trc phn b th
i c chn sinh ra mt im d liu bt k, P(x |v
i
, {
i
m }) =
2 2
/ ) (
2
1
s m
p s
x
i
e
-
l
xc sut im d liu x c sinh ra ngu nhin bi cu trc phn b th i. Vn
y l ta phi c lng {
i
m } sao cho xc sut hu nghim t gi tr ln nht
(gi s rng cc cu trc phn b c cng phng sai).
Ta c th xem L = P(data | {
i
m }) l mt hm nhiu bin, khi vic cc i
ho P(data | {
i
m }) tng ng vi vic gii cc phng trnh o hm ring phn
xc nh {
i
m } tng ng : 0 =

i
L
m

Tuy nhin, vic gii nhng phng trnh nh vy rt kh v thng l khng
gii quyt 1 cch trc tip c do ta thng khng c y thng tin v mt mu
d liu x, v d nh ta khng bit chnh xc x thuc v nhm no m ta ch bit i
khi l x thuc v 1 trong s m nhm no . Chnh v vy, ngi ta xut mt
67
cch tip cn mm do hn, chnh l thut gii EM (Expectation
Maximization).
Thut gii ny s sinh ngu nhin mt b tham s v tin hnh hiu chnh b
tham s ny thng qua mt chu trnh gm 2 bc c lp i lp li n khi no
tho mt iu kin dng no . Hai bc gm c Expectation v Maximization.
bc Expectation, thut gii s c lng nhng thng tin cha bit da vo
nhng thng tin bit (b tham s thi im hin ti + P(v
i
) bit trc) bng
cch xc nh gi tr trung bnh k vng ca chng theo l thuyt xc sut. bc
Maximization, thut gii s da vo nhng thng tin c lng bc
Expectation tnh li b tham s hin c.

Hnh 2.22 M gi ca thut ton EM
hiu bn cht ca thut gii EM chng ta s theo di v d sau y : Gi
s x
k
l im ca hc sinh trong mt lp hc vi xc sut phn b :
x
1
=30 , P(x
1
) =
68
x
2
=18 , P(x
2
) =m
x
3
=0 , P(x
3
) =2m
x
4
=23 , P(x
4
) = - 3m
Trng hp 1 : Qua kho st chng ta bit c :
x
1
: a hc sinh
x
2
: b hc sinh
x
3
: c hc sinh
x
4
: d hc sinh
Nh vy, ta cn c lng m m hnh xc sut trn khp vi d liu thu c
qua kho st :

t L = P(a, b, c, d| m ), nhn xt rng vic cc i ho L tng ng vi ca i
ho ln(L) do t P = ln(L) ta cn gii phng trnh o hm theo m :
0 =

m
P

(b/ m ) +(c/ m ) 3d/(1/2-3m ) =0
m =(b +c)/6*(b+c+d)
trng hp ny, do thng tin c nn vic gii trc tip t ra kh thun li.
Trng hp 2 : Qua kho st ta bit :
C h hc sinh t x
1
hay x
2
im.
69
C c hc sinh t x
3
im.
C d hc sinh t x
4
im.
trng hp ny, ta thiu thng tin v a, b v thay vo ta ch bit a + b = h.
gii quyt bi ton trong tnh trng thiu thng tin nh vy, ta s s dng thut
gii EM c lng gi tr trung bnh k vng a, b sau cc i ho hm xc sut
hu nghim.

Thut ton EM rt c hiu qu trong vic x l nhng mu d liu khng hon
chnh. M hnh hn hp cc phn b Gaussian kt hp gii thut EM l mt cng c
kh hiu qu gii bi ton gom cm d liu nhm gim kch thc tp d liu
khi i vo phn tch.
2.5.3.4 K-Means Clustering
y l mt gii thut gom nhm n gin khng tham s. tng chnh ca
gii thut ny n gin ch l xem mi mu d liu l mt im trn khng gian N
chiu, trong cc im gn nhau s c gom vo thnh mt nhm.
Ban u gii thut s sinh ngu nhin K im v xem nh l trng tm
ca K nhm tng ng. Sau cc im s c ln lt phn vo cc nhm c
trng tm gn n nht. Cui cng, cc trng tm nhm s c tnh li da trn cc
70
im d liu thuc nhm . Qu trnh trn s c lp li lin tc cho n khi
phm vi dao ng ca cc trng tm nh hn mt ngng no (hi t).
y, tiu chun gn nht s c xc nh da theo mt tiu chun o
khong cch no nh Euclide, Minkowsky. Ngng hi t s c thit lp da
trn bn cht ca d liu.
Nhn chung, y l mt thut gii d ci t, d hiu c kh nng ng dng
trong thc t nhng cng c nhiu hn ch. u tin, thut gii ny c hn ch
ch ta phi bit trc K. Ngoi ra, nu trong tp d liu c nhiu mu c bit
(outlier) th gii thut s x l khng c hiu qu.
2.5.3.5 Fuzzy C-Means Clustering
Thut gii Fuzzy C-Means l mt thut gii gom nhm khng tham s da
trn xc sut. Theo , mt mu d liu c th thuc v nhiu nhm vi xc sut
khc nhau. i vi thut gii ny th hm mc tiu m ta cn ti u c dng :


Trong :
u
ij
l bc thnh vin ca mu d liu x
i
trong nhm j.
c
j
l trung tm ca nhm j.
||*|| l mt chun o no gia cc i tng (Euclide, Manhattan,
Minkowsky )
S chung ca gii thut nh sau :
71

Hnh 2.23 S thut ton Fuzzy C-Means Clustering
Bn cht ca thut gii ny l c lng cc gi tr u
ij
nhm ti u hm mc
tiu J
m
ni trn. Khi ta c lng c u
ij
cng c ngha l ta xc nh c
c s gom nhm ca tp d liu (bit c xc sut mu i thuc v nhm j).
Vic gii trc tip u
ij
s rt kh v ta khng bit cc gi tr trung tm c
j
,do cng
ging nh m hnh Mixture of Gaussian, chng ta phi tip cn theo mt hng
mm do hn. Theo , ban u cc gi tr u
ij
s c sinh ngu nhin, sau cc
gi tr ny s c hiu chnh t t.
Thut gii ny bao gm 2 bc :
S dng u
ij
hin c c lng c
j
.
Da vo c
j
va c lng tnh li u
ij
.
Hai bc trn s c lp i lp li cho n khi phm vi dao ng ca u
ij
nh
hn mt ngng cho trc.
72
2.5.3.6 Hierachical Clustering
tng chnh ca thut gii ny l xy dng mt m hnh gom nhm phn
cp trong t tp d liu ban u, mt cy gom nhm s c xy dng. Ban u,
mi mu d liu s thuc v mt nhm nu ta c N mu d liu th ta s c N
nhm. Ti mi thi im, nu s nhm cn ln hn 1 th ta chn ra hai nhm gn
nhau nht (theo mt tiu chun nh gi no ) v trn chng li vi nhau bng
cch to mt link ni chng vi nhau ng thi ghi nhn li di ca link bng
o khong cch gia 2 nhm . Nh vy, ta thy sau mi bc s nhm s gim i
1 v sau N 1 bc th ta s c c mt cy gom nhm phn cp.
V d : Ta c 6 mu d liu v bng o khong cch gia chng nhu sau :

Hnh 2.24 Bng o khong cch gia cc mu d liu
Ban u, mi nhm ch c mt mu d liu nn khong cch gia 2 nhm
bt k chnh l khong cch ca 2 mu d liu tng ng. Sau mt s bc, nu cc
nhm c hn mt mu d liu th khi tnh o khong cch gia chng, ta s ly
gi tr trung bnh o khong cch gia cc mu d liu thuc 2 nhm tng ng
lm o khong cch gia chng. Vi nguyn tc tnh khong cch nh vy, p
dng thut ton ni trn ta thu c cy phn cp sau :
73

Hnh 2.25 Lc th hin cy gom nhm (clustering tree)
Khi xy dng c cy phn cp th ta c th gom d liu vo K nhm
vi K bt k t 1 N. chnh l im mnh ca thut gii ny, qua m hnh cy
phn cp, n s ch ra r s nht qun, khng tng ng ca cc mu d liu vi
nhau, t gip chng ta a ra quyt nh phn d liu vo bao nhiu nhm (trong
trng hp chng ta khng bit K). phn d liu vo K nhm th thut gii chi
n gin chn ra K 1 link di nht ct i. v d trn, nu ta mun phn cc
mu d liu vo 2 nhm th ta ch vic ct i link di nht :

Hnh 2.26 Phn nhm bng cch loi b link di nht
Nhn chung, gii thut gom nhm phn cp ny c nhng u im nht nh nhng
74
mt khc n cng c nhc im l khng th hiu chnh cp nht nhng g lm
nh nhng thut gii gom nhm khc. K-Means c th hiu chnh li cc gi tr
trung tm nhm, Gaussian Mixture Model c th hiu chnh cc tham s ca cc
cu trc phn b thng qua gii thut EM, Fuzzy C-Means c th hiu chnh cc gi
tr bc thnh vin cn gii thut gom nhm phn cp li khng th undo nhng g
n lm cc bc trc v d, mt link c to ta cc bc trc th
khng th hiu chnh hay hy b cc bc sau. iu ny lm cho thut gii gom
nhm phn cp tr nn cng v khng linh hot dn n nh hng chnh xc
gom nhm.
Phn 2.6. Tng kt
Tin x l d liu l mt cng on cc k quan trng i vi vic lu tr
cng nh khai thc d liu bi v d liu trong thc t thng khng hon chnh,
nhiu v i khi khng tng thch. Tin x l d liu bao gm cc cng on
chnh sau :
Lm sch d liu.
Tch hp d liu.
Chuyn i d liu.
Thu gn d liu.
Qu trnh lm sch d liu s c gng thm vo nhng gi tr b thiu, lm
trn d liu bng cch kh nhiu, xc nh cc mu d liu c bit (outlier) v hiu
chnh, loi b nhng thnh phn khng tng thch.
Qu trnh tch hp d liu kt hp d liu t nhiu ngun khc nhau, tng
hp thnh mt ngun d liu kt dnh cht ch. D liu qun l, phn tch quan h.
kim nh khng tng thch l cc k thut thng dng lm trn (smooth) d
liu tch hp.
Qu trnh chuyn i d liu s bin i d liu v nhng dng chun thun
75
li cho vic khai ph tr thc. Ly v d, l cc gi tr thuc tnh c th s c
chun ho sao cho nm trong khong 0..1.
Qu trnh thu gn d liu c th c tin hnh thng qua nhiu k thut :
la chn tp thuc tnh, gim chiu d liu, gim kch thc d liu nhm biu din
d liu di dng thu gn, ti thiu ho s mt mt thng tin khi rt gn.
Phn 2.7. Gii mt s bi tp

Bi 1
S dng hai phng php a) v b) (m t bn di) chun ho bng d liu sau :
D liu u
200
300
400
600
1000
a. Chun ha minmax, min=0 v max =1
b. Chun ha z-score
Bi lm:
Trung bnh =500
lch chun = 316.2278
Kt qu chun ha
D liu u Minmax Z-score
200 0 -0.948683196
300 0.125 -0.632455464
400 0.25 -0.316227732
600 0.5 0.316227732
1000 1 1.58113866

Bi 2
Gi thit mt bnh vin thng k d liu v tui v lng m i vi 18 ngi ln
c chn la ngu nhin v cho ra bng kt qu sau :
(Cc kt qu v th c tnh v v bng Excel)
D liu thng k
Tui %M
23 9.5
76
23 26.5
27 7.8
27 17.8
39 31.4
41 25.9
47 27.4
49 27.2
50 31.2
52 34.6
54 42.5
54 28.8
56 33.4
57 30.2
58 34.1
58 32.9
60 41.2
61 35.7

a. Tnh trung bnh trung v v lch chun.
b. Chun ho d liu theo tiu chun z-score.
c. V th scatter plot
d. Tnh h s tng quan Pearson.
Gii bi 2
a. Tnh trung bnh, trung v v lch chun:
Tui %M
Trung bnh 46.4444444 28.78333333
Trung v 51 30.7
lch chun 13.2186242 9.254394822

b. D liu chun ha z-score:
D liu ban u D liu chun ha z-score

Tui %M Tui (z-score) %M (z-score)
23 9.5 -1.773258893 -2.083747569
23 26.5 -1.773258893 -0.246704128
27 7.8 -1.470654986 -2.267451913
27 17.8 -1.470654986 -1.186838124
39 31.4 -0.562843266 0.282796628
41 25.9 -0.411541313 -0.311540955
47 27.4 0.042364547 -0.149448887
49 27.2 0.1936665 -0.171061163
50 31.2 0.269317477 0.261184353
52 34.6 0.42061943 0.628593041
54 42.5 0.571921384 1.482277934
77
54 28.8 0.571921384 0.001837043
56 33.4 0.723223337 0.498919386
57 30.2 0.798874313 0.153122974
58 34.1 0.87452529 0.574562351
58 32.9 0.87452529 0.444888697
60 41.2 1.025827243 1.341798141
61 35.7 1.10147822 0.747460558

c. th scatter plot:
0
5
10
15
20
25
30
35
40
45
0 10 20 30 40 50 60 70
Tui
%
M

Series1

Hnh 2.27 th biu din tng quan gia 2 bin Tui %M

d. H s tng quan (Pearsons) =0.817619. Vy 2 bin tui v %m
(trong c th) l c tng quan vi nhau tng i mnh (gn bng 1).
Bi 3
Cht lng d liu c th c biu din theo chnh xc, s hon thin v s
tng thch. Ch ra hai tiu chun khc nh gi cht lng d liu.
Bi lm
Cht lng ca d liu cn c th c biu din thng qua tnh k tha (inherent
78
quality) v tnh hu dng (pragmatic quality). D liu c tnh k tha th s to
thun li cho qu trnh phn tch d liu, d liu c tnh hu dng s h tr gim
nhiu trong qu trnh phn tch. y l hai trong s cc tiu chun quan trng nht
nh gi cht lng d liu.
Bi 4
Trong thc t, cc mu d liu (tuple) c mt s trng gi tr thiu (missing
values) l iu thng xy ra trong thc t. M t mt vi phng php gii
quyt vn .
Bi lm:
Cc phng php gii quyt vn thiu mt s trng gi tr mt s mu d
liu (tuple) :
B qua nhng mu d liu : Phng php ny khng hiu qu nu nh s
lng gi tr thuc tnh thiu l khng ng k so vi kch thc ca mu d
liu tng ng.
T in vo cc gi tr thiu (manually) : Hng tip cn ny cho hiu
qu cao (v do con ngi trc tip lm) nhng li tn rt nhiu thi gian v
hon ton khng kh thi khi x l nhng c s d liu ln.
S dng hng s c nh in vo nhng gi tr thiu (Unknow/ Null) :
Hng gii quyt ny c kh nng em li ri ro cao khi phn tch d liu do
cc gi tr thiu u c in bng cng mt hng s nn chng tng
ng nhau v v th chng trnh phn tch d liu c th hiu nhm l
c mt mi lin h no (trn thc t l khng c) gia cc mu d liu
thng qua cc thuc tnh Unknown/ Null.
S dng gi tr trung bnh thng k trn cc mu hon thin in vo : y
l mt cch lm tng i kh thi, l tng n gin lm trn d liu.
Tm gi tr c xc sut cao nht v thch hp (most likely) in vo cc
gi tr cn thiu : Cch lm ny l mt hnh thc ni suy phc tp tinh t hn
nhm lm trn d liu.

79
Bi 5
Ch ra cc khong gi tr ng vi cc phng php chun ho d liu sau :
a. Min-max normalization
b. Z-score normalization.
c. Decimal scaling normalization.
Bi lm
a. Khong gi tr : [new_minnew_max]
b. Gi thit :
Gi tr nh nht : min.
Gi tr ln nht : max.
Gi tr trung bnh : mean.
lch chun : stdev (standard deviation).
Khong gi tr : [(min-mean)/stdev..(max-mean)/stdev]
c. Khong gi tr : [-1..1]
Bi 6
Ch ra 3 tiu chun o th hin s phn tn ca d liu. Tho lun v nhng tiu
chun o ny.
Bi lm:
3 tiu chun o th hin s phn tn ca d liu :
Khong : l s sai bit gia gi tr ln nht v gi tr nh nht ca d liu.
o ny ch th hin cc cn bin ca d liu ch cha th hin r s bin
thin dao ng ca d liu trong khong.
80
lch chun :

o ny th hin s dao ng ca cc mu d liu xung quanh trung tm
d liu (gi tr trung bnh).
Hng s bin ng (Coefficient of Variation) :

o ny th hin r nt mc phn tn ca cc mu d liu trn ton tp
d liu bng cch ly t l phn trm gia lch chun v trung bnh mu.
Bi 7
Gi tr thuc tnh tui cho cc b d liu c sp theo th t tng dn: 13, 15, 16,
16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70.
a. Tnh mean, median

Ta c 17 gi tr phn bit khc nhau, nh N l l, median s bng gi tr
chnh gia median =30
b. mode =25, 35, bimodal.
c. midrange =(13 +70)/2 42
d. Xp x: Q
1
=20 (c xp x 25% gi tr d liu nh hn hoc bng 20)
Q
3
=35 (c xp x 75% gi tr d liu nh hn hoc bng 35)
e. Five number summary ca d liu:
Mininum =13, Q
1
=20, Median =30, Q
3
=35, Maximum =70
f. Boxplot biu din d liu:
81

Hnh 2.28 Boxplot biu din d liu
Bi 8
S dng d liu c cho trong bi tp 2.4
a. Phn chia thnh cc bin vi tn s 3.
Bin 1: 13, 15, 16
Bin 2: 16, 19, 20
Bin 3: 20, 21, 22
Bin 4: 22, 25, 25
Bin 5: 25, 25, 30
Bin 6: 33, 33, 35
Bin 7: 35, 35, 35
Bin 8: 36, 40, 45
Bin 9: 46, 52, 70
Lm mn d liu bng bin means:
Bin 1: 15, 15, 15
Bin 2: 18, 18, 18
Bin 3: 21, 21, 21
82
Bin 4: 24, 24, 24
Bin 5: 27, 27, 27
Bin 6: 34, 34, 34
Bin 7: 35, 35, 35
Bin 8: 40, 40, 40
Bin 9: 56, 56, 56
b. Da vo cc gi tr Q
1
, Q
3
tnh trong bi tp 2.4, tnh IQR =Q
3
Q
1
,
mt phn t l c bit nu gi tr ca n ln hn Q
3
+ 1.5xIQR hay nh
hn Q
1
1.5xIQR.
Ngoi ra c th pht hin cc phn t c bit bng phng php regression
(phn tch hi qui) hay clustering (gom cm).
c. Mt s phng php khc lm mn d liu l regression (phn tch hi qui)
hay clustering (gom cm).
Bi 9
Cc vn xem xt trong qu trnh tch hp d liu
- Vn xc nh tnh tng ng gia cc thc th gi l Entity
identification problem. Chng ta c th da vo cc metadata ca mi
thuc tnh (gm tn, ngha, kiu d liu, khong gi tr, cc gi tr null)
gii quyt phn no vn ny.
- Mt vn quan trng khc l s d tha d liu. Mt thuc tnh (nh tng
doanh thu hng nm) c th xem nh d tha v n c th c suy ra t
cc thuc tnh khc. Cc thuc tnh d tha c th c pht hin bng
cch s dng phn tch tng quan (correlation analysis).
- Mt vn quan trong na trong tch hp d liu l pht hin v x l
xung t gi tr d liu. Vi cng 1 thc th, gi tr cc thuc tnh trong cc
ngun d liu khc nhau c th khc nhau. iu ny c th xy ra do s th
hin d liu khc nhau.
83
- Phi ch c bit n cu trc ca d liu. iu ny nhm m bo rng
mi ph thuc hm ca thuc tnh v cc rng buc kha ngoi trong h
thng ngun phi c m bo trong h thng ch.
Bi 10
Cho 12 gi tr ca thuc tnh gi c sp xp nh sau: 5, 10, 11, 13, 15, 35, 50, 55,
72, 92, 204, 215
Phn chia chng vo 3 bin bng cc phng php sau:
a. Phn chia vi tn s bng nhau:
Bin 1: 5, 10, 11, 13
Bin 2: 15, 35, 50, 55
Bin 3: 72, 92, 204, 215
b. Phn chia vi rng bng nhau
rng ca mi khong : (215 5) / 3 =70
Bin 1 =5, 10, 11, 13, 15, 35, 50, 55
Bin 2 =72, 92
Bin 3 =204, 215

You might also like