Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

1

Chng 2: Cc vn tin x l d liu


Khoa Khoa Hc & K Thut My Tnh
Trng i Hc Bch Khoa Tp. H Ch Minh
Hc k 2 2012-2013
Cao Hc Ngnh Khoa Hc My Tnh
Gio trnh in t
Bin son bi: TS. V Th Ngc Chu
(chauvtn@cse.hcmut.edu.vn)
2
Ti liu tham kho
[1] Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts
and Techniques, Third Edition, Morgan Kaufmann Publishers, 2012.
[2] David Hand, Heikki Mannila, Padhraic Smyth, Principles of Data
Mining, MIT Press, 2001.
[3] David L. Olson, Dursun Delen, Advanced Data Mining
Techniques, Springer-Verlag, 2008.
[4] Graham J. Williams, Simeon J. Simoff, Data Mining: Theory,
Methodology, Techniques, and Applications, Springer-Verlag, 2006.
[5] Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin
Kumar, Next Generation of Data Mining, Taylor & Francis Group, LLC,
2009.
[6] Daniel T. Larose, Data mining methods and models, John Wiley &
Sons, Inc, 2006.
[7] Ian H.Witten, Frank Eibe, Mark A. Hall, Data mining : practical
machine learning tools and techniques, Third Edition, Elsevier Inc,
2011.
[8] Florent Messeglia, Pascal Poncelet & Maguelonne Teisseire,
Successes and new directions in data mining, IGI Global, 2008.
[9] Oded Maimon, Lior Rokach, Data Mining and Knowledge Discovery
Handbook, Second Edition, Springer Science + Business Media, LLC
2005, 2010.
3
Ni dung
Chng 1: Tng quan v khai ph d liu
Chng 2: Cc vn tin x l d liu
Chng 3: Hi qui d liu
Chng 4: Phn loi d liu
Chng 5: Gom cm d liu
Chng 6: Lut kt hp
Chng 7: Khai ph d liu v cng ngh c s
d liu
Chng 8: ng dng khai ph d liu
Chng 9: Cc ti nghin cu trong khai ph
d liu
Chng 10: n tp
4
Chng 2: Cc vn tin x l
d liu
2.1. Tng quan v giai on tin x l d liu
2.2. Tm tt m t v d liu
2.3. Lm sch d liu
2.4. Tch hp d liu
2.5. Bin i d liu
2.6. Thu gim d liu
2.7. Ri rc ha d liu
2.8. To cy phn cp nim
2.9. Tm tt
Tnh hung trong bi ton khai ph
d liu gio dc
5
MSSV M MH Nm hc Hc k im gia k im cui k
50503660 001001 2005 1 6 5.5
50503660 004010 2005 1 NULL 8
50503660 004009 2005 1 NULL 7
50503660 006004 2005 1 3.5 13
50503660 007005 2005 1 NULL 4
50501879 007005 2005 1 5 10
50501879 006001 2005 1 4 13
NULL nn c din dch theo nhng ngha no?
Min tr ca im s: [0, 1]; [0, 10]; {yu, km, trung bnh, trung bnh kh, kh, gii, xut sc}
Tt c sinh vin u c xem xt trong bi ton khai ph d liu gio dc?
Tt c mn hc u c xem xt trong bi ton khai ph d liu gio dc?
Ngoi kt qu im s mn hc, c im g ca sinh vin c th c xem xt trong bi ton
khai ph d liu gio dc?
* D on kh nng tt nghip ng hn ca sinh vin i hc chnh quy
6
2.1. Tng quan v giai on tin x l d liu
Data Sources
Data Warehouse
Task-relevant Data
Data
Cleaning
Data Integration
Selection/Transformation
Data Mining
Pattern Evaluation/
Presentation
Patterns
7
2.1. Tng quan v giai on tin x l d liu
Giai on tin x l d liu
Qu trnh x l d liu th/gc (raw/original
data) nhm ci thin cht lng d liu
(quality of the data) v do , ci thin cht
lng ca kt qu khai ph.
D liu th/gc
C cu trc, bn cu trc, phi cu trc
c a vo t cc ngun d liu trong cc h thng
x l tp tin (file processing systems) v/hay cc h
thng c s d liu (database systems)
Cht lng d liu (data quality): tnh chnh xc,
tnh hin hnh, tnh ton vn, tnh nht qun
8
2.1. Tng quan v giai on tin x l d liu
Cht lng d liu (data quality)
tnh chnh xc (accuracy): gi tr c ghi nhn
ng vi gi tr thc.
tnh hin hnh (currency/timeliness): gi tr
c ghi nhn khng b li thi.
tnh ton vn (completeness): tt c cc gi tr
dnh cho mt bin/thuc tnh u c ghi
nhn.
tnh nht qun (consistency): tt c gi tr d
liu u c biu din nh nhau trong tt c
cc trng hp.
9
2.1. Tng quan v giai on tin x l d liu
10
2.1. Tng quan v giai on tin x l d liu
Cc k thut tin x l d liu
Lm sch d liu (data cleaning/cleansing): loi b nhiu
(remove noise), hiu chnh nhng phn d liu khng
nht qun (correct data inconsistencies)
Tch hp d liu (data integration): trn d liu (merge
data) t nhiu ngun khc nhau vo mt kho d liu
Bin i d liu (data transformation): chun ho d liu
(data normalization)
Thu gim d liu (data reduction): thu gim kch thc d
liu (ngha l gim s phn t) bng kt hp d liu (data
aggregation), loi b cc c im d tha (redundant
features) (ngha l gim s chiu/thuc tnh d liu), gom
cm d liu
11
2.1. Tng quan v giai on tin x l d liu
Cc k thut tin x l d liu
Lm sch d liu (data cleaning/cleansing)
Tm tt ho d liu: nhn din c im chung ca d liu
v s hin din ca nhiu hoc cc phn t k d (outliers)
X l d liu b thiu (missing data)
X l d liu b nhiu (noisy data)
Tch hp d liu (data integration)
Tch hp lc (schema integration) v so trng i tng
(object matching)
Vn d tha (redundancy)
Pht hin v x l mu thun gi tr d liu (detection and
resolution of data value conflicts)
12
2.1. Tng quan v giai on tin x l d liu
Cc k thut tin x l d liu
Bin i d liu (data transformation)
Lm trn d liu (smoothing)
Kt hp d liu (aggregation)
Tng qut ha d liu (generalization)
Chun ha d liu (normalization)
Xy dng thuc tch (attribute/feature construction)
Thu gim d liu (data reduction)
Kt hp khi d liu (data cube aggregation)
Chn tp con cc thuc tnh (attribute subset selection)
Thu gim chiu (dimensionality reduction)
Thu gim lng (numerosity reduction)
To phn cp nim (concept hierarchy generation) v ri rc ha
(discretization)
13
2.2. Tm tt m t v d liu
D liu v im s ca cc sinh vin
Sinh vin im thi
1 25
2 25
3 40
4 45
5 50
6 60
7 60
8 60
9 65
10 80
11 85
12 85
c im phn b v xu hng ca d liu ???
c im c bit g khc ca d liu ???
14
2.2. Tm tt m t v d liu
Xc nh cc thuc tnh (properties) tiu
biu ca d liu v xu hng chnh (central
tendency) v s phn tn (dispersion) ca
d liu
Cc o v xu hng chnh: mean, median,
mode, midrange
Cc o v s phn tn: quartiles, interquartile
range (IQR), variance
Lm ni bt cc gi tr d liu nn c
xem nh nhiu (noise) hoc phn t bin
(outliers), cung cp ci nhn tng quan v
d liu
15
2.2. Tm tt m t v d liu
Cc o v xu hng chnh ca d liu
Mean
Weighted arithmetic mean
Median
Mode: gi tr xut hin thng xuyn nht trong
tp d liu
Midrange: gi tr trung bnh ca cc gi tr ln
nht v nh nht trong tp d liu
(

+
=
+
even N if x x
odd N if x
Median
N N
N
2 / ) (
1 2 / 2 /
2 /
16
2.2. Tm tt m t v d liu
D liu v im s ca cc sinh vin
Sinh vin im thi
1 25
2 25
3 40
4 45
5 50
6 60
7 60
8 60
9 65
10 80
11 85
12 85
im thi

S sinh vin

25

2

30

0

35

0

40

1

45

1

50

1

55

0

60

3

65

1

70

0

75

0

80

1

85

2

Mean = 56.67

Median = 60

Mode = 60

Midrange = 55
17
2.2. Tm tt m t v d liu
Cc o v s phn tn ca d liu
Quartiles
The first quartile (Q1): the 25
th
percentile
The second quartile (Q2): the 50
th
percentile (median)
The third quartile (Q3): the 75
th
percentile
Interquartile Range (IQR) = Q3 Q1
Outliers (the most extreme observations): gi tr nm
cch trn Q3 hay di Q1 mt khong 1.5xIQR
Variance
18
2.2. Tm tt m t v d liu
D liu v im s ca cc sinh vin
Sinh
vin
im
thi
lch
1 25 -31.6667
2 25 -31.6667
3 40 -16.6667
4 45 -11.6667
5 50 -6.66667
6 60 3.333333
7 60 3.333333
8 60 3.333333
9 65 8.333333
10 80 23.33333
11 85 28.33333
12 85 28.33333
Q1 = 42.5

Q2 = median = 60

Q3 = 72.5


IQR = Q3 Q1 = 30

Outliers = ???

Variance = o
2
= 4310.56

o = 65.65
19
2.2. Tm tt m t v d liu
Q1 Q2 Q3
Tm tt m t v s phn b d liu gm nm tr s quan trng:
median, Q1, Q3, tr ln nht, v tr nh nht (theo th t:
Minimum, Q1, Median, Q3, Maximum).
20
2.2. Tm tt m t v d liu
D liu v im s ca cc sinh vin
Sinh vin im thi
1 25
2 25
3 40
4 45
5 50
6 60
7 60
8 60
9 65
10 80
11 85
12 85
Mean = 56.67 < Mode = Median = 60
Negatively skewed data


Minimum, Q1, Median, Q3, Maximum

25, 42.5, 60, 72.5, 85
21
2.3. Lm sch d liu
X l d liu b thiu (missing data)
Nhn din phn t bin (outliers) v gim
thiu nhiu (noisy data)
X l d liu khng nht qun (inconsistent
data)
22
2.3. Lm sch d liu
X l d liu b thiu (missing data)
nh ngha ca d liu b thiu
D liu khng c sn khi cn c s dng
Nguyn nhn gy ra d liu b thiu
Khch quan (khng tn ti lc c nhp liu, s c, )
Ch quan (tc nhn con ngi)
Gii php cho d liu b thiu
B qua
X l tay (khng t ng, bn t ng)
Dng gi tr thay th (t ng): hng s ton cc, tr ph bin
nht, trung bnh ton cc, trung bnh cc b, tr d on,
Ngn chn d liu b thiu: thit k tt CSDL v cc th tc
nhp liu (cc rng buc d liu)
23
2.3. Lm sch d liu
Nhn din phn t bin (outliers) v gim
thiu nhiu (noisy data)
nh ngha
Outliers: nhng d liu (i tng) khng tun theo c
tnh/hnh vi chung ca tp d liu (i tng).
Noisy data: outliers b loi b (rejected/discarded
outliers) nh l nhng trng hp ngoi l (exceptions).
Nguyn nhn
Khch quan (cng c thu thp d liu, li trn ng
truyn, gii hn cng ngh, )
Ch quan (tc nhn con ngi)
24
2.3. Lm sch d liu
Nhn din phn t bin (outliers) v gim
thiu nhiu (noisy data)
Gii php nhn din phn t bin (outlier detection)
Da trn phn b thng k (statistical distribution-based)
Da trn khong cch (distance-based)
Da trn mt (density-based)
Da trn lch (deviation-based)
Gii php gim thiu nhiu (data smoothing)
Binning
Hi quy (regression)
Phn tch cm (cluster analysis)
25
2.3. Lm sch d liu
Gii php nhn din phn t bin
Da trn phn b thng k (statistical distribution-based)
Th tc khi (block): tt c cc i tng tnh nghi l outliers
hoc khng
Th tc ln lt/tun t (consecutive/sequential): i tng tnh
nghi nht l outlier th nhng i tng cc tr hn cng l
outlier; nu khng th i tng tnh nghi k s c kim tra
Gi s tp d liu tun theo mt m hnh phn b F cho trc
(phn b chun, phn b Poisson, ), xc nh cc i tng o
i

l outlier i vi m hnh phn b ny dng php th
discordancy nhm kim tra 2 hypotheses:
Working hypothesis: vi F, nu significance probability SP(gi tr
thng k v
i
ca o
i
) = Prob(T>v
i
) nh, o
i
c xem l khc bit
(discordant) v working hypothesis khng c chp nhn. Nu
khng th tt c i tng tun theo F.
Alternative hypothesis: xc nh xc sut m working hypothesis
khng c chp nhn khi o
i
tht s l outlier. Khi ny, o
i
tun theo
mt m hnh phn b G.
26
2.3. Lm sch d liu
Gii php nhn din phn t bin
Da trn khong cch (distance-based)
Xem xt khong cch gia cc i tng n i tng
tnh nghi o
i
, nu t nht mt lng i tng pct cch i
tng o
i
xa hn mt khong cch dmin th o
i
l outlier.
Outlier l nhng i tng khng c lng ging trong khu
vc c xc nh bi mt khong cch cho trc.
Xc nh gi tr pct v dmin cn dng trial-and-error.
O
i
l outlier vi pct = 0.8 v dmin = 1.
pct: t l s i tng khng l lng ging ca outliers
dmin: minimum distance dng xc nh vng lng
ging ca mi i tng
o
i
27
2.3. Lm sch d liu
Gii php nhn din phn t bin
Da trn mt (density-based)
Da trn mt ca vng lng ging
ca mi i tng
Mc ca outlierness c xc nh
qua LOF (local outlier factor) ca mi
i tng p
Mc ph thuc vo mc cch ly
ca i tng i vi vng lng
ging
k-distance ca p
k-distance neighborhood ca p
Reachability distance ca p i vi o
Local reachability density ca p
LOF(p) cng cao, p cng c xem l
mt local outlier.
o
1
v o
2
l density-based outliers.
28
2.3. Lm sch d liu
Gii php nhn din phn t bin
Da trn lch (deviation-based)
Da trn vic kim tra cc c im chnh ca cc i tng
trong mt nhm
Outliers l nhng i tng lch khi cc i tng khc da trn
nhng c im chnh
Sequential exception technique
M phng cch human phn bit nhng i tng khc bit khi
chui cc i tng ging nhau; s dng d tha d liu ngm nh
Xc nh tp ngoi l (exception set): cho mi tp con trong chui
cc tp con c to ra t tp d liu ban u, nu tp con c b
khi tp d liu v s khc bit gia cc i tng c gim i th
mc lm trn (smoothing factor) ca tp c xc nh. Tp
con c gi tr ny ln nht l tp ngoi l
Tp ngoi l: tp gm cc outliers l tp con nh nht m vic
loi b tp con ny dn n vic gim i nhiu nht s khc
bit gia cc i tng trong tp cn li
29
2.3. Lm sch d liu
Gii php gim
thiu nhiu
Binning (by bin
means, bin median,
bin boundaries)
D liu c th t
Phn b d liu vo
cc bins (buckets)
Equal-frequency
S phn t
Equal-width
Min tr
Bin boundaries: tr
min v tr max
30
2.3. Lm sch d liu
Gii php gim thiu nhiu
Hi quy (regression)
x
y
y = x + 1
X1
Y1
Y1
31
2.3. Lm sch d liu
Gii php gim thiu nhiu
Phn tch cm (cluster analysis)
32
2.3. Lm sch d liu
X l d liu khng nht qun
nh ngha ca d liu khng nht qun
D liu c ghi nhn khc nhau cho cng mt i
tng/thc th
nh dng ngy/thng/nm: 2004/12/25 v 25/12/2004
Tn mn hc: KPDL, Khai ph d liu, Data mining
D liu c ghi nhn khng phn nh ng ng ngha cho
cc i tng/thc th
Rng buc kha ngoi
Nguyn nhn
S khng nht qun trong cc qui c t tn hay m d liu
nh dng khng nht qun ca cc vng nhp liu
Thit b ghi nhn d liu hay h thng b li
33
2.3. Lm sch d liu
X l d liu khng nht qun (inconsistent
data)
Gii php
Tn dng siu d liu, rng buc d liu, s kim tra
ca nh phn tch d liu cho vic nhn din
iu chnh d liu khng nht qun bng tay
Cc gii php bin i/chun ha d liu t ng

34
2.4. Tch hp d liu
Tch hp d liu: qu trnh trn d liu t cc ngun
khc nhau vo mt kho d liu sn sng cho qu
trnh khai ph d liu
Vn nhn dng thc th (entity identification problem)
Tch hp lc (schema integration)
So trng i tng (object matching)
Vn d tha (redundancy)
Vn mu thun gi tr d liu (data value conflicts)
Lin quan n cu trc v tnh khng thun nht
(heterogeneity) v ng ngha (semantics) ca d liu
H tr vic gim v trnh d tha v khng nht
quan v d liu ci thin tnh chnh xc v tc
qu trnh khai ph d liu
35
2.4. Tch hp d liu
Vn nhn dng thc th
Cc thc th (object/entity/attribute) n t nhiu
ngun d liu.
Hai hay nhiu thc th khc nhau din t cng mt
thc th tht.
mc lc (schema):
customer_id trong ngun S1 v cust_number trong ngun S2
mc th hin (instance):
R & D trong ngun S1 v Research & Development trong
ngun S2
Male v Female trong ngun S1 v Nam v N trong
ngun S2
Vai tr ca siu d liu (metadata)
36
2.4. Tch hp d liu
Vn d tha
Hin tng: gi tr ca mt thuc tnh c th c dn ra/tnh
t mt/nhiu thuc tnh khc, vn trng lp d liu
(duplication).
Nguyn nhn: t chc d liu km, khng nht qun trong
vic t tn chiu/thuc tnh.
Pht hin d tha: phn tch tng quan (correlation analysis)
Da trn d liu hin c, kim tra kh nng dn ra mt thuc tnh
B t thuc tnh A.
i vi cc thuc tnh s (numerical attributes), nh gi tng
quan gia hai thuc tnh vi cc h s tng quan (correlation
coefficient, aka Pearsons product moment coefficient).
i vi cc thuc tnh ri rc (categorical/discrete attributes),
nh gi tng quan gia hai thuc tnh vi php kim th chi-
square (_
2
).
37
2.4. Tch hp d liu
Phn tch tng quan gia hai thuc tnh s A v B
r
A,B
e [-1, 1]
r
A,B
> 0: A v B tng quan thun vi nhau, tr s ca A tng
khi tr s ca B tng, r
A,B
cng ln th mc tng quan cng
cao, A hoc B c th c loi b v d tha.
r
A,B
= 0: A v B khng tng quan vi nhau (c lp).
r
A,B
< 0: A v B tng quan nghch vi nhau, A v B loi tr ln
nhau.
38
2.4. Tch hp d liu
Phn tch tng quan gia hai thuc tnh s A v B
A
B
A
B
A
B
A
B
A
B
39
2.4. Tch hp d liu
Phn tch tng quan gia hai thuc tnh ri rc A v B
A c c gi tr phn bit, a
1
, a
2
, , a
c
.
B c r gi tr phn bit, b
1
, b
2
, , b
r
.
o
ij
: s lng i tng (tuples) c tr thuc tnh A l a
i
v tr
thuc tnh B l b
j
.
e
ij
: tn s k vng (expected frequency) ca (A
i
, B
j
).
count(A=a
i
): s lng i tng c tr thuc tnh A l a
i
.
count(B=b
j
): s lng i tng c tr thuc tnh B l b
j
.
40
2.4. Tch hp d liu
Phn tch tng quan gia hai thuc tnh ri
rc A v B
Php kim thng k chi-square kim tra gi
thuyt liu A v B c c lp vi nhau da trn
mt mc significance (significance level) vi
t do (degree of freedom).
Nu gi thuyt b loi b th A v B c s lin h vi
nhau da trn thng k.
t do (degree of freedom): (r-1)*(c-1)
Tra bng phn b chi-square xc nh gi tr _
2
.
Nu gi tr tnh ton c ln hn hay bng tr tra bng
c th hai thuc tnh A v B tng quan vi nhau (gi
thuyt sai).
41
2.4. Tch hp d liu
Phn tch tng quan gia hai thuc tnh ri rc A
v B
Gi s kho st 1500 ngi vi 2 thuc tnh gender v
preferred_reading
Kim tra: gender v preferred_reading c tng quan vi nhau khng
Php kim thng k _
2
s kim tra gi thuyt liu gender v preferred_reading
c c lp vi nhau khng
42
2.4. Tch hp d liu
Kim tra: gender v preferred_reading c
tng quan vi nhau khng
Php kim thng k _
2
s kim tra gi
thuyt liu gender v preferred_reading c
c lp vi nhau khng
o
11
= 250; o
12
= 200; o
21
= 50; o
22
= 1000
e
11
= (count(male)*count(fiction))/N = (300*450)/1500 = 90
e
12
= (count(female)*count(fiction))/N = (1200*450)/1500 = 360
e
21
= (count(male)*count(non_fiction))/N = (300*1050)/1500 = 210
e
22
= (count(female)*count(non_fiction))/N = (1200*1050)/1500 = 840
Degree of freedom = (2-1)*(2-1) = 1; Significance level = 0.001
Tra bng: _
2
= 10.828 <<< _
2
tnh c t tp d liu (507.93)
bc b gi thuyt c lp: gender v preferred_reading c tng quan vi nhau.
43
2.4. Tch hp d liu
Vn mu thun gi tr d liu
Cho cng mt thc th tht, cc gi tr thuc
tnh n t cc ngun d liu khc nhau c th
khc nhau v cch biu din (representation),
o lng (scaling), v m ha (encoding).
Representation: 2004/12/25 vi 25/12/2004.
Scaling: thuc tnh weight trong cc h thng o khc
nhau vi cc n v o khc nhau, thuc tnh price
trong cc h thng tin t khc nhau vi cc n v
tin t khc nhau.
Encoding: yes v no vi 1 v 0.
44
2.5. Bin i d liu
Bin i d liu: qu trnh bin i hay kt
hp d liu vo nhng dng thch hp cho
qu trnh khai ph d liu
Lm trn d liu (smoothing)
Kt hp d liu (aggregation)
Tng qut ho (generalization)
Chun ho (normalization)
Xy dng thuc tnh/c tnh (attribute/feature
construction)
45
2.5. Bin i d liu
Lm trn d liu (smoothing)
Cc phng php binning (bin means, bin
medians, bin boundaries)
Hi quy
Cc k thut gom cm (phn tch phn t bin)
Cc phng php ri rc ha d liu (cc phn
cp nim)
Loi b/gim thiu nhiu khi d liu.
46
2.5. Bin i d liu
Kt hp d liu (aggregation)
Cc tc v kt hp/tm tt d liu
Chuyn d liu mc chi tit ny sang d liu
mc km chi tit hn
H tr vic phn tch d liu nhiu mn thi
gian khc nhau
Thu gim d liu (data reduction)
47
2.5. Bin i d liu
Tng qut ha (generalization)
Chuyn i d liu cp thp/nguyn t/th sang
cc khi nim mc cao hn thng qua cc
phn cp nim
Thu gim d liu (data reduction)
48
2.5. Bin i d liu
Chun ha (normalization)
min-max normalization
z-score normalization
Normalization by decimal scaling
Cc gi tr thuc tnh c chuyn i vo mt
min tr nht nh c nh ngha trc.
49
2.5. Bin i d liu
Chun ha (normalization)
min-max normalization
Gi tr c: v e[minA, maxA]
Gi tr mi: v e [new_minA, new_maxA]
V d: chun ha im s t 0-4.0 sang 0-10.0.
c im ca php chun ha min-max?
50
2.5. Bin i d liu
Chun ha (normalization)
z-score normalization
Gi tr c: v tng ng vi mean v standard
deviation
A
Gi tr mi: v
c im ca chun ha z-score?
51
2.5. Bin i d liu
Chun ha (normalization)
Normalization by decimal scaling
Gi tr c: v
Gi tr mi: v vi j l s nguyn nh nht sao cho
Max(|v|) < 1
52
2.5. Bin i d liu
Xy dng thuc tnh/c tnh
(attribute/feature construction)
Cc thuc tnh mi c xy dng v thm vo
t tp cc thuc tnh sn c.
H tr kim tra tnh chnh xc v gip hiu cu
trc ca d liu nhiu chiu.
H tr pht hin thng tin thiu st v cc mi
quan h gia cc thuc tnh d liu.
Cc thuc tnh dn xut
53
2.6. Thu gim d liu
Tp d liu c bin i m bo cc ton vn, nhng
nh/t hn nhiu v s lng so vi ban u.
Cc chin lc thu gim
Kt hp khi d liu (data cube aggregation)
Chn mt s thuc tnh (attribute subset selection)
Thu gim chiu (dimensionality reduction)
Thu gim lng (numerosity reduction)
Ri rc ha (discretization)
To phn cp nim (concept hierarchy generation)
Thu gim d liu: lossless v lossy
54
2.6. Thu gim d liu
Kt hp khi d liu
(data cube aggregation)
Dng d liu: additive,
semi-additive (numerical)
Kt hp d liu bng cc
hm nhm: average, min,
max, sum, count,
D liu cc mc tru
tng khc nhau.
Mc tru tng cng cao
gip thu gim lng d
liu cng nhiu.
Sum()
cube: Sale
55
2.6. Thu gim d liu
Chn mt s thuc tnh (attribute subset selection)
Gim kch thc tp d liu bng vic loi b nhng
thuc tnh/chiu/c trng
(attribute/dimension/feature) d tha/khng thch hp
(redundant/irrelevant)
Mc tiu: tp t cc thuc tnh nht vn m bo phn
b xc sut (probability distribution) ca cc lp d
liu t c gn vi phn b xc sut ban u vi tt
c cc thuc tnh
Bi ton ti u ha: vn dng heuristics
56
2.6. Thu gim d liu
Chn mt s thuc tnh (attribute subset selection)
57
2.6. Thu gim d liu
Thu gim chiu (dimensionality reduction)
Bin i wavelet (wavelet transforms)
Phn tch nhn t chnh (principal component
analysis)
c im v ng dng?
58
2.6. Thu gim d liu
Thu gim lng (numerosity reduction)
Cc k thut gim lng d liu bng cc dng biu
din d liu thay th.
Cc phng php c thng s (parametric): m hnh
c lng d liu cc thng s c lu tr thay
cho d liu tht
Hi quy
Cc phng php phi thng s (nonparametric): lu
tr cc biu din thu gim ca d liu
Histogram, Clustering, Sampling
59
2.6. Thu gim d liu
Thu gim lng (numerosity reduction)
Cc phng php phi thng s (nonparametric):
Sampling
Simple random sample without replacement (SRSWOR)
Simple random sample with replacement (SRSWR)
Cluster sample
Stratied sample
60
2.6. Thu gim d liu
Thu gim lng (numerosity reduction)
Cc phng php phi thng s (nonparametric):
Sampling

61
2.6. Thu gim d liu
Thu gim lng (numerosity reduction)
Cc phng php phi thng s (nonparametric):
Sampling

62
2.6. Thu gim d liu
Thu gim lng (numerosity reduction)
Cc phng php phi thng s (nonparametric):
Sampling

63
2.7. Ri rc ha d liu
Gim s lng gi tr ca mt thuc tnh lin
tc (continuous attribute) bng cc chia min
tr thuc tnh thnh cc khong (intervals)
Cc nhn (labels) c gn cho cc khong
(intervals) ny v c dng thay gi tr thc
ca thuc tnh
Cc tr thuc tnh c th c phn hoch
theo mt phn cp (hierarchical) hay nhiu
mc phn gii khc nhau (multiresolution)
64
2.7. Ri rc ha d liu
Ri rc ha d liu cho cc thuc tnh s
(numeric attributes)
Cc phn cp nim c dng thu gim d
liu bng vic thu thp v thay th cc nim
cp thp bi cc nim cp cao.
Cc phn cp nim c xy dng t ng
da trn vic phn tch phn b d liu.
Chi tit ca thuc tnh s b mt.
D liu t c c ngha v d c din dch
hn, i hi t khng gian lu tr hn.
65
2.7. Ri rc ha d liu
Cc phng php ri rc ha d liu cho
cc thuc tnh s
Binning
Histogram analysis
Interval merging by _
2
analysis
Cluster analysis
Entropy-based discretization
Discretization by natural/intuitive partitioning
66
2.8. To cy phn cp nim
D liu phn loi (categorical data)
D liu ri rc (discrete data)
Min tr thuc tnh phn loi (categorical
attribute)
S gi tr phn bit hu hn
Khng c th t gia cc gi tr
To phn cp nim cho d liu ri rc
67
2.8. To cy phn cp nim
Cc phng php to phn cp nim cho
d liu ri rc (categorical/discrete data)
c t th t ring phn (partial ordering)/th
t ton phn (total ordering) ca cc thuc tnh
tng minh mc lc bi ngi s dng
hoc chuyn gia
c t mt phn phn cp bng cch nhm d
liu tng minh
68
2.8. To cy phn cp nim
Cc phng php to phn cp nim cho
d liu ri rc (categorical/discrete data)
c t mt tp cc thuc tnh, nhng khng bao
gm th t ring phn ca chng
c t ch mt tp ring phn cc thuc tnh
(partial set of attributes)
To phn cp nim bng cch dng cc kt ni
ng ngha c ch nh trc
69
2.9. Tm tt
D liu thc t: khng y (incomplete/missing),
nhiu (noisy), khng nht qun (inconsistent)
Qu trnh tin x l d liu
lm sch d liu: x l d liu b thiu, lm trn d liu nhiu,
nhn dng cc phn t bin, hiu chnh d liu khng nht qun
tch hp d liu: vn nhn dng thc th, vn d tha, vn
mu thun gi tr d liu
bin i d liu: lm trn d liu, kt hp d liu, tng qut ha,
chun ha, xy dng thuc tnh/c tnh
thu gim d liu: kt hp khi d liu, chn mt s thuc tnh,
thu gim chiu, ri rc ha v to phn cp nim
70
2.9. Tm tt
Ri rc ha d liu
Thu gim s tr ca mt thuc tnh lin tc (continuous attribute) bng cch chia min
tr thnh cc khong (interval) c dn nhn. Cc nhn ny c dng thay cho cc gi
tr thc.
Tin hnh theo hai cch: trn xung (top down) v di ln (bottom up), c gim st
(supervised) v khng c gim st (unsupervised).
To phn hoch phn cp/a phn gii (multiresolution) trn cc tr thuc tnh
phn cp nim cho thuc tnh s (numerical attribute)
To cy phn cp nim
H tr khai ph d liu nhiu mc tru trng
Cho thuc tnh s (numerical attributes): binning, histogram analysis, entropy-based
discretization, _
2
-merging, cluster analysis, discretization by intuitive partitioning
Cho thuc tnh phn loi/ri rc (categorical/discrete attributes): ch nh tng minh
bi ngi s dng hay chuyn gia, nhm d liu tng minh, da trn s lng tr
phn bit (khc nhau) ca mi thuc tnh
71
Hi & p

You might also like