Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 7

Lm sch d liu

L qu trnh
xc nh tnh khng chnh xc, khng y /tnh bt hp l ca d liu
chnh sa cc sai st v thiu st c pht hin
nng cao cht lng d liu.
Qu trnh bao gm
kim tra nh dng, tnh y , tnh hp l, min gii hn,
xem xt d liu xc nh ngoi lai (a l, thng k, thi gian hay mi
trng) hoc cc li khc,
nh gi d liu ca cc chuyn gia min ch .
Qu trnh thng dn n
loi b, lp ti liu v kim tra lin tip v hiu chnh ng bn ghi nghi
ng.
Kim tra xc nhn c th c tin hnh nhm t tnh ph hp vi cc
chun p dng, cc quy lut, v quy tc.

June 30, 2017 1


Lm sch d liu
Nguyn l cht lng d liu cn c p dng mi giai on qu trnh
qun l d liu (nm gi, s ha, lu tr, phn tch, trnh by v s dng).
hai vn ct li ci thin cht lng - phng nga v chnh sa
Phng nga lin quan cht ch vi thu thp v nhp d liu vo CSDL.
Tng cng phng nga li, vn/tn ti sai st trong b d liu ln
(Maletic v Marcus 2000) v khng th b qua vic xc nhn v sa
cha d liu
Vai tr quan trng
l mt trong ba bi ton ln nht ca kho d liuRalph Kimball
l bi ton number one trong kho d liuDCI kho st

Cc bi ton thuc lm sch d liu


X l gi tr thiu
D liu nhiu: nh danh ngoi lai v lm trn.
Chnh sa d liu khng nht qun
Gii quyt tnh d tha to ra sau tch hp d liu.
June 30, 2017 2
X l thiu gi tr
B qua bn ghi c gi tr thiu:
Thng lm khi thiu nhn phn lp (gi s bi ton phn lp)
khng hiu qu khi t l s lng gi tr thiu ln (bn gim st)

in gi tr thiu bng tay:


t nht
tnh kh thi

in gi tr t ng:
Hng ton cc: chng hn nhcha bit - unknown, c phi mt lp
mi
Trung bnh gi tr thuc tnh cc bn ghi hin c
Trung bnh gi tr thuc tnh cc bn ghi cng lp: tinh hn
Gi tr c kh nng nht: da trn suy lun nh cng thc Bayes hoc
cy quyt nh

June 30, 2017 3


D liu nhiu
Nhiu:
Li ngu nhin
Bin dng ca mt bin o c
Gi tr khng chnh xc
Li do thit b thu thp d liu
Vn nhp d liu: ngi dng hoc my c th sai
Vn truyn d liu: sai t thit b gi/nhn/truyn
Hn ch ca cng ngh: v d, phn mm c th x l khng ng
Thit nht qun khi t tn: cng mt tn song cch vit khc nhau
Cc vn d liu khc yu cu lm sch d liu
Bi bn ghi
D liu khng y
D liu khng nht qun

June 30, 2017 4


X l d liu nhiu
Phng php ng thng (Binning):
Sp d liu tng v chia u vo cc thng
Lm trn: theo trung bnh, theo trung tuyn, theo bin
Phn cm (Clustering)
Pht hin v loi b ngoi lai (outliers)
Kt hp kim tra my tnh v con ngi
Pht hin gi tr nghi ng con ngi kim tra (chng hn,
i ph vi ngoi lai c th)
Hi quy
Lm trn: ghp d liu theo cc hm hi quy

June 30, 2017 5


P/php ri rc ha n gin: Xp thng (Binning)
Phn hoch cn bng b rng Equal-width (distance) partitioning:
Chia min gi tr: N on di nh nhau: uniform grid
Min gi tr t A (nh nht) ti B (ln nht) ->W = (B A)/N.
n gin nht song b nh hng theo ngoi lai.
Khng x l tt khi d liu khng cn bng (u).
Phn hoch cn bng theo chiu su Equal-depth (frequency)
partitioning:
Chia min xc nh thnh N on u nhau v s lng,
cc on c xp x s v d mu.
Kh c d liu: tt.
Vic qun l cc thuc tnh lp: c th khn kho.

June 30, 2017 6


P/php xp thng lm trn d liu (Data Smoothing)

* D liu c xp theo gi: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Chia thng theo chiu su:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Lm trn thng theo trung bnh:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Lm trn thng theo bin:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

June 30, 2017 7

You might also like