Professional Documents
Culture Documents
3datapreprocessing ppt3
3datapreprocessing ppt3
2014
1
Data Mining:
Concepts and
Techniques
October 15,
2014
2
Chapter 3: Data Preprocessing
Data cleaning
Data reduction
Summary
October 15,
2014
3
Why Data Preprocessing?
)ccuracy
Completeness
Consistency
Timeliness
*elie#a"ility
+alue added
,nterpreta"ility
)ccessi"ility
*road categories:
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
equipment malfunction
9se the attri"ute mean for all samples "elonging to the same
class to 1ll in the missing #alue: smarter
technology limitation
duplicate records
incomplete data
inconsistent data
October 15,
2014
11
4o to 4andle %oisy Data?
*inning method:
Clustering
=egression
Data integration:
Schema integration
min(ma- normalization
z(score normalization
)ttri"ute?feature construction
min(ma- normalization
z(score normalization
=
A
A
dev stand
mean v
v
_
'
=
j
v
v
10
' = Were j !" te "#$%%e"t !&te'er "(c t$t )$x(* *)+1
' v
October 15,
2014
20
Data =eduction
Strategies
Data reduction
Dimensionality reduction
%umerosity reduction
decision(tree induction
October 15,
2014
23
Example of Decision Tree Induction
,&!t!$% $ttr!b(te "et-
./1, /2, /3, /4, /5, /60
/4 1
/11
/61
2%$"" 1
2%$"" 2
2%$"" 1
2%$"" 2
3 4e5(ce5 $ttr!b(te "et- ./1, /4, /60
October 15,
2014
25
Data Compression
String compression
Typically lossless
)udio?#ideo compression
Method:
Parametric methods
%on(parametric methods
Linear regression: Y = + X
Log(linear models:
) popular data
reduction technique
=elated to quantization
pro"lems.
0
5
10
15
20
25
30
35
40
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
5
0
0
0
0
6
0
0
0
0
7
0
0
0
0
8
0
0
0
0
9
0
0
0
0
1
0
0
0
0
0
October 15,
2014
34
Clustering
Strati1ed sampling:
o
(
t
r
e
7
%
$
c
e
#
e
&
t
)
8
4
8
W
4
4$9 6$t$
October 15,
2014
37
Sampling
4$9 6$t$
2%("ter:8tr$t!;!e5 8$#7%e
October 15,
2014
38
4ierarchical =eduction
4ierarchical aggregation
Discretization:
Discretization
Concept hierarchies
3ntropy("ased discretization
Discretization