Professional Documents
Culture Documents
Concepts and Techniques: Data Mining
Concepts and Techniques: Data Mining
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
2006 Jiawei Han and Micheline Kamber, All rights reserved
March 26, 2015
Data cleaning
Data reduction
Summary
e.g., occupation=
e.g., Salary=-10
Data cleaning
Data integration
Data reduction
Data transformation
Data discretization
Data cleaning
Data reduction
Summary
10
Motivation
11
x
N
Mode
Empirical formula:
12
13
14
15
Boxplot Analysis
Boxplot
16
17
Histogram Analysis
18
Quantile Plot
19
20
Scatter plot
21
Loess Curve
22
23
24
25
Data cleaning
Data reduction
Summary
26
Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey
27
Missing Data
equipment malfunction
28
29
Noisy Data
30
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)
31
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
32
33
Regression
y
Y1
y=x+1
Y1
X1
34
Cluster Analysis
35
36
Data cleaning
Data reduction
Summary
37
Data Integration
Data integration:
Combines data from multiple sources into a
coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values
from different sources are different
Possible reasons: different representations,
different scales, e.g., metric vs. British units
38
39
40
2 (chi-square) test
41
Not play
chess
Sum
(row)
250(90)
200(360)
450
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
42
Data Transformation
min-max normalization
z-score normalization
Attribute/feature construction
43
Data Transformation:
Normalization
73,600 54,000
1.225
v
v' j
10
March 26, 2015
44
Data cleaning
Data reduction
Summary
45
Techniques
46
Techniques
47
48
A Data Cube
Branch cuboid
Product
Base cuboid
Cor1
Cor2
Cam1
Cam2
Lex1
Lex2
All
Branch
Dammam
10
11
12
10
47
Jeddah
11
48
Riyadh
12
44
All
33
29
26
17
23
11
139
Product cuboid
March 26, 2015
Aggregate cell
Base cell
Apex Cuboid
49
50
A1?
Class 1
>
March 26, 2015
Class 2
Class 1
Class 2
51
52
Data Compression
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Data Mining: Concepts and
Techniques
53
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
March 26, 2015
y
s
s
lo
54
Dimensionality Reduction:
Wavelet Transformation
Method:
55
Low Pass
Low Pass
Low Pass
High Pass
High Pass
High Pass
56
Dimensionality Reduction:
Principal Component Analysis
(PCA)
57
Principal Component
Analysis
X2
Y1
Y2
X1
58
Numerosity Reduction
59
60
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the
line and are to be estimated by using the data
at hand
Using the least squares criterion to the known
values of Y1, Y2, , X1, X2, .
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed
into the above
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables
Partitioning rules:
40
35
20
15
10
90000
80000
70000
60000
50000
40000
100000
30000
20000
10000
62
Can have hierarchical clustering and be stored in multidimensional index tree structures
63
64
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r
SRSW
R
Raw Data
March 26, 2015
65
Raw Data
Cluster/Stratified Sample
66
Data cleaning
Data reduction
Summary
67
Discretization
Discretization:
68
Discretization
69
70
Entropy-Based Discretization
71
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
72
Segmentation by Natural
Partitioning
73
Step 1:
Step 2:
-$351
-$159
Min
msd=1,000
profit
Low=-$1,000
$1,838
High(i.e, 95%-0 tile)
$4,700
Max
High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0)
($1,000 - $2,000)
(0 -$ 1,000)
(-$400 -$5,000)
Step 4:
(-$400 - 0)
(-$400 -$300)
(-$300 -$200)
(-$200 -$100)
(-$100 0)
(0 - $1,000)
(0 $200)
($1,000 $1,200)
($200 $400)
($1,200 $1,400)
($1,400 $1,600)
($400 $600)
($600 $800)
($800 $1,000)
($2,000 $3,000)
($3,000 $4,000)
($4,000 $5,000)
74
75
15 distinct values
country
province_or_ state
city
street
March 26, 2015
76
Data cleaning
Data reduction
Summary
77
Summary
Discretization
78
References
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons,
2003
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of
the Technical Committee on Data Engineering. Vol.23, No.4
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE
Trans. Knowledge and Data Engineering,
7:623-640,
1995
Data Mining:
Concepts and
Techniques
79
80