Professional Documents
Culture Documents
CH 2 Book
CH 2 Book
CH 2 Book
— Chapter 2 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
January 20, 2018 Data Mining: Concepts and Techniques 1
Chapter 2: Data Preprocessing
n Completeness
n Consistency
n Timeliness
n Believability
n Value added
n Interpretability
n Accessibility
n Broad categories:
n Intrinsic, contextual, representational, and accessibility
n Data cleaning
n Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
n Data integration
n Integration of multiple databases, data cubes, or files
n Data transformation
n Normalization and aggregation
n Data reduction
n Obtains reduced representation in volume but produces the same
or similar analytical results
n Data discretization
n Part of data reduction but with particular importance, especially
for numerical data
n Motivation
n To better understand the data: central tendency, variation
and spread
n Data dispersion characteristics
n median, max, min, quantiles, outliers, variance, etc.
n Numerical dimensions correspond to sorted intervals
n Data dispersion: analyzed with multiple granularities of
precision
n Boxplot or quantile analysis on sorted intervals
n Dispersion analysis on computed measures
n Folding measures into numerical dimensions
n Boxplot or quantile analysis on the transformed cube
January 20, 2018 Data Mining: Concepts and Techniques 10
Measuring the Central Tendency
1 n x
n Mean (algebraic measure) (sample vs. population): x = ∑ xi µ=∑
n i =1 N
Weighted arithmetic mean:
n
n
∑w x
i =1
i i
x=
n Trimmed mean: chopping extreme values n
∑w i
n Importance
n “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
n “Data cleaning is the number one problem in data
warehousing”—DCI survey
n Data cleaning tasks
n Fill in missing values
n Identify outliers and smooth out noisy data
n Correct inconsistent data
n Resolve redundancy caused by data integration
January 20, 2018 Data Mining: Concepts and Techniques 26
Missing Data
n technology limitation
n incomplete data
n inconsistent data
n Clustering
n detect and remove outliers
Y1
Y1’ y=x+1
X1 x
∑ ( A − A)( B − B ) ∑ ( AB ) − n A B
rA, B = =
(n − 1)σAσB (n − 1)σAσB
n Χ2 (chi-square) test
2
(Observed − Expected )
χ2 = ∑
Expected
n The larger the Χ2 value, the more likely the variables are
related
n The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
n Correlation does not imply causality
n # of hospitals and # of car-theft in a city are correlated
n Both are causally linked to the third variable: population
73,600 − 54,000
n Ex. Let µ = 54,000, σ = 16,000. Then = 1.225
16,000
n Normalization by decimal scaling
v
v'= j Where j is the smallest integer such that Max(|ν’|) < 1
10
January 20, 2018 Data Mining: Concepts and Techniques 43
Chapter 2: Data Preprocessing
smaller in volume but yet produce the same (or almost the
same) analytical results
n Data reduction strategies
n Data cube aggregation:
n Data Compression
understand
n Heuristic methods (due to exponential # of choices):
n Step-wise forward selection
n Decision-tree induction
A4 ?
A1? A6?
n Typically lossless
expansion
n Audio/video compression
n Typically lossy compression, with progressive
refinement
n Sometimes small fragments of signal can be
Original Data
Approximated
component vectors
n The principal components are sorted in order of decreasing
“significance” or strength
n Since the components are sorted, the size of the data can be
X2
Y1
Y2
X1
n Linear regression: Y = w X + b
n Two regression coefficients, w and b, specify the line
and are to be estimated by using the data at hand
n Using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
n Multiple regression: Y = b0 + b1 X1 + b2 X2.
n Many nonlinear functions can be transformed into the
above
n Log-linear models:
n The multi-way table of joint probabilities is
approximated by a product of lower-order tables
n Probability: p(a, b, c, d) = αab βacχad δbcd
Data Reduction Method (2): Histograms
20000
30000
40000
50000
60000
70000
80000
90000
100000
between each pair for pairs have
the β–1 largest differences
January 20, 2018 Data Mining: Concepts and Techniques 59
Data Reduction Method (3): Clustering
n Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
n Can be very effective if data is clustered but not if data is
“smeared”
n Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
n There are many choices of clustering definitions and clustering
algorithms
n Cluster analysis will be studied in depth in Chapter 7
W O R
SRS le random t
p
(sim le withou
samp ment)
p la ce
re
SRSW
R
Raw Data
January 20, 2018 Data Mining: Concepts and Techniques 62
Sampling: Cluster or Stratified Sampling
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4: