Professional Documents
Culture Documents
W4-5 03preprocessing
W4-5 03preprocessing
W4-5 03preprocessing
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
Data
05/03/24
Mining: Concepts and Techniques
3
Data Quality: Why Preprocess the Data?
4
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
5
Chapter 3: Data Preprocessing
Class Average
Imputation/Nearest-neighbor
11
Age Car Class
Acciden (Gender)
Exercise 1 21
?
ts
3
4
F
F
20 3 F
22 ? M
Find the missing values using 19 5 M
18 ? M
Mean Age Car Class
? 7 M
Acciden (Gender)
ts
Age: 21 3 F
Car Accidents: 4 F
20 3 F
22 M
19 5 M
18 M
7 M
Class Average
Age Age Car Class
Acciden (Gender)
Female: ts
Male: 21 3 F
4 F
Car Accidents 20 3 F
22 M
Female: 19 5 M
Male: 18 M
7 M
12
Age Car Class
Acciden (Gender)
Exercise 1 - Sol 21
?
ts
3
4
F
F
20 3 F
22 ? M
Find the missing values using 19 5 M
18 ? M
Mean Age Car Class
? 7 M
Acciden (Gender)
ts
Age: (21+20+22+19+18)/5=20 21 3 F
Car Accidents: 4.4 20 4 F
20 3 F
22 4.4 M
19 5 M
18 4.4 M
20 7 M
Class Average
Age Age Car Class
Acciden (Gender)
Female: 20.5 ts
Male: 19.7 21 3 F
20.5 4 F
Car Accidents 20 3 F
22 6 M
Female: 3.33 19 5 M
Male: 6 18 6 M
19.7 7 M
13
Age Car Class
Accide (Gender)
Exercise 1 A
B
21
?
nts
3
4
F
F
C 20 3 F
D 22 ? M
Find the missing values using E 19 5 M
F 18 ? M
Imputation/Nearest-neighbor G ? 7 M
14
Age Car Class
Accide (Gender)
Exercise 1 - Sol A
B
21
?
nts
3
4
F
F
C 20 3 F
D 22 ? M
Find the missing values using E 19 5 M
F 18 ? M
Imputation/Nearest-neighbor G ? 7 M
technology limitation
incomplete data
inconsistent data
16
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
Clustering
detect and remove outliers
17
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
18
Reading Assignment
Read example 3.1
19
Chapter 3: Data Preprocessing
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
23
Chi-Square Calculation: An Example
Are gender and preferred reading correlated?
507.93
>
10.827
Chi-Square Calculation: An Example
Exercise 2
There are 2000 people who don’t have iPhone,
out of these, 1750 have installed WhatsApp
Correlation =
Exercise 2 - Sol
iPhone No iPhone Total
WhatsApp 2000 (2250) 1750 (1500) 3750
No WhatsApp 1000 (750) 250 (500) 1250
Total 3000 2000 5000
Correlation = 276
…..
…..
….. continue
Exercise 3
Find the correlation between color and
gender
Find the correlation betweenID color andFav.
Gender foodFav.
Color Food
Alpha = .05 1 M Blue Cake
2 M Blue Pasta
3 F Red Pasta
4 M Red Cake
5 F Red Cake
6 F Pink Pasta
7 F Red Cake
8 M Blue Pasta
9 F Red Pasta
10 M Blue Cake
ID Gend Fav. Fav.
er Color Food
Exercise 3
1 M Blue Cake
2 M Blue Pasta
3 F Red Pasta
4 M Red Cake
5 F Red Cake
Color and gender 6 F Pink Pasta
7 F Red Cake
Null Hypothesis H0: color and gender are independent
8 M Blue Pasta
9 F Red Pasta
Blue Red Pink Total 10 M Blue Cake
Male 4 () 1 () 0 () 5
Female 0 () 4 () 1 () 5
Total 4 5 1 10
Color and food
Null Hypothesis H0: color and food are independent
Exercise 3 - Sol
1 M Blue Cake
2 M Blue Pasta
3 F Red Pasta
4 M Red Cake
5 F Red Cake
Color and gender 6 F Pink Pasta
7 F Red Cake
Blue Red Pink Total 8 M Blue Pasta
9 F Red Pasta
Male 4 (2) 1 (2.5) 0 (0.5) 5
10 M Blue Cake
Female 0 (2) 4 (2.5) 1 (0.5) 5
Total
Correlation (color, gender) 4 5
= (4-2)^2/2 + (1-2.5)^2/2.5 1 + (0-2)^2/2 + (4-2.5)^2/2.5
+ (0-0.5)^2/0.5 10 + (1-0.5)^2/0.5 = 6.8 -> Strongly
Correlated
Color and food
Total 4 5 1 10
Exercise 3 - Sol
• We can reject the hypothesis that color and gender are
independent.
• We conclude that the color and gender are (strongly)
correlated for the given group of people.
Scatter plots
showing the
similarity from
–1 to 1.
36
Exercise 4
Given the following data
Car
Age
Accidents
23 3
21 4
20 3
22 2
19 5
18 6
17 7
Average 20.0 4.3
StdDev 2.16 1.7
Find the correlation between Age and Car Accidents using Pearson
correlation
37
Exercise 4
Find the correlation between Age and Car Accidents using Pearson
correlation
Car
Age
Accidents
23 3
21 4
20 3
22 2
19 5
18 6
17 7
Average 20.0 4.3
StdDev 2.16 1.7
38
Exercise 4 - Sol
Find the correlation between Age and Car Age Car
Accidents
Accidents using Pearson correlation
23-20= 3 3-4.3 = -1.3
21-20= 1 4-4.3 = -0.3
20-20= 0 3-4.3 = -1.3
Correl(age,acc) = (3*-1.3 + 1*-0.3 + 0*-1.3 + 22-20= 2 2-4.3 = -2.3
2*-2.3 - 1*.7 - 2*1.7 - 3*2.7)/(7*2.16*1.8) = - 19-20= -1 5-4.3 = 0.7
18-20= -2 6-4.3 = 1.7
0.77
17-20 = -3 7-4.3 = 2.7
-> -0.77 shows that there is a strong negative
correlation between the attributes age and # of
car accidents
39
Reading Assignment
Read example 3.2
40
Chapter 3: Data Preprocessing
Numerosity reduction
Sampling
42
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
43
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1
44
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Works for numeric data only
45
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant attributes
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
46
47
Heuristic Search in Attribute Selection
■ There are 2d possible attribute combinations of d attributes.
47
Attribute Subset Selection
Stepwise forward selection:
– The procedure starts with an empty set of attributes as the reduced set.
– First: The best single-feature is picked.
– Next: At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.
EX:
Initial attribute set:
{A1,A2,A3,A4,A5,A6}
{A1,A4}
{A1,A4,A6}
48
Attribute Subset Selection
Stepwise backward elimination:
– The procedure starts with the full set of attributes.
– At each step, it removes the worst attribute remaining in the set.
Ex:
Initial attribute set:
{A1,A2,A3,A4,A5,A6}
{A1,A3,A4,A6}
{A1,A4,A5,A6}
Reduced attribute set:
{A1,A4,A6}
49
Attribute Subset Selection
Combining forward selection and backward elimination:
– The stepwise forward selection and backward elimination methods
can be combined
– At each step, the procedure selects the best attribute and removes
the worst from among the remaining attributes.
50
Attribute Subset Selection
Decision tree induction:
– Decision tree algorithms, such as ID3, C4.5,
and CART, were originally intended for
classification.
– Decision tree induction constructs a
flowchart-like structure where each internal
(nonleaf) node denotes a test on an attribute,
each branch corresponds to an outcome of the
test, Data Reduction and each external (leaf)
node denotes a class prediction.
– At each node, the algorithm chooses the
“best” attribute to partition the data into
individual classes.
– When decision tree induction is used for
attribute subset selection, a tree is constructed
from the given data.
– All attributes that do not appear in the
tree are assumed to be irrelevant.
51
52
Numerosity Reduction
■ Reduce data volume by choosing alternative, smaller
forms of data representation
■ There are several methods for storing reduced
representations of the data include histograms,
clustering, and sampling.
52
Sampling
54
Types of Sampling 55
■ Cluster sampling:
■ If the tuples in D are grouped into M mutually disjoint
“clusters,” then a simple random sample of s clusters can
be obtained, where s < M.
55
Types of Sampling 56
■ Stratified sampling:
56
Sampling: With or without Replacement
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
57
Sampling: Cluster or Stratified Sampling
58
Reading Assignment
Read example 3.3
59
Chapter 3: Data Preprocessing
Decimal Scaling
Exercise 5 – Min-Max (Sol)
(30-20)/(35-20) = 10/15 = 0.67
((20-20)/(25-20))*(1-0) + 0 = 0
Decimal Scaling
(25-20)/(35-20) = 5/15 = 0.33
Exercise 5 – Z-Score
v
v' j
10
Exercise 5 – Decimal Scaling
30/100 = 0.3
v
v' j
10
Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
70
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., X2) analysis (unsupervised, bottom-up
merge)
71
Simple Discretization: Binning
74
Exercise 6 W = (B –A)/N
Equi-width Equi-width Equi-depth
Draw the bins for Age using (2 bins) (3 bins) (3 bins)
Age Bin Age Bin Age Bin
• Equi-width (2 bins) [A=17, B=29, N=2]
• W= 23 23 23
• Bins: 21 21 21
20 20 20
• Equi-width (3 bins) 22 22 22
• W= 19 19 19
• Bins: 18 18 18
17 17 17
• Equi-depth (3 bins) 29 29 29
• Bins: 18 18 18
18 18 18
23 23 23
20 20 20
75
17+6=23 23+6=29
Exercise 6 W = (B –A)/N
Equi-width Equi-width Equi-depth
Draw the bins for Age using (2 bins) (3 bins) (3 bins)
• Equi-width (2 bins) [A=17, B=29, N=2] Ag Bin Ag Bin Age Bin
e e
• W = (29-17)/2 = 6
23 17-23 23 >21 to 25 23 22-29
• Bins: 17 to 23, >23 to 29 21 17-23 21 17 to 21 21 19-21
• Equi-width (3 bins) 20 17-23 20 17 to 21 20 19-21
• W = (29-17)/3 = 4 22 17-23 22 >21 to 25 22 22-29
• Bins: 17 to 213), >21 to 25 , >25 to 29 19 17-23 19 17 to 21 19 19-21
(1) 18 17-23 18 17 to 21 18 17-18
• Equi-depth (3 bins) 17 17-23 17 17 to 21 17 17-18
29 >23 to 29 29 >25 to 29 29 22-29
• Sort the numbers:
17, 18, 18, 19, 19, 20, 20, 21, 22, 23, 23, 29 18 17-23 18 17 to 21 18 17-18
• Divide into three groups 18 17-23 18 17 to 21 18 17-18
17, 18, 18, 18 | 19, 20, 20, 21 | 22, 23, 23, 29 23 17-23 23 >21 to 25 23 22-29
• Bins: 17-18, 19-21, 22-29 20 17-23 20 17 to 21 20 19-21
76
Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
77
Concept Hierarchy Generation
for Nominal Data
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit
data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
78
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
The attribute with the most distinct values is placed at
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
81
Reading Assignment
Read examples 3.4 to 3.8
82
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse
environments. Comm. of ACM, 42:73-78, 1999
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John
Wiley, 2003
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk.
Mining Database Structure; Or, How to Build a Data Quality Browser.
SIGMOD’02
H. V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of
the Technical Committee on Data Engineering, 20(4), Dec. 1997
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE
Bulletin of the Technical Committee on Data Engineering. Vol.23, No.4
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data
Cleaning and Transformation, VLDB’2001
T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality
research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995
83