3datapreprocessing ppt3

October 15,
2014
1
Data Mining:
Concepts and
Techniques
October 15,
2014
2
Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy

generation
Summary
October 15,
2014
3
Why Data Preprocessing?
Data in the real orld is dirty
incomplete: lac!ing attri"ute #alues$ lac!ing

certain attri"utes of interest$ or containing only
aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes

or names
%o quality data$ no quality mining results&
'uality decisions must "e "ased on quality data
Data arehouse needs consistent integration of

quality data
October 15,
2014
4
Multi(Dimensional Measure of Data
'uality
) ell(accepted multidimensional #ie:
)ccuracy
Completeness
Consistency
Timeliness
*elie#a"ility
+alue added
,nterpreta"ility
)ccessi"ility
*road categories:
intrinsic$ conte-tual$ representational$ and

accessi"ility.
October 15,
2014
5
Ma/or Tas!s in Data
Preprocessing
Data cleaning
0ill in missing #alues$ smooth noisy data$ identify or

remo#e outliers$ and resol#e inconsistencies
Data integration
,ntegration of multiple data"ases$ data cu"es$ or 1les
Data transformation
%ormalization and aggregation
Data reduction
2"tains reduced representation in #olume "ut produces the

same or similar analytical results
Data discretization
Part of data reduction "ut ith particular importance$

especially for numerical data
October 15,
2014
6
0orms of data
preprocessing
October 15,
2014
7
Data Cleaning
Data cleaning tas!s
0ill in missing #alues
,dentify outliers and smooth out noisy

data
Correct inconsistent data

October 15,
2014
8
Missing Data
Data is not alays a#aila"le
3.g.$ many tuples ha#e no recorded #alue for se#eral

attri"utes$ such as customer income in sales data
Missing data may "e due to
equipment malfunction
inconsistent ith other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not "e considered important at the

time of entry
not register history or changes of the data
Missing data may need to "e inferred.

October 15,
2014
9
4o to 4andle Missing
Data?
,gnore the tuple: usually done hen class la"el is missing

5assuming the tas!s in classi1cation6not e7ecti#e hen the
percentage of missing #alues per attri"ute #aries considera"ly.
0ill in the missing #alue manually: tedious 8 infeasi"le?
9se a glo"al constant to 1ll in the missing #alue: e.g.$

:un!non;$ a ne class?&
9se the attri"ute mean to 1ll in the missing #alue
9se the attri"ute mean for all samples "elonging to the same
class to 1ll in the missing #alue: smarter
9se the most pro"a"le #alue to 1ll in the missing #alue:

inference("ased such as *ayesian formula or decision tree
October 15,
2014
10
%oisy Data
%oise: random error or #ariance in a measured #aria"le
,ncorrect attri"ute #alues may due to
faulty data collection instruments
data entry pro"lems
data transmission pro"lems
technology limitation
inconsistency in naming con#ention
2ther data pro"lems hich requires data cleaning
duplicate records
incomplete data
inconsistent data
October 15,
2014
11
4o to 4andle %oisy Data?
*inning method:
1rst sort data and partition into 5equi(depth< "ins
then one can smooth "y "in means$ smooth "y

"in median$ smooth "y "in "oundaries$ etc.
Clustering
detect and remo#e outliers
Com"ined computer and human inspection
detect suspicious #alues and chec! "y human
=egression
smooth "y 1tting the data into regression

functions
October 15,
2014
12
Simple Discretization Methods:
*inning
3qual(idth 5distance< partitioning:
,t di#ides the range into N inter#als of equal size:

uniform grid
if A and B are the loest and highest #alues of the

attri"ute$ the idth of inter#als ill "e: W > 5B(A<?N.
The most straightforard
*ut outliers may dominate presentation
S!eed data is not handled ell.
3qual(depth 5frequency< partitioning:
,t di#ides the range into N inter#als$ each containing

appro-imately same num"er of samples
@ood data scaling
Managing categorical attri"utes can "e tric!y.

October 15,
2014
13
*inning Methods for Data
Smoothing
A Sorted data for price 5in dollars<: B$ C$ D$ EF$ GE$ GE$ GB$ GF$ GH$
GC$ GD$ 3B
A Partition into 5equi(depth< "ins:
( *in E: B$ C$ D$ EF
( *in G: GE$ GE$ GB$ GF
( *in 3: GH$ GC$ GD$ 3B
A Smoothing "y "in means:
( *in E: D$ D$ D$ D
( *in G: G3$ G3$ G3$ G3
( *in 3: GD$ GD$ GD$ GD
A Smoothing "y "in "oundaries:
( *in E: B$ B$ B$ EF
( *in G: GE$ GE$ GF$ GF
( *in 3: GH$ GH$ GH$ 3B
October 15,
2014
14
Cluster )nalysis
October 15,
2014
15
=egression
x
y
y = x + 1
X1
Y1
Y1
October 15,
2014
16
Data ,ntegration
Data integration:
com"ines data from multiple sources into a

coherent store
Schema integration
integrate metadata from di7erent sources
3ntity identi1cation pro"lem: identify real orld

entities from multiple data sources$ e.g.$ ).cust(id
*.cust(I
Detecting and resol#ing data #alue conJicts
for the same real orld entity$ attri"ute #alues

from di7erent sources are di7erent
possi"le reasons: di7erent representations$

di7erent scales$ e.g.$ metric #s. *ritish units
October 15,
2014
17
4andling =edundant
Data in Data
,ntegration
=edundant data occur often hen integration of

multiple data"ases
The same attri"ute may ha#e di7erent names in

di7erent data"ases
2ne attri"ute may "e a :deri#ed; attri"ute in

another ta"le$ e.g.$ annual re#enue
=edundant data may "e a"le to "e detected "y

correlational analysis
Careful integration of the data from multiple sources

may help reduce?a#oid redundancies and
inconsistencies and impro#e mining speed and quality
October 15,
2014
18
Data Transformation
Smoothing: remo#e noise from data
)ggregation: summarization$ data cu"e construction
@eneralization: concept hierarchy clim"ing
%ormalization: scaled to fall ithin a small$ speci1ed

range
min(ma- normalization
z(score normalization
normalization "y decimal scaling
)ttri"ute?feature construction
%e attri"utes constructed from the gi#en ones

October 15,
2014
19
Data Transformation:
%ormalization
min(ma- normalization
z(score normalization
normalization "y decimal scaling

A A A
A A
A
min new min new max new
min max
min v
v _ ) _ _ ( ' +
=
A
A
dev stand
mean v
v
_
'

=
j
v
v
10
' = Were j !" te "#$%%e"t !&te'er "(c t$t )$x(* *)+1
' v
October 15,
2014
20
Data =eduction
Strategies
Warehouse may store tera"ytes of data: Comple-

data analysis?mining may ta!e a #ery long time to
run on the complete data set
Data reduction
2"tains a reduced representation of the data set

that is much smaller in #olume "ut yet produces
the same 5or almost the same< analytical results
Data reduction strategies
Data cu"e aggregation
Dimensionality reduction
%umerosity reduction
Discretization and concept hierarchy generation

October 15,
2014
21
Data Cu"e )ggregation
The loest le#el of a data cu"e
the aggregated data for an indi#idual entity of

interest
e.g.$ a customer in a phone calling data arehouse.
Multiple le#els of aggregation in data cu"es
0urther reduce the size of data to deal ith
=eference appropriate le#els
9se the smallest representation hich is enough to

sol#e the tas!
'ueries regarding aggregated information should "e

ansered using data cu"e$ hen possi"le
October 15,
2014
22
Dimensionality =eduction
0eature selection 5i.e.$ attri"ute su"set selection<:
Select a minimum set of features such that the

pro"a"ility distri"ution of di7erent classes gi#en the
#alues for those features is as close as possi"le to the
original distri"ution gi#en the #alues of all features
reduce I of patterns in the patterns$ easier to

understand
4euristic methods 5due to e-ponential I of choices<:
step(ise forard selection
step(ise "ac!ard elimination
com"ining forard selection and "ac!ard elimination
decision(tree induction
October 15,
2014
23
Example of Decision Tree Induction
,&!t!$% $ttr!b(te "et-
./1, /2, /3, /4, /5, /60
/4 1
/11
/61
2%$"" 1
2%$"" 2
2%$"" 1
2%$"" 2
3 4e5(ce5 $ttr!b(te "et- ./1, /4, /60
October 15,
2014
25
Data Compression
String compression
There are e-tensi#e theories and ell(tuned

algorithms
Typically lossless
*ut only limited manipulation is possi"le ithout

e-pansion
)udio?#ideo compression
Typically lossy compression$ ith progressi#e

re1nement
Sometimes small fragments of signal can "e

reconstructed ithout reconstructing the hole
Time sequence is not audio
Typically short and #ary sloly ith time

October 15,
2014
26
Data Compression
Or!'!&$% 6$t$ 2o#7re""e5
6$t$
%o""%e""
Or!'!&$% 6$t$
/77rox!#$te5
%
o
"
"
y
October 15,
2014
27
Wa#elet Transforms
Discrete a#elet transform 5DWT<: linear signal

processing
Compressed appro-imation: store only a small fraction

of the strongest of the a#elet coeKcients
Similar to discrete 0ourier transform 5D0T<$ "ut "etter

lossy compression$ localized in space
Method:
Length$ L$ must "e an integer poer of G 5padding ith Ms$ hen

necessary<
3ach transform has G functions: smoothing$ di7erence
)pplies to pairs of data$ resulting in to set of data of length L?G
)pplies to functions recursi#ely$ until reaches the desired

length

4aarG Dau"echieB
October 15,
2014
28
@i#en N data #ectors from k(dimensions$ 1nd c

<= k orthogonal #ectors that can "e "est used
to represent data
The original data set is reduced to one

consisting of % data #ectors on c principal
components 5reduced dimensions<
3ach data #ector is a linear com"ination of the c

principal component #ectors
Wor!s for numeric data only
9sed hen the num"er of dimensions is large

Principal Component )nalysis
October 15,
2014
29
X1
X2
Y1
Y2
Principal Component Analysis
October 15,
2014
30
%umerosity =eduction
Parametric methods
)ssume the data 1ts some model$ estimate

model parameters$ store only the parameters$
and discard the data 5e-cept possi"le outliers<
Log(linear models: o"tain #alue at a point in m(

D space as the product on appropriate marginal
su"spaces
%on(parametric methods
Do not assume models
Ma/or families: histograms$ clustering$ sampling

October 15,
2014
31
=egression and Log(Linear
Models
Linear regression: Data are modeled to 1t a straight

line
2ften uses the least(square method to 1t the line
Multiple regression: allos a response #aria"le N to

"e modeled as a linear function of multidimensional
feature #ector
Log(linear model: appro-imates discrete

multidimensional pro"a"ility distri"utions
October 15,
2014
Linear regression: Y = + X
To parameters $ and specify the line and

are to "e estimated "y using the data at hand.
using the least squares criterion to the !non

#alues of Y1, Y2, , X1, X2, .
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can "e transformed

into the a"o#e.
Log(linear models:
The multi(ay ta"le of /oint pro"a"ilities is

appro-imated "y a product of loer(order
ta"les.
Pro"a"ility: p(a, b, c, d) = ab acad bcd

=egress )nalysis and Log(
Linear Models
October 15,
2014
33
4istograms
) popular data
reduction technique
Di#ide data into "uc!ets

and store a#erage
5sum< for each "uc!et
Can "e constructed

optimally in one
dimension using
dynamic programming
=elated to quantization
pro"lems.
0
5
10
15
20
25
30
35
40
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
5
0
0
0
0
6
0
0
0
0
7
0
0
0
0
8
0
0
0
0
9
0
0
0
0
1
0
0
0
0
0
October 15,
2014
34
Clustering
Partition data set into clusters$ and one can store

cluster representation only
Can "e #ery e7ecti#e if data is clustered "ut not if

data is :smeared;
Can ha#e hierarchical clustering and "e stored in

multi(dimensional inde- tree structures
There are many choices of clustering de1nitions

and clustering algorithms$ further detailed in
Chapter C
October 15,
2014
35
Sampling
)llo a mining algorithm to run in comple-ity that is

potentially su"(linear to the size of the data
Choose a representati#e su"set of the data
Simple random sampling may ha#e #ery poor

performance in the presence of s!e
De#elop adapti#e sampling methods
Strati1ed sampling:
)ppro-imate the percentage of each class 5or

su"population of interest< in the o#erall data"ase
9sed in con/unction ith s!eed data
Sampling may not reduce data"ase ,?2s 5page at a

time<.
October 15,
2014
36
Sampling
8
4
8
W
O
4
(
"
!
#
7
%
e

r
$
&
5
o
#

"
$
#
7
%
e

9
!
t
o
(
t

r
e
7
%
$
c
e
#
e
&
t
)
8
4
8
W
4
4$9 6$t$
October 15,
2014
37
Sampling
4$9 6$t$
2%("ter:8tr$t!;!e5 8$#7%e
October 15,
2014
38
4ierarchical =eduction
9se multi(resolution structure ith di7erent degrees

of reduction
4ierarchical clustering is often performed "ut tends

to de1ne partitions of data sets rather than :clusters;
Parametric methods are usually not amena"le to

hierarchical representation
4ierarchical aggregation
)n inde- tree hierarchically di#ides a data set into

partitions "y #alue range of some attri"utes
3ach partition can "e considered as a "uc!et
Thus an inde- tree ith aggregates stored at each

node is a hierarchical histogram
October 15,
2014
39
Discretization
Three types of attri"utes:
%ominal 6 #alues from an unordered set
2rdinal 6 #alues from an ordered set
Continuous 6 real num"ers
Discretization:
di#ide the range of a continuous attri"ute into

inter#als
Some classi1cation algorithms only accept

categorical attri"utes.
=educe data size "y discretization
Prepare for further analysis

October 15,
2014
40
Discretization and Concept
hierachy
Discretization
reduce the num"er of #alues for a gi#en

continuous attri"ute "y di#iding the range of
the attri"ute into inter#als. ,nter#al la"els can
then "e used to replace actual data #alues.
Concept hierarchies
reduce the data "y collecting and replacing

lo le#el concepts 5such as numeric #alues for
the attri"ute age< "y higher le#el concepts
5such as young$ middle(aged$ or senior<.
October 15,
2014
41
Discretization and concept
hierarchy generation for numeric
data
*inning 5see sections "efore<
4istogram analysis 5see sections "efore<
Clustering analysis 5see sections "efore<
3ntropy("ased discretization
Segmentation "y natural partitioning

October 15,
2014
42
3ntropy(*ased Discretization
@i#en a set of samples S$ if S is partitioned into to

inter#als SE and SG using "oundary T$ the entropy
after partitioning is
The "oundary that minimizes the entropy function

o#er all possi"le "oundaries is selected as a "inary
discretization.
The process is recursi#ely applied to partitions

o"tained until some stopping criterion is met$ e.g.$
3-periments sho that it may reduce data size and

impro#e classi1cation accuracy
E S T
S
Ent
S
Ent
S
S
S
S
( , )
* *
* *
( )
* *
* *
( ) = +
1
1
2
2
Ent S E T S ( ) ( , ) >
October 15,
2014
43
Segmentation "y natural
partitioning
3(B(F rule can "e used to segment numeric data into
relati#ely uniform$ :natural; inter#als.
A ,f an inter#al co#ers 3$ H$ O or D distinct #alues at
the most signi1cant digit$ partition the range into
3 equi(idth inter#als
A ,f it co#ers G$ B$ or C distinct #alues at the most
signi1cant digit$ partition the range into B inter#als
A ,f it co#ers E$ F$ or EM distinct #alues at the most
signi1cant digit$ partition the range into F inter#als
October 15,
2014
44
3-ample of 3(B(F rule
(<=4000 <=5,000)
(<=400 < 0)
(<=400 <
<=300)
(<=300 <
<=200)
(<=200 <
<=100)
(<=100 <
0)
(0 < =1,000)
(0 <
=200)
(=200 <
=400)
(=400 <
=600)
(=600 <
=800) (=800 <
=1,000)
(=2,000 < =5, 000)
(=2,000 <
=3,000)
(=3,000 <
=4,000)
(=4,000 <
=5,000)
(=1,000 < =2, 000)
(=1,000 <
=1,200)
(=1,200 <
=1,400)
(=1,400 <
=1,600)
(=1,600 <
=1,800)
(=1,800 <
=2,000)
#"5=1,000 >o9=<=1,000 ?!'==2,000 8te7 2-
8te7 4-
8te7 1- <=351 <=159 7ro;!t =1,838 =4,700
)!& >o9 (!@e, 5A<t!%e) ?!'(!@e, 95A<0 t!%e) )$x
co(&t
(<=1,000 < =2,000)
(<=1,000 < 0)
(0 <= 1,000)
8te7 3-
(=1,000 < =2,000)
October 15,
2014
45
Concept hierarchy generation for
categorical data
Speci1cation of a partial ordering of attri"utes

e-plicitly at the schema le#el "y users or e-perts
Speci1cation of a portion of a hierarchy "y

e-plicit data grouping
Speci1cation of a set of attri"utes$ "ut not of

their partial ordering
Speci1cation of only a partial set of attri"utes

October 15,
2014
46
Speci1cation of a set of
attri"utes
Concept hierarchy can "e automatically
generated "ased on the num"er of distinct
#alues per attri"ute in the gi#en attri"ute set.
The attri"ute ith the most distinct #alues is
placed at the loest le#el of the hierarchy.
co(&try
7roB!&ce_or_ "t$te
c!ty
"treet
15 5!"t!&ct B$%(e"
65 5!"t!&ct
B$%(e"
3567 5!"t!&ct B$%(e"
674,339 5!"t!&ct B$%(e"
October 15,
2014
47
Summary
Data preparation is a "ig issue for "oth

arehousing and mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
) lot a methods ha#e "een de#eloped "ut still an

acti#e area of research

3datapreprocessing ppt3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3datapreprocessing ppt3

Uploaded by

Copyright:

Available Formats

October 15,

Why preprocess the data?

Data integration and transformation

Discretization and concept hierarchy

Data in the real orld is dirty

incomplete: lac!ing attri"ute #alues$ lac!ing

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes

%o quality data$ no quality mining results&

'uality decisions must "e "ased on quality data

Data arehouse needs consistent integration of

) ell(accepted multidimensional #ie:

intrinsic$ conte-tual$ representational$ and

0ill in missing #alues$ smooth noisy data$ identify or

,ntegration of multiple data"ases$ data cu"es$ or 1les

%ormalization and aggregation

2"tains reduced representation in #olume "ut produces the

Part of data reduction "ut ith particular importance$

Data cleaning tas!s

0ill in missing #alues

,dentify outliers and smooth out noisy

Correct inconsistent data

Data is not alays a#aila"le

3.g.$ many tuples ha#e no recorded #alue for se#eral

Missing data may "e due to

inconsistent ith other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not "e considered important at the

not register history or changes of the data

Missing data may need to "e inferred.

,gnore the tuple: usually done hen class la"el is missing

0ill in the missing #alue manually: tedious 8 infeasi"le?

9se a glo"al constant to 1ll in the missing #alue: e.g.$

9se the attri"ute mean to 1ll in the missing #alue

9se the most pro"a"le #alue to 1ll in the missing #alue:

%oise: random error or #ariance in a measured #aria"le

,ncorrect attri"ute #alues may due to

faulty data collection instruments

data entry pro"lems

data transmission pro"lems

inconsistency in naming con#ention

2ther data pro"lems hich requires data cleaning

1rst sort data and partition into 5equi(depth< "ins

then one can smooth "y "in means$ smooth "y

detect and remo#e outliers

Com"ined computer and human inspection

detect suspicious #alues and chec! "y human

smooth "y 1tting the data into regression

3qual(idth 5distance< partitioning:

,t di#ides the range into N inter#als of equal size:

if A and B are the loest and highest #alues of the

The most straightforard

*ut outliers may dominate presentation

S!eed data is not handled ell.

3qual(depth 5frequency< partitioning:

,t di#ides the range into N inter#als$ each containing

@ood data scaling

Managing categorical attri"utes can "e tric!y.

com"ines data from multiple sources into a

integrate metadata from di7erent sources

3ntity identi1cation pro"lem: identify real orld

Detecting and resol#ing data #alue conJicts

for the same real orld entity$ attri"ute #alues