Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 46

October 15,

2014
1
Data Mining:
Concepts and
Techniques
October 15,
2014
2
Chapter 3: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy


generation

Summary
October 15,
2014
3
Why Data Preprocessing?

Data in the real orld is dirty

incomplete: lac!ing attri"ute #alues$ lac!ing


certain attri"utes of interest$ or containing only
aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes


or names

%o quality data$ no quality mining results&

'uality decisions must "e "ased on quality data

Data arehouse needs consistent integration of


quality data
October 15,
2014
4
Multi(Dimensional Measure of Data
'uality

) ell(accepted multidimensional #ie:

)ccuracy

Completeness

Consistency

Timeliness

*elie#a"ility

+alue added

,nterpreta"ility

)ccessi"ility

*road categories:

intrinsic$ conte-tual$ representational$ and


accessi"ility.
October 15,
2014
5
Ma/or Tas!s in Data
Preprocessing

Data cleaning

0ill in missing #alues$ smooth noisy data$ identify or


remo#e outliers$ and resol#e inconsistencies

Data integration

,ntegration of multiple data"ases$ data cu"es$ or 1les

Data transformation

%ormalization and aggregation

Data reduction

2"tains reduced representation in #olume "ut produces the


same or similar analytical results

Data discretization

Part of data reduction "ut ith particular importance$


especially for numerical data
October 15,
2014
6
0orms of data
preprocessing
October 15,
2014
7
Data Cleaning

Data cleaning tas!s

0ill in missing #alues

,dentify outliers and smooth out noisy


data

Correct inconsistent data


October 15,
2014
8
Missing Data

Data is not alays a#aila"le

3.g.$ many tuples ha#e no recorded #alue for se#eral


attri"utes$ such as customer income in sales data

Missing data may "e due to

equipment malfunction

inconsistent ith other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not "e considered important at the


time of entry

not register history or changes of the data

Missing data may need to "e inferred.


October 15,
2014
9
4o to 4andle Missing
Data?

,gnore the tuple: usually done hen class la"el is missing


5assuming the tas!s in classi1cation6not e7ecti#e hen the
percentage of missing #alues per attri"ute #aries considera"ly.

0ill in the missing #alue manually: tedious 8 infeasi"le?

9se a glo"al constant to 1ll in the missing #alue: e.g.$


:un!non;$ a ne class?&

9se the attri"ute mean to 1ll in the missing #alue

9se the attri"ute mean for all samples "elonging to the same
class to 1ll in the missing #alue: smarter

9se the most pro"a"le #alue to 1ll in the missing #alue:


inference("ased such as *ayesian formula or decision tree
October 15,
2014
10
%oisy Data

%oise: random error or #ariance in a measured #aria"le

,ncorrect attri"ute #alues may due to

faulty data collection instruments

data entry pro"lems

data transmission pro"lems

technology limitation

inconsistency in naming con#ention

2ther data pro"lems hich requires data cleaning

duplicate records

incomplete data

inconsistent data
October 15,
2014
11
4o to 4andle %oisy Data?

*inning method:

1rst sort data and partition into 5equi(depth< "ins

then one can smooth "y "in means$ smooth "y


"in median$ smooth "y "in "oundaries$ etc.

Clustering

detect and remo#e outliers

Com"ined computer and human inspection

detect suspicious #alues and chec! "y human

=egression

smooth "y 1tting the data into regression


functions
October 15,
2014
12
Simple Discretization Methods:
*inning

3qual(idth 5distance< partitioning:

,t di#ides the range into N inter#als of equal size:


uniform grid

if A and B are the loest and highest #alues of the


attri"ute$ the idth of inter#als ill "e: W > 5B(A<?N.

The most straightforard

*ut outliers may dominate presentation

S!eed data is not handled ell.

3qual(depth 5frequency< partitioning:

,t di#ides the range into N inter#als$ each containing


appro-imately same num"er of samples

@ood data scaling

Managing categorical attri"utes can "e tric!y.


October 15,
2014
13
*inning Methods for Data
Smoothing
A Sorted data for price 5in dollars<: B$ C$ D$ EF$ GE$ GE$ GB$ GF$ GH$
GC$ GD$ 3B
A Partition into 5equi(depth< "ins:
( *in E: B$ C$ D$ EF
( *in G: GE$ GE$ GB$ GF
( *in 3: GH$ GC$ GD$ 3B
A Smoothing "y "in means:
( *in E: D$ D$ D$ D
( *in G: G3$ G3$ G3$ G3
( *in 3: GD$ GD$ GD$ GD
A Smoothing "y "in "oundaries:
( *in E: B$ B$ B$ EF
( *in G: GE$ GE$ GF$ GF
( *in 3: GH$ GH$ GH$ 3B
October 15,
2014
14
Cluster )nalysis
October 15,
2014
15
=egression
x
y
y = x + 1
X1
Y1
Y1
October 15,
2014
16
Data ,ntegration

Data integration:

com"ines data from multiple sources into a


coherent store

Schema integration

integrate metadata from di7erent sources

3ntity identi1cation pro"lem: identify real orld


entities from multiple data sources$ e.g.$ ).cust(id
*.cust(I

Detecting and resol#ing data #alue conJicts

for the same real orld entity$ attri"ute #alues


from di7erent sources are di7erent

possi"le reasons: di7erent representations$


di7erent scales$ e.g.$ metric #s. *ritish units
October 15,
2014
17
4andling =edundant
Data in Data
,ntegration

=edundant data occur often hen integration of


multiple data"ases

The same attri"ute may ha#e di7erent names in


di7erent data"ases

2ne attri"ute may "e a :deri#ed; attri"ute in


another ta"le$ e.g.$ annual re#enue

=edundant data may "e a"le to "e detected "y


correlational analysis

Careful integration of the data from multiple sources


may help reduce?a#oid redundancies and
inconsistencies and impro#e mining speed and quality
October 15,
2014
18
Data Transformation

Smoothing: remo#e noise from data

)ggregation: summarization$ data cu"e construction

@eneralization: concept hierarchy clim"ing

%ormalization: scaled to fall ithin a small$ speci1ed


range

min(ma- normalization

z(score normalization

normalization "y decimal scaling

)ttri"ute?feature construction

%e attri"utes constructed from the gi#en ones


October 15,
2014
19
Data Transformation:
%ormalization

min(ma- normalization

z(score normalization

normalization "y decimal scaling


A A A
A A
A
min new min new max new
min max
min v
v _ ) _ _ ( ' +

=
A
A
dev stand
mean v
v
_
'

=
j
v
v
10
' = Were j !" te "#$%%e"t !&te'er "(c t$t )$x(* *)+1
' v
October 15,
2014
20
Data =eduction
Strategies

Warehouse may store tera"ytes of data: Comple-


data analysis?mining may ta!e a #ery long time to
run on the complete data set

Data reduction

2"tains a reduced representation of the data set


that is much smaller in #olume "ut yet produces
the same 5or almost the same< analytical results

Data reduction strategies

Data cu"e aggregation

Dimensionality reduction

%umerosity reduction

Discretization and concept hierarchy generation


October 15,
2014
21
Data Cu"e )ggregation

The loest le#el of a data cu"e

the aggregated data for an indi#idual entity of


interest

e.g.$ a customer in a phone calling data arehouse.

Multiple le#els of aggregation in data cu"es

0urther reduce the size of data to deal ith

=eference appropriate le#els

9se the smallest representation hich is enough to


sol#e the tas!

'ueries regarding aggregated information should "e


ansered using data cu"e$ hen possi"le
October 15,
2014
22
Dimensionality =eduction

0eature selection 5i.e.$ attri"ute su"set selection<:

Select a minimum set of features such that the


pro"a"ility distri"ution of di7erent classes gi#en the
#alues for those features is as close as possi"le to the
original distri"ution gi#en the #alues of all features

reduce I of patterns in the patterns$ easier to


understand

4euristic methods 5due to e-ponential I of choices<:

step(ise forard selection

step(ise "ac!ard elimination

com"ining forard selection and "ac!ard elimination

decision(tree induction
October 15,
2014
23
Example of Decision Tree Induction
,&!t!$% $ttr!b(te "et-
./1, /2, /3, /4, /5, /60
/4 1
/11
/61
2%$"" 1
2%$"" 2
2%$"" 1
2%$"" 2
3 4e5(ce5 $ttr!b(te "et- ./1, /4, /60
October 15,
2014
25
Data Compression

String compression

There are e-tensi#e theories and ell(tuned


algorithms

Typically lossless

*ut only limited manipulation is possi"le ithout


e-pansion

)udio?#ideo compression

Typically lossy compression$ ith progressi#e


re1nement

Sometimes small fragments of signal can "e


reconstructed ithout reconstructing the hole

Time sequence is not audio

Typically short and #ary sloly ith time


October 15,
2014
26
Data Compression
Or!'!&$% 6$t$ 2o#7re""e5
6$t$
%o""%e""
Or!'!&$% 6$t$
/77rox!#$te5
%
o
"
"
y
October 15,
2014
27
Wa#elet Transforms

Discrete a#elet transform 5DWT<: linear signal


processing

Compressed appro-imation: store only a small fraction


of the strongest of the a#elet coeKcients

Similar to discrete 0ourier transform 5D0T<$ "ut "etter


lossy compression$ localized in space

Method:

Length$ L$ must "e an integer poer of G 5padding ith Ms$ hen


necessary<

3ach transform has G functions: smoothing$ di7erence

)pplies to pairs of data$ resulting in to set of data of length L?G

)pplies to functions recursi#ely$ until reaches the desired


length

4aarG Dau"echieB
October 15,
2014
28

@i#en N data #ectors from k(dimensions$ 1nd c


<= k orthogonal #ectors that can "e "est used
to represent data

The original data set is reduced to one


consisting of % data #ectors on c principal
components 5reduced dimensions<

3ach data #ector is a linear com"ination of the c


principal component #ectors

Wor!s for numeric data only

9sed hen the num"er of dimensions is large


Principal Component )nalysis
October 15,
2014
29
X1
X2
Y1
Y2
Principal Component Analysis
October 15,
2014
30
%umerosity =eduction

Parametric methods

)ssume the data 1ts some model$ estimate


model parameters$ store only the parameters$
and discard the data 5e-cept possi"le outliers<

Log(linear models: o"tain #alue at a point in m(


D space as the product on appropriate marginal
su"spaces

%on(parametric methods

Do not assume models

Ma/or families: histograms$ clustering$ sampling


October 15,
2014
31
=egression and Log(Linear
Models

Linear regression: Data are modeled to 1t a straight


line

2ften uses the least(square method to 1t the line

Multiple regression: allos a response #aria"le N to


"e modeled as a linear function of multidimensional
feature #ector

Log(linear model: appro-imates discrete


multidimensional pro"a"ility distri"utions
October 15,
2014

Linear regression: Y = + X

To parameters $ and specify the line and


are to "e estimated "y using the data at hand.

using the least squares criterion to the !non


#alues of Y1, Y2, , X1, X2, .

Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can "e transformed


into the a"o#e.

Log(linear models:

The multi(ay ta"le of /oint pro"a"ilities is


appro-imated "y a product of loer(order
ta"les.

Pro"a"ility: p(a, b, c, d) = ab acad bcd


=egress )nalysis and Log(
Linear Models
October 15,
2014
33
4istograms

) popular data
reduction technique

Di#ide data into "uc!ets


and store a#erage
5sum< for each "uc!et

Can "e constructed


optimally in one
dimension using
dynamic programming

=elated to quantization
pro"lems.
0
5
10
15
20
25
30
35
40
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
5
0
0
0
0
6
0
0
0
0
7
0
0
0
0
8
0
0
0
0
9
0
0
0
0
1
0
0
0
0
0
October 15,
2014
34
Clustering

Partition data set into clusters$ and one can store


cluster representation only

Can "e #ery e7ecti#e if data is clustered "ut not if


data is :smeared;

Can ha#e hierarchical clustering and "e stored in


multi(dimensional inde- tree structures

There are many choices of clustering de1nitions


and clustering algorithms$ further detailed in
Chapter C
October 15,
2014
35
Sampling

)llo a mining algorithm to run in comple-ity that is


potentially su"(linear to the size of the data

Choose a representati#e su"set of the data

Simple random sampling may ha#e #ery poor


performance in the presence of s!e

De#elop adapti#e sampling methods

Strati1ed sampling:

)ppro-imate the percentage of each class 5or


su"population of interest< in the o#erall data"ase

9sed in con/unction ith s!eed data

Sampling may not reduce data"ase ,?2s 5page at a


time<.
October 15,
2014
36
Sampling
8
4
8
W
O
4
(
"
!
#
7
%
e

r
$
&
5
o
#

"
$
#
7
%
e

9
!
t

o
(
t

r
e
7
%
$
c
e
#
e
&
t
)
8
4
8
W
4
4$9 6$t$
October 15,
2014
37
Sampling
4$9 6$t$
2%("ter:8tr$t!;!e5 8$#7%e
October 15,
2014
38
4ierarchical =eduction

9se multi(resolution structure ith di7erent degrees


of reduction

4ierarchical clustering is often performed "ut tends


to de1ne partitions of data sets rather than :clusters;

Parametric methods are usually not amena"le to


hierarchical representation

4ierarchical aggregation

)n inde- tree hierarchically di#ides a data set into


partitions "y #alue range of some attri"utes

3ach partition can "e considered as a "uc!et

Thus an inde- tree ith aggregates stored at each


node is a hierarchical histogram
October 15,
2014
39
Discretization

Three types of attri"utes:

%ominal 6 #alues from an unordered set

2rdinal 6 #alues from an ordered set

Continuous 6 real num"ers

Discretization:

di#ide the range of a continuous attri"ute into


inter#als

Some classi1cation algorithms only accept


categorical attri"utes.

=educe data size "y discretization

Prepare for further analysis


October 15,
2014
40
Discretization and Concept
hierachy

Discretization

reduce the num"er of #alues for a gi#en


continuous attri"ute "y di#iding the range of
the attri"ute into inter#als. ,nter#al la"els can
then "e used to replace actual data #alues.

Concept hierarchies

reduce the data "y collecting and replacing


lo le#el concepts 5such as numeric #alues for
the attri"ute age< "y higher le#el concepts
5such as young$ middle(aged$ or senior<.
October 15,
2014
41
Discretization and concept
hierarchy generation for numeric
data

*inning 5see sections "efore<

4istogram analysis 5see sections "efore<

Clustering analysis 5see sections "efore<

3ntropy("ased discretization

Segmentation "y natural partitioning


October 15,
2014
42
3ntropy(*ased Discretization

@i#en a set of samples S$ if S is partitioned into to


inter#als SE and SG using "oundary T$ the entropy
after partitioning is

The "oundary that minimizes the entropy function


o#er all possi"le "oundaries is selected as a "inary
discretization.

The process is recursi#ely applied to partitions


o"tained until some stopping criterion is met$ e.g.$

3-periments sho that it may reduce data size and


impro#e classi1cation accuracy
E S T
S
Ent
S
Ent
S
S
S
S
( , )
* *
* *
( )
* *
* *
( ) = +
1
1
2
2
Ent S E T S ( ) ( , ) >
October 15,
2014
43
Segmentation "y natural
partitioning
3(B(F rule can "e used to segment numeric data into
relati#ely uniform$ :natural; inter#als.
A ,f an inter#al co#ers 3$ H$ O or D distinct #alues at
the most signi1cant digit$ partition the range into
3 equi(idth inter#als
A ,f it co#ers G$ B$ or C distinct #alues at the most
signi1cant digit$ partition the range into B inter#als
A ,f it co#ers E$ F$ or EM distinct #alues at the most
signi1cant digit$ partition the range into F inter#als
October 15,
2014
44
3-ample of 3(B(F rule
(<=4000 <=5,000)
(<=400 < 0)
(<=400 <
<=300)
(<=300 <
<=200)
(<=200 <
<=100)
(<=100 <
0)
(0 < =1,000)
(0 <
=200)
(=200 <
=400)
(=400 <
=600)
(=600 <
=800) (=800 <
=1,000)
(=2,000 < =5, 000)
(=2,000 <
=3,000)
(=3,000 <
=4,000)
(=4,000 <
=5,000)
(=1,000 < =2, 000)
(=1,000 <
=1,200)
(=1,200 <
=1,400)
(=1,400 <
=1,600)
(=1,600 <
=1,800)
(=1,800 <
=2,000)
#"5=1,000 >o9=<=1,000 ?!'==2,000 8te7 2-
8te7 4-
8te7 1- <=351 <=159 7ro;!t =1,838 =4,700
)!& >o9 (!@e, 5A<t!%e) ?!'(!@e, 95A<0 t!%e) )$x
co(&t
(<=1,000 < =2,000)
(<=1,000 < 0)
(0 <= 1,000)
8te7 3-
(=1,000 < =2,000)
October 15,
2014
45
Concept hierarchy generation for
categorical data

Speci1cation of a partial ordering of attri"utes


e-plicitly at the schema le#el "y users or e-perts

Speci1cation of a portion of a hierarchy "y


e-plicit data grouping

Speci1cation of a set of attri"utes$ "ut not of


their partial ordering

Speci1cation of only a partial set of attri"utes


October 15,
2014
46
Speci1cation of a set of
attri"utes
Concept hierarchy can "e automatically
generated "ased on the num"er of distinct
#alues per attri"ute in the gi#en attri"ute set.
The attri"ute ith the most distinct #alues is
placed at the loest le#el of the hierarchy.
co(&try
7roB!&ce_or_ "t$te
c!ty
"treet
15 5!"t!&ct B$%(e"
65 5!"t!&ct
B$%(e"
3567 5!"t!&ct B$%(e"
674,339 5!"t!&ct B$%(e"
October 15,
2014
47
Summary

Data preparation is a "ig issue for "oth


arehousing and mining

Data preparation includes

Data cleaning and data integration

Data reduction and feature selection

Discretization

) lot a methods ha#e "een de#eloped "ut still an


acti#e area of research

You might also like