Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Data Preprocessing

■ Data cleaning
■ Data integration and transformation
■ Data reduction
■ Concept Hierarchy Generation
■ Discretization
August 26, 2020 Data Mining: Data Preprocessing 1
Data Reduction
■ Data reduction
⚪ Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
■ Data reduction strategies
⚪ Dimensionality reduction
⚪ Numerosity reduction
⚪ Concept hierarchy generation
⚪ Discretization
August 26, 2020 Data Mining: Data Preprocessing 2
Dimensionality Reduction

■ Feature selection (i.e., attribute subset

⚪ Reduce # of patterns in the patterns,
⚪ Remove features with missing values
⚪ Remove features with low variance
⚪ Remove highly correlated features
⚪ Univariate feature selection
■ Heuristic methods:
⚪ Decision-Tree induction
August 26, 2020 Data Mining: Data Preprocessing 3
Example of Decision Tree Induction

Initial attribute set:

{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

August 26, 2020 Data Mining: Data Preprocessing 4
Data Compression
■ String compression
⚪ There are extensive theories and well-tuned
⚪ Typically lossless
⚪ But only limited manipulation is possible without
■ Audio/video compression
⚪ Typically lossy compression, with progressive
⚪ Sometimes small fragments of signal can be
reconstructed without reconstructing the whole

August 26, 2020 Data Mining: Data Preprocessing 5

Data Compression

Original Data d

s y
Original Data

August 26, 2020 Data Mining: Data Preprocessing 6

Numerosity Reduction

■ Parametric methods
⚪ Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
⚪ Example: Regression

■ Non-parametric methods
⚪ Do not assume models
⚪ Major families: histograms, clustering, sampling
August 26, 2020 Data Mining: Data Preprocessing 7

■ A popular data reduction

■ Divide data into buckets
and store average or
sum for each bucket / bin.

August 26, 2020 Data Mining: Data Preprocessing 8

Type of Bucket:
■ Singleton: Each bucket represents one
price-value/frequency pair.
■ Equiwidth: The width of each bucket range is
■ Equidepth: The buckets are created so that,
roughly, the frequency of each bucket is
■ MaxDiff: A bucket boundary is established
between each pair for pairs having the β - 1
largest differences, where β is user specified.
■ Etc.
August 26, 2020 Data Mining: Data Preprocessing 9

Singleton Buckets Equi-width Buckets

August 26, 2020 Data Mining: Data Preprocessing 10


The following data are a list of prices of commonly sold items at
AllElectronics (rounded to the nearest dollar). The numbers have been
1, 1, 4, 4, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22,
22, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30

How are the buckets determined and the attribute values partitioned?
a. Equi-width (Equal Width) Histogram, Number of buckets = 5
b. Equi-depth (Equal Depth) Histogram, Number of buckets = 5
c. MaxDiff Histogram, β = 3

August 26, 2020 Data Mining: Data Preprocessing 11

■ Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
■ Choose a representative subset of the data
⚪ Simple Random Sampling, may have very poor
■ Develop adaptive sampling methods
⚪ Stratified Sampling
■ Approximate the percentage of each class (or
subpopulation of interest) in the overall
August 26, 2020 Data Mining: Data Preprocessing 12
Simple Random Sampling

SRS domran
p le t
(sim le withou
samp ment)
ep la ce

(s R
Samp random
Repla with
Raw Data
August 26, 2020 Data Mining: Data Preprocessing 13
Stratified Sampling

August 26, 2020 Data Mining: Data Preprocessing 14


■ Partition data set into clusters, and one can store

cluster representation only
■ Can be very effective if data is clustered but not if
data is “smeared”
■ Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
■ There are many choices of clustering definitions and
clustering algorithms.
August 26, 2020 Data Mining: Data Preprocessing 15
Cluster Sampling
Cluster Sampling
Raw Data Stratified Sample

August 26, 2020 Data Mining: Data Preprocessing 16

Hierarchical Reduction
■ Hierarchical clustering is often
■ Parametric methods are usually not
amenable to hierarchical
■ Hierarchical aggregation
⚪ An index tree hierarchically divides a data set
into partitions by value range of some attributes.
⚪ Each partition can be considered as a bucket.
August 26, 2020 Data Mining: Data Preprocessing 17
Hierarchical Reduction

A concept hierarchy for the attribute price.

August 26, 2020 Data Mining: Data Preprocessing 18

Data Preprocessing

■ Data cleaning
■ Data integration and transformation
■ Data reduction
■ Concept Hierarchy Generation
■ Discretization

August 26, 2020 Data Mining: Data Preprocessing 19

Concept hierachy and
■ Concept hierarchies
⚪ reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior).
■ Discretization
⚪ reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can then be
used to replace actual data values.
August 26, 2020 Data Mining: Data Preprocessing 20
Discretization and concept hierarchy
generation for numeric data
■ Binning

■ Histogram analysis

■ Clustering analysis

■ Concept hierarchy generation

■ Entropy-based discretization

■ Segmentation by natural partitioning

August 26, 2020 Data Mining: Data Preprocessing 21
Specification of a set of
Concept hierarchy can be automatically generated
based on the number of distinct values per attribute in
the given attribute set. The attribute with the most
distinct values is placed at the lowest level of the
country 15 distinct values

Province or state 65 distinct values

city 3567 distinct values

street 674,339 distinct values

August 26, 2020 Data Mining: Data Preprocessing 22
Concept hierarchy generation for
categorical data (Example)
A concept hierarchy for the location.

August 26, 2020 Data Mining: Data Preprocessing 23

Concept hierarchy generation for
categorical data (Example)
A concept hierarchy for the location, based on language.

August 26, 2020 Data Mining: Data Preprocessing 24


■ Four types of attributes:

⚪ Nominal
⚪ Ordinal
⚪ Interval
⚪ Ratio
■ Discretization:
▪ divide the range of a continuous attribute into
■ Some classification algorithms only accept categorical
⚪ Reduce data size
⚪ Prepare for further analysis

August 26, 2020 Data Mining: Data Preprocessing 25

Four types of attributes:

August 26, 2020 Data Mining: Data Preprocessing 26

Attributes Types
■ Nominal:
⚪ This scale is made up of the list of possible values that a variable
may take.
⚪ The order of these values has no meaning.
■ Ordinal:
⚪ This scale describes a variable whose values are ordered.
⚪ the difference between the values does not describe the magnitude
of the actual difference.
■ Interval:
⚪ Scales that describe values where the interval between the values
has meaning.
■ Ratio:
⚪ Scales that describe variables where the same difference between
values has the same meaning as in interval, but where a double,
tripling, etc. of the values implies a double, tripling, etc. of the

August 26, 2020 Data Mining: Data Preprocessing 27

Types of Attributes
■ Binary: true / false, yes/no, +/-, etc.
■ Nominal: ID number, eye color, zip codes, etc.
■ Ordinal: rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in {tall,
medium, short}, etc.
■ Interval: calendar dates, temperatures in
Celsius or Fahrenheit, age, etc. (e.g.
Fahrenheit scale: 5oF, 10oF, 15oF. 10oF is not
twice as hot as 5oF.)
■ Ratio: Bank account ratio, tax ratio, etc. (e.g.
Bank ratio: $5, $10, $15)
August 26, 2020 Data Mining: Data Preprocessing 28
Attributes Types

August 26, 2020 Data Mining: Data Preprocessing 29

Concept hierarchy generation for
categorical data

■ Specification of a partial ordering of

attributes explicitly at the schema level
by users or experts
■ Specification of a portion of a hierarchy
by explicit data grouping
■ Specification of only a partial set of
August 26, 2020 Data Mining: Data Preprocessing 30
Methods for splitting the records
■ Depends on attribute types
⚪ Binary: true / false, yes/no, +/-, etc.
⚪ Nominal: ID number, eye color, zip codes, etc.
⚪ Ordinal: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}, etc.
⚪ Continuous/Ratio: calendar dates, temperatures
in Celsius or Fahrenheit, age, etc.
■ Depends on number of ways to split
⚪ 2-way split (Binary split)
⚪ Multi-way split
26-Aug-20 Data Mining: Classification 31
Splitting based on Nominal

■ Each partition has

subset of values
signifying it
⚪ Multi-way split: Use
as many partitions as
distinct values.
⚪ Binary split: Divides
values in to two
subsets. Need to find
optimal partitioning.
26-Aug-20 Data Mining: Classification 32
Splitting based on Ordinal
■ Multi-way split:
⚪ Use as many
partitions as
distinct values
■ Binary split:
⚪ Divides values into
two subsets
⚪ Need to find
optimal partitioning
⚪ Preserve order
property among
attribute values
26-Aug-20 Data Mining: Classification 33
Splitting based on Continuous
■ Different ways of handling
⚪ Discretization to form an ordinal categorical attribute
■ Static – discretize once at the beginning
■ Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), clustering, ect.
⚪ Binary Decision: (A < v) or (A ≥ v)
■ Consider all possible splits and finds the best cut
■ Can be more compute intensive

26-Aug-20 Data Mining: Classification 34

Segmentation by natural

3-4-5 Rule can be used to segment numeric data into

relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into three
equal-width intervals for 3, 6, 9, and three intervals
in the grouping of 2-3-2 for 7
* If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into four intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into five intervals
August 26, 2020 Data Mining: Data Preprocessing 35
Example of 3-4-5 rule
■ Suppose that profits at different branches of AllElectronics for the
year 1997 cover a wide range, from -$351,976.00 to
■ A user wishes to have a concept hierarchy for profit automatically
⚪ For improved readability, we use the notation ( l… r ] to represent
the interval (l… r]. For example, (-$1,000,000… $0] denotes the
range from -$1,000,000 (exclusive) to $0 (inclusive).
■ Suppose that the data within the 5%-tile and 95%-tile are
between -$159,876 and $1,838,761. The results of applying the
3-4-5 rule are shown in the next slide

August 26, 2020 Data Mining: Data Preprocessing 36

Example of 3-4-5 rule (5 Steps)

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%- tile) Max

Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)

(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
-$300) $1,200) ($2,000 -
($200 - $3,000)
($1,200 -
(-$300 - $400)
-$200) ($3,000 -
Step 5:
($400 - ($1,400 - $4,000)
(-$200 - $1,600)
$600) ($4,000 -
($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
(-$100 - $1,800)
$1,000) $2,000)
Example of 3-4-5 rule
Step 1: Based on the information, the minimum and maximum values are: MIN
= -$351,976.00, and MAX = $4,700,896.50. The low (5%-tile) and
high (95%-tile) values to be considered for the top or first level of
segmentation are: LOW = -$159,876.00 and HIGH = $1,838,761.00

-$351 -$159 profit $1,838 $4,700

Step 1:
Min Low (i.e, 5%-tile) High(i.e, 95%-tile) Max
Step 2: Given LOW and HIGH, the most significant digit is at the million dollar
digit position (i.e., msd = 1,000,000). Rounding LOW down to the
million dollar digit, we get LOW = -$1,000,000 and rounding HIGH
up to the million dollar digit, we get HIGH = +$2,000,000.
Step 2: msd=1,000 Low = -$1,000 High = $2,000
August 26, 2020 Data Mining: Data Preprocessing 38
Example of 3-4-5 rule

Step 3: Since this interval ranges over 3 distinct values at the most significant
digit, i.e., (2,000,000 - (-1,000,000)) / 1,000,000 = 3, the segment is
partitioned into 3 equi-width subsegments according to the 3-4-5 rule:
(-$1,000,000…$0], ($0…$1,000,000], and ($1,000,000…$2,000,000].
This represents the top tier of the hierarchy.

Step 3: (-$1,000 - $2,000]

(-$1,000 – 0] (0 - $ 1,000] ($1,000 - $2,000]

August 26, 2020 Data Mining: Data Preprocessing 39
Example of 3-4-5 rule
Step 4: We now examine the MIN and MAX values to see how they “fit" into the
first level partitions.
• Since the first interval, (-$1,000,000…$0] covers the MIN value, i.e.,
LOW < MIN, we can adjust the left boundary of this interval to make
the interval smaller. The most significant digit (msd) of MIN = 100,000.
Rounding MIN down to this position, we get MIN = -$400,000.
Therefore, the first interval is redefined as (-$400; 000…$0].
• Since the last interval, ($1,000,000…$2,000,000] does not cover the
MAX value, i.e., MAX > HIGH, we need to create a new interval to
cover it. Rounding up MAX at its most significant digit position, the new
interval is ($2,000,000…$5,000,000].
Hence, the top most level of the hierarchy contains four partitions,
(-$400,000…$0], ($0…$1,000,000], ($1,000,000…$2,000,000],

(-$400 -$5,000]
(-$1,000 - $2,000]
Step 4:

(-$400 - $0] (-$1,000

($0 – 0] (0 - $ 1,000]
- $1,000] ($1,000
($1,000 - 000]
- $2, $2,000] ($2,000 - $5, 000]
Example of 3-4-5 rule
Step 5: Recursively, each interval can be further partitioned according to the 3-4-5 rule to form the next lower
level of the hierarchy:
• The first interval (-$400,000…$0] is partitioned into 4 sub-intervals: (-$400,000…$300,000],
(-$300,000…-$200,000], (-$200,000…-$100,000], and (-$100,000…$0].
• The second interval, ($0…$1,000,000], is partitioned into 5 sub-intervals: ($0…$200,000],
($200,000… $400,000], ($400,000…$600,000], ($600,000…$800,000], and ($800,000…$1,000,000].
• The third interval, ($1,000,000…$2,000,000], is partitioned into 5 sub-intervals: ($1,000,000…
$1,200,000], ($1,200,000…$1,400,000], ($1,400,000…$1,600,000], ($1,600,000…$1,800,000], and
• The last interval, ($2,000,000…$5,000,000], is partitioned into 3 sub-intervals: ($2,000,000…
$3,000,000], ($3,000,000…$4,000,000], and ($4,000,000…$5,000,000].

(-$4000 -$5,000]

Step 5: (-$400 - $0] ($0 - $1,000] ($1,000 - $2, 000] ($2,000 - $5, 000]
($0 - ($1,000 -
(-$400 - $1,200]
$200] ($2,000 -
($200 - ($1,200 - $3,000]
(-$300 - $1,400]
-$200] $400]
($3,000 -
(-$200 - ($400 - ($1,400 - $4,000] ($4,000 -
-$100] $600] $1,600] $5,000]
($600 - ($800 - ($1,600 - ($1,800 -
(-$100 - $800] $1,000] $1,800] $2,000]
Similarly, the 3-4-5 rule can be carried on iteratively at deeper levels, as necessary.

You might also like