Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

Data Preprocessing

Why Data Preprocessing?


Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g. occupation=“ ”
noisy: containing errors or outliers e.g. Salary=“-10”
inconsistent: containing discrepancies in codes or
names
e.g. Age=“42” Birthday=“03/07/1997”
e.g. Was rating “1,2,3”, now rating “A, B, C”
e.g. discrepancy between duplicate records
Why Is Data Dirty?
 Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected
and when it is analyzed.
Human/hardware/software problems
 Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
 Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Why is Data Preprocessing important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation


comprises the majority of the work of building a data
warehouse
Measures for data quality
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update?
Believability: how trustable are the data, are they
correct?
Interpretability: how easily the data can be understood?
Value added
Accessibility
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data Cleaning
Importance
“Data cleaning is the number one problem in data
warehousing”

Data cleaning tasks – this routine attempts to


Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data
Data is not always available
E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data

Missing data may be due to


Equipment malfunction
Inconsistent with other recorded data and thus deleted
Data not entered due to misunderstanding
Certain data may not be considered important at the time
of entry
No registered history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data
1. Ignore the tuple
2. Fill in missing values manually: tedious + infeasible
3. Fill in automatically with a global constant.
4. Fill in with the attribute mean
5. Fill in with the attribute mean for all samples
belonging to the same class as the given tuple
6. Fill in with the most probable value determined with
regression, inference-based such as Bayesian
formula, decision tree.
Noisy Data
Noise: random error or variance in a measured
variable.

Incorrect attribute values may be due to


faulty data collection instruments
data entry problems
data transmission problems

Other data problems which requires data cleaning


duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data
 Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Clustering
Similar values are organized into groups (clusters).
Values that fall outside of clusters are considered outliers.
 Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
 Regression
Data can be smoothed by fitting the data to a function such as
with regression. (linear regression/multiple linear regression)
Discretization Method: Binning
 Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well

 Equal-depth (frequency) partitioning


Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
Sorted data for price : 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equi-depth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Data Integration
 Data integration:
Combines data from multiple sources(data cubes, multiple db
or flat files)
 Issues during data integration
Schema integration
Integrate metadata (about the data) from different sources
Entity identification problem: identify real world entities from
multiple data sources
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British
units
Removing duplicates and redundant data
An attribute can be derived from another table (annual revenue)
Inconsistencies in attribute naming
Data Transformation
Smoothing: remove noise from data (binning, clustering,
regression)
Normalization: scaled to fall within a small, specified
range such as –1.0 to 1.0 or 0.0 to 1.0
Attribute/feature construction
New attributes constructed / added from the given ones
Aggregation: summarization or aggregation operations
apply to data
Generalization: concept hierarchy climbing
Low level/ primitive/raw data are replace by higher level
concepts
Data Transformation: Normalization
Useful for classification algorithms involving
Neural networks
Distance measurements (nearest neighbor)
Backpropagation algorithm (NN) – normalizing helps to
speed up the learning phase
Distance-based methods – normalization prevents
attributes with initially large range (i.e. income) from
outweighing attributes with initially smaller ranges (i.e.
binary attribute)
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
v  minAA
v'  (new _ maxAA  new _ minAA)  new _ minAA
maxAA  minAA
Ex. Let income range 12,000 to 98,000 normalized to [0.0,
1.0]. Then 73,000 is mapped to 73,600  12,000 (1.0  0)  0  0.716
98,000  12,000
Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A
73,600  54,000
Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
Normalization by decimal scaling
v
v'  j v'
Where j is the smallest integer such that Max(|
10 |)<1
Data Reduction Strategies
 Data is too big to work with – may take time, impractical or
infeasible analysis
 Data reduction techniques
 Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results
 Data reduction strategies
 Data cube aggregation – apply aggregation operations (data cube)
 Dimensionality reduction—remove unimportant attributes
 Data compression – encoding mechanism used to reduce data size
 Numerosity reduction – data replaced or estimated by alternative,
smaller data representation - parametric model (store model parameter
instead of actual data), non-parametric (clustering sampling,
histogram)
 Discretization and concept hierarchy generation – replaced by ranges
or higher conceptual levels
Dimensionality Reduction
Problem: Feature selection (i.e., attribute subset
selection):
Select a minimum set of attributes (features) that is
sufficient for the data mining task.
Best/worst attributes are determined using test of
statistical significance – information gain (building
decision tree for classification)
Solution: Heuristic methods (due to exponential # of
choices – 2d):
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video, image compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Data Compression

Original Data Compressed


Data
lossless

s s y
lo
Original Data
Approximated
Numerosity Reduction
Reduce the data volume by choosing alternative ‘smaller’
forms of data representation
Two type:
Parametric – a model is used to estimate the data, only
the data parameters is stored instead of actual data
regression
log-linear model
Nonparametric –storing reduced representation of the
data
Histograms
Clustering
Sampling
Histograms
A popular data reduction technique
Divide data into buckets and store average (sum) for each
bucket
Use binning to approximate data distributions
Bucket – horizontal axis, height (area) of bucket – the
average frequency of the values represented by the
bucket
Bucket for single attribute-value/frequency pair –
singleton buckets
Continuous ranges for the given attribute
Clustering
Partition data set into clusters, and one can store cluster
representation only.
Can be very effective if data is clustered but not if data
is “smeared”/ spread.
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures.
There are many choices of clustering definitions and
clustering algorithms
Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods e.g. Stratified
sampling.
Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
Used in conjunction with skewed data
Discretization and Concept Hierarchy
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values

Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior)
Discretization
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into intervals
because some data mining algorithms only accept
categorical attributes.
Some techniques:
Binning methods – equal-width, equal-frequency
Histogram
Entropy-based methods
References
J. Han and M. Kamber. Data Mining: Concepts and
Techniques. Morgan Kaufmann, 2000.
T. Dasu and T. Johnson. Exploratory Data Mining and
Data Cleaning. John Wiley & Sons, 2003
V. Raman and J. Hellerstein. Potters Wheel: An
Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), December 1997.
H.V. Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), December 1997

You might also like