Data Mining Methods: Data Pre-Processing: Prof. Dr. Christina Andersson

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Data Mining Methods

Prof. Dr. C. Andersson

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing Data Mining Methods:
Data Pre-processing

Prof. Dr. Christina Andersson

1/33
Data Mining Methods
Prof. Dr. C. Andersson
Contents

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing

1 Data Preprocessing in the Data Mining Process

2 Issues in Data Preprocessing

2/33
Data Mining Methods
Prof. Dr. C. Andersson
The data mining/KDD process

Data Preprocessing in the


Data Mining Process
The Data Mining/KDD
Process
Why Data Preprocessing?

Issues in Data
Preprocessing
Understanding customer: 10%-20%
Understanding data: 20%-30%
Prepare data: 40%-70%
Build models: 10%-20%
Evaluate models: 10%-20%
Take action: 10%-20%
(http://www.crisp-dm.org)

3/33
Data Mining Methods
Prof. Dr. C. Andersson
Why data preprocessing?

Data Preprocessing in the


Data Mining Process
The Data Mining/KDD
Process
Why Data Preprocessing?

Issues in Data
Preprocessing

Real-world data is dirty


Low data quality anyway a huge problem in data
mining
Garbage in, garbage out
Different methods, different requirements

4/33
Data Mining Methods
Prof. Dr. C. Andersson
Issues in data preprocessing

Data Preprocessing in the


Data Mining Process

Issues in Data Data cleaning


Preprocessing
Data Cleaning Fill in missing values, smooth noisy data, identify
Data Transformation
Variable Construction outliers, resolve inconsistencies
Data Reduction and
Discretization
Data Integration Data transformation
Normalization, aggregation, removal of skewness
Variable construction
Derivation of new variables based on existing
variables
Data integration
Integration of multiple data sources
Data reduction
Reduction of number of records, variables and
variable levels
5/33
Data Mining Methods
Prof. Dr. C. Andersson
Data cleaning tasks

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration
Fill in (impute) missing values
Detect and correct inconsistent data
Smooth noisy data
Identify / remove outliers

6/33
Data Mining Methods
Prof. Dr. C. Andersson
Causes of missing data

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization Misunderstanding of data entry
Data Integration
Variable not considered important
Inconsistent with other data and then deleted
Missing since special personal information
Depending on analysis method (data mining
technique), missing data must often be imputed!

7/33
Data Mining Methods
Prof. Dr. C. Andersson
Handling of missing data

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration
Ignore records with missing data
Do nothing and use method which can use data
record with missing data
Impute (fill in) missing values (= guess!)

8/33
Data Mining Methods
Prof. Dr. C. Andersson
How to impute values?

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation = What is a good guess?
Variable Construction
Data Reduction and
Discretization Fill in values manually
Data Integration
Fill in with a global constant (e.g. unknown)
Fill in with variable mean or median
Fill in with class mean or median
Fill in with most likely value (use e.g. regression,
decision tree to predict this value)
Use other variables to predict value

9/33
Data Mining Methods
Prof. Dr. C. Andersson
Result of the imputation

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration

Corresponding variable without missing values


(Indicator variable)

10/33
Data Mining Methods
Prof. Dr. C. Andersson
Inconsistent data

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration
Data entry errors, data integration problems
Important with data entry verification (both
format and values entered)
Correct with external reference data

11/33
Data Mining Methods
Prof. Dr. C. Andersson
Handling of outliers and noisy data

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction Noise: Random error in a measurement
Data Reduction and
Discretization
Data Integration
Definition of outliers
Cause of outliers
What to do with outliers?
Handle noisy data through binning, clustering,
regression, manual inspection, ...
Methods more or less sensitive to noise and
outliers

12/33
Data Mining Methods
Prof. Dr. C. Andersson
Handling noisy data

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing Binning
Data Cleaning
Data Transformation Sort data and partition into bins
Variable Construction
Data Reduction and
Discretization
Smooth by bin means, bin median, bin
Data Integration
boundaries, ...
Regression
Fit a regression function to smooth
Clustering
Detect and remove outliers
Combined computer and human inspection
Use computer to detect suspicious values and
check by human inspection

13/33
Data Mining Methods
Prof. Dr. C. Andersson
Binning: Equal-width

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Divide the range into N intervals of equal width
Discretization
Data Integration

Max − Min
Width =
N
Advantage:
Simple
Disadvantage:
Outliers can dominate

14/33
Data Mining Methods
Prof. Dr. C. Andersson
Binning: Equal-depth

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration
Divides the data into N intervals, each containing
approx. the same number of observations
Advantage:
Skewed data handled well

15/33
Data Mining Methods
Prof. Dr. C. Andersson
Smoothing by regression

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration Perform a regression analysis
Replace observed values by predicted values
Disadvantage:
Appropriate model with fulfilled assumptions
required

16/33
Data Mining Methods
Prof. Dr. C. Andersson
Data transformation

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning Aggregation (summarization)
Data Transformation
Variable Construction
Data Reduction and
Generalization
Discretization
Data Integration Remove variable with higher level equivalent
Normalization
To make different variables comparable
Min-max
Z-score
Decimal scaling (move decimal point)
Removal of skewness

Note: Save parameters in a meta-data repository!

17/33
Data Mining Methods
Prof. Dr. C. Andersson
Normalization

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Min-max-normalization (e.g. 0 to 1):
Data Transformation
Variable Construction
Data Reduction and x − min
Discretization x′ =
Data Integration max − min
z-score-normalization:
x − mean
x′ =
standard deviation
Normalization by decimal scaling (move decimal
point):
x
x′ = j
10

18/33
Data Mining Methods
Prof. Dr. C. Andersson
Variable construction

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
New variables constructed from given variables
Discretization
Data Integration Pattern may only exist for derived variables
Ex.:
Change of profit for consecutive years
Variable volume constructed using height, depth
and width
Construction based on mathematical or logical
operations

19/33
Data Mining Methods
Prof. Dr. C. Andersson
Data reduction and discretization

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Improve efficiency:
Data Cleaning Runtime of many data mining algorithms is linear
Data Transformation
Variable Construction w.r.t. number of observations (records) and
Data Reduction and
Discretization
Data Integration
number of variables
Improved quality:
Removal of noisy variables may improve the
quality of discovered pattern

Conclusion:
Advantageous to reduce number of observations and
variables!

Note: Reduced data should produce similar results as


original data
20/33
Data Mining Methods
Prof. Dr. C. Andersson
Discretization

Data Preprocessing in the


Data Mining Process
To enable the application of data mining methods to
Issues in Data
Preprocessing discrete variable values
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Three types of variables:
Discretization
Data Integration Nominal (categorical) - values from an unordered
set
Ordinal (categorical) - values from an ordered set
Continuous (numerical) - real numbers

Motivation:
Some data mining algorithms only accept
categorical variables
May improve the identification and understanding
of patterns
21/33
Data Mining Methods
Prof. Dr. C. Andersson
Discretization: Tasks

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration
Reduce the number of values for a given
continuous variable by partitioning the range of
the variable into intervals
Interval labels replace actual variable values

22/33
Data Mining Methods
Prof. Dr. C. Andersson
Discretization: Methods

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration

Binning
Cluster analysis
Entropy-based discretization

23/33
Data Mining Methods
Prof. Dr. C. Andersson
Entropy-based discretization

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing For classification tasks
Data Cleaning
Data Transformation Given a data set S
Variable Construction
Data Reduction and
Discretization If S is partitioned into two intervals S1 and S2
Data Integration
using boundary T , the entropy E after
partitioning is

|S1 | |S2 |
E (S, T ) = E (S1 ) + E (S2 )
|S| |S|

Recursive partitioning until some stopping


criterion is met, e.g.

E(S) - E(T, S) > δ

24/33
Data Mining Methods
Prof. Dr. C. Andersson
Variable Reduction

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration
= Dimensionality reduction
Remove redundancy!
Remove irrelevancy!

25/33
Data Mining Methods
Prof. Dr. C. Andersson
Data reduction

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration
Select a sub-set of the available variables with a
similar probability distribution compared to
original data - but how?
Problem:
2d possible subsets of set of d variables
Need heuristic variable selection methods

26/33
Data Mining Methods
Prof. Dr. C. Andersson
Data reduction

Data Preprocessing in the


Data Mining Process

Issues in Data But how?


Preprocessing
Data Cleaning
Data Transformation Find correlated variables, e.g. age and date of
Variable Construction
Data Reduction and birth and make sure that they don’t occur at the
Discretization
Data Integration same time in the model
Greedy bottom-up variable selection:
Use forward selection (find and select best
variables)
The best single variable is selected first
Then the second best variable is selected,
conditioned on the selection of the first one
Greedy top-down variable elimination
Use backward elimination (find and eliminate
worst variables, repeatedly)

27/33
Data Mining Methods
Prof. Dr. C. Andersson
Data reduction: Principal component analysis

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Task
Data Reduction and
Discretization
Data Integration
Given p original variables, find c ≤ p orthogonal
vectors that can "best" represent the data
Data representation by projection onto the c
resulting vectors
Resulting vectors are called principal components
Each principal component is a linear combination
of the original variables

28/33
Data Mining Methods
Prof. Dr. C. Andersson
Starting Point

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Consider the linear combinations
Variable Construction
Data Reduction and
Discretization
Data Integration
PC1 = a11 X1 + a12 X2 + · · · + a1p Xp
PC2 = a21 X1 + a22 X2 + · · · + a2p Xp
..
.
PCp = ap1 X1 + ap2 X2 + · · · + app Xp

Each of these components can be seen as a linear


regression, predicting PCi from X1 , X2 , . . . , Xp .

29/33
Data Mining Methods
Prof. Dr. C. Andersson
Principal component analysis

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation Problem:
Variable Construction
Data Reduction and
Discretization
In what way "best" representation of the data?
Data Integration
The first principal component (= pc1) explains
the largest proportion of the variance of the
original variables
The second principal component (= pc2) explains
the largest proportion of the variance of the
original variables
And so forth ....

30/33
Data Mining Methods
Prof. Dr. C. Andersson
Principal component analysis

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning
Data Transformation
Variable Construction
Data Reduction and
Discretization
Data Integration
Problem:
How many principal components should be selected?
Scree plot of eigen values
Cut-off for proportion of explained variance
Eigen value > 1

31/33
Data Mining Methods
Prof. Dr. C. Andersson
Principal component analysis

Data Preprocessing in the


Data Mining Process

Issues in Data
Preprocessing
Data Cleaning Properties
Data Transformation
Variable Construction
Data Reduction and
Principal components are uncorrelated
Discretization
Data Integration Principal components are the directions of the
maximum variance of original data
Principal components are linear combinations of
original vectors - how to interpret them?
Works for numeric data only
What about prediction power of Y ? Y not
involved in the construction of the principal
components!

32/33
Data Mining Methods
Prof. Dr. C. Andersson
Data integration: Purpose

Data Preprocessing in the


Data Mining Process Objective:
Issues in Data Combine data from multiple sources into one coherent
Preprocessing
Data Cleaning database
Data Transformation
Variable Construction
Data Reduction and
Discretization Metadata integration
Data Integration
Integrate metadata from different sources
Variable identification problem:
The "same" variable may have different names in
different databases

Instance integration
Integrate instances from different sources
For the same real world entity, variable values
from different sources may be different
Reasons: Different representation, different scales,
... 33/33

You might also like