3 Data Preprocessing

Data Mining
Steps of Data Mining

Data Preprocessing
Data
• Data
– Usually thought of some large datasets
• With huge number of rows and columns
– Not always the case
– Could be in many different forms
• Structured Tables,
• Images,
• Audio or video files, etc.
Data Preprocessing
• Data Preprocessing
– A step in which the data gets transformed to bring it to
such a state that now the machine can easily parse it
– Can be divided into four categories
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
Data Cleaning
• First step of data pre-processing
• A process of preparing raw data for analysis by
– Handling missing values
– Handling noise
– Detecting outliers, and
– Correcting inconsistency
Data Cleaning
• If the data contains values missing for some of it’s attributes , then
– They can be handled using one of the following ways:
• Ignore the tuple
• Fill the missing value manually
• Use a global constant for the missing value
• Use the mean of the attribute value to fill the missing value
• Use the most probable value to fill the missing value
Handling Missing Value
Ignore the Tuple
• This choice is selected when

– There are no class label
– Used when the tuple have several attributes with missing value
– Not very effective
Filling the missing value manually
• This method is used when
• There are no class label
• The tuple have several attributes with missing value
– Not very effective
Use the mean of the attribute values
• This method works

• By replacing the missing value for a particular attribute
with the average (mean) value of that attribute
Use a global constant for the missing value
• This method works
• By replacing the missing values of attributes by a particular
constant which is similar for all records
• May cause problem because
– The mining process may think that the value (constant)
form an important concept as they have the common value
Use the most probable
• This method is used with techniques like

– Inference based regression using a decision tree
– Bayesian formalism
– Etc.
Noise
• Noise
• A random error or
• Which deviates from the normal
• Reasons of noisy data

• Due to faulty data collection instruments
• Data entry problems
• Limitation of technology
How to Handle Noisy Data?

Handling Noise
• Methods for handling noise
 Binning
 Regression
 Clustering
Handling Noise
Binning
• This methods works on
 Smoothing stored data based on its neighborhood
 First, all the values are sorted
 Sorted values are divided into bins or buckets
 In smoothing by bin boundaries
 The min and max values for each bin are determined as bin
boundaries
 Each value is replaced by the closest bin boundary
 Larger the bin width, greater the effect of smoothing
Handling Noise
Binning Example
Stored Data: 21, 15, 24, 34, 25, 8, 4, 28, 21
Sorted Data: 4, 8, 15, 21, 21, 24, 25, 28, 34
Bin 1: 4, 8, 15 Mean = 9 min = 4 max = 15

Bin 2: 21, 21, 24 Mean = 22 min = 21 max = 24
Bin 3: 25, 28, 34 Mean = 29 min = 25 max = 34
Handling Noise
Regression Analysis
• A way to find trends in data
– By fitting the data into a regression functions
• Simple Linear regression
– The relationships between variables can be described with a
straight line
– Involves finding the best line to fit two variables so that one
variable can be used to predict the other
• Example, Y = b0 + b1 x
• Multiple Linear Regression
– An extension of linear regression
– More than two variables are involved and the data are fit to a
multidimensional surface
• Example, Y = b0 + b1 x1 + b0 + b1 x2 + ……+ b0 + b1 xn
Detecting Outliers
Clustering
• Outliers may be detected by clustering where
• Similar values are grouped together called clusters
• Values that fall outside of the set of clusters

• May be considered as outliers
Data Integration
• A process where multiple heterogeneous data sources such as
databases, data cubes or files are combined together for analysis
• Can help to improve the accuracy and speed of the data mining
process
• Different databases have different naming conventions of
variables
– Which causes redundancies in the databases
• Additional Data Cleaning can be performed
– To remove the redundancies and inconsistencies occurred due to
data integration
– Without affecting the reliability of data
• Data Integration can be performed
– Using Data Migration Tools
• such as Oracle Data Service Integrator and Microsoft SQL etc.
Data Transformation
• A process where
• Data is transformed into a form suitable for the data mining
process
• Data is consolidated so that the mining process is more efficient
and the patterns are easier to understand
• Involves Data Mapping and code generation process
Data Transformation
• Strategies for data transformation
– Smoothing
• Removing noise from data using clustering, regression, etc.
– Aggregation
• Combining two or more attributes into a single attribute
• Normalization
• Scaling of data to fall within a smaller range
– Discretization
• Raw values of numeric data are replaced by intervals
Data Reduction
• This technique
– Obtaining the size of the representation which is much smaller
in volume while maintaining integrity
– Performed using methods such as Naive Bayes, Decision Trees,
Neural network, etc.
• Some strategies of data reduction
• Dimensionality Reduction
– Reducing the number of attributes in the dataset
• Numerosity Reduction
– Replacing the original data volume by smaller form of data
representation
• Data Compression
– Compressed representation of the original data
Data Reduction
Dimensionality Reduction
• Reduces the volume of original data
– By eliminates the attributes from the data set under consideration
• Different techniques
– Wavelet Transform
– Principal Component Analysis (PCA)
– Attribute Subset Selection
Data Reduction
Numerosity Reduction
• Reduces the volume of the original data and represents it in a
much smaller form
• Two types
– Parametric
• Incorporates storing only data parameters instead of the
original data
• Method: Regression and Log-linear
– Non-Parametric
• Used for storing reduced representations of the data
• Methods: Histogram, Clustering, Sampling, etc.
Data Reduction
Data Compression
• A technique where
– The data transformation technique is applied to the original data to
obtain compressed data
• Lossless data reduction

– If the original data can again be reconstructed from the compressed
data without losing any information
• Lossy data reduction

– If the original data cannot be reconstructed from the compressed data
• Dimensionality and Numerosity reduction can be used for data

compression

3 Data Preprocessing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Data Preprocessing

Uploaded by

Copyright:

Available Formats

Data Mining

Steps of Data Mining

• This choice is selected when

• This method works

• This method is used with techniques like

• Reasons of noisy data

How to Handle Noisy Data?

Sorted Data: 4, 8, 15, 21, 21, 24, 25, 28, 34

Bin 1: 4, 8, 15 Mean = 9 min = 4 max = 15

• Values that fall outside of the set of clusters

• Lossless data reduction

• Lossy data reduction

• Dimensionality and Numerosity reduction can be used for data

You might also like