Data Preprocessing Before Classification: Presented by

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 23

Data preprocessing before

classification
Presented By:
Outline
• Collecting data
• Preparing data
• Data preprocessing
Collecting data
Collecting data
• Collecting “example
patterns”
– Inputs (vectors of
independent variables)
– Outputs (vectors
dependent variables)
• More data is better
• Begin with an
elementary set of
data
Collecting data
• Choose an appropriate sampling rate for
time-series data.
• Make sure the data measurements units
are consistent.
• Keep non-essential variables not in the
input vector
• Make sure no major structural (systemic)
changes have occurred during collection.
Collecting data
• How much data is enough?
– Training and testing using a subset of data
– If the performance does not increase when full
data is used, data is enough
– There are statistical validating methods (Ch.11)
• Using simulated data
– When it is difficult to collect (sufficient) data
• Realistic
• Representative
Preparing data
Preparing data
• Handling
– Missing data
– Categorical data
– Inconsistent data and outliers
Missing data
• Discard incomplete example patterns
• Manually enter a reasonable, probable, or
expected values
• Use an statistic generated from the example
patterns with that value
– Mean, mode
• Encode missing values explicitly by creating new
indicator variables
• Generate a predictive model to predict each of
the missing data value
Categorical data
• Ordinal:
– Convert to a numerical representation in a
straightforward manner
– “Low”, “medium”, “high” => 0, 1, 2
• Nominal:
– “One of n” representation
– Encode the input variables as n different
binary inputs, when there are n distinct
categories.
Further process of “one of n”
• When n is too large, reduce the number of
inputs in the new encoding.
– Manually
– PCA-based reduction
• Reduce the one-of-n representation to a one-of-m
representation where m is less than n.
– Eigenvalue-based reduction
– Output variable-based reduction
Inconsistent data and outliers
• Removing erroneous data
• Identifying inconsistent data
– Thresholding, filtering
• Outliers
– Data points that lie outside of the normal
region of interest in the input space, which
may be
• Unusual situations that are “correct”
• Misleading or incorrect measurements
Outliers
• Ways to spot outliers
– Plot: box plot, histogram…
– Number of S.D. from the mean
• Handling outliers
– Remove them
• Assumption: the input space where the outliers reside are not
concerned
– “Winzorize” them
• Convert the values of outliers into the values of upper or
lower thresholds.
• Outliers can always be reintroduced into the
satisfying model to study the changes in the
performance of the model.
Ben Shabad
Data preprocessing
Reasons to preprocess data
• Reducing noise
• Enhancing the signal
• Reducing input space
• Feature extraction
• Normalizing data
• Modifying prior probabilities (specific for
classification)
Reducing noise
• Averaging data values
• Thresholding data
– Convert numeric format data into categorical
– E.g. grey-scale => monotone image
Reducing input space
• Principle component analysis (PCA)
– Identify m-dimensional subspace of the n-dimensional
input space
– original n variables are reduced to m variables that are
mutually orthogonal (independent)
• Eliminating correlated input variables
– Identify highly correlated input variables by
• Statistical correlation tests
• Visual inspection of graphed data variables
• Seeing if a data variable can be modeled using one or more
others.
Reducing input space
• Combining non-correlated input variables
• Sensitivity analysis
– If variations of a particular input variable
cause large changes in the estimation model
output, the variable is very significant.
– Sensitivity analysis prunes input variables
based on information provided by both input
and output data.
Normalizing data
• Not “transform to normal distribution”
• For models that perform better
– Non-parametric algorithms implicitly assume
distances in different directions carry the
same weight (e.g. K-nearest neighbor, ”KNN”)
– Backpropagation (BP) and multi-layered
perception (MLP) models often perform better
if all inputs and outputs are normalized
• Avoiding numerical problems
Types of normalization
• Min-max normalization
– It preserves all relationships of the data
values exactly
– It would compress the normal range if
extreme values or outliers exist
• Z-score normalization
• Sigmoidal normalization
Other considerations
• According to the characteristics of the
specific classifiers being used for modeling
– E.g. CHAID uses categorical data directly
• Input variables produce the best modeling
accuracy when exhibiting a uniform or
Gaussian distribution
• Add expert knowledge when preprocessing
data
Get prepared and then go!

You might also like