Insy662 - f23 - Week 1

INSY 662 – Fall 2023
Data Mining and Visualization
Week 1: Data Pre-processing

August 31, 2023
Elizabeth Han
Why Do We Preprocess Data?
▪ Raw data are often incomplete, noisy
▪ They usually contain:
– Obsolete fields
– Missing values
– Outliers
– Data in form not suitable for data mining
– Erroneous values
– Irrelevant data
2
Data Pre-Processing
▪ Minimize GIGO (Garbage In, Garbage Out)
– IF garbage input minimized
THEN garbage outputs minimized
▪ For data mining purposes, raw data must

undergo data cleaning and data transformation
▪ Data preparation is ~70% of effort for data

mining process
3
Data Cleaning
▪ Inconsistent formatting or labeling

– Not all countries use the same zip code format
e.g., 90210 (U.S.) vs. J2S7K7 (Canada)
– Truncation of zero for numeric fields

e.g., 6269 vs. 06269 (New England states)
–
4
Data Cleaning
▪ Missing data
– Pose problems to data analysis methods
– More common in massive datasets with large
number of fields
– Dropping is the naïve approach
▪ Drop columns with missing values
→ What if all columns contain missing values?
▪ Drop rows with missing values
→ What if missing is not at random?
5
Data Cleaning
▪ Missing data
1. Replace with user-defined constant
2. Replace with mean, median, or mode
3. Replace with random values from underlying
distribution
4. Create a model to predict the values
6
Data Cleaning
▪ Outliers
– Should we always remove all outliers?
7
Data Cleaning
▪ Create an index field

– To track the sort order of the records in the
database
– Data mining data gets partitioned at least once
(and sometimes several times)
– It is helpful to have an index field so that the
original sort order may be recreated
8
Data Cleaning
▪ Remove unary (or nearly unary) variables

– Variables that take on only a single value
– Sometimes a variable can be very nearly unary
e.g., Suppose that 99.95% of the players in a field

hockey league are female, with the remaining 0.05%
male
– While it may be useful to investigate the male
players, some algorithms will tend to treat the
variable as essentially unary
9
Data Cleaning
▪ Removing variables with ≥90% missing values

– But should we always remove them?
e.g., Variable ‘donation’ from a survey data

– If most people do not donate, the data will contain
many missing values.
▪ Recommendation
– Create a dummy variable
(1=record w/o missing value; 0=record w/ missing
value)
10
Data Cleaning
▪ Removing strongly correlated variables

– In statistics, they lead to the issue of
multicollinearity
– In data mining and predictive analytics, they may
cause a double-count of a particular aspect of the
analysis, and at worst lead to instability of the
model results
▪ Recommendation
– Remove the variables from the model
– Apply dimension reduction techniques, such as
the principal components analysis (PCA),
11
Data Cleaning
▪ Removing duplicates
– May occur after merging datasets
– Lead to an overweighting of the data values in
those records
But are they really duplicates?
▪ Recommendation
– Weigh the likelihood that the duplicates truly
represent different records against the likelihood
that the duplicates are indeed just duplicated
records
12
Data Transformation
▪ Adjust the scale of variables

– Variables tend to have different ranges
e.g., two fields in a baseball player data set:

– Batting average: [ 0.0, 0.400 ]
– Number of home runs: [ 0, 70 ]
– Will influence the prediction process of some data

mining algorithms
– By standardizing numeric field values, we can
ensure that the impact of variables on the model is
similar
13
Data Transformation

1. Min-Max scaling
– Results in [0, 1]
– Sensitive to extreme values
𝑿 − 𝒎𝒊𝒏(𝑿)
𝑿𝒎𝒎 =
𝒎𝒂𝒙 𝑿 − 𝒎𝒊𝒏(𝑿)
14
Data Transformation

2. Decimal scaling
– Reduce the magnitude using a factor of 10
– Results in [-1, 1]
𝑿
𝑿𝒅𝒔 = 𝒅
𝟏𝟎
where d represents the number of digits in the data

value with the largest absolute value
15
Data Transformation

3. Z-score standardization
– To follow normal distribution (mean = 0, SD = 1)
𝑿 − 𝒎𝒆𝒂𝒏(𝑿)
𝑿𝒛𝒔 =
𝑺𝑫(𝑿)
16
Data Transformation

4. Log transformation
– To account for skewness
– ln(x); 𝑥; 1/ 𝑥
𝟑(𝒎𝒆𝒂𝒏 𝑿 − 𝒎𝒆𝒅𝒊𝒂𝒏 𝑿 )
𝑺𝒌𝒆𝒘𝒏𝒆𝒔𝒔(𝑿) =
𝑺𝑫(𝑿)
17
Data Transformation
▪ Dummy variables (a.k.a. flag or indicator)

– A categorical variable taking only 0 or 1
– Create k-1 dummies for a categorical predictor
with k possible values, and use the unassigned
category as the reference category
e.g. For a variable “region”: {north, east, south,

west}, dummy variables will be:
– dummy_north if region = north
– dummy_east if region = east
– dummy_south if region = south
18
Data Transformation
▪ Binning of numeric variables
– Partitioning numeric values into bins
– Equal width binning: create k categories with
equal width
– Equal frequency binning: create k categories,
each with the same number of records
– Binning by clustering: use clustering algorithm
e.g., X = {1,1,1,1,1,2,2,11,11,12,12,44} & k = 3
19
Data Transformation
▪ Transforming categorical to numerical

– Most of the times, should be avoided
– Except only when categorical variables are clearly
ordered
– A variable “survey_response”
– Should “never” be “0” rather than “1”? Is “always”

closer to “usually” than “usually” is to
“sometimes”? 20
Data Transformation
▪ Reclassifying categorical variables

– Sometimes, there may be too many categories
– 50 states in the U.S.
▪ Recommendation
– Reclassify as a variable “region” with five field values
{Northeast, Southeast, North Central, Southwest,
West}
– Reclassify as a variable “economic_level” with three
field values
{the richer states, the midrange states, the poorer
states}
21

Insy662 - f23 - Week 1

Uploaded by

Copyright:

Available Formats

You might also like

Insy662 - f23 - Week 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Insy662 - f23 - Week 1

Uploaded by

Copyright:

Available Formats

INSY 662 – Fall 2023

Data Mining and Visualization

Week 1: Data Pre-processing

▪ For data mining purposes, raw data must

▪ Data preparation is ~70% of effort for data

▪ Inconsistent formatting or labeling

– Truncation of zero for numeric fields

– Should we always remove all outliers?

▪ Create an index field

▪ Remove unary (or nearly unary) variables

e.g., Suppose that 99.95% of the players in a field

▪ Removing variables with ≥90% missing values

e.g., Variable ‘donation’ from a survey data

▪ Removing strongly correlated variables

But are they really duplicates?

▪ Adjust the scale of variables

e.g., two fields in a baseball player data set:

– Will influence the prediction process of some data

▪ Adjust the scale of variables

▪ Adjust the scale of variables

where d represents the number of digits in the data

▪ Adjust the scale of variables

▪ Adjust the scale of variables

▪ Dummy variables (a.k.a. flag or indicator)

e.g. For a variable “region”: {north, east, south,

e.g., X = {1,1,1,1,1,2,2,11,11,12,12,44} & k = 3

▪ Transforming categorical to numerical

– Should “never” be “0” rather than “1”? Is “always”

▪ Reclassifying categorical variables

You might also like