Insy662 - f23 - Week 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

INSY 662 – Fall 2023

Data Mining and Visualization

Week 1: Data Pre-processing


August 31, 2023
Elizabeth Han
Why Do We Preprocess Data?
▪ Raw data are often incomplete, noisy
▪ They usually contain:
– Obsolete fields
– Missing values
– Outliers
– Data in form not suitable for data mining
– Erroneous values
– Irrelevant data

2
Data Pre-Processing
▪ Minimize GIGO (Garbage In, Garbage Out)
– IF garbage input minimized
THEN garbage outputs minimized

▪ For data mining purposes, raw data must


undergo data cleaning and data transformation

▪ Data preparation is ~70% of effort for data


mining process

3
Data Cleaning

▪ Inconsistent formatting or labeling


– Not all countries use the same zip code format
e.g., 90210 (U.S.) vs. J2S7K7 (Canada)

– Truncation of zero for numeric fields


e.g., 6269 vs. 06269 (New England states)

4
Data Cleaning

▪ Missing data
– Pose problems to data analysis methods
– More common in massive datasets with large
number of fields
– Dropping is the naïve approach
▪ Drop columns with missing values
→ What if all columns contain missing values?
▪ Drop rows with missing values
→ What if missing is not at random?

5
Data Cleaning

▪ Missing data
1. Replace with user-defined constant
2. Replace with mean, median, or mode
3. Replace with random values from underlying
distribution
4. Create a model to predict the values

6
Data Cleaning

▪ Outliers

– Should we always remove all outliers?

7
Data Cleaning

▪ Create an index field


– To track the sort order of the records in the
database
– Data mining data gets partitioned at least once
(and sometimes several times)
– It is helpful to have an index field so that the
original sort order may be recreated

8
Data Cleaning

▪ Remove unary (or nearly unary) variables


– Variables that take on only a single value
– Sometimes a variable can be very nearly unary

e.g., Suppose that 99.95% of the players in a field


hockey league are female, with the remaining 0.05%
male
– While it may be useful to investigate the male
players, some algorithms will tend to treat the
variable as essentially unary

9
Data Cleaning

▪ Removing variables with ≥90% missing values


– But should we always remove them?

e.g., Variable ‘donation’ from a survey data


– If most people do not donate, the data will contain
many missing values.

▪ Recommendation
– Create a dummy variable
(1=record w/o missing value; 0=record w/ missing
value)

10
Data Cleaning

▪ Removing strongly correlated variables


– In statistics, they lead to the issue of
multicollinearity
– In data mining and predictive analytics, they may
cause a double-count of a particular aspect of the
analysis, and at worst lead to instability of the
model results

▪ Recommendation
– Remove the variables from the model
– Apply dimension reduction techniques, such as
the principal components analysis (PCA),
11
Data Cleaning

▪ Removing duplicates
– May occur after merging datasets
– Lead to an overweighting of the data values in
those records

But are they really duplicates?

▪ Recommendation
– Weigh the likelihood that the duplicates truly
represent different records against the likelihood
that the duplicates are indeed just duplicated
records

12
Data Transformation

▪ Adjust the scale of variables


– Variables tend to have different ranges

e.g., two fields in a baseball player data set:


– Batting average: [ 0.0, 0.400 ]
– Number of home runs: [ 0, 70 ]

– Will influence the prediction process of some data


mining algorithms
– By standardizing numeric field values, we can
ensure that the impact of variables on the model is
similar
13
Data Transformation

▪ Adjust the scale of variables


1. Min-Max scaling
– Results in [0, 1]
– Sensitive to extreme values

𝑿 − 𝒎𝒊𝒏(𝑿)
𝑿𝒎𝒎 =
𝒎𝒂𝒙 𝑿 − 𝒎𝒊𝒏(𝑿)

14
Data Transformation

▪ Adjust the scale of variables


2. Decimal scaling
– Reduce the magnitude using a factor of 10
– Results in [-1, 1]

𝑿
𝑿𝒅𝒔 = 𝒅
𝟏𝟎

where d represents the number of digits in the data


value with the largest absolute value

15
Data Transformation

▪ Adjust the scale of variables


3. Z-score standardization
– To follow normal distribution (mean = 0, SD = 1)

𝑿 − 𝒎𝒆𝒂𝒏(𝑿)
𝑿𝒛𝒔 =
𝑺𝑫(𝑿)

16
Data Transformation

▪ Adjust the scale of variables


4. Log transformation
– To account for skewness
– ln(x); 𝑥; 1/ 𝑥

𝟑(𝒎𝒆𝒂𝒏 𝑿 − 𝒎𝒆𝒅𝒊𝒂𝒏 𝑿 )
𝑺𝒌𝒆𝒘𝒏𝒆𝒔𝒔(𝑿) =
𝑺𝑫(𝑿)

17
Data Transformation

▪ Dummy variables (a.k.a. flag or indicator)


– A categorical variable taking only 0 or 1
– Create k-1 dummies for a categorical predictor
with k possible values, and use the unassigned
category as the reference category

e.g. For a variable “region”: {north, east, south,


west}, dummy variables will be:
– dummy_north if region = north
– dummy_east if region = east
– dummy_south if region = south

18
Data Transformation
▪ Binning of numeric variables
– Partitioning numeric values into bins
– Equal width binning: create k categories with
equal width
– Equal frequency binning: create k categories,
each with the same number of records
– Binning by clustering: use clustering algorithm

e.g., X = {1,1,1,1,1,2,2,11,11,12,12,44} & k = 3

19
Data Transformation

▪ Transforming categorical to numerical


– Most of the times, should be avoided
– Except only when categorical variables are clearly
ordered
– A variable “survey_response”

– Should “never” be “0” rather than “1”? Is “always”


closer to “usually” than “usually” is to
“sometimes”? 20
Data Transformation

▪ Reclassifying categorical variables


– Sometimes, there may be too many categories
– 50 states in the U.S.

▪ Recommendation
– Reclassify as a variable “region” with five field values
{Northeast, Southeast, North Central, Southwest,
West}
– Reclassify as a variable “economic_level” with three
field values
{the richer states, the midrange states, the poorer
states}
21

You might also like