Professional Documents
Culture Documents
Insy662 - f23 - Week 1
Insy662 - f23 - Week 1
Insy662 - f23 - Week 1
2
Data Pre-Processing
▪ Minimize GIGO (Garbage In, Garbage Out)
– IF garbage input minimized
THEN garbage outputs minimized
3
Data Cleaning
4
Data Cleaning
▪ Missing data
– Pose problems to data analysis methods
– More common in massive datasets with large
number of fields
– Dropping is the naïve approach
▪ Drop columns with missing values
→ What if all columns contain missing values?
▪ Drop rows with missing values
→ What if missing is not at random?
5
Data Cleaning
▪ Missing data
1. Replace with user-defined constant
2. Replace with mean, median, or mode
3. Replace with random values from underlying
distribution
4. Create a model to predict the values
6
Data Cleaning
▪ Outliers
7
Data Cleaning
8
Data Cleaning
9
Data Cleaning
▪ Recommendation
– Create a dummy variable
(1=record w/o missing value; 0=record w/ missing
value)
10
Data Cleaning
▪ Recommendation
– Remove the variables from the model
– Apply dimension reduction techniques, such as
the principal components analysis (PCA),
11
Data Cleaning
▪ Removing duplicates
– May occur after merging datasets
– Lead to an overweighting of the data values in
those records
▪ Recommendation
– Weigh the likelihood that the duplicates truly
represent different records against the likelihood
that the duplicates are indeed just duplicated
records
12
Data Transformation
𝑿 − 𝒎𝒊𝒏(𝑿)
𝑿𝒎𝒎 =
𝒎𝒂𝒙 𝑿 − 𝒎𝒊𝒏(𝑿)
14
Data Transformation
𝑿
𝑿𝒅𝒔 = 𝒅
𝟏𝟎
15
Data Transformation
𝑿 − 𝒎𝒆𝒂𝒏(𝑿)
𝑿𝒛𝒔 =
𝑺𝑫(𝑿)
16
Data Transformation
𝟑(𝒎𝒆𝒂𝒏 𝑿 − 𝒎𝒆𝒅𝒊𝒂𝒏 𝑿 )
𝑺𝒌𝒆𝒘𝒏𝒆𝒔𝒔(𝑿) =
𝑺𝑫(𝑿)
17
Data Transformation
18
Data Transformation
▪ Binning of numeric variables
– Partitioning numeric values into bins
– Equal width binning: create k categories with
equal width
– Equal frequency binning: create k categories,
each with the same number of records
– Binning by clustering: use clustering algorithm
19
Data Transformation
▪ Recommendation
– Reclassify as a variable “region” with five field values
{Northeast, Southeast, North Central, Southwest,
West}
– Reclassify as a variable “economic_level” with three
field values
{the richer states, the midrange states, the poorer
states}
21