Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 32

Data Preprocessing

Pattern Evaluation

Data Mining
Task-relevant
Data
Selection and
Transformation

Data
Data Warehouse
Cleaning

Data Integration
Databases
January 21, 2024 Data Mining: Data Preprocessing 1
Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction and selection
 Discretization
 Hierarchy generation
January 21, 2024 Data Mining: Data Preprocessing 2
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute
values, lacking certain attributes of
interest, or containing only
aggregate data
 noisy: containing errors or outliers
 inconsistent: containing
discrepancies in codes or names
January 21, 2024 Data Mining: Data Preprocessing 3
Dirty Data

 No quality data, no quality mining results!


- Quality decisions must be based on quality data
- Data mining needs consistent integration of quality data
January 21, 2024 Data Mining: Data Preprocessing 4
Multi-Dimensional Measure of Data Quality

 A well-accepted multidimensional Measure:


 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
 Broad categories:
 Intrinsic, Contextual, Representational.
January 21, 2024 Data Mining: Data Preprocessing 5
Forms of data preprocessing

January 21, 2024 Data Mining: Data Preprocessing 6


Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces
the same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance,
especially for numerical data
January 21, 2024 Data Mining: Data Preprocessing 7
Data Cleaning

 Data cleaning tasks


 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data

January 21, 2024 Data Mining: Data Preprocessing 8


Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
January 21, 2024 Data Mining: Data Preprocessing 9
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
 Replace the missing values with a value generated at random
from the variable distribution observed.
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
 Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
January 21, 2024 Data Mining: Data Preprocessing 10
Dirty Data

January 21, 2024 Data Mining: Data Preprocessing 11


Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
January 21, 2024 Data Mining: Data Preprocessing 12
How to Handle Noisy Data?
 Graphical Methods For Identifying Outliers

January 21, 2024 Data Mining: Data Preprocessing 13


How to Handle Noisy Data?
 Graphical Methods For Identifying Outliers

January 21, 2024 Data Mining: Data Preprocessing 14


How to Handle Noisy Data?
 Binning method:
 first sort data and partition into (equal-width or
equal-depth) bins
 then one can smooth by bin mean, smooth by
bin median, smooth by bin boundaries, etc.
 Combined computer-human inspection:
 detect suspicious values and check by human
 Clustering:
 detect and remove outliers
 Regression:
 smooth by fitting the data into regression
functions

January 21, 2024 Data Mining: Data Preprocessing 15


Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size:
uniform grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.

 E.g. Data for price (in dollars):


29, 8, 9, 15, 21, 24, 4, 25, 28, 21, 34, 26
January 21, 2024 Data Mining: Data Preprocessing 16
Binning Methods
* Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Partition into (equal-width) bins ???:

January 21, 2024 Data Mining: Data Preprocessing 17


Data Binning

January 21, 2024 Data Mining: Data Preprocessing 18


Data Binning
The cut-off values with the names of the bins for the
variables:
 Pregnant: low (1,2), medium (3,4,5), high ( > 6)
 Plasma–Glucose: low (< 90), medium (90 – 150),
high (> 150)
 DiastolicBP : normal ( < 80), normal-to-high (80 –
90), high ( > 90)
 BMI: low (< 25), normal (25–30 ), obese (30–35 ),
severely obese ( > 35)
 DPF: low (< 0.4), med ium (0.4–0 .8), high (> 0.8)
 Age : 20–39, 40–59, 60 plus
 Diabetes Class: yes (1), no (0)

January 21, 2024 Data Mining: Data Preprocessing 19


Data Binning

January 21, 2024 Data Mining: Data Preprocessing 20


Binning with Cluster Analysis

January 21, 2024 Data Mining: Data Preprocessing 21


Data Smoothing
 Remove noise from data
 Example: Binning and Smoothing
 Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
 Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
 Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
January 21, 2024 Data Mining: Data Preprocessing 22
Smoothing with Regression
Y (Age)

Y1

Y1’
y=x+1

X1 (Salary) X

January 21, 2024 Data Mining: Data Preprocessing 23


Simple Linear Regression
 A straight line can be described using the formula:

y = a + bx
 where a is the point of intersection with the y-axis and b is the
slope of the line.

January 21, 2024 Data Mining: Data Preprocessing 24


Simple Linear Regression
Least Squares Method
The formula for calculating slope b and intercept a are:
n

(x i  x )( y i  y )
b i 1
n

(x
i 1
i  x) 2

a  y  bx
Where: xi and yi are the individual values for the descriptor variable (x i) and
the response (yi).
x is the mean of the descriptor variable x and y is the mean of the
response variable y.
January 21, 2024 Data Mining: Data Preprocessing 25
Simple Linear
Regression

(Example)

Formula:
Monthly Sales = 23.2064 +
0.00259 * Income

January 21, 2024 Data Mining: Data Preprocessing 26


Data Integration
 Data integration:
 combines data from multiple sources into a
coherent store
 Schema integration:
 integrate metadata from different sources
 Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id 
B.cust-#
 Detecting and resolving data value conflicts:
 for the same real world entity, attribute values from
different sources are different
 possible reasons: different representations,
different scales, e.g., metric vs. British units
January 21, 2024 Data Mining: Data Preprocessing 27
Data Transformation
 Smoothing: remove noise from data
 Aggregation: summarization, data cube
construction
 Generalization: concept hierarchy
 Normalization: scaled to fall within a small,
specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
January 21, 2024 Data Mining: Data Preprocessing 29
Data Transformation: Normalization

January 21, 2024 Data Mining: Data Preprocessing 30


Data Transformation: Normalization

 min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 z-score normalization
v  mean
v' 
A

stand _ dev A

 normalization by decimal scaling


v
v'  Where j is the smallest integer such that Max(| v ' |) < 1
10 j
January 21, 2024 Data Mining: Data Preprocessing 31
Latihan:

 Lakukan Normalisasi untuk data diatas


dengan Min-Max , Z-Score dan
decimal scalling.
January 21, 2024 Data Mining: Data Preprocessing 32
 Lakukan Normalisasi untuk data diatas
dengan Min-Max , Z-Score dan
decimal scalling.
January 21, 2024 Data Mining: Data Preprocessing 33

You might also like