DMiningKuliah 2A DPreparation

Data Preprocessing
Pattern Evaluation
Data Mining
Task-relevant
Data
Selection and
Transformation
Data
Data Warehouse
Cleaning
Data Integration
Databases
January 21, 2024 Data Mining: Data Preprocessing 1
Data Preprocessing
 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction and selection
 Discretization
 Hierarchy generation
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute
values, lacking certain attributes of
interest, or containing only
aggregate data
 noisy: containing errors or outliers
 inconsistent: containing
discrepancies in codes or names
Dirty Data
 No quality data, no quality mining results!

- Quality decisions must be based on quality data
- Data mining needs consistent integration of quality data
Multi-Dimensional Measure of Data Quality
 A well-accepted multidimensional Measure:

 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
 Broad categories:
 Intrinsic, Contextual, Representational.
Forms of data preprocessing

Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces
the same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance,
especially for numerical data
Data Cleaning
 Data cleaning tasks

 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data

Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
 Replace the missing values with a value generated at random
from the variable distribution observed.
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
 Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
Dirty Data

Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy Data?
 Graphical Methods For Identifying Outliers

 Graphical Methods For Identifying Outliers

 Binning method:
 first sort data and partition into (equal-width or
equal-depth) bins
 then one can smooth by bin mean, smooth by
bin median, smooth by bin boundaries, etc.
 Combined computer-human inspection:
 detect suspicious values and check by human
 Clustering:
 detect and remove outliers
 Regression:
 smooth by fitting the data into regression
functions

Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size:
uniform grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
 E.g. Data for price (in dollars):

29, 8, 9, 15, 21, 24, 4, 25, 28, 21, 34, 26
Binning Methods
* Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Partition into (equal-width) bins ???:

Data Binning

Data Binning
The cut-off values with the names of the bins for the
variables:
 Pregnant: low (1,2), medium (3,4,5), high ( > 6)
 Plasma–Glucose: low (< 90), medium (90 – 150),
high (> 150)
 DiastolicBP : normal ( < 80), normal-to-high (80 –
90), high ( > 90)
 BMI: low (< 25), normal (25–30 ), obese (30–35 ),
severely obese ( > 35)
 DPF: low (< 0.4), med ium (0.4–0 .8), high (> 0.8)
 Age : 20–39, 40–59, 60 plus
 Diabetes Class: yes (1), no (0)

Data Binning

Binning with Cluster Analysis

Data Smoothing
 Remove noise from data
 Example: Binning and Smoothing
 Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
 Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
 Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Smoothing with Regression
Y (Age)
Y1
Y1’
y=x+1
X1 (Salary) X

Simple Linear Regression
 A straight line can be described using the formula:
y = a + bx
 where a is the point of intersection with the y-axis and b is the
slope of the line.

Simple Linear Regression
Least Squares Method
The formula for calculating slope b and intercept a are:
n
(x i  x )( y i  y )
b i 1
n
(x
i 1
i  x) 2
a  y  bx
Where: xi and yi are the individual values for the descriptor variable (x i) and
the response (yi).
x is the mean of the descriptor variable x and y is the mean of the
response variable y.
Simple Linear
Regression
(Example)
Formula:
Monthly Sales = 23.2064 +
0.00259 * Income

Data Integration
 Data integration:
 combines data from multiple sources into a
coherent store
 Schema integration:
 integrate metadata from different sources
 Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id 
B.cust-#
 Detecting and resolving data value conflicts:
 for the same real world entity, attribute values from
different sources are different
 possible reasons: different representations,
different scales, e.g., metric vs. British units
Data Transformation
 Smoothing: remove noise from data
 Aggregation: summarization, data cube
construction
 Generalization: concept hierarchy
 Normalization: scaled to fall within a small,
specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
Data Transformation: Normalization

Data Transformation: Normalization
 min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 z-score normalization
v  mean
v' 
A
stand _ dev A
 normalization by decimal scaling

v
v'  Where j is the smallest integer such that Max(| v ' |) < 1
10 j
Latihan:
 Lakukan Normalisasi untuk data diatas

dengan Min-Max , Z-Score dan
decimal scalling.
 Lakukan Normalisasi untuk data diatas
dengan Min-Max , Z-Score dan
decimal scalling.

DMiningKuliah 2A DPreparation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMiningKuliah 2A DPreparation

Uploaded by

Copyright:

Available Formats

Data Preprocessing

 Why preprocess the data?

 No quality data, no quality mining results!

 A well-accepted multidimensional Measure:

January 21, 2024 Data Mining: Data Preprocessing 6

 Data cleaning tasks

January 21, 2024 Data Mining: Data Preprocessing 8

January 21, 2024 Data Mining: Data Preprocessing 11

January 21, 2024 Data Mining: Data Preprocessing 13

January 21, 2024 Data Mining: Data Preprocessing 14

January 21, 2024 Data Mining: Data Preprocessing 15

 E.g. Data for price (in dollars):

January 21, 2024 Data Mining: Data Preprocessing 17

January 21, 2024 Data Mining: Data Preprocessing 18

January 21, 2024 Data Mining: Data Preprocessing 19

January 21, 2024 Data Mining: Data Preprocessing 20

January 21, 2024 Data Mining: Data Preprocessing 21

January 21, 2024 Data Mining: Data Preprocessing 23

January 21, 2024 Data Mining: Data Preprocessing 24

January 21, 2024 Data Mining: Data Preprocessing 26

January 21, 2024 Data Mining: Data Preprocessing 30

 normalization by decimal scaling

 Lakukan Normalisasi untuk data diatas

You might also like