2 DMiningKuliah 2A DPreparation

Data Preprocessing
Pattern Evaluation
Data Mining
Task-relevant
Data
Selection and
Transformation
Data
Data Warehouse
Cleaning
Data Integration
Databases
August 19, 2020 Data Mining: Data Preprocessing 1
Data Preprocessing
■ Why preprocess the data?

■ Data cleaning
■ Data integration and transformation
■ Data reduction and selection
■ Discretization
■ Hierarchy generation
Why Data Preprocessing?
■ Data in the real world is dirty

⚪ incomplete: lacking attribute
values, lacking certain attributes of
interest, or containing only
aggregate data
⚪ noisy: containing errors or outliers
⚪ inconsistent: containing
discrepancies in codes or names
Dirty Data
■ No quality data, no quality mining results!

- Quality decisions must be based on quality data
- Data mining needs consistent integration of quality data
Multi-Dimensional Measure of Data Quality
■ A well-accepted multidimensional Measure:

⚪ Accuracy
⚪ Completeness
⚪ Consistency
⚪ Timeliness
⚪ Believability
⚪ Value added
⚪ Interpretability
⚪ Accessibility
■ Broad categories:
⚪ Intrinsic, Contextual, Representational.
Forms of data preprocessing

Major Tasks in Data Preprocessing
■ Data cleaning
⚪ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
■ Data integration
⚪ Integration of multiple databases, data cubes, or files
■ Data transformation
⚪ Normalization and aggregation
■ Data reduction
⚪ Obtains reduced representation in volume but produces
the same or similar analytical results
■ Data discretization
⚪ Part of data reduction but with particular importance,
especially for numerical data
Data Cleaning
■ Data cleaning tasks

⚪ Fill in missing values
⚪ Identify outliers and smooth out noisy data
⚪ Correct inconsistent data

Missing Data
■ Data is not always available
⚪ E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
■ Missing data may be due to
⚪ equipment malfunction
⚪ inconsistent with other recorded data and thus
deleted
⚪ data not entered due to misunderstanding
⚪ certain data may not be considered important at the
time of entry
How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing
■ Fill in the missing value manually: tedious + infeasible?
■ Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
■ Replace the missing values with a value generated at random
from the variable distribution observed.
■ Use the attribute mean to fill in the missing value
■ Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
■ Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree

Dirty Data

Noisy Data
■ Noise: random error or variance in a measured
variable
■ Incorrect attribute values may due to
⚪ faulty data collection instruments
⚪ data entry problems
⚪ data transmission problems
⚪ technology limitation
⚪ inconsistency in naming convention
■ Other data problems which requires data cleaning
⚪ duplicate records
⚪ incomplete data
⚪ inconsistent data
How to Handle Noisy Data?
■ Graphical Methods For Identifying Outliers

■ Graphical Methods For Identifying Outliers

■ Binning method:
⚪ first sort data and partition into (equal-width or
equal-depth) bins
⚪ then one can smooth by bin mean, smooth by
bin median, smooth by bin boundaries, etc.
■ Combined computer-human inspection:
⚪ detect suspicious values and check by human
■ Clustering:
⚪ detect and remove outliers
■ Regression:
⚪ smooth by fitting the data into regression
functions

Binning
■ Equal-width (distance) partitioning:
⚪ It divides the range into N intervals of equal size:
uniform grid
⚪ if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
⚪ The most straightforward
⚪ But outliers may dominate presentation
⚪ Skewed data is not handled well.
■ Equal-depth (frequency) partitioning:
⚪ It divides the range into N intervals, each containing
approximately same number of samples
⚪ Good data scaling
⚪ Managing categorical attributes can be tricky.
■ E.g. Data for price (in dollars):

29, 8, 9, 15, 21, 24, 4, 25, 28, 21, 34, 26
Binning Methods
* Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Partition into (equal-width) bins ???:

Data Binning

Data Binning
The cut-off values with the names of the bins for the
variables:
■ Pregnant: low (1,2), medium (3,4,5), high ( > 6)
■ Plasma–Glucose: low (< 90), medium (90 – 150),
high (> 150)
■ DiastolicBP : normal ( < 80), normal-to-high (80 –
90), high ( > 90)
■ BMI: low (< 25), normal (25–30 ), obese (30–35 ),
severely obese ( > 35)
■ DPF: low (< 0.4), med ium (0.4–0 .8), high (> 0.8)
■ Age : 20–39, 40–59, 60 plus
■ Diabetes Class: yes (1), no (0)
Data Binning

Binning with Cluster Analysis

Data Smoothing
■ Remove noise from data
■ Example: Binning and Smoothing
■ Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
■ Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
■ Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Smoothing with Regression
Y (Age)
Y1
Y1’
y=x+1
X1 (Salary) X

Simple Linear Regression
■ A straight line can be described using the formula:
y = a + bx
■ where a is the point of intersection with the y-axis and b is the
slope of the line.

Simple Linear Regression
Least Squares Method
The formula for calculating slope b and intercept a are:
Where: xi and yi are the individual values for the descriptor variable (xi) and
the response (yi).
x is the mean of the descriptor variable x and y is the mean of the response
variable y.
Simple Linear
Regression
(Example)
Formula:
Monthly Sales = 23.2064 +
0.00259 * Income

Data Integration
■ Data integration:
⚪ combines data from multiple sources into a
coherent store
■ Schema integration:
⚪ integrate metadata from different sources
⚪ Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id ≡
B.cust-#
■ Detecting and resolving data value conflicts:
⚪ for the same real world entity, attribute values from
different sources are different
⚪ possible reasons: different representations,
different scales, e.g., metric vs. British units
Handling Redundant Data in Data
Integration
■ Redundant data occur often when integration of

multiple databases
⚪ The same attribute may have different names in different
databases
⚪ One attribute may be a “derived” attribute in another
table, e.g., annual revenue
■ Can be detected by correlational analysis
■ Careful integration of the data from multiple sources
may help reduce / avoid redundancies and
inconsistencies and improve mining speed and quality

Data Transformation
■ Smoothing: remove noise from data
■ Aggregation: summarization, data cube
construction
■ Generalization: concept hierarchy
■ Normalization: scaled to fall within a small,
specified range
⚪ min-max normalization
⚪ z-score normalization
⚪ normalization by decimal scaling
■ Attribute/feature construction
⚪ New attributes constructed from the given ones
Data Transformation: Normalization

Data Transformation: Normalization
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
Where j is the smallest integer such that Max(| |) < 1

Latihan:
■ Lakukan Normalisasi untuk data diatas

dengan Min-Max , Z-Score dan
decimal scalling.

2 DMiningKuliah 2A DPreparation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 DMiningKuliah 2A DPreparation

Uploaded by

Copyright:

Available Formats

Data Preprocessing

■ Why preprocess the data?

■ Data in the real world is dirty

■ No quality data, no quality mining results!

■ A well-accepted multidimensional Measure:

August 19, 2020 Data Mining: Data Preprocessing 6

■ Data cleaning tasks

August 19, 2020 Data Mining: Data Preprocessing 8

August 19, 2020 Data Mining: Data Preprocessing 10

August 19, 2020 Data Mining: Data Preprocessing 11

August 19, 2020 Data Mining: Data Preprocessing 13

August 19, 2020 Data Mining: Data Preprocessing 14

August 19, 2020 Data Mining: Data Preprocessing 15

■ E.g. Data for price (in dollars):

August 19, 2020 Data Mining: Data Preprocessing 17

August 19, 2020 Data Mining: Data Preprocessing 18

August 19, 2020 Data Mining: Data Preprocessing 20

August 19, 2020 Data Mining: Data Preprocessing 21

August 19, 2020 Data Mining: Data Preprocessing 23

August 19, 2020 Data Mining: Data Preprocessing 24

August 19, 2020 Data Mining: Data Preprocessing 26

■ Redundant data occur often when integration of

August 19, 2020 Data Mining: Data Preprocessing 28

August 19, 2020 Data Mining: Data Preprocessing 30

■ normalization by decimal scaling

Where j is the smallest integer such that Max(| |) < 1

August 19, 2020 Data Mining: Data Preprocessing 31

■ Lakukan Normalisasi untuk data diatas

You might also like