Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Data Preprocessing

Pattern Evaluation

Data Mining
Selection and

Data Warehouse

Data Integration
August 19, 2020 Data Mining: Data Preprocessing 1
Data Preprocessing

■ Why preprocess the data?

■ Data cleaning
■ Data integration and transformation
■ Data reduction and selection
■ Discretization
■ Hierarchy generation
August 19, 2020 Data Mining: Data Preprocessing 2
Why Data Preprocessing?

■ Data in the real world is dirty

⚪ incomplete: lacking attribute
values, lacking certain attributes of
interest, or containing only
aggregate data
⚪ noisy: containing errors or outliers
⚪ inconsistent: containing
discrepancies in codes or names
August 19, 2020 Data Mining: Data Preprocessing 3
Dirty Data

■ No quality data, no quality mining results!

- Quality decisions must be based on quality data
- Data mining needs consistent integration of quality data
August 19, 2020 Data Mining: Data Preprocessing 4
Multi-Dimensional Measure of Data Quality

■ A well-accepted multidimensional Measure:

⚪ Accuracy
⚪ Completeness
⚪ Consistency
⚪ Timeliness
⚪ Believability
⚪ Value added
⚪ Interpretability
⚪ Accessibility
■ Broad categories:
⚪ Intrinsic, Contextual, Representational.
August 19, 2020 Data Mining: Data Preprocessing 5
Forms of data preprocessing

August 19, 2020 Data Mining: Data Preprocessing 6

Major Tasks in Data Preprocessing
■ Data cleaning
⚪ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
■ Data integration
⚪ Integration of multiple databases, data cubes, or files
■ Data transformation
⚪ Normalization and aggregation
■ Data reduction
⚪ Obtains reduced representation in volume but produces
the same or similar analytical results
■ Data discretization
⚪ Part of data reduction but with particular importance,
especially for numerical data
August 19, 2020 Data Mining: Data Preprocessing 7
Data Cleaning

■ Data cleaning tasks

⚪ Fill in missing values
⚪ Identify outliers and smooth out noisy data
⚪ Correct inconsistent data

August 19, 2020 Data Mining: Data Preprocessing 8

Missing Data
■ Data is not always available
⚪ E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
■ Missing data may be due to
⚪ equipment malfunction
⚪ inconsistent with other recorded data and thus
⚪ data not entered due to misunderstanding
⚪ certain data may not be considered important at the
time of entry
August 19, 2020 Data Mining: Data Preprocessing 9
How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing
■ Fill in the missing value manually: tedious + infeasible?
■ Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
■ Replace the missing values with a value generated at random
from the variable distribution observed.
■ Use the attribute mean to fill in the missing value
■ Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
■ Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree

August 19, 2020 Data Mining: Data Preprocessing 10

Dirty Data

August 19, 2020 Data Mining: Data Preprocessing 11

Noisy Data
■ Noise: random error or variance in a measured
■ Incorrect attribute values may due to
⚪ faulty data collection instruments
⚪ data entry problems
⚪ data transmission problems
⚪ technology limitation
⚪ inconsistency in naming convention
■ Other data problems which requires data cleaning
⚪ duplicate records
⚪ incomplete data
⚪ inconsistent data
August 19, 2020 Data Mining: Data Preprocessing 12
How to Handle Noisy Data?
■ Graphical Methods For Identifying Outliers

August 19, 2020 Data Mining: Data Preprocessing 13

How to Handle Noisy Data?
■ Graphical Methods For Identifying Outliers

August 19, 2020 Data Mining: Data Preprocessing 14

How to Handle Noisy Data?
■ Binning method:
⚪ first sort data and partition into (equal-width or
equal-depth) bins
⚪ then one can smooth by bin mean, smooth by
bin median, smooth by bin boundaries, etc.
■ Combined computer-human inspection:
⚪ detect suspicious values and check by human
■ Clustering:
⚪ detect and remove outliers
■ Regression:
⚪ smooth by fitting the data into regression

August 19, 2020 Data Mining: Data Preprocessing 15

■ Equal-width (distance) partitioning:
⚪ It divides the range into N intervals of equal size:
uniform grid
⚪ if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
⚪ The most straightforward
⚪ But outliers may dominate presentation
⚪ Skewed data is not handled well.
■ Equal-depth (frequency) partitioning:
⚪ It divides the range into N intervals, each containing
approximately same number of samples
⚪ Good data scaling
⚪ Managing categorical attributes can be tricky.

■ E.g. Data for price (in dollars):

29, 8, 9, 15, 21, 24, 4, 25, 28, 21, 34, 26
August 19, 2020 Data Mining: Data Preprocessing 16
Binning Methods
* Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Partition into (equal-width) bins ???:

August 19, 2020 Data Mining: Data Preprocessing 17

Data Binning

August 19, 2020 Data Mining: Data Preprocessing 18

Data Binning
The cut-off values with the names of the bins for the
■ Pregnant: low (1,2), medium (3,4,5), high ( > 6)
■ Plasma–Glucose: low (< 90), medium (90 – 150),
high (> 150)
■ DiastolicBP : normal ( < 80), normal-to-high (80 –
90), high ( > 90)
■ BMI: low (< 25), normal (25–30 ), obese (30–35 ),
severely obese ( > 35)
■ DPF: low (< 0.4), med ium (0.4–0 .8), high (> 0.8)
■ Age : 20–39, 40–59, 60 plus
■ Diabetes Class: yes (1), no (0)
August 19, 2020 Data Mining: Data Preprocessing 19
Data Binning

August 19, 2020 Data Mining: Data Preprocessing 20

Binning with Cluster Analysis

August 19, 2020 Data Mining: Data Preprocessing 21

Data Smoothing
■ Remove noise from data
■ Example: Binning and Smoothing
■ Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
■ Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
■ Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
August 19, 2020 Data Mining: Data Preprocessing 22
Smoothing with Regression
Y (Age)



X1 (Salary) X

August 19, 2020 Data Mining: Data Preprocessing 23

Simple Linear Regression
■ A straight line can be described using the formula:

y = a + bx
■ where a is the point of intersection with the y-axis and b is the
slope of the line.

August 19, 2020 Data Mining: Data Preprocessing 24

Simple Linear Regression
Least Squares Method
The formula for calculating slope b and intercept a are:

Where: xi and yi are the individual values for the descriptor variable (xi) and
the response (yi).
x is the mean of the descriptor variable x and y is the mean of the response
variable y.
August 19, 2020 Data Mining: Data Preprocessing 25
Simple Linear


Monthly Sales = 23.2064 +
0.00259 * Income

August 19, 2020 Data Mining: Data Preprocessing 26

Data Integration
■ Data integration:
⚪ combines data from multiple sources into a
coherent store
■ Schema integration:
⚪ integrate metadata from different sources
⚪ Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id ≡
■ Detecting and resolving data value conflicts:
⚪ for the same real world entity, attribute values from
different sources are different
⚪ possible reasons: different representations,
different scales, e.g., metric vs. British units
August 19, 2020 Data Mining: Data Preprocessing 27
Handling Redundant Data in Data

■ Redundant data occur often when integration of

multiple databases
⚪ The same attribute may have different names in different
⚪ One attribute may be a “derived” attribute in another
table, e.g., annual revenue
■ Can be detected by correlational analysis
■ Careful integration of the data from multiple sources
may help reduce / avoid redundancies and
inconsistencies and improve mining speed and quality

August 19, 2020 Data Mining: Data Preprocessing 28

Data Transformation
■ Smoothing: remove noise from data
■ Aggregation: summarization, data cube
■ Generalization: concept hierarchy
■ Normalization: scaled to fall within a small,
specified range
⚪ min-max normalization
⚪ z-score normalization
⚪ normalization by decimal scaling
■ Attribute/feature construction
⚪ New attributes constructed from the given ones
August 19, 2020 Data Mining: Data Preprocessing 29
Data Transformation: Normalization

August 19, 2020 Data Mining: Data Preprocessing 30

Data Transformation: Normalization

■ min-max normalization

■ z-score normalization

■ normalization by decimal scaling

Where j is the smallest integer such that Max(| |) < 1

August 19, 2020 Data Mining: Data Preprocessing 31


■ Lakukan Normalisasi untuk data diatas

dengan Min-Max , Z-Score dan
decimal scalling.
August 19, 2020 Data Mining: Data Preprocessing 32

You might also like