Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Data Preprocessing

Pattern Evaluation

Data Mining
Task-relevant
Data
Selection and
Transformation

Data
Data Warehouse
Cleaning

Data Integration
Databases
August 19, 2020 Data Mining: Data Preprocessing 1
Data Preprocessing

■ Why preprocess the data?


■ Data cleaning
■ Data integration and transformation
■ Data reduction and selection
■ Discretization
■ Hierarchy generation
August 19, 2020 Data Mining: Data Preprocessing 2
Why Data Preprocessing?

■ Data in the real world is dirty


⚪ incomplete: lacking attribute
values, lacking certain attributes of
interest, or containing only
aggregate data
⚪ noisy: containing errors or outliers
⚪ inconsistent: containing
discrepancies in codes or names
August 19, 2020 Data Mining: Data Preprocessing 3
Dirty Data

■ No quality data, no quality mining results!


- Quality decisions must be based on quality data
- Data mining needs consistent integration of quality data
August 19, 2020 Data Mining: Data Preprocessing 4
Multi-Dimensional Measure of Data Quality

■ A well-accepted multidimensional Measure:


⚪ Accuracy
⚪ Completeness
⚪ Consistency
⚪ Timeliness
⚪ Believability
⚪ Value added
⚪ Interpretability
⚪ Accessibility
■ Broad categories:
⚪ Intrinsic, Contextual, Representational.
August 19, 2020 Data Mining: Data Preprocessing 5
Forms of data preprocessing

August 19, 2020 Data Mining: Data Preprocessing 6


Major Tasks in Data Preprocessing
■ Data cleaning
⚪ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
■ Data integration
⚪ Integration of multiple databases, data cubes, or files
■ Data transformation
⚪ Normalization and aggregation
■ Data reduction
⚪ Obtains reduced representation in volume but produces
the same or similar analytical results
■ Data discretization
⚪ Part of data reduction but with particular importance,
especially for numerical data
August 19, 2020 Data Mining: Data Preprocessing 7
Data Cleaning

■ Data cleaning tasks


⚪ Fill in missing values
⚪ Identify outliers and smooth out noisy data
⚪ Correct inconsistent data

August 19, 2020 Data Mining: Data Preprocessing 8


Missing Data
■ Data is not always available
⚪ E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
■ Missing data may be due to
⚪ equipment malfunction
⚪ inconsistent with other recorded data and thus
deleted
⚪ data not entered due to misunderstanding
⚪ certain data may not be considered important at the
time of entry
August 19, 2020 Data Mining: Data Preprocessing 9
How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing
■ Fill in the missing value manually: tedious + infeasible?
■ Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
■ Replace the missing values with a value generated at random
from the variable distribution observed.
■ Use the attribute mean to fill in the missing value
■ Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
■ Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree

August 19, 2020 Data Mining: Data Preprocessing 10


Dirty Data

August 19, 2020 Data Mining: Data Preprocessing 11


Noisy Data
■ Noise: random error or variance in a measured
variable
■ Incorrect attribute values may due to
⚪ faulty data collection instruments
⚪ data entry problems
⚪ data transmission problems
⚪ technology limitation
⚪ inconsistency in naming convention
■ Other data problems which requires data cleaning
⚪ duplicate records
⚪ incomplete data
⚪ inconsistent data
August 19, 2020 Data Mining: Data Preprocessing 12
How to Handle Noisy Data?
■ Graphical Methods For Identifying Outliers

August 19, 2020 Data Mining: Data Preprocessing 13


How to Handle Noisy Data?
■ Graphical Methods For Identifying Outliers

August 19, 2020 Data Mining: Data Preprocessing 14


How to Handle Noisy Data?
■ Binning method:
⚪ first sort data and partition into (equal-width or
equal-depth) bins
⚪ then one can smooth by bin mean, smooth by
bin median, smooth by bin boundaries, etc.
■ Combined computer-human inspection:
⚪ detect suspicious values and check by human
■ Clustering:
⚪ detect and remove outliers
■ Regression:
⚪ smooth by fitting the data into regression
functions

August 19, 2020 Data Mining: Data Preprocessing 15


Binning
■ Equal-width (distance) partitioning:
⚪ It divides the range into N intervals of equal size:
uniform grid
⚪ if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
⚪ The most straightforward
⚪ But outliers may dominate presentation
⚪ Skewed data is not handled well.
■ Equal-depth (frequency) partitioning:
⚪ It divides the range into N intervals, each containing
approximately same number of samples
⚪ Good data scaling
⚪ Managing categorical attributes can be tricky.

■ E.g. Data for price (in dollars):


29, 8, 9, 15, 21, 24, 4, 25, 28, 21, 34, 26
August 19, 2020 Data Mining: Data Preprocessing 16
Binning Methods
* Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Partition into (equal-width) bins ???:

August 19, 2020 Data Mining: Data Preprocessing 17


Data Binning

August 19, 2020 Data Mining: Data Preprocessing 18


Data Binning
The cut-off values with the names of the bins for the
variables:
■ Pregnant: low (1,2), medium (3,4,5), high ( > 6)
■ Plasma–Glucose: low (< 90), medium (90 – 150),
high (> 150)
■ DiastolicBP : normal ( < 80), normal-to-high (80 –
90), high ( > 90)
■ BMI: low (< 25), normal (25–30 ), obese (30–35 ),
severely obese ( > 35)
■ DPF: low (< 0.4), med ium (0.4–0 .8), high (> 0.8)
■ Age : 20–39, 40–59, 60 plus
■ Diabetes Class: yes (1), no (0)
August 19, 2020 Data Mining: Data Preprocessing 19
Data Binning

August 19, 2020 Data Mining: Data Preprocessing 20


Binning with Cluster Analysis

August 19, 2020 Data Mining: Data Preprocessing 21


Data Smoothing
■ Remove noise from data
■ Example: Binning and Smoothing
■ Partition into (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
■ Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
■ Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
August 19, 2020 Data Mining: Data Preprocessing 22
Smoothing with Regression
Y (Age)

Y1

Y1’
y=x+1

X1 (Salary) X

August 19, 2020 Data Mining: Data Preprocessing 23


Simple Linear Regression
■ A straight line can be described using the formula:

y = a + bx
■ where a is the point of intersection with the y-axis and b is the
slope of the line.

August 19, 2020 Data Mining: Data Preprocessing 24


Simple Linear Regression
Least Squares Method
The formula for calculating slope b and intercept a are:

Where: xi and yi are the individual values for the descriptor variable (xi) and
the response (yi).
x is the mean of the descriptor variable x and y is the mean of the response
variable y.
August 19, 2020 Data Mining: Data Preprocessing 25
Simple Linear
Regression

(Example)

Formula:
Monthly Sales = 23.2064 +
0.00259 * Income

August 19, 2020 Data Mining: Data Preprocessing 26


Data Integration
■ Data integration:
⚪ combines data from multiple sources into a
coherent store
■ Schema integration:
⚪ integrate metadata from different sources
⚪ Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id ≡
B.cust-#
■ Detecting and resolving data value conflicts:
⚪ for the same real world entity, attribute values from
different sources are different
⚪ possible reasons: different representations,
different scales, e.g., metric vs. British units
August 19, 2020 Data Mining: Data Preprocessing 27
Handling Redundant Data in Data
Integration

■ Redundant data occur often when integration of


multiple databases
⚪ The same attribute may have different names in different
databases
⚪ One attribute may be a “derived” attribute in another
table, e.g., annual revenue
■ Can be detected by correlational analysis
■ Careful integration of the data from multiple sources
may help reduce / avoid redundancies and
inconsistencies and improve mining speed and quality

August 19, 2020 Data Mining: Data Preprocessing 28


Data Transformation
■ Smoothing: remove noise from data
■ Aggregation: summarization, data cube
construction
■ Generalization: concept hierarchy
■ Normalization: scaled to fall within a small,
specified range
⚪ min-max normalization
⚪ z-score normalization
⚪ normalization by decimal scaling
■ Attribute/feature construction
⚪ New attributes constructed from the given ones
August 19, 2020 Data Mining: Data Preprocessing 29
Data Transformation: Normalization

August 19, 2020 Data Mining: Data Preprocessing 30


Data Transformation: Normalization

■ min-max normalization

■ z-score normalization

■ normalization by decimal scaling

Where j is the smallest integer such that Max(| |) < 1

August 19, 2020 Data Mining: Data Preprocessing 31


Latihan:

■ Lakukan Normalisasi untuk data diatas


dengan Min-Max , Z-Score dan
decimal scalling.
August 19, 2020 Data Mining: Data Preprocessing 32

You might also like