Professional Documents
Culture Documents
Data Preprocessing
Data Preprocessing
Data Preprocessing
DATA MINING:
Data Preprocessing Prof. Sherica Lavinia Menezes
Asst. Professor
Computer Engineering Department
Goa College of Engineering
AGGREGATION SAMPLING
DISCRETIZATION AND
FEATURE CREATION
BINARIZATION
VARIABLE
TRANSFORMATION
1
21-09-2020
AGENDA
AGGREGATION
01
02 SAMPLING
DIMENSIONALITY
REDUCTION
03
FEATURE SUBSET
04 SELECTION
LEARNING OBJECTIVES
Explain
Aggregation
01 Appreciate
different sampling
02 techniques
Discuss Feature
Subset Selection
03 Differentiate
between
Dimensionality
2
21-09-2020
AGGREGATION
EXAMPLE OF
AGGREGATION
Either omitted
or summarized
as set of items
Replace all transactions of a sold Aggregated
single store with a single by taking
storewide transaction sum
3
21-09-2020
SAMPLING
4
21-09-2020
Sampling
Sampling Approaches
10
5
21-09-2020
Example
11
Sampling Approaches
Stratified Sampling: Starts with prespecified
groups of objects from which samples are drawn
Example 1. I want to understand the differences between legitimate and fraudulent credit card
transactions. 0.1% of transactions are fraudulent. What happens if I select 1000 transactions at
random?
I get 1 fraudulent transaction (in expectation). Not enough to draw any conclusions. Solution: sample
1000 legitimate and 1000 fraudulent transactions
12
6
21-09-2020
Sampling Process
13
Example of Various
Sample Size
14
7
21-09-2020
DIMENSIONALITY REDUCTION
DM Algorithms work better if dimensionality is
lower
15
Dimensionality Reduction
Reserved for those techniques that reduce
dimensionality by creating new attributes that are
a combination of old attributes
16
8
21-09-2020
17
3 Approaches to FSS
FILTER
EMBEDDED WRAPPER
Features are
selected before DM
Algorithm decides algorithms is run DM algorithm is
which attributes to using an approach used as Black Box to
use and which to that is independent find best subset of
ignore of DM algorithm attribute
18
9
21-09-2020
STOPPING
CRITERION S V VALIDATION
PROCEDURE
19
Subset of
Attributes Search Strategy
Attributes
20
10
21-09-2020
FEATURE CREATION
New set of attributes are created from the original
attributes that capture important information more
effectively
Mapping
Feature the data to Feature
Extraction a new construction
space
21
Feature Extraction
Creation of new set of features from raw data
22
11
21-09-2020
Frequency
23
Feature Construction
Constructing features to be in a form that is best
suited for the respective data mining algorithm
24
12
21-09-2020
25
Binarization
If there are m
categorical values,
then uniquely assign If attribute is ordinal order
each original value to needs to be maintained.
an integer in the
interval [0, m-1]
Represent
binary values
using n binary
attributes
26
13
21-09-2020
Example of Binarization
27
Issues in Binarization
Can create unintended relationships among
transformed attributes
28
14
21-09-2020
29
Unsupervised Discretization
Class information is not used
30
15
21-09-2020
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
31
32
16
21-09-2020
33
34
17
21-09-2020
Supervised Discretization
Class information is available
k = number of classes
Entropy: mi = number of values
𝒌
in ith interval
𝒆 𝒊 = 𝒑𝒊𝒋 ∗ log 𝟐 𝒑𝒊𝒋 mij = no of values of
𝒊=𝟏 class j in interval i
35
36
18
21-09-2020
VARIABLE
TRANSFORMATION
37
THANKS
38
19