Session 2-Data Processing

Session 2: Data Processing
Lecturer: Dr. Le Hoang Son

Vietnam National University (VNU)
lehoangson@hus.edu.vn
sonlh@vnu.edu.vn
Contents
1 Introduction
2 Data Summarization
3 Data Cleaning
4 Data Integration
5 Data Reduction
6 Data Discretization
7 Discussion & Exercises
2
1. Introduction
 Real-world databases are highly susceptible to
 Incomplete: lacking attribute values or certain attributes of interest or
containing only aggregate data
 Noisy: containing errors, or outlier values that deviate from the
expected
 Inconsistent: containing discrepancies in the data
 Low-quality data will lead to low-quality mining results
3
Reasons …
 Incomplete:
 Relevant data may not be recorded due to a misunderstanding or because of
equipment malfunctions
 Data that were inconsistent with other recorded data may have been deleted
 The recording of the history or modifications to the data may have been
overlooked
 Noisy:
 Data collection instruments may be faulty. Human or computer errors occurring at
data entry. Errors in data transmission can also occur
 There may be technology limitations, such as limited buffer size
 Inconsistent naming conventions, duplicate tuples
 Inconsistence:
 Data from multiple sources, some attributes representing a given concept may
have different names in different databases, causing inconsistencies and
redundancies
4
Overview of data processing methods
5
Data processing methods
 Descriptive data summarization: study
general characteristics of data and identify
the presence of noise or outliers
 Data cleaning: remove noise and correct

inconsistencies in the data
 Data integration: merges data from

multiple sources into a coherent data store,
such as a data warehouse
 Data transformations: normalization data to

a new domain
 Data reduction: reduce the data size by

aggregating, attribute subset selection,
dimensionality reduction, numerosity/
representations reduction
 Data discretization: automatic generation

of concept hierarchies from numerical data
6
Contents
1 Introduction
3 Data Cleaning
4 Data Integration
5 Data Reduction
7
2. Data Summarization
 Identify the typical properties of your data and highlight which data values
should be treated as noise or outliers
 Data characteristics (distribution): central tendency and dispersion of the

data
 Central tendency: mean, median, mode, and midrange
 Data dispersion: quartiles, interquartile range (IQR), variance
 Notions of distributive measure, algebraic measure and holistic measure
8
Central tendency
 Weighted Average
 Median: If N is odd then it is the middle value of the ordered set; otherwise
is the average of the middle two values
 Mode: value that occurs most frequently in the set (unimodal, bimodal,
trimodal). Multimodal: mean-mode = 3*(mean - median)
9
Formulae
10
Example 1
 Calculate Weighted Average, Median and Mode of IRIS dataset
11
Range, Quartiles, Boxplots
 Range of the set is the difference
between the largest and smallest
values
 The kth percentile of a set of data

in numerical order is the value xi
having the property that k percent
of the data entries lie at or below
xi.
 Q2: 50th percentile (Median)
 Q1: 25th percentile
 Q3: 75th percentile
 Interquartile range: IQR = Q3 - Q1

 Maximum, Minimum, Outliers
12
Example 2
 Draw Boxplots for all attributes in IRIS
13
Variance, Standard Deviation, Histogram
 Variance of N observations, x1, x2, …., xN is
 x is the mean value of the observations

 Standard deviation, σ, of the observations is the square root of the
variance, σ2
 Histogram
14
Example 3
 Calculate Variance, Standard Deviation and Draw Histogram of IRIS
15
Quantile-quantile plot and scatter plot
 A quantile-quantile plot (q-q plot) graphs the quantiles of one univariate
distribution against the corresponding quantiles of another
 It is a powerful visualization tool to view whether there is a shift in going

from one distribution to another
 A scatter plot is one of the most effective graphical methods for

determining if there appears to be a relationship, pattern, or trend between
two numerical attributes
16
Example 4
 Draw Q-Q Plot of ‘Sepal_Width’
17
Contents
1 Introduction
3 Data Cleaning
4 Data Integration
5 Data Reduction
18
3. Data Cleaning
 Fill in missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data
19
Handle missing values
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the

missing value (UNKNOWN)
4. Use the attribute mean to fill in the

missing value
5. Use the attribute mean for all

samples belonging to the same
class as the given tuple
6. Use the most probable value to fill

in the missing value: regression,
inference-based tools using a
Bayesian formalism, decision tree
20
Handle noisy data
 Noise is a random error or
variance in a measured variable
1. Binning
2. Regression
3. Clustering
21
Example 5
 Apply Binning for the sequence: (N=3)
5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204,215
22
Contents
1 Introduction
3 Data Cleaning
4 Data Integration
5 Data Reduction
23
4. Data Integration
 Combines data from multiple sources (multiple databases, data
cubes or flat files) into a coherent data store
 Issues: Schema integration, object matching, redundancy,

inconsistency
 Customer_ID in one database and Cust_number in another refer to the
same attribute?
 Wage_customers in a database can be computed through Wage_month
and Total_customers in other databases?
 Redundancies can be detected by correlation analysis (measure

how strongly one attribute implies the other)
24
Pearson correlation (numeric)
25
Example 6
 Calculate Pearson correlation between pairs of attributes (except Class)
26
Chi-squared correlation (categorical)
27
Example 7
 Calculate Chi-squared correlation
28
Data Transformation
 Data are transformed or consolidated into forms appropriate for
mining
 Smoothing: remove noise from data using binning, regression,
clustering
 Aggregation: data may be aggregated so as to compute monthly and

annual total amounts, for example
 Generalization: low-level data (age) are replaced by higher-level

concepts (youth, middle-aged and senior)
 Normalization: attribute data are scaled to fall within a small range
 Attribute construction: new attributes are constructed and added from

the given set of attributes
29
Example 8
 Aggregation data by class to show the quantity of each attribute by class.
Calculate a new attribute: total_length
30
Normalization
 Min-Max
 Z-score
 Decimal scaling
31
Example 9
 Normalize each attribute by 3 methods
32
Contents
1 Introduction
3 Data Cleaning
4 Data Integration
5 Data Reduction
33
5. Data Reduction
1. Data cube aggregation: aggregation operations are applied to
the data in the construction of a data cube
2. Attribute subset selection: irrelevant, weakly relevant, or

redundant attributes or dimensions are detected and removed
3. Dimensionality reduction: encoding mechanisms (PCA,

Wavelet) are used to reduce the data set size
4. Numerosity reduction: data are replaced or estimated by

alternative, smaller data representations
5. Discretization and concept hierarchy generation: raw data

values for attributes are replaced by ranges or higher conceptual
levels
34
Data cube aggregation
 Data cubes provide fast access to pre-computed, summarized data,

thereby benefiting on-line analytical processing as well as data mining
35
Attribute subset selection
1. Stepwise forward selection: The procedure starts with an empty
set of attributes and each subsequent iteration or step, the best of
the remaining original attributes is added to the set
2. Stepwise backward elimination: The procedure starts with the

full set of attributes, and each step, removes the worst attribute
remaining in the set
3. Combination of forward selection and backward elimination:

each step the procedure selects the best attribute and removes
the worst
4. Decision tree induction: ID3, C4.5, CART are used to achieve

best information gain
36
Attribute subset selection (2)
37
Dimensionality reduction
 Lossy dimensionality reduction: wavelet transforms and principal
components analysis (PCA)
 The discrete wavelet transform (DWT) is a linear signal

processing technique that transforms a data vector X to a vector
X0 of wavelet coefficients
 Principal components analysis, or PCA searches for k n-

dimensional orthogonal vectors that can best be used to represent
the data, where k < n. The original data are thus projected onto a
much smaller space, resulting in dimensionality reduction
38
Dimensionality reduction (2)
39
Numerosity reduction
 Reduce the data volume by choosing alternative, smaller forms of
data representation
 Parametric methods: estimate discrete multidimensional probability
distributions (Log-linear model)
 Non-Parametric methods: storing reduced representations of the

data (histograms, clustering, and sampling)
• Histogram: partitions the data distribution of A into disjoint subsets, or
buckets. If each bucket represents only a single attribute-value/frequency
pair, the buckets are called singleton buckets
• Clustering: partition the objects into groups or clusters, so that objects

within a cluster are “similar” to one another and “dissimilar” to objects in
other clusters
• Sampling: allows a large data set to be represented by a much smaller

random sample (or subset) of the data
40
Parametric methods
41
Non-Parametric methods
42
Contents
1 Introduction
3 Data Cleaning
4 Data Integration
5 Data Reduction
43
6. Data Discretization
 Data discretization techniques can be used to reduce the number of values for a
given continuous attribute by dividing the range of the attribute into intervals.
Interval labels can then be used to replace actual data values
 Binning
 Histogram analysis
 Entropy Discretization
 ChiMerge
 Clustering
44
Binning
45
Contents
1 Introduction
3 Data Cleaning
4 Data Integration
5 Data Reduction
46
Discussion & Exercises
 Data preprocessing is an important issue for both data warehousing and data
mining, as real-world data tend to be incomplete, noisy, and inconsistent
 Descriptive data summarization provides the analytical foundation for data

preprocessing
 Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data
 Data integration combines data from multiple sources to form a coherent data
store
 Data transformation converts the data into appropriate forms for mining
 Data reduction obtains a reduced representation of the data while minimizing the
loss of information content
 Data discretization generate higher concept hierarchy
47
Questions
48
Exercises
49
Exercises (2)
50
Exercises (3)
51
Main reference
52
Click to edit company slogan .

Session 2-Data Processing

Uploaded by

Copyright:

Available Formats

You might also like

Session 2-Data Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 2-Data Processing

Uploaded by

Copyright:

Available Formats

Session 2: Data Processing

Lecturer: Dr. Le Hoang Son

7 Discussion & Exercises

 Low-quality data will lead to low-quality mining results

 Data cleaning: remove noise and correct

 Data integration: merges data from

 Data transformations: normalization data to

 Data reduction: reduce the data size by

 Data discretization: automatic generation

7 Discussion & Exercises

 Data characteristics (distribution): central tendency and dispersion of the

 Notions of distributive measure, algebraic measure and holistic measure

 The kth percentile of a set of data

 Interquartile range: IQR = Q3 - Q1

 Variance of N observations, x1, x2, …., xN is

 x is the mean value of the observations

 It is a powerful visualization tool to view whether there is a shift in going

 A scatter plot is one of the most effective graphical methods for

7 Discussion & Exercises

2. Fill in the missing value manually

3. Use a global constant to fill in the

4. Use the attribute mean to fill in the

5. Use the attribute mean for all

6. Use the most probable value to fill

5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204,215

7 Discussion & Exercises

 Issues: Schema integration, object matching, redundancy,

 Redundancies can be detected by correlation analysis (measure

 Aggregation: data may be aggregated so as to compute monthly and

 Generalization: low-level data (age) are replaced by higher-level

 Normalization: attribute data are scaled to fall within a small range

 Attribute construction: new attributes are constructed and added from

7 Discussion & Exercises

2. Attribute subset selection: irrelevant, weakly relevant, or

3. Dimensionality reduction: encoding mechanisms (PCA,

4. Numerosity reduction: data are replaced or estimated by

5. Discretization and concept hierarchy generation: raw data

 Data cubes provide fast access to pre-computed, summarized data,

2. Stepwise backward elimination: The procedure starts with the

3. Combination of forward selection and backward elimination:

4. Decision tree induction: ID3, C4.5, CART are used to achieve

 The discrete wavelet transform (DWT) is a linear signal

 Principal components analysis, or PCA searches for k n-

 Non-Parametric methods: storing reduced representations of the

• Clustering: partition the objects into groups or clusters, so that objects

• Sampling: allows a large data set to be represented by a much smaller

7 Discussion & Exercises

7 Discussion & Exercises

 Descriptive data summarization provides the analytical foundation for data

 Data discretization generate higher concept hierarchy

You might also like