Professional Documents
Culture Documents
Session 2-Data Processing
Session 2-Data Processing
Session 2-Data Processing
2 Data Summarization
3 Data Cleaning
4 Data Integration
5 Data Reduction
6 Data Discretization
2
1. Introduction
Real-world databases are highly susceptible to
Incomplete: lacking attribute values or certain attributes of interest or
containing only aggregate data
Noisy: containing errors, or outlier values that deviate from the
expected
Inconsistent: containing discrepancies in the data
3
Reasons …
Incomplete:
Relevant data may not be recorded due to a misunderstanding or because of
equipment malfunctions
Data that were inconsistent with other recorded data may have been deleted
The recording of the history or modifications to the data may have been
overlooked
Noisy:
Data collection instruments may be faulty. Human or computer errors occurring at
data entry. Errors in data transmission can also occur
There may be technology limitations, such as limited buffer size
Inconsistent naming conventions, duplicate tuples
Inconsistence:
Data from multiple sources, some attributes representing a given concept may
have different names in different databases, causing inconsistencies and
redundancies
4
Overview of data processing methods
5
Data processing methods
Descriptive data summarization: study
general characteristics of data and identify
the presence of noise or outliers
6
Contents
1 Introduction
2 Data Summarization
3 Data Cleaning
4 Data Integration
5 Data Reduction
6 Data Discretization
7
2. Data Summarization
Identify the typical properties of your data and highlight which data values
should be treated as noise or outliers
8
Central tendency
Weighted Average
Median: If N is odd then it is the middle value of the ordered set; otherwise
is the average of the middle two values
Mode: value that occurs most frequently in the set (unimodal, bimodal,
trimodal). Multimodal: mean-mode = 3*(mean - median)
9
Formulae
10
Example 1
Calculate Weighted Average, Median and Mode of IRIS dataset
11
Range, Quartiles, Boxplots
Range of the set is the difference
between the largest and smallest
values
12
Example 2
Draw Boxplots for all attributes in IRIS
13
Variance, Standard Deviation, Histogram
Histogram
14
Example 3
Calculate Variance, Standard Deviation and Draw Histogram of IRIS
15
Quantile-quantile plot and scatter plot
A quantile-quantile plot (q-q plot) graphs the quantiles of one univariate
distribution against the corresponding quantiles of another
16
Example 4
Draw Q-Q Plot of ‘Sepal_Width’
17
Contents
1 Introduction
2 Data Summarization
3 Data Cleaning
4 Data Integration
5 Data Reduction
6 Data Discretization
18
3. Data Cleaning
Fill in missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data
19
Handle missing values
1. Ignore the tuple
20
Handle noisy data
Noise is a random error or
variance in a measured variable
1. Binning
2. Regression
3. Clustering
21
Example 5
Apply Binning for the sequence: (N=3)
22
Contents
1 Introduction
2 Data Summarization
3 Data Cleaning
4 Data Integration
5 Data Reduction
6 Data Discretization
23
4. Data Integration
Combines data from multiple sources (multiple databases, data
cubes or flat files) into a coherent data store
24
Pearson correlation (numeric)
25
Example 6
Calculate Pearson correlation between pairs of attributes (except Class)
26
Chi-squared correlation (categorical)
27
Example 7
Calculate Chi-squared correlation
28
Data Transformation
Data are transformed or consolidated into forms appropriate for
mining
Smoothing: remove noise from data using binning, regression,
clustering
29
Example 8
Aggregation data by class to show the quantity of each attribute by class.
Calculate a new attribute: total_length
30
Normalization
Min-Max
Z-score
Decimal scaling
31
Example 9
Normalize each attribute by 3 methods
32
Contents
1 Introduction
2 Data Summarization
3 Data Cleaning
4 Data Integration
5 Data Reduction
6 Data Discretization
33
5. Data Reduction
1. Data cube aggregation: aggregation operations are applied to
the data in the construction of a data cube
34
Data cube aggregation
35
Attribute subset selection
1. Stepwise forward selection: The procedure starts with an empty
set of attributes and each subsequent iteration or step, the best of
the remaining original attributes is added to the set
36
Attribute subset selection (2)
37
Dimensionality reduction
Lossy dimensionality reduction: wavelet transforms and principal
components analysis (PCA)
38
Dimensionality reduction (2)
39
Numerosity reduction
Reduce the data volume by choosing alternative, smaller forms of
data representation
Parametric methods: estimate discrete multidimensional probability
distributions (Log-linear model)
40
Parametric methods
41
Non-Parametric methods
42
Contents
1 Introduction
2 Data Summarization
3 Data Cleaning
4 Data Integration
5 Data Reduction
6 Data Discretization
43
6. Data Discretization
Data discretization techniques can be used to reduce the number of values for a
given continuous attribute by dividing the range of the attribute into intervals.
Interval labels can then be used to replace actual data values
Binning
Histogram analysis
Entropy Discretization
ChiMerge
Clustering
44
Binning
45
Contents
1 Introduction
2 Data Summarization
3 Data Cleaning
4 Data Integration
5 Data Reduction
6 Data Discretization
46
Discussion & Exercises
Data preprocessing is an important issue for both data warehousing and data
mining, as real-world data tend to be incomplete, noisy, and inconsistent
Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data
Data integration combines data from multiple sources to form a coherent data
store
Data transformation converts the data into appropriate forms for mining
Data reduction obtains a reduced representation of the data while minimizing the
loss of information content
47
Questions
48
Exercises
49
Exercises (2)
50
Exercises (3)
51
Main reference
52
Click to edit company slogan .