Session 2-Data Processing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Session 2: Data Processing

Lecturer: Dr. Le Hoang Son


Vietnam National University (VNU)
lehoangson@hus.edu.vn
sonlh@vnu.edu.vn
Contents
1 Introduction

2 Data Summarization

3 Data Cleaning

4 Data Integration

5 Data Reduction

6 Data Discretization

7 Discussion & Exercises

2
1. Introduction
 Real-world databases are highly susceptible to
 Incomplete: lacking attribute values or certain attributes of interest or
containing only aggregate data
 Noisy: containing errors, or outlier values that deviate from the
expected
 Inconsistent: containing discrepancies in the data

 Low-quality data will lead to low-quality mining results

3
Reasons …
 Incomplete:
 Relevant data may not be recorded due to a misunderstanding or because of
equipment malfunctions
 Data that were inconsistent with other recorded data may have been deleted
 The recording of the history or modifications to the data may have been
overlooked

 Noisy:
 Data collection instruments may be faulty. Human or computer errors occurring at
data entry. Errors in data transmission can also occur
 There may be technology limitations, such as limited buffer size
 Inconsistent naming conventions, duplicate tuples

 Inconsistence:
 Data from multiple sources, some attributes representing a given concept may
have different names in different databases, causing inconsistencies and
redundancies

4
Overview of data processing methods

5
Data processing methods
 Descriptive data summarization: study
general characteristics of data and identify
the presence of noise or outliers

 Data cleaning: remove noise and correct


inconsistencies in the data

 Data integration: merges data from


multiple sources into a coherent data store,
such as a data warehouse

 Data transformations: normalization data to


a new domain

 Data reduction: reduce the data size by


aggregating, attribute subset selection,
dimensionality reduction, numerosity/
representations reduction

 Data discretization: automatic generation


of concept hierarchies from numerical data

6
Contents
1 Introduction

2 Data Summarization

3 Data Cleaning

4 Data Integration

5 Data Reduction

6 Data Discretization

7 Discussion & Exercises

7
2. Data Summarization
 Identify the typical properties of your data and highlight which data values
should be treated as noise or outliers

 Data characteristics (distribution): central tendency and dispersion of the


data
 Central tendency: mean, median, mode, and midrange
 Data dispersion: quartiles, interquartile range (IQR), variance

 Notions of distributive measure, algebraic measure and holistic measure

8
Central tendency
 Weighted Average

 Median: If N is odd then it is the middle value of the ordered set; otherwise
is the average of the middle two values

 Mode: value that occurs most frequently in the set (unimodal, bimodal,
trimodal). Multimodal: mean-mode = 3*(mean - median)

9
Formulae

10
Example 1
 Calculate Weighted Average, Median and Mode of IRIS dataset

11
Range, Quartiles, Boxplots
 Range of the set is the difference
between the largest and smallest
values

 The kth percentile of a set of data


in numerical order is the value xi
having the property that k percent
of the data entries lie at or below
xi.
 Q2: 50th percentile (Median)
 Q1: 25th percentile
 Q3: 75th percentile

 Interquartile range: IQR = Q3 - Q1


 Maximum, Minimum, Outliers

12
Example 2
 Draw Boxplots for all attributes in IRIS

13
Variance, Standard Deviation, Histogram

 Variance of N observations, x1, x2, …., xN is

 x is the mean value of the observations


 Standard deviation, σ, of the observations is the square root of the
variance, σ2

 Histogram

14
Example 3
 Calculate Variance, Standard Deviation and Draw Histogram of IRIS

15
Quantile-quantile plot and scatter plot
 A quantile-quantile plot (q-q plot) graphs the quantiles of one univariate
distribution against the corresponding quantiles of another

 It is a powerful visualization tool to view whether there is a shift in going


from one distribution to another

 A scatter plot is one of the most effective graphical methods for


determining if there appears to be a relationship, pattern, or trend between
two numerical attributes

16
Example 4
 Draw Q-Q Plot of ‘Sepal_Width’

17
Contents
1 Introduction

2 Data Summarization

3 Data Cleaning

4 Data Integration

5 Data Reduction

6 Data Discretization

7 Discussion & Exercises

18
3. Data Cleaning
 Fill in missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data

19
Handle missing values
1. Ignore the tuple

2. Fill in the missing value manually

3. Use a global constant to fill in the


missing value (UNKNOWN)

4. Use the attribute mean to fill in the


missing value

5. Use the attribute mean for all


samples belonging to the same
class as the given tuple

6. Use the most probable value to fill


in the missing value: regression,
inference-based tools using a
Bayesian formalism, decision tree

20
Handle noisy data
 Noise is a random error or
variance in a measured variable

1. Binning
2. Regression
3. Clustering

21
Example 5
 Apply Binning for the sequence: (N=3)

5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204,215

22
Contents
1 Introduction

2 Data Summarization

3 Data Cleaning

4 Data Integration

5 Data Reduction

6 Data Discretization

7 Discussion & Exercises

23
4. Data Integration
 Combines data from multiple sources (multiple databases, data
cubes or flat files) into a coherent data store

 Issues: Schema integration, object matching, redundancy,


inconsistency
 Customer_ID in one database and Cust_number in another refer to the
same attribute?
 Wage_customers in a database can be computed through Wage_month
and Total_customers in other databases?

 Redundancies can be detected by correlation analysis (measure


how strongly one attribute implies the other)

24
Pearson correlation (numeric)

25
Example 6
 Calculate Pearson correlation between pairs of attributes (except Class)

26
Chi-squared correlation (categorical)

27
Example 7
 Calculate Chi-squared correlation

28
Data Transformation
 Data are transformed or consolidated into forms appropriate for
mining
 Smoothing: remove noise from data using binning, regression,
clustering

 Aggregation: data may be aggregated so as to compute monthly and


annual total amounts, for example

 Generalization: low-level data (age) are replaced by higher-level


concepts (youth, middle-aged and senior)

 Normalization: attribute data are scaled to fall within a small range

 Attribute construction: new attributes are constructed and added from


the given set of attributes

29
Example 8
 Aggregation data by class to show the quantity of each attribute by class.
Calculate a new attribute: total_length

30
Normalization
 Min-Max

 Z-score

 Decimal scaling

31
Example 9
 Normalize each attribute by 3 methods

32
Contents
1 Introduction

2 Data Summarization

3 Data Cleaning

4 Data Integration

5 Data Reduction

6 Data Discretization

7 Discussion & Exercises

33
5. Data Reduction
1. Data cube aggregation: aggregation operations are applied to
the data in the construction of a data cube

2. Attribute subset selection: irrelevant, weakly relevant, or


redundant attributes or dimensions are detected and removed

3. Dimensionality reduction: encoding mechanisms (PCA,


Wavelet) are used to reduce the data set size

4. Numerosity reduction: data are replaced or estimated by


alternative, smaller data representations

5. Discretization and concept hierarchy generation: raw data


values for attributes are replaced by ranges or higher conceptual
levels

34
Data cube aggregation

 Data cubes provide fast access to pre-computed, summarized data,


thereby benefiting on-line analytical processing as well as data mining

35
Attribute subset selection
1. Stepwise forward selection: The procedure starts with an empty
set of attributes and each subsequent iteration or step, the best of
the remaining original attributes is added to the set

2. Stepwise backward elimination: The procedure starts with the


full set of attributes, and each step, removes the worst attribute
remaining in the set

3. Combination of forward selection and backward elimination:


each step the procedure selects the best attribute and removes
the worst

4. Decision tree induction: ID3, C4.5, CART are used to achieve


best information gain

36
Attribute subset selection (2)

37
Dimensionality reduction
 Lossy dimensionality reduction: wavelet transforms and principal
components analysis (PCA)

 The discrete wavelet transform (DWT) is a linear signal


processing technique that transforms a data vector X to a vector
X0 of wavelet coefficients

 Principal components analysis, or PCA searches for k n-


dimensional orthogonal vectors that can best be used to represent
the data, where k < n. The original data are thus projected onto a
much smaller space, resulting in dimensionality reduction

38
Dimensionality reduction (2)

39
Numerosity reduction
 Reduce the data volume by choosing alternative, smaller forms of
data representation
 Parametric methods: estimate discrete multidimensional probability
distributions (Log-linear model)

 Non-Parametric methods: storing reduced representations of the


data (histograms, clustering, and sampling)
• Histogram: partitions the data distribution of A into disjoint subsets, or
buckets. If each bucket represents only a single attribute-value/frequency
pair, the buckets are called singleton buckets

• Clustering: partition the objects into groups or clusters, so that objects


within a cluster are “similar” to one another and “dissimilar” to objects in
other clusters

• Sampling: allows a large data set to be represented by a much smaller


random sample (or subset) of the data

40
Parametric methods

41
Non-Parametric methods

42
Contents
1 Introduction

2 Data Summarization

3 Data Cleaning

4 Data Integration

5 Data Reduction

6 Data Discretization

7 Discussion & Exercises

43
6. Data Discretization
 Data discretization techniques can be used to reduce the number of values for a
given continuous attribute by dividing the range of the attribute into intervals.
Interval labels can then be used to replace actual data values
 Binning
 Histogram analysis
 Entropy Discretization
 ChiMerge
 Clustering

44
Binning

45
Contents
1 Introduction

2 Data Summarization

3 Data Cleaning

4 Data Integration

5 Data Reduction

6 Data Discretization

7 Discussion & Exercises

46
Discussion & Exercises
 Data preprocessing is an important issue for both data warehousing and data
mining, as real-world data tend to be incomplete, noisy, and inconsistent

 Descriptive data summarization provides the analytical foundation for data


preprocessing

 Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data

 Data integration combines data from multiple sources to form a coherent data
store

 Data transformation converts the data into appropriate forms for mining

 Data reduction obtains a reduced representation of the data while minimizing the
loss of information content

 Data discretization generate higher concept hierarchy

47
Questions

48
Exercises

49
Exercises (2)

50
Exercises (3)

51
Main reference

52
Click to edit company slogan .

You might also like