Professional Documents
Culture Documents
Data Pre-Processing
Data Pre-Processing
Data Pre-Processing
Data Pre-processing
Data Pre-processing Phase
Example
A person's weight
The number of pizzas I can eat before fainting
Types of Attributes
Spatial Data (geospatial data) is data about the locations and shapes
of geographic features and the relationships between them, usually
stored as co-ordinates, links and nodes.
Data mining of spatial data may uncover patterns describing the
characteristics of houses located near a specified kind of location,
the climate of mountainous areas located at various altitudes, etc
e.g. Patterns from mining Japanese earthquakes 1961-1994
2. Multimedia data
Image
Video
3. Time-series data
Example:
In this database
5. Graph data
(a). Data Matrix is a table for representing data set , where there are m
rows, one for each object, and n columns, one for each attribute
This is if data objects have the same fixed set of numeric attributes, and
data objects are represented as as points in a multi-dimensional space,
where each dimension represents a distinct attribute
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
6. Record Data
Missing values refers to values that are not available in the data
Data with missing values is called incomplete data
Causes of missing values include:
1. Equipment malfunction
2. Data deleted since it was found inconsistent with other
recorded data during data integration.
3. Data not entered due at the time of entry.
2. Noisy Data
Example:
An insurance company offers car insurance. A field identifying
„auto_type“ seems innocent enough, but it turns out that the
labels entered into the system – „Merc“, „Mercedes“, „M-
Benz“,and „Mrcds“ all represent the same manufacturer.
Causes of Noise and Inconsistencies
(3) Use the attribute mean to fill in the missing value: For example,
suppose that the average income of AllElectronics customer is $28,000.
Use this value to replace the missing value for income.
(a) Filling Missing values
Example of Binning:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equidepth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
(b) Smoothening Noisy Data
y (salary)
Y1 y=x+1
X1 x (age)
43
Noisy data Example
Sampling methods
56
56