Professional Documents
Culture Documents
TTDS Lecture 2
TTDS Lecture 2
TTDS Lecture 2
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document data: text documents: term-frequency
vector
Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0
Molecular Structures
Ordered
TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
2 Beer, Bread
Sequential Data: transaction sequences
Genetic sequence data
3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image data:
Video data:
Data Objects
Data Result
Data Mining
Preprocessing Post-processing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
Data Cleaning
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification)—not effective when the percentage of missing
values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill
in the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
How to Handle Missing Data?
Use routines designed to detect inconsistencies and manually correct them. E.g.,
the routine may use the check global constraints (age>10) or functional
dependencies
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different
sources are different (e.g., J.D.Smith and Jonh Smith may refer
to the same person)
possible reasons: different representations, different scales, e.g.,
metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
Correlation (viewed as linear relationship)
Correlation coefficient:
Suppose two stocks A and B have the following values in one week: (2, 5),
(3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
Discretization
Three types of attributes:
Nominal — values from an unordered set
Discretization:
divide the range of a continuous attribute into intervals
why?
Equi-width
binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Discretization by Classification &
Correlation Analysis
Classification (e.g., decision tree analysis)
Visualization
The human eye is a powerful analytical tool
If we visualize the data properly, we can discover patterns
Visualization is the way to present the data so that patterns can be seen
E.g., histograms and plots are a form of visualization
There are multiple techniques (a field on its own)
Scatter Plot Array of Iris Attributes
Contour Plot Example: SST Sea surface
temperature Dec, 1998
Celsius