Professional Documents
Culture Documents
Principles and Algorithms: Data Mining
Principles and Algorithms: Data Mining
09/11/2012
09/11/2012
city 1 2 3 4 1 . . .
region
. . .
8
Multidimensional Analysis
Strategy Generalize the planbase in different directions Look for sequential patterns in the generalized plans Derive high-level plans A multi-D model for the planbase
09/11/2012
Multidimensional Generalization
Multi-Dimensional generalization of the planbase
Plan# 1 2
. . .
Loc_Seq ALB - JFK - ORD - LAX - SAN SPI - ORD - JFK - SYR
. . .
09/11/2012
11
09/11/2012
12
09/11/2012
13
Fir
Oak
(2,0)
(4,0) x
09/11/2012
20
09/11/2012
22
Query Processing
Efficient algorithms to answer spatial queries Common Strategy: filter and refine Filter: Query Region overlaps with MBRs (minimum bounding rectangles) of B, C, D Refine: Query Region overlaps with B, C
C D
C D
Data Object
REFINE
09/11/2012
23
y-axis
R2
y-axis
S3 S2 R4 S1 x-axis (a) R1
T (T.xl, T.yl)
R3
x-axis (b)
R4
S2
S1
R1 (c)
S3
R2
R3
09/11/2012
24
09/11/2012
25
A B C C i d e f g h i j
Measures
numerical (e.g. monthly revenue of a region) distributive (e.g. count, sum) algebraic (e.g. average) holistic (e.g. median, rank) spatial collection of spatial pointers (e.g. pointers to all regions with temperature of 25-30 degrees in July)
29
Spatial-to-Spatial Generalization
Generalize detailed geographic points into clustered regions, such as businesses, residential, industrial, or agricultural areas, according to land usage Requires the merging of a set of geographic areas by spatial operations
09/11/2012
Output
A map that reveals patterns: merged (similar) regions
Goals
Interactive analysis (drill-down, slice, dice, pivot, roll-up) Fast response time Minimizing storage space used
Challenge
A merged region may contain hundreds of primitive regions (polygons)
09/11/2012 Data Mining: Principles and Algorithms 31
Fact table
32
09/11/2012
33
[7%, 85%]
2) What kinds of objects are typically located close to golf courses?
09/11/2012
35
09/11/2012
36
09/11/2012
37
Spatial Autocorrelation
Spatial data tends to be highly self-correlated Example: Neighborhood, Temperature Items in a traditional data are independent of each other, whereas properties of locations in a map are often auto-correlated. First law of geography: Everything is related to everything, but nearby things are
09/11/2012
38
09/11/2012
39
Spatial Classification
Methods in classification Decision-tree classification, Nave-Bayesian classifier + boosting, neural network, logistic regression, etc. Association-based multi-dimensional classification Example: classifying house value based on proximity to lakes, highways, mountains, etc. Assuming learning samples are independent of each other Spatial auto-correlation violates this assumption! Popular spatial classification methods Spatial auto-regression (SAR) Markov random field (MRF)
09/11/2012 Data Mining: Principles and Algorithms 40
Spatial Auto-Regression
Linear Regression Y=X + Spatial autoregressive regression (SAR) Y = WY + X + W: neighborhood matrix. models strength of spatial dependencies error vector The estimates of and can be derived using maximum likelihood theory or Bayesian statistics
09/11/2012 Data Mining: Principles and Algorithms 41
42
09/11/2012
43
09/11/2012
45
Constraints-Based Clustering
Constraints on individual objects
Simple selection of relevant objects before clustering
C2 C1 River
C3
The upper left and lower right quadrants indicate a spatial outlier Computation method Fit a linear regression line Select points (e.g. P, Q, S) which are from the regression line greater than specified residual error
09/11/2012 Data Mining: Principles and Algorithms 50
Movement Motifs
Prototypical movement of object
Right-turn, U-turn
Can be either defined by an expert or discovered automatically from data Defined in our framework Extracted in movement paths Path becomes a set of motif expressions
09/11/2012
54
09/11/2012
56
Feature Clustering
Rough, fast micro-clustering method based on BIRCH (SIGMOD96) A micro-cluster is represented by a CF Vector: CF = (n, LS, SS) Centroid and radius can be calculated from CF vector CF Additive Theorem allows two CF Vectors to be combined quickly and losslessly CF Tree is a hierarchy of CF Vectors A parent CF Vector holds information for all descendent CF Vectors Leaf CF Vector corresponds to a set of actual points
09/11/2012 Data Mining: Principles and Algorithms 60
09/11/2012
61
09/11/2012
62
Experiments
Synthetic Data Generated at motif-expression level Abnormal paths are injected with abnormal motif-expressions Classifiers SVM using nave feature space SVM using extracted feature spaces of varying refinement levels
09/11/2012
63
Experiment
09/11/2012
64
Experiment (2)
09/11/2012
65
09/11/2012
67
Wavelet Analysis
Wavelet-based signature Use the dominant wavelet coefficients of an image as its signature Wavelets capture shape, texture, and location information in a single unified framework Improved efficiency and reduced the need for providing multiple search primitives May fail to identify images containing similar objects that are in different locations.
09/11/2012 Data Mining: Principles and Algorithms 71
Wavelet-based signature with region-based granularity Define regions by clustering signatures of windows of varying sizes within the image Signature of a region is the centroid of the cluster Similarity is defined in terms of the fraction of the area of the two images covered by matching pairs of regions from two images
09/11/2012 Data Mining: Principles and Algorithms 72
Layout descriptor: contains a color layout vector and an edge layout vector
09/11/2012 Data Mining: Principles and Algorithms 73
09/11/2012
74
09/11/2012
75
Search for blue sky and green meadows Search for blue sky
(top layout grid is blue) (top layout grid is blue and bottom is green)
09/11/2012
76
Cross Tab
JPEG GIF RED WHITE BLUE
By Colour
Group By
Colour
RED WHITE BLUE
By Format Sum
Measurement
Sum
Format of image Duration Colors Textures Keywords Size Width Height Internet domain of image Internet domain of parent pages Image popularity
77
09/11/2012
09/11/2012
78
Classification in MultiMediaMiner
09/11/2012
79
09/11/2012
81
09/11/2012
82
Summary
Mining object data needs feature/attribute-based generalization methods Spatial, spatiotemporal and multimedia data mining is one of important research frontiers in data mining with broad applications Spatial data warehousing, OLAP and mining facilitates multidimensional spatial analysis and finding spatial associations, classifications and trends Multimedia data mining needs content-based retrieval and similarity search integrated with mining methods
09/11/2012 Data Mining: Principles and Algorithms 84
09/11/2012
86