Professional Documents
Culture Documents
Recommendation System
Recommendation System
Data Streams
Data streams—continuous, ordered, changing, fast, huge amount
Traditional DBMS—data stored in finite, persistent data sets
Characteristics
Huge volumes of continuous data, possibly infinite
Fast changing and requires fast, real-time response
Data stream captures nicely our data processing needs of today
Random access is expensive—single scan algorithm (can only have
one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processing
June 13, 2024 Data Mining: Concepts and Techniques 9
Stream Data Applications
Telecommunication calling records
Business: credit card transaction flows
Network monitoring and traffic engineering
Financial market: stock exchange
Engineering & industrial processes: power supply &
manufacturing
Sensor, monitoring & surveillance: video streams, RFIDs
Security monitoring
Web logs and Web page click streams
Massive data sets (even saved but random access is too
expensive)
June 13, 2024 Data Mining: Concepts and Techniques 10
DBMS versus DSMS
Persistent relations Transient streams
One-time queries Continuous queries
Random access Sequential access
“Unbounded” disk store Bounded main memory
Only current state matters Historical data is important
No real-time services Real-time requirements
Relatively low update rate Possibly multi-GB arrival rate
Data at any granularity Data at fine granularity
Assume precise data Data stale/imprecise
Access plan determined by Unpredictable/variable data
query processor, physical DB arrival and characteristics
design Ack. From Motwani’s PODS tutorial slides
June 13, 2024 Data Mining: Concepts and Techniques 11
Mining Data Streams
Continuous Query
Results
Multiple streams
Stream Query
Processor
Scratch Space
(Main memory and/or Disk)
June 13, 2024 Data Mining: Concepts and Techniques 13
Challenges of Stream Data Processing
Methodology
Synopses (trade-off between accuracy and storage)
Histograms
Sliding windows
Multi-resolution model
Sketches
Radomized algorithms
where v: the universe or domain size, mi: the frequency of i in the sequence
Given N elts and v values, sketches can approximate F0, F1, F2 in
O(log v + log N) space
Randomized algorithms
Monte Carlo algorithm: bound on running time but may not return correct
result 2
P (| X | k )
Chebyshev’s inequality: k2
Let X be a random variable with mean μ and standard deviation σ
2 / 4
P[ X (1 ) |] e
Chernoff bound:
Let X be the sum of independent Poisson trials X1, …, Xn, δ in (0, 1]
The probability decreases expoentially as we move from the mean
June 13, 2024 Data Mining: Concepts and Techniques 18
Approximate Query Answering in Streams
Sliding windows
Only over sliding windows of recent stream data
Approximation but often more desirable in applications
Batched processing, sampling and synopses
Batched if update is fast but computing is slow
Compute periodically, not very timely
grouping, sorting
Patterns are hidden and more general than querying
1 2 m o n th s 3 1 d ay s 2 4 h o u rs 4 q trs
tim e
Logarithmic tilted time frame:
Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …
maximal 3 snapshots
Given a snapshot number N, if N mod 2 d = 0, insert
(A 1 , * , C 1 )
(A 1 , * , C 2 ) (A 1 , B 1 , C 1 ) (A 2 , * , C 1 )
(A 1 , B 1 , C 2 ) (A 1 , B 2 , C 1 ) (A 2 , * , C 2 ) (A 2 , B 1 , C 1 )
(A 1 , B 2 , C 2 ) (A 2 , B 1 , C 2 ) (A 2 , B 2 , C 1 )
(A 2 , B 2 , C 2 )
June 13, 2024 Data Mining: Concepts and Techniques 30
An H-Tree Cubing Structure
root
Observation layer
Chicago Urbana Springfield
Empty
(summary) +
2 4 3
2 4 3
1
2 + 10 9
1 2
1 2
1 0
Itemset ( ) is deleted.
That’s why we choose a large number of buckets
– delete more
June 13, 2024 Data Mining: Concepts and Techniques 41
Pruning Itemsets – Apriori Rule
1
2
2
1 +
1
Weakness:
Space Bound is not good
times
The output is based on all previous data. But
R 2 ln( 1 / )
2n
June 13, 2024 Data Mining: Concepts and Techniques 49
Hoeffding Tree Algorithm
Hoeffding Tree Input
S: sequence of examples
X: attributes
G( ): evaluation function
d: desired accuracy
Hoeffding Tree Algorithm
for each example in S
retrieve G(Xa) and G(Xb) //two highest G(Xi)
if ( G(Xa) – G(Xb) > ε )
split on Xa
recurse to next node
break
June 13, 2024 Data Mining: Concepts and Techniques 50
Decision-Tree Induction with Data
Streams
Packets > 10
Data Stream
yes no
Protocol = http
Packets > 10
Data Stream
yes no
Bytes > 60K
Protocol = http
yes
Strengths
Scales better than traditional methods
Incremental
Weakness
Could spend a lot of time with ties
G computed every n
min
Deactivates certain leaves to save memory
CVFDT
Increments count with new example
Sliding window
level-(i+1) medians
level-i medians
data points
Method:
maintain at most m level-i medians
On seeing m of them, generate O(k) level-(i+1)
medians of weight equal to the sum of the weights of
the intermediate medians assigned to them
Drawbacks:
Low quality for evolving data streams (register only k
centers)
Limited functionality in discovering and exploring
clusters over different portions of the stream over time