Incremental LOF Algorithm

You might also like

Download as ppsx, pdf, or txt
Download as ppsx, pdf, or txt
You are on page 1of 29

Incremental Local Outlier

Factor Algorithm

Dragoljub Pokrajac, AMRC, DSU


Longin Jan Latecki, Temple
University
Aleksandar Lazarevic, UTC
University of Delaware, 10/23/2006
Static LOF Algorithm For Outlier
Detection (Breunig et al, 2000)
• Main idea: Estimate local density of data
• Outlier is a data point where the local
density significantly decreases in
comparison to its neighborhood
• We measure:
– The local density by local reachability distance
– Density variation by LOF factor
Example: Two-Dimensional Data

Not outlier

Outlier
Outlier
Local Reachability Distance (lrd)

k=MinPts=4
LOF Factor
6.05
60

3.70
50

40

2.26
30

20
1.38
10

0.87

LOF factor
Identified Outliers
Reachability Distance
 d O , p , O is not among k nearest neighbors of p
reach  dist (O , p )   (1)
k  dis tan ce p , O is among k nearest neighbors of p

K-neighborhood (k=3)

k-distance(p)

Reachability distance to particular points from point p


Local Reachability Density (lrd)
• For each point O, local reachability density is
defined as inverse of average reachability
distance from O to its k neighbors
1
lrd (O )  (2)
 reach _ dist k (O , p ) / k
for all k neighbors p

• For given coordinates of neighbors


– the reachability distances will be the smallest possible if
O is among k-nearest neighbors of each of its k
neighbor
• Heuristically, this means the lrd is large, if the
neighborhood of O is “compact”
Local Outlier Factor (LOF)
• Heuristically, the point is outlier if:
– its neighboring points are in more “compact” (dense)
regions than the region where the point is itself
– We measure “compactness” by lrd
– Hence, we may detect outliers by comparing average
lrd in neighborhood of point and the lrd at the very
point
• To quantify outliers, we use the LOF(O) ratio
defined as:
1
 lrd ( p )
LOF (O )  k for all k neighbors p (3)
lrd (O )
Properties of LOF algorithm
• Computational complexity is O(NlogN) for
properly chosen indexing structure
• Algorithm performs statically: e.g., does not
accommodate for incremental data
Incremental LOF Algorithm

• Goal: Update parameters of LOF algorithm


incrementally with each new record arriving in the
database
• Retain computational complexity of static LOF
algorithm
Incremental LOF-Insertion k=2

 d O, p , O  k-NN  p 
reach  dist (O, p )  
k  dis tan ce p , O  k-NN  p 
f

lrd (O ) 
1 Dataset New point
LOF computed for the new point
 reach _ dist k (O , p ) / k
for all k neighbors p
Compute reach-dist from new point
f

1 to its k-neighbors
 lrd ( p )
Update k-distance for points having
LOF (O )  k for all k neighbors p
lrd (O ) New point in its k-neighborhood

lrd updated for all points where k-neighborhood changed


and to points whose reach-dist to one of their existing k-neighbors
changed reach-dist updated for points in
k-neighborhods of points where
LOF updated for points whereislrd
k-distance is updated
updated
and to points where lrd of one of its k-neighbors is updated
K-Nearest Neighbor Query
Sue
Tom Beverly
Allen

Sue and Beverly are 2 nearest neighbors of Tom! (k=2)


Sue Allen

Beverly
Tom
Inverse k-Nearest Neighbor
Query
Sue is among 2 nearest neighbors of Tom!
Sue is among 2 nearest neighbors of Allen!
Allen and Tom are inverse 2-nearest neighbors of Sue
Sue Allen

Beverly
Tom
Incremental LOF_insertion(Dataset S)
· Given: Set S {p1, … ,pN} pi RD to be inserted into the database
· For each data point pc in data set S
 insert(pc)
 Compute kNN(pc)
 (pj kNN(pc))
compute reach-distk(pc,pj) using Eq. (1);
//Update_neighbors of pc
 Supdate_k_distance =inverse k-NN(pc);
 (pj  Supdate_k_distance)
update k-distance(pj);
 Supdate_lrd = Supdate_k_distance;
 (pj  Supdate_k_distance), (pikNN(pj)\{pc})
reach-distk(pi,pj) =k-distance(pj);
if pj  kNN(pi)
Supdate_lrd = Supdate_lrd {pi};
 Supdate_LOF = Supdate_lrd;
 (pm  Supdate_lrd)
update lrd(pm) using Eq. (2);
Supdate_LOF = Supdate_LOF  inverse k-NN(pm);
 (pl  Supdate_LOF)
update LOF(pl) Eq. (3);
IncrementalLOF_deletion(Dataset S,Sdelete)
¨ Given: Dataset S {p1, … ,pN} pi RD,
¨ Set Sdelete {p1, … ,pM} pi RD to be deleted

¨ Supdate_k_distance=;
¨ (pc  Sdelete)
Supdate_k_distance = Supdate_k_distanceinverse k-NN (pc);
delete(pc); //we can delete pc after finding all friends
¨ Supdate_k_distance= Supdate_k_distance\Sdelete; //points from Sdelete may still be
//present when computing inverse nearest neighbors
¨ (pj  Supdate_k_distance)
update k-distance(pj);
¨ Supdate_lrd = Supdate_k_distance;
¨ (pj  Supdate_k_distance) (pi(k-1)NN (pj))
reach-distk(pi,pj)=k-distance(pj);
if pj  kNN (pi) then Supdate_lrd = Supdate_lrd {pi};
¨ Supdate_LOF = Supdate_lrd;
¨ (pm  Supdate_lrd)
update lrd(pm) using Eq. (2);
Supdate_LOF = Supdate_LOF  inverse k-NN (pm);
¨ (pl  Supdate_LOF)
Performance of Incremental LOF
algorithm
• The number of points where reach-dist, lrd, LOF is to
be updated is:
– Limited
– Does not depend on N
• E.g., For insertion:
– |Supdate_k_distance| F=O(k 2D)
– |Supdate_lrd|  k|Supdate_k_distance| =O(k2 2D)
– |Supdate_LOF|  (1+F)|Supdate_k_distance| =O(k3 22D)

• Computational performance of kNN and inverse-kNN is


O(logN) with proper data structure used
• Therefore: When inserting/deleting N points,
Incremental LOF performs O(N log N)
Simulation Evaluation
• Synthetic data sets with N{100,200,…,5000}
records generated
• Dimensionality varied D{2,3,4,5,10}
• For each simulation experiment:
• The neighborhood size varied we varied the values of
the parameter k {5, 10, 15, 20 }
• Number of k-distance, reach-dist, lrd and LOF
updates measured
• Experiments repeated 100 times
• Average results on number of LOF updates reported
Simulation Evaluation
k=5

60

50
# LOF updates

40

30

Dimension
20
2
3
10 4
5
10
0
0 1000 2000 3000 4000 5000
Number of points

Number of LOF updates vs. number N of points in the database


Simulation Evaluation
k=20
1000
Dimension
900
2
800 3
4
700 5
10
# LOF updates

600

500

400

300

200

100

0
0 1000 2000 3000 4000 5000
Number of points

Number of LOF updates vs. number N of points in the database


Simulation Evaluation
800
Dimension
700
2
3
600 4
5
500 10
#LOF updates

400

300

200

100

0
2 4 6 8 10 12 14 16 18 20
k

Number of LOF updates vs. neighborhood size k


Simulation Evaluation
800
Dimension
700 2
3
600 4
5
10
500
#LOF updates

400

300

200

100

0
0 234 5 10 15 20
k

Number of LOF updates vs. neighborhood size k


(note that abscise scale is quadratic)
Application of Incremental LOF

Identifying Distribution with Small Variance


Application of Incremental LOF

Identifying Outbreak of Novel Distribution


Application of Incremental LOF

Outliers in Trajectories
]Trajectory Representation
Application of Incremental LOF
• Detection of unusual events in video sequence
25
Dynamic LOF
Identified transitions
Static LOF

20

15

10

0
20 30 40 50 60 70 80 90 100
Work in Progress
• Incremental LOF algorithm with “fading
memory”
– Old examples have gradually less and less
influence on the outlier computation
• Incrementalization of other outlier detection
algorithms
• Indexing structures for efficient
implementation of kNN, inverse-kNN
• Computation of tighter bounds for F
Thanks a lot!
• David Mount, CS, Univ. of • Dejan Milic, University of Nis
Maryland, College Park • Nenad Milosevic, University of
• Janko Milutinovic, DSU Nis
• Kam Kong, DSU • Aleksandra Vidovic, University
• Dragana Jankovic, DSU of Nis
• Brian Tjaden, Wellesley • Zoran Peric, University of Nis
College, USA • Jelena Nikolic, University of
• DOD/DOA for supporting Nis
this research • Marko Petrovic, University of
Nis
• Aleksandra Jovanovic,
University of Nis

For further questions, please contact me at:


dpokraja@desu.edu
dragoljub.pokrajac@comcast.net

You might also like