Download as pdf or txt
Download as pdf or txt
You are on page 1of 94

Anomaly Detection with Extreme Value Theory

A. Siffer, P-A Fouque, A. Termier and C. Largouet


April 26, 2017
Contents

Context

Providing better thresholds

Finding anomalies in streams

Application to intrusion detection

A more general framework

1
Context
General motivations

⊸ Massive usage of the Internet

2
General motivations

⊸ Massive usage of the Internet


• More and more vulnerabilities

2
General motivations

⊸ Massive usage of the Internet


• More and more vulnerabilities
• More and more threats

2
General motivations

⊸ Massive usage of the Internet


• More and more vulnerabilities
• More and more threats

⊸ Awareness of the sensitive data and infrastructures

2
General motivations

⊸ Massive usage of the Internet


• More and more vulnerabilities
• More and more threats

⊸ Awareness of the sensitive data and infrastructures

⇒ Network security :
a major concern

2
A Solution

⊸ IDS (Intrusion Detection System)


• Monitor traffic
• Detect attacks

3
A Solution

⊸ IDS (Intrusion Detection System)


• Monitor traffic
• Detect attacks
⊸ Current methods : rule-based
• Work fine on common and well-known attacks
• Cannot detect new attacks

3
A Solution

⊸ IDS (Intrusion Detection System)


• Monitor traffic
• Detect attacks
⊸ Current methods : rule-based
• Work fine on common and well-known attacks
• Cannot detect new attacks
⊸ Emerging methods : anomaly-based
• Use the network data to estimate a normal behavior
• Apply algorithms to detect abnormal events (→ attacks)

3
Overview

⊸ Basic scheme

data Algorithm alerts

4
Overview

⊸ Basic scheme

data Algorithm alerts

⊸ Many ”standard” algorithms have been tested

4
Overview

⊸ Basic scheme

data Algorithm alerts

⊸ Many ”standard” algorithms have been tested


⊸ Complex pipelines are emerging (ensemble/hybrid techniques)

alerts
data

4
Inherent problem

⊸ Algorithms are not magic


• They give some information about data (scores)

5
Inherent problem

⊸ Algorithms are not magic


• They give some information about data (scores)
• But the decision often rely on a human choice
if score>threshold then trigger alert

5
Inherent problem

⊸ Algorithms are not magic


• They give some information about data (scores)
• But the decision often rely on a human choice
if score>threshold then trigger alert

⊸ The thresholds are often hard-set


• Expertise
• Fine-tuning
• Distribution assumption

5
Inherent problem

⊸ Algorithms are not magic


• They give some information about data (scores)
• But the decision often rely on a human choice
if score>threshold then trigger alert

⊸ The thresholds are often hard-set


• Expertise
• Fine-tuning
• Distribution assumption

⊸ Our idea: provide dynamic threshold with a probabilistic


meaning

5
Providing better thresholds
My problem

0.025

0.020
Frequency

0.015

0.010

0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

6
My problem

0.025

0.020
Frequency

0.015

0.010

0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

⊸ How to set zq such that P(X€ > zq ) < q ?


6
Solution 1: empirical approach

0.025

0.020
Frequency

0.015

0.010

0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

7
Solution 1: empirical approach

0.025 1−q q
0.020
Frequency

0.015

0.010
zq
0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

7
Solution 1: empirical approach

0.025 1−q q
0.020
Frequency

0.015

0.010
zq
0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

⊸ Drawbacks: stuck in the interval, poor resolution


7
Solution 2: Standard Model

0.025

0.020
Frequency

0.015

0.010

0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

8
Solution 2: Standard Model

0.025

0.020
Frequency

0.015

0.010

0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

8
Solution 2: Standard Model

0.025

0.020
Frequency

0.015

0.010
zq
0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

8
Solution 2: Standard Model

0.025

0.020
Frequency

0.015

0.010
zq
0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

⊸ Drawbacks: manual step, distribution assumption


8
Realities

0.015
0.20
0.010 0.15
0.10
0.005
0.05
0.000 0.00
0 100 200 300 16 18 20 22 24

0.020 0.03
0.015
0.02
0.010
0.005 0.01
0.000 0.00
20 40 60 80 100 0 25 50 75

⊸ Different clients and/or temporal drift

9
Results

Properties Empirical quantile Standard model


statistical guarantees Yes Yes
easy to adapt Yes No
high resolution No Yes

10
Inspection of extreme events

0.025

0.020
Frequency

0.015

0.010
Probability estimation ?
0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

11
Inspection of extreme events

0.025

0.020
Frequency

0.015

0.010
Probability estimation ?
0.005

0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )

11
Extreme Value Theory

12
Extreme Value Theory

⊸ Main result (Fisher-Tippett-Gnedenko, 1928)


The extreme values of any distribution have nearly the same
distribution (called Extreme Value Distribution)

12
Extreme Value Theory

⊸ Main result (Fisher-Tippett-Gnedenko, 1928)


The extreme values of any distribution have nearly the same
distribution (called Extreme Value Distribution)

0.6 heavy tail


exponential tail
bounded tail
0.5
0.4
(X > x)

0.3
0.2
0.1
0.0
0 1 2 3 4 5 6
x
12
An impressive analogy

⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with


n
Sn = Xi Mn = max (Xi )
1≤i≤n
i=1

13
An impressive analogy

⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with


n
Sn = Xi Mn = max (Xi )
1≤i≤n
i=1

⊸ Central Limit Theorem


Sn − nµ d
√ −→ N (0, σ 2 )
n

13
An impressive analogy

⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with


n
Sn = Xi Mn = max (Xi )
1≤i≤n
i=1

⊸ Central Limit Theorem


Sn − nµ d
√ −→ N (0, σ 2 )
n

⊸ FTG Theorem
M n − an d
−→ EVD(γ)
bn

13
A more practical result

⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)


The excesses over a high threshold follow a Generalized Pareto
Distribution (with parameters γ, σ)

14
A more practical result

⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)


The excesses over a high threshold follow a Generalized Pareto
Distribution (with parameters γ, σ)

⊸ What does it imply ?


• we have a model for extreme events
• we can compute zq for q as small as desired

14
How to use EVT

⊸ Get some data X1 , X2 . . . Xn


⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t
when Xkj > t

15
How to use EVT

⊸ Get some data X1 , X2 . . . Xn


⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t
when Xkj > t
⊸ Fit a GPD to the Yj (→ find parameters γ, σ)

15
How to use EVT

⊸ Get some data X1 , X2 . . . Xn


⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t
when Xkj > t
⊸ Fit a GPD to the Yj (→ find parameters γ, σ)
⊸ Compute zq such as P(X > zq ) < q

15
How to use EVT

⊸ Get some data X1 , X2 . . . Xn


⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t
when Xkj > t
⊸ Fit a GPD to the Yj (→ find parameters γ, σ)
⊸ Compute zq such as P(X > zq ) < q
q

X1 , X2 . . . Xn EVT zq

15
Finding anomalies in streams
Streaming Peaks-Over-Threshold (SPOT) algorithm

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Streaming Peaks-Over-Threshold (SPOT) algorithm

q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120

trigger alarm update model


yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop

16
Can we trust that threshold zq ?

⊸ An example with ground truth : a Gaussian White Noise


• 40 streams with 200 000 iid variables drawn from N (0, 1)
• q = 10−3 ⇒ theoretical threshold zth ≃ 3.09

17
Can we trust that threshold zq ?

⊸ An example with ground truth : a Gaussian White Noise


• 40 streams with 200 000 iid variables drawn from N (0, 1)
• q = 10−3 ⇒ theoretical threshold zth ≃ 3.09
⊸ Averaged relative error

0.04
Relative error

0.03

0.02

0.01

0 50000 100000 150000 200000


Number of observations
17
Application to intrusion detection
About the data

⊸ Lack of relevant public datasets to test the algorithms ...

18
About the data

⊸ Lack of relevant public datasets to test the algorithms ...


⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]

18
About the data

⊸ Lack of relevant public datasets to test the algorithms ...


⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]
⊸ We rather use MAWI
• 15 min a day of real traffic (.pcap file)
• Anomaly patterns given by the MAWILab [Fontugne et al. 2010]
with taxonomy [Mazel et al. 2014]

18
About the data

⊸ Lack of relevant public datasets to test the algorithms ...


⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]
⊸ We rather use MAWI
• 15 min a day of real traffic (.pcap file)
• Anomaly patterns given by the MAWILab [Fontugne et al. 2010]
with taxonomy [Mazel et al. 2014]

⊸ Preprocessing step : raw .pcap → NetFlow format (only


metadata)

18
An example to detect network syn scan

⊸ The ratio of SYN packets : relevant feature to detect network


scan [Fernandes & Owezarski 2009]

19
An example to detect network syn scan

⊸ The ratio of SYN packets : relevant feature to detect network


scan [Fernandes & Owezarski 2009]
ratio of SYN packets in 50ms time window
0.8

0.6

0.4

0.2

0.0
0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0
Time (s)

19
An example to detect network syn scan

⊸ The ratio of SYN packets : relevant feature to detect network


scan [Fernandes & Owezarski 2009]
ratio of SYN packets in 50ms time window
0.8

0.6

0.4

0.2

0.0
0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0
Time (s)

⊸ Goal: find peaks


19
SPOT results

⊸ Parameters : q = 10−4 , n = 2000 (from the previous day record)

20
SPOT results

⊸ Parameters : q = 10−4 , n = 2000 (from the previous day record)


ratio of SYN packets in 50ms time window
0.8

0.6

0.4

0.2

0.0
0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0
Time (s)

20
Do we really flag scan attacks ?

⊸ The main parameter q: a False Positive regulator

21
Do we really flag scan attacks ?

⊸ The main parameter q: a False Positive regulator

100.0
8. 10 −5
1. 10 −5 5. 10
−5
80.0 6. 10 −3

60.0
TPr (%)

40.0
9. 10 −6
20.0
8. 10 −6
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
FPr (%)

21
Do we really flag scan attacks ?

⊸ The main parameter q: a False Positive regulator

100.0
8. 10 −5
1. 10 −5 5. 10
−5
80.0 6. 10 −3

60.0
TPr (%)

40.0
9. 10 −6
20.0
8. 10 −6
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
FPr (%)

⊸ 86% of scan flows detected with less than 4% of FP


21
A more general framework
SPOT Specifications

⊸ A single main parameter q


• With a probabilistic meaning → P(X > zq ) < q
• False Positive regulator

22
SPOT Specifications

⊸ A single main parameter q


• With a probabilistic meaning → P(X > zq ) < q
• False Positive regulator
⊸ Stream capable
• Incremental learning
• Fast (∼ 1000 values/s)
• Low memory usage (only the excesses)

22
Other things ?

⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies

23
Other things ?

⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
⊸ But it could be adapted to

23
Other things ?

⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
⊸ But it could be adapted to
• compute upper and lower thresholds

23
Other things ?

⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
⊸ But it could be adapted to
• compute upper and lower thresholds
• other fields

23
Other things ?

⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
⊸ But it could be adapted to
• compute upper and lower thresholds
• other fields
• drifting contexts (with an additional parameter) → DSPOT

23
A recent example

24
A recent example

⊸ Thursday the 9th of February 2017

24
A recent example

⊸ Thursday the 9th of February 2017


• 9h : explosion at Flamanville nuclear plant

24
A recent example

⊸ Thursday the 9th of February 2017


• 9h : explosion at Flamanville nuclear plant
• 11h : official declaration of the incident by EDF

24
A recent example

⊸ Thursday the 9th of February 2017


• 9h : explosion at Flamanville nuclear plant
• 11h : official declaration of the incident by EDF

⊸ What about the EDF stock prices ?

24
EDF stock price ( )

9.05
9.10
9.15
9.20
9.25
9.30
9.35
9.40
09
:0 2
EDF stock prices

09
:4 2
10
:3 4
11
:3 2
12
:1 9
13

Time
:3 0
14
:4 3
15
:3 4
16
:2 1
17
:1 4
25
EDF stock price ( )

9.05
9.10
9.15
9.20
9.25
9.30
9.35
9.40
09
:0 2
EDF stock prices

09
:4 2
10
:3 4
11
:3 2
12
:1 9
13

Time
:3 0
14
:4 3
15
:3 4
16
:2 1
17
:1 4
25
Conclusion

26
Conclusion

⊸ Context: A great deal of work has been done to develop


anomaly detection algorithms

26
Conclusion

⊸ Context: A great deal of work has been done to develop


anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise

26
Conclusion

⊸ Context: A great deal of work has been done to develop


anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
⊸ Our solution: Building dynamic threshold with a probabilistic
meaning

26
Conclusion

⊸ Context: A great deal of work has been done to develop


anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
⊸ Our solution: Building dynamic threshold with a probabilistic
meaning
• Application to detect network anomalies

26
Conclusion

⊸ Context: A great deal of work has been done to develop


anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
⊸ Our solution: Building dynamic threshold with a probabilistic
meaning
• Application to detect network anomalies
• But a general tool to monitor online time series in a blind way

26
Conclusion

⊸ Context: A great deal of work has been done to develop


anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
⊸ Our solution: Building dynamic threshold with a probabilistic
meaning
• Application to detect network anomalies
• But a general tool to monitor online time series in a blind way

⊸ Future: Adapt the method to higher dimensions

26

You might also like