ASiffer Presentation 2017

Anomaly Detection with Extreme Value Theory
A. Siffer, P-A Fouque, A. Termier and C. Largouet

April 26, 2017
Contents
Context
Providing better thresholds
Finding anomalies in streams
Application to intrusion detection
A more general framework
1
Context
General motivations
⊸ Massive usage of the Internet
2
General motivations

• More and more vulnerabilities
2
General motivations

• More and more threats
2
General motivations

⊸ Awareness of the sensitive data and infrastructures
2
General motivations

⊸ Awareness of the sensitive data and infrastructures
⇒ Network security :
a major concern
2
A Solution
⊸ IDS (Intrusion Detection System)

• Monitor traffic
• Detect attacks
3
A Solution

• Detect attacks
⊸ Current methods : rule-based
• Work fine on common and well-known attacks
• Cannot detect new attacks
3
A Solution

• Detect attacks
⊸ Current methods : rule-based
• Work fine on common and well-known attacks
• Cannot detect new attacks
⊸ Emerging methods : anomaly-based
• Use the network data to estimate a normal behavior
• Apply algorithms to detect abnormal events (→ attacks)
3
Overview
⊸ Basic scheme
data Algorithm alerts
4
Overview
⊸ Basic scheme
⊸ Many ”standard” algorithms have been tested
4
Overview
⊸ Basic scheme
⊸ Many ”standard” algorithms have been tested

⊸ Complex pipelines are emerging (ensemble/hybrid techniques)
alerts
data
4
Inherent problem
⊸ Algorithms are not magic

• They give some information about data (scores)
5
Inherent problem

• But the decision often rely on a human choice
if score>threshold then trigger alert
5
Inherent problem

⊸ The thresholds are often hard-set

• Expertise
• Fine-tuning
• Distribution assumption
5
Inherent problem

⊸ The thresholds are often hard-set

• Expertise
• Fine-tuning
• Distribution assumption
⊸ Our idea: provide dynamic threshold with a probabilistic

meaning
5
Providing better thresholds
My problem
0.025
0.020
Frequency
0.015
0.010
0.005
0.000
0 25 50 75 100 125 150 175 200
Daily payment by credit card ( )
6
My problem
0.025
0.020
Frequency
0.015
0.010
0.005
0.000
0 25 50 75 100 125 150 175 200
⊸ How to set zq such that P(X€ > zq ) < q ?

6
Solution 1: empirical approach
0.025
0.020
Frequency
0.015
0.010
0.005
0.000
0 25 50 75 100 125 150 175 200
7
0.025 1−q q
0.020
Frequency
0.015
0.010
zq
0.005
0.000
0 25 50 75 100 125 150 175 200
7
0.025 1−q q
0.020
Frequency
0.015
0.010
zq
0.005
0.000
0 25 50 75 100 125 150 175 200
⊸ Drawbacks: stuck in the interval, poor resolution

7
Solution 2: Standard Model
0.025
0.020
Frequency
0.015
0.010
0.005
0.000
0 25 50 75 100 125 150 175 200
8
0.025
0.020
Frequency
0.015
0.010
0.005
0.000
0 25 50 75 100 125 150 175 200
8
0.025
0.020
Frequency
0.015
0.010
zq
0.005
0.000
0 25 50 75 100 125 150 175 200
8
0.025
0.020
Frequency
0.015
0.010
zq
0.005
0.000
0 25 50 75 100 125 150 175 200
⊸ Drawbacks: manual step, distribution assumption

8
Realities
0.015
0.20
0.010 0.15
0.10
0.005
0.05
0.000 0.00
0 100 200 300 16 18 20 22 24
0.020 0.03
0.015
0.02
0.010
0.005 0.01
0.000 0.00
20 40 60 80 100 0 25 50 75
⊸ Different clients and/or temporal drift
9
Results
Properties Empirical quantile Standard model

statistical guarantees Yes Yes
easy to adapt Yes No
high resolution No Yes
10
Inspection of extreme events
0.025
0.020
Frequency
0.015
0.010
Probability estimation ?
0.005
0.000
0 25 50 75 100 125 150 175 200
11
Inspection of extreme events
0.025
0.020
Frequency
0.015
0.010
Probability estimation ?
0.005
0.000
0 25 50 75 100 125 150 175 200
11
Extreme Value Theory
12
⊸ Main result (Fisher-Tippett-Gnedenko, 1928)

The extreme values of any distribution have nearly the same
distribution (called Extreme Value Distribution)
12
⊸ Main result (Fisher-Tippett-Gnedenko, 1928)

The extreme values of any distribution have nearly the same
distribution (called Extreme Value Distribution)
0.6 heavy tail

exponential tail
bounded tail
0.5
0.4
(X > x)
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6
x
12
An impressive analogy
⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with
∑
n
Sn = Xi Mn = max (Xi )
1≤i≤n
i=1
13
∑
n
1≤i≤n
i=1
⊸ Central Limit Theorem

Sn − nµ d
√ −→ N (0, σ 2 )
n
13
∑
n
1≤i≤n
i=1
⊸ Central Limit Theorem

Sn − nµ d
√ −→ N (0, σ 2 )
n
⊸ FTG Theorem
M n − an d
−→ EVD(γ)
bn
13
A more practical result
⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)

The excesses over a high threshold follow a Generalized Pareto
Distribution (with parameters γ, σ)
14
A more practical result
⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)

The excesses over a high threshold follow a Generalized Pareto
Distribution (with parameters γ, σ)
⊸ What does it imply ?

• we have a model for extreme events
• we can compute zq for q as small as desired
14
How to use EVT
⊸ Get some data X1 , X2 . . . Xn

⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t
when Xkj > t
15
How to use EVT

when Xkj > t
⊸ Fit a GPD to the Yj (→ find parameters γ, σ)
15
How to use EVT

when Xkj > t
⊸ Compute zq such as P(X > zq ) < q
15
How to use EVT

when Xkj > t
⊸ Compute zq such as P(X > zq ) < q
q
X1 , X2 . . . Xn EVT zq
15
Finding anomalies in streams
Streaming Peaks-Over-Threshold (SPOT) algorithm
16
q
0.30
(initial batch)
0.20 zq
X1 , X2 . . . Xn Calibration t
0.10
0
20 40 60 80 100 120
trigger alarm update model

yes yes
(stream)
Xi>n Xi > zq no Xi > t no drop
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
q
0.30
(initial batch)
0.20 zq
0.10
0
20 40 60 80 100 120

yes yes
(stream)
16
Can we trust that threshold zq ?
⊸ An example with ground truth : a Gaussian White Noise

• 40 streams with 200 000 iid variables drawn from N (0, 1)
• q = 10−3 ⇒ theoretical threshold zth ≃ 3.09
17
Can we trust that threshold zq ?
⊸ An example with ground truth : a Gaussian White Noise

• 40 streams with 200 000 iid variables drawn from N (0, 1)
• q = 10−3 ⇒ theoretical threshold zth ≃ 3.09
⊸ Averaged relative error
0.04
Relative error
0.03
0.02
0.01
0 50000 100000 150000 200000

Number of observations
17
Application to intrusion detection
About the data
⊸ Lack of relevant public datasets to test the algorithms ...
18
About the data

⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]
18
About the data

⊸ We rather use MAWI
• 15 min a day of real traffic (.pcap file)
• Anomaly patterns given by the MAWILab [Fontugne et al. 2010]
with taxonomy [Mazel et al. 2014]
18
About the data

⊸ We rather use MAWI
• 15 min a day of real traffic (.pcap file)
• Anomaly patterns given by the MAWILab [Fontugne et al. 2010]
with taxonomy [Mazel et al. 2014]
⊸ Preprocessing step : raw .pcap → NetFlow format (only

metadata)
18
An example to detect network syn scan
⊸ The ratio of SYN packets : relevant feature to detect network

scan [Fernandes & Owezarski 2009]
19

ratio of SYN packets in 50ms time window
0.8
0.6
0.4
0.2
0.0
0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0
Time (s)
19

0.8
0.6
0.4
0.2
0.0
0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0
Time (s)
⊸ Goal: find peaks

19
SPOT results
⊸ Parameters : q = 10−4 , n = 2000 (from the previous day record)
20
SPOT results
⊸ Parameters : q = 10−4 , n = 2000 (from the previous day record)

0.8
0.6
0.4
0.2
0.0
0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0
Time (s)
20
Do we really flag scan attacks ?
⊸ The main parameter q: a False Positive regulator
21
100.0
8. 10 −5
1. 10 −5 5. 10
−5
80.0 6. 10 −3
60.0
TPr (%)
40.0
9. 10 −6
20.0
8. 10 −6
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
FPr (%)
21
100.0
8. 10 −5
1. 10 −5 5. 10
−5
80.0 6. 10 −3
60.0
TPr (%)
40.0
9. 10 −6
20.0
8. 10 −6
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
FPr (%)
⊸ 86% of scan flows detected with less than 4% of FP

21
A more general framework
SPOT Specifications
⊸ A single main parameter q

• With a probabilistic meaning → P(X > zq ) < q
• False Positive regulator
22
SPOT Specifications
⊸ A single main parameter q

• With a probabilistic meaning → P(X > zq ) < q
• False Positive regulator
⊸ Stream capable
• Incremental learning
• Fast (∼ 1000 values/s)
• Low memory usage (only the excesses)
22
Other things ?
⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
23
Other things ?
⊸ SPOT
⊸ But it could be adapted to
23
Other things ?
⊸ SPOT
• compute upper and lower thresholds
23
Other things ?
⊸ SPOT
• other fields
23
Other things ?
⊸ SPOT
• other fields
• drifting contexts (with an additional parameter) → DSPOT
23
A recent example
24
A recent example
⊸ Thursday the 9th of February 2017
24
A recent example

• 9h : explosion at Flamanville nuclear plant
24
A recent example

• 11h : official declaration of the incident by EDF
24
A recent example

• 11h : official declaration of the incident by EDF
⊸ What about the EDF stock prices ?
24
EDF stock price ( )
9.05
9.10
9.15
9.20
9.25
9.30
9.35
9.40
09
:0 2
EDF stock prices
09
:4 2
10
:3 4
11
:3 2
12
:1 9
13
Time
:3 0
14
:4 3
15
:3 4
16
:2 1
17
:1 4
25
EDF stock price ( )
9.05
9.10
9.15
9.20
9.25
9.30
9.35
9.40
09
:0 2
EDF stock prices
09
:4 2
10
:3 4
11
:3 2
12
:1 9
13
Time
:3 0
14
:4 3
15
:3 4
16
:2 1
17
:1 4
25
Conclusion
26
Conclusion
⊸ Context: A great deal of work has been done to develop

anomaly detection algorithms
26
Conclusion

⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
26
Conclusion

⊸ Our solution: Building dynamic threshold with a probabilistic
meaning
26
Conclusion

meaning
• Application to detect network anomalies
26
Conclusion

meaning
• But a general tool to monitor online time series in a blind way
26
Conclusion

meaning
• But a general tool to monitor online time series in a blind way
⊸ Future: Adapt the method to higher dimensions
26

ASiffer Presentation 2017

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ASiffer Presentation 2017

Uploaded by

Copyright:

Available Formats

Anomaly Detection with Extreme Value Theory

A. Siffer, P-A Fouque, A. Termier and C. Largouet

Providing better thresholds

Finding anomalies in streams

Application to intrusion detection

A more general framework

⊸ Massive usage of the Internet

⊸ Massive usage of the Internet

⊸ Massive usage of the Internet

⊸ Massive usage of the Internet

⊸ Awareness of the sensitive data and infrastructures

⊸ Massive usage of the Internet

⊸ Awareness of the sensitive data and infrastructures

⊸ IDS (Intrusion Detection System)

⊸ IDS (Intrusion Detection System)

⊸ IDS (Intrusion Detection System)

data Algorithm alerts

data Algorithm alerts

⊸ Many ”standard” algorithms have been tested

data Algorithm alerts

⊸ Many ”standard” algorithms have been tested

⊸ Algorithms are not magic

⊸ Algorithms are not magic

⊸ Algorithms are not magic

⊸ The thresholds are often hard-set

⊸ Algorithms are not magic

⊸ The thresholds are often hard-set

⊸ Our idea: provide dynamic threshold with a probabilistic

⊸ How to set zq such that P(X€ > zq ) < q ?

⊸ Drawbacks: stuck in the interval, poor resolution

⊸ Drawbacks: manual step, distribution assumption

⊸ Different clients and/or temporal drift

Properties Empirical quantile Standard model

⊸ Main result (Fisher-Tippett-Gnedenko, 1928)

⊸ Main result (Fisher-Tippett-Gnedenko, 1928)

0.6 heavy tail

⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with

⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with

⊸ Central Limit Theorem

⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with

⊸ Central Limit Theorem

⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)

⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)

⊸ What does it imply ?

⊸ Get some data X1 , X2 . . . Xn

⊸ Get some data X1 , X2 . . . Xn

⊸ Get some data X1 , X2 . . . Xn

⊸ Get some data X1 , X2 . . . Xn

trigger alarm update model

trigger alarm update model

trigger alarm update model

trigger alarm update model

trigger alarm update model

trigger alarm update model

trigger alarm update model

trigger alarm update model

trigger alarm update model

trigger alarm update model

trigger alarm update model

⊸ An example with ground truth : a Gaussian White Noise

⊸ An example with ground truth : a Gaussian White Noise

0 50000 100000 150000 200000

⊸ Lack of relevant public datasets to test the algorithms ...