Professional Documents
Culture Documents
Machine Learning Strategies For Time Series Prediction
Machine Learning Strategies For Time Series Prediction
Series Prediction
Machine Learning Summer School
(Hammamet, 2013)
Gianluca Bontempi
According to MathSciNet
2. Adaptive real-time machine learning for credit card fraud detection (2012-2013).
st = g(t) + ϕt t = 1, . . . , T
360
observed
340
320
360
trend
340
1 2 3 320
seasonal
−1
−3
0.5
random
0.0
−0.5
p(ϕ1 , ϕ2 )
p(ϕ2 |ϕ1 ) =
p(ϕ1 )
p(ϕ1 , . . . , ϕT )
γ̂(k)
ρ̂(k) =
γ̂(0)
• It follows that this process has constant mean and variance. Also
3
2
1
0
y
−1
−2
−3
1.0
0.8
0.6
ACF
0.4
0.2
0.0
0 5 10 15 20 25 30
Lag
ϕt = ϕt−1 + wt
∇ϕt = ϕt − ϕt−1
50
40
30
20
10
−10
−20
−30
−40
0 50 100 150 200 250 300 350 400 450 500
ϕt = α1 ϕt−1 + · · · + αn ϕt−n + wt
• This means that the next value is a linear weighted sum of the past n
values plus a random shock.
• Finite memory filter.
• If w is a normal variable, ϕt will be normal too.
• Note that this is like a linear regression model where ϕ is regressed not
on independent variables but on its past values (hence the prefix “auto”).
• The properties of stationarity depends on the values αi , i = 1, . . . , n.
ϕt = αϕt−1 + wt
Then
2
E[ϕt ] = 0 Var [ϕt ] = σw (1 + α2 + α4 + . . . )
Then if |α| < 1 the variance if finite and equals
2 2
Var [ϕt ] = σϕ = σw /(1 − α2 )
ρ(k) = αk k = 0, . . . , 1, 2
Machine Learning Strategies for Prediction – p. 31/128
General order AR(n) process
• It has been shown that condition necessary and sufficient for the
stationarity is that the complex roots of the equation
φ(z) = 1 − α1 z − · · · − αn z n = 0
15
10
y
5
0
1.0
0.5
ACF
0.0
−0.5
0 5 10 15 20 25 30
Lag
0.5
Partial ACF
0.0
−0.5
0 5 10 15 20 25 30
Lag
ϕt = α1 ϕt−1 + · · · + αp ϕt−n + wt
T
2
X
α̂ = arg min [ϕt − α1 ϕt−1 − · · · − αn ϕt−n ]
α
t=n+1
N
X
T 2 T
α̂ = arg min (yi − xi a) = arg min (Y − Xa) (Y − Xa)
a a
i=1
• linear methods interpret all the structure in a time series through linear
correlation
• deterministic linear dynamics can only lead to simple exponential or
periodically oscillating behavior, so all irregular behavior is attributed to
external noise while deterministic nonlinear equations could produce
very irregular data,
• in real problems it is extremely unlikely that the variables are linked by a
linear relation.
In practice, the form of the relation is often unknown and only a limited
amount of samples is available.
TRAINING
DATASET
MODEL
PREDICTION
15
10
Y
5
0
−5
−2 −1 0 1 2
x
NOTA BENE: this is NOT a time series ! y = ϕt , x = ϕt−1 . The horizontal axis
does not represent time but the past value of the series.
Machine Learning Strategies for Prediction – p. 46/128
Model degree 1
Training error= 2 degree= 1
15
10
Y
5
0
−5
−2 −1 0 1 2
x
fˆ(x) = α0 + α1 x
15
10
Y
5
0
−5
−2 −1 0 1 2
x
fˆ(x) = α0 + α1 x + · · · + α3 x3
15
10
Y
5
0
−5
−2 −1 0 1 2
x
fˆ(x) = α0 + α1 x + · · · + α18 x1 8
where the intrinsic noise term reflects the target alone, the bias reflects
the target’s relation with the learning algorithm and the variance term
reflects the learning algorithm alone.
• This result is purely theoretical since these quantities cannot be
measured on the basis of a finite amount of data.
• However, this result provides insight about what makes accurate a
learning process.
Bias
Variance
complexity
2. within the family fˆ(x, α), to estimate on the basis of the training set DN
the parameter αN which best approximates f (parametric identification).
In order to accomplish that, a learning procedure is made of two nested
loops:
1. an external structural identification loop which goes through different
model structures
2. an inner parametric identification loop which searches for the best
parameter vector within the family structure.
\ emp (α)
αN = α(DN ) = arg min MISE
α∈Λ
PN 2
i=1 yi − fˆ(xi , α)
\ emp (α) =
MISE
N
with p = n + 1,
• the Generalized Cross-Validation (GCV)
\ emp (αN ) 1
GCV = MISE
(1 − Np )2
p 1
AIC = − L(αN )
N N
\ emp (αN )
MISE
Cp = 2
+ 2p − N
σ̂w
2
where σ̂w is an estimate of the variance of noise,
• the Predicted Squared Error (PSE)
N N 2
\ 1 X −k(i) 2 1 X
MISE CV = (yi − ŷi ) = yi − fˆ(xi , α−k(i) )
N i=1 N i=1
−k(i)
where ŷi denotes the fitted value for the ith observation returned by the
model estimated with the k(i)th part of the data removed.
10%
90%
N N
\ 1 X −i 2 1 X
ˆ −i 2
MISE LOO = (yi − ŷi ) = (yi − f xi , α )
N i=1 N i=1
PARAMETRIC IDENTIFICATION
REALIZATION
TRAINING STOCHASTIC
αN αN , GN αN , GN
1 1 2 2 s S
, GN SET PROCESS
VALIDATION
STRUCTURAL
αN αN , GSN
1 s
αN , GN
1 2 2
, GN
IDENTIFICATION
?
MODEL SELECTION
αN
LEARNED MODEL
A model with complexity s̃ is trained on the whole dataset DN and used for
future predictions.
(b) ej = yj − fˆ(xj , αN
s
−1 )
\
• MISE (s) = 1
PN 2
LOO N j=1 ej
\
2. Model selection: s̃ = arg mins=1,...,S MISE LOO (s)
3. Final parametric identification:
αN = arg minα∈Λs̃ i=1 (yi − fˆ(xi , α))2
s̃
PN
0.9
0.8
0.7
0.6
fraction of local
n=1
points
0.5
n=2
0.4
n=3
0.3
0.2
0.1 n=100
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r
The size of the neighborhood on which we can estimate local features of the
output (e.g. E[y|x]) increases with dimension n, making the estimation
coarser and coarser.
Underfitting Overfitting
Bias
Variance
1/Bandwith
PARAMETRIC IDENTIFICATION
ON N SAMPLES PUT THE j-th SAMPLE ASIDE
N TIMES PARAMETRIC IDENTIFICATION ON N-1 SAMPLES
TEST ON THE j-th SAMPLE
PRESS STATISTIC
LEAVE-ONE-OUT
The leave-one-out error can be computed in two equivalent ways: the slowest
way (on the right) which repeats N times the training and the test procedure;
the fastest way (on the left) which performs only once the parametric
identification and the computation of the PRESS statistic.
H = X(X T X)−1 X T
ej
eloo
j =
1 − Hjj
Note that PRESS is not an approximation of the loo error but simply a faster
way of computing it.
Pb
i=1 ζi ŷq (ki )
ŷq = Pb ,
i=1 ζi
where the weights are the inverse of the mean square errors:
\
ζi = 1/MISE (ki ).
LOO
z -1
ϕt
ϕt-3 f
z -1
ϕt-2
z -1
ϕt-1
The approximator fˆ returns the prediction of the value of the time series at
time t + 1 as a function of the n previous values (the rectangular box
containing z −1 represents a unit delay operator, i.e., ϕt−1 = z −1 ϕt ).
−t−16 −t−11 t
t−6 t−1
We want to predict at time t̄ − 1 the next value of the series y of order n = 6.
The pattern yt̄−16 , yt̄−15 , . . . , yt̄−11 is the most similar to the pattern
{yt̄−6 , yt̄−5 , . . . , yt̂−1 }. Then, the prediction ŷt̄ = yt̄−10 is returned.
z -1
ϕt
ϕt-3 f
z -1
ϕt-2
z -1
ϕt-1
z -1
The approximator fˆ returns the prediction of the value of the time series at
time t + 1 by iterating the predictions obtained in the previous steps (the
rectangular box containing z −1 represents a unit delay operator, i.e.,
ϕ̂t−1 = z −1 ϕ̂t ).
1 1
2
3 2 3
4 4
5 5
a) b)
250
200
150
y
100
50
0
The A chaotic time series has a training set of 1000 values: the task is to
predict the continuation for 100 steps, starting from different points.
Machine Learning Strategies for Prediction – p. 98/128
One-step assessment criterion
300
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
ϕt−1 ϕt
ϕt+1 ϕt+2
ϕt+3
ϕ t−1 ϕt
ϕ t+1 ϕ t+2
ϕ t+3
ϕt−1 ϕt
ϕt+1 ϕt+2
ϕt+3
This quantity is smaller than one if the predictor performs better than the
naivest predictor, i.e. the average µ̂.
Other measures rely on relative or percentage error
ϕt+h − ϕ̂t+h
pet+h = 100
ϕt+h
like
PH
h=1 |pet+h |
MAPE =
H
Machine Learning Strategies for Prediction – p. 111/128
Applications in my lab
• Side-channel attack
45
40
35
Temperature (°C)
30
25
20
! !! !
!! !! ! ! !! ! ! ! ! ! ! ! ! !
0 5 10 15 20
Time (Hour)
Communication costs
Metric
Model error
Model complexity
!
p
AR(p) : ŝi [t] = θj si [t − j]
j=1
N 2
1 X
fˆQj T(t−1) , T(t−2) , ..., T(t−n−1) − T(t)
D (Qj , T ) =
N − n + 1 t=n
Highly recommended !
128-1
ing. International Journal of Forecasting, 27(3):689–699,
2011.
[9] M. Guo, Z. Bai, and H.Z. An. Multi-step prediction for non-
linear autoregressive models based on empirical distribu-
tions. Statistica Sinica, pages 559–570, 1999.
128-2
[11] Yann-Aël Le Borgne, Silvia Santini, and Gianluca Bon-
tempi. Adaptive model selection for time series pre-
diction in wireless sensor networks. Signal Processing,
87(12):3010–3020, 2007.
128-3
[18] A. Sorjamaa and A. Lendasse. Time series prediction us-
ing dirrec strategy. In M. Verleysen, editor, ESANN06, Eu-
ropean Symposium on Artificial Neural Networks, pages
143–148, Bruges, Belgium, April 26-28 2006. European
Symposium on Artificial Neural Networks.
[21] Van Tung Tran, Bo-Suk Yang, and Andy Chit Chiow Tan.
Multi-step ahead direct prediction for the machine con-
dition prognosis using regression trees and neuro-fuzzy
systems. Expert Syst. Appl., 36(5):9378–9387, 2009.
128-4