Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 91

Time series

1
Taxonomy of Time Series Forecasting
Problems
▪ Endogenous: Input variables that are influenced by other variables in the system and on which the output variable
depends. Exogenous: Input variables that are not influenced by other variables in the system and on which the output
variable depends.
▪ Unstructured: No obvious systematic time-dependent pattern in a time series variable. Structured: Systematic time-
dependent patterns in a time series variable (e.g. trend and/or seasonality).
▪ Univariate: One variable measured over time. Multivariate: Multiple variables measured over time.
▪ One-step: Forecast the next time step. Multi-step: Forecast more than one future time steps.
▪ Static. A forecast model is fit once and used to make predictions. Dynamic. A forecast model is fit on newly available data
prior to each prediction.
▪ Contiguous. Observations are made uniform over time. Discontiguous. Observations are not uniform over time.

2
Problem definition
▪ Inputs vs. Outputs: What are the inputs and outputs for a forecast?
▪ Endogenous vs. Exogenous: What are the endogenous and exogenous variables?
▪ Unstructured vs. Structured: Are the time series variables unstructured or structured?
▪ Regression vs. Classification: Are you working on a regression or classification predictive
modeling problem? What are some alternate ways to frame your time series forecasting
problem?
▪ Univariate vs. Multivariate: Are you working on a univariate or multivariate time series
problem?
▪ Single-step vs. Multi-step: Do you require a single-step or a multi-step forecast?
▪ Static vs. Dynamic: Do you require a static or a dynamically updated model?
▪ Contiguous vs. Discontiguous: Are your observations contiguous or discontiguous?

3
Time series analysis: most important
approach
• Before machine learning and deep learning era, people were creating mathematical models and approaches for time
series and signals analysis. Here is a summary of the most important of them:

• Time domain analysis: this is all about “looking” how time series evolves over time. It can include analysis of width,
heights of the time steps, statistical features and other “visual” characteristics.
• Frequency domain analysis: a lot of signals are better represented not by how the change over time, but what
amplitudes they have in it and how they change. Fourier analysis and wavelets are what you go with.
• Nearest neighbors analysis: sometimes we just need to compare two signals or measure a distance between them
and we can’t do this with regular metrics like Euclidean, because signals can be of different length and the notion of
similarity is a bit different as well. A great example of metrics for time series with dynamic time warping.
• (S)AR(I)MA(X) models: the very popular family of mathematical models based on linear self-dependence inside of
time series (autocorrelation) that is able to explain future fluctuations.
• Decomposition: another important approach for prediction is decomposing time series into logical parts that can be
summed or multiplied to obtain the initial time series: trend part, seasonal part, and residuals.
• Nonlinear dynamics: we always forget about differential equations (ordinary, partial, stochastic and others) as a tool
for modeling dynamical systems that are in fact signals or time series. It’s rather unconventional today, but features
from DEs can be very useful for…
• Machine learning: all things from above can get features for any machine learning model we have. But in 2018 we
don’t want to rely on human-biased mathematical models and feature. We want it to be done for us with AI, which
today is deep learning.

https://alexrachnog.medium.com/deep-learning-the-final-frontier-for-signal-processing-and-time-series-analysis-734307167ad6   4
 Autoregressive neural nets: wave net
• What if we really want to avoid unnecessary difficulties related to recurrent
neural networks? Is there a way to “emulate” somehow dependence from
last N time steps and have this N quite large? Here is where WaveNet and
similar architectures are coming into the game. More generally we can call
them autoregressive feedforward models that model last N steps using
dilated convolutions. 
• The trend of switching from recurrent neural networks to
(autoregressive) feedforward models touched not only speech recognition
or time series analysis, but NLP as well

https://alexrachnog.medium.com/deep-learning-the-final-frontier-for-signal-processing-and-time-series-analysis-734307167ad6   5
Random walk: #1
▪ What question are trying to answer? How do you know your time series problem is
predictable?
▪ Common mistake: random walk is a list of random numbers, and this is not the case at all.
▪ Purely random: a random walk is different from a list of random numbers because the next
value in the sequence is a modification of the previous value in the sequence. Given the
way that the random walk is constructed, we would expect a strong autocorrelation with
the previous observation.
▪ Stationary: Therefore we can expect a random walk to be non-stationary. In fact, all
random walk processes are non-stationary. Note that not all non-stationary time series are
random walks.

6
Random walk: #2
▪ A random walk is unpredictable; it cannot reasonably be predicted.

▪ Given the way that the random walk is constructed, we can expect that the best prediction we
could make would be to use the observation at the previous time step as what will happen in the
next time step.

▪ Simply because we know that the next time step will be a function of the prior time step.

▪ This is often called the NAIVE forecast, or a PERSISTENCE mode.

▪ The random walk hypothesis is a theory that stock market prices are a random walk and cannot
be predicted.

7
Backtesting (aka hindcasting): #1
▪ Goal? Make accurate predictions about the future.
▪ What is backtesting or hindcasting? In the field of time series forecasting it is the evaluation of machine
learning models on time series data.
▪ What is an important difference? Train-test splits and k-fold cross-validation do not work in the case of
time series data. In other words: This is because they assume that there is no relationship between the
observations, that each observation is independent. This is not true of time series data, where the time
dimension of observations means that we cannot randomly split them into groups. Instead, we must split
data up and respect the temporal order in which values were observed.
▪ Why? This is because they ignore the temporal components inherent in the problem.
▪ What is the alternative? There are three methods (see next slide)

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/ 8
Backtesting (aka hindcasting): #2
▪ Train-test split: It is useful when you have a large amount of data so that both
training and tests sets are representative of the original problem.
▪ Multiple Train-Test Splits: we can repeat the process of splitting the time series
into train and test sets multiple times. Using multiple train-test splits will result in
more models being trained, and in turn, a more accurate estimate of the
performance of the models on unseen data. A limitation of the train-test split
approach is that the trained models remain fixed as they are evaluated on each
evaluation in the test set. This may not be realistic as models can be retrained as
new daily or monthly observations are made available.
▪ Walk Forward Validation: it is called like this because this methodology involves
moving along the time series one-time step at a time. Additionally, because a
sliding or expanding window is used to train a model, this method is also referred
to as Rolling Window Analysis or a Rolling Forecast. You can see that many more
models are created. This has the benefit again of providing a much more robust
estimation of how the chosen modelling method and parameters will perform in
practice. This improved estimate comes at the computational cost of creating so Multiple Train-Test Splits
many models. This is not expensive if the modelling method is simple or dataset is
small, but could be an issue at scale. Walk-forward validation is the gold standard
of model evaluation. It is the k-fold cross-validation for time series. Essentially a
model may be updated each time step new data is received.

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/ 9
Time series analysis vs forecasting

▪ Time series analysis: This field of study seeks the why behind a time series dataset. This
often involves making assumptions about the form of the data and decomposing the time
series into constitution components. The quality of a descriptive model is determined by
how well it describes all available data and the interpretation it provides to better inform
the problem domain.

▪ Time series forecasting: The skill of a time series forecasting model is determined by its
performance at predicting the future. This is often at the expense of being able to explain
why a specific prediction was made, confidence intervals and even better understanding
the underlying causes behind the problem.

10
Time series decomposition

▪ The most useful of these is the decomposition of a time series into 4 constituent parts:

▪ Level. The baseline value for the series if it were a straight line.
Trend. The optional and often linear increasing or decreasing behaviour of the series
over time.
▪ Seasonality. The optional repeating patterns or cycles of behaviour over time.
▪ Noise. The optional variability in the observations that cannot be explained by the
model.

▪ All time series have a level, most have noise, and the trend and seasonality are optional.

11
Sliding window
Sliding window is the way to restructure a time series dataset as a supervised learning problem. A difficulty
with the sliding window approach is how large to make the window for your problem. Perhaps a good starting point
is to perform a sensitivity analysis and try a suite of different window widths to in turn create a suite of different
views of your dataset and see which results in better performing models. There will be a point of diminishing
returns.

We can restructure this time series


Imagine we have a time dataset as a supervised learning
series as follows problem by using the value at the
previous time step to predict the value at
the next time-step. 12
What are lags in a time series?

▪ Time series modelling assumes a relationship between an observation and the previous
observation.

▪ Previous observations in a time series are called lags, with the observation at the previous
time step called lag1, the observation at two time steps ago lag=2, and so on.

13
Autocorrelation: #1

▪ We can quantify the strength and type of relationship between observations and their lags.

▪ In statistics, this is called correlation, and when calculated against lag values in time series,
it is called autocorrelation (self-correlation).

▪ A correlation value calculated between two groups of numbers, such as observations and
their lag=1 values, results in a number between -1 and 1. The sign of this number indicates
a negative or positive correlation respectively.

https://www.displayr.com/autocorrelation/ 14
Autocorrelation: #2

▪ When you have a series of numbers, and there is a pattern such that values in the series
can be predicted based on preceding values in the series, the series of numbers is said to
exhibit autocorrelation.

▪ The autocorrelation function can be used to answer the following questions.


▪ Was this sample data set generated from a random process?
▪ Would a non-linear or time series model be a more appropriate model for these data
than a simple constant plus error model?

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35c.htm 15
Autocorrelations: #3 [correlogram, ACF plot]
We can see that for the Minimum Daily Dotted lines are provided that indicate any
Temperatures dataset we see cycles of correlation values above those lines are
strong negative and positive correlation. statistically significant (meaningful).
This captures the relationship of an
observation with past observations in the
same and opposite seasons or times of
year. Sine waves like those seen
in this example are a strong sign of
seasonality in the dataset.

The plot also includes solid and


dashed lines that indicate the 95%
and 99% confidence interval for the
correlation values. Correlation values
above these lines are more
significant than those below the line,
providing a threshold or cut off for 16
selecting more relevant lag values.
Autocorrelations: #4 [correlogram, ACF plot]
▪ We can see in this plot that at lag 0, the
correlation is 1, as the data is correlated
with itself.

▪ At a lag of 1, the correlation is shown as


being around 0.5 (this is different to the
correlation computed above, as the
correlogram uses a slightly different
formula). 

▪ We can also see that we have negative


correlations when the points are 3, 4,
and 5 apart.

https://www.displayr.com/autocorrelation/ 17
Autocorrelations: #5
First-order autocorrelation: where first order indicates that
observations that are one apart are correlated

and positive means that the correlation between the observations With negative first-order correlation, the points  form a zigzag
is positive. The plot appears in a smooth snake-like curve. pattern if connected, as shown on the right.

https://www.displayr.com/autocorrelation/ 18
Autocorrelations: #6 [Testing for
autocorrelation]
▪ Sampling error alone means that we will typically see some autocorrelation in any data set,
so a statistical test is required to rule out the possibility that sampling error is causing the
autocorrelation.

▪ The standard test for this is the Durbin-Watson test.

▪ This test only explicitly tests first order correlation, but in practice it tends to detect most
common forms of autocorrelation as most forms of autocorrelation exhibit some degree of
first order correlation.

https://www.displayr.com/autocorrelation/ 19
Autocorrelations: #7

▪ When autocorrelation is detected in the residuals from a model, it suggests that the model
is misspecified (i.e., in some sense wrong).
▪ A cause is that some key variable or variables are missing from the model. Where the data
has been collected across space or time, and the model does not explicitly account for this,
autocorrelation is likely.
▪ For example, if a weather model is wrong in one suburb, it will likely be wrong in the same
way in a neighboring suburb. The fix is to either include the missing variables, or explicitly
model the autocorrelation (e.g., using an ARIMA model).
▪ The existence of autocorrelation means that computed standard errors, and consequently
p-­values, are misleading
▪ But for time series we have no issue if the data are autocorrelated, right?

https://www.displayr.com/autocorrelation/ 20
PACF = partial autocorrelation function
▪ It can be thought as the correlation between two points that are separated by some
number of periods n, but with the effect of the intervening correlations removed.

▪ Example: If T1 is directly correlated with T2 and T2 is directly correlated with T3, it would
appear that T1 is correlated with T3.

▪ PACF will remove the intervening correlation with T2.

▪ This allows to look at the pure and direct (if present) correlation eliminating any
confounding effect.

https://towardsdatascience.com/120-data-scientist-interview-question
s-and-answers-you-should-know-in-2021-b2faf7de8f3e 21
How to read a PACF
Just like regular correlations
they range between -1 to +1. The first value is 1 because it
represents the correlation of x_t
with itself.
The third partial The second partial
autocorrelation is moderately correlation takes the value
negative, and the remaining of approximately 0.96, and
values are rather small. indicates that xt and xt−1 are
highly correlated. This might
be expected, as it is
reasonable to assume there
is some relationship
between today’s COE price
PACF for 40 lags alongside an and the price yesterday.
approximate 95% statistical
confidence interval (dotted gray
line). It appears observations past
one time lag have little
association with the current COE
price. Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016).
22
Methods to smooth the time series
▪ Moving average smoothing
▪ Exponential smoothing
▪ Double smoothing – Holt method
▪ Triple smoothing – Holt-Winters method

▪ How are they used These can be used to smooth the original time series for
spotting trend.

23
Moving average smoothing: #1
▪ Moving averages are a simple and common type of smoothing used in time series analysis and time
series forecasting.
▪ Smoothing is a technique applied to time series to remove the fine-grained variation between time steps.

▪ The hope of smoothing is to remove noise and better expose the signal of the underlying causal
processes.

▪ There are two main types of moving average that are used:
▪ Centred: This method requires knowledge of future values, and as such is used on time series
analysis to better understand the dataset. A centre moving average can be used as a general method
to remove trend and seasonal components from a time series, a method that we often cannot use
when forecasting.
▪ Trailing Moving Average: Trailing moving average only uses historical observations only

24
Moving average smoothing: #2

25
Exponential smoothing = Holt Method

▪ Weight all available observations while


exponentially decreasing the weights
as we move further back in time.

▪ They can only forecast the current


level.

26
Double smoothing – Holt method
▪ The idea is exponential smoothing applied to both level and trend. The basic idea
is saying if our time series has a trend, we can incorporate that information to do
better than just estimating the current level and using that to forecast the future
observations.
▪ Once the trend is estimated to be positive, all future predictions can only go up
from the last value in the time series. On the other hand, if the trend is estimated
to be negative, all future predictions can only go down.
▪ This property makes this method unsuitable for predicting very far out into the
future as well.

27
Triple Exponential Smoothing - Holt-Winters
Method
▪ The idea behind triple exponential smoothing (a.k.a Holt-Winters Method) is to apply
exponential smoothing to a third component - seasonality, 𝑆.
▪ This means we should not be using this method if our time series is not expected to have
seasonality.

28
Stationary time series
▪ A stationary time series is one whose properties do not depend on the time at which the
series is observed. In general, a stationary time series will have no predictable patterns in the
long-term.
▪ Time series with trends, or with seasonality, are not stationary — the trend and
seasonality will affect the value of the time series at different times.
▪ On the other hand, a white noise series is stationary — it does not matter when you
observe it, it should look much the same at any point in time.
▪ This means that your time series is stationary, or does not show obvious trends (long-term
increasing or decreasing movement) or seasonality (consistent periodic structure).

▪ Confusing cases — a time series with cyclic behaviour (but with no trend or seasonality) is
stationary. This is because the cycles are not of a fixed length, so before we observe the series
we cannot be sure where the peaks and troughs of the cycles will be.
https://otexts.com/fpp2/stationarity.html 29
Which time series is stationary? #1

https://otexts.com/fpp2/stationarity.html
Trend and changing levels
Trend and changing levels

Trend and changing levels


seasonal

Trend and changing levels

seasonal

Seasonal + trend and


changing levels + change in
variance
30
Which time series is stationary? #2

https://otexts.com/fpp2/stationarity.html
▪ At first glance, the strong cycles in series (g)
might appear to make it non-stationary.
▪ But these cycles are aperiodic — they are
caused when the lynx population becomes
too large for the available feed, so that they
stop breeding and the population falls to
low numbers, then the regeneration of their
food sources allows the population to grow
again, and so on.
▪ In the long-term, the timing of these cycles
is not predictable.
▪ Hence the series is stationary.

31
White noise: #1

▪ Main conclusion: If a time series is white noise, it is a sequence of random numbers and
cannot be predicted.

▪ Rigorous definition: A time series may be white noise. A time series is white noise if the
variables are independent and identically distributed with a mean of zero. This means that
all variables have the same variance (sigma^2) and each value has a zero correlation with all
other values in the series. If the variables in the series are drawn from a Gaussian
distribution, the series is called Gaussian white noise.

32
White noise: #2

▪ Your time series is not white noise if any of the following conditions are true:

▪ Does your series have a non-zero mean?


▪ Does the variance change over time?
▪ Do values correlate with lag values?

33
Systematic vs. non-systematic: #1
▪ Break a time series down into systematic and unsystematic components.

▪ Systematic: Components of the time series that have consistency or recurrence and
can be described and modelled. These are:
▪ Level: The average value in the series.
▪ Trend: The increasing or decreasing value in the series.
▪ Seasonality: The repeating short-term cycle in the series.

▪ Non-Systematic: Components of the time series that cannot be directly modelled. This
is: noise: The random variation in the series.

▪ All series have a level and noise. The trend and seasonality components are optional.

34
Systematic vs. non-systematic: #2

▪ It is helpful to think of the components as combining either additively or multiplicatively.

▪ An additive model y(t) = Level + Trend + Seasonality + Noise. An additive model is


linear where changes over time are consistently made by the same amount.

▪ A multiplicative model y(t) = Level × Trend × Seasonality × Noise. A multiplicative


model is nonlinear, such as quadratic or exponential

35
Trends in time series
▪ Identifying and understanding trend information can aid in improving model performance;
below are a few reasons:

▪ Faster Modelling: Perhaps the knowledge of a trend or lack of a trend can suggest
methods and make model selection and evaluation more efficient.
▪ Simpler Problem: Perhaps we can correct or remove the trend to simplify modelling
and improve model performance.
▪ More Data: Perhaps we can use trend information, directly or as a summary, to provide
additional information to the model and improve model performance.

36
Removing a Trend [detrending]

▪ A time series with a trend is called non-stationary. An identified trend can be modelled.

▪ Once modelled, it can be removed from the time series dataset.

▪ This is called detrending the time series.

▪ If a dataset does not have a trend or we successfully remove the trend, the dataset is said
to be trend stationary.

37
Stationary: #1

▪ The observations in a stationary time series are not dependent on time.

▪ Time series are stationary if they do not have trend or seasonal effects.

▪ Summary statistics calculated on the time series are consistent over time, like the mean or
the variance of the observations. When a time series is stationary, it can be easier to model.

▪ Statistical modelling methods assume or require the time series to be stationary to be


effective.

38
Stationary: #2
▪ Stationary Process: A process that generates a stationary series of observations.

▪ Stationary Model: A model that describes a stationary series of observations.

▪ Trend Stationary: A time series that does not exhibit a trend.

▪ Seasonal Stationary: A time series that does not exhibit seasonality.

▪ Strictly Stationary: A mathematical definition of a stationary process, specifically that the


joint distribution of observations is invariant to time shift.

39
Stationary: #3
▪ Statistical tests make strong assumptions about your data. They can only be used to inform
the degree to which a null hypothesis can be rejected (or fail to be rejected).

▪ The result must be interpreted for a given problem to be meaningful.

▪ Nevertheless, they can provide a quick check and confirmatory evidence that your time
series is stationary or non-stationary.

▪ The Augmented Dickey-Fuller test is a type of statistical test called a unit root test. The
intuition behind a unit root test is that it determines how strongly a time series is defined
by a trend.

40
Should you make your time series stationary?
▪ Stationary time series = mean and variance is constant over time

▪ Generally, yes. If you have clear trend and seasonality in your time series, then model
these components, remove them from observations, then train models on the residuals.

▪ Statistical time series methods and even modern machine learning methods will benefit
from the clearer signal in the data.

▪ But it is not always said this is the best available option

41
Why do we need to make the time series
stationary? #1
▪ Stationarity is important because, in its absence, a model describing the data will vary in
accuracy at different time points.
▪ As such, stationarity is required for sample statistics such as means, variances, and
correlations to accurately describe the data at all time points of interest.

https://stats.stackexchange.com/questions/19715/why-does-a-time-series-have-to-be-stationary 42
Why do we need to make the
time series stationary? #2
▪ The mean and variance of any given
segment of time would do a good job
representing the whole stationary time
series.

▪ The mean and variance of any given


segment of time would do a poor job
representing the whole non-stationary
time series.

https://stats.stackexchange.com/questions/19715/why-does-a-time-series-have-to-be-stationary 43
So what we do with this non-stationary time
series?
▪ Whether a time series is stationary or not doesn't affect the method to be used for
predicting certain properties.

▪ If the time series isn't stationary, it, simply, isn't possible to predict into future.

https://www.researchgate.net/post/Is-it-necessary-to-make-time-series-data-stationary-before-applying-tr
ee-based-ML-methods-ie-Random-Forest-or-Xgboost-etc 44
Estimating error in the time series forecast

Error measure Formula


Forecast Error (or Residual Forecast Error)

Mean Forecast Error (or Forecast Bias)

Mean Absolute* Error

Mean Squared Error

Root Mean Squared Error

* Using the absolute value here means positive and


negative errors are treated the same way 45
Forecast Performance Baseline
▪ A baseline in forecast performance provides a point of comparison. It is a point of
reference for all other modelling techniques on your problem. If a model achieves
performance at or below the baseline, the technique should be fixed or abandoned.

▪ The simplest forecast that we can make is to forecast that what happened in the previous
time step will be the same as what will happen in the next time step. This is called the naive
forecast or the persistence forecast model.

46
ARIMA: #1
▪ It is a forecasting algorithm based on the idea that the information in the past values of the time series
can alone be used to predict the future values. However, you’ll have to use your own judgment if that
makes sense at all!
▪ It is a generalization of the simpler AutoRegressive Moving Average:
▪ AR: Autoregression. A model that uses the dependent relationship between an observation and
some number of lagged observations. A pure Auto Regressive (AR only) model is one where a value
depends only on its own lags.
▪ I: Integrated. The use of differencing of raw observations (i.e. subtracting an observation from an
observation at the previous time step) in order to make the time series stationary.
▪ MA: Moving Average. A model that uses the dependency between an observation and residual
errors from a moving average model applied to lagged observations. A pure Moving Average (MA
only) model is one where Yt depends only on the lagged forecast errors.
▪ So what does ARIMA model do? An ARIMA model is one where the time series was differenced at least
once to make it stationary and you combine the AR and the MA terms.

47
ARIMA: #2

▪ The parameters of the ARIMA model are defined as follows:

▪ p: The number of lag observations included in the model, also called the lag order.
▪ d: The number of times that the raw observations are differenced, also called the
degree of differencing.
▪ q: The size of the moving average window, also called the order of moving average.

▪ An extension to ARIMA that supports the direct modelling of the seasonal component of
the series is called SARIMA.

48
ARIMA: #3
▪ The first step to build an ARIMA model is to make the time series stationary.
▪ Why? Because, term ‘Auto Regressive’ in ARIMA means it is a linear regression model
that uses its own lags as predictors. Linear regression models, work best when the
predictors are not correlated and are independent of each other.
▪ So how to make a series stationary? The most common approach is to difference it.
That is, subtract the previous value from the current value. Sometimes, depending on
the complexity of the series, more than one differencing may be needed. The value of d,
therefore, is the minimum number of differencing needed to make the series stationary.
▪ What if the time series is already stationary? Then d = 0.
▪ ‘p’ is the order of the ‘Auto Regressive’ (AR) term. It refers to the number of lags of Y to be
used as predictors.
▪ ‘q’ is the order of the ‘Moving Average’ (MA) term. It refers to the number of lagged forecast
errors that should go into the ARIMA Model.

https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/ 49
How to find the order of differencing (d) in
ARIMA model?
▪ You need differencing only if the series is non-stationary. So, first, I am going to check if the
series is stationary using the Augmented Dickey Fuller test
▪ The right order of differencing is the minimum differencing required to get a near-
stationary series which roams around a defined mean and the ACF plot reaches to zero
fairly quick.
▪ If the autocorrelations are positive for many number of lags (10 or more), then the
series needs further differencing.
▪ If the lag 1 autocorrelation itself is too negative, then the series is probably over-
differenced.
▪ If undecided, then go with the order that gives the least standard deviation in the
differenced series.

https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/ 50
How to handle if a time series is slightly
under or over differenced?
▪ It may so happen that your series is slightly under differenced, that differencing it one more
time makes it slightly over-differenced.

▪ How to handle this case?

▪ If the series is slightly under-differenced, add one or more additional AR terms


▪ If the series is slightly over-differenced, add an additional MA term.

https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/ 51
ARIMA & ARMA
▪ You don't require stationarity to run an ARIMA model, since if the I() order is >0, it's
explicitly nonstationary. ARIMA(AutoRegressive Integrated Moving Average) model is one
model for non-stationarity. It assumes that the data becomes stationary after differencing.
▪ Stationarity is an assumption of ARMA, however.
▪ Look at the link below to know more.

https://stats.stackexchange.com/questions/19715/why-does-a-time-series-have-to-be-stationary
52
ARIMA vs. XGBoost for time series

▪ ARIMA regressions are used in classical statistical approaches, when the goal is not just
prediction, but also understanding on how different explanatory variables relate with the
dependent variable and with each other. ARIMA are thought specifically for time series
data.

▪ XGBoost models are used in pure ML approaches, where we exclusively care about quality
of prediction. XGBoost regressors can be used for time series forecast, even though they
are not specifically meant for long term forecasts. But they can work.

https://datascience.stackexchange.com/questions/60678/xgboost-vs-arima-for-time-series-analysis
53
Box-Jenkins Method (ARMA & ARIMA)
▪ The approach starts with the assumption that the process that generated the time series can be
approximated using an ARMA model if it is stationary or an ARIMA model if it is non-stationary.

▪ the process as a stochastic model building and that it is an iterative approach that consists of the
following 3 steps:
▪ Identification. Use the data and all related information to help select a sub-class of model that may
best summarize the data.
▪ Estimation. Use the data to train the parameters of the model (i.e. the coefficients).
▪ Diagnostic Checking. Evaluate the fitted model in the context of the available data and check for
areas where the model may be improved.

▪ It is an iterative process, so that as new information is gained during diagnostics, you can circle back to
step 1 and incorporate that into new model classes.

54
Exponential smoothing
▪ Exponential smoothing is a time series forecasting method for univariate data.

▪ Time series methods like the Box-Jenkins ARIMA family of methods develop a model where
the prediction is a weighted linear sum of recent past observations or lags.

▪ Exponential smoothing forecasting methods are similar in that a prediction is a weighted


sum of past observations, but the model explicitly uses an exponentially decreasing weight
for past observations.

▪ Specifically, past observations are weighted with a geometrically decreasing ratio.

55
Double & triple Exponential Smoothing

▪ Double Exponential Smoothing is an extension to Exponential Smoothing that explicitly


adds support for trends in the univariate time series. In addition to the alpha parameter for
controlling smoothing factor for the level, an additional smoothing factor is added to
control the decay of the influence of the change in trend called beta (b or β).

▪ Triple Exponential Smoothing is an extension of Exponential Smoothing that explicitly adds


support for seasonality to the univariate time series. This method is sometimes called Holt-
Winters Exponential Smoothing, named for two contributors to the method

56
Correlation and autocorrelation
▪ Statistical correlation summarizes the strength of the relationship between two variables.
We can assume the distribution of each variable fits a Gaussian (bell curve) distribution. If
this is the case, we can use the Pearson’s correlation coefficient to summarize the
correlation between the variables. The Pearson’s correlation coefficient is a number
between -1 and 1 that describes a negative or positive correlation respectively. A value of
zero indicates no correlation.

▪ We can calculate the correlation for time series observations with observations with
previous time steps, called lags. Because the correlation of the time series observations is
calculated with values of the same series at previous times, this is called a serial
correlation, or an autocorrelation.

57
Partial autocorrelation
▪ A partial autocorrelation is a summary of the relationship between an observation in a
time series with observations at prior time steps with the relationships of intervening
observations removed.

▪ The partial autocorrelation at lag k is the correlation that results after removing the effect
of any correlations due to the terms at shorter lags.

▪ The autocorrelation for an observation and an observation at a prior time step is comprised
of both the direct correlation and indirect correlations. These indirect correlations are a
linear function of the correlation of the observation, with observations at intervening time
steps. It is these indirect correlations that the partial autocorrelation function seeks to
remove.

58
On the difference btw autocorrelation and
partial autocorrelation
▪ We know that the autocorrelation function describes the autocorrelation between an
observation and another observation at a prior time step that includes direct and indirect
dependence information. This means we would expect the ACF for the AR(k) time series to
be strong to a lag of k and the inertia of that relationship would carry on to subsequent lag
values, trailing off at some point as the effect was weakened.

▪ We know that the partial autocorrelation function only describes the direct relationship
between an observation and its lag. This would suggest that there would be no correlation
for lag values beyond k.

59
On the difference btw autocorrelation and
partial autocorrelation & moving average
▪ Consider a time series that was generated by a moving average (MA) process with a lag of
k. Remember that the moving average process is an autoregression model of the time
series of residual errors from prior predictions. Another way to think about the moving
average model is that it corrects future forecasts based on errors made on recent forecasts.
We would expect the ACF for the MA(k) process to show a strong correlation with recent
values up to the lag of k, then a sharp decline to low or no correlation. By definition, this is
how the process was generated.

▪ For the PACF, we would expect the plot to show a strong relationship to the lag and a
trailing off of correlation from the lag onwards. Again, this is exactly the expectation of the
ACF and PACF plots for an MA(k) process.

60
Used methods in time series analysis

▪ Convolutional and Long Short-Term Memory Neural Networks,

▪ Multilayer Perceptron models for univariate, multivariate and multi-step time series
forecasting problems

▪ Convolutional Neural Network models for univariate, multivariate and multi-step time
series forecasting problems

▪ Long Short-Term Memory Neural Network models for univariate, multi- variate and multi-
step time series forecasting problems

61
Time Series Forecasting Methods: classical vs. ML
▪ 8 classical methods :
▪ 10 classical methods :

▪ Naive 2, which is actually a random walk model


adjusted for season. ▪ Multilayer Perceptron (MLP).
▪ Simple Exponential Smoothing. ▪ Bayesian Neural Network (BNN).
▪ Holt. ▪ Radial Basis Functions (RBF).
▪ Damped exponential smoothing. ▪ Generalized Regression Neural Networks (GRNN),
▪ Average of SES, Holt, and Damped. also called kernel regression. 􏰀 k-Nearest
Neighbour regression (KNN).
▪ Theta method.
▪ CART regression trees (CART).
▪ ARIMA, automatic.
▪ Support Vector Regression (SVR).
▪ ETS, automatic.
▪ Gaussian Processes (GP).
▪ An additional two modern neural network
▪ Recurrent Neural Network (RNN).
▪ Long Short-Term Memory (LSTM).
Spyros Makridakis, et al , Statistical and Machine Learning forecasting
methods: Concerns and ways forward concludes that on certain
univariate test cases classical methods are better.
62
Time series and the danger of cross-correlation
▪ Identifying strong correlations or relationships between time series. If you have 1,000
metrics (time series), you can compute 499,500 = 1,000*999/2 (you have k metrics or
variables, the all cross-correlations, that is m = k*(k-1)/2) correlations.
▪ If you include cross-correlations with time lags then we are dealing with many, many
millions of correlations. Out of all these correlations, a few will be extremely high just by
chance.
▪ However, a spectral analysis of normalized time series (instead of correlation
analysis) provide a much more robust mechanism to identify true relationships.

▪ The article below is very very interesting, as it shows how to compute how likely you are
to get a correlation that is pure chance!

https://www.analyticbridge.datasciencecentral.com/profiles/blogs/the-curse-of-big-data 63
Time series, random
walk and R2 metric: #1
▪ The time series is the result of a random walk which is
purely stochastic and this, by definition, impossible to
learn anything. It is simply a random process.
▪ We used a LSTM model and you see the R2 score on
the right which is fairly good and it is giving us the
idea we have actually learn something.
▪ Time series data tend to be correlated in time, and
exhibit a significant autocorrelation. In this case, that
means that the index at time "t+1" is quite likely close
to the index at time "t". As illustrated in the above
figure to the right, what the model is actually doing is
that when predicting the value at time "t+1", it simply
uses the value at time "t" as its prediction (often
referred to as the persistence model).

https://www.linkedin.com/pulse/how-use-machine-learning-time-series-forecasting-vegard-flovik-phd/ 64
Time series, random walk and R2 metric: #2
▪ Plotting the cross-correlation between the
predicted and real value (below figure), we see a
clear peak at a time lag of 1 day, indicating that the
model simply uses the previous value as the
prediction for the future.
▪ Measure such as (and others) R2 can be very
misleading, and one can easily be fooled into
being overly confident in the model accuracy.
▪ What can we do? We can render the TS
approximately stationary (i.e., "stationarised").
One such basic transformation, is to time-
difference the data, as illustrated in the below
figure.

65
https://www.linkedin.com/pulse/how-use-machine-learning-time-series-forecasting-vegard-flovik-phd/
Time series, random walk and R2 metric: #3
▪ Rather than considering the index directly, we are
calculating the difference between consecutive
time steps.
▪ Defining the model to predict the difference is a
much stronger test of the models predictive
powers. In that case, it cannot simply use that the
data has a strong autocorrelation, and use the
value at time "t" as the prediction for "t+1".
▪ This figure indicates that the model is not able to
predict future changes based on historical events,
which is the expected result in this case, since the
data is generated using a completely stochastic
random walk process.

66
https://www.linkedin.com/pulse/how-use-machine-learning-time-series-forecasting-vegard-flovik-phd/
How to check if your time series is a random walk?
▪ Ask this 3 questions:
▪ The time series shows a strong temporal dependence (autocorrelation) that decays
linearly or in a similar pattern.
▪ The time series is non-stationary and making it stationary shows no obviously learnable
structure in the data.
▪ The persistence model (using the observation at the previous time step as what will
happen in the next time step) provides the best source of reliable predictions.

▪ This last point is key for time series forecasting. Baseline forecasts with the persistence
model quickly indicate whether you can do significantly better. If you can’t, you’re probably
dealing with a random walk (or close to it). The human mind is hardwired to look for
patterns everywhere and we must be vigilant that we are not fooling ourselves and wasting
time by developing elaborate models for random walk processes.

67
https://www.linkedin.com/pulse/how-use-machine-learning-time-series-forecasting-vegard-flovik-phd/
The Granger causality test: #1

▪ What is it used for? The Granger causality test is a statistical hypothesis test for


determining whether one time series is useful in forecasting another.
▪ Definition? A time series X is said to Granger-cause Y if it can be shown, usually through a
series of t-tests and F-tests on lagged values of X (and with lagged values of Y also
included), that those X values provide statistically significant information about future
values of Y.
▪ Granger defined the causality relationship based on two principles:
▪ The cause happens prior to its effect.
▪ The cause has unique information about the future values of its effect.

https://www.linkedin.com/pulse/how-use-machine-learning-time-series-forecasting-vegard-flovik-phd-1f/ 68
The Granger causality test: #2
▪ When time series X Granger-causes time series Y (as
illustrated below), the patterns in X are approximately
repeated in Y after some time lag (two examples are
indicated with arrows). Thus, past values of X can be
used for the prediction of future values of Y.

▪ The original definition of Granger causality does not


account for latent confounding effects and does not
capture instantaneous and non-linear causal
relationships.
▪ As such, performing a Granger causality test cannot
give you a definitive answer whether there exists a
causal relationship between your input variables and
the target you are trying to predict.
▪ Still, it can definitely be worth looking into, and
provides additional information compared to relying
purely on the (possible spurious) correlation between
them.

https://www.linkedin.com/pulse/how-use-machine-learning-time-series-forecasting-vegard-flovik-phd-1f/ 69
Autoregression for times series

▪ Autoregression is a time series model that uses observations from previous time steps as
input to a regression equation to predict the value at the next time step.

▪ It is a very simple idea that can result in accurate forecasts on a range of time series
problems.

70
Persistent model

▪ When predicting the value at time "t+1", it simply uses the value at time "t" as its
prediction (often referred to as the persistence model).

▪ It can be used to provide a BASELINE of performance for the problem that we can use for
COMPARISON with an autoregression model.

71
Can you use t-statistics in time series?
▪ No, you can’t.

▪ In time series, we cannot assume that the observations are independent!

▪ This will often affect the distribution of the


t -statistic, and invalidate the usual inferences

http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf 72
Random walk
▪ Stock prices follow a random walk, as long as markets are efficient. If the price change were
predictable, investors would quickly figure this out, thereby removing the predictability.
▪ In an efficient market, the best forecast of the future price is the current price, and the best
forecast of the future return is zero.
▪ Since the variance of a random walk is infinite, it makes no sense to talk about the
correlation between stock prices (assuming that the prices follow a random walk, or simply
assuming that prices have an infinite variance).
▪ It can be shown that if we take two random walks that are completely independent of each
other, there is a very high probability of finding a (spuriously) high correlation coefficient
between them. (This may explain the bond yield example). This underscores the futility of
looking at correlations between two price series.

http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf 73
Unit root tests
▪ To try to determine whether our price data came from a random walk, we can test whether
the true slope is 1.
▪ Issue: the t-statistic for this hypothesis does not have an approximately standard normal
distribution, even if we really have a random walk.
▪ Fortunately, the distribution of this t -statistic has been determined (Dickey and Fuller), and
tables are available. The result is a unit root test. In the unit root test, we test the null
hypothesis that the series is a random walk against the alternative hypothesis that it is an
AR (1) with ρ<1. Note that under the alternative hypothesis, the series is stationary, and
therefore mean reverting, while under the null hypothesis is it nonstationary.

http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf 74
Cointegration – nonstationary series #1
▪ Cointegration Suppose we have two nonstationary series {xt } and {yt }, both
(approximately) random walks. How do we measure their tendency to move together?
Correlation is meaningless here.

▪ Both series wander all over the place, since they are nonstationary.

▪ Instead of looking at how they wander from a particular point (such as zero), let’s look at
how they wander from each other. Maybe the "spread” {yt −xt } is stationary.

▪ Then even though both series wander all over the place separately, they are tied to each
other in that the spread between them is mean reverting. So we can make bets on the
reversion of this spread.

http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf 75
Cointegration – nonstationary series #2
▪ More generally, maybe there is a β such that the linear combination {yt − βxt } is stationary.
If so, then we say that {xt } and {yt } are cointegrated

▪ A simple approach to cointegration is first to do unit root tests on {xt } and {yt } separately.
Next, estimate β by an (ordinary) regression of {yt } on {xt }, and finally do a unit root test
on the residuals {yt −βˆ xt }.

▪ If the tests indicate that {xt } and {yt } are nonstationary, but {yt −βˆ xt } is stationary, then
we declare that {xt } and {yt } are cointegrated, with cointegrat-
ing parameter β

http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf 76
Time series cross validation
▪ The only minor difference, compared with
standard supervised learning methods is
the way to perform cross validation.
▪ Because time series data have this
temporal structure, one cannot randomly
mix values in a fold while preserving this
structure.
▪ With randomization, all the time
dependencies between observations will
be lost, hence the cross validation method
that we'll be using will be based on a
rolling window approach.

77
Time series: stationary and ergodicity
▪ In a time-series, we observe a single run of a stochastic process rather than repeated runs
of the stochastic process. We observe 1 long experiment rather than multiple, independent
experiments.
▪ We need stationarity and ergodicity so that observing a long run of a stochastic process is
similar to observing many independent runs of a stochastic process. That is to say we
observe multiple observation over time rather multiple draws! This means that the law of
large number may not converge to anything at all!
▪ For multiple observations over time to accomplish a similar task as multiple draws from the
sample space, we need stationarity and ergodicity.

https://stats.stackexchange.com/questions/19715/why-does-a-time-series-have-to-be-stationary 78
3 ways to rephrase a time series
▪ Given a time series describing the temperature:
▪ Regression framing – predict the tomorrow temperature given the day
temperature the day before
▪ Classification framings - Given the minimum temperature the day before,
predict the temperature as either cold, moderate, or hot
▪ Time horizon framings – predict the minimum temperature for the next 7
days

79
How to make a time series stationary?
▪ Differencing can help stabilise the mean of a time series by removing
changes in the level of a time series, and therefore eliminating (or
reducing) trend and seasonality.
▪ Logarithms can help to stabilise the variance of a time series.

https://otexts.com/fpp2/stationarity.html 80
Second-order differencing
▪ Occasionally the differenced data will not appear to be stationary and it may be necessary
to difference the data a second time to obtain a stationary series.
▪ Then, we would model the “change in the changes” of the original data.
▪ In practice, it is almost never necessary to go beyond second-order differences.

https://otexts.com/fpp2/stationarity.html 81
Seasonal differences
▪ A seasonal difference is the difference between an observation and the previous
observation from the same season.
▪ These are also called “lag-m differences,” as we subtract the observation after a lag of m
periods.
▪ To distinguish seasonal differences from ordinary differences, we sometimes refer to
ordinary differences as “first differences,” meaning differences at lag 1.

https://otexts.com/fpp2/stationarity.html 82
https://otexts.com/fpp2/stationarity.html

Example of making time series stationary

Logs and seasonal differences of the A10 (antidiabetic) sales The data are first transformed using logarithms (second panel), then
data. The logarithms stabilise the variance, while the seasonal differences are calculated (third panel). The data still seem
seasonal differences remove the seasonality and trend. somewhat non-stationary, and so a further lot of first differences are
computed (bottom panel). 83
Does the order of differencing matter?

▪ When both seasonal and first differences are applied, it makes no difference which is done
first—the result will be the same. However, if the data have a strong seasonal pattern, we
recommend that seasonal differencing be done first, because the resulting series will
sometimes be stationary and there will be no need for a further first difference. If first
differencing is done first, there will still be seasonality present.

▪ It is important that if differencing is used, the differences are interpretable. First differences
are the change between one observation and the next. Seasonal differences are the change
between one year to the next. Other lags are unlikely to make much interpretable sense
and should be avoided.

https://otexts.com/fpp2/stationarity.html 84
Degree of subjectivity in selecting which
differences to apply
▪ There is a degree of subjectivity in selecting which differences to apply.
▪ At the end of the day someone will have the question: are the data now sufficiently
stationary? If not an extra round of differencing can be enforced.
▪ There are always some choices to be made in the modelling process, and different analysts
may make different choices.

https://otexts.com/fpp2/stationarity.html 85
Can we determined when differencing is
needed?
▪ One way to determine more objectively whether differencing is required is to use a unit
root test.
▪ These are statistical hypothesis tests of stationarity that are designed for determining
whether differencing is required.
▪ A number of unit root tests are available. On of them is the Kwiatkowski-Phillips-Schmidt-
Shin (KPSS) test. In this test, the null hypothesis is that the data are stationary, and we look
for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than
0.05) suggest that differencing is required.

https://otexts.com/fpp2/stationarity.html 86
Why considering RNNs for time series?

▪ Unlike the deep neural network, RNNs contain hidden states which are distributed across
time.

▪ This allows them to efficiently store a lot of information about the past.

▪ As with a regular deep neural network, the non- linear dynamics allows them to update
their hidden state in complicated ways.

Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 87
Argument for using ANNs in time series analysis
▪ What is the issue? The presence of trend and seasonal variation can be hard to estimate
and/or remove. The chief difficultly is that the underlying dynamics generating the data are
unknown.
▪ Traditional statistics approach: require the specification of an assumed time-series model,
such as auto-regressive models, Linear Dynamical Systems, or Hidden Markov Model which
require skills.
▪ ANNs approach: The great thing about neural networks is that you do not need to specify
the exact nature of the relationship (linear, non-linear, seasonality, trend) that exists
between the input and output. The hidden layers of a deep neural network (DNN) remove
the need to prespecify the nature of the data generating mechanism. This is because they
can approximate extremely complex decision functions.

Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 88
Clustering for time series
• Apart from classification and regression using RNNs / CNNs /
Autoregressive models we’re also interested in clustering time series into
meaningful groups. We can do this using combination of specific for time
series distances (like above mentioned DTW) and metric-based clustering
algorithms like K-Means, but it’s rather slow and not optimal approach. We
would like to have something that can work with signals of different
lengths, but much more efficient.
• Of course, we can ask neural networks to provide us with an embedding
space where we will perform clustering, for instance, with autoencoders. 

https://alexrachnog.medium.com/deep-learning-the-final-frontier-for-signal-processing-and-time-se
ries-analysis-734307167ad6
 
89
Nyquist-Shannon sampling theorem

90
Prophet (FaceBook)
● Open-source for univariate (one variable) time series forecasting
● Prophet implements what they refer to as an additive time series forecasting model, and the implementation supports trends, seasonality, and
holidays.
● It is designed to be easy and completely automatic, e.g. point it at a time series and get a forecast.
● Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily
seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is
robust to missing data and shifts in the trend, and typically handles outliers well.*
● Prophet decomposes time series data into trend, seasonality and holiday effect. **Trend** models non periodic changes in the time series data.
**Seasonality** is caused due to the periodic changes like daily, weekly, or yearly seasonality. **Holiday effect** which occur on irregular
schedules over a day or a period of days. **Error terms** is what is not explained by the model.
● Accurate and fast - Prophet is accurate and fast. It is used in many applications across Facebook for producing reliable forecasts for planning and
goal setting.
● Fully automatic - Prophet is fully automatic. We will get a reasonable forecast on messy data with no manual effort.
● Tunable forecasts - Prophet produces adjustable forecasts. It includes many possibilities for users to tweak and adjust forecasts. We can use
human-interpretable parameters to improve the forecast by adding our domain knowledge.
● Available in R or Python - We can implement the Prophet procedure in R or Python.
● Handles seasonal variations well - Prophet accommodates seasonality with multiple periods.
● Robust to outliers - It is robust to outliers. It handles outliers by removing them.
● Robust to missing data - Prophet is resilient to missing data.
● Prophet follows the sklearn model API.

You might also like