Module -3 Time Series Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Module -3 Time Series Analysis

Time series analysis is used for non-stationary data—things that are constantly fluctuating over
time or are affected by time. Industries like finance, retail, and economics frequently use time
series analysis because currency and sales are always changing. Stock market analysis is an
excellent example of time series analysis in action, especially with automated trading algorithms.
Likewise, time series analysis is ideal for forecasting weather changes, helping meteorologists
predict everything from tomorrow’s weather report to future years of climate change. Examples
of time series analysis in action include:

● Weather data
● Rainfall measurements
● Temperature readings
● Heart rate monitoring (EKG)
● Brain monitoring (EEG)
● Quarterly sales
● Stock prices
● Automated stock trading
● Industry forecasts
● Interest rates
Time Series Analysis Types

Because time series analysis includes many categories or variations of data, analysts sometimes
must make complex models. However, analysts can’t account for all variances, and they can’t
generalize a specific model to every sample. Models that are too complex or that try to do too
many things can lead to a lack of fit. Lack of fit or overfitting models lead to those models not
distinguishing between random error and true relationships, leaving analysis skewed and
forecasts incorrect.

Models of time series analysis include:

● Classification: Identifies and assigns categories to the data.


● Curve fitting: Plots the data along a curve to study the relationships of variables within
the data.
● Descriptive analysis: Identifies patterns in time series data, like trends, cycles, or
seasonal variation.
● Explanative analysis: Attempts to understand the data and the relationships within it, as
well as cause and effect.
● Exploratory analysis: Highlights the main characteristics of the time series data, usually
in a visual format.
● Forecasting: Predicts future data. This type is based on historical trends. It uses the
historical data as a model for future data, predicting scenarios that could happen along
future plot points.
● Intervention analysis: Studies how an event can change the data.
● Segmentation: Splits the data into segments to show the underlying properties of the
source information.

Data classification

Further, time series data can be classified into two main categories:

● Stock time series data means measuring attributes at a certain point in time, like a static
snapshot of the information as it was.
● Flow time series data means measuring the activity of the attributes over a certain
period, which is generally part of the total whole and makes up a portion of the results.
Data variations

In time series data, variations can occur sporadically throughout the data:

● Functional analysis can pick out the patterns and relationships within the data to identify
notable events.
● Trend analysis means determining consistent movement in a certain direction. There are
two types of trends: deterministic, where we can find the underlying cause, and
stochastic, which is random and unexplainable.
● Seasonal variation describes events that occur at specific and regular intervals during the
course of a year. Serial dependence occurs when data points close together in time tend to
be related.

Time series analysis and forecasting models must define the types of data relevant to answering
the business question. Once analysts have chosen the relevant data they want to analyze, they
choose what types of analysis and techniques are the best fit.

Important Considerations for Time Series Analysis


While time series data is data collected over time, there are different types of data that describe
how and when that time data was recorded. For example:

● Time series data is data that is recorded over consistent intervals of time.
● Cross-sectional data consists of several variables recorded at the same time.
● Pooled data is a combination of both time series data and cross-sectional data.
Time Series Analysis Models and Techniques

Just as there are many types and models, there are also a variety of methods to study data. Here
are the three most common.

● Box-Jenkins ARIMA models: These univariate models are used to better understand a
single time-dependent variable, such as temperature over time, and to predict future data
points of variables. These models work on the assumption that the data is stationary.
Analysts have to account for and remove as many differences and seasonalities in past
data points as they can. Thankfully, the ARIMA model includes terms to account for
moving averages, seasonal difference operators, and autoregressive terms within the
model.
● Box-Jenkins Multivariate Models: Multivariate models are used to analyze more than
one time-dependent variable, such as temperature and humidity, over time.
● Holt-Winters Method: The Holt-Winters method is an exponential smoothing technique.
It is designed to predict outcomes, provided that the data points include seasonality.

What is Time Series Analysis

A time series is nothing but a sequence of various data points that occurred in a successive
order for a given period of time

Objectives:

● To understand how time series works, what factors are affecting a certain variable(s) at different
points of time.
● Time series analysis will provide the consequences and insights of features of the given dataset
that changes over time.
● Supporting to derive the predicting the future values of the time series variable.
● Assumptions: There is one and the only assumption that is “stationary”, which means that the
origin of time, does not affect the properties of the process under the statistical factor.
How to analyze Time Series?

● Collecting the data and cleaning it


● Preparing Visualization with respect to time vs key feature
● Observing the stationarity of the series
● Developing charts to understand its nature.
● Model building – AR, MA, ARMA and ARIMA
● Extracting insights from prediction

Components of Time Series Analysis

● Trend
● Seasonality
● Cyclical
● Irregularity/random

● Trend: In which there is no fixed interval and any divergence within the given dataset is a
continuous timeline. The trend would be Negative or Positive or Null Trend
● Seasonality: In which regular or fixed interval shifts within the dataset in a continuous timeline.
Would be bell curve or saw tooth
● Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern
● Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.
Data Types of Time Series

Time series’ data types and their influence. While discussing TS data-types, there are two major
types.

● Stationary
● Non- Stationary

6.1 Stationary: A dataset should follow the below thumb rules, without having Trend,
Seasonality, Cyclical, and Irregularity component of time series

● The MEAN value of them should be completely constant in the data during the analysis
● The VARIANCE should be constant with respect to the time-frame
● The COVARIANCE measures the relationship between two variables.

6.2 Non- Stationary: This is just the opposite of Stationary.

Methods to check Stationarity

During the TSA model preparation workflow, we must access if the given dataset is Stationary or
NOT. Using Statistical and Plots test.

Statistical Test: There are two tests available to test if the dataset is Stationary or NOT.

● Augmented Dickey-Fuller (ADF) Test


● Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test
Augmented Dickey-Fuller (ADF) Test or Unit Root Test: The ADF test is the most popular
statistical test and with the following assumptions.

● Null Hypothesis (H0): Series is non-stationary


● Alternate Hypothesis (HA): Series is stationary
o p-value >0.05 Fail to reject (H0)
o p-value <= 0.05 Accept (H1)

Kwiatkowski–Phillips–Schmidt–Shin (KPSS): these tests are used for testing a NULL


Hypothesis (HO), that will perceive the time-series, as stationary around a deterministic trend
against the alternative of a unit root. Since TSA looking for Stationary Data for its further
analysis, we have to make sure that the dataset should be stationary.

Converting Non- stationary into stationary

How to convert Non- stationary into stationary for effective time series modeling. There are two
major methods available for this conversion.

● Detrending
● Differencing
● Transformation

8.1 Detrending: It involves removing the trend effects from the given dataset and showing only
the differences in values from the trend. it always allows the cyclical patterns to be identified.

8.2 Differencing: This is a simple transformation of the series into a new time series, which we
use to remove the series dependence on time and stabilize the mean of the time series, so trend
and seasonality are reduced during this transformation.

Yt= Yt – Yt-1

Yt=Value with time

Detrending and Differencing extractions


8.3 Transformation: This includes three different methods they are Power Transform, Square
Root, and Log Transfer., most commonly used one is Log Transfer

ACF & PACF :

● Auto-Correlation Function (ACF)


● Partial Auto-Correlation Function (PACF)

Auto-Correlation Function (ACF): ACF is used to indicate and how similar a value is within a
given time series and the previous value. (OR) It measures the degree of the similarity between a
given time series and the lagged version of that time series at different intervals that we
observed.
Python Statsmodels library calculates autocorrelation. This is used to identify a set of trends in
the given dataset and the influence of former observed values on the currently observed values.

Partial Auto-Correlation (PACF): PACF is similar to Auto-Correlation Function and is a little


challenging to understand. It always shows the correlation of the sequence with itself with some
number of time units per sequence order in which only the direct effect has been shown, and all
other intermediary effects are removed from the given time series.

Auto-correlation and Partial Auto-Correlation

plot_acf(df_temperature)
plt.show()
plot_acf(df_temperature, lags=30)
plt.show()

Observation: The previous temperature influences the current temperature, but the significance
of that influence decreases and slightly increases from the above visualization along with the
temperature with regular time intervals.

10.3 Types of Auto-correlation

10.4 Interpret ACF and PACF plots

ACF PACF Perfect ML -Model

Plot declines gradually Plot drops instantly Auto Regressive model.

Plot drops instantly Plot declines gradually Moving Average model

Plot decline gradually Plot Decline gradually ARMA


Plot drop instantly Plot drop instantly You wouldn’t perform any model

Remember that both ACF and PACF require stationary time series for analysis.

Auto-Regressive model

This is a simple model, that predicts future performance based on past performance. mainly used
for forecasting, when there is some correlation between values in a given time series and the
values that precede and succeed (back and forth).

An AR model is a Linear Regression model, that uses lagged variables as input. The Linear
Regression model can be easily built using the scikit-learn library by indicating the input to use.
Statsmodels library is used to provide autoregression model-specific functions where you have to
specify an appropriate lag value and train the model. It is provided in the AutoTeg class to get
the results, using simple steps

● Creating the model AutoReg()


● Call fit() to train it on our dataset.
● Returns an AutoRegResults object.
● Once fit, make a prediction by calling the predict () function

The equation for the AR model (Let’s compare Y=mX+c)

Yt =C+b1 Yt-1+ b2 Yt-2+……+ bp Yt-p+ Ert


Key Parameters

● p=past values
● Yt=Function of different past values
● Ert=errors in time
● C=intercept

Lets’s check, given data-set or time series is random or not

from matplotlib import pyplot


from pandas.plotting import lag_plot
lag_plot(df_temperature)
pyplot.show()
Observation: Yes, looks random and scattered.

Implementation of Auto-Regressive model

#import libraries
from matplotlib import pyplot
from statsmodels.tsa.ar_model import AutoReg
from sklearn.metrics import mean_squared_error
from math import sqrt
# load csv as dataset
#series = read_csv('daily-min-temperatures.csv', header=0, index_col=0,
parse_dates=True, squeeze=True)
# split dataset for test and training
X = df_temperature.values
train, test = X[1:len(X)-7], X[len(X)-7:]
# train autoregression
model = AutoReg(train, lags=20)
model_fit = model.fit()
print('Coefficients: %s' % model_fit.params)
# Predictions
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1,
dynamic=False)
for i in range(len(predictions)):
print('predicted=%f, expected=%f' % (predictions[i], test[i]))
rmse = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse)
# plot results
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()
OUTPUT

predicted=15.893972, expected=16.275000
predicted=15.917959, expected=16.600000
predicted=15.812741, expected=16.475000
predicted=15.787555, expected=16.375000
predicted=16.023780, expected=16.283333
predicted=15.940271, expected=16.525000
predicted=15.831538, expected=16.758333
Test RMSE: 0.617
Observation: Expected (blue) Against Predicted (red). The forecast looks good on the 4th and
the deviation on the 6th day.

Based on the frequency, a Time Series can be classified into the following categories:

1. Yearly (For example, Annual Budget)

2. Quarterly (For example, Expenses)

3. Monthly (For example, Air Traffic)

4. Weekly (For example, Sale Quantity)

5. Daily (For instance, Weather)

6. Hourly (For example, Stocks Price)

7. Minutes wise (For example, Inbound Calls in a Call Centre)

8. Seconds wise (For example, Web Traffic)

Time Series Forecasting is generally used in many manufacturing companies as it drives the
primary business planning, procurement, and production activities. Any forecasts' errors will
undulate throughout the chain of the supply or any business framework.
The Time Series forecasting can be broadly classified into two categories:

1. Univariate Time Series Forecasting: The Univariate Time Series Forecasting is a


forecasting of time series where we utilize the former values of the time series only in
order to guess the forthcoming values.

2. Multi-Variate Time Series Forecasting: The Multi-Variate Time Series Forecasting is a


forecasting of time series where we utilize the predictors other than the series, also
known as exogenous variables, in order to forecast.

ARIMA, abbreviated for 'Auto Regressive Integrated Moving Average', is a class of models
that 'demonstrates' a given time series based on its previous values: its lags and the lagged errors
in forecasting, so that equation can be utilized in order to forecast future values.

There are three terms characterizing An ARIMA model:

p, q, and d

where,

○ p = the order of the AR term

○ q = the order of the MA term

○ d = the number of differences required to make the time series stationary

If a Time Series has seasonal patterns, we have to insert seasonal periods, and it becomes
SARIMA, short for 'Seasonal ARIMA'.

The 'p' is the order of the 'AR' (Auto-Regressive) term, which means that the number of lags of
Y to be utilized as predictors. At the same time, 'q' is the order of the 'MA' (Moving Average)
term, which means that the number of lagged forecast errors should be used in the ARIMA
Model.
A Pure AR (Auto-Regressive only) Model is a model which relies only on its own lags. Hence,
we can also conclude that it is a function of the 'lags of Yt'

where, Yt-1 is the lag1 of the series. β1 is the coefficient of lag1 and is the term of intercept that is
calculated by the model.

Similarly, a Pure MA (Moving Average only) model is a model where Yt relies only on the
lagged predicted errors.

Where, the error terms are the AR models errors of the corresponding lags. The errors ϵt and ϵt-1
are the errors from the equations given below:

Thus, we have concluded Auto-Regressive (AR) and Moving Average (MA) models,
respectively.
The equation of an ARIMA Model.

An ARIMA model is a model where the series of time was subtracted at least once in order to
make it stationary, and we combine the Auto-Regressive (AR) and the Moving Average (MA)
terms. Hence, we got the following equation:

ARIMA Model in words:

Forecasted Yt = Constant + Linear Combination Lags of Y (up to p lags) + Linear


Combination of Lagged Predicted Errors (up to q lags)

ARMA and ARIMA

ARMA This is a combination of the Auto-Regressive and Moving Average model for
forecasting. This model provides a weakly stationary stochastic process in terms of two
polynomials, one for the Auto-Regressive and the second for the Moving Average.

ARMA is best for predicting stationary series. So ARIMA came in since it supports stationary as
well as non-stationary.
● AR ==> Uses the past values to predict the future
● MA ==> Uses the past error terms in the given series to predict the future
● I==> uses the differencing of observation and makes the stationary data

AR+I+MA= ARIMA
Understand the Signature of ARIMA

● p==> log order => No of lag observations.


● d==> degree of differencing => No of times that the raw observations are differenced.
● q==>order of moving average => the size of the moving average window

Implementation steps for ARIMA

Step 1: Plot a time series format

Step 2: Difference to make stationary on mean by removing the trend

Step 3: Make stationary by applying log transform.

Step 4: Difference log transform to make as stationary on both statistic mean and variance

Step 5: Plot ACF & PACF, and identify the potential AR and MA model

Step 6: Discovery of best fit ARIMA model

Step 7: Forecast/Predict the value, using the best fit ARIMA model

Step 8: Plot ACF & PACF for residuals of the ARIMA model, and ensure no more information is
left.

Implementation of ARIMA
Already we have discussed steps 1-5, let’s focus on the rest here.

from statsmodels.tsa.arima_model import ARIMA


model = ARIMA(df_temperature, order=(0, 1, 1))
results_ARIMA = model.fit()
results_ARIMA.summary()

results_ARIMA.forecast(3)[0]
Output
array([16.47648941, 16.48621826, 16.49594711])
results_ARIMA.plot_predict(start=200)
plt.show()
Process flow
Finding the order of differencing 'd' in the ARIMA Model
The primary purpose of differencing in the ARIMA model is to make the Time Series stationary.

1. from statsmodels.tsa.stattools import adfuller


2. from numpy import log
3. import pandas as pd
4. mydata = pd.read_csv('mydataset.csv', names = ['value'], header = 0)
5. res = adfuller( mydata.value.dropna())
6. print('Augmented Dickey-Fuller Statistic: %f' % res[0])
7. print('p-value: %f' % res[1])

Output :
Augmented Dickey-Fuller Statistic: -2.464240

p-value: 0.124419

It is necessary to check whether the series is stationary or not. If not, we have to use difference;
else, d becomes zero.

The Augmented Dickey-Fuller (ADF) test's null hypothesis is that the time series is not
stationary. Thus, if the ADF test's p-value is less than the significance level (0.05), then we will
reject the null hypothesis and infer that the time series is definitely stationary. As we can
observe, the p-value is more significant than the level of significance. Therefore, we can
difference the series and check the plot of autocorrelation as shown below.

1. import numpy as np, pandas as pd


2. from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
3. import matplotlib.pyplot as plt
4.
5. plt.rcParams.update({'figure.figsize' : (9,7), 'figure.dpi' : 120})
6.
7. # Importing data
8. df = pd.read_csv('mydataset.csv', names = ['value'], header = 0)
9.
10. # The Genuine Series
11. fig, axes = plt.subplots(3, 2, sharex = True)
12. axes[0, 0].plot(df.value); axes[0, 0].set_title('The Genuine Series')
13. plot_acf(df.value, ax = axes[0, 1])
14.
15. # Order of Differencing: First
16. axes[1, 0].plot(df.value.diff()); axes[1, 0].set_title('Order of Differencing: First')
17. plot_acf(df.value.diff().dropna(), ax = axes[1, 1])
18.
19. # Order of Differencing: Second
20. axes[2, 0].plot(df.value.diff().diff()); axes[2, 0].set_title('Order of Differencing: Second')
21. plot_acf(df.value.diff().diff().dropna(), ax = axes[2, 1])
22.
23. plt.show()

Finding the order of the Auto-Regressive (AR) term (p)


Partial Autocorrelation of lag(k) of a series is the coefficient of that lag in the Auto-Regression
Equation of Y.

import numpy as np, pandas as pd

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

import matplotlib.pyplot as plt

plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})

# importing data

df = pd.read_csv('mydataset.csv', names = ['value'], header = 0)

fig, axes = plt.subplots(1, 2, sharex = True)

axes[0].plot(df.value.diff()); axes[0].set_title('Order of Differencing: First')

axes[1].set(ylim = (0,5))

plot_pacf(df.value.diff().dropna(), ax = axes[1])

plt.show()

Output:
Explanation:

As a result, we can observe that the PACF lag 1 is pretty significant above the line of
significance. Lag 2 also appears to be substantial, entirely maintaining to cross the limit of
significance (blue region). However, we will be conservative and fix the p as one tentatively.

Finding the Order of the Moving Average (MA) term (q)

use the ACF plot to find the number of Moving Average (MA) Terms. A Moving Average (MA)
term is, theoretically, the lagged forecast's error.

import numpy as np, pandas as pd

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

import matplotlib.pyplot as plt

plt.rcParams.update({'figure.figsize' : (9,3), 'figure.dpi' : 120})

# importing data

mydata = pd.read_csv('mydataset.csv', names = ['value'], header = 0)

fig, axes = plt.subplots(1, 2, sharex = True)

axes[0].plot(mydata.value.diff()); axes[0].set_title('Order of Differencing: First')


axes[1].set(ylim = (0, 1.2))

plot_acf(mydata.value.diff().dropna(), ax = axes[1])

plt.show()

Output:

Explanation:

In the above example, we have imported the required libraries, modules, and datasets. We have
then plotted the graphs to represent the First Order Differencing and its Autocorrelation. As a
result, we can observe that some lags are pretty above the line of significance. So, let us fix q as
2, tentatively. We can also use the simpler model in case of any doubt that adequately
demonstrates the Y.

Building the ARIMA Model


Once we have determined the values of p, q, and d, we will try creating the ARIMA model.
The implementation of the ARIMA() module is shown below:

Example:

import numpy as np, pandas as pd

from statsmodels.tsa.arima_model import ARIMA

# importing data

mydata = pd.read_csv('mydataset.csv', names = ['value'], header = 0)

# Creating ARIMA model

mymodel = ARIMA(mydata.value, order = (1, 1, 2))

modelfit = mymodel.fit(disp = 0)

print(modelfit.summary())

ARIMA Model Results

====================================================================
==========

Dep. Variable: D.value No. Observations: 99

Model: ARIMA(1, 1, 2) Log Likelihood -253.790

Method: css-mle S.D. of innovations 3.119

Date: Thu, 15 Apr 2021 AIC 517.579

Time: 21:10:37 BIC 530.555

Sample: 1 HQIC 522.829


====================================================================
=============

coef std err z P>|z| [0.025 0.975]

---------------------------------------------------------------------------------

const 1.1202 1.290 0.868 0.385 -1.409 3.649

ar.L1.D.value 0.6351 0.257 2.469 0.014 0.131 1.139

ma.L1.D.value 0.5287 0.355 1.489 0.136 -0.167 1.224

ma.L2.D.value -0.0010 0.321 -0.003 0.998 -0.631 0.629

Roots

====================================================================
=========

Real Imaginary Modulus Frequency

-----------------------------------------------------------------------------

AR.1 1.5746 +0.0000j 1.5746 0.0000

MA.1 -1.8850 +0.0000j 1.8850 0.5000

MA.2 545.5472 +0.0000j 545.5472 0.0000

-----------------------------------------------------------------------------

Explanation:

In the above example, we have imported the new module called ARIMA from the statsmodels
class and create the ARIMA model of the order 1, 1, and 2. We have then printed the summary of
the model to the user. As we can observe, the overview of the model reveals a lot of details. The
middle table is the table of coefficients where the 'coef' values act as the related terms' weights.

We can also notice that the MA2 term's coefficient tends to zero, and the P-Value in the 'P > |z|'
column is exceedingly insignificant. The P-Value should be less than 0.05, ideally for the
corresponding X to be significant.

What is AIC , BIC model?


Akaike's Information Criterion
Akaike's Information Criterion (AIC), which was useful in selecting
predictors for regression, is also useful for determining the order of an
ARIMA model. It can be written as AIC=−2log(L)+2(p+q+k+1), AIC = − 2
log ⁡( L ) + 2 ( p + q + k + 1 ) , where L is the likelihood of the data, k=1 if
c≠0 c ≠ 0 and k=0 if c=0 .

You might also like