Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Time Series Forecasting Assignment

–Australia Monthly Gas Production


Forecasting

BY:
Pranav Viswanathan
CONTENTS
TOPIC
1.Project Objective 3
2.Assumptions 3
3.EDA (Expolatory Data Analysis) 4
3.1 Data Discovery 4
3.2 Top and Bottom of Dataset 4
3.3 Class of Dataset 4
3.4 Start and End of Series 4
3.5 cycle 6
3.6 Structure 7
3.7 Summary 7
3.8 Outliers 8
3.9 Missing values 8
4.Creating Time series from 1970 9
5.Plots 11
6.Periodicity 14
7.Decomposition of time series 15

8.Checking for stationary 18

9.ACF and PACF (performing to check the stationary data and autocorrelation) 20
10.Auto arima 21

11.Arima 23

12.Final Model 25

13.Accuracy 28

14.Next step in Model Refining 28


1.Project Objective
This project is to analyze Australian Monthly Gas production
dataset “Gas” in package “Forecast”.
Monthly gas production of Australia between year 1956–1995 is
present in the Forecast library which is already in time series
format.
The Objective is to read the data and do various analysis on
same by reading, plotting, observing and conducting applicable
tests.
Model building and to forecast for 12 months is also expected in
this project using ARIMA and Auto Arima models.
We must come up with best model for our prediction by
comparing performance measures of the models.
The dataset looks like:

Variable Description Type


Year Year of the gas Continuous
production.
Month Month of the gas Categorical
production (12
months)
Production Production of gas Continuous
and
numeric

2.Assumptions
There are a few assumptions considered:
• The Sample size is adequate to perform techniques applicable
for time series dataset.
• All the necessary packages are installed in R.

• Dataset File to be used for this project is available in package


Forecast.

3.EDA (Expolatory Data Analysis)


3.1 Data Discovery
The data is obtained from the library Forecast and the dataset is
already in Timeseries format.

Fig 1: Reading the dataset


3.2 Top and Bottom of Dataset

Fig 2: Head and tail of the dataset

3.3 Class of Dataset

Fig 3: Class of the dataset

3.4 Start and End of Series


Fig 4a: Start of the TS

Fig 4b: End of the TS

3.5 cycle
Given timeseries is univariate, has only one variable.

Fig 5: Cycle of the TS

This series stars from 1956, 1st Month i.e. January. Series end at
1995 , 8th Month i.e August.

Frequency of the data is 12 which implies that this is a monthly


series .
Cycle indicates that all monthly values are available from 1956
Jan till 1995 August. Data set does not have any missing value

3.6 Structure

Fig 6: Structure of the TS

3.7 Summary

Fig 7: Summary of the TS

From Summary output, we see the difference between median


and maximum value, might be read as if data is skewed.
However, for a time series, such interpretation should be made
with a caution as time series may have two additional
components i.e. trend and seasonality. Hence this difference
could be because of these two components.
3.8 Outliers

Fig 8: Outliers of the TS

3.9 Missing values

Fig 9: Missing Values of the TS

INSIGHTS OF EDA:
 The data is obtained from ‘forecast’ package.
 The various exploration line class ,start and end of
series is carried out.
 Str() and summary() is seen.
 Missing values checked and is seen to be zero.
 Some outliers are present due to extreme values of
the production.

4.Creating Time series from 1970


From the given Time series ,we create a new Time series
from the date 1970 Jan for better accuracy in forecasting
by use of the function window().

TS_data <- window(ts_data,start=c(1970,1),frequency=12)

Fig 10:TS from 1970

Fig 10 shows two patterns:


• an overall positive trend
There is a clear and increasing trend. The sudden drop at
the start of each year needs to be investigated in order to
find what cause this effect at the end of the calendar year.

• a zig-zagging seasonal pattern.


There is also a strong seasonal pattern that increases in
size as the level of the series increases. Any forecasts of
this series would need to capture the seasonal pattern,
and the fact that the trend is changing slowly.
Now we plot the same with a Linear regression line :

Fig 11:TS from 1970


5.Plots
Quarterly plot:

Fig 12:Quarterly plot

Quarter plot shows some clear indication of seasonality.


Yearly plot:

Fig 13:Yearly plot

Seasonal plot:

Fig 14:Seasonal plot


A seasonal plot allows the underlying seasonal pattern to be seen
more clearly, and is especially useful in identifying years in which
the pattern changes For the initial years, seasonality was not
prevalent. However over years, seasonality is visible from May to
October with July showing the peak value across all years. Series
has clear semi-annual seasonality.

Month plot:
The horizontal lines indicate the means for each month. This form
of plot enables the underlying seasonal pattern to be seen clearly
and shows the changes in seasonality over time. It is especially
useful in identifying changes within seasons.

Fig 15:Month plot


Fig 16:Boxplot

Box plot also shows some seasonality and this also indicates that there are
no outlier in the data set.

6.Periodicity
Time series object are an ordered sequence of values (data
points) of variables at equally spaced time interval. It is in time
domain.
There are couple of methods to detect periodicity of timeseries
object.
The periodogram shows the “power” of each possible frequency,
and we can clearly see spikes between 0 and 0.1, frequency
close to 0 is high then decreasing effect then at frequency 0.07
Hz .
Fig 17: periodogram

Taking the frequency at 0.08 we get around ~12.

7.Decomposition of time series

A time series decomposition is procedure which transform a time


series into multiple different time series. The original time series is
often computed (decompose) into 3 sub-time series:

Seasonal: patterns that repeat with fixed period of time. Trend:


the underlying trend of the metrics. Random: (also call “noise”,
“Irregular” or “Remainder”) Is the residuals of the time series after
allocation into the seasonal and trends time series. Other than
above three component there is Cyclic component which occurs
after long period of time

Additive or multiplicative decomposition?

To get a successful decomposition, it is important to choose


between the additive or multiplicative model. To choose the right
model we need to look at the time series.
 The additive model is useful when the seasonal variation is
relatively constant over time.
 The multiplicative model is useful when the seasonal variation
increases over time.

How to visually differentiate an additive and Multiplicative


Model

In an additive model, the amplitude of both the seasonal and


irregular variations do not vary as the level of the trend rises or
falls.

Decomposition is a tool that we can separate different


components in a time series data so we can see trend,
seasonality, and random noises individually.

As seasonality pattern does not increases with time in our series,


hence the series is assumed to be additive.

Fig 18: Decomposition

Above decomposition clearly shows the gas production is trending


upward with clear semi-annual seasonality.
STL is a very versatile and robust method for decomposing time
series. STL is an acronym for “Seasonal and Trend
decomposition using Loess”. It does an additive decomposition
and the four graphs are the original data, seasonal component,
trend component and the remainder.

Fig 19: Decomposition

If the focus is on figuring out whether the general trend of


production is up, we deseasonalize, and possibly forget about the
seasonal component. However, if you need to forecast the
production in next mnoth, then you need take into account both
the secular trend and seasonality.

As the series is additive, trend and random component of the


series are added to deasonalise the series. Then desasonalised
and original data set are plotted to study the trend.
Fig 20: comparison of original and deseasonalised data

Above plot show original series in Red and de-seasoned production in


Blue, we can see that there is increasing trend of production.

8.Checking for stationary


Statistical tests make strong assumptions about your data. They
can only be used to inform the degree to which a null hypothesis
can be accepted or rejected. The result must be interpreted for a
given problem to be meaningful. Nevertheless, they can provide a
quick check and confirmatory evidence that your time series is
stationary or non-stationary.

Null Hypothesis (H0): If accepted, it suggests the time series has


a unit root, meaning it is non-stationary. It has some time
dependent structure.
Alternate Hypothesis (H1): The null hypothesis is rejected; it
suggests the time series does not have a unit root, meaning it is
stationary. It does not have time-dependent structure.

p-value > 0.05: Retain the null hypothesis (H0), the data has a
unit root and is non-stationary.
p-value <= 0.05: Reject the null hypothesis (H0), the data does
not have a unit root and is stationary.

Fig 21:Stationery test

Null Hypothesis is rejected, hence gas data is stationary.

9.ACF and PACF (performing to check the stationary data and


autocorrelation)
The function acf() computes an estimate of the autocorrelation
function of a (possibly multivariate) time series.
Function pacf() computes an estimate of the partial
autocorrelation function of a (possibly multivariate) time series.
Fig 22:ACF

ACF plots display correlation between the series and its lags.
Most of lines are significant as they are beyond 2 blue lines 2nd
line is significant and then couple more . Look for spikes at
specific lag points of the difference series , highest spike is at 0.5.
Fig 23:PACF

Even the partial correlation plot shows that all lags are significant.
PACF plots are useful when determining the order of the AR(p)
model.

As multiple lags are significant, it is not possible to tentatively


identify the numbers of AR and/or MA terms that are needed.

10.Auto arima
Exponential smoothing methods are useful for making forecasts,
and make no assumptions about the correlations between
successive values of the time series.

While exponential smoothing methods do not make any


assumptions about correlations between successive values of the
time series, in some cases you can make a better predictive
model by taking correlations in the data into account.

Autoregressive Integrated Moving Average (ARIMA) models


include an explicit statistical model for the irregular component of
a time series, that allows for non-zero autocorrelations in the
irregular component. ARIMA models are defined for stationary
time series.

As the series has seasonality, hence in the auto arima function


seasonality is assumed to be true.

Fig 24:Auto Arima

Akaike information criteria (AIC) and Baysian information criteria


(BIC).are closely related and can be interpreted as an estimate of
how much information would be lost if a given model is chosen.

When comparing models, one wants to minimize AIC and BIC.

ARIMA(2,1,1)(0,1,2)[12] is seasonal ARIMA. [12] stands for


number of periods in season, i.e. months in year in this case.
11.Arima

Let’s fit the autoarima output using the Arima function

Fig 25:Arima
Fig 26:Arima plot

Check if the residuals are independent before using the model for
forecasting.

Box-Ljung Test

To check is residual are independent

H0: Residuals are independent


Ha: Residuals are not independent

Fig 27:Box ljung test

Conclusion: Do not reject H0: Residuals are independent

Now the model is valid, let’s check the model performance on the
train dataset
Fig 28:Mapea

Let’s forecast the holdout sample using the above model. Period
is considered as 20 because we have 20 periods in the holdout
sample

Fig 29:Forecast for 20 periods and its MAPEA

Fig 30:Forecast:Actual vs forecast

From the above plot it can be seen that there is a difference between the actual and
forecast model hence there is a scope to further improve the model.
Some tips to improve the model are provided in section 4 of this document.

12.Final Model

In time series, model creation is a two-step process.


In the first step, data is divided into train and holdout sample.
Model is prepared using the train dataset and validated using the
holdout sample.
Once the model is finalised, then the final model is prepared on
the complete dataset. Output of this model is used for a real
forecast i.e. forecast for an unknown period
Now forecast for 12 months, unknown period.

For Time Series Forecasting problem, we had observed both trend and
seasonality in the data. Trend is increasing in the data along with high
variation in seasonality, It also had few outliers.
13.Accuracy
After comparing some the arima models with and without
seasonality(Manual and Auto). We will compare some of the models
fitted so far using a test set consisting of the last data of 1995 then also
applying on original data gas.
The models chosen manually and with auto.arima() are both in the top
four models based on their AIC values. When AIC value and MAPE is
critieria of chosing model then ensure that order of differencing is same
. However, when comparing models using a test set, it does not matter
how the forecasts were produced — the comparisons are always valid.
In above tables, we can find various models ,we compared seasonal =
False and True and found that results are better with seasonal = TRUE
for both manual and auto.arima models .
None of the models considered here pass all of the residual tests. In
practice, we would normally use the best model we could find, even if it
did not pass all the tests.
14.Next step in Model Refining
Now we have built a robust model. Can this model be refined
further?
Following steps are recommended to refine the model further
 Log transformation of the original dataset as log transformed
data is expected to give better result
 As it is clearly visible in the final model, Arima model
equation is different when the entire dataset was considered
for the final model, which clearly indicates that dataset for
the later periods has seasonality and trend which should be
further analysed. Hence rather than considering dataset from
1970, probably a refined model should be built considering
the period from 1980.

You might also like