Gas Production

Time Series Forecasting Assignment
–Australia Monthly Gas Production

Forecasting
BY:
Pranav Viswanathan
CONTENTS
TOPIC
1.Project Objective 3
2.Assumptions 3
3.EDA (Expolatory Data Analysis) 4
3.1 Data Discovery 4
3.2 Top and Bottom of Dataset 4
3.3 Class of Dataset 4
3.4 Start and End of Series 4
3.5 cycle 6
3.6 Structure 7
3.7 Summary 7
3.8 Outliers 8
3.9 Missing values 8
4.Creating Time series from 1970 9
5.Plots 11
6.Periodicity 14
7.Decomposition of time series 15
8.Checking for stationary 18
9.ACF and PACF (performing to check the stationary data and autocorrelation) 20
10.Auto arima 21
11.Arima 23
12.Final Model 25
13.Accuracy 28
14.Next step in Model Refining 28

1.Project Objective
This project is to analyze Australian Monthly Gas production
dataset “Gas” in package “Forecast”.
Monthly gas production of Australia between year 1956–1995 is
present in the Forecast library which is already in time series
format.
The Objective is to read the data and do various analysis on
same by reading, plotting, observing and conducting applicable
tests.
Model building and to forecast for 12 months is also expected in
this project using ARIMA and Auto Arima models.
We must come up with best model for our prediction by
comparing performance measures of the models.
The dataset looks like:
Variable Description Type

Year Year of the gas Continuous
production.
Month Month of the gas Categorical
production (12
months)
Production Production of gas Continuous
and
numeric
2.Assumptions
There are a few assumptions considered:
• The Sample size is adequate to perform techniques applicable
for time series dataset.
• All the necessary packages are installed in R.
• Dataset File to be used for this project is available in package

Forecast.
3.EDA (Expolatory Data Analysis)

3.1 Data Discovery
The data is obtained from the library Forecast and the dataset is
already in Timeseries format.
Fig 1: Reading the dataset

3.2 Top and Bottom of Dataset
Fig 2: Head and tail of the dataset
3.3 Class of Dataset
Fig 3: Class of the dataset
3.4 Start and End of Series

Fig 4a: Start of the TS
Fig 4b: End of the TS
3.5 cycle
Given timeseries is univariate, has only one variable.
Fig 5: Cycle of the TS
This series stars from 1956, 1st Month i.e. January. Series end at
1995 , 8th Month i.e August.
Frequency of the data is 12 which implies that this is a monthly

series .
Cycle indicates that all monthly values are available from 1956
Jan till 1995 August. Data set does not have any missing value
3.6 Structure
Fig 6: Structure of the TS
3.7 Summary
Fig 7: Summary of the TS
From Summary output, we see the difference between median

and maximum value, might be read as if data is skewed.
However, for a time series, such interpretation should be made
with a caution as time series may have two additional
components i.e. trend and seasonality. Hence this difference
could be because of these two components.
3.8 Outliers
Fig 8: Outliers of the TS
3.9 Missing values
Fig 9: Missing Values of the TS
INSIGHTS OF EDA:
 The data is obtained from ‘forecast’ package.
 The various exploration line class ,start and end of
series is carried out.
 Str() and summary() is seen.
 Missing values checked and is seen to be zero.
 Some outliers are present due to extreme values of
the production.
4.Creating Time series from 1970

From the given Time series ,we create a new Time series
from the date 1970 Jan for better accuracy in forecasting
by use of the function window().
TS_data <- window(ts_data,start=c(1970,1),frequency=12)
Fig 10:TS from 1970
Fig 10 shows two patterns:

• an overall positive trend
There is a clear and increasing trend. The sudden drop at
the start of each year needs to be investigated in order to
find what cause this effect at the end of the calendar year.
• a zig-zagging seasonal pattern.

There is also a strong seasonal pattern that increases in
size as the level of the series increases. Any forecasts of
this series would need to capture the seasonal pattern,
and the fact that the trend is changing slowly.
Now we plot the same with a Linear regression line :
Fig 11:TS from 1970

5.Plots
Quarterly plot:
Fig 12:Quarterly plot
Quarter plot shows some clear indication of seasonality.

Yearly plot:
Fig 13:Yearly plot
Seasonal plot:
Fig 14:Seasonal plot

A seasonal plot allows the underlying seasonal pattern to be seen
more clearly, and is especially useful in identifying years in which
the pattern changes For the initial years, seasonality was not
prevalent. However over years, seasonality is visible from May to
October with July showing the peak value across all years. Series
has clear semi-annual seasonality.
Month plot:
The horizontal lines indicate the means for each month. This form
of plot enables the underlying seasonal pattern to be seen clearly
and shows the changes in seasonality over time. It is especially
useful in identifying changes within seasons.
Fig 15:Month plot

Fig 16:Boxplot
Box plot also shows some seasonality and this also indicates that there are
no outlier in the data set.
6.Periodicity
Time series object are an ordered sequence of values (data
points) of variables at equally spaced time interval. It is in time
domain.
There are couple of methods to detect periodicity of timeseries
object.
The periodogram shows the “power” of each possible frequency,
and we can clearly see spikes between 0 and 0.1, frequency
close to 0 is high then decreasing effect then at frequency 0.07
Hz .
Fig 17: periodogram
Taking the frequency at 0.08 we get around ~12.
7.Decomposition of time series
A time series decomposition is procedure which transform a time

series into multiple different time series. The original time series is
often computed (decompose) into 3 sub-time series:
Seasonal: patterns that repeat with fixed period of time. Trend:

the underlying trend of the metrics. Random: (also call “noise”,
“Irregular” or “Remainder”) Is the residuals of the time series after
allocation into the seasonal and trends time series. Other than
above three component there is Cyclic component which occurs
after long period of time
Additive or multiplicative decomposition?
To get a successful decomposition, it is important to choose

between the additive or multiplicative model. To choose the right
model we need to look at the time series.
 The additive model is useful when the seasonal variation is
relatively constant over time.
 The multiplicative model is useful when the seasonal variation
increases over time.
How to visually differentiate an additive and Multiplicative

Model
In an additive model, the amplitude of both the seasonal and

irregular variations do not vary as the level of the trend rises or
falls.
Decomposition is a tool that we can separate different

components in a time series data so we can see trend,
seasonality, and random noises individually.
As seasonality pattern does not increases with time in our series,

hence the series is assumed to be additive.
Fig 18: Decomposition
Above decomposition clearly shows the gas production is trending

upward with clear semi-annual seasonality.
STL is a very versatile and robust method for decomposing time
series. STL is an acronym for “Seasonal and Trend
decomposition using Loess”. It does an additive decomposition
and the four graphs are the original data, seasonal component,
trend component and the remainder.
Fig 19: Decomposition
If the focus is on figuring out whether the general trend of

production is up, we deseasonalize, and possibly forget about the
seasonal component. However, if you need to forecast the
production in next mnoth, then you need take into account both
the secular trend and seasonality.
As the series is additive, trend and random component of the

series are added to deasonalise the series. Then desasonalised
and original data set are plotted to study the trend.
Fig 20: comparison of original and deseasonalised data
Above plot show original series in Red and de-seasoned production in

Blue, we can see that there is increasing trend of production.
8.Checking for stationary

Statistical tests make strong assumptions about your data. They
can only be used to inform the degree to which a null hypothesis
can be accepted or rejected. The result must be interpreted for a
given problem to be meaningful. Nevertheless, they can provide a
quick check and confirmatory evidence that your time series is
stationary or non-stationary.
Null Hypothesis (H0): If accepted, it suggests the time series has

a unit root, meaning it is non-stationary. It has some time
dependent structure.
Alternate Hypothesis (H1): The null hypothesis is rejected; it
suggests the time series does not have a unit root, meaning it is
stationary. It does not have time-dependent structure.
p-value > 0.05: Retain the null hypothesis (H0), the data has a
unit root and is non-stationary.
p-value <= 0.05: Reject the null hypothesis (H0), the data does
not have a unit root and is stationary.
Fig 21:Stationery test
Null Hypothesis is rejected, hence gas data is stationary.
9.ACF and PACF (performing to check the stationary data and

autocorrelation)
The function acf() computes an estimate of the autocorrelation
function of a (possibly multivariate) time series.
Function pacf() computes an estimate of the partial
autocorrelation function of a (possibly multivariate) time series.
Fig 22:ACF
ACF plots display correlation between the series and its lags.
Most of lines are significant as they are beyond 2 blue lines 2nd
line is significant and then couple more . Look for spikes at
specific lag points of the difference series , highest spike is at 0.5.
Fig 23:PACF
Even the partial correlation plot shows that all lags are significant.
PACF plots are useful when determining the order of the AR(p)
model.
As multiple lags are significant, it is not possible to tentatively

identify the numbers of AR and/or MA terms that are needed.
10.Auto arima
Exponential smoothing methods are useful for making forecasts,
and make no assumptions about the correlations between
successive values of the time series.
While exponential smoothing methods do not make any

assumptions about correlations between successive values of the
time series, in some cases you can make a better predictive
model by taking correlations in the data into account.
Autoregressive Integrated Moving Average (ARIMA) models

include an explicit statistical model for the irregular component of
a time series, that allows for non-zero autocorrelations in the
irregular component. ARIMA models are defined for stationary
time series.
As the series has seasonality, hence in the auto arima function

seasonality is assumed to be true.
Fig 24:Auto Arima
Akaike information criteria (AIC) and Baysian information criteria

(BIC).are closely related and can be interpreted as an estimate of
how much information would be lost if a given model is chosen.
When comparing models, one wants to minimize AIC and BIC.
ARIMA(2,1,1)(0,1,2)[12] is seasonal ARIMA. [12] stands for

number of periods in season, i.e. months in year in this case.
11.Arima
Let’s fit the autoarima output using the Arima function
Fig 25:Arima
Fig 26:Arima plot
Check if the residuals are independent before using the model for
forecasting.
Box-Ljung Test
To check is residual are independent
H0: Residuals are independent

Ha: Residuals are not independent
Fig 27:Box ljung test
Conclusion: Do not reject H0: Residuals are independent
Now the model is valid, let’s check the model performance on the
train dataset
Fig 28:Mapea
Let’s forecast the holdout sample using the above model. Period
is considered as 20 because we have 20 periods in the holdout
sample
Fig 29:Forecast for 20 periods and its MAPEA
Fig 30:Forecast:Actual vs forecast
From the above plot it can be seen that there is a difference between the actual and
forecast model hence there is a scope to further improve the model.
Some tips to improve the model are provided in section 4 of this document.
12.Final Model
In time series, model creation is a two-step process.

In the first step, data is divided into train and holdout sample.
Model is prepared using the train dataset and validated using the
holdout sample.
Once the model is finalised, then the final model is prepared on
the complete dataset. Output of this model is used for a real
forecast i.e. forecast for an unknown period
Now forecast for 12 months, unknown period.
For Time Series Forecasting problem, we had observed both trend and
seasonality in the data. Trend is increasing in the data along with high
variation in seasonality, It also had few outliers.
13.Accuracy
After comparing some the arima models with and without
seasonality(Manual and Auto). We will compare some of the models
fitted so far using a test set consisting of the last data of 1995 then also
applying on original data gas.
The models chosen manually and with auto.arima() are both in the top
four models based on their AIC values. When AIC value and MAPE is
critieria of chosing model then ensure that order of differencing is same
. However, when comparing models using a test set, it does not matter
how the forecasts were produced — the comparisons are always valid.
In above tables, we can find various models ,we compared seasonal =
False and True and found that results are better with seasonal = TRUE
for both manual and auto.arima models .
None of the models considered here pass all of the residual tests. In
practice, we would normally use the best model we could find, even if it
did not pass all the tests.
14.Next step in Model Refining
Now we have built a robust model. Can this model be refined
further?
Following steps are recommended to refine the model further
 Log transformation of the original dataset as log transformed
data is expected to give better result
 As it is clearly visible in the final model, Arima model
equation is different when the entire dataset was considered
for the final model, which clearly indicates that dataset for
the later periods has seasonality and trend which should be
further analysed. Hence rather than considering dataset from
1970, probably a refined model should be built considering
the period from 1980.

Gas Production

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gas Production

Uploaded by

Copyright:

Available Formats

Time Series Forecasting Assignment

–Australia Monthly Gas Production

8.Checking for stationary 18

14.Next step in Model Refining 28

Variable Description Type

• Dataset File to be used for this project is available in package

3.EDA (Expolatory Data Analysis)

Fig 1: Reading the dataset

Fig 2: Head and tail of the dataset

3.3 Class of Dataset

Fig 3: Class of the dataset

3.4 Start and End of Series

Fig 4b: End of the TS

Fig 5: Cycle of the TS

Frequency of the data is 12 which implies that this is a monthly

Fig 6: Structure of the TS

Fig 7: Summary of the TS

From Summary output, we see the difference between median

Fig 8: Outliers of the TS

3.9 Missing values

Fig 9: Missing Values of the TS

4.Creating Time series from 1970

TS_data <- window(ts_data,start=c(1970,1),frequency=12)

Fig 10:TS from 1970

Fig 10 shows two patterns:

• a zig-zagging seasonal pattern.

Fig 11:TS from 1970

Fig 12:Quarterly plot

Quarter plot shows some clear indication of seasonality.

Fig 13:Yearly plot

Fig 14:Seasonal plot

Fig 15:Month plot

Taking the frequency at 0.08 we get around ~12.

7.Decomposition of time series

A time series decomposition is procedure which transform a time

Seasonal: patterns that repeat with fixed period of time. Trend:

Additive or multiplicative decomposition?

To get a successful decomposition, it is important to choose

How to visually differentiate an additive and Multiplicative

In an additive model, the amplitude of both the seasonal and

Decomposition is a tool that we can separate different

As seasonality pattern does not increases with time in our series,

Fig 18: Decomposition

Above decomposition clearly shows the gas production is trending

Fig 19: Decomposition

If the focus is on figuring out whether the general trend of

As the series is additive, trend and random component of the

Above plot show original series in Red and de-seasoned production in

8.Checking for stationary

Null Hypothesis (H0): If accepted, it suggests the time series has

Fig 21:Stationery test

Null Hypothesis is rejected, hence gas data is stationary.

9.ACF and PACF (performing to check the stationary data and

As multiple lags are significant, it is not possible to tentatively

While exponential smoothing methods do not make any

Autoregressive Integrated Moving Average (ARIMA) models

As the series has seasonality, hence in the auto arima function

Fig 24:Auto Arima

Akaike information criteria (AIC) and Baysian information criteria

When comparing models, one wants to minimize AIC and BIC.

ARIMA(2,1,1)(0,1,2)[12] is seasonal ARIMA. [12] stands for

Let’s fit the autoarima output using the Arima function