Professional Documents
Culture Documents
Gas Production
Gas Production
BY:
Pranav Viswanathan
CONTENTS
TOPIC
1.Project Objective 3
2.Assumptions 3
3.EDA (Expolatory Data Analysis) 4
3.1 Data Discovery 4
3.2 Top and Bottom of Dataset 4
3.3 Class of Dataset 4
3.4 Start and End of Series 4
3.5 cycle 6
3.6 Structure 7
3.7 Summary 7
3.8 Outliers 8
3.9 Missing values 8
4.Creating Time series from 1970 9
5.Plots 11
6.Periodicity 14
7.Decomposition of time series 15
9.ACF and PACF (performing to check the stationary data and autocorrelation) 20
10.Auto arima 21
11.Arima 23
12.Final Model 25
13.Accuracy 28
2.Assumptions
There are a few assumptions considered:
• The Sample size is adequate to perform techniques applicable
for time series dataset.
• All the necessary packages are installed in R.
3.5 cycle
Given timeseries is univariate, has only one variable.
This series stars from 1956, 1st Month i.e. January. Series end at
1995 , 8th Month i.e August.
3.6 Structure
3.7 Summary
INSIGHTS OF EDA:
The data is obtained from ‘forecast’ package.
The various exploration line class ,start and end of
series is carried out.
Str() and summary() is seen.
Missing values checked and is seen to be zero.
Some outliers are present due to extreme values of
the production.
Seasonal plot:
Month plot:
The horizontal lines indicate the means for each month. This form
of plot enables the underlying seasonal pattern to be seen clearly
and shows the changes in seasonality over time. It is especially
useful in identifying changes within seasons.
Box plot also shows some seasonality and this also indicates that there are
no outlier in the data set.
6.Periodicity
Time series object are an ordered sequence of values (data
points) of variables at equally spaced time interval. It is in time
domain.
There are couple of methods to detect periodicity of timeseries
object.
The periodogram shows the “power” of each possible frequency,
and we can clearly see spikes between 0 and 0.1, frequency
close to 0 is high then decreasing effect then at frequency 0.07
Hz .
Fig 17: periodogram
p-value > 0.05: Retain the null hypothesis (H0), the data has a
unit root and is non-stationary.
p-value <= 0.05: Reject the null hypothesis (H0), the data does
not have a unit root and is stationary.
ACF plots display correlation between the series and its lags.
Most of lines are significant as they are beyond 2 blue lines 2nd
line is significant and then couple more . Look for spikes at
specific lag points of the difference series , highest spike is at 0.5.
Fig 23:PACF
Even the partial correlation plot shows that all lags are significant.
PACF plots are useful when determining the order of the AR(p)
model.
10.Auto arima
Exponential smoothing methods are useful for making forecasts,
and make no assumptions about the correlations between
successive values of the time series.
Fig 25:Arima
Fig 26:Arima plot
Check if the residuals are independent before using the model for
forecasting.
Box-Ljung Test
Now the model is valid, let’s check the model performance on the
train dataset
Fig 28:Mapea
Let’s forecast the holdout sample using the above model. Period
is considered as 20 because we have 20 periods in the holdout
sample
From the above plot it can be seen that there is a difference between the actual and
forecast model hence there is a scope to further improve the model.
Some tips to improve the model are provided in section 4 of this document.
12.Final Model
For Time Series Forecasting problem, we had observed both trend and
seasonality in the data. Trend is increasing in the data along with high
variation in seasonality, It also had few outliers.
13.Accuracy
After comparing some the arima models with and without
seasonality(Manual and Auto). We will compare some of the models
fitted so far using a test set consisting of the last data of 1995 then also
applying on original data gas.
The models chosen manually and with auto.arima() are both in the top
four models based on their AIC values. When AIC value and MAPE is
critieria of chosing model then ensure that order of differencing is same
. However, when comparing models using a test set, it does not matter
how the forecasts were produced — the comparisons are always valid.
In above tables, we can find various models ,we compared seasonal =
False and True and found that results are better with seasonal = TRUE
for both manual and auto.arima models .
None of the models considered here pass all of the residual tests. In
practice, we would normally use the best model we could find, even if it
did not pass all the tests.
14.Next step in Model Refining
Now we have built a robust model. Can this model be refined
further?
Following steps are recommended to refine the model further
Log transformation of the original dataset as log transformed
data is expected to give better result
As it is clearly visible in the final model, Arima model
equation is different when the entire dataset was considered
for the final model, which clearly indicates that dataset for
the later periods has seasonality and trend which should be
further analysed. Hence rather than considering dataset from
1970, probably a refined model should be built considering
the period from 1980.