Professional Documents
Culture Documents
TimeSeries Report
TimeSeries Report
For this assignment, the data of different types of wine sales in the 20th century is to
be analysed. Both data are from the same company but of different wines. As an
analyst in the ABC Estate Wines, you are tasked to analyse and forecast Wine Sales
in the 20th century.
Solution:
Reading the data Sparkling
Top 5
Sparkling
YearMont
h
1980-01-01 1686
1980-02-01 1591
1980-03-01 2304
1980-04-01 1712
1980-05-01 1471
Basic Info
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 187 entries, 1980-01-01 to 1995-07-01
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sparkling 187 non-null int64
dtypes: int64(1)
memory usage: 2.9 KB
Plotting Time series
Sparkling
coun
187.0
t
mean 2402.0
std 1295.0
min 1070.0
25% 1605.0
50% 1874.0
75% 2549.0
max 7242.0
Boxplots
Monthly Plot
There is a clear distinction of 'Sparkling' within different months spread across various years. The
highest such numbers are being recorded from the month of October till December across
various years.
Grid Plot
Red represents the Train data and blue represents the Test
4) Building various models and comparing Accuracy Metrics
Model 1: Linear Regression
Metrics
Test RMSE
RegressionOnTime 1374.550202
Test RMSE
RegressionOnTime 1374.550202
NaiveModel 1496.444629
RegressionOnTime 1374.550202
NaiveModel 1496.444629
SimpleAverageModel 1368.746717
For the moving average model, we are going to calculate rolling means (or moving averages) for
different intervals. The best interval can be determined by the maximum accuracy (or the
minimum error) over here.
For Moving Average, we are going to average over the entire data
Test RMSE
RegressionOnTime 1374.550202
NaiveModel 1496.444629
SimpleAverageModel 1368.746717
2pointTrailingMovingAverag
811.178937
e
4pointTrailingMovingAverag
1184.213295
e
6pointTrailingMovingAverag
1337.200524
e
9pointTrailingMovingAverag
1422.653281
e
Before we go on to build the various Exponential Smoothing models, let us plot all the models
and compare the Time Series plots
RegressionOnTime 1374.550202
NaiveModel 1496.444629
SimpleAverageModel 1368.746717
2pointTrailingMovingAverage 811.178937
4pointTrailingMovingAverage 1184.213295
6pointTrailingMovingAverage 1337.200524
9pointTrailingMovingAverage 1422.653281
Alpha=0.995,SimpleExponentialSmoothin
1362.348469
g
The higher the alpha value more weightage is given to the more recent observation. That
means, what happened recently will happen again.
We will run a loop with different alpha values to understand which value works best for alpha on
the test set.
Check for stationarity of the whole Time Series data.
We see that at 5% significance level the Time Series is non-stationary as at high p-value we are
unable to reject the null hypothesis.
Let us take a difference of order 1 and check whether the Time Series is stationary or not.
We see that at 𝛼 = 0.05 the Time Series is indeed stationary as the p-value is lower than 0.05
and hence we can reject the null hypothesis which says that the time series is not stationary. So
differentiation by 1 makes the time series stationary.
Training Data is till the end of 1990. Test Data is from the beginning of 1991 to the last time
stamp provided.
YearMonth
1980-01-01 1686
1980-02-01 1591
1980-03-01 2304
1980-04-01 1712
1980-05-01 1471
YearMonth
1990-08-01 1605
1990-09-01 2424
1990-10-01 3116
1990-11-01 4286
1990-12-01 6047
YearMonth
1991-01-01 1902
1991-02-01 2049
1991-03-01 1874
1991-04-01 1279
1991-05-01 1432
YearMonth
1995-03-01 1897
1995-04-01 1862
1995-05-01 1670
1995-06-01 1688
1995-07-01 2031
0.6697444263523344
We see that the series is not stationary at 𝛼 = 0.05.
0.0
We see that after taking a difference of order 1 the series have become stationary at 𝛼 = 0.05.
Automated version of an ARIMA model: Based on lowest Akaike Information
Criteria (AIC)
SARIMAX Results
- 12-01-1990
Heteroskedasticity
2.43 Skew: 0.61
(H):
RMSE
ARIMA(2,1,1) 2249.496264
The Auto-Regressive parameter in an ARIMA model is 'p' which comes from the significant lag
before which the PACF plot cuts-off to 0. The Moving-Average parameter in an ARIMA model is
'q' which comes from the significant lag before the ACF plot cuts-off to 1. By looking at the above
plots, we can say that both the PACF and ACF plot cuts-off at lag 1. So,our pdq values are 0,1,1.
SARIMAX Results
=======================================================================
=======
Dep. Variable: Sparkling No. Observations:
132
Model: ARIMA(0, 1, 1) Log Likelihood -
1129.530
Date: Sun, 25 Dec 2022 AIC
2263.060
Time: 23:17:34 BIC
2268.810
Sample: 01-01-1980 HQIC
2265.397
- 12-01-1990
Covariance Type: opg
=======================================================================
=======
coef std err z P>|z| [0.025
0.975]
-----------------------------------------------------------------------
-------
ma.L1 -0.3419 0.063 -5.397 0.000 -0.466
-0.218
sigma2 1.805e+06 2.13e+05 8.463 0.000 1.39e+06
2.22e+06
=======================================================================
============
Ljung-Box (L1) (Q): 1.09 Jarque-Bera (JB):
33.41
Prob(Q): 0.30 Prob(JB):
0.00
Heteroskedasticity (H): 2.45 Skew:
-0.89
Prob(H) (two-sided): 0.00 Kurtosis:
4.71
=======================================================================
============
We get a comparatively simpler model by looking at the ACF and the PACF plots.
Note: When we see that both the AR(p) and the MA(q) model are of order 0, we have to convert
the input variable into a 'float64' type variable else Python might throw an error when we try to
forecast
RMSE
ARIMA(2,1,1) 2249.496264
ARIMA(0,1,1) 3141.313590
SARIMAX Results
SARIMAX(0, 1, 2)x(2, 0, 2,
Model: Log Likelihood -771.561
12)
- 132
Covariance
opg
Type:
ma.S.L2
-1.5969 2.674 -0.597 0.550 -6.837 3.644
4
2.905e+0 - 2.15e+0
sigma2 9.49e+04 0.306 0.760
4 1.57e+05 5
Heteroskedasticity
1.45 Skew: 0.49
(H):
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
RMSE
ARIMA(2,1,1) 2249.496264
ARIMA(0,1,1) 3141.313590
SARIMA(0,1,2)
526.539279
(2,0,2,12)
We see that we have huge gain the RMSE value by including the seasonal parameters as
well.
Setting the seasonality as 24 for the second iteration of the auto SARIMA model.
SARIMAX Results
SARIMAX(1, 1, 2)x(2, 0, 2,
Model: Log Likelihood -604.141
24)
Covariance
opg
Type:
Heteroskedasticity
1.06 Skew: 0.06
(H):
RMSE
ARIMA(2,1,1) 2249.496264
ARIMA(0,1,1) 3141.313590
SARIMA(0,1,2)
526.539279
(2,0,2,12)
SARIMA(1,1,2) 670.549644
RMSE
(2,0,2,24)
We see that the RMSE value have not reduced further when the seasonality parameter
was changed to 24.
SARIMAX Results
No.
Dep. Variable: Sparkling 187
Observations:
SARIMAX(0, 1, 2)x(2, 0, 2,
Model: Log Likelihood -1173.399
12)
- 07-01-1995
Covariance
opg
Type:
Heteroskedasticity
0.94 Skew: 0.60
(H):
Predictions Based on the models built and the RSME scores from the various
models built:
The plots show that it had a significant increase during 1988 but saw a
downward slope and recovered again in 1995.
There is a significant increase in sales and a positive flow towards a
more productive year 1995.
To increase the sales, we need to find the trend that was during the year
1989 and try to implement similar features into the wine.
To increase the sales, we could find the right marketing technique/build
similar ways of connecting with the customers as in 1989.
Find improvised ways of reaching the customers via current and
trending ways of reaching over to consumers.
Target to improve the availability of the wine in stores early from
October to get the consumers used to the brand and ease of access.
Solution:
Reading the data Rose
Top 5
Rose
YearMonth
1980-01-01 112.0
1980-02-01 118.0
1980-03-01 129.0
1980-04-01 99.0
1980-05-01 116.0
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 187 entries, 1980-01-01 to 1995-07-01
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rose 185 non-null float64
dtypes: float64(1)
memory usage: 2.9 KB
coun
185.0
t
mean 90.0
std 39.0
min 28.0
Rose
25% 63.0
50% 86.0
75% 112.0
max 267.0
The basic measures of descriptive statistics tell us how the Sales have varied across years. But,
for this measure of descriptive statistics we have averaged over the whole data without taking the
time component into account.
Yearly Boxplot
Monthly Plot
There is a clear distinction of 'Rose' within different months spread across various years. The
highest such numbers are being recorded in the month of December across various years.
This plot shows us the behaviour of the Time Series (Rose in this case) across various
months. The red line is the median value.
The red indicates Train Set and Blue indicates Test Set
For this linear regression, we are going to regress the 'Rose' variable against the order of the
occurrence. For this we need to modify our training data before fitting it into a linear regression.
Now that our training and test data has been modified, let us go ahead use Linear Regression to
build the model on the training data and test the model on the test data.
Accuracy Metrics
Test RMSE
RegressionOnTime 51.004214
SARIMA(0,1,2)
526.539279 NaN
(2,0,2,12)
SARIMA(1,1,2)
670.549644 NaN
(2,0,2,24)
RMSE Test RMSE
For this simple average method, we will forecast by using the average of the training
values.
SARIMA(0,1,2)
526.539279 NaN
(2,0,2,12)
SARIMA(1,1,2)
670.549644 NaN
(2,0,2,24)
For the moving average model, we are going to calculate rolling means (or moving
averages) for different intervals. The best interval can be determined by the maximum
accuracy (or the minimum error) over here.For Moving Average, we are going to average over
the entire data.
RMSE Test RMSE
2pointTrailingMovingAverag
NaN 12.546929
e
4pointTrailingMovingAverag
NaN 17.148954
e
6pointTrailingMovingAverag
NaN 18.039241
e
9pointTrailingMovingAverag
NaN 18.422860
e
Before we go on to build the various Exponential Smoothing models, let us plot all the
models and compare the Time Series plots.
Method 5: Simple Exponential Smoothing
Alpha=0.995,SimpleExponentialSmoothin
NaN 33.098471
g
Alpha=0.3,Beta=0.3,DoubleExponentialSmoothin
NaN NaN
g
We see that at 5% significance level the Time Series is non-stationary as at high p-value we are
unable to reject the null hypothesis.
Let us take a difference of order 1 and check whether the Time Series is stationary or not.
0.0
We see that at 𝛼 = 0.05 the Time Series is indeed stationary as the p-value is lower than 0.05
and hence we can reject the null hypothesis which says that the time series is not stationary. So
differentiation by 1 makes the time series stationary.
Training Data is till the end of 1990. Test Data is from the beginning of 1991 to the last
time stamp provided.
YearMonth
1980-01-01 112.0
1980-02-01 118.0
1980-03-01 129.0
1980-04-01 99.0
Rose
YearMonth
1980-05-01 116.0
YearMonth
1990-08-01 70.0
1990-09-01 83.0
1990-10-01 65.0
1990-11-01 110.0
1990-12-01 132.0
YearMont
h
1991-01-01 54.0
1991-02-01 55.0
1991-03-01 66.0
1991-04-01 65.0
1991-05-01 60.0
YearMont
h
1995-03-01 45.0
1995-04-01 52.0
1995-05-01 28.0
1995-06-01 40.0
1995-07-01 62.0
Check for stationarity of the Training Data Time Series.
0.0
We see that after taking a difference of order 1 the series have become stationary at 𝛼 = 0.05.
SARIMAX Results
1281.87
Date: Sun, 25 Dec 2022 AIC
1
1296.24
Time: 23:37:18 BIC
7
1287.71
Sample: 01-01-1980 HQIC
2
- 12-01-1990
Covariance
opg
Type:
std
coef z P>|z| [0.025 0.975]
err
0.33
ar.L1 -0.4540 0.469 -0.969 -1.372 0.464
3
0.99
ar.L2 0.0001 0.170 0.001 -0.334 0.334
9
0.58
ma.L1 -0.2541 0.459 -0.554 -1.154 0.646
0
0.16
ma.L2 -0.5984 0.430 -1.390 -1.442 0.245
4
RMSE
ARIMA(2,1,1
43.201893
)
The Auto-Regressive parameter in an ARIMA model is 'p' which comes from the significant lag
before which the PACF plot cuts-off to 1. The Moving-Average parameter in an ARIMA model is
'q' which comes from the significant lag before the ACF plot cuts-off to 1. By looking at the above
plots, we can say that both the PACF and ACF plot cuts-off at lag 1. So,our pdq values are 1,1,1.
SARIMAX Results
=======================================================================
=======
Dep. Variable: Rose No. Observations:
132
Model: ARIMA(1, 1, 1) Log Likelihood -
637.287
Date: Sun, 25 Dec 2022 AIC
1280.574
Time: 23:38:42 BIC
1289.200
Sample: 01-01-1980 HQIC
1284.079
- 12-01-1990
Covariance Type: opg
=======================================================================
=======
coef std err z P>|z| [0.025
0.975]
-----------------------------------------------------------------------
-------
ar.L1 0.1814 0.076 2.396 0.017 0.033
0.330
ma.L1 -0.9192 0.053 -17.362 0.000 -1.023
-0.815
sigma2 972.5964 88.768 10.957 0.000 798.614
1146.579
=======================================================================
============
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB):
41.59
Prob(Q): 0.98 Prob(JB):
0.00
Heteroskedasticity (H): 0.35 Skew:
0.87
Prob(H) (two-sided): 0.00 Kurtosis:
5.14
=======================================================================
============
We get a comparatively simpler model by looking at the ACF and the PACF plots.
Note: When we see that both the AR(p) and the MA(q) model are of order 0, we have to convert
the input variable into a 'float64' type variable else Python might throw an error when we try to
forecast
Predict on the Test Set & Evaluation
RMSE
ARIMA(2,1,1
43.201893
)
ARIMA(1,1,1
40.207902
)
We see that there is difference in the RMSE values for both the models, but remember that
the second model is a much simpler model.
Let us look at the ACF plot once more to understand the seasonal parameters PDQ for the
SARIMA model.
We see that there can be a seasonality of 12 as well as 24. We will run our auto SARIMA
models by setting seasonality both as 12 and 24.
- 132
Covariance
opg
Type:
ma.S.L1
0.0767 0.133 0.577 0.564 -0.184 0.337
2
ma.S.L2
-0.0726 0.146 -0.498 0.618 -0.358 0.213
4
251.313 - 9.37e+0
sigma2 4.77e+04 0.005 0.996
7 9.32e+04 4
Heteroskedasticity
0.88 Skew: 0.37
(H):
43.20189
ARIMA(2,1,1)
3
40.20790
ARIMA(1,1,1)
2
SARIMA(0,1,2) 30.03819
(2,0,2,12) 9
RMSE
43.20189
ARIMA(2,1,1)
3
40.20790
ARIMA(1,1,1)
2
SARIMA(0,1,2) 30.03819
(2,0,2,12) 9
We see that we have huge gain the RMSE value by including the seasonal parameters as
well.
Setting the seasonality as 24 for the second iteration of the auto SARIMA model.
SARIMAX Results
- 12-01-1990
Covariance
opg
Type:
177.056 1.72e+0
sigma2 0.001 0.000 177.054 177.058
2 5
Heteroskedasticity
0.67 Skew: 0.56
(H):
RMSE
43.20189
ARIMA(2,1,1)
3
40.20790
ARIMA(1,1,1)
2
SARIMA(0,1,2) 30.03819
(2,0,2,12) 9
SARIMA(1,1,2) 23.12280
(2,0,2,24) 7
SARIMAX(0, 1, 2)x(2, 0, 2,
Model: Log Likelihood -659.342
12)
- 07-01-1995
Covariance
opg
Type:
std
coef z P>|z| [0.025 0.975]
err
230.231
sigma2 23.741 9.698 0.000 183.701 276.763
8
Heteroskedasticity
0.59 Skew: 0.23
(H):
28.248216985592833
Please note the above RMSE is on the training data predictions only. You cannot calculate
RMSE for forecast as the actual values(which have not occured yet) are not available for it to be
compared with the forecasted value.
Predictions Based on the models built and the RSME scores from the various
models built: