Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 41

Problem Statement:

For this assignment, the data of different types of wine sales in the 20th century is to
be analysed. Both data are from the same company but of different wines. As an
analyst in the ABC Estate Wines, you are tasked to analyse and forecast Wine Sales
in the 20th century.

Solution:
Reading the data Sparkling
Top 5

Sparkling

YearMont
h

1980-01-01 1686

1980-02-01 1591

1980-03-01 2304

1980-04-01 1712

1980-05-01 1471

Basic Info
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 187 entries, 1980-01-01 to 1995-07-01
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sparkling 187 non-null int64
dtypes: int64(1)
memory usage: 2.9 KB
Plotting Time series

Basic Descriptive statistics

Sparkling

coun
187.0
t

mean 2402.0

std 1295.0

min 1070.0

25% 1605.0

50% 1874.0

75% 2549.0

max 7242.0
Boxplots

Monthly Plot

There is a clear distinction of 'Sparkling' within different months spread across various years. The
highest such numbers are being recorded from the month of October till December across
various years.

Time Series Month Plot


This plot shows us the behaviour of the Time Series ('Sparkling' in this case) across various
months. The red line is the median value.

Grid Plot

Empirical Cumulative Distribution of Sales

Decomposition of Time series


Splitting into Train and Test

Red represents the Train data and blue represents the Test
4) Building various models and comparing Accuracy Metrics
Model 1: Linear Regression

Metrics
Test RMSE

RegressionOnTime 1374.550202

Model 2: Naïve Approach

Test RMSE

RegressionOnTime 1374.550202

NaiveModel 1496.444629

Method 3: Simple Average


For this simple average method, we will forecast by using the average of the training values.
Test RMSE

RegressionOnTime 1374.550202

NaiveModel 1496.444629

SimpleAverageModel 1368.746717

Method 4: Moving Average (MA)

For the moving average model, we are going to calculate rolling means (or moving averages) for
different intervals. The best interval can be determined by the maximum accuracy (or the
minimum error) over here.

For Moving Average, we are going to average over the entire data
Test RMSE

RegressionOnTime 1374.550202

NaiveModel 1496.444629

SimpleAverageModel 1368.746717

2pointTrailingMovingAverag
811.178937
e

4pointTrailingMovingAverag
1184.213295
e

6pointTrailingMovingAverag
1337.200524
e

9pointTrailingMovingAverag
1422.653281
e

Before we go on to build the various Exponential Smoothing models, let us plot all the models
and compare the Time Series plots

Method 5: Simple Exponential Smoothing


Test RMSE

RegressionOnTime 1374.550202

NaiveModel 1496.444629

SimpleAverageModel 1368.746717

2pointTrailingMovingAverage 811.178937

4pointTrailingMovingAverage 1184.213295

6pointTrailingMovingAverage 1337.200524

9pointTrailingMovingAverage 1422.653281

Alpha=0.995,SimpleExponentialSmoothin
1362.348469
g

Setting different alpha values:

The higher the alpha value more weightage is given to the more recent observation. That
means, what happened recently will happen again.

We will run a loop with different alpha values to understand which value works best for alpha on
the test set.
Check for stationarity of the whole Time Series data.

Perform Dickey-Fuller test: Ho : There is no stationarity

Results of Dickey-Fuller Test:


Test Statistic -1.360497
p-value 0.601061
#Lags Used 11.000000
Number of Observations Used 175.000000
Critical Value (1%) -3.468280
Critical Value (5%) -2.878202
Critical Value (10%) -2.575653
dtype: float64

We see that at 5% significance level the Time Series is non-stationary as at high p-value we are
unable to reject the null hypothesis.

Let us take a difference of order 1 and check whether the Time Series is stationary or not.

We see that at 𝛼 = 0.05 the Time Series is indeed stationary as the p-value is lower than 0.05
and hence we can reject the null hypothesis which says that the time series is not stationary. So
differentiation by 1 makes the time series stationary.

Autocorrelation & Partial Autocorrelation function


Train & Test Data Split & Plot

Training Data is till the end of 1990. Test Data is from the beginning of 1991 to the last time
stamp provided.

First few rows of Training Data


Sparkling

YearMonth

1980-01-01 1686

1980-02-01 1591

1980-03-01 2304

1980-04-01 1712

1980-05-01 1471

Last few rows of Training Data


Sparkling

YearMonth

1990-08-01 1605

1990-09-01 2424

1990-10-01 3116

1990-11-01 4286

1990-12-01 6047

First few rows of Test Data


Sparkling

YearMonth

1991-01-01 1902

1991-02-01 2049

1991-03-01 1874

1991-04-01 1279

1991-05-01 1432

Last few rows of Test Data


Sparkling

YearMonth

1995-03-01 1897

1995-04-01 1862

1995-05-01 1670

1995-06-01 1688

1995-07-01 2031

Check for stationarity of the Training Data Time Series.

0.6697444263523344
We see that the series is not stationary at 𝛼 = 0.05.

0.0
We see that after taking a difference of order 1 the series have become stationary at 𝛼 = 0.05.
Automated version of an ARIMA model: Based on lowest Akaike Information
Criteria (AIC)

SARIMAX Results

Dep. Variable: Sparkling No. Observations: 132


Model: ARIMA(2, 1, 2) Log Likelihood -1101.755

Date: Sun, 25 Dec 2022 AIC 2213.509

Time: 23:17:33 BIC 2227.885

Sample: 01-01-1980 HQIC 2219.351

- 12-01-1990

Covariance Type: opg

coef std err z P>|z| [0.025 0.975]

ar.L1 1.3121 0.046 28.781 0.000 1.223 1.401

ar.L2 -0.5593 0.072 -7.741 0.000 -0.701 -0.418

ma.L1 -1.9917 0.109 -18.217 0.000 -2.206 -1.777

ma.L2 0.9999 0.110 9.109 0.000 0.785 1.215

sigma 1.099e+0 5.51e+1


1.99e-07 0.000 1.1e+06 1.1e+06
2 6 2

Ljung-Box (L1) (Q): 0.19 Jarque-Bera (JB): 14.46

Prob(Q): 0.67 Prob(JB): 0.00

Heteroskedasticity
2.43 Skew: 0.61
(H):

Prob(H) (two-sided): 0.00 Kurtosis: 4.08

Predict on the Test Set & Evaluation - Auto Arima Model

RMSE

ARIMA(2,1,1) 2249.496264

Manual ARIMA model - Using ACF & PACF plots


Here, we have taken alpha=0.05.

The Auto-Regressive parameter in an ARIMA model is 'p' which comes from the significant lag
before which the PACF plot cuts-off to 0. The Moving-Average parameter in an ARIMA model is
'q' which comes from the significant lag before the ACF plot cuts-off to 1. By looking at the above
plots, we can say that both the PACF and ACF plot cuts-off at lag 1. So,our pdq values are 0,1,1.

SARIMAX Results
=======================================================================
=======
Dep. Variable: Sparkling No. Observations:
132
Model: ARIMA(0, 1, 1) Log Likelihood -
1129.530
Date: Sun, 25 Dec 2022 AIC
2263.060
Time: 23:17:34 BIC
2268.810
Sample: 01-01-1980 HQIC
2265.397
- 12-01-1990
Covariance Type: opg
=======================================================================
=======
coef std err z P>|z| [0.025
0.975]
-----------------------------------------------------------------------
-------
ma.L1 -0.3419 0.063 -5.397 0.000 -0.466
-0.218
sigma2 1.805e+06 2.13e+05 8.463 0.000 1.39e+06
2.22e+06
=======================================================================
============
Ljung-Box (L1) (Q): 1.09 Jarque-Bera (JB):
33.41
Prob(Q): 0.30 Prob(JB):
0.00
Heteroskedasticity (H): 2.45 Skew:
-0.89
Prob(H) (two-sided): 0.00 Kurtosis:
4.71
=======================================================================
============

We get a comparatively simpler model by looking at the ACF and the PACF plots.

Note: When we see that both the AR(p) and the MA(q) model are of order 0, we have to convert
the input variable into a 'float64' type variable else Python might throw an error when we try to
forecast

Predict on the Test Set & Evaluation

RMSE

ARIMA(2,1,1) 2249.496264

ARIMA(0,1,1) 3141.313590

Automated version of a SARIMA model -Parameter Selection with lowest


Akaike Information Criteria (AIC)
We see that there can be a seasonality of 12 as well as 24. We will run our auto SARIMA models
by setting seasonality both as 12 and 24

Auto SARIMA model - With Seasonality as 12

SARIMAX Results

Dep. Variable: y No. Observations: 132

SARIMAX(0, 1, 2)x(2, 0, 2,
Model: Log Likelihood -771.561
12)

Date: Sun, 25 Dec 2022 AIC 1557.122

Time: 23:18:30 BIC 1575.632

Sample: 0 HQIC 1564.621

- 132

Covariance
opg
Type:

coef std err z P>|z| [0.025 0.975]

ma.L1 -0.7771 0.101 -7.678 0.000 -0.975 -0.579

ma.L2 -0.1239 0.121 -1.022 0.307 -0.361 0.114

ar.S.L12 0.6992 0.751 0.931 0.352 -0.773 2.171

ar.S.L24 0.3588 0.783 0.458 0.647 -1.175 1.893

ma.S.L1 1.6452 3.728 0.441 0.659 -5.661 8.952


2

ma.S.L2
-1.5969 2.674 -0.597 0.550 -6.837 3.644
4

2.905e+0 - 2.15e+0
sigma2 9.49e+04 0.306 0.760
4 1.57e+05 5

Ljung-Box (L1) (Q): 0.03 Jarque-Bera (JB): 22.05

Prob(Q): 0.85 Prob(JB): 0.00

Heteroskedasticity
1.45 Skew: 0.49
(H):

Prob(H) (two-sided): 0.28 Kurtosis: 5.04

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Prediction on the Test Set & Evaluation

RMSE

ARIMA(2,1,1) 2249.496264

ARIMA(0,1,1) 3141.313590

SARIMA(0,1,2)
526.539279
(2,0,2,12)

We see that we have huge gain the RMSE value by including the seasonal parameters as
well.

Setting the seasonality as 24 for the second iteration of the auto SARIMA model.

SARIMAX Results

Dep. Variable: Sparkling No. Observations: 132

SARIMAX(1, 1, 2)x(2, 0, 2,
Model: Log Likelihood -604.141
24)

Date: Sun, 25 Dec 2022 AIC 1224.282

Time: 23:21:36 BIC 1243.338

Sample: 01-01-1980 HQIC 1231.922


- 12-01-1990

Covariance
opg
Type:

coef std err z P>|z| [0.025 0.975]

ar.L1 0.3946 0.297 1.330 0.183 -0.187 0.976

ma.L1 -2.1097 0.815 -2.588 0.010 -3.707 -0.512

ma.L2 1.0108 0.830 1.218 0.223 -0.616 2.638

ar.S.L24 0.4197 0.122 3.434 0.001 0.180 0.659

ar.S.L48 0.7928 0.138 5.765 0.000 0.523 1.062

ma.S.L24 0.3180 0.728 0.437 0.662 -1.108 1.744

ma.S.L48 -0.8802 0.467 -1.884 0.060 -1.796 0.036

7.06e+0 2.37e- 7.06e+0 7.06e+0


sigma2 2.98e+09 0.000
4 05 4 4

Ljung-Box (L1) (Q): 0.08 Jarque-Bera (JB): 37.50

Prob(Q): 0.78 Prob(JB): 0.00

Heteroskedasticity
1.06 Skew: 0.06
(H):

Prob(H) (two-sided): 0.87 Kurtosis: 6.35

Predict on the Test Set & Evaluation

RMSE

ARIMA(2,1,1) 2249.496264

ARIMA(0,1,1) 3141.313590

SARIMA(0,1,2)
526.539279
(2,0,2,12)

SARIMA(1,1,2) 670.549644
RMSE

(2,0,2,24)

We see that the RMSE value have not reduced further when the seasonality parameter
was changed to 24.

Building the most optimum model on the Full Data

SARIMAX Results

No.
Dep. Variable: Sparkling 187
Observations:

SARIMAX(0, 1, 2)x(2, 0, 2,
Model: Log Likelihood -1173.399
12)

Date: Sun, 25 Dec 2022 AIC 2360.798

Time: 23:21:38 BIC 2382.281

Sample: 01-01-1980 HQIC 2369.522

- 07-01-1995

Covariance
opg
Type:

coef std err z P>|z| [0.025 0.975]

ma.L1 -0.8323 0.079 -10.518 0.000 -0.987 -0.677

ma.L2 -0.1230 0.082 -1.494 0.135 -0.284 0.038

ar.S.L12 0.6500 0.670 0.970 0.332 -0.664 1.964

ar.S.L24 0.3697 0.683 0.541 0.589 -0.970 1.709

ma.S.L12 -0.2323 0.669 -0.347 0.728 -1.544 1.079

ma.S.L24 -0.2435 0.425 -0.573 0.567 -1.076 0.590

1.467e+0 1.29e+0 1.21e+0 1.72e+0


sigma2 11.341 0.000
5 4 5 5

Ljung-Box (L1) (Q): 0.02 Jarque-Bera (JB): 42.80


Prob(Q): 0.89 Prob(JB): 0.00

Heteroskedasticity
0.94 Skew: 0.60
(H):

Prob(H) (two-sided): 0.83 Kurtosis: 5.24

Predictions Based on the models built and the RSME scores from the various
models built:

 The plots show that it had a significant increase during 1988 but saw a
downward slope and recovered again in 1995.
 There is a significant increase in sales and a positive flow towards a
more productive year 1995.
 To increase the sales, we need to find the trend that was during the year
1989 and try to implement similar features into the wine.
 To increase the sales, we could find the right marketing technique/build
similar ways of connecting with the customers as in 1989.
 Find improvised ways of reaching the customers via current and
trending ways of reaching over to consumers.
 Target to improve the availability of the wine in stores early from
October to get the consumers used to the brand and ease of access.

Solution:
Reading the data Rose
Top 5
Rose

YearMonth

1980-01-01 112.0

1980-02-01 118.0

1980-03-01 129.0

1980-04-01 99.0

1980-05-01 116.0

Info of the dataset

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 187 entries, 1980-01-01 to 1995-07-01
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rose 185 non-null float64
dtypes: float64(1)
memory usage: 2.9 KB

Plotting the Timeseries

Basic Descriptive Statistics


Rose

coun
185.0
t

mean 90.0

std 39.0

min 28.0
Rose

25% 63.0

50% 86.0

75% 112.0

max 267.0

The basic measures of descriptive statistics tell us how the Sales have varied across years. But,
for this measure of descriptive statistics we have averaged over the whole data without taking the
time component into account.

Boxplots - Sales across years & within months across years.

Yearly Boxplot

Monthly Plot
There is a clear distinction of 'Rose' within different months spread across various years. The
highest such numbers are being recorded in the month of December across various years.

Time Series Month Plot

This plot shows us the behaviour of the Time Series (Rose in this case) across various
months. The red line is the median value.

Monthly Sales Plot across years


Empirical Cumulative Distribution of Sales

Plot of Month on month % change of Rose


Decompose Time Series

Train – Test Split and Plot

The red indicates Train Set and Blue indicates Test Set

Building different models and comparing the accuracy metrics .

Model 1: Linear Regression

For this linear regression, we are going to regress the 'Rose' variable against the order of the
occurrence. For this we need to modify our training data before fitting it into a linear regression.

Now that our training and test data has been modified, let us go ahead use Linear Regression to
build the model on the training data and test the model on the test data.
Accuracy Metrics

Test RMSE

RegressionOnTime 51.004214

Model 2: Naïve Approach

RMSE Test RMSE

ARIMA(2,1,1) 2249.496264 NaN

ARIMA(0,1,1) 3141.313590 NaN

SARIMA(0,1,2)
526.539279 NaN
(2,0,2,12)

SARIMA(1,1,2)
670.549644 NaN
(2,0,2,24)
RMSE Test RMSE

NaiveModel NaN 24.671882

Method 3: Simple Average

For this simple average method, we will forecast by using the average of the training
values.

RMSE Test RMSE

ARIMA(2,1,1) 2249.496264 NaN

ARIMA(0,1,1) 3141.313590 NaN

SARIMA(0,1,2)
526.539279 NaN
(2,0,2,12)

SARIMA(1,1,2)
670.549644 NaN
(2,0,2,24)

SimpleAverageModel NaN 54.851316

Method 4: Moving Average(MA)

For the moving average model, we are going to calculate rolling means (or moving
averages) for different intervals. The best interval can be determined by the maximum
accuracy (or the minimum error) over here.For Moving Average, we are going to average over
the entire data.
RMSE Test RMSE

ARIMA(2,1,1) 2249.496264 NaN

ARIMA(0,1,1) 3141.313590 NaN

SARIMA(0,1,2)(2,0,2,12) 526.539279 NaN

SARIMA(1,1,2)(2,0,2,24) 670.549644 NaN

2pointTrailingMovingAverag
NaN 12.546929
e

4pointTrailingMovingAverag
NaN 17.148954
e

6pointTrailingMovingAverag
NaN 18.039241
e

9pointTrailingMovingAverag
NaN 18.422860
e
Before we go on to build the various Exponential Smoothing models, let us plot all the
models and compare the Time Series plots.
Method 5: Simple Exponential Smoothing

RMSE Test RMSE

ARIMA(2,1,1) 2249.496264 NaN

ARIMA(0,1,1) 3141.313590 NaN

SARIMA(0,1,2)(2,0,2,12) 526.539279 NaN

SARIMA(1,1,2)(2,0,2,24) 670.549644 NaN

Alpha=0.995,SimpleExponentialSmoothin
NaN 33.098471
g

Method 6: Double Exponential Smoothing (Holt's Model)


RMSE Test RMSE

ARIMA(2,1,1) 2249.496264 NaN

ARIMA(0,1,1) 3141.313590 NaN

SARIMA(0,1,2)(2,0,2,12) 526.539279 NaN

SARIMA(1,1,2)(2,0,2,24) 670.549644 NaN

Alpha=0.3,Beta=0.3,DoubleExponentialSmoothin
NaN NaN
g

Check for stationarity of the whole Time Series data.

Results of Dickey-Fuller Test:


Test Statistic -1.715949
p-value 0.422914
#Lags Used 13.000000
Number of Observations Used 173.000000
Critical Value (1%) -3.468726
Critical Value (5%) -2.878396
Critical Value (10%) -2.575756
dtype: float64

We see that at 5% significance level the Time Series is non-stationary as at high p-value we are
unable to reject the null hypothesis.

Let us take a difference of order 1 and check whether the Time Series is stationary or not.

0.0
We see that at 𝛼 = 0.05 the Time Series is indeed stationary as the p-value is lower than 0.05
and hence we can reject the null hypothesis which says that the time series is not stationary. So
differentiation by 1 makes the time series stationary.

Autocorrelation & Partial Autocorrelation function

Train & Test Data Split & Plot

Training Data is till the end of 1990. Test Data is from the beginning of 1991 to the last
time stamp provided.

First few rows of Training Data


Rose

YearMonth

1980-01-01 112.0

1980-02-01 118.0

1980-03-01 129.0

1980-04-01 99.0
Rose

YearMonth

1980-05-01 116.0

Last few rows of Training Data


Rose

YearMonth

1990-08-01 70.0

1990-09-01 83.0

1990-10-01 65.0

1990-11-01 110.0

1990-12-01 132.0

First few rows of Test Data


Rose

YearMont
h

1991-01-01 54.0

1991-02-01 55.0

1991-03-01 66.0

1991-04-01 65.0

1991-05-01 60.0

Last few rows of Test Data


Rose

YearMont
h

1995-03-01 45.0

1995-04-01 52.0

1995-05-01 28.0

1995-06-01 40.0

1995-07-01 62.0
Check for stationarity of the Training Data Time Series.

0.0
We see that after taking a difference of order 1 the series have become stationary at 𝛼 = 0.05.

Automated version of an ARIMA model : Based on lowest Akaike Information


Criteria (AIC).

SARIMAX Results

Dep. Variable: Rose No. Observations: 132

Model: ARIMA(2, 1, 2) Log Likelihood -635.935

1281.87
Date: Sun, 25 Dec 2022 AIC
1

1296.24
Time: 23:37:18 BIC
7

1287.71
Sample: 01-01-1980 HQIC
2

- 12-01-1990

Covariance
opg
Type:

std
coef z P>|z| [0.025 0.975]
err

0.33
ar.L1 -0.4540 0.469 -0.969 -1.372 0.464
3

0.99
ar.L2 0.0001 0.170 0.001 -0.334 0.334
9

0.58
ma.L1 -0.2541 0.459 -0.554 -1.154 0.646
0

0.16
ma.L2 -0.5984 0.430 -1.390 -1.442 0.245
4

952.160 0.00 1131.34


sigma2 91.424 10.415 772.973
1 0 7

Ljung-Box (L1) (Q): 0.02 Jarque-Bera (JB): 34.16

Prob(Q): 0.88 Prob(JB): 0.00


Heteroskedasticity
0.37 Skew: 0.79
(H):

Prob(H) (two-sided): 0.00 Kurtosis: 4.94

Predict on the Test Set & Evaluation - Auto Arima Model

RMSE

ARIMA(2,1,1
43.201893
)

Manual ARIMA model - Using ACF & PACF plots


Here, we have taken alpha=0.05.

The Auto-Regressive parameter in an ARIMA model is 'p' which comes from the significant lag
before which the PACF plot cuts-off to 1. The Moving-Average parameter in an ARIMA model is
'q' which comes from the significant lag before the ACF plot cuts-off to 1. By looking at the above
plots, we can say that both the PACF and ACF plot cuts-off at lag 1. So,our pdq values are 1,1,1.

SARIMAX Results
=======================================================================
=======
Dep. Variable: Rose No. Observations:
132
Model: ARIMA(1, 1, 1) Log Likelihood -
637.287
Date: Sun, 25 Dec 2022 AIC
1280.574
Time: 23:38:42 BIC
1289.200
Sample: 01-01-1980 HQIC
1284.079
- 12-01-1990
Covariance Type: opg
=======================================================================
=======
coef std err z P>|z| [0.025
0.975]
-----------------------------------------------------------------------
-------
ar.L1 0.1814 0.076 2.396 0.017 0.033
0.330
ma.L1 -0.9192 0.053 -17.362 0.000 -1.023
-0.815
sigma2 972.5964 88.768 10.957 0.000 798.614
1146.579
=======================================================================
============
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB):
41.59
Prob(Q): 0.98 Prob(JB):
0.00
Heteroskedasticity (H): 0.35 Skew:
0.87
Prob(H) (two-sided): 0.00 Kurtosis:
5.14
=======================================================================
============

We get a comparatively simpler model by looking at the ACF and the PACF plots.

Note: When we see that both the AR(p) and the MA(q) model are of order 0, we have to convert
the input variable into a 'float64' type variable else Python might throw an error when we try to
forecast
Predict on the Test Set & Evaluation

RMSE

ARIMA(2,1,1
43.201893
)

ARIMA(1,1,1
40.207902
)

We see that there is difference in the RMSE values for both the models, but remember that
the second model is a much simpler model.

Automated version of a SARIMA model -Parameter Selection with lowest


Akaike Information Criteria (AIC).

Let us look at the ACF plot once more to understand the seasonal parameters PDQ for the
SARIMA model.

We see that there can be a seasonality of 12 as well as 24. We will run our auto SARIMA
models by setting seasonality both as 12 and 24.

Auto SARIMA model - With Seasonality as 12


SARIMAX Results

Dep. Variable: y No. Observations: 132


Model: SARIMAX(0, 1, 2)x(2, 0, 2, 12) Log Likelihood -436.969

Date: Sun, 25 Dec 2022 AIC 887.938

Time: 23:42:41 BIC 906.448

Sample: 0 HQIC 895.437

- 132

Covariance
opg
Type:

coef std err z P>|z| [0.025 0.975]

ma.L1 -0.8427 189.778 -0.004 0.996 -372.800 371.115

ma.L2 -0.1573 29.815 -0.005 0.996 -58.594 58.279

ar.S.L12 0.3467 0.079 4.375 0.000 0.191 0.502

ar.S.L24 0.3023 0.076 3.996 0.000 0.154 0.451

ma.S.L1
0.0767 0.133 0.577 0.564 -0.184 0.337
2

ma.S.L2
-0.0726 0.146 -0.498 0.618 -0.358 0.213
4

251.313 - 9.37e+0
sigma2 4.77e+04 0.005 0.996
7 9.32e+04 4

Ljung-Box (L1) (Q): 0.10 Jarque-Bera (JB): 2.33

Prob(Q): 0.75 Prob(JB): 0.31

Heteroskedasticity
0.88 Skew: 0.37
(H):

Prob(H) (two-sided): 0.70 Kurtosis: 3.03

Prediction on the Test Set & Evaluation


RMSE

43.20189
ARIMA(2,1,1)
3

40.20790
ARIMA(1,1,1)
2

SARIMA(0,1,2) 30.03819
(2,0,2,12) 9

RMSE

43.20189
ARIMA(2,1,1)
3

40.20790
ARIMA(1,1,1)
2

SARIMA(0,1,2) 30.03819
(2,0,2,12) 9

We see that we have huge gain the RMSE value by including the seasonal parameters as
well.
Setting the seasonality as 24 for the second iteration of the auto SARIMA model.

SARIMAX Results

Dep. Variable: Rose No. Observations: 132

Model: SARIMAX(1, 1, 2)x(2, 0, 2, 24) Log Likelihood -337.400

Date: Sun, 25 Dec 2022 AIC 690.800

Time: 23:47:23 BIC 709.856

Sample: 01-01-1980 HQIC 698.440

- 12-01-1990

Covariance
opg
Type:

coef std err z P>|z| [0.025 0.975]

ar.L1 0.2062 0.344 0.600 0.548 -0.467 0.880


ma.L1 -1.1235 0.392 -2.863 0.004 -1.893 -0.354

ma.L2 0.1543 0.379 0.407 0.684 -0.589 0.897

ar.S.L24 0.8289 0.136 6.107 0.000 0.563 1.095

ar.S.L48 0.0416 0.103 0.403 0.687 -0.160 0.244

ma.S.L24 -0.8878 0.195 -4.547 0.000 -1.270 -0.505

ma.S.L48 -0.1123 0.242 -0.464 0.643 -0.587 0.362

177.056 1.72e+0
sigma2 0.001 0.000 177.054 177.058
2 5

Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 4.14

Prob(Q): 0.98 Prob(JB): 0.13

Heteroskedasticity
0.67 Skew: 0.56
(H):

Prob(H) (two-sided): 0.30 Kurtosis: 3.10

RMSE

43.20189
ARIMA(2,1,1)
3

40.20790
ARIMA(1,1,1)
2

SARIMA(0,1,2) 30.03819
(2,0,2,12) 9

SARIMA(1,1,2) 23.12280
(2,0,2,24) 7

Building the most optimum model on the Full Data.


SARIMAX Results

Dep. Variable: Rose No. Observations: 187

SARIMAX(0, 1, 2)x(2, 0, 2,
Model: Log Likelihood -659.342
12)

Date: Sun, 25 Dec 2022 AIC 1332.683

Time: 23:48:30 BIC 1354.166

Sample: 01-01-1980 HQIC 1341.407

- 07-01-1995

Covariance
opg
Type:

std
coef z P>|z| [0.025 0.975]
err

ma.L1 -0.6989 0.078 -8.937 0.000 -0.852 -0.546

ma.L2 -0.2026 0.079 -2.557 0.011 -0.358 -0.047

ar.S.L12 0.4266 0.062 6.903 0.000 0.305 0.548

ar.S.L24 0.3561 0.056 6.311 0.000 0.246 0.467

ma.S.L12 -0.1150 0.097 -1.188 0.235 -0.305 0.075

ma.S.L24 -0.2191 0.113 -1.933 0.053 -0.441 0.003

230.231
sigma2 23.741 9.698 0.000 183.701 276.763
8

Ljung-Box (L1) (Q): 0.09 Jarque-Bera (JB): 4.49

Prob(Q): 0.77 Prob(JB): 0.11

Heteroskedasticity
0.59 Skew: 0.23
(H):

Prob(H) (two-sided): 0.06 Kurtosis: 3.68

28.248216985592833
Please note the above RMSE is on the training data predictions only. You cannot calculate
RMSE for forecast as the actual values(which have not occured yet) are not available for it to be
compared with the forecasted value.

Predictions Based on the models built and the RSME scores from the various
models built:

 The plots show that it is a Downward sloping trend and seasonality.


 Though there has been a significant increase in the sales, there is still a
downfall over the years
 To increase the sales, we need to find the trend that was during the year
1980 and try to implement similar features into the wine.
 To increase the sales, we could find the right marketing technique/build
similar ways of connecting with the customers as in 1980.
 Find improvised ways of reaching the customers via current and
trending ways of reaching over to consumers.
 Target to improve the availability of the wine in stores early from
October to get the consumers used to the brand and ease of access.

You might also like