Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 27

Sparkling

Wine
Time Series
Project
Report

June’ 21
Date: 19/06/2021

0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents

Contents

Executive Summary....................................................................................................................................................3
Introduction................................................................................................................................................................3
Data Description.........................................................................................................................................................3
Sample of the dataset:............................................................................................................................................3
Q1: Use methods of descriptive statistics to summarize data....................................................................................3
Q2: Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition ...4
Q3: Split the data into training and test. The test data should start in 1991...........................................................8
Q4: Build various exponential smoothing models on the training data and evaluate the model using RMSE on
the test data. Other models such as regression,naïve forecast models, simple average models etc. should
also be built on the training data and check the performance on the test data using RMSE...............................9
Q5: Check for the stationarity of the data on which the model is being built on using appropriate statistical
tests and also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take
appropriate steps to make it stationary. Check the new data for stationarity and comment. .............................15
Q6: Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using the
lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data using
RMSE....................................................................................................................................................................... 17
Q7: Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and
evaluate this model on the test data using RMSE.................................................................................................21
Q8: Build a table with all the models built along with their corresponding parameters and the respective RMSE
values on the test data............................................................................................................................................23
Q9: Based on the model-building exercise, build the most optimum model(s) on the complete data and predict
12 months into the future with appropriate confidence intervals/bands ...............................................................24
Q10: Comment on the model thus built and report your findings and suggest the measures that the company
should be taking for future sales............................................................................................................................25
THE END!.................................................................................................................................................................. 25

1
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
List of Figures
Fig.1 – Time series plot
Fig.2 – Wine Sale per Year
Fig.3 – Wine Sale per Month for 8 years
Fig.4 – Wine sale trend over the years per month
Fig.5 – Sales units plot
Fig.6 – Average Sales and Sales Percentage Change
Fig.7 – Time series decomposition - Additive
Fig.8 – Time series decomposition - Multiplicative
Fig.9 – SES Test set prediction
Fig.10 – DES Test set prediction
Fig.11 – TES Test set prediction
Fig.12 – TES Test set prediction – 2 for Brute Force
Fig.13 – Regression Test set prediction
Fig.14 – Naive Test set prediction
Fig.15 – Simple Average Test set prediction
Fig.16 – Rolling Mean & Std Deviation on original series
Fig.17 – Rolling Mean & Std Deviation on differenced series
Fig.18 – ACF on Original time series
Fig.19 – ACF on Differentiated time series
Fig.20 – PACF on Original time series
Fig.21 – PACF on Differentiated time series
Fig.22 – SARIMA (1,1,2)(1,0,2,12)
Fig.23 – ACF on whole Time series
Fig.24 – PACF on whole Time series
Fig.25 – SARIMA (3,1,2)(1,0,1,12)
Fig.26 – Sparkling wine forecast
Executive Summary
For this particular assignment, the data of Sparkling wine sales in the 20th century is to be analysed.
The task is to analyze and forecast Wine Sales in the 20th century

Data Description
1. YearMonth: Year and month in format YYYY-MM. Data type - object
2. Sparkling : Sales unit of Sparkling wine for respective duration on month and year. Data type – int64

The data is having 187 rows for these 2 columns with no null values.

Q1: Read the data as an appropriate Time Series data and plot the data.
After reading the data and with above description of the data – converting it into Timeseries format by making
the YearMonth as the index.

Sample of the dataset:

Also below are the observed traits from the dataset:

Minimum Sales value - 1070.000000


Maximum Sales value - 7242.000000

Time series graphical view:


Fig.1 – Time series plot

The dataset plot shows a steady increase in wine sales over the year 1982 -1988 post which the sales have
reduced and the maximum sales/year remain in between the range 6500-6000 units similarly the lowest sales
points are 1000-2000 units/year over the entire 7 years duration.

The dataset plot also shows seasonality wherein similar pattern of sales is observed per year.

Q2: Perform appropriate Exploratory Data Analysis to understand the data and
also perform decomposition.

Wine sales per year when observed closely and plotted gives below observation:

Fig.2 – Wine Sale per Year

 There was a steady increase in sales/year for sparkling wine from 1982-1985.
 Post year 1985 – the sale unit were almost constant until 1989.
 From 1989 the sales started decreasing until 1992
 Again the trend for sales increase was observed for 1993 and 1994.
 The lowest sales of wine is observed at year 1995.
 The median lines at the box plots need to be explored more – as it shows that the sales unit across the
year were not constant.

To find out more in sales across months during the years:

Fig.3 – Wine Sale per Month for 8 years

 The above plot shows the sales/month over the years.


 Sales of wine are constantly increasing from the month of sep till dec with December having the high
sales if wine – may be that is year end celebrations or holidays.
 During the first half of the year the sales are comparatively less with June being the lest – due to
weather conditions which do favor much for wines or something else.

Plotting the monthly sales of wines over the years, sales of wine per month per year to corelate above
2 plots:

Fig.4 – Wine sale trend over the years per month


 So as evident from fig – the sales of wine during December month remain the highest, followed by Nov.
 The sales on first half of the years are comparatively very low.
 Also the wine sales in month of Dec and Nov has decreased post year 1987.

In general, the dataset sales units when plotted:

Fig.5 – Sales units plot

 Most of the sales of the wine has been in units of 3000-7000 – meaning these are units per year sales
which is observed for the duration of last 8 years
 Roughly about 40% of sales unit with last 8 years is around 1000-3500. Which also means company
should strive to increase this and keep sales unit above this – if aiming for a growth of sales.
 Sales unit lower than 3000 an year could be taken as a loss of market – due to other competitor wines
or change of consumer preference to drink wine with some other liquid product like juice, water etc.

These inferences could be summed up with below plot:


Fig.6 – Average Sales and Sales Percentage Change

 An average sale of 6000-7000 units during Dec is normal sales trend – lesser than this means
company might be losing the market.
 An average sales percentage change during first half if the year is generally negative upto even -75%
and at later half of the year it would be mostly positive with upte 100% sales increase from 3000 units
of wine/month on average.
 Also – its important to look at the -ve sales during the first half of the year (June/July months specially)
as to why the sales are less during these months. If average sale during these months could be
increased – it would lead to a significant increase of profit for the company.

Decomposing the time series:


Decomposing the time series in Additive mode:
Fig.7 – Time series decomposition - Additive

The residues seem to be NOT following the trend and are independent and random. As observed in
highlighted data points, the highest errors are observed not during the hightes peaks of trends and lowest are
not during the lowest peaks of trend.

Decomposing the time series in Multiplicative mode:

Fig.8 – Time series decomposition - Multiplicative

In this case also we observe that the residues and seasonality are independent of the trend of the time series.
Hence the time series could be taken as Additive time series.

 Errors or residues are random with range of -1000 to +1000.

Q3: Split the data into training and test. The test data should start in 1991.
The dataset is not time series; hence it is ordered in a sequential manner. Splitting the data in Train and test
with test data starting from 1991, we get 2 separate data sets with below shapes and points:

First few records Last few records

Train Dataset Shape:


132 rows, 1 column of Sparkling
with datatype Int64
Test Dataset Shape:
55 rows, 1 column of Sparkling
with datatype Int64

So the test data ends by 1995- 07 month.

Q4: Split Build various exponential smoothing models on the training data and
evaluate the model using RMSE on the test data.
4.a ) Model 1 – Simple exponential Smoothing model

 Simple exponential model captures only the smoothing level α. And not really the trend and
Seasonality. Hence only one attribute of the time series.
 Using brutes force method or multiple iterations – the best value if α found is 0.07028781460389563.
 Which when used in below equation:

Ft+1=αYt+(1−α)FtFt+1=αYt+(1−α)Ft

Where Ft+1= forecast at time (t+1)

Provides the forecast at respective time period.

 Using α, and Yt values from the train dataset, the predicted values for test dataset are calculated and
plotted along with actual test dataset values:
Fig.9 – SES Test set prediction

 As can be seen with SES (Simple Exponential Smoothing) for α = 0.07 we get straight line as the
forecasted sale for test data set duration which is not very close to actual dataset.
 The performance of the model, can be evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:

RMSE = 1338.0046232563645

4.b ) Model 2 – Double exponential Smoothing model

 Double exponential model captures the smoothing level α and smoothing Trend β. And not really the
Seasonality. Hence only 2 attributes of the time series Dataset .
 Using brutes force method or multiple iterations – the best value found are:

smoothing_level (α) : 0.6649999999999999


smoothing_trend (β) : 0.0001

 The seasonality component of the time series would notbe captured still but using Holt’s model(DES),
The forecast for sales of wines is calculated for test dataset and when plotted :
Fig.10 – DES Test set prediction

 As can be seen with DES (Double Exponential Smoothing) for α = 0.66 and β = 0.0001 we get straight
line as the forecasted sale for test data set duration, which is having increasing trend as well.
 The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:

RMSE = 5291.8798332269125

4.c ) Model 3 – Triple exponential Smoothing model (autofit)


 Triple exponential model captures the smoothing level α, smoothing Trend β and smoothing
Seasonality γ . Hence all major 3 attributes of the time series Dataset are captured.

 Using auto fit in the TES model – the best value found are:

smoothing_level (α) : 0.11103179280236858


smoothing_trend (β) : 0.04932976379995919
smoothing_seasonal (γ) : 0.3623612114487896

 Holt’s Winter model(TES), when used for calculation of the forecast for sales of wines for test dataset
and plotted :
Fig.11 – TES Test set prediction

 As can be seen with TES (Triple Exponential Smoothing) for α = 0.11 and β = 0.04 and γ = 0.36 we get
forecasted line (purple) resembling very close to the actual test dataset.
 The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:

RMSE = 403.0192361612048

4.d ) Model 4 – Triple exponential Smoothing model (Brute Force)

 Triple exponential model captures autofit or default fit will provide the best values for α,β, and γ.
However it would not really mean that that is the best TES model with least RMSE.
 Using multiple iterations to calculate α,β, and γ using Brute Force method and checking respective
RMSE’s, the values with least RMSE value that are picked are:

smoothing_level (α) : 0.3


smoothing_trend (β) : 0.3
smoothing_seasonal (γ) : 0.3

 This Holt’s Winter model(TES), when used for calculation of the forecast for sales of wines for test
dataset and plotted Vs the autofit TES Model (4.c) :
Fig.12 – TES Test set prediction – 2 for Brute Force

 As can be seen with TES (Triple Exponential Smoothing) for α = 0.3 and β = 0.3 and γ = 0.3 we get
forecasted line (RED) resembling the test data set more closely than the autofit TES model.
 The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:

RMSE = 361.397300

4.e ) Model 5 –Regressive (Linear regression)

 The same time series dataset could be used for frecasting using Linear Regression.
 For Linear Regression – the timeseries need to converted into a linear dataset.
 Then based on this linear Train dataset – based on old values the future values would be predicted.
 As the name suggest – its linear in nature and thus it would not be able to capture the seasonality part
of the timeseries well.
 When the forecasted values for test dataset are plotted agained actual test dataset:
Fig.13 – Regression Test set prediction

 The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:

RMSE = 1389.135

4.f ) Model 6 –Naïve forecast


 Naïve forecast essentially targets forecast based on using last periods actuals to predict future values.
 Again– the timeseries need to convert into a linear dataset.
 Then based on this Train dataset – based on old values the future values would be predicted.
 As the name suggest – this model uses the last old period value and continues to use that value as the
forecast values for all periods. No trend or seasonality or for that matter actual values more than the
last values are captured in this.
 When the forecasted values for test dataset are plotted against actual test dataset:
Fig.14 – Naive Test set prediction

 The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:

RMSE = 3864.279

4.g ) Model 7 – Simple Average


 Simple Average model will take the average of all the past period values and predict future values on
this average.
 Again– the timeseries need to convert into a linear dataset.
 When the forecasted values for test dataset are plotted against actual test dataset:

Fig.15 – Simple Average Test set prediction

 The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:

RMSE = 1275.082

Q5: Check for the stationarity of the data on which the model is being built on
using appropriate statistical tests and also mention the hypothesis for the
statistical test. If the data is found to be non-stationary, take appropriate steps to
make it stationary. Check the new data for stationarity and comment.
Using Dickey–Fuller test on entire dataset of Sparkling wine. The test works with institution that if the
timeseries is stationary and broken in period or windows – and the mean for all these windows is taken, then it
has tendency to return to a constant mean. If not – then the time series is not stationary.

With below Hypothesis H0 – Time Series is non-stationary


Alternate Hypothesis H1 – Time Series is stationary.

At significance level 5%.


Fig.16 – Rolling Mean & Std Deviation on original series

The rolling mean (red) if can be seen that this is not having constant mean.

Also if we look at other parameters:

The P value is 0.6 > 0.05 so the null hypothesis is correct which is Time series is NOT-STATIONARY.

Differentiating the time series by order 1 to make it stationary, we get below time series plot now :

Fig.17 – Rolling Mean & Std Deviation on differenced series


And in this case the rolling mean is nearly having same time-period. Looking at the other parameters we get P
value as :

As can be seen at 95% significance level, p-value happens to be very less than 0.05 and this null-hypothesis is
rejected in this case – making the series stationary.

Q6: Check Build an automated version of the ARIMA/SARIMA model in


which the parameters are selected using the lowest Akaike Information Criteria
(AIC) on the training data and evaluate this model on the test data using RMSE.
The series has a seasonal component and hence using ARIMA will not capture the seasonal aspect of the time
series while forecasting the sales values for future – instead SARIMA will be used to capture even the
seasonal aspect of the time series.

The original series Auto Corelative function (ACF) plot:

Fig.18 – ACF on Original time series

After the series differentiation the Auto Corelative function (ACF) plot:
Fig.19 – ACF on Differentiated time series

As observed in the plot the pattern repeats after every 12 Laps, and there are about 2 laps in these 12 periods
that are corrected and will have predominant effect on forecasting the futuristic values. This also means p = 2
after which the first cut off happen.

Similarly the Partial Auto Correlated function (PACF) is plotted for original series:

Fig.20 – PACF on Original time series

After the series differentiation, PACF is :


Fig.21 – PACF on Differentiated time series

Where we see that first 3 months data out of 12 months lags would be where the ACF will be observed more
and will have more predominant effect in forecasting. This also means q = 3 after which the first cut off happen.

Now using the various iterations to get the values of p, q and d and respective AIC (Akaike Information
Criteria) for a Lag at 12 months. And the lowest 5 values of these are :

So using the parameters where AIC value is lowest as per above table for the first record with below details :

p=1, d=1, q =2, P = 1, D = 0, Q = 2 and S = 12 for SARIMA model [(p,d,q)(P,D,Q)(S)]

Using these parameters for calculation of automatic SARIMA model (MODEL 7)


And the same when plotted:

Fig.22 – SARIMA (1,1,2)(1,0,2,12)


The Histogram and Normal Q-Q suggests it’s a good model with standard deviation and sample and
forecasting quantities not having much of difference.

The Model when evaluated for RMSE (Root mean square error)

RMSE = 1275.082

Q7: Check Build ARIMA/SARIMA models based on the cut-off points of ACF
and PACF on the training data and evaluate this model on the test data using
RMSE.
Again using only SARIMA as time series has a seasonality and ARIMA would not be able to capture that
aspect of the timeseries.

For manual model of SARIMA [(p,d,q)(P,D,Q)(S)]

The p,q,d values would remain the same as that of section Q6.

Seasonality is 12 months as discussed in Q6.

For getting the values of P and Q, PACF and ACF plots are rechecked again to see the cut off points after lag
12.

Fig.23 – ACF on whole Time series


Fig.24 – PACF on whole Time series

Based on above the best values at confidence interval 95 % we have is:

 α = 0.05
 p=3
 q=2
 d=1
 P=1
 Q=1
 D=0
 S = 12

Using SARIMA model file on these parameters to build MODEL – 9


Plotting above SARIMA parameters graphically:

Fig.25 – SARIMA (3,1,2)(1,0,1,12)

Histogram and Normal Q-Q present the difference between Actual data and Forecasted data.

The Model when evaluated for RMSE (Root mean square error)

RMSE = 325.090069

Q8: Check Build a table with all the models built along with their corresponding
parameters and the respective RMSE values on the test data.
The table of all the models thus far and their variants that were tried is below:
Q9: Based on the model-building exercise, build the most optimum model(s) on
the complete data and predict 12 months into the future with appropriate
confidence intervals/bands.
Based on above Q8 figure picking most optimum models to forecast the 12 months futuristic sales of Sparkling
wine which is model 9 – SARIMA manual fit.

9.a) Model 9 - SARIMA(3,1,2)(1,0,0,12).


When SARIMA (3,1,2)(1,0,0,12) is run for the entire original timeseries data:
The Model when evaluated for RMSE (Root mean square error) for entire timeseries data:

RMSE = 560.0757913891757
This model is then used to forecast next 12 months data and with confidence level 95%, below is how the
forecast for next 12 months would be:

Fig.26 – Sparkling wine forecast

Q10: Based on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales.

From the data exploration, we can suggest/recommend below suggestions:


1) The Sparkling wines sales has been more or less constant – which means over the years the
product sales have neither increased drastically nor decreased.
2) While the product sales is not decreasing, the increase is also not phenomenal – which means
The product sales are not increasing with increase of consumers. This is something company
should look into – why? May be the wine needs more advertisement.
3) Also the wine sales are fairly less in initial months of the year (Jan, Feb, Mar, Apr, May, June) – this
needs to looked more – could be because the wine is costly and is worth only when purchased for
holidays or year end celebration. Or could be because the weather is not good to consume wine during
these months and then lighter version of could be manufactured.
4) During the last months of the year post July – the sale increase of wine with Dec being the highest
selling month. Since these are the selling seasons – discounts provided during these months would
also lure the consumers to buy it even more.
5) The consumers globally are increasing but the wines sales don’t follow this trend – could it be because
the market has other option of wines which are cheaper or better quality. Company needs to work on
the quality part of the wine as well.

THE END!

You might also like