Professional Documents
Culture Documents
Neha Arcot Time Series
Neha Arcot Time Series
Neha Arcot
Problem:
For this particular assignment, the data of different types of wine sales in the 20th century is to be
analyzed. Both of these data are from the same company but of different wines. As an analyst in the
1|Pa ge
ABC Estate Wines, you are tasked to analyze and forecast Wine Sales in the 20th century.
Data set for the Problem: Sparkling.csv and Rose.csv
1. Read the data as an appropriate Time Series data and plot the data.
Load both the data set using the pandas read csv commands and parse the data.
Figure 1
Inferences:
Though the above plot looks like a Time Series plot, notice that the X-Axis is not time. In
order to make the X-Axis as a Time Series, we need to pass the date range manually through a
command in Pandas.
Sparkling
Given data set seems to not have much trend
We are able to see seasonality and it seems to be yearly.
Data set consist of 187 data points
Rose
We are able to see the there is a decreasing trend in the data set.
We are able to see seasonality and it seems to be yearly.
Data set consist of 187 data points
2|Pa ge
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.
Figure 2
Year and Monthly Box plot for Sparling and Rose:
Figure 3
Line Plot
Figure 4
3|Pa ge
Inferences
The basic measures of descriptive statistics tell us how the Sales have varied across years. But
remember, for this measure of descriptive statistics we have averaged over the whole data without
taking the time component into account.
Sparkling:
Data set consist of 187 data points with no duplicates
As observed in time serious plot there is no trend in the Box plot
Outliers are present in the data set expect the year 1995
January till July there is no spike in the sales. However, from July onwards we are able to see
there increase in sales.
December month has the highest sales in a year.
Line plot also indicates that the sale in the December is highest that the rest of the years.
Rose:
Data set consist of 187 data points with no duplicates and 2 null values for the year 1994, 7th
and 8th month.
By using interpolate impute the missing values
As observed in time serious plot here also there is a we are able to notice downward trend.
Outliers are present in the data set for the month June, July, August, September and December
The yearly boxplots also shows that the Sales have decreased towards the last few years.
December month has the highest sales in a year.
Year 1981 seems to be having the highest sales and the lowest seems to be year 1994.
Line plot also indicates that the sale in the December is highest that the rest of the years.
TimeSeries
Figure 5
Cumulative
Figure 6
Average Sales
4|Pa ge
Figure 7
Additive
Figure 8
Multiplicative
Figure 9
Inferences
Time Series plot shows the behavior of sales of different wines (Sparking/Rose) across
various months. The red line is the median value.
Cumulative graph tells us what percentage of data points refer to what number of Sales
Average and percentage graph tells us the Average sales (Sparkling/Rose) and the Percentage
change sales ((Sparkling/Rose) with respect to the time.
Sparkling
5|Pa ge
The Median value has been constant from the beginning of the year and start increasing from
July till December
Average sale values do not show any trend
We see that the residuals are located around 0 from the plot of the residuals in the
decomposition.
For the multiplicative series, we see that a lot of residuals are located around 1.
Rose
The Median value has been increasing from the beginning of the year. There is a gradual
increase throughout the year.
Average sale values show a downward trend.
We see that the residuals are located around 0 from the plot of the residuals in the
decomposition.
For the multiplicative series, we see that a lot of residuals are located around 1.
3. Split the data into training and test. The test data should start in 1991.
Split the data into train and test and plot the training and test data
Training Testing
Training Testing
Timestamp Rose Timestamp Rose
Time_Stam Sparklin Time_Stam Sparklin
31-01-1980 112 31-01-1991 54 p g p g
29-02-1980 118 28-02-1991 55 31-01-1980 1686 31-01-1991 1902
First
few 31-03-1980 129 31-03-1991 66 First 29-02-1980 1591 28-02-1991 2049
Rows few
30-04-1980 99 30-04-1991 65 31-03-1980 2304 31-03-1991 1874
Row
31-05-1980 116 31-05-1991 60 s 30-04-1980 1712 30-04-1991 1279
31-08-1990 70 31-03-1995 45 31-05-1980 1471 31-05-1991 1432
30-09-1990 83 30-04-1995 52 31-08-1990 1605 31-03-1995 1897
Last
few 31-10-1990 65 31-05-1995 28 Last 30-09-1990 2424 30-04-1995 1862
Rows few
30-11-1990 110 30-06-1995 40 31-10-1990 3116 31-05-1995 1670
Row
31-12-1990 132 31-07-1995 62 s 30-11-1990 4286 30-06-1995 1688
31-12-1990 6047 31-07-1995 2031
Table 1
Figure 10
Inferences
Sparkling/Ros
6|Pa ge
e
7|Pa ge
The train data for Sparkling wine has been split till the year 1990 and the test data starting
from 1991.
Training data set has 132 data points whereas testing data set has 55 data points.
It is difficult to predict the future observations if such an instance has not happened in the
past. From our train-test split we are predicting likewise behaviour as compared to the past
years
Timeseries graph indicates the test and train data split and demarcation.
4. Build various exponential smoothing models on the training data and evaluate the model using
RMSE on the test data. Other models such as regression, naïve forecast models, simple average
models etc. should also be built on the training data and check the performance on the test data
using RMSE.
Train
ing Testi
ng
Timest Ti Timest Ti
am p Rose m am p Rose m
e e
31- 31-
01- 112 1 01- 54 43
198 199
0 1
29- 28-
02- 118 2 02- 55 44
First 198 199
0 1
few 31- 31-
Row 03- 129 3 03- 66 45
198 199
s 0 1
30- 30-
04- 99 4 04- 65 46
198 199
0 1
31- 31-
05- 116 5 05- 60 47
198 199
0 1
Last 31- 31-
08- 70 128 03- 45 93
few 199 199
0 5
Row 30- 30-
s 09- 83 129 04- 52 94
199 199
0 5
31- 31-
10- 65 130 05- 28 95
199 199
0 5
30- 110 131 30- 40 96
11- 06-
199 199
0 5
9|Pa ge
31-12- 31-07-
1990 132 132 1995 62 97
Figure 11
Inferences
Sparkling
For this particular linear regression, we are going to regress the ‘Sparkling’ sales variable
against the order of the occurrence. For this we need to modify our training data before fitting
it into a linear regression.
Generated the numerical time instance order for both the training and test set. Now we will
add these values in the training and test set.
Now that our training and test data has been modified, let us go ahead use Linear Regression
to build the model on the training data and test the model on the test data.
We then ran the Linear regression model and the RMSE is 1275.867
Rose
For this particular linear regression, we are going to regress the ‘Rose sales variable against
the order of the occurrence. For this we need to modify our training data before fitting it into a
linear regression.
Generated the numerical time instance order for both the training and test set. Now we will
add these values in the training and test set.
Now that our training and test data has been modified, let us go ahead use Linear Regression
to build the model on the training data and test the model on the test data.
We then ran the Linear regression model and the RMSE is 51.433
Figure 12
Model Evaluation for Naives
10 | P a g
e
Inferences
Sparkling
For this particular naive model, we say that the prediction for tomorrow is the same as today
and the prediction for day after tomorrow is tomorrow and since the prediction of tomorrow is
same as today, therefore the prediction for day after tomorrow is also today
We ran the model for the Sparkling data set and the green line indicates the Naïve’s forecast
plotting on a test data set and it a straight line.
Test RMSE score for Naives model is 3864.279
Rose
For this particular naive model, we say that the prediction for tomorrow is the same as today
and the prediction for day after tomorrow is tomorrow and since the prediction of tomorrow is
same as today, therefore the prediction for day after tomorrow is also today
We ran the model for the Rose data set and the green line indicates the Naïve’s forecast
plotting on a test data set and it a straight line.
Test RMSE score for Naives model is 79.719
Figure 13
Model Evaluation for Simple average
11 | P a g
e
Inferences
Sparkling
For this particular simple average method, we will forecast by using the average sales values.
We ran the model for the Sparkling data set and the green line indicates the Simple average
forecasting plotting in a straight line.
Test RMSE score for Simple average model is 1275.082
Rose
For this particular simple average method, we will forecast by using the average sales values.
We ran the model for the Rose data set and the green line indicates the Simple average
forecasting plotting in a straight line.
Test RMSE score for Simple average model is 53.461
Figure
14
Model Evaluation for moving average on test data
Sparkling
12 | P a g
e
13 | P a g
e
Rose:
Inferences
For the moving average model, we are going to calculate rolling means (or moving averages)
for different intervals. The best interval can be determined by the maximum accuracy (or the
minimum error) over here.
We are going to use average over the entire data set. The window of the moving average is
need to be carefully selected as too big a window will result in not having any test set as the
whole series might get averaged over.
Sparkling
We ran the moving average model for Sparkling sale data set. The interval points are pre-
defined as 2,4,6 and 9 respectively.
The test RMSE value for the defined intervals are given below
o Interval 2 – 813.401
o Interval 4 – 1156.590
o Interval 6 – 1283.927
o Interval 9 – 1346.278
Lowest score in the moving average is identified at interval 2
Rose
We ran the moving average model for Sparkling sale data set. The interval points are pre-
defined as 2,4,6 and 9 respectively.
The test RMSE value for the defined intervals are given below
o Interval 2 – 11.529
o Interval 4 – 14.451
o Interval 6 – 14.566
o Interval 9 – 14.728
14 | P a g e
Model Evaluation for = 0.995 : Simple Exponential Smoothing
Sparkling
Rose:
Inferences
Sparkling
We ran the forecasting Model simple exponential smoothing using the alpha values
α = 0.995
For Alpha =0.995 Simple Exponential Smoothing Model forecast on the Test Data, RMSE is
1316.035
Rose
We ran the forecasting Model simple exponential smoothing using the alpha values
α = 0.995
For Alpha =0.995 Simple Exponential Smoothing Model forecast on the Test Data, RMSE is
36.796
Figure 17
15 | P a g e
16 | P a g e
Model Evaluation:
Sparkling Rose
Inferences
Sparkling
We ran the forecasting Model with different alpha values starting from 0.3 till 0.9
Test RMSE score ranges of the sparking data set are given above.
The lowest RMSE is identified for the alpha value 0.3, i.e., 1935.507
Rose
We ran the forecasting Model with different alpha values starting from 0.3 till 0.9
Test RMSE score ranges of the sparking data set are given above.
The lowest RMSE is identified for the alpha value 0.3, i.e., 47.504
Figure 18
Model Evaluation:
Sparkling Rose
17 | P a g e
Inferences
Sparkling
We ran the Double Exponential Smoothing Model with different smoothing parameters alpha
and beta with values starting from 0.3 till 0.9
Least test RMSE scores for the sparking data set are given above.
The lowest RMSE is identified for the parameter α and β= 0.3, i.e., 18259.110
Rose
We ran the Double Exponential Smoothing Model with different smoothing parameters alpha
and beta with values starting from 0.3 till 0.9
Least test RMSE scores for the Rose data set are given above.
The lowest RMSE is identified for the parameter α and β= 0.3, i.e., 265.567
Figure 19
Model Evaluation:
Sparkling
Rose
18 | P a g e
Inferences
Sparkling
We ran the Triple Exponential Smoothing Model with autofit smoothing parameters alpha =
0.111, beta = 0.0616 and gamma = 0.3948
Test RMSE is identified for the parameter α, β and γ i.e., 469.432
Rose
We ran the Triple Exponential Smoothing Model with autofit smoothing parameters alpha =
0.064, beta = 0.053 and gamma = 0.0
Test RMSE is identified for the parameter α, β and γ i.e., 21.137
Method 9: Triple Exponential Smoothing (Holt - Winter's Model) with different tunning
parameters
Figure 20
Model Evaluation:
Sparkling Rose
Inferences
Sparkling
We ran the Triple Exponential Smoothing Model with different smoothing parameters alpha,
beta and Gamma with values starting from 0.3 till 0.9
We see that the best model is the Triple Exponential Smoothing with multiplicative
seasonality with the parameters = 0.3, = 0.3 and = 0.3
Test RMSE is identified for the parameter α, β and γ i.e., 392.786
Auto fit model parameters are given in the Jupyter note book.
Rose
We ran the Triple Exponential Smoothing Model with different smoothing parameters alpha,
beta and Gamma with values starting from 0.3 till 0.9
We see that the best model is the Triple Exponential Smoothing with multiplicative
seasonality with the parameters = 0.3, = 0.4 and = 0.3
19 | P a g e
Test RMSE is identified for the parameter α, β and γ i.e., 10.945
20 | P a g e
Auto fit model parameters are given in the Jupyter note book.
5. Check for the stationarity of the data on which the model is being built on using appropriate
statistical tests and also mention the hypothesis for the statistical test. If the data is found to be
non-stationary, take appropriate steps to make it stationary. Check the new data for stationarity
and comment.
Note: Stationarity should be checked at alpha = 0.05.
Dickey-Fuller Test:
o H0: Series is NOT Stationary
o H1: Series is Stationary
Sparkling
Figure 21
Rose
21 | P a g e
Figure 22
Inferences
Sparkling
We ran the Stationarity test on the data set at alpha = 0.05 and the results are attached in the
above picture. At α = 0.05 the p value is 0.601 which is greater than the alpha value. Hence,
we fail to reject the null hypothesis.
We see that at 5% significant level the Time Series is non-stationary
Let us take a difference of order 1 and check whether the Time Series is stationary or not
Looking at the results at α = 0.05 the p value is 0.00. Hence, we reject the null hypothesis.
We see that at α = 0.05 the Time Series is indeed stationary
Rose
We ran the Stationarity test on the data set at alpha = 0.05 and the results are attached in the
above picture. At α = 0.05 the p value is 0.343 which is greater than the alpha value. Hence,
we fail to reject the null hypothesis.
We see that at 5% significant level the Time Series is non-stationary
Let us take a difference of order 1 and check whether the Time Series is stationary or not.
Looking at the results at α = 0.05 the p value is 1.810895e-12. Hence, we reject the null
hypothesis.
We see that at = 0.05 the Time Series is indeed stationary.
6. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selectedusing the lowest Akaike Information Criteria (AIC) on the training data and evaluate
this model on the test data using RMSE.
Sparkling
22 | P a g e
Figure 23
Rose
23 | P a g e
Figure 24
24 | P a g e
Inferences
Sparkling
From the above plots, we can say that there seems to be a seasonality in the data.
Rose
From the above plots, we can say that there seems to be a seasonality in the data
Automated ARIMA
Build an Automated version of an ARIMA model for which the best parameters are selected
in accordance with the lowest Akaike Information Criteria (AIC)
Sparkling Rose
Inference
Sparkling:
Run the automated model getting a combination of different parameters of p and q in the
range of 0 and 2. We have kept the value of d as 1 as we need to take a difference of the series
to make it stationary.
Sort the AIC values in the ascending order to get the parameters for the minimum AIC value
Build the ARIMA model with the lowest AIC values and the test RMSE for the value
is 1375.0953285449787
Rose:
Run the automated model getting a combination of different parameters of p and q in the
range of 0 and 2. We have kept the value of d as 1 as we need to take a difference of the series
to make it stationary.
Sort the AIC values in the ascending order to get the parameters for the minimum AIC value
Build the ARIMA model with the lowest AIC values and the test RMSE for the value
25 | P a g e
is 15.6196.
26 | P a g e
Automated SARIMA
Build an Automated version of an SARIMA model for which the best parameters are selected
in accordance with the lowest Akaike Information Criteria (AIC)
Figure 25
Inference:
Sparkling:
Run the auto SARIMA model for the seasonality 12 for which the best parameter are selected
in accordance with the lowest AIC.
Sort the AIC values in the ascending order to get the parameters for the minimum AIC value
Build the ARIMA model with the lowest AIC values and the test RMSE for the value
is 629.373
Rose:
Run the auto SARIMA model for the seasonality 12 for which the best parameter are selected
in accordance with the lowest AIC.
Sort the AIC values in the ascending order to get the parameters for the minimum AIC value
Build the ARIMA model with the lowest AIC values and the test RMSE for the value
is 31.479
27 | P a g e
Figure 26
Inference:
Sparkling/Rose:
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training
data and evaluate this model on the test data using RMSE.
Sparkling ARIMA
Figure 27
28 | P a g e
Inference
Sparkling
The ACF plot summarizes the correlation of an observation with lag values. The x-axis shows
the lag and the y-axis shows the correlation coefficient between -1 and 1 for negative and
positive correlation.
The PACF plot summarizes the correlations for an observation with lag values that is not
accounted for by prior lagged observations.
RMSE for the ARIMA model is 4779.154
There are seasonality present in the data set hence we need to look into the SARIMA model
for better results.
Sparkling Sarima
Figure 28
Inference
Sparkling
An extension to ARIMA that supports the direct modelling of the seasonal component of the
series is called SARIMA.
We then built manual SARIMA model for Sparking Sales based on the ACF and PACF plots
RMSE for the SARIMA value is 629.373
29 | P a g e
By looking at the above pictures we can say the two tests has been done JB and Ljung BOX
test
JB test distribution is normal for null hypothesis
P value is low hence its normal distribution
Ljung box test errors and Residual is independent, since p value is high, we can conclude its
independent
Heteroskedasticity means the residuals do have relation with independent variables. In this
case null is high hence it not heteroskedastic.
ROSE ARIMA
30 | P a g e
Inference
Sparkling
The ACF plot summarizes the correlation of an observation with lag values. The x-axis shows
the lag and the y-axis shows the correlation coefficient between -1 and 1 for negative and
positive correlation.
The PACF plot summarizes the correlations for an observation with lag values that is not
accounted for by prior lagged observations.
we chose the AR parameter p value 5, Moving average parameter q value 2 and d value 1
based on the given plots.
RMSE for the ARIMA model is 15.43
There are seasonality present in the data set hence we need to look into the SARIMA model
for better results.
ROSE SARIMA:
Figure 29
Inference
Sparkling
An extension to ARIMA that supports the direct modelling of the seasonal component of the
series is called SARIMA.
We then built manual SARIMA model for Sparking Sales based on the ACF and PACF plots
RMSE for the SARIMA value is 37.87
31 | P a g e
By looking at the above pictures we can say the two tests has been done JB and Ljung BOX
test
JB test distribution is normal for null hypothesis
P value is high hence it’s not normal distribution
Ljung box test errors and Residual is independent, since p value is low, we can conclude its
not independent
Heteroskedasticity means the residuals do have relation with independent variables. In this
case null is high hence it not heteroskedastic
32 | P a g e
8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.
Sparkling Rose
Inference:
Sparkling/Rose:
After executing various models for forecasting process and the sorting the Test RMSE values
from lowest to highest is given in the above pictures.
The lowest RMSE for both the data set is identified in Triple exponential smoothing which
was done based on different smoothing levels.
Best test RMSE for the sparkling data set is 392.786
Best test RMSE for the Rose data set is 10.945
9. Based on the model-building exercise, build the most optimum model(s) on the complete data
and predict 12 months into the future with appropriate confidence intervals/bands.
Sparkling Rose
33 | P a g e
Inference:
Sparkling/Rose:
By using the full data set and at 95% confidence interval we will take our best model and
forecast 12 months into the future with appropriate confidence intervals to see how the
predictions look. We have to build our model on the full data for this.
RMSE on the full sparking data set is 539.929
RMSE on the full Rose data set is 28.884
10. Comment on the model thus built and report your findings and suggest the measures that the
company should be taking for future sales.
Sparkling
Inference
For the given sparkling data set there is not much trend compared to previous years
Decmeber month has the highest sales in a year
Model plot was build based on the trend and seasoniality. We see the future prediction is
inline with the previous year predictions
Recommendations
Sparkling wine sales are seasonal.
Company should plan a head and keep enough stock from Septemeber till december to
captlize on the demand.
In order to increase the sales company should plan some promotional offers from January
till June so that there will be stready sales throughout the year.
Rose
Inference
Recommendations
Rose wine sales are seasonal.
We are able to see the Rose wines are sold highly during March/August/October till
December.
Company should plan a head and keep enough stock for March/August/October till
December to captlize on the demand.
In order to increase the sales company should plan some promotional offers during the
low sale period.
34 | P a g e