Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Project – Time Series

Neha Arcot

S.No. Table of content Page. No.


1 Read the data as an appropriate Time Series data and 2
plot the data.
2 Perform appropriate Exploratory Data Analysis to 3-6
understand the data and also perform decomposition.
3 Split the data into training and test. The test data should 6-7
start in 1991.
4 Build various exponential smoothing models on the 7-15
training data and evaluate the model using RMSE on
the test data. Other models such as regression, naive
forecast models and simple average models. Should
also be built on the training data and check the
performance on the test data using RMSE.
5 Check for the stationary of the data on which the model 16-17
is being built on using appropriate statistical tests and
also mention the hypothesis for the statistical test. If the
data is found to be non-stationary, take appropriate
steps to make it stationary. Check the new data for
stationary and comment. Note: Stationary should be
checked at alpha = 0.05.
6 Build an automated version of the ARIMA/SARIMA 17-21
model in which the parameters are selected using the
lowest Akaike Information Criteria (AIC) on the
training data and evaluate this model on the test data
using RMSE.
7 Build ARIMA/SARIMA models based on the cut-off 21-25
points of ACF and PACF on the training data and
evaluate this model on the test data using RMSE.
8 Build a table (create a data frame) with all the models 26
built along with their corresponding parameters and the
respective RMSE values on the test data.
9 Based on the model-building exercise, build the most 26-27
optimum model(s) on the complete data and predict 12
months into the future with appropriate confidence
intervals/bands.
10 Comment on the model thus built and report your 27
findings and suggest the measures that the company
should be taking for future sales.

Problem:
For this particular assignment, the data of different types of wine sales in the 20th century is to be
analyzed. Both of these data are from the same company but of different wines. As an analyst in the
1|Pa ge
ABC Estate Wines, you are tasked to analyze and forecast Wine Sales in the 20th century.
Data set for the Problem: Sparkling.csv and Rose.csv

1. Read the data as an appropriate Time Series data and plot the data.

 Load both the data set using the pandas read csv commands and parse the data.

Figure 1
Inferences:

Though the above plot looks like a Time Series plot, notice that the X-Axis is not time. In
order to make the X-Axis as a Time Series, we need to pass the date range manually through a
command in Pandas.

Sparkling
 Given data set seems to not have much trend
 We are able to see seasonality and it seems to be yearly.
 Data set consist of 187 data points
Rose
 We are able to see the there is a decreasing trend in the data set.
 We are able to see seasonality and it seems to be yearly.
 Data set consist of 187 data points

2|Pa ge
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.

Figure 2
Year and Monthly Box plot for Sparling and Rose:

Figure 3
Line Plot

Figure 4

3|Pa ge
Inferences

The basic measures of descriptive statistics tell us how the Sales have varied across years. But
remember, for this measure of descriptive statistics we have averaged over the whole data without
taking the time component into account.

Sparkling:
 Data set consist of 187 data points with no duplicates
 As observed in time serious plot there is no trend in the Box plot
 Outliers are present in the data set expect the year 1995
 January till July there is no spike in the sales. However, from July onwards we are able to see
there increase in sales.
 December month has the highest sales in a year.
 Line plot also indicates that the sale in the December is highest that the rest of the years.

Rose:
 Data set consist of 187 data points with no duplicates and 2 null values for the year 1994, 7th
and 8th month.
 By using interpolate impute the missing values
 As observed in time serious plot here also there is a we are able to notice downward trend.
 Outliers are present in the data set for the month June, July, August, September and December
 The yearly boxplots also shows that the Sales have decreased towards the last few years.
 December month has the highest sales in a year.
 Year 1981 seems to be having the highest sales and the lowest seems to be year 1994.
 Line plot also indicates that the sale in the December is highest that the rest of the years.

Time Series, Cumulative Distribution,Decompse time series

TimeSeries

Figure 5
Cumulative

Figure 6

Average Sales

4|Pa ge
Figure 7
Additive

Figure 8
Multiplicative

Figure 9
Inferences

 Time Series plot shows the behavior of sales of different wines (Sparking/Rose) across
various months. The red line is the median value.
 Cumulative graph tells us what percentage of data points refer to what number of Sales
 Average and percentage graph tells us the Average sales (Sparkling/Rose) and the Percentage
change sales ((Sparkling/Rose) with respect to the time.
Sparkling

5|Pa ge
 The Median value has been constant from the beginning of the year and start increasing from
July till December
 Average sale values do not show any trend
 We see that the residuals are located around 0 from the plot of the residuals in the
decomposition.
 For the multiplicative series, we see that a lot of residuals are located around 1.

Rose
 The Median value has been increasing from the beginning of the year. There is a gradual
increase throughout the year.
 Average sale values show a downward trend.
 We see that the residuals are located around 0 from the plot of the residuals in the
decomposition.
 For the multiplicative series, we see that a lot of residuals are located around 1.

3. Split the data into training and test. The test data should start in 1991.

Split the data into train and test and plot the training and test data

Trainig and testing Data first/last few rows

Training Testing
Training Testing
Timestamp Rose Timestamp Rose
Time_Stam Sparklin Time_Stam Sparklin
31-01-1980 112 31-01-1991 54 p g p g
29-02-1980 118 28-02-1991 55 31-01-1980 1686 31-01-1991 1902
First
few 31-03-1980 129 31-03-1991 66 First 29-02-1980 1591 28-02-1991 2049
Rows few
30-04-1980 99 30-04-1991 65 31-03-1980 2304 31-03-1991 1874
Row
31-05-1980 116 31-05-1991 60 s 30-04-1980 1712 30-04-1991 1279
31-08-1990 70 31-03-1995 45 31-05-1980 1471 31-05-1991 1432
30-09-1990 83 30-04-1995 52 31-08-1990 1605 31-03-1995 1897
Last
few 31-10-1990 65 31-05-1995 28 Last 30-09-1990 2424 30-04-1995 1862
Rows few
30-11-1990 110 30-06-1995 40 31-10-1990 3116 31-05-1995 1670
Row
31-12-1990 132 31-07-1995 62 s 30-11-1990 4286 30-06-1995 1688
31-12-1990 6047 31-07-1995 2031

Table 1
Figure 10

Inferences
Sparkling/Ros
6|Pa ge
e

7|Pa ge
 The train data for Sparkling wine has been split till the year 1990 and the test data starting
from 1991.
 Training data set has 132 data points whereas testing data set has 55 data points.
 It is difficult to predict the future observations if such an instance has not happened in the
past. From our train-test split we are predicting likewise behaviour as compared to the past
years
 Timeseries graph indicates the test and train data split and demarcation.

4. Build various exponential smoothing models on the training data and evaluate the model using
RMSE on the test data. Other models such as regression, naïve forecast models, simple average
models etc. should also be built on the training data and check the performance on the test data
using RMSE.

Building different models and comparing the accuracy metrics

Model 1: Linear Regression

Train
ing Testi
ng

Time_S Spar Ti Time_S Spar


tam p klin g m tam p klin g Time
e
31-01- 168 1 31-01- 190 43
Fir 1980 6 1991 2
st 29-02- 159 2 28-02- 204 44
few 1980 1 1991 9
Ro 31-03- 230 3 31-03- 187 45
ws 1980 4 1991 4
30-04- 171 4 30-04- 127 46
1980 2 1991 9
31-05- 147 5 31-05- 143 47
1980 1 1991 2
31-08- 160 128 31-03- 189 93
La 1990 5 1995 7
st 30-09- 242 129 30-04- 186 94
fe 1990 4 1995 2
w 31-10- 311 130 31-05- 167 95
Ro 1990 6 1995 0
w 30-11- 428 131 30-06- 168 96
s 1990 6 1995 8
31-12- 604 132 31-07- 203 97
1990 7 1995 1
Train Testi
8|Pa ge
ing ng

Timest Ti Timest Ti
am p Rose m am p Rose m
e e
31- 31-
01- 112 1 01- 54 43
198 199
0 1
29- 28-
02- 118 2 02- 55 44
First 198 199
0 1
few 31- 31-
Row 03- 129 3 03- 66 45
198 199
s 0 1
30- 30-
04- 99 4 04- 65 46
198 199
0 1
31- 31-
05- 116 5 05- 60 47
198 199
0 1
Last 31- 31-
08- 70 128 03- 45 93
few 199 199
0 5
Row 30- 30-
s 09- 83 129 04- 52 94
199 199
0 5
31- 31-
10- 65 130 05- 28 95
199 199
0 5
30- 110 131 30- 40 96
11- 06-
199 199
0 5

9|Pa ge
31-12- 31-07-
1990 132 132 1995 62 97

Figure 11

Model Evaluation for LR:

Inferences

Sparkling

 For this particular linear regression, we are going to regress the ‘Sparkling’ sales variable
against the order of the occurrence. For this we need to modify our training data before fitting
it into a linear regression.
 Generated the numerical time instance order for both the training and test set. Now we will
add these values in the training and test set.
 Now that our training and test data has been modified, let us go ahead use Linear Regression
to build the model on the training data and test the model on the test data.
 We then ran the Linear regression model and the RMSE is 1275.867

Rose
 For this particular linear regression, we are going to regress the ‘Rose sales variable against
the order of the occurrence. For this we need to modify our training data before fitting it into a
linear regression.
 Generated the numerical time instance order for both the training and test set. Now we will
add these values in the training and test set.
 Now that our training and test data has been modified, let us go ahead use Linear Regression
to build the model on the training data and test the model on the test data.
 We then ran the Linear regression model and the RMSE is 51.433

Model 2: Naive Approach

Figure 12
Model Evaluation for Naives

10 | P a g
e
Inferences

Sparkling

 For this particular naive model, we say that the prediction for tomorrow is the same as today
and the prediction for day after tomorrow is tomorrow and since the prediction of tomorrow is
same as today, therefore the prediction for day after tomorrow is also today
 We ran the model for the Sparkling data set and the green line indicates the Naïve’s forecast
plotting on a test data set and it a straight line.
 Test RMSE score for Naives model is 3864.279
Rose
 For this particular naive model, we say that the prediction for tomorrow is the same as today
and the prediction for day after tomorrow is tomorrow and since the prediction of tomorrow is
same as today, therefore the prediction for day after tomorrow is also today
 We ran the model for the Rose data set and the green line indicates the Naïve’s forecast
plotting on a test data set and it a straight line.
 Test RMSE score for Naives model is 79.719

Method 3: Simple Average

Figure 13
Model Evaluation for Simple average

11 | P a g
e
Inferences

Sparkling

 For this particular simple average method, we will forecast by using the average sales values.
 We ran the model for the Sparkling data set and the green line indicates the Simple average
forecasting plotting in a straight line.
 Test RMSE score for Simple average model is 1275.082

Rose
 For this particular simple average method, we will forecast by using the average sales values.
 We ran the model for the Rose data set and the green line indicates the Simple average
forecasting plotting in a straight line.
 Test RMSE score for Simple average model is 53.461

Method 4: Moving Average (MA)

Figure
14
Model Evaluation for moving average on test data

Sparkling

12 | P a g
e
13 | P a g
e
Rose:

Inferences

 For the moving average model, we are going to calculate rolling means (or moving averages)
for different intervals. The best interval can be determined by the maximum accuracy (or the
minimum error) over here.
 We are going to use average over the entire data set. The window of the moving average is
need to be carefully selected as too big a window will result in not having any test set as the
whole series might get averaged over.

Sparkling

 We ran the moving average model for Sparkling sale data set. The interval points are pre-
defined as 2,4,6 and 9 respectively.
 The test RMSE value for the defined intervals are given below
o Interval 2 – 813.401
o Interval 4 – 1156.590
o Interval 6 – 1283.927
o Interval 9 – 1346.278
 Lowest score in the moving average is identified at interval 2
Rose
 We ran the moving average model for Sparkling sale data set. The interval points are pre-
defined as 2,4,6 and 9 respectively.
 The test RMSE value for the defined intervals are given below
o Interval 2 – 11.529
o Interval 4 – 14.451
o Interval 6 – 14.566
o Interval 9 – 14.728

 Lowest score in the moving average is identified at interval 2


Method 5: Simple Exponential Smoothing

14 | P a g e
Model Evaluation for = 0.995 : Simple Exponential Smoothing
Sparkling

Rose:

Inferences

Sparkling
 We ran the forecasting Model simple exponential smoothing using the alpha values
α = 0.995
 For Alpha =0.995 Simple Exponential Smoothing Model forecast on the Test Data, RMSE is
1316.035
Rose
 We ran the forecasting Model simple exponential smoothing using the alpha values
α = 0.995
 For Alpha =0.995 Simple Exponential Smoothing Model forecast on the Test Data, RMSE is
36.796

Method 6: Simple Exponential Smoothing Increasing Alpha Values

 Setting different alpha values.


 Remember, the higher the alpha value more weightage is given to the more recent
observation. That means, what happened recently will happen again.
 We will run a loop with different alpha values to understand which particular value works
best for alpha on the test set

Figure 17

15 | P a g e
16 | P a g e
Model Evaluation:

Sparkling Rose

Inferences

Sparkling
 We ran the forecasting Model with different alpha values starting from 0.3 till 0.9
 Test RMSE score ranges of the sparking data set are given above.
 The lowest RMSE is identified for the alpha value 0.3, i.e., 1935.507

Rose
 We ran the forecasting Model with different alpha values starting from 0.3 till 0.9
 Test RMSE score ranges of the sparking data set are given above.
 The lowest RMSE is identified for the alpha value 0.3, i.e., 47.504

Method 7: Double Exponential Smoothing

 Double exponential smoothing is also called ad Holt’s model


 Two parameters α (level) and β(trend) are estimated in this model.

Figure 18
Model Evaluation:

Sparkling Rose

17 | P a g e
Inferences

Sparkling
 We ran the Double Exponential Smoothing Model with different smoothing parameters alpha
and beta with values starting from 0.3 till 0.9
 Least test RMSE scores for the sparking data set are given above.
 The lowest RMSE is identified for the parameter α and β= 0.3, i.e., 18259.110
Rose
 We ran the Double Exponential Smoothing Model with different smoothing parameters alpha
and beta with values starting from 0.3 till 0.9
 Least test RMSE scores for the Rose data set are given above.
 The lowest RMSE is identified for the parameter α and β= 0.3, i.e., 265.567

Method 8: Triple Exponential Smoothing (Holt - Winter's Model)

 Triple exponential smoothing is also called ad Holt’s -Winters model


 Three parameters α (level) , β(trend) and γ (Seasonality) are estimated in this model.

Figure 19
Model Evaluation:

Sparkling

Rose

18 | P a g e
Inferences

Sparkling
 We ran the Triple Exponential Smoothing Model with autofit smoothing parameters alpha =
0.111, beta = 0.0616 and gamma = 0.3948
 Test RMSE is identified for the parameter α, β and γ i.e., 469.432

Rose
 We ran the Triple Exponential Smoothing Model with autofit smoothing parameters alpha =
0.064, beta = 0.053 and gamma = 0.0
 Test RMSE is identified for the parameter α, β and γ i.e., 21.137

Method 9: Triple Exponential Smoothing (Holt - Winter's Model) with different tunning
parameters

Figure 20

Model Evaluation:

Sparkling Rose

Inferences

Sparkling
 We ran the Triple Exponential Smoothing Model with different smoothing parameters alpha,
beta and Gamma with values starting from 0.3 till 0.9
 We see that the best model is the Triple Exponential Smoothing with multiplicative
seasonality with the parameters = 0.3, = 0.3 and = 0.3
 Test RMSE is identified for the parameter α, β and γ i.e., 392.786
 Auto fit model parameters are given in the Jupyter note book.
Rose
 We ran the Triple Exponential Smoothing Model with different smoothing parameters alpha,
beta and Gamma with values starting from 0.3 till 0.9
 We see that the best model is the Triple Exponential Smoothing with multiplicative
seasonality with the parameters = 0.3, = 0.4 and = 0.3
19 | P a g e
 Test RMSE is identified for the parameter α, β and γ i.e., 10.945

20 | P a g e
 Auto fit model parameters are given in the Jupyter note book.

5. Check for the stationarity of the data on which the model is being built on using appropriate
statistical tests and also mention the hypothesis for the statistical test. If the data is found to be
non-stationary, take appropriate steps to make it stationary. Check the new data for stationarity
and comment.
Note: Stationarity should be checked at alpha = 0.05.

Check for stationarity of the whole Time Series data

 Dickey-Fuller Test:
o H0: Series is NOT Stationary
o H1: Series is Stationary

Sparkling

Figure 21

Rose

21 | P a g e
Figure 22
Inferences

Sparkling
 We ran the Stationarity test on the data set at alpha = 0.05 and the results are attached in the
above picture. At α = 0.05 the p value is 0.601 which is greater than the alpha value. Hence,
we fail to reject the null hypothesis.
 We see that at 5% significant level the Time Series is non-stationary
 Let us take a difference of order 1 and check whether the Time Series is stationary or not
 Looking at the results at α = 0.05 the p value is 0.00. Hence, we reject the null hypothesis.
 We see that at α = 0.05 the Time Series is indeed stationary
Rose
 We ran the Stationarity test on the data set at alpha = 0.05 and the results are attached in the
above picture. At α = 0.05 the p value is 0.343 which is greater than the alpha value. Hence,
we fail to reject the null hypothesis.
 We see that at 5% significant level the Time Series is non-stationary
 Let us take a difference of order 1 and check whether the Time Series is stationary or not.
Looking at the results at α = 0.05 the p value is 1.810895e-12. Hence, we reject the null
hypothesis.
 We see that at = 0.05 the Time Series is indeed stationary.

6. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selectedusing the lowest Akaike Information Criteria (AIC) on the training data and evaluate
this model on the test data using RMSE.

Sparkling
22 | P a g e
Figure 23

Rose

23 | P a g e
Figure 24

24 | P a g e
Inferences

Sparkling
 From the above plots, we can say that there seems to be a seasonality in the data.
Rose
 From the above plots, we can say that there seems to be a seasonality in the data

Automated ARIMA

 Build an Automated version of an ARIMA model for which the best parameters are selected
in accordance with the lowest Akaike Information Criteria (AIC)

Sparkling Rose

Inference

Sparkling:
 Run the automated model getting a combination of different parameters of p and q in the
range of 0 and 2. We have kept the value of d as 1 as we need to take a difference of the series
to make it stationary.
 Sort the AIC values in the ascending order to get the parameters for the minimum AIC value
 Build the ARIMA model with the lowest AIC values and the test RMSE for the value
is 1375.0953285449787
Rose:
 Run the automated model getting a combination of different parameters of p and q in the
range of 0 and 2. We have kept the value of d as 1 as we need to take a difference of the series
to make it stationary.
 Sort the AIC values in the ascending order to get the parameters for the minimum AIC value
 Build the ARIMA model with the lowest AIC values and the test RMSE for the value
25 | P a g e
is 15.6196.

26 | P a g e
Automated SARIMA

 Build an Automated version of an SARIMA model for which the best parameters are selected
in accordance with the lowest Akaike Information Criteria (AIC)

Figure 25

Inference:

Sparkling:
 Run the auto SARIMA model for the seasonality 12 for which the best parameter are selected
in accordance with the lowest AIC.
 Sort the AIC values in the ascending order to get the parameters for the minimum AIC value
 Build the ARIMA model with the lowest AIC values and the test RMSE for the value
is 629.373
Rose:
 Run the auto SARIMA model for the seasonality 12 for which the best parameter are selected
in accordance with the lowest AIC.
 Sort the AIC values in the ascending order to get the parameters for the minimum AIC value
 Build the ARIMA model with the lowest AIC values and the test RMSE for the value
is 31.479

27 | P a g e
Figure 26
Inference:

Sparkling/Rose:

 By looking at the model diagrams we can say that there is no seasonality


 KDE plot of the residuals looks normally distributed.
 Residuals are distributed following a linear trend of the samples taken from standard normal
distribution in Normal Q-Q plot.
 Time series residuals have low correlation with lagged version by itself.

7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training
data and evaluate this model on the test data using RMSE.

Sparkling ARIMA

Figure 27

28 | P a g e
Inference

Sparkling
 The ACF plot summarizes the correlation of an observation with lag values. The x-axis shows
the lag and the y-axis shows the correlation coefficient between -1 and 1 for negative and
positive correlation.
 The PACF plot summarizes the correlations for an observation with lag values that is not
accounted for by prior lagged observations.
 RMSE for the ARIMA model is 4779.154
 There are seasonality present in the data set hence we need to look into the SARIMA model
for better results.

Sparkling Sarima

Figure 28

Inference

Sparkling
 An extension to ARIMA that supports the direct modelling of the seasonal component of the
series is called SARIMA.
 We then built manual SARIMA model for Sparking Sales based on the ACF and PACF plots
 RMSE for the SARIMA value is 629.373

29 | P a g e
By looking at the above pictures we can say the two tests has been done JB and Ljung BOX
test
JB test distribution is normal for null hypothesis
P value is low hence its normal distribution
Ljung box test errors and Residual is independent, since p value is high, we can conclude its
independent
Heteroskedasticity means the residuals do have relation with independent variables. In this
case null is high hence it not heteroskedastic.

ROSE ARIMA

30 | P a g e
Inference

Sparkling
 The ACF plot summarizes the correlation of an observation with lag values. The x-axis shows
the lag and the y-axis shows the correlation coefficient between -1 and 1 for negative and
positive correlation.
 The PACF plot summarizes the correlations for an observation with lag values that is not
accounted for by prior lagged observations.
 we chose the AR parameter p value 5, Moving average parameter q value 2 and d value 1
based on the given plots.
 RMSE for the ARIMA model is 15.43
 There are seasonality present in the data set hence we need to look into the SARIMA model
for better results.

ROSE SARIMA:

Figure 29

Inference

Sparkling
 An extension to ARIMA that supports the direct modelling of the seasonal component of the
series is called SARIMA.
 We then built manual SARIMA model for Sparking Sales based on the ACF and PACF plots
 RMSE for the SARIMA value is 37.87

31 | P a g e
By looking at the above pictures we can say the two tests has been done JB and Ljung BOX
test
JB test distribution is normal for null hypothesis
P value is high hence it’s not normal distribution
Ljung box test errors and Residual is independent, since p value is low, we can conclude its
not independent
Heteroskedasticity means the residuals do have relation with independent variables. In this
case null is high hence it not heteroskedastic

32 | P a g e
8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.

Sparkling Rose

Inference:

Sparkling/Rose:

 After executing various models for forecasting process and the sorting the Test RMSE values
from lowest to highest is given in the above pictures.
 The lowest RMSE for both the data set is identified in Triple exponential smoothing which
was done based on different smoothing levels.
 Best test RMSE for the sparkling data set is 392.786
 Best test RMSE for the Rose data set is 10.945

9. Based on the model-building exercise, build the most optimum model(s) on the complete data
and predict 12 months into the future with appropriate confidence intervals/bands.

Sparkling Rose

33 | P a g e
Inference:

Sparkling/Rose:

 By using the full data set and at 95% confidence interval we will take our best model and
forecast 12 months into the future with appropriate confidence intervals to see how the
predictions look. We have to build our model on the full data for this.
 RMSE on the full sparking data set is 539.929
 RMSE on the full Rose data set is 28.884

10. Comment on the model thus built and report your findings and suggest the measures that the
company should be taking for future sales.

Sparkling

Inference

 For the given sparkling data set there is not much trend compared to previous years
 Decmeber month has the highest sales in a year
 Model plot was build based on the trend and seasoniality. We see the future prediction is
inline with the previous year predictions

Recommendations
 Sparkling wine sales are seasonal.
 Company should plan a head and keep enough stock from Septemeber till december to
captlize on the demand.
 In order to increase the sales company should plan some promotional offers from January
till June so that there will be stready sales throughout the year.

Rose

Inference

 Rose wine sales shown a decrease in trend on year-on-year basis.


 Decmeber month has the highest sales in a year for Rose wine as well.
 Model plot was build based on the trend and seasoniality.We see the future prediction is
inline with the previous year predictions

Recommendations
 Rose wine sales are seasonal.
 We are able to see the Rose wines are sold highly during March/August/October till
December.
 Company should plan a head and keep enough stock for March/August/October till
December to captlize on the demand.
 In order to increase the sales company should plan some promotional offers during the
low sale period.

34 | P a g e

You might also like