Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Phase 1: Linear Regression

1.1 Regression Model

Tourist Arrival = 621311.01 × Exchange Rate − 772068.01

The Exchange Rate showed a positive coefficient in the regression model. This
indicates that, on average, as the exchange rate rises, so will the number of tourist
arrivals. In practice, when a country's currency weakens and has a greater exchange
rate in terms of foreign currency, tourists may find it cheaper to travel since their
foreign money has more purchasing power. While the data suggests a favourable
relationship, it is important to realise that an increase in the exchange rate does not
always result in an increase in visitor arrivals. Other obscure factors or external
causes could influence both the exchange rate and visitor arrivals. For example, a
thriving global economy may encourage more individuals to travel (raising tourist
arrivals) while also influencing currency rates.

The R2 value, 0.2483 (Appendix A) in this case, tells us about the model's accuracy.
It means the Exchange Rate accounts for roughly 25% of the changes we observe in
tourist numbers. This percentage gives an idea of how much the Exchange Rate can
predict tourist movement. The fact that our R2 isn't near 1 hints that there are other
influential factors we haven't considered in this model. These might include global
events, marketing efforts for tourism, political scenarios, among others.
1.2 Residuals Diagnostics

Residuals from Q* df p-value Model df Total lags used


Breusch-Godfrey
test

Values 348.59 10 2.2e-16 0 19

The Breusch-Godfrey test results indicate that the linear regression model may not
be adequately capturing all the temporal structures in the data. With a test statistic
(Q*) of 348.59 and 10 degrees of freedom, the p-value is significantly less than the
conventional 0.05 threshold for statistical significance (Appendix B). Such a small
p-value leads us to reject the null hypothesis, implying that there's significant
autocorrelation in the residuals up to 10 lags. This suggests that the residuals aren't
random, and the model hasn't captured all the underlying temporal structures in the
time series. Furthermore, the ACF plot reinforces the findings from the
Breusch-Godfrey test. The initial lags in the ACF plot showed significant
autocorrelation. This visual evidence further confirms the presence of autocorrelation
in the residuals.
The presence of serial correlation in the residuals, as highlighted by both the
Breusch-Godfrey test and the ACF plot, indicates that the current regression model
is not fully adequate in representing the relationship between the exchange rate and
tourist arrivals. While the model provides some insights, its residuals' structure
suggests that it may not be capturing some important temporal dynamics or other
influential factors in the data. As such, to obtain more accurate and reliable insights,
the model may require refinements or even a shift towards other modelling
approaches better suited for time series data with potential autocorrelation.

1.3 Forecast Interval

January February

Forecast 1402521 2272356

Lower Bound 184376.6 1042637

Upper Bound 2620664 3502075

● In January 2020, when the exchange rate is expected to be 3.5 RM for every
USD, the forecasted tourist arrival in Malaysia is approximately 1.4 million.
● For February 2020, with a higher exchange rate of 4.9 RM to USD, the model
predicts a rise in tourist arrivals to approximately 2.27 million.
● The forecast interval for January is quite wide, spanning from around 184k to
2.62 million tourists. Similarly, the February interval spans from around 1.04
million to 3.5 million tourists.

Given the wide forecast intervals, while the point forecasts for January and February
2020 seem reasonable, there's a substantial degree of uncertainty associated with
these predictions. This uncertainty might arise from the model's inability to capture all
influencing factors, as indicated by the presence of autocorrelation in the residuals.

1.4 Coefficients and results of the estimated model (Appendix A).

Coefficients Estimate Std. Error T value Pr(>| t |)

Intercept -772068 192178 -4.017 7.13e-05

Exchange.Rat 721311 56202 11.055 2e-16


e
Estimate:
● Intercept: This is the model's predicted value for "Tourist Arrival" when the
"Exchange Rate" is zero. If there were no influence from the exchange rate,
the model would predict a base value of -772,068 for tourist arrivals.
● Exchange Rate: For every 1-unit increase in the exchange rate (RM to USD),
we expect an average increase of 621,311 in the number of tourist arrivals,
holding everything else constant.

Standard Error:
● Intercept: 192,178 indicates the average variability or uncertainty associated
with the estimate of the intercept. The smaller the standard error, the more
reliable (or precise) the coefficient estimate.
● Exchange Rate: 56,202 gives the average variability or uncertainty associated
with the estimate of the exchange rate coefficient.

T-value:
● Intercept: -4.017 indicates that the observed value of the intercept is less than
the null hypothesis value of zero.
● Exchange Rate: t-value of 11.055 provides strong evidence that there's a
significant positive relationship between the exchange rate (RM to USD) and
the number of tourist arrivals. As the exchange rate increases, the number of
tourist arrivals also tends to increase, and this relationship is statistically
significant and not due to random chance.

Pr(>|t|):
● Intercept: 7.13e-05 is the p-value associated with the t-test for the intercept.
Given that it's much smaller than the common alpha level of 0.05, the
intercept is statistically significant.
● < 2e-16 is extremely small, much less than the typical 0.05 threshold,
indicating that the relationship between the exchange rate and tourist arrivals,
as described by this coefficient, is statistically significant.

Minimum -1634548

1st Quartile -425708

Median -165448

3rd Quartile 515899

Maximum 1559931
● Min: This is the smallest residual, indicating that in the most extreme case, the
model under-predicted the actual tourist arrival by about 1.63 million.
● 1Q: 25% of the residuals are below this value. This means that in 25% of the
cases, the model under-predicted the actual value by more than 425,708
tourists.
● Median: A negative median suggests that, on a median basis, the model
tends to under-predict tourist arrivals by 165,448.
● 3Q: 75% of the residuals are below this value. This means that in the top 25%
of cases, the model over-predicted the actual value by up to 515,899 tourists.
● Max: This is the largest residual, indicating that in the most extreme case, the
model over-predicted the actual tourist arrival by about 1.56 million.

Residual Standard Error 619600

Multiple R squared 0.2483

Adjusted R-squared 0.2463

F-statistics 122.2

Residual Standard Error:


● The residual standard error provides an estimate of the typical
difference between the observed values of the response variable
(Tourist Arrival) and the values predicted by the model. In this case, on
average, the predicted values are about 618,600 away from the actual
values.
● The degrees of freedom (370) is calculated as the difference between
the number of observations and the number of parameters estimated
(including the intercept). It indicates the number of independent pieces
of information that went into estimating the RSE.

Multiple R-squared
● This value represents the proportion of the variance in the dependent
variable (Tourist Arrival) that's explained by the independent variable
(Exchange Rate). An R2 of 0.2483 means that the model explains
about 24.83% of the variability in the tourist arrivals. The closer this
value is to 1, the better the model fits the data.

Adjusted R-squared
● Similar to R2 the adjusted R2 provides a measure of how well the
independent variables explain the variability in the dependent variable.
However, it adjusts for the number of predictors in the model,
preventing it from artificially inflating when unnecessary predictors are
added. An adjusted R2 of 0.2463 indicates that, after adjusting for the
number of predictors, the model explains about 24.63% of the
variability in tourist arrivals.

F-statistic:
● An F-statistic of 122.2 is quite large, suggesting that the model
provides a significantly better fit to the data than a model with no
predictors.

Phase 2: Linear Regression

2.1 Time Series plot of the tourist

● Seasonality: There appear to be regular peaks and troughs within each year.
This suggests that there is a seasonal pattern in the tourist arrivals data.
● Trend: There's an overall upward trend in the data, indicating that the number
of tourist arrivals has generally increased over the years. There are, however,
periods where the trend is relatively flat or slightly declining.
● Cyclic Patterns: Beyond the regular seasonal fluctuations, there seem to be
longer-term patterns where the tourist arrivals rise and fall.
2.2 Determine Seasonality

a) Plotting the time series data offers an intuitive way to detect recurring trends
or patterns that happen at consistent timeframes. When the graph displays
regular peaks, valleys, or other noticeable patterns at certain times each year,
it's a sign of seasonality. This direct visual method is often the starting point
because it vividly underscores pronounced seasonal tendencies.
b) On the other hand, the ACF plot, also known as a correlogram, illustrates how
a time series is correlated with its own previous values, which are termed
lags. When there's seasonality in the data, the ACF plot will exhibit
pronounced spikes at intervals corresponding to the season. For example, in
monthly data with an annual seasonal pattern, we’d expect to see prominent
spikes at intervals like 12, 24, or 36 months. Such a recurring pattern in the
ACF, especially a notable spike at a 12-month lag, affirms the correlation
between a value and the same value from a year prior, signalling strong
seasonality.

2.3 Subset of data starting from January 2010 to January 2020

We want to subset to focus on a particular time window, in this case, January 2010
to January 2020, allowing for a more detailed and relevant analysis of recent trends,
patterns, and behaviours in tourist arrivals (Appendix C).

2.4 Data Partition

This "tourist2" dataset is split into training and test subsets. Specifically, 80% of the
data is allocated for training, covering the period from January 2010 to December
2017, while the remaining 20% is set aside for testing, encompassing the timeframe
from January 2018 to January 2020. The length is verified to contain 96 observations
for the train and 25 observations for the test using the `length()` function as shown in
appendix D.

2.5 Fit a piecewise linear regression model to the training set and include
seasonal dummies (Appendix E)

The regression model provides a comprehensive understanding of the factors


affecting tourist arrivals in Malaysia from January 2010 to December 2017. The
steady rise in tourist numbers, as indicated by the positive coefficient of the time
variable, suggests that Malaysia's appeal as a tourist destination has been growing
over this period. However, the marked decline post-January 2014, represented by
the negative coefficient of "SegmentedTime," is particularly intriguing. This shift could
be influenced by several external events or changes in Malaysia's tourism dynamics.
The model also underscores the significance of seasonality in tourism, with
December, July, and March standing out. This seasonal trend could be tied to global
vacation patterns, climate preferences like avoiding monsoon seasons or seeking
the tropical climate during winter months, or specific events and festivals that
Malaysia hosts during these months.

It's also noteworthy that the model, with its R-squared value of approximately 60%,
does a reasonably good job of capturing the core trends in the data. This suggests
that while time, the 2014 trend shift, and seasonality are essential factors, there
might be other external variables. The highly significant F-statistic further
strengthens the model's credibility, indicating that the predictors collectively play a
critical role in determining tourist patterns.

2.6 Plot of the training set data with the fitted values

Both the actual and predicted values show a growth pattern up to approximately
January 2014. From this juncture, a descending trend is observed in the predicted
values, which mirrors the segmented design of the model. This indicates that an
event around January 2014 may have influenced tourist arrivals. While the piecewise
linear regression provides a commendable approximation to the actual data,
especially in recognizing the trend shift around January 2014, there are instances
where it either falls short of or exceeds the actual figures. Clear cyclical patterns are
noticeable in the real tourist arrivals, characterised by recurring highs and lows.
While the model grasps some aspects of this cyclical behaviour, there are
discrepancies. For instance, certain peaks in the real data surpass the model's
estimates, and vice versa for the troughs. Significantly, a distinct trend alteration is
discernible around January 2014 in the actual data, which the model aptly
represents, underscoring the relevance of the segmented regression at that point.

2.7 Interpret Beta slope coefficients of the piecewise linear model.

Tourist∼Time+SegmentedTime+Month

Time (Linear Trend Coefficient - 5215):


● Before the knot point in January 2014, for every additional month,
there's an average increase of 5,215 in tourist arrivals. This indicates a
positive linear trend in the number of tourist arrivals in Malaysia leading
up to January 2014.

​ SegmentedTime (Change in Trend Post-Knot - -6634):


● After January 2014, the trend was modified by a decrease of 6,634
tourists per month from the original trend. When we combine this with
the positive Time coefficient, it indicates that while the number of tourist
arrivals was still increasing after January 2014, it was at a reduced
rate. Specifically, the net increase after January 2014 is
5215−6634=−1419 tourists per month. This suggests a decline in the
rate of tourist arrivals after January 2014.
2.8.1 Forecasts for the time period spanning the test dataset using the
piecewise linear regression model

The graph contrasts the actual tourist arrivals against their projected numbers over a
given duration. At first glance, the actual data shows a gentle climb in tourist figures,
marked by consistent highs and lows indicative of seasonal trends. In contrast, the
forecasts from the piecewise linear regression model seem to suggest a continuous
decrease, indicating an expected drop in tourism in the forthcoming period.

The actual data prominently showcases these seasonal patterns, with some months
constantly experiencing a rise or fall in tourist numbers. However, the predicted
numbers, even though they try to emulate these seasonal patterns, don't align
seamlessly with the real data's peaks and troughs.

Progressing through the timeline, a growing discrepancy between the actual and
forecasted numbers emerges, especially towards the latter part of the graph. This
deviation implies that the model, perhaps initially in sync with the real numbers,
starts to drift as time advances. It's noteworthy that in the graph's initial phases, the
model tends to overestimate the tourist counts. Yet, as we delve deeper into the
timeline, it seems to lag behind, particularly during the pronounced peaks in tourist
arrivals.
2.8.2 Produce forecasts for the time period spanning the test dataset using an
ensemble forecast comprising the ETS and ARIMA forecasts

The chart illustrates the actual tourist numbers alongside predictions from three
forecasting methods. Actual data exhibits a rising trend with pronounced seasonal
highs and lows. The ETS method, designed to account for errors, trends, and
seasonal shifts, initially mirrors the actual figures quite well. However, as time
progresses, it tends to fall short, especially during peak tourist seasons. On the other
hand, the ARIMA method, which combines several statistical approaches, predicts a
steady decline in tourist numbers. While it does acknowledge the drop in tourists in
certain periods, it's generally more conservative and misses the mark during high
seasons. The Ensemble approach, merging the strengths of both ETS and ARIMA,
offers a more balanced prediction, bridging the gap between the optimistic view of
ETS and the cautious stance of ARIMA. Nonetheless, even this combined approach
tends to underestimate the high points in the real data.

2.8.3 Produce forecasts for the time period spanning the test dataset using the
bagging procedure

In our analysis, we employed the Bagging (Bootstrap Aggregating) technique to


enhance our forecasting accuracy for tourist arrivals. By setting an initial seed value
of 123, we ensured the consistency of our results. We generated ten bootstrapped
samples from our training dataset and applied the ETS model, which captures error,
trend, and seasonality, to each of these samples. Subsequently, we derived
forecasts for our test period from each of these ETS models. Our final forecast was
then computed by taking the average of the predictions from all these models. This
method capitalises on the collective insights of multiple models, aiming to provide a
forecast that's both more robust and accurate.

2.8.4 Produce the plot along with the point forecasts from the piecewise linear
regression model, ensemble method, and the bagging procedure.

The chart presents actual tourist arrivals, which consistently rise with evident
seasonal highs and lows. The Piecewise Linear Regression closely mirrors these
arrivals at first but starts to miss the mark, especially in the later stages, often
undervaluing the real figures. The Exponential Smoothing (ETS) method gives a fair
representation, especially in the early sections, but tends to undervalue the peaks as
time progresses. On the other hand, the ARIMA forecast is a bit cautious, suggesting
a mild decline and frequently missing the season's high points. Merging the insights
from ETS and ARIMA, the Ensemble method provides a more measured forecast,
finding a balance between the two. Meanwhile, the Bagging approach, pooling
insights from various forecasts, aligns reasonably well with the actual data early on,
but seems to slightly undervalue later figures.
2.8.5 Determine method that produces the most accurate forecast (Appendix F)

Piecewise Linear Regression

ME 23147.049

RMSE 149539.1

MAE 117584.9

MPE 0.7676820

MAPE 5.431734

The ME, RMSE, and MAE are relatively high compared to other methods, especially
the RMSE. The MAPE is also slightly higher than some of the other models.

ETS

ME 67505.509

RMSE 167287.4

MAE 133773.9

MPE 2.8190726

MAPE 6.103809

This model has the highest ME, RMSE, and MAE among all models. It also has the
highest MAPE, suggesting it may not be the best model in terms of percentage
errors.

ARIMA

ME 3750.373

RMSE 157146.9

MAE 116758.9

MPE -0.1219569

MAPE 5.382891
The values of ME, RMSE, MAE, and MAPE are in the mid-range among all models.
It's better than the ETS model but not as accurate as some of the other methods.

Ensemble

ME 35627.941

RMSE 157120.6

MAE 119557.7

MPE 1.3485578

MAPE 5.481776

The ME is relatively high, but the RMSE, MAE, and especially the MAPE are better
than the ARIMA and ETS models. This indicates that combining forecasts can
improve accuracy.

Bagging

ME 12042.293

RMSE 144335.1

MAE 124681.4

MPE 0.1179655

MAPE 5.746465

This method has the lowest ME, RMSE, and MAE among all models. It also has the
lowest MAPE, suggesting it provides the most accurate forecasts in terms of
percentage errors.

In conclusion, based on the provided metrics, the Bagging method seems to be the
most accurate forecasting method among the ones considered. It consistently shows
the lowest error rates across different measures.
Phase (3): Conclusion and policy implications

Our journey began with a deep dive into the trends and patterns of tourist arrivals in
Malaysia. From Phase (1), it was clear that tourist visits followed predictable
patterns, likely influenced by events like festivals, school breaks, or even the
weather.

In Phase (2), we ventured into the realm of forecasting. We employed a variety of


models to predict the future flow of tourists. Among all the models, the Bagging
method stood out. It didn't just make predictions; it mirrored the real-world data,
showcasing the strength of combining insights from multiple models.

Policy Implications:

● The rhythmic ebb and flow of tourist numbers suggest a need for strategic
planning. During slower months, perhaps we can introduce special
promotions to attract visitors. And when a busy season is anticipated, it's
crucial to ensure we're well-equipped to handle the crowd, guaranteeing a
seamless experience for all.
● The standout performance of the Bagging method is a testament to the power
of data-driven decision-making. Relying on such advanced techniques can
guide stakeholders, helping them anticipate and navigate challenges
effectively.
● Even with a champion like the Bagging method, it's essential to keep our
models updated. The world is ever-evolving, with factors like global events or
economic shifts potentially altering tourism patterns. Our models should be
agile enough to adapt to these changes.
● With some months more popular than others for tourists, spreading out
marketing efforts can ensure a more balanced influx of visitors throughout the
year. This can lead to better revenue management and a sustainable tourism
model.
Appendix

Appendix A
Summary of Linear Regression Model

Appendix B
Residual Diagnostics

Appendix C
Subset of data starting from January 2010 to January 2020
Appendix D
Training and Test dataset

Appendix E
Summary of Piecewise Linear Regression Model
Appendix F
Determine method that produces the most accurate forecast

You might also like