Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Energy Economics 133 (2024) 107466

Contents lists available at ScienceDirect

Energy Economics
journal homepage: www.elsevier.com/locate/eneeco

How to select oil price prediction models — The effect of statistical and
financial performance metrics and sentiment scores
Christian Haas a,b ,∗, Constantin Budin a , Anne d’Arcy a
a WU Vienna University of Economics and Business, Welthandelsplatz 1, Vienna, 1020, Austria
b
University of Nebraska at Omaha, 6001 Dodge St, Omaha, NE, 68182, USA

ARTICLE INFO ABSTRACT

JEL classification: Predicting crude oil prices is an important yet challenging forecasting problem due to various influencing
C45 quantitative and qualitative factors. To address the growing number of potential prediction models and model
C49 parameters to consider during model selection, we highlight the need to systematically compare alternative
C51
prediction models and variables while taking the specific context of their application into account. Specifically,
C52
we provide a novel perspective on oil price prediction models by comparing a variety of different forecasting
C53
G17
models and considering both their statistical and financial performance. In addition to common statistical
measures, to assess the usefulness in a practical setting we evaluate the potential financial impact of the
Keywords:
predictions in a simulation of a simple trading strategy. We show that the ranking of different approaches
Oil-price prediction
Machine learning depends on the selected evaluation metric and that small differences between models in one evaluation metric
Sentiment analysis can translate into large differences in another metric. For instance, forecasts that are not considered statistically
Performance metrics different can lead to substantially different financial performance when the forecasts are used in a trading
Model selection strategy. Finally, we show that including qualitative information in the prediction models through sentiment
Forecasting models analysis can yield both statistical and financial performance improvements.

1. Introduction With the increasing number of potential prediction models and the
variety of different quantitative and qualitative variables, the decision
Predicting crude oil prices is considered a challenging problem due of which model to select in a specific scenario becomes more complex.
to the various quantitative and qualitative factors that influence the During the model selection process, models for oil price prediction are
oil price and its volatility. This includes politics and global crises such usually compared using standard statistical forecasting metrics such as
as rising tensions in the Middle East (Movagharnejad et al., 2011), the root mean squared error (RMSE), out-of-sample 𝑅2 , the directional
advances in technology (Monge et al., 2017; Cheon and Urpelainen, accuracy, or mean absolute percentage error (MAPE) (see e.g. Chai
2012), rumors that can have a noticeable impact on markets (Spiegel et al., 2018; Qin et al., 2019; Lu et al., 2020; Çepni et al., 2022). In
et al., 2010), or sentiment (Qadan and Nama, 2018; Li et al., 2016, general, these statistical metrics are then used to select the model (and
2021). To tackle this challenge, various types of forecasting models the corresponding predictions) with the respective best performance.
have been developed ranging from statistical, theory-based, and re- Yet, these statistical metrics do not consider the practical usefulness
gression models, to newer Machine Learning (ML) approaches (see of the predictions in their specific application context. For instance,
e.g. Jammazi and Aloui, 2012; Li et al., 2016; Zhao et al., 2017). if these predictions are used for financial decisions such as buy/sell
Research shows that ML models often (a) outperform other models in transactions, the statistical metrics might only offer limited insights
terms of prediction accuracy, (b) their ability to discover new links and into the expected financial performance. To address this, some of the
relationships between forecasting variables, and (c) the possibility to previous prediction approaches extend their analysis by evaluating the
include both quantitative and qualitative data (Ghoddusi et al., 2019). predictions from an economic or financial impact perspective such as
Oil price prediction models are a frequent use case of ML models in the certainty equivalent return or sharp ratio (e.g. Xing and Zhang,
energy economics, while more specialized approaches from the subset 2022).
of Deep Learning (DL) are still comparatively underutilized (Ghoddusi Hence, to select an appropriate model for a specific context, we
et al., 2019). need to further investigate the relationship between statistical and

∗ Corresponding author at: WU Vienna University of Economics and Business, Welthandelsplatz 1, Vienna, 1020, Austria.
E-mail address: christian.haas@wu.ac.at (C. Haas).

https://doi.org/10.1016/j.eneco.2024.107466
Received 26 September 2022; Received in revised form 27 February 2024; Accepted 4 March 2024
Available online 11 March 2024
0140-9883/© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
C. Haas et al. Energy Economics 133 (2024) 107466

financial performance. From a model selection perspective, a model our evaluation shows that even small differences in one evaluation
ranking based on the statistical performance of the prediction models metric can yield large differences in another metric. Specifically, small
might not coincide with a ranking based on these alternative metrics differences in statistical performance measured as RMSE or MAPE can
such as the financial impact of the predictions. Our article provides a yield substantial differences in financial metrics, highlighting yet again
novel perspective and insights into the usefulness of oil price prediction that the choice of evaluation metric is crucial. Third, for the specific
by comparing a variety of prediction models on both statistical and case study of adding sentiment scores as additional variables to DL
financial evaluation metrics. To highlight the value of a comparison models, we show that models with sentiment can outperform other
between statistical and financial performance measures, we consider non-sentiment models on statistical forecasting metrics. Additionally,
two underutilized areas of ML models in energy economics. First, we models including sentiment show substantial improvement for financial
compare both traditional prediction models and advanced DL models evaluation metrics such as the considered simulated trading strategy.
to investigate if they lead to a different expected statistical and/or The article is structured as follows. In Section 2, we provide an
financial performance. Second, we include sentiment scores as addi- overview of oil price prediction models and sentiment analysis. Next,
tional prediction variables in the models as an example of quantified in Section 3 we describe the methodology of our analysis and the
unstructured information to analyze their impact on the prediction considered evaluation metrics, followed by the analysis of the results
performance (Ghoddusi et al., 2019). Overall, we analyze if the ranking and a discussion in Section 4. We conclude in Section 5.
of the prediction model performance depends on the evaluation metric,
i.e., if we consider statistical performance or financial performance. In 2. Related work: Oil price prediction models
addition, we also investigate if the relative model performance, and
thus practical usefulness of the predictions, is substantially different Due to their importance and potential financial impact, different
between evaluation metrics. types of oil price prediction models have been developed over the
For our analysis, we apply and compare several oil price prediction years. This section provides an overview of common types of prediction
models: two DL models using Long-Short-Term-Memory (LSTM) Neural models, from statistical to modern DL approaches. Further, it introduces
Networks (based on Li et al. (2016) and Zhao et al. (2019)), a vector different approaches in sentiment analysis to quantify unstructured
autoregressive model (Baumeister and Kilian, 2012, 2015), and a spot information.
price spread model (Baumeister et al., 2018). In addition to different
prediction model types, we also include news sentiment as additional 2.1. Statistical oil price prediction models
predictor variables in the models through quantified sentiment scores.
We use several approaches for sentiment analysis: A dictionary based Statistical oil price prediction models define a specific function
approach, a modern algorithmic approach (Vader), a machine learn- that calculates the oil price for the next time period(s). Arguably the
ing approach (Watson), a combination of these approaches, and no simplest model is the no-change forecast. This model always predicts no
sentiment (i.e., not using sentiment scores in the respective model). changes to the oil price:
In addition to these models, we also include two ensemble models
and use a no-change forecast as prediction baseline, yielding a total 𝑃̂𝑡+ℎ = 𝑃𝑡 (1)
of 20 different prediction models for the analysis. We also consider
where 𝑃𝑡 is the current price and 𝑃̂𝑡+ℎ is the predicted future price in ℎ
different forecast horizons to study the robustness of the results. While
time steps.
the main evaluation looks at a standard one-period forecast, we also
The no-change forecast is also often called a ‘naive’ forecast and
consider multi-period forecasts, specifically 3- and 5-period forecasts,
does not consider any information other than the current price (Alquist
as robustness checks. For the evaluation metrics, we select RMSE,
et al., 2013; Baumeister and Kilian, 2015; Baumeister et al., 2018).
MAPE, and directional accuracy as statistical metrics, and use a Return-
Despite its simplicity, the no-change forecast is a good benchmark in
on-Investment (ROI) metric to study the financial implications of the
general and is most useful for short-term predictions, i.e., predictions
models when their predictions are used in a specific (trading) context.
The ROI metric is calculated by simulating the behavior of the models for the next (or next few) time periods. Hence, we use the no-change
in a market environment and represents a simple trading strategy where forecast as benchmark for our evaluations as well.
the predictions are used for trading decisions.
Our evaluation of the various prediction models and the optional 2.1.1. Autoregressive models
sentiment analysis shows that while the difference in statistical per- Traditional Autoregressive (AR), Autoregressive Moving Average
formance (considering RMSE, MAPE, and the directional accuracy) (ARMA), and Vector Autoregression (VAR) models are quite successful
tends to be small between the different models, the range of financial for oil price prediction (Park and Ratti, 2008; Alquist and Kilian,
performance using the ROI metric is significantly larger. While some 2010; Baumeister and Kilian, 2012, 2015). Autoregressive models use
prediction models are able to yield ROIs between 7% and 15% in the information from previous time periods to create predictions for the
considered simulation, other models effectively yield no positive ROIs upcoming time periods, usually by weighting previous observations in
despite achieving similar or only slightly worse statistical performance. their calculations. While AR and ARMA models are popular prediction
In addition, the analysis shows that while RMSE and ROI performance methods that perform well in many scenarios, their downside is that
are moderately correlated for this use case, MAPE and ROI only show they typically consider univariate prediction scenarios, i.e., only take
a weak correlation. Hence, selecting a prediction model purely based one variable into account (the previous values of the time series,
on statistical performance does not guarantee that this model will also e.g., oil price).
yield the best financial performance. A VAR model allows for multivariate predictions and is specified as
We make three contributions with this article. First, we show that follows:
the choice of evaluation metrics determines the ranking of models. 𝑦𝑡 = 𝑣 + 𝐴1 𝑦𝑡−1 + 𝐴2 𝑦𝑡−2 + ⋯ + 𝐴𝑛 𝑦𝑡−𝑛 + 𝜖𝑡 (2)
Selecting a prediction model based on the best statistical prediction
performance does not guarantee that the selected model is also the where 𝑛 is the lag order (the number of time steps looking into the
best model in terms of financial performance in a specific applica- past), 𝑦𝑡 = (𝑦1𝑡 , 𝑦2𝑡 , … , 𝑦𝑚
𝑡 ) is an 𝑚 dimensional vector of 𝑚 independent
tion context such as the simulated trading strategy in our evaluation. variables measured at time 𝑡, and 𝜖𝑡 is the error term that accounts
This also highlights that especially for ML and DL-based models, the for random noise that affects the price movements. The 𝐴𝑖 are 𝑚 × 𝑚
evaluation metric needs to reflect the application scenario in order to matrices and contain the model parameters that have to be tweaked to
yield a ranking of models that adequately captures the setting. Second, make accurate predictions.

2
C. Haas et al. Energy Economics 133 (2024) 107466

VAR models allow for multiple variables and model the inter- 2.2. Machine learning and deep learning-based models
relationships between the variables. They can also offer significant
insights into the inner workings of the underlying scenario. Due to this, In scenarios where autoregressive and theory-based models reach
VAR models are one of the more successful approaches for predicting their limits, machine learning is a commonly considered alternative.
oil prices using regression and show a drastic improvement of the Not surprisingly, oil price prediction is a common use case for ML in
forecast accuracy over the no change forecast (Baumeister and Kil- energy economics (Godarzi et al., 2014; Luo et al., 2018; Zhao et al.,
ian, 2012, 2015). This particular approach is geared towards monthly 2019; Ramyar and Kianfar, 2019; Lin et al., 2020; Ghoddusi et al.,
forecasts, where the noise of daily trading might not be as relevant. 2019).
The accuracy of the monthly forecasts can be improved even further Machine learning allows for the inclusion of information that might
if multiple approaches (e.g., spread based model, futures based model) be related to oil prices, even if the link between this information and
are averaged (Baumeister and Kilian, 2015). Alquist and Kilian (2010) the oil price is hard to quantify through theoretical or regression-based
were able to further show the reliability of VAR models for oil price equations. For example, news articles have information about how
prediction using oil future spreads. However, despite their ability to authors feel about the market and whether they believe in some rumors.
produce accurate predictions, autoregressive models do not always This information is linked to how the price will move, because investors
provide explanations why prices change. can be swayed by this qualitative data (Spiegel et al., 2010; Tetlock,
2007). Yet, deriving an equation that describes this link is a particularly
2.1.2. Theory-based models challenging task. Machine learning and deep learning models such as
Theory-based models use established links between the oil price neural networks that are able to learn complex and highly non-linear
and other variables to calculate the price predictions. There are mainly equations, however, are often able to estimate these links and use them
three basic approaches for these theory-based models: Models using fu- to improve the predictions.
tures, models using spot prices, and models using economic indicators. Many previous machine learning models for oil price prediction use
simple and small neural networks with no special features (Li et al.,
Forecasting using futures. Verleger (1982) introduces one of the earliest 2016; Zhao et al., 2017, 2019; Ramyar and Kianfar, 2019). More so-
theory-driven models by noting that the price of a barrel of oil can be phisticated machine learning approaches such as wavelet series, Gener-
expressed by the weighted sum of the prices of the products derived ative Adversarial Neural Networks (GANN), Recurrent Neural Networks
from it: The price of a barrel of oil at time 𝑡 (𝑃𝑡 ) is equal to the price of (RNN), Random Vector Functional-Link (RVFL) or Long Short-term
the products that can be made out of oil (𝑝𝑡,𝑖 ) at time 𝑡, with additional Memory Neural Networks (LSTM), provide more complex architectures
weights 𝑤𝑖 that reflect how much oil is used for the various products. and features that might be helpful for oil price prediction (Luo et al.,
While this model is promising due to its simplicity, both Alquist and 2018; Xu and Niu, 2022; Hu et al., 2012). Simpler networks have the
Kilian (2010) and Knetsch (2007) conclude that oil futures are not advantage of being easier to train and configure. In contrast, the larger
a good predictor of oil prices. Hence, we can expect this model to and more complex a model becomes, the harder it gets to achieve a
have a sub-par performance compared to more sophisticated models. good estimate of all the model parameters and to prevent overfitting of
Besides predicting the crude oil price directly, several models have the model. In this article, we use LSTM neural networks as an example
been suggested to forecast the price of energy futures instead, i.e., to of such more complex architectures.
predict 𝐹𝑡,𝑖 (see, e.g., Manoliu and Tompaidis (2002), Lautier and Galli
(2004), Date et al. (2013) and Sadik et al. (2020)). Yet, as the focus 2.2.1. Wavelet series
of our analysis is the crude oil price directly, we do not include these Wavelet series are a quite popular and modern approach to crude oil
futures-focused models into our set of considered forecasting models. price prediction using neural networks (Jammazi and Aloui, 2012; Luo
et al., 2018; Lin et al., 2020). A wavelet series is a series of simple wave
Forecasting using spot prices. The futures-based prediction model can
functions that, if added together, recreate the original data, in this case
be augmented by replacing future prices by spot prices and slightly
the crude oil price. Each wavelet is responsible for a certain frequency
changing the underlying equation (Baumeister et al., 2018):
[ ∑ ]
range. In the context of oil prices, this means each function represents
𝑙𝑜𝑔( 𝑛𝑖=1 𝑤𝑖 𝑝𝑡,𝑖 )−𝑃𝑡
𝑃̂𝑡+ℎ = 𝑃𝑡 ⋅ 𝑒𝛼+𝛽 (3) movements on different time scales. The higher frequency wavelets are
more focused on day to day changes while low frequency ones have
The parameters 𝛼 and 𝛽 are estimated using linear regression with information on general trends. These simple functions are then fed into
the following equation: a neural network, which uses them for its prediction. This approach
[ ]

𝑛 has the advantage of breaking down the information contained in the
𝑙𝑜𝑔(𝑃𝑡+ℎ ) − 𝑙𝑜𝑔(𝑃𝑡 ) = 𝛼 + 𝛽 𝑙𝑜𝑔( 𝑤𝑖 𝑝𝑡,𝑖 ) − 𝑃𝑡 (4) price series into smaller discrete parts, which makes it easier for the
𝑖=1
neural network to sort out connections within the data. In addition,
In our case, we use a single spot spread model where heating oil is having such a decomposition allows for easy low pass filtering of the
used as the single product (Baumeister et al., 2018): data (Luo et al., 2018; Lin et al., 2020). When a low pass filter is applied
[ ] to a function, all frequencies higher than some value are removed
𝑙𝑜𝑔(𝑃𝑡+ℎ ) − 𝑙𝑜𝑔(𝑃𝑡 ) = 𝛼 + 𝛽 𝑙𝑜𝑔(𝑝𝑡,heating oil ) − 𝑃𝑡 (5)
from it. This smooths the data and gets rid of high frequency noise,
Forecasting using economic indicators. Alquist et al. (2013) propose a which is similar to the static noise when listening to the radio. This
model based on economic indicators. This model assumes that the oil helps the neural network training as it gets rid of most random day-to-
price changes in tandem with some other economic indicator(s), for day movements that might otherwise influence the machine learning
example industrial raw materials: algorithm.

𝑃𝑡+ℎ = 𝑃𝑡 (1 + 𝛥𝑥𝑡 )ℎ (6) 2.2.2. GANNs


Here, 𝛥𝑥𝑡 refers to the percent change of the considered economic While wavelet series are a quite straightforward approach to oil
indicator in the last period, or the average change of any indicator price prediction, GANN use a more indirect method. GANN consist of
over some past periods. The model can also be adapted to predict two competing neural networks. One is trying to create or alter data
in a certain way, while the second neural network tries to determine
the inflation-adjusted oil price (𝑅𝑡+ℎ ), where E(𝜋ℎ|𝑡 ) is the expected
which data is altered and which one is real. The first neural network
inflation rate over period ℎ:
succeeds if it can fool the second one into thinking altered data is not
𝑅𝑡+ℎ = 𝑅𝑡 (1 + 𝛥𝑥𝑡 − E(𝜋ℎ|𝑡 ))ℎ (7) altered. Both neural networks get increasingly better at their task and

3
C. Haas et al. Energy Economics 133 (2024) 107466

future stock prices well enough to make an financial profit when using
these predictions (Akita et al., 2016; Fischer and Krauss, 2018), and
that they outperform traditional forecasting techniques for spot price
predictions (Baughman et al., 2018). However, their performance tends
to decrease in times of high market uncertainty such as the financial
crisis (Fischer and Krauss, 2018). They also generally perform better
on the short term, as uncertainty tends to rise for predictions further
into the future (Akita et al., 2016; Fischer and Krauss, 2018).
Previous studies indicate that machine learning models generally
outperform other models for oil price prediction available on pure
statistical measures (Ramyar and Kianfar, 2019) and when including
additional investor sentiment for the prediction of crude oil future
returns (Li et al., 2021). At the same time, neural networks are black
Fig. 1. Visualization of a LSTM. box models that take in information and produce accurate predictions,
making it hard to interpret why certain predictions were made and
which factors contributed to the prediction. Recent research in Ex-
in doing so force their opponent to also improve. In the context of oil plainable AI is working on methods of analyzing and explaining neural
price prediction, this means one neural network is tasked with creating networks and their complexity (Olah et al., 2020).
predictions and the second one tries to determine which prices are pre-
dicted and which ones are real. In combination with wavelets, GANN 2.3. Sentiment analysis
can be used quite successfully to predict oil price movements (Luo et al.,
2018). Sentiment analysis, also called opinion mining, analyses people’s
opinions, sentiments, evaluations, appraisals, attitudes, and emotions
2.2.3. LSTMs towards entities such as products, services, organizations, individuals,
RNN and LSTM neural networks are a more traditional approach issues, events, topics, and their attributes (Liu, 2012, p.7). It is widely
to time series prediction and can be successfully applied to crude oil used, from political maneuvering (Ceron et al., 2014), strategic decision
price prediction (Hu et al., 2012). The specialty of these models is the making (Qiu et al., 2010; Yu et al., 2013), to predicting stock market
incorporation of past patterns into their prediction. While all machine prices (Pagolu et al., 2016).
learning models are able to do this, RNNs and LSTMs are especially In sentiment analysis, unstructured information provided as text
adept at extracting trends and context from a time series (Bengio (e.g., articles, reports, interviews) is converted into quantitative infor-
et al., 1994; Hochreiter and Schmidhuber, 1997). This makes them a mation that can be further analyzed. There are generally two types of
promising model for any prediction task where the next data point in quantification approaches in sentiment analysis:
a series is dependent on the previous ones.
1. A dictionary-based approach
Long short-term memory (LSTM) neural networks are neural net-
works with the ability to memorize information. When working with 2. An ML-based approach.
data where context is needed, like text, speech, or video, traditional These types of sentiment approaches will be further described next.
neural networks quickly reach their limits as they struggle with con-
necting related information (Hochreiter, 1991). When analyzing a text, 2.3.1. Dictionary-based sentiment analysis
for example, being able to remember the previous words is crucial to
Dictionary-based approaches use manually created and curated dic-
understanding the whole text. LSTMs are a special type of Recurrent
tionaries that define which words are considered negative and positive,
Neural Networks (RNN) (Bengio et al., 1994; Hochreiter and Schmid-
respectively. They then count the number of positive and negative
huber, 1997) that feed back information into themselves. Usually their
words in the given text and convert these counts to an overall sentiment
own output is used as part of the input for the next iteration. This feed-
score (Medhat et al., 2014). The general equation of this basic word
back loop helps them to retain information. In theory, RNNs are capable
count-based sentiment is:
of remembering previous information for many execution steps, how- 𝑛𝑝𝑜𝑠 − 𝑛𝑛𝑒𝑔
ever in practice this is not the case. When used in models, RNNs usually 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 = (8)
𝑛𝑝𝑜𝑠 + 𝑛𝑛𝑒𝑔
only have a short-term memory, because in order to achieve a long-term
memory, many parameters would have to be tweaked manually until This does not, however, take into account strengtheners (e.g.
the performance of the RNN is satisfactory (Hochreiter, 1991; Bengio ‘‘great’’), weakeners (e.g. ‘‘somewhat’’), and negation (e.g. ‘‘not great’’).
et al., 1994). To remedy this situation, LSTMs have been developed to Moreover, words are only counted as fully positive or negative, while in
also capture long-term trends (Hochreiter and Schmidhuber, 1997). reality some words correspond to stronger sentiment than others. This
As opposed to a standard RNNs, LSTMs provide a long-term mem- is why more sophisticated algorithms use lists of negators, strength-
ory without the need for time-consuming manual tweaking. This is eners and weakeners to increase the scoring accuracy. Also, instead
achieved by introducing an additional connection between the network of assuming every negative or positive word is equally negative or
cells which explicitly transports previous information (see Fig. 1). In positive, a positivity or negativity score can be introduced (Hutto
this connection, the information is barely altered as opposed to the and Gilbert, 2014; Medhat et al., 2014). The final score is still often
network output 𝑦𝑡 . The cells can remove unneeded information and calculated by calculating a weighted sum of the scores of the words
add new one at every step, to memorize new important data and found in a text.
forget old unnecessary information, ensuring only needed pieces of A very successful dictionary-based approach for sentiment analysis
information are present (Hochreiter and Schmidhuber, 1997). This is VADER. This approach is especially attuned to social media, but is
additional connection is represented by 𝐶𝑡 (the cell state at time t) in expected to work also on other source material (Hutto and Gilbert,
Fig. 1. 2014). Not only does VADER use rules, like strengtheners, weakeners,
This capability of remembering information has made LSTMs quite and negation, it also is aware of context and uses a lexicon where words
successful in any task requiring temporal coherence, such as stock price are not just positive or negative, but can be anything in between (Hutto
prediction. Prior research shows that LSTMs are capable of predicting and Gilbert, 2014).

4
C. Haas et al. Energy Economics 133 (2024) 107466

While VADER represents a general-purpose dictionary that can be


used in many contexts, specialized dictionaries can be created that
take into account the idiosyncrasies of specific applications. For in-
stance, in a financial context such as an analyst report, words might
have a different positive or negative sentiment than in a non-financial
text. For this reason, finance-specific dictionaries have been developed
as well. Two of the most commonly used ones are Henry’s finance-
specific dictionary (Henry, 2008) and Loughran and McDonald’s dictio-
nary (Loughran and McDonald, 2018). Both of these dictionaries will
be used in our analysis.
Due to their simplicity, dictionary-based approaches have several
shortcomings, the biggest being context. One word can have different Fig. 2. Neural network hyperparameter selection using cross validation for time series
data.
meanings in different contexts, which then could substantially impact
the underlying sentiment. An example for this might be ‘‘catch’’. While
it has a neutral or slightly positive meaning in ‘‘What a great catch’’, in
the sentence ‘‘Where’s the catch’’ the meaning is rather negative. The for more than one search term, we only count the first appearance. We
challenge of including context in sentiment analysis is hard to overcome adapt the used search terms from Zhao et al. (2019) due to constraints
just with dictionaries due to the ambiguity of the specific meaning (Liu, with how searching works on the considered sites.1
2012, p.12). The search terms described in Table A.1 (see Appendix) yield arti-
cles between Nov 18, 2017, and Feb 7, 2020 (due to time limitations in
2.3.2. ML-based sentiment analysis the search engines of the news outlets). We select the forecasting hori-
zon as one day, theoretically resulting in 814 data points. However, oil
As an alternative to dictionary-based sentiment analysis, ML-based
price data is not available for weekends and public holidays. Therefore,
approaches have emerged to address some of their shortcomings. ML-
we removed these days from the data set. This cleaning results in a total
based sentiment analysis relies on a large set of labeled texts with which
of 555 data points that we use in this evaluation.2
they try to generate a general understanding of human sentiments, in-
cluding learning to take the context of words into account. The labeled
3.2. Prediction models
examples used to train these ML-based models have a significant impact
on the resulting predictions. On the one hand, manually labeling the
To get a realistic estimate of the prediction performance of the
training texts can introduce bias and is resource-intensive. On the other various models, we split the entire data set into an 80% training and
hand, using pre-labeled texts such as reviews from sites like IMDB or 20% test set. Such an 80%–20% split between training and test set has
Amazon can lead to a sentiment model that correctly predicts sentiment been shown to balance getting accurate model performance estimates
in a specific context (e.g., movie reviews), yet might have difficulty to with the need to avoid overfitting the models (Gholamy et al., 2018).
infer the correct sentiment in other settings (e.g., financial texts) (Bai, To be specific, the respective model parameters are estimated solely on
2011; He and Zhou, 2011; Ribeiro et al., 2016). Therefore, while the training set, and the separate test set serves to estimate the out-
ML-based approaches are often superior, when dealing with language of-sample performance for the prediction models. This out-of-sample
that falls outside a general-purpose text, traditional algorithms can performance simulates the prediction performance of the models when
still outperform ML-based sentiment analysis (Hutto and Gilbert, 2014; applied to new data, and is therefore a better basis for comparing
Ribeiro et al., 2016). different prediction models than their performance on the training set.
For time series analysis, this split has to respect the temporal order of
3. Methodology the observations, i.e., the training data represent earlier time periods,
and the test data time periods after the training period (Cerqueira et al.,
In this section, we describe the methodology of the subsequent 2020).
evaluation. First, we list data sources and the process of deriving The training set includes observations from Nov 18, 2017 to Aug
sentiment scores that are used in the prediction models. Then, we 28, 2019, and the test set observations from Aug 29, 2019 to Feb 7,
outline the different oil price prediction models and introduce the ROI 2020. The reported metrics in Section 4 refer to the performance of the
simulation that measures the financial performance of the prediction respective forecasting models on the test set.
models. We create two different types of models to replicate the oil price
prediction approaches from Li et al. (2016) and Zhao et al. (2019).
3.1. Data sources The first type of models (‘‘absolute’’) predict the absolute oil price as
target variable, i.e., predict 𝑃𝑡+1 . The second type of models (‘‘relative’’)
We use two types of data sources for the evaluation: data about the predict the change in oil price compared to the current price, i.e., 𝑃𝑡+1 −
crude oil price, and data on news sentiment related to oil prices. For the 𝑃𝑡 , and adds this predicted change to the current observed price to get
oil price itself, we use the WTI crude oil price which we downloaded a prediction for the next time period.
from the US Energy Information Administration EIA (EIA, 2020) along We train each neural network model on the training set using
with the heating oil spot prices for the spot price model. We also an additional hyperparameter optimization step where different val-
downloaded the inflation relative to the year 2015 from the OECD ues for layer size, dropout rates, initialization method, and optimizer
website (OECD, 2020) and apply it to remove inflation from crude oil
prices.
1
For the news articles used in the sentiment analysis, we built a Reuters, for example, only provides up to 1000 pages of search result.
This means, if a search term is chosen too broadly, such as ‘‘news’’ the 1000-
dedicated web scraper and targeted four different news sites (Reuters,
page search limit will be reached with articles spanning only a few months as
CNBC, UPI, Forbes) to collect various relevant news articles. Of the
‘‘news’’ is a word that should appear in most articles found on Reuters. Using
31,770 downloaded articles, we use a total of 28,704 articles for the original search terms, would have limited the available days for training
the evaluation due to differences in formatting of some articles and too much.
extracting their information. A detailed list of the used search terms 2
While the gaps in the oil price data could be filled through interpolation,
as they are used by the web scrapers can be seen in Table A.1 in the we do not consider this here as it would introduce additional assumptions and
Appendix. We only count each article once, and if one article appears potential bias into the dataset and thus the evaluation.

5
C. Haas et al. Energy Economics 133 (2024) 107466

type are used. Specifically, as shown in Fig. 2, we use a time series 4. Watson: This model only uses the sentiment score obtained
cross-validation approach to calculate the performance of a specific through the IBM Watson cloud API, using a pre-trained neural
hyperparameter combination on multiple validation sets. This type of network.3
cross validation has the advantage of preserving the time dependency of 5. Combination: This model uses all three types of sentiment
the observations, as well as separating the hyperparameter optimization scores (Vader, Dictionary, Watson).
step from calculating the test set predictions (Cerqueira et al., 2020). The resulting sentiment scores are in the range of [−1, 1], where −1 in-
Based on the average cross-validation performance, the best hyperpa- dicates negative sentiment and +1 positive sentiment. Two examples of
rameters are used to re-train the model on the training data and to news article sentiment, specifically a positive and negative sentiment,
calculate the predictions for the test set. are provided in the Appendix.
With the five different sentiment scores approaches (no sentiment,
Vader-based sentiment, finance dictionary-based, Watson-based, and a 3.3. Model evaluation metrics
combined model using all three types of sentiment scores), we therefore
consider five neural network models for absolute price prediction and The models are evaluated using three separate metrics. First, the
Root Mean Squared Error (RMSE), a standard metric for numeric
for relative price prediction, respectively. In addition, we also consider
forecasts that compares the predictions 𝑦̂ against the actual values 𝑦:
the five different sentiment scores for the VAR models. For the gasoline √
spot price spread model based on Baumeister et al. (2018), we consider ∑𝑛 2
𝑖=1 (𝑦𝑖 − 𝑦̂𝑖 )
two different types. First, a ‘‘spread’’ model that uses the prices and data 𝑅𝑀𝑆𝐸 = (9)
𝑛
from 1990, and ‘‘limited-spread’’ model that only uses the same training
The RMSE is a standard evaluation metric in oil price predic-
data time frame as the other models (i.e., from 2017 to 2019). Finally,
tion (Baumeister and Kilian, 2012; Baumeister et al., 2018; Zhao et al.,
with the no-change forecast model and two additional ensemble models
2019), making a comparison to previous literature easier, and is in
that combine the predictions from the three best RMSE and MAPE general a good measure for how close, on average, the prediction is to
models, respectively, this yields a total of 20 forecasting models in the the actual value. The RMSE is calculated for all forecasting models, and
evaluation. the RMSE for the no-change forecast is used as a baseline for further
The neural network models for oil price prediction are created using comparison. This no-change baseline is a good measure to analyze if a
the Python libraries Keras and Tensorflow. The models use an LSTM model actually improves upon the naive no-change forecast and how
layer and a Dense layer, where the latter has the purpose to reduce the much we can gain from using a more sophisticated model.
number of outputs from the LSTM layer to one output (the predicted oil As a second statistical performance measure, we also consider the
price). LSTMs are selected as they are particularly suited for time series Mean Absolute Percentage Error (MAPE). The MAPE is another com-
data. To make the machine learning process more stable and faster, monly calculated performance measure for time series prediction mod-
we normalize all data using a simple MinMax scaler. The VAR model els and measures the average percentage difference of the predictions
to the actual values.
and the gasoline spot price spread models (based on Baumeister et al.
∑𝑛 𝑦𝑖 −𝑦̂𝑖
(2018)) were created in R. Similar to the neural networks, both models 𝑖=1 𝑎𝑏𝑠( 𝑦𝑖 )
use the training data to estimate the corresponding parameters, and 𝑀𝐴𝑃 𝐸 = 100 ∗ (10)
𝑛
then apply the trained models on the separate test set to ensure a fair Besides RMSE and MAPE, we also include the directional accuracy
comparison. (𝐷𝐴) (sometimes called the direction statistic, see e.g. Wu et al., 2022),
For the news articles, we use standard text cleaning steps (e.g., re- which measures how often the forecast correctly predicts the direc-
moving double space, line breaks, etc.) and process them by three tion of price change, i.e., whether the price will increase or decrease
different sentiment analysis methods and their respective Python in- compared to the current price:
terface. In total, we create 5 different sentiment score scenarios that
1∑
𝑛
we include in the various prediction models: 𝐷𝐴 = 𝑚 𝑥100%, where (11)
𝑛 𝑖=1 𝑖
{
1. No Sentiment: This scenario only uses the WTI oil price and no 1, if (𝑦̂𝑖+1 − 𝑦̂𝑖 ) × (𝑦𝑖+1 − 𝑦𝑖 ) > 0
sentiment scores. We use it as a baseline to check whether the 𝑚𝑖 = (12)
0, otherwise
sentiment scores derived from news articles have an impact on
the prediction. For additional statistical comparison of the prediction errors, we
2. Vader: An advanced rules and dictionary based approach (Hutto also consider the Diebold–Mariano (DM) test (Diebold and Mariano,
and Gilbert, 2014) that has been previously used in oil price 2002). The DM test analyzes if two forecasts are statistically different in
prediction models (Zhao et al., 2019). We use Vader to create their errors, i.e., if one forecast produces significantly better prediction
accuracy than the other. Specifically, if 𝑒1,𝑡 = (𝑦𝑡 −𝑦̂1,𝑡 ) and 𝑒2,𝑡 = (𝑦𝑡 −𝑦̂2,𝑡 )
two different scores: One where the whole article is fed into
are the prediction errors from forecasts 1 and 2 for time period 𝑡,
the algorithm, and a second one where each sentence is scored
respectively, it considers a null hypothesis that the expected value of
independently and the average over all sentences is then taken. ∑
the loss differential, 𝑑̄ = 𝑇1 𝑡 𝑑𝑡 , is 0, i.e., that there is no difference in
The latter approach was added as Vader was not intended to the expected loss, or prediction errors, of the two forecasts: 𝐻0 ∶ 𝐸[𝑑] ̄ =
analyze long texts (Hutto and Gilbert, 2014). Nonetheless, Vader 0. In this article, we use the common loss differential 𝑑𝑡 = 𝑒21 − 𝑒22 . The
still seems capable of analyzing longer texts, and a compari- corresponding test statistic is defined as follows, where 𝑑̄ is the sample
son between both scores shows a strong correlation (yet, the difference of the loss differential and 2𝜋 𝑓̂𝑑 (0) is a consistent estimate
scores calculated over the whole text, instead of averaging sen- of the standard deviation of 𝑑̄ (Diebold and Mariano, 2002; Diebold,
tences, are overall closer to neutral sentiment). In the evaluation, 2015):
this sentiment approach uses both the averaged score over all 𝑑̄
sentences in an article as well as the overall article score. 𝐷𝑀 = √ (13)
2𝜋 𝑓̂𝑑 (0)
3. Dictionary: An approach using two finance specific dictionaries 𝑇
by Henry (2008) and Loughran and McDonald (2018). This
model uses the two sentiment scores obtained separately from
the two dictionaries. Henry’s Finance-specific dictionary was 3
Watson Tone Analyzer, https://www.ibm.com/watson/services/tone-
originally used by Li et al. (2016). analyzer/. Last accessed: August 02, 2022.

6
C. Haas et al. Energy Economics 133 (2024) 107466

Table 1 Algorithm 1: ROI Simulation of the Oil Price Forecasts


Overview of the model and design parameters in the evaluation.
Data: Predictions 𝑦̂ from the different forecasting models
Design parameter Values
Result: ROI impact of predictions, Trading Frequency
Data sources WTI crude oil price, inflation rates, economic indicators, begin
news articles
profit = 0
Prediction models No-change forecast, VAR (Baumeister and Kilian, 2015), currentInvestment = 0
Spread-based (Baumeister et al., 2018), LSTM Neural maxInvestment = 0
Network (Li et al., 2016; Zhao et al., 2019), Ensemble barrelsHeld = 0
Models
buyTransaction = 0
Sentiment scores No sentiment, VADER (Hutto and Gilbert, 2014), sellTransaction = 0
Dictionary-based (Henry, 2008; Loughran and McDonald, for time period t in test data do
2018), Watson, Combined
if 𝑦̂𝑡+1 > 𝑦𝑡 then
Performance RMSE, MAPE, ROI, DA buy one barrel of oil at current price
metric barrelsHeld += 1
Forecast periods 1, 3, 5 currentInvestment += 𝑦𝑡
buyTransaction += 1
if currentInvestment > maxInvestment then
maxInvestment = currentInvestment
The DM test statistic is approximately normally distributed. Rejecting else
the corresponding null hypothesis then indicates that there is a signif- if barrelsHeld > 0 then
icant difference in loss differential between the two forecasts, i.e., that profit += barrelsHeld * 𝑦𝑡 - currentInvestment
one forecast corresponds to significantly lower prediction errors. barrelsHeld = 0
However, focusing only on statistical measures such as the RMSE currentInvestment = 0
and MAPE metrics does not provide information about the financial sellTransaction += 0
value and impact of a prediction model, it only quantifies the statistical
tradingFrequency = buyTransactions + sellTransactions
performance. A model might have a good prediction accuracy, but little
return profit, maxInvestment, tradingFrequency
financial value.
Hence, for the evaluation we also consider the financial perfor-
mance of a model. Such a financial performance metric depends on the
scenario and how the predictions will be used. We use a simulation
where the forecasts are used to make buy or sell decisions for barrels 4. Results
of oil. Specifically, the achievable profit, necessary investment, and
in particular the Return on Investment (ROI) are considered as finan- Following the previously described methodology, we first compare
cial performance metrics and used to compare the various forecasting the model performance for one-period forecasts using standard statisti-
models. cal metrics. Afterwards, we also consider the financial performance of
The simulation considers the test set time horizon and makes a the models using the ROI metric described in Section 3. Finally, we rank
decision for each day in the test set. If the predicted oil price for the the different models based on their respective performance and provide
next period (depending on the model-specific forecast) is higher than additional robustness checks by considering multi-period forecasts.
the current price, a barrel of oil is bought at the current price. If the
model predicts a price decrease (or no price change), all currently held 4.1. Model comparison of one-period forecasts using statistical metrics
barrels are sold at market value (the current price). The difference
between average buying costs and revenue from selling the oil is the For all models, we first calculate the RMSE, MAPE, and DA scores on
overall profit, assuming no transaction costs in the simulation. While the test data set as described in Section 3. We also consider the Diebold–
profit is a good measure for financial performance, it needs to be Mariano test (Diebold and Mariano, 2002) to compare prediction errors
considered relative to the invested amount of money. In this case, for statistical differences.
the ROI is a suitable measure as it considers profits relative to the Overall, Fig. 3 visually shows that the prediction models, including
investment. For the calculation, the investment needed over the test the no-change forecast, follow the WTI oil price reasonably well (for
set period (29.08.2019 to 07.02.2020) equals the maximum amount of visualization purposes, we show only the best and worst prediction
dollars held in oil during this period, and the ROI is calculated using models based on their RMSE performance). In general, some of the
the profit and the maximum amount of dollars held in oil during the
models tend to lag behind the exact WTI price, a behavior which is
test period. We also record the trading frequency as the number of
particularly obvious at inflection points.
buy and sell transactions to consider potential transaction cost effects.
As differences between the forecasting models can be hard to iden-
Algorithm 1 shows a brief pseudocode for the ROI simulation.
tify visually, Table 2 compares the RMSE and MAPE performance of the
Similar to using the RMSE of the no-change forecast as baseline,
different models as well as the performance relative to the no-change
the ROI of different models should be compared against a baseline
forecast. The results reveal several interesting insights.
ROI to realistically compare the financial performance of the different
models. Since the no-change forecast would never yield a positive or First, looking at the RMSE performance on the test set, we see that
negative ROI in this simulation (as the predicted price is always the not all models actually yield a better performance than the no-change
same as the day before), a Monte Carlo simulation is used to simulate forecast. Specifically, the spread-based models and two of the VAR
a random walk, i.e., randomly predict if the oil price is going to increase models provide RMSEs slightly higher than the no-change forecast. In
or decrease in the next time period. For each time period in the test data contrast, most of the machine learning models using an LSTM neural
set, a random number is drawn from a uniform distribution to decide network architecture perform best, leading to the smallest RMSEs and
whether the price is going to increase or decrease. Using this approach the highest relative gain compared to the no-change forecast (between
for all periods in the test data set yields buy/sell decisions and thus 1% and 7%). These results indicate that using machine learning pre-
an ROI. This process is repeated 10,000 times and the resulting ROIs dictions can yield significant increases in prediction accuracy measured
averaged to get the final ROI baseline. As the expected value for such using RMSE.
a random walk is the no-change forecast (i.e., increases are as likely as Second, when looking at the impact of sentiment scores on the
decreases), this is a valid and robust way to generate a ROI baseline. predictions, we can see two interesting aspects. On the one hand,
Table 1 summarizes the main evaluation design parameters. including a sentiment score in the VAR models has a mixed effect

7
C. Haas et al. Energy Economics 133 (2024) 107466

Table 2
RMSE, MAPE, and DA performance relative to the no-change forecast, ranked by RMSE. The DM significance count column indicates the number of times this method produced
statistically significantly better forecasts, measured by a Diebold–Mariano test, compared to other approaches.
Model type Sentiment RMSE rank RMSE RMSE relative DM Sig. Count MAPE MAPE relative 𝐷𝐴
Ensemble best RMSE All 1 1.182 0.926 0 1.443 1.020 0.67
Neural network absolute Vader 2 1.194 0.936 0 1.472 1.040 0.63
Neural network absolute Watson 3 1.212 0.950 3 1.395 0.986 0.62
Neural network absolute All 4 1.232 0.966 0 1.650 1.167 0.68
Ensemble best MAPE All 5 1.243 0.974 2 1.392 0.984 0.63
Neural network relative All 6 1.250 0.980 0 1.421 1.005 0.63
Neural network relative Vader 7 1.252 0.982 0 1.402 0.991 0.59
VAR Vader 8 1.254 0.983 0 1.403 0.992 0.58
VAR Watson 9 1.260 0.987 0 1.415 1.000 0.62
Neural network absolute Dict 10 1.261 0.988 0 1.483 1.048 0.60
Neural network absolute – 11 1.263 0.990 0 1.416 1.001 0.64
Neural network relative Watson 12 1.266 0.992 0 1.399 0.989 0.63
Neural network relative – 13 1.272 0.997 0 1.421 1.005 0.62
Neural network relative Dict 14 1.272 0.997 0 1.419 1.003 0.63
VAR – 15 1.275 0.999 0 1.401 0.990 0.62
No-change – 16 1.276 1.000 0 1.415 1.000 0.63
VAR All 17 1.277 1.001 0 1.484 1.049 0.59
VAR Dict 18 1.292 1.013 0 1.508 1.066 0.63
Spread – 19 1.321 1.035 0 1.456 1.029 0.57
Limited-spread – 20 1.321 1.036 0 1.478 1.045 0.57

model yields a forecast that leads to better statistical forecast, measured


via the DM test, than 2 of the other forecasts. Similarly, the absolute
neural network with Watson sentiment leads to lower prediction errors
than 3 other forecasts. Given that there are 20 different forecasts, this
also means that most pairwise comparisons do not lead to a statistically
significantly different expected loss on the test set. In other words,
most of the forecasts do not lead to a significantly different expected
loss measured via MSE. While we include this statistical comparison as
additional information, care must be taken when using and interpreting
the DM test results. Specifically, the statistical results show a potential
difference in expected loss for the given time period, but are not
necessarily the best choice when comparing and deciding between
different prediction models (Diebold, 2015). This also corroborates
our point that while a statistical analysis can be useful, existing (or
non-existing) statistical differences can still yield substantially different
Fig. 3. Comparison of the best and worst forecasting models (according to RMSE) with results on other evaluation metrics as we will show in the subsequent
the WTI price (black line). section.
As a second common statistical forecasting metric, we also consider
the Mean Absolute Percentage Error (MAPE) in the performance eval-
on the prediction accuracy. Two of the sentiment scores, Vader and uation. Table 2 also provides the MAPE metric, along with a relative
Watson, lead to better RMSE performance than not including senti- change compared to the no-change forecast. The results show that bet-
ment (essentially an autoregressive model). On the other hand, we ter RMSE performance does not necessarily correspond to better MAPE
can see a positive effect of including the sentiment scores for the performance, as indicated by the relatively higher MAPE values for the
neural networks. Specifically, the neural network models using Vader spread and Finance dictionary-based models. The Pearson correlation
sentiment scores, Watson sentiment scores, and the absolute neural between RMSE and MAPE performance is 𝜌 = 0.003, indicating no
network model with the combination of all sentiment scores yield the correlation. As the model hyperparameter selection focuses on selecting
lowest absolute RMSE scores, indicating that sentiment scores offer parameters with a good RMSE performance, this result is not fully
useful information for this type of forecasting model. Finally, among unexpected, though. From a practical perspective, the results show that
the different sentiment score models, the Vader-based approach yields the selection of a specific statistical forecasting metric can significantly
the best RMSE results followed by the Watson-based approach. Yet, as impact the evaluation of oil price prediction models.
this is only based on a specific scenario and finance-related articles, Third, considering the directional accuracy 𝐷𝐴, we can see that
the appropriateness of different sentiment approaches needs to be while two of the best RMSE models (Ensemble Best RMSE, Neural
evaluated in the respective context. Network Absolute with a Combined Sentiment) have the highest direc-
Third, we also compare the prediction errors on the test set for tional accuracy, the results overall are mixed. The Pearson correlation
statistical differences. Specifically, we compare each pair of forecasts on between the RMSE score and the directional accuracy 𝐷𝐴 is 𝜌 = −0.62,
the test set, e.g., the forecasts from neural network with all sentiment indicating that lower RMSE tends to correlate with higher directional
scores with the forecasts from a VAR model with all sentiment scores, accuracy, yet the correlation is only moderate.
and run the Diebold–Mariano (DM) tests introduced in Section 3.3 for Finally, to investigate if combining the predictions of different
each comparison. The DM significance count column in Table 2 counts model types is able to further increase the statistical performance of the
the number of times the respective forecast leads to a lower expected models, we build two ensemble models. The first ensemble model takes
loss on the test set (measured by MSE) compared to another forecast at the average predicted price from the three models with the best RMSE
a 0.05 significance level. We can see, e.g., that the Ensemble Best MAPE performance: the absolute neural network with all sentiment scores, the

8
C. Haas et al. Energy Economics 133 (2024) 107466

Table 3
Model performance ranked by ROI. No-change refers to the no-change forecast, and Random Walk to a randomized investment decision.
Model type Sentiment ROI rank RMSE MAPE Profit Invested capital ROI Number buy Number sell Trade frequency
Neural network absolute Watson 1 1.212 1.395 85.530 558.780 0.153 59 16 75
Ensemble best MAPE All 2 1.243 1.392 81.240 558.780 0.145 54 16 70
Ensemble best RMSE All 3 1.182 1.443 56.100 396.020 0.142 74 24 98
Neural network relative – 4 1.272 1.421 192.090 1531.230 0.125 101 5 106
VAR Dict 5 1.292 1.508 75.990 694.130 0.110 80 18 98
VAR Watson 6 1.260 1.415 99.280 1029.120 0.096 80 13 93
VAR Vader 7 1.254 1.403 100.450 1275.920 0.079 82 15 97
Neural network relative Vader 8 1.252 1.402 66.790 1065.070 0.063 68 12 80
Neural network absolute Vader 9 1.194 1.472 33.680 603.460 0.056 75 18 93
Spread – 10 1.321 1.456 53.110 962.850 0.055 35 7 42
Neural network relative All 11 1.250 1.421 94.990 1774.810 0.054 69 11 80
VAR – 12 1.275 1.401 78.090 1521.030 0.051 86 17 103
Neural network relative Dict 13 1.272 1.419 29.450 654.400 0.045 67 12 79
VAR All 14 1.277 1.484 18.550 644.430 0.029 72 23 95
Neural network absolute Dict 15 1.261 1.483 3.210 494.350 0.006 54 14 68
Neural network absolute All 16 1.232 1.650 2.900 694.280 0.004 70 19 89
Limited-spread – 17.5 1.321 1.478 0.000 5719.910 0.000 101 0 101
Neural network relative Watson 17.5 1.266 1.399 0.000 0.000 0.000 0 0 0
Neural network absolute – 19 1.263 1.416 −1.330 357.440 −0.004 32 13 45
No-change/Random walk – 20 1.276 1.415 −5.163 352.597 −0.015 55 27 82

absolute neural network with Vader sentiment, and the absolute neural number of buy-transactions. Third, similar to before, the neural net-
network with Watson sentiment. Similarly, the second ensemble model work models also yield the best ROIs. Fourth, considering the relative
combines and averages the prediction of the three models with the best performance of the sentiment approaches, for the neural network model
MAPE performance. For RMSE, we can see that the ensemble of the best including Watson- and Vader-based sentiment improves the resulting
three models further improves the predictions to yield the lowest RMSE ROI. This confirms previous findings that including Vader and Watson-
score of all models. Compared to the no-change forecast, the predictions based sentiment in oil price prediction models yield better results in
lead to a reduction in RMSE of 7.4%. Similarly, the ensemble model terms of RMSE. It also shows that including sentiment and using the
combining the best MAPE models also (slightly) reduces the MAPE, resulting augmented predictions can translate into substantial financial
yielding a reduction of 1.6% compared to the no-change forecast. benefits measured via ROI.
Finally, while we do not include specific transaction costs in the sim-
4.2. Model comparison of one-period forecasts using ROI metric ulation, we also record the number or resulting buy and sell trades as
well as the overall trading frequency (sum of buy and sell transactions)
To estimate the financial performance of the different forecasting that each forecast would occur (see the last 3 columns in Table 3).
models, we apply the simulation described in Section 3 to show the
A larger number of trades would incur higher total transaction costs,
potential financial impact of using different prediction models. Specifi-
which could then lead to specific forecasts being more profitable than
cally, the forecasts from the different models are used to make buy/sell
others. For the number of trades, we see that the random walk forecast
decisions for oil barrels, and the overall profit is compared to the
would, on average, lead to 55 buy and 27 sell trades, i.e., a total of
needed investment/capital to calculate the overall ROI for the different
82 trades. In comparison, the three highest ranked models (according
models (refer to Algorithm 1 in Section 3.3 for a detailed description).
to the ROI metric) would use 75, 70, and 98 trades, respectively. So,
Here, we use a random walk as a baseline. Specifically, the simulation
even without making an assumption about specific transaction costs,
logic would result in a zero ROI for the no-change forecast as the
we can see that the two highest ROI forecasts lead to a higher expected
predicted price of the next period is never different than the current
price, resulting in no transactions. Hence, a random walk, where it is ROI even with added transaction costs, since adding transaction costs
equally likely to buy or sell, is a better baseline for the comparison of would further lower the respective ROI from model predictions that
the ROI performance. incur a higher number of trades.
Table 3 shows the ROI along with the profit and invested capital
for the different models. Interestingly, while the ROI ranking confirms 4.3. Statistical and financial performance rankings for one-period forecasts
that the neural network models generally outperform the VAR models,
we see several changes in the ranking. First, all models yield higher
To further compare the performance of the various approaches in
ROIs than the random walk forecast, despite some models having a
terms of both statistical and financial rankings, we rank the different
lower RMSE on the test set. This confirms that for the selection of
approaches based on their respective performance. First, we create a
an appropriate forecasting model, the application of the predictions in
the specific context (in this case for buy/sell decisions as specified in ranking from best (rank 1) to worst for each of the performance metrics
the ROI simulation) needs to be considered as part of their evaluation (RMSE, MAPE) as well as the financial ROI metric. Second, we then
and model selection as well. Second, the spread-based model yields average the ranks for the RMSE and MAPE metrics to create an ‘average
an ROI of 0.055, even though it had a worse RMSE performance as rank’ for the statistical performance of the specific model. Finally, we
the No-Change forecast. The spread-based model trained on the limited create an overall rank that uses the (equally weighted) performance
training data yielded an ROI of 0 in the simulation yet had the highest rank and the ROI rank to create a measure of overall performance.
capital investment. This is driven by the fact that this model only had Table 4 shows the resulting ranking, where the models are sorted
buy-transactions and no sell-transactions in the simulation, indicating in descending order based on their overall rank. The overall rank is
that in the simulation, the spread-based forecast consistently predicted calculated as the average of the statistical performance rank and the
increasing prices compared to the current price, leading to a high financial (ROI) rank, where the average of the statistical performance

9
C. Haas et al. Energy Economics 133 (2024) 107466

Table 4
Comparison of rankings based on different evaluation metrics. Overall rank is the average of statistical performance and ROI rank.
Model type Sentiment RMSE rank MAPE rank Average rank ROI rank Overall rank
Neural network absolute Watson 3 2 2.5 1 1.75
Ensemble best MAPE All 5 1 3.0 2 2.50
Ensemble best RMSE All 1 13 7.0 3 5.00
Neural network relative Vader 7 5 6.0 8 7.00
VAR Vader 8 6 7.0 7 7.00
VAR Watson 9 7 8.0 6 7.00
Neural network relative – 13 11 12.0 4 8.00
Neural network absolute Vader 2 15 8.5 9 8.75
Neural network relative All 6 12 9.0 11 10.00
VAR – 15 4 9.5 12 10.75
VAR Dict 18 19 18.5 5 11.75
Neural network relative Dict 14 10 12.0 13 12.50
Neural network relative Watson 12 3 7.5 17.5 12.50
Spread – 19 14 16.5 10 13.25
Neural network absolute All 4 20 12.0 16 14.00
Neural network absolute Dict 10 17 13.5 15 14.25
Neural network absolute – 11 9 10.0 19 14.50
VAR All 17 18 17.5 14 15.75
No-change/Random walk – 16 8 12.0 20 16.00
Limited-spread – 20 16 18.0 17.5 17.75

Fig. 4. Comparison of the ROI and RMSE (left) and MAPE (right) performance of the various prediction models.

rank is the average of the RMSE and MAPE rank of the models.4 ROI and RMSE (Pearson correlation 𝜌 = −0.412), indicating that in
This ranking confirms that, generally, the neural network approaches general better RMSE performance corresponds to better ROI perfor-
using different sentiment techniques are ranked highest, on average. mance, there are models that have relatively stronger ROI or RMSE
Furthermore, while the performance ranks and the financial ROI rank performance compared to the other models. On the other hand, the
are correlated (Spearman rank correlation coefficient of 𝜌𝑠 = 0.80), comparison of the MAPE and ROI performance reveals a weaker cor-
we see that this correlation is not perfect and that some models per- relation (Pearson correlation of 𝜌 = −0.30), showing that statistical
form better in financial terms than in statistical performance terms. MAPE performance and financial ROI performance are not necessarily
Specifically, when model selection is performed based on the statistical indicative of the respective other metric.
RMSE rank, Ensemble Best RMSE would be selected, whereas using
the financial performance rank would yield the Neural Network with 4.4. Statistical and financial performance for multi-period forecasts
Watson sentiment as the suggested prediction model. This confirms
our previous findings that the decision to use a specific evaluation To further analyze the robustness of the results, we now consider
metric, or combination of metrics, is crucial for the selection of a multi-period forecasts. Instead of predicting the oil price for the next
sentiment-based oil price prediction model, and forecasting techniques time period as in the previous evaluation, we consider prediction
in general. models that forecast the oil price 3 and 5 time periods in advance.
Fig. 4 shows the RMSE/MAPE and ROI performance of the predic- To be specific, the prediction models with a 3 time period forecasting
tion models. On the one hand, we see a moderate correlation between horizon aim to predict the oil price 3 time periods from the current
time period, and models with a 5 time period horizon predict the oil
price 5 time periods in advance. In general, predicting further into the
4
We note that this equal weighting of statistical and financial performance future will make it harder for the prediction models to correctly forecast
is only one option to combine the rankings. The respective weighting will the oil price, a fact that is indicated by the higher RMSE scores for the
depend on the given scenario and will impact the resulting overall rank. following 3- and 5-period prediction models.

10
C. Haas et al. Energy Economics 133 (2024) 107466

Table 5
Model performance ranked by ROI for 3-period forecasts. No-change refers to the no-change forecast, NN-A to neural network absolute, NN-R to neural network relative, and VAR
to the vector auto-regressive models.
Model type Sentiment ROI rank RMSE DM 𝐷𝐴 MAPE Profit Invested capital ROI Number buy Number sell Trade frequency
Ens. Best MAPE All 1.5 1.658 16 0.62 2.253 132.780 558.780 0.238 74 12 86
Ens. Best RMSE All 1.5 1.658 16 0.62 2.253 132.780 558.780 0.238 74 12 86
NN-A All 3.0 1.740 1 0.62 2.399 71.730 502.280 0.143 61 13 74
NN-A Vader 4.0 1.853 1 0.68 2.628 136.610 1068.550 0.128 85 10 95
NN-R Vader 5.0 1.886 4 0.57 2.544 40.640 364.260 0.112 21 6 27
VAR Vader 6.0 1.810 12 0.57 2.446 285.240 2794.390 0.102 90 6 96
VAR Watson 7.0 1.841 6 0.53 2.497 197.430 2066.140 0.096 83 8 91
NN-R Watson 8.0 1.872 5 0.55 2.526 89.230 1154.070 0.077 76 15 91
NN-A Watson 9.0 1.715 13 0.53 2.366 56.130 910.910 0.062 77 13 90
VAR – 10.0 2.064 0 0.55 2.837 26.090 438.560 0.060 64 21 85
NN-A Dict 11.0 1.906 1 0.56 2.677 57.130 1365.680 0.042 84 9 93
VAR All 12.0 1.890 0 0.57 2.574 45.310 1312.100 0.034 74 14 88
Limited-spread – 13.0 1.949 0 0.59 2.642 28.290 986.050 0.029 54 12 66
NN-A – 14.0 1.896 1 0.58 2.574 31.440 1146.740 0.027 93 12 105
NN-R All 15.0 1.889 3 0.58 2.546 0.860 60.430 0.014 2 2 4
Spread – 16.0 1.949 0 0.59 2.636 11.230 986.050 0.011 49 12 61
VAR Dict 17.0 1.941 12 0.58 2.628 9.510 1881.620 0.005 86 10 96
NN-R Dict 18.5 1.908 0 0.58 2.574 0.000 0.000 0.000 0 0 0
NN-R – 18.5 1.907 0 0.58 2.572 0.000 0.000 0.000 0 0 0
No-change/Random walk – 20.0 1.916 0 0.58 2.584 −5.732 346.118 −0.017 54 27 81

Table 5 shows the combined RMSE, MAPE, 𝐷𝐴, and ROI per- 4.5. Discussion, limitations, and future work
formance of the models trained for a 3 period forecast. In terms
of RMSE performance, the neural networks and resulting ensemble In the previous sections, we evaluate the performance of various
prediction models lead to significantly better predictions compared to forecasting models and sentiment scores for oil price prediction. Com-
the no-change forecast measured by RMSE. Specifically, the ensemble pared to a baseline no-change forecast, the RMSE values show potential
prediction models yield a 13% lower RMSE compared to the no-change improvements of 1% to 7% for the one-period forecasts and up to 11%–
forecast (note that both ensemble models yield the same predictions 13% for multi-period forecasts. Considering the different sentiment
here since the three lowest RMSE models also correspond to the three scores, the fact that the Vader and Watson dictionary approaches
lowest MAPE models). This is also reflected in the number of positive perform better than finance-specific dictionaries is surprising, since
DM-tests measuring a lower prediction error, as the results show that in general the language of a finance specific news article tends to
the ensemble models yield statistically significantly lower prediction differ from normal language. While Vader was originally designed for
errors in 16 out of 19 cases. Similarly, the best neural network models Facebook post and tweets and not for technical language (Hutto and
also yield higher directional accuracy than other models. Considering Gilbert, 2014) and Watson also is a general-purpose sentiment tool,
the impact of including sentiment scores, we again see that Watson- and they lead to superior results for the oil price prediction data used in this
Vader-based sentiment yield the highest improvements on RMSE and evaluation. This indicates that practitioners and researchers should con-
ROI performance. Considering ROI specifically, while the general range sider evaluating different sentiment dictionaries and approaches when
of ROI scores is similar to before (0%–14%), the ensemble models par- using sentiment scores for prediction tasks, and select the approach
ticularly yield a high ROI of over 23%. In terms of correlation between
with the highest predicted performance for a given scenario.
statistical and financial performance metrics, the correlation between
In general, the RMSE has a moderate negative correlation with
RMSE and ROI is slightly higher than before, with 𝜌 = −0.77. Overall,
the ROI (between 𝜌 = −0.355 and 𝜌 = −0.77). While this is not an
for three forecast periods, the general results and key take-aways align
unexpected result, Table 3 shows that some forecasting models perform
with the results for the one-period forecast.
much better in terms of ROI as compared to RMSE. For example, while
Looking at the 5-period forecasts in Table 6, we again see a gener-
the neural network models tend to perform best both in terms of RMSE
ally similar behavior as before. In terms of RMSE performance, in this
and ROI, the limited spread model, while having a higher RMSE, leads
case the VAR models with Vader and Watson sentiment, respectively,
to a good ROI outcome (albeit requiring higher capital investment
yield the lowest RMSE scores, and the resulting prediction error is also
often statistically lower than the prediction error of other models (see than the neural network models). Similarly, when comparing the MAPE
the DM column in Table 6 for the number of positive Diebold–Mariano performance with the ROI metric, we only see a weak negative corre-
tests). This generally confirms that the Vader and Watson sentiment lation (𝜌 = −0.30 for one-period forecasts). This indicates that selecting
scores seem to be the most useful for improving the prediction accu- an appropriate evaluation metric is crucial in ranking and selecting
racy. Finally, in terms of ROI scores, both the general range (0%–16%) specific forecasting models for oil price prediction, and forecasting in
and the correlation between RMSE rank and ROI rank (𝜌 = −0.355) is general.
similar to before. Despite the high correlation, the discrepancy between Overall, the evaluation shows that both traditional forecasting ac-
statistical RMSE and financial ROI performance can be particularly seen curacy metrics as well as financial performance metrics such as ROI
for the absolute neural network using a combination of all sentiment need to be considered for the prediction of oil prices. Specifically,
scores that yields the second highest ROI score, while having a higher while better RMSE performance tends to correlate with better ROI
RMSE score than the no-change forecast. performance for the simulation used in this evaluation, some models
To summarize, both multi-period forecasts confirm the robustness of perform comparatively much better on one metric than on the other.
the previous findings. First, RMSE and ROI performance are moderately In addition, the models not only yield a different ROI, but also substan-
correlated in general. Second, including sentiment scores, particularly tially different capital investments, which is another factor that needs
Watson- and Vader-based sentiment, yields improved predictions. And to be considered in the selection of an appropriate forecasting model.
third, prediction models which yield the best RMSE scores are not While the results provide interesting and useful insights for oil
necessarily the ones yielding the highest ROI score (and vice versa). price predictions, there are several limitations that can be addressed

11
C. Haas et al. Energy Economics 133 (2024) 107466

Table 6
Model performance ranked by ROI for 5-period forecasts. No-change refers to the no-change forecast, and Random walk to a randomized investment decision.
Model type Sentiment ROI rank RMSE DM 𝐷𝐴 MAPE Profit Invested capital ROI Number buy Number sell Trade frequency
NN-A Watson 1 2.164 2 0.64 2.996 79.200 488.090 0.162 46 10 56
NN-A All 2 2.354 0 0.58 3.078 191.790 1555.490 0.123 80 8 88
VAR Vader 3 2.097 12 0.60 2.911 428.100 3482.210 0.123 91 4 95
Ens. Best RMSE All 4 2.088 12 0.57 2.918 386.630 3163.010 0.122 85 4 89
Ens. Best MAPE All 5 1.974 18 0.59 2.699 237.650 2074.690 0.114 83 4 87
NN-A Vader 6 2.138 3 0.58 2.835 173.720 1555.490 0.112 80 7 87
NN-R – 7 2.231 1 0.57 3.155 549.920 4954.570 0.111 103 1 104
VAR Watson 8 2.125 12 0.53 2.990 274.560 2601.970 0.106 91 3 94
NN-A Dict 9 2.390 0 0.64 3.383 62.900 703.000 0.090 54 16 70
NN-R Vader 10 2.197 1 0.59 3.067 263.020 3182.900 0.083 76 5 81
VAR – 11 2.454 0 0.67 3.499 26.090 438.560 0.060 62 21 83
VAR All 12 2.199 1 0.54 3.084 76.230 1364.940 0.056 80 10 90
VAR Dict 13 2.271 12 0.59 3.177 113.320 3062.180 0.037 94 7 101
NN-R Watson 14 2.117 10 0.59 2.954 92.390 3298.410 0.028 78 3 81
Limited-spread – 15 2.325 1 0.53 3.368 21.840 989.820 0.022 36 7 43
Spread – 16 2.322 1 0.53 3.375 20.260 989.820 0.020 34 5 39
NN-R All 17 2.226 1 0.56 3.108 9.640 1495.360 0.006 73 11 84
NN-R Dict 18 2.233 1 0.58 3.146 0.000 6038.300 0.000 106 0 106
No-change/Random walk – 19 2.230 1 0.57 3.171 −3.761 351.383 −0.011 53 26 79
NN-A – 20 2.274 1 0.63 3.155 −54.690 1454.920 −0.038 70 18 88

in future work. First, the simulation used to evaluate the predictions in Specifically, we compare previously suggested models for oil price
a trading strategy is fairly simple and can be extended. For example, predictions (spot price spread models, vector autoregression models,
despite counting the number of trade transactions, it does not consider and LSTM neural networks) and several approaches to derive sentiment
transaction costs that can affect the financial performance of models scores from the news articles (general and finance dictionaries as well
when the predictions are used in practice (Xing and Zhang, 2022). as pre-trained sentiment models). To highlight that the selection of the
Depending on the specific transaction costs, the performance ranking evaluation metric crucially determines which model should be selected,
in Table 3 might change. Second, the sentiment analysis approach we compare the models on both statistical performance metrics (RMSE,
currently uses standard sentiment dictionaries and cleaning procedures, MAPE, Directional Accuracy) and a financial ROI simulation that sheds
both of which can be augmented. On the one hand, we can further light on the practical usefulness and financial impact of the predictions.
analyze the impact of different sentiment analysis approaches, e.g., how We also consider both one-period and multi-period forecasts in our
sentiment scores are calculated for a given news article and how analysis.
nuanced the scores are. While we include several approaches in our The key insights and contributions of our article are threefold. First,
evaluation, from simple dictionary-based approaches to pre-trained we shift the focus from purely statistical performance measures for
sentiment algorithms such as Watson, the individual usefulness of a oil price prediction evaluation to a more financial, use-case oriented
specific sentiment approach needs to be studied on a scenario-specific one that takes the practical usefulness of the predictions into account.
basis. On the other hand, standard text cleaning procedures such as For this, we implement a simulation to calculate the ROI of the pre-
removing excess space, unnecessary characters, line breaks, etc., can dictions from a forecasting model in addition to the RMSE, MAPE,
result in the sentiment scores having high day-to-day variations. In
and Directional Accuracy. Statistical measures are commonly used to
a follow-up study, alternative preparation steps such as using low-
compare different forecasting models and select a ‘best’ model for a
pass filters to generate a smoother sentiment score curve that focuses
given scenario, yet might not be of core concern to practitioners. For
on general, long-term sentiment, can be considered. These additional
example, decreasing the RMSE score by one percent is only relevant if
steps and their impact on the quality of the predictions can potentially
this decrease has a noticeable impact in the application scenario and
lead to further improvements for oil price predictions or forecasting
decision process in which the predictions are used. For the use case
scenarios in general. Finally, the ML-based prediction models can be
of oil price prediction, our results show that evaluating predictions
further investigated to identify which factors impact the predictions.
using additional metrics such as a ROI performance provides valuable
While standard models such as LSTM neural networks are black-box
information for the selection of different forecasting techniques. I.e.,
approaches, concepts and approaches from Explainable AI can be used
rather than just focusing on standard evaluation metrics such as RMSE
to shed light on how predictions are generated and which factors are
or MAPE, the concrete usage of the forecasts needs to be considered
most influential for general or specific predictions (Samek et al., 2019).
to get a more holistic view of the usefulness of the prediction models.
5. Conclusion Second, our results show that while there is a moderate to strong cor-
relation between the RMSE and ROI metrics used for the statistical and
Due to its challenging nature, oil price prediction has been ap- financial performance analysis, respectively, some forecasting models
proached from a variety of angles, using different forecasting models perform substantially better on the financial metric as compared to
and prediction variables. To decide which prediction model to use, the the statistical RMSE or MAPE metrics. Third, while the difference in
different approaches and design choices need to be carefully compared performance between forecasting models is typically rather small on
and evaluated based on how the predictions will be used in practice. statistical metrics, the differences in resulting financial performance
In this article, we compare oil price prediction models using different can be significantly higher. This, again, highlights the need to critically
statistical performance and financial performance metrics to evaluate evaluate the predictions based on how they will be used in practice.
the quality and usefulness of the resulting predictions. We use the WTI Our results confirm previous research that showed that automati-
oil price data from 2017 to 2020 and (automatically scraped) news cally derived sentiment scores, using different sentiment analysis al-
articles from different news sites to train, evaluate, and compare dif- gorithms, can provide useful additional information for the oil price
ferent combinations of model types and sentiment analysis approaches. prediction forecasting models (Li et al., 2016; Zhao et al., 2019). The

12
C. Haas et al. Energy Economics 133 (2024) 107466

Table A.1 Qatar’s Energy Minister Mohammed al-Sada said on Sunday. ‘‘Geopolit-
The search terms as used with article counts. ical changes’’ are the reason for the rise in crude prices, he added, cited
Site Search term Articles by the state-run Qatar News Agency. Reporting by Maher Chmaytelli’’.
oil prices 1914 Another Reuters article provides an example of a negative sentiment
saudi arabia 1581 news article: https://www.reuters.com/article/north-sea-oil-forties/for
brent oil 822
oil market 638
ties-crude-oil-loadings-slowed-due-to-bad-weather-sources-idUSL8N1Q
petroleum 345 I532 ‘‘Forties crude oil loadings slowed due to bad weather-sources.
gasoline 278 By Reuters Staff, 1 Min Read. LONDON, Feb 28 (Reuters) - * Loadings
fossil fuels 271 of Forties crude oil slowed down on Wednesday due to bad weather and
cnbc
diesel 177
overnight snowfall, two trading sources familiar with the matter said
arabian oil 169
kerosene 77
* Pilots are unable to reach the tankers as they arrive to bring them
Gazprom 74 alongside to load oil * It was not immediately clear if oil flows through
crude oil 57 the Forties pipeline were affected * A spokesman for Ineos could not
opec 38 be reached for immediate comment * Forties is one of five North Sea
WTI 28
crude grades used to set the dated Brent benchmark (Reporting by Julia
forbes ‘‘saudi arabia oil’’ ‘‘oil price’’ ‘‘crude oil’’ opec wti gazprom 2861 Payne and Ahmad Ghaddar; editing by Jason Neely)’’.
‘‘brent oil’’ ‘‘middle east oil’’ ‘‘oil market’’
brent oil 5987 References
oil price arabia 4102
opec 3462
Akita, R., Yoshihara, A., Matsubara, T., Uehara, K., 2016. Deep learning for stock
oil market price crude 2332
prediction using numerical and textual information. In: 2016 IEEE/ACIS 15th
middle east oil 2026
International Conference on Computer and Information Science. ICIS, IEEE, pp.
reuters oil market price crude rise 968
1–6.
gazprom 926
Alquist, R., Kilian, L., 2010. What do we learn from the price of crude oil futures? J.
WTI 332
Appl. Econometrics 25 (4), 539–573.
oil market price crude loss 174
Alquist, R., Kilian, L., Vigfusson, R.J., 2013. Forecasting the price of oil. In: Elliott, G.,
oil market price crude fall 138
Timmermann, A. (Eds.), Handbook of Economic Forecasting. Vol. 2, North Holland,
oil market price crude gain 84
pp. 427–507.
upi (arabia AND oil) OR (brent AND oil) OR (oil AND prices) 1909 Bai, X., 2011. Predicting consumer sentiments from online text. Decis. Support Syst.
OR (oil AND market) OR petroleum OR gasoline OR (crude 50 (4), 732–742.
AND oil) OR opec OR gazprom OR WTI Baughman, M., Haas, C., Wolski, R., Foster, I., Chard, K., 2018. Predicting amazon
spot prices with LSTM networks. In: Proceedings of the 9th Workshop on Scientific
Cloud Computing. ScienceCloud ’18, Association for Computing Machinery, New
York, NY, USA, http://dx.doi.org/10.1145/3217880.3217881.
Baumeister, C., Kilian, L., 2012. Real-time forecasts of the real price of oil. J. Bus.
evaluation indicates that the question which specific dictionary or
Econom. Statist. 30 (2), 326–336.
sentiment analysis approach to use seems to depend not only on the Baumeister, C., Kilian, L., 2015. Forecasting the real price of oil in a changing world:
types of articles used for the prediction and their domain-dependent A forecast combination approach. J. Bus. Econom. Statist. 33 (3), 338–351.
language, but also on the forecasting model itself. I.e., trying different Baumeister, C., Kilian, L., Zhou, X., 2018. Are product spreads useful for forecasting
(automated) sentiment approaches to identify which set(s) of senti- oil prices? An empirical evaluation of the verleger hypothesis. Macroecon. Dyn. 22
(3), 562–580.
ment scores add the most predictive power to a forecasting model is Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies with
an important and worthwhile endeavor. For the considered oil price gradient descent is difficult. IEEE Trans. Neural Netw. 5 (2), 157–166.
prediction scenario, RMSE improvements of up to 7% and ROI increases Çepni, O., Gupta, R., Pienaar, D., Pierdzioch, C., 2022. Forecasting the realized variance
of up to 15% (compared to a no-change or random-walk baseline for of oil-price returns using machine learning: Is there a role for US state-level
uncertainty? Energy Econ. 114, 106229.
one-period forecasts) can be achieved by integrating sentiment scores
Ceron, A., Curini, L., Iacus, S.M., Porro, G., 2014. Every tweet counts? How sentiment
into the prediction models. analysis of social media can improve our knowledge of citizens’ political preferences
with an application to Italy and France. New Med. Soc. 16 (2), 340–358.
CRediT authorship contribution statement Cerqueira, V., Torgo, L., Mozetič, I., 2020. Evaluating time series forecasting models: An
empirical study on performance estimation methods. Mach. Learn. 109, 1997–2028.
Chai, J., Xing, L.-M., Zhou, X.-Y., Zhang, Z.G., Li, J.-X., 2018. Forecasting the WTI
Christian Haas: Writing – review & editing, Writing – original
crude oil price by a hybrid-refined method. Energy Econ. 71, 114–127.
draft, Visualization, Validation, Supervision, Software, Project admin- Cheon, A., Urpelainen, J., 2012. Oil prices and energy technology innovation: An
istration, Methodology, Investigation, Formal analysis, Data curation, empirical analysis. Glob. Environ. Change 22 (2), 407–417.
Conceptualization. Constantin Budin: Writing – original draft, Vi- Date, P., Mamon, R., Tenyakov, A., 2013. Filtering and forecasting commodity futures
prices under an HMM framework. Energy Econ. 40, 1001–1013.
sualization, Validation, Software, Resources, Project administration,
Diebold, F.X., 2015. Comparing predictive accuracy, twenty years later: A personal
Methodology, Investigation, Formal analysis, Data curation, Concep- perspective on the use and abuse of diebold–mariano tests. J. Bus. Econom. Statist.
tualization. Anne d’Arcy: Writing – review & editing, Writing – orig- 33 (1), 1–1.
inal draft, Validation, Supervision, Resources, Project administration, Diebold, F.X., Mariano, R.S., 2002. Comparing predictive accuracy. J. Bus. Econ. Statist.
Methodology, Investigation, Formal analysis, Conceptualization. 20 (1), 134–144.
EIA, E.I.A., 2020. Spot prices. URL: https://www.eia.gov/dnav/pet/pet_pri_spt_s1_d.htm.
last accessed August 02, 2022.
Appendix Fischer, T., Krauss, C., 2018. Deep learning with long short-term memory networks for
financial market predictions. European J. Oper. Res. 270 (2), 654–669.
The repository for the code used to generate the evaluation of this Ghoddusi, H., Creamer, G.G., Rafizadeh, N., 2019. Machine learning in energy
economics and finance: A review. Energy Econ. 81, 709–727.
article is available in the following GitHub repository: https://github.c
Gholamy, A., Kreinovich, V., Kosheleva, O., 2018. Why 70/30 or 80/20 relation
om/OilPricePrediction/EnergyEconomics. between training and testing sets: A pedagogical explanation.
An example of a news article that is scored as having a positive Godarzi, A.A., Amiri, R.M., Talaei, A., Jamasb, T., 2014. Predicting oil price movements:
sentiment is the following Reuters article: https://www.reuters.com/a A dynamic artificial neural network approach. Energy Policy 68, 371–382.
rticle/opec-qatar-crude/oil-market-is-balanced-says-qatar-energy-minis He, Y., Zhou, D., 2011. Self-training from labeled features for sentiment analysis. Inf.
Process. Manage. 47 (4), 606–616.
ter-idUSB2N1VO00S ‘‘October 7, 2018. Oil market is balanced, says Henry, E., 2008. Henry’s finance-specific dictionary. URL: https://github.com/
Qatar energy minister. By Reuters Staff, 1 Min Read. DUBAI, Oct 7 sfeuerriegel/SentimentAnalysis/blob/master/data/DictionaryHE.rda. last accessed
(Reuters) - The oil market is balanced in terms of supply and demand, August 02, 2022.

13
C. Haas et al. Energy Economics 133 (2024) 107466

Hochreiter, S., 1991. Untersuchungen zu dynamischen neuronalen Netzen. Diploma Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., Carter, S., 2020. Zoom in:
Tech. Univ. München 91 (1). An introduction to circuits. Distill 5 (3), e00024–001.
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8), Pagolu, V.S., Reddy, K.N., Panda, G., Majhi, B., 2016. Sentiment analysis of Twitter
1735–1780. data for predicting stock market movements. In: 2016 International Conference on
Hu, J.W.-S., Hu, Y.-C., Lin, R.R.-W., 2012. Applying neural networks to prices prediction Signal Processing, Communication, Power and Embedded System. SCOPES, IEEE,
of crude oil futures. Math. Probl. Eng. 2012. pp. 1345–1350.
Hutto, C.J., Gilbert, E., 2014. VADER: A parsimonious rule-based model for sentiment Park, J., Ratti, R.A., 2008. Oil price shocks and stock markets in the US and 13
analysis of social media text. In: Eighth International AAAI Conference on Weblogs European countries. Energy Econ. 30 (5), 2587–2608.
and Social Media. Qadan, M., Nama, H., 2018. Investor sentiment and the price of oil. Energy Econ. 69,
Jammazi, R., Aloui, C., 2012. Crude oil price forecasting: Experimental evidence 42–58.
from wavelet decomposition and neural network modeling. Energy Econ. 34 (3), Qin, Q., Xie, K., He, H., Li, L., Chu, X., Wei, Y.-M., Wu, T., 2019. An effective and
828–841. robust decomposition-ensemble energy price forecasting paradigm with local linear
Knetsch, T.A., 2007. Forecasting the price of crude oil via convenience yield predictions. prediction. Energy Econ. 83, 402–414.
J. Forecast. 26 (7), 527–549. Qiu, G., He, X., Zhang, F., Shi, Y., Bu, J., Chen, C., 2010. DASA: dissatisfaction-oriented
Lautier, D., Galli, A., 2004. Simple and extended Kalman filters: an application to term advertising based on sentiment analysis. Expert Syst. Appl. 37 (9), 6182–6191.
structures of commodity prices. Appl. Financial Econ. 14 (13), 963–973. Ramyar, S., Kianfar, F., 2019. Forecasting crude oil prices: A comparison between
Li, Y., Jiang, S., Li, X., Wang, S., 2021. The role of news sentiment in oil futures re- artificial neural networks and vector autoregressive models. Comput. Econ. 53 (2),
turns and volatility forecasting: data-decomposition based deep learning approach. 743–761.
Energy Econ. 95, 105140. Ribeiro, F.N., Araújo, M., Gonçalves, P., Gonçalves, M.A., Benevenuto, F., 2016.
Li, J., Xu, Z., Yu, L., Tang, L., 2016. Forecasting oil price trends with sentiment of Sentibench-a benchmark comparison of state-of-the-practice sentiment analysis
online news articles. Procedia Comput. Sci. 91, 1081–1087. methods. EPJ Data Sci. 5 (1), 1–29.
Lin, L., Jiang, Y., Xiao, H., Zhou, Z., 2020. Crude oil price forecasting based on a novel Sadik, Z.A., Date, P.M., Mitra, G., 2020. Forecasting crude oil futures prices using global
hybrid long memory GARCH-M and wavelet analysis model. Physica A 123532. macroeconomic news sentiment. IMA J. Manag. Math. 31 (2), 191–215.
Liu, B., 2012. Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers. Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., Müller, K.-R., 2019. Explainable
Loughran, T., McDonald, B., 2018. Oughran-McDonald master dictionary. URL: https:// AI: Interpreting, Explaining and Visualizing Deep Learning, vol. 11700, Springer
sraf.nd.edu/textual-analysis/resources/#Master%20Dictionary. last accessed August Nature.
02, 2022. Spiegel, U., Tavor, T., Templeman, J., 2010. The effects of rumours on financial market
Lu, Q., Li, Y., Chai, J., Wang, S., 2020. Crude oil price analysis and forecasting: A efficiency. Appl. Econ. Lett. 17 (15), 1461–1464.
perspective of ‘‘new triangle’’. Energy Econ. 87, 104721. Tetlock, P.C., 2007. Giving content to investor sentiment: The role of media in the
Luo, Z., Chen, J., Cai, X.J., Tanaka, K., Takiguchi, T., Kinkyo, T., Hamori, S., 2018. stock market. J. Finance 62 (3), 1139–1168.
Oil price forecasting using supervised GANs with continuous wavelet transform Verleger, P.K., 1982. The determinants of official OPEC crude prices. Rev. Econ. Stat.
features. In: 2018 24th International Conference on Pattern Recognition. ICPR, 177–183.
IEEE, pp. 830–835. Wu, C., Wang, J., Hao, Y., 2022. Deterministic and uncertainty crude oil price
Manoliu, M., Tompaidis, S., 2002. Energy futures prices: term structure models with forecasting based on outlier detection and modified multi-objective optimization
Kalman filter estimation. Appl. Math. Finance 9 (1), 21–43. algorithm. Resour. Policy 77, 102780.
Medhat, W., Hassan, A., Korashy, H., 2014. Sentiment analysis algorithms and Xing, L.-M., Zhang, Y.-J., 2022. Forecasting crude oil prices with shrinkage methods:
applications: A survey. Ain Shams Eng. J. 5 (4), 1093–1113. Can nonconvex penalty and huber loss help? Energy Econ. 110, 106014.
Monge, M., Gil-Alana, L.A., de Gracia, F.P., 2017. US shale oil production and WTI Xu, K., Niu, H., 2022. Do EEMD based decomposition-ensemble models indeed improve
prices behaviour. Energy 141, 12–19. prediction for crude oil futures prices? Technol. Forecast. Soc. Change 184, 121967.
Movagharnejad, K., Mehdizadeh, B., Banihashemi, M., Kordkheili, M.S., 2011. Forecast- Yu, Y., Duan, W., Cao, Q., 2013. The impact of social and conventional media on firm
ing the differences between various commercial oil prices in the Persian gulf region equity value: A sentiment analysis approach. Decis. Support Syst. 55 (4), 919–926.
by neural network. Energy 36 (7), 3979–3984. Zhao, Y., Li, J., Yu, L., 2017. A deep learning ensemble approach for crude oil price
OECD, 2020. OECD total inflation (2015=100). URL: https://data.oecd.org/price/ forecasting. Energy Econ. 66, 9–16.
inflation-cpi.htm. last accessed August 02, 2022. Zhao, L., Zeng, G., Wang, W., Zhang, Z., 2019. Forecasting oil price using web-based
sentiment analysis. Energies 12 (22), 4291.

14

You might also like