Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Energy 169 (2019) 160e171

Contents lists available at ScienceDirect

Energy
journal homepage: www.elsevier.com/locate/energy

Crude oil price prediction model with long short term memory deep
learning based on prior knowledge data transfer
Zhongpei Cen*, Jun Wang
Institute of Financial Mathematics and Financial Engineering, School of Science, Beijing Jiaotong University, Beijing 100044, PR China

a r t i c l e i n f o a b s t r a c t

Article history: Energy resources have acquired a strategic significance for economic growth and social welfare of any
Received 8 March 2018 country throughout the history. Therefore, the prediction of crude oil price fluctuation is a significant
Received in revised form issue. In recent years, with the development of artificial intelligence, deep learning has attracted wide
25 October 2018
attention in various industrial fields. Some scientific research about using the deep learning model to fit
Accepted 3 December 2018
and predict time series has been developed. In an attempt to increase the accuracy of oil market price
Available online 7 December 2018
prediction, Long Short Term Memory, a representative model of deep learning, is applied to fit crude oil
prices in this paper. In the traditional application field of long short term memory, such as natural
Keywords:
Deep learning
language processing, large amount of data is a consensus to improve training accuracy of long short term
Crude oil energy market memory. In order to improve the prediction accuracy by extending the size of training set, transfer
Long short term memory predicting model learning provides a heuristic data extension approach. Moreover, considering the equivalent of each
Data transfer historical data to train the long short term memory is difficult to reflect the changeable behaviors of
Empirical predictive effect analysis crude oil markets, a very creative algorithm named data transfer with prior knowledge which provides a
Ensemble empirical mode decomposition more availability data extension approach (three data types) is proposed. For comparing the predicting
performance of initial data and data transfer deeply, the ensemble empirical mode decomposition is
applied to decompose time series into several intrinsic mode functions, and these intrinsic mode
functions are utilized to train the models. Further, the empirical research is performed in testing the
prediction effect of West Texas Intermediate and Brent crude oil by evaluating the predicting ability of
the proposed model, and the corresponding superiority is also demonstrated.
© 2018 Elsevier Ltd. All rights reserved.

1. Introduction global economic growth, while others think that high oil prices are
caused by economic growth, generally speaking, the relationship
As the most important non-renewable energy in the world, between oil prices and global economic is very unstable.
crude oil plays a significant and irreplaceable role in economic There has been a long history focused on crude oil markets and
society. Crude oil is an important material for many chemical in- financial markets prediction research. Alvarez-Ramirez et al. [1]
dustrial products including fertilizers, solvents, plastics and pesti- analyzed the auto-correlations of international crude oil. Chiroma
cides. The prices of international crude oil rest with the world's et al. [2] proposed an alternative approach for the prediction of
major oil-producing areas. For example, in New York Stock Ex- West Texas Intermediate (WTI) crude oil price. Niu and Wang [3]
change, the crude oil future in the United States West Texas is the investigated the statistical behaviors of long-range dependence
production of “intermediate base oil (WTI)” as the base oil. The phenomena and volatility clustering of crude oil price. Niu and
strength of USA super-crude oil buyers, coupled with New York Wang [4] constructed a simulative data from the financial model
Exchange itself influence, makes WTI benchmark oil futures trading that can accord with the real markets to a certain extent. Yu and
become as the global commodity futures varieties leader. Oil prices Wang [5] modeled a discrete time series of stock price process to
are closely related to global macroeconomic situation. Many compare the real financial markets. Mordjaoui et al. [6] used a
economists propose that high oil prices have a negative impact on dynamic neural network for the prediction of daily power con-
sumption. Wang and Wang [7] introduced the artificial neural
network to train and forecast the fluctuations of return intervals for
* Corresponding author.
the real data and the simulative data. Liao and Wang [8] introduced
E-mail address: zhongpeicen@bjtu.edu.cn (Z. Cen). an improved neural network with a stochastic time effective

https://doi.org/10.1016/j.energy.2018.12.016
0360-5442/© 2018 Elsevier Ltd. All rights reserved.
Z. Cen, J. Wang / Energy 169 (2019) 160e171 161

of artificial intelligence (AI) technology has a profound and signif-


Abbreviations icant impact on every fields of society. Bahdanau et al. [27] showed
the RNN application in English translation into French. Graves et al.
LSTM Long Short Term Memory [28] applied the LSTM to language recognition. Sak et al. [29] pre-
NLP Natural language processing sented a novel LSTM based on the RNN architecture which makes
WTI West Texas Intermediate crude oil the use of model parameters more effectively to train acoustic
BPNN Back-propagation neural network models for large vocabulary speech recognition. Sutskever et al.
AI Artificial intelligence [30] used the LSTM to learn sensitive phrases and sentence repre-
RNN Recurrent neural network sentations, which are sensitive to word order. Liu et al. [31] utilized
EEMD Ensemble empirical mode decomposition the LSTM and wavelet to predict the low-frequency wind speed
EMD Empirical mode decomposition sub-layers. So it is a natural and promising progression to use the
IMF Intrinsic mode function LSTM to train financial time series. This model is distinguished from
a standard recurrent neural network, and the special and endemic
structure of LSTM is that each ordinary node in hidden layers is
replaced by a memory cell. The LSTM with unique advantages
function to forecast China stock index. In the traditional field of reaches more and more strong fitting ability. The LSTM consists of a
neural networks, the back-propagation neural network (BPNN) is number of cells and each cell has a complex internal structure,
an effective training algorithm for time series prediction [9]. For which includes three kinds of gates, internal state and several in-
improving the prediction performance, some predicting neural puts. The connection between gates and nodes represents a
network models have been put forward. Rumelhart et al. [10] weighted value for the connection signal, which is core to the
described a new learning procedure, back-propagation, for net- artificial neural network, because the most important matter in the
works of neurone-like units. Katijani et al. [11] demonstrated the LSTM is determining weights.
forecast capabilities of feed-forward neural network (FFNN) model. The extremely complex internal circulation structure endows
Takahama et al. [12] applied differential evolution with degenera- the superiority of LSTM. The LSTM based on the RNN is used to
tion to the structural learning of neural networks. Tripathyb [13] predict the fluctuation of crude oil index in this work, specifically,
built the radial basis function neural network model. Deep learning the West Texas Intermediate (WTI) and Brent data are considered
as the basic theory of AI, has solved the increasingly complex as input data to train the model. It is known that the traditional
application, and the predicting accuracy of AI has been improved training process of LSTM is equivalent of each historical data at
rapidly. The most outstanding advantage of deep learning models, different time point. However, the dynamic investment environ-
which are composed of multiple processing layers, is the ability to ments make it difficult to reflect the changeable behaviors of crude
deal with massive data. Devaraj et al. [14] presented an artificial oil markets by using the early data. If all the historical data are
neural network-based approach for static-security assessment. equivalently applied to train the LSTM, the system of the model
Krizhevsky et al. [15] used deep convolutional neural network to may not accord with the movement of crude oil markets. But it will
classify the 1.2 million high-resolution images. Kusumo et al. [16] also lose much useful information by only selecting the recent data.
developed kernel-based extreme learning machine (K-ELM) and Thus, data transfer is used to give each historical data a weight to
artificial neural network (ANN) models. Therefore, the types of reflect its degree of impact on the fluctuation of the current market.
deep learning models contain unsupervised learning, reinforce- In order to compare whether the different prediction results are
ment learning & evolutionary computation and indirect search for caused by data migration and prior knowledge, crude oil time series
short programs encoding deep and large networks [17]. Deep will be transferred into two longer and more complex time series
learning can effectively solve the practical problems, including by using two kinds of data processing methods, and different
natural language processing [18], image recognition [19] and so on. predictor performance will also be shown. These three data pro-
Moreover, it is a significant topic to use more mechanisms and cessing methods are given different mark respectively. the raw data
characteristics in fitting crude oil time series. Safari and Davallou without data transfer is named Type I which is the control group,
[20] developed a state-space model framework to increase the the data after the expansion of the sliding window is named Type II,
accuracy of forecasting crude oil prices. Movagharnejad et al. [21] and the data after the expansion of the sliding window with
introduced a neural network to forecast the price of commercial oil decrement weight is named Type III. The ensemble empirical mode
in these crude oils. Wang and Wang [22] established the architec- decomposition (EEMD), a noise-assisted data analysis method,
ture which combines multilayer perception and Elman recurrent which is utilized in training the LSTM for the first time, decomposes
neural network (ERNN) with stochastic time effective function. time series into different frequencies parts for training the model.
Machine learning and deep learning theories have become a Furthermore, the combination of EEMD and data transfer shows the
research focus in predicting time series. Li et al. [23] developed the training results of LSTM at different frequencies in more details. In
models using traditional combination method to increase the this paper, the data of WTI and Brent crude oil prices are selected
prediction accuracy of petroleum consumption. Rahman et al. [24] for each trading day in more than 10-year period.
developed and optimized novel deep recurrent neural network
(RNN) models which aimed at long term electric load prediction at 2. Description of long short term memory
1-h resolution. Recurrent neural network (RNN) [25,26] is a class of
deep learning for processing sequence data like time series. The The LSTM not only has external loop similar to the RNN, but also
most obvious feature of RNN is the self circulatory structure. Based as internal circulation in memory cells. Each memory cell contains
on the RNN model, the LSTM model with non fixed weights of self input node and three kinds of gates with many intricate self-
circulation was originally used for natural language processing, connected recurrent weights, ensuring that the signal can pass
furthermore, the LSTM model has a great processing power for data across many time steps without gradient explosion or gradient
with time order like the RNN. With the development of machine descent. The long-term memory of LSTM is embodied in the
learning and deep learning theories, the LSTM model as a kind of weights between every cell and every gates in each cell, and these
deep learning models has been widely employed in natural lan- weights iterate slowly during training, which is depending on the
guage processing. In recent years, it is noted that the rapid progress iterative process. The short-term memory rests with the ephemeral
162 Z. Cen, J. Wang / Energy 169 (2019) 160e171

activation various gates which control whether the data is method is show on the forward
remembered by the current state in the cell. The LSTM uses these  
gates to store intermediate state, in other words, the feature of f ðtÞ ¼ s W fx xðtÞ þ W fh hðt1Þ þ bf : (3)
LSTM is that nodes with the valve of each layer are added outside
the RNN structure. All elements of one LSTM cell are enumerated Output gate o: One memory cell can produce the ultimately
and represented in the diagram of Fig. 1, and there are three types of output value by multiplying the internal state h and the value of the
valves: forget gate, input gate and output gate. These three gates output gate o. The signal of internal state customary first runs
can be turned off or on, and determine whether the output of the through an activation function tanh, this compresses the output of
memory state of the model network (the state of the previous each memory cell a same dynamic range. In this part, we give the
network) reaches the threshold in the layer, thereby adding it to the calculation method
calculation of the current layer. The LSTM also has this chain like  
structure, but the repeating module has a different structure. oðtÞ ¼ s W ox xðtÞ þ W oh hðt1Þ þ bo : (4)
The following equations give the complete algorithm for a LSTM
model, which are performed at each time step as follows: Internal state s: The most import part of one memory cell is the
Input node g: This unit is a node that takes activation signal from node with nonlinearity activation, which is called “internal state” s.
the input layer xðtÞ at the current time t to the hidden layer hðt  1Þ Every internal state s has a self recurrent structure. The update for
at previous time step t  1. The input signal aggregates weights by the internal state in the vector notation is
running through an activation function tanh, which is a typically
nonlinear function. And W gx express the weights between input sðtÞ ¼ gðtÞ 5iðtÞ þ sðt1Þ 5f ðtÞ (5)
node g and input layer x, W gh
express the weights between input
where 5 is point wise multiplication.
node and hidden layer h, and bg is the bias of input node. The
Hidden layer h: Hidden layer is defined by internal state and
following weights and bias are the same as this formula
output gate at current time step t
   
g ðtÞ ¼ f W gx xðtÞ þ W gh hðt1Þ þ bg : (1) hðtÞ ¼ f sðtÞ 5oðtÞ : (6)
Input gate i: Gate is a distinctive feature of the LSTM compared to There are two kinds of activation functions in the above for-
other RNNs, which is a node that contains an activation function s.
mulas. f is the tanh function
The activation function is a nonlinear mapping of input layer xðtÞ at
current time t and the value of hidden layer hðt  1Þ at the previous ex  ex
time point into ½0; 1. The corresponding weights are denoted by fðxÞ ¼ tanhðxÞ ¼ (7)
ex þ ex
W ix and W ih ,
as well as bi is the bias of input gate. The value of gate
and another node are used to multiply, and if the input gate is sense and s is the sigmoid function
to be zero, then flow from the other node is cut off, in other words,
if the value of input gate is one, all flow from other nodes will be 1
sðxÞ ¼ sigmoidðxÞ ¼ : (8)
passed through. Figuratively speaking, the input gate is a switch of 1 þ ex
input node. Thus, we let For understanding the LSTM model, it is necessary to explain the
  model construction of RNN firstly, since the LSTM is a deformation
iðtÞ ¼ s W ix xðtÞ þ W ih hðt1Þ þ bi : (2) of the RNN. Fig. 2(a) describes the structure of the RNN model, the
left side of the diagram is a basic model of the RNN, and the right is
Forget gate f: Forget gate provides a possibility for the LSTM the appearance after the model is expanded. The expansion is to
learning to forget the connotations of the internal state i, and it match the input sample. If a batch of time series is input, the
plays a especially useful role in continuously running neural net- maximum length of each bath time series is less than ten, then the
works. With this kind of design, the internal state calculation model is expanded ten times. The LSTM is same as the RNN, the
data will be passed to the next layer and processed to the next node
at the same level and at the same time. The difference between
them is that the LSTM has more hidden layer which is called deeper,
and the data processing nodes of the LSTM are cells, but the RNN
are neurons. Fig. 2(b) describes the model structure between the
LSTM layers. Here are two hidden layers, one batch data input from
the bottom of the LSTM model to the top, and one batch data are
input into each cell one by one, where every data will be input into
a LSTM cell, and input gate will accept input data and process with
the LSTM cell as mentioned above. A LSTM cell returns the output
data as the input data of the next layer, and passes the output data
to the next cell as short-term memory in the same layer. The final
model returns a string of data which is the output of the LSTM. In
this paper the output is the last one of these string data. A more
detailed description of the LSTM model can be referred to Ref. [25]
(see Fig. 3).

3. Data transfer

In this part, crude oil time series will be transferred into three
Fig. 1. Diagram of one cell of LSTM. longer and more complex time series by using three kinds of data
Z. Cen, J. Wang / Energy 169 (2019) 160e171 163

Fig. 2. (a) Diagram of LSTM. (b) Diagram of recurrent neural network.

difficult to reflect the changeable behaviors of crude oil markets by


using the early data. If all the historical data are equivalently
applied to train the LSTM, the system of the model may not accord
with the movement of crude oil markets. But it will also lose much
useful information by only selecting the recent data. Thus, a novel
prior knowledge based process is introduced to transfer the inputs
of data set. In this paper, a priori knowledge is referred to time
series inclined with timeliness, which means that the fluctuation in
current time is more affected by the previous time, and the fluc-
tuation of the distant time point has lesser effect on the current
time fluctuation. Then the core vision of this paper is adjusting
weights of samples at different time points. Let the number of batch
size is n, the input time series is X ¼ fx1 ;x2 ;/;xN g, in which N is the
size of sample. Define a kernel K ¼ fk1 ; k2 ; /; ki ; /; kn g and let the
kernel be the same size of the batch size n. Then give a new series
Yi ¼ fyi ; yiþ1 ; /; yiþn1 g by reckoning: Y ¼ X 5 K, in which 5
means yi ¼ xi  ki , the elements multiplication on the corre-
Fig. 3. Diagrammatic sketch of data transfer.
sponding dimension. In this paper, a new kind of kernel with prior
knowledge is introduced, that is, the characteristic of time strength
processing methods, in order to compare the data transfer perfor- function can reduce the weights of old data influence, and raise the
mance of the results with or without prior knowledge. Input data influence of the latest data. In order to achieve the objective that
processed by three kinds of methods will train the model, and the latest data are more important than the old data in energy
different predictor performance will also be shown. For conve- market prediction, this function is obvious the increasing function
nience, these three data processing methods are given different of i. For convenience, the raw data without data transfer is named
mark respectively. The raw data without data transfer is named Type I which is the control group; the data after the expansion of
Type I which is the control group, the data after the expansion of the sliding window without decrement weight kernel is named
the sliding window is named Type II, and the data after the Type II, where the data is expanded by sliding window with the
expansion of the sliding window with decrement weight is named kernel of one vector K ¼ ½1; 1;/;1; the data after the expansion of
Type III. The expansion of the sliding window means that the data is the sliding window with decrement weight kernel is named Type
expanded by sliding window. III.
In the present study, for the data of West Texas Intermediate
3.1. Data transfer based on prior knowledge (WTI) and Brent crude oil daily prices, closing price, opening price,
highest price and lowest price are used as four dimensions for each
In energy markets and financial markets, the corresponding input sample. For these four dimensions, four transfer data are
time series prediction is an orthodox topic. With the prosperity of established respectively, relative to the extension of four sets of
computational simulation ability and the special preponderance of time series. In the details, the closing price is extracted price for all
big data era, the combination of deep learning and traditional prior samples in the data set as one time series, and this time series is
experience will be a frontier project for future research. There is transferred into the longer and more complex time series with
one important concept in the LSTM model: Batch Size, which priori knowledge. In Table 1, for predicting WTI and Brent crude oil
means the number of input sample for every steps. The traditional markets by the proposed LSTM model, the daily data of WTI cover
training set of WTI is equivalent of each historical data at different the time period more than 10-year from January 31, 2005 up to
time point and facing the bottleneck of insufficient data quantity, a December 5, 2016, the daily data of Brent cover from January 31,
common practice to extend the amount of data by using sliding 2006 up to October 17, 2017. This paper involves multiple choice of
windows. However, the dynamic investment environments make it hyper parameter. Because the neural network is a complex
164 Z. Cen, J. Wang / Energy 169 (2019) 160e171

Table 1
Data selection and data transfer of daily crude oil prices for West Texas Intermediate and Brent.

Mode data sets total number training set testing set size

WTI Type I 31/01/2005  05/12/2016 2772 2000 772


WTI Type II 31/01/2005  05/12/2016 20772 20000 772
WTI Type III 31/01/2005  05/12/2016 20772 20000 772
Brent Type III 31/01/2006  27/10/2017 2697 2000 697
Brent Type III 31/01/2006  27/10/2017 20697 20000 697
Brent Type III 31/01/2006  27/10/2017 20697 20000 697

nonlinear mapping, the over selection of the hyper parameter will From the curves of actual closing prices and predictive data
lead to good performance of data fitting in training set, but poor which are exhibited by Fig. 4(a)(c)(e) and Fig. 5(a)(c)(e), they show
performance in test set, and this phenomenon is called over fitting. that the distinctions between the predicted value and the actual
To avoid drawing into over fitting, cross validation is used to data are almost nothingness in the training set, so as the distances
determine other hyper parameters, that means the data will be between the predictive data string and the actual data in the test
divided into three sets for appropriate parameters. It can find a set set. This can draw a conclusion that the LSTM model has a powerful
of most representative hyper parameters, that have the best generalization ability to train the crude oil time series well.
optimal generalization ability in validation set. The number of Furthermore, facing a sharp fall in the test set, Type III has a
batch size n is 10, data transfer is only made in the training set part, stronger adaptability than Type I and Type II. Fig. 4(b)(d)(f) are the
so the numbers of Type II and Type III are both n times of Type I linear regressions of WTI crude oil prices and predicted closing
training set size, and these three data processing have the same test prices of Type I, Type II and Type III data by the LSTM model
set. Table 2 lists four kinds of information for crude oil prices that respectively, Fig. 5(b)(d)(f) are the linear regressions of Brent crude
include opening prices, closing prices, highest prices and lowest oil prices and predicted closing prices of Type I, Type II and Type III
prices, and Table 2 also displays the corresponding three kinds of data by the LSTM model respectively, from these plots, the slope
data processing methods. For the data processing methods, the can explain that the predicted data and the real closing prices are
highest prices, the lowest prices and the opening prices are the close.
input features and the closing prices are as labels.
4. Evaluation by multiple statistical measures

3.2. Three kinds of modes trained by long short term memory


In this section, we comparatively evaluate the prediction accu-
racy effects among three kinds of data, which are preprocessed
In the following, a cross-validation after data pretreatment is
mentioned above, by adopting some statistical evaluation methods.
performed, that is, the whole training set will be split into two
For further analyzing the predicting performance of data transfer
parts, the first part is the training set and the other part is validation
and sliding window model, several measures are chosen to eval-
set. Concretely, the training process is divided into two steps, pri-
uate error and trend performance [32e35], they show how these
marily train the LSTM model with training set and continuous
statistical methods evaluate the performance of time series, that
adjust hyper parameters to ensure that the model behaves well in
include mean absolute error (MAE), root mean square error (RMSE),
validation set. The training objective of LSTM model is to minimize
mean absolute percent error (MAPE), symmetric mean absolute
the global error between prediction and actual targets by modifying
percentage error (SMAPE), Theil inequality coefficient (TIC), and
the weights. Then retraining the model with confirmed hyper pa-
correlation coefficient (CC). They are all the measures of the devi-
rameters by the whole training set and validation set.
ation error between predicted values and actual data, and they
reflect the prediction global error. The corresponding criteria def-
Table 2 initions are given by following formulas:
Samples of West Texas Intermediate input daily data for Type I, Type II and Type III. Mean absolute error (MAE)
Criterion number closing price opening price highest price lowest price
1 XN
Type I 1 17.71 18.0 18.09 17.03 MAE ¼ jdt  yt j: (9)
2 17.72 17.71 17.9 17.59 N t¼1
3 17.72 17.76 17.9 17.7
/ / / / / Root mean square error (RMSE)
11 18.44 18.63 18.68 18.4 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
/ / / / / u
u1 X N
Criterion number closing price opening price highest price lowest price RMSE ¼ t ðdt  yt Þ2 : (10)
N t¼1
Type II 1 17.71 18.0 18.09 17.03
2 17.72 17.71 17.9 17.59
Mean absolute percent error (MAPE)
3 17.72 17.76 17.9 17.7
N  
/ / / / /
11 17.72 17.71 17.9 17.59 1 X 
dt  yt :
MAPE ¼ 100  (11)
/ / / / / N t¼1 dt 

Criterion number closing price opening price highest price lowest price
Symmetric mean absolute percent error (SMAPE)
Type III 1 1.771 1.8 1.809 1.703
2 3.544 3.542 3.58 3.518
3 5.316 5.328 5.37 5.31 2 XN
jdt  yt j
/ / / / /
SMAPE ¼ 100  : (12)
N t¼1 jdt j þ jyt j
11 17.72 17.71 17.9 17.59
/ / / / /
Theil inequality coefficient (TIC)
Z. Cen, J. Wang / Energy 169 (2019) 160e171 165

Fig. 4. (a)(c)(e) Training and testing results of WTI daily closing prices of LSTM model for Type I, Type II and Type III, respectively. (b)(d)(f) Linear regressions of WTI predicted
closing prices of Type I, Type II and Type III by LSTM, respectively.
166 Z. Cen, J. Wang / Energy 169 (2019) 160e171

Fig. 5. (a)(c)(e) Training and testing results of Brent daily closing prices of LSTM model for Type I, Type II and Type III, respectively. (b)(d)(f) Linear regressions of Brent predicted
closing prices of Type I, Type II and Type III by LSTM model, respectively.

PN  
t¼1 ðyt yÞ dt  d
CC ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 2ffi: (14)
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN 2 PN
1
PN 2 t¼1 ðy t  yÞ t¼1 d t  d
N t¼1 ðdt  yt Þ
TIC ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PN 2 ffi q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PN 2 : (13)
1 1 In the above notations, dt is the real value and yt is the predict
N t¼1 dt þ N t¼1 yt
value at time t of WTI. N denotes the number of the evaluated data,
Correlation coefficient (CC). d is the average real value and y is the average predict value. Here
Z. Cen, J. Wang / Energy 169 (2019) 160e171 167

the test set is set as the same size for comparing the error of test set Table 4
more reasonably. The smaller values of MAE, RMSE, MAPE and Error evaluations for Brent data Type I, Type II and Type III of long short term
memory model.
SMAPE show the less deviation of the predicting results from the
actual values and the closer distance between time series. The value Training Times Mode MAE RMSE MAPE SMAPE TIC CC
of TIC is closer to 0, it means that the accuracy is higher, on the 1000 I 0.1819 0.4427 3.4635 3.8618 0.0137 0.987
contrary, the value of TIC and CC is closer to 1, this means the more II 0.1787 0.2567 2.6389 2.0833 0.0112 0.992
errors and the lower accuracy. III 0.1122 0.1223 1.0072 1.0651 0.0089 0.995

Through the cross-validation [36], it can find a set of most Training Times Mode MAE RMSE MAPE SMAPE TIC CC
representative hyper parameters, that have the best optimal 2000 I 0.1361 0.4853 2.1224 5.1398 0.0393 0.981
generalization ability in validation set. The learning rate of the II 0.1401 0.3228 2.4574 3.0056 0.0255 0.923
LSTM is chosen as 0.01 which is small enough to fit unceasingly III 0.1147 0.1148 1.3279 1.1212 0.0124 0.997
changing data, the number of neurons for each layer (we have Training Times Mode MAE RMSE MAPE SMAPE TIC CC
chosen here) is 150, and the number of hidden layers is 2. In
5000 I 0.1712 0.6234 4.6821 13.3412 0.0645 0.952
Tables 3 and 4, the empirical research has been made for the error II 0.1031 0.3565 1.4532 8.6756 0.0606 0.920
evaluation at different levels of training times K. By comparing the III 0.0718 0.2876 0.7824 0.8457 0.0156 0.994
statistical indicators between different data preprocessing methods Training Times Mode MAE RMSE MAPE SMAPE TIC CC
and different parameters, it is obvious that Type III has the smaller
10000 I 0.3996 0.5318 6.3018 559.85 0.0723 0.926
values in MAE, RMSE, SMAPE, TIC and CC than those of Type I and
II 0.1086 0.4230 5.2342 594.58 0.0772 0.981
Type II. It is clear that Type III has a better performance in pre- III 0.0521 0.0539 1.2108 0.8254 0.0411 0.997
dicting crude oil markets. In Fig. 6, we comparatively study the
Training Times Mode MAE RMSE MAPE SMAPE TIC CC
relative errors of the predicted data, where the relative errors are
given by ~ e ¼ ðyðtÞ  dðtÞÞ=dðtÞ, we can compare the errors at each 20000 I 0.1774 0.7185 9.6424 7.6572 0.2984 0.917
II 0.1499 0.6512 5.1840 6.2693 0.1483 0.978
time point in test set. Fig. 6(a)(b)(c) are the relative errors of crude III 0.1113 0.2023 1.7721 3.5887 0.1161 0.989
oil prices for Type I, Type II and Type III predicted by the LSTM
model individually, it is clear that the relative errors of Type III
mode are lower than those of Type I and Type II predicted by the
LSTM model individually, this phenomenon demonstrates that adaptive physical basis of the original time series. In the high fre-
Type III mode has a better performance in predicting accuracy. quency part, each IMF has two requirements by following: (i) Over
all the time series, the number of local extremum and zero-
5. Forecasting model based on ensemble empirical mode crossings must either differ at most by one or be equal, that is,
decomposition between each zero-crossing there is only one extremum; (ii) At any
time point, the mean value of the envelope defined by the local
Empirical mode decomposition (EMD) is an analyzing nonlinear maxima and the local minima separately both are zero. But aimed
and non-stationary data method. Under no requirement of statio- at the insufficient of this method the ensemble empirical mode
narity or nonlinearity data, the EMD can extract all the oscillatory decomposition (EEMD), a noise-assisted data analysis method, is
modes [37e39]. The EMD locally decomposes any time series into introduced in Refs. [40,41], which is defined to solve the EMD
the high frequency part, which are composed with a series intrinsic mixing appears frequently. The biggest differences for intrinsic
mode functions (IMFs), and correspondingly a low frequency part mode function between EEMD and EMD are components as the
that names residual. Different IMFs represent different scales and mean of an ensemble of trials, each IMF consists of the signal and a

Table 3
Error evaluations for West Texas Intermediate data Type I, Type II and Type III of long short term memory model.

Training Times Type MAE RMSE MAPE SMAPE TIC CC

1000 I 0.1703 0.3778 3.7098 3.8618 0.0297 0.957


II 0.1888 0.4339 4.7001 4.0333 0.0334 0.984
III 0.1452 0.3520 1.1673 1.4075 0.0270 0.997

Training Times Mode MAE RMSE MAPE SMAPE TIC CC

2000 I 0.2361 0.4853 4.0626 5.8998 0.0393 0.981


II 0.1445 0.3228 3.4384 5.4956 0.0255 0.923
III 0.0872 0.2348 0.1329 1.0142 0.0182 0.998

Training Times Mode MAE RMSE MAPE SMAPE TIC CC

5000 I 0.3222 0.8057 11.7844 91.1450 0.0645 0.962


II 0.2756 0.7663 11.1581 248.0636 0.0606 0.980
III 0.0718 0.2011 0.3423 0.8185 0.0156 0.994

Training Times Mode MAE RMSE MAPE SMAPE TIC CC

10000 I 0.2886 0.7818 11.3998 559.85 0.0619 0.964


II 0.2886 0.7830 11.4002 594.58 0.0620 0.981
III 0.0781 0.2139 0.2918 0.8254 0.0170 0.995

Training Times Mode MAE RMSE MAPE SMAPE TIC CC

20000 I 0.1911 0.4885 6.6424 12.6572 0.0384 0.937


II 0.2032 0.4912 6.1840 12.2693 0.0383 0.980
III 0.0761 0.2093 0.5714 0.8837 0.0163 0.996
168 Z. Cen, J. Wang / Energy 169 (2019) 160e171

Fig. 6. (a) Relative error of crude oil prices for Type I. (b) Relative error of crude oil prices for Type II. (c) Relative error of crude oil prices for Type III.

finite amplitude white noise. The EEMD dispels the negative in- EEMD. It can be found that the fluctuation frequency gradually
fluence of mode mixing and meanwhile preserves physical reduces from IMF1 to IMF6. That means that the IMF1 is the first
uniqueness of decomposition by adding the white noise. The spe- series decomposed from the WTI and remains the main structures
cific algorithm of EEMD is as follows. and characteristics, and apparently it has a higher frequency than
Step one: For obtaining the noise added signal xm , we sum the other IMFs. Every time calculating the data with sifting, the data's
time signal x and a white noise time series nm in the mth trial frequency will be lower than the former's. It is known in the above
that, different data processing modes have different training and
xm ðtÞ ¼ xðtÞ þ nm ðtÞ: (15) generalization ability for the LSTM, the introduction of EEMD in this
Step two: Like traditional EMD decomposing the signal with paper is for the purpose of analyzing the predicting performance of
noise xm into IMFs cj;m , the EEMD uses the same algorithm three kinds data processing modes deeply. We choose the best
performance parameters of the LSTM to predict the IMFs, the IMFs
X
n as input data will be processed by Type I, Type II and Type II modes
xm ðtÞ/ cj;m ðtÞ þ rðtÞ (16) like the real data of WTI. The prediction performance of different
j¼1 IMFs is useful to understand the predictive ability of the proposed
model. Fig. 8(a)(d)(g)(j)(m)(p) are the IMFs of Type I,
and the name of r is residue, which is the data xm extracted after p Fig. 8(b)(e)(h)(k)(n)(q) are the IMFs of Type II and
times of cj;m . Fig. 8(c)(f)(i)(l)(o)(r) are the IMFs of Type III. To demonstrate the
Step three: Return to Step one, and repeat Steps one and two for performance of IMFs in different modes more clearly, we apply the
a predefined number M of trials, and in this process we use a methods MAE, RMSE, MAPE, SMAPE, TIC, CC to measure the specific
different white noise series with the same amplitude for every errors in Table 5. It can attribute to that Type III has a better fitting
time. ability than Type I and Type II for every IMF. It is obvious that Type
Step four: Calculate and obtain the ensemble mean of the cor- III has a stronger generalization ability on IMF1 and IMF2, but there
responding cj;m as the final IMFs, that is given by is no great advantage on IMF3, IMF4, IMF5 and IMF6, because white
noises with equal amplitude influences are greater on IMF3 to IMF6
1 XM compared with IMF1 and IMF2.
IMFj ðtÞ ¼ c ; j ¼ 1; 2; /; p; m ¼ 1; 2; /; M: (17)
M m¼1 j;m
6. Conclusion
So it is obvious that each IMF is peeled from previous residual,
and the front IMF fluctuates more acutely, and the back IMF tends The crude oil plays a significant and irreplaceable role in eco-
to be more slower fluctuation. nomic society, West Texas Intermediate crude oil and Brent crude
Fig. 7 shows the IMFs and the residual of WTI index by the oil are the most important crude oil market indices in the world. In

Fig. 7. Plots of IMFs and the residual of WTI index by EEMD model.
Z. Cen, J. Wang / Energy 169 (2019) 160e171 169

Fig. 8. Real IMFs of WTI and IMFs of long short term memory model by Type I, Type II and Type III.

the present paper, Long Short Term Memory of deep learning al- compare the predictive ability of Type I, Type II and Type III data.
gorithm is applied to predict the volatility behaviors of crude oil The comparison results of MAE, RMSE, MAPE, SMAPE, TIC and R
prices, and this is a new approach for predicting energy markets by show that Type III predicting performance is better than Type I and
using deep learning networks. Considering on the characteristics of Type II. Furthermore, the nonlinear analysis method EEMD is
energy price time series, new data transfer modes are established applied to decompose the crude oil price series into different
aimed at improving predicting ability of the fluctuations for global fluctuation frequency levels, that are the intrinsic mode functions
crude oil prices, which are Type I, Type II and Type III data pre- (IMFs). The corresponding predicting performance on the IMFs
processing methods. Through comparing linear regression of the shows that the proposed LSTM model can catch the main fluctua-
predicted values and the real data, the experiment results display tion characteristics of crude oil prices for different fluctuation fre-
that the predictive values can approach the real data well by the quency levels.
LSTM model. Then we adopt several evaluation methods to
170 Z. Cen, J. Wang / Energy 169 (2019) 160e171

Table 5
Error evaluations of IMFs for Type I, Type II and Type III of long short term memory model model.

Criterion Training Times MAE RMSE MAPE SMAPE TIC CC

IMF1 Type I 0.0427 0.1183 26.6725 165.0501 0.7287 0.0191


Type II 0.0425 0.1177 26.0192 104.2218 0.7301 0.0227
Type III 0.0374 0.1033 18.1133 67.0256 0.7262 0.3372

Criterion Training Times MAE RMSE MAPE SMAPE TIC CC

IMF2 Type I 0.0200 0.0555 137.1360 44.5535 0.2876 3.0249


Type II 0.0197 0.0552 94.7113 3310.2 0.2875 3.0065
Type III 0.0190 0.0530 98.4469 50.0657 0.2820 2.9796

Criterion Training Times MAE RMSE MAPE SMAPE TIC CC

IMF3 Type I 0.0225 0.0625 13.2329 91.6119 0.2011 5.3200


Type II 0.0227 0.0628 14.1893 49.1674 0.2015 5.3337
Type III 0.0325 0.0889 8.6696 49.0365 0.2939 4.6486

Criterion Training Times MAE RMSE MAPE SMAPE TIC CC

IMF4 Type I 0.0282 0.0838 3.9322 48.3883 0.1208 12.5563


Type II 0.0283 0.0837 4.0123 52.0142 0.1205 12.5774
Type III 0.0301 0.0870 3.0170 46.9555 0.1280 12.2470

Criterion Training Times MAE RMSE MAPE SMAPE TIC CC

IMF5 Type I 0.0234 0.0597 0.3616 37.8503 0.1219 8.8455


Type II 0.0238 0.0601 2.8459 39.0115 0.1228 8.8408
Type III 0.0269 0.0687 4.9237 36.0344 0.1346 9.1719

Criterion Training Times MAE RMSE MAPE SMAPE TIC CC

IMF6 Type I 0.1633 0.4541 6.4735 11.6506 0.0358 207.9203


Type II 0.1217 0.2656 0.6743 1.8264 0.0211 199.5199
Type III 0.1058 0.2374 0.8684 1.1516 0.0183 191.9695

Acknowledgment [16] Kusumo F, Silitonga AS, Masjuki HH, Ong HC, Siswantoro J, Mahlia TMI.
Optimization of transesterification process for Ceiba pentandra oil: a
comparative study between kernel-based extreme learning machine and
The authors were supported by National Natural Science artificial neural networks. Energy 2017;134:24e34.
Foundation of China Grant No. 71271026. [17] Schmidhuber J. Deep learning in neural networks: an overview. Neural
Network: Off J Int Neural Netw Soc 2015;61:85e117.
[18] Dahl G E, Yu D, Deng L, Acero A. Context-dependent pre-trained deep neural
References networks for large-vocabulary speech recognition. IEEE Trans Audio Speech
Lang Process 2012;20:33e42.
[1] Alvarez-Ramirez J, Alvarez J, Rodriguez E. Short-term predictability of crude [19] LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient based learning applied to
oil markets: a detrended fluctuation analysis approach. Energy Econ 2008;30: document recognition. Proc IEEE 1998;86:2278e324.
2645e56. [20] Safari A, Davallou M. Oil price forecasting using a hybrid model. Energy
[2] Chiroma H, Abdulkareem S, Herawan T. Evolutionary Neural Network model 2018;148:49e58.
for West Texas Intermediate crude oil price prediction. Appl Energy 2015;142: [21] Movagharnejad K, Mehdizadeh B, Banihashemi M, Kordkheili MS. Forecasting
266e73. the differences between various commercial oil prices in the Persian Gulf
[3] Niu H, Wang J. Volatility clustering and long memory of financial time series region by neural network. Energy 2011;36:3979e84.
and financial price model. Digit Signal Process 2013;23:489e98. [22] Wang J, Wang J. Forecasting energy market indices with recurrent neural
[4] Niu H, Wang J. Return volatility duration analysis of NYMEX energy futures networks: case study of crude oil price fluctuations. Energy 2016;102:
and spot. Energy 2017;140:837e49. 365e74.
[5] Yu Y, Wang J. Lattice-oriented percolation system applied to volatility [23] Li J, Wang R, Wang J, Li Y. Analysis and forecasting of the oil consumption in
behavior of stock market. J Appl Stat 2012;39:785e97. China based on combination models optimized by artificial intelligence al-
[6] Mordjaoui M, Haddad S, Medoued A, Laouafi A. Electric load forecasting by gorithms. Energy 2018;144:243e64.
using dynamic neural network. Energy 2017;42:17655e63. [24] Rahman A, Srikumar V, Smith AD. Predicting electricity consumption for
[7] Wang F, Wang J. Statistical analysis and forecasting of return interval for SSE commercial and residential buildings using deep recurrent neural networks.
and model by lattice percolation system and neural network. Comput Ind Eng Energy 2018;212:372e85.
2012;62:198e205. [25] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput
[8] Liao Z, Wang J. Forecasting model of global stock index by stochastic time 1997;9:1735e80.
effective neural network. Expert Syst Appl 2010;37:834e41. [26] Sutskever I, Martens J, Hinton GE. Generating text with recurrent neural
[9] Taghavifar H, Mardani A. Applying a supervised ANN (artificial neural networks. Int Conf Mach Learn 2011;336:1310e8.
network) approach to the prognostication of driven wheel energy efficiency [27] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning
indices. Energy 2014;68:651e7. to align and translate. In: International Conference on learning Representa-
[10] Rumelhart DE, Hinton GE, Williams RJ, Haffner P. Learning representations by tions; 2015. p. 3111e229.
back propagating errors. Nature 1986;323:533e6. [28] Graves A, Mohamed A, Hinton GE. Speech recognition with deep recurrent
[11] Katijani Y, Hipel KW, Mcleod AI. Forecasting nonlinear time series with neural networks. Acoust Speech Signal Process (ICASSP) 2013;38:6645e9.
feedforward neural networks: a case study of Canadian lynx data. J Forecast [29] Sak H, Senior A, Beaufays F. Long short-term memory recurrent neural
2005;24:105e17. network architectures for large scale acoustic modeling. In: Proceedings of the
[12] Takahama T, Sakai S, Hara A, Iwane N. Predicting stock price using neural 15th annual conference of the international speech communication associa-
networks optimized by differential evolution with degeneration. Int J Innov tion: celebrating the diversity of spoken languages, INTERSPEECH 2014;
Comput Inf Contr Ijicic 2009;5:5021e32. September 2014. p. 338e42.
[13] Tripathyb M. Power transformer differential protection using neural network [30] Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural
principal component analysis and radial basis function neural network. Sim- networks. Adv Neural Inf Process Syst 2014;27:3104e22.
ulat Model Pract Theor 2010;18:600e11. [31] Liu H, Mi XW, Li YF. Wind speed forecasting method based on deep learning
[14] Devaraj D, Yegnanarayana B, Ramar K. Radial basis function networks for fast strategy using empirical wavelet transform, long short term memory neural
contingency ranking. Electr Power Energy Syst 2002;24:387e95. network and Elman neural network. Energy 2018;156:498e514.
[15] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep [32] Doucoure B, Agbossou K, Cardenas A. Time series prediction using artificial
convolutional neural networks. Adv Neural Inf Process Syst 2012;60: wavelet neural network and multi-resolution analysis: application to wind
1097e105. speed data. Renew Energy 2016;92:202e11.
Z. Cen, J. Wang / Energy 169 (2019) 160e171 171

€ A neural network and ARIMA model for water quality time series
[33] Faruk OF. [38] Huang NE, Shen Z, Long SR, Wu MC, Zheng Q, Yen NC, Tung CC, Liu HH. The
prediction, vol. 23. Pergamon Press; 2010. p. 586e94. empirical mode decomposition and the Hilbert spectrum for nonlinear and
[34] Olson D, Mossman C. Neural network forecasts of Canadian stock returns non-stationary time series analysis. Proceed: Math Phys Eng Sci 1998;454:
using accounting ratios. Int J Forecast 2003;19:453e65. 903e95.
[35] Plumb AP, Rowe RC, York P, Brown M. Optimisation of the predictive ability of [39] Huang YX, Schmitt FG, Hermand JP, Gagne Y, Lu ZM, Liu YL. Arbitraryorder
artificial neural network (ANN) models: a comparison of three ANN programs Hilbert spectral analysis for time series possessing scaling statistics: com-
and four classes of training algorithm. Eur J Pharm Sci 2005;25:395e405. parison study with detrended fluctuation analysis and wavelet leaders. Phys
[36] Golub GH, Heath M, Wahba G, Brown M. Generalized cross-validation as a Rev. E, Statist, Nonlin Soft Matter Phys 2011;84, 016208.
method for choosing a good ridge parameter. Technometrics 1979;21: [40] Wu ZH, Huang NE. Ensemble empirical mode decomposition: a noise-assisted
215e23. data analysis method. Adv Adapt Data Anal 2009;1:1e41.
[37] Flandrin P, Goncalves P. Empirical mode decompositions as data-driven [41] Yu L, Wang ZS, Tang L, Gagne Y, Lu ZM, Liu YL. A decomposition-ensemble
wavelet-like expansions. Int J Wavelets, Multiresol Inf Process 2004;2: model with data-characteristic-driven reconstruction for crude oil price fore
477e96. casting. Appl Energy 2015;156:251e67.

You might also like