A Hybrid Deep Learning Architecture For Wind Power Prediction Based On Bi-Attention Mechanism and Crisscross Optimization

Energy 238 (2022) 121795
Contents lists available at ScienceDirect
Energy
journal homepage: www.elsevier.com/locate/energy
A hybrid deep learning architecture for wind power prediction based

on bi-attention mechanism and crisscross optimization
Anbo Meng a, Shun Chen a, Zuhong Ou a, Weifeng Ding a, Huaming Zhou b, Jingmin Fan a,
Hao Yin a, *
a
School of Automation, Guangdong University of Technology, Guangzhou, 510006, Guangdong, China
b
Machine Patrol Operation Center, Guangdong Power Grid Co., Ltd., Guangzhou, 510160, Guangdong, China
a r t i c l e i n f o a b s t r a c t
Article history: Accurate wind power forecasting is of great significance for power system operation. In this study, a
Received 11 May 2021 triple-stage multi-step wind power forecasting approach is proposed by applying attention-based deep
Received in revised form residual gated recurrent unit (GRU) network combined with ensemble empirical mode decomposition
20 July 2021
(EEMD) and crisscross optimization algorithm (CSO). In the data processing stage, the EEMD is used to
Accepted 12 August 2021
Available online 16 August 2021
decompose the wind power/speed time series and a bi-attention mechanism (BA) is applied to enhance
the sensitivity of model to the important information from both time and feature dimension. In the
prediction stage, a series-connected deep learning model called RGRU consisting of the residual network
Keywords:
Bi-attention mechanism
and GRU is proposed as the forecasting model, aiming to make full use of extracting the static and dy-
Ensemble empirical mode decomposition namic coupling relationship among the input features. In the retraining-stage, a high-performance CSO
Wind power prediction algorithm is adopted to further optimize the fully-connected layer of RGRU model so as to improve the
Hybrid model generalization ability of the model. The proposed method is validated on a wind farm located in Spain
Crisscross optimization algorithm and the experimental results demonstrate that the proposed hybrid model has significant advantage over
other state-of-the-art models involved in this study in terms of prediction accuracy and stability.
© 2021 Elsevier Ltd. All rights reserved.
1. Introduction mathematical models [3]. However, the modeling process of

physical model is complex and the computational cost is large. The
With the rapid consumption of fossil fuels and their non- statistical models can obtain the corresponding relationship be-
renewable characteristics, it is extremely urgent to increase the tween historical weather conditions and wind power, and then
development of renewable energy [1]. Due to the low emission and predict the future wind power based on the relationship. Statistical
no pollution, wind power is regarded as one of the important approaches such as autoregressive (AR) [4], autoregressive moving
strategic choices for the sustainable development of electric energy. average (ARMA) [5] and variations of ARMA [6] have been applied
However, because of the randomness and volatility of wind power in related fields of wind power forecasting. However, the prediction
generation, its large-scale integration into the power grid will bring performance of these approaches is limited by the non-linearity
severe challenges to the safe operation and dispatch of the power and non-stationarity of data. Given that machine learning has
system [2]. Therefore, highly accurate prediction of wind power obvious advantages in dealing with nonlinear problems, some
plays an important role in its real-time deployment in the power machine learning models such as artificial neural networks (ANNs)
system. [7], support vector machine (SVM) [8] and extreme learning ma-
At present, the existing wind power prediction methods can be chine (ELM) [9] have been widely applied in the wind forecasting
grouped into physical model, statistical model as well as machine field.
learning model. The physical model mainly uses historical meteo- To reduce the prediction difficulty caused by the non-stationary
rological data of Numerical Weather Predictions (NWP) to establish characteristics of wind-related series, many scholars attempted to
combine the data decomposition with machine learning model. For
example, Wu et al. [10] used variational mode decomposition
(VMD) to decompose wind speed data to obtain a series of com-
* Corresponding author.
ponents, which were predicted using the least squares support
E-mail address: 3403446@qq.com (H. Yin).
https://doi.org/10.1016/j.energy.2021.121795
0360-5442/© 2021 Elsevier Ltd. All rights reserved.
A. Meng, S. Chen, Z. Ou et al. Energy 238 (2022) 121795
vector machine (LSSVM) model. Their experimental results show begun to be applied in the field of wind prediction. For example,
that the mean absolute percentage error (MAPE) can decrease from Wang et al. [31] proposed a wind power forecasting approach based
15.54% to 1.03% due to the introduction of VMD. Naik et al. [11] used on WT and CNN. They decomposed the original wind power data
empirical mode decomposition (EMD) to decompose wind speed into different frequencies by using WT, and then the CNN was
and wind power, and combined with kernel ridge regression (KRR) utilized to predict in each frequency. Shubhi and Volker [32] used
to realize wind forecasting. Wang et al. [12] used ensemble the structure of dual one-dimensional (1D) convolutional neural
empirical mode decomposition (EEMD) to decompose the original network to achieve prediction of dominant wind speed and direc-
wind speed data, and an optimized BP neural network was tion. In deep learning methods, Recurrent neural network (RNN) is
employed to realize the prediction of each component. Their pro- suitable for dealing with time series problems [33]. However, RNN
posed EEMD-based model presented a significant improvement has a long-term dependency problem. With the further develop-
over the EMD-based model by 25.95% in terms of MAPE. In addition ment of RNN, RNN variants such as LSTM and GRU have been
to the above decomposition methods, other decomposition tech- developed to overcome such issue and further improve the accu-
niques such as wavelet transform (WT) and wavelet packet racy of time series prediction [34]. Hence, Peng et al. [35] presented
decomposition (WPD) are also adopted in [13,14]. a novel wind speed prediction approach by combining GRU and the
In addition to data decomposition technology, the proper data filtering technology called wavelet soft threshold denoising
feature selection is another effective way to enhance the prediction (WSTD). Taking into account the excellent performance of CNN and
ability of machine learning model. Kong et al. [15] presented RNN variants, some scholars combined them to apply to wind
principal component analysis(PCA) approach to select out the most prediction and obtained better prediction performance than single
appropriate input features prior to the use of their proposed model. For example, Chen et al. [36] combined the spatial feature
reduced support vector machine (RSVM) optimized by particle extraction capabilities of CNN with the temporal feature mining
swarm optimization (PSO). In [16], a feature selection method capabilities of long short-term memory neural network (LSTM) to
based on conditional mutual information to achieve threshold obtain the wind speed prediction model that considers spatio-
selecting of input features is proposed. The motive of the above temporal characteristics of data. Yin et al. [37] presented a dual-
feature selection methods aims to improve the prediction accuracy mode decomposition method composed of EMD and VMD to
of machine learning models by reducing the number of features, decompose the original wind power and wind speed time series.
which is regarded as a static feature selection method that is Then, the cascade model by combine CNN and LSTM is utilized to
executed only once before making a prediction. Due to the complex realize the extraction of meteorological and temporal features of
nonlinear mapping relationship between wind power and input the decomposed sub-sequences. In [38], compared with LSTM unit,
variables, the potential importance of each input variable may not GRU is more simplified and efficient as well as its superiority has
be fully evaluated by correlation analysis in multi-step ahead pre- been verified. Therefore, Liu et al. [39] proposed a multi-step wind
diction [17]. As an advanced feature selection technique that is speed prediction model that combined CNN-GRU with support
applied in the field of deep learning recently, the attention mech- vector regression (SVR) and used singular spectrum analysis (SSA)
anism has powerful ability of independently assigning different for data decomposition.
weights to highlight key variables in input variables and weaken Although many CNN-based models have been developed, the
redundant variables [18]. Up to now, the attention mechanism has research on CNN architecture is still in progress. As a structural
been widely used in image classification [19], natural language variant of CNN, the residual network (RN) based on the idea of
processing (NLP) [20] and speech recognition [21]. It has also residual learning is proposed to alleviate the issue of network
attracted the attention in the wind power prediction domain. For degradation and gradient explosion produced by increasing the
example, Peng et al. [22] developed a wind-power prediction depth of the network as well as achieved good results [40]. Up to
model driven by NWP data. In their model, the attention mecha- now, RN has been gradually used in the field of wind-related pre-
nism is used to identify the importance of each element in the high- diction. For example, Andrea et al. [41] used a deep residual
dimensional NWP dataset from encoder feature extraction. Niu network to build a sea surface wind direction estimation model
et al. [23] provided an idea of embedding the attention mechanism based on Synthetic Aperture Radar (SAR) images. Ceyhun et al. [42]
into the end of the input encoder to perform feature selection on proposed an improved wind power prediction model of convolu-
three types of input (i.e., wind, weather, and temporal factors), and tion residual network, which used VMD for sequence feature
compared it with the aforementioned static feature selection extraction and converted the features into image form as the input
methods to verify its superiority. Their results show the superiority of the model.
of the attention mechanism over the PCA in feature selection, with In this study, a novel hybrid model called EEMD-BA-RGRU-CSO
the MAPE value reduced by 3.63% in the 1-h ahead wind power is developed to make multi-step wind power prediction. The main
prediction. novelty and contributions are as follows. 1) A novel bi-attention
In essence, the expression ability and feature extraction ability mechanism (BA) is first applied to dynamically evaluate the
of neural network improve with the increase of network depth [24]. importance of each element in sample input (i.e., 2D tensor matrix)
With the rapid development of deep learning, they have surpassed from both feature and temporal dimensions based on EEMD. Such
many traditional machine learning models in terms of feature attention-based feature selection strategy can greatly improve the
extraction capabilities [25]. At present, deep learning has been sensitivity of the prediction model to sample inputs. 2) A series-
widely applied in many fields [26] and gradually applied to the field connected deep learning wind power prediction model is pro-
of renewable energy forecasting [27]. For instance, Wang et al. [28] posed by using an improved residual network as the static feature
proposed a short-term wind power forecasting approach based on extractor and a GRU for further extracting the temporal features.
deep belief network (DBN) by using NWP data as input. Lin et al. The proposed residual network combined with GRU, namely RGRU,
[29] used the data in the Supervisory Control and Data Acquisition has powerful ability of extracting both the static and dynamical
(SCADA) database with a sampling rate of 1-s as input, and built a feature so as to enhance the prediction accuracy greatly. 3) A high-
five-layer feedforward neural network to realize wind power pre- performance crisscross optimization algorithm is employed to
diction. Furthermore, in view of the excellent feature extraction retrain the output layer of proposed RGRU model so as to further
capabilities of deep convolutional neural networks (CNNs) and its improve the generalization ability of the prediction model. The
successful application in the field of image classification [30], it has proposed EEMD-BA-RGRU-CSO is validated on a Spain wind farm
2
and the obtained results demonstrate its superiority over other following subsections.
state-of-the-art methods involved in this study.
The remaining structure of the paper is as follows. Section 2
1.2. Data processing stage
briefly introduces the overall framework and detailed imple-
mentation of the proposed hybrid model. The parameter selection
To alleviate the pressure of feature extraction of prediction
is presented in Section 3. Three experimental tests and comparative
models, a combination of data decomposition technology and novel
results are presented Section 4. Finally, the conclusion is given in
attention mechanism is utilized in the data processing stage. The
Section 5.
EEMD is chosen as the model decomposition technique for wind
power and wind speed time series. The bi-attention mechanism is
applied to dynamically assign weights to input variables to improve
1.1. Framework and detailed implementation of the proposed
the sensitivity of the model to inputs. The detailed implementation
hybrid prediction model
process is described as follows.
1.1.1. Framework of the proposed prediction model
The framework is illustrated in Fig. 1. It is seen that the proposed 1.2.1. Data decomposition based on EEMD
hybrid prediction approach consists of data processing, prediction, It is known the proper decomposition of original wind power
and retraining. In the data processing stage, the EEMD is utilized to data will improve the prediction accuracy of wind power [43].
decompose the original wind power and wind speed data, and then Compared with WT and VMD, EEMD decomposition method can
the proposed bi-attention feature selection strategy is employed to realize the adaptive decomposition of time series data [44].
dynamically transform the original sample input to the weighted Moreover, EEMD can effectively alleviate the phenomenon of mode
sample input. In the prediction stage, the proposed RGRU is used to mixing that occurs in the signal decomposition process of EMD. In
extract the implicit relationship and temporal features of input to addition, there is an obvious cubic relationship between wind
realize the prediction of wind power decomposition components. speed and wind power. Taking these into consideration and
Finally, the CSO algorithm is used to retrain the fully-connected (FC) ensuring the consistency of the data, the EEMD decomposition
layer of RGRU model. The detailed modeling process can refer to the method is adopted in this study to decompose the original wind
Fig. 1. The overall framework of the proposed hybrid model.
3
power data and wind speed data at the same time. direction, and cosine of wind direction of multiple time steps are
The EEMD decomposition result in this study is shown in Fig. 2, set as the original input of the prediction model. According to [23],
and the detailed operation of decomposition are listed as follows. the influences of different time nodes as well as different features at
the same time node on output are not equal. Considering the above,
1) Add Gaussian white noise w(t) for the wind power/wind speed the importance of each element in the matrix may not be
time series x(t) to generate a noise-added signal X(t): adequately evaluated by using a single-dimensional attention
mechanism which only focuses on specific dimension of the input
XðtÞ ¼ xðtÞ þ wðtÞ (1) matrix. Hence, a bi-attention mechanism composed of two single-
dimensional attention mechanisms (SA) is proposed, aiming to
obtain a new weighted input with inconsistent importance so as to
2) The signal X(t) is decomposed to obtain a series of intrinsic
enhance the sensitivity of the model to the input data. The structure
mode function (imf) components and a residual difference
of BA designed in this study is shown in Fig. 3.
component n(t):
The core idea of BA is to obtain attention matrices in different
dimensions of the input matrix, and perform matrix fusion to
X
l
XðtÞ ¼ imfi ðtÞ þ nðtÞ (2) obtain a new weighted input matrix. In this study, the original input
i¼1 matrix of wind power prediction is composed of two dimensions:
time and feature, which can be expressed as Eq. (5).
where l is the number of imf components.
2 3
3) Repeat steps 1) and 2) N times, and take the arithmetic average 1
as the final IMF. The equation is as follows: 6 xtT x2tT / xFtT 7
6 1 7
6x x2tTþ1 / xFtTþ1 7
X org ¼ 6
6
tTþ1 7
7 (5)
1 XN
6« « 1 « 7
IMFi ðtÞ ¼ imfij ðtÞ (3) 4 1 5
N j¼1 xt1 x2t1 / xFt1
ðTFÞ
where IMFi(t) is the final i-th IMF component obtained by EEMD,

i ¼ 1; 2; /; l. where xw ¼ ½x1w ; x2w ; /; xFw ; ðw ¼ t T; t T þ1; /; t 1Þ is the
feature vector at time w in the matrix,
4) After decomposition, the wind power/wind speed time series xk ¼ ½xktT ; xktTþ1 ; /; xkt1 ; ðk ¼ 1; 2; /; FÞ is the time vector of
x(t) can be expressed as: length T corresponding to the k-th feature in the matrix. In this
study, the number of features is 4 (i.e., F ¼ 4), which are wind power
X
l component, wind speed component, sine of wind direction and
xðtÞ ¼ IMFk ðtÞ þ resðtÞ (4) cosine of wind direction respectively. Meanwhile, T is set to be 6.
k¼1
For the two dimensions of the input matrix Xorg, SA (i.e., feature
attention mechanism (FA) and time attention mechanism (TA)) is
where res(t) is the final residual term.
applied to obtain the feature attention matrix Xfeature_att_matrix and
the time attention matrix Xtime_att_matrix, respectively. Among them,
1.2.2. Bi-attention mechanism FA and TA are carried out at the same time. Combined with Fig. 3,
Wind power component, wind speed component, sine of wind the operation mechanism of FA is introduced with Xorg as input:
Fig. 2. EEMD decomposition results of wind power and wind speed.
4
Fig. 3. The operation process of BA.
X feature att matrix ¼X org 1a

2 3
6 a1 xtT a2 x2tT / aF xFtT
1
7
6 7
6 a x1 a2 x2tTþ1 / aF xFtTþ1 7 (7)
8 6 1 tTþ1 7
¼6 7
> 6« 7
> U ðTFÞ ¼ X org W f þ bf
>
> 6 « 1 « 7
>
> 4 5
>
>
>
> exp U ij a1 x1t1 a2 x2t1 / aF xFt1
>
> ðTFÞ
< AðTFÞ ¼ XF ; i2½t T; t 1
exp U ij (6)
>
>
j¼1 Similarly, combined with Fig. 3, the operation mechanism of TA
>
> is introduced with the transpose of Xorg as input:
>
> XT
>
>
>
> A
q¼1 qr
> 8 T
:a ¼
> ¼ ½a1 ; a2 ; /; aF ; r2½1; F
V ðFTÞ ¼ X org W t þ bt
T >
>
>
>
>
>
>
>
where U is the unnormalized feature probability weight matrix >
> exp Vij
< BðFTÞ ¼ X ; i2½1; F
obtained after the input matrix Xorg is fed to the neural network T
exp V ij
operation, Wf is the weight matrix of the neural network, and bf is >
> j¼tT
>
>
the bias vector. A is the normalized feature probability weight >
> XF
>
>
matrix that uses the softmax activation function to normalize U so >
> B
: m¼1 mn
that the sum of the probability weights in each row is 1. a is the
b¼ ¼ ½btT ; btTþ1 ; /; bt1 ; n2½t T; t 1
F
feature attention vector of length F (i.e., the output of FA). Subse- (8)
quently, each row in matrix Xorg is dot multiplied by a to obtain the
feature attention matrix Xfeature_att_matrix. where V is the unnormalized time probability weight matrix
5
obtained after the transpose of input matrix Xorg is fed to the neural perform deeper feature extraction on the information obtained by
network operation, Wt is the weight matrix of the neural network, BA. The residual network is generally formed by stacking residual
and bt is the bias vector. B is the normalized time probability weight units, and the detailed structure of the original residual unit is
matrix that uses the softmax activation function to normalize V so shown in Fig. 4. Assumed that the input of the residual unit in Fig. 4
that the sum of the probability weights in each row is 1. b is the is ym, the corresponding output ymþ1 of the unit can be expressed
time attention vector of length T (i.e., the output of TA). Subse- as:
quently, each column in matrix Xorg is dot multiplied by the
transpose of b to obtain the time attention matrix Xtime_att_matrix. ymþ1 ¼ f ðym þ Fðym ÞÞ (11)
Then perform the fusion operation on the matrices Xfeatur-
where F (ym) is the residual function represented by the residual
2 3
6 btT xtT btT x2tT / btT xFtT
1
7
6 7
6 btTþ1 x1 7
tTþ1 btTþ1 xtTþ1 / btTþ1 xtTþ1 7
2 F
X time
T
¼ X org 1b ¼ 6 6 (9)
att matrix 7
6« « 1 « 7
4 5
bt1 x1t1 bt1 x2t1 / bt1 xFt1
ðTFÞ
and Xtime_att_matrix to obtain the final attention output

e_att_matrix unit, f (*) is the nonlinear activation function, and the ReLU function
matrix. The fusion process is as follows: is generally used.
The residual network designed in this paper is constructed
X att out ¼ m,X feature att matrix þ l,X time att matrix based on the residual unit in Fig. 4. The unit is designed to contain 3
(10)
mþl¼1 convolutional layers of 1 1 kernel-size, and the activation func-
tion of each convolutional layer is uniform set to ReLU. Compared
where Xatt_out is the attention output matrix with the output shape with the large-size filter such as 3 3, the advantage of the small-
of T F. m and l are matrix fusion factor coefficients, which add up size filter is that the amount of calculation parameters is less, and it
to 1. Then, the shape of Xatt_out is resized to T F C to obtain the can also provide the extraction of advanced abstract features.
input Xin of the RGRU model. C is set to be 1 in this study. It is worth Furthermore, the tan function is used to replace the ReLU function
noting that the proposed BA is essentially a dynamic evaluation as the unit output activation function, which is expressed in Eq.
process by using neural network, so it needs to be trained together (12). The purposes of using this function are: 1) to map the feature
with the prediction model. map extracted by the residual unit to the range [-1, 1]; 2) to over-
come the dying ReLU problems caused by zero-slope. Finally, the
1.3. Prediction stage above residual unit is used to build the residual network in this
paper, and the total number of units is 1. Fig. 5 illustrates the
In this study, a hybrid reduction model RGRU composed of the designed structure of residual network in this study.
residual network and GRU is illustrated in the second part of Fig. 1.
There are two advantages of combining these single prediction expx expx
tanhðxÞ ¼ (12)
models: 1) highlight the internal coupling relationship of input by expx þ expx
merging of shallow and deep features. 2) To further extract the
Combined with Fig. 5, in this study, the Xin obtained before is fed
hidden temporal features. The detailed modeling and imple-
to the designed residual network to get the network output Xres_out
mentation process of each component of RGRU is described as
with the same size. Then, an sequence Y ¼ ðy1 ; y2 ; /; yT Þ of length
follows.
T is obtained by converting the output of the residual network from
3D to 2D, where yt2[-1, 1](1F) is a F-dimension vector. Subse-
1.3.1. The residual network
quently, the obtained Y is used as the input of the cascaded GRU
The cross-layer connection structure of the residual network
network.
increases the information transmission path so that the shallow
features can be fed to the deeper layers of the network, which can
improve the accuracy of the network [45]. Meanwhile, as a member 1.3.2. Gated recurrent unit
of CNN architecture, the residual network also has the excellent In dealing with sequence problems, RNNs have an outstanding
feature representation and extraction ability of CNN. Hence, a re- performance in various machine learning models. However, it is
sidual network structure designed in this study is utilized to easily to get caught in the problems of gradient vanishing and ex-
plosion in the training process of RNN [46]. As one of the variants of
Fig. 4. Structure diagram of the original residual unit. Fig. 5. The designed residual network structure.
6
network.
It is noting that the input in the early stage of the retraining is
consistent with that in the Adam optimizer training. However, the
final input fed to the CSO retraining layer is the high-dimensional
abstract features extracted by the aforementioned parameter
fixed feature extractor.
Here, suppose there is a population Z of size P and dimension D,
which D represents the number of weights and biases of the output
layer to be optimized by CSO and can be expressed as Eq. (14). The
specific implementation process of using CSO to retrain the output
layer in this study is as follows.
D ¼ hfc nout þ nout (14)

Fig. 6. The internal structure of GRU unit.
where hfc and nout are the number of neurons in the first and output
RNN, GRU is simpler in structure than LSTM and can overcome the layers in designed FC layers, respectively.
above-mentioned problems. Moreover, GRU has fewer training Step 1: Parameters initialization.
parameters and higher computational efficiency. Hence, GRU is The weights and biases are randomly generated in the initial
considered in this study to further extract the temporal features. population Z according to the following bound rules Eq. (15). Each
Fig. 6 illustrates the internal structure of the GRU unit. individual in population Z can be represented as Eq. (16).
The reset gate and update gate are the core components of the 8 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi!
GRU network. As shown in Fig. 6, the reset gate rt controls the >
> 6 6
< wrange ¼ ;
fusion between the new input of the network and the previous hfc þ nout hfc þ nout (15)
memory. The update gate zt determines the amount of memory >
>
:
information retained. Here, the detailed updating method of GRU brange ¼ ð jbtrained j Dq; jbtrained j þ DqÞ
unit is introduced by using the F-dimension vector yt at time t in Y
obtained as input. where btrained represents the bias of the output layer of the trained
model. Dq indicates the offset of the bias.
8
>
>
> rt ¼ s yt Wry þ ht1 Wrh þ br
>
< zt ¼ s yt Wzy þ ht1 Wzh þ bz ZðkÞ ¼ ½pk1 ; pk2 ; /; pkD ; k2½1; P (16)

~ t ¼ tanh yt W þ rt 1h W þ b (13)
>
> h hy t1 hh h where pkD indicates the weight or bias in the output layer.
>
>
: ~ Step 2: Fitness calculation.
ht ¼ ð1 zt Þ1ht1 þ zt 1h t
The fitness value of each individual in Z can be calculated by the
where ht is the output of the GRU unit at time t. W represents the formula Eq. (17).
weight matrix, and b is the bias vector. s(,) represents the sigmoid sT 2
function, and 1 is the dot product operation of the corresponding 1 X
fitnessðZðkÞÞ ¼ ytrue ypred ; k2½1; P (17)
position of the element. sT i¼1
The GRU network built in this paper contains 2 layers, the
number of neurons in each layer is set to be 4 and 8, and the where fitness (Z(k)) represents the fitness value corresponding to
activation function uses the ReLU function. Subsequently, two-layer the individual Z(k), and sT is the number of samples in the training
FC layers are used to build a regression prediction, in which number set. ytrue and ypred are the true value and model predicted value,
of neurons in each layer is 32 and prediction steps, respectively. The respectively.
regression prediction takes the output of GRU as input to obtain the Step 3: Vertical crossover.
final prediction value of the RGRU model. The vertical crossover (VC) operator in CSO algorithm is the
So far, the wind power prediction model has been established. In arithmetic crossover operation for different dimensions of the same
this study, Adam optimization algorithm is utilized to train the individual in the population. Perform VC operation on the di-
prediction model [47]. mensions d1 and d2 of the population individual Z(i), the process is
as follows:
1.4. Retraining stage
Svc ði; d1 Þ ¼ r,Zði; d1 Þ þ ð1 rÞ,Zði; d2 Þ (18)
It is proven that the prediction model trained by the popular
where Shc (i,d1) is the result of the cross operation. r is a uniformly
Adam optimizer can achieve rapid convergence in the early stage,
distributed random value between 0 and 1.
but too low learning rate in the later stage of training may affect the
After performing the VC operation on population Z, the popu-
effective convergence of the model and cause generalization
lation Z is updated according to the following rules.
problems [48]. Inspired by the idea of using the population-based
optimization algorithms to optimize the parameters of shallow
Svc ðkÞ; if fitnessðZðkÞÞ > fitnessðSvc ðkÞÞ
layer neural networks in [49], a high-performance CSO [50] is used ZðkÞ ¼ ; k2½1; P (19)
ZðkÞ; if fitnessðZðkÞÞ < fitnessðSvc ðkÞÞ
in this study to retrain the parameters of the output layer based on
the proposed model trained by Adam, aiming to enhance the Step 4: Horizontal crossover.
generalization performance of the proposed model. In fact, the The horizontal crossover (HC) operator in CSO algorithm is an
parameters of the proposed model trained by Adam are fixed in arithmetic crossover operation for the same dimension between
other layers expect the output layer in this stage. Hence, the layer two different individuals in the population. Select two individuals
with fixed parameters is used as the feature extractor so as to Z(i) and Z(j) in the population, and perform a horizontal crossover
transform the deep learning model into a shallow BP neural on the d-th dimension of the selected individuals. The process is as
7
follows: number of convolutional layers, nfc represents the number of

neurons in the FC layers before the output layer, ng1 represents the
Shc ði; dÞ ¼ r1 ,Zði; dÞ þ ð1 r1 Þ,Zðj; dÞ þ c1 ,ðZði; dÞ Zðj; dÞÞ number of GRU units in the first GRU layer, ng2 represents the
Shc ðj; dÞ ¼ r2 ,Zðj; dÞ þ ð1 r2 Þ,Zði; dÞ þ c2 ,ðZðj; dÞ Zði; dÞÞ number of GRU units in the second GRU layer.
(20)
2. Optimal parameters selection
where Z (i,d) represents the d-th dimension of Z(i), Shc (i,d) is the
result of the cross operation. r1 and r2 are probabilities between For EEMD-BA-RGRU, there are two parameters that affect the
0 and 1. c1 and c2 are random numbers between 1 and 1. prediction performance of the model, which are the time step of
After performing the HC operation on population Z, the popu- input and the decomposition number of EEMD. Therefore, in this
lation Z is updated according to the following rules. section, these two parameters will be determined by two groups of
experimental tests.
Shc ðkÞ; if fitnessðZðkÞÞ > fitnessðShc ðkÞÞ
ZðkÞ ¼ ; k2½1; P (21)
ZðkÞ; if fitnessðZðkÞÞ < fitnessðShc ðkÞÞ
2.1. Selection of input time steps
Step 5: Terminal condition.
Determine whether the maximum number of iterations of the To determine the first parameters, BA-RGRU are used to make
CSO algorithm has been reached. Otherwise, repeat steps 3 and 4. multi-step ahead prediction in this experimental test. The MAE and
Finally, the best performing individual in the population serves as RMSE prediction errors of BA-RGRU with various input time steps
the final weights and biases of the output layer. in one-step, two-step and three-step prediction are shown in Fig. 7.
It is worth noting that according to [51], the vertical probability According to the errors cures in Fig. 7, it is obvious that the best
Pv and the horizontal probability Ph in the CSO algorithm are set to forecasting performance can be obtained when the input time steps
0.8 and 1, respectively, in this study. is set to be 6 in any-step prediction. Therefore, in all the following
experiments, the input time step is uniformly set to be 6 for one-
1.5. Evaluation and comparison step, two-step and three-step wind power prediction.
1.5.1. Evaluation metrics 2.2. Selection of decomposition number

The mean absolute error (MAE), root mean square error (RMSE)
and R-squared (R2) are used to evaluate the performance between To investigate the influence of different number of EEMD
the proposed model and other models in this paper. They are decomposition layer on prediction result, EEMD-BA-RGRU is uti-
computed by: lized as the prediction model. The number of decomposition layers
is set from 2 to 9 in this experimental test. Fig. 8 shows the MAE and
8 N
>
> 1 X RMSE errors of EEMD-BA-RGRU with various decomposition layers
>
> MAE ¼ q ðtÞ qpred ðtÞ in multi-step prediction. It is clear that the minimum values can be
>
> N t¼1
true
< obtained when the number of decomposition layer is set to be 3 in
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (22)
>
> u N 2
any-step prediction. Hence, the number of decomposition layer in
>
> u1 X
>
> RMSE ¼ t qtrue ðtÞ qpred ðtÞ EEMD is set to be 3 for all the following experiments.
: N t¼1
3. Case study
N
P 2
qtrue ðtÞ qpred ðtÞ To verify the effectiveness of the proposed EEMD-BA-RGRU-CSO
R2 ¼ 1 t¼1 (23) prediction model, the following groups of comparative experi-
N
P 2 ments are carried out with one-step, two-step and three-step
qtrue ðtÞ q
t¼1 predictions in this section. All simulation experiments are imple-
mented with Keras deep learning framework under Python 3.7.3.
where qture(t) and qpred(t) respectively represent the true and esti- The configuration of the simulation platform is Intel core i5-10210U
mated values of wind power at time t, q is the mean value of qture, N processors running at 2.11 GHz with memory capacity of 8 GB
is the number of samples in the test set. under Windows 10 Operating System.
1.5.2. Compared models 3.1. Data description

To validate the effectiveness of the proposed model, CNN, GRU,
CNN-GRU and the residual network are selected as comparison The experimental data from a wind farm located in Spain is used
models. The detailed structure information of these prediction to validate the effectiveness of the proposed forecasting approach
models are listed in Table 1. In particular, nconv represents the EEMD-BA-RGRU-CSO in this paper. In this wind farm, the measured
Table 1
The structure information of the Compared models.
Model Input dimension Configuration Output dimension
CNN 6 4 1 nconv: 3, kernel size: 2 2, nfc: 32 Prediction step

GRU 6 4 ng1: 4, ng2: 8, nfc: 32
Residual network 6 4 1 nconv: 3, kernel size: 1 1, nfc: 32
CNN-GRU 6 4 1 nconv: 3, kernel size: 2 2, ng1: 4, ng2: 8, nfc: 32
RGRU 6 4 1 nconv: 3, kernel size: 1 1, ng1: 4, ng2: 8, nfc: 32
SA 6 4 Dual-layer neural network 64
BA 6 4 Two parallel dual-layer neural networks 64
8
Fig. 7. Errors curves of BA-RGRU with various input time steps in multi-step ahead prediction.
Fig. 8. Errors cures of EEMD-BA-RGRU with various number of decomposition layers for EEMD in multi-step prediction.
wind power, wind speed and wind direction are sampled at 1-h 3.2. Case 1: the effectiveness of the proposed EEMD-BA data
intervals for each day. Therefore, the original dataset is composed processing technique
of wind power, wind speed and wind direction sequences. The
original historical data of this wind farm in 2016 are shown in 3.2.1. Validity of EEMD decomposition method
Fig. 9(a). The dataset of April shown in Fig. 9(b) is used in cases 1e3, The purpose of this experiment is to verify that EEMD can
aiming at verifying the feasibility of the proposed model. To effectively improve the prediction accuracy of the model. The
demonstrate the robustness and advantage of the proposed model verification experiments are carried out based on original data and
over other state-the-of-art methods, the compared models are decomposition data based on EEMD, CNN, GRU as well as CNN-GRU
validated on four datasets of different months including January, prediction models are used for comparison. The multi-step pre-
April, July, and October of 2016 in case 4. diction results of different models with and without EEMD are
According to [55], we use the first 600 sets of sampled data in shown in Table 2 and Fig. 10.
the dataset to train the proposed prediction model, and the rest- From Table 2, some conclusions can be drawn. (a) The EEMD-
120 sets of sampled data to test the performance of the proposed based models have better prediction accuracy. For example,
model. It is worth noting that the data needs to be normalized compared with GRU, the MAE errors of EEMD-GRU in 1-step, 2-step
before feeding into the model, which is expressed in Eq. (24). and 3-step predictions are reduced by 59.69%, 58.03% and 57.25%,
respectively. Compared with CNN, the MAE errors of EEMD-CNN in
1-step, 2-step and 3-step predictions are cut by 48.11%, 47.79% and
47.36%, respectively. (b) It can be observed that the
EEMDeCNNeGRU model performs best, with the minimum MAE,
x xmin
x* ¼ ða bÞ þ b (24) RMSE and the top R2, which shows that the cascaded CNN-GRU has
xmax xmin outstanding ability in extracting coupled relationship and temporal
information.
where x and x* respectively represent the values before and after
normalization. The normalized range is [a, b]. In this study, the
input data is normalized to a value in the range [-1, 1].
9
Fig. 9. Original time series of Dataset.
Table 2
Results of multi-step prediction with and without EEMD.
Indexes CNN GRU CNN-GRU

1-step 2-step 3-step 1-step 2-step 3-step 1-step 2-step 3-step
MAE (MW) 0.6849 0.7864 0.9463 0.6606 0.7594 0.8619 0.6486 0.7506 0.8459
RMSE (MW) 0.9133 1.0553 1.3107 0.8856 0.9776 1.1542 0.8492 0.9189 1.0898
R2 0.8037 0.7383 0.5968 0.8154 0.7754 0.6874 0.8303 0.8016 0.7214
EEMD-CNN EEMD-GRU EEMDeCNNeGRU
MAE (MW) 0.3554 0.4106 0.4981 0.2663 0.3187 0.3685 0.2211 0.3011 0.3536
RMSE (MW) 0.4356 0.5471 0.7005 0.3266 0.4201 0.4941 0.2949 0.3884 0.4605
R2 0.9550 0.9291 0.884 0.9747 0.9582 0.9423 0.9794 0.9643 0.9499
10
Fig. 10. Bar chart of multi-step prediction with and without EEMD.
Table 3
Comparison between the BA-based model and the SA-based models.
Indexes EEMDeCNNeGRU EEMD-FAeCNNeGRU EEMD-TAeCNNeGRU EEMD-BAeCNNeGRU
1-step 2-step 3-step 1-step 2-step 3-step 1-step 2-step 3-step 1-step 2-step 3-step
MAE (MW) 0.2211 0.3011 0.3536 0.2149 0.2893 0.3449 0.2173 0.2943 0.3480 0.2114 0.2843 0.3286
RMSE (MW) 0.2949 0.3884 0.4605 0.2697 0.3803 0.4579 0.2723 0.3813 0.4601 0.2599 0.3754 0.4429
R2 0.9794 0.9643 0.9499 0.9827 0.9663 0.9504 0.9824 0.9655 0.9500 0.9840 0.9666 0.9536
Table 4
Comparison between the EEMD-BA-RGRU and other EEMD-BA based models.
Indexes EEMD-BA-CNN EEMD-BA-GRU EEMD-BAeCNNeGRU

MAE (MW) 0.3012 0.3746 0.4774 0.2452 0.3011 0.3483 0.2114 0.2843 0.3286
RMSE (MW) 0.3942 0.5021 0.6513 0.3043 0.4070 0.4554 0.2599 0.3754 0.4429
R2 0.9631 0.9403 0.8997 0.9780 0.9607 0.9510 0.9840 0.9666 0.9536
EEMD-BA-RN-ECL EEMD-BA-RN EEMD-BA-RGRU
MAE (MW) 0.3900 0.4144 0.4662 0.2623 0.3531 0.3978 0.1906 0.2575 0.2980
RMSE (MW) 0.5094 0.5488 0.6317 0.3374 0.4946 0.5460 0.2400 0.3416 0.3933
R2 0.9384 0.9286 0.9057 0.9730 0.9420 0.9295 0.9863 0.9723 0.9634
Fig. 11. Multi-step prediction results of different BA-based models.
3.2.2. Superiority of proposed bi-attention mechanism based prediction models including EEMD-FA-CNN-GRU and EEMD-
To verify the effectiveness of bi-attention mechanism (BA), SA TAeCNNeGRU, which are compared with EEMDeCNNeGRU and
(i.e., feature attention mechanism (FA) or time attention mecha- EEMD-BAeCNNeGRU. Their multi-step prediction results are
nism (TA)) is applied to the CNN-GRU model, thus forming two SA- shown in Table 3.
11
As shown in Table 3, it can be found that: (a) by utilizing SA, the method, several other popular swarm intelligence algorithms
prediction accuracy of the prediction model can be improved. For including differential evolution (DE), flower pollination algorithm
example, compared with EEMDeCNNeGRU, the MAEs of EEMD- (FPA), particle swarm optimization (PSO) are used as the bench-
FAeCNNeGRU are cut by 2.80%, 3.92% and 2.46% and the MAEs of mark algorithms. The results obtained by employing different
EEMD-TAeCNNeGRU are cut by 1.72%, 2.26% and 1.58% in 1-step, 2- swarm intelligence algorithms are listed in Table 5. Furthermore, in
step and 3-step predictions. (b) The FA-based model outperforms order to intuitively observe the prediction effect of the proposed
the TA-based model in terms of prediction accuracy. It is found that model, the multi-step prediction curves of different prediction
the MAEs of EEMD-FAeCNNeGRU in 1-step, 2-step and 3-step models in all the above cases are depicted in Fig. 12.
predictions are less than those of EEMD-TAeCNNeGRU by 1.10%, As shown in Table 5, it can be observed that: (a) the performance
1.70% and 0.89%, respectively. (c) The BA-based model is superior to of EEMD-BA-RGRU can be further improved to different degrees
the SA-based models. For example, the RMSEs of 3-step prediction when the swarm intelligence algorithms are utilized to retrain its
using EEMD-BAeCNNeGRU are down by 3.28% and 3.74%, fully-connected layer. For example, compared with EEMD-BA-
compared with EEMD-FAeCNNeGRU and EEMD-TAeCNNeGRU. RGRU in 1-step, 2-step and 3-step prediction, the MAEs of the
Moreover, the maximum values of R2 values confirm the effec- proposed EEMD-BA-RGRU-CSO are down by 5.04%, 3.50% and
tiveness of the BA-based prediction model. 4.90%, respectively. However, it is seen that the EEMD-BA-RGRU
model optimized by PSO or FPA has no obvious improvement,
3.3. Case 2: outstanding prediction performance of RGRU especially in multi-step predictions. The possible reason is that they
may fall into local optimum when applied to optimize the FC layer
To investigate the ability of RGRU in mining data in depth, the with dozens of parameters. (b) It is observed that the DE and CSO
prediction models such as EEMD-BA-CNN, EEMD-BA-GRU, EEMD- have improvement on EEMD-BA-RGRU, but CSO has more obvious
BAeCNNeGRU, EEMD-BA-RN-ECL (i.e., the cross-layer connection advantage. In particular, the MAEs of EEMD-BA-RGRU-CSO in 1-
of RN is eliminated), EEMD-BA-RN and EEMD-BA-RGRU are utilized step, 2-step and 3-step predictions are cut by 1.36%, 1.74% and
for comparison. The comparison results are shown in Table 4 and 2.11%, and the RMSEs are down by 2.05%, 2.59% and 1.14%,
Fig. 11. compared with EEMD-BA-RGRU-DE. The above experimental re-
According to the prediction results, it is clear that: (a) by sults show the effectiveness of EEMD-BA-RGRU optimized by CSO.
applying BA for data fusion, the cascaded CNN-GRU still performs
well compared with the single model. For example, compared with 3.5. Case 4: EEMD-BA-RGRU-CSO vs other state-of-the-art models
EEMD-BA-CNN and EEMD-BA-GRU, the MAE errors of EEMD-
BAeCNNeGRU in 1-step prediction are reduced by 29.81% and This case aims to comprehensively verify the effectiveness of the
13.78%, respectively. (b) On the premise of ensuring the ability of proposed EEMD-BA-RGRU-CSO by comparison with the persistence
network feature extraction, the cross-layer connection of residual model and other state-of-the-art decomposition-based hybrid
network can make the shallow information of network be utilized forecasting models including EMDeCNNeLSTM [37], CEEMDAN-
effectively. Compared with EEMD-BA-RN-ECL, the MAE errors of FPA-BP [52], VMD-LSTM-ELM [53] and EWT-BiDLSTM [54]. The
EEMD-BA-RN in 1-step, 2-step, and 3-step predictions are down by parameters of the compared models are all set according to the
32.74%, 14.79% and 14.67%. The results show that fusing the suggestions reported in the references. To investigate the perfor-
extracted deep and shallow features can improve the prediction mance of the proposed method in various epoch of year, all the
accuracy. (c) It can be observed that the EEMD-BA-RGRU out- prediction models involved in this case are validated on four
performs all other EEMD-BA based models, with the minimum datasets over different quarters, corresponding to January, April,
MAE, RMSE and the top R2 regardless of 1-step, 2-step and 3-step July, and October of 2016. The prediction results of different
prediction, which shows that the proposed RGRU has a powerful methods are shown in Table 6.
ability to extract the implicit relationship of data. In particular, the As shown in Table 6, some conclusions can be made as follows:
MAE and RMSE of 3-step prediction using the proposed EEMD-BA-
RGRU are down by 9.31% and 11.20%, as well as the R2 is up by 1.02%, (1) The obtained values of MAE, RMSE and R2 fluctuate over the
compared with the second best performing EEMD-BAeCNNeGRU. different quarters. It is observed that the prediction in
January is the most difficult for all the prediction models. It is
3.4. Case 3: the effectiveness of EEMD-BA-RGRU optimized by CSO not surprising because the wind power series in January
fluctuate strongest in four quarters, as seen from Fig. 9(a).
This experimental test is used to investigate the effect of CSO on Although the proposed model performs worst in January, it
improving the generalization ability of the proposed EEMD-BA- still outperforms its competitors greatly. For example, the
RGRU in this case. To verify the superiority of the proposed CSO proposed EEMD-BA-RGRU-CSO achieved a minimum RMSE
Table 5
Results of EEMD-BA-RGRU models optimized by different swarm intelligence algorithms.
Indexes EEMD-BA-RGRU EEMD-BA-RGRU-PSO EEMD-BA-RGRU-FPA
MAE (MW) 0.1906 0.2575 0.2980 0.1906 0.2575 0.2980 0.1840 0.2575 0.2980
RMSE (MW) 0.2400 0.3416 0.3933 0.2400 0.3416 0.3933 0.2304 0.3416 0.3933
R2 0.9863 0.9723 0.9634 0.9863 0.9723 0.9634 0.9874 0.9723 0.9634
EEMD-BA-RGRU-DE EEMD-BA-RGRU-CSO
1-step 2-step 3-step 1-step 2-step 3-step
MAE (MW) 0.1835 0.2529 0.2895 0.1810 0.2485 0.2834

RMSE (MW) 0.2292 0.3404 0.3872 0.2245 0.3316 0.3828
R2 0.9875 0.9725 0.9646 0.9880 0.9739 0.9654
12
Fig. 12. Multi-step prediction curves of different models.
of 0.3530 (MW), MAE of 0.4992 (MW) as well as a top R2 of (3) The proposed EEMD-BA-RGRU-CSO has obvious advantage
0.9867 in 1-step prediction. Correspondingly, the improving over all its competitors in four different quarters. It is seen
percentages of them are 58.36%, 62.19%, and 8.10%, compared from Table 6 that the MAE of proposed model is reduced by
with the baseline persistence model. 33.35%, and 30.39% in 3-step prediction, by comparing the
(2) It is observed that the decomposition-based hybrid fore- results of the second best performing EWT-BiDLSTM pre-
casting models do not always outperform the baseline diction model in January and the second best performing
persistence model in all quarterly dataset. For instance, the VMD-LSTM-ELM in April. In July and October, the EEMD-BA-
persistence model outperforms CEEMDAN-FPA-BP in spring RGRU-CSO outperforms the second best performing
and has some advantage over VMD-LSTM-ELM in 1-step and EMDeCNNeLSTM in 2-step and 3-step predictions. Take 2-
2-step prediction in winter. In spite of this, the proposed step prediction as example, the MAE values of July and
model has overwhelming superiority over the persistence October are cut by 23.43% and 40.28%, respectively. Mean-
model. Take 3-step as an example, the RMSEs of the pro- while, the EEMD-BA-RGRU-CSO outperforms the second best
posed EEMD-BA-RGRU-CSO are down by 67.51%, 67.39%, performing EWT-BiDLSTM in 1-step prediction in July and
68.09% and 60.11% in January, April, July and October, October, with the MAE values cut by 15.17% and 38.26%.
compared with the baseline persistence model.
13
Table 6
Results of multi-step prediction based on different quarterly datasets.
Months Indexes Persistence model EMDeCNNeLSTM CEEMDAN-FPA-BP
January (Winter) 1-step 2-step 3-step 1-step 2-step 3-step 1-step 2-step 3-step
MAE (MW) 0.8478 1.3072 1.6432 0.8731 1.1151 1.2417 0.7845 1.1665 1.4390
RMSE (MW) 1.3203 1.9704 2.3054 1.5336 1.7152 1.7348 1.0854 1.5935 1.8597
R2 0.9068 0.7916 0.7140 0.8732 0.8409 0.8368 0.9365 0.8627 0.8125
VMD-LSTM-ELM EWT-BiDLSTM The proposed method
MAE (MW) 0.9352 1.0451 1.1161 0.4218 0.6170 0.7810 0.3530 0.4551 0.5205
RMSE (MW) 1.3692 1.4635 1.5840 0.5848 0.8300 1.0834 0.4992 0.6337 0.7491
R2 0.8961 0.8810 0.8602 0.9817 0.9631 0.9369 0.9867 0.9785 0.9698
April (Spring) Persistence model EMDeCNNeLSTM CEEMDAN-FPA-BP

MAE (MW) 0.5309 0.7547 0.8993 0.3776 0.4089 0.4189 0.6534 0.7589 0.9421
RMSE (MW) 0.7155 0.9664 1.1737 0.4855 0.5622 0.5706 0.8430 1.0194 1.1869
R2 0.8795 0.7806 0.6772 0.9445 0.9257 0.9237 0.8327 0.7559 0.6699
MAE (MW) 0.2563 0.3367 0.4071 0.3306 0.4122 0.5374 0.1810 0.2485 0.2834
RMSE (MW) 0.3460 0.4391 0.5053 0.4146 0.5797 0.7647 0.2245 0.3316 0.3828
R2 0.9708 0.9531 0.9380 0.9598 0.9216 0.8639 0.9880 0.9739 0.9654
July (Summer) Persistence model EMDeCNNeLSTM CEEMDAN-FPA-BP

MAE (MW) 0.4526 0.6371 0.8236 0.2188 0.2783 0.3774 0.4467 0.6555 0.8231
RMSE (MW) 0.6798 0.9812 1.1977 0.2980 0.3801 0.5093 0.6050 0.8414 1.0236
R2 0.8948 0.7802 0.6717 0.9796 0.9668 0.9402 0.9160 0.8371 0.7585
MAE (MW) 0.5633 0.6235 0.6443 0.1918 0.3438 0.4183 0.1627 0.2131 0.2651
RMSE (MW) 0.7783 0.8576 0.8912 0.2672 0.4871 0.6324 0.2363 0.3085 0.3822
R2 0.8582 0.8273 0.8133 0.9837 0.9455 0.9080 0.9873 0.9783 0.9667
October (Autumn) Persistence model EMDeCNNeLSTM CEEMDAN-FPA-BP

MAE (MW) 0.2456 0.3909 0.5348 0.2293 0.2545 0.2931 0.3555 0.4381 0.6078
RMSE (MW) 0.4122 0.6458 0.8411 0.3716 0.3902 0.4800 0.5071 0.5786 0.7991
R2 0.8351 0.5961 0.3162 0.8658 0.8543 0.7789 0.7501 0.6795 0.3872
MAE (MW) 0.4614 0.4982 0.5138 0.2086 0.3679 0.4180 0.1288 0.1520 0.1778
RMSE (MW) 0.6209 0.6572 0.6667 0.2772 0.5368 0.6801 0.2186 0.2839 0.3355
R2 0.6172 0.5774 0.5641 0.9256 0.7252 0.5575 0.9538 0.9232 0.8923
The above comparisons further confirm the advantage of the relationship of data. Thirdly, the CSO algorithm is proposed to
proposed model over the persistence model and other retrain are utilized to retrain the fully-connected layer of EEMD-BA-
decomposition-based hybrid forecasting models in addressing the RGRU. The results of case 3 show that the prediction model opti-
multi-step wind power forecasting problem. mized by CSO has better prediction performance. Eventually, the
proposed EEMD-BA-RGRU-CSO is validated on four datasets of
4. Conclusion different quarters in a Spain wind farm. The results of case 4 further
confirm its superiority over other state-of-the-art decomposition-
In order to achieve highly accurate prediction of wind power, a based hybrid forecasting models.
novel EEMD-BA-RGRU-CSO approach is proposed for multi-step The above experimental results demonstrate that the proposed
ahead prediction in this paper. Firstly, from the perspective of EEMD-BA-RGRU-CSO is promising alternative in the multi-step
input, a novel EEMD-BA data processing method is proposed to wind power prediction. In view of its excellent performance, we
lessen the prediction difficulty as well as enhance the sensitivity of are planning to apply it to the wind power prediction of offshore
the prediction model to sample inputs. The input features are ob- wind farms in the near future.
tained by splicing the wind power and wind speed sub-sequences
generated by EEMD with the converted wind direction data (i.e., Author statement
the sine and cosine of wind direction). Subsequently, BA is utilized
to extract the acquired features for the first time to obtain new Anbo Meng: Conceptualization, Supervision, Funding acquisi-
weighted inputs of the model with different importance. The tion. Shun Chen: Software Data curation, Writing e original draft.
experimental results of case 1 show that EEMD-based prediction Zuhong Ou: Investigation. Weifeng Ding: Visualization. Huaming
models have obvious improvement in prediction accuracy. Mean- Zhou: Resources. Jingmin Fan: Formal analysis. Hao Yin: Project
while, according to the prediction results in Tables 2 and 4, the administration, Methodology, Writing e review & editing.
application of BA can further improve the prediction accuracy of
EEMD-based models. Secondly, a RGRU model is proposed to make Declaration of competing interest
the prediction of each wind power components decomposed by
EEMD. The experimental results of case 2 confirm the advantage of The authors declare that they have no known competing
combining the residual network and GRU in extracting the implicit financial interests or personal relationships that could have
14
appeared to influence the work reported in this paper. large vocabulary conversational speech recognition. ICASSP, IEEE Int Conf
Acoust Speech Signal Process - Proc 2016;2016. https://doi.org/10.1109/
ICASSP.2016.7472621. May:4960e4.
Acknowledgements [22] Peng X, Wang H, Lang J, Li W, Xu Q, Zhang Z, et al. EALSTM-QR: interval wind-
power prediction model based on numerical weather prediction and deep
learning. Energy 2021;220:119692. https://doi.org/10.1016/
This research work was supported by the National Natural Sci- j.energy.2020.119692.
ence Foundation of China under Grant 61876040. [23] Niu Z, Yu Z, Tang W, Wu Q, Reformat M. Wind power forecasting using
attention-based gated recurrent unit network. Energy 2020;196:117081.
https://doi.org/10.1016/j.energy.2020.117081.
References [24] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper
with convolutions. IEEE Comput Soc Conf Comput Vis Pattern Recogn 2015.
[1] Wang JZ, Wang Y, Jiang P. The study and application of a novel hybrid fore- https://doi.org/10.1109/CVPR.2015.7298594. 07-12-June:1e9.
casting model - a case study of wind speed forecasting in China. Appl Energy [25] Lecun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436e44. https://
2015;143:472e88. https://doi.org/10.1016/j.apenergy.2015.01.038. doi.org/10.1038/nature14539.
[2] Soman SS, Zareipour H, Malik O, Mandal P. A review of wind power and wind [26] Zhang X, Wang L, Su Y. Visual place recognition: a survey from deep learning
speed forecasting methods with different time horizons. North Am Power perspective. Pattern Recogn 2020:107760. https://doi.org/10.1016/
Symp 2010. https://doi.org/10.1109/NAPS.2010.5619586. NAPS 2010 2010. j.patcog.2020.107760.
[3] Landberg L. A mathematical look at a physical power prediction model. Wind [27] Wang H, Lei Z, Zhang X, Zhou B, Peng J. A review of deep learning for
Energy 1998;1:23e8. https://doi.org/10.1002/(sici)1099-1824(199809)1: renewable energy forecasting. Energy Convers Manag 2019;198:111799.
1<23:aid-we9>3.3.co;2-0. https://doi.org/10.1016/j.enconman.2019.111799.
[4] Lydia M, Suresh Kumar S, Immanuel Selvakumar A, Edwin Prem Kumar G. [28] Wang K, Qi X, Liu H, Song J. Deep belief network based k-means cluster
Linear and non-linear autoregressive models for short-term wind speed approach for short-term wind power forecasting. Energy 2018;165:840e52.
forecasting. Energy Convers Manag 2016;112:115e24. https://doi.org/ https://doi.org/10.1016/j.energy.2018.09.118.
10.1016/j.enconman.2016.01.007. [29] Lin Z, Liu X. Wind power forecasting of an offshore wind turbine based on
[5] Han Q, Meng F, Hu T, Chu F. Non-parametric hybrid models for wind speed high-frequency SCADA data and deep learning neural network. Energy
forecasting. Energy Convers Manag 2017;148:554e68. https://doi.org/ 2020;201:117693. https://doi.org/10.1016/j.energy.2020.117693.
10.1016/j.enconman.2017.06.021. [30] Huang X, Lei Q, Xie T, Zhang Y, Hu Z, Zhou Q. Deep transfer convolutional
[6] Wang L, Li X, Bai Y. Short-term wind speed prediction using an extreme neural network and extreme learning machine for lung nodule diagnosis on
learning machine model with error correction. Energy Convers Manag CT images. Knowl Base Syst 2020;204:106230. https://doi.org/10.1016/
2018;162:239e50. https://doi.org/10.1016/j.enconman.2018.02.015. j.knosys.2020.106230.
[7] Sun W, Wang Y. Short-term wind speed forecasting based on fast ensemble [31] Wang H zhi, qiang Li G, Wang G bing, Peng J chun, Jiang H, Liu Y tao. Deep
empirical mode decomposition, phase space reconstruction, sample entropy learning based ensemble approach for probabilistic wind power forecasting.
and improved back-propagation neural network. Energy Convers Manag Appl Energy 2017;188:56e70. https://doi.org/10.1016/
2018;157:1e12. https://doi.org/10.1016/j.enconman.2017.11.067. j.apenergy.2016.11.111.
[8] Yu C, Li Y, Bao Y, Tang H, Zhai G. A novel framework for wind speed prediction [32] Harbola S, Coors V. One dimensional convolutional neural network architec-
based on recurrent neural networks and support vector machine. Energy tures for wind prediction. Energy Convers Manag 2019;195:70e5. https://
Convers Manag 2018;178:137e45. https://doi.org/10.1016/ doi.org/10.1016/j.enconman.2019.05.007.
j.enconman.2018.10.008. [33] Zhang Z, Qin H, Liu Y, Wang Y, Yao L, Li Q, et al. Long Short-Term Memory
[9] Liu H, Tian HQ, Li YF. Four wind speed multi-step forecasting models using Network based on Neighborhood Gates for processing complex causality in
extreme learning machines and signal decomposing algorithms. Energy wind speed prediction. Energy Convers Manag 2019;192:37e51. https://
Convers Manag 2015;100:16e22. https://doi.org/10.1016/ doi.org/10.1016/j.enconman.2019.04.006.
j.enconman.2015.04.057. [34] Chen J, Zeng GQ, Zhou W, Du W, Lu K Di. Wind speed forecasting using
[10] Wu Q, Lin H. Short-term wind speed forecasting based on hybrid variational nonlinear-learning ensemble of deep learning time series prediction and
mode decomposition and least squares support vector machine optimized by extremal optimization. Energy Convers Manag 2018;165:681e95. https://
bat algorithm model. Sustain Times 2019;11. https://doi.org/10.3390/ doi.org/10.1016/j.enconman.2018.03.098.
su11030652. [35] Peng Z, Peng S, Fu L, Lu B, Tang J, Wang K, et al. A novel deep learning
[11] Naik J, Satapathy P, Dash PK. Short-term wind speed and wind power pre- ensemble model with data denoising for short-term wind speed forecasting.
diction using hybrid empirical mode decomposition and kernel ridge Energy Convers Manag 2020;207:112524. https://doi.org/10.1016/
regression. Appl Soft Comput J 2018;70:1167e88. https://doi.org/10.1016/ j.enconman.2020.112524.
j.asoc.2017.12.010. [36] Chen Y, Zhang S, Zhang W, Peng J, Cai Y. Multifactor spatio-temporal corre-
[12] Wang S, Zhang N, Wu L, Wang Y. Wind speed forecasting based on the hybrid lation model based on a combination of convolutional neural network and
ensemble empirical mode decomposition and GA-BP neural network method. long short-term memory neural network for wind speed forecasting. Energy
Renew Energy 2016;94:629e36. https://doi.org/10.1016/ Convers Manag 2019;185:783e99. https://doi.org/10.1016/
j.renene.2016.03.103. j.enconman.2019.02.018.
[13] Jiajun H, Chuanjin Y, Yongle L, Huoyue X. Ultra-short term wind prediction [37] Yin H, Ou Z, Huang S, Meng A. A cascaded deep learning wind power pre-
with wavelet transform, deep belief network and ensemble learning. Energy diction approach based on a two-layer of mode decomposition. Energy
Convers Manag 2020;205:112418. https://doi.org/10.1016/ 2019;189:116316. https://doi.org/10.1016/j.energy.2019.116316.
j.enconman.2019.112418. [38] Kisvari A, Lin Z, Liu X. Wind power forecasting e a data-driven method along
[14] Liu H, Mi X, Li Y. An experimental investigation of three new hybrid wind with gated recurrent neural network. Renew Energy 2021;163:1895e909.
speed forecasting models using multi-decomposing strategy and ELM algo- https://doi.org/10.1016/j.renene.2020.10.119.
rithm. Renew Energy 2018;123:694e705. https://doi.org/10.1016/ [39] Liu H, Mi X, Li Y, Duan Z, Xu Y. Smart wind speed deep learning based multi-
j.renene.2018.02.092. step forecasting model using singular spectrum analysis, convolutional Gated
[15] Kong X, Liu X, Shi R, Lee KY. Wind speed prediction using reduced support Recurrent Unit network and Support Vector Regression. Renew Energy
vector machines with feature selection. Neurocomputing 2015;169:449e56. 2019;143:842e54. https://doi.org/10.1016/j.renene.2019.05.039.
https://doi.org/10.1016/j.neucom.2014.09.090. [40] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. IEEE
[16] Li S, Wang P, Goel L. Wind power forecasting using neural network ensembles Comput Soc Conf Comput Vis Pattern Recogn 2016;2016. https://doi.org/
with feature selection. IEEE Trans Sustain Energy 2015;6:1447e56. https:// 10.1109/CVPR.2016.90. Decem:770e8.
doi.org/10.1109/TSTE.2015.2441747. [41] Zanchetta A, Zecchetto S. Wind direction retrieval from Sentinel-1 SAR images
[17] Amjady N, Keynia F, Zareipour H. Wind power prediction by a new forecast using ResNet. Remote Sens Environ 2021;253:112178. https://doi.org/
engine composed of modified hybrid neural network and enhanced particle 10.1016/j.rse.2020.112178.
swarm optimization. IEEE Trans Sustain Energy 2011;2:265e76. https:// [42] Yildiz C, Acikgoz H, Korkmaz D, Budak U. An improved residual-based con-
doi.org/10.1109/TSTE.2011.2114680. volutional neural network for very short-term wind power forecasting. En-
[18] Xiang L, Wang P, Yang X, Hu A, Su H. Fault detection of wind turbine based on ergy Convers Manag 2021;228:113731. https://doi.org/10.1016/
SCADA data analysis using CNN and LSTM with attention mechanism. Mea- j.enconman.2020.113731.
surement 2021;175:109094. https://doi.org/10.1016/ [43] Wang R, Li C, Fu W, Tang G. Deep learning method based on gated recurrent
j.measurement.2021.109094. unit and variational mode decomposition for short-term wind power interval
[19] Guo X, Yuan Y. Semi-supervised WCE image classification with adaptive prediction. IEEE Trans Neural Networks Learn Syst 2020;31:3814e27. https://
aggregated attention. Med Image Anal 2020;64:101733. https://doi.org/ doi.org/10.1109/TNNLS.2019.2946414.
10.1016/j.media.2020.101733. [44] Chen Y, Dong Z, Wang Y, Su J, Han Z, Zhou D, et al. Short-term wind speed
[20] Usama M, Ahmad B, Song E, Hossain MS, Alrashoud M, Muhammad G. predicting framework based on EEMD-GA-LSTM method under large scaled
Attention-based sentiment analysis using convolutional and recurrent neural wind history. Energy Convers Manag 2021;227:113559. https://doi.org/
network. Future Generat Comput Syst 2020;113:571e8. https://doi.org/ 10.1016/j.enconman.2020.113559.
10.1016/j.future.2020.07.022. [45] Zhang K, Zuo W, Chen Y, Meng D, Zhang L. Beyond a Gaussian denoiser: re-
[21] Chan W, Jaitly N, Le Q, Vinyals O. Listen, attend and spell: a neural network for sidual learning of deep CNN for image denoising. IEEE Trans Image Process
15
2017;26:3142e55. https://doi.org/10.1109/TIP.2017.2662206. Energy 2015;93:2175e90. https://doi.org/10.1016/j.energy.2015.10.112.

[46] Li S, Li W, Cook C, Zhu C, Gao Y. Independently recurrent neural network [52] Qu Z, Mao W, Zhang K, Zhang W, Li Z. Multi-step wind speed forecasting
(IndRNN): building A longer and deeper RNN. IEEE Comput Soc Conf Comput based on a hybrid decomposition technique and an improved back-
Vis Pattern Recogn 2018:5457e66. https://doi.org/10.1109/CVPR.2018.00572. propagation neural network. Renew Energy 2019;133:919e29. https://
[47] Kingma DP, Ba JL. Adam: a method for stochastic optimization. 3rd Int Conf doi.org/10.1016/j.renene.2018.10.043.
Learn Represent ICLR 2015 - Conf Track Proc 2015;1e15. [53] Liu H, Mi X, Li Y. Smart multi-step deep learning model for wind speed
[48] Keskar NS, Socher R. Improving generalization performance by switching forecasting based on variational mode decomposition, singular spectrum
from ADAM to SGD. ArXiv 2017. analysis, LSTM network and ELM. Energy Convers Manag 2018;159:54e64.
[49] Yin H, Dong Z, Chen Y, Ge J, Lai LL, Vaccaro A, et al. An effective secondary https://doi.org/10.1016/j.enconman.2018.01.010.
decomposition approach for wind power forecasting using extreme learning [54] Jaseena KU, Kovoor BC. Decomposition-based hybrid wind speed forecasting
machine trained by crisscross optimization. Energy Convers Manag 2017;150: model using deep bidirectional LSTM networks. Energy Convers Manag
108e21. https://doi.org/10.1016/j.enconman.2017.08.014. 2021;234:113944. https://doi.org/10.1016/j.enconman.2021.113944.
[50] Meng AB, Chen YC, Yin H, Chen SZ. Crisscross optimization algorithm and its [55] Liu H, Mi X, Li Y. Comparison of two new intelligent wind speed forecasting
application. Knowl Base Syst 2014;67:218e29. https://doi.org/10.1016/ approaches based on wavelet packet decomposition, complete ensemble
j.knosys.2014.05.004. empirical mode decomposition with adaptive noise and artificial neural net-
[51] Meng A, Hu H, Yin H, Peng X, Guo Z. Crisscross optimization algorithm for works. Energy Convers Manag 2018;155:188e200. https://doi.org/10.1016/
large-scale dynamic economic dispatch problem with valve-point effects. j.enconman.2017.10.085.
16

A Hybrid Deep Learning Architecture For Wind Power Prediction Based On Bi-Attention Mechanism and Crisscross Optimization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Hybrid Deep Learning Architecture For Wind Power Prediction Based On Bi-Attention Mechanism and Crisscross Optimization

Uploaded by

Copyright:

Available Formats

Energy 238 (2022) 121795

Contents lists available at ScienceDirect

A hybrid deep learning architecture for wind power prediction based

1. Introduction mathematical models [3]. However, the modeling process of

Fig. 1. The overall framework of the proposed hybrid model.

where IMFi(t) is the ﬁnal i-th IMF component obtained by EEMD,

Fig. 2. EEMD decomposition results of wind power and wind speed.

Fig. 3. The operation process of BA.

X feature att matrix ¼X org 1a

and Xtime_att_matrix to obtain the ﬁnal attention output

D ¼ hfc nout þ nout (14)

follows: number of convolutional layers, nfc represents the number of

1.5.1. Evaluation metrics 2.2. Selection of decomposition number

1.5.2. Compared models 3.1. Data description

Model Input dimension Conﬁguration Output dimension

CNN 6 4 1 nconv: 3, kernel size: 2 2, nfc: 32 Prediction step

Fig. 9. Original time series of Dataset.

Indexes CNN GRU CNN-GRU

EEMD-CNN EEMD-GRU EEMDeCNNeGRU

1-step 2-step 3-step 1-step 2-step 3-step 1-step 2-step 3-step

Indexes EEMDeCNNeGRU EEMD-FAeCNNeGRU EEMD-TAeCNNeGRU EEMD-BAeCNNeGRU

Indexes EEMD-BA-CNN EEMD-BA-GRU EEMD-BAeCNNeGRU

EEMD-BA-RN-ECL EEMD-BA-RN EEMD-BA-RGRU

1-step 2-step 3-step 1-step 2-step 3-step 1-step 2-step 3-step

Fig. 11. Multi-step prediction results of different BA-based models.

Indexes EEMD-BA-RGRU EEMD-BA-RGRU-PSO EEMD-BA-RGRU-FPA

1-step 2-step 3-step 1-step 2-step 3-step 1-step 2-step 3-step

1-step 2-step 3-step 1-step 2-step 3-step

MAE (MW) 0.1835 0.2529 0.2895 0.1810 0.2485 0.2834

Fig. 12. Multi-step prediction curves of different models.

Months Indexes Persistence model EMDeCNNeLSTM CEEMDAN-FPA-BP

April (Spring) Persistence model EMDeCNNeLSTM CEEMDAN-FPA-BP

1-step 2-step 3-step 1-step 2-step 3-step 1-step 2-step 3-step

July (Summer) Persistence model EMDeCNNeLSTM CEEMDAN-FPA-BP

October (Autumn) Persistence model EMDeCNNeLSTM CEEMDAN-FPA-BP

2017;26:3142e55. https://doi.org/10.1109/TIP.2017.2662206. Energy 2015;93:2175e90. https://doi.org/10.1016/j.energy.2015.10.112.

You might also like