GNN Main PDF

Renewable Energy 206 (2023) 309–323
Contents lists available at ScienceDirect
Renewable Energy
journal homepage: www.elsevier.com/locate/renene
Unsupervised anomaly detection using graph neural networks integrated

with physical-statistical feature fusion and local-global learning
Chenlong Feng a, Chao Liu a, b, *, Dongxiang Jiang a, c
a
Department of Energy and Power Engineering, Tsinghua University, Beijing, 100084, China
b
Key Laboratory for Thermal Science and Power Engineering of Ministry of Education, Tsinghua University, Beijing, 100084, China
c
State Key Laboratory of Control and Simulation of Power System and Generation Equipment, Tsinghua University, Beijing, 100084, China
A R T I C L E I N F O A B S T R A C T
Keywords: Efficient and feasible anomaly detection scheme that could utilize data collected by supervisory-control-and-
Wind turbine data-acquisition (SCADA) system is essential for wind turbines, which could greatly increase the amount of
Anomaly detection available condition monitoring data, and improve the power generation efficiency and reduce maintenance costs.
Feature fusion
While it is also very challenging as condition estimation with such massive field data is difficult to tackle,
Supervisory control and data acquisition
Graph neural networks
especially the SCADA data is not labeled in most cases. This work presents an unsupervised anomaly detection
framework for wind turbines incorporating physical-statistical feature fusion and graph neural networks (GNNs),
realizing dimensionality reduction, temporal dependence extraction, and latent nonlinear correlation capture of
high-dimensional data. Firstly, graphical modeling for SCADA data of wind turbines is presented. Secondly, the
physical-statistical feature fusion is implemented via local-global mutual information maximization. Finally,
anomaly detection is realized with an energy-based method to learn patterns in the updated nodes’ feature
matrix. The results show that i) the features designed by physical information can reasonably represent equip
ment’s state and reduce the information-redundancy, ii) the time-series based graph structure can effectively
express the dataset structure information and extract temporal dependence, iii) and the anomaly detection model
can fully use the physical-statistical information and local-global information, which outperforms comparison
methods.
1. Introduction issues for wind farms have drawn extensive attention under the cir
cumstances [5]. Statistics show that the O&M costs of onshore wind
Wind energy plays an essential role in the development of renewable turbines account for 10%–15% of the total power generation cost, while
energy mostly because of its large amount of capacity and installation. the proportion of offshore wind turbines is about 20%–25% [6,7]. High
The global installed capacity of wind energy reached 743 GW by the end O&M costs are extremely heavy burdens for wind farm operators.
of 2020, with a huge increase of 93 GW compared to 2019 [1]. In China, Therefore, to detect anomalies in advance, and reduce abnormal
the development of renewable energy, including wind energy, is more downtime and O&M costs, effective anomaly detection approaches for
aggressive in the era of “Carbon Peak and Carbon Neutrality” [2]. In this wind turbines are critical for both academia and industry.
context, the condition monitoring and anomaly detection of the newly Currently, supervisory control and data acquisition (SCADA) systems
installed wind turbines as well as a large number of existing ones are are installed in most wind turbines, which are principally used for per
becoming more and more important in terms of optimizing the power formance monitoring via collecting various types of online wind turbine
output of the wind energy and the maintenance of the systems. operating data at a certain interval [8]. These data are usually with very
Wind turbines are complex electro-hydraulic systems that operate high dimensions (in terms of the number of variables monitored) and in
continuously under irregular load patterns, intermittent durations, and long-duration (in terms of the time of the records) [9], where there is
inclement weather conditions [3]. Such changing and harsh operating temporal dependence at each parameter [5] and strong nonlinear cor
environments tend to cause irreversible damage to wind turbines, relations between parameters because of the interaction and depen
resulting in kinds of failures [4]. The operation and maintenance (O&M) dence between different components in a WT system [10]. The SCADA
* Corresponding author. Department of Energy and Power Engineering, Tsinghua University, Beijing, 100084, China.
E-mail address: cliu5@tsinghua.edu.cn (C. Liu).
https://doi.org/10.1016/j.renene.2023.02.053
Received 10 June 2022; Received in revised form 9 February 2023; Accepted 12 February 2023
Available online 15 February 2023
0960-1481/© 2023 Elsevier Ltd. All rights reserved.
C. Feng et al. Renewable Energy 206 (2023) 309–323
data is readily available for condition monitoring with no need for circumstance of not knowing what information to discard when training,
additional data acquisition [11]. Thus, by developing data-driven based the idea of which was first proposed by Linsker in 1988, named the
anomaly detection algorithms based on SCADA data, the operating InfoMax Principle [29]. The implementation of this principle is sup
condition of wind turbines can be monitored [9]. ported by mutual information, which is a useful information metric used
Intensive research has been conducted for anomaly detection of wind to capture nonlinear statistical dependencies between variables in in
turbines using SCADA data, among which deep learning frameworks formation theory [30,31]. Therefore the unsupervised deep learning
attracts more and more attention recently. Zhao et al. [12] proposed a methods based on mutual information maximization have attracted a lot
deep learning method for wind turbines based on a deep auto-encoder of attention in recent years [28,32–36]. The Deep InfoMax (DIM) [28]
(DAE) network and designed an adaptive threshold determined by the and Deep Graph InfoMax (DGI) [32], as two typical unsupervised rep
extreme value theory as the rule of anomaly judgment to implement resentation learning methods based on mutual information maximiza
early warning of fault components and deduce potential locations of a tion, respectively train an encoder to maximize the mutual information
faulted component. Similarly, Chen et al. [5] designed a stacked between the high-level global representation of an image or graph and
denoising auto-encoders (SDAE) model with sliding windows and mul local representations.
tiple noise levels for the construction of wind turbines’ normal behavior Motivated by the achievements of these unsupervised learning
model and then the quantification of abnormal levels is obtained by methods, we design graph nodes features and graph adjacency matrix
evaluating the overlap between test distribution and baseline condition. applicable to SCADA data for its characteristics of high-dimensional,
Chu et al. [9] proposed an anomaly detection framework for wind tur temporal-dependence and nonlinear-correlation, and apply the ideas
bines based on LSTM-AE using SCADA time series data, considering the of unsupervised graph learning of DGI in this work, thus proposing an
advantages of the LSTM in modeling temporal dependency hidden in anomaly detection framework for wind turbines based on physical-
multidimensional sequential data and the ability of the AE effectively statistical feature fusion and unsupervised GNNs (i.e., DGI). Firstly, a
extract data features. Xiang et al. [13] cascaded convolutional neural method based on operating physical characteristics of wind turbines is
network (CNN) and bidirectional gated recurrent units with attention proposed to extract the features representing wind turbines’ state from
mechanism (BiGRU-AM) together to extract multidirectional spatio the perspective of the characteristic curve based, the normal behavior
temporal features of SCADA data. The aforementioned detection algo model based and the relative temperature difference based, thus
rithms share the same assumptions for the inputs that the input nodes obtaining the nodes feature matrix; Secondly, a time-series-based con
(parameters) are assumed to be independent of each other, and the struction method for the graph adjacency matrix with a sliding window
latent structural information between data is ignored [14,15]. In other is designed to extract the temporal-dependence of SCADA data; Thirdly,
words, the correlation between input data is not considered. Therefore, the nodes feature matrix and graph adjacency matrix is fed into the DGI
the characteristic representation capacity of these methods is limited network, on the one hand, to capture the nonlinear characteristics, on
and information (that probably improves the anomaly detection per the other hand, to reduce the dimensionality of SCADA data through the
formance) is not extracted and considered. feature extraction idea based on physical-statistical feature fusion;
To solve the problem, a novel data structure considering data and the Finally, the updated nodes feature matrix is sent into anomaly detector,
relationships between data, i.e., graph, is introduced in the field of deep i.e. the Restricted Boltzmann Machine (RBM), to detect the abnormal
learning, thus generating the graph neural networks (GNNs) with graph state of wind turbines according to the designed anomaly detection
data as input. Recent applications of GNNs in anomaly detections show rules.
that the graph structure is beneficial as of its feature extraction between The main contributions of this work are summarized as follows.
the inputs (which is missed in the aforementioned deep learning ap
proaches). Wang et al. [16] combined the powerful representation (i) To further represent and construct relationships between
ability of GNNs with the classical one-class objective to construct one different parameters in the SCADA dataset, a graphical modeling
class graph neural network (OCGNN), a graph anomaly detection method of time-series data for SCADA systems is proposed to
framework, which aims at mapping the training nodes into a hyper abstract the SCADA data as the combination of a graph nodes
sphere in the embedding space. For the rumor detection task, Bai et al. feature matrix and a graph adjacency matrix.
[17] proposed an ensemble graph convolutional neural net (EGCN) by (ii) To realize the dimensionality reduction, temporal dependence
designing a source-replies relation graph for each conversation and node extraction and nonlinear correlation capture for the SCADA
feature of word vectors. Xie et al. [18] constructed a novel anomaly dataset, a physical-statistical feature fusion method based on
detection method for satellites based on a graph neural network and local-global mutual information maximization is designed.
dynamic threshold (GNN-DTAN) using telemetry time series data. In the (iii) The efficacy of the DGI’s encoder using the GAT is verified and
field of anomaly detection for wind turbines, Yu et al. [19] proposed a the feasibility of the anomaly detection approach using graph
new fast deep graph convolutional network to diagnose gearbox faults. neural networks and an energy-based model is validated through
These graph-based deep learning networks can process the actual fault cases in wind turbines.
non-Euclidean data and embed the graph information of data into net
works to provide more prior knowledge for anomaly detection tasks, The remainder of this paper is structured as follows. In section II,
thus improving the effect of anomaly detection [14]. preliminaries of GNNs and RBM are introduced. In section III, the GNNs
In addition, a large amount of data collected by real SCADA systems framework based on DGI-RBM is illustrated. In section IV, field SCADA
is not labeled due to the high costs of labeling [20,21], bringing out a data is collected and fed into the DGI-RBM framework. The anomaly
huge challenge to develop anomaly detection methods based on SCADA detection results are obtained and compared with the ground truth.
data. Therefore, considering the advantages of GNNs and the limitations Finally, the conclusions are summarized in section V.
of the SCADA data, unsupervised GNNs have attracted more and more
attention [22–27]. Note that the selection of objective functions to train 2. Preliminaries
networks is a key problem for unsupervised algorithms. The purpose of
unsupervised networks is to train an encoder to learn useful represen 2.1. Graph and graph neural networks
tations [28], i.e., one that can be used to distinguish anomalous samples
correctly or contain the most unique information of samples, and 2.1.1. Graph [37]
entirely by which these samples can be reconstructed. It means that As typical non-Euclidean data, a graph is a language that can
unsupervised networks should naturally consider preserving as much describe complex things, and it has the characteristics of varying local
information as possible within the imposed constraints under the input dimensions and disordered node arrangement. Mathematically, a
310
graph is represented as G = (V, E), where V is the set of nodes, and E is graph, while the weight to measure the information importance of nodes
the set of edges. Let vi ∈ V denote a node and eij = (vi , vj ) ∈ E denotes an depends on the nodes’ features. The introduction of the attention
edge pointing from vi to vj . The adjacency matrix A is an N× N matrix mechanism transforms the weights into parameters that can be learned,
with Aij = 1 if eij ∈ E and Aij = 0 if eij ∕
∈ E. Additionally, a node v in graph prompting the network to pay more attention to those nodes with
G can be represented by feature vector hv ∈ RF . Then the feature matrix greater impact and neglect some nodes with lesser impact according to
H for graph G, where H ∈ RN×F . the actual expression of nodes feature in the process of information
aggregation. The structure of the graph attention layer is shown in Fig. 1
2.1.2. Graph convolutional network (GCN) [37,38] (b). The updated nodes feature and the attention weight can be calcu
From the perspective of how to define graph convolution, graph lated by Ref. [39]:
convolution networks fall into two categories: spectral-based and ⇀
′
(
∑ ⇀
)
spatial-based. This work only focuses on the basic principle of the hi = σ αi,j Whj (2)
spatial-based graph convolution network, which defines graph convo j∈N(i)
lution by information propagation and aggregation between nodes, ( ( [ ⃦ ]))

⇀⃦ ⇀
forming a new central node feature in this way. Taking a single-layer ⇀
LeaklyReLU a Whi ⃦
exp Wh
⃦ j
forward propagation graph convolution network as an example, its αi,j = ( ( [ ⃦ ])) (3)
⇀⃦ ⇀
propagation rules are as follows [38]: ∑ ⇀
exp LeaklyReLU a Whi ⃦⃦Whk
k∈N(i)
( ) ( − )
(1)
1/2 1/2
̂
H (l+1) = f H (l) , A = σ D ̂D
A ̂− H (l) W (l)
where, W is the parameter used to transform the dimension for nodes
where Â = A + I is the adjacency matrix considering self-loop edge, I is feature; a is the weight vector parameter; ‖ denotes the concatenation of
⇀
̂ = ∑A
the unit matrix; the diagonal matrix D ̂ ij represents the degree of vectors; αi,j is the weight factor between node i and node j calculated
j
with a; LeaklyReLU( ⋅) is the leak correction linear unit; σ( ⋅) is the
⇀
each node in the graph (that is, the number of edges connected to the
nonlinear activation function. GAT aggregates neighbor nodes’ infor
node); H(l) denotes the nodes feature matrix of layer l network; W(l) mation with parametrized weights (as shown in Eq. (3)) obtained by
denotes the weight matrix of layer l network; σ( ⋅) is nonlinear activation nodes’ features. Based on Eqs. (2) and (3), GAT can use the learned
function; H(l+1) = f(H(l) , A) is the output of layer l network, that is, the weight factor to update the nodes feature matrix.
updated nodes feature matrix.
Based on the above propagation rules, GCN can greatly improve the
breadth of information received by each node as the number of propa 2.2. Restricted Boltzmann Machine (RBM) [40]
gation layers increases, obtaining more information about neighbor
nodes. Noted that to measure the information importance of different The Restricted Boltzmann machine is a probabilistic graph model
neighbor nodes, GCN uses D̂ − 1/2 to weight each node, which follows this based on energy estimation of the inputs, which can be used to extract
√̅̅̅̅ features of input data, as shown in Fig. 2. Given state vectors h and v, the
̂ larger, 1/ di smaller, thus the less important infor
rule: when di ∈ D
energy function of RBM can be expressed as:
mation for the neighbor node connected to this edge. GCN aggregates
neighbor nodes’ information with nonparametric weights (αi,j = 1/ E(v, h) = − hT Wv − aT v − bT h (4)
1/2
(deg (vi )deg (vj )) ) obtained by the degree matrix, as shown in Fig. 1
(a). This deficiency will be remedied in GAT. where a is the bias vector of the visible layer, b is the bias vector of the
hidden layer, and W is the weight matrix connecting visible units with
2.1.3. Graph attention network (GAT) [38,39] hidden units.
Considering the limitation of measuring the information importance
of a node only by the structural information of the graph (degree ma 3. Methodology
trix), an attention mechanism is introduced into the process of propa
gating neighbor nodes’ information, named graph attention network. In 3.1. Unsupervised anomaly detection framework
this network, the adjacency matrix is only used to define the nodes in a
An unsupervised anomaly detection framework based on Deep Graph
Fig. 1. Difference of edge weight between GCN and GAT.
311
Fig. 2. RBM model.
InfoMax - Restricted Boltzmann Machine (DGI-RBM) is presented in this Phase 2. Graphical modeling I - nodes definition (Feature extraction).
work (shown in Fig. 3), in which the DGI is used to extract and aggregate The nodes of the graphs are defined as the state of wind turbines at time
the nodes’ features (with the purpose that the nodes “close” in the input t. To form the feature vector for the node in the graphs, three types of
graph are also “close” in the high-dimension feature space after merging physics-based features reflecting the operating condition of wind tur
graph structure information), and RBM is used to evaluate nodes’ state bines are presented in this work, including the characteristics-curve-
and detect anomalies. The steps of forming the anomaly detection based features, the normal-behavior-model-based features, and the
framework based on DGI-RBM include. relative-temperature-difference based features. Details are described in
Section 3.2.1.
Phase 1. Data preprocessing. It is not conducive to learning the normal
behavior models if using the raw SCADA data, as the widely existing Phase 3. Graphical modeling II- edges definition. The graph adjacency
outliers could be distractive for the modeling training, which is usually matrix is constructed following the definition in Section 3.2.2, in which
the case with the field data due to lots of practical issues in the data the influence of past nodes and future nodes on the current node is
collection process. This work applies the mean root square method and considered, that is, a directed graph with a sliding window. The graphs
kernel density estimation method for data preprocessing, to remove the (nodes and edges) are defined via Phases 2 and 3.
outliers. Also, the raw dataset can be roughly separated into the normal
Phase 4. Normal behavior learning and feature fusion. The graphs
dataset and fault dataset. Note that the normal dataset and fault dataset
constructed by the normal dataset are fed into the DGI network, and the
defined here are not strictly, some of the nominal data might mislabeled
model is learned to represent the nominal behavior of the wind turbines.
as anomalous (vice versa), and it cannot be used in anomaly detection.
Details are described in Section 3.3.
In this context, the normal data sent to the learning scheme in the follow
phases are mixed with nominal and anomalous ones (where we assume Phase 5. Health status evaluation and anomaly detection. The outputs
that most of them are nominal). Therefore, the framework presented in in Phase 4 represented the normal behavior of the system (wind turbine
this work is unsupervised in terms without the needs of intensive and in this work), the health status of the system in normal and abnormal
human labeling in SCADA data. (fault) states can be captured with an energy-based model, RBM, where
Fig. 3. The proposed unsupervised anomaly detection framework based on DGI-RBM.
312
Fig. 4. Geometric relation of the reduced power in the wind speed power curve diagram.
the normal states are widely seen and treated as low-energy (high- In this paper, the mapping function F 1 is designed to calculate a
probability) events in the RBM, and the abnormal (low-probability) new parameter representing the power generation capacity of a wind
states will be detected as high-energy events. The anomaly score is turbine, named reduced power, considering the distribution character
defined based on the energy states of the RBM in the last step, and the istics of outlier points in the characteristic curve (here is the wind speed
alarm thresholds can be pre-defined and used to detect the anomaly state power curve) scatter diagram. The construction of this parameter mainly
and identify the abnormal node in the time series. relies on the slope of the outlier points and the physical meaning of the
wind speed power scatter diagram, the detail of which is shown in Fig. 4.
3.2. Graphical modeling of time-series data for SCADA systems (Phase 2 The wind speed power curve is divided into four areas in Fig. 4, but
and Phase 3) the study about the reduced power only focuses on the first three areas:
startup area, maximum wind energy capture area, and constant power
In this work, the graphical modeling for SCADA data is primary and operation area.
important for subsequent unsupervised graph learning and anomaly The construction of the reduced power starts from the maximum
detection. wind energy capture area first. In this area, the distribution of power
The normalized raw time-series data samples are firstly abstracted as points is approximately a straight line with a slope of ξc = pr /(vr − vin ).
a matrix S ∈ RK×N , where K is the number of variables in the time-series In other words, the slope of points in this area should fluctuate near ξc
data or the number of measurements in the samples, and N is the length under the normal state of power. Take point B as an example, this point
of time series data or the number of sampling points in the samples. Then is the one deviating from the normal range of wind speed power curve
the graphs are defined as G = (V, E), where the nodes V are defined as with a slope less than ξc . Based on the above analysis, the formula for
the states of the wind turbine at time t, represented by several features calculating the reduced power in the maximum wind energy capture
(described in Section 3.2.1), and the edges E are defined as the temporal area can be given:
relationships between the different moment in the data. Formally, the p
ξ= (vin < v ≤ vr ) (5)
node feature vector is → x t = F (S:,t ) = [f1 , f2 , f3 , …, fF ] ∈ RF , where →
xt v − vin
denotes the node feature vector at time t, or the feature vector of the
As there are fewer outlier points in the startup area and most points
node t; F (•) denotes the mapping function for computing features that
are in the shutdown state (such as point A), the reduced power of this
can represent the status of the wind turbine; S:,t denotes the data at
area is always normal value, that is,
column tth in the matrix S, i.e. the value of all monitoring variables at
pr
time t; fi denotes the ith characteristic parameter designed by mapping ξ = ξc = (v ≤ vin ) (6)
function F (•), F is the number of all features in a node feature vector. vr − vin
The graph nodes feature matrix X = {→ x 1, → x 2 , …, →x N } ∈ RN×F is ob In the constant power operation area, the power in a normal state is
tained by merging all nodes’ feature vectors. Next, the adjacency matrix always the rated power, the slope calculation has nothing to do with
A ∈ RN×N is designed based on the graph construction idea described in wind speed. However, to keep the same order of magnitude and form of
Section 3.2.2. reduced power between this area and the above two areas, the formula
for calculating reduced power in this area is given as follows:
3.2.1. Feature extraction for node representations in graphical modeling p
In this study, the mapping function F used to design features is ξ= (v > vr ) (7)
vr − vin
utilized for feature extraction. The features representing the operating
In conclusion, the reduced power ξ refers to the power generated by a
state of wind turbines can be divided into three categories: the charac
wind turbine at unit wind speed. In practical application, we take the
teristic curve based, the normal behavior model based, and the relative
logarithm of the above formula and fine-tune the range of wind speed
temperature difference based. The corresponding mapping functions are
F 1 , F 2 , and F 3 . range, to avoid infinite value when calculating the reduced power near
cut-in wind speed. The formula of the reduced power for wind turbines is
(a) The Mapping Function F Based on Characteristic Curve summarized as follows:
1
313
⎧ ( )
⎪
⎪ pr influence of future nodes’ features is not considered, i.e., a directed
⎪
⎪
⎪
⎪
ln
vr − vin
v ≤ vin + 0.1 graph with a sliding window. The construction flow of the adjacency
⎪
⎪
⎪
⎨ (p + 0.1) matrix is shown in Fig. 5.
ξ = F 1 (v) = ln vin + 0.1 < v ≤ vr (8)
⎪
⎪ v − vin
⎪
⎪
⎪ (p + 0.1)
⎪
⎪
⎪
⎪ 3.3. Feature fusion based on local-global mutual information
⎩ ln v > vr
vr − vin maximization (Phase 4)
3.3.1. The formulation for feature fusion

(b) The Mapping Function F 2 Based on Normal Behavior Model
Relying on the defined graph node feature matrix and graph adja
cency matrix, the unsupervised graph learning idea is used for feature
As a complex system, the components of a wind turbine are coupled
fusion in this work.
with each other, and different monitoring variables are correlated with
According to Eq. (11) [32], the nodes feature information X (graph
each other, so this correlation can be used to construct features for wind
node feature matrix) and the graph structure information A (graph ad
turbines, i.e., the mapping function F 2 .
jacency matrix) are used to aggregate neighbor nodes information and
In this paper, Pearson correlation analysis is used to calculate the
are activated by the nonlinear function, to obtain the nonlinear nodes
correlation between different monitoring variables obtained from the
feature information.
SCADA dataset. Several main monitoring variables are selected ac
( )
cording to their representation ability and their correlated monitoring → ∑
variables are also selected according to the condition of a correlation h =σ
i α Wx →
i,j j (11)
coefficient greater than 0.5. Then, a deep neural network is used to
j∈N i
establish a multivariable regression model between the main monitoring

where, αi,j represents the edge weight when aggregating neighbor nodes
variables and their correlated monitoring variables, named the normal
information, W represents the parameter matrix for transforming the
behavior model. Finally, the absolute error between the model predic
dimension of nodes feature matrix, N i represents the neighbor nodes set
tion value and the real value of each main monitoring variable is ob ∑
tained, as shown in Eq. (9). of node i obtained from adjacency matrix information, thus αi,j W→
xj
j∈N i
⃒ ⃒ represents the weighted aggregation of neighbor nodes information for
eM = F 2 (M) = ⃒Mpr edcit − Mreal ⃒ (9)
node i; σ ( ⋅) represents the nonlinear activation for weighted aggregated
→
where Mpredcit is the model prediction value of the main monitoring nodes features, and finally obtaining a node embedding vector h i
variable M, Mreal is its real value and eM is the derived characteristic summarizing a patch of the graph centered around node i rather than
parameter (named absolute error) based on the normal behavior model. just the node itself, named patch representations [32]. The above pro
→ → →
cess can be summarized as H = ℇ(X, A) = { h 1 , h 2 , …, h N } ∈ RN×F ,
′
(c) The Mapping Function F 3 Based on Relative Temperature

named the encoder, where ℇ : RN×F × RN×N →RN×F is the encoder
′
Difference
function. Noted that in this work, the difference between the encoder
In the main transmission chain of a wind turbine, the occurrence of functions used in different methods is mainly the calculation approach
failure is always accompanied by a rise in temperature. Noted that the of edge weight αi,j that is divided into the fixed weight (in the GCN
temperature at the fault will be significantly higher than the tempera encoder) and the learnable weight (in the GAT encoder).
ture at the non-fault and it relative to the environmental temperature The main target of the DGI model is to train a feature extractor based
will also have a significant rise. Therefore, two types of features based on on the local-global mutual information maximization to obtain local
relative temperature differences are designed to describe a wind tur representations capturing the global information of the graph [32].
Therefore, a readout function R : RN×F →RF is first introduced to
′ ′
bine’s state, they are temperature differences between different vari
ables and temperature differences relative to environmental summarize the patch representations calculated by the encoder into
temperature. The formula of the mapping function F 3 is described as graph-level representation → s , as shown in Eq. (12), [32].
follows. ( )
1 ∑N
→
ΔT = F 3 (MA , MB ) = |MA − MB | (10)
→s = R (ℇ(X, A)) = R (H) = σ hi (12)
N i=1
where MA and MB are two monitoring parameters used to calculate
that is, the simple average vector of all the node features in the graph is
temperature difference, ΔT is the derived characteristic parameter
denoted as the representation vector for graph global information.
(named absolute temperature error).
Next, a discriminator D : RF × RF →R is introduced as a proxy for
′ ′
To sum up, all the above features will be used to form the feature
vector of a node in a graph, and the feature vectors of all nodes will be maximizing the local mutual information to give a probability score for
put together and normalized to form the nodes feature matrix. →
the patch-summary pair ( h , →
s ), as shown in Eq. (13), [32].
i
( ) ( )
3.2.2. Edge formulation for graphical modeling →
D h i, →
→T
s = σ h i W→
s (13)
In our definition of the graph derived from SCADA data, the features
(the node feature vector) representing the operating state of wind tur
this is a bilinear scoring function used to convert scores into probabili
bines at a certain time are defined as the nodes feature matrix, so the →
relationship between nodes (i.e., edge in the graph) should be defined as ties of ( h i , →
s ) being a positive sample, where W is a learnable scoring
the relationship between the operating state of wind turbines at different matrix and σ ( ⋅) is the logistic sigmoid nonlinearity function.
times. Inspired by this idea, a time-series-based design idea for graph To successfully maximize the generation probability of the above
structure (i.e., the adjacency matrix) is proposed in this study, in which a → → →
positive sample ( h i , →s ), the negative sample ( ̃
h j , ̃s ) is also needed to be
sliding window is used to describe the influence of past nodes and future
designed for contrastive training. Thus, a corruption function C :
nodes on the current node. Specifically speaking, only the influence of
past nodes’ features on the current node state is considered, while the RN×F × RN×N →RM×F × RM×M is introduced to obtain the negative sam
ples from the raw samples, as shown in Eq. (14), [32]. In this work, the
314
Fig. 5. The time-series-based construction idea for the adjacency matrix.
specific process of C is to preserve the raw adjacency matrix, i.e., A

̃ = assumptions, the graph adjacency matrix is formed.
A, and obtain the corrupted nodes feature matrix X ̃ by a row-wise Part 3. The nonlinear correlation capturing in SCADA data and the
shuffling of X. statistical characteristics extraction for wind turbines. Considering the
nonlinear correlation hidden in the SCADA data, we not only use the
̃ = C (X, A)
̃ A)
(X, (14)
edges in the graph to connect nodes features at different moments
Finally, the objective function L (as shown in Eq. (15) [32]) is used calculated from SCADA data but also use the encoder constructed by DGI
→
to effectively maximize the mutual information between h i and → s to aggregate the one-order neighbor information of nodes features and
based on the Jensen-Shannon divergence between the joint and the to nonlinearly activate these features based on Eq. (11). Therefore, the
product of marginals, that is, maximize the generation probability for updated nodes feature matrix not only reduces the dimensions of the
positive samples and minimize the generation probability for negative high-dimensional physical feature space for the nodes but also extracts
samples to maximize the correlation of the obtained patch representa the statistical features of SCADA data based on the local-global mutual
tions and the global representation. information maximization principle. With this setup, the dimensionality
( reduction of high-dimensional SCADA data is realized
[ ( )]
1 ∑N
→ The unsupervised feature extraction algorithm based on physical-
L = E(X,A) log D h i , →
s
N + M i=1 statistical feature fusion is summarized in Algorithm 1.
[ ( (→ ))]) (15)
∑M
→ Algorithm 1. Feature Extraction Based on Physical-Statistical
̃
E ̃ ̃ log 1 − D h j , ̃s
+
(X ,A) Feature Fusion
j=1
3.3.2. The algorithm for feature fusion

Inspired by the feature aggregation capability applied in a graph of a
DGI network, an unsupervised feature extraction algorithm based on 3.4. Unsupervised anomaly detection (Phase 5)
physical-statistical feature fusion is proposed in this work.
The updated nodes feature matrix obtained from DGI will be sent
Part 1. The physical features calculation for wind turbines. The self-
into RBM for unsupervised training and the RBM energy of each node
defined mapping function F = [F 1 , F 2 , F 3 ] is used to design several
will be calculated through Eq. (4). These energy values are used for
physical features representing the operating state of wind turbines at
computing the status score of the nodes in a graph.
each moment from three perspectives of characteristics curve based
The normal state samples will be used to train the DGI-RBM normal
(F 1 ), normal behavior model based (F 2 ) and relative temperature
behavior model for calculating the mean value μ and the standard
difference based (F 3 ), thus obtaining the node feature vector. After
variance σ of normal nodes’ status scores. Subsequently, the abnormal
normalizing this matrix according to each parameter, the normalized
state sample is sent into the trained DGI-RBM model to calculate the
and follow-up required nodes feature matric is obtained
status scores of the abnormal state sample nodes. According to the
Part 2. The temporal dependence extraction for SCADA data. The principle of 3σ , the anomaly score of each node status score is obtained
sliding window is applied in the construction of the graph adjacency by Eq. (16) for anomaly detection:
matrix ⎧
e − (μ + 3σ )
On the one hand, the interrelation between the nodes out of the ⎪
⎪
⎪
⎪ e ≥ μ + 3σ
⎪ 6σ
window and the ones in the window is not considered, which means that ⎨
the influence of the node at some moment on the nodes at other mo score = 0 μ − 3σ < e < μ + 3σ (16)
⎪
⎪
⎪
ments only exists on a finite time scale. In other words, larger time scale ⎪
⎪
⎩ (μ − 3 σ ) − e
e ≤ μ − 3σ
and weaker influence. In this work, this temporal attenuation effect is 6σ
simplified as 0/1, i.e., the interrelation in the window is defined as 1,
while the one out the window is defined as 0. where e is the status score of each node in an abnormal sample, and
On the other hand, the connection between the nodes within the = 0 denotes anomaly.
score ∕
sliding window is directed, the past nodes influence the current node, Remark. The presented DGI-RBM framework realizes unsupervised
while the future nodes do not influence the current node, which is anomaly detection in two aspects, (i) graphical modeling and feature
consistent with the actual physics regulation. According to the above learning via DGI. The features (the nodes in the graphs) are sent to DGI
with the graph adjacent matrix (the edges in the graphs), and the local-
315
global mutual information maximization principle is applied to reduce 4. Results and discussion
the dimension of the features and learn the patterns in different time
steps in an unsupervised manner, (ii) The system behavior learning via 4.1. Dataset
RBM, where the recurring states in the system are learned as low-energy
state (detected as normal), and the low-probability states are detected as With the proposed approach, the case studies are carried out using
anomalous. The previous one with DGI intends to learn the short-time the SCADA dataset, which is collected from a wind farm in northeast
behavior of the system and the later one with RBM is to learning China. The SCADA system installed in this wind farm mainly monitors
system-wide behavior in a long term and identify those with different 37 state parameters of 26 wind turbines including wind speed, active
patterns learned by DGI. Therefore, the integrated DGI-RBM framework power, generator average speed, gearbox bearing temperature and so
could efficiently process the multivariate time-series in SCADA system, on. According to the mean correlation coefficients between the moni
to capture the system behavior thru quite a number of measurements toring parameters of all wind turbines in this wind farm, 7 parameters
and detect the anomalies in an unsupervised way. Note that, although with a low correlation coefficient are neglected in this study, and the
the DGI-RBM framework is formed with the SCADA data of wind tur remaining 30 parameters are used for subsequent analysis, as shown in
bines, it can be easily applied to other systems, especially distributed Table .1.
complex systems with different kind of measurements.
4.2. Data preprocessing
The existence of abnormal data in the raw dataset will seriously

affect the accuracy of normal behavior learning for features and
316
Tabel.1 Tabel.2
30 parameters used in this study and their symbolic representations. Seven main monitoring variables and their correlated monitoring variables.
No. Parameters Unit No. Parameters Unit No. Main Monitoring Variables Correlated Monitoring Variables
M1 NCC300 temperature ◦
C M16 Temperature, pitch ◦
C f2 Active Power M5, M9, M10, M13, M17, M18, M19, M23,
motor 3 M25, M26, M27, M28, M29, M30
M2 NCC320 temperature ◦
C M17 Pitch motor 1 Nm f3 Gearbox Bearing Temperature M10, M13, M18, M21, M23, M25, M28
torque f4 Gearbox Oil Temperature M9, M13, M17, M18, M19, M21, M23,
M3 Grid side inductance ◦
C M18 Pitch motor 2 Nm M25, M26, M27, M28, M29, M30
temperature torque f5 Generator Bearing M1, M2, M3, M4, M5, M6, M12, M13, M17,
M4 Machine side inductance ◦
C M19 Pitch motor 3 Nm Temperature in Driving M18, M19
temperature torque Direction
M5 Machine side ◦
C M20 Nacelle ◦
C f6 Generator Bearing M1, M2, M3, M4, M5, M6, M11, M13, M17,
semiconductor temperature Temperature in Non-driving M18, M19
temperature Direction
M6 Grid side semiconductor ◦
C M21 Active power kW f7 Maximum Generator Winding M3, M4, M5, M6, M9, M10, M11, M12,
temperature Temperature M14, M15, M16, M17, M18, M19, M21,
M7 Ambient temperature ◦
C M22 Nacelle battery ◦
C M23, M25 M26, M27, M28, M29, M30
temperature f8 Nacelle Temperature M1, M2, M4, M7, M8, M14, M15, M16,
M8 Filter plate temperature ◦
C M23 1-s average wind m/s M22
speed
M9 Gearbox bearing ◦
C M24 Maximum yaw kW
temperature power Remove the abnormal data points of several parameters, including
M10 Gearbox oil temperature C M25 Generator average rpm
gearbox bearing temperature, gearbox oil temperature, generator
◦
speed
M11 Generator bearing ◦
C M26 Minimum generator rpm
bearing temperature in driving direction, generator bearing temperature
temperature in driving speed in non-driving direction, maximum generator winding temperature, and
direction nacelle temperature, according to the normal range of monitoring pa
M12 Generator bearing ◦
C M27 Maximum rpm rameters summarized by the dataset and Ref. [41].
temperature in non-driving generator speed
According to the above steps, the data cleaning for 26 wind turbines
direction
M13 Maximum generator ◦
C M28 Control generator rpm is carried out. Taking the No.13 wind turbine as an example, the
winding temperature average speed cleaning results of which are shown in Fig. 6, where (a) shows the
M14 Temperature, pitch motor ◦
C M29 Minimum control rpm cleaning result based on the standard wind speed power curve. (b) shows
1 generator speed the cleaning result based on the standard generator speed power curve.
M15 Temperature, pitch motor ◦
C M30 Maximum control rpm
2 generator speed
As a result, the normal data used to train the normal behavior model is
obtained.
anomaly detection. Therefore, it is necessary to perform data cleaning

4.3. Formulation of features
based on the operating characteristics of individual wind turbines. The
details for data cleaning are described below:
The reduced power is firstly calculated by Eq. (8), represented by f1 .
Remove those data points that are significantly outside the normal
As for the features based on the normal behavior model, the seven
range according to the characteristic parameters of wind turbines (such
main monitoring variables used in this paper and their correlated
as cut-in wind speed, cut-off wind speed, rated wind speed, and rated
monitoring variables are shown in Tabel2. These seven parameters and
power).
their correlated parameters are used to calculate the features according
Remove fault points according to fault records provided by wind
to the method of Section 3.2.1.(b).
farms.
As for the features based on the relative temperature difference, eight
Remove outlier points in the wind speed power scatter diagram by
features are designed, as shown in Table .3.
using the bin method, Root Mean Square method, or Kernel Density
In conclusion, 16 features are designed to represent wind turbines
Estimation method according to the standard wind speed power curve of
operating state at a certain time.
wind turbines.
Remove outliers in the speed power scatter diagram by using the bin
method, Root Mean Square method, or Kernel Density Estimation 4.4. Parameters of graphs
method according to the standard generator speed power curve of wind
turbines. Using the method of Section 3.2.2 and considering the SCADA
dataset size, the parameters of the graph structure are set as follows:
Fig. 6. The cleaning results of the No.13 wind turbine where black points denote normal points and grey points denote outliers.
317
Tabel.3 details are described below and the operating data of 30 monitoring
Eight features based on the relative temperature difference. variables corresponding to these anomalies are shown in Fig. 7.
No. Parameters Description Formula Hub speed signal difference. This fault occurred in the No.6 wind
turbine with fault number 3057, the description for which is “Hub
f9 The relative temperature difference between Generator |M11 −
Bearing Temperature in Driving Direction and Generator M12| rotation speed signal difference, rotation speed difference of 1 s is
Bearing Temperature in Non-driving Direction greater than 1000 rpm”. As shown in Fig. 7(a), the obvious difference in
f10 The relative temperature difference between Generator |M11 − hub speed directly causes generator shutdown (Parameter M25), which
Bearing Temperature in Driving Direction and Gearbox Oil M10| leads to the output power zero (Parameter M21). For some commonly
Temperature
used monitoring parameters such as gearbox bearing temperature
Bearing Temperature in Non-driving Direction and Gearbox M10| (Parameter M9), gearbox oil temperature (Parameter M10), and
Oil Temperature generator bearing temperature (Parameter M11/M12), the anomaly
f12 The relative temperature difference between Nacelle |M20 − makes them decline in the trend. In this case, the above-mentioned
Temperature and Ambient Temperature M7|
phenomenon can be interpreted as the hub speed change will affect
Bearing Temperature in Driving Direction and Ambient M7| the friction intensity between bearings and shafts, i.e., lower hub speed,
Temperature weaker friction intensity, and lower bearing temperature. And the
f14 The relative temperature difference between Generator |M12 − similar trend is also shown in nacelle temperature (Parameter M20) and
Bearing Temperature in Non-driving Direction and Ambient M7| other temperature parameters.
Temperature
f15 The relative temperature difference between Gearbox Oil |M10 −
Grid actual power anomaly. This fault occurred in the No.6 wind
Temperature and Ambient Temperature M7| turbine with fault number 3352, the description for which is “Compare
f16 The relative temperature difference between Gearbox Bearing |M9 − M7| the measured power and parameters of the grid, and if they exceed a
Temperature and Ambient Temperature certain level, a fault occurs”. As shown in Fig. 7(b), the measured power
anomaly is directly shown in active power (Parameter M21), and the
wind speed (Parameter M23) is also zero. Therefore, the root cause of
The number of nodes in a graph is 1008, i.e., each graph mainly
this anomaly may be the sudden decrease in wind speed.
focuses on 7-days’ data including fault data.
Anemometer communication error. This fault occurred in the No.13
The sliding window size is 30, i.e., the wind turbine state at each
wind turbine with fault number 10091, the description for which is
moment will be influenced by the past 5 h’ data.
“Anemometer communication error occurs if no data packet is received
from anemometer within 5 s”. As shown in Fig. 7(c), the wind speed
4.5. Evaluation metrics and comparison methods
(Parameter M23) jumps to zero when the fault occurs and other pa
rameters also vary to a certain extent.
In this work, recall rate, precision rate, F1 value, and accuracy rate
The experiments are arranged as follows: the effectiveness of the
(As shown in Eqs. (17)–(20)) are selected to evaluate the anomaly
proposed method for capturing nonlinear correlation in the SCADA
detection performance of the proposed approach.
dataset is verified in Section 4.6.2. The effects of the proposed method
TP for extracting temporal dependence in the SCADA dataset are tested in
Recall = (17)
TP + FN Section 4.6.3. The effects of the proposed method for feature extraction
based on physical-statistical feature fusion and dimensionality reduction
Precision =
TP
(18) are shown in Section 4.6.4. Finally, the effectiveness and accuracy of the
TP + FP proposed method are analyzed in Section 4.6.5.
2⋅Recall⋅Precision
F1 = (19) 4.6.2. Effectiveness of nonlinear correlation capture
Recall + Precision
The achievement of the nonlinear correlation capture for the SCADA
TP + TN dataset in this method mainly depends on the nonlinear activation of the
Accuracy = (20) graph deep learning method, i.e., the nonlinear activation function σ in
TP + TN + FP + FN
Eq. (11). To make a better comparison for the influence of nonlinear
where TP, TN, FP, FN are the numbers of true positives, true negatives, activation function on anomaly detection, this experiment takes Fault
false positives, and false negatives. #3057 as an example to design two graph deep learning training models
To verify the validity of the proposed framework, three comparison with/without nonlinear activation, the results of which are shown in
methods are selected in this paper. Fig. 8, where red represents DGI(GCN) with activation, blue represents
DGI(GCN)-RBM: this method is based on the DGI-RBM model with a DGI(GCN) without activation, green represents DGI(GAT) with activa
GCN encoder to detect anomalies and it will put the nodes’ feature in tion and orange represents DGI(GAT) without activation. It can be found
formation and the graph structure information together to calculate the that: (i) nonlinear activation function can effectively improve the per
RBM energy value. formance of anomaly detection. (ii) From the perspective of the encoder,
DGI(GAT)-RBM: GAT can overcome the GCN’s limitation of the detection effect of the GAT encoder is better than that of the GCN
measuring the information importance of node only by the structural encoder. (iii) The anomaly detection effect improvement by nonlinear
information of the graph, and uses learnable weights instead of fixed activation function in the GAT encoder is weaker than that in the GCN
weights to capture important information. Therefore, this study trans encoder, which is related to the great improvement in feature aggre
forms the GCN encoder into a GAT encoder, hoping to achieve better gation capability by learnable weights in GAT. In a word, the nonlinear
anomaly detection performance. activation function in DGI can effectively capture the nonlinear corre
RBM: this method is only used to analyze the influence of graph lation of the SCADA dataset.
structure on anomaly detection effect.
4.6.3. Effect analysis of temporal dependence extraction
4.6. Experiments The temporal dependence extraction is mainly realized by the sliding
window in this method. However, a large sliding window will make each
4.6.1. Fault samples node aggregate a large range of information and weaken the uniqueness
With the above-mentioned setups, three types of anomalies in a real of the updated node information, while a small sliding window will
wind farm are successfully detected using the proposed approach. The weaken the temporal dependence extracted by each node. Thus the
318
Fig. 7. The raw data of all monitoring variables in three fault samples, where blue/red represents the normal/abnormal state.
sliding window with an appropriate size is beneficial to extracting the 4.6.4. Effect of DGI on feature extraction and dimensionality reduction
temporal dependence of the dataset as much as possible. To discuss the This experiment takes Fault #3352 as an example to show the effect
influence of sliding window size on anomaly detection, this experiment of feature extraction and dimensionality reduction carried out by the
takes Fault #3057 and Fault #3352 as examples to analyze the trend of proposed method, as shown in Fig. 10, where blue represents a normal
anomaly detection evaluation metrics changing with sliding window state, and red represents an abnormal state. It can be found that: (i) the
size (from 1 h to 24 h), as shown in Fig. 9, where the normal range of first time’s dimensionality reduction of wind turbines raw monitoring
these metrics is [0, 1] and − 1 represents calculation error. It can be data information is carried out for the model input features from the
found that: (i) Both too large and too small sliding windows will weaken perspective of physical characteristics, preliminarily reducing the input
the effect of anomaly detection. (ii) The evaluation metrics of anomaly information redundancy, as shown in Fig. 10(a). (ii) The second time’s
detection reach the optimal configuration when the sliding window is 5 dimensionality reduction is realized through statistical-based feature
h. aggregation and nonlinear activation performed by the DGI model,
319
Fig. 8. The influence of nonlinear activation function on anomaly detection.
Fig. 9. The trend of anomaly detection evaluation metrics changing with sliding window size.
Fig. 10. The input features obtained from raw data and updated features extracted by the DGI-RBM model for Fault #3352.
further reducing the input information redundancy, as shown in Fig. 10 abnormal state, the black dotted line indicates the upper and lower
(b) and (c). (iii) The feature extraction effect of the DGI(GAT) model is limits of the anomaly detection threshold, and the green dotted line
superior to that of the DGI(GCN). indicates the start and end of the fault. (a) shows the results of RBM
energy and anomaly score based on the DGI-RBM model with GCN
4.6.5. Effectiveness and accuracy of graph deep learning network encoder. (b) shows the results of RBM energy and anomaly score based
This experiment takes Fault #10091 as an example to reflect RBM on the DGI-RBM model with GAT encoder. (c) shows the results of RBM
energy-based state evaluation result and anomaly detection result, as energy and anomaly score based on the RBM model. It can be found that:
shown in Fig. 11, where blue represents a normal state, red represents an (i) compared with the RBM model (i.e., the model without graph
320
Fig. 11. The anomaly detection result for Fault #10091.
Table 4
The best performance evaluation results for six models in terms of Recall, Precision, F1-score, and Accuracy.
Method #3057 #3352 #10091
Rec Prec F1 Acc Rec Prec F1 Acc Rec Prec F1 Acc
DGI(GCN)-RBM 0.84 0.87 0.85 0.96 0.74 0.89 0.81 0.88 0.38 0.53 0.44 0.92
DGI(GAT)-RBM 0.90 0.85 0.88 0.97 0.72 0.98 0.83 0.90 0.87 1.00 0.93 0.99
RBM 0.73 0.40 0.51 0.83 0.39 0.99 0.56 0.79 0.22 0.10 0.14 0.75
TadGAN [42] 1.00 0.12 0.21 0.47 0.22 0.56 0.32 0.67 0.07 0.02 0.03 0.78
COUTA [43] 0.85 0.60 0.71 0.96 0.27 0.68 0.39 0.74 0.31 0.82 0.45 0.97
GDN [44] 1.00 0.48 0.65 0.97 0.23 0.44 0.30 0.81 0.93 0.34 0.50 0.96
information), DGI-RBM achieves an excellent anomaly detection per unsupervised method based on calibrated one-class classification for
formance (ii) The modification of the DGI’s encoder (i.e., transforming time series anomaly detection. The GDN [44] learns a graph of re
GCN to GAT). effectively improves the feature extraction capability of lationships between sensors and detects deviations from these patterns
DGI, and naturally significantly improves the performance of the pro to find anomalies in time series. The results show that DGI(GAT)-RBM
posed anomaly detection method, which also reflects the advantages of outperforms these three models in real-world cases, which further in
learnable weights over fixed weights. dicates (i) the anomaly detection methods (e.g., DGI-RBM): based on
The best performance evaluation results for the above three models graph structure information have stronger detection capability than
are shown in Table .4. DGI-RBM model is stronger than the RBM model, those (e.g., TadGAN, COUTA) relying only on time series (ii) the
i.e., the information contained in the sample will be more comprehen methods introducing physical information (e.g., DGI-RBM); have
sive and the detection ability of the model will be stronger if the graph stronger anomaly detection capability than those (e.g., GDN) using only
structure information is embedded in the training and prediction process monitoring data under the same conditions of graph-based methods.
using DGI. Especially for the DGI-RBM model with a GAT encoder, the
improvement of learnable weights enables the model to mine sample 4.7. Discussions
features as much as possible without the constraint of fixed weights. It is
worth noting that the precision rate metric of the RBM model has a high Four groups of experiments are conducted to analyze the perfor
value (0.99), which can be interpreted as: in the nodes features of this mance of the proposed method from different perspectives, and the re
anomaly, the mutation between normal nodes and abnormal nodes is sults show that: (i) the method can capture the nonlinear correlation of
obvious. This mutation feature can be utilized by RBM to make the SCADA data by nonlinear activation function and graph learning
detected anomaly results almost all correct, while some false detections method. (ii) The method can extract the temporal dependence hidden in
for normal nodes appear in the DGI detection result due to the weak SCADA data using the sliding window and optimize the extraction effect
ening effect of the designed sliding window on mutation. Nevertheless, by selecting an appropriate window size. (iii) The framework can realize
the excellent performance of the DGI-RBM model in anomaly detection feature extraction and hierarchical dimensionality reduction from two
is indisputable. perspectives, physical characteristics and statistical characteristics. (iv)
In addition, the comparison of the proposed method with another The method can improve anomaly detection performance by the graph
three state-of-the-art time series anomaly detection models is also shown neural network based on DGI. It can be concluded that the proposed
in Table .4. The TadGAN [42] is used to reconstruct time series and method can reduce the dimensionality and extract the latent nonlinear
contextually assess errors to identify anomalies. The COUTA [43] is an correlation and temporal dependence for SCADA data, to detect wind
321
turbine anomalies successfully. ii) the graph structure constructed by time-series based idea and a
As a new data expression with stronger representation capability, the sliding window can easily and effectively express the structure infor
graph considers the monitoring values and the relationship between mation of SCADA dataset and extract temporal dependence, iii) and the
different values in time-series data at the same time to more compre anomaly detection model constructed by DGI-RBM framework can fully
hensively delve into and express data information. Different from the use the physical-statistical information and local-global information of
conventional time-series data processing method that only focuses on dataset, which has an excellent anomaly detection performance and can
local information, the proposed method uses the graphically recon be used to research the fault propagation mechanism and analysis the
structed data and the DGI to comprehensively consider the local and root cause of fault in the future.
global information of time-series data. Therefore, this method re
constructs and expresses time-series data in a novel way (i.e., graph) CRediT authorship contribution statement
with a better outcome.
The proposed method strengthens the application of feature fusion in Chenlong Feng: Methodology, Software, Validation, Formal anal
anomaly detection, which is mainly in two aspects: firstly, the extraction ysis, Investigation, Data curation, Writing – original draft, Writing –
and dimensionality reduction of nodes features are realized from the review & editing, Visualization. Chao Liu: Conceptualization, Method
perspectives of physical and statistical characteristics respectively, ology, Validation, Formal analysis, Investigation, Resources, Writing –
effectively extracting the main features of the dataset and reducing the original draft, Writing – review & editing. Dongxiang Jiang: Supervi
information redundancy. Secondly, the local and global characteristics sion, Resources, Writing – review & editing, Project administration,
of the dataset are fused in the process of unsupervised training inspired Funding acquisition.
by the mutual information maximization principle. The fused features
strengthen the representation capability of the node embedding vector
and enable the downstream anomaly detector to catch the uniqueness of Declaration of competing interest
each node more quickly, thus improving the anomaly detection
performance. The authors declare that they have no known competing financial
In this paper, DGI is the feature extractor and RBM is the anomaly interests or personal relationships that could have appeared to influence
detector, the combination of which forms the framework for anomaly the work reported in this paper.
detection. The results of the above research show that the proposed
feature extractor based on mutual information maximization and Acknowledgements
physical-statistical feature fusion can fully delve into time-series data
feature information. Thus using this feature extractor, not only different This work was partly supported by the Huaneng Group science and
time-series data of different equipment can be processed but also a novel technology research project(HNKJ20-H50).
fault diagnosis method can be developed. Moreover, changing the DGI
encoder to GAT can give full play to the advantages of learnable weights References
and improve anomaly detection performance.
Given the superior applicability of the proposed approach in [1] Council, G. W. E, Global Wind Report 2021, Report, Global wind energy council,
handling time-series data and extracting features, future work will focus 2021.
[2] Y. Wang, C.H. Guo, X.J. Chen, L.Q. Jia, X.N. Guo, R.S. Chen, et al., Carbon peak and
on, (i) exploring the construction method for wind turbines system carbon neutrality in China: goals, implementation path and prospects, China
graph structure and realizing anomaly detection and fault location for Geology 4 (4) (2021) 27.
wind turbines components, (ii) exploring the fault propagation mecha [3] Teja S. Kandukuri, Andreas Klausen, Gunnar K. Robbersmyr, et al., A Review of
Diagnostics and Prognostics of Low-Speed Machinery towards Wind Turbine Farm-
nism between different wind turbines components and construct a
Level Health Management, Renewable & sustainable energy reviews, 2016.
complex system fault propagation network for wind turbines, (iii) [4] H. Zhao, L. Lang, Fault diagnosis of wind turbine bearing based on variational
designing the fault tracing algorithm for wind turbines and trace the mode decomposition and teager energy operator, IET Renew. Power Gener. 11 (4)
(2016) 453–460.
fault source of wind turbines anomalies.
[5] A. Jc, L.B. Jian, B. Wc, B. Yw, B. Tj, Anomaly detection for wind turbines based on
the reconstruction of condition parameters using stacked denoising autoencoders,
5. Conclusion Renew. Energy 147 (2020) 1469–1480.
[6] Q. Wei, D. Lu, A survey on wind turbine condition monitoring and Fault diagnosis -
Part II: signals and signal processing methods[J], IEEE Trans. Ind. Electron. 62 (10)
Motivated by the demand for wind turbines anomaly detection, the (2015), 1-1.
stronger representation capability of graph data, and the fast develop [7] M. Gil, O. Gomis-Bellmunt, A. Sumper, Technical and economic assessment of
ment of unsupervised graph neural networks, this work presents an offshore wind power plants based on variable frequency operation of clusters with
a single power converter, Appl. Energy 125 (jul.15) (2014) 218–229.
anomaly detection framework for wind turbines using SCADA data [8] F. Qu, J. Liu, H. Zhu, B. Zhou, Wind turbine fault detection based on expanded
based on physical-statistical feature fusion and an unsupervised GNNs linguistic terms and rules using non-singleton fuzzy logic, Appl. Energy 262 (2020).
that can reduce data dimensionality, capture parameters nonlinear [9] H. Chen, H. Liu, X. Chu, Q. Liu, D. Xue, Anomaly detection and critical scada
parameters identification for wind turbines based on lstm-ae neural network,
correlation and extract data temporal dependence. The graphical Renew. Energy 172 (1) (2021).
modeling method for SCADA data captures the correlation hidden in the [10] W. Yang, R. Court, J. Jiang, Wind turbine condition monitoring by the approach of
dataset in a graph-based manner. The designed/proposed physical fea scada data analysis, Renew. Energy 53 (9) (2013) 365–376.
[11] Y. Zhang, M. Li, Z.Y. Dong, K. Meng, Probabilistic anomaly detection approach for
tures purposely introduce the structure/mechanism information of wind data-driven wind turbine condition monitoring, CSEE J. Power Energy Syst. 5 (2)
turbines to enrich the information in the extracted features. The pro (2019) 10.
posed DGI-RBM framework fully extracts the effective features repre [12] H. Zhao, H. Liu, W. Hu, X. Yan, Anomaly detection and fault analysis of wind
turbine components based on deep learning network, Renew. Energy 127 (NOV)
senting the operating state of wind turbines in SCADA data in an (2018) 825–834.
unsupervised local-global feature learning and physical-statistical [13] L. Xiang, X. Yang, A. Hu, H. Su, P. Wang, Condition monitoring and anomaly
feature fusion manner, thus improving the performance of anomaly detection of wind turbine based on cascaded and bidirectional deep learning
networks, Appl. Energy 305 (2022).
detection for wind turbines. Finally, the validation and discussion for the
[14] H. Tong, R.C. Qiu, D. Zhang, H. Yang, Q. Ding, X. Shi, Detection and classification
proposed method are carried out based on the experimental studies on of transmission line transient faults based on graph convolutional neural network,
the real-world SCADA data, and the results show that i) the features CSEE J. Power Energy Syst. 7 (3) (2021) 16.
designed by the criterion of characteristic curves based, normal behavior [15] C. Li, L. Mo, R. Yan, fault diagnosis of rolling bearing based on WHVG and GCN,
IEEE Trans. Instrum. Meas. 70 (2021).
model based and relative temperature difference based can reasonably [16] X. Wang, B. Jin, Y. Du, P. Cui, Y. Yang, One-class graph neural networks for
represent wind turbines state and reduce the information redundancy, anomaly detection in attributed networks, Neural Comput. Appl. (3) (2020).
322
[17] N. Bai, F. Meng, X. Rui, Z. Wang, Rumour Detection Based on Graph Convolutional [31] J.B. Kinney, G.S. Atwal, Equitability, mutual information, and the maximal
Neural Net, IEEE Access, 2021, 99), 1-1. information coefficient, Proc. Natl. Acad. Sci. USA 111 (9) (2014) 3354–3359.
[18] L. Xie, D. Pi, X. Zhang, J. Chen, W. Yu, Graph neural network approach for anomaly [32] P. Veličković, W. Fedus, W.L. Hamilton, P. Liò, Y. Bengio, R.D. Hjelm, Deep Graph
detection, Measurement 180 (1) (2021), 109546. Infomax, 2018 arXiv preprint arXiv:1809.10341.
[19] X. Yu, B. Tang, K. Zhang, Fault diagnosis of wind turbine gearbox using a novel [33] Y. Ren, B. Liu, C. Huang, P. Dai, L. Bo, J. Zhang, Heterogeneous Deep Graph
method of fast deep graph convolutional networks, IEEE Trans. Instrum. Meas. 70 Infomax, 2019 arXiv preprint arXiv:1911.08538.
(2021) 1–14. [34] C. Park, D. Kim, J. Han, H. Yu, Unsupervised attributed multiplex network
[20] C. Liu, S. Ghosal, Z. Jiang, S. Sarkar, An unsupervised anomaly detection approach embedding, Proc. AAAI Conf. Artif. Intell. 34 (2020, April) 5371–5378, 04.
using energy-based spatiotemporal graphical modeling, Cyber-Physical Systems 3 [35] F.Y. Sun, J. Hoffmann, V. Verma, J. Tang, Infograph: Unsupervised and Semi-
(1) (2017) 1–37. supervised Graph-Level Representation Learning via Mutual Information
[21] W. Yang, C. Liu, D. Jiang, An unsupervised spatiotemporal graphical modeling Maximization, 2019 arXiv preprint arXiv:1908.01000.
approach for wind turbine condition monitoring, Renew. Energy 127 (2018) [36] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, J. Huang, Graph
230–241. Representation Learning via Graphical Mutual Information Maximization, In
[22] X. Zhang, Y. Yang, D. Zhai, T. Li, J. Chu, H. Wang, Local2global: unsupervised Proceedings of The Web Conference 2020, 2020, April, pp. 259–270.
multi-view deep graph representation learning with nearest neighbor constraint, [37] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, S.Y. Philip, A comprehensive survey on
Knowl. Base Syst. 231 (2021), 107439. graph neural networks, IEEE Transact. Neural Networks Learn. Syst. 32 (1) (2020)
[23] T. Lv, X. Pan, Y. Zhu, L. Li, Unsupervised medical images denoising via graph 4–24.
attention dual adversarial network, Appl. Intell. 51 (6) (2021) 4094–4105. [38] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, M. Sun, Graph neural networks: a
[24] D. Buterez, I. Bica, I. Tariq, H. Andrés-Terré, P. Liò, CellVGAE: an unsupervised review of methods and applications, AI Open 1 (2020) 57–81.
scRNA-seq analysis workflow with graph attention networks, Bioinformatics 38 (5) [39] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph
(2022) 1277–1286. Attention Networks, 2017 arXiv preprint arXiv:1710.10903.
[25] D. Ding, F. Xia, X. Yang, C. Tang, Joint dictionary and graph learning for [40] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural
unsupervised feature selection, Appl. Intell. 50 (5) (2020) 1379–1397. networks, Science 313 (5786) (2006) 504–507.
[26] R. Zhang, Y. Zhang, X. Li, Unsupervised Graph Embedding via Adaptive Graph [41] M. Schlechtingen, I.F. Santos, S. Achiche, Wind turbine condition monitoring based
Learning, 2020 arXiv preprint arXiv:2003.04508. on SCADA data using normal behavior models. Part 1: system description, Appl.
[27] A. Semenov, A. Mazeev, D. Doropheev, T. Yusubaliev, Unsupervised graph Soft Comput. 13 (1) (2013) 259–270.
anomaly detection algorithms implemented in Apache spark, Lobachevskii J. Math. [42] A. Geiger, D. Liu, S. Alnegheimish, A. Cuesta-Infante, K. Veeramachaneni,
39 (9) (2018) 1262–1269. TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks,
[28] R.D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, In 2020 IEEE International Conference on Big Data (Big Data), 2020, December,
Y. Bengio, Learning Deep Representations by Mutual Information Estimation and pp. 33–43 (IEEE).
Maximization, 2018 arXiv preprint arXiv:1808.06670. [43] H. Xu, Y. Wang, S. Jian, Q. Liao, Y. Wang, G. Pang, Calibrated One-Class
[29] R. Linsker, Self-organization in a perceptual network, Computer 21 (3) (1988) Classification for Unsupervised Time Series Anomaly Detection, 2022 arXiv
105–117. preprint arXiv:2207.12201.
[30] M.I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, R.D. Hjelm, [44] A. Deng, B. Hooi, Graph neural network-based anomaly detection in multivariate
Mine: Mutual Information Neural Estimation, 2018 arXiv preprint arXiv: time series, Proc. AAAI Conf. Artif. Intell. 35 (5) (2021, May) 4027–4035.
1801.04062.
323

GNN Main PDF

Uploaded by

Copyright:

Available Formats

You might also like

GNN Main PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GNN Main PDF

Uploaded by

Copyright:

Available Formats

Renewable Energy 206 (2023) 309–323

Contents lists available at ScienceDirect

Unsupervised anomaly detection using graph neural networks integrated

lution by information propagation and aggregation between nodes, ( ( [ ⃦ ]))

Fig. 1. Difference of edge weight between GCN and GAT.

Fig. 2. RBM model.

Fig. 3. The proposed unsupervised anomaly detection framework based on DGI-RBM.

3.3.1. The formulation for feature fusion

establish a multivariable regression model between the main monitoring

(c) The Mapping Function F 3 Based on Relative Temperature

Fig. 5. The time-series-based construction idea for the adjacency matrix.

specific process of C is to preserve the raw adjacency matrix, i.e., A

3.3.2. The algorithm for feature fusion

The existence of abnormal data in the raw dataset will seriously

anomaly detection. Therefore, it is necessary to perform data cleaning

Fig. 8. The influence of nonlinear activation function on anomaly detection.

Fig. 11. The anomaly detection result for Fault #10091.

Rec Prec F1 Acc Rec Prec F1 Acc Rec Prec F1 Acc

You might also like