Zheng 2015

IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO.
1, JANUARY 2015
11
Raw Wind Data Preprocessing:

A Data-Mining Approach
Le Zheng, Wei Hu, and Yong Min
AbstractWind energy integration research generally relies

on complex sensors located at remote sites. The procedure for
generating high-level synthetic information from databases
containing large amounts of low-level data must therefore account
for possible sensor failures and imperfect input data. The data
input is highly sensitive to data quality. To address this problem,
this paper presents an empirical methodology that can efficiently
preprocess and filter the raw wind data using only aggregated
active power output and the corresponding wind speed values
at the wind farm. First, raw wind data properties are analyzed,
and all the data are divided into six categories according to their
attribute magnitudes from a statistical perspective. Next, the
weighted distance, a novel concept of the degree of similarity
between the individual objects in the wind database and the local
outlier factor (LOF) algorithm, is incorporated to compute the
outlier factor of every individual object, and this outlier factor is
then used to assess which category an object belongs to. Finally,
the methodology was tested successfully on the data collected
from a large wind farm in northwest China.
Index TermsData mining, data preprocessing, local outlier
factor (LOF), unsupervised learning.
N OMENCLATURE
V , V (x)
Vci
Vr
Vco
P , P (x)
PA , PT
d(x, y)
Wind speed measured from the wind farm

meteorological mast, wind speed value of
object x.
Cut-in speed of the wind turbine.
Rated speed of the wind turbine.
Cut-out speed of the wind turbine.
Wind power value measured from SCADA
system, when the speed equals V , V (x).
Approximate and accurate wind power value,
when the speed equals V .
Distance between object x and y, i.e., the
concept of the degree of similarity between x
and y.
Weight of the notation of the weighted
distance.
Manuscript received April 23, 2014; revised July 02, 2014 and August
11, 2014; accepted September 03, 2014. Date of publication September 29,
2014; date of current version December 12, 2014. This work was supported
in part by the National High Technology Research and Development Program
2011AA05A112 of China, in part by the National Natural Science Foundation
of China under Grant 51190101, in part by the Science and Technology Projects
of the State Grid Corporation of China SGHN0000DKJS1300221, in part by
Hunan Electric Power Corporation, and in part by Ningxia Electric Power
Corporation. Paper no. TSTE-00173-2014.
The authors are with the State Key Lab of Power Systems, Department of Electrical Engineering, Tsinghua University, Beijing 100084,
China (e-mail: zhengl07@mails.tsinghua.edu.cn; huwei@mail.tsinghua.
edu.cn; minyong@mail.tsinghua.edu.cn).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSTE.2014.2355837
T
Nk (x)
Lrd(x)
LOF(x)
Ncubic , Nlinear
Ncommon
Tuning parameter of the weighted distance.

k-distance neighborhood of object x.
Local reachability density of object x.
Local outlier factor of object x.
Number of valid data points detected by the
cubic and linear formula weighted distance,
respectively.
Number of valid data points detected by both
cubic and linear formula weighted distance.
I. I NTRODUCTION
N RECENT years, wind energy has become a major energy

source. Wind farm power curve monitoring [1][3] and
wind power prediction [4], [5] constitute the foundation of
wind energy integration research. Because precise modeling
of the wind source is difficult, and the wind turbine is highly
nonlinear, researchers prefer data-mining methods over analytical methods to generate high-level synthetic information (also
known as knowledge) from the low-level data collected by realtime data acquisition systems. However, as the acquisition and
transmission of wind data rely on the reliability of sensors
that are located at remote sites exposed to an open, uncontrolled, and even harsh environment, there is a relatively high
probability of the occurrence of incorrect data. On the other
hand, unnatural operating states of a wind farm cause unnatural data. For example, wind curtailment because of congestion
or load balancing purposes or wind turbine shutdown because
of mechanical faults or maintenance will result in unnatural
data, which have normal wind speed and abnormal wind power
output below the theoretical values corresponding to the wind
speed. Both incorrect and unnatural data affect the performance
of the data-based research, as data-mining methods are highly
sensitive to data quality. Incorrect and unnatural data should be
detected and preprocessed before the integration studies.
Different approaches have been proposed to address preprocessing problems for various types of data [6][10], including
load data, remote terminal unit (RTU) data, geophysical data,
fingerprint image data, and photovoltaic data. However, few
papers specifically discuss wind data preprocessing. The preprocessing descriptions comprise only a small part of the
relevant works.
Schlechtingen and Santos [11] presented a wind data preprocessing method including four steps: 1) validity check; 2) data
scaling; 3) missing data processing; and 4) lag removal. The
validity check involves a data range check that detects data
values exceeding the physical limits. Data scaling normalizes
data with the ratings. Missing data processing involves either
1949-3029 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
12
IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015
neglecting or approximating the missing values. Lag removal

uses the cross correlation function to identify the lag between
input and output, which is useful when dealing with time-series
analysis. This method, especially the validity check and missing data processing steps, has been widely used in the published
literature [2][4].
However, the method proposed in [11] did not consider
unnatural data. References [12] and [13] discussed unnatural
data by classifying the raw data according to the magnitude of
the wind speed and the wind power output. A neural network
classifier was trained with the wind speed and wind power as
input and the classification result as output. Then, the neural
network was used to classify more data. As a type of supervised learning algorithm, this method can achieve relatively
high accuracy as long as the classification result is accurate,
i.e., the correct class is precisely determined for every single
data point. Liu et al. [12] classified the data points according to
artificial judgment, whereas Ding [13] classified the data points
based on the wind farm operation state records.
In real-world applications, artificial judgment is limited and
inconvenient when the size of the database is large, and the
wind farm operation state records are often unavailable. Thus,
these data classification procedures are infeasible or unreliable,
which causes difficulties in applying supervised learning algorithms. Therefore, the alternative solution is to use unsupervised
algorithms. To use unsupervised algorithms, we adopted an
unsupervised learning approach based on the local outlier factor
(LOF)-identifying algorithm introduced by Breunig et al. [14].
The LOF of every data point is computed using a novel concept
of the degree of similarity among the individual data points, and
hence invalid data are detected as abnormal outlier factors.
The contribution of this paper is to develop an empirical
methodology for raw wind data preprocessing. The only information required for this methodology is the aggregated wind
power output of the wind farm collected from the Supervisory
Control And Data Acquisition (SCADA) system, which is
available at the dispatch center, and the wind speed magnitude
data at the corresponding wind farm site. The availability of
wind farm operation state records or wind turbine fault logs
(which are not recorded or stored by most wind farm operators)
will help improve the accuracy of the methodology. If these data
are unavailable, which is often the case, the methodology proposed in this paper has nonetheless been proved to be adequate
for the situation. The rest of this paper is organized as follows.
Section II studies the properties of raw wind data and notes the
possible causes of data errors. Section III proposes the preprocessing methodology with an emphasis on the formulation of
the LOF algorithm and the concept of the weighted distance.
Section IV discusses the uncertainty management of the proposed algorithm. Section V presents test results and Section VI
presents the conclusion.
II. R AW W IND DATA P ROPERTIES
All the data used in this paper are collected from a riverside
wind farm in the Gansu Province of China, at a temporal resolution of 15 min. The wind farm contains 100 identical directdriven magnet wind turbines, each rated at 1500 kW. Wind
TABLE I
R AW DATA C LASSIFICATION
Fig. 1. Raw scatter plot of wind farm output and wind speed.
TABLE II
S AMPLE OF I NVALID W IND DATA
speed represents the value measured at the wind meteorological

mast (there is only one meteorological mast at the wind farm),
and wind power denotes the aggregated generation from the
entire wind farm.
Raw wind data can be divided into six categories according
to the wind speed and power values, as shown in Table I. Fig. 1
and Table II show some possible examples of invalid data in
ZHENG et al.: RAW WIND DATA PREPROCESSING
13
Fig. 3. Structure of preprocessing system.

TABLE III
C ONSTANT DATA P ROCESSING A LGORITHM
Fig. 2. Distribution of raw wind data: (a) period from 10/1/2010 to 1/31/2011;
(b) period from 2/1/2011 to 5/31/2011; and (c) period from 6/1/2011 to
9/30/2011.
scatter plot and numerical format, respectively. The data were

collected from the period 11/21/2012 12:00 A . M. to 5/19/2013
8:00 A . M. Invalid data include primarily missing data, constant data, exceeding data, irrational data, and unnatural data.
The existence of incorrect data might be due to sensor failures
or transmission errors. Unnatural data indicate data with low
power output during high wind speed periods because not all
turbines are always online. When the wind speed is higher than
the cut-out speed, the wind turbine is forced to shut down by
the high-speed protection protocol. Another possible reason is
that the grid cannot absorb excess wind energy, and thus the
wind turbine is shut down by dispatch instructions. In addition, Fig. 1 shows that there is a slender, approximately vertical
band around the cut-out speed because of wake effects, i.e., the
turbines within a wind farm do not cut out together near the cutout speed because the wind speeds at each turbine vary from
the speed value measured at the mast. The wake effects data
should be categorized as valid data because they reflect the natural output fluctuation property of the wind farm around the
cut-out speed, which is important to the system operators. It
is, therefore, necessary to distinguish them from the unnatural
data.
The constant data, missing data, and exceeding data can easily be detected by the method proposed in [11]. Among the
remaining data [the irrational, unnatural, and valid (IUV) data],
the unnatural data and the irrational data should be given more
attention. To study the distribution of the different data categories, the raw wind data in a complete calendar year (from
10/1/2010 to 9/30/2011) from the same wind farm have been
analyzed. The raw wind data distribution histogram (shown in
Fig. 2) indicates that the number of invalid data points (including the irrational and the unnatural data) is much smaller than
the number of valid data points, even in winter, when more wind
curtailments result in more unnatural data [see Fig. 2(a)]. In
addition, Fig. 1 illustrates that the spatial distribution of the
valid data is much closer than the spatial distribution of the
unnatural and irrational data. Within the areas of the unnatural
and the irrational data, the density is considerably lower than
TABLE IV
M ISSING DATA P ROCESSING A LGORITHM
the density in the area of the valid data. Both the unnatural and
the irrational data can be considered outliers or noise compared
to the valid data.
Therefore, outlier detection, which tries to identify exceptional cases that deviate substantially from the majority patterns
[15], can be used to exclude the unnatural and the irrational
data. Furthermore, from the simplicity point of view, as a type
of unsupervised learning, outlier detection can learn relationships and structure from the attributes of the data themselves
[16], so that the classification step in [12] and [13] is no longer
necessary.
III. P REPROCESSING M ETHODOLOGY
A. Wind Data Preprocessing Method
Fig. 3 shows the structure of the proposed preprocessing
method. The constant data processing block, the missing data
processing block, and the physical range check block can easily be implemented via several ifthen judgment sentences, as
shown in Tables IIIV. Regarding imputation of the invalid
14
TABLE V
E XCEEDING DATA P ROCESSING A LGORITHM
3) Definition 3: Reachability distance of an object x with

respect to object y. The reachability distance of object x
with respect to object y is defined as
Reach distk (x, y) = max{k-distance(x), d(x, y)}.
(3)
4) Definition 4: Local reachability density of an object x.
The local reachability density of y is defined as
LrdMinPts (x) =

yNMinPts (x)
data, the major imputation approaches [21] are to fill or predict the missing values based on the nearby observed values.
However, because the invalid data are often consistent for
a relatively long time, there are insufficient data to make a
smooth imputation, which may only introduce more incorrect
data to the database. Moreover, there is a large amount of
data available, so we can obtain sufficiently interesting patterns from the remaining data that the effect of the pattern
losses with the removal of the invalid data is limited. Therefore,
no approximation is performed after removing the invalid
data.
Data scaling is performed by applying the following
equation:
x
=
x
xr
(1)
where x is a variable, xr is the rating, and x

is the normalized
value of the variable. Similarly, we use the bar notation to
denote the normalized value of a variable.
For the irrational and the unnatural data processing block, the
LOF identifying algorithm is applied.
B. LOF Algorithm
The LOF algorithm was first proposed by Breunig et al.
in 2000 [14]. The algorithm tries to assign to each object in
the database a degree of being an outlier from the global perspective. The degree is called the LOF of an object. The key
difference between the LOF algorithm and others is that LOF
considers being an outlier to be a continuous property rather
than a binary property. The formal definition of LOF is listed as
follows. More detailed discussion can be found in [14].
1) Definition 1: k-distance of an object x. For any positive integer k, the k-distance of object x, denoted as
k - distance(x), is defined as the distance d(x, y) between
x and an object y D, such that for at least k objects z
D|{x}, d(x, z) d(x, y) holds, and for at most k 1
objects, d(x, y) < d(x, z) holds.
2) Definition 2: k-distance neighborhood of an object x.
Given the k-distance of x, the k-distance neighborhood
of x contains every object whose distance from x is not
greater than the k-distance, as shown in (2). These objects
y are called the k-nearest neighbors of x
Nk (x) = {y D\{x}|d(x, y) k-distance(x)}. (2)
|NMinPts (x)|
. (4)
reach distMinPts (x, y)
5) Definition 5: LOF of an object x. The LOF of x is

defined as
LOFMinPts (x) =
IrdMinPts (y)
Ird
MinPts (x)
yNMinPts (x)
|NMinPts (x)|
(5)
The parameter MinPts is used to define the concept of

density, i.e., specifying a minimum number of objects in the
neighborhood of an object x. For most objects in a cluster, the
outlier factors are approximately equal to 1. For the outliers,
the outlier factors are larger than 1. We can generally define an
LOF-threshold value, which is determined by trial and error to
obtain the best performance. Objects with outlier factors greater
than the LOF-threshold value are outliers.
C. Similarity Measurement
Choosing the similarity/distance measurement or the relationship model to describe data objects is critical in LOF [15].
Because the hypothesis space is two-dimensional (2-D), the
most commonly used similarity measurement is the Euclidean
distance for the purpose of proper visualization

2
2
(6)
d(x, y) = (V (x) V (y)) + (P (x) P (y))
where d(x, y) denotes the distance between object x and y.
V (x) and P (x) represent the normalized value of wind speed
and power output of object x, respectively.
However, after applying the LOF algorithm with the
Euclidean distance to the data set, the test result indicates that
the Euclidean distance measurement may fail to detect certain
unnatural data (see Section IV) because the Euclidean distance
considers the wind speed dimension and the wind power dimension equally. However, the wind power dimension has greater
impact on the result because, although the unnatural data have
correct wind speed values, their wind power values are unnatural (see Fig. 1). Therefore, the wind power dimension is the
outlier attribute in the unnatural data detection procedure.
Based on this understanding, the weighted distance is introduced to measure the similarity between objects, where the
outlier attribute is assigned a larger weight. To determine a
proper form of the weight, prior knowledge about the wind
power curve should be considered.
Many studies have reported the three-region theoretical wind
turbine power curve [17][19], as shown in Fig. 4. The scatter
15
Fig. 4. Wind turbine power curve.
plot of raw wind data in Fig. 1 also shows the same shape characteristics of the power curve. Hence, the points corresponding
to the valid data are distributed near the power curve, whereas
the points corresponding to the unnatural and the irrational data
are far away. Therefore, the weight can be formulated based on
the difference between the measured and the true value of wind
power, as follows.
1) When Vci V < Vr ,

PT P 0.1
1,

(7)
=
PT P /0.1, PT P > 0.1.
2) When V < Vci or Vr V < Vco ,

PT P 0.05
1,

=
PT P /0.05, PT P > 0.05.
(8)
3) When V Vco ,
=1
(9)
where Vci , Vr , and Vco represent the cut-in, rated, and cut-out
speed of the wind turbine. PT denotes the normalized true value
of wind power.
An object that is close to the power curve is defined as being
located in the [PT 0.1, PT + 0.1] interval when Vci V <
Vr and in the [PT 0.05, PT + 0.05] interval when V < Vci
or V Vr , based on domain experiences and past studies.
The weights assigned to these objects are 1, identical to the
Euclidean distance. Additionally, the weights of the data whose
wind speed values are larger than the cut-out speed, are also
equal to 1, to extract the natural properties from the wake effects
data. The weight of the other data is larger than 1, in proportion to the difference between the measured and the accurate
value of the wind power. The farther away an object is located,
the greater the weight, and the more likely the object is to be
detected as an outlier. Thus the weighted distance considers
sticking close to the power curve as an auxiliary factor of being
valid, which is achieved by applying the following equation:

2
2
d(x, y) = (V (x) V (y)) + T (P (x) P (y)) (10)
where the notations are identical to those in (6). T 0 is a
tuning parameter, to be determined separately.
However, the accurate wind power curve is ambiguous and
impossible to be determined. Therefore, we have to use an
approximate curve to approach the accurate power curve, i.e.,

replace the true value PT in (7) and (8) by the approximate
value PA . According to a review of the literature [1], [2], [17]
[20], there are two main types of approximations. The simplest
way is to represent the whole wind farm with a single equivalent
wind turbine and the corresponding approximate power curve.
In general, the wind turbines of a wind farm are purchased from
the same manufacturer and have the same technical parameters.
Thus, the approximate power curve of the equivalent wind turbine can be established by multiplying the power curve model
of a single wind turbine by the total wind turbine number of the
farm. The two most commonly and easily used wind turbine
power curve models are given by (11) and (12), called the cubic
model and the linear model, respectively. All the parameters are
given by the manufacturers
0,
0 < V < Vci
V 3 Vci3
,V V < Vr
PA = Vr3 Vci3 ci
1, Vr V < Vco
0,
Vco V
0, 0 < V < Vci
V Vci ,V V < V
ci
r
PA = Vr Vci
1,
V
V
<
V
r
co
0,
Vco V
(11)
(12)
where PA denotes the normalized approximate value of wind

power. Other notations are identical to those in (8).
More accurate approximations can be found in [1] and
[20]. According to the comparative study in [20], these proposed approximations result in error rates of approximately 1%,
whereas the error of the cubic or the linear model is less than
8%. However, the approximation methodologies are mainly
based on field measurement data or historical data, which are
not valid and ready in the preprocessing procedure. In other
words, the cubic or the linear model is all we have when dealing with preprocessing problems, especially when considering
a brand new wind farm.
IV. U NCERTAINTY M ANAGEMENT

A. BiasVariance Tradeoff
Another significant aspect of data-mining methods is the
uncertainty management. When evaluating the performance
of the similarity measurements, the difference between the
approximate and the true values of the wind power should
be considered. By analogy to the uncertainty management of
supervised learning, we denote by variance the amount by
which the detection result would change if we formulated the
weighted distance using a different approximate power curve.
Although different formulas of the power curve model will
result in different detection results, ideally the result should
not vary too much between approximations. If a similarity
measurement has high variance, then small changes in the
approximation can result in large changes in the detection
result, and vice versa. Hence, a similarity measurement with
16
low variance is more reliable, especially when the accurate

power curve is unavailable.
We denote bias by the detection failure introduced by the
similarity measurement. For example, the Euclidean distance
measurement assumes that the wind power and speed dimensions have the same impact on detection results. However,
the wind power dimension has a greater impact on unnatural data. Thus, no matter how many data we are given, it will
not be possible to produce an accurate detection using the
Euclidean distance. The Euclidean distance results in high bias
in unnatural data detection.
Generally, as the weight increases, the variance will increase,
and the bias will decrease. The tuning parameter serves to
control the relative impact of the deviations on the detection
result, i.e., trading off between variance and bias. When T = 0,
the weight has no effect and the accuracy is limited, but no
approximate-true deviation is introduced into the procedure.
However, as T , the impact of the weight increases, so
that the detection result is more constrained to the shape of
the approximate curve, decreasing the bias between the detection results and the approximate power curve and increasing the
variance of the model.
Fig. 5. Filtered scatter plot of wind data with the Euclidean distance.
B. Parameter Selection
The method for selecting the MinPts and the LOF_threshold
can be found in [14]. In this paper, MinPts equals 300 and
LOF_threshold is 1.1. Selecting a good value for T is critical.
However, unlike supervised learning, there are no outputs by
which to supervise the learning; hence, the most common performance evaluation methods (such as cross validation) cannot
be used. Because the task is outlier detection in a 2-D space,
the simplest way to evaluate the accuracy of the algorithm is
by visual inspection. Another way is to choose the T value
that results in the lowest bias plus variance value (denoted by
bias + variance). As a general rule, as we increase the value
of T , the bias tends initially to decrease faster than the variance increases. Consequently, the expected bias + variance
declines. However, at some point, increasing the value of T
has little impact on the bias but starts to increase the variance significantly. When this happens, the bias + variance
increases.
According to the definition of bias in Section IV-A, bias measures the detection performance of the algorithm, so the value
of bias is defined as if the algorithm fails to detect all irrational and unnatural data and as 0 the other way around. In the
same way, variance is used to assess the differences among various power curve approximations, and the value of variance
is computed by comparing the detection results of different
approximations applied. In this paper, we use the two models
described in (10) and (11), i.e., the cubic model and the linear
model. The variance is low if most of the outliers detected using
different approximation formulas coincide, which is computed
by
Variance =
(Ncubic Ncommon ) + (Nlinear Ncommon )

100
Ncommon
(13)
Fig. 6. Outlier factor distributions.
where Ncubic and Nlinear denote the number of valid data

points detected by the cubic and linear formula-weighted distances, respectively. Ncommon illustrates the number of valid
data points detected by both cubic and linear formula-weighted
distances.
V. T EST R ESULTS AND D ISCUSSION

A. Test Results
The data described in Section II are used to test the proposed
method. The number of objects in the data set is 18 001, with
1902 missing data points, 1694 constant data points, and 594
exceeding data points. The rest of the 13 811 data points are the
input of the LOF algorithm. Fig. 5 shows the filtered scatter plot
of the wind data with the Euclidean distance, and Fig. 6 shows
the corresponding outlier factor distribution. Fig. 5 indicates
that some unnatural data cannot be detected using the Euclidean
distance, as in some certain seasons, especially in winter, the
grid cannot absorb excess wind energy at valley load periods,
and wind curtailment occurs so frequently that those unnatural
data have high-density neighborhoods and small outlier factors.
TABLE VI
R ESULTS OF VARIOUS T UNING PARAMETERS
17
TABLE VII
C ONFUSION M ATRIX OF THE W EIGHTED D ISTANCE A LGORITHM
U SING THE C UBIC A PPROXIMATION M ODEL , T = 0.5
TABLE VIII
TABLE IX
Fig. 7. Filtered scatter plot of wind data with the weighted distance: (a) cubic
approximation model, T = 0.5; (b) linear approximation model, T = 0.5;
(c) cubic approximation model, T = 0.7; and (d) linear approximation model,
T = 0.7.
We then test the LOF algorithm using weighted distance.

Table VI shows the detection results of various tuning parameter values. The middle two columns indicate the number of
objects filtered by the cubic and the linear approximation models, respectively. The fourth column specifies the common
objects filtered by both models. The last column shows the
bias + variance values, as defined in Section IV-B. When the
value is less than 0.7, both the cubic and the linear approximation models fail to detect all of the unnatural data. When
the value increases above 0.7, the cubic and the linear approximation models can detect all the unnatural data accurately. As
the value increases, the number of common objects decreases.
Fig. 7 shows the filtered scatter plot of the experiments applying
the weighted distance using various tuning parameter values.
Visual inspection can verify the descriptions above.
Good performance of a similarity measurement requires both
low variance and low bias. Therefore, we choose 0.7 as the
tuning parameter value to ensure detection accuracy and robustness. The value of the tuning parameter may vary with different
databases, but the determination procedure will not change
much.
The approach is performed on a PC with an Intel i5 CPU

3.19-GHz clock and 2 GB RAM. The algorithm is programmed based on MATLAB. A single outlier factor computation requires approximately 1 min. Because selecting the value
of parameter T requires several trials, the total computation
time is approximately 10 min.
B. Performance Validation
All the irrational and the unnatural data are labeled as invalid
compared to the valid data. Thus, the proposed algorithm is
more or less like a binary classifier. Analogous to any binary
classifier, the algorithm can make two types of detection errors:
1) the algorithm can incorrectly assign an object that is invalid
to the valid category, which is denoted by nondetection or 2) the
algorithm can incorrectly assign an object that is valid to the
invalid category, which is denoted by false alarm. It is often of
interest to determine which of these two types of errors is being
made. The confusion matrix is a convenient way to display this
information. Tables VIIXII show the confusion matrix of the
algorithm using the cubic and linear approximation models at
various T values.
Table VII reveals that when T = 0.5, the weighted distance
algorithm using the cubic approximation model detects that a
18
TABLE X
U SING THE L INEAR A PPROXIMATION M ODEL , T = 0.5
TABLE XI
TABLE XII
total number of 3880 data points are invalid. Of these data,

3551 are actually invalid, and 329 are not, i.e., are false alarms.
Meanwhile, 224 genuinely invalid data points are not detected
by the algorithm, i.e., are nondetection errors. In the same way,
we can tell how many false-alarm and nondetection errors are
made by each model.
As stated in Section III-A, only the valid data detected by the
algorithm are to be used in further research. The false-alarm
data points have a risk of losing wind speed-power patterns
because the invalid data are removed, whereas nondetection
will introduce incorrect information into further research, so the
nondetection errors are more fatal than the false alarms. When
T > 0.7, the nondetection is 0 in both cubic and linear approximation models. As T increases, more useful patterns will be lost
as the detected valid data decrease. Hence, T = 0.7 is the optimal value, and the accuracy of the algorithm is approximately
95.45%. In addition, most false-alarm errors occur around the
boundary of the valid data area, especially in the wake effects
data area. This is because valid data located near the boundary
are more distributed and thus have less density. This is the common limitation of all density-based outlier detection algorithms.
Future work will be done to improve the detection accuracy of

the boundary data, especially the wake effects data.
The neural network-based method proposed in [12] reports
an accuracy of 96.5%. However, as training the neural network is time-consuming, the computation time is much longer
than the computation time of the methodology proposed in this
paper. Moreover, as stated in Section I, the neural network
method is a type of supervised learning algorithm, and the accuracy depends on precise classification of the training set, which
is inconvenient and unavailable in most situations.
VI. C ONCLUSION
In this paper, raw wind data properties were analyzed. Invalid
data can be categorized into five types. A wind data preprocessing methodology has been proposed. Because identifying
the unnatural and the irrational data is challenging, this paper
treats them as outliers and uses the LOF algorithm to detect and
remove these outliers. To incorporate prior knowledge regarding the wind data, a new type of similarity measurement is
designed and applied in the algorithm. Numerical experiments
have verified the effectiveness of the algorithm and the similarity measurement. The performance evaluation of the algorithm
has also been discussed.
One of the greatest advantages of the proposed methodology
is that it is a type of unsupervised learning algorithm. Therefore,
it can detect and classify the raw data using solely the attributes
of the data themselves. It is easier and more convenient to perform in practice, especially when the operation records are not
available. However, as there is no universal data-mining algorithm that can handle all problems, this methodology has its
limitations. First, the total number of the data points should not
be too small. An empirical minimum value is approximately
1000. Second, if most of the data are invalid, the accuracy cannot be guaranteed. This situation indicates that either the data
acquisition and transmission system is broken down or manual
actions are frequent. In short, the wind farm is faulty, and the
data acquired from it should not be used for research.
The data preprocessing method proposed in this paper can
be used for many purposes, not only wind-related applications.
The idea of weighted distance can also be used in other outlier
or cluster-detection algorithms to develop individual detection
algorithms dedicated to specific applications.
R EFERENCES
[1] M. Ali, I. Ilie, J. V. Milanovic, and G. Chicco, Wind farm model aggregation using probabilistic clustering, IEEE Trans. Power Syst., vol. 28,
no. 1, pp. 309316, Feb. 2013.
[2] M. Schlechtingen, I. F. Santos, and S. Achiche, Using data-mining
approaches for wind turbine power curve monitoring: A comparative
study, IEEE Trans. Sustain. Energy, vol. 4, no. 3, pp. 671679, Jul.
2013.
[3] S. Kelouwani and K. Agbossou, Nonlinear model identification of wind
turbine with a neural network, IEEE Trans. Energy Convers., vol. 19,
no. 3, pp. 607612, Sep. 2004.
[4] A. Kusiak and Z. J. Zhang, Short-horizon prediction of wind power:
A data-driven approach, IEEE Trans. Energy Convers., vol. 25, no. 4,
pp. 11121122, Dec. 2010.
[5] A. Kusiak, H. Y. Zheng, and Z. Song, Short-term prediction of wind farm
power: A data mining approach, IEEE Trans. Energy Convers., vol. 24,
no. 1, pp. 125136, Mar. 2009.
[6] K. N. Filho, A. D. P. Lotufo, and C. R. Minussi, Preprocessing data for

short-term load forecasting with a general regression neural network and
a moving average filter, in Proc. IEEE PowerTech Conf., Trondheim,
Norway, 2011, pp. 17.
[7] P. Kumar, V. K. Chandna, and M. S. Thomas, Intelligent algorithm for
preprocessing multiple data at RTU, IEEE Trans. Power Syst., vol. 18,
no. 4, pp. 15661572, Nov. 2003.
[8] G. Noriega and S. Pasupathy, Adaptive estimation of noise covariance
matrices in real-time preprocessing of geophysical data, IEEE Trans.
Geosci. Remote Sens., vol. 35, no. 5, pp. 11461159, Sep. 1997.
[9] J. S. Bartunek, M. Nilsson, B. Sallberg, and I. Claesson, Adaptive fingerprint image enhancement with emphasis on preprocessing of data, IEEE
Trans. Image Process., vol. 22, no. 2, pp. 644656, Feb. 2013.
[10] M. Fan, V. Vittal, G. T. Heydt, and R. Ayyanar, Preprocessing uncertain
photovoltaic data, IEEE Trans. Sustain. Energy, vol. 5, no. 1, pp. 351
352, Jan. 2014.
[11] M. Schlechtingen and I. F. Santos, Comparative analysis of neural network and regression based condition monitoring approaches for wind
turbine fault detection, Mech. Syst. Signal Process., vol. 25, no. 5,
pp. 18491875, Jul. 2011.
[12] Z. Q. Liu, W. Z. Gao, Y. H. Wan, and E. Muljadi, Wind power plant prediction by using neural networks, in Proc. IEEE Energy Convers. Congr.
Expo., 2012, pp. 31543160.
[13] Z. Y. Ding, Study of short-term prediction of wind power, M.S. thesis,
Dept. Elect. Eng., South China Univ. Technol., Guangzhou, China, 2012.
[14] M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander, LOF: Identifying
density-based local outliers, in Proc. Int. Conf. Manage. Data, 2000,
pp. 93104.
[15] J. W. Han, M. Kamber, and J. Pei, Outlier detection, in Data Mining:
Concepts and Techniques, 3rd ed. San Mateo, CA, USA: Morgan
Kaufmann, 2011, pp. 544549.
[16] G. James, D. Witten, T. Hastie, and R. Tibshirani. (2014, Jan. 21).
An Introduction to Statistical Learning with Applications in R (1st ed.)
[Online]. Available: http://www.springer.com/series/417
[17] A. Kusiak, H. Y. Zheng, and Z. Song, On-line monitoring of power
curves, Renew. Energy, vol. 34, no. 6, pp. 14871493, Jun. 2009.
[18] M. Lydia, A. I. Selvakumar, S. S. Kuma, and G. E. P. Kumar, Advanced
algorithms for wind turbine power curve modeling, IEEE Trans. Sustain.
Energy, vol. 4, no. 3, pp. 827835, Jul. 2013.
19
[19] D. A. Spera, Wind Turbine Technology: Fundamental Concepts of Wind

Turbine Engineering. New York, NY, USA: ASME, 1994.
[20] B. P. Hayes, I. S. Ilie, A. Porpodas, S. Z. Djokic, and G. Chicco,
Equivalent power curve model of a wind farm based on field measurement data, in Proc. IEEE PowerTech Conf., Trondheim, Norway, 2011,
pp. 17.
[21] B. Efron, Missing data, imputation, and the bootstrap, J. Amer. Stat.
Assoc., vol. 89, no. 426, pp. 463475, Jun. 1994.
Le Zheng was born in China, in 1989. He received

the B.S. degree in electrical engineering from
Tsinghua University, Beijing, China, in 2011. He
is currently pursuing the Ph.D. degree in electrical
engineering at the same university.
His research interests include power system stability and control and large-scale wind energy integration and control.
Wei Hu was born in China, in 1976. He received the B.S. and Ph.D. degrees
in electrical engineering from Tsinghua University, Beijing, China, in 1998 and
2002, respectively.
Currently, he is working as an Associate Professor with the Department
of Electrical Engineering, Tsinghua University. His research interests include
power system modeling and simulation, security analysis, and smart control.
Yong Min was born in China, in 1963. He received the B.S. and Ph.D. degrees
in electrical engineering from Tsinghua University, Beijing, China, in 1984 and
1990, respectively.
He is currently a Professor with the Department of Electrical Engineering,
Tsinghua University. His research interests include power system stability and
control.
Prof. Min is a Fellow of the IET.

Zheng 2015

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zheng 2015

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO.

Raw Wind Data Preprocessing:

AbstractWind energy integration research generally relies

Wind speed measured from the wind farm

Tuning parameter of the weighted distance.

N RECENT years, wind energy has become a major energy

IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015

neglecting or approximating the missing values. Lag removal

speed represents the value measured at the wind meteorological

ZHENG et al.: RAW WIND DATA PREPROCESSING

Fig. 3. Structure of preprocessing system.

scatter plot and numerical format, respectively. The data were

IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015

3) Definition 3: Reachability distance of an object x with

where x is a variable, xr is the rating, and x

5) Definition 5: LOF of an object x. The LOF of x is

The parameter MinPts is used to define the concept of

ZHENG et al.: RAW WIND DATA PREPROCESSING

Fig. 4. Wind turbine power curve.

approximate curve to approach the accurate power curve, i.e.,

0, 0 < V < Vci

where PA denotes the normalized approximate value of wind

IV. U NCERTAINTY M ANAGEMENT

IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015

low variance is more reliable, especially when the accurate

(Ncubic Ncommon ) + (Nlinear Ncommon )

Fig. 6. Outlier factor distributions.

where Ncubic and Nlinear denote the number of valid data

V. T EST R ESULTS AND D ISCUSSION

ZHENG et al.: RAW WIND DATA PREPROCESSING

We then test the LOF algorithm using weighted distance.

The approach is performed on a PC with an Intel i5 CPU

IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015

total number of 3880 data points are invalid. Of these data,

Future work will be done to improve the detection accuracy of

ZHENG et al.: RAW WIND DATA PREPROCESSING

[6] K. N. Filho, A. D. P. Lotufo, and C. R. Minussi, Preprocessing data for

[19] D. A. Spera, Wind Turbine Technology: Fundamental Concepts of Wind

Le Zheng was born in China, in 1989. He received

You might also like