Completion of Wind Turbine Data Sets For Wind Integration Studies PDF

Applied Energy xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Applied Energy
journal homepage: www.elsevier.com/locate/apenergy
Completion of wind turbine data sets for wind integration studies applying
random forests and k-nearest neighbors
⁎
Raik Beckera, , Daniela Thräna,b
a
Department of Bioenergy, Helmholtz Centre for Environmental Research GmbH – UFZ, Permoserstraße 15, 04318 Leipzig, Germany
b
Bioenergy Systems Department, DBFZ Deutsches Biomasseforschungszentrum gGmbH, Torgauer Str. 116, 04347 Leipzig, Germany
H I G H L I G H T S
• Wind integration studies require complete wind turbine data sets.

• Advanced approaches are proposed for filling gaps in wind turbine data sets.
• Different types of random forests and k-nearest neighbors were tested.
• Random forests are particularly suitable for filling gaps in wind turbine data sets.
• The set of considered predictor variables influences the predictive power notably.
A R T I C L E I N F O A B S T R A C T
Keywords: The importance of wind power as a renewable and cost-efficient power generation technology is growing
Wind energy globally. The impact of wind power on the existing power system, land use, and others over time has been widely
Wind turbine data studied. Such wind integration studies, especially when they are designed as retrospective bottom-up studies,
Machine learning rely on detailed wind turbine data, including the geographic locations, hub height, and dates of commission.
Random forests
Given the frequency of gaps present in these data sets, basic concepts have been developed to cope with missing
Wind power integration
data points. In this paper, multiple advanced algorithms were compared with respect to their ability to complete
such data sets. One focus was on the selection of predictor variables to analyze the impact of different com-
pletion techniques depending on the specific gaps in the data set. A sample application using a German data set
indicated that random forests are particularly well suited to the problem at hand.
1. Introduction [1]. Besides wind speed data (primarily sourced from reanalysis
models), wind turbine data is a crucial input for wind power integration
The relevance of wind power as a clean source of energy has grown, studies. Converting wind speed into wind power through the wind-to-
and continues to grow, in many countries. This development carries power curve requires the definition of a few essential parameters of
with it certain implications for the existing power system, human well- wind turbines, namely the rated power, hub height, and geographic
being, and the environment. Since the beginning of this process, many location (latitude and longitude). The geographic location and hub
so-called wind integration studies have been performed with varying height are needed to derive historical wind speed time series at the hub
scopes to investigate the impacts of wind power. One branch of such height for a specific wind turbine. These wind speed time series are then
impact analyses has focused on retrospectively modeling wind power further converted into wind power generation by using a wind-to-power
generation over extended periods of time to obtain insights into system curve. This approach was implemented in [2–5], inter alia. Using the
behaviors. This approach holds an advantage in the availability of specific wind-to-power curve of a wind turbine requires foreknowledge
different historical observations, which facilitates the interpretation of of certain information regarding the manufacturer and turbine type. If
results. As such, many wind integration studies find common ground in this information is not readily available, the rotor diameter can be used
the conversion of empirical wind speed data into wind power through to calculate the swept area, which is required to convert wind speed
wind-to-power curves, as a result of the limited availability of wind into wind power without knowing the specific power curve. This ap-
power generation time series with high spatial and temporal resolution proach was demonstrated in [4]. Most wind integration studies account
⁎
Corresponding author.
E-mail addresses: raik.becker@ufz.de (R. Becker), daniela.thraen@dbfz.de (D. Thrän).
http://dx.doi.org/10.1016/j.apenergy.2017.10.044
Received 31 May 2017; Received in revised form 19 September 2017; Accepted 6 October 2017
0306-2619/ © 2017 Elsevier Ltd. All rights reserved.
Please cite this article as: Becker, R., Applied Energy (2017), http://dx.doi.org/10.1016/j.apenergy.2017.10.044
R. Becker, D. Thrän Applied Energy xxx (xxxx) xxx–xxx
Nomenclature ols ordinary least squares

R2 coefficient of determination
CART classification and regression tree rf random forest
ctree conditional inference tree rfc random forest with conditional inference trees
knn k-nearest neighbors RMSE root-mean-square error
mtry maximum number of trees in the forest
for a growing wind farm fleet over the years, an approach which pro- based on their rated power. Each class covered a width of 250 kW and
vides insight into system behaviors that are dependent on the installed was represented by the most common turbine type, including its rated
wind power capacity. Hence, the commission date, or at least the power and hub height.
commission year, of a wind turbine is required to analyze the devel- The process of implementing procedures to complete substantial
opment over time. In addition, there exist several other applications portions of input data sets has been a common requirement of all stu-
that require detailed wind turbine data sets. The relationship between dies thus far. Another striking similarity between these studies is the
the aging of wind turbines and decreasing load factors was proven in number of explanatory variables that have been used to estimate
[6] for wind farms in the UK. Complete information regarding the age missing parameters. Only in [3] were two explanatory variables con-
of wind turbines could be used to form estimations regarding load sidered in estimating the hub height; other studies instead selected one
factors based on the commission date or commission year. Furthermore, or sometimes only utilized the average of a comparison group.
detailed information about wind turbines, including their location, hub The paper at hand intends to overcome this issue by first con-
height, and rated power, is prerequisite for performing the type of hot sidering more explanatory variables and, second, applying advanced
spot analysis conducted in [7]. Another field of application is calcu- techniques including machine learning algorithms to fill data gaps.
lating the capacity value of wind power. Such studies require wind Therefore, the question of which parameters should be considered to
power time series [8], which could be generated from complete wind model the wind-to-power conversion arises. In [4], the rated power per
turbine data sets. rotor area was taken into account. Hence, the rotor diameter became a
However, complete data sets that contain all of the above para- crucial input for their selected model. In [2], the swept area was con-
meters are quite rare. Often, such data sets have many gaps or are sidered as well. In contrast, manufacturers’ power curves were applied
lacking some parameters entirely. So far, several methods have been to convert wind speed to wind power in [3], making this approach
applied to ameliorate gaps in data sets for wind integration studies. independent of the rotor diameter. To encompass as many variables as
In [4], the wind power production in Sweden between 2007 and possible and derive generalizable recommendations, the following
2012 was modeled with high spatial and temporal resolution. Thereby, variables were considered as predictor or response variables1:
wind data from the MERRA reanalysis data set was combined with wind
turbine data based on alternative data sources. According to the au- • rated power (x /y ), p p
thors, the main source of error in this process was expected to be the • hub height ( x / y ),
h h
date the wind turbine was put into operation. In cases where the hub • rotor diameter ( x / y ), d d
height was missing, data could be fitted through a second order poly- • commission year (x /y ). y y
nomial using the rated power as the explanatory variable. In this pro-
cess, the model was fit with the complete part of the data set. The date of commission could be used in place of the commission
A wind power generation model for all EU member states as well as year. Therefore, it is recommended that studies transform the date of
Norway, Switzerland, and the remaining countries of the Western commission into a continuous variable prior to this process. This will
Balkans that applies NASA’s MERRA reanalysis data set is described in allow for the application of the same methods that are discussed in this
[2]. This model offers hourly temporal resolution, and the highest paper.
spatial resolution is NUTS 2 [9]. The wind turbine data is based on [10]. In some countries, the choice of a wind turbine is influenced by the
According to [2], however, this data set includes incorrect entries as geographic location. For example, sites with a lower average wind yield
well as a significant number of gaps. Entries with missing rated power may require wind turbines with higher hub heights. To account for this
values, which account for 0.14% of all entries, were dropped. In the additional information, this study tested the extent to which the con-
case of missing longitude or latitude values, the geographic location siderations of the latitude ( xlat / ylat ) and longitude ( xlon / ylon ) can influ-
was estimated using other indicators if available, including the name of ence results.
the city. To fill missing entries for the hub height, two approaches were For validation purposes, a typical wind turbine data set for Germany
applied: if the manufacturer and type was known, the average hub was investigated in greater detail. This includes, among other con-
heights of wind turbines of the same specific type were used, and in all siderations, a correlation analysis for understanding which parameters
other cases, the average hub height in each country was taken as an should be taken into account to estimate others. Based on this analysis,
estimate. Generally, only 39% of all entries included given hub heights, several methodologies were compared to each other with the aim of
stressing the importance of adequate techniques for filling these gaps. identifying the most suitable approach. As demonstrated in [2], some
Another Europe-wide study modeling wind power time series using gaps can be filled when the manufacturer and turbine types are known.
the MERRA reanalysis data set is elaborated in [3]. This study used For example, certain turbine types were manufactured with only one
[10], the same data source as [2]. However, only wind turbines with a specific hub height and rotor diameter. Given that completion utilizing
rated power greater than 1 MW were considered. According to [3], the such information is straightforward, this information was disregarded
hub heights of 62% of wind turbines in this data set were unknown. A to concentrate on cases that do not allow for quick data set completion.
regression was applied to estimate these hub heights. The logarithm of In addition, only onshore wind turbines were considered in this study.
the rated power and the commission date were chosen as the ex- Thus far, offshore wind farms have been flagship projects for many
planatory variables. The commission date, which was unknown for 16%
of all database entries, was estimated by taking the commission date of 1
In the paper at hand, the input variables of a model are denoted predictor variables
turbines with the same rated power within a country. (x). Other common terms are explanatory variables, feature variables, and independent
A Danish study, [5], generated historical and future wind power variables. The output of the proposed model has been named a response variable (y),
time series and assigned wind turbines to so-called wind turbine classes otherwise defined as an explained variable or dependent variable.
2
companies. This is why offshore wind farms are commonly well-docu- parameter was missing. In the case that all parameters except the geo-
mented on specific project websites. graphic location were missing (rated power, commission year, hub
The remainder of this paper is organized as follows. Section 2 pre- height, and rotor diameter) the data entry (overall 340) was removed. It
sents and analyses a common wind turbine data set from a bottom-up was assumed that the parameters for these 340 wind turbines cannot be
wind integration study. Different approaches for completing wind tur- properly estimated based solely on the geographic locations. After this
bine data sets are introduced in Section 3. Section 4 discusses the processing was completed, a total of 8432 wind turbines with incomplete
training and evaluation framework for application of the models. In information remained. Among those, 3411 wind turbines were missing a
Section 5, the approaches described in Section 3 are applied to the data single parameter. There were 2968 entries in which two parameters were
set presented in Section 2, and the main results are discussed. The paper undefined. Three parameters were not given for 2053 wind turbines.
is concluded in Section 6.
2.1.2. Training data set
2. Data
The cleaned data set, excluding the target data set, consisted at this
step of 17,476 complete entries. To further cleanse the training data set,
The data set used in this study was based on information provided
each turbine type within a wind farm was considered only once. This
by the 16 federal states of Germany. In each state, different regional
avoids unwanted over-representation of certain types of wind turbine
state offices are responsible for publishing detailed wind turbine data.
that may occur only because they have been chosen for a large wind
For example, information about wind turbines in Bavaria was provided
farm. Moreover, certain estimators, for example the K-nearest neighbor
by [11]. Wind turbines erected in Baden-Württemberg were listed in
regression, measure the similarity to other data points by computing a
[12]. In [13], wind turbines located in Lower Saxony were listed.
distance measure. If the longitude and latitude were similar, these two
This combined initial data set consisted of 26,360 entries (onshore
parameters would outweigh others and the algorithm would pick the
wind turbines) which corresponded to 43.19 GW. However, many
next turbine within the wind farm as a response. This bias can be
parameters were undefined, and it is the objective of this paper to
avoided by choosing only one representative for a wind farm if all
propose procedures that can effectively fill these gaps. Therefore, pre-
turbines have the same characteristics. Unfortunately, information on
dictive models enabling the modeling of relationships between different
whether a group of turbines forms a wind farm cannot be obtained
variables were trained. Training these models required a complete
directly from the data set. Instead, a minimum distance parameter be-
(such that all relevant parameters were given for each wind turbine)
tween wind turbines was introduced. The minimum distance was set to
and error-free training data set. To achieve this, erroneous entries
900 m. By doing so, the final data set contained only wind turbines
needed to be filtered and removed. Once an adequate model is identi-
which were at least 900 m away from each other. Once this process had
fied and trained, it can be applied to fill gaps across the entire data set.
been completed, 5683 wind turbines remained for which all relevant
In this paper, the incomplete data set that required filling is referred to
variables were known. Their geographic locations within Germany are
as the target data set. The pre-processing of data is described in Section
shown in Fig. 1. This figure demonstrates that the wind turbine density
2.1. Dependencies between variables are analyzed in Section 2.2.
is higher in Northern Germany. In addition, hub heights were unknown
for all turbines in the federal state of Mecklenburg-West Pomerania
2.1. Pre-processing of data
(north-east of Germany).
In the first step, entries containing unspecified geographic locations

were removed, amounting to merely 51 wind turbines. These entries 2.2. Dependency analysis
were not further considered because the geographic location is a key
input for computing the location dependent wind speed at a given hub Investigating the dependencies between variables can facilitate the
height, which is one of the main applications of studies in this area. selection of relevant parameters and models. Fig. 2 depicts the Pearson
Furthermore, wind turbines erected before 1990 were disregarded (63 linear correlation coefficient between all considered variables, their
entries) during further analysis. scatter plots (including linear trend lines), and histograms for each
In a second step, unrealistic, unrepresentative, or clearly erroneous variable. The histograms on the main diagonal provide insight into the
values were filtered. The rated power was limited to values between distribution of each variable. The domain is shown either at the top or
250 and 4000 kW to reduce the data to representative wind turbines. the bottom of Fig. 2. The rated power is given in kW, and all lengths are
1048 wind turbines had a rated power of less than 250 kW, and 104 given in meters. The histogram of the rated power indicates that wind
turbines exhibited rated powers over 4000 kW. Moreover, wind tur-
bines with a hub height of less than 30 m (98 wind turbines) were
deleted from the data set. These are likely to represent demonstration
models or old wind turbines that have been placed out of operation
without being designated as such and, hence, can be regarded as un-
representative. One wind turbine had a hub height of more than 200 m
and was also removed from the data set. 479 wind turbines were de-
leted because their rotor diameters were shorter than 30 m or longer
than 150 m. Naturally, several deleted wind turbines displayed more
than one of these traits. The remainder of the data set was split into the
training and target data sets.
2.1.1. Target data set

The target data set consisted of incomplete entries that needed to be
replaced by estimates. The rated power was unknown for 1353 wind
turbines. No hub height was given for 5106 wind turbines, and 5825
wind turbines exhibited no information on the rotor diameter. The
commission year was unknown for 4582 wind turbines. In many cases,
more than one parameter was missing from a single wind turbine.
Fig. 1. Geographic locations of all wind turbines in the final training data set.
Overall, 8772 entries were found to be incomplete because one or more
3
Fig. 2. Scatter plots and correlations between rated power (power), commission year (year), hub height (hubheight), rotor diameter (diameter), latitude (lat) and longitude (lon).
turbines with a rated power of approximately 2 MW were relatively 3. Methodology

frequent in the training data set. The histogram of the commission year
shows that installation numbers were high shortly after the turn of the Hereafter, different approaches that can be applied to completing
century and in 2014. In addition, it can be seen that wind farms are the presented data set (cf. Section 2) are described. Thereby, focus is
concentrated in the north (cf. histogram of latitude). By further in- placed on obvious candidates, ideally those requiring a small number of
vestigating the scatter plots and Pearson linear correlation coefficients model parameters while offering high generalizability.
in Fig. 2, two different dependency groups were identified. The corre-
lation between hub height, commission year, rated power, and rotor
diameter appears strong and positive. In contrast, the dependencies 3.1. Multiple linear regression
between variables describing the geographic location and others are
rather weak. For some pairs, the Pearson linear correlation coefficient Multiple linear regression has been used to solve similar problems in
was, as expected, close to zero. For example, the longitude showed no [3] and other past research. It was assumed in this study that there are
correlation with the rated power or hub height. However, the hub at least two predictor variables present, and as such simple linear
height and latitude were negatively correlated with a value of −0.35.
Though the cause is not obvious at first glance, it can be explained by
the geographic conditions in Germany. The average wind yield is the
highest in Northern Germany, whereas Southern Germany is primarily
characterized by lower average wind speeds and more complex terrain.
This is why higher hub heights have been chosen in the south (smaller
latitude), leading to a negative correlation. Furthermore, latitude
showed a slightly negative correlation with the commission year. This is
because the first wind turbines were erected in Northern Germany be-
fore construction projects moved further south due to a lack of space in
the north. This weak but visible relationship is shown in Fig. 3. In Fig. 3,
it can be seen that only a few wind turbines were built below the 50th
latitude in the early years of Germany’s energy transition, and these
predominantly had hub heights below 90 m. In contrast, the majority of
wind turbines erected after 2010 were below the 52nd latitude.
The strong positive dependence between variables that represent
the technical evolution of wind turbines (hub height, rated power, and
rotor diameter) with the commission year should be obvious in all wind
turbine data sets. This is depicted in Fig. 4. The hub height and the
rated power have both increased over the years in a comparable
Fig. 3. Scatter plot of hub height versus latitude.
manner, with the exception of a small number of outliers.
4
Fig. 5. Exemplary regression tree, in which hub height and commission year are the
predictor variables and rated power is the response variable.
multi-dimensional rectangles in the feature space. A simple tree fit to

the data presented in Section 2 is shown in Fig. 5. Two predictor
variables, namely the hub height and commission year, were used to
explain the rated power of wind turbines. In this example, the data is
Fig. 4. Scatter plot of hub height versus rated power.
split into four groups. Within each group of response variables, the
estimate is simply given by the mean in that individual group, which is
regression was disregarded. Multiple linear regression models are de- depicted in the terminal nodes. In Fig. 5, one group on the left-hand
fined by: side consists of all wind turbines with a hub height under 78 m that
y = β X + ∊, (1) were built before 2000. The average rated power in this group was
662 kW, which is simultaneously the estimate. This group contained
in which y contains the values of the response variable (regressand) and 764 wind turbines.
X contains the values of the different predictor variables (regressors). ∊ Using this method, a splitting decision is made so that the reduction
is the error term with ∊ ∼ N (0,1) . β is the parameter vector that can be in the residual sum of squares is maximized [21]. Ideally, the data set
estimated using ordinary least squares by: will be split into a training set and testing set for this purpose. At the
β = (XT X)−1XT y. first node, all variables and values are considered. Once this split is
(2)
made, a further variable and splitting condition is selected. Therefore,
More information on (multiple) linear regression can be found in [14]. the same variable can be chosen consecutively. This approach is known
Once the parameter vector has been estimated, the following equation as recursive binary splitting.
could be used to predict the rated power with the remaining variables One main advantage of this approach is that regression trees can be
as predictors: easily interpreted. Furthermore, regression trees can cope with large
data sets and different data types. Their small computational require-
yp ̂ = β0 + β1 xh + β2 x d + β3 x y .
ments can also be seen as a merit. However, they generally provide poor
Generally, variables can be transformed prior to the estimation. performance and tend to over-fit problems when the tree becomes too
Because the linearity assumption can be maintained, the efficient or- long.
dinary least square approach can still be used. Several approaches have been found to deal with the disadvantages
of CARTs while maintaining most advantages. The over-fitting issue can
be limited by defining a maximum tree length abort criterion.
3.2. K-nearest neighbors
Moreover, conditional inference trees based on [22] can be applied. In
contrast to regular regression trees, the splitting decision in this method
KNN is a non-parametric pattern recognition technique that uses the
is not defined by a possible reduction in the residual sum of squares.
average of the K closest observations in the training set to provide an
Instead, the basic idea is to test whether a split creates two subsets with
estimate. KNN can be applied to classification [15,16] and regression
statistically different means using statistical tests. Thereby, the p-values
problems. A distance metric is used to determine whether an observa-
of the test statistics are compared, which enables comparisons between
tion is close in the feature space. The Euclidean distance is a common
different types of variables in different domains. The p-value is the main
measure. Besides the distance metric, K, the number of neighbors that
tuning parameter of conditional inference trees. However, the max-
are considered for estimating the response variable can be chosen
imum tree length can also be optimized.
freely. Small values for K can lead to over-fitting whereas high values of
In this paper, regression trees are introduced as a foundation for
K often perform worse. One strategy for finding an optimal K value
random forests, which are addressed hereafter. In addition, conditional
includes comparing different subsets of the training data. Given that
inference trees are implemented as a benchmark. Further details about
distances in the multi-dimensional feature space are compared, data
both tree types are elaborated upon in [20].
should be normalized beforehand.
KNN has been widely applied to forecasting problems such as wind
speed forecasting [17], electricity price forecasting [18], or solar power 3.4. Random forests
forecasting [19]. Additional information about KNN can be found in
[20]. The sensitivity of regression trees to the training data can be over-
come by using so-called ensemble techniques. Breiman [23] introduced
3.3. Regression trees bootstrap aggregation (also known as bagging), which can be applied to
regression trees (bagged trees). In this method, m bootstrap samples
Regression trees have their origin in so-called classification and (random sampling with replacement) are taken and a regression tree is
regression tree (CART) algorithms based on [21]. When using these, fit to each sample. A prediction is made by averaging across the pre-
predictor variables are split into non-overlapping sections, creating dictions of all m regression trees. This approach comes along with a
5
significant performance improvement as compared to simple regression used for fitting the model. This is repeated until the model has been
trees [20]. However, most bagged trees are characterized by a similar trained and tested four times. This yields four performance measures,
tree structure. This is called tree correlation, and should be avoided to e.g. the root-mean-square error (RMSE), which are then averaged to
produce better results. Random forests represent one approach that evaluate the performance of the model. To further exclude the influence
overcomes these limitations. In energy-related fields, random forests of randomness, this process is repeated five times. Each model is,
have been used to forecast electricity spot prices [24] or analyze the therefore, evaluated using 20 different training and testing sets.
influential features of regional energy consumption [25], among other This entire process is repeated for the basic set-up (four technical
applications. parameters) with 3-fold cross-validation and six repetitions to ad-
To fit random forests according to [26], bootstrap samples are ditionally test for robustness. In this case, the training data set is only
generated as they would be in the bagging algorithm. However, instead two-third of the data set while one-third can be used for validation.
of taking the whole sample to fit the tree, only one random subset of
predictor variables is considered at each split. As such, the split decision
can be based on different criteria, as described in Section 3.3. The 4.2. Model tuning
number of randomly selected predictor variables (known as mtry ) is
recommended to be one-third of the number of all predictor variables Certain models require parameter selection. For example, KNN re-
[26]. However, mtry can be varied to find the optimal value in terms of quires users to define the number of neighbors to be considered. To
predictive power. In addition, the number of trees must be set. identify the best fitting parameter for this study, a range of possibilities
was tested and compared using the described 4-fold cross-validation
4. Application method with five repetitions. Hence, if five different K values were
tested for KNN, a KNN model would be fit 100-times (5·20 ). The R
This section is comprised of aspects about the training and testing of package caret [27] was utilized to perform this parameter tuning. The
models (Section 4.1) and tuning of model parameters (Section 4.2). process of determining which parameters were to be tested is described
subsequently.
4.1. Training and evaluation framework For the K-nearest neighbors approach, the number of neighbors was
varied between one and 15. Conditional inference trees have two tuning
4.1.1. Data processing parameters: first, in this study, the p-value ranged between 0.2 and 0.9
Given that certain approaches (cf. Section 3.2) require normalized in 0.1-steps; and second, the maximum tree depth was varied between 5
data, all variables were centered and scaled. After centering and and 25. The number of trees for populating the random forest was set to
scaling, the mean becomes zero and the standard deviation is one. 1500. The number of randomly chosen predictor variables (known as
mtry ) was set to either two or three unless there were fewer predictor
4.1.2. Predictors and response variables. If there were fewer than three predictors, mtry became one.
There exist six variables (cf. Section 2), of which four represent
technical parameters (rated power, hub height, rotor diameter, and
commission year) and two define the geographic position (longitude Table 1
Test set-up, predictor variables and response variables.
and latitude). All technical parameters could be either predictor vari-
ables or response variables. To account for the fact that the geographic Set-up Predictor variables Response variable
position may provide additional information (as can be expected in the
case of Germany), the latitude and longitude were tested as additional Power Year Hub height Rotor diameter
predictor variable. However, the focus remained on technical para- p1 × × × Power
meters to achieve a maximum of generalizability for application in p2 × ×
other countries. p3 × ×
Table 1 is comprised of all the chosen test set-ups. Differentiating p4 × ×
between these set-ups is necessary because more than one parameter p5 ×
can be missing for a single wind turbine. If this is the case, the available p6 ×
number of predictor variables is limited, which inevitably influences p7 ×
model accuracy. In Table 1, the first entry for each response variable
y1 × × × Commission year
(set-up 1) used all three predictor variables for a prediction and was y2 × ×
thus applicable only to entries missing a single parameter. For the re- y3 × ×
maining set-ups, the number of predictor variables was decreased. Set- y4 × ×
ups 5, 6, and 7 use only one predictor variable. By comparing these y5 ×
different set-ups, recommendations can be given for data completion y6 ×
strategies when more than one variable is missing. y7 ×
To assess the impact of taking the geographic location into account

h1 × × × Hub height
as a predictor variable, two different set-up types were added. ab+lat h2 × ×
considers the latitude as an additional predictor in all set-ups. In one h3 × ×
example p7+lat uses the rotor diameter as well as the latitude to predict h4 × ×
the rated power. In contrast, ab+geo integrates both geographic variables, h5 ×
h6 ×
the latitude and longitude.
h7 ×
4.1.3. Training and testing d1 × × × Rotor diameter

To avoid over-fitting and to test how the different approaches d2 × ×
generalize when presented with new data, 4-fold cross-validation was d3 × ×
applied. Using this method, the data set is split into four equally sized d4 × ×
d5 ×
subsets. The model is then fit with three concatenated subsets, and the
d6 ×
first subset is used to validate the performance. In a next step, the d7 ×
second subset is used for validation, while the remaining subsets are
6
5. Results knn provided superior results to rfc and ctree more often than not in the
regular set-up (left column). Section 5.1.2 analyzes why this was no
Section 5.1 focuses on the performance of the different models with longer the case after the latitude is added to the considerations.
regards to the selected set-ups. If not stated otherwise, presented results Furthermore, the most relevant predictor variable can be identified
were obtained using 4-fold cross-validation with five repetitions. In- by comparing the last three set-ups for each response variable. The most
sight into the ability of each model to complete sparse data sets and influential variable when predicting the rated power was the rotor
generate meaningful results is given in Section 5.2. diameter, and vice versa. In p4 ,p5 and p6 , performance was seen to de-
crease significantly when the rotor diameter was removed from the set
5.1. Quantitative analysis of predictor variables. This performance decrease also held true for d5
and d6 when the rated power was disregarded as a predictor variable for
The following approaches (cf. Section 3) were tested and evaluated: modeling the rotor diameter. This decrease in performance could be
multiple linear regression (ols), conditional inference trees (ctree), somewhat mitigated by taking the hub height into account. In general,
random forests (rf), random forests with conditional inference trees this behavior was unsurprising given that the correlation between the
(rfc), and K-nearest neighbors (knn). rated power and the rotor diameter is 0.92 (cf. Fig. 2). Besides this, a
The RMSE values for all set-ups are shown in Table 2. Each column strong relationship between the hub height and rotor diameter was
represents a different model, and the first sub-column displays results evident in the data as well as observable in Table 2. In the case of the
that were obtained with one or more of the four technical predictor commission year, the hub height is the least relevant predictor variable.
variables. +lat indicates that the latitude was added to the set of pre- However, it was seen that considering both the rated power and the
dictor variables. If the longitude was added as well, this is denoted by rotor diameter would improve results. This was also confirmed by the
+geo . The best performing model for each test set-up is written in bold. data (cf. Fig. 2).
This enables easy assessment of the different models in terms of their To test whether the obtained results were robust and generalizable,
overall performance depending on the set-up. In addition, the best the RMSE was computed for all 4-fold cross-validation samples and
performing set of predictor variables for each modeling approach is their repetitions using the final model parameters, i.e., 20 different
underlined, which can, among other applications, be used to evaluate testing data sets (cf. Section 4.2). For p1, the standard deviation of the
the impact of the latitude and/or longitude in the set of predictors. RMSE when applying knn was 6.95 kW. When rf was used to compute
These two assessments are described in following sections. the rated power, the standard deviation of the RMSE was 6.07 kW. In y4 ,
the standard deviation for knn was 0.082 years, whereas random forests
achieved a lower value of 0.079 years. Furthermore, a comparison be-
5.1.1. Model performance and set of predictor variables
tween the results obtained by 4-fold cross-validation with five repeti-
With regards to the overall performance, it can be seen that regular
tions and 3-fold cross-validation with six repetitions was performed to
random forests (rf) outperformed all other models. Multiple linear re-
obtain additional insight into the robustness of results. Therefore,
gression (ols) proved to be the worst performing predictor of all set-ups.
Table 2
RMSE for ols, knn, ctree, rf and rfc.
Response variable ols knn ctree rf rfc
+lat +geo +lat +geo +lat +geo +lat +geo +lat +geo
p1 Power 297 287 284 168 182 188 175 172 172 146 136 134 163 157 156
p2 308 296 293 162 178 191 169 171 174 153 143 139 174 159 160
p3 297 288 287 173 182 192 174 176 175 170 148 143 208 165 162
p4 440 416 415 302 320 319 324 309 311 292 240 230 334 283 282
p5 475 472 472 470 453 439 470 458 458 470 429 387 470 449 428
p6 530 490 486 350 346 348 358 332 345 350 289 258 362 321 305
p7 308 301 298 189 193 198 191 190 186 188 174 157 191 188 175
y1 Year 3.22 3.22 3.21 2.22 2.41 2.44 2.25 2.27 2.29 2.11 1.96 1.90 2.19 2.15 2.13
y2 3.33 3.32 3.31 2.35 2.54 2.63 2.45 2.40 2.44 2.31 2.13 2.04 2.44 2.30 2.28
y3 3.34 3.32 3.32 2.27 2.46 2.50 2.32 2.33 2.36 2.25 2.07 1.95 2.38 2.22 2.18
y4 3.25 3.25 3.25 2.34 2.51 2.54 2.39 2.42 2.41 2.31 2.10 2.01 2.45 2.27 2.23
y5 3.47 3.46 3.46 2.69 2.78 2.80 2.71 2.66 2.72 2.68 2.44 2.23 2.69 2.61 2.47
y6 3.92 3.83 3.81 3.00 3.02 3.02 3.04 2.90 2.89 2.99 2.61 2.40 3.05 2.82 2.68
y7 3.47 3.42 3.42 2.70 2.71 2.63 2.71 2.66 2.64 2.70 2.52 2.25 2.70 2.62 2.45
h1 Hub height 16.1 14.4 14.4 13.7 11.5 10.6 13.8 11.5 11.3 13.3 9.9 9.0 13.6 10.9 10.5
h2 16.1 14.5 14.5 14.2 11.8 10.9 14.4 11.6 11.4 14.0 10.1 9.2 14.6 11.1 10.7
h3 16.7 14.9 14.8 13.9 11.4 10.6 14.1 11.7 11.4 13.8 10.4 9.2 14.3 11.3 10.7
h4 17.1 15.0 14.9 15.7 12.2 11.3 15.8 12.4 12.1 15.5 10.9 9.7 15.9 11.9 11.3
h5 16.7 15.2 15.1 14.9 12.1 11.2 15.0 12.1 11.8 14.9 11.1 9.7 15.0 12.2 11.0
h6 18.2 15.8 15.7 16.5 13.0 11.7 16.5 13.1 12.5 16.5 12.4 10.4 16.5 13.4 11.7
h7 18.4 17.0 16.9 18.3 15.3 14.1 18.3 15.5 15.1 18.3 14.4 12.3 18.3 15.3 14.0
d1 Rotor diameter 7.8 7.8 7.8 4.1 5.0 5.6 4.9 4.9 4.9 3.6 3.4 3.3 4.3 4.3 4.3
d2 11.6 11.3 11.4 7.1 8.3 8.5 8.1 8.1 8.1 7.1 5.8 5.6 8.4 7.1 7.1
d3 7.9 7.9 7.9 4.1 4.8 5.6 4.5 4.5 4.7 4.1 3.6 3.5 4.9 4.3 4.3
d4 8.3 8.1 8.1 5.7 5.8 6.1 5.7 5.6 5.5 5.7 4.8 4.4 6.1 5.3 5.2
d5 13.6 13.1 13.1 8.8 9.0 9.4 9.1 8.7 8.6 8.8 7.4 6.7 9.1 8.3 7.8
d6 13.3 13.3 13.3 13.0 12.6 12.0 13.0 12.8 12.6 13.0 11.9 10.5 13.0 12.5 11.7
d7 8.6 8.4 8.3 6.2 6.1 6.3 6.2 6.2 6.0 6.2 5.7 4.8 6.2 6.1 5.6
7
absolute differences in R2 were computed for the 4-fold and 3-fold run.
The largest absolute deviation was 0.005. For knn, ctree, rf, and rfc, the
mean absolute deviation was 0.001 while it was 0.000 for ols. This
indicates that satisfactory results in terms of robustness and general-
izability can be obtained even with relatively small training data sets.
5.1.2. Influence of the geographic location

Based on the structure of Table 2, the benefit of accounting for
geographic information is identifiable by reviewing each model hor-
izontally. For example, in case of ordinary least squares (ols), adding
geographic information increased the performance of ols in almost all
cases. Furthermore, increasing the number of predictor variables has
never led to a decrease in performance.
What holds for ols, however, can be generalized across some but not
Fig. 6. Coefficient of determination for rf for all set-ups with rated power and hub height
all models. For rf, the inclusion of additional information (latitude and/
as response variables. or longitude) improved the predictive power in all cases. In the case of
rfc, p2+geo and d 2+geo were slightly worse than p2+lat and d 2+lat , respectively.
For ols, y1 and y4 were better than y1+lat and y4+lat , whereas y7+lat and d 2+lat
displayed a better performance than y7+geo and d 2+geo . In general, taking
the latitude into account led to a distinct improvement in the results of
these algorithms, whereas accounting for the longitude led to mostly
smaller and often negligible performance gains. Given the correlations
between the latitude and longitude with other variables (cf. Fig. 2), this
behavior is unsurprising. Nevertheless, the actual performance gains
were seen to be somewhat dependent on the available technical para-
meters. Fig. 6 illustrates this effect by depicting the coefficient of de-
termination R2 for modeling the rated power and the hub height with rf
along the different set-ups. The red lines show R2 when rated power was
the missing parameter. Blue depicts the R2 values for hub heights. When
the rated power was missing, the rotor diameter became the most im-
portant predictor variable. p7 was seen to outperform p4 , even though p4
uses one more predictor variable. Inclusion of the latitude and the
longitude, however, was seen to have little effect as long as the rotor
diameter was included. When the rotor diameter was also missing ( p4 ,p5
Fig. 7. Coefficient of determination for knn for all set-ups with rated power and hub
height as response variables.
and p6 ), the latitude increased the explanatory power of the set of
predictor variables. This increase was especially pronounced when the
hub height was defined ( p6 ) because the correlation between hub height
and latitude is not negligible. Regarding the hub height results in Fig. 6,
it can be seen that the rated power and the rotor diameter have the
greatest impact on the performance of the model. However, considering
the latitude in addition results in a RMSE improvement of approxi-
mately 35% over all set-ups (h1-h 7 ). This can be explained by the cor-
relation between the hub height and longitude seen in Germany (cf.
Fig. 2).
Fig. 7 shows the same information as Fig. 6 for knn. Generally, the
performance of the model, depending on the set-up, is comparable.
However, differences exist when the latitude and/or longitude are
taken into account. p1- p3 and p7 performed worse when geographic
information was added. This can be explained by the characteristics of
K-nearest neighbors (cf. Section 3.2). Independent of the explanatory
power of a variable, this variable is considered to compute the distance
in the feature space. Consequently, these variables could lead to arbi-
trarily deviations due to the noise they added to the data, especially if
the explanatory power of such variables was limited. The relationship
between the latitude and other predictor variables was weak, but the K-
nearest neighbors approach cannot differentiate between weak and
strong dependencies. If most relevant variables are missing from a data
set, as in p4 −p6 , additional geographic information could increase the
performance.
In summary, the benefit of adding predictor variables depends on
the chosen model type, not only the dependences between these vari-
ables. Random forests make a split based on a pre-defined measure,
which corresponds to giving more attention to a variable, whereas the
ordinary least squares method assigns higher parameter values to more
Fig. 8. Median of the hub height (in meters) in each NUTS 3 region at the end of 2000.
influential variables. In contrast, K-nearest neighbors treats all
8
rf, the average R2 for predicting the rated power over all basic set-ups
without latitude and longitude was 0.875 and 0.881 when the rotor
diameter needed to be estimated. For the commission year, the average
R2 was 0.809. The average R2 was 0.676 for the hub height, the lowest
value. When either the latitude or longitude were taken into account,
the R2 is higher in all cases for rf, whereas the performance gain is
highest when the hub height is the response variable. If two parameters
are missing, there exist two general strategies for generating a complete
wind turbine data set: first, the two missing parameters can be esti-
mated using two different models, such as p2 and y2 if the rated power
and year of commission are missing; and second, one of the two could
be predicted first and the results could then be used to predict the
second missing parameter, for example applying y2 first, followed by p1.
The findings of this study would indicate the use of two models in
parallel is preferable, as the performance measures are relatively close
to each other for the first and second set-ups. When this approach is
applied, it is possible to avoid using an estimate rather than an ob-
servation as a predictor variable. If this is not the intended course of
action, the set-up with the higher R2 should be applied first to improve
the predictive power of the second model.
To demonstrate the completion of sparse data sets, the target data
set (cf. Section 2.1) in this study was completed applying the first
proposed method and then added to the remainder of the data. Overall,
this data set had 25,908 complete entries, marginally fewer than the
original data set, because entries with completely unknown technical
parameters were disregarded. Fig. 8 shows the median of the hub height
Fig. 9. Median of the hub height (in meters) in each NUTS 3 region at the end of 2015. (in meters) in each NUTS 3 region at the end of 2000 (5833 wind
turbines). It can be seen that wind turbines were smaller at that time. In
addition, larger parts of Southern Germany had no wind power in-
predictor variables equally. This is why the variable selection should be
stallations (white regions). Fig. 9 depicts the same statistics at the end
implemented carefully when the (unweighted) K-nearest neighbors al-
of 2015. This data set contains 25,901 entries, reflecting the rapid de-
gorithm is applied. If K-nearest neighbors is implemented properly, it
ployment of wind power in Germany after the start of the millennium. It
can outperform even rfc (cf. Table 2).
can be seen in Fig. 9 that wind turbines are now almost everywhere in
Germany. Moreover, the hub height median has increased significantly.
5.1.3. Tuning parameters It becomes also obvious that the median of wind turbines is higher in
Regarding the selection of optimal tuning parameters, the number the south, as discussed in Section 2.2. This has two reasons: first, the
of neighbors K varied between one and 15. For +geo , the maximum was south was developed later than the north, at a time at which turbines
five, which occurred once, and the minimum was two, with 15 occur- were generally higher; and second, wind resources are not as good in
rences. For +lat , values between two and 15 were selected with 15 being the south as in the north, and therefore southern developments require
chosen twice in p5+lat and h 7+lat . When the latitude and longitude were increased hub heights to make projects profitable.
excluded from the set of predictor variables, K varied between one and It should further be noted that the initial data set had no given hub
four with a single exception when it reached 11 in h 7+lat . As expected, heights in the federal state of Mecklenburg-West Pomerania (cf. Fig. 1).
low predictive power in the set of variables lad to increasing the This gap could be closed applying random forests, as can be seen in
number of neighbors considered. For ctree, the maximum tree length Figs. 8 and 9.
ranged from five to 19 and the average over all set-ups was 12.65. The
p-value was calculated in each set-up, resulting in 0.2 a total of 79 times 6. Conclusion
and 0.3 only five times. In the case of rf and rfc,mtry was either two or
three unless there were fewer variables than that available. For rf, one Wind power integration studies have grown in popularity due to the
was selected 24 times whereas three was selected four times. When a increased relevance of wind power to many power systems around the
sufficient number of variables could be selected ( +geo ), mtry = 1 was world. One branch of wind integration studies includes the historical
never chosen. In general, mtry should not be allowed to become as great modeling of wind power infeed by converting the observed wind
as the number of predictor variables to avoid over-fitting and tree speeds, primarily from reanalysis models, into wind power. This type of
correlation. However, if there is only one predictor variable, mtry is study is often designed as a bottom-up study and, hence, is dependent
inevitably one. In this case, the bagging algorithm is applied rather than on reliable wind turbine data. Given the frequent incomplete wind
random forests (cf. Section 3.4). However, according to the results, this turbine data sets, approaches are required to manage this. Thus far,
method outperformed all other approaches. The chosen number of mtry rather simple methods with a limited number of predictor variables,
for rfc was similar to rf. When tuning rfc, this case occurred several such as linear regression, have been used. This paper goes beyond that
times. As mtry , two was picked nine times and three 19 times. Yet, the and presents advanced approaches that integrate more predictor vari-
difference in RMSE between a mtry of two or three was on average ables, which can facilitate filling gaps in wind turbine data sets for
0.68% and, hence, can be judged as negligible. different applications.
With this aim, several advanced methods were tested to evaluate
5.2. Completion of sparse data sets their applicability to the problem at hand. The set of tested methods is
comprised of multiple linear regression (ols), K-nearest neighbors (knn),
To identify proper strategies for filling data gaps when more than conditional inference trees (ctree), random forests with CARTs (rf), and
one parameter was missing, the R2 can be regarded as a dimensionless random forests with conditional inference trees (rfc). They all have in
performance measure. All values of R2 can be found in Appendix A. For common that they require a small number of parameters and offer high
9
generalizability when applied properly. rf outperformed all other ap- However, if the model is not capable of independently selecting vari-
proaches followed by, knn or, when geographic information was con- ables that contribute to improvements in performance, as in case of K-
sidered, rfc. Given the simplicity of knn and ctree, these methods pro- nearest neighbors, it would be advisable to precisely test the relation-
duced relatively good results. ols was the worst performing candidate. ship between variables before setting up the model. Furthermore, it was
However, it is possible performance gains could be achieved by trans- found that the correlation coefficient coincides with performance gains
forming certain variables before applying ols, an approach which was when the two investigated variables are integrated as predictor and
not further tested in this study but also not suggested by the data. The response variables, respectively.
application of neural networks could feature as part of further research. Finally, we advise to apply the presented models with caution.
Concerning the choice of predictor variables, it can be concluded Although these methods can produce reasonable results, their results
that considering additional variables will generally lead to increased are still estimates and can under no circumstances replace correct data
predictive power. Adding latitude data improved the results produced initially provided by official sources.
by most models in the specific case, Germany, investigated in this study.
Appendix A
See Table A.3.
Table A.3
R2 in % for ols, knn, ctree, rf and rfc.
Response variable ols knn ctree rf rfc
+lat +geo +lat +geo +lat +geo +lat +geo +lat +geo
p1 Power 85.6 86.6 86.8 95.4 94.6 94.3 95.0 95.2 95.2 96.5 97.0 97.1 95.7 96.0 96.0
p2 84.5 85.7 86.0 95.7 94.9 94.1 95.3 95.2 95.1 96.2 96.6 96.8 95.0 95.9 95.8
p3 85.6 86.4 86.6 95.1 94.6 94.0 95.1 94.9 95.0 95.3 96.4 96.7 93.1 95.6 95.7
p4 68.4 71.7 71.9 85.2 83.6 83.6 82.9 84.5 84.3 86.1 90.6 91.4 82.0 87.0 87.0
p5 63.1 63.7 63.7 64.0 66.5 68.8 64.0 65.8 65.8 64.0 70.0 75.6 64.0 67.1 70.1
p6 54.2 60.8 61.4 80.0 80.6 80.5 79.1 82.0 80.7 80.0 86.4 89.2 78.5 83.3 84.8
p7 84.5 85.2 85.5 94.2 94.0 93.7 94.0 94.1 94.3 94.2 95.1 95.9 94.1 94.3 95.0
y1 Year 68.2 68.2 68.3 84.9 82.5 81.9 84.5 84.1 83.9 86.4 88.2 88.9 85.3 85.8 86.1
y2 65.9 66.1 66.4 83.1 80.4 79.0 81.6 82.3 81.7 83.6 86.1 87.2 81.6 83.7 83.9
y3 65.7 66.2 66.2 84.2 81.5 81.0 83.4 83.4 82.9 84.4 86.8 88.3 82.7 84.9 85.3
y4 67.5 67.4 67.5 83.2 80.9 80.4 82.4 82.1 82.2 83.6 86.5 87.6 81.6 84.2 84.8
y5 63.1 63.2 63.3 77.8 76.5 76.2 77.4 78.3 77.3 77.9 81.6 84.7 77.8 79.1 81.2
y6 52.7 54.9 55.3 72.4 72.3 72.2 71.7 74.2 74.3 72.6 79.0 82.2 71.4 75.7 77.9
y7 63.1 64.2 64.1 77.6 77.4 79.1 77.5 78.2 78.6 77.6 80.5 84.4 77.6 78.9 81.5
h1 Hub height 64.2 71.0 71.3 74.0 82.0 84.6 73.5 81.6 82.5 75.5 86.4 88.7 74.3 83.4 84.6
h2 64.2 70.7 70.9 72.0 80.8 83.6 71.3 81.2 82.0 72.7 85.7 88.2 70.6 82.9 84.2
h3 61.4 69.2 69.4 73.2 82.0 84.6 72.4 81.1 82.1 73.6 85.1 88.2 71.7 82.4 84.2
h4 59.6 68.7 69.1 65.9 79.6 82.6 65.5 78.5 79.6 66.5 83.4 86.8 65.0 80.3 82.2
h5 61.2 68.1 68.2 69.2 79.8 83.0 68.9 79.6 80.8 69.2 83.0 87.0 68.8 79.6 83.2
h6 54.2 65.5 65.9 62.2 76.7 81.3 61.9 76.0 78.3 62.2 78.7 85.0 62.0 75.3 81.0
h7 52.7 59.7 60.1 53.5 67.7 73.1 53.4 66.7 68.4 53.4 71.3 79.1 53.5 67.6 72.7
d1 Rotor diameter 87.2 87.3 87.4 96.5 94.9 93.5 95.0 94.9 95.0 97.2 97.6 97.7 96.1 96.2 96.2
d2 72.0 73.1 73.1 89.5 85.9 84.9 86.3 86.3 86.5 89.5 92.9 93.4 85.3 89.5 89.5
d3 86.9 87.0 87.0 96.5 95.2 93.5 95.8 95.7 95.5 96.5 97.2 97.5 94.9 96.1 96.2
d4 85.6 86.2 86.4 93.2 93.1 92.4 93.3 93.5 93.6 93.3 95.2 96.0 92.3 94.1 94.4
d5 61.2 64.2 64.3 83.8 83.3 82.0 82.5 84.3 84.7 83.9 88.5 90.8 82.5 85.8 87.5
d6 63.1 63.1 63.2 64.6 67.4 70.9 64.6 65.8 67.0 64.6 70.4 77.0 64.6 67.5 71.2
d7 84.5 85.4 85.6 92.0 92.1 91.9 91.9 92.1 92.4 92.0 93.3 95.1 91.9 92.3 93.5
References [6] Staffell I, Green R. How does wind farm performance decline with age? Renew
Energy 2014;66:775–86. http://dx.doi.org/10.1016/j.renene.2013.10.041
<http://www.sciencedirect.com/science/article/pii/S0960148113005727> .
[1] Dowds J, Hines P, Ryan T, Buchanan W, Kirby E, Apt J, et al. A review of large-scale [7] Rauner S, Eichhorn M, Thrn D. The spatial dimension of the power system: in-
wind integration studies. Renew Sustain Energy Rev 2015;49:768–94. vestigating hot spots of smart renewable power provision. Appl Energy
[2] Aparicio IG, Zucker A, Careri F, Monforti F, Huld T, Badger J. EMHIRES dataset. 2016;184:1038–50. http://dx.doi.org/10.1016/j.apenergy.2016.07.031 <http://
Part I: Wind power generation European Meteorological derived HIgh resolution www.sciencedirect.com/science/article/pii/S0306261916309710> .
RES generation time series for present and future scenarios. Tech rep. Joint [8] Keane A, Milligan M, Dent CJ, Hasche B, D’Annunzio C, Dragoon K, et al. Capacity
Research Centre (JRC); 2016. value of wind power. IEEE Trans Power Syst 2011;26(2):564–72. http://dx.doi.org/
[3] Staffell I, Pfenninger S. Using bias-corrected reanalysis to simulate current and fu- 10.1109/TPWRS.2010.2062543.
ture wind power output. Energy 2016;114:1224–39. [9] Eurostat. NUTS – Nomenclature of territorial units for statistics. < http://ec.europa.
[4] Olauson J, Bergkvist M. Modelling the Swedish wind power production using eu/eurostat/web/nuts > .
MERRA reanalysis data. Renew Energy 2015;76:717–25. [10] The Wind Power. The wind power. < http://www.thewindpower.net/ > .
[5] Andresen GB, Søndergaard AA, Greiner M. Validation of danish wind time series [11] Bayerisches Staatsministerium für Wirtschaft und Medien, Energie und
from a new global renewable energy atlas for energy system analysis. Energy Technologie. Energie-Atlas Bayern. < http://geoportal.bayern.de/energieatlas-
2015;93(Part 1):1074–88. karten/?wicket-crypt=Sq-qYtHdSGM&wicket-crypt=ZS6RSNnuWcA&theme=
10
61 > (accessed: 2017-05-31). [19] Long H, Zhang Z, Su Y. Analysis of daily solar power prediction with data-driven
[12] Landesanstalt für Umwelt, Messungen und Naturschutz Baden-Wrttemberg. Daten- approaches. Appl Energy 2014;126:29–37. http://dx.doi.org/10.1016/j.apenergy.
und Kartendienst der LUBW. < http://udo.lubw.baden-wuerttemberg.de/public/ 2014.03.084 <http://www.sciencedirect.com/science/article/pii/
pages/home/welcome.xhtml > (accessed: 2017-05-31). S0306261914003249> .
[13] Niedersächsisches Ministerium für Ernährung, Landwirtschaft und [20] Kuhn M, Johnson K. Applied predictive modeling. New York: Springer-Verlag;
Verbraucherschutz. Energieatlas Niedersachsen. < http://www.energieatlas. 2013.
niedersachsen.de/startseite/daten/da-ten-135073.html > (accessed: 2017-05-31). [21] Breiman L, Friedman J, Olshen R, Stone C. Classification and regression trees. New
[14] James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. York: Chapman and Hall; 1984.
New York: Springer-Verlag; 2013. [22] Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: a conditional in-
[15] Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theor ference framework. J Comput Graph Stat 2006;15(3):651–74.
2006;13(1):21–7. [23] Breiman L. Bagging predictors. Mach Learn 1996;24(2):123–40.
[16] Dudani SA. The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst, Man, [24] Ludwig N, Feuerriegel S, Neumann D. Putting big data analytics to work: feature
Cybernet 1976;SMC-6(4):325–7. http://dx.doi.org/10.1109/TSMC.1976.5408784. selection for forecasting electricity prices using the lasso and random forests. J
[17] Yesilbudak M, Sagiroglu S, Colak I. A new approach to very short term wind speed Decis Syst 2015;24(1):19–36. http://dx.doi.org/10.1080/12460125.2015.994290.
prediction using k-nearest neighbor classification. Energy Convers Manage [25] Ma J, Cheng JC. Identifying the influential features on the regional energy use in-
2013;69:77–86. http://dx.doi.org/10.1016/j.enconman.2013.01.033 <http:// tensity of residential buildings based on random forests. Appl Energy
www.sciencedirect.com/science/article/pii/S0196890413000770> . 2016;183:193–201. http://dx.doi.org/10.1016/j.apenergy.2016.08.096 <http://
[18] Lora AT, Santos JMR, Exposito AG, Ramos JLM, Santos JCR. Electricity market www.sciencedirect.com/science/article/pii/S0306261916311941> .
price forecasting based on weighted nearest neighbors techniques. IEEE Trans [26] Breiman L. Random forests. Mach Learn 2001;45(1):5–32.
Power Syst 2007;22(3):1294–301. http://dx.doi.org/10.1109/TPWRS.2007. [27] M. Kuhn. caret: Classification and regression training. R package version 6.0-76;
901670. 2017. < https://CRAN.R-project.org/package=caret > .
11

Completion of Wind Turbine Data Sets For Wind Integration Studies PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Completion of Wind Turbine Data Sets For Wind Integration Studies PDF

Uploaded by

Copyright:

Available Formats

Applied Energy xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

• Wind integration studies require complete wind turbine data sets.

Nomenclature ols ordinary least squares

In the ﬁrst step, entries containing unspeciﬁed geographic locations

2.1.1. Target data set

turbines with a rated power of approximately 2 MW were relatively 3. Methodology

multi-dimensional rectangles in the feature space. A simple tree ﬁt to

To assess the impact of taking the geographic location into account

4.1.3. Training and testing d1 × × × Rotor diameter

Response variable ols knn ctree rf rfc

5.1.2. Inﬂuence of the geographic location

See Table A.3.

Response variable ols knn ctree rf rfc

You might also like