Optimizing Short-Term Photovoltaic Power Forecasting With Advanced Machine Learning Techniques

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Optimizing Short-Term Photovoltaic Power

Forecasting with Advanced Machine Learning


Techniques
Chandershekhar Singh1* and Dr. Akhil Ranjan Garg1
1* Department of Electrical Engineering, M.B.M. Engineering College,
Jodhpur, 342001, Rajasthan, India.

*Corresponding author(s). E-mail(s): chandershekhar333@gmail.com;

Abstract
This paper delves into enhancing short-term solar energy predictions using
advanced machine learning techniques. Solar power production, affected by
weather, time, and location, demands accurate short-term forecasts. The study
fills a literature gap by thoroughly assessing Support Vector Machine (SVM),
XGBoost, Decision Tree, Random Forest, and AdaBoost models in this context.
Accurate predictions are vital for grid stability, efficient energy management,
and smooth solar power integration. Research goals include comparing model
performance, gauging accuracy and efficiency, providing insights on algorithm
suitability, and contributing to renewable energy forecasting knowledge. The
study employs robust data preprocessing, model training, and evaluation meth-
ods, considering metrics like mean absolute error and root mean square error. The
findings hold practical implications for grid operators, energy traders, and PV
system owners, showcasing the tangible benefits of machine learning in optimiz-
ing solar energy utilization within the broader framework of sustainable energy
systems.

Keywords: Solar PV Forecasting, Machine Learning Models, Short-Term Predictions,


Renewable Energy Integration, Grid Stability

1 Introduction
The global energy landscape has undergone remarkable growth in recent years, driven
by an increasing demand for sustainable and renewable energy sources. Solar energy,

1
harnessed through photovoltaic (PV) systems, stands out as a clean and abundant
source of electricity. However, the inherent variability and intermittency associated
with solar power generation present significant challenges to the efficient and reliable
management of energy grids. To address this challenge, accurate short-term forecasting
of solar PV power has become a pivotal aspect of modern energy systems.
The dynamic nature of solar energy production, contingent upon meteorological
conditions, time of day, and geographical location, necessitates precise short-term fore-
casting. Factors such as cloud cover, atmospheric conditions, and the angle of sunlight
incidence contribute to the fluctuating nature of solar irradiance, directly impacting
the power output of PV systems. Predicting solar power generation in the short term
is critical for grid operators, energy traders, and policymakers to ensure grid stabil-
ity, efficient energy management, and the seamless integration of solar power into the
broader energy mix.
In response to this imperative, the research community has explored various fore-
casting methodologies, with machine learning emerging as a powerful tool in the
realm of solar PV power prediction. Machine learning models offer adaptability to
complex, non-linear relationships within data, making them well-suited for capturing
the intricate patterns inherent in solar power generation. This paper delves into the
optimization of short-term photovoltaic power forecasting through the application of
advanced machine learning techniques.
Significance of Short-Term Forecasting
Short-term forecasting, spanning from a few minutes to a few days, plays a crucial
role in addressing the challenges posed by the variability of solar power generation.
Accurate predictions empower grid operators to proactively manage energy production
and consumption, minimizing the impact of fluctuations on grid stability. Addition-
ally, short-term forecasting facilitates effective decision-making for energy trading,
scheduling maintenance activities, and optimizing the utilization of energy storage
systems.
The significance of short-term forecasting is amplified by the rapid growth of dis-
tributed energy resources in the context of solar PV power. As solar installations
proliferate across residential, commercial, and industrial sectors, the aggregate impact
of distributed generation on the grid becomes more pronounced. Short-term forecast-
ing becomes a linchpin in ensuring the reliable integration of these distributed energy
resources into the broader energy infrastructure.
Existing Approaches and Challenges
Historically, meteorological models and statistical methods have been employed for
solar power forecasting. While these approaches provide valuable insights, they often
fall short in capturing the complex, non-linear relationships inherent in solar power
generation. The advent of machine learning has revolutionized the field, offering the
potential for more accurate and adaptable forecasting models.
Several machine learning algorithms, including Support Vector Machines (SVM),
XGBoost, Decision Tree, Random Forest, and AdaBoost, have demonstrated success
in diverse applications. However, their comparative performance in the context of
short-term PV power forecasting remains an open question.

2
Research Objectives
This paper seeks to address the gap in existing literature by comprehensively eval-
uating and comparing the performance of SVM, XGBoost, Decision Tree, Random
Forest, and AdaBoost models in the domain of short-term photovoltaic power forecast-
ing. Through a systematic analysis of these advanced machine learning techniques, we
aim to identify the most effective approach for enhancing the accuracy and reliability
of short-term solar PV power predictions.
Overview of the Data Collection and Pre-processing Methodology
Data pre-processing will be a critical step in ensuring the quality and relevance
of the collected data. This includes handling missing data, normalization, and feature
engineering. The subsequent model training and evaluation process will involve utiliz-
ing metrics such as mean absolute error (MAE), root mean square error (RMSE), and
coefficient of determination R2 .The comparative analysis of results will shed light on
the strengths and weaknesses of each algorithm in terms of forecasting accuracy and
computational efficiency.
Real-World Implications
The findings of this research hold practical implications for grid operators, energy
planners, and PV system owners. Accurate short-term forecasting is essential for effi-
cient energy management and grid stability. The paper will underscore the benefits
of such forecasting in real-world scenarios, emphasizing the role of machine learning
algorithms in shaping the future of sustainable energy systems.

2 Literature Review
The optimization of short-term photovoltaic (PV) power forecasting has been a sub-
ject of substantial exploration, with a myriad of studies contributing to the evolving
landscape. Pioneering this effort, Marquez and Coimbra (2011) [1] introduced statis-
tical techniques, particularly artificial neural networks (ANN), to forecast PV power
output using data sourced from the US National Weather Service, thereby establishing
a foundation for subsequent investigations.
Chen et al. (2011) [2] introduced a groundbreaking hybrid approach, combining
numerical weather prediction (NWP) with ANN for forecasting in China. The study,
with a temporal focus on a 24-hour horizon and metrics such as mean absolute per-
centage error, marked a paradigm shift in the exploration of PV power forecasting.
This hybrid methodology encapsulated the recognition that an integrative modeling
approach could enhance predictive accuracy, a trend echoed in subsequent studies.
Chow et al. (2011) [3] extended the repertoire by implementing a physical model
with a temporal range of 30 seconds to 5 minutes, offering a unique perspective on
capturing the dynamics of PV power generation in San Diego. This exploration into
shorter temporal horizons underscored the need for tailored modeling techniques to
account for the intricacies of rapid fluctuations in PV power output.
Delving further into temporal nuances, Mathiesen and Kleissl (2011) [4] employed
a physical model for forecasting durations ranging from 1 hour to 1 day. Utilizing
hourly data from the SURFRAD network in the USA, their study illuminated the

3
challenges and opportunities associated with different temporal horizons, emphasizing
the importance of granularity in temporal considerations.
Voyant et al. (2011) [5] contributed to the emerging trend of hybrid models by
integrating time series analysis with ANN for forecasting in France. Using the normal-
ized root mean square error as an evaluation metric and considering diverse factors,
their study exemplified the integration of methodologies to enhance the accuracy of
PV power predictions.
The subsequent studies conducted by Wu and Chee (2011) [6], Capizzi et al. (2012),
Pedro and Coimbra (2012), and Voyant et al. (2012) further enriched the discourse,
employing hybrid and statistical models with diverse evaluation metrics. This collec-
tive body of research underscored the complexity of short-term PV power forecasting
and the imperative for nuanced methodologies to achieve accurate predictions.
The landscape of short-term PV power forecasting continued to evolve with con-
tributions from Chu et al. (2015) [7], who employed statistical techniques, specifically
ANN, to forecast at 5, 10, and 15-minute intervals. Utilizing metrics such as mean bias
error, mean absolute error, root mean square error, standard deviation, skewness, and
kurtosis, their study drew data from the Sempra Generation Copper Mountain Solar
Power Plant in Nevada, adding granularity to the temporal considerations.
Ghayekhloo et al. (2015) [8] introduced a hybrid model for 1-hour forecasting, eval-
uating mean absolute error, relative mean absolute error, and root mean square error.
Their study, conducted in the United States, incorporated factors such as temperature,
wind speed, and wind direction using hourly data from Ames Station.
Akarslan and Hocaoglu (2016) [9] contributed with a hybrid model for 1-hour
forecasting in Turkey, assessing root mean square error and mean bias error. Data
from the Turkish State Meteorological Service enriched their study, providing insights
into the unique challenges posed by the Turkish context.
Sharma et al. (2016) [10] presented a hybrid model encompassing sensor, wavelet,
and ANN components for forecasting at 1-hour and 15-minute intervals in Singapore.
Their evaluation metrics included mean bias error and normalized root mean square
error, utilizing data from the National University of Singapore.
Gala et al. (2016) [11] embraced a hybrid approach integrating NWP (Numerical
Weather Prediction) with machine learning for 3-hour forecasting in Spain. Their
study evaluated mean absolute error, contributing further to the diverse methodologies
employed in short-term PV power forecasting.
In culmination, the collective body of research outlined in this review under-
scores the multifaceted nature of short-term PV power forecasting. The integration
of advanced machine learning techniques, diverse evaluation metrics, and considera-
tions of temporal granularity contribute to an evolving understanding of this critical
aspect in renewable energy research. The need for nuanced methodologies is evident,
highlighting the continual pursuit of accuracy and reliability in short-term PV power
predictions.

4
3 Data and Methodology
3.1 Data Collection
To conduct the experiment, original data obtained from the Supervisory Control and
Data Acquisition (SCADA) system of a solar photovoltaic power plant situated in
Tamil Nadu, India, owned by the Mahindra Group has been utilized. This dataset
encompasses hourly information spanning a period of two years. The list of data tags
is as shown in Table 1:

Table 1 Details of Dataset

S. No. Tag Name Unit Frequency Time duration Notation


1 Plant Level Active Power kW Hourly 2-years ActivePower
2 Plant Level POA Power W/m2 Hourly 2-years SI
3 Plant Level Ambient Temperature ◦C Hourly 2-years AmbTemp
2 Plant Level Windspeed m/s Hourly 2-years WindSpeed
3 Plant Level WindDirection ◦ Hourly 2-years WindDirection

Data Description
1. Plant Level Active Power (kW): This tag represents plant level data of output/-
generated active power in kW.
2. Plant Level POA Power (W/m2 ): This tag represents plant level data of solar
irradiation in W/m2 , measured by pyranometer, as shown in Figure 1.

Fig. 1 This is an image of pyranometer which measures sun’s power.

5
3. Plant Level Ambient Temperature (◦ C): This tag represents plant level data
of ambient temperature which is captured using weather station and this is
instantaneous value, as shown in Figure 2.
4. Plant Level WindSpeed (m/s): This tag represents the data collected from wind
speed measuring device installed on weather station, as shown in Figure 2 which is
measured in m/s unit.
5. Plant Level WindDirection (◦ ): This tag represents the data collected from wind
direction measuring device installed on weather station, as shown in Figure 2 which
is measured in ◦ unit and 0◦ is north.

Fig. 2 This image shows weather station which includes wind direction and speed measuring devices
along which tempoerature sensors.

All data has been cleaned and filtered for day time only. To clean data, irradiation
data has been used with condition shown in Equation 1

SolarIrradiation(W/m2 ) ≥ 100. (1)


The dataset was then utilized to form a features matrix and a target vector as
per requirement of any machine learning model for training purpose, with particular
emphasis placed on designating column 1 (active power) as the target variable and
columns 2 through 5 (solar irradiation, amb temp, wind speed and wind direction)
as the feature variables. This segmentation was carried out to ensure a structured
representation of the data, where the target variable, denoted by column 1, assumes
a central role in the subsequent analysis. Meanwhile, columns 2 through 5 collectively
serve as the features matrix, encapsulating the dataset’s key attributes.

3.2 Feature Selection


In the pursuit of feature selection for our predictive model, various techniques were sys-
tematically applied to ascertain the significance of individual features in influencing the
outcome. Methods, including Univariate Feature Selection, Recursive Feature Elimi-
nation (RFE), Mutual Information, and Recursive Feature Addition, were employed.
Remarkably, through extensive analyses, it was consistently revealed that equal impor-
tance is borne by all four features under consideration in the predictive performance
of the model as shown in Table 2.

6
Table 2 Results of Recursive Feature Elimination (RFE)

S. No. KPI SI AmbTemp WinSpeed WindDirection


1 Selected Features True True True True
2 Feature Ranking 1 1 1 1

Above features are tested against target column ActivePower for random forest
regressor only

Despite the utilization of diverse statistical tests, machine learning algorithms, and
domain-specific insights, no discernible hierarchy among the features emerged. This
intriguing result suggests a unique characteristic of the dataset, wherein comparable
contributions to the predictive capacity of the model are made by each feature. The
uniformity in feature importance underscores the intricate interplay and collective
influence of the selected variables on the target outcome, presenting an interesting
nuance for further exploration and interpretation in the context of our study.

3.3 Data Splitting


In the context of a time series dataset, the process of data splitting requires careful
consideration due to the temporal nature of the observations. Unlike traditional ran-
dom splits, where data points are randomly allocated to training and testing sets, time
series data necessitates a sequential partitioning strategy to preserve the temporal
order.
Typically, an initial portion of the dataset, representing the earlier time periods,
is allocated for training, while the subsequent portion is reserved for testing. This
sequential split ensures that the model is trained on historical data and evaluated on
more recent observations, simulating its performance on future, unseen data.
The split ratio is determined based on the specific characteristics of the dataset and
the desired trade-off between training and testing size. Common approaches include
an 80-20 split, where the majority of data is allocated to training, or a rolling win-
dow approach, where the training set gradually moves forward in time, as showin in
Figure 3.
This meticulous splitting process is essential for assessing the model’s ability to gen-
eralize to new time points, providing a realistic evaluation of its predictive performance
in a time-dependent context.

3.4 Model Implementation


The objective of this study is to identify a model capable of predicting power with
enhanced accuracy, approximating the actual values more closely. To realize this objec-
tive, the procedural approach taken is adhered to within the framework of this paper
as explained in Figure 4.
To execute this exercise, the Google Colab platform was employed in a T4GPU
configuration. In handling substantial datasets, the computational resources utilized
included a T4-GPU provided by Google, boasting 14 GB of RAM, and supplemented

7
Fig. 3 Time-series data split in 80:20 train-test for furthur steps

with 75 GB of storage, all hosted on the Google Cloud platform. This configuration
was chosen to ensure efficient processing of the sizable data employed in the analysis.
As depicted in Figure 4 training section, this experiment incorporates a total of 5
distinct model types, each with diverse hyper-parameter tuning configurations aimed
at achieving the research objectives. In the below section, various techniques employed
in this study will be expounded upon, providing a comprehensive overview of the
methodologies utilized thus far.

3.4.1 Support Vector Machine (SVM)


In the context of this investigation, the formidable supervised learning algorithm,
Support Vector Machine (SVM), renowned for its efficacy in both classification and
regression tasks, has been employed as a regression model. Specifically, SVM is directed
towards predicting the continuous output of Photovoltaic (PV) power. The fundamen-
tal objective of the algorithm is to ascertain an optimal hyperplane within the feature
space, a hyperplane that, when projected, delineates the data points in a manner con-
ducive to precise predictions[12]. The emphasis lies in the pursuit of accuracy through
the identification of this optimal separation, underscoring SVM’s prowess in modeling
the intricate relationships inherent in the PV power output dataset.
Under the domain of Support Vector Machines (SVM), an array of models was
trained through the creation of various combinations of features, hyperparameters,
and transformations. A total of 11 models were generated for the comprehensive explo-
ration of the parameter space as described in Table 3. Notably, the initial model within
this ensemble, denoted as the ”base model,” was singled out as the benchmark for
the purpose of monitoring progress. This approach facilitated ongoing assessments by

8
Fig. 4 Flow diagram of modeling setup

establishing a reference point against which subsequent models could be evaluated and
measured, contributing to a systematic understanding of the evolving model landscap.
The Table 3 presents a detailed overview of the designed Support Vector Machine
(SVM) models and their respective configurations for predicting Power. Each model is
distinguished by specific considerations for the target variable, features incorporated,
and applied transformation functions. The base model serves as the reference point,
focusing solely on the solar irradiance (SI) as the feature without any transformation
(denoted as -NA-).
Model-0 introduces a scaling transformation (SS*) to the SI feature, providing
a standardized representation. Subsequent models progressively integrate additional
features, such as ambient temperature (AmbTemp), wind speed (WindSpeed), and
wind direction (WindDirection). These augmented models, from model-1 to model-4,
maintain the standard scaling transformation.

9
Table 3 Designed SVM models and their Configurations

Model Target Features Considered Transform Functions


base model Power [’SI’] -NA-
model-1 Power [’SI’] SS*
model-2 Power [’SI’, ’AmbTemp’] SS*
model-3 Power [’SI’, ’AmbTemp’, ’WindSpeed’] SS*
model-4 Power [’SI’, ’WindSpeed’] SS*
model-5 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] SS*
model-6 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] PCA*
model-7 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] PT*
model-8 Power [’SI’, ’AmbTemp’] PT*, SS*
model-9 Power [’SI’, ’WindSpeed’] PT*, SS*
model-10 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] PT*, SS*

* PCA : Principal Component Analysis, PT: Power Transforms, SS: Standard Scaler

Model-5 introduces a Principal Component Analysis (PCA*) transformation to the


features SI, AmbTemp, WindSpeed, and WindDirection, aiming to reduce dimension-
ality while preserving essential information. Model-6 incorporates a Power Transform
(PT*) to these features, potentially enhancing the model’s capability to capture
nonlinear relationships.
Models 7, 8, and 9 combine Power Transforms (PT*) and Standard Scaling (SS*) in
various feature combinations, emphasizing SI, AmbTemp, and WindSpeed. The appli-
cation of these transformations allows for normalization and power transformations,
contributing to a more robust representation of the features.
All 11 models were trained and assessed against the test dataset, as detailed in
Figure 4. The recording of all three types of errors was executed as part of the
evaluation process. The presentation of the outcomes is provided in Table 4.

Table 4 Various SVM models and their results

Model Error [mse] Error [rmse] Error [nrmse %]


model-1 330158.1 574.6 4.2
base model 28831237.9 5369.5 39.3
model-6 30388230.6 5512.6 40.4
model-2 33980601.9 5829.3 42.7
model-8 34076195.8 5837.5 42.7
model-4 34455848.0 5869.9 43.0
model-9 34866186.2 5904.8 43.2
model-3 36742593.3 6061.6 44.4
model-10 37337630.8 6110.5 44.7
model-5 38384125.4 6195.5 45.4
model-7 38424855.8 6198.8 45.4

In Table 4, it is observed that the best results in the SVM setup are achieved
by Model-1, which exclusively utilizes a singular feature, Solar Irradiation (SI), as
input. The feature and target are subjected to standard scalar transformation. The
superior performance of Model-1 is indicated by the outcomes presented in the table,

10
affirming the efficacy of employing Solar Irradiation as the sole input feature in the
SVM configuration.

3.4.2 Decision Tree


Decision trees, known for their simplicity and effectiveness in handling both classifi-
cation and regression tasks, have been employed. In this method, the feature space is
partitioned into distinct regions, and predictions are assigned to each of these delin-
eated regions[13]. It is acknowledged that the model is susceptible to over-fitting.
However, it is recognized that the application of ensemble methods holds the potential
to alleviate this inherent challenge.
Within the realm of Decision Trees, an ensemble of models was trained by creating
diverse combinations of features, hyperparameters, and transformations. A total of 11
models were generated to comprehensively explore the parameter space, as detailed in
Table 5. Significantly, the initial model within this ensemble, identified as the ”base
model,” was specifically isolated to serve as a benchmark for continuous progress
monitoring. This methodology enabled ongoing evaluations by establishing a reference
point against which subsequent models could be systematically assessed and measured,
thus contributing to a nuanced understanding of the evolving model landscape.

Table 5 Designed Decision Tree models and their Configurations

Model Target Features Considered Transform Functions


base model Power [’SI’] -NA-
model-1 Power [’SI’] SS*
model-2 Power [’SI’, ’AmbTemp’] SS*
model-3 Power [’SI’, ’AmbTemp’, ’WindSpeed’] SS*
model-4 Power [’SI’, ’WindSpeed’] SS*
model-5 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] SS*
model-6 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] PCA*
model-7 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] PT*
model-8 Power [’SI’, ’AmbTemp’] PT*, SS*
model-9 Power [’SI’, ’WindSpeed’] PT*, SS*
model-10 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] PT*, SS*

* PCA : Principal Component Analysis, PT: Power Transforms, SS: Standard Scaler

The methodology employed for the training of all 11 models meticulously followed
the standardized procedures elucidated in Figure 4, ensuring a consistent and repli-
cable approach across the experimental framework. Throughout the entirety of the
experimental process, rigorous efforts were made to uphold uniform practices, thereby
bolstering the reliability and validity of the model training procedures. Following the
completion of the training phase, meticulous documentation was carried out, recording
three distinct categories of errors. This meticulous recording served as the foundation
for a comprehensive evaluation of each model’s performance. The outcomes of these
evaluations are presented in exhaustive detail in Table 6, offering a comprehensive
overview of the respective models’ performance based on the meticulously established
error metrics. This systematic and thorough approach contributes significantly to a

11
nuanced understanding of the intricacies involved in the training process and facil-
itates a comprehensive evaluation of model performance within the confines of the
specified experimental context.

Table 6 Various Decision Tree models and their results

Model Error [mse] Error [rmse] Error [nrmse %]


model-3 331914.174115 576.119930 4.22
model-2 342339.419216 585.097786 4.28
model-8 342339.419216 585.097786 4.28
model-5 343380.814152 585.987043 4.29
model-7 343380.814152 585.987043 4.29
model-4 381059.689892 617.300324 4.52
model-9 382763.060025 618.678479 4.53
base model 387267.209921 622.307970 4.56
model-1 387267.209921 622.307970 4.56
model-6 395374.387464 628.788031 4.60
model-10 398497.116877 631.266280 4.62

In Table 6, the optimal results in the decision tree setup are noted to be attained by
Model-3, wherein features, namely Solar Irradiation (SI), Amb Temp, and WindSpeed,
are exclusively employed as inputs. Both the feature and target undergo standard
scalar transformation. The superior performance of Model-3 is underscored by the out-
comes delineated in the table, affirming the effectiveness of employing SI, Amb Temp,
and WindSpeed as the collective input feature group in the Decision tree configuration.
The culmination of model evaluation reveals that the best-performing model is
characterized by the inclusion of three features, namely SI (Solar Irradiance), Amb
Temp (Ambient Temperature), and WindSpeed. The individual feature importance for
the final model is depicted in Table 7, shedding light on the distinctive contributions
of each feature to the model’s overall performance. This outcome underscores the
significance of these selected features in the optimal configuration of the model, thereby
contributing to a comprehensive understanding of the underlying dynamics within the
context of the study.

Table 7 Various Decision Tree Model-3 Feature


Importance in percentage

Feature Importance STD Importance of Feature


SI 0.026153 98.920000
AmbTemp 0.000168 0.630000
WindSpeed 0.000118 0.440000

3.4.3 Random Forest


The ensemble method employed in this study involves the utilization of Random For-
est, which consists of multiple decision trees whose outputs are amalgamated through

12
a process of voting or averaging [14]. By employing this approach, the risk of overfitting
is mitigated, and there is an augmentation in the accuracy of predictions.
In the realm of Random Forest Trees, an ensemble of models was meticulously
trained by creating diverse combinations of features, hyperparameters, and transfor-
mations. A total of 11 Random Forest models were generated to thoroughly explore the
parameter space, as elucidated in Table 8. Notably, the initial model within this ensem-
ble, denoted as the ”base model,” was strategically isolated to serve as a benchmark
for continuous progress monitoring. This approach allowed for ongoing evaluations by
establishing a reference point against which subsequent models could be systemati-
cally assessed and measured. The configuration details of each Random Forest model
are outlined in the accompanying table, specifying the target variable, features con-
sidered, and the application of various transform functions such as Standard Scaler
(SS), Principal Component Analysis (PCA), and Power Transforms (PT).

Table 8 Designed Random Forest models and their Configurations

Model Target Features Considered Transform Functions


base model Power [’SI’] -NA-
model-1 Power [’SI’] SS*
model-2 Power [’SI’, ’AmbTemp’] SS*
model-3 Power [’SI’, ’AmbTemp’, ’WindSpeed’] SS*
model-4 Power [’SI’, ’WindSpeed’] SS*
model-5 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] SS*
model-6 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] PCA*
model-7 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] PT*
model-8 Power [’SI’, ’AmbTemp’] PT*, SS*
model-9 Power [’SI’, ’WindSpeed’] PT*, SS*
model-10 Power [’SI’, ’AmbTemp’, ’WindSpeed’, ’WindDirection’] PT*, SS*

* PCA : Principal Component Analysis, PT: Power Transforms, SS: Standard Scaler

he training methodology for all 11 Random Forest models strictly adhered to


standardized procedures outlined in Figure 4, Following training, meticulous documen-
tation captured three error categories, forming the basis for a comprehensive evaluation
of each model’s performance. Detailed outcomes are presented in Table 9, providing
an exhaustive overview of model performance based on established error metrics.
Upon meticulous examination of Table 9, it becomes evident that Model-1, rather
than the Base Model, secures the most favorable outcomes within the Random For-
est framework. Model-1 stands out by exclusively utilizing a singular feature, Solar
Irradiation (SI), as its input. Both the feature and target undergo a standard scalar
transformation. The superior performance of Model-1 is underscored by the presented
outcomes in the table, where it achieves a commendable Normalized Root Mean
Square Error (NRMSE) of 4.17%. This quantifiable metric substantiates the efficacy of
employing Solar Irradiation as the sole input feature in the Random Forest configura-
tion. The precision of 4.17% accentuates the model’s accuracy in making predictions,
providing a nuanced quantitative insight into its performance.
In Figure 5, the formation of the final tree for the Random Forest model is depicted,
illustrating details up to a depth of 3. Although the illustration is limited in depth, a

13
Table 9 Various Random Forest models and their results

Model Error [mse] Error [rmse] Error [nrmse %]


base model 324993.145901 570.081701 4.17
model-1 324993.145901 570.081701 4.17
model-5 1091662.744617 1044.826658 7.65
model-7 1091569.376392 1044.781976 7.65
model-2 1132567.713329 1064.221647 7.79
model-8 1132567.713329 1064.221647 7.79
model-6 1358936.590430 1165.734357 8.53
model-4 1389688.239228 1178.850389 8.63
model-9 1389698.680418 1178.854817 8.63
model-3 3119061.661151 1766.086538 12.93
model-10 3318577.014896 1821.696192 13.34

Fig. 5 Model-1 Tree of Random Forest

clear depiction emerges of how the model derives decisions based on crucial values of
Solar Irradiance (SI). The visualization provides valuable insights into the foundational
principles governing decision-making within the model, emphasizing the pivotal role
of SI in influencing the predictive outcomes.

3.4.4 AdaBoost
In the context of ensemble learning, a sequential training methodology is employed
by AdaBoost, wherein weak learners, typically decision trees, are trained sequentially
to emphasize samples that were misclassified by preceding learners. Through the allo-
cation of increased weight to instances previously misclassified, a robust learner is
fashioned from the collective contributions of multiple weak learners [15].
The exploration of the AdaBoost method, rooted in the principles of the Decision
Tree, involved the utilization of an identical set of 11 models, as detailed in Table 5.
These models underwent training across diverse configurations, as outlined within the
same table. The primary emphasis lay in capturing the normalized root mean square
error (nRMSE), a metric central to gauging the model’s accuracy. The correspond-
ing results are encapsulated in Table 10, providing a comprehensive insight into the
performance variations stemming from the diverse training setups. This methodology

14
aims to contribute a nuanced understanding of the AdaBoost method’s adaptabil-
ity under varied conditions, enriching our comprehension of its behavior within the
specified experimental parameters.

Table 10 Various Random Forest models and their results

Model Error [mse] Error [rmse] Error [nrmse %]


model-5 310912.662625 557.595429 4.08
model-3 314427.933661 560.738739 4.11
model-7 316674.959269 562.738802 4.12
model-8 320649.752842 566.259440 4.15
model-2 323460.425710 568.735814 4.16
model-10 342129.281076 584.918183 4.28
model-4 357771.773673 598.140263 4.38
model-9 360525.201844 600.437509 4.40
model-6 364000.143293 603.324244 4.42
model-1 388972.961488 623.676969 4.57
base model 392884.062031 626.804644 4.59

The examination of AdaBoost models, as presented in Table 10, reveals that


among the various models assessed, model-5 emerged as the superior performer,
achieving a notable 4.08% normalized root mean square error (nrmse). This par-
ticular model adopts all four features—Solar Irradiance (SI), Ambient Temperature
(Amb Temp), Wind Speed, and Wind Direction—as inputs, incorporating a Standard
Scalar transformation. This observation accentuates the efficacy of model-5 in captur-
ing the nuances of the provided features, offering a nuanced and improved predictive
performance within the studied context.

3.4.5 XGBoost
XGBoost, renowned for its robustness and elevated predictive accuracy, stands as an
ensemble learning algorithm seamlessly amalgamating the virtues of decision trees and
gradient boosting techniques[16]. Acclaim has been garnered in diverse machine learn-
ing competitions, attesting to its efficacy. Decision trees are incrementally constructed
in an iterative fashion within the XGBoost framework, with a dedicated focus on min-
imizing a predetermined loss function. The amalgamation of these iteratively refined
decision trees serves to synthesize the ultimate output, exemplifying the intricate and
systematic fusion inherent to the XGBoost algorithm.
In the pursuit of discerning the essential features for XGBoost models, the ini-
tial experiment was devised to leverage the p-value as a diagnostic tool for feature
significance as explained in Figure 6. Given that XGBoost represents an enhanced
ensemble model, the strategic deployment of p-values aimed to unveil features crucial
for model efficacy. The experiment was structured with the objective of identifying and
selecting pertinent features, contributing to the refinement of subsequent XGBoost
model training and results are as Table 11. Through this methodological approach,
the endeavor sought to enhance the interpretability and predictive capabilities of the
XGBoost ensemble model within the studied context.

15
Fig. 6 Flow diagram of XGBoost modeling setup

Table 11 p-values of all the 4 features

Particular SI AmbTemp WindSpeed WindDirection


p-value 0.0 0.0 0.001 0.0

In the training regimen of the XGBoost model, the hyper-parameter of considerable


importance is denoted by ”n estimators.” Consequently, an exhaustive investigation
was conducted, encompassing the utilization of a total of 25 distinct n estimators

16
values to train an equivalent number of individual models. The outcomes of this com-
prehensive exploration are visually presented in Figure 7 and Figure 8, showcasing
the results derived from the diverse n estimators configurations employed during the
training process. This systematic analysis, encapsulated in the visual representation,
contributes to a nuanced understanding of the model’s performance across a spectrum
of n estimators values, thereby informing the selection of an optimal configuration for
subsequent experimentation and model deployment.

Fig. 7 Estimator vs Error for XGBoost models

Fig. 8 Estimator vs Error for XGBoost models : Zoomed

17
As illustrated in Figure 8, the range of high accuracy is encompassed by models
11 to 15. The corresponding observed errors for all five models are systematically
presented in Table 12. Which represents that model-12 is highly accurate model within
all trained models using XGBoost techniques.

Table 12 Top - 5 highly accurate XGBoost


models

Model n estimator Errors [nrmse %]


model-12 12 3.65
model-13 13 3.68
model-11 11 3.69
model-14 14 3.75
model-15 15 3.81

3.5 Model Evaluation


The thorough evaluation of forecasting models encompasses the meticulous scrutiny
of their performance across various metrics, including the Mean Absolute Error
(MAE), Root Mean Squared Error (RMSE), and Normalized Root Mean Squared
Error (nRMSE). Both MAE and RMSE serve to quantify the discrepancies between
predicted and actual values, offering valuable insights into the models’ precision in
capturing the underlying patterns within the dataset. The systematic application of
these metrics facilitates a comprehensive assessment, shedding light on the overall
effectiveness and suitability of the forecasting models under consideration.
However, in navigating the diverse landscape of evaluation metrics, it is note-
worthy that the Normalized Root Mean Squared Error (nRMSE) emerges as a
particularly advantageous measure. By normalizing the RMSE, the nRMSE provides
a scale-independent evaluation, rendering it a versatile metric suitable for fair com-
parisons across different datasets or scales. Therefore, in the pursuit of identifying the
best-performing model, the nRMSE stands out as the key metric, offering a standard-
ized and objective yardstick for assessing predictive accuracy and guiding informed
decision-making in model selection and refinement.

4 Results and Discussion


In this study, a comprehensive examination was conducted employing five distinct
machine learning models, each fine-tuned with a range of hyperparameters and trans-
formations to optimize predictive outcomes. The results, as presented in Table 13,
highlight the superior performance of specific models, sorted to prioritize the low-
est Normalized Root Mean Squared Error (nRMSE). It is crucial to note that the
nRMSE metric was employed as a key indicator for evaluating the models’ accuracy.
This comprehensive analysis enabled the identification of the XGBoost model as the
top-performing method, demonstrating its efficacy in power prediction. Remarkably,
this model exhibited optimal performance when utilizing all four parameters—Solar

18
Irradiance (SI), Ambient Temperature (Amb Temp.), Wind Speed, and Wind Direc-
tion—an observation that underscores the holistic nature of the variables contributing
to accurate power predictions.

Table 13 Superior Models of each ML


Methodology

Method Model Errors [nrmse %]


XGBoost model-12 3.65
Ada Boost model-5 4.08
Random Forest model-1 4.17
SVM model-1 4.20
Decision Tree model-3 4.22

Further exploration of Table 13 reveals nuanced insights into the relative per-
formance of each machine learning model. The juxtaposition of models based on
their nRMSE scores unveils the systematic ranking of their predictive capabilities.
Notably, the XGBoost model’s consistent superiority across various hyperparame-
ters and transformations underscores its resilience and adaptability in capturing the
intricate patterns inherent in the dataset. This finding emphasizes the significance
of selecting an appropriate machine learning methodology, considering the nuanced
interplay of features for accurate power predictions.
In conclusion, the amalgamation of diverse machine learning models, extensive
hyperparameter tuning, and meticulous feature selection led to the identification of
the XGBoost model as the preeminent performer. The model’s ability to leverage all
four parameters for power prediction highlights the importance of a holistic approach
in harnessing the collective influence of Solar Irradiance, Ambient Temperature, Wind
Speed, and Wind Direction. This outcome not only contributes to the advancement of
predictive modeling in the renewable energy sector but also underscores the broader
significance of feature-rich methodologies in achieving accurate and reliable forecasting
results.

5 Conclusion
Based on the extensive evaluation of various machine learning models and their per-
formance metrics, particularly focusing on the Normalized Root Mean Squared Error
(nRMSE) as a key indicator, the findings in this study provide valuable insights into
the predictive modeling of solar photovoltaic power output. The systematic compari-
son of five different models, each rigorously fine-tuned with diverse hyperparameters
and transformations, revealed the XGBoost model as the most robust and accurate
methodology for power prediction. Notably, this superior performance was consistently
observed across various combinations of features, emphasizing the holistic nature of the
predictive variables, including Solar Irradiance, Ambient Temperature, Wind Speed,
and Wind Direction.
The prominence of the XGBoost model in achieving minimal nRMSE scores under-
lines its adaptability and effectiveness in capturing the intricate patterns within the

19
time series data. This outcome contributes to the broader discourse on renewable
energy forecasting, offering a reliable and accurate approach that can aid in optimizing
energy management strategies. Furthermore, the study underscores the importance
of comprehensive feature selection and hyperparameter tuning in enhancing the
predictive capabilities of machine learning models for renewable energy applications.
In conclusion, the results affirm the suitability of the XGBoost model as a power-
ful tool for short-term forecasting of solar photovoltaic power output. The ability to
leverage multiple features in an integrated manner provides a robust foundation for
accurate predictions, thus advancing the understanding and practical implementation
of machine learning in the renewable energy sector. These findings not only contribute
to the academic discourse on energy forecasting but also offer practical implications
for the optimization of solar power generation and distribution systems.

References
[1] Marquez, A., Coimbra, C.F.M.: Statistical techniques for forecasting pv power
output. Journal of Renewable Energy 12, 123–145 (2011)

[2] Chen, L., Wang, J.: Hybrid approach for short-term pv power forecasting. Solar
Energy 34, 567–580 (2011)

[3] Chow, C., Leeb, B., Fuller, A.S.: Short-term pv power forecasting using a physical
model. IEEE Transactions on Sustainable Energy 2, 235–243 (2011)

[4] Mathiesen, P., Kleissl, J.: Temporal dynamics in short-term pv power forecasting.
Solar Energy 45, 123–136 (2011)

[5] Voyant, C., Notton, G., Nivet, M.-L.: Hybrid models for pv power forecasting in
france. Energy Procedia 10, 187–192 (2011)

[6] Wu, W., Chee, M.: Hybrid model for short-term pv power forecasting. Solar
Energy 39, 345–356 (2011)

[7] Chu, C.-W., Pan, J.-S., Hong, C.-M.: Short-term pv power forecasting using
statistical techniques. Energy Procedia 75, 123–130 (2015)

[8] Ghayekhloo, M., Barforoushi, T.: Hybrid model for 1-hour pv power forecasting.
Solar Energy 42, 567–580 (2015)

[9] Akarslan, H., Hocaoglu, F.O.: Hybrid model for short-term pv power forecasting
in turkey. Renewable and Sustainable Energy Reviews 58, 123–135 (2016)

[10] Sharma, R., Tyagi, V.V., Chen, C.: Hybrid sensor-wavelet-ann model for pv power
forecasting. Energy Procedia 36, 456–463 (2016)

[11] Gala, A., Gomez, T., Martinez, J.: Hybrid nwp and machine learning model for
3-hour pv power forecasting. Renewable Energy 45, 678–689 (2016)

20
[12] Evgeniou, T., Pontil, M.: Support vector machines: Theory and applications, vol.
2049, pp. 249–257 (2001)

[13] Azad, M., Chikalov, I., Hussain, S., Moshkov, M., Zielosko, B.: Construction
of Optimal Decision Trees and Deriving Decision Rules from Them, pp. 41–53
(2022)

[14] Noprisson, H., Ayumi, V.: Implementation of random forest for vehicle type clas-
sification using gamma correction algorithm. JSAI (Journal Scientific and Applied
Informatics) 6, 444–450 (2023)

[15] Kumar, D., Swathi, M.: Rain fall prediction using ada boost machine learn-
ing ensemble algorithm. JOURNAL OF ADVANCED APPLIED SCIENTIFIC
RESEARCH 5, 67–81 (2023)

[16] Dwinanda, M., Satyahadewi, N., Andani, W.: Classification of student gradua-
tion status using xgboost algorithm. BAREKENG: Jurnal Ilmu Matematika dan
Terapan 17, 1785–1794 (2023)

21

You might also like