Professional Documents
Culture Documents
Machine Learning-Based Short-Term Solar Power Fore
Machine Learning-Based Short-Term Solar Power Fore
Case Report
Keywords: Solar Power Forecasting, Machine Learning Algorithms, Short-Term Prediction, Regression
Analysis, Classi cation Approach, Australian Solar Dataset.
DOI: https://doi.org/10.21203/rs.3.rs-3706776/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
a
4 Laboratory of Green and Mechanical Development (LGMD), Ecole Nationale Polytechnique, B.P. 182, El-Har-
5 rach, Algiers, 16200, Algeria
7 Abstract
8 Solar energy production is an intermittent process that is affected by weather and climate conditions. This can lead
9 to unstable and fluctuating electricity generation, which can cause financial losses and damage to the power grid.
10 To better control power production, it is important to predict solar energy production. Big data and machine learn-
11 ing algorithms have yielded excellent results in this regard. This study compares the performance of two different
12 machine learning approaches to solar energy production prediction: regression and classification. The regression
13 approach predicts the actual power output, while the classification approach predicts whether the power output
14 will be above or below a certain threshold. The study found that the random forest regressor algorithm performed
15 the best in terms of accuracy, with mean absolute errors and root mean square errors of 0.046 and 0.11, respec-
16 tively. However, it did not predict peak power values effectively, which can lead to higher errors. The Long Short-
17 Term Memory (LSTM) algorithm performed better in classifying peak power values. The study concluded that
18 classification models may be better at generalizing than regression models. This proposed approach is valuable for
19 interpreting model performance and improving prediction accuracy.
20 Keywords
21 Solar Power Forecasting; Machine Learning Algorithms; Short-Term Prediction; Regression Analysis; Classifica-
22 tion Approach; Australian Solar Dataset.
23 Nomenclature
24 Abbreviations
25 ANN Artificial Neural Network
26 CdTe Cadmium Telluride Technology
27 CV Cross Validation
28 DKASC Desert Knowledge Australia Solar Centre
29 DT Decision Trees
30 EDA Exploratory Data Analysis
31 ET Extra Trees
32 KNN K-Nearest Neighbors
33 LSTM Long Short-Term Memory
34 MAE Mean Absolute Error
35 ML Machine Learning
36 MSE Mean Square Error
37 PV Photovoltaic
38 RF Random Forest
39 RMSE Root Mean Square Error
40 SPCE Sparse Categorical Cross Entropy
44 1. Introduction
45 In the context of the increasing global demand for sustainable energy sources, solar power generation has
46 emerged as a crucial pillar in the transition towards cleaner and more environmentally friendly electricity produc-
47 tion. The intermittent and weather dependent nature of solar energy, however, presents a significant challenge in
48 effectively integrating it into the power grid. Accurate forecasting of solar power generation plays a pivotal role
49 in addressing this challenge, enabling better grid management, enhanced energy trading, and optimal utilization of
50 resources.
51 Over the years, advancements in machine learning (ML) techniques have revolutionized various domains,
52 including energy forecasting. These models have the potential to provide more accurate and reliable solar power
53 generation forecasts, thus facilitating decision making processes for energy producers, grid operators, and policy-
54 makers.
55 Various ML algorithms such as artificial neural networks (ANN), random forest (RF), decision tree (DT),
56 extreme gradient boosting (XGB), and long short-term memory (LSTM) are highlighted for their use in PV solar
57 power output forecasting and their successful applications in various fields. Table 1 summarized recent works for
58 solar forecasting (reported from N. Rahimi [1]). In a similar study and through a comparison, the best to worst
59 algorithms for the forecasting of PV solar power output sequenced as ANN, RF, DT, XGB, LSTM, where The
60 ANN model, produces the best MAE, RMSE, and R² with scores of 0.4693, 0.8816, and 0.9988 respectively. The
61 study confirmed that XGB algorithm is able to produce better PV solar power output forecasting performances
62 over the established algorithms (ANN, RF), but requires the usage of additional advanced techniques [2]. A very
63 recent study conducted in Lubbock, Texas, leveraged a dataset containing comprehensive 5-minute measurements
64 of humidity, temperature, wind direction, wind speed, and solar radiation spanning from 2012 to 2022. The study
65 highlighted the superiority of RF and LSTM models in comparison to DT, ANN, CNN, and GB. Impressively, the
66 RF and LSTM models achieved the lowest Mean Squared Error (MSE) rates of 2.06% and 2.23%, alongside the
67 highest R² values of 0.977 and 0.975, respectively [3]. Two studies specifically focused LSTM models, emphasiz-
68 ing their robustness and forecasting capacity. These studies explored LSTM architectures and hyperparameter
69 tuning to achieve excellent results in predicting PV power production [4,5]. In another investigation, solar energy
70 generation prediction was based on the UNISOLAR Solar Generation Dataset, which encompasses two years of
71 data collected at La Trobe University, Victoria, Australia. This dataset incorporated vital weather data such as
72 apparent temperature, air temperature, dew point temperature, wind speed, wind direction, and relative humidity.
73 The study creatively addressed the issue of zero-skewness in solar energy generation data, employing a zero-
74 inflated model and power transformer scaling techniques. Results highlighted the efficacy of RF, XGB and Con-
75 vLSTM2D algorithms, yielding low (RMSE) values [6]. Furthermore, research efforts were extended to Eastern
76 India, where the impact of weather parameters on solar PV power generation was assessed using ensemble machine
77 learning models like bagging, boosting, stacking, and voting. These models were validated against field data from
78 a 10kWp solar PV power plant, providing valuable insights for greenfield solar projects in the in the eastern region
79 of India. Voting and stacking algorithms exhibited superior performance, with 313.07 RMSE, 0.96 R2-score and
80 314.9 RMSE, 0.96 R2-score respectively [7]. Lastly, hybrid models combining CNN and LSTM, as well as LSTM
81 and Gaussian Process Regression GPR, showed promise in stable power generation forecasting. These hybrid
82 models exhibited improved accuracy, offering a reliable approach for future solar power forecasting applications
83 [8, 9]. Another study explored energy production forecasting in a 24kW solar installation, evaluating the perfor-
84 mance of SVR, ANN, DT, RF, Generalized Additive Models GAM, and XGB algorithms. Among these, the ANN
85 emerged as the most accurate model for energy production predictions [10]. In a novel approach, a deep learning
86 algorithm called DSE-XGB combined ANN, LSTM, and XGB to surpass the individual deep learning algorithms.
87 This method demonstrated improved consistency and stability across various case studies, even under varying
88 weather conditions, resulting in a significant increase in R² value of 10% -12% to other models [11]. Incorporating
89 solar PV panel temperature, ambient temperature, solar flux, time of the day, and relative humidity, another study
103 This work focuses on three main areas. The first was the selection of meteorological variables as input
104 variables, with a focus on their relationship with the output variable. Next, building regression and classification
105 models and concluded with a comparison.
106 The current study proposes a new forecasting approach where, instead of looking for a discrete output value,
107 it classifies active power into four classes. This is because regression models showed that evaluation metrics do
108 not explain the accuracy well. The results of the classification showed which models are good at forecasting values,
109 and it appears that only XGB in regression and LSTM in classification were able to predict peak values.
110 The remaining paper is organized as follows: Section 2 discusses data collection, while Section 3 presents
111 an exploratory analysis of the data. Section 4 covers data preprocessing, and Section 5 describes the machine
112 learning algorithms used in this study. Section 6 explains the evaluation metrics, and Section 7 presents the exper-
113 imentation tests, results, and discussion. Finally, in Section 8 conclusions and future works are presented.
114 Table 1. Summary of recent studies for solar forecasting using machine learning, reported from N. Rahimi [1].
Time Output
Ref Input Variables Forecasting Method Error Comparison
Ahead Variable
Temperature, Air Pres- Random Fo-
[15] Daily PV power MAPE = 8.5% RF > GBDT
sure, Wind rest
Temperature, Relative
[16] Daily Humidity, Cloud cover, PV power Boosting RMSE = 9.48% Boosting > AR
Precipitation
[17] Daily PV Power PV power Timeseries MSE = 16.24 TS > SARIMA
SVR > NAM >
[18] Hour Irradiation PV power SVR MAE = 37.04
SP
Irradiation, Tempera- Polynomial
[19] Hour PV power MAPE = 10.51% -
ture Regression
RMSE =
Humidity, Tempera- WD-ANN >
[20] Daily PV power WD-ANN 19.663%
ture, Wind Speed ANN
MAE = 10.349%
Solar Irradiance, Tem- RMSE = 32.12% WD-BCRF >
[21] 6h PV power WD-BCRF
perature, PV output MAE = 20.64% WDSVM > RF
ANN >
[22] 10 min Irradiation PV power ANN RMSE = 6%
ARIMA
116 The data employed for this study were provided by the Desert Knowledge Australia Solar Centre (DKASC),
117 an actual solar technology demonstration facility. Situated within the Desert Knowledge Precinct in Alice Springs,
121
122 Figure 1. Location of d' Alice Spring where PV solar modules are installed.
123
124 Figure 2. Desert Knowledge Australia Solar center (DKASC).
125 Within the database, data from multiple installations were presented, and the study was interested in the
126 First Solar installation with specifications summarised in Table 2.
128 The CdTe technology involves utilizing Cadmium Telluride as the active photovoltaic material in a thin-
129 film array with a fixed ground-mount configuration. CdTe, which replaces traditional silicon as the active material,
130 is deposited in a thin layer on the substrate of the solar panel. This thin-film approach consumes less photoelectric
131 material during production, potentially leading to reduced manufacturing costs. While CdTe panels exhibit lower
132 efficiency compared to certain silicon-based technologies, they display advantageous properties under specific
133 solar conditions.
139 The histogram displayed in Figure 3 illustrates the distribution of active power values, exhibiting a left-
140 skewed distribution. It is observed that 66% of the power values did not exceed 0.5 kW, with a few peak values
141 not surpassing 7 kW under standard test conditions (STC), as specified in the technical datasheets. The distribution
142 of the target value is crucial in determining the normalization method and even the selection of the algorithm.
143 When dealing with imbalanced target variables, where one class significantly outweighs the others, it can introduce
144 bias into models. Imbalanced datasets can lead to suboptimal performance for algorithms assuming balanced clas-
145 ses. In such scenarios, ensemble methods like Random Forest or boosting techniques like XGB can be particularly
146 effective, as they adeptly address class imbalances.
147
148 Figure 3. Active Power distribution.
149 Figure 4 illustrates the correlation among the variables through a heatmap. Correlation assists in attaining a
150 more profound comprehension of the relationships between these variables. Furthermore, this enables to identify
151 redundant variables, those that have no impact on the model, and even better interpret model's performance.
154 Table 4, presents the correlation coefficients between the target variable and the input variables. It is evident
155 that global horizontal irradiation exhibits a strong correlation, while temperature, wind speed, humidity, and rain-
156 fall show a weak correlation with power. In the following sections, it will be explained how these variables influ-
157 ence the target variable in one way or another. Furthermore, the correlation coefficients will guide in the variable
158 selection during the experimental phase. R. Ahmed et al. [24] showed that the inputs to forecasting models have
159 direct influence on prediction accuracy; a key factor in determining model performance. Generally, imprudent
160 input selection can cause forecast errors which increase time delay, cost and computational complexity.
162 These Power Production Diagrams in Figure 5 depict a scatterplot considering wind speed, wind direction,
163 and active power. They reveal that the peak values of active power are located in the high wind speed zone, ex-
164 ceeding 2m/s up to 12 m/s. Moreover, they indicate the influence of wind direction, with more peak power values
165 observed in the 45-180° zone. But the correlation between win speed and Active power is weak.
166 Raza, et al. [25] determined by inspection that PV output power pattern does not exactly follow wind speed
167 pattern. During daytime, the PV output power is at a higher level compared to wind speed. Subsequently, this
168 output power decreases with increase in wind speed; however, similar observations were not made by others during
169 the same period. Therefore, according to these researchers, the correlation between PV output and wind speed is
170 weak.
171 In addition, the rise in temperature of PV cells is extremely sensitive to wind speed rather than wind direction
172 [26]. Surface shape and structure has a clear impact on convection cooling of PV panel. Structured and grooved
173 glass cover surfaces may operate at low temperatures at higher wind speeds. However, the cooling effect is signif-
174 icantly higher for the flat surface at low wind speed. Studies in the USA have shown that for 10 m/s wind speed,
175 the operating temperature can be lowered by 3.5°C with a grooved glass cover. Furthermore, the temperature can
(a) (b)
179 Figure 5. Active power, wind direction diagrams (a) wind speed > 2.5 m/s (b) wind speed < 2.5 m/s.
180 Figure 6 represents the average temperature recorded on the same day over several years. It is evident that
181 peak values of active power were recorded during the spring-summer period when temperatures were below rec-
182 ords. Moafaq [29] has proved by an experimental study and a simulation program that increasing the operating
183 temperature had a negative impact on the performance of the photovoltaic panel in general.
184 Based on the literature, module temperature is a function of environmental factors such as solar irradiance,
185 [30,31] wind speed, [32] ambient temperature, as well as some PV constructional factors such as materials and
186 glass transmittance [33]. In addition, the efficient production of electricity strongly depends on the module tem-
187 perature of a PV panel [34]. As the module temperature increases, electrical efficiency decreases since the PV
188 modules convert only 20% solar energy into electricity and 80% into heat [35].
(a) (b)
189
190 Figure 6. (a) Daily average temperature and (b) daily average active power throughout the year.
191 Figure 7 represents different distributions of three weather variables, the temperature follows a gaussian
192 distribution, while rainfall and humidity are left skewed which give a first preview about their effect on the model
193 performance. When a feature is heavily skewed, the model may become overly influenced by the extreme values
194 in the long tail. This can result in predictions that are biased towards the direction of the skewness, causing the
195 model to perform poorly on the minority class or underestimating certain outcomes. Skewed features can lead to
196 overfitting. Models may perform well on the training data but poorly on unseen data because they've learned to fit
197 the noise or extreme values in the skewed feature.
198 The relative humidity is an influencing factor that is re-sponsible for the accumulation of tiny water droplets
199 and water vapor on solar panels from the atmosphere. Water droplets can refract, reflect or diffract sunlight away
200 from solar cells and reduces the number of direct components of solar radiation hitting them to produce electricity
201 [36]. Moreover, water condensation at the interface between the encapsulant and the solar cell materials creates
202 increased corrosion rates that risk encapsulant delamination [32,37,38].
212
213 Figure 7. Distribution of temperature, rainfall and humidity.
214 The scatterplot in Figure 8 depicts a linear correlation between the global horizontal irradiance variable and
215 active power. Irradiance is the energy that strikes a unit horizontal area per unit wavelength interval per unit time
216 [42]. The output of the PV module increases as the irradiance increases [43]. The PV module can measure the
217 irradiance based on the G-P (sun radiation-output maximum power) curve, as it is approximately linear.
218
219 Figure 8. Global horizontal irradiation and active power scatter plot.
221 Before delving into the modeling process, it was imperative to appropriately prepare the data. This involved
222 traversing through five pivotal phases to guarantee a more resilient and precise training phase. Handling missing
223 Values Within the dataset, one of the variables exhibited instances of missing values, which could hinder the proper
224 training of the model. Given that data is collected every 5 minutes, it is highly logical to replace missing values
225 with the preceding ones, employing the back filling method provided by Pandas library.
226 Detecting outliers denote data points that significantly deviate from the dataset's norm. They often constitute
227 abnormal observations that skew data distribution, often arising due to inconsistent data entry or erroneous meas-
228 urements. To ensure the trained model's adaptability to a valid range of test inputs, it is imperative to identify and
229 eliminate outliers. Box plots functionality of the Seaborn, serving as effective visualization tools, were employed
230 to highlight these outliers in the data distribution. Meteorological context of the study site Taking into account the
231 specific meteorological conditions of the desert region where the site is situated, Alice Springs witnesses distinct
232 weather patterns. Extended summers are characterized by high temperatures and partially cloudy skies.
237 Data normalization through the utilization of the min max scaler technique. This technique involves rescal-
238 ing data to a predetermined range, typically spanning from 0 to 1. Such normalization proves advantageous when
239 features possess disparate units, preserving the interrelationships among data points. By employing the min max
240 scaler, outliers will surely do not disproportionately impact the model's learning process.
241 Feature engineering, creation of new variables to enhance the dataset, three new variables ('month', 'day',
242 'hour') were introduced through the extraction of temporal data, considering the time series nature of the data which
243 is impacted by weather fluctuations, seasonal variations, electricity demand shifts or sensor failures. A categorical
244 variable has been added to the dataset, classifying the active power as a numerical variable. This new variable is
245 employed as an output for the classification task.
246 The data encoding of this new categorical variable was carried out using the label encoder function from the
247 scikit-learn library. Label encoder is a utility class designed to assist in the normalization of labels, ensuring that
248 they encompass values ranging from 0 to n-1.
259 Methods grounded in nearest neighbors are categorized as non-generalizing in the realm of machine learn-
260 ing, as they essentially retain and rely upon the entirety of their training dataset. Despite their apparent simplicity,
261 nearest neighbor approaches have exhibited significant efficacy across a wide spectrum of classification and re-
262 gression tasks. These encompass diverse applications ranging from recognizing handwritten digits to interpreting
263 satellite image scenes. Given their non-parametric nature, these methods excel particularly when dealing with
264 classification scenarios characterized by intricate and irregular decision boundaries.
265
266 Figure 9. Explanatory representation of K-Nearest Neighbors Classifier.
273 (i) Averaging methods: these methods hinge on the creation of several estimators independently and sub-
274 sequently averaging their individual predictions. The rationale behind this approach is that, on average,
275 the amalgamated estimator tends to outperform any single base estimator due to a reduction in variance.
276 Examples of algorithms falling into this category include the extra trees regressor / classifier and the
277 random forest regressor / classifier [44].
278 Figure 10, show a random forest schema, which is an extension of the bagging method as it utilizes both
279 bagging and feature randomness to create an uncorrelated forest of decision trees. Random forest model
280 is made up of multiple decision trees. Decision trees ask a series of questions to determine an answer,
281 these questions make up the decision nodes in the tree, acting as a means to split the data. Decision trees
282 seek to find the best split to subset the data, and they are typically trained through the classification and
283 regression Tree (CART) algorithm.
284
285 Figure 10. Random Forest schema.
286 (ii) Boosting methods: boosting is one kind of ensemble learning method which trains the model sequen-
287 tially and each new model tries to correct the previous model. It combines several weak learners into
288 strong learners. In contrast to averaging methods, boosting methods construct base estimators sequen-
289 tially, with the primary objective of mitigating the bias in the combined estimator. The underlying mo-
290 tivation is to unite multiple weak models to form a potent ensemble. A notable example is the gradient
291 tree boosting regressor, along with the XGB regressor / classifier. XGB, a highly optimized distributed
292 gradient boosting library, is renowned for its efficiency, adaptability, and portability. It implements
293 machine learning algorithms within the gradient boosting framework and offers parallel tree boosting,
294 also known as GBDT (Gradient Boosting Decision Trees) or GBM (Gradient Boosting Machine) [44].
295 5.3. Artificial Neural Network
296 Long short-term memory (LSTM) is a type of recurrent neural network (RNN) architecture that is particu-
297 larly well-suited for working with sequences of data, such as time series. A long short-term memory (LSTM) cell
306 LSTM networks are comprised of various gates that contain information about the previous state. This in-
307 formation is either written, stored or read from a cell that is acting like a memory. The cell decides on whether to
308 the store the incoming information, when it reads, writes and erase via the gates opening and closure. They act
309 based on the signals received and block or pass on information based on its strength and import by filtering with
310 their own sets of weights [4].
311
312 Figure 11. Cell diagram of LSTM.
𝑁
1 2
𝑅𝑀𝑆𝐸 = √ ∑(𝑦𝑖,𝑡𝑟𝑢𝑒 − 𝑦𝑖,𝑝𝑟𝑒𝑑 ) (3)
𝑁
𝑖=1
∑𝑁
𝑖=1(𝑦𝑖,𝑡𝑟𝑢𝑒 − 𝑦𝑝𝑟𝑒𝑑 )
2
𝑅2 = 1 − (4)
∑𝑁
𝑖=1(𝑦𝑖,𝑡𝑟𝑢𝑒 − 𝑦𝑡𝑟𝑢𝑒,𝑚𝑒𝑎𝑛 )
2
332 The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant
333 model that always predicts the expected value of y, disregarding the input features, would get a 𝑅2 score of 0.0.
334 6.2. Metrics for classification
335 6.2.1. Accuracy
336 Accuracy is the simplest and most straightforward metric. It measures the ratio of correctly predicted in-
337 stances to the total instances in the dataset. Figure 12 is the confusion matrix of a bi-variate problem, it’s a spe-
338 cific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one.
TP + TN
Accuracy = (5)
(TP + TN + FP + FN)
339 Where:
340 TP = True Positive
341 TN =True Negative
342 FP = False Positive
343 FN = False Negative
344
345 Figure 12. Confusion matrix.
2 ∙ Precision ∙ Recall
F1 Score = (8)
Precision + Recall
355
360 Grid search CV from Scikit Learn, was employed which facilitates the identification of the model with the
361 most appropriate parameters by evaluating the performance of various combinations through the cross-validation
362 technique. This approach entails training the model across numerous potential partitions of the training dataset.
364 In the experimentation chapter, three successive experiments were conducted. The first one aimed to deter-
365 mine the best combination of variables. Secondly, training the data on various tuned classification and regression
366 algorithms (KNN, XGB, ET, RF, and LSTM) and using evaluation metrics, the best regression and classification
367 models were determined and subsequently compared the performance of both approaches.
368 Table 5 provides technical data about the computing machine, programming environment, and the various
369 libraries used. In contrast, Table 6 contains information about the dataset, such as time attributes and size. It is
370 crucial to have an overview of this information to better interpret the results.
382 Table 8 presents the performance of the best model recorded in the initial experiment. This time, the corre-
383 lation between variables was considered, particularly the strong correlation between global horizontal irradiation
384 and global tilted irradiation, which often leads to variable redundancy. The results after training and testing showed
385 that redundant variables had no effect on model performance, suggesting that it is better to exclude them.
386 Table 7. XGB regressor forecasting results obtained from the input selection phase.
Additional Input Parameters Selected by Evaluating Correlation Value with the Target Feature
Train Test
Additional Input Parameters Dataset MAE MSE RMSE
Score Score
Global Horizontal Irradiation, Global Tilted Irradiation, Diffuse
Horizontal Irradiation, Temperature, Wind Speed, Diffuse Tilted Dataset 1 0.223 0.244 0.494 0.996 0.927
Irradiation, Wind Direction, Humidity, Rainfall
Global Horizontal Irradiation, Global Tilted Irradiation, Diffuse
Horizontal Irradiation, Temperature, Wind Speed, Diffuse Tilted Dataset 1 0.208 0.167 0.409 0.997 0.950
Irradiation, Wind Direction, Humidity
Global Horizontal Irradiation, Global Tilted Irradiation, Diffuse
Horizontal Irradiation, Temperature, Wind Speed, Diffuse Tilted Dataset 1 0.195 0.156 0.395 0.996 0.954
Irradiation, Wind Direction
Global Horizontal Irradiation, Global Tilted Irradiation, Diffuse
Horizontal Irradiation, Temperature, Wind Speed, Diffuse Tilted Dataset 1 0.207 0.169 0.412 0.996 0.950
Irradiation
Global Horizontal Irradiation, Global Tilted Irradiation, Diffuse
Dataset 1 0.193 0.146 0.382 0.996 0.956
Horizontal Irradiation, Temperature, Wind Speed
Global Horizontal Irradiation, Global Tilted Irradiation, Diffuse Dataset 1 0.200 0.148 0.385 0.996 0.956
Horizontal Irradiation, Temperature Dataset 2 0.055 0.014 0.121 0.989 0.995
Global Horizontal Irradiation, Global Tilted Irradiation, Diffuse Dataset 1 0.194 0.163 0.404 0.996 0.952
Horizontal Irradiation Dataset 2 0.064 0.02 0.144 0.989 0.993
Dataset 1 0.230 0.206 0.454 0.996 0.939
Global Horizontal Irradiation, Global Tilted Irradiation
Dataset 2 0.063 0.018 0.134 0.988 0.994
Dataset 1 0.291 0.286 0.534 0.994 0.915
Global Horizontal Irradiation
Dataset 2 0.068 0.018 0.134 0.987 0.994
387 Table 8. XGB regressor forecasting results obtained from the input selection phase.
Additional Input Parameters Selected by Evaluating Correlation Value with the Target Feature and Between Input Varia-
bles
Train Test
Additional Input Parameters Dataset MAE MSE RMSE
Score Score
Global Horizontal Irradiation, Diffuse Horizontal Irradia-
Dataset 1 0.250 0.228 0.478 0.995 0.932
tion, Temperature
388 After identifying the best combinations, combination 1 (global horizontal irradiation / dataset 2) and com-
389 bination 2 (global horizontal irradiation, diffuse horizontal irradiation, temperature / dataset 2) with the XGB pre-
390 diction algorithm, which exhibits shorter execution times, while other explored algorithms proved to be signifi-
391 cantly slower and demanded substantial storage space, such as KNN, RF, and LSTM.
392 Table 9 represents the performance of each model with the optimal parameters following tuning. Model 2
393 achieved superior scores, with RF outperforming all other algorithms by an average margin of 0.01, except for the
394 LSTM algorithm, which failed to yield the anticipated results. It’s recommended either refining its parameters or
395 increasing the number of hidden layers. Furthermore, in the classification section, a more detailed interpretation
396 of the results obtained from each model will be provided.
397 In the following sections, model 1 and model 2 refer to combination 1 and combination 2, respectively,
398 trained on a specifically defined algorithm with the mentioned hyperparameters.
401 Figures 13, 14 and 15 depict two scatter plots between active power and global horizontal irradiation, where
402 the orange points represent the predicted values during testing, and the blue points represent the actual values.
403 These three scatter plots confirm the evaluation metrics found earlier, except for the majority of values > 5 kW,
404 where the predicted values differ from the actual values. The LSTM algorithm's figure clearly demonstrates that it
405 is the best-performing model as it generalizes the most points. However, it has the lowest RMSE due to the peak
406 values that increased the average error. In the classification section, will be addressed the issue of peak values
407 more effectively.
410
411 Figure 14. Test values vs RF predicted output scatter plot.
412
413 Figure 15. Test values vs LSTM predicted output scatter plot.
424 Tables 11 and 12 represent the performance evaluation for various combinations of variables, akin to the
425 regression analysis. However, the results are not entirely straightforward. For instance, identical accuracies were
426 observed for models trained on Dataset 2. Nevertheless, this assertion requires further elaboration, as it will be
427 explained below. Interestingly, the optimal combinations for regression have remained consistent for classification
428 as well.
429 Table 11. XGB classifier forecasting results obtained from the input selection phase.
Additional Input Parameters Selected by Evaluating Correlation Value with the Target Feature
Accu- Preci- Re- F1
Additional Input Parameters Dataset
racy sion call Score
Global Horizontal Irradiation, Global Tilted Irradiation,
Diffuse Horizontal Irradiation, Temperature, Wind
Dataset 1 0.77 0.74 0.77 0.74
Speed, Diffuse Tilted Irradiation, Wind Direction, Hu-
midity, Rainfall
Global Horizontal Irradiation, Global Tilted Irradiation,
Diffuse Horizontal Irradiation, Temperature, Wind
Dataset 1 0.78 0.75 0.78 0.74
Speed, Diffuse Tilted Irradiation, Wind Direction, Hu-
midity
Global Horizontal Irradiation, Global Tilted Irradiation,
Diffuse Horizontal Irradiation, Temperature, Wind Dataset 1 0.77 0.74 0.77 0.74
Speed, Diffuse Tilted Irradiation, Wind Direction
Global Horizontal Irradiation, Global Tilted Irradiation,
Diffuse Horizontal Irradiation, Temperature, Wind Dataset 1 0.76 0.73 0.76 0.74
Speed, Diffuse Tilted Irradiation
Global Horizontal Irradiation, Global Tilted Irradiation,
Diffuse Horizontal Irradiation, Temperature, Wind Dataset 1 0.78 0.76 0.78 0.74
Speed
Global Horizontal Irradiation, Global Tilted Irradiation, Dataset 1 0.78 0.76 0.78 0.73
Diffuse Horizontal Irradiation, Temperature Dataset 2 0.98 0.98 0.98 0.98
Global Horizontal Irradiation, Global Tilted Irradiation, Dataset 1 0.78 0.76 0.78 0.73
Diffuse Horizontal Irradiation Dataset 2 0.98 0.98 0.98 0.98
Dataset 1 0.72 0.75 0.72 0.69
Global Horizontal Irradiation, Global Tilted Irradiation
Dataset 2 0.98 0.98 0.98 0.98
Dataset 1 0.74 0.74 0.74 0.70
Global Horizontal Irradiation
Dataset 2 0.98 0.98 0.98 0.98
430 Table 12. XGB classifier forecasting results obtained from the input selection phase.
Additional Input Parameters Selected by Evaluating Correlation Value with the Target Feature and Be-
tween Input Variables
Additional Input Parameters Dataset Accuracy Precision Recall F1 Score
431 Table 13 displays the performance results of the top-performing combinations, namely combination 1
432 (global horizontal irradiation / dataset 2) and combination 2 (global horizontal irradiation, diffuse horizontal irra-
433 diation, temperature / dataset 2). These combinations were trained using classification algorithms, specifically the
434 KNN classifier, XGB classifier, ET classifier, RF classifier, and LSTM (artificial neural network).
436 Table 14 presents the model 2 performances for each class. All algorithms exhibited high precision and F1
437 scores for the three classes: "High," "Normal," and "Null." However, it is evident that for the "Peak" class, the
438 XGB Classifier outperformed all other algorithms for both models, effectively classifying all values. LSTM also
439 demonstrated superior performance in this regard. In contrast, the KNN Classifier consistently failed to classify
440 the "Peak" class, even with a significantly large number of neighbors.
450 Table 15 and Figure 17 present a comparison between a regression and a classification model based on RF.
451 In both approaches, there is a challenge in predicting peak values. However, the regression approach demonstrates
452 a slight advantage concerning peak values.
454
(a) (b)
455 Figure 17. Confusion matrix; (a) Random Forest classification, (b) Random Forest regression
456 Table 16 and Figure 18 depict a comparison between a regression and classification model based on LSTM.
457 As mentioned earlier, LSTM struggles to capture peak values in regression, but surprisingly, in classification, it
458 has improved significantly, surpassing all other algorithms. The confusion matrix reveals that LSTM correctly
459 predicted 70 peak values and 39 high values, which is highly impressive. It’s concluded that LSTM performs better
460 on continuous outputs than discrete ones.
(a) (b)
463 Figure 18. Confusion matrix; (a) LSTM classification, (b) and LSTM regression
464 Table 17 and Figure 19 illustrate a comparison between a regression and classification model based on XGB.
465 This algorithm proves to be highly effective, as it captured the most peak values in the regression approach. Ex-
466 amining the confusion matrix, found that in regression, XGB accurately predicted 80 peak values and 29 high
467 values, which is remarkable. In the classification approach, it correctly predicted 25 peak values and 84 high val-
468 ues, also demonstrating its proficiency. However, it is evident that XGB performs better when trained on discrete
469 outputs rather than continuous ones.
471
474 This study validates many points already mentioned in several research articles. Data exploratory analysis
475 has allowed to gain a better understanding of the impact of meteorological conditions on the performance of pho-
476 tovoltaic panels, such as wind speed, which facilitates the cooling of PV panels and increases their efficiency.
477 Temperature peaks do not correspond to the best PV efficiency; instead, there is a linear relationship and a strong
478 positive correlation with global horizontal irradiation. Almost no correlation was observed with other variables
479 such as relative humidity and precipitation, while rain proved to be a natural cleaning source for PV panels against
480 dust and pollution. The experiment conducted for input variable selection, training on a dataset of 206,860 data
481 points, and using the XGB algorithm, as it is the fastest in execution, demonstrated that the model improved each
482 time a variable with low or even negative correlation with active power was removed. The mean absolute error
483 (MAE), mean squared error (MSE), and root mean square error (RMSE) values decreased from 0.223, 0.244, and
484 0.494 to 0.2, 0.148, and 0.385, respectively. It was found that variables were important when the correlation was
485 above 0.38 and even with less skewness, as the model deteriorated when temperature with weak correlation, but a
486 normal distribution was removed. Additionally, global tilted irradiation had a strong correlation with global hori-
487 zontal irradiation, but the experiment showed that it had no effect on model performance, so it was eliminated to
488 avoid redundancy.
489 Same combinations were applied to a second dataset containing 1,333,693 data points, which is six times
490 larger than the first dataset, and the error values MAE, MSE, and RMSE significantly decreased. For the best
491 combination, it went from 0.2, 0.148, and 0.385 to 0.055, 0.014, and 0.121. Big data and machine learning are
492 closely related. The massive data of big data provide the necessary resources to fuel machine learning models with
493 training data. The more data there is, the better the models can learn and generalize from this data. Machine learn-
494 ing, in turn, allows to extract valuable information and predictive models from big data. The best combinations
495 resulting from the first experiment (global horizontal irradiation with time) in terms of fewer required inputs with
496 good performance, as more variables can cause high forecast errors leading to time delays, cost overruns, and
497 computational complexities. The other combination (global horizontal irradiation, diffuse horizontal irradiation,
498 temperature) with the best performance.
499 After selecting the perfect combinations and trained them on other algorithms: KNN, XGB, ET, RF, and
500 long short-term memory. All algorithms achieved a coefficient of determination R2 greater than 0.95, but the most
501 performant ones were RF, XGB, and ET, with MAE, MSE, and RMSE values reaching 0.046, 0.013, and 0.11 for
502 RF. Scatter plots of true values versus predicted values revealed that all algorithms struggled to predict values with
503 active power greater than 5 kW, which was impossible to discern using evaluation metrics. Therefore, active power
504 was classified into well defined categories.
505 After creating four power classes ("Null," "Normal," "High," "Peak"), same experiments were performed
506 for regression. However, the algorithms KNN, XGB, ET, RF, and LSTM are constructed differently to solve a
516 In conclusion, the study demonstrated that each approach, whether classification or regression, had its
517 strengths, to the point where the two approaches complemented each other. XGB proved to be an essential, fast,
518 and high-performing algorithm in both regression and classification. LSTM worked well in classification but not
519 in regression, and RF yielded the best results in regression, although it consumed a significant amount of memory
520 and time, making it less practical in real-world scenarios.
521 For future works, plan to dedicate a more advanced and comprehensive study to the classification approach
522 for predicting active power. The study recommends building a hybrid model in which the regression model will
523 utilize the results of classification to achieve higher accuracy and performance than those of the two separate
524 approaches.
525
526 9. Declarations
527
528 Ethics approval and consent to participate: Not applicable.
529
530 Consent for publication: Not applicable.
531
532 Availability of data and materials: The datasets analysed during the current study are available in the official
533 online hub of DKA Solar Center, https://dkasolarcentre.com.au/source/alice-springs/dka-m6-a-phase
534
535 Competing interests: The authors declare that they have no competing interests.
536
537 Funding:
538
539 Authors' contributions:
540 Aouidad Hichem Idris: Conceptualization; Methodology; Software; Validation; Formal analysis; Investigation;
541 Resources; Writing - Original Draft; Writing Review & Editing; Visualization.
542 Bouhelal Abdelhamid: Supervision; Methodology; Writing - Review & Editing
543
544 References
545
546 [1] Rahimi, N., Park, S., Choi, W., Oh, B., Kim, S., Cho, Y. ho, Ahn, S., Chong, C., Kim, D., Jin, C., & Lee, D.
547 (2023). A Comprehensive Review on Ensemble Solar Power Forecasting Algorithms. Journal of Electrical
548 Engineering and Technology, 18(2), 719–733. https://doi.org/10.1007/s42835-23-01378-2.
549 [2] Essam, Y., Ahmed, A. N., Ramli, R., Chau, K. W., Idris Ibrahim, M. S., Sherif, M., Sefelnasr, A., & El-
550 Shafie, A. (2022). Investigating photovoltaic solar power output forecasting using machine learning algo-
551 rithms. Engineering Applications of Computational Fluid Mechanics, 16(1), 2002–2034.
552 https://doi.org/10.1080/19942060.2022.2126528.
553 [3] Balal, A., Jafarabadi, Y. P., Demir, A., Igene, M., Giesselmann, M., & Bayne, S. (2023). Forecasting Solar
554 Power Generation Utilizing Machine Learning Models in Lubbock. Emerging Science Journal, 7(4), 1052–
555 1062. https://doi.org/10.28991/ESJ-2023-07-04-02
556 [4] Liu, C. H., Gu, J. C., & Yang, M. T. (2021). A Simplified LSTM Neural Networks for One Day-Ahead Solar
557 Power Forecasting. IEEE Access, 9, 17174–17195. https://doi.org/10.1109/ACCESS.2021.3053638
558 [5] Harrou, F., Kadri, F., & Sun, Y. (2020). Forecasting of Photovoltaic Solar Power Production Using LSTM
559 Approach. In Advanced Statistical Modeling, Forecasting, and Fault Detection in Renewable Energy Sys-
560 tems. IntechOpen. https://doi.org/10.5772/intechopen.91248