Professional Documents
Culture Documents
Solar Energy Analysis For The USA Market
Solar Energy Analysis For The USA Market
• Descriptive Statistics for Variables • Count for Missing Values - 0 /2 Categorical Variables
• Hour_Min = 10, Hour_Max = 15
• Month_Min = 1, Month_Max = 12
Distribution plots of Input and Output Variables
array([[<AxesSubplot:title={'center':'Date'}>,
<AxesSubplot:title={'center':'Time'}>,
<AxesSubplot:title={'center':'Latitude'}>,
<AxesSubplot:title={'center':'Longitude'}>],
[<AxesSubplot:title={'center':'Altitude'}>,
<AxesSubplot:title={'center':'Month'}>,
<AxesSubplot:title={'center':'Hour'}>,
<AxesSubplot:title={'center':'Humidity'}>],
[<AxesSubplot:title={'center':'AmbientTemp'}
<AxesSubplot:title={'center':'PolyPwr'}>,
<AxesSubplot:title={'center':'Wind.Speed'}>,
<AxesSubplot:title={'center':'Visibility'}>],
[<AxesSubplot:title={'center':'Pressure'}>,
<AxesSubplot:title={'center':'Cloud.Ceiling'}>,
Top Features Correlation HeatMap
Cyclicity of hour and month / Cyclicity of Sine and Cosine Functions
• Data Collection on identified data points, Solar PV Data has been downloaded online from
Kaggle.
• Data Audit and Data Cleaning for missing and duplicate values. In this study there were no
missing values in the Solar PV Dataset, Location and Season were treated as categorical
variables. While one hot encoding was done on Time and Month for Cyclicity.
• Exploratory Analysis - Descriptive Statistics, HeatMap and correlation studies on the
independent feature variables and test hypothesis with the input and output variables.
• Select the best features based on the best feature selection methods - We have Ambient
Temperature, Humidity, Pressure, Cos_Month, Cloud.Ceiling as the most important feature
variables and R_Square Value for Random Forest is 0.6599. R_Square for Linear
Regression is 0.5712.
• Testing and Training Data - We divide our data into 80% training data and 20% test data.
• We have used Linear Regression, AdaBoost, Decision Trees and Random Forest
Regression for our modeling exercise and choose Random Forest as the best model for our
M/L Exercise.
System Architecture and M/L Validation Methods
• The Feature Selection for the top 5 features is matching with 2 studies conducted in
the past, both the references have been provided.
• The R_Squared Value for the Random Forest is in alignment with a study listed in
references.
• The Data used is not based on Irradiance. Irradiance is an important parameter for
estimating solar power output.
• Many of the Python Classification Tree Algorithms can require more computational
capabilities, compared to a personal Laptop.
• This kind of analysis used a data from 12 geographical locations around USA in a
controlled environment, The Irradiance data can be very time consuming to collect,
and can have significant uncertainity.
• The Study did not test any Deep Learning Techniques or Neural Networks, we can
further do that in the next exercise on this data set.
Recommeded Datasets for Future M/L Studies in Solar / Energy Sector
• Pasion, C.; Wagner, T.; Koschnick, C.; Schuldt, S.; Williams, J. & Hallinan, K.
Machine Learning Modeling of Horizontal Photovoltaics Using Weather and
Location Data. Energies 2020, 13, 2570; doi:10.3390/en13102570.
• https://scikit-learn.org/stable/user_guide.html
Python Libraries, Scikit Learn.