Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 32

An Analysis of Power Plant Data

Spring 2016

Richard Nehrboss (rdn4vg), Matthew Shih (mjs2uq)


Overview
This dataset comes from the recorded data of a Combined Cycle Power Plant working at
full load in Turkey over a span of six years (from 2006-2011). The original purpose of the data
collection was to record hourly averages of temperature, ambient pressure, relative humidity, and
exhaust vacuum in order to predict the net hourly electrical energy output. Originally, the data
set came from a paper regarding the prediction of energy output of a combined cycle power plant
(Heysem, Pinar, and Sadik). The provided data includes the first 500 values from the overall
dataset and this data was downloaded and accessed from the University of California-Irvine’s
Machine Learning Repository (Lichman).

Data-related Terminology
A combined cycle power plant is composed of gas turbines, steam turbines, and heat
recovery steam generators. In this type of power plant, electricity is first generated by gas and
steam turbines (combined into one cycle), and is then transferred from one turbine to another.
The gas turbine not only generates electrical power, but also hot exhaust. Steam is produced by
rerouting this exhaust through a heat recovery steam generator. This rerouted exhaust is used to
produce steam that generates additional electricity through the steam turbine. Once this
additional electricity has been generated, the total energy output can be calculated from the
steam and gas turbines. Each cycle of this process is called a load. See Figure 1 in the appendix
for a more detailed diagram of a combined cycle power plant.

As for the variables, Average Temperature refers to the average temperature, expressed in
degrees Celsius, within the power plant during a load. Exhaust Vacuum refers to the
steam/condensation, expressed in centimeters Mercury, given off by the turbines during a load.
Ambient Pressure refers to the pressure of the surrounding air, expressed in millibars, within the
power plant during a load. Relative humidity refers to the amount of water vapor in the air,
expressed as a percentage of the amount needed for saturation at the same temperature, during a
load. Energy output refers to amount of electrical power, expressed in Megawatts, generated by
the power plant during a load. See Figure 2 in the appendix for the first ten observations of the
data.

Variables (note: see Figure 3 in appendix for detailed summary statistics)


1. (AT) Average Temperature (°C)
2
 Continuous ranging from 1.81 °C to 37.11 °C
 Average temperature within the power plant during a load

2. (EV) Exhaust Vacuum (cm Hg)

 Continuous ranging from 25.36 cm Hg to 81.56 cm Hg


 Steam/condensation, given off by the gas turbine, that will be rerouted for use in the
steam turbine during a load

3. (AP) Ambient Pressure (millibars)

 Continuous ranging from 992.89 millibars to 1033.30 millibars


 Pressure of the surrounding air within a power plant during a load

4. (RH) Relative Humidity (%)

 Continuous ranging from 25.56% to 100.16%


 Amount of water vapor in the air, expressed as a percentage of the amount needed for
saturation at the same temperature, during a load

5. (EO) Energy Output (MW)

 Continuous ranging from 420.26 MW to 495.76 MW


 Amount of electrical power generated by the power plant during a load

Furthermore, boxplots were created to graphically display some of the information


described in the summary statistics, including the minimum, maximum, and median. We
calculated these values and placed them in Figure 3 in the appendix. Outliers were also easily
depicted by the boxplots; Ambient Pressure (AP) was the only variable to have any significant
outliers. It had a low outlier of 994.2 and a high outlier of 1033.0. See Figure 4 in the appendix
for the above boxplots. This helps to provide an idea of what kinds of tests we will use and gain
a basic understanding of the distributions of the variables.

Several other plots were created to help further describe the data. As seen in Figure 6 of
the appendix, four scatterplots plotting Energy Output versus Average Temperature, Energy
Output versus Exhaust Vacuum, Energy Output versus Ambient Pressure, and Energy Output
versus Relative Humidity were created. Each of these graphs displays a regression line with
variability within the 99% confidence interval. This helps to predict which variables should be

3
used in our regression analysis to predict the dependent variable Energy Output and gives a
general idea of these relationships.

In addition, violin plots for each of the variables were created as shown in Figure 6 in the
appendix. Violin plots are similar to boxplots, but they also show the probability density for the
data at different values. The Average Temperature and Ambient Pressure violin plots show a
reasonably even distribution throughout the range of values. In contrast, the Exhaust Vacuum
data is very dense at the lower values, very thin in the middle values, and then somewhat dense
at the higher values. Meanwhile the Relative Humidity violin plot is dense at higher values and
then diminishes at the lower values, and the Energy Output is the opposite: it is dense at lower
values and then diminishes at the higher values. We can determine a vague idea about normality
and overall distribution with these violin plots; further graphs, such as histograms and Normal Q-
Q plots will be created to support these conclusions.

We also created other graphics in our exploratory analysis to get an idea of what
Hypothesis tests would be appropriate to run. Before conducting any tests, we need to check
each quantitative variable to see if they are normally distributed in order to decide which types of
tests, parametric or nonparametric, should be used for each variable. Thus, we created both
Histograms and Normal Q-Q plots for all five variables. After examining the histograms (Figure
7 in the appendix), it appears as if AT and AP are relatively normally distributed. Meanwhile,
RH is skewed to the left and EV and EO are skewed to the right. These observations are further
supported by looking at the Normal Q-Q plots shown in Figure 8 in the appendix. However,
since the sample size of n=500 is relatively large, it would be acceptable to perform parametric
tests on AP, AT, EO, and RH because the central limit theorem will dictate that their sample
means follow a normal distribution because they are only moderately skewed. On the other hand,
EV is strongly skewed with a bimodal distribution so it would not be appropriate to run
parametric tests on this variable despite the large sample size of 500.

Hypothesis Testing
1. One sample t-test
This first test will examine the variable EO, Energy Output. Although the distribution of
EO is slightly skewed (Figure 7) and the Q-Q plot (Figure 8) is not perfectly normal, the sample

4
size of n=500 is large enough so that we may use a one-sample t-test and thus the normality
assumption is met. A z-test cannot be used because the population standard deviation is
unknown. This power plant was designed to have a nominal energy output of 480 MW (Heysem,
Pınar, Sadık). However, after examining the right skewed data, it appears that the power plant
may be producing at less than 480 MW per load. This test will help determine if the power plant
is operating at design specification. Thus, we are testing, with μ=mean of EO:

H 0 : μ=480 MW versus H 1 : μ< 480 MW

H 0 : The power plant’s mean Energy Output is at the nominal level of 480 MW versus

H 1 : The power plant’s mean Energy Output is lower than the nominal level of 480 MW

This will be tested at the significance level α =0.05 . We will get a t test statistic which
will follow a student’s-t distribution with df=499 degrees of freedom. The p-value generated by
this test is p=2.2e−16. The p-value is significantly less than the significance level of 0.05, so we
reject the null hypothesis with 95% confidence and conclude that the power plant’s mean energy
output is less than the nominal level of 480 MW. Therefore, there is evidence that the power
plant is not operating at design specification.

2. Paired t-test
This next test will examine the variables AP, Ambient Pressure and AT, Average
Temperature. These two variables are related via the ideal gas law, PV =nRT . In this equation, P
refers to pressure and T refers to temperature. Thus, these two variables should be proportional,
and therefore dependent. We are comparing these two variables because if the two variables have
equal effects upon the closed system, then their adjusted means should be equal. The Q-Q
plots(Figure 8) show that AP resembles a normal distribution, while AT is slightly skewed.
However, since the sample size n=500 is quite large, the adjusted sample means of AP and AT
can be assumed to follow an approximately normal distributions by the central limit theorem and
the normality assumption is met. Each instance of AP corresponds to the instance of AT during
the same load, so we can use the paired t-test. The conversion factor for a closed system is
predicated upon a number of variables but leads to 40.78977 (Convert Units) for a comparable

5
power plant. Therefore, we divide AP by 40.78977 and store it in a new variable AP_AT. Thus,
we are testing μ AP =mean of AP being converted into degrees Celsius and μ AT =mean of AT:
AT

H 0 : μ AP =μ AT versus H 1 : μ AP ≠ μ AT
AT AT

H 0 : The mean Ambient Pressure (after conversion) is equivalent to the mean Average
Temperature versus
H 1 : The mean Ambient Pressure (after conversion) is not equivalent to the mean Average
Temperature

This will be tested at the significance level α =0.05 . We will get a t test statistic which
will follow a student’s-t distribution with df=499 degrees of freedom. The p-value generated by
this test is p=2.2e−16 . The p-value is less than the significance level of 0.05, so we reject the null
hypothesis with 95% confidence and conclude that the mean Ambient Pressure (after conversion)
is different than the mean Average Pressure in power plant loads. Thus, there is evidence that the
two variables do not have equal effects upon the closed system

3. Levene’s Test for Equality of Variances


The variables EV, Exhaust Vacuum and AP, Ambient Pressure will be tested for equality
of variances. These two variables are independent because a combined cycle power plant uses
two different types of turbines, gas turbines and steam turbines. EV is a factor in steam turbines,
while AP is a factor in gas turbines. Thus, these two variables are independent because they do
not interact with each other during a power plant load. It is possible that these variables equally
affect their corresponding turbines; therefore, we want to test if the population variances are
equal. AP has a normal distribution, while EV has a heavily skewed, bimodal distribution, which
makes EV non-normal. These results are confirmed by the Q-Q plots (Figure 8 in the appendix).
Due to this non-normality, we will run a Levene’s test for equality of variances because it is less
sensitive to non-normal distributions than the F-test for equality of variances and thus does not
assume normality. It does assume independent variables and this condition is thus satisfied. We
must take into account the conversion factor between cmHg and millibars. Doing research yields
the conversion 1 millibar = 13.332239 cmHg (Convert Units). Therefore, we divide EV by

6
2
13.332239 and store it in a new variable EV_AP. Thus, we are testing σ EV =variance of EV
AP

2
converted into millibars and σ =variance of AP:
AP

2 2 2 2
H 0 :σ EV =σ AP versus H 1 :σ EV ≠ σ AP
AP AP

H 0 : The variance of Exhaust Vacuum is equivalent to the variance of Ambient Pressure versus

H 1 : The variance of Exhaust Vacuum is not equivalent to the variance of Ambient Pressure

This will be tested at the significance level α =0.05 . The test statistic will follow a
Snecdecor’s F distribution with df1=1 and df2= 998 degrees of freedom. The test generated an F
statistic of 628.81 The p-value generated by this test is p=2.2e−16. The p-value is significantly
less than our confidence level of 0.05. Therefore we reject the null hypothesis and accept the
alternative hypothesis; there is significant evidence to conclude that the variances of Exhaust
Vacuum adjusted and Average Pressure are different. Therefore, we do not believe that they
equally affect their respective turbines.

4. Wilcoxon Rank Sum Test


Next, we will test the variables EV, exhaust vacuum and AP, ambient pressure. These
two variables are independent because a combined cycle power plant uses two different types of
turbines, gas turbines and steam turbines in separate systems. EV is a factor in steam turbines,
while AP is a factor in gas turbines. Thus, these two variables do not interact with each other
during a power plant load. If the environments in these two systems are similar then the medians
of these two variables should be equal. AP has a normal distribution, while EV has a strongly
skewed right distribution as shown by the histograms in Figure 7. This is also reflected by the
Normal Q-Q plots (Figure 8). Since the data is non-parametric, we will run a Wilcoxon Rank
Sum Test and not a t test. While the distributions are different because EV has bimodal
properties, they are still fairly similar without any strong outliers so we are permitted to run this
test. The variables are independent so this assumption is satisfied. We must take into account the
conversion factor between cmHg and millibars. Doing research yields the conversion 1 millibar
= 13.332239 cmHg (Convert Units). Therefore, we divide EV by 13.332239 and store it in a new

7
variable EV_AP. Thus, we are testing M EV =median of EV converted into millibars and M AP
AP

=median of AP:

H 0 : M EV =M AP versus H 1 : M EV ≠ M AP
AP AP

H 0 : The median of Exhaust Vacuum is equivalent to the median of Ambient Pressure (the true
distribution shift is equal to 0) versus

H 1 : The median of Exhaust Vacuum is not equivalent to the median of Ambient Pressure (the
true distribution shift is not equal to 0)

This will be tested at the significance level α =0.05 . The test statistic is the Wilcoxen test
statistic, W. The p-value generated by this test is p=2.2e−16 . The p-value is less than the
significance level of 0.05, so we reject the null hypothesis with 95% confidence and conclude
that the median of EV adjusted is not equal to the median of AP. Thus, the true distribution shift
is not equal to 0 and there is therefore evidence that the environments are not similar.

5. One sample z-test proportion test


Now, we will test the variable AT, Average Temperature. According to our research, a
combined cycle power plant can reach a dangerous temperature at 32ºC or higher (Heysem,
Pınar, Sadık). AT appears to be only slightly skewed (Figure 7) and normality is thus assumed
for proportions. We want to test to see if this dangerous temperature occurs more than 1% of the
time for this particular combined cycle power plant. Greater than 1% of temperatures in danger
of overheating the gas turbine could represent long-term risk to the power plant (Carazas and
Martha de Souza). Examining the sample loads, we concluded that 14 of the 500 loads exceeded
the dangerous temperature level. Thus, we will test, p= proportion of loads where the AT
exceeded the dangerous temperature level: H 0 : p=0.01 versus H 1 : p> 0.01

H 0 : Dangerous temperatures occur 1% of the time in the combined cycle power plant (no long-
term risk) versus

H 1 : Dangerous temperatures occur more than 1% of the time in the combined cycle power plant
(possible long-term risk)

8
This will be tested at the significance level α =0.05 . The probability will be 0.01 and the
test statistic will follow a standard normal distribution. The p-value generated by this test is
−5
p=2.614 e . The p-value is less than the significance level of 0.05, so we reject the null
hypothesis with 95% confidence and accept the alternative hypothesis. We conclude that
dangerous temperatures do occur more than 1% of the time in the combined cycle power plant.
This could present a long-term risk for the power plant.

6. Two sample z-test proportion test

Finally, we will test the variables AP, Ambient Pressure, and RH, Relative Humidity.
These are environmental variables, and we would like to see how each differs from its typical
readings in Turkey. This would allow us to see which one is more greatly affected by the
operation of the power plant. The test could enable us to determine what kind of effects the
power plant can have on the local environment; higher AP or RH proportions than normal could
indicate potential negative environmental effects. We want to determine if these environmental
factors are unequally impacted by the power plant. AP typically has an average pressure of 30.02
inches Hg, which converts to 1016.5 millibars (“Weather Almanac for LTBA”). RH is an
average of 75.4 percent humidity in Turkey (“Relative Humidity in Istanbul, Turkey”). We will
determine what percent of the readings for each of these variables is above the norm. Thus, we
will test p AP= proportion of loads where AP is greater than 1016.5 and p RH= proportion of loads
where RH is greater than 75.4%. After examining the data, p AP=142/500 and p RH =227/500:

H 0 : p AP= p RH versus H 1 : p AP ≠ pRH

H 0 : The proportion of loads where AP is greater than 1016.5 is equal to the proportion of loads
where RH is greater than 75.4% versus

H 1 : The proportion of loads where AP is greater than 1016.5 is not equal to the proportion of
loads where RH is greater than 75.4%

This will be tested at the significance level α =0.05 . The test statistic will follow a standard
normal distribution. The p-value generated by this test is p=2.541e−08. The p-value is less than
the significance level of 0.05, so we reject the null hypothesis with 95% confidence and conclude

9
that the proportion of loads where AP is greater than 1016.5 is not equal to the proportion of
loads where RH is greater than 75.4%. Therefore, there is evidence that these environmental
conditions are not equally affected by the combined cycle power plant.

Modeling
Multiple Linear Regression with No Dummy Variables
We used multiple linear regression techniques in order to predict the Energy Output from
the other covariate information. First, we split our dataset which comprised of 500 observations
into a training dataset and a testing dataset. We used a fixed kernel and employed R to randomly
select 100 observations to place into the test dataset. Then we removed these selected
observations from the training dataset which left us with 400 observations on which to train our
model. This achieved an intersection between the training and testing data of null. We then used
three different explanatory variables in our population maximal model to predict the dependent
variable, Energy Output.

Firstly we used Average Temperature. We included Average Temperature in our model


because the power plant machinery has temperature sensitive properties which may run more
efficiently at certain temperatures. Thus, for some temperatures we would expect the plant to
run more efficiently and thus product a higher level of electricity output. Furthermore, the
scatterplot we produced between these two variables demonstrated a significant linear
relationship (Figure 6).

Secondly, we used Exhaust Vacuum. We included Exhaust Vacuum because this may be
a good indicator of the relative load the plant is running at. Higher loads are likely to produce
more steam and thus could lead to a higher exhaust vacuum measurement. This could therefore
be a good predictor for Energy Output. Furthermore, the scatterplot we produced between these
two variables demonstrated a significant linear relationship (Figure 6).

Finally, we used Ambient Pressure. We included Ambient Pressure because the Power
Plant machinery is designed to operate at an optimal level of internal pressure. The closer the
plant is to this level the more efficiently it will run. Thus, the measurement of internal pressure
is likely to be a predictor for the Energy Output of the Power Plant. Furthermore, the scatterplot

10
we produced between these two variables demonstrated a significant linear relationship (Figure
6).

We believe that Relative Humidity is not correlated to energy produced because the
power plant machinery is designed to operate in a large spectrum of humidity levels and thus
should not be affected by the level of humidity in the environment. Furthermore, the scatterplot
produced in our exploratory data analysis of Energy Output plotted against relative humidity
showed little correlation (Figure 6). Thus we will not include this variable in our maximal model.

In addition, the possible interactions between these variables will generate four interaction
variables which must be included in the maximal model:

o Average Temperature* Exhaust Vacuum


o Average Temperature*Ambient Pressure
o Exhaust Vacuum*Ambient Pressure
o Average Temperature* Exhaust Vacuum*Ambient Pressure

Before we set up our model, we check several assumptions in order to ensure that is
appropriate to employ a multiple linear regression model. Firstly, all our explanatory variables
and the response variable are quantitative and in this case continuous. Secondly, all variables
have nonzero variance- this is easily seen from the data. Thirdly, the explanatory variables do not
demonstrate very strong multicollinearity. In order to test this, we plotted combinations of these
variables in scatterplots in order to check for collinearity (Figure 9); the variables appear to have
some collinearity but not strong collinearity so we can proceed with the regression. Finally, the
variables are expected to produce a linear model. This is a reasonable assumption because the
Power Plant uses linear and not volumetric operation. If we had found that these assumptions are
not satisfied, we would have attempted nonlinear transformations including logarithms and
reciprocals. After we complete the model, we will further test certain assumptions on the
residuals to ensure our model correctly fits the data. Our initial maximal population maximal
model is therefore:

Eo=β 1 At + β2 Ev+ β3 Ap+ β 4 At∗Ev+ β 5 At∗Ap+ β 6 Ap∗Ev + β 7 At∗Ap∗Ev + β 0+ error

o Eo= Energy Output by Power Plant


o At= Average Temperature
o Ev= Exhaust Vacuum

11
o Ap= Ambient Pressure
o β i = Constant coefficients

We generated this model using the lm() command in R. We then proceeded to attempt
simplify this into a better model by removing terms which are deemed not significant- whose
slopes can be considered zero.

First we ran an ANOVA test to see if the explanatory variables collectively have a significant
effect on the dependent variable- Energy Output.

o H0 : β1 = · · · = β = 0 vs H1 : βj ≠ 0 |j = 1, . . . , 7

If we reject the null hypothesis the model is significant for at least one explanatory
variable. If we fail to reject the null, the model is not significant for at least one variable. The F
test statistic will tell us whether our model is, in fact, significant and will follow an F
distribution. After completing the test, the ANOVA gave us a very small P value of 2.2e-16 for the
collective significance of the model; therefore the model is significant for at least some
explanatory variables.

Because the model is indeed significant for some variables we proceeded to do a t test for
slope on each explanatory variable to see which ones. This will give us a significance level for
each variable and allow us to remove potentially nonsignificant variables. We used the
summary() command in R to automatically run these tests. The null hypothesis for the test of
significance is that the given explanatory variable does not affect the Energy output and thus
would have a Beta of zero. The alternative hypothesis is that the given explanatory variable does
in fact influence the Energy Output and thus has a nonzero β coefficient.

 H0 : βj = 0 vs H1 : βj ≠ 0 | j = 1, . . . , 7

We then examined the significance of each of the explanatory variables in the given
output. We used a significance level of .05. Any variables with P values greater than this level
were deemed insignificant. Our tests on each variable and their interaction terms all generated p
values less than our significance level. Therefore, for each term, we rejected the null hypothesis
and accept the alternative hypothesis, there is evidence that each term should have a nonzero
coefficient in our final model. Therefore, we do not need to simplify the model further; we

12
proceeded to check the assumptions on our residuals. Thus, there were no variables removed
from our maximal model to our final model. The output for these tests is in Figure 10 in the
appendix.

However, if at least some other variables had had p values above .05, we would have
continued to simplify the model via the following process. We would have first checked to
ensure that the nonsignificant term with the highest p value was not involved in a significant
interaction term. If this was the case, we would have removed this term from the model with the
update function. We would have then again ran the anova test with the above hypothesis. If the
new model was indeed collectively significant, then we would have again run the t tests on each
term. Thus we would have proceeded to remove nonsignificant terms not involved in significant
interaction variables with the largest p value as above until a final, fully significant model was
reached.

There are multiple assumptions on the residuals we needed to test to ensure that our
model is accurate. In order for our model to be considered good, it must have the following
properties Homoscedasticity, Independent errors, Normally Distributed Errors. To test for this,
we executed the following:

First we tested for constant variance by plotting the predicted values vs residuals and
residuals versus each explanatory variable. This can be seen in Figure 13. Because there is no
discernible “cone” shape in the data any of the plots, it appears that there is evidence of constant
variance. Thus, this condition appears to be satisfied.

Then we tested for normality of residuals by generating a QQ plot of the residuals of the
model (Figure 12). Because the residuals relatively closely follow the line of perfect normality, it
appears that they are therefore normal; thus this condition is satisfied. We then calculated the
mean of the residuals with R and arrived at 1.51e-16 which is approximately zero- thus the mean
of the residuals is approximately zero and this condition is satisfied.

Because our model satisfies these requirements, it appears to be a good model and thus
we do not have to explore other, perhaps nonlinear, alternatives.

Thus we have arrived at our final model for predicting Energy Output:

13
Eo=−85.31 At −50.23 Ev−2.11 Ap+ .081 At∗Ap+1.763 At∗Ev +.04861 Ap∗Ev−.0001710 At∗Ap∗Ev+ 267.4+

This equation gives us a mathematical model for predicting our dependent variable,
Energy Output, from our three other variables. The β i will give us the individual coefficients for
each variable; they tell us how an incremental change in these variables will affect the Energy
Output. The β 0 term gives the y intercept of the model- what Energy Output level would be
produced if all other variables were zero. The error term accounts for the error in the predictions
versus the actual data (residuals).

Thus we see that all three initial variables were helpful in predicting Energy Output along
with all of their interaction terms. They are all linear predictors and they have both negative and
positive coefficient; this implies that a marginal increase in different variables could lead to
either a marginal decrease or increase in Energy Output depending on the variable.

After creating the model, we then ran our model on the “Test” data in order to see how
well our model could predict Energy Output. The “Test” data made up 20 percent of the original
dataset randomly selected with a set kernel in order to allow for replication. Because the majority
of the values used in the test dataset lie within the range of those used in the training dataset, this
testing consists of interpolation. We generated a prediction interval with confidence of 95%.
After running the prediction based upon the explanatory variables and calculating the number of
actual values which fit in the model, we arrived at the number 91. Therefore, 91/100 predicted
values fitted in the given prediction interval. This is very close to our confidence level of 95%,
this implies that we have a good model which can accurately predict the data.

During our analysis of the model, we also generated other statistics in order to gain
greater insight into the data. We calculated the Multiple R squared to be 0.9435. This gives the
strength of the relationship between the predictor variables and the dependent variable, Energy
Output. We also calculated the Adjusted R squared to be equal to 0.9425. This is also gives the
strength of the relationship between the predictor variables and the dependent variable but
adjusts for the number of predictors. Finally, we calculated ANOVA statistics for our model
using the anova function in R. This gave us a sum of square regression of 6513 with 392 degrees
of freedom and a residual standard error of 4.076 on 392 degrees of freedom.

Time Series Analysis

14
As described in the overview, our dataset comprises of hourly data from a Combined
Cycle Power Plant working at full load in Turkey. The dataset available for download contained
the first 500 values from the overall dataset. Since our data occurs over a period of time, we can
run a Time Series analysis on the data.

The original purpose of the data collection was to record hourly averages of Temperature,
Ambient Pressure, Relative Humidity, and Exhaust Vacuum along with the Energy Output for
the given power plant. Energy Output is ultimately the most important aspect of the given power
plant and is the variable which the other variables are used to predict. Thus, the specific variable
we are choosing to analyze via Time Series is Energy Output, EO. The other variables are factors
that can affect the power plant’s energy output, (except relative humidity). Therefore, the EO is
akin to the dependent variable. Furthermore, the EO describes how efficiently the power plant is
working overall. We want to analyze any possible trends and seasonality over time that may
describe the power plant’s productivity. A time series plot will be helpful to analyze any possible
trends or periodic fluctuations within the data. Furthermore, we can glean information about the
mean, variance, any abrupt changes, and potential outliers from this plot.

To begin the time series analysis, we first created a time series plot that can be seen in
Figure 14 in the appendix. We used the first 450 values and saved the final 50 for possible
modeling predictions. There does not appear to be any obvious periodic fluctuations throughout
the data and there does not appear to be any visible trends in the data except a possible a slight
depression in the first third of the time series. As such, our data is most likely not seasonal
because there is no relevant periodicity in the values of the time series. Furthermore, it does not
appear that there are any strong outliers in the data. This comports with our belief that if the
Combined Cycle Power Plant is operating efficiently and correctly, there is expected to be very
little identifiable seasonality. Drastic differences in energy output should not be present if
external factors are not affecting the power plant.

In order to confirm our initial conclusions about periodicity and trend, we plotted the
Autocorrelation Function (ACF) plot and the Partial Autocorrelation Function (PACF) plot. Both
of these can be seen in Figure 15 in the appendix. The ACF represents the similarity between
observations as a function of the time lag between them. It describes how a population at one
time point (ti) is related to a previous time point (ti−h) h points apart. It begins at a lag of zero.

15
The PACF represents a similar correlation, but it also controls for the values of the time series at
all shorter lags, removing any linear dependence across the time series. It describes the
relationship between a time point (tt) and the population at lag h (tt−h) once we have controlled
for the correlations between all of the successive time points between the two points(Source:
Lecture). It begins at a lag of one.

According to the ACF plot, the correlation’s significance drops off after time lag 0.
Furthermore, the ACF values greater than zero hover around 0 with little variability. Thus, there
are no significant lags and values are thus not related to previous values at any given lags. We
suspect that there is no obvious periodicity or seasonality throughout our time series. Since there
is a pattern of 0 in the ACF, the data could be represented by a white noise model and there are
no significant MA terms. Next, we examined the PACF model to determine if it supports our
findings in the ACF plot. None of the PACF values appear to be significant. In addition, the
PACF does not show a consistent decay or display any significant periodic or seasonal trends.
The PACF values simply hover around the value of 0 throughout the entire PACF plot.
Therefore, it appears that there are no significant lags in the data; levels of Energy Output are not
related to previous values even when controlling for lags between points; there appear to be no
significant AR terms.

After examining the ACF and PACF plots, we will consider stationarity. A stationary
time series has statistical properties such as mean, variance, and autocorrelation that are all
constant over time. These mathematical properties are important because many other Time
Series models and techniques assume stationarity. This includes ARMA, ARIMA, and SARIMA
models which require stationarity in order to create the model. Further, you need stationarity in
order to better interpret an ACF and PACF plot. If stationarity is confirmed, we would be able to
better estimate other mathematical properties and this would further support our supposition that
the Power Plant is not undergoing and significant changes in efficiency over time. Referencing
our findings for the ACF plot, the ACF values are all incredibly small and close to 0. There is
very little variability, so we conclude that the autocorrelation is constant over time. Furthermore,
the PACF values show similar values hovering around 0 which are far from being significant.
Thus, there are no significant time lags. We have constant autocorrelation. Looking at the
original time series plot also reveals that the mean does not seem to shift (there is no

16
increasing/decreasing trend) and the variance seems to be relatively consistent throughout the
entire plot. Thus, we conclude that our data is stationary. If we did not have stationarity, there are
a number of methods we could have used in order to achieve it. We could have tried to
difference the series to remove a trend or we could have doubly differenced the data in order to
remove both a possible trend and seasonality. We could have also used transformations such as
the log transformation in order to achieve constant variance. Further, we could have performed
spectral analysis in order to remove other periodic variations by utilizing Sine and Cosine
functions. We could also have attempted to detrend via linear regression by using covariate data
in other variables to remove the trend in our primary variable. However, because our data was
stationary, we did not need to employ any of the above methods.

Our data covers a span of 500 hours or just over 20 days. Although we believe that the
data is stationary, we still want to examine possible long term trends across this time period, so
we will smooth the data. Smoothing out the short-term random fluctuations creates a clearer
representation of specific trends in the data. The two-sided moving average model is most
appropriate here because we expect the Combined Cycle Power Plant’s Energy Output to be
consistent regardless of the time. The amount of time between two data points is less relevant, so
neither Kernel smoothing nor Nearest-neighbors regression would be as apt as a two-sided
moving average. Furthermore, two-sided moving averages are adaptable to slow changes across
time, so we can analyze our dataset based on different average time periods.

Two sided moving average smoothing works by taking a shifting window of a set width
on both sides of a given point and averaging all values within the given window. This value is
then plotted and the window is shifted by the given number of units and the process is repeated.
This works to smooth out short term fluctuations in the data.

After some experimentation with window size based on grouping the data into time
periods, we decided on the following window sizes:

o k = 6 for quarter-day trends


o k = 12 for half-day trends (potential AM vs PM implications)
o k = 24 for daily trends

17
Originally, we were considering analyzing a weekly trend with k = 168, but the data was
vastly over-smoothed due to the larger time period. Thus, we smoothed the data based on the
above three window sizes. The resulting graph is shown in Figure 16 in the appendix. The red
line represents k=6, the green line represents k=12, and the blue line represents k=24. It appears
as if the green line with k=12 is the correct balance between under-smoothing (not enough
difference to show long-term trends) and over-smoothing (eliminates possibly interesting trends).
Based on the smoothed data, we can see that the mean Energy Output is quite constant. Plus,
there does not appear to be much variance throughout the data. The only possible trend of note is
the slight decreasing and then increasing trend throughout the first 100 values of the smoothed
data, however, this is very slight. We further plotted the residuals of the moving average
smoothing in Figure 20; it appears that the residuals are stationary for all three levels of
smoothing with a relatively constant variance. Overall though, the smoothed data supports the
conclusion of stationarity that we made based on the ACF and PACF plots and the original
analysis of the time series plot.

Note, although we do not have seasonal data, we still attempted spectral analysis in the
interest of completeness. However, it was very unsuccessful in explaining the nonexistent
seasonal variation and was therefore ultimately not included in the report.

Finally, we will attempt to build a model that better describes our data. As concluded
above from the ACF and PACF plot, the data is stationary and not seasonal. Therefore, we will
build an ARIMA model rather than a SARIMA model.

First we did exploratory data analysis by examining the PACF, ACF, and time series
plots and looking for significant lags and cycles. These plots were described above. From the
time series plot, it appears that the data is stationary with a constant variance. Thus we do not
need to stabilize the variance with a log transformation or difference the data.

An ARIMA model consists of a dataset described by ARIMA ( p , d , q ) x ( P , D , Q ) s. The P,


D, and Q values are 0 for an ARIMA model and s=−1 because there is no seasonal trend.
Therefore, the only values left to determine are p, d, and q. Since the data was already
sufficiently stationary, we did not need to difference the data and d=0. Examining the ACF plot
will indicate the q value and the PACF plot will indicate the p value. According to the ACF plot,

18
the ACF drops off after time lag 0, therefore we take q=0 . For the PACF plot, none of the values
are significant and the values are not decaying to 0 after a fixed lag, so we also take the value of
0 for p, p=0. Therefore, we will build the model 1 ARIMA ( 0,0,0 ) x ( 0,0,0 )-1. However, in the
interest of being exhaustive, we will check a number of other models despite there being little
evidence that they would be appropriate. We are primarily doing this to confirm our expectation
that a white noise model most accurately fits our data. We will also test model 2
ARIMA ( 1,0,0 ) x ( 0,0,0 )-1, model 3 ARIMA ( 0,0,1 ) x ( 0,0,0 ) -1, model 4 ARIMA ( 1,0,1 ) x ( 0,0,0 ) -1,
and model 5 ARIMA ( 1,0,2 ) x ( 0,0,0 ) -1 for comparison of different close p and q values. We
created model 6 ARIMA ( 2,0,0 ) x ( 0,0,0 )-1, model 7 ARIMA ( 0,0,2 ) x ( 0,0,0 ) -1, model 8
ARIMA ( 2,0,1 ) x ( 0,0,0 ) -1, and model 8 ARIMA ( 2,0,2 ) x ( 0,0,0 ) -1 for further comparison of p and q
values. Finally, we tested model 10 ARIMA ( 6,0,0 ) x ( 0,0,0 )-1, model 11 ARIMA ( 0,0,6 ) x ( 0,0,0 )-1,
and model 12 ARIMA ( 6,0,6 ) x ( 0,0,0 )-1 because although none of the lags in the ACF and PACF
plot are significant, it appears as if 6 is the lag closest to significance for both the ACF and the
PACF plots(although still far away).

After creating these models, we examined the coefficient and standard error for each part
of the diagnostic models. Of the given models, many of the AR and MA terms are nonsignificant
so we can immediately throw out those models. We also examined the AIC and the BIC for each
of the twelve models. See the table in Figure 17 in the appendix for the AIC and the BIC for the
models. The AIC and BIC are methods of assessing model fit, adjusted for the number of
estimated parameters. The AIC assumes that the true model requires an infinite number of
parameters, while the BIC assumes that the true model has a fixed set of parameters. A lower
AIC or BIC value indicates a better model. As shown, model 1 clearly has the lowest BIC value
(0.2 below the next lowest) and the second lowest AIC value although it is very close to model
12 which has the lowest overall AIC. Based on these values, we will only further consider model
1 and model 12; none of the other models seem to be the optimal model. model 12 is the most
complicated model with twelve different variables. However, model 1 is the simplest model by
far, which is an important consideration in determining the best model, yet its AIC value is still
very close (0.12) to model 12’s AIC. In addition, model 1’s Ljung-Box statistics are all greater
than the significance level. The time series plot of the residuals for this model appears to be
stationary with no visible trend, seasonality, or instability in variance. Furthermore, the ACF of

19
residuals shows that there are no significant lags and the QQ plot of residuals is relatively
normal. Thus, we can conclude that model 1 is the optimal model. An ARIMA ( 0,0,0 ) x ( 0,0,0 )-1
model represents white noise time series. This makes sense based on the nature of our data. A
power plant’s energy output should not have a periodic or seasonal trend to it. Furthermore, if it
is working correctly the respective energy outputs for every load should be independent of the
previous energy outputs. As a result, the Energy Output level should be random throughout time
and therefore follow a white noise model.

Despite the white noise model’s random nature, we decided to create a prediction model
to predict the final 50 values in our dataset and further confirm our choice of a white noise model
that represents our data. We created an upper and lower prediction interval based on the standard
error for each of the time index values. Then we plotted our prediction values. This graph can be
seen in Figure 19 in the appendix. As shown, the prediction values are the same for all of the
values. This is because we are using a white noise model in order to model the data; we are
attempting to “predict” randomness. Since there is no proper way to create predictions for a
random white noise model, the prediction values are all the same. All 50 values from the testing
set were within the prediction interval. Thus, we further confirm our conclusion that our data is
best represented by an unpredictable white noise model.

Conclusion

By studying and applying a number of data analysis techniques to our data set, we have
discovered a number of statistical properties which could lead to relevant information. While it is
unlikely that this study uncovered all statistical properties of the given dataset, by employing a
number of hypothesis tests, multiple linear regression modeling, and time series analysis it made
a very good start.

20
Works Cited

Carazas, Fernando J., & Martha de Souza, Gilberto F. (2009). Availability Analysis of Gas

Turbines Used in Power Plants. International Journal of Thermodynamics, 28-37.

Retrieved February 18, 2016.

Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen. Local and Global Learning Methods for

Predicting Power of a Combined Gas & Steam Turbine. Dubai. March 2012. Proceedings

of the International Conference on Emerging Trends in Computer and Electronics

Engineering ICETCEE. 14 February 2016.

Krause, S. R., Merrion, D.F., and Green, G.L. "Effect of Inlet Air Humidity and Temperature on

Diesel Exhaust Emissions." SAE Technical Paper Series (1973). Web. 19 Feb. 2016.

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine,

CA: University of California, School of Information and Computer Science. 14 February

2016.

Hermanns, Helmutt. "Siemens Power Generation." Power-Technology. Web. 18 Apr. 2016.

"Relative Humidity in Istanbul, Turkey." Relative Humidity in Istanbul, Turkey. Web. 30 Apr.
2016.

"Weather Almanac for LTBA - January, 2006." Weather History for Istanbul, Turkey. Web. 30
Apr. 2016.

21
Appendix

Figure 1: Diagram showing split gas and steam turbine of combined cycle power
plant(Hermanns).

22
Average Exhaust Ambient Relative Energy
Temperature Vacuum Pressure Humidity Output
1 8.34 40.77 1010.84 90.01 480.48
2 23.64 58.49 1011.40 74.20 445.75
3 29.74 56.90 1007.15 41.91 438.76
4 19.07 49.69 1007.22 76.79 453.09
5 11.80 40.66 1017.13 97.20 464.43
6 13.97 39.16 1016.05 64.60 470.96
7 22.10 71.29 1008.20 75.38 442.35
8 14.47 41.76 1021.98 78.41 464.00
9 31.25 69.51 1010.25 36.83 428.77
10 6.77 38.18 1017.80 81.13 484.31

Figure 2: Table of first 10 observations of the data

Minimum 1st Quartile Median Mean 3rd Max


Quartile
Average 4.11 13.76 20.57 19.74 25.92 34.33
Temperature
Exhaust 35.57 41.46 50.83 54.09 66.54 79.74
Vacuum
Ambient 994.2 1008.0 1012.0 1013.0 1018.0 1033.0
Pressure
Relative 32.97 62.68 74.00 72.64 84.55 100.20
Humidity
Energy 425.2 438.9 451.3 454.1 468.5 495.2
Output
Figure 3: Table of summary statistics for each variable

23
Figure 4: Boxplots of variables to display properties

Figure 5: Violin Plots of Continuous Variables

24
y = -2.171x y = -1.168x
+ 497.035 + 517.805

y = 0.455x +
420.965

y = 1.435x –
998.785

Figure 6: Correlation scatterplots of Energy Output versus each of the other four variables during
a load, shown with a 99% confidence interval

25
Figure 7: Histograms showing distribution for each of the five variables

Figure 8: Normal Q-Q plots depicting normality (or lack of) for each of the five variables

26
Figure 9: Two Way Scatterplots to Check for Collinearity

Estimate Standard Error t-value Pr(>|t|)


(intercept) 2.674e+03 5.498e+02 4.863 1.68e-06
AP -2.110e+00 5.427e-01 -3.889 0.000118
AT -8.531e+01 2.468e+01 -3.456 0.000608
EV -5.023e+01 1.267e+01 -3.966 8.751e-05
AP*AT 8.108e-02 2.439e-02 3.324 0.00972
AT*EV 1.763e+00 4.913e-01 3.588 0.000375
AP*EV 4.861e-02 1.251e-02 3.885 0.000120
AP*AT*EV -1.170e-03 4.857e-04 -3.521 0.000480
Figure 10: Table of Multiple Linear Regression Output for T tests.

Figure 11: Plot of Regression Model Output and Residual QQ plot

27
Figure 12: Q-Q plot of residuals

Figure 13: Predicted vs Residuals and Residuals vs Explanatory Variables.

28
Figure 14: Time Series plot of Energy Output

Figure 15: ACF and PACF plots of the Energy Output

29
Figure 16: Time series plot with smoothing visible.
AIC BIC
ARIMA ( 0,0,0 ) x ( 0,0,0 )−1 6.690818 5.699950
ARIMA ( 1,0,0 ) x ( 0,0,0 )−1 6.694230 5.712493
ARIMA ( 0,0,1 ) x ( 0,0,0 ) −1 6.694256 5.712520
ARIMA ( 1,0,1 ) x ( 0,0,0 )−1 6.698611 5.726006
ARIMA ( 1,0,2 ) x ( 0,0,0 )−1 6.702982 5.739509
ARIMA ( 2,0,0 ) x ( 0,0,0 )−1 6.698526 5.725921
ARIMA ( 0,0,2 ) x ( 0,0,0 ) −1 6.698491 5.725886
ARIMA ( 2,0,1 ) x ( 0,0,0 ) −1 6.702105 5.738632
ARIMA ( 2,0,2 ) x ( 0,0,0 ) −1 6.691215 5.736874
ARIMA ( 6,0,0 ) x ( 0,0,0 )−1 6.709989 5.773911
ARIMA ( 0,0,6 ) x ( 0,0,0 )−1 6.708444 5.772365
ARIMA ( 6,0,6 ) x ( 0,0,0 )−1 6.678641 5.797353

Figure 17: AIC and BIC of potential models

30
Figure 18: Output for ARIMA(0,0,0)x(0,0,0)-1

31
Figure 19: Prediction interval of with ARIMA(0,0,0)x(0,0,0)-1 model.

Figure 20: Residuals of Moving Average Smoothing

32

You might also like