Journal of Building Engineering: Chao Liu, Zhi-Gang Su, Xinyi Zhang

Journal of Building Engineering 80 (2023) 107956
Contents lists available at ScienceDirect
Journal of Building Engineering

journal homepage: www.elsevier.com/locate/jobe
A data-driven evidential regression model for building hourly

energy consumption prediction with feature selection and
parameters learning
Chao Liu a , Zhi-gang Su a ,∗, Xinyi Zhang b
a
School of Energy and Environment, Southeast University, Nanjing, Jiangsu 210096, China
b
Nanjing Foreign Language School, Nanjing, Jiangsu 210008, China
ARTICLE INFO ABSTRACT
Keywords: Building energy consumption prediction is critical for building energy management and energy
Building energy consumption prediction policy formulation, and its inherent uncertainty can significantly affect the utilization of
Data-driven model current energy market benefits for market participants. To capture the uncertainty of energy
Feature selection
consumption and enhance the predictive capability of the model, in this study, a data-driven
Dempster–Shafer theory
evidential regression (EVREG) model with integrated feature selection function is proposed
Interval prediction
based on Dempster–Shafer theory and mutual information, which can perform point prediction
and interval prediction for building hourly energy consumption to describe its fluctuation and
uncertainty. Different from the traditional EVREG model, this method enables simultaneous
feature selection and model parameters learning instead of treating feature selection as a
separate data pre-processing step. Specifically, an evaluation function is defined to describe the
significance of a candidate feature, taking into account the predictive power of regression model
and the redundancy between the candidate feature and already selected features. According to a
search strategy, features with high significance are selected to minimize the objective function.
A real dataset from a commercial building is used to evaluate the performance of the proposed
method. The results demonstrate that the proposed method can select fewer features while
achieving better prediction performance compared to traditional feature selection methods used
as data preprocessing. The proposed method also achieves better or comparable performance
compared to commonly applied point prediction and interval prediction methods.
1. Introduction
In the context of carbon neutrality, the high energy consumption of buildings has been a growing concern. In 2019, the energy
consumption of the construction industry accounted for 45.8% of total energy consumption in China, of which the building operation
phase accounted for 21.2% [1]. With the acceleration of urbanization, energy consumption in the global construction industry will
continue to grow. Building energy consumption prediction model is essential for formulating building energy management strategies,
such as energy saving control strategies [2], energy distribution planning [3] and power system management [4]. However, building
energy consumption is affected by various uncertainties such as weather conditions, occupants’ usage habits, and time-varying
building operational [5,6], making it difficult to achieve accurate predictions of building energy consumption.
Building energy prediction has attracted a lot of research attention in recent years, and current prediction models can be divided
into three types: physics-based white-box models [7], data-driven black-box models [8], and gray-box models [9] that combine the
∗ Corresponding author.
E-mail address: zhigangsu@seu.edu.cn (Z.-g. Su).
https://doi.org/10.1016/j.jobe.2023.107956
Received 29 April 2023; Received in revised form 10 September 2023; Accepted 15 October 2023
Available online 17 October 2023
2352-7102/© 2023 Elsevier Ltd. All rights reserved.
C. Liu et al. Journal of Building Engineering 80 (2023) 107956
Nomenclature
EVREG Evidential regression

MLR Multiple linear regression
SVM Support vector machine
ANN Artificial neural network
DT Decision tree
LSTM Long short-term memory
QR Quantile regression
QRNN Quantile regression neural network
QD Quantile determination
BBA Basic belief assignment
MI Mutual information
LOO Leave-one-out
PDF Probability density function
NI Normalized mutual information
PINC Prediction interval nominal confidence
SFS Sequentially forward selection
SBS Sequentially backward selection
B&B Branch and bound
GA Genetic algorithm
MAE Mean absolute error
MAPE Mean absolute percentage error
RMSE Root mean square error
PICP Prediction interval coverage probability
PINAW Prediction interval normalized average width
PIARW Prediction interval average relative width
CWC Coverage width criterion
CVRMSE Coefficient of variation of RMSE
NMBE Normalized mean bias error
PCC Pearson correlation coefficient
SVR Support vector regression
BPNN Back propagation neural network
ELM Extreme learning machine
KDE Kernel density estimation
two. Benefiting from the rapid development of artificial intelligence and the legal disclosure of building energy consumption data,
data-driven models are more widely used for their simplicity in the modeling process and their powerful predictive capabilities.
Commonly used data-driven methods for building energy consumption prediction include multiple linear regression (MLR), support
vector machine (SVM), artificial neural network (ANN), decision tree (DT), and long short-term memory (LSTM). The above methods
have been successfully applied in different building types and different time scales for energy consumption prediction, however, all of
these methods only focus on point-value type prediction of energy consumption. As mentioned earlier, there are many uncertainties
in building energy consumption prediction, while it is difficult for the point-value prediction model to capture the uncertainties in
energy consumption prediction brought by these factors. And obviously, the interval-type prediction model is more reasonable for
describing these uncertainties. In addition, describing the uncertainties in building energy consumption forecasting in the form of
intervals can contribute greatly to building energy management and efficient operation of power systems. Developing power supply
schemes based on interval prediction results can ensure reliable supply of power systems under extreme weather conditions and
also avoid energy wastage due to excess power supply [10].
A number of studies have been conducted on energy interval prediction. Walter et al. [11] assumed that the training error of the
model has the same statistical distribution as its prediction error on the data to be predicted, and then used the error distribution on
training data to estimate the uncertainty of prediction result. The validity of this method was verified on datasets of 17 commercial
buildings. However, this assumption is difficult to satisfy when the training sample differs significantly from the sample to be
predicted. Taieb et al. [12] used a boosting procedure to construct an additive quantile regression (QR) model for estimating the
probability distribution of future energy consumption, and compared it with the prediction model based on normal distribution.
Similarly, He et al. [13] proposed quantile regression neural network (QRNN) model to predict annual electricity consumption
and introduced kernel density estimation to estimate the probability density of forecasting results, which improved the prediction
2
performance compared with quantile regression. In [14], a parallel and improved electric load quantile forecasting method was
proposed and the reliability of the forecasting model was improved by an alternative quantile determination (QD) method. In the
above methods, since each quantile is predicted independently, quantile crossover may occur [15], which violates monotonicity. In
essence, current methods for predicting building energy intervals model the uncertainty based on the prediction bias distribution
over the training sample, without characterizing the data distribution in the sample space. When there is a lack of data or incomplete
information, the reliability of prediction intervals will be reduced. Evidence theory proposed by Dempster and Shafer, also known
as Dempster–Shafer theory, provides a unified framework for characterizing uncertainty information and enables reasoning in the
absence of prior probabilities. It extends point-value functional form to interval functional form, and plays an important role in
various fields such as fault diagnosis [16,17], risk analysis [18,19], image processing [20], and multicriteria decision making [21,22].
The nonparametric evidential regression (EVREG) [23] method proposed by Petit-Renaud and Denoeux, based on Dempster–Shafer
theory, neither relies on a specific probability distribution nor requires prior specification of functional forms. It can achieve both
point-value prediction and interval prediction, and the interval prediction of which can reflect the sample space distribution of
training data. For this reason, in this paper, EVREG will be applied to building energy consumption prediction.
For building energy consumption prediction, there are a large number of candidate input features (variables), such as weather-
related features and time-related features. Feature selection is an essential step for any data-driven model, which removes ineffective
candidates from the original feature set and reduces the size of the input feature subset [24,25]. Selecting a suitable and effective
subset of features for prediction not only reduces the complexity of computation and training time, but also improves the predictive
power. Therefore, how to select features is of great importance in the field of building energy consumption prediction. Commonly
used feature selection methods can be divided into two main categories: filter methods and wrapper methods. The former can be
regarded as a data pre-processing process performed independently of model training. The importance of each feature to the target
variable is evaluated based on the natural properties of the data or statistical criteria, and the most relevant features are selected
by a pre-set threshold. While in the wrapper methods, feature selection and model training are performed simultaneously, using the
performance metric of a given model as the criterion for selecting feature subsets. For the current EVREG prediction models, filter
methods are adopted for feature selection. For example, in [26], for PV power forecasting, the Pearson correlation coefficients
between candidate input variables and PV power were first calculated and variables with strong correlation coefficients were
selected as input variables. In [27], features for equipment remaining useful life prediction were first extracted by statistical metrics,
frequency and time domain signal transformations, and empirical modal decomposition (EMD). Then, the extracted features were
selected based on monotonicity and trend indices. In [28], frequency ratio was used to quantify the contribution of each candidate
feature to the prediction and thus select features with high contribution. All the aforementioned feature selection methods are
performed independently of the training process of the EVREG model and fail to consider the impact of the selected features on the
prediction effect, which may result in the selected features contributing little to the improvement of the predictive ability. In fact, the
feature subset that can lead to promising results may always remove many redundant but relevant features from the candidates [29].
The selection of a suitable subset of features can improve the predictive capability of the model. Therefore, the objective of this
study is to propose an embedded feature selection method for EVREG model to simultaneously implement feature selection and
learning of the model parameters, and apply it to point prediction and interval prediction of building energy consumption.
To complement the research gap mentioned above, this study makes the following main contributions:
(1) A data-driven evidential regression model is proposed to predict building hourly energy consumption, allowing both point
prediction and interval prediction, which is crucial for effective building energy management.
(2) Feature evaluation takes into account the impact of features on the predictive capability of the model and describes the
redundancy among features using mutual information, ensuring that relevant features are selected while avoiding redundancy.
(3) Feature selection is performed simultaneously with parameters learning of the model, rather than as a separate data
pre-processing as in traditional EVREG, improving the performance of the model.
The rest of this paper is organized as follows. Section 2 introduces the principle and evaluation metrics of the method proposed
in this study. Section 4 conducts a case study and analyzes the results. Section 5 discusses the applications and limitations of this
method. Finally, Section 6 summarizes the conclusions of this paper.
2. Preliminaries
2.1. Dempster–Shafer theory
Dempster–Shafer theory, as an extension of probability theory, is capable of better handling imprecise and uncertain information.
In this section, the basic concepts of Dempster–Shafer theory are introduced, and more information about related materials can be
found in [30].
Let 𝛺 be a finite set called the frame of discernment, which contains all possible answers to a given question of interest Q. And
let 2𝛺 be the set of all subsets of 𝛺. Then, the mass function which is the basic concept representing uncertainty about y, also called
basic belief assignment (BBA), can be defined as a mapping function from 2𝛺 to [0,1] verifying:
∑
𝑚𝑦 (𝐴) = 1 . (1)
𝐴⊆𝛺
Each number 𝑚𝑦 (𝐴) represents the belief assigned to the hypothesis that ‘‘𝑦 ∈ 𝐴’’, and that cannot be assigned to any more
restrictive hypothesis, given the available knowledge. A mass function used to describe a certain problem is called a piece of
3
evidence. Any subset of 𝛺 such as 𝑚(𝐴) > 0 is called a focal element of 𝑚. In particular, in case 𝑚(𝛺) = 1, the mass function
is vacuous, which represents complete ignorance about the value of 𝑦.
Let 𝑚1 and 𝑚2 be two mass functions induced by two distinct sources separately. The combination of mass functions plays
an important role in Dempster–Shafer theory. The conjunctive combination of 𝑚1 and 𝑚2 , denoted by ⃝∩,
∩ yields the following
unnormalized mass function:
∑
∩ (𝐴) =
𝑚1⃝2 𝑚1 (𝐵)𝑚2 (𝐶), ∀𝐴 ⊆ 𝛺 . (2)
𝐵∩𝐶=𝐴
If necessary, the normality condition 𝑚(𝜙) = 0 can be recovered by dividing each mass 𝑚1⃝2
∩ (𝐴)∩ by 1−𝑚1⃝2
∩ (𝜙)∩. This operation
is called Dempster’s rule of combination, noted by ⃝:+
𝑚1⃝∩ 2 (𝐴)
+ 2 (𝐴) =
𝑚1⃝ ,𝜙 ≠ 𝐴 ⊆ 𝛺. (3)
1 − 𝑚1⃝2
∩ (𝜙)
Both operations are associative, commutative and admit the vacuous mass function as the only neutral element.
It may occur that we have some doubt about the reliability of the source of information inducing 𝑚. To solve this problem, the
discount of mass function is proposed [31]:
{
(1 − 𝛼)𝑚(𝐴), ∀𝐴 ⊂ 𝛺 ,
𝑚𝛼 (𝐴) = (4)
𝛼 + (1 − 𝛼)𝑚(𝐴), 𝐴 = 𝛺.
The discount rate 𝛼 ∈ [0, 1] characterizes the degree of reliability of information provided by the source. If 𝛼 = 0, it means that
one is sure the information is absolutely reliable. On the contrary, the information is known to be absolutely unreliable, and the
resulting mass function is then vacuous.
Mass function describes a belief state, but when making decisions, the ‘‘credal level’’ is different from the ‘‘decision level’’, and
there are rational arguments strongly supporting the use of probabilities in a decision context [32]. For decision making from mass
function, the pignistic transformation is defined:
∑ 𝑚(𝐴)
𝐵𝑒𝑡𝑃 (𝜔) = , (5)
{𝐴⊆𝛺,𝜔∈𝐴}
|𝐴|
where | ⋅ | denotes the cardinality of a focal element.
2.2. EVREG: Evidential regression
Based on the concept and operations of ‘‘evidence’’ in Dempster–Shafer theory introduced in Section 2.1, the evidential regression
model is proposed [23]. For evidential regression problem, the frame of discernment 𝛺 denotes all possible values of the output
variable 𝑦, which is usually taken as the uniform distribution of the output in training set 𝑇 , denoted as 𝑈(𝑦min ,𝑦max ) . 𝑈 represents the
uniform distribution, and 𝑦min , 𝑦max represent the minimum and maximum values of the output in 𝑇 . Let 𝒙 be an arbitrary vector
to be predicted, 𝑦 be the corresponding unknown output and 𝒩𝐾 (𝒙) be the set of 𝐾 nearest neighbors of 𝒙 in 𝑇 . The information
on 𝑦 can be deduced from the neighbors. Each neighbor 𝒙𝑖 with output 𝑦𝑖 in 𝒩𝐾 (𝒙) can provide a piece of evidence concerning the
possible value of 𝑦, which can also be represented by the following mass function:
{
𝜙(𝑑𝑖 ), 𝐴 = 𝑦𝑖 ,
𝑚𝑖 (𝑦 = 𝐴 ∣ 𝒙𝑖 ) = (6)
1 − 𝜙(𝑑𝑖 ), 𝐴 = 𝛺,
where 𝜙 is a decreasing function representing the discount of the information provided by the neighbors according to the distance
metric criterion 𝑑𝑖 . 1 − 𝜙(𝑑𝑖 ) is the discount rate 𝛼𝑖 , which determines the influence of 𝒙𝑖 on 𝒙. When 𝒙𝑖 is very far from 𝒙, 𝛼𝑖 tends
to 1, and 𝒙𝑖 leaves us in a position of almost complete ignorance of the value of 𝑦; otherwise, the values of 𝑦 and 𝑦𝑖 are quite likely
to be similar if 𝒙 is ‘‘close’’ to 𝒙𝑖 . The distance we use here is the Euclidean distance: 𝑑𝑖 = ‖𝒙 − 𝒙𝑖 ‖1∕2 . So a natural choice for 𝜙 is:
𝜙(𝑑) = 𝜃exp(−𝛾𝑑 2 ) , (7)
where 𝜃 ∈ (0, 1) is a constant parameter, in this paper, we set 𝜃 to 0.95. And 𝛾 > 0 is an important structure parameter for EVREG that
controls the decay gradient of the distance function. With Dempster’s rule, all the 𝐾 pieces of evidence in 𝒩𝐾 (𝒙) can be combined
to compute the final BBA is:
+𝐾
𝑚=⃝ 𝑚 (𝑦 = 𝐴 ∣ 𝒙𝑖 ) .
𝑖=1 𝑖
(8)
After the final mass function is calculated, various forms of output can be obtained. As mentioned earlier, EVREG can implement
both point and interval forecasts, with the expected output of the point prediction 𝑦̂ and the upper and lower bounds of the interval
prediction[𝑦∗ , 𝑦∗ ], respectively:
∑
𝐾
𝑦min + 𝑦max
𝑦̂ = 𝑚(𝑦 = 𝑦𝑖 ) ⋅ 𝑦𝑖 + 𝑚(𝑦 = 𝛺) ⋅ , (9)
𝑖=1
2
∑𝐾
𝑦̂∗ = 𝑚(𝑦 = 𝑦𝑖 ) ⋅ 𝑦𝑖 + 𝑚(𝑦 = 𝑦𝑚𝑖𝑛 ) ⋅ 𝑦min , (10)
𝑖=1
4
∑
𝐾
𝑦̂∗ = 𝑚(𝑦 = 𝑦𝑖 ) ⋅ 𝑦𝑖 + 𝑚(𝑦 = 𝑦𝑚𝑎𝑥 ) ⋅ 𝑦max . (11)
𝑖=1
𝑦∗ , 𝑦̂∗ ] contains 𝑦̂ and the interval length can be interpreted as reflecting the uncertainty of
It can be noticed that the interval [̂
the prediction.
𝛾 is an important structural parameter for EVREG, the choice of 𝛾 is crucial to optimize the performance of the model. As
suggested in [23], leave-one-out (LOO) method is applied to optimize 𝛾 by minimizing the following criterion:
1 ∑
𝑁
𝐶𝑉 (𝛾) = (𝑦 − 𝑦̂𝑖 [𝒙𝑖 , 𝑇 −𝑖 , 𝛾])2 , (12)
𝑁 𝑖=1 𝑖
where 𝑦̂𝑖 [𝒙𝑖 , 𝑇 −𝑖 , 𝛾] is the prediction about 𝑦𝑖 based on the training set without example (𝒙𝑖 , 𝑦𝑖 ). The estimator ̂
𝛾 of parameter 𝛾 is
then obtained by minimizing this criterion:
𝛾 = arg min 𝐶𝑉 (𝛾) .

̂ (13)
𝛾
2.3. Mutual information
Mutual information (MI) is a measure of the amount of information shared between random variables and can be used for
evaluating the interdependence between them [33]. Traditional correlation analysis is sensitive only to linear relationships between
variables [29], while MI can capture arbitrary relationships between variables, both linear and nonlinear. Therefore, in this study,
mutual information is utilized to describe the redundancy between features, the validity of which has been verified in previous
studies [34,35]. MI between two continuous random variables 𝐴 and 𝐵 is defined as:
𝑃 (𝑎, 𝑏)
𝐼(𝐴, 𝐵) = 𝑃 (𝑎, 𝑏) log 𝑑𝑎𝑑𝑏 , (14)
∬ 𝑃 (𝑎)𝑃 (𝑏)
where 𝑃 (𝑎) and 𝑃 (𝑏) are the marginal probability density function (PDF) of variables 𝐴 and 𝐵, respectively. 𝑃 (𝑎, 𝑏) are the joint
probability density function of 𝐴 and 𝐵. The value of 𝐼(𝐴, 𝐵) can represent the reasoning ability of 𝐵 with respect to the uncertainty
of 𝐴, that is, the greater the value of 𝐼(𝐴, 𝐵), the stronger the reasoning ability of 𝐵. 𝐼(𝐴, 𝐵) = 0 means that observing 𝐵 does not
infer any uncertainty about 𝐴. If we can observe 𝐴, all uncertainties of 𝐴 can be inferred. Hence, we can obtain the value interval
of mutual information:
0 ≤ 𝐼(𝐴, 𝐵) ≤ 𝐼(𝐴, 𝐴) . (15)
In order to assess the degree of interdependence between different variables, MI is supposed to be normalized. In this paper,
normalized mutual information (NI) is defined as:
𝐼(𝐴, 𝐵)
𝑁𝐼(𝐴, 𝐵) = . (16)
𝐼(𝐴, 𝐴)
According to Eq. (15), we can find that 𝑁𝐼(𝐴, 𝐵) ∈ [0, 1] and the value of 𝑁𝐼(𝐴, 𝐵) can reflect the degree of interdependence
between variables.
To calculate MI, in this study, the 𝑘-nearest neighbor estimation method proposed in [36] is adopted, which can estimate MI
directly from the distance of k-nearest neighbors without estimating PDFs and does not require assumptions on the data distribution.
Specifically, the MI between two continuous random variables can be estimated as follows:
1 ∑[ ( )
𝑁
1 ( )]
𝐼(𝐴, 𝐵) = 𝜓(𝑘) − − 𝜓 𝑛𝑎 + 𝜓 𝑛𝑏 + 𝜓(𝑁) (17)
𝑘 𝑁 𝑖=1
where 𝜓(𝑎) is the digamma function 𝜓(𝑎) = 𝛤 (𝑎)−1 𝑑𝛤 (𝑎)∕𝑑𝑎, 𝑘 is the number of nearest neighbors (as suggested in [36], in this
‖ ‖
study, we use 𝑘 = 3). 𝑛𝑎 (𝑖) is the number of points 𝑎𝑗 with ‖𝑎𝑖 − 𝑎𝑗 ‖ ≤ 𝜖𝑎 (𝑖)∕2, where 𝜖𝑎 (𝑖)∕2 is the distance between 𝑎𝑖 and its 𝑘th
‖ ‖
neighbor; definitions of 𝑛𝑏 (𝑖) and 𝜖𝑏 (𝑖) are analogous to those of 𝑛𝑎 (𝑖) and 𝜖𝑎 (𝑖).
3. Research methodology
In this study, we propose an improved EVREG model for building hourly energy consumption prediction that simultaneously
allows feature selection and parameters learning. The model constructs evidence based on the historical operational data of
the building. For new data to be predicted, its nearest neighbors are retrieved from the evidence base, and the evidence is
discounted based on the distance between them. The discounted evidence is then combined to obtain the prediction results of
energy consumption. While learning the model parameters, feature selection is carried out simultaneously by taking into account
the impact of features on the predictive capability of the model and the redundancy between features.
5
3.1. Problem formulation
The proposed method can be formulated as an optimization problem through evaluating the significance of features, searching
optimal neighborhood size 𝐾 ∗ and its corresponding 𝛾 ∗ , and minimal feature subset ∗ . Formally speaking, we want to solve
(𝐾 ∗ , , 𝛾 ∗ ) = arg min  (𝐾, , 𝛾) (18)

𝐾,,𝛾
with an objective function  defined as follows:

∑
(𝐾,,𝛾) = 𝑃 𝐶(𝐾,,𝛾) + 𝜆 𝑁𝐼(𝑏𝑖 , 𝑏𝑗 ) , 𝑖 = 1, … , || − 1, 𝑗 > 𝑖 . (19)
𝑏𝑖 ,𝑏𝑗 ∈
The first term reflects the predictive capability of the model when performing point prediction or interval prediction. The second
term is a penalty item defined by normalized mutual information, which is used to describe the redundancy between features.
The inclusion of penalty term can describe the redundancy among the selected features, and also constrain the number of selected
features, which can prevent the model from overfitting to some extent. The hyperparameters 𝜆 is a penalty factor, which controls the
tradeoff between the two terms in Eq. (19), the larger 𝜆 is, the larger the latter constraint on feature selection is. Since the evaluation
criteria are different when performing point prediction and interval prediction, there is a difference in the specific definition of 𝑃 𝐶.
𝐶𝑉 defined in Eq. (12) can be directly adopted for point prediction, which reflects the fitting ability on the training set and can
approximate the prediction accuracy of the model. Considering the large difference in the order of magnitude between 𝐶𝑉 and 𝑁𝐼,
𝜆 is taken as 0.0001 in order to make the two in the same order of magnitude. When performing interval prediction, we expect the
predicted interval to get both higher accuracy and better quality, so we use the comprehensive evaluation index CWC [37,38] as
an indicator to assess the prediction capability of the model; the smaller the CWC, the greater the prediction capability. Its specific
calculation method will be explained in detail in Section 3.3. Accordingly, the criterion should change when the optimization of 𝛾
is performed:
𝐶𝑉𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 (𝛾) = (CWCPINC=80% + CWCPINC=90% + CWCPINC=95% )∕3 , (20)
where CWC has three subscripts corresponding to the prediction results of the model on the training set at prediction interval
nominal confidence (PINC) levels of 80%, 90%, and 95%, respectively. Then, the redefined 𝐶𝑉 can be used as the first term in
Eq. (19). In this case, taking into account the order of magnitude difference between the two terms, the 𝜆 is taken as 0.1.
Hence, the significance of one feature 𝑎 to be selected relative to the already selected feature subset , denoted by 𝑆𝐼𝐺 is defined
as follows:
𝑆𝐼𝐺(𝑎, 𝐾, , 𝛾) = (𝐾,,𝛾) − (𝐾, ⋃ 𝑎,𝛾) . (21)
The evaluation function Eq. (21) indicates that significance will increase while adding an informative feature. For a given 𝐾,
the minimal feature subset ∗ of the whole feature set  can be found when the condition ∀𝑎 ∈  − , 𝑆𝐼𝐺(𝑎, , 𝛾) < 0 can be
satisfied.
3.2. Feature selection and parameters learning
As previously stated, our purpose is to propose an improved EVREG model with feature selection function, the search strategy
is important to select the optimal feature subsets. There are several candidate search strategies for selecting the minimal subset of
features, such as the greedy search strategy like sequentially forward selection (SFS) and sequentially backward selection (SBS) [39],
branch and bound (B&B) search strategy [40], and genetic algorithm (GA)-based feature selection [41]. In this study, SFS is adopted
for convenience. Firstly, for a given 𝐾, using only one feature for model training, optimize 𝛾 according to Eq. (13), iterate through
all the features and select 𝑎𝑘 satisfying the following criterion to add it to :
𝑆𝐼𝐺(𝑎𝑘 , 𝐾, ) = max{𝑆𝐼𝐺(𝑎𝑖 , , 𝐾)} . (22)
Repeat this operation, increasing selected features one by one until 𝑆𝐼𝐺 is no longer increasing to end the feature selection and
optimization of 𝛾 for the current 𝐾. According to the above interpretations, our method can be realized as Algorithm 1.
3.3. Model performance evaluation
To evaluate the accuracy of the model in building energy consumption prediction, the performance of the point forecasting is
evaluated with three metrics: mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error
(RMSE), calculated as shown in Eqs. (23)–(25).
1∑
𝑛
MAE = |𝑦̂ − 𝑦𝑖 | , (23)
𝑛 𝑖=1 𝑖
1 ∑ |𝑦̂𝑖 − 𝑦𝑖 |
𝑛
MAPE = , (24)
𝑛 𝑖=1 𝑦𝑖
6
Algorithm 1 Procedure of the proposed method

Input: training data (𝒙, 𝑦), feature set , bound [𝐾, 𝐾] for 𝐾, 𝜆 and testing data 𝒙𝑡 , 𝑡 = 1, 2, … , 𝑛𝑡
Output: Optimal 𝐾 ∗ , 𝛾 ∗ , selected feature subset ∗ , estimations 𝑦̂𝑡 of 𝒙𝑡
1: Calculate 𝑁𝐼 between each feature in 
2: 𝐾 ∗ = 𝐾, 𝛾 ∗ = 0, ∗ ← ∅,  ∗ = inf
3: For 𝐾 = 𝐾 𝑡𝑜 𝐾 do
4:  ← ∅, 𝛾 = 0
5: while  −  ≠ ∅ do
6: For each ai ∈  −  do
7: Compute ̂ 𝛾𝑖 , 𝑆𝐼𝐺(𝑎, 𝐾, )
8: end For
9: Select 𝑎𝑘 : 𝑆𝐼𝐺(𝑎𝑘 , 𝐾, ) = max{𝑆𝐼𝐺(𝑎𝑖 , , 𝐾)}
10: if 𝑆𝐼𝐺(𝐚𝑘 , , 𝐾) > 0
⋃
11:  ← 𝑩 𝑎𝑘 , 𝛾 = ̂ 𝛾𝑘
12: else
13: break
14: end if
15: end while
16: if (𝐾,) <  ∗
17: ∗ ← , 𝐾 ∗ ← 𝐾, 𝛾 ∗ = 𝛾,  ∗ ← 
18: end if
19: end For
√
∑𝑛
𝑖=1 (𝑦𝑖 − 𝑦̂𝑖 )2
RMSE = , (25)
𝑛
where 𝑛 is the number of samples, 𝑦𝑖 and 𝑦̂𝑖 are the real value and predicted value of building energy consumption respectively. All
metrics measure the difference between the predicted value and the true value. The smaller the metrics, the better the accuracy of
the model.
Interval prediction provides the upper and lower bounds of building energy consumption prediction under a given confidence
level, which indicates that the performance of interval prediction should be evaluated in terms of both accuracy and quality. The
accuracy of interval prediction is assessed according to the prediction interval coverage probability (PICP) defined as:
1∑
𝑛
PICP = 𝐶 , (26)
𝑛 𝑖=1 𝑖
where 𝐶𝑖 is calculated as shown in Eq. (27):

{
1, 𝑦𝑖 ∈ [𝑦𝐿 𝑈
𝑖 , 𝑦𝑖 ] ,
𝐶𝑖 = 𝐿 𝑈
(27)
0, 𝑦𝑖 ∉ [𝑦𝑖 , 𝑦𝑖 ] ,
where 𝑦𝐿 𝑈
𝑖 and 𝑦𝑖 are the upper and lower bounds of the prediction interval, respectively. PICP indicates the probability that the
predicted interval covers the actual energy consumption data, and the larger the PICP, the more accurate the interval prediction.
The quality of interval prediction is evaluated according to the prediction interval normalized average width (PINAW) as defined
in Eq. (28):
1 ∑𝑛
PINAW = (𝑦𝑈 − 𝑦𝐿
𝑖 ), (28)
𝑛(𝑦max − 𝑦min ) 𝑖=1 𝑖
where 𝑦max and 𝑦min are the maximum and minimum values of the actual energy consumption. PINAW is used to describe the average
width of the prediction intervals. For a given confidence level, a smaller width implies a better quality of the interval prediction.
When performing interval forecasting, we want to simultaneously obtain higher accuracy and better quality, i.e., larger PICP and
smaller PINAW, but these two are often in conflict. Therefore, in order to evaluate the narrowness of the prediction interval and
the sample coverage in a comprehensive manner, the coverage width criterion (CWC) is adopted for evaluation. Since there are
various definitions of metrics for evaluating the narrowness of intervals, such as PINAW and prediction interval average relative
width (PIARW), the corresponding calculations for CWC can differ. However, the basic principle is to achieve a compromise between
the informativeness and correctness of the prediction interval. In this study, considering that the CWC is also used as a criterion for
performing parameters learning and feature selection, the following formula is used:
CWC = (1 + 𝜂1 PINAW)(1 + 𝛿𝑒−𝜂2 (PICP−PINC) ) , (29)
7
Fig. 1. Research framework.
where PINC is the prediction interval nominal confidence. The hyperparameter 𝜂1 linearly magnifies PINAW, and the hyperparameter
𝜂2 exponentially magnifies the difference between PICP and PINC, 𝛿 is a step function that depends on PICP described by Eq. (30):
{
0, PICP ≥ PINC ,
𝛿= (30)
1, PICP < PINC .
In summary, this study proposes an improved EVREG regression model that allows simultaneous feature selection and model
parameters learning, enabling both point prediction and interval prediction of building hourly energy consumption for its compre-
hensive assessment, where the former describes the fluctuation of energy consumption and the latter quantifies the uncertainty of
energy consumption. The overall research framework of the proposed method is summarized in Fig. 1. It consists of four steps, (1)
Step 1: data pre-processing. This step mainly involves outlier processing and relevant data normalization. (2) Step 2: initialization
of regression problem and algorithm parameters. The training data and testing data are divided, the mutual information between
features is calculated in advance, and the upper and lower bounds of the neighborhood size 𝐾 are specified. (3) Step 3: feature
selection and model parameters learning. This step is a critical part of the proposed method. As described in Algorithm 1, for each
𝐾, 𝛾 under current feature subset is optimized with the objective of minimizing 𝐶𝑉 , and the optimal feature subset  under the
current 𝐾 is obtained with the objective of maximizing 𝑆𝐼𝐺. Based on this, the optimal neighborhood size 𝐾 for point prediction
and interval prediction are obtained by minimizing  as the objective function. (4) Step 4: evaluation. The established models are
used to perform point and interval predictions energy consumption on the testing data, and the prediction results are evaluated
according to the respective evaluation metrics.
4. Case study and results
4.1. Data description
In this study, a commercial building located in Virginia, USA is selected as the case building for hourly energy consumption
prediction using data collected from January 1, 2012, to December 31, 2012. The energy consumption data was obtained from an
8
Table 1
Summary of independent and dependent variables.
ID Variable Abbreviation Range Unit
1 time of the day Time 0,1,2,3. . . 23 hour
2 day of the month Day 0,1,2,3. . . 31 day
3 month MTH 1,2. . . 12 month
4 day of the week Week 1,2. . . 7 day
5 type of day Type 0,1 weekday, day off
6 temperature Temp [14,101] ◦F
7 dew point Dewp [−3,75] ◦F
8 relative humidity RH [16,100] %

9 wind speed Wsp [0,38] mph
10 wind gust Wgu [0,54] mph
11 barometric pressure BP [28.37,30.33] in
12 precipitation Precip [0,0.75] in
Fig. 2. Energy consumption characteristics from January to December.
open-source project called Building Data Genome and collected at 5-minute intervals from the EnerNOC dataset [42]. Meteorological
data was collected at hourly intervals from Weather Underground (https://www.wunderground.com) including temperature, dew
point, relative humidity, wind speed, wind gust, barometric pressure and precipitation. Time-related data include time of the day,
day, month, day of the week and type of day (i.e., weekday or day off), which is based on the 2012–2013 U.S. calendar. These data
were used to reflect seasonality and building usage. Table 1 lists all variables used in this study. All experiments are conducted in
MATLAB on a PC with Intel i7 1.8 GHz CPU and Nvidia GeForce MX150 GPU.
4.2. Data preparation
To improve the quality of the dataset, pre-processing of the data, including outlier processing and data normalization, is required
before model training. For most building energy prediction models, feature selection is also required in this step. Since this study
proposes an improved EVREG model that can achieve feature selection and model training simultaneously, feature selection is not
required here. Firstly, the outlier data are identified on the basis of the quartiles range rule, and the upper and lower bounds
are determined by Eqs. (31) and (32), respectively. Data exceeding the thresholds are identified as abnormal, and then the linear
interpolation method is applied to replace the outliers as well as fill in the missing data. After processing, a total of 8740 valid data
were obtained. The annual building energy consumption is shown in Fig. 2.
𝑈 𝑏 = 𝑄3 + 1.5 × (𝑄3 − 𝑄1 ) , (31)
𝐿𝑏 = 𝑄1 − 1.5 × (𝑄3 − 𝑄1 ) , (32)

where 𝑄1 and 𝑄3 are the first and third quartiles of the data series, respectively.
Considering the existence of different magnitudes or orders of magnitude of the input variables, which can lead to inconsistency
in their contribution to the distance calculation when subsequently finding the nearest neighbors of samples, it is necessary to
normalize the data before model training. Specifically, the min–max normalization method based on Eq. (33) is adopted.
𝑥 − 𝑥min
𝑥∗ = , (33)
𝑥max − 𝑥min
9
Table 2
Candidate features of prediction model.
Categories Candidate features Number
Time-related Time, Day, Mth, Week, Type 1–5
Meteorological-related Temp𝑡 , Dewp𝑡 , RH𝑡 , Wsp𝑡 , Wgu𝑡 , BP𝑡 , Precip𝑡 6–12
Time-lag Temp𝑡−1 , Dewp𝑡−1 , RH𝑡−1 , Wsp𝑡−1 , Wgu𝑡−1 , BP𝑡−1 , Precip𝑡−1 13–19
Table 3
Selected sets of features and optimized parameters with different 𝜆.
∗
𝜆 Number of features Selected set of features 𝐾∗ 𝛾∗ (𝐾 ∗ ,∗ )
0 6 [1,4,13,3,5,18] 4 4.0347 0.0009

0.0001 5 [1,4,13,3,5] 4 4.0656 0.0015
where 𝑥∗ is the normalized data, 𝑥 is the original data, and 𝑥min and 𝑥max are the minimum and maximum values of the original
data, respectively.
Since the current energy consumption of the building is somewhat related to the historical energy consumption, randomly
dividing the training set and the testing set may lead to data information leakage. In addition, building energy consumption has a
clear periodicity (as can be seen in the figure, energy consumption varies on a weekly cycle, with most of the energy consumption
troughs occurring at weekends), the data distribution needs to be consistent before and after data set division. Therefore, in this
study, the data from the fourth week of each month is selected as the testing set, which contains a total of 2001 data items. And
the remaining 6739 pieces of data are used as training set to build the prediction model. The testing set accounts for approximately
22.9% of the total dataset, which falls within the acceptable range for dataset partitioning proportions.
4.3. Point prediction of building energy consumption
4.3.1. Results of feature selection

The raw features used in this study including the meteorological-related features and time-related features have been listed
in Table 1. In addition, due to the thermal inertia of the building envelope, there is a delay in the effect of changes in outdoor
meteorological conditions on the interior, so the time-lag features based on existing features should also be taken into account. In
this paper, each meteorological feature for one hour before the prediction time was added to the raw feature set, and the subscripts
𝑡 and 𝑡 − 1 were used to denote the prediction hour and the previous hour, respectively. Table 2 lists all the candidate features and
their numbers after adding the time-lag features.
EVREG was trained using leave-one-out strategy, with 𝐾 increasing from 3 to 13 on the full training set, while feature selection
was performed simultaneously using the method proposed in this paper. For a given 𝐾, the criterion for feature selection completion
is that the inclusion of any new feature in the selected feature subset does not reduce the objective function  , i.e., the 𝑆𝐼𝐺 of
all remaining unselected features is less than 0. In order to show the trend of  with 𝐾 and selected feature set , the remaining
unselected features were still selected when the criterion was met, and the preferred feature was the one having the largest 𝑆𝐼𝐺
among remaining features. Then we obtained nineteen values of  for each 𝐾 and the minimum  indicates the best performance.
Note that,  can be considered as the approximate accuracy of EVREG on the training set based on LOO learning.
Fig. 3 shows the contour surface of the objective function  changing with the number of selected features when 𝐾 increases
from 3 to 13. It could be observed that for any 𝐾,  decreases first and then increases with the increase of the number of selected
features, and achieves appropriate performance when selecting 4 to 6 features. When 𝐾 ∗ = 4 and selecting 5 features, the global best
performance  ∗ = 0.0015 is achieved. Fig. 4 specifically illustrates the variation of the objective function  with the order of selected
features for 𝐾 ∗ = 4. When three features numbered 1, 4 and 13 are added to the selected feature subset (i.e., ‘‘Time’’, ‘‘Week’’ and
‘‘ Temp𝑡−1 ’’,),  decreases rapidly, which indicates that these three features have the greatest impact on the prediction accuracy of
EVREG. After adding features numbered 3 and 5 (i.e. ‘‘Month’’ and ‘‘Type’’),  obtains the minimum value and the feature selection
is completed, in which case the corresponding model parameters 𝐾 and 𝛾 are the optimization parameters. After that, adding more
features will only make  increase, which means that too many features will not only increase the complexity of the model, but
also make the prediction effect decrease. Therefore, the subset of features selected for point prediction is {Time, Week, Temp𝑡−1 ,
Month, Type}. This selection is also mechanistically interpretable, as these features can have an impact on building occupancy
and usage patterns, the usage of energy-using equipment, the operational schedules, etc., which lead to energy consumption
variations.
To illustrate the effect of adding the 𝑁𝐼 penalty term, feature selection was performed by following the same steps as above
when 𝜆 was taken as 0. Table 3 shows the results for the two cases. It can be found that the number of selected features is reduced
from 6 to 5 by adding the penalty term, and the feature numbered 18 (i.e., ‘‘BP𝑡−1 ε) is not selected, which reduces the complexity
of the model. Inevitably, the addition of the penalty term also increases the optimal objective function, because feature selection
without the penalty term only considers the prediction accuracy on training set. The prediction results of these two cases will be
illustrated in the subsequent experiments.
10
Fig. 3. Contour surface of the objective 𝐽 (𝐾, ).
Fig. 4. Contour surface of the objective 𝐽 (𝐾, ).
4.3.2. Impact of feature selection

To analyze the point prediction capability of the features selected by the proposed method, point prediction was performed on
the testing data using the model trained in Section 4.3.1, i.e., selected subset of features ∗1 = {Time, Week, Temp𝑡−1 , Month, Type} and
their corresponding 𝐾 ∗ and 𝛾 ∗ . Fig. 5 shows the point prediction results of the model, and it can be found that except for the last
week in which the prediction results had a large deviation from the actual value, all other weeks achieved good prediction results.
This is mainly due to that for EVREG, the prediction results depend on the response value of the nearest neighbors in the training
set of the point to be predicted, and it can be seen from Fig. 2 that the energy consumption curve in the fourth week of December
is significantly different from that in the first three weeks for some unknown reason, while this situation does not exist in other
months, thus inevitably leading to deviations in the prediction. In addition, it can be found that the deviations in the prediction
results for the first eleven weeks are mainly in the peak and trough periods of energy consumption per day, which is mainly caused
by the small proportion of samples in this period in the training data compared to those in the transition time.
Before conducting further experiments, it is necessary to verify the validity of the proposed model according to ASHRAE
standards. As stated in [43], it is recommended to use the coefficient of variation of the root-mean-square error (CVRMSE) and
the normalized-mean-bias error (NMBE) as statistical goodness of fit metrics for whole-building hourly energy consumption model.
The former describes the variation in the pattern of the data and the latter describes the variation between the mean real and
11
Fig. 5. Comparison of point prediction results and actual energy consumption.
Fig. 6. Prediction results under different feature sets.
predicted values. These two metrics can be calculated as follows:

√∑
𝑛 2
𝑖=1 (𝑦𝑖 −𝑦̂𝑖 )
𝑛−1
CVRMSE = 100 × , (34)
𝑦̄
∑𝑛
𝑖=1 (𝑦𝑖 − 𝑦̂𝑖 )
NMBE = 100 × , (35)
(𝑛 − 1) × 𝑦̄
where 𝑦̄ is the mean value of real building energy consumption. In [43], these two metrics are specified, stipulating that for effective
energy model, the maximum CVRMSE should not exceed 30% and the maximum NMBE should not exceed 10%. In light of this, the
relevant metrics were calculated based on the above prediction results, and the calculated CVRMSE is 8.69% and NMBE is 1.34%.
Therefore, the model complies with the ASHRAE requirements.
To illustrate the effect of feature selection and to further validate the effect of adding the 𝑁𝐼 penalty term, the models trained
with all features  and ∗2 (feature set selected when 𝜆 = 0) were used to conduct predictions on the testing set, respectively, and
Fig. 6 shows the evaluation metrics of point predictions using different feature set. It can be found that the prediction results after
performing feature selection are significantly better than those using all variables for prediction. And after adding the penalty term,
the number of selected features is reduced, while the prediction accuracy is improved.
4.3.3. Comparison with other feature selection methods

Filter feature selection methods suitable for the EVREG model were used to compare with the proposed method, including
Pearson correlation coefficient (PCC), mutual information (MI), and ReliefF algorithms. A set of influential features was selected
based on pre-specified thresholds. The thresholds were chosen so that features with moderate to high correlation with the target
could be identified [44]. In this study, candidate features with a Pearson correlation coefficient greater than 0.3 [45], normalized
12
Table 4
Selected sets of features by several methods and optimal values of model parameters.
Method Number of features Selected set of features 𝐾∗ 𝛾∗
PCC 7 [1,4,5,6,8,13,15] 5 3.3899
MI 8 [1,5,6,7,8,13,14,17] 5 3.7103
ReliefF 7 [1,4,7,13,14,15,17] 4 3.3318
Proposed method 5 [1,4,13,3,5] 4 4.0656
Fig. 7. Prediction results with different feature selection methods, where PM represents the proposed method.
mutual information greater than 0.5 [46], and ReliefF calculation weights greater than 0.002 will be selected as input of the model.
The feature selection was performed on the same training set as in the previous section, and then the selected features were used
for the training of EVREG model, feature selection results and optimization of parameters 𝐾 and 𝛾 are presented in Table 4.
It can be found that the feature sets selected by filter feature selection methods are not the same as that of the proposed method,
which is because the proposed method evaluates the importance of each feature by different criteria than the former, and it considers
the influence of the features on the prediction capability of the model. To compare the effects of feature selection methods on the
prediction effectiveness of EVREG, the feature sets selected by the above methods and the corresponding optimized model parameters
were used respectively to make predictions on the testing set, and the prediction results are presented in Fig. 7.
It is clear that for the EVREG prediction model, the method proposed in this paper achieves better prediction results and selects
fewer input features. This is mainly due to the fact that the influence of the selected features on the prediction effect of the model
is taken into account during feature selection, while the filter feature selection method only considers the relationship between the
features and the output, lacking the interaction with the model, which inevitably has disadvantages in the subsequent prediction
process.
4.3.4. Comparison with other prediction methods

In this section, the proposed building energy consumption prediction model is compared with the following prediction methods:
back propagation neural network (BPNN) model, support vector regression (SVR), and extreme learning machine (ELM). The penalty
factor of SVR was set to be 200 and Gaussian function was applied as the kernel function. The number of hidden layer nodes of ELM
was set as 28, and the sigmoid function was adopted as the activation function. The learning rate of BPNN was set to 0.01 and the
number of hidden layer nodes was set to 10. Since these regression methods do not consider feature selection themselves, model
training was performed with the features selected in Section 4.3.3. To ensure a fair comparison, for each of the three comparison
prediction models, the three sets of features selected by the aforementioned filter methods were used as input. This resulted in nine
different feature–model combinations.
Fig. 8 presents the prediction error distribution of the three comparative models with different feature selection methods, the
more concentrated the curves around 0, the higher the prediction accuracy of the model. It can be found that whether for BPNN, SVR
or ELM, the best prediction results are obtained with the features selected according to PCC. Fig. 9 aims to demonstrate graphically
the prediction results of the comparison models with the subset of features yielding the best prediction results. It can be observed
that similar to the EVREG prediction results in Fig. 5, the prediction results in the fourth week of December also show a large
prediction bias regardless of which method was applied, which may be due to the inconsistency with the normal operating pattern
that occurred in that week. The BPNN model with the features chosen based on PCC fits the actual energy consumption curve the
best in the three comparison methods, but its prediction error at energy consumption troughs is larger compared to EVREG model.
Table 5 illustrates the evaluation metrics of each method. In comparison with the three comparative methods, the MAE, MAPE and
RMSE of EVREG are all the smallest, which means that the point prediction of the proposed method is better in terms of building
energy consumption prediction error.
13
Fig. 8. Prediction error probability distribution with different methods and features.
Fig. 9. Comparison of point prediction results and actual energy consumption of different models.
Table 5
Prediction results with different methods and features.
Method MAE MAPE RMSE
BPNN-PCC 2.5542 7.02% 3.8074
BPNN-MI 2.7180 7.36% 3.9903
BPNN-ReliefF 2.8957 7.77% 4.3187
SVR-PCC 5.6387 14.69% 7.0206
SVR-MI 6.4566 17.31% 7.7600
SVR-ReliefF 6.6238 17.34% 7.9922
ELM-PCC 4.9735 12.87% 6.5717
ELM-MI 5.0415 13.15% 6.5478
ELM-ReliefF 5.0342 13.22% 6.5199
Proposed method 2.2509 6.04% 3.4521
14
Fig. 10. Interval prediction results using the proposed method.
Table 6
Interval prediction results under different feature sets.
Feature set PINC PICP PINAW CWC
∗ 80% 84.41% 19.14% 7.6990
90% 91.25% 25.01% 9.7535
95% 94.30% 29.96% 24.2436
 80% 85.91% 27.78% 10.7230
90% 93.00% 37.69% 14.1915
95% 96.90% 49.44% 18.3040
4.4. Interval prediction of building energy consumption
4.4.1. Impact of feature selection

In this section, the interval prediction capability of the features selected by the proposed method is evaluated. Feature selection
and modeling were performed according to the objective function proposed in Section 3.1 for interval prediction, and the two
hyperparameters 𝑒𝑡𝑎2 and 𝑒𝑡𝑎2 in CWC were set to 35 and 15. The reselected optimal feature subsets ∗ , the number of nearest
neighbors 𝐾 ∗ and corresponding 𝛾 ∗ are [6, 1, 4, 2, 5] (that is, {Temp𝑡 , Time, Week, Day, Type}), 40 and 4.2008, respectively. It can
be noticed that, unlike the features selected for point prediction, the feature ‘‘Day’’ is selected for interval prediction, which is also
selected in [47]. The reason it was selected could be that, in interval forecasting, it can help capture the impact of date changes
within a month on energy consumption uncertainty. Additionally, in interaction with other features, it can better capture energy
consumption patterns within a week and month, thereby increasing the reliability of the predicted energy intervals. The interval
prediction results of the proposed method at PINC of 80%, 90% and 95% are demonstrated in Fig. 10. It can be found that the
majority of the testing sample points are within the 80% confidence interval, but for the samples numbered 1764 to 1785 and 1882
to 1907, there is basically no difference in the range of the prediction interval. In point prediction, this time period also exhibits
larger prediction errors. The main reason for this issue is that the neighbors of these samples are far away from them in the reduced
feature space, and the information that the neighbors could provide is little after discounting. Therefore, in order to better predict
building energy consumption, it is necessary to obtain more information about the building operation patterns.
To discover the effect of feature selection on interval prediction, interval prediction of building energy consumption was also
performed using all features . Table 6 presents the interval prediction results after and before feature selection.
Compared with the prediction results using all features, after feature selection, although the value of PICP decreased, the width
of the forecasting interval is significantly reduced, with PINAW decreasing by 8.64%, 12.68% and 19.48% for PINC of 80%, 90% and
95%. The high interval coverage of the latter is at the cost of increasing the interval width, and it is not meaningful to continue to
increase PINAW to pursue a larger PICP when the interval coverage has already met the requirements. Combined with this metric of
CWC, the CWC after feature selection decreases by 3.0240 and 4.4380 at 80% and 90% confidence levels, but it increases by 5.9396
at 95% confidence level, which is mainly due to the fact that the confidence level is not satisfied at this point. CWC is exponentially
related to the confidence level deviation, but linearly related to the prediction interval width, which leads to a significant increase
in the CWC once the confidence level is not satisfied, but the interval width has a much smaller effect on it. Taking the average
CWC at the three confidence levels together, it is reduced by 0.5075 after feature selection. Taking the average CWC at the three
confidence levels together, it is still reduced by 0.5075 after feature selection.
4.4.2. Comparison with other feature selection methods

As in Section 4.3.3, we used the subsets of features selected by filter feature selection methods to construct EVREG model for
interval prediction, including PCC, MI and ReliefF, and the prediction results are demonstrated in Table 7.
15
Table 7
Prediction results with different feature selection methods.
Method PINC PICP PINAW CWC
PCC 80% 81.51% 17.75% 7.2125
90% 89.56% 24.08% 19.4992
95% 94.50% 32.80% 25.9320
MI 80% 78.51% 16.44% 15.1995
90% 87.06% 22.19% 22.3919
95% 92.85% 28.36% 26.0102
ReliefF 80% 80.36% 19.10% 7.6850
90% 88.36% 25.72% 22.7936
95% 93.05% 32.69% 29.1103
Proposed method 80% 84.41% 19.14% 7.6990
90% 91.25% 25.01% 9.7535
95% 94.30% 29.96% 24.2436
Table 8
Interval prediction results with different interval prediction methods.
Method PINC PICP PINAW CWC
BPNN-KDE 80% 72.21% 9.96% 18.9182
90% 84.01% 13.65% 19.9666
95% 90.25% 17.67% 21.8343
ELM-KDE 80% 78.76% 24.39% 21.0225
90% 90.00% 31.89% 12.1615
95% 95.25% 38.58% 14.5030
SVR-KDE 80% 80.51% 27.78% 10.7230
90% 90.95% 34.77% 13.1695
95% 96.05% 41.16% 15.4060
BPNN-QR 80% 72.87% 9.27% 16.6126
90% 83.71% 12.49% 19.1707
95% 89.81% 15.83% 20.7870
ELM-QR 80% 78.61% 22.53% 19.8309
90% 89.26% 30.15% 24.4612
95% 93.90% 34.89% 28.7931
SVR-QR 80% 79.21% 23.73% 19.7817
90% 88.51% 29.99% 25.8722
95% 94.55% 35.48% 27.7730
Proposed method 80% 84.41% 19.14% 7.6990
90% 91.25% 25.01% 9.7535
95% 94.30% 29.96% 24.2436
As shown in Table 7, compared with the three filter feature selection methods, the prediction interval coverage using the features
selected by the method proposed in this study is higher than that of the other three methods, with 84.41%, 91.25% and 94.30%,
except for the case when the PINC is 95%, in which the PICP is slightly smaller than that based on the features selected by PCC
which is 94.50%. The width of the prediction interval using the four sets of features does not differ much. In terms of CWC metrics,
the average CWC of EVREG built with features selected by PCC, MI and ReliefF, and the proposed method, respectively, are 17.5479,
21.2005, 29.1103, and 13.8987, and the proposed method achieves the smallest CWC.
4.4.3. Comparison with other interval prediction methods

Performance comparison of interval prediction was conducted between the kernel density estimation (KDE) method, quantile
regression (QR) method and the method proposed in this paper. Gaussian function was adopted as the kernel density function of KDE,
and the bandwidth was taken as 0.5. The quantile regression equation for the prediction error in QR adopted quadratic equation.
Since both comparison algorithms essentially model the bias of point prediction on the training samples, interval prediction was
performed respectively based on the three models with the best point prediction results among BPNN, SVR and ELM in Section 4.3.4,
that is, the three point prediction models trained with feature set selected according to PCC. The prediction results are displayed in
Table 8.
For the prediction intervals obtained by QR, KDE, and the proposed method, there are cases where the prediction interval
coverage does not meet the confidence level requirements. The total confidence deviations corresponding to the three algorithms
are 6.59%, 8.19%, and 0.70%, respectively, where the values corresponding to QR and KDE are obtained from the average of
the three models (i.e., BPNN-KDE/QR, ELM-KDE/QR and SVR-KDE/QR). The proposed method has the smallest total confidence
deviation in terms of the extent to which the PICP satisfies the PINC.
16
It can be noted that PICP is satisfied at 80%, 90% and 95% of PINC, when interval prediction is performed by applying KDE
algorithm to the SVR-based point prediction results. However, it cannot be ignored that the widest prediction interval width is also
obtained in this case. Considering the point prediction results, SVR actually exhibits the largest errors among the point prediction
models, yet it achieves the highest interval coverage. This is mainly because when modeling the errors based on point prediction
results, the width of the prediction interval is enlarged in the presence of larger errors, resulting in an increased coverage rate.
Similarly, since both KDE and QR are modeled based on the errors of point prediction, for BPNN, which has the best point prediction
performance among the three comparison algorithms, both methods yield the narrowest prediction interval widths. However, in this
case, the interval coverage rate is the lowest among the three models.
The average prediction interval widths of KDE, QR, and the proposed method at the three confidence levels are 26.65%, 23.81%
and 24.70%, respectively. The corresponding average interval coverage rates are 86.44%, 85.60%, and 89.99%. The proposed
method achieves a better balance between the informativeness and accuracy of the prediction intervals, without pursuing a larger
coverage rate by using wider intervals. Combining the above two aspects, from the perspective of CWC, the average CWC of the
three algorithms at three confidence levels are 16.4116, 22.5647 and 13.8987, and the proposed method in this study achieves
the smallest CWC. When performing energy consumption interval prediction, the proposed method can obtain better accuracy and
quality at the same time.
5. Discussion
5.1. Practical implications
In this study, an improved evidential regression model is introduced for point prediction and interval prediction of building
energy consumption, effectively capturing its fluctuation and uncertainty. A case study is conducted based on a commercial building
from EnerNOC dataset. The results demonstrate that the proposed method can achieve better prediction performance with fewer
features selected compared to traditional filter feature selection methods. In terms of both point prediction and interval prediction
capabilities, the proposed method also achieves better or comparable results compared to commonly used prediction methods. In
addition, compared with commonly applied interval prediction approaches, this method does not require assumptions about data
distribution or error distribution, thus avoiding prediction errors caused by differences between the actual distribution and assumed
distribution, and ensuring the performance of the model.
In real-world scenarios, the main applications of this study are as follows: (1) Interval prediction of energy consumption is
more conducive to timely detection of building equipment failures or abnormalities, enabling effective maintenance measures to
be implemented and ensuring the smooth operation of building systems. (2) By predicting the energy consumption of a building
over a period of time in the future, operators can formulate reasonable energy scheduling and usage strategies. This allows for
the optimization of energy supply and utilization, leading to reduced energy costs. (3) It can be used for model-based control
to automatically adjust the control strategy and operation mode of the building energy system based on the energy consumption
prediction results, responding to changing external conditions or internal demands in advance to optimize the building system
operation. (4) The proposed prediction model can be further used in scenarios such as building cooling or thermal load prediction,
and energy consumption prediction of individual equipment in a building.
5.2. Limitations
In this study, the proposed model, based on historical data, can be used to predict the energy consumption of buildings in the
operational phase. Inevitably, the proposed method still has some limitations: (1) The construction of the prediction model relies
heavily on historical energy consumption data of the building, which, however, can be difficult to obtain for new buildings or
buildings without mature energy management systems. Therefore, future research should focus on proposing a prediction model
capable of handling incomplete historical data. (2) The evidence base which the model relies on is established using existing
historical data, which cannot be updated promptly when building characteristics or usage patterns change over time. For subsequent
studies, it could be beneficial to explore options for monitoring energy consumption patterns and automatically updating the
evidence base. (3) The candidate features considered in this model for energy consumption prediction are mainly time-related
features and weather-related features, without considering internal equipment-related features and features related to human
behaviors. When equipment failures or significant changes occur in operating patterns, relying solely on time and weather-related
features may not adequately capture these conditions, resulting in significant deviations in energy consumption predictions. In future
research, when conditions permit, other relevant features should be thoroughly considered as candidate inputs for prediction model
in order to more fully explain the energy consumption model and obtain more reliable prediction results.
6. Conclusions
In this paper, an improved evidential regression model is proposed for building hourly energy consumption prediction, which
enables both point and interval prediction. And the interval prediction is independent of the point prediction model, eliminating the
need for any assumptions about the probability distribution of errors or data distribution. It combines Dempster–Shafer theory and
mutual information, defines an objective function for feature selection and parameters learning, as well as an evaluation function
17
to assess the importance of candidate features. Compared with the traditional EVREG, this approach simultaneously achieves both
feature selection and parameters learning of the model, which can improve the prediction performance of the model.
A commercial building in Virginia is used as a case study in this study for a comprehensive evaluation of the proposed method.
The experimental results demonstrate that for EVREG, compared with traditional filter feature selection methods, i.e., PCC, MI, and
ReliefF, the proposed method can select fewer features, and improve both the point prediction accuracy and interval prediction
performance. Compared with PCC, which is the most effective of the three filter methods mentioned above, the proposed method
achieves reductions of 0.8011 in MAE, 1.92% in MAPE, and 1.1036 in RMSE. And the composite metric CWC of interval prediction
is better than that of models with features selected on the basis of filter methods. In comparison with the point prediction models
BPNN, SVR, and ELM, the proposed method also achieves the best prediction performance. Compared with the best-performing
model, BPNN, among the three comparison models, the proposed method reduces MAE by 0.3033, MAPE by 0.98%, and RMSE by
0.3553. In comparison with interval prediction using KDE and QR, the proposed method reduces the average CWC by 2.5129 and
8.6660, respectively. It achieves the best trade-off in terms of informativeness and accuracy of the prediction intervals, without
pursuing higher interval coverage at the cost of increased interval width.
In future research, we will aim to enhance the interval prediction capability of the model by improving its prediction interval
coverage, and use more efficient optimization algorithms to learn the model parameters. In addition, the proposed prediction model
will be integrated with energy management strategies to achieve more efficient building operation.
CRediT authorship contribution statement
Chao Liu: Methodology, Formal analysis, Writing – original draft. Zhi-gang Su: Resources, Project administration, Supervision,
Funding acquisition. Xinyi Zhang: Conceptualization, Data curation.
Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing
interests: Zhi-gang Su reports financial support was provided by National Natural Science Foundation of China.
Data availability
Data will be made available on request.
Acknowledgment
This work was supported by the National Natural Science Foundation of China under Grant 52076037.
References
[1] Building energy efficiency research center of Tsinghua University, 2021 Annual Report on China Building Energy Efficiency, 2021.
[2] S.M.A. Haleem, G.S. Pavlak, W.P. Bahnfleth, Impact of control loop performance on energy use, air quality, and thermal comfort in building systems with
advanced sequences of operation, Autom. Constr. 130 (2021) 103837.
[3] Y. Zhao, T. Li, X. Zhang, C. Zhang, Artificial intelligence-based fault detection and diagnosis methods for building energy systems: Advantages, challenges
and the future, Renew. Sustain. Energy Rev. 109 (2019) 85–101.
[4] A. Kathirgamanathan, M. De Rosa, E. Mangina, D.P. Finn, Data-driven predictive control for unlocking building energy flexibility: A review, Renew. Sustain.
Energy Rev. 135 (2021) 110120.
[5] J.L. Gómez, F.T. Pastoriza, E.A. Fariña, P.E. Oller, E.G. Álvarez, Use of a numerical weather prediction model as a meteorological source for the estimation
of heating demand in building thermal simulations, Sustainable Cities Soc. 62 (2020) 102403.
[6] Z. Tong, Y. Chen, A. Malkawi, Estimating natural ventilation potential for high-rise buildings considering boundary layer meteorology, Appl. Energy 193
(2017) 276–286.
[7] P.B. Franceschini, L.O. Neves, A critical review on occupant behaviour modelling for building performance simulation of naturally ventilated school
buildings and potential changes due to the COVID-19 pandemic, Energy Build. (2022) 111831.
[8] M.-A. Hamdaoui, M.-H. Benzaama, Y. El Mendili, D. Chateigner, A review on physical and data-driven modeling of buildings hygrothermal behavior:
Models, approaches and simulation tools, Energy Build. 251 (2021) 111343.
[9] Y. Li, Z. O’Neill, L. Zhang, J. Chen, P. Im, J. DeGraw, Grey-box modeling and application for building energy simulations-a critical review, Renew. Sustain.
Energy Rev. 146 (2021) 111174.
[10] H. Khaloie, A. Abdollahi, M. Rashidinejad, P. Siano, Risk-based probabilistic-possibilistic self-scheduling considering high-impact low-probability events
uncertainty, Int. J. Electr. Power Energy Syst. 110 (2019) 598–612.
[11] T. Walter, P.N. Price, M.D. Sohn, Uncertainty estimation improves energy measurement and verification procedures, Appl. Energy 130 (2014) 230–236.
[12] S.B. Taieb, R. Huser, R.J. Hyndman, M.G. Genton, Forecasting uncertainty in electricity smart meter data by boosting additive quantile regression, IEEE
Trans. Smart Grid 7 (5) (2016) 2448–2455.
[13] Y. He, Y. Qin, S. Wang, X. Wang, C. Wang, Electricity consumption probability density forecasting method based on LASSO-quantile regression neural
network, Appl. Energy 233 (2019) 565–575.
[14] W. Zhang, H. Quan, D. Srinivasan, Parallel and reliable probabilistic load forecasting via quantile regression forest and quantile determination, Energy
160 (2018) 810–819.
[15] D.W. Van der Meer, J. Widén, J. Munkhammar, Review on probabilistic forecasting of photovoltaic power production and electricity consumption, Renew.
Sustain. Energy Rev. 81 (2018) 1484–1512.
[16] X. Ji, Y. Ren, H. Tang, C. Shi, J. Xiang, An intelligent fault diagnosis approach based on Dempster-Shafer theory for hydraulic valves, Measurement 165
(2020) 108129.
18
[17] J. Xiong, Q. Zhang, G. Sun, X. Zhu, M. Liu, Z. Li, An information fusion fault diagnosis method based on dimensionless indicators with static discounting
factor and KNN, IEEE Sens. J. 16 (7) (2015) 2060–2069.
[18] Y. Pan, L. Zhang, Z. Li, L. Ding, Improved fuzzy Bayesian network-based risk analysis with interval-valued fuzzy sets and D–S evidence theory, IEEE Trans.
Fuzzy Syst. 28 (9) (2019) 2063–2077.
[19] S. Mason-Renton, M. Vazquez, C. Robinson, G. Oberg, Science for policy: A case study of scientific polarization, values, and the framing of risk and
uncertainty, Risk Anal. 39 (6) (2019) 1229–1242.
[20] A. Hamache, M.E.Y. Boudaren, W. Pieczynski, Kernel smoothing classification of multiattribute data in the belief function framework: Application to
multichannel image segmentation, Multimedia Tools Appl. 81 (20) (2022) 29587–29608.
[21] M. Beynon, B. Curry, P. Morgan, The Dempster–Shafer theory of evidence: an alternative approach to multicriteria decision modelling, Omega 28 (1)
(2000) 37–50.
[22] F. Xiao, EFMCDM: Evidential fuzzy multicriteria decision making based on belief entropy, IEEE Trans. Fuzzy Syst. 28 (7) (2019) 1477–1491.
[23] S. Petit-Renaud, T. Denœux, Nonparametric regression analysis of uncertain and imprecise data using belief functions, Internat. J. Approx. Reason. 35 (1)
(2004) 1–28.
[24] Z. Wang, L. Xia, H. Yuan, R.S. Srinivasan, X. Song, Principles, research status, and prospects of feature engineering for data-driven building energy
prediction: A comprehensive review, J. Build. Eng. (2022) 105028.
[25] I. Koprinska, M. Rana, V.G. Agelidis, Correlation and instance based feature selection for electricity load forecasting, Knowl.-Based Syst. 82 (2015) 29–40.
[26] M. Wang, P. Wang, T. Zhang, Evidential extreme learning machine algorithm-based day-ahead photovoltaic power forecasting, Energies 15 (11) (2022)
3882.
[27] F. Cannarile, P. Baraldi, E. Zio, An evidential similarity-based regression method for the prediction of equipment remaining useful life in presence of
incomplete degradation trajectories, Fuzzy Sets and Systems 367 (2019) 36–50.
[28] W. Chen, Y. Li, P. Tsangaratos, H. Shahabi, I. Ilia, W. Xue, H. Bian, Groundwater spring potential mapping using artificial intelligence approach based on
kernel logistic regression, random forest, and alternating decision tree models, Appl. Sci. 10 (2) (2020) 425.
[29] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (Mar) (2003) 1157–1182.
[30] T. Denœux, Conjunctive and disjunctive combination of belief functions induced by nondistinct bodies of evidence, Artificial Intelligence 172 (2–3) (2008)
234–264.
[31] G. Shafer, A Mathematical Theory of Evidence, Vol. 42, Princeton University Press, 1976.
[32] P. Smets, The transferable belief model for quantified belief representation, in: Quantified Representation of Uncertainty and Imprecision, Springer, 1998,
pp. 267–301.
[33] T.M. Cover, Elements of Information Theory, John Wiley & Sons, 1999.
[34] P.A. Estévez, M. Tesmer, C.A. Perez, J.M. Zurada, Normalized mutual information feature selection, IEEE Trans. Neural Netw. 20 (2) (2009) 189–201.
[35] G. Van Dijck, M.M. Van Hulle, Speeding up the wrapper feature subset selection in regression by mutual information relevance and redundancy analysis,
in: International Conference on Artificial Neural Networks, Springer, 2006, pp. 31–40.
[36] A. Kraskov, H. Stögbauer, P. Grassberger, Estimating mutual information, Phys. Rev. E 69 (6) (2004) 066138.
[37] A. Khosravi, S. Nahavandi, D. Creighton, A.F. Atiya, Comprehensive review of neural network-based prediction intervals and new advances, IEEE Trans.
Neural Netw. 22 (9) (2011) 1341–1356.
[38] L. Yan, J. Feng, T. Hang, Y. Zhu, Flow interval prediction based on deep residual network and lower and upper boundary estimation method, Appl. Soft
Comput. 104 (2021) 107228.
[39] A.L. Teixeira, J.P. Leal, A.O. Falcao, Random forests for feature selection in QSPR models-an application for predicting standard enthalpy of formation of
hydrocarbons, J. Cheminformatics 5 (1) (2013) 1–15.
[40] Narendra, Fukunaga, A branch and bound algorithm for feature subset selection, IEEE Trans. Comput. 100 (9) (1977) 917–922.
[41] I.-S. Oh, J.-S. Lee, B.-R. Moon, Hybrid genetic algorithms for feature selection, IEEE Trans. Pattern Anal. Mach. Intell. 26 (11) (2004) 1424–1437.
[42] C. Miller, F. Meggers, The building data genome project: An open, public data set from non-residential building electrical meters, Energy Procedia 122
(2017) 439–444.
[43] American Society of Heating, Refrigerating and Air-Conditioning Engineers, ASHRAE standard 211–2018: Standard for commercial building energy audits,
Technical Report, American Society of Heating, Refrigerating and Air-Conditioning Engineers, 2018.
[44] D.-S. Kapetanakis, E. Mangina, D.P. Finn, Input variable selection for thermal load predictive models of commercial buildings, Energy Build. 137 (2017)
13–26.
[45] Y. Li, Z. Tong, S. Tong, D. Westerdahl, T. Pastoriza, A data-driven interval forecasting model for building energy prediction using attention-based LSTM
and fuzzy information granulation, Sustain. Cities Soc. 76 (2022).
[46] M. Rana, S. Sethuvenkatraman, M. Goldsworthy, A data-driven approach based on quantile regression forest to forecast cooling load for commercial
buildings, Sustain. Cities Soc. 76 (2022).
[47] Q. Qiao, A. Yunusa-Kaltungo, R.E. Edwards, Feature selection strategy for machine learning methods in building energy consumption prediction, Energy
Rep. 8 (2022) 13621–13654.
19

Journal of Building Engineering: Chao Liu, Zhi-Gang Su, Xinyi Zhang

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Journal of Building Engineering: Chao Liu, Zhi-Gang Su, Xinyi Zhang

Uploaded by

Copyright:

Available Formats

Journal of Building Engineering 80 (2023) 107956

Contents lists available at ScienceDirect

Journal of Building Engineering

A data-driven evidential regression model for building hourly

ARTICLE INFO ABSTRACT

EVREG Evidential regression

2.1. Dempster–Shafer theory

where | ⋅ | denotes the cardinality of a focal element.

2.2. EVREG: Evidential regression

𝜙(𝑑) = 𝜃exp(−𝛾𝑑 2 ) , (7)

𝛾 = arg min 𝐶𝑉 (𝛾) .

2.3. Mutual information

0 ≤ 𝐼(𝐴, 𝐵) ≤ 𝐼(𝐴, 𝐴) . (15)

3.1. Problem formulation

(𝐾 ∗ , , 𝛾 ∗ ) = arg min  (𝐾, , 𝛾) (18)

with an objective function  defined as follows:

𝐶𝑉𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 (𝛾) = (CWCPINC=80% + CWCPINC=90% + CWCPINC=95% )∕3 , (20)

𝑆𝐼𝐺(𝑎, 𝐾, , 𝛾) = (𝐾,,𝛾) − (𝐾, ⋃ 𝑎,𝛾) . (21)

3.2. Feature selection and parameters learning

𝑆𝐼𝐺(𝑎𝑘 , 𝐾, ) = max{𝑆𝐼𝐺(𝑎𝑖 , , 𝐾)} . (22)

3.3. Model performance evaluation

Algorithm 1 Procedure of the proposed method

where 𝐶𝑖 is calculated as shown in Eq. (27):

CWC = (1 + 𝜂1 PINAW)(1 + 𝛿𝑒−𝜂2 (PICP−PINC) ) , (29)

Fig. 1. Research framework.

4. Case study and results

4.1. Data description

7 dew point Dewp [−3,75] ◦F

8 relative humidity RH [16,100] %

Fig. 2. Energy consumption characteristics from January to December.

4.2. Data preparation

𝑈 𝑏 = 𝑄3 + 1.5 × (𝑄3 − 𝑄1 ) , (31)

𝐿𝑏 = 𝑄1 − 1.5 × (𝑄3 − 𝑄1 ) , (32)

0 6 [1,4,13,3,5,18] 4 4.0347 0.0009

4.3. Point prediction of building energy consumption

4.3.1. Results of feature selection

Fig. 3. Contour surface of the objective 𝐽 (𝐾, ).

Fig. 4. Contour surface of the objective 𝐽 (𝐾, ).

4.3.2. Impact of feature selection

Fig. 5. Comparison of point prediction results and actual energy consumption.

Fig. 6. Prediction results under different feature sets.

predicted values. These two metrics can be calculated as follows:

4.3.3. Comparison with other feature selection methods

4.3.4. Comparison with other prediction methods

Fig. 10. Interval prediction results using the proposed method.

4.4. Interval prediction of building energy consumption

4.4.1. Impact of feature selection

4.4.2. Comparison with other feature selection methods

4.4.3. Comparison with other interval prediction methods

5.1. Practical implications

CRediT authorship contribution statement

Declaration of competing interest

Data will be made available on request.

You might also like