Mauch 2015

GlobalSIP 2015 -- Symposium on Signal Processing Applications in Smart Buildings
A NEW APPROACH FOR SUPERVISED POWER DISAGGREGATION BY USING A DEEP

RECURRENT LSTM NETWORK
Lukas Mauch and Bin Yang
Institute of Signal Processing and System Theory, University of Stuttgart, Germany
ABSTRACT used to assign known appliances to detected events in order

to estimate the power trace and energy consumption of indi-
This paper presents a new approach for supervised power dis-
vidual appliances [6, 7]. But many event-based algorithms
aggregation by using a deep recurrent long short term memory
need a high sampling frequency to extract rich features (P,
network. It is useful to extract the power signal of one dom-
Q, harmonics, transients) for a reliable event detection and
inant appliance or any subcircuit from the aggregate power
classification. A second difficulty is its restriction to on-off
signal. To train the network, a measurement of the power sig-
and multi-state appliances. Variable load appliance is still
nal of the target appliance in addition to the total power signal
an open issue. A third difficulty is the collection of training
during the same time period is required. The method is super-
data (appliance library) to train the classifier and, more seri-
vised, but less restrictive in practice since submetering of an
ously, the transfer of the trained NILM system from a known
important appliance or a subcircuit for a short time is feasi-
building to a new one containing possibly new appliances.
ble. The main advantages of this approach are: a) It is also
applicable to variable load and not restricted to on-off and The second approach considers the disaggregation as an
multi-state appliances. b) It does not require hand-engineered optimization problem and seeks a combination of appliances
event detection and feature extraction. c) By using multiple whose resulting aggregate signal is close to the total power
networks, it is possible to disaggregate multiple appliances or measurement. Hidden Markov models and various extensions
subcircuits at the same time. d) It also works with a low cost are proposed to examine different combinations of state se-
power meter as shown in the experiments with the Reference quences of appliances [8, 9, 10]. By definition, this approach
Energy Disaggregation (REDD) dataset (1/3Hz sampling fre- is also limited to discrete-state appliances only. Another chal-
quency, only real power). lenge is the exponentially increasing number of combinations
of state sequences for an increasing number of appliances. A
Index Terms— Non-intrusive load monitoring (NILM),
third limitation is that this approach fails in the presence of
supervised power disaggregation, deep recurrent neural net-
unknown appliances.
work (RNN), long-short term memory (LSTM)
This paper proposes a new approach for supervised power
disaggregation. The intention in this paper is not to disaggre-
1. INTRODUCTION
gate all appliances; this is almost impossible in a large build-
Non-intrusive load monitoring (NILM) is a cost-effective en- ing with several hundred or thousand appliances whose power
ergy monitoring technique to infer the electrical power con- consumption ranges from W (LED light) to kW. Our main
sumption of individual appliances from the total power mea- purpose is to monitor the power consumption of the major
surement of a single or a few power meters [1]. It can be loads in a building for power saving. We propose to use a
supervised or unsupervised and event-based or eventless [2, deep recurrent neural network (RNN) based on the long short
3, 4]. Though there is no system yet today solving the NILM term memory (LSTM) architecture. The task of the network
problem with high accuracy for challenging applications like is to extract the target power signal of the appliance or subcir-
large buildings, two approaches seem to be favored. cuit from the aggregate power signal based on the long mem-
ory of the network after a supervised training of the network.
One approach is event-based and uses a combination of
In this case, the aggregate power signal can contain, in addi-
unsupervised and supervised techniques. Unsupervised al-
tion to the target power signal, any other known or unknown
gorithms are first used to detect and cluster state transitions
appliances.
called events in the aggregate power signal [1, 2, 4]. A state
of the art event detection algorithm can achieve a true posi- The paper is organized as follows. In section 2, the archi-
tive rate of 98.5% and 70.5% and false positive percentage tecture and learning of the deep recurrent LSTM network are
of 0.55% and 8.75% for the phase A and B of the Building- briefly described. Section 3 explains how to use the network
Level fUlly-labeled dataset for Electricity Disaggregation for supervised power disaggregation. Section 4 shows some
(BLUED), respectively [5]. Then supervised classifiers are preliminary results for the REDD dataset [11].
978-1-4799-7591-4/15/$31.00 ©2015 IEEE 63

x(n) suffer from the vanishing gradient problem [15] and, there-
fore, can learn mappings with long time dependencies [16].
The mapping rule of the recurrent layer l at time n is defined
Input Layer N (0) units by
x(0) (n) i(l) (n) = σ(W(l) (n) + W(l)
hi x (n − 1)
(l−1) (l)
xi x
LSTM layer 1 N (1) units, θ(1) + W(l) (l) (l)

si s (n − 1) + bi ) (2)
f (n) =
(l)
σ(W(l) xfx
(l−1)
(n) + W(l) (l)
h f x (n − 1)
+ W(l)
s f s (n − 1) + b f )
(l) (l)
(3)
LSTM layer L N (L) units, θ(L)
s (n) =
(l)
i (n) ◦
(l)
tanh(W(l)
xs x
(l−1)
(n) + W(l) (l)
hs x (n − 1) + b(l)
s )
x(L) (n)
+ f (l) (n) ◦ s (n − 1)
(l)
(4)
Output layer N (L+1) units, θ(L+1)
o (n) =
(l)
σ(W(l) xo x
(l−1)
(n) + W(l) (l)
ho x (n − 1)
y(n)
+ W so sn + bo )
(l) (l) (l)
(5)
Fig. 1. Deep recurrent neural network x (n) =
(l)
o (n) ◦ tanh(s (n)),
(l) (l)
(6)
where a ◦ b denotes the elementwise product of two vectors.

2. DEEP RECURRENT LSTM NETWORK s(l) (n) is the state vector containing the memory stored in the
layer l at time n. At each time step, the multiplicative input
A conventional feedforward neural network (NN) with enough gate i(l) (n) and the forgetting gate f (l) (n) determine whether
hidden layers and units is known as a universal mapper. It the memory content is updated with new data (i(l) (n) 0),
is able to approximate any static mapping from certain input fed back (f (l) (n) 0) or erased (f (l) (n) = 0). The initializa-
to output. A RNN with feedback in time is suitable to learn tion of the state is s(l) (0) = 0. The output x(l) (n) of layer l is
any dynamic (time-varying) mapping from input to output controlled by the multiplicative output gate o(l) (n) which de-
[12, 13]. Hence RNN suits better for power disaggregation termines whether the memory content is released (o(l) (n) 0)
than NN because in NILM applications, the power signal or not (o(l) (n) = 0). The activation of all gates is a nonlinear
of almost all appliances is not deterministic due to random function σ(·) of the linear combination of the current input of
switch-on/off and the aggregate power signal is dynamic due the previous layer, the memory content and the output of the
to a time-varying superposition of power signals of different current layer at the previous time step. It is assumed that the
appliances. We use the more complex LSTM unit in RNN gate signals are only affected by the memory content of the
because it has a longer memory and avoids the vanishing same LSTM unit. Therefore, all matrices W(l) s∗ are diagonal.
gradient problem [14]. Clearly, the output y(n) of such a network is a causal, non-
linear and time-varying function of x(1), . . . , x(n)
2.1. Network architecture
y(n) = f (x(1), . . . , x(n), n, θ) (7)
Fig. 1 illustrates the architecture of the used deep re-
current LSTM network. The input sequence to the net- The set of all parameters in the network is summarized in
work is a scalar sequence x(n) ∈ R which could be the one vector θ = [θ(1)T , θ(2)T , . . . , θ(L)T , θ(L+1)T ]T and has to be
aggregate real power signal at time n ∈ Z. The input determined during training. Here, θ(l) (1 ≤ l ≤ L) is the pa-
layer of N (0) input units just forms a sequence of vectors rameter vector of the LSTM layer l and contains all elements
(0)
x(0) (n) = [x(n), x(n − 1), . . . , x(n − N (0) + 1)]T ∈ RN for the of the weight matrices W(l) (l)
∗∗ and bias vectors b∗ . θ
(L+1)
is the
next L recurrent layers. Each recurrent layer l consists of corresponding parameter vector of the output layer.
(l−1)
N (l) units and maps its input sequence x(l−1) (n) ∈ RN to an
N (l)
output sequence x (n) ∈ R , 1 ≤ l ≤ L. The last recurrent
(l)
layer L is followed by a feedforward output layer consisting 2.2. Forward-backward learning

of N (L+1) units. It performs a static nonlinear mapping at each In Fig. 1, the input sequence x(n) is processed feedforward
time instant n in time, resulting in a causal input-output mapping. In our
y(n) = σ(L+1) (W(L+1) x(L) (n) + b(L+1) ) ∈ RN
(L+1)
, (1) study, we found out that a noncausal RNN can perform a
better power disaggregation, i.e. each output y(n) depends
where σ(L+1) (·) is an elementwise nonlinear activation func- on the current, past and future samples of x(n). For this
tion in the output layer. purpose, we adopt the forward-backward learning from [17].
As introduced in [14], LSTM units are commonly used The input sequence x(n) is divided into a number of non-
as building blocks of the recurrent layers because they do not overlapping blocks. A bidirectional RNN processes each
64
block of B samples, say x = [x(1), . . . , x(B)]T , and returns submetered signal st (n) in a similar way: tm = [st ((m − 1)B +
a block of output samples y(1), . . . y(B). Each layer l of 1), . . . , st (mB)]T . Because real power is always non-negative,
the bidirectional RNN has a double width 2N (l) . One half a softplus activation function σ(L+1) (x) = ln(1 + exp(x)) is
of its layer processes the input sequence in the forward di- chosen for the output layer. For all recurrent layers, a sigmoid
rection x(l−1) (1), . . . , x(l−1) (B), resulting in x(l) (l)
f (1), . . . , x f (B), activation σ(l) (x) = 1/(1 + exp (−x)) is used.
and the other half does the same in the backward direction
on the reversed input sequence x(l−1) (B), . . . , x(l−1) (1), result- 4. EXPERIMENTS AND RESULTS
ing in x(l) (l)
b (B), . . . , xb (1). The input for the next layer l + 1
is a sequence x (1), . . . , x(l) (B) with concatenated results
(l)
The network is implemented in Python using Theano [22].
2N (l)
x(l) (n) = [x(l)T (l)T
f (n), xb (n)] ∈ R
T
. Therefore, the output All experiments are done with the REDD. It contains real
y(n) of the network depends on the complete input block power measurements for six houses (no reactive power). In
x(1), . . . , x(B) for any time 1 ≤ n ≤ B. each house, two aggregate signals of phase A and B with a
sampling frequency F s = 1Hz and submetered power sig-
2.3. Training of the network nals of individual appliances with a sampling frequency F s =
1/3Hz are recorded. In our study, house 1 (18 appliances, 620
The parameter vector θ = [θTf , θTb ]T of the bidirectional RNN
hours) and house 2 (9 appliances, 258 hours) are used.
is determined using a supervised training. For a given set of
Because of different sampling frequencies in aggregate
M pairs of input blocks xm = [xm (1), . . . xm (B)]T and desired
and submetered signals, we performed our tests on synthetic
output blocks tm (1), . . . , tm (B), 1 ≤ m ≤ M, the network is
aggregate signals by summing up all submetered signals sk (n)
trained by a least squares learning θ̂ = arg minθ J(θ) with
in each house. For training and test, both aggregate and sub-

M
B metered signals of house 1 are divided into a training set con-
J(θ) = ||ym (n) − tm (n))||2 + λ1 ||θW ||1 + λ2 ||θW ||22 . (8) taining the first 2/3 part and a test set containing the last 1/3
m=1 n=1 part. House 2 is only used as test data to see whether a net-
To avoid overfitting, l1 - and l2 -penalties on all weight matrices work trained by house 1 is able to disaggregate a similar ap-
are used for regularization. The vector θW contains all weight pliance in house 2.
matrices W(l)
∗∗ in all layers. || · ||1 and || · ||2 denote the l1 and l2
Three different networks are trained to disaggregate the
norm, respectively. The regularization parameters λ1 and λ2 fridge (FR), dishwasher (DW) and microwave (MW) in house
have to be chosen carefully and can be obtained by minimiz- 1. The fridge has a quite periodic power consumption,
ing J(θ) over a validation set using grid search. The gradient whereas the other two devices show random events. The
of J(θ) with respect to θ is calculated using backpropagation dishwasher is a multistate device, whereas the microwave
through time (BPTT) without gradient truncation. The min- and fridge are on-off appliances. Unfortunately, there are no
imization is done by using stochastic gradient descent [18]. variable load devices in the REDD dataset.
The network is trained end-to-end. According to [19], a mo- In our experiments, each network consists of L = 2 recur-
mentum of μ = 0.5 is used to adjust the learning rate during rent layers with N (1) = N (2) = 140 units. The input layer has a
training. Momentum increases the learning rate in gradient width of N (0) = 10. So the parameter vector θ has 485801 el-
directions where the cost function has low curvature. Fur- ements to be learned. The block length is chosen as B = 5000
thermore, an exponentially decaying learning rate schedule is corresponding to 4.17 hours of data. Each network is trained
used which ensures stable minimization [20] without oscilla- for a maximum number of 100 epochs. Training is stopped
tion near local minima. According to [21], the initial weights early if the validation error does not further decrease.
are drawn from a uniform distribution. Fig. 2 shows the aggregate signal x(n), submetered refer-
ence signal st (n), and disaggregated signal ŝt (n) for roughly
30 hours of the test set for dishwasher and fridge in house
3. APPLICATION TO POWER DISAGGREGATION
1. The target power signal can be estimated quite accurately
Let sk (n) be the submetered real power consumed by appli- from the aggregate signal. Errors occur at discontinuities (e.g.
K switch-on/off) because the network seems to have a lowpass
ance k at time n and x(n) = k=1 sk (n) be the aggregate power
signal from K appliances, respectively. Both signals are nor- characteristic.
malized to have unit variance. For a selected target appli- Table 1 and 2 give some performance metrics for a
quantitative analysis of the disaggregation. We use Et =
1 N
ance st (n) among these K devices, we train one RNN to per-
form time series regression on the aggregate signal and get n=1 st (n) in kWh to calculate the true energy consump-
Fs
N
an estimate ŝt (n) for st (n). Hence the output layer has only tion of the target appliance. Similarly, Êt = F1s n=1 ŝt (n)

N (L+1) = 1 unit. The aggregate signal is divided into M and E = F1s n=1 N
x(n) denote the estimated energy consump-
non-overlapping blocks of length B: xm = [x((m − 1)B + tion of the target applianceand the total energy consumption
1), . . . , x(mB)]T , (1 ≤ m ≤ M), which are processed indepen- N
n=1 ( ŝt (n)−st (n))
2
dently by the RNN. The target blocks are extracted from the of the house. NRMS = N 2 is the normalized
n=1 st (n)
65
real power [W] aggregate
2000
0
0 5 10 15 20 25 30
time [h]
refrigerator (FR) dishwasher (DW)
500 1500
400
real power [W]
real power [W]

1000
300
200
500
100
0 0
0 5 10 15 20 25 0 5 10 15 20 25
Fig. 2. Disaggregation
time [h] of the fridge and the dishwasher (red: reference, blue: disaggregated)
time [h]
Appl. Et Êt NRMS F1 R P peatedly, starting from different random initializations. Each
FR 23.9 23.0 0.33 0.91 0.98 0.85 network is only trained for 10 epochs with fixed regularization
DW 11.1 10.50 0.35 0.79 0.87 0.73 parameters to reduce the computational complexity. Consid-
MW 7.8 7.9 0.74 0.66 0.83 0.54 ering the models obtained from all random inizializations, the
Table 1. Validation on test set of house 1 with E = 63.37kWh median NRMS together with its interquartile range (IQR) for
the training and test set is shown in Fig. 3. The training error
Appl. Et Êt NRMS F1 R P decreases with an increasing layer width. However, for large
FR 20.7 20.6 0.35 0.93 0.96 0.91 layers the network tends to overfitt to the training data. The
DW 2.36 3.26 0.31 0.68 1.0 0.52 optimal layer width seems to be around 140 units.
MW 4.0 2.11 0.58 0.09 0.05 0.5
Table 2. Validation on house 2 with E = 36.6kWh 0.42 test set
training set
0.40
RMSE of the disaggregation. We also examine how well the 0.38
active periods of the target appliance are estimated from the 0.36
NRMS
aggregate signal using the F1 score, precision P and recall 0.34
R. The active periods in st (n) and ŝt (n) are detected by sim-
ple thresholding st (n) ≥ γ, ŝt (n) ≥ γ, where γ = 30W is
0.32
chosen. We see that the time intevals of the active periods 0.30
of all three target appliances are estimated quite accurately, 0.28
while the estimation of the power values is less accurate. A 20 40 60 80 100 120 140 160
layer size in units

good news is that the networks show a good generalization
capability for the fridge and dishwasher. Trained by house Fig. 3. Influence of the layer size for disaggregation of the
1, the networks are able to disaggregate similar appliances in refrigerator from house 1
house 2. The reason for the much smaller value of Êt and F1
score for the microwave in house 2 is that house 2 seems to
have a microwave-barbecue combination in contrast to house 5. CONCLUSIONS
1. The microwave network trained by house 1 is only able
to reconstruct the active phases of the microwave, but not This paper demonstrates the feasibility of a deep recurrent
those of barbecue in house 2. The barbecue less power (about LSTM network for an eventless power disaggregation in
45W), but is active for longer periods than the microwave. NILM applications. It is able to estimate the power signal of
Because those periods can not be captured by the network, a target appliance or any subcircuit from the aggregate signal
the recall and F1 score are very small. By using γ = 50W after a supervised training of the network by using a submeter
for the thresholding of active periods, i.e. considering the measurement of the target appliance. The results achieved
barbecue periods as inactive, the metrics become F1 = 0.59, for the low frequency (1/3Hz) REDD dataset containing only
p = 0.59 and r = 0.59 which are comparable to the results of real power measurements are promising and shows a new way
house 1. to design NILM systems. By using multiple networks, any
To investigate the influence of the layer width, different number of appliances or subcircuits can be extracted from the
RNN with 20 to 160 recurrent units per layer are trained re- aggregate signal.
66
6. REFERENCES [15] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio,

“On the difficulty of training recurrent neural networks,”
[1] G. W. Hart, “Nonintrusive appliance load monitoring,” in Proceedings of the 30th International Conference on
Proc. IEEE, vol. 80, pp. 1870–1891, 1992. Machine Learning (ICML’13). 2013, ACM.
[2] M. Zeifman and K. Roth, “Nonintrusive appliance load [16] Alex Graves, “Generating sequences with recurrent neu-
monitoring: Review and outlook,” IEEE Trans. Con- ral networks,” CoRR, vol. abs/1308.0850, 2013.
sumer Electronics, vol. 57, pp. 76–84, 2011.
[17] Alex Graves and Navdeep Jaitly, “Towards end-to-end
[3] J. Froehlich, E. Larson, et al., “Disaggregated end-use speech recognition with recurrent neural networks,” in
energy sensing for the smart grid,” IEEE Trans. Perva- Proceedings of the 31st International Conference on
sive Computing, vol. 10, pp. 28–39, 2011. Machine Learning (ICML-14), Tony Jebara and Eric P.
Xing, Eds. 2014, pp. 1764–1772, JMLR Workshop and
[4] A. Zoha, A. Gluhak, et al., “Nonintrusive load moni- Conference Proceedings.
toring approaches for disaggregated energy sensing: A
survey,” Sensors, vol. 12, pp. 16838–16866, 2012. [18] Léon Bottou, “Stochastic gradient descent tricks,” in
Neural Networks: Tricks of the Trade, Grégoire Mon-
[5] K. S. Barsim, R. Streubel, and B. Yang, “An approach tavon, Geneviève B. Orr, and Klaus-Robert Müller,
for unsupervised non-intrusive load monitoring of resi- Eds., vol. 7700 of Lecture Notes in Computer Science,
dential appliances,” in 2. NILM Workshop, 2014. pp. 421–436. Springer Berlin Heidelberg, 2012.
[6] A. Marchiori, D. Hakkarinen, et al., “Circuit-level load [19] Ilya Sutskever, James Martens, George E. Dahl, and
monitoring for household energy management,” IEEE Geoffrey E. Hinton, “On the importance of initializa-
Trans. Pervasive Computing, pp. 40–48, 2011. tion and momentum in deep learning.,” in ICML (3).
2013, vol. 28 of JMLR Proceedings, pp. 1139–1147,
[7] M. J. Johnson and A. S. Willsky, “Bayesian nonpara- JMLR.org.
metric hidden semi-Markov models,” J. Machine Learn-
ing Research, pp. 673–701, 2013. [20] A. Senior, G. Heigold, M. Ranzato, and Ke Yang, “An
empirical study of learning rates in deep neural networks
[8] M. Baranski and J. Voss, “Genetic algorithm for pat- for speech recognition,” in Acoustics, Speech and Signal
tern detection in NIALM systems,” in Proc. of IEEE Processing (ICASSP), 2013 IEEE International Confer-
Int. Conf. on Systems, Man and Cybernetics, 2004, pp. ence on, May 2013, pp. 6724–6728.
3462–3468.
[21] Xavier Glorot and Yoshua Bengio, “Understanding the
[9] K. Suzuki, S. Inagaki, et al., “Nonintrusive appliance difficulty of training deep feedforward neural networks,”
load monitoring based on integer programming,” in in In Proceedings of the International Conference on Ar-
Proc. SICE Annual Conf., 2008, pp. 2742–2747. tificial Intelligence and Statistics (AISTATSâĂŹ10). So-
ciety for Artificial Intelligence and Statistics, 2010.
[10] T. Zia, D. Bruckner, and A.Zaidi, “A hidden Markov
model based procedure for identifying household elec- [22] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pas-
tric loads,” in Proc. IECON, 2011, pp. 3218–3223. cal Lamblin, Razvan Pascanu, Guillaume Desjardins,
Joseph Turian, David Warde-Farley, and Yoshua Ben-
[11] J. Z. Kolter and M. J. Johnson, “REDD: A public gio, “Theano: a CPU and GPU math expression com-
data set for energy disaggregation research,” in Proc. piler,” in Proceedings of the Python for Scientific Com-
of SustKDD workshop on Data Mining Applications in puting Conference (SciPy), June 2010, Oral Presenta-
Sustainability, 2011. tion.
[12] K. Funahashi and Y. Nakamura, “Approximation of
dynamical systems by continuous time recurrent neural
networks,” Neural Networks, 1993.
[13] M. Hermans and B. Schrauwen, “Training and

analysing deep recurrent neural networks,” in Proc.
NIPS, 2013.
[14] S. Hochreiter and J. Schmidhuber, “Long short-term

memory,” Neural Computation, vol. 9, pp. 1735–1780,
1997.
67

Mauch 2015

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mauch 2015

Uploaded by

Copyright:

Available Formats

GlobalSIP 2015 -- Symposium on Signal Processing Applications in Smart Buildings

A NEW APPROACH FOR SUPERVISED POWER DISAGGREGATION BY USING A DEEP

Lukas Mauch and Bin Yang

Institute of Signal Processing and System Theory, University of Stuttgart, Germany

ABSTRACT used to assign known appliances to detected events in order

978-1-4799-7591-4/15/$31.00 ©2015 IEEE 63

LSTM layer 1 N (1) units, θ(1) + W(l) (l) (l)

where a ◦ b denotes the elementwise product of two vectors.

layer L is followed by a feedforward output layer consisting 2.2. Forward-backward learning

real power [W] aggregate

real power [W]

RMSE of the disaggregation. We also examine how well the 0.38

aggregate signal using the F1 score, precision P and recall 0.34

of all three target appliances are estimated quite accurately, 0.28

layer size in units

6. REFERENCES [15] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio,

[13] M. Hermans and B. Schrauwen, “Training and

[14] S. Hochreiter and J. Schmidhuber, “Long short-term

You might also like