Professional Documents
Culture Documents
Mauch 2015
Mauch 2015
x(n) suffer from the vanishing gradient problem [15] and, there-
fore, can learn mappings with long time dependencies [16].
The mapping rule of the recurrent layer l at time n is defined
Input Layer N (0) units by
x(0) (n) i(l) (n) = σ(W(l) (n) + W(l)
hi x (n − 1)
(l−1) (l)
xi x
64
GlobalSIP 2015 -- Symposium on Signal Processing Applications in Smart Buildings
block of B samples, say x = [x(1), . . . , x(B)]T , and returns submetered signal st (n) in a similar way: tm = [st ((m − 1)B +
a block of output samples y(1), . . . y(B). Each layer l of 1), . . . , st (mB)]T . Because real power is always non-negative,
the bidirectional RNN has a double width 2N (l) . One half a softplus activation function σ(L+1) (x) = ln(1 + exp(x)) is
of its layer processes the input sequence in the forward di- chosen for the output layer. For all recurrent layers, a sigmoid
rection x(l−1) (1), . . . , x(l−1) (B), resulting in x(l) (l)
f (1), . . . , x f (B), activation σ(l) (x) = 1/(1 + exp (−x)) is used.
and the other half does the same in the backward direction
on the reversed input sequence x(l−1) (B), . . . , x(l−1) (1), result- 4. EXPERIMENTS AND RESULTS
ing in x(l) (l)
b (B), . . . , xb (1). The input for the next layer l + 1
is a sequence x (1), . . . , x(l) (B) with concatenated results
(l)
The network is implemented in Python using Theano [22].
2N (l)
x(l) (n) = [x(l)T (l)T
f (n), xb (n)] ∈ R
T
. Therefore, the output All experiments are done with the REDD. It contains real
y(n) of the network depends on the complete input block power measurements for six houses (no reactive power). In
x(1), . . . , x(B) for any time 1 ≤ n ≤ B. each house, two aggregate signals of phase A and B with a
sampling frequency F s = 1Hz and submetered power sig-
2.3. Training of the network nals of individual appliances with a sampling frequency F s =
1/3Hz are recorded. In our study, house 1 (18 appliances, 620
The parameter vector θ = [θTf , θTb ]T of the bidirectional RNN
hours) and house 2 (9 appliances, 258 hours) are used.
is determined using a supervised training. For a given set of
Because of different sampling frequencies in aggregate
M pairs of input blocks xm = [xm (1), . . . xm (B)]T and desired
and submetered signals, we performed our tests on synthetic
output blocks tm (1), . . . , tm (B), 1 ≤ m ≤ M, the network is
aggregate signals by summing up all submetered signals sk (n)
trained by a least squares learning θ̂ = arg minθ J(θ) with
in each house. For training and test, both aggregate and sub-
M
B metered signals of house 1 are divided into a training set con-
J(θ) = ||ym (n) − tm (n))||2 + λ1 ||θW ||1 + λ2 ||θW ||22 . (8) taining the first 2/3 part and a test set containing the last 1/3
m=1 n=1 part. House 2 is only used as test data to see whether a net-
To avoid overfitting, l1 - and l2 -penalties on all weight matrices work trained by house 1 is able to disaggregate a similar ap-
are used for regularization. The vector θW contains all weight pliance in house 2.
matrices W(l)
∗∗ in all layers. || · ||1 and || · ||2 denote the l1 and l2
Three different networks are trained to disaggregate the
norm, respectively. The regularization parameters λ1 and λ2 fridge (FR), dishwasher (DW) and microwave (MW) in house
have to be chosen carefully and can be obtained by minimiz- 1. The fridge has a quite periodic power consumption,
ing J(θ) over a validation set using grid search. The gradient whereas the other two devices show random events. The
of J(θ) with respect to θ is calculated using backpropagation dishwasher is a multistate device, whereas the microwave
through time (BPTT) without gradient truncation. The min- and fridge are on-off appliances. Unfortunately, there are no
imization is done by using stochastic gradient descent [18]. variable load devices in the REDD dataset.
The network is trained end-to-end. According to [19], a mo- In our experiments, each network consists of L = 2 recur-
mentum of μ = 0.5 is used to adjust the learning rate during rent layers with N (1) = N (2) = 140 units. The input layer has a
training. Momentum increases the learning rate in gradient width of N (0) = 10. So the parameter vector θ has 485801 el-
directions where the cost function has low curvature. Fur- ements to be learned. The block length is chosen as B = 5000
thermore, an exponentially decaying learning rate schedule is corresponding to 4.17 hours of data. Each network is trained
used which ensures stable minimization [20] without oscilla- for a maximum number of 100 epochs. Training is stopped
tion near local minima. According to [21], the initial weights early if the validation error does not further decrease.
are drawn from a uniform distribution. Fig. 2 shows the aggregate signal x(n), submetered refer-
ence signal st (n), and disaggregated signal ŝt (n) for roughly
30 hours of the test set for dishwasher and fridge in house
3. APPLICATION TO POWER DISAGGREGATION
1. The target power signal can be estimated quite accurately
Let sk (n) be the submetered real power consumed by appli- from the aggregate signal. Errors occur at discontinuities (e.g.
K switch-on/off) because the network seems to have a lowpass
ance k at time n and x(n) = k=1 sk (n) be the aggregate power
signal from K appliances, respectively. Both signals are nor- characteristic.
malized to have unit variance. For a selected target appli- Table 1 and 2 give some performance metrics for a
quantitative analysis of the disaggregation. We use Et =
1 N
ance st (n) among these K devices, we train one RNN to per-
form time series regression on the aggregate signal and get n=1 st (n) in kWh to calculate the true energy consump-
Fs
N
an estimate ŝt (n) for st (n). Hence the output layer has only tion of the target appliance. Similarly, Êt = F1s n=1 ŝt (n)
N (L+1) = 1 unit. The aggregate signal is divided into M and E = F1s n=1 N
x(n) denote the estimated energy consump-
non-overlapping blocks of length B: xm = [x((m − 1)B + tion of the target applianceand the total energy consumption
1), . . . , x(mB)]T , (1 ≤ m ≤ M), which are processed indepen- N
n=1 ( ŝt (n)−st (n))
2
dently by the RNN. The target blocks are extracted from the of the house. NRMS = N 2 is the normalized
n=1 st (n)
65
GlobalSIP 2015 -- Symposium on Signal Processing Applications in Smart Buildings
2000
0
0 5 10 15 20 25 30
time [h]
refrigerator (FR) dishwasher (DW)
500 1500
400
real power [W]
Appl. Et Êt NRMS F1 R P peatedly, starting from different random initializations. Each
FR 23.9 23.0 0.33 0.91 0.98 0.85 network is only trained for 10 epochs with fixed regularization
DW 11.1 10.50 0.35 0.79 0.87 0.73 parameters to reduce the computational complexity. Consid-
MW 7.8 7.9 0.74 0.66 0.83 0.54 ering the models obtained from all random inizializations, the
Table 1. Validation on test set of house 1 with E = 63.37kWh median NRMS together with its interquartile range (IQR) for
the training and test set is shown in Fig. 3. The training error
Appl. Et Êt NRMS F1 R P decreases with an increasing layer width. However, for large
FR 20.7 20.6 0.35 0.93 0.96 0.91 layers the network tends to overfitt to the training data. The
DW 2.36 3.26 0.31 0.68 1.0 0.52 optimal layer width seems to be around 140 units.
MW 4.0 2.11 0.58 0.09 0.05 0.5
Table 2. Validation on house 2 with E = 36.6kWh 0.42 test set
training set
0.40
active periods of the target appliance are estimated from the 0.36
NRMS
R. The active periods in st (n) and ŝt (n) are detected by sim-
ple thresholding st (n) ≥ γ, ŝt (n) ≥ γ, where γ = 30W is
0.32
chosen. We see that the time intevals of the active periods 0.30
while the estimation of the power values is less accurate. A 20 40 60 80 100 120 140 160
66
GlobalSIP 2015 -- Symposium on Signal Processing Applications in Smart Buildings
67