Professional Documents
Culture Documents
Wang 2020
Wang 2020
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
1
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
2
optimization of economic dispatch (ED) and IL management 3. Compared with the traditional DQN, the DDQN structure
is cast as a stochastic optimization problem to compensate for overcomes the noise and instability of value function during
the forecast errors of wind farm generation. Literature [14] the iteration and enhances the convergence stability of the
sets out a framework for decision making in emergency model.
operations for power transformers, which relies on The remaining of the paper is organized as follows. In
interruption contracts as decision variables. In general, the section II, The DDQN-based automatic demand response
optimization problems with IL can be modeled as a architecture is constructed and analyzed. Section III introduces
multiperiod MINLP in which the state of IL is considered as the MDP framework for DR management of IL, in which the
an integer decision variable. However, traditional optimization state, action, and reward function is formulated respectively.
algorithms that employ dynamic programming techniques Section IV defines the state action estimation function of
only take into account a specific system operation pattern, DDQN and describes the DRL-based algorithm. Based on this,
which makes it difficult to adapt to realistic operation the optimal DR strategy of IL is obtained. Section V applies
conditions such as variable electricity consumption patterns. the algorithm to the IEEE33 node extension system and
Unlike traditional model-based methods that require an confirms its validity through simulation. At last, the
explicit physical or mathematical model of the system, deep conclusions are drawn in Section V.
reinforcement learning (DRL), a combination of reinforcement
learning (RL) and deep learning [15], is a model-free II. DDQN-BASED AUTOMATIC DEMAND RESPONSE
algorithm to solve complex control problems [16]. DRL has ARCHITECTURE
shown successful performance in playing Atari [17] and Go Traditional DR requires the staff to operate the equipment
games [18] in recent years and its success in AlphaGo or set the operating conditions manually, which cannot
demonstrates that DRL has become one of the most guarantee the response speed of users as well as the reliability
fascinating research areas in machine learning. So far, there of DR implementation. Automatic demand response (ADR),
have been some attempts to apply the DRL method to the DR as one of the key technologies of smart grid, refers to the
control problem [19]-[23]. Paper [19] comprehensively implementation of demand response program through
reviews the use of reinforcement learning for DR applications automatic systems without manual intervention, which can
in the smart grid. In detail, paper [20] concerns with the joint truly achieve load scheduling and thereby improve energy
bidding and pricing problem of load serving entity, which is efficiency. In this sense, it is necessary to introduce
tackled by the deep deterministic policy gradient (DDPG) automation technology into DR.
algorithm in [21]. Paper [22] proposes a dynamic pricing DR By establishing a bidirectional communication network on
method with Q-learning for energy management in a the grid side and the user side, the ADR system realizes the
hierarchical electricity market. Paper [23] designs an actor- automation of DR combined with advanced measurement
critic-based DRL algorithm to determine the optimal energy technology and control methodology. The system architecture
management policy for industrial facilities. All of these papers of DDQN-based ADR is shown in Fig. 1. The automatic
formulate the DR control as a Markov decision process (MDP) demand response terminal (ADRT) integrates the input
and use respective DRL algorithm to make complex DR information of both the grid side and the user side to obtain
decisions adapting to specific constraints. state action transitions (st, at, rt+1, st+1) and transmits them to
In general, the DRL algorithm can be divided into the the database located in the computing center of DSO for
value-based and the policy-based algorithm, wherein deep Q storage through the communication network, wherein the grid
network (DQN) is the most classical value-based algorithm. side information includes real-time operation information of
The dueling deep Q network (DDQN) structure [24] is an the grid (such as node voltage) collected by the measurement
improvement of DQN [25], which effectively solves the device and the TOU electricity price tariff, and the user side
problem of overestimation of the DQN value function and information contains the interruption information and
enhances generalization ability of the model. Hence, this paper compensation scheme of each IL from the controller. As there
aims to propose a value-based DRL algorithm with DDQN may be many ILs widely distributed in the system, it is hard
structure to map the real-time state of the grid to the DR for the DSO to communicate with each IL individually, which
strategy for IL with the aim of reducing the peak load demand will make the communication and control network more
of the system as well as the operation costs of DSO. complicated. A load aggregation controller is added to the user
Considering the presented literature review, the main side of the independent system, which aggregates all ILs
contributions of this paper are as follows: within the system and performs remote read and interruption
1. The DR management problem for IL is formulated as an control for each IL.
MDP, which allows the consideration of an accumulative Moreover, the action execution and network training are
profit of the DSO in the long-term; separated in the system in order to meet the real-time
2. The proposed DRL algorithm realizes the direct mapping requirement of ADR. As shown in Fig. 1, the computing
of the real-time state of the grid to the DR management center and the ADRT each contains a DDQN, which plays a
strategy, thus achieving the goals of both regulating voltage different role in the DR management of IL as described in the
and reducing the total operation costs of DSO under variable following procedure. The computing center obtains state
electricity consumption patterns; action transitions from the ADRT and stores them in the
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
3
database, and then DDQN at the computing center is trained A. Formulation of Markov Decision Process
based on the DRL algorithm. Furthermore, the DDQN of 1) State Formulation
ADRT is periodically updated by copying parameters from The composite state of the environment (S) includes the
that of the computing center after the training. Finally, operation state of the system (Sop) and the state of IL (Sil) as
combined with the external input of the grid, the DDQN with (1). In our case, the state of IL is defined as the power
updated network parameters can obtain the optimal DR consumption of each IL to track its interruption status and
strategy of IL (that is, whether the IL load is cut off during the determine the compensation costs paid to IL in real time. The
peak load period) by performing a forward calculation, and the operation state is denoted as the voltage at the end nodes and
controller deployed on the user side executes the DR strategy the total power consumption at the root node of the system,
on each IL and monitors the response results, thereby which is necessary to regulate voltage and shave peak load of
achieving the automation of DR for IL. the system, respectively. Accordingly, the observable state (st)
of the system at time t is defined as (2), including the voltage
of node i ( U ti ), the total power ( Pt total ), and the power
Computing Center Automatic Demand
Response System
of DSO
consumption of node j ( Pt j ) at time t. Nend and NIL represent
Data
DDQN
Base a set of end nodes and a set of IL access nodes of the system,
respectively.
st,at,rt,st+1 DDQN S = S op S il (1)
Parameters
st = [Ut , Pt , Pt ] i Nend , j NIL
i total j
(2)
However, there are significant differences in the range of
DDQN
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
4
where rtvol
+1
(i )
is the voltage reward at end node i, rtil+1( j ) is the
DDQN
interruption reward at IL access node j, and rteoc
+1 is the
economic reward. wvol , wil , and weoc are their corresponding ADRT
DQN t
paper are all between [0, 1] after the same normalization as (3). Current Q value
Q ' ( st , at , w)
at
B. Markov Decision Process Framework st
MUX
V(st)
With the state, action, and reward function defined, the st Q(st,at)
Update parameter
MDP framework of DR management for IL is shown in Fig. 2.
Loss
Experience Gradient
st
descent
replay
During each step, depending on the current state of the memory st+1
max Q( st +1 , at +1 )
independent distribution system, the ADRT chooses an action Database A(st,at)
at +1 At +1
DDQN
(i.e., the DR strategy of IL) based on its current structure. The rt+1 +
Q * ( st , at )
system feeds back a reward and then enters the next state due Target Q value
Mini-batch of transitions
to the variation of the system power flow. Within the
Fig. 3. Value function approximation process based on DDQN.
interaction between the ADRT and the system, the DDQN
gradually updates its parameter during each learning episode, The absolute values between the state and action
which finally obtains the DR strategy of IL with the maximum dimensions of IL control differ a lot, which will cause the
cumulative reward. noise and instability of the Q-learning algorithm [24]. In this
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
5
section, a dueling deep Q network (DDQN) is taken to accumulation from line 5 to 10. In detail, as the step counter t
approximate the optimal Q value in (8). Fig. 3 shows the value increases, action at is selected according to the ε-greedy policy,
function approximation process based on DDQN. and the state action transition tuple is stored in the experience
As depicted in Fig. 3, the significant difference between pool in succession. When the number of samples in the pool
DDQN and DQN is the structure of the estimation neural accumulates to exceed the replay-start size M, the learning
network. Different from traditional DQN which contains only process from experience is conducted from line 11 to 15. In
one state action estimation network, the dueling DQN detail, a batch of samples with number n are drawn randomly
architecture represents both the state value network V(st) and from the pool in line 12. Then, the target Q values and the
action advantage network A(st, at) with a single deep model predicted Q values of the samples are calculated respectively
whose output combines the two to produce a state-action value in line 13, based on these, the loss function is calculated as (12)
Q(st, at). The Q function based on DDQN structure is defined in line 14. Finally, the batch gradient descent (BGD) method
as (10): is used to update the weights in DDQN in line 15.
1
Q( st , at ) = V ( st ) + [ A( st , at ) − A( st , at' )] (10) Algorithm 1: DRL Algorithm for DR management of IL
A at' 1. Set hyper parameters: , , , n, T, N, M
where A represents a set containing all executable actions, 2. Initialize DDQN with random parameters w
A represents the number of all executable actions. The 3. Initialize the experience pool as an empty set
4. For t=1, T, do:
action advantage function is set to the individual action 5. Reset state st randomly
function minus the average value of all action advantage 6. Select action at with ε-greedy policy
functions in a certain state to remove redundant degrees of 7. Receive new state st+1 and reward rt+1
8. If the number of samples > N:
freedom and improve the stability of the algorithm. 9. Remove the oldest observation sample
In the following, key concepts of the DRL-based 10. Store (st, at, rt+1, st+1) into the experience pool
algorithm are formulated. 11. If the number of samples > the replay-start size M:
(1) In order to balance the exploration and exploitation, 12. Sample random mini-batch of (st, at, rt+1, st+1) with number n from
experience pool
the ε-greedy policy is used for action selection as (11): 13. Obtain the target Q values and the predicted Q values respectively
random ( (T - t ) * / T ) 14. Calculate the loss function as (12)
at = (11) 15. Update the parameters w of DDQN by performing BGD method
arg max Q ( st,at ) ( (T - t ) * / T )
'
at 16. t= t+1
17.End for
where is a fixed constant, T and t are the total steps and
current iteration step of learning respectively, (0< ≤ ) is a C. The DR Strategy of IL Based on DDQN
random number generated by the computer. On the one hand, the action execution and network training
(2) The Q value estimation for all control actions can be of DDQN should be separated in the ADR system due to the
calculated by performing a forward calculation in DDQN. The limited computing ability of the ADRT. On the other hand, the
mean squared error between the target Q value and the output observation state varies with the user load in the independent
of the neural network is defined as the loss function in (12): distribution system, which results in the learning process of
1 n the DDQN at computing center being carried out continually.
Loss( w) =
n i =1
[Q * ( st , at ) − Q ' ( st , at , w)]2 (12) Therefore, in order to achieve real-time DR management, the
DDQN of ADRT should be periodically updated by copying
where Q * (st , at ) (i.e., the target Q value) is calculated as (8), parameters from that of the computing center over a constant
Q ' (st , at , w) is the output of DDQN with parameters w , n time frame such as the control cycle of DR.
denotes the mini-batch size. Start
(3) In order to eliminate the strong correlations between the
samples in a short period, experience replay is adopted to store Update parameters
periodically
the state action transitions (st, at, rt+1, st+1) at each time step in
an experience pool with capacity N. Set t=0
Q value
random parameters w and the experience pool is initialized as
an empty set. Y
Commencing on line 4, the algorithm enters episodic t<Tmax?
time during learning. The algorithm engages in experience Fig. 4. DR control flow of IL based on DDQN.
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
6
The DR control flow of IL based on DDQN is shown in Fig. The current TOU tariff of the power grid and the
4: where Tmax denotes the maximum control cycle of DR and compensation scheme [27] for each IL are shown in Table I
is selected as 24 hours in this paper. As shown in Fig. 4, the and Table II, respectively.
ADRT periodically reads parameters of the DDQN at the B. Simulation Parameter Settings
computing center and updates its parameters through the
According to the above case scenario, the sample data of
bidirectional communication network. Therefore, the optimal
the observation state, control action and immediate reward
DR strategy of IL in each observation state is selected by just
function in the MDP framework of Section III are formulated
performing a forward calculation and comparing the Q values
in Table III: where the weights of immediate reward are
of different actions, which greatly reduces the calculation time
selected according to the specific operational target in practice.
of ADRT and provides a possibility for real-time application
In particular, the target of the DR management is to reduce the
of ADR.
total operation costs of DSO on the premise of regulating
voltage to the safe limit, so the weights of the voltage reward,
V. CASE STUDY AND RESULT ANALYSIS
interruption reward and economic reward in our case study are
A. Case Scenario taken as 0.5, 0.2 and 0.3, respectively. Meanwhile, the reward
In this paper, the IEEE 33-node extension system is coefficient and penalty coefficient are taken as 1 and -1
selected as a typical model of medium voltage distribution respectively for normalization.
system as shown in Fig. 5, where nodes #6, #13, #23, and #29
Table III
are connected to IL, whereas the remaining nodes are FORMULATION OF THE SAMPLE DATA
connected to the general load. To keep consistent with the Function Corresponding sample data
original load level, the realistic daily load curve at each node [(U t17 ,U t21 ,U t24 ,U t32 ), Pt total ,
in the extension system aggregates the low voltage load Observation state st
( Pt 6 , Pt13 , Pt 23 , Pt 29 )]
profiles selected from the IEEE European Low Voltage Test
Feeder [26] to fit in with the IEEE 33-node system. The Control action at [ut6+1 , ut13+1 , ut23+1 , ut29+1 ]
system parameter setting is consistent with the literature.
Obtained by (5): where wvol =0.5,
wil =0.2, weoc =0.3, Freward
18 19 20 21
19 20 21 35 Immediate reward rt+1 =+1
18 33
and Fpenalty =-1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
34
22
25
25 26 27 28 29 30 31 32
Then, the DDQN structure in the simulation is set as
22 23 24
26 27 28 29 30 31 32 36
follows: the input layer of both the state value network V(st)
and the action advantage network A(st, at) has 9 neurons (i.e.,
23 24 37
Table I Table IV
TOU TARIFF OF GRID HYPER PARAMETERS OF DRL ALGORITHM
Periods Time (h) Price (yuan/kWh) Hyper Parameters Value
08:00-12:00 discount factor 0.7
Peak 1.25
16:00-23:00 learning rate 0.01
06:00-08:00 fixed constant 1
Flat 12:00-16:00 0.8 mini-batch size n 30
23:00-24:00 total learning steps T 3000
Valley 00:00-06:00 0.4 replay memory capacity N 1000
replay-start size M 100
Table II
COMPENSATION SCHEME FOR IL
IL Compensation cost (yuan/kWh) C. Results and Analysis
6 1.2 Applying the algorithm of Section IV to the case study, the
13 0.84
23 0.96 learning process of the DRL algorithm is programmed with
29 1.4 the aid of TensorFlow, an open-source deep learning
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
7
framework developed by Google. Firstly, the learning without DR, the voltage of nodes #17 and #32 will be low or
performance of the neural network is evaluated. The even over-limit, indicating that the existing load level will
comparison of the loss function between DDQN and cause the low voltage problem at the end nodes of the system.
traditional DQN structure during iterative training is shown in Fig. 8 (a) shows that the regulation effect with DQN is limited
Fig. 6. Compared with DQN, the loss function using the as the voltage of node #17 is still over-limit after DR. In
DDQN structure has a faster rate of decline, and reaches a contrast to DQN, the voltage of node #17 and #32 with the DR
smaller value at the end. Meanwhile, the fluctuation of the loss management of DDQN rises significantly from the over-limit
function is smaller, indicating that the algorithm with DDQN status to the safe status (i.e., ±5% of the rated voltage), which
is more stable. verifies the effectiveness of the voltage reward. Therefore, the
In order to demonstrate the convergence of the proposed voltage regulation effect with DDQN is obvious, ensuring the
algorithm, Fig. 7 shows the comparison of cumulative reward voltage of all the nodes in the system within the safe limit and
at time 19:00 (when the voltage is the lowest of the day) thus effectively solving the low voltage problem of the
between DDQN and DQN during iterative training. It is distribution network.
obvious from Fig. 7 that as experience is gained via episodic 1.02
1.02
80
17 without DR 21 without DR 24 without DR 32 without DR
17 with DR 21 with DR 24 with DR 32 with DR
60
1.00
40
Node voltage (pu)
20 0.98
0
0 500 1000 1500 2000 2500 3000 0.96
Learning episode
0.95
Fig. 6. Comparison of loss function between DDQN and DQN during the 0.94
learning process.
0.93
3 0.92
DQN 0 2 4 6 8 10 12 14 16 18 20 22 24
DDQN
Time (h)
(b) DDQN structure
Cumulative reward
2
Fig. 8. Comparison of voltage regulation of all end nodes (#17, #21, #24, #32)
between DDQN and DQN.
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
8
verifies the effectiveness of the economic reward. Thus, the required to interrupt its power consumption. Therefore, the
DR strategy of IL obtained by the proposed DDQN method interruptible price in practice needs to be officially confirmed
reduces the total costs of DSO by about 16.9% in one day, by the electricity regulatory department, and there is a
greatly improving the economic benefits under the premise of minimum capacity requirement for participating users.
ensuring normal power supply to other general users.
400 IL6 before IL6 after 250 IL13 before IL13 after
3500
350
Without DR 200
Power (kW)
Power (kW)
300
3000 DQN 250 150
200
DDQN 150 100
2500 100
Total power (kW)
50
50
0 0
2000
0 4 8 12 16 20 24 0 4 8 12 16 20 24
600 IL23 before IL23 after 300 IL29 before IL29 after
1000
500 250
Power (kW)
Power (kW)
400 200
500
300 150
200 100
0
0 2 4 6 8 10 12 14 16 18 20 22 24 100 50
Time (h) 0 0
0 4 8 12 16 20 24 0 4 8 12 16 20 24
Fig. 9. Comparison of the total load curve between DDQN and DQN. Time (h) Time (h)
Fig. 11. The optimized load curve of each IL with DDQN.
45000
Electricity bill
40000 Compensation costs IL6
IL13
35000 IL23
30921.24 30062.00 IL29
30000
Total cost (yuan)
4783.38 5163.61
25000 41.08% 30.91%
20000
36181.10
15000
26137.86 24898.39
10000
8.57% 19.44%
5000
0
Without DR DQN DDQN
Fig. 12. The proportion of each IL in compensation costs with DDQN.
Fig. 10. Comparison of total operation costs in one day for DSO between
DDQN and DQN.
As discussed earlier, the discount factor is used for the
It can be concluded from Fig. 6 to Fig.10 that the DR consideration of the long-term behavior. To verify its
strategy with DDQN is apparently superior to that with DQN. adaptability to the DR management of IL, Fig. 13 shows the
With regard to the action combinations of IL under the boxplot of the total cumulative rewards, i.e., the sum of
proposed DR strategy with DDQN, the optimized load curve immediate rewards during a day, under different discount
of each IL in one day is shown in Fig. 11: where the blue line factor values. The DR management with =0.7 which has
represents the initial load curve without DR whereas the red bigger mean cumulative rewards outperforms those with other
line represents the one with DDQN-based DR. The power of values. It indicates that too small may make the decision
"0" indicates that the power supply to the IL is interrupted at “short-sighted”, but too large value may also cause inaccurate
that moment and the interruption compensation fee is paid to action selection at some moment. Thus, the value of discount
the IL by DSO at the same time. It can be seen from Fig. 11 factor should be selected via comparison for specific control
that all ILs in the case are interrupted at the time interval problem in practice.
mainly between 08:00-13:00 and 17:00-23:00, which belongs
to the peak periods with heavy load and expensive electricity 15
5
at interval 16:45-23:30. Moreover, the interruption time of
each IL focuses on one certain interval, which improves user 0
satisfaction and verifies the effectiveness of the interruption
reward. -5
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
9
VI. CONCLUSION AND FUTURE WORK Power Syst., vol. 31, no. 3, pp. 2055-2063, May 2016.
[13] L. Yang, M. He, V. Vittal and J. Zhang, “Stochastic optimization-based
This paper fully explored the optimized scheduling space of economic dispatch and interruptible load management with increased
IL at demand side and presented a value-based DRL method wind penetration,” IEEE Trans. Smart Grid, vol. 7, no. 2, pp. 730-739,
with DDQN to optimize the DR management of IL in the case Mar 2016.
[14] J. C. S. Sousa, O. R. Saavedra and S. L. Lima, “Decision making in
of realistic daily load of users. Through simulation analysis, emergency operation for power transformers with regard to risks and
the following conclusions can be drawn: interruptible load contracts,” IEEE Trans. Power Deliv., vol. 33, no. 4,
(1) The maximum long-term profit of the DR management pp. 1556-1564, Aug. 2018.
for IL is obtained by formulating this problem as an MDP [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.
Wierstra, and M.A. Riedmiller. (2013). Playing Atari with deep
through the analysis of cumulative reward. reinforcement learning. [online]. Available: https://arxiv.org/abs/
(2) The proposed DRL-based algorithm realizes the direct 1312.5602.
mapping of the real-time state of the grid to the ADR [16] Y. Hu, W. Li, K. Xu, T. Zahid, F. Qin, and C. Li, “Energy management
management strategy, thus achieving the goals of both strategy for a hybrid electric vehicle based on deep reinforcement
learning,” Appl. Sci., vol. 8, no. 2, Jan. 2018.
regulating voltage and reducing the total operation costs of [17] V. Mnih et al., “Human-level control through deep reinforcement
DSO. learning,” Nature, vol. 518, pp. 529-533, Feb. 2015.
(3) Based on DDQN structure, this paper overcomes the [18] D. Silver et al., “Mastering the game of go with deep neural networks
noise and instability in the traditional DQN algorithm and and tree search,” Nature, vol. 529, no. 7587, pp. 484-489, 2016.
[19] J. R., Vázquez-Canteli, and Z. Nagy, “Reinforcement learning for
illustrates the convergence stability of DDQN by comparing demand response: A review of algorithms and modeling techniques,”
the decline of the loss function. Appl. Energy, vol. 235, pp. 1072-1089, Feb. 2019.
The further research work will focus on taking more [20] H. Xu, K. Zhang, and J. Zhang, “Optimal joint bidding and pricing of
realistic constraints of the interruption contracts such as the profit-seeking load serving entity,” IEEE Trans. Power Syst., vol. 33, no.
5, pp. 5427-5436, Sep. 2018.
load curtailment into consideration, which will better adapt to [21] H. Xu, H. Sun, D. Nikovski, S. Kitamura, K. Mori and H. Hashimoto,
the production demand of IL and further improve user “Deep reinforcement learning for joint bidding and pricing of load
satisfaction. serving entity,” IEEE Trans. Smart Grid, vol. 10, no. 6, pp. 6366-6375,
Nov. 2019.
[22] R. Lu, S. H. Hong, and X. Zhang, “A dynamic pricing demand response
REFERENCES algorithm for smart grid: Reinforcement learning approach,” Appl.
[1] P. Palensky and D. Dietrich, “Demand side management: demand Energy, vol. 220, pp. 220–230, Jun. 2018.
response, intelligent energy systems, and smart loads,” IEEE Trans. Ind. [23] X. Huang, S. H. Hong, M. Yu, Y. Ding and J. Jiang, “Demand response
Inf., vol. 7, no. 3, pp. 381-388, Aug. 2011. management for industrial facilities: a deep reinforcement learning
[2] A. Mohsenian-Rad, V. W. S. Wong, J. Jatskevich, R. Schober and A. approach,” IEEE Access, vol. 7, pp. 82194-82205, 2019.
Leon-Garcia, “Autonomous demand-side management based on game- [24] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de
theoretic energy consumption scheduling for the future smart grid,” Freitas. (2015). Dueling network architectures for deep reinforcement
IEEE Trans. Smart Grid, vol. 1, no. 3, pp. 320-331, Dec. 2010. learning. [online]. Available: https://arxiv.org/abs/1511.06581.
[3] C. Li, X. Yu, W. Yu, G. Chen and J. Wang, “Efficient computation for [25] H. van Hasselt, A. Guez, and D. Silver. (2015) Deep reinforcement
sparse load shifting in demand side management,” IEEE Trans. Smart learning with double q-learning. [online]. Available:
Grid, vol. 8, no. 1, pp. 250-261, Jan. 2017. https://arxiv.org/abs/1509.06461v1.
[4] T. Logenthiran, D. Srinivasan and T. Z. Shun, “Demand side [26] (2016). IEEE PES AMPS DSAS Test Feeder Working Group. [online].
management in smart grid using heuristic optimization,” IEEE Trans. Available: http://sites.ieee.org/pes-testfeeders/resources/.
Smart Grid, vol. 3, no. 3, pp. 1244-1252, Sept. 2012. [27] L. Zhu, X. Zhou, L. Tang, and C. Lao, “Multi-objective optimal
[5] J. S. Vardakas, N. Zorba and C. V. Verikoukis, “A survey on demand operation for microgrid considering interruptible loads,” Power System
response programs in smart grids: pricing methods and optimization Technology, vol. 41, no. 6, pp. 1847-1854, 2017.
algorithms,” IEEE Commun. Surv. Tutor., vol. 17, no. 1, pp. 152-178, [28] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning.
Firstquarter 2015. Cambridge, MA, USA: MIT Press, 2016.
[6] H. Karimi, S. Jadid and H. Saboori, “Multi-objective bi-level
optimisation to design real-time pricing for demand response programs
in retail markets,” IET Gener. Transm. Distrib., vol. 13, no. 8, pp. 1287-
BIOGRAPHIES
1296, May 2019.
[7] L. Zhao, Z. Yang and W. Lee, “The impact of time-of-use (TOU) rate Biao Wang received the B.S. degree in electrical engineering from Wuhan
structure on consumption patterns of the residential customers,” IEEE University of Technology (WHUT), Wuhan, China in 2018. He is currently
Trans. Ind. Appl., vol. 53, no. 6, pp. 5130-5138, Nov. 2017. pursuing the M.S. degree in electrical engineering at Huazhong University of
[8] T. W. Gedra and P. P. Varaiya, “Markets and pricing for interruptible Science and Technology (HUST). His current research interests include
demand response, electric vehicles, and artificial intelligence.
electric power,” IEEE Trans. Power Syst., vol. 8, no. 1, pp. 122-128, Feb.
1993.
[9] K. Bhattacharya, M. H. J. Bollen and J. E. Daalder, “Real time optimal
interruptible tariff mechanism incorporating utility-customer
interactions,” IEEE Trans. Power Syst., vol. 15, no. 2, pp. 700-706, May Yan Li received the M.S. and Ph.D. degrees in electrical engineering from
2000. Huazhong University of Science and Technology (HUST), Wuhan, China, in
[10] J. Wang, X. Wang and X. Ding, “The forward contract model of 1999 and 2005, respectively. Currently, she is an Associate Professor with the
interruptible load in power market,” in Proc. IEEE PES Transm. Distrib. School of Electrical and Electronics, HUST. Her current research interests
Conf. Exhib. Asia Pacific, Dalian, China, 2005, pp. 1-5. include power system operation and control, distribution network planning,
[11] M. H. Imani, K. Yousefpour, M. T. Andani and M. Jabbari Ghadi, energy internet, Smart/Micro grid, distributed generation and so on.
“Effect of changes in incentives and penalties on interruptible/curtailable
demand response program in microgrid operation,” in Proc IEEE Texas
Power Energy Conf. (TPEC), Texas, USA, 2019, pp. 1-6.
[12] R. Bhana and T. J. Overbye, “The commitment of interruptible load to Weiyu Ming received the B.S. degree from the Department of Electrical
ensure adequate system primary frequency response,” IEEE Trans. Engineering, Hefei University of Technology, Hefei, China, in 2019. He is
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
10
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.