Energy Management in Solar Microgrid Via Reinforcement Learning

Energy Management in Solar Microgrid
via Reinforcement Learning

Panagiotis Kofinas George Vouros Anastasios I. Dounis
University of Piraeus, Department of University of Piraeus, Department of Piraeus University of Applied
Digital Systems Digital Systems Sciences (T.E.I. of Piraeus)
80, Karaoli & Dimitriou Str, 18534, 80, Karaoli & Dimitriou Str, 18534, Department of Automation
Piraeus, Greece Piraeus, Greece Engineering
+302105381287 +30 210-4142552 250, Thivon & P. Ralli Str.
panagiotis.kofinas@gmail.com georgev@unipi.gr 12244 Egaleo - Athens, Greece
+302105381188
aidounis@teipir.gr
ABSTRACT around the world. Microgrid is a group of interconnected loads

This paper proposes a single agent system towards solving energy and distributed resources within clearly defined electrical
management issues in solar microgrids. The system considered boundaries that act as a single controllable entity with respect to
consists of a Photovoltaic (PV) source, a battery bank, a the grid [2]. Microgrid can operate in either grid-connected or
desalination unit (responsible for providing the demanded water) island-mode. Its ability to operate in island mode makes it an
and a local consumer. The trade-offs and complexities involved in exceptional energy solution in remote areas where the cost of the
the operation of the different units, and the quality of services’ grid expansion is prohibitive.
demanded from energy consumer units (e.g. the desalination unit), The challenging task in microgrids is the energy management
makes the energy management a challenging task. The goal of the between its elements. This involves planning and control in
agent is to satisfy the energy demand in the solar microgrid, energy production and consumption, in order to save energy. In
optimizing the battery usage, in conjunction to satisfying the grid tied microgrids the coordination of the distributed energy
quality of services provided. It is assumed that the solar microgrid sources and the energy storage has to be optimum in order to
operates in island-mode. Thus, no connection to the electrical grid minimize the energy consumption from the grid. In island mode,
is considered. The agent collects data from the elements of the the energy management is crucial for the reliability of the
system and learns the suitable policy towards optimizing system microgrid under the uncertainties of the renewable generators and
performance. Simulation results provided, show the performance the dynamic loads.
of the agent.
Many studies concerning the energy management of the
CCS Concepts microgrids have been presented in literature. In [3], a self-learning
• Computing methodologies → Machine learning →
single neural network for energy management in residential
Machine learning algorithms → Dynamic programming applications has been proposed. A two steps ahead Q-learning
for Markov decision processes → Q-learning algorithm for planning the battery scheduling in a wind microgrid
has been proposed in [4], while a three step ahead Q-learning for
Keywords planning the battery scheduling in a solar microgrid has been
Reinforcement learning; Q-learning; microgrid; energy proposed in [5]. A multiagent system using Q-learning has been
management proposed in [6] in order to reduce the power consumption of a
solar microgrid from the grid. In [7] a multi agent system with a
central coordinator has been proposed for optimal response in
emergency power demand. An autonomous multiagent system for
1. INTRODUCTION optimal management of buying and selling power has been
Nowadays, one of the most concerning environmental issues is
proposed in [8] and an autonomous multiagent system for power
global warming due to the use of fossil fuels that emit greenhouse
generation and power demand scheduling has been proposed in
gases. To satisfy the continuously rising energy demand and
[9].
simultaneously reduce greenhouse gas emissions, renewable
energy sources are integrated in the overall energy production [1]. This article proposes a single-agent system for performing energy
New technologies encourage distributed generation resources management of the solar microgrid via Q-learning. The goal of
the agent is to satisfy the energy demand in the solar microgrid,
Permission to make digital or hard copies of all or part of this work for optimizing the battery usage, in conjunction to satisfying the
personal or classroom use is granted without fee provided that copies are quality of services provided by energy consumer units.
not made or distributed for profit or commercial advantage and that
The main contributions of this paper are as follows: First, we
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than ACM must be address the problem incorporating consumer units that are
honored. Abstracting with credit is permitted. To copy otherwise, or responsible for providing services and/or goods at the demanded
republish, to post on servers or to redistribute to lists, requires prior level. This provides more tradeoffs beyond pure parameters for
specific permission and/or a fee. Request permissions from energy production and consumption. In our case study, we use the
Permissions@acm.org. desalination unit. Second, we provide results on how to control
SETN '16, May 18-20, 2016, Thessaloniki, Greece. the energy flow between the elements of the microgrid by means
© 2016 ACM. ISBN 978-1-4503-3734-2/16/05 ...$15.00 of a Q-learning agent. Finally, we show preliminary results on
DOI: http://dx.doi.org/10.1145/2903220.2903257
how the energy demand can be fulfilled by developing an on-line the agent continues by performing the optimal policy, thus
energy management. Experimental results are based on real data maxQ k (s k+1 ,α) is the maximum value of the best action
concerning the demanded load and the energy produced by the performed in the next state (i.e. the state resulting after performing
photovoltaic unit. The structure of this paper is as follows: α k in the state s k ). The learning rate α shows in what percent the
Section 2 provides preliminaries about reinforcement learning and new information overrides the old one.
Q-learning. Section 3 presents the problem specification, and Thus, the Q-learning agent chooses the action α k to be performed
section 4 presents the design of the Q-Learning agent. Section 5 in the current state s k (either via exploration or by exploitation)
presents and discusses simulation results. Finally, section 6 and identifies the resulting state s k+1 . It gets the reward from the
concludes the paper sketching future work. transition from s k to s k+1 , i.e. R(s k , s k+1 ), and recalculates the
value for the action-state pair (α k , s k ) assuming that it performs
2. REINFORCEMENT LEARNING the optimal policy from state s k+1 and on.
Markov Decision Processes (MDPs) are widely used for modeling 3. PROBLEM SPECIFICATION
sequential decision-making problems. It is a discrete time system
based on state transitions via actions. MDP can be described by a 3.1 Power system
time parameter t, a set of states S, a set of actions A that can be The power system consists of a Photovoltaic (PV) source 25kW, a
performed in each state at any time, the transition probabilities desalination unit, which delivers potable water into a tank of
P=(s’|s,a) for a state s’ to occur when performing the action a (in 1.5m3 capacity, a battery of 100 kWh and a dynamic load that
a time step) in state s; and the reward function R(s,s’) that represents a small residential building of four households. The PV
describes the reward gained when the transition between states source is in parallel with a buck converter in order to track the
occurs. A policy is a decision making function π:S→A that Maximum Power Point (MPP) of the source [14]. If the battery is
defines which action the agent should take at any state so as to fully charged and/or the total electrical resistance of the dynamic
maximize the expected sum of the discounted rewards. load and the desalination unit is lower than the internal resistance
Solving the MDPs results to computing the policy that maximizes of the PV source, there is no need of performing maximum power
the cumulative discounted expected reward (expected utility) point tracking. If the battery needs charging and the total
V(s t ). resistance is lower than the internal resistance of the PV source,
the operation point of the source is located on the left of the MPP
∝
V ( st ) = ∑ γ t R ( st , st +1 ) (1) in the current-voltage characteristic curve. In this case the
t =0
produced power is lower than the potential power that the PV
source can produce [15]. The adopted converter raises the total
where γ is the discount factor determining the importance of resistance according to Eq.3 and forces the operation point to be
future rewards, and t is the time step. When the transition coincident to the MPP (Fig. 1) [16]. Fig. 1 presents a typical
probabilities are not known, model-free reinforcement learning characteristic current-voltage curve of the photovoltaic source.
(RL) techniques can be used. The current-voltage curve indicates the relationship between the
The term reinforcement learning refers to a family of algorithms electric current that the PV source produced, and the
inspired by human and animal learning. The objective of RL is to corresponding voltage.
discover a policy, i.e., a mapping from states to actions, by R (3)
Rt =
learning through exploration in the space of possible state-action D2
pairs. Actions that result in a good performance when performed
in a specific state are “rewarded” and actions leading to a poor
performance are “punished”. The goal of RL is to discover the
policy that maximizes the reward received [10]. Although there
are many model-free RL algorithms to approach the problem, we
may distinguish three main approaches for computing the optimal
action at any state:
The first is called Q-Learning [11]. Q-learning aims at learning
an action-value function that gives the expected utility of taking a
given action in a given state and following the optimal policy
thereafter. The second method uses gradient information [12],
assuming a differentiable utility function. The third method uses
the temporal difference between the expected performance and the
measured performance [13]. In this study we use Q-Learning as Figure 1: Typical characteristic current-voltage curve of
we do not have a model of the environment and our problem photovoltaic source.
involves stochastic transitions and rewards, without requiring
adaptations. Two converters are used in order to manage the power flow from
and to the battery. The buck converter is used to lower the voltage
In Q-Learning, the value of each action a k when performed in a of the DC bus in order to charge the battery at the nominal voltage
state s k is calculated as follows: of 12V. The boost converter is used to step up the voltage of the
) Qk ( sk , ak ) + a ( Rk +1 + γ max Qk ( sk +1 , a ) − Qk ( sk , ak ) (2)
Qk +1 ( sk , ak= battery at the level of the DC bus. A buck converter is used to the
desalination unit to step down the voltage between 0V and 32V
where Q k+1 (s k ,α k ) is the new value of the state-action where 0V and 32V are the operation limits of the DC motor of the
combination, Q k (s k ,α k ) is the old value for the same state-action unit. An inverter is used to convert the DC voltage to AC voltage
combination, α is the learning rate, and R k+1 is the reward for in order to feed with power the dynamic AC load. Figure 2
performing the action α k in the state s k . Q learning assumes that
presents the topology of the power system. The buck converter make the management of such a system a challenging task. The
decreases the voltage according to Eq.4 difficulties of managing the system concern whether the battery
must perform as a consumer or as a source, and whether the
Vo = Vi D (4)
desalination unit must stop or operate. For example in a simple
case where there is surplus in power production the battery should
where V o and V i are the output and input voltage of the converter,
charge, otherwise the produced power is wasted. In a more
respectively, and D is the duty cycle of the switch of the
complicated case where the power produced by the PV source is
converter. The boost converter raises the voltage according to
enough to supply the dynamic load, the rest of the power must be
Eq.5.
distributed to the battery and/or to the desalination unit. There is
1 no certain decision about the power distribution as it is a multi-
Vo = Vi (5) variable problem depending on the water demand, the state of the
1− D charge (SOC) of the battery etc. The decision-making agent
receives input from its environment (i.e. the microgrid) and
decides on the most appropriate actions to react to the
environment in order to achieve its goals, i.e. to satisfy the energy
demand in the solar microgrid in conjunction to achieving the
quality of services provided by energy consumer units, also
optimizing the battery usage [17].
4. DESIGN OF THE Q-LEARNING AGENT
4.1 Agent overview
Towards developing the agent we need to specify PAGE, i.e.
percepts, actions, goals and the environment.
The environment of the agent is the microgrid system already
described. The agent collects data from the microgrid to determine
its current state: It collects the power demand of the consumers,
the power produced by the PV source, the amount of the water
Figure 2: Power system overview into the tank, and the SOC of the battery. The states of the agent
The system is simulated in Matlab/Simulink. Due to the high are defined by the power balance (pb) between the produced
frequency of the switch of the converter (10kHz) the step size of power and the power demand, the SOC and the water into the
the simulation is set to 1μsec. As a result, a vast amount of data is tank.
accumulated that cannot be stored and the performance of the
simulation is very slow. Concerning the study of the power
system, an energy model has been developed in order to overcome
the aforementioned limitations (simulation performance, vast
amount of data). This model does not take into account the quality
of the power (voltage, current, frequency etc.) and the type of the
power (AC/DC). This model considers the delivered amount of
power amongst the elements of the microgrid. This is an
assumption that leads to good results in energy management
applications as the information for taking a decision regarding
energy management is not related with the way that this power is
distributed through the power circuits. The power flow of the
energy model is depicted in Figure 3.
Figure 4: Agent interaction with the environment

The actions, that the agent can perform, concern the battery and
the desalination unit. These actions are charging/discharging the
battery and operation/non operation of the desalination unit (Fig.
4). Thus the agent can decide whether the battery will be charged
or discharged, and whether the desalination unit will operate or
not. In our study, the action unit consists of an
exploration/exploitation strategy to explore the application of
Figure 3: Energy system overview actions in different states and a Q-Learning method for learning
the optimal mapping between system states and actions.
3.2 Problem formulation
The PV source is the only source of the system. The battery can
perform either as a source or as a consumer and the other
consumers of the system are the desalination unit and the dynamic
load. The uncertainties of power production and power demand
If the desalination unit is in operation, the desalination unit
consumes 550W and potable water is added into the tank with a
rate of about 100lt/h. If the desalination unit is in no operation
mode the desalination unit consumes 0W and no water is added in
the tank. Thus, the agent at any time instant may decide any
combination of two actions.
In order to depict and understand how actions affect the
microgrid, and for the simplicity of the presentation, assuming a
deterministic environment where the next state depends only on
the current state and the action performed by the agent, the
succession of states may be presented as in Figure 6. In the state
Figure 5: Q-Learning agent scheme where dSOC=2, dtank=2 and dpb=2, if the actions “C” and “O”
The aim for the users of the microgrid is to establish profitable are performed, the agent will move to the state where dSOC=3,
and efficient strategies of energy use in constantly changing dtank=3 and dpb=2. The battery will be charged and
environments. The goal of the agent is the energy management of simultaneously the desalination unit will add potable water into
the microgrid so as the battery to be fully charged, the water tank the tank (surplus of power). If the actions “D” and “S” are
to be full of water and cover the power demand. Figure 5 presents performed, the agent will remain at the same state. The dynamic
the policy iteration scheme. load is supplied by the PV power (surplus of the power), the
desalination unit is not operating, thus no water is added into the
4.2 State space and actions tank, and the battery is not discharged as the dynamic load is
As already said, the state of the microgrid is determined by the supplied directly by the PV source.
SOC of the battery, the percentage of the potable water into the Of course, in our case the environment is not that deterministic
tank, with respect to the capacity of the tank, and the power given that the load and water demands may change unpredictably
balance. Although these measures are continuous, we discretize and the energy produced by the PV may also be affected by
the state space by considering different ranges of variable’s external factors (e.g. the weather): These issues have been taken
values. Thus: into account in our experimental study.
The discrete state of battery Charge (dSOC) is considered to range
in {1,2,3,4} as follows:
dSOC=1 if 0% ≤ SOC<25%
dSOC=2 if 25% ≤ SOC <50%
dSOC=3 if 50% ≤ SOC <75%
dSOC=4 if 75% ≤ SOC ≤100%
The discrete state of the tank (dtank) is considered to range in
{1,2,3) as follows:
dtank=1 if 0% ≤ tank<33.3%
dtank=2 if 33.3% ≤ tank<66.7%
dtank=3 if 66.7% ≤ tank ≤100%
The discrete power balance (dpb) is divided in only two states
where the difference between the power production and the power
consumption is positive or negative:
dpb1=1 if pp-pc <0
dpb=2 if pp-pc ≥0 Figure 6: Success of states
The combination of discrete variables’ values result in a discrete 4.3 Reward
state space of twenty four states. It must be noted that although Q-
learning methods for continuous state variables and state spaces The reward depends on three factors, the SOC of the battery, the
do exist (e.g. [18]), we plan to study these in our future work. amount of the water into the tank and the coverage of the power
demand. The coverage of the power demand is achieved when the
We assume that the time interval between two consecutive system supplies to the load at least the 80% of the demanded
instants is 1800sec. At any time instant the agent performs one power.
action for the battery and one action for the desalination unit. The
actions which can take place in the battery are two, charging (“C”) The main idea is to keep the SOC of the battery and the
and discharging (“D”). If there is an energy surplus and the percentage of the water into the tank (p_wat) at their maximum
battery is charging, then it absorbs the remaining power. If there is values and simultaneously, cover the power demand. In case
surplus in energy and the battery is discharging then it remains where the SOC of the battery and the percentage of the water into
idle. If there is deficit in energy and the battery is charging, the the tank are not at their maximum values, the goal is to cover the
battery remains idle. If there is energy deficit and the battery is power demand and simultaneously to increase the SOC of the
discharging, the battery offers the power that the consumptions battery and the volume of the water into the tank. If this is not
need. The actions that can take place in the desalination unit are possible, the goal is only to cover the energy demand.
two. Stop (“S”) and operation at maximum power of 550W (“P”).
The term of the reward regarding the SOC (r SOC ) of the battery in
the resulting state is as follows:
 1 SOC > 90% 
  (6)
rSOC =
 −1 SOC < 20% 
0.1 ⋅ soc _ dt 20% ≤ SOC ≤ 90% 

where “soc_dt” is the difference in the rate of the SOC in two
consecutive time instants, i.e. the next and the current one. The
value of 0.1 is used to scale r SOC in the range of [1,-1] according
to the maximum and minimum value of the soc_dt.
The term of the reward regarding the percentage of the water into Figure 7: Plot of function f(i)
the tank (r wat ) in the resulting state is calculated as follows: As shown in Figure 7, the value of f(i) is gradually decreased as
the number of times an action has been chosen is increased. If the
 0.5 p _ wat > 90% 
  function p(i) gives a number greater than 0.7, a random action is
(7)
rwat =
 −0.5 p _ wat < 20%  selected. Otherwise the best action is selected. This condition
0.1 ⋅ pwat _ dt 20% ≤ p _ wat ≤ 90%  assures only exploration until the value of the function f(i)
  becomes equal or lower than 0.7. For values of f(i) in the range of
where “pwat_dt” is the difference in the percentage of the water [0.7 0.2] the agent performs exploration with high probability for
into the tank between two consecutive time instants, i.e. the next values near 0.7 and with lower probability for values near 0.2 (due
and the current one. The value of 0.1 is used for scaling the to the random term). When the value of f(i) becomes lower than
variable r wat in the range of [0.5,-0.5]. 0.2 the agent stops exploring and exploits by selecting the best
action, given the current state.
The reward for the coverage of the energy demand (r cov ) by the
load is calculated by: The overall agent cycle is presented in Figure 8. The agent
perceives the state of the environment by reading the
−1.5 l _ per > 0.2  corresponding sensors and specifies the current state of the
rcov =  
(8)
 1.5 l _ per ≤ 0.2  microgrid by performing the discretization specified above.
According to Eq. 11 it decides whether it will explore or exploit.
The value of l_per is the difference between the demanded and Thus, it decides the action to be performed and perceives the
the supplied power. resulting state by acquiring sensors’ measurements. It then
l_per=(l_dem-l_pow)/l_dem (9) calculates the reward and finally updates the Q-table according to
Eq. 2.
where “l_dem” is the power demand by the consumers and
“l_pow” is the power supplied to the consumers. If “l_per” is
greater than 0.2, the system does not cover the energy demand of
the consumers and a negative individual reward equal to -1.5 is
given. In cases where the “l_per” is lower or equal to 0.2 a
positive reward equal to 1.5 is given.
The total reward is the sum of these three terms multiplied by the
factor of ‘1/3’ for conditioning the total reward in range [1 -1]:
rsoc ( sk , sk +1 ) + rwat ( sk , sk +1 ) + rcov ( sk , sk +1 )
R ( sk ,s k +1 ) = (10)
3
4.4 Exploration/Exploitation Strategy

In the beginning, the agent does not know which is the best action
for each state. In order to perform high exploration in the
beginning of the simulation a function is used:
=
p(i ) f (i ) + r (11)
Figure 8: Flowchart of Q-learning agent
where:
5. SIMULATION RESULTS
1
f (i ) = (12) The simulation time is set to one year with simulation step time of
log(0.03(i + 34) 5sec. The data concerning the power production are data acquired
from a 25kW PV park located in Attica Greece with sample time
where i is the number of times an action has been chosen and r is
of 300sec. The power consumption data set is collected from a
the uniform distribution of Matlab function rand in the range [0
four household building with 60sec sampling time. For the water
0.5]. The plot of function f(i) is presented in Figure 7.
demand a constant consumption of 70lt/h is assumed, with a
variation of 20lt/h for 12h/day (during daytime) and 0lt/h with a
positive variation of 20lt/h for 12h/day (during night). During day,
the variation follows a uniform distribution ranges from [-20, 20]
lt/h while during night the variation follows a uniform distribution consumed power and the power demand for two consecutive days.
ranges from [0, 20] lt/h. The tank remains almost full and the battery is charging during
Table 1 presents the Q-Table produced at the end of the the day when there is surplus in the power balance. During the
simulation. The Q-Table contains all the combinations between night no power is produced. The battery discharges and supplies
the states and the actions. For each combination a “value” has the load and the desalination unit.
been assigned that indicates the best action in the given state.
There are four states that the exploration algorithm did not explore
(bold black), and there are six states where the exploration was
incomplete (italic black). Incompleteness means that there is at
least one state/action combination that the agent did not explore.
Figure 9 presents the states of the agent versus time, the action
that has been selected in each state, and finally, the reward that
has been received after performing this action at the
corresponding state. In the beginning, where the agent purely
explores the action to be performed at any state, there are a lot of
variations at the reward and in many cases a negative reward is
received by the agent. As the time passes, and while the agent
mostly exploits, the reward is at high levels.
Figure 10 presents the SOC of the battery and the percentage of Figure 9: a) Actions, b) State and c) Reward
the potable water into the tank. After the exploration period, the
tank remains almost full but the battery does not get fully charged
due to lack of exploration in the states which are defined by high
levels of SOC.
Table 1: Q-Table
Actions “D”, “S” “D”, “O” “C”, “S” “C”, “O”
State
S1 -1.8546 -1.4648 -1.9317 -1.0749
S2 0.9477 -2.0821 -3.0142 -1.9488
S3 5.7727 20.1139 16.0998 2.5059
S4 0.9212 2.4657 0.0197 0.0766
S5 0.9365 2.9992 0.1383 -0.0319
S6 7.6560 19.6856 18.8457 10.3534
S7 0 0 0 0
S8 0.2220 3.1622 -0.5983 -0.2765
Figure 10: a) SOC of the battery, b) Percentage of water into
S9 1.1824 19.3738 5.3704 -0.1333 the tank
S10 0 0 0 0
S11 0.4528 1.8866 -0.2039 -0.2831
S12 3.5814 12.3144 0.1569 0
S13 -0.1030 0 0.0182 0.5248
S14 1.3203 0 3.7075 0.2521
S15 6.1059 20.7187 21.2451 8.5461
S16 0 0.4265 0.4345 0
S17 1.0089 6.7522 1.2965 0.7369
S18 11.1878 20.6124 20.9590 8.3055
S19 0 0 0 0
S20 0.4944 0.6693 0.3608 0.5089
S21 0.2726 19.6120 19.4187 0.5596
S22 0 0 0 0
S23 0.2570 0.1697 -0.0297 0
S24 6.0153 0.4616 2.4078 0
Figure 11: a) Load power demand, b) True power
consumption and c) Load power difference
Figure 11 shows the power demand by the dynamic load, the true
power that the load consumes, and the difference between the
consumed power and the power demand. High negative values of
this quantity means that the agent does not satisfy the power
demand. On the other hand, values close to zero show that the
agent performs well, and the power demand is covered. In the
beginning, where the agent explores, there are many cases where
the agent does not succeed in covering the power demand. When
the agent starts exploiting (for most of the times), the power
demand is covered. Finally, Figure 12 presents the power
produced by the PV source, the SOC of the battery, the percentage
of the water into the tank and the difference between the
[4] Elizaveta Kuznetsova, Yan-Fu Li, Carlos Ruiz, Enrico
Zio, Graham Ault, Keith Bell (2013), Reinforcement
learning for microgrid management, Energy 59, pp.133-
146
[5] Leo R., Milton, R.S., Sibi, S (2014), Reinforcement
Learning for Optimal Energy Management of a Solar
Microgrid, proceedings of IEEE Global Humanitarian
Technology Conference, pp.181-186
[6] Leo R, Sibi, S, Milton, R.S. (2014), Distributed
Optimization of Solar Microgrid using Multi Agent
Reinforcement Learning, International Conference on
Information and Communication Technologies, pp
Figure 12: a) PV source power production, b) Percentage of [7] Il-Yop Chung, Cheol-HeeYoo, Sang-Jin Oh (2013),
water into the tank and c) SOC of the battery Distributed Intelligent Microgrid Control Using Multi-
Agent Systems, Engineering 5,pp.1-6
6. CONCLUSIONS AND FUTURE WORK [8] Hak-Man Kim, Yujin Lim, Tetsuo Kinoshita (2012), An
This paper demonstrates the potential of a Q-learning algorithm to Intelligent Multiagent System for Autonomous Microgrid
control the energy flow between the elements of the solar Operation, Energies 5, pp.3347-3362
microgrid. The agent learns by trial and error and demonstrates a [9] Y. S. Foo. Eddy, Gooi, H.B., Chen, S.X. (2014), Multi-
good performance after a period of time by keeping the total Agent System for Distributed Management of
reward in high levels. The simulation results highlight the Microgrids, IEEE Transactions on power systems 30, pp.
performance of the agent as the water tank remains almost full 24-34
and the SOC of the battery remains over 20%.
[10] Russel S., Norving, P. (1995) Artificial Intelligence: A
The agent could be enhanced by integrating a more sophisticated Modern Approach, Prentice Hall, Upper Saddle River,
exploration algorithm. In future work, a more sophisticated NJ
exploration algorithm will be tested in a larger and more detailed
state space. It is also our aim to propose alternative discretization [11] Watkins,C.J.C.H(1989), Learning from delayed
strategies and perform comparative experiments with leaning reinforcement signals, PhD Thesis, University of
methods for continuous state spaces. MAS learning methods will Cambridge, England
also be studied for their effectiveness and efficiency in relation to [12] H. R. Berenjii, P. Khedar, Learning and tuning fuzzy
a single agent solution. Furthermore, we plan to apply these logic controllers through reinforcements, IEEE Trans.
techniques to real-world settings where a microgrid contains more Neural Networks, vol. 3, pp. 724-740, 1992.
elements, such as a wind turbine, fuel cell, hybrid electrical
[13] A. Barto, R. Sutton, C. W. Anderson, “Neuro-like
vehicles and the electrical grid. In a larger scale microgrid, the
adaptive elements that can solve difficult learning control
architecture of fuzzy MAS will be tested in order to increase the
problems”, IEEE Trans. Systems, Man, Cybern., vol. 13,
performance and the reliability of the microgrid under the
no. 5, pp. 834-846, 1983.
uncertainties that consumers and production units introduce.
[14] Anastasios I. Dounis, Panagiotis Kofinas, Constantine
Alafodimos, Dimitrios Tseles (2013), Renewable Energy
Acknowledgement. The publication of this paper has been partly 60,pp. 202-214
supported by the University of Piraeus Research Center.
[15] Anastasios I. Dounis, P. Kofinas, G. Papadakis, C.
Alafodimos (2015), A direct adaptive neural control for
7. REFERENCES maximum power point tracking of photovoltaic system
Solar Energy 115, pp.145-165
[1] Manuela Sechilariu, Bao Chao Wang, Fabrice Locment,
Antoine Jouglet (2014), DC microgrid power flow [16] P. Kofinas, Anastasios I. Dounis, G. Papadakis, M.N.
optimization by multi-layer supervision control. Design Assimakopoulos (2015), An Intelligent MPPT controller
and experimental validation, Energy conversion and based on direct neural control for partially shaded PV
management 82, pp.1-10 system, Solar Energy 115, pp.145-165
[2] M. Smith, “DOE microgrid initiative overview”, in [17] Anastasios I. Dounis (2010), Artificial intelligence for
Coinf. 2012 DOE microgrid workshop, Chicago, Illinois, energy conservation in buildings, Advances in Building
July 2012 Energy Research 4, pp. 267–299
[3] Ting Huang, Derong Liu (2013) A self-learning scheme [18] Angelidakis A, Chalkiadakis G.(2015), Factored MDPs
for residential energy system control and management, for Optimal Prosumer Decision-Making in Continuous
Neural computing & applications 22, pp. 59-69 State Spaces, proceedings of EUMAS15

Energy Management in Solar Microgrid Via Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Energy Management in Solar Microgrid Via Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Energy Management in Solar Microgrid

via Reinforcement Learning

ABSTRACT around the world. Microgrid is a group of interconnected loads

Figure 4: Agent interaction with the environment

4.4 Exploration/Exploitation Strategy

You might also like