Panagiotis Kofinas George Vouros Anastasios I. Dounis University of Piraeus, Department of University of Piraeus, Department of Piraeus University of Applied Digital Systems Digital Systems Sciences (T.E.I. of Piraeus) 80, Karaoli & Dimitriou Str, 18534, 80, Karaoli & Dimitriou Str, 18534, Department of Automation Piraeus, Greece Piraeus, Greece Engineering +302105381287 +30 210-4142552 250, Thivon & P. Ralli Str. panagiotis.kofinas@gmail.com georgev@unipi.gr 12244 Egaleo - Athens, Greece +302105381188 aidounis@teipir.gr
ABSTRACT around the world. Microgrid is a group of interconnected loads
The actions, that the agent can perform, concern the battery and the desalination unit. These actions are charging/discharging the battery and operation/non operation of the desalination unit (Fig. 4). Thus the agent can decide whether the battery will be charged or discharged, and whether the desalination unit will operate or not. In our study, the action unit consists of an exploration/exploitation strategy to explore the application of Figure 3: Energy system overview actions in different states and a Q-Learning method for learning the optimal mapping between system states and actions. 3.2 Problem formulation The PV source is the only source of the system. The battery can perform either as a source or as a consumer and the other consumers of the system are the desalination unit and the dynamic load. The uncertainties of power production and power demand If the desalination unit is in operation, the desalination unit consumes 550W and potable water is added into the tank with a rate of about 100lt/h. If the desalination unit is in no operation mode the desalination unit consumes 0W and no water is added in the tank. Thus, the agent at any time instant may decide any combination of two actions. In order to depict and understand how actions affect the microgrid, and for the simplicity of the presentation, assuming a deterministic environment where the next state depends only on the current state and the action performed by the agent, the succession of states may be presented as in Figure 6. In the state Figure 5: Q-Learning agent scheme where dSOC=2, dtank=2 and dpb=2, if the actions “C” and “O” The aim for the users of the microgrid is to establish profitable are performed, the agent will move to the state where dSOC=3, and efficient strategies of energy use in constantly changing dtank=3 and dpb=2. The battery will be charged and environments. The goal of the agent is the energy management of simultaneously the desalination unit will add potable water into the microgrid so as the battery to be fully charged, the water tank the tank (surplus of power). If the actions “D” and “S” are to be full of water and cover the power demand. Figure 5 presents performed, the agent will remain at the same state. The dynamic the policy iteration scheme. load is supplied by the PV power (surplus of the power), the desalination unit is not operating, thus no water is added into the 4.2 State space and actions tank, and the battery is not discharged as the dynamic load is As already said, the state of the microgrid is determined by the supplied directly by the PV source. SOC of the battery, the percentage of the potable water into the Of course, in our case the environment is not that deterministic tank, with respect to the capacity of the tank, and the power given that the load and water demands may change unpredictably balance. Although these measures are continuous, we discretize and the energy produced by the PV may also be affected by the state space by considering different ranges of variable’s external factors (e.g. the weather): These issues have been taken values. Thus: into account in our experimental study. The discrete state of battery Charge (dSOC) is considered to range in {1,2,3,4} as follows: dSOC=1 if 0% ≤ SOC<25% dSOC=2 if 25% ≤ SOC <50% dSOC=3 if 50% ≤ SOC <75% dSOC=4 if 75% ≤ SOC ≤100% The discrete state of the tank (dtank) is considered to range in {1,2,3) as follows: dtank=1 if 0% ≤ tank<33.3% dtank=2 if 33.3% ≤ tank<66.7% dtank=3 if 66.7% ≤ tank ≤100% The discrete power balance (dpb) is divided in only two states where the difference between the power production and the power consumption is positive or negative: dpb1=1 if pp-pc <0 dpb=2 if pp-pc ≥0 Figure 6: Success of states The combination of discrete variables’ values result in a discrete 4.3 Reward state space of twenty four states. It must be noted that although Q- learning methods for continuous state variables and state spaces The reward depends on three factors, the SOC of the battery, the do exist (e.g. [18]), we plan to study these in our future work. amount of the water into the tank and the coverage of the power demand. The coverage of the power demand is achieved when the We assume that the time interval between two consecutive system supplies to the load at least the 80% of the demanded instants is 1800sec. At any time instant the agent performs one power. action for the battery and one action for the desalination unit. The actions which can take place in the battery are two, charging (“C”) The main idea is to keep the SOC of the battery and the and discharging (“D”). If there is an energy surplus and the percentage of the water into the tank (p_wat) at their maximum battery is charging, then it absorbs the remaining power. If there is values and simultaneously, cover the power demand. In case surplus in energy and the battery is discharging then it remains where the SOC of the battery and the percentage of the water into idle. If there is deficit in energy and the battery is charging, the the tank are not at their maximum values, the goal is to cover the battery remains idle. If there is energy deficit and the battery is power demand and simultaneously to increase the SOC of the discharging, the battery offers the power that the consumptions battery and the volume of the water into the tank. If this is not need. The actions that can take place in the desalination unit are possible, the goal is only to cover the energy demand. two. Stop (“S”) and operation at maximum power of 550W (“P”). The term of the reward regarding the SOC (r SOC ) of the battery in the resulting state is as follows: 1 SOC > 90% (6) rSOC = −1 SOC < 20% 0.1 ⋅ soc _ dt 20% ≤ SOC ≤ 90% where “soc_dt” is the difference in the rate of the SOC in two consecutive time instants, i.e. the next and the current one. The value of 0.1 is used to scale r SOC in the range of [1,-1] according to the maximum and minimum value of the soc_dt. The term of the reward regarding the percentage of the water into Figure 7: Plot of function f(i) the tank (r wat ) in the resulting state is calculated as follows: As shown in Figure 7, the value of f(i) is gradually decreased as the number of times an action has been chosen is increased. If the 0.5 p _ wat > 90% function p(i) gives a number greater than 0.7, a random action is (7) rwat = −0.5 p _ wat < 20% selected. Otherwise the best action is selected. This condition 0.1 ⋅ pwat _ dt 20% ≤ p _ wat ≤ 90% assures only exploration until the value of the function f(i) becomes equal or lower than 0.7. For values of f(i) in the range of where “pwat_dt” is the difference in the percentage of the water [0.7 0.2] the agent performs exploration with high probability for into the tank between two consecutive time instants, i.e. the next values near 0.7 and with lower probability for values near 0.2 (due and the current one. The value of 0.1 is used for scaling the to the random term). When the value of f(i) becomes lower than variable r wat in the range of [0.5,-0.5]. 0.2 the agent stops exploring and exploits by selecting the best action, given the current state. The reward for the coverage of the energy demand (r cov ) by the load is calculated by: The overall agent cycle is presented in Figure 8. The agent perceives the state of the environment by reading the −1.5 l _ per > 0.2 corresponding sensors and specifies the current state of the rcov = (8) 1.5 l _ per ≤ 0.2 microgrid by performing the discretization specified above. According to Eq. 11 it decides whether it will explore or exploit. The value of l_per is the difference between the demanded and Thus, it decides the action to be performed and perceives the the supplied power. resulting state by acquiring sensors’ measurements. It then l_per=(l_dem-l_pow)/l_dem (9) calculates the reward and finally updates the Q-table according to Eq. 2. where “l_dem” is the power demand by the consumers and “l_pow” is the power supplied to the consumers. If “l_per” is greater than 0.2, the system does not cover the energy demand of the consumers and a negative individual reward equal to -1.5 is given. In cases where the “l_per” is lower or equal to 0.2 a positive reward equal to 1.5 is given. The total reward is the sum of these three terms multiplied by the factor of ‘1/3’ for conditioning the total reward in range [1 -1]: rsoc ( sk , sk +1 ) + rwat ( sk , sk +1 ) + rcov ( sk , sk +1 ) R ( sk ,s k +1 ) = (10) 3
4.4 Exploration/Exploitation Strategy
In the beginning, the agent does not know which is the best action for each state. In order to perform high exploration in the beginning of the simulation a function is used: = p(i ) f (i ) + r (11) Figure 8: Flowchart of Q-learning agent where: 5. SIMULATION RESULTS 1 f (i ) = (12) The simulation time is set to one year with simulation step time of log(0.03(i + 34) 5sec. The data concerning the power production are data acquired from a 25kW PV park located in Attica Greece with sample time where i is the number of times an action has been chosen and r is of 300sec. The power consumption data set is collected from a the uniform distribution of Matlab function rand in the range [0 four household building with 60sec sampling time. For the water 0.5]. The plot of function f(i) is presented in Figure 7. demand a constant consumption of 70lt/h is assumed, with a variation of 20lt/h for 12h/day (during daytime) and 0lt/h with a positive variation of 20lt/h for 12h/day (during night). During day, the variation follows a uniform distribution ranges from [-20, 20] lt/h while during night the variation follows a uniform distribution consumed power and the power demand for two consecutive days. ranges from [0, 20] lt/h. The tank remains almost full and the battery is charging during Table 1 presents the Q-Table produced at the end of the the day when there is surplus in the power balance. During the simulation. The Q-Table contains all the combinations between night no power is produced. The battery discharges and supplies the states and the actions. For each combination a “value” has the load and the desalination unit. been assigned that indicates the best action in the given state. There are four states that the exploration algorithm did not explore (bold black), and there are six states where the exploration was incomplete (italic black). Incompleteness means that there is at least one state/action combination that the agent did not explore. Figure 9 presents the states of the agent versus time, the action that has been selected in each state, and finally, the reward that has been received after performing this action at the corresponding state. In the beginning, where the agent purely explores the action to be performed at any state, there are a lot of variations at the reward and in many cases a negative reward is received by the agent. As the time passes, and while the agent mostly exploits, the reward is at high levels. Figure 10 presents the SOC of the battery and the percentage of Figure 9: a) Actions, b) State and c) Reward the potable water into the tank. After the exploration period, the tank remains almost full but the battery does not get fully charged due to lack of exploration in the states which are defined by high levels of SOC. Table 1: Q-Table Actions “D”, “S” “D”, “O” “C”, “S” “C”, “O” State S1 -1.8546 -1.4648 -1.9317 -1.0749 S2 0.9477 -2.0821 -3.0142 -1.9488 S3 5.7727 20.1139 16.0998 2.5059 S4 0.9212 2.4657 0.0197 0.0766 S5 0.9365 2.9992 0.1383 -0.0319 S6 7.6560 19.6856 18.8457 10.3534 S7 0 0 0 0 S8 0.2220 3.1622 -0.5983 -0.2765 Figure 10: a) SOC of the battery, b) Percentage of water into S9 1.1824 19.3738 5.3704 -0.1333 the tank S10 0 0 0 0 S11 0.4528 1.8866 -0.2039 -0.2831 S12 3.5814 12.3144 0.1569 0 S13 -0.1030 0 0.0182 0.5248 S14 1.3203 0 3.7075 0.2521 S15 6.1059 20.7187 21.2451 8.5461 S16 0 0.4265 0.4345 0 S17 1.0089 6.7522 1.2965 0.7369 S18 11.1878 20.6124 20.9590 8.3055 S19 0 0 0 0 S20 0.4944 0.6693 0.3608 0.5089 S21 0.2726 19.6120 19.4187 0.5596 S22 0 0 0 0 S23 0.2570 0.1697 -0.0297 0 S24 6.0153 0.4616 2.4078 0 Figure 11: a) Load power demand, b) True power consumption and c) Load power difference Figure 11 shows the power demand by the dynamic load, the true power that the load consumes, and the difference between the consumed power and the power demand. High negative values of this quantity means that the agent does not satisfy the power demand. On the other hand, values close to zero show that the agent performs well, and the power demand is covered. In the beginning, where the agent explores, there are many cases where the agent does not succeed in covering the power demand. When the agent starts exploiting (for most of the times), the power demand is covered. Finally, Figure 12 presents the power produced by the PV source, the SOC of the battery, the percentage of the water into the tank and the difference between the [4] Elizaveta Kuznetsova, Yan-Fu Li, Carlos Ruiz, Enrico Zio, Graham Ault, Keith Bell (2013), Reinforcement learning for microgrid management, Energy 59, pp.133- 146 [5] Leo R., Milton, R.S., Sibi, S (2014), Reinforcement Learning for Optimal Energy Management of a Solar Microgrid, proceedings of IEEE Global Humanitarian Technology Conference, pp.181-186 [6] Leo R, Sibi, S, Milton, R.S. (2014), Distributed Optimization of Solar Microgrid using Multi Agent Reinforcement Learning, International Conference on Information and Communication Technologies, pp Figure 12: a) PV source power production, b) Percentage of [7] Il-Yop Chung, Cheol-HeeYoo, Sang-Jin Oh (2013), water into the tank and c) SOC of the battery Distributed Intelligent Microgrid Control Using Multi- Agent Systems, Engineering 5,pp.1-6 6. CONCLUSIONS AND FUTURE WORK [8] Hak-Man Kim, Yujin Lim, Tetsuo Kinoshita (2012), An This paper demonstrates the potential of a Q-learning algorithm to Intelligent Multiagent System for Autonomous Microgrid control the energy flow between the elements of the solar Operation, Energies 5, pp.3347-3362 microgrid. The agent learns by trial and error and demonstrates a [9] Y. S. Foo. Eddy, Gooi, H.B., Chen, S.X. (2014), Multi- good performance after a period of time by keeping the total Agent System for Distributed Management of reward in high levels. The simulation results highlight the Microgrids, IEEE Transactions on power systems 30, pp. performance of the agent as the water tank remains almost full 24-34 and the SOC of the battery remains over 20%. [10] Russel S., Norving, P. (1995) Artificial Intelligence: A The agent could be enhanced by integrating a more sophisticated Modern Approach, Prentice Hall, Upper Saddle River, exploration algorithm. In future work, a more sophisticated NJ exploration algorithm will be tested in a larger and more detailed state space. It is also our aim to propose alternative discretization [11] Watkins,C.J.C.H(1989), Learning from delayed strategies and perform comparative experiments with leaning reinforcement signals, PhD Thesis, University of methods for continuous state spaces. MAS learning methods will Cambridge, England also be studied for their effectiveness and efficiency in relation to [12] H. R. Berenjii, P. Khedar, Learning and tuning fuzzy a single agent solution. Furthermore, we plan to apply these logic controllers through reinforcements, IEEE Trans. techniques to real-world settings where a microgrid contains more Neural Networks, vol. 3, pp. 724-740, 1992. elements, such as a wind turbine, fuel cell, hybrid electrical [13] A. Barto, R. Sutton, C. W. Anderson, “Neuro-like vehicles and the electrical grid. In a larger scale microgrid, the adaptive elements that can solve difficult learning control architecture of fuzzy MAS will be tested in order to increase the problems”, IEEE Trans. Systems, Man, Cybern., vol. 13, performance and the reliability of the microgrid under the no. 5, pp. 834-846, 1983. uncertainties that consumers and production units introduce. [14] Anastasios I. Dounis, Panagiotis Kofinas, Constantine Alafodimos, Dimitrios Tseles (2013), Renewable Energy Acknowledgement. The publication of this paper has been partly 60,pp. 202-214 supported by the University of Piraeus Research Center. [15] Anastasios I. Dounis, P. Kofinas, G. Papadakis, C. Alafodimos (2015), A direct adaptive neural control for 7. REFERENCES maximum power point tracking of photovoltaic system Solar Energy 115, pp.145-165 [1] Manuela Sechilariu, Bao Chao Wang, Fabrice Locment, Antoine Jouglet (2014), DC microgrid power flow [16] P. Kofinas, Anastasios I. Dounis, G. Papadakis, M.N. optimization by multi-layer supervision control. Design Assimakopoulos (2015), An Intelligent MPPT controller and experimental validation, Energy conversion and based on direct neural control for partially shaded PV management 82, pp.1-10 system, Solar Energy 115, pp.145-165 [2] M. Smith, “DOE microgrid initiative overview”, in [17] Anastasios I. Dounis (2010), Artificial intelligence for Coinf. 2012 DOE microgrid workshop, Chicago, Illinois, energy conservation in buildings, Advances in Building July 2012 Energy Research 4, pp. 267–299 [3] Ting Huang, Derong Liu (2013) A self-learning scheme [18] Angelidakis A, Chalkiadakis G.(2015), Factored MDPs for residential energy system control and management, for Optimal Prosumer Decision-Making in Continuous Neural computing & applications 22, pp. 59-69 State Spaces, proceedings of EUMAS15