Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Energy and AI 13 (2023) 100242

Contents lists available at ScienceDirect

Energy and AI
journal homepage: www.elsevier.com/locate/egyai

Smart energy management system framework for population dynamics


modelling and suitable energy trajectories identification in islanded
micro-grids
Mehdi Mounsif ∗, Fabien Medard
Akkodis, 7 Bd Henri Ziegler, Blagnac, 31700, Occitanie, France

HIGHLIGHTS GRAPHICAL ABSTRACT

• RL approaches increase storage efficiency


and delivery systems in micro-grids
• Power and electricity demand modelling
through individual behaviour aggrega-
tion.
• Identification of suitable social configu-
rations based on energy load forecasting
• Comparison of various storage strategies
over a wide spectrum of scenarios.

ARTICLE INFO ABSTRACT

Keywords: In an increasingly electrified and connected world, renewable energy production and robust distribution as well
Energy storage and management as sobriety paradigm, both for the individual and the society, will most likely play a central role regarding
Reinforcement learning global systems stability. Consequently, while being able to conceive efficient storage systems coupled with
Population dynamics
robust energy management strategies present significant interests, a number of related studies often consider
Optimization
the human behaviour factor separately. While not decisive in large industrial factories, human demeanor impact
cannot be overlooked in residential areas. As such, this work proposes an innovative and flexible dynamic
population model, inspired from epidemiological methods, that allows simulation of a vast spectrum of social
scenarios. By pairing this formalization with a smart energy management strategy, a complete framework is
proposed. In particular, beyond the theoretical identification of sustainable parameters in a wide diversity of
configurations, our experiments demonstrate the relevance of reinforcement learning agents as efficient energy
management policies. Depending on the scenario, the trained agent enables an increase of the sustainability
areas over baseline strategies up to 200%, thus hinting at ultimately softer societal impact.

∗ Corresponding author.
E-mail address: mehdi.mounsif@akkodis.com (M. Mounsif).

https://doi.org/10.1016/j.egyai.2023.100242
Received 29 November 2022; Received in revised form 25 January 2023; Accepted 14 February 2023
Available online 23 February 2023
2666-5468/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

EMS with flexible population dynamic modelling. It demonstrates clear


Nomenclature advantages for RL-based policies in increasing the area of sustainability
The next list describes several symbols that will be later used with respect to external constraints. Specifically, the extensive experi-
within the body of the document ments set up suggest that competitive data-driven strategies can play
𝐸𝑀𝑆 Energy Management System a significant role in easing the society energy transition as they will,
𝐼𝐼𝐷 Independent and Identically Distributed in-fine, enable authorities to formulate smoother social policies, more
likely to be adopted.
𝑀𝐷𝑃 Markov Decision Process
To this aim, the next section introduces related works and the
𝑃𝑃𝑂 Proximal Policy Optimization
diverse challenges and limitations associated with the multiple fields
𝑅𝐿 Reinforcement Learning linked to the presented initiative. Then, the focus is directed towards
population modelling approaches and provides an exhaustive view of
the formalized transition mechanisms. The following section is dedi-
cated to the experimental setup: it first establishes the relevance of
reinforcement learning approaches for the energy management system
1. Introduction against baseline policies and then presents the exploration results of
multiple scenarios. Beyond identification of acceptable population pa-
There is no shortage of rigorous works [1–5] showing how crucial rameters for different initial configuration, these results also illustrate
the immediate and deep emissions reductions of across all activity how the proposed method increases the area of sustainable policies,
sectors to limit excessive global warming is. Indeed, the non-respect ultimately facilitating social acceptation of the evolution needed to
of the goals formulated during the Paris summit is highly likely to comply with energy constraints.
lead mankind to dire global situations. This global state is conse-
quently calling for swift and significant societal evolution, targeting 2. Related works
both institutions as well as individuals, whose local actions undeniably
contribute to greenhouse gases emissions [6–8]. As evoked in the introduction, compelling scientific evidences [4]
In fact, notwithstanding strong surges in energy price and geopoliti- are motivating societal evolution to minimize risks and impacts related
cal tensions regarding energy access, non-negligible population groups to the effects of global warming. As such, recent years have witnessed
are only remotely willing to accept a paradigm suitable with climate a strong acceleration in environment-related sustainable investments
change. This consequently makes formulating national policy sensitive and an increase in renewable energies within the global mix of elec-
and complex to adjust since an important public interest and adherence tricity [15–18]. An important aspect of this evolution is the conception
is needed [9]. and integration of micro-grids that is a key element in the transition
In this context, policy makers are likely to rely on simulation from centralized energy sources to distributed configurations [19].
tools to explore numerous scenarios and associated consequences. In- This new organization for energy access and distribution, designed
deed, as demonstrated on multiple occasions during the COVID-19 to accommodate local demand with flexibility, implies multiple chal-
pandemic [10], these methods can provide an interesting framework lenges [20,21], such as intermittency and volatility, since its electricity
with low to moderate cost and enhanced flexibility to plan and foresee supply is limited, as opposed to more conventional production units
the consequences of a given configuration. Nevertheless, by not relying that provide a stable and easily predictable power intensity. To achieve
on cross-domain modelling [11,12], such techniques are often used as the multiple benefits advertised by micro-grids, multiple works intro-
standalone and may fail to capture wider factors, which is specifically duce approaches for designing the required robust energy management
the scope of the proposed method. system [22–24]. For instance, the works presented in [25] introduce a
With the presented model, Islanded microgrids can be seen as robust MPC controller to minimize a set of metrics related to the opera-
building, district or even cities, depending on what policy maker would tion cost, pollution emission, while ensuring suitable ranges for battery
like to prospect. daily number of cycles. Despite interesting results, these strategies rely
In particular, extending on insights from [13,14] that illustrated the on expert knowledge and simplifications to model the environment
strong relevance of Reinforcement Learning (RL) in micro-grids energy dynamics which may hinder the maximal attainable performance.
management systems (EMS), a system pairing population modelling More globally, while technical progresses in the study and appli-
and smart energy management is introduced. As shown in Fig. 1, it cation of micro-grids as the main power source for a given area are
considers a hypothetical islanded configuration hosting a population remarkable, most approaches are evaluated in a stationary setting,
composed of multiple groups of varying environmental sensitivity, often using historical data [26,27]. However, doing so does not neces-
denoted A, P and N. Depending on the targeted environment, policy sarily offer a reliable perspective should behavioural adaptation occur,
makers and prospective analysts can consider buildings, districts or which is highly likely in the context of sobriety and energy transition.
even cities connected to a renewable energy generation source and Crucially, the study of individual and group demeanor are the main
its associated storage system. This consequently enables scenario ex- topic of a strong body of work that has been dedicated to dynamic mod-
ploration by defining the parameters governing internal population elling of populations [12,28,29]. For instance, [30] describes various
interactions and the range of reactions to external factors. non-linear systems of interactions between diverse species, studying in
Concretely, a number of transition mechanisms between the vari- particular how predation, competition and mutualism shape the col-
ous groups of inhabitants are introduced, such as conversion through lective trajectories of the considered groups. In a related setting, [31]
discussion or sensibility to price, and can be given a complete range focuses more specifically on the role of predators in an ecosystem and
of importance to represent distinct societal configurations. These can how, as also suggested in [11], this particular type of individual can
then be evaluated over a time window, through the dynamic model influence the many environmental cycles that can be encountered in
proposed, and provide an analysis of whether these parameters yield ecosystems.
sustainable energy demand for a given storage policy, as presented in Building on these ideas and the associated non-linear models used
Fig. 2. to describe group interactions, many aspects of this field have been
As such, the main contributions of this work lie in the original central in the COVID-19 crisis management [10]. In fact, they al-
dynamic population model that introduces framework pairing of smart lowed authorities to evaluate possible trajectories and, using relevant

2
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

Fig. 1. This work considers a hypothetical islanded population, involving various groups and paradigms, connected to a renewable energy production source through a storage
device and its associated policy.

Fig. 2. S3SAME pairs a dynamic population model with a RL-based EMS strategy to evaluate the effects of human behaviour on system performances and identify scenario
compatibility with renewable energy sources.

data, plan and formulate local measures to ensure safety as well as RL-driven, within the EMS. However, these views are very rarely stud-
possible [32,33]. In [34], an individual-level model accounting for ied and used in conjunction, leaving out important optimization areas
geographic and demographic particularities is proposed and used to and potentially critical insights, which consequently motivates the
estimate probable parameters governing the epidemic spreading. In proposed framework and enables flexible modelling and exploration of
particular, this approached enabled the simulation of the effect of local population trajectories in the context of renewable energy production
contact restrictions and suggested that control strategies should be and management.
per-population tailored, further illustrating the relevance of population
models. 3. Dynamic population modelling
On another axis, deep learning methods are transversal, with appli-
cation ranging from complex visual tasks [35], protein folding [36] to This section presents the main modelling hypothesis involved in the
board games [37] or remarkable natural language manipulation [38, scenario creation and study. It essentially introduces the population
39]. In particular, finding relevant strategies in complex, dynamic and model paradigm, the different population group involved in the dynam-
reactive environments has mostly been considered in reinforcement ics offers details on the transition mechanisms that motivate individual
learning (RL). In [40], the author demonstrate that RL agents are able transfer between groups.
to confine and stabilize fusion plasma via tokamak control in a nuclear
reactor, or control robotic hand is trained to solve a Rubik’s cube, as 3.1. Population modelling
reported in [41]. However, RL agents are known to be brittle [42,43] as
they are essentially trained and evaluated in stationary environments, As mentioned, total energy consumption is heavily influenced by
a feature shared with some of the works dedicated to micro-grid study. the population behaviours which, while often presented as an averaged
In particular, reward-optimized agents are particularly dependent on behaviour, are significantly more diverse and present a fine-grained
the independent identically distributed ( i.i.d.) hypothesis [44,45] and sophistication. Consequently, the framework introduced in this paper is
generally report drastic performance loss when deployed outside of inspired from epidemiological modelling and implements a simplified,
training distribution. Indeed, as demonstrated in [46], many recent although not unrealistic, population model that involves three groups of
research attempts develop method to increase agents ability to manage population, each exhibiting a specific consumption factor. In particular,
unseen configurations, with encouraging but limited results so far. the following paradigms, are represented:
In a global approach, it is possible to notice that while these very
diverse fields, such as micro-grid design and population modelling, • Environmentally active people (A): group that is the most dedi-
could mutually benefit from concepts introduced in the others. Indeed, cated to reducing its energy consumption
population modelling could enable the accurate estimation of the re- • Environmentally passive (P): intermediate group
quired work capacity of a micro-grids, based on the projected evolution • Neutral (N): inactive group, regarding climate and energy con-
of local demand and its sustainability using a smart approach, possibly sumption issues

3
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

Fig. 3. Population dynamics modelling: individuals are sensitive to different transition mechanisms that incite them to change their paradigm and adopt alternative behavioural
practices depending on the current context.

3.2. Transition mechanisms The recent surge of interest in reinforcement learning methods has
led to the introduction of multiple paradigms [49–51]. Concretely, this
As schematically shown in Fig. 3, multiple mechanisms are liable particular work relies on Proximal Policy Optimization (PPO) [52],
to incite people to update their behaviour and consequently change an on-policy, actor-critic with trust-region that has demonstrated rel-
groups. More specifically, it is possible to discern between internal evant properties regarding the complexity and features of our envi-
group mechanisms that result from individual encounters and external ronment [40]. In particular, the PPO algorithm aims at scaling action
factors, which are both respectively detailed in the following para- selection probabilities accordingly to the advantage function of a state–
graphs. action pair. This can be formally expressed through the minimization
of the following objective:
𝜋𝜃
3.2.1. Group dynamic factors 𝐿(𝑠, 𝑎, 𝜃old , 𝜃𝑘 ) = min(𝑟𝑡 (𝜃𝑘 )𝐴 old (𝑠, 𝑎), clip(𝑟𝑡 (𝜃), 1 − 𝜀, 1 + 𝜀)𝐴𝜋𝜃𝑘 (𝑠, 𝑎)) (1)
The proposed simulation framework relies on the idea that interac-
tion between individuals belonging to different groups will necessarily where 𝑟𝑡 (𝜃𝑘 ) is the ratio of the probabilities between the previous policy
involve an exchange of ideas and world views, thus resulting in some and the current one for a given pair of state and action. This ratio can
cases in group conversion for one of the participants. Additionally, be expressed as:
inner pessimistic mechanisms are considered. These express the concept 𝜋𝜃𝑘 (𝑎|𝑠)
𝜌𝑡 (𝜃𝑘 ) = (2)
that individuals can reach a level of disappointment or discouragement 𝜋𝜃old (𝑠, 𝑎)
that will push them towards the closest less energy-sober group. In this
configuration, the neutral group is not affected and can be seen as an 5. RL agent as an efficient energy storage policy
absorption state.
The economic and industrial context towards which this work is
3.2.2. External factors directed is heavily and actively investigated, with multiple actors of
Although a wide range of external factors and their effects could be diverse nature tackling the challenges of energy management and stor-
considered as influential on the proposed modelling of population dy- age. Nevertheless, the particularly thin granularity of the approaches
namics, the current approach focuses specifically on price fluctuations. and datasets makes the creation of a common benchmark challenging
In particular, emulating the current price volatility observed on in- and, to the best of our knowledge, no unified and compatible setup
ternational energy market and the corresponding consumer behaviour is publicly available to evaluate the methods presented in this work
adaptation, this work introduces price sensitivity thresholds that, once against other state of the art results. In this view, and since rigorous
crossed, will initiate individual migration from one group to a more comparison are paramount to establish the interest and the relevance
sober one (in a highly priced energy configuration) or to the closest of the presented approaches, the following paragraphs introduces the
less sober group (when energy prices are affordable). paradigm used to engineer baseline strategies. Concretely, these repre-
sent the best policy the authors could possibly create without relying
on reinforcement learning algorithms, which is then evaluated against
4. Reinforcement learning background
the RL-trained policy. Then, an analysis of the scaling factors for the
complete spectrum of population distributions is discussed.
As evoked in the above sections, an important contribution of this
work lies in pairing a smart, RL-based EMS with the population model 5.1. Environment for policy generation
introduced above. This section consequently provides the basic theo-
retical elements required to apprehend reinforcement learning. Indeed,
Prior to the baseline introduction, this section details the fundamen-
while multiple policy optimization strategies could be considered in the
tal MDP elements for training the reinforcement learning based storage
scope of this topic, previous experiences [13,47] on relatable challenges policy, central in our approach. As explained in Section 4, reinforce-
motivates the use of reinforcement learning approaches [48]. ment learning environments rely on an observation space, an action
Specifically, the goal of reinforcement learning is to produce a pol- space, the reward function and the transition function. Nevertheless,
icy that maximizes a metric of success for a given task. Indeed, through since this work considers a deterministic process, similarly to [13], the
interaction with the considered environment, the policy will optimize transition function is not addressed. Regarding the remaining variables,
the sequence of selected actions based on observations it receives to they are formally defined as :
generate trajectories that maximizes the cumulative rewards given by
the environment. Formally, the environment is most frequently defined • The state space provides an observation which is the input used by
as a Markov Decision Process (MDP), composed of: the agent to select an action 𝑠 = concat(𝑠historic , 𝑠storage , 𝑠latent ) ∈
R𝐻+1+𝐿 with
• States 𝑠 ∈ 𝑆
• Actions 𝑎 ∈ 𝐴 – 𝑠historic ∈ 𝑁 production over consumption historic ratio for
• Reward function 𝑟𝑡 = 𝑓 (𝑠𝑡 , 𝑎𝑡 ) the last 𝑁 steps (this work proposes 𝑛 = 8)
• Transition probability function: 𝑝(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) – 𝑠storage ∈ 1 the current normalized storage state

4
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

– 𝑠latent ∈ 𝐿 a latent representation provided the predictive such advances concepts. In practice, the proposed actor-critic is trained
model of dimension 𝐿 (such as 𝐿 = 16 in this work) on historical European electricity prices and averaged individual con-
sumption between 2015 and 2018, year 2019 being used as the testing
• The action space is one-dimensional and continuous: 𝑎𝑡 ∈ [−1, 1]
environment. During evaluation, each policy is run over a two hundreds
and controls the entirety of the storage and delivery process:
episodes and the number of hours during which energy can be provided
– Specifically, for positive actions, the system stores the in- is recorded per episode. As presented in Fig. 4, it clearly appears that
coming energy scaled by the action and sends the rest to the RL-trained policy (shown in light blue) exhibits superior behaviour
the grid for consumer usage. Formally: in the sense that it is able to provide power to the population signifi-
{ cantly more frequently than the manually-crafted baseline (shown in
𝑒grid = (1 − 𝑎𝑡 ).𝑝turb red). From another point of view, given the important volatility in-
𝑒storage = min[𝑒storage + 𝑎𝑡 × 𝑝turb , 𝑒max storage ] herent to renewable energies, Fig. 5 shows the per-episode distribution
of energy availability when direct production is accounted for. As can
where 𝑒grid , 𝑒storage , 𝑒max storage , 𝑝turb , 𝑎𝑡 respectively rep-
be observed, the trained agent again displays a more efficient storage
resent the energy sent to the grid, the current available
strategy as more hours can be covered with the same energy load. These
energy, the maximal storage volume, the production during
results increase the legitimacy of RL-driven methods and subsequent
the last time step and finally, the agent action.
agents to be used in downstream tasks.
– Alternatively, negative actions send the total energy col-
lected during the last time step to the grid and adds the
5.4. Storage policy performances as a function of the general population
maximal energy output scaled by the action. Formally, this
consumption profile
can be expressed by:
{
𝑒grid = 𝑝turb + min[𝑎𝑡 ∗ 𝛥max , 𝑒storage ] Having established that the RL trained agent is the most efficient
𝑒storage = max[𝑒storage − 𝑒grid , 0] policy available in the energy storage setup considered in this work, it
follows that its performance level and limitations must be clearly identi-
where 𝛥max represents the maximal energy volume that can fied to provide a strong and robust comprehension to enable reasonable
be withdrawn in an environment time step, which is highly decision making. Specifically, our policy is trained under a default
dependent on the physical system. consumption profile (i.e: 𝛼 = 1) and the setup described in this section
• Finally, the proposed reward function is designed to incite the is designed to evaluate the evolution of its performance for a wide range
agent to send an energy volume as close as possible to the next of consumption profiles. Indeed, Fig. 6 displays the system performance
consumption. Formally, this can be expressed by 𝑟𝑡 = 𝑜𝑡 + 𝑜𝑠 + 𝑜𝑐 , as a function of the mean consumer profile. Specifically, for each value
where: in the explored consumer factor range, a set of one thousand episodes
are run by the policy and for each one the performance (number of
– primary objective: energy sent to the grid must be superior hour with power over number of hours in total) is stored. Given the
to the demand number of episodes considered, the significant spread of performance
{
1 if 𝑒grid ≥ 𝑒consumption values can be explained by the wide diversity of production condition.
𝑜𝑡 = Nevertheless, it is possible to observe, as expected, a strong negative
0 otherwise
correlation between system performance and mean consumer profile
𝑒storage and its associated consumption factor, also represented by the mean
– 𝑜𝑠 = 𝑒max storage
incites the agent to fill the storage system
performance value. For additional confirmation and perspective on
– 𝑜𝑐 = exp 𝛼 × (𝑒grid − 𝑒consumption ) accuracy related bonus agent reliability, the same visualization for the baseline policy is also
points with a scaling value displayed (in red).
For an externally defined performance level, given specific country
5.2. Heuristics-designed baseline storage strategy requirements, dedicated energy sources, multiple other factors and with
respect to the storage policy, this global estimated slope can be relied
Concretely, the proposed manual strategy is a reactive one, pa- on to define an 𝛼 threshold above which the mean system performance
rameterized by 𝛾𝑚 ∈ R2𝑚 . Technically, it uses the last 𝑚 historic cannot be guaranteed. This 𝛼 value can be used to evaluate social
production over consumption ratios as input and has an additional bias scenarios, as presented in the following paragraphs.
parameter to compute the scalar action subsequently used by the MDP,
as introduced in Section 5.1. 6. Scenarios definition and evaluation
Formally, the manual policy action 𝑎𝑡,𝛾 can be obtained using fol-
lowing relation presented in Eq. (3). 6.1. Population dynamics modelling
𝑣⃖⃗ =𝑖∈[1..𝑚] [𝛾𝑖 − 𝑅(𝑛−𝑖) ] ∈ R𝑚
(3) As explained in Section 3.1 and referred to multiple times after-
𝑎𝑡 =clip(⃖𝛾⃗𝑖∈[𝑚..2 𝑚] ⋅ 𝑣⃖⃗, −1., 1.) ∈ R
wards, this approach relies heavily on the formalization of a population
where 𝑛 is the observation dimensions, 𝑅𝑛−𝑖 thus being the 𝑖th most dynamic model that enables simulation of individual distribution be-
recent production over consumption ratio. In practice, several manual tween the three introduced group profiles over a given time horizon.
policies are constructed, with 𝑚 ∈ {1 … 5} and their respective pa- Building on the concepts frequently used in epidemiological mod-
rameters optimized in a gradient ascent sense to maximize the reward elling [10,32,53] such as the SIR model, the following differential
presented in Section 5.1. Finally, we find that the policy with 𝑚 = 2 to equations are proposed to represent respectively the active (Eq. (4)),
present the best average results and is the one consequently kept for passive (Eq. (5)) and neutral (Eq. (6)) group evolution:
comparison against the reinforcement method. 𝑑𝐴
= − 𝐹𝐴→𝑁 × 𝐴
𝑑𝑡
5.3. RL agent performances evaluation + (𝑇𝑁→𝐴 − 𝑇𝐴→𝑁 ) × 𝐴 × 𝑁
+ (𝑇𝑃 →𝐴 − 𝑇𝐴→𝑃 ) × 𝐴 × 𝑃 (4)
Similarly to [13], an experimental setup and specific metrics are
designed to evaluate our trained policy performances with respect to + 𝐻upgrade (price, 𝜀𝑃 ) × 𝑃 × 𝐻speed
a manually crafted baseline in order to assess the relevance of using − 𝐻downgrade (price, 𝜀𝐴 ) × 𝐴 × 𝐻speed

5
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

Fig. 4. Distribution of normalized policy performance per episode. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version
of this article.)

Fig. 5. Distribution of normalized policy performance per episode.

Fig. 6. Mean performance level as a function of consumption profile for a given RL-trained storage policy. (For interpretation of the references to colour in this figure legend,
the reader is referred to the web version of this article.)

6
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

𝑑𝑃 approaches. It also quantitatively demonstrates that using a RL-based


= − 𝐹𝑃 →𝑁 × 𝑃 policy to manage energy storage can significantly increase the state
𝑑𝑡
+ (𝑇𝑁→𝑃 − 𝑇𝑃 →𝑁 ) × 𝑃 × 𝑁 space area with sustainable parameters for a given scenario, which
can ultimately bring advantages in the context of social policy design.
+ (𝑇𝐴→𝑃 − 𝑇𝑃 →𝐴 ) × 𝐴 × 𝑃
Finally, it provides various examples of public decision being driven
− 𝐻upgrade (price, 𝜀𝑃 ) × 𝑃 × 𝐻speed (5)
by these results, consequently establishing the interest of the proposed
+ 𝐻upgrade (price, 𝜀𝑁 ) × 𝑁 × 𝐻speed tool. In particular, the analysis and interpretation of the results associ-
ated with each diverse scenarios and state space visualization suggest
+ 𝐻downgrade (price, 𝜀𝐴 ) × 𝐴 × 𝐻speed
action course to avoid critical and detrimental configurations while
− 𝐻downgrade (price, 𝜀𝑃 ) × 𝑃 × 𝐻speed maximizing the energy objectives.
𝑑𝑁 As no specific data compatible with the particularities of the pro-
= + 𝐹𝐴→𝑁 × 𝐴
𝑑𝑡 posed population model at the time of writing was available, this
+ 𝐹𝑃 →𝑁 × 𝑃 work provides a wide range for the considered parameters. For clarity
+ (𝑇𝐴→𝑁 − 𝑇𝑁→𝐴 ) × 𝐴 × 𝑁 purposes, the reader is referred to Appendix for details related to the
(6)
values used for the various scenarios. Conceptually, each simulation
+ (𝑇𝑃 →𝑁 − 𝑇𝑁→𝑃 ) × 𝑃 × 𝑁
relies on the following parameters:
− 𝐻upgrade (price, 𝜀𝑁 ) × 𝑁 × 𝐻speed
+ 𝐻downgrade (price, 𝜀𝑃 ) × 𝑃 × 𝐻speed • Conversion Speed (Price) refers to the daily rate of conversion of
a group whenever its price tolerance threshold is exceeded (either
Where 𝐹 is the internal mechanism representing loss of faith and belief positively or negatively).
in one’s group belonging and downgrading to the next less sober group,
• Faith loss represents the daily rate of individual leaving their
𝑇 is the talk mechanism and formalizes group interaction, thus allowing
group towards a less sober one by loss of conviction in their belief
this model to represent the conviction strength of each group and
• Price thresholds indicate the respective price levels above or
finally, 𝐻upgrade and 𝐻downgrade are Heaviside function that materialize
below which a population group begins to migrate towards a
the price threshold (eg: 𝜀𝑃 ) over or below which individual will start
more (upgrade case) or less sober (downgrade case) group. This
leaving their group to adjust with the price signal. As such, the passive
migration is controlled by the conversion speed previously intro-
group dynamic represented by Eq. (6) can be expressed as:
duced.
1. Subtract individual losing faith and conviction that are moving • Conversion (A → P) represents the daily conversion rate from
towards less sober groups the actively sober group to the passively sober one, which cor-
2. Add individuals from the neutral group (less sober) as the dif- responds to a increase in consumption. In contrast, Conversion (P
ference of conviction strength between the passive and neutral → A) is determines the opposite flow.
group. This quantity can be negative. • Similarly, Conversion (P → N) indicates how fast individual from
3. A similar process is conducted but this time involving the active the passively sober group are degrading their practices and mov-
group (more sober) ing in the neutral group and (P → N) also defines the rate of
4. Subtract individuals moving from this group going to the more inverse conversion.
sober one if price a above tolerance threshold 𝜀𝑃 . This happens
In practice, 4 scenarios, each of which has two variant (mostly based
at rate 𝐻speed
on revenues level to evaluate the impact of wealth), are explored in the
5. Add individuals upgrading from the neutral group next sections. Their narratives can be summed up as:
6. Add individuals downgrading from the active group
7. Subtract individuals downgrading from this group towards the • Military conflict (MC) : In the perspective of a military conflict,
neutral group natural resources scarcity increases, driving a strong inflation
and a more significant insecurity and uncertainty. This could
6.2. Scenario definition and critical configuration identification indirectly make individuals more sensitive to climate issues and
increase conversion rate towards higher sobriety groups.
Ultimately, this work is dedicated to the design of a modular and • Economical resurrection (EC): Increased confidence after exiting
flexible tool that would allow scenario definition, exploration and a global crisis elicits short-term goals and a significant desire
identification of critical configurations, for a given storage policy. As for entertainment and leisure, consequently making people less
such, once the linear representation between the system performance environmentally conscious.
and the mean consumer profile is obtained and the associated threshold • Social tension (denoted COTC): this scenario explores the pos-
identified, it becomes possible to connect this policy result to the sibility of an increase in social tensions and its repercussion on
population simulation module, previously introduced in Section 6.1. the energy demand of a population. Inspired by current events, it
More specifically, by defining a scenario (i.e.: a set of parameters for considers that some political measures and societal evolution are
the governing population model), the corresponding mean consumer not perceived in a uniform fashion by the population, in particular
profile evolution over the time horizon can be computed. Such a value based on individual metrics such as revenues or geographical
can then be used as a reference metric in order to assess the scenario location
acceptability. In practice, this work proposes to consider the ratio of • Winter is coming (WC): Due to disturbing perspectives regarding
days with a mean consumer profile over the total number of days, thus energy availability during winter, individuals are more sensitive
associating this performance level with the set of parameters underlying to environment issues and more likely to converge towards more
the scenario, which can be further compared with the threshold value sober behaviours.
to estimate whether the configuration acceptability, as illustrated in
Fig. 1. 7.1. Quantitative comparison of storage strategies

7. Results: Scenarios and associated parameters While Section 5.3 has shown that RL-based storage strategy offer
higher performances and returns than proposed baselines, previous
In order to illustrate the interest of this work, this section introduces experiments were considered in a static configuration and did not offer
several scenarios, their associated parameters and possible exploration a more global view, connected to behavioural parameters. This makes

7
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

Fig. 7. Comparison of state space policy scores for multiple scenarios. (For interpretation of the references to colour in this figure legend, the reader is referred to the web
version of this article.)

the evaluation of their impact in the context of dynamic population sustainable configurations (yellow) as well as strongly energy-hungry
non-trivial and may not necessarily encourage their adoption. This ones (deep blue) (see Fig. 8).
section consequently provides a comparison of the resulting state space
scores for the baseline and RL-based storage policies. 7.2.1. Subsidizing energy price
Fig. 7 thus illustrates the performance difference, for all considered Consider a population group 𝑝0 that has been positioned (through
scenarios (four with a variant each), between the two methods, where polls, for instance) at location 1, as shown in Table A.6. This group has
state space scores in line 1 and 3 are produces by the RL agent and an intermediate conviction level but can be perceived as more reactive
line 2 and 4 by the manual policy. As can be observed, sustainability to price movements than the mean population and can be numerically
(yellow regions) using RL covers significantly more state space and described by 𝑝0 = [0.5, 0.7], relative to each axis range. Despite the
ranges between reasonable increases (4% in the ER1 scenario) to more score increase provided by the usage of the RL-based storage strategy
than 200%. (+11%), this configuration does not belong to the sustainability area.
In the hypothesis that additional power sources are not available,
7.2. Specific case study multiple course of action can still be considered. Among them, the price
conversion threshold is an important aspect that can drive individual
As shown in Section 7.1, relying on policies trained in the rein- decisions regarding energy usage. Specifically, in this configuration,
forcement learning paradigm proves advantageous in a wide variety of local authorities could provide financial incentives to this variable in
scenarios as these methods are likely to significantly increase the state order to increase sustainability. For instance, Fig. 9 shows the resulting
space sustainability area. Building on this confirmation, this section evolution levels of highly energetically sober individuals and associated
proposes various practical example of usage of the proposed simulation configuration scores when the price conversion threshold is subsidized.
tool. While applicable to any scenario, the COTC0, shown in Table A.6 Specifically, while an initial conversion price of 40 euros yields a sus-
is preferred here, due to its contrasted nature that is likely to increase tainability score that necessarily implies using external power sources,
the clarity of the logic used. In particular, this scenario explores sen- it appears that increasing public financial incentives quickly offers
sibility along the price conversion sensibility and faith resistance axis. support to the environmentally virtuous group and consequently will
As can be seen, the range of values considered generates both highly ease tensions on the energy system.

8
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

Fig. 8. Heatmap for the social tensions (COTC0) scenario. Specific starting points for subsequent policy tests are represented in white. (For interpretation of the references to
colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 9. Evaluation of sustainability levels resulting of a subsidiary policy to financially support individuals by offsetting their price conversion threshold.

7.2.2. Energy tax usage and magnitude located in the upper-west corner, local sustainable islands can observed
From another point of view, local authorities could resort to tax in multiple location in the state space. In particular, this paragraph
strategies to try and convince individuals with the highest energy focuses on the higher scoring zone that shares the same price sensibility
consumption regimen to adapt their behaviour to existing resources. but has a different faith resistance value.
Using the same starting point 𝑝0 , this paragraph explores the effect Concretely, by evaluating configurations with the lowest faith resis-
of lowering the upper price conversion threshold. Concretely, it is the tance values and in the close neighbourhood with faith values com-
price above which individuals will start migrating towards groups with prised between 0.18 and 0.19, population dynamics as displayed in
lower consumption profile. As tax policy can be particularly complex Fig. 11 can be found. As can be seen, some of these tests demonstrate
and is definitely out of the scope of this work, Fig. 10 simplifies the high sustainability value. Globally, starting from 𝑝1 , it may prove more
setup and simply only illustrates the final price threshold. energy and cost efficient to define public policies that would nudge the
population towards these regions instead of trying to reach the upper
7.2.3. Local exploration area.
Consider another population configuration 𝑝1 = [0.8, 0.18] that However, as can also be seen in Fig. 11, there is a high score
presents a better resistance/inertia to price movement speed but dis- volatility on the price sensitivity axis which may make this approach
plays an elevated loss of faith and conviction with group A individuals unpractical in the real-world where measurements will be averaged
being significantly absorbed by the 𝑁 group, that is more energy sober and where social constraints and inertia may not allow such accuracy.
member moving to higher consumption ratios. While, for this scenario This is further illustrated in Fig. 12 where the complete faith range
and the associated set of parameters, the more sustainable area is is explored. Specifically, this spectrum of conviction is evaluated for

9
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

Fig. 10. Evaluation of sustainability levels resulting of a tax policy to financially incite individuals to lessen their consumption by artificially increasing their price conversion
threshold to more sober groups.

Fig. 11. Heatmap for the clash of the classes scenario. Specific starting points for subsequent policy tests are represented in white.

Fig. 12. Price sensitivity between 0.18 and 0.20.

10
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

price sensitivity values belonging to the [0.18, 0.20] interval. While Table A.1
Base configuration.
the shaded area, representing the extreme values obtained for each
location, does have strong peaks denoting sustainability areas, the Parameters

mean score indicates that the equilibrium is hard to attain, in contrasts Faith: A->N 3.00E−05
Talk: A->N 1.00E−05
with higher faith values.
Talk: N->A 5.00E−05
Talk: P->A 6.00E−05
8. Discussion Talk: A->P 4.00E−05
Talk: N->P 6.00E−05
As presented in the previous sections, the proposed prospective Talk: P->N 4.00E−05
Price: N upwards 6.50E+01
framework can provide strong insights over the energy disparity be-
Price: P upwards 6.00E+01
tween electricity production and demand, as well as general perfor- Price conversion ratio 2.00E−02
mance levels over a year that can be expected for a population, for Price: P downwards 3.50E+01
a given storage system and management policy. While the current Price: A downwards 4.50E+01
parameter ranges were essentially selected to enable exploration, in a
more global view and with adequate measurements, the quantitative Table A.2
comparison of storage strategies provided in Section 7.1 can have MC0 configuration.
remarkable implications. Indeed, the evaluation of various policies Parameters
demonstrate that RL-based agents are likely to create increased area Talk: A->P- start 1.00E−06
of sustainability, which corresponds to state space configurations that Talk: A->P- end 1.00E−04
yield energy needs compatible with the system performance. This im- Talk: P->N- start 1.00E−06
plies that ensuring that the population remains within the boundaries Talk: P->N- end 1.00E−04
Price: N upwards 6.50E+01
of acceptable renewable energy demand will requires less drastic public Price: P upwards 6.00E+01
conventions, thus proportionally more susceptible to be adopted. Price: P downwards 5.00E+01
Nevertheless, in this current formalization, the system efficiency is Price: A downwards 3.50E+01
evaluated and averaged over the whole time window, which may not Price conversion ratio 3.00E−02
be optimal. Indeed, this implies that the scenario parameters, such as
the conviction strength of some individuals or their posture towards
energy consumption, will not be affected by the conditions met during also shown, a global view of configuration scores in the state space
the year. However, it is highly likely that, should these conditions can also drive more pragmatic decisions and identify equilibrium points
harden, the population parameters will evolve. More practically, if that could be easier to reach.
an initialization leads a population to extended phases of insufficient In future works, these encouraging results will be extended along
power, individuals beliefs and/or permeability to other posture may several axis: in a first phase, implementing a feedback and adaptation
change. Formally, this can be expressed by second order dynamics and mechanism appears as an important step for improving this framework
will be explored in future works. reliability. Conceptually, this initiative would bring second order dy-
namics in the population model, thus enabling individuals to react more
9. Conclusion rationally, in particular since this would allow to simulate memory
effects. Then, as the proposed work is conceived to support policy de-
As the imperative need of mankind for clean, renewable, affordable sign, an remarkable addition would be to further develop the presented
energy increases, the inherent discontinuity in availability of such usage examples and provide a search-based approach to evaluate the
energy sources appears more clearly and strongly emphasizes how least costly public trajectory to bring a given point in the state space in
critical environmentally sober and conscious individual behaviour are. the sustainability area. Specifically, by implementing movement costs
While these traits are spreading and are more broadly disseminated in the state space, it becomes possible to identify which parameters
within the general population, there still exist a wide diversity of are the most efficient to nudge a population below the acceptability
behavioural factors that must be accounted for when preparing large threshold, for a given storage policy and renewable energy production,
energy policies. As such, this work, after having established the rele- back to a configuration with higher guarantees of performance. By inte-
vance of reinforcement learning methods for energy storage manage- grating real-world data, this approach could ultimately be an important
ment, specifically proposes an original flexible and modular backbone, asset in policy design and facilitate acceptance of energy transition
leveraging RL-optimized storage strategy, to enable the modelling of measures. Finally, to address practical concerns dedicated to usage in
multiple hypothesis and the analysis of their resulting unfolding. By real configuration, additional data could be collected to correctly fit
connecting these multiple fields, this work contrasts with common the model, identify realistic parameters as well as increase simulation
methods that usually consider each topic separately and may overlook variety by including, for instance, different initial group distributions.
mutually beneficial insights.
Acknowledgement
As demonstrated in our examples, complex dynamics can be in-
tegrated and simulated and the interpretation of these configurations
This work has been sponsored by Akkodis and the Adecco group,
outcomes endows users with the capacity to identify particular accept-
France.
ability thresholds for a selected model variable and consequently plan
to avoid these areas and their potentially undesirable consequences. Appendix. Scenario parameters
Furthermore, through the comparison of diverse storage policies on a
wide spectrum of scenarios, this work quantitatively illustrate the area This section presents the various set of parameters used for the
of sustainability up-scaling which relates the severity and austerity of multiple scenarios introduced in this work. In practice, each configu-
social measures needed to ensure that the energy consumption of the ration is derived from the base configuration, see Table A.1 and only
population depending on renewable energies stays within acceptable modified values are denoted. Similarly, for each subsequent scenario,
boundaries. Moreover, the introduced framework can be, as demon- only altered parameters with respect to the initial scenario are detailed.
strated, used to evaluate the effect of public policies, such as subsidies For instance, in Table A.3, only values differing from Table A.2 are
as financial incentives to adopt more responsible practices. Inversely, given (see Tables A.4, A.5 and A.7–A.9).
the impact of taxes can also be, to a certain extend, simulated. As was

11
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

Table A.3 References


MC1 configuration.
Parameters [1] Arias P, Bellouin N, Coppola E, Jones R, Krinner G, Marotzke J, et al. Technical
Price: N upwards 6.00E+01 summary. In: Climate change 2021: the physical science basis. Contribution of
Price: P upwards 7.00E+01 working group I to the sixth assessment report of the intergovernmental panel on
Price: P downwards 4.00E+01 climate change. Cambridge, United Kingdom and New York, NY, USA: Cambridge
Price: A downwards 3.50E+01 University Press; 2021, p. 33–144. http://dx.doi.org/10.1017/9781009157896.
002.
[2] Canadell J, Monteiro P, Costa M, Cotrim da Cunha L, Cox P, Eliseev A, et al.
Table A.4 Global carbon and other biogeochemical cycles and feedbacks. In: Climate change
EC0 configuration. 2021: the physical science basis. Contribution of working group I to the sixth
Parameters assessment report of the intergovernmental panel on climate change. Cambridge,
United Kingdom and New York, NY, USA: Cambridge University Press; 2021, p.
Talk: A->P - start 1.00E−06 673–816. http://dx.doi.org/10.1017/9781009157896.007.
Talk: A->P - end 1.00E−04 [3] Canadell J, Monteiro P, Costa M, Cotrim da Cunha L, Cox P, Eliseev A, et al.
Talk: P->N - start 1.00E−06 Global carbon and other biogeochemical cycles and feedbacks supplementary
Talk: P->n - end 1.00E−04 material. In: Climate change 2021: the physical science basis. Contribution of
Price: N upwards 6.50E+01 working group I to the sixth assessment report of the intergovernmental panel
Price: P upwards 6.00E+01 on climate change. 2021, URL Available from https://www.ipcc.ch/.
Price: P downwards 5.00E+01
[4] of Working Group II to the Sixth Assessment Report of the Intergovernmental
Price: A downwards 3.50E+01
Panel on Climate Change C. Climate change 2022: impacts, adaptation and
vulnerability. Cambridge University Press; 2022.
[5] of Working Group III to the Sixth Assessment Report of the Intergovernmental
Table A.5
Panel on Climate Change C. Climate change 2022: mitigation of climate change.
EC1 configuration.
Cambridge University Press; 2022.
Parameters [6] Reichl J, Cohen JJ, Klöckner CA, Kollmann A, Azarova V. The drivers of indi-
Price conversion ratio 3.00E−02 vidual climate actions in Europe. Global Environ Change 2021;71:102390. http:
//dx.doi.org/10.1016/j.gloenvcha.2021.102390, URL https://www.sciencedirect.
com/science/article/pii/S0959378021001692.
Table A.6 [7] Akerlof KL, Boules C, Ban Rohring E, Rohring B, Kappalman S. Governmental
COTC0 configuration. communication of climate change risk and efficacy: moving audiences toward
Parameters ‘‘danger control’’. Environ Manag 2020;65(5):678–88.
[8] Boyce C, Czajkowski M, Hanley N. Personality and economic choices. J Environ
Price conversion ratio - start 1.00E−03
Econ Manag 2019;94:82–100. http://dx.doi.org/10.1016/j.jeem.2018.12.004,
Price conversion ratio - end 3.00E−02
URL https://www.sciencedirect.com/science/article/pii/S0095069617304941.
Faith: A->N - start 3.00E−05
[9] Cohen J, Moeltner K, Reichl J, Schmidthaler M. Effect of global warming on
Faith: A->N - end 1.00E−02
willingness to pay for uninterrupted electricity supply in European nations. Nat
Price: N upwards 6.00E+01
Energy 2018;3:37–45. http://dx.doi.org/10.1038/s41560-017-0045-4.
Price: P upwards 6.00E+01
[10] Prague M, Wittkop L, Clairon Q, Dutartre D, Thiébaut R, Hejblum BP. Population
Price: P downwards 4.00E+01
modeling of early COVID-19 epidemic dynamics in French regions and estimation
Price: A downwards 4.00E+01
of the lockdown impact on infection rate. 2020, URL https://hal.archives-
ouvertes.fr/hal-02555100.
Table A.7 [11] Cyrus Chu C. Population dynamics: theory of nonstable populations. In:
COTC1 configuration. Smelser NJ, Baltes PB, editors. International encyclopedia of the social & behav-
ioral sciences. Oxford: Pergamon; 2001, p. 11771–3. http://dx.doi.org/10.1016/
Parameters
B0-08-043076-7/02110-0, URL https://www.sciencedirect.com/science/article/
Price: N upwards 6.50E+01 pii/B0080430767021100.
Price: P upwards 4.00E+01 [12] Shorrocks B. Competition, interspecific. In: Levin SA, editor. Encyclopedia of
Price: P downwards 5.00E+01 biodiversity. 2nd ed. Waltham: Academic Press; 2001, p. 177–91. http://dx.
Price: A downwards 3.50E+01 doi.org/10.1016/B978-0-12-384719-5.00027-7, URL https://www.sciencedirect.
com/science/article/pii/B9780123847195000277.
[13] Mehdi M, Fabien M, Yassine M, Guilhem M, Enzo M, Dorian F. S2SAME: A multi-
Table A.8 modal deep learning system model for automated renewable energy trading. In:
WIC0 configuration. 3rd incernational conference on energy and AI. 2022.
Parameters [14] Levent T, Preux P, le Pennec E, Badosa J, Henri G, Bonnassieux Y. Energy
Price conversion ratio - start 5.00E−03 management for microgrids: a reinforcement learning approach. In: 2019 IEEE
Price conversion ratio - end 5.00E−02 PES innovative smart grid technologies Europe. 2019, p. 1–5. http://dx.doi.org/
Faith: A->N - start 3.00E−05 10.1109/ISGTEurope.2019.8905538.
Faith: A->N - end 1.00E−02 [15] Qing-Hua W, Jiehui Z, Zhaoxia J, Xiaoxin Z. Large-scale integrated energy
Talk: N->A 8.00E−05 systems. Springer Link; 2016.
Talk: N->P 1.00E−04 [16] Zia MF, Elbouchikhi E, Benbouzid M. Microgrids energy management sys-
Price: N upwards 7.50E+01 tems: A critical review on methods, solutions, and prospects. Appl Energy
Price: P upwards 6.00E+01 2018;222:1033–55. http://dx.doi.org/10.1016/j.apenergy.2018.04.103.
Price: P downwards 4.00E+01 [17] Keirstead J, Jennings M, Sivakumar A. A review of urban energy system
Price: A downwards 5.00E+01 models: Approaches, challenges and opportunities. Renew Sustain Energy Rev
2012;16:3847–66. http://dx.doi.org/10.1016/j.rser.2012.02.047.
[18] Nigim KA, Lee W-J. Micro grid integration opportunities and challenges. In:
Table A.9 2007 IEEE power engineering society general meeting. 2007, p. 1–6. http:
WIC1 configuration. //dx.doi.org/10.1109/PES.2007.385669.
Parameters [19] Parhizi S, Lotfi H, Khodaei A, Bahramirad S. State of the art in research on
microgrids: a review. IEEE Access 2015;3:1. http://dx.doi.org/10.1109/ACCESS.
Price: N upwards 7.50E+01 2015.2443119.
Price: P upwards 6.50E+01
[20] Su W, Wang J. Energy management systems in microgrid operations. Electr
Price: P downwards 3.00E+01
J 2012;25(8):45–60. http://dx.doi.org/10.1016/j.tej.2012.09.010, URL https://
Price: A downwards 3.00E+01
www.sciencedirect.com/science/article/pii/S104061901200214X.
[21] Palma-Behnke R, Benavides C, Lanas F, Severino B, Reyes L, Llanos J, et al.
A microgrid energy management system based on the rolling horizon strategy.
IEEE Trans Smart Grid 2013;4(2):996–1006. http://dx.doi.org/10.1109/TSG.
2012.2231440.

12
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242

[22] Ahmad Khan A, Naeem M, Iqbal M, Qaisar S, Anpalagan A. A compendium [36] Jumper JM, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O,
of optimization objectives, constraints, tools and algorithms for energy man- et al. Highly accurate protein structure prediction with AlphaFold. Nature
agement in microgrids. Renew Sustain Energy Rev 2016;58:1664–83. http: 2021;596:583–9.
//dx.doi.org/10.1016/j.rser.2015.12.259, URL https://www.sciencedirect.com/ [37] Silver D, Huang A, Maddison C, Guez A, Sifre L, van den Driessche G, et al.
science/article/pii/S1364032115016421. Mastering the game of go with deep neural networks and tree search. J Nat
[23] Guo L, Liu W, Li X, Liu Y, Jiao B, Wang W, et al. Energy management system 2015.
for stand-alone wind-powered-desalination microgrid. IEEE Trans Smart Grid [38] Wei J, Wang X, Schuurmans D, Bosma M, Chi EH, Le Q, et al. Chain of
2016;7(2):1079–87. http://dx.doi.org/10.1109/TSG.2014.2377374. thought prompting elicits reasoning in large language models. 2022, CoRR
[24] Kyriakarakos G, Dounis A, Arvanitis K, Papadakis G. A fuzzy logic energy man- abs/2201.11903, arXiv:2201.11903, URL https://arxiv.org/abs/2201.11903.
agement system for polygeneration microgrids. Renew Energy 2012;41:315–27. [39] Rubin O, Herzig J, Berant J. Learning to retrieve prompts for in-context
http://dx.doi.org/10.1016/j.renene.2011.11.019. learning. 2021, CoRR abs/2112.08633, arXiv:2112.08633, URL https://arxiv.org/
[25] Taha MS, Mohamed YA-RI. Robust MPC-based energy management system of a abs/2112.08633.
hybrid energy source for remote communities. In: 2016 IEEE electrical power [40] Degrave J, Felici F, Buchli J, Neunert M, Tracey B, Carpanese F, et al.
and energy conference. 2016, p. 1–6. Magnetic control of tokamak plasmas through deep reinforcement learning.
[26] Eghtedarpour N, Farjah E. Power control and management in a hybrid AC/DC Nature 2022;22(1):79–86. http://dx.doi.org/10.1038/s41586-021-04301-9.
microgrid. IEEE Trans Smart Grid 2014;5(3):1494–505. http://dx.doi.org/10. [41] OpenAI, Akkaya I, Andrychowicz M, Chociej M, Litwin M, McGrew B, et al.
1109/TSG.2013.2294275. Solving rubik’s cube with a robot hand. 2019.
[27] Mokheimer EM, Sahin AZ, Al-Sharafi A, Ali AI. Modeling and optimization of [42] Sahni H. Deep reinforcement learning doesn’t work yet. 2018, https:
hybrid wind–solar-powered reverse osmosis water desalination system in Saudi //himanshusahni.github.io/2018/02/23/reinforcement-learning-never-
Arabia. Energy Convers Manage 2013;75:86–97. http://dx.doi.org/10.1016/j. worked.html.
enconman.2013.06.002, URL https://www.sciencedirect.com/science/article/pii/ [43] Eysenbach B, Gupta A, Ibarz J, Levine S. Diversity is all you need: learning
S0196890413002987. skills without a reward function. In: International conference on representation
[28] Artzrouni M. Mathematical demography. In: Kempf-Leonard K, editor. Encyclo- learning. 2019.
pedia of social measurement. New York: Elsevier; 2005, p. 641–51. http:// [44] Mahajan D, Tople S, Sharma A. Domain generalization using causal matching.
dx.doi.org/10.1016/B0-12-369398-5/00360-1, URL https://www.sciencedirect. In: Meila M, Zhang T, editors. Proceedings of the 38th international conference
com/science/article/pii/B0123693985003601. on machine learning. Proceedings of machine learning research, vol. 139, PMLR;
[29] Brander JA. Easter island: resource depletion and collapse. In: Cleveland CJ, 2021, p. 7313–24, URL https://proceedings.mlr.press/v139/mahajan21b.html.
editor. Encyclopedia of energy. New York: Elsevier; 2004, p. 871–80. http:// [45] Kaddour J, Lynch A, Liu Q, Kusner MJ, Silva R. Causal machine learning:
dx.doi.org/10.1016/B0-12-176480-X/00014-0, URL https://www.sciencedirect. a survey and open problems. 2022, http://dx.doi.org/10.48550/ARXIV.2206.
com/science/article/pii/B012176480X000140. 15475, URL https://arxiv.org/abs/2206.15475.
[30] Hellmann JJ. Species interactions. In: Levin SA, editor. Encyclopedia of bio- [46] Chen AS, Sharma A, Levine S, Finn C. You only live once: single-life rein-
diversity. 2nd ed. Waltham: Academic Press; 2013, p. 715–25. http://dx. forcement learning. 2022, http://dx.doi.org/10.48550/ARXIV.2210.08863, URL
doi.org/10.1016/B978-0-12-384719-5.00134-9, URL https://www.sciencedirect. https://arxiv.org/abs/2210.08863.
com/science/article/pii/B9780123847195001349. [47] Yu L, Xie W, Xie D, Zou Y, Zhang D, Sun Z, et al. Deep reinforcement learning
[31] Estes J, Crooks K, Holt RD. Predators, ecological role of. In: Levin SA, editor. for smart home energy management. IEEE Internet Things J 2020;7(4):2751–62.
Encyclopedia of biodiversity. 2nd ed. Waltham: Academic Press; 2013, p. 229– http://dx.doi.org/10.1109/JIOT.2019.2957289.
49. http://dx.doi.org/10.1016/B978-0-12-384719-5.00117-9, URL https://www. [48] Sutton RS, Barto A. Reinforcement learning: an introduction. MIT Press; 2018.
sciencedirect.com/science/article/pii/B9780123847195001179. [49] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft Actor-Critic: Off-Policy Maximum
[32] Chan T-L, Yuan H-Y, Lo W-C. Modeling COVID-19 transmission dynamics with Entropy Deep Reinforcement Learning with a Stochastic Actor. In: International
self-learning population behavioral change. Front Public Health 2021;9. conference on machine learning. (ICML), 2018.
[33] Eshragh A, Alizamir S, Howley P, Stojanovski E. Modeling the dynamics [50] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous
of the COVID-19 population in Australia: A probabilistic analysis. PLoS One control with deep reinforcement learning. In: ICLR (Poster). 2016, URL http:
2020;15:e0240153. http://dx.doi.org/10.1371/journal.pone.0240153. //arxiv.org/abs/1509.02971.
[34] Wilder B, Charpignon M, Killian JA, Ou H-C, Mate A, Jabbari S, et al. Modeling [51] Fujimoto S, van Hoof H, Meger D. Addressing function approximation error
between-population variation in COVID-19 dynamics in Hubei, Lombardy, and in actor-critic methods. 2018, CoRR abs/1802.09477, arXiv:1802.09477, URL
New York City. Proc Natl Acad Sci 2020;117(41):25904–10. http://dx.doi.org/ http://arxiv.org/abs/1802.09477.
10.1073/pnas.2010651117, arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas. [52] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy
2010651117, URL https://www.pnas.org/doi/abs/10.1073/pnas.2010651117. optimization algorithms. 2017, CoRR abs/1707.06347, arXiv:1707.06347, URL
[35] Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection http://arxiv.org/abs/1707.06347.
with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, [53] Beckley R, Weatherspoon C, Alexander M, Chandler M, Johnson A, Bhatt GS.
Garnett R, editors. Advances in neural information processing systems, vol. 28. Modeling epidemics with differential equations. 2013.
Curran Associates, Inc.; 2015, p. 91–9.

13

You might also like