Professional Documents
Culture Documents
Energy and AI: Mehdi Mounsif, Fabien Medard
Energy and AI: Mehdi Mounsif, Fabien Medard
Energy and AI
journal homepage: www.elsevier.com/locate/egyai
Keywords: In an increasingly electrified and connected world, renewable energy production and robust distribution as well
Energy storage and management as sobriety paradigm, both for the individual and the society, will most likely play a central role regarding
Reinforcement learning global systems stability. Consequently, while being able to conceive efficient storage systems coupled with
Population dynamics
robust energy management strategies present significant interests, a number of related studies often consider
Optimization
the human behaviour factor separately. While not decisive in large industrial factories, human demeanor impact
cannot be overlooked in residential areas. As such, this work proposes an innovative and flexible dynamic
population model, inspired from epidemiological methods, that allows simulation of a vast spectrum of social
scenarios. By pairing this formalization with a smart energy management strategy, a complete framework is
proposed. In particular, beyond the theoretical identification of sustainable parameters in a wide diversity of
configurations, our experiments demonstrate the relevance of reinforcement learning agents as efficient energy
management policies. Depending on the scenario, the trained agent enables an increase of the sustainability
areas over baseline strategies up to 200%, thus hinting at ultimately softer societal impact.
∗ Corresponding author.
E-mail address: mehdi.mounsif@akkodis.com (M. Mounsif).
https://doi.org/10.1016/j.egyai.2023.100242
Received 29 November 2022; Received in revised form 25 January 2023; Accepted 14 February 2023
Available online 23 February 2023
2666-5468/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
2
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
Fig. 1. This work considers a hypothetical islanded population, involving various groups and paradigms, connected to a renewable energy production source through a storage
device and its associated policy.
Fig. 2. S3SAME pairs a dynamic population model with a RL-based EMS strategy to evaluate the effects of human behaviour on system performances and identify scenario
compatibility with renewable energy sources.
data, plan and formulate local measures to ensure safety as well as RL-driven, within the EMS. However, these views are very rarely stud-
possible [32,33]. In [34], an individual-level model accounting for ied and used in conjunction, leaving out important optimization areas
geographic and demographic particularities is proposed and used to and potentially critical insights, which consequently motivates the
estimate probable parameters governing the epidemic spreading. In proposed framework and enables flexible modelling and exploration of
particular, this approached enabled the simulation of the effect of local population trajectories in the context of renewable energy production
contact restrictions and suggested that control strategies should be and management.
per-population tailored, further illustrating the relevance of population
models. 3. Dynamic population modelling
On another axis, deep learning methods are transversal, with appli-
cation ranging from complex visual tasks [35], protein folding [36] to This section presents the main modelling hypothesis involved in the
board games [37] or remarkable natural language manipulation [38, scenario creation and study. It essentially introduces the population
39]. In particular, finding relevant strategies in complex, dynamic and model paradigm, the different population group involved in the dynam-
reactive environments has mostly been considered in reinforcement ics offers details on the transition mechanisms that motivate individual
learning (RL). In [40], the author demonstrate that RL agents are able transfer between groups.
to confine and stabilize fusion plasma via tokamak control in a nuclear
reactor, or control robotic hand is trained to solve a Rubik’s cube, as 3.1. Population modelling
reported in [41]. However, RL agents are known to be brittle [42,43] as
they are essentially trained and evaluated in stationary environments, As mentioned, total energy consumption is heavily influenced by
a feature shared with some of the works dedicated to micro-grid study. the population behaviours which, while often presented as an averaged
In particular, reward-optimized agents are particularly dependent on behaviour, are significantly more diverse and present a fine-grained
the independent identically distributed ( i.i.d.) hypothesis [44,45] and sophistication. Consequently, the framework introduced in this paper is
generally report drastic performance loss when deployed outside of inspired from epidemiological modelling and implements a simplified,
training distribution. Indeed, as demonstrated in [46], many recent although not unrealistic, population model that involves three groups of
research attempts develop method to increase agents ability to manage population, each exhibiting a specific consumption factor. In particular,
unseen configurations, with encouraging but limited results so far. the following paradigms, are represented:
In a global approach, it is possible to notice that while these very
diverse fields, such as micro-grid design and population modelling, • Environmentally active people (A): group that is the most dedi-
could mutually benefit from concepts introduced in the others. Indeed, cated to reducing its energy consumption
population modelling could enable the accurate estimation of the re- • Environmentally passive (P): intermediate group
quired work capacity of a micro-grids, based on the projected evolution • Neutral (N): inactive group, regarding climate and energy con-
of local demand and its sustainability using a smart approach, possibly sumption issues
3
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
Fig. 3. Population dynamics modelling: individuals are sensitive to different transition mechanisms that incite them to change their paradigm and adopt alternative behavioural
practices depending on the current context.
3.2. Transition mechanisms The recent surge of interest in reinforcement learning methods has
led to the introduction of multiple paradigms [49–51]. Concretely, this
As schematically shown in Fig. 3, multiple mechanisms are liable particular work relies on Proximal Policy Optimization (PPO) [52],
to incite people to update their behaviour and consequently change an on-policy, actor-critic with trust-region that has demonstrated rel-
groups. More specifically, it is possible to discern between internal evant properties regarding the complexity and features of our envi-
group mechanisms that result from individual encounters and external ronment [40]. In particular, the PPO algorithm aims at scaling action
factors, which are both respectively detailed in the following para- selection probabilities accordingly to the advantage function of a state–
graphs. action pair. This can be formally expressed through the minimization
of the following objective:
𝜋𝜃
3.2.1. Group dynamic factors 𝐿(𝑠, 𝑎, 𝜃old , 𝜃𝑘 ) = min(𝑟𝑡 (𝜃𝑘 )𝐴 old (𝑠, 𝑎), clip(𝑟𝑡 (𝜃), 1 − 𝜀, 1 + 𝜀)𝐴𝜋𝜃𝑘 (𝑠, 𝑎)) (1)
The proposed simulation framework relies on the idea that interac-
tion between individuals belonging to different groups will necessarily where 𝑟𝑡 (𝜃𝑘 ) is the ratio of the probabilities between the previous policy
involve an exchange of ideas and world views, thus resulting in some and the current one for a given pair of state and action. This ratio can
cases in group conversion for one of the participants. Additionally, be expressed as:
inner pessimistic mechanisms are considered. These express the concept 𝜋𝜃𝑘 (𝑎|𝑠)
𝜌𝑡 (𝜃𝑘 ) = (2)
that individuals can reach a level of disappointment or discouragement 𝜋𝜃old (𝑠, 𝑎)
that will push them towards the closest less energy-sober group. In this
configuration, the neutral group is not affected and can be seen as an 5. RL agent as an efficient energy storage policy
absorption state.
The economic and industrial context towards which this work is
3.2.2. External factors directed is heavily and actively investigated, with multiple actors of
Although a wide range of external factors and their effects could be diverse nature tackling the challenges of energy management and stor-
considered as influential on the proposed modelling of population dy- age. Nevertheless, the particularly thin granularity of the approaches
namics, the current approach focuses specifically on price fluctuations. and datasets makes the creation of a common benchmark challenging
In particular, emulating the current price volatility observed on in- and, to the best of our knowledge, no unified and compatible setup
ternational energy market and the corresponding consumer behaviour is publicly available to evaluate the methods presented in this work
adaptation, this work introduces price sensitivity thresholds that, once against other state of the art results. In this view, and since rigorous
crossed, will initiate individual migration from one group to a more comparison are paramount to establish the interest and the relevance
sober one (in a highly priced energy configuration) or to the closest of the presented approaches, the following paragraphs introduces the
less sober group (when energy prices are affordable). paradigm used to engineer baseline strategies. Concretely, these repre-
sent the best policy the authors could possibly create without relying
on reinforcement learning algorithms, which is then evaluated against
4. Reinforcement learning background
the RL-trained policy. Then, an analysis of the scaling factors for the
complete spectrum of population distributions is discussed.
As evoked in the above sections, an important contribution of this
work lies in pairing a smart, RL-based EMS with the population model 5.1. Environment for policy generation
introduced above. This section consequently provides the basic theo-
retical elements required to apprehend reinforcement learning. Indeed,
Prior to the baseline introduction, this section details the fundamen-
while multiple policy optimization strategies could be considered in the
tal MDP elements for training the reinforcement learning based storage
scope of this topic, previous experiences [13,47] on relatable challenges policy, central in our approach. As explained in Section 4, reinforce-
motivates the use of reinforcement learning approaches [48]. ment learning environments rely on an observation space, an action
Specifically, the goal of reinforcement learning is to produce a pol- space, the reward function and the transition function. Nevertheless,
icy that maximizes a metric of success for a given task. Indeed, through since this work considers a deterministic process, similarly to [13], the
interaction with the considered environment, the policy will optimize transition function is not addressed. Regarding the remaining variables,
the sequence of selected actions based on observations it receives to they are formally defined as :
generate trajectories that maximizes the cumulative rewards given by
the environment. Formally, the environment is most frequently defined • The state space provides an observation which is the input used by
as a Markov Decision Process (MDP), composed of: the agent to select an action 𝑠 = concat(𝑠historic , 𝑠storage , 𝑠latent ) ∈
R𝐻+1+𝐿 with
• States 𝑠 ∈ 𝑆
• Actions 𝑎 ∈ 𝐴 – 𝑠historic ∈ 𝑁 production over consumption historic ratio for
• Reward function 𝑟𝑡 = 𝑓 (𝑠𝑡 , 𝑎𝑡 ) the last 𝑁 steps (this work proposes 𝑛 = 8)
• Transition probability function: 𝑝(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) – 𝑠storage ∈ 1 the current normalized storage state
4
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
– 𝑠latent ∈ 𝐿 a latent representation provided the predictive such advances concepts. In practice, the proposed actor-critic is trained
model of dimension 𝐿 (such as 𝐿 = 16 in this work) on historical European electricity prices and averaged individual con-
sumption between 2015 and 2018, year 2019 being used as the testing
• The action space is one-dimensional and continuous: 𝑎𝑡 ∈ [−1, 1]
environment. During evaluation, each policy is run over a two hundreds
and controls the entirety of the storage and delivery process:
episodes and the number of hours during which energy can be provided
– Specifically, for positive actions, the system stores the in- is recorded per episode. As presented in Fig. 4, it clearly appears that
coming energy scaled by the action and sends the rest to the RL-trained policy (shown in light blue) exhibits superior behaviour
the grid for consumer usage. Formally: in the sense that it is able to provide power to the population signifi-
{ cantly more frequently than the manually-crafted baseline (shown in
𝑒grid = (1 − 𝑎𝑡 ).𝑝turb red). From another point of view, given the important volatility in-
𝑒storage = min[𝑒storage + 𝑎𝑡 × 𝑝turb , 𝑒max storage ] herent to renewable energies, Fig. 5 shows the per-episode distribution
of energy availability when direct production is accounted for. As can
where 𝑒grid , 𝑒storage , 𝑒max storage , 𝑝turb , 𝑎𝑡 respectively rep-
be observed, the trained agent again displays a more efficient storage
resent the energy sent to the grid, the current available
strategy as more hours can be covered with the same energy load. These
energy, the maximal storage volume, the production during
results increase the legitimacy of RL-driven methods and subsequent
the last time step and finally, the agent action.
agents to be used in downstream tasks.
– Alternatively, negative actions send the total energy col-
lected during the last time step to the grid and adds the
5.4. Storage policy performances as a function of the general population
maximal energy output scaled by the action. Formally, this
consumption profile
can be expressed by:
{
𝑒grid = 𝑝turb + min[𝑎𝑡 ∗ 𝛥max , 𝑒storage ] Having established that the RL trained agent is the most efficient
𝑒storage = max[𝑒storage − 𝑒grid , 0] policy available in the energy storage setup considered in this work, it
follows that its performance level and limitations must be clearly identi-
where 𝛥max represents the maximal energy volume that can fied to provide a strong and robust comprehension to enable reasonable
be withdrawn in an environment time step, which is highly decision making. Specifically, our policy is trained under a default
dependent on the physical system. consumption profile (i.e: 𝛼 = 1) and the setup described in this section
• Finally, the proposed reward function is designed to incite the is designed to evaluate the evolution of its performance for a wide range
agent to send an energy volume as close as possible to the next of consumption profiles. Indeed, Fig. 6 displays the system performance
consumption. Formally, this can be expressed by 𝑟𝑡 = 𝑜𝑡 + 𝑜𝑠 + 𝑜𝑐 , as a function of the mean consumer profile. Specifically, for each value
where: in the explored consumer factor range, a set of one thousand episodes
are run by the policy and for each one the performance (number of
– primary objective: energy sent to the grid must be superior hour with power over number of hours in total) is stored. Given the
to the demand number of episodes considered, the significant spread of performance
{
1 if 𝑒grid ≥ 𝑒consumption values can be explained by the wide diversity of production condition.
𝑜𝑡 = Nevertheless, it is possible to observe, as expected, a strong negative
0 otherwise
correlation between system performance and mean consumer profile
𝑒storage and its associated consumption factor, also represented by the mean
– 𝑜𝑠 = 𝑒max storage
incites the agent to fill the storage system
performance value. For additional confirmation and perspective on
– 𝑜𝑐 = exp 𝛼 × (𝑒grid − 𝑒consumption ) accuracy related bonus agent reliability, the same visualization for the baseline policy is also
points with a scaling value displayed (in red).
For an externally defined performance level, given specific country
5.2. Heuristics-designed baseline storage strategy requirements, dedicated energy sources, multiple other factors and with
respect to the storage policy, this global estimated slope can be relied
Concretely, the proposed manual strategy is a reactive one, pa- on to define an 𝛼 threshold above which the mean system performance
rameterized by 𝛾𝑚 ∈ R2𝑚 . Technically, it uses the last 𝑚 historic cannot be guaranteed. This 𝛼 value can be used to evaluate social
production over consumption ratios as input and has an additional bias scenarios, as presented in the following paragraphs.
parameter to compute the scalar action subsequently used by the MDP,
as introduced in Section 5.1. 6. Scenarios definition and evaluation
Formally, the manual policy action 𝑎𝑡,𝛾 can be obtained using fol-
lowing relation presented in Eq. (3). 6.1. Population dynamics modelling
𝑣⃖⃗ =𝑖∈[1..𝑚] [𝛾𝑖 − 𝑅(𝑛−𝑖) ] ∈ R𝑚
(3) As explained in Section 3.1 and referred to multiple times after-
𝑎𝑡 =clip(⃖𝛾⃗𝑖∈[𝑚..2 𝑚] ⋅ 𝑣⃖⃗, −1., 1.) ∈ R
wards, this approach relies heavily on the formalization of a population
where 𝑛 is the observation dimensions, 𝑅𝑛−𝑖 thus being the 𝑖th most dynamic model that enables simulation of individual distribution be-
recent production over consumption ratio. In practice, several manual tween the three introduced group profiles over a given time horizon.
policies are constructed, with 𝑚 ∈ {1 … 5} and their respective pa- Building on the concepts frequently used in epidemiological mod-
rameters optimized in a gradient ascent sense to maximize the reward elling [10,32,53] such as the SIR model, the following differential
presented in Section 5.1. Finally, we find that the policy with 𝑚 = 2 to equations are proposed to represent respectively the active (Eq. (4)),
present the best average results and is the one consequently kept for passive (Eq. (5)) and neutral (Eq. (6)) group evolution:
comparison against the reinforcement method. 𝑑𝐴
= − 𝐹𝐴→𝑁 × 𝐴
𝑑𝑡
5.3. RL agent performances evaluation + (𝑇𝑁→𝐴 − 𝑇𝐴→𝑁 ) × 𝐴 × 𝑁
+ (𝑇𝑃 →𝐴 − 𝑇𝐴→𝑃 ) × 𝐴 × 𝑃 (4)
Similarly to [13], an experimental setup and specific metrics are
designed to evaluate our trained policy performances with respect to + 𝐻upgrade (price, 𝜀𝑃 ) × 𝑃 × 𝐻speed
a manually crafted baseline in order to assess the relevance of using − 𝐻downgrade (price, 𝜀𝐴 ) × 𝐴 × 𝐻speed
5
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
Fig. 4. Distribution of normalized policy performance per episode. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version
of this article.)
Fig. 6. Mean performance level as a function of consumption profile for a given RL-trained storage policy. (For interpretation of the references to colour in this figure legend,
the reader is referred to the web version of this article.)
6
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
7. Results: Scenarios and associated parameters While Section 5.3 has shown that RL-based storage strategy offer
higher performances and returns than proposed baselines, previous
In order to illustrate the interest of this work, this section introduces experiments were considered in a static configuration and did not offer
several scenarios, their associated parameters and possible exploration a more global view, connected to behavioural parameters. This makes
7
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
Fig. 7. Comparison of state space policy scores for multiple scenarios. (For interpretation of the references to colour in this figure legend, the reader is referred to the web
version of this article.)
the evaluation of their impact in the context of dynamic population sustainable configurations (yellow) as well as strongly energy-hungry
non-trivial and may not necessarily encourage their adoption. This ones (deep blue) (see Fig. 8).
section consequently provides a comparison of the resulting state space
scores for the baseline and RL-based storage policies. 7.2.1. Subsidizing energy price
Fig. 7 thus illustrates the performance difference, for all considered Consider a population group 𝑝0 that has been positioned (through
scenarios (four with a variant each), between the two methods, where polls, for instance) at location 1, as shown in Table A.6. This group has
state space scores in line 1 and 3 are produces by the RL agent and an intermediate conviction level but can be perceived as more reactive
line 2 and 4 by the manual policy. As can be observed, sustainability to price movements than the mean population and can be numerically
(yellow regions) using RL covers significantly more state space and described by 𝑝0 = [0.5, 0.7], relative to each axis range. Despite the
ranges between reasonable increases (4% in the ER1 scenario) to more score increase provided by the usage of the RL-based storage strategy
than 200%. (+11%), this configuration does not belong to the sustainability area.
In the hypothesis that additional power sources are not available,
7.2. Specific case study multiple course of action can still be considered. Among them, the price
conversion threshold is an important aspect that can drive individual
As shown in Section 7.1, relying on policies trained in the rein- decisions regarding energy usage. Specifically, in this configuration,
forcement learning paradigm proves advantageous in a wide variety of local authorities could provide financial incentives to this variable in
scenarios as these methods are likely to significantly increase the state order to increase sustainability. For instance, Fig. 9 shows the resulting
space sustainability area. Building on this confirmation, this section evolution levels of highly energetically sober individuals and associated
proposes various practical example of usage of the proposed simulation configuration scores when the price conversion threshold is subsidized.
tool. While applicable to any scenario, the COTC0, shown in Table A.6 Specifically, while an initial conversion price of 40 euros yields a sus-
is preferred here, due to its contrasted nature that is likely to increase tainability score that necessarily implies using external power sources,
the clarity of the logic used. In particular, this scenario explores sen- it appears that increasing public financial incentives quickly offers
sibility along the price conversion sensibility and faith resistance axis. support to the environmentally virtuous group and consequently will
As can be seen, the range of values considered generates both highly ease tensions on the energy system.
8
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
Fig. 8. Heatmap for the social tensions (COTC0) scenario. Specific starting points for subsequent policy tests are represented in white. (For interpretation of the references to
colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 9. Evaluation of sustainability levels resulting of a subsidiary policy to financially support individuals by offsetting their price conversion threshold.
7.2.2. Energy tax usage and magnitude located in the upper-west corner, local sustainable islands can observed
From another point of view, local authorities could resort to tax in multiple location in the state space. In particular, this paragraph
strategies to try and convince individuals with the highest energy focuses on the higher scoring zone that shares the same price sensibility
consumption regimen to adapt their behaviour to existing resources. but has a different faith resistance value.
Using the same starting point 𝑝0 , this paragraph explores the effect Concretely, by evaluating configurations with the lowest faith resis-
of lowering the upper price conversion threshold. Concretely, it is the tance values and in the close neighbourhood with faith values com-
price above which individuals will start migrating towards groups with prised between 0.18 and 0.19, population dynamics as displayed in
lower consumption profile. As tax policy can be particularly complex Fig. 11 can be found. As can be seen, some of these tests demonstrate
and is definitely out of the scope of this work, Fig. 10 simplifies the high sustainability value. Globally, starting from 𝑝1 , it may prove more
setup and simply only illustrates the final price threshold. energy and cost efficient to define public policies that would nudge the
population towards these regions instead of trying to reach the upper
7.2.3. Local exploration area.
Consider another population configuration 𝑝1 = [0.8, 0.18] that However, as can also be seen in Fig. 11, there is a high score
presents a better resistance/inertia to price movement speed but dis- volatility on the price sensitivity axis which may make this approach
plays an elevated loss of faith and conviction with group A individuals unpractical in the real-world where measurements will be averaged
being significantly absorbed by the 𝑁 group, that is more energy sober and where social constraints and inertia may not allow such accuracy.
member moving to higher consumption ratios. While, for this scenario This is further illustrated in Fig. 12 where the complete faith range
and the associated set of parameters, the more sustainable area is is explored. Specifically, this spectrum of conviction is evaluated for
9
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
Fig. 10. Evaluation of sustainability levels resulting of a tax policy to financially incite individuals to lessen their consumption by artificially increasing their price conversion
threshold to more sober groups.
Fig. 11. Heatmap for the clash of the classes scenario. Specific starting points for subsequent policy tests are represented in white.
10
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
price sensitivity values belonging to the [0.18, 0.20] interval. While Table A.1
Base configuration.
the shaded area, representing the extreme values obtained for each
location, does have strong peaks denoting sustainability areas, the Parameters
mean score indicates that the equilibrium is hard to attain, in contrasts Faith: A->N 3.00E−05
Talk: A->N 1.00E−05
with higher faith values.
Talk: N->A 5.00E−05
Talk: P->A 6.00E−05
8. Discussion Talk: A->P 4.00E−05
Talk: N->P 6.00E−05
As presented in the previous sections, the proposed prospective Talk: P->N 4.00E−05
Price: N upwards 6.50E+01
framework can provide strong insights over the energy disparity be-
Price: P upwards 6.00E+01
tween electricity production and demand, as well as general perfor- Price conversion ratio 2.00E−02
mance levels over a year that can be expected for a population, for Price: P downwards 3.50E+01
a given storage system and management policy. While the current Price: A downwards 4.50E+01
parameter ranges were essentially selected to enable exploration, in a
more global view and with adequate measurements, the quantitative Table A.2
comparison of storage strategies provided in Section 7.1 can have MC0 configuration.
remarkable implications. Indeed, the evaluation of various policies Parameters
demonstrate that RL-based agents are likely to create increased area Talk: A->P- start 1.00E−06
of sustainability, which corresponds to state space configurations that Talk: A->P- end 1.00E−04
yield energy needs compatible with the system performance. This im- Talk: P->N- start 1.00E−06
plies that ensuring that the population remains within the boundaries Talk: P->N- end 1.00E−04
Price: N upwards 6.50E+01
of acceptable renewable energy demand will requires less drastic public Price: P upwards 6.00E+01
conventions, thus proportionally more susceptible to be adopted. Price: P downwards 5.00E+01
Nevertheless, in this current formalization, the system efficiency is Price: A downwards 3.50E+01
evaluated and averaged over the whole time window, which may not Price conversion ratio 3.00E−02
be optimal. Indeed, this implies that the scenario parameters, such as
the conviction strength of some individuals or their posture towards
energy consumption, will not be affected by the conditions met during also shown, a global view of configuration scores in the state space
the year. However, it is highly likely that, should these conditions can also drive more pragmatic decisions and identify equilibrium points
harden, the population parameters will evolve. More practically, if that could be easier to reach.
an initialization leads a population to extended phases of insufficient In future works, these encouraging results will be extended along
power, individuals beliefs and/or permeability to other posture may several axis: in a first phase, implementing a feedback and adaptation
change. Formally, this can be expressed by second order dynamics and mechanism appears as an important step for improving this framework
will be explored in future works. reliability. Conceptually, this initiative would bring second order dy-
namics in the population model, thus enabling individuals to react more
9. Conclusion rationally, in particular since this would allow to simulate memory
effects. Then, as the proposed work is conceived to support policy de-
As the imperative need of mankind for clean, renewable, affordable sign, an remarkable addition would be to further develop the presented
energy increases, the inherent discontinuity in availability of such usage examples and provide a search-based approach to evaluate the
energy sources appears more clearly and strongly emphasizes how least costly public trajectory to bring a given point in the state space in
critical environmentally sober and conscious individual behaviour are. the sustainability area. Specifically, by implementing movement costs
While these traits are spreading and are more broadly disseminated in the state space, it becomes possible to identify which parameters
within the general population, there still exist a wide diversity of are the most efficient to nudge a population below the acceptability
behavioural factors that must be accounted for when preparing large threshold, for a given storage policy and renewable energy production,
energy policies. As such, this work, after having established the rele- back to a configuration with higher guarantees of performance. By inte-
vance of reinforcement learning methods for energy storage manage- grating real-world data, this approach could ultimately be an important
ment, specifically proposes an original flexible and modular backbone, asset in policy design and facilitate acceptance of energy transition
leveraging RL-optimized storage strategy, to enable the modelling of measures. Finally, to address practical concerns dedicated to usage in
multiple hypothesis and the analysis of their resulting unfolding. By real configuration, additional data could be collected to correctly fit
connecting these multiple fields, this work contrasts with common the model, identify realistic parameters as well as increase simulation
methods that usually consider each topic separately and may overlook variety by including, for instance, different initial group distributions.
mutually beneficial insights.
Acknowledgement
As demonstrated in our examples, complex dynamics can be in-
tegrated and simulated and the interpretation of these configurations
This work has been sponsored by Akkodis and the Adecco group,
outcomes endows users with the capacity to identify particular accept-
France.
ability thresholds for a selected model variable and consequently plan
to avoid these areas and their potentially undesirable consequences. Appendix. Scenario parameters
Furthermore, through the comparison of diverse storage policies on a
wide spectrum of scenarios, this work quantitatively illustrate the area This section presents the various set of parameters used for the
of sustainability up-scaling which relates the severity and austerity of multiple scenarios introduced in this work. In practice, each configu-
social measures needed to ensure that the energy consumption of the ration is derived from the base configuration, see Table A.1 and only
population depending on renewable energies stays within acceptable modified values are denoted. Similarly, for each subsequent scenario,
boundaries. Moreover, the introduced framework can be, as demon- only altered parameters with respect to the initial scenario are detailed.
strated, used to evaluate the effect of public policies, such as subsidies For instance, in Table A.3, only values differing from Table A.2 are
as financial incentives to adopt more responsible practices. Inversely, given (see Tables A.4, A.5 and A.7–A.9).
the impact of taxes can also be, to a certain extend, simulated. As was
11
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
12
M. Mounsif and F. Medard Energy and AI 13 (2023) 100242
[22] Ahmad Khan A, Naeem M, Iqbal M, Qaisar S, Anpalagan A. A compendium [36] Jumper JM, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O,
of optimization objectives, constraints, tools and algorithms for energy man- et al. Highly accurate protein structure prediction with AlphaFold. Nature
agement in microgrids. Renew Sustain Energy Rev 2016;58:1664–83. http: 2021;596:583–9.
//dx.doi.org/10.1016/j.rser.2015.12.259, URL https://www.sciencedirect.com/ [37] Silver D, Huang A, Maddison C, Guez A, Sifre L, van den Driessche G, et al.
science/article/pii/S1364032115016421. Mastering the game of go with deep neural networks and tree search. J Nat
[23] Guo L, Liu W, Li X, Liu Y, Jiao B, Wang W, et al. Energy management system 2015.
for stand-alone wind-powered-desalination microgrid. IEEE Trans Smart Grid [38] Wei J, Wang X, Schuurmans D, Bosma M, Chi EH, Le Q, et al. Chain of
2016;7(2):1079–87. http://dx.doi.org/10.1109/TSG.2014.2377374. thought prompting elicits reasoning in large language models. 2022, CoRR
[24] Kyriakarakos G, Dounis A, Arvanitis K, Papadakis G. A fuzzy logic energy man- abs/2201.11903, arXiv:2201.11903, URL https://arxiv.org/abs/2201.11903.
agement system for polygeneration microgrids. Renew Energy 2012;41:315–27. [39] Rubin O, Herzig J, Berant J. Learning to retrieve prompts for in-context
http://dx.doi.org/10.1016/j.renene.2011.11.019. learning. 2021, CoRR abs/2112.08633, arXiv:2112.08633, URL https://arxiv.org/
[25] Taha MS, Mohamed YA-RI. Robust MPC-based energy management system of a abs/2112.08633.
hybrid energy source for remote communities. In: 2016 IEEE electrical power [40] Degrave J, Felici F, Buchli J, Neunert M, Tracey B, Carpanese F, et al.
and energy conference. 2016, p. 1–6. Magnetic control of tokamak plasmas through deep reinforcement learning.
[26] Eghtedarpour N, Farjah E. Power control and management in a hybrid AC/DC Nature 2022;22(1):79–86. http://dx.doi.org/10.1038/s41586-021-04301-9.
microgrid. IEEE Trans Smart Grid 2014;5(3):1494–505. http://dx.doi.org/10. [41] OpenAI, Akkaya I, Andrychowicz M, Chociej M, Litwin M, McGrew B, et al.
1109/TSG.2013.2294275. Solving rubik’s cube with a robot hand. 2019.
[27] Mokheimer EM, Sahin AZ, Al-Sharafi A, Ali AI. Modeling and optimization of [42] Sahni H. Deep reinforcement learning doesn’t work yet. 2018, https:
hybrid wind–solar-powered reverse osmosis water desalination system in Saudi //himanshusahni.github.io/2018/02/23/reinforcement-learning-never-
Arabia. Energy Convers Manage 2013;75:86–97. http://dx.doi.org/10.1016/j. worked.html.
enconman.2013.06.002, URL https://www.sciencedirect.com/science/article/pii/ [43] Eysenbach B, Gupta A, Ibarz J, Levine S. Diversity is all you need: learning
S0196890413002987. skills without a reward function. In: International conference on representation
[28] Artzrouni M. Mathematical demography. In: Kempf-Leonard K, editor. Encyclo- learning. 2019.
pedia of social measurement. New York: Elsevier; 2005, p. 641–51. http:// [44] Mahajan D, Tople S, Sharma A. Domain generalization using causal matching.
dx.doi.org/10.1016/B0-12-369398-5/00360-1, URL https://www.sciencedirect. In: Meila M, Zhang T, editors. Proceedings of the 38th international conference
com/science/article/pii/B0123693985003601. on machine learning. Proceedings of machine learning research, vol. 139, PMLR;
[29] Brander JA. Easter island: resource depletion and collapse. In: Cleveland CJ, 2021, p. 7313–24, URL https://proceedings.mlr.press/v139/mahajan21b.html.
editor. Encyclopedia of energy. New York: Elsevier; 2004, p. 871–80. http:// [45] Kaddour J, Lynch A, Liu Q, Kusner MJ, Silva R. Causal machine learning:
dx.doi.org/10.1016/B0-12-176480-X/00014-0, URL https://www.sciencedirect. a survey and open problems. 2022, http://dx.doi.org/10.48550/ARXIV.2206.
com/science/article/pii/B012176480X000140. 15475, URL https://arxiv.org/abs/2206.15475.
[30] Hellmann JJ. Species interactions. In: Levin SA, editor. Encyclopedia of bio- [46] Chen AS, Sharma A, Levine S, Finn C. You only live once: single-life rein-
diversity. 2nd ed. Waltham: Academic Press; 2013, p. 715–25. http://dx. forcement learning. 2022, http://dx.doi.org/10.48550/ARXIV.2210.08863, URL
doi.org/10.1016/B978-0-12-384719-5.00134-9, URL https://www.sciencedirect. https://arxiv.org/abs/2210.08863.
com/science/article/pii/B9780123847195001349. [47] Yu L, Xie W, Xie D, Zou Y, Zhang D, Sun Z, et al. Deep reinforcement learning
[31] Estes J, Crooks K, Holt RD. Predators, ecological role of. In: Levin SA, editor. for smart home energy management. IEEE Internet Things J 2020;7(4):2751–62.
Encyclopedia of biodiversity. 2nd ed. Waltham: Academic Press; 2013, p. 229– http://dx.doi.org/10.1109/JIOT.2019.2957289.
49. http://dx.doi.org/10.1016/B978-0-12-384719-5.00117-9, URL https://www. [48] Sutton RS, Barto A. Reinforcement learning: an introduction. MIT Press; 2018.
sciencedirect.com/science/article/pii/B9780123847195001179. [49] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft Actor-Critic: Off-Policy Maximum
[32] Chan T-L, Yuan H-Y, Lo W-C. Modeling COVID-19 transmission dynamics with Entropy Deep Reinforcement Learning with a Stochastic Actor. In: International
self-learning population behavioral change. Front Public Health 2021;9. conference on machine learning. (ICML), 2018.
[33] Eshragh A, Alizamir S, Howley P, Stojanovski E. Modeling the dynamics [50] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous
of the COVID-19 population in Australia: A probabilistic analysis. PLoS One control with deep reinforcement learning. In: ICLR (Poster). 2016, URL http:
2020;15:e0240153. http://dx.doi.org/10.1371/journal.pone.0240153. //arxiv.org/abs/1509.02971.
[34] Wilder B, Charpignon M, Killian JA, Ou H-C, Mate A, Jabbari S, et al. Modeling [51] Fujimoto S, van Hoof H, Meger D. Addressing function approximation error
between-population variation in COVID-19 dynamics in Hubei, Lombardy, and in actor-critic methods. 2018, CoRR abs/1802.09477, arXiv:1802.09477, URL
New York City. Proc Natl Acad Sci 2020;117(41):25904–10. http://dx.doi.org/ http://arxiv.org/abs/1802.09477.
10.1073/pnas.2010651117, arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas. [52] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy
2010651117, URL https://www.pnas.org/doi/abs/10.1073/pnas.2010651117. optimization algorithms. 2017, CoRR abs/1707.06347, arXiv:1707.06347, URL
[35] Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection http://arxiv.org/abs/1707.06347.
with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, [53] Beckley R, Weatherspoon C, Alexander M, Chandler M, Johnson A, Bhatt GS.
Garnett R, editors. Advances in neural information processing systems, vol. 28. Modeling epidemics with differential equations. 2013.
Curran Associates, Inc.; 2015, p. 91–9.
13