Report - Deep Reinforcement Learning PDF

2021/2022
FINAL YEAR PROJECT
Submitted in fulfillment of the requirements for the
ENGINEERING DEGREE FROM THE LEBANESE UNIVERSITY

FACULTY OF ENGINEERING-BRANCH 3
Major: Electrical and Electronics Engineering
Prepared by:
Tala Karaki
Deep Reinforcement Learning- Based Approach

for Fault-Tolerant Control of PV
Systems in Smart Grids
Supervised by:
Dr. Majd Saied
Defended on 29 July 2022 in front of the jury:
Dr. Youssef Harkouss President

Dr. Haidar El Mokdad Member
Contents
Acknowledgement i
Abstract ii
Chapter 1: General Introduction 1
Chapter 2: Literature Review 3

0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2 PV Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2.1 Solar Cells & PV modules . . . . . . . . . . . . . . . . . . . . . 3
0.2.2 PV string & PV array . . . . . . . . . . . . . . . . . . . . . . . 4
0.2.3 PV System Charactersitics . . . . . . . . . . . . . . . . . . . . . 5
0.2.4 Types of PV systems . . . . . . . . . . . . . . . . . . . . . . . . 6
0.2.5 Components of a PV system . . . . . . . . . . . . . . . . . . . . 7
0.3 MPPT Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.3.2 DC-DC converter . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.3.3 MPPT Controllers Techniques . . . . . . . . . . . . . . . . . . . 12
0.4 Fault-Tolerant Control of PV systems . . . . . . . . . . . . . . . . . . . 13
0.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 3: Modeling of PV Systems 14

0.6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.7 PV Cells Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.8 PV Arrays Under Faults . . . . . . . . . . . . . . . . . . . . . . . . . . 15
0.9 Simulation of PV arrays faults . . . . . . . . . . . . . . . . . . . . . . . 17
0.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 4: Deep Reinforcement Learning 20

0.11 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
0.12 Reinforcement Learning Model . . . . . . . . . . . . . . . . . . . . . . . 20
0.12.1 Elements of Reinforcement Learning . . . . . . . . . . . . . . . 21
0.12.2 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . . 22
0.13 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
0.14 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 25
0.15 Fault-Tolerant Control based on RL . . . . . . . . . . . . . . . . . . . . 28
0.16 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 5: Fault-Tolerant Control Using DRL 29

0.17 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
0.18 PV environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
0.19 Training setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
0.20 Testing with the MPPT control algorithm . . . . . . . . . . . . . . . . 35
0.21 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 6: Conclusion and Perspective 40
Bibliography 41
List of Figures
1 PV cells connected in series to form a module . . . . . . . . . . . . . . 4

2 PV string . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 PV array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 P-V & I-V Characteristic Curves . . . . . . . . . . . . . . . . . . . . . 6
5 Effect of Solar Irradiance on the Output Power . . . . . . . . . . . . . . 6
6 Stand-Alone PV System . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7 Grid-Connected PV System . . . . . . . . . . . . . . . . . . . . . . . . 7
8 Bypass & Blocking Diodes . . . . . . . . . . . . . . . . . . . . . . . . . 8
9 Panel-Converter Connection [6] . . . . . . . . . . . . . . . . . . . . . . 9
10 Location of the operation point of a photovoltaic model [6] . . . . . . . 10
11 DC-DC Buck Converter . . . . . . . . . . . . . . . . . . . . . . . . . . 10
12 DC-DC Boost Converter . . . . . . . . . . . . . . . . . . . . . . . . . . 11
13 DC-DC Buck Boost Converter . . . . . . . . . . . . . . . . . . . . . . . 11
14 Single Diode Model of a PV Cell . . . . . . . . . . . . . . . . . . . . . 15
15 Single Diode Model of a PV Array . . . . . . . . . . . . . . . . . . . . 16
16 Different Faults that can occur in a PV System . . . . . . . . . . . . . 18
17 P-V and I-V curves for: (a) effect of line-ground fault on the PV array,
(b) effect of line-line fault, (c) effect of mismatch fault, (d) effect of
partial shading fault. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
18 The agent–environment interaction in a Markov decision process [18]. . 23
19 Artifical Neural Network. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
20 Actor-Critic Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
21 Simulation of a PV Array. . . . . . . . . . . . . . . . . . . . . . . . . . 29
22 P-V Characteristic Curve. . . . . . . . . . . . . . . . . . . . . . . . . . 30
23 I-V Characteristic Curve. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
24 Simulation of the PV System. . . . . . . . . . . . . . . . . . . . . . . . 31
25 P-V Characteristic Curves under Different Faults . . . . . . . . . . . . 32
26 RL MPPT Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
27 Actor Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
28 Critic Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
29 Training Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
30 PV Power under different faults (1) . . . . . . . . . . . . . . . . . . . . 37
31 PV Power under different faults (2) . . . . . . . . . . . . . . . . . . . . 38
List of Tables
1 DDPG Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Training Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Experimental Results of the DPPG algorithm under different faults . . 39
Symbols and Abbreviations
AC Alternating Current
ANN Artificial Neural Network
DC Direct Current
DDPG Deep Deterministic Policy Gradient
DQN Deep Q-Network
DNN Deep Neural Network
FTC Fault Tolerant Control
IncCond Incremental Conductance
MDP Markov Decision Process
MMPT Maximum Power Point Tracker
PSO Particle Swarm Optimization
PV Photovoltaic
RL Reinforcement Learning
STC Standard Test Conditions
Acknowledgement
I would like to thank Dr. Hassan Shreim, the director of the Faculty of Engineer-
ing III-Lebanese University, and Prof. Youssef Harkous, the head of Electrical and
Electronics department.
My deepest gratitude goes out to Dr. Majd Saied, my supervisor, for helping me
with this thesis. I also appreciate her for the priceless information and abilities she
has given me as well as for the crucial and helpful guidance she provided to enable me
to complete this work.
Finally, I must express my sincere thanks to my family for their unwavering support
and never-ending encouragement during my years of study. I thank them for the trust
they have always placed in me and for the encouragement without which I would
absolutely not be where I am today. Without their affection, this feat would not have
been possible.
i
Abstract
Reliability of photovoltaic energy generation systems becomes an important issue

as demand for renewable energy sources increases from day to day. PV systems are
susceptible to a range of environmental disturbances and component-related failures,
which have an impact on their regular operation and cause significant energy loss. To
mitigate the effect of such issues, fault-tolerant control methods have been proposed to
monitor the state of the PV system and warn the user of degradation signs of the PV
array and any other unexpected change in the systems normal operation. In addition,
finding the optimal operating point of a PV system in the presence of challenging
environmental circumstances has been made possible through a variety of maximum
power point tracking (MPPT) techniques.
The goal of this study is to provide a fault-tolerant control (FTC) method for set-
ting up the MPPT controller to seek for the sub-optimal operating point. The deep
reinforcement learning approach is the basis of the MMPT controller. The suggested
FTC approach has good performance in the presence of diverse fault types, according
to simulation results on Matlab/Simulink.
Keywords: Fault-Tolerant Control, Deep Reinforcement Learning, PV Systems;
ii
Chapter 1: General Introduction
The first practical photovoltaic (PV) cells were produced in 1954. They were first
used for providing power for earth-orbiting satellites [1]. In the 1970s, the quality
and performance of the PV modules were improved following the improvements in
the manufacturing sector [1]. This has led to a huge reduction in the costs of the PV
modules and enabled the use of this form of energy in remote locations where no grid
connected utilities exist.
By 1992, solar PV energy conversion systems have grown significantly, it had
reached a cumulative total power of around 1.2 GW [2]. After that, PV modules
started being made for domestic use on rooftops. In the past years we have seen
enormous investment in utility-scale solar power plants.
In 2020, the power generated from solar PV systems has increased by 156 TWh
marking a 23% growth from 2019 where solar PV accounted for 3.1 % of global elec-
tricity generation making it the third most used renewable energy technology behind
hydropower and wind energy.
However, similar to any dynamical system, a PV plant is susceptible to many faults
that could decrease and affect the overall efficiency of the system [3].
Motivation
As a renewable energy source, photovoltaic system installation has considerably
expanded in the recent years. In order to avoid unexpected power outages, quick
and effective fault-tolerant procedures are needed. Due to the lack of data and to
the diversity, ambiguity, and suddenness of unanticipated failures occurring in PV
systems, fault-tolerant control (FTC) is difficult to implement, especially for faults
that occur in the early stages. In its simplest form, a system’s fault tolerance refers to
its capacity to carry out its intended function even in the face of flaws. Therefore, by
creating an online learning fault-tolerant controller, reinforcement learning techniques
with a critic action architecture might overcome this difficulty and enable the faulty
system to approximate the fault-free system’s performance.
Thesis Contribution
The main contributions of this thesis are:
• Design of a fault tolerant controller using deep reinforcement learning (DRL)

and testing it by simulation on four different faults.
1
• Proposition and simulation of a DRL Maximum Power Point Tracker (MPPT)
controller using Deep Deterministic Policy Gradient (DDPG) algorithm.
• Comparison between DDPG and Incremental Conductance (IncCond) algorithms

in the context of fault-tolerance capabilities.
Thesis Outline
The thesis is organized in four chapters:
• Chapter 1: In this chapter, all parts relating to PV systems are explained in de-
tails especially the MPPT controller. Different MPPT approaches are presented
and the concept of fault tolerance is addressed.
• Chapter 2: The mathematical model of the PV array is derived. Almost all

faults that could occur in a PV system are detailed alongside their causes and
effects. The effects of some of these faults are shown and illustrated in simulation.
• Chapter 3: The deep reinforcement learning theory is explained in details and

the algorithms used in DRL are shown. The use of DRL in FTC controllers is
brought up.
• Chapter 4: The proposed fault-tolerant controller is validated in simulation.

The training of the reinforcement learning agent is shown and finally the sim-
ulation results of the DRL FTC controller are presented and compared to the
IncCond method.
Finally the thesis concludes with a summary about the obtained results and an outlook
about potential improvements of our work.
2
Chapter 2: Literature Review
0.1 Introduction
Two types of PV systems exist: stand alone and grid connected. The latter requires
an MPPT controller to ensure that the PV system is delivering the maximum power it
can provide. The PV system is subject to many faults. In order to tolerate these faults
and to be able to reach the maximum achievable power at all time, a fault tolerance
controller is needed. In recent years, fault-tolerant control has received attention in
many engineering application due to its efficiency and productivity.
This chapter describes the different components and structures of PV systems,
in addition to the different applied MPPT control techniques and the theoretical
tools used to design them. It also defines the fault tolerance concept and its applied
techniques for PV systems.
0.2 PV Systems
A photovoltaic system is a power system that supplies electrical power using solar
cells. It consists of several components including solar panels that absorb the sunlight
and convert it into electricity, inverters and other electrical and mechanical compo-
nents (cabling, mounting, batteries, ...) [4]. The solar panels consist of a connection
of several PV arrays. PV systems can vary in size from small portable systems to
large utility generation plants. They are either grid-connected or stand-alone sys-
tems. The performance of the solar panel is done using the Watt Peak (Wp) rating.
It is the maximum power that can be achieved under standard test conditions (STC)
in a laboratory. The STC are 1000w/m2 of solar irradiance, 25°C cell temperature
and spectrum at air mass of 1.5. Any change in these parameters will decrease the
overall output power [4], [5].
0.2.1 Solar Cells & PV modules

The solar panels are essentially made of several photovoltaic cells. The PV cell
converts the light emitted from the Sun into electricity. Each cell is made of at least
two layers of semi-conducting material, usually silicon (Si) is used. The Si is then
doped with other materials to create a p-n junction and produce a voltage across it.
An individual PV cell can produce about 0.5 volts and up to 2W of power.
Since the power required by the consumer varies from several watts to kilo-watts,
a single PV cell cannot produce enough power to achieve the required load demand.
3
Hence, PV cells are electrically connected together to form a module.
Modules are then grouped together in series or in parallel or both, as shown in
Figure 1, to increase either the voltage or the current which will result in an increase
in power to create then PV arrays and then the panels. Series connections increase
the voltage while parallel connections increase the current.
Figure 1: PV cells connected in series to form a module
0.2.2 PV string & PV array

A PV module string (Figure 2) is the connection of a certain number of PV modules
in series. The modules connected in series increase the overall voltage of the system
while the current remains the same.
VT = V1 + V2 + ... + Vn (1)
Figure 2: PV string
Multiple PV strings are connected in parallel to form a PV array (Figure 3). This
will increase the current while the voltage stays unchanged.
IT = I1 + I2 + ... + In (2)
The connection of different PV arrays makes the solar panel that leads to reach the
desired power required by the load. The PV array produces DC current and voltage.
Its performance is affected by the change of the temperature and the solar irradiance.
4
Figure 3: PV array
0.2.3 PV System Charactersitics

The PV system characteristics are detailed below:
• PV Module Efficiency: The PV module efficiency is the ratio between the

output power Pout that the solar module can produce and the input power Pin
that hits the module. Usually, the efficiency varies from 5% to 15%-19%.
• Short Circuit Current (ISC ): The short circuit current is the maximum
current that a photovoltaic module can produce when there is no resistance in
the circuit.
• Open Circuit Voltage (VOC ): The open circuit voltage is the voltage when
there is no load connected to the PV cell. The value of VOC depends on the
number of PV panels connected in series.
• I-V and P-V Characteristics Curves: The I-V characteristic curve shown
in Figure 4 gives the current and voltage at which the PV cell generates the
maximum power under STC. Similarly for the P-V characteristic curve. The
power PP V is the product of the current and the voltage. These curves are used
to describe the behavior of the solar cells and to determine the maximum power
point or MPP.
PP V = IP V × VP V (3)
• Solar Irradiance: Solar irradiance measures how much solar intensity or power
is falling on a flat surface per square meter. It depends on the weather conditions
and the location. The higher the solar irradiance, the higher the power produced
by the PV array is. The solar irradiance under STC is 1000 w/m2. The effect
of the solar irradiance on the output power is shown in Figure 5.
• Temperature: The ideal temperature under which the PV cell can operate is
25°C. If the temperature increases above this value, then the PV cell performance
will decrease about 5% for every 20°C increase in cell temperature.
• Air Mass: Air mass refers to the thickness and clarity of the air. The standard
is 1.5.
5
Figure 4: P-V & I-V Characteristic Curves
Figure 5: Effect of Solar Irradiance on the Output Power
0.2.4 Types of PV systems

There are mainly two types of PV systems: stand-alone and grid-connected sys-
tems. Stand-alone (or off-grid) systems are not connected to the grid and require bat-
tery storage meanwhile grid-connected systems are directly connected to the grid and
do not need battery storage. However, batteries can be added to the grid-connected
systems.
• Stand-Alone PV Systems: A stand-alone system produces electrical power

to charge batteries during the day to be used during the night or during cloudy
days when the solar energy is unavailable. The system can supply both DC and
AC loads as seen in Figure 6. This type of systems is usually composed of the
PV array, batteries, a charge controller and an inverter.
6
Figure 6: Stand-Alone PV System
• Grid-Connected PV System: In the grid-connected system (Figure 7), the

components are connected to a grid-tied inverter. The system can use batteries
but it can work without them directly supplying the AC load. This system is
composed of the PV array, an MPPT controller, a DC-DC boost converter, an
inverter, a grid interface and a controller for efficient performance. If batteries
are also used, they can assist during outage or cloudy days to supply the load.
Figure 7: Grid-Connected PV System
0.2.5 Components of a PV system

The main components of a PV system are:
1. PV array: Details about PV array were discussed earlier in this chapter.
2. Bypass Diodes: Bypass diodes are connected in reverse bias between solar
cells or panel and has no effect on the panel’s output. Preferably, each solar cell
has one bypass diode. If a cell is faulty, the diode bypasses it by providing an
alternative path for the current to flow. This will allow the system to continue
supplying power at a reduced voltage rather than no power at all.
3. Blocking Diodes: Blocking diodes are connected in series with each series
branch in the array. The blocking diodes ensure that the power only flows out
7
of the PV array to the load, controller and batteries and thus prevents the flow
of current from the battery to the panel.
Figure 8 shows both the bypass and blocking diodes in a PV system.
Figure 8: Bypass & Blocking Diodes
4. Batteries: The energy produced from solar PV systems is often stored in a

battery or a battery bank especially in stand-alone systems. Today, several
battery kinds are produced, each with a unique design for a variety of uses.
Lead-acid batteries are used in solar PV systems most frequently because of
their widespread availability in a variety of sizes, inexpensive price, and well-
defined performance characteristics. Nickel-cadmium cells are utilized for low
temperature applications, but their high starting cost prevents them from being
employed in the majority of PV systems.
5. Charge Controller: The charge controller is connected between the array and
the batteries. The main role of this controller is to ensure that batteries do not
overcharge or discharge. Overcharging can shorten the battery’s life span while
discharging reduces the battery’s effectiveness. One of the main drawbacks using
batteries and a charge controller is that they will force the array to operate at
the battery voltage. Typically it is not the ideal voltage so the array will not be
able to produce the maximum power.
6. Inverter: Since the electrical grid and most appliances use alternating current
(AC), the use of an inverter is essential. The inverter converts direct current
(DC) produced by the PV array into AC type that will be consumed by the
grid or any other loads. The many types of solar PV inverters include module
integrated inverters, string inverters, multi string inverters, and centralized in-
verters. Centralized inverters are the most common type. In this model a single,
large inverter is connected to many PV modules [4].
8
7. DC-DC Converter: The DC-DC converter will be detailed in the following
section.
8. MPPT controller: The MPPT controller will be detailed in the following

section.
All other components that are not mentioned above are called balance of system.
They include electricity meters, cables, power optimizers, protection devices, trans-
formers, combiner boxes, switches, ... and they are used in both stand-alone and
grid-connected system.
0.3 MPPT Controllers
0.3.1 Definition
The maximum power point tracker controller is considered as one of the most
important components. It enables the solar panel to operate at it maximum power
point. It extracts the maximum power from the panel and transfers it to the load.
The MPPT is based on a control system and works under a certain algorithm. Max-
imum power varies depending on the solar irradiance, the temperature, the solar cell
temperature, and other parameters.
A DC-DC converter connected between the inverter and the panel ensures the
transfer of the maximum power. The duty cycle of the DC-DC converter which is
a pulse width modulation (PWM) is varied by the MPPT controller to regulate the
voltage at which the maximum power is obtained.
0.3.2 DC-DC converter

The DC-DC converter is considered as an essential part of the MPPT controller, it
is therefore important to understand how it operates. In [6], the authors detailed how
the converter is used to track the maximum power point. The ratio of the time interval
in which the switch is on to the commutation period is called duty cycle (D). Figure
Figure 9: Panel-Converter Connection [6]
9 shows the solar panel connected to the converter, with Ri being the input resistance
and RL the load resistance. In Figure 10, we assume that Ri =RiA = VA /IA is the
impedance at which the PV is currently operating and RiB =RM P P = VM M P /IM M P is
the impedance at the maximum power point. The goal is to vary the impedance Ri
in order to reach Ri =RM M P . Varying Ri is done by varying the duty cycle. Three
different DC-DC converters studied in [6].
9
Figure 10: Location of the operation point of a photovoltaic model [6]
• DC-DC Buck Converter: In a buck converter (Figure 11), the output voltage
Figure 11: DC-DC Buck Converter
Vo is smaller than the input voltage Vin and the output current Io is larger than
the input one Iin .
Vo Iin
= = D with 0 ≤ D ≤ 1 (4)
Vin Io
Vo
Vin = (5)
D
Iin = Io × D (6)
Vo o V
Vin D Io Ro
Rin = = = 2 = 2 (7)
Iin Io × D D D
where Ro is the output resistance. D varies from 0 to 1, and Rin varies from ∞
to Ro .
• DC-DC Boost Converter: The DC-DC boost converter circuit is shown in

Figure 12. Its output voltage is larger than the input voltage and the output
current is smaller than the input one.
Vo Iin 1
= = (8)
Vin Io 1−D
10
Figure 12: DC-DC Boost Converter
Vin = Vo × (1 − D) (9)
Io
Iin = (10)
1−D
Vin Vo (1 − D) Vo
Rin = = Io
= (1 − D)2 = Ro (1 − D)2 (11)
Iin (1−D)
Io
Rin varies from 0 to Ro as D varies from 1 to 0.
• DC-DC Buck Boost Converter: It is a combination of a buck and a boost

converter. It is shown in Figure 13.
Figure 13: DC-DC Buck Boost Converter
Vo Iin D
= = (12)
Vin Io 1−D
(1 − D)
Vin = Vo × (13)
D
D
Iin = Io (14)
1−D
Vin Vo (1 − D2 ) 1 − D2
Rin = = = Ro (15)
Iin Io D2 D2
Rin varies from ∞ to 0 as D varies from 0 to 1.
11
The type of the converter is chosen based on the application and the required output
voltage.
0.3.3 MPPT Controllers Techniques

In [4], different techniques for MPPT control were proposed. They are listed below:
• Perturb and Observe (P&O);
• Incremental Conductance (InCond);
• Fractional short circuit current;
• Fractional open circuit voltage;
• Fuzzy logic;
• Neural networks;
• Ripple Correlation Control;
• Current Sweep;
• DC-link capacitor droop control;
• Load current or load voltage maximization;
• dP/dV or dP/dI Feedback control;
P&O and InCond are the most commonly used techniques. In P&O, the voltage
and current are sensed at each iteration and the power is computed. The method
consists of changing the duty cycle based on the sign of the difference between the
current iteration and the one that precedes it [7].
The InCond method consist of using two voltage and current sensors. The power
is calculated. The slope dP/dV is computed at each iteration and its value determines
the variation in the duty cycle that needs to be performed [8].
dP
= 0 At MPP (16)
dV
dP
> 0 Left of MPP (17)
dV
dP
< 0 Right of MPP (18)
dV
In [9],[10] and [11], an MPPT controller based on reinforcment learning was pro-
posed. This method was compared to the P&O one. It was concluded that using
reinforcement learning improves the produced power and the speed. The authors in
[9] also proposed the use of reinforcement learning to solve the partial shading issues
in PV systems.
12
0.4 Fault-Tolerant Control of PV systems
The PV system is prone to many faults. Fault tolerance is necessary to maintain
an acceptable performance in case any fault occurs, until the fault is detected and
cleared. Fault-tolerant control is defined as the development and design of special
controllers that are able to tolerate and process faults while still maintaining desirable
and robust performance. In [12], a bibliographical review on FTC systems was made,
listing all previous and new developments in fault tolerance systems. It classified
FTC into two categories: passive and active. In passive FTC, faults that usually
happen are known and taken into consideration in the design of the fault tolerant
controller. The controllers are fixed and are designed to be robust against a class of
presumed faults. Many approaches are being used in designing passive FTC such as
sliding mode control (SMC), fuzzy logic control, particle swarm optimization (PSO),
machine learning, reinforcement learning and many others. In active FTC, faults
are unknown. In case of system components failures, the controller is reconfigured
actively so that the stability and the acceptable performance of the entire system can
be maintained. Active FTC consists mainly of two steps: fault detection and isolation
(FDI) and controller reconfiguration. Passive FTC is more commonly used due to its
simplicity in design and application. In [13], an FTC controller was made using deep
reinforcement learning and PSO. The controller was implemented without any prior
knowledge on the system.
Fault tolerance in PV systems was tackled in [14], where the authors proposed to
reconfigure the PV panel in order to bypass the faulty cells. The work in [3] proposed
an FTC controller to tackle were line to line, line to ground, mismatch and partial
shading faults. Active FTC was used. In normal conditions InCond was used for
the MPPT controller. When a fault is detected, the FTC controller takes over until
the fault is cleared. The method used was based on the particle swarm optimisation.
Both [15] and [16] used the deep reinforcement learning to solve the partial shading
in PV systems. [15] used the DDPG algorithm while [16] used the DDPG and DQN
and compared them to the P&O method. It was shown that the DQN method was
better than the DDPG in most cases. But when it came to partial shading DDPG was
better. However both methods present better performances than the P&O technique.
0.5 Conclusion
This chapter presents an overview on PV systems, their components and char-
acteristics. Then it lists the different techniques used to implement one of the most
important component, the MPPT controller. Finally, the fault tolerance concept is
explained, together with its existing types alongside some applied examples on PV
systems. The implementation of the fault-tolerant controller requires the knowledge
of the PV model which will be considered in the following chapter.
13
Chapter 3: Modeling of PV
Systems
0.6 Introduction
PV systems are exposed to the outdoor environment. So various elements can affect
them, from the weather conditions to the rodents and are prone to different faults. In
order to study the effect of these faults on the system and on the characteristic curve,
a PV array requires a mathematical model to allow the simulation of the array. The
model is also used to represent the P-V and I-V characteristic curves. Some of the
models that can represent the cell are the single diode and double diode model. They
will be detailed in this chapter.
0.7 PV Cells Modeling

To represent the PV cell analytically a model is needed. The PV cell is very similar
to a classical diode with a pn junction. If it is ideal, it will have zero losses. However,
this is not the case and some losses occur in the cell. These losses are expressed
using resistance in the equivalent model. The model chosen and commonly used to
represent the PV cell is the single-diode equivalent circuit shown in Figure 14. It is
chosen due to its simplicity and easy simulation that do not require a lot of time.
The model is simulated as a photo-generated current source Iph connected in parallel
with a diode traversed by a current Id . The circuit also has series and shunt internal
resistances Rs and Rp . The current source’s value depends on the solar irradiance
and the temperature of the solar cell.
The basic equation that represents mathematically the I-V characteristic curve of
the ideal PV cell is:
qV
ipv = Iph − Io [exp( ) − 1] (19)
αkR
With V and ipv being the output voltage and current respectively. Io is the diode
saturation current, α is the identity factor of the diode, q is the charge of the electron,
T is the temperature of the cell and k = 1.38110−38 J/K is the Boltzmann constant.
However equation 19 does not represent the actual I-V characteristic of a PV array
(Figure 15) since it is composed of several PV cells where cells connected in parallel
increase the current, then the basic equation becomes:
14
Figure 14: Single Diode Model of a PV Cell
V + RS I V + RS I
ipv = Iph − Io [exp( ) − 1] − (20)
Vt α RP
With V being the output voltage, V = Ncell × Vcell . It is the product of the total
number of cell in the PV array and the voltage of a single PV cell and I is the output
current. Rs is the series resistance, its influence is stronger when the device operates
in the voltage source region and Rp is the parallel resistance, its influence is stronger
when the device operates in the current source region. Vt is the thermal temperature
of the PV module. Io is given by:
Isc,n + KI ∆T
Io = (21)
exp( Voc,nαV
+Kv ∆T
t
)−1
With all the parameters of this equation are available in the PV array datasheet,
Isc , n is the nominal short circuit current, Voc , n is the nominal open circuit voltage,
KV is the open-circuit voltage/temperature coefficient and KI is the short circuit
current/temperature coefficient. Finally, we have:
G
Iph = (Iph,n + KI ∆T ) (22)
Gn
with G and Gn = 1000W/m2 being the irradiance and nominal irradiance respectively.
.
These equations show the dependency of the array on the irradiance and tempera-
ture. Also the P-V and I-V curves are obtained from this mathematical representation.
0.8 PV Arrays Under Faults

The faults occurring on PV systems include mainly the short circuit (SC) faults
which can be divided into line to line and line to ground faults. Other types include
the open circuit (OC) faults, the partial shading fault, the bypass diode fault, the
hotspot fault, the mismatch fault and the arc faults. Some of these faults can lead to
15
Figure 15: Single Diode Model of a PV Array
a reduction in the overall efficiency and performance of the system and can lead to
power loss and some can lead to catastrophic results. Most common faults are line to
line, line to ground, shadowing, mismatch and arc fault while the less common faults
are hotspot, bypass, and connection faults.
• Line to Ground Fault: Line to Ground fault occurs when one or more current
carrying conductors are short circuited to the ground. There are many causes
for this fault. Usually it is due to some disturbances or mishandling of the PV
array such as cable insulation damage (during installation, corrosion, chewing
by rodents. . . ), wrong wiring, wire cutoff, etc.
If this fault occurs outside the PV array, it can completely shut down the system.
But if it occurs inside one of the PV modules, the system will not shutdown but
its performance and output power will decrease. This is due to the fact that
when a string is short circuited to the ground, its output voltage will differ from
the rest of the strings. This will lead to a mismatch fault. If this fault persists
it can lead to a fire.
• Line to Line Fault: Line to line fault occurs when there is a short circuit
between two lines in the PV system. This fault is caused by a short circuit
between the current carrying conductors due to the corrosion, the chewing by
rodents or the water ingress. It may happen within the same string or across
multiple strings. This fault can cause system failure and decreases the overall
system efficiency. The effect on the PV system due to this fault is similar to
that of line to ground, but it can also damage the PV string if it persists. In
general short-circuit faults can in turn lead to arc faults.
• Open Circuit Fault: Due to the environmental effects, the wiring of the PV
system could suffer from degradation which will lead to the breaking of the
conductors. This will cause an open circuit fault.
• Arc Fault An arc fault is a high-power discharge of electricity between two

or more conductors. This discharge can cause heat or breaking of the wire’s
16
insulation. This fault can cause fires, threaten the safety of people and damages
property.
• Mismatch Fault: Mismatch Fault occurs when cells within the same series
string produce different currents. The faulted cells produce lower current and
thus dissipate power. The mismatched cell has an increase in the series resistance
and a decrease in the parallel one. It will affect the output power and affect the
overall performance of the system.
• Partial Shading: Partial shading results in the reduction of the power output.
The partial shading is usually due to non-uniform irradiance or dark cells that
act as a load. The shaded cells generate less current than other cells in series in
the same string. This fault usually leads to an over-heating of the cell, known
as “Hot-spot”.
• Hot Spot Fault: Hot spot fault occurs due to some internal and external
causes. Internal causes include fragmentation of cells, current mismatch between
cells, high resistance, degradation of cells, partially shaded cells and overheating.
Meanwhile, external causes are dust, shadow, snow, etc. During this case, the
current will decrease in the shaded cells and the voltage will become negative.
The cells will now consume power from the non-shaded cells instead of producing
it which will result in high power dissipation in the poor cells. This dissipation
will cause over-heating or hot-spot. If this occurrence persists, it can damage the
solar cells, lead to an open circuit fault and reduce the efficiency of the system.
It can also break the wire’s insulation and lead to an electrical fire.
• Diode Fault: Bypass and blocking diodes are very important for the perfor-
mance of the PV system. The causes of such faults are similar to those of the
hot spot fault. The faults that can occur on these diodes are short-circuited or
open circuited diodes.
Figure 16 illustrates the different faults that could occur on a PV system.
0.9 Simulation of PV arrays faults

The issue of modeling of PV arrays under electrical faults has been largely inves-
tigated in the literature [17]. The P-V and I-V characteristic curves play a huge role
in understanding the impact of the different faults on the PV system. With these
curves, one can show the effect of the faults on parameters such as open-circuit volt-
age (Voc ), short-circuit current (Isc ), maximum power point voltage (Vmpp ), maximum
power point current (Impp ) and maximum power point (Pmpp ) in I-V and P-V curves.
A typical PV array is simulated in normal conditions and when a fault occurs. The
effect of different faults on the P-V and I-V characteristics curves is illustrated below
[3].
• Effect of line-to-ground: The effect of this fault is shown in Figure 17.a This
fault presents a change in both the I-V and P-V curves. There is an evident
decrease in the maximum power. Multiple peaks are now shown when there
should be only one. These peaks are at different operating points. A local and a
global maximum are now present which will create a difficulty of detecting the
MPP using traditional methods.
17
Figure 16: Different Faults that can occur in a PV System
• Effect of line-to-line: The effect is shown in Figure 17.b. The curves present
now different peaks. Additionally, the open circuit voltage Voc has increased.
• Effect of mismatch fault The fault is simulated by changing the series and
shunt resistances of the faulty cells. The series resistance is increased while the
shunt one is decreased. The effect is shown in Figure 17.c, it is similar to those
of the line-line and the line to ground faults.
• Effect of partial shading Theis fault is simulated by exposing a part of the

panel to an irradiance value while the the other part to a different value. The
effect is shown in Figure. 17.d . It has the same effect as the previous fault.
These four faults are the only ones shown since they are the most common. The
rest of the faults are either a consequence of these faults or do not frequently occur.
0.10 Conclusion
In order to simulate the PV array, the single diode model was explained in this
chapter. Different faults that can occur in a PV array were also presented like line-
line, short circuit, arc fault. Finally, the effect of some of these faults in a simulated
environment was presented.
18
Figure 17: P-V and I-V curves for: (a) effect of line-ground fault on the PV array, (b) effect
of line-line fault, (c) effect of mismatch fault, (d) effect of partial shading fault.
19
Chapter 4: Deep Reinforcement
Learning
0.11 Introduction
The first thing that comes to mind when we consider the nature of learning is
usually the notion that we interact with our surroundings, our environment, to learn.
Deep reinforcement learning is a machine learning subset that deals with learning from
interaction with an environment, similarly to the way humans learn.
This chapter explains the concept of reinforcement learning in details. And then
proceeds to explain how reinforcement learning can be improved using deep learning.
It mentions some algorithms used in deep reinforcement learning.
0.12 Reinforcement Learning Model

Reinforcement Learning (RL) is the training and ability of a machine learning
model to make a sequence of decisions. It is one of the three machine learning types
(the other two are supervised learning and unsupervised learning).
RL is the area in machine learning that is concerned with how the learner can
take actions in a certain environment in order to maximize the notion of cumulative
reward. The learner or agent is not told which actions it must take and which ones
yield the most reward, but it has to discover them on its own through trial and error.
Reinforcement learning is very similar to a game-like situation. The agent uses
trial and error to try to find a solution to the problem. To get the machine to do what
the programmer wants, it either gets rewards or penalties depending on the actions it
takes. The goal is to maximize the reward.
Even though the programmer sets the rules of the games (the different actions that
the agent could get and the reward), he gives no hints to the model on how to solve
the game. The model must figure that out on its own, starting by random trials and
finishing with specific tactics and skills.
It is also interesting to note, that the taken actions may not only affect the im-
mediate reward, but it can also affect the next situation and after that all subsequent
rewards. The two characteristics trial and error, and search and delayed reward are
two of the most important features in RL.
One of the main difficulties that RL faces is the adjustment between exploration
and exploitation. In order to get the maximum reward, a RL agent prefers to take
actions that it has already tried and found it valuable in obtaining a good reward.
20
However, in order to find them, the agent has to try actions it has not tried yet. In
other words, the agent has to exploit what is already known and tried in order to get a
high reward but at the same time it has to explore new actions to find those that have
a high reward. Hence, the agent cannot only use either exploration or exploitation
without failing. The agent must continuously jump between trying new actions and
taking those that it already knows will grant it a reward. In order to get a reliable
estimated reward of a certain reward, it must be tried numerous times.
Reinforcement learning agents have clear goals. They can choose actions that will
affect the environments of these agents by sensing specific features and characteristics
of these environments. What makes RL interesting is that it needs no training data.
It works and learns from the reward system.
Of all the different types of machine learning, reinforcement learning is the closest
one to the kind of training that the humans and animals do. For example, a gazelle
calf can barely stand on its feet after being born. But half an hour later, it is running
at 20 miles per hour. In this example, the calf is interacting with its environment,
making different decisions to achieve a certain goal. The decision in this case is how
to move its legs and the goal is being able to walk and run. This is achieved by trial
and error. This example shows the similarity between reinforcement learning and very
basic learning that an animal has done.
0.12.1 Elements of Reinforcement Learning

The terms agents and environment were already mentioned above, they are the
first two elements of reinforcement learning. Alongside these two, there are four main
supplements of RL: a policy, a reward signal, a value function, and sometime a model
for the environment [18].
• An agent can be viewed as an object that is perceiving its environment through
sensors and acting upon that environment through actuators. It is an artificial
intelligence algorithm that is trying to solve the environment. The objective of
the agent is to maximize the total reward it receives over the long run.
• The environment is a task or simulation. The world in which the agent carries
out its actions. It uses current states and actions of the agent as input, rewards
and next states of the agents as output.
• A policy is the way an agent behaves at a certain time. It is a mapping from

the states perceived of the environment to the actions that will be taken when
in those states. The policy may look like a lookup table, a simple function, or it
may involve extensive computation such as a search process. Policies are usually
stochastic, meaning that we select an action from a probability distribution.
• A reward signal is the goal of the RL problem. Reward means the desired
behaviors which are expected from the agent. Rewards are also called feedback
for the agent’s actions in a given state and are described as results, outputs,
or prizes in the model. The reward signal thus tells the agent what are the
good and what are the bad decisions (e.g. low reward = bad, high reward =
good). The reward affects and alters the policy. If the policy chooses an action
and it is followed by a low reward, then the policy may be changed to select a
different action when put in that situation. The reward specifies what is good
in an immediate sense.
21
• A value function indicates what is good for the agent on the long run. A value
of a state is the total amount of reward an agent is expected to accumulate over
the future starting from this state. Values indicate the long-term desirability
of a set of states taking into account the likely future states and the rewards
yielded by those states. Even if a state might yield a low immediate reward, it
can still have a high value because it is regularly followed by other states that
yield higher rewards.
It is important to distinguish between the rewards and the value. Without
reward there will be no value function. The agent chooses the actions based on
value. The actions are chosen based on those that will yield the highest value
not the highest reward. However, it is much harder to find values than to find
rewards. Rewards are given immediately by the environment. Meanwhile, values
need to be estimated and re-estimated all the time based on the observations
that an agent makes. It is very important to find an efficient way to estimate
values.
• The model of the environment is a model that mimics how the environment
will behave. For example, given a state and an action, the model might predict
the next state and the next reward. Models are used for deciding on a course of
actions, by taking into account possible future situations before they are actually
experienced. RL problems that use models are called model-based methods. RL
problems that do not use models are called model-free methods. The agents
here are explicitly trial-and-error learners.
0.12.2 Markov Decision Process (MDP)

1. Definition of MDPs:
MDPs are meant to be a straightforward framing of the problem of learning
from interaction to achieve a goal. The learner is the agent and it interacts with
the environment. This interaction, illustrated in Figure 18, will lead to different
decisions made to find the best actions to take. At each decision stage, the
system occupies a state St . The set of possible states a system can have is S,
with St ∈ S. At each time t, the agent receives a state St and based on that
state, the agent selects an action At ∈ A(s). With A(s) is the set of allowable
actions an agent can make when it finds itself in that state. Let A = ∪St ∈S A(s).
After taking this step or this action, the agent finds itself in a new state St+1 .
It also receives an award Rt+1 ∈ R, the reward is a real number. When it is
positive it is regarded as income and when negative as a cost. Now the agent
needs to take another action and so on.
The sets of states and actions and reward can be finite or countable infinite.
Actions may be chosen either randomly or deterministically. If the sets were
finite, then Rt and St have a discrete probability distributions that depend on
the previous state and action.
p(s′ , r|s, a) = P St = s′ , Rt = r|St−1 = s, At−1 = a (23)

with s′ , s ∈ S, r ∈ R and a ∈ A(s) and p : S × R × S × A → [0, 1]. P defines
the dynamics of the MDP.
22
Figure 18: The agent–environment interaction in a Markov decision process [18].
Also, XX
p(s′ , r|s, a) = 1 for all s ∈ S, a ∈ A(s). (24)
s′ ∈S r∈R
From equation 23, it is possible to know the state-transition probability. The

probability of landing in state s′ by taking action a from state s.
p(s′ |s, a) = P St = s′ |St−1 = s, At−1 = a = sumr∈R p(s′ , r|s, a) (25)

The reward r can also be computed as:
. X
r(s, a) = [Rt |St−1 = s, At−1 = a] = sumr∈R r p(s′ , r|s, a) (26)
s′ ∈S
The reward is a simple number passed from the environment to the agent. The
agent’s goal is to maximize the total amount of reward. The cumulative reward
is the sum of all immediate rewards called expected reward.
Another concept that needs to be mentioned is the discounting. In reality, the
agent tries to maximize the sum of the discounted rewards it receives over the
future and chooses the action At based on this number.
If the sequence of the reward over a period of time is Rt+1 , Rt+2 , . . . , then the
sum or discounted return will be:
∞
X
2
Gt = Rt+1 + γRt+2 + γ Rt+3 + ... = γ k Rt+k+1 (27)
k=0
with γ being the discount factor, 0 ⩽ γ ⩽ 1. If a reward is received k time

steps in the future then its value is worth γ k−1 times what it would be if it was
received immediately. If γ=0, then the agent’s goal is to maximize the immediate
reward. As γ becomes closer to 1, the agent takes more in consideration the
future rewards.
The collection of objects mentioned above constitute the markov decision pro-
cess: T, S, As , γ, pt (.|s, a), rt (s, a) T here is the state transition probability, prob-
ability of making a transition to state s′ from state s using action a.
2. Policy and Value Function:

A policy is the mapping from state to the probabilities of selecting each possible
action π(s, a), π : S → A.The policy captures the RL agent’s behavior.
23
A value function is the expected return when starting in state s and following
policy π.
∞
. X
vπ (s) =π [Gt |St = s] =π [ γ k Rt+k+1 |St = s] ,for all s ∈ S (28)
k=0
vπ is the state-value function of policy π.

The expected return starting from state s, taking action a and following the
policy π is qπ (s, a):
∞
. X
qπ (s, s) =π [Gt |St = s, At = a] =π [ γ k Rt+k+1 |St = s, At = a] (29)
k=0
q is the action-value function for the policy π.
3. Optimal Policies and Optimal Value Functions:

There is always a policy or more better than the other policies. That is the
optimal policy denoted π ∗ . A policy π ′ is defined to be better than or equal to
a policy π if and only if vπ′ (s) ⩾ vπ (s) for all states s. An optimal policy π
satisfies π ⩾π for all policies π. An optimal policy is guaranteed to exist but may
not be unique. That means that there could be different optimal policies, but
they all share the same value functions, the “optimal” value functions.
X∞
∗
π = argπ:S→A maxπ [ γ t−1 rt ] (30)
t=1
The optimal value function is the one which yields maximum value compared
to all other value functions. When we say we are solving an MDP it actually
means we are finding the optimal value function.
An optimal policy has an optimal state-value function:
.
vπ (s) = maxπ vπ (s) ,for all s ∈ S (31)
In equation 31 v (s) tells us what is the maximum reward for state s we can get
from the system. Optimal state-action value function indicates the maximum
reward we are going to get if we are in state s and taking action a from there
on-wards:
.
qπ (s, a) = maxπ qπ (s, a) ,for all s ∈ S and a ∈ A (32)
It is usually more difficult to compute the state value function. Once q ∗ is
available then the optimal policy can be obtained:
π ∗ (s) = argmaxa∈A qπ∗ (s, a) (33)
Using the action-value function does not require the model information like the
state-value function case.
4. Bellman Optimality Equation:

The optimal value function is recursively related to the Bellman Optimality
Equation. It is actually a system of equations, one for each state. These equa-
tions are used to compute the optimal value function. They simplify the com-
putation of the value function, such that rather than summing over multiple
time steps, we can find the optimal solution of a complex problem by breaking
24
it down into simpler, recursive sub-problems and finding their optimal solutions.
Bellman proved that the optimal state value function in a state s is:
X
v ∗ (s) = maxa p(s′ , r|s, a)[r + γv ∗ (s′ )] (34)
s′ ,r
Similarly: X
q ∗ (s, a) = p(s′ , r|s, a)[r + γmaxa′ q ∗ (s′ , a′ )] (35)
s′ ,r
0.13 Deep Learning

Deep Learning is a subset of machine learning that imitates the way humans gain
knowledge. Traditional machine learning algorithms are linear, meanwhile deep learn-
ing algorithms are stacked in a hierarchy of increasing complexity and abstraction.
To better understand the concept of deep learning, imagine a toddler trying to un-
derstand what a dog is. He starts pointing at different objects and his parents say,
“Yes,that’s a dog”, or “No, that’s not a dog”. The more the toddler points, the more
he becomes more aware of all the features that a dog has. The toddler has, without
knowing, built a hierarchy in which each level is created with the knowledge gained
from the previous layer of the hierarchy in order to clarify the concept of dog.
Similarly, computers use the same concept. Each algorithm in the hierarchy applies
a nonlinear transformation to its input and uses what it learns to create a statistical
model as output. Iterations continue until the output has reached an acceptable level
of accuracy. The number of processing layers through which data must pass is what
inspired the label deep.
Most deep learning models use artificial neural network (ANN) [18] (Figure 19).
ANN consists of an input layer, hidden layers and an output layer. The activation
functions usually used are Sigmoid, TanH, ReLU,. . . .
Figure 19: Artifical Neural Network.
0.14 Deep Reinforcement Learning

Deep reinforcement learning is reinforcement learning while using Artificial Neural
Network. DRL enables us to use RL in situations approaching real life complexity.
25
Agents occasionally need to create accurate representations of their surroundings from
highly detailed sensory inputs in order to apply their prior knowledge to novel circum-
stances. Because of this, RL agents need ANN to handle these challenging issues.
The first developed DRL agent was a deep Q-network (DQN) [19]. Jobs where
the agent engages with the environment are taken into account by making a series
of observations, taking actions, and receiving rewards. The agent aims to choose
behaviors in a way that maximizes the total future payoff. Formally, to approximate
the ideal action-value function, we employ a deep neural network.
Q∗ (s, a) = maxπ [rt + γrt+1 + γ 2 rt+2 + ...|st = s, at = a, π] (36)
The Deep Q-Learning method can be used to perform deep, model-free RL in

discrete action spaces. This method uses a single deep network to estimate the value
function of each discrete action and, when acting, chooses the maximally valued output
for a given state input. DQN is based on Q-leaning which is a primitive method [20].
An agent tries a particular action and assesses the results based on the reward or
punishment it receives immediately and its estimation of the worth of the state from
which it is taken. It discovers which ones are optimal by continuously attempting all
actions in all states, as determined by the long-term discounted reward.
Using deep neural network, an approximate value function Q(s, a; θi ) is parame-
terized, where θi are the parameters (i.e., weights) of the Q-network at iteration i.
The agent’s experiences et = (st , at , rt , st+1 ) are recorded at each time step t in a data
set Dt = e1 , ..., et to perform experience replay. Using samples (or minibatches) of ex-
perience uniformly and randomly chosen from the pool of stored samples, Q-learning
updates are applied during the learning process. The following loss function is used
in the Q-learning update at each iteration i:
Li (θi ) =(s,a,r,s′ )∼D [(r + γmaxa′ Q(s′ , a′ , θi′ ) − Q(s, a, θi ))2 ] (37)
with θi being the parameters of the Q-network at iteration i and θi′ are the network
parameters used to compute the target at iteration i. The target network parameters
θi′ are updated with the Q-network parameters θi every n steps. The goal is to update
the parameters of the network in a way to minimize the loss function.
However, DQN only deals with spaces of discrete actions and cannot deal with
spaces of continuous actions. To solve this, the work in [21] developed an actor-critic
approach called deep deterministic policy gradient (DDPG) for continuous actions.
The DDPG algorithm uses an Actor/Critic architecture. It is represented using two
neural networks, an actor network and a critic network. The actor network µ, param-
eterized by θµ , takes as input the state s and as output the action a (policy). While
the critic network Q, paramerterized by θQ , takes as input both the state s and action
a and as outputs the Q-value Q(s, a). The actor/critic network is shown in Figure 20.
The critic is learned using the Bellman equation and is updated by minimizing the
loss function L just like the DQN:
L(θQ ) =st ∼ρβ ,αt ∼β,rt ∼E [(Q(st , at |θQ ) − yt )2 ] (38)
where:
yt = r(st , at ) + γQ(st+1 , µ(st+1 )|θQ ) (39)
Meanwhile the actor is updated by evaluating the policy and following its gradient in
order to maximize the performance.
∇θµ J ≈st ∼ρβ [∇θβ Q(s, a|θQ )|s=st ,a=µ(st ) ∇θµ µ(s|θµ )|s=st ] (40)
26
Figure 20: Actor-Critic Network.
with:
J ∗ = maxπ Jπ = maxππ [Rt |s = st ] (41)
where Jπ is the expected total reward given a policy π. Similarly to DQN, DDPG
agents uses a replay buffer R. Transitions are sampled from the environment, and the
replay buffer is then filled with the tuple (st ; at ; rt ; st+1 ).
The replay buffer is used by every typical approach for teaching a deep neural
network to approximate Q∗ (s, a). This is the collection of prior experiences. The
older samples are removed when the replay buffer is full. The actor and critic are
updated at each time step by uniformly sampling a minibatch from the buffer.
′ ′
The algorithm also uses target networks, Q′ and µ′ with weights θQ and θµ . We
are trying to make the Q-function be more like this target. The resulted algorithm
made by [21] is shown below:
DDPG Algorithm
Randomly initialize critic network Q(s, a|θQ ) and actor µ(s|θµ ) with weights θQ and
θµ .
′ ′
Initialize the target network Q′ and µ′ with weights θQ ← θQ ,θµ ← θµ
Initialize the replay buffer R
for episode = 1, M do
Initialize a random process N for action exploration

Receive initial observation state s1
for t = 1, T do
Select action at = µ(st |θµ )+Nt according to the current policy and exploration
noise
Execute action and observe reward rt and observe new state st+1
Store transition (st , at , rt , st+1 ) in R
Sample a random minibatch of N transitions (si , ai , ri , si+1 ) from R
′ ′
Set yi = ri + γQ′ (si+1 , µ′ (si+1 )|θµ )|θQ )
Update critic by minimizing the loss: L = N1 i (yi − Q(si , ai |θQ ))2
P
Update the actor policy P using the sampled policy gradient:
∇θµ J ≈ N1 i ∇a Q(s, a|θQ )|s=st ,a=µ(st ) ∇θµ µ(s|θµ )|st
Update the target networks:
27
′ ′
θQ ← τ θQ + (1 − τ )θQ
′ ′
θµ ← τ θµ + (1 − τ )θµ
end for
end for
τ << 1 is the target smooth factor used to update the weights and N is an
exploration noise chosen to suit the environment.
0.15 Fault-Tolerant Control based on RL

Due to the absence of reliable data, particularly for a failure at an early stage,
the diversity, uncertainty, and suddenness of unexpected faults present a difficulty for
fault-tolerant control.
Due to physical limitations and financial considerations, it is theoretically im-
possible to obtain data on all failure types. However, the data from the FTC are
unbalanced, with a large number of accurate data and a small number of inaccurate
data. For FTC, the question of how to learn while utilizing data as little as possible
is a good one. The control value is changed and updated after each action taken.
Using DRL for FTC is a new approach. It enables to learn in an unknown envi-
ronment without the need of any previous data. Hence this is highly beneficial for
FTC that lacks data in early faults.
0.16 Conclusion
This chapter presented the concept of deep reinforcement learning, starting with rein-
forcement learning and MDP, then how DNN can be used in RL for complex situations.
The DQN algorithm, used in discrete spaces, was explained. Finally, the DDPG algo-
rithm was also clarified in details and how it uses two neural networks and is used in
continuous spaces.
28
Chapter 5: Fault-Tolerant Control
Using DRL
0.17 Introduction
In this chapter, the fault-tolerant algorithm is applied to a PV array. The method
used is the DDPG algorithm mentioned before. The simulation was implemented
on Matlab/Simulink through the Reinforcement Learning Toolbox. The method was
validated by simulation under different faults using both the IncCond and DDPG
algorithm. Four different faults were tested and the results were shown comparing the
DDPG algorithm to the IncCond.
0.18 PV environment
The PV panel is made of 6 PV arrays. The arrays are divided in 3 series assembled
arrays connected in parallel. The maximum power of each array is P = 334.905W .
The PV array has two inputs; one to control the irradiance, the other for the temper-
ature. Each array has a bypass diode connected to it in parallel. A blocking diode is
added for each series branch. The completed PV array is shown in Figure 21.
Figure 21: Simulation of a PV Array.
29
The P-V and I-V characteristic curves of this array are shown in Figures 22 and
23 respectively. The curves show that the MMP is at (2000 W, 123.1 V) in the P-V
curve and (16.23 A,123.1 V) in the I-V curve.
Figure 22: P-V Characteristic Curve.
Figure 23: I-V Characteristic Curve.
The completed PV system is shown in figure 24. The PV array is connected to a

25000Hz DC-DC boost converter. The duty cycle of the converter is controlled using
the MPPT Controller. The effects of the four different faults, line to line, line to
ground, partial shading and mismatch faults, were analyzed in the previous chapter.
All faults showed approximately the same effect, multiple peaks in the P-V and I-V
curves and a decrease in the overall power. These four faults were simulated in our PV
environment with no load and the same effect can be seen in Figure 25. An effective
MMPT algorithm should be able to increase the efficiency of the PV array under
30
Figure 24: Simulation of the PV System.
such faulty conditions. The controller should achieve the global maximum under any
condition. Since the effects of these faults are similar, the FTC controller can be
programmed with the same algorithm in the intent of reaching the global maximum
regardless of the occurring fault. Since the effects of these faults are already known,
passive FTC is applied.
To implement the FTC strategy, the MMPT Controller makes use of deep reinforce-
ment learning to optimize the duty cycle for the purpose of obtaining the maximum
possible power. The controlled variable is the duty cycle.
To implement a DRL approach on MPPT control, the MMPT control problem
needs to be formulated as an MDP problem. The three main components needs to be
defined:
• State Space: As stated before, usually the PV array uses the P-V and I-V
characteristic to track the MPP, and changes the duty cycle of the DC-DC
converter accordingly. The continuous state space consists then of the voltage
of the PV array, the current, the current duty cycle of the converter and the
variation of the duty cycle:
S = {VP V , IP V , D, ∆D} (42)
• Action Space: The action space is also continuous. The action of the DRL
MPPT agent is chosen as the variation in the duty cycle:
A = {∆D| − 0.03 ≤ ∆D ≤ 0.03} (43)
• Reward Function: Every chosen action will take the system into another state.
This change is also accompanied with a reward. The reward should increase the
closer we get to the MMPT, and decrease the further we get from it. The reward
function is defined below:
r = r1 + r2 + r3 (44)
PP V
r1 = (45)
PM ax
31
(a) P-V Characteristic curve under line- (b) P-V Characteristic curve under line-
line fault to-ground fault
(c) P-V Characteristic curve under mis- (d) P-V Characteristic curve under partial
match fault shading
Figure 25: P-V Characteristic Curves under Different Faults
32
(
( PPMP ax
V
)2 if ∆P > 0
r2 = (46)
−( PPMP axV
)2 if ∆P < 0
(
0 if 0 ≤ D ≤ 1
r3 = (47)
−1 otherwise
with P = VP V × IP V and ∆P = Pt+1 − Pt . PM ax = 2000W is the maximum
power in normal conditions.
Considering the faulty situations, the characteristics curves have local and global
maximum, r1 ensures that the agent receives higher reward if it stays at the
global maximum. r2 gives a positive reward if the power increases and a negative
reward if it decreases. Finally, r3 ensures that the duty cycle remains between
0 and 1. Any number outside this range will lead to a negative reward.
0.19 Training setup

Figure 26 shows the model of the RL MPPT model. The chosen RL agent is a
DDPG agent. As mentioned before, this algorithm is used in continuous spaces. Using
Figure 26: RL MPPT Model.
the STC of 1000W/m2 and 25C and an initial duty cycle D = 0.5, the training of the
agent is conducted. The agent sample time Ts is 0.01s and the simulation end time
Tf is 0.5s.
The DDPG agent options are shown in table 1 and the training options are shown
in table 2.
Specifications Value
Sample Time Ts
Discount Factor 0.9
Mini Batch Size 1024
Experience Buffer Length 106
Target Smooth Factor τ 0.01
33
Table 1: DDPG Options
Specifications Value
Max Episodes 2500
Max Steps Per Episode round(Tf /Ts )
Table 2: Training Options
DDPG algorithm requires two neural networks. The action neural network is
shown in Figure 27. The input is the state, followed by the hidden layers and then
the output layer which is the action. The activation function for the hidden layer is
the ReLu function which is the most used in DNN.
(
0 if x < 0
ReLu:F (x) = (48)
x if x ⩾ 0
As for the output layer, the activation function used us the Tanh function. The range
of this layer is (-1,1), so a scaling layer is added to reach the desired range. The
Figure 27: Actor Network.
critic neural network is shown in Figure 28. The network has two inputs the state
34
and the action. Both inputs are followed by hidden layers and then connected to
the output layer. The activation function used in the ReLu function. The agent was
Figure 28: Critic Network.
trained during 2500 episodes. At each time step, a mini batch of the size 1024 of
the memory is taken and used to update the weights of the neural network. Training
results are shown in Figure 29. As seen from the graph, the blue line shows the episode
reward that the agent receives at each episode. The red line is the average reward
during training. While the green line represents the episode Q0 , it is the prediction of
the expected long term reward based on the current observation at the beginning of
the episode made by the critic network. This usually should overlap with the actual
total reward. But when using actor-critic methods, it is not required for both to
overlap. The goal is to have an increase in the reward and to reach convergence. After
approximately 1400 episodes, the DDPG method reaches convergence. This is shown
by the the flatness of the average reward.
0.20 Testing with the MPPT control algorithm

After the training step, the results should be validated by testing their performance
under different faults with the system connected to a load. The first test was made
on partial shading.
While being under partial shading, the irradiance of the cells varies between
1000w/m2 , 800w/m2 and 600w/m2 . The IncCond method reaches 1641W (Figure
30d) while DDPG algorithm reaches a peak at 1822W after 20s and then declined to
35
Figure 29: Training Results.
reach a steady state at 1734 W after 30s which is still higher than IncCond (Figure
30a). The DDPG added an efficiency of 5.36%. Now we can also notice, that the
graph is smoother with few ripples if we compare the DDPG algorithm to the typical
IncCond (graph).
When being under the mismatch fault, one of the PV array in a parallel branch
does not produce power anymore. It is clear from the graph that the power starts at a
minimum point and then reaches a local maximum of 989W after 10s, then decreases
followed by an increase. After 25s the power reached is maximum at P = 1190W and
stabilizes after that (Figure 30c). The maximum power achieved by IncCond is 999W
(Figure 30d).
After line to line fault occurrence, the power increases smoothly until reaching
P = 1147W after 30s (Figure 31a) which is greater by approximately 50W than the
power achieved by the IncCond method (Figure 31b).
The line to ground fault produced identical results to those of the mismatch fault
under both the InCond algorithm (Figure 31d) and the DDPG algorithm (Figure 31d).
All the results are detailed in Table 3. From these results, we can deduce that
the FTC controller using DRL has a higher efficiency than the traditional controller
using IncCond, with an average increase in effiency of 12.59%. This is considered
as the main advantage of this method. The power achieved is closer to the global
maximum in all faults simulated. Another advantage is that this controller can be
used on different faults and even in fault free conditions. This is highly advantageous,
since there is no need for an FTC controller for every fault. The DRL FTC controller
can track the global MPP and try to reach this value with no prior knowledge of the
type of the occurring fault.
36
(a) Results of the proposed FTC method (b) Results of the IncCond method un-
under partial shading der partial shading
(c) Results of the proposed FTC method (d) Results of the IncCond method under
under mismatch fault mismatch fault
Figure 30: PV Power under different faults (1)
37
(a) Results of the proposed FTC method (b) Results of the IncCond method under
under line to line fault line to line fault
(c) Results of the proposed FTC method (d) Results of the IncCond method under
under line to ground fault line to ground fault
Figure 31: PV Power under different faults (2)
38
Additionally, the IncCond algorithm tracks the peak of the P-V curve by calculat-
ing dP/dV and stops when it reaches 0. So in some cases, the algorithm might reach
just the local maximum and stops. Meanwhile, while using the DDPG algorithm, the
reward function ensures that the algorithm reaches the global maximum.
However, one of the main drawbacks is the time. On average, the controller reaches
a steady state and the maximum achievable power after 30s which is a considerably
long period.
PV Array PP V (W) IncCond PP V (W) DDPG ∆P (W ) Added Efficiency (%)

Partial Shading 1641 1734 93 5.36
Mismatch fault 999 1190 191 16
Line to Line fault 998 1147 149 13
Line to ground 999 1190 191 16
Table 3: Experimental Results of the DPPG algorithm under different faults
0.21 Conclusion
This chapter showed the steps to train a DRL controller. The training results
of the DDPG algorithm were validated by simulation on four different faults. It was
shown that the DDPG algorithm can reach a higher power compared to the IncCond
algorithm that is typically used in MMPT controller.
39
Chapter 6: Conclusion and
Perspective
The main objective of this thesis was to develop a FTC controller for a PV system
using deep reinforcement learning approach. Four different faults were considered:
line to line, line to ground, mismatch and partial shading.
We started first with a bibliographic review on existing works. All parts and areas
concerning the PV systems were explained, including all characteristic of the PV
system that were of great importance in our works, specifically, the I-V and P-V
characteristic curves. The components were all mentioned but we were mostly
interested in the MMPT controller, the DC-DC converter and the controller used to
manipulate the duty cycle of the converter in order to reach the maximum achievable
power. A large number of methods and algorithm were proposed for the MMPT
controller. We also introduced the FTC concepts, and reviewed the most relevant
works in this field.
Regarding the modeling part, the single diode model adopted in this work was
derived. A description of all faults that could occur in a PV system was presented
alongside their effects on the P-V characteristic curve. Following this, we proposed
a passive FTC approach based on the deep reinforcement learning to tolerate these
faults.
The DDPG algorithm was chosen for our controller. This critic-actor architecture is
very convenient to overcome the lack of data in faulty situations. The method was
tested in simulation on Matlab considering the four faults mentioned above. The
method was also compared to the IncCond technique. We noticed that the DDPG
algorithm increased the efficiency in comparison to IncCond. The power reached was
very close to the MPP. However, we noticed that the time it took to converge was
extremely high.
In future works, we should try to find what is causing the slow convergence and
improve the algorithm to become much faster. Additionally, FTC should be also done
and tested on faults different than the ones considered in this work. Fault tolerance
should be achieved on all faults in the fastest and most efficient way.
40
Bibliography
[1] Jones, Geoffrey G., and Loubna Bouamane. ”” Power from Sunshine”: A Business
History of Solar Energy.” Harvard Business School Working Paper Series (2012).
[2] Kouro, Samir, et al. ”Grid-connected photovoltaic systems: An overview of re-

cent research and emerging PV converter technology.” IEEE Industrial Electronics
Magazine 9.1 (2015): 47-61.
[3] Boutasseta, Nadir, Messaoud Ramdani, and Saad Mekhilef. ”Fault-tolerant power
extraction strategy for photovoltaic energy systems.” Solar Energy 169 (2018): 594-
606.
[4] Sumathi, S., L. Ashok Kumar, and P. Surekha. Solar PV and wind energy conver-
sion systems: an introduction to theory, modeling with MATLAB/SIMULINK, and
the role of soft computing techniques. Vol. 1. Switzerland: Springer, 2015.
[5] Pearsall, Nicola, ed. The performance of photovoltaic (PV) systems: modelling,
measurement and assessment. Woodhead Publishing, 2016
[6] Kotak, V. C., and Preti Tyagi. ”DC to DC Converter in maximum power point
tracker.” International Journal of Advanced Research in Electrical, Electronics and
Instrumentation Engineering 2.12 (2013): 6115-6125.
[7] Salameh, Ziyad, and Daniel Taylor. ”Step-up maximum power point tracker for
photovoltaic arrays.” Solar energy 44.1 (1990): 57-61.
[8] Hussein, K. H., et al. ”Maximum photovoltaic power tracking: an algorithm for
rapidly changing atmospheric conditions.” IEE Proceedings-Generation, Transmis-
sion and Distribution 142.1 (1995): 59-64.
[9] Kofinas, P., et al. ”A reinforcement learning approach for MPPT control method
of photovoltaic sources.” Renewable Energy 108 (2017): 461-473.
[10] Chou, Kuan-Yu, Shu-Ting Yang, and Yon-Ping Chen. ”Maximum power point
tracking of photovoltaic system based on reinforcement learning.” Sensors 19.22
(2019): 5054.
[11] Youssef, Ayman, Mohamed El Telbany, and Abdelhalim Zekry. ”Reinforcement

learning for online maximum power point tracking control.” Journal of Clean Energy
Technologies 4.4 (2016): 245-248.
[12] Zhang, Youmin, and Jin Jiang. ”Bibliographical review on reconfigurable fault-
tolerant control systems.” Annual reviews in control 32.2 (2008): 229-252.
41
[13] Zhang, Dapeng, and Zhiwei Gao. ”Fault tolerant control using reinforcement
learning and particle swarm optimization.” IEEE Access 8 (2020): 168802-168811.
[14] Lin, Xue, et al. ”Designing fault-tolerant photovoltaic systems.” IEEE Design
Test 31.3 (2013): 76-84.
[15] Avila, Luis, et al. ”Deep reinforcement learning approach for MPPT control of
partially shaded PV systems in Smart Grids.” Applied Soft Computing 97 (2020):
106711.
[16] Phan, Bao Chau, Ying-Chih Lai, and Chin E. Lin. ”A deep reinforcement
learning-based MPPT control for PV systems under partial shading condition.”
Sensors 20.11 (2020): 3039.
[17] Sabbaghpur Arani, M., and Maryam A. Hejazi. ”The comprehensive study of
electrical faults in PV arrays.” Journal of Electrical and Computer Engineering
2016 (2016).
[18] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduc-
tion. MIT press, 2018.
[19] Mnih, Volodymyr, et al. ”Human-level control through deep reinforcement learn-
ing.” nature 518.7540 (2015): 529-533.
[20] Watkins, Christopher John Cornish Hellaby. ”Learning from delayed rewards.”
(1989).
[21] Lillicrap, Timothy P., et al. ”Continuous control with deep reinforcement learn-
ing.” arXiv preprint arXiv:1509.02971 (2015).
[22] Zaki, Sayed A., et al. ”Deep-learning–based method for faults classification of PV
system.” IET Renewable Power Generation 15.1 (2021): 193-205.
[23] Khalil, Ihsan Ullah, et al. ”Comparative analysis of photovoltaic faults and perfor-
mance evaluation of its detection techniques.” IEEE Access 8 (2020): 26676-26700.
[24] Alam, Mohammed Khorshed, et al. ”A comprehensive review of catastrophic

faults in PV arrays: types, detection, and mitigation techniques.” IEEE Journal of
Photovoltaics 5.3 (2015): 982-997.
[25] Mellit, Adel, Giuseppe Marco Tina, and Soteris A. Kalogirou. ”Fault detection
and diagnosis methods for photovoltaic systems: A review.” Renewable and Sus-
tainable Energy Reviews 91 (2018): 1-17.
[26] Perera, A. T. D., and Parameswaran Kamalaruban. ”Applications of reinforce-

ment learning in energy systems.” Renewable and Sustainable Energy Reviews 137
(2021): 110618.
[27] Glavic, Mevludin, Raphaël Fonteneau, and Damien Ernst. ”Reinforcement learn-
ing for electric power system decision and control: Past considerations and perspec-
tives.” IFAC-PapersOnLine 50.1 (2017): 6918-6927.
[28] Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. ”Reinforce-
ment learning: A survey.” Journal of artificial intelligence research 4 (1996): 237-
285.
42
[29] Watkin, C. J. C. H., and Peter Dayan. ”Technical note Q-learning.” Machine
learning 8.3 (1992): 279-292.
[30] Hausknecht, Matthew, and Peter Stone. ”Deep reinforcement learning in param-
eterized action space.” arXiv preprint arXiv:1511.04143 (2015).
[31] Oliva, Diego, Mohamed Abd El Aziz, and Aboul Ella Hassanien. ”Parameter
estimation of photovoltaic cells using an improved chaotic whale optimization algo-
rithm.” Applied energy 200 (2017): 141-154.
[32] Keles, Cemal, et al. ”A photovoltaic system model for Matlab/Simulink simula-
tions.” 4th International Conference on Power Engineering, Energy and Electrical
Drives. IEEE, 2013.
[33] Kameyama, Keisuke. ”Particle swarm optimization-a survey.” IEICE transactions

on information and systems 92.7 (2009): 1354-1361.
43

Report - Deep Reinforcement Learning PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report - Deep Reinforcement Learning PDF

Uploaded by

Copyright:

Available Formats

2021/2022

FINAL YEAR PROJECT

Submitted in fulfillment of the requirements for the

ENGINEERING DEGREE FROM THE LEBANESE UNIVERSITY

Major: Electrical and Electronics Engineering

Deep Reinforcement Learning- Based Approach

Dr. Majd Saied

Defended on 29 July 2022 in front of the jury:

Dr. Youssef Harkouss President

Chapter 1: General Introduction 1

Chapter 2: Literature Review 3

Chapter 3: Modeling of PV Systems 14

Chapter 4: Deep Reinforcement Learning 20

Chapter 5: Fault-Tolerant Control Using DRL 29

Chapter 6: Conclusion and Perspective 40

1 PV cells connected in series to form a module . . . . . . . . . . . . . . 4

Reliability of photovoltaic energy generation systems becomes an important issue

Keywords: Fault-Tolerant Control, Deep Reinforcement Learning, PV Systems;

• Design of a fault tolerant controller using deep reinforcement learning (DRL)

• Comparison between DDPG and Incremental Conductance (IncCond) algorithms

• Chapter 2: The mathematical model of the PV array is derived. Almost all

• Chapter 3: The deep reinforcement learning theory is explained in details and

• Chapter 4: The proposed fault-tolerant controller is validated in simulation.

0.2.1 Solar Cells & PV modules

Figure 1: PV cells connected in series to form a module

0.2.2 PV string & PV array

0.2.3 PV System Charactersitics

• PV Module Efficiency: The PV module efficiency is the ratio between the

Figure 5: Effect of Solar Irradiance on the Output Power

0.2.4 Types of PV systems

• Stand-Alone PV Systems: A stand-alone system produces electrical power

• Grid-Connected PV System: In the grid-connected system (Figure 7), the

Figure 7: Grid-Connected PV System

0.2.5 Components of a PV system

1. PV array: Details about PV array were discussed earlier in this chapter.

Figure 8: Bypass & Blocking Diodes

4. Batteries: The energy produced from solar PV systems is often stored in a

8. MPPT controller: The MPPT controller will be detailed in the following

0.3 MPPT Controllers

0.3.2 DC-DC converter

Figure 9: Panel-Converter Connection [6]

Figure 11: DC-DC Buck Converter

• DC-DC Boost Converter: The DC-DC boost converter circuit is shown in

• DC-DC Buck Boost Converter: It is a combination of a buck and a boost

Figure 13: DC-DC Buck Boost Converter

0.3.3 MPPT Controllers Techniques

• Perturb and Observe (P&O);

• Incremental Conductance (InCond);

• Fractional short circuit current;

• Fractional open circuit voltage;

• Ripple Correlation Control;

• DC-link capacitor droop control;

• Load current or load voltage maximization;

• dP/dV or dP/dI Feedback control;

0.7 PV Cells Modeling

0.8 PV Arrays Under Faults

• Arc Fault An arc fault is a high-power discharge of electricity between two

0.9 Simulation of PV arrays faults

• Effect of partial shading Theis fault is simulated by exposing a part of the

0.12 Reinforcement Learning Model

0.12.1 Elements of Reinforcement Learning

• A policy is the way an agent behaves at a certain time. It is a mapping from