10 1109@tec 2020 2990937 PDF

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TEC.2020.2990937, IEEE
Transactions on Energy Conversion
1
Q-Learning based Maximum Power Extraction

for Wind Energy Conversion System with
Variable Wind Speed
Ashish Kushwaha, Madan Gopal and Bhim Singh, Fellow, IEEE
 method is fast and accurate. It requires real time wind speed

Abstract--This paper presents an intelligent wind speed sensor measurement. Wind speed measurement is done by
less maximum power point tracking (MPPT) method for a anemometers. However, accurate wind speed measurement is a
variable speed wind energy conversion system (VS-WECS) based substantial task which may result in wrong wind speed [3].
on a Q-Learning algorithm. The Q-Learning algorithm consists of Moreover, to maximize the electrical power output, wind
Q-values for each state action pair which is updated using reward turbine TSR cannot be taken. For each wind speed, the TSR
and learning rate. Inputs to define these states are electrical power
value is different with respect to electrical power [4]. PSF
received by grid and rotational speed of the generator. In this
paper, Q-Learning is equipped with peak detection technique,
method is a look-up table based method which uses the
which drives the system towards peak power even if learning is relationship between electric power and the rotor speed. It does
incomplete which makes the real time tracking faster. To make the not need wind speed measurement. Instead of electric power,
learning uniform, each state has its separate learning parameter electromagnetic torque of the generator has also been used to
instead of common learning parameter for all states as is the case form the look-up table. This approach is termed as optimal
in conventional Q-Learning. Therefore, if half learned system is torque control (OTC) method [5]. Both PSF and OTC methods
running at peak point, it does not affect the learning of unvisited are simple and fast. However, extensive field tests are required
states. Also, wind speed change detection is combined with to form the table. HCS or P&O method does not need any
proposed algorithm which makes it eligible to work for varying
information of WECS. It is an online maximum power point
wind speed conditions. In addition, the information of wind
turbine characteristics and wind speed measurement is not
(MPP) search method. However, HCS method results in
needed. The algorithm is verified through simulations and oscillations around MPP. The direction of search can also be
experimentation and also compared with perturbation and wrong for fast changing wind speeds. An adaptive P&O method
observation (P&O) algorithm. is proposed in [6]. The step-size is proportional to slope which
Index Terms- Maximum Power Point Tracking, Q-learning makes tracking faster. However, continuously varying wind
Algorithm, Reinforcement Learning, Wind Energy Conversion conditions may make step-size always on higher side which will
System effect tracking. In [7], author has proposed two separate
algorithm for maximum mechanical power tracking (MMPT)
I. INTRODUCTION and maximum electrical power tracking (MEPT). The paper
shows promising results but use of anemometer makes the
I N recent years, the research in electrical power generation has
been mostly focused on renewable energy sources to
counterbalance the increasing demand of electric power as
system vulnerable to wind speed measurement errors.
Wind speed estimation techniques [8]-[9] have been proposed
well as to overcome the environmental issues associated with to avoid the problems associated with wind speed sensors in
conventional energy resources. The wind power has emerged TSR method. Performance of estimation techniques depends on
as a major energy source with the capability to provide clean offline training provided to them, which is a tiresome process,
electricity. A wind energy conversion system (WECS) consists and field testing is required. To improve the overall
of a wind turbine coupled to an electric generator. A commonly performance of MPPT, some hybrid techniques have also been
used generator is a squirrel cage induction generator (SCIG) proposed, which combines HCS and PSF method [10]. It starts
which is integrated to the grid by back-to-back converter with HCS, then it saves the output power, DC link voltage and
scheme [1]. The wind power extraction efficiency of WECS is current control value in memory. However, it needs longer time
enhanced by unifying maximum power point tracking (MPPT) to learn as each DC voltage in operating range has separate
techniques with converter controllers. MPP. The technique reported in [11] combines HCS and TSR
There are three basic approaches of maximum power point methods. However, requirement of wind speed measurement
tracking: tip-speed ratio (TSR) control method, power signal makes it complicated. Artificial intelligence (AI) based
feedback (PSF) method, and hill climb search (HCS) or techniques have also been proposed in [12]-[13]. However,
perturbation and observation (P&O) method [2]. TSR control
A. Kushwaha and M. Gopal are with the Department of Electrical Engineering, School of Engineering, Shiv Nadar University, Gautam Buddha Nagar, Uttar
Pradesh, 201314, India. (e-mail: ak999@snu.edu.in; mgopal@snu.edu.in).
B. Singh, Jr., is with the Department of Electrical Engineering, Indian Institute of Technology, New Delhi, 110016, India. (e-mail:
bhimsinghiitd61@gmail.com).
0885-8969 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 28,2020 at 07:42:49 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TEC.2020.2990937, IEEE
2
Generator side Grid side

VSI VSI Iga
Gear Vga
Ia
Box Igb Vgb
Ib VDC
Ic Igc Vgc
Grid
Squirrel cage Vga Vgb Vgc Iga Igb Igc
DC Link
Induction Generator
Wind turbine
Feedforward
SPWM SPWM
compensation
+ vqref
+ dq- αβ- abc-αβ abc-αβ
Id + vdref αβ abc αβ-abc
Ia Ψr + Vgα Vgβ Igα Igβ θ
Iq θ s
Ib abc- Flux
Ic dq estimator PLL
θ αβ-dq αβ-dq
θ +- PI PI ++ ++
ωr +- dq-αβ
VDCref
igdref vgd vgq igd igq
ωref T *
e Iq* vgd
VDC igd
+- PI G +- PI ωL
Ψr Iq
ωr -ωL
ωr Ψr* Id*
+- PI 1/Lm +- PI igqref ++
+- PI ++
Ψr Id igq vgq
Voltage
Field Oriented Control (FOC) Oriented Control (VOC)
Fig 1: Block Diagram of Proposed System 3.78kW, three phase squirrel cage induction generator based
these include tedious task of offline learning. Assuming WECS.
constant characteristics also give wrong results. With the
purpose of solving problems of HCS, modified HCS techniques II. SYSTEM CONFIGURATION
have been poposed in [14]-[15]. These require detection of
The proposed system is shown in Fig. 1. The system consists of
wind speed change. Moreover, these are not able to learn from
wind turbine coupled to a squirrel cage induction generator
experience because of no memory.
(SCIG) through a gear box. The SCIG is connected to the grid
In this paper, Q-learning based MPPT technique is proposed.
with back-to-back converters. The grid side converter and
Unlike, other MPPT techniques as TSR and PSF, Q-learning is
generator side converter are connected through a DC link
a model-free learning algorithm, which rejects any requirement
capacitor.
of WECS parameter knowledge for learning as wind speed. Q-
learning based controller works like an agent which interacts A. Wind Turbine Model
with the environment and the experience gained by interaction The mathematical model of adopted wind turbine is given as
forms the basis of its learning. In [16], the authors have following. The mechanical power generated by the wind turbine
presented Q-learning based MPPT for constant and step change can be given as,
in wind speeds. The Q-table is being represented by neural 1
network. The authors have presented the online learning of 𝑃𝑚 = 𝜌𝐴𝑣 3 𝐶𝑝 (𝜆, 𝛽) (1)
2
neural network for MPPT. It terminates the requirement of where Pm is mechanical power of turbine in [W], ρ is air density
extensive offline training. However, it has limitation of working in [kg/m3], A is turbine swept area in [m2], v is wind speed in
in varying wind speed conditions and online detection of [m/s], Cp is performance coefficient of turbine, λ is tip-speed
maximum power point. In the proposed algorithm, the ratio and β is the blade pitch angle in degrees.
controller stores its experience in form of look-up table. The The performance coefficient of wind turbine which is a function
technique works in two modes: learning mode and MPP mode. of λ and β is given as,
In learning mode, the controller searches for the MPP as well
as learns through the interaction. As soon as MPP is detected, 𝑐
𝑐
− 6
the controller works in MPP mode where it remains at MPP 𝐶𝑝 (𝜆, 𝛽) = 𝑐1 ( 2 − 𝑐3 𝛽 − 𝑐4 ) 𝑒 𝜆𝑖
+ 𝑐6 𝜆 (2)
𝜆𝑖
until wind speed changes. For any wind speed change, the Where,
controller again works in learning mode. If the controller 1 1 0.035
experiences the wind speed that has been already experienced = − (3)
𝜆𝑖 𝜆 + 0.008𝛽 𝛽 3 + 1
before, it moves faster towards MPP. Therefore, the proposed with c1 = 0.5176, c2 = 116, c3 = 0.4, c4 = 5, c5 = 21 and c6 =
MPPT technique shows improvement in terms of time as 0.0068.
compared to HCS. The expression for the tip-speed ratio is given as,
With the wind speed change detection, the controller works in 𝜔𝑇 𝑅
two modes mode 0 and mode 1. In mode 1, system operates at 𝜆 = (4)
𝑣
optimum point in varying speed conditions. In mode 0, the where ωT is turbine angular speed and R is turbine blade radius
system implements in learning mode. While working in MPP in m.
mode, controller records peak point which is used in generating In wind turbine, there is an optimum value of tip-speed ratio λopt
reference speed in mode 1. at which performance coefficient of turbine is at its maximum
The proposed Q-learning based MPPT technique is verified by value Cpmax. At this point, turbine fetches maximum power from
simulation in MATLAB/SIMULINK environment on a wind. For this turbine, λopt and Cpmax are 8.1 and 0.48
3
respectively for pitch angle β =00. Fig. 2 shows the The basis of most of the reinforcement learning framework is
characteristics of turbine power with turbine rotational speed at Markov decision process (MDP). Mathematically, MDP can be
different wind speeds for a pitch angle β=00. This optimum expressed as a set of variables (S, A, P, r) where S is set of
curve shows the maximum power point curve at different wind possible states of the environment, A is set of actions, P is state
speeds. The aim is to vary the generator speed so that tip-speed transition probability, and r is reward function. V(s) represents
ratio remains at optimum value for different wind speeds and the value of state s. For action a at state s, P(s,a,s’) is the
maximum power point tracking (MPPT) occurs. probability of transition from state s to s’, the value of state s
state
B. Generator Side Control Environment (st+1)
The generator side converter control is based on field oriented
action reward
control (FOC). FOC enables the machine to follow the (r(st,at))
(at)
reference speed accurately. Three phase generator currents are
decoupled into d- axis and q-axis in rotor flux oriented frame.
Agent
On the q-axis, the speed control loop is performed and iq is state
proportional to the generator torque. The flux control loop (st)
operates in d-axis Fig. 3 Block Diagram of RL
8000 is shown as,
optimal
𝑉(𝑠) = 𝑟(𝑠) + 𝑚𝑎𝑥 𝑎 𝛶 ∑𝑠′ 𝑃( 𝑠, 𝑎, 𝑠′)𝑉(𝑠 )
′
(5)
Turbine output power (W)
7000 curve
6000 η is learning rate and lies between 0 and 1.

5000
12 m/s 𝛶 ∈ [0,1] is defined as discount factor that signifies the effect
11 m/s of future reward on current reward. 𝛶 close to 1 shows
4000 10 m/s provident approach while 𝛶 close 0 shows myopic approach.
3000 The selection of action in each state to reach from initial state
9 m/s
2000 to final state is called policy represented as π(s). The aim is to
8 m/s
1000 find the optimal policy (π*(s)) so that irrespective of initial
states and decisions, the final state should be achieved. The
0
0 10 20 30 40 50 60 70 80 90 commonly used methods to find the optimal policy are value
Turbine rotational speed (rad/s)
Fig. 2 Wind turbine mechanical power-speed characteristics iteration and policy iteration. For optimum policy, selection of
and id is aligned with rotor flux linkage. The control block action is done such that the maximum discounted reward can be
diagram is shown in Fig. 1. In each control loop, proportional- obtained for each state.
𝑎𝑟𝑔 𝑚𝑎𝑥
integral (PI) controllers are used. The inverter is switched with 𝜋 ∗ (𝑠) = 𝑎 ∑𝑠′ 𝑃( 𝑠, 𝑎, 𝑠′)𝑉 ∗ (𝑠′) (6)
sine pulse width modulation (SPWM) technique. For a transition from state s to state s’, the update for value is
The reference speed for the generator is achieved by Q-learning as,
based MPPT control. This control is explained in further 𝑉 𝜋 (𝑠) = 𝑉 𝜋 (𝑠) + 𝜂(𝑟(𝑠) + 𝛶 𝑉 𝜋 (𝑠 ′ ) − 𝑉 𝜋 (𝑠)) (7)
sections. Most important breakthrough in reinforcement learning has
C. Grid Side Control been made by Watkins [18] who has suggested new algorithm
called Q -learning and applied it to MDP. Q-learning is the first
The control of grid side converter is based on voltage oriented reinforcement learning algorithm whose convergence to
control (VOC) as shown in Fig. 1. The instantaneous currents optimal policy is proven for decision making problems
are decoupled in d-axis and q-axis components in synchronous involving cumulative cost. Q-learning learns the Q-function
reference frame using PLL. The q-axis current controls the Q(s,a) which represents the value of taking action a in state s.
reactive power. The reference current for q-axis is kept at zero The value iteration is basis of the updating Q-function to get the
to make the reactive power transfer zero. optimal Q-values, Q*(s,a) forms the optimal policy similar as
The d-axis current control loop is to control the active power. value function. However, this approach requires model.
To maintain the DC link voltage, its control loop is added before The temporal difference approach requires no model. The
d-axis current control loop. The DC link voltage control loop update equation for TD based Q-learning is as,
generates the reference d-axis current. A feed-forward control 𝑚𝑎𝑥
𝑄(𝑠, 𝑎) = 𝑄(𝑠, 𝑎) + 𝜂(𝑟(𝑠) + 𝛶 [ 𝑎′ 𝑄(𝑠 ′ , 𝑎′ )] − 𝑄(𝑠, 𝑎)) (8)
is used to compensate the coupling terms. For switching the
VSC, sine pulse width modulation (SPWM) is used with which is calculated whenever action a is taken in state s leading
switching frequency of 10 kHz. to state s’.
In the real time Q-learning, all the estimates of Q-values are
III. REINFORCEMENT LEARNING stored in a look-up table with one entry for each state-action
pair. If at time t in real-time control, the agent observes state st
The reinforcement learning (RL) is a model-free machine and takes action at, which leads to state st+1, the agent gets the
intelligence approach which does not need any supervision or immediate reward rt. Q-value for state-action pair (st ,at), is
expert knowledge. In the reinforcement learning, the focus is on updated as follows,
direct interaction of the individual (agent) with its environment, 𝑄𝑡+1 (𝑠𝑡 , 𝑎𝑡 ) = 𝑄𝑡 (𝑠𝑡 , 𝑎𝑡 ) + 𝜂𝑡 (𝑟𝑡 + 𝛶[𝑚𝑎𝑥
𝑎 𝑄𝑡 (𝑠𝑡+1 , 𝑎)] − 𝑄𝑡 (𝑠𝑡 , 𝑎𝑡 )) (9)
which learns from its own experience [17]. In each state, the The look-up table is updated as follows
agent performs an action to reach another state. The agent gets 𝑄𝑡+1 (𝑠, 𝑎) = 𝑄𝑡 (𝑠, 𝑎)∀(𝑠, 𝑎) ≠ (𝑠𝑡 , 𝑎𝑡 )) (10)
a reward for each action based on the favorability of reached
states. The basic block of RL is shown in Fig. 3.
4
η∈(0,1) is learning parameter which represents the extent to where,

which new information is stored. If η is close to zero, learning 𝑎𝑟𝑔 𝑎𝑟𝑔
𝑖= (𝑃𝑔𝑟𝑖𝑑,𝑖 , 𝜔𝑔,𝑗 ) & 𝑗 = 𝑗 (𝑃𝑔𝑟𝑖𝑑,𝑖 , 𝜔𝑔,𝑗 ) (13)
is very slow, and η close to 1 leads to a fast learning. However, 𝑖
for better learning, initially the learning rate should be close to B. Set of Actions (A)
1 and gradually with time, learning rate should decrease. The
After observing the state, the agent selects an action so that
selection of action at in state st, at time step t is done with
WECS moves towards maximum power point (MPP). The
strategy to maximize Qt(st,a), thereby exploiting the current
action in WECS is amount of increment, decrement or no
approximation of Q-values following greedy policy, which is
change in previous generator rotational speed. The actions
as,
𝑎𝑟𝑔 𝑚𝑎𝑥 taken by agent should move the WECS towards MPP rapidly
𝑎𝑡 = 𝑎 𝑄𝑡 (𝑠𝑡 , 𝑎) (11) and accurately. When the current operating point is far from
This approach of action selection over commits the actions MPP, the change in generator speed should be large in value to
which are found in early stages, while failing to explore other make MPPT fast. When the current operating point is near
actions that can have lower values. Convergence condition MPP, change in speed should be less to make MPPT accurate.
requires visiting each state-action pair infinitely often, which To fulfill this purpose, three different values of speed change
clearly does not happen with greedy policy. Therefore, agent are chosen as Δω1, Δω2, Δω3 such that Δω1> Δω2> Δω3. The
should follow the policy of exploitation and exploration. One set of actions has been defined as,
such scheme is ε-greedy policy. In this scheme, the agent’s 𝐴 = {𝑎|𝑎 = 𝑎𝑡 , 𝑎𝑡 ∈ {Δ𝜔1 , Δ𝜔2 , Δ𝜔3 , 0, −Δ𝜔3 −, Δ𝜔2 −, Δ𝜔1 }} (14)
behavior remains greedy for most of the time, However, with The agent has seven actions to choose from A. If speed of
small probability ε, every once a while, the agent selects an generator is not within the speed limits, the agent takes action
action at random, uniformly and independently to action-value Δω1 or - Δω1 to lead the WECS with in speed range limits.
estimates. In the real-time learning, the probability ε should be Within the speed limits, the agent follows action selection
high in initial learning phase for good exploration, and policy. An action with zero speed change is taken once the
gradually it should decrease with time for good learning. WECS is operating at MPP.
The agent chooses an action from A at each step and the speed
IV. Q-LEARNING BASED MPPT reference for next step is as,
As described in section III, the wind turbine extracts maximum 𝜔𝑔,𝑡 = 𝜔𝑔,𝑡−1 + 𝑎
power when tip-speed ratio is at its optimum value λopt. C. Rewards
Therefore, for each wind speed, the wind turbine should rotate
at specific speed for maximum power output. The aim of MPPT After each action, the agent receives a reward based on
is to maximize the electrical power transferred to grid (Pgrid), favorability of next state. In WECS, if there is an increase in
while the known characteristics are of output mechanical power power in next state, the reward is positive else is negative. The
of wind turbine (PT). Pgrid depends on efficiency of the rewards for all seven actions have been defined as,
generator and converters. Efficiency depends on so many 1 𝑖𝑓 𝑃𝑔𝑟𝑖𝑑,𝑡 − 𝑃𝑔𝑟𝑖𝑑,𝑡−1 > 𝛼, 𝑎 = ±𝛥𝜔1
factors as rotational speed, switching frequency, aging factor, 0.8 𝑖𝑓 𝑃𝑔𝑟𝑖𝑑,𝑡 − 𝑃𝑔𝑟𝑖𝑑,𝑡−1 > 𝛼, 𝑎 = ±𝛥𝜔2
temperature, which make η variable. So, λopt for Pgrid is different 𝑟𝑡 = 0.6 𝑖𝑓 𝑃𝑔𝑟𝑖𝑑,𝑡 − 𝑃𝑔𝑟𝑖𝑑,𝑡−1 > 𝛼, 𝑎 = ±𝛥𝜔3 (15)
from PT and also changes with time. This requires an efficient 0 𝑖𝑓 |𝑃𝑔𝑟𝑖𝑑,𝑡 − 𝑃𝑔𝑟𝑖𝑑,𝑡−1 | < 𝛼
model-free MPPT which is done using Q-learning.
Q-learning algorithm has three basic components- set of states, {−1 𝑖𝑓 𝑃𝑔𝑟𝑖𝑑,𝑡 − 𝑃𝑔𝑟𝑖𝑑,𝑡−1 < −𝛼
set of actions and rewards. Other important parameters which where Pgrid,t and Pgrid.t-1 are grid powers at current step and
affect the performance of Q-learning are discount factor, previous step respectively, and α is a small positive number
learning rate and probability (ε) of selection of random action which defines the limit of maximum change in grid power for
to have balance between exploitation and exploration. which grid power of two consecutive steps can be considered
equal.
A. Set of States (S) If there is a negative change in the grid power, the equal
For a specific wind speed, the power varies with rotational negative reward is given for each action. Positive reward is
speed of a generator and power is at its maximum value at given for positive change in grid power, however, the value of
optimal speed. The pitch angle is kept constant at 00 in this reward is different based on value of speed change in action.
work. An operating point for WECS can be defined as set of For any operating point of WECS, if change in grid power is
two variables- electrical power output (Pgrid) and rotational positive for any two values of positive speed change available
speed of generator (ωg) as (Pgrid,ωg). The Pgrid and ωg are in set of actions A, the higher positive reward is given for action
divided in equal intervals in their operating ranges. The with a higher value of speed change. For corresponding state, it
numbers of intervals are m and n for Pgrid and ωg respectively. makes Q-value of action with higher speed change more than
The agent can recognize a state by getting the feedback of Pgrid that of action with lower value of speed change. It results in fast
and ωg and finding out their respective interval numbers. The MPPT. If operating point is near MPP, the change in grid power
total number of states is mxn. may be very small. If the change in grid power is smaller than
At any step, the observed state can be (Pgrid,i,ωg,j) where Pgrid,i predefined value α, the agent considers the difference in power
belongs to ith interval in power range and ωg,j belongs to jth equivalent to zero and no reward is given for this case
interval in speed range. The set of states can be defined as, irrespective of action.
𝑆 = {𝑠𝑖𝑗 |𝑠𝑖𝑗 = 𝑛 ∗ (𝑖 − 1) + 𝑗, 𝑖 ∈ [1,2, … 𝑚], 𝑗 ∈ [1,2, … 𝑛]} (12)
5
D. Q-Learning Parameters It should run at MPP. However, Q-learning cannot observe

The one important parameter of Q-learning is learning rate η. MPP until learning is complete. Therefore, MPP detection has
For unexplored system, the learning parameter should be high been applied. As the slope of power-speed curve of wind
and should decrease with time. In WECS, MPPT should be turbine is very small at peak, the change in power becomes very
achieved for the current wind speed. Therefore, the agent small for any speed change. Therefore, for any action other than
cannot wait for the all the states to be learned as states related zero change, if the change in power is less than α, the system
to current wind speed are only be visited. Therefore, it is not has been considered to be operating at MPP. The value of α has
possible to have same learning parameter for whole system. To been chosen as 0.004 pu. Once the MPP is detected, the system
resolve this issue, each state has been assigned its own learning shifts in MPP mode.
parameter. Initially each state has high learning parameter, and For mxn states, Q-value matrix is initialized with values of 7
then its decrement depends on number of visits made by system actions for each state. Along with Q-value matrix, a matrix for
to the respective state. Other states’ learning parameter remains learning rate has also been initialized. Learning rate matrix also
unaffected. For any state sij, a learning parameter ηij is defined consists of value of number of visits to each state. Initial
as, learning rate for each state is 1. Initial Q-values for each state
1 are zero.
𝜂𝑖𝑗 = (16) After the initialization of Q-matrix, learning process starts. The
(1 + 𝑘𝑖𝑗 ∗ 0.2)0.8
visit to each state has been treated as a step. In each step, after
where kij is number of visits to the state. Initial value of kij is 0,
observing the state, an action is selected using ε-greedy policy.
ηij is equal to 1. After each visit, kij is increased by 1 for
The probability ε is uniformly distributed over all actions.
respective state, and ηij is updated, which decreases with each
Initially, after every 5 actions, a random action is selected.
visit made to the state.
Once, all the actions are covered, random action is selected after
When the system is less learned, the exploration should be
every 10 actions and so on. Normal action selection is done
more. Less exploration leads to high Q-values for most
using (7). Using (16), the value of learning rate is determined.
occurring states leaving few states unexplored. This affects the
At the end of step time, Q-value of state is updated using (5).
performance of Q-learning. To get over this problem, ε-greedy
The zero change action is not updated as this action is being
policy has been followed. At each step, the agent selects an
taken by agent in MPP mode. The step time should be greater
action with high Q-value for respective state, however with ε
than the response time of speed control loop of the generator
probability; a random action is selected irrespective of state and
side control to get correct values of grid power and speed.
condition of system. The selection of action is done uniformly.
As the learning progresses, the value of probability decreases B. MPP Mode
for good exploitation. In MPP mode, the operating point of system is MPP. The action
The discount factor 𝛶 tells the discounted values of future is zero speed change. The updating of Q-values is stopped in
rewards. For a transition from less explored state to explored this mode. The system runs in MPP mode until grid power
state, high value of discount factor may result in an increase of difference of two consecutive steps exceeds α. Then system is
the Q-value of less desirable action. This affects performance shifted to learning mode.
of Q-learning in MPPT. The discount factor chosen for the Initialize Q(s,a)=0 for each
learning is 0.6. state action pair
Recognize current state and record corresponding

V. IMPLEMENTATION OF Q-LEARNING BASED MPPT Q(st,at) for each action and learning rate η
ALGORITHM
Select action at according to ε greedy policy
The Q-learning based MPPT algorithm works in two modes:
learning mode and maximum power point (MPP) mode. If
Observe the next state,
operating point is not at MPP, system runs in learning mode, in Receive reward r
which Q-values are being updated for respective states and
actions. The controller shifts the system in MPP mode if MPP Update Q(st,at) using (5)
is detected, and the system remains in MPP mode until wind
speed changes. The flow of the algorithm is shown in Fig. 4.
MPP reached
N
A. Learning Mode Y
In learning mode, the agent observes the state using Pgrid and ωg Take action zero
feedback, and then learns the best action by maximizing Q-
values to achieve the MPP. In each state, the agent can take N Y
ΔPgrid < α
seven actions as defined in (14). Three actions are for increment
in generator speed, three actions are for decrement in generator Fig. 4 Flowchart of Q-learning Based MPPT
speed and one action is for no change. For each action, agent The convergence of Q-learning is based on convergence of Q-
gets positive or negative reward based on increase or decrease values for all the state-action values. In online learning, waiting
in grid power as defined in (15). for all Q-values to converge is not a good option. Another
In Q-learning based MPPT, learning of states occurs online. option is to define the level of learning of each state separately.
Therefore, if during online learning, the system comes across Learning of each state is defined by the number of visits to the
MPP, the system should not wait for learning to be completed. state. The criterion for any state sij to be learned is
6
𝑘𝑖𝑗 ≥ 𝑘𝑚𝑎𝑥 (17) 𝛥𝑃 𝛥𝑃

( ) < 0 and ( ) < 0.
𝛥𝜔 𝑘 𝛥𝜔 𝑘−1
The value of kmax is taken as 80. Once the state is learned, the 𝛥𝑃 𝛥𝑃
action with maximum Q-value is to always be selected. If ( ) < 0 and ( ) > 0, the operating point was in
positive slope and currently in negative slope. In this condition,
VI. ONLINE OPERATING MODES (𝛥𝜔)𝑘−1 has to be necessarily positive. The threshold value of
Q-learning based MPPT can search the maximum power point power is given as
for constant wind speeds. In real time, wind speed is varying 𝛥𝑃
(𝛥𝑃) 𝑇 = 𝐾2 |𝛥𝜔| |( ) | (20)
𝛥𝜔 𝑘
most of the time. To deal with the varying wind speeds, the
MPPT algorithm has been implemented in two modes: mode 0 If (𝛥𝜔)𝑘−1 is negative, wind speed change is detected. Similar
and mode 1. logic for wind speed change detection is implemented for the
𝛥𝑃 𝛥𝑃
condition ( ) > 0 and ( ) < 0.
A. Mode 0 𝛥𝜔 𝑘 𝛥𝜔 𝑘−1
The wind speed change detection is the key to switch between
Mode 0 implements the Q-learning based MPPT when the wind
operating modes 1 and 0 to operate the system at maximum
speed is stable. If wind speed change is detected, the system
power point. For varying wind speed conditions, mode 1
switches to mode 1.
operates the system at optimum speed. Whenever, the wind
If system is working in MPP mode, the change in speed
speed becomes stable, system automatically switches to mode
reference command will be zero. In this case, if change in power
0 to search for the optimum speed and update the value of kopt.
is more than 5 to 10 %, the wind speed change is detected.
If the system is operating in learning mode, the wind speed
change is detected by wind speed change detection algorithm. VII. SIMULATION RESULTS
If wind speed change is detected, the system stops learning and The proposed Q-learning based MPPT has been applied to grid
switches to mode 1. Wind speed change detection algorithm is connected 5 HP (3.7 kW) SCIG based WECS. The simulation
explained in part C of this section. has been done in MATLAB/SIMULINK. The sample time for
the simulation is 2 μs. The sample time for Q-learning based
B. Mode 1 MPPT controller is 200 ms. The wind turbine radius is 2 m for
For continuously varying wind conditions, system operates in the simulation.
mode 1. In mode 1, controller uses the peak points detected in The capability of tracking the MPP of Q-learning based
MPP mode to calculate the generator speed reference algorithm has been tested by running the WECS at constant
command. The reference command is calculated by wind speed of 9 m/s. Wind speed, power coefficient Cp,
3 𝑃𝑔 generator speed and grid power are shown in Fig. 5. The actions
𝜔𝑔∗ = √ (18)
𝑘 𝑜𝑝𝑡 values are Δω1=90 rpm, Δω2=60 and Δω3=30 rpm. Initially,
While operating in mode 1, if controller detects less than 5 to WECS runs in learning mode and it reaches MPP after around
10% change in power, it concludes that the wind speed is stable 7 seconds.
[19]. For stable wind speed, controller switches to mode 0. In For step change in wind speed, the results are shown in Fig. 6.
mode 0, system operates in learning mode or MPP mode For the variation of wind speed between 9 m/s and 10 m/s, it
depending on state it has reached. can be observed that an initially unlearned system takes more
time (8-9 sec) to reach the optimum point. Gradually with the
C. Wind Speed Change Detection
time, system reaches to optimum point very quickly (1-2 sec).
While working in mode 0, controller implements the Q-learning Fig. 7 shows the performance of proposed MPPT algorithm for
based MPPT. In MPP mode, the wind speed change detection varying wind speed conditions. Initially wind speed is kept
depends on power change. In learning mode, as the generator stable around 9 m/s. The variation in stable wind speed is
speed reference command is changing, the wind speed change
detection is done separately.
For the wind speed change detection, three parameters, current
𝛥𝑃 𝛥𝑃
slope (( ) ), previous step slope (( ) ) and change in
speed reference command (𝛥𝜔)𝑘−1 are taken.
𝛥𝑃 𝛥𝑃
If ( ) > 0 and ( ) > 0, the operating point lies in
positive slope of P-ω curve of turbine. In this condition, if
(𝛥𝜔)𝑘−1 >0, the (𝛥𝑃)𝑘 is necessarily positive. Similarly, if
(𝛥𝜔)𝑘−1 <0, the (𝛥𝑃)𝑘 is necessarily negative. For the above
mentioned conditions, if magnitude of (𝛥𝑃)𝑘 is less than a
threshold value (𝛥𝑃) 𝑇 , there is no significant change wind
speed. If |(𝛥𝑃)𝑘 | > (𝛥𝑃) 𝑇 , the significant change in wind
speed is detected and controller switches to mode 1 from mode
0. The threshold value (𝛥𝑃) 𝑇 is given as
𝛥𝑃
(𝛥𝑃) 𝑇 = 𝐾1 |𝛥𝜔| |( ) | (19)
𝛥𝜔 𝑘
Similar logic for wind speed change detection is applied for
7
Fig. 5 Performance of MPPT algorithm for constant wind speed of 9 m/s.

(Generator speed 1290 rpm @ 1800 W at MPP)
Fig. 7 Q-learning Based MPPT for continuously varying wind speed
VIII. EXPERIMENTAL RESULTS

For experimental verification of proposed algorithm, a test
setup is made as shown in Fig. 8. The tests are conducted for
constant wind speed, step change in wind speed and varying
wind speed conditions.
The test set up contains Squirrel cage induction generator
(SCIG) connected to DC motor. DC motor is controlled in
torque control mode. SCIG is connected to grid by back to back
3-phase converters. Vector control is applied for speed control
of SCIG and converter is switched using hysteresis control.
Grid side converter is switched by SPWM with voltage oriented
control. The control scheme has been implemented on dSPACE
1103 board for real time control. For the results, the real time
data is stored using dSPACE and then plotted in MATLAB.
Current waveforms of grid-side converter and generator side
converter are shown in Fig. 9 and Fig. 10.
The value of K1 and K2 for (19) and (20) for experimental
verification are 1.5 and 0.66 respectively.. The discount factor
for Q-learning is 0.6.
Q-learning based MPPT algorithm is first tested for constant
Fig. 6 Q-learning Based MPPT for step change in wind speed from 9 m/s to wind speed of 7 m/s. For the experimentation, two speed
10 m/s changes of 40 rpm and 60 rpm are used. The initial speed is set
between ±0.05 m/s. When the wind speed is stable, the to 1000 rpm.
system operates in mode 0, which is learning mode. When it
reaches the peak point, it runs in MPP mode. After t = 14 s, the
wind speed is continuously varying. The system switches to CONVERTERS
mode 1 and reference speed is generated by (18). The Cp dSPACE
remains at its maximum value 0.48 in the varying wind speed SENSORS
also.
SCIG
DC
MOTOR
Fig 8. Hardware Prototype in Laboratory
ia
va
Fig 9. Phase ‘a’ voltage and current
The controller first does the exploration by decreasing and
increasing the speed. By gaining the rewards for each action,
the controller reaches to peak point in around 10 to 11 seconds.
The peak point is 340 W, 1230 rpm as shown in Fig. 11.
8
rpm reduces or increases the reference speed (Nr) by 40 rpm. A

ib ia power threshold of 10 W has been defined to make the P&O
algorithm able to stay at MPP which is a low slope region.
Fig. 10 Phase ‘a’ and ‘b’ current of generator at 1200 rpm
Fig. 11 Performance of Q-learning based MPPT control for constant wind

speed of 7 m/s
The Q-learning based MPPT algorithm has been tested for step
change in wind speed and results are shown in Fig. 12. The wind
speed is changed between 6.5 m/s and 7.5 m/s alternatively for Fig. 13 Performance of Q-learning based MPPT control for varying wind
speed
20 seconds each. The initial speed is set to 1000 rpm. Initially, This improves the performance of P&O method as it reduces
as the controller is inexperienced, it takes more time in the oscillation at MPP. The performance of P&O algorithm is
exploration. Thus, it takes more time to reach optimum point. shown in Fig 13 for varying wind speed conditions. It can be
As the controller gain experience through rewards, it reaches observed that the power coefficient Cp for P&O is lower
the peak point quickly. The initial time to reach peak point for compared to that of Q-learning based MPPT. Specifically, the
6.5 m/s is around 5 sec which reduces to 2-3 sec when controller region where wind speed continuously changing, P&O
faces the same speed again. algorithm is unable to maintain the peak point.
.
Start
Measure, Generator Speed (N(k)), Grid

Voltage (Vabc(k)) and Current (Iabc(k))
Calculate Grid Power

(P(k))=Vabc(k)*Iabc(k)
NO
|P(k)-P(k-1)|>10?
YES
Nr(k)=N(k)
NO YES
P(k)-P(k-1)>10?
NO
N(k)-N(k-1)>0? N(k)-N(k-1)>0?
NO
YES YES
Nr(k)=N(k)-40 Nr(k)=N(k)+40
Fig. 12 Performance of Q-learning based MPPT control for step change in

wind speed from 6.6 m/s to 7.5 m/s Fig. 14 Flow Chart of Applied P&O Algorithm
The proposed algorithm is verified for continuously varying
wind speed with automatic switch between operating modes.
The results for the varying wind speeds are shown in Fig. 13.
The initial wind speed is stable around 7 m/s. After 10.5
seconds, wind speed starts varying. From the mode waveform,
it can be observed that when the speed starts varying, the mode
changes automatically in next time step that is at 11 sec. The
power coefficient Cp during the varying wind speed conditions
remains at optimum value 0.48. Fig. 15 Energy supplied to Grid
The proposed algorithm is compared with P&O algorithm. The Also, as there is no memory involved in P&O, it will take longer
applied P&O algorithm is shown in Fig. 14. The step change in time to reach MPP on every deviation. The comparison of P&O
reference speed is kept constant at 40 rpm which is comparable and proposed technique is presented in terms of electrical
to step size used in proposed algorithm. The step change of 40 energy supplied to grid in Fig. 15. The difference in energy at
9
the end of 25 seconds is 293 watt-seconds. This analysis shows [11] J. Hui and A. Bakhshai, “A new adaptive control algorithm for maximum
power point tracking for wind energy conversion systems,” in Power
the superiority of proposed algorithm over P&O algorithm.
Elect. Specialists Conf., June 15-18, 2008, pp. 4003-4007.
Assuming same rate of energy generated by both methods, 293 [12] W.-M. Lin, C.-M. Hong, F.-S. Cheng, and K. H. Lu, “MPPT control
watt-seconds improvement in 25 seconds leads to 0.042 mega- strategy for wind energy conversion system based on RBF network,” in
Joules of improvement in 1 hour, which makes proposed EnergyTech, May 25-26, 2011, pp. 1-6.
[13] C.-H. Chen, C.-M. Hong, and T.-C. Ou, “WRBF network based control
algorithm significantly superior to existing methods.
strategy for PMSG on smart grid,” in Int. Conf. on Intelligent System
Applications to Power Systems, Sept. 25-28, 2011.
IX. CONCLUSION [14] J. Yaoqin, Y. Zhongqing, and C. Binggang, “A new maximum power
point tracking control scheme for wind generation,” Proc. Int. Conf. on
Q-learning based MPPT algorithm has been proposed for Power System Technology, Oct. 13-17, 2002, pp. 114-148.
variable-speed WECS. This algorithm has been found capable [15] S. M. R. Kazmi, H. Goto, H.-J. Guo, and O. Ichinokura, “A Novel
of searching MPP in real time. The wind speed change detection Algorithm for Fast and Efficient Speed-Sensorless Maximum Power
Point Tracking in Wind Energy Conversion Systems,” IEEE Trans. Ind.
makes the proposed algorithm works in real time with varying
Electron., vol. 58, no. 1, pp. 29–36, jan. 2011.
wind speed conditions. The MPPT technique is model-free and [16] W. Chun, Z. Zhang, W Qiao and L Qu, “Intelligent maximum power
capable to learn the states by interacting with system. The extraction control for wind energy conversion systems based on online
stopping criterion for learning of the system depends on the Q-learning with function approximation.” 2014 IEEE Energy
number of times a particular state has been visited. The state Conversion Congress and Exposition (ECCE) (2014): 4911-4916.
[17] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction
which is learned will work in mode 1. The proposed algorithm Cambridge, Mass.: MIT Press, 1998.
is compared with another popular algorithm P&O which does [18] C.J.C.H Watkins, "Learning from Delayed Rewards,” PhD Dissertation,
not require system information,, and superiority of proposed Cambridge University, Cambridge, England, 1989.
algorithm is shown. As explained in Section-IV, the various [19] Y. Xia, K. H. Ahmed and B. W. Williams, "A New Maximum Power
Point Tracking Technique for Permanent Magnet Synchronous
dynamic parameters makes TSR and PSF algorithms unreliable Generator Based Wind Energy Conversion System," IEEE Transactions
for MPPT. So, the proposed Q-learning based algorithm is on Power Electronics, vol. 26, no. 12, pp. 3609-3620, Dec. 2011.
capable of online learning, identifying MPP even in system is
unlearned and to work at MPP with varying wind speed. Also, X. BIOGRAPHIES
due to one Q-value update at one step, it takes very small Ashish Kushwaha received B.Tech. Degree in electrical
computational time which makes it fast. However, there is enginnering from JSSATE, Noida in 2010, and the
scope of using function-approximation methods to reduce the M.Tech. Degree in electrical engineering from the Motilal
Nehru National Institute of Technology, Allahabad, India,
memory requirement because of use of look-up table.
in 2012. He is currently working toward the Ph.D. degree
in the Department of Electrical Engineering, Shiv Nadar
REFERENCES University, Gautam Buddha Nagar, UP, India.
Her research interests include renewable energy systems,
[1] M. G. Simões, F. A. Farret, “Alternative Energy Systems: Design and
power electronics, electric drives and control, and
Analysis with Induction Generators,” CRC Press, 2007.
machine learning.
[2] S. Musunuri and H. L. Ginn, “Comprehensive review of wind energy
M. Gopal He received the B.E. degree in electrical
maximum power extraction algorithms,” in Power and Energy Society
engineering, and the M.E. and Ph.D. degrees in control
General Meeting, July 24-29, 2011, pp. 1-8.
engineering from Birla Institute of Technology and
[3] L. Y. Pao and K. E. Johnson, “A tutorial on the dynamics and control of
Science, Pilani, India, in 1968, 1970, and 1976,
wind turbines and wind farms,” in American Control Conference, June
respectively. M. Gopal, an Ex-Professor of Department of
10-12, 2009, pp. 2076-2089.
Electrical Engineering, Indian Institute of Technology
[4] S. M. R. Kazmi, H. Goto, H.-J. Guo, and O. Ichinokura, “Review and
(IIT) Delhi, New Delhi, India, is presently associated
critical analysis of the research papers published till date on maximum
with, Shiv Nadar University, Gautam Budh Nagar (U.P.),
power point tracking in wind energy conversion system,” in Energy
India. His teaching and research stints span over three
Conversion Congress & Exposition, Sept. 12-16, 2010, pp. 4075-4082
decades at IITs. He is the author/co-author of six books on control engineering.
[5] S. Morimoto, H. Nakayama, M. Sanada, and Y. Takeda, “Sensorless
His video course on control engineering is available on YouTube. He has a large
output maximization control for variable-speed wind generation system
number of publications in reputed journals. His current research interests
using IPMSG,” IEEE Trans. on Ind. Appl., vol. 41, no. 1, pp. 60-67, Jan-
include pattern recognition, machine learning, soft computing, and intelligent
Feb 2005.
control.
[6] R. I. Putri, M. Pujiantara, A. Priyadi, T. Ise and M. H. Purnomo,
"Maximum power extraction improvement using sensorless controller
based on adaptive perturb and observe algorithm for PMSG wind turbine Bhim Singh (SM’99–F’10) was born in Rahamapur,
application," IET Electric Power Applications, vol. 12, no. 4, pp. 455- Bijnor, India, in 1956. He received the B.E. degree in
462, 4 2018. electrical from the University of Roorkee, Roorkee, India,
[7] H. Fathabadi, "Novel Maximum Electrical and Mechanical Power in 1977, and the M.Tech. degree in power apparatus and
Tracking Controllers for Wind Energy Conversion Systems," IEEE systems and Ph.D. degree in electrical machines from the
Journal of Emerging and Selected Topics in Power Electronics, vol. 5, Indian Institute of Technology (IIT) Delhi, New Delhi,
no. 4, pp. 1739-1745, Dec. 2017. India, in 1979 and 1983, respectively In 1983, he was a
[8] H. Li, K. Shi, and P. Mclaren, “Neural-Network-Based Sensorless Lecturer with the Department of Electrical Engineering,
Maximum Wind Energy Capture With Compensated Power University of Roorkee (now IIT Roorkee), where he became a Reader in 1988.
Coefficient,” IEEE Trans. on Ind. Appl., vol. 41, no. 6, pp. 1548–1556, In December 1990, he joined the Department of Electrical Engineering, IIT
Nov-Dec 2005. Delhi, as an Assistant Professor, where he became an Associate Professor in
[9] M. Cirrincione, M. Pucci, and G. Vitale, “Neural MPPT of Variable- 1994 and a Professor in 1997. He was the Head with the Department of
Pitch Wind Generators With Induction Machines in a Wide Wind Speed Electrical Engineering, IIT Delhi, from July 2014 to August 2016. He is
Range,” IEEE Trans. on Ind. Appl., vol. 49, no. 2, pp. 942–953, March- currently the Dean, Academics with IIT Delhi. He has guided 79 Ph.D.
April 2013. dissertations, 166 M.E./M.Tech./M.S.(R) thesis. He has executed more than 75
[10] Q. Wang and L. Chang, “An Intelligent Maximum Power Extraction sponsored and consultancy projects. His areas of research interests include
Algorithm for Inverter-Based Variable Speed Wind Turbine Systems,” photovoltaic (PV) grid interface systems, micro grid, power quality, and PV
IEEE Trans. Power Elect., vol. 19, no. 5, pp. 1242-1249, Sept. 2004. water pumping systems.

10 1109@tec 2020 2990937 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1109@tec 2020 2990937 PDF

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Q-Learning based Maximum Power Extraction

 method is fast and accurate. It requires real time wind speed

Generator side Grid side

6000 η is learning rate and lies between 0 and 1.

η∈(0,1) is learning parameter which represents the extent to where,

D. Q-Learning Parameters It should run at MPP. However, Q-learning cannot observe

Recognize current state and record corresponding

𝑘𝑖𝑗 ≥ 𝑘𝑚𝑎𝑥 (17) 𝛥𝑃 𝛥𝑃

Fig. 5 Performance of MPPT algorithm for constant wind speed of 9 m/s.

Fig. 7 Q-learning Based MPPT for continuously varying wind speed

VIII. EXPERIMENTAL RESULTS

mode 1 and reference speed is generated by (18). The Cp dSPACE

Fig 8. Hardware Prototype in Laboratory

rpm reduces or increases the reference speed (Nr) by 40 rpm. A

Fig. 10 Phase ‘a’ and ‘b’ current of generator at 1200 rpm

Fig. 11 Performance of Q-learning based MPPT control for constant wind

Measure, Generator Speed (N(k)), Grid

Calculate Grid Power

Fig. 12 Performance of Q-learning based MPPT control for step change in

You might also like