A Reinforcement Learning Fuzzy Controller For The Ball and Plate System

A Reinforcement Learning Fuzzy Controller for
the Ball and Plate System

Nima Mohajerin, Mohammad B. Menhaj, Member, IEEE, Ali Doustmohammadi
system may also be available and be used in order to reduce

Abstract—In this paper, a new fuzzy logic controller, namely the needed time for reaching the goal.
Reinforcement Learning Fuzzy Controller (RLFC), is proposed Those few published papers on the BnP system are mainly
and implemented. Based on fuzzy logic, this newly proposed devoted to achieve the defined goals regarding the BnP itself
online-learning controller is capable of improving its behavior by rather than how to achieve those goals [3, 4, 5, 8, 9]. They can
learning from experiences it gains through interaction with the be categorized into two main groups: those which are based on
plant. RLFC is well established for hardware implementation
mathematical modeling (with or without hardware
with or without a priori knowledge about the plant. To evidence
this claim, a hardware implementation of Ball and Plate system implementation), those proposing model-free controllers.
was established, and RLFC was then developed and applied to it. Since the simplification in the mathematical modeling of the
The obtained results are illustrated in this paper. BnP system yields two separate BnB systems, the first
category is goal-oriented and is of no interest for the current
Index Terms—Fuzzy Logic Controller, Reinforcement project [4, 5]. On the other hand, the hardware apparatus used
Learning, Ball and Plate system, Balancing Systems, Model-free in the second category is CE151 [6] (or in some rare cases
optimization another apparatus [9]) which all use image feedback for ball
position sensing. However, among all, [5] is devoted to a
mechatronic design of the BnP system controlled by a classic
I. INTRODUCTION model-based which benefits a touch-sensor feedback and [3,
B ALANCING systems are one of the most popular and

challenging test platforms for control engineers. Such
systems are the traditional cart-pole system (inverted
4] used the CE151 apparatus. Note that the image feedback is
a time bottleneck which will be discussed in Section III. In [3],
a fuzzy logic controller (FLC) is designed which learns online
pendulum), the ball and beam (BnB) system, the multiple from a conventional controller. Although the work in [4] is
inverted pendulums, the ball and plate system (BnP), etc. done through mathematical modeling and is applied to CE151
These systems are the promising test-benches for investigating apparatus, it is of more interest because it tackles the problem
the performance of both model-free and model-based of trajectory planning (to be stated in Section III). Reports in
controllers. Considering those complicated ones (such as [8] and [9] are focused on motion planning and control though
multiple inverted pendulums or BnP) even if one bothers to they are less interesting for us.
mathematically model them, the resulted model is likely to be To achieve the desired goal, a different approach will be
too complicated to be used in a model-based design. One demonstrated in this paper. This approach is based on fuzzy
would highly prefer to use an implemented version of such a logic controller which learns on the basis of reinforcement
system (if available and not risky) and observe its behavior learning. Additionally, a modified version of the BnP
while the proposed controller is applied to it. This paper is apparatus is also implemented in this project as a test platform
devoted to the efforts done for a project in which the main for the proposed controller.
goal is to control a ball over a flat rotary surface (the plate) In this paper, the fundamental concepts of RL are embodied
mimicking human’s behavior controlling the same plant, i.e. into fuzzy logic controlling methodologies. This leads to a
the BnP system. The proposed controller neither should be new controller, namely Reinforcement Learning Fuzzy
dependent on any physical characteristics of the BnP system Controller (RLFC) which is capable of learning from its own
nor should it be supervised by an expert. It should learn an experiences. Inherited from fuzzy controlling methodologies,
optimal behavior from its experiences interacting with the BnP RLFC is a model-free controller and, of course, previous
system and improve its action-generation strategy; however, knowledge about the system can be included in RLFC so as to
some prior knowledge about the overall behavior of the BnP decrease the learning time. However, as it will be seen,
learning in RLFC is not a separate phase of its functioning
Manuscript received February 5, 2010.
phase.
Nima Mohajerin is with the School of Science and Technology of Örebro This paper is divided into six sections. After this
University, Örebro, Sweden. (e-mail: nima.mohajerinh091@student.oru.se). introduction, in section II RLFC is completely explained, both
Mohammad B. Menhaj, is with the Electrical Engineering Department of conceptually and mathematically. In section III, the BnP
Amirkabir University of Technology, (e-mail: tmenhaj@ieee.org).
Ali Doustmohammadi, is with the Electrical Engineering Department of system is introduced and the hardware specification of the
Amirkabir University of Technology, (e-mail: doustm@aut.ac.ir). implemented version of this system is also outlines. In section
IV some modifications on RLFC in order to be applicable to
978-1-4244-8126-2/10/$26.00 ©2010 IEEE
Authorized licensed use limited to: Inha University. Downloaded on July 27,2022 at 10:13:55 UTC from IEEE Xplore. Restrictions apply.
the BnP system is fully discussed. Section V is dedicated to
x = [x 1 x 2 xn]
T
(1)
illustrate and analyze the results of RLFC performance on the
implemented BnP system. In this section, RLFC performance Assume that in the universe of discourse of input variable
is also compared with that of a human controlling the same xi , i.e. U i , a number of ni term sets are defined. The lth
plant. Finally, section VI concludes the paper. *
fuzzy rule is : (for now forget about the consequence part)
II. CONTROLLER DESIGN IF x1 is A1l1 AND x2 is A2l2 AND … AND xn is Anln
(2)
This section is dedicated to explain the idea and THEN y is B l
mathematics of the proposed controller, i.e. RLFC. First, the l
where: 1 ≤ li ≤ ni , and Ai i is the l i th defined fuzzy set on the
behavior of RLFC is outlined conceptually and then the
mathematics will be detailed. universe of discourse of xi , (1 ≤ i ≤ n ) . The output variable
A. Outline is y and there are M fuzzy sets defined on the universe of
According to Fig. 1, RLFC consists of two main blocks, a discourse of y ( M ∈ ) - V is the universe of discourse of
controller (FLC) and a critic. The task of the controller is to
generate and apply actions in each given situation as well as y. Note that (2) is a fuzzy relation defined on U × V [11]
improving its state-action mapping while the critic has to whereU = U 1 ×U 2 × ×U n . Note also that in real world, all
evaluate the current state of the plant. Neither controller nor of the above variables express physical quantities. Therefore
critic necessarily knows anything about how the plant all the corresponding universes of discourse are bounded and
responds to actions, how the actions will influence its states can be covered by a limited number of fuzzy sets.
and what the best action is in any given situation. There is also For hardware implementation, what matters first is the
no need for the critic to know how the actions in the controller processing speed of the controller. In other words, we have to
are generated. The most important responsibility of the critic establish a fuzzy controller architecture that leads to an
is to generate an evaluating signal (reinforcement signal), optimum performance versus its complexity. For this reason,
which best represents the current state of the plant. The we propose the following elements for the FLC structure:
reinforcement signal is then fed to the controller. singleton fuzziyfier, product inference engine and center
Once the controller receives a reinforcement signal, it
should realize whether its generated action was indeed a good
average defuzzyfier [11]. In this case given an input vector x∗
the output of the controller is given by:
action or a bad one, and in both cases how good or how bad it
∑ y ( ∏ A (x ) )
L −l n li ∗
was. These terms are embodied into a measure named l =1 i i
∗ ∗ i =1
reinforcement measure which apparently is a function of the y = f (x ) = (3)
∑ ( ∏ A (x ) )
L n li ∗
reinforcement signal. Then according to this measure, the l =1 i =1 i i
controller attempts to update its knowledge of generating
−l l
actions, i.e. improves its mapping from states to actions. The where y is the center of normal fuzzy set B . However, as
process of generating actions is separate from the process of mentioned earlier, other FLC structures may be considered.
updating the knowledge and thus they can be assumed as two Apparently, only those rules are participated in generating y
parallel tasks. This implies that while the learning procedure is that have a non-zero premise (IF-part), i.e. the fired rules. This
a discrete-time process, the controlling procedure can be fact is not dependent on the FLC structure.
continues-time. However, without loss of generality it is On the design stage, the controller does not know which
assumed that the actions are generated in a discrete-time states it will observe, or in other words, the designer hardly
manner and each action is generated after the reinforcement may know what rules are useful to be embodied in the fuzzy
signal - describing consequences of the previously generated rule base. Thus, all rules with premises made by all possible
action - has been reported and the parameters have been combinations of input variables, using AND operator, are
updated. The dashed line in Fig. 1 implies the controller included in the fuzzy rule base. The number of these rules is:
awareness of its generated actions. Although apparently the n
controller is aware of its generated actions, in case of L = ∏ ni (4)
hardware implementation, inaccuracies in mechanic structure i =1
and electronic devices and other unknown disturbances may
impose distortions on the generated actions.
B. The Controller
The aforementioned concept is as general as to be
applicable to any fuzzy controller scheme; however, we
assume that the fuzzy IF-THEN rules and controller structure
are of Mamdani type [9]. Imagine that the input to the fuzzy
inference engine (FIS) is an n element vector, each element is Fig. 1. A typical application of the RLFC.
a fuzzy number produced by the fuzzyfication block [11]: *
Please note that superscripts are not powers, unless it is mentioned to be
so.
Obviously, L grows exponentially with respect to the ωil
number of variables and defined term sets. Consequently, the Pl (j = i ) = (9)
Ωl
processing time drastically increases. To solve this, we assume
that for any given value to each variable, there are at most two where ω rl represents the numeric length of sub-distance
term sets with non-zero membership values. This condition –
which will be referred by the covering condition – is necessary ωrl and is calculated by (10).
and if it is held then the number of fuzzy rules which ωrl = arl − arl −1 (10)
contribute to generate the actual output, i.e. the fired rules, is
equal or less than 2n, i.e. 2 is powered by n. Noticeably, to Till now the iteration counter, k, is omitted. However since
reduce the needed time for discovering the fired rules we the reinforcement learning procedure is done by tuning the
implement a set of conditional statements rather than an above parameters, they are all a function of k. Thus, (9) turns
extensive search among all of the rules. to:
ωil ( k )
C. The Decision Making Pkl ( j = i ) = l (11)
As previously mentioned, the controller is discrete-time. So, Ω (k )
in each iteration, as the controller observes the plant state, it Or:
distinguishes the fired rules. From now on, the L-rules FLC ail (k ) − ail −1 (k )
Pkl ( j = i ) = (12)
shrinks to a 2n-rules FLC where 2n << L . The key-point in aMl ( k ) − a0l ( k )
generating output, i.e. decision making, is how the By observing (7), (8) and (11) it is obvious that
consequences (THEN parts) of these fired rules are selected.
Noticing (2), the assigned term set to the consequence of the Pkl ( j = i ) is a probability function and satisfies the
l
lth rule is B . It was also mentioned that there are M term sets necessary axioms.
defined on the universe of discourse of the output variable.
Assume that these term sets are referred by W i where D. Reinforcement Learning Algorithm
i = 1, 2,..., M . Having fired the lth rule in the kth iteration, In this sub-section the proposed algorithm for tuning the
above defined parameters is depicted. This algorithm is based
the probability of choosing W i for its consequence is:
on reinforcement learning methods and satisfies the six
Pkl ( j = i ) (5) axioms mentioned by Barto in [12]. Let r ( k ) be the
where j is a random variable with an unknown discrete reinforcement signal generated by the critic module in the kth
distribution over indices. The aim of reinforcement learning iteration. Note that it represents the effect of the previously
algorithm, which is discussed in the next sub-section, is to generated action by the FLC, i.e. y ( k − 1) , on the plant and
learn this probability for each rule such that the overall before generating the kth action the parameters of the related
performance of the controller is (nearly) optimized. fired rules should be updated. In other words, this scalar can
l
Our objective in this section is to sensibly define Pk to be represent the change in the current state of the plant.
well suited for applying the reward/punishment procedure and To be more specific, as a general assumption, imagine that
also for software programming. Fulfilling these objectives, for the smaller reinforcement signal represents a better state of
the plant. Obviously, the change in the current state of the
each rule a bounded sub-space on ℜ is chosen. Note that ℜ
plant is stated by (13). Therefore, an improvement in the
represents the set of real numbers. Factors for how this one
plant situation is indicated by Δr (k ) > 0 while Δr ( k ) < 0
dimensional sub-space should be chosen are discussed in
section IV. Let the related sub-space to lth rule be: indicates that the plant situation is worsened.
Ωl = ⎡⎣a0l , aMl ) (6) Δr (k ) = r ( k − 1) − r ( k ) (13)

This distance is divided into M sub-distances as shown by However, since (13) is based solely on the output of the
(8-a), each of which is assigned to an index i where critic, it does not contain information about which rules have
i = 1, 2, , M . We have: been fired, which term sets have been chosen for generating
M y (k − 1) , etc. Thus, Δr (k ) is not immediately applicable
Ω l = ∪ ωrl (7) for updating the parameters. The mentioned
r =1 reward/punishment updating procedure means that if the
a) ωrl = ⎡⎣ arl −1 , arl ) , r = 1,..., M generated action resulted in an improvement in the plant
state, the probabilities of choosing the same term set for the
b) a0 ≤ as ≤ as +1 ≤ aM , s = 1,..., M − 2
(8) premises of each corresponding fired rule should be
c) ∀ l ∈ {1, 2,..., L} & p, q ∈ {1, 2,..., M } p ≠ q : increased. However, if this action caused the plant state to be
ω lp ∩ ωql = ∅ worsened, these probabilities should be decreased.
Δr (k ) will be mathematically manipulated in order to
We calculate the probability P l by (9):
shape up the reinforcement measure. This measure is the But only the parameters of fired rules are modified.
amount by which the mentioned probabilities, (11), would be Hence there are at most M .2 n modifications per
affected. iteration.
At the first step, a simple modification is done on Δr (k ) . 4- By (18) it should be understood that only the length of
This step may be ignored if Δr (k ) is already a suitable subspaces ωil and Ωl are modified. Although,
representation of change in the system state A comprehensive subspaces ωql for q = i + 1,..., M are shifted, but their
example will illustrate this case in section IV. This length remains unchanged.
manipulation is done by f (.) noticing that f : ℜ ⎯⎯
→ℜ : 5- Modification of Ωl always let other subspaces (and
Δr ′( k ) = f ( Δr (k ) ) (14) hence other indices) to be chosen. As the system learns
To modify the probabilities defined by (11), the more, a dominant subspace will be found, but the
corresponding sub-distances, (8) are changed.The amount of length of un-chosen subspaces are still not zero as long
change in the sub-distances relating to the lth rule is defined as the effect of choosing them is not observed as a
worsening result. This feature is useful in case of
by ϑil (k ) in (15). slightly changing plants.
ϑil ( k ) = g .ε (l , i ).α (l ).Δr ′(k ) (15)
Theorem1. By (18) if a term set receives
Regarding (15), g is gain and is a scaling factor, ε (l , i )
reward/punishment, then the probability of choosing that
represents the exploration/exploitation policy [1], α (l ) is the term set is increased/decreased.
firing rate of the lth rule and is obtained by replacing x in the Proof.
membership function obtained from the premise part of the lth a) Reward. Assume that W i has been chosen for lth rule
rule. Note that this factor expresses the contribution of this
and the resulting action has an improving effect. Thus,
rule in generating the output.
There are a variety of exploration/exploitation policies [1, 2, Pkl ( j = i ) should increase. Note that in this case
, 12], however here we propose a simple one: Δr (k ) > 0 and by (13),(14),(15),(16) and (17) it is obvious
⎛ θ ⎞ that Δωi ( k ) > 0 . Using (5) we have:
l
ε (l , i ) = ⎜ 1 − (k ) ⎟
n il
(16)
⎝ e ⎠ ΔPkl ( j = i ) = Pkl ( j = i ) − Pkl −1 ( j = i )
where n il ( k ) counts how many times W i is chosen for the ωil (k ) ωil (k − 1) (19)
= −
lth rule and θ is a scaling factor. Considering a particular l
Ω (k ) l
Ω (k − 1)
rule, as W i is chosen more for this rule, ε (l , i ) grows In case of an improvement ΔPkl ( j = i ) must be a positive
exponentially to one, letting ϑi ( k ) → g .α (l ).Δr ′( k ) .
l
scalar. Using (9), (10), (11) and (12) in (19) we obtain:
The reinforcement measure introduced in subsection II.A is a l (k ) − ail −1 (k ) ail (k − 1) − ail −1 (k − 1)
ΔPkl ( j = i ) = i −
given by (17). aml (k ) aml (k − 1)
⎧⎪ϑil (k ) , Δr (k ) ≥ 0 According to (18) the above equation yields:
Δωil (k ) = ⎨ (17) a l + Δω l (k ) − ail −1 ail − ail −1
{ l l
}
⎪⎩max ϑi (k ), −ωi (k ) , Δr (k ) < 0 ΔPkl ( j = i ) = i l
am + Δω l (k )
−
aml
In (17), max operator is used in case of Δr ( k ) < 0 . This
In which we omit (k-1) on the right-hand side of the above
is due to avoid large reinforcing those sub-distances that have equation. It can easily be seen that:
not been chosen. The reason is clearer if (18) is studied.
Δωil (k ) ⎡⎣aml − ail + ail −1 ⎤⎦
Equation (18) depicts the updating rule: ΔPkl ( j = i ) =
aml ⎡⎣aml + Δω l (k ) ⎤⎦
⎧⎪ aql (k − 1) ,q < i
aql ( k ) = ⎨ (18) This equation with regard to (8) implies that:
⎪⎩ max {aq −1 ( k ), aq ( k − 1) + Δωi (k )} , q ≥ i
l l l
ΔPkl ( j = i ) > 0
where q = 0,1,..., M . b) Punishment. In this case Δr (k ) < 0 .
Regarding (18) there are several points to be noted: Hence:
Δωil (k ) = max {ϑil (k ), −ωil (k )}
1- i is the index of the chosen term set for the
consequence part of the lth fired rule , i.e. B l = W i .
Since a negative-length sub-distance is undefined, the max
2- q is a counter which starts from i and ends to M.
operators used in the above equation and (17) assures us that
Apparently, there is no need to update those
update rule (18) does not yield to undefined sub-distance.
parameters which are not modified and q may starts
The max operators are used to satisfy (8-b).
from i. This will indeed reduce the processing time.
3- There are M parameters for each rule in the rule base.
l
In this case ΔPk ( j = i ) in (19) has to be a negative function is summarized next.
The Actuating Block
scalar. The procedure is the same as part a. • The actuating block consists of high accuracy
□ stepping motors equipped with precise incremental
encoders (3600 ppr*) coupled to their shafts plus
III. BALL AND PLATE SYSTEM accurate micro-stepping-enhanced drivers.
A Ball and Plate system, aforementioned BnP, is an • The original step size of the steppers is 0.9 degrees
evolution of the traditional Ball and Beam (BnB) system [13, and reducible by the drivers down to 1 200 of step
14]. It consists of a plate which its slope can be manipulated
size, i.e. 4.5 × 10 −3 degree per step.
in two perpendicular directions and a ball over it. The
behavior of the ball is of interest and can be controlled by • Taking into account the mechanical limitations, the
tilting the plate. According to this scheme, various structures smallest measurable and applicable amount of
may be proposed and applied in practice. Usually an image rotation is 0.1 degrees.
feedback is used to locate the position of the ball. However,
due to its less accuracy and slower sampling rate comparing
touch screen sensors (or simply touch sensors), we opt for a
touch sensor.
The implemented hardware structure in this project, as
outlined in Fig.2, consists of five blocks. However, the whole
system can be viewed as a single block which its input is a
two element vector, i.e. the target angles – (20). The output
of the assumed block is a six element vector which contains
the position and velocity of the ball and the current angles of Fig. 2. Hardware structure of the implemented BnP system. The dashed
the plate. Table I illustrates the related parameters. square separates the electronics section from the mechanical parts.
TABLE I The Sensor Block
PARAMETERS OF BALL AND PLATE SYSTEM AND THEIR UNITS • The sensor is a 15 inches touch-screen sensor.
Symbol Parameter Unit • Sensor output is a message packet sent through
( xd , y d ) Target position of the ball Pixel RS-232 serial communication with 19200 bps†.
• Thus the fastest sampling rate of the whole sensor
( x, y) Current position of the ball Pixel
block is 384 samples per second. This implies that
Pixel per
(vx , vy ) Current velocity of the ball
second the maximum available time for decision making is
−3
(α x , α y ) Current angels of the plate Angel step
1
384 ≅ 2.604 ×10 second.
αx Plate angel with regard to x axis Angel step
• The area of the surface of the touch sensor on
which pressure can be sensed is 30.41 × 22.81cm.
αy Plate angel with regard to y axis Angel step
• The sensor resolution is 1900×1900 pixels. If the
(u x , u y ) Control signal Angel step sensor sensitivity is uniformly distributed on its
sensitive area, then each pixel is assigned to an area
ux plate target angel with regard to x axis Angel step
approximately equals to 1.8 × 1.2 mm2 of the surface of
uy plate target angel with regard to y axis Angel step the sensor.
The Interface Block
T The third and main section of the BnP system is its
u = ⎡⎣α x , α y ⎤⎦ (20)
electronic interface. This interface receives commands from
In this paper we are interested in the following problem: an external device in which the controller has been
implemented and then takes necessary corresponding actions.
Simple command of the ball. The objective is to place the Each decision made by the controller algorithm is translated
ball on any desired location on the plate surface starting and formed into a message packet which is then sent to the
from an arbitrary initial position. interface via a typical serial communication (RS232) or other
communication platforms. Then, the interface sends necessary
Before explaining the control system, a summary of the signals to actuators. In addition to some low-level signal
hardware specifications of the implemented BnP for this manipulation (such as applying low-pass filter to sensor
project is given below done in hardware implementation of the reading and noise cancelling), upon request from the main
BnP in this project is stated. A complete or even brief controller, the interface sends current information, such as ball
description of the efforts how we made this apparatus is position and velocity or position of the actuators, to the main
beyond scope of this paper. But a summary is needed to show controller.
that the plant for RLFC is made roughly and there are many
inaccuracies that a classic controller will definitely be unable *
ppr: pulse per rotation
to control the plant. Referring back to Fig.2 each block †
bps: bit per second. A measuring unit for serial communication.
IV. MODIFICATION OF RLFC FOR BALL AND PLATE In Fig.3, the corresponding defined term sets are presented
In section II, RLFC was discussed completely. In this graphically. Number, shape and distribution of the defined
section, necessary modifications for making it applicable to term sets are chosen based on the logical sense and practical
control the implemented BnP system for solving the second experience. There is no obligation for the system not to
problem depicted in the previous section are expressed. perform well if these selections are changed.
Assuming that the covering condition is satisfied, as it is the
A. Primary Design Stage 6
case in Fig.3, in each iteration, there are at most 2 =64 fired
Fig.1 depicts the control architecture as well as signal flows. rules. However, if an extensive search is done among 4900
With regard to Table I, illustrated signals are explained next: rules to discover fired rules, i.e. to calculate each premise
Control signal vector: membership value and check whether it is zero or not, then the
T
u = ⎡⎣u x u y ⎤⎦ (21) processing time would grow out of tolerable bounds. Instead,
since we know the exact location of each term set, if we locate
Plant state vector: each measured value of input variables on their corresponding
T
s = ⎡⎣ x y ux u y ⎤⎦ (22) universe of discourses then those term sets with non-zero
membership value are discovered. Doing this procedure for all
Plant target state vector: six input variables, there will be at most two discovered term
sd = [ xd 0 0]
T
yd (23) sets for each of them. Hence the combinations of these term-
sets using AND operator directly shows us which rules are
Error vector: fired. The mentioned locating procedure can be coded into a
T
e = ⎡e x ey ev x ev y ⎤ (24) software programming language using a set of conditional
⎣ ⎦ statements. Apparently, if there are n defined term sets on a
Where: particular universe of discourse then (n+1) conditional
ex = x − x d , e y = y − y d statements are needed in order to discover the fired rules.
Hence instead of L arithmetic calculations, we are faced with
And since we want to make the ball steady:
S logical comparisons, where:
evx = vx , ev y = v y
S = ∑ i =1 ( n i + 1)
n
(26)
In Fig.1 it is obvious that there are six input variables to the
For our case study: S=32.
controller section of RLFC. Let us arrange them in x vector as
There is no special constraint to be applied to distribution of
written in (25).
T
term sets on the universes of discourse of the output variables.
x = ⎡⎣e x ey vx vy αx α y ⎤⎦ (25) There are M=51 triangular term sets which are uniformly
distributed over each universe of discourse. Fig. 4 illustrates
the location of these term sets.
(a)
Fig. 4 The defined term sets on the universes of discourses of output

variables.
B. Critic: Generation of Reinforcement Signal

Until now, the reinforcement signal is only defined to be an
(b) (c)
Fig. 3 shows the defined term sets on the universe of discourse of (a) evaluation of the system current situation generated by the
e x and e y , (b) v x and v y , (c) α x and α y . For the respective units refer to critic block. No other precise definition could have been given
Table I. by now since the nature of this signal directly depends on the
On the universe of discourse of each input variable, a nature of the plant to be controlled. Referring back to Fig.1, it
specific number of term sets are defined. Let these numbers be is obvious that the input to critic is the Error Signal, (24).
n x , n y , nvx , nvy , nα x and nα y . According to (4), the fuzzy According to this figure the reinforcement signal is a function
rule base contains L rules where: of this vector:
L = n x .n y .nvx .nvy .nα x .nα y r = g (e ) = g (e x , e y ,v x ,v y ) (27)
For our very specific implementation following quantities Apparently, the critic is defined by g(.) and is the
are assigned to the above variables: designer’s choice. The only necessary condition is that this
function should represent the current state of the plant as well
( nx = ny = 7, nvx = nvy = 5, nα x = nα y = 2 ) ⇒ L = 4900 possible as it can be. A proposed general form of this function
is depicted in (28). By this concept, adding a priori knowledge in term of
modifying bounds on j l is a simple procedure. For a typical
r = cx (ex2 + cv vx2 ) + c y (ey2 + cv v y2 ) (28) example regarding our implemented RLFC, observe the
Equation (28) can be regarded as a revised version of LMS. following rule form:
Note that the aim of RLFC is to minimize (28). In (28) three ⎧ex is A11 AND ey is whatever AND ⎫
⎪⎪ ⎪⎪
coefficients are defined which are explained next. cv is the IF ⎨vx is A32 AND ey is whatever AND ⎬ THEN u x is B jl
balancing coefficient between the velocity of the ball and its ⎪ ⎪
⎪⎩α x is whatever AND α y is whatever ⎪⎭
position error. As cv increases, the controller gives more
Apparently, this applies to a set of rules which the first and
credit to stabilizing the ball rather than guiding it to the third conditions are fixed as mentioned. By referring back to
desired location. cx and c y represent the mutual interaction the depicted term sets in Fig.3 and Fig.4, This clearly indicates
that the sensible choice for this condition is a high deviation in
between the two actuators. This interaction comes from the
inevitable impreciseness of the mechanical structure of BnP. αx towards a positive direction with regard according to
According to this mutual interaction, the motion of the ball in Cartesian co-ordination system. Thus:
each direction is not only a function of the corresponding j l ∈ [35,50] (31)
angle of the plate. However, exact values of cx and c y are This implies that those insensible choices are ignored for
one of the mechanical specifications. They can be chosen and the mentioned rules.
then tuned experimentally or to be learnt.
V. PERFORMANCE RESULTS OF RLFC ON BALL AND PLATE
C. Reinforcement Measure After the modified RLFC is implemented it is experimentally
applied to the implemented BnP system. The illustrated graphs
Having proposed the reinforcement signal, we are seeking
in this section are the results of a series of experiments. In all
for a suitable function to produce Δωil ( k ) according to (17). the experiments, the position of the ball versus time is
Equation (29) is a proposed function for (14): collected using the monitoring section of the implemented
⎛ Δr ( k ) sgn ( Δr (k ) ) ⎞ BnP. Then these row data is modified by a sort of software
Δr ′(k ) = f ( r ( k ) ) = ⎜ β +λ ⎟ (29) enhancing procedures to avoid huge amount of confusing
⎝ rmax r (k ) ⎠
graphs. After omitting the time, the x position versus y is
This function consists of two terms, the first one scales the
obtained. Then, the obtained points corresponding to a series
pure reinforcement signal received from the critic, and the
of iterations are drawn on a single graph. Units of x and y are
second one tunes the learning sensitiveness when the plant is
pixels and the origin of the coordinating system is selected as
around the target state; actions receive more
seen by the touch sensor. Each figure illustrates the touch-
reward/punishment as they affect the plant state when it is
sensitive area of the plate. Thus illustrates the 1900×1900
around the desired target state. Note that rmax can be easily
pixels of the touch sensor. Location of the ball is illustrated
calculated using (28).
approximately around the mean of the acquired points. Dark
According to (29) and (15), the reinforcement measure is
areas on the figures illustrate presence of the ball over the
given by (30) replaced in (17):
corresponding areas of the plate. The more each space of the
⎛ θ ⎞ ⎛ Δr (k ) sgn(Δr ) ⎞ figure darkens, the more the ball was in the corresponding area
Δϑil = g. ⎜1 − nl ( k ) ⎟ . ⎜ β +λ ⎟ (30)
⎝ e i
⎠ ⎝ rmax r (k ) ⎠ over all observed iterations. The target (desired) location of
the ball in all the illustrated experiences is the centre of the
D. Adding a Priori Knowledge plate.
From a very general point of view, the proposed algorithm In Fig. 5, improvement in behavior of the ball under control
is a search in the space of possible actions. However, it is of RLFC is shown. It is observed that after approximately
possible to add a priori knowledge in order to increase the 70000 iterations, an acceptable performance is obtained. Note
learning speed. To describe that, it would be a great help to that the needed time for iterations normally varies. In our
explain how the random selection of output term sets takes experiments, 70000 iterations took around 20 minutes. Since
place. Digital processors can produce uniformly distributed the performance of the system is satisfactory around 70000th
random numbers and this is also used in RLFC. First, a iteration, at this stage RLFC is regarded as a trained system.
random number is generated by the processor and then it is However, since we do not omit the learning procedure
checked to which sub-distance (note equation (8)) it belongs. afterwards, the comments under the next figures do not
Then the corresponding number of that sub-distanfce will be include the term “trained”. Instead, the 70000th iteration is
the index of the chosen term set. Let the randomly generated mentioned as a reference point for a good train of the system.
number for the lth rule be j l . Equation (31) must hold: The control signals relating to the best performance illustrated
in Fig. 5 is shown in Fig. 6.
j l ∈ Ωl (31)
Fig. 7 . Different starting point. Fig. 8 . Human best performance.
VI. CONCLUSION
The main idea of the discussed work was to propose a
human-like controller, capable of learning from the past and
own obtained experiences as well as embedding some prior
knowledge and reasonable facts. Although still far away from
an exact human-like behavior, the result of applying this
controller to a very complex and uncertain plant resulted in
satisfactory performance, especially when compared with a
good human behavior trying to control the same plant. There
are varieties of extensions and modifications to the proposed
Fig. 5. Improvement in behavior of the ball under RLFC control. Top left method, from the form of fuzzy IF-THEN rules to the method
figure corresponds to the first 15000 iterations, where initially a priori of tuning various defined parameters.
knowledge is embedded. Top right and bottom left figures are showing
improvement in performance. After approximately 70000 iterations,
bottom right performance is regarded as acceptable. REFERENCES
[1] R. S. Sutton A.G. Barto , Introduction to reinforcement learning,
MIT Press/Bradford Books, Cambridge, MA, 1998.
In all the aforementioned experiments, the ball is initially [2] L. P. Kaelbling, M. L. Littman, A. W. Moore “Reinforcement
located in the same place. To show that this is not a necessary Learning: A Survey” Journal of Artificial Intelligence. vol. 4, pp.
condition, in Fig. 7 another start point is chosen. It is seen that 237-285, May 1996.
[3] A. B. Rad, P. T. Chan, Wai Lun Lo, C. K. Mok “An Online
since RLFC has not experienced these new states enough Learning Fuzzy Controller” IEEE Trans. Industrial Engineering, vol
before, at first it could not perform well. However, as the ball 50, no. 50, pp. 1016-1021, October 2003.
touches a previously enough-experienced state (shown by an [4] X. Fan, N. Zhang, S. Teng, “Trajectory planning and tracking of ball
arrow), the behavior comes under the control. The number of and plate system using hierarchical fuzzy control scheme” Elsevier
Journal of Fuzzy Sets and Systems, vol. 144, pp. 297-312, 2003.
iterations in this figure is about 3500. [5] S. Awtar, K. C. Craig, “Mechatronic Design of a Ball on Plate
Balancing System” Proc. 7th Mechatronics Forum International
Conference, Atlanta, GA, 2000.
[6] Humusoft User’s Manual “CE 151 Ball&Plate Apparatus”,
Humusoft,
[7] E. H. Mamdani, “Application of fuzzy algorithms for simple
dynamic plant,” Proceedings of IEE, vol. 121, pp. 1585-1588, 1974.
[8] M. Bai, H. Lu, J. Su, Y. Tian, “Motion Control of Ball and Plate
System Using Supervisory Fuzzy Controller,” WCICA 2006, vol. 2,
pp. 8127-8131, June, 2006.
[9] Wang, H. Tian, Y. Sui, Z. Zhang, X. Ding, “Tracking Control of
Ball and Plate System with a Double Feedback Loop Structure,”
Fig. 6 . Control inputs with respect to the best performance in Fig.5. ICMA 2007, August, 2007.
[10] R. Bellman, Adaptive Control Processes: A Guided Tour, Princton
In order to compare the performance of RLFC with that of a University Press, 1961.
human, 10 individuals (all healthy, normal, and matured with [11] L. X. Wang, A Course in Fuzzy Systems and Control , Prentice-Hall
no apparent nervous or muscular disorder) were selected and International Inc.,1997.
[12] A. G. Barto, “Reinforcement Learning”, The Handbood of Brain
asked to control the implemented BnP system. Each individual Theory and Neural Networks, M. A. Arbib, Jaico Publishing House,
was allowed to try the system 10 times. Note that in all The MIT Press, Cambridge, MA, USA, 2006, pp. -804-809.
experiences, the steppers are released and the individuals [13] H. K. Lam, F. H. F. Leung, P. K. S. Tam, “Design of a fuzzy
controller for stabilizing a ball and beam system”, The proceedings
controlled the plate by their own hands. This would indeed of IECON 1999, vol. 2, pp. 520 – 524.
omit the most complicated nonlinearity and impreciseness of [14] E. Laukanen, S. Yurkovich, “A Ball and Beam testbed for Fuzzy
the system: the actuators. In Fig.8 the best performance is identification and Control design,” Proc. American Control Conf.,
June 1993, pp 665-669.
illustrated which was the 7th try of the individual who
possessed the best control over the BnP system.

A Reinforcement Learning Fuzzy Controller For The Ball and Plate System

Uploaded by

Copyright:

Available Formats

You might also like

A Reinforcement Learning Fuzzy Controller For The Ball and Plate System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Reinforcement Learning Fuzzy Controller For The Ball and Plate System

Uploaded by

Copyright:

Available Formats

A Reinforcement Learning Fuzzy Controller for

the Ball and Plate System

system may also be available and be used in order to reduce

B ALANCING systems are one of the most popular and

978-1-4244-8126-2/10/$26.00 ©2010 IEEE

Ωl = ⎡⎣a0l , aMl ) (6) Δr (k ) = r ( k − 1) − r ( k ) (13)

Fig. 4 The defined term sets on the universes of discourses of output

B. Critic: Generation of Reinforcement Signal

You might also like