GALS Partitioning by Behavioural in Petri Nets

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

2014 20th IEEE International Symposium on Asynchronous Circuits and Systems

GALS partitioning by behavioural decoupling


expressed in Petri nets
Danil Sokolov, Alex Yakovlev
School of Electrical & Electronic Engineering, Newcastle University, UK
{danil.sokolov, alex.yakovlev}@ncl.ac.uk
AbstractEfcient design of a complex heterogeneous system
requires detailed knowledge about the periodicity properties of its
components and understanding the interaction patterns in their
data exchange. Some of this information is usually available at
the design time and facilitates basic optimisation through the
insertion of sufciently deep buffers between the communicating
sub-systems. However, most of the information, e.g. the amount
of data items to transfer or the relative difference in the
iteration count of system components between communications,
is data-dependent and can only be analysed dynamically at the
run time. These data-dependent properties of the system still
can be efciently accounted for in globally asynchronous locally
synchronous (GALS) design where the system is partitioned
into independently clocked sub-systems that interact with each
other in the asynchronous style. In this paper we introduce an
approach to GALS partitioning based on the analysis of Petri
net models of system components and the complexity analysis of
their underlying algorithms.

and heterogenation of hardware platforms is clearly far from


being fully utilised.
Historically GALS methodology [2] is considered as a solution to the increasing heterogeneity of complex systems [3],
[4]. The classical GALS method with pausible clocking
can exploit the advantages of asynchronous design and at
the same time maximally reuse the products of standard
synchronous EDA ow. This is achieved through embedding
the synchronous components into specially designed wrappers
whose local clock generators can be paused to transmit the data
through the self-timed interfaces. The independently clocked
components are subsequently interconnected asynchronously,
thus avoiding the synchronisation problems associated with the
integration process. There are many avours of GALS design
with pausible clocking varying in the level of granularity of
synchronous components, the communication protocols for
data exchange between them, the design of the wrapper interfaces and local clock generators [5]. Note that the spectrum
of GALS design can be extended to asynchronous and loosely
synchronous styles [6], however the focus of this paper is on
the pausible clock GALS systems.
To date there is still no consensus on what is the best way
to utilise the potential advantages of GALS architecture in
the context of modern heterogeneous systems. There are even
contradictory opinions on the power consumption benets and
performance penalties of GALS systems [7], [8], [9]. This is
partially due to the inefciency of existing GALS methods
which are all of an assemble-and-validate paradigm, even the
recent ones [10], [11], rather than being a process of synthesis
with optimisation. Only rudimentary design automation has
been proposed [12] where the top-level hierarchy determines
the boundaries for synchronicity islands and a static analysis
of communication between these islands denes the choice of
fairly limited communication mechanisms.
The gap between what is synchronous and what is asynchronous, both in terms of models and implementation is far
too large. Thus the main problem here is in automating the
integration of asynchrony and synchrony, which requires building non-trivial links and ports, clock start-stop mechanisms,
insertion of synchronisers etc. This became the main subject of
GALS-related research with a focus on efcient data exchange
mechanisms [13], communication protocols [14] and optimisation of GALS wrappers to mitigate the latency overheads [15],
[16]. Many of these elements are still designed ad hoc and
require laborious validation by many hours of simulation. This

I. I NTRODUCTION
Current developments in multi-core system architectures
go into exploiting the potential of heterogeneous processing, where some components can deal with predominantly
sequential imperative operations, while other cores perform
functionally specic data or signal processing with large
amount of parallelism involved. In the latter, parallel execution on portions of data is quite natural for the type of
data and functions they have to perform, including graphics (GPUs) or communications (DSP plus mixed signal RF
circuitry). Ideally, the components of such a system operate
at their own function-determined pace (clock frequency), with
data-dependent timing (number of iterations) and synchronise
only occasionally to exchange the computed results (self-timed
communication mechanisms).
While the drive for functional heterogeneity is strong and already supported by leading IP providers (ARMs big-LITTLE
platform and vision for the Internet of Things), the way of
how the non-functional aspects, such as timing, should be
handled is clearly lagging behind. Most of the techniques
for timing and communication are still simply inherited from
the conventional homogeneous processing and remain in the
traditional forms of clocking, such as multi-clock domains,
clock-gating and frequency scaling. Partly the stumbling block
on the greater diversication of a timing discipline is due to the
delay in a take-up of innovations in introducing elasticity [1]
into these infrastructures from the side of the EDA tools. As a
result the potential of leveraging the functional diversication
1522-8681/14 $31.00 2014 IEEE
DOI 10.1109/ASYNC.2014.11

17

II. I NTERACTION PATTERNS

increases the risk of signicant under-performance of the overall system in terms of energy efciency (power management,
such as clock/power gating, is only possible at a very coarse
granularity level) and speed (both throughput and latency).
In this work we investigate the following issue: what is
the key motivating factor for using GALS and what is the
correct granularity level of system behaviour to look for GALS
partitioning? Traditionally a motivating factor for introducing
GALS was clock distributing problems (clock skew, clock tree
balancing, etc.) which resulted in building GALS purely on
the structural or physical criteria. As it appears, this approach
has not really succeeded because one can always nd a way
to handle it within a clocked design style, even by inserting
synchronisers, which often introduce less power and latency
overheads than GALS wrappers. Therefore we should look
for another, higher-level form of timing or synchronisation,
where a system may benet from being partitioned into largely
independent subsystems that can run at their own pace and
interact only occasionally, thus minimising the overheads of
GALS wrappers. With system timing being in the heart of
GALS synthesis, a successful GALS-based design ow needs
a common timing discipline, unifying both the computational
elements and their interfaces, supported by a good formalism,
analysis and optimisation algorithms. The main contribution
of our work on this pathway is as follows:

Our modelling tool for capturing a system behaviour is


Petri nets [17]. Formally, a Petri net is dened as a tuple P N = P, T, F, M0  comprising nite disjoint sets
of places P and transitions T , arcs denoting the ow relation F (P T ) (T P ) and initial marking M0 .
There is an arc between x P T and y P T
iff (x, y) F . The preset of a node x P T
is dened as x = {y | (y, x) F }, and the postset as
x = {y | (x, y) F }. The dynamic behaviour of a Petri net
is expressed by a token game: an evolution of the marking
according to the enabling and ring rules. A marking is a
mapping M : P N denoting the number of tokens in
each place (N = {0, 1} for 1-safe Petri nets). A transition
t is enabled iff p, p t M (p) > 0. The evolution
of a Petri net is possible by ring the enabled transitions.
Firing of atransition t results in a new marking M  such that
M (p) 1 if p t \ t,
M (p) + 1 if p t \ t, for all p P .
M  (p) =

M (p)
otherwise
Graphically, places of a Petri net are represented as circles , transitions as boxes , arcs as arrows
, and tokens
are depicted by dots in the corresponding places .

Petri net models of interaction patterns (Section II): We


see Petri nets [17] as a convenient formalism for capturing
the essential properties of system behaviour and use it to
model a range of characteristic communication patterns
between the system components.
Design method based on iterative renement (Section III):
Throughout the paper we rely on an AES key generator
as our running example. We derive its abstract Petri net
model and iteratively rene it based on the underlying
algorithms of its components until the interaction patterns
can be recognised and a decision about optimal partitioning can be made.
Algorithmic complexity metrics for partitioning (Section V): We use algorithmic complexity as a metrics to
estimate the number of iterations between the components interactions. If this number is sufciently large
and data-dependent, then the component is considered
decoupled from the rest of the system and becomes a
good candidate for GALS.

(a) Complex computation

(b) Availability of a resource

(c) Multiple input and output

Figure 1: Modelling prolonged action


The ring of a Petri net transition has an atomic and instant
semantics. However, in this paper we often need to model
the prolonged activities, e.g. a computation by a complex
algorithm. Such a prolonged computation X still can be
captured with a Petri net by explicitly modelling its beginning
and ending, see transitions X_beg and X_end respectively
in Figure 1a. Following this technique one can model a
component for performing the computation X by a sequence
of transitions representing the input of initial data X_in, an
algorithmic core of the component X_calc, and the output
of the results X_out, as shown in Figure 1b. The X_free
place denotes the availability of the resource and enables a
new round of computation only when the previous output is
consumed. Moreover, a component with multiple inputs and
outputs can be captured as shown in Figure 1c: each interface

In this paper we do not attempt to fully automate GALS


synthesis, but rather demonstrate how the proposed method
can help in making an informed design decision about the
systems architecture whether it makes sense to partition the
system for GALS and where the boundaries of synchronous
islands should be laid. The Petri net model is the core of our
method and we see it as the rst step to a theory and CAD
tools for the automated synthesis of GALS systems. Design
automation and application of this method to a wider range of
benchmarks is outside the scope of the paper and is a subject
for future work.

18

channel is modelled by a separate transition and a pair of


X_beg and X_end transitions performs synchronisation of all
the inputs and the outputs respectively.
Note that the X_calc transition also represents a prolonged
action and should be rened to a pair of beg/end transitions,
however, for simplicity we use a shorthand notation and denote
such complex transitions by shaded boxes . Also we draw all
the input transitions in red, the output transitions in blue and
the decision structures around choice places in green. This
colour encoding of transitions is used just for visualisation
purposes and is not essential to follow the paper material.

out

in

(a) Schematic

(a) Tight

(b) With slack


N stages

(c) Buffer the slack by pipelining

N-2 stages

?
(b) Model

Figure 2: Interaction of system components


In this section our running example is a basic system
comprising a pair of interacting components, a producer A and
a consumer B, as shown in Figure 2a. The system behaviour
is captured using a Petri net model depicted in Figure 2b. We
utilise the Petri net model to overview the possible patterns of
interaction between the components and to reason about implementation alternatives for the communication channel (the
hidden area with the question mark). Depending on the timing
of the components A and B, in particular the relation of their
cycle times TA and TB (not to be mistaken with the clock
period), the patterns of the interaction between the components
can be classied in three categories: coupled, bursty and
decoupled.

(d) Buffer the slack by wagging

Figure 3: Coupled behaviour


data needs to go through all the stages before it reaches the
consumer) then a more sophisticated wagging buffer [18] can
be employed. A Petri net model of the producer-consumer
communication over such a buffer is shown in Figure 3d.
Notice the synchronised scheduling of tokens on the entrance
and exit of the wagging buffer this is to make sure the tokens
produced by the component A reach the component B in the
exact same order.
B. Bursty behaviour

A. Coupled behaviour

Bursty communication is observed when the consumer


requires several items of input data to start the computation,
while the producer is only capable of generating these data
items one by one. A symmetrical situation is also possible
when the producer generates a portion of data items in a single
burst while the consumer processes these items individually.
Due to similarity between the bursty consumer and the
bursty producer we consider only the former type communication, whose model is depicted in Figure 4a. The producer
generates a single token at a time while there are tokens
in the buf_free place; the consumer becomes enabled only
when N tokens get accumulated in the buf_busy place. At
this point the consumer takes all N tokens from the buf_busy
place (and puts N tokens in the buf_free place) in one go.
This model can be further rened into a 1-safe Petri net
shown in Figure 4b. Note that the arbitrary choice place in
this model (denoted by #) can be converted into a controlled
choice by explicitly scheduling the generated tokens of data,
as shown in Figure 4c. Alternatively the producer component
may be replicated N times to match the number of input
tokens required by the consumer, as shown in Figure 4d, thus
speeding up the computation at the expense of circuit size.

Coupled behaviour is characterised by a predictable and


well dened relationship between the cycle times of communicating components. There are two distinguished cases of
coupled behaviour: tightly coupled and coupled with slack.
In case of tightly coupled behaviour both components
operate in sync (TA = TB ) and exchange data every computation cycle, see Figure 3a. In order to avoid synchronisation
problems and latency overheads these components should be
put in the same clock domain.
The computation time of coupled components may temporarily deviate from the average speed depending on the input
data, thus introducing a slack between the components the
producer and consumer may run away from each other, but
only for a limited number N of cycles. In order to account for
the slack one can insert a buffer of a required capacity N between the components and thus amortise the communication.
The buffer capacity is modelled by the accumulative number
of tokens in the buf_free and buf_busy places, see Figure 3b.
Particular implementation of the buffer depends on the design
intent, e.g. using a basic pipeline with N stages, s[0], s[1],
etc, as shown in Figure 3c. If the increased communication
latency introduced by this pipeline is an issue (the produced

19

to maintain delay-insensitive communication between them.


However, this results in overheads, both in terms of the extra
latency introduced by the wrappers and in terms of the design
effort imposed by the synchronous EDA tools. We believe that
the full potential of the GALS approach can be realised only
when the following conditions apply:
Rare
communications separated by unpredictable
data-dependent number of computation cycles, which
renders design-time buffering/balancing impossible;
Sufciently large number of iterations between interactions to make synchronisation overheads negligible;
Signicant difference in the critical path of decoupled
components to benet from different clock periods.
The behavioural patterns reviewed in this section facilitate
the design decisions about system partitioning into clock
domains (for tightly coupled behaviour), buffering capacity of
communication channels (for coupled behaviour with slack),
resource reuse and replication (for bursty behaviour) and
GALS partitioning (for decoupled behaviour).

N
Na

rcs

Na

rcs

(a) Unsafe Petri net

(b) 1-safe Petri net

N
N

(c) Scheduling of produced tokens

(d) Producer replication

Figure 4: Bursty consumer behaviour


C. Decoupled behaviour
Decoupled behaviour is observed when the system components perform computation at their own, often data-dependent,
pace. In this situation it is difcult (if not impossible) to predict
at design time which component is faster and by how much.
Often one of the components becomes a bottleneck for a long
time, and then the other component takes this role.
A model of such a communication is shown in Figure 5,
where both components A and B (rened into the input,
the core calculation and the output transitions) go through
an iterative computation of the results as follows. Firstly, the
input data is initialised (modelled by *_in transitions) and an
iteration of computing is performed (*_calc). The iteration
results are checked against a certain condition to make a
decision if the computation is complete and the results can
be output (*_out). If the condition does not hold, the input
data is updated (*_update) and the calculation repeats.

III. M OTIVATING EXAMPLE AND DESIGN METHOD


While looking for a convenient benchmark to convincingly
demonstrate the need for a detailed understanding of the
system behavioural dynamics to fully utilise the advantages
of GALS approach, we aimed at the following set of criteria. The running example should be compact, so its model
and schematic representation are observable and manageable
manually, and still representative, to exhibit a wide range of
behavioural patterns. It is also advantageous for a benchmark
to have a realistic semantic, helping the reader to better
understand the applied algorithms and follow the extraction
and renement of its behavioural model.
We found inspiration for a benchmark satisfying the above
criteria in the cryptography domain largely dominated by the
encryption algorithms with asymmetric keys, which are based
on the presumed difculty of factoring large integers. The
asymmetric key cryptography uses a pair of keys: a publicly
available one for message encryption, and a privately held
one for message decryption. These keys are mathematically
related, but in such a way that determining the private key
from the public key is prohibitively expensive. In a nutshell,
this requires factoring out a portion of the public key, which
is built as a product of two huge primes.
One of the most established and widely used public-key
algorithms is RSA [19]. Recently a presumably unbreakable
1024-bit RSA encryption was compromised using a vulnerability in the distribution of randomly generated primes [20].
The attackers collected around 7.1M RSA public keys and
applied a basic Euclidean GCD algorithm to each pair. The
result was shocking around 13K keys were factored out
successfully using this method. The authors characterised their
experience as follows: Factoring one 1024-bit RSA modulus
would be historic. Factoring 12720 such moduli is a statistic.
The random primes are ltered out from a stream of
randomly generated numbers using a primality test. Ideally a
source of truly random numbers is required for cryptography

Figure 5: Decoupled behaviour


There are two options for the efcient implementation of
such a decoupled system: clock gating or GALS partitioning.
While clock gating is a traditional design practice to deal with
data-dependent computational cycle, it however lacks exibility in choosing the clock frequency for the communicating
components all of them are driven by the same clock in order
to avoid clock domain crossing and associated synchronisation
difculties.
The GALS methodology is more exible in clocking the
individual components as it relies on asynchronous wrappers

20

prime1

public

prime2

private

gen

key

public
private
n

tag

tag (bad / good)

chk

traditional key generator

(a) High-level schematic

(b) Abstract Petri net model

Figure 6: Key generator benchmark

Figure 7: Renement of components interface


hassle of applying GCD to them.
Note that this design enhancement does not solve the
problem entirely. The malefactor can obtain several collections
of public keys from different sources, which being combined
are still vulnerable to the GCD attack. However, keep in mind
this is just a motivating example with realistic semantics and
interesting application domain which should still serve the
purpose of the paper.

applications, such as the randomness coming from a physical


process in the environment, e.g. thermal noise, photoelectric
effect or voltage uctuations [22]. In practice the rate of truly
random large numbers may be unacceptably slow: it takes very
long time to accumulate a sufcient entropy from the environment and produce a single random instance. Often, in order
to increase the production speed the truly random data is only
employed as a seed for a pseudo-random number generator,
which negatively affects the distribution of generated prime
numbers [23].
The conventional design of an asymmetric key generator
consists of a gen module which produces a pair of prime
numbers, and a key module which uses these two primes p and
q to compute public / private keys, see high-level schematic
in Figure 6a. We decided to extend this design by a chk
module which validates the resistance of produced keys to the
above mentioned GCD attack. This is achieved by checking
the independence of the public keys within this particular
device. For this a portion of newly computed public key, which
is a product of p and q (further called prime product), is
exercised against each previously generated prime products
using the GCD algorithm. If any of the stored prime products
is found to have a common divider (different from 1) with the
new one, then the produced pair of keys is vulnerable to the
GCD attack and should be disregarded (tagged as bad). All the
previously generated prime products are stored in a memory
pool, either locally in a cache of chk component or externally
in a dedicated memory module.
One can criticise the unnecessary complexity of the design
as it would be sufcient to store all the previously generated
prime numbers and make sure they do not repeat. However,
this would introduce a severe security breach if the attacker
would be able to read the stored primes and thus compromise
all the generated keys at once even without going through the

A. Abstract model
In this work we want to identify the boundaries of synchronised behaviour and make an informed decision about
the GALS partitioning of the design. Usually the structural
hierarchy would determine the boundaries of GALS islands,
however, we feel this is a naive and sub-optimal approach.
Instead we start from an abstract Petri net model of the system
and go through the iterative renement of its components
until we accumulate sufcient knowledge about its behavioural
dynamics to justify a partitioning decision.
An initial Petri net representation for the key generator is
captured by a pipeline-like structure, as shown in Figure 6b.
As was described in Section II, each component is modelled by
a sequence of *_in, *_calc and *_out transitions to represent
the input of data, the algorithmic core and the output of the
results. A token in a *_free place denotes the availability of
the corresponding component and prevents a new round of
computation until the current results are consumed by the next
pipeline stage.
B. Interface renement
At the rst transformation step the component interfaces
are rened, as shown in Figure 7. The communication between the gen and key component is captured by explicitly
modelling the generation of two prime numbers via transitions

21

gen_out_prime1 and gen_out_prime2. These primes are


consumed by the key component at key_in_p and key_in_q
inputs. The output of the key component is a pair of keys
denoted by the key_out_public and key_out_private transitions, and the prime product represented by the key_out_n
transition. The key_out_n output is passed to the chk_in
input of the chk component for validation of its co-primality
to the prime products of the previously generated keys. Depending on the outcome of the check either chk_out_bad or
chk_out_good output transition is red. In this rened model
the computation complexity of each component is hidden in
the *_calc transitions whose further renement is based on
the extraction of a Petri net model from the corresponding
computation algorithm.

Algorithm 1 Pseudo-code for gen, key and chk


func gen() : (prime1, prime2)
for i {1, 2}
do
candidate := random()
until pt(candidate)
prime [i] := candidate
return (prime[1], prime[2])
func key(p, q) : (private, public, n)
n := p q
f := (p 1) (q 1)
do
e := random()
(gcd, x, y) = egcd(f , e)
until gcd = 1
d := y
return ((n, d) , (n, e) , ) n

C. Computation renement
The shaded transitions gen_calc, key_calc and chk_calc
which represent complex actions are rened based on the
corresponding computation algorithms. Extraction of Petri nets
from behavioural specications [21] has been tailored and
partially automated for the purpose of this research. The
pseudo-code for each of the complex actions is captured in
Algorithm 1 and the derived Petri net models are shown in
Figure 8. Let us consider the extraction of each model in
more details. The names of Petri net transitions that correspond
to the discussed portions of the algorithm are referenced in
parentheses.
1) Generating a pair of prime numbers: A random number is generated (gen_random) and undergoes a primality test (gen_pt). The test result is subsequently analysed (gen_prime?) and if the number is prime (gen_p_true)
then it is passed to the output buffer (gen_out_prime1 for the
rst instance and gen_out_prime2 for the second instance).
Otherwise, if the number is not prime (gen_p_false), a new
random number is generated and checked for primality. This
procedure repeats until two primes are produced. A Petri net
model of this algorithm is shown in Figure 8a.
2) Computing RSA keys: Two large primes p and q generated by the gen module are used to calculate the prime
product n = p f the difculty of factoring out this product
is the core of the open-key cryptography. Also, the Eulers
totient function f = (n) is computed it shows the number
of positive integers in the range [1, n] which are co-prime to n.
As n is a product of presumably different prime numbers,
its Eulers totient function is f = (p 1) (q 1). Note
that n and f can be computed concurrently provided two
multipliers are available (modelled by transitions key_calc_n
and key_calc_f respectively). Secondly, an integer e is randomly chosen (key_random) which needs to meet two requirement: (i) it must be smaller than f ; and (ii) it must
be co-prime to f . The former requirement is easy to satisfy,
e.g. by limiting the e bit-width, so it is denitely smaller
than f it is known that a relatively short bit-width with
small Hamming weight (the number of non-zero bits) results
in more efcient encryption. The latter requirement is checked
by calculating a GCD of f and e (key_egcd) and comparing

func chk(n)
addr_init(a)
while addr_last(a)
m :=mem_read(a)
if gcd(n, m)= 1 then return false
addr_next(a)
mem_write(a, n)
return true

it with 1 (key_coprime?). The generation of random e


repeats until it is co-prime to f (key_cp_true) which denotes
that e is found (key_e) it forms the public key. Then
the value d is computed (key_d) as a multiplicative inverse
of e, i.e. d = e1 (mod f ) it forms a private key. Note
that e d mod f 1 which means there exists a (negative)
integer x such that f x+ed = 1. Also we know that f and e
are co-prime and therefore their GCD is equal to 1, which
being combined with Bzouts identity a x + b y = gcd (a, b)
reveals an elegant way of computing d. The values of x and y
in this formula can be efciently derived by the extended
Euclidean algorithm this can be done when GCD of f and e
is computed for the co-primality check (key_egcd). Finally a
pair of public and private keys is derived by combining n
with e (key_out_public) and n with e (key_out_private)
respectively. The resultant Petri net model for this algorithm
is shown in Figure 8b.
3) Checking the keys: The chk component takes the prime
product (chk_in_n) and checks its co-primality with all the
previously generated prime products. The old products are
stored in a local cache or global memory at some base
address (chk_addr_init). Each iteration starts from checking if
all the old products have been considered (chk_addr_last?)
if not, then an old product is fetched from the current
memory address (chk_mem_read) and is used to compute its GCD with the new product (chk_gcd). The GCD
result is utilised to determine the new and old products
co-primality (chk_coprime?). If the products have a common
divider different from 1 (chk_c_false), then the check fails

22

(a) Prime number generator (gen component)

(b) Key calculator (key component)

(c) Key checker (chk component)

Figure 8: Renement of key generator components


and the newly generated pair of keys based on this product is
tagged as vulnerable (chk_out_bad). Otherwise (chk_c_true)
the memory address is incremented (chk_addr_next) and the
above procedure repeats. Finally, when the last old prime product has been considered (chk_al_true), the new prime product
is stored in memory for future checks (chk_mem_write)
and the pair of generated keys is tagged as a reliable (chk_out_good). A Petri net model of this algorithm
is shown in Figure 8c.

rounded box) maps into the bursty consumer interaction pattern (see Section II-B).
The decoupled behaviour is of particular interest for us as
it means a prolonged independent computation with an occasional synchronisation for data exchange. Due to the dynamic
nature of the synchronicity between the *_loop segments we
cannot just insert deep enough buffers to compensate for the
slack between them. Indeed, the number of iterations before
synchronisation is data-dependent and also unpredictable due
to the randomness of the initial data (gen_random and
key_random). Moreover, the system bottleneck is temporal
and shifts between the clusters. Intuitively the gen_loop lags
behind while there is only few stored prime products to
check co-primality with. However, when the critical amount
of prime products is accumulated, the chk_loop becomes the
bottleneck. This behavioural pattern is a good candidate either
for coarse-grain clock gating or for GALS partitioning. Let us
analyse which option is preferable in case of the key generator
example.
The dynamics of decoupled interaction can be efciently
predicted by analysing the complexity of underlying algorithms and deriving the expected number of iterations each
component performs before producing its output. Let us
demonstrate this technique by estimating the number of iterations when computing N -bit RSA keys. In cryptography the
bit-width of data operated is signicantly larger than normal
register size (e.g. 512-bit or 1024-bit numbers), therefore it
makes sense to estimate the iteration count of algorithms
relatively to the bit-width N of their operands (this is known
as bit complexity analysis).
Firstly, two N/2-bit primes need to be produced. The

IV. A NALYSIS AND DESIGN DECISIONS


The rened Petri net models of gen, key and chk components are composed according to the interface model of
Figure 7. For compactness the pipeline stages between the
output and input transitions of the communicating components
are collapsed into singleton input/output transitions, e.g. a
pair of sequential transitions key_out_n and chk_in_n is
represented by a single key_out_n/chk_in_n transition. Also
the model of the key component is transformed by splitting
it into three pipeline stages (gen_key_interface, key_loop
and key_env_interface), see Figure 9. The applied Petri
net transformations are purely structural and preserve the
important properties of the model [24]. These modications
and the model checking of the results are both performed in
semi-automated mode within W ORKCRAFT framework [25].
The resultant Petri net reveals the communication patterns discussed in Section II. In particular, one can see that
the gen_loop, key_loop and chk_loop segments (shaded
rectangles) of the net are decoupled one from another (see
Section II-C and the gen_key_interface segment (a shaded

23

Figure 9: Composition and partitioning


gen_loop, key_loop and chk_loop cycles, which largely
determine the dynamics of data exchange, is different: K N ,
N and M respectively. This supports our intuition about the
changing bottleneck: the system performance is limited by the
gen_loop while M K N and it becomes constrained by
the chk_loop when M K N . For example, the computation of 512-bit RSA keys with 16 Miller-Rabin witnesses has a
threshold number of stored prime products around 8 thousands
that shifts the bottleneck from the gen_loop to the chk_loop
segment.
The proposed architecture of the key generator is shown
in Figure 10. One can notice a correspondence between the
shaded areas of the schematic and those of the Petri net
model in Figure 9. The bursty consumer interaction between
gen_loop and key_loop is implemented by buffering the rst
generated prime (toggle and buf structure) and synchronising
it with the second prime at key_calc_n and key_calc_f components. These N -bit multipliers operate concurrently with
the multi-cycle key_loop and therefore can be implemented
without speed optimisation; the iterative multiplication can
also be unrolled into multi-stage pipeline to buffer a series
of generated prime numbers while the key_loop is busy.
The shaded rectangles (gen_loop, key_loop and chk_loop
with chk_mem_interface) are mapped into GALS islands.
Note that chk_loop reads the prime products from memory (chk_mem_read) and the chk_mem_interface also interacts with the memory (chk_addr_* and chk_mem_write),
therefore both of them should be in the same clock domain
forming a single GALS island for the whole chk component.
The choice of GALS implementation over the clock gating
was driven by the design guidelines proposed in Section II-C.

complexity of generating a prime number P may vary in a


wide range depending on the method chosen for the primality
test (modelled with the gen_pt
 The bit-complexity
 transition).
of the nave algorithm is O N 2 2N/4 , which deems it useless for cryptography applications. Therefore, in cryptography
the Miller-Rabin probabilistic
 [26] is often employed
 approach
whose bit-complexity is O K N 3 , where K is a (relatively
low) number of witnesses. Now, we take into account the
prime number theorem stating that a random integer in the
range from zero to some large number P is a prime with the
probability of about 1/ln(P ). In other words, to get an N/2-bit
prime, on average, one needs to generate and test around N/2
random numbers. Based on these estimates we can conclude
that in order to generate a pair of primes
 the gen_loop makes
around K N iterations, each of O N 3 bit complexity.
The computation time of the key_loop is dominated by the
extended Euclidean
(key_egcd transition) whose bit

 algorithm
3
complexity
 2  is O N at most N divisions with complexity O N . This procedure repeats until a randomly generated
number e appears to be co-prime to f (see Section III-C for
details). The probability of getting an N -bit prime is 1/N .
Therefore, on average, the key component
performs N itera 
tions, each with bit complexity O N 3 .
Similarly, the chk_loop spends most of the time calculating
GCD of a newly generated prime product and the old prime
products
 (chk_gcd transition). The bit complexity of GCD
is O N 3 and it is repeated for each of the M prime products
currently stored in memory.
Notice that the bit complexity of all the internal computations (gen_pt,
  key_egcd and chk_gcd transitions) is the
same O N 3 while the number of iterations over them in

24

prime

clock generator [T=0.96ns]

tog

gle

input
port

C
C

output
port

N iterations

clock generator [T=1.34ns]

key_calc_n

output
port

e,d

key_loop

gen_loop
KN iterations

e,d

C
C

public
pub
private
priv

key_public_private

key_calc_f

buf

tag

tag

chk
input
port

M iterations

output
port

clock generator [T=1.27ns]

Figure 10: High-level schematic of the proposed implementation


As has been discussed, the *_loop components exhibit unpredictable and rare interactions separated by large number
of computation cycles. Now we only need to estimate the
difference in the clocking frequency of these components. In
case of 512-bit RSA keys the gen_loop works with 256-bit
numbers while the key_loop and chk_loop operate on 512-bit
numbers, which suggests a signicantly longer critical path
for the latter two. We veried this assumption by synthesising
each of these components in UMC 90nm technology using
the Faraday cell library and doing static timing analysis.
Due to the high bit-width of the operands we implemented
all the complex arithmetic operations (multiplication, division, exponentiation, etc.) in iterative way while the simple arithmetic operations (additions, subtractions, comparison,
etc.) were left for Synopsys Design Compiler to t into a
single clock cycle. As has been expected, the critical path
of the obtained gen_loop, key_loop and chk_loop circuits
varied signicantly: 0.96ns, 1.34ns and 1.27ns respectively.
This ~30% difference in potential clock speed justies the
use of a GALS architecture over the traditional coarse-grain
clock gating, while the large number of computational iterations (thousands clock cycles for a 512-bit RSA) before a
single data exchange compensates for the latency overheads
introduced by asynchronous wrappers.

be employed to compensate for the increased computation


time of the chk_loop when the number of stored keys grows.
For example, several such modules can be instantiated, each
with its own power gating mechanism and its local memory
of optimal size for a given key bit-width. The memory size
is closely related to the discussed threshold number, e.g. for
the case of 512-bit keys the local memory of around 512KB
would be optimal (to store up to 8 thousand prime products).
As soon as the number of prime products reaches the memory
limit (and thus the threshold value) a new instance of chk
module is powered up to perform public key test in parallel
with other checkers.
V. C ONCLUSIONS
Understanding the behaviour of a heterogeneous system in
its dynamics is paramount for efcient GALS implementation.
This requires modelling of the whole system at an abstract
behavioural level with possible renement of its individual
components. A complexity analysis of the underlying algorithms proves to be a valuable source of information to characterise the interaction patterns between the system components.
This technique has been demonstrated on a relatively simple
benchmark and can be further generalised, thus leading to the
development of theory and tools for the efcient synthesis of
GALS systems.
Our next goal is design automation for the process of capturing the interaction between the system components with Petri
nets and the identication of the characteristic behavioural
patterns. This step is essential for applying the proposed
method to larger and more complex systems with widely
varying types of computation and communication activities.
With the automated design ow at hand we plan to build a
pool of benchmarks to compare our GALS approach (based
on behavioural patterns) against the conventional GALS methods (based on structural partitioning).
Both the theoretical aspect of modelling the behavioural
patterns and the practical aspect of design automation open
interesting directions for future work.

The obtained GALS partitioning is different from what


would be the result of a purely structural approach. Even if a
system designer intuitively recognises the three islands of synchronicity corresponding to gen, key and chk components, it
still would be difcult to guess the optimal boundary between
the islands other than relying on the top-level schematic in
Figure 6a. Our method helps to locate the core of each GALS
island resulting in tighter local clock periods and minimising
the local clock trees.
Having the gen_loop and chk components in different clock domains gives a possibility for independent frequency (and voltage) scaling when the bottleneck moves, thus
facilitating the power savings. Resource replication can also

25

[20] A. Lenstra, J. Hughes, M. Augier, J. Bos, T. Kleinjung, C. Wachter:


Public keys. Proc. International Cryptology Conference (CRYPTO),
pp. 626642, 2012.
[21] D. Shang, F. Burns, A. Koelmans, A. Yakovlev, F. Xia: Asynchronous
system synthesis based on direct mapping using VHDL and Petri nets,
Proc. IEE Computers and Digital Techniques, 151(3), pp. 209220,
2004.
[22] V. Fischer, M. Drutarovsk, M. imka, F. Celle, U. J. Monnet: A
simple PLL-based true random number generator for embedded digital
systems, Computing and Informatics, pp. 56, 2004.
[23] T. Tkacik: A hardware random number generator, Cryptographic
Hardware and Embedded Systems (CHES), pp. 450453, 2002.
[24] T. Murata: Petri nets: properties, analysis and applications, Proc.
IEEE, 77(4), pp. 541580, 1989.
[25] I. Poliakov, D. Sokolov, A. Mokhov: Workcraft: a static data ow
structure editing, visualisation and analysis tool, Proc. International
Conference on Application and Theory of Petri Nets (ATPN), pp. 505
514, 2007.
[26] G. Miller: Riemanns hypothesis and tests for primality, Proc. ACM
Symposium on Theory of Computing (STOC), pp. 234239, 1975.

ACKNOWLEDGEMENTS
The authors are grateful to the anonymous reviewers for all
the critics, inspiring comments and valuable suggestions on
how to further develop this work.
This research was supported by EPSRC grant EP/I038551/1
Globally Asynchronous Elastic Logic Synthesis (GAELS).
R EFERENCES
[1] M. Galceran-Oms, A. Gotmanov, J. Cortadella, M. Kishinevsky: Microarchitectural transformations using elasticity, ACM Journal on
Emerging Technologies in Computing Systems (JETC), 7(4), pp. 18:1
18:24, 2011.
[2] D. Chapiro: Globally asynchronous locally synchronous systems, PhD
thesis, 1984.
[3] D. Bormann, P. Cheung: Asynchronous wrapper for heterogeneous
systems, Proc. International Conference on Computer Design (ICCD),
pp. 307314, 1997.
[4] K. Yun, A. Dooply: Pausible clocking based heterogeneous systems,
IEEE Transactions on VLSI Systems, 7, pp. 482487, 1999.
[5] R. Mullins, S. Moore: Demystifying data-driven and pausible clocking
schemes, Proc. International Symposium on Advanced Research in
Asynchronous Circuits and Systems (ASYNC), pp. 175185, 2007.
[6] P. Teehan, M. Greenstreet, G. Lemieux: A survey and taxonomy of
GALS design styles, IEEE Design & Test of Computers, vol. 24(5),
pp. 418428, 2007.
[7] A. Iyer, D. Marculescu: Power and performance evaluation of globally asynchronous locally synchronous processors, Proc. International
Symposium on Computer Architecture (ISCA), pp. 158168, 2002.
[8] M. Krstic, E. Grass, F. Grkaynak, P. Vivet: Globally asynchronous,
locally synchronous circuits: overview and outlook, IEEE Design and
Test of Computers, 24(5), pp. 430441, 2007.
[9] M. Horak, S. Nowick, M. Carlberg, U. Vishkin: A low-overhead
asynchronous interconnections network for GALS chip multiprocessors,
IEEE Trans. on CAD, 30(4), pp. 494507, 2011.
[10] D. Ludovici, A. Strano, G. Gaydadjiev, L. Benini, D. Bertozzi: Design
space exploration of a mesochronous link for cost-effective and exible
GALS NoCs, Proc. Conference on Design, Automation and Test in
Europe (DATE), pp. 679684, 2010.
[11] X. Fan, M. Krstic, E. Grass: Performance analysis of GALS datalink
based on pausible clocking, Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC),
pp. 126133, 2012.
[12] A. Hemani, T. Meincke, S. Kumar, A. Postula, T. Olsson, P. Nilsson,
J. Oberg, P. Ellervee, D. Lundqvist: Lowering power consumption in
clock by using GALS design style, Proc. Design Automation Conference (DAC), pp. 873878, 1999.
[13] J. Muttersbach, T. Villiger, W. Fichtner: Practical design of globallyasynchronous locally-synchronous systems, Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC), pp. 5259, 2000.
[14] N. Jindapetch, H. Saito, K. Thongnoo, T. Nanya: A fair overhead
comparison between asynchronous four-phase protocol based controllers
and local clock controllers, Proc. International Conference on Electrical
Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 791794, 2005.
[15] E. Grass, F. Winkler, M. Kristc, A. Julius, C. Staht, M. Piz: Enhanced GALS techniques for datapath applications, Proc. International
Workshop Integrated Circuit and System Design, Power and Timing
Modeling, Optimization and Simulation (PATMOS), pp. 581590, 2005.
[16] X. Fan, M. Krstic, E. Grass: Analysis and optimisation of pausible
clocking based on GALS design, Proc. International Conference on
Computer Design (ICCD), 2009.
[17] C.Petri: Kommunikation mit automaten (Communicating with automata), University of Bonn, PhD Thesis, 1962.
[18] C.Brej: Wagging logic: Implicit parallelism extraction using asynchronous methodologies, Proc. International Conference on Application
of Concurrency to System Design (ACSD), pp. 3544, 2010.
[19] R. Rivest, A.Shamir, L.Adleman: A method for obtaining digital
signatures and public-key cryptosystems, Communications of the ACM,
21(2), pp. 120126, 1978.

26

You might also like