Professional Documents
Culture Documents
GALS Partitioning by Behavioural in Petri Nets
GALS Partitioning by Behavioural in Petri Nets
GALS Partitioning by Behavioural in Petri Nets
I. I NTRODUCTION
Current developments in multi-core system architectures
go into exploiting the potential of heterogeneous processing, where some components can deal with predominantly
sequential imperative operations, while other cores perform
functionally specic data or signal processing with large
amount of parallelism involved. In the latter, parallel execution on portions of data is quite natural for the type of
data and functions they have to perform, including graphics (GPUs) or communications (DSP plus mixed signal RF
circuitry). Ideally, the components of such a system operate
at their own function-determined pace (clock frequency), with
data-dependent timing (number of iterations) and synchronise
only occasionally to exchange the computed results (self-timed
communication mechanisms).
While the drive for functional heterogeneity is strong and already supported by leading IP providers (ARMs big-LITTLE
platform and vision for the Internet of Things), the way of
how the non-functional aspects, such as timing, should be
handled is clearly lagging behind. Most of the techniques
for timing and communication are still simply inherited from
the conventional homogeneous processing and remain in the
traditional forms of clocking, such as multi-clock domains,
clock-gating and frequency scaling. Partly the stumbling block
on the greater diversication of a timing discipline is due to the
delay in a take-up of innovations in introducing elasticity [1]
into these infrastructures from the side of the EDA tools. As a
result the potential of leveraging the functional diversication
1522-8681/14 $31.00 2014 IEEE
DOI 10.1109/ASYNC.2014.11
17
increases the risk of signicant under-performance of the overall system in terms of energy efciency (power management,
such as clock/power gating, is only possible at a very coarse
granularity level) and speed (both throughput and latency).
In this work we investigate the following issue: what is
the key motivating factor for using GALS and what is the
correct granularity level of system behaviour to look for GALS
partitioning? Traditionally a motivating factor for introducing
GALS was clock distributing problems (clock skew, clock tree
balancing, etc.) which resulted in building GALS purely on
the structural or physical criteria. As it appears, this approach
has not really succeeded because one can always nd a way
to handle it within a clocked design style, even by inserting
synchronisers, which often introduce less power and latency
overheads than GALS wrappers. Therefore we should look
for another, higher-level form of timing or synchronisation,
where a system may benet from being partitioned into largely
independent subsystems that can run at their own pace and
interact only occasionally, thus minimising the overheads of
GALS wrappers. With system timing being in the heart of
GALS synthesis, a successful GALS-based design ow needs
a common timing discipline, unifying both the computational
elements and their interfaces, supported by a good formalism,
analysis and optimisation algorithms. The main contribution
of our work on this pathway is as follows:
M (p)
otherwise
Graphically, places of a Petri net are represented as circles , transitions as boxes , arcs as arrows
, and tokens
are depicted by dots in the corresponding places .
18
out
in
(a) Schematic
(a) Tight
N-2 stages
?
(b) Model
A. Coupled behaviour
19
N
Na
rcs
Na
rcs
N
N
20
prime1
public
prime2
private
gen
key
public
private
n
tag
chk
A. Abstract model
In this work we want to identify the boundaries of synchronised behaviour and make an informed decision about
the GALS partitioning of the design. Usually the structural
hierarchy would determine the boundaries of GALS islands,
however, we feel this is a naive and sub-optimal approach.
Instead we start from an abstract Petri net model of the system
and go through the iterative renement of its components
until we accumulate sufcient knowledge about its behavioural
dynamics to justify a partitioning decision.
An initial Petri net representation for the key generator is
captured by a pipeline-like structure, as shown in Figure 6b.
As was described in Section II, each component is modelled by
a sequence of *_in, *_calc and *_out transitions to represent
the input of data, the algorithmic core and the output of the
results. A token in a *_free place denotes the availability of
the corresponding component and prevents a new round of
computation until the current results are consumed by the next
pipeline stage.
B. Interface renement
At the rst transformation step the component interfaces
are rened, as shown in Figure 7. The communication between the gen and key component is captured by explicitly
modelling the generation of two prime numbers via transitions
21
C. Computation renement
The shaded transitions gen_calc, key_calc and chk_calc
which represent complex actions are rened based on the
corresponding computation algorithms. Extraction of Petri nets
from behavioural specications [21] has been tailored and
partially automated for the purpose of this research. The
pseudo-code for each of the complex actions is captured in
Algorithm 1 and the derived Petri net models are shown in
Figure 8. Let us consider the extraction of each model in
more details. The names of Petri net transitions that correspond
to the discussed portions of the algorithm are referenced in
parentheses.
1) Generating a pair of prime numbers: A random number is generated (gen_random) and undergoes a primality test (gen_pt). The test result is subsequently analysed (gen_prime?) and if the number is prime (gen_p_true)
then it is passed to the output buffer (gen_out_prime1 for the
rst instance and gen_out_prime2 for the second instance).
Otherwise, if the number is not prime (gen_p_false), a new
random number is generated and checked for primality. This
procedure repeats until two primes are produced. A Petri net
model of this algorithm is shown in Figure 8a.
2) Computing RSA keys: Two large primes p and q generated by the gen module are used to calculate the prime
product n = p f the difculty of factoring out this product
is the core of the open-key cryptography. Also, the Eulers
totient function f = (n) is computed it shows the number
of positive integers in the range [1, n] which are co-prime to n.
As n is a product of presumably different prime numbers,
its Eulers totient function is f = (p 1) (q 1). Note
that n and f can be computed concurrently provided two
multipliers are available (modelled by transitions key_calc_n
and key_calc_f respectively). Secondly, an integer e is randomly chosen (key_random) which needs to meet two requirement: (i) it must be smaller than f ; and (ii) it must
be co-prime to f . The former requirement is easy to satisfy,
e.g. by limiting the e bit-width, so it is denitely smaller
than f it is known that a relatively short bit-width with
small Hamming weight (the number of non-zero bits) results
in more efcient encryption. The latter requirement is checked
by calculating a GCD of f and e (key_egcd) and comparing
func chk(n)
addr_init(a)
while addr_last(a)
m :=mem_read(a)
if gcd(n, m)= 1 then return false
addr_next(a)
mem_write(a, n)
return true
22
rounded box) maps into the bursty consumer interaction pattern (see Section II-B).
The decoupled behaviour is of particular interest for us as
it means a prolonged independent computation with an occasional synchronisation for data exchange. Due to the dynamic
nature of the synchronicity between the *_loop segments we
cannot just insert deep enough buffers to compensate for the
slack between them. Indeed, the number of iterations before
synchronisation is data-dependent and also unpredictable due
to the randomness of the initial data (gen_random and
key_random). Moreover, the system bottleneck is temporal
and shifts between the clusters. Intuitively the gen_loop lags
behind while there is only few stored prime products to
check co-primality with. However, when the critical amount
of prime products is accumulated, the chk_loop becomes the
bottleneck. This behavioural pattern is a good candidate either
for coarse-grain clock gating or for GALS partitioning. Let us
analyse which option is preferable in case of the key generator
example.
The dynamics of decoupled interaction can be efciently
predicted by analysing the complexity of underlying algorithms and deriving the expected number of iterations each
component performs before producing its output. Let us
demonstrate this technique by estimating the number of iterations when computing N -bit RSA keys. In cryptography the
bit-width of data operated is signicantly larger than normal
register size (e.g. 512-bit or 1024-bit numbers), therefore it
makes sense to estimate the iteration count of algorithms
relatively to the bit-width N of their operands (this is known
as bit complexity analysis).
Firstly, two N/2-bit primes need to be produced. The
23
24
prime
tog
gle
input
port
C
C
output
port
N iterations
key_calc_n
output
port
e,d
key_loop
gen_loop
KN iterations
e,d
C
C
public
pub
private
priv
key_public_private
key_calc_f
buf
tag
tag
chk
input
port
M iterations
output
port
25
ACKNOWLEDGEMENTS
The authors are grateful to the anonymous reviewers for all
the critics, inspiring comments and valuable suggestions on
how to further develop this work.
This research was supported by EPSRC grant EP/I038551/1
Globally Asynchronous Elastic Logic Synthesis (GAELS).
R EFERENCES
[1] M. Galceran-Oms, A. Gotmanov, J. Cortadella, M. Kishinevsky: Microarchitectural transformations using elasticity, ACM Journal on
Emerging Technologies in Computing Systems (JETC), 7(4), pp. 18:1
18:24, 2011.
[2] D. Chapiro: Globally asynchronous locally synchronous systems, PhD
thesis, 1984.
[3] D. Bormann, P. Cheung: Asynchronous wrapper for heterogeneous
systems, Proc. International Conference on Computer Design (ICCD),
pp. 307314, 1997.
[4] K. Yun, A. Dooply: Pausible clocking based heterogeneous systems,
IEEE Transactions on VLSI Systems, 7, pp. 482487, 1999.
[5] R. Mullins, S. Moore: Demystifying data-driven and pausible clocking
schemes, Proc. International Symposium on Advanced Research in
Asynchronous Circuits and Systems (ASYNC), pp. 175185, 2007.
[6] P. Teehan, M. Greenstreet, G. Lemieux: A survey and taxonomy of
GALS design styles, IEEE Design & Test of Computers, vol. 24(5),
pp. 418428, 2007.
[7] A. Iyer, D. Marculescu: Power and performance evaluation of globally asynchronous locally synchronous processors, Proc. International
Symposium on Computer Architecture (ISCA), pp. 158168, 2002.
[8] M. Krstic, E. Grass, F. Grkaynak, P. Vivet: Globally asynchronous,
locally synchronous circuits: overview and outlook, IEEE Design and
Test of Computers, 24(5), pp. 430441, 2007.
[9] M. Horak, S. Nowick, M. Carlberg, U. Vishkin: A low-overhead
asynchronous interconnections network for GALS chip multiprocessors,
IEEE Trans. on CAD, 30(4), pp. 494507, 2011.
[10] D. Ludovici, A. Strano, G. Gaydadjiev, L. Benini, D. Bertozzi: Design
space exploration of a mesochronous link for cost-effective and exible
GALS NoCs, Proc. Conference on Design, Automation and Test in
Europe (DATE), pp. 679684, 2010.
[11] X. Fan, M. Krstic, E. Grass: Performance analysis of GALS datalink
based on pausible clocking, Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC),
pp. 126133, 2012.
[12] A. Hemani, T. Meincke, S. Kumar, A. Postula, T. Olsson, P. Nilsson,
J. Oberg, P. Ellervee, D. Lundqvist: Lowering power consumption in
clock by using GALS design style, Proc. Design Automation Conference (DAC), pp. 873878, 1999.
[13] J. Muttersbach, T. Villiger, W. Fichtner: Practical design of globallyasynchronous locally-synchronous systems, Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC), pp. 5259, 2000.
[14] N. Jindapetch, H. Saito, K. Thongnoo, T. Nanya: A fair overhead
comparison between asynchronous four-phase protocol based controllers
and local clock controllers, Proc. International Conference on Electrical
Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 791794, 2005.
[15] E. Grass, F. Winkler, M. Kristc, A. Julius, C. Staht, M. Piz: Enhanced GALS techniques for datapath applications, Proc. International
Workshop Integrated Circuit and System Design, Power and Timing
Modeling, Optimization and Simulation (PATMOS), pp. 581590, 2005.
[16] X. Fan, M. Krstic, E. Grass: Analysis and optimisation of pausible
clocking based on GALS design, Proc. International Conference on
Computer Design (ICCD), 2009.
[17] C.Petri: Kommunikation mit automaten (Communicating with automata), University of Bonn, PhD Thesis, 1962.
[18] C.Brej: Wagging logic: Implicit parallelism extraction using asynchronous methodologies, Proc. International Conference on Application
of Concurrency to System Design (ACSD), pp. 3544, 2010.
[19] R. Rivest, A.Shamir, L.Adleman: A method for obtaining digital
signatures and public-key cryptosystems, Communications of the ACM,
21(2), pp. 120126, 1978.
26