Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

292 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 2, NO.

3, SEYEMBER 1994

Design of a Pipelined Datapath Synthesis


System for Digital Signal Processing
Hong-Shin Jun and Sun-Young Hwang, Member, IEEE

Abstract-In this paper, we describe the design of SODAS- process, operations and variables are assigned to functional
DSP (Sogang Design Automation System-DSP),a pipelined data- and memory units and interconnections among those modules
path synthesis system targeted for application-specificDSP chip are constructed by buses and multiplexors.
design. Through facilitated user interaction, the design space
of pipeliied datapaths for given design descriptions can be The higher is the abstraction level of design automation,
explored to produce an optimal design which meets design con- the wider is the design space to explore [12]. It is very diffi-
straints. Taking SFG (Signal Flow Graph) in schematic as in- cult to devise an efficient high-level synthesis algorithm that
puts, SODAS-DSP generates pipelined datapaths through sched- provides exhaustive search possibilities in the design space.
uling and module allocation processes. New scheduling and mod-
To overcome this problem, application area of a high-level
ule allocation algorithms are proposed for efficient synthesis of
pipelined hardwares. The proposed scheduling algorlthm is of synthesis system is specified to a certain target architecture so
iterativelconstructive nature, where the measure of equidistri- that the system can synthesize the datapath optimized in the
bution of operations among pipeline partitions is adopted as frame of the target architecture.
the objective function. Module allocation is performed in two DSP (Digital Signal Processing) has a very wide applica-
passes: the first pass for initial allocation and the second one for
reduction of interconnectioncost. In the experiments,we compare tion areas such as speech, audio, image processings, and its
the synthesis results for benchmark examples with those of recent importance is recognized. DSP algorithms involve linear and
pipelined datapath synthesis systems, Sehwa and PISYN, and nonlinear arithmetic operations for multi-dimensional signals.
show the effectiveness of SODAS-DSP. The most important aspect of DSP is the real time processing.
A signal processor must be specially designed to execute repet-
I. INTRODUCTION itive real-time algorithms with a constant data rate. So, mini-
mization of the chip area subject to a worst-case data rate must

R ECENTLY, the effort for design automation to produce


VLSI efficiently has been made and the abstraction level
of design automation tools used by designers becomes higher.
be considered when designing a synthesis system for DSP.
Synthesis of VLSI architectures from behavioral descriptions
for DSP algorithms has been addressed in a number of systems,
By designing VLSI at higher level, designers can complete Sehwa [ 131, PISYN [SI, HAL [ 151, SPAID [5], CATHEDRAL
a design faster, thus have a better chance of hitting the [l]. Sehwa and PISYN systems support pipelined datapaths.
market window for that design. A good synthesis tool can They employ a list scheduling algorithm using the priority
produce several designs from the same specification in a functions based on urgency and freedom, respectively. Because
reasonable amount of time. It allows designers to explore of the local nature of those functions, they produce inefficient
different tradeoffs between area and speed at higher level results in the usage of FU’s (Functional Units). HAL system
of abstraction, and to take an existing design and produce uses a global force directed scheduling algorithm, which
a functionally equivalent one that is faster and less expensive. constructs a schedule for one operation in each iteration
Logic synthesis tools are widely used for commercial chip to make equidistribution of operations among control steps
design and the high level synthesis becomes a hot issue with for maximal hardware utilization. However, the scheduling
its necessity [2], [4], [121. algorithm does not consider the effects introduced by the
High-level synthesis is a process that generates RTL (Reg-
scheduling of an operation on the scheduling of the other oper-
ister Transfer Level) hardwares from behavioral descriptions. ations. CATHEDRAL is a typical architecture-driven synthesis
The process consists of two major steps; scheduling and
system. CATHEDRAL contains four synthesizers for different
module allocation. In the scheduling process, a specific time
target architectures based on the algorithm characteristics and
step is determined for each of the operations which appear
on the throughput specifications; CATHEDRAL-I for bit serial
in input description such that design constraints (area, de-
architecture [6], CATHEDRAL-I1for u-coded multi-processor
lay, power consumption, etc.) are satisfied while the other
architectures, CATHEDRAL-I11 for cooperating datapath ar-
factors are optimized. In this process, possible sharing of
chitectures suitable for high speed recursive algorithms, and
hardware resources is determined. In the module allocation
finally CATHEDRAL-IV (currently under development) for
Manuscript received June 18, 1993; revised October 12, 1993 and December regular array architectures [ 11. CATHEDRAL-I1 supporting
28, 1993. This work was suppoted in part by Korea Science and Engineering pipelined datapath performs scheduling using the ILP (integer
Foundation under Grant 91014009 and Samsung Electronics Co. linear programming) method. ILP method generates optimal
The authors are with the Department of Electronic Engineering, Sogang
University, Seoul 100-611, Korea. results but requires very long computation time. Thus, it is
IEEE Log Number 9403166. inadequate to design a large system with CATHEDRAL-11.
1063-8210/94$04.00 0 1994 IEEE

~ ~~
~

JUN AND HWANG: PIPELINED DATAPATH SYNTHESIS SYSTEM 293

Tasks

I1

12 F1F2 F3F4 F5 +
I3

I4

+$"
1 2 3 4 5 6 7 8 9 If there is no resynchronization overhead caused by pipeline
Time
DII hazards, the average performance gain of a pipeline depends
on DII and clock cycle time [9]. Opportunities for hardware
Fig. 1. Space-time diagram for a five-stage pipeline with DII = 2.
sharing can be increased at the cost of degraded performance
of a pipeline by increasing DII. Area and performance trade-off
SPAID system produces the datapath which uses N buses, each in pipeline designs can be achieved by changing the synthesis
of which is connected to a dedicated register file. The target parameters, DII, clock cycle time, and number of pipeline
architecture of SPAID is very simple and regular. However, it stages. Through careful scheduling of operations to pipeline
is difficult to get high speed computations, because operands stages and allocation of hardware modules, high utilization of
must be fetched from register files and the results of operations hardware modules can be achieved.
are also transfered to register files through buses in each time In Fig. 1, we can find that functions F1, F3, F5 are ex-
step. ecuted at the same time, as are functions F2 and F4. We
In this paper, the design of SODAS-DSP, a high level can group the functions into two clusters; { F l ,F3, F 5 } and
synthesis system for application- specific DSP chip designs, is ( F 2 , F4}. Any functions belonging to the same cluster cannot
described. Targeting pipelined datapaths with fixed DII (Data be executed at the same FU, while functions belonging to
Initiation Interval [9], [ 141 ), efficient scheduling and module different clusters can be. Pipeline partitions are defined to
allocation algorithms are proposed. The proposed scheduling represent the sets of time-overlapping stages whose functions
algorithm is of iterativekonstructive nature. The measure of are executed concurrently on consecutive data. In Fig. 1, sets
equidistribution of operations to the pipeline partitions is ( S l , S 3 , S 5 } and { S 2 , S 4 } are two pipeline partitions. It is
modeled by the 'entropy' function, and the approximate values noticeable that the number of pipeline partitions is equal to DII.
of its derivatives are used as the priority function to distribute It is impossible for operations belonging to a pipeline partition
operations. Module allocation consists of two passes; initial to share hardware resources, while operations belonging to
allocation and allocation improvement. Allocation is iteratively different pipeline partitions are allowed to do so. Most of
performed to reduce the interconnection cost from the initial pipeline synthesis systems employ scheduling and module
allocation. Section I1 presents the target architecture and design allocation algorithms exploiting this fact.
methodology of SODAS-DSP. The proposed scheduling and The target architecture of SODAS-DSP is the pipelined
module allocation algorithms are described in Section I11 and datapath with fixed DII. For the sharing of a FU between
Section IV, respectively. Section V presents the experimental operations, proper interconnections must be provided. For the
results for benchmark examples and conclusions are drawn in interconnection only multiplexors are supported in SODAS-
Section VI. DSP with the belief that high speed execution cannot be
achieved with buses. The target architecture consists of FU's
in tandem with proper amount of storages, as is illustrated in
11. TARGETARCHITECTURE AND DESIGNMETHODOLOGY Fig. 2. It is obtained by assigning functions F1, F2 to FUl
and F3, F4 to FU2, and establishing interconnections with
A. Target Architecture multiplexors and latches. At time 2n + 1, FU1, FU2, and FU3
A DSP algorithm performs a sequence of operations which perform functions F1, F3, and F5, respectively, on the data
organizes similar computation blocks on consecutively initi- produced by their predecessor stages. The results generated by
ated data [3]. Pipelining technique is essential for high speed FUl and FU2 are stored into latches located at output ports. At
DSP applications. The target architecture of this work is the time 2n, functions F2 and F4 are executed in FUl and FU2,
pipelined datapaths with fixed DII like that of Sehwa and respectively. The result produced by FU3 is the final output.
PISYN.
Fig. 1 shows the space time-diagram for five-stage pipeline B. Design Methodology
with DII = 2, where I j represents the j-th task input and Fi The overall configuration and synthesis methodology of
represents the function executed in stage S i . At time 1, task SODAS-DSP is shown in Fig. 3. The user interface of the
I 1 is entered into the pipeline and functions F1 and F 2 are system consists of two major parts; SFG View and Datapath
performed at stages S 1 and S2 on I 1 in two clock cycles. At View. Taking design descriptions in SFGDL (Signal Flow
time 3, a new task I 2 is initiated and functions F1 and F2 are Graph Description Language) and in schematic SFG, the SFG
performed on it, while functions F3 and F4 are performed on View manages the entire design process through menu-driven
11. At time 5 , another task is entered into the pipeline, while user interface. Datapath View displays the datapaths generated
I 1 and I 2 are served at stage S5 and S 3 / S 4 , respectively. by the system. Communication between these two views
294 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSrEMS, VOL. 2, NO. 3, SEPTEMBER 1994

Simulation

(a)
, I
-
INPORT x(8) ;
t -l-

fl
OUTPORT y(8) ;
ModuleSpUwsizer
(NET a l , a2,bo, bl, b2, ml, m2, m3, m4, m5,a1, a2, a4, z l , 22;
Lcglc Synthesis
NODE A1 : CELL CONSTANT : OUTLIST a1 : END :
NODE A2 i CELL CONSTANT i OMLIST &!i END '
NODE BO ; CELL CONSTANT ; OUTLIST bo; END i
NODE E1 : CELL CONSTANT : OUTLIST b l : END :
NoEE 82 i &CL BNsTAfif 6"fLisT b2j END i
Fig. 3 . Overall configuration and synthesis methodology of SODAS-DSP. NODE multl ; CELL MULT ; INLIST 81, z l ; OUTLIST ml ; END ;
NODE mult2 ; CELL MULT ; INLIST a2,22; OUTLIST m2; END ;
is performed by interprocess communication mechanism. It NODE mum ; CELL MULT ; INLIST a l , bo; OUTLIST m3; END ;
NODE muk4 ; CELL MULT ; INLIST zl, bl ; OUTLIST m4; END ;
allows the designers to examine the correspondence between NODE tnUk5 ; CELL MULT ; INLIST 22,b2; OUTLIST m5; END ;
NODE add1 ; CELL ADD ; INLIST x, a2;OUTLIST a1 ; END ;
the operations in the SFG View and the functional modules NODE add2 : CELL ADD ; INLIST m2, ml : OUTLIST a2;END ;
in the Datapath View. NODE add3 ; CELL ADD ; INLIST m3. a4; OUTLIST y; END ;
NODE add4 ' CELL ADD ' INLIST m5 m4' OUTLIST a4; END ;
Design descriptions verified through simulation are handed NODE zi ; CELL z ; INLIST a1 ;OUT~IST'ZI; END .
over to the synthesizer together with design constraints in
NODE 22 ; CELL Z ; INLIST 21; OUTLIST z2; END I
1
area and/or time. Design constraints are refined into synthesis (b)
constraints (such as DII, number of stages, and module set)
Fig. 4. Representations of the second-order IIR filter (a) in SFG (b) in
more suitable for synthesis. Depending on the constraints, SFGDL.
appropriate pipeline scheduling, time constrained or area con-
strained, is performed to generate an optimal pipeline schedule. in SFGDL. In the figure, nodes represent operations and
The scheduling result displayed in the SFG View can be edges represent the signal flows between nodes. To enhance
modified by the designer. Datapaths are established through designer's productivity, design descriptions in higher level
module allocation and displayed in the Datapath View. Under HDL's are also to be supported, such as Silage [7] and VHDL
the fixed pipeline schedule, the module allocation process U81.
determines the sharing of FU's and builds interconnections
among FU's. Designers can also modify the pattern of FU 111. SCHEDULING ALGORITHMFOR PIPELINED DATAPATHS
sharing through the Datapath View. If design constraints are To achieve pipelining, the input task must be divided into
not satisfied in the synthesized datapath, the entire process is a sequence of subtasks, each of which can be executed by
repeated with new synthesis constraints and module sets. The a dedicated hardware stage operating concurrently with the
final design satisfying the given design constraints is simulated other stages in the pipeline. Pipeline scheduling is a process
for verification. Through the user interactive synthesis of that assigns operations to pipeline stages [13]. The goal of
SODAS-DSP, the design space of pipelined datapaths for this process can be either maximizing the speed while area
given design descriptions can be explored to produce an constraints are satisfied, or minimizing the total area cost while
optimal design which meets design constraints. the time constraints are satisfied. Two scheduling algorithms,
SFG View also generates the call pattern graph (CPG), under time constraints and under area constraints, are devised
which contains the information on the SFG hierarchy, to and described in this section.
manage interactive hierarchical design. Each RTL module is
constructed by the corresponding module synthesizer with the A. Objective Function
assumption that all submodules are already synthesized and Throughout the scheduling process, the time frame intervals
in library. Traversing the CPG from leaf modules, functional for all the operations in SFG are calculated and maintained as
modules are synthesized such that all the design constraints a scheduling state. At an intermediate state of the scheduling
are satisfied. process, each operation in SFG has its time frame interval,
[bopn, eopn].The objective function of the scheduling is defined
C. Design Descriptions as the measure of equidistribution of operations to pipeline
A design can be described in SFG using schematic editor, or partitions and can be calculated by the time frame intervals
in SFGDL, a textual representation for SFG. Fig. 4 shows an for operations in SFG. The probability for the operations of
SFG displayed in the SFG View and its textual representation type 'OP' of belonging to stage i , p o p ( i ) , is the normalized
JUN AND HWANG: PIPELINED DATAPATH SYNTHESIS SYSTEM 295

If the probability that the operations of a type belong to a


certain pipeline partition is 1, i.e., when all the operations
are concentrated on a certain pipeline partition, the H(0P)
becomes 0. The objective function at a scheduling state S is
defined by the weighted sum of the entropy functions for all
operation types and is given by (4).
l4A I4.Y

Fig. 5. A scheduling state where the time frame interval of each operation OF(S) = H(OP)w(OP) (4)
is indicated.
where the weight w(0P) for operation type ‘OP’ is defined
by the area and the number of appearances of the ‘OF” type
operations in SFG. The maximal sharing of a functional unit
can be achieved by maximizing the objective function.
3 2666 2 lea Fig. 5 shows a scheduling state, where the time frame
2 2 166 4 166
1
i p 2 5 , 4 166 ‘?$166: interval for each operation is indicated in the SFG. The values
0 2 4 0 G 0 2 4 f f i in square brackets [b,e] are obtained by ASAP and ALAP
schedulings, respectively. An operation with the time frame
interval of [b,e] can be assigned to any stage between b and
e. The DG value at each stage can be obtained by summing
the reciprocal of the length of the time frame interval for the
operations that can be scheduled at the stage. For example, the
DG value for add operations at stage 3, DG+(3), is obtained
as follows;
L I

(b)
I I
DG+(3) = 0.333(due to + 5) + 0.333(due to + 6)
Fig. 6. (a) DG’s for add and multiply operations. (b) Probabilities for add
+ 0.25(due to + 7) + 0.25(due to + 8)
and multiply operations in each pipeline partition. + l(due to + 9) + 0.5(due to + 10)
form of the distribution graph [15] and is given by (l), where = 2.666.
Nop is the number of operations of type ‘OP’ and Prob(opn,
Fig. 6(a) shows the DG for the SFG of Fig. 5. The probabilities
i) is the probability of an operation ‘opn’ scheduled at stage i.
that add and multiply operations belong to each pipeline
poP(i) = Rob(opn, i ) / ~ o P , partition are shown in Fig. 6(b). The probability of add
opnEOP operation in pipeline partition 1 is the sum of probabilities
i = 1,.. . ,max-stage (1) at stages 1 and 4, i.e., P+(l)= p+(l) +p+(4) = 4.166/15+
2.5/15 = 0.44. The measure of equidistribution is calculated
where using (3) as follows.
prob(opn, i ) = l/(eopn - bopn + 1) for eopn I i I bopn
[o otherwise H ( + ) = - ( P + ( O ) b P + ( O ) + P+(l)logP+(1)
The probability that operations of type ‘OP’ belong to + P+ (2) log P+(2))/ log DII
pipeline partition k is given in (2), where the sum is taken = -(0.28* log(0.28) + 0.44* log(0.44)
over all the stages in the pipeline partition. Pipeline partition + 0.28* log(0.28))/ log 3
k(0 5 k 5 DII- 1) is the set of stage Si(1 5 i 5 maxstage), = 0.977
where index i satisfies i modulo DII = k . A functional unit
can be shared among the operations belonging to different H ( * ) = -(P*(O)logP*(O) + P*(l)logP*(1)
pipeline partitions. Thus, the number of functional units in + P*(2) logP*(2))
the final implementation is equal to the maximum number of = -(0.15* log(0.15) + 0.44* log(0.58)
operations in a pipeline partition. + 0.27” log(0.27))
Pop(k) = pop(i), k = 0 , . . . ,DII-l (2) = 0.909
for all 1
s . t . t modulo-DIl=k +
OF() = H(+)w(+) H(*)w(*)
The measure of equidistribution for each type of operations is = 0.977*15 + 0.909*8 = 21.93
defined by an entropy function and given by (3).
In Fig. 6(b), the probability of add operations is more balanced
Dii- 1 than that of multiply operations. Hence, add operation has a
H(0P) = - Pop(k)logPop(k)/log DII (3) larger entropy value; 0.977 for add operations versus 0.909
k=O for multiply operations. The value of the objective function is
The value of H(0P) lies between 0 and 1. H(0P) becomes determined to be 21.93, where the weights for add and multiply
1, when all the pipeline partitions have the same probability. operations are given by 15 and 8, respectively.
:p13
296 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 2, NO. 3, SEPTEMBER 1994

B. Neighbor States and their Priority Function


As scheduling progresses, the time frame interval for each
operation is getting tighter. A pair of neighbor states, Sb and 12.41
Se, for a scheduling state S are defined as the scheduling states
obtained by reducing the time frame interval of an operation 15.61

from (bopn, eopn) to (bopn + 1,eopd and (bopn, eopn - 13.31 13.4 14.41 (4.51 15.51

respectively. There can be 2N neighbor states for an SFG


with N operations. The priority function is defined for each
neighbor state. The gain for a neighbor state is proportional to
the derivative of the objective function. The priority function
of the proposed iterativekonstructive algorithm is the linear
approximation of the derivative of the objective function in
time frame interval and is given by (5).
12.4
8
11.31

& 14.51

13.4 13.41 15.51 15.q


Priority Function(opn) = (Mx-Mn)/(eopn - bopn) (5) (b)
Fig. 7. A pair of neighbor states obtained by changing the time frame interval
of * 5 . (a) Se,, and (b) Sb,,.
[bopn, eopn]: time frame interval of an operation opn in a
scheduling state.
Sbopn,Seopn : neighbor states for operation opn whose
time frame interval is changed to [bopn +
1,eopn]and [bopn, eopn- 11, respectively.
Mx = MAX( OF( S b o p n ) , OF( Seopn)) ,
Mn = MIN(0F (Sbopn),OF(Seopn))
Fig. 7 shows the neighbor states which are induced by
I s 5 I I I 028 -in115 =os
044 +113/l5/2=045
0 2 8 +1/3/15L?=029 I on + i n m n = o =
015 - l B B = O l l
058 +113B/2=060 I
changing the time frame interval of *5 from the initial sched-
uling state shown in Fig. 5. When the time frame interval of *5
is changed to [2, 31, the time frame intervals of its predecessor
nodes are affected and the time frame interval of +5 becomes FU's while meeting given time constraints. The algorithm is
[l, 21 as shown in Fig. 7(a). Fig. 7(b) shows another neighbor of iterativekonstructive nature in that it constructs schedule
state Se,, obtained by changing the time frame interval of *5. incrementally. From the initial state in which the time frame
In this case, the time frame intervals of its successor nodes are interval is set by ASAP and ALAP schedules, a neighbor state
not modified, because they are already in the critical path. In with the highest priority function is selected in each iteration.
general, if the time frame interval of an operation, [bop,,, eopnl, The scheduling process under the time constraints in DII and
is changed to [bopn + l,eopn], the probability is decreased the number of stages is summarized as follows:
by l/(eopn - bopn + l)Nop for the operations at the stage Step 1: Set the initial scheduling state where the time
+
bopn and increased by l/(eopn - bopn l)(eopn- 6opn)Nop frame interval of each operation is determined
for the operations at the stages from bopn + 1 to eopn. by ASAP and ALAP schedules.
Similarly, if [bopn, eopn] is changed to [bopn, eopn - 11, the Step 2: Calculate the priority function for each neighbor state.
probability is reduced by l/(eopn - bopn + l)Nop for the For each unfixed operation opn do
operations at the stage eopnand increased by l/(eopn- bopn + Calculate objective function for two neighbor
l)(eopn - bopn)Nop for the operations at the stages from bopn states, OF(Sbopn) and OF(Seopn).
to eopn - 1. The probability that operations belong to each Calculate priority function PF(opn).
pipeline partition is determined by this property and is shown Step 3: Make transition to the neighbor state with the
in Table I. The objective function is calculated by (4) as highest priority function.
follows: Step 4: If there remain unfixed nodes, go to Step 2.
OF(Sb,,) = 0.973*15 + 0.819*8 = 21.14 The time complexity of the time constrained scheduling
algorithm is determined as follows. Let the number of nodes
+
OF(Se,,) = 0.977*15 0.904*8 = 21.89 in SFG be N , and the number of control steps be c. Step 1
The priority function(~5)= (21.89 - 21.14)/(4 - 2) takes O ( N ) time for finding ASAP and ALAP schedules. The
= 0.375. number of neighbor states whose priority function would be
calculated in O ( N ) time is 2 N . The time complexity for step
2 is O ( N 2 ) .The outermost loop (from Step 2 to Step 4) is
C. Pipeline Scheduling Algorithm Under Time Constraints repeated until all nodes are fixed, i.e., bopn is equal to eopnfor
Time constraints are specified or converted into the con- each operation opn. The loop count amounts to c N . Thus, the
straints in DII and/or the number of pipeline stages either total time complexity of the algorithm becomes O ( c N 3 ) .The
by the user or by the system. The proposed scheduling time complexity is the same as that of PISYN in the worst
algorithm generates a schedule that uses mininum number of case.
JUN AND HWANG: PIPELINED DATAPATH SYNTHESIS SYSTEM 291

L
Fig. 8 shows a schedule of the FIR filter with the FU
s1 +1+2+3+4+5
s2 4 +7 +e ‘1 ‘2 ‘3 constraints of 5 adders and 3 multipliers. DII is decided by
S3 ’4 ‘5 ‘6 +9 +10 the max( [N+/C+l, [N,/C,l) = max( [15/5], 18/31) = 3.
s4 ‘7*8+11+12
55 +13 +14 Fig. 8(a) is the initial scheduling state obtained by the list
sa schedule, where operations in six stages and in three pipeline
partitions are shown in separate boxes. The initial schedule
s1 +1+2+3
uses 7 adders and 3 multipliers when executed in pipelined
S2 +4+5+6+7+8’1 ‘2‘3 fashion. It exceeds the available adders by 2. At the pipeline
S3 ‘4’5’6+9+10
S4 ‘7’8+11+12 partition P1, two add operations +4 and +5 in the first stage
are selected according to the priority function and deferred
to the second stage. Then the assignment of each operation is
changed as shown in Fig. 8(b). Final results shown in Fig. 8(c)
are obtained by deferring operations +7 and +8 in pipeline
partition P2.

E, Mutual Exclusion Support


(C) Hardware resources can be shared among the operations in
Fig. 8. An example schedule under FU constraints. (a) Initial scheduling different conditional blocks by mutual exclusion property. To
state obtained by a nonpipelined list scheduling. (b) After first iteration. (c)
After second iteration (final result).
detect mutually exclusive block condition vectors ( c p S ) are
employed [17]. Each operation is tagged with a CV which
D. Scheduling Algorithm Under Area Constraints is defined to be a bitwise encoding of conditions when an
For the synthesis of non-pipelined datapaths satisfying area operation is performed. Let the CV’s of operations opl and
or FU constraints, list shedding is widely employed. In the list op2 are CV1 and CV2. When one bit of CV is ‘1’ (‘0’) and
scheduling, operations in a control step are deferred to next the corresponding bit of CV2 is ‘0’ (‘l’), operations opl and
control step when the number of operations in the control op2 are mutually exclusive.
step exceeds the number of available FU’s. A variation of The mutual exclusion can be supported by modifying the (1)
the list scheduling is proposed for pipelined datapaths. DII when calculating the probability for each type of operations
is selected as MAX([NOp/COp1) for all operation types to to belong to each stage. For the stages in which the time
maximize the performance of synthesized datapaths and to frame intervals of mutually exclusive operations intersect, only
simplify the complexity of the algorithm. Here, Cop is the one operation with the highest probability contributes to the
allowed number of FU’s for ‘OP’ type operations and Nop corresponding pop. The probability pop for a conditional
is the number of ‘OP’ type operations. The summary of the block with branches can be calculated recursively and given
scheduling algorithm under FU constraints is described below: by (6).

1) Set bopn to the result of the non-pipelined list where p o p , b ( i ) is the probability that operations of ‘OP’ type
scheduling under given constraints. in branch b belong to stage i.
2) Set eopn to the result of ALAP schedule. The objective function of (4) represents the measure of
Step 2: count = 1. equidistribution of operations in pipeline partition k , where
Step 3: If (FU constraints are met in all pipeline partitions) summation of pop(i) over all stages is unique and is equal
then stop. to 1. The reduction of hardware due to conditional sharing is
Step 4: k = count % DII. reflected in the reduction of pop by (6). The weight modified
Step 5: While (# FU’s in pipeline partition k exceeds to reflect this fact is given in (7).
constraints) do max -stage

Find all operations that satisfy bop,,% DII = k to


form a candidate list.
If (all operations in candidate list are in the critical where (1- Cpop ( 2 ) ) .Nap is the degree of conditional sharing.
path) then Fig. 9 shows an example of SFG containing FORWJOIN
nodes representing conditional branches. It is scheduled under
increment eopn for each operation by one. the constraints that the number of stages is 6 and DII is 3.
Calculate the priority function of Seopnfor all the The pipeline stage number to which an operation node is
operations in the candidate list. assigned is labeled to the left side of the node. Operations
add3, add5 and add6 whose CV’s are ‘lOXXX’, ‘OXOXX’
Take Se with highest priority as the next sched- and ‘llXXX’, respectively, are assigned to stage 3. Referring
uling state.
to the first bit of CV’s of add3 and add5. we can find that
Step 6: count = count +1 and go to step 3. they are mutually exclusive. In the same way, operations add3
298 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 2, NO. 3, S E ~ M B E R1994

depending on the execution of sub5, sub6, and add4 operations.


The control states for the pipeline partition P1 are made from
the combination of possible execution patterns for stages s l
and s4. By recognizing all the possible execution pattems,
control can be designed for the pipelines with conditional
branches without any re-synchronization overheads.

IV. MODULEALLOCATION
ALGORITHM

A. Determination of Cost Function


After assigning a FU for each operation and providing in-
terconnections among FU’s using multiplexors andor latches,
module allocation process generates datapaths. In this process
the number of FU’s is fixed to the maximum number of
operations in each pipeline partition, which is determined by
the scheduling result. The interconnection cost, modeled by
the numbers of multiplexor inputs and latches, is adopted as
the cost function as shown (8).

+ (0.# latches)

4
Cost Function = ( a . # mux inputs) (8)

The values of a and ,6 are the abstract areas for a multiplexor


input and a latch, respectively. The values are determined by
the library used for synthesis. The default values are 0.5 and
0.5.
A functional unit fetches its operands through multiplexor
to perform the operations assigned to the unit. If FU1 performs
Fig. 9. A pipeline schedule for the example with conditional branches. operation opl and its operands are produced by operation
TABLE I1 op2 in Fu2, an interconnection between FU1 and FU2 is
OF FU SHARING
PATTERN FOR THE SCHEDULE
IN FIG.9. required. For proper timing, latches must be inserted into the
communication path between those FU’s. It takes as many
I I
I latches as the difference of stage numbers, i.e., IstagenP2.-
stage,,,). It is possible for FU’s to share an interconnection
when the operand sources and the number of required latches
are not different. The fanin set of the p-th input port of
functional unit F , FS(F, p ) , to which operation op2 is assigned
is defined as follows:
and add6 are mutually exclusive, as are operations add5 and
FS(F,p) = {(S,c)}, where S is the functional unit assigned
add6. Thus, those operations are allowed to share an adder.
Operation add8 assigned to stage 6, which belongs to the to operation opl having data dependency relation with op2,
same pipeline partition as stage 3, requires another adder and c = I stage,,, - stage,,, I.
for pipelined execution. The overall resource usage for the
schedule is shown Table 11. This result is somewhat different The fanin set is determined by the information on the operand
from that of Sehwa, but the same amounts of FU’s, two adders sources and the number’of latches for every signal flow to
and two subtractors, are required. each FU port. The size of a fanin set is equal to the number
Time-stationary control scheme is popularly employed for of multiplexor inputs.
the proper control of the pipelines with or without condi- Fig. lO(a) shows a scheduled SFG with DII = 2, and
tional branches [ 111. Without conditional branches, pipeline Fig. 10(b) presents a possible assignment (not optimal) of
controller has as many states as the value of DII. However, operations to FU’s. Operations +1 and +4 are assigned to
control design for the pipelines with conditional branches is FUl and operations f 2 and f 3 are assigned to FU2, while
somewhat complicated. The controller must also remember the operations “1 and “2 are assigned to FU3. For the left port of
current pipeline state in order to provide control signals to each FU3, operations *1 and *2 need the results of operations +1
pipeline stage occupied by multiple overlapping tasks. Thus, (assigned to FUl) and +3 (assigned to FU2). The fanin set for
the number of states is increased to cover all possible execution the left port, FS(FU3, l), is {(FUl, l ) , (FU2, 1 ) ) . The values
pattems according to branch conditions for each task in the of ‘c’ field are obtained by - stage+l I = 12 - 11 = 1
pipeline. In Fig. 9, stage 1 always performs add1 operation. and Istage,, - stage+31= 13 - 21 = 1. From the calculation,
On the contrary, stage 4 has 8 different execution patterns it is determined that a 2-input multiplexor is required for the
JUN AND HWANG: PIPELINED DATAPATH SYNTHESIS SYSTEM 299

1 P1 P2

FU1 +2 +1
FU FU2 +3 +4

(a) (b)
Fig. IO. An example of calculation of the number of multiplexor inputs. (a)
- FU3 “2 “1

MUX
A scheduled SFG. (b) The fanin set and the number of required multiplexor
inputs.
-
Fuv FU -> D, : 3 latches
Latch

FU -> D, : 2 latches
FU -> D, : no latch

Fig. 11. A method of sharing latches.

left port. Investigating the fanin sets for all the FU ports, the
total number of multiplexor inputs is determined.
Fig. 11 shows a method of sharing latches. When inter-
connections from a FU to destination points Dj,for j = 1,
2, 3, need Nfzl.d, latches, the number of latches for the
complete interconnection is equal to the maximum of Nfu.d,
(b)
for j = 1, 2, 3. In the figure, for example, interconnections
from FU to destination points, D1,D2,0 3 need 3 , 2 , 0 latches, Fig. 12. An example state transition. (a) Initial location state. (b) After
exchanging the FU assignment of operations + 2 and +3.
respectively. Here, three latches are sufficient for the required
data transfer. assignment to FU’s. Assume that operations opl and op2 are
assigned to FU1 and FU2, respectively. Operations opl and
B. Module Allocation Algorithm op2 are candidates for pairwise exchange if both operations
The proposed algorithm performs allocation in two passes. can be performed in FU1 and FU2, and opl can be moved
Taking scheduling result, a possible assignment of operations to FU2 and op2 can be moved to FU1 simultaneously. An
to FU’s is performed at initial allocation. Each possible example state transition is shown in Fig. 12. The initial state
assignment is represented as an allocation state in which the of assignment shown in Fig. 12(a). Two 2-input multiplexors,
interconnection cost of multiplexors and latches is determined. MUXl and MUX2, are necessary for the ports of FU3, where
As module allocation progresses, assignment is refined to the size of fanin set for each port is IFS(FU3, 1)1 = 2 and
reduce the value of the cost function defined in (8) through IFS(FU3, 2)1 = 2. MUXl selects the results of operations +1
state transitions. Possible state transitions are generated by a and +3 for the left operand of FU3, and MUX2 selects the
pairwise exchange of operations and/or by operand swapping. results of +2 and +3 for the right port of FU3. Only one
Among those transitions most beneficial one is accepted as latch is sufficient for the results of FU1, because the results of
next state. The overall module allocation algorithm is summa- FU1 generated by operations +1 and +4 must be transferred
rized as follows: to MUXl and MUX2 through one latch. Similarly, one latch
Step 1: Perform initial allocation. is required for the results of FU2. There are two candidates
Step 2: Find all the possible state transitions. for exchange, (+2, +4) and (+l,+3), in the initial allocation
Step 3: Take the most beneficial one, i.e., a state state. A new state obtained by the exchange of the operation
transition with minimal cost, pair, (+a, +4), is shown in Fig. 12(b). The number of required
among the possible state transitions. latches is equal to that of the initial state, but the number of
Step 4: If the cost is reduced then make state multiplexor input is reduced by 4. This is due to the fact that no
transition and go to step 2. multiplexor is required for the input ports of FU3 as shown
An operation opn assigned to FU1 can be moved to FU2, in the fanin set of Fig. 12(b).
if the operation can be executed in FU2 and no operations In addition, operand swapping is also considered as a
are assigned to FU2 in the pipeline partition that opn belongs candidate of state transition. Every commutative operation can
to. However, possible moves of operations hardly exist when swap its operands to reduce the size of fanin set. Adopting the
operations are scheduled such that maximal sharing of FU’s operand swapping, the interconnection cost is reduced by 12%
is achieved. Two operations are selected to exchange their in the final datapath implementation.
300 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 2, NO. 3, SEPTEMBER 1994

TABLE IV
SYNTHESIS RESULTS
FOR THE F I ~ ELLIFTIC
H WAVE FILTER
(a) WITH# STAGES= 9. (b) WITH# STAGES= 10.

1 2 3 4 5 6 7 8 9

# mulbplias 818 414 4/4 313 414 2!2 2!2 212 2/2

t adders 13113 s19 7ll 8/8 56 6t6 616 414

t MUX inputs om w40 63/45 62/45 60142 57/46 6 1 m nn8 6342

t Reglstm 132U7 69/64 6OPS 53/48 54/32 49/48 48/46 47/46 46/45

CPU lime (Sec) 261 P81 2621 2651 B 7 / 2731 2711 2721 2771
181 343 2 9 0 300 123 144 167 121 106

3 4 4 5 5 6 6
2 3 4 5 6 7 8 9 10
(b)
Fig. 13. Scheduling of the 16-point FIR filter with #stages = 6 , DII = 3
t mumpliers 414 qa ziz 313 m 2iz 2~
(a) Result by Sehwa and PISYN. (b) Result by SODAS-DSP.
U adders 13/13 io10 70 m 56 515 615 414 4n
TABLE 111
(# STAGES = 6)
FOR THE 16-POINT FIR FILTER.
SYNTHESIS RESULTS U MUX inputs 66146 w4.s 66145 58/40 132144 55/41 57/42 wn9 52/41

t Reglsers 72lM 68/56 59159 60H 5562 54/47 52/48 51/46 51/46

I 2 3 4 5 6 CPU hme (Sec) 382/ 262/ 3771 3481 366l 3511 341 3391 3451
4W 334 400 643 392 527 479 469 549

t mumprim

t addars

t MUX inputs
U Reglam

CPU lime (Sec)

V. EXPERIMENTAL
RESULTS
SODAS-DSP has been implemented in C language on
SUN SPARC-I workstation running UNIX. Experiments are
performed on the benchmark examples, the 16-point FIR filter,
fifth-order elliptic wave filter, and FDCT (Fast Discrete Cosine
Transform) kernel. Those are taken from the systems reported
in the literature [2], [lo], [16] for the comparison purposes.

A. Synthesis for Benchmark Examples


Synthesis is performed on the 16-point FIR filter. Fig. 13(a)
shows the scheduling result generated by Sehwa and PISYN
under the constraints of # stages = 6 and DII = 3. Fig. 13(b)
is the result by SODAS-DSP. Those figures show that the same
amounts of functional units (5 adders and 3 multipliers) are
required. In Fig. 13(b), it is observed that the number of latches
between addl and multl is equal to that required between add3
and mult3, and between add5 and mult5. If operations addl
and add3 are assigned to the same adder and operations multl
and mult3 are assigned to the same multiplier, one multiplexor Fig. 14. Synthesis results for the mutual exclusion example shown in Fig. 9.
input and two latches can be saved by sharing the fanin set.
The synthesis results are summarized in Table 111, where multiplexor inputs and latches is reduced by 9% in SODAS-
comparisons with those of PISYN are presented. The number DSP. In this experiment, the values of Q and /3 are set to
of stages is fixed at 6 while varying DII from 1 to 6, and the 0.5.
numbers of hardware modules-multipliers, adders, registers, The fifth-order elliptic wave filter is a more substantial
and muliplexor inputs-are presented in the table. There is no benchmark program with 8 multiplications and 26 additions
difference in the usage of the functional units. However, the (161. The synthesis results with the 9- and 10-stages are
interconnection cost represented by the sum of the numbers of presented in Table IV. Usage of the FU's is same in most
JUN AND HWANG: PIPELINED DATAPATH SYNTHESIS SYSTEM 301

Fig. 15. SFG of the FDCT kernel in SFG view.

cases. When the number of stages is set to 10, one multiplier TABLE V
and one adder are saved with DII = 3, and one adder is PATTERN FU’S FOR THE DATAPATH
OF SHARING SHOWN
IN FIG. 16.
-
saved with DII = 8 and DII = 10. The interconnection cost is
also reduced by 20%. It validates the use of the cost function
-ALUl
sube
--
ALU4 ALU5 ALU6
7

reflecting the interconnection cost in the proposed two-pass sub10 sub9


module allocation process. The runtime difference is thought sub13 mufill
to be due to the time-consuming module allocation process mu1115 mu1113
in PISYN despite the same time complexity for scheduling in sub12 sub7 m u n i ~ mu1112
PISYN and SODAS-DSP. sum mult2 mung
sub6 subl1 mult3 mu1114
sub4 mune mu114
B. Experiment with Mutual Exclusion Example
sub5 mu116 munio
Synthesis is performed for the mutual exclusion example of mu117 mu111
Fig. 9 under the constraints of # stages = 6 and DII = 3. The subl mu115
scheduling result is shown in the figure. The result of module
allocation and the synthesized datapath are presented in Table
sub2
-
I1 and Fig. 14, respectively. The datapath consists of two
datapath consists of 2 adders and subtractors, 2 multipliers,
adders, two subtractors, and ROM’s containing coefficients.
2 coefficient ROM’s, 66 multiplexor inputs and 47 latches
For the interconnection between those Fu’s, 31 multiplexor
as shown in Fig. 16. The same amounts of FU’s are required
inputs and 17 latches are used. Its controller has 25 states and
when comparing with the results in [ 101. The detailed hardware
18 output signals.
usage is presented in Table V.

C. Synthesis of FDCT (Fast Discrete Cosine Transform)Kernel


The SFG of FDCT kernel [lo] in SFG View is shown in VI. CONCLUSION
Fig. 15. Library modules used in the synthesis process are 20 In this paper, we presented the design of a pipelined datapath
nS adder, 20 nS subtractor, and 80 nS multiplier. A pipelined synthesis system for DSP applications, SODAS-DSP. For the
datapath has been generated with the following constraints; synthesis of pipelined datapaths efficient for digital signal
clock cycle = 100 ns, # stages = 12, and DII = 8. The processing, new scheduling and module allocation algorithms
302 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 2, NO. 3, SEPTEMBER 1994

Fig. 16. Synthesized datapath for the FDCT kemel in Datapath view.

have been proposed. The proposed scheduling algorithms, the literature. For the dramatic enhancement of designer’s
under time constraints and/or area constraints, have the ob- productivity, efforts are being made to support high level
jective function adopting the measure of equidistribution of HDL’s. Silage is now supported, and researches are being
operations among pipeline partitions. The time constrained continued to support VHDL.
scheduling algorithm is of iterativekonstructive nature, where
the derivative of the objective function is used as the priority
function. A variation of the list scheduling is proposed for ACKNOWLEDGMENT
the synthesis of pipelined datapaths under area constraints. The authors would like to thank their colleagues Y. Lee who
The proposed module allocation algorithm iteratively improves developed a VHDL simulator, and M. Hyun who designed a
the interconnection cost (the numbers of multiplexor inputs powerful VHDL synthesis system to build up a fancy working
and latches) from an initial allocation by painvise exchange environment in the CAD & Computer Systems Laboratory of
of operations and operand swapping. In the experiments, Sogang University. The authors also wish to thank anonymous
we showed that SODAS-DSP generates efficient pipelined reviewers for their constructive comments to improve the
datapaths compared with the systems reported previously in quality of this paper.
IUN AND HWANG: PIPELINED DATAPATH SYNTHESIS SYSTEM 303

REFERENCES synthesis method for conditional branches,” in Proc. ZCCAD, pp. 62-65,
Nov. 1989.
H. De Man et al., “Architecture driven synthesis technique for VLSI [ 181 IEEE Standard VHDL Language Reference Manual, IEEE 1076-1987,
implementation of DSP algorithms,” IEEE Proc., vol. 78, no. 2, pp. April 1989.
319-335, Feb. 1990.
R. Camposano and W. Wolf, High-Level VLSI Synthesis. New York:
Kluwer Academic, 1991.
F. Catthoor and H. De Man, “Application specific architectural method-
ologies for high throughput digital signal and image processing,” IEEE Hong-Shin Jun received the B.S. and M.S.
Trans. ASSP., vol. 38, no. 2, pp. 339-349, Feb. 1990. degrees in electronic engineering from Sogang
D. D. Gajski, Silicon Compilation. Reading, MA: Addison Wesley, University, Seoul, Korea in 1989 and 1991,
1988. respectively.
B. S. Haroun and M. I. Flmasry, “Architectural synthesis for DSP He is currently working toward the Ph.D. degree
silicon compilers,” IEEE Trans. Compter-Aided Design, vol. 8, no. 4, with the Department of Electronic Engineering of
pp. 431447, April 1989. Sogang University. His research interests include
R. I. Hartley and J. R. Jasca, “Behavioral to structural translation in a silicon compilation and optimization in VLSI
bit-serial silicon compiler,” IEEE Trans. Compter-Aided Design, vol. 7, design.
no. 8, pp. 877-886, Aug. 1988.
P. Hilfinger, “A high level language and silicon compiler for digital
signal processing,” in Proc. Custom Integrated Circuits Conf., May
1985, pp. 213-216.
K. Hwang and A. E. Casavant, “Scheduling and hardware sharing in
pipelined data paths,” in Proc. ICCAD, Nov. 1989, pp. 2&27.
P. M. Kogge, The Architecture of Pipelined Computers. New York:
McGraw-Hill, 1982.
D. J. Mallon and P. B. Denyer, “A new approach to pipeline optimisa-
tion,” in Proc. EDAC, March 1990, pp. 83-88. Sun-Young Hwang (M’86) received the B.S. de-
J. Kim, F. Kurdahi and N. Park, “Automatic synthesis of time-stationary gree in electronic engineering from Seoul National
controllers for pipelined datapath,” in Proc. ICCAD-91, Nov. 1991, pp. University, Seoul, Korea in 1976, the M.S. degree
30-33. from Korea Advanced Institute of Science in 1978,
M. C. McFarland and A. C. Parker, “The high level synthesis of digital and the Ph.D. degree in electrical engineering from
systems,” IEEE Proc., vol. 78, no. 2, pp. 301-318, Feb. 1990. Stanford University, CA in 1986.
N. Park and A. C. Parker, “Sehwa: A software package for synthesis From 1976 to 1981, he v a s with Samsung Semi-
of pipelines from behavioral specification,” IEEE Trans. Compter-Aided conductor, Inc., Korea, where he designed several
Design, vol. 7, no. 3, pp. 356-370, Mar. 1988. CMOS VLSI chips and managed design section.
N. Park, “Synthesis of high-speed digital systems,” Ph.D. dissertation, Until 1988, he was with the Center for Integrated
Univ. Southem Calif., Oct. 1985. Systems at Stanford University, working on high-
P. Paulin, “Force directed scheduling for the behavioral synthesis of level synthesis and simulation system design. In 1986 and 1987, he held a
ASIC’s,” IEEE Trans. Compter-AidedDesign,vol. 8, no. 6, pp. 661-679, consulting position at Palo Alto Research Center of Fairchild Semiconductor
June 1989. Corporation. In 1989, he joined Sogang University, Seoul, Korea, where he
G. Saucier and P. M. McLellan, Logic and Architectural Synthesis for is currently an Associate Professor of electronic engineering. His research
Silicon Compilers. Amsterdam: North-Holland, Elsevier Science 1989. interests include silicon compilation, VLSI design, and computer systems
K. Wakabayashi and T. Yoshimura, “A resource sharing and control design.

You might also like