Professional Documents
Culture Documents
FPGA Design Automation
FPGA Design Automation
FPGA Design Automation
F
n
T
E
D
A
1
:
3
F
P
G
A
D
e
s
i
g
n
A
u
t
o
m
a
t
i
o
n
:
A
S
u
r
v
e
y
D
e
m
i
n
g
C
h
e
n
,
J
a
s
o
n
C
o
n
g
,
a
n
d
P
e
i
c
h
a
n
P
a
n
FPGA Design Automation: A Survey
Deming Chen, Jason Cong, and Peichan Pan
FPGA Design Automation: A Survey is an up-to-date comprehensive survey/tutorial of FPGA
design automation, with an emphasis on the recent developments within the past 5 to 10
years. The focus is on the theory and techniques that have been, or most likely will be,
reduced to practice. It covers all major steps in FPGA design flow: routing and placement,
circuit clustering, technology mapping and architecture-specific optimization, physical
synthesis, RT-level and behavior-level synthesis, and power optimization.
FPGA Design Automation: A Survey can be used as both a guide for beginners who are
embarking on research in this relatively young yet exciting area, and a useful reference for
established researchers in this field.
This book is originally published as
Foundations and Trends
in
Electronic Design Automation
1:3 (2006)
FPGA Design Automation:
A Survey
Deming Chen, Jason Cong, and Peichan Pan
now
n
o
w
EDA-Cover.qxd 10/11/2006 1:55 PM Page 1
EDA03-ebook 2006/10/26 10:16 page i #1
i
i
i
i
i
i
i
i
FPGA Design
Automation: A Survey
EDA03-ebook 2006/10/26 10:16 page ii #2
i
i
i
i
i
i
i
i
EDA03-ebook 2006/10/26 10:16 page iii #3
i
i
i
i
i
i
i
i
FPGA Design
Automation: A Survey
Deming Chen
Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
dchen@uiuc.edu
Jason Cong
Department of Computer Science
University of California at Los Angeles
cong@cs.ucla.edu
Peichen Pan
Magma Design Automation, Inc.
Los Angeles, CA
peichen@magma-da.com
Boston Delft
EDA03-ebook 2006/10/26 10:16 page iv #4
i
i
i
i
i
i
i
i
Foundations and Trends
R _
in
Electronic Design Automation
Published, sold and distributed by:
now Publishers Inc.
PO Box 1024
Hanover, MA 02339
USA
Tel. +1-781-985-4510
www.nowpublishers.com
sales@nowpublishers.com
Outside North America:
now Publishers Inc.
PO Box 179
2600 AD Delft
The Netherlands
Tel. +31-6-51115274
A Cataloging-in-Publication record is available from the Library of Congress
The preferred citation for this publication is Deming Chen, Jason Cong and Peichen
Pan, FPGA Design Automation: A Survey, Foundations and Trends
R
in Electronic
Design Automation, vol 1, no 3, pp 195330, November 2006
Printed on acid-free paper
ISBN: 1-933019-38-7
c November 2006 Deming Chen, Jason Cong and Peichen Pan
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of specic clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
services for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy-
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho-
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +1 781 871 0245; www.nowpublishers.com;
sales@nowpublishers.com
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to now
Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; e-mail:
sales@nowpublishers.com
EDA03-ebook 2006/10/26 10:16 page v #5
i
i
i
i
i
i
i
i
Foundations and Trends
R _
in
Electronic Design Automation
Volume 1 Issue 3, November 2006
Editorial Board
Editor-in-Chief:
Sharad Malik
Department of Electrical Engineering
Princeton University
Princeton, NJ 08544
USA
sharad@princeton.edu
Editors
Robert K. Brayton (UC Berkeley)
Raul Camposano (Synopsys)
K.T. Tim Cheng (UC Santa Barbara)
Jason Cong (UCLA)
Masahiro Fujita (University of Tokyo)
Georges Gielen (KU Leuven)
Tom Henzinger (EPFL)
Andrew Kahng (UC San Diego)
Andreas Kuehlmann (Cadence Berkeley Labs)
Ralph Otten (TU Eindhoven)
Joel Phillips (Cadence Berkeley Labs)
Jonathan Rose (University of Toronto)
Rob Rutenbar (CMU)
Alberto Sangiovanni-Vincentelli (UC Berkeley)
Leon Stok (IBM Research)
EDA03-ebook 2006/10/26 10:16 page vi #6
i
i
i
i
i
i
i
i
Editorial Scope
Foundations and Trends
R
in Electronic Design Automation
will publish survey and tutorial articles in the following topics:
System Level Design
Behavioral Synthesis
Logic Design
Verication
Test
Physical Design
Circuit Level Design
Recongurable Systems
Analog Design
Information for Librarians
Foundations and Trends
R
in Electronic Design Automation, November 2006,
Volume 1, 4 issues. ISSN paper version 1551-3939. ISSN online version 1551-
3947. Also available as a combined paper and online subscription.
EDA03-ebook 2006/10/26 10:16 page vii #7
i
i
i
i
i
i
i
i
Foundations and Trends
R
in
Electronic Design Automation
Vol. 1, No 3 (November 2006) 195330
c November 2006 Deming Chen, Jason Cong and
Peichen Pan
DOI: 10.1561/1000000003
FPGA Design Automation: A Survey
Deming Chen
1
, Jason Cong
2
and Peichen Pan
3
1
Department of Electrical and Computer Engineering,University of Illinois
at Urbana-Champaign, dchen@uiuc.edu
2
Department of Computer Science,University of California at Los Angeles,
cong@cs.ucla.edu
3
Magma Design Automation, Inc., Los Angeles, CA,
peichen@magma-da.com
Abstract
Design automation or computer-aided design (CAD) for eld pro-
grammable gate arrays (FPGAs) has played a critical role in the rapid
advancement and adoption of FPGA technology over the past two
decades. The purpose of this paper is to meet the demand for an up-to-
date comprehensive survey/tutorial for FPGA design automation, with
an emphasis on the recent developments within the past 510 years. The
paper focuses on the theory and techniques that have been, or most
likely will be, reduced to practice. It covers all major steps in FPGA
design ow which includes: routing and placement, circuit clustering,
technology mapping and architecture-specic optimization, physical
synthesis, RT-level and behavior-level synthesis, and power optimiza-
tion. We hope that this paper can be used both as a guide for beginners
who are embarking on research in this relatively young yet exciting area,
and a useful reference for established researchers in this eld.
Keywords: computer-aided design; FPGA design
EDA03-ebook 2006/10/26 10:16 page viii #8
i
i
i
i
i
i
i
i
EDA03-ebook 2006/10/26 10:16 page ix #9
i
i
i
i
i
i
i
i
Contents
1 Introduction 1
1.1 Introduction to FPGA Architectures 2
1.2 Overview of FPGA Design Flow 6
2 Routing and Placement for FPGAs 13
2.1 Routing 13
2.2 Placement and Floorplanning 26
2.3 Combined Placement and Routing 33
3 Technology Mapping 35
3.1 Preliminaries 36
3.2 Structural Mapping Framework 38
3.3 Structural Mapping Algorithms 42
3.4 Integrated Technology Mapping 48
3.5 Specialized Mapping Algorithms 56
4 Physical Synthesis 61
4.1 Logic Clustering 62
4.2 Placement-Driven Mapping 65
4.3 Placement-Driven Duplication 67
4.4 Other Techniques 70
ix
EDA03-ebook 2006/10/26 10:16 page x #10
i
i
i
i
i
i
i
i
5 Design and Synthesis with Higher Level
of Abstraction 77
5.1 RTL Synthesis 78
5.2 Behavior-Level Synthesis 85
6 Power Optimization 97
6.1 Sources of Power Consumption 98
6.2 Power Estimation 99
6.3 Power Breakdown 105
6.4 Synthesis for Power Optimization 107
6.5 Synthesis for Power-Ecient Programmable
Architectures 112
7 Conclusions and Future Trends 121
References 127
EDA03-ebook 2006/10/26 10:16 page 1 #11
i
i
i
i
i
i
i
i
1
Introduction
The semiconductor industry has showcased the spectacular exponen-
tial growth of device complexity and performance for four decades, pre-
dicted by Moores Law. Programmable logic devices (PLDs), especially
eld programmable gate arrays (FPGAs), have also experienced an
exponential growth in the past 20 years, in fact, at an even faster pace
compared to the rest of the semiconductor industry. For example, when
FPGAs were rst debuted in the mid- to late-80s, the Xilinx XC2064
FPGA had only 64 lookup tables (LUTs) and it was used as simple glue
logic. Now, both Alteras Stratix II [10] and Xilinxs Virtex-4 chips [205]
oer up to over 200,000 programmable logic cells (i.e., LUTs), plus a
large number of hard-wired macro blocks such as embedded memories,
DSP blocks, embedded processors, high-speed IOs, and clock synchro-
nization circuitry, representing an over 3,000 times increase in logic
capacity. These FPGA devices are being used to implement highly com-
plex system-on-a-chip (SoC) designs. To support the design of such
complex programmable devices, computer-aided design (CAD) plays
a critical role in delivering high-performance, high-density, and low-
power design solutions using these high-end FPGAs. We witnessed the
1
EDA03-ebook 2006/10/26 10:16 page 2 #12
i
i
i
i
i
i
i
i
2 Introduction
establishment of FPGA design automation as a research area and a dra-
matic increase in research activities in this area in the past 1718 years.
However, there is lack of comprehensive references for the latest FPGA
CAD research. Most existing books (e.g., [23, 27, 92, 150, 186]) and
survey/tutorial papers (e.g., [28, 52]) in this area are 1015 years old,
and do not reect vast amount of recent research on FPGA CAD. The
purpose of this paper is to meet the demand for a comprehensive sur-
vey/tutorial on the state of FPGA CADwith an emphasis on the
recent developments that have taken place within the past 510 years
and a focus on the theory and techniques that have been, or most likely
will be, reduced to practice. We hope that this paper can be useful for
both beginners and established researchers in this exciting and dynamic
eld.
In the remainder of this section we shall rst briey introduce some
typical FPGA architectures and dene the basic terminologies that will
be used in the rest of this paper. Then, we shall provide an overview
of the FPGA design ow.
1.1 Introduction to FPGA Architectures
An FPGA chip includes input/output (I/O) blocks and the core pro-
grammable fabric. The I/O blocks are located around the periphery
of the chip, providing programmable I/O connections and support for
various I/O standards. The core programmable fabric consists of pro-
grammable logic blocks and programmable routing architectures. Fig-
ure 1.1 shows a high-level view of an island-style FPGA [23], which
represents a popular architecture framework that many commercial
FPGAs are based on, and is also a widely accepted architecture model
used in the FPGA research community. Logic blocks represented by
gray squares consist of circuitry for implementing logic. Logic blocks
are also called congurable logic blocks (CLBs). Each logic block is
surrounded by routing channels connected through switch blocks and
connection blocks. The wires in the channels are typically segmented
and the length of each wire segment can vary. A switch block con-
nects wires in adjacent channels through programmable switches such
as pass-transistors or bi-directional buers. A connection block connects
EDA03-ebook 2006/10/26 10:16 page 3 #13
i
i
i
i
i
i
i
i
1.1. Introduction to FPGA Architectures 3
Fig. 1.1 An island-style FPGA [23].
the wire segments around a logic block to its inputs and outputs,
also through programmable switches. Notice that the structures of the
switch blocks are all identical. The gure illustrates the dierent switch-
ing and connecting situations in the switch blocks (the structures of
all the connection blocks are identical as well). In [23] routing archi-
tectures are dened by the parameters of channel width (W), switch
block exibility (F
s
the number of wires to which each incoming wire
can connect in a switch block), connection block exibility (F
c
the
number of wires in each channel to which a logic block input or out-
put pin can connect), and segmented wire lengths (the number of logic
blocks a wire segment spans). Modern FPGAs also provide embedded
IP cores, such as memories, DSP blocks, and processors, to facilitate
the implementation of SoC designs.
Commercial FPGA chips contain a large amount of dedicated inter-
connects with dierent xed lengths. These interconnects are usu-
ally point-to-point and uni-directional connections for performance
improvement. For example, Alteras Stratix II chip [10] has vertical or
horizontal interconnects across 4, 16 or 24 logic blocks. There are ded-
icated carry chain and register chain interconnects within and between
EDA03-ebook 2006/10/26 10:16 page 4 #14
i
i
i
i
i
i
i
i
4 Introduction
logic blocks as well. Xilinxs Spartan-3E chip [204] has long lines, hex
lines, double lines, and direct connections between the logic blocks.
These lines cross dierent numbers of logic blocks. Specically, the
direct connect lines can route signals to neighboring tiles vertically,
horizontally, and diagonally. For example, Figure 1.2 shows the direct
connect lines (a) and hex lines (b) between a CLB and its neighbors in
the Spartan-3E chip. The use of segmented routing makes the FPGA
interconnect delays highly non-linear, discrete, and in some cases, even
non-monotone (with respect to the distance). This presents unique chal-
lenges for FPGA placement and routing tools because a simple inter-
connect delay model using Manhattan distance between the source and
the sink may not work well any more. Accurate interconnect delay mod-
eling is a mandate for meaningful performance-driven physical design
tools for FPGAs.
Further down the logic hierarchy, each logic block contains a
group of basic logic elements (BLEs), where each BLE contains a
Fig. 1.2 Direct connect lines (a) and hex lines (b) in Xilinx Spartan-3E architecture [204].
EDA03-ebook 2006/10/26 10:16 page 5 #15
i
i
i
i
i
i
i
i
1.1. Introduction to FPGA Architectures 5
Fig. 1.3 A logic block and its peripheries.
LUT
1
and a register. Figure 1.3 shows part of a logic block with a
block size N (the logic block contains N BLEs). The logic block has I
inputs and N outputs. These inputs and outputs are fully connected
to the inputs of each LUT through multiplexers. The gure also shows
some details of the peripheral circuitry in the routing channels.
In addition to logic and routing architectures, clock distribution
networks is another important aspect of FPGA chips. An H-tree based
FPGA clock network is shown in Fig. 1.4 [130]. A tile is a logic block.
Each clock tree buer in the H-tree has two branches. There is a
local clock buer for each ip-op in a tile. Both clock tree buers
in the H-tree and local clock buers in the tiles are considered to
be clock network resources. Chip area, tile size, and channel width
determine the depth of the clock tree and the lengths of the tree
branches.
1
We focus on the LUT-based FPGA architecture in which the BLE consists of one k-input
lookup table (k-LUT) and one ip-op. The output of the k-LUT can be either registered
or un-registered. We want to point out that commercial FPGAs may use slightly dierent
logic architectures. For example, Alteras Stratix II FPGA [10] uses an adaptive logic
module which contains a group of LUTs and a pair of ip-ops.
EDA03-ebook 2006/10/26 10:16 page 6 #16
i
i
i
i
i
i
i
i
6 Introduction
Fig. 1.4 A clock tree [130].
There is another type of programmable logic device called complex
programmable logic device (CPLD). The general architecture topology
of a CPLD chip is similar to that of island-based FPGAs, where rout-
ing resources surround logic blocks. One attribute of CPLD is that its
interconnected structures are simpler than those of FPGAs. Therefore,
the interconnect delay of CPLD is more predictable compared to that
of FPGAs. The basic logic elements in the CPLD logic blocks are not
LUTs. Instead, they are logic cells based on two-level AND-OR struc-
tures, where a xed number of AND gates (also called p-terms) drive
an OR gate. The output from the OR gate can be registered as well.
For example, Fig. 1.5 shows such a structure (called macrocell) used
in Alteras MAX7000B CPLD [6]. Each macrocell has ve p-terms by
default. It can borrow some p-terms from its neighbors. The intercon-
nect structure PIA (programmable interconnect array) connects dier-
ent logic blocks together.
1.2 Overview of FPGA Design Flow
As the FPGA architecture evolves and its complexity increases, CAD
software has become more mature as well. Today, most FPGA ven-
dors provide a fairly complete set of design tools that allows auto-
matic synthesis and compilation from design specications in hardware
specication languages, such as Verilog or VHDL, all the way down
EDA03-ebook 2006/10/26 10:16 page 7 #17
i
i
i
i
i
i
i
i
1.2. Overview of FPGA Design Flow 7
Fig. 1.5 An example of CPLD logic element, MAX 7000B macrocell [6].
to a bit-stream to program FPGA chips. A typical FPGA design ow
includes the steps and components shown in Fig. 1.6.
Inputs to the design ow typically include the HDL specication
of the design, design constraints, and specication of target FPGA
devices. We further elaborate on these components of the design input
in the following:
is a sub-network
consisting of and some of its predecessors, such that for any node
u O
. The maxi-
mum cone of , consisting of all the predecessors of , is called a fanin
cone of , denoted as F
. O
is K-feasible if [input(O
)
of the fanin cone F
of such that X
)
the cutset of the cut. A cut is K-feasible (or a K-cut) if X
is a K-
feasible cone, or equivalently, X
) if u is
in input(O
Cut generation/enumeration
Cut ranking
Cut selection
uc
[a(u)/[output(u)[] + A
c
,
where A
c
is the area of the LUT corresponding to the cut c. The area
cost of a non-PI/PO node is then the minimum area of its cuts:
a() = mina(c)[ u ().
It can be shown [72] that for duplication-free mapping based on
MFFCs, eective area is accurate in that there is a mapping solution
whose area is equal to the sum of the eective areas of the POs. Since
the eective area is computed by distributing the area of a node evenly
among its fanouts, it does not account for the situation where the node
may be duplicated in a mapping solution. When there is duplication,
eective area may be inaccurate. In the example shown in Fig. 3.2,
with K = 3, the LUT for u covers w. In this case, the LUT for w is
introduced solely for the LUT for . However, in eective area compu-
tation, only one half is counted for . As a result, the LUT for w is
under-counted. In this particular case, the eective area of the overall
mapping solution (sum of the eective areas of the POs) is 2.5 while
the mapping solution has three LUTs. In general, eective area is a
lower bound of the actual area [72].
Fig. 3.2 Inaccuracy in eective area when duplication is allowed [72].
EDA03-ebook 2006/10/26 10:16 page 42 #52
i
i
i
i
i
i
i
i
42 Technology Mapping
3.2.3 Cut ranking timing
Because the exact layout information is not available during the tech-
nology mapping stage, most FPGA technology mapping algorithms
only consider the cell delays. The delay of a mapping solution is dened
as the total cell delays on a longest combinational path from PIs
to POs.
Most delay optimal combinational mapping algorithms use the
dynamic programming-based labeling process introduced in the
FlowMap algorithm [50]. The label at each node is the minimum arrival
time that can be achieved for the node in any mapping solution. The
label of a PI is set to zero assuming that the signal arrives at the PI
at the beginning of the clock edge. After the labels for all the nodes
in F
)[)/(1 + ),
where estimated fc() is the estimated fanout count for current itera-
tion, estimated fc
prev
(), the estimated fanout count for the previous
iteration; output(LUT
uc
[a(u)/eatimated fc(u)] + A
c
.
The authors suggest setting to be between 1.5 and 2.5 when the
number of iterations is limited to 8. When applying the enhanced area
cost on delay-oriented mapping, the authors show good improvement
in area compared to previous delay-oriented mapping algorithms.
In [140], the authors present a mixed structural and functional area
mapping algorithm based on solving a sequence of SAT problems. The
algorithm starts with an existing mapping solution (e.g., obtained from
a structural mapper described earlier). The key idea is a SAT formula-
tion for the problem of mapping a small circuit into the smallest pos-
sible number of LUTs. The algorithm iteratively selects a small logic
cone to remap to fewer LUTs using a SAT-solver. It is shown that for
some highly structured (albeit small) designs, area can be improved
signicantly.
Most area optimization techniques are heuristics. A natural question
is how close or far away existing mapping algorithms are from optimal.
In [66], the authors construct a set of designs with known optimal area
mapping solutions (called LEKO examples). They tested existing aca-
demic algorithms and commercial tools on the LEKO examples. The
average gap from optimal varies from 5 to 23%, with an average of
15%. From the LEKO examples, they further derived the logic syn-
thesis examples with known upper bounds (LEKU). These examples
require both logic optimization and mapping. Existing FPGA synthesis
algorithms and tools perform very poorly on LEKU examples, with an
average optimality gap of over 70X. This indicates that further studies
are needed for area-oriented mapping and optimization.
3.3.2 Delay minimization
Timing optimization is important for FPGAs due to the performance
disadvantage introduced by programmability. FlowMap and its deriva-
EDA03-ebook 2006/10/26 10:16 page 46 #56
i
i
i
i
i
i
i
i
46 Technology Mapping
tives can nd a mapping solution with optimal delay. Recent advances
in delay minimization focus on optimizing area while maintaining
performance.
DAOmap [38] is a mapping algorithm that guarantees optimal delay,
while at the same time reducing the area signicantly compared to
previous delay-oriented mapping algorithms. The DAOmap algorithm
introduces three techniques to optimize area without degrading timing.
First, it enhances eective area computation to control potential node
duplications more eectively. Second, it exploits the extra timing slacks
on non-critical paths for area reduction. It uses an iterative cut selection
procedure to further explore and perturbs the solution space to improve
solution quality.
In DAOmap, the eective area for a cut c of node is enhanced
using the following formula:
a(c) =
uc
[a(u)/[output(u)[] + U
c
+ P
u
1
+ P
u
2
where U
c
is the area contributed by the cut itself and P
u
1
and P
u
2
are
correction terms to account for potential duplication from the fanins
u
1
and u
2
of . Specically, the following formula is used,
U
c
= [c[/(N
c
+ ([output()[ + R
c
)
where N
c
is the number of nodes in the cone of the cut and R
c
is
the number of reconvergent paths completely covered by c. Parameters
and are two weighting factors determined empirically. The intu-
ition of the formula is to encourage the use of a small cut (in terms
of the number of cut nodes) that covers a large cone. Obviously, such
cuts have fewer chances to introduce unnecessary logic duplication. The
correction terms P
u
i
(i = 1 or 2) are introduced to gauge the potential
of node duplication at the fanin u
i
of . If has only one fanout, they
are set to zero; otherwise, P
u
i
= N
c
i
/[c[, where c
i
is the cut induced on
u
i
by c. The intuition is that the larger N
c
i
is, the more likely the nodes
in c
i
will be duplicated. The size of the cut [c[ is added as a normalizing
factor.
DAOmap rst picks cuts with minimum timing cost for each node.
Among all cuts with minimum timing cost, it then picks a cut with
EDA03-ebook 2006/10/26 10:16 page 47 #57
i
i
i
i
i
i
i
i
3.3. Structural Mapping Algorithms 47
minimum area cost. However, when there is extra slack (the dierence
between required time and arrival time from the timing labels), it will
pick a cut with minimum area cost as long as the timing increase doesnt
exceed the slack. Cut selection starts from POs and works backward
toward PIs.
Recognizing the heuristic nature of the area cost, DAOmap also
employs the technique of multiple passes for cut selection (i.e., map-
ping generation). Moreover, DAOmap adjusts area costs based on input
sharing to encourage using nodes that have already been contained in
other cuts. This reduces the chance that a newly picked cut cuts into
the interior of existing LUTs. Between successive iterations of cut selec-
tion, it also adjusts area cost to encourage using LUT roots with a large
number of fanouts in previous iterations. There are also a few other sec-
ondary techniques used in DAOmap. The interested reader is referred
to [38] for details.
Based on the results reported, DAOmap can improve the area
by about 13% compared to previous mapping algorithms for optimal
depth. It is also many times faster than previous ow computation
based mapping algorithms, mainly due to ecient implementation of
cut enumeration.
A more recent work [147] introduces several techniques that fur-
ther improve the area while preserving delay optimality. As DAOmap,
this algorithm also goes through several passes of cut selection. Each
pass selects cuts with better areas among the ones that do not violate
optimal timing. The basic framework is also based on the concept of
eective area (or area ow). However, it processes nodes from PIs to
POs, instead of from POs to PIs during cut selection. With this pro-
cessing order, the algorithm tries to use extra slack on nodes close to
PIs to reduce area cost. This is based on the observation that logic is
typically denser close to PIs. Delay relaxation is more eective following
the topological order from PIs to POs.
The algorithm [147] also uses a local heuristic for area recovery dur-
ing cut selection. It introduces the concept of exact area of a cut which
is dened as the number of LUTs to be added to the mapping solu-
tion if the cut is selected for the node. The algorithm tries to improve
the current mapping solution by selecting a cut with minimum exact
EDA03-ebook 2006/10/26 10:16 page 48 #58
i
i
i
i
i
i
i
i
48 Technology Mapping
area at each node without violating timing. An iterative procedure is
proposed to calculate the exact area for each cut. Experimental data
show 7% better area over DAOmap with the same timing.
The cut enumeration framework can also be used for technology
mapping for power minimization, which will be discussed in Sections 6.4
and 6.5.
3.4 Integrated Technology Mapping
Technology mapping is a step in the middle of a typical FPGA
CAD ow with optimization steps before and after it. We carry out
technology-independent optimization and logic decomposition before
technology mapping. We may do sequential optimization such as retim-
ing after (or before) mapping. A separate approach can miss the best
overall solutions even if we can solve each individual step optimally.
To obtain better overall solutions, it is desirable to combine some of
the optimization steps with mapping. In this section we discuss map-
ping algorithms that integrate with decomposition, logic synthesis, and
retiming. Mapping has also been integrated with clustering and place-
ment; these algorithms will be presented when we discuss layout-driven
synthesis.
3.4.1 Simultaneous logic decomposition and mapping
It is advantageous to decompose an input netlist into a two-bounded
one before technology mapping. This is because a mapping solution for
the original network can always be found in the decomposed one.
A network of complex gates (in SOP form) can be turned into a net-
work of simple gates (e.g., AND, OR, XOR, and INV gates) by express-
ing each SOP as a set of AND-OR gates (e.g., using the tech decomp
algorithm in SIS [168]). Structural decomposition then decomposes the
network of simple gates into two-bounded simple gates using associa-
tivity. The dmig algorithm is one such method; it tries to minimize the
depth of the decomposed two-input network using Human-tree con-
struction [52]. However, the resulting decomposition may not be the
best as far as the nal mapping solution is concerned. This is because
EDA03-ebook 2006/10/26 10:16 page 49 #59
i
i
i
i
i
i
i
i
3.4. Integrated Technology Mapping 49
the depth of the two-bounded network is not an exact predictor for the
depth of the nal mapping solution.
In general, dierent decompositions can signicantly impact the
nal mapping solution. To nd the best mapping solution for the ini-
tial network, it is ideal that we combine decomposition with mapping
by nding a structural decomposition such that the nal mapping solu-
tion of the decomposed network has minimum depth among all possible
decompositions of the input network. This problem was shown to be
NP-hard [59].
An algorithm for integrated structural decomposition and mapping
is proposed in [43]. The algorithm is based on the concept of choice
nodes that were originally introduced for combining decomposition and
technology mapping for standard cells [126, 127]. A choice node is not
a physical gate. It is introduced to group all functionally equivalent,
but structurally dierent decompositions of a node [126]. Each choice
node has a complement copy, both of which form a so-called ugate.
Fig. 3.4 shows an ugate formed by two choice nodes c
1
and c
2
that are
the complement of each other.
The algorithm essentially encodes all possible structural decompo-
sitions of the input network in a concise mapping graph. The map-
ping graph is formed by adding the choice nodes to represent the
Fig. 3.4 An ugate example [43].
EDA03-ebook 2006/10/26 10:16 page 50 #60
i
i
i
i
i
i
i
i
50 Technology Mapping
Fig. 3.5 A mapping graph for all decompositions of function xyz [43].
decomposition choices. Figure 3.5 is a mapping graph that encodes
all possible decompositions of Boolean function f = xyz. By picking
one branch at each choice node in the mapping graph, we can retrieve
one decomposition of the function.
Cut generation and cut ranking can be extended to choice nodes.
For example, the set of cuts of a choice node is simply the union of the
sets of cuts of all its fanins. Similarly, the label of a choice node is the
smallest one among the labels of its fanins. The rest of the approach is
similar to a conventional mapping algorithm.
The algorithm does not actually generate a mapping solution
directly from the mapping graph. Instead, it determines a decompo-
sition from the mapping graph by selecting a decomposition with best
timing label at each node. After that, it applies a state-of-the-art map-
ping algorithm for a xed decomposition to obtain the nal mapping
solution. Signicant timing and area improvements were observed com-
pared to separate decomposition and mapping [43].
EDA03-ebook 2006/10/26 10:16 page 51 #61
i
i
i
i
i
i
i
i
3.4. Integrated Technology Mapping 51
3.4.2 Simultaneous logic synthesis and mapping
Technology-independent Boolean optimizations carried out before tech-
nology mapping in a conventional ow can signicantly change the
network passed to mapping; so does the nal mapping solution. During
technology-independent optimization, we have freedom to change the
network structures, but accurate estimation of the impact to down-
stream mapping is not available. During technology mapping, we can
achieve optimal or close to optimal solutions using one of the algo-
rithms discussed earlier. However, we are stuck with a xed network. It
is desirable to capture the interactions between logic optimization and
mapping to arrive at a solution with better quality.
Lossless synthesis has been proposed as way to consider technology-
independent optimization during mapping [147]. Lossless synthesis is
based on the concept of choice networks; this is similar to the mapping
graphs in [126, 127]. As mapping graphs, a choice network contains
choice nodes which encode functionally equivalent, but structurally dif-
ferent alternatives. The algorithm operates on a simple yet powerful
data structure called AIG which is a network of AND2 and INV gates.
A combination of SAT and simulation techniques are used to detect
functionally equivalent points in dierent networks and compress them
Fig. 3.6 Combining networks to create a choice network [147].
EDA03-ebook 2006/10/26 10:16 page 52 #62
i
i
i
i
i
i
i
i
52 Technology Mapping
to form one choice network. Figure 3.6 illustrates the construction of
a network with choices from two equivalent networks with dierent
structures. The nodes x
1
and x
2
in the two networks are functionally
equivalent (up to complementation). They are combined in an equiva-
lence class in the choice network, and an arbitrary member (x
1
in this
case) is set as the class representative. Node p does not lead to a choice
because its implementation is structurally the same in both networks.
Note also that there is no choice corresponding to the output node o
since the procedure detects the maximal commonality between the two
networks.
Using structural choices leads to a new way of thinking about logic
synthesis: rather than trying to come up with a good nal netlist
to use as an input to mapping, the algorithm in [147] accumulates
choices by combining intermediate networks seen during logic synthe-
sis to generate a network with many choices. In a sense, it does not
make judgments on the goodness of the intermediate networks or por-
tions of the networks and defers the decision to the mapping phase.
The best combination among these choices is selected during map-
ping. In the nal mapping solution, dierent sections may come from
dierent intermediate networks. For example, the timing-critical sec-
tions of the nal mapping solution may come from networks which
are optimized for timing, while the timing non-critical sections of the
nal mapping solution may come from networks which are optimized
for area.
Cut generation and ranking techniques are extended to network
with choices as in the case of integrated decomposition and mapping.
Results reported in [147] show that the proposed algorithm can improve
both area and timing by 7% on a large set of benchmark designs over
mapping solutions produced using just one optimized network as the
input.
3.4.3 Simultaneous retiming and mapping
Retiming is an optimization technique that relocates ip-ops (FF)
to improve the performance or area of a design while preserving its
functionality [128]. Retiming can shift FF boundaries and change the
EDA03-ebook 2006/10/26 10:16 page 53 #63
i
i
i
i
i
i
i
i
3.4. Integrated Technology Mapping 53
Fig. 3.7 A retiming example: (a) The initial design; (b) The retimed design.
timing of a design. For the design (a) in Fig. 3.7, if we relocate the FFs
as indicated, we arrive at the design in (b). If we assume each gate has
one unit of delay, the original design has a clock period of three, while
the retimed design has a clock period of one. The number of FFs is also
reduced from four to three.
If we apply retiming after mapping, mapping may optimize the
wrong paths, because the critical paths seen during mapping may not
be critical after retiming. If we do retiming before mapping, retiming
will be carried out using less accurate timing information since the
design is not mapped. In either case, we cannot account for the impact
of retiming on the cut or LUT generation as logic can be shifted from
one side of FFs to the other. All of these point to the importance of
combining retiming and technology mapping.
In [158], the authors propose the rst polynomial time mapping
algorithm that can nd the best clock period in the combined solution
space of retiming and mapping. In other words, the mapping solution
obtained at the end is the best among all possible ways of retiming
and mapping a network. The algorithm is based on two important con-
cepts: sequential arrival times and expanded cones (circuits). Sequential
arrival times anticipate the impact of retiming on performance without
actually doing retiming. Expanded cones at a node make it possible to
form cuts across time frames. So cut generation is oblivious of register
boundaries.
Although the algorithm in [158] has polynomial time complexity, the
runtime can be high. An improved algorithm is proposed in [71] that
signicantly reduces the runtime while still preserving the optimality
of the nal mapping solution. Both algorithms are based on ow com-
EDA03-ebook 2006/10/26 10:16 page 54 #64
i
i
i
i
i
i
i
i
54 Technology Mapping
putation, as in FlowMap. To further improve runtime in practice, an
algorithm based on cut enumeration is proposed in [157]; this will be
discussed next.
FF boundaries are not xed anymore with retiming. Cut genera-
tion is extended to go across FF boundaries to generate sequential cuts
[157]. In a sequential design, a gate may go through zero or more FFs in
addition to logic gates before reaching gate . To capture this informa-
tion, an element in a cut for a node is represented as a pair consisting
of the driving gate u and the number of FFs d on the paths from u to .
The element will be denoted by u
d
. Note that a node may reach a root
node through paths with dierent FF counts. In that case, the node
will appear in the cut multiple times with dierent d values. Here is the
formula relating the set of cuts of a node to its two fanins u
1
and u
2
:
() =
0
c
d
1
1
c
d
2
2
[c
1
(u
1
), c
2
(u
2
), [c
d
1
1
c
d
2
2
[ K,
where d
i
is number of FFs from u
i
to and c
d
i
i
= u
d+d
i
[u
d
c
i
for
i = 1, 2.
Unlike the combinational case, the above formula does not give us a
direct way to compute all cuts because a general sequential design may
contains loops, so the sets of cuts are inter-dependent. A procedure is
proposed in [157] to determine the sets of cuts for all nodes through
successive approximation. The procedure starts with () containing
the trivial cut
0
for each node , and then updating cuts using the
above formula by going through all nodes in passes until no new cut
is discovered. Figure 3.8 shows an example. For the design on the left,
the table on the right shows three iterations in cut generation. In the
rst iteration, every node has the trivial cut. Row 1 shows the new
cuts discussed in the rst iteration. In iteration 2, two more cuts are
discovered, one for a and one for b. After that, further cut combination
does not yield any new cut, and the procedure stops. In practice, cut
generation stops very quickly. For example, cut generation stops in, at
most ve iterations for all ISCAS89 benchmarks with K = 4.
To consider retiming eect, the concept of (timing) labels is
extended to that based on sequential arrival times [159, 158]. The label
of a cut c is now dened as follows:
l(c) = maxl(u) d + 1[u
d
c
EDA03-ebook 2006/10/26 10:16 page 55 #65
i
i
i
i
i
i
i
i
3.4. Integrated Technology Mapping 55
Fig. 3.8 Example of iterative cut generation for sequential circuits [157].
where is the target clock period. The label of a gate is then
dened as,
l() = minl(c)[c ().
The label for each PI is zero, and the label for each PO is that of its
driven gate.
An algorithm is proposed in [157] to nd the labels for cuts and
nodes. Due to the cyclic nature of general sequential designs, the labels
cannot be determined in one pass, as in the case of combinational net-
works. They are computed through successive approximation by going
through the nodes in several passes. At the beginning, the labels for
all nodes are set to except PIs whose labels are always zero. The
successive improvement stops if either one of the POs has a label larger
than or no more change in the labels is observed. It is shown in [160]
that the initial design has a mapping solution with a clock period or
less among all possible retiming and mapping if the label of each PO is
less than or equal to .
After the labels for all nodes are computed and the target clock
period is determined to be achievable, we can generate a mapping solu-
tion. As in the combinational case, the algorithm generates a mapped
network starting from POs and going backward. At each node , it
selects one of the cuts that realizes the label of the node, and then
moves on to select a cut for u if u
d
is in the cut selected for . On the
edge from u to , d FFs are added. To obtain the nal mapping solution
with the clock period , it retimes the LUT for each non-PI/PO node
by l()/| 1.
EDA03-ebook 2006/10/26 10:16 page 56 #66
i
i
i
i
i
i
i
i
56 Technology Mapping
Heuristics are used to reduce the LUT counts in the mapping solu-
tion [157]. Experimental results show that the algorithm is very ecient
and consistently produces mapping solutions with better performance
than optimal combinational mapping followed by optimal retiming.
3.5 Specialized Mapping Algorithms
In previous sections, we discussed LUT mapping with a single value K.
In reality, FPGA devices typically contain heterogeneous resources e.g.,
embedded memory blocks and LUTs of dierence input sizes. There
are also commercial FPGA architectures with logic cells that can only
implement a subset of functions of their inputs. In this section, we
briey summarize mapping algorithms for these special architecture
features.
3.5.1 Mapping for FPGAs with heterogeneous resources
In this subsection, we summarize FPGA technology mapping algo-
rithms for heterogeneous resources. We rst discuss mapping for archi-
tectures with dierent sizes of LUTs. Then, we examine the problem
of mapping logic to on-chip memory.
Mapping with dierent LUT sizes: There are a number of commer-
cial FPGA architectures that can support LUTs with several dierent
input sizes. The adaptive logic modules (ALMs) in Alteras Stratix II
devices can be congured to two 4-LUTs, one 5-LUT and one 3-LUT,
and certain 6/7-LUTs. Other architectures such as Xilinx Virtex II,
Virtex 4 devices can also implement LUTs with dierent input sizes.
Without loss of generality, we assume there are two types of LUTs
with sizes K
1
and K
2
and delays d
1
and d
2
, respectively. We further
assume K
1
< K
2
and d
1
< d
2
. A number of mapping algorithms have
been proposed for mapping to such architectures: for area [72, 100, 106,
115], for delay [73, 76].
For area minimization, the PRAETOR algorithm discussed earlier
has also been applied to such architectures using dierent area costs
for dierent LUTs. Good improvements are observed over a previous
EDA03-ebook 2006/10/26 10:16 page 57 #67
i
i
i
i
i
i
i
i
3.5. Specialized Mapping Algorithms 57
algorithm. In the special case of tree networks, a polynomial optimal
algorithm has also been proposed based on dynamic programming [115].
For timing optimization, the algorithm proposed in [73] is an exten-
sion of FlowMap. As FlowMap, the algorithm is also based on ow
computation. The basic ideas in the algorithm can also be cast in the
cut generation framework by enumerating all K
2
cuts. We can just
extend the timing cost by replacing D
c
in the formula with d
1
or d
2
depending on the cut size. We set D
c
to d
1
if [c[ K
1
; otherwise, we
set it to d
2
. With this simple modication, an algorithm for homoge-
neous LUT architectures can be used for architectures with dierent
LUT sizes.
When there are resource bounds on the available LUTs of dier-
ent sizes, the mapping problem becomes harder since mapping with
area bound is NP-hard in general. Two heuristic algorithms are pro-
posed for the case in which there can be at most r K
2
LUTs [74].
The algorithm BinaryHM [74] employs a mapping algorithm that does
not consider resource limitation. It calls on the algorithm repeatedly
to bring resource utilization under control. The basic idea is as fol-
lows. Let D
FM
be the delay obtained using only K
1
LUTs, and D
HM
be the delay obtained using both K
1
and K
2
LUTs with no resource
limitation. Obviously, D
FM
and D
HM
are upper and lower bounds on
the delay with the resource limitation, respectively. If the solution with
delay D
HM
meets the resource bound, the best solution is found. If not,
one can increase d
2
, the delay of K
2
LUT, and solve the unconstrained
version again, which should result in a mapping solution with fewer
K
2
LUTs. Binary search is used, and the change in d
2
is done through
adjusting the ratio of d
1
and d
2
. The reader is referred to [74] for more
details.
Mapping with embedded memory blocks: On-chip memory has
become an essential component of high-performance FPGAs. Dedicated
embedded memory blocks (EMBs) can be used to improve clock fre-
quencies and lower costs for large systems that require memory. If a
design does not need all the available EMBs, unused EMBs can be used
to implement logic, since they typically can be congured as ROMs on
EDA03-ebook 2006/10/26 10:16 page 58 #68
i
i
i
i
i
i
i
i
58 Technology Mapping
most commercial FPGA devices, which essentially turns EMBs into
large multi-input multi-output LUTs.
EMBs usually have congurable widths and depths. They can be
used to implement functions with dierent numbers of inputs/outputs.
For example, a 2K-bit memory with congurations, 2048 1, 1024 2,
and 512 4 can be used to implement an 11-input/1-output, 10-
input/2-output, or 9-input/4-output logic function. Using EMBs for
logic can reduce interconnect delays since they can potentially replace
many small LUTs. On the other hand, EMBs usually have big inter-
nal delays (memory access time). Care must be taken to map logic to
EMBs. Several mapping algorithms have been proposed to take advan-
tage of unused EMBs [75, 196, 198, 197]. Mapping logic to EMBs is
typically done as a post-processing step after LUT mapping. It starts
with an optimized mapping solution for LUTs and then packs groups
of LUTs for EMB implementation.
The algorithm SMAP in [196] maps one EMB at a time. It begins
by selecting a seed node. A fanin cone of the seed node is generated by
nding a d-feasible cut that covers as many nodes as possible, where d is
the bit width of the address line of the target EMB. Since d is consider-
ably larger than the typical LUT input size, ow-based cut generation
is used. After the cone is generated, the output selection process selects
signals to be the outputs of the EMB. Output selection tries to select
a set of signals so that the resulting EMB can eliminate as many LUTs
as possible. Each node is assigned a score that is equal to the number
of nodes in its MFFC within the cone. The w highest-scoring nodes are
selected as the EMB outputs, where w is the number of outputs of the
target EMB. The selection of the seed node is critical for this method.
The algorithm tests each candidate node and selects the one that leads
to the maximum number of eliminated LUTs. Heuristics are introduced
to consider EMBs with dierent congurations and to preserve timing.
Another algorithm, EMB Pack, presented in [75] takes a slightly dif-
ferent approach. It nds the logic to map to EMBs altogether instead
of one at a time, as in SMAP. The algorithm rst selects a set of
maximum fanout-free subgraphs (MFFSs)a generalization of MFFCs
with one or more outputs for possible EMB implementation. It goes
through each node to nd a MFFS. For each node, it searches in the
EDA03-ebook 2006/10/26 10:16 page 59 #69
i
i
i
i
i
i
i
i
3.5. Specialized Mapping Algorithms 59
fanout cone of the node to nd output nodes for the MFFS. MFFS
selection is transformed into a hyper-graph clustering problem. The
MFFSs selected may exceed the input bound of the target EMBs. If
this happens, an iterative procedure is used to remove nodes from the
MFFSs to reduce the input size. Finally, it selects a set of MFFSs among
all candidate MFFSs to maximize area improvement without increas-
ing the overall circuit delay. The selected MFFSs will be implemented
as EMBs.
3.5.2 Mapping for CPLDs
Complex programmable logic devices (CPLDs) are a class of pro-
grammable logic devices that are more coarse-grained than typical
FPGAs. Each CPLD logic cell (p-term block) is essentially a PLA that
consists of a set of product terms (p-terms) with multiple outputs. A
p-term block can be characterized by a 3-tuple (k, m, p) where k is the
number of inputs, p is the number of outputs, and m is the number of
p-terms for the block. The input size k is normally much larger than
that of FPGA logic cells. In this section we discuss mapping algorithms
for CPLDs.
Relatively speaking, there has been a lot less mapping algorithms
proposed for CPLDs. A fast heuristic partition method for PLA-based
structures is presented in [97]. The algorithm DDMap [117] adapts a
LUT mapper for CPLD mapping. It uses wide cuts to form big LUTs
and decomposes the big LUTs into p-terms allowed in target CPLDs.
Packing is used to form multi-output p-term cells. An area-oriented
mapping algorithm is proposed for CPLDs in [15]. The algorithm is
based on tree mapping. It uses heuristic partial collapsing and bin
packing to form p-term cells. In [57], the authors investigate an FPGA
architecture consisting of the k/m macrocell which is a p-term block
with one output. A mapping algorithm similar to the FlowMap algo-
rithm is proposed for this architecture.
A more recent mapping algorithm for CPLDs, PLAmap, is pro-
posed in [40]. Like most of mapping algorithms, it has two phases: the
labeling phase and the mapping phase. In the labeling phase, it nds
a minimal mapping depth for each node using logic cell (k, m,1), i.e., a
EDA03-ebook 2006/10/26 10:16 page 60 #70
i
i
i
i
i
i
i
i
60 Technology Mapping
singled out p-term block, assuming each logic cell has one unit delay.
The labeling procedure is based on Lawlers clustering algorithm [122].
The labeling process follows topological order from PIs whose labels
are set to zero. At each node, let l be the max label of the nodes in its
fanin cone. The algorithm forms a logic cluster by grouping the node
with all nodes in its fanin cone that have the label l. If the cluster
can be implemented by a (k, m,1)-cell, the node is assigned the label
l; otherwise, the node gets the label l + 1 with a cluster consisting
only of the node itself. Note that this is a heuristic in that the label
may not be the best. This is because even if the cluster formed by
the node and its (transitive) fanins with the label l cannot be imple-
mented by a (k, m,1)-cell, a super-cluster (one containing the cluster)
may be the so-called non-monotone property. The mapping phase is
done in reverse topological order from the POs. The algorithm tries to
merge the clusters generated in the labeling phase to form (k, m, p)-
cells whenever possible. Cluster merging is done in such a way that
duplication is minimized and the labels of the POs do not exceed the
performance target. There is also a post-processing packing to further
reduce p-term cell count. In commercial CPLDs, there is extra logic in
each p-term block for borrowing p-terms across blocks. PLAmap is also
extended to take advantage of such extra logic. Experimental results
show PLAmap out-performs commercial tools and other algorithms in
performance with none or very small area penalty.
P-term blocks or macrocells are suitable for implementing wide-
fanin, low-density logic, such as nite-state machines. They can poten-
tially complement ne-grain LUTs to improve both performance and
utilization. Both academic and commercial architectures have been
proposed that contain a mix of LUTs and P-team blocks or macro-
cells to take the advantages of dierent types of logic cells. Technol-
ogy mapping algorithms were proposed for such hybrid architectures
[110, 118, 137]. These algorithms are similar to the hybrid mapping
algorithms described earlier in that they try to identify sub-circuits to
be implemented using coarse-grain structures.
EDA03-ebook 2006/10/26 10:16 page 61 #71
i
i
i
i
i
i
i
i
4
Physical Synthesis
In a conventional FPGA design ow, synthesis is separated from phys-
ical design. In the synthesis stage, the design is transformed from one
representation to another. Along the way, the design morphs from a
high-level generic representation to a netlist in terms of the logic cells
of the target FPGA device (a mapped netlist). Physical design places
the logic cells on the selected FPGA device and nally connects the
cells.
With continued shrinking of the feature size, the locations of the
cell and the wirings among them become the dominating factors on
the quality of the nal implementation. It is common for interconnect
delays to take up to 7080% of the total delay on the critical paths
of nal designs these days. The conventional design ow cannot ade-
quately address this challenge. This is because the physical eects are
not considered during synthesis. On the other hand, we are limited by
the netlist from synthesis during physical design. As a result, we cannot
correct the bad synthesis decisions made earlier in the ow.
This is where physical synthesis comes into play. Physical synthesis
can mean dierent things for dierent people. However, at higher level,
physical synthesis can be viewed as techniques/methods that try to
61
EDA03-ebook 2006/10/26 10:16 page 62 #72
i
i
i
i
i
i
i
i
62 Physical Synthesis
link physical design with synthesis. So synthesis can consider physical
impacts and physical design can introduce design transformations.
Although some physical synthesis techniques for standard cells can
also be applied to FPGAs, physical synthesis for FPGAs has its unique
constraints. For example, in FPGAs, we cannot size cells up and down
to trade o timing, area, and power. In the following, we review physical
synthesis techniques developed for FPGAs.
4.1 Logic Clustering
Most modern FPGA architectures contain a physical hierarchy logic
blocks for improved area-eciency and speed. To implement a design
on such architectures, a logic clustering step is typically needed between
technology mapping and placement. Logic clustering transforms a
netlist of logic cells into a netlist of logic clusters each of which can
be implemented using a logic block.
A typical logic block contains N logic cells with I inputs and N
outputs. Here, N and I are xed for the given architecture. There can
be other architecture constraints, such as control signals for sequential
elements, which have been omitted here for ease of discussion. The
logic clustering problem for FPGAs takes as input a mapped netlist
and produces a clustered netlist satisfying the cluster parameters.
One of the early FPGA clustering algorithms is VPack [22, 21].
The VPack algorithm forms one cluster at a time. At the beginning
of each cluster formation, VPack selects as a seed an unclustered logic
cell with the most used inputs and places this seed into the cluster.
It then calculates the attraction of each unclustered cell to the new
cluster. The attraction of a logic cell to a cluster is the number of
inputs and outputs that are shared by the cell and the cluster. The cell
whose addition doesnt violate cluster constraints and has the largest
attraction value will be added to the cluster. The packing process is
repeated until we cannot add in new cells to the current cluster. At
that time, packing begins with a new cluster. The process terminates
when all logic cells have been assigned a cluster.
The objective of the VPack algorithm is to minimize the number
of clusters needed. Later, an enhanced algorithm called T-VPack was
EDA03-ebook 2006/10/26 10:16 page 63 #73
i
i
i
i
i
i
i
i
4.1. Logic Clustering 63
proposed to optimize timing [143]. The delay model used in T-VPack
can be described by three delay values: logic cell delay, intra-cluster
delay, and the inter-cluster delay. Usually, the inter-cluster delay is
much larger than the intra-cluster delay. The basic idea in T-VPack is
to reduce the inter-cluster connects on the critical paths.
T-VPack follows the general steps of VPack. It also forms one cluster
at a time and packs as many logic cells into each cluster as possible.
The main dierence is in the selection of seeds and selection of logic
cells to absorb into partial clusters. Both selections are based on timing
criticality values. The criticality of the connection driving an input i
is dened as Connection Criticality(i) = 1 slack(i)/MaxSlack, where
MaxSlack is the largest slack of all interconnects and slack(i) is the
slack at input i. T-VPack selects as the seed an unclustered cell that
is driven by the most critical connection.
The concept of attraction is enhanced to include timing criticali-
ties. T-VPack denes the base criticality of an unclustered logic cell B
with respect to the cluster C currently being packed as the maximum
connection criticality of all the connections between B and cells C,
denoted by Base BLE Criticality(B), when the cluster is understood.
The criticality of B is dened using the following formula:
Critically(B) =Base BLE Critically(B)
+ total path aected(B),
where total path aected(B) is an estimate of the number of crit-
ical paths B is involved in and is a very small value (so
total path aected(B) acts a tie-breaker for the base criticality). The
new attraction formula incorporating timing is dened as follows:
Attraction(B) = Critically(B) + (1 ) Attraction
area
(B)/G,
where Attraction
area
(B) is the attraction for area packing used in
VPack, G is a normalizing factor, and is a trade-o factor that
determines the amount of attraction coming from timing vs. area. It is
recommended to set to 0.75 for timing.
T-VPack performs much better for timing than VPack. Comparison
data also show T-VPack results in better nal chip area after placement
and routing. This is because T-VPack has the tendency to completely
EDA03-ebook 2006/10/26 10:16 page 64 #74
i
i
i
i
i
i
i
i
64 Physical Synthesis
absorb many low fanout nets into clusters. This reduces the number of
inter-cluster nets, which in turn, improves the routing area.
One problem with T-VPack is the delay model. It uses a sim-
plistic model since there is no accurate interconnect delay. To over-
come this problem, a simultaneous clustering and placement algorithm
is proposed in [45]. This algorithm can move cells among clusters
inside an annealing-based placement engine. It starts with an initial
clustered netlist and a random placement. During the annealing pro-
cess, it tries to remove suboptimal clustering structures by introducing
fragment-level moves in addition to block-level moves which move clus-
ters around. Fragment-level moves relocate cells among clusters with-
out changing the locations of the clusters. The integrated approach is
particularly eective when utilization is high and a large amount of
unrelated packing occurs. Unrelated packing means logic that is not
directly connected is packed into the same clusters to reduce number
of clusters. Experimental results show signicant improvement in tim-
ing and area compared to a separate approach using T-VPack followed
by VPR.
Clustering has a signicant impact on the routability of a design. A
routability-driven clustering algorithm is presented in [25]. The algo-
rithm prioritizes a set of factors that can impact routability, then incor-
porates those factors to form an routabilty-oriented attraction formula
to guide the cell selection process. A typical clustering algorithm packs
as much logic into each cluster as possible. In practice, routability can
be improved if the clustered netlist matches the device structure. A
connectivity-based clustering algorithm is proposed that tries to achieve
spatial uniformity to reduce stress on routing [173]. The algorithm
is based on Rents rule and may purposely leave some clusters unsatu-
rated for better routability.
Some FPGA architectures have more than one level of hierarchy,
e.g., Altera APEX families have two levels of physical hierarchy. In [68],
the authors study the problem of performance-driven multi-level clus-
tering. They show that the problem is NP-hard and propose an e-
cient heuristic for two-level clustering. The heuristic algorithm is based
on label computation, like many of the area-constrained performance-
driven clustering algorithms [122, 149, 163]. Experimental results show
EDA03-ebook 2006/10/26 10:16 page 65 #75
i
i
i
i
i
i
i
i
4.2. Placement-Driven Mapping 65
an average improvement of 15% delay reduction from the two-level
clustering algorithm after nal place and route.
4.2 Placement-Driven Mapping
In this section, we examine physical synthesis techniques that com-
bine placement and technology mapping. One problem with technol-
ogy mapping in the traditional ow is the lack of accurate information
about interconnects. Many mapping algorithms use simple models that
are inaccurate when the design is placed and routed. For timing opti-
mization, most mapping algorithms use the unit delay model in which
the interconnect delays are totally ignored. Good mapping solutions
produced based on such inaccurate models may not be good after place-
ment. To overcome this problem, algorithms have been proposed to
combine placement and mapping.
A number of algorithms try to carry out placement and mapping
simultaneously. The MIS-pga algorithm [151] performs iterative logic
optimization and placement. The algorithm in [37] tightly couples tech-
nology mapping and placement by mapping each cell and assigning it
to a preferred location on a 2-D grid using a maximum weighted match-
ing formulation. Another approach [184] combines mapping, placement,
and routing by integrating mapping into a bi-partitioning-based place-
ment framework. The algorithm in [24] renes mapping solution dur-
ing placement using simulated annealing to move logic among LUTs to
improve routability. Ideally, such integrated approaches would generate
the best solutions. In practice, they have a serious limitation. Due to
the complexity of the combined problem, often simple mapping and
placement techniques are employed for ease of integration. Because of
this, the benet of the combined approach is reduced.
Another approach to combining mapping and placement is by iter-
ating between mapping and placement (or placement renement). The
design is rst mapped and placed. Then, the netlist is back-annotated
and mapped again under the given placement. This process can be
repeated until a satisfactory solution is found. Figure 4.1 outlines the
major steps in an iterative mapping and placement algorithm for timing
optimization presented in [139]. The key step is the placement-driven
EDA03-ebook 2006/10/26 10:16 page 66 #76
i
i
i
i
i
i
i
i
66 Physical Synthesis
Fig. 4.1 One pass of iterative mapping and placement [139].
mapping problem that involves decomposing a mapped/placed netlist,
then mapping it again to improve timing. The mapping step may make
the placement illegal (e.g., two or more cells are placed at the same
location). The placement of the new mapped netlist is then legalized
and rened to produce a mapped/placed solution with potentially bet-
ter timing.
The algorithm uses table-lookup to estimate edge delays based on
placement locations. Given two locations, it looks up the estimated
delay for the wiring between the two locations in a pre-stored table.
This is more accurate and realistic than the xed interconnect delays
used in earlier layout-based mapping algorithms [145, 214].
The decomposition method is a straightforward extension of the
dmig algorithm [46]. The only dierence is that interconnect delays
are added to the arrival time propagation during the decomposition
process. New nodes inherit the locations of the LUTs from which they
are generated.
The mapping algorithm works in a fashion similar to typical cut enu-
meration based technology mappers. It has two phases namely label-
ing and mapping generation. The labeling phase is the same as in
EDA03-ebook 2006/10/26 10:16 page 67 #77
i
i
i
i
i
i
i
i
4.3. Placement-Driven Duplication 67
conventional mapping, except it adds interconnect delays into the label
propagation.
One diculty in placement-driven mapping is that the new solution
may not be legal since it is possible to assign two or more cells to the
same location. Another diculty is that timing predicted in the label-
ing phase may be invalid due to congestion in the new mapping solu-
tion. Congestion means many overlapping locations in a small region. It
requires many cell relocations to legalize the placement, which in turn,
perturbs the timing. To overcome this problem, the algorithm employs
an iterative process with multiple passes in the mapping phase. Each
pass uses the cell congestion information gathered during previous iter-
ations to guide the mapping decisions. Several techniques are proposed
to relieve congestion while trying to meet the labels at the POs. One
of the techniques is a hierarchical area control scheme to evaluate the
local congestion cost. In this scheme, the chip is divided into bins with
dierent granularities. Area increase is tallied in each bin. Penalty costs
will be given to bins with area overows.
After the mapping phase completes, a mapping solution with a pos-
sibly illegal placement is generated. This is followed by a timing-driven
legalization step that moves overlapping cells to empty locations in their
neighborhood based on the timing slack available for the cell. Finally, a
simulated annealing-based placement renement phase is carried out to
improve the circuit performance. Experimental results show the method
can improve timing by over 12% with little area penalty incurred by
remapping.
4.3 Placement-Driven Duplication
Logic replication is a simple, yet eective technique for improving tim-
ing. It can be used to distribute fanout load and isolate critical paths.
Logic replication can improve design quality without any real area cost
if it can be done using only the unused cells on the target device.
Recently, logic replication has been used to reduce interconnect
delays after placement. The idea is to use replication to straighten paths
that are otherwise circuitous (and therefore with big delays). Figure 4.2
shows an example. Suppose A, B, D, and E cannot be moved (e.g., pads
EDA03-ebook 2006/10/26 10:16 page 68 #78
i
i
i
i
i
i
i
i
68 Physical Synthesis
Fig. 4.2 Straightening paths using replication [20].
with xed positions). We further assume that delay is proportional to
the wirelength. To minimize the maximum path delay, we have to place
C at the middle of the region enclosed by the four xed pads as shown
in the left of Fig. 4.2. Obviously, all paths between pads have detours
going through the middle cell C. This is bad for timing. On the other
hand, if we duplicate cell C and move C and its duplicate close to the
two sink pads A and E, we arrive at the placement shown in the right
of Fig. 4.2. The new placement is much better in delay. What happens
is that once we replicated C, we have the freedom to place the cell and
its duplicate dierently to straighten paths and improve timing.
An iterative algorithm is presented in [20] that selectively replicates
cells and perturbs the placement to straighten circuitous paths for tim-
ing optimization. One of the key components of the algorithm is the
notion of local monotonicity. A cell, together with a fanin and a fanout,
is on a non-monotone local sub-path if the distance between the fanin
cell and the fanout cell is smaller than the sum of the distance from
the fanin to the cell and the distance from the cell to the fanout. Let
C be the cell and P and N be a fanin and a fanout of C, respectively.
The deviation of C on this sub-path can be dened as follows:
deviation(C) = distance(P, C) + distance(C, N) distance(P, N),
If deviation(C) > 0, then P, C, and N form a non-monotone sub-path.
The algorithm, starting from a placed design, iteratively selects one
cell on a non-monotone sub-path of a critical path to replicate. Cell
selection focuses on cells that are on critical paths and have non-zero
deviation. The cells with large deviations will have a high probability
EDA03-ebook 2006/10/26 10:16 page 69 #79
i
i
i
i
i
i
i
i
4.3. Placement-Driven Duplication 69
of being selected. After a cell is selected and replicated, it nds a new
location for the replicated cell to reduce the deviation. The fanouts
of the selected cell will then be distributed between the cell and its
replicate with the goal of improving timing. It could also happen that
all the fanouts go to the replicate. In this case, the cell is essentially
relocated.
The new location may not be legal since cells can overlap. Cell over-
laps are eliminated in the ensuing legalization procedure. The objective
of legalization is to produce a legal placement with minimal impact to
the performance of other paths. A procedure based on ripple move,
similar to the one in [104], is used. The algorithm terminates when it
fails to generate improvement for a certain number of iterations in a
row. Experiments show good performance improvement over the place-
ment results produced by the VPR algorithm.
One limitation of the preceding algorithm is that it only targets
local monotone sub-paths. While eective, it may not be able to remove
more global non-monotone sub-paths. Figure 4.3 demonstrates this lim-
itation. In this example, the sub-path s- >a->b and a- >b- >t are both
non-monotone (based on rectilinear distance). However, the path from
s to t is not so.
An enhanced algorithm is proposed in [102] to overcome the limi-
tation of local monotonicity. The algorithm is also iterative. However,
it may replicate a set of cells in each iteration. The algorithm deter-
mines a timing-critical section of the design, then replicates the cells in
the critical section to generate a Replication Tree in which every node
has one fanout except the leaves which may have multiple fanouts (a
leaf-DAG). The algorithm then embeds the replication tree by adapt-
ing the dynamic programming procedure for S-tree embedding [101] to
Fig. 4.3 Limitation of local monotonicity [102].
EDA03-ebook 2006/10/26 10:16 page 70 #80
i
i
i
i
i
i
i
i
70 Physical Synthesis
consider both timing and wirelength. Placement legalization is invoked
after embedding. The legalization procedure is the same as the previ-
ous algorithm except for the fact that both timing and wirelength are
considered. A delay reduction of over 14% was observed on a set of
benchmark designs, about twice the improvement from the approach
based on local monotonicity removal.
The previous algorithms carry out logic replication as a post-
optimization step after placement. An algorithm is presented in [44]
that integrates replication into a simulated annealing placement engine.
At the end of each annealing iteration, it performs a placement-driven
logic replication based on the current placement. The replication algo-
rithm has several unique features. It introduces the notion of feasible
region and super-feasible region to improve the critical path mono-
tonicity globally. An enhanced placement legalization procedure is
proposed that can take into consideration the complex architecture
constraints in real commercial FPGA architectures. Replication can
be carried out multiple times with this approach, which may result
in redundanciesmultiple duplicates of the same nodes. An eective
technique is presented to remove redundancies globally while preserv-
ing timing. Experimental results show over 18% delay reduction over
VPR on average. With the path-counting-based net weighting scheme
in [114], the algorithm achieves over 25% delay reduction.
4.4 Other Techniques
There are many other physical synthesis techniques, from post-layout
pin permutation [?] to a general incremental physical synthesis frame-
work [?]. In this section, we briey review a few other physical-aware
synthesis and optimization techniques.
Integrated retiming and placement: Traditionally, retiming is
applied during the synthesis stage where accurate estimate of inter-
connect delays is not available. In [174] an integrated retiming and
placement algorithm is presented. The algorithm has three compo-
nents: retiming-aware placement, retiming with minimal placement
EDA03-ebook 2006/10/26 10:16 page 71 #81
i
i
i
i
i
i
i
i
4.4. Other Techniques 71
disruption, and incremental clustering and placement to legalize the
disrupted placement after retiming.
The algorithm rst enhances the simulated annealing-based
FPGA placement algorithm VPR (see Section 2.2.1) to make it
retiming-aware. VPR is a net-weighting based timing-driven placement
algorithm that uses connection criticalities to weigh interconnection
delays with the objective of encouraging cells on critical connections
to stay together during placement. In the original VPR algorithm, the
criticality of an interconnection c is dened as follows: Critically(c) =
1.0 slack(c), where slack(c) is the slack of the interconnection and
is a scaling factor. To make VPR retiming-aware, the criticality for-
mula is enhanced by replacing slack(c) with CycleSlack(c). The cycle
slack of an interconnection is the maximum amount of delay that can
be added to the interconnection without violating a given performance
target under retiming. The notion of cycle slacks is similar to sequential
slacks computed using sequential arrival and required times [63, 158].
Once the retiming-aware placement step is completed, the next step
of the algorithm is to nd and apply a retiming to improve performance
based on more accurate interconnect delays determined by the place-
ment. The algorithm nds a retiming using the standard formulation of
minimizing a weighted sum of the register counts on interconnections
subject to the timing requirements. The weights or costs of the inter-
connections are assigned in such a way that they discourage disruption
to the placements. The nal step is incremental clustering and place-
ment to place the new registers introduced during retiming. This step
is based on a greedy iterative improvement method that moves logic
cells in an attempt to minimize a cost function. The reader is referred
to [174] for details on the cost function and the moves used. An average
of 19% performance improvement is reported compared to a sequential
approach in which retiming is done before placement.
SPFD-based rewiring: Rewiring is a technique that changes inter-
connections in a design by removing and adding wires without
touching logic cells, while preserving design functionality [82, 33,
34, 213, 64]. This technique is very attractive in physical synthe-
sis since it does not perturb placement. It can be used for timing
EDA03-ebook 2006/10/26 10:16 page 72 #82
i
i
i
i
i
i
i
i
72 Physical Synthesis
improvement (replacing timing-critical connections with non-critical
ones) or routability improvement (replacing interconnections in con-
gested regions with those in less congested ones).
Most rewiring algorithms use ATPG-based redundancy addition
and removal and do not modify the functionality of any node in a
design during rewiring. However, they cannot take advantage of the
exibility of K-LUT (which can implement any function with up to K
inputs). Set of pairs of functions to be distinguished (SPFD) [213] is
a rewiring technique that may change node internal functions during
rewiring. SPFD approaches can potentially nd more rewiring opportu-
nities than ATPG-based approaches while possessing the same advan-
tage of no placement disruption.
A function f is said to distinguish a function pair (
1
,
0
) where
1
,= 0,
0
,= 0,
1
0
= 0, if either one of the following two conditions
holds:
1
f
0
or
1
f
1
. Intuitively,
1
f
0
means f
contains
1
and is outside of
0
, namely, it separates or distinguishes
the two functions
1
and
0
. A function f is said to satisfy an SPFD
P = (
11
,
10
), (
21
,
20
), . . . , (
m1
,
m0
) if f distinguishes all the func-
tion pairs in P. The notion of SPFDs is a way to express dont-care
conditions by providing exibility for implementing functions [26].
A SPFD-based rewiring algorithm starts by computing the global
function of each pin in a design (a global function is dened in terms of
the PIs). It then determines the SPFD for each pin in topological order
from the POs. After the SPFDs for all pins are available, rewiring is
carried out next. As an example, for the partial design shown in the
left of Fig. 4.4, assume there is another node p
= x
1
+ x
2
. It can be shown that g
to p
2
and remove the original wire connected
to p
2
. The internal function of G should be changed to p
0
= p
1
+ p
2
to
maintain the functionality of the design.
The original SPFD-based rewiring algorithm proposed in [213]
nds alternative wires locally. It requires the destination node of the
alternative wire to be the same as the target wire, as indicated in
Fig. 4.5. This obviously limits the eectiveness of the rewiring algo-
rithm. The global SPFD-based rewiring algorithm presented in [64] is
capable of nding global alternative wires that may not share the same
destination nodes as target wires. Let G
1
be the destination node of
a target wire as shown in Fig. 4.5. The global rewiring algorithm may
add an alternative wire to a dominator, G
D
, of G
I
. A dominator of G
1
is a node that is on all paths from G
1
to POs. Comparitive results show
that ATPG-based rewiring can nd alternative wires for around 10% of
the wires, while local and global SPFD-based rewiring can nd alter-
native wires for 25% and 36% of the wires, respectively. This clearly
shows the potential of SPFD-based rewiring for LUT-based FPGAs.
Fig. 4.5 Global SPFD rewiring [65].
Integrated mapping and clustering: The quality of clustering
depends signicantly on the mapping solution. To address this depen-
dency, mapping and clustering needs to be performed together. In [138]
an integrated mapping and clustering algorithm was proposed. The
EDA03-ebook 2006/10/26 10:16 page 74 #84
i
i
i
i
i
i
i
i
74 Physical Synthesis
Fig. 4.6 Mapping and clustering.
algorithm can nd a clustering solution with optimal delay under the
commonly-used clustering delay model described in Section 4.1.
Figure 4.6 is an example that illustrates the sub-optimality of sep-
arate mapping and clustering. We assume 3-input LUTs and a cluster
capacity of 3 (i.e. at most 3 LUTs in each cluster). For the example
netlist in (a), delay-optimal mapping generates a netlist with 5 LUTs,
and delay-optimal clustering afterwards produces a clustered netlist in
(b). The critical path contains three inter-cluster connections (count-
ing the edges between I/Os and clusters) and one intra-cluster con-
nection. On the other hand, with node duplication, we can obtain
another mapping solution with 6 LUTs as shown in (c) which results
in the clustered netlist in (d). The critical path of the second clustering
solution contains two inter-cluster connections and two intra-cluster
EDA03-ebook 2006/10/26 10:16 page 75 #85
i
i
i
i
i
i
i
i
4.4. Other Techniques 75
connections. The latter solution is better than the one in (b) gen-
erated by separate delay-optimal mapping and clustering, assuming
inter-clustering is larger than intra-cluster delay as is the case in prac-
tice.
The proposed algorithm carries out mapping and clustering simul-
taneously. It rst uses dynamic programming to determine the optimal
delay at each node while considering both mapping and clustering.
After the optimal delay at each node is determined, the algorithm gen-
erates mapping and clustering solutions simultaneously to realize the
optimal delays under the cluster capacity constraint. The paper also
presents a number of heuristics to reduce area overhead introduced by
duplication. Compared to a sequential approach using state-of-the-art
mapping and clustering algorithms, the proposed algorithm achieves
25% performance gain with 22% area overhead under the inter-/intra-
clustering delay model. After placement and routing using VPR, the
performance is still 12% better on a set of benchmark designs.
EDA03-ebook 2006/10/26 10:16 page 76 #86
i
i
i
i
i
i
i
i
EDA03-ebook 2006/10/26 10:16 page 77 #87
i
i
i
i
i
i
i
i
5
Design and Synthesis with Higher Level
of Abstraction
Modern SoC FPGA (or eld-programmable SoC) contains embedded
processors (hard or soft), busses, memory, and hardware accelerators on
a single device. On one hand, these types of FPGAs provide opportuni-
ties and exibilities for system designers to develop high-performance
systems targeting various applications. On the other hand, they also
immediately increase the design complexity considerably. To realize the
promise of this vision, a complete tool chain from concept to imple-
mentation is required [208]. System-level, behavior-level, and RT-level
synthesis techniques are the building blocks for this automated system
design ow. System-level synthesis compiles a complex application in
a system-level description (such as SystemC [180]) into a set of tasks
to be executed on various processors, or a set of functions to be imple-
mented in customized logic, as well as the communication protocols and
the interface logic connecting dierent modules. Such capabilities are
part of the so-called electronic system-level (ESL) design automation.
ESL design automation has caught much attention from the industry
recently. Many design challenges still remain in this level, such as the
standardization of IP integration, system modeling, performance esti-
mation, overall design ow, and verication methodology, etc. A key
77
EDA03-ebook 2006/10/26 10:16 page 78 #88
i
i
i
i
i
i
i
i
78 Design and Synthesis with Higher Level of Abstraction
component of ESL design automation is behavior-level synthesis which
is a process that takes a given behavioral description of a circuit and
produces an RTL design to satisfy area, delay, or power constraints for
the hardware. It primarily consists of three subtasks namely: schedul-
ing, allocation, and binding. The next step after high-level synthesis is
RTL synthesis.
1
Usually, input to an RTL synthesis tool includes the
number of data path components, the binding of operations to data
path components, and a controller (nite state machine) that contains
the detailed schedule (related to clock edge) of computational, I/O, and
memory operations. The output of the RTL synthesis provides logic
level implementations of the design that can be evaluated through the
optimization of the data path, memory, and controller components,
individually or in a unied manner. Section 1 introduced RTL design
and RTL elaboration steps.
System-level design is a vast area. It can include topics on soft-
ware/hardware partitioning and codesign, recongurable computing,
and synthesis for dynamically recongurable FPGAs, which are all
beyond the scope of this paper. Therefore, we will primarily focus on
behavior-level and RTL synthesis in this section. Continuing with the
bottom-up approach guided by design levels, as laid out in previous sec-
tions, we will introduce RTL synthesis rst and behavior-level synthesis
second.
5.1 RTL Synthesis
RTL synthesis is a key step in the FPGA design ow, as shown in
Section 1. However, there are relatively few publications on this sub-
ject in relevant literature. Part of the reason is that RTL synthesis
for FPGAs can take advantage of existing RTL synthesis techniques
used in ASIC designs, which are already pretty mature. Meanwhile,
RTL synthesis for FPGAs does need to consider the specic FPGA
architecture features. For example, the regularity of FPGA logic fab-
ric oers opportunities for directly mapping datapath components to
FPGA logic blocks, producing regular layout, and reducing chip delay
1
Designers can skip high-level synthesis and directly write RTL codes for circuit design.
This design style is facing challenges due to the growing complexity of FPGA designs.
EDA03-ebook 2006/10/26 10:16 page 79 #89
i
i
i
i
i
i
i
i
5.1. RTL Synthesis 79
and synthesis runtime. The synthesis tools also need to pay attention
to circuitry features enclosed in the logic blocks, such as the fast carry
chains, to achieve better performance for the design. In addition, most
of the modern FPGAs oer both hard structures, such as blocks of
memory and multipliers, and exible soft programmable logic to pro-
vide domain-specic programmable solutions. These FPGAs are called
platform FPGAs or heterogenous FPGAs. Examples include Altera
Stratix II device families [10] and Xilinx Virtex-5 device families [207].
This brings new challenges for RTL synthesis to simultaneously target
both hard structures and soft logic. In this subsection, we will present
several research works and provide readers with the avor of how these
issues have been addressed. We will rst talk about datapath synthesis,
which deals with mapping datapath components to FPGA directly. We
then cover RTL synthesis for heterogeneous FPGAs. RTL synthesis for
FPGA power reduction will be discussed in Section 6.4.3. We believe
there are still many interesting research topics for further study in RTL
synthesis, such as retiming for glitch power reduction, resource sharing
for multiplexer optimization, and layout-driven RTL synthesis, just to
name a few.
5.1.1 Datapath synthesis
Large circuits typically contain a large portion of highly regular data-
path logic. The traditional gate/CLB-level CAD ow rst implements
each datapath node with a datapath component, then attens the dat-
apath components to gates (discarding information about regularity),
and feeds the resulting netlist to the gate-level design ow. It is unlikely
that an ecient bit-slice layout will be rediscovered during placement,
and the generated irregular layout leads to a dicult routing problem.
Moreover, once the circuit is attened to gates, it is usually not possi-
ble to rediscover uses of specialized features of the CLBs in the FPGA,
such as the fast carry chain circuitry. Flattening to gates also leads to
a much larger problem size there are many multiples of gates than
there are nodes in the DFG. To address this problem, datapath synthe-
sis algorithms preserve these datapath structures rather than attening
them to gates. They also explore the specialized datapath features in
EDA03-ebook 2006/10/26 10:16 page 80 #90
i
i
i
i
i
i
i
i
80 Design and Synthesis with Higher Level of Abstraction
an FPGA and try to map datapath operations directly to the available
modules in the FPGA. Datapath synthesis intrinsically needs to deal
with bit-slices for maintaining the regularity of the datapath and the
layout. Therefore, it is natural to be combined with packing, placement,
and routing tools to improve speed or area. The input of datapath syn-
thesis is a scheduled DFG or CDFG. Thus, the data storage elements
are already determined and need to be considered during synthesis.
There are several related works in this area.
In [113] a design strategy named SDI was presented. It is a
strategy for the ecient implementation of regular datapaths with
xed topology on FPGAs. It employs parametric module generators,
a oorplanner based on a genetic algorithm, and a circuit compaction
phase through local technology mapping and placement. Figure 5.1
shows the design ow of SDI. The chip topology targeted by SDI
is characterized by a xed tri-partite layout (Fig. 5.2). The large
middle section holds the regular part of the datapath. This part
consists of a horizontal arrangement of modules, each composed of
vertically stacked bit-slices. The area below the datapath is intended
to hold the controller. A small area above the regular section can
hold irregularities in the modules as cap cells, e.g., the processing of
overow and carry bits in a signed adder.
Fig. 5.1 SDI design ow in [113].
EDA03-ebook 2006/10/26 10:16 page 81 #91
i
i
i
i
i
i
i
i
5.1. RTL Synthesis 81
Fig. 5.2 On-chip topology shown in [113].
As shown in Fig. 5.1, the designs are expressed in the SDI netlist
format SNF, which is a textual netlist of module declarations, module
instantiations, and interconnections. Modules include arithmetic mod-
ules, logic, shifters, comparators, counters, storage elements, etc. SNF
associates values with module parameters such as bus widths, data
types, and optimization requests (speed vs. area).
The module generation procedure (PARAMOG) takes user-
specied parameters for each module instance and prepares a list of
possible layouts with dierent topologies. The FloorPlanner will read
the available layout topologies for all module instances of the datapath
and begin to linearly place instances in the regular region of the FPGA.
During this process, dierent concrete layouts are selected and evalu-
ated in context. The FloorPlanner is based on a genetic algorithm, and
thus considers various dierent layout choices and placements simul-
taneously. When FloorPlanner has nished its work and created a
suitable linear placement of module instances in the datapath area,
a compaction phase follows. Compaction is performed by merging all
logic (across module boundaries) within a logic equivalence class and
processing the resulting functions with classical logic synthesis and opti-
mization tools. Afterwards, since all placement information within an
equivalence class is lost during compaction, the CLBs in the classes
have to be replaced. One of the primary criteria for this placement is
EDA03-ebook 2006/10/26 10:16 page 82 #92
i
i
i
i
i
i
i
i
82 Design and Synthesis with Higher Level of Abstraction
the restoration of a regular structure consistent with the one created by
PARAMOG. This placement (Placement) is based on an integer linear
programming (ILP) formulation of the problem. Afterward, a routing
tool available through Xilinx physical design tool PPR is used to nish
the routing for the design. The nal result of SDI is a bit-stream ready
for downloading.
The paper showed two layouts of the same circuit (a 16-bit dat-
apath consisting of two instances of a sample combinational mod-
ule with a structure common to many bit-slices), one conventionally
generated by the Xilinx tool PPR, the other one processed by SDI.
Based on the layout, they showed that the SDI-generated solution is
more regular. The SDI layout is also less congested than the PPR
layout, and the routing delay in the critical path of the SDI solu-
tion is 13% shorter than the PPR layout. SDI also runs about two
times faster. Although FPGAs with xed bit-slice topology enable e-
cient datapath synthesis and faster optimization ow, it does imply
restrictions on the overall layout of the design. In general, it should
work well for data-intensive designs with simple control logic. How-
ever, for control-intensive designs, layout restriction can cause larger
wire delays due to the diculty of distributing control logic exibly on
the chip.
In [215] the authors proposed an enhancement to the module
compaction algorithm proposed in [113]. They observed that typical
datapath synthesis algorithms sacriced area to gain regularity. They
proposed two word-level optimizationsmultiplexer tree collapsing and
operation reordering. They reduced the area ination to 3% to 8% as
compared to at synthesis. Their synthesis results retained a signicant
amount of regularity from the original designs.
In another follow-up work [30], the authors observed that the limi-
tation of [113] was that the module compaction step could not handle
specialized CLB features such as a fast carry chain, and thus did not
attempt to merge modules that used such features. Another limitation
of [113] was that only physically adjacent modules in the previously
determined oorplan were considered for compaction. To address these
issues, [30] presented a datapath mapping tool GAMA. GAMA consists
of the following optimization steps.
EDA03-ebook 2006/10/26 10:16 page 83 #93
i
i
i
i
i
i
i
i
5.1. RTL Synthesis 83
Tree splitting. GAMA uses a tree-covering algorithm. Therefore, the
input design (control dataow graph) must be split into a forest of trees.
Cycles are broken at appropriate places, usually at storage elements
demarking iteration boundaries. This produces a directed acyclic graph
(DAG), which will be further split into trees.
Tree covering. GAMA tries to take advantage of the compound mod-
ule present in typical CLBs. It is often possible to implement multiple
nodes from the DFG together in a compound module that is much
smaller and/or faster than if they were implemented separately. When
such compound modules exist, there may be many dierent ways in
which the DFG can be covered with module patterns from the library
of possible modules. The authors designed a dynamic programming
algorithm (similar to what was used in DAGON [111]) to nd the best
cover in linear time for a tree. Each tree is passed to the tree-covering
algorithm separately. The trees are covered in the topological order
dened by the original design before tree splitting. Each tree covering
is optimal in terms of delay or area, but the overall solution is not
necessarily optimal. Relative module placement in the linear datapath
occurs simultaneously with tree covering.
Post-covering optimizations. This phase may consider rearranging
the modules after they have been placed by the tree-covering algorithm.
This allows layout possibilities that are not considered by the tree-
covering algorithm, such as intermingling the modules from dierent
trees.
Module generation. Finally, each specied module is generated. A
rich variety of functions can be implemented using a column of 4-input
LUTs augmented with fast carry chain circuitry. The generator, given
a pattern of DFG nodes, values of constant inputs, datapath width (in
bits), etc., creates the module. All modules are generated with the same
pitch and width (in bits).
Experimental results showed that for 32-bit datapath designs
mapped to the Xilinx 4000 architecture, GAMA gave compilation
speeds 3.24 times faster than compiling attened netlists. Designs
generated by GAMA were roughly of the same quality or better than
EDA03-ebook 2006/10/26 10:16 page 84 #94
i
i
i
i
i
i
i
i
84 Design and Synthesis with Higher Level of Abstraction
their attened equivalents in terms of both CLB usage and critical
path delay.
5.1.2 Heterogenous FPGAs
In [107] the authors presented an RTL synthesis tool for heterogenous
FPGAs. Modern heterogenous FPGAs contain hard specic-purpose
structures such as blocks of memory and multipliers in addition to
the completely exible soft programmable logic and routing. These
hard structures provide major benets, yet raise interesting questions
in FPGA CAD and architecture. The authors presented a synthesis
tool, called Odin, and an algorithm that permits exible targeting of
hard structures in FPGAs. Odin maps Verilog designs to two dierent
FPGA CAD ows: Alteras Quartus, and the academic VPR CAD ow.
Figure 5.3 shows the overall design ow of Odin.
First, a front-end parser parses the Verilog design and generates a
hierarchical representation of the design. Second, Odin has an elabo-
ration stage that traverses the intermediate representation of a design
to create a at netlist that consists of structures including logic blocks,
memory blocks, if and case blocks, arithmetic operations, and registers.
Each of these structures within the netlist is a node in the netlist. Third,
some simple synthesis and mapping is performed on this netlist. This
includes examining adders and multipliers for constants, collapsing mul-
Fig. 5.3 RTL synthesis ow for heterogenous FPGAs in [107].
EDA03-ebook 2006/10/26 10:16 page 85 #95
i
i
i
i
i
i
i
i
5.2. Behavior-Level Synthesis 85
tiplexers, and detecting and re-encoding nite state machines to one-
hot encoding. Fourth, an inferencing stage searches for structures in
the design that could be mapped to hard circuits on the target FPGA.
These structures are connected sub-graphs of nodes that exist in the
design netlist. A matching algorithm is used to carry out this search.
This matching problem is a form of sub-graph isomorphism, which has
been used in instruction generation for congurable processor systems
[55] and other applications.
Finally, a binding stage guides how each node in the netlist will be
implemented. This is done by mapping nodes in the netlist to either
hard circuits, soft programmable logic, or a mixture of both. One way to
do this is to map structures to library parameterized modules (LPMs),
which, in later stages of the industrial CAD ow, will bind to an imple-
mentation on the FPGA whether it is a hard or soft implementation.
The output from Odin is a at netlist consisting of connected complex
logic structures and primitive gates. The authors showed that the qual-
ity of their tool is comparable to Alteras front-end synthesis tool. They
also showed that their binding/mapping results compared favorably to
those from Alteras Quartus tool.
5.2 Behavior-Level Synthesis
The basic problem of high-level synthesis (or behavior-level synthesis)
is the mapping of a behavioral description of a digital system into a
cycle-accurate RTL design consisting of a datapath and a control unit.
A datapath is composed of three types of components: functional units
(e.g., ALUs, multipliers, and shifters), storage units (e.g., registers and
memory), and interconnection units (e.g., buses and multiplexers). The
control unit is specied as a nite state machine which controls the set
of operations for the datapath to perform during every control step. The
high-level synthesis process mainly consists of three tasks: scheduling,
allocation, and binding. Scheduling determines when a computational
operation will be executed; allocation determines how many instances
of resources (functional units, registers, or interconnection units) are
needed; binding binds operations, variables, or data transfers to these
resources. In general, it has been shown that the code density and
EDA03-ebook 2006/10/26 10:16 page 86 #96
i
i
i
i
i
i
i
i
86 Design and Synthesis with Higher Level of Abstraction
simulation time can be improved by 10X and 100X, respectively, when
moved to the behavior-level synthesis from RTL synthesis [191, 192].
Such an improvement in eciency is much needed for design in the deep
submicron era. However, most of the behavioral synthesis problems are
dicult to solve optimally due to various constraints, including latency
and resource constraints. Meanwhile, the subtasks of behavioral synthe-
sis are highly interrelated with one another. For example, the scheduling
of operations to control steps is directly constrained by resource alloca-
tion. Meanwhile, a performance/cost tradeo exists in the design space
exploration. An area-ecient design with a smaller number of resources
will increase the total number of control steps to execute the desired
function. On the other hand, allocating more resources to exploit par-
allel executions of operations can achieve a higher performance, but at
the expense of a larger area.
Traditionally, people are more concerned with the area and power
of functional units and registers. As technology advances, the area and
power of multiplexers and interconnects have by far outweighed the
area and power of functional units and registers. Multiplexers are par-
ticularly expensive for FPGA architectures. It is shown that the area,
delay and power data of a 32-to-1 multiplexer are almost equivalent
to an 18-bit multiplier in 0.1 m technology in FPGA designs [41]. In
general, having a smaller number of functional units or registers allo-
cated, with a larger number of wide multiplexers and larger amount
of interconnects, may lead to a completely unfavorable solution for
both performance and the area/power cost. Tackling this increasingly
alarming problem will require an ecient search engine to explore a
suciently large solution space while considering multiple constraining
factors.
On top of these diculties, behavioral synthesis also faces chal-
lenges on how to connect better to the physical reality. Without phys-
ical layout information, the interconnect delay cannot be accurately
estimated. Since interconnect delay is the dominate element to deter-
mining the performance of designs in submicron technology, ignoring
it will make it even more dicult for behavioral synthesis to deliver
satisfactory solutions. These unique challenges are driving the need for
developing new behavioral synthesis techniques and design ows that
EDA03-ebook 2006/10/26 10:16 page 87 #97
i
i
i
i
i
i
i
i
5.2. Behavior-Level Synthesis 87
are able to overcome or mitigate negative impacts from the separation
of traditional high-level synthesis and low-level design. In addition,
there is a need of powerful data-dependence analysis tools to ana-
lyze the operational parallelism available in the design before we can
allocate proper amount of resources to carry out the computation in
parallel. Meanwhile, memory operations usually represent the bottle-
neck for performance optimization. How to carry out memory partition-
ing, bitwidth optimization, and memory access pattern optimization,
together with behavioral synthesis for dierent application domains
prompts unique challenges for delivering satisable quality of design
results. Given all these challenges, much research is still needed in this
area.
We will rst present work in behavioral synthesis for multi-FPGA
systems. We then present some initial work on layout-driven behavioral
synthesis, which presented some promising results for improving design
quality. Finally, we briey introduce some other techniques involved in
behavioral synthesis, including loop transformation, branch optimiza-
tion, and memory allocation. Behavioral synthesis for power minimiza-
tion will be presented in Section 6.4.4.
5.2.1 Behavioral synthesis for multi-FPGA systems
Due to the capacity limitation of FPGA devices, there has been work
on mapping large designs onto multiple-FPGA systems. The traditional
ow usually consists of two phases [83]. In the rst phase, a synthesizer
is used to transform a design specication into a CLB-based netlist
by performing high-level compilation, RTL/logic synthesis, and CLB-
based technology mapping tasks. In the second phase, a circuit-level
partitioner is used to partition the CLB netlist into FPGA chips. This
method is mainly constrained by pin limitations on the FPGAs. In [167]
the authors experimented with multiple FPGA partitioning methods at
behavioral and structural levels. They observed that during structural
partitioning, the IO limitation can be reduced if the partitioner is able
to decompose and place portions of structural components, such as
multiplexers and controllers, into dierent FPGA chips.
EDA03-ebook 2006/10/26 10:16 page 88 #98
i
i
i
i
i
i
i
i
88 Design and Synthesis with Higher Level of Abstraction
Fig. 5.4 The synthesis ow in [83].
In [83] the authors presented a new integrated synthesis and parti-
tioning method for multiple-FPGA applications. Their approach tried
to bridge the gap between high-level synthesis and physical partitioning
by exploiting the design hierarchy. The input to their system is a design
specication described in Verilog. Figure 5.4 shows their synthesis ow.
In the rst step, a synthesizer performs RTL and FPGA synthesis
tasks, including hardware description language (HDL) compilation,
unit selection, unit/storage/interconnect binding, logic minimization,
and CLB-based technology mapping. The synthesizer uses a ne-
grained synthesis method to generate a structural tree, which can rep-
resent the structural hierarchy of the HDL description of a design. In
a structural tree, the root node represents the top-level design, each
intermediate node represents a higher-level design such as modules,
processes, and tasks, and each leaf node represents a circuit cluster.
A HDL description of the design is usually expressed as a set of hier-
archical interconnected modules. Each module may contain a set of
concurrent processes. From the hardware point of view, each process
can be implemented as an independent hardware block. Furthermore,
a process usually consists of a set of statements with input and output
signals. The outcome of the outputs is dependent on the executions of
the statements embedded in the process. To further decompose these
statements, each output can be represented as a function of a set of
EDA03-ebook 2006/10/26 10:16 page 89 #99
i
i
i
i
i
i
i
i
5.2. Behavior-Level Synthesis 89
Fig. 5.5 (a) the design hierarchy; (b) the structural tree in [83].
inputs and internal signals in the process. Figure 5.5 shows a design
with (a) its functional hierarchy, and (b) the structural tree for the
design. The authors then map the nodes in the structural tree to a
hierarchical connected graph. Each module node in the structural tree
is mapped to a top-level node in the graph, while each process and
functional node is mapped to a second-level and a third-level node,
respectively. Figure 5.6 shows an example of their hierarchical con-
nected graph.
The authors then formulate their problem as follows: given a hierar-
chical connected graph Gand the CLB/IO-pin constraint of the FPGA
chips, nd a minimum number of FPGAs to cover G. They used a
heuristic called hierarchical set-covering partitioning. The basic idea of
the covering method is to start the set-covering procedure from the
top-level nodes (i.e., module nodes). If no more feasible covers can be
found in the top level, then the set-covering process continues on the
nodes at the lower level. By exploiting the design structural hierarchy
during the multiple-FPGA partitioning, they showed that their method
produced fewer FPGA partitions with higher CLB and lower I/O-pin
utilizations.
EDA03-ebook 2006/10/26 10:16 page 90 #100
i
i
i
i
i
i
i
i
90 Design and Synthesis with Higher Level of Abstraction
Fig. 5.6 A hierarchical connected graph mapped from a structural tree in [83].
Authors in [80] presented an overview of their COBRA-ABS high-
level synthesis tool. COBRA-ABS has been designed to synthesize
custom architectures for arithmetic-intensive algorithms, specied in
C, for implementation on multi-FPGA platforms. It performs global
optimization of high-level synthesis using simulated annealing, and
integrating all of the following operations: datapath partitioning over
multiple FPGAs, functional unit (FU) operation scheduling, FU selec-
tion and operator binding, FU allocation, register allocation, inter-
FPGA communication scheduling, and inter-FPGA communication
binding. COBRA-ABS synthesizes a custom very long instruction word
(VLIW) architecture for the given algorithm for implementation on
the specied FCCM. To illustrate the operation of this tool, a num-
ber of results for synthesis of a Fast Fourier Transform algorithm were
presented.
5.2.2 Layout-driven behavioral synthesis
In [212] a layout-driven behavioural synthesis approach was presented
to reduce the gap between predicted metrics during the synthesis and
the actual data after implementation of the FPGA. This allows more
ecient exploration of the design space and thus avoids unnecessary
EDA03-ebook 2006/10/26 10:16 page 91 #101
i
i
i
i
i
i
i
i
5.2. Behavior-Level Synthesis 91
iterations through the design process. By producing not only an RTL
netlist but also an approximate physical topology of implementation
at the chip level, the solution would perform at the predicted metric
once it is implemented. The problem is formulated as follows: given
(1) a data ow graph (DFG), (2) maximum allowable clock period and
execution time, which are usually part of the system specication, and
(3) component power/performance tradeo functions due to dierent
implementations, identify whether there is a feasible RTL datapath
solution or not. If there is a solution, perform scheduling and binding,
and generate an RTL netlist and its corresponding oorplan; otherwise,
report it to the user and output the best solution that can be achieved.
Figure 5.7 shows their overall design ow.
Fig. 5.7 Layout-driven high-level synthesis design ow in [212].
EDA03-ebook 2006/10/26 10:16 page 92 #102
i
i
i
i
i
i
i
i
92 Design and Synthesis with Higher Level of Abstraction
The authors rst estimate the resources required for a design and
use their physical-level estimation tool ChipEst-FPGA [211] to obtain
an approximate topology of the layout. They then obtain distance met-
rics between the dierent units and use this step to provide feedback to
the scheduling and binding task. Their scheduling and binding is carried
out as one control step at a time. Given timing and resource constraints,
they build a search tree encoding dierent binding and scheduling solu-
tions. They then apply pruning techniques to certain branches when
those branches lead to larger area or latency. Once a proper search
solution is found, the algorithm will record its scheduling-binding infor-
mation and proceed to do scheduling and binding for the next cycle.
This process will be repeated until all the cycles are processed. If the
scheduling-binding succeeds, the algorithm will then update the area
and timing information based on the component information in the
library and generate an optimized RTL netlist. At this point, if the
limit exceeds the maximum limit layout adjustment will be invoked to
re-run the ChipEst-FPGA on the updated RTL netlist. After layout
adjustment, if the cycle time still cannot satisfy the maximum clock
period constraint, the authors then relax the latency constraint and
redo the scheduling and binding until the latency hits a threshold or a
feasible solution is found. Experimental results showed that when using
this approach, the authors could nd a result that satised the con-
straints, while the timing constraints were violated using a traditional
method (without layout information). They also found that intercon-
nection delay could contribute up to 55% on the circuit performance
for their benchmarks.
In [54] the authors presented layout-driven architectural synthesis
algorithms, including scheduling-driven placement, placement-driven
simultaneous scheduling with rebinding, distributed control genera-
tion, etc. The synthesis engine can target both microarchitecture
and FPGAs. To target designs in nanometer technologies, their tech-
nique supports multicycle on-chip communication (data transfers on
global interconnects can take multiple clock cycles). Their architecture
modelregular distributed register (RDR) divides the entire chip into
an array of islands. The island size is chosen so that all local com-
putation and communication within an island can be performed in a
EDA03-ebook 2006/10/26 10:16 page 93 #103
i
i
i
i
i
i
i
i
5.2. Behavior-Level Synthesis 93
Fig. 5.8 Overall synthesis ow MCAS in [54].
single clock cycle. Signals traveling between two dierent islands will
take 1 to k clock cycles depending on the distance between these two
islands, where k is the maximum number of cycles needed to communi-
cate across the chip. Figure 5.8 shows their overall synthesis ow, which
is named architectural synthesis system for multi-cycle communication
(MCAS).
At the front-end, MCAS rst generates the control data ow graph
(CDFG) from the behavioral descriptions. Based on the CDFG, it per-
forms resource allocation, followed by an initial functional unit bind-
ing. The objective of resource allocation is to minimize the resource
usage (e.g., functional units, registers, etc.) without violating the timing
constraint. It uses the time-constrained force-directed scheduling algo-
rithm [154] to obtain the resource allocation. After resource allocation,
it employs an algorithm proposed in [112] to bind operational nodes
to functional units for minimizing the potential global data transfers.
An interconnected component graph (ICG) is derived from the bound
CDFG. An ICG consists of a set of components (i.e., functional units)
to which operation nodes are bound. They are interconnected by a set
of connections that denote data transfers between components.
EDA03-ebook 2006/10/26 10:16 page 94 #104
i
i
i
i
i
i
i
i
94 Design and Synthesis with Higher Level of Abstraction
At the core, this ow performs the scheduling-driven placement,
which takes the ICG as input, places the components in the island struc-
ture of the architecture using a simulated-annealing-based placer, and
returns the island index of each component. After the scheduling-driven
placement, both the CDFG schedule and the layout information are
produced. To further minimize the schedule latency, MCAS performs
placement-driven scheduling with rebinding. The algorithm is based
on the force-directed list-scheduling framework, and is integrated with
simultaneous rebinding.
At the back-end, MCAS performs register and port binding, fol-
lowed by datapath and distributed controller generation. The nal out-
put of MCAS includes a datapath in structural VHDL format and a
set of distributed controllers in behavioral FSM style (these RT-level
VHDL les will be fed into logic synthesis tools), and oorplan and
multi-cycle path constraints for the downstream place-and-route tools.
To obtain the nal performance results, Alteras Quartus II version
2.2 is used to implement the datapath portion into a real FPGA device,
the Stratix
TM
EP1S40F1508C5 [12]. All the pipelined multipliers are
implemented into the dedicated DSP blocks of the Stratix
TM
device.
For data-ow-intensive examples, the authors obtained a 44% improve-
ment on average in terms of the clock period and a 37% improvement
on average in terms of the nal latency compared to a traditional
non-layout-driven ow. For designs with control ow, their approach
achieved a 28% clock-period reduction and a 23% latency reduction on
average.
5.2.3 Other techniques
There are other types of studies on specic behavioral synthesis tasks
for FPGA designs, such as loop transformation, branch optimization,
memory allocation, module selection and resource sharing, and com-
munication optimization. We will briey introduce these works.
Loop transformation. In [171] the authors tried to develop fast
and accurate performance and area models to quickly understand the
impact and interaction of program transformations. They presented
a combined analytical performance and area modeling approach for
EDA03-ebook 2006/10/26 10:16 page 95 #105
i
i
i
i
i
i
i
i
5.2. Behavior-Level Synthesis 95
complete FPGA designs in the presence of loop transformations. Their
approach took into account the impact of I/O memory bandwidth and
memory interface resources (often the limiting factor in the eective
implementation of computations). The preliminary results revealed
that their modeling was accurate, and was therefore amenable to being
used as a compiler tool to quickly explore very large design spaces.
Branch optimization. In [176] the authors explored using the infor-
mation about program branch probabilities to optimize recongurable
designs. The basic premise is to promote utilization by dedicating
more resources to branches that execute more frequently. A hard-
ware compilation system was developed for producing designs that
were optimized for dierent branch probabilities. The authors proposed
an analytical queuing network performance model to determine the
best design from observed branch probability information. The branch
optimization space was characterized in an experimental study of two
complex applications for Xilinx Virtex FPGAs: video feature extrac-
tion and progressive renement radiosity. For designs of equal perfor-
mance, branch-optimized designs require 24% and 27.5% less area. For
designs of equal area, branch optimized designs run up to three times
faster.
Memory allocation. In [94] the authors observed that FPGA-based
processors, like many conventional DSP systems, often associate small
high-performance memories with each processing chip. These memories
may be onboard embedded SRAMs or discrete parts. In the process of
mapping a computation onto an FPGA processor, it is necessary to map
the applications data to memories. The authors presented an algorithm
that had been implemented in their NAPA C compiler to assign data
automatically to memories to produce a minimum overall execution
time of the loops in the program. The algorithm used a search technique
known as implicit enumeration to reduce the otherwise exponential
search space. They showed the correctness of their implementation.
Module selection and resource sharing. In [177] the authors
developed a synthesis methodology that generated pipelined data-path
circuits from a high-level data-ow specication. This methodology car-
ried out module selection (selecting a module implementation from
EDA03-ebook 2006/10/26 10:16 page 96 #106
i
i
i
i
i
i
i
i
96 Design and Synthesis with Higher Level of Abstraction
a variety of circuit implementation options for each operation) and
resource sharing (i.e., resource binding) together. They used a module
library created for Xilinx Virtex-4 architecture. They presented two
types of algorithms. One was based on a recursive branch and bound
algorithm. However, the runtime complexity was high. Then they pre-
sented another algorithm based on iterative modulo scheduling with a
backtracking feature. Their objective was to minimize the area cost of
the resulting circuit while meeting a user-specied minimum through-
put constraint. They showed that even for small benchmark circuits,
combining module selection and resource sharing could oer signicant
area savings relative to applying them alone.
Communication optimization. In [53] the authors proposed a
communication synthesis approach targeting systems with sequential
communication media (SCM). Since SCMs require that the read-
ing sequence and writing sequence must have the same order, dif-
ferent transmission orders may have a dramatic impact on the nal
performance. The goal of their work was to consider behaviors
in communication synthesis for SCM, detect appropriate transmis-
sion order to optimize latency, automatically transform the behavior
descriptions, and automatically generate driver routines and glue log-
ics to access physical channels. They showed that solving the order
detection problem was equivalent to solving the resource-constrained
scheduling problem. Since the scheduling problem with resource con-
straints is NP-complete in general, they used a list-scheduling-based
heuristic algorithm to tackle this problem. To deal with loops in CDFG,
they completely expanded the loop iteration space and used iteration
reordering techniques to generate reconstructed loops and reduced asso-
ciated storage overhead. To get real simulation numbers, they devel-
oped a FIFO module in VHDL which resembled the behaviors of the
Xilinx FSL (Fast Simplex Link) [203]. The algorithm, named SCOOP,
achieved an average 20% improvement in total latency on a set of real-
life benchmarks compared to the results without optimization.
EDA03-ebook 2006/10/26 10:16 page 97 #107
i
i
i
i
i
i
i
i
6
Power Optimization
With the exponential growth in the performance and capacity of
integrated circuits, power consumption has become one of the most
critical design factors in the IC design process. FPGAs are not power-
ecient. The post-fabrication exibility provided by these devices is
implemented using a large number of prefabricated routing tracks and
programmable switches. Also, the generic logic structures in FPGAs
consume more power than the dedicated circuitry that is found in
an ASIC. It has been shown that a typical FPGA chip consumes
about 50 to 100X more power than a functionally equivalent ASIC
chip [119, 219]. It is projected that a high-end FPGA chip with 7 mil-
lion logic cells using 35 nm technology can consume close to 200 W
power [91]. This power dissipation level is almost equivalent to that
of a high-performance microprocessor using the same technology that
has 3.5 billion transistors and runs 20X faster [1]. The large power
consumption of FPGA chips limits its use in mainstream low-power
applications. Meanwhile, large power consumption and heat dissipa-
tion typically lead to higher costs for thermal packaging, fans, and
electricity, and also have a negative impact on signal integrity. The
growing demand of power reduction for FPGA designs has caught the
97
EDA03-ebook 2006/10/26 10:16 page 98 #108
i
i
i
i
i
i
i
i
98 Power Optimization
attention of both industry and academia. A large amount of research
has been published in this area, especially in the past ve years. We
will touch on power estimation, power breakdown, synthesis for power
minimization, and synthesis and design for novel power-ecient FPGA
architectures in this section. Low-power RTL and high-level synthesis
techniques will be discussed separately in the next section.
6.1 Sources of Power Consumption
There are three sources of power consumption in FPGAs: (1) switching
power, (2) short-circuit power, and (3) static power. The rst two types
of power can only occur when a signal transition happens at the gate
output; together they are called dynamic power. There are two types
of signal transitions: one is the signal transition necessary to perform
the required logic functions between two consecutive clock ticks (called
functional transition); the other is the unnecessary signal transition due
to the unbalanced path delays to the inputs of a gate (called spurious
transition or glitch). Glitch power can be a signicant portion of the
dynamic power. Static power is the power consumption when there
is no signal transition for a gate or a circuit module. As technology
advances to feature sizes of 90 nm and below, static power starts to
become a dominating factor in the total chip power dissipation.
Switching power can be modeled by the following formula:
P
sw
= 0.5f V
2
dd
n
i=1
C
i
S
i
(6.1)
where n is the total number of nodes, f is the clock frequency, Vdd is
the supply voltage, C
i
is the load capacitance for node i, and S
i
is the
transition density (switching activity) for nodei. Switching activity is
the average number of transitions (0 1 or 1 0) a signal switches
per unit time.
Short-circuit power is another type of dynamic power. When a
signal transition occurs at a gate output, both the pull-up and pull-
down transistors can be conducting simultaneously for a short period
of time. Short-circuit power represents the power dissipated via the
EDA03-ebook 2006/10/26 10:16 page 99 #109
i
i
i
i
i
i
i
i
6.2. Power Estimation 99
direct current path from Vdd to GND during this period of time. It is
a function of the input signal transition time and load capacitance.
Static power is also called leakage power. There are primarily three
types of leakage power: sub-threshold leakage power, reverse-biased
junction leakage power, and gate leakage power. Sub-threshold current
is the weak inversion conduction current that ows between the source
and drain of a MOSFET when the gate voltage is below the threshold
voltage. Sub-threshold leakage is the dominant factor in leakage power.
It is exponentially related to threshold voltage and temperature, as
modeled by the following formula:
I
sub
W
L
exp
V
GS
V
TH
n V
t
(6.2)
where W and L are the eective width and length of the device, V
gs
is gate voltage, V
th
is threshold voltage, V
T
= kT/q is thermal voltage
(k and q are constants, and T is temperature), and n is a technology-
dependent parameter.
MOS transistors have reverse biased pn junctions from the
drain/source to the well. The reverse biased pn junctions give rise to
static current passing across the junctions. This leakage is a function
of junction area and doping concentration.
As gate oxide thickness scales down, there occurs an increased prob-
ability of direct tunneling current through the gate oxide. There are
three components of gate leakage namely: gate leakage between the
gate and the drain, between the gate and the substrate, and between
the gate and the source. Although gate leakage is becoming increasingly
important, it will have to be controlled with other techniques such as
high-k dielectrics. Figure 6.1 schematically illustrates the various types
of leakage currents.
6.2 Power Estimation
Power estimation is an important task for FPGAs. FPGA designers
rely on power estimation tools in order to predict the power consump-
tion of circuits and discover possible power violations during the design
process. Power estimation also serves as the foundation for power opti-
mization. It is well known that higher the design level the larger the
EDA03-ebook 2006/10/26 10:16 page 100 #110
i
i
i
i
i
i
i
i
100 Power Optimization
Fig. 6.1 Various leakage currents.
impact of power reduction techniques. Therefore, power estimation is
required across the various levels in the FPGA design hierarchy in
order to eectively control and reduce the power consumption of the
end product.
As technology node scales down, power consumption in intercon-
nects becomes the dominant source in sub-micron FPGAs. They can
contribute 7585% of the total power [119, 130] for most of the FPGA
designs. Consequently, power estimation for FPGAs must consider
routing interconnect capacitance. Interconnect estimation can be done
in dierent design levels. The estimation becomes increasingly accurate
as the design enters lower design levels. After placement and routing,
wire-capacitance can be more accurately captured and back-annotated
to the original netlist for better power estimation. However, there is a
tradeo between accuracy and runtime complexity. Although high-level
interconnect estimation is not as accurate as low-level estimation, its
runtime is much faster, which will be benecial when using high-level
synthesis to explore low-power design possibilities.
Most FPGA companies provide online spreadsheets for their cus-
tomers to estimate power dissipation for particular devices in early
design stages [7, 121, 206]. Some vendors, such as Xilinx and Altera,
have incorporated the power estimation feature in their CAD tools,
such as XPower [209] and PowerPlay [8], which can be launched after
placement and routing for a more accurate power analysis of the design.
We will introduce dynamic and static power estimation in Sections 6.2.1
and 6.2.2, and then we will briey summarize the power estimation
works published in the literature.
EDA03-ebook 2006/10/26 10:16 page 101 #111
i
i
i
i
i
i
i
i
6.2. Power Estimation 101
6.2.1 Dynamic power estimation
Dynamic power estimation is mainly concerned with switching activ-
ity estimation and load capacitance estimation (Formula 6.1). Both
gates (including buers) and wires contribute capacitance in the circuit.
Gate-related capacitance is usually easy to obtain because FPGA chips
are already fabricated and their gate sizes are already known. Wire
capacitance estimation is more demanding. Gate-level power estimators
can rst perform placement and routing for more accurate capacitance
extraction. The extracted capacitance can then be back-annotated to
the power estimation ow for better estimation. For high-level power
estimation, detailed placement and routing are usually not available.
These estimators will have to rely on wire-length estimation methods,
such as Rents rule-base methods, for the estimation of the total amount
of wires and wire capacitance involved in the design.
There are primarily three approaches reported in literature for
FPGA switching activity estimation namely: characterization through
board measurement, statistical model, and simulation model. The work
in used an emulation board embedded with a Virtex FPGA for power
measurement. The authors then calculated the average switching activ-
ity for logic elements using a power estimation formula published by
Xilinx. The formula is as follows [210]:
P
INT
= V
Core
K
p
f
Max
N
LC
Tog
LC
(6.3)
P
INT
is the internal power consumption caused by the charging and
discharging of the capacitance on each logic element that is switched.
V
Core
is the core voltage; K
p
is a technology-dependent constant; f
Max
is the maximum clock speed; N
LC
is number of logic elements used;
and Tog
LC
is the average switching activity of all the logic elements.
Works in [135, 161, 170] used statistical models to estimate switch-
ing activity. The static probability of a signal x, denoted by P(x), is
dened as the probability that signal xhas the logic value 1. For each
LUT in the circuit, the function implemented in that LUT can be
expressed as a function y = f(x
1
,x
2
, . . . ,x
n
). For each input, x
i
, two
new Boolean functions fx
i
and fx
i
can be generated by setting input
x
i
to 1 and 0, respectively in f(x
1
,x
2
, . . . ,x
n
) (these functions are called
EDA03-ebook 2006/10/26 10:16 page 102 #112
i
i
i
i
i
i
i
i
102 Power Optimization
cofactors of y with respect to x
i
). The Boolean dierence of the output
with respect to an input, x
i
, can be calculated by:
y
x
i
= fx
i
fx
i
(6.4)
where denotes the excusive-OR operation. The probability of this
Boolean dierence, P(y/x
i
), is the static probability that a transition
inx
i
causes a transition at the output. The switching activity at the
output S(y) can be calculated as follows:
S(y) =
n
i=1
P
y
x
i
S(x
i
) (6.5)
In [161] the authors assumed that all primary inputs have a static
probability of 0.5 and a switching activity of 0.5. Notice that this model
also assumes that the switching activities of input signals are not cor-
related, which is usually not the case. It is also hard for this model to
capture glitch power.
The third type of switching activity estimation is based on sim-
ulation. A sequence of random input vectors can be applied on the
primary inputs, and cycle-accurate gate-level simulation can be carried
out for the whole circuit. Combined with back-annotated delay infor-
mation available after placement and routing, this estimation model
is most accurate for switching activity calculation because it can also
capture activities due to glitches. Works in [130, 133] used this model.
The down side of this approach is its larger runtime.
After switching power is estimated, short-circuit power can be esti-
mated proportionally to the switching power. Some work used a xed
ratio. For example, [161] assumed that short-circuit power is always
10% of the total dynamic power. Some work developed detailed mod-
els to evaluate the short-circuit power. For example, [133] used a linear
curve tting method to derive the ratio between the short-circuit power
and the switching power. This ratio is a linear function of the input
transition time in the model. They reported that the short-circuit power
is a signicant power component due to the large signal transition time
in FPGA designs. It can reach 70% of the global interconnect dynamic
power for certain designs [133].
EDA03-ebook 2006/10/26 10:16 page 103 #113
i
i
i
i
i
i
i
i
6.2. Power Estimation 103
6.2.2 Static power estimation
We will introduce two approaches reported in the literature for FPGA
static power estimation: the analytical method and the macro-modeling
method. The analytical method estimates the values of various param-
eters involved in the calculation of the leakage. For example, the work
in [161] used a detailed formula (with similar parameters as presented
in Formula 6.2) to estimate the sub-threshold leakage. Given a specic
process technology, they estimated the values of various parameters,
such as the device width, channel length, and temperature. They also
made some assumptions in the calculation. For instance, they assumed
that the V
gs
value was half of the threshold voltage V
th
. They reported
that the average error between the estimated values and the simulated
results was 13.4%.
Micro-modeling mainly relies on SPICE simulation to achieve
estimation results. For example, the work in [133] used SPICE sim-
ulation with randomly generated input vectors to obtain the average
leakage power in the LUT. Since the number of all possible input vec-
tors increases exponentially with the number of inputs for LUTs, it is
infeasible to try all the input vectors for large-input LUTs. Therefore,
dierent input vectors were mapped into a few typical vectors with rep-
resentative Hamming distances and SPICE simulation was performed
only for these typical vectors to build macromodels in [133]. With this
model, [133] performed simulation for LUT sizes ranging from three
to seven and buers of various sizes in global/local interconnects, and
then built the static power macromodels.
6.2.3 Power estimation works
We will briey introduce several gate-level and high-level power
estimation publications in this subsection. In [119] people used a Xil-
inx XC4003A FPGA test board to carry out power dissipation mea-
surement, characterize capacitance of various FPGA components, and
report the power breakdown of these components. In [170], the authors
analyzed the dynamic power consumption and distribution for the Xil-
inx Virtex-II FPGA family. The work in [195] presented the power con-
sumption estimation for the Xilinx Virtex architecture using their emu-
EDA03-ebook 2006/10/26 10:16 page 104 #114
i
i
i
i
i
i
i
i
104 Power Optimization
lation environment. Based on board measurement, the authors calcu-
lated the technology-dependent power factor K
p
(Eq. (3)) and derived
their own power estimation formulas. They reported a 5% estimation
error for certain designs. However, they did not produce good estima-
tion results for designs that were dominated by large combinatorial logic
blocks like multipliers. The work in [161] presented a exible FPGA
power model associated with architectural parameters. This model esti-
mates dynamic and leakage power for a wide variety of FPGA architec-
tures. The authors in [14] developed an empirical estimation model and
showed that estimation accuracy was improved by considering aspects
of the FPGA interconnect architecture in addition to generic parame-
ters, such as net fanout and bounding box perimeter length. The work
in [188] made a detailed analysis of leakage power in Xilinx CLBs. It
concluded that a signicant reduction of FPGA leakage was needed to
enable the use of FPGAs in mobile applications. Authors in [130, 133]
developed a mixed-level FPGA power model that combines switch-level
models for interconnects and macromodels for LUTs and ip-ops. It
carried out gate-level simulation under real delay models and was able
to capture glitch power. Work in [133] reported high delity compared
to SPICE simulation, and the absolute estimation error was 8% on
average.
There is limited high-level power estimation work for FPGAs in
academia. Authors in [169] presented a high-level power modeling tech-
nique to estimate the power consumption of FPGAs. They captured the
relationship between FPGA power dissipation and I/O signal statistics.
Then they used an adaptive regression method to model the FPGA
power consumption. Experimental results indicated that the average
relative error was 3.1% compared to a low-level FPGA power simulation
method for FPGA components such as ALUs, adders, DSP cores, etc.
There was no report for power estimation for larger designs. Authors
in [41] developed a high-level power estimator. It used a fast switching
activity calculation algorithm, a Rents rule-based wire-length estima-
tion [78], and a resource characterization ow using DesignWare library
from Synopsys [179]. It takes into account various FPGA components,
such as LUTs, local and global buers, MUXes, etc. Later on, the same
authors extended this power model to work on a real FPGA archi-
EDA03-ebook 2006/10/26 10:16 page 105 #115
i
i
i
i
i
i
i
i
6.3. Power Breakdown 105
tecture, Alteras Stratix architecture [12], during high-level synthesis.
Some device-specic functional blocks are considered in the model, such
as memory blocks, DSP clocks, and I/O blocks. Experimental results
showed an 18.5% average estimation error [77] compared to Alteras
gate-level PowerPlay power analyzer using real designs.
6.3 Power Breakdown
In the FPGA architecture evaluation work for power [133], the authors
concluded that logic block size = 6 and LUT input size = 7 represent the
min-delay architecture, and logic block size = 8 and LUT input size =
4 represents the min-energy architecture under the 100 nm technology.
The energy consumption dierence between these two architectures is
48%, and the critical path delay dierence is 12%. Figure 6.2 shows
the breakdown of various power sources for the min-energy architec-
ture [133].
Figure 6.3 presents the power breakdown for average designs using
Alteras Stratix II FPGA architecture (90 nm technology) [11]. Total
device power is the sum of three components: core dynamic power, core
Fig. 6.2 Power breakdown of the min-energy architecture in [133].
EDA03-ebook 2006/10/26 10:16 page 106 #116
i
i
i
i
i
i
i
i
106 Power Optimization
Fig. 6.3 Power breakdown of average designs using Alteras Stratix II FPGA in [11].
static power, and I/O power. Core dynamic power is the power dissi-
pated by the operation of the FPGA core fabric. Core static power is
the leakage power dissipated, which can be determined by stopping all
clocks (an operating frequency of 0 MHz). I/O power is the power dis-
sipated in the FPGA I/O cells when communicating with other chips.
This data was obtained by estimating the power consumption of 99
complete designs with the Quartus II software version 5.0 SP1 Power-
Play power analyzer [11]. We can observe that leakage power in this g-
ure represents a smaller portion of the total power than what has been
shown in Fig. 6.2. We believe that the main reason for this is the archi-
tecture in [133] does not consider circuit-level optimization for leakage
power reduction. However, there are some circuit optimization tech-
niques for leakage power reduction in Stratix II FPGAs, such as higher
V
th
and longer transistor length for non-speed-critical paths [9]. We
shall present leakage power minimization techniques in Section 6.5.1.
Figure 6.4 shows the power breakdown for Xilinxs Spartan-3
devices [187]. Dynamic power is estimated at 150 MHz clock fre-
quency, 12.5% average switching activity, and typical conguration and
resource utilization as determined by user benchmarks. Static power
measures both subthreshold leakage and gate leakage [187]. The static
power is about 10% of the total power consumption. We can observe
that routing switches make up the largest part of the total dynamic
power, and both routing switches and conguration SRAM represent
signicant parts of the total static power.
EDA03-ebook 2006/10/26 10:16 page 107 #117
i
i
i
i
i
i
i
i
6.4. Synthesis for Power Optimization 107
Fig. 6.4 Power breakdown of average designs using Xilinxs Spartan-3 in [187].
6.4 Synthesis for Power Optimization
In this section, we present low-power synthesis techniques for exist-
ing FPGA architectures, where performance is the main objective of
the architecture design. However, power eciency can be a synthesis
objective with performance constraints. We will touch on technology
mapping, circuit clustering, RTL synthesis, behavioral synthesis, and
some other synthesis techniques for power minimization.
6.4.1 Technology mapping for low power
FPGA technology mapping for low power is a NP-hard problem [85].
Some heuristics have been proposed. These algorithms mainly worked
on reducing the overall switching activity of the design. The smaller
the total switching activity, the lesser the power consumption (Eq. (1)).
In [85], besides the NP-hard proof, the authors also presented a heuris-
tic for low-power mapping. They pointed out that the power consumed
by a LUT depended on the switching activity and the fanout number of
the LUT, and gave a formula to estimate the total power consumption
of a technology mapping solution. The main idea of their algorithm was
to hide nodes with higher switching activities inside LUTs (and hence,
LUTs would have smaller switching activity at their outputs). The goals
of most other algorithms were similar in terms of switching activity
EDA03-ebook 2006/10/26 10:16 page 108 #118
i
i
i
i
i
i
i
i
108 Power Optimization
reduction. In [13] a cut-enumeration-based algorithm was designed to
keep nets with high switching activities out of the FPGA routing tracks.
It also considered switching activity when logic replication was needed
for optimizing mapping depth of the design. The authors reported that
the result of FlowMap-r[51]+MP-Pack[46] required 14.2% more power
than their algorithm when both were depth-optimal. When the map-
ping depth was relaxed by one level, they reported further power reduc-
tion by about 8% over the depth-optimal case for 4-LUTs and 10% for
5-LUTs. The authors in [191] and [120] used cut enumeration as well.
In [193] both run time and memory space were considered and only a
xed number of cuts were performed. It reported up to a 14.18% power
savings compared to [85]. The mapping algorithm in [120] designed a
cost function for each cut, including switching activity, fanout num-
ber, node duplication consideration, etc. Authors reported an 8.4%
energy reduction over CutMap [58]. The authors in [135] used a net-
work ow formulation and carried out mapping while looking ahead at
the impact of the mapping selection on the power consumption of the
remaining network. An extension was also presented that computed
depth-optimal mapping. They reported a 14% power savings without
any depth penalty compared to CutMap.
6.4.2 Circuit clustering for low power
Clustering has traditionally been used in the VLSI industry to extract
underlying circuit structures and construct a natural hierarchy in the
circuits. In [163] the authors derived the rst delay optimal clustering
algorithm under the general delay model. In [190] the authors presented
a low-power clustering algorithm with the optimal delay. Their algo-
rithm is power optimal for trees. They enumerated all clustering solu-
tions for a graph and selected a low-power clustering solution from all
delay optimal clustering solutions. It has been shown that logic cluster-
based FPGA logic blocks can improve FPGA performance, area, and
power [2, 22, 173]. Section 4.1 presented some clustering algorithms
optimizing FPGA area and performance. There are a few prior research
eorts on clustering for low-power FPGA designs as well. An FPGA
circuit-clustering work was reported in [120] as one of the optimization
EDA03-ebook 2006/10/26 10:16 page 109 #119
i
i
i
i
i
i
i
i
6.4. Synthesis for Power Optimization 109
steps in power-aware CAD ow, which tried to attract nets with high
switching activity inside the logic blocks. It was 12.6% better than T-
VPACK [22] in terms of energy consumption. Researchers presented a
routability-driven clustering technique for area and power reduction in
FPGAs [173]. It used Rents rule-based estimation method to reduce
the potential routing complexity due to clustering. In [77] a delay-
optimal clustering algorithm was presented to improve FPGA perfor-
mance and reduce FPGA power. The algorithm was delay and power
optimal for trees. It also presented some heuristic to control duplica-
tions. The authors reported a 9% improvement on delay reduction with
a 3% power overhead compared to [120].
6.4.3 RTL synthesis for low power
In [199] the authors worked on RTL synthesis for FPGA power mini-
mization. They rst characterized the power and performance data for
the functional units on their targeted FPGAs using board measure-
ment. For example, they characterized the power and delay of dierent
implementations of adders, such as ripple adder, carry lookahead adder,
conditional sum adder, etc. Then, their design ow took an RTL spec-
ication and began to tradeo power with circuit speed by selecting
dierent implementations of components iteratively. Figure 6.5 shows
the optimization ow. They showed that their methodology was useful
for designing a low-power digital lter.
6.4.4 Behavioral synthesis for low power
As we have mentioned before, multiplexers are particularly expensive
for FPGA architectures. In general, when there is a smaller number
of functional units or registers allocated but there is a larger num-
ber of wide multiplexers and larger amount of interconnects, it may
lead to a completely unfavorable solution for both the performance
and the area/power cost. Tackling this increasingly alarming prob-
lem will require an ecient search engine to explore a suciently
large solution space considering multiple constraining factorssuch as
resource allocation and binding, MUX generation, and interconnection
EDA03-ebook 2006/10/26 10:16 page 110 #120
i
i
i
i
i
i
i
i
110 Power Optimization
Fig. 6.5 Power optimization ow presented in [199].
generationfor optimizing performance or cost, or studying the trade-
o between them.
With such a motivation, a behavioral synthesis engine for FPGA
power optimization was presented in [41]. Here, the power optimiza-
tion goal is to search a combined solution space for the subtasks in
behavioral synthesis so that the power of FPGA designs can be opti-
mized, and at the same time the performance/latency target can still be
met. To achieve this goal, the authors adopted a simulated annealing-
based algorithm. For each move during the annealing procedure, the
optimization engine generates the full data path to capture the over-
all cost, considering all the contributing factors in the design. The
cost function is the estimated power dissipation guided by the behav-
ioral level power estimator. The algorithm carried out resource selec-
tion, scheduling, function unit binding, register binding, and steering
logic and interconnection estimation simultaneously. Figure 6.6 shows
a block diagram of the power optimization engine.
There are ve dierent types of moves performed during simulated
annealing. They are dierent functional unit binding operations. The
moves are listed below:
Reselect: Select another FU of the same functionality but with a dif-
ferent implementation. For example, select a carry look-ahead adder to
EDA03-ebook 2006/10/26 10:16 page 111 #121
i
i
i
i
i
i
i
i
6.4. Synthesis for Power Optimization 111
Fig. 6.6 The power optimization engine in [41].
replace a Brent-Kung adder. The operations bound to the adder stay
unchanged.
Swap: Swap two bindings of the same functionality but with dierent
implementations. This is equivalent to two reselects between two FUs
in each direction.
Merge: Merge two bindings into one, i.e., the operations bound to the
two FUs are combined into one FU. As a result, the total number of
FUs decreases by 1. The two FUs have to be the same type.
Split: Split one binding into two. It is the reverse action of merge. As
a result, the total number of FUs increases by 1. The operations of the
original binding are distributed into the new bindings randomly.
Mix: Select two bindings, merge them, sort the merged operations
according to their slacks, and then split the operations. For example,
if there are N operations after sorting, operations 1 to N/2 will form
one binding and the rest of the operations will form another binding.
An interconnection optimization step was also designed to reduce
the interconnections between functional units and registers through
EDA03-ebook 2006/10/26 10:16 page 112 #122
i
i
i
i
i
i
i
i
112 Power Optimization
multiplexer optimization. Experimental results showed that a power
reduction of 35.8% was achieved compared to the results of the Synop-
sys Behavioral Compiler [179].
6.4.5 Other techniques
There are also other works that focus on FPGA power minimiza-
tion. We introduce two of them here. In [16] the authors considered
active leakage power dissipation in FPGAs and presented a no cost
approach for active leakage reduction. Leakage power consumed by a
digital CMOS circuit depends strongly on the state of its inputs. This
leakage reduction technique leveraged a fundamental property of LUTs
which says that a logic signal in an FPGA design can be interchanged
with its complemented form without any area or delay penalty. The
authors applied this property to select polarities for logic signals so
that FPGA hardware structures spent the majority of time in low
leakage states. They optimized leakage power in circuits mapped into a
90 nm commercial FPGA. Results showed that the proposed approach
reduced active leakage by 25%, on average. In [105] the authors pre-
sented a method to re-synthesize LUT-based FPGAs for low power
design after technology mapping, placement and routing were per-
formed. They used the set of pairs of functions to be distinguished
(SPFD) method to express functional permissibility of each signal.
Using dierent propagations of SPFD to fan-in signals, they changed
the functionality of a CLB which drives a large load into one with
low transition density. Experimental results showed that their method
on an average could achieve, a 12% power reduction compared to the
original circuits, without aecting placement and routing.
6.5 Synthesis for Power-Ecient Programmable
Architectures
In this section, we present synthesis techniques for power-ecient
FPGA architectures, where low power is the main objective of the archi-
tecture design with performance being the secondary objective. For
example, this type of FPGA chip may support multiple supply voltages,
EDA03-ebook 2006/10/26 10:16 page 113 #123
i
i
i
i
i
i
i
i
6.5. Synthesis for Power-Ecient Programmable Architectures 113
multiple threshold voltages, and power gating features to reduce both
dynamic and leakage power.
As silicon technologies advance, smaller geometries become possi-
ble with lower Vdd. At the same time, threshold voltage V
th
must
be reduced in order to maintain or improve circuit performance. This
decrease of V
th
then drives signicant increases in leakage power. As sil-
icon technologies move into 90 nm and below, leakage currents become
as important as active power in many applications. To reduce both
dynamic power and leakage power, deploying multiple Vdd and V
th
is a popular design technique. These techniques have been extensively
used for ASIC designs [17, 152, 175, 181, 189, 194]. Low-Vdd reduces
dynamic power, and high-V
th
reduces leakage power, but each incurs a
longer signal delay. If low-Vdd and/or high-V
th
are only applied to non-
critical paths carefully, the multi-Vdd/V
th
technique has the advantage
of reducing power dissipation without sacricing system performance.
Specic to FPGAs, multi-Vdd/V
th
fabric and layout pattern must be
pre-dened, because FPGAs do not have the freedom of using mask
patterns to arrange dierent Vdd/V
th
components in a exible way (as
in ASICs). This brings unique challenges for FPGA designers. We will
introduce several recent research eorts in this area.
6.5.1 Leakage power reduction
Circuit design techniques can be applied to reduce leakage power for
FPGAs. As mentioned before, SRAM-based FPGAs use a large amount
of SRAM cells to provide programmability for both logic cells and inter-
connects. The work in [134] and [162] introduced high-V
th
for SRAM
cell design. Increasing V
th
for SRAM cells in FPGAs has no delay
penalty during normal operation of the FPGA. However, it will increase
the SRAM write access time and slow down the FPGA conguration
speed. It was shown that the V
th
of SRAM cells could be increased to
achieve a 15X SRAM-leakage reduction with only a 13% conguration
time increase [134]. Figure 6.7 shows the schematic of a 4-LUT with
dual-V
th
(denoted as V
t
in the gure) regions, where Region I presents
the high-V
th
region, and Region II for low-V
th
region.
EDA03-ebook 2006/10/26 10:16 page 114 #124
i
i
i
i
i
i
i
i
114 Power Optimization
Fig. 6.7 Two dierent V
th
applied to a 4-LUT cell in [134].
The authors in [162] proposed several circuit enhancement tech-
niques for leakage power reduction, including redundant memory cells,
dual-V
th
devices, and body biasing. Specically, they targeted low-
leakage multiplexer designs because multiplexers are widely used in
SRAM-based FPGAs. Figure 6.8(a) shows a two-stage implementation
of a pass transistor-based multiplexer. It is composed of several smaller
multiplexers, and the same SRAM cell congures one pass transistor
from each multiplexer in stage 1. As a result, whenever there is an
enabled input-to-output path, the intermediate nodes, such as nodes 1,
2, 3, and 4, are driven to Vdd. Therefore, the drain-to-source voltage,
V
DS
of all disabled pass transistors is Vdd, and these pass transis-
tors still contribute leakage power. Figure 6.8(b) shows the schematic
after adding some redundant memory cells in the multiplexer design.
These SRAM cells can turn the inactive input-to-output paths o (e.g.,
the left upper portion in Fig. 6.8(b)) and reduce leakage power. These
SRAM cells can be implemented using high-V
th
devices. In the authors
EDA03-ebook 2006/10/26 10:16 page 115 #125
i
i
i
i
i
i
i
i
6.5. Synthesis for Power-Ecient Programmable Architectures 115
Fig. 6.8 (a) A two-stage traditional implementation of a pass transistor-based multiplexer;
(b) A two-stage multiplexer with redundant memory cells for turning o inactive circuit
portion in [162].
analysis, the V
th
of transistors in SRAM cells is 25% higher than that of
typical devices. They reported a 2X leakage power reduction with a 15
30% total chip area increase through use of this technique. They also
studied the performance/leakage-power tradeo scenarios when V
th
is
increased for the devices in routing switches.
The work in [90] divided the FPGA fabric into small regions and
switched on/o the power supply to each region using a sleep transistor
in order to reduce leakage energy. The regions not used by the placed
design were power gated. The authors presented a placement strategy
to increase the number of regions that could be power gated.
The authors in [187] described the design and implementation of
Pika, a low-power FPGA core targeting battery-powered applications
such as those in consumer and automotive markets. They used several
key leakage power reduction techniques. First, they applied low-leakage
conguration SRAMs by adopting mid-oxide, high-V
th
transistors. This
EDA03-ebook 2006/10/26 10:16 page 116 #126
i
i
i
i
i
i
i
i
116 Power Optimization
Fig. 6.9 Power gating architecture in [187].
corresponded to a reduction of 43% to the total standby power of the
FPGA core. Second, they applied power gating at the level of individ-
ual tiles (each tile consists of a CLB and a programmable intercon-
nect switch matrix). They used mid-oxide power gates and sized them
at the point of 10% performance degradation. They did not power
gate the conguration SRAM cells to enable a state-retaining standby
mode when all logic and routing were power gated. Figure 6.9 shows
their power gating architecture. Third, they made circuit modications
to prevent high-current paths. A high-current path may be a short-
circuit path from supply to ground, or a high-leakage path that does
not go through a power gate (refer to [187] for details). They reported
that standby power was reduced by 99% when the chip was idle. The
power optimizations incurred a 27% performance penalty and 40% area
increase. They also reported that the core woke up from standby mode
in approximately 100 ns.
6.5.2 Dynamic power reduction
A dual-Vdd FPGA with pre-dened voltage patterns was studied
in [134]. Figure 6.10 shows the voltage patterns explored in this work.
EDA03-ebook 2006/10/26 10:16 page 117 #127
i
i
i
i
i
i
i
i
6.5. Synthesis for Power-Ecient Programmable Architectures 117
Fig. 6.10 Fixed dual-Vdd layout patterns in [134].
The architecture had the advantage of small area overhead and sim-
ple voltage regulation. The authors designed FPGA architectures with
the support of dual supply voltages and dual threshold voltages and
developed placement tools on top of the architecture. The placement
was based on a simulated-annealing method modied from the one
used in VPR [22]. The cell swapping during the simulated-annealing
process considered dierent voltage assignments on the physical loca-
tions of the CLBs. It tried to reduce power while avoiding deterioration
of the critical path delay. However, the authors found that it did not
provide good opportunities for power reduction while still maintain-
ing competitive circuit performance. This was due to the xed volt-
age pattern that imposed strong constraints on the placement engine.
Later on, researchers proposed a new architecture that provided volt-
age congurability for each CLB [131]. They inserted two PMOS tran-
sistors between the high-Vdd and low-Vdd power rails and the CLB
(Fig. 6.11). Therefore, each CLB could be congured as driven by
either the low-Vdd or high-Vdd. The area overhead of sleep transis-
tors was 24% over the original CLB area with a 5% delay overhead.
The exibility oered through this architecture helped the placement
EDA03-ebook 2006/10/26 10:16 page 118 #128
i
i
i
i
i
i
i
i
118 Power Optimization
Fig. 6.11 Congurable Vdd for a logic block in [131].
engine to place cells on critical paths into CLBs congured to high-
Vdd, and cells on non-critical paths into CLBs congured to low-Vdd.
A total of 14% power reduction was reported [131]. In [132] the same
authors designed a dual-Vdd architecture to reduce the interconnect
power of FPGAs. They designed three Vdd states for interconnect
switches namely: high Vdd, low Vdd and power-gating. This is similar
to the technique used in [131], where each CLB can be Vdd-congured
or power-gated. The authors developed a design ow to apply high
Vdd to critical paths and low Vdd to non-critical paths and to power
gate unused interconnect switches. They reported a signicant amount
of power savings. In [148], a partitioning algorithm for FPGAs with
pre-dened voltage patterns (voltage island congurations) was pre-
sented. This power-driven partitioner created partitions of critical and
non-critical CLBs and assigned these CLBs to dierent voltage islands
according to their timing criticalities, followed by placement and rout-
ing. It showed that a dynamic power gain as high as 47% was possible
with a 17% area/delay product penalty and a 30% power gain, with
an area/delay product penalty as low as 6% for dierent voltage island
congurations. In [103], the authors developed a technique to estimate
power reduction using dual-Vdd for mixed length interconnects, and
applied linear programming (LP) to solve slack budgeting to minimize
power for mixed length interconnects. Experiments showed 53% power
reduction on average compared to single-Vdd interconnects. Further-
more, this paper presented a simultaneous retiming and slack budgeting
algorithm to reduce power in dual-Vdd FPGAs considering placement
EDA03-ebook 2006/10/26 10:16 page 119 #129
i
i
i
i
i
i
i
i
6.5. Synthesis for Power-Ecient Programmable Architectures 119
and ip-op binding constraints. The algorithm was based on mixed
integer and linear programming (MILP) and achieved up to 20% power
reduction compared to retiming followed by slack budgeting.
Low-power technology mapping and circuit clustering for FPGAs
with dual Vdds were presented in [42] and [39], respectively. In [42]
the authors used a cut-enumeration-based technique. They developed
a detailed delay and power model for LUTs and level converters using
dierent voltages. The algorithm built all the cases of LUT connec-
tions under dual-Vdd scenarios and generated one set of power and
delay results for each case to enlarge the low-power solution search
space. Their algorithm improved power savings by 11.6% on average
over the single-Vdd case when both algorithms produced optimal map-
ping depth. In [39] the authors used a solution curve propagation tech-
nique to examine the quality of dierent clustering solutions. They
built all the non-inferior delay-power-Vdd solution points for a node
with considerations of dual Vdds and level converter delay. The algo-
rithm is delay and power optimal for trees and delay optimal for DAGs
under the general delay model. A limitation of the work was that each
generated cluster (to be implemented by a logic block) was a single
output cluster, which could cause a large area overhead. Later on, the
authors extended the work to generate clustering solutions with mul-
tiple cluster outputs [77] while maintaining the optimal results. They
showed a 13.5% power reduction compared to the single-Vdd case on
average.
EDA03-ebook 2006/10/26 10:16 page 120 #130
i
i
i
i
i
i
i
i
EDA03-ebook 2006/10/26 10:16 page 121 #131
i
i
i
i
i
i
i
i
7
Conclusions and Future Trends
Tremendous advances have been made in FPGA design automation in
the past decade. In this survey paper, we try to cover important algo-
rithms and methodologies developed for various design tasks in modern
FPGA design ow, including routing and placement, clustering, tech-
nology mapping, physical synthesis, RT-level and behavior-level syn-
thesis, and power optimization. It is our hope that this paper can be
a useful reference for both beginning and established researchers and
tool developers in this eld.
As the feature size continues to shrink and device capacity contin-
ues to increase in modern FPGAs, we are facing new challenges and
opportunities in FPGA design automation. We would like to conclude
the paper by listing some challenges and open problems in FPGA CAD
that we are facing now or will face in the near future.
Physical Design
1. The time spent for placement and routing is still the domi-
nating part of the entire FPGA compilation process. There is
a need for more scalable and ecient placement and routing
algorithms. We think that the interesting yet very challenging
121
EDA03-ebook 2006/10/26 10:16 page 122 #132
i
i
i
i
i
i
i
i
122 Conclusions and Future Trends
goals are full-chip physical design in one hour and full-chip
incremental physical design in ve minutes. If these goals
can be achieved with little or no compromise on design
quality, it will be signicant improvement on overall design
productivity.
2. In connection to the challenges stated above, we believe an
important means to achieve highly scalable placement and
routing algorithms is to make the best use of multi-CPU
computer systems which will be widely available very soon
as standard engineering workstations. Although there were
some early studies on parallel CAD algorithms (e.g., [19, 96]),
we believe that more work is needed to come up commercial-
strength parallel or distributed placement and routing algo-
rithms with good scalability and quality of results.
3. Process variation is an increasing concern in nanometer
designs (esp. 65 nm or below) (e.g., [32, 200, 218]), and
its impact to FPGA architectures and designs is not fully
understood. We hope that the regularity of the nanometer
FPGA architecture can hide a large portion of variability
eect from the designer. But in case it is not possible or
economically feasible to shield all variability eect by the
architecture optimization alone, it will be important to inves-
tigate novel physical design algorithms with consideration
of process variability and be able to perform statistical
optimization.
4. We expect that the future FPGAs to be designed in the
ultimate CMOS technologies (32 nm and below) may have
defects. On one hand, it is important to develop defect-
tolerant FPGA architectures. On the other hand, we believe
it is important to develop defect-aware physical design algo-
rithms which can work hand-in-hand with the defect-tolerant
architectures to continue to deliver high yield even in the era
of the ultimate CMOS technologies. The design of defect-
tolerant memory systems is well known (e.g., [136]) and
widely used, but it remains challenging to come up equally
EDA03-ebook 2006/10/26 10:16 page 123 #133
i
i
i
i
i
i
i
i
123
ecient defect-tolerant architecture and CAD algorithms for
logic design.
Logic Synthesis and Physical Synthesis
1. Although FPGA technology mapping has been a subject of
extensive research, recent studies indicate that there is con-
siderable room to improve [66]. This is especially true in the
combined solution space of logic optimization and technol-
ogy mapping. There were some initial eorts in integrating
logic optimization with mapping. However, more research is
needed to nd eective and ecient ways to combine the two
to arrive at better mapping solutions. As semiconductor tech-
nologies advance, new FPGA architecture features are being
introduced to improve area utilization, performance, and/or
power. For example, architectures have been introduced or
proposed to use LUTs with large number of inputs or multi-
ple supply voltages. New mapping techniques are needed to
utilize these new architecture features.
2. Physical synthesis for FPGAs needs to consider the impact
on routing and routability. Most existing techniques stay at
the placement level and pay little attention to routing. As a
result, for high utilization situations (common in practice),
the predicted performance gain may not be realizable after
routing. Next generation interconnect-centric physical syn-
thesis techniques are needed to further improve predictability
of results. Another direction of research is to consider high-
level optimizations in physical synthesis. Local and simple
transformations such as logic replication and local remapping
can explore only a limited solution space during physical syn-
thesis. It is desirable to use physical information to inuence
high-level optimizations/transformations, for example, deter-
mining datapath architectures, selecting on-chip resources to
implement memory, to name a few. Such physical synthe-
sis techniques can potentially bring in more improvement in
quality of results.
EDA03-ebook 2006/10/26 10:16 page 124 #134
i
i
i
i
i
i
i
i
124 Conclusions and Future Trends
Behavioral Synthesis
Nowadays, high-end FPGAs can implement large and complex SoC
designs which were originally only possible through ASIC implemen-
tation. These design applications may have dramatically dierent user
requirements in terms of performance, power, memory bandwidth, com-
putational throughput, etc. To meet these demands, we foresee that
next-generation FPGA chips will encompass a large number (in the
order of hundreds) of heterogenous cores (soft and hard) with support-
ing memory hierarchy and interconnect hierarchy. Sophisticated archi-
tecture features will surface as well, such as the architecture supports of
multi-cycle interconnect communication, core-level and/or logic block-
level multi-Vdd, core-level and/or logic block-level power/clock gating,
full-chip adaptive threshold voltage, full-chip error checking capability,
etc. Mapping dierent applications onto these architectures to fulll
various design requirements will become a dauntingly complex task.
Behavioral synthesis will be in a critical need to tackle the design
complexity problem and improve design productivity. It will be an
important component of so called electronic system level design (ESL)
methodology to speed up high-quality hardware implementation and
enable fast and accurate design space exploration. We would like to
highlight two research directions in this context.
1. To address the speed, power, and interconnect challenges,
behavioral synthesis has to take meaningful physical infor-
mation from potential hardware implementation and layout
(physical planning). One direction is to incorporate physi-
cal planning in as early as possible, thus performing a com-
bined synthesis and layout optimization. The layout needs
to deal with the heterogeneity of the FPGA chip, core topol-
ogy, and the specic memory and interconnect structures.
The abundant silicon capacity can be utilized to improve
the interconnect performance and integrity either by inter-
connect pipelining or resource redundancy, which can also
be combined with system reliability design strategies. Lay-
out information will serve as a guideline for the synthesis
engine to determine the scheduling and binding solutions to
EDA03-ebook 2006/10/26 10:16 page 125 #135
i
i
i
i
i
i
i
i
125
exploit specic architecture features (such as power gating
and multi-Vdd) for better quality of design. The synthe-
sis engine will generate RTL implementations together with
physical and timing constraints, which can serve as guidelines
for the downstream physical design tools.
2. We need to model communication interfaces between dier-
ent components within the FPGA chip under dierent com-
munication protocols and topologies, such as point-to-point
connections, buses, or network-on-chips. A deeper under-
standing of ecient and robust communication synthesis is
much needed.
EDA03-ebook 2006/10/26 10:16 page 126 #136
i
i
i
i
i
i
i
i
EDA03-ebook 2006/10/26 10:16 page 127 #137
i
i
i
i
i
i
i
i
References
[1] International Technology Roadmap for Semiconductors, Executive Summary,
2003. http://public.itrs.net/ Files/2003ITRS/Home2003.htm.
[2] E. Ahmed and J. Rose. The eect of LUT and cluster size on deep-submicron
FPGA performance and density. In ACM International Symposium on FPGA,
February 2000.
[3] C. Albrecht. Provably good global routing by a new approximation algo-
rithm for multicommodity ow. In Proc. International Symposium on Physical
Design, pages 1925, March 2000.
[4] M. J. Alexander, J. P. Cohoon, J. L. Ganley, and G. Robins. Placement and
routing for performance-oriented FPGA layout. VLSI Design: An Interna-
tional Journal of Custom-Chip Design, Simulation, and Testing, 7(1), 1998.
[5] M. J. Alexander and G. Robins. New performance-driven FPGA routing algo-
rithms. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 15(12):15051517, December 1996.
[6] Altera. MAX 7000B Data Sheet. http://www.altera.com/literature/ds/
m7000b.pdf.
[7] Altera. PowerPlay Early Power Estimator. http://www.altera.com/support/
devices/estimator/pow-powerplay.html.
[8] Altera. PowerPlay Power Analyzer. http://www.altera.com/support/
devices/estimator/pow-powerplay.html.
[9] Altera. Stratix II 90-nm Silicon Power Optimization. http://www.altera.
com/products/devices/stratix2/features/st2-90nmpower.html.
[10] Altera. Stratix II Device Handbook. http://www.altera.com/literature/hb/stx2/
stratix2 handbook.pdf.
127
EDA03-ebook 2006/10/26 10:16 page 128 #138
i
i
i
i
i
i
i
i
128 References
[11] Altera. White paper, Stratix II vs. Virtex-4 Power Comparison & Estima-
tion Accuracy White Paper. http://altera.com/literature/wp/wp s2v4 pwr
acc.pdf.
[12] Altera, August 2002. Stratix Programmable Logic Device Family Data Sheet.
[13] J. Anderson and F. N. Najm. Power-aware technology mapping for LUT-based
FPGAs. In IEEE International Conference on Field-Programmable Technol-
ogy, 2002.
[14] J. Anderson and F. N. Najm. Interconnect capacitance estimation for FPGAs.
In IEEE/ACM Asia and South Pacic Design Automation Conference,
Yokohama, Japan, 2004.
[15] J. H. Anderson and S. D. Brown. Technology mapping for large complex
PLDs. In Design Automation Conf., 1998.
[16] J. H. Anderson, F. N. Najm, and T. Tuan. Active leakage power optimization
for FPGAs. International Symposium on Field Programmable Gate Arrays,
February 2004.
[17] F. Assaderaghi, D. Sinitsky, S. A. Parke, J. Bokor, P. K. Ko, and C. Hu.
Dynamic threshold-voltage MOSFET (DTMOS) for ultra-low voltage VLSI.
IEEE Transactions Electron Devices, 44:414422, March 1997.
[18] B. Awerbuch, A. Bar Noy, N. Linial, and D. Peleg. Improved routing strategies
with succinct tables. J. Algorithms, 11(3):307341, 1990.
[19] P. Banerjee. Parallel Algorithms for VLSI Computer-Aided Design. Prentice-
Hall, Inc., Englewoods-Clis, NJ, 1994.
[20] G. Beraudo and J. Lillis. Timing optimization of FPGA placements by logic
replication. In ACM/IEEE Design Automation Conference, pages 96201,
2003.
[21] V. Betz and J. Rose. Cluster-based logic blocks for FPGAs: area-eciency vs.
input sharing and size. In IEEE Custom Integrated Circuits Conference, pages
551554, 1997.
[22] V. Betz and J. Rose. VPR: a new packing, placement and routing tool for
FPGA research. In International Workshop on Field-Programmable Logic and
Applications, pages 213222, 1997.
[23] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-
Submicron FPGAs. Kluwer Academic Publishers, 1999.
[24] N. Bhat and D. D. Hill. Routable technology mapping for LUT FPGAs. In
IEEE International Conference on Computer Design, pages 9598, 1992.
[25] E. Bozorgzadeh, S. Ogrenci, and M. Sarrafzadeh. Routability-driven packing
for cluster-based FPGAs. In Asia South Pacic Design Automation Conf.,
2001.
[26] R. K. Brayton. Understanding SPFDs: A new method for specifying exibility.
In International Workshop on Logic Synthesis, 1997.
[27] S. Brown, R. Francis, J. Rose, and Z. Vranesic. Field-Programmable Gate
Arrays. Kluwer Academic Publishers, May 1992.
[28] S. Brown and J. Rose. FPGA and CPLD architectures: A tutorial. IEEE
Design and Test of Computers, 12(2):4257, 1996.
EDA03-ebook 2006/10/26 10:16 page 129 #139
i
i
i
i
i
i
i
i
References 129
[29] S. Brown, J. Rose, and Z. G. Vranesic. A detailed router for eld-
programmable gate arrays. IEEE Trans. on Computer-Aided Design,
11(5):620628, May 1992.
[30] T. J. Callahan, P. Chong, A. DeHon, and J. Wawrzynek. Fast module map-
ping and placement for datapaths in FPGAs. In International Symposium on
Field Programmable Gate Arrays, 1998.
[31] T. Chan, J. Cong, and K. Sze. Multilevel generalized force-directed method for
circuit placement. In Proceedings of the Intl Symposium on Physical Design.
San Francisco, CA, April 2005.
[32] H. Chang and S. Sapatnekar. Impact of process variations on power: full-chip
analysis of leakage power under process variations, including spatial correla-
tions. In Proc. of Design Automation Conf., June 2005.
[33] S. C. Chang, K. T. Cheng, N.-S. Woo, and M. Marek Sadowska. Postlayout
rewiring using alternative wires. IEEE Trans. on Computer Aided Design of
Integrated Circuits and Systems, 16(6):58796, June 1997.
[34] S. C. Chang, L. V. Ginneken, and M. Marek-Sadowska. Circuit optimization
by rewiring. IEEE Transaction on Computers, 48(9):962970, September 1999.
[35] Y. W. Chang and Y. T. Chang. An architecture-driven metric for simultaneous
placement and global routing for FPGAs. In Proc. Design Automation Conf.,
pages 567572, 2000.
[36] Y. W. Chang, K. Zhu, and D. F. Wong. Timing-driven routing for symmet-
rical array-based FPGAs. ACM Trans. on Design Automation of Electronic
Systems, 5(3), July 2000.
[37] C. Chen, Y. Tsay, Y. Hwang, T. Wu, and Y. Lin. Combining technology
mapping and placement for delay-optimization in FPGA designs. In Intl Conf.
Computer Aided Design, 1993.
[38] D. Chen and J. Cong. DAOmap: a depth-optimal area optimization mapping
algorithm for FPGA designs. In Intl Conf. Computer Aided Design, 2004.
[39] D. Chen and J. Cong. Delay optimal low-power circuit clustering for FPGAs
with dual supply voltages. International Symposium on Low Power Electronics
and Design, August 2004.
[40] D. Chen, J. Cong, M. Ercegovac, and Z. Huang. Performance-driven mapping
for CPLD architectures. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 22(10):14241431, October 2003.
[41] D. Chen, J. Cong, and Y. Fan. Low-power high-level synthesis for FPGA
architectures. International Symposium on Low Power Electronics and Design,
August 2003.
[42] D. Chen, J. Cong, F. Li, and L. He. Low-power technology mapping for FPGA
architectures with dual supply voltages. International Symposium on Field
Programmable Gate Arrays, February 2004.
[43] G. Chen and J. Cong. Simultaneous logic decomposition with technology
mapping in FPGA designs. International Symposium on Field-Programmable
Gate-Arrays, 2001.
[44] G. Chen and J. Cong. Simultaneous timing driven clustering and placement
for FPGAs. In International Conference on Field Programmable Logic and Its
Applications, pages 158167, August 2004.
EDA03-ebook 2006/10/26 10:16 page 130 #140
i
i
i
i
i
i
i
i
130 References
[45] G. Chen and J. Cong. Simultaneous timing-driven placement and duplication.
International Symposium on Field-Programmable Gate-Arrays, 2005.
[46] K. C. Chen, et al. DAG-map: graph-based FPGA technology mapping for
delay optimization. IEEE Design and Test of Computers, 9(3):720, Septem-
ber 1992.
[47] A. Chowdary and J. P. Hayes. Technology mapping for eld-programmable
gate arrays using integer programming. In Intl Conf. Computer Aided Design,
November 1995.
[48] J. Cong and Y. Ding. An optimal technology mapping algorithm for delay
optimization in lookup-table based FPGA designs. In Intl Conf. Computer
Aided Design, November 1992.
[49] J. Cong and Y. Ding. Beyond the combinatorial limit in depth minimization
for LUT-based FPGA designs. In Intl Conf. Computer Aided Design, 1993.
[50] J. Cong and Y. Ding. FlowMap: an optimal technology mapping algorithm
for delay optimization in lookup-table based FPGA designs. IEEE Trans.
on Computer Aided Design of Integrated Circuits and Systems, 13(1):112,
January 1994.
[51] J. Cong and Y. Ding. On area/depth trade-o in LUT-based FPGA technology
mapping. IEEE Transactions on VLSI Systems, 2(2):137148, 1994.
[52] J. Cong and Y. Ding. Combinational logic synthesis for LUT based eld pro-
grammable gate arrays. ACM Trans. on Design Automation of Electronic
Systems, 1(2):145204, April 1996.
[53] J. Cong, Y. Fan, G. Han, W. Jiang, and Z. Zhang. Behavior and communi-
cation co-optimization for systems with sequential communication media. In
IEEE/ACM Design Automation Conference, 2006.
[54] J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang. Architecture and synthe-
sis for on-chip multi-cycle communication. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, pages 550564, April 2004.
[55] J. Cong, Y. Fan, G. Han, and Z. Zhang. Application-specic instruction gen-
eration for congurable processor architectures. In International Symposium
on Field-Programmable Gate Arrays, February 2004.
[56] J. Cong, J. Fang, M. Xie, and Y. Zhang. MARSa multilevel full-chip gridless
routing system. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 24(3):382394, March 2005.
[57] J. Cong, H. Huang, and X. Yuan. Technology mapping and architecture eval-
uation for k/m-macrocell-based FPGAs. ACM Trans. on Design Automation
of Electronic Systems, 10:323, January 2005.
[58] J. Cong and Y. Hwang. Simultaneous depth and area minimization in
LUT-based FPGA mapping. International Symposium on Field-Programmable
Gate-Arrays, February 1995.
[59] J. Cong and Y. Hwang. Structural gate decomposition for depth-optimal tech-
nology mapping in LUT-based FPGA design. In ACM/IEEE Design Automa-
tion Conference, 1996.
[60] J. Cong, A. Kahng, G. Robins, M. Sarrafzadeh, and C. K. Wong. Provably
good performance-driven global routing. IEEE Trans. on Computer-Aided
Design of Integrated Circuits and Systems, 11(6):739752, June 1992.
EDA03-ebook 2006/10/26 10:16 page 131 #141
i
i
i
i
i
i
i
i
References 131
[61] J. Cong, T. Kong, J. Shinnerl, M. Xie, and X. Yuan. Large scale circuit
placement. ACM Transactions on Design Automation of Electronic Systems,
10(2):389430, April 2005.
[62] J. Cong, K. S. Leung, and D. Zhou. Performance-driven interconnect design
based on distributed RC delay model. In Proc. ACM/IEEE 30th Design
Automation Conference, pages 606611, June 1993.
[63] J. Cong and S. K. Lim. Physical planning with retiming. In IEEE International
Conference on Computer Aided Design, pages 27, 2000.
[64] J. Cong, Y. Lin, and W. Long. SPFD-based global reviewing. In International
Symposium on Field-Programmable Gate Arrays, 2002.
[65] J. Cong and W. Long. Theory and algorithm for SPFD-based global rewiring.
In International Workshop on Logic Synthesis, 2001.
[66] J. Cong and K. Minkovich. Optimality study of logic synthesis for LUT-based
FPGAs. In International Symposium on Field-Programmable Gate Arrays,
February 2006.
[67] J. Cong and B. Preas. A new algorithm for standard cell global routing. In
Proc. Intl Conf. on Computer-Aided Design, pages 176179, November 1988.
[68] J. Cong and M. Romesis. Performance-driven multi-level clustering with appli-
cation to hierarchical FPGA mapping. In Design Automation Conference,
2001.
[69] J. Cong and J. Shinnerl, editors. Multilevel Optimization in VLSI CAD.
Kluwer Academic Publishers, 2003.
[70] J. Cong and C. Wu. FPGA synthesis with retiming and pipelining for clock
period minimization of sequential circuits. In Design Automation Conference,
1997.
[71] J. Cong and C. Wu. Optimal FPGA mapping and retiming with ecient
initial state computation. Design Automation Conference, 1997.
[72] J. Cong, C. Wu, and Y. Ding. Cut ranking and pruning: enabling a gen-
eral and ecient FPGA mapping solution. International Symposium on Field-
Programmable Gate Arrays, February 1999.
[73] J. Cong and S. Xu. Delay-optimal technology mapping for FPGAs with het-
erogeneous LUTs. In Design Automation Conference, 1998.
[74] J. Cong and S. Xu. Delay-oriented technology mapping for heterogeneous
FPGAs with bounded resources. In Intl Conf. Computer Aided Design, 1998.
[75] J. Cong and S. Xu. Technology mapping for FPGAs with embedded mem-
ory blocks. In International Symposium on Field-Programmable Gate Arrays,
1998.
[76] J. Cong and S. Xu. Performance-driven technology mapping for heterogeneous
FPGAs. IEEE Trans. on Computer-aided Design of Integrated Circuits and
Systems, 19(11):12681281, November 2000.
[77] Ph.D. Dissertation D. Chen. Design and synthesis for low-power FPGAs. Com-
puter Science Department, University of California, December 2005.
[78] J. A. Davis, V. K. De, and J. Meindl. A stochastic wire-length distribution for
gigascale integration (GSI)Part I: derivation and validation. 45(3):580589,
March 1998.
EDA03-ebook 2006/10/26 10:16 page 132 #142
i
i
i
i
i
i
i
i
132 References
[79] G. De Micheli. Synthesis and optimization of digital circuits. McGraw-Hill,
Inc., 1994.
[80] A. Duncan, D. Hendry, and P. Gray. An overview of the COBRA-ABS
high level synthesis system for multi-FPGA systems. In IEEE Symposium
on FPGAs for Custom Computing Machines, pages 106115, April 1998.
[81] J. M. Emmert and D. Bhatia. A methodology for fast FPGA oorplanning. In
ACM/SIGDA International Symposium on Field Programmable Gate Arrays,
pages 4756, February 21-23 1999.
[82] L. A. Entrena and K. T. Cheng. Combinational and sequential logic optimiza-
tion by redundancy addition and removal. IEEE Transaction on Computer
Aided Design of Integrated Circuits and Systems, 14(7):909916, 1995.
[83] W. Fang and A. Wu. Multi-way FPGA partitioning by fully exploiting design
hierarchy. ACM Transactions on Design Automation of Electronic Systems,
5(1):3450, January 2000.
[84] A. Farrahi and M. Sarrafzadeh. Complexity of the lookup-table minimiza-
tion problem for FPGA technology mapping. IEEE Tran. on Computer Aided
Design of Integrated Circuits and Systems, 13(11):13191332, November 1994.
[85] A. H. Farrahi and M. Sarrafzadeh. FPGA technology mapping for power min-
imization. In International Workshop in Field Programmable Logic and Appli-
cations, 1994.
[86] FishTail. Design Automation. http://www.shtail-da.com/.
[87] R. J. Francis, J. Rose, and Z. Vranesic. Technology mapping for lookup
table-based FPGAs for performance. In Intl Conf. Computer-Aided Design,
November 1991.
[88] R. J. Francis, et al. Chortle-crf: fast technology mapping for lookup table-
based FPGAs. In Design Automation Conference, 1991.
[89] J. Frankle. Iterative and adaptive slack allocation for performance-driven lay-
out and FPGA routing. In Proceedings of Design Automation Conference,
pages 536542, 1992.
[90] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. Irwin, and T. Tuan.
Reducing leakage energy in FPGAs using region-constrained placement. In
ACM International Symposium on Field Programmable Gate Arrays, Febru-
ary 2004.
[91] V. George and J. Rabaey. Low-energy FPGAsArchitecture and Design.
Kluwer Academic Publishers, 2001.
[92] V. George and J. Rabaey. Low-Energy FPGAs: Architecture and Design.
Springer, June 2001.
[93] S. Ghiasi, E. Bozorgzadeh, S. Choudhury, and M. Sarrafzadeh. A unied the-
ory of timing budget management. In IEEE/ACM International Conference
on Computer-Aided Design, pages 653659, November 2004.
[94] M. Gokhale and J. Stone. Automatic allocation of arrays to memories in
FPGA processors with multiple memory banks. In IEEE Symposium on Field-
Programmable Custom Computing Machines, pages 6369, April 1999.
[95] P. Gopalakrishnan, X. Li, and L. Pileggi. Architecture-aware FPGA place-
ment using metric embedding. In IEEE/ACM Design Automation Conference,
pages 460465, 2006.
EDA03-ebook 2006/10/26 10:16 page 133 #143
i
i
i
i
i
i
i
i
References 133
[96] M. Haldar, A. Nayak, A. Choudhary, and P. Banerjee. Parallel algorithms for
FPGA placement. In Proc. Great Lakes Symposium on VLSI (GVLSI 2000),
March 2000.
[97] Z. Hasan, D. Harrison, and M. Ciesielski. A fast partition method for PLA-
based FPGAs. IEEE Design and Test of Computers, December 1992.
[98] G. D. Hatchel and F. Somenzi. Logic Synthesis and Verication Algorithms.
Kluwer Academic Publishers, 1996.
[99] P. S. Hauge, R. Nair, and E. J. Yoa. Circuit placement for predictable perfor-
mance. In International Conference of Computer Aided Design, pages 8891,
1987.
[100] J. He and J. Rose. Technology mapping for heterogeneous FPGAs. In Inter-
national Symposium on Field Programmable Gate Arrays, 1994.
[101] M. Hrkic and J. Lillis. S-Tree: a technique for buered routing tree synthesis.
In Design Automation Conference, 2002.
[102] M. Hrkic, J. Lillis, and G. Beraudo. An approach to placement-coupled logic
replication. In ACM/IEEE Design Automation Conference, pages 711716,
June 2004.
[103] Y. Hu, Y. Lin, L. He, and T. Tuan. Simultaneous time slack budgeting
and retiming for Dual-Vdd FPGA power reduction. In IEEE/ACM Design
Automation Conference, 2006.
[104] S. W. Hur and J. Lillis. Mongrel: hybrid techniques for standard cell place-
ment. In International Conference of Computer Aided Design, 2000.
[105] J. Hwang, F. Chiang, and T. Hwang. A re-engineering approach to low power
FPGA design using SPFD. In Design Automation Conference, 1998.
[106] M. Inuani and J. Saul. Re-synthesis in technology mapping for heterogeneous
FPGAs. In International Conference on Computer Design, 1998.
[107] P. Jamieson and J. Rose. A verilog RTL synthesis tool for heterogeneous
FPGAs. In International Conference on Field Programmable Logic and Appli-
cations, August 2005.
[108] A. B. Kahng, S. Reda, and Q. Wang. Architecture and details of a high quality,
large-scale analytical placer. In ACM/IEEE Intl. Conf. on Computer-Aided
Design, pages 891898, November 2005.
[109] G. Karypis and V. Kumar. Multilevel hypergraph partitioning. In Design
Automation Conference, 1997.
[110] A. Kaviani and S. Brown. Technology mapping issues for an FPGA with
lookup tables and PLA-like blocks. In International Symposium on Field Pro-
grammable Gate Arrays, 2000.
[111] K. Keutzer. DAGON: technology binding and local optimization by DAG
matching. ACM/IEEE Design Automation Conference, pages 341347, 1987.
[112] D. Kim, J. Jung, S. Lee, J. Jeon, and K. Choi. Behavior-to-placed RTL syn-
thesis with performance-driven placement. In Int. Conf. on Computer Aided
Design, pages 320326, November 2001.
[113] A. Koch. Structured design implementationa strategy for implementing
regular datapaths on FPGAs. In International Symposium on Field Pro-
grammable Gate Arrays, pages 151157, 1996.
EDA03-ebook 2006/10/26 10:16 page 134 #144
i
i
i
i
i
i
i
i
134 References
[114] T. Kong. A novel net weighting algorithm for timing-driven placement.
In IEEE/ACM International Conference on Computer-Aided Design, pages
172176, 2002.
[115] M. R. Korupolu, K. K. Lee, and D. F. Wong. Exact tree-based FPGA technol-
ogy mapping for logic blocks with independent LUTs. In Design Automation
Conference, 1998.
[116] L. Kou, G. Markowsky, and L. Berman. A fast algorithm for steiner trees.
Acta Informatica, 15:141145, 1981.
[117] J. L. Kouloheris. Empirical Study of the Eect of Cell Granularity on FPGA
Density and Performance. Ph.D. Thesis, Stanford University, 1993.
[118] S. Krishnamoorthy and R. Tessier. Technology mapping algorithms for hybrid
FPGAs containing lookup tables and PLAs. IEEE Trans. on Computer-Aided
Design of Integrated Circuits and Systems, 22(5), May 2003.
[119] E. Kusse and J. Rabaey. Low-energy embedded FPGA structures. In Proc.
of International Symposium on Low Power Electronics and Design, August
1998.
[120] J. Lamoureux and S. J. E. Wilton. On the interaction between power-aware
FPGA CAD algorithms. In IEEE International Conference on Computer-
Aided Design, November 2003.
[121] Lattice. Power estimation in ispMACH 5000B devices, May 2002.
[122] E. L. Lawler, K. N. Levitt, and J. Turner. Module clustering to minimize
delay in digital networks. Trans. On Computer, C18(1), 1969.
[123] S. Lee and D. F. Wong. Timing-driven routing for FPGAs based on
Lagrangian relaxation. In Proc. of International Symposium on Physical
Designs, pages 176181, April 2002.
[124] Y. S. Lee and C. H. Wu. A performance and routability-driven router for
FPGAs considering path delay. In Proc. of Design Automation Conference,
pages 557561, 1995.
[125] C. Legl, B. Wurth, and K. Eckl. A Boolean approach to performance-directed
technology mapping for LUT-based FPGA designs. In Design Automation
Conference, June 1996.
[126] E. Lehman, Y. Watanabe, J. Grodstein, and H. Harkness. Logic decomposi-
tion during technology mapping. In IEEE/ACM International Conference on
Computer-Aided Design, pages 264271, 1995.
[127] E. Lehman, Y. Watanabe, J. Grodstein, and H. Harkness. Logic decomposi-
tion during technology mapping. IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, 16(8):813834, 1997.
[128] C. E. Leiserson and J. B. Saxe. Retiming synchronous circuitry. Algorithmica,
6:535, 1991.
[129] G. Lemieux and S. D. Brown. A detailed routing algorithm for allocating wire
segments in FPGAs. ACM/SIGDA Physical Design Workshop, 1993.
[130] F. Li, D. Chen, L. He, and J. Cong. Architecture evaluation for power-
ecient FPGAs. In ACM International Symposium on Field Programmable
Gate Arrays, pages 175184, Monterey, California, 2003.
[131] F. Li, Y. Lin, and L. He. FPGA power reduction using congurable Dual-Vdd.
In IEEE/ACM Design Automation Conference, pages 735740, June 2004.
EDA03-ebook 2006/10/26 10:16 page 135 #145
i
i
i
i
i
i
i
i
References 135
[132] F. Li, Y. Lin, and L. He. Vdd programmability to reduce FPGA interconnect
power. In IEEE/ACM International Conference on Computer-Aided Design,
pages 760765, San Jose, November 2004.
[133] F. Li, Y. Lin, L. He, D. Chen, and J. Cong. Power modeling and characteristics
of eld programmable gate arrays. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 24(11), November 2005.
[134] F. Li, Y. Lin, L. He, and J. Cong. Low-power FPGA using dual-Vdd/Dual-Vt
techniques. In International Symposium on Field Programmable Gate Arrays,
pages 4250, February 2004.
[135] H. Li, S. Katkoori, and W. K. Mak. Power minimization algorithms for LUT
based FPGA technology mapping. ACM Transactions on Design Automation
of Electronic Systems, 9(1):3351, January 2004.
[136] R. Liberskind Hadas, N. Hasan, J. Cong, P. Mckinley, and C. L. Liu. Kluwer
Academic Publishers, 1992.
[137] E. Lin and S. Wilton. Macrocell architectures for product term embedded
memory arrays. Field Programmable Logic Applications, pages 4858, August
2001.
[138] J. Lin, D. Chen, and J. Cong. Optimal simultaneous mapping and clustering
for FPGA delay optimization. In IEEE/ACM Design Automation Conference,
2006.
[139] J. Lin, A. Jagannathan, and J. Cong. Placement-driven technology mapping
for LUT-based FPGAs. In International Symposium on Field Programmable
Gate Arrays, pages 121126, February 2003.
[140] A. Ling, D. Singh, and S. Brown. FPGA technology mapping: a study of
optimality. In Design Automation Conference, 2005.
[141] P. Maidee, C. Ababei, and K. Bazarga. Timing-driven partitioning-based
placement for Island style FPGAs. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 24(3):395406, March 2005.
[142] V. Manohararajah, S. D. Brown, and Z. G. Vranesic. Heuristics for area min-
imization in LUT-based FPGA technology mapping. In International Work-
shop of Logic Synthesis, 2004.
[143] A. Marquardt, V. Betz, and J. Rose. Using cluster-based logic blocks and
timing-driven packing to improve FPGA speed and density. In ACM/SIGDA
International Symposium on Field Programmable Gate Arrays, pages 3746,
1999.
[144] A. Marquardt, V. Betz, and J. Rose. Timing-driven placement for FPGAs. In
International Symposium on Field Programmable Gate Arrays, pages 203213,
Monterey, Ca., February 2000.
[145] A. Mathur and C. L. Liu. Performance-driven technology mapping for lookup-
table based FPGAs using the general delay model. In International Workshop
on Field Programmable Gate Arrays, February 1994.
[146] L. Mcmurchie and C. Ebeling. PathFinder: A negotiation-based performance-
driven router for FPGAs. In Proceedings of International Symposium on Field-
Programmable Gate Arrays, February 1995.
EDA03-ebook 2006/10/26 10:16 page 136 #146
i
i
i
i
i
i
i
i
136 References
[147] A. Mishchenko, S. Chatterjee, and R. Brayton. Improvements to technol-
ogy mapping for LUT-based FPGAs. In International Symposium on Field-
Programmable Gate Arrays, 2006.
[148] R. Mukherjee and S. Ogrenci Memik. Evaluation of Dual Vdd fabrics for low
power FPGAs. In Asia-South Pacic Design Automation Conference, January
2005.
[149] R. Murgai, R. Brayton, and A. Sangiovanni Vincentelli. On clustering for
minimum delay/area. In Intl Conf. Computer Aided Design, November 1991.
[150] R. Murgai, R. Brayton, and A. Sangiovanni Vincentelli. Logic Synthesis for
Field-Programmable Gate Arrays. Springer, July 1995.
[151] R. Murgai, et al. Improved logic synthesis algorithms for table look up archi-
tectures. In Intl Conf. Computer Aided Design, November 1991.
[152] S. Mutoh, et al. 1-V Power supply high-speed digital circuit technology
with multi-threshold-voltage CMOS. IEEE Journal of Solid-State Circuits,
30(8):847854, August 1995.
[153] S. K. Nag and R. A. Rutenbar. Performance-driven simultaneous placement
and routing for FPGAs. IEEE Trans. Computer-Aided Design of Integrated
Circuits and Systems, 17(6):499518, June 1998.
[154] R. Nair. A simple yet eective technique for global wiring. IEEE Trans. on
Computer-Aided Design of Integrated Circuits and Systems, CAD-6(6):165
172, March 1987.
[155] G. J. Nam, F. Aloul, K. A. Sakallah, and R. A. Rutenbar. A comparative
study of two Boolean formulations of FPGA detailed routing constraints.
IEEE Transactions on Computers, 53(6):688696, June 2004.
[156] G. J. Nam, K. A. Sakallah, and R. A. Rutenbar. Satisability-based lay-
out revisited: detailed routing of complex FPGAs via search-based Boolean
SAT. In ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, pages 167175, February 1999.
[157] P. Pan and C. C. Lin. A new retiming-based technology mapping algorithm
for LUT-based FPGAs. In International Symposium on Field-Programmable
Gate Arrays, 1998.
[158] P. Pan and C. L. Liu. Optimal clock period FPGA technology mapping for
sequential circuits. In Design Automation Conf., June 1996.
[159] P. Pan and C. L. Liu. Technology mapping of sequential circuits for
LUT-based FPGAs for performance. In International Symposium on Field-
Programmable Gate Arrays, 1996.
[160] P. Pan and C. L. Liu. Optimal clock period FPGA technology mapping for
sequential circuits. ACM Transactions on Design Automation of Electronic
Systems, 3(3):437462, 1998.
[161] K. Poon, S. J. E. Wilton, and A. Yan. A detailed power model for eld pro-
grammable gate arrays. ACM Transactions on Design Automation of Elec-
tronic Systems, 10(2):279302, April 2005.
[162] A. Rahman and V. Polavarapuv. Evaluation of low-leakage design techniques
for eld programmable gate arrays. In International Symposium on Field Pro-
grammable Gate Arrays, February 2004.
EDA03-ebook 2006/10/26 10:16 page 137 #147
i
i
i
i
i
i
i
i
References 137
[163] R. Rajaraman and D. F. Wong. Optimal clustering for delay minimization.
In Design Automation Conference, June 1993.
[164] J. Rose. Parallel global routing for standard cells. IEEE Transactions on Com-
puter Aided Design of Integrated Circuits and Systems, 9(10):10851095, Octo-
ber 1990.
[165] Y. Sankar and J. Rose. Trading quality for compile time: ultra-fast placement
for FPGAs. In International Symposium on Field Programmable Gate Arrays,
pages 157166, 1999.
[166] M. Schlag, J. Kong, and P. K. Chan. Routability-driven technology mapping
for lookup table-based FPGAs. IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, 13(1):1326, 1994.
[167] H. Schmit, L. Arnstein, D. Thomas, and E. Lagnese. Behavioral synthesis
for FPGA-based computing. In Workshop on FPGAs for Custom Computing
Machines, pages 125132, 1994.
[168] E. M. Sentovich, et al. SIS: A system for sequential circuit synthesis.
Berkeley, CA, University of California, 1992. Dept. of Electrical Engineering
and Computer Science.
[169] L. Shang and N. K. Jha. High-level power modeling of CPLDs and FPGAs.
In IEEE International Conference on Computer Design, September 2001.
[170] L. Shang, A. Kaviani, and K. Bathala. Dynamic power consumption in virtex-
II FPGA family. In ACM International Symposium on Field Programmable
Gate Arrays, February 2002.
[171] K. R. Shayee, J. Park, and P. Diniz. Performance and area modeling of com-
plete FPGA designs in the presence of loop transformations. IEEE Transac-
tions on Computers, 53(11):14201435, November 2004.
[172] J. P. M. Silva and K. A. Sakallah. GRASPa new search algorithm for satis-
ability. In Proc. ACM/IEEE Intl Conf. Computer Aided Design, November
1997.
[173] A. Singh and M. Marek Sadowska. Ecient circuit clustering for area and
power reduction in FPGAs. In ACM International Symposium on Field Pro-
grammable Gate Arrays, February 2002.
[174] D. Singh and S. Brown. Integrated retiming and placement for eld pro-
grammable gate arrays. In International Symposium on Field Programmable
Gate Arrays, pages 6776, February 2002.
[175] A. Srivastava, D. Sylvester, and D. Blaauw. Power minimization using simulta-
neous gate sizing, Dual-Vdd and Dual-Vth assignment. In Design Automation
Conference, 2004.
[176] H. Styles and W. Luk. Branch optimization techniques for hardware compila-
tion. In International Conference on Field Programmable Logic and Applica-
tions, 2003.
[177] W. Sun, M. Wirthlin, and S. Neuendorer. Combining module selection and
resource sharing for ecient FPGA pipeline synthesis. In Intl Symposium on
Field Programmable Gate Arrays, 2006.
[178] J. Swartz, V. Betz, and J. Rose. A fast routability-driven router for FPGAs.
In Intl Symposium on Field Programmable Gate Arrays, pages 140149, Mon-
terey, CA, 1998.
EDA03-ebook 2006/10/26 10:16 page 138 #148
i
i
i
i
i
i
i
i
138 References
[179] Synopsys. http://www.synopsys.com/products/products matrix.html.
[180] SystemC. http://www.systemc.org.
[181] M. Takahashi, et al. A 60mW MPEG4 video codec using clustered voltage
scaling with variable supply-voltage scheme. Journal of Solid-State Circuits,
1998.
[182] R. Tessier. Fast placement approaches for FPGAs. ACM Transactions on
Design Automation of Electronic Systems, 7(2):284305, April 2002.
[183] The MathWorks. http://www.mathworks.com/.
[184] N. Togawa, M. Sato, and T. Ohtsuki. Maple: a simultaneous technology map-
ping, placement, and global routing algorithm for eld-programmable gate
arrays. In Intl Conf. Computer Aided Design, 1994.
[185] N. Togawa, M. Sato, and T. Ohtsuki. A simultaneous placement and global
routing algorithm with path length constraints for transport-processing
FPGAs. In Asia South Pacic Design Automation Conf., pages 569578, 1997.
[186] S. Trimberger. Field-Programmable Gate Array Technology. Springer, January
1994.
[187] T. Tuan, S. Kao, A. Rahman, S. Das, and S. Trimberger. A 90nm low-power
FPGA for battery-powered applications. In International Symposium on Field
Programmable Gate Arrays, 2006.
[188] T. Tuan and B. Lai. Leakage power analysis of a 90nm FPGA. In Custom
Integrated Circuits Conference, 2003.
[189] K. Usami and M. Horowitz. Clustered voltage scaling for low-power design.
In International Symposium on Low Power Design, April 1995.
[190] H. Vaishnav and M. Pedram. Delay optimal clustering targeting low-power
VLSI circuits. IEEE Transactions on Computer Aided Design of Integrated
Circuits and Systems, 18(6), June 1999.
[191] K. Wakabayashi. C-based behavioral synthesis and verication analysis on
industrial design examples. In Asian and South Pacic Design Automation
Conference, pages 344348, January 2004.
[192] K. Wakabayashi and T.Okamoto. C-based SoC design ow and EDA tools: an
ASIC and system vendor perspective. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 19(12):15071522, December 2000.
[193] Z. H. Wang, E. C. Liu, J. Lai, and T. C. Wang. Power minimization in LUT-
based FPGA technology mapping. In Asia South Pacic Design Automation
Conference, 2001.
[194] L. Wei, Z. Chen, M. Johnson, K. Roy, and V. De. Design and optimization
of low voltage high performance dual threshold CMOS circuits. In Design
Automation Conference, 1998.
[195] K. Wei, C. Oetker, I. Katchan, T. Steckstor, and W. Rosenstiel. Power esti-
mation approach for SRAM-based FPGAs. In ACM International Symposium
on Field Programmable Gate Arrays, February 2000.
[196] S. Wilton. SMAP: heterogeneous technology mapping for area reduction in
FPGAs with embedded memory arrays. In International Symposium on Field
Programmable Gate Arrays, 1998.
EDA03-ebook 2006/10/26 10:16 page 139 #149
i
i
i
i
i
i
i
i
References 139
[197] S. Wilton. Heterogeneous technology mapping for area reduction in FPGAs
with embedded memory arrays. IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, 19(1), January 2000.
[198] S. Wilton. Heterogeneous technology mapping for FPGAs with dual-
port embedded memory arrays. In International Symposium on Field Pro-
grammable Gate Arrays, 2000.
[199] F. G. Wol, M. J. Knieser, D. J. Weyer, and C. A. Papachristou. High-level
low power FPGA design methodology. In IEEE National Aerospace Confer-
ence, 2000.
[200] P. Wong, L. Cheng, Y. Lin, and L. He. FPGA device and architecture evalu-
ation considering process variation. In Proc. IEEE/ACM International Conf.
on Computer-Aided Design, November 2005.
[201] Y. L. Wu and M. Marek Sadowska. An ecient router for 2-D eld pro-
grammable gate arrays. In Proc. of European Design Automation Conference,
pages 412416, 1994.
[202] Y. L. Wu and M. Marek Sadowska. Orthogonal greedy couplinga new opti-
mization approach to 2-D FPGA routing. In Proc. of Design Automation Con-
ference, June 1995.
[203] Xilinx. Website, http://www.xilinx.com.
[204] Xilinx. Spartan-3E Data Sheets. http://direct.xilinx.com/bvdocs/publications/
ds312.pdf.
[205] Xilinx. Virtex-4 Data Sheet. http://www.xilinx.com.
[206] Xilinx. Virtex-4 Web Power Tool Version 8.1.01. http://www.xilinx.com/
products/silicon solutions/fpgas/virtex/virtex4/index.htm.
[207] Xilinx. Virtex-5 Data Sheet. http://www.xilinx.com.
[208] Xilinx. white paper 205: hardware/software codesign for platform FPGAs.
http://www.xilinx.com/products/design resources/proc central/resource/
hardware software codesign.pdf.
[209] Xilinx. XPower Tool. http://www.xilinx.com/products/design resources/
design tool/grouping/power tools.htm.
[210] Xilinx. A simple method of estimating power in XC4OOOXL/EX/E FPGAs,
June 30 1997. Application Brief X014.
[211] M. Xu and F. J. Kurdahi. ChipEst-FPGA: a tool for chip level area and timing
estimation of lookup table based FPGAs for high level applications. In Asia
and South Pacic Design Automation Conference, January 1997.
[212] M. Xu and F.J. Kurdahi. Design Automation and Test in Europe. Layout-
driven high level synthesis for FPGA based architectures. 1998.
[213] S. Yamshita, H. Sawada, and A. Nagoya. A new method to express functional
permissibilities for LUT based FPGAs and its applications. In International
Conference on Computer Aided Design, pages 254261, 1996.
[214] H. Yang and D. F. Wong. Edge-map: optimal performance driven technology
mapping for iterative LUT based FPGA designs. In Intl Conf. Computer
Aided Design, November 1994.
[215] A. G. Ye, J. Rose, and D. Lewis. Synthesizing datapath circuits for FPGAs
with emphasis on area minimization. In International Conference on Field-
Programmable Technology, December 2002.
EDA03-ebook 2006/10/26 10:16 page 140 #150
i
i
i
i
i
i
i
i
140 References
[216] A. Z. Zelikovsky. An 11/6 approximation algorithm for the network Steiner
problem. Algorithmica, 9:463470, 1993.
[217] L. Zhang, C. Madigan, M. Moskewicz, and S. Malik. Ecient conict driven
learning in a Boolean satisability solver. In International Conference on
Computer-Aided Design, pages 279285, 2001.
[218] Y. Zhan, et al. Statistical timing analysis: correlation-aware statistical timing
analysis with non-gaussian delay distributions. In Proc. of Design Automation
Conf., June 2005.
[219] P. S. Zuchowski, et al. A hybrid ASIC and FPGA architecture. In Interna-
tional Conference on Computer-Aided Design, November 2002.