Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 37

Hi, John,

Here are the 14 major metrics that I feel a design team


must consider whendeciding on what specific emulator to
use on their project (if any):
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.

Price/Gate
Initialization and Dedicated Support
Capacity
Primary Target Designs
Speed Range
Partitioning
Compile Time
Visibility
Debug
Virtual Platform API
Transactor Availability
Verification Language and Native Support
Number of Users
Memory Capacity

Below I explain in detail the impact and pitfalls each


metric will have on your team's emulation decision.
- Jim Hogan
Vista Ventures, LLC
Gatos, CA
----

----

----

Los
----

----

----

----

1. Price/Gate
The actual cost of an emulator typically runs from
1-5 cents-per-gate for higher capacity emulators -both processor-based and FPGA-based.
There is usually some recurring cost for software
and maintenance.

The lower capacity FPGA-based emulators are


typically priced as separate HW and SW components.
The HW consists of off-the-shelf FPGA prototyping
boards that typically cost 0.25 to 1 cent-per-gate.
The SW that provides emulation features on top of the
prototyping boards are typically priced like software
simulators.
2. Initialization and Dedicated Support
When assessing an emulator's total cost of
ownership, the cost of
dedicated human support is a
key factor, especially as it is an
ongoing expense.
The large processor-based emulators virtually
always
require at least one support person, if not a team,
dedicated
to emulator support. It's not due to any
inherent issues with
emulators; it's just the sheer
scope and magnitude of the
verificaion projects
being tackled.
The large FPGA-based emulators' need for dedicated
support is similar,
also depending on the size and
complexity of the emulated designs
and the number of
end users involved.
Further, the time to initialize an emulator and set
up models can take
6 months or more. Transactors
may need to be developed to connect
test benches to
the design-under-test (DUT). They are also needed to
connect host-based virtual platforms to emulators and
FPGA prototyping
systems. This is part of a growing
trend to improve the accuracy and
performance of
virtual platforms for performance modeling and presilicon software development.
The most complex transactors - such as PCIe
transactors - can take
more than a year to develop.
They can be the source of functionality
and

performance bugs that can delay the emulation project


even longer.
And test bench re-partitioning between the host and
emulator may also
be required to achieve acceptable
performance.
3. Capacity
There are subtle issues around an emulator's
capacity; a given capacity
can be delivered in
multiple ways. First, there is the straightforward
capacity measurement in terms of total number of
gates: that range for
emulators currently ranges
from 2 million to 2 billion gates.
Second, there is the granularity in terms of the
number of devices
(ASICs or FPGAs), boards, or boxes
that are used to reach a given
capacity.
Processor-based emulators are architected to
provide capability in a
seamless way, where it looks
monolithic to users. With FPGA-based
emulators, if
it is a vertically integrated box, it should also
appear
monolithic to a user. Synopsys-EVE reaches
its 1 B gate capacity
in a monolithic way by
connecting multiple emulator boxes.
A vendor may hide the emulator granularity from the
customer, but higher
granularity generates more
communication overhead, which generally
degrades
performance. This is true whether the emulator is
processorbased or FPGA-based. For instance, if
you were emulating 10 M gates
and it fit on a single
FPGA, it could run approximately 5X faster than
if
the same 10 M gate capacity were to be divided over
multiple FPGAs.
Low-to-mid FPGA-based emulators typically expose
more granularity to
the end user. However, an

offset to this is that these boxes tend


to take
advantage of the increased capacity for newer FPGA
devices
more quickly than the mid-to-high-capacity
FPGA-based emulators.
FPGAs on the leading edge of Moore's Law are one of
the first things
manufactured. New FPGA product
cycles run 12 to 18 months. In
contrast custom chip
cycles are at least 4 years, and the cost of
development, which has generally been increasing for
custom ASICs,
is borne by the custom-processor-based
emulation vendor.
4. Primary Target Designs
SoCs are the dominant workhorse for systems
companies today. The
target design sweet spot for
each emulator type is typically defined
by the
capacity of the emulator as it relates to your design
size.
As mentioned earlier, complexity is one of the
factors driving more
mainstream designs into
requiring emulation. Control complexity can
require many verification cycles. Emulation
applications range from
CPUs, GPUs, application
processors such as video, audio, security
and
datapath, to IP blocks and subsystems.
5. Speed Range
Emulator speed is measured in cycles-per-second.
Processor-based
speeds range from 100 K to 4 M
cycles/sec, while FPGA-based range
from 500 K to 50
M cycles/sec, depending on the number of devices.
6. Partitioning

Partitioning is harder than sounds. Partitioning


would be easy if
your design could be broken up into
latency insensitive blocks that
would eliminate any
strict latency requirements between partitions.
But that's hardly ever the case.
- Processor-based emulators have many processing
units, so large
designs must be partitioned across
these units. Processor-based
systems basically
run software using many cores, and the
partitioning is more or less transparent to the user.
- FPGA-based systems are hardware-centric, based on
multiple FPGAs.
Due to the FPGA boundaries you must first split
the design into
reasonably sized pieces, so that
each sub-system will fit on one
FPGA device.
Unfortunately, you then usually end up needing more
logical connections between partitioned sections
of the design
than there are physical wires
between FPGAs to make those
connections! These
logic connections can exceed the physical
connections by 2X to 100X.
So when you stitch your partitioned design
elements together to
connect them, rather than
just doing a simple logical connection,
you must multiplex signal pins over the FPGA
connections. This adds
substantially to
complexity, particularly if this process is not
completely automated. The entire process takes
additional time and
effort that opens the
possibility for errors to be injected in
partitioning or reconnecting.
A deeply partitioned design can be very difficult
to debug manually
or semi-automatically. Design
groups sometimes give up on the task
because they
can't get it right. Certain debug features don't work

as well in partitioned systems; debugging a single FPGA


is easier.
Debug concern only applies to FPGA-based; the
processor-based emulators
have complete visibility
across all of their processors.
Most processor-based and larger FPGA-based
emulation vendors try to
automatically take care of
partitioning such that the end user doesn't
have to
deal with it.
- Automatic partitioning is rule-based and is
assumed to be
correct-by-construction. If you
have a problem with your
partitioning, it's a
support call to the emulation vendor.
- The greatest partitioning risk occurs in the
smaller FPGA-based
emulator group. Their
(in)ability to partition should be
carefully
assessed -- particularly when partitioning across more
than two FPGAs. Some of these vendors use an
independent
partitioning tool from Auspy;
however, partitioning is
inherently never
completely push button.
Thankfully, Moore's Law has now made it possible to
have 30 M gates
in just two FPGAs. This neutralizes
the partitioning issue for a
substantial fraction of
emulation projects. If no messy multi-part
partitioning is needed, there's no problem. A single
partition
between only two devices is
straightforward.
7. Compile Time
Emulator compile time is the total time to prep a
job for execution on
your emulation system,
including synthesis and routing. For FPGA-based

emulators, compilation time is primarily determined by


the FPGA routing
tools. Further, some FPGA-based
emulators are starting to provide
hierarchical
routing -- which enables incremental compiles.
Another variable for compile time is whether your
design partitioning
between the devices is automated
or not.
- For processor-based emulators this design
partitioning is
automated and quite complete, and
that time is included in the
compile time. The
compile runs on a single workstation.
- For FPGA-based emulators partitioning can be
quite variable; your
total compile time can more
than double if the partitioning is
semi-automated
and requires user interaction.
Another factor for compile time is whether it's
parallelizable - i.e.
whether it can be broken into
independent jobs to be run on separate
workstations.
For FPGA-based emulators, once your partitioning is
complete, your FPGA compilation can be sped up
significantly by running
each FPGA compile on a
separate workstation. That's not the case for
processor-based emulators.
8. Visibility
Visibility is the ability to see signals inside a
design. You have
full visibility with SW
simulation, because simulation is software and
the program state is inside the simulator; it is
very flexible and you
can log any state.
For FPGA-based emulators, there's static and
dynamic visibility.
Static visibility refers to
signal probes that are defined at compile

time. They require FPGA resources and run fast but


usually cover only
a small subset of a design's
signals. If you need to change that
subset of
signals, you need to recompile. Dynamic visibility
refers
to signal probes that do not need to be
defined at compile time.
These do not require
additional FPGA resources and run much slower than
static probes, but cover a much larger subset of a
design's signals.
- The signal visibility with processor-based
emulators is basically
the same as simulation.
That's because all signal states reside
as a
software addressable register somewhere in the
emulation
processor array.
- With FPGA-based emulators, to see the internal
signals, you must
route them out through some type
of multiplexing network to the
pins. This adds
physical gate overhead whenever you take a signal
and connect to it through an I/O multiplexer to
the pins of the
FPGA. The multiplexer and wires
create trees of logic that add
area overhead.
Additionally, those gates and wires add a
performance overhead;
every cycle requires
capturing the state and eventually sending
it to a host for storage. If you try to probe
every signal, your
effective design capacity can
go down by a factor of 2-5. You
must navigate
this carefully, or your design won't fit.
However, the signal visibility gap is narrowing
between processor-based
and FPGA-based emulators
based on two key factors:
- FPGA capacity improvements provide increased
capacity for static
probing inside of FPGAs.

- Xilinx has a built a feature in their FPGAs which


offers dynamic
probes for register states. This
Xilinx feature enables better
multiplexing in the
chip to access the signals without overhead.
With effort, the individual emulator vendors can
all but eliminate
the area overhead, which then
means less time recompiling to be
able to see a
different set of signals. Some emulator vendors are
taking advantage of this.
The bottom line is that emulation vendors can be
assessed with an area
and performance hit for any
given number of probes, as well as whether
they
offer dynamic probing.
9. Debug
With the exponential rise in complexity, debug is
essential - as you can
see in the chart below, teams
spend 1.6x total time debugging (42%) than
they do
developing testbenches (26%) and writing/running tests
(26%).

A robust debug capability is critical, and


processor-based emulators
today have debug
capabilities that approach that of SW simulator debug.

Fundamental emulator debug capabilities can


include:
- Breakpoints. The ability to pause an emulation
run based
on event triggers.
- Assertions support. It flags when assertions,
or logic
statements that define the intended
behavior of a design,
are violated.
- Simulation hot-swap. The ability to
automatically transfer
execution to a connected
simulator for more in-depth debug,
in the form
of greater visibility and control.
- Software debug. It can run software debuggers
on the
embedded code being executed by the
processor(s).
Debug is an area undergoing significant innovation
and should be
assessed beyond the simple list I show
in the comparison chart.
10. Virtual Platform API
Hybrid platforms consists of emulators cosimulating with virtual
platforms, where a virtual
platform simulates whole chips on a
workstation by
eliminating detail from the hardware, and plugging
C-models together.
Given the prevalence of virtual platforms, it is
important for the
emulator to have a standard
virtual platform API, so that if an
engineer doesn't
have a virtual platform C-model for a component,
they can plug the component into the emulator instead.
For example, you may have an RTL USB 3.0, but no C
model. An emulator
with a virtual platform API
could allow you to co-simulate with your
emulator

running the USB RTL at up to 4 M cycles/sec. In


contrast,
the RTL might run at only 5 K cycles/sec
in a SW simulator.
11. Transactor Availability
Transactors facilitate the communication between
the emulator and
other platforms with different
levels of abstraction. Transactors
are critical to
making emulators work in that they remove the speed
bottlenecks associated with co-emulation.
- Transactors convert transactions coming from
outside the emulator
to bit-level interfaces
inside the emulator and vice versa; as
emulators
move to the mainstream, they need to communicate with
simulators and virtual platforms more frequently.
- Transactors also serve as interfaces when you
connect inputs and
outputs to the I/O of a design
(SoC) under test (DUT) in an
emulator. The
transactors let you go between I/Os on the DUT to
anything communicating with it on the host simulator, C
testbench
or possibly another I/O board bringing in a live
connection, such
as Ethernet. Below shows an
example.

Source: Mentor Graphics


- One definition of a transactor is anything that
moves something
from one abstraction level to
another. For example, in the transaction level
modeling (TLM) is
standard, whereas in the RTL
space, bit-level interfaces are the
norm.
- Hardware transactors eliminate traffic over coemulation links to
optimize performance. They do
this by translating a compound
transaction into a
bit level handshake exploded in space and time
on
the emulator where it's more efficiently handled. For
example,
a single transaction over the coemulation link to "move a
128-word block of memory
from location A to location B" is
exploded into a
sequence of 128 bit-level interface operations in
the emulator.
The overall emulator performance is very sensitive
to transactor
quality. For example, an optimized
transactor may have a speed
differential of 1 M
cycles/sec compared with only 100 K cycles/sec
for
an unoptimized transactor -- a 10X performance delta.
The individual emulator vendors provide off-theshelf transactor libs
available for standard
interfaces such as memory and high speed I/O;

these standard transactors can be used for


different designs. Also,
custom transactors must
often be developed for each design.
Some projects may require 10-20 transactors, of
which 5-10 may be
custom, ranging from UARTs to
proprietary system bus interconnects.
End-to-end
performance optimization of these custom transactors is
hard -- the time can range from 1 to 12 months.
Furthermore,
different transactor versions can be
required to support multiple
emulation platforms.
12. Verification Language and Native Support
Regardless of transactor quality, sometimes, you
must move selected
components, models or
testbenches, over to a SW simulator. To do
this
efficiently, ideally your emulator supports your
existing
languages.
- Processor-based emulators typically have broad
native support for
verification languages. This
is because the emulation vendors
often also
provide simulators -- an inherent advantage.
- FPGA-based emulators don't support native
languages such as C++ or
SystemC; they are
typically limited to synthesizable Verilog and
VHDL. The drawback is that developing
synthesizable Verilog/VHDL
takes more time than
writing a higher level language.
Another item that can severely limit emulator
performance: when test
bench components run too slow
on the host or generate too much traffic
on the link between the host and the emulator.
Synthesizable test
benches running in the emulator
can reduce a host emulator bottleneck
by 5X to 10X.

13. Number of Users


The bigger an SoC, generally the larger the
engineering team and the
more geographically
dispersed it is. Therefore, the maximum number
of
users that can run an emulation at the same time on
partitioned
elements of the SoC should be taken into
consideration.
14. Memory Capacity
Processor-based emulators have a high memory
capacity, up to 1 TB.
This memory is used more like
the memory in a simulator; mapping DUT
memories is
transparent and usually not an issue.
In contrast, DUT memory on FPGA-based emulators is
more explicitly
mapped to hard macro memory blocks
in the FPGA devices. The largest
FPGA devices have
about 50 M bits per device; this capacity is usually
enough for the memory required in DUT partitions.
Sometimes a DUT
memory does not map well into the
FPGA device memory, in which case
the FPGA-based
emulator can use special constructs to map the DUT
memory into on-board DRAM. It is important to know
ahead of time
whether an FPGA-based emulator has
memory configuration limitations.
Below I have mapped top-level information for each
vendor, according to the
emulation metrics I mentioned earlier. I derived the
information for the
snapshot below from each vendor's website plus my
general accumulated
knowledge to date.
- Category 1. Emulators are based on applicationspecific
processors. Cadence Palladium's

processor is implemented
in an ASIC-structured
custom fabric. Mentor Veloce's
processor is
implemented in a custom FPGA fabric.
- Category 2. Emulation with a standard FPGA
product at its
core. Synopsys-EVE is currently
the player in this sector.
- Category 3. Other emulators with HW based on a
standard
FPGA product. The primary differentiator
between category
2 and 3 is capacity; however,
there are other differences
as shown below.
Aldec, Bluespec, Cadence RPP, Dini Group,
S2C, and
HyperSilicon are primary vendors in this segment.
The emulator best suited to the designer problem is
defined by what problemthey are trying to solve.

Source: EVE, Embedded Computing, 2010


The optimal choice lies in the intersection of a number
of factors, with one
example outlined above.
Other: Aldec,
Emulation Cadence Palladium, Synopsys EVE
Bluespec, Cadence
Vendors
Mentor Veloce
Zebu
RPP, HyperSilicon
custom silicon,
off-the-shelf
off-the-shelf FPGA,
Emulator
custom board,
FPGA, custom
off-the-shelf board,
Architecture custom box (32 M to board, custom box off the shelf box (2
2 B)
(25 M - 200 M)
M - 50 M)
Price/gate
2-5 cents
0.5 - 2 cents
0.25 -1 cent

Dedicated
Support
Design
Capacity

Primary
Target
Designs
Speed range
(cycles/sec)

yes

mixed

Claims up to 1
Claims up to 50+
Claims up to 2
billion. Typical
million. Typical
billion. Typical usage
usage 25 M to 200 usage 2 M to 25 M
100 M to 1 B gates.
M gates.
gates.
SoCs 100 M to 1 B
gates. Large CPUs,
IP blocks, subSoCs from 25 M to
GPUs, multi-chip
system, and SoCs
200 M gates
systems, application
from 2 M to 25 M
processors.
100 K to 2 M

10-30 M gates/hour.
Single workstation
(Palladium). PC farm
Compile time (Veloce). Includes
automated
partitioning time.
Parallelizable: Yes

500 K to 5 M
25 M - 100 M
gates/hr for PC
farm. Proprietary
software for fast
FPGA partitioning,
synthesis and P&R.
Parallelizable: Yes

Partitioning

automated

automated

Visibility

full visibility. atspeed probe capture.

static, dynamic
probes. at-speed
probe capture.

Debug
Virtual

no

Breakpoints,
Breakpoints,
assertions,
assertions,
simulation hot-swap, simulation hotSW debug.
swap, SW debug.
Yes
Yes

500 K to 20 M
1 M - 15 M gates/hr
for PC farm.
Constrained by
FPGA vendor
synthesis and P&R
times. Doesn't
include partitioning
time. Parallelizable:
Yes
semi-automated.
Partitioning depends
on # of FPGAs.
Time range 30 min
to 4 hours.
static, dynamic
probes (vendor
dependent). at-speed
probe capture
(vendor dependent).
Breakpoints,
assertions,
simulation hot-swap,
SW debug.
varies by vendor

platform API
Standard/off-theStandard/off-theStandard/off-theTransactor
shelf: Good.
shelf: Mixed.
shelf: Good. Custom:
Availability
Custom: developed Custom: developed
developed ad hoc
ad hoc
ad hoc
C++, SystemC,
Verification
Specman e,
Synthesizable
Synthesizable
Language SystemVerilog,
Verilog, VHDL,
Verilog, VHDL,
Native
OVM, SVA, PSL,
System Verilog
System Verilog
support
OVL
Memory
up to 1 TB
up to 200 GB
up to 32 GB
Users
1 to 512 users
1 to 49
1 user
Here is my quick summary of the different emulation
vendors for 2013.
Category 1:
- Cadence Palladium. Hats off to Cadence for being
pioneers in
emulation and sustaining innovation to
maintain a very competitive
product year-overyear.
- Mentor Veloce. Their revenue numbers show
emulation is a growing
segment for them. (See
ESNUG 510 #7.) Clearly Wally and Greg
have been
investing heavily in emulation.
Category 2:
- Synopsys EVE Zebu. This has been the choice for
companies and
design groups doing mid-size SoCs or
blocks for emulation. It
is no secret that Intel
was an EVE customer. (See ESNUG 508 #6.)
My expectation is that with the Synopsys
acquisition, EVE will now
move upstream to
challenge Cadence and Mentor at the high end.
Category 3:

- Aldec HES-DVM. The company initially grew out of


providing system
emulation/simulation using FPGAs
for eventual implementation in
FPGAs. FPGAs will
continue to be a choice for system designers
with
low volumes, including the mil-aero world. Will they
try
to move into the SoC market?
- Bluespec Semu. Bluespec expanded their emulation
footprint in
March with a new FPGA-based desktop
form factor verification and
hybrid emulator.
They emphasize low cost, ease of use, fast
deployment using third-party FPGA boards, dynamic
hardware debug
(no re-instrument and re-synthesis)
and a C API to integrate
SystemC/C/C++ models
and test benches. Bluespec claims to need
only 1
day set up.
- Cadence RPP. The Cadence FPGA-based Rapid
Prototyping Platform
is an FPGA-based prototyper
for early software development and
highperformance system validation. While not positioned as
an
emulator (See ESNUG 517 #6) it uses the core
technology
of FPGA-based emulators and confirms
the need for boxes with
higher performance and
lower cost than processor-based emulators
for presilicon software development.
- The Dini Group. An established leader in FPGA
boards for
prototyping and emulation. The Dini
Group consistently delivers
high quality, high
capacity boards with the shortest time-to-market
for leading edge FPGAs.
- S2C. Offers FPGA boards and software and IP for
system-level
design verification and
acceleration. Their new boards based on
14 M gate
Xilinx FPGAs.

- HyperSilicon. Company to watch from mainland


China, focusing on
FPGA prototyping boards.
Offers boards similar to S2C.
My conclusion is that emulation has indeed gone
mainstream. Its growthextends from the rise of the SoC
as the cornerstone of system hardware,with its
associated multiple SW functions. What's also helped
emulationgrow is its better debug, increased FPGA
sizes, and its newer ability tohandle complex designs.
----

----

----

----

----

----

---The reason for my report was to analyze the segment and


try to put someorder to the market place. I'm not the
only source of info on this. I'dlike to invite the
DeepChip readers to feel free to add their
perspectiveand to update the charts and data I have
gathered.
Functional verification is primarily comprised of:
-

software simulation,
simulation acceleration,
FPGA prototyping, and
emulation.

Let's look at each of them briefly, and then compare


some of the elements that relate to the speed ranges of
each approach.
----

----

----

----

----

----

---SOFTWARE SIMULATORS
A simulator is a software program that simulates an
abstract model of aparticular system by taking an input
representation of the product or

circuit, and processing the hardware description


language and compiling it.
A system model typically includes processor cores,
peripheral devices,
memories, interconnection buses, hardware accelerators
and I/O interfaces.
Simulation is the basis of all functional verification.
It spans the full
range of detail from transistor-level simulation like
SPICE to Transaction
Level Modeling (TLM) using C/SystemC. Simulation
should be used wherever
it's up to the task -- it's easiest to use and the most
general purpose.
But SW simulation hits a speed wall as the size and
detail of the circuit
description increases. Moores Law continues to give
us more transistors
per chip, but transistor speed is flattening out. So
while computers are
shipping with increasing numbers of microprocessor
cores, the operating
frequencies are stuck in the 2 - 3 GHz range. Since SW
simulators (which
run on these computers) don't effectively utilize more
than a handful of
PC cores, performance degrades significantly for large
circuits. It would
take decades just to boot an operating system running
on an SoC being
simulated in a logic SW simulator.
Simulation acceleration, emulation, and FPGA
prototyping are all solutions
to get around show-stopping slow PC simulation speeds
for large designs.

They all attempt to parallelize simulation onto larger


numbers of processing
units. This ranges from two orders of magnitude (e.g.
hundreds of GPU
processing elements) to nine orders of magnitude
(billions of FPGA gates).
----

----

----

----

----

----

---SIMULATION ACCERLERATION
Simulation acceleration implements a hardware
description language, such
as Verilog or VHDL, according to a verification
specification. The results
are the same as the simulation, but faster.
- Often simulation accelerators will use hardware
such as GPUs
(i.e. NVidia Kepler) or FPGAs with embedded
processors.
- Simulation acceleration involves mapping the
synthesizable portion
of the design into a hardware platform
specifically designed to
increase performance by evaluating the HDL
constructs in parallel.
The remaining portions of the simulation are not
mapped into
hardware, but run in a software simulator on a
PC/workstation.
- The software simulator works in conjunction with
the hardware
platform to exchange simulation data.
Acceleration removes most
of the simulation events from the slow PC
software simulator and

runs them in parallel on other HW to increase


performance.
The final acceleration performance is determined by:
1) Percentage of the simulation that is left
running in software;
2) Number of I/O signals
communicating between the PC/workstation
and the hardware engine;
3) Communication
channel latency and bandwidth; and
4) The amount of
visibility enabled for the hardware being
accelerated.
----

----

----

----

----

----

---EMULATORS
An emulator maps an entire design into gates or Boolean
macros that are then executed on the emulator's
implementation fabric (parallel Boolean
processors or FPGA gates) such that the emulated
behavior exactly matches the cycle-by-cycle behavior of
the actual system.
- Processor-based emulator. The design under test
is mapped to
special purpose Boolean processors.
- FPGA-based emulator. The design under test is
mapped to FPGA
gates as processing elements.
Elsewhere in this report, I go into more detail on
emulation including:Emulation drivers; Metrics to
evaluate emulation; and a top-level comparison
chart of commercial emulation systems against those
metrics.
-------

----

----

----

----

----

FPGA PROTOTYPING
An FPGA prototype is the implementation of the SoC or
IC design on a FGPA. The protype environment is real,
with real input and output streams. Thus
the FPGA prototyping platform can provide full
verification for hardware, firmware, and application
software design functionality.
Some problems associated with FPGA prototyping are:
- Debug Confusion: Because you mapped your design
into an FPGA, you
can expect to spend some extra
time debugging it, to identify
problems that are
relevant ONLY to your prototype, but that are
not necessarily bugs inside your actual design.
- Partitioning: Your design must be partitioned
across multiple
FPGAs. Further, sometimes
repartitioning may be necessary when
design
changes are made. Partitioning challenges can also
apply
to emulation, so I discuss them in the
emulation metrics section.
- Timing (Impedance) Mismatches: If your FPGA
prototype connects to
real world interfaces, such
as Ethernet or PCIe, then you have to
ensure that
it is capable of supporting the interface. That is,
mismatched timing can sometimes be a problem. This can
involve
"speed bridging" to an FPGA.
If your design can fit into a few FPGAs, and you have
adequate support, then FPGA prototyping can be very
effective -- especially when real-time performance is
vital.
----

----

----

A BASIC COMPARISION

----

----

----

----

Below I characterize the various types of hardwareassisted verification approaches:


Speed Cycles
Granularity
Computational
per
per Sec
approach
(# of comp
Vendors
Element
comp (100 M
elements)
element gates)
Cadence
SW
Incisive/NC-Sim,
X86 cores
under 16 3 GHz under 1
Simulation
Synopsys VCS,
Mentor Questa
GPU
Simulation
10 to
processing
100's
1 GHz
Rocketick
acceleration
1,000
elements
Processorcustom
under 1 100 K
Cadence
based
1000's
processors
GHz to 2 M
Palladium
emulation
Mentor Veloce,
Synopsys EVE1 MHz FPGA-based
500 K - Zebu, Bluespec,
FPGA gates
millions
100
emulation
2M
Aldec, Cadence
MHz
RPP,
HyperSilicon
1 MHz FPGA
500 K - Synopsys HAPS,
FPGA gates
millions
100
prototyping
20 M
internal
MHz
Notice that any given approach's processing elements
speed is inverselyrelated to the approach's ultimate
performance in cycles/sec. This is basically a
reflection that for functional verification,
concurrency (more computing elements) trumps clock
speed (speed-per-comp element).
I put in a list of the commercially announced vendors
that I am aware of.

The graphic below roughly shows how each of these basic


approaches compare relative to simulation design
frequency and design size.

Notice emulation's 1,000X to 1,000,000X faster run time


over SW simulationas your design size goes from 10 K
gates to 20 M gates. Also notice that emulation's
capacity of 2 B gates while SW simulation and
acceleration both top out at around 20 M gates -- a
100X difference in capacity.
The rest of my analysis will expand on the emulation
segment.
EMULATION IS BECOMING UBIQUITIOUS
Emulation emerged in the early 1990s. In its early
days, emulation was primarily deployed by a few large
corporations to verify large system-level
designs -- specifically for microprocessors and
graphics processors. The turning point was when audio
and video recordings were digitized. Where possible,
designs needed to be verified for both functionality
and softwarebehavior against voluminous real data
before committing to silicon.

By 2013, we are now moving toward emulation becoming


ubiquitous:
- Current mobile devices have hundreds of
applications. The
vast majority have audio and
video use as one of their
system appliance "must
haves".
- Subsystems now are the same size as what entire
designs and
processors were in the 90s. Teams are
now using emulators to
verify many of these
larger, more complex blocks.
- Because of compute-speed and capacity gains,
emulation is becoming
an attractive option for
more mainstream verification tasks; such
as
verifying individual IP blocks in the low millions of
gates.
- Emulation is even now needed for smaller, under 1
million gate
designs -- if there is a lot of
control complexity with a
large number of cycles
-- such as with an H.265 encoder/decoder.
In fact,
simulating high density videos on an H.265 decoder
would
be impractical (because it would take weeks
to do) if it couldn't
be simulated in seconds in an
emulator.
It's safe to say that simulation is vital for
functional verification. However software simulation
(by itself) is way too slow to capture events that
occur when an application runs for even a short time on
an actual instance of hardware. Emulation is the only
practical way to get usefully long SW application
runtimes on a hardware instance.
----

----

----

----

IT'S NOW A SW DOMINATED WORLD:

----

----

----

I see some of the same current top level drivers for


emulation as I did for On-Chip Communications Networks:
SoC growth is being driven by mobile consumer devices,
with the corresponding pressures for:
-

Smaller die size and reduced cost


More functionality
Increased battery life
True gigahertz performance

For years, hardware verification was primarily about


meeting the hardware design specification. Intel built
the processor, Microsoft built the operating system,
and Phoenix gave you BIOS. The software conformed to
the hardware; meaning that the software developer lived
within the constraints of the hardware design.
Fast forward to a 2013 system design world. You'll see
it's now software dominated. Intel, Microsoft and
Phoenix are still around but struggling; the system
design community is now serving consumers by way of
Google Android, ARM, and their SW ecosystems.

Typical SoC - dozens of HW blocks but millions of


lines of code Notice that there's millions of lines of
code for roughly a dozen hardware subsystems in a
typical SoC. (Source: Texas Instruments)
----

----

----

----

----

----

---'GOODNESS' IS ABOUT A DESIRED BEHAVIOR


While not explicitly aware of it, consumers'
expectations are high with regard to how well the
hardware and software work together.
Software is about the
The software dictates
other words, software
hardware architecture

experience, not functionality.


the behavior of an SoC -- in
now forces constraints on the
specification.

Verification encompasses more than "does it do what the


specification says?". It now asks "does the
specification deliver the end user experience that I'm
looking for?". The goal is not just to build the
design right, but to build the system that behaves
correctly.
How do you ensure your design is up to its intended
use? The answer is by using increasingly capable
emulation systems.

Source: Bluespec, Inc.

You load the operating system and observe the behavior


on the hardware; software developers run their
application on the virtual machine.
----

----

----

----

----

----

----

IP USE AND REUSE IS EXPLODING


Because today's designs are getting so big and so
complex, they're next to impossible to design without
using 3rd party IP, IP from fabs, and internal
IP reuse.

Semico predicts an average 2013 design will have close


to 90 IP cores. This will only grow -- and the more IP
that's (re)used, the more each block will demand
simulation compute resources -- increasing the need for
emulation speed and capacities.

Source: International Business Strategies,

2012

And with each node shrink, 90nm, 65nm, 45nm, 28nm,


20nm, the average number of IP cores used grows -- plus
the cost-to-design grows.
----

----

----

----

----

----

---SW SIMULATION HAS HIT A WALL


SW simulators (like VCS or NC-Sim or Questa) all
parallelize across only a small number of x86
processors, so their performance has been scaling only
as fast as x86 frequency, which has flattened out. SW
simulators have become inadequate for verification
within cores and core-to-core.
Emulation is becoming ubiquitous where SW simulation
hits a wall. Emulation runs orders of magnitude faster
than SW simulation, and its ability to runsoftware in
megahertz rather than kilohertz (or even hertz) allows
teams to do verification that is otherwise timeprohibitive.
----

----

----

----

----

----

----

EMULATION AND SW SIMULATION COMPLEMENT EACH OTHER


SW simulation is great in the early phases of
verification:
- Designers and verification engineers are
constantly finding bugs,
changing the design, and
re-verifying. In this phase, the fast
compilation
time that SW simulation offers for each new run is
critical. And, they're not at the point yet where you
need to
load the operating system.
- If you have a fixed function or hard-wired
integrated circuit, you
can just run your
Verilog/VHDL to verify that your design does
what the
specification describes.
What drives emulation use:
- The one downside to emulation is it takes a
measurable compile
time. But emulation's 1,000X
to 1,000,000X faster run time wins
out over SW
simulation when the number of cycles that must be run
offsets the longer compilation time for emulation.
As engineers
progress through the design cycle,
they must run more and more
cycles to identify new
bugs.
- The reduced total cost of emulation ownership is
another factor
helping it grow. Current emulation
system cost can go as low
as 0.25 to 0.50 cents
per gate.
- Even smaller companies can now use it. Emulation
used to just be
the domain of big companies with
huge support teams -- they might
spend $2 M on an
emulator, and then find they need a service
contract for $1 M to make it work.

Many of today's emulation systems are fairly


plug-and-play, with
far reduced support
requirements. I go into more detail regarding
some of these costs of ownership factors in the metrics
section of
this report.
- Increased static and dynamic probing inside
today's much bigger
FPGAs has helped emulation
grow, too.
----

----

----

----

----

----

----

THE PARTITIONING PROBLEM HAS RECEDED


One significant obstacle to adoption, particularly for
FPGA-based emulators,has been the partitioning
requirement (also called mapping). A designer
would have to partition his register transfer level
design into sectionswhich fit inside each FPGA and then
connect all the sections via the FPGAs I/Os. It was a
tremendous challenge to balance the optimal sizing of
the partitioned elements against managing the number of
FPGA connections withinthe limits of the available
I/Os.
Partitioning software such as what Auspy Development
offers has helped.
However, it does not entirely eliminate the manual
effort. The partitioning process can result in
unnatural design hierarchies, and critical timing
paths that cross FPGA boundaries can mandate repartitioning.
One major technology enabler behind the proliferation
of emulators relates to Moore's Law: It's now possible
to have 30 million gates in just 2 FPGAs.

This higher gate-count-per-FPGA means less design


partitioning is necessary.For example, below we see two
off-the-shelf FPGA boards.

Source: The Dini Group

Source:

Aldec
- The first board has 6 FPGAs; each FPGA is a
Virtex-6 LX550T
capable of supporting 4 M gates
for a total board capacity
of 24 M gates.
- The second board has only 2 FPGAs; each FPGA is a
Virtex-77V2000T capable of supporting 14 M gates for a
total board
capacity of 28 M gates.
Cutting a design in half is much easier and safer than
cutting a design into6 parts. This lessens a major
adoption obstacle for FPGA-based emulators,helping them
to close the gap with custom processor based emulators;
as wellas making emulation more attractive for smaller
designs.
Higher capacity FPGAs also push down the cost of the
emulation systems; forexample, some systems now use
off-the-shelf FPGAs at $4 K per FPGA, reducingtheir
core hardware cost. (Further, lower capacity emulators
can often addnext generation FPGAs shortly after they
are released by Xilinx and Altera.)
----

----

----

----

----

----

----

Speed Cycles
Granularity
Computational
per
per Sec
(# of comp
Element
comp (100 M
elements)
element gates)

SW
Simulation

X86 cores

Simulation
acceleration

GPU
processing
elements

Verification Mix of Host


Acceleration and Emulator

under 16

100's
millions

3 GHz

under 1

10 to
1,000
1 GHz
NVDIA:
17x
under 1 40 to
GHz
10,000

Vendors
Cadence
Incisive/NCSim,
Synopsys
VCS,
Mentor
Questa
Rocketick
Cadence
Palladium
XP + RTL
Sim
Mentor
Veloce +
RTL Sim
Synopsys
Zebu Server

+ RTL Sim

Processor
-based
emulation

custom
processors

FPGA
-based
emulation

FPGA gates

FPGA
prototyping

FPGA gates

100 K to
2M,
processor
It is actually
based
100s of
under 1 scales
Cadence
1000s to
GHz
better
Palladium
millions
with
design
size than
FPGA
500 K - 2
M, does
not scale
well with
design
size, so at
100MG to Mentor
1 MHz reach the
Veloce,
millions
100
M range Synopsys
MHz
is very
EVE-Zebu
unlikely.
Debug
causes
further
slow
down.
Synopsys
HAPS,
2M - 50
Cadence
1 MHz M,
RPP, DINI,
millions
100 sometimes Aldec, S2C,
MHz
up to
HOENS,
100M
Hitech
Global,
ProDesign

Granularity
Number of
users
Speed

Capacity

Palladium P64
4 MG to 256 MG
resulting in much better
utilization
64 users

Veloce 2 Quattro
16 MG to 256MG
16 users

1-1.5 MHz
2 MHz
(as per datasheet),
(as per datasheet)
degrading with design size due
scaling with design size
to architecture
256 MG nominal
256 MG nominal
90% to 100% utilization
60% to 75% utilization
=> 256 MG actual
=> 200 MG actual capacity
capacity

You might also like