Hi, John,

Here are the 14 major metrics that I feel a design team

must consider whendeciding on what specific emulator to
use on their project (if any):

Initialization and Dedicated Support
Primary Target Designs
Speed Range
Compile Time
Virtual Platform API
Transactor Availability
Verification Language and Native Support
Number of Users
Memory Capacity

Below I explain in detail the impact and pitfalls each

metric will have on your team's emulation decision.
- Jim Hogan
Vista Ventures, LLC
Gatos, CA







1. Price/Gate
The actual cost of an emulator typically runs from
1-5 cents-per-gate for higher capacity emulators -both processor-based and FPGA-based.
There is usually some recurring cost for software
and maintenance.

The lower capacity FPGA-based emulators are

typically priced as separate HW and SW components.
The HW consists of off-the-shelf FPGA prototyping
boards that typically cost 0.25 to 1 cent-per-gate.
The SW that provides emulation features on top of the
prototyping boards are typically priced like software
2. Initialization and Dedicated Support
When assessing an emulator's total cost of
ownership, the cost of
dedicated human support is a
key factor, especially as it is an
ongoing expense.
The large processor-based emulators virtually
require at least one support person, if not a team,
to emulator support. It's not due to any
inherent issues with
emulators; it's just the sheer
scope and magnitude of the
verificaion projects
being tackled.
The large FPGA-based emulators' need for dedicated
support is similar,
also depending on the size and
complexity of the emulated designs
and the number of
end users involved.
Further, the time to initialize an emulator and set
up models can take
6 months or more. Transactors
may need to be developed to connect
test benches to
the design-under-test (DUT). They are also needed to
connect host-based virtual platforms to emulators and
FPGA prototyping
systems. This is part of a growing
trend to improve the accuracy and
performance of
virtual platforms for performance modeling and presilicon software development.
The most complex transactors - such as PCIe
transactors - can take
more than a year to develop.
They can be the source of functionality

performance bugs that can delay the emulation project

even longer.
And test bench re-partitioning between the host and
emulator may also
be required to achieve acceptable
3. Capacity
There are subtle issues around an emulator's
capacity; a given capacity
can be delivered in
multiple ways. First, there is the straightforward
capacity measurement in terms of total number of
gates: that range for
emulators currently ranges
from 2 million to 2 billion gates.
Second, there is the granularity in terms of the
number of devices
(ASICs or FPGAs), boards, or boxes
that are used to reach a given
Processor-based emulators are architected to
provide capability in a
seamless way, where it looks
monolithic to users. With FPGA-based
emulators, if
it is a vertically integrated box, it should also
monolithic to a user. Synopsys-EVE reaches
its 1 B gate capacity
in a monolithic way by
connecting multiple emulator boxes.
A vendor may hide the emulator granularity from the
customer, but higher
granularity generates more
communication overhead, which generally
performance. This is true whether the emulator is
processorbased or FPGA-based. For instance, if
you were emulating 10 M gates
and it fit on a single
FPGA, it could run approximately 5X faster than
the same 10 M gate capacity were to be divided over
multiple FPGAs.
Low-to-mid FPGA-based emulators typically expose
more granularity to
the end user. However, an

offset to this is that these boxes tend

to take
advantage of the increased capacity for newer FPGA
more quickly than the mid-to-high-capacity
FPGA-based emulators.
FPGAs on the leading edge of Moore's Law are one of
the first things
manufactured. New FPGA product
cycles run 12 to 18 months. In
contrast custom chip
cycles are at least 4 years, and the cost of
development, which has generally been increasing for
custom ASICs,
is borne by the custom-processor-based
emulation vendor.
4. Primary Target Designs
SoCs are the dominant workhorse for systems
companies today. The
target design sweet spot for
each emulator type is typically defined
by the
capacity of the emulator as it relates to your design
As mentioned earlier, complexity is one of the
factors driving more
mainstream designs into
requiring emulation. Control complexity can
require many verification cycles. Emulation
applications range from
CPUs, GPUs, application
processors such as video, audio, security
datapath, to IP blocks and subsystems.
5. Speed Range
Emulator speed is measured in cycles-per-second.
speeds range from 100 K to 4 M
cycles/sec, while FPGA-based range
from 500 K to 50
M cycles/sec, depending on the number of devices.
6. Partitioning

Partitioning is harder than sounds. Partitioning

would be easy if
your design could be broken up into
latency insensitive blocks that
would eliminate any
strict latency requirements between partitions.
But that's hardly ever the case.
- Processor-based emulators have many processing
units, so large
designs must be partitioned across
these units. Processor-based
systems basically
run software using many cores, and the
partitioning is more or less transparent to the user.
- FPGA-based systems are hardware-centric, based on
multiple FPGAs.
Due to the FPGA boundaries you must first split
the design into
reasonably sized pieces, so that
each sub-system will fit on one
FPGA device.
Unfortunately, you then usually end up needing more
logical connections between partitioned sections
of the design
than there are physical wires
between FPGAs to make those
connections! These
logic connections can exceed the physical
connections by 2X to 100X.
So when you stitch your partitioned design
elements together to
connect them, rather than
just doing a simple logical connection,
you must multiplex signal pins over the FPGA
connections. This adds
substantially to
complexity, particularly if this process is not
completely automated. The entire process takes
additional time and
effort that opens the
possibility for errors to be injected in
partitioning or reconnecting.
A deeply partitioned design can be very difficult
to debug manually
or semi-automatically. Design
groups sometimes give up on the task
because they
can't get it right. Certain debug features don't work

as well in partitioned systems; debugging a single FPGA

is easier.
Debug concern only applies to FPGA-based; the
processor-based emulators
have complete visibility
across all of their processors.
Most processor-based and larger FPGA-based
emulation vendors try to
automatically take care of
partitioning such that the end user doesn't
have to
deal with it.
- Automatic partitioning is rule-based and is
assumed to be
correct-by-construction. If you
have a problem with your
partitioning, it's a
support call to the emulation vendor.
- The greatest partitioning risk occurs in the
smaller FPGA-based
emulator group. Their
(in)ability to partition should be
assessed -- particularly when partitioning across more
than two FPGAs. Some of these vendors use an
partitioning tool from Auspy;
however, partitioning is
inherently never
completely push button.
Thankfully, Moore's Law has now made it possible to
have 30 M gates
in just two FPGAs. This neutralizes
the partitioning issue for a
substantial fraction of
emulation projects. If no messy multi-part
partitioning is needed, there's no problem. A single
between only two devices is
7. Compile Time
Emulator compile time is the total time to prep a
job for execution on
your emulation system,
including synthesis and routing. For FPGA-based

emulators, compilation time is primarily determined by

the FPGA routing
tools. Further, some FPGA-based
emulators are starting to provide
routing -- which enables incremental compiles.
Another variable for compile time is whether your
design partitioning
between the devices is automated
or not.
- For processor-based emulators this design
partitioning is
automated and quite complete, and
that time is included in the
compile time. The
compile runs on a single workstation.
- For FPGA-based emulators partitioning can be
quite variable; your
total compile time can more
than double if the partitioning is
and requires user interaction.
Another factor for compile time is whether it's
parallelizable - i.e.
whether it can be broken into
independent jobs to be run on separate
For FPGA-based emulators, once your partitioning is
complete, your FPGA compilation can be sped up
significantly by running
each FPGA compile on a
separate workstation. That's not the case for
processor-based emulators.
8. Visibility
Visibility is the ability to see signals inside a
design. You have
full visibility with SW
simulation, because simulation is software and
the program state is inside the simulator; it is
very flexible and you
can log any state.
For FPGA-based emulators, there's static and
dynamic visibility.
Static visibility refers to
signal probes that are defined at compile

time. They require FPGA resources and run fast but

usually cover only
a small subset of a design's
signals. If you need to change that
subset of
signals, you need to recompile. Dynamic visibility
to signal probes that do not need to be
defined at compile time.
These do not require
additional FPGA resources and run much slower than
static probes, but cover a much larger subset of a
design's signals.
- The signal visibility with processor-based
emulators is basically
the same as simulation.
That's because all signal states reside
as a
software addressable register somewhere in the
processor array.
- With FPGA-based emulators, to see the internal
signals, you must
route them out through some type
of multiplexing network to the
pins. This adds
physical gate overhead whenever you take a signal
and connect to it through an I/O multiplexer to
the pins of the
FPGA. The multiplexer and wires
create trees of logic that add
area overhead.
Additionally, those gates and wires add a
performance overhead;
every cycle requires
capturing the state and eventually sending
it to a host for storage. If you try to probe
every signal, your
effective design capacity can
go down by a factor of 2-5. You
must navigate
this carefully, or your design won't fit.
However, the signal visibility gap is narrowing
between processor-based
and FPGA-based emulators
based on two key factors:
- FPGA capacity improvements provide increased
capacity for static
probing inside of FPGAs.

- Xilinx has a built a feature in their FPGAs which

offers dynamic
probes for register states. This
Xilinx feature enables better
multiplexing in the
chip to access the signals without overhead.
With effort, the individual emulator vendors can
all but eliminate
the area overhead, which then
means less time recompiling to be
able to see a
different set of signals. Some emulator vendors are
taking advantage of this.
The bottom line is that emulation vendors can be
assessed with an area
and performance hit for any
given number of probes, as well as whether
offer dynamic probing.
9. Debug
With the exponential rise in complexity, debug is
essential - as you can
see in the chart below, teams
spend 1.6x total time debugging (42%) than
they do
developing testbenches (26%) and writing/running tests

A robust debug capability is critical, and

processor-based emulators
today have debug
capabilities that approach that of SW simulator debug.

Fundamental emulator debug capabilities can

- Breakpoints. The ability to pause an emulation
run based
on event triggers.
- Assertions support. It flags when assertions,
or logic
statements that define the intended
behavior of a design,
are violated.
- Simulation hot-swap. The ability to
automatically transfer
execution to a connected
simulator for more in-depth debug,
in the form
of greater visibility and control.
- Software debug. It can run software debuggers
on the
embedded code being executed by the
Debug is an area undergoing significant innovation
and should be
assessed beyond the simple list I show
in the comparison chart.
10. Virtual Platform API
Hybrid platforms consists of emulators cosimulating with virtual
platforms, where a virtual
platform simulates whole chips on a
workstation by
eliminating detail from the hardware, and plugging
C-models together.
Given the prevalence of virtual platforms, it is
important for the
emulator to have a standard
virtual platform API, so that if an
engineer doesn't
have a virtual platform C-model for a component,
they can plug the component into the emulator instead.
For example, you may have an RTL USB 3.0, but no C
model. An emulator
with a virtual platform API
could allow you to co-simulate with your

running the USB RTL at up to 4 M cycles/sec. In

the RTL might run at only 5 K cycles/sec
in a SW simulator.
11. Transactor Availability
Transactors facilitate the communication between
the emulator and
other platforms with different
levels of abstraction. Transactors
are critical to
making emulators work in that they remove the speed
bottlenecks associated with co-emulation.
- Transactors convert transactions coming from
outside the emulator
to bit-level interfaces
inside the emulator and vice versa; as
move to the mainstream, they need to communicate with
simulators and virtual platforms more frequently.
- Transactors also serve as interfaces when you
connect inputs and
outputs to the I/O of a design
(SoC) under test (DUT) in an
emulator. The
transactors let you go between I/Os on the DUT to
anything communicating with it on the host simulator, C
or possibly another I/O board bringing in a live
connection, such
as Ethernet. Below shows an

Source: Mentor Graphics

- One definition of a transactor is anything that
moves something
from one abstraction level to
another. For example, in the transaction level
modeling (TLM) is
standard, whereas in the RTL
space, bit-level interfaces are the
- Hardware transactors eliminate traffic over coemulation links to
optimize performance. They do
this by translating a compound
transaction into a
bit level handshake exploded in space and time
the emulator where it's more efficiently handled. For
a single transaction over the coemulation link to "move a
128-word block of memory
from location A to location B" is
exploded into a
sequence of 128 bit-level interface operations in
the emulator.
The overall emulator performance is very sensitive
to transactor
quality. For example, an optimized
transactor may have a speed
differential of 1 M
cycles/sec compared with only 100 K cycles/sec
an unoptimized transactor -- a 10X performance delta.
The individual emulator vendors provide off-theshelf transactor libs
available for standard
interfaces such as memory and high speed I/O;

these standard transactors can be used for

different designs. Also,
custom transactors must
often be developed for each design.
Some projects may require 10-20 transactors, of
which 5-10 may be
custom, ranging from UARTs to
proprietary system bus interconnects.
performance optimization of these custom transactors is
hard -- the time can range from 1 to 12 months.
different transactor versions can be
required to support multiple
emulation platforms.
12. Verification Language and Native Support
Regardless of transactor quality, sometimes, you
must move selected
components, models or
testbenches, over to a SW simulator. To do
efficiently, ideally your emulator supports your
- Processor-based emulators typically have broad
native support for
verification languages. This
is because the emulation vendors
often also
provide simulators -- an inherent advantage.
- FPGA-based emulators don't support native
languages such as C++ or
SystemC; they are
typically limited to synthesizable Verilog and
VHDL. The drawback is that developing
synthesizable Verilog/VHDL
takes more time than
writing a higher level language.
Another item that can severely limit emulator
performance: when test
bench components run too slow
on the host or generate too much traffic
on the link between the host and the emulator.
Synthesizable test
benches running in the emulator
can reduce a host emulator bottleneck
by 5X to 10X.

13. Number of Users

The bigger an SoC, generally the larger the
engineering team and the
more geographically
dispersed it is. Therefore, the maximum number
users that can run an emulation at the same time on
elements of the SoC should be taken into
14. Memory Capacity
Processor-based emulators have a high memory
capacity, up to 1 TB.
This memory is used more like
the memory in a simulator; mapping DUT
memories is
transparent and usually not an issue.
In contrast, DUT memory on FPGA-based emulators is
more explicitly
mapped to hard macro memory blocks
in the FPGA devices. The largest
FPGA devices have
about 50 M bits per device; this capacity is usually
enough for the memory required in DUT partitions.
Sometimes a DUT
memory does not map well into the
FPGA device memory, in which case
the FPGA-based
emulator can use special constructs to map the DUT
memory into on-board DRAM. It is important to know
ahead of time
whether an FPGA-based emulator has
memory configuration limitations.
Below I have mapped top-level information for each
vendor, according to the
emulation metrics I mentioned earlier. I derived the
information for the
snapshot below from each vendor's website plus my
general accumulated
knowledge to date.
- Category 1. Emulators are based on applicationspecific
processors. Cadence Palladium's

processor is implemented
in an ASIC-structured
custom fabric. Mentor Veloce's
processor is
implemented in a custom FPGA fabric.
- Category 2. Emulation with a standard FPGA
product at its
core. Synopsys-EVE is currently
the player in this sector.
- Category 3. Other emulators with HW based on a
FPGA product. The primary differentiator
between category
2 and 3 is capacity; however,
there are other differences
as shown below.
Aldec, Bluespec, Cadence RPP, Dini Group,
S2C, and
HyperSilicon are primary vendors in this segment.
The emulator best suited to the designer problem is
defined by what problemthey are trying to solve.

Source: EVE, Embedded Computing, 2010

The optimal choice lies in the intersection of a number
of factors, with one
example outlined above.
Other: Aldec,
Emulation Cadence Palladium, Synopsys EVE
Bluespec, Cadence
Mentor Veloce
RPP, HyperSilicon
custom silicon,
off-the-shelf FPGA,
custom board,
FPGA, custom
off-the-shelf board,
Architecture custom box (32 M to board, custom box off the shelf box (2
2 B)
(25 M - 200 M)
M - 50 M)
2-5 cents
0.5 - 2 cents
0.25 -1 cent


Speed range



Claims up to 1
Claims up to 50+
Claims up to 2
billion. Typical
million. Typical
billion. Typical usage
usage 25 M to 200 usage 2 M to 25 M
100 M to 1 B gates.
M gates.
SoCs 100 M to 1 B
gates. Large CPUs,
IP blocks, subSoCs from 25 M to
GPUs, multi-chip
system, and SoCs
200 M gates
systems, application
from 2 M to 25 M
100 K to 2 M

10-30 M gates/hour.
Single workstation
(Palladium). PC farm
Compile time (Veloce). Includes
partitioning time.
Parallelizable: Yes

500 K to 5 M
25 M - 100 M
gates/hr for PC
farm. Proprietary
software for fast
FPGA partitioning,
synthesis and P&R.
Parallelizable: Yes





full visibility. atspeed probe capture.

static, dynamic
probes. at-speed
probe capture.



simulation hot-swap, simulation hotSW debug.
swap, SW debug.

500 K to 20 M
1 M - 15 M gates/hr
for PC farm.
Constrained by
FPGA vendor
synthesis and P&R
times. Doesn't
include partitioning
time. Parallelizable:
Partitioning depends
on # of FPGAs.
Time range 30 min
to 4 hours.
static, dynamic
probes (vendor
dependent). at-speed
probe capture
(vendor dependent).
simulation hot-swap,
SW debug.
varies by vendor

platform API
shelf: Good.
shelf: Mixed.
shelf: Good. Custom:
Custom: developed Custom: developed
developed ad hoc
ad hoc
ad hoc
C++, SystemC,
Specman e,
Language SystemVerilog,
Verilog, VHDL,
Verilog, VHDL,
System Verilog
System Verilog
up to 1 TB
up to 200 GB
up to 32 GB
1 to 512 users
1 to 49
1 user
Here is my quick summary of the different emulation
vendors for 2013.
Category 1:
- Cadence Palladium. Hats off to Cadence for being
pioneers in
emulation and sustaining innovation to
maintain a very competitive
product year-overyear.
- Mentor Veloce. Their revenue numbers show
emulation is a growing
segment for them. (See
ESNUG 510 #7.) Clearly Wally and Greg
have been
investing heavily in emulation.
Category 2:
- Synopsys EVE Zebu. This has been the choice for
companies and
design groups doing mid-size SoCs or
blocks for emulation. It
is no secret that Intel
was an EVE customer. (See ESNUG 508 #6.)
My expectation is that with the Synopsys
acquisition, EVE will now
move upstream to
challenge Cadence and Mentor at the high end.
Category 3:

- Aldec HES-DVM. The company initially grew out of

providing system
emulation/simulation using FPGAs
for eventual implementation in
FPGAs. FPGAs will
continue to be a choice for system designers
low volumes, including the mil-aero world. Will they
to move into the SoC market?
- Bluespec Semu. Bluespec expanded their emulation
footprint in
March with a new FPGA-based desktop
form factor verification and
hybrid emulator.
They emphasize low cost, ease of use, fast
deployment using third-party FPGA boards, dynamic
hardware debug
(no re-instrument and re-synthesis)
and a C API to integrate
SystemC/C/C++ models
and test benches. Bluespec claims to need
only 1
day set up.
- Cadence RPP. The Cadence FPGA-based Rapid
Prototyping Platform
is an FPGA-based prototyper
for early software development and
highperformance system validation. While not positioned as
emulator (See ESNUG 517 #6) it uses the core
of FPGA-based emulators and confirms
the need for boxes with
higher performance and
lower cost than processor-based emulators
for presilicon software development.
- The Dini Group. An established leader in FPGA
boards for
prototyping and emulation. The Dini
Group consistently delivers
high quality, high
capacity boards with the shortest time-to-market
for leading edge FPGAs.
- S2C. Offers FPGA boards and software and IP for
design verification and
acceleration. Their new boards based on
14 M gate
Xilinx FPGAs.

- HyperSilicon. Company to watch from mainland

China, focusing on
FPGA prototyping boards.
Offers boards similar to S2C.
My conclusion is that emulation has indeed gone
mainstream. Its growthextends from the rise of the SoC
as the cornerstone of system hardware,with its
associated multiple SW functions. What's also helped
emulationgrow is its better debug, increased FPGA
sizes, and its newer ability tohandle complex designs.






---The reason for my report was to analyze the segment and

try to put someorder to the market place. I'm not the
only source of info on this. I'dlike to invite the
DeepChip readers to feel free to add their
perspectiveand to update the charts and data I have
Functional verification is primarily comprised of:

software simulation,
simulation acceleration,
FPGA prototyping, and

Let's look at each of them briefly, and then compare

some of the elements that relate to the speed ranges of
each approach.






A simulator is a software program that simulates an
abstract model of aparticular system by taking an input
representation of the product or

circuit, and processing the hardware description

language and compiling it.
A system model typically includes processor cores,
peripheral devices,
memories, interconnection buses, hardware accelerators
and I/O interfaces.
Simulation is the basis of all functional verification.
It spans the full
range of detail from transistor-level simulation like
SPICE to Transaction
Level Modeling (TLM) using C/SystemC. Simulation
should be used wherever
it's up to the task -- it's easiest to use and the most
general purpose.
But SW simulation hits a speed wall as the size and
detail of the circuit
description increases. Moores Law continues to give
us more transistors
per chip, but transistor speed is flattening out. So
while computers are
shipping with increasing numbers of microprocessor
cores, the operating
frequencies are stuck in the 2 - 3 GHz range. Since SW
simulators (which
run on these computers) don't effectively utilize more
than a handful of
PC cores, performance degrades significantly for large
circuits. It would
take decades just to boot an operating system running
on an SoC being
simulated in a logic SW simulator.
Simulation acceleration, emulation, and FPGA
prototyping are all solutions
to get around show-stopping slow PC simulation speeds
for large designs.

They all attempt to parallelize simulation onto larger

numbers of processing
units. This ranges from two orders of magnitude (e.g.
hundreds of GPU
processing elements) to nine orders of magnitude
(billions of FPGA gates).






Simulation acceleration implements a hardware
description language, such
as Verilog or VHDL, according to a verification
specification. The results
are the same as the simulation, but faster.
- Often simulation accelerators will use hardware
such as GPUs
(i.e. NVidia Kepler) or FPGAs with embedded
- Simulation acceleration involves mapping the
synthesizable portion
of the design into a hardware platform
specifically designed to
increase performance by evaluating the HDL
constructs in parallel.
The remaining portions of the simulation are not
mapped into
hardware, but run in a software simulator on a
- The software simulator works in conjunction with
the hardware
platform to exchange simulation data.
Acceleration removes most
of the simulation events from the slow PC
software simulator and

runs them in parallel on other HW to increase

The final acceleration performance is determined by:
1) Percentage of the simulation that is left
running in software;
2) Number of I/O signals
communicating between the PC/workstation
and the hardware engine;
3) Communication
channel latency and bandwidth; and
4) The amount of
visibility enabled for the hardware being






An emulator maps an entire design into gates or Boolean
macros that are then executed on the emulator's
implementation fabric (parallel Boolean
processors or FPGA gates) such that the emulated
behavior exactly matches the cycle-by-cycle behavior of
the actual system.
- Processor-based emulator. The design under test
is mapped to
special purpose Boolean processors.
- FPGA-based emulator. The design under test is
mapped to FPGA
gates as processing elements.
Elsewhere in this report, I go into more detail on
emulation including:Emulation drivers; Metrics to
evaluate emulation; and a top-level comparison
chart of commercial emulation systems against those






An FPGA prototype is the implementation of the SoC or
IC design on a FGPA. The protype environment is real,
with real input and output streams. Thus
the FPGA prototyping platform can provide full
verification for hardware, firmware, and application
software design functionality.
Some problems associated with FPGA prototyping are:
- Debug Confusion: Because you mapped your design
into an FPGA, you
can expect to spend some extra
time debugging it, to identify
problems that are
relevant ONLY to your prototype, but that are
not necessarily bugs inside your actual design.
- Partitioning: Your design must be partitioned
across multiple
FPGAs. Further, sometimes
repartitioning may be necessary when
changes are made. Partitioning challenges can also
to emulation, so I discuss them in the
emulation metrics section.
- Timing (Impedance) Mismatches: If your FPGA
prototype connects to
real world interfaces, such
as Ethernet or PCIe, then you have to
ensure that
it is capable of supporting the interface. That is,
mismatched timing can sometimes be a problem. This can
"speed bridging" to an FPGA.
If your design can fit into a few FPGAs, and you have
adequate support, then FPGA prototyping can be very
effective -- especially when real-time performance is








Below I characterize the various types of hardwareassisted verification approaches:

Speed Cycles
per Sec
(# of comp
comp (100 M
element gates)
X86 cores
under 16 3 GHz under 1
Synopsys VCS,
Mentor Questa
10 to
1 GHz
under 1 100 K
GHz to 2 M
Mentor Veloce,
Synopsys EVE1 MHz FPGA-based
500 K - Zebu, Bluespec,
FPGA gates
Aldec, Cadence
500 K - Synopsys HAPS,
FPGA gates
20 M
Notice that any given approach's processing elements
speed is inverselyrelated to the approach's ultimate
performance in cycles/sec. This is basically a
reflection that for functional verification,
concurrency (more computing elements) trumps clock
speed (speed-per-comp element).
I put in a list of the commercially announced vendors
that I am aware of.

The graphic below roughly shows how each of these basic

approaches compare relative to simulation design
frequency and design size.

Notice emulation's 1,000X to 1,000,000X faster run time

over SW simulationas your design size goes from 10 K
gates to 20 M gates. Also notice that emulation's
capacity of 2 B gates while SW simulation and
acceleration both top out at around 20 M gates -- a
100X difference in capacity.
The rest of my analysis will expand on the emulation
Emulation emerged in the early 1990s. In its early
days, emulation was primarily deployed by a few large
corporations to verify large system-level
designs -- specifically for microprocessors and
graphics processors. The turning point was when audio
and video recordings were digitized. Where possible,
designs needed to be verified for both functionality
and softwarebehavior against voluminous real data
before committing to silicon.

By 2013, we are now moving toward emulation becoming

- Current mobile devices have hundreds of
applications. The
vast majority have audio and
video use as one of their
system appliance "must
- Subsystems now are the same size as what entire
designs and
processors were in the 90s. Teams are
now using emulators to
verify many of these
larger, more complex blocks.
- Because of compute-speed and capacity gains,
emulation is becoming
an attractive option for
more mainstream verification tasks; such
verifying individual IP blocks in the low millions of
- Emulation is even now needed for smaller, under 1
million gate
designs -- if there is a lot of
control complexity with a
large number of cycles
-- such as with an H.265 encoder/decoder.
In fact,
simulating high density videos on an H.265 decoder
be impractical (because it would take weeks
to do) if it couldn't
be simulated in seconds in an
It's safe to say that simulation is vital for
functional verification. However software simulation
(by itself) is way too slow to capture events that
occur when an application runs for even a short time on
an actual instance of hardware. Emulation is the only
practical way to get usefully long SW application
runtimes on a hardware instance.








I see some of the same current top level drivers for

emulation as I did for On-Chip Communications Networks:
SoC growth is being driven by mobile consumer devices,
with the corresponding pressures for:

Smaller die size and reduced cost

More functionality
Increased battery life
True gigahertz performance

For years, hardware verification was primarily about

meeting the hardware design specification. Intel built
the processor, Microsoft built the operating system,
and Phoenix gave you BIOS. The software conformed to
the hardware; meaning that the software developer lived
within the constraints of the hardware design.
Fast forward to a 2013 system design world. You'll see
it's now software dominated. Intel, Microsoft and
Phoenix are still around but struggling; the system
design community is now serving consumers by way of
Google Android, ARM, and their SW ecosystems.

Typical SoC - dozens of HW blocks but millions of

lines of code Notice that there's millions of lines of
code for roughly a dozen hardware subsystems in a
typical SoC. (Source: Texas Instruments)







While not explicitly aware of it, consumers'
expectations are high with regard to how well the
hardware and software work together.
Software is about the
The software dictates
other words, software
hardware architecture

experience, not functionality.

the behavior of an SoC -- in
now forces constraints on the

Verification encompasses more than "does it do what the

specification says?". It now asks "does the
specification deliver the end user experience that I'm
looking for?". The goal is not just to build the
design right, but to build the system that behaves
How do you ensure your design is up to its intended
use? The answer is by using increasingly capable
emulation systems.

Source: Bluespec, Inc.

You load the operating system and observe the behavior

on the hardware; software developers run their
application on the virtual machine.








Because today's designs are getting so big and so
complex, they're next to impossible to design without
using 3rd party IP, IP from fabs, and internal
IP reuse.

Semico predicts an average 2013 design will have close

to 90 IP cores. This will only grow -- and the more IP
that's (re)used, the more each block will demand
simulation compute resources -- increasing the need for
emulation speed and capacities.

Source: International Business Strategies,


And with each node shrink, 90nm, 65nm, 45nm, 28nm,

20nm, the average number of IP cores used grows -- plus
the cost-to-design grows.







SW simulators (like VCS or NC-Sim or Questa) all
parallelize across only a small number of x86
processors, so their performance has been scaling only
as fast as x86 frequency, which has flattened out. SW
simulators have become inadequate for verification
within cores and core-to-core.
Emulation is becoming ubiquitous where SW simulation
hits a wall. Emulation runs orders of magnitude faster
than SW simulation, and its ability to runsoftware in
megahertz rather than kilohertz (or even hertz) allows
teams to do verification that is otherwise timeprohibitive.








SW simulation is great in the early phases of
- Designers and verification engineers are
constantly finding bugs,
changing the design, and
re-verifying. In this phase, the fast
time that SW simulation offers for each new run is
critical. And, they're not at the point yet where you
need to
load the operating system.
- If you have a fixed function or hard-wired
integrated circuit, you
can just run your
Verilog/VHDL to verify that your design does
what the
specification describes.
What drives emulation use:
- The one downside to emulation is it takes a
measurable compile
time. But emulation's 1,000X
to 1,000,000X faster run time wins
out over SW
simulation when the number of cycles that must be run
offsets the longer compilation time for emulation.
As engineers
progress through the design cycle,
they must run more and more
cycles to identify new
- The reduced total cost of emulation ownership is
another factor
helping it grow. Current emulation
system cost can go as low
as 0.25 to 0.50 cents
per gate.
- Even smaller companies can now use it. Emulation
used to just be
the domain of big companies with
huge support teams -- they might
spend $2 M on an
emulator, and then find they need a service
contract for $1 M to make it work.

Many of today's emulation systems are fairly

plug-and-play, with
far reduced support
requirements. I go into more detail regarding
some of these costs of ownership factors in the metrics
section of
this report.
- Increased static and dynamic probing inside
today's much bigger
FPGAs has helped emulation
grow, too.








One significant obstacle to adoption, particularly for
FPGA-based emulators,has been the partitioning
requirement (also called mapping). A designer
would have to partition his register transfer level
design into sectionswhich fit inside each FPGA and then
connect all the sections via the FPGAs I/Os. It was a
tremendous challenge to balance the optimal sizing of
the partitioned elements against managing the number of
FPGA connections withinthe limits of the available
Partitioning software such as what Auspy Development
offers has helped.
However, it does not entirely eliminate the manual
effort. The partitioning process can result in
unnatural design hierarchies, and critical timing
paths that cross FPGA boundaries can mandate repartitioning.
One major technology enabler behind the proliferation
of emulators relates to Moore's Law: It's now possible
to have 30 million gates in just 2 FPGAs.

This higher gate-count-per-FPGA means less design

partitioning is necessary.For example, below we see two
off-the-shelf FPGA boards.

Source: The Dini Group


- The first board has 6 FPGAs; each FPGA is a
Virtex-6 LX550T
capable of supporting 4 M gates
for a total board capacity
of 24 M gates.
- The second board has only 2 FPGAs; each FPGA is a
Virtex-77V2000T capable of supporting 14 M gates for a
total board
capacity of 28 M gates.
Cutting a design in half is much easier and safer than
cutting a design into6 parts. This lessens a major
adoption obstacle for FPGA-based emulators,helping them
to close the gap with custom processor based emulators;
as wellas making emulation more attractive for smaller
Higher capacity FPGAs also push down the cost of the
emulation systems; forexample, some systems now use
off-the-shelf FPGAs at $4 K per FPGA, reducingtheir
core hardware cost. (Further, lower capacity emulators
can often addnext generation FPGAs shortly after they
are released by Xilinx and Altera.)







Speed Cycles
per Sec
(# of comp
comp (100 M
element gates)


X86 cores



Verification Mix of Host

Acceleration and Emulator

under 16


3 GHz

under 1

10 to
1 GHz
under 1 40 to

Veloce +
Zebu Server

+ RTL Sim




FPGA gates


FPGA gates

100 K to
It is actually
100s of
under 1 scales
1000s to
size than
500 K - 2
M, does
not scale
well with
size, so at
100MG to Mentor
1 MHz reach the
M range Synopsys
is very
2M - 50
1 MHz M,
100 sometimes Aldec, S2C,
up to

Number of


Palladium P64
4 MG to 256 MG
resulting in much better
64 users

Veloce 2 Quattro
16 MG to 256MG
16 users

1-1.5 MHz
2 MHz
(as per datasheet),
(as per datasheet)
degrading with design size due
scaling with design size
to architecture
256 MG nominal
256 MG nominal
90% to 100% utilization
60% to 75% utilization
=> 256 MG actual
=> 200 MG actual capacity

