Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Register / Log In

UPCOMING
EVENTS

2021 IEEE BiCMOS and


Compound... SYNOPSYS
March 19 - June 18
Useful Skew in Production Flows
by Tom Dillinger on 12-13-2019 at 6:00 am
Virtual ASMC 2021 Categories: EDA, Synopsys
May 10 - May 12

Altair HPC Summit 2021 The concept of applying useful clock skew to the
May 11 - May 12 design of synchronous systems is not new.  To
date, the application of this design technique has
HPC User Forum
May 11 - May 14 been somewhat limited, as the related
methodologies have been rather ad hoc, to be
Advanced CMOS discussed shortly.  More recently, the ability to
Technology 2021
May 12 - May 14
leverage useful skew has seen a major
improvement, and is now an integral part of
production design ows.  This article will brie y
SEARCH SEMIWIKI review the concept of useful skew, its prior
implementation methods, and a signi cant
enhancement in the overall design optimization
R EC E N T
methodology.
SYNOPSYS
ARTICLES
What is useful skew?
Mars Perseverance
Rover Features First
Zoom Lens in Deep The design of a synchronous digital system
Space requires the distribution of a (fundamental) clock
May 9, 2021
signal to state elements within the digital network.
Veri cation  A speci c machine state is “captured” by an edge
Management the
of this clock signal at state element inputs. 
Synopsys Way
May 6, 2021 Concurrently, the transition to the next machine
state is “launched” by a change in the state values
Synopsys Debuts Major
New Analog Simulation through fanout logic paths, to be captured at the
Capabilities next clock edge.  The collection of logic and state
May 3, 2021
elements controlled by this signal is denoted as a
Accelerating Cache clock domain, which may encompass multiple
Coherence Veri cation block designs in the overall SoC hierarchy.
April 29, 2021

PCIe 6.0 Doubles Speed Current SoC designs incorporate many separate
with New Modulation
clock domains associated with unrelated clocks. 
Technique
April 26, 2021
The signal interface between domains is thus
asynchronous, requiring speci c logic circuitry
Addressing SoC Test (and electrical analysis) to evaluate the risk of
Implementation Time anomalous metastable network behavior.  (Clock
and Costs
domain crossing analysis, or CDC, is applied to the
April 20, 2021
network to ensure the metastable design
Your Car Is a
requirements are observed.)
Smartphone on Wheels
—and It Needs
Smartphone Security The subsequent discussion will utilize the simplest
April 18, 2021 of examples – i.e., a single clock frequency with a
Global Variation and Its common capture edge to all state elements, and a
Impact on Time-to- clock edge-based launch time.  Advanced SoCs
Market for Designs
would often include domain designs with:  both
April 14, 2021
rising and falling clock edge-sensitive state
How PCI Express 6.0 elements (with half-cycle timing paths);  reduced
Can Enhance Bandwidth-
Hungry High- clock frequencies by dividing the fundamental
Performance Computing clock signal;  and, latch-based synchronous timing
SoCs
where the launch time could be a state element
April 12, 2021
data input transition while the latch clock is
VC Formal SIG Virtually transparent.  This discussion does not address
Conferences in Europe
clocking considerations common in serial
April 6, 2021
interface communications, unique cases of
Why In-Memory
asynchronous domains with “closely related”
Computing Will Disrupt
Your AI SoC clocks – e.g., mesochronous, plesiochronous
Development domains.  This discussion also assumes the clock
March 22, 2021
domain is isochronous, although there may be
Using IP Interfaces to instantaneous deviations in the clock period at
Reduce HPC Latency
any state element input due to jitter – in other
and Accelerate the Cloud
March 11, 2021 words, there is a single fundamental clock
frequency throughout the domain over time.
USB 3.2 Helps Deliver on
Type-C Connector
Performance Potential The gure below is the typical timing
March 8, 2021 representation used for a synchronous system.  A

Key Requirements for clock distribution network is provided on-chip


Effective SoC from the clock source (e.g., an external source, an
Veri cation
on-chip PLL), through interconnects and buffering
Management
February 25, 2021 circuitry to state elements.

Synopsys is Enabling the


Cloud Computing
Revolution
February 18, 2021

Techniques and Tools for


Accelerating Low Power
Design Simulations
February 17, 2021

Synopsys Delivers a
Brief History of AI chips
and Specialty AI IP
February 16, 2021
The gure also includes a de nition of the late
A New ML Application, in
Formal Regressions mode timing slack, measured as the difference
February 10, 2021 between the required arrival time and the actual

Change Management for logic path propagation arrival time.


Functional Safety
January 27, 2021 The interconnects present from the clock source
What Might the “1nm and between buffers could consist of a variety of
Node” Look Like? physical topologies – e.g., a (top-level metal) grid, a
December 28, 2020
balanced H-tree, a spine plus balanced branches
(aka, a shbone).  The buffers could be logically
inverting or non-inverting signal drivers or simple
gating logic with additional enable inputs to
suspend the clock propagation for one or more
cycles.

The time interval between the clock launch edge


and subsequent (next cycle) capture edge is
based on the fundamental clock period, adjusted
by two factors – jitter and skew.

The jitter represents the cycle-speci c variation in


the clock period.  It originates from the time-
variant conditions at the clock source, such as
dynamic voltage and temperature at the PLL
circuitry and/or thermal and mechanical noise
from the reference crystal.

The skew in the arrival of the launch (cycle n) and


capture (cycle n+1) clock edges also de nes the
time interval for the allowable path delays in the
logic network.  The skew is due to a combination
of dynamic and static factors.  Dynamic factors
include:  voltage and temperature variations in
clock buffer circuitry, temperature variations in
interconnects, capacitive coupling noise in
interconnects.  The static factors include process
variation in the buffer circuits and interconnects,
plus physical implementation design differences
between the clock endpoints.  More precisely, the
skew is the difference in clock edge arrival at the
two endpoints due to factors past the shared
clock distribution to the endpoints, removing the
common path from the source.

The time interval for logic path evaluation is the


clock period adjusted by the design margins for
jitter and (static and dynamic) arrival skew.  From
the launching clock edge through the state
elements and logic circuit delays, the longest path
needs to complete its evaluation prior to the
capture edge, accounting for the jitter plus skew
margins and the setup data-to-clock constraint of
the capture state element.

To accommodate a long timing path, one of the


potential optimization solutions would be to
intentionally extend the static skew to the capture
state element(s) through the physical
implementation of the buffer and interconnect
distribution differences between launch and
capture.  Correspondingly, the time interval for
logic path evaluation from the delayed clock to its
capture endpoints is reduced.  This is the
foundation of applying useful skew.

Traditional Methods

The conventional methodology for timing closure


utilizes distinct tools for logic synthesis,
construction of the clock physical distribution, and
cell netlist placement and routing (adhering to any
existing clock implementations).  For synthesis, a
set of clock constraints are de ned – e.g., period,
jitter, max skew implementation targets,
distribution latency target from the block clock
input to state endpoints.  These targets were
applied uniformly throughout the synthesis model
(i.e., no arrival skew variation).  Long timing paths
were presented to various optimization
algorithms focused on logic netlist and
interconnect updates – e.g., higher drive strength
cell swaps, signal fanout repowering buffer
topologies, place-and-route directives to
preferentially use metal layers with lower R*C
characteristics.  The gure below illustrates some
of the potential repowering strategies employed
during synthesis.
For state element hold time clock-to-data
transition timing tests, the skew target was added
to the same clock edge between short launch and
capture paths, to ensure suf cient logic path
delays and stability of the data input capture. 
Algorithms to judiciously add delay padding to
short paths not meeting the skew plus hold-time
constraint would be invoked.

The timing analysis reports from synthesis


provide feedback on the relative success of these
(long and short) logic path timing optimizations. 
Designs with a large number of failing timing tests
were faced with the dif cult decision on whether
to proceed to the P&R ows with additional
physical constraints to try to optimize paths, or to
update the microarchitecture.  The introduction of
physical synthesis ows improved the estimated
timing for the synthesized netlist, but the timing
optimizations were still based on uniform clock
distribution to logic paths.

In addition to the limited scope of logic path


timing optimizations, an additional critical issue
has arisen with this methodology.  In a
synchronous system, the vast majority of the
switching activity occurs in the interval from the
clock edge plus a few logic stage delays – thus,
the peak power and the dynamic (L * di/dt + I*R)
power/ground distribution network voltage drop
are both maximized.  For advanced process node
designs seeking to aggressively scale the supply
voltage (and related cost of power distribution),
this dynamic current pro le of low-skew
synchronous systems is problematic.

Useful Skew in Production

At the recent Synopsys Fusion Compiler technical


symposium, several customer presentations
described how the incorporation of useful skew
into the full synthesis plus physical
implementation ows has been extremely
productive.

Haroon Gauhar, Principal Engineer at Arm, offered


some interesting insights.  He indicated, “The Arm
Cortex core architecture contains numerous
imbalanced paths, by design.  This enables the
timing optimization algorithms in synthesis to
apply concurrent clock and data (CCD)
assumptions directly during technology netlist
mapping.  The corresponding clock tree
implementation assumptions become an integral
part of the physical design ows.”  Synopsys
refers to this strategy as CCD Everywhere.  (“Arm”
and “Cortex” are registered trademarks of Arm
Limited.)

Haroon continued, “This useful skew strategy is


applied to both setup and hold timing tests, in full
multi-corner, multi-mode timing analysis.”  Haroon
showed an enlightening chart from the Fusion
Compiler output data, illustrating the number and
magnitude of useful skew clock distribution
modi cations that were made, both “postponing”
and “preponing” clock edges relative to the
nominal latency arrival target within the block.

He said, “We evaluate the post-synthesis negative


slack timing report data with the CCD postpone
and prepone results.  It may still be appropriate to
look at microarchitectural changes – the
additional postpone/prepone information
provides insights into where RTL updates would
be the most effective for realizable performance
improvements.”

Raghavendra Swami Sadhu, Senior Engineer at


Samsung, echoed similar comments in his
presentation.  “We have enabled CCD
optimizations in our compile_fusion and
clock_tree_synthesis ows.  Fine-tuning iterations
with CCD may be required to nd an optimal
balance of useful skew for the goals of both setup
and hold timing paths.”

Another presenter at the Fusion Compiler


technical symposium offered the following
summary, “We are seeing reductions in setup and
hold TNS, and thus fewer iterations to close on
timing.  There are fewer hold buffers in the design
netlist, resulting in better block and die sizes.  For
our products, even a small percentage area
reduction is of tremendous value.”
Summary

The net takeaway is that the application of useful


skew is now available in production ows.  This
additional optimization has the potential to guide
microarchitectural updates, improve netlist size
(less buffering and repowering cells), and reduce
design iterations to timing closure.  Dynamic I*R
voltage drop issues are reduced, as well.  The
gure below illustrates a switching pro le based
on traditional ows (“baseline”) and for a design
incorporating useful skew.

The Synopsys Fusion Compiler platform provides


a direct integration of useful skew (CCD)
optimizations across the implementation
methodology.

There is a caveat – useful skew is best viewed as


another design option in the design optimization
toolbox.  Thinking again of the Arm Cortex
architecture, the success of this approach relies
upon the availability of imbalanced path lengths. 
A very useful utility that I have seen deployed is to
provide a distribution plot of logic path lengths for
a synthesized netlist exported right after logic
reductions (e.g., constant propagation,
redundancy removal), before any timing-driven
algorithms are invoked.  A design with a broad
path length distribution would be a good
candidate for useful skew.  A design with “all paths
at maximum length” (corresponding to the target
clock period) or with a bimodal distribution of
primarily long paths and very short paths would
likely be more problematic – a solution of
postpone and prepone skews may not easily
converge.

For more insights into useful skew and CCD


Everywhere, here are some links to additional
information from Synopsys:

CCD Everywhere video — link.

Fusion Compiler home page — link.

-chipguy

Share this post via:

E DA , SY N O P SYS

CCD EVERYWHERE, SYNOPSYS FUSION


COMPILER, USEFUL SKEW

Comments
There are no comments yet.

You must register or log in to view/post


comments.

You might also like