Techniques To Reduce Timing Violations Using Clock Tree Optimizations in Synopsys ICC2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

semiwiki.com

Techniques to Reduce Timing


Violations using Clock Tree
Optimizations in Synopsys ICC2 -
Semiwiki
eInfochips
14-18 minutes

The semiconductor industry growth is increasing exponentially with


high speed circuits, low power design requirements because of
updated and new technology like IOT, Networking chips, AI,
Robotics etc.
In lower technology nodes the timing closure becomes a major
challenge due to the increase in on-chip variation effect and it
leads to changes in interconnect delay and cell delay. It is a
difficult task for the clock to reach every flop at almost the same
instance of time to avoid timing violations, as most of the power is
consumed by clock structure in circuit design the effect of ocv is
more in the clock network as compared to signal and other paths.
So it is important to minimize the effect of ocv in the clock network
by creating proper clock structure to effectively reduce timing
violations and variation effects as well as meeting the clock skew
and latency requirements.
This article will include the information and techniques to reduce
timing violations using optimized mesh clock tree structure with
different optimization switches to reduce timing violation and power
consumption. We have used Mesh clock tree structure because it
provides low skew and has less ocv effect for high performance
vlsi designs as compared to conventional clock tree structure.
Keywords: clock tree synthesis (CTS), clock tree optimization,
clock concurrent optimization (CCD), On-Chip Variation(OCV),
Design Rule Violation Checks(DRVs), Lower technology nodes ,
Place and Route flow.
Introduction

1 of 10 31-08-2020, 11:20
Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

Clock Tree Synthesis is a process which makes sure that the clock
gets distributed evenly to all sequential elements in a design to
meet the clock tree design rule violations (DRVs)Vs such as max
Transition, Capacitance and max Fanout, balancing the skew and
minimizing insertion delay.
There are many types of clock structures namely H-Tree, X-Tree,
Conventional clock tree, Multi source clock tree, Mesh Tree etc. In
this article, we will focus on clock tree optimization of a mesh clock
tree.
Mesh Tree Structure
Mesh tree has clock nets in grid pattern that are driven by clock
inverters and buffers. With this structure we can have minimum
skew, latency and On-chip Variation as compared to other clock
structures. The network of inverter and buffer drivers from clock
port to clock mesh drivers is known as Pre-mesh clock structure.
An example of a clock mesh tree is shown in figure 1 below.
Mesh tree structure has high power consumption and requires
high routing resources because the whole layer is consumed by
the clock tree structure. Generally mesh is created at the top layer
to acquire the advantage of less resistance in metals and to save
routing resources for signal nets in lower layers. A design can
consist of one mesh tree or multiple mesh trees.
Mesh terminals are created at a particular pitch in X and Y
direction based on various experiments. First step is to create a
mesh terminal as shown in fig 2, then clock tree synthesis where
skew groups are created according to flop distribution in design.
First level routing is done from the mesh terminal to the first buffer
to reserve routing resources for first level clock nets. Inverter is
connected to the clock gating cell. Then the network of clock
inverters and buffers are created upto the clock sinks as shown in
figure 2.
These clock gating cells are cloned as per the number of fanout
sink points. In first level cloning, it looks for the sink points and
checks whether the number of fanout exceeds a certain limit. If it
exceeds the limit then this clock gaters are again cloned according
to Design rule violation checks, RVs (Max fanout, Max capacitance
and Max Transition). After cloning, clock tree synthesis is executed
and followed by clock_opt which performs timing, power and area
optimizations.

2 of 10 31-08-2020, 11:20
Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

Figure 2 : Clock Flow


Block configuration

Mesh Layer: M13 (Mesh Terminals)


Target Latency: 250ps
Target Skew: 35ps
Mesh Terminal pitch X: 40.128 microns
Mesh Terminal pitch Y: 40.128 microns
For each experiment I have provided the table for comparison of
results of the same block with optimization switch and without
optimization switch.
My comparison points are skew, setup slack, buffer count, inverter
count, launch path and capture path latency and power
consumption by clock network.
For this I have checked the pattern of violating paths in each
design and picked one high violating path from each design. All
these switches are executed at clock_opt stage.
eInfochips helps in m2m IoT application development with low
power clock tree synthesis (CTS) optimization in ASIC back-end
solution platform. Watch this video to know,
Why is CTS needed?
How is CTS helpful?
How to optimize CTS?
How to overcome challenges while implementing CTS

3 of 10 31-08-2020, 11:20
Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

Experiments
1) Enabling Global routing for timing and skew optimization.
Default : set_app_options –name cts.compile.enable_global_route
–value false
Exp1 : set_app_options –name cts.compile.enable_global_route
–value true
During clock tree synthesis these options enable a global router at
its initial stage. By default this option is false and instead of global
router, virtual router is enabled during initial synthesis.
Virtual routers are used at pre pre-optimization stage for fast
prediction of the wire pattern. It does not contain a layer
assignment. Does not consider whether there are enough routing
resources.
Global routing is used for the first step of the actual wire
implementation. Tries to avoid global congestion. It takes longer
time for optimization but has accurate timing results.
So the advantage of a global router is that we have accurate
timing results and the optimization is done based on the estimation
of the routability and congestion in the design.
Results Default Using Switch
Setup slack -46.1ps 9ps
Launch path 247.7ps 222.3ps
latency
Capture path 172.5ps 191.02ps
latency
Skew 75.2ps 31.3ps
CK capture path Buff : X8, X24, X8, Buff : X8, X32, X8,
BUF/INV X4, X8 X12, X12
CK launch path Buff : X8, X8, X12, Buff : X8, X32, X4,
BUF/INV X12 X8, X12
CKBUF Count 6841 7273
CKINV Count 844 864
CKBUFF Power 42.2mw 42.9mw
CKINV Power 4.08mw 4.11mw
Due to enabling global routing during the clock tree synthesis the
synthesis was based on the actual wire implementation. Launch
path, capture path and skew is decreased. And we got a margin of

4 of 10 31-08-2020, 11:20
Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

9ps in setup timing at the cost of increased buffer and inverter


count. The total power consumed by the clk buffer and inverter in
the whole design is increased by 0.7mw and 0.3mw respectively. If
we have relaxation for clock buffer count and power then this
switch is useful to reduce timing violations.
2) Concurrent clock and data optimization(CCD)
set_app_options -name clock_opt.flow.enable_ccd -value true
This app option performs clock concurrent and data (CCD)
optimization when it is set to true. In clock concurrent optimization
technique, it optimizes both data and clock path concurrently.
When this option is set to true, At clock_opt stage the CCD
optimization is performed.
This attribute also performs area and power optimization at
clock_opt stage.
Results Default Switch
Setup slack -46.1ps 7ps
Launch path latency 247.7ps 232.9ps
Capture path 172.5ps 187.133ps
latency
Skew 75.2ps 45.67
CK capture path Buff : X8, X24, X8, Buff : X8, X12, X32,
BUF/INV X4, X8 X32
CK launch path Buff : X8, X8, X12, Buff : X8, X32, X8,
BUF/INV X12 X20
CKBUF Count 6841 5829
CKINV Count 844 703
CKBUFF Power 42.2mw 41.4mw
CKINV Power 4.08mw 3.79mw
From the above table, we can see that the default experiment had
-46.1ps setup slack and in CCD optimization we got a margin of
7ps. On observing 10 to 15 most violating paths it is concluded
that CCD is applying useful skew techniques during datapath
optimization to improve the timing QoR. To solve the setup
violation, tool is adjusting the launch and capture path in such a
way that the launch clock path plus data path delay is reduced and
capture path delay is increased. The overall clock buffer and
inverter count is less than the default experiment. Hence the
power consumption and area is reduced.

5 of 10 31-08-2020, 11:20
Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

3) Appling NDR
Default : set_app_options –name clock_opt.flow.optimize_ndr
–value false
Exp : set_app_options -name clock_opt.flow.optimize_ndr -value
true
Tool applies non-default-routing rules on long timing critical nets
during clock_opt optimization to improve timing, by applying NDR
on timing critical nets the width of the net increases due to which
resistance in the nets decreases which results in a decrease in net
delay.
Results Default Switch
Setup slack -46.1ps -21ps
Launch path latency 247.7ps 228.5ps
Capture path latency 172.5ps 175.8ps
Skew 75.2ps 52.7ps
CK capture path Buff : X8, X24, X8, Buff : X8, X24,
BUF/INV X4, X8 X32, X20
CK launch path Buff : X8, X8, X12, Buff : X8, X32,
BUF/INV X12 X24, X8
CKBUF Count 6841 6912
CKINV Count 844 854
CKBUFF Power 42.2mw 42.5mw
CKINV Power 4.08mw 4.16mw
From the above table, WNS in default experiment is -46.1ps slack
and with NDR optimization is -21ps, Here launch path latency is
less than the default experiment because the NDR is applied on
timing critical nets due to which the net delays is decreased. But
the total no of clock buffer, inverter count and power consumption
is increased. Here the power consumption has increased because
after applying NDR on timing critical nets still the setup is slack is
negative but it is better than the default experiment as there was
no margin available if we didn’t see any power optimization.
4) Enabling Area Recovery
set_app_options -name
clock_opt.flow.enable_clock_power_recovery -value area
This option turns on power recovery in clock_opt optimization. The
valid values are: auto, none, power, area. By default, it is auto
when CCD flow is enabled. In non-CCD flow, auto means none.

6 of 10 31-08-2020, 11:20
Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

Area recovery mode is turned on by area, where the optimization


is driven by area.
Results Default Switch
Setup slack -46.1ps 2ps
Launch path latency 247.7ps 229ps
Capture path latency 172.5ps 177.8ps
Skew 72.5ps 51.2ps
CK capture path Buff : X8, X24, X8, Buff : X8, X18, X4,
BUF/INV X4, X8 X8
CK launch path Buff : X8, X8, X12, Buff : X8, X32,
BUF/INV X12 X24, X4
CKBUF Count 6841 6894
CKINV Count 844 811
CKBUFF Power 42.2mw 42.1mw
CKINV Power 4.08mw 4.15mw
Here we can see from the above table that the total no of clock
buffer and inverters are greater than the default experiment due to
which the total area is greater in this experiment. The clock_opt
first tries to fix timing violations and then it optimises the area if the
margin is available. After optimizing timing the setup margin for
area recovery is not sufficient so area optimization didn’t take
place. So for performing area recovery timing margin is required.
5) Enabling power recovery
set_app_options -name
clock_opt.flow.enable_clock_power_recovery -value power
As explained in experiment 4 it has power value where the tool
optimizes the design in terms of power consumption.
Results Default Switch
Setup slack -46.1ps 9ps
Launch path latency 247.7ps 229.3ps
Capture path latency 172.5ps 179.6ps
Skew 72.5ps 49.7ps
CK capture path Buff : X8, X24, X8, Buff : X8, X32,
BUF/INV X4, X8 X24, X8

7 of 10 31-08-2020, 11:20
Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

CK launch path Buff : X8, X8, X12, Buff : X8, X18, X4,
BUF/INV X12 X8
CKBUF Count 6841 6980
CKINV Count 844 885
CKBUFF Power 42.2mw 42.1mw
CKINV Power 4.08mw 4.11mw
Here we see that again the priority is given to timing not to power,
Here also margin was not available so power recovery is not done.
6) Disabling Path groups for optimization if margin is
available
set_app_options -name ccd.skip_path_groups -value {reg2mem
mem2reg}
set_app_options -name clock_opt.flow.enable_ccd -value true
This app option skips the path groups which are mentioned in the
list. We can skip those path groups which are not timing critical. So
the tool can put most of its effort on those path which are timing
critical
Results Default Switch
Setup slack -46.1ps 6ps
Launch path 247.7ps 221.9ps
latency
Capture path 172.5ps 175ps
latency
Skew 72.5ps 46.9ps
CK capture path Buff : X8, X24, X8, Buff : X8, X32, X24,
BUF/INV X4, X8 X24, X8
CK launch path Buff : X8, X8, X12, Buff : X8, X8, X12,
BUF/INV X12 X12
CKBUF Count 6841 5968
CKINV Count 844 674
CKBUFF Power 42.2mw 41.3mw
CKINV Power 4.08mw 3.09mw
In this block my two path groups have margin in timing so the tool
will not use its resources to optimize those paths and enable the
CCD optimization. By doing these the tool will give emphasis on

8 of 10 31-08-2020, 11:20
Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

the paths which are timing critical and hence we get a positive
margin in timing and no clock buffer, inverter count and power is
reduced.
7) Hold Fixing
set_app_options -name ccd.hold_control_effort -value high
set_app_options -name clock_opt.enable_ccd -value true
The first app options control the hold optimization effort. It has five
values: none, low, medium, high and ultra. By default it is set to
low.
Here the hold slack is given in the below table.
Results Default Switch
Hold slack -89ps 35ps
Launch path 238.76ps 230.34ps
latency
Capture path 194.57ps 206.46ps
latency
Skew 44.19ps 23.88ps
CK capture path Buff : X8, X4, X16, Buff : X8, X28, X12,
BUF/INV X12, X8 X32, X4
CK launch path Buff : X8, X12, X32, Buff : X8, X18, X32,
BUF/INV X8 X4
Delay Buff:DLX2

CKBUF Count 6841 6124


CKINV Count 844 756
CKBUFF Power 42.2mw 40.3mw
CKINV Power 4.08mw 4.05mw
In this experiment the hold timing is met by 35ps margin and the
skew difference is also decreased and one SVT delay buffer DLX2
is added in the launch path, to increase the data path delay. The
total number of clock buffer and inverter count is reduced and the
total power consumption is reduced by enabling CCD optimization.
Conclusion
All the Optimization switches are used to optimize power, area or
timing. Timing is given high priority and after that if timing margin is
available then it will try to optimize the chip design based on power
and area. It is not necessary that these switches will reduce the
timing violations, it depends on the block complexity. So using the

9 of 10 31-08-2020, 11:20
Techniques to Reduce Timing Violations using Clock Tree Optimizations... about:reader?url=https://semiwiki.com/semiconductor-services/290148-t...

above switches we can optimize the timing but after performing


optimization using these switches we need to check the target
latency and target skew.
eInfochips (An Arrow Company) can help tech companies to solve
CTS implementation challenges in their ASIC design requirement
by leveraging a highly-efficient and skilled ASIC design processes.
We have subject matter experts to work on a highly challenging
product design and development requirements. Our expertise
helps semiconductor & product companies to shorten their Time-
to-Market, even while addressing challenges related to Power,
Timing, and Area. For more information contact us today.
Authors
Haswant Kumar (ASIC Physical Design Engineer)
Bhavik Balwani (ASIC Physical Design Engineer)

10 of 10 31-08-2020, 11:20

You might also like