Professional Documents
Culture Documents
16 MCH 1
16 MCH 1
16 MCH 1
Manuscript ID Draft
Complete List of Authors: CHENTOUF, Mohamed; Siemens PLM Software, Siemes EDA; L'Ecole
rP
1
2
3
4
A PUS Based Nets Weighting Mechanism for
5
6 Power, Hold, and Setup Timing Optimization
7 Mohamed Chentouf1, 2 (Corresponding author) (mohamed_chentouf@mentor.com),
8
9 Zine El Abidine Alaoui Ismaili2 (z.alaoui@um5s.net.ma)
10 1 Mentor a Siemens Business/CSD Calypto Division, Rabat, 10010, Morocco
11 2 Information, Communication and Embedded Systems (ICES) Team, University Mohammed V, Rabat, 10010, Morocco
12
13
Abstract— Power consumption has become a major constraint have used the timing information to calculate the net weighting
14
in VLSI design. A considerable power increase is usually seen to drive the placer to be timing aware (TDP) [1].
15 during the hold closure step of the physical design done in post-
16 Many algorithms were proposed to improve the placement
CTS and post-route stages. Hold optimization is performed by
17 quality in terms of timing, routability, area or power. Kong T.
applying some circuit-level changes such as buffer insertion, cell
18 sizing, useful-skew or cell movement. Moving the hold fixing proposed a new weighting algorithm that takes into
19 problem to the pre-CTS stage represents a big opportunity for consideration the number of critical paths that share a common
segment and assigns a hi0gher weight to the edges of this
Fo
20 power saving and design closure improvement. In this paper, we
present a novel power, hold, and setup driven placement common segment [2]. Another approach to overcome the timing
21 algorithm. The objective is to reduce not only the setup, but also closure problem was proposed by Papa D. et al., it showed that
22 the hold violations while keeping the power consumption under a linear-wire-delay model is sufficient to model the impact of
23
rP
control. This objective is achieved by changing the weighting
buffering in the placement stage, then developed RUMBLE, a
24 mechanism of a commercial Power and Timing Driven Placement
linear programming based TDP which includes buffering for
25 (PTDP) engine to include power, hold and electrical Design Rule
slack-optimal placement [3]. Another approach was used by
26 Constraints (eDRC) in the weighting equation which will drive the
ee
placer to place the cells that are in the setup critical paths or Wang Q. et al. to improve the placement timing by optimizing
27
connected with high power nets close to each other and relax the iteratively the timing-critical sub-circuit by Linear
28 weight of the cells that are on hold critical paths, so the placer may Programming and timing driven legalization [4].
29 place them far from each other. As a consequence, critical setup,
rR
timing gain is about 15% and 13% in TNS and THS respectively. did not take into account the hold timing requirement due to the
37 The total power gain is about 9%, distributed as 7% in leakage ideal nature of the clock at this early stage of the design
38 power and 9% in dynamic power.
implementation. Although, hold closure after the Clock Tree
39 Synthesis (CTS) causes a big power increase due to the inserted
40 Index Terms— Application-specific integrated circuits (ASIC),
delay elements. Taking the hold requirement into the weighting
41 Timing Driven Placement (TDP), hold timing optimization, setup
timing optimization, Predictive Useful-Skew (PUS), static timing formulation is an opportunity to reduce further the power
42 analysis, electrical design rule constraints, electronic design consumption and to improve the design closure.
43 automation, physical design, global routing, power optimization,
44 Total Hold Slack (THS), Worst Hold Slack (WHS), Total Negative In this paper, we propose a new linear programming (LP) that
45 Slack (TNS), Worst Negative Slack (WNS), Clock Skew. includes the hold parameter in the weight calculation based on
46 the predictive useful-skew methodology. In [7], Chan et al.
47 showed that the application of useful-skew at the pre-CTS stage
I. INTRODUCTION
48 improves the timing correlation between pre-CTS and post-CTS
49 The nets weighting is a technique that has been extensively
stages. Thus, we will use the predictive useful-skew (PUS) to
50 studied in recent decades, it is used to drive the placer to
perform the STA (Static Timing Analysis) and estimate the hold
51 produce different results depending on the objective to
timing at the pre-CTS stage, then we will include the estimated
52 minimize. It was originally used to reduce the Total Wire-
hold in the weight calculation formula to relax the constraints
53 Length (TWL). With the technology scale, the complexity of
on nets that are in hold critical paths and to drive the TDP to be
54 designs has increased considerably and the interconnect delay
hold-aware. The main contributions of this work are
55 exceeded the cell delay. Thus, designers and EDA providers
summarized as follows:
56
57 1
58
59
60
Transactions on Design Automation of Electronic Systems Page 2 of 11
1
2 1. A novel nets weighting calculation formula that includes its placement model [15] [16] [17]. In this context, Tsay R. et
3 the hold factor, besides the setup and the power al. have implemented an analytic weighting mechanism that
4 parameters. transforms the timing information into net weight and compiles
5 a weighted wire-length minimization engine [18], the results
2. A flow that integrates the new formula in the placement were significant in terms of runtime and timing. Dunlop A. et
6 stage and measures its benefits at the post-CTS stage of the
7 al. have proposed an iterative update of the net weights with a
PnR flow. continuous model to improve the placement convergence and to
8
9 The remainder of this paper is organized as follows: overcome the limitation of static net weighting. [39]
10 Section II gives a global overview of the TDP and its related In the last decade, the IC market focus has shifted from circuit
11 work. It also gives the state of art of the predictive useful-skew speed-only to circuit speed-power efficiency. More parameters
12 application in modern VLSI designs. Section III describes the have been added to the placement formulation to reduce the
13 new weighting equations to calculate the nets weight based on power consumption early in the design process.
14 their power, setup, and hold timing characteristics, and gives a In general, the sources of power dissipation in an IC are
15 detailed explanation of the new weighting mechanism divided into three board categories. Switching power, Short-
16 integration in the Place and Route (PnR) flow. Section IV circuit power, and Leakage power [19]. The leakage power is
17 presents the results achieved with this new approach, especially the energy consumed due to leakage current in the MOS
18 the power, area and timing gains. Finally, section V gives a technologies. The short-circuit power is the energy consumed
19 conclusion and draws the perspectives of our study. due to the short circuit current that flows during the transition
Fo
20 time of the MOS transistors . While the dynamic power is the
21 II. OVERVIEW OF TIMING DRIVEN PLACEMENT AND power needed to perform the circuit computations by charging
22 USEFUL-SKEW PREDICTION and discharging all parasitic capacitances of the design. Many
23 types of research and development were carried out to include
rP
A. Placement Overview power reduction in the placement stage. In [20] Obermier et al.
24
To overcome the placement challenges, many approaches have introduced the power density into the placement
25
were developed to simplify the task and make it formulation which led to a flat temperature distribution and a
26
ee
computationally less intensive, to produce a good solution for good power and heat reduction. In [5] Cheon T. et al. proposed
27
designs with multi-millions of objects (gates, pins, nets, a register clustering technique to reduce the clock power
28
macros), in a reasonable runtime. Historically, the placement dissipation by included the power in the weight equation to
29
rR
algorithms can be roughly grouped into four classes: Partition shorten high power nets, and was able to achieve a gain of
30 based placement [8] [9] [10], quadratic placement [11] [12], 11.4% in power with a minor timing degradation of 2% and area
31 simulated annealing based placement [13] [14] and nonlinear overhead of 1.2%.
32 placement [9].
ev
33
Initially, EDA placement algorithms were timing-driven to B. Useful-skew Overview
34
maximize circuit performance. Net-weighting was a known
35 Two decades ago, the skew minimization was of high
technique used to drive the placement to minimize a specific
36
iew
1
2 Placement: Skew minimization was considered from the Start with a fully placed and
3 placement stage. In [23] and [24], sequential cell sites are legalized design database
4 mapped to predefined locations of a template clock tree in the
5 middle of the quadratic placement. [25] Proposed a modified
6 scheme to perform iterative placement modification based on Run the Predictive Useful Skew
7 [26] and skew optimization such as in [27] to produce a skew (PUS)
8 aware placement.
9
10 CTS: F. Niu et al. proposed an obstacle aware zero skew clock Run the CTS to realize the
11 tree synthesis flow which consists of two steps: the first step estimated PUS
12 generates the topology of the clock tree. Then an Obstacle-
13 aware Deferred Merge Embedding (ODME) algorithm is
14 applied to complete the clock tree routing. [40] Optimize the design to correct the
15 setup and hold timing
Routing: Several works have applied wire sizing to reduce
16
clock skew [28][29][30]. Guthaus et al. [31] proposed a
17
sequential linear programming as well as quadratic
18
programming based clock buffer/wire sizing to minimize clock QoR Assessment
19
skew. Shu et al. [32] performed wire sizing for skew
Fo
20 Fig. 1. Reference Flow
minimization.
21
22
23 Shift from Zero Skew to Useful-skew: Zero-skew Clock tree
rP
24 was an active field of research, but more recently, it has been Start with a fully placed and
25 proved that ”exact zero-skew ” comes at the cost of increased legalized design database
26 power consumption and wire length. Friedman et al. pointed out
ee
33 based on PUS
34 technique for timing and power optimization where the clock Based net weighting
35 latencies of FFs are skewed intentionally to increase the clock Apply dynamic Net-Weighting and incremental
36 frequency and timing margins of the design [34] [13].
iew
Global Placement
37 B. Placement and Useful-skew Combination
38 Legalization and Global Route reparation
Recently, some placement optimization techniques were
39
introduced after clock tree synthesis to improve the early slack
40
while preserving an optimized late slack. In [35], Huang et al.
41
proposed some placement modifications (in place Run the CTS to realize the
42
optimizations) to predict the optimal Steiner tree topology after estimated PUS
43
each move and then optimize the clock tree by a clock tree re-
44
connection mechanism. The main limitations of this approach
45
are its focus on the placement optimization of FF only, and the Optimize the design to correct the
46
number of clock tree modifications that could impact the design setup and hold timing
47
closure negatively, especially in complex SoCs. To overcome
48
this limitation, we will take the early slack optimization to the
49
pre-CTS stage based on the predictive useful-skew timing QoR Assessment
50
information and we will combine [35], [6] and [7] to generate a
51 Fig. 2. New PUS Driven Flow
placement (setup, hold and power)-aware, and to give the CTS
52
engine a fully formulated problem to get the maximum benefit
53
from it.
54
55
56 3
57
58
59
60
Transactions on Design Automation of Electronic Systems Page 4 of 11
1
2 III. OUR APPROACH FOR NETS WEIGHTING 16nm, the weighting is dominated by the power factor.
3 CALCULATION Applying the normalization helps to standardize the weighting
4 process and provide more controllability over the nets
As shown in section II, traditional PnR flows optimize the criticality.
5 hold timing after the CTS stage, and use the useful-skew
6 scheduling to drive the clock tree synthesis engine to realize the The next step in the process is to calculate a timing-based
7 previously computed offsets instead of targeting a zero-skew weight which is a combination of setup and hold timing slacks
8 clock tree. In this section, we will use the predictive useful-skew (Algorithm 1). The algorithm calculates first the setup and the
9 to calculate the nets weight and to perform an iterative (setup, hold timing criticality of the net based on the slack and the
10 hold, and power)-driven incremental placement. number of critical paths going through it. The netsetup is the sum
11 of the negative setup slacks of the violated setup timing paths
12 A. New Hold Aware Incremental Placement Flow traversing the net. Similarly, the nethold is the sum of the negative
13 Usually, hold fixing is performed at the post-CTS stage, hold slacks of the violated hold timing paths traversing the net.
14 where the clocks are fully propagated. The main technique for
15 The timing-based weight is then calculated depending on
hold fixing is the insertion of delay elements to slow down the
16 whether the net is setup or hold critical. A prioritization
data signal. This technique comes with a non-negligible cost of
17 parameter α is used to control the ratio of each factor to the final
power consumption.
18 timing-based weight and to provide a knob for the setup-hold
By treating the hold fixing problem from the placement stage,
19 trade-off.
we will reduce the number of buffers inserted during the post-
Fo
20 CTS hold optimization. Which is very beneficial for power and
21 area reduction as well as routability improvement. The outlines Algorithm 1: Timing based weight formula
22 of our new flow, as well as the reference flow, are illustrated in
23 1: for net ∈ design data nets do
rP
Fig. 1 and Fig. 2. 𝑘
24 Our reference flow is a traditional flow that uses the 2: 𝑛𝑒𝑡𝑠𝑒𝑡𝑢𝑝 = ∑ 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒_𝑠𝑒𝑡𝑢𝑝(𝑝𝑎𝑡ℎ𝑖)
𝑖=0
25 predictive useful-skew and drives the clock tree synthesis 𝑘
3: 𝑛𝑒𝑡ℎ𝑜𝑙𝑑 = ∑𝑖=0 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒_ℎ𝑜𝑙𝑑(𝑝𝑎𝑡ℎ𝑖 )
26 engine to fully exploit the potential of useful-skew. In the new
ee
27 flow, a step is added to perform a dynamic net weighting and 4: if 𝑛𝑒𝑡𝑠𝑒𝑡𝑢𝑝 < 0 and 𝑛𝑒𝑡ℎ𝑜𝑙𝑑 < 0 then
28 incremental global placement based on the predictive useful- 5: wt = (1 − α) · 𝑛𝑒𝑡ℎ𝑜𝑙𝑑 − α · 𝑛𝑒𝑡𝑠𝑒𝑡𝑢𝑝
29 skew timing information. The weighting calculation formula 6: else
rR
30 from [6] is modified to include the hold factor in addition to 7: if 𝑛𝑒𝑡𝑠𝑒𝑡𝑢𝑝 ≥ 0 and 𝑛𝑒𝑡ℎ𝑜𝑙𝑑 < 0 then
31 setup, power and eDRC parameters. A pass of legalization and 8: wt = (1 − α) · 𝑛𝑒𝑡ℎ𝑜𝑙𝑑
32 global routing repair is performed to clean all illegal cells and 9: else
ev
33 repair the routing to have a good routing congestion estimation. 10: if 𝑛𝑒𝑡𝑠𝑒𝑡𝑢𝑝 < 0 and 𝑛𝑒𝑡ℎ𝑜𝑙𝑑 ≥ 0 then
34 The CTS engine is called afterward to realize the previously 11: wt = - α · 𝑛𝑒𝑡𝑠𝑒𝑡𝑢𝑝
35 calculated offsets and then a pass of timing optimization to 12: else
36
iew
1
2 Algorithm 2: eDRC based weight formula topology. To maximize the benefit from the new weighting
3 1: for net ∈ design data nets do approach, no displacement constraints are applied on the
4 movable cells during the incremental global placement, this
2: netmtv=Max (Max_transiton violations of fan-out cells)
5 allows cells on the hold critical paths to be moved by the
3: netmcv= Max_capacitance violation of the fan-in cell
6 necessary distances to meet or reduce the hold violations. On
4: if netmtv < 0 and netmcv < 0 then
7 𝑀𝑎𝑥(𝑛𝑒𝑡𝑠𝑚𝑡𝑣) the other hand, the legalization engine of Nitro-SoC is designed
5: wdrc = - (netmtv + (netmcv. )) to limit the displacement with a default maximum allowable
8 𝑀𝑎𝑥(𝑛𝑒𝑡𝑠𝑚𝑐𝑣)
A pass of normalization is done after calculating both timing- repeated multiple times with different α and β values to improve
24
related factors to project wt and wdrc parameters into an interval
25 Record Placement coordinates of each
of [0-100]. After normalization, all parameters are combined cell and design QoR
26
ee
33 net in the same bin, which can lead to a severe congestion and
34 legalization problems and consequently, a non-routable design. Legalization and Global Route
35 reparation
36
iew
1
𝑤𝑛𝑒𝑡 = . 𝑒(𝛽.(𝑤𝑡 + 𝑤𝑑𝑟𝑐 )+(1−𝛽 ).𝑤𝑝 ) (3)
37 𝐹𝑎𝑛−𝑜𝑢𝑡
1
2 the QoR iteratively. It was seen through multiple experiments
3 that the values of α and β are very design dependent and that
4 there are no specific values that give the optimum for all
5 designs. The randomization and acceptance/reverting process
6 was proven to be a good solution, due to the runtime/QoR gains
7 trade-offs. The outcome of this flow is a new placement with an
8 improved or similar QoR. The QoR improvements achieved by
9 this approach are due to the introduction of the hold parameter
10 in the placement formulation and to the multiple incremental
placement iterations that improve the convergence after each
11
accepted iteration.
12 Fig. 4a. Std cells placement after the Fig. 4b. Design congestion map after
13 TDP. the TDP.
IV. CASE STUDY – NEW VS DEFAULTS PLACEMENT OF A
14
SIMPLE DESIGN:
15
16 In this section, we will use a simple design of around 7k
17 standard cells to show and explain the benefits of our placement
18 flow (Fig. 2) compared to the default flow (Fig. 1). The starting
19 point of both flows is a placed and legalized design from [6],
Fo
20 which means that the additional gains achieved by the new
21 approach are due to the PUS introduction in the weighting
22 mechanism. In [6], power, timing and fan-out factors are
23 already used for weight calculation.
rP
24
The original placement as shown in Fig. 4a is a well spread
25
placement with a good congestion (Fig. 4b) that is easily Fig. 5a. Std cells placement after the Fig. 5b. Design congestion map after
26
ee
routable. The routing difficulty is reflected by the colors in the CTS and Post-CTS. the CTS and Post-CTS.
27
congestion map, blue color means that the design is easily
28
routable, green means routable, yellow means hardly routable,
29
rR
post-CTS optimizations.
37
38 The new algorithm optimizes the placement based on the
39 predicted useful-skew before balancing the clock network and Fig. 6a. Std cells placement after the Fig. 6b. Design congestion map after
optimizing the setup and hold timings. It was evaluated on the PUS driven placement, CTS and Post- the PUS driven placement, CTS and
40 CTS. Post-CTS.
41 same testcase to generate a post-CTS implementation (Fig. 6a
42 and Fig. 6b). It is clear that the design went through strong It can be noticed that the new flow has clustered the cells in
43 placement modifications to improve the setup and hold timings, several clusters after several placement iterations based on the
44 but since the congestion is monitored during each pass, the predicted useful skew mechanism. Running the incremental
45 design is still routable. Although the green zones have increased placement with different α and β values while monitoring the
46 in the new flow, the yellow zone is smaller than the one seen timing and congestion has helped to shorten critical setup
47 with default flow due to the automatic reduction of cells density timing paths and to insert useful skew delays in the clock
48 around the macro corner due to the congestion feedback loop, network to balance the setup and hold timings without
49 which means that the design generated with the new flow is degrading the congestion. Also, less buffers and inverters are
50 easily routable compared to the default flow. needed to close the hold timing at the post-CTS stage which
51 reduced the area, the number of nets, the number of pins, and
52 improved further the congestion
53
54
55
56 6
57
58
59
60
Page 7 of 11 Transactions on Design Automation of Electronic Systems
1
2 The default flow has spread the cells over the design in order Another benefit of treating the hold at the placement stage is
3 to achieve the best congestion-timing trade-off in one pass, the area and utilization reductions achieved since less delay
4 which resulted in more data cells, clock cells and wire-length. elements are needed to correct the hold timing, and we have
5 14% less buffers/inverters in the feature run compared with the
Using the PUS based nets weighting mechanism in this baseline (Fig. 10). This reduction represents an average 5%
6 small design has helped to reduce the TWL (Total Wire Length)
7 reduction in designs’ utilization (Fig. 10).
by 21% for data nets and 30% for clock nets. The number of
8 buffers/inverters used for CTS and post-CTS optimizations has Reducing the utilization has given white-space for setup and
9 decreased by 7%, which resulted in a power gain of 6% in eDRC optimizations to do more circuit transformations and the
10 leakage power and 5% in dynamic power. This power gain was setup timing has improved by 26% for WNS and 15% for TNS
11 achieved without compromising the design timing, since the (Fig. 8), while the max capacitance and max transition
12 useful skew was used for setup and hold driven placement violations are reduced by 33% and 41% respectively (Fig. 12).
13 adjustment in addition to post-CTS timing optimization, a gain This improvement is also partially due to the multiple
14 of 2% in TNS and 4% in THS was realized. placement improvement iterations performed before the CTS.
15
16 In the next chapter, we had run the flow on multiple The clock metrics are intentionally not reported here, since
17 industrial designs of different technologies for study we don’t expect any reduction in the clock repeaters, latencies,
18 generalization. skews, or the clock wire-length. We have noticed that there is
19 no specific pattern or correlation between the clock metrics and
Fo
20 V. EXPERIMENTAL RESULTS the applied weights. This is an expected behavior depending
21 To evaluate the effectiveness of the proposed flow, we have mainly on the placement and the calculated clock offsets by the
22 implemented the algorithm using TCL programming language PUS engine. Our objective is to converge the setup and hold
23
rP
and integrated it into Nitro Reference Flow as shown in Fig. 2. timings with less total power consumption which includes all
24 Our initial databases are generated using Nitro-SoC’s default the clock elements. The wirelength (data and clock) reduction
25 PnR flow (NRF) with the net weighting mechanism presented of 13% (Fig. 11) along with the achieved area reductions have
26 [6]. The baseline results are generated by running the CTS and yielded an expected average total power reduction of 9% (9%
ee
27 post-CTS steps on the initial database as shown in Fig. 1. While average dynamic power reduction, and 7% leakage power
28 the new results are generated with a modified version of the reduction) as shown in Fig. 9.
29 same flow including our new weighting mechanism, the
rR
It can be noted that TC3 has achieved a very good hold gain
30 incremental placement iterations algorithm, and the same CTS but with a negative setup timing impact, this is due to the fact
31 and post-CTS flows as in the baseline (Fig. 2). Since the starting that the hold was the dominant timing factor (design has hold
32 point for each testcase is the database generated by the flow convergence issue) and the net weights were relaxed which
ev
33 presented in [6], and all the aspects and engines of the flow are derived the placer to spread more the cells in order to reduce the
34 the same except the weighting and the incremental placement, hold timing, but has resulted in more setup violations.
35 this allows us to assess the benefits of the PUS usage in the
36
iew
1
TC24 1 2 2 500 286 8 180
2 TC25 1 5 5 11 139 40 180 Setup Timing Gain (%)
3 TC26 2 1 3 1000 225 0 28
4 TC27 3 5 4 1000 183 0 28 130%
TC28 5 3 500 826 693 168 7
5 TC29 2 2 3 952 79 32 28
6 TC30 3 6 25 200 1219 173 28
80%
Gain (%)
TC33 2 1 4 10 23 0 180
9 TC34 3 5 8 500 824 82 28
TC1
TC2
TC3
TC4
TC5
TC6
TC7
TC8
TC9
TC10
TC11
TC12
TC13
TC14
TC15
TC16
TC17
TC18
TC19
TC20
TC22
TC23
TC24
TC25
TC26
TC27
TC28
TC29
TC30
TC31
TC32
TC33
TC34
TC35
TC36
TC37
TC38
TC39
TC40
10 TC35 1 2 6 200 105 44 180 -20%
TC36 1 2 3 500 103 4 90
11 TC37 2 18 3 833 710 40 7
12 TC38 2 3 7 940 750 164 28
-70%
10%
27 multi-threading, the algorithm implementation or distributing 0%
28
TC1
TC2
TC3
TC4
TC5
TC6
TC7
TC8
TC9
TC10
TC11
TC12
TC13
TC14
TC15
TC16
TC17
TC18
TC19
TC20
TC22
TC23
TC24
TC25
TC26
TC27
TC28
TC29
TC30
TC31
TC32
TC33
TC34
TC35
TC36
TC37
TC38
TC39
TC40
incremental placement calls on different machines. For each -10%
29 combination of α and β, a thread can be started to evaluate the
rR
30 placement impact. The good values could be recorded and Leakage Power Gain (%): Average Gain is 7%
31 reapplied. Thus, the runtime will be reduced at the expense of Dynamic Power Gain (%): Average Gain is 9%
32 computational resources. Further runtime reduction could be Total Power Gain (%): Average Gain is 9%
ev
33 achieved by moving the flow from TCL to C++ level to have an Fig. 9. Power Consumption Gains (%).
34 apple to apple comparison with the reference flow.
35
36
iew
39
100%
40 150% 80%
Gain(%)
41 60%
42 100%
Gain (%)
43 40%
44 50% 20%
45 0%
46 0%
TC1
TC2
TC3
TC4
TC5
TC6
TC7
TC8
TC9
TC10
TC11
TC12
TC13
TC14
TC15
TC16
TC17
TC18
TC19
TC20
TC22
TC23
TC24
TC25
TC26
TC27
TC28
TC29
TC30
TC31
TC32
TC33
TC34
TC35
TC36
TC37
TC38
TC39
TC40
TC1
TC2
TC3
TC4
TC5
TC6
TC7
TC8
TC9
TC10
TC11
TC12
TC13
TC14
TC15
TC16
TC17
TC18
TC19
TC20
TC22
TC23
TC24
TC25
TC26
TC27
TC28
TC29
TC30
TC31
TC32
TC33
TC34
TC35
TC36
TC37
TC38
TC39
TC40
47
48 -50% Design Utilization Gain (%): Average Gain is 5%
49 WHS Gain (%): Average Gain is 15% THS Gain (%): Average Gain is 13%
Number of Buffers/Inverters Gain (%): Average Gain is 14%
1
2 Wirwlength Gain (%)
3
4 80% VI. CONCLUSION
5 70%
In this paper, we proposed a new weighting approach that
60%
6 50%
takes the hold timing factor in addition to the power, setup and
7 40%
eDRC factors while calculating the nets weight before and
8 during the incremental global placement. The new algorithm is
Gain(%)
30%
9 20% added before the CTS stage to generate a hold-friendly
10 10% placement without impacting the setup timing. Adding the hold
11 0% parameter in the weighting formulation using the PUS
TC1
TC2
TC3
TC4
TC5
TC6
TC7
TC8
TC9
TC10
TC11
TC12
TC13
TC14
TC15
TC16
TC17
TC18
TC19
TC20
TC22
TC23
TC24
TC25
TC26
TC27
TC28
TC29
TC30
TC31
TC32
TC33
TC34
TC35
TC36
TC37
TC38
TC39
TC40
12 -10% capabilities permits to reduce the power consumption of the
13 -20% design by reducing the number of delay elements needed for
14 -30% hold violations fixing in the post-CTS stage. By evaluating this
15 Global Route Data Nets Wirelength Gain (%): Average Gain is 13% new weighting approach on a wide variety of designs, we
16 achieved an additional average gain of 9% in total power
17 Fig. 11. Wire Length Gain (%). consumption compared to our approach proposed in [6]. The
18 power gain is achieved while keeping a better setup and hold
19 eDRC Gain (%) timings (TNS gain = 15%, WNS gain =26%, THS gain =13%,
Fo
20 WHS gain = 15%). By taking the hold timing into consideration
350%
21 early in the physical design process, we achieved a better power
300%
22 reduction and design closure throughout the PnR flow.
250%
23
rP
200%
Future work will focus on runtime reduction using machine
24 150%
learning algorithms to figure out the best α and β parameters
Gain(%)
25 100%
based on design characteristics to make sure that the QoR
26 50%
ee
values.
28 -50%
-100%
29
rR
-150% ACKNOWLEDGMENT
30
31 Max Capacitance Gain (%): Average Gain is 33% This research was supported by Mentor, a Siemens Business.
32 Max Transition Gain (%): Average Gain is 41% We thank our colleagues from the Digital Design
Implementation Solution (DDIS) division who provided insight
ev
33
Fig. 12. eDRC Gain (%). and expertise that greatly assisted this research.
34
35 We thank Dr. Hazem El Tahawy (Mentor Graphics,
36
iew
Runtime
Managing Director MENA Region) for initiating and
37 100% supporting this work. From the Place-and-Route Solutions
38 group in DDIS division, we thank David Chinnery (Architect,
39 50%
Optimization), and Nikitin Nikita (Member of Consulting Staff,
40 DDIS R&D CTS), for their assistance, help, and guidance
41 0%
Gain %
TC1
TC2
TC3
TC4
TC5
TC6
TC7
TC8
TC9
TC10
TC11
TC12
TC13
TC14
TC15
TC16
TC17
TC18
TC19
TC20
TC22
TC23
TC24
TC25
TC26
TC27
TC29
TC30
TC31
TC32
TC33
TC34
TC35
TC36
TC37
TC38
TC39
TC40
1
[4] Wang, Q. B., Lillis, J., & Sanyal, S. (n.d.). An LPbased methodology for [19] Krishnamoorthy, A.,(2004). Minimize IC Power without Sacrificing
2 improved timing-driven placement. Proceedings of the ASP-DAC 2005. Performance. EEdesign. Available at http://www.eedesign.com/article/
3 Asia and South Pacific Design Automation Conference, 2005. showArticle.jhtml?articleId=23901143
4 doi:10.1109/aspdac.2005.1466542
[20] Obermeier, B., Johannes, F. (n.d.). Temperature-aware global placement.
5
[5] Cheon, Y., Ho, P., Kahng, A., Reda, S., Wang, Q. (2005). Power-aware ASP-DAC 2004: Asia and South Pacific Design Automation Conference
6 placement. Proceedings. 42nd Design Automation Conference, 2005. 2004 (IEEE Cat. No.04EX753). doi:10.1109/aspdac.2004.1337555
7 doi:10.1109/dac.2005.193924
8 [21] Bakopla, H. (1990). Circuits, Interconnections, and Packaging for VLSI.
[6] Chentouf, M., & Ismaili, Z. E. (2018). A Novel Net Weighting Algorithm Addison-Wesley
9 for Power and Timing-Driven Placement. VLSI Design, 2018, 1-9.
10 doi:10.1155/2018/3905967 [22] Jackson, M., Srinivasan, A., Kuh, E. (n.d.). Clock routing for high-
11 performance ICs. 27th ACM/IEEE Design Automation Conference.
[7] Chan, T., Kahng, A. B., & Li, J. (2014). NOLO: A no-loop, predictive doi:10.1109/dac.1990.114920
12
useful-skew methodology for improved timing in IC implementation.
13 Fifteenth International Symposium on Quality Electronic Design. [23] Natesan, V., Bhatia, D. (n.d.). Clock-skew constrained cell placement.
14 doi:10.1109/isqed.2014.6783368 Proceedings of 9th International Conference on VLSI Design.
15 doi:10.1109/icvd.1996.489474
[8] Burstein, M., Youssef, M. (1985). Timing Influenced Layout Design.
16 22nd ACM/IEEE Design Automation Conference. [24] Venkateswaran, N., Bhatia, D. (n.d.). Clock-skew constrained placement
17 doi:10.1109/dac.1985.1585923 for row based designs. Proceedings International Conference on Computer
18 Design. VLSI in Computers and Processors (Cat. No.98CB36273).
[9] Hur, S., Cao, T., Rajagopal, K., Parasuram, Y., Chowdhary, A., Tiourin, doi:10.1109/iccd.1998.727053
19
V., Halpin, B. (n.d.). Force directed Mongrel with physical net constraints.
Fo
20 Proceedings 2003. Design Automation Conference (IEEE Cat. [25] Huang, L., Cai, Y., Zhou, Q., Hong, X., Hu, J., Lu, Y. (2005). Clock
21 No.03CH37451). doi:10.1109/dac.2003.1218966 network minimization methodology based on incremental placement.
22 Proceedings of the 2005 Conference on Asia South Pacific Design
[10] Ou, S., Pedram, M. (2000). Timing-driven placement based on Automation - ASP-DAC 05. doi:10.1145/1120725.1120755
23
rP
[11] Riess, B., Ettelt, G. (n.d.). SPEED: Fast and efficient timing driven doi:10.1109/iscas.2002.1011495
27 placement. Proceedings of ISCAS95 - International Symposium on
28 Circuits and Systems. doi:10.1109/iscas.1995.521529 [27] Deokar, R., Sapatnekar, S. (n.d.). A graph-theoretic approach to clock
29 skew optimization. Proceedings of IEEE International Symposium on
rR
[12] Eisenmann, H., Johannes, F. (n.d.). Generic global placement and Circuits and Systems - ISCAS 94. doi:10.1109/iscas.1994.408825
30 floorplanning. Proceedings 1998 Design and Automation Conference.
31 35th DAC. (Cat. No.98CH36175). doi:10.1109/dac.1998.724480 [28] Boese, K., Kahng, A. (n.d.). Zero-skew clock routing trees with minimum
32 wirelength. [1992] Proceedings. Fifth Annual IEEE International ASIC
ev
33 [13] Swartz, W. (2008). Placement Using Simulated Annealing. Handbook of Conference and Exhibit. doi:10.1109/asic.1992.270316
Algorithms for Physical Design Automation.
34 doi:10.1201/9781420013481.ch16 [29] Guthaus, M., Sylvester, D., Brown, R. (2006). Clock buffer and wire
35 sizing using sequential programming. 2006 43rd ACM/IEEE Design
36
iew
[14] Bunglowala, A., Jain, M. (2014). Parallel Simulated Annealing Algorithm Automation Conference. doi:10.1109/dac.2006.229435
for Standard Cell Placement in VLSI Design. International Journal of
37 Computer Applications, 87(1), 23-26. doi:10.5120/15172-3047 [30] Zhu, Q., Dai, W. (1996). High-speed clock network sizing optimization
38 based on distributed RC and lossy RLC interconnect models. IEEE
39 [15] Jackson, M. A., Kuh, E. S. (1989). Performance-driven placement of cell Transactions on Computer-Aided Design of Integrated Circuits and
40 based ICs. Proceedings of the 1989 26th ACM/IEEE Conference on Systems, 15(9), 1106-1118. doi:10.1109/43.536716
Design Automation Conference - DAC 89. doi:10.1145/74382.74444
41 [31] L, W., Li, Y., Chen, H. (2010). Minimizing clock latency range in robust
42 [16] Srinivasan, A., Chaudhary, K., Kuh, E. (n.d.). RITUAL: A performance clock tree synthesis. 2010 15th Asia and South Pacific Design Automation
43 driven placement algorithm for small cell ICs. 1991 IEEE International Conference (ASP-DAC). doi:10.1109/aspdac.2010.5419849
Conference on Computer-Aided Design Digest of Technical Papers.
44 doi:10.1109/iccad.1991.185188 [32] Lee, D., Markov, I. L. (2010). Contango: Integrated optimization of SoC
45 clock networks. 2010 Design, Automation Test in Europe Conference
46 [17] Donath, W. E., Norman, R. J., Agrawal, B. K., Bello, S. E., Han, S. Y., Exhibition (DATE 2010). doi:10.1109/date.2010.5457043
47 Kurtzberg, J. M., . . . Mcmillan, R. I. (1990). Timing driven placement
using complete path delays. Conference Proceedings on 27th ACM/IEEE [33] Friedman, E. G. (1989). Performance Limitations in synchronous Digital
48 Design Automation Conference - DAC 90. doi:10.1145/123186.123232 systems. University California, Irvine
49
50 [18] Tsay, R., Koehl, J. (1991). An analytic net weighting approach for [34] Chou, H., Yu, H., Chang, S. (2011). Useful-skew clock optimization for
performance optimization in circuit placement. Proceedings of the 28th multi-power mode designs. 2011 IEEE/ACM International Conference on
51 Conference on ACM/IEEE Design Automation Conference - DAC 91. Computer-Aided Design (ICCAD). doi:10.1109/iccad.2011.6105398
52 doi:10.1145/127601.122882
53 [35] Huang, C., Liu, Y., Lu, Y., Kuo, Y., Chang, Y., Kuo, S. (2016). Timing-
driven cell placement optimization for early slack histogram compression.
54
55
56 10
57
58
59
60
Page 11 of 11 Transactions on Design Automation of Electronic Systems
1
Proceedings of the 53rd Annual Design Automation Conference on - DAC
2 16. doi:10.1145/2897937.2898105
3
4 [36] Nitro-SoC™ and Olympus-SoC™ User’s Manual, Software Version
2017, August 2017.
5
6 [37] Nitro-SoC™ and Olympus-SoC™ Advanced Design Flows Guide,
7 Software Version 2017, August 2017.
8
[38] Nitro-SoC™ and Olympus-SoC™ Software Version 2017.1.R2, August
9 2017.
10
11 [39] Dunlop, A., Agrawal, V., Deutsch, D., Jukl, M., Kozak, P., & Wiesel, M.
(1984). Chip Layout Optimization Using Critical Path Weighting. 21st
12 Design Automation Conference Proceedings.
13 doi:10.1109/dac.1984.1585786
14
15 [40] Niu, F., Zhou, Q., Yao, H., Cai, Y., Yang, J., & Sze, C. N. (2011).
Obstacle-avoiding and slew-constrained buffered clock tree synthesis for
16 skew optimization. Proceedings of the 21st Edition of the Great Lakes
17 Symposium on Great Lakes Symposium on VLSI - GLSVLSI 11.
18 doi:10.1145/1973009.1973049
19
Fo
20
21
22
23
rP
24
25
26
ee
27
28
29
rR
30
31
32
ev
33
34
35
36
iew
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56 11
57
58
59
60