Haitong Tian, Wai-Chung Tang, Evangeline F.Y. Young and C.N. Sze

Grid--to
Grid
to--Ports Clock Routing for
High Performance
Microprocessor Designs
Haitong Tian#, Wai-Chung Tang#, Evangeline F.Y. Young# and C.N. Sze*
# Department
of Computer Science and Engineering

The Chinese University of Hong Kong
* IBM
Austin Research Laboratory
ISPD 11, Santa Barbara , USA

March 28, 2011
Outline
Introduction
Problem Formulation
Routing Algorithm
Experimental
E
i
l Results
R l
Conclusion
ISPD 2011
Clock Distribution Categories

Clock distribution is an very important issue
Buffered and unbuffered trees
Used in various ASICs
Supported by many physical design tools
See Tsay TCAD93, Xi DAC95
Non-tree structure with crosslinks

Intended for reducing clock skews
aja a DAC04,
C 0 , TCAD06
C
06
See Rajaram
Grid and buffered trees

High performance processors
Sometimes manually design the clock structures
See Shelar ISPD09, TCAD10, Guru VLSI Circuits10
ISPD 2011
High Performance Clock Distribution

Clock network in high
performance
microprocessors
Distributed as global grid
followed by buffered trees
See Shelar ISPD
ISPD09
09,
TCAD10, Guru VLSI
Circuits10
This paper focuses on the
post-grid clock distribution
area
Grid buffers
External
Clock
PLL
...
Local
Clock
buffers
...
...
Grid Bufer
Regional
Clock
buffers
Post-grid Clock
Distribution
Regional
Clock
buffers
Clock Grid
Local
Clock
buffers
Local Clock
Network
Post grid clock distribution
ISPD 2011
Post-grid Clock Distribution

In our modeling
Entire chip divided into
several layout areas
Global grid
Grid Buffer
Each layout area contains

many blocks
Blocks
Reserved
R
d
Tracks
Port
Each block contains

standard cells and/or macros
Sequential
Global Grid
Local Clock
Buffer
Each layout area contains

100s-1000s clock pports
Grid wires reserved for
clock routing
Typically upper mental
layers
Layout
Region
Reserved
multilayer tracks
ISPD 2011
Motivations
Clock distribution of microprocessor:
Crucial importance
Major source of power dissipation
High capacitance usage

[1]
18.1%
18 1% off total
t t l clock
l k capacitance
it
See Pham Solid State Circuits06
Manuallyy design
g in ppractice
Hard to satisfy delay/slew constraints
Time to market
See
S Sh
Shelar
l ISPD09
ISPD09, TCAD10
[1]: D. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey, P. Harvey, H. Hofstee, C. Johns,
et al. Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor. IEEE
Journal of Solid-State Circuits, 41(1):179196, Jan. 2006.
ISPD 2011
6
Outline
Introduction
Problem Formulation
Routing Algorithm
Experimental Results
Conclusion
ISPD 2011
Problem Formulation
Input
A set of reserved tracks
Locations and capacitances of ports P
Different types of wires on each metal layer
Delay limit D. Slew limit S
Output
A clock network (may be non-tree
non tree structures)
Objective
Connecting every port to the source
Satisfying delay and slew constraints
Minimizing capacitance usage
ISPD 2011
Post-grid Clock Routing

0
7
Layer
3
0
500
1000
1500
200
400
600
800
1000
1200
1400
1600
1800
ISPD 2011
Outline
Introduction
Problem Formulation
Routing Algorithm
Conclusion
ISPD 2011
10
Overall Algorithm
Critical ports
Ports with large capacitance or
f away from
far
f
the
th source
Path expansion algorithm
Elmore-delay driven
Expanding in some selected
directions
Post-processing
Wire replacement
Topology refinement
Iterations
The overall algorithm is
repeatedly invoked
Mayy fail when number of
iterations > K (user specified)
ISPD 2011
11
Delay-driven Path Expansion Algorithm

Basic steps
Simultaneously expand from all ports
Select the path with the minimum Elmore delay to further expand
Connect the ports to the source once the path reaches the source
grid
Check delay/slew constraints
ISPD 2011
12
A Routing Example
Initially, the heap is empty
First iteration (simultaneously expand from all ports)
Heap={(P1,P2);(P1,C1);(P2,P3);(P2,P1);(P3,P2);(P3,C2)}
Second iteration (P1,P2)

Heap={(P
p {( 3,,C2);(
);(P1,,P2,,P3);(
);(P1,,C1);(
);(P2,,P3);(
);(P2,,P1);(
);(P3,,P2)}
Third iteration (P3,C

C2)
Heap ={(P3,C2,S2);(P1,P2,P3);(P1,C1);(P2,P3);(P2,P1);(P3,P2)}
ISPD 2011
13
A Routing Example
Fourth iteration (identify chain paths)
Heap ={(P1,P2,P3);(P1,C1);(P2,P3);(P2,P1)}
Chain path={(P1,P2,P3);(P2,P3)}
Fifth iteration (P2,P3)

Heap={(P1,P2,P3);(P1,C1)}
{( 1,,P2,,P3);(
);(P1,,P2)}
Chain ppath={(P
Sixth iteration (P1,P

P2)
Heap={}, chain path={}
Final result
ISPD 2011
14
Post-processing Techniques
Wire replacement
Two types of wires
capacitance/resistance
it
/ it
tradeoff
t d ff
Procedures
Identify port Pl with the largest
Elmore delay
Wire replacement
Port with largest delay: P5

Replace edge P1C1
Replace edge P4C2
Replace edge P2P3, P3C1
Replace
p
P5C3, C3C2, C2C1, C1S1
Replace wires in a bottom-up style

S1
Check delay/slew constrains

P2
C3
P3
C1
C2
P1
P4
P5
ISPD 2011
15
Post-processing Techniques
Topology refinement
Procedures
Disconnect a port P
Topology refinement
Elmore delay:
P5>P4>P6>P2>P1>P3>P7
Sequentially process all the ports

Expand P towards all
directions
Select paths with smaller
capacitance
i
Check delay/slew constraints
S1
P2
P3
C1
P1
C2
C3
P4
P5
P6
C4
C5
P7
S2
ISPD 2011
16
Non-tree Extensions
A small number of ports have
exceptionally large capacitances
The delay of its shortest path
exceeds the delay limit D
Procedures
Establish a shortest ppath for p
Non-tree extensions
Connect p to S1
Find a second source S2
Add crosslinks
Find a third source S3
Add crosslinks
Find a second shortest path

Target delay not met? Add all
useful corsslinks
Target delay not met? Do the
same thing for parent node of p
ISPD 2011
17
Outline
Introduction
Problem Formulation
Routing Algorithm
Conclusion
ISPD 2011
18
Experiment Setup
Environment
Implemented in C++
Run on Linux server
Intel Pentium 4 3.2GHz
2GB RAM
Delay setup: 5ps

Slew setup: input: 10ps; output: 15 ps
Benchmarks
B h
k
3 test cases are provided by industry
11 test
es cases aaree from
o ISPD
S
2010
0 0C
Clock
oc Network
Ne wo Synthesis
Sy es s Co
Contest
es
Comparisons
Compared with TG, which was proposed by Shelar in ISPD09,
TCAD10
ISPD 2011
19
Tree Growing Algorithm

Proposed in R. Shelar ISPD09,
TCAD10
Delay/Slew
D l /Sl constraints
t i t
Greedy expansion from the
source
Edges
Ed
with
ith the
th smallest
ll t
capacitance will be added into
the network
Tree Growing Algorithm

Expand from the source
Add S1C1, S2C2
Add C2P3
Add C1P1
Add P3P2
S1
C1
S2
P1
P2
P3
ISPD 2011
C2
20
Comparisons: capacitance
Capacitance (without post-processing techniques)
25000
ccapacitance (fF)
Without post-processing:
18 3% improvement
18.3%
i
t
20000
15000
TG
Ours
10000
5000
0
1
10
11
12
13
14
Test cases
Capacitance (with topology refinement)

25000
capacitancce(fF)
With topology refinement:

24.6% improvement
20000
15000
TG
Ours
10000
5000
0
1
10
11
12
13
14
Test cases
ISPD 2011
21
Comparisons: wire length

Wire Length (without post-processing techniques)
50000
Wire length (um)
W
Without post-processing:
16 8% improvement
16.8%
i
t
40000
30000
TG
Ours
20000
10000
0
1
10
11
12
13
14
Test cases
With topology refinement:

23.6% improvement
Wire Length (with topology refinement)
Wire length ((um)
50000
40000
30000
TG
Ours
20000
10000
0
1
10
11
12
13
14
Test cases
ISPD 2011
22
Comparisons: run time

TG
Average
A
run ti
time 0.14s
0 14
Time (s)
Ours
Run time (without any post-processing techniques)
Without any post-processing

techniques: 1.49s on average
With all techniques: 1.78s on
average
Practical runtime
7
6
5
4
3
2
1
0
TG
Ours
10
11
12
13
14
Test cases
Run time (with all post-processing techniques)
Time ((s)
8
6
TG
Ours
4
2
0
1
10
11
12
13
14
T t cases
Test
ISPD 2011
23
Non-tree extension
We also did some experiments to see the results of our non-tree
extension
A 23.4%
23 4% improvement
i
t on the
th minimum
i i
portt delay
d l
Minimum port delay is the lower bound one can achieve using tree
structures
Non-tree delay
3000
Delay (fs)
2500
2000
Non-tree delay
Lower bound delay
1500
1000
500
0
1
10
11
12
13
14
Test cases
ISPD 2011
24
Simulation Results
Simulation tool: Hspice
Delay
Correlation coefficient
3000
Delay(fs)
2500
2000
Calculated delay
Simulated delay
1500
1000
500
0
1
10
11
12
13
14
Test case
Non-tree delay
2500
2000
Delay (fs)
Tree: 99%
Non-tree: 99%
Tree delay
1500
Calculated delay
Simulated delay
1000
500
0
1
10
11
12
13
14
Test cases
ISPD 2011
25
Simulation Results
Slew
Correlation coefficient
12
Slew (ps)
11.5
11
Simulated slew
Calculated slew
10.5
10
9.5
9
1
10
11
12
13
14
Test cases
Non-tree slew
Slew (ps)
Tree:
T
96%
Non-tree: 94%
Tree slew
11.2
11
10.8
10.6
10.4
10.2
10
9.8
9.6
9.4
Simulated slew
Calculated slew
10
11
12
13
14
Test cases
ISPD 2011
26
Outline
Introduction
Problem Formulation
Routing Algorithm
Conclusion
ISPD 2011
27
Conclusion
Proposed an efficient algorithm to construct a post-grid clock
network on reserved multi-layer metal tracks
Extended the algorithm to allow non-tree structures to further
bi
brings
down
d
the
th delay
d l
Verified our results using Hspice simulation
Expected
p
to reduce energy
gy consumption,
p
, improve
p
ggrid to pport
delay in real post-grid clock networks
ISPD 2011
28
ISPD 2011
29

Haitong Tian, Wai-Chung Tang, Evangeline F.Y. Young and C.N. Sze

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Haitong Tian, Wai-Chung Tang, Evangeline F.Y. Young and C.N. Sze

Uploaded by

Copyright:

Available Formats

Grid--to

of Computer Science and Engineering

Austin Research Laboratory

ISPD 11, Santa Barbara , USA

Clock Distribution Categories

Non-tree structure with crosslinks

Grid and buffered trees

High Performance Clock Distribution

Post grid clock distribution

Post-grid Clock Distribution

Each layout area contains

Each block contains

Each layout area contains

High capacitance usage

Post-grid Clock Routing

Delay-driven Path Expansion Algorithm

Second iteration (P1,P2)

Third iteration (P3,C

Fifth iteration (P2,P3)

Sixth iteration (P1,P

Port with largest delay: P5

Replace wires in a bottom-up style

Check delay/slew constrains

Sequentially process all the ports

Find a second shortest path

Delay setup: 5ps

Tree Growing Algorithm

Tree Growing Algorithm

Capacitance (with topology refinement)

With topology refinement:

Comparisons: wire length

With topology refinement:

Wire Length (with topology refinement)

Wire length ((um)

Comparisons: run time

Run time (without any post-processing techniques)

Without any post-processing

Run time (with all post-processing techniques)

You might also like