Clock Distribution: Rajeev Murgai

Clock Distribution
Rajeev Murgai Advanced CAD Technologies Fujitsu Labs of America
UC Berkeley Feb 15, 2005

1
Defining Clock Skew and Jitter

Clock skew

The deterministic (knowable) difference in clock arrival times at each flip-flop Caused mainly by imperfect balancing of clock tree/mesh Can be deliberately introduced using delay blocks in order to time-borrow Accounted for in STA by calculating the clock arrival times at each flip-flop The random (unknowable, except distribution ) difference in clock arrival times at each flip-flop Caused by on-die process, Vdd, temperature variation, PLL jitter, crosstalk, Static timing analysis (STA) accuracy, layout parameter extraction (LPE) accuracy Accounted for in STA by subtracting (~3 ) from the cycle time in long path analysis, and adding to receiving clock arrival time in race analysis
Clock jitter

Jitter is always bad, skew can be helpful or harmful. Clock uncertainty skew jitter
Long path analysis
F F Logic F F clk
Race analysis
F F skew +jitter
2
F F
skew
clk -jitter
Background
Technology scaling results in:

higher clock frequencies possible and requested by users prominence of wiring parasitics (R,L,C) in electrical behavior increasing noise impact on delays increasing on-chip process variation impact on delays Use tree architectures: not best for low skew, jitter, variations Don't properly address noise issues Rely on STA to calculate the delays through clock networks Use inaccurate wiring models Use noise-sensitive clock circuit topologies Ignore or crudely estimate process/voltage/temperature variations Dont have tight integration of physical synthesis & clock synthesis Predictability of clock delay is poor: Clock uncertainty (i.e., skew + jitter) of 400ps is not uncommon Maximum attainable clock frequency is impaired
3
Existing ASIC clock synthesis flows

Result

Problems with Existing Clock Methodologies

Tree-based Clock Distribution Low power but... Sensitive to mismatching branches, difficult to layout Sensitive to noise, especially if wires are not shielded Using STA to calculate tree timing results in large errors => high skew and jitter
F F F F small skew and jitter F F PLL F F medium skew and jitter F F F F
large skew and jitter
Problems with Static Timing Analysis (STA)

What we have...
R L
Cs signal wire
Cg
What STA uses...

Rup Rwire Rdn Cw/2 Cw/2 Cload
Note: driver model is a little better than this with table look-up
Other problems Cw can match either delay or slew, but not both interpolation using look-up tables
Clock Distribution Architectures

Two basic architectures

Tree Grid (mesh) Tree + crosslinks Mesh + local trees
Hybrids of tree and mesh

Tree
Widely used in ASICs Advantages
Low cost Wiring Capacitance Power Clock gating easy Difficult to balance path delays due to asymmetric FF distribution Sensitive to variations Symmetric H-tree Asymmetric trees
7
Disadvantages
Flip-flops
Topologies

CAD for Tree Architecture

Topology generation
H-tree: widely used Method of means and medians (MMM) [Jackson et al. DAC 90] Goal: reduce wirelength while minimizing skew. Divide set S of points into Sleft and Sright, based on median. | Sleft | = | Sright | Connect/route center of mass (CM) of S to CM of Sleft and Sright. Recurse on Sleft and Sright.
Method of Means & Medians

Problem
May not result in zero skew One step look-ahead and decide direction of splitting. Estimate skews using Penfield Rubenstein model.
Solution
Other problems

Buffer insertion not handled. Obstructions not handled.
Topology: Recursive Geometric Matching

[Kahng et al. DAC 91] Bottom-up pair-wise merge algorithm

Optimum geometric matching on n points (minimum wirelength) Determine center point of each match edge Recurse on n/2 points
Uses path length skews
Tries to balance root to leaf path lengths.
10
Topology: Simulated Annealing

Topology generation
Cheng et al: improve initial topology by simulated annealing
effective in reducing delay
11
CAD for Tree Architecture

Routing & wire sizing Tsay, TCAD 93: zero-skew routing first paper to use Elmore delay as delay model earlier work used pathlength DME, planar DME make faster paths slower by detours/snaking to match delays may use wire-sizing: make slower paths faster Wire spacing
Buffering

Tellez & Sarrafzadeh, TCAD 97 insert minimum buffers on a given topology to meet skew and slew constraints.
12
Grid/Mesh
Clock source

flip flops
n x n uniform mesh Distributed array of k x k buffers drives the mesh. Buffers driven by global Htree. Flip-flops directly connected to the nearest mesh segment Used in modern processors Advantages

Excellent for low skew Robust to variations Higher wiring area, capacitance, power Difficult to analyze 13 Loops and redundancy
Disadvantages
Mesh
Sizing of clock distribution networks for high performance CPU chips Desai et al., DEC [DAC 1996] goal: size grid interconnect segments with constraints on clock latency and average current assume: initial grid and interconnect sizes width explicit => non-linear program; practical for small networks/trees. consider width as implicit & solve using sequence of network problems. Results: applied on clock networks of two actual processors: DC21046A and DC21164. Results for DC21046A: 275MHz clock grid has 1 million edges, 15.5K drivers, 81K receivers 16% reduction in capacitance - without increasing clock latency. Runtime: 3 days. Optimal Wire and Transistor Sizing for Circuits with Non-tree Topology

Vandeberghe et al., Stanford University [ICCAD 97] RC circuit with tree topology => sizing problem is convex optimization meshes have R loops; use dominant time constant as measure of delay 14 solve using semi-definite programming (quasi-convex function)
Hybrid Architecture: Tree + Cross-links

Reducing Clock Skew Variability via Cross Links
[Rajaram et. al., DAC 2004] clock signal propagates through multiple paths; reduces skew and skew variability between shorted sinks
tree + short-circuit some sink pairs => non-tree topology
reduces skew variability by 30-70%
very small wire-length penalty (2%) over tree topology

Drawback:
does not consider buffering
source
15
Hybrid Architecture: Mesh + Trees

Hybrid Structured Clock Network Construction [Hu & Sapatnekar, ICCAD 01]
Hybrid clock topology simple top-level global mesh zero-skew local trees at bottom Presents wire sizing scheme to achieve latency and skew reduction. iterative LP to minimize wire width (area) of top-level mesh, given delay bound uses Elmore delay t = G-1C sensitivity-based post-layout clock tree tuning to reduce skew.
(a, CDa) a b
source
c d
16
Clock Architectures
Clock source
Flip-flops
flip flops
Tree -- low cost (wiring, power, cap) -- higher skew, jitter than mesh -- widely used in ASIC designs -- clock gating easy to incorporate
Flip flops
Mesh -- excellent for low skew, jitter -- high power, area, capacitance -- difficult to analyze -- clock gating not easy -- used in modern processors
Clock source
Best architecture depends on the application

crosslink tree crosslink
Local trees
Hybrid: tree + cross-links -- low cost (wiring, power, cap) -- smaller skew, jitter than tree -- difficult to analyze
Flip flops
Hybrid: mesh + local trees -- suitable for coarse mesh
17
Processors
Traditionally two hierarchies
Global clock network Local clock network Global network: balanced trees or grids Local network: de-skewing buffers
Skew control

18
Pentium4 [IJSSC Nov 2001]

0.18u, 6 metal layers, 42 million transistors Core medium clock frequency: 2 GHz
Used by most core blocks
High speed scheduling and execution: 4GHz Non critical blocks (e.g., bus interface logic): 1GHz Global clock distribution

3 spines; each spine has binary clock distribution jitter reduction schemes low-pass RC-filtered power supply for clock drivers shield clock wires
source spines
19
IBM [IJSSC 2001]

Same clock architecture for 6 chips (including PowerPC): Design priorities: min. clock skew, sharp rise and fall times (below 100 ps for 1ns clock), 50% duty cycle, low power consumption Global buffered H-trees (on top 2 layers) drive sector buffers.
length-matched
Each sector buffer drives tuneable tree, which drives global mesh
Tree wire-widths tuned to minimize skew over long distances Mesh minimizes local skew by connecting nearby points directly.
Buffer placement, wiring
For each chip, 10-20 complete tuning cycles
Clock source
Flip-flops connected to closest point on mesh

Global clock skew of 22ps Inductance included in analysis Mesh difficult to analyze due to loops
flip flops
cut the mesh
20
Alpha, DEC [JSSC, Nov 98]

0.35u, 4 metal layers, 15.2 million transistors, 600 MHz at 2.2V 3 hierarchies in clock distribution
Global, major (regional) and local
Multi-level mesh global: trees to global GCLK grid Uses 3% of M3/M4 interconnect M3/M4 shielding; M2, M4: Vdd/Vss power = 16W; skew = 72ps Major (regional) six grids over execution units use 6% of M3, M4 power = 14W Local clock

PLL
tree structure, not shielded conditional/unconditional clocks less than 10ps skew; power = 15.6W AWE-reduction + SPICE
GCLK grid
21
Clock simulation
Summary of Processor Clock Design

Three basic routing structures for global clock
H-tree low skew, smallest routing capacitance, low power Floorplan flexibility is poor: Grid or mesh low skew, increases routing capacitance, worse power Alpha uses global clock grid and regional clock grids Spine Small RC delay because of large spine width Spine has to balance delays; difficult problem Routing cap lower than grid but may be higher than H-tree.
Clock skew
Low/medium Low High
Clock structure
H-tree Grid Spine
Capacitance/Layout area/power
Low High Medium
Floorplan flexibility
Low Medium/high Medium
22
Estimation of Process-dependent Clock Skew in CMOS VLSI, Shoji [JSSC, Oct. 86]
Given two paths from clock source to FFs Conventional design method
design paths such that skew between S1 and S2 is zero at a (fixed) process corner
skew may not be zero at another process corner design the two paths such that skew between S1 and S2 is zero for different process corners
S1
S2
However,
Novel idea in the paper
B
A
TA + TB + TC = TD + TE (typical corner) For high-current process corner H,
TA(H) = TA * 1/fN; TB(H) = TB * 1/fP (fN, fP > 1) TA(H) + TB(H) + TC(H) = TD(H) + TE(H) (TA+TC) * 1/fN + TB/FP = TD/fN + TE/fP (TE TB)/fN = (TE - TB)/fP
CLK
Zero-skew condition at H

23
Estimation of Process-dependent Clock Skew in CMOS VLSI, Shoji [JSSC, Oct. 86]
Either TE = TB or fN = fP.
S1 S2
But fN may not be same as fP (for PH-NL process)
In general, TE = TB => TD = TA + TC.

Pull-up and pull-down delays of two paths should be identical. Determine NMOS & PMOS transistor widths of inverters to achieve this. Results

C B A E
1.75 u process Widths selected manually Lead to very small skews at all process corners only analyzes two paths assumes identical percentage delay variation for all NMOS (PMOS) devices uses simplistic delay model; ignores wire cap
Drawbacks

CLK
24
Optimal Clock Skew Scheduling

Long & short path constraints impose lower/upper bounds on skew.
long path analysis: aj ai + logic_max + tset_up - Tcycle short path analysis: aj ai + logic_min - thold
Leads to a set of linear inequalities: ai aj cij Given a clock cycle, feasibility can be solved using linear program, more efficiently with Bellman-Ford shortest path [Fishburn TCAD90].
If wish to compute optimum clock cycle,

Perform binary search using above feasibility check. Perform parametrized shortest path [Tarjan et al.]
One challenge: realize each ai
Other objectives: minimize power or switching noise.

i F F ai skew clk Logic j F F
aj
25
Optimal Clock Skew Scheduling Tolerant to Process Variations [Neves & Friedman, 96]
Long path and short path constraints impose lower and upper bounds on skew.

long path analysis: aj ai + logic_max + tset_up - Tcycle short path analysis: aj ai + logic_min - thold
Try to choose skews in the middle of the bounds for maximum protection against process variations.
i F F ai skew clk Logic
j F F
aj
26

Clock Distribution: Rajeev Murgai

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clock Distribution: Rajeev Murgai

Uploaded by

Copyright:

Available Formats

Clock Distribution

Rajeev Murgai Advanced CAD Technologies Fujitsu Labs of America

UC Berkeley Feb 15, 2005

Defining Clock Skew and Jitter

Existing ASIC clock synthesis flows

Problems with Existing Clock Methodologies

large skew and jitter

Problems with Static Timing Analysis (STA)

What STA uses...

Clock Distribution Architectures

Tree Grid (mesh) Tree + crosslinks Mesh + local trees

Hybrids of tree and mesh

CAD for Tree Architecture

Method of Means & Medians

Buffer insertion not handled. Obstructions not handled.

Topology: Recursive Geometric Matching

Uses path length skews

Tries to balance root to leaf path lengths.

Topology: Simulated Annealing

Cheng et al: improve initial topology by simulated annealing

effective in reducing delay

CAD for Tree Architecture

Hybrid Architecture: Tree + Cross-links

tree + short-circuit some sink pairs => non-tree topology

reduces skew variability by 30-70%

very small wire-length penalty (2%) over tree topology

does not consider buffering

Hybrid Architecture: Mesh + Trees

Best architecture depends on the application

Hybrid: mesh + local trees -- suitable for coarse mesh

Pentium4 [IJSSC Nov 2001]

Used by most core blocks

IBM [IJSSC 2001]

For each chip, 10-20 complete tuning cycles

Flip-flops connected to closest point on mesh

cut the mesh

Alpha, DEC [JSSC, Nov 98]

Global, major (regional) and local

Summary of Processor Clock Design

Novel idea in the paper

TA + TB + TC = TD + TE (typical corner) For high-current process corner H,

But fN may not be same as fP (for PH-NL process)

In general, TE = TB => TD = TA + TC.

Optimal Clock Skew Scheduling

If wish to compute optimum clock cycle,

One challenge: realize each ai

Other objectives: minimize power or switching noise.

i F F ai skew clk Logic

You might also like