Professional Documents
Culture Documents
Clocking
Clocking
BR 6/07 1
Clock Distribution
clk
100 ps 100 ps
100 ps 100 ps
BR 6/07 2
Clock Distribution (cont)
clk
100 ps 100 ps
Data
0 ps
reference point
100 ps 100 ps
BR 6/07 3
Clocking Regions skew to neighboring region: 11 ps
102 ps 98 ps
99 ps
103 ps 92 ps 96 ps
97 ps 93 ps
0 ps
reference point
100 ps 100 ps
Chip is divided into regions, the further a signal has to travel, the
larger the skew budget BR 6/07 4
Skew: Flip-Flops
clk clk
F1
F2
Tc
sequencing overhead
clk
Q1 tpdq tsetup
D2
clk
Q1
CL
F1
Skew adds to both setup clk
F2
constraints tskew
clk
thold
Q1 tccq
D2 tcd
BR 6/07 6
Clock Distribution Evolution (cont)
BR 6/07 7
Definition: Gridded Clock
In early clock distribution systems, large drivers + metal clock grid used for clock
distribution. Subsystems just tapped into the clock grid for connectivity – easy to
do, but takes a lot of power, chews up routing resources (grid density is
exaggerated in this picture, there is a lot more white space than is shown
BR 6/07 8
Alpha 21064 Die
Photo (1993)
Single Clock driver,
2 transistors for
buffer visible to
naked eye
Clocking scheme was
2 phase, single wire.
Clock load was 3.5 nF
Gate length of final
driver was 35 cm (not
a misprint, used
serpentine layout to
get this gate length).
BR 6/07 9
21064 Clock Skew Distribution Max clock skew approx.
180 ps (3.6% of 5 ns clock
period) – 1 gate delay about
300 ps, so clock skew about
50% of a gate delay.
BR 6/07 10
Thermal Image of 20164 76C at center of
chip
46C at edges of
chip
BR 6/07 11
21164 Clock Distribution (1995)
Goal of 21164 Clock distribution was to reduce skew by 30%
and reduce the thermal gradient.
A predriver was centered between two main clock drivers
Predriver
BR 6/07 12
Max clock skew approx.
Clock skew lowest 80 ps (2.4% of 3.3 ns clock
near two main clock period) – 1 gate delay about
240 ps, so clock skew about
drivers. 1/3 of a gate delay.
BR 6/07 13
Aside: Why a Gridded Clock?
• Both 21064, 21164 used a single global clk distributed by a
metal clock grid.
• Skew is largely determined by grid interconnect density and is
insensitive to gate load placement
– Why? Because capacitance of grid wiring dominates the gate loads
connected to it.
• Universal availability of clock signals
• Design teams can proceed in parallel since clock constraints
well known
• Good process-variation tolerance
• The disadvantage is the extra capacitance of the grid
– Power-performance tradeoff is determined by choice of skew target,
which establishes the needed grid density, which determines the clock
driver size.
BR 6/07 14
21264 Clock Distribution (1997)
• 21264 clocking fundamentally different from previous
Alphas because it supported a hierarchy of clocks
– Still had a GCLK (global Clk) grid, but conditional and local clocks
had several buffer stages after GCLK
• Conditional clocks used to save power
– Clocks gated to functional units in design
– If not executing a floating point instruction, then stop the clock to the
floating point unit to save power!
• State elements and clocking points were 0 to 8 gates past
Gclk
• Six major regional clocks two gain stages past GCLK with
grids juxaposed with GCLK, but shielded from it.
– Major clocks drive local clocks and conditional clocks
• Goals were to improve performance, reduce power.
BR 6/07 15
Clock Hierarchy of 21264
BR 6/07 16
21264 Global Clock Distribution
BR 6/07 18
Global clock grid.
Uses 3% of
M3/M4 routing
layers
(lines in picture
are misleadingly
thick).
BR 6/07 19
All GCLK lines are laterally shielded by Vss/Vdd
signal GCLK signal
Vss Vdd
BR 6/07 20
Simulated worst case
GCLK skew was
72 ps .
Skew on M1, M2 was
less than 10 ps.
BR 6/07 21
Major clocks are two
inversions past GCLK
BR 6/07 23
Local, Conditional Clocks
BR 6/07 24
Power Consumption 21264
BR 6/07 25
The IA-64 (Gen 1)
BR 6/07 26
Technology
• 0.18 CMOS
• 25.4 million transistors
• 6 metal layers
• Flip-chip with 1014 pads
BR 6/07 27
IA-64 Clock Distribution (Gen 1)
BR 6/07 31
Alpha vs IA64 Approach
• Alpha CPU Major Clock = IA 64 Regional clocks
– Alpha did not attempt to deskew Major clocks with GCLK
– Alpha used local clocks generated from major clocks and did timing
analysis, path delay matching between clocks and data to solve timing
problems
• This does NOT account for delays due to on die process variations
– At Ghz clock speeds, skew due to on die process variations can cause
timing failures
• IA64 used an ‘active’ distributed deskewing approach for GCLK
and Regional Clocks
– Wanted to avoid the detailed delay matching, timing analysis required in
the Alpha design after complete implementation because of impact on
design schedule
– Account for delay due to on die process variations
BR 6/07 32
Think of reference clock as the ‘golden’ clock
Feedback clock!!!
Delay circuit used to control edge alignment of Global clock
with Regional Clock.
In general, this is a form of a Digital Delay Locked Loop
(DLL). Any form of PLL/DLL must have feedback for
correction! BR 6/07 33
Decoupling caps
Regional clock
can be gated
Shifting a ‘1’ from one end decreases delay, shifting a ‘0’ from opposite end
increases delay (this is a variable delay line).
Delay range was 170 ps in 8.5 ps steps. Phase adjustments made every 16
clock cycles. Could also be adjusted manually via test access port (TAP)
BR 6/07 34
Controller for Deskew Buffer Register
BR 6/07 35
Why a Reference Clock?
• The goal of the DSK was to deskew the global (core) clock
with respect to the regional clocks
– Reference clock was 2X core clock
– Regional clocks were simply a delayed version of the global clock
– Reference clock was not deskewed but smaller distribution region
and more balanced routing gave less skew in reference clock.
• Not possible to maintain a balance routing network and load
matching for core clock over such a large design with
multiple design teams since the core clock was driving logic
– However, it was possible to design balanced routing network and
have load matching for the reference clock since all it drove were the
DSK’s and global clock design team solely responsible for reference
clock design
– Feedback clocks from the regional clock distribution were then used
to deskew regional clocks with respect to reference clock.
BR 6/07 36
Skew Elements
• Total skew of design based on residual skew in reference
clock, uncertainty of phase detector in DSK, and
mismatches of feedback clocks
– Reference clock did not have as large a distribution region as the
core clock, and loads were better matched, so had tighter skew
than would have been possible with global clock
– Feedback clock routes were kept short with respect to DSKs
– Phase detector uncertainty kept small via symmetric layout
techniques and by allowing a long time for phase comparison
• Achieved maximum skew was 28 ps (2.8% of a 1 Ghz
clock period).
BR 6/07 37
Measured skew via Laser voltage probing
BR 6/07 38
Local Clocks
• Local clocks generated from Regional Clocks and
provided clocks needed by domino logic
• Full timing analysis performed on local clocks
• Local clocks responsibility of functional block design
teams
• Global and regional clock responsibility of global clock
design team
BR 6/07 39
Hold Time Analysis (another look)
BR 6/07 42
Clock Comparison of three generations of IA-64
BR 6/07 43
Comments
• Active de-skewing used in 1st generation jettisoned in 2nd
generation
– 2nd generation just used a balanced H-tree
– Difficult to route this type of structure - all clock routing was
reserved prior to block layout
– Differential clocks used for 2nd level clock distribution – reduced
jitter
– Non-active de-skew easier to test, and more deterministic behavior
– Intentional clock skewing for time borrowing easier
• 3rd generation uses programmable fuses for skewing
– allows skew adjustment after fabrication
BR 6/07 44
2nd Generation Clock distribution
Gated clocks
differential
clocks
BR 6/07 45
2nd Generation Clock Shielding
CLK+ CLK-
This level
reduces
inductive
effects.
Locates gnd
current
return close
to clock
lines.
BR 6/07 46
3rd Generation Distribution
Made a big
difference
here. Skew
reduced from
60ps to 24
ps.
BR 6/07 49
90 nm IA Microprocessor (2003)
Global clock distribution scaled up to 6 GHz
–Core clocks (clocks for processor cores) uses same core clock scheme as used
in Xeon Single Core (2003,/90 nm). This clock scheme was designed to scale up
to 6 GHz, and used a H-tree distributed clock with shorted nodes that had
produced less than 10 ps skew. No active de-skew or fuse-based de-skew.
–Un-core clock (everything outside the core) – Cache, bus logic, etc. Large area
prevented use of gridded clock (power restriction), used a clock tree (9 vertical,
2 horizontal) with fuse-based deskew at root of each vertical spine. Achieved
less than 11 ps skew.
BR 6/07 54
Dual Core die photograph
BR 6/07 55
Clock Domains
BR 6/07 56
Clock Generator Arch.
BR 6/07 57
Clock Distribution
Fused-based
deskew
buffers
located at the
root of the
vertical
MCLK
spines
BR 6/07 59
Core to Un-Core deskew
different VCCs
Un-core clock – Core 1.25 V,
Core Clock uncore – 1.10 V
BR 6/07 61
Global Skew
Skew <
10 ps
BR 6/07 62
Power
BR 6/07 63
Papers
–Gronowski, Paul E., et.al., “High Performance Microprocessor Design”, IEEE Journal of Solid-State
Circuits, Vol. 33, No. 5, May 1998, pp. 676-686
–Bailey, Daniel W. and Bradley J. Benschneider, “Clocking Design and Analysis for a 600-Mhz Alpha
Microprocessor”, IEEE Journal of Solid-State Circuits, Vol. 33, No. 11, November 1998, pp. 1627-1633
–Tam, S. et.al, "Clock Generation and distribution for the First IA-64 microprocessor", IEEE Journal of
Solid State Circuits, Vol 35, Issue 11, Nov 2000.
– Rusu, S. and Singer G, "The first IA-64 microprocessor ", IEEE Journal of Solid State Circuits, Vol 35,
Issue 11, Nov 2000.
–Anderson, F. E., Wells, J. S., Berta, E. Z, “The Core Clock System on the Next Generation Itanium
Processor", ISSCC 2002, pp 453-456.
– Tam, S., Desai, U. Limaye, R., “Clock Generation and Distribution for the Third Generation Itanium
Processor ", 2003 Symposium n VLSI Circuits, pp 9-12.
–Stinson, J., Rusu, S., “A 1.5GHz Third Generation Itanium Processor”, ISSCC 2003, paper 14.4.
–The implementation of the Itanium 2 microprocessor
Naffziger, S.D.; Colon-Bonet, G.; Fischer, T.; Riedlinger, R.; Sullivan, T.J.; Grutkowski, T.;
Solid-State Circuits, IEEE Journal of , Volume: 37 Issue: 11 , Nov. 2002
Page(s): 1448 -1460
–A 90-nm variable frequency clock system for a power-managed itanium architecture processor, Fischer,
T.; Desai, J.; Doyle, B.; Naffziger, S.; Patella, B.; Solid-State Circuits, IEEE Journal of Volume 41, Issue
1, Jan. 2006 Page(s):218 – 228 Digital Object Identifier 10.1109/JSSC.2005.859879
–Clock distribution on a dual-core, multi-threaded Itanium/sup /spl reg//-family processor, Mahoney, P.;
Fetzer, E.; Doyle, B.; Naffziger, S.; Solid-State Circuits Conference, 2005. Digest of Technical Papers.
ISSCC. 2005 IEEE International 6-10 Feb. 2005 Page(s):292 - 599 Vol. 1 Digital Object Identifier
10.1109/ISSCC.2005.1493984
BR 6/07 64
Papers (cont)
–Scalable sub-10ps skew global clock distribution for a 90nm multi-GHz IA microprocessor Bindal, N.;
Kelly, T.; Velastegui, N.; Wong, K.L.; Solid-State Circuits Conference, 2003. Digest of Technical Papers.
ISSCC. 2003 IEEE International 2003 Page(s):346 - 498 vol.1 Digital Object Identifier
10.1109/ISSCC.2003.1234329
–A 65-nm Dual-Core Multithreaded Xeon® Processor With 16-MB L3 Cache Rusu, S.; Tam, S.; Muljono,
H.; Ayers, D.; Chang, J.; Cherkauer, B.; Stinson, J.; Benoit, J.; Varada, R.; Leung, J.; Limaye, R. D.; Vora,
S.; Solid-State Circuits, IEEE Journal of Volume 42, Issue 1, Jan. 2007 Page(s):17 – 25 Digital Object
Identifier 10.1109/JSSC.2006.885041
–Clock Generation and Distribution of a Dual-Core Xeon Processor with 16MB L3 Cache Tam, S.; Leung,
J.; Limaye, R.; Choy, S.; Vora, S.; Adachi, M.; Solid-State Circuits, 2006 IEEE International Conference
Digest of Technical Papers Feb. 6-9, 2006 Page(s):1512 - 1521
BR 6/07 65