Low Power Clock Distribution - II

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 86

Low Power Clock Distribution -II

Buffers and device sizing under process


variations, zero skew Vs Tolerable skew,
chip and package co design of clock
network
Energy consumption is humongous
Rise transition is
more than the fall
transition
therefore the pulse
width test is not
cleared
CLOCK DISTRIBUTION NETWORK METRICS
• A. Power: Power consumption is the most critical metric for a
clock distribution network. In most of the high-performance
processors, the clock network dissipates more than 30% of
the total power. There are mainly 3 methods to manage the
power: reduce the clock voltage swing, reduce the effective
load capacitance, and use transmission lines.
• B. Jitter : Another major consideration is the timing noise and
systematic offsets, known as the clock jitter, caused by the
clock source and network. The jitter can be affected by the
buffers’ noise, supply’s injected noise, phase mismatch, etc.
The tolerable jitter depends on the application and the blocks
receiving the clock signal. For instance, if the clock is driving
an ADC or a high-speed wire-line transceiver, then the jitter
specs directly sets the system performance.
• C. Latency/Skew : Although the clock edge rates should remain properly fast at
each leaf of the clock distribution tree, each node may see a different delay (or
phase) of the clock. The definition of clock skew is the time difference between
two clock signals at the half of their voltage swing. At the global clock network
level, the goal is mostly to synchronize the clocks at the leaves and each block
can tune its proper clock phase via a phase interpolator or delay lines. In
another words, skew should be constant for all the clock users on the chip. The
absolute skew value might be also a design specified parameter in some
applications as well.

• D. Area/Cross-section : The wiring for clock networks is mostly done at the


upper metal levels due to their low resistance. Same layers are considered for
power distribution and implementing inductors for the same reason. Thus, not
only we prefer compact clock wiring to accommodate more high-speed
transceivers for high bandwidth applications such as switch system-on-chips
(SOC), but also it is important to minimize the clock network area to have more
room for power grids. This metric restricts the usage of transmission lines for
clocking in these chips since they are relatively wider than the normal wires. For
instance, 32um wide coplanar transmission lines result in the throughput
density of 0.25Gb/s/um at 8Gb/s. On the other hand, 1.6um wide thick copper
wires with CMOS-based buffers lead to 0.625Gb/s/um area throughput density
at 1Gb/s in an 180µm technology.
Buffer and Device Sizing under Process
Variations
• Different buffers cause different phase delay variations
• The sizes of the buffers can be further adjusted to reduce
power dissipation
the phase delay constraint t p
• straint, t
• The above formulation assumes typical process parameters and
a fixed
• PMOS/NMOS transistor ratio for CMOS buffers, i.e. wp/wn = 2.0.
• In reality, MOS device parameters such as carrier mobilities and
threshold voltages may vary in a remarkably wide range from die
to die for the same process.
• Moreover, process spread causes PMOS and NMOS device
parameters to vary independently from die to die in a chip
production environment .
• This type of process variations are becoming more significant as
feature sizes and supply voltages are being scaled down.
• Additional skew will arise from the buffer delay variations
even when delay is balanced with typical process
parameters
• The processing spread for a given CMOS technology is
usually characterized by extracting three sets of process
parameters for PMOS and NMOS devices:
• fast (or high current) parameters, typical (or medium
current) parameters and slow (or low current)
parameters.
• The rise time of a buffer with slow parameter of PMOS
can be as many as 2 to 4 times the rise time of a buffer
with fast parameter of PMOS.
• In a chip fabrication environment, a die can have fast
PMOS and slow NMOS, or slow PMOS and fast NMOS or
other combinations of the three sets of parameters .
• This type of process variation is different from the device
or wire geometry variations which can be overcome by
increasing the device sizes or wire widths
devices (or the pull-up path) and the delays through NMOS devices (or the pull-down
path) on different paths, this type of process variation effects can be
eliminated.
IEEE 2007
• Clock distribution network consumes a significant portion of the total
chip power since the clock signal has the highest activity factor and
drives the largest capacitive load in a synchronous integrated circuit. A
new algorithm is proposed in this paper for buffer insertion and sizing in
an H-tree clock distribution network. The objective of the algorithm is to
minimize the total power consumption while satisfying the maximum
acceptable clock transition time constraints at the leaves of the clock
distribution network for maintaining high- performance. The algorithm
employs non-uniform buffer insertion and progressive relaxation of the
transition time requirements from the leaves to the root of the clock
distribution network. The proposed algorithm provides up to 30%
savings in the total power consumption as compared to a standard
algorithm with uniform buffer insertion aimed at maintaining uniform
transition time constraints at all the nodes of a clock tree.
• Clock buffers have equal rise and falltime. This
prevents duty cycle of clock signal from
changing when it passes through a chain of
clock buffers. Normal buffers are designed
with W/L ratio such that sum of rise time and
fall time is minimum. They too are designed
for higher drive strength.
ACM 2013
• Minimizing power and skew for clock networks are critical and
difficult tasks which can be greatly affected by buffer sizing. However,
buffer sizing is a non-linear problem and most existing algorithms are
heuristics that fail to obtain a global minimum. In addition, existing
buffer sizing solutions do not usually consider manufacturing
variations. Any design made without considering variation can fail to
meet design constraints after manufacturing. In this paper, first we
proposed an efficient optimization scheme based on geometric
programming (GP) for buffer sizing of clock networks. Then, we
extended the GP formulation to consider process variations in the
buffer sizes using robust optimization (RO). The resultant variation-
aware network is examined with SPICE and shown to be superior in
terms of robustness to variations while decreasing area, power and
average skew.
Zero Skew vs. Tolerable Skew
• Much research has been done in the area of performance driven clock
distribution, mainly on the construction of clock trees to minimize clock
skew
• Most techniques used for skew minimization are based on adjusting the
interconnect lengths and widths: the length adjustment technique moves
the balance points or elongates the interconnect length to achieve zero
skew the width sizing technique achieves zero skew by adjusting the
widths of wires in the clock tree .
• For power consideration, these techniques tend to be pessimistic
in that they assume the skew between every pair of clock sinks has to be
limited by the same amount. Due to the attempt to achieve the minimum
skew for all clock sinks, wire lengths or widths of the clock tree may be
increased substantially resulting in increased power dissipation.

• Tolerable skews are the maximum values of clock skew between each
pair of clock sinks with which the system can function correctly at the
desired frequency.
Derivation of Tolerable Skew
• Setup time is defined as the minimum amount
of time before the clock's active edge that the data
must be stable for it to be latched correctly. ... Hold
time is defined as the minimum amount of time after
the clock's active edge during which data must be
stable.
• The setup time is the interval before the clock where
the data must be held stable. The hold time is the
interval after the clock where the data must be held
stable. Hold time can be negative, which means the
data can change slightly before the clock edge and still
be properly captured.
• Figure 5.9 illustrates two cases of correct synchronous operations with
tolerable skews. In Figure 5.9(a), the clock arrives at CO2 later than the
previous
A Two-level Clock Distribution Scheme
• the tolerable skew between each pair of clock sinks is
well defined.
• In designing a low power system, tolerable skew instead
of minimum skew should be used during clock tree
construction
• Placing clock tree on a single metal layer reduces delays
and the attenuations caused by via's and decreases the
sensitivity to process induced wire or via variations.
• The clock wiring capacitance is also substantially
reduced.
• However, it is not always practical to embed the entire
clock tree on a single layer.
a two-level clock distribution scheme:
• Tolerable skew differs from one pair of clock sinks to another
as logic path delays vary from one combinational block to
another. The clock sinks that are close to each other and have
very small tolerable skews among them are grouped into
clusters.
• A global level clock tree connects the clock source to the
clusters and is routed on a single layer with the smallest RC
parameters by a planar routing algorithm.
• For clock sinks that are located close to each other, tolerable
skews among them can be easily satisfied.
• Little savings within a local cluster can be gained if the sinks
within the cluster have large tolerable skews.
• Local trees may be routed on multiple layers since the
total wiring capacitance inside each cluster is very
small and has less impact on total power.
• The tolerable skews between two clusters can be
determined from the smallest tolerable skew
between a clock sink in one cluster and a clock sink in
the other cluster.
• During clustering, the tolerable skews are maximized
between clusters.
• This will give the global level clock tree construction
more opportunity to reduce wire length, save buffer
sizes, and reduce power consumption since the global
level clock tree has much more impact on power.
Power reduction in two level clock distribution

• The buffer insertion and sizing method can be used at


the global level to minimize wire width and meet
tolerable skew constraints.
• By taking advantage of the more relaxed tolerable
skews instead of zero-skews, further reduction in
power dissipation can be achieved.
• The buffer sizing problem (PMBS) and device sizing
problem (PMDS) can be reformulated by replacing the
• minimum tolerable skew value, with the tolerable
skew value for each pair of clock sinks
Chip and Package Co-Design of Clock Network
• With ever increasing I/O counts, more and more chips will
become "pad limited".
• At the same time, the simultaneous switching noise due to
the wire bond inductance will limit the chip performance.
• This is more apparent to low power systems as supply
voltage is scaled down to 3 V or less. Area-IO provides an
immediate and lasting solution to the IO bottleneck on
chips.
• In area-IO, the IO pads are distributed over the entire chip
surface rather than being confined to the periphery as
shown in Figure 5.11b.
Packaging
Flip Chip describes the method of
electrically connecting the die to
the package carrier. The package carrier,
either substrate or lead frame, then
provides the connection from the die to the
exterior of the package. In
“standard” packaging, the interconnection
between the die and the carrier is made
using wire.
• The flip-chip technology makes area IO scheme feasible. In
this technology, the dice or bare chips are attached with pads
facing down and solder bumps form the mechanical
• and electrical connections to the substrate as shown in Figure
5.11(a). Flip chip tech is expensive
• Compared with the wire bonding and TAB, flip-chip has the
highest IO density, smallest chip size, and lowest inductance.
With high density area-IO and low inductance solder bumps, it
is possible to place global clock distribution on a dedicated
layer off the chip, either on the chip carriers of single chips as
shown in Figure 5.12 or on the substrates of multi-chip
modules.
• This two-level clock-distribution scheme is depicted in Figure
5.13. The package level interconnect layer can be made to
have far smaller RC parameters. This can be seen from the
interconnect scaling properties.
Electrostatic discharge (ESD)
ESD protection
Reference
• Digital VLSI Design Lecture 9: I/O and Pad Ring
Semester A, 2016-17 Lecturer: Dr. Adam
Teman
LGA socket is a type of central processing unit (CPU) socket that uses
the land grid array style of integrated circuit packaging.
LGA 2066 socket-Land Grid Array socket
Power Reduction in clock networks
• In a synchronous digital chip, the clock signal is generally one
with the highest frequency.
• The clock signal typically drives a large load because it has to
reach many sequential elements distributed throughout the
chip. Therefore, clock signals have been a notorious source of
power dissipation because of high frequency and load.
• It has been observed that clock distribution can take up to
40% of the total power dissipation of a high performance
microprocessor
• Furthermore, the clock signal caries no information contents
because it is predictable. It does not perform useful
computation and only serves the purpose of synchronization.
• The number of different clock signals on a chip is very limited
and warrant special attention during the design process.
Low power clock distribution techniques

• 1.Clock Gating
• 2.Reduced swing clock
• 3.Oscillator Circuit for Clock Generation
• 4.Frequency Division and Multiplication
• 5.reduce the capacitance of the clock signal
1.Clock Gating
• Clock gating, as depicted in Figure 6.1, is the most popular
method for power reduction of clock signals. When the clock
signal of a functional module (ALUs, memories, FPUs, etc.) is
not required for some extended period, a gating function is
used (typically NAND or NOR gate) to tum off the clock
feeding the module.
• the gating signal should be enabled and disabled at a much
slower rate compared to the clock frequency. Otherwise the
power required to drive the enable signal may outweigh the
power saving. Clock gating saves power by reducing
unnecessary clock activities inside the gated module.
The masking gate simply replaces one of the buffers in the clock
distribution tree. If the gating signal appears in a critical delay
path and degrades the overall speed, the designer can always
choose not to gate a particular module.
• Clock gating can significantly reduce the switching activity
in a circuit and on the clock nets; thus, it has been viewed
as one of the most effective logic, RTL and architectural
approaches to dynamic power minimization .
• Complex algorithms have been devised for calculating the
idle conditions of a circuit and for automatically inserting
the clock gating logic into the netlist
• Side effects of the clock-gating paradigm, such as its
impact on circuit testability, have been explored in details,
making this technology very mature also from the
industrial stand-point.
• As of today, most commercial EDA tools for power-driven
synthesis feature automatic clock-gating capabilities at
different levels of design abstraction
2. Reduced Swing Clock
• P = CV2 equation, the most attractive parameter to attack is the
voltage swing V due to the quadratic effect. Generally, it is difficult to
reduce the load capacitance or frequency of clock signals due to the
obvious performance reasons.
• in CMOS design, a clock signal is only connected to the gate of a
transistor when it reaches a sequential element. The clock signal is
seldom connected to the source or drain of a transistor. Inside a
sequential cell, the clock signal is used to tum on or tum off
transistors.
• Consider a 5V digital CMOS chip with an N-transistor threshold
voltage of O.8V. For a 5V regular full swing clock signal, an N-
transistor gated by the clock will tum on if the clock signal is above
O.8v.
• if the swing of the N-transistor clock signal can be limited from zero
to 2.5V (half swing), the on-off characteristics of all N transistors
remain digitally identical.
Similar observation can be made for the clock signal feeding a P-
transistor, where the swing is limited from 2.5V to 5V
The power saved from the reduced swing is 75% on the clock
signal. The penalty incurred is the reduced speed of the sequential
elements. The sequential delay, expressed in propagation delay
and setup hold time, is approximately doubled.
3.Oscillator Circuit for Clock Generation
• Clock less than 50MHz is generated using crystal oscillators.
• Actually crystal oscillators can easily go up to 10's of MHz.
Above that in most cases a PLL (Phase Locked Loop) is used.
• The frequency of this high-frequency oscillator is divided
by a suitable factor (dividing a signal by a power of 2 is easy
and totally accurate), and then compared to a let's say a 10
MHz oscillator. The comparison is used to adjust the high-
frequency oscillator. Thus a high frequency is made with
(almost) the accuracy of the lower frequency crystal
oscillator.
4.Frequency Division and Multiplication
• power reduction scheme that has been successfully applied is
frequency division and multiplication shown in Figure 6.6. This is
especially common for off-chip clock signals because they drive
very large capacitance.
• The off-chip clock signal runs at a slower speed and an on-chip
phase-locked loop circuit is used to multiply the frequency to
the desired rate.
• The slower signal also eases the off-chip signal distribution in
terms of electromagnetic interference and reliability.
• The frequency multiplier N is a trade-off between power
dissipation and the phase-locked loop circuit complexity.
• Larger values of N lead to better power dissipation but increases
the design complexity and performance of the phase-locked
loop circuit.
5.Reduce the capacitance of the clock signal

• Yet other techniques attempt to reduce the


capacitance of the clock signal. In some advanced
microprocessor chips, the clock signal is routed on the
topmost metal layer, thus reducing the capacitance .
• the gated modules are clustered based on their
activity patterns to reduce power dissipation and
complexity of control signal generation.
• Simultaneous switching of global clock signals can
also cause large transient current to be drawn.
Methods to reduce the transient current by
controlling clock skew have been studied
• In computer architecture, frequency scaling (also known
as frequency ramping) is the technique of increasing a
processor's frequency so as to enhance the performance of the
system containing the processor in question. Frequency ramping
was the dominant force in commodity processor performance
increases from the mid-1980s until roughly the end of 2004.
• The effect of processor frequency on computer speed can be seen
by looking at the equation for computer program runtime:
• {\displaystyle \mathrm {Runtime} ={\frac {\mathrm {Instructions} }
{\mathrm {Program} }}\times {\frac {\mathrm {Cycles} }{\mathrm
{Instruction} }}\times {\frac {\mathrm {Time} }{\mathrm
{Cycle} }},}where instructions per program is the total instructions
being executed in a given program, cycles per instruction is a
program-dependent, architecture-dependent average value, and
time per cycle is by definition the inverse of processor frequency. [1]
An increase in frequency thus decreases runtime.
• However, power consumption in a chip is given by the equation
• {\displaystyle P=C\times V^{2}\times F,}where P is power
consumption, C is the capacitance being switched per clock cycle, V is
voltage, and F is the processor frequency (cycles per second). [2] Increases
in frequency thus increase the amount of power used in a processor.
Increasing processor power consumption led ultimately to Intel's May
2004 cancellation of its Tejas and Jayhawk processors, which is generally
cited as the end of frequency scaling as the dominant computer
architecture paradigm.[3]
• Moore's Law was[4] still in effect when frequency scaling ended. Despite
power issues, transistor densities were still doubling every 18 to 24
months. With the end of frequency scaling, new transistors (which are
no longer needed to facilitate frequency scaling) are used to add extra
hardware, such as additional cores, to facilitate parallel computing - a
technique that is being referred to as parallel scaling.
• The end of frequency scaling as the dominant cause of processor
performance gains has caused an industry-wide shift to
parallel computing in the form of multicore processors.

You might also like