Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Adaptive Clock Distribution for 3D Integrated Circuits

Xi Chen, W Rhett Davis, Paul D Franzon


Department of Electrical and Computer Engineering
North Carolina State University
Raleigh, NC 27695, USA
{xchen10, wdavis, paulf}@ncsu.edu

Abstract—Clock distribution in three-dimensional integrated In previous 2D ICs, some techniques have been proposed to
circuits (3D ICs) is faced with many challenges. In this work, we reduce clock skews. In the designs [5][6], delay buffers are
present new techniques for realizing highly adaptive and reliable inserted in the selected positions of the clock trees and they are
clock distribution for 3D ICs. Firstly, an efficient clock used to match the delays of all clock distributions to the longest
distribution topology without need of balanced H-tree is routing paths. This method will introduce large overhead if it
proposed. Secondly, a robust tunable-delay-buffer (TDB) circuit needs buffer insertions in multiple clock stages. In addition, the
and a novel active de-skew method are developed in order to traditional delay buffer design is sensitive to PVT variations,
handle the cross-die variations, thermal gradients, and wiring and the phase error caused by the variations will be
asymmetry. Moreover, a design optimization flow is constructed
accumulated across the whole clock path. In high performance
for improving the adaptive clock design based on the thermal
systems like micro-processors, active de-skew technique is
profiles. Experiment results show that the clock skews are
significantly reduced using the proposed techniques.
used [7]. This method compares clock phases at loading points
with the phase of a reference signal, and adjusts the Tunable-
Keywords-3D IC, clock districution, adaptive, de-skew Delay-Buffers (TDBs) in the clock path according to the phase
errors obtained by the phase comparators [8][9]. However,
distribution of an accurate reference signal itself is very
I. INTRODUCTION challenging and also it is difficult for this method to
Three-dimensional integrated circuit (3D IC) technology compensate the asymmetry caused by TSVs in 3D ICs.
provides promising benefits for advanced digital system
designs. The technology can help to overcome the interconnect In recent years, some efforts are made to design clock
wire delay barrier by greatly shortening the wire length from a network in 3D ICs. In [10], for each tier, the clock tree is
2D system [1]. It also provides a solution to the well-known designed by the same way as a 2D design. This method does
memory wall problem [2] by stacking multiple logic and not have the capability to handle cross-tier variations and
memory dies and connecting them with Through-Silicon-Vias results in a clock skew up to 250ps in the simulation. In [11],
(TSVs) [3]. In addition, the technology is able to significantly the method routes the clock network freely in three-
reduce memory access latency and input/output driver power dimensional space using updated algorithms. However, since
consumption compared to a general multi-chip system design. these routing algorithms oversimplify the effects of the TSVs,
All these features make 3D IC technology attractive. they are too optimistic and have limited use [12]. Some
researchers extend the design based on TDBs into 3D ICs [13],
Clock distribution is critical to a digital system design. but they have not provided solutions to handle the non-
When a system is implemented in 3D technologies, it is idealities caused by 3D integration.
becoming more challenging to control the clock skews for the
following reasons, In this work, we propose new techniques to handle the
challenges in 3D clock distribution. Firstly, we propose a new
a) Cross-die process variations. In a 3D integration, 3D clock distribution topology to achieve high quality and
especially a heterogeneous integration, cross-die process good cost-efficiency. Secondly, we design a phase mixer based
variations will increase the clock skews if the sequential TDB circuit which is tunable in 360 degrees and has good
elements in the same clock domain are located on different tolerance to the PVT variations. Thirdly, a novel de-skew
tiers. method is developed to handle the cross-tier variations and the
3D wiring asymmetry. Moreover, a design optimization flow
b) High thermal gradients. A 3D integration will lead to a
based on thermal profile is developed to minimize the power
higher heat density and moves some active devices further
and area overheads of the TDB insertions and further improve
away from the heatsink. The increased thermal gradients will
the adaptive clock network.
result in significant clock skews.
The paper is organized as follows. Section II discusses
c) Non-idealities of Through-Silicon-Via (TSV). Due to
details of the proposed techniques, including clock distribution
parasitics, TSVs can degrade clock signal quality and increase
topology, new active de-skew, and TDB circuit design. Section
skews. Also TSVs can absorb noise from substrate [4]. In
III presents the design optimization flow. The experiment
addition, TSVs make it difficult to design a highly symmetric
results are demonstrated in Section IV.
clock distribution.

This work was supported by Semiconductor Research Corporation under


contract No. 1824

978-1-4244-9399-9/11/$26.00©2011 IEEE
978-1-4244-9401-9/11/$26.00©2011 91
II. ADAPTIVE 3D CLOCK TECHNOLOGIES

A. Efficient 3D clock distribution topology


Traditionally, global level clock network is in H-tree
structure (Figure 1(a)). However, it is difficult for this topology
to handle the variations and non-idealities in 3D ICs. In
addition, the routing area cost for this topology is high. We
propose a new topology to improve the cost-efficiency as well
as to reduce the routing complexities in the 3D clock
distribution. As shown in Figure 1(b), the global clock
distribution is only on one tier in order to minimize the effects
of cross-tier variations. TDBs are placed within each clock (a)
region on all the tiers, and they are driven by the global clock Normal Slow Fast
network through TSVs. If the tuning range of the TDBs covers 1000
a whole clock duty cycle monotonically, i.e. 360-degree
800
linearly tunable, this topology only needs TDBs at the last

Delay (ps)
stage of the global distribution and therefore saves large design 600
effort. Without need of the H-trees, the proposed clock
400
distribution topology is able to largely reduce power and
routing complexity. The TDB design is discussed in details in 200
the following subsection.
0
0 8 16 24 32
B. Phase mixer based Tunable-Delay-Buffer (TDB)
Tuning Code D[4-0]
In this work, we use multi-phase clocking to enhance the
(b)
capability of locking the phases of the TDBs with the clock
generator. A Phase Mixer based TDB (PM-TDB) circuit is Figure.2 New tunable delay buffer (TDB) circuit design. (a) Simplified
designed. As shown in Figure 2(a), the PM-TDB consists of a circuit schematic. (b) Simulated tuning delays at different process corners
phase multiplexer and a phase interpolator. By interpolating the
multi-phase clock, delay of the PM-TDB can be tuned
precisely. This PM-TDB circuit provides multiple advantages. number of TDB insertions. Secondly, the circuit has good
Firstly, it is capable of tuning in 360 degrees and generating tolerance to the PVT variations. Moreover, the PM-TDB is
arbitrary delay within only one stage so that the clock convenient for regional clock gating and intentional skew
distribution neither needs a highly balanced H-tree nor a large editing as it can be tuned individually without complicated
delay analysis. A PM-TDB controlled by 5-bit digital tuning
code is designed in a 45nm CMOS process. The loading
structure for the circuit is optimized for better slew-rate and
linearity. As the simulation shows, the total power
consumption for one mixer is 125μW at 1GHz, and the silicon
area is 10μm2. The nominal and worst-case delay values under
all code settings are shown in Figure 2(b).

C. Return signal active de-skew technique


To minimize the clock skews, PM-TDBs located at all
(a) clock loading regions need to be accurately in-phase. In this
work, we propose a novel de-skew method. Figure 3 shows the
simplified functional diagram with only one clock path. The
forward clock path (blue line) distributes the clock signal from
the source to a loading point. In this method, we try to
synchronize the phase at the loading point (ΦLoading) to the
phase at the clock source (ΦSource) by tuning the delay of the
PM-TDB (TDB_L). Instead of providing a reference signal to
the loading point, we place a return signal path (orange line)
located closely next to the forward path. As the return signal
goes through the same routing path and the same number of
TSVs as the forward signal, it can be used to track the delay of
the forward signal. A phase comparator is located closely to the
clock source, and it compares the return signal phase PL with a
(b) reference signal phase Pref, which is delayed by a reference
Figure.1 (a) Traditional 2D H-tree clock distribution and (b) Proposed 3D clock PM-TDB (TDB_ref) from the clock source. When the phase
distribution topology comparator detects the phase difference between the two

92
Global Regional
Clock Distribution Distribution
Source
ΦSource TDB_L ΦLoading

Pref PL Loading
Point
Return signal
TDB_ref
path
Phase
D Comparator D

Figure.3 Diagram of return signal de-skew

Figure.5 Clock design flow framework


signals, it sweeps the digital control codes D of TDB_L and
D of TDB_ref, until PL matches Pref. The code D is the
complementary code of D, which makes the delay of TDB_L Meanwhile, regions with very low skews will be merged to
plus the delay of TDB_ref always equals to a full clock period. save cost. According to the revised regions partition, the TDB
Therefore, when PL matches Pref, ΦLoading at the loading point insertions and de-skew return signal paths will be updated in
and ΦSource at the clock source are in-phase. The proposed de- the physical design. This process will continue until all the
skew method has good adaptability to the cross-tier variations regions meet the requirement. By this flow, we are able to
and the asymmetry caused by TSVs. In addition, the method obtain the optimal clock regions based on thermal profiles.
saves design overhead as all the clock regions can be calibrated
with only one phase comparator. IV. EXPERIMENT RESULTS

III. DESIGN OPTIMIZATION FLOW BASED ON THERMAL A. Active de-skew simulations


PROFILE Figure 6 shows the simulation results of the de-skew
Large thermal gradients caused by 3D stacking will technology at 1GHz clock frequency. The top figure shows the
increase clock skews. It is helpful to improve skews by change of the tuning codes D and D . The middle figure shows
dividing the whole system into multiple clock regions and the waveforms of the reference signal Pref and the return signal
inserting TDBs. However, the insertion will introduce extra PL. The bottom figure is the phases at the clock source (ΦSource)
and at the loading point (ΦLoading). The results show that the
area and power. A design and optimization flow is proposed to
signals at the source and the loading are able to lock after 10
handle the tradeoff of skew performance and cost.
clock cycles, and the settling skew equals to 15.9 ps.
Figure 4 shows the proposed clock tree topology in which
TDBs are placed at the leaves of a global tree and drive
B. System clock skews improvement
regional clock networks. In this topology, TSVs are only
placed at the end of global tree, and each regional clock A design case is created to study the impacts of the
network distributes only on one tier. The adaptive de-skew thermal gradients and the TDB insertions. The circuits and
technology discussed in Section II. C is applied to sense the thermal profiles are adopted from a 3D ORPSOC system
return signal from one leaf point of a regional tree and lock the design described in [10]. The technology used is a 45nm
clock phases of all the loading points in that region. CMOS process. A two-tier stacking structure is used,
We aim to figure out the optimal clock regions whose including logic circuits on the bottom tier and 32KB SRAM
dimensions are inversely proportional to the thermal gradients
within them. As Figure 5 shows, the initial clock tree partition
starts from a logic circuit unit or a single memory bank. The
skew distribution is calculated based on the thermal profiles.
In the optimization process, the clock regions that could not D
meet the skew specification will be further divided. D

ΦSourceΦ Loading
Figure.4 Proposed clock tree topology
Figure.6 De-skew transient simulation results.

93
arrays (2KB/bank×16banks) on the top tier. Each tier is [10] Hao Hua, “Design and Verification Methodology for Complex Three-
Dimensional Digital Integrated Circuit”, Ph.D. dissertation, Dept. Elect.
1mm×1mm in area. HSPICE simulations are used to extract Comp. Eng., North Carolina State Univ., Raleigh, NC, 2006.
the temperature coefficients of the clock buffers and metal [11] Jacob Minz, Xin Zhao, and Sung Kyu Lim, “Buffered Clock Tree
wires in the clock tree. The accumulated delays of all clock Synthesis for 3D ICs under Thermal Variations”, in Asia and South
routings on both tiers are also calculated. Pacific Design Automation Conf., 2008, pp. 504-509.
[12] David Kung and Ruchir Puri, “CAD challenges for 3D ICs”, in Asia and
Figure 7(a) shows the thermal profiles for both tiers. The
South Pacific Design Automation Conf., 2009. pp. 421-422.
background temperature is 25°C, and the highest temperature [13] Mosin Mondal et al, “Thermally Robust Clocking Schemes for 3D
is 90°C. Figure 7(b) shows the delay performance based on a Integrated Circuits”, in Design, Automation & Test in Europe, 2007, pp.
traditional H-tree clock distribution which neither have TDB 1-6.
insertion nor de-skew technique. As the results show, the
maximum in-tier skew is 85.2ps for the logic tier, and 75ps for
the memory tier. The cross-tier skew, 214.3ps, is even worse °C
because of the thermal gradients and TSVs between tiers. Memory
Figure 7(c) shows the results based on the proposed topology
with 250μm×250μm minimum clock region and a 7.8ps tuning
step TDB in each region. The maximum in-tier skew are
17.8ps for the logic tier and 21ps for the memory tier. Because μm μm
the adaptive de-skew is able to compensate the effects from °C
both TSVs and thermal gradients cross tiers, in this case, the Logic
worst case clock skew is the same as the value of the memory
tier. The results show that the clock regions partition and the
de-skew technique reduce the clock skews by more than 90%.
V. CONCLUSIONS μm μm
In this paper, we present novel technologies to realize (a)
high performance clock distribution in 3D ICs. An efficient ps
clock distribution topology, a reliable tunable-delay-buffer, Memory
and a highly adaptive de-skew technique are proposed to
overcome the impacts from the cross-tier process variations,
the large thermal gradients, and the routing asymmetries in 3D
ICs. In addition, an optimization flow is developed to improve
μm μm
the clock regions design and reduce the overhead. ps
REFERENCES Logic
[1] S. J. Souri, K. Banerjee, A. Mehrotra, and K. C. Saraswat, “Multiple Si
layer ICs: motivation, performance analysis, and design implications”, in
Proc. Design Automation Conf., 2000, pp. 213-220.
[2] K. Banerjee, S. J. Souri, P. Kapur, K. C. Saraswat, “3-D ICs: a novel chip
design for improving deep-submicrometer interconnect performance and μm μm
systems-on-chip integration”, Proc. IEEE, vol. 89, pp. 602-633, May 2001.
[3] Wm.A. Wulf and S.A. McKee, “Hitting the memory wall: Implications of (b)
the obvious,” ACM SIGARCH Computer Architecture News, vol. 23, pp. ps
20-24, March 1995. Memory
[4] Jonghyun Cho et al, “Active Circuit to Through Silicon Via (TSV) Noise
Coupling”, in IEEE 18th Conf. Electrical Performance of Electronic
Packaging and Systems, 2009, pp. 97-100.
[5] Mosin Mondal et al, “Mitigating Thermal Effects on Clock Skew with
Dynamically Adaptive Drivers”, in Int. Symp. Quality Electronic Design,
2007, pp. 67-72. μm μm
[6] Ashutosh Chakraborty et al, “Dynamic Thermal Clock Skew ps
Compensation Using Tunable Delay Buffers”, IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., Vol. 16, pp. 639-649, June 2008. Logic
[7] Simon Tam et al, “Clock Generation and Distribution for the 130-nm
Itanium® 2 Processor with 6-MB On-Die L3 Cache”, IEEE J. Solid-State
Circuits, Vol. 39, pp. 636-642, April 2004.
[8] Simon Tam et al, “Clock generation and distribution for the first IA-64
microprocessor”, IEEE J. Solid-State Circuits, Vol. 35, pp. 1545-1552, μm μm
2000.
[9] Patrick Mahoney, Eric Fetzer, Bruce Doyle, and Sam Naffziger, “Clock (c)
Distribution on a Dual-Core, Multi-Threaded Itanium®-Family Figure.7 Thermal profile and simulated clock skew distribution for two tiers
Processor”, in IEEE Int. Solid-State Circuits Conf., 2005, pp. 292-599. (a) Thermal profiles (90℃ hot spot) (b) Delays of H-tree clock distribution
(214.3ps max skew) (c) Skews of new topology with de-skew (21ps max)

94

You might also like