PD Lec03

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

CS250

VLSI Systems Design

Lecture 3: Physical Realities:


Beneath the Digital Abstraction,
Part 1: Timing
Fall 2010

Krste Asanovic’, John Wawrzynek


with
John Lazzaro
and
Yunsup Lee (TA)
Lecture 03, Timing CS250, UC Berkeley Fall ‘10

What do Computer Architects


need to know about physics?
‣ Physics effect:
Area cost
Delay performance
Energy performance & cost

• Ideally, zero delay, area, and energy. However, the


physical devices occupy area, take time, and consume
energy.
• CMOS process lets us build transistors, wires,
connections, and we get capacitors (,inductors) and
resistors whether or not we want them.

Lecture 03, Timing 2 CS250, UC Berkeley Fall ‘10


Physical Layout

‣ “Switch-level” abstraction gives a good way to understand


the function of a circuit.
‣ nFET (g=1 ? short circuit : open)
‣ pFET (g=0 ? short circuit : open)
‣ Understanding delay means going below the switch-level
abstraction to transistor physics and layout details.
Lecture 03, Timing 3 CS250, UC Berkeley Fall ‘10

“Gate Delay”

‣ Modern CMOS gate delays on the order of


a few picoseconds. (However, highly
dependent on gate context.)
‣ Often expressed as FO4 delays (fan-out of
4) - as a process independent delay metric:
‣ the delay of an inverter, driven by an
inverter 4x smaller than itself, and
driving an inverter 4x larger than itself.
‣ For our 90nm process FO4 is around 20ps.

Lecture 03, Timing 4 CS250, UC Berkeley Fall ‘10


“Path Delay”

‣ For correct operation:


Total Delay ≤ clock_period - FFsetup_time - FFclk_to_q - Clock_skew
on all paths.

‣ High-speed processors critical paths have around 10-20 FO4


delays.

Lecture 03, Timing 5 CS250, UC Berkeley Fall ‘10

FO4 Delays per clock period !"#$%&'(&)*+

=88
FO4
Delays CPU Clock Periods
B8
1985-2005 '$,-/)7A?
A8 '$,-/)4A?
'$,-/)C-$,'3D
'$,-/)C-$,'3D)6
@8 '$,-/)C-$,'3D)7
MIPS 2000
'$,-/)C-$,'3D)4
5 stages
?8 '$,-/)',#$'3D
E/CF#)6=8?4
E/CF#)6==?4
>8 Pentium Pro E/CF#)6=6?4
10 stages 9C#"%
48 93C-"9C#"%
9C#"%?4

Historical 78
G'C(
HI)IE
limit: I&J-")IK

about 68 EGL)M?
EGL)M@
12 =8
EGL)NA?O?4
Pentium 4
20 stages
8
A> A? A@ AA AB B8 B= B6 B7 B4 B> B? B@ BA BB 88 8= 86 87 84 8>

!"#$%&'()*#+&$,-
./#+&$,-0(,#$.&"12-13 456756887 6 9,#$.&"1):$';-"(',<

Thanks to Francois Labonte, Stanford

Lecture 03, Timing CS250, UC Berkeley Fall ‘10


“Gate Delay”
‣ What determines the actual delay of a logic gate?
‣ Transistors are not perfect switches - cannot change terminal
voltages instantaneously.
‣ Consider the NAND gate:

‣ Current (I) value depends on: process parameters, transistor size

∆ ∝C /I
L

‣ CL models gate output, wire, inputs to next stage (Cap. of Load)


‣ C “integrates” I creating a voltage change at output
Lecture 03, Timing 7 CS250, UC Berkeley Fall ‘10

More on transistor Current


‣ Transistors act like a cross bet ween a resistor and “current
source”

‣ ISAT depends on process parameters (higher for nFETs than for


pFETs) and transistor size (layout):

ISAT ∝ W/L

Lecture 03, Timing 8 CS250, UC Berkeley Fall ‘10


More on CL
‣ Everything that connects to the output of a logic gate (or
transistor) contributes capacitance:

‣ Transistor
I drains
‣ Interconnection
(wires/
contacts/vias)
‣ Transistor Gates

Lecture 03, Timing 9 CS250, UC Berkeley Fall ‘10

Wires
‣ So far, simple capacitors:
C ∝ Area = width ∗ length
‣ Wires have finite resistance, so have distributed R and C:

with r = res/length, c = cap/length, ∆ ∝ rcL 2 ≅ rc + 2rc +3rc + ...


‣ For short wires (bet ween gates) R is insignificant (total RC
delay << gate delay)
‣ For long wires R becomes significant. Ex: busses, clocks, reset
‣ “rebuffering” helps

Lecture 03, Timing 10 CS250, UC Berkeley Fall ‘10


Turning Rise/Fall Delay into Gate Delay
• Cascaded gates:

“transfer curve” for inverter.

Lecture 03, Timing 11 CS250, UC Berkeley Fall ‘10

Driving Large Loads


‣ Large fanout nets: clocks, resets, memory bit lines, off-chip
‣ Relatively small driver results in long rise time (and thus
large gate delay)

‣ Strategy: Staged Buffers

‣ Optimal trade-off bet ween delay per stage and total


number of stages fanout of ∼4 per stage

Lecture 03, Timing 12 CS250, UC Berkeley Fall ‘10


Components of Path Delay

1. # of levels of logic
2. Internal cell delay
3. wire delay
4. cell input capacitance
5. cell fanout
6. cell output drive strength
Lecture 03, Timing 13 CS250, UC Berkeley Fall ‘10

Who controls the delay?


foundary Library
CAD Tools (DC, Designer
engineer Developer
IC Compiler) (Yunsup)
(TSMC) (Aritsan)

1. # of levels synthesis RTL


2. Internal physical cell topology,
cell delay parameters trans sizing
3. Wire physical layout
place & route
delay parameters generator
4. Cell input physical cell topology,
cell selection instantiation
capacitance parameters trans sizing
5. Cell
synthesis RTL
fanout
6. Cell drive physical transistor
cell selection instantiation
strength parameters sizing
Lecture 03, Timing 14 CS250, UC Berkeley Fall ‘10
Timing Closure: Searching for and beating
down the critical path
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Must consider all connected register pairs,


paths from input to register, register to
output. Don’t forget the controller.
• Design tools help in the search.
– Synthesis tools work to meet clock
constraint, report delays on paths,
? – Special static timing analyzers accept a
design netlist and report path delays,
– and, of course, simulators can be used to
by power.
w voltage
determine timing performance.
Tools that are expected to do something about
by current
age of the
as is used the timing behavior (such as synthesizers), also
include provisions for specifying input arrival
All core
and bulk
balt disili- times (relative to the clock), and output
requirements (set-up times of next stage).
itance, as
mance and
Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by


and data gray. Features that allow the microarchitecture to achieve high
ck buffer. speed are as follows.
and four The shifter and ALU reside in separate stages. The ARM in-
nder-miss struction set allows a shift followed by an ALU operation in a
like oper- single instruction. Previous implementations limited frequency
lookaside by having the shift and ALU in a single stage. Splitting this op-
provided eration reduces the critical ALU bypass path by approximately

Timing Analysis, real example


128-entry 1/3. The extra pipeline hazard introduced when an instruction is
a pipeline immediately followed by one requiring that the result be shifted
2], [3]. is infrequent.
Decoupled Instruction Fetch. A two-instruction deep queue is
The critical path
implemented between the second fetch and instruction decode Most paths have hundreds of
re utilizes
n addition
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby picoseconds to spare.
s,
approach, allowing instruction fetches to proceed when the pipe is stalled,
Late-mode timing checks (thousands)

sed at the and also relieves stall speed paths in the instruction fetch and
st
gn issues,
sures that
branch prediction units.
200
Deferred register dependency stalls. While register depen-

late
e is seven dencies are checked in the RF stage, stalls due to these hazards
eline, and are deferred until the X1 stage. All the necessary operands are
s inserted then captured from result-forwarding busses as the results are
nto ARM
16 b, two
returned to the register file. 150
One of the major goals of the design was to minimize the en-
humb in- ergy consumed to complete a given task. Conventional wisdom
ipeline is has been that shorter pipelines are more efficient due to re-

100
ng
50

rs 0
!40 !20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
*
Timing slack (ps)
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
g
as Figure 26
Histogram of the POWER4 processor path delays.
Timing Analysis Tools
‣ Static Timing Analysis: Tools use delay models for
gates and interconnect. Traces through circuit paths.
‣ Cell delay models capture
‣ For each input/output pair, internal delay (output load
independent)
delay
‣ output dependent delay
‣ Standalone tools (PrimeTime) and part of logic
synthesis.
output load
‣ Back-annotation takes information from results of
place and route to improve accuracy of timing
analysis.
‣ DC in “topographical mode” uses preliminary layout
information to model interconnect parasitics.
‣ Prior versions used a simple fan-out model of gate
loading.
Lecture 04, Timing 17 CS250, UC Berkeley Fall ‘09

clk Hold-time Violations


d FF q

‣ Some state elements have positive hold time requirements.


‣ How can this be?
‣ Fast paths from one state element to the next can create a
violation. (Think about shift registers!)
‣ CAD tools do their best to fix violations by inserting delay (buffers).
‣ Of course, if the path is delayed too much, then cycle time suffers.
‣ Difficult because buffer insertion changes layout, which changes
path delay.
Lecture 04, Timing 18 CS250, UC Berkeley Fall ‘09
Conclusion
‣ Timing Optimization: You start with a target on clock
period. What control do you have?
‣ Biggest effect is RTL manipulation.
‣ i.e., how much logic to put in each pipeline stage.
‣ In most cases, the tools will do a good job at logic/circuit
level:
‣ Logic level manipulation
‣ Transistor sizing
‣ Buffer insertion
‣ But some cases may be difficult and you may need to help
‣ Hand instantiate cells, layout generators

Lecture 04, Timing 19 CS250, UC Berkeley Fall ‘09

End of Physical Realities


part 1 Timing

Lecture 02, Introduction 1 20 CS250, UC Berkeley Fall ‘09

You might also like