SoCT Slides

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
SoC Logic Design Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
SoC Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Low Power Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
SoC Arithmetic Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
1
System-on-Chip Technologies
A. Herkersdorf © Lehrstuhl für Integrierte Systeme

W. Stechele
A. Surhonne Theresienstr. 90
Building N1, 2nd floor
www.lis.ei.tum.de
Introduction
System-on-Chip © Lehrstuhl für Integrierte Systeme

Technologies
Theresienstr. 90
A. Herkersdorf Building N1, 2nd floor
A. Surhonne www.lis.ei.tum.de
2
Organisational matters
Lectures (A. Herkersdorf)

 Room : 0360, Wednesday 13:15 - 14:45
Tutorials (A. Surhonne)
 Room : 1260, Thursday 09:45 - 10:30
Registration in TUMonline is required:
 Registration link available at www.lis.ei.tum.de/?id=soct
News, handouts and course materials:
 http://www.moodle.tum.de
Exam:
 Final exam at end of semester, written exam (75 min.), accounts for 100% of
total grade, calculators & 1 A4 sheet allowed (no laptops or smartphones)
 You MUST bring your Student ID AND Passport!
SoCT – Introduction – 3 © Lehrstuhl für Integrierte Systeme
Reading material / Literature
Course handouts and lecture notes

 Recommended as primary reference during course
 Sufficient for exam preparation
Text books
 Digital Integrated Circuits - A Design Perspective, J. Rabaey,
Prentice Hall
 Computer Architecture. A Quantitative Approach, J. Hennessy, Elsevier
3
What is this course about?
Basics of digital System-on-Chip (SoC)

 Hierarchical composition of SoC platforms
 … based on foundation of technology scaling and CMOS circuitry
 Gate array, ASIC, FPGA
 Insight to the main components:
 RISC processor
 Bus/FIFO Interconnect
 On-/Off-Chip Memory
 Low Power design principles
Foundation of CMOS Scaling

1947: 2002:
First transistor PowerPC 970,
0.13µm, 1.8 GHz,
John Bardeen and
52 M transistors
Walter Brattain
(IBM [1])
(Bell Laboratories )
1958: simplified CMOS tOX

W
transistor model
First integrated
circuit (IC) source
Lmin
drain
Jack Kilby (TI), W W’ W’’
Robert Noyce
(Fairchild) Lmin L’min L’’min
′ = × .
′= × . ′′ = ′× .
′′ = ′× .
4
Moore’s Law CMOS Scaling Good news
1013
1965: Gordon Moore 50
1012 forecasts that chip
Transistor gate length L (um)

32G 20
Chipcapacity (transistors per chip)
1011 capacity doubles 16G

DRAM-Chips 10
1010 every 18 – 24 month 4G
1G 5
109 (66 % CAGR) 256M 2
64M
108 16M 1
107 4M 0.5
1M
106 256k 0.2
64k 0.1
105 16k
4k 0.05
104
0.02
103
0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
Source: ITRS Roadmap 99, 09
Moore’s Law CMOS Scaling Good news

1013 107
50
DRAM prices (Microcents per bit stored)
1012
20 106
32G
1011 16G
DRAM-Chips 10 105
1010 4G
1G 5 104
109 256M 2
64M Core i7 103
108 Core 2 Duo 1
16M
Pentium 4
107 4M Pentium III 0.5 102
1M Pentium II
106 256k Pentium 0.2 10
64k 80486
105 80386 0.1
16k 1
4k 80286 0.05
104 8086 10-1
8008 Microprocessors 0.02
103
0.01 10-2
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
5
Moore’s Law CMOS Scaling Challenges
1013
Power Dissipation 50
1012

20
Performance
1011 rocket
1000 10
1010
Frequency nuclear nozzle
5
Number of cores
Power density (W/cm2)
109
reactor
100 Core i7 2
108 Core 2 Duo 1
hot plate Pentium 4
107 Pentium III 0.5
10 Pentium II
106 Pentium 0.2
80486
105 80386 0.1
1 80286 0.05
104 8086
0.1 103
0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
If Moore’s Law…
1013
50
1012
20
1011
10
… had hold in other industries or our daily life for the 1010 5
109 2
last 30 years… 108
Pentium 4
Core i7
Core 2 Duo 1
107 Pentium III 0.5
Pentium II
106 Pentium 0.2
80486
105 80386 0.1
 the capacity of a 1.5 V AA Battery would have 104
8008
8086
80286 0.05
0.02
103
been around 2 MWh today, 0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
 Energy consumption of a 4-person houshold for a

half year
 the flight between Munich und Singapore

would have taken 2 seconds, and
 the tuition fee for one semester in Harvard would

have been just 0.80 €
6
Intel: Moore‘s Law is Forever ! … Really?
So what!
7
SoC Technologies
CMOS Technology: Vdd Processing Data:

What determines: - FSM, adder,
- speed of CMOS? w multiplier, mux,
- CMOS power shifter, registers,
consumption? t RISC core
Source Ox Drain
- costs of CMOS? Lmin
(Intel Pentium I [2])
Transport of Data: Storing Data:

- on-chip buses - SRAM, DRAM,
- FIFO interconnect ROM, FLASH
- capacity, timing,
latency, access BW
(IBM CMOS7S Cu 0.13µm [1]) (8 Mbit SRAM chip [1])
Hardware Platforms
Full-Custom ASIC: Std. Cell ASIC:

- Highest performance - High performance
- Highest cost - High cost
- Highest development - High development
effort effort
(Intel Pentium I [2])
FPGA: System-on-Chip:
- Function determined - Maximizes reuse I/O
eRAM
NoC
RAM-Ctl.
after manufacturing of existing macros App.

m-Prozessor spez.
- Good performance - High performance System on Chip Plattform
- Medium development - High capacity
effort - Medium development
- Medium capacity effort
(Xilinx Virtex-II Pro XC2VP20 FPGA)
8
Application in our daily life
Consumer Electronics: Automotive:

- Laptops with “server”- - Driving, internetworked
class processing power “supercomputer”
- navigation,
- Smartphones, mobile appliances
- distance control,
- lane departure warning,
- UMTS, 802.11 [Continental- TEMIC]
Communications & Networking:

- Terabit Internet backbone router
- Broadband wired/wireless access
- ad-hoc nets among mobile devices
SoC Design Challenges
Microscopic issues Macroscopic issues

 Ultra-high speeds  Time-to-market
 Power dissipation and  Design complexity
supply rail drop (millions of gates)
 Growing importance of  High levels of abstractions
interconnect  Design for test
 Noise, crosstalk  IP reuse, portability
 Reliability,  EDA tool interoperability
manufacturability
 Clock distribution
Technical Economical
Source: J. Rabaey [3], M. Irwin [4]
9
Key Metrics of SoC
 Performance SoC design is a

 Speed (delay, clock frequency) multi objective
 Throughput optimization
 Power consumption (static, dynamic)
 Cost problem!
 NRE (fixed) costs: Design effort, mask production
 RE (variable) costs: Chip area, package, test
 Reliability, robustness
 Signal to Noise ratio, noise margins and immunity
 Mean time between failure (MTBF) = 1 / failure rate
 Time-to-Market
 Time between product idea to shipment (market research,
specification, development, fabrication, test)
Outlook to SoC Platforms

Enterprise Application xSP
Servers
Sonet/SDH Edge Storage
Servers
LAN/SAN Transmission
Switch
Load Balancer
Edge Inner
Servers
WAN Core Storage Server
Core
Storage Server Application Servers
Wireless ASP
Internet
Home
Networks Router Wireless
Gateway Application
Edge Router Servers
802.11 Servers
Home RF, .. xDSL Mobile
Switching Wireless
Local Center
Network
Access
Network Base
Cable Station
ISDN VoIP PSTN Controller
Gateway
Control
Mobile Base
Procesors Sonet/SDH Clients Station
Transmission
10
Outlook to SoC Platforms
Control and management

of entire box
Backplane
Network
Switch Processor Determines box function:
System Fabric Line Switch, Router, Gateway, …
Processor Interface
Interconnect for
as many as possible Terminates physical network
line interface cards links: Ethernet, SONET/SDH, …
SoCT Course Outline
Part 1 Part 2
Introduction Processor Architecture
SoC Logic Design Recap Memory
SoC Paradigm Interconnect
Low Power Design
11
Why CMOS ?
SiO2-Layer Lithography
Electrical Reasons Light
 Low power dissipation Si-Wafer

 Noise immunity Photolithographic Mask
Photoresist
 Clean logic levels Poly-Si-Layer
 One supply voltage
 Cascadable Etching Light patterns
Photoresist
Economical Reasons Doping Atoms (P or As)
 Easy to design Process

 Fabrication well understood
 Highly integrateable
Deposition
Why CMOS ?
SiO2-Layer Lithography
Electrical Reasons Light
 Low power dissipation Si-Wafer

 Noise immunity Photolithographic Mask
Photoresist
 Clean logic levels Poly-Si-Layer
 One supply voltage
 Cascadable Etching Light patterns
Photoresist
Economical Reasons Doping Atoms (P or As)
 Easy to design Process

 Fabrication well understood
 Highly integrateable
Deposition
12
CMOS – What is it?
Input
nMOS pMOS
gnd Vdd
Output
n+ p+
n
p
Metal (Poly-Si) The C in CMOS signals

Oxide (SiO2) the combination of p-
Semiconductor and n-MOSFETS
 Complementary
The channel type gives the prefix of the transistor
Static CMOS Inverter
VDD
Ip
VGSp S
P VDSp A Z
N VDSn A Z
VA VZ
VGSn S
0 1
1 0
gnd
13
MOSFET Voltages And Drain Current
nMOS pMOS
G Vector Conventions* G
VGS VGS
ID ID
S D D S
VDS VDS
On lower potential Source is always On higher potential
Gate-Source Voltage
0 Drain-Source Voltage 0
Drain Current
*Please avoid VSG, VSD, VGD, VDG

MOSFET Output Characteristic
State Current Condition
VGS  Vt  VDS  0
• Off I Dn  0
• Linear VGS  Vt  0  VDS  VGS Vt

 V 
I Dn    VGS  Vt  DS VDS
 2 
• Saturation VGS  Vt  VDS  VGS  Vt


I Dn  VGS Vt 2
2
14
MOSFET Output Characteristics
VGS  const.  Vtn G G VGS  const.  Vtp

ID ID
S D D S
VDS VDS
ID VGS  Vtp ID
VDS
VGS  Vtn VDS

linear saturation saturation linear
DC Operation
Voltage Transfer Characteristics (VTC)

Plot of output voltage as a function of the input voltage
V(y)
f V(y)=V(x) V(x) V(y)

VDD
Switching Threshold
VM
GND
VIL VIH VDD V(x)
15
Static Voltage Transfer Curve (VTC)
VZ off on N
VDD
Ip
on off P VGS S
p
P VDSp
VDD
D
N VDSn
VA VZ
VGSn S
gnd
Ip
A Z
VDD VA 0 1
Vtn VDD-|Vtp|
Vth 1 0
MOSFET Dimensioning
w
tOx
Lmin
W  Ox  0
Transconductance:   K where K  
L tOx
 Designer’s Parameter: W  Technology Parameters
 Conflicting Design Goals:  Mobility µn = 1.5 .. 3.5 x µp
 Designer uses W to compensate for
 Area => W=Lmin lower current drive of pMOSFETS
 Speed => high W, l=Lmin  Minimum Feature Size: Lmin
→ always use Lmin for digital circuits  Oxide dielectic/thickness: εox,tox
16
Noise Margins
For robust circuits,

VDD want the “0” and “1”
intervals to be as
"1" large as possible
NMH = VDD - VIH
VIH
Noise Margin High Undefined Region
VIL
Noise Margin Low
NML = VIL
"0"
Gnd
Gate Output Gate Input
Effect of Capacitance on Inv. Delay
Vdd VA
Ron,p
C VZ=VC
Ron,n
50%
gnd
t
Inverter model:
 on: Resistance
 off: open switch
 operating in linear
region of Sah
17
Effect of Capacitance on Inv. Delay
Vdd VA
Ron,p
C VZ tpLH=f(C)=?
Ron,n
50%
gnd
t
Inverter model:
 on: Resistance Cload tOx Lp
t pLH 
 off: open switch Wp  p Ox Vdd  | Vtp |
 operating in linear
region of Sah
CMOS Power Sources
 Dynamic  Static
 capacitive  sub-threshold,
 short-cut  leakage (reverse diode),
 result of circuit function  gate currents
 signal edge dependent  parasitic effects
 signal level dependent
Dynamic Static
18
Dynamic Capacitive Power
 Capacitive Power Dissipation: Vdd

Ron,p
PCap = a01 f C Vdd²
a01f
C
Ron,n
gnd
Short Circuit Power

VA
Vdd Vdd
Vdd+Vtp
i Vdd/2
Vtn
A Z r f
t
i
gnd

t
t1 t2
 Short Circuit Power Dissipation: PShort = a01 f bn  (Vdd-2Vtn)³

(assumptions: bn=bp, Vtn=|Vtp|, r = f
19
Sub-threshold Currents
0<VGSn<Vtn or 0<|VGSp|<|Vtp| Reasons:

gnd A Vdd  Non-ideal output levels
Z  Noise
n+ n+ p+ p+
Isub Isub
p n ID
(VGS Vt ) / nVTemp VDS / VTemp
I D  I 0e (1  e ); VGS  Vt
I0
I0: ID for VGS = Vt
VTemp: temperature voltage
n: process constant (1 … 2.5) Vt Vgs
Diode Leakage / Gate Current
A Vdd Igate
gnd Z Vdd
n+ n+ p+ p+
Igate
p n gnd
A high A low Gate oxide not perfect isolator:
 Ohm’s law (R<inf)
p-n junctions are diodes:  Ionic conduction (trapped Ions
 Reverse current into substrate in oxide)
contributes to static power  tox ~ 5 – 2 nm today
consumption
Igate ~ exp ( tox-1 )
20
Static Power
 In the past: negligible (typ. < 1% of total P)

 Today and in the future: major concern, because
 low threshold voltages for high speed applications
 noise “does not scale”
 high temperatures
 low frequency and long standby times
 Gate and diode currents cause higher failure rate
 Gate currents limit tox scaling
Important Process Parameters
Vt Vdd
threshold voltage supply voltage
Total power=min!
dynamic power
static power
circuit delay Vdd and Vt such that Vdd-Vt = const
21
References
[1] IBM press release, „New 64-bit PowerPC microprocessor“, Oct. 14, 2002,
http://www-3.ibm.com/chips/news/2002/1014_powerpc.html
[2] IBM Microelectronics Photo Catalog,
http://www-3.ibm.com/chips/photolibrary/photo10.nsf/home?ReadForm
[3] J. Rabaey, „Digital Integrated Circuits – A Design Perspective“, Prentice Hall,
second edition, 2003
[4] M. Irwin, „VLSI Digital Circuits“, Penn State, 2002,
http://mdlwiki.cse.psu.edu/twiki/bin/view/MDL/MJI477
22
SoC Logic Design Recap

Technologies
Theresienstr. 90
Outline
• Combinatorial Logic
 Transistor synthesis for combinatorial logic
• Sequential Logic
 Registers, latches, flip-flops
 Finite state machine design
SoCT – SoC Logic Design – 2 © Lehrstuhl für Integrierte Systeme
23
Combinatorial Logic
Simple Static CMOS Logic Gates
VDD A A
NOR Z
B B
Z
Z A NAND NOR
A Z A B Z
Z
B 1 0 0 1
B D
1 0 1 0
NAND 1 1 0 0
0 1 1 0
gnd
24
All logic functions
DeMorgan Rule can be expressed
with NAND or NOR
NAND NOR
NOT
AND
OR
basic gate NOT AND OR NAND NOR

output expression A‘ AB A+B (AB)‘ (A+B)‘
NAND represent. (AA)‘ (AB)‘‘ (A‘B‘)‘ (AB)‘ (A‘B‘)‘‘
NOR represent. (A+A)‘ (A‘+B‘)‘ (A+B)‘‘ (A‘+B‘)‘‘ (A+B)‘
Generic Model of Static CMOS
VDD
A1
...
An
Z
C
gnd
NAND NOR
25
Systematic Static CMOS Logic Design
Example Function: Z=AB+C VDD

 CMOS always inverts => add
extra inverter: Z=AB+C A B
 Start at the output and find a
way through n-MOSFETS to C
ground: Z
 serial for AND (AB)
 parallel for OR (C)
A C
 Draw the way to Vdd by using the
dual n-network and p-MOSFETS
B
If everything is done right, there will
gnd
never be a conducting path between Vdd
and gnd.
Register Transfer Blocks: Arithmetic
Numerous applications are based on mathematical operations

 Digital Filters
 Processor Arithmetic Logic Units
implemented as
A combinatorial logic circuitry
multi-bit register / x
storage elements B
+ Y
C
Y=AxB+C
26
Sequential Logic
Basic Register / Storage Element
 In CMOS we can store a 0 or a 1 in a loop of two inverters:

Vdd
01
Q
01 10
10 Q Q Q
01 10
 One inverter drives the input of the other
 Only outputs so far (how boring!) gnd
 impossible to externaly drive a node without short-
circuit
x=1: Q Q
 We need to open the loop to set a Q Q
specific logic value: x=0: Vdd Q
x
27
CMOS Latch
Q
Q
CMOS Latch
1 R
e D
0 e Q
1 DQ
Q
D Q
D
1 e Q
S
1 D
 Enable signal e
 Level-controlled <=> Latch 0 Q
28
CMOS Flip-flop
c c Q
e Q e Q Q
D Q
D D DQ Q
c
(alternatively)
Master Slave
c Q
 Clock edge-controlled (c) <=> flip-flop D
 Most important sequential element
 Used in (almost all) sychronous digital circuits - Q
 Used as register banks
 Counter, shift registers
CMOS Flip-flop
clk
e Q e Q Q
D D q DQ Q
Master Slave
clk
clk
D
q
Q
29
Flip-flop Timing
c
D Q
50%
t tc2q
c
D
Flip-flop characteristics:
50%  setup-time: Data must be stable
t tsetup before clock edge
tsetup thold  hold-time: Data must be stable
for thold after clock edge
Q
 clock-to-output delay: Data will
be visible at output tc2q after
clock edge
t
tc2q SoCT – SoC Logic Design – 15 © Lehrstuhl für Integrierte Systeme
Synchronous Sequential Logic
Tlogic,min Tlogic,max
Timing Constraints:
 Setup constraint:
Tclk > Tc2q + Tlogic,max + Tsetup
 Hold constraint
: Tc2q + Tlogic,min > Thold
30
Metastability
c Violation of either tsetup or thold
resulting in undefined/
50%
oscillating output Q
t with probability pmeta (<10-9)
D
Relevance for SoC design:
50%
 increasing number of chip
clock domains
t  externally imposed clocks
tsetup thold
Precaution:
Q
 Double-registered inter-domain
signal interfaces
t  FIFO buffers
tc2q SoCT – SoC Logic Design – 17 © Lehrstuhl für Integrierte Systeme
Finite State Machines

(FSM)
31
Finite State Machines
x D1 Q1
f(x,u) g(x,u) y
Dn Qn
clk
u
 x: Primary input vector, y: primary output vector, u: state vector
 f(x,u): input function, g(x,u): output function
 Mealy-Machine: Most general case, as shown above
 Moore-Machine: No combinatorial logic through machine <=> g(x,u)==g(u)
 Medvedev-Machine: No output logic <=> y=g(x,u)=g(u)=u
 No input logic Machine: f(x,u)=(x,f2(u))
Finite State Machines(Contd.)
FSM2
f g
FSM1
f g
FSM3
f g
32
FSM example: Counter
S3
A2
x = 0: St+1 = St
xor S2 x = 1: St+1 = St + 1
A1
xor S1
A0
x
xor S0
clk f(x,S)
Practical Relevance of FSMs
Synchronous System Design paradigm

 Essentially all control functions in state-of-art digital IC’s
consist of “communicating FSM’s”
 Avoid combinatorial logic through paths!
 Stick to one FSM design style across SoC!
x f(x,u) x g(x,u)
clk clk
Tclk > ΣTlogic + Tsetup + Tc2q
33
How are FSMs designed today?
Idle:Process
Begin FIFO a_empty = 1
WAIT UNTIL(CLK = ‘1’);
C_State := N_State;
CASE (C_State) IS to VHDL Idle = Idle =

WHEN (State_0) THEN 0 1
IF (FIFO a_empty = ‘1’) THEN
N_State := State_1;
END IF; FIFO a_empty = 0
WHEN (State_1) THEN AND Addr = 26
IF (FIFO a_empty = ‘0’ AND
Addr = ’26’) THEN
N_State := State_0;
END IF;
END CASE;
f(x,u)
CASE (N_State) IS x
WHEN (State_0) THEN Synthesis
Idle <= ‘0’; clk
WHEN (State_1) THEN
Idle <= ‘1’;
END CASE; How many levels of logic can you
END PROCESS; afford?
FSM Logic Depth
Tclk > ΣTlogic + Tsetup + Tc2q

f(x,u)
x
clk Tclk > N (tgate + twire ) + Tstup + Tc2q
T -T –T … in a given case:
CMOS Databook: N < clk stup c2q N < 50.9
tgate = tc2q = 80ps; tgate + twire
Nmax = 50
tstup = 240ps;
twire ≈ 50% tgate; Q’s:
a) Is this plenty or marginal?
Data path width: b) If synthesis reveals Nmax = 24,
w = 16bit; S = 2.488 Gbps; would you consider changing the data
path to 8bit?
Tclk = 1 / 155.5 MHz = 6.43 ns;
34
References
[1] R. J. Baker et al., CMOS circuit design, layout, and simulation, IEEE
Press, 1998. ISBN 0-7803-3416-7
[2] N. H. E. Weste et al., Principles of CMOS VLSI Design, Addison
Wesley, 1993. ISBN 0-201-53376-6
[3] SIA, International Technology Roadmap for Semiconductors,
http://public.itrs.net/
Picture credits: www.maxmon.com
35
SoC Paradigm

Technologies
Theresienstr. 90
Outline
• Design productivity
 Platform-based SoC
 Virtual Platforms
• Computational density
 Flexibilty vs. Performance
• Hardware Implementation
 Gate Array, Standard Cells, FPGA
 SoC design paradigm
 Single- vs. Multicore
SoCT – SoC Paradigm – 2 © Lehrstuhl für Integrierte Systeme
36
Revisiting Moore‘s Law
Chip capacity (Transistors per Chip)
Designer Productivity (K trans. / PM)

1011
106
1010
105
109 Microprocessor
Complexity Pentium 4 104
108
~ 55 % CAGR Pentium 103
107 Pentium III
80486 ?
Pentium II 100
106 80386
80286 10
105 8086 ~ 20 % CAGR
8008 1
104
4004 Designer Productivity
0.1
103
1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012
Year
 How to develop and test such complex systems with

affordable cost and time?
Improvements in Designer Productivity

Layout
1975: Polygons representing mask layout
GND In VDD
Transistor 1980: Transistor circuitry

A A’
Out 1985: Logic gates, Boolean algebra,

Gate Standard Cell designs
Register Transfer Block 1990: RTL design entry, Logic synthesis
Reg 1 Reg 2 Reg 3
+ 1995: Design entry with HW

Comp description languages,
Behavior Reg 4 behavioral synthesis
Begin
WAIT UNTIL (CLK’EVENT AND
CLK = ‘1’); Improvements in designer productivity due to
LCDltch <= tmp;
tmp := LCD;
progress in EDA tool and design methods as
END PROCESS;
well as raised levels of abstraction
37
Platform-based SoC Design
Conquer design complexity by reuse maximization:

Shorter development cycles, higher chances for (first time) fault-free
design and competitive value differentiation
Differentiation through
AMBA System SRAM System EMAC
Core eDRAM Core PCI-X
new, application specific
Standard on-Chip system cores
(bus) interconnect
Processor Bus Bridge Peripheral Bus
and interfaces,
CoreConnect Processor ISA Memory UART
Core Ext. Ctrl. GPIO
Blue
Logic
Standard RISC CPU cores and
SW development environments, Reuse existing function
building blocks,
XILINX
XILINX
Virtual Platforms
Graphical Debugging tools SW development
environment
SW stacks
Custom HW development
Drivers Std. libs
OS OS OS Virtual Platform
Hypervisor
CPU1 CPU2 ACC SRAM SDRAM HW/SW Partitioning
Arbiter On-chip Interconnect
AMBA HW models INT

HW
HW
ACC
Buffer I/O
System Integration
 Processing cores  Executable SoC model
 Interconnect  Fast and accurate: TLM-abstraction
 HW accelerators  Simulation kernel, e.g.
... SystemC/SpecC
38
Trade-Off: Flexibility vs. Performance
CPU DSP
ASIP
Log F L E X I B I L I T Y
FPGA
Instruction Depth
ASIC
Flexibility vs.
Custom IC Performance/Power
dissipation dilemma
Log COMPUTATIONAL DENSITY = performance / area
103 . . . 104
Log Power Efficiency = performance / W
105 . . . 106
Source: A. DeHon [4]; A. Cuomo, T. Noll [5]
SoC Implementation Styles
 Standard Cell ASIC A B C D

Time 0
+ +
Time 1
Parallel in space and time
-
Time 2
y
 Pipelined RISC CPU
IF ID OF EX M WB t1 = A + B
IF ID OF EX M WB t2 = C + D
IF ID OF EX M WB Y = t1 - t2
Parallel in space, sequential in time
39
Computational Density / Functional Diversity
Computational Density Functional Diversity

 Computations performed per  Number of functions resident
unit area and time and rapidly accessible by a
 Implementation technique compute unit
descriptor  IS = Number of instructions
 [CD] = ops / N [Lmin2 tiles] stored on a general purpose
 CPU: 40 – 80 compute device
 FPGA: 400  Application descriptor
 ASIC: 4’000  Technology differentiator
 Custom: > 10’000  CPU: ~256 – 16 K
 FPGA: 1
 approx. values  ASIC: 10-3 - 10-6
 Custom IC: ~0
Source: DeHon, PhD Thesis [4]
Software vs. Hardware Implementation

Cycle Instruction Interpretation
 SW execution in a processor 1 mul r5, r1, r2 r5 = r1 x r2
 Sequential in time 2 mul r6, r3, r4 r6 = r3 x r4
3 add r7, r5, r6 r7 = r5 + r6
x4
 RTL-Level Hardware x3
X
y
 Parallel in time x2 +
X
x1
T0 T1 T2 T3
 HW/SW Partitioning
 Optimization for performance,
power or total cost
Further details in course

„HW/SW Codesign“
40
Hardware Implementation Methods
Implementation methods for digital ICs
Full-Custom Semi-Custom
Cell-based Array-based
Standard Cell Macro Cell Pre-diffused Pre-wired Pre-wired

SoC (Gate Arrays) (PLD’s) (FPGA's)
Example: Example:
Altera MAX- XILINX Virtex-II
7000 PRO
Implementation Buy from the shelf Circuit functionality
Standard-IC/Core Programable with

yes
Processor/DSP standard SW tools
Programable with
ASIP no
customized SW tools
FPGA yes (re-)configurable
ASIC no fixed
41
Full Custom Std.-cell, Gate array FPGA & PLD
Design style Circuit optimization at Design libraries and logic synthesis

transistor level
Design sign off Mask manufacturing and wafer production Customer
programmable
Function density highest (100x) high (50x) average
(1 normalized)
Typ. clock rate 1…3 GHz 500 MHz 100…200 MHz
Design time Many months months weeks, days
Typ. volume Mio. 10.000 - 100.000 1 – 10.000
One time costs Very high (10 Mio.) high (1 Mio.) low (1 K)
Variable costs lowest average (2) high (>100)
(chip area) (1 normalized)
Gate Array
Pad
• Transistors / logic gates to realize

combinatorial and sequential circuit
designs are already pre-placed on chip
• I/O-macros to drive external circuitry
integrated as well
• Functionality is determined by means of
customer / application specific wiring
 Chips are pre-manufactured up until wiring polysilicon
In 1 In 2 In 3 In4
 High volume, low cost VD D
metal
possible
GND contact
Out
Uncommited Committed Cell

Cell (4-input NOR)
42
Standard Cell
• Library of pre-developed (full custom) logic

gates and function macros (variable width,
fixed height)
• Pre-defined placement rows and routing
channels
 Past: only between placement rows (see
figures)
 Today: 3-dimensional multi-layer wiring
across entire chip area
• State-of-the-art design flow
 Synthesis of VHDL behavior description into
technology specific design library elements
 Embedding of complex function macros
(memory, CPU, std. i/o macros, etc.)
Standard Cell
• Library of pre-developed (full custom) logic

gates and function macros (variable width,
fixed height)
• Pre-defined placement rows and routing
channels
 Past: only between placement rows (see
figures)
 Today: 3-dimensional multi-layer wiring
across entire chip area
• State-of-the-art design flow
 Synthesis of VHDL behavior description into
technology specific design library elements
 Embedding of complex function macros
(memory, CPU, std. i/o macros, etc.)
43
ASIC Chip
Random Logic
(from library)
Example ASIC:
• 65 nm
• 1 GHz
• ~ 100 Mgates
• ~ 100 MB sRAM
• ARM/MIPS cores
Memory • >1000 I/Os
Subsystem • 64 x 2.5 Gb/s I/O
• 1 x 10 Gb/s I/O
• DDR I/F
[LSI Logic]
Full Custom Design
• Unconstraint placement of transistors,

gates and function macros
• Individual (manual) optimization of circuit
parameters to minimize area and power
Intel 4004 Processor
44
Programmable Logic: Overview
• Fuse / Anti-Fuse Technique:

 Fuses connect or disconnect links between two wiring layers
 One-time programmable  „Redesigns" are expensive!
 High integration density
 Robust against radiation  space applications, etc.
• RAM based:
 Bits in look-up tables (LUTs) realize logic function
 Bits in registers control switches (transistors) which connect /
disconnect wire links
 Re-programmable, partially even online during operation (run-time
reconfiguration)
 Medium integration density
 Sensitive to radiation
PLDs – Programmable Logic Devices

• Two types of PLDs:
 PAL (Programmable Array Logic): AND matrix programmable, OR matrix fixed
 PLA (Programmable Logic Array): AND and OR matrix programmable (past technology)
I5 I4 I3 I2 I1 I0 Programmable
OR array I5 I4 I3 I2 I1 I0 Fixed OR array
O0  I 0 I1  I 2
O1  I 0 I1 I 2  I 2  I 0 I1
PAL
Programmable AND array
Programmable AND array
O 3O 2O 1O 0 O 3O 2O 1O 0
CPLD: multiple PAL blocks with programmable interconnect, e.g. Altera Max 7000
45
FPGA: Structure and Properties
Routing
Channel
I/O Pad
Configurable
Logic Block
FPGA: Basic Structure and Properties
• Functionality of CLBs and inter-

connect wiring determined by
program data
• Matrix of logic blocks (CLBs)

and wiring resources already
placed and routed on IC
during manufacturing
• Programming:
 SRAM (Data in SRAM determine logic function and control interconnect)
 Anti fuse (Melting individual fuses through controlled peak currents)
46
FPGA: Realization Principles
Boolean function:
Y = x1 x2 x3 + x2 x3
x1 x2 x3 y FPGA: Look-Up Table (LUT)
0 0 0 0
x1 x2 x3 Address
ASIC: gates 0 0 1 0
0 1 0 1
2x4 SRAM
x1 0 1 1 0
y 1 0 0 0 0 0 1 0
x2
y
x3 1 0 1 0 0 0 1 1
1 1 0 1 Content
1 1 1 1 td = 5 nsec
td = 1 nsec
FPGA: Realization Principles
Boolean function: y=x1x2

ASIC: AND gate Short Line Long Line
x1
y
CLB
x2
CLB
Logic table:
x1 x2 y
0 0 0
0 1 0
1 0 0
1 1 1
Address Content Programmable

Switch Matrix
 Realization in 4x1 SRAM (LUT)
Gate contacts of switch transistors are connected with
configuration memory cells which determine FPGA routing
47
FPGA: Internal Structure (Xilinx Virtex-II Pro)
Virtex-II Pro Slice:

• 2 4-input LUTs/RAM
• Carry Logic
• 2 2to1 multiplexer
• 2 Register (D-FF)
Virtex-II Pro CLB: Virtex-II Pro RAM/MULT:

• 4 Slices • 18x18 HW multiplier
• 2 Tri-State Buffers • 18 kbit Dual-Port SRAM:
• Carry-Chain • 16k x 1bit
•…
• 512 x 36bit
FPGA: Xilinx UtraScale+ Architecture
• 16 nm FinFET technology • Column based architecture of

• Stacked Silicon Interconnect (SSI) programmable logic consisting of
• Multiple dies connected via SI interposer • CLBs for logic and distributed memory,
• Die = Super Logic Region (SLR) • DSPs for multiply and accumulate (MAC)
• BlockRAM, UltraRAM,
• IO, high-speed transceivers
Microbumps
• Clock management (adjacent to IO and
Passive Silicon Interposer
memory)
Through Silicon Vias (TSV)
C4 Bump
BGA Ball
• 3 types differing in complexity and composition

• Virtex: high-end
• Kintex: mid-range
Source Xilinx (picture shows Virtex7) • Zynq: includes standard processors
48
FPGA: Xilinx UtraScale+ Programmable Logic
CLB = Slice
• 1 Slice per CLB, 2 slice types: SLICEL and SLICEM
• SLICEL LUT
• 8 LUTs with 6 inputs
(each usable as two 3- or 5-input LUTs)
• 16 FlipFlops Carry Chain
• arithmetic carry logic
• multiplexer
FlipFlop
• SLICEM
• LUTs can be used as 64 bit RAM,
1x32 or 2x16 bit shift register

• Memory options
• Distributed memory via SLICEM
• Block RAM: 36 kbit blocks, dual port, (2x18 kbit, sync/async FIFO, configurable width)
• Ultra RAM: 288 kbit blocks, dual port, sync
• DDR4, DDR3, QDRII+, RLDRAM3 memory interfaces
Distributed RAM Block RAM UltraRAM External Memory

(Bits to kbits) (10s of Mbits) (100s of Mbits) (100s of Mbits to Gbits)
• wide, shallow FIFOs • data/coefficient storage • deep packet buffering • Larger data storage
• shift registers • deep FIFOs • video buffering
• state machines • shallow buffering • state, statistics, counters
Source Xilinx
49
• DSP slice Kintex Virtex Zynq

• Pre-adder
# CLB (k) 18.8 – 82.9 49.3-159.8 5.9 - 65.3
• 27×18 bit multiplier
• 48-bit accumulator (incl. XOR) Block RAM (Mbit) 12.7 - 34.6 25.3 - 94.5 4.5 – 34.6
Ultra RAM (Mbit) 0 – 36 90 – 360 0 – 36

• Select IO
# DSP slices (k) 1.3 – 3.5 2.3 – 12.3 0.2 – 3.5
• high-performance / high-density
DSP Perf.
• with different voltages 6.3 21.9 6.3
(GMAC/s *)
• single ended / differential
I/O pins 280 – 668 416 – 832 82 – 668
• High-speed serial transceivers Transceivers 16 – 76 40 – 128 0 – 72
• GTH (16.3 Gbit/s) Hard processor

macros   
• GTY (32.75 Gbit/s)
Source Xilinx
* 109 multiply-accumulate ops per s
FPGA: Xilinx Ultrascale+ Zynq MPSoC

Heterogeneous Processing System Zynq CG Block Diagram
• Application Processing Unit (APU)
64 bit ARM Cortex-A53 (up to 1.5 GHz)
• Real-Time Processing Unit (RPU)
32 bit ARM Cortex-R5 (up to 600 MHz,
safety features incl. ECC, lock-step mode,
detection of faults in core)
• Graphics Processing Unit (GPU)
ARM Mali-400 MP2 (667 MHz)
• Video En/Decoder (VCU) for H.264/H.265
Zynq Device Types

CG EG EV
APU 2 4 4
RPU 2 2 2
GPU   
VCU   
Source Xilinx
50
Processor Implementation in SoC (1)
„Real“ „Virtual“
Component Component
System on Board System on Silicon
Processor Implementation in SoC (2)
Soft VC Firm VC Hard VC
Architectural Speed/Area
extensions VHDL optimized
51
Soft VC CPU in FPGA
Example: XILINX MicroBlaze CPU
MicroBlaze: RS-232 For comparison:

 32 bit RISC
 200 MHz Hard VC
GPIO (LEDs) GPIO (buttons) PowerPC 405:
 166 DMIPS
 32 bit RISC
Extensions: MicroBlaze UserLogic  400 MHz
 I-Cache Core (OPB-Master)  600 DMIPS
 D-Cache
 HW Multiplier SDRAM Ctrl.
Debug
Logic
Local SRAM
The Need for SoC Design Paradigm
Chip capacities of multi 10 to 100 Next level of abstraction:

M gates enable new dimensions Functional IP macros / cores
of function integration
Challenge
• How to cope with this
complexity and develop
operational systems within…
 reasonable time (time-to- eRAM network i/o
market) applic. specific
 costs (engineering and µ-processor DRAM ctrl.
manufacturing)
52
The Need for SoC Design Paradigm
Chip capacities of multi 10 to 100 Next level of abstraction:

M gates enable new dimensions Functional IP macros / cores
of function integration plus a standard
on-chip interconnect structure
Challenge (NoC)
• How to cope with this
complexity and develop
operational systems within…
Bus
 reasonable time (time-to- µ-processor eRAM
market)
 costs (engineering and specific
µ-processor eRAM
manufacturing) M ctrl.
SoC: Another way to look at it
What in the past was on a board, today fits on a chip
DC ROM Analog
ROM
MCU
ASIC
~ 10 cm
i/f
i/f i/f DSP
ASIC
SRAM
SRAM
SRAM
DSP
MCU ROM An’lg
Source: Berkeley BWRC, TI cellular phone baseband SoC [1] ~ 0.8 cm
53
Single- vs. Multicore
n
clk s 1
TApp  inst App    
  inst App i'  CPI 
inst clk i 1 f

Multi-core
Single-core
App‘1 App‘2
App If App can be perfectly
parallelized on n cores and
Tapp = const
2 App‘3 App‘n
Pdyn ~ f  Vdd
2 2
f  Vdd  f  Vdd
Pdyn ~   n ~ 2
n  n  n
* neglecting influence of Vth
Single-core Multi-core
Texe Texe
App App‘ App‘
Assumption: App can be perfectly App‘ App‘

parallelized on n cores
Case 1: Single- and Multi-core have same performance = App execution time Texe
f Vdd
Pdyn ~ f  Vdd2 f MC  Vdd, MC 
n n
2
f  Vdd  f  Vdd2
Pdyn ~   n ~
n  n  n2
54
Single-core Multi-core Texe / k

Texe
App App‘ App‘
Assumption: App can be perfectly App‘ App‘

parallelized on n cores
Case 2: Multi-core shall have performance increase by factor k

2
Pdyn ~ f  Vdd k f k  Vdd
f MC  Vdd, MC 
n n
Further details in course 2
„Chip Multicore Processors“
k  f  k  Vdd  k 3  f  Vdd2
Pdyn ~   n ~
n  n  n2
References
[1] Design Technology for Low Power Radio Systems, Reth Davis, BWRC, Berkeley,
http://bwrc.eecs.berkeley.edu
[2] DSP multi chip module, esa,
http://www.estec.esa.nl/tech/spacewire/products/#modules
[3] Chip-On-Chip, Valtronic SA, http://www.valtronic.ch
[4] Reconfigurable Architectures for General Purpose Computing, Andre DeHon, PhD
Thesis, MIT, 1996
[5] A. Cuomo, Semiconductor Challenges, DATE03 Keynote, March 03,
http://www.date-conference.com/conference/2003/keynotes/andrea/andrea.pdf
55
Processor Architecture

Technologies
Theresienstr. 90
Outline
• Classification of processors
• Instruction set architecture
• Internal processor architecture
 Pipelining and hazards
 Branch prediction
 Superscalar/VLIW architecture
 Instruction and data caches
 Multi-threading
SoCT – Processor Architecture – 2 © Lehrstuhl für Integrierte Systeme
56
Motivation
Processor-based digital systems

 Computers with fully programmable, general-
purpose processors (laptops PCs, workstations,
clusters)
 Primary purpose / function is data
processing (incl. Web servers, bank servers)
 Hardware & software evolve rather
independently
 However, most processors are deployed in

„embedded systems“
 Game consoles, smart phones, printers,
household appliances, …
 Cars, industry robots, …
What is “Machine Structure”?
Applications
Operating
Compiler
System
Software Assembler
Instruction Set
Architecture
Hardware Processor Memory I/O system
Primary focus Datapath & Control

of this module Digital Design
Circuit Design
Transistors
Coordination of many levels of abstraction
57
Processor Classification
Type Application Characteristic Remark

RISC Embedded control Load/store instructions MIPS, ARM,
Instruction for memory access PowerPC
complexity CISC Personal Computer/ Complex, variable- Intel x86-based
Servers length instructions
Superscalar Personal Computer/ Instruction parallelism Intel, ARM,
Instruction-level Embedded on run-time PowerPC
parallelism
(ILP) VLIW Image Processing Instruction parallelism Parallel video
on compile-time pixel processing
ASIP Embedded Application-specific Tensilica
Application- intructions
specific area DSP Signal Processing HW multiply for digital TI
filters
Levels of SW Code Representation
Processor/ISA SW model,
independent (e.g. Matlab)
Code generator int a = 10;
while(a < 100)
High-level language
a += b;
(e.g. C/C++) if (a > b && c < 0)
c++;
ISA dependent, Low-level

processor Compiler language lw r2, 16(r30)
independent lw r3, 20(r30)
(Assembly)
addu r2, r2, r3
Assembler sw r2, 24(r30)
Machine code
Software
1010 1111 0101 1000
0000 1001 1100 0110
Hardware Processor/ISA
0101 1000 0000 1001
dependent Control Signal
1100 0110 1010 1111
Specification
58
Instruction Set Architecture (ISA)
Defines interface between SW & HW

 Visible hardware state (registers & memory) Software & OS
 A set of instructions that operate on that
state Instruction Set
Given an ISA Hardware
 The hardware implements it
 The software uses it
 Old SW can use new HW and vice versa
Keep in mind
 Difference: ISA vs. HW implementation
 X86: Intel  AMD
 X86: Intel 80x86  Intel Core i7
ISA Example: MIPS
Instructions Registers Memory Address Space

 Arithmetic Mapped/
0xffff fffff
r0 zero
 add, sub, li, lui... cached, kseg2
r1 temp. 0xdfff fffff
 Logical Mapped/
r2-r3 returns
 and, nor, or, not, xor... cached, ksseg
0xbfff fffff
r4-r7 args Unmapped/
 Load/store
r8-r15 temp. uncached,kseg1
 lb, lw, sb, sw... 0x9fff fffff
temp. Unmapped/
 Multiply/divide r16-r23
saved cached, kseg0
0x7fff fffff
 div, mult, multu...
r24-r25 temp.
 Jumps/branches
r26-r27 OS
 b, beq, bne, j, jal, jr... User space
r28 global ptr.
 ... r29 stack ptr. Mapped/cached
r30 frame ptr. kuseg
return
r31
addr.
0x0000 0000
59
Look Inside
system bus
data cache
data i/o
ALU register block
status
accumulator program counter
control
address i/o
instr. cache
system bus
Processor Microarchitecture
system bus
data cache
data i/o
ALU register block
status
control
instruction i/o
instr. cache
system bus
60
Processor Microarchitecture
system bus
data cache Memory access (M)

Execution (EX)
data i/o
Write back (WB)

ALU register block
status
control
instruction i/o
Instr. cache
Instruction fetch (IF)
Instruction decode (ID)
system bus
Program Execution
 Sequential execution of instructions
Instruction Data
Processor
memory memory
add r3, r3, r1

lw r1, 0(r0) CPI = 5
sw r3, 4(r0) (cycles per instruction)
add r3, r2, r1
IF ID EX M WB lw r1, 0(r0)
IF ID EX M WB sw r3, 4(r0)
IF ID EX M WB
Efficiency improvement:
instruction-level parallelism (ILP)
61
ILP: Pipelining
Clock signal
add r3, r2, r1
IF ID EX M WB lw r1, 0(r0)
IF ID EX M WB sw r3, 4(r0)
IF ID EX M WB
Execution stages can overlap

… multiple instructions
IF ID EX M WB
execute faster: CPI1
IF ID EX M WB
IF ID EX M WB
CPU Pipeline
Single-scalar = 1 ALU, CPImin = 1.0
Pipeline Control
ΣTlogic ΣTlogic ΣTlogic
IF ID EX M WB
clk
Buffer
Tclk  Tc2q   Tlogic  Tstp
max
clk 1
f max 
D Q D Tclk
Tstp Tc2q instr. rate [MIPS] =
Q = f[MHz] / CPI
62
ILP: Pipelining
• Prerequisite for effective pipelining

 Regularity in sequence of individual
instruction phases
 Few, regular instruction set
 Simple, few addressing modes
• Deep pipelining
 Ease processor speed scaling
 Increase vulnerability for pipeline problems
 Structural hazards
 Data hazards
 Control hazards
Structural Hazards
Pipelined execution is hindered due to resource conflicts
IF ID EX M WB load/store instruction
IF ID EX M WB arithmetic
instructions
IF ID EX M WB
stall IF ID EX M WB
if only one memory
port is available
63
Data Hazards
Data dependencies among instructions cause data hazards
add r3,r2,r1 IF ID EX M WB
sub r7,r3,r1 IF ID EX M WB
and r6,r3,r2 IF ID EX M WB
Stalling is required
add r3,r2,r1 IF ID EX M WB
sub r7,r3,r1 IF stall ID EX M WB
and r6,r3,r2 IF ID WB
EX M
Control Hazards
• Deviation from sequential execution

inst.addr mnemonics
0x400258: lw r2, 24(r30)
0x400260: slti r3, r2, 15 bne r3, r0, 400280
?
0x400268: bne r3, r0, 400280
IF ID EX M WB
0x400270: addiu r2, r0, 6
0x400278: sw r2, 20(r30) stall IF ID EX M WB
0x400280: addiu r2, r0, 1 IF ID EX M WB
0x400288: j 400290
• Branches are frequent

 total performance loss is greater than in
case of data hazards
 employment of branch prediction
64
Branch prediction
• 1-bit prediction
Branch history table
0x400258: lw r2, 24(r30) 1 1 – taken
0x400260: slti r3, r2, 15 0 0 – not taken
0x400268: bne r3, r0, 400280 0
0
idx brach addr. 0
1
x
bits 2x 0
0
0
Problem 0
 For loops we always predict incorrectly twice 0

1
 in the first loop iteration 0
 in the last loop iteration 0
1 bit
Branch prediction
• 2-bit prediction
Branch history table
0x400258: lw r2, 24(r30) 10

0x400260: slti r3, r2, 15 00
T nT
0x400268: bne r3, r0, 400280 01
00
11 T 10
Taken Taken
idx brach addr. 00
11
x T nT
bits 2x 00
00
01 T 00
00
nTaken nTaken
01 nT nT
01
10
00
2 bits
00
local history
65
Branch prediction
• Two-level (correlating) prediction
0x400258: lw r2, 24(r30) Branch history table General case:

0x400260: slti r3, r2, 15 (m, n) predictor
0x400268: bne r3, r0, 400280
m=4
idx brach addr.
T N N N
x
bits n = 2 bits
 Recent behaviour of other 2x

branches is considered
2m
History pattern of
last m global
branches
global history local history
ILP: Superscalar Architecture
external data bus
data cache
data i/o
internal data bus
Multiple ALU
ALU register block
execution units ALU
status
internal address bus
control
address i/o
Instr. cache
external address bus
66
ILP: Superscalar architecture
Instr. Fetch (IFi ... IFi+3)
Instr. Decode (IDi ... IDi+3)

Multiple Decided at run-time
datapaths
(DP)
DP1 DP2 DP3 DP4
OFi+2 OFi OFi+3 OFi+1 Data dependency check

and Operand Fetch (OF)
EXi+2 EXi EXi+3 EXi+1
MEMi+2 MEMi MEMi+3 MEMi+1
WBi+2 WBi WBi+3 WBi+1
 More than 1 instruction can be issued in 1 cycle, i.e. CPI < 1 is possible
 More complex logic for checking data dependencies required
ILP: VLIW processors

Sequential
Program
...
instr i+2
instr i+1
instr i
instr i-1
instr i+2
Determined during
...
compile-time
Optimizing Compiler
InstrDP1 InstrDP2 InstrDP3 InstrDP4 ... ... InstrDPn-1 InstrDPn
DP 1 DP 2 DP 3 DP 4 ... Datapath ... DPn-1 DPn
Registers
67
Processor Performance (1)
• What is performance?
 Example Porsche vs. Bus from Munich to Stuttgart
Top speed Distance Travel time Capacity Throughput
Vehicle [km/h] [km] [h] [person] [pkm/h]
Porsche 260 200 0.77 2 520

Bus 100 200 2.0 46 4600
• What matters in CPU performance:

 Fastest possible execution of a single instruction?
 Shortest program execution (many instructions)?
Ultimately interested in
 CPU execution time: Time CPU needs to complete certain program,
task or function
Clock cycles Seconds

CPU time = x =
Program Clock cycle
Instructions Clock cycles Seconds

= x x
Program Instruction Clock cycle
Specific for your CPI: Processor 1 / fcpu

application architecture and
Processor data
Estimate/count after memory hierarchy
sheet
compilation dependent
68
Processor Performance
Instruction Data
Processor
memory memory
CPI = CPICPU + CPIMEM
Performance Comparison
300 266
250 221,67
Effective MIPS
200
150
100
50 20,95 17,81
0
CPUx1.0 CPUx1.2 CPU-DDR CPU-I/O
clk/instr clk/instr
[Data from Xilinx]
Memory Hierarchy
CPU L1 L2 Main
registers cache cache memory
Access time: 0.5 ns Access time: 2 ns Access time: 20 ns Access time: 100 ns
Size: 500 B Size: 32 KB Size: 256 KB Size: 512 MB
Access latency
small large
Cost
large small
Size
small large
Observation on program execution

 Temporal locality
 addresses are likely to be accessed in the near future once again
 Spatial locality
 addresses are likely to be close to each other
Frequently accessed data/instructions are kept close to CPU
69
Example: PowerPC 405GP
66-133MHz Arb
266MHz 32/64-bit
64-bit PCI-X, with ECC
33-66MHz RAM/ROM/
On-chip Peripheral Bus (OPB) 33-66 MHz

32/64bit PCI Peripheral Up to 66MHz
DDR266 controller
32-bit
External bus
PCI-X SDRAM address /
OPB master cntlr.
Bridge Controller 32-bit data
Bridge
PLB UART (2)
128-bit master, Monitor

128 bit
128-bit slave 128 bit I2C (2)
up to 133MHz Processor Local Bus (PLB) 128-bit GPIO

GPIO
128-bit 128-bit
128-bit 128-bit 128-bit
SRAM 128KB GPT
DMA
Ctlr. SRAM
Controller
CPU regs
32K 32K L1 Caches
I-Cache D-Cache 1 MII or 2 RMII Fast & Small SRAM
Timers interfaces Slower & Larger SDRAM
MMU 10/100 I/O subsystem (SCSI, PCI, etc)
MAL Ethernet
Interrupt MAC
CPU Controller
JTAG Trace
13 external
interrupts
CPImem = CPIinst + CPIdata

= Ifreq x L1miss rate (L1miss penalty + L2miss rate x L2miss penalty) +
+ Dfreq x L1miss rate (L1miss penalty + L2miss rate x L2miss penalty)
Pipelined RISC: CPICPU =1.2
Two-level cache hierarchy

 L1miss rate = 5%; CPImiss 1.89
= = 1.57
 L1miss penalty = 10 cycles CPIno miss 1.2
 L2miss rate = 3%; 0.15% instr./data accesses to system
 L2miss penalty = 50 cycles memory degrade overall performance
 Dfreq = 20% (CPU execution time) by 57%
 CPImem = 0.69
70
Cache Organization
Main
• Caches store only small share of main memory Memory
 Data are stored in lines of multiple sequential data Cache
a i q
b j r
y
z
G
H
words (e.g. 4 words) o r H c k s A I
p s I d l t B J
 Cache capacity = Lsize x Nlines CPU E t e m u C K
C d f n v D L
g o w E M
• CPU accesses data in the full address range h p x F N
 address width = 32 bit → 4 GB memory. How to map

onto a smaller memory? memory address
 Index: used to determine potential position(s) in cache
 depends on placement strategy tag index offset
 Tag: part of address stored together with cache line hit =?
…
 If stored tag is identical to tag part of memory
flags tag cache line
address → cache hit
…
 Offset: determines word in cache line and byte in word
 Flags: entry valid; entry “dirty” (entry changed by CPU)
Direct Mapped Cache

index = block MOD NL
Index Block
000 ..00000
• Each block (size of a cache line) in main 001 ..00001
..00010
010
memory can be stored in only one cache entry 011 ..00011
100 ..00100
101 ..00101
110 ..00110
block offset ..00111
111
CPU address ..01000
Direct mapped ..01001
cache ..01010
tag byte ..01011
index word ..01100
..01101
Data ..01110
index flags tag word0 word1 word2 word3 ..01111
000 ..10000
001 ..10001
..10010
010 ..10011
011 ..10100
100 ..10101
..10110
110
…
111
Main memory
=
valid & Example: 16 KB direct mapped cache
with 4 words à 32 bit per cache line
hit word
• 10 bit index (1k cache lines)
• Conflicting indices lead to higher cache miss rate • 18 bit tag
• 4 bit offset (for word and byte)
71
Set Associative Cache set = block MOD (NL / n),
any line within set
Set Block
00 ..00000
• Each block in main memory can be stored 00 ..00001
..00010
01
in n cache entries: n-way set associative cache 01 ..00011
10 ..00100
• Increasing n reduces cache misses due to conflicts 10 ..00101
..00110
11
11 ..00111
block offset ..01000
2-way set ..01001
CPU address associative cache ..01010
..01011
tag index word byte ..01100
..01101
..01110
index index ..01111
data set fl. data data ..10000
fl. tag tag
00 00 ..10001
01 01 ..10010
..10011
11 11 ..10100
= way 0 = way 1 ..10101
..10110
…
w0 w1
Main memory
1
n times parallel tag comparison
hit word Example: 16 KB 2-way set associative
cache with 4 words à 32 bit per cache line
• 9 bit index (512 sets à 2 cache lines)
• With higher n: selection circuitry more complex, • 19 bit tag
needs more time • 4 bit offset (word and byte, 2 bit each)
Fully Associative Cache

Block
..00000
..00001
• Fully Associative Cache ..00010
..00011
 A memory block can be stored in any ..00100
..00101
cache entry ..00110
..00111
 Only one set containing all entries tag any cache line ..01000
..01001
 No cache misses due to conflicts =
..01010
..01011
= ..01100
= ..01101
1 = ..01110
• Parallel tag matching for all entries =
= ..01111
..10000
 Complex circuitry =
= ..10001
..10010
..10011
 High latency Fully associative ..10100
hit cache ..10101
..10110
…
• Fully associative caches rarely used Main memory

(only for special purposes)
Example: 16 KB fully associative cache
with 4 words à 32 bit per cache line
• no index
• 28 bit tag
• 4 bit offset (word and byte, 2 bit each)
72
Cache Replacement
• Replacement index data

 Might be necessary on cache miss 00
01
fl. tag
 When all associated entries occupied: replace 10

11
V
one entry way 0

index data data
 Direct mapped cache: replace old entry with new set fl.
00
tag
01
 Fully associative cache: replace only when full 10
11
V
 Which cache entry should be replaced? way 1
?
• Replacement policy
 Goal: reduce number of misses
 Least recently used (LRU): least recently access
 First in first out (FIFO): least recently loaded
(oldest)
 Random
Cache Write Strategies

Cache writes
 Contrary to reads: Migrated data gets modified
 Data item also requires update in memory
hierarchy
write
1 V 2 through
Write through write
Memory
 Modified data written directly through
 Potentially via write buffer write
2 4
Write back write back
1
VD
V
 Modify data locally Memory
3
 Dirty flag marks updated cache lines 5
conflicting replace
 Dirty cache lines are written back miss
 On replacement
 By invalidation by processor
 By cache coherency protocol (see later)
73
Multithreading in Software
system bus
Load/save
data cache
register
status
register block
data i/o status

program counter
ALU register block

register block
status
accumulator program counter status
program counter
control
instruction i/o
register block
instr. cache
status
Further details in lecture program counter
system bus
Multithreading in Hardware
system bus
data cache
data i/o
ALU registerblock
register
block Multiple
register block
status register banks
status
statuscounter
program
program counter
control
instruction i/o
instr. cache
Further details in lecture
system bus
74
Summary
• A variety of microprocessor architectures in embedded

systems
• Instruction Set Architecture as interface between hardware
and software
• Performance is limited mainly by memory access, code
parallelism and data dependencies
Literature
[1] Hennessy, Patterson: Computer Architecture, A Quantative Approach.

Morgan Kaufmann, 5th edition, 2012
[2] M. Flynn: Basic Issues in Microprocessor Architecture.
Journal of Systems Architecture, 1999
[3] Intel Technology Journal, www.intel.com: Hyper-Threading Technology,
Feb. 2002
[4] K. Diefendorff: PC Processor Microarchitecture. Microprocessor Report,
July 1999
75
Memory

Technologies
Theresienstr. 90
Outline
• Motivation
• Classification and Characteristics
• Look Inside
 Architecture of state-of-art memories
 Different types of memory cells
• How to Use Memory in System Design
• Product Overview
SoCT – Memory – 2 © Lehrstuhl für Integrierte Systeme
76
Motivation
 State-of-the-art SoC’s make

application-specific use of various
MCU ROM types of memories:
 non-volatile memory for
firmware and parameter
DSP storage
ASIC
 fast SRAM for data and control
state storage
SRAM  (not shown eDRAM, CAM,
An’lg eFlash)
Motivation
 Significant portion of high-end

microprocessor area is dedicated
to fast SRAM, associative Caches
and Registers
 key criteria for overall
processor performance
IBM PowerPC 750 Cu
77
Positioning
Capacity Density Transfer Rate

Memory Type
[bit] [bit/mm2] [Mbit/s]
Paper Page A4 16 x 103 0.4
Floppy 3.5“ 11.52 x 106 1.08 x 103 0.5
64 Mbit DRAM 64 x 106 2.13 x 106 1600
Zip-Disk 100 MB 800 x 106 75.5 x 103 11.2
CD (32x) 5.44 x 109 544 x 103 38.4
DAT DDS-3 96 x 109 4.24 x 106 8
DVD (4x) 136 x 109 6.8 x 106 43.2
Hard Disk (5 disks,
176 x 109 3.3 x 106 140
7200 rpm)
Classification
Read Write
Non-Volatile Read Only
Random Access Non-Random Read Write ROM
RAM Access
SRAM FIFO / LIFO EPROM / EEPROM Mask-Programmed
DRAM CCD FLASH Fuse-Programmed
Register CAM Magneto RAM
Shift-Register
78
Classification
Read Write
Non-Volatile Read Only
Random Access Non-Random Read Write ROM
RAM Access
SRAM FIFO / LIFO EPROM / EEPROM Mask-Programmed
DRAM CCD FLASH Fuse-Programmed

Memory
Register CAM size
Magneto RAM / design trade-
density offs:
Shift-Register
speed
robustness
Characteristics
Memory Type Application Access Time Remarks
Registers CPU Registers Very fast [Sub-nsec] Direct addressing scheme
[32 x 64 bit]
On-chip SRAM Caches Fast [nsec] SRAM is faster but more
[32 KByte] expensive than SDRAM
QDR SRAM Fast system memory Fast [2 x 2 x 200 MHz] Dual clock edge, dual
[4 MByte] port
SDRAM Main Memory Slow-Medium [133 Needs refresh,
[64 MByte] MHz] sophisticated control,
Synchronous interface
DDR3 SDRAM Main Memory Medium [2 x 800 MHz] Dual clock edge
[1 GByte]
ROM System config Medium [~kByte/sec] Read only
[few kByte]
Flash Memory Card Medium [20 Mbyte/sec] Non-volatile, no refresh,
[16 GByte] different rd/wr cycles
79
Look Inside
Definitions:
 Bandwidth: Amount of data into/out of a device
or across interface per unit time
 Latency: Time elapsed between request and
delivery of data
 Cycle time: Time between two consecutive
read/write accesses
512M DRAM
 Asynchronous memory: (self-timed) Change of address (and control)

lines triggers memory read/write
 Synchronous memory: All memory operations occur synchronous to
clock edge(s)
 Multiple requests may be outstanding
Memory Architecture: Decoders

M bits M bits
S0 S0
Word 0 Word 0
S1
Word 1 A0 Word 1
S2 Storage Storage
Word 2 Word 2
N Words
Cell A1 Cell
Decoder
A
AL-1
K-1
SN-2 Aspect ratio
Word N-2 Word N-2
SN_1 heights / width
Word N-1 Word N-1
not suitable for
implementation /
Input-Output Input-Output
performance !
(M bits) (M bits)
N words  N select signals ! Decoder reduces # of select

signals to K = log2N
[Rabaey]
80
Memory Architecture: Array-Structure
2 L-K Bit Line

Storage Cell
AK
Row Decoder
A K+1 Word Line
A L-1
M.2 K
Sense Amplifiers / Drivers Amplify swing to rail-

to-rail amplitude
A0
Column Decoder Select appropriate
A K-1
word
Input-Output
(M bits)
[Rabaey]
Hierarchical Memory Architecture
Row
Address
Column
Address
Block
Address
Global Data Bus

Control Block Selector Global
Circuitry Amplifier/Driver
I/O
Advantages:
1. Shorter wires within blocks
2. Block address activates only 1 block => power savings
[Rabaey]
81
1-Transistor DRAM Cell
size / BL
density
WL Write "1" Read "1"
speed WL
robustness M1 X
CS GND VDD VT
VDD
BL
VDD/2 VDD /2
CBL sensing
Write: CS is charged or discharged by asserting WL and BL.

Read: Charge redistribution takes places between bit line and storage capacitance
CS
V = VBL – V PRE =  V BIT – V PRE  ----------------------- CELL
C +C S BL
Voltage swing is small; typically around 250 mV.

[Rabaey]
Trench DRAM Cell
Bitline
Wordline
n+ - Si
SiO2
Polysilicon
p-Si
Depletion Zone
Inversion
at SiO2/Si
Interface
Address Memory
Transistor Capacitor
[IC1]
82
Advanced 1 Transistor DRAM Cells
Word line
Cell plate Capacitor dielectric layer
Insulating Layer
Cell Plate Si
Capacitor Insulator Transfer gate Isolation

Refilling Poly Storage electrode
Storage Node Poly
Si Substrate
2nd Field Oxide
[Rabaey]
Trench Cell Stacked-capacitor Cell
Sense Amplifier
Bitlines
WLi-4
WLi-3
Memory
WLi-2 cells
WLi-1
SA1 SA2 SA3 SA4 SA5 SA6 SA7 SA: Sense Amplifier
WLi
WLi+1 Speicher- Memory

WLi+2 zellen cells
WLi+3
WLi : Wordlines 2n/2 Bitlines
[IC1]
83
Sense Amplifier
VDD
Read 1 pull u p
T1 T2
VDD
Cs
Equalize
BL TE BL
K1 K2
T3 T4
WL
Sense
Memory Cell
[IC1]
Sense Amplifier
VDD
Read 0 pull u p
T1 T2
VDD
Cs
Equalize
BL TE BL
K1 K2
T3 T4
WL
Sense
Memory Cell
[IC1]
84
6-Transistor CMOS SRAM Cell
size / WL
density
speed
VDD
M2 M4
robustness
Q
Q M6
M5
M1 M3
BL BL
[Rabaey]
ROM – Read Only Memory
WLi 4 ... 6 m
„0“
WLk
„1“
Programing
Programming
„0“ „1“
BL1 BL2
size /
density
speed
robustness
[IC1]
85
Floating Gate Transistor Cell
Floating gate Gate

D
Source Drain
tox G
tox
S
p
n+ n+
Substrate
(a) Device cross-section (b) Schematic symbol
[Rabaey]
Floating-Gate Transistor Programming

20 V 0V 5V
20 V 0V 5V
10 V 5 V 5 V 2.5 V
S D S D S D
Avalanche injection. Removing programming voltage Programming results in

leaves charge trapped. higher V T.
ID
[Rabaey]
Vwl Vgs
86
Flash Memory Cell
[Infineon]
Phase Change Memory
• Non-volatile
• Faster and more writable
than Flash
• Research topic
• Phase change effect used
with Blu-Ray
[Wong et al, IEEE Proceedings, Dec. 2010]

[IBM, extremetech, May 2014]
87
How to Use Memory in System Design
Capacity
Access
speed
FAST LOW
CPU
Cache
Local Bus
Fast & Small SRAM
Slower & larger SDRAM
I/O Subsystem (SCSI, PCI, etc)
Disk
Tape
SLOW HIGH
[IC1]
Memory in System Design: Example
66-133MHz Arb
266MHz 32/64-bit
64-bit PCI-X, with ECC
33-66MHz RAM/ROM/
On-chip Peripheral Bus (OPB) 33-66 MHz
32/64bit PCI Peripheral

Up to 66MHz
DDR266 controller
External bus
32-bit
PCI-X SDRAM address /
OPB master cntlr.
Bridge Controller 32-bit data
Bridge
PLB UART (2)
128-bit master, Monitor

128 bit 128 bit
128-bit slave I2C (2)
up to 133MHz Processor Local Bus (PLB) 128-bit GPIO

GPIO
128-bit 128-bit
128-bit 128-bit 128-bit
SRAM 128KB GPT
DMA
Ctlr. SRAM
Controller
32K 32K
CPU
I-Cache D-Cache 1 MII or 2 RMII
Timers interfaces Cache
MMU 10/100 Local Bus
MAL Ethernet Fast & Small SRAM
Interrupt MAC
CPU Controller Slower & larger (SDRAM)
I/O Subsystem (SCSI, PCI, etc)
JTAG Trace Disk
13 external Tape
interrupts
88
A Minimal Memory System
[Gries]
SDRAM Read Operation Timing
[Gries]
89
(tCAS)
[Micron]
tRCD: row to column delay tCAS: column-access strobe

tRAS: row-access strobe tRP : row precharge
Timing: tCAS–tRCD–tRP–tRAS → 2-2-2-6 SDRAM
(tCAS)
[Micron]
Access latency: tmem_acc = tRCD + tCAS BWpeak = f ∙ w f∙w∙n

, n ≤ tRC – tmem_acc
tRC
Min. row cycle time: tRC = tRAS + tRP BW(burst = n) =
f∙w∙n
* time in clock cycles
tmem_acc + (n-1) , else
90
SDRAM Write Operation Timing
[Gries]
SDRAM Write Operation Timing
[Gries]
91
Synchronous DRAM vs DDR
[Gries]
Multiport DRAM
Row
Address
Column
Address
Block
Address
Global Data Bus

I/O
Separate rd/wr
addresses, decoders
and I/O
92
Synchronous DRAM vs RAMBUS
[RAMBUS]
Embedded DRAM
Pro‘s:
• customized size
• wide data bus
• multi port
• high speed
Con‘s:
• complex technology
• expensive
• less density
[IBM]
93
Wide I/O – 3D-Integration with Through
Silicon Vias (TSV)
Stacked Wide I/O DRAM

TSV
Processor
PCB
BGA
 1200 TSVs to connect memory and processor layers

 Short, low C interconnects (TSV)
 High packaging density (no bonding wires)
 TSV diameter: 40-50 µm (approx. 500 TSV/mm²)
 Thin wafers (50-100 µm)
 4 layers max.
 Total SoC height: < 1mm
[www.3dic.org/Wide_IO]
Memory Summary
 Memories are key elements in integrated system design

 Come in different types with different optimization
criteria (size/density, access speed/cycle time,
robustness)
 Dynamic / static / non-volatile memory
 Memory cell / bank / array structure
 Single data / burst access
94
References
[1] Jan Rabaey: Digital Integrated Circuits: A Design Perspective, Prentice Hall, 2nd
Edition, 2003
[2] Stechele/Herkersdorf: Integrierte Schaltungen, Lecture notes, TUM, 2003
[3] IBM photos http://www-3.ibm.com/chips/photolibrary/photo10.nsf/home?ReadForm
[4] www.rambus.com white papers
[5] M. Gries: A Survey of Synchronous RAM Architectures, Swiss Federal Institute of
Technology, ETHZ, Technical Report TIK No. 71, 1999
[6] J. Alsmeier, Infineon: Speicherkonzepte, 4. Dresdner Sommerschule Mikroelektronik,
September 2003
[7] R. Desikan, University of Texas, Tech Report TR-02-47, Sept. 2002
[8] Micron 256Mb: x4, x8, x16 SDRAM Feautres
[9] Wong, H-S. Philip, et al. "Phase change memory." Proceedings of the IEEE 98.12
(2010): 2201-2227.
[10] Motoyoshi, Makoto. "Through-silicon via (TSV)." Proceedings of the IEEE 97.1
(2009): 43-48.
95
Interconnect

Technologies
Theresienstr. 90
Outline
• On-Chip Buses
 Basic operation
 Methods for increasing bus throughput
 PLB, AHB, AXI
• Outlook for Network-on-Chip
• FIFOs
 Principles of operation
• Example: Networking SoC
SoCT – Interconnect – 2 © Lehrstuhl für Integrierte Systeme
96
On-Chip Buses
Characteristics:
Bus Slaves:
React on requests
Max. number of
Masters supported
Bus width
Traffic
CPU ASIC1 Mgr Separate/Shared
Rd/Wr Bus
Arbiter
Clock rate
Arbitration
EN
Scheme DSP MAC
Mem
Bus Masters: Max. number of

Initiate Transfers Slaves supported
On-Chip Buses – Basic Operation
CLK • Synchronous bus: central CLK signal

Master Req • Shared medium: Arbiter controls
RNW
access among multiple bus masters
Read Write
by means of Request/Grant protocol
Grant
• Separate address- and data buses,
varying bus widths (32-256bit)
A1 A2
Addr Bus • Acknowledgement for successful
Data Bus D(A1) D(A2) transaction
Data Ack
Bus throughput depends on:

• Bus width
• Bus CLK rate
Bus Transaction
Address Cycle Data Cycle
Request Addr. Trans Data Trans Data Ack
97
Bus Arbitration Schemes
Determines sequence in which requests from multiple masters are serviced
Scheme Pro’s Con’s

Round Robin Simple control No QoS support
Strict Priority II BE I I Simple control “Starvation” of low

Different service priority traffic
classes
Weighted Priority No starvation Complex control
15 5 40 40 % No BW guarantees
Weighted Priority + QoS with BW Complex control

Credits 15 5 40 40 % guarantees
5 30 10 30 KB
Example: Processor Local Bus (PLB)
Central Bus
Arbitration Arbiter
PLB
Address Bus Slave
Address Bus
Write Data Write Data

Bus Bus
Control Control
PLB
Core
PLB Read Data OR Read Data
Master Bus Bus
Status & Status &
Control OR Control
Shared Bus
98
PLB: Standard Read (Rd) Transfers
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_Avalid 1 2
SI_AddrAck
Data Bus
SI_DBus D(A0) D(B0)
SI_DAck 1 2
tarb tacc tarb tacc
f∙w
BW = , for memory tacc =tmem_acc
tarb + tacc
Bus Throughput Improvements
PLB employs following methods to

increase bus throughput:
 Independent data buses for
reads and writes
 Pipelining
 Burst Transfers
These are generic methods which are used by AMBA and

other on-chip buses too
99
Pipelined Bus Control
Address Cycle Data Cycle
 Strictly sequential Address/Data cycle
 Data Ack terminating current transfer Request Address Data Ack
Transfer Transfer
& release new transfer
 Pipelined Address Data Cycle Address Cycle

 Enable multiple, simultaneous
transfers Request Address Addr
Transfer Ack
 Initiation of transfer before
completion of previous transfer
Data Cycle
Data Data
Transfer Ack
PLB: Pipelined Rd Transfers
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_PAvalid 1
PLB_SAvalid 2
Sn_AddrAck 1 2
Data Bus
SI_rdDBus D(A0) D(B0)
SI_rdDAck 1 2
tarb tacc
BWpeak = f ∙ w
tarb tacc
* assuming B0 is independant of D(A0)
100
PLB: Pipelined Rd Transfers (But...)
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_PAvalid 1 2
PLB_SAvalid 2
SI_AddrAck 1 2
Data Bus
SI_rdDBus D(A0) D(B0)
SI_rdDAck 1 2
tarb tacc
tarb tacc
Burst Transfers
Reduction of Req./Addr. signaling overhead for read/ write

transactions to consecutive addresses
→ Burst transfers with implicit address increment
Data Trans Data Trans Data Trans

Request Addr. Trans Request Addr. Trans Request Addr. Trans
+ Ack + Ack + Ack
Data Trans Data Trans Data Trans

Request Addr. Trans
+ Ack + Ack + Ack
101
PLB: Burst Rd Transfers
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_Avalid 1 2
Sn_AddrAck
Data Bus
SI_DBus D(A0) D(A1) D(A2) D(A3) D(B0) D(B1) D(B2)
SI_DAck 1 2
tarb tacc tarb tacc
f∙w∙n
BW =
(tarb+tacc) + (n−1)
PLB: Pipelined Burst Rd Transfers
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_PAvalid 1
PLB_SAvalid 2
Sn_AddrAck 1 2
Data Bus
SI_rdDBus A0 A1 A2 A3 B0 B1 B2 B3
SI_rdDAck 1 1 1 1 2 2 2 2
tarb tacc
f∙w∙n
BW = =f∙w
n
102
PLB: Pipelined Back-to-Back Rd & Wr
SYS_Clk
Mn_req 1 2 3 4
Mn_RNW
Mn_ABus A B C D
PLB_PAvalid 1 2
PLB_SAvalid 3 4
SI_AddrAck 1 2 3 4
Write Data Bus
Mn_wrDBus B0 B1 B2 B3 D0 D1 D2 D3
SI_wrDAck 2 2 2 2 4 4 4 4
Read Data Bus

SI_rdDBus A0 A1 A2 A3 C0 C1 C2 C3
SI_rdDAck 1 1 1 1 3 3 3 3
PLB: Pipelined Back-to-Back Rd & Wr
SYS_Clk
Mn_req 1 2 3 4 5
Mn_RNW
Mn_ABus A B C D E
PLB_PAvalid 1 2
PLB_SAvalid 3 4 5
SI_AddrAck 1 2 3 4 5
Write Data Bus
Mn_wrDBus B0 B1 B2 B3 D0 D1 D2 D3
SI_wrDAck 2 2 2 2 4 4 4 4
Read Data Bus

SI_rdDBus A0 A1 A2 A3 C0 C1 C2 C3 E0
SI_rdDAck 1 1 1 1 3 3 3 3 5
103
Example: AMBA AHB
• AMBA AHB
 Advanced Microcontroller Bus Architecture High-
High-bandwidth
performance
 Advanced High-Performance Bus ARM processor
on-chip RAM
AHB
• Features improving bus throughput:
 Independent data buses (reads/writes)
 Pipelining (with the previous transfer only) High-bandwidth
DMA bus
 Burst transfers Memory
master
Interface
 Split transfers
AMBA AHB: Split transfers
• Master M0 reads data X0...X3 from slow slave S0

• Master M1 reads data Y0...Y3 from fast slave S1
Master Req M0 M1
t
Addr Bus X Y
t
Y0 Y1 Y2 Y3 X0 X1 X2 X3
Data Bus
t
Transfer to M0 is split
• The bus is not be blocked by S0
Note: PLB features split-bus transfers:

 Read and write data buses are independent
 Simultaneous use of read and write data buses for two independent
transactions is possible
104
Example: AMBA AXI
• AMBA AXI
source: http://www.arm.com
 Advanced Microcontroller Bus Architecture
 Advanced eXtensible Interface
• Features improving bus throughput:

 Independent data buses (reads/writes) and
independent addr. buses (reads/writes)
 Pipelining (over multiple previous transfers)
 Burst transfers
 Out-of-order transfers
ARM Cortex-A5 implementing
a 64-bit AXI bus
AMBA AXI: Out-of-order transfers
• Master M0
 reads data X0...X3 and Y0...Y3 from slow slave S0
 Y0...Y3 can be delivered faster than X0...X3
• Master M1
 reads data Z0...Z3 from fast slave S1
Master Req M0 M0 M1
t
Addr Bus X Y Z
t
Z0 Z1 Y0 Y1 X0 Y2 X1 Y3 Z2 X2 Z3 X3
Data Bus
t
• Reordering occurs
 among multiple masters
 among multiple transfers of the same master
 but not within a burst
105
Bus Standards Comparison
CoreConnect AMBA
OPB PLB APB AHB AXI
On-Chip Processor Local Advanced Advanced High Advanced
Peripheral Bus Bus Peripheral Bus Performance Extensible
Bus Interface
Addresses 32/64 bit 32/64 bit 32 bit 32 bit 32 bit
Bus widths 32/64 bit 32-256 bit 8-32 bit 8-1024 bit 8-1024 bit
# Masters 4 16 1 16 n/a
(bridge to AHB)
Bursts  var. length - var. length var. length

Pipelining -  -  
Separate rd/wr
data buses -  -  
Split transfer -  -  
Out-of-order
transfer - - - - 
Summary On-Chip Buses
Various on-chip bus industry standards exist

 AMBA (ARM), CoreConnect (IBM), OCP (Sonics), VSIA
 Widely used SoC interconnect structure
 Differentiation in:
 speed (clock rate)
 width (64/128/256 bits)
 transfer protocol (pipelined, split/out-of-order transfers)
 number of masters/slaves supported
 Capacity limitation for high-end applications due to
shared medium nature
106
FIFO Interface - Motivation
• On-chip Buses
 Synchronous, high throughput, shared medium
 High overhead for point-to-point connect between two
modules
• FIFOs:
 widely used point-to-point interconnect between asynchronous
modules with standardized interfaces
FIFO – Application Example
FIFOs are used for PPC

(300 MHz)
 Decoupling clock domains
Dev. Driver
100 MHz PLB bus
M
125 MHz HW-Accelerator Arbiter PLB-Bus (100 MHz)
 Decoupling data path widths S
64 bit PLB bus Bus Interface
32 bit HW-Accelerator
FIFO size TX-FIFO RX-FIFO

 As small as possible
 But large enough to compensate different
read/write data rates
HW-Accelerator
(125 MHz)
107
FIFO Architecture
Internal FIFO architecture

AF
 A ring of SRAM memory cells or registers
 2 address counters
 Read & write pointers, incremented
Additional control logic

 Fill level
 Prevent illegal accesses AE
 Prevent wrap-around overwriting
 Almost full (AF) and almost empty (AE) flags for
early detection of FIFO overflow or underrun
Read Pointer
Write Pointer
FIFOs – Principle of Operation
FIFO with 8 storage locations (0 to 7), AE=2, AF=5

0 1 2 3 4 5 6 7
WP RP
AE AF
FIFO is empty!
108

0 1 2 3 4 5 6 7
RP WP
AE AF

0 1 2 3 4 5 6 7
RP WP
AE AF
109

0 1 2 3 4 5 6 7
RP WP
AE AF
Fill level exceeds AE  Read without risk of „under flow“ can start!

0 1 2 3 4 5 6 7
RP WP
AE AF
110

0 1 2 3 4 5 6 7
RP WP
AE AF

0 1 2 3 4 5 6 7
RP WP
AE AF
111

0 1 2 3 4 5 6 7
RP WP
AE AF

0 1 2 3 4 5 6 7
RP WP
AE AF
Fill level exceeds AF  additional writes should be obmitted to prevent „over flow“!
112

0 1 2 3 4 5 6 7
WP RP
AE AF
Only one remaining storage location available!

0 1 2 3 4 5 6 7
WP RP
AE AF
113

0 1 2 3 4 5 6 7
WP RP
AF AE
Fill level again below AF!
Example: Networking SoC
Given
Traffic
 IP Router/VoIP Gateway SoC CPU ASIC1
Mgr
 EN MAC
 4 x 1 Gb/s
 PLB Bus
 2x 128 bit Rd/Wr
 180 MHz clock EN
DSP Mem
 Nom. Capacity: 46 Gb/s MAC
Question
 Under/Over/Well dimensioned?
114
Basic Packet Rx / Process / Tx
MAC Bus Mem CPU
Traffic
CPU ASIC1 C
Mgr
Packet
reception
CPU retrieves
EN packet from
DSP Mem memory
MAC
6C Packet
processing
• Message Sequence Chart (MSC):
CPU write back
 Uncover “hidden” transfers for packet to
CPU/MAC notification memory
 Short packets are worst case
condition (packet size ≈ notification Packet
message) transmission
C
Plus...
Traffic Not yet considered

CPU ASIC1
Mgr  Buffer management
 Maintenance of data structure to
store different size packets in equal
size buffers
 Memory accesses during packet
EN processing
DSP Mem
MAC  Per flow contexts
 Coprocessor invocation
Bottom line  ASIC, Traffic Manager
 Networking SoCs have on-chip  DSP operation
communication demands easily  Low capacity but delay sensitive
exceeding 10x link rate capacity
 Candidates for crossbar switches,
hierarchical buses or advanced NoCs
115
… and last but not least
Traffic Memory access BW is even

CPU ASIC1
Mgr more constraint
 200 MHz, 32b DDR SDRAM
 Rand. Rd/Wr access: 25 ns
 64 Byte transfer: 65 ns
EN  Access capacity: 7.9 Gb/s
DSP Mem
MAC
Improvements
 Multiple, physical memories
 Separation data, state, control
 Interleaving techniques
 Multi-port memories
NoC – Network on Chip
Benefits
 Scalability: Aggregate bandwidth Tile
scales with network size
 Segmentation of wires: short point-to- Links
point links
Node
 Pipelining, power consumption,
reliability/crosstalk
 Synchronization
 fully synchronous clock distribution NOT
required Topologies
Drawbacks
 Latency, Area
Further details in lecture

„Chip-level Multiprocessors“ 2D Mesh Ring, Octagon Fat Tree
116
Summary
• On-chip buses are industry standard for SoC component

interconnect
• FIFOs are employed for point-to-point communication
• Networking ICs/SoCs demand high on-chip interconnect
and memory capacities
• Current research: Networks on Chip
References
[1] CoreConnect Processor Local Bus Specifications, available at

https://www-01.ibm.com/chips/techlib/techlib.nsf/products/CoreConnect_Bus_Architecture
[2] AMBA Open Specifications, available at
http://www.arm.com/products/system-ip/amba/amba-open-specifications.php
117
Cross-Layer Perspectives on
Low Power Design
Andreas Herkersdorf
Armin Sadighi
Anmol Surhonne
Thomas Wild
Chair of Integrated Systems

Technische Universität München
1013
Power Density 50
1012
20
Performance
1011 rocket
1000 10
1010
Frequency nuclear nozzle
5
Number of cores
Power density (W/cm2)
109
reactor
100 Core i7 2
108 Core 2 Duo 1
hot plate Pentium 4
107 Pentium III 0.5
10 Pentium II
106 Pentium 0.2
80486
105 80386 0.1
1 80286 0.05
104 8086
0.1 103
0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
Source: [1] ITRS Roadmap 99, 09
SoCT – Low Power Design – 2 © Chair of Integrated Systems, TUM
118
Source: AMD
Source: AMD
119
Low Power Design is Prerequisite for …
Higher reliability Lower cost packaging
High currents may cause electro- Commodity servers cannot afford

migration in metal interconnects. mainframe water cooling,
Plus 10°C … nor can smartphones afford

doubles bit enforced air cooling
failure rate. with fans
Figure Sources: TU Dresden, IBM 2007, KIT [11]
Low Power Design is Prerequisite for …
Portability Higher Integration

IT data centers contribute
substantially to world wide
CO2 emission.
HPC replacement is OPEX

driven
Green Computing
Continued device scaling

Enabling sophisticated
and 3D integration:
mobile and IoT applications
dynamic power per
device decreases, but
number of devices per
area/volume increase
120
Power Trends
Source: R.Puri [4]
Power Trends
Below 65nm node

static power can be considered equal
to active (dynamic) power.
Clock distribution accounts for up to

half of active power
Source: R.Puri [4]
121
Energy versus Power
Power is drawn from a voltage source attached to the VDD pin(s) of a

chip.
• Energy: Electrical work to transport electrical charge across an

electrical potential
T T
E   P (t )dt   iDD (t )VDD dt
0 0
• Power: Rate at which electrical energy is converted into heat

E 1 T
Pavg    iDD (t )VDD dt
T T 0
Supply & Threshold Voltage Optimization
lower Dynamic power

supply voltage
2
Pcap (Vdd )    C L  Vdd  f clk
Gate delay
Static power
CL
t d (Vdd ,Vt )  Pleak (Vt ,Vdd )  I leak (Vt ) Vdd
Vdd  Vt
Leakage current
lower V gs Vt
I leak (Vt )  e
threshold voltage
122
Outline
• System-level
Lecture Focus
 Processor Voltage Frequency Scaling Power
 Algorithmic optimization, Operand isolation dynamic static
 Power gating & sleep transistors
System
 Voltage islands
• Architecture-level Architecture
 Scheduling, Pipelining
 Bus-Segmentation, Memory-partitioning, RTL/Logic
Datapath reordering
Transistor
• RTL/Logic-level
 Clock gating
• Transistor-level
Just 2 references for in-depth text books:
 Threshold voltage control [2] J. Rabaey, Low Power Essentials, 2009, Springer
 FinFET/FDSOI [3] D. Chinnery, K. Keutzer, Closing the Power Gap
 Advanced memory cells Between ASIC and Custom, Springer, 2007
Hierarchical Low Power Design

Level Means Gain per Effort and
Unit Investment
System Power management/budgeting,
Choice of components, Partitioning,
Approximate computing
Architecture / Arithmetic transformation, data
RTL representation,
Parallelism, Pipelining, Resource
allocation, Adaptive voltage scaling
Circuit / Logic Resizing, Parallelism, VTCMOS,
MTCMOS
Physical Design Power-driven place & route, Low
power layout
Technology Scaling, Vt optimization, alternative
technologies (SOI), NTC
Inspired by: Raghunathan
123
Power
Dynamic Power Management dyn. stat.
System
Recent trend:
Machine learning-based state traversal
[9, 17, 18, …]
Source: D. Perlmutter [5]
Q:SA R
Q ( st , a t )  (1   )  Q ( s t , at )
   (rt    max Q ( s t , a ))
a
Example: DPM in Sensor Network Processors

2.7x2.7 mm2 (130 nm CMOS)
Clock Rates 8 MHz – 80 KHz
Supply 0.3-1V
Leakage Power 53 mW
Average Power 150 mW
Peak Power 5 mW
1200
Power (mW)
RX listen windows TX broadcast packet
766
60
if
baseband
Sleep signals
serial
© IEEE 2006
neighbor
location
queues
dw8051
Source: M. Sheets [6]
dll
124
Power
Power Gating dyn. stat.
System
Dyn. Power
Management
Trade-Off: Flexibility vs. Performance
CPU DSP
ASIP
Log F L E X I B I L I T Y
FPGA
Instruction Depth
ASIC
Flexibility vs.
Custom IC Performance/Power
dissipation dilemma
Log COMPUTATIONAL DENSITY = performance / area
103 . . . 104
Log Power Efficiency = performance / W
105 . . . 106
Source: A. DeHon [7]; A. Cuomo [8]; T. Noll
125
Platform-based SoC Design Methodology
Conquer design complexity by reuse maximization:

Shorter development cycles, higher chances for (first time) fault-free
design and competitive value differentiation
Differentiation through
AMBA System SRAM System EMAC
Core eDRAM Core PCI-X
new, application specific
Standard on-Chip system cores
(bus) interconnect
Processor Bus Bridge Peripheral Bus
and interfaces,
CoreConnect Processor ISA Memory UART
Core Ext. Ctrl. GPIO
Blue
Logic
Standard RISC CPU cores and
SW development environments, Reuse existing function
building blocks,
XILINX
XILINX
Power Budgets
Control CLB
I/O Drivers I/O Interconnect
15% 10% 5%
9%
Execution
Units 15%
Clock 21%
40% Clocks 65%
20%
Caches
mProcessor FPGA
I/O Clock
Memory Logic
Source: J. Rabaey [2]
Signal processor
126
… as Enablement for Multi-Function Devices
Source: K. Arabi [10], Qualcomm Tech.
Source: K. Arabi [10], Qualcomm Tech.
127
Example: Invasive Computing Platform
DFG Transregional Collaborative Research Center between

FAU Erlangen, KIT Karlsruhe and TU Munich
•Resource-aware many-core processor

programming, middleware, tools and
architecture
•Resources occupied / released based on:

• Availability
• Utilization (load, bandwidth)
• Operating Conditions (temperature,
frequency/voltage, soft error rate)
See: http://invasic.informatik.uni-erlangen.de/en/index.php
Thermal and Dark Silicon Management

Thermal Design Power 32 active cores 32 active cores 16 active cores
TDP = 220 W @3.2GHz @3.2 GHz @3.6 GHz
Ptotal=214 W Ptotal=214 W 16 active cores
Tcritical = 80 °C @2.8 GHz
Ptotal=218 W
[°C]
83
Active Core
X
Dark Core 77
71
65
Thermal Location of Dark V/F Levels of

Violation under Cores affects Active Cores
Courtesy: Jörg Henkel, affect Tpeak
TDP Tpeak
KIT Karlsruhe, Khdr, DAC15 [11]
128
ARM big.LITTLE
Cortex A-15 (The Big) Cortex A-7 (The Little)

Source: P.Greenhalgh [12]
ARM big.LITTLE
Source: P.Greenhalgh [12]
129
Samsung Example
Source:A. Frumusanu, R. Smith. (2015, February). ARM A53/A57/T760 investigated - Samsung Galaxy Note 4
Exynos Review. Available: http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-
review/
Power
Bus/Interconnect Segmentation dyn. stat.
Architecture
• Segmented Bus structure

 Reduction of resource
sharing
Bridg
Bus
 Reduction of switched
capacitance
 Independent
transactions on both
segments
130
Power
Design Partitioning dyn. stat.
System
Spatially Global Spatially Local
 Reduced # of global bus accesses

 Reduced buffer power
 Reduced # of multiplexers
 Average Power reduction: 18.5 %
Area Reduction: 1%
Source: D.Stankovic [13]
Crosstalk / Signal Integrity
A
capacitive coupling
D
C
B
B=1
C
Superfluous transitions on
D result in additional Pcap
D
131
Power
Fighting Crosstalk dyn. stat.
Architecture
Shielding
wire
GND
VDD Shielding
layer
GND
Substrate (GND)
Power
Low Voltage Signaling on Interconnect / I/O
dyn. stat.
Architecture
RTL/Logic
Circuit
t pd ~ VDD  C L
ID
2
Pcap    f  C L  VDD
ID
* VDD
VDD  VDD  with I D*  I D by transistor sizing
N
Pcap
f*  1 * N f with *
Pcap 
t pd
N
132
MPEG-4 Video Frame Coding Principles
• I-frames
 Least compressible frames but don't require other
video frames to decode.
• P-frames
 Can use data from previous frames to decompress
 Are more compressible than I-frames.
• B-frames
 can use both previous and forward frames for data
reference to get the highest amount of data
compression.
Source: Wikipedia – The free Encyclopedia
Power
Processor Voltage/Frequency Scaling dyn. stat.
System
Total chip power MPEG-4 Interframe scaling

Logic power
Frequency scaling is 3 instructions

(at most), 500 – 700 nS typical
latency
I/O power
2.0V Voltage scaling at up to 10 mV/uS

Logic VDD without PLL relock
1V per
1.0V 100 usec
Dhrystone 2.1 code Operation uninterrupted by power scaling

running 400 loops per
cycle in background
CPU MHz 266 66 266

Memory MHz 133 Source: IBM PowerPC 405 LP 66 133
NetSeminar
133
Power
DVFS: Multicore Scheduling and Allocation
dyn. stat.
System
E = 2 units
E = 2 units
A1 A4 Tclk A1
A4 A5
A2 A5 A2
A3 A3
∑
=
_ ∑ ∗
Power
Pipelining dyn. stat.
Architecture
RTL/Logic
clk Pipeling
clk  Constant clock rate
f1(in)
preserves Throughput
 Relaxation of logic
clk timing requirements.
Allows lowering of
f(in) f2(f1(in)) Vdd.
clk
 Logic power saving
overcompensates
power for additional
clk
fn(…) registers.
clk fn(fn-1(…f1(in))) = f(in)
134
Memory Architecture: Array-Structure
2 L-K Bit Line

Storage Cell
AK
Row Decoder
A K+1 Word Line
A L-1
M.2 K
Sense Amplifiers / Drivers Amplify swing to rail-

to-rail amplitude
A0
Column Decoder Select appropriate
A K-1
word
Input-Output
(M bits)
Power
Hierarchical Memory Architecture dyn. stat.
Architecture
Row
Address
Column
Address
Block
Address
Global Data Bus

I/O
Advantages:
1. Shorter wires within blocks
2. Block address activates only 1 block => power savings
Source: J. Rabaey [14], A. Macci [19]
135
Power
Clock Gating: Circuit / Logic Optimizationdyn. stat.
RTL/Logic
A2 Clock Gating
 Toggle registers only
xor when outputs can
change
Clock Gating A1
 Majority of dynamic
Unit power saving is on the
clock tree
xor
 20 to 50 % reduction
in active power
A0 possible
x
Concerns
xor
 Additional skew on
clk f(x,S)
clock
SoCT – Low Power Design – 37  Testability
© Chair of Integrated Systems, TUM
Clock Gating is the Baseline Power Saver

Relative Leakage
Temperature (C°)
Dependency of Pstat on
temperature
Source: R.Puri [4]
136
Power
Arithmetic Optimization dyn. stat.
RTL/Logic
Mathematical laws, e.g. distributive law:
More logic and activity vs. less logic activity
Power
Threshold Control dyn. stat.
Transistor
Substrate bias Vsubstrate shifts threshold voltage Vt
Vdd
Id
Ileak
Dyn. Power Vsubstr ate

Power logic with VTCMOS
Management
management t ransistors Vsubstrate
gnd
Vt1 Vt2 Vt3 Vgs
(VGS Vt ) / nVTemp VDS / VTemp
I D  I 0e (1  e ); VGS  Vt
137
Technology Change for Low Power
MRAM
Source: K. Arabi [10], Qualcomm
FinFET vs. FDSOI Tech.
Source: A.B. Kahng
[16]
Source: R. G. Dreslinski [15] Near Threshold Computing
Power Optimization Conclusions
There is no such a thing as a “free lunch” in Low Power

Design
 Trade in Power with Performance, Area, Cost
 Power optimization necessary & meaningful at all abstraction
layers
 SoC system designers:
 Exploit low power improvements from lower layers
 Be sensitive for power optimizations at RTL and higher
abstraction layers
138
References

[2] J. Rabaey, Low Power Essentials, 2009, Springer
[3] D. Chinnery, K. Keutzer, Closing the Power Gap Between ASIC and
Custom, Springer, 2007
[4] R. Puri, L. Stok and S. Bhattacharya, "Keeping hot chips cool,"
Proceedings. 42nd Design Automation Conference, 2005., Anaheim,
CA, 2005, pp. 285-288
[5] D. Perlmutter, "Sustainability in silicon and systems development," 2012
IEEE International Solid-State Circuits Conference, San Francisco, CA,
2012, pp. 31-35
[6] M. Sheets et al., "A power-managed protocol processor for wireless
sensor networks,"Digest ofTechnical Papers VLSI06, pp. 212–213, June
2006
References
[7] A. DeHon, Reconfigurable Architectures for General Purpose

Computing, PhD Thesis, MIT, 1996
[8] A. Cuomo, Semiconductor Challenges, DATE03 Keynote, March 03,
http://www.date-
conference.com/conference/2003/keynotes/andrea/andrea.pdf
[9] A. Das et al., Reinforcement learning-based inter- and intra-application

thermal optimization for lifetime improvement of multicore systems, DAC
2014
[10] K. Arabi, Low power Design Techniques in Mobile Processors,
Qualcomm Presentation, 2014
[11] H. Khdr, S. Pagani, M. Shafique, J. Henkel. “Thermal Constrained
Resource Management for Mixed ILP-TLP Workloads in Dark Silicon
Chips”. In: 52nd Design Automation Conference (DAC). June, 2015
139
References
[12] P.Greenhalgh, Big.LITTLE Processing with ARM Cortex™-A15 & Cortex-

A7 Improving Energy Efficiency in High-Performance Mobile Platforms,
ARM white paper, September 2011
[13] D.Stankovic, Low Power Design course, University of Nis, Serbia, 2005
[14] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits,
2003
[15] R. G. Dreslinski et al., "Centip3De: A 64-core, 3D stacked, near-
threshold system," 2012 IEEE Hot Chips 24 Symposium (HCS),
Cupertino, CA, 2012, pp. 1-30.
[16] A.B. Kahng, “A Roadmap for Low-Power Design:Trends, Technology,
Tools”, EDPS-2015 Keynote
[17] A. Iranfar, S. N. Shahsavani, M. Kamal and A. Afzali-Kusha, "A heuristic
machine learning-based algorithm for power and thermal management
of heterogeneous MPSoCs," 2015 IEEE/ACM International Symposium
on Low Power Electronics and Design (ISLPED), Rome, 2015, pp. 291-
296.
References
[18] X. Lin, Y. Wang and M. Pedram, "A Reinforcement Learning-Based Power

Management Framework for Green Computing Data Centers," 2016 IEEE
International Conference on Cloud Engineering (IC2E), Berlin, 2016, pp.
135-138
[19] A. Macii, “Memory Organization for Low-Energy Embedded Systems,” in
Low-Power Electronics Design, C, Piguet Editor, Chapter 26, CRC Press,
2005
140
Voluntary Appendix:
SoC Arithmetic Building Blocks

Technologies
Theresienstr. 90
Outline
• Arithmetic Building Blocks

 Adders, Multiplier, Shifter, Multiplexer
141
Bit-sliced Datapath
Arithmetical operations are

31 30 29 … 01 00 performed on data words
 Typical widths: 8, 16, 32, 64
31 30 29 … 01 00
bits
 Words are provided by and
written to CPU register file
add, mult, Frequently, same operation is

shift, … performed on each bit
 Shift right: b : t (b)  s(b  1)
 Bit-sliced Datapath
31 30 29 … 01 00  Foundation for fast, parallel
instruction execution
Binary Full Adder: Truth Table

Cin A B Cin Cout S carry
status
A 1-bit 0 0 0 0 0 kill
Full Adder S 0 0 1 0 1 kill
B (FA) 0 1 0 0 1 propagate
0 1 1 1 0 propagate
Cout 1 0 0 0 1 propagate
Useful intermediate signals to 1 0 1 1 0 propagate
determine Cout : 1 1 0 1 0 generate
 Depend only on 1 1 1 1 1 generate
primary inputs
G=A&B S = A  B  Cin = P  Cin
P=AB
K = !A & !B Cout = A & B | A & Cin | B & Cin = G | P & Cin
142
Ripple-Carry Adder
A3 B3 A2 B2 A1 B1 A0 B0
Critical Path
Cout = C4 FA FA FA FA C0 = Cin= 0
S3 S2 S1 S0
Condition for worst case computation time: G0 = 1, Pi = 1 ¦ 0 < i < N

tadd = t(A0,B0Cout) + (N-2) t(CinCout) + t(CinS)  (N -1) tcarry + tsum
tadd = O(N) : linear proportional to word width
Adder computation time is dominated by carry propagation delay!
Ripple-Carry Adder
A3 B3 A2 B2 A1 B1 A0 B0
Critical Path
Cout = C4 FA FA FA FA C0 = Cin= 0
S3 S2 S1 S0
143
FA: Logic Gate Implementation
A B
FA P=AB
G=A&B
Cin
Cout
S = P  Cin
S Cout = G | P & Cin
FA: Static MOSFET Implementation

VDD
VDD Cin A B
A B
A
B
B
Cin VDD
A
X Cin
Cin A
S
Cin
A B B VDD
A B Cin A
Cout B
Truth table:
Total of 28 MOSFET transistors
Cout = AB + BCin + ACin = AB + (B + A) Cin
Two inverter stages between Cin and Cout
S = ABCin + Cout(A + B + Cin)
144
Ripple-Carry Adder/Subtractor
add/subt C0 = Cin
A0
 Subtraction – complement all subtrahend FA S0
bits (xor gates) and set the low order B0
carry-in C1
A1
 RCA summary: FA S1
B1
 advantage: simple logic, small area C2
(low cost), straightforward A2
expandable FA S2
B2
 disadvantage: slow (O(N) for N bits), C3
lots of signal transitions (energy A3
wasteful) FA S3
B3
C4 = Cout
FA Inversion Property
A B Cin Cout S Inverting all inputs of a FA results in inverted

0 0 0 0 0 K values on all FA outputs
0 0 1 0 1 K  Reduces number of inverters in carry path
0 1 0 0 1 P by one
0 1 1 1 0 P
1 0 0 0 1 P A2 B2 A1 B1 A0 B0
1 0 1 1 0 P
1 1 0 1 0 G
1 1 1 1 1 G
FA’ FA’ FA’ C0
C3 C2 C1
A B A B
S2 S1 S0
FA FA
Cout Cin Cout Cin
FA’: FA without INV in carry path
S
S
145
Manchester Carry-Chain Adder
B Fast carry propagation through

Ai Bi
P transmission gate logic design
A P G  Ci+1 follows Ci for Pi = 1
 Ci+1 is locally generated or killed
Gi Pi (follows Gi) when Pi = 0
Ci+1 Ci Attention:
 Transmission gate logic isn’t
regenerative
 Signal noise on Ci directly
Ci+1 = Gi | Pi & Ci
S propagated to Ci+1
Carry-Bypass Adder: Concept
A3 B3 A2 B2 A1 B1 A0 B0
C4
FA FA FA FA C0 = Cin
Cout
S3 S2 S1 S0
BP = P0 P1 P2 P3 “Block Propagate”
If (P0 & P1 & P2 & P3 = 1) then Co,3 = C4 = Ci,0

Otherwise the block itself kills or generates the carry internally
146
CBA: Block Propagate Generation
BP P3 P2 P1 P0
Cout Cin
G3 G2 G1 G0
BP
Manchester Carry-Chain Bypass

 BP-path certainly faster than regular  BP-path breaks “bit-sliced” structure
“ripple” path  Conversion to N-bit group slices
 Area overhead for transmission gates  Output INV for signal regeneration
typically between 10 and 20%  … feeding FA‘ in next stage
Carry-Bypass Adder: 16-Bit Adder

bits 12 to 15 bits 8 to 11 bits 4 to 7 bits 0 to 3
Setup Setup Setup Setup
G P G P G P G P
Carry Carry Carry Carry
Propagation Propagation Propagation Propagation
Ci,0
Ci P Ci P Ci P Ci P
Sum Sum Sum Sum
Worst-case delay: Carry from bit 0 to bit 15 = carry generated in bit 0,

propagates through bits 1, 2, and 3, skips the middle two groups (B: group
size in bits), propagates in the last group from bit 12 to bit 15
tCBA = tsetup + B tcarry + ((N/B) - 1) tskip + (B -1) tcarry + tsum
147
Carry-Bypass Adder
 tCBA still O(N), but with more tadder

graduate slope
Ripple adder
 Significant tadd reduction for larger
N
 Higher overhead (and hence Bypass adder
higher tadd) for small N due to
additional MUX in carry path
 tadd limiting factor: N

 Sequential availability of Cin 4…8
between and within bit blocks
How to achieve a sub-linear dependency of tadd over N?
Carry-Select Adder: Concept

A’s B’s
 Precompute the carry out of
each block for both carry_in =
0 and carry_in = 1 (can be 4-b setup
done for all blocks in parallel) P’s G’s
“0” carry propagation 0
 Then select the correct one
(takes one MUX delay rather
than B-bit ripple) “1” carry propagation 1
tCSA = tsetup + B tcarry + Cin

Cout multiplexer
(N/B) tmux + tsum
C’s
sum generation
 Implies a approx. 30% area
overhead
S’s
148
Carry-Select Adder: 16-bit Adder
A’s B’s A’s B’s A’s B’s A’s B’s

P’s G’s P’s G’s P’s G’s P’s G’s
“0” carry 0
“0” carry 0 “0” carry “0” carry 0
0
“1” carry “1” carry 1

1 1
mux mux mux mux

Cout Cin
C’s C’s C’s C’s
Sum gen Sum gen Sum gen Sum gen
S’s S’s S’s S’s

Carry-Select Adder: 16-bit Adder

A’s B’s A’s B’s A’s B’s A’s B’s

P’s G’s P’s G’s P’s G’s P’s G’s (1)
“0” carry 0
“0” carry 0 “0” carry “0” carry 0
0
(5) (5) (5) (5)
1 1
(5) (5) (5) (5)
mux mux mux mux
Cout (8) (7) (6) Cin
(9) C’s C’s C’s C’s
Sum gen Sum gen Sum gen Sum gen
(10)
S’s S’s S’s S’s
149
Square Root Carry-Select Adder
bits 14 to 19 bits 9 to 13 bits 5 to 8 bits 2 to 4 bits 0 to 1
A’s B’s A’s B’s A’s B’s A’s B’s A’s B’s
Setup Setup Setup Setup Setup

P’s G’s P’s G’s P’s G’s P’s G’s P’s G’s (1)
“0” carry 0
“0” carry 0 “0” carry “0” carry 0 “0” carry 0
0
(7) (6) (5) (4) (3)
“1” carry 1
1 1
(7) (6) (5) (4) (3)
mux mux mux mux mux
Cout (7) (6) (5) (4) Cin
(8) C’s C’s C’s C’s C’s
Sum gen Sum gen Sum gen Sum gen Sum gen
(9)
S’s S’s S’s S’s S’s
 Progressively increasing number of bits per block equalizes arrival

times at MUX inputs
 Lowers absolute adder delay despite using more stages
 Assume N bit adder consists of P stages and contains M bits in

first stage:
N = M + (M+1) + (M+2) + … tSCS = tsetup + M tcarry + (√2N) tmux + tsum
+ (M+P-1)
= MP + P(P-1)/2
2
= P /2 + P(M – 0.5)
2
≈ P /2 for N >> M
150
 Progressively increasing number of bits per block equalizes arrival

times at MUX inputs
 Lowers absolute adder delay despite using more stages
 Assume N bit adder consists of P stages and contains M bits in

first stage:
N = M + (M+1) + (M+2) + … tSCS = tsetup + M tcarry + (√2N) tmux + tsum
+ (M+P-1)
 tSCS = O(√N)
= MP + P(P-1)/2
2
 Sub-linear increase of tadd over
= P /2 + P(M – 0.5) adder width N
2
 Nevertheless, carry still ripples
≈ P /2 for N >> M sequentially through MUX stages
Adder Summary
Adder Delay Area Comment

Ripple Carry O(N) 1 (norm.) Simple, modular, delay limits application to
small N only
Adder
Carry Bypass O(N) 1.1 – Although CBA have linear delay dependency
over N, they are much faster than RCA for
Adder 1.2
larger N
Linear Carry O(N) > 1.3
Select Adder
Square Carry O(√N) > 1.3 Approach with sub-linear adder time
dependency over N
Select Adder
Log. Lookahead O(log(N)) >m Lookahead adders are generally several times
larger than RCA, but has significant speed
Adder
advantage for large N
151
Multiply Operation
Multiplication as repeated
additions
N M 1 N 1
multiplicand x y   x y i j 2i  j
multiplier j 0 i 0
partial  Partial product generation

M product
 Easy in binary representation:
array
 All- 0 for yj = 0
double precision  Multiplicand offset j positions to
product left for yj = 1
N+M
 Addition of partial products
 M x N-bit additions max
 Skip all-0 partial products
Partial Product Generation
X7 X6 X5 X4 X3 X2 X1 X0
Yi
PP7 PP6 PP5 PP4 PP3 PP2 PP1 PP0
Multiplying multiplicand x with a bit yj of multiplier

 Easy for binary numbers: PP = x AND yj
152
Sequential Multiplier
Right shift-and-add
 Partial product rows accumulated
N from least to most significant bit
on an N-bit adder
1010  After each addition, shift
1101 accumulated partial product to
right in order to align it with the
T= 0; 1010 next row to add
T= 1; 01010
T= 2; 001010  Time for NxN bits
T= 6; 110010  Tserial_mult = O(N Tadder) = O(N2)
T= 7; 0110010  Design (area) complexity
T= 11; 10000010  One N-bit adder and
T= 12; 10000010 single-bit shifter
 No straightforward pipeline
structure
Array Multiplier
 Faster than sequential shift-and-add, but more costly in terms of area

 All partial products generated in tmult  M 1  N  2tcarry  N 1 tsum  tand
parallel and organized in adder
array with respective offset X3 X2 X1 X0 Y0
X3 X2 X1 X0 Y1 Z0
 Partial product addition
 Typically no single worst HA FA FA HA
case path, but multiple
X3 X2 X1 X0 Y2 Z1
paths with same (max)
length FA FA FA HA
X3 X2 X1 X0 Y3 Z2
FA FA FA HA
Z7 Z6 Z5 Z4 Z3
153
Partial Product Reduction
 Standard Array Multiplier requires N-1 x N-bit-adders

 “All-0” partial products can be omitted reducing number of PPs
 Booth recoding: N-bit multiplier can be reduced to max. of N/2 “1”

partial products
 Principle:
011102 (1410) =
100002 (1610) – 000102 (210)
 Requires capability to perform subtractions
Shifter
Programmable shifter
Right Nop Left
 Single-bit left/ right shift
operations through individual pass
transistors operated by separate
control lines
Ai Bi
 Output drivers to fully regenerate
logic levels
 Cascading several single-bit

Ai-1 Bi-1 shifters to build multi-bit shifter
rapidly becomes complex and
slow
Bit-slice i
 Not practical for larger shift
values
154
Barrel Shifter
: Data Wire  Number of rows indicates

: Control Wire word width, …
 … number of columns the
A3
B maximum shift offset
3
Sh1
A2
B  Word bits have to pass
2
Sh2 through maximum one
A1
B
transmission gate
Sh3
1  Propagation delay
A0 (theoretically) constant
B
0  Area generally dominated
by wiring (not active
Sh0 Sh1 Sh2 Sh3 MOSFETs)
 One-hot shift signal
Logarithmic Barrel Shifter
 Number of rows indicates

word width, …
 … number of columns the
base two logarithm (ld) of
largest shift offset
 Word bits have to pass

through exactly ld(max.
shift size) multiplexers
 Propagation delay
(theoretically) constant
 Area dominated by wiring
(not multiplexers)
 Binary shift signal
155
Other Arithmetic Operators
In1 In2
Constructed from concepts

introduced with adders
Ci,0 = 1
Two´s complement add
 Subtractor
In1 – In2 a) Subtractor
 Inverted two‘s complement
In1 In2
adder with Cin,0 = 1
Two´s complement add

Ci,0 = 1
 Comparator
SN - 1
 Use MSB of subtractor as
sign-indicator for In1 ≥ In2
b) Comparator
In1 ≥ In2
Multiplexer / Demultiplexer
A1 Selecting bits/words from different

Z sources
 E.g.: memory read control logic
An S  if n=2 => S is one bit
 Z= ¬SA1+SA2
N-1 MUX
S=S1 ..Sm
Z1 Distributing bits/words to
A different destinations:
 E.g.: memory write control logic
S
Zn  if n=2
 Z1= ¬SA
1-N deMUX  Z2=SA
S=S1 ..Sm
156
References
[1] R. J. Baker et al., CMOS circuit design, layout, and simulation, IEEE
Press, 1998. ISBN 0-7803-3416-7
[2] N. H. E. Weste et al., Principles of CMOS VLSI Design, Addison
Wesley, 1993. ISBN 0-201-53376-6
Picture credits: www.maxmon.com
157

SoCT Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SoCT Slides

Uploaded by

Copyright:

Available Formats

Contents

SoC Logic Design Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Low Power Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

SoC Arithmetic Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A. Herkersdorf © Lehrstuhl für Integrierte Systeme

System-on-Chip © Lehrstuhl für Integrierte Systeme

Lectures (A. Herkersdorf)

SoCT – Introduction – 3 © Lehrstuhl für Integrierte Systeme

Reading material / Literature

Course handouts and lecture notes

SoCT – Introduction – 4 © Lehrstuhl für Integrierte Systeme

Basics of digital System-on-Chip (SoC)

SoCT – Introduction – 5 © Lehrstuhl für Integrierte Systeme

Foundation of CMOS Scaling

1958: simplified CMOS tOX

SoCT – Introduction – 6 © Lehrstuhl für Integrierte Systeme

Transistor gate length L (um)

1011 capacity doubles 16G

SoCT – Introduction – 7 © Lehrstuhl für Integrierte Systeme

Moore’s Law CMOS Scaling Good news

SoCT – Introduction – 8 © Lehrstuhl für Integrierte Systeme

Transistor gate length L (um)

SoCT – Introduction – 9 © Lehrstuhl für Integrierte Systeme

 Energy consumption of a 4-person houshold for a

 the flight between Munich und Singapore

 the tuition fee for one semester in Harvard would

SoCT – Introduction – 10 © Lehrstuhl für Integrierte Systeme

SoCT – Introduction – 11 © Lehrstuhl für Integrierte Systeme

SoCT – Introduction – 12 © Lehrstuhl für Integrierte Systeme

CMOS Technology: Vdd Processing Data:

(Intel Pentium I [2])

Transport of Data: Storing Data:

(IBM CMOS7S Cu 0.13µm [1]) (8 Mbit SRAM chip [1])

SoCT – Introduction – 13 © Lehrstuhl für Integrierte Systeme

Full-Custom ASIC: Std. Cell ASIC:

(Intel Pentium I [2])

after manufacturing of existing macros App.

(Xilinx Virtex-II Pro XC2VP20 FPGA)

SoCT – Introduction – 14 © Lehrstuhl für Integrierte Systeme

Consumer Electronics: Automotive:

Communications & Networking:

SoCT – Introduction – 15 © Lehrstuhl für Integrierte Systeme

SoC Design Challenges

Microscopic issues Macroscopic issues

SoCT – Introduction – 16 © Lehrstuhl für Integrierte Systeme

 Performance SoC design is a

SoCT – Introduction – 17 © Lehrstuhl für Integrierte Systeme

Outlook to SoC Platforms

Control and management

SoCT – Introduction – 19 © Lehrstuhl für Integrierte Systeme

SoCT Course Outline

SoCT – Introduction – 20 © Lehrstuhl für Integrierte Systeme

 Low power dissipation Si-Wafer

Economical Reasons Doping Atoms (P or As)

 Easy to design Process

SoCT – Introduction – 21 © Lehrstuhl für Integrierte Systeme

 Low power dissipation Si-Wafer

Economical Reasons Doping Atoms (P or As)

 Easy to design Process

SoCT – Introduction – 22 © Lehrstuhl für Integrierte Systeme

Metal (Poly-Si) The C in CMOS signals

Static CMOS Inverter

SoCT – Introduction – 24 © Lehrstuhl für Integrierte Systeme