Download as pdf or txt
Download as pdf or txt
You are on page 1of 157

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

SoC Logic Design Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

SoC Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Low Power Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

SoC Arithmetic Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

1
System-on-Chip Technologies

A. Herkersdorf © Lehrstuhl für Integrierte Systeme


W. Stechele
A. Surhonne Theresienstr. 90
Building N1, 2nd floor
www.lis.ei.tum.de

Introduction

System-on-Chip © Lehrstuhl für Integrierte Systeme


Technologies
Theresienstr. 90
A. Herkersdorf Building N1, 2nd floor
A. Surhonne www.lis.ei.tum.de

2
Organisational matters

Lectures (A. Herkersdorf)


 Room : 0360, Wednesday 13:15 - 14:45
Tutorials (A. Surhonne)
 Room : 1260, Thursday 09:45 - 10:30
Registration in TUMonline is required:
 Registration link available at www.lis.ei.tum.de/?id=soct
News, handouts and course materials:
 http://www.moodle.tum.de
Exam:
 Final exam at end of semester, written exam (75 min.), accounts for 100% of
total grade, calculators & 1 A4 sheet allowed (no laptops or smartphones)
 You MUST bring your Student ID AND Passport!

SoCT – Introduction – 3 © Lehrstuhl für Integrierte Systeme

Reading material / Literature

Course handouts and lecture notes


 Recommended as primary reference during course
 Sufficient for exam preparation

Text books
 Digital Integrated Circuits - A Design Perspective, J. Rabaey,
Prentice Hall
 Computer Architecture. A Quantitative Approach, J. Hennessy, Elsevier

SoCT – Introduction – 4 © Lehrstuhl für Integrierte Systeme

3
What is this course about?

Basics of digital System-on-Chip (SoC)


 Hierarchical composition of SoC platforms
 … based on foundation of technology scaling and CMOS circuitry
 Gate array, ASIC, FPGA
 Insight to the main components:
 RISC processor
 Bus/FIFO Interconnect
 On-/Off-Chip Memory
 Low Power design principles

SoCT – Introduction – 5 © Lehrstuhl für Integrierte Systeme

Foundation of CMOS Scaling


1947: 2002:
First transistor PowerPC 970,
0.13µm, 1.8 GHz,
John Bardeen and
52 M transistors
Walter Brattain
(IBM [1])
(Bell Laboratories )

1958: simplified CMOS tOX


W
transistor model
First integrated
circuit (IC) source
Lmin
drain
Jack Kilby (TI), W W’ W’’
Robert Noyce
(Fairchild) Lmin L’min L’’min

′ = × .
′= × . ′′ = ′× .
′′ = ′× .

SoCT – Introduction – 6 © Lehrstuhl für Integrierte Systeme

4
Moore’s Law CMOS Scaling Good news
1013
1965: Gordon Moore 50
1012 forecasts that chip

Transistor gate length L (um)


32G 20
Chipcapacity (transistors per chip)

1011 capacity doubles 16G


DRAM-Chips 10
1010 every 18 – 24 month 4G
1G 5
109 (66 % CAGR) 256M 2
64M
108 16M 1
107 4M 0.5
1M
106 256k 0.2
64k 0.1
105 16k
4k 0.05
104
0.02
103
0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
Source: ITRS Roadmap 99, 09

SoCT – Introduction – 7 © Lehrstuhl für Integrierte Systeme

Moore’s Law CMOS Scaling Good news


1013 107
50
DRAM prices (Microcents per bit stored)

1012
Transistor gate length L (um)

20 106
32G
Chipcapacity (transistors per chip)

1011 16G
DRAM-Chips 10 105
1010 4G
1G 5 104
109 256M 2
64M Core i7 103
108 Core 2 Duo 1
16M
Pentium 4
107 4M Pentium III 0.5 102
1M Pentium II
106 256k Pentium 0.2 10
64k 80486
105 80386 0.1
16k 1
4k 80286 0.05
104 8086 10-1
8008 Microprocessors 0.02
103
0.01 10-2
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
Source: ITRS Roadmap 99, 09

SoCT – Introduction – 8 © Lehrstuhl für Integrierte Systeme

5
Moore’s Law CMOS Scaling Challenges
1013
Power Dissipation 50
1012

Transistor gate length L (um)


20
Performance
Chipcapacity (transistors per chip)

1011 rocket
1000 10
1010
Frequency nuclear nozzle
5
Number of cores
Power density (W/cm2)

109
reactor
100 Core i7 2
108 Core 2 Duo 1
hot plate Pentium 4
107 Pentium III 0.5
10 Pentium II
106 Pentium 0.2
80486
105 80386 0.1
1 80286 0.05
104 8086
8008 Microprocessors 0.02
0.1 103
0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
Source: ITRS Roadmap 99, 09

SoCT – Introduction – 9 © Lehrstuhl für Integrierte Systeme

If Moore’s Law…
1013
50
1012
20
1011
10
… had hold in other industries or our daily life for the 1010 5
109 2
last 30 years… 108
Pentium 4
Core i7
Core 2 Duo 1
107 Pentium III 0.5
Pentium II
106 Pentium 0.2
80486
105 80386 0.1
 the capacity of a 1.5 V AA Battery would have 104
8008
8086
80286 0.05
0.02
103
been around 2 MWh today, 0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020

 Energy consumption of a 4-person houshold for a


half year

 the flight between Munich und Singapore


would have taken 2 seconds, and

 the tuition fee for one semester in Harvard would


have been just 0.80 €

SoCT – Introduction – 10 © Lehrstuhl für Integrierte Systeme

6
Intel: Moore‘s Law is Forever ! … Really?

SoCT – Introduction – 11 © Lehrstuhl für Integrierte Systeme

So what!

SoCT – Introduction – 12 © Lehrstuhl für Integrierte Systeme

7
SoC Technologies

CMOS Technology: Vdd Processing Data:


What determines: - FSM, adder,
- speed of CMOS? w multiplier, mux,
- CMOS power shifter, registers,
consumption? t RISC core
Source Ox Drain
- costs of CMOS? Lmin

(Intel Pentium I [2])

Transport of Data: Storing Data:


- on-chip buses - SRAM, DRAM,
- FIFO interconnect ROM, FLASH
- capacity, timing,
latency, access BW

(IBM CMOS7S Cu 0.13µm [1]) (8 Mbit SRAM chip [1])

SoCT – Introduction – 13 © Lehrstuhl für Integrierte Systeme

Hardware Platforms

Full-Custom ASIC: Std. Cell ASIC:


- Highest performance - High performance
- Highest cost - High cost
- Highest development - High development
effort effort

(Intel Pentium I [2])

FPGA: System-on-Chip:
- Function determined - Maximizes reuse I/O
eRAM
NoC
RAM-Ctl.

after manufacturing of existing macros App.


m-Prozessor spez.
- Good performance - High performance System on Chip Plattform
- Medium development - High capacity
effort - Medium development
- Medium capacity effort

(Xilinx Virtex-II Pro XC2VP20 FPGA)

SoCT – Introduction – 14 © Lehrstuhl für Integrierte Systeme

8
Application in our daily life

Consumer Electronics: Automotive:


- Laptops with “server”- - Driving, internetworked
class processing power “supercomputer”
- navigation,
- Smartphones, mobile appliances
- distance control,
- lane departure warning,
- UMTS, 802.11 [Continental- TEMIC]

Communications & Networking:


- Terabit Internet backbone router
- Broadband wired/wireless access
- ad-hoc nets among mobile devices

SoCT – Introduction – 15 © Lehrstuhl für Integrierte Systeme

SoC Design Challenges

Microscopic issues Macroscopic issues


 Ultra-high speeds  Time-to-market
 Power dissipation and  Design complexity
supply rail drop (millions of gates)
 Growing importance of  High levels of abstractions
interconnect  Design for test
 Noise, crosstalk  IP reuse, portability
 Reliability,  EDA tool interoperability
manufacturability
 Clock distribution

Technical Economical
Source: J. Rabaey [3], M. Irwin [4]

SoCT – Introduction – 16 © Lehrstuhl für Integrierte Systeme

9
Key Metrics of SoC

 Performance SoC design is a


 Speed (delay, clock frequency) multi objective
 Throughput optimization
 Power consumption (static, dynamic)
 Cost problem!
 NRE (fixed) costs: Design effort, mask production
 RE (variable) costs: Chip area, package, test
 Reliability, robustness
 Signal to Noise ratio, noise margins and immunity
 Mean time between failure (MTBF) = 1 / failure rate
 Time-to-Market
 Time between product idea to shipment (market research,
specification, development, fabrication, test)

SoCT – Introduction – 17 © Lehrstuhl für Integrierte Systeme

Outlook to SoC Platforms


Enterprise Application xSP
Servers
Sonet/SDH Edge Storage
Servers
LAN/SAN Transmission
Switch
Load Balancer

Edge Inner
Servers
WAN Core Storage Server
Core
Storage Server Application Servers
Wireless ASP
Internet
Home
Networks Router Wireless
Gateway Application
Edge Router Servers
802.11 Servers
Home RF, .. xDSL Mobile
Switching Wireless
Local Center
Network
Access
Network Base
Cable Station
ISDN VoIP PSTN Controller
Gateway
Control
Mobile Base
Procesors Sonet/SDH Clients Station
Transmission
SoCT – Introduction – 18 © Lehrstuhl für Integrierte Systeme

10
Outlook to SoC Platforms

Control and management


of entire box
Backplane

Network
Switch Processor Determines box function:
System Fabric Line Switch, Router, Gateway, …
Processor Interface
Interconnect for
as many as possible Terminates physical network
line interface cards links: Ethernet, SONET/SDH, …

SoCT – Introduction – 19 © Lehrstuhl für Integrierte Systeme

SoCT Course Outline

Part 1 Part 2
Introduction Processor Architecture
SoC Logic Design Recap Memory
SoC Paradigm Interconnect
Low Power Design

SoCT – Introduction – 20 © Lehrstuhl für Integrierte Systeme

11
Why CMOS ?
SiO2-Layer Lithography
Electrical Reasons Light

 Low power dissipation Si-Wafer


 Noise immunity Photolithographic Mask
Photoresist
 Clean logic levels Poly-Si-Layer
 One supply voltage
 Cascadable Etching Light patterns
Photoresist

Economical Reasons Doping Atoms (P or As)

 Easy to design Process


 Fabrication well understood
 Highly integrateable
Deposition

SoCT – Introduction – 21 © Lehrstuhl für Integrierte Systeme

Why CMOS ?
SiO2-Layer Lithography
Electrical Reasons Light

 Low power dissipation Si-Wafer


 Noise immunity Photolithographic Mask
Photoresist
 Clean logic levels Poly-Si-Layer
 One supply voltage
 Cascadable Etching Light patterns
Photoresist

Economical Reasons Doping Atoms (P or As)

 Easy to design Process


 Fabrication well understood
 Highly integrateable
Deposition

SoCT – Introduction – 22 © Lehrstuhl für Integrierte Systeme

12
CMOS – What is it?
Input
nMOS pMOS
gnd Vdd
Output

n+ p+
n
p

Metal (Poly-Si) The C in CMOS signals


Oxide (SiO2) the combination of p-
Semiconductor and n-MOSFETS
 Complementary
The channel type gives the prefix of the transistor
SoCT – Introduction – 23 © Lehrstuhl für Integrierte Systeme

Static CMOS Inverter

VDD
Ip
VGSp S
P VDSp A Z

N VDSn A Z
VA VZ
VGSn S
0 1
1 0
gnd

SoCT – Introduction – 24 © Lehrstuhl für Integrierte Systeme

13
MOSFET Voltages And Drain Current

nMOS pMOS
G Vector Conventions* G
VGS VGS
ID ID
S D D S
VDS VDS

On lower potential Source is always On higher potential

Gate-Source Voltage
0 Drain-Source Voltage 0
Drain Current

*Please avoid VSG, VSD, VGD, VDG


SoCT – Introduction – 25 © Lehrstuhl für Integrierte Systeme

MOSFET Output Characteristic

State Current Condition

VGS  Vt  VDS  0
• Off I Dn  0

• Linear VGS  Vt  0  VDS  VGS Vt


 V 
I Dn    VGS  Vt  DS VDS
 2 

• Saturation VGS  Vt  VDS  VGS  Vt



I Dn  VGS Vt 2
2
SoCT – Introduction – 26 © Lehrstuhl für Integrierte Systeme

14
MOSFET Output Characteristics

VGS  const.  Vtn G G VGS  const.  Vtp


ID ID
S D D S
VDS VDS
ID VGS  Vtp ID
VDS

VGS  Vtn VDS


linear saturation saturation linear
SoCT – Introduction – 27 © Lehrstuhl für Integrierte Systeme

DC Operation

Voltage Transfer Characteristics (VTC)


Plot of output voltage as a function of the input voltage

V(y)

f V(y)=V(x) V(x) V(y)


VDD

Switching Threshold
VM

GND
VIL VIH VDD V(x)

SoCT – Introduction – 28 © Lehrstuhl für Integrierte Systeme

15
Static Voltage Transfer Curve (VTC)
VZ off on N
VDD

Ip
on off P VGS S
p
P VDSp
VDD
D

N VDSn
VA VZ
VGSn S

gnd
Ip
A Z
VDD VA 0 1
Vtn VDD-|Vtp|
Vth 1 0

SoCT – Introduction – 29 © Lehrstuhl für Integrierte Systeme

MOSFET Dimensioning

w
tOx
Lmin
W  Ox  0
Transconductance:   K where K  
L tOx
 Designer’s Parameter: W  Technology Parameters
 Conflicting Design Goals:  Mobility µn = 1.5 .. 3.5 x µp
 Designer uses W to compensate for
 Area => W=Lmin lower current drive of pMOSFETS
 Speed => high W, l=Lmin  Minimum Feature Size: Lmin
→ always use Lmin for digital circuits  Oxide dielectic/thickness: εox,tox
SoCT – Introduction – 30 © Lehrstuhl für Integrierte Systeme

16
Noise Margins

For robust circuits,


VDD want the “0” and “1”
intervals to be as
"1" large as possible
NMH = VDD - VIH

VIH
Noise Margin High Undefined Region
VIL
Noise Margin Low

NML = VIL
"0"
Gnd
Gate Output Gate Input

SoCT – Introduction – 31 © Lehrstuhl für Integrierte Systeme

Effect of Capacitance on Inv. Delay

Vdd VA
Ron,p

C VZ=VC
Ron,n
50%
gnd
t
Inverter model:
 on: Resistance
 off: open switch
 operating in linear
region of Sah
SoCT – Introduction – 32 © Lehrstuhl für Integrierte Systeme

17
Effect of Capacitance on Inv. Delay

Vdd VA
Ron,p

C VZ tpLH=f(C)=?
Ron,n
50%
gnd
t
Inverter model:
 on: Resistance Cload tOx Lp
t pLH 
 off: open switch Wp  p Ox Vdd  | Vtp |
 operating in linear
region of Sah
SoCT – Introduction – 33 © Lehrstuhl für Integrierte Systeme

CMOS Power Sources

 Dynamic  Static
 capacitive  sub-threshold,
 short-cut  leakage (reverse diode),
 result of circuit function  gate currents
 signal edge dependent  parasitic effects
 signal level dependent

Dynamic Static

SoCT – Introduction – 34 © Lehrstuhl für Integrierte Systeme

18
Dynamic Capacitive Power

 Capacitive Power Dissipation: Vdd


Ron,p
PCap = a01 f C Vdd²
a01f

C
Ron,n

gnd

SoCT – Introduction – 35 © Lehrstuhl für Integrierte Systeme

Short Circuit Power


VA
Vdd Vdd
Vdd+Vtp
i Vdd/2
Vtn

A Z r f
t
i

gnd

t
t1 t2

 Short Circuit Power Dissipation: PShort = a01 f bn  (Vdd-2Vtn)³


(assumptions: bn=bp, Vtn=|Vtp|, r = f

SoCT – Introduction – 36 © Lehrstuhl für Integrierte Systeme

19
Sub-threshold Currents

0<VGSn<Vtn or 0<|VGSp|<|Vtp| Reasons:


gnd A Vdd  Non-ideal output levels
Z  Noise
n+ n+ p+ p+
Isub Isub
p n ID
(VGS Vt ) / nVTemp VDS / VTemp
I D  I 0e (1  e ); VGS  Vt

I0
I0: ID for VGS = Vt
VTemp: temperature voltage
n: process constant (1 … 2.5) Vt Vgs

SoCT – Introduction – 37 © Lehrstuhl für Integrierte Systeme

Diode Leakage / Gate Current

A Vdd Igate
gnd Z Vdd
n+ n+ p+ p+
Igate
p n gnd
A high A low Gate oxide not perfect isolator:
 Ohm’s law (R<inf)
p-n junctions are diodes:  Ionic conduction (trapped Ions
 Reverse current into substrate in oxide)
contributes to static power  tox ~ 5 – 2 nm today
consumption
Igate ~ exp ( tox-1 )

SoCT – Introduction – 38 © Lehrstuhl für Integrierte Systeme

20
Static Power

 In the past: negligible (typ. < 1% of total P)


 Today and in the future: major concern, because
 low threshold voltages for high speed applications
 noise “does not scale”
 high temperatures
 low frequency and long standby times
 Gate and diode currents cause higher failure rate
 Gate currents limit tox scaling

SoCT – Introduction – 39 © Lehrstuhl für Integrierte Systeme

Important Process Parameters

Vt Vdd
threshold voltage supply voltage

Total power=min!

dynamic power

static power
circuit delay Vdd and Vt such that Vdd-Vt = const

SoCT – Introduction – 40 © Lehrstuhl für Integrierte Systeme

21
References

[1] IBM press release, „New 64-bit PowerPC microprocessor“, Oct. 14, 2002,
http://www-3.ibm.com/chips/news/2002/1014_powerpc.html
[2] IBM Microelectronics Photo Catalog,
http://www-3.ibm.com/chips/photolibrary/photo10.nsf/home?ReadForm
[3] J. Rabaey, „Digital Integrated Circuits – A Design Perspective“, Prentice Hall,
second edition, 2003
[4] M. Irwin, „VLSI Digital Circuits“, Penn State, 2002,
http://mdlwiki.cse.psu.edu/twiki/bin/view/MDL/MJI477

SoCT – Introduction – 41 © Lehrstuhl für Integrierte Systeme

22
SoC Logic Design Recap

System-on-Chip © Lehrstuhl für Integrierte Systeme


Technologies
Theresienstr. 90
A. Herkersdorf Building N1, 2nd floor
A. Surhonne www.lis.ei.tum.de

Outline

• Combinatorial Logic
 Transistor synthesis for combinatorial logic
• Sequential Logic
 Registers, latches, flip-flops
 Finite state machine design

SoCT – SoC Logic Design – 2 © Lehrstuhl für Integrierte Systeme

23
Combinatorial Logic

SoCT – SoC Logic Design – 3 © Lehrstuhl für Integrierte Systeme

Simple Static CMOS Logic Gates

VDD A A
NOR Z
B B
Z
Z A NAND NOR
A Z A B Z
Z
B 1 0 0 1
B D

1 0 1 0

NAND 1 1 0 0
0 1 1 0
gnd
SoCT – SoC Logic Design – 4 © Lehrstuhl für Integrierte Systeme

24
All logic functions
DeMorgan Rule can be expressed
with NAND or NOR

NAND NOR

NOT

AND

OR

basic gate NOT AND OR NAND NOR


output expression A‘ AB A+B (AB)‘ (A+B)‘
NAND represent. (AA)‘ (AB)‘‘ (A‘B‘)‘ (AB)‘ (A‘B‘)‘‘
NOR represent. (A+A)‘ (A‘+B‘)‘ (A+B)‘‘ (A‘+B‘)‘‘ (A+B)‘

SoCT – SoC Logic Design – 5 © Lehrstuhl für Integrierte Systeme

Generic Model of Static CMOS

VDD
A1
...
An
Z

C
gnd
NAND NOR

SoCT – SoC Logic Design – 6 © Lehrstuhl für Integrierte Systeme

25
Systematic Static CMOS Logic Design

Example Function: Z=AB+C VDD


 CMOS always inverts => add
extra inverter: Z=AB+C A B
 Start at the output and find a
way through n-MOSFETS to C
ground: Z
 serial for AND (AB)
 parallel for OR (C)
A C
 Draw the way to Vdd by using the
dual n-network and p-MOSFETS
B
If everything is done right, there will
gnd
never be a conducting path between Vdd
and gnd.

SoCT – SoC Logic Design – 7 © Lehrstuhl für Integrierte Systeme

Register Transfer Blocks: Arithmetic

Numerous applications are based on mathematical operations


 Digital Filters
 Processor Arithmetic Logic Units

implemented as
A combinatorial logic circuitry

multi-bit register / x
storage elements B
+ Y
C
Y=AxB+C
SoCT – SoC Logic Design – 8 © Lehrstuhl für Integrierte Systeme

26
Sequential Logic

SoCT – SoC Logic Design – 9 © Lehrstuhl für Integrierte Systeme

Basic Register / Storage Element

 In CMOS we can store a 0 or a 1 in a loop of two inverters:


Vdd
01
Q
01 10
10 Q Q Q
01 10
 One inverter drives the input of the other
 Only outputs so far (how boring!) gnd
 impossible to externaly drive a node without short-
circuit
x=1: Q Q
 We need to open the loop to set a Q Q
specific logic value: x=0: Vdd Q
x
SoCT – SoC Logic Design – 10 © Lehrstuhl für Integrierte Systeme

27
CMOS Latch

Q
Q

SoCT – SoC Logic Design – 11 © Lehrstuhl für Integrierte Systeme

CMOS Latch

1 R
e D
0 e Q
1 DQ
Q
D Q
D
1 e Q
S
1 D
 Enable signal e
 Level-controlled <=> Latch 0 Q

SoCT – SoC Logic Design – 12 © Lehrstuhl für Integrierte Systeme

28
CMOS Flip-flop

c c Q
e Q e Q Q
D Q
D D DQ Q
c
(alternatively)
Master Slave
c Q
 Clock edge-controlled (c) <=> flip-flop D
 Most important sequential element
 Used in (almost all) sychronous digital circuits - Q
 Used as register banks
 Counter, shift registers

SoCT – SoC Logic Design – 13 © Lehrstuhl für Integrierte Systeme

CMOS Flip-flop

clk
e Q e Q Q
D D q DQ Q
Master Slave

clk
clk
D
q
Q
SoCT – SoC Logic Design – 14 © Lehrstuhl für Integrierte Systeme

29
Flip-flop Timing
c
D Q
50%

t tc2q
c
D
Flip-flop characteristics:
50%  setup-time: Data must be stable
t tsetup before clock edge
tsetup thold  hold-time: Data must be stable
for thold after clock edge
Q
 clock-to-output delay: Data will
be visible at output tc2q after
clock edge
t
tc2q SoCT – SoC Logic Design – 15 © Lehrstuhl für Integrierte Systeme

Synchronous Sequential Logic

Tlogic,min Tlogic,max

Timing Constraints:
 Setup constraint:
Tclk > Tc2q + Tlogic,max + Tsetup

 Hold constraint
: Tc2q + Tlogic,min > Thold

30
Metastability
c Violation of either tsetup or thold
resulting in undefined/
50%
oscillating output Q
t with probability pmeta (<10-9)
D
Relevance for SoC design:
50%
 increasing number of chip
clock domains
t  externally imposed clocks
tsetup thold
Precaution:
Q
 Double-registered inter-domain
signal interfaces

t  FIFO buffers

tc2q SoCT – SoC Logic Design – 17 © Lehrstuhl für Integrierte Systeme

Finite State Machines


(FSM)

SoCT – SoC Logic Design – 18 © Lehrstuhl für Integrierte Systeme

31
Finite State Machines

x D1 Q1
f(x,u) g(x,u) y

Dn Qn

clk
u
 x: Primary input vector, y: primary output vector, u: state vector
 f(x,u): input function, g(x,u): output function
 Mealy-Machine: Most general case, as shown above
 Moore-Machine: No combinatorial logic through machine <=> g(x,u)==g(u)
 Medvedev-Machine: No output logic <=> y=g(x,u)=g(u)=u
 No input logic Machine: f(x,u)=(x,f2(u))
SoCT – SoC Logic Design – 19 © Lehrstuhl für Integrierte Systeme

Finite State Machines(Contd.)

FSM2

f g
FSM1

f g
FSM3

f g

SoCT – SoC Logic Design – 20 © Lehrstuhl für Integrierte Systeme

32
FSM example: Counter

S3
A2
x = 0: St+1 = St
xor S2 x = 1: St+1 = St + 1

A1

xor S1

A0
x
xor S0

clk f(x,S)

SoCT – SoC Logic Design – 21 © Lehrstuhl für Integrierte Systeme

Practical Relevance of FSMs

Synchronous System Design paradigm


 Essentially all control functions in state-of-art digital IC’s
consist of “communicating FSM’s”
 Avoid combinatorial logic through paths!
 Stick to one FSM design style across SoC!

x f(x,u) x g(x,u)

clk clk
Tclk > ΣTlogic + Tsetup + Tc2q

SoCT – SoC Logic Design – 22 © Lehrstuhl für Integrierte Systeme

33
How are FSMs designed today?

Idle:Process
Begin FIFO a_empty = 1
WAIT UNTIL(CLK = ‘1’);
C_State := N_State;

CASE (C_State) IS to VHDL Idle = Idle =


WHEN (State_0) THEN 0 1
IF (FIFO a_empty = ‘1’) THEN
N_State := State_1;
END IF; FIFO a_empty = 0
WHEN (State_1) THEN AND Addr = 26
IF (FIFO a_empty = ‘0’ AND
Addr = ’26’) THEN
N_State := State_0;
END IF;
END CASE;
f(x,u)
CASE (N_State) IS x
WHEN (State_0) THEN Synthesis
Idle <= ‘0’; clk
WHEN (State_1) THEN
Idle <= ‘1’;
END CASE; How many levels of logic can you
END PROCESS; afford?
SoCT – SoC Logic Design – 23 © Lehrstuhl für Integrierte Systeme

FSM Logic Depth

Tclk > ΣTlogic + Tsetup + Tc2q


f(x,u)
x
clk Tclk > N (tgate + twire ) + Tstup + Tc2q
T -T –T … in a given case:
CMOS Databook: N < clk stup c2q N < 50.9
tgate = tc2q = 80ps; tgate + twire
Nmax = 50
tstup = 240ps;
twire ≈ 50% tgate; Q’s:
a) Is this plenty or marginal?
Data path width: b) If synthesis reveals Nmax = 24,
w = 16bit; S = 2.488 Gbps; would you consider changing the data
path to 8bit?
Tclk = 1 / 155.5 MHz = 6.43 ns;

SoCT – SoC Logic Design – 24 © Lehrstuhl für Integrierte Systeme

34
References

[1] R. J. Baker et al., CMOS circuit design, layout, and simulation, IEEE
Press, 1998. ISBN 0-7803-3416-7
[2] N. H. E. Weste et al., Principles of CMOS VLSI Design, Addison
Wesley, 1993. ISBN 0-201-53376-6
[3] SIA, International Technology Roadmap for Semiconductors,
http://public.itrs.net/

Picture credits: www.maxmon.com

SoCT – SoC Logic Design – 25 © Lehrstuhl für Integrierte Systeme

35
SoC Paradigm

System-on-Chip © Lehrstuhl für Integrierte Systeme


Technologies
Theresienstr. 90
A. Herkersdorf Building N1, 2nd floor
A. Surhonne www.lis.ei.tum.de

Outline

• Design productivity
 Platform-based SoC
 Virtual Platforms
• Computational density
 Flexibilty vs. Performance
• Hardware Implementation
 Gate Array, Standard Cells, FPGA
 SoC design paradigm
 Single- vs. Multicore

SoCT – SoC Paradigm – 2 © Lehrstuhl für Integrierte Systeme

36
Revisiting Moore‘s Law

Chip capacity (Transistors per Chip)

Designer Productivity (K trans. / PM)


1011
106
1010
105
109 Microprocessor
Complexity Pentium 4 104
108
~ 55 % CAGR Pentium 103
107 Pentium III
80486 ?
Pentium II 100
106 80386
80286 10
105 8086 ~ 20 % CAGR
8008 1
104
4004 Designer Productivity
0.1
103

1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012
Year

 How to develop and test such complex systems with


affordable cost and time?

SoCT – SoC Paradigm – 3 © Lehrstuhl für Integrierte Systeme

Improvements in Designer Productivity


Layout
1975: Polygons representing mask layout
GND In VDD

Transistor 1980: Transistor circuitry


A A’

Out 1985: Logic gates, Boolean algebra,


Gate Standard Cell designs
Register Transfer Block 1990: RTL design entry, Logic synthesis
Reg 1 Reg 2 Reg 3

+ 1995: Design entry with HW


Comp description languages,
Behavior Reg 4 behavioral synthesis
Begin
WAIT UNTIL (CLK’EVENT AND
CLK = ‘1’); Improvements in designer productivity due to
LCDltch <= tmp;
tmp := LCD;
progress in EDA tool and design methods as
END PROCESS;
well as raised levels of abstraction

SoCT – SoC Paradigm – 4 © Lehrstuhl für Integrierte Systeme

37
Platform-based SoC Design

Conquer design complexity by reuse maximization:


Shorter development cycles, higher chances for (first time) fault-free
design and competitive value differentiation
Differentiation through
AMBA System SRAM System EMAC
Core eDRAM Core PCI-X
new, application specific
Standard on-Chip system cores
(bus) interconnect
Processor Bus Bridge Peripheral Bus
and interfaces,
CoreConnect Processor ISA Memory UART
Core Ext. Ctrl. GPIO
Blue
Logic
Standard RISC CPU cores and
SW development environments, Reuse existing function
building blocks,
XILINX
XILINX

SoCT – SoC Paradigm – 5 © Lehrstuhl für Integrierte Systeme

Virtual Platforms
Graphical Debugging tools SW development
environment

SW stacks
Custom HW development
Drivers Std. libs

OS OS OS Virtual Platform
Hypervisor
CPU1 CPU2 ACC SRAM SDRAM HW/SW Partitioning

Arbiter On-chip Interconnect

AMBA HW models INT


HW
HW
ACC

Buffer I/O

System Integration
 Processing cores  Executable SoC model
 Interconnect  Fast and accurate: TLM-abstraction
 HW accelerators  Simulation kernel, e.g.
... SystemC/SpecC

SoCT – SoC Paradigm – 6 © Lehrstuhl für Integrierte Systeme

38
Trade-Off: Flexibility vs. Performance

CPU DSP
ASIP
Log F L E X I B I L I T Y

FPGA
Instruction Depth

ASIC
Flexibility vs.
Custom IC Performance/Power
dissipation dilemma
Log COMPUTATIONAL DENSITY = performance / area
103 . . . 104
Log Power Efficiency = performance / W
105 . . . 106
Source: A. DeHon [4]; A. Cuomo, T. Noll [5]

SoCT – SoC Paradigm – 7 © Lehrstuhl für Integrierte Systeme

SoC Implementation Styles

 Standard Cell ASIC A B C D


Time 0

+ +
Time 1
Parallel in space and time
-
Time 2
y
 Pipelined RISC CPU
IF ID OF EX M WB t1 = A + B
IF ID OF EX M WB t2 = C + D
IF ID OF EX M WB Y = t1 - t2

Parallel in space, sequential in time

SoCT – SoC Paradigm – 8 © Lehrstuhl für Integrierte Systeme

39
Computational Density / Functional Diversity

Computational Density Functional Diversity


 Computations performed per  Number of functions resident
unit area and time and rapidly accessible by a
 Implementation technique compute unit
descriptor  IS = Number of instructions
 [CD] = ops / N [Lmin2 tiles] stored on a general purpose
 CPU: 40 – 80 compute device
 FPGA: 400  Application descriptor
 ASIC: 4’000  Technology differentiator
 Custom: > 10’000  CPU: ~256 – 16 K
 FPGA: 1
 approx. values  ASIC: 10-3 - 10-6
 Custom IC: ~0
Source: DeHon, PhD Thesis [4]

SoCT – SoC Paradigm – 9 © Lehrstuhl für Integrierte Systeme

Software vs. Hardware Implementation


Cycle Instruction Interpretation
 SW execution in a processor 1 mul r5, r1, r2 r5 = r1 x r2
 Sequential in time 2 mul r6, r3, r4 r6 = r3 x r4
3 add r7, r5, r6 r7 = r5 + r6

x4
 RTL-Level Hardware x3
X
y
 Parallel in time x2 +
X
x1
T0 T1 T2 T3

 HW/SW Partitioning
 Optimization for performance,
power or total cost

Further details in course


„HW/SW Codesign“

SoCT – SoC Paradigm – 10 © Lehrstuhl für Integrierte Systeme

40
Hardware Implementation Methods

Implementation methods for digital ICs

Full-Custom Semi-Custom

Cell-based Array-based

Standard Cell Macro Cell Pre-diffused Pre-wired Pre-wired


SoC (Gate Arrays) (PLD’s) (FPGA's)

Example: Example:
Altera MAX- XILINX Virtex-II
7000 PRO

SoCT – SoC Paradigm – 11 © Lehrstuhl für Integrierte Systeme

Hardware Implementation Methods

Implementation Buy from the shelf Circuit functionality

Standard-IC/Core Programable with


yes
Processor/DSP standard SW tools
Programable with
ASIP no
customized SW tools

FPGA yes (re-)configurable

ASIC no fixed

SoCT – SoC Paradigm – 12 © Lehrstuhl für Integrierte Systeme

41
Hardware Implementation Methods

Full Custom Std.-cell, Gate array FPGA & PLD

Design style Circuit optimization at Design libraries and logic synthesis


transistor level
Design sign off Mask manufacturing and wafer production Customer
programmable
Function density highest (100x) high (50x) average
(1 normalized)
Typ. clock rate 1…3 GHz 500 MHz 100…200 MHz
Design time Many months months weeks, days
Typ. volume Mio. 10.000 - 100.000 1 – 10.000
One time costs Very high (10 Mio.) high (1 Mio.) low (1 K)
Variable costs lowest average (2) high (>100)
(chip area) (1 normalized)

SoCT – SoC Paradigm – 13 © Lehrstuhl für Integrierte Systeme

Gate Array
Pad

• Transistors / logic gates to realize


combinatorial and sequential circuit
designs are already pre-placed on chip
• I/O-macros to drive external circuitry
integrated as well
• Functionality is determined by means of
customer / application specific wiring
 Chips are pre-manufactured up until wiring polysilicon
In 1 In 2 In 3 In4

 High volume, low cost VD D

metal

possible
GND contact

Out

Uncommited Committed Cell


Cell (4-input NOR)
SoCT – SoC Paradigm – 14 © Lehrstuhl für Integrierte Systeme

42
Standard Cell

• Library of pre-developed (full custom) logic


gates and function macros (variable width,
fixed height)
• Pre-defined placement rows and routing
channels
 Past: only between placement rows (see
figures)
 Today: 3-dimensional multi-layer wiring
across entire chip area
• State-of-the-art design flow
 Synthesis of VHDL behavior description into
technology specific design library elements
 Embedding of complex function macros
(memory, CPU, std. i/o macros, etc.)

SoCT – SoC Paradigm – 15 © Lehrstuhl für Integrierte Systeme

Standard Cell

• Library of pre-developed (full custom) logic


gates and function macros (variable width,
fixed height)
• Pre-defined placement rows and routing
channels
 Past: only between placement rows (see
figures)
 Today: 3-dimensional multi-layer wiring
across entire chip area
• State-of-the-art design flow
 Synthesis of VHDL behavior description into
technology specific design library elements
 Embedding of complex function macros
(memory, CPU, std. i/o macros, etc.)

SoCT – SoC Paradigm – 16 © Lehrstuhl für Integrierte Systeme

43
ASIC Chip

Random Logic
(from library)

Example ASIC:
• 65 nm
• 1 GHz
• ~ 100 Mgates
• ~ 100 MB sRAM
• ARM/MIPS cores
Memory • >1000 I/Os
Subsystem • 64 x 2.5 Gb/s I/O
• 1 x 10 Gb/s I/O
• DDR I/F

[LSI Logic]

SoCT – SoC Paradigm – 17 © Lehrstuhl für Integrierte Systeme

Full Custom Design

• Unconstraint placement of transistors,


gates and function macros
• Individual (manual) optimization of circuit
parameters to minimize area and power

Intel 4004 Processor

SoCT – SoC Paradigm – 18 © Lehrstuhl für Integrierte Systeme

44
Programmable Logic: Overview

• Fuse / Anti-Fuse Technique:


 Fuses connect or disconnect links between two wiring layers
 One-time programmable  „Redesigns" are expensive!
 High integration density
 Robust against radiation  space applications, etc.

• RAM based:
 Bits in look-up tables (LUTs) realize logic function
 Bits in registers control switches (transistors) which connect /
disconnect wire links
 Re-programmable, partially even online during operation (run-time
reconfiguration)
 Medium integration density
 Sensitive to radiation

SoCT – SoC Paradigm – 19 © Lehrstuhl für Integrierte Systeme

PLDs – Programmable Logic Devices


• Two types of PLDs:
 PAL (Programmable Array Logic): AND matrix programmable, OR matrix fixed
 PLA (Programmable Logic Array): AND and OR matrix programmable (past technology)
I5 I4 I3 I2 I1 I0 Programmable
OR array I5 I4 I3 I2 I1 I0 Fixed OR array

O0  I 0 I1  I 2
O1  I 0 I1 I 2  I 2  I 0 I1

PAL
Programmable AND array
Programmable AND array
O 3O 2O 1O 0 O 3O 2O 1O 0

CPLD: multiple PAL blocks with programmable interconnect, e.g. Altera Max 7000

SoCT – SoC Paradigm – 20 © Lehrstuhl für Integrierte Systeme

45
FPGA: Structure and Properties

Routing
Channel

I/O Pad

Configurable
Logic Block

SoCT – SoC Paradigm – 21 © Lehrstuhl für Integrierte Systeme

FPGA: Basic Structure and Properties

• Functionality of CLBs and inter-


connect wiring determined by
program data

• Matrix of logic blocks (CLBs)


and wiring resources already
placed and routed on IC
during manufacturing

• Programming:
 SRAM (Data in SRAM determine logic function and control interconnect)
 Anti fuse (Melting individual fuses through controlled peak currents)

SoCT – SoC Paradigm – 22 © Lehrstuhl für Integrierte Systeme

46
FPGA: Realization Principles

Boolean function:
Y = x1 x2 x3 + x2 x3
x1 x2 x3 y FPGA: Look-Up Table (LUT)
0 0 0 0
x1 x2 x3 Address
ASIC: gates 0 0 1 0
0 1 0 1
2x4 SRAM
x1 0 1 1 0
y 1 0 0 0 0 0 1 0
x2
y
x3 1 0 1 0 0 0 1 1
1 1 0 1 Content
1 1 1 1 td = 5 nsec
td = 1 nsec

SoCT – SoC Paradigm – 23 © Lehrstuhl für Integrierte Systeme

FPGA: Realization Principles

Boolean function: y=x1x2


ASIC: AND gate Short Line Long Line
x1
y
CLB
x2

CLB
Logic table:
x1 x2 y
0 0 0
0 1 0
1 0 0
1 1 1

Address Content Programmable


Switch Matrix
 Realization in 4x1 SRAM (LUT)
Gate contacts of switch transistors are connected with
configuration memory cells which determine FPGA routing

SoCT – SoC Paradigm – 24 © Lehrstuhl für Integrierte Systeme

47
FPGA: Internal Structure (Xilinx Virtex-II Pro)

Virtex-II Pro Slice:


• 2 4-input LUTs/RAM
• Carry Logic
• 2 2to1 multiplexer
• 2 Register (D-FF)

Virtex-II Pro CLB: Virtex-II Pro RAM/MULT:


• 4 Slices • 18x18 HW multiplier
• 2 Tri-State Buffers • 18 kbit Dual-Port SRAM:
• Carry-Chain • 16k x 1bit
•…
• 512 x 36bit

SoCT – SoC Paradigm – 25 © Lehrstuhl für Integrierte Systeme

FPGA: Xilinx UtraScale+ Architecture

• 16 nm FinFET technology • Column based architecture of


• Stacked Silicon Interconnect (SSI) programmable logic consisting of
• Multiple dies connected via SI interposer • CLBs for logic and distributed memory,
• Die = Super Logic Region (SLR) • DSPs for multiply and accumulate (MAC)
• BlockRAM, UltraRAM,
• IO, high-speed transceivers
Microbumps
• Clock management (adjacent to IO and
Passive Silicon Interposer
memory)

Through Silicon Vias (TSV)

C4 Bump
BGA Ball

• 3 types differing in complexity and composition


• Virtex: high-end
• Kintex: mid-range
Source Xilinx (picture shows Virtex7) • Zynq: includes standard processors
SoCT – SoC Paradigm – 26 © Lehrstuhl für Integrierte Systeme

48
FPGA: Xilinx UtraScale+ Programmable Logic
CLB = Slice
• 1 Slice per CLB, 2 slice types: SLICEL and SLICEM

• SLICEL LUT
• 8 LUTs with 6 inputs
(each usable as two 3- or 5-input LUTs)
• 16 FlipFlops Carry Chain
• arithmetic carry logic
• multiplexer
FlipFlop
• SLICEM
• LUTs can be used as 64 bit RAM,
1x32 or 2x16 bit shift register

SoCT – SoC Paradigm – 27 © Lehrstuhl für Integrierte Systeme

FPGA: Xilinx UtraScale+ Programmable Logic


• Memory options
• Distributed memory via SLICEM
• Block RAM: 36 kbit blocks, dual port, (2x18 kbit, sync/async FIFO, configurable width)
• Ultra RAM: 288 kbit blocks, dual port, sync
• DDR4, DDR3, QDRII+, RLDRAM3 memory interfaces

Distributed RAM Block RAM UltraRAM External Memory


(Bits to kbits) (10s of Mbits) (100s of Mbits) (100s of Mbits to Gbits)
• wide, shallow FIFOs • data/coefficient storage • deep packet buffering • Larger data storage
• shift registers • deep FIFOs • video buffering
• state machines • shallow buffering • state, statistics, counters

Source Xilinx

SoCT – SoC Paradigm – 28 © Lehrstuhl für Integrierte Systeme

49
FPGA: Xilinx UtraScale+ Programmable Logic

• DSP slice Kintex Virtex Zynq


• Pre-adder
# CLB (k) 18.8 – 82.9 49.3-159.8 5.9 - 65.3
• 27×18 bit multiplier
• 48-bit accumulator (incl. XOR) Block RAM (Mbit) 12.7 - 34.6 25.3 - 94.5 4.5 – 34.6

Ultra RAM (Mbit) 0 – 36 90 – 360 0 – 36


• Select IO
# DSP slices (k) 1.3 – 3.5 2.3 – 12.3 0.2 – 3.5
• high-performance / high-density
DSP Perf.
• with different voltages 6.3 21.9 6.3
(GMAC/s *)
• single ended / differential
I/O pins 280 – 668 416 – 832 82 – 668

• High-speed serial transceivers Transceivers 16 – 76 40 – 128 0 – 72

• GTH (16.3 Gbit/s) Hard processor


macros   
• GTY (32.75 Gbit/s)
Source Xilinx
* 109 multiply-accumulate ops per s

SoCT – SoC Paradigm – 29 © Lehrstuhl für Integrierte Systeme

FPGA: Xilinx Ultrascale+ Zynq MPSoC


Heterogeneous Processing System Zynq CG Block Diagram
• Application Processing Unit (APU)
64 bit ARM Cortex-A53 (up to 1.5 GHz)
• Real-Time Processing Unit (RPU)
32 bit ARM Cortex-R5 (up to 600 MHz,
safety features incl. ECC, lock-step mode,
detection of faults in core)
• Graphics Processing Unit (GPU)
ARM Mali-400 MP2 (667 MHz)
• Video En/Decoder (VCU) for H.264/H.265

Zynq Device Types


CG EG EV
APU 2 4 4

RPU 2 2 2

GPU   
VCU   
Source Xilinx

SoCT – SoC Paradigm – 30 © Lehrstuhl für Integrierte Systeme

50
Processor Implementation in SoC (1)

„Real“ „Virtual“
Component Component

System on Board System on Silicon

SoCT – SoC Paradigm – 31 © Lehrstuhl für Integrierte Systeme

Processor Implementation in SoC (2)

Soft VC Firm VC Hard VC

Architectural Speed/Area
extensions VHDL optimized

SoCT – SoC Paradigm – 32 © Lehrstuhl für Integrierte Systeme

51
Soft VC CPU in FPGA

Example: XILINX MicroBlaze CPU

MicroBlaze: RS-232 For comparison:


 32 bit RISC
 200 MHz Hard VC
GPIO (LEDs) GPIO (buttons) PowerPC 405:
 166 DMIPS
 32 bit RISC
Extensions: MicroBlaze UserLogic  400 MHz
 I-Cache Core (OPB-Master)  600 DMIPS
 D-Cache
 HW Multiplier SDRAM Ctrl.
Debug
Logic

Local SRAM

SoCT – SoC Paradigm – 33 © Lehrstuhl für Integrierte Systeme

The Need for SoC Design Paradigm

Chip capacities of multi 10 to 100 Next level of abstraction:


M gates enable new dimensions Functional IP macros / cores
of function integration

Challenge
• How to cope with this
complexity and develop
operational systems within…
 reasonable time (time-to- eRAM network i/o
market) applic. specific
 costs (engineering and µ-processor DRAM ctrl.
manufacturing)

SoCT – SoC Paradigm – 34 © Lehrstuhl für Integrierte Systeme

52
The Need for SoC Design Paradigm

Chip capacities of multi 10 to 100 Next level of abstraction:


M gates enable new dimensions Functional IP macros / cores
of function integration plus a standard
on-chip interconnect structure
Challenge (NoC)
• How to cope with this
complexity and develop
operational systems within…
Bus
 reasonable time (time-to- µ-processor eRAM
market)
 costs (engineering and specific
µ-processor eRAM
manufacturing) M ctrl.

SoCT – SoC Paradigm – 35 © Lehrstuhl für Integrierte Systeme

SoC: Another way to look at it

What in the past was on a board, today fits on a chip

DC ROM Analog
ROM
MCU
ASIC
~ 10 cm

i/f
i/f i/f DSP
ASIC
SRAM
SRAM

SRAM
DSP
MCU ROM An’lg

Source: Berkeley BWRC, TI cellular phone baseband SoC [1] ~ 0.8 cm

SoCT – SoC Paradigm – 36 © Lehrstuhl für Integrierte Systeme

53
Single- vs. Multicore

n
clk s 1
TApp  inst App    
  inst App i'  CPI 
inst clk i 1 f

Multi-core
Single-core

App‘1 App‘2
App If App can be perfectly
parallelized on n cores and
Tapp = const
2 App‘3 App‘n
Pdyn ~ f  Vdd

2 2
f  Vdd  f  Vdd
Pdyn ~   n ~ 2
n  n  n
* neglecting influence of Vth

SoCT – SoC Paradigm – 37 © Lehrstuhl für Integrierte Systeme

Single- vs. Multicore

Single-core Multi-core
Texe Texe

App App‘ App‘

Assumption: App can be perfectly App‘ App‘


parallelized on n cores

Case 1: Single- and Multi-core have same performance = App execution time Texe

f Vdd
Pdyn ~ f  Vdd2 f MC  Vdd, MC 
n n
2
f  Vdd  f  Vdd2
Pdyn ~   n ~
n  n  n2
SoCT – SoC Paradigm – 38 © Lehrstuhl für Integrierte Systeme

54
Single- vs. Multicore

Single-core Multi-core Texe / k


Texe

App App‘ App‘

Assumption: App can be perfectly App‘ App‘


parallelized on n cores

Case 2: Multi-core shall have performance increase by factor k


2
Pdyn ~ f  Vdd k f k  Vdd
f MC  Vdd, MC 
n n
Further details in course 2
„Chip Multicore Processors“
k  f  k  Vdd  k 3  f  Vdd2
Pdyn ~   n ~
n  n  n2
SoCT – SoC Paradigm – 39 © Lehrstuhl für Integrierte Systeme

References

[1] Design Technology for Low Power Radio Systems, Reth Davis, BWRC, Berkeley,
http://bwrc.eecs.berkeley.edu
[2] DSP multi chip module, esa,
http://www.estec.esa.nl/tech/spacewire/products/#modules
[3] Chip-On-Chip, Valtronic SA, http://www.valtronic.ch
[4] Reconfigurable Architectures for General Purpose Computing, Andre DeHon, PhD
Thesis, MIT, 1996
[5] A. Cuomo, Semiconductor Challenges, DATE03 Keynote, March 03,
http://www.date-conference.com/conference/2003/keynotes/andrea/andrea.pdf

SoCT – SoC Paradigm – 40 © Lehrstuhl für Integrierte Systeme

55
Processor Architecture

System-on-Chip © Lehrstuhl für Integrierte Systeme


Technologies
Theresienstr. 90
A. Herkersdorf Building N1, 2nd floor
A. Surhonne www.lis.ei.tum.de

Outline

• Classification of processors
• Instruction set architecture
• Internal processor architecture
 Pipelining and hazards
 Branch prediction
 Superscalar/VLIW architecture
 Instruction and data caches
 Multi-threading

SoCT – Processor Architecture – 2 © Lehrstuhl für Integrierte Systeme

56
Motivation

Processor-based digital systems


 Computers with fully programmable, general-
purpose processors (laptops PCs, workstations,
clusters)
 Primary purpose / function is data
processing (incl. Web servers, bank servers)
 Hardware & software evolve rather
independently

 However, most processors are deployed in


„embedded systems“
 Game consoles, smart phones, printers,
household appliances, …
 Cars, industry robots, …

SoCT – Processor Architecture – 3 © Lehrstuhl für Integrierte Systeme

What is “Machine Structure”?

Applications

Operating
Compiler
System
Software Assembler
Instruction Set
Architecture
Hardware Processor Memory I/O system

Primary focus Datapath & Control


of this module Digital Design
Circuit Design
Transistors

Coordination of many levels of abstraction

SoCT – Processor Architecture – 4 © Lehrstuhl für Integrierte Systeme

57
Processor Classification

Type Application Characteristic Remark


RISC Embedded control Load/store instructions MIPS, ARM,
Instruction for memory access PowerPC
complexity CISC Personal Computer/ Complex, variable- Intel x86-based
Servers length instructions
Superscalar Personal Computer/ Instruction parallelism Intel, ARM,
Instruction-level Embedded on run-time PowerPC
parallelism
(ILP) VLIW Image Processing Instruction parallelism Parallel video
on compile-time pixel processing
ASIP Embedded Application-specific Tensilica
Application- intructions
specific area DSP Signal Processing HW multiply for digital TI
filters

SoCT – Processor Architecture – 5 © Lehrstuhl für Integrierte Systeme

Levels of SW Code Representation

Processor/ISA SW model,
independent (e.g. Matlab)
Code generator int a = 10;
while(a < 100)
High-level language
a += b;
(e.g. C/C++) if (a > b && c < 0)
c++;

ISA dependent, Low-level


processor Compiler language lw r2, 16(r30)
independent lw r3, 20(r30)
(Assembly)
addu r2, r2, r3
Assembler sw r2, 24(r30)

Machine code
Software
1010 1111 0101 1000
0000 1001 1100 0110
Hardware Processor/ISA
0101 1000 0000 1001
dependent Control Signal
1100 0110 1010 1111
Specification

SoCT – Processor Architecture – 6 © Lehrstuhl für Integrierte Systeme

58
Instruction Set Architecture (ISA)

Defines interface between SW & HW


 Visible hardware state (registers & memory) Software & OS
 A set of instructions that operate on that
state Instruction Set
Given an ISA Hardware
 The hardware implements it
 The software uses it
 Old SW can use new HW and vice versa
Keep in mind
 Difference: ISA vs. HW implementation
 X86: Intel  AMD
 X86: Intel 80x86  Intel Core i7

SoCT – Processor Architecture – 7 © Lehrstuhl für Integrierte Systeme

ISA Example: MIPS

Instructions Registers Memory Address Space


 Arithmetic Mapped/
0xffff fffff
r0 zero
 add, sub, li, lui... cached, kseg2
r1 temp. 0xdfff fffff
 Logical Mapped/
r2-r3 returns
 and, nor, or, not, xor... cached, ksseg
0xbfff fffff
r4-r7 args Unmapped/
 Load/store
r8-r15 temp. uncached,kseg1
 lb, lw, sb, sw... 0x9fff fffff
temp. Unmapped/
 Multiply/divide r16-r23
saved cached, kseg0
0x7fff fffff
 div, mult, multu...
r24-r25 temp.
 Jumps/branches
r26-r27 OS
 b, beq, bne, j, jal, jr... User space
r28 global ptr.
 ... r29 stack ptr. Mapped/cached
r30 frame ptr. kuseg

return
r31
addr.
0x0000 0000

SoCT – Processor Architecture – 8 © Lehrstuhl für Integrierte Systeme

59
Look Inside

system bus

data cache

data i/o

ALU register block

status
accumulator program counter

control
address i/o

instr. cache

system bus

SoCT – Processor Architecture – 9 © Lehrstuhl für Integrierte Systeme

Processor Microarchitecture

system bus

data cache

data i/o

ALU register block

status
accumulator program counter

control
instruction i/o

instr. cache

system bus

SoCT – Processor Architecture – 10 © Lehrstuhl für Integrierte Systeme

60
Processor Microarchitecture

system bus

data cache Memory access (M)


Execution (EX)

data i/o

Write back (WB)


ALU register block

status
accumulator program counter

control
instruction i/o

Instr. cache
Instruction fetch (IF)
Instruction decode (ID)

system bus

SoCT – Processor Architecture – 11 © Lehrstuhl für Integrierte Systeme

Program Execution

 Sequential execution of instructions

Instruction Data
Processor
memory memory

add r3, r3, r1


lw r1, 0(r0) CPI = 5
sw r3, 4(r0) (cycles per instruction)

add r3, r2, r1

IF ID EX M WB lw r1, 0(r0)

IF ID EX M WB sw r3, 4(r0)
IF ID EX M WB
Efficiency improvement:
instruction-level parallelism (ILP)
SoCT – Processor Architecture – 12 © Lehrstuhl für Integrierte Systeme

61
ILP: Pipelining

Clock signal

add r3, r2, r1

IF ID EX M WB lw r1, 0(r0)

IF ID EX M WB sw r3, 4(r0)
IF ID EX M WB

Execution stages can overlap


… multiple instructions
IF ID EX M WB
execute faster: CPI1
IF ID EX M WB
IF ID EX M WB

SoCT – Processor Architecture – 13 © Lehrstuhl für Integrierte Systeme

CPU Pipeline
Single-scalar = 1 ALU, CPImin = 1.0
Pipeline Control

ΣTlogic ΣTlogic ΣTlogic

IF ID EX M WB

clk
Buffer
Tclk  Tc2q   Tlogic  Tstp
max

clk 1
f max 
D Q D Tclk
Tstp Tc2q instr. rate [MIPS] =
Q = f[MHz] / CPI

SoCT – Processor Architecture – 14 © Lehrstuhl für Integrierte Systeme

62
ILP: Pipelining

• Prerequisite for effective pipelining


 Regularity in sequence of individual
instruction phases
 Few, regular instruction set
 Simple, few addressing modes

• Deep pipelining
 Ease processor speed scaling
 Increase vulnerability for pipeline problems
 Structural hazards
 Data hazards
 Control hazards

SoCT – Processor Architecture – 15 © Lehrstuhl für Integrierte Systeme

Structural Hazards

Pipelined execution is hindered due to resource conflicts

IF ID EX M WB load/store instruction

IF ID EX M WB arithmetic
instructions
IF ID EX M WB
stall IF ID EX M WB
if only one memory
port is available

SoCT – Processor Architecture – 16 © Lehrstuhl für Integrierte Systeme

63
Data Hazards

Data dependencies among instructions cause data hazards

add r3,r2,r1 IF ID EX M WB
sub r7,r3,r1 IF ID EX M WB
and r6,r3,r2 IF ID EX M WB

Stalling is required

add r3,r2,r1 IF ID EX M WB
sub r7,r3,r1 IF stall ID EX M WB
and r6,r3,r2 IF ID WB
EX M

SoCT – Processor Architecture – 17 © Lehrstuhl für Integrierte Systeme

Control Hazards

• Deviation from sequential execution


inst.addr mnemonics
0x400258: lw r2, 24(r30)
0x400260: slti r3, r2, 15 bne r3, r0, 400280
?
0x400268: bne r3, r0, 400280
IF ID EX M WB
0x400270: addiu r2, r0, 6
0x400278: sw r2, 20(r30) stall IF ID EX M WB
0x400280: addiu r2, r0, 1 IF ID EX M WB
0x400288: j 400290

• Branches are frequent


 total performance loss is greater than in
case of data hazards
 employment of branch prediction

SoCT – Processor Architecture – 18 © Lehrstuhl für Integrierte Systeme

64
Branch prediction

• 1-bit prediction
Branch history table
0x400258: lw r2, 24(r30) 1 1 – taken
0x400260: slti r3, r2, 15 0 0 – not taken
0x400268: bne r3, r0, 400280 0
0
idx brach addr. 0
1
x
bits 2x 0
0
0

Problem 0

 For loops we always predict incorrectly twice 0


1
 in the first loop iteration 0
 in the last loop iteration 0

1 bit

SoCT – Processor Architecture – 19 © Lehrstuhl für Integrierte Systeme

Branch prediction

• 2-bit prediction
Branch history table

0x400258: lw r2, 24(r30) 10


0x400260: slti r3, r2, 15 00
T nT
0x400268: bne r3, r0, 400280 01
00
11 T 10
Taken Taken
idx brach addr. 00
11
x T nT
bits 2x 00
00
01 T 00
00
nTaken nTaken
01 nT nT
01
10
00
2 bits
00
local history

SoCT – Processor Architecture – 20 © Lehrstuhl für Integrierte Systeme

65
Branch prediction

• Two-level (correlating) prediction

0x400258: lw r2, 24(r30) Branch history table General case:


0x400260: slti r3, r2, 15 (m, n) predictor
0x400268: bne r3, r0, 400280
m=4
idx brach addr.
T N N N
x
bits n = 2 bits

 Recent behaviour of other 2x


branches is considered
2m
History pattern of
last m global
branches

global history local history

SoCT – Processor Architecture – 21 © Lehrstuhl für Integrierte Systeme

ILP: Superscalar Architecture

external data bus

data cache

data i/o
internal data bus

Multiple ALU
ALU register block
execution units ALU
status
accumulator program counter
internal address bus
control
address i/o

Instr. cache

external address bus

SoCT – Processor Architecture – 22 © Lehrstuhl für Integrierte Systeme

66
ILP: Superscalar architecture

Instr. Fetch (IFi ... IFi+3)

Instr. Decode (IDi ... IDi+3)


Multiple Decided at run-time
datapaths
(DP)
DP1 DP2 DP3 DP4

OFi+2 OFi OFi+3 OFi+1 Data dependency check


and Operand Fetch (OF)
EXi+2 EXi EXi+3 EXi+1

MEMi+2 MEMi MEMi+3 MEMi+1

WBi+2 WBi WBi+3 WBi+1

 More than 1 instruction can be issued in 1 cycle, i.e. CPI < 1 is possible
 More complex logic for checking data dependencies required

SoCT – Processor Architecture – 23 © Lehrstuhl für Integrierte Systeme

ILP: VLIW processors


Sequential
Program
...
instr i+2
instr i+1
instr i
instr i-1
instr i+2
Determined during
...
compile-time

Optimizing Compiler
InstrDP1 InstrDP2 InstrDP3 InstrDP4 ... ... InstrDPn-1 InstrDPn

DP 1 DP 2 DP 3 DP 4 ... Datapath ... DPn-1 DPn

Registers

SoCT – Processor Architecture – 24 © Lehrstuhl für Integrierte Systeme

67
Processor Performance (1)

• What is performance?
 Example Porsche vs. Bus from Munich to Stuttgart
Top speed Distance Travel time Capacity Throughput
Vehicle [km/h] [km] [h] [person] [pkm/h]

Porsche 260 200 0.77 2 520


Bus 100 200 2.0 46 4600

• What matters in CPU performance:


 Fastest possible execution of a single instruction?
 Shortest program execution (many instructions)?

SoCT – Processor Architecture – 25 © Lehrstuhl für Integrierte Systeme

Processor Performance (2)

Ultimately interested in
 CPU execution time: Time CPU needs to complete certain program,
task or function

Clock cycles Seconds


CPU time = x =
Program Clock cycle

Instructions Clock cycles Seconds


= x x
Program Instruction Clock cycle

Specific for your CPI: Processor 1 / fcpu


application architecture and
Processor data
Estimate/count after memory hierarchy
sheet
compilation dependent

SoCT – Processor Architecture – 26 © Lehrstuhl für Integrierte Systeme

68
Processor Performance

Instruction Data
Processor
memory memory

CPI = CPICPU + CPIMEM

Performance Comparison
300 266
250 221,67
Effective MIPS

200
150
100
50 20,95 17,81
0
CPUx1.0 CPUx1.2 CPU-DDR CPU-I/O
clk/instr clk/instr
[Data from Xilinx]

SoCT – Processor Architecture – 27 © Lehrstuhl für Integrierte Systeme

Memory Hierarchy

CPU L1 L2 Main
registers cache cache memory

Access time: 0.5 ns Access time: 2 ns Access time: 20 ns Access time: 100 ns
Size: 500 B Size: 32 KB Size: 256 KB Size: 512 MB
Access latency
small large
Cost
large small
Size
small large

Observation on program execution


 Temporal locality
 addresses are likely to be accessed in the near future once again
 Spatial locality
 addresses are likely to be close to each other
Frequently accessed data/instructions are kept close to CPU

SoCT – Processor Architecture – 28 © Lehrstuhl für Integrierte Systeme

69
Example: PowerPC 405GP

66-133MHz Arb
266MHz 32/64-bit
64-bit PCI-X, with ECC
33-66MHz RAM/ROM/

On-chip Peripheral Bus (OPB) 33-66 MHz


32/64bit PCI Peripheral Up to 66MHz
DDR266 controller
32-bit
External bus
PCI-X SDRAM address /
OPB master cntlr.
Bridge Controller 32-bit data
Bridge
PLB UART (2)

128-bit master, Monitor


128 bit
128-bit slave 128 bit I2C (2)

up to 133MHz Processor Local Bus (PLB) 128-bit GPIO


GPIO
128-bit 128-bit
128-bit 128-bit 128-bit
SRAM 128KB GPT
DMA
Ctlr. SRAM
Controller
CPU regs
32K 32K L1 Caches
I-Cache D-Cache 1 MII or 2 RMII Fast & Small SRAM
Timers interfaces Slower & Larger SDRAM
MMU 10/100 I/O subsystem (SCSI, PCI, etc)
MAL Ethernet
Interrupt MAC
CPU Controller

JTAG Trace
13 external
interrupts

SoCT – Processor Architecture – 29 © Lehrstuhl für Integrierte Systeme

Processor Performance (3)

CPImem = CPIinst + CPIdata


= Ifreq x L1miss rate (L1miss penalty + L2miss rate x L2miss penalty) +
+ Dfreq x L1miss rate (L1miss penalty + L2miss rate x L2miss penalty)

Pipelined RISC: CPICPU =1.2

Two-level cache hierarchy


 L1miss rate = 5%; CPImiss 1.89
= = 1.57
 L1miss penalty = 10 cycles CPIno miss 1.2
 L2miss rate = 3%; 0.15% instr./data accesses to system
 L2miss penalty = 50 cycles memory degrade overall performance
 Dfreq = 20% (CPU execution time) by 57%
 CPImem = 0.69

SoCT – Processor Architecture – 30 © Lehrstuhl für Integrierte Systeme

70
Cache Organization
Main
• Caches store only small share of main memory Memory
 Data are stored in lines of multiple sequential data Cache
a i q
b j r
y
z
G
H
words (e.g. 4 words) o r H c k s A I
p s I d l t B J
 Cache capacity = Lsize x Nlines CPU E t e m u C K
C d f n v D L
g o w E M

• CPU accesses data in the full address range h p x F N

 address width = 32 bit → 4 GB memory. How to map


onto a smaller memory? memory address
 Index: used to determine potential position(s) in cache
 depends on placement strategy tag index offset

 Tag: part of address stored together with cache line hit =?


 If stored tag is identical to tag part of memory
flags tag cache line
address → cache hit


 Offset: determines word in cache line and byte in word
 Flags: entry valid; entry “dirty” (entry changed by CPU)

SoCT – Processor Architecture – 31 © Lehrstuhl für Integrierte Systeme

Direct Mapped Cache


index = block MOD NL

Index Block
000 ..00000
• Each block (size of a cache line) in main 001 ..00001
..00010
010
memory can be stored in only one cache entry 011 ..00011
100 ..00100
101 ..00101
110 ..00110
block offset ..00111
111
CPU address ..01000
Direct mapped ..01001
cache ..01010
tag byte ..01011
index word ..01100
..01101
Data ..01110
index flags tag word0 word1 word2 word3 ..01111
000 ..10000
001 ..10001
..10010
010 ..10011
011 ..10100
100 ..10101
..10110
110

111
Main memory
=
valid & Example: 16 KB direct mapped cache
with 4 words à 32 bit per cache line
hit word
• 10 bit index (1k cache lines)
• Conflicting indices lead to higher cache miss rate • 18 bit tag
• 4 bit offset (for word and byte)

SoCT – Processor Architecture – 32 © Lehrstuhl für Integrierte Systeme

71
Set Associative Cache set = block MOD (NL / n),
any line within set

Set Block
00 ..00000
• Each block in main memory can be stored 00 ..00001
..00010
01
in n cache entries: n-way set associative cache 01 ..00011
10 ..00100
• Increasing n reduces cache misses due to conflicts 10 ..00101
..00110
11
11 ..00111
block offset ..01000
2-way set ..01001
CPU address associative cache ..01010
..01011
tag index word byte ..01100
..01101
..01110
index index ..01111
data set fl. data data ..10000
fl. tag tag
00 00 ..10001
01 01 ..10010
..10011
11 11 ..10100
= way 0 = way 1 ..10101
..10110


w0 w1
Main memory
1
n times parallel tag comparison
hit word Example: 16 KB 2-way set associative
cache with 4 words à 32 bit per cache line
• 9 bit index (512 sets à 2 cache lines)
• With higher n: selection circuitry more complex, • 19 bit tag
needs more time • 4 bit offset (word and byte, 2 bit each)

SoCT – Processor Architecture – 33 © Lehrstuhl für Integrierte Systeme

Fully Associative Cache


Block
..00000
..00001
• Fully Associative Cache ..00010
..00011
 A memory block can be stored in any ..00100
..00101
cache entry ..00110
..00111
 Only one set containing all entries tag any cache line ..01000
..01001
 No cache misses due to conflicts =
..01010
..01011
= ..01100
= ..01101
1 = ..01110
• Parallel tag matching for all entries =
= ..01111
..10000
 Complex circuitry =
= ..10001
..10010
..10011
 High latency Fully associative ..10100
hit cache ..10101
..10110

• Fully associative caches rarely used Main memory


(only for special purposes)
Example: 16 KB fully associative cache
with 4 words à 32 bit per cache line
• no index
• 28 bit tag
• 4 bit offset (word and byte, 2 bit each)

SoCT – Processor Architecture – 34 © Lehrstuhl für Integrierte Systeme

72
Cache Replacement

• Replacement index data


 Might be necessary on cache miss 00
01
fl. tag

 When all associated entries occupied: replace 10


11
V

one entry way 0


index data data
 Direct mapped cache: replace old entry with new set fl.
00
tag
01
 Fully associative cache: replace only when full 10
11
V

 Which cache entry should be replaced? way 1

?
• Replacement policy
 Goal: reduce number of misses
 Least recently used (LRU): least recently access
 First in first out (FIFO): least recently loaded
(oldest)
 Random

SoCT – Processor Architecture – 35 © Lehrstuhl für Integrierte Systeme

Cache Write Strategies


Cache writes
 Contrary to reads: Migrated data gets modified
 Data item also requires update in memory
hierarchy
write
1 V 2 through
Write through write
Memory
 Modified data written directly through
 Potentially via write buffer write
2 4
Write back write back
1
VD
V
 Modify data locally Memory
3
 Dirty flag marks updated cache lines 5
conflicting replace
 Dirty cache lines are written back miss
 On replacement
 By invalidation by processor
 By cache coherency protocol (see later)

SoCT – Processor Architecture – 36 © Lehrstuhl für Integrierte Systeme

73
Multithreading in Software

system bus
Load/save
data cache
register
status
register block

data i/o status


program counter

ALU register block


register block
status
accumulator program counter status
program counter
control
instruction i/o

register block
instr. cache
status
Further details in lecture program counter
„Chip Multicore Processors“
system bus

SoCT – Processor Architecture – 37 © Lehrstuhl für Integrierte Systeme

Multithreading in Hardware
system bus

data cache

data i/o

ALU registerblock
register
block Multiple
register block
status register banks
status
statuscounter
program
accumulator program counter
program counter
control
instruction i/o

instr. cache
Further details in lecture
„Chip Multicore Processors“
system bus

SoCT – Processor Architecture – 38 © Lehrstuhl für Integrierte Systeme

74
Summary

• A variety of microprocessor architectures in embedded


systems
• Instruction Set Architecture as interface between hardware
and software
• Performance is limited mainly by memory access, code
parallelism and data dependencies

SoCT – Processor Architecture – 39 © Lehrstuhl für Integrierte Systeme

Literature

[1] Hennessy, Patterson: Computer Architecture, A Quantative Approach.


Morgan Kaufmann, 5th edition, 2012
[2] M. Flynn: Basic Issues in Microprocessor Architecture.
Journal of Systems Architecture, 1999
[3] Intel Technology Journal, www.intel.com: Hyper-Threading Technology,
Feb. 2002
[4] K. Diefendorff: PC Processor Microarchitecture. Microprocessor Report,
July 1999

SoCT – Processor Architecture – 40 © Lehrstuhl für Integrierte Systeme

75
Memory

System-on-Chip © Lehrstuhl für Integrierte Systeme


Technologies
Theresienstr. 90
A. Herkersdorf Building N1, 2nd floor
A. Surhonne www.lis.ei.tum.de

Outline

• Motivation
• Classification and Characteristics
• Look Inside
 Architecture of state-of-art memories
 Different types of memory cells
• How to Use Memory in System Design
• Product Overview

SoCT – Memory – 2 © Lehrstuhl für Integrierte Systeme

76
Motivation

 State-of-the-art SoC’s make


application-specific use of various
MCU ROM types of memories:
 non-volatile memory for
firmware and parameter
DSP storage
ASIC
 fast SRAM for data and control
state storage
SRAM  (not shown eDRAM, CAM,
An’lg eFlash)

SoCT – Memory – 3 © Lehrstuhl für Integrierte Systeme

Motivation

 Significant portion of high-end


microprocessor area is dedicated
to fast SRAM, associative Caches
and Registers
 key criteria for overall
processor performance

IBM PowerPC 750 Cu

SoCT – Memory – 4 © Lehrstuhl für Integrierte Systeme

77
Positioning

Capacity Density Transfer Rate


Memory Type
[bit] [bit/mm2] [Mbit/s]
Paper Page A4 16 x 103 0.4
Floppy 3.5“ 11.52 x 106 1.08 x 103 0.5
64 Mbit DRAM 64 x 106 2.13 x 106 1600
Zip-Disk 100 MB 800 x 106 75.5 x 103 11.2
CD (32x) 5.44 x 109 544 x 103 38.4
DAT DDS-3 96 x 109 4.24 x 106 8
DVD (4x) 136 x 109 6.8 x 106 43.2
Hard Disk (5 disks,
176 x 109 3.3 x 106 140
7200 rpm)

SoCT – Memory – 5 © Lehrstuhl für Integrierte Systeme

Classification

Read Write
Non-Volatile Read Only
Random Access Non-Random Read Write ROM
RAM Access

SRAM FIFO / LIFO EPROM / EEPROM Mask-Programmed

DRAM CCD FLASH Fuse-Programmed

Register CAM Magneto RAM

Shift-Register

SoCT – Memory – 6 © Lehrstuhl für Integrierte Systeme

78
Classification

Read Write
Non-Volatile Read Only
Random Access Non-Random Read Write ROM
RAM Access

SRAM FIFO / LIFO EPROM / EEPROM Mask-Programmed

DRAM CCD FLASH Fuse-Programmed


Memory
Register CAM size
Magneto RAM / design trade-
density offs:
Shift-Register
speed

robustness

SoCT – Memory – 7 © Lehrstuhl für Integrierte Systeme

Characteristics
Memory Type Application Access Time Remarks
Registers CPU Registers Very fast [Sub-nsec] Direct addressing scheme
[32 x 64 bit]
On-chip SRAM Caches Fast [nsec] SRAM is faster but more
[32 KByte] expensive than SDRAM
QDR SRAM Fast system memory Fast [2 x 2 x 200 MHz] Dual clock edge, dual
[4 MByte] port
SDRAM Main Memory Slow-Medium [133 Needs refresh,
[64 MByte] MHz] sophisticated control,
Synchronous interface
DDR3 SDRAM Main Memory Medium [2 x 800 MHz] Dual clock edge
[1 GByte]
ROM System config Medium [~kByte/sec] Read only
[few kByte]
Flash Memory Card Medium [20 Mbyte/sec] Non-volatile, no refresh,
[16 GByte] different rd/wr cycles

SoCT – Memory – 8 © Lehrstuhl für Integrierte Systeme

79
Look Inside

Definitions:
 Bandwidth: Amount of data into/out of a device
or across interface per unit time
 Latency: Time elapsed between request and
delivery of data
 Cycle time: Time between two consecutive
read/write accesses

512M DRAM

 Asynchronous memory: (self-timed) Change of address (and control)


lines triggers memory read/write
 Synchronous memory: All memory operations occur synchronous to
clock edge(s)
 Multiple requests may be outstanding

SoCT – Memory – 9 © Lehrstuhl für Integrierte Systeme

Memory Architecture: Decoders


M bits M bits

S0 S0
Word 0 Word 0
S1
Word 1 A0 Word 1
S2 Storage Storage
Word 2 Word 2
N Words

Cell A1 Cell
Decoder

A
AL-1
K-1
SN-2 Aspect ratio
Word N-2 Word N-2
SN_1 heights / width
Word N-1 Word N-1
not suitable for
implementation /
Input-Output Input-Output
performance !
(M bits) (M bits)

N words  N select signals ! Decoder reduces # of select


signals to K = log2N
[Rabaey]

SoCT – Memory – 10 © Lehrstuhl für Integrierte Systeme

80
Memory Architecture: Array-Structure

2 L-K Bit Line


Storage Cell

AK

Row Decoder
A K+1 Word Line

A L-1

M.2 K

Sense Amplifiers / Drivers Amplify swing to rail-


to-rail amplitude
A0
Column Decoder Select appropriate
A K-1
word
Input-Output
(M bits)
[Rabaey]

SoCT – Memory – 11 © Lehrstuhl für Integrierte Systeme

Hierarchical Memory Architecture

Row
Address

Column
Address

Block
Address

Global Data Bus


Control Block Selector Global
Circuitry Amplifier/Driver

I/O
Advantages:
1. Shorter wires within blocks
2. Block address activates only 1 block => power savings
[Rabaey]

SoCT – Memory – 12 © Lehrstuhl für Integrierte Systeme

81
1-Transistor DRAM Cell

size / BL
density
WL Write "1" Read "1"
speed WL

robustness M1 X
CS GND VDD VT

VDD
BL
VDD/2 VDD /2
CBL sensing

Write: CS is charged or discharged by asserting WL and BL.


Read: Charge redistribution takes places between bit line and storage capacitance
CS
V = VBL – V PRE =  V BIT – V PRE  ----------------------- CELL
C +C S BL

Voltage swing is small; typically around 250 mV.


[Rabaey]

SoCT – Memory – 13 © Lehrstuhl für Integrierte Systeme

Trench DRAM Cell

Bitline
Wordline

n+ - Si

SiO2

Polysilicon

p-Si
Depletion Zone

Inversion
at SiO2/Si
Interface

Address Memory
Transistor Capacitor

[IC1]

SoCT – Memory – 14 © Lehrstuhl für Integrierte Systeme

82
Advanced 1 Transistor DRAM Cells

Word line
Cell plate Capacitor dielectric layer
Insulating Layer

Cell Plate Si

Capacitor Insulator Transfer gate Isolation


Refilling Poly Storage electrode

Storage Node Poly

Si Substrate
2nd Field Oxide

[Rabaey]
Trench Cell Stacked-capacitor Cell

SoCT – Memory – 15 © Lehrstuhl für Integrierte Systeme

Sense Amplifier
Bitlines

WLi-4

WLi-3
Memory
WLi-2 cells

WLi-1

SA1 SA2 SA3 SA4 SA5 SA6 SA7 SA: Sense Amplifier

WLi

WLi+1 Speicher- Memory


WLi+2 zellen cells

WLi+3

WLi : Wordlines 2n/2 Bitlines

[IC1]

SoCT – Memory – 16 © Lehrstuhl für Integrierte Systeme

83
Sense Amplifier
VDD

Read 1 pull u p

T1 T2
VDD

Cs

Equalize

BL TE BL
K1 K2

T3 T4
WL

Sense
Memory Cell
[IC1]

SoCT – Memory – 17 © Lehrstuhl für Integrierte Systeme

Sense Amplifier
VDD

Read 0 pull u p

T1 T2
VDD

Cs

Equalize

BL TE BL
K1 K2

T3 T4
WL

Sense
Memory Cell
[IC1]

SoCT – Memory – 18 © Lehrstuhl für Integrierte Systeme

84
6-Transistor CMOS SRAM Cell

size / WL
density

speed
VDD
M2 M4
robustness
Q
Q M6
M5

M1 M3

BL BL

[Rabaey]

SoCT – Memory – 19 © Lehrstuhl für Integrierte Systeme

ROM – Read Only Memory

WLi 4 ... 6 m

„0“

WLk
„1“

Programing
Programming
„0“ „1“

BL1 BL2
size /
density

speed

robustness
[IC1]

SoCT – Memory – 20 © Lehrstuhl für Integrierte Systeme

85
Floating Gate Transistor Cell

Floating gate Gate


D
Source Drain

tox G

tox
S
p
n+ n+
Substrate

(a) Device cross-section (b) Schematic symbol

[Rabaey]

SoCT – Memory – 21 © Lehrstuhl für Integrierte Systeme

Floating-Gate Transistor Programming


20 V 0V 5V

20 V 0V 5V
10 V 5 V 5 V 2.5 V

S D S D S D

Avalanche injection. Removing programming voltage Programming results in


leaves charge trapped. higher V T.
ID

[Rabaey]
Vwl Vgs
SoCT – Memory – 22 © Lehrstuhl für Integrierte Systeme

86
Flash Memory Cell

[Infineon]

SoCT – Memory – 23 © Lehrstuhl für Integrierte Systeme

Phase Change Memory

• Non-volatile
• Faster and more writable
than Flash
• Research topic
• Phase change effect used
with Blu-Ray

[Wong et al, IEEE Proceedings, Dec. 2010]


[IBM, extremetech, May 2014]

SoCT – Memory – 24 © Lehrstuhl für Integrierte Systeme

87
How to Use Memory in System Design

Capacity
Access
speed
FAST LOW
CPU
Cache
Local Bus
Fast & Small SRAM
Slower & larger SDRAM
I/O Subsystem (SCSI, PCI, etc)
Disk
Tape
SLOW HIGH

[IC1]

SoCT – Memory – 25 © Lehrstuhl für Integrierte Systeme

Memory in System Design: Example

66-133MHz Arb
266MHz 32/64-bit
64-bit PCI-X, with ECC
33-66MHz RAM/ROM/
On-chip Peripheral Bus (OPB) 33-66 MHz

32/64bit PCI Peripheral


Up to 66MHz
DDR266 controller
External bus
32-bit
PCI-X SDRAM address /
OPB master cntlr.
Bridge Controller 32-bit data
Bridge
PLB UART (2)

128-bit master, Monitor


128 bit 128 bit
128-bit slave I2C (2)

up to 133MHz Processor Local Bus (PLB) 128-bit GPIO


GPIO
128-bit 128-bit
128-bit 128-bit 128-bit
SRAM 128KB GPT
DMA
Ctlr. SRAM
Controller

32K 32K
CPU
I-Cache D-Cache 1 MII or 2 RMII
Timers interfaces Cache
MMU 10/100 Local Bus
MAL Ethernet Fast & Small SRAM
Interrupt MAC
CPU Controller Slower & larger (SDRAM)
I/O Subsystem (SCSI, PCI, etc)
JTAG Trace Disk
13 external Tape
interrupts

SoCT – Memory – 26 © Lehrstuhl für Integrierte Systeme

88
A Minimal Memory System

[Gries]

SoCT – Memory – 27 © Lehrstuhl für Integrierte Systeme

SDRAM Read Operation Timing

[Gries]

SoCT – Memory – 28 © Lehrstuhl für Integrierte Systeme

89
SDRAM Read Operation Timing

(tCAS)
[Micron]

tRCD: row to column delay tCAS: column-access strobe


tRAS: row-access strobe tRP : row precharge

Timing: tCAS–tRCD–tRP–tRAS → 2-2-2-6 SDRAM

SoCT – Memory – 29 © Lehrstuhl für Integrierte Systeme

SDRAM Read Operation Timing

(tCAS)
[Micron]

Access latency: tmem_acc = tRCD + tCAS BWpeak = f ∙ w f∙w∙n


, n ≤ tRC – tmem_acc
tRC
Min. row cycle time: tRC = tRAS + tRP BW(burst = n) =
f∙w∙n
* time in clock cycles
tmem_acc + (n-1) , else
SoCT – Memory – 30 © Lehrstuhl für Integrierte Systeme

90
SDRAM Write Operation Timing

[Gries]

SoCT – Memory – 31 © Lehrstuhl für Integrierte Systeme

SDRAM Write Operation Timing

[Gries]

SoCT – Memory – 32 © Lehrstuhl für Integrierte Systeme

91
Synchronous DRAM vs DDR

[Gries]
SoCT – Memory – 33 © Lehrstuhl für Integrierte Systeme

Multiport DRAM

Row
Address

Column
Address

Block
Address

Global Data Bus


Control Block Selector Global
Circuitry Amplifier/Driver

I/O
Separate rd/wr
addresses, decoders
and I/O

SoCT – Memory – 34 © Lehrstuhl für Integrierte Systeme

92
Synchronous DRAM vs RAMBUS

[RAMBUS]

SoCT – Memory – 35 © Lehrstuhl für Integrierte Systeme

Embedded DRAM

Pro‘s:
• customized size
• wide data bus
• multi port
• high speed

Con‘s:
• complex technology
• expensive
• less density

[IBM]

SoCT – Memory – 36 © Lehrstuhl für Integrierte Systeme

93
Wide I/O – 3D-Integration with Through
Silicon Vias (TSV)

Stacked Wide I/O DRAM


TSV
Processor
PCB
BGA

 1200 TSVs to connect memory and processor layers


 Short, low C interconnects (TSV)
 High packaging density (no bonding wires)
 TSV diameter: 40-50 µm (approx. 500 TSV/mm²)
 Thin wafers (50-100 µm)
 4 layers max.
 Total SoC height: < 1mm
[www.3dic.org/Wide_IO]

SoCT – Memory – 37 © Lehrstuhl für Integrierte Systeme

Memory Summary

 Memories are key elements in integrated system design


 Come in different types with different optimization
criteria (size/density, access speed/cycle time,
robustness)
 Dynamic / static / non-volatile memory
 Memory cell / bank / array structure
 Single data / burst access

SoCT – Memory – 38 © Lehrstuhl für Integrierte Systeme

94
References

[1] Jan Rabaey: Digital Integrated Circuits: A Design Perspective, Prentice Hall, 2nd
Edition, 2003
[2] Stechele/Herkersdorf: Integrierte Schaltungen, Lecture notes, TUM, 2003
[3] IBM photos http://www-3.ibm.com/chips/photolibrary/photo10.nsf/home?ReadForm
[4] www.rambus.com white papers
[5] M. Gries: A Survey of Synchronous RAM Architectures, Swiss Federal Institute of
Technology, ETHZ, Technical Report TIK No. 71, 1999
[6] J. Alsmeier, Infineon: Speicherkonzepte, 4. Dresdner Sommerschule Mikroelektronik,
September 2003
[7] R. Desikan, University of Texas, Tech Report TR-02-47, Sept. 2002
[8] Micron 256Mb: x4, x8, x16 SDRAM Feautres
[9] Wong, H-S. Philip, et al. "Phase change memory." Proceedings of the IEEE 98.12
(2010): 2201-2227.
[10] Motoyoshi, Makoto. "Through-silicon via (TSV)." Proceedings of the IEEE 97.1
(2009): 43-48.

SoCT – Memory – 39 © Lehrstuhl für Integrierte Systeme

95
Interconnect

System-on-Chip © Lehrstuhl für Integrierte Systeme


Technologies
Theresienstr. 90
A. Herkersdorf Building N1, 2nd floor
A. Surhonne www.lis.ei.tum.de

Outline

• On-Chip Buses
 Basic operation
 Methods for increasing bus throughput
 PLB, AHB, AXI
• Outlook for Network-on-Chip
• FIFOs
 Principles of operation
• Example: Networking SoC

SoCT – Interconnect – 2 © Lehrstuhl für Integrierte Systeme

96
On-Chip Buses

Characteristics:
Bus Slaves:
React on requests
Max. number of
Masters supported
Bus width
Traffic
CPU ASIC1 Mgr Separate/Shared
Rd/Wr Bus
Arbiter
Clock rate

Arbitration
EN
Scheme DSP MAC
Mem

Bus Masters: Max. number of


Initiate Transfers Slaves supported

SoCT – Interconnect – 3 © Lehrstuhl für Integrierte Systeme

On-Chip Buses – Basic Operation

CLK • Synchronous bus: central CLK signal


Master Req • Shared medium: Arbiter controls
RNW
access among multiple bus masters
Read Write
by means of Request/Grant protocol
Grant
• Separate address- and data buses,
varying bus widths (32-256bit)
A1 A2
Addr Bus • Acknowledgement for successful
Data Bus D(A1) D(A2) transaction
Data Ack

Bus throughput depends on:


• Bus width
• Bus CLK rate
Bus Transaction
Address Cycle Data Cycle

Request Addr. Trans Data Trans Data Ack

SoCT – Interconnect – 4 © Lehrstuhl für Integrierte Systeme

97
Bus Arbitration Schemes

Determines sequence in which requests from multiple masters are serviced

Scheme Pro’s Con’s


Round Robin Simple control No QoS support

Strict Priority II BE I I Simple control “Starvation” of low


Different service priority traffic
classes
Weighted Priority No starvation Complex control
15 5 40 40 % No BW guarantees

Weighted Priority + QoS with BW Complex control


Credits 15 5 40 40 % guarantees
5 30 10 30 KB

SoCT – Interconnect – 5 © Lehrstuhl für Integrierte Systeme

Example: Processor Local Bus (PLB)

Central Bus
Arbitration Arbiter
PLB
Address Bus Slave
Address Bus

Write Data Write Data


Bus Bus

Control Control
PLB
Core
PLB Read Data OR Read Data
Master Bus Bus
Status & Status &
Control OR Control

Shared Bus

SoCT – Interconnect – 6 © Lehrstuhl für Integrierte Systeme

98
PLB: Standard Read (Rd) Transfers

SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_Avalid 1 2
SI_AddrAck
Data Bus
SI_DBus D(A0) D(B0)
SI_DAck 1 2
tarb tacc tarb tacc

f∙w
BW = , for memory tacc =tmem_acc
tarb + tacc
* time in clock cycles

SoCT – Interconnect – 7 © Lehrstuhl für Integrierte Systeme

Bus Throughput Improvements

PLB employs following methods to


increase bus throughput:
 Independent data buses for
reads and writes
 Pipelining
 Burst Transfers

These are generic methods which are used by AMBA and


other on-chip buses too

SoCT – Interconnect – 8 © Lehrstuhl für Integrierte Systeme

99
Pipelined Bus Control
Address Cycle Data Cycle
 Strictly sequential Address/Data cycle
 Data Ack terminating current transfer Request Address Data Ack
Transfer Transfer
& release new transfer

 Pipelined Address Data Cycle Address Cycle


 Enable multiple, simultaneous
transfers Request Address Addr
Transfer Ack
 Initiation of transfer before
completion of previous transfer
Data Cycle

Data Data
Transfer Ack

SoCT – Interconnect – 9 © Lehrstuhl für Integrierte Systeme

PLB: Pipelined Rd Transfers

SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_PAvalid 1
PLB_SAvalid 2
Sn_AddrAck 1 2
Data Bus
SI_rdDBus D(A0) D(B0)
SI_rdDAck 1 2

tarb tacc
BWpeak = f ∙ w
tarb tacc
* assuming B0 is independant of D(A0)

SoCT – Interconnect – 10 © Lehrstuhl für Integrierte Systeme

100
PLB: Pipelined Rd Transfers (But...)

SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_PAvalid 1 2
PLB_SAvalid 2
SI_AddrAck 1 2
Data Bus
SI_rdDBus D(A0) D(B0)
SI_rdDAck 1 2

tarb tacc
tarb tacc

SoCT – Interconnect – 11 © Lehrstuhl für Integrierte Systeme

Burst Transfers

Reduction of Req./Addr. signaling overhead for read/ write


transactions to consecutive addresses
→ Burst transfers with implicit address increment

Data Trans Data Trans Data Trans


Request Addr. Trans Request Addr. Trans Request Addr. Trans
+ Ack + Ack + Ack

Data Trans Data Trans Data Trans


Request Addr. Trans
+ Ack + Ack + Ack

SoCT – Interconnect – 12 © Lehrstuhl für Integrierte Systeme

101
PLB: Burst Rd Transfers

SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_Avalid 1 2
Sn_AddrAck
Data Bus
SI_DBus D(A0) D(A1) D(A2) D(A3) D(B0) D(B1) D(B2)
SI_DAck 1 2
tarb tacc tarb tacc

f∙w∙n
BW =
(tarb+tacc) + (n−1)
* time in clock cycles

SoCT – Interconnect – 13 © Lehrstuhl für Integrierte Systeme

PLB: Pipelined Burst Rd Transfers

SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_PAvalid 1
PLB_SAvalid 2
Sn_AddrAck 1 2
Data Bus
SI_rdDBus A0 A1 A2 A3 B0 B1 B2 B3

SI_rdDAck 1 1 1 1 2 2 2 2

tarb tacc
f∙w∙n
BW = =f∙w
n

SoCT – Interconnect – 14 © Lehrstuhl für Integrierte Systeme

102
PLB: Pipelined Back-to-Back Rd & Wr

SYS_Clk
Mn_req 1 2 3 4
Mn_RNW
Mn_ABus A B C D
PLB_PAvalid 1 2
PLB_SAvalid 3 4
SI_AddrAck 1 2 3 4
Write Data Bus
Mn_wrDBus B0 B1 B2 B3 D0 D1 D2 D3

SI_wrDAck 2 2 2 2 4 4 4 4

Read Data Bus


SI_rdDBus A0 A1 A2 A3 C0 C1 C2 C3

SI_rdDAck 1 1 1 1 3 3 3 3

SoCT – Interconnect – 15 © Lehrstuhl für Integrierte Systeme

PLB: Pipelined Back-to-Back Rd & Wr

SYS_Clk
Mn_req 1 2 3 4 5
Mn_RNW
Mn_ABus A B C D E
PLB_PAvalid 1 2
PLB_SAvalid 3 4 5
SI_AddrAck 1 2 3 4 5
Write Data Bus
Mn_wrDBus B0 B1 B2 B3 D0 D1 D2 D3

SI_wrDAck 2 2 2 2 4 4 4 4

Read Data Bus


SI_rdDBus A0 A1 A2 A3 C0 C1 C2 C3 E0

SI_rdDAck 1 1 1 1 3 3 3 3 5

SoCT – Interconnect – 16 © Lehrstuhl für Integrierte Systeme

103
Example: AMBA AHB

• AMBA AHB
 Advanced Microcontroller Bus Architecture High-
High-bandwidth
performance
 Advanced High-Performance Bus ARM processor
on-chip RAM

AHB
• Features improving bus throughput:
 Independent data buses (reads/writes)
 Pipelining (with the previous transfer only) High-bandwidth
DMA bus
 Burst transfers Memory
master
Interface
 Split transfers

SoCT – Interconnect – 17 © Lehrstuhl für Integrierte Systeme

AMBA AHB: Split transfers

• Master M0 reads data X0...X3 from slow slave S0


• Master M1 reads data Y0...Y3 from fast slave S1

Master Req M0 M1
t
Addr Bus X Y
t
Y0 Y1 Y2 Y3 X0 X1 X2 X3
Data Bus
t
Transfer to M0 is split

• The bus is not be blocked by S0

Note: PLB features split-bus transfers:


 Read and write data buses are independent
 Simultaneous use of read and write data buses for two independent
transactions is possible

SoCT – Interconnect – 18 © Lehrstuhl für Integrierte Systeme

104
Example: AMBA AXI

• AMBA AXI

source: http://www.arm.com
 Advanced Microcontroller Bus Architecture
 Advanced eXtensible Interface

• Features improving bus throughput:


 Independent data buses (reads/writes) and
independent addr. buses (reads/writes)
 Pipelining (over multiple previous transfers)
 Burst transfers
 Out-of-order transfers
ARM Cortex-A5 implementing
a 64-bit AXI bus

SoCT – Interconnect – 19 © Lehrstuhl für Integrierte Systeme

AMBA AXI: Out-of-order transfers

• Master M0
 reads data X0...X3 and Y0...Y3 from slow slave S0
 Y0...Y3 can be delivered faster than X0...X3
• Master M1
 reads data Z0...Z3 from fast slave S1

Master Req M0 M0 M1
t

Addr Bus X Y Z
t
Z0 Z1 Y0 Y1 X0 Y2 X1 Y3 Z2 X2 Z3 X3
Data Bus
t

• Reordering occurs
 among multiple masters
 among multiple transfers of the same master
 but not within a burst
SoCT – Interconnect – 20 © Lehrstuhl für Integrierte Systeme

105
Bus Standards Comparison
CoreConnect AMBA
OPB PLB APB AHB AXI
On-Chip Processor Local Advanced Advanced High Advanced
Peripheral Bus Bus Peripheral Bus Performance Extensible
Bus Interface
Addresses 32/64 bit 32/64 bit 32 bit 32 bit 32 bit
Bus widths 32/64 bit 32-256 bit 8-32 bit 8-1024 bit 8-1024 bit
# Masters 4 16 1 16 n/a
(bridge to AHB)

Bursts  var. length - var. length var. length


Pipelining -  -  
Separate rd/wr
data buses -  -  
Split transfer -  -  
Out-of-order
transfer - - - - 

SoCT – Interconnect – 21 © Lehrstuhl für Integrierte Systeme

Summary On-Chip Buses

Various on-chip bus industry standards exist


 AMBA (ARM), CoreConnect (IBM), OCP (Sonics), VSIA
 Widely used SoC interconnect structure
 Differentiation in:
 speed (clock rate)
 width (64/128/256 bits)
 transfer protocol (pipelined, split/out-of-order transfers)
 number of masters/slaves supported
 Capacity limitation for high-end applications due to
shared medium nature

SoCT – Interconnect – 22 © Lehrstuhl für Integrierte Systeme

106
FIFO Interface - Motivation

• On-chip Buses
 Synchronous, high throughput, shared medium
 High overhead for point-to-point connect between two
modules

• FIFOs:
 widely used point-to-point interconnect between asynchronous
modules with standardized interfaces

SoCT – Interconnect – 23 © Lehrstuhl für Integrierte Systeme

FIFO – Application Example

FIFOs are used for PPC


(300 MHz)
 Decoupling clock domains
Dev. Driver
100 MHz PLB bus
M
125 MHz HW-Accelerator Arbiter PLB-Bus (100 MHz)
 Decoupling data path widths S
64 bit PLB bus Bus Interface
32 bit HW-Accelerator

FIFO size TX-FIFO RX-FIFO


 As small as possible
 But large enough to compensate different
read/write data rates
HW-Accelerator
(125 MHz)

SoCT – Interconnect – 24 © Lehrstuhl für Integrierte Systeme

107
FIFO Architecture

Internal FIFO architecture


AF
 A ring of SRAM memory cells or registers
 2 address counters
 Read & write pointers, incremented

Additional control logic


 Fill level
 Prevent illegal accesses AE
 Prevent wrap-around overwriting
 Almost full (AF) and almost empty (AE) flags for
early detection of FIFO overflow or underrun
Read Pointer

Write Pointer

SoCT – Interconnect – 25 © Lehrstuhl für Integrierte Systeme

FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

WP RP
AE AF

FIFO is empty!

SoCT – Interconnect – 26 © Lehrstuhl für Integrierte Systeme

108
FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

RP WP
AE AF

SoCT – Interconnect – 27 © Lehrstuhl für Integrierte Systeme

FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

RP WP
AE AF

SoCT – Interconnect – 28 © Lehrstuhl für Integrierte Systeme

109
FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

RP WP
AE AF

Fill level exceeds AE  Read without risk of „under flow“ can start!

SoCT – Interconnect – 29 © Lehrstuhl für Integrierte Systeme

FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

RP WP
AE AF

SoCT – Interconnect – 30 © Lehrstuhl für Integrierte Systeme

110
FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

RP WP
AE AF

SoCT – Interconnect – 31 © Lehrstuhl für Integrierte Systeme

FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

RP WP
AE AF

SoCT – Interconnect – 32 © Lehrstuhl für Integrierte Systeme

111
FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

RP WP
AE AF

SoCT – Interconnect – 33 © Lehrstuhl für Integrierte Systeme

FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

RP WP
AE AF

Fill level exceeds AF  additional writes should be obmitted to prevent „over flow“!

SoCT – Interconnect – 34 © Lehrstuhl für Integrierte Systeme

112
FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

WP RP
AE AF

Only one remaining storage location available!

SoCT – Interconnect – 35 © Lehrstuhl für Integrierte Systeme

FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

WP RP
AE AF

SoCT – Interconnect – 36 © Lehrstuhl für Integrierte Systeme

113
FIFOs – Principle of Operation

FIFO with 8 storage locations (0 to 7), AE=2, AF=5


0 1 2 3 4 5 6 7

WP RP
AF AE

Fill level again below AF!

SoCT – Interconnect – 37 © Lehrstuhl für Integrierte Systeme

Example: Networking SoC

Given
Traffic
 IP Router/VoIP Gateway SoC CPU ASIC1
Mgr
 EN MAC
 4 x 1 Gb/s
 PLB Bus
 2x 128 bit Rd/Wr
 180 MHz clock EN
DSP Mem
 Nom. Capacity: 46 Gb/s MAC

Question
 Under/Over/Well dimensioned?

SoCT – Interconnect – 38 © Lehrstuhl für Integrierte Systeme

114
Basic Packet Rx / Process / Tx
MAC Bus Mem CPU
Traffic
CPU ASIC1 C
Mgr
Packet
reception

CPU retrieves
EN packet from
DSP Mem memory
MAC

6C Packet
processing
• Message Sequence Chart (MSC):
CPU write back
 Uncover “hidden” transfers for packet to
CPU/MAC notification memory
 Short packets are worst case
condition (packet size ≈ notification Packet
message) transmission
C

SoCT – Interconnect – 39 © Lehrstuhl für Integrierte Systeme

Plus...

Traffic Not yet considered


CPU ASIC1
Mgr  Buffer management
 Maintenance of data structure to
store different size packets in equal
size buffers
 Memory accesses during packet
EN processing
DSP Mem
MAC  Per flow contexts
 Coprocessor invocation
Bottom line  ASIC, Traffic Manager
 Networking SoCs have on-chip  DSP operation
communication demands easily  Low capacity but delay sensitive
exceeding 10x link rate capacity
 Candidates for crossbar switches,
hierarchical buses or advanced NoCs

SoCT – Interconnect – 40 © Lehrstuhl für Integrierte Systeme

115
… and last but not least

Traffic Memory access BW is even


CPU ASIC1
Mgr more constraint
 200 MHz, 32b DDR SDRAM
 Rand. Rd/Wr access: 25 ns
 64 Byte transfer: 65 ns
EN  Access capacity: 7.9 Gb/s
DSP Mem
MAC

Improvements
 Multiple, physical memories
 Separation data, state, control
 Interleaving techniques
 Multi-port memories

SoCT – Interconnect – 41 © Lehrstuhl für Integrierte Systeme

NoC – Network on Chip

Benefits
 Scalability: Aggregate bandwidth Tile
scales with network size
 Segmentation of wires: short point-to- Links
point links
Node
 Pipelining, power consumption,
reliability/crosstalk
 Synchronization
 fully synchronous clock distribution NOT
required Topologies
Drawbacks
 Latency, Area

Further details in lecture


„Chip-level Multiprocessors“ 2D Mesh Ring, Octagon Fat Tree

SoCT – Interconnect – 42 © Lehrstuhl für Integrierte Systeme

116
Summary

• On-chip buses are industry standard for SoC component


interconnect
• FIFOs are employed for point-to-point communication
• Networking ICs/SoCs demand high on-chip interconnect
and memory capacities
• Current research: Networks on Chip

SoCT – Interconnect – 43 © Lehrstuhl für Integrierte Systeme

References

[1] CoreConnect Processor Local Bus Specifications, available at


https://www-01.ibm.com/chips/techlib/techlib.nsf/products/CoreConnect_Bus_Architecture
[2] AMBA Open Specifications, available at
http://www.arm.com/products/system-ip/amba/amba-open-specifications.php

SoCT – Interconnect – 44 © Lehrstuhl für Integrierte Systeme

117
Cross-Layer Perspectives on
Low Power Design

Andreas Herkersdorf
Armin Sadighi
Anmol Surhonne
Thomas Wild

Chair of Integrated Systems


Technische Universität München

Moore’s Law CMOS Scaling Challenges

1013
Power Density 50
1012
Transistor gate length L (um)

20
Performance
Chipcapacity (transistors per chip)

1011 rocket
1000 10
1010
Frequency nuclear nozzle
5
Number of cores
Power density (W/cm2)

109
reactor
100 Core i7 2
108 Core 2 Duo 1
hot plate Pentium 4
107 Pentium III 0.5
10 Pentium II
106 Pentium 0.2
80486
105 80386 0.1
1 80286 0.05
104 8086
8008 Microprocessors 0.02
0.1 103
0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
Source: [1] ITRS Roadmap 99, 09

SoCT – Low Power Design – 2 © Chair of Integrated Systems, TUM

118
Moore’s Law CMOS Scaling Challenges

Source: AMD

SoCT – Low Power Design – 3 © Chair of Integrated Systems, TUM

Moore’s Law CMOS Scaling Challenges

Source: AMD

SoCT – Low Power Design – 4 © Chair of Integrated Systems, TUM

119
Low Power Design is Prerequisite for …

Higher reliability Lower cost packaging

High currents may cause electro- Commodity servers cannot afford


migration in metal interconnects. mainframe water cooling,

Plus 10°C … nor can smartphones afford


doubles bit enforced air cooling
failure rate. with fans

Figure Sources: TU Dresden, IBM 2007, KIT [11]

SoCT – Low Power Design – 5 © Chair of Integrated Systems, TUM

Low Power Design is Prerequisite for …

Portability Higher Integration


IT data centers contribute
substantially to world wide
CO2 emission.

HPC replacement is OPEX


driven

Green Computing

Continued device scaling


Enabling sophisticated
and 3D integration:
mobile and IoT applications
dynamic power per
device decreases, but
number of devices per
area/volume increase

SoCT – Low Power Design – 6 © Chair of Integrated Systems, TUM

120
Power Trends

Source: R.Puri [4]

SoCT – Low Power Design – 7 © Chair of Integrated Systems, TUM

Power Trends

Below 65nm node


static power can be considered equal
to active (dynamic) power.

Clock distribution accounts for up to


half of active power

Source: R.Puri [4]

SoCT – Low Power Design – 8 © Chair of Integrated Systems, TUM

121
Energy versus Power

Power is drawn from a voltage source attached to the VDD pin(s) of a


chip.

• Energy: Electrical work to transport electrical charge across an


electrical potential
T T
E   P (t )dt   iDD (t )VDD dt
0 0

• Power: Rate at which electrical energy is converted into heat


E 1 T
Pavg    iDD (t )VDD dt
T T 0

SoCT – Low Power Design – 9 © Chair of Integrated Systems, TUM

Supply & Threshold Voltage Optimization

lower Dynamic power


supply voltage
2
Pcap (Vdd )    C L  Vdd  f clk

Gate delay
Static power
CL
t d (Vdd ,Vt )  Pleak (Vt ,Vdd )  I leak (Vt ) Vdd
Vdd  Vt

Leakage current
lower V gs Vt
I leak (Vt )  e
threshold voltage

SoCT – Low Power Design – 10 © Chair of Integrated Systems, TUM

122
Outline

• System-level

Lecture Focus
 Processor Voltage Frequency Scaling Power
 Algorithmic optimization, Operand isolation dynamic static
 Power gating & sleep transistors
System
 Voltage islands
• Architecture-level Architecture
 Scheduling, Pipelining
 Bus-Segmentation, Memory-partitioning, RTL/Logic
Datapath reordering
Transistor
• RTL/Logic-level
 Clock gating
• Transistor-level
Just 2 references for in-depth text books:
 Threshold voltage control [2] J. Rabaey, Low Power Essentials, 2009, Springer
 FinFET/FDSOI [3] D. Chinnery, K. Keutzer, Closing the Power Gap
 Advanced memory cells Between ASIC and Custom, Springer, 2007

SoCT – Low Power Design – 11 © Chair of Integrated Systems, TUM

Hierarchical Low Power Design


Level Means Gain per Effort and
Unit Investment
System Power management/budgeting,
Choice of components, Partitioning,
Approximate computing
Architecture / Arithmetic transformation, data
RTL representation,
Parallelism, Pipelining, Resource
allocation, Adaptive voltage scaling
Circuit / Logic Resizing, Parallelism, VTCMOS,
MTCMOS
Physical Design Power-driven place & route, Low
power layout
Technology Scaling, Vt optimization, alternative
technologies (SOI), NTC
Inspired by: Raghunathan

SoCT – Low Power Design – 12 © Chair of Integrated Systems, TUM

123
Power
Dynamic Power Management dyn. stat.

System

Recent trend:
Machine learning-based state traversal
[9, 17, 18, …]
Source: D. Perlmutter [5]
Q:SA R
Q ( st , a t )  (1   )  Q ( s t , at )
   (rt    max Q ( s t , a ))
a

SoCT – Low Power Design – 13 © Chair of Integrated Systems, TUM

Example: DPM in Sensor Network Processors


2.7x2.7 mm2 (130 nm CMOS)
Clock Rates 8 MHz – 80 KHz
Supply 0.3-1V
Leakage Power 53 mW
Average Power 150 mW
Peak Power 5 mW

1200
Power (mW)

RX listen windows TX broadcast packet

766

60
if
baseband
Sleep signals

serial
© IEEE 2006
neighbor
location
queues
dw8051
Source: M. Sheets [6]
dll

SoCT – Low Power Design – 14 © Chair of Integrated Systems, TUM

124
Power
Power Gating dyn. stat.

System

Dyn. Power
Management

SoCT – Low Power Design – 15 © Chair of Integrated Systems, TUM

Trade-Off: Flexibility vs. Performance

CPU DSP
ASIP
Log F L E X I B I L I T Y

FPGA
Instruction Depth

ASIC
Flexibility vs.
Custom IC Performance/Power
dissipation dilemma
Log COMPUTATIONAL DENSITY = performance / area
103 . . . 104
Log Power Efficiency = performance / W
105 . . . 106
Source: A. DeHon [7]; A. Cuomo [8]; T. Noll

SoCT – Low Power Design – 16 © Chair of Integrated Systems, TUM

125
Platform-based SoC Design Methodology

Conquer design complexity by reuse maximization:


Shorter development cycles, higher chances for (first time) fault-free
design and competitive value differentiation
Differentiation through
AMBA System SRAM System EMAC
Core eDRAM Core PCI-X
new, application specific
Standard on-Chip system cores
(bus) interconnect
Processor Bus Bridge Peripheral Bus
and interfaces,
CoreConnect Processor ISA Memory UART
Core Ext. Ctrl. GPIO
Blue
Logic
Standard RISC CPU cores and
SW development environments, Reuse existing function
building blocks,
XILINX
XILINX

SoCT – Low Power Design – 17 © Chair of Integrated Systems, TUM

Power Budgets
Control CLB
I/O Drivers I/O Interconnect
15% 10% 5%
9%
Execution
Units 15%
Clock 21%
40% Clocks 65%
20%
Caches

mProcessor FPGA
I/O Clock

Memory Logic
Source: J. Rabaey [2]
Signal processor
SoCT – Low Power Design – 18 © Chair of Integrated Systems, TUM

126
… as Enablement for Multi-Function Devices

Source: K. Arabi [10], Qualcomm Tech.

SoCT – Low Power Design – 19 © Chair of Integrated Systems, TUM

Source: K. Arabi [10], Qualcomm Tech.

SoCT – Low Power Design – 20 © Chair of Integrated Systems, TUM

127
Example: Invasive Computing Platform

DFG Transregional Collaborative Research Center between


FAU Erlangen, KIT Karlsruhe and TU Munich

•Resource-aware many-core processor


programming, middleware, tools and
architecture

•Resources occupied / released based on:


• Availability
• Utilization (load, bandwidth)
• Operating Conditions (temperature,
frequency/voltage, soft error rate)

See: http://invasic.informatik.uni-erlangen.de/en/index.php

SoCT – Low Power Design – 21 © Chair of Integrated Systems, TUM

Thermal and Dark Silicon Management


Thermal Design Power 32 active cores 32 active cores 16 active cores
TDP = 220 W @3.2GHz @3.2 GHz @3.6 GHz
Ptotal=214 W Ptotal=214 W 16 active cores
Tcritical = 80 °C @2.8 GHz
Ptotal=218 W
[°C]
83
Active Core
X
Dark Core 77

71

65

Thermal Location of Dark V/F Levels of


Violation under Cores affects Active Cores
Courtesy: Jörg Henkel, affect Tpeak
TDP Tpeak
KIT Karlsruhe, Khdr, DAC15 [11]

SoCT – Low Power Design – 22 © Chair of Integrated Systems, TUM

128
ARM big.LITTLE

Cortex A-15 (The Big) Cortex A-7 (The Little)


Source: P.Greenhalgh [12]

SoCT – Low Power Design – 23 © Chair of Integrated Systems, TUM

ARM big.LITTLE

Source: P.Greenhalgh [12]

SoCT – Low Power Design – 24 © Chair of Integrated Systems, TUM

129
Samsung Example

Source:A. Frumusanu, R. Smith. (2015, February). ARM A53/A57/T760 investigated - Samsung Galaxy Note 4
Exynos Review. Available: http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-
review/
SoCT – Low Power Design – 25 © Chair of Integrated Systems, TUM

Power
Bus/Interconnect Segmentation dyn. stat.

Architecture

• Segmented Bus structure


 Reduction of resource
sharing
Bridg
Bus

 Reduction of switched
capacitance
 Independent
transactions on both
segments

SoCT – Low Power Design – 26 © Chair of Integrated Systems, TUM

130
Power
Design Partitioning dyn. stat.

System

Spatially Global Spatially Local

 Reduced # of global bus accesses


 Reduced buffer power
 Reduced # of multiplexers
 Average Power reduction: 18.5 %
Area Reduction: 1%
Source: D.Stankovic [13]

SoCT – Low Power Design – 27 © Chair of Integrated Systems, TUM

Crosstalk / Signal Integrity

A
capacitive coupling

D
C
B

B=1

C
Superfluous transitions on
D result in additional Pcap
D

SoCT – Low Power Design – 28 © Chair of Integrated Systems, TUM

131
Power
Fighting Crosstalk dyn. stat.

Architecture

Shielding
wire
GND

VDD Shielding
layer

GND

Substrate (GND)

SoCT – Low Power Design – 29 © Chair of Integrated Systems, TUM

Power
Low Voltage Signaling on Interconnect / I/O
dyn. stat.

Architecture
RTL/Logic
Circuit

t pd ~ VDD  C L
ID
2
Pcap    f  C L  VDD
ID

* VDD
VDD  VDD  with I D*  I D by transistor sizing
N
Pcap
f*  1 * N f with *
Pcap 
t pd
N

SoCT – Low Power Design – 30 © Chair of Integrated Systems, TUM

132
MPEG-4 Video Frame Coding Principles

• I-frames
 Least compressible frames but don't require other
video frames to decode.
• P-frames
 Can use data from previous frames to decompress
 Are more compressible than I-frames.
• B-frames
 can use both previous and forward frames for data
reference to get the highest amount of data
compression.

Source: Wikipedia – The free Encyclopedia

SoCT – Low Power Design – 31 © Chair of Integrated Systems, TUM

Power
Processor Voltage/Frequency Scaling dyn. stat.

System

Total chip power MPEG-4 Interframe scaling


Logic power

Frequency scaling is 3 instructions


(at most), 500 – 700 nS typical
latency
I/O power

2.0V Voltage scaling at up to 10 mV/uS


Logic VDD without PLL relock
1V per
1.0V 100 usec

Dhrystone 2.1 code Operation uninterrupted by power scaling


running 400 loops per
cycle in background

CPU MHz 266 66 266


Memory MHz 133 Source: IBM PowerPC 405 LP 66 133
NetSeminar
SoCT – Low Power Design – 32 © Chair of Integrated Systems, TUM

133
Power
DVFS: Multicore Scheduling and Allocation
dyn. stat.

System

E = 2 units
E = 2 units

A1 A4 Tclk A1
A4 A5

A2 A5 A2

A3 A3


=
_ ∑ ∗

SoCT – Low Power Design – 33 © Chair of Integrated Systems, TUM

Power
Pipelining dyn. stat.

Architecture
RTL/Logic

clk Pipeling
clk  Constant clock rate
f1(in)
preserves Throughput
 Relaxation of logic
clk timing requirements.
Allows lowering of
f(in) f2(f1(in)) Vdd.

clk
 Logic power saving
overcompensates
power for additional
clk
fn(…) registers.
clk fn(fn-1(…f1(in))) = f(in)

SoCT – Low Power Design – 34 © Chair of Integrated Systems, TUM

134
Memory Architecture: Array-Structure

2 L-K Bit Line


Storage Cell

AK

Row Decoder
A K+1 Word Line

A L-1

M.2 K

Sense Amplifiers / Drivers Amplify swing to rail-


to-rail amplitude
A0
Column Decoder Select appropriate
A K-1
word
Input-Output
(M bits)
Source: J. Rabaey [14]

SoCT – Low Power Design – 35 © Chair of Integrated Systems, TUM

Power
Hierarchical Memory Architecture dyn. stat.

Architecture

Row
Address

Column
Address

Block
Address

Global Data Bus


Control Block Selector Global
Circuitry Amplifier/Driver

I/O
Advantages:
1. Shorter wires within blocks
2. Block address activates only 1 block => power savings
Source: J. Rabaey [14], A. Macci [19]

SoCT – Low Power Design – 36 © Chair of Integrated Systems, TUM

135
Power
Clock Gating: Circuit / Logic Optimizationdyn. stat.

RTL/Logic

A2 Clock Gating
 Toggle registers only
xor when outputs can
change
Clock Gating A1
 Majority of dynamic
Unit power saving is on the
clock tree
xor
 20 to 50 % reduction
in active power
A0 possible
x
Concerns
xor
 Additional skew on
clk f(x,S)
clock
SoCT – Low Power Design – 37  Testability
© Chair of Integrated Systems, TUM

Clock Gating is the Baseline Power Saver


Relative Leakage

Temperature (C°)

Dependency of Pstat on
temperature
Source: R.Puri [4]

SoCT – Low Power Design – 38 © Chair of Integrated Systems, TUM

136
Power
Arithmetic Optimization dyn. stat.

RTL/Logic

Mathematical laws, e.g. distributive law:

More logic and activity vs. less logic activity

SoCT – Low Power Design – 39 © Chair of Integrated Systems, TUM

Power
Threshold Control dyn. stat.

Transistor
Substrate bias Vsubstrate shifts threshold voltage Vt

Vdd
Id
Ileak

Dyn. Power Vsubstr ate


Power logic with VTCMOS
Management
management t ransistors Vsubstrate

gnd
Vt1 Vt2 Vt3 Vgs
(VGS Vt ) / nVTemp VDS / VTemp
I D  I 0e (1  e ); VGS  Vt
Source: J. Rabaey [14]

SoCT – Low Power Design – 40 © Chair of Integrated Systems, TUM

137
Technology Change for Low Power

MRAM
Source: K. Arabi [10], Qualcomm
FinFET vs. FDSOI Tech.
Source: A.B. Kahng
[16]

Source: R. G. Dreslinski [15] Near Threshold Computing

SoCT – Low Power Design – 41 © Chair of Integrated Systems, TUM

Power Optimization Conclusions

There is no such a thing as a “free lunch” in Low Power


Design
 Trade in Power with Performance, Area, Cost
 Power optimization necessary & meaningful at all abstraction
layers
 SoC system designers:
 Exploit low power improvements from lower layers
 Be sensitive for power optimizations at RTL and higher
abstraction layers

SoCT – Low Power Design – 42 © Chair of Integrated Systems, TUM

138
References

[1] SIA, International Technology Roadmap for Semiconductors,


http://public.itrs.net/
[2] J. Rabaey, Low Power Essentials, 2009, Springer
[3] D. Chinnery, K. Keutzer, Closing the Power Gap Between ASIC and
Custom, Springer, 2007
[4] R. Puri, L. Stok and S. Bhattacharya, "Keeping hot chips cool,"
Proceedings. 42nd Design Automation Conference, 2005., Anaheim,
CA, 2005, pp. 285-288
[5] D. Perlmutter, "Sustainability in silicon and systems development," 2012
IEEE International Solid-State Circuits Conference, San Francisco, CA,
2012, pp. 31-35
[6] M. Sheets et al., "A power-managed protocol processor for wireless
sensor networks,"Digest ofTechnical Papers VLSI06, pp. 212–213, June
2006

SoCT – Low Power Design – 43 © Chair of Integrated Systems, TUM

References

[7] A. DeHon, Reconfigurable Architectures for General Purpose


Computing, PhD Thesis, MIT, 1996
[8] A. Cuomo, Semiconductor Challenges, DATE03 Keynote, March 03,
http://www.date-
conference.com/conference/2003/keynotes/andrea/andrea.pdf

[9] A. Das et al., Reinforcement learning-based inter- and intra-application


thermal optimization for lifetime improvement of multicore systems, DAC
2014
[10] K. Arabi, Low power Design Techniques in Mobile Processors,
Qualcomm Presentation, 2014
[11] H. Khdr, S. Pagani, M. Shafique, J. Henkel. “Thermal Constrained
Resource Management for Mixed ILP-TLP Workloads in Dark Silicon
Chips”. In: 52nd Design Automation Conference (DAC). June, 2015

SoCT – Low Power Design – 44 © Chair of Integrated Systems, TUM

139
References

[12] P.Greenhalgh, Big.LITTLE Processing with ARM Cortex™-A15 & Cortex-


A7 Improving Energy Efficiency in High-Performance Mobile Platforms,
ARM white paper, September 2011
[13] D.Stankovic, Low Power Design course, University of Nis, Serbia, 2005
[14] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits,
2003
[15] R. G. Dreslinski et al., "Centip3De: A 64-core, 3D stacked, near-
threshold system," 2012 IEEE Hot Chips 24 Symposium (HCS),
Cupertino, CA, 2012, pp. 1-30.
[16] A.B. Kahng, “A Roadmap for Low-Power Design:Trends, Technology,
Tools”, EDPS-2015 Keynote
[17] A. Iranfar, S. N. Shahsavani, M. Kamal and A. Afzali-Kusha, "A heuristic
machine learning-based algorithm for power and thermal management
of heterogeneous MPSoCs," 2015 IEEE/ACM International Symposium
on Low Power Electronics and Design (ISLPED), Rome, 2015, pp. 291-
296.
SoCT – Low Power Design – 45 © Chair of Integrated Systems, TUM

References

[18] X. Lin, Y. Wang and M. Pedram, "A Reinforcement Learning-Based Power


Management Framework for Green Computing Data Centers," 2016 IEEE
International Conference on Cloud Engineering (IC2E), Berlin, 2016, pp.
135-138
[19] A. Macii, “Memory Organization for Low-Energy Embedded Systems,” in
Low-Power Electronics Design, C, Piguet Editor, Chapter 26, CRC Press,
2005

SoCT – Low Power Design – 46 © Chair of Integrated Systems, TUM

140
Voluntary Appendix:
SoC Arithmetic Building Blocks

System-on-Chip © Lehrstuhl für Integrierte Systeme


Technologies
Theresienstr. 90
A. Herkersdorf Building N1, 2nd floor
A. Surhonne www.lis.ei.tum.de

Outline

• Arithmetic Building Blocks


 Adders, Multiplier, Shifter, Multiplexer

SoCT – SoC Logic Design – 2 © Lehrstuhl für Integrierte Systeme

141
Bit-sliced Datapath

Arithmetical operations are


31 30 29 … 01 00 performed on data words
 Typical widths: 8, 16, 32, 64
31 30 29 … 01 00
bits
 Words are provided by and
written to CPU register file

add, mult, Frequently, same operation is


shift, … performed on each bit
 Shift right: b : t (b)  s(b  1)
 Bit-sliced Datapath
31 30 29 … 01 00  Foundation for fast, parallel
instruction execution

SoCT – SoC Logic Design – 3 © Lehrstuhl für Integrierte Systeme

Binary Full Adder: Truth Table


Cin A B Cin Cout S carry
status
A 1-bit 0 0 0 0 0 kill
Full Adder S 0 0 1 0 1 kill
B (FA) 0 1 0 0 1 propagate
0 1 1 1 0 propagate
Cout 1 0 0 0 1 propagate
Useful intermediate signals to 1 0 1 1 0 propagate
determine Cout : 1 1 0 1 0 generate
 Depend only on 1 1 1 1 1 generate
primary inputs
G=A&B S = A  B  Cin = P  Cin
P=AB
K = !A & !B Cout = A & B | A & Cin | B & Cin = G | P & Cin

SoCT – SoC Logic Design – 4 © Lehrstuhl für Integrierte Systeme

142
Ripple-Carry Adder

A3 B3 A2 B2 A1 B1 A0 B0
Critical Path

Cout = C4 FA FA FA FA C0 = Cin= 0

S3 S2 S1 S0

Condition for worst case computation time: G0 = 1, Pi = 1 ¦ 0 < i < N


tadd = t(A0,B0Cout) + (N-2) t(CinCout) + t(CinS)  (N -1) tcarry + tsum
tadd = O(N) : linear proportional to word width

Adder computation time is dominated by carry propagation delay!

SoCT – SoC Logic Design – 5 © Lehrstuhl für Integrierte Systeme

Ripple-Carry Adder

A3 B3 A2 B2 A1 B1 A0 B0
Critical Path

Cout = C4 FA FA FA FA C0 = Cin= 0

S3 S2 S1 S0

SoCT – SoC Logic Design – 6 © Lehrstuhl für Integrierte Systeme

143
FA: Logic Gate Implementation

A B
FA P=AB
G=A&B

Cin

Cout
S = P  Cin
S Cout = G | P & Cin

SoCT – SoC Logic Design – 7 © Lehrstuhl für Integrierte Systeme

FA: Static MOSFET Implementation


VDD

VDD Cin A B

A B
A

B
B
Cin VDD
A
X Cin

Cin A
S
Cin

A B B VDD
A B Cin A

Cout B

Truth table:
Total of 28 MOSFET transistors
Cout = AB + BCin + ACin = AB + (B + A) Cin
Two inverter stages between Cin and Cout
S = ABCin + Cout(A + B + Cin)
SoCT – SoC Logic Design – 8 © Lehrstuhl für Integrierte Systeme

144
Ripple-Carry Adder/Subtractor

add/subt C0 = Cin
A0
 Subtraction – complement all subtrahend FA S0
bits (xor gates) and set the low order B0
carry-in C1
A1
 RCA summary: FA S1
B1
 advantage: simple logic, small area C2
(low cost), straightforward A2
expandable FA S2
B2
 disadvantage: slow (O(N) for N bits), C3
lots of signal transitions (energy A3
wasteful) FA S3
B3
C4 = Cout

SoCT – SoC Logic Design – 9 © Lehrstuhl für Integrierte Systeme

FA Inversion Property

A B Cin Cout S Inverting all inputs of a FA results in inverted


0 0 0 0 0 K values on all FA outputs
0 0 1 0 1 K  Reduces number of inverters in carry path
0 1 0 0 1 P by one
0 1 1 1 0 P
1 0 0 0 1 P A2 B2 A1 B1 A0 B0
1 0 1 1 0 P
1 1 0 1 0 G
1 1 1 1 1 G
FA’ FA’ FA’ C0
C3 C2 C1
A B A B

S2 S1 S0
FA FA
Cout Cin Cout Cin
FA’: FA without INV in carry path
S
S
SoCT – SoC Logic Design – 10 © Lehrstuhl für Integrierte Systeme

145
Manchester Carry-Chain Adder

B Fast carry propagation through


Ai Bi
P transmission gate logic design
A P G  Ci+1 follows Ci for Pi = 1
 Ci+1 is locally generated or killed
Gi Pi (follows Gi) when Pi = 0

Ci+1 Ci Attention:
 Transmission gate logic isn’t
regenerative
 Signal noise on Ci directly
Ci+1 = Gi | Pi & Ci
S propagated to Ci+1

SoCT – SoC Logic Design – 11 © Lehrstuhl für Integrierte Systeme

Carry-Bypass Adder: Concept

A3 B3 A2 B2 A1 B1 A0 B0

C4
FA FA FA FA C0 = Cin

Cout
S3 S2 S1 S0

BP = P0 P1 P2 P3 “Block Propagate”

If (P0 & P1 & P2 & P3 = 1) then Co,3 = C4 = Ci,0


Otherwise the block itself kills or generates the carry internally

SoCT – SoC Logic Design – 12 © Lehrstuhl für Integrierte Systeme

146
CBA: Block Propagate Generation

BP P3 P2 P1 P0

Cout Cin
G3 G2 G1 G0

BP

Manchester Carry-Chain Bypass


 BP-path certainly faster than regular  BP-path breaks “bit-sliced” structure
“ripple” path  Conversion to N-bit group slices
 Area overhead for transmission gates  Output INV for signal regeneration
typically between 10 and 20%  … feeding FA‘ in next stage

SoCT – SoC Logic Design – 13 © Lehrstuhl für Integrierte Systeme

Carry-Bypass Adder: 16-Bit Adder


bits 12 to 15 bits 8 to 11 bits 4 to 7 bits 0 to 3

Setup Setup Setup Setup

G P G P G P G P
Carry Carry Carry Carry
Propagation Propagation Propagation Propagation
Ci,0
Ci P Ci P Ci P Ci P

Sum Sum Sum Sum

Worst-case delay: Carry from bit 0 to bit 15 = carry generated in bit 0,


propagates through bits 1, 2, and 3, skips the middle two groups (B: group
size in bits), propagates in the last group from bit 12 to bit 15

tCBA = tsetup + B tcarry + ((N/B) - 1) tskip + (B -1) tcarry + tsum

SoCT – SoC Logic Design – 14 © Lehrstuhl für Integrierte Systeme

147
Carry-Bypass Adder

 tCBA still O(N), but with more tadder


graduate slope
Ripple adder
 Significant tadd reduction for larger
N
 Higher overhead (and hence Bypass adder
higher tadd) for small N due to
additional MUX in carry path

 tadd limiting factor: N


 Sequential availability of Cin 4…8
between and within bit blocks

How to achieve a sub-linear dependency of tadd over N?

SoCT – SoC Logic Design – 15 © Lehrstuhl für Integrierte Systeme

Carry-Select Adder: Concept


A’s B’s
 Precompute the carry out of
each block for both carry_in =
0 and carry_in = 1 (can be 4-b setup
done for all blocks in parallel) P’s G’s
“0” carry propagation 0
 Then select the correct one
(takes one MUX delay rather
than B-bit ripple) “1” carry propagation 1

tCSA = tsetup + B tcarry + Cin


Cout multiplexer
(N/B) tmux + tsum
C’s
sum generation
 Implies a approx. 30% area
overhead
S’s

SoCT – SoC Logic Design – 16 © Lehrstuhl für Integrierte Systeme

148
Carry-Select Adder: 16-bit Adder
bits 12 to 15 bits 8 to 1 bits 4 to 7 bits 0 to 3
A’s B’s A’s B’s A’s B’s A’s B’s

Setup Setup Setup Setup


P’s G’s P’s G’s P’s G’s P’s G’s
“0” carry 0
“0” carry 0 “0” carry “0” carry 0
0

“1” carry “1” carry 1


“1” carry “1” carry 1
1 1

mux mux mux mux


Cout Cin
C’s C’s C’s C’s
Sum gen Sum gen Sum gen Sum gen

S’s S’s S’s S’s


SoCT – SoC Logic Design – 17 © Lehrstuhl für Integrierte Systeme

Carry-Select Adder: 16-bit Adder


bits 12 to 15 bits 8 to 11 bits 4 to 7 bits 0 to 3
A’s B’s A’s B’s A’s B’s A’s B’s

Setup Setup Setup Setup


P’s G’s P’s G’s P’s G’s P’s G’s (1)
“0” carry 0
“0” carry 0 “0” carry “0” carry 0
0
(5) (5) (5) (5)
“1” carry “1” carry 1
“1” carry “1” carry 1
1 1
(5) (5) (5) (5)
mux mux mux mux
Cout (8) (7) (6) Cin
(9) C’s C’s C’s C’s
Sum gen Sum gen Sum gen Sum gen
(10)
S’s S’s S’s S’s
SoCT – SoC Logic Design – 18 © Lehrstuhl für Integrierte Systeme

149
Square Root Carry-Select Adder
bits 14 to 19 bits 9 to 13 bits 5 to 8 bits 2 to 4 bits 0 to 1
A’s B’s A’s B’s A’s B’s A’s B’s A’s B’s

Setup Setup Setup Setup Setup


P’s G’s P’s G’s P’s G’s P’s G’s P’s G’s (1)
“0” carry 0
“0” carry 0 “0” carry “0” carry 0 “0” carry 0
0
(7) (6) (5) (4) (3)
“1” carry “1” carry 1
“1” carry “1” carry 1
“1” carry 1
1 1
(7) (6) (5) (4) (3)
mux mux mux mux mux
Cout (7) (6) (5) (4) Cin
(8) C’s C’s C’s C’s C’s
Sum gen Sum gen Sum gen Sum gen Sum gen
(9)
S’s S’s S’s S’s S’s
SoCT – SoC Logic Design – 19 © Lehrstuhl für Integrierte Systeme

Square Root Carry-Select Adder

 Progressively increasing number of bits per block equalizes arrival


times at MUX inputs
 Lowers absolute adder delay despite using more stages

 Assume N bit adder consists of P stages and contains M bits in


first stage:
N = M + (M+1) + (M+2) + … tSCS = tsetup + M tcarry + (√2N) tmux + tsum
+ (M+P-1)

= MP + P(P-1)/2
2
= P /2 + P(M – 0.5)
2
≈ P /2 for N >> M

SoCT – SoC Logic Design – 20 © Lehrstuhl für Integrierte Systeme

150
Square Root Carry-Select Adder

 Progressively increasing number of bits per block equalizes arrival


times at MUX inputs
 Lowers absolute adder delay despite using more stages

 Assume N bit adder consists of P stages and contains M bits in


first stage:
N = M + (M+1) + (M+2) + … tSCS = tsetup + M tcarry + (√2N) tmux + tsum
+ (M+P-1)
 tSCS = O(√N)
= MP + P(P-1)/2
2
 Sub-linear increase of tadd over
= P /2 + P(M – 0.5) adder width N
2
 Nevertheless, carry still ripples
≈ P /2 for N >> M sequentially through MUX stages

SoCT – SoC Logic Design – 21 © Lehrstuhl für Integrierte Systeme

Adder Summary

Adder Delay Area Comment


Ripple Carry O(N) 1 (norm.) Simple, modular, delay limits application to
small N only
Adder
Carry Bypass O(N) 1.1 – Although CBA have linear delay dependency
over N, they are much faster than RCA for
Adder 1.2
larger N
Linear Carry O(N) > 1.3
Select Adder
Square Carry O(√N) > 1.3 Approach with sub-linear adder time
dependency over N
Select Adder
Log. Lookahead O(log(N)) >m Lookahead adders are generally several times
larger than RCA, but has significant speed
Adder
advantage for large N

SoCT – SoC Logic Design – 22 © Lehrstuhl für Integrierte Systeme

151
Multiply Operation

Multiplication as repeated
additions
N M 1 N 1
multiplicand x y   x y i j 2i  j
multiplier j 0 i 0

partial  Partial product generation


M product
 Easy in binary representation:
array
 All- 0 for yj = 0
double precision  Multiplicand offset j positions to
product left for yj = 1
N+M
 Addition of partial products
 M x N-bit additions max
 Skip all-0 partial products

SoCT – SoC Logic Design – 23 © Lehrstuhl für Integrierte Systeme

Partial Product Generation

X7 X6 X5 X4 X3 X2 X1 X0

Yi

PP7 PP6 PP5 PP4 PP3 PP2 PP1 PP0

Multiplying multiplicand x with a bit yj of multiplier


 Easy for binary numbers: PP = x AND yj

SoCT – SoC Logic Design – 24 © Lehrstuhl für Integrierte Systeme

152
Sequential Multiplier

Right shift-and-add
 Partial product rows accumulated
N from least to most significant bit
on an N-bit adder
1010  After each addition, shift
1101 accumulated partial product to
right in order to align it with the
T= 0; 1010 next row to add
T= 1; 01010
T= 2; 001010  Time for NxN bits
T= 6; 110010  Tserial_mult = O(N Tadder) = O(N2)
T= 7; 0110010  Design (area) complexity
T= 11; 10000010  One N-bit adder and
T= 12; 10000010 single-bit shifter
 No straightforward pipeline
structure

SoCT – SoC Logic Design – 25 © Lehrstuhl für Integrierte Systeme

Array Multiplier

 Faster than sequential shift-and-add, but more costly in terms of area


 All partial products generated in tmult  M 1  N  2tcarry  N 1 tsum  tand
parallel and organized in adder
array with respective offset X3 X2 X1 X0 Y0

X3 X2 X1 X0 Y1 Z0
 Partial product addition
 Typically no single worst HA FA FA HA
case path, but multiple
X3 X2 X1 X0 Y2 Z1
paths with same (max)
length FA FA FA HA

X3 X2 X1 X0 Y3 Z2

FA FA FA HA

Z7 Z6 Z5 Z4 Z3

SoCT – SoC Logic Design – 26 © Lehrstuhl für Integrierte Systeme

153
Partial Product Reduction

 Standard Array Multiplier requires N-1 x N-bit-adders


 “All-0” partial products can be omitted reducing number of PPs

 Booth recoding: N-bit multiplier can be reduced to max. of N/2 “1”


partial products
 Principle:
011102 (1410) =
100002 (1610) – 000102 (210)
 Requires capability to perform subtractions

SoCT – SoC Logic Design – 27 © Lehrstuhl für Integrierte Systeme

Shifter

Programmable shifter
Right Nop Left
 Single-bit left/ right shift
operations through individual pass
transistors operated by separate
control lines
Ai Bi
 Output drivers to fully regenerate
logic levels

 Cascading several single-bit


Ai-1 Bi-1 shifters to build multi-bit shifter
rapidly becomes complex and
slow
Bit-slice i
 Not practical for larger shift
values

SoCT – SoC Logic Design – 28 © Lehrstuhl für Integrierte Systeme

154
Barrel Shifter

: Data Wire  Number of rows indicates


: Control Wire word width, …
 … number of columns the
A3
B maximum shift offset
3
Sh1
A2
B  Word bits have to pass
2
Sh2 through maximum one
A1
B
transmission gate
Sh3
1  Propagation delay
A0 (theoretically) constant
B
0  Area generally dominated
by wiring (not active
Sh0 Sh1 Sh2 Sh3 MOSFETs)
 One-hot shift signal

SoCT – SoC Logic Design – 29 © Lehrstuhl für Integrierte Systeme

Logarithmic Barrel Shifter

 Number of rows indicates


word width, …
 … number of columns the
base two logarithm (ld) of
largest shift offset

 Word bits have to pass


through exactly ld(max.
shift size) multiplexers
 Propagation delay
(theoretically) constant
 Area dominated by wiring
(not multiplexers)
 Binary shift signal

SoCT – SoC Logic Design – 30 © Lehrstuhl für Integrierte Systeme

155
Other Arithmetic Operators

In1 In2

Constructed from concepts


introduced with adders
Ci,0 = 1
Two´s complement add

 Subtractor
In1 – In2 a) Subtractor
 Inverted two‘s complement
In1 In2
adder with Cin,0 = 1

Two´s complement add


Ci,0 = 1
 Comparator
SN - 1
 Use MSB of subtractor as
sign-indicator for In1 ≥ In2
b) Comparator
In1 ≥ In2

SoCT – SoC Logic Design – 31 © Lehrstuhl für Integrierte Systeme

Multiplexer / Demultiplexer

A1 Selecting bits/words from different


Z sources
 E.g.: memory read control logic
An S  if n=2 => S is one bit
 Z= ¬SA1+SA2
N-1 MUX
S=S1 ..Sm
Z1 Distributing bits/words to
A different destinations:
 E.g.: memory write control logic
S
Zn  if n=2
 Z1= ¬SA
1-N deMUX  Z2=SA
S=S1 ..Sm
SoCT – SoC Logic Design – 32 © Lehrstuhl für Integrierte Systeme

156
References

[1] R. J. Baker et al., CMOS circuit design, layout, and simulation, IEEE
Press, 1998. ISBN 0-7803-3416-7
[2] N. H. E. Weste et al., Principles of CMOS VLSI Design, Addison
Wesley, 1993. ISBN 0-201-53376-6
[3] SIA, International Technology Roadmap for Semiconductors,
http://public.itrs.net/

Picture credits: www.maxmon.com

SoCT – SoC Logic Design – 33 © Lehrstuhl für Integrierte Systeme

157

You might also like