Professional Documents
Culture Documents
SoCT Slides
SoCT Slides
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
SoC Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
1
System-on-Chip Technologies
Introduction
2
Organisational matters
Text books
Digital Integrated Circuits - A Design Perspective, J. Rabaey,
Prentice Hall
Computer Architecture. A Quantitative Approach, J. Hennessy, Elsevier
3
What is this course about?
′ = × .
′= × . ′′ = ′× .
′′ = ′× .
4
Moore’s Law CMOS Scaling Good news
1013
1965: Gordon Moore 50
1012 forecasts that chip
1012
Transistor gate length L (um)
20 106
32G
Chipcapacity (transistors per chip)
1011 16G
DRAM-Chips 10 105
1010 4G
1G 5 104
109 256M 2
64M Core i7 103
108 Core 2 Duo 1
16M
Pentium 4
107 4M Pentium III 0.5 102
1M Pentium II
106 256k Pentium 0.2 10
64k 80486
105 80386 0.1
16k 1
4k 80286 0.05
104 8086 10-1
8008 Microprocessors 0.02
103
0.01 10-2
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
Source: ITRS Roadmap 99, 09
5
Moore’s Law CMOS Scaling Challenges
1013
Power Dissipation 50
1012
1011 rocket
1000 10
1010
Frequency nuclear nozzle
5
Number of cores
Power density (W/cm2)
109
reactor
100 Core i7 2
108 Core 2 Duo 1
hot plate Pentium 4
107 Pentium III 0.5
10 Pentium II
106 Pentium 0.2
80486
105 80386 0.1
1 80286 0.05
104 8086
8008 Microprocessors 0.02
0.1 103
0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
Source: ITRS Roadmap 99, 09
If Moore’s Law…
1013
50
1012
20
1011
10
… had hold in other industries or our daily life for the 1010 5
109 2
last 30 years… 108
Pentium 4
Core i7
Core 2 Duo 1
107 Pentium III 0.5
Pentium II
106 Pentium 0.2
80486
105 80386 0.1
the capacity of a 1.5 V AA Battery would have 104
8008
8086
80286 0.05
0.02
103
been around 2 MWh today, 0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
6
Intel: Moore‘s Law is Forever ! … Really?
So what!
7
SoC Technologies
Hardware Platforms
FPGA: System-on-Chip:
- Function determined - Maximizes reuse I/O
eRAM
NoC
RAM-Ctl.
8
Application in our daily life
Technical Economical
Source: J. Rabaey [3], M. Irwin [4]
9
Key Metrics of SoC
Edge Inner
Servers
WAN Core Storage Server
Core
Storage Server Application Servers
Wireless ASP
Internet
Home
Networks Router Wireless
Gateway Application
Edge Router Servers
802.11 Servers
Home RF, .. xDSL Mobile
Switching Wireless
Local Center
Network
Access
Network Base
Cable Station
ISDN VoIP PSTN Controller
Gateway
Control
Mobile Base
Procesors Sonet/SDH Clients Station
Transmission
SoCT – Introduction – 18 © Lehrstuhl für Integrierte Systeme
10
Outlook to SoC Platforms
Network
Switch Processor Determines box function:
System Fabric Line Switch, Router, Gateway, …
Processor Interface
Interconnect for
as many as possible Terminates physical network
line interface cards links: Ethernet, SONET/SDH, …
Part 1 Part 2
Introduction Processor Architecture
SoC Logic Design Recap Memory
SoC Paradigm Interconnect
Low Power Design
11
Why CMOS ?
SiO2-Layer Lithography
Electrical Reasons Light
Why CMOS ?
SiO2-Layer Lithography
Electrical Reasons Light
12
CMOS – What is it?
Input
nMOS pMOS
gnd Vdd
Output
n+ p+
n
p
VDD
Ip
VGSp S
P VDSp A Z
N VDSn A Z
VA VZ
VGSn S
0 1
1 0
gnd
13
MOSFET Voltages And Drain Current
nMOS pMOS
G Vector Conventions* G
VGS VGS
ID ID
S D D S
VDS VDS
Gate-Source Voltage
0 Drain-Source Voltage 0
Drain Current
VGS Vt VDS 0
• Off I Dn 0
14
MOSFET Output Characteristics
DC Operation
V(y)
Switching Threshold
VM
GND
VIL VIH VDD V(x)
15
Static Voltage Transfer Curve (VTC)
VZ off on N
VDD
Ip
on off P VGS S
p
P VDSp
VDD
D
N VDSn
VA VZ
VGSn S
gnd
Ip
A Z
VDD VA 0 1
Vtn VDD-|Vtp|
Vth 1 0
MOSFET Dimensioning
w
tOx
Lmin
W Ox 0
Transconductance: K where K
L tOx
Designer’s Parameter: W Technology Parameters
Conflicting Design Goals: Mobility µn = 1.5 .. 3.5 x µp
Designer uses W to compensate for
Area => W=Lmin lower current drive of pMOSFETS
Speed => high W, l=Lmin Minimum Feature Size: Lmin
→ always use Lmin for digital circuits Oxide dielectic/thickness: εox,tox
SoCT – Introduction – 30 © Lehrstuhl für Integrierte Systeme
16
Noise Margins
VIH
Noise Margin High Undefined Region
VIL
Noise Margin Low
NML = VIL
"0"
Gnd
Gate Output Gate Input
Vdd VA
Ron,p
C VZ=VC
Ron,n
50%
gnd
t
Inverter model:
on: Resistance
off: open switch
operating in linear
region of Sah
SoCT – Introduction – 32 © Lehrstuhl für Integrierte Systeme
17
Effect of Capacitance on Inv. Delay
Vdd VA
Ron,p
C VZ tpLH=f(C)=?
Ron,n
50%
gnd
t
Inverter model:
on: Resistance Cload tOx Lp
t pLH
off: open switch Wp p Ox Vdd | Vtp |
operating in linear
region of Sah
SoCT – Introduction – 33 © Lehrstuhl für Integrierte Systeme
Dynamic Static
capacitive sub-threshold,
short-cut leakage (reverse diode),
result of circuit function gate currents
signal edge dependent parasitic effects
signal level dependent
Dynamic Static
18
Dynamic Capacitive Power
C
Ron,n
gnd
A Z r f
t
i
gnd
t
t1 t2
19
Sub-threshold Currents
I0
I0: ID for VGS = Vt
VTemp: temperature voltage
n: process constant (1 … 2.5) Vt Vgs
A Vdd Igate
gnd Z Vdd
n+ n+ p+ p+
Igate
p n gnd
A high A low Gate oxide not perfect isolator:
Ohm’s law (R<inf)
p-n junctions are diodes: Ionic conduction (trapped Ions
Reverse current into substrate in oxide)
contributes to static power tox ~ 5 – 2 nm today
consumption
Igate ~ exp ( tox-1 )
20
Static Power
Vt Vdd
threshold voltage supply voltage
Total power=min!
dynamic power
static power
circuit delay Vdd and Vt such that Vdd-Vt = const
21
References
[1] IBM press release, „New 64-bit PowerPC microprocessor“, Oct. 14, 2002,
http://www-3.ibm.com/chips/news/2002/1014_powerpc.html
[2] IBM Microelectronics Photo Catalog,
http://www-3.ibm.com/chips/photolibrary/photo10.nsf/home?ReadForm
[3] J. Rabaey, „Digital Integrated Circuits – A Design Perspective“, Prentice Hall,
second edition, 2003
[4] M. Irwin, „VLSI Digital Circuits“, Penn State, 2002,
http://mdlwiki.cse.psu.edu/twiki/bin/view/MDL/MJI477
22
SoC Logic Design Recap
Outline
• Combinatorial Logic
Transistor synthesis for combinatorial logic
• Sequential Logic
Registers, latches, flip-flops
Finite state machine design
23
Combinatorial Logic
VDD A A
NOR Z
B B
Z
Z A NAND NOR
A Z A B Z
Z
B 1 0 0 1
B D
1 0 1 0
NAND 1 1 0 0
0 1 1 0
gnd
SoCT – SoC Logic Design – 4 © Lehrstuhl für Integrierte Systeme
24
All logic functions
DeMorgan Rule can be expressed
with NAND or NOR
NAND NOR
NOT
AND
OR
VDD
A1
...
An
Z
C
gnd
NAND NOR
25
Systematic Static CMOS Logic Design
implemented as
A combinatorial logic circuitry
multi-bit register / x
storage elements B
+ Y
C
Y=AxB+C
SoCT – SoC Logic Design – 8 © Lehrstuhl für Integrierte Systeme
26
Sequential Logic
27
CMOS Latch
Q
Q
CMOS Latch
1 R
e D
0 e Q
1 DQ
Q
D Q
D
1 e Q
S
1 D
Enable signal e
Level-controlled <=> Latch 0 Q
28
CMOS Flip-flop
c c Q
e Q e Q Q
D Q
D D DQ Q
c
(alternatively)
Master Slave
c Q
Clock edge-controlled (c) <=> flip-flop D
Most important sequential element
Used in (almost all) sychronous digital circuits - Q
Used as register banks
Counter, shift registers
CMOS Flip-flop
clk
e Q e Q Q
D D q DQ Q
Master Slave
clk
clk
D
q
Q
SoCT – SoC Logic Design – 14 © Lehrstuhl für Integrierte Systeme
29
Flip-flop Timing
c
D Q
50%
t tc2q
c
D
Flip-flop characteristics:
50% setup-time: Data must be stable
t tsetup before clock edge
tsetup thold hold-time: Data must be stable
for thold after clock edge
Q
clock-to-output delay: Data will
be visible at output tc2q after
clock edge
t
tc2q SoCT – SoC Logic Design – 15 © Lehrstuhl für Integrierte Systeme
Tlogic,min Tlogic,max
Timing Constraints:
Setup constraint:
Tclk > Tc2q + Tlogic,max + Tsetup
Hold constraint
: Tc2q + Tlogic,min > Thold
30
Metastability
c Violation of either tsetup or thold
resulting in undefined/
50%
oscillating output Q
t with probability pmeta (<10-9)
D
Relevance for SoC design:
50%
increasing number of chip
clock domains
t externally imposed clocks
tsetup thold
Precaution:
Q
Double-registered inter-domain
signal interfaces
t FIFO buffers
31
Finite State Machines
x D1 Q1
f(x,u) g(x,u) y
Dn Qn
clk
u
x: Primary input vector, y: primary output vector, u: state vector
f(x,u): input function, g(x,u): output function
Mealy-Machine: Most general case, as shown above
Moore-Machine: No combinatorial logic through machine <=> g(x,u)==g(u)
Medvedev-Machine: No output logic <=> y=g(x,u)=g(u)=u
No input logic Machine: f(x,u)=(x,f2(u))
SoCT – SoC Logic Design – 19 © Lehrstuhl für Integrierte Systeme
FSM2
f g
FSM1
f g
FSM3
f g
32
FSM example: Counter
S3
A2
x = 0: St+1 = St
xor S2 x = 1: St+1 = St + 1
A1
xor S1
A0
x
xor S0
clk f(x,S)
x f(x,u) x g(x,u)
clk clk
Tclk > ΣTlogic + Tsetup + Tc2q
33
How are FSMs designed today?
Idle:Process
Begin FIFO a_empty = 1
WAIT UNTIL(CLK = ‘1’);
C_State := N_State;
34
References
[1] R. J. Baker et al., CMOS circuit design, layout, and simulation, IEEE
Press, 1998. ISBN 0-7803-3416-7
[2] N. H. E. Weste et al., Principles of CMOS VLSI Design, Addison
Wesley, 1993. ISBN 0-201-53376-6
[3] SIA, International Technology Roadmap for Semiconductors,
http://public.itrs.net/
35
SoC Paradigm
Outline
• Design productivity
Platform-based SoC
Virtual Platforms
• Computational density
Flexibilty vs. Performance
• Hardware Implementation
Gate Array, Standard Cells, FPGA
SoC design paradigm
Single- vs. Multicore
36
Revisiting Moore‘s Law
1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012
Year
37
Platform-based SoC Design
Virtual Platforms
Graphical Debugging tools SW development
environment
SW stacks
Custom HW development
Drivers Std. libs
OS OS OS Virtual Platform
Hypervisor
CPU1 CPU2 ACC SRAM SDRAM HW/SW Partitioning
Buffer I/O
System Integration
Processing cores Executable SoC model
Interconnect Fast and accurate: TLM-abstraction
HW accelerators Simulation kernel, e.g.
... SystemC/SpecC
38
Trade-Off: Flexibility vs. Performance
CPU DSP
ASIP
Log F L E X I B I L I T Y
FPGA
Instruction Depth
ASIC
Flexibility vs.
Custom IC Performance/Power
dissipation dilemma
Log COMPUTATIONAL DENSITY = performance / area
103 . . . 104
Log Power Efficiency = performance / W
105 . . . 106
Source: A. DeHon [4]; A. Cuomo, T. Noll [5]
+ +
Time 1
Parallel in space and time
-
Time 2
y
Pipelined RISC CPU
IF ID OF EX M WB t1 = A + B
IF ID OF EX M WB t2 = C + D
IF ID OF EX M WB Y = t1 - t2
39
Computational Density / Functional Diversity
x4
RTL-Level Hardware x3
X
y
Parallel in time x2 +
X
x1
T0 T1 T2 T3
HW/SW Partitioning
Optimization for performance,
power or total cost
40
Hardware Implementation Methods
Full-Custom Semi-Custom
Cell-based Array-based
Example: Example:
Altera MAX- XILINX Virtex-II
7000 PRO
ASIC no fixed
41
Hardware Implementation Methods
Gate Array
Pad
metal
possible
GND contact
Out
42
Standard Cell
Standard Cell
43
ASIC Chip
Random Logic
(from library)
Example ASIC:
• 65 nm
• 1 GHz
• ~ 100 Mgates
• ~ 100 MB sRAM
• ARM/MIPS cores
Memory • >1000 I/Os
Subsystem • 64 x 2.5 Gb/s I/O
• 1 x 10 Gb/s I/O
• DDR I/F
[LSI Logic]
44
Programmable Logic: Overview
• RAM based:
Bits in look-up tables (LUTs) realize logic function
Bits in registers control switches (transistors) which connect /
disconnect wire links
Re-programmable, partially even online during operation (run-time
reconfiguration)
Medium integration density
Sensitive to radiation
O0 I 0 I1 I 2
O1 I 0 I1 I 2 I 2 I 0 I1
PAL
Programmable AND array
Programmable AND array
O 3O 2O 1O 0 O 3O 2O 1O 0
CPLD: multiple PAL blocks with programmable interconnect, e.g. Altera Max 7000
45
FPGA: Structure and Properties
Routing
Channel
I/O Pad
Configurable
Logic Block
• Programming:
SRAM (Data in SRAM determine logic function and control interconnect)
Anti fuse (Melting individual fuses through controlled peak currents)
46
FPGA: Realization Principles
Boolean function:
Y = x1 x2 x3 + x2 x3
x1 x2 x3 y FPGA: Look-Up Table (LUT)
0 0 0 0
x1 x2 x3 Address
ASIC: gates 0 0 1 0
0 1 0 1
2x4 SRAM
x1 0 1 1 0
y 1 0 0 0 0 0 1 0
x2
y
x3 1 0 1 0 0 0 1 1
1 1 0 1 Content
1 1 1 1 td = 5 nsec
td = 1 nsec
CLB
Logic table:
x1 x2 y
0 0 0
0 1 0
1 0 0
1 1 1
47
FPGA: Internal Structure (Xilinx Virtex-II Pro)
C4 Bump
BGA Ball
48
FPGA: Xilinx UtraScale+ Programmable Logic
CLB = Slice
• 1 Slice per CLB, 2 slice types: SLICEL and SLICEM
• SLICEL LUT
• 8 LUTs with 6 inputs
(each usable as two 3- or 5-input LUTs)
• 16 FlipFlops Carry Chain
• arithmetic carry logic
• multiplexer
FlipFlop
• SLICEM
• LUTs can be used as 64 bit RAM,
1x32 or 2x16 bit shift register
Source Xilinx
49
FPGA: Xilinx UtraScale+ Programmable Logic
RPU 2 2 2
GPU
VCU
Source Xilinx
50
Processor Implementation in SoC (1)
„Real“ „Virtual“
Component Component
Architectural Speed/Area
extensions VHDL optimized
51
Soft VC CPU in FPGA
Local SRAM
Challenge
• How to cope with this
complexity and develop
operational systems within…
reasonable time (time-to- eRAM network i/o
market) applic. specific
costs (engineering and µ-processor DRAM ctrl.
manufacturing)
52
The Need for SoC Design Paradigm
DC ROM Analog
ROM
MCU
ASIC
~ 10 cm
i/f
i/f i/f DSP
ASIC
SRAM
SRAM
SRAM
DSP
MCU ROM An’lg
53
Single- vs. Multicore
n
clk s 1
TApp inst App
inst App i' CPI
inst clk i 1 f
Multi-core
Single-core
App‘1 App‘2
App If App can be perfectly
parallelized on n cores and
Tapp = const
2 App‘3 App‘n
Pdyn ~ f Vdd
2 2
f Vdd f Vdd
Pdyn ~ n ~ 2
n n n
* neglecting influence of Vth
Single-core Multi-core
Texe Texe
Case 1: Single- and Multi-core have same performance = App execution time Texe
f Vdd
Pdyn ~ f Vdd2 f MC Vdd, MC
n n
2
f Vdd f Vdd2
Pdyn ~ n ~
n n n2
SoCT – SoC Paradigm – 38 © Lehrstuhl für Integrierte Systeme
54
Single- vs. Multicore
References
[1] Design Technology for Low Power Radio Systems, Reth Davis, BWRC, Berkeley,
http://bwrc.eecs.berkeley.edu
[2] DSP multi chip module, esa,
http://www.estec.esa.nl/tech/spacewire/products/#modules
[3] Chip-On-Chip, Valtronic SA, http://www.valtronic.ch
[4] Reconfigurable Architectures for General Purpose Computing, Andre DeHon, PhD
Thesis, MIT, 1996
[5] A. Cuomo, Semiconductor Challenges, DATE03 Keynote, March 03,
http://www.date-conference.com/conference/2003/keynotes/andrea/andrea.pdf
55
Processor Architecture
Outline
• Classification of processors
• Instruction set architecture
• Internal processor architecture
Pipelining and hazards
Branch prediction
Superscalar/VLIW architecture
Instruction and data caches
Multi-threading
56
Motivation
Applications
Operating
Compiler
System
Software Assembler
Instruction Set
Architecture
Hardware Processor Memory I/O system
57
Processor Classification
Processor/ISA SW model,
independent (e.g. Matlab)
Code generator int a = 10;
while(a < 100)
High-level language
a += b;
(e.g. C/C++) if (a > b && c < 0)
c++;
Machine code
Software
1010 1111 0101 1000
0000 1001 1100 0110
Hardware Processor/ISA
0101 1000 0000 1001
dependent Control Signal
1100 0110 1010 1111
Specification
58
Instruction Set Architecture (ISA)
return
r31
addr.
0x0000 0000
59
Look Inside
system bus
data cache
data i/o
status
accumulator program counter
control
address i/o
instr. cache
system bus
Processor Microarchitecture
system bus
data cache
data i/o
status
accumulator program counter
control
instruction i/o
instr. cache
system bus
60
Processor Microarchitecture
system bus
data i/o
status
accumulator program counter
control
instruction i/o
Instr. cache
Instruction fetch (IF)
Instruction decode (ID)
system bus
Program Execution
Instruction Data
Processor
memory memory
IF ID EX M WB lw r1, 0(r0)
IF ID EX M WB sw r3, 4(r0)
IF ID EX M WB
Efficiency improvement:
instruction-level parallelism (ILP)
SoCT – Processor Architecture – 12 © Lehrstuhl für Integrierte Systeme
61
ILP: Pipelining
Clock signal
IF ID EX M WB lw r1, 0(r0)
IF ID EX M WB sw r3, 4(r0)
IF ID EX M WB
CPU Pipeline
Single-scalar = 1 ALU, CPImin = 1.0
Pipeline Control
IF ID EX M WB
clk
Buffer
Tclk Tc2q Tlogic Tstp
max
clk 1
f max
D Q D Tclk
Tstp Tc2q instr. rate [MIPS] =
Q = f[MHz] / CPI
62
ILP: Pipelining
• Deep pipelining
Ease processor speed scaling
Increase vulnerability for pipeline problems
Structural hazards
Data hazards
Control hazards
Structural Hazards
IF ID EX M WB load/store instruction
IF ID EX M WB arithmetic
instructions
IF ID EX M WB
stall IF ID EX M WB
if only one memory
port is available
63
Data Hazards
add r3,r2,r1 IF ID EX M WB
sub r7,r3,r1 IF ID EX M WB
and r6,r3,r2 IF ID EX M WB
Stalling is required
add r3,r2,r1 IF ID EX M WB
sub r7,r3,r1 IF stall ID EX M WB
and r6,r3,r2 IF ID WB
EX M
Control Hazards
64
Branch prediction
• 1-bit prediction
Branch history table
0x400258: lw r2, 24(r30) 1 1 – taken
0x400260: slti r3, r2, 15 0 0 – not taken
0x400268: bne r3, r0, 400280 0
0
idx brach addr. 0
1
x
bits 2x 0
0
0
Problem 0
1 bit
Branch prediction
• 2-bit prediction
Branch history table
65
Branch prediction
data cache
data i/o
internal data bus
Multiple ALU
ALU register block
execution units ALU
status
accumulator program counter
internal address bus
control
address i/o
Instr. cache
66
ILP: Superscalar architecture
More than 1 instruction can be issued in 1 cycle, i.e. CPI < 1 is possible
More complex logic for checking data dependencies required
Optimizing Compiler
InstrDP1 InstrDP2 InstrDP3 InstrDP4 ... ... InstrDPn-1 InstrDPn
Registers
67
Processor Performance (1)
• What is performance?
Example Porsche vs. Bus from Munich to Stuttgart
Top speed Distance Travel time Capacity Throughput
Vehicle [km/h] [km] [h] [person] [pkm/h]
Ultimately interested in
CPU execution time: Time CPU needs to complete certain program,
task or function
68
Processor Performance
Instruction Data
Processor
memory memory
Performance Comparison
300 266
250 221,67
Effective MIPS
200
150
100
50 20,95 17,81
0
CPUx1.0 CPUx1.2 CPU-DDR CPU-I/O
clk/instr clk/instr
[Data from Xilinx]
Memory Hierarchy
CPU L1 L2 Main
registers cache cache memory
Access time: 0.5 ns Access time: 2 ns Access time: 20 ns Access time: 100 ns
Size: 500 B Size: 32 KB Size: 256 KB Size: 512 MB
Access latency
small large
Cost
large small
Size
small large
69
Example: PowerPC 405GP
66-133MHz Arb
266MHz 32/64-bit
64-bit PCI-X, with ECC
33-66MHz RAM/ROM/
JTAG Trace
13 external
interrupts
70
Cache Organization
Main
• Caches store only small share of main memory Memory
Data are stored in lines of multiple sequential data Cache
a i q
b j r
y
z
G
H
words (e.g. 4 words) o r H c k s A I
p s I d l t B J
Cache capacity = Lsize x Nlines CPU E t e m u C K
C d f n v D L
g o w E M
…
If stored tag is identical to tag part of memory
flags tag cache line
address → cache hit
…
Offset: determines word in cache line and byte in word
Flags: entry valid; entry “dirty” (entry changed by CPU)
Index Block
000 ..00000
• Each block (size of a cache line) in main 001 ..00001
..00010
010
memory can be stored in only one cache entry 011 ..00011
100 ..00100
101 ..00101
110 ..00110
block offset ..00111
111
CPU address ..01000
Direct mapped ..01001
cache ..01010
tag byte ..01011
index word ..01100
..01101
Data ..01110
index flags tag word0 word1 word2 word3 ..01111
000 ..10000
001 ..10001
..10010
010 ..10011
011 ..10100
100 ..10101
..10110
110
…
111
Main memory
=
valid & Example: 16 KB direct mapped cache
with 4 words à 32 bit per cache line
hit word
• 10 bit index (1k cache lines)
• Conflicting indices lead to higher cache miss rate • 18 bit tag
• 4 bit offset (for word and byte)
71
Set Associative Cache set = block MOD (NL / n),
any line within set
Set Block
00 ..00000
• Each block in main memory can be stored 00 ..00001
..00010
01
in n cache entries: n-way set associative cache 01 ..00011
10 ..00100
• Increasing n reduces cache misses due to conflicts 10 ..00101
..00110
11
11 ..00111
block offset ..01000
2-way set ..01001
CPU address associative cache ..01010
..01011
tag index word byte ..01100
..01101
..01110
index index ..01111
data set fl. data data ..10000
fl. tag tag
00 00 ..10001
01 01 ..10010
..10011
11 11 ..10100
= way 0 = way 1 ..10101
..10110
…
w0 w1
Main memory
1
n times parallel tag comparison
hit word Example: 16 KB 2-way set associative
cache with 4 words à 32 bit per cache line
• 9 bit index (512 sets à 2 cache lines)
• With higher n: selection circuitry more complex, • 19 bit tag
needs more time • 4 bit offset (word and byte, 2 bit each)
72
Cache Replacement
?
• Replacement policy
Goal: reduce number of misses
Least recently used (LRU): least recently access
First in first out (FIFO): least recently loaded
(oldest)
Random
73
Multithreading in Software
system bus
Load/save
data cache
register
status
register block
register block
instr. cache
status
Further details in lecture program counter
„Chip Multicore Processors“
system bus
Multithreading in Hardware
system bus
data cache
data i/o
ALU registerblock
register
block Multiple
register block
status register banks
status
statuscounter
program
accumulator program counter
program counter
control
instruction i/o
instr. cache
Further details in lecture
„Chip Multicore Processors“
system bus
74
Summary
Literature
75
Memory
Outline
• Motivation
• Classification and Characteristics
• Look Inside
Architecture of state-of-art memories
Different types of memory cells
• How to Use Memory in System Design
• Product Overview
76
Motivation
Motivation
77
Positioning
Classification
Read Write
Non-Volatile Read Only
Random Access Non-Random Read Write ROM
RAM Access
Shift-Register
78
Classification
Read Write
Non-Volatile Read Only
Random Access Non-Random Read Write ROM
RAM Access
robustness
Characteristics
Memory Type Application Access Time Remarks
Registers CPU Registers Very fast [Sub-nsec] Direct addressing scheme
[32 x 64 bit]
On-chip SRAM Caches Fast [nsec] SRAM is faster but more
[32 KByte] expensive than SDRAM
QDR SRAM Fast system memory Fast [2 x 2 x 200 MHz] Dual clock edge, dual
[4 MByte] port
SDRAM Main Memory Slow-Medium [133 Needs refresh,
[64 MByte] MHz] sophisticated control,
Synchronous interface
DDR3 SDRAM Main Memory Medium [2 x 800 MHz] Dual clock edge
[1 GByte]
ROM System config Medium [~kByte/sec] Read only
[few kByte]
Flash Memory Card Medium [20 Mbyte/sec] Non-volatile, no refresh,
[16 GByte] different rd/wr cycles
79
Look Inside
Definitions:
Bandwidth: Amount of data into/out of a device
or across interface per unit time
Latency: Time elapsed between request and
delivery of data
Cycle time: Time between two consecutive
read/write accesses
512M DRAM
S0 S0
Word 0 Word 0
S1
Word 1 A0 Word 1
S2 Storage Storage
Word 2 Word 2
N Words
Cell A1 Cell
Decoder
A
AL-1
K-1
SN-2 Aspect ratio
Word N-2 Word N-2
SN_1 heights / width
Word N-1 Word N-1
not suitable for
implementation /
Input-Output Input-Output
performance !
(M bits) (M bits)
80
Memory Architecture: Array-Structure
AK
Row Decoder
A K+1 Word Line
A L-1
M.2 K
Row
Address
Column
Address
Block
Address
I/O
Advantages:
1. Shorter wires within blocks
2. Block address activates only 1 block => power savings
[Rabaey]
81
1-Transistor DRAM Cell
size / BL
density
WL Write "1" Read "1"
speed WL
robustness M1 X
CS GND VDD VT
VDD
BL
VDD/2 VDD /2
CBL sensing
Bitline
Wordline
n+ - Si
SiO2
Polysilicon
p-Si
Depletion Zone
Inversion
at SiO2/Si
Interface
Address Memory
Transistor Capacitor
[IC1]
82
Advanced 1 Transistor DRAM Cells
Word line
Cell plate Capacitor dielectric layer
Insulating Layer
Cell Plate Si
Si Substrate
2nd Field Oxide
[Rabaey]
Trench Cell Stacked-capacitor Cell
Sense Amplifier
Bitlines
WLi-4
WLi-3
Memory
WLi-2 cells
WLi-1
SA1 SA2 SA3 SA4 SA5 SA6 SA7 SA: Sense Amplifier
WLi
WLi+3
[IC1]
83
Sense Amplifier
VDD
Read 1 pull u p
T1 T2
VDD
Cs
Equalize
BL TE BL
K1 K2
T3 T4
WL
Sense
Memory Cell
[IC1]
Sense Amplifier
VDD
Read 0 pull u p
T1 T2
VDD
Cs
Equalize
BL TE BL
K1 K2
T3 T4
WL
Sense
Memory Cell
[IC1]
84
6-Transistor CMOS SRAM Cell
size / WL
density
speed
VDD
M2 M4
robustness
Q
Q M6
M5
M1 M3
BL BL
[Rabaey]
WLi 4 ... 6 m
„0“
WLk
„1“
Programing
Programming
„0“ „1“
BL1 BL2
size /
density
speed
robustness
[IC1]
85
Floating Gate Transistor Cell
tox G
tox
S
p
n+ n+
Substrate
[Rabaey]
20 V 0V 5V
10 V 5 V 5 V 2.5 V
S D S D S D
[Rabaey]
Vwl Vgs
SoCT – Memory – 22 © Lehrstuhl für Integrierte Systeme
86
Flash Memory Cell
[Infineon]
• Non-volatile
• Faster and more writable
than Flash
• Research topic
• Phase change effect used
with Blu-Ray
87
How to Use Memory in System Design
Capacity
Access
speed
FAST LOW
CPU
Cache
Local Bus
Fast & Small SRAM
Slower & larger SDRAM
I/O Subsystem (SCSI, PCI, etc)
Disk
Tape
SLOW HIGH
[IC1]
66-133MHz Arb
266MHz 32/64-bit
64-bit PCI-X, with ECC
33-66MHz RAM/ROM/
On-chip Peripheral Bus (OPB) 33-66 MHz
32K 32K
CPU
I-Cache D-Cache 1 MII or 2 RMII
Timers interfaces Cache
MMU 10/100 Local Bus
MAL Ethernet Fast & Small SRAM
Interrupt MAC
CPU Controller Slower & larger (SDRAM)
I/O Subsystem (SCSI, PCI, etc)
JTAG Trace Disk
13 external Tape
interrupts
88
A Minimal Memory System
[Gries]
[Gries]
89
SDRAM Read Operation Timing
(tCAS)
[Micron]
(tCAS)
[Micron]
90
SDRAM Write Operation Timing
[Gries]
[Gries]
91
Synchronous DRAM vs DDR
[Gries]
SoCT – Memory – 33 © Lehrstuhl für Integrierte Systeme
Multiport DRAM
Row
Address
Column
Address
Block
Address
I/O
Separate rd/wr
addresses, decoders
and I/O
92
Synchronous DRAM vs RAMBUS
[RAMBUS]
Embedded DRAM
Pro‘s:
• customized size
• wide data bus
• multi port
• high speed
Con‘s:
• complex technology
• expensive
• less density
[IBM]
93
Wide I/O – 3D-Integration with Through
Silicon Vias (TSV)
Memory Summary
94
References
[1] Jan Rabaey: Digital Integrated Circuits: A Design Perspective, Prentice Hall, 2nd
Edition, 2003
[2] Stechele/Herkersdorf: Integrierte Schaltungen, Lecture notes, TUM, 2003
[3] IBM photos http://www-3.ibm.com/chips/photolibrary/photo10.nsf/home?ReadForm
[4] www.rambus.com white papers
[5] M. Gries: A Survey of Synchronous RAM Architectures, Swiss Federal Institute of
Technology, ETHZ, Technical Report TIK No. 71, 1999
[6] J. Alsmeier, Infineon: Speicherkonzepte, 4. Dresdner Sommerschule Mikroelektronik,
September 2003
[7] R. Desikan, University of Texas, Tech Report TR-02-47, Sept. 2002
[8] Micron 256Mb: x4, x8, x16 SDRAM Feautres
[9] Wong, H-S. Philip, et al. "Phase change memory." Proceedings of the IEEE 98.12
(2010): 2201-2227.
[10] Motoyoshi, Makoto. "Through-silicon via (TSV)." Proceedings of the IEEE 97.1
(2009): 43-48.
95
Interconnect
Outline
• On-Chip Buses
Basic operation
Methods for increasing bus throughput
PLB, AHB, AXI
• Outlook for Network-on-Chip
• FIFOs
Principles of operation
• Example: Networking SoC
96
On-Chip Buses
Characteristics:
Bus Slaves:
React on requests
Max. number of
Masters supported
Bus width
Traffic
CPU ASIC1 Mgr Separate/Shared
Rd/Wr Bus
Arbiter
Clock rate
Arbitration
EN
Scheme DSP MAC
Mem
97
Bus Arbitration Schemes
Central Bus
Arbitration Arbiter
PLB
Address Bus Slave
Address Bus
Control Control
PLB
Core
PLB Read Data OR Read Data
Master Bus Bus
Status & Status &
Control OR Control
Shared Bus
98
PLB: Standard Read (Rd) Transfers
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_Avalid 1 2
SI_AddrAck
Data Bus
SI_DBus D(A0) D(B0)
SI_DAck 1 2
tarb tacc tarb tacc
f∙w
BW = , for memory tacc =tmem_acc
tarb + tacc
* time in clock cycles
99
Pipelined Bus Control
Address Cycle Data Cycle
Strictly sequential Address/Data cycle
Data Ack terminating current transfer Request Address Data Ack
Transfer Transfer
& release new transfer
Data Data
Transfer Ack
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_PAvalid 1
PLB_SAvalid 2
Sn_AddrAck 1 2
Data Bus
SI_rdDBus D(A0) D(B0)
SI_rdDAck 1 2
tarb tacc
BWpeak = f ∙ w
tarb tacc
* assuming B0 is independant of D(A0)
100
PLB: Pipelined Rd Transfers (But...)
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_PAvalid 1 2
PLB_SAvalid 2
SI_AddrAck 1 2
Data Bus
SI_rdDBus D(A0) D(B0)
SI_rdDAck 1 2
tarb tacc
tarb tacc
Burst Transfers
101
PLB: Burst Rd Transfers
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_Avalid 1 2
Sn_AddrAck
Data Bus
SI_DBus D(A0) D(A1) D(A2) D(A3) D(B0) D(B1) D(B2)
SI_DAck 1 2
tarb tacc tarb tacc
f∙w∙n
BW =
(tarb+tacc) + (n−1)
* time in clock cycles
SYS_Clk
Mn_req 1 2
Mn_RNW
Mn_ABus A0 B0
PLB_PAvalid 1
PLB_SAvalid 2
Sn_AddrAck 1 2
Data Bus
SI_rdDBus A0 A1 A2 A3 B0 B1 B2 B3
SI_rdDAck 1 1 1 1 2 2 2 2
tarb tacc
f∙w∙n
BW = =f∙w
n
102
PLB: Pipelined Back-to-Back Rd & Wr
SYS_Clk
Mn_req 1 2 3 4
Mn_RNW
Mn_ABus A B C D
PLB_PAvalid 1 2
PLB_SAvalid 3 4
SI_AddrAck 1 2 3 4
Write Data Bus
Mn_wrDBus B0 B1 B2 B3 D0 D1 D2 D3
SI_wrDAck 2 2 2 2 4 4 4 4
SI_rdDAck 1 1 1 1 3 3 3 3
SYS_Clk
Mn_req 1 2 3 4 5
Mn_RNW
Mn_ABus A B C D E
PLB_PAvalid 1 2
PLB_SAvalid 3 4 5
SI_AddrAck 1 2 3 4 5
Write Data Bus
Mn_wrDBus B0 B1 B2 B3 D0 D1 D2 D3
SI_wrDAck 2 2 2 2 4 4 4 4
SI_rdDAck 1 1 1 1 3 3 3 3 5
103
Example: AMBA AHB
• AMBA AHB
Advanced Microcontroller Bus Architecture High-
High-bandwidth
performance
Advanced High-Performance Bus ARM processor
on-chip RAM
AHB
• Features improving bus throughput:
Independent data buses (reads/writes)
Pipelining (with the previous transfer only) High-bandwidth
DMA bus
Burst transfers Memory
master
Interface
Split transfers
Master Req M0 M1
t
Addr Bus X Y
t
Y0 Y1 Y2 Y3 X0 X1 X2 X3
Data Bus
t
Transfer to M0 is split
104
Example: AMBA AXI
• AMBA AXI
source: http://www.arm.com
Advanced Microcontroller Bus Architecture
Advanced eXtensible Interface
• Master M0
reads data X0...X3 and Y0...Y3 from slow slave S0
Y0...Y3 can be delivered faster than X0...X3
• Master M1
reads data Z0...Z3 from fast slave S1
Master Req M0 M0 M1
t
Addr Bus X Y Z
t
Z0 Z1 Y0 Y1 X0 Y2 X1 Y3 Z2 X2 Z3 X3
Data Bus
t
• Reordering occurs
among multiple masters
among multiple transfers of the same master
but not within a burst
SoCT – Interconnect – 20 © Lehrstuhl für Integrierte Systeme
105
Bus Standards Comparison
CoreConnect AMBA
OPB PLB APB AHB AXI
On-Chip Processor Local Advanced Advanced High Advanced
Peripheral Bus Bus Peripheral Bus Performance Extensible
Bus Interface
Addresses 32/64 bit 32/64 bit 32 bit 32 bit 32 bit
Bus widths 32/64 bit 32-256 bit 8-32 bit 8-1024 bit 8-1024 bit
# Masters 4 16 1 16 n/a
(bridge to AHB)
106
FIFO Interface - Motivation
• On-chip Buses
Synchronous, high throughput, shared medium
High overhead for point-to-point connect between two
modules
• FIFOs:
widely used point-to-point interconnect between asynchronous
modules with standardized interfaces
107
FIFO Architecture
Write Pointer
WP RP
AE AF
FIFO is empty!
108
FIFOs – Principle of Operation
RP WP
AE AF
RP WP
AE AF
109
FIFOs – Principle of Operation
RP WP
AE AF
Fill level exceeds AE Read without risk of „under flow“ can start!
RP WP
AE AF
110
FIFOs – Principle of Operation
RP WP
AE AF
RP WP
AE AF
111
FIFOs – Principle of Operation
RP WP
AE AF
RP WP
AE AF
Fill level exceeds AF additional writes should be obmitted to prevent „over flow“!
112
FIFOs – Principle of Operation
WP RP
AE AF
WP RP
AE AF
113
FIFOs – Principle of Operation
WP RP
AF AE
Given
Traffic
IP Router/VoIP Gateway SoC CPU ASIC1
Mgr
EN MAC
4 x 1 Gb/s
PLB Bus
2x 128 bit Rd/Wr
180 MHz clock EN
DSP Mem
Nom. Capacity: 46 Gb/s MAC
Question
Under/Over/Well dimensioned?
114
Basic Packet Rx / Process / Tx
MAC Bus Mem CPU
Traffic
CPU ASIC1 C
Mgr
Packet
reception
CPU retrieves
EN packet from
DSP Mem memory
MAC
6C Packet
processing
• Message Sequence Chart (MSC):
CPU write back
Uncover “hidden” transfers for packet to
CPU/MAC notification memory
Short packets are worst case
condition (packet size ≈ notification Packet
message) transmission
C
Plus...
115
… and last but not least
Improvements
Multiple, physical memories
Separation data, state, control
Interleaving techniques
Multi-port memories
Benefits
Scalability: Aggregate bandwidth Tile
scales with network size
Segmentation of wires: short point-to- Links
point links
Node
Pipelining, power consumption,
reliability/crosstalk
Synchronization
fully synchronous clock distribution NOT
required Topologies
Drawbacks
Latency, Area
116
Summary
References
117
Cross-Layer Perspectives on
Low Power Design
Andreas Herkersdorf
Armin Sadighi
Anmol Surhonne
Thomas Wild
1013
Power Density 50
1012
Transistor gate length L (um)
20
Performance
Chipcapacity (transistors per chip)
1011 rocket
1000 10
1010
Frequency nuclear nozzle
5
Number of cores
Power density (W/cm2)
109
reactor
100 Core i7 2
108 Core 2 Duo 1
hot plate Pentium 4
107 Pentium III 0.5
10 Pentium II
106 Pentium 0.2
80486
105 80386 0.1
1 80286 0.05
104 8086
8008 Microprocessors 0.02
0.1 103
0.01
1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020
Year
Source: [1] ITRS Roadmap 99, 09
118
Moore’s Law CMOS Scaling Challenges
Source: AMD
Source: AMD
119
Low Power Design is Prerequisite for …
Green Computing
120
Power Trends
Power Trends
121
Energy versus Power
Gate delay
Static power
CL
t d (Vdd ,Vt ) Pleak (Vt ,Vdd ) I leak (Vt ) Vdd
Vdd Vt
Leakage current
lower V gs Vt
I leak (Vt ) e
threshold voltage
122
Outline
• System-level
Lecture Focus
Processor Voltage Frequency Scaling Power
Algorithmic optimization, Operand isolation dynamic static
Power gating & sleep transistors
System
Voltage islands
• Architecture-level Architecture
Scheduling, Pipelining
Bus-Segmentation, Memory-partitioning, RTL/Logic
Datapath reordering
Transistor
• RTL/Logic-level
Clock gating
• Transistor-level
Just 2 references for in-depth text books:
Threshold voltage control [2] J. Rabaey, Low Power Essentials, 2009, Springer
FinFET/FDSOI [3] D. Chinnery, K. Keutzer, Closing the Power Gap
Advanced memory cells Between ASIC and Custom, Springer, 2007
123
Power
Dynamic Power Management dyn. stat.
System
Recent trend:
Machine learning-based state traversal
[9, 17, 18, …]
Source: D. Perlmutter [5]
Q:SA R
Q ( st , a t ) (1 ) Q ( s t , at )
(rt max Q ( s t , a ))
a
1200
Power (mW)
766
60
if
baseband
Sleep signals
serial
© IEEE 2006
neighbor
location
queues
dw8051
Source: M. Sheets [6]
dll
124
Power
Power Gating dyn. stat.
System
Dyn. Power
Management
CPU DSP
ASIP
Log F L E X I B I L I T Y
FPGA
Instruction Depth
ASIC
Flexibility vs.
Custom IC Performance/Power
dissipation dilemma
Log COMPUTATIONAL DENSITY = performance / area
103 . . . 104
Log Power Efficiency = performance / W
105 . . . 106
Source: A. DeHon [7]; A. Cuomo [8]; T. Noll
125
Platform-based SoC Design Methodology
Power Budgets
Control CLB
I/O Drivers I/O Interconnect
15% 10% 5%
9%
Execution
Units 15%
Clock 21%
40% Clocks 65%
20%
Caches
mProcessor FPGA
I/O Clock
Memory Logic
Source: J. Rabaey [2]
Signal processor
SoCT – Low Power Design – 18 © Chair of Integrated Systems, TUM
126
… as Enablement for Multi-Function Devices
127
Example: Invasive Computing Platform
See: http://invasic.informatik.uni-erlangen.de/en/index.php
71
65
128
ARM big.LITTLE
ARM big.LITTLE
129
Samsung Example
Source:A. Frumusanu, R. Smith. (2015, February). ARM A53/A57/T760 investigated - Samsung Galaxy Note 4
Exynos Review. Available: http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-
review/
SoCT – Low Power Design – 25 © Chair of Integrated Systems, TUM
Power
Bus/Interconnect Segmentation dyn. stat.
Architecture
Reduction of switched
capacitance
Independent
transactions on both
segments
130
Power
Design Partitioning dyn. stat.
System
A
capacitive coupling
D
C
B
B=1
C
Superfluous transitions on
D result in additional Pcap
D
131
Power
Fighting Crosstalk dyn. stat.
Architecture
Shielding
wire
GND
VDD Shielding
layer
GND
Substrate (GND)
Power
Low Voltage Signaling on Interconnect / I/O
dyn. stat.
Architecture
RTL/Logic
Circuit
t pd ~ VDD C L
ID
2
Pcap f C L VDD
ID
* VDD
VDD VDD with I D* I D by transistor sizing
N
Pcap
f* 1 * N f with *
Pcap
t pd
N
132
MPEG-4 Video Frame Coding Principles
• I-frames
Least compressible frames but don't require other
video frames to decode.
• P-frames
Can use data from previous frames to decompress
Are more compressible than I-frames.
• B-frames
can use both previous and forward frames for data
reference to get the highest amount of data
compression.
Power
Processor Voltage/Frequency Scaling dyn. stat.
System
133
Power
DVFS: Multicore Scheduling and Allocation
dyn. stat.
System
E = 2 units
E = 2 units
A1 A4 Tclk A1
A4 A5
A2 A5 A2
A3 A3
∑
=
_ ∑ ∗
Power
Pipelining dyn. stat.
Architecture
RTL/Logic
clk Pipeling
clk Constant clock rate
f1(in)
preserves Throughput
Relaxation of logic
clk timing requirements.
Allows lowering of
f(in) f2(f1(in)) Vdd.
clk
Logic power saving
overcompensates
power for additional
clk
fn(…) registers.
clk fn(fn-1(…f1(in))) = f(in)
134
Memory Architecture: Array-Structure
AK
Row Decoder
A K+1 Word Line
A L-1
M.2 K
Power
Hierarchical Memory Architecture dyn. stat.
Architecture
Row
Address
Column
Address
Block
Address
I/O
Advantages:
1. Shorter wires within blocks
2. Block address activates only 1 block => power savings
Source: J. Rabaey [14], A. Macci [19]
135
Power
Clock Gating: Circuit / Logic Optimizationdyn. stat.
RTL/Logic
A2 Clock Gating
Toggle registers only
xor when outputs can
change
Clock Gating A1
Majority of dynamic
Unit power saving is on the
clock tree
xor
20 to 50 % reduction
in active power
A0 possible
x
Concerns
xor
Additional skew on
clk f(x,S)
clock
SoCT – Low Power Design – 37 Testability
© Chair of Integrated Systems, TUM
Temperature (C°)
Dependency of Pstat on
temperature
Source: R.Puri [4]
136
Power
Arithmetic Optimization dyn. stat.
RTL/Logic
Power
Threshold Control dyn. stat.
Transistor
Substrate bias Vsubstrate shifts threshold voltage Vt
Vdd
Id
Ileak
gnd
Vt1 Vt2 Vt3 Vgs
(VGS Vt ) / nVTemp VDS / VTemp
I D I 0e (1 e ); VGS Vt
Source: J. Rabaey [14]
137
Technology Change for Low Power
MRAM
Source: K. Arabi [10], Qualcomm
FinFET vs. FDSOI Tech.
Source: A.B. Kahng
[16]
138
References
References
139
References
References
140
Voluntary Appendix:
SoC Arithmetic Building Blocks
Outline
141
Bit-sliced Datapath
142
Ripple-Carry Adder
A3 B3 A2 B2 A1 B1 A0 B0
Critical Path
Cout = C4 FA FA FA FA C0 = Cin= 0
S3 S2 S1 S0
Ripple-Carry Adder
A3 B3 A2 B2 A1 B1 A0 B0
Critical Path
Cout = C4 FA FA FA FA C0 = Cin= 0
S3 S2 S1 S0
143
FA: Logic Gate Implementation
A B
FA P=AB
G=A&B
Cin
Cout
S = P Cin
S Cout = G | P & Cin
VDD Cin A B
A B
A
B
B
Cin VDD
A
X Cin
Cin A
S
Cin
A B B VDD
A B Cin A
Cout B
Truth table:
Total of 28 MOSFET transistors
Cout = AB + BCin + ACin = AB + (B + A) Cin
Two inverter stages between Cin and Cout
S = ABCin + Cout(A + B + Cin)
SoCT – SoC Logic Design – 8 © Lehrstuhl für Integrierte Systeme
144
Ripple-Carry Adder/Subtractor
add/subt C0 = Cin
A0
Subtraction – complement all subtrahend FA S0
bits (xor gates) and set the low order B0
carry-in C1
A1
RCA summary: FA S1
B1
advantage: simple logic, small area C2
(low cost), straightforward A2
expandable FA S2
B2
disadvantage: slow (O(N) for N bits), C3
lots of signal transitions (energy A3
wasteful) FA S3
B3
C4 = Cout
FA Inversion Property
S2 S1 S0
FA FA
Cout Cin Cout Cin
FA’: FA without INV in carry path
S
S
SoCT – SoC Logic Design – 10 © Lehrstuhl für Integrierte Systeme
145
Manchester Carry-Chain Adder
Ci+1 Ci Attention:
Transmission gate logic isn’t
regenerative
Signal noise on Ci directly
Ci+1 = Gi | Pi & Ci
S propagated to Ci+1
A3 B3 A2 B2 A1 B1 A0 B0
C4
FA FA FA FA C0 = Cin
Cout
S3 S2 S1 S0
BP = P0 P1 P2 P3 “Block Propagate”
146
CBA: Block Propagate Generation
BP P3 P2 P1 P0
Cout Cin
G3 G2 G1 G0
BP
G P G P G P G P
Carry Carry Carry Carry
Propagation Propagation Propagation Propagation
Ci,0
Ci P Ci P Ci P Ci P
147
Carry-Bypass Adder
148
Carry-Select Adder: 16-bit Adder
bits 12 to 15 bits 8 to 1 bits 4 to 7 bits 0 to 3
A’s B’s A’s B’s A’s B’s A’s B’s
149
Square Root Carry-Select Adder
bits 14 to 19 bits 9 to 13 bits 5 to 8 bits 2 to 4 bits 0 to 1
A’s B’s A’s B’s A’s B’s A’s B’s A’s B’s
= MP + P(P-1)/2
2
= P /2 + P(M – 0.5)
2
≈ P /2 for N >> M
150
Square Root Carry-Select Adder
Adder Summary
151
Multiply Operation
Multiplication as repeated
additions
N M 1 N 1
multiplicand x y x y i j 2i j
multiplier j 0 i 0
X7 X6 X5 X4 X3 X2 X1 X0
Yi
152
Sequential Multiplier
Right shift-and-add
Partial product rows accumulated
N from least to most significant bit
on an N-bit adder
1010 After each addition, shift
1101 accumulated partial product to
right in order to align it with the
T= 0; 1010 next row to add
T= 1; 01010
T= 2; 001010 Time for NxN bits
T= 6; 110010 Tserial_mult = O(N Tadder) = O(N2)
T= 7; 0110010 Design (area) complexity
T= 11; 10000010 One N-bit adder and
T= 12; 10000010 single-bit shifter
No straightforward pipeline
structure
Array Multiplier
X3 X2 X1 X0 Y1 Z0
Partial product addition
Typically no single worst HA FA FA HA
case path, but multiple
X3 X2 X1 X0 Y2 Z1
paths with same (max)
length FA FA FA HA
X3 X2 X1 X0 Y3 Z2
FA FA FA HA
Z7 Z6 Z5 Z4 Z3
153
Partial Product Reduction
Shifter
Programmable shifter
Right Nop Left
Single-bit left/ right shift
operations through individual pass
transistors operated by separate
control lines
Ai Bi
Output drivers to fully regenerate
logic levels
154
Barrel Shifter
155
Other Arithmetic Operators
In1 In2
Subtractor
In1 – In2 a) Subtractor
Inverted two‘s complement
In1 In2
adder with Cin,0 = 1
Multiplexer / Demultiplexer
156
References
[1] R. J. Baker et al., CMOS circuit design, layout, and simulation, IEEE
Press, 1998. ISBN 0-7803-3416-7
[2] N. H. E. Weste et al., Principles of CMOS VLSI Design, Addison
Wesley, 1993. ISBN 0-201-53376-6
[3] SIA, International Technology Roadmap for Semiconductors,
http://public.itrs.net/
157