ISSCC2020-02 Visuals Processors

ISSCC 2020
SESSION 2
Processors
Zen 2: The AMD 7nm Energy-Efficient
High-Performance x86-64 Microprocessor Core
T. Singh1, S. Rangarajan1, D. John1, R. Schreiber1, S. Oliver1, R. Seahra2, A. Schaefer1

1AMD, Austin, TX, 2AMD, Markham, ON, Canada
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 1 of 31
International Solid-State Circuits Conference
Outline
• Motivation
• Market Segments
• Architecture
• Core Complex
• Technology
• Implementation
• SRAMs
• Power
• Silicon Results
• Conclusion
Motivation
• Zen was a huge lift
• Zen2 compelling successor to Zen
• Goals
– Give above industry trend generational
performance improvement
– Enable 2x cores same socket
– Improve single thread (1T) performance
• How can we do this?
– Technology port
– Architectural changes
– Physical design and methodology changes
• AMD was aggressive and we did all of the
above to achieve the goals!!
Zen 2 Market Segments
Zen 2 Architecture
• Changes from Zen
– New TAGE Branch Predictor
– Optimized L1 Instruction Cache: 32K/8-way vs. 64K/4-way
– 2X Op Cache Capacity: 4K vs. 2K ops
– 2X Floating Point Data Path Width: 256b vs 128b
– 3rd Address Generation Unit
– Larger Physical Structures: Integer Scheduler, PRF, ROB, Store Queue, L2DTLB
– 2X L1 Data Cache Read/Write Bandwidth
– 2X L3 Cache: 16MB vs. 8MB per Core Complex (CCX)
• +15%1 single thread (1T) IPC over Zen
• ~9% switching capacitance (CAC) improvement over previous
generation, technology neutral
1 AMD"Zen 2" CPU-based system scored an estimated 15% higher than previous generation AMD “Zen” based system using estimated SPECint®_base2006 results.
SPEC and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. See www.spec.org.
Core Functional Units
uCode CPL
Decode I-Cache
• 32KB IC
• 32KB DC Branch
• ~20 blocks, ~400K Prediction
Floating
avg instances Scheduler
L2
Point
• ROM for uCODE Cache
• 5 L1 RAM variants ALU Load/
• Chip Pervasive Logic Store
(CPL) – clock/test Data
block Cache
L2/L3 Cache Hierarchy
Shadow tag macros for serving external probes
• Only 3 unique custom
macros L2 Data
L3Data
– Down from 8 on Zen L2 Tags
• Each 4M slice is identical L2 Status
L3Tags CTL
• Multi-stage clock gating in
L3 to keep clock
distribution power the
same as 8M L3 from Zen
• LDOs incorporated into
the L3 to supply VDDM to 512K L2
L2 and L3 arrays
4M Slice
– Loss of package distribution
of VDDM meant LDOs had
to be moved closer
– Must reduce current on
VDDM
© 2020 IEEE
2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core LDOs 7 of 31
Zen 2 Core Complex (CCX)
• 4 core complex
• L3 size increases to 16MB
• Design for flexibility
• Maximize # cores for server case
Zen 2 CCX Configs
Value
HEDT/Server APU
2 Core,
4 Core, 4 Core,
4MB L3
16MB L3 CCX 4MB L3 CCX
CCX
Cores Market TDP

8 Notebook 15W
• Zen 2 Core can be used in various 6

8
Desktop
Desktop/Server
65 W
65-120 W
configs covering a wide power range 12 Desktop/Server 105-120 W
• Multiple CCX can be placed to 16 Desktop/Server 105-155 W

24 HEDT/Server 155-280 W
achieved desired core count 32 HEDT/Server 155-280 W
48 Server 200-225 W
64 HEDT/Server 200-280 W
Zen vs. Zen 2 Technology Comparison
Zen Zen 2
Tech 14nm FinFET 7nm FinFET
4 Cores, 4 Cores,
Cores/CCX
8 Threads 8 Threads
Area/CCX 44 mm2 31.3 mm2
L2/core 512KB 512KB
L3/CCX 8MB 16MB
CPP 78 nm 57 nm
Fin Pitch 48 nm 30 nm
1x Metal Pitch 64 nm 57 nm
Stdcell Track Library 10.5 track 6 track
Cu Metal Layers 11 w/ MiM 13 w/ MiM
Zen vs. Zen 2 Technology Comparison (cont)
Zen (14nm) Zen 2 (7nm)
Layer Name Pitch Layer Name Pitch
M0 M0
n/a 1.0x
StdCell Internal StdCell Internal
M1
M1
1.0x Stdcell 1.425x
StdCell Internal
& BEOL
M2-M3 1.0x M2-M3 1.0x-1.1x
M4-M7 1.25x M4-M7 2.0x
M8-M9 2.0x M8-M9 2.0x
--- --- M10-M11 3.15x
M10-M11 M12-M13
11.25x 18.0x
(RDL) (RDL)
Place and Route Design Optimization
Same-Layer Jogs Inter-Layer Jumpers
• 7nm FinFET presents unique route challenges Forbidden Required
– Lower layer jogs forbidden
– Denser standard cells with reduction in track height
– Increased lower level metal resistance
• Deep collaboration between AMD CAD,

foundry, and EDA partners
– Cell density management
– Advanced legalization techniques
– Improved pre-route timing estimates
– Wire Engineering and Via Ladders
Placement Restricted by Large Cells
• Multi-row cells benefit
power and area, but
create placement
challenges
• Clustering of flops has
many benefits but can
cause placement
issues
• Resulting small gaps
are challenging to use
and required innovation
to exploit
– New algorithms
– Flexible power grid
choices
Design RC Miscorrelation
Normalized Normalized
• Pre-route vs Post-route miscorrelation caused Layer
Resistance Capacitance
by length and layer assumptions M1 1.00 1.00
• Pre-route miscorrelations for resistance and M2 3.17 0.96
capacitance have differing root causes M3 2.31 0.96

M4 0.72 0.75
– Layer assignment for resistance
M5 0.55 0.83
– Length estimates for resistance and capacitance
M6 0.52 0.83
• Based on previously modeled trends, EDA M7 0.55 0.83
tools may have challenges estimating delay M8 0.52 0.83
• Required innovation to tackle M9 0.55 0.92
M10 0.16 0.96
M11 0.16 0.92
Pre-Route Correlation Improvements
• Plots showing ClockTreeSynthesis vs Timing Slack Correlation Timing Slack Delta
Route timing
• Large variance in initial results Pessimistic Optimistic
Initial
– Large number of paths have overly-
Results
pessimistic delay during pre-route steps.
Tools waste resources trying to fix
– Significant number of paths have optimistic
delay estimates. These paths are under- cts_vs_route.slack.corr cts_vs_route.slack_delta.hist
optimized
• Employed timing with targeted Improved

capacitance scaling and global route- Results
based layer estimation
– Standard deviation dramatically improved
while keeping a slightly pessimistic mean cts_vs_route.slack.corr cts_vs_route.slack_delta.hist
Wire Engineering Challenges
• Lower layers getting more
resistive with latest
technology nodes
– Very short routes in tight data
paths need a buffer
– Routes longer than Steiner due
to complex rules
– Challenging for optimization
tools to comprehend
• Critical signals need to get to
higher layers quickly
Wire Engineering and Via Ladders
Top Via Ladder View
• Team used selective layer optimization,
buffering, pre-routes, and via ladders to
exploit the fast layers for critical signals
• Two types of via ladders
– High Performance: for large buffers driving long
wires Side Via Ladder View
– EM: for high-activity gates (e.g., clock drivers)

– Mitigated EM issues on large fanout nodes with
high activity
L2/L3 Cache Changes
VDDM VDDM
WL[N:0]
WL[N:0]
• Zen had an on-die LDO

to generate VDDM BLPCX
BLPCX
supply for use by cache XCENX
BLC[]
arrays
XCENX
BLT[]
BLC[]
BLT[]
• Zen 2’s package choices WRCS[]
make using package
WRCS[]
WDT_X
WDT_X NegBL Write Driver
layers for VDDM
WDC_X
WDC_X
RDCSX[]
distribution impossible
SAC
SAT
RDCSX[]
SAPCX
• Moved the bitline
SAC
SAT
precharge from VDDM to
SAPCX
VDD to reduce current SAC

SAT_INT SAT
SAC_INT
SAEN
SAEN
VDD Precharge Challenges
Controller pauses voltage Controller pauses voltage
increase and unsets increase and sets
• Moving bitline precharge superVminEn register before superVmaxEn register before
to VDD creates both continuing to raise voltage continuing to raise voltage
bitcell stability and VDD

superVminEn=1 superVmaxEn=1
writeability challenges VDDmax
• High level of
configurability allows for
silicon flexibility VDD where VDD-VDDM=superVmaxThreshold
VDDM
VDD where VDDM-VDD=superVminThreshold
superVminEn VDDmin
System
Management superVmaxEn
Assist controller WLUdEn
Voltage
thresholds NegBlEn
Programming
details
SRAM
SRAMs
SRAM
SRAM
Fuses
Assist configurations
VDD Precharge Timing Challenges
Power races with WL Read before write challenges
WL@ constant
VDDM
Bitline precharge
BLPCX @ high turns on before WL
VDD turns off at high
VDD!
BLPCX @ low WL on before
VDD Bitline precharge
turns off at low
VDD!
• Moving precharge to VDD reduced our current enough to allow on die-distribution but presents other
challenges
• Read before write timing challenges at low VDD, high VDDM
Solving Timing Challenges
• Solving these multiple voltage timing challenges required a number of techniques
– Dual voltage clock shapers to average two voltage domains
• Can alter the number of these buffers on VDD or VDDM or remove them entirely to make timing more
or less dependent on either supply
Psuedo-dynamic level shifter
VDD
VDDM shapedFallInput
LS
Input@VDD LS @VDD
ISOX@VDDM
– False read before write problem can be mitigated by compressing the front end of the WL
during a write operation
WLCLK
WLCLK WLCLK_shape
WL during read
WREN
WL during write
CAC Comparison
• 3% decrease in flop power allocates more budget for combinational logic

FLOP Palette Improvements
Best for Performance Best for Power
• Rich flop library,

balance
timing/power
needs by
driving right flop
mix
• Up to 8% Fmax
benefit from
high speed
flops in timing
critical loop
paths
Low Power Gater Latch
qf CLK
CLKB CLKBB
Dbar
Energy with AvgApp Activity (fJ)
CLKBB
LP Regular
State Ratio
Latch Latch
CLKB CLKB
E E=1 0.22 0.18 121%
Dbar qf_x qf
E=0 0.17 1.61 10%
TE
CLKBB
Total 0.38 1.79 22%
Q
• 90% Power savings in latch for common case of E = 0

through internal self gating
• Clock gater latch power contribution from 22% in Zen to 13%
in Zen 2 for an average application
Zen 2 Clock Optimization
• Multi mesh plan for the
core supported by
configurable clock tree
construction
– FP level mesh gating
enabled with minimal
timing/area overhead
– 15% Mesh power savings
in Idle and Average App
• Tight clock skew distribution
• Relocated clock spines and
technology shrink (vs. Zen)
achieves similar skew profile
while reducing CAC
Zen vs. Zen 2 CAC Comparison
• Primary sources of CAC reduction

– 14 nm to 7 nm scaling
– 6 track library
– Aggressive microarchitectural CAC optimizations
Generational Leadership Perf/Watt
Power Improvements – ISO Frequency
• Performance/Watt driven by a combination
of technology and design improvements
7nm CAC Savings
• Timing
– Improved scalability by optimizing at a wider Library Choice
voltage range compared to Zen
– Multi-corner optimization 7nm Timing
• Library choice and optimization Design CAC Zen power
– 6 track library enabled additional Savings @ 100% IPC
CAC/leakage savings in addition to default
technology entitlement
• Design CAC
– MBFF, low power clock-gater library Zen2 power
optimization @ 115% IPC
– RTL improvements
– CAC aware downsizing methodology
Frequency/Power Silicon Results
• 4 cores active with 2

threads per core
• The combined effect of
lower Vmin for the same
50% power
frequency and reduced reduction
CAC enabled a 50%
reduction in power for a
given frequency
throughout most of the
F(P) curve
• This enables 2x cores in
the same socket!!
Frequency/Voltage Silicon Results
• 1 core active with

two threads per
core, 3 cores idle
• F/V curve improved
over all voltages
• Design worked to
improve the low
voltage
performance for
improved linearity
• Wide voltage range
Conclusion
• Met Goals
– Moved to energy efficient TSMC 7nm finFET
– Made huge architectural changes
– Improved PD and methodology
• Results are clear
– Scalable across 15W mobile to 280W Server
– 50% reduced power at iso-frequency
– Enable 2x cores in same-socket
– >15% 1T IPC over previous generation
– ~9% CAC improvement over previous
generation technology neutral
– Enables peak frequencies up to 4.7GHz
(+350MHz generationally)
• Zen2 delivers generational performance
uplift!!
Acknowledgements
• We would like to thank our talented
AMD design team across Austin, Fort
Collins, Santa Clara, Boston,
Markham, and India who contributed
to Zen 2
• Please stay for our chiplet paper next
• Please check out our demo, 2.1
tonight in Golden Gate
• Did we mention we have liquid
nitrogen?
AMD Chiplet Architecture
for High-Performance
Server and Desktop Products
Samuel Naffziger
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 1 of 27
Outline
• Motivation and architectural goals

• Engineering challenges and solutions
• Silicon-package co-design
• Die-to-die interconnects 3rd Gen AMD RyzenTM
• Shared IO die architecture ThreadripperTM Processor
• Power distribution and management IFOP 4 x16 PCIe/IFIP IFOP
• Results
4x DDR
4x DDR
Zen2 cores
Zen2 cores
L3
IFOP 4 x16 PCIe/IFIP IFOP
IFOP
L3 Server IO Die
2nd Gen AMD EPYCTM
8.34 Billion FETs, 416 mm2
IFOP
3rd Gen
2x DDR
AMD
7nm Core Complex Die: RyzenTM
PCIe
3.8 Billion FETs, 74 mm2 Processor
Client IO Die
2.09 Billion FETs, 125 mm2 AMD X570 Chipset
© 2020 IEEE
Motivation and Architectural Goals
Primary goal:
Achieve leadership performance, performance/Watt and
performance/$ in server and desktop markets
• This required
– Exploiting advanced 7nm technology for better performance and
performance/Watt
– Packing more silicon into the package than traditional approaches enable
• While also
– Enabling scalable performance/$ up to performance levels otherwise not
achievable
– Improving memory and IO latency
– Supporting leverage across markets by re-using IP and SOCs
© 2020 IEEE
Background: Performance and Die Size Trend
Specint®_rate2006 2P Server Performance
100X
Trend Over Time1
Throughput Performance Ratio

Goal:
Above the
• Generational historical
10X
performance trend line
improvements are an
exponential trend
• Holding to this trend has 1X
2006 2009 2012 2014 2017 2020
required increasing core
counts and die sizes Server CPU Dies sizes over time
1000
• Bumping up against the
Throughput Performance
800
reticle limit and becoming
too costly 600
400
200
1. Su, Lisa “Delivering the Future of High-Performance 0
Computing”, Hot Chips 31 (2019)
© 2020 IEEE Oct-06 Jul-09 Apr-12 Dec-14 Sep-17 Jun-20 Mar-23
Exploiting 7nm Technology
2X >1.25X 0.5X
DENSITY1 FREQUENCY1 POWER1
• Leadership performance (same power) (same performance)
requires 7nm benefits

• Yet the cost of advanced 7nm Compute Efficiency Gains
technologies are increasing
Cost Per Yielded mm2 for a 250mm2 Die
• Traditional approaches of 6.00
large die sizes not viable
Normalize Cost/Yielded mm2

5.00
• Innovation required 4.00
3.00
2.00
1.00
1. Based on June 8, 2018 AMD internal testing of same- -

architecture product ported from 14 to 7 nm technology 45nm 32nm 28nm 20nm 14/16nm 7nm 5nm
with similar implementation flow/methodology, using
performance from SGEMM.
© 2020 IEEE
7nm Scaling
Prior Generation RYZEN™ Processor Die
• High-performance server and

desktop processors are IO-heavy
• Analog devices and bump pitches
for IO benefit very little from leading
edge technology, and that • CPU core + L3 on this die comprises 56% of the area
technology is very costly • These circuits see huge 7nm gains
• Remaining 44% sees very little performance and
• Solution: Partition the SOC, density improvement from 7nm
reserving the expensive leading-
edge silicon for CPU cores while
Zen2 cores
Zen2 cores
leaving the IO and memory L3 7nm CCD is
interfaces in N-1 generation silicon DFx IFOP SerDes SMU 86% CPU + L3
L3
© 2020 IEEE
Chiplets Evolved – Hybrid Multi-die Architecture
Traditional Monolithic 1st Gen EPYC 2nd Gen EPYC
Use the Most Each IP in its Optimal Centralized I/O Die Superior
Advanced Technology Technology, 2nd Gen Improves NUMA Technology for
Where it is Needed Infinity Fabric™ CPU Performance
Most Connected and Power
© 2020 IEEE
Connecting the Chiplets
Theoretical Interposer-based
• Silicon interposers and bridges

provide high wire density, but have CCD
IOD
CCD
limited reach
• Only supports die edge connectivity CCD
Interposer
CCD
which limits number of chiplets and

cores that can be supported
Selected MCM Approach
• Performance goals required more
Core Complex Die (CCDs) than can
be tiled adjacent to the IOD
• Solution is to retain the on-
package SerDes links for die-die
connections
© 2020 IEEE
CPU Compute Die (CCD) Floorplan
2 CCX core complexes Core2 Core3
– 4 core and 16MB L3 each L3
– Comprise 86% of CCD area Core0 Core1
System Management Unit (SMU)
DFT IFOP
– Microcontroller SMU
– Power management Core0 Core1

– Clocks and reset
L3
– Fuses Core2 Core3
– Thermal monitor and control
Infinity Fabric On-Package (IFOP) Links
– 14.6 GT/s (packing 10 bits at 1.46Ghz)
– 39 RX lanes – 2 clock lanes – 1 clock gating lane
– 31 TX lanes – 1 clock gating lane
– 4 lanes for control traffic – 2 clock lanes
DFT and Debug
Wafer test bumps
© 2020 IEEE
IFOP GEN2 KEY FEATURE SUMMARY AND COMPARISON
Gen2 14nm IOD, 7nm CCD
Gen1 14nm
Max Per lane Synchronous
Max per lane Local clock Datarate clock crossing
datarate alignment and 14.6Gbps Local CDR
6.4Gbp/s global tracking
50 Ohm fixed
10:1 Serialization/
drive strength and
50/100/200 Ohm Deserialization
4:1 Serialization/ termination
drive strength
Deserialization Local PHY
and termination TX and RX T-Coil
Regulators
Forwarded PHY Regulated Pseudo-Diff
clocks through package VTT Termination Single Ended
Receiver
DDR I/O DDR I/O
Zen2
CCD
Zen2
Zen2
Zen2
CCD
CCD
CCD
Die3
Die2
CCX
CCX
CCX
CCX
I/O I/O
I/O I/O
Die0
Die1
CCX
CCX
CCX
CCX
Zen2
Zen2
Zen2
Zen2
CCD
CCD
CCD
CCD
I/O DDR I/O DDR
© 2020 IEEE
IFOP SerDes Architecture
FDI[30:0]
FDO[30:0]
RX X32
TXCLK
TXCLK TX X32
TXCLK
TXCLK TXCLK
TXCLK FWDFCLK[30:0]
QUAD
RXCLK MCM DIFF
FCLK Package TXCLK
8Ghz
FILT 8Ghz
Routes
C
IOD CCD
R PLL TXDRV CLKRX PLL CCD
CORE FCLK CORE
C FCLK C
DIFF
TXCLK
QUAD R
RXCLK
8Ghz 8Ghz
C
New
Gen2 FDO[38:0] TX X40
TXCLK
TXCLK RX X40
TXCLK
TXCLK TXCLK
TXCLK FDI[38:0]
FWDFCLK[38:0]
Features
IOD CCD
TX lane Detail Tcoil RX lane Detail
Tcoil
Trained + Trained
FDO[9:0]
Register Serializer TX De- Low FDI[9:0]
50Ω 50Ω RX serializer Latency
Capture - FIFO
Clock generator VTT Clock generator
FWDFCLK TXCLK QUAD Phase Calibration

Calibration logic RXCLK Interpolator
CDR Logic
© 2020 IEEE
Package Routing Challenges
• Prior generation already consumed
all package routing resources for
memory and IO
• Connecting 9 chiplets in the same
package requires innovation
Die2
I/O
I/O
CCX
CCX DDR
CCX
DDR
CCX
I/O
I/O
Die1
Die3
I/O
I/O
CCX
DDR
CCX
CCX
DDR
CCX
I/O
I/O
Die0
© 2020 IEEE
1st Gen AMD EPYC™
International Solid-State Circuits Conference [Beck ISSCC 2018] 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 12 of 27
Under-CCD Routing
Routing Infinity Fabric on Package (IFOP) SerDes links from IOD to the
2-deep chiplets required sharing routing layers with off-package
SerDes and competing with power delivery requirements
SERDES
CCD CCD
CCD CCD
DDR IOD DDR
CCD CCD
CCD CCD
SERDES
© 2020 IEEE
Zen vs. Zen 2 VDDM Distribution
Dense SRAMs require a separate rail
Package
RDL
RDL
© 2020 IEEE Zen VDDM distribution via package plane Zen 2 VDDM distribution via RDL only 14 of 27
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products
Zen 2 VDDM Design Challenges
Enables 80 IFOP package routed
• RDL is more resistive than a signals under the CCD
dedicated package layer
• Therefore we reduced overall 4 VDDM LDO’s inside the L3
VDDM current draw by 80%
compared to Zen ([Singh Core L3 4MB L3 4MB Core
ISSCC 2020]) +L2 slice slice +L2
• New, smaller, and distributed L3 4MB
Core L3 4MB Core
LDO design slice slice
+L2 +L2
• Ensured sufficient routing
porosity through the
integrated LDO’s to enable VDDM
critical routing RDL
LDO
• These improvements kept the spanning
IR drop to ≈10mV impact
L2 and L3
© 2020 IEEE
Package Integration, Server, and Desktop
Zen2 Zen2
CCD CCD
Zen2 Zen2 Bump pitch for 14nm and •

CCD CCD 128 total x16 7nm is 150um and 130um
SERDES
respectively
• Transitioned IOD from
solder bumps to copper
pillars, enabling a common
72 Data +
8 Clk/Ctl
interface for IOD+CCD
Zen2 Zen2 (total/CCD) • Conducive to tighter
CCD CCD
bump pitches (compact)
3rd Gen AMD Ryzen Processor • Enabled common die
TM
Zen2 Zen2
CCD CCD height after assembly
Infinity Fabric (die-to-die)
• Higher max current
IO Controllers and PHYs
(electromigration) limits
2 x DDR4 PHYs
© 2020 IEEE 2nd Gen AMD EPYCTM Server Processor 16 of 27

International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products
Operating System Scheduler Optimizations
• Growing number of cores and the advent of chiplets

resulted in a wider range of frequency responses to
process, voltage and temperature variations
– Up to 200MHz core-to-core Fmax upside within a CCD
– Legacy boost approaches don’t take advantage of the faster cores
• Preferred Core Ordering maximizes performance
– New algorithm characterizes the capabilities of the cores at boot
time under various system parameters and generates a list of cores
in an order of frequency capability
– The core ordering is modified according to the usage policy detected
• Single threaded applications scheduled to the fastest cores
• Multi-threaded applications scheduled toward the fastest core
cluster (CCX), maximizing L3 cache sharing
– This core ordering is expressed to the OS allowing for an efficient,
dynamic, HW-directed selection of processors for a given workload
© 2020 IEEE
Per-Core Linear Regulation
Regulating the voltage per-core enables power savings by adapting the voltage to each core’s
capability and compensating for power delivery gradients across-package
8 cores per chiplet, each

• Digitally controlled LDO enables setting voltage based with separate VDD
on per-core speed capability for a given frequency 64 total core-specific
• Droops mitigated with fast-response charge injection voltages
from RVDD for cores with a drop-out
© 2020 IEEE
Clock Stretching and Per-Core Voltage
• Droop detection with a fast analog
comparator
Bottom of LL
• Separate threshold for LDO Charge VID VDDCORE2
VDDCORE0 VDDCORE1
Injection (CI level) and for clock (Core0 DLDO)(Core1 DLDO) (Core2 DLDO)
VDDCORE7
(slowest)
stretching (CKS)
• These work synergistically to lower the
required voltage for a given frequency
Same-frequency power Idle TDC EDC

savings through voltage
IDD
reduction
DROOP
No LDO, no CKS 0% CCLK
LDO only 19%
CKS only 19%
LDO and CKS 25% VDD droop forces core stretch after 1 more full frequency period
Based on AMD internal testing of 64C AMD EPYC "Rome" Clock stretch response rise-to-rise, is 150% period, 175% period, then 125%
processor operating at 2.5GHz, synthetic di/dt pattern periods
© 2020 IEEE
Improving Memory Performance
Prior Generation
(EPYC 7001 Series Processors)
™
• Server memory latency is a key factor

in performance
• A goal for 2nd Gen AMD EPYCTM was
to improve on the 2017 1st Gen
EPYC™ design
• Non-Uniform-Memory-Access (NUMA)
behaviors are a result of memory
interfaces being distributed across die 3 NUMA Distances
Domain Latency (ns)
8 NUMA Domains1
• Significant deltas from NUMA1 to

NUMA1 90
NUMA2 impact performance for some
NUMA2 141
applications
NUMA3 234
1: AMD internal testing with DRAM page miss Avg. Local2 128
2: 75% NUMA 2 + 25% NUMA 1 traffic mix
© 2020 IEEE
2nd Gen AMD EPYCTM Improved Memory Latency
• Central IOD enables a single NUMA domain per socket
• Improved average memory latency1 by 24ns (19%)2
• Minimum (local) latency only increases 4ns with chiplet architecture
Single Domain
CCD5 CCD4 G2 PCIe G1
G1 xGMI
PCIe CCD0 CCD1
G3 PCIe G0 PCIe CCD0, CCD1, IO0, CCD2, CCD3, IO1,
UMC4 3 UMC0 CCD4, CCD5, IO2, CCD6, CCD7, IO3,
UMC5 UMC1 MA/MB/MC/MD/ME/MF/MG/MH
G3 xGMI G0 xGMI 1 interleaved
G2 xGMI G1 xGMI  1.46GHz / DDR2933 (coupled)1
MH
MG
MA
MB
G3 PCIe G0 PCIe  1: Local 94ns
IO2 IO0  2: ~97ns
1: AMD internal testing P3 PCIe P0 PCIe
with DRAM page miss  3: ~104ns
G2 PCIe G1 PCIe
2: EPYC 7002 Series
IO3 IO1  4: ~114ns
NUMA 1 vs EPYC 7001 P2 PCIe P1 PCIe
MD
MC
ME
MF
Series Avg. Local; EPYC  Measured Avg: ~104ns

7002 Series NUMA2 vs S-Link to S-Link to 2
EPYC 7001 Series P1/P2 PCIe P0/P3 PCIe
NUMA 3
UMC7 UMC3
Repeater: 1 FCLK (1.46GHz)
UMC6 4 UMC2
P2 P1 Switch: 2 FCLK (1.46GHz)
CCD7 CCD6 P3 P0 CCD2 CCD3
(low-load bypass, best-case)
© 2020 IEEE
2nd Gen AMD EPYCTM Chiplet Performance vs. Cost
2
• Higher core counts and 1.8

1.6
performance than
Normalized Die Cost

1.4
possible with a 1.2
monolithic design 1
• Lower costs at all core 0.8
count / performance 0.6

0.4
points in the product line 0.2
• Cost scales down with 0

64 Cores 48 Cores 32 Cores 24 Cores 16 Cores
performance by Chiplet 7nm + 14nm Hypothetical Monolithic 7nm
depopulating chiplets
• 14nm technology for IOD
reduces the fixed cost
Dummy
© 2020 IEEE
3rd Gen AMD Ryzen™ Processor Chiplet Performance vs. Cost
2.5
Similar cost savings and
2 scalability for desktop
Normalized Die Cost
1.5
Re-using the client IO die for
1 the X570 Chipset expander
0.5
enables optional additional
connectivity for higher end
0 systems
16 cores 8 cores
• PCIe, SATA, USB
Chiplet 7nm + 14nm Hypothetical Monolithic 7nm
© 2020 IEEE
Performance Results
Chiplet architecture enables leadership performance and
performance/Watt in server and desktop markets
Metric at 105W TDP1 Ryzen 2700X (8C) Ryzen 3950X (16C) Improvement (%)
1. Testing as of
12/13/2019 by AMD
Performance Labs
Cinebench r15 1T 177 216 22%
using a Ryzen 9 3950X
with 16 cores vs. a
Cinebench r20 1T 434 527 21%
Ryzen 7 2700X with 8
cores in the Cinebench
Cinebench r15 NT 1802 3928 118%
R20 1T benchmark
test. Results may vary.
Cinebench r20 NT 4020 8862 120%
RZ3-102 1T Fmax (Max Boost) 4.3 4.7 9%
NT Base Freq (All-core)1 3.9 3.95 1%
EPYC 7601 EPYC 7742
Metric (32C 2P (64C 2P Improvement (%)
180W TDP) 225W TDP)
SPECrate®2017_int_base2 272 663 144%
SPECrate®2017_fp_base2 259 511 97%
NT Base Freq 2.2 2.5 14%
2: Results obtained from the SPEC® website as of Jan 3, 2020.
EPYC 7601 SPECrate®2017_int_base: https://www.spec.org/cpu2017/results/res2017q4/cpu2017-20171114-00833.html
EPYC 7601 SPECrate®2017_fp_base: https://www.spec.org/cpu2017/results/res2017q4/cpu2017-20171114-00845.html
EPYC 7742 SPECrate®2017_int_base: https://www.spec.org/cpu2017/results/res2019q4/cpu2017-20191028-19261.html
EPYC 7742 SPECrate®2017_fp_base: https://www.spec.org/cpu2017/results/res2019q4/cpu2017-20191028-19237.html
More information about SPEC CPU® 2017 can be obtained from https://www.spec.org/cpu2017. SPEC®, SPEC CPU® and SPECrate® are registered trademarks of the Standard Performance Evaluation Corporation.
© 2020 IEEE
Summary
• Chiplet architecture has proven key to achieving leadership
performance, performance/$ and performance/Watt across multiple
market segments
• Many significant innovations were required:

• Package + Silicon co-design for optimizing
complex routes and heterogeneous technology IFOP 4 x16 PCIe/IFIP IFOP
chiplet die
4x DDR
4x DDR
• Package level fabric and interconnect architecture
• Power delivery and voltage adaptation
IFOP 4 x16 PCIe/IFIP IFOP
Zen2 cores
Zen2 cores
L3
IFOP
L3 IFOP
2x DDR
PCIe
© 2020 IEEE
Acknowledgment
We would like to thank our talented AMD design teams across

Austin, Bangalore, Boston, Fort Collins, Hyderabad, Markham,
Santa Clara, and Shanghai.
© 2020 IEEE
Disclaimer and Endnotes
DISCLAIMER
The information contained herein is for informational purposes only, and is subject to change without notice. While every
precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and
typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro
Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this
document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or
fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described
herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this
document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed
agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18
©2020 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, RYZEN, Threadripper, Infinity
Fabric, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this
publication are for identification purposes only and may be trademarks of their respective companies.
© 2020 IEEE
A 220GOPS 96-core Processor with 6 Chiplets
3D-stacked on an Active Interposer Offering
0.6ns/mm Latency, 3TBit/s/mm2 inter-Chiplet Interconnects
and 156mW/mm2@82% Peak-Efficiency DC-DC Converters
Pascal Vivet¹, Eric Guthmuller¹, Yvain Thonnart¹, Gaël Pillonnet2, Guillaume Moritz2,
Ivan Miro-Panades¹, César Fuguet¹, Jean Durupt¹, Christian Bernard¹, Didier Varreau¹, Julian Pontes¹,
Sébastien Thuriès¹, David Coriat1, Michel Harrand¹, Denis Dutoit¹, Didier Lattard¹, Lucile Arnaud2,
Jean Charbonnier2, Perceval Coudrain2, Arnaud Garnier2, Frédéric Berger2, Alain Gueugnot2,
Alain Greiner3, Quentin Meunier3, Alexis Farcy4, Alexandre Arriordaz5, Séverine Cheramy2, Fabien Clermidy¹
pascal.vivet@cea.fr
¹Univ. Grenoble Alpes, CEA, LIST; 2Univ. Grenoble Alpes, CEA, LETI; 3Sorbonne Université, LIP6;
4STMicroelectronics; 5Mentor A Siemens Business
This work was partly funded by the French National Program

Programme d’Investissements d’Avenir IRT Nanoelec under
Grant ANR-10-AIRT-05
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 1 of 34
High Performance Computing & Big Data
• More cores + more accelerators + more memory
– Similar constraints are appearing for embedded HPC
(Automotive, etc)
– Need both highly optimized generic and specialized functions
(i.e. ML/AI accelerator)
– Need a « go-to-market » solution for sustainable system differentiation
• System designers must offer :

– Modular and cost effective solutions
– Energy efficiency of the system infrastructure
– More on-chip memory bandwidth per core
 With advanced CMOS issues, « Single Die »

solution is not viable anymore
Chiplet Partitioning
• Chiplet motivations
– Cost driven
– Modularity driven using 3D technologies
– Heterogeneous integration
• Chiplet challenges ?
– Eco-system maturity,
– Technology & Architecture partitioning,
– Chiplet Interfaces, testability, 3D CAD flow, etc
[D. Dutoit, Keynote, 3DIC’2014]
Chiplet Partitioning : Solutions and Limitations
• Existing technologies
Organic Substrates Passive interposer (2.5D) Silicon bridges
AMD, 4-chiplet circuit, ISSCC’2018 TSMC, CoWos, VLSI’2019 INTEL, EMIB bridge, ISSCC’2017
• But, some limitations

– Chiplet communication limited to side-by-side communication, not scalable
– How to integrate heterogeneous chiplets & differentiating functions ?
– How to integrate less-scalable functions (IO’s, analogs, power management) ?
Active Interposer : Principle
Scalable & Distributed NoCs
Any chiplet-to-chiplet traffic
Chiplets :
Clusters of Cores
Power Management
Active Close to cores
Interposer
SoC infrastructure
Analog, IOs, PHY, DFT
Additional features
 Mature CMOS technology (with low logic density to preserve system cost)
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives
Outline
• Introduction
6 Chiplets 3D-stacked on an Active Interposer
Chiplet (16 cores) Chiplet (16 cores)
• Chiplet Overview Cluster Cluster Cluster
L3 L3 Cluster
L3 L3
SoC infrastructure
SoC infrastructure
– 4 cluster of 4 cores 0 1 0 1
– Distributed L1$ + L2$ + L3$

Cluster Cluster Cluster Cluster
L3 L3
– Scalable Cache Coherency 2 3 L3
2 3
L3
• Active Interposer 3D Plug(s) 3D Plug(s)

– Distributed flexible interconnects µ-bumps, Ø10µm
Active Interposer
– Integrated SCVRs (1/chiplet) Distributed NoCs
(routers & pipelined links)
– Memory Controller & System IO’s
Cfg Power Management Power Management
– SOC Infrastructure, DFT Memory-IO
C4 bumps Ø90µm
Clk, Rst, Config, Test Package Substrate 1.5 - 2.5 VDD-chiplet 1.2 VDD-interpo Off chip links
Balls Ø500µm
6 Chiplets 3D-stacked on an Active Interposer
• Chiplet Overview 6 Chiplets
– 4 cluster of 4 cores (FDSOI28)
– Distributed L1$ + L2$ + L3$
– Scalable Cache Coherency
• Active Interposer
– Distributed flexible interconnects
– Integrated SCVRs (1/chiplet)
96 cores :
– Memory Controller & System IO’s Active In 6 chiplets
– SOC Infrastructure, DFT Interposer 3D-stacked on
(CMOS65) active CMOS interposer
 2 technology nodes difference between chiplets & bottom die
Chiplet Main Features
• 16 x MIPS ® 32-bit scalar cores
• Memory is physically distributed through
chiplet L2-caches + Virtual Memory support
– L1 I-caches + D-caches (16 kB / core)
– Distributed Shared L2-caches (256 kB / cluster)
– Adaptive & fault tolerant L3-caches (4 tiles of 1 MB)
• Directory-based cache coherence with
linked-list directory [5]
L1-L2
• 2D-mesh NoCs,
L2-L3
extended through the interposer L3-ExtMem
from/to
active interposer
• FDSOI 28nm, LPLV, [0.5-1.3V], with Body Biasing
[5] E. Guthmuller et al, “A 29 Gops/Watt 3D-Ready 16-Core
– FLLs, Timing Fault Sensors, Thermal Sensors Computing Fabric with Scalable Cache Coherent Architecture
Using Distributed L2 and Adaptive L3 Caches”, ESSCIRC’2018.
Outline
• Introduction
System Level Interconnects
TAP PE2PE3 PE2PE3 TAP PE2PE3 PE2PE3
L1D L1D L1D L1D L1D L1D L1D L1D
TG TG
TG TG
TG TG
3D Plug(s) 3D Plug(s)
From L1-L2 - short reach - passive to next

L1-L2 3D Plug(s) 3D Plug(s)
prev. chiplet
L2-L3 R L2-L3 - long reach - async. - active R
chiplet
L3-ExtMem R L3-Ext-Mem - sync. - active R R
Active Interposer Memory-IO
• Distributed & flexible interconnects • Chiplet-to-Chiplet Communication

within the active interposer Schemes
– Multiple Network-on-Chips (routers+links) – Passive links, short reach (L1-L2)
– 3D-Plug communication IPs – Active links, long reach (L2-L3, L3-ExtMem)
Synchronous & Asynchronous versions  allow chiplet to any chiplet scalable traffic
3D-Plug Communication IP : synchronous version
3D-Plug : TX 3D-Plug : RX
• Chiplet-to-Chiplet Data+ NoC

VCid Virtual
communication Channel
controller
– NoC Virtualization Outputs
NoC
– High throughput Virtual
Channel Credit
– Low latency Inputs
φ CLK_RX
CLK_TX
φ
Clk
• Circuit Design
– Credit-based multi-channel synchronization  Full digital design
– Source synchronous scheme, with delay compensation  Full swing logic
– Integrates: µ-bumps + µ-buffers + DFT (BoundaryScan)  no DLL
 3D fine pitch parallel if.
3D-Plug Communication IP : layout overview
µ-buffer std-cells µ-bumps

20µm pitch
3D-Plug :
• Logic interface
• µ-bumps
• µ-buffer std-cells
Chiplet layout : • DFT µ-buffer std-cell
3D-Plug interfaces
BiDir Driver + ESD +
Pull-Up + Level-Shifter
3D-Plug Communication IP : sync. version perf.
This work [2] VLSI'19 Units

28nm FDSOI chiplets
Technology
65nm active interposer
7nm FinFet chiplet
• Performances
3D Link type 3D LIPINCON™
and technology Active (face-to-face) Passive (CoWos™)
– 1.2 Gb/s/pin
Die-to-Die Bump Pitch 20 40 µm – 0.59 pJ/bit
Voltage swing 1.2 0.3 V
Data Rate 1.21 8 Gb/s/pin – 3.0 Tb/s/mm2
Power Efficiency 0.59 0.56 pJ/bit
Bandwidth Density 3.0 1.6 Tb/s/mm²
 x2 better than SoA
[2] : Mu-Shan Lin et al., “A 7nm 4GHz Arm®-core-based CoWoS® Chiplet Design
for High Performance Computing”, Symposium on VLSI circuits, June 2019.
System Level Interconnects : L1-L2
1.0V Sync. Sync. Sync. 8 FIFOs Sync. Sync. Sync.
3D-Plug 3D-Plug Router Router 3D-Plug Chiplet 11 3D-Plug
Chiplet 00 Chiplet 12
Interposer 5 mm, 8 ns, 0.7 pJ/bit
1.5 mm, 7.2 ns, 0.75 pJ/bit
M3-M5 passive routing
with clock shielding
L1-L2 L1-L2
Units
• L1-L2 interconnect
nearest farthest – 3D-Plug sync. version + passive links
1 passive 3 passive
Interposer — – Synchronous NoC routers (within chiplets)
link links
3D Plug frequency 1.25 1.25 GHz – Global clocking + clock gating
2D NoC frequency — 1.00 GHz • Performances
2x4+[0-1] 44 cycles – 3D-Plug interface throughput : 1.25 GHz
End to end latency
7.2 44.0 ns
– SNOC local throughput : 1 GHz
Propagation speed 4.8 2.9 ns/mm
– Large end-to-end latency : 44 ns (44 cycles)
Energy / bit / mm 0.29 0.15 pJ/bit/mm
(re-timing and re-synchronization)
3D-Plug Communication IP : asynchronous version
• Asynchronous Logic
– Quasi-Delay-Insensitive (QDI) logic 2-phase
– Use of 1-of-4 data encoding
– Deep pipelining, achieving low latency 4-phase
1-of-4 asynchronous
pipeline stage
(C-element gates)  Robust Asynchronous design
 No clocking at 3D interface [6]
• Circuit Implementation  2-phase protocol to reduce
penalty of 3D-interface delays
– Use 4-phase for on-die communication (Active interposer)
– Use 2-phase for off-die communication (3D-Plug interface)
[6] P. Vivet et al., “A 4x4x2 Homogeneous Scalable
– Use 4-phase  2-phase protocol converters 3D Network-on-Chip Circuit with 326 MFlit/s 0.66
pJ/bit Robust and Fault Tolerant Asynchronous 3D
links”, ISSCC’2016.
System Level Interconnects : L2-L3
L2-L3 L2-L3 • L2-L3 interconnect

Units
4-phase 2-phase – 3D-Plug async. version + vertical connection
Active Active – Asynchronous NoC routers
Interposer —
async. async.
– Pipelined links (1 pipe every 500 µm)
3D Plug frequency 0.30 0.52 GHz
2D NoC frequency 0.97 GHz • Performances
4 + async. 4 + async. cycles – 3D-plug interface throughput : 520 MHz
End to end latency
15.2 15.2 ns (2-phase is x1.7 better than 4-phase version)
Propagation speed 0.6 0.6 ns/mm
– ANoC local throughput : 970 MHz
Energy / bit / mm 0.52 0.52 pJ/bit/mm
– Overall best end-to-end latency : 15.2 ns
System Level Interconnects : L3 – EXT-MEM
Memory + IO
controller
LVDS PHY
Off-Chip traffic
L3-EXT- • L3 – EXT-MEM interconnect

Units
MEM – 3D-Plug sync. version + vertical connection
Active – Synchronous NoC routers
Interposer —
sync.
– Pipelined links (1 pipe every 1000 µm)
3D Plug frequency 1.21 GHz
2D NoC frequency 0.75 GHz – Off-Chip communication: 4x32-bit LVDS @ 600Mb/s
End to end latency

37 cycles • Performances
49.5 ns
– 3D-plug interface throughput : 1.21 GHz
Propagation speed 2.0 ns/mm
Energy / bit / mm 0.24 pJ/bit/mm – SNoC throughput : 750 MHz
– Large end-to-end latency : 49.5 ns
System Level Interconnects : Comparison
L1-L2 L2-L3 L3-EXT-MEM Units
Link type Passive, sync. Active, async. Active, sync.
B
3D Plug frequency 1.25 0.52 1.21 GHz
2D NoC frequency 1.00 0.97 0.75 GHz
44 4 + async. 37 cycles
End to end latency
44.0 15.2 49.5 ns
Propagation speed 2.9 0.6 2.0 ns/mm
Energy / bit / mm 0.15 0.52 0.24 pJ/bit/mm
• 3D-Plug - Best throughput for synchronous version (1.25GHz)

A
• Interposer - Similar throughput between SNOC & ANOC (~1GHz)
- Best latency for ANOC, 0.6ns/mm (3-5x wrt. SNOC)
* A => B end-to-end latency Latency reduction, for cache coherency traffic, at the cost of energy
 Combination of interconnect types to achieve performance trade-offs

Outline
• Introduction
Switched Cap Voltage Regulators : Principle
• Distributed power supply units
– DVFS local scheme, below each chiplet
– Fast transitions & reduced IR-drop effects
– “High” input voltage (up to 2.5V),
reduces #PG IOs in the package
• Fully Integrated
– No external passive components, Thick oxide transistors P/G to chiplet VOUT
– On-chip CAPs only (MOS+MOM+MIM  8.9 nF/mm2) µ-bumps
DC-DC
– 50% of chiplet area, fault tolerant, in the interposer TSVs converter
– PG delivery as a µ-bump flip-chip matrix
VIN
Switched Cap Voltage Regulators : Circuit Design
• Circuit design
– 3-stage gear box, 7 voltage ratios
– VIN [1.8V – 2.5V] ; VOUT [0.35V – 1.3V]
– Tile-based layout in a checker board pattern Replicated
Unit Cell
– Central clock frequency, feedback controller @ C4-bump pitch
x270 cells
Switched Cap Voltage Regulators : Circuit Results
• Circuit design
– 3-stage gear box, 7 voltage ratios
– VIN [1.8V – 2.5V] ; VOUT [0.35V – 1.3V]
– Tile-based layout in a checker board pattern Replicated
Unit Cell
– Central clock frequency, feedback controller @ C4-bump pitch
• Power conversion efficiency

– 156 mW/mm2 power density
@ 82% peak efficiency (2:1 ratio)
– Better efficiency wrt. integrated LDO
Outline
• Introduction
Circuit Overview
• Die technologies
– Chiplet: FDSOI 28nm, ULV + BodyBias, 22mm2
– Active Interposer: CMOS 65nm, MIM option, 200mm2
3D cross-section
• 3D technology integration
– µ-bumps, 20µm pitch (150 k)
– TSV middle, 40 µm pitch
– Face2Face assembly
on package substrate
Chiplet front-face
– 6 chiplets
3D integration Active Interposer

and final package front-face
Circuit Performance
• Main performances
– Freq in [130 MHz @ 0.5V – 1.15GHz @ 1.1V] with FDSOI Back-Bias
– Peak performance : 220 GOPS for all 96 cores @ 1.15 GHz.
– Best Energy efficiency : 9.6 GOPS/W (Coremark) @ 246MHz @ 0.6V
Chiplet 1 Misc 2%
• Power consumption break-down Chiplet 2 Clks 21%

L1-L2 5%
– Cores+L1: ~50% power per chiplet L3 1%
L2 12%
Chiplet 3
– Interposer logic & interconnect (w.o. IOs)
3% only of overall budget
– SCVR: 17% of overall power budget Chiplet 4 Cores + L1
55%
Chiplet 5
Circuit Performance : SCVR efficiency
• Switched Cap Voltage Regulator (SCVR)
– SCVR configured at best ratio according to chiplet voltage
• 3:1 => 2:1 => 3:2 @ fixed VIN 2.5V
• SCVR versus integrated LDO ?

0.5x
– Using a LDO at same VIN = 2.5V
• 0.5x the power consumption
– Higher VIN + increased conversion
efficiency reduces power pin count 0.45x
 Fully integrated SCVR enables

high efficiency along full voltage range
Circuit Performance : Scalability
• Memory Hierarchy and System Level Interconnect study
– Execution of a 4Mpixel filtering applications
(including convolution, transposition,
67x
and synchronization with barriers)
for
– Scalability study of the 96-core circuit 96 cores
340x
• Acceleration ratio of 67x for
512 cores
– Scalability of a 512-core circuit
• Acceleration ratio of 340x (HW emulation) Dataset
fits in L3-$
 Cache coherency protocol + system level

interconnects sustain the traffic and are scalable
Comparison with State-of-the-Art
[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017
This work Units
INTEL AMD TSMC INTEL
FDSOI FinFET FinFET FinFET FinFET
Chiplet Technology
28nm 14nm 14nm 7nm 14nm
Active MCM Passive EMIB
Interposer Technology no
CMOS 65nm substrate CoWoS ® bridge
Technology
Interposer extra features yes N/A no no no

High, using
active interposer
Total system yield N/A high high high
mature technology and
low transistor count
Die-to-Die µbump pitch 20 N/A > 100 40 55 µm
Integrated in interposer, on-chip
LDO per core,
Voltage Regulator (VR) type 1 SCVR per chiplet distributed no no
Power Mgt
with MIM
with MOS+MOM+MIM SCVR with MIM
MIM above 40%
VR area 34% of active interposer - N/A N/A
of core area
VR peak efficiency 82% 72% LDO limited N/A N/A
 First Active Interposer, with fully integrated SCVR, up to 82% efficiency
Comparison with State-of-the-Art
[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017
This work Units
INTEL AMD TSMC INTEL
Distributed NoC meshes
Scalable Data TM
Interconnect types for scalable chip-to-chip N/A LIPINCON links AIB interconnect
Interconnect
Fabric (SDF)
cache-coherency traffic
3D Plug power efficiency 0.59 N/A 2.0 0.56 1.2 pJ/bit
2
BW density 3.0 N/A - 1.6 1.5 Tb/s/mm
Aggregate 3D bandwidth 527 N/A - 640 504 GByte/s
1 FPGA fabric
Number of chiplets 6 1 1-4 2
6 transceivers
CPU
Number of cores 96 18 8 - 32 8 FPGA fabric

Max Frequency 1.15 0.4 4.1 4 1 GHz
Gops (32b-Integer) 220 (peak mult./acc.) 14.4 131.2 - 524.8 128 N/A Gop/s
 First Active Interposer, with distributed NoC meshes and 3.0 Tb/s/mm2
interfaces, offering a total of 96 cores
Outline
• Introduction
Conclusions and Perspectives
• Active Interposer & chiplet partitioning
– Integration of : Interconnects, Power management, IOs,
– Scalable cache coherency protocol
– 3 TBit/s/mm2 3D interface achieved
– Low latency 0.6ns/mm long-reach asynchronous interconnect
– Power management @ 82% efficiency, close to the cores, w.o. passives
 Increase the system energy efficiency and the on-chip memory bandwidth per core
• Perspectives ?
– Progressive setup of a chiplet eco-system
– Active interposer, an enabler for differentiation : integrating heterogeneous functions & chiplets
• Acknowledgments
This work was partly funded by the French National Program
Programme d’Investissements d’Avenir IRT Nanoelec under
Grant ANR-10-AIRT-05
• Thank you for your attention
A 7nm High-Performance and Energy-Efficient
Mobile Application Processor with Tri-Cluster
CPUs and a Sparsity-Aware NPU
Young Duk Kim,
Wookyeong Jeong, Lakkyung Jung, Dongsuk Shin,
Jae Geun Song, Jinook Song, Hyeokman Kwon, Jaeyoung Lee,
Jaesu Jung, Myungjin Kang, Jaehun Jeong, Yoonjoo Kwon,
Nak Hee Seong
Samsung Electronics, Hwaseong, Korea
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 1 of 26
Background
•Smartphone users want to

• Run applications smoothly
• Have better gaming experience
• Enhance multimedia experience including fancy camera
• Enjoy longer battery lifetime for an all-day experience
•Conclusion is
“High performance” and “Low power”
Outline
• 7nm power efficient Exynos AP processor
• Tri-cluster CPUs Middle/
• Power efficient architecture Little
Big CPUs
CPUs
• NPU
• Skipping zero weights operation
• HWACG (HW Auto Clock Gating)
• Clock power reduction in idle state
GPU
• Droop detector
• Reducing voltage droop NPU
• 7nm process
• Enhancing AC performance
CPU Clusters
• Eight CPU cores with three different classes
• Two big cores (M4): 2.73GHz, two mid cores (CA75): 2.4GHz,
four little cores (CA55): 2.0GHz
• Heterogeneous Multi-Processor governed by Energy-aware scheduler
M4 M4 CA75 CA75 CA55 CA55

32KB 32KB 32KB 32KB
64KB 64KB 64KB 64KB L1I L1D L1I L1D
L1I L1D L1I L1D 64KB 64KB 64KB 64KB
L1I L1D L1I L1D CA55 CA55
1MB L2 1MB L2
256KB L2 256KB L2 32KB 32KB 32KB 32KB
L1I L1D L1I L1D
3MB L3
1MB L3
Coherent Interconnect
Big Custom CPU (1)
• Instruction Front-end Architecture Branch Main L2
uBTB
Predict BTB BTB
• 6 micro-Ops bandwidth for decode,
rename, dispatch and retire Address Queue
• Improved branch prediction accuracy 64KB

I- Cache
and latency
• Neural Net based main predictor Instruction Queue
• 128-entry uBTB, 4K entry main BTB,

32K (16K*) branches L2 BTB Decode
• 228 entry ROB

Rename
Dispatch Queue
*M3 specification
Big Custom CPU (2)
• Integer and Load/store Execution Pipes
Integer Schedulers
• Two simple ALUs + two complex ALUs 1BR, 2CALU, 2ALU, 1LD, 1LD/ST, 1ST, 1ST-D
• AGUs: 1 Load + 1 Load/Store(1Store*)
Integer PRF
+ 1 Store
• Improved memory latency through
LD/ST
DATA
AGU
AGU
AGU
MUL
MUL
ALU
ALU
ALU
ALU
direct path from memory controller
DIV
BR
LD
ST
ST
• 1MB(512KB*) private L2 cache per core
• 3MB(4MB@4cores*) shared L3 cache BDTLB DTLB/TAG
Queue
ST
• 48(32*)-entry DTLB, 512-entry BDTLB, L2 UTLB
64KB D-
4K entry L2 UTLB Cache
Table Walk Prefetch
*M3 specification
Big Custom CPU (3)
• Floating-point Execution Pipes Floating-point Scheduler
• Three 128-bit floating-point pipes

Floating-point PRF
• 24 single-precision OPs per cycle
FMAC FMAC FMAC
• 2 cycle FADD, 3 cycle FMUL, 4 cycle
FMAC latency FADD FADD FADD
FCVT FCVT
• Two 128-bit width Dot Product (Int8)
units FDIV/SQRT FDIV/SQRT
FST FST
NCRYPT NCRYPT
NALU NALU NALU
NSHUF NSHUF
NSHIFT NSHIFT NSHIFT
NMUL NMUL
Big Custom CPU
• Samsung 4th generation Custom CPU
• Improved memory subsystem performance significantly
Geekbench v4 Tests Relative Architectural Performance (over M3)

2.10 2.19
Relative Performance (score/GHz)
1.90
1.70
The higher the better
1.50
1.30
M3
1.10
M4
0.90
Cortex-A76
0.70
0.50
Single-Thread Performance
• Desktop-class single-thread performance
• Average 23% single-thread performance uplift from 3rd generation
Tri-cluster management (1)
• Allows seamless performance transition by the middle CPU
• A single heavy task can be selectively assigned to middle or big CPU.
• In most of user scenarios, the main workload can be covered by the middle
CPU instead of big CPU to reduce the absolute power consumption.
• CPU total power comparison (big/Little @10nm vs. Tri-cluster @7nm)
• Allows various options on workload scheduling
• A single heavy task running on a little CPU can be selectively migrated
to middle or big CPU. (depending on demand performance)
Task
(Running on little core) ? Demand Perf #3
Power
Utilization Demand Perf #1 Demand Perf #2
Max capacity Min power @

(big) Big
Max capacity
(middle)
Max capacity
Option Option
(little) Running
#1 #2 Min power @
MED Performance
Little Middle
• Enhanced work load scheduling method based on ISA (Instruction Set
Architecture) where only 32bit/64bit energy efficiency is considered.
• The energy efficiency of each cluster is different for ISA mode, so
workloads should be scheduled based on a different energy model.
• The energy model is newly designed considering 32-bit/64-bit energy
efficiency.
• The CPU power is improved by over 30% in specific scenarios such as
32-bit games.
• The measurement results for ISA based scheduling method
• CPU total power comparison in Lineage2 game
• Big CPU’s tasks moved to Mid CPU, so total power decreased.
Game power
(Normal scheduling vs ISA Aware scheduling)
Sparsity-Aware Neural Processing Unit
• 1024 MACs always consume the incoming data every cycle.
• Data staging units dispatch corresponding input feature-maps and skip 0-weights
• Activation Function Units perform ReLU-family activation.
• HW automatic clock gating (HWACG) is applied on module levels.
BUS
512-KB Scratchpad 512-KB Scratchpad
HWACG HWACG HWACG

HWACG HWACG HWACG
Data Data Data Data

Staging Unit Staging Unit Staging Unit Staging Unit
Dual
Dual
DualMAC
MACArray
Array Dual
Dual MAC
MACArray
Array
DualMAC
MACArray
Array Dual
DualMAC
MACArray
Array
Data Returning
Activation UnitUnits
Function Data Function
Activation ReturningUnits
Unit
Skipping Convolution: Moving OFM
• Input feature-maps are buffered for marching with nonzero weight positions
Weight Kernel
Non-zero Weights 1
1 3 5 6 7 3 5
Input Feature Map 6 7
Output Feature Map
1 Cycle #1 3 Cycle #2 5 Cycle #3 6 Cycle #4 7 Cycle #5

Performance and Energy Efficiency
Normalized FPS Running Inception V3
150
100
%
50
0
5% 80 %
Weight Pruning Rate
Normalized Energy Efficiency Running Inception V3

250
200
150
%
100
50
0
5% 80 %
Weight Pruning Rate
International Solid-State Circuits Conference of 26 17
HWACG (HW Automatic Clock Gating)
• Reduce clock tree power from PLL to IPs in idle state
• Q-channel interface Protocol between IP and HWACG Controller
• Hierarchical architecture composed of parents and children
• If all the IPs attached to the Clock Gating cell are idle, the clock is gated.
Clock gating Clock gating

when IP0 is idle.
PLL side when all IPs (IP0,IP1) are idle. IP side
Q-channel
Gating
PLL 0 CLK CLK CLK & Q-Ch I/F
Divider CLK Gating Divider IP I/F
IP0
PLL 1 MUX Gating Q-channel
CLK
& Q-Ch I/F
Divider IP I/F
Clock gating
when all IP (IP2) is
IP1
idle. Q-channel
Gating
CLK CLK & Q-Ch I/F
IP I/F
Gating Divider IP2
Clock Path
HWACG
• HWACG is different from S/W-directed clock gating.
• Ideally, clock tree responds to IPs clock usage and H/W responds to it.
• We can gate the clock for CMU and bus components.
• S/W-assisted operation, where part or all of the gating behavior is
directed by software, was also adopted.
-D:HWACG disable
• The measured power gain is as follows -E:HWACG enable
HWACG – EWS(Early Wake-up System)
• The early wake-up system reduces the cumulative latency.
• Latency issue can appear especially when multi-layered bus uses
different PLL sources
• A clock request from a latency critical IP is delivered to multiple target
domains to wakeup multiple IPs instead of sequential wakeup process.
BLK_#1
EWG
EARLY_WAKEUP__MAST_#
BLK_CMU Master_#1
CMU
BLK_#2
EWG
EWR Master_#2
CMU
ACTIVE__CMU_#
CMU BLK_#n
EWG
Master_#n
-EWR : Early Wakeup Router CMU
-EWG : Early Wakeup Generator
Voltage Droop Mitigation (1)
• Voltage droop mitigation solution
• Droop detector (DD) monitors voltage droop in target domain.
• (1) When voltage droop happens under threshold values,
• (2) Droop-detected flag is asserted
• (3) Then, CMU divides the clock to IP by half to reduce load current.
Voltage Droop Mitigation (2)
• Voltage droop mitigation solution
• One sensor in GPU
• Calibration done for each DVFS level.
• Vmin gain by 12.5mV.
Voltage Droop Detector
• Ring oscillator-type droop detector
• Measures voltage levels through change of RO’s speed
• It counts RO clocks within a programmable time window.
• When the counter value is smaller than programmable threshold values,
droop-detected flag is asserted
A key technology feature of 7nm
1
Fin FET Logic
7nm 8nm
Technology
Iddq (Normalized)
0.1
Fin SAQP SADP
Key Module +7%
eSiGe 6th Gen. 5th Gen
Technology 0.01
eSD 6th Gen. 5th Gen

Normalized 0.001
Device RO Perf. 1.07 1.00 0.4 0.6 0.8 1
(AC) Freq (Normalized)
• AC performance was enhanced by fin-pitch scaling. (Ceff gain at the

standard-cell level)
A key technology feature of 7nm
• Self-Aligned Quadruple Patterning (SAQP) process was introduced to
scale fin pitch below 42nm (more than 15% scale-down).
• S/D epitaxy optimization, heat optimization were performed to

compensate the DC performance degradation by fin pitch scaling.
Self-Aligned Double Patterning Self-Aligned Quadruple Patterning

Litho
Spacer1
Litho
Spacer1 Spacer2
Fin Fin
Conclusion
• The five low power architectures contributed to reduce the power
consumption of the 7nm SOC chip.
• Tri-cluster CPUs
• Sparsity NPU
• HWACG
• Droop detector
• 7nm process
• They are also extensively applied to the following projects to enhance low
power competitiveness
A 7nm FinFET 2.5GHz/2.0GHz Dual-
Gear Octa-Core CPU Subsystem with
Power/Performance Enhancements for a
Fully Integrated 5G Smartphone SoC.
Hugh Mair, Ericbill Wang, Ashish Nayak, Rolf Lagerquist, Loda Chou,
Gordon Gammie, Hsinchen Chen, Lee-Kee Yong, Manzur Rahman,
Jenny Wiedemeier, Ramu Madhavaram, Alex Chiou, Blundt Li, Vincent
Lin, Rory Huang, Michael Yang, Achuta Thippana, Osric Su, SA Huang
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with

© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 1 of 29
Outline
• SoC Overview
• Wireless connectivity
• Graphics/Media/AI
• Dual-gear CPU Cluster
• Cortex-A77 CPU
• Droop-Response Clock Control (DRCC)
• Silicon Results
• Frequency-Locked-Loop Clocking
• Silicon Results
• Hierarchical Test/Debug Interface
• Summary
© 2020 IEEE
SoC Overview
• Dimensity 1000 is a fully integrated smartphone SoC
supporting 5G cellular, advanced Wi-Fi 6, high performance
compute, multimedia, and AI capabilities
• Monolithic 7nm CMOS
– 10LM metal stack
• CPU Complex
– Octa-core w/ heterogeneous multi-processing
– 9.4mm2 on silicon
– Clock speeds up to 2.6GHz for current volume production

© 2020 IEEE
Wireless Connectivity
• 5G Cellular Modem:
– SA & NSA modes
• SA Opt.2, NSA Opt.3 / 3a / 3x
– 4.7Gbps down, 2.5Gbps up
– Full backwards compatibility

© 2020 IEEE
Wireless Connectivity
• 5G Cellular Modem:
– SA & NSA modes
• SA Opt.2, NSA Opt.3 / 3a / 3x
– 4.7Gbps down, 2.5Gbps up
– Full backwards compatibility
• Non-Cellular connectivity:
– Wi-Fi 6 (802.11a/b/g/n/ac/ax)
• 2T2R antenna
– Bluetooth 5.1+, GPS, FM
© 2020 IEEE
Graphics/Media/AI
• ARM-Mali G77 MC9 3D
• Full-HD display at 120Hz
• 80M pixel imagers
– 32+16M pixel dual camera
• HEVC & AV1 support with
encode/decode at 4K 60FPS
– Multi-expose HDR video
• APU3.0 Hexa-core AI
– 2xBig+3xSmall+1xTiny
– 4.5TOPS
Paper 7.1
© 2020 IEEE 2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
CPU Complex
• Heterogenous big.LITTLE CPU
– 4x High-Performance (HP / Big)
• Cortex-A77 up to 2.6GHz
• Cache sizes: 64kB L1 Data,
64kB L1 Inst., 256kB L2
– 4x High-Efficiency (HE / Little)
• Cortex-A55 up to 2.0GHz
• Cache sizes: 32kB L1 Data,
32kB L1 Inst., 128kB L2
– 2MB L3 CPU cache

© 2020 IEEE
Outline
• SoC Overview
• Cortex-A77 CPU
• Droop-Response Clock Control (DRCC).
• Silicon Results
• Frequency-Locked-Loop Clocking.
• Silicon Results
• Summary
© 2020 IEEE
Cortex-A77 µArchitecture Additions/Improvements
• Additions: Macro-Op cache, 2nd Branch, 4th ALU
• Improvements: Brand Pred., OoO window, Dispatch BW, Prefetch
Improved: New:
+25% OoO window Add 2nd branch
Improved: New:
2x Branch Pred. BW Add 4th ALU
Improved:
Next-gen Prefetch
New: Improved:
Macro-Op Cache +50% dispatch BW

© 2020 IEEE
Cortex-A77: Benchmark uplift from Cortex-A76
Performance improvements across a range of workloads
(IPC uplift, ISO-process & frequency)
20% Performance Improvement over Cortex-A76

© 2020 IEEE
Dual-Gear CPU Cluster
Higher Performance
@ Lower Power
• Compounded IPC gains on HP • Prioritize maximum voltage/

core widens gap to HE core frequency range of HP core
© 2020 IEEE 2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
Outline
• SoC Overview
• Cortex-A77 CPU
• Silicon Results
• Silicon Results
• Summary
© 2020 IEEE
Droop Responsive Clock Control (DRCC)
• DI/DT stress on PDN continually increasing challenge
(a) ~flat power budget, increased current rom lower-V. (b) extreme clock-gating
• Mitigate di/dt droops: BANDGAP 2.5GHz CLOCK
1.8V SUPPLY
1. From package network inductance (-L di/dt)

VOLTAGE STATE
VOLTAGE
TRIM
MONITOR
CODE
ARRAY
2. DC-DC converter BW / DC losses

VMON0 POWER
SWITCH
VMON1
REFERENCE ARRAY
VOLTAGE VMON2
GENERATOR 12
VMON3
• Prior approach uses charge injection

ACTIVATION
VMON4
CODE
RESPONSE TIME < 1nS
• Current work adopts clock-gating vs. charge-injection DIE_SENSE_VDD
DIE_SENSE_VSS
CPU
CLUSTER
EXTERNAL
SUPPLY
– 50x / 98% area reduction

Prior Work [3]
• Concurrent operation of dual detectors:
1. RC-filtered DVFS’ed logic supply [-L di/dt]
2. Fixed DC reference [SW-controlled]
© 2020 IEEE
Droop Responsive Clock Control
-L di/dt: DAC + LPF generate comparator reference at 75%~100% of
DVFS-adaptive DVDD supply
Non-filtered (high bandwidth) supply compared against reference, half-
speed clock engaged when VMIN violated.
SW-ctrl: Comparator reference = 33% ~ 100% of VREF (1.2V)

DVDD / VREF DVDD CLK_IN
Clock
DAC_OUT =
+ Gate CLK_OUT
100% DVDD ~ 75% DVDD +

FSM
VREF
CODE[5:0] 6-Bit DAC LPF - SAMPLER_OUT
FILT[2:0] LPFOUT

© 2020 IEEE
DRCC Silicon Results (SW-ctrl loop)
• CPU DI/DT exceeds DC-DC converter bandwidth
Voltage
w/ DRCC Off
Voltage ✓
w/ DRCC On
~50mV Current ✓
w/ DRCC On
Current
w/ DRCC Off
Approximate CPU
load current
Voltage/current measurements at PCB; 1µs/div (10µs window)

© 2020 IEEE
Outline
• SoC Overview
• Cortex-A77 CPU
• Droop-Response Clock Control (DRCC).
• Silicon Results
• Silicon Results
• Summary
© 2020 IEEE
FLL Clocking -- Principle of Operation
Traditional Clocking This Work Future?

© 2020 IEEE
Frequency-Locked Loop [FLL] Clocking
• Utilize CPU-internal osc. for main clock
• Allow osc. to vary with power supply
– Digital control loop to track DVFS
changes with controlled
bandwidth (backwards compatible)
– Eliminate clock distortion voltage
translations & physical hierarchies
Local oscillator w/ supply correlation out-performs

[High-quality] remote & un-correlated PLL clock
© 2020 IEEE
Ring Oscillator Topology
• Fine Control: ~1ps/bit (2ps per.)
– 45 thermometer coded bits
– Asynchronous to oscillator
• Multiple bit transition allowed
• Coarse: ~20ps/bit (40ps per.)
– 40 thermometer coded bits
– Synchronized to oscillator
• Avoid creating glitches in ring oscillator
• Single bit increment/decrement
• Each coarse code is ~50% fine code range
© 2020 IEEE
Ring Oscillator Silicon Measurements
Oscillator Period vs. Fine & Coarse Codes
• Fine step size = 2.2ps Coarse
– Vs. target of 2ps Code
Clock Period [ps]

• Coarse step size = 40ps
– Vs. target of 40ps
• Coarse Step = 40% of

Fine Range
– Vs. target of 50%
Fine Code

© 2020 IEEE
FLL Block Diagram
• Two oscillators (Ping & Pong) double frequency range
– Second oscillator includes ÷2
• PI Loop
– Inputs: Phase & freq. errors
– Output: Fine control
• Coarse control by FSM
– Monitor fine control
– Apply fine-code delta on
coarse-code change

© 2020 IEEE
FLL Oscillator CPU V/F Tracking [Si. Measured]
• Analyze small-signal correlation
• Testing Procedure:
Frequency [MHz]
1. Determine VMIN at given
reference frequency points, FTGT
◯ is VMIN per Frequency
2. Record oscillator frequency FREF

3. Plot (FOSC/FREF * FTGT) at
VMIN +/-50mV
▽ : -50mV, : +50mV VDD [a.u.]

© 2020 IEEE
FLL Oscillator CPU V/F Tracking [Si. Measured]
• Analyze small-signal correlation
• Testing Procedure:
Frequency [MHz]
1. Determine VMIN at given
reference frequency points, FTGT
◯ is VMIN per Frequency
2. Record oscillator frequency FREF

3. Plot (FOSC/FREF * FTGT) at
VMIN +/-50mV
▽ : -50mV, : +50mV VDD [a.u.]
Oscillator V/F curve tracks overall CPU V/F curve

© 2020 IEEE
FLL Clocking VMIN Improvement [Si. Measured]
Delta VMIN vs. Frequency
0
Delta Voltage [mV]

~35mV
-20
-40
-60
~35mV
-80
-100
2.3 2.4 2.5
Frequency [GHz]
~35mV VMIN Improvement, >10% Power Reduction

© 2020 IEEE
Outline
• SoC Overview
• Cortex-A77 CPU
• Silicon Results
• Silicon Results
• Summary
© 2020 IEEE
Hierarchical JTAG -- Motivation
• JTAG (IEEE 1149.1) compact & convenient interface for cmd/ctrl
– Two phase: 1.[wr] Instruction Register (IR), 2. [rd/wr] Data Register (DR)
• JTAG Challenges/Limitations:
– Models a serial connection through all devices / IP blocks
– Large number of embedded IP (#clusters * #cpus * #ip/cpu)
– Power gating blocks chain segments
• Existing alternative: IJTAG (IEEE 1687) creates hierarchy but cannot

mix 1149/1687 in same hierarchy, requires additional phases
• Our approach maintains JTAG compatibly throughout the hierarchy; all
IP, including “GWTAP”, are 1149.1 compatible
© 2020 IEEE
Hierarchical JTAG: GWTAP
• Gateway TAP (GWTAP) creates a selectable 1-to-4 (and bypass)
– 11-bit IR instructs which sub-chain to access and sub-chain IR-length
– Sub-chain transitioned from IR to DR after #(IR-Length) clocks
– TAPS in upper levels are signaled to bypass

© 2020 IEEE
Hierarchical Command Roll-Up Example
• From lowest level, moving up the hierarchy:
– IR+DR embedded into upper level DR; add DR for upper TAP bypass
– Upper level IR: GWTAP instruction + other TAPs to bypass
Example GWTAP Topology Interpreted as IR Interpreted as DR
TAP GWTAP TAP 1st Level IR IR IR IR DR DR
TAP GWTAP TAP 2nd Level IDLE IR IR IR DR DR
TAP GWTAP TAP 3rd Level IDLE IR IR DR DR
TAP 4th Level IDLE IR DR
JTAG TDI Bitstream as a function of time

© 2020 IEEE
Summary
• A production monolithic 5G smartphone SoC integrating a
latest generation Cortex-A77 CPU is introduced
• A continued focus on CPU power efficiency and innovation in
the area of power supply and clocking is presented along with
silicon results
• Key circuit elements shown:
– Droop Responsive Clock Control (DRCC)
– CPU-localized Frequency-Locked-Loop (FLL) clocking
– Novel hierarchical extension to traditional JTAG

© 2020 IEEE
A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore
SoC Platform for Automotive and Embedded Applications
with Integrated Safety MCU, 512b Vector VLIW DSP,
Embedded Vision and Imaging Acceleration
Rama Venkatasubramanian1, Don Steiss1, Greg Shurtz2, Tim Anderson1, Kai Chirca1,
Raghavendra Santhanagopal1, Niraj Nandan1, Anish Reghunath1, Hetul Sanghvi1, Daniel Wu1,
Abhijeet Chachad1, Brian Karguth1, Denis Beaudoin1, Charles Fuoco1, Lewis Nardini1, Chunhua
Hu1, Sam Visalli1, Amrit Mundra1, Devanathan Varadarajan1, Frank Cano2, Shane Stelmach1,
Mihir Mody3, Arthur Redfern1, Haydar Bilhan1, Maher Sarraj1, Ali Siddiki1, Anthony Lell1, Eldad
Falik1, Anthony Hill1, Abhinay Armstrong1, Todd Beck1, Vijay Kanumuri1, Steve Mullinnix1,
Darnell Moore1, Jason Jones2, Manoj Koul1, Sanjive Agarwala1
1Texas Instruments, Dallas, TX
2Texas Instruments, Houston, TX
3Texas Instruments, Bangalore, India
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 1 of 40
Outline
 Automotive Processor background
 Jacinto™ 7 SoC Platform Architecture
 C71x Digital Signal Processor (DSP)
 Embedded vision and Imaging accelerators
 VPAC and DMPAC
 First SoC of the platform
 Device details/Die micrograph
 Automotive SoC Development
 Innovative quality and reliability methodologies
 Conclusion
Automotive Processor - Applications
Advanced driver Body Infotainment

assistance systems electronics & cluster
(ADAS)
• Camera/Radar/Lidar based • Gateway • Infotainment
front, rear, surround view • Vehicle compute • Instrument cluster
and night vision systems • Telematics features
• Automatic emergency
braking, Adaptive cruise
control, Automated parking
Automotive Processor - Applications
Advanced driver Body Infotainment

assistance systems electronics & cluster
(ADAS)
• Camera/Radar/Lidar based • Gateway • Infotainment Scalability:
front, rear, surround view • Instrument cluster
and night vision systems
• Vehicle compute Entry to
• Telematics features premium
• Automatic emergency
braking, Adaptive cruise vehicles
control, Automated parking
Motivation for SoC Platform Architecture
 Platform Reuse  Higher Performance Needs

 Maximize design investments  Scalable real-time processing solution
 Scalable hardware and software  Analytics
solutions  Communication
 Advanced Integration  Efficient Data Processing

 Enhanced functional safety  Scalable interconnect
 Security  SoC infrastructure
 Low power
SoC Platform Architecture (1/2)
 Evolution of Texas Instruments

OMAP and Keystone II Platforms
 Developed with Functional safety

and Automotive quality as primary
design objectives
SoC Platform Architecture (2/2)
 Multiple isolated domains:
 Wakeup domain
 Microcontroller (MCU) domain
 Main domain
 Integrated Safety MCU
 Overall system cost reduction

 Minimizing external components
through on-die micro-architecture
solutions
Automotive Functional Safety - Overview
 Governed by ISO-26262 standard
 Four ASILs ― A, B, C, and D ASIL-D Most
stringent
 ASIL-D ASIL-C
 Most stringent functional safety standard
ASIL-B Slightly less
 Ex: Power steering (unwanted acceleration)
stringent
 Ex: Engine braking (unintended braking) ASIL-A Least
 ASIL-B stringent
 Also critical, slightly less stringent
 Ex: Embedded vision ADAS (Incorrect sensor feedback)
 Ex: Instrument cluster (Loss of critical data)
Wakeup Domain
 ASIL-D
 Isolated domain
 Manages security and low power modes

 Boot management
 Cryptographic acceleration
 Trusted execution environment
 Secure storage
 On-the-fly encryption
MCU Domain
 ASIL-D
 Isolated chip-within-a-chip
 General-purpose MCU
 Communication peripherals for safety-critical
communication
 Dedicated supply, clock, and reset
Safety and Isolation Features
 MCU domain monitors Main domain faults and
takes appropriate action
 Domain Isolation
 Reset, Clock and Bus isolation
 Logic/IO voltage isolation with power monitoring
 Dedicated Voltage, thermal, clock rate sensors
per domain
 Highly reliable interconnect

 SECDED/Parity to support ASIL-D
 Control and data communication

 Internal SPI to SPI
 Full bus with isolation gasket and timeouts
Main Domain – Compute Cluster
 64b Heterogeneous Multicore Architecture
 Arm® Cortex-A and C71x DSP
 Coherent memory
 Multicore shared-memory system (MSMC)
 L1/L2/L3 caches
 Shared on-chip SRAM with ECC
 Virtualization
 C66x DSP
 Optimized for Audio applications
 Enables supplemental analytics
 Backwards compatibility
 High-performance GPU
 External memory interface (LPDDR4-4266)
Main Domain – Accelerator Cluster
 Vision Pre-processing Accelerator (VPAC)
 Depth and Motion Perception Accelerator
(DMPAC)
 Video acceleration (H.264/H.265)
 Image capture subsystem (MIPI CSI-2 RX/TX)
 Display Subsystem
 Interfaces for different display panel types
 eDP, MIPI DSI, MIPI DPI
 Security acceleration
 PKA, AES, SHA, RNG, DES/3DES
C71x DSP
 DSP CPU:
 64b addressing
 Optimized for General Purpose DSP and Embedded Vision
 16-Issue VLIW with flexible pipeline protection
 64b scalar registers and execution units (int*, float, double)
 512b vector registers and execution units (int*, float, double)
 512b x 512b matrix registers and matrix unit (int*)
 4240 integer MAC/Cycle
 Memory Subsystem:
 Load/Store instructions up to 512b wide
 Streaming accesses with programmed address sequencers
 Integrated L1 program, L1 data and unified L2 caches
C71x DSP Vs C66x DSP
C66x [1] C71x

64-bit addressing No Yes
Vector size 128b 512b
Load/Store bits/cycle 128 1600
8-bit Fixed point MAC/cycle 32 4240
32-bit Floating point OPS/cycle 16 88
[1] R. Damodaran et al., “A 1.25 GHz 0.8W C66x DSP core in 40nm CMOS,” IEEE Conf. VLSI Design, pp. 286-291, 2012.
C71x DSP
Benchmark Performance Scaling
Benchmark C71x/C66x [1] C71x/EVE [2]
FFT 1024pt, 32b complex 5.6 x 2.4 x General Purpose DSP
Image Gradient 11.0 x 2.7 x
Image Transform 11.0 x 4.8 x
Filtering 18.1 x 7.1 x
Computer Vision
Feature Extraction 18.8 x 4.7 x
Optical Flow 3.0 x 3.0 x
Morphology 25.7 x 15.3 x
MobileNet v1 516.0 x 115.0 x
Machine Learning
MobileNet v2 210.0 x 62.7 x
[1] R. Damodaran et al., “A 1.25 GHz 0.8W C66x DSP core in 40nm CMOS,” IEEE Conf. VLSI Design, pp. 286-291, 2012.
[2] Mandal, Dipan Kumar et al. “An Embedded Vision Engine (EVE) for Automotive Vision Processing,” IEEE International Symposium on
Circuits and Systems (ISCAS), pp. 49-52, 2014.
Vision Pre-processing Accelerators (VPAC) (1/4)
Image Capture
 VPAC consists of:
CSI2-RX CSI2-RX
 Image Signal Processing Engine (ISP)
VPAC
 Multiple hardware accelerators:
 Remap Engine Image Signal Processing (ISP)
 Noise Filter
 Scalar Engine Remap Engine
SW
 SW and HW flexibility controlled
Noise Filter
Flexible
 Control and Sequencing
Acceleration
Scalar Engine
Vision Pre-processing Accelerators (VPAC) (2/4) – Image Pipe
Image Capture
 Image Capture CSI2-RX CSI2-RX
 2x4L MIPI compliant CSI2 RX interface
 ISP : Image Signal Processing ISP

 Human and machine vision Wide Lens
Tone
 >8x2MP@30FPS camera support Dynamic Shading
Mapping
 Flexible RAW sensor support Range Correction
 Defective Pixel Correction (On-the-fly)
 Lens Shading Correction (LSC) Defect Pixel Noise
CFA
 Noise Suppression Filter Correction Filter
 Flexible color processing
 YUV 420/422, 8b/12b, RGB, Dual output,
Custom support Color Edge
Statistics
Processing Enhancer
Vision Processing Accelerators (VPAC) (3/4) - Vision Primitives
 Remap Engine
 Lens distortion correction (+Fish Eye Lens)
 Rectification
 Noise filter
 Edge preserving
 Enhances analytics quality
 Scalar Engine
 Multiple scales for pyramid generation for
various vision algorithms
 Region of Interest (ROI) scaling support
Vision Pre-processing Accelerators (VPAC) (4/4)
Image Capture
CSI2-RX CSI2-RX
 Hardware accelerators optimized for :
 Low latency Image/Vision pipe
 Reducing external memory bandwidth
VPAC
Image Signal Processing (ISP)
 SW and HW flexibility
 Sequencing of Algorithms Remap Engine
 Standalone / connected pipe SW
controlled
Noise Filter
Flexible
Acceleration
Scalar Engine
Depth and Motion Perception Accelerator (DMPAC) (1/5)
Stereo Disparity Overview
Stereo Disparity Engine
Left Camera Image
Disparity Map
Stereo Disparity
Engine
Right Camera Image Hardware
Accelerator
Optical Flow Overview
Previous Frame Current Frame

F(t-1) F(t)
Flow Vector
 Optical Flow estimates 2D motion

vector field between two images
Dense Optical Flow Engine
Applications :
Dense Optical Flow Object Tracking

Dense Optical flow
Hardware
Accelerator
Input Video Frame

Flow to color mapping Structure From Motion (3D)
Confidence Score
for each flow vector output
Moving Object Segmentation
 Stereo Disparity Engine HWA DMPAC

 Custom Semi Global Matching (SGM) algorithm
Depth:
 Matches SGM in robustness performance Stereo Disparity Engine
 >50x Memory bandwidth reduction Vs SGM Hardware Accelerator
 Dense Optical Flow HWA Motion Perception:

Dense Optical Flow
 Up to 2MPix / 60 fps Hardware Accelerator
 Machine Learning based confidence score
generated for each flow vector output
 Fractional Pixel Precision : 1/16th of pixel-motion differentiable
 Large Motion search range : ±191 H and ±63 V
Die Micrograph and Device details
 First SoC of the Platform Process TSMC 16FF FINFET
Core : 0.72V-0.990V (AVS)
Voltage
SRAM : 0.85V
Temperature -40C-150C
3.5B+ Transistors
Specifics
~200Mb SRAM
2GHz 2xSuper-scalar CPU
1GHz 3xDual Safety MCU
Performance
4266 LPDDR4
16G SERDES
Power 2W-10W (use case dependent)
1GHz 64bVLIW/512bSIMD DSP,
Key Accelerators
Machine Learning, Computer vision
Number of PLL’s 25+
Power Networks 74
Features ASIL-D Capable, Low DPPM Design
Automotive SoC – Quality and Reliability
 Automotive SoCs require:
 Stringent attention to DPPM (defective parts per million)
 Very low FIT for functional safety and intrinsic reliability
 Large embedded memories

 1 DPPM at SoC level requires <1 DPPB at memory component level
 Ex: 200Mb with no assumptions on test/screening needs
 7.7σ closure for bitcells
 7σ closure for wordline drivers
 6.6σ closure for sense amps
 Drives need for new techniques for design and test
Automotive SoC – Memory Test
 Test with time-zero positive margin needed

 Enhanced SRAM screening methodology
 Wordline and bitline voltage control combined with traditional
techniques
 Improved defect injection techniques
Test Goal: Test mimics functional mode (1/3)
 Case-study: Robust memory interface test
 Memory BIST is targeted to screen memory-internal defects
 Functional vs. Memory BIST: Start/end points are different
BIST
flop
functional D latch
Q
TD Q_mem
D_mem
BIST
functional
flop
Memory
Array
functional
A
A_mem
latch LEGEND
BIST TA
Functional / mission mode
BIST test mode
 ATPG is targeted to screen memory-interface logic defects
 Functional vs. ATPG: Memory I/O timings are different due to scan collar
 Subtle defects in memory-interface may be missed by BIST and ATPG
BIST
flop
functional D latch
Q
Q_mem
D_mem
BIST TD
functional
flop
Memory
Array
functional
A
A_mem
latch LEGEND
BIST TA
Functional / mission mode
ATPG test mode
 Match functional mode memory I/O timing with test mode
 Improved memory IP architecture to match functional and test timing
 Improved DFT to test “true” functional memory-interface path
BIST
flop
functional D latch
Q
Q_mem
D_mem
BIST TD
functional
flop
Memory
Array
functional
A
A_mem
latch
LEGEND
BIST TA Functional / mission mode
BIST test mode
ATPG test mode
EM FIT Calculation for Automotive SoCs (1/2)
Signoff with Violations at Technology Reliability Limit
Cumulative FIT vs. FIT by Bin
Probability of occurrence
May be acceptable/waiveable
in consumer grade SoC
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
Component FIT Per BIN Cumulative FIT FIT Per Bin System FIT Spec
Signoff at Technology Reliability Limit
Fixed for Technology Reliability Limit

Still may violate System FIT spec
Fix any
violations
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
Component FIT Per BIN Cumulative FIT FIT Per Bin System FIT Spec
Signoff with Margined Technology Reliability Limit
Additional
Margins
Fix all
violations
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
Cumulative FIT FIT Per Bin System FIT Spec
Component FIT Per BIN
EM FIT calculation in an eco-system class
Automotive SoC (1/2)
Foundry EDA Foundry
Flow Spec Reliability Flows Rules, Models
SoC Designer IP Eco-system

Execute Flow Execute Flow
SoC Reliability
Estimated FIT Rate
EM FIT calculation in an eco-system class
Automotive SoC (1/2)
Foundry EDA Foundry
Flow Spec Reliability Flows Rules, Models
SoC Designer IP Eco-system

Execute Flow Execute Flow
SoC Reliability
Estimated FIT Rate
System Validation – ADAS (1/2)
Forward Camera Analytics based on Deep Learning
Semantic Segmentation Object Detection Parking Spot Detection
 DeepLabV3Lite architecture  Single shot detection (SSD) with MobileNetV1

(MobileNetV2 encoder)  47 Convolution Layers, 6 regression heads
 63 Convolution Layers Application GMACS Time
 5 classes (pedestrian, sky, per Frame per frame (ms)
vehicles, roads, background) Semantic Segmentation 3.68 6.20
Parking Spot + Vehicle Detection 3.65 4.49
System Validation – ADAS (2/2)
8MP Front Camera Perception and Localization
 Multi-class object detection using C71x DSP and VPAC/DMPAC

 Fusion with IMU and GPS for Localization
System Demo – 3D Surround View + Auto Valet Parking
Real-time 360 Degree Surround View Analytics for Auto Valet Parking
C7x DSP
CSI C66x DSP VPAC SD Card
VPAC (Post-process) Semantic
FISHEYE GPU 3x 1280x720
Segmentation
CAMERA VPAC 3x 768x384 @ 15 FPS
I2C (Mosaic)
Multi-object @15 FPS YUV420
Detection
4x 1920x1080 C66x DSP
Display Subsystem
Parking Spot
@ 30 FPS (Pre-process)
Bayer Detection
3D Car-model with 3 additional camera inputs

surround view 3 different ML networks
rendered Automatic valet parking application
Industry Showcase: EE3 (Today at 8pm)

Texas Instruments - Camera Based Perception and 3D Surround View for Autonomous Valet Parking on a
16nm Automotive SoC
Conclusion
 Jacinto™ 7 SoC Platform Architecture with Integrated MCU

 C71x DSP
 Embedded vision and Imaging hardware accelerators
 VPAC and DMPAC
 Innovative quality and reliability methodologies
 Enabling Automotive in 16nm ecosystem
 First SoC of the platform
 First pass silicon fully functional and meeting design goals
IBM z15TM: A 12-Core 5.2GHz Microprocessor
Christopher Berry1, Brian Bell2, Adam Jatkowski1, Jesse Surprise1, John Isakson3, Ofer Geva1, Brian
Deskin4, Mark Cichanowski3, Dina Hamid1, Chris Cavitt1, Gregory Fredeman1, Anthony Saporito1,
Ashutosh Mishra5, Alper Buyuktosunoglu6, Tobias Webel7, Preetham Lobo5, Pradeep Parashurama5,
Ramon Bertran6, Dureseti Chidambarrao8, David Wolpert1, Brandon Bruen1
IBM Systems
1 - Poughkeepsie, NY
2 - Rochester, NY
3 - Austin, TX
4 - Endicott, NY
5 - Bangalore, India
6 - Yorktown Heights, NY
7 - Boeblingen, Germany
8 - Hopewell Junction, NY
© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor

Outline
• Introduction
• Technology and System Structure
• System Control (SC) Chip
• Central Processor (CP) Chip
• Hardware Measurements
• Conclusion

Introduction: IBM z15
• 14nm – Again
• Significant changes to system topology
• Significant feature additions
• Performance Goals
• +10% Single Thread
• +20% System Capacity
• Big ticket items
• 33% increase in L2 (per core)
• 100% increase in L3 (per chip)
• 43% increase in L4 (per chip)
• Added 2 cores (+20%)

Technology Overview [ISSCC2018 Berry]
GlobalFoundries 14nm High-Perf (HP) FinFET SOI technology

w/embedded DRAM
Technology Overview
Ultra-thick wires 2400nm - 2 levels
5.6X pitch wiring 360nm - 4 levels
4X pitch wiring 256nm - 2 levels
2X pitch wiring 128nm - 4 levels
1.3X pitch wiring 80nm - 2 levels
1X pitch, fine wiring (LELE) 64nm - 3 levels
Logic Device VT pairs L, R, H
SRAM Cell Sizes 0.102μm2 (HP & LL), 0.143μm2
eDRAM Cell Size 0.0174μm2

z15 System Structure: Max System
• Up to 4 - 42Ux19” Racks
• 20 Liquid Cooled CP
Chips
• 240 Physical Cores
• 40TB RAIM*-protected
memory
*Redundant Array of Independent
Memory
• 60 PCIe Gen4x16 cards

z15 System Structure: Max System
Power Supply
Ethernet Switches
IO Cages
Processor Drawers
Water Cooling

z15 System Structure: Drawers

z15 System Structure: Drawer

System
Memory
Control
DIMMs
Central
Processors

System Control (SC) Chip Overview
X-BUS Drivers and Receivers
• Drawer-to-drawer connectivity L4 L4
• L4 Cache - 960MB (+43%) eDRAM eDRAM
A-BUS Drivers & Receivers

A-BUS Drivers & Receiveres
• Specs Cache Cache
L4 Data-
• 2.6GHz flow
• 12.2B Transistors
• 696mm2 L4 Control
• ~20K C4’s L4 Directory
• IO bandwidth 6.8Tb/s
L4 Data-
• 21.7km of signal wire flow
L4 L4
eDRAM eDRAM
Cache Cache

SC Chip: X-Bus & A-Bus Links
• A-Bus (drawer-to-drawer)
• Four links
A-BUS Drivers & Receivers

A-BUS Drivers & Receiveres
• Differential
• 10.4Gb/s/lane (+33%)
• 0.9Tb/s each (3.6Tb/s total)
• X-Bus (SC-CP)
• Four links
• Single-ended
• 5.2Gb/s/lane

eDRAM Improvements
• “Double dense” eDRAM
significant enabler
• Doubled bitline and
wordline lengths

eDRAM Improvements
significant enabler
wordline lengths
• Sense amp update to
improve voltage
sensitivity for double
length bit lines

eDRAM Improvements
significant enabler
wordline lengths
improve voltage
length bit lines

eDRAM Improvements
significant enabler
wordline lengths
improve voltage
length bit lines
• Improved array macro
density by ~30%

eDRAM Improvements
• Changed on-die z14 – 8Mb
generated high voltage
to package delivered
• Removed charge
pump overhead
• Absorbed low voltage
generation & regulator z15 – 16Mb
into macro
• Combined effective
cache density
improvement of ~80%

Central Processor Chip Overview
MC Drv MCU MC Rec
• Cores, L1/L2/L3 Cache,
L3 DF
L3 Cache
L3 Cache
Core 0 Core 1
Memory interface, IO,
X-BUS
Cntl/Dir
CP & SC interfaces Core 2 Core 3
L3
L3 Cache
L3 Cache
• Specs
L3 DF
Core 4 Core 5
• 5.2GHz
• 9.2B Transistors
L3 DF
L3 Cache
L3 Cache
Core 6 Core 7
• 696mm2
X-BUS
Cntl/Dir
• ~29K C4’s Core 8 Core 9
L3
L3 Cache
L3 Cache
• IO Bandwidth 4.0Tb/s
L3 DF
• 23.2km of signal wire Core 10 Core 11
GZIP
PCIE2 PBU2 PCIE1 PBU1 PBU0 PCIE0
CP Chip: Processor Cores
• 12 Cores (+20%)
L3 DF
L3 Cache
L3 Cache
• 128+128KB L1 Core 0 Core 1
(I+D) SRAM cache
Cntl/Dir
Core 2 Core 3
L3
• 4+4MB (I+D)
L3 Cache
L3 Cache
private L2 eDRAM
L3 DF
Core 4 Core 5
cache (+33%)
• L3
L3 DF
L3 Cache
L3 Cache
Core 6 Core 7
• 256MB eDRAM
Cntl/Dir
shared L3 cache Core 8 Core 9
L3
L3 Cache
L3 Cache
(+100%)
L3 DF
Core 10 Core 11
• GZIP Accelerator
GZIP
CP Chip: IO Links
MC Drv MCU MC Rec
• XBUS
• Two X-Bus links
X-BUS
• Single-ended
• 5.2Gb/s/lane
• Memory Interface
• 9.6Gb/s/lane (1.6Tb/s total)
X-BUS
• 3 - PCIe x16 Gen4 interfaces
(.8Tb/s)
PCIE2 PBU2 PCIE1 PBU1 PBU0 PCIE0

Chip Improvements
• Core area reduction

Chip Improvements

Chip Improvements
• IO area improvement

Chip Improvements

Chip Improvements
• z14 - 185 hex IO ring with 185 orthogonal
central
185 hex->33.4 C4/mm2

Chip Improvements
central
185 hex->33.4 C4/mm2
185->29 C4/mm2

Chip Improvements
central
185 hex->33.4 C4/mm2
185->29 C4/mm2
• z15 – Reduce X-bus signals by 33%
185 hex top/bottom edges with 150 through
center
150->44.4 C4/mm2

L2 Cache
z15 Core Floorplan Vector Floating
L1 D-Cache/Load-store Fixed Point Point
Instruction
Sequencer
Recovery
Translator
Elliptic Curve
Cryptography
Instruction Decode Branch Prediction

L1 I-Cache/Compression/Sort
Core Improvements
• Core area reduced

by 10%
28->25mm2

Core Improvements

by 10%
28->25mm2
• Instruction
Sequencer area
reduced by 35%
(1.8M nets)

Core Improvements

by 10%
28->25mm2
• Instruction
Sequencer area
reduced by 35%
• L2 increased by 33%
6MB->8MB

Core Improvements

by 10%
28->25mm2
• Instruction
Sequencer area
reduced by 35%
• L2 increased by 33%
6MB->8MB
• ECC accelerator
added

Core Improvements
Improved store Decimal operations as well as
forwarding decimal-to-binary conversion
improvements
Improved Operand-
store-compare (OSC)
hazards
Enhanced
branch
prediction
Sort/Merge accelerator added

Core Improvements
+5% -2%
z14
z15
+22%
Net Count Device Count Wire Length

Single Die Power
+12%
z14 v z15
+5%
+1%

Process Shift
z14 v z15
Normalized Die Count
z14
z15
Process Delay (1/Frequency)

Vmin vs Process Delay
z13 => z15
Chip Vmin (V)
Normalized Process Delay (1/Frequency)

z15 Design: Conclusion
• Improvement in single thread performance => 14%
• System capacity improvement => 20%
• Achieved goals while staying in 14nm
• Significant increase in L2, L3 & L4 cache
• Two more cores
• Several new on-die or in-core accelerators:
• ECC, gzip, sort/merge
• Minimal power increase given additions

Acknowledgements
The authors would like to acknowledge and thank the

many contributions from the rest of the IBM Z design
team (Austin, Bangalore, Boeblingen, Haifa,
Poughkeepsie, Rochester, Tel Aviv), the IBM EDA team,
the IBM Research team, the IBM Systems team and
GlobalFoundries for processing the wafers.


ISSCC2020-02 Visuals Processors

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ISSCC2020-02 Visuals Processors

Uploaded by

Copyright:

Available Formats

ISSCC 2020

T. Singh1, S. Rangarajan1, D. John1, R. Schreiber1, S. Oliver1, R. Seahra2, A. Schaefer1

Cores Market TDP

• Zen 2 Core can be used in various 6

• Multiple CCX can be placed to 16 Desktop/Server 105-155 W

L2/core 512KB 512KB

L3/CCX 8MB 16MB

Layer Name Pitch Layer Name Pitch

• Deep collaboration between AMD CAD,

• Pre-route miscorrelations for resistance and M2 3.17 0.96

capacitance have differing root causes M3 2.31 0.96

• Employed timing with targeted Improved

– EM: for high-activity gates (e.g., clock drivers)

• Zen had an on-die LDO

supply for use by cache XCENX

VDD to reduce current SAC

bitcell stability and VDD

VDD where VDDM-VDD=superVminThreshold

• 3% decrease in flop power allocates more budget for combinational logic

• Rich flop library,

• 90% Power savings in latch for common case of E = 0

• Primary sources of CAC reduction

• 4 cores active with 2

• 1 core active with

• Motivation and architectural goals

Throughput Performance Ratio

requires 7nm benefits

Normalize Cost/Yielded mm2

1. Based on June 8, 2018 AMD internal testing of same- -

• High-performance server and

• Silicon interposers and bridges

which limits number of chiplets and

– Power management Core0 Core1

DDR I/O DDR I/O

Clock generator VTT Clock generator

FWDFCLK TXCLK QUAD Phase Calibration

DDR IOD DDR

Zen2 Zen2 Bump pitch for 14nm and •

© 2020 IEEE 2nd Gen AMD EPYCTM Server Processor 16 of 27

• Growing number of cores and the advent of chiplets

8 cores per chiplet, each

Same-frequency power Idle TDC EDC

• Server memory latency is a key factor

• Significant deltas from NUMA1 to

Series Avg. Local; EPYC  Measured Avg: ~104ns

• Higher core counts and 1.8

Normalized Die Cost

• Lower costs at all core 0.8

count / performance 0.6

• Cost scales down with 0

• Many significant innovations were required:

We would like to thank our talented AMD design teams across

This work was partly funded by the French National Program

• System designers must offer :

 With advanced CMOS issues, « Single Die »

[D. Dutoit, Keynote, 3DIC’2014]

• But, some limitations

– Distributed L1$ + L2$ + L3$

• Active Interposer 3D Plug(s) 3D Plug(s)

From L1-L2 - short reach - passive to next

Active Interposer Memory-IO

• Distributed & flexible interconnects • Chiplet-to-Chiplet Communication

• Chiplet-to-Chiplet Data+ NoC

µ-buffer std-cells µ-bumps