Professional Documents
Culture Documents
ISSCC2020-02 Visuals Processors
ISSCC2020-02 Visuals Processors
SESSION 2
Processors
Zen 2: The AMD 7nm Energy-Efficient
High-Performance x86-64 Microprocessor Core
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 1 of 31
International Solid-State Circuits Conference
Outline
• Motivation
• Market Segments
• Architecture
• Core Complex
• Technology
• Implementation
• SRAMs
• Power
• Silicon Results
• Conclusion
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 2 of 31
International Solid-State Circuits Conference
Motivation
• Zen was a huge lift
• Zen2 compelling successor to Zen
• Goals
– Give above industry trend generational
performance improvement
– Enable 2x cores same socket
– Improve single thread (1T) performance
• How can we do this?
– Technology port
– Architectural changes
– Physical design and methodology changes
• AMD was aggressive and we did all of the
above to achieve the goals!!
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 3 of 31
International Solid-State Circuits Conference
Zen 2 Market Segments
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 4 of 31
International Solid-State Circuits Conference
Zen 2 Architecture
• Changes from Zen
– New TAGE Branch Predictor
– Optimized L1 Instruction Cache: 32K/8-way vs. 64K/4-way
– 2X Op Cache Capacity: 4K vs. 2K ops
– 2X Floating Point Data Path Width: 256b vs 128b
– 3rd Address Generation Unit
– Larger Physical Structures: Integer Scheduler, PRF, ROB, Store Queue, L2DTLB
– 2X L1 Data Cache Read/Write Bandwidth
– 2X L3 Cache: 16MB vs. 8MB per Core Complex (CCX)
• +15%1 single thread (1T) IPC over Zen
• ~9% switching capacitance (CAC) improvement over previous
generation, technology neutral
1 AMD"Zen 2" CPU-based system scored an estimated 15% higher than previous generation AMD “Zen” based system using estimated SPECint®_base2006 results.
SPEC and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. See www.spec.org.
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 5 of 31
International Solid-State Circuits Conference
Core Functional Units
uCode CPL
Decode I-Cache
• 32KB IC
• 32KB DC Branch
• ~20 blocks, ~400K Prediction
Floating
avg instances Scheduler
L2
Point
• ROM for uCODE Cache
• 5 L1 RAM variants ALU Load/
• Chip Pervasive Logic Store
(CPL) – clock/test Data
block Cache
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 6 of 31
International Solid-State Circuits Conference
L2/L3 Cache Hierarchy
Shadow tag macros for serving external probes
• Only 3 unique custom
macros L2 Data
L3Data
– Down from 8 on Zen L2 Tags
• Each 4M slice is identical L2 Status
L3Tags CTL
• Multi-stage clock gating in
L3 to keep clock
distribution power the
same as 8M L3 from Zen
• LDOs incorporated into
the L3 to supply VDDM to 512K L2
L2 and L3 arrays
4M Slice
– Loss of package distribution
of VDDM meant LDOs had
to be moved closer
– Must reduce current on
VDDM
© 2020 IEEE
International Solid-State Circuits Conference
2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core LDOs 7 of 31
Zen 2 Core Complex (CCX)
• 4 core complex
• L3 size increases to 16MB
• Design for flexibility
• Maximize # cores for server case
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 8 of 31
International Solid-State Circuits Conference
Zen 2 CCX Configs
Value
HEDT/Server APU
2 Core,
4 Core, 4 Core,
4MB L3
16MB L3 CCX 4MB L3 CCX
CCX
CPP 78 nm 57 nm
Fin Pitch 48 nm 30 nm
1x Metal Pitch 64 nm 57 nm
Stdcell Track Library 10.5 track 6 track
Cu Metal Layers 11 w/ MiM 13 w/ MiM
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 10 of 31
International Solid-State Circuits Conference
Zen vs. Zen 2 Technology Comparison (cont)
Zen (14nm) Zen 2 (7nm)
M0 M0
n/a 1.0x
StdCell Internal StdCell Internal
M1
M1
1.0x Stdcell 1.425x
StdCell Internal
& BEOL
M2-M3 1.0x M2-M3 1.0x-1.1x
M4-M7 1.25x M4-M7 2.0x
M8-M9 2.0x M8-M9 2.0x
--- --- M10-M11 3.15x
M10-M11 M12-M13
11.25x 18.0x
(RDL) (RDL)
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 11 of 31
International Solid-State Circuits Conference
Place and Route Design Optimization
Same-Layer Jogs Inter-Layer Jumpers
• 7nm FinFET presents unique route challenges Forbidden Required
– Lower layer jogs forbidden
– Denser standard cells with reduction in track height
– Increased lower level metal resistance
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 12 of 31
International Solid-State Circuits Conference
Placement Restricted by Large Cells
• Multi-row cells benefit
power and area, but
create placement
challenges
• Clustering of flops has
many benefits but can
cause placement
issues
• Resulting small gaps
are challenging to use
and required innovation
to exploit
– New algorithms
– Flexible power grid
choices
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 13 of 31
International Solid-State Circuits Conference
Design RC Miscorrelation
Normalized Normalized
• Pre-route vs Post-route miscorrelation caused Layer
Resistance Capacitance
by length and layer assumptions M1 1.00 1.00
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 14 of 31
International Solid-State Circuits Conference
Pre-Route Correlation Improvements
• Plots showing ClockTreeSynthesis vs Timing Slack Correlation Timing Slack Delta
Route timing
• Large variance in initial results Pessimistic Optimistic
Initial
– Large number of paths have overly-
Results
pessimistic delay during pre-route steps.
Tools waste resources trying to fix
– Significant number of paths have optimistic
delay estimates. These paths are under- cts_vs_route.slack.corr cts_vs_route.slack_delta.hist
optimized
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 15 of 31
International Solid-State Circuits Conference
Wire Engineering Challenges
• Lower layers getting more
resistive with latest
technology nodes
– Very short routes in tight data
paths need a buffer
– Routes longer than Steiner due
to complex rules
– Challenging for optimization
tools to comprehend
• Critical signals need to get to
higher layers quickly
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 16 of 31
International Solid-State Circuits Conference
Wire Engineering and Via Ladders
Top Via Ladder View
• Team used selective layer optimization,
buffering, pre-routes, and via ladders to
exploit the fast layers for critical signals
• Two types of via ladders
– High Performance: for large buffers driving long
wires Side Via Ladder View
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 17 of 31
International Solid-State Circuits Conference
L2/L3 Cache Changes
VDDM VDDM
WL[N:0]
WL[N:0]
BLC[]
arrays
XCENX
BLT[]
BLC[]
BLT[]
• Zen 2’s package choices WRCS[]
make using package
WRCS[]
WDT_X
WDT_X NegBL Write Driver
layers for VDDM
WDC_X
WDC_X
RDCSX[]
distribution impossible
SAC
SAT
RDCSX[]
SAPCX
• Moved the bitline
SAC
SAT
precharge from VDDM to
SAPCX
SAEN
SAEN
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 18 of 31
International Solid-State Circuits Conference
VDD Precharge Challenges
Controller pauses voltage Controller pauses voltage
increase and unsets increase and sets
• Moving bitline precharge superVminEn register before superVmaxEn register before
to VDD creates both continuing to raise voltage continuing to raise voltage
superVminEn VDDmin
System
Management superVmaxEn
Assist controller WLUdEn
Voltage
thresholds NegBlEn
Programming
details
SRAM
SRAMs
SRAM
SRAM
Fuses
Assist configurations
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 19 of 31
International Solid-State Circuits Conference
VDD Precharge Timing Challenges
Power races with WL Read before write challenges
WL@ constant
VDDM
Bitline precharge
BLPCX @ high turns on before WL
VDD turns off at high
VDD!
BLPCX @ low WL on before
VDD Bitline precharge
turns off at low
VDD!
• Moving precharge to VDD reduced our current enough to allow on die-distribution but presents other
challenges
• Read before write timing challenges at low VDD, high VDDM
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 20 of 31
International Solid-State Circuits Conference
Solving Timing Challenges
• Solving these multiple voltage timing challenges required a number of techniques
– Dual voltage clock shapers to average two voltage domains
• Can alter the number of these buffers on VDD or VDDM or remove them entirely to make timing more
or less dependent on either supply
Psuedo-dynamic level shifter
VDD
VDDM shapedFallInput
LS
Input@VDD LS @VDD
ISOX@VDDM
– False read before write problem can be mitigated by compressing the front end of the WL
during a write operation
WLCLK
WLCLK WLCLK_shape
WL during read
WREN
WL during write
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 21 of 31
International Solid-State Circuits Conference
CAC Comparison
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 23 of 31
International Solid-State Circuits Conference
Low Power Gater Latch
qf CLK
CLKB CLKBB
Dbar
Energy with AvgApp Activity (fJ)
CLKBB
LP Regular
State Ratio
Latch Latch
CLKB CLKB
E E=1 0.22 0.18 121%
Dbar qf_x qf
E=0 0.17 1.61 10%
TE
CLKBB
Total 0.38 1.79 22%
Q
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 25 of 31
International Solid-State Circuits Conference
Zen vs. Zen 2 CAC Comparison
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 27 of 31
International Solid-State Circuits Conference
Frequency/Power Silicon Results
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 29 of 31
International Solid-State Circuits Conference
Conclusion
• Met Goals
– Moved to energy efficient TSMC 7nm finFET
– Made huge architectural changes
– Improved PD and methodology
• Results are clear
– Scalable across 15W mobile to 280W Server
– 50% reduced power at iso-frequency
– Enable 2x cores in same-socket
– >15% 1T IPC over previous generation
– ~9% CAC improvement over previous
generation technology neutral
– Enables peak frequencies up to 4.7GHz
(+350MHz generationally)
• Zen2 delivers generational performance
uplift!!
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 30 of 31
International Solid-State Circuits Conference
Acknowledgements
• We would like to thank our talented
AMD design team across Austin, Fort
Collins, Santa Clara, Boston,
Markham, and India who contributed
to Zen 2
• Please stay for our chiplet paper next
• Please check out our demo, 2.1
tonight in Golden Gate
• Did we mention we have liquid
nitrogen?
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 31 of 31
International Solid-State Circuits Conference
AMD Chiplet Architecture
for High-Performance
Server and Desktop Products
Samuel Naffziger
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 1 of 27
Outline
• Results
4x DDR
4x DDR
Zen2 cores
Zen2 cores
L3
IFOP 4 x16 PCIe/IFIP IFOP
IFOP
L3 Server IO Die
2nd Gen AMD EPYCTM
8.34 Billion FETs, 416 mm2
IFOP
3rd Gen
2x DDR
AMD
7nm Core Complex Die: RyzenTM
PCIe
3.8 Billion FETs, 74 mm2 Processor
Client IO Die
2.09 Billion FETs, 125 mm2 AMD X570 Chipset
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 2 of 27
Motivation and Architectural Goals
Primary goal:
Achieve leadership performance, performance/Watt and
performance/$ in server and desktop markets
• This required
– Exploiting advanced 7nm technology for better performance and
performance/Watt
– Packing more silicon into the package than traditional approaches enable
• While also
– Enabling scalable performance/$ up to performance levels otherwise not
achievable
– Improving memory and IO latency
– Supporting leverage across markets by re-using IP and SOCs
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 3 of 27
Background: Performance and Die Size Trend
Specint®_rate2006 2P Server Performance
100X
Trend Over Time1
Throughput Performance
800
reticle limit and becoming
too costly 600
400
200
1. Su, Lisa “Delivering the Future of High-Performance 0
Computing”, Hot Chips 31 (2019)
© 2020 IEEE Oct-06 Jul-09 Apr-12 Dec-14 Sep-17 Jun-20 Mar-23
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 4 of 27
Exploiting 7nm Technology
2X >1.25X 0.5X
DENSITY1 FREQUENCY1 POWER1
• Leadership performance (same power) (same performance)
3.00
2.00
1.00
Zen2 cores
Zen2 cores
leaving the IO and memory L3 7nm CCD is
interfaces in N-1 generation silicon DFx IFOP SerDes SMU 86% CPU + L3
L3
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 6 of 27
Chiplets Evolved – Hybrid Multi-die Architecture
Traditional Monolithic 1st Gen EPYC 2nd Gen EPYC
Use the Most Each IP in its Optimal Centralized I/O Die Superior
Advanced Technology Technology, 2nd Gen Improves NUMA Technology for
Where it is Needed Infinity Fabric™ CPU Performance
Most Connected and Power
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 7 of 27
Connecting the Chiplets
Theoretical Interposer-based
limited reach
• Only supports die edge connectivity CCD
Interposer
CCD
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 8 of 27
CPU Compute Die (CCD) Floorplan
2 CCX core complexes Core2 Core3
– 4 core and 16MB L3 each L3
– Comprise 86% of CCD area Core0 Core1
System Management Unit (SMU)
DFT IFOP
– Microcontroller SMU
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 9 of 27
IFOP GEN2 KEY FEATURE SUMMARY AND COMPARISON
Gen2 14nm IOD, 7nm CCD
Gen1 14nm
Max Per lane Synchronous
Max per lane Local clock Datarate clock crossing
datarate alignment and 14.6Gbps Local CDR
6.4Gbp/s global tracking
50 Ohm fixed
10:1 Serialization/
drive strength and
50/100/200 Ohm Deserialization
4:1 Serialization/ termination
drive strength
Deserialization Local PHY
and termination TX and RX T-Coil
Regulators
Forwarded PHY Regulated Pseudo-Diff
clocks through package VTT Termination Single Ended
Receiver
Zen2
CCD
Zen2
Zen2
Zen2
CCD
CCD
CCD
Die3
Die2
CCX
CCX
CCX
CCX
I/O I/O
I/O I/O
Die0
Die1
CCX
CCX
CCX
CCX
Zen2
Zen2
Zen2
Zen2
CCD
CCD
CCD
CCD
I/O DDR I/O DDR
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 10 of 27
IFOP SerDes Architecture
FDI[30:0]
FDO[30:0]
RX X32
TXCLK
TXCLK TX X32
TXCLK
TXCLK TXCLK
TXCLK FWDFCLK[30:0]
QUAD
RXCLK MCM DIFF
FCLK Package TXCLK
8Ghz
FILT 8Ghz
Routes
C
IOD CCD
R PLL TXDRV CLKRX PLL CCD
CORE FCLK CORE
C FCLK C
DIFF
TXCLK
QUAD R
RXCLK
8Ghz 8Ghz
C
New
Gen2 FDO[38:0] TX X40
TXCLK
TXCLK RX X40
TXCLK
TXCLK TXCLK
TXCLK FDI[38:0]
FWDFCLK[38:0]
Features
IOD CCD
TX lane Detail Tcoil RX lane Detail
Tcoil
Trained + Trained
FDO[9:0]
Register Serializer TX De- Low FDI[9:0]
50Ω 50Ω RX serializer Latency
Capture - FIFO
Die2
I/O
I/O
CCX
CCX DDR
CCX
DDR
CCX
I/O
I/O
Die1
Die3
I/O
I/O
CCX
DDR
CCX
CCX
DDR
CCX
I/O
I/O
Die0
© 2020 IEEE
1st Gen AMD EPYC™
International Solid-State Circuits Conference [Beck ISSCC 2018] 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 12 of 27
Under-CCD Routing
Routing Infinity Fabric on Package (IFOP) SerDes links from IOD to the
2-deep chiplets required sharing routing layers with off-package
SerDes and competing with power delivery requirements
SERDES
CCD CCD
CCD CCD
CCD CCD
CCD CCD
SERDES
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 13 of 27
Zen vs. Zen 2 VDDM Distribution
Dense SRAMs require a separate rail
Package
RDL
RDL
© 2020 IEEE Zen VDDM distribution via package plane Zen 2 VDDM distribution via RDL only 14 of 27
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products
Zen 2 VDDM Design Challenges
Enables 80 IFOP package routed
• RDL is more resistive than a signals under the CCD
dedicated package layer
• Therefore we reduced overall 4 VDDM LDO’s inside the L3
VDDM current draw by 80%
compared to Zen ([Singh Core L3 4MB L3 4MB Core
ISSCC 2020]) +L2 slice slice +L2
• New, smaller, and distributed L3 4MB
Core L3 4MB Core
LDO design slice slice
+L2 +L2
• Ensured sufficient routing
porosity through the
integrated LDO’s to enable VDDM
critical routing RDL
LDO
• These improvements kept the spanning
IR drop to ≈10mV impact
L2 and L3
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 15 of 27
Package Integration, Server, and Desktop
Zen2 Zen2
CCD CCD
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 17 of 27
Per-Core Linear Regulation
Regulating the voltage per-core enables power savings by adapting the voltage to each core’s
capability and compensating for power delivery gradients across-package
MA
MB
G3 PCIe G0 PCIe 1: Local 94ns
IO2 IO0 2: ~97ns
1: AMD internal testing P3 PCIe P0 PCIe
with DRAM page miss 3: ~104ns
G2 PCIe G1 PCIe
2: EPYC 7002 Series
IO3 IO1 4: ~114ns
NUMA 1 vs EPYC 7001 P2 PCIe P1 PCIe
MD
MC
ME
MF
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 22 of 27
3rd Gen AMD Ryzen™ Processor Chiplet Performance vs. Cost
2.5
Similar cost savings and
2 scalability for desktop
Normalized Die Cost
1.5
Re-using the client IO die for
1 the X570 Chipset expander
0.5
enables optional additional
connectivity for higher end
0 systems
16 cores 8 cores
• PCIe, SATA, USB
Chiplet 7nm + 14nm Hypothetical Monolithic 7nm
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 23 of 27
Performance Results
Chiplet architecture enables leadership performance and
performance/Watt in server and desktop markets
Metric at 105W TDP1 Ryzen 2700X (8C) Ryzen 3950X (16C) Improvement (%)
1. Testing as of
12/13/2019 by AMD
Performance Labs
Cinebench r15 1T 177 216 22%
using a Ryzen 9 3950X
with 16 cores vs. a
Cinebench r20 1T 434 527 21%
Ryzen 7 2700X with 8
cores in the Cinebench
Cinebench r15 NT 1802 3928 118%
R20 1T benchmark
test. Results may vary.
Cinebench r20 NT 4020 8862 120%
RZ3-102 1T Fmax (Max Boost) 4.3 4.7 9%
NT Base Freq (All-core)1 3.9 3.95 1%
EPYC 7601 EPYC 7742
Metric (32C 2P (64C 2P Improvement (%)
180W TDP) 225W TDP)
SPECrate®2017_int_base2 272 663 144%
SPECrate®2017_fp_base2 259 511 97%
NT Base Freq 2.2 2.5 14%
2: Results obtained from the SPEC® website as of Jan 3, 2020.
EPYC 7601 SPECrate®2017_int_base: https://www.spec.org/cpu2017/results/res2017q4/cpu2017-20171114-00833.html
EPYC 7601 SPECrate®2017_fp_base: https://www.spec.org/cpu2017/results/res2017q4/cpu2017-20171114-00845.html
EPYC 7742 SPECrate®2017_int_base: https://www.spec.org/cpu2017/results/res2019q4/cpu2017-20191028-19261.html
EPYC 7742 SPECrate®2017_fp_base: https://www.spec.org/cpu2017/results/res2019q4/cpu2017-20191028-19237.html
More information about SPEC CPU® 2017 can be obtained from https://www.spec.org/cpu2017. SPEC®, SPEC CPU® and SPECrate® are registered trademarks of the Standard Performance Evaluation Corporation.
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 24 of 27
Summary
• Chiplet architecture has proven key to achieving leadership
performance, performance/$ and performance/Watt across multiple
market segments
4x DDR
4x DDR
• Package level fabric and interconnect architecture
• Power delivery and voltage adaptation
IFOP 4 x16 PCIe/IFIP IFOP
Zen2 cores
Zen2 cores
L3
IFOP
L3 IFOP
2x DDR
PCIe
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 25 of 27
Acknowledgment
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 26 of 27
Disclaimer and Endnotes
DISCLAIMER
The information contained herein is for informational purposes only, and is subject to change without notice. While every
precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and
typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro
Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this
document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or
fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described
herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this
document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed
agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18
©2020 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, RYZEN, Threadripper, Infinity
Fabric, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this
publication are for identification purposes only and may be trademarks of their respective companies.
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 27 of 27
A 220GOPS 96-core Processor with 6 Chiplets
3D-stacked on an Active Interposer Offering
0.6ns/mm Latency, 3TBit/s/mm2 inter-Chiplet Interconnects
and 156mW/mm2@82% Peak-Efficiency DC-DC Converters
Pascal Vivet¹, Eric Guthmuller¹, Yvain Thonnart¹, Gaël Pillonnet2, Guillaume Moritz2,
Ivan Miro-Panades¹, César Fuguet¹, Jean Durupt¹, Christian Bernard¹, Didier Varreau¹, Julian Pontes¹,
Sébastien Thuriès¹, David Coriat1, Michel Harrand¹, Denis Dutoit¹, Didier Lattard¹, Lucile Arnaud2,
Jean Charbonnier2, Perceval Coudrain2, Arnaud Garnier2, Frédéric Berger2, Alain Gueugnot2,
Alain Greiner3, Quentin Meunier3, Alexis Farcy4, Alexandre Arriordaz5, Séverine Cheramy2, Fabien Clermidy¹
pascal.vivet@cea.fr
¹Univ. Grenoble Alpes, CEA, LIST; 2Univ. Grenoble Alpes, CEA, LETI; 3Sorbonne Université, LIP6;
4STMicroelectronics; 5Mentor A Siemens Business
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 1 of 34
High Performance Computing & Big Data
• More cores + more accelerators + more memory
– Similar constraints are appearing for embedded HPC
(Automotive, etc)
– Need both highly optimized generic and specialized functions
(i.e. ML/AI accelerator)
– Need a « go-to-market » solution for sustainable system differentiation
• Chiplet challenges ?
– Eco-system maturity,
– Technology & Architecture partitioning,
– Chiplet Interfaces, testability, 3D CAD flow, etc
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 3 of 34
Chiplet Partitioning : Solutions and Limitations
• Existing technologies
Organic Substrates Passive interposer (2.5D) Silicon bridges
AMD, 4-chiplet circuit, ISSCC’2018 TSMC, CoWos, VLSI’2019 INTEL, EMIB bridge, ISSCC’2017
Chiplets :
Clusters of Cores
Power Management
Active Close to cores
Interposer
SoC infrastructure
Analog, IOs, PHY, DFT
Additional features
Mature CMOS technology (with low logic density to preserve system cost)
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 5 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 6 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 7 of 34
6 Chiplets 3D-stacked on an Active Interposer
Chiplet (16 cores) Chiplet (16 cores)
• Chiplet Overview Cluster Cluster Cluster
L3 L3 Cluster
L3 L3
SoC infrastructure
SoC infrastructure
– 4 cluster of 4 cores 0 1 0 1
Active Interposer
– Integrated SCVRs (1/chiplet) Distributed NoCs
(routers & pipelined links)
– Memory Controller & System IO’s
Cfg Power Management Power Management
– SOC Infrastructure, DFT Memory-IO
C4 bumps Ø90µm
Clk, Rst, Config, Test Package Substrate 1.5 - 2.5 VDD-chiplet 1.2 VDD-interpo Off chip links
Balls Ø500µm
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 8 of 34
6 Chiplets 3D-stacked on an Active Interposer
• Chiplet Overview 6 Chiplets
– 4 cluster of 4 cores (FDSOI28)
– Distributed L1$ + L2$ + L3$
– Scalable Cache Coherency
• Active Interposer
– Distributed flexible interconnects
– Integrated SCVRs (1/chiplet)
96 cores :
– Memory Controller & System IO’s Active In 6 chiplets
– SOC Infrastructure, DFT Interposer 3D-stacked on
(CMOS65) active CMOS interposer
2 technology nodes difference between chiplets & bottom die
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 9 of 34
Chiplet Main Features
• 16 x MIPS ® 32-bit scalar cores
• Memory is physically distributed through
chiplet L2-caches + Virtual Memory support
– L1 I-caches + D-caches (16 kB / core)
– Distributed Shared L2-caches (256 kB / cluster)
– Adaptive & fault tolerant L3-caches (4 tiles of 1 MB)
• Directory-based cache coherence with
linked-list directory [5]
L1-L2
• 2D-mesh NoCs,
L2-L3
extended through the interposer L3-ExtMem
from/to
active interposer
• FDSOI 28nm, LPLV, [0.5-1.3V], with Body Biasing
[5] E. Guthmuller et al, “A 29 Gops/Watt 3D-Ready 16-Core
– FLLs, Timing Fault Sensors, Thermal Sensors Computing Fabric with Scalable Cache Coherent Architecture
Using Distributed L2 and Adaptive L3 Caches”, ESSCIRC’2018.
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 10 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 11 of 34
System Level Interconnects
TAP PE2PE3 PE2PE3 TAP PE2PE3 PE2PE3
L1D L1D L1D L1D L1D L1D L1D L1D
TG TG
TG TG
TG TG
3D Plug(s) 3D Plug(s)
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 12 of 34
3D-Plug Communication IP : synchronous version
3D-Plug : TX 3D-Plug : RX
controller
– NoC Virtualization Outputs
NoC
– High throughput Virtual
Channel Credit
– Low latency Inputs
φ CLK_RX
CLK_TX
φ
Clk
• Circuit Design
– Credit-based multi-channel synchronization Full digital design
– Source synchronous scheme, with delay compensation Full swing logic
– Integrates: µ-bumps + µ-buffers + DFT (BoundaryScan) no DLL
3D fine pitch parallel if.
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 13 of 34
3D-Plug Communication IP : layout overview
3D-Plug :
• Logic interface
• µ-bumps
• µ-buffer std-cells
Chiplet layout : • DFT µ-buffer std-cell
3D-Plug interfaces
BiDir Driver + ESD +
Pull-Up + Level-Shifter
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 14 of 34
3D-Plug Communication IP : sync. version perf.
[2] : Mu-Shan Lin et al., “A 7nm 4GHz Arm®-core-based CoWoS® Chiplet Design
for High Performance Computing”, Symposium on VLSI circuits, June 2019.
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 15 of 34
System Level Interconnects : L1-L2
1.0V Sync. Sync. Sync. 8 FIFOs Sync. Sync. Sync.
3D-Plug 3D-Plug Router Router 3D-Plug Chiplet 11 3D-Plug
Chiplet 00 Chiplet 12
Interposer 5 mm, 8 ns, 0.7 pJ/bit
1.5 mm, 7.2 ns, 0.75 pJ/bit
M3-M5 passive routing
with clock shielding
L1-L2 L1-L2
Units
• L1-L2 interconnect
nearest farthest – 3D-Plug sync. version + passive links
1 passive 3 passive
Interposer — – Synchronous NoC routers (within chiplets)
link links
3D Plug frequency 1.25 1.25 GHz – Global clocking + clock gating
2D NoC frequency — 1.00 GHz • Performances
2x4+[0-1] 44 cycles – 3D-Plug interface throughput : 1.25 GHz
End to end latency
7.2 44.0 ns
– SNOC local throughput : 1 GHz
Propagation speed 4.8 2.9 ns/mm
– Large end-to-end latency : 44 ns (44 cycles)
Energy / bit / mm 0.29 0.15 pJ/bit/mm
(re-timing and re-synchronization)
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 16 of 34
3D-Plug Communication IP : asynchronous version
• Asynchronous Logic
– Quasi-Delay-Insensitive (QDI) logic 2-phase
– Use of 1-of-4 data encoding
– Deep pipelining, achieving low latency 4-phase
1-of-4 asynchronous
pipeline stage
(C-element gates) Robust Asynchronous design
No clocking at 3D interface [6]
• Circuit Implementation 2-phase protocol to reduce
penalty of 3D-interface delays
– Use 4-phase for on-die communication (Active interposer)
– Use 2-phase for off-die communication (3D-Plug interface)
[6] P. Vivet et al., “A 4x4x2 Homogeneous Scalable
– Use 4-phase 2-phase protocol converters 3D Network-on-Chip Circuit with 326 MFlit/s 0.66
pJ/bit Robust and Fault Tolerant Asynchronous 3D
links”, ISSCC’2016.
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 17 of 34
System Level Interconnects : L2-L3
LVDS PHY
Off-Chip traffic
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 21 of 34
Switched Cap Voltage Regulators : Principle
• Distributed power supply units
– DVFS local scheme, below each chiplet
– Fast transitions & reduced IR-drop effects
– “High” input voltage (up to 2.5V),
reduces #PG IOs in the package
• Fully Integrated
– No external passive components, Thick oxide transistors P/G to chiplet VOUT
– On-chip CAPs only (MOS+MOM+MIM 8.9 nF/mm2) µ-bumps
DC-DC
– 50% of chiplet area, fault tolerant, in the interposer TSVs converter
– PG delivery as a µ-bump flip-chip matrix
VIN
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 22 of 34
Switched Cap Voltage Regulators : Circuit Design
• Circuit design
– 3-stage gear box, 7 voltage ratios
– VIN [1.8V – 2.5V] ; VOUT [0.35V – 1.3V]
– Tile-based layout in a checker board pattern Replicated
Unit Cell
– Central clock frequency, feedback controller @ C4-bump pitch
x270 cells
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 23 of 34
Switched Cap Voltage Regulators : Circuit Results
• Circuit design
– 3-stage gear box, 7 voltage ratios
– VIN [1.8V – 2.5V] ; VOUT [0.35V – 1.3V]
– Tile-based layout in a checker board pattern Replicated
Unit Cell
– Central clock frequency, feedback controller @ C4-bump pitch
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 24 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 25 of 34
Circuit Overview
• Die technologies
– Chiplet: FDSOI 28nm, ULV + BodyBias, 22mm2
– Active Interposer: CMOS 65nm, MIM option, 200mm2
3D cross-section
• 3D technology integration
– µ-bumps, 20µm pitch (150 k)
– TSV middle, 40 µm pitch
– Face2Face assembly
on package substrate
Chiplet front-face
– 6 chiplets
Chiplet 3
– Interposer logic & interconnect (w.o. IOs)
3% only of overall budget
– SCVR: 17% of overall power budget Chiplet 4 Cores + L1
55%
Chiplet 5
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 27 of 34
Circuit Performance : SCVR efficiency
• Switched Cap Voltage Regulator (SCVR)
– SCVR configured at best ratio according to chiplet voltage
• 3:1 => 2:1 => 3:2 @ fixed VIN 2.5V
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 29 of 34
Comparison with State-of-the-Art
[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017
This work Units
INTEL AMD TSMC INTEL
FDSOI FinFET FinFET FinFET FinFET
Chiplet Technology
28nm 14nm 14nm 7nm 14nm
Active MCM Passive EMIB
Interposer Technology no
CMOS 65nm substrate CoWoS ® bridge
Technology
with MIM
with MOS+MOM+MIM SCVR with MIM
MIM above 40%
VR area 34% of active interposer - N/A N/A
of core area
VR peak efficiency 82% 72% LDO limited N/A N/A
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 30 of 34
Comparison with State-of-the-Art
[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017
This work Units
INTEL AMD TSMC INTEL
Distributed NoC meshes
Scalable Data TM
Interconnect types for scalable chip-to-chip N/A LIPINCON links AIB interconnect
Interconnect
Fabric (SDF)
cache-coherency traffic
3D Plug power efficiency 0.59 N/A 2.0 0.56 1.2 pJ/bit
2
BW density 3.0 N/A - 1.6 1.5 Tb/s/mm
Aggregate 3D bandwidth 527 N/A - 640 504 GByte/s
1 FPGA fabric
Number of chiplets 6 1 1-4 2
6 transceivers
CPU
First Active Interposer, with distributed NoC meshes and 3.0 Tb/s/mm2
interfaces, offering a total of 96 cores
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 31 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 32 of 34
Conclusions and Perspectives
• Active Interposer & chiplet partitioning
– Integration of : Interconnects, Power management, IOs,
– Scalable cache coherency protocol
– 3 TBit/s/mm2 3D interface achieved
– Low latency 0.6ns/mm long-reach asynchronous interconnect
– Power management @ 82% efficiency, close to the cores, w.o. passives
Increase the system energy efficiency and the on-chip memory bandwidth per core
• Perspectives ?
– Progressive setup of a chiplet eco-system
– Active interposer, an enabler for differentiation : integrating heterogeneous functions & chiplets
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 33 of 34
• Acknowledgments
This work was partly funded by the French National Program
Programme d’Investissements d’Avenir IRT Nanoelec under
Grant ANR-10-AIRT-05
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 34 of 34
A 7nm High-Performance and Energy-Efficient
Mobile Application Processor with Tri-Cluster
CPUs and a Sparsity-Aware NPU
Young Duk Kim,
Wookyeong Jeong, Lakkyung Jung, Dongsuk Shin,
Jae Geun Song, Jinook Song, Hyeokman Kwon, Jaeyoung Lee,
Jaesu Jung, Myungjin Kang, Jaehun Jeong, Yoonjoo Kwon,
Nak Hee Seong
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 1 of 26
Background
•Conclusion is
“High performance” and “Low power”
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 2 of 26
Outline
• 7nm power efficient Exynos AP processor
• Tri-cluster CPUs Middle/
• Power efficient architecture Little
Big CPUs
CPUs
• NPU
• Skipping zero weights operation
• HWACG (HW Auto Clock Gating)
• Clock power reduction in idle state
GPU
• Droop detector
• Reducing voltage droop NPU
• 7nm process
• Enhancing AC performance
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 3 of 26
CPU Clusters
• Eight CPU cores with three different classes
• Two big cores (M4): 2.73GHz, two mid cores (CA75): 2.4GHz,
four little cores (CA55): 2.0GHz
• Heterogeneous Multi-Processor governed by Energy-aware scheduler
Coherent Interconnect
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 4 of 26
Big Custom CPU (1)
• Instruction Front-end Architecture Branch Main L2
uBTB
Predict BTB BTB
• 6 micro-Ops bandwidth for decode,
rename, dispatch and retire Address Queue
Dispatch Queue
*M3 specification
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 5 of 26
Big Custom CPU (2)
• Integer and Load/store Execution Pipes
Integer Schedulers
• Two simple ALUs + two complex ALUs 1BR, 2CALU, 2ALU, 1LD, 1LD/ST, 1ST, 1ST-D
• AGUs: 1 Load + 1 Load/Store(1Store*)
Integer PRF
+ 1 Store
• Improved memory latency through
LD/ST
DATA
AGU
AGU
AGU
MUL
MUL
ALU
ALU
ALU
ALU
direct path from memory controller
DIV
BR
LD
ST
ST
• 1MB(512KB*) private L2 cache per core
• 3MB(4MB@4cores*) shared L3 cache BDTLB DTLB/TAG
Queue
ST
• 48(32*)-entry DTLB, 512-entry BDTLB, L2 UTLB
64KB D-
4K entry L2 UTLB Cache
Table Walk Prefetch
*M3 specification
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 6 of 26
Big Custom CPU (3)
• Floating-point Execution Pipes Floating-point Scheduler
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 7 of 26
Big Custom CPU
• Samsung 4th generation Custom CPU
• Improved memory subsystem performance significantly
1.90
1.70
The higher the better
1.50
1.30
M3
1.10
M4
0.90
Cortex-A76
0.70
0.50
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 8 of 26
Single-Thread Performance
• Desktop-class single-thread performance
• Average 23% single-thread performance uplift from 3rd generation
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 9 of 26
Tri-cluster management (1)
• Allows seamless performance transition by the middle CPU
• A single heavy task can be selectively assigned to middle or big CPU.
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 10 of 26
Tri-cluster management (2)
• In most of user scenarios, the main workload can be covered by the middle
CPU instead of big CPU to reduce the absolute power consumption.
• CPU total power comparison (big/Little @10nm vs. Tri-cluster @7nm)
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 11 of 26
Tri-cluster management (3)
• Allows various options on workload scheduling
• A single heavy task running on a little CPU can be selectively migrated
to middle or big CPU. (depending on demand performance)
Task
(Running on little core) ? Demand Perf #3
Power
Utilization Demand Perf #1 Demand Perf #2
Max capacity
(middle)
Max capacity
Option Option
(little) Running
#1 #2 Min power @
MED Performance
Little Middle
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 12 of 26
Tri-cluster management (4)
• Enhanced work load scheduling method based on ISA (Instruction Set
Architecture) where only 32bit/64bit energy efficiency is considered.
• The energy efficiency of each cluster is different for ISA mode, so
workloads should be scheduled based on a different energy model.
• The energy model is newly designed considering 32-bit/64-bit energy
efficiency.
• The CPU power is improved by over 30% in specific scenarios such as
32-bit games.
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 13 of 26
Tri-cluster management (5)
• The measurement results for ISA based scheduling method
• CPU total power comparison in Lineage2 game
• Big CPU’s tasks moved to Mid CPU, so total power decreased.
Game power
(Normal scheduling vs ISA Aware scheduling)
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 14 of 26
Sparsity-Aware Neural Processing Unit
• 1024 MACs always consume the incoming data every cycle.
• Data staging units dispatch corresponding input feature-maps and skip 0-weights
• Activation Function Units perform ReLU-family activation.
• HW automatic clock gating (HWACG) is applied on module levels.
BUS
512-KB Scratchpad 512-KB Scratchpad
Dual
Dual
DualMAC
MACArray
Array Dual
Dual MAC
MACArray
Array
DualMAC
MACArray
Array Dual
DualMAC
MACArray
Array
Data Returning
Activation UnitUnits
Function Data Function
Activation ReturningUnits
Unit
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 15 of 26
Skipping Convolution: Moving OFM
• Input feature-maps are buffered for marching with nonzero weight positions
Weight Kernel
Non-zero Weights 1
1 3 5 6 7 3 5
Input Feature Map 6 7
100
%
50
0
5% 80 %
Weight Pruning Rate
100
50
0
5% 80 %
Weight Pruning Rate
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference of 26 17
HWACG (HW Automatic Clock Gating)
• Reduce clock tree power from PLL to IPs in idle state
• Q-channel interface Protocol between IP and HWACG Controller
• Hierarchical architecture composed of parents and children
• If all the IPs attached to the Clock Gating cell are idle, the clock is gated.
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 19 of 26
HWACG – EWS(Early Wake-up System)
• The early wake-up system reduces the cumulative latency.
• Latency issue can appear especially when multi-layered bus uses
different PLL sources
• A clock request from a latency critical IP is delivered to multiple target
domains to wakeup multiple IPs instead of sequential wakeup process.
BLK_#1
EWG
EARLY_WAKEUP__MAST_#
BLK_CMU Master_#1
CMU
BLK_#2
EWG
EWR Master_#2
CMU
ACTIVE__CMU_#
CMU BLK_#n
EWG
Master_#n
-EWR : Early Wakeup Router CMU
-EWG : Early Wakeup Generator
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 20 of 26
Voltage Droop Mitigation (1)
• Voltage droop mitigation solution
• Droop detector (DD) monitors voltage droop in target domain.
• (1) When voltage droop happens under threshold values,
• (2) Droop-detected flag is asserted
• (3) Then, CMU divides the clock to IP by half to reduce load current.
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 21 of 26
Voltage Droop Mitigation (2)
• Voltage droop mitigation solution
• One sensor in GPU
• Calibration done for each DVFS level.
• Vmin gain by 12.5mV.
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 22 of 26
Voltage Droop Detector
• Ring oscillator-type droop detector
• Measures voltage levels through change of RO’s speed
• It counts RO clocks within a programmable time window.
• When the counter value is smaller than programmable threshold values,
droop-detected flag is asserted
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 23 of 26
A key technology feature of 7nm
1
Fin FET Logic
7nm 8nm
Technology
Iddq (Normalized)
0.1
Fin SAQP SADP
Key Module +7%
eSiGe 6th Gen. 5th Gen
Technology 0.01
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 24 of 26
A key technology feature of 7nm
• Self-Aligned Quadruple Patterning (SAQP) process was introduced to
scale fin pitch below 42nm (more than 15% scale-down).
Spacer1
Litho
Spacer1 Spacer2
Fin Fin
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 25 of 26
Conclusion
• The five low power architectures contributed to reduce the power
consumption of the 7nm SOC chip.
• Tri-cluster CPUs
• Sparsity NPU
• HWACG
• Droop detector
• 7nm process
• They are also extensively applied to the following projects to enhance low
power competitiveness
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 26 of 26
A 7nm FinFET 2.5GHz/2.0GHz Dual-
Gear Octa-Core CPU Subsystem with
Power/Performance Enhancements for a
Fully Integrated 5G Smartphone SoC.
Hugh Mair, Ericbill Wang, Ashish Nayak, Rolf Lagerquist, Loda Chou,
Gordon Gammie, Hsinchen Chen, Lee-Kee Yong, Manzur Rahman,
Jenny Wiedemeier, Ramu Madhavaram, Alex Chiou, Blundt Li, Vincent
Lin, Rory Huang, Michael Yang, Achuta Thippana, Osric Su, SA Huang
Improved: New:
+25% OoO window Add 2nd branch
Improved: New:
2x Branch Pred. BW Add 4th ALU
Improved:
Next-gen Prefetch
New: Improved:
Macro-Op Cache +50% dispatch BW
Higher Performance
@ Lower Power
DIE_SENSE_VSS
CPU
CLUSTER
EXTERNAL
SUPPLY
Clock
DAC_OUT =
+ Gate CLK_OUT
FILT[2:0] LPFOUT
Voltage ✓
w/ DRCC On
~50mV Current ✓
w/ DRCC On
Current
w/ DRCC Off
Approximate CPU
load current
Frequency [MHz]
1. Determine VMIN at given
reference frequency points, FTGT
◯ is VMIN per Frequency
Frequency [MHz]
1. Determine VMIN at given
reference frequency points, FTGT
◯ is VMIN per Frequency
-40
-60
~35mV
-80
-100
2.3 2.4 2.5
Frequency [GHz]
• JTAG Challenges/Limitations:
– Models a serial connection through all devices / IP blocks
– Large number of embedded IP (#clusters * #cpus * #ip/cpu)
– Power gating blocks chain segments
Rama Venkatasubramanian1, Don Steiss1, Greg Shurtz2, Tim Anderson1, Kai Chirca1,
Raghavendra Santhanagopal1, Niraj Nandan1, Anish Reghunath1, Hetul Sanghvi1, Daniel Wu1,
Abhijeet Chachad1, Brian Karguth1, Denis Beaudoin1, Charles Fuoco1, Lewis Nardini1, Chunhua
Hu1, Sam Visalli1, Amrit Mundra1, Devanathan Varadarajan1, Frank Cano2, Shane Stelmach1,
Mihir Mody3, Arthur Redfern1, Haydar Bilhan1, Maher Sarraj1, Ali Siddiki1, Anthony Lell1, Eldad
Falik1, Anthony Hill1, Abhinay Armstrong1, Todd Beck1, Vijay Kanumuri1, Steve Mullinnix1,
Darnell Moore1, Jason Jones2, Manoj Koul1, Sanjive Agarwala1
1Texas Instruments, Dallas, TX
2Texas Instruments, Houston, TX
3Texas Instruments, Bangalore, India
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 1 of 40
Outline
Automotive Processor background
Jacinto™ 7 SoC Platform Architecture
C71x Digital Signal Processor (DSP)
Embedded vision and Imaging accelerators
VPAC and DMPAC
First SoC of the platform
Device details/Die micrograph
Automotive SoC Development
Innovative quality and reliability methodologies
Conclusion
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 2 of 40
Automotive Processor - Applications
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 3 of 40
Automotive Processor - Applications
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 4 of 40
Motivation for SoC Platform Architecture
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 5 of 40
SoC Platform Architecture (1/2)
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 6 of 40
SoC Platform Architecture (2/2)
Multiple isolated domains:
Wakeup domain
Microcontroller (MCU) domain
Main domain
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 7 of 40
Automotive Functional Safety - Overview
Governed by ISO-26262 standard
Four ASILs ― A, B, C, and D ASIL-D Most
stringent
ASIL-D ASIL-C
Most stringent functional safety standard
ASIL-B Slightly less
Ex: Power steering (unwanted acceleration)
stringent
Ex: Engine braking (unintended braking) ASIL-A Least
ASIL-B stringent
Also critical, slightly less stringent
Ex: Embedded vision ADAS (Incorrect sensor feedback)
Ex: Instrument cluster (Loss of critical data)
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 8 of 40
Wakeup Domain
ASIL-D
Isolated domain
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 9 of 40
MCU Domain
ASIL-D
Isolated chip-within-a-chip
General-purpose MCU
Communication peripherals for safety-critical
communication
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 10 of 40
Safety and Isolation Features
MCU domain monitors Main domain faults and
takes appropriate action
Domain Isolation
Reset, Clock and Bus isolation
Logic/IO voltage isolation with power monitoring
Dedicated Voltage, thermal, clock rate sensors
per domain
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 11 of 40
Main Domain – Compute Cluster
64b Heterogeneous Multicore Architecture
Arm® Cortex-A and C71x DSP
Coherent memory
Multicore shared-memory system (MSMC)
L1/L2/L3 caches
Shared on-chip SRAM with ECC
Virtualization
C66x DSP
Optimized for Audio applications
Enables supplemental analytics
Backwards compatibility
High-performance GPU
External memory interface (LPDDR4-4266)
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 12 of 40
Main Domain – Accelerator Cluster
Vision Pre-processing Accelerator (VPAC)
Depth and Motion Perception Accelerator
(DMPAC)
Video acceleration (H.264/H.265)
Image capture subsystem (MIPI CSI-2 RX/TX)
Display Subsystem
Interfaces for different display panel types
eDP, MIPI DSI, MIPI DPI
Security acceleration
PKA, AES, SHA, RNG, DES/3DES
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 13 of 40
C71x DSP
DSP CPU:
64b addressing
Optimized for General Purpose DSP and Embedded Vision
16-Issue VLIW with flexible pipeline protection
64b scalar registers and execution units (int*, float, double)
512b vector registers and execution units (int*, float, double)
512b x 512b matrix registers and matrix unit (int*)
4240 integer MAC/Cycle
Memory Subsystem:
Load/Store instructions up to 512b wide
Streaming accesses with programmed address sequencers
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 14 of 40
C71x DSP Vs C66x DSP
[1] R. Damodaran et al., “A 1.25 GHz 0.8W C66x DSP core in 40nm CMOS,” IEEE Conf. VLSI Design, pp. 286-291, 2012.
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 15 of 40
C71x DSP
Benchmark Performance Scaling
Benchmark C71x/C66x [1] C71x/EVE [2]
FFT 1024pt, 32b complex 5.6 x 2.4 x General Purpose DSP
Image Gradient 11.0 x 2.7 x
Image Transform 11.0 x 4.8 x
Filtering 18.1 x 7.1 x
Computer Vision
Feature Extraction 18.8 x 4.7 x
Optical Flow 3.0 x 3.0 x
Morphology 25.7 x 15.3 x
MobileNet v1 516.0 x 115.0 x
Machine Learning
MobileNet v2 210.0 x 62.7 x
[1] R. Damodaran et al., “A 1.25 GHz 0.8W C66x DSP core in 40nm CMOS,” IEEE Conf. VLSI Design, pp. 286-291, 2012.
[2] Mandal, Dipan Kumar et al. “An Embedded Vision Engine (EVE) for Automotive Vision Processing,” IEEE International Symposium on
Circuits and Systems (ISCAS), pp. 49-52, 2014.
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 16 of 40
Vision Pre-processing Accelerators (VPAC) (1/4)
Image Capture
VPAC consists of:
CSI2-RX CSI2-RX
Image Signal Processing Engine (ISP)
VPAC
Multiple hardware accelerators:
Remap Engine Image Signal Processing (ISP)
Noise Filter
Scalar Engine Remap Engine
SW
SW and HW flexibility controlled
Noise Filter
Flexible
Control and Sequencing
Acceleration
Scalar Engine
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 17 of 40
Vision Pre-processing Accelerators (VPAC) (2/4) – Image Pipe
Image Capture
Image Capture CSI2-RX CSI2-RX
2x4L MIPI compliant CSI2 RX interface
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 18 of 40
Vision Processing Accelerators (VPAC) (3/4) - Vision Primitives
Remap Engine
Lens distortion correction (+Fish Eye Lens)
Rectification
Noise filter
Edge preserving
Enhances analytics quality
Scalar Engine
Multiple scales for pyramid generation for
various vision algorithms
Region of Interest (ROI) scaling support
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 19 of 40
Vision Pre-processing Accelerators (VPAC) (4/4)
Image Capture
CSI2-RX CSI2-RX
Hardware accelerators optimized for :
Low latency Image/Vision pipe
Reducing external memory bandwidth
VPAC
Image Signal Processing (ISP)
SW and HW flexibility
Sequencing of Algorithms Remap Engine
Standalone / connected pipe SW
controlled
Noise Filter
Flexible
Acceleration
Scalar Engine
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 20 of 40
Depth and Motion Perception Accelerator (DMPAC) (1/5)
Stereo Disparity Overview
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 21 of 40
Depth and Motion Perception Accelerator (DMPAC) (2/5)
Stereo Disparity Engine
Left Camera Image
Disparity Map
Stereo Disparity
Engine
Right Camera Image Hardware
Accelerator
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 22 of 40
Depth and Motion Perception Accelerator (DMPAC) (3/5)
Optical Flow Overview
Flow Vector
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 23 of 40
Depth and Motion Perception Accelerator (DMPAC) (4/5)
Dense Optical Flow Engine
Applications :
Confidence Score
for each flow vector output
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 24 of 40
Depth and Motion Perception Accelerator (DMPAC) (5/5)
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 25 of 40
Die Micrograph and Device details
First SoC of the Platform Process TSMC 16FF FINFET
Core : 0.72V-0.990V (AVS)
Voltage
SRAM : 0.85V
Temperature -40C-150C
3.5B+ Transistors
Specifics
~200Mb SRAM
2GHz 2xSuper-scalar CPU
1GHz 3xDual Safety MCU
Performance
4266 LPDDR4
16G SERDES
Power 2W-10W (use case dependent)
1GHz 64bVLIW/512bSIMD DSP,
Key Accelerators
Machine Learning, Computer vision
Number of PLL’s 25+
Power Networks 74
Features ASIL-D Capable, Low DPPM Design
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 26 of 40
Automotive SoC – Quality and Reliability
Automotive SoCs require:
Stringent attention to DPPM (defective parts per million)
Very low FIT for functional safety and intrinsic reliability
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 27 of 40
Automotive SoC – Memory Test
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 28 of 40
Test Goal: Test mimics functional mode (1/3)
Case-study: Robust memory interface test
Memory BIST is targeted to screen memory-internal defects
Functional vs. Memory BIST: Start/end points are different
BIST
flop
functional D latch
Q
TD Q_mem
D_mem
BIST
functional
flop
Memory
Array
functional
A
A_mem
latch LEGEND
BIST TA
Functional / mission mode
BIST test mode
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 29 of 40
Test Goal: Test mimics functional mode (2/3)
Case-study: Robust memory interface test
ATPG is targeted to screen memory-interface logic defects
Functional vs. ATPG: Memory I/O timings are different due to scan collar
Subtle defects in memory-interface may be missed by BIST and ATPG
BIST
flop
functional D latch
Q
Q_mem
D_mem
BIST TD
functional
flop
Memory
Array
functional
A
A_mem
latch LEGEND
BIST TA
Functional / mission mode
ATPG test mode
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 30 of 40
Test Goal: Test mimics functional mode (3/3)
Case-study: Robust memory interface test
Match functional mode memory I/O timing with test mode
Improved memory IP architecture to match functional and test timing
Improved DFT to test “true” functional memory-interface path
BIST
flop
functional D latch
Q
Q_mem
D_mem
BIST TD
functional
flop
Memory
Array
functional
A
A_mem
latch
LEGEND
BIST TA Functional / mission mode
BIST test mode
ATPG test mode
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 31 of 40
EM FIT Calculation for Automotive SoCs (1/2)
Signoff with Violations at Technology Reliability Limit
Cumulative FIT vs. FIT by Bin
Probability of occurrence
May be acceptable/waiveable
in consumer grade SoC
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
Component FIT Per BIN Cumulative FIT FIT Per Bin System FIT Spec
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 32 of 40
EM FIT Calculation for Automotive SoCs (2/3)
Signoff at Technology Reliability Limit
Cumulative FIT vs. FIT by Bin
Probability of occurrence
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
Component FIT Per BIN Cumulative FIT FIT Per Bin System FIT Spec
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 33 of 40
EM FIT Calculation for Automotive SoCs (2/2)
Signoff with Margined Technology Reliability Limit
Cumulative FIT vs. FIT by Bin
Probability of occurrence
Additional
Margins
Fix all
violations
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
Cumulative FIT FIT Per Bin System FIT Spec
Component FIT Per BIN
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 34 of 40
EM FIT calculation in an eco-system class
Automotive SoC (1/2)
Foundry EDA Foundry
SoC Reliability
Estimated FIT Rate
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 35 of 40
EM FIT calculation in an eco-system class
Automotive SoC (1/2)
Foundry EDA Foundry
SoC Reliability
Estimated FIT Rate
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 36 of 40
System Validation – ADAS (1/2)
Forward Camera Analytics based on Deep Learning
Semantic Segmentation Object Detection Parking Spot Detection
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 38 of 40
System Demo – 3D Surround View + Auto Valet Parking
Real-time 360 Degree Surround View Analytics for Auto Valet Parking
C7x DSP
CSI C66x DSP VPAC SD Card
VPAC (Post-process) Semantic
FISHEYE GPU 3x 1280x720
Segmentation
CAMERA VPAC 3x 768x384 @ 15 FPS
I2C (Mosaic)
Multi-object @15 FPS YUV420
Detection
4x 1920x1080 C66x DSP
Display Subsystem
Parking Spot
@ 30 FPS (Pre-process)
Bayer Detection
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 40 of 40
IBM z15TM: A 12-Core 5.2GHz Microprocessor
Christopher Berry1, Brian Bell2, Adam Jatkowski1, Jesse Surprise1, John Isakson3, Ofer Geva1, Brian
Deskin4, Mark Cichanowski3, Dina Hamid1, Chris Cavitt1, Gregory Fredeman1, Anthony Saporito1,
Ashutosh Mishra5, Alper Buyuktosunoglu6, Tobias Webel7, Preetham Lobo5, Pradeep Parashurama5,
Ramon Bertran6, Dureseti Chidambarrao8, David Wolpert1, Brandon Bruen1
IBM Systems
1 - Poughkeepsie, NY
2 - Rochester, NY
3 - Austin, TX
4 - Endicott, NY
5 - Bangalore, India
6 - Yorktown Heights, NY
7 - Boeblingen, Germany
8 - Hopewell Junction, NY
Technology Overview
Ultra-thick wires 2400nm - 2 levels
5.6X pitch wiring 360nm - 4 levels
4X pitch wiring 256nm - 2 levels
2X pitch wiring 128nm - 4 levels
1.3X pitch wiring 80nm - 2 levels
1X pitch, fine wiring (LELE) 64nm - 3 levels
Logic Device VT pairs L, R, H
SRAM Cell Sizes 0.102μm2 (HP & LL), 0.143μm2
eDRAM Cell Size 0.0174μm2
• Up to 4 - 42Ux19” Racks
• 20 Liquid Cooled CP
Chips
• 240 Physical Cores
• 40TB RAIM*-protected
memory
*Redundant Array of Independent
Memory
• 60 PCIe Gen4x16 cards
Ethernet Switches
IO Cages
Processor Drawers
Water Cooling
Central
Processors
© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 9 of 39
z15 System Structure: Drawer
• Drawer-to-drawer connectivity L4 L4
• L4 Cache - 960MB (+43%) eDRAM eDRAM
• 12.2B Transistors
• 696mm2 L4 Control
• ~20K C4’s L4 Directory
• IO bandwidth 6.8Tb/s
L4 Data-
• 21.7km of signal wire flow
L4 L4
eDRAM eDRAM
Cache Cache
• A-Bus (drawer-to-drawer)
• Four links
L3 DF
L3 Cache
L3 Cache
Core 0 Core 1
Memory interface, IO,
X-BUS
Cntl/Dir
CP & SC interfaces Core 2 Core 3
L3
L3 Cache
L3 Cache
• Specs
L3 DF
Core 4 Core 5
• 5.2GHz
• 9.2B Transistors
L3 DF
L3 Cache
L3 Cache
Core 6 Core 7
• 696mm2
X-BUS
Cntl/Dir
• ~29K C4’s Core 8 Core 9
L3
L3 Cache
L3 Cache
• IO Bandwidth 4.0Tb/s
L3 DF
• 23.2km of signal wire Core 10 Core 11
GZIP
PCIE2 PBU2 PCIE1 PBU1 PBU0 PCIE0
© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 18 of 39
CP Chip: Processor Cores
• 12 Cores (+20%)
L3 DF
L3 Cache
L3 Cache
• 128+128KB L1 Core 0 Core 1
(I+D) SRAM cache
Cntl/Dir
Core 2 Core 3
L3
• 4+4MB (I+D)
L3 Cache
L3 Cache
private L2 eDRAM
L3 DF
Core 4 Core 5
cache (+33%)
• L3
L3 DF
L3 Cache
L3 Cache
Core 6 Core 7
• 256MB eDRAM
Cntl/Dir
shared L3 cache Core 8 Core 9
L3
L3 Cache
L3 Cache
(+100%)
L3 DF
Core 10 Core 11
• GZIP Accelerator
GZIP
© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 19 of 39
CP Chip: IO Links
MC Drv MCU MC Rec
• XBUS
• Two X-Bus links
X-BUS
• Single-ended
• 5.2Gb/s/lane
• 0.8Tb/s each (1.6Tb/s total)
• Memory Interface
• 9.6Gb/s/lane (1.6Tb/s total)
X-BUS
• 3 - PCIe x16 Gen4 interfaces
(.8Tb/s)
Instruction
Sequencer
Recovery
Translator
Elliptic Curve
Cryptography
Improved Operand-
store-compare (OSC)
hazards
Enhanced
branch
prediction
z14
z15
+22%
+5%
+1%
z14
z15