Download as pdf or txt
Download as pdf or txt
You are on page 1of 227

ISSCC 2020

SESSION 2
Processors
Zen 2: The AMD 7nm Energy-Efficient
High-Performance x86-64 Microprocessor Core

T. Singh1, S. Rangarajan1, D. John1, R. Schreiber1, S. Oliver1, R. Seahra2, A. Schaefer1


1AMD, Austin, TX, 2AMD, Markham, ON, Canada

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 1 of 31
International Solid-State Circuits Conference
Outline
• Motivation
• Market Segments
• Architecture
• Core Complex
• Technology
• Implementation
• SRAMs
• Power
• Silicon Results
• Conclusion
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 2 of 31
International Solid-State Circuits Conference
Motivation
• Zen was a huge lift
• Zen2 compelling successor to Zen
• Goals
– Give above industry trend generational
performance improvement
– Enable 2x cores same socket
– Improve single thread (1T) performance
• How can we do this?
– Technology port
– Architectural changes
– Physical design and methodology changes
• AMD was aggressive and we did all of the
above to achieve the goals!!

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 3 of 31
International Solid-State Circuits Conference
Zen 2 Market Segments

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 4 of 31
International Solid-State Circuits Conference
Zen 2 Architecture
• Changes from Zen
– New TAGE Branch Predictor
– Optimized L1 Instruction Cache: 32K/8-way vs. 64K/4-way
– 2X Op Cache Capacity: 4K vs. 2K ops
– 2X Floating Point Data Path Width: 256b vs 128b
– 3rd Address Generation Unit
– Larger Physical Structures: Integer Scheduler, PRF, ROB, Store Queue, L2DTLB
– 2X L1 Data Cache Read/Write Bandwidth
– 2X L3 Cache: 16MB vs. 8MB per Core Complex (CCX)
• +15%1 single thread (1T) IPC over Zen
• ~9% switching capacitance (CAC) improvement over previous
generation, technology neutral
1 AMD"Zen 2" CPU-based system scored an estimated 15% higher than previous generation AMD “Zen” based system using estimated SPECint®_base2006 results.
SPEC and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. See www.spec.org.
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 5 of 31
International Solid-State Circuits Conference
Core Functional Units
uCode CPL
Decode I-Cache
• 32KB IC
• 32KB DC Branch
• ~20 blocks, ~400K Prediction
Floating
avg instances Scheduler
L2
Point
• ROM for uCODE Cache
• 5 L1 RAM variants ALU Load/
• Chip Pervasive Logic Store
(CPL) – clock/test Data
block Cache

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 6 of 31
International Solid-State Circuits Conference
L2/L3 Cache Hierarchy
Shadow tag macros for serving external probes
• Only 3 unique custom
macros L2 Data
L3Data
– Down from 8 on Zen L2 Tags
• Each 4M slice is identical L2 Status
L3Tags CTL
• Multi-stage clock gating in
L3 to keep clock
distribution power the
same as 8M L3 from Zen
• LDOs incorporated into
the L3 to supply VDDM to 512K L2
L2 and L3 arrays
4M Slice
– Loss of package distribution
of VDDM meant LDOs had
to be moved closer
– Must reduce current on
VDDM
© 2020 IEEE
International Solid-State Circuits Conference
2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core LDOs 7 of 31
Zen 2 Core Complex (CCX)

• 4 core complex
• L3 size increases to 16MB
• Design for flexibility
• Maximize # cores for server case
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 8 of 31
International Solid-State Circuits Conference
Zen 2 CCX Configs

Value
HEDT/Server APU
2 Core,
4 Core, 4 Core,
4MB L3
16MB L3 CCX 4MB L3 CCX
CCX

Cores Market TDP


8 Notebook 15W

• Zen 2 Core can be used in various 6


8
Desktop
Desktop/Server
65 W
65-120 W
configs covering a wide power range 12 Desktop/Server 105-120 W

• Multiple CCX can be placed to 16 Desktop/Server 105-155 W


24 HEDT/Server 155-280 W
achieved desired core count 32 HEDT/Server 155-280 W
48 Server 200-225 W
64 HEDT/Server 200-280 W
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 9 of 31
International Solid-State Circuits Conference
Zen vs. Zen 2 Technology Comparison
Zen Zen 2
Tech 14nm FinFET 7nm FinFET
4 Cores, 4 Cores,
Cores/CCX
8 Threads 8 Threads
Area/CCX 44 mm2 31.3 mm2

L2/core 512KB 512KB

L3/CCX 8MB 16MB

CPP 78 nm 57 nm
Fin Pitch 48 nm 30 nm
1x Metal Pitch 64 nm 57 nm
Stdcell Track Library 10.5 track 6 track
Cu Metal Layers 11 w/ MiM 13 w/ MiM

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 10 of 31
International Solid-State Circuits Conference
Zen vs. Zen 2 Technology Comparison (cont)
Zen (14nm) Zen 2 (7nm)

Layer Name Pitch Layer Name Pitch

M0 M0
n/a 1.0x
StdCell Internal StdCell Internal

M1
M1
1.0x Stdcell 1.425x
StdCell Internal
& BEOL
M2-M3 1.0x M2-M3 1.0x-1.1x
M4-M7 1.25x M4-M7 2.0x
M8-M9 2.0x M8-M9 2.0x
--- --- M10-M11 3.15x
M10-M11 M12-M13
11.25x 18.0x
(RDL) (RDL)
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 11 of 31
International Solid-State Circuits Conference
Place and Route Design Optimization
Same-Layer Jogs Inter-Layer Jumpers
• 7nm FinFET presents unique route challenges Forbidden Required
– Lower layer jogs forbidden
– Denser standard cells with reduction in track height
– Increased lower level metal resistance

• Deep collaboration between AMD CAD,


foundry, and EDA partners
– Cell density management
– Advanced legalization techniques
– Improved pre-route timing estimates
– Wire Engineering and Via Ladders

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 12 of 31
International Solid-State Circuits Conference
Placement Restricted by Large Cells
• Multi-row cells benefit
power and area, but
create placement
challenges
• Clustering of flops has
many benefits but can
cause placement
issues
• Resulting small gaps
are challenging to use
and required innovation
to exploit
– New algorithms
– Flexible power grid
choices

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 13 of 31
International Solid-State Circuits Conference
Design RC Miscorrelation
Normalized Normalized
• Pre-route vs Post-route miscorrelation caused Layer
Resistance Capacitance
by length and layer assumptions M1 1.00 1.00

• Pre-route miscorrelations for resistance and M2 3.17 0.96

capacitance have differing root causes M3 2.31 0.96


M4 0.72 0.75
– Layer assignment for resistance
M5 0.55 0.83
– Length estimates for resistance and capacitance
M6 0.52 0.83
• Based on previously modeled trends, EDA M7 0.55 0.83
tools may have challenges estimating delay M8 0.52 0.83
• Required innovation to tackle M9 0.55 0.92
M10 0.16 0.96
M11 0.16 0.92

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 14 of 31
International Solid-State Circuits Conference
Pre-Route Correlation Improvements
• Plots showing ClockTreeSynthesis vs Timing Slack Correlation Timing Slack Delta
Route timing
• Large variance in initial results Pessimistic Optimistic
Initial
– Large number of paths have overly-
Results
pessimistic delay during pre-route steps.
Tools waste resources trying to fix
– Significant number of paths have optimistic
delay estimates. These paths are under- cts_vs_route.slack.corr cts_vs_route.slack_delta.hist

optimized

• Employed timing with targeted Improved


capacitance scaling and global route- Results
based layer estimation
– Standard deviation dramatically improved
while keeping a slightly pessimistic mean cts_vs_route.slack.corr cts_vs_route.slack_delta.hist

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 15 of 31
International Solid-State Circuits Conference
Wire Engineering Challenges
• Lower layers getting more
resistive with latest
technology nodes
– Very short routes in tight data
paths need a buffer
– Routes longer than Steiner due
to complex rules
– Challenging for optimization
tools to comprehend
• Critical signals need to get to
higher layers quickly
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 16 of 31
International Solid-State Circuits Conference
Wire Engineering and Via Ladders
Top Via Ladder View
• Team used selective layer optimization,
buffering, pre-routes, and via ladders to
exploit the fast layers for critical signals
• Two types of via ladders
– High Performance: for large buffers driving long
wires Side Via Ladder View

– EM: for high-activity gates (e.g., clock drivers)


– Mitigated EM issues on large fanout nodes with
high activity

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 17 of 31
International Solid-State Circuits Conference
L2/L3 Cache Changes
VDDM VDDM
WL[N:0]
WL[N:0]

• Zen had an on-die LDO


to generate VDDM BLPCX
BLPCX

supply for use by cache XCENX

BLC[]
arrays
XCENX

BLT[]

BLC[]
BLT[]
• Zen 2’s package choices WRCS[]
make using package
WRCS[]
WDT_X
WDT_X NegBL Write Driver
layers for VDDM
WDC_X
WDC_X
RDCSX[]
distribution impossible

SAC
SAT
RDCSX[]
SAPCX
• Moved the bitline

SAC
SAT
precharge from VDDM to
SAPCX

VDD to reduce current SAC


SAT_INT SAT
SAC_INT

SAEN

SAEN

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 18 of 31
International Solid-State Circuits Conference
VDD Precharge Challenges
Controller pauses voltage Controller pauses voltage
increase and unsets increase and sets
• Moving bitline precharge superVminEn register before superVmaxEn register before
to VDD creates both continuing to raise voltage continuing to raise voltage

bitcell stability and VDD


superVminEn=1 superVmaxEn=1
writeability challenges VDDmax
• High level of
configurability allows for
silicon flexibility VDD where VDD-VDDM=superVmaxThreshold
VDDM

VDD where VDDM-VDD=superVminThreshold

superVminEn VDDmin
System
Management superVmaxEn
Assist controller WLUdEn
Voltage
thresholds NegBlEn
Programming
details
SRAM
SRAMs
SRAM
SRAM
Fuses
Assist configurations

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 19 of 31
International Solid-State Circuits Conference
VDD Precharge Timing Challenges
Power races with WL Read before write challenges

WL@ constant
VDDM

Bitline precharge
BLPCX @ high turns on before WL
VDD turns off at high
VDD!
BLPCX @ low WL on before
VDD Bitline precharge
turns off at low
VDD!

• Moving precharge to VDD reduced our current enough to allow on die-distribution but presents other
challenges
• Read before write timing challenges at low VDD, high VDDM
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 20 of 31
International Solid-State Circuits Conference
Solving Timing Challenges
• Solving these multiple voltage timing challenges required a number of techniques
– Dual voltage clock shapers to average two voltage domains
• Can alter the number of these buffers on VDD or VDDM or remove them entirely to make timing more
or less dependent on either supply
Psuedo-dynamic level shifter

VDD
VDDM shapedFallInput
LS
Input@VDD LS @VDD
ISOX@VDDM

– False read before write problem can be mitigated by compressing the front end of the WL
during a write operation
WLCLK

WLCLK WLCLK_shape
WL during read
WREN
WL during write

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 21 of 31
International Solid-State Circuits Conference
CAC Comparison

• 3% decrease in flop power allocates more budget for combinational logic


© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 22 of 31
International Solid-State Circuits Conference
FLOP Palette Improvements
Best for Performance Best for Power

• Rich flop library,


balance
timing/power
needs by
driving right flop
mix
• Up to 8% Fmax
benefit from
high speed
flops in timing
critical loop
paths

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 23 of 31
International Solid-State Circuits Conference
Low Power Gater Latch
qf CLK
CLKB CLKBB
Dbar
Energy with AvgApp Activity (fJ)
CLKBB

LP Regular
State Ratio
Latch Latch
CLKB CLKB
E E=1 0.22 0.18 121%
Dbar qf_x qf
E=0 0.17 1.61 10%
TE
CLKBB
Total 0.38 1.79 22%
Q

• 90% Power savings in latch for common case of E = 0


through internal self gating
• Clock gater latch power contribution from 22% in Zen to 13%
in Zen 2 for an average application
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 24 of 31
International Solid-State Circuits Conference
Zen 2 Clock Optimization
• Multi mesh plan for the
core supported by
configurable clock tree
construction
– FP level mesh gating
enabled with minimal
timing/area overhead
– 15% Mesh power savings
in Idle and Average App
• Tight clock skew distribution
• Relocated clock spines and
technology shrink (vs. Zen)
achieves similar skew profile
while reducing CAC

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 25 of 31
International Solid-State Circuits Conference
Zen vs. Zen 2 CAC Comparison

• Primary sources of CAC reduction


– 14 nm to 7 nm scaling
– 6 track library
– Aggressive microarchitectural CAC optimizations
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 26 of 31
International Solid-State Circuits Conference
Generational Leadership Perf/Watt
Power Improvements – ISO Frequency
• Performance/Watt driven by a combination
of technology and design improvements
7nm CAC Savings
• Timing
– Improved scalability by optimizing at a wider Library Choice
voltage range compared to Zen
– Multi-corner optimization 7nm Timing
• Library choice and optimization Design CAC Zen power
– 6 track library enabled additional Savings @ 100% IPC
CAC/leakage savings in addition to default
technology entitlement
• Design CAC
– MBFF, low power clock-gater library Zen2 power
optimization @ 115% IPC
– RTL improvements
– CAC aware downsizing methodology

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 27 of 31
International Solid-State Circuits Conference
Frequency/Power Silicon Results

• 4 cores active with 2


threads per core
• The combined effect of
lower Vmin for the same
50% power
frequency and reduced reduction
CAC enabled a 50%
reduction in power for a
given frequency
throughout most of the
F(P) curve
• This enables 2x cores in
the same socket!!
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 28 of 31
International Solid-State Circuits Conference
Frequency/Voltage Silicon Results

• 1 core active with


two threads per
core, 3 cores idle
• F/V curve improved
over all voltages
• Design worked to
improve the low
voltage
performance for
improved linearity
• Wide voltage range

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 29 of 31
International Solid-State Circuits Conference
Conclusion
• Met Goals
– Moved to energy efficient TSMC 7nm finFET
– Made huge architectural changes
– Improved PD and methodology
• Results are clear
– Scalable across 15W mobile to 280W Server
– 50% reduced power at iso-frequency
– Enable 2x cores in same-socket
– >15% 1T IPC over previous generation
– ~9% CAC improvement over previous
generation technology neutral
– Enables peak frequencies up to 4.7GHz
(+350MHz generationally)
• Zen2 delivers generational performance
uplift!!
© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 30 of 31
International Solid-State Circuits Conference
Acknowledgements
• We would like to thank our talented
AMD design team across Austin, Fort
Collins, Santa Clara, Boston,
Markham, and India who contributed
to Zen 2
• Please stay for our chiplet paper next
• Please check out our demo, 2.1
tonight in Golden Gate
• Did we mention we have liquid
nitrogen?

© 2020 IEEE 2.1: Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core 31 of 31
International Solid-State Circuits Conference
AMD Chiplet Architecture
for High-Performance
Server and Desktop Products
Samuel Naffziger

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 1 of 27
Outline

• Motivation and architectural goals


• Engineering challenges and solutions
• Silicon-package co-design
• Die-to-die interconnects 3rd Gen AMD RyzenTM
• Shared IO die architecture ThreadripperTM Processor
• Power distribution and management IFOP 4 x16 PCIe/IFIP IFOP

• Results

4x DDR

4x DDR
Zen2 cores

Zen2 cores
L3
IFOP 4 x16 PCIe/IFIP IFOP
IFOP
L3 Server IO Die
2nd Gen AMD EPYCTM
8.34 Billion FETs, 416 mm2
IFOP
3rd Gen

2x DDR
AMD
7nm Core Complex Die: RyzenTM
PCIe
3.8 Billion FETs, 74 mm2 Processor
Client IO Die
2.09 Billion FETs, 125 mm2 AMD X570 Chipset
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 2 of 27
Motivation and Architectural Goals
Primary goal:
Achieve leadership performance, performance/Watt and
performance/$ in server and desktop markets

• This required
– Exploiting advanced 7nm technology for better performance and
performance/Watt
– Packing more silicon into the package than traditional approaches enable
• While also
– Enabling scalable performance/$ up to performance levels otherwise not
achievable
– Improving memory and IO latency
– Supporting leverage across markets by re-using IP and SOCs

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 3 of 27
Background: Performance and Die Size Trend
Specint®_rate2006 2P Server Performance
100X
Trend Over Time1

Throughput Performance Ratio


Goal:
Above the
• Generational historical
10X
performance trend line
improvements are an
exponential trend
• Holding to this trend has 1X
2006 2009 2012 2014 2017 2020
required increasing core
counts and die sizes Server CPU Dies sizes over time
1000
• Bumping up against the

Throughput Performance
800
reticle limit and becoming
too costly 600

400

200
1. Su, Lisa “Delivering the Future of High-Performance 0
Computing”, Hot Chips 31 (2019)
© 2020 IEEE Oct-06 Jul-09 Apr-12 Dec-14 Sep-17 Jun-20 Mar-23
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 4 of 27
Exploiting 7nm Technology

2X >1.25X 0.5X
DENSITY1 FREQUENCY1 POWER1
• Leadership performance (same power) (same performance)

requires 7nm benefits


• Yet the cost of advanced 7nm Compute Efficiency Gains
technologies are increasing
Cost Per Yielded mm2 for a 250mm2 Die
• Traditional approaches of 6.00
large die sizes not viable

Normalize Cost/Yielded mm2


5.00
• Innovation required 4.00

3.00

2.00

1.00

1. Based on June 8, 2018 AMD internal testing of same- -


architecture product ported from 14 to 7 nm technology 45nm 32nm 28nm 20nm 14/16nm 7nm 5nm
with similar implementation flow/methodology, using
performance from SGEMM.
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 5 of 27
7nm Scaling
Prior Generation RYZEN™ Processor Die

• High-performance server and


desktop processors are IO-heavy
• Analog devices and bump pitches
for IO benefit very little from leading
edge technology, and that • CPU core + L3 on this die comprises 56% of the area
technology is very costly • These circuits see huge 7nm gains
• Remaining 44% sees very little performance and
• Solution: Partition the SOC, density improvement from 7nm
reserving the expensive leading-
edge silicon for CPU cores while

Zen2 cores

Zen2 cores
leaving the IO and memory L3 7nm CCD is
interfaces in N-1 generation silicon DFx IFOP SerDes SMU 86% CPU + L3

L3
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 6 of 27
Chiplets Evolved – Hybrid Multi-die Architecture
Traditional Monolithic 1st Gen EPYC 2nd Gen EPYC

Use the Most Each IP in its Optimal Centralized I/O Die Superior
Advanced Technology Technology, 2nd Gen Improves NUMA Technology for
Where it is Needed Infinity Fabric™ CPU Performance
Most Connected and Power

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 7 of 27
Connecting the Chiplets
Theoretical Interposer-based

• Silicon interposers and bridges


provide high wire density, but have CCD
IOD
CCD

limited reach
• Only supports die edge connectivity CCD
Interposer
CCD

which limits number of chiplets and


cores that can be supported
Selected MCM Approach
• Performance goals required more
Core Complex Die (CCDs) than can
be tiled adjacent to the IOD
• Solution is to retain the on-
package SerDes links for die-die
connections

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 8 of 27
CPU Compute Die (CCD) Floorplan
2 CCX core complexes Core2 Core3
– 4 core and 16MB L3 each L3
– Comprise 86% of CCD area Core0 Core1
System Management Unit (SMU)
DFT IFOP
– Microcontroller SMU

– Power management Core0 Core1


– Clocks and reset
L3
– Fuses Core2 Core3
– Thermal monitor and control
Infinity Fabric On-Package (IFOP) Links
– 14.6 GT/s (packing 10 bits at 1.46Ghz)
– 39 RX lanes – 2 clock lanes – 1 clock gating lane
– 31 TX lanes – 1 clock gating lane
– 4 lanes for control traffic – 2 clock lanes
DFT and Debug
Wafer test bumps

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 9 of 27
IFOP GEN2 KEY FEATURE SUMMARY AND COMPARISON
Gen2 14nm IOD, 7nm CCD
Gen1 14nm
Max Per lane Synchronous
Max per lane Local clock Datarate clock crossing
datarate alignment and 14.6Gbps Local CDR
6.4Gbp/s global tracking
50 Ohm fixed
10:1 Serialization/
drive strength and
50/100/200 Ohm Deserialization
4:1 Serialization/ termination
drive strength
Deserialization Local PHY
and termination TX and RX T-Coil
Regulators
Forwarded PHY Regulated Pseudo-Diff
clocks through package VTT Termination Single Ended
Receiver

DDR I/O DDR I/O

Zen2
CCD
Zen2

Zen2

Zen2
CCD

CCD

CCD
Die3

Die2
CCX

CCX
CCX

CCX

I/O I/O

I/O I/O
Die0

Die1
CCX

CCX

CCX

CCX

Zen2

Zen2

Zen2

Zen2
CCD

CCD

CCD

CCD
I/O DDR I/O DDR

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 10 of 27
IFOP SerDes Architecture
FDI[30:0]
FDO[30:0]
RX X32
TXCLK
TXCLK TX X32
TXCLK
TXCLK TXCLK
TXCLK FWDFCLK[30:0]
QUAD
RXCLK MCM DIFF
FCLK Package TXCLK
8Ghz
FILT 8Ghz
Routes
C
IOD CCD
R PLL TXDRV CLKRX PLL CCD
CORE FCLK CORE
C FCLK C
DIFF
TXCLK
QUAD R
RXCLK
8Ghz 8Ghz
C
New
Gen2 FDO[38:0] TX X40
TXCLK
TXCLK RX X40
TXCLK
TXCLK TXCLK
TXCLK FDI[38:0]
FWDFCLK[38:0]
Features
IOD CCD
TX lane Detail Tcoil RX lane Detail
Tcoil
Trained + Trained
FDO[9:0]
Register Serializer TX De- Low FDI[9:0]
50Ω 50Ω RX serializer Latency
Capture - FIFO

Clock generator VTT Clock generator

FWDFCLK TXCLK QUAD Phase Calibration


Calibration logic RXCLK Interpolator
CDR Logic
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 11 of 27
Package Routing Challenges
• Prior generation already consumed
all package routing resources for
memory and IO
• Connecting 9 chiplets in the same
package requires innovation

Die2
I/O
I/O

CCX
CCX DDR

CCX
DDR

CCX
I/O

I/O

Die1

Die3
I/O
I/O

CCX
DDR

CCX

CCX
DDR

CCX
I/O

I/O

Die0

© 2020 IEEE
1st Gen AMD EPYC™
International Solid-State Circuits Conference [Beck ISSCC 2018] 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 12 of 27
Under-CCD Routing
Routing Infinity Fabric on Package (IFOP) SerDes links from IOD to the
2-deep chiplets required sharing routing layers with off-package
SerDes and competing with power delivery requirements

SERDES

CCD CCD

CCD CCD

DDR IOD DDR

CCD CCD

CCD CCD

SERDES

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 13 of 27
Zen vs. Zen 2 VDDM Distribution
Dense SRAMs require a separate rail

Package

RDL

RDL

© 2020 IEEE Zen VDDM distribution via package plane Zen 2 VDDM distribution via RDL only 14 of 27
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products
Zen 2 VDDM Design Challenges
Enables 80 IFOP package routed
• RDL is more resistive than a signals under the CCD
dedicated package layer
• Therefore we reduced overall 4 VDDM LDO’s inside the L3
VDDM current draw by 80%
compared to Zen ([Singh Core L3 4MB L3 4MB Core
ISSCC 2020]) +L2 slice slice +L2
• New, smaller, and distributed L3 4MB
Core L3 4MB Core
LDO design slice slice
+L2 +L2
• Ensured sufficient routing
porosity through the
integrated LDO’s to enable VDDM
critical routing RDL

LDO
• These improvements kept the spanning
IR drop to ≈10mV impact
L2 and L3
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 15 of 27
Package Integration, Server, and Desktop

Zen2 Zen2
CCD CCD

Zen2 Zen2 Bump pitch for 14nm and •


CCD CCD 128 total x16 7nm is 150um and 130um
SERDES
respectively
• Transitioned IOD from
solder bumps to copper
pillars, enabling a common
72 Data +
8 Clk/Ctl
interface for IOD+CCD
Zen2 Zen2 (total/CCD) • Conducive to tighter
CCD CCD
bump pitches (compact)
3rd Gen AMD Ryzen Processor • Enabled common die
TM
Zen2 Zen2
CCD CCD height after assembly
Infinity Fabric (die-to-die)
• Higher max current
IO Controllers and PHYs
(electromigration) limits
2 x DDR4 PHYs

© 2020 IEEE 2nd Gen AMD EPYCTM Server Processor 16 of 27


International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products
Operating System Scheduler Optimizations

• Growing number of cores and the advent of chiplets


resulted in a wider range of frequency responses to
process, voltage and temperature variations
– Up to 200MHz core-to-core Fmax upside within a CCD
– Legacy boost approaches don’t take advantage of the faster cores
• Preferred Core Ordering maximizes performance
– New algorithm characterizes the capabilities of the cores at boot
time under various system parameters and generates a list of cores
in an order of frequency capability
– The core ordering is modified according to the usage policy detected
• Single threaded applications scheduled to the fastest cores
• Multi-threaded applications scheduled toward the fastest core
cluster (CCX), maximizing L3 cache sharing
– This core ordering is expressed to the OS allowing for an efficient,
dynamic, HW-directed selection of processors for a given workload

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 17 of 27
Per-Core Linear Regulation
Regulating the voltage per-core enables power savings by adapting the voltage to each core’s
capability and compensating for power delivery gradients across-package

8 cores per chiplet, each


• Digitally controlled LDO enables setting voltage based with separate VDD
on per-core speed capability for a given frequency 64 total core-specific
• Droops mitigated with fast-response charge injection voltages
from RVDD for cores with a drop-out
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 18 of 27
Clock Stretching and Per-Core Voltage
• Droop detection with a fast analog
comparator
Bottom of LL
• Separate threshold for LDO Charge VID VDDCORE2
VDDCORE0 VDDCORE1
Injection (CI level) and for clock (Core0 DLDO)(Core1 DLDO) (Core2 DLDO)
VDDCORE7
(slowest)
stretching (CKS)
• These work synergistically to lower the
required voltage for a given frequency

Same-frequency power Idle TDC EDC


savings through voltage
IDD
reduction
DROOP
No LDO, no CKS 0% CCLK
LDO only 19%
CKS only 19%
LDO and CKS 25% VDD droop forces core stretch after 1 more full frequency period
Based on AMD internal testing of 64C AMD EPYC "Rome" Clock stretch response rise-to-rise, is 150% period, 175% period, then 125%
processor operating at 2.5GHz, synthetic di/dt pattern periods
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 19 of 27
Improving Memory Performance
Prior Generation
(EPYC 7001 Series Processors)

• Server memory latency is a key factor


in performance
• A goal for 2nd Gen AMD EPYCTM was
to improve on the 2017 1st Gen
EPYC™ design
• Non-Uniform-Memory-Access (NUMA)
behaviors are a result of memory
interfaces being distributed across die 3 NUMA Distances
Domain Latency (ns)
8 NUMA Domains1

• Significant deltas from NUMA1 to


NUMA1 90
NUMA2 impact performance for some
NUMA2 141
applications
NUMA3 234
1: AMD internal testing with DRAM page miss Avg. Local2 128
2: 75% NUMA 2 + 25% NUMA 1 traffic mix
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 20 of 27
2nd Gen AMD EPYCTM Improved Memory Latency
• Central IOD enables a single NUMA domain per socket
• Improved average memory latency1 by 24ns (19%)2
• Minimum (local) latency only increases 4ns with chiplet architecture
Single Domain
CCD5 CCD4 G2 PCIe G1
G1 xGMI
PCIe CCD0 CCD1
G3 PCIe G0 PCIe CCD0, CCD1, IO0, CCD2, CCD3, IO1,
UMC4 3 UMC0 CCD4, CCD5, IO2, CCD6, CCD7, IO3,
UMC5 UMC1 MA/MB/MC/MD/ME/MF/MG/MH
G3 xGMI G0 xGMI 1 interleaved
G2 xGMI G1 xGMI  1.46GHz / DDR2933 (coupled)1
MH
MG

MA
MB
G3 PCIe G0 PCIe  1: Local 94ns
IO2 IO0  2: ~97ns
1: AMD internal testing P3 PCIe P0 PCIe
with DRAM page miss  3: ~104ns
G2 PCIe G1 PCIe
2: EPYC 7002 Series
IO3 IO1  4: ~114ns
NUMA 1 vs EPYC 7001 P2 PCIe P1 PCIe

MD
MC
ME
MF

Series Avg. Local; EPYC  Measured Avg: ~104ns


7002 Series NUMA2 vs S-Link to S-Link to 2
EPYC 7001 Series P1/P2 PCIe P0/P3 PCIe
NUMA 3
UMC7 UMC3
Repeater: 1 FCLK (1.46GHz)
UMC6 4 UMC2
P2 P1 Switch: 2 FCLK (1.46GHz)
CCD7 CCD6 P3 P0 CCD2 CCD3
(low-load bypass, best-case)
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 21 of 27
2nd Gen AMD EPYCTM Chiplet Performance vs. Cost
2

• Higher core counts and 1.8


1.6
performance than

Normalized Die Cost


1.4
possible with a 1.2
monolithic design 1

• Lower costs at all core 0.8

count / performance 0.6


0.4
points in the product line 0.2

• Cost scales down with 0


64 Cores 48 Cores 32 Cores 24 Cores 16 Cores
performance by Chiplet 7nm + 14nm Hypothetical Monolithic 7nm
depopulating chiplets
• 14nm technology for IOD
reduces the fixed cost
Dummy

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 22 of 27
3rd Gen AMD Ryzen™ Processor Chiplet Performance vs. Cost

2.5
Similar cost savings and
2 scalability for desktop
Normalized Die Cost

1.5
Re-using the client IO die for
1 the X570 Chipset expander
0.5
enables optional additional
connectivity for higher end
0 systems
16 cores 8 cores
• PCIe, SATA, USB
Chiplet 7nm + 14nm Hypothetical Monolithic 7nm

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 23 of 27
Performance Results
Chiplet architecture enables leadership performance and
performance/Watt in server and desktop markets
Metric at 105W TDP1 Ryzen 2700X (8C) Ryzen 3950X (16C) Improvement (%)
1. Testing as of
12/13/2019 by AMD
Performance Labs
Cinebench r15 1T 177 216 22%
using a Ryzen 9 3950X
with 16 cores vs. a
Cinebench r20 1T 434 527 21%
Ryzen 7 2700X with 8
cores in the Cinebench
Cinebench r15 NT 1802 3928 118%
R20 1T benchmark
test. Results may vary.
Cinebench r20 NT 4020 8862 120%
RZ3-102 1T Fmax (Max Boost) 4.3 4.7 9%
NT Base Freq (All-core)1 3.9 3.95 1%
EPYC 7601 EPYC 7742
Metric (32C 2P (64C 2P Improvement (%)
180W TDP) 225W TDP)
SPECrate®2017_int_base2 272 663 144%
SPECrate®2017_fp_base2 259 511 97%
NT Base Freq 2.2 2.5 14%
2: Results obtained from the SPEC® website as of Jan 3, 2020.
EPYC 7601 SPECrate®2017_int_base: https://www.spec.org/cpu2017/results/res2017q4/cpu2017-20171114-00833.html
EPYC 7601 SPECrate®2017_fp_base: https://www.spec.org/cpu2017/results/res2017q4/cpu2017-20171114-00845.html
EPYC 7742 SPECrate®2017_int_base: https://www.spec.org/cpu2017/results/res2019q4/cpu2017-20191028-19261.html
EPYC 7742 SPECrate®2017_fp_base: https://www.spec.org/cpu2017/results/res2019q4/cpu2017-20191028-19237.html
More information about SPEC CPU® 2017 can be obtained from https://www.spec.org/cpu2017. SPEC®, SPEC CPU® and SPECrate® are registered trademarks of the Standard Performance Evaluation Corporation.
© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 24 of 27
Summary
• Chiplet architecture has proven key to achieving leadership
performance, performance/$ and performance/Watt across multiple
market segments

• Many significant innovations were required:


• Package + Silicon co-design for optimizing
complex routes and heterogeneous technology IFOP 4 x16 PCIe/IFIP IFOP
chiplet die

4x DDR

4x DDR
• Package level fabric and interconnect architecture
• Power delivery and voltage adaptation
IFOP 4 x16 PCIe/IFIP IFOP

Zen2 cores

Zen2 cores
L3
IFOP
L3 IFOP

2x DDR
PCIe

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 25 of 27
Acknowledgment

We would like to thank our talented AMD design teams across


Austin, Bangalore, Boston, Fort Collins, Hyderabad, Markham,
Santa Clara, and Shanghai.

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 26 of 27
Disclaimer and Endnotes

DISCLAIMER
The information contained herein is for informational purposes only, and is subject to change without notice. While every
precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and
typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro
Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this
document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or
fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described
herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this
document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed
agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18

©2020 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, RYZEN, Threadripper, Infinity
Fabric, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this
publication are for identification purposes only and may be trademarks of their respective companies.

© 2020 IEEE
International Solid-State Circuits Conference 2.2: AMD Chiplet Architecture for High-Performance Server and Desktop Products 27 of 27
A 220GOPS 96-core Processor with 6 Chiplets
3D-stacked on an Active Interposer Offering
0.6ns/mm Latency, 3TBit/s/mm2 inter-Chiplet Interconnects
and 156mW/mm2@82% Peak-Efficiency DC-DC Converters
Pascal Vivet¹, Eric Guthmuller¹, Yvain Thonnart¹, Gaël Pillonnet2, Guillaume Moritz2,
Ivan Miro-Panades¹, César Fuguet¹, Jean Durupt¹, Christian Bernard¹, Didier Varreau¹, Julian Pontes¹,
Sébastien Thuriès¹, David Coriat1, Michel Harrand¹, Denis Dutoit¹, Didier Lattard¹, Lucile Arnaud2,
Jean Charbonnier2, Perceval Coudrain2, Arnaud Garnier2, Frédéric Berger2, Alain Gueugnot2,
Alain Greiner3, Quentin Meunier3, Alexis Farcy4, Alexandre Arriordaz5, Séverine Cheramy2, Fabien Clermidy¹
pascal.vivet@cea.fr
¹Univ. Grenoble Alpes, CEA, LIST; 2Univ. Grenoble Alpes, CEA, LETI; 3Sorbonne Université, LIP6;
4STMicroelectronics; 5Mentor A Siemens Business

This work was partly funded by the French National Program


Programme d’Investissements d’Avenir IRT Nanoelec under
Grant ANR-10-AIRT-05

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 1 of 34
High Performance Computing & Big Data
• More cores + more accelerators + more memory
– Similar constraints are appearing for embedded HPC
(Automotive, etc)
– Need both highly optimized generic and specialized functions
(i.e. ML/AI accelerator)
– Need a « go-to-market » solution for sustainable system differentiation

• System designers must offer :


– Modular and cost effective solutions
– Energy efficiency of the system infrastructure
– More on-chip memory bandwidth per core

 With advanced CMOS issues, « Single Die »


solution is not viable anymore
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 2 of 34
Chiplet Partitioning
• Chiplet motivations
– Cost driven
– Modularity driven using 3D technologies
– Heterogeneous integration

• Chiplet challenges ?
– Eco-system maturity,
– Technology & Architecture partitioning,
– Chiplet Interfaces, testability, 3D CAD flow, etc

[D. Dutoit, Keynote, 3DIC’2014]

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 3 of 34
Chiplet Partitioning : Solutions and Limitations
• Existing technologies
Organic Substrates Passive interposer (2.5D) Silicon bridges

AMD, 4-chiplet circuit, ISSCC’2018 TSMC, CoWos, VLSI’2019 INTEL, EMIB bridge, ISSCC’2017

• But, some limitations


– Chiplet communication limited to side-by-side communication, not scalable
– How to integrate heterogeneous chiplets & differentiating functions ?
– How to integrate less-scalable functions (IO’s, analogs, power management) ?
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 4 of 34
Active Interposer : Principle
Scalable & Distributed NoCs
Any chiplet-to-chiplet traffic

Chiplets :
Clusters of Cores
Power Management
Active Close to cores
Interposer
SoC infrastructure
Analog, IOs, PHY, DFT

Additional features

 Mature CMOS technology (with low logic density to preserve system cost)
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 5 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 6 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 7 of 34
6 Chiplets 3D-stacked on an Active Interposer
Chiplet (16 cores) Chiplet (16 cores)
• Chiplet Overview Cluster Cluster Cluster
L3 L3 Cluster
L3 L3

SoC infrastructure

SoC infrastructure
– 4 cluster of 4 cores 0 1 0 1

– Distributed L1$ + L2$ + L3$


Cluster Cluster Cluster Cluster
L3 L3
– Scalable Cache Coherency 2 3 L3
2 3
L3

• Active Interposer 3D Plug(s) 3D Plug(s)


– Distributed flexible interconnects µ-bumps, Ø10µm

Active Interposer
– Integrated SCVRs (1/chiplet) Distributed NoCs
(routers & pipelined links)
– Memory Controller & System IO’s
Cfg Power Management Power Management
– SOC Infrastructure, DFT Memory-IO
C4 bumps Ø90µm

Clk, Rst, Config, Test Package Substrate 1.5 - 2.5 VDD-chiplet 1.2 VDD-interpo Off chip links

Balls Ø500µm

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 8 of 34
6 Chiplets 3D-stacked on an Active Interposer
• Chiplet Overview 6 Chiplets
– 4 cluster of 4 cores (FDSOI28)
– Distributed L1$ + L2$ + L3$
– Scalable Cache Coherency
• Active Interposer
– Distributed flexible interconnects
– Integrated SCVRs (1/chiplet)
96 cores :
– Memory Controller & System IO’s Active In 6 chiplets
– SOC Infrastructure, DFT Interposer 3D-stacked on
(CMOS65) active CMOS interposer
 2 technology nodes difference between chiplets & bottom die
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 9 of 34
Chiplet Main Features
• 16 x MIPS ® 32-bit scalar cores
• Memory is physically distributed through
chiplet L2-caches + Virtual Memory support
– L1 I-caches + D-caches (16 kB / core)
– Distributed Shared L2-caches (256 kB / cluster)
– Adaptive & fault tolerant L3-caches (4 tiles of 1 MB)
• Directory-based cache coherence with
linked-list directory [5]
L1-L2
• 2D-mesh NoCs,
L2-L3
extended through the interposer L3-ExtMem
from/to
active interposer
• FDSOI 28nm, LPLV, [0.5-1.3V], with Body Biasing
[5] E. Guthmuller et al, “A 29 Gops/Watt 3D-Ready 16-Core
– FLLs, Timing Fault Sensors, Thermal Sensors Computing Fabric with Scalable Cache Coherent Architecture
Using Distributed L2 and Adaptive L3 Caches”, ESSCIRC’2018.
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 10 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 11 of 34
System Level Interconnects
TAP PE2PE3 PE2PE3 TAP PE2PE3 PE2PE3
L1D L1D L1D L1D L1D L1D L1D L1D
TG TG
TG TG
TG TG

3D Plug(s) 3D Plug(s)

From L1-L2 - short reach - passive to next


L1-L2 3D Plug(s) 3D Plug(s)
prev. chiplet
L2-L3 R L2-L3 - long reach - async. - active R
chiplet
L3-ExtMem R L3-Ext-Mem - sync. - active R R

Active Interposer Memory-IO

• Distributed & flexible interconnects • Chiplet-to-Chiplet Communication


within the active interposer Schemes
– Multiple Network-on-Chips (routers+links) – Passive links, short reach (L1-L2)
– 3D-Plug communication IPs – Active links, long reach (L2-L3, L3-ExtMem)
Synchronous & Asynchronous versions  allow chiplet to any chiplet scalable traffic

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 12 of 34
3D-Plug Communication IP : synchronous version
3D-Plug : TX 3D-Plug : RX

• Chiplet-to-Chiplet Data+ NoC


VCid Virtual
communication Channel

controller
– NoC Virtualization Outputs
NoC
– High throughput Virtual
Channel Credit
– Low latency Inputs
φ CLK_RX
CLK_TX
φ
Clk
• Circuit Design
– Credit-based multi-channel synchronization  Full digital design
– Source synchronous scheme, with delay compensation  Full swing logic
– Integrates: µ-bumps + µ-buffers + DFT (BoundaryScan)  no DLL
 3D fine pitch parallel if.
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 13 of 34
3D-Plug Communication IP : layout overview

µ-buffer std-cells µ-bumps


20µm pitch

3D-Plug :
• Logic interface
• µ-bumps
• µ-buffer std-cells
Chiplet layout : • DFT µ-buffer std-cell
3D-Plug interfaces
BiDir Driver + ESD +
Pull-Up + Level-Shifter
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 14 of 34
3D-Plug Communication IP : sync. version perf.

This work [2] VLSI'19 Units


28nm FDSOI chiplets
Technology
65nm active interposer
7nm FinFet chiplet
• Performances
3D Link type 3D LIPINCON™
and technology Active (face-to-face) Passive (CoWos™)
– 1.2 Gb/s/pin
Die-to-Die Bump Pitch 20 40 µm – 0.59 pJ/bit
Voltage swing 1.2 0.3 V
Data Rate 1.21 8 Gb/s/pin – 3.0 Tb/s/mm2
Power Efficiency 0.59 0.56 pJ/bit
Bandwidth Density 3.0 1.6 Tb/s/mm²
 x2 better than SoA

[2] : Mu-Shan Lin et al., “A 7nm 4GHz Arm®-core-based CoWoS® Chiplet Design
for High Performance Computing”, Symposium on VLSI circuits, June 2019.

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 15 of 34
System Level Interconnects : L1-L2
1.0V Sync. Sync. Sync. 8 FIFOs Sync. Sync. Sync.
3D-Plug 3D-Plug Router Router 3D-Plug Chiplet 11 3D-Plug
Chiplet 00 Chiplet 12
Interposer 5 mm, 8 ns, 0.7 pJ/bit
1.5 mm, 7.2 ns, 0.75 pJ/bit
M3-M5 passive routing
with clock shielding
L1-L2 L1-L2
Units
• L1-L2 interconnect
nearest farthest – 3D-Plug sync. version + passive links
1 passive 3 passive
Interposer — – Synchronous NoC routers (within chiplets)
link links
3D Plug frequency 1.25 1.25 GHz – Global clocking + clock gating
2D NoC frequency — 1.00 GHz • Performances
2x4+[0-1] 44 cycles – 3D-Plug interface throughput : 1.25 GHz
End to end latency
7.2 44.0 ns
– SNOC local throughput : 1 GHz
Propagation speed 4.8 2.9 ns/mm
– Large end-to-end latency : 44 ns (44 cycles)
Energy / bit / mm 0.29 0.15 pJ/bit/mm
(re-timing and re-synchronization)
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 16 of 34
3D-Plug Communication IP : asynchronous version
• Asynchronous Logic
– Quasi-Delay-Insensitive (QDI) logic 2-phase
– Use of 1-of-4 data encoding
– Deep pipelining, achieving low latency 4-phase

1-of-4 asynchronous
pipeline stage
(C-element gates)  Robust Asynchronous design
 No clocking at 3D interface [6]
• Circuit Implementation  2-phase protocol to reduce
penalty of 3D-interface delays
– Use 4-phase for on-die communication (Active interposer)
– Use 2-phase for off-die communication (3D-Plug interface)
[6] P. Vivet et al., “A 4x4x2 Homogeneous Scalable
– Use 4-phase  2-phase protocol converters 3D Network-on-Chip Circuit with 326 MFlit/s 0.66
pJ/bit Robust and Fault Tolerant Asynchronous 3D
links”, ISSCC’2016.
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 17 of 34
System Level Interconnects : L2-L3

L2-L3 L2-L3 • L2-L3 interconnect


Units
4-phase 2-phase – 3D-Plug async. version + vertical connection
Active Active – Asynchronous NoC routers
Interposer —
async. async.
– Pipelined links (1 pipe every 500 µm)
3D Plug frequency 0.30 0.52 GHz
2D NoC frequency 0.97 GHz • Performances
4 + async. 4 + async. cycles – 3D-plug interface throughput : 520 MHz
End to end latency
15.2 15.2 ns (2-phase is x1.7 better than 4-phase version)
Propagation speed 0.6 0.6 ns/mm
– ANoC local throughput : 970 MHz
Energy / bit / mm 0.52 0.52 pJ/bit/mm
– Overall best end-to-end latency : 15.2 ns
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 18 of 34
System Level Interconnects : L3 – EXT-MEM
Memory + IO
controller

LVDS PHY
Off-Chip traffic

L3-EXT- • L3 – EXT-MEM interconnect


Units
MEM – 3D-Plug sync. version + vertical connection
Active – Synchronous NoC routers
Interposer —
sync.
– Pipelined links (1 pipe every 1000 µm)
3D Plug frequency 1.21 GHz
2D NoC frequency 0.75 GHz – Off-Chip communication: 4x32-bit LVDS @ 600Mb/s

End to end latency


37 cycles • Performances
49.5 ns
– 3D-plug interface throughput : 1.21 GHz
Propagation speed 2.0 ns/mm
Energy / bit / mm 0.24 pJ/bit/mm – SNoC throughput : 750 MHz
– Large end-to-end latency : 49.5 ns
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 19 of 34
System Level Interconnects : Comparison
L1-L2 L2-L3 L3-EXT-MEM Units
Link type Passive, sync. Active, async. Active, sync.
B
3D Plug frequency 1.25 0.52 1.21 GHz
2D NoC frequency 1.00 0.97 0.75 GHz
44 4 + async. 37 cycles
End to end latency
44.0 15.2 49.5 ns
Propagation speed 2.9 0.6 2.0 ns/mm
Energy / bit / mm 0.15 0.52 0.24 pJ/bit/mm

• 3D-Plug - Best throughput for synchronous version (1.25GHz)


A
• Interposer - Similar throughput between SNOC & ANOC (~1GHz)
- Best latency for ANOC, 0.6ns/mm (3-5x wrt. SNOC)
* A => B end-to-end latency Latency reduction, for cache coherency traffic, at the cost of energy

 Combination of interconnect types to achieve performance trade-offs


© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 20 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 21 of 34
Switched Cap Voltage Regulators : Principle
• Distributed power supply units
– DVFS local scheme, below each chiplet
– Fast transitions & reduced IR-drop effects
– “High” input voltage (up to 2.5V),
reduces #PG IOs in the package

• Fully Integrated
– No external passive components, Thick oxide transistors P/G to chiplet VOUT
– On-chip CAPs only (MOS+MOM+MIM  8.9 nF/mm2) µ-bumps
DC-DC
– 50% of chiplet area, fault tolerant, in the interposer TSVs converter
– PG delivery as a µ-bump flip-chip matrix
VIN
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 22 of 34
Switched Cap Voltage Regulators : Circuit Design
• Circuit design
– 3-stage gear box, 7 voltage ratios
– VIN [1.8V – 2.5V] ; VOUT [0.35V – 1.3V]
– Tile-based layout in a checker board pattern Replicated
Unit Cell
– Central clock frequency, feedback controller @ C4-bump pitch

x270 cells

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 23 of 34
Switched Cap Voltage Regulators : Circuit Results
• Circuit design
– 3-stage gear box, 7 voltage ratios
– VIN [1.8V – 2.5V] ; VOUT [0.35V – 1.3V]
– Tile-based layout in a checker board pattern Replicated
Unit Cell
– Central clock frequency, feedback controller @ C4-bump pitch

• Power conversion efficiency


– 156 mW/mm2 power density
@ 82% peak efficiency (2:1 ratio)
– Better efficiency wrt. integrated LDO

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 24 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 25 of 34
Circuit Overview
• Die technologies
– Chiplet: FDSOI 28nm, ULV + BodyBias, 22mm2
– Active Interposer: CMOS 65nm, MIM option, 200mm2
3D cross-section
• 3D technology integration
– µ-bumps, 20µm pitch (150 k)
– TSV middle, 40 µm pitch
– Face2Face assembly
on package substrate
Chiplet front-face
– 6 chiplets

3D integration Active Interposer


and final package front-face
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 26 of 34
Circuit Performance
• Main performances
– Freq in [130 MHz @ 0.5V – 1.15GHz @ 1.1V] with FDSOI Back-Bias
– Peak performance : 220 GOPS for all 96 cores @ 1.15 GHz.
– Best Energy efficiency : 9.6 GOPS/W (Coremark) @ 246MHz @ 0.6V
Chiplet 1 Misc 2%

• Power consumption break-down Chiplet 2 Clks 21%


L1-L2 5%
– Cores+L1: ~50% power per chiplet L3 1%
L2 12%

Chiplet 3
– Interposer logic & interconnect (w.o. IOs)
3% only of overall budget
– SCVR: 17% of overall power budget Chiplet 4 Cores + L1
55%
Chiplet 5

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 27 of 34
Circuit Performance : SCVR efficiency
• Switched Cap Voltage Regulator (SCVR)
– SCVR configured at best ratio according to chiplet voltage
• 3:1 => 2:1 => 3:2 @ fixed VIN 2.5V

• SCVR versus integrated LDO ?


0.5x
– Using a LDO at same VIN = 2.5V
• 0.5x the power consumption
– Higher VIN + increased conversion
efficiency reduces power pin count 0.45x

 Fully integrated SCVR enables


high efficiency along full voltage range
© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 28 of 34
Circuit Performance : Scalability
• Memory Hierarchy and System Level Interconnect study
– Execution of a 4Mpixel filtering applications
(including convolution, transposition,
67x
and synchronization with barriers)
for
– Scalability study of the 96-core circuit 96 cores
340x
• Acceleration ratio of 67x for
512 cores
– Scalability of a 512-core circuit
• Acceleration ratio of 340x (HW emulation) Dataset
fits in L3-$

 Cache coherency protocol + system level


interconnects sustain the traffic and are scalable

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 29 of 34
Comparison with State-of-the-Art
[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017
This work Units
INTEL AMD TSMC INTEL
FDSOI FinFET FinFET FinFET FinFET
Chiplet Technology
28nm 14nm 14nm 7nm 14nm
Active MCM Passive EMIB
Interposer Technology no
CMOS 65nm substrate CoWoS ® bridge
Technology

Interposer extra features yes N/A no no no


High, using
active interposer
Total system yield N/A high high high
mature technology and
low transistor count
Die-to-Die µbump pitch 20 N/A > 100 40 55 µm
Integrated in interposer, on-chip
LDO per core,
Voltage Regulator (VR) type 1 SCVR per chiplet distributed no no
Power Mgt

with MIM
with MOS+MOM+MIM SCVR with MIM
MIM above 40%
VR area 34% of active interposer - N/A N/A
of core area
VR peak efficiency 82% 72% LDO limited N/A N/A

 First Active Interposer, with fully integrated SCVR, up to 82% efficiency

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 30 of 34
Comparison with State-of-the-Art
[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017
This work Units
INTEL AMD TSMC INTEL
Distributed NoC meshes
Scalable Data TM
Interconnect types for scalable chip-to-chip N/A LIPINCON links AIB interconnect
Interconnect

Fabric (SDF)
cache-coherency traffic
3D Plug power efficiency 0.59 N/A 2.0 0.56 1.2 pJ/bit
2
BW density 3.0 N/A - 1.6 1.5 Tb/s/mm
Aggregate 3D bandwidth 527 N/A - 640 504 GByte/s
1 FPGA fabric
Number of chiplets 6 1 1-4 2
6 transceivers
CPU

Number of cores 96 18 8 - 32 8 FPGA fabric


Max Frequency 1.15 0.4 4.1 4 1 GHz
Gops (32b-Integer) 220 (peak mult./acc.) 14.4 131.2 - 524.8 128 N/A Gop/s

 First Active Interposer, with distributed NoC meshes and 3.0 Tb/s/mm2
interfaces, offering a total of 96 cores

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 31 of 34
Outline
• Introduction
– Chiplet-partitioning and Active Interposer concept
• Circuit architecture
– Circuit overview
– Chiplet overview
• Active interposer design details
– System Interconnects
– Power Management
• Circuit results & performances
• Conclusions & Perspectives

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 32 of 34
Conclusions and Perspectives
• Active Interposer & chiplet partitioning
– Integration of : Interconnects, Power management, IOs,
– Scalable cache coherency protocol
– 3 TBit/s/mm2 3D interface achieved
– Low latency 0.6ns/mm long-reach asynchronous interconnect
– Power management @ 82% efficiency, close to the cores, w.o. passives

 Increase the system energy efficiency and the on-chip memory bandwidth per core

• Perspectives ?
– Progressive setup of a chiplet eco-system
– Active interposer, an enabler for differentiation : integrating heterogeneous functions & chiplets

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 33 of 34
• Acknowledgments
This work was partly funded by the French National Program
Programme d’Investissements d’Avenir IRT Nanoelec under
Grant ANR-10-AIRT-05

• Thank you for your attention

© 2020 IEEE 2.3 : A 220GOPS 96-core Processor with 6 Chiplets 3D-stacked on an Active Interposer Offering 0.6ns/mm Latency,
International Solid-State Circuits Conference 3TBit/s/mm2 inter-Chiplet Interconnects and 156mW/mm2 @ 82% Peak-Efficiency DC-DC Converters 34 of 34
A 7nm High-Performance and Energy-Efficient
Mobile Application Processor with Tri-Cluster
CPUs and a Sparsity-Aware NPU
Young Duk Kim,
Wookyeong Jeong, Lakkyung Jung, Dongsuk Shin,
Jae Geun Song, Jinook Song, Hyeokman Kwon, Jaeyoung Lee,
Jaesu Jung, Myungjin Kang, Jaehun Jeong, Yoonjoo Kwon,
Nak Hee Seong

Samsung Electronics, Hwaseong, Korea

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 1 of 26
Background

•Smartphone users want to


• Run applications smoothly
• Have better gaming experience
• Enhance multimedia experience including fancy camera
• Enjoy longer battery lifetime for an all-day experience

•Conclusion is
“High performance” and “Low power”

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 2 of 26
Outline
• 7nm power efficient Exynos AP processor
• Tri-cluster CPUs Middle/
• Power efficient architecture Little
Big CPUs
CPUs
• NPU
• Skipping zero weights operation
• HWACG (HW Auto Clock Gating)
• Clock power reduction in idle state
GPU
• Droop detector
• Reducing voltage droop NPU
• 7nm process
• Enhancing AC performance

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 3 of 26
CPU Clusters
• Eight CPU cores with three different classes
• Two big cores (M4): 2.73GHz, two mid cores (CA75): 2.4GHz,
four little cores (CA55): 2.0GHz
• Heterogeneous Multi-Processor governed by Energy-aware scheduler

M4 M4 CA75 CA75 CA55 CA55


32KB 32KB 32KB 32KB
64KB 64KB 64KB 64KB L1I L1D L1I L1D
L1I L1D L1I L1D 64KB 64KB 64KB 64KB
L1I L1D L1I L1D CA55 CA55
1MB L2 1MB L2
256KB L2 256KB L2 32KB 32KB 32KB 32KB
L1I L1D L1I L1D
3MB L3
1MB L3

Coherent Interconnect

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 4 of 26
Big Custom CPU (1)
• Instruction Front-end Architecture Branch Main L2
uBTB
Predict BTB BTB
• 6 micro-Ops bandwidth for decode,
rename, dispatch and retire Address Queue

• Improved branch prediction accuracy 64KB


I- Cache
and latency
• Neural Net based main predictor Instruction Queue

• 128-entry uBTB, 4K entry main BTB,


32K (16K*) branches L2 BTB Decode

• 228 entry ROB


Rename

Dispatch Queue
*M3 specification

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 5 of 26
Big Custom CPU (2)
• Integer and Load/store Execution Pipes
Integer Schedulers
• Two simple ALUs + two complex ALUs 1BR, 2CALU, 2ALU, 1LD, 1LD/ST, 1ST, 1ST-D
• AGUs: 1 Load + 1 Load/Store(1Store*)
Integer PRF
+ 1 Store
• Improved memory latency through

LD/ST

DATA
AGU

AGU

AGU
MUL

MUL
ALU

ALU
ALU

ALU
direct path from memory controller

DIV
BR

LD

ST

ST
• 1MB(512KB*) private L2 cache per core
• 3MB(4MB@4cores*) shared L3 cache BDTLB DTLB/TAG

Queue
ST
• 48(32*)-entry DTLB, 512-entry BDTLB, L2 UTLB
64KB D-
4K entry L2 UTLB Cache
Table Walk Prefetch

*M3 specification

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 6 of 26
Big Custom CPU (3)
• Floating-point Execution Pipes Floating-point Scheduler

• Three 128-bit floating-point pipes


Floating-point PRF
• 24 single-precision OPs per cycle
FMAC FMAC FMAC
• 2 cycle FADD, 3 cycle FMUL, 4 cycle
FMAC latency FADD FADD FADD
FCVT FCVT
• Two 128-bit width Dot Product (Int8)
units FDIV/SQRT FDIV/SQRT
FST FST
NCRYPT NCRYPT
NALU NALU NALU
NSHUF NSHUF
NSHIFT NSHIFT NSHIFT
NMUL NMUL

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 7 of 26
Big Custom CPU
• Samsung 4th generation Custom CPU
• Improved memory subsystem performance significantly

Geekbench v4 Tests Relative Architectural Performance (over M3)


2.10 2.19
Relative Performance (score/GHz)

1.90

1.70
The higher the better
1.50

1.30
M3
1.10
M4
0.90
Cortex-A76
0.70

0.50

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 8 of 26
Single-Thread Performance
• Desktop-class single-thread performance
• Average 23% single-thread performance uplift from 3rd generation

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 9 of 26
Tri-cluster management (1)
• Allows seamless performance transition by the middle CPU
• A single heavy task can be selectively assigned to middle or big CPU.

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 10 of 26
Tri-cluster management (2)
• In most of user scenarios, the main workload can be covered by the middle
CPU instead of big CPU to reduce the absolute power consumption.
• CPU total power comparison (big/Little @10nm vs. Tri-cluster @7nm)

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 11 of 26
Tri-cluster management (3)
• Allows various options on workload scheduling
• A single heavy task running on a little CPU can be selectively migrated
to middle or big CPU. (depending on demand performance)

Task
(Running on little core) ? Demand Perf #3
Power
Utilization Demand Perf #1 Demand Perf #2

Max capacity Min power @


(big) Big

Max capacity
(middle)

Max capacity
Option Option
(little) Running
#1 #2 Min power @
MED Performance
Little Middle
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 12 of 26
Tri-cluster management (4)
• Enhanced work load scheduling method based on ISA (Instruction Set
Architecture) where only 32bit/64bit energy efficiency is considered.
• The energy efficiency of each cluster is different for ISA mode, so
workloads should be scheduled based on a different energy model.
• The energy model is newly designed considering 32-bit/64-bit energy
efficiency.
• The CPU power is improved by over 30% in specific scenarios such as
32-bit games.

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 13 of 26
Tri-cluster management (5)
• The measurement results for ISA based scheduling method
• CPU total power comparison in Lineage2 game
• Big CPU’s tasks moved to Mid CPU, so total power decreased.

Game power
(Normal scheduling vs ISA Aware scheduling)

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 14 of 26
Sparsity-Aware Neural Processing Unit
• 1024 MACs always consume the incoming data every cycle.
• Data staging units dispatch corresponding input feature-maps and skip 0-weights
• Activation Function Units perform ReLU-family activation.
• HW automatic clock gating (HWACG) is applied on module levels.
BUS
512-KB Scratchpad 512-KB Scratchpad

HWACG HWACG HWACG


HWACG HWACG HWACG

Data Data Data Data


Staging Unit Staging Unit Staging Unit Staging Unit

Dual
Dual
DualMAC
MACArray
Array Dual
Dual MAC
MACArray
Array
DualMAC
MACArray
Array Dual
DualMAC
MACArray
Array

Data Returning
Activation UnitUnits
Function Data Function
Activation ReturningUnits
Unit

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 15 of 26
Skipping Convolution: Moving OFM
• Input feature-maps are buffered for marching with nonzero weight positions
Weight Kernel
Non-zero Weights 1
1 3 5 6 7 3 5
Input Feature Map 6 7

Output Feature Map

1 Cycle #1 3 Cycle #2 5 Cycle #3 6 Cycle #4 7 Cycle #5


© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 16 of 26
Performance and Energy Efficiency
Normalized FPS Running Inception V3
150

100

%
50

0
5% 80 %
Weight Pruning Rate

Normalized Energy Efficiency Running Inception V3


250
200
150
%

100
50
0
5% 80 %
Weight Pruning Rate
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference of 26 17
HWACG (HW Automatic Clock Gating)
• Reduce clock tree power from PLL to IPs in idle state
• Q-channel interface Protocol between IP and HWACG Controller
• Hierarchical architecture composed of parents and children
• If all the IPs attached to the Clock Gating cell are idle, the clock is gated.

Clock gating Clock gating


when IP0 is idle.
PLL side when all IPs (IP0,IP1) are idle. IP side
Q-channel
Gating
PLL 0 CLK CLK CLK & Q-Ch I/F
Divider CLK Gating Divider IP I/F
IP0
PLL 1 MUX Gating Q-channel
CLK
& Q-Ch I/F
Divider IP I/F
Clock gating
when all IP (IP2) is
IP1
idle. Q-channel
Gating
CLK CLK & Q-Ch I/F
IP I/F
Gating Divider IP2
Clock Path
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 18 of 26
HWACG
• HWACG is different from S/W-directed clock gating.
• Ideally, clock tree responds to IPs clock usage and H/W responds to it.
• We can gate the clock for CMU and bus components.
• S/W-assisted operation, where part or all of the gating behavior is
directed by software, was also adopted.
-D:HWACG disable
• The measured power gain is as follows -E:HWACG enable

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 19 of 26
HWACG – EWS(Early Wake-up System)
• The early wake-up system reduces the cumulative latency.
• Latency issue can appear especially when multi-layered bus uses
different PLL sources
• A clock request from a latency critical IP is delivered to multiple target
domains to wakeup multiple IPs instead of sequential wakeup process.
BLK_#1
EWG
EARLY_WAKEUP__MAST_#
BLK_CMU Master_#1
CMU

BLK_#2
EWG
EWR Master_#2
CMU
ACTIVE__CMU_#

CMU BLK_#n
EWG
Master_#n
-EWR : Early Wakeup Router CMU
-EWG : Early Wakeup Generator
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 20 of 26
Voltage Droop Mitigation (1)
• Voltage droop mitigation solution
• Droop detector (DD) monitors voltage droop in target domain.
• (1) When voltage droop happens under threshold values,
• (2) Droop-detected flag is asserted
• (3) Then, CMU divides the clock to IP by half to reduce load current.

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 21 of 26
Voltage Droop Mitigation (2)
• Voltage droop mitigation solution
• One sensor in GPU
• Calibration done for each DVFS level.
• Vmin gain by 12.5mV.

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 22 of 26
Voltage Droop Detector
• Ring oscillator-type droop detector
• Measures voltage levels through change of RO’s speed
• It counts RO clocks within a programmable time window.
• When the counter value is smaller than programmable threshold values,
droop-detected flag is asserted

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 23 of 26
A key technology feature of 7nm
1
Fin FET Logic
7nm 8nm
Technology

Iddq (Normalized)
0.1
Fin SAQP SADP
Key Module +7%
eSiGe 6th Gen. 5th Gen
Technology 0.01

eSD 6th Gen. 5th Gen


Normalized 0.001
Device RO Perf. 1.07 1.00 0.4 0.6 0.8 1
(AC) Freq (Normalized)

• AC performance was enhanced by fin-pitch scaling. (Ceff gain at the


standard-cell level)

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 24 of 26
A key technology feature of 7nm
• Self-Aligned Quadruple Patterning (SAQP) process was introduced to
scale fin pitch below 42nm (more than 15% scale-down).

• S/D epitaxy optimization, heat optimization were performed to


compensate the DC performance degradation by fin pitch scaling.

Self-Aligned Double Patterning Self-Aligned Quadruple Patterning


Litho

Spacer1
Litho

Spacer1 Spacer2

Fin Fin
© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 25 of 26
Conclusion
• The five low power architectures contributed to reduce the power
consumption of the 7nm SOC chip.
• Tri-cluster CPUs
• Sparsity NPU
• HWACG
• Droop detector
• 7nm process
• They are also extensively applied to the following projects to enhance low
power competitiveness

© 2020 IEEE 2.4: A 7nm High-Performance and Energy-Efficient Mobile Application Processor with Tri-Cluster CPUs and a Sparsity-Aware NPU
International Solid-State Circuits Conference 26 of 26
A 7nm FinFET 2.5GHz/2.0GHz Dual-
Gear Octa-Core CPU Subsystem with
Power/Performance Enhancements for a
Fully Integrated 5G Smartphone SoC.

Hugh Mair, Ericbill Wang, Ashish Nayak, Rolf Lagerquist, Loda Chou,
Gordon Gammie, Hsinchen Chen, Lee-Kee Yong, Manzur Rahman,
Jenny Wiedemeier, Ramu Madhavaram, Alex Chiou, Blundt Li, Vincent
Lin, Rory Huang, Michael Yang, Achuta Thippana, Osric Su, SA Huang

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 1 of 29
Outline
• SoC Overview
• Wireless connectivity
• Graphics/Media/AI
• Dual-gear CPU Cluster
• Cortex-A77 CPU
• Droop-Response Clock Control (DRCC)
• Silicon Results
• Frequency-Locked-Loop Clocking
• Silicon Results
• Hierarchical Test/Debug Interface
• Summary
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 2 of 29
SoC Overview
• Dimensity 1000 is a fully integrated smartphone SoC
supporting 5G cellular, advanced Wi-Fi 6, high performance
compute, multimedia, and AI capabilities
• Monolithic 7nm CMOS
– 10LM metal stack
• CPU Complex
– Octa-core w/ heterogeneous multi-processing
– 9.4mm2 on silicon
– Clock speeds up to 2.6GHz for current volume production

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 3 of 29
Wireless Connectivity
• 5G Cellular Modem:
– SA & NSA modes
• SA Opt.2, NSA Opt.3 / 3a / 3x
– 4.7Gbps down, 2.5Gbps up
– Full backwards compatibility

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 4 of 29
Wireless Connectivity
• 5G Cellular Modem:
– SA & NSA modes
• SA Opt.2, NSA Opt.3 / 3a / 3x
– 4.7Gbps down, 2.5Gbps up
– Full backwards compatibility
• Non-Cellular connectivity:
– Wi-Fi 6 (802.11a/b/g/n/ac/ax)
• 2T2R antenna
– Bluetooth 5.1+, GPS, FM
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 5 of 29
Graphics/Media/AI
• ARM-Mali G77 MC9 3D
• Full-HD display at 120Hz
• 80M pixel imagers
– 32+16M pixel dual camera
• HEVC & AV1 support with
encode/decode at 4K 60FPS
– Multi-expose HDR video
• APU3.0 Hexa-core AI
– 2xBig+3xSmall+1xTiny
– 4.5TOPS
Paper 7.1
© 2020 IEEE 2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 6 of 29
CPU Complex
• Heterogenous big.LITTLE CPU
– 4x High-Performance (HP / Big)
• Cortex-A77 up to 2.6GHz
• Cache sizes: 64kB L1 Data,
64kB L1 Inst., 256kB L2
– 4x High-Efficiency (HE / Little)
• Cortex-A55 up to 2.0GHz
• Cache sizes: 32kB L1 Data,
32kB L1 Inst., 128kB L2
– 2MB L3 CPU cache

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 7 of 29
Outline
• SoC Overview
• Wireless connectivity
• Graphics/Media/AI
• Dual-gear CPU Cluster
• Cortex-A77 CPU
• Droop-Response Clock Control (DRCC).
• Silicon Results
• Frequency-Locked-Loop Clocking.
• Silicon Results
• Hierarchical Test/Debug Interface
• Summary
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 8 of 29
Cortex-A77 µArchitecture Additions/Improvements
• Additions: Macro-Op cache, 2nd Branch, 4th ALU
• Improvements: Brand Pred., OoO window, Dispatch BW, Prefetch

Improved: New:
+25% OoO window Add 2nd branch

Improved: New:
2x Branch Pred. BW Add 4th ALU

Improved:
Next-gen Prefetch
New: Improved:
Macro-Op Cache +50% dispatch BW

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 9 of 29
Cortex-A77: Benchmark uplift from Cortex-A76
Performance improvements across a range of workloads
(IPC uplift, ISO-process & frequency)

20% Performance Improvement over Cortex-A76


2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 10 of 29
Dual-Gear CPU Cluster

Higher Performance
@ Lower Power

• Compounded IPC gains on HP • Prioritize maximum voltage/


core widens gap to HE core frequency range of HP core
© 2020 IEEE 2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 11 of 29
Outline
• SoC Overview
• Wireless connectivity
• Graphics/Media/AI
• Dual-gear CPU Cluster
• Cortex-A77 CPU
• Droop-Response Clock Control (DRCC)
• Silicon Results
• Frequency-Locked-Loop Clocking
• Silicon Results
• Hierarchical Test/Debug Interface
• Summary
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 12 of 29
Droop Responsive Clock Control (DRCC)
• DI/DT stress on PDN continually increasing challenge
(a) ~flat power budget, increased current rom lower-V. (b) extreme clock-gating
• Mitigate di/dt droops: BANDGAP 2.5GHz CLOCK
1.8V SUPPLY

1. From package network inductance (-L di/dt)


VOLTAGE STATE
VOLTAGE
TRIM
MONITOR
CODE
ARRAY

2. DC-DC converter BW / DC losses


VMON0 POWER
SWITCH
VMON1
REFERENCE ARRAY
VOLTAGE VMON2
GENERATOR 12
VMON3

• Prior approach uses charge injection


ACTIVATION
VMON4
CODE

RESPONSE TIME < 1nS

• Current work adopts clock-gating vs. charge-injection DIE_SENSE_VDD

DIE_SENSE_VSS
CPU
CLUSTER
EXTERNAL
SUPPLY

– 50x / 98% area reduction


Prior Work [3]
• Concurrent operation of dual detectors:
1. RC-filtered DVFS’ed logic supply [-L di/dt]
2. Fixed DC reference [SW-controlled]
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 13 of 29
Droop Responsive Clock Control
-L di/dt: DAC + LPF generate comparator reference at 75%~100% of
DVFS-adaptive DVDD supply
Non-filtered (high bandwidth) supply compared against reference, half-
speed clock engaged when VMIN violated.

SW-ctrl: Comparator reference = 33% ~ 100% of VREF (1.2V)


DVDD / VREF DVDD CLK_IN

Clock
DAC_OUT =
+ Gate CLK_OUT

100% DVDD ~ 75% DVDD +


FSM
VREF
CODE[5:0] 6-Bit DAC LPF - SAMPLER_OUT

FILT[2:0] LPFOUT

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 14 of 29
DRCC Silicon Results (SW-ctrl loop)
• CPU DI/DT exceeds DC-DC converter bandwidth
Voltage
w/ DRCC Off

Voltage ✓
w/ DRCC On

~50mV Current ✓
w/ DRCC On

Current
w/ DRCC Off
Approximate CPU
load current

Voltage/current measurements at PCB; 1µs/div (10µs window)


2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 15 of 29
Outline
• SoC Overview
• Wireless connectivity
• Graphics/Media/AI
• Dual-gear CPU Cluster
• Cortex-A77 CPU
• Droop-Response Clock Control (DRCC).
• Silicon Results
• Frequency-Locked-Loop Clocking
• Silicon Results
• Hierarchical Test/Debug Interface
• Summary
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 16 of 29
FLL Clocking -- Principle of Operation

Traditional Clocking This Work Future?

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 17 of 29
Frequency-Locked Loop [FLL] Clocking
• Utilize CPU-internal osc. for main clock
• Allow osc. to vary with power supply
– Digital control loop to track DVFS
changes with controlled
bandwidth (backwards compatible)
– Eliminate clock distortion voltage
translations & physical hierarchies

Local oscillator w/ supply correlation out-performs


[High-quality] remote & un-correlated PLL clock
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 18 of 29
Ring Oscillator Topology
• Fine Control: ~1ps/bit (2ps per.)
– 45 thermometer coded bits
– Asynchronous to oscillator
• Multiple bit transition allowed
• Coarse: ~20ps/bit (40ps per.)
– 40 thermometer coded bits
– Synchronized to oscillator
• Avoid creating glitches in ring oscillator
• Single bit increment/decrement
• Each coarse code is ~50% fine code range
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 19 of 29
Ring Oscillator Silicon Measurements
Oscillator Period vs. Fine & Coarse Codes
• Fine step size = 2.2ps Coarse
– Vs. target of 2ps Code

Clock Period [ps]


• Coarse step size = 40ps
– Vs. target of 40ps

• Coarse Step = 40% of


Fine Range
– Vs. target of 50%
Fine Code

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 20 of 29
FLL Block Diagram
• Two oscillators (Ping & Pong) double frequency range
– Second oscillator includes ÷2
• PI Loop
– Inputs: Phase & freq. errors
– Output: Fine control
• Coarse control by FSM
– Monitor fine control
– Apply fine-code delta on
coarse-code change

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 21 of 29
FLL Oscillator CPU V/F Tracking [Si. Measured]
• Analyze small-signal correlation
• Testing Procedure:

Frequency [MHz]
1. Determine VMIN at given
reference frequency points, FTGT
◯ is VMIN per Frequency

2. Record oscillator frequency FREF


3. Plot (FOSC/FREF * FTGT) at
VMIN +/-50mV
▽ : -50mV, : +50mV VDD [a.u.]

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 22 of 29
FLL Oscillator CPU V/F Tracking [Si. Measured]
• Analyze small-signal correlation
• Testing Procedure:

Frequency [MHz]
1. Determine VMIN at given
reference frequency points, FTGT
◯ is VMIN per Frequency

2. Record oscillator frequency FREF


3. Plot (FOSC/FREF * FTGT) at
VMIN +/-50mV
▽ : -50mV, : +50mV VDD [a.u.]

Oscillator V/F curve tracks overall CPU V/F curve


2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 23 of 29
FLL Clocking VMIN Improvement [Si. Measured]
Delta VMIN vs. Frequency
0

Delta Voltage [mV]


~35mV
-20

-40

-60
~35mV
-80

-100
2.3 2.4 2.5
Frequency [GHz]

~35mV VMIN Improvement, >10% Power Reduction


2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 24 of 29
Outline
• SoC Overview
• Wireless connectivity
• Graphics/Media/AI
• Dual-gear CPU Cluster
• Cortex-A77 CPU
• Droop-Response Clock Control (DRCC)
• Silicon Results
• Frequency-Locked-Loop Clocking
• Silicon Results
• Hierarchical Test/Debug Interface
• Summary
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 25 of 29
Hierarchical JTAG -- Motivation
• JTAG (IEEE 1149.1) compact & convenient interface for cmd/ctrl
– Two phase: 1.[wr] Instruction Register (IR), 2. [rd/wr] Data Register (DR)

• JTAG Challenges/Limitations:
– Models a serial connection through all devices / IP blocks
– Large number of embedded IP (#clusters * #cpus * #ip/cpu)
– Power gating blocks chain segments

• Existing alternative: IJTAG (IEEE 1687) creates hierarchy but cannot


mix 1149/1687 in same hierarchy, requires additional phases
• Our approach maintains JTAG compatibly throughout the hierarchy; all
IP, including “GWTAP”, are 1149.1 compatible
2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with
© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 26 of 29
Hierarchical JTAG: GWTAP
• Gateway TAP (GWTAP) creates a selectable 1-to-4 (and bypass)
– 11-bit IR instructs which sub-chain to access and sub-chain IR-length
– Sub-chain transitioned from IR to DR after #(IR-Length) clocks
– TAPS in upper levels are signaled to bypass

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 27 of 29
Hierarchical Command Roll-Up Example
• From lowest level, moving up the hierarchy:
– IR+DR embedded into upper level DR; add DR for upper TAP bypass
– Upper level IR: GWTAP instruction + other TAPs to bypass

Example GWTAP Topology Interpreted as IR Interpreted as DR

TAP GWTAP TAP 1st Level IR IR IR IR DR DR

TAP GWTAP TAP 2nd Level IDLE IR IR IR DR DR

TAP GWTAP TAP 3rd Level IDLE IR IR DR DR

TAP 4th Level IDLE IR DR

JTAG TDI Bitstream as a function of time

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 28 of 29
Summary
• A production monolithic 5G smartphone SoC integrating a
latest generation Cortex-A77 CPU is introduced
• A continued focus on CPU power efficiency and innovation in
the area of power supply and clocking is presented along with
silicon results
• Key circuit elements shown:
– Droop Responsive Clock Control (DRCC)
– CPU-localized Frequency-Locked-Loop (FLL) clocking
– Novel hierarchical extension to traditional JTAG

2.5: A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem with


© 2020 IEEE
International Solid-State Circuits Conference Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC. 29 of 29
A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore
SoC Platform for Automotive and Embedded Applications
with Integrated Safety MCU, 512b Vector VLIW DSP,
Embedded Vision and Imaging Acceleration

Rama Venkatasubramanian1, Don Steiss1, Greg Shurtz2, Tim Anderson1, Kai Chirca1,
Raghavendra Santhanagopal1, Niraj Nandan1, Anish Reghunath1, Hetul Sanghvi1, Daniel Wu1,
Abhijeet Chachad1, Brian Karguth1, Denis Beaudoin1, Charles Fuoco1, Lewis Nardini1, Chunhua
Hu1, Sam Visalli1, Amrit Mundra1, Devanathan Varadarajan1, Frank Cano2, Shane Stelmach1,
Mihir Mody3, Arthur Redfern1, Haydar Bilhan1, Maher Sarraj1, Ali Siddiki1, Anthony Lell1, Eldad
Falik1, Anthony Hill1, Abhinay Armstrong1, Todd Beck1, Vijay Kanumuri1, Steve Mullinnix1,
Darnell Moore1, Jason Jones2, Manoj Koul1, Sanjive Agarwala1
1Texas Instruments, Dallas, TX
2Texas Instruments, Houston, TX
3Texas Instruments, Bangalore, India

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 1 of 40
Outline
 Automotive Processor background
 Jacinto™ 7 SoC Platform Architecture
 C71x Digital Signal Processor (DSP)
 Embedded vision and Imaging accelerators
 VPAC and DMPAC
 First SoC of the platform
 Device details/Die micrograph
 Automotive SoC Development
 Innovative quality and reliability methodologies
 Conclusion

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 2 of 40
Automotive Processor - Applications

Advanced driver Body Infotainment


assistance systems electronics & cluster
(ADAS)
• Camera/Radar/Lidar based • Gateway • Infotainment
front, rear, surround view • Vehicle compute • Instrument cluster
and night vision systems • Telematics features
• Automatic emergency
braking, Adaptive cruise
control, Automated parking

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 3 of 40
Automotive Processor - Applications

Advanced driver Body Infotainment


assistance systems electronics & cluster
(ADAS)
• Camera/Radar/Lidar based • Gateway • Infotainment Scalability:
front, rear, surround view • Instrument cluster
and night vision systems
• Vehicle compute Entry to
• Telematics features premium
• Automatic emergency
braking, Adaptive cruise vehicles
control, Automated parking

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 4 of 40
Motivation for SoC Platform Architecture

 Platform Reuse  Higher Performance Needs


 Maximize design investments  Scalable real-time processing solution
 Scalable hardware and software  Analytics
solutions  Communication

 Advanced Integration  Efficient Data Processing


 Enhanced functional safety  Scalable interconnect
 Security  SoC infrastructure
 Low power

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 5 of 40
SoC Platform Architecture (1/2)

 Evolution of Texas Instruments


OMAP and Keystone II Platforms

 Developed with Functional safety


and Automotive quality as primary
design objectives

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 6 of 40
SoC Platform Architecture (2/2)
 Multiple isolated domains:
 Wakeup domain
 Microcontroller (MCU) domain
 Main domain

 Integrated Safety MCU

 Overall system cost reduction


 Minimizing external components
through on-die micro-architecture
solutions

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 7 of 40
Automotive Functional Safety - Overview
 Governed by ISO-26262 standard
 Four ASILs ― A, B, C, and D ASIL-D Most
stringent
 ASIL-D ASIL-C
 Most stringent functional safety standard
ASIL-B Slightly less
 Ex: Power steering (unwanted acceleration)
stringent
 Ex: Engine braking (unintended braking) ASIL-A Least
 ASIL-B stringent
 Also critical, slightly less stringent
 Ex: Embedded vision ADAS (Incorrect sensor feedback)
 Ex: Instrument cluster (Loss of critical data)

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 8 of 40
Wakeup Domain
 ASIL-D
 Isolated domain

 Manages security and low power modes


 Boot management
 Cryptographic acceleration
 Trusted execution environment
 Secure storage
 On-the-fly encryption

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 9 of 40
MCU Domain
 ASIL-D
 Isolated chip-within-a-chip

 General-purpose MCU
 Communication peripherals for safety-critical
communication

 Dedicated supply, clock, and reset

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 10 of 40
Safety and Isolation Features
 MCU domain monitors Main domain faults and
takes appropriate action

 Domain Isolation
 Reset, Clock and Bus isolation
 Logic/IO voltage isolation with power monitoring
 Dedicated Voltage, thermal, clock rate sensors
per domain

 Highly reliable interconnect


 SECDED/Parity to support ASIL-D

 Control and data communication


 Internal SPI to SPI
 Full bus with isolation gasket and timeouts

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 11 of 40
Main Domain – Compute Cluster
 64b Heterogeneous Multicore Architecture
 Arm® Cortex-A and C71x DSP
 Coherent memory
 Multicore shared-memory system (MSMC)
 L1/L2/L3 caches
 Shared on-chip SRAM with ECC
 Virtualization
 C66x DSP
 Optimized for Audio applications
 Enables supplemental analytics
 Backwards compatibility
 High-performance GPU
 External memory interface (LPDDR4-4266)

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 12 of 40
Main Domain – Accelerator Cluster
 Vision Pre-processing Accelerator (VPAC)
 Depth and Motion Perception Accelerator
(DMPAC)
 Video acceleration (H.264/H.265)
 Image capture subsystem (MIPI CSI-2 RX/TX)
 Display Subsystem
 Interfaces for different display panel types
 eDP, MIPI DSI, MIPI DPI
 Security acceleration
 PKA, AES, SHA, RNG, DES/3DES

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 13 of 40
C71x DSP
 DSP CPU:
 64b addressing
 Optimized for General Purpose DSP and Embedded Vision
 16-Issue VLIW with flexible pipeline protection
 64b scalar registers and execution units (int*, float, double)
 512b vector registers and execution units (int*, float, double)
 512b x 512b matrix registers and matrix unit (int*)
 4240 integer MAC/Cycle

 Memory Subsystem:
 Load/Store instructions up to 512b wide
 Streaming accesses with programmed address sequencers

 Integrated L1 program, L1 data and unified L2 caches

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 14 of 40
C71x DSP Vs C66x DSP

C66x [1] C71x


64-bit addressing No Yes
Vector size 128b 512b
Load/Store bits/cycle 128 1600
8-bit Fixed point MAC/cycle 32 4240
32-bit Floating point OPS/cycle 16 88

[1] R. Damodaran et al., “A 1.25 GHz 0.8W C66x DSP core in 40nm CMOS,” IEEE Conf. VLSI Design, pp. 286-291, 2012.

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 15 of 40
C71x DSP
Benchmark Performance Scaling
Benchmark C71x/C66x [1] C71x/EVE [2]
FFT 1024pt, 32b complex 5.6 x 2.4 x General Purpose DSP
Image Gradient 11.0 x 2.7 x
Image Transform 11.0 x 4.8 x
Filtering 18.1 x 7.1 x
Computer Vision
Feature Extraction 18.8 x 4.7 x
Optical Flow 3.0 x 3.0 x
Morphology 25.7 x 15.3 x
MobileNet v1 516.0 x 115.0 x
Machine Learning
MobileNet v2 210.0 x 62.7 x

[1] R. Damodaran et al., “A 1.25 GHz 0.8W C66x DSP core in 40nm CMOS,” IEEE Conf. VLSI Design, pp. 286-291, 2012.
[2] Mandal, Dipan Kumar et al. “An Embedded Vision Engine (EVE) for Automotive Vision Processing,” IEEE International Symposium on
Circuits and Systems (ISCAS), pp. 49-52, 2014.

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 16 of 40
Vision Pre-processing Accelerators (VPAC) (1/4)
Image Capture
 VPAC consists of:
CSI2-RX CSI2-RX
 Image Signal Processing Engine (ISP)
VPAC
 Multiple hardware accelerators:
 Remap Engine Image Signal Processing (ISP)
 Noise Filter
 Scalar Engine Remap Engine
SW
 SW and HW flexibility controlled
Noise Filter
Flexible
 Control and Sequencing
Acceleration
Scalar Engine

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 17 of 40
Vision Pre-processing Accelerators (VPAC) (2/4) – Image Pipe
Image Capture
 Image Capture CSI2-RX CSI2-RX
 2x4L MIPI compliant CSI2 RX interface

 ISP : Image Signal Processing ISP


 Human and machine vision Wide Lens
Tone
 >8x2MP@30FPS camera support Dynamic Shading
Mapping
 Flexible RAW sensor support Range Correction
 Defective Pixel Correction (On-the-fly)
 Lens Shading Correction (LSC) Defect Pixel Noise
CFA
 Noise Suppression Filter Correction Filter
 Flexible color processing
 YUV 420/422, 8b/12b, RGB, Dual output,
Custom support Color Edge
Statistics
Processing Enhancer

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 18 of 40
Vision Processing Accelerators (VPAC) (3/4) - Vision Primitives
 Remap Engine
 Lens distortion correction (+Fish Eye Lens)
 Rectification

 Noise filter
 Edge preserving
 Enhances analytics quality

 Scalar Engine
 Multiple scales for pyramid generation for
various vision algorithms
 Region of Interest (ROI) scaling support

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 19 of 40
Vision Pre-processing Accelerators (VPAC) (4/4)
Image Capture
CSI2-RX CSI2-RX
 Hardware accelerators optimized for :
 Low latency Image/Vision pipe
 Reducing external memory bandwidth
VPAC
Image Signal Processing (ISP)
 SW and HW flexibility
 Sequencing of Algorithms Remap Engine
 Standalone / connected pipe SW
controlled
Noise Filter
Flexible
Acceleration
Scalar Engine

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 20 of 40
Depth and Motion Perception Accelerator (DMPAC) (1/5)
Stereo Disparity Overview

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 21 of 40
Depth and Motion Perception Accelerator (DMPAC) (2/5)
Stereo Disparity Engine
Left Camera Image

Disparity Map

Stereo Disparity
Engine
Right Camera Image Hardware
Accelerator

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 22 of 40
Depth and Motion Perception Accelerator (DMPAC) (3/5)
Optical Flow Overview

Previous Frame Current Frame


F(t-1) F(t)

Flow Vector

 Optical Flow estimates 2D motion


vector field between two images

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 23 of 40
Depth and Motion Perception Accelerator (DMPAC) (4/5)
Dense Optical Flow Engine
Applications :

Dense Optical Flow Object Tracking


Dense Optical flow
Hardware
Accelerator

Input Video Frame


Flow to color mapping Structure From Motion (3D)

Confidence Score
for each flow vector output

Moving Object Segmentation

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 24 of 40
Depth and Motion Perception Accelerator (DMPAC) (5/5)

 Stereo Disparity Engine HWA DMPAC


 Custom Semi Global Matching (SGM) algorithm
Depth:
 Matches SGM in robustness performance Stereo Disparity Engine
 >50x Memory bandwidth reduction Vs SGM Hardware Accelerator

 Dense Optical Flow HWA Motion Perception:


Dense Optical Flow
 Up to 2MPix / 60 fps Hardware Accelerator
 Machine Learning based confidence score
generated for each flow vector output
 Fractional Pixel Precision : 1/16th of pixel-motion differentiable
 Large Motion search range : ±191 H and ±63 V

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 25 of 40
Die Micrograph and Device details
 First SoC of the Platform Process TSMC 16FF FINFET
Core : 0.72V-0.990V (AVS)
Voltage
SRAM : 0.85V
Temperature -40C-150C
3.5B+ Transistors
Specifics
~200Mb SRAM
2GHz 2xSuper-scalar CPU
1GHz 3xDual Safety MCU
Performance
4266 LPDDR4
16G SERDES
Power 2W-10W (use case dependent)
1GHz 64bVLIW/512bSIMD DSP,
Key Accelerators
Machine Learning, Computer vision
Number of PLL’s 25+
Power Networks 74
Features ASIL-D Capable, Low DPPM Design

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 26 of 40
Automotive SoC – Quality and Reliability
 Automotive SoCs require:
 Stringent attention to DPPM (defective parts per million)
 Very low FIT for functional safety and intrinsic reliability

 Large embedded memories


 1 DPPM at SoC level requires <1 DPPB at memory component level
 Ex: 200Mb with no assumptions on test/screening needs
 7.7σ closure for bitcells
 7σ closure for wordline drivers
 6.6σ closure for sense amps

 Drives need for new techniques for design and test

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 27 of 40
Automotive SoC – Memory Test

 Test with time-zero positive margin needed


 Enhanced SRAM screening methodology
 Wordline and bitline voltage control combined with traditional
techniques
 Improved defect injection techniques

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 28 of 40
Test Goal: Test mimics functional mode (1/3)
 Case-study: Robust memory interface test
 Memory BIST is targeted to screen memory-internal defects
 Functional vs. Memory BIST: Start/end points are different
BIST
flop
functional D latch
Q
TD Q_mem
D_mem
BIST
functional

flop
Memory
Array
functional
A
A_mem
latch LEGEND
BIST TA
Functional / mission mode
BIST test mode
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 29 of 40
Test Goal: Test mimics functional mode (2/3)
 Case-study: Robust memory interface test
 ATPG is targeted to screen memory-interface logic defects
 Functional vs. ATPG: Memory I/O timings are different due to scan collar
 Subtle defects in memory-interface may be missed by BIST and ATPG
BIST
flop
functional D latch
Q
Q_mem
D_mem
BIST TD
functional

flop
Memory
Array
functional
A
A_mem
latch LEGEND
BIST TA
Functional / mission mode
ATPG test mode

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 30 of 40
Test Goal: Test mimics functional mode (3/3)
 Case-study: Robust memory interface test
 Match functional mode memory I/O timing with test mode
 Improved memory IP architecture to match functional and test timing
 Improved DFT to test “true” functional memory-interface path
BIST
flop
functional D latch
Q
Q_mem
D_mem
BIST TD
functional

flop
Memory
Array
functional
A
A_mem
latch
LEGEND
BIST TA Functional / mission mode
BIST test mode
ATPG test mode
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 31 of 40
EM FIT Calculation for Automotive SoCs (1/2)
Signoff with Violations at Technology Reliability Limit
Cumulative FIT vs. FIT by Bin
Probability of occurrence

May be acceptable/waiveable
in consumer grade SoC

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
Component FIT Per BIN Cumulative FIT FIT Per Bin System FIT Spec

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 32 of 40
EM FIT Calculation for Automotive SoCs (2/3)
Signoff at Technology Reliability Limit
Cumulative FIT vs. FIT by Bin
Probability of occurrence

Fixed for Technology Reliability Limit


Still may violate System FIT spec
Fix any
violations

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
Component FIT Per BIN Cumulative FIT FIT Per Bin System FIT Spec

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 33 of 40
EM FIT Calculation for Automotive SoCs (2/2)
Signoff with Margined Technology Reliability Limit
Cumulative FIT vs. FIT by Bin
Probability of occurrence

Additional
Margins
Fix all
violations

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
Cumulative FIT FIT Per Bin System FIT Spec
Component FIT Per BIN

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 34 of 40
EM FIT calculation in an eco-system class
Automotive SoC (1/2)
Foundry EDA Foundry

Flow Spec Reliability Flows Rules, Models

SoC Designer IP Eco-system


Execute Flow Execute Flow

SoC Reliability
Estimated FIT Rate

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 35 of 40
EM FIT calculation in an eco-system class
Automotive SoC (1/2)
Foundry EDA Foundry

Flow Spec Reliability Flows Rules, Models

SoC Designer IP Eco-system


Execute Flow Execute Flow

SoC Reliability
Estimated FIT Rate

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 36 of 40
System Validation – ADAS (1/2)
Forward Camera Analytics based on Deep Learning
Semantic Segmentation Object Detection Parking Spot Detection

 DeepLabV3Lite architecture  Single shot detection (SSD) with MobileNetV1


(MobileNetV2 encoder)  47 Convolution Layers, 6 regression heads
 63 Convolution Layers Application GMACS Time
 5 classes (pedestrian, sky, per Frame per frame (ms)
vehicles, roads, background) Semantic Segmentation 3.68 6.20
Parking Spot + Vehicle Detection 3.65 4.49
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 37 of 40
System Validation – ADAS (2/2)
8MP Front Camera Perception and Localization

 Multi-class object detection using C71x DSP and VPAC/DMPAC


 Fusion with IMU and GPS for Localization

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 38 of 40
System Demo – 3D Surround View + Auto Valet Parking
Real-time 360 Degree Surround View Analytics for Auto Valet Parking
C7x DSP
CSI C66x DSP VPAC SD Card
VPAC (Post-process) Semantic
FISHEYE GPU 3x 1280x720
Segmentation
CAMERA VPAC 3x 768x384 @ 15 FPS
I2C (Mosaic)
Multi-object @15 FPS YUV420
Detection
4x 1920x1080 C66x DSP
Display Subsystem
Parking Spot
@ 30 FPS (Pre-process)
Bayer Detection

3D Car-model with 3 additional camera inputs


surround view 3 different ML networks
rendered Automatic valet parking application

Industry Showcase: EE3 (Today at 8pm)


Texas Instruments - Camera Based Perception and 3D Surround View for Autonomous Valet Parking on a
16nm Automotive SoC
© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 39 of 40
Conclusion

 Jacinto™ 7 SoC Platform Architecture with Integrated MCU


 C71x DSP
 Embedded vision and Imaging hardware accelerators
 VPAC and DMPAC
 Innovative quality and reliability methodologies
 Enabling Automotive in 16nm ecosystem
 First SoC of the platform
 First pass silicon fully functional and meeting design goals

© 2020 IEEE 2.6: A 16nm 3.5B+ Transistor >14TOPS 2-to-10W Multicore SoC Platform for Automotive and Embedded Applications with Integrated Safety MCU,
International Solid-State Circuits Conference 512b Vector VLIW DSP, Embedded Vision and Imaging Acceleration 40 of 40
IBM z15TM: A 12-Core 5.2GHz Microprocessor

Christopher Berry1, Brian Bell2, Adam Jatkowski1, Jesse Surprise1, John Isakson3, Ofer Geva1, Brian
Deskin4, Mark Cichanowski3, Dina Hamid1, Chris Cavitt1, Gregory Fredeman1, Anthony Saporito1,
Ashutosh Mishra5, Alper Buyuktosunoglu6, Tobias Webel7, Preetham Lobo5, Pradeep Parashurama5,
Ramon Bertran6, Dureseti Chidambarrao8, David Wolpert1, Brandon Bruen1

IBM Systems
1 - Poughkeepsie, NY
2 - Rochester, NY
3 - Austin, TX
4 - Endicott, NY
5 - Bangalore, India
6 - Yorktown Heights, NY
7 - Boeblingen, Germany
8 - Hopewell Junction, NY

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 1 of 39
Outline
• Introduction
• Technology and System Structure
• System Control (SC) Chip
• Central Processor (CP) Chip
• Hardware Measurements
• Conclusion

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 2 of 39
Introduction: IBM z15
• 14nm – Again
• Significant changes to system topology
• Significant feature additions
• Performance Goals
• +10% Single Thread
• +20% System Capacity
• Big ticket items
• 33% increase in L2 (per core)
• 100% increase in L3 (per chip)
• 43% increase in L4 (per chip)
• Added 2 cores (+20%)

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 3 of 39
Technology Overview [ISSCC2018 Berry]

GlobalFoundries 14nm High-Perf (HP) FinFET SOI technology


w/embedded DRAM

Technology Overview
Ultra-thick wires 2400nm - 2 levels
5.6X pitch wiring 360nm - 4 levels
4X pitch wiring 256nm - 2 levels
2X pitch wiring 128nm - 4 levels
1.3X pitch wiring 80nm - 2 levels
1X pitch, fine wiring (LELE) 64nm - 3 levels
Logic Device VT pairs L, R, H
SRAM Cell Sizes 0.102μm2 (HP & LL), 0.143μm2
eDRAM Cell Size 0.0174μm2

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 4 of 39
z15 System Structure: Max System

• Up to 4 - 42Ux19” Racks
• 20 Liquid Cooled CP
Chips
• 240 Physical Cores
• 40TB RAIM*-protected
memory
*Redundant Array of Independent
Memory
• 60 PCIe Gen4x16 cards

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 5 of 39
z15 System Structure: Max System
Power Supply

Ethernet Switches

IO Cages

Processor Drawers

Water Cooling

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 6 of 39
z15 System Structure: Drawers

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 7 of 39
z15 System Structure: Drawer

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 8 of 39
z15 System Structure: Drawer
System
Memory
Control
DIMMs

Central
Processors
© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 9 of 39
z15 System Structure: Drawer

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 10 of 39
System Control (SC) Chip Overview
X-BUS Drivers and Receivers

• Drawer-to-drawer connectivity L4 L4
• L4 Cache - 960MB (+43%) eDRAM eDRAM

A-BUS Drivers & Receivers


A-BUS Drivers & Receiveres
• Specs Cache Cache
L4 Data-
• 2.6GHz flow

• 12.2B Transistors
• 696mm2 L4 Control
• ~20K C4’s L4 Directory

• IO bandwidth 6.8Tb/s
L4 Data-
• 21.7km of signal wire flow
L4 L4
eDRAM eDRAM
Cache Cache

X-BUS Drivers and Receivers


© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 11 of 39
SC Chip: X-Bus & A-Bus Links
X-BUS Drivers and Receivers

• A-Bus (drawer-to-drawer)
• Four links

A-BUS Drivers & Receivers


A-BUS Drivers & Receiveres
• Differential
• 10.4Gb/s/lane (+33%)
• 0.9Tb/s each (3.6Tb/s total)
• X-Bus (SC-CP)
• Four links
• Single-ended
• 5.2Gb/s/lane
• 0.8Tb/s each (3.2Tb/s total)

X-BUS Drivers and Receivers


© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 12 of 39
eDRAM Improvements
• “Double dense” eDRAM
significant enabler
• Doubled bitline and
wordline lengths

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 13 of 39
eDRAM Improvements
• “Double dense” eDRAM
significant enabler
• Doubled bitline and
wordline lengths
• Sense amp update to
improve voltage
sensitivity for double
length bit lines

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 14 of 39
eDRAM Improvements
• “Double dense” eDRAM
significant enabler
• Doubled bitline and
wordline lengths
• Sense amp update to
improve voltage
sensitivity for double
length bit lines

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 15 of 39
eDRAM Improvements
• “Double dense” eDRAM
significant enabler
• Doubled bitline and
wordline lengths
• Sense amp update to
improve voltage
sensitivity for double
length bit lines
• Improved array macro
density by ~30%

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 16 of 39
eDRAM Improvements
• Changed on-die z14 – 8Mb
generated high voltage
to package delivered
• Removed charge
pump overhead
• Absorbed low voltage
generation & regulator z15 – 16Mb
into macro
• Combined effective
cache density
improvement of ~80%

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 17 of 39
Central Processor Chip Overview
MC Drv MCU MC Rec
• Cores, L1/L2/L3 Cache,

L3 DF
L3 Cache

L3 Cache
Core 0 Core 1
Memory interface, IO,

X-BUS
Cntl/Dir
CP & SC interfaces Core 2 Core 3

L3

L3 Cache
L3 Cache
• Specs

L3 DF
Core 4 Core 5
• 5.2GHz
• 9.2B Transistors

L3 DF
L3 Cache

L3 Cache
Core 6 Core 7
• 696mm2

X-BUS

Cntl/Dir
• ~29K C4’s Core 8 Core 9

L3

L3 Cache
L3 Cache
• IO Bandwidth 4.0Tb/s

L3 DF
• 23.2km of signal wire Core 10 Core 11

GZIP
PCIE2 PBU2 PCIE1 PBU1 PBU0 PCIE0
© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 18 of 39
CP Chip: Processor Cores
• 12 Cores (+20%)

L3 DF
L3 Cache

L3 Cache
• 128+128KB L1 Core 0 Core 1
(I+D) SRAM cache

Cntl/Dir
Core 2 Core 3

L3
• 4+4MB (I+D)

L3 Cache
L3 Cache
private L2 eDRAM

L3 DF
Core 4 Core 5
cache (+33%)
• L3

L3 DF
L3 Cache

L3 Cache
Core 6 Core 7
• 256MB eDRAM

Cntl/Dir
shared L3 cache Core 8 Core 9

L3

L3 Cache
L3 Cache
(+100%)

L3 DF
Core 10 Core 11
• GZIP Accelerator

GZIP
© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 19 of 39
CP Chip: IO Links
MC Drv MCU MC Rec

• XBUS
• Two X-Bus links

X-BUS
• Single-ended
• 5.2Gb/s/lane
• 0.8Tb/s each (1.6Tb/s total)
• Memory Interface
• 9.6Gb/s/lane (1.6Tb/s total)

X-BUS
• 3 - PCIe x16 Gen4 interfaces
(.8Tb/s)

PCIE2 PBU2 PCIE1 PBU1 PBU0 PCIE0


© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 20 of 39
Chip Improvements
• Core area reduction

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 21 of 39
Chip Improvements
• Core area reduction

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 22 of 39
Chip Improvements
• Core area reduction
• IO area improvement

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 23 of 39
Chip Improvements
• Core area reduction
• IO area improvement

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 24 of 39
Chip Improvements
• Core area reduction
• IO area improvement
• z14 - 185 hex IO ring with 185 orthogonal
central
185 hex->33.4 C4/mm2

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 25 of 39
Chip Improvements
• Core area reduction
• IO area improvement
• z14 - 185 hex IO ring with 185 orthogonal
central
185 hex->33.4 C4/mm2
185->29 C4/mm2

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 26 of 39
Chip Improvements
• Core area reduction
• IO area improvement
• z14 - 185 hex IO ring with 185 orthogonal
central
185 hex->33.4 C4/mm2
185->29 C4/mm2
• z15 – Reduce X-bus signals by 33%
185 hex top/bottom edges with 150 through
center
150->44.4 C4/mm2

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 27 of 39
L2 Cache
z15 Core Floorplan Vector Floating
L1 D-Cache/Load-store Fixed Point Point

Instruction
Sequencer

Recovery
Translator

Elliptic Curve
Cryptography

Instruction Decode Branch Prediction


L1 I-Cache/Compression/Sort
© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 28 of 39
Core Improvements

• Core area reduced


by 10%
28->25mm2

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 29 of 39
Core Improvements

• Core area reduced


by 10%
28->25mm2
• Instruction
Sequencer area
reduced by 35%
(1.8M nets)

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 30 of 39
Core Improvements

• Core area reduced


by 10%
28->25mm2
• Instruction
Sequencer area
reduced by 35%
• L2 increased by 33%
6MB->8MB

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 31 of 39
Core Improvements

• Core area reduced


by 10%
28->25mm2
• Instruction
Sequencer area
reduced by 35%
• L2 increased by 33%
6MB->8MB
• ECC accelerator
added

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 32 of 39
Core Improvements
Improved store Decimal operations as well as
forwarding decimal-to-binary conversion
improvements

Improved Operand-
store-compare (OSC)
hazards

Enhanced
branch
prediction

Sort/Merge accelerator added


© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 33 of 39
Core Improvements
+5% -2%

z14
z15

+22%

Net Count Device Count Wire Length

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 34 of 39
Single Die Power
+12%
z14 v z15

+5%

+1%

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 35 of 39
Process Shift
z14 v z15

Normalized Die Count

z14
z15

Process Delay (1/Frequency)

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 36 of 39
Vmin vs Process Delay
z13 => z15
Chip Vmin (V)

Normalized Process Delay (1/Frequency)


© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor
International Solid-State Circuits Conference 37 of 39
z15 Design: Conclusion
• Improvement in single thread performance => 14%
• System capacity improvement => 20%
• Achieved goals while staying in 14nm
• Significant increase in L2, L3 & L4 cache
• Two more cores
• Several new on-die or in-core accelerators:
• ECC, gzip, sort/merge
• Minimal power increase given additions

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 38 of 39
Acknowledgements

The authors would like to acknowledge and thank the


many contributions from the rest of the IBM Z design
team (Austin, Bangalore, Boeblingen, Haifa,
Poughkeepsie, Rochester, Tel Aviv), the IBM EDA team,
the IBM Research team, the IBM Systems team and
GlobalFoundries for processing the wafers.

© 2020 IEEE 2.7: IBM z15TM: A 12-Core 5.2GHz Microprocessor


International Solid-State Circuits Conference 39 of 39

You might also like