Professional Documents
Culture Documents
ta2-1-walston-pres-snps
ta2-1-walston-pres-snps
19 October 2017
Arm/Synopsys Collaboration
2017
3.5+GHz
3.5GHz in 7nm
2016 7nm
3.0+GHz
3.0GHz+ in 16nm
16nm
2013
2.0+GHz
2.0GHz/28nm
28nm+
2010
1.0GHz
1.0 GHz / 40nm
40nm
− Hybrid Emulation
Partnering for Arm Powered® Products AMBA Transactors
System SW Dev.
Software Stack
Virtual Prototypes
Physical Prototypes Virtualizer™ Tool Set & Development
HAPS® Support for Arm Cores, Hybrid Prototypes Kits (VDKs) for Pre-RTL SW Dev.
Connection to Juno Arm AMBA Transactors Using Arm Fast Models (v7 & v8),
Development Platform, Arch. Design with Platform Architect,
Prototyping Methodology Coverity & Defensics SW Signoff
POP
Pre-ALPHA ALPHA BETA EAC
Requirements/
Inputs
Co-development work
Design Compiler
Graphical
IC Compiler II
place_opt
NDM Library Prep
clock_opt PrimeTime ECO
route_opt
VC LP PrimeTime SI Formality
Note: DFT supported at each step during DC Graphical and IC Compiler II, UPF supported throughout the flow
Design Detail
DENGINE
Clock Gating ~250 architectural ICGs CORE
TSMC 16nm FFC
Process
11 Layer +AP w/ routing on M2-M9 DSIDE
Arm POP IP for TSMC 16nm FFC ISIDE
Libraries/
SVT/LVT/ULVT C24/C20/C18/C16 for data
memories
ULVT C16 for clock
Setup: RC Max: TT/1.0v/85c
PVT corners
Setup & Power: TT/0.8v/85c
MCMM
Hold: RC Min: FFGNP/1.05v/125c
DFT Strategy Scan compression
UPF 3 power domains with LS and ELS required
Starting with library analysis makes balancing FMAX and power easier
© 2017 Synopsys, Inc. 23 © Copyright Synopsys 2017, All Rights Reserved
Constructing Cortex-A75 CPU Power Opt. Flow
That Meets Performance/Power Targets
X20N 18 19 19 20 20 21 21 22 23 24 25 26
X16N 19 20 20 21 21 22 23 24 25 26 27 28
X14N 20 21 21 22 22 23 24 25 26 27 28 29
X12N 21 22 22 23 23 24 25 26 27 28 29 31 Cell Delay
Use Cell Delay Heat X10N 22 23 24 25 25 26 26 28 29 30 31 33
Drive Strength
ULVT
– LVT use in implementation absolutely necessary for best
SVT
LVT
leakage when targeting ULVT for timing
SSG
influence on timing early in flow
– Best to focus on a primary class for timing closure based on
clock frequency
ULVT
SVT
LVT
• Very low drive cells can be load sensitive
– Do not use small drive cells during synthesis and P&R
TT-OD
Vt/channel swap per datapath stage
Leakage
Power
ULVT LVT SVT
% Final C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
144% 54% 6% 4% 3% 6% 15% 3% 0% 1% 2% 5% 1%
100% 16% 2% 1% 1% 40% 5% 3% 2% 3% 15% 12% 1%
118% 15% 8% 7% 33% 5% 5% 3% 9% 3% 1% 3% 7%
114% 13% 4% 4% 3% 12% 5% 5% 23% 3% 5% 5% 17%
© 2017 Synopsys, Inc. 29 © Copyright Synopsys 2017, All Rights Reserved
Vt Distribution Through Cortex-A75 Full Flow
Vt Class/Channel Mix Changes As Implementation Progresses
In DC, datapath delay is prioritized Pessimism reduces through the ECO brings in 4 new SVT classes
Faster cells are used flow and CTS brings in useful and does positive slack recovery
skew, cells are swapped for power
StarRC
PrimeTime SI/PX
LS Mbit
LS
ELS Mbit
ELS
• RTL inferencing
DC Graphical Q 1-bit Timing
• Physical-aware
7% net 2 TT-1.0V QL 2-bit Area
critical path
QA 4-bit Dynamic
de-banking
2
place_opt Timing
IC Compiler II POCV TT-1.0V Q 1-bit • Physical-aware
clock_opt Area
cell TT-0.8V QL 2-bit critical path
Leakage
7% net FFG QA 4-bit de-banking
8 Dynamic
route_opt
PrimeTime
ECO AOCV TT-1.0V Q 1-bit Timing
cell 12 TT-0.8V QL 2-bit Area
7% net FFG QA 4-bit Leakage
© 2017 Synopsys, Inc. 36 © Copyright Synopsys 2017, All Rights Reserved
Cortex-A75 Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA
Blockage
• Arm CPU projects typically get to 90% of the Cortex-A75 place_opt FMAX Progression
performance target quickly 1
Ramp up The last 10%
to 90%
• Achieving the last 10% consumes most of the 90%
0.9
schedule
FMAX (% target)
0.8
• Extensive analysis shows performance is
impacted by placement
0.7
DENGINE
– Typically focused on top-level modules DSIDE
– Some analysis at first level down
– Add blockages to control channel density
CORE
• TNS reduced but QoR gains were offset by SI ISIDE
effects
• Use DFA (Data Flow Analysis) in ICC II to drive block-level QoR improvements through CPU
floorplan changes
• Deep analysis of fast connectivity-based placement and inter-module flylines
ISIDE
Some clear patterns emerge that lead to new macro placement conclusions
© 2017 Synopsys, Inc. 44 © Copyright Synopsys 2017, All Rights Reserved
Cortex-A75 CPU Floorplan Changes
Move RAMs To Guide Module Placement
ISIDE
ISIDE
CORE
Floorplan changes that allowed CORE to float to the center and close to DSIDE
© 2017 Synopsys, Inc. 45 © Copyright Synopsys 2017, All Rights Reserved
CORE Results @ route_opt with New Floorplan
Cortex-A75 CPU
FMAX (% target)
• Block-level floorplanning is a powerful and 0.8
necessary tool for Cortex-A75 CPU QoR
improvements 0.7
0.6
0.5
• When starting the Cortex-A75 CPU flow, used First Full Flow Trial
18.00 200%
best practices from Cortex-A73 RI
16.00 180%
– 2W/4S clock NDRs (tapered on sinks) 160%
14.00
– Crosstalk threshold noise ratio of 20% 140%
12.00
– Congestion optimization for placer settings 10.00
120%
• In Arm CPUs placement and crosstalk Critical Path Corridor in Cortex-A75 CPU
are interrelated
Area of very high
connectivity
• Separating two modules with a lot of DSIDE between CORE
timing-critical interconnect causes a and DSIDE
modules
channel of high, unidirectional routing
density
CORE
Blocked Channel %
cluster together and cause
timing problems 60%
• Use percentage placement
blockages to spread out
pipeline flops 80%
90%
Before After
CLK
place_opt with data only: higher area & power
Apply
-100ps 200ps Size up clock buf
useful
skew 90ps 10ps
Leakage Leakage
Cortex-A75 CPU WNS TNS Cortex-A75 CPU WNS TNS
Power Power
place_opt + route_opt -23 -95 100% route_opt (Baseline) -112 -134 100%
place_opt CCD + route_opt -20 -68 98% Power CCD + route_opt -99 -126 99%
© 2017 Synopsys, Inc. 56 © Copyright Synopsys 2017, All Rights Reserved
Increased Correlation Through the Flow
Routing and Timing Correlation on Cortex-A75 CPU
Global Route-based Accurate CCS Receiver Route_opt Based on
Optimization Model PrimeTime Signoff Timer
• Timing-driven routing • Used in place_opt, • Route_opt able to see (and
• Re-routing, Re-buffering clock_opt and route_opt fix) timing issues easing
using global route parasitics • Improved correlation to burden on PT ECO
• On-route global route-based signoff delay calculation
re-buffering
% of Final
50
0 100%
-50
50%
-100
-150 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages
% of Final
0 100%
-100 50%
-200 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages