Professional Documents
Culture Documents
wc03-34-babbar-pres-snps
wc03-34-babbar-pres-snps
SNUG 2017 1
Agenda
ARM/Synopsys Collaboration
SNUG 2017 3
The New Threshold for Market Leadership
2017
3.5+GHz
3.5GHz in 7nm
2016 7nm
3.0+GHz
3.0GHz+ in 16nm
16nm
2013
2.0+GHz
2.0GHz/28nm
28nm+
2010
1.0GHz
1.0 GHz / 40nm
40nm
SNUG 2017 4
ARM POP™ IP on 16nm
SNUG 2017 5
How does PDG work with processor design on POP IP?
Requirements/
Inputs
Co-development work
SNUG 2017 6
DesignWare® IP
For ARM AMBA® Interconnect, I/F IP,
coreTools for assembly, MBIST, High Performance
Optimized Implementation Standard Cells, Fast Memories, Verification
Galaxy™ Tools, IP Designer for ASIPs VCS® RTL Verification & Verdi® Debug,
Reference Implementations (RIs), SpyGlass® RTL Signoff,
High Performance Core Centers, VIP for AMBA Interconnect,
Low Power Methodology, LP Verification Methodology,
Lynx Design System Optimized HW Design & ZeBu® HW-Assisted Verification
Implementation Verification
− Hybrid Emulation
Partnering for ARM Powered® Products AMBA Transactors
System SW Dev.
Software Stack
Virtual Prototypes
Physical Prototypes Virtualizer™ Tool Set & Development
HAPS® Support for ARM Cores, Hybrid Prototypes Kits (VDKs) for Pre-RTL SW Dev.
Connection to Juno ARM AMBA Transactors Using ARM Fast Models (v7 & v8),
Development Platform, Arch. Design with Platform Architect,
Prototyping Methodology Coverity & Defensics SW Signoff
SNUG 2017 8
Synopsys Reference Implementation
Complete Implementation & Static Verification Flow for Next Generation CPUs
Design Compiler
Graphical
IC Compiler II
place_opt
NDM Library Prep
clock_opt PrimeTime ECO
route_opt
VC LP PrimeTime SI Formality
Note: DFT supported at each step during DC Graphical and IC Compiler II, UPF supported throughout the flow
SNUG 2017 11
https://community.arm.com/processors/b/blog/posts/arm-dynamiq-expanding-the-possibilities-for-artificial-intelligence
SNUG 2017 12
Included in Synopsys Reference Implementation
ARM Next Generation CPU and NonCPU Designs and Libraries
Design Detail
DENGINE
Clock Gating ~250 architectural ICGs CORE
TSMC 16nm FFC
Process
11 Layer +AP w/ routing on M2-M9 DSIDE
ARM POP IP for TSMC 16nm FFC ISIDE
Libraries/
SVT/LVT/ULVT C24/C20/C18/C16 for data
memories
ULVT C16 for clock
Setup: RC Max: TT/1.0v/85c
PVT corners
Setup & Power: TT/0.8v/85c
MCMM
Hold: RC Min: FFGNP/1.05v/125c
DFT Strategy Scan compression
UPF 3 power domains with LS and ELS required
SNUG 2017 14
Implementation Careabouts
Starting with library analysis makes balancing FMAX and power easier
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 19
Constructing CPU Power Optimization Flow
That Meets Performance/Power Targets
Meet Power Target • Essential to manage
MB banking/de-banking – Vt class availability
1bit, 2bit, 4bit
DC Graphical – Multibit (MB) banking/de-banking
VT selection
– Leakage vs. timing vs. dynamic
Across 12 vt/channel
options optimization
– Leave headroom (both timing and
QL (leakage) vs. Q (std)
IC Compiler II vs. QA (area)
power) for ECO
flop selection
• Library impacts all these
SI TNS Reduction, very
congested, clock NDRs decisions
PrimeTime
ECO
fix_eco_power to meet
leakage target, expect
15-20% reduction
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 20
Analyzing Library Leakage Power Heat Maps
Example: INV @ TT 0.8V, 85c
VT Types/Channel lengths
ULVT LVT SVT
ULVT LVT SVT
C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
X20N 18 19 19 20 20 21 21 22 23 24 25 26
X16N 19 20 20 21 21 22 23 24 25 26 27 28
X14N 20 21 21 22 22 23 24 25 26 27 28 29
X12N 21 22 22 23 23 24 25 26 27 28 29 31 Cell Delay
Use Cell Delay Heat X10N 22 23 24 25 25 26 26 28 29 30 31 33
Drive Strength
ULVT
– LVT use in implementation absolutely necessary for
SVT
LVT
best leakage when targeting ULVT for timing
SSG
influence on timing early in flow
– Best to focus on a primary class for timing closure based
on clock frequency
ULVT
SVT
LVT
• Very low drive cells can be load sensitive
– Do not use small drive cells during synthesis and P&R
TT-OD
Vt/channel swap per datapath stage
Leakage
Power
ULVT LVT SVT
% Final C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
144% 54% 6% 4% 3% 6% 15% 3% 0% 1% 2% 5% 1%
100% 16% 2% 1% 1% 40% 5% 3% 2% 3% 15% 12% 1%
118% 15% 8% 7% 33% 5% 5% 3% 9% 3% 1% 3% 7%
114% 13% 4% 4% 3% 12% 5% 5% 23% 3% 5% 5% 17%
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 25
Vt Distribution Through Full Flow
Vt Class/Channel Mix Changes As Implementation Progresses
In DC, datapath delay is prioritized Pessimism reduces through the ECO brings in 4 new SVT classes
Faster cells are used flow and CTS brings in useful and does positive slack recovery
skew, cells are swapped for power
StarRC
• Timing ECO makes minor improvements on
remaining critical paths PrimeTime SI/PX
fix_eco_timing
• Power ECO reduces leakage without impacting fix_eco_power –pattern
timing ULVT/LVT/SVT
• Essential to consider Q
– Which flops to use Base
– Allowed stage flops
LS Mbit
LS
ELS Mbit
ELS
• RTL inferencing
DC Graphical Q 1-bit Timing
• Physical-aware
7% net 2 TT-1.0V QL 2-bit Area
critical path
QA 4-bit Dynamic
de-banking
2
place_opt Timing
IC Compiler II POCV TT-1.0V Q 1-bit • Physical-aware
clock_opt Area
cell TT-0.8V QL 2-bit critical path
Leakage
7% net FFG QA 4-bit de-banking
8 Dynamic
route_opt
PrimeTime
ECO AOCV TT-1.0V Q 1-bit Timing
cell 12 TT-0.8V QL 2-bit Area
7% net FFG QA 4-bit Leakage
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 32
Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA
Blockage
FMAX (% target)
0.8
DENGINE
DSIDE
– Typically focused on top-level modules
– Some analysis at first level down
– Add blockages to control channel density
CORE
ISIDE
• TNS reduced but QoR gains were
offset by SI effects
ISIDE
Some clear patterns emerge that lead to new macro placement conclusions
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 40
High-performance Core Floorplan Changes
Move RAMs To Guide Module Placement
ISIDE
ISIDE
CORE
Floorplan changes that allowed CORE to float to the center and close to DSIDE
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 41
CORE Results @ route_opt with New Floorplan
FMAX (% target)
further improve QoR 0.8
0.5
CORE
Blocked Channel %
cluster together and cause
timing problems 60%
• Use percentage placement
blockages to spread out
pipeline flops 80%
90%
CLK
place_opt with data only: higher area & power
Apply
-100ps 200ps Size up clock buf
useful
skew 90ps 10ps
Leakage Leakage
Flow WNS TNS Flow WNS TNS
Power Power
place_opt + route_opt -23 -95 100% route_opt (Baseline) -112 -134 100%
place_opt CCD + route_opt -20 -68 98% Power CCD + route_opt -99 -126 99%
GR- Delay
Traditional AWP CCS
Metric based @ calculation Step WNS TNS
Flow Delay Receiver
Signoff used
Opt. Model Cap Model
ICC II PT
R2R WNS -29ps -15ps -20 -8
WNS -60ps -49ps Delaycalc ECO
R2R TNS -1ns -1ns PrimeTime PT
TNS -11ns -2ns -12 0
Leakage 100% 96% Delaycalc ECO
SNUG 2017 56
Results on ARM Next Generation CPU
Achieved Target PPA with Convergent, Repeatable Flow
Next-Generation CPU PPA
TNS (ns) FMAX (%) Leakage (%)
300 300%
250
% of Final
150 200%
100
150%
50
0 100%
-50
50%
-100
-150 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages
% of Final
0 100%
-100 50%
-200 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages
SNUG 2017 61