wc03-34-babbar-pres-snps

Best Practices For High-performance,
Energy Efficient Implementations of

The Latest ARM® Processors
In 16-nanometer FinFET Compact (16FFC) Process Technology
Using Synopsys Galaxy™ Design Platform
Vidit Babbar, ARM

Joe Walston, Synopsys
22nd March, 2017
SNUG 2017 1
Agenda
ARM/Synopsys Collaboration
Performance / Power Optimized Galaxy Design Reference Implementation
Maximizing Performance/Minimizing Power
Summary and Next Steps
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 2

ARM + Synopsys
Rich History of Collaboration
SNUG 2017 3
The New Threshold for Market Leadership
2017
3.5+GHz
3.5GHz in 7nm
2016 7nm
3.0+GHz
3.0GHz+ in 16nm
16nm
2013
2.0+GHz
2.0GHz/28nm
28nm+
2010
1.0GHz
1.0 GHz / 40nm
40nm
SNUG 2017 4
ARM POP™ IP on 16nm
POP IP is a comprehensive, fully validated Cortex®-A

Need to shorten the CPU implementation solution
Includes Physical IP, floorplans and reference
design cycle implementation scripts
POP IP is developed and tuned in synergy with RTL

Need to lower technical over several iterations
and schedule risk All Physical IP and implementation issues have been
identified and solved by EAC date
POP undergoes extensive iterative floorplan exploration

Need to achieve and design tuning to deliver market- leading PPA
Our record in 16nm FinFET technology is a testament to
market-leading PPA the hard work behind POP development
SNUG 2017 5
How does PDG work with processor design on POP IP?
Pre-ALPHA ALPHA BETA LAC EAC

CPU
CPU RTL optimization based on
POP IP implementation feedback
ARM For more information on POP IP, check out
Floorplan
The Faster, Better, tuning and PPA
and Shorter optimization
ICC II –
How It’s Done with ARM POP IP for Cortex-A73
POP
Presented by Rupal Gandhi
Pre-ALPHAThursday,
ALPHA
March 23, @ BETA
11:15am, Hall A2 EAC
Requirements/
Inputs
Co-development work
SNUG 2017 6
DesignWare® IP
For ARM AMBA® Interconnect, I/F IP,
coreTools for assembly, MBIST, High Performance
Optimized Implementation Standard Cells, Fast Memories, Verification
Galaxy™ Tools, IP Designer for ASIPs VCS® RTL Verification & Verdi® Debug,
Reference Implementations (RIs), SpyGlass® RTL Signoff,
High Performance Core Centers, VIP for AMBA Interconnect,
Low Power Methodology, LP Verification Methodology,
Lynx Design System Optimized HW Design & ZeBu® HW-Assisted Verification
Implementation Verification
− Hybrid Emulation
Partnering for ARM Powered® Products AMBA Transactors
System SW Dev.
Software Stack
Validation & HW/SW

Integration
Virtual Prototypes
Physical Prototypes Virtualizer™ Tool Set & Development
HAPS® Support for ARM Cores, Hybrid Prototypes Kits (VDKs) for Pre-RTL SW Dev.
Connection to Juno ARM AMBA Transactors Using ARM Fast Models (v7 & v8),
Development Platform, Arch. Design with Platform Architect,
Prototyping Methodology Coverity & Defensics SW Signoff
SNUG 2017 synopsys.com/ARM v161013 7

Performance & Power Optimized
Reference Implementation
Galaxy Design Platform
SNUG 2017 8
Synopsys Reference Implementation
Complete Implementation & Static Verification Flow for Next Generation CPUs
Design Compiler
Graphical
IC Compiler II
place_opt
NDM Library Prep
clock_opt PrimeTime ECO
route_opt
VC LP PrimeTime SI Formality
Note: DFT supported at each step during DC Graphical and IC Compiler II, UPF supported throughout the flow

Included in Synopsys Reference Implementation
Most Advanced Technology from Synopsys Implementation Platform
Meet Timing Reduce Power
• Enhanced physical guidance
• Timing-driven multibit register
(eSPG)
banking and de-banking
• Enhanced layer-aware DC Graphical • Physical-aware clock gating
optimization
• Low power placement
• Placement pre-clustering
• place_opt CCD • Incremental timing-driven

• New global route-based opt. multibit register banking and
• CCS receiver cap modeling IC Compiler II de-banking
• PrimeTime delay calc in • Clock gating optimization
route_opt • Low power placement
• Redundant VIA insertion • High effort leakage flow
PrimeTime
• Path-based analysis (PBA)
• Clock skew ECO ECO • Leakage-aware timing ECO
• Physical-aware ECO

ARM DynamIQ™ – Multicore redefined
Greater flexibility with Redesigned memory Advanced compute

New single cluster design
or without big.LITTLE sub-system capabilities
SNUG 2017 11
https://community.arm.com/processors/b/blog/posts/arm-dynamiq-expanding-the-possibilities-for-artificial-intelligence
SNUG 2017 12
Included in Synopsys Reference Implementation
ARM Next Generation CPU and NonCPU Designs and Libraries
Design Detail
Config Crypto ; NEON & FP Unit
DENGINE
Clock Gating ~250 architectural ICGs CORE
TSMC 16nm FFC
Process
11 Layer +AP w/ routing on M2-M9 DSIDE
ARM POP IP for TSMC 16nm FFC ISIDE
Libraries/
SVT/LVT/ULVT C24/C20/C18/C16 for data
memories
ULVT C16 for clock
Setup: RC Max: TT/1.0v/85c
PVT corners
Setup & Power: TT/0.8v/85c
MCMM
Hold: RC Min: FFGNP/1.05v/125c
DFT Strategy Scan compression
UPF 3 power domains with LS and ELS required

Maximizing Performance/Minimizing Power
on the Next Generation ARM CPU
with the Galaxy Design Platform
SNUG 2017 14
Implementation Careabouts
Determine an Optimum Floorplan

Power
Analyze Library for Balanced Power/Performance/TAT

Area
Manage congestion, SI & pessimism for convergence

Performance

Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA
Power Analyze Library for Balanced Power/Performance/TAT

Area

Performance


Power
Multi-Vt & Sequential
OCV Multibit
gate-length Flops

Area

Performance

On Chip Variation (OCV)
Signoff and Implementation Derating Methodology
Derating Strategy • Derating strategy changes through
implementation to signoff
No cell derating – Manages pessimism
DC Graphical
7% net derating – Reduces area/power impact of GBA-based
AOCV
POCV cell derating* (from
AOCV 5%, 7%, 10%)
• Methodology does not affect signoff
IC Compiler II
7% net derating
ICC II Optimization AOCV-based POCV-based

AOCV cell derating*
PrimeTime (5%, 7%, 10%) ICC II Baseline + 8%
Frequency
ECO PrimeTime Baseline + 2%
7% net derating
Utilization 68% 66%
SSG @ 5%, TT 1.0v @ 7%, FFG @ 10%

ARM Library – Leverage for best CPU QoR
Complexity and Flexibility
• Specific details of ARM 16FFC library Design Detail
– 3 Vt classes (SVT, LVT, ULVT) TSMC 16nm FFC
Process
– 4 channel lengths each (16, 18, 20, 24) 11 Layer +AP routing on M2-M9
– Base, hpk (High-Performance), pmk (Power
Management) ARM POP IP for TSMC 16nm FFC
– 1, 2 and 4 bit flops Libraries/ ULVT C24/C20/C18/C16 for data
– 3 main types of SB/MB flops (Q, QL, QA) memories ULVT C16 for clock
SVT/LVT for leakage opt.
• CPU has very aggressive power targets
– As always - extremely high FMAX target
Setup: RC Max: TT/1.0v/85c
– But… leakage and dynamic power targets PVT corners
require more than opportunistic power reduction Setup & Power: TT/0.8v/85c
MCMM
Hold: RC Min: FFGNP/1.05v/125c
Starting with library analysis makes balancing FMAX and power easier
Constructing CPU Power Optimization Flow
That Meets Performance/Power Targets
Meet Power Target • Essential to manage
MB banking/de-banking – Vt class availability
1bit, 2bit, 4bit
DC Graphical – Multibit (MB) banking/de-banking
VT selection
– Leakage vs. timing vs. dynamic
Across 12 vt/channel
options optimization
– Leave headroom (both timing and
QL (leakage) vs. Q (std)
IC Compiler II vs. QA (area)
power) for ECO
flop selection
• Library impacts all these
SI TNS Reduction, very
congested, clock NDRs decisions
PrimeTime
ECO
fix_eco_power to meet
leakage target, expect
15-20% reduction
Analyzing Library Leakage Power Heat Maps
Example: INV @ TT 0.8V, 85c
VT Types/Channel lengths
ULVT LVT SVT
ULVT LVT SVT
C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
X20N 858 682 575 409 89 69 56 35 8 5 4 3

X16N 681 542 457 326 71 55 44 28 6 4 3 2
X14N 593 472 398 284 62 48 39 24 6 4 3 2
X12N 504 401 339 243 52 41 33 21 5 3 2 2 Leakage Power
Use Leakage Power X10N 415 331 280 201 43 33 27 17 4 3 2 1
Heat Map to gain
Drive Strength
X8N 327 261 221 159 34 26 21 14 3 2 1 1

qualitative X7N 259 208 177 129 27 21 17 11 2 2 1 1
understanding of X6N 239 191 162 118 25 19 16 10 2 2 1 1
library characteristics X5N 195 156 133 96 20 16 13 8 2 1 1 1
X4N 151 121 103 75 16 12 10 6 1 1 1 0
X3N 108 87 74 54 11 9 7 5 1 1 0 0 Leakage Power
X2N 67 53 45 33 7 5 4 3 1 0 0 0
Heat Map
X1P5N 39 31 26 20 5 4 3 2 0 0 0 0
X1N 28 22 19 14 3 2 2 1 0 0 0 0

Analyzing Library Cell Delay Heat Maps
Example: INV with 50ff Load @ TT 1.0V, 85c
VT Types/Channel lengths
ULVT LVT SVT
ULVT LVT SVT
C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
X20N 18 19 19 20 20 21 21 22 23 24 25 26
X16N 19 20 20 21 21 22 23 24 25 26 27 28
X14N 20 21 21 22 22 23 24 25 26 27 28 29
X12N 21 22 22 23 23 24 25 26 27 28 29 31 Cell Delay
Use Cell Delay Heat X10N 22 23 24 25 25 26 26 28 29 30 31 33
Drive Strength
Map to gain qualitative X8N 24 25 26 27 27 28 29 31 32 33 34 36

understanding of X7N 25 26 27 28 29 30 31 32 34 35 36 39
library characteristics X6N 28 29 29 31 31 32 33 35 37 38 39 42
X5N 30 31 32 34 34 35 37 39 41 42 43 46
X4N 34 35 36 38 39 40 41 43 45 47 49 52
X3N 41 42 43 45 47 48 50 52 55 57 59 63 Cell Delay
X2N 53 54 55 58 60 62 64 67 70 73 76 80 Heat Map
X1P5N 65 66 68 72 74 76 78 82 86 88 92 98
X1N 92 94 97 101 105 108 111 116 123 127 133 140

Library Observations
Recommendations to Minimize Leakage Power
• Overlap between ULVT and LVT
– Consider combination of ULVT C16 and LVT C16
during synthesis and P&R
Leakage Heat Map (TT)
• Order of magnitude power delta between equivalent

drive strengths/channel lengths across VT classes
ULVT
– LVT use in implementation absolutely necessary for
SVT
LVT
best leakage when targeting ULVT for timing
• SVT power savings mainly in small drive strengths

– SVT probably best for ECO swap/size
• Variation between channel lengths is small

– Channel variations should be reserved for ECO

Library Observations
Recommendations to Improve Timing
Cell Delay Heatmap
• Delay variation across Vt and channel smaller than
across drive strength
– Many available Vt/channel classes will not have a strong
SSG
influence on timing early in flow
– Best to focus on a primary class for timing closure based
on clock frequency
ULVT
SVT
LVT
• Very low drive cells can be load sensitive
– Do not use small drive cells during synthesis and P&R
• Variation between channel lengths reserved for ECO

– Late in flow, swapping can boost timing 1-2ps per
TT-OD
Vt/channel swap per datapath stage

Optimizing Leakage Power
Implementation Flow - Vt Class Trials
• 12 Vt and channel length variations in library
• Not using SVT for implementation, saving it for ECO
– Capacitance sensitivity outweighs possible leakage gains in DC Graphical/IC Compiler II
• Choosing Vt classes to use during flow can impact final leakage power
Leakage
Power
ULVT LVT SVT
% Final C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
144% 54% 6% 4% 3% 6% 15% 3% 0% 1% 2% 5% 1%
100% 16% 2% 1% 1% 40% 5% 3% 2% 3% 15% 12% 1%
118% 15% 8% 7% 33% 5% 5% 3% 9% 3% 1% 3% 7%
114% 13% 4% 4% 3% 12% 5% 5% 23% 3% 5% 5% 17%
Vt Distribution Through Full Flow
Vt Class/Channel Mix Changes As Implementation Progresses
In DC, datapath delay is prioritized Pessimism reduces through the ECO brings in 4 new SVT classes
Faster cells are used flow and CTS brings in useful and does positive slack recovery
skew, cells are swapped for power
DC Graphical ICC II route_opt PrimeTime ECO
ULVT LVT SVT

c16 c18 c20 c24 c16 c18 c20 c24 c16 c18 c20 c24
ECO Objectives for Power Optimization
• PrimeTime ECO used to lower power

ICC II
– AOCV with PBA increases positive slack for downsizing Route_opt
– Open up all 12 Vt/channel classes to ECO ULVT/LVT
StarRC
• Timing ECO makes minor improvements on
remaining critical paths PrimeTime SI/PX
fix_eco_timing
• Power ECO reduces leakage without impacting fix_eco_power –pattern
timing ULVT/LVT/SVT
– Uses footprint-compatible Vt/channel swaps ICC II ECO
• Typically see 20-30% leakage reduction StarRC

PrimeTime SI/PX

Dynamic Power - Flop Mapping & Banking
Considerations
• ARM 16FFC library has many choices for Flops in
Single-bit 2-bit 4-bit
flop mapping Library
– Q, QA (low area), QL (low power)

– 1-bit, 2-bit, 4-bit QA
Low area
• Essential to consider Q
– Which flops to use Base
– Allowed stage flops
– At what stage(s) to bank/de-bank

QL
– De-banking criteria Low
– Banking exclusion list power

Dynamic Power - Flop Mapping & Banking
Recommendations: Achieved 18% Total Power Savings
Flop Mapping Multibit Banking
• Experimented with different flavors of flops • Experimented with different strategies

– QA, QL and QA + QL – 2-bit and 4-bit flops
– Physically Aware Multibit Banking (PAMB)
• Observations – Automated critical path de-banking
– QA is smaller but has larger dynamic power
• Observations
– QL has lower leakage power
– 4-bit reduces power, creates WNS instability
– QA + QL combination best for overall power
– PAMB improves power, increases TNS
– De-banking reduces PAMB TNS impact
4-bit: 3% additional total power savings

13% total power savings
PAMB/de-bank: 2% total power savings

Dynamic Power UPF
Implementation and Static Verification Flow – Included in Synopsys RI
ARM successive refinement UPF
• Constraints, configuration and implementation UPF
Gas station voltage areas for selected signals

• Ensures optimum clock tree for CPU modules
Multibit LS and ELS insertion

• 2-bit wide and 4-bit wide
Header switch control and acknowledge implementation

• Hammer & Trickle Headers
• Daisy chain and HFS connections
Fine grain switched ground RAM support

• Proper always-on TIE cells, automated level shifter, isolation cell insertion
Includes many of the newest UPF capabilities and methodologies

Multibit LS/ELS insertion
Reduced Area Leads to Less Congestion
3457 single-bit LS/ELS 886 single/multi-bit LS/ELS
LS Mbit
LS
ELS Mbit
ELS
LS/ELS Area LS/ELS Leakage LS/ELS Area LS/ELS Leakage

12,815 u2 0.77 mW 4,920 u2 (61% less) 0.85 mW (13% more)

Summary: CPU Power Optimization Flow
That Meets Performance/Power Targets
MCMM Flop Multibit Optimization Banking/
OCV Vt Classes
Scenarios Families Flops Targets De-banking Flow
• RTL inferencing
DC Graphical Q 1-bit Timing
• Physical-aware
7% net 2 TT-1.0V QL 2-bit Area
critical path
QA 4-bit Dynamic
de-banking
2
place_opt Timing
IC Compiler II POCV TT-1.0V Q 1-bit • Physical-aware
clock_opt Area
cell TT-0.8V QL 2-bit critical path
Leakage
7% net FFG QA 4-bit de-banking
8 Dynamic
route_opt
PrimeTime
ECO AOCV TT-1.0V Q 1-bit Timing
cell 12 TT-0.8V QL 2-bit Area
7% net FFG QA 4-bit Leakage

Power
Area Determine an Optimum Floorplan

Performance


Power

Area
Placement Data Flow Macro
Bounds
Controls Analysis Placement

Performance

Floorplanning
Key Element in ARM CPU Development
• ARM and Synopsys have collaborated on ARM CPUs for many generations
• Floorplanning is always part of achieving best QoR for ARM CPUs
• Macro placement, bounds and blockages impact both timing and power
Cortex-A57 Cortex-A72 Cortex-A73

Bounds Magnet
placement
Blockage
SV SNUG 2015 SV SNUG 2016

The Last 10% Is Always The Most Challenging
• ARM CPU projects typically get to 90% place_opt FMAX Progression

of the performance target quickly 1
Ramp up The last 10%
to 90%
90%
0.9
• Achieving the last 10% consumes
most of the schedule
FMAX (% target)
0.8
• Extensive analysis shows performance

0.7
is impacted by placement
0.6
Data Flow
• Block floorplan exploration can provide Prep Development
performance breakthrough
0.5
Project Duration

Early Placement and Bounds Trials
Manual Bounds/Blockages to Balance Timing vs. Routability QoR
• Techniques to drive QoR through
placement
– Hard macro placement
– Placement bounds (fixed and floating)
DENGINE
DSIDE
– Typically focused on top-level modules
– Some analysis at first level down
– Add blockages to control channel density
CORE
ISIDE
• TNS reduced but QoR gains were
offset by SI effects
FMAX limited by critical paths to/from CORE module

QoR Challenges in Placement
Analysis of Critical Module Timing
• Most critical paths to/from CORE
sub-module
– CORE connects heavily to DSIDE and
DENGINE
DENGINE DSIDE – Critical paths seen throughout the flow
(challenging to fix downstream)
• CORE being “pushed” out of center

of core area and near IOs
ISIDE
• Created FMAX-limited paths due to
CORE
long-path buffering across block

Controlling Standard Cell Placement
Look For Possibility Of Bounds To Improve QoR
• Experimented with bounds and
blockages to improve QoR
– Based on module placement analysis
• Looking for changes in CORE
• Most critical paths between CORE and

DSIDE
• CORE was being pushed away from

DSIDE and DENGINE
• Needed to find block floorplan that

allowed CORE to be placed in center
CORE DSIDE DENGINE
Desired movement not possible with given macro placement

DFA for ARM CPU Block Floorplanning
Enables Faster Convergence
• Use DFA (Data Flow Analysis) in ICC II to drive block-level QoR improvements
through CPU floorplan changes
• Deep analysis of fast connectivity-based placement and inter-module flylines
DENGINE DSIDE ISIDE CORE
ISIDE
Some clear patterns emerge that lead to new macro placement conclusions
High-performance Core Floorplan Changes
Move RAMs To Guide Module Placement
DENGINE Move CORE module

back into the center
DENGINE
CORE
DSIDE
DSIDE
ISIDE
ISIDE
CORE
Floorplan changes that allowed CORE to float to the center and close to DSIDE
CORE Results @ route_opt with New Floorplan
• Updated floorplan produced

much better FMAX, TNS and
leakage DENGINE
• Critical path analysis of paths CORE

within, to and from CORE module DSIDE
shows much better timing
Path Group WNS (ps) TNS (ns) ISIDE

CORE Reg to Reg -17 -3
Out of CORE -16 -1
In to CORE -17 -1
CORE ICGs -18 -1
Block level floorplanning is key to improving QOR

Floorplan/Placement Refinement
Summary
• Analysis-driven floorplan changes FMAX Progression with Synopsys RI
boosted FMAX and reduced 1
Ramp up
power The last 10%
to 90%
0.9
90%
• Minor floorplan modifications can
FMAX (% target)
further improve QoR 0.8
• Block-level floorplanning is a 0.7

powerful and necessary tool for
ARM CPU QoR improvements 0.6
0.5
Achieved goal to boost FMAX, reduce TNS and total power


Power

Area
Performance Manage Crosstalk and Optimization for Best Frequency


Power

Area
Manage Crosstalk and Optimization for Best Frequency

Performance Concurrent Global Route-
Crosstalk PrimeTime
Clock & Data based
Optimization Delaycalc
(CCD) Optimization

Crosstalk: An Ongoing Challenge on ARM Cores
• Lower geometry processes always
have a crosstalk component
• ARM CPUs have traditional SI

prevention
– Clock NDRs
– Congestion-aware placement
– Logic and density controls
• The Cortex-A73 flow used NDRs to

dramatically reduce crosstalk
• We have used all these techniques

plus more on the latest ARM CPU SV SNUG 2016

Early Results on Latest High-perf ARM Core
Crosstalk Issues Already Visible
• When starting new ARM CPU flow, used First Full Flow Trial
18.00 200%
best practices from Cortex-A73 RI
16.00 180%
– 2W/4S clock NDRs (tapered on sinks) 160%
14.00
– Crosstalk threshold noise ratio of 20% 140%
12.00
– Congestion optimization for placer settings 10.00
120%
– 80ps max_transition limit 100%

8.00
80%
6.00
• Early results: crosstalk still an issue 60%
4.00 40%
– Large TNS increase at route_auto stage
2.00 20%
– Leakage increase at route_opt stage 0.00 0%
place_opt clock_opt route_auto route_opt
TNS WNS Leakage (%)
Need more to address crosstalk

Crosstalk and Placement
Why Placement Matters For Crosstalk
• In ARM CPUs placement and crosstalk Critical Path Corridor
are interrelated
Area of very high
connectivity
• Separating two modules with a lot of DSIDE between CORE
timing-critical interconnect causes a and DSIDE
modules
channel of high, unidirectional routing
density
• Addressing this connectivity helped

with crosstalk optimization
CORE
Decreasing net crosstalk delta delay

Crosstalk and Congestion
Related, But - Fixing Congestion Might not Fix Crosstalk
Congestion: localized, over-capacity problem Crosstalk: large-area, at-capacity problem
Congestion is indicated by >100% utilized gcells, Crosstalk is indicated by high local routing
best addressed by routing changes density, best addressed with placement changes
Congestion Map: Gcells >100% Congestion Map: Gcell = 100%

Long Path Crosstalk in NonCPU
Reduced with Channel Blockages on CPU Interface Paths
• NonCPU has long-path

crosstalk challenges 0%
• Connections to CPUs can
Blocked Channel %
cluster together and cause
timing problems 60%
• Use percentage placement
blockages to spread out
pipeline flops 80%
90%

Results
Dramatic FMAX & Power Improvements
New optimization solutions deployed in addition to Cortex-A73 techniques
Before After
Decreasing net crosstalk delta delay

Leveraging New CCD Capabilities
Place_opt CCD and Power Aware CCD
CLK
place_opt with data only: higher area & power
Apply
-100ps 200ps Size up clock buf
useful
skew 90ps 10ps
Size down/swap LVT

place_opt w/ useful skew: lower area & power CLK
Datapath
50ps 50ps area/power
recovery 90ps 10ps
Delay 150ps
Leakage Leakage
Flow WNS TNS Flow WNS TNS
Power Power
place_opt + route_opt -23 -95 100% route_opt (Baseline) -112 -134 100%
place_opt CCD + route_opt -20 -68 98% Power CCD + route_opt -99 -126 99%

Increased Correlation Through the Flow
Routing and Timing Correlation
Global Route-based Accurate CCS Receiver Route_opt based on signoff
Optimization Model timer
• Timing-driven routing • Used in place_opt, • Route_opt able to see (and
• Re-routing, Re-buffering clock_opt and route_opt fix) timing issues easing
using global route parasitics • Improved correlation to burden on PT ECO
• On-route global route-based signoff delay calculation
re-buffering
GR- Delay
Traditional AWP CCS
Metric based @ calculation Step WNS TNS
Flow Delay Receiver
Signoff used
Opt. Model Cap Model
ICC II PT
R2R WNS -29ps -15ps -20 -8
WNS -60ps -49ps Delaycalc ECO
R2R TNS -1ns -1ns PrimeTime PT
TNS -11ns -2ns -12 0
Leakage 100% 96% Delaycalc ECO

Multisource CTS (MS-CTS) + HTree
Benefiting Performance / Power
• Performance Challenge Pre-mesh

drivers
– Significant OCV penalty ICG ICG
FF
F
associated with longer clock F
CLK
insertion delay FF
ICG ICG F
• Power Challenge ICG ICG

F
– Clock tree can consume up to Pre-mesh tree Clock Mesh

RAMs
35% total power in typical

CPU Flow/
MS-CTS
– Most power consumption is at Metrics
leaf level drivers & flop clocks Global Skew 53% smaller
• MS-CTS useful to address Clock Levels 40% fewer
both challenges
Dynamic
similar
Power

Summary: Implementation Careabouts
Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Power
Multi-Vt & Sequential
OCV Multibit
gate-length Flops

Area
Placement Data Flow Macro
Bounds
Controls Analysis Placement
Manage Crosstalk and Optimization for Best Frequency

Performance Concurrent Global Route-
Crosstalk PrimeTime
Clock & Data based
Optimization Delaycalc
(CCD) Optimization

Summary and Next Steps
SNUG 2017 56
Results on ARM Next Generation CPU
Achieved Target PPA with Convergent, Repeatable Flow
Next-Generation CPU PPA
TNS (ns) FMAX (%) Leakage (%)
300 300%
250
FMAX and Leakage

250%
200
TNS (ns)
% of Final
150 200%
100
150%
50
0 100%
-50
50%
-100
-150 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages
DC Graphical + IC Compiler II + PrimeTime ECO

Results on ARM Next Generation Non-CPU
Customized Hierarchical Flow to meet PPA on NonCPU
Next-Generation NonCPU PPA
TNS (ns) FMAX (%) Leakage (%)
200 200%
FMAX and Leakage

100 150%
TNS (ns)
% of Final
0 100%
-100 50%
-200 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages
DC Graphical + IC Compiler II + PrimeTime ECO

Next Steps
Look for more details on this next-generation Synopsys Reference Implementation (RI)
ARM design in the coming months is ready
DENGINE • CPU and NonCPU flows

CORE
• TSMC 16nm FFC process
DSIDE • ARM POP™ IP – core optimized standard
ISIDE
cells & fast cache RAMs
• Complete implementation and static
verification flow
Contact your Synopsys AC for additional information

Once core is announced, RI will be available on SolvNet
(solvnet.synopsys.com/ARM-RI)
Conclusion: ARM + Synopsys
Continuing Close Collaboration To Benefit Our Customers
Complete implementation & static verification flows
Utilizing the most advanced Synopsys technologies
Providing maximum performance & minimum power

on ARM’s next generation processors

Thank You
SNUG 2017 61

wc03-34-babbar-pres-snps

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

wc03-34-babbar-pres-snps

Uploaded by

Copyright:

Available Formats

Best Practices For High-performance,

Energy Efficient Implementations of

Vidit Babbar, ARM

22nd March, 2017

Performance / Power Optimized Galaxy Design Reference Implementation

Maximizing Performance/Minimizing Power

Summary and Next Steps

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 2

POP IP is a comprehensive, fully validated Cortex®-A

POP IP is developed and tuned in synergy with RTL

POP undergoes extensive iterative floorplan exploration

Pre-ALPHA ALPHA BETA LAC EAC

Validation & HW/SW

SNUG 2017 synopsys.com/ARM v161013 7

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 9

• place_opt CCD • Incremental timing-driven

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 10

Greater flexibility with Redesigned memory Advanced compute

Config Crypto ; NEON & FP Unit

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 13

Determine an Optimum Floorplan

Analyze Library for Balanced Power/Performance/TAT

Manage congestion, SI & pessimism for convergence

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 15

Power Analyze Library for Balanced Power/Performance/TAT

Analyze Library for Balanced Power/Performance/TAT

Manage congestion, SI & pessimism for convergence

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 16

Analyze Library for Balanced Power/Performance/TAT

Analyze Library for Balanced Power/Performance/TAT

Manage congestion, SI & pessimism for convergence

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 17

ICC II Optimization AOCV-based POCV-based

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 18

X20N 858 682 575 409 89 69 56 35 8 5 4 3

X8N 327 261 221 159 34 26 21 14 3 2 1 1

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 21

Map to gain qualitative X8N 24 25 26 27 27 28 29 31 32 33 34 36

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 22

• Order of magnitude power delta between equivalent

• SVT power savings mainly in small drive strengths

• Variation between channel lengths is small

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 23

• Variation between channel lengths reserved for ECO

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 24

DC Graphical ICC II route_opt PrimeTime ECO

ULVT LVT SVT

• PrimeTime ECO used to lower power

– Uses footprint-compatible Vt/channel swaps ICC II ECO

• Typically see 20-30% leakage reduction StarRC

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 27

– Q, QA (low area), QL (low power)

– At what stage(s) to bank/de-bank

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 28

• Experimented with different flavors of flops • Experimented with different strategies

4-bit: 3% additional total power savings

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 29

Gas station voltage areas for selected signals

Multibit LS and ELS insertion

Header switch control and acknowledge implementation

Fine grain switched ground RAM support

Includes many of the newest UPF capabilities and methodologies

LS/ELS Area LS/ELS Leakage LS/ELS Area LS/ELS Leakage

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 31