ta2-1-walston-pres-snps

Best Practices For High-performance,
Energy Efficient Implementations of

Arm® Cortex®-A75/-A55 Processors
In 16-nanometer FinFET Compact (16FFC) Process Technology
Using Synopsys Design Platform
Joe Walston, Synopsys
19 October 2017
© 2017 Synopsys, Inc. 1

Agenda
Arm/Synopsys Collaboration
Performance / Power Optimized Synopsys Design Reference Implementation
Maximizing Performance/Minimizing Power
Summary and Next Steps
© 2017 Synopsys, Inc. 2 © Copyright Synopsys 2017, All Rights Reserved

Arm + Synopsys
Rich History of Collaboration

The New Threshold for Market Leadership
2017
3.5+GHz
3.5GHz in 7nm
2016 7nm
3.0+GHz
3.0GHz+ in 16nm
16nm
2013
2.0+GHz
2.0GHz/28nm
28nm+
2010
1.0GHz
1.0 GHz / 40nm
40nm

DesignWare® IP
For Arm AMBA® Interconnect, I/F IP,
coreTools for assembly, MBIST, High Performance
Optimized Implementation Standard Cells, Fast Memories, Verification
Synopsys Design Tools, IP Designer for ASIPs VCS® RTL Verification & Verdi® Debug,
Reference Implementations (RIs), SpyGlass® RTL Signoff,
High Performance Core Centers, VIP for AMBA Interconnect,
Low Power Methodology, LP Verification Methodology,
Lynx Design System Optimized HW Design & ZeBu® HW-Assisted Verification
Implementation Verification
− Hybrid Emulation
Partnering for Arm Powered® Products AMBA Transactors
System SW Dev.
Software Stack
Validation & HW/SW

Integration
Virtual Prototypes
Physical Prototypes Virtualizer™ Tool Set & Development
HAPS® Support for Arm Cores, Hybrid Prototypes Kits (VDKs) for Pre-RTL SW Dev.
Connection to Juno Arm AMBA Transactors Using Arm Fast Models (v7 & v8),
Development Platform, Arch. Design with Platform Architect,
Prototyping Methodology Coverity & Defensics SW Signoff
© 2017 Synopsys, Inc. 5 synopsys.com/Arm v170901

Collaboration Enables Early Adopters
of Arm’s Latest IP
© 2017 Synopsys, Inc. 6 Images source: arm.com

Collaboration Enables Early Adopters
of Arm’s Latest IP
• Early adopters have already taped out using
Synopsys’ Design & Verification Continuum
Platforms
• New QuickStart Implementation Kits (QIKs) add

Reference Guide to Reference Implementation
• Synopsys Design Services available

– QuickStart Implementation Service (4-weeks)
– Consultative core-optimization
– Full turnkey core hardening
© 2017 Synopsys, Inc. 7 Images source: arm.com

©©Copyright Synopsys 2016, All Rights Reserved
2017 Synopsys, Inc. 8
Arm POP™ IP on 16nm
POP IP is a comprehensive, fully validated Cortex®-A

Need to shorten the CPU implementation solution
Includes Physical IP, floorplans and reference
design cycle implementation scripts
POP IP is developed and tuned in synergy with RTL

Need to lower technical over several iterations
and schedule risk All Physical IP and implementation issues have been
identified and solved by EAC date
POP undergoes extensive iterative floorplan exploration

Need to achieve and design tuning to deliver market- leading PPA
Our record in 16nm FinFET technology is a testament to
market-leading PPA the hard work behind POP development

How does Arm PDG work w/processor design on
POP IP?
Pre-ALPHA ALPHA BETA LAC EAC
CPU
CPU RTL optimization based on
POP IP implementation feedback
Arm
Floorplan tuning and PPA optimization
POP
Pre-ALPHA ALPHA BETA EAC
Requirements/
Inputs
Co-development work

Performance & Power Optimized
Reference Implementation
Synopsys Design Platform

Synopsys Reference Implementation
Complete Implementation & Static Verification Flow for Cortex-A75/-A55 Cores
Design Compiler
Graphical
IC Compiler II
place_opt
NDM Library Prep
clock_opt PrimeTime ECO
route_opt
VC LP PrimeTime SI Formality
Note: DFT supported at each step during DC Graphical and IC Compiler II, UPF supported throughout the flow

Included in Synopsys Reference Implementation
Most Advanced Technology from Synopsys Implementation Platform
Meet Timing Reduce Power

• Enhanced physical guidance
• Timing-driven multibit register
(eSPG)
banking and de-banking
• Enhanced layer-aware DC Graphical • Physical-aware clock gating
optimization
• Low power placement
• Placement pre-clustering
• place_opt CCD • Incremental timing-driven

• New global route-based opt. multibit register banking and
• CCS receiver cap modeling IC Compiler II de-banking
• PrimeTime delay calc in • Clock gating optimization
route_opt • Low power placement
• Redundant VIA insertion • High effort leakage flow
PrimeTime
• Path-based analysis (PBA)
• Clock skew ECO ECO • Leakage-aware timing ECO
• Physical-aware ECO

Included in Synopsys Reference Implementation
Arm Cortex-A75 CPU and DynamIQ™ Shared Unit (DSU) Designs and Libraries
Design Detail
Config Crypto ; NEON & FP Unit
DENGINE
Clock Gating ~250 architectural ICGs CORE
TSMC 16nm FFC
Process
11 Layer +AP w/ routing on M2-M9 DSIDE
Arm POP IP for TSMC 16nm FFC ISIDE
Libraries/
SVT/LVT/ULVT C24/C20/C18/C16 for data
memories
ULVT C16 for clock
Setup: RC Max: TT/1.0v/85c
PVT corners
Setup & Power: TT/0.8v/85c
MCMM
Hold: RC Min: FFGNP/1.05v/125c
DFT Strategy Scan compression
UPF 3 power domains with LS and ELS required

Maximizing Performance/Minimizing Power
on Cortex-A75 Processor
with the Synopsys Design Platform

Cortex-A75 Implementation Careabouts
Determine an Optimum Floorplan

Power
Analyze Library for Balanced Power/Performance/TAT

Area
Manage congestion, SI & pessimism for convergence

Performance

Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA
Power Analyze Library for Balanced Power/Performance/TAT

Area

Performance


Power
Multi-Vt & Sequential
OCV Multibit
gate-length Flops

Area

Performance

On Chip Variation (OCV)
Signoff and Implementation Derating Methodology
Derating Strategy • Derating strategy changes through

implementation to signoff
No cell derating – Manages pessimism
DC Graphical – Reduces area/power impact of GBA-based
7% net derating
AOCV
POCV cell derating* (from

• Methodology does not affect signoff
IC Compiler II AOCV 5%, 7%, 10%)
7% net derating
ICC II Optimization AOCV-based POCV-based

AOCV cell derating*
PrimeTime (5%, 7%, 10%) ICC II Baseline + 8%
Frequency
ECO PrimeTime Baseline + 2%
7% net derating
Utilization 68% 66%
SSG @ 5%, TT 1.0v @ 7%, FFG @ 10%

Arm Library – Leverage for best CPU QoR
Complexity and Flexibility
• Specific details of Arm 16FFC library Design Detail

– 3 Vt classes (SVT, LVT, ULVT) TSMC 16nm FFC
Process
– 4 channel lengths each (16, 18, 20, 24) 11 Layer +AP routing on M2-M9
– Base, hpk (High-Performance), pmk (Power
Management) Arm POP IP for TSMC 16nm FFC
– 1, 2 and 4 bit flops Libraries/ ULVT C24/C20/C18/C16 for data
– 3 main types of SB/MB flops (Q, QL, QA) memories ULVT C16 for clock
SVT/LVT for leakage opt.
• Cortex-A75 CPU has very aggressive power
targets
– As always - extremely high FMAX target Setup: RC Max: TT/1.0v/85c
PVT corners
– But… leakage and dynamic power targets require Setup & Power: TT/0.8v/85c
MCMM
more than opportunistic power reduction Hold: RC Min: FFGNP/1.05v/125c
Starting with library analysis makes balancing FMAX and power easier
Constructing Cortex-A75 CPU Power Opt. Flow
That Meets Performance/Power Targets
Meet Power Target • Essential to manage

MB banking/de-banking – Vt class availability
1bit, 2bit, 4bit – Multibit (MB) banking/de-banking
DC Graphical
VT selection – Leakage vs. timing vs. dynamic optimization
Across 12 vt/channel – Leave headroom (both timing and power)
options for ECO
QL (leakage) vs. Q (std)
IC Compiler II vs. QA (area) • Library impacts all these decisions
flop selection
SI TNS Reduction, very

congested, clock NDRs
PrimeTime
ECO
fix_eco_power to meet
leakage target, expect
15-20% reduction
Analyzing Library Leakage Power Heat Maps
Example: INV @ TT 0.8V, 85c
VT Types/Channel lengths
ULVT LVT SVT
ULVT LVT SVT
C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
X20N 858 682 575 409 89 69 56 35 8 5 4 3

X16N 681 542 457 326 71 55 44 28 6 4 3 2
X14N 593 472 398 284 62 48 39 24 6 4 3 2
X12N 504 401 339 243 52 41 33 21 5 3 2 2 Leakage Power
Use Leakage Power Drive Strength
X10N 415 331 280 201 43 33 27 17 4 3 2 1
Heat Map to gain X8N 327 261 221 159 34 26 21 14 3 2 1 1
qualitative X7N 259 208 177 129 27 21 17 11 2 2 1 1
understanding of X6N 239 191 162 118 25 19 16 10 2 2 1 1
library characteristics X5N 195 156 133 96 20 16 13 8 2 1 1 1
X4N 151 121 103 75 16 12 10 6 1 1 1 0
X3N 108 87 74 54 11 9 7 5 1 1 0 0 Leakage Power
X2N 67 53 45 33 7 5 4 3 1 0 0 0
Heat Map
X1P5N 39 31 26 20 5 4 3 2 0 0 0 0
X1N 28 22 19 14 3 2 2 1 0 0 0 0

Analyzing Library Cell Delay Heat Maps
Example: INV with 50ff Load @ TT 1.0V, 85c
VT Types/Channel lengths
ULVT LVT SVT
ULVT LVT SVT
C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
X20N 18 19 19 20 20 21 21 22 23 24 25 26
X16N 19 20 20 21 21 22 23 24 25 26 27 28
X14N 20 21 21 22 22 23 24 25 26 27 28 29
X12N 21 22 22 23 23 24 25 26 27 28 29 31 Cell Delay
Use Cell Delay Heat X10N 22 23 24 25 25 26 26 28 29 30 31 33
Drive Strength
Map to gain qualitative X8N 24 25 26 27 27 28 29 31 32 33 34 36

understanding of X7N 25 26 27 28 29 30 31 32 34 35 36 39
library characteristics X6N 28 29 29 31 31 32 33 35 37 38 39 42
X5N 30 31 32 34 34 35 37 39 41 42 43 46
X4N 34 35 36 38 39 40 41 43 45 47 49 52
X3N 41 42 43 45 47 48 50 52 55 57 59 63 Cell Delay
X2N 53 54 55 58 60 62 64 67 70 73 76 80 Heat Map
X1P5N 65 66 68 72 74 76 78 82 86 88 92 98
X1N 92 94 97 101 105 108 111 116 123 127 133 140

Library Observations
Recommendations to Minimize Leakage Power
• Overlap between ULVT and LVT

– Consider combination of ULVT C16 and LVT C16 during
synthesis and P&R
Leakage Heat Map (TT)
• Order of magnitude power delta between equivalent

drive strengths/channel lengths across VT classes
ULVT
– LVT use in implementation absolutely necessary for best
SVT
LVT
leakage when targeting ULVT for timing
• SVT power savings mainly in small drive strengths

– SVT probably best for ECO swap/size
• Variation between channel lengths is small

– Channel variations should be reserved for ECO

Library Observations
Recommendations to Improve Timing
Cell Delay Heatmap
• Delay variation across Vt and channel smaller than
across drive strength
– Many available Vt/channel classes will not have a strong
SSG
influence on timing early in flow
– Best to focus on a primary class for timing closure based on
clock frequency
ULVT
SVT
LVT
• Very low drive cells can be load sensitive
– Do not use small drive cells during synthesis and P&R
• Variation between channel lengths reserved for ECO

– Late in flow, swapping can boost timing 1-2ps per
TT-OD
Vt/channel swap per datapath stage

Optimizing Leakage Power
Cortex-A75 Implementation Flow - Vt Class Trials
• 12 Vt and channel length variations in library

• Not using SVT for implementation, saving it for ECO
– Capacitance sensitivity outweighs possible leakage gains in DC Graphical/IC Compiler II
• Choosing Vt classes to use during flow can impact final leakage power
Leakage
Power
ULVT LVT SVT
% Final C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
144% 54% 6% 4% 3% 6% 15% 3% 0% 1% 2% 5% 1%
100% 16% 2% 1% 1% 40% 5% 3% 2% 3% 15% 12% 1%
118% 15% 8% 7% 33% 5% 5% 3% 9% 3% 1% 3% 7%
114% 13% 4% 4% 3% 12% 5% 5% 23% 3% 5% 5% 17%
Vt Distribution Through Cortex-A75 Full Flow
Vt Class/Channel Mix Changes As Implementation Progresses
In DC, datapath delay is prioritized Pessimism reduces through the ECO brings in 4 new SVT classes
Faster cells are used flow and CTS brings in useful and does positive slack recovery
skew, cells are swapped for power
DC Graphical ICC II route_opt PrimeTime ECO
ULVT LVT SVT

c16 c18 c20 c24 c16 c18 c20 c24 c16 c18 c20 c24
ECO Objectives for Power Optimization
Cortex-A75 CPU
• PrimeTime ECO used to lower power

– AOCV with PBA increases positive slack for downsizing ICC II
Route_opt
– Open up all 12 Vt/channel classes to ECO ULVT/LVT
• Timing ECO makes minor improvements on remaining critical StarRC

paths
PrimeTime SI/PX
• Power ECO reduces leakage without impacting timing fix_eco_timing
– Uses footprint-compatible Vt/channel swaps fix_eco_power –pattern
ULVT/LVT/SVT
• Typically see 20-30% leakage reduction
ICC II ECO
StarRC
PrimeTime SI/PX

Dynamic Power - Flop Mapping & Banking
Considerations
• Arm 16FFC library has many choices for flop Flops in

mapping Single-bit 2-bit 4-bit
Library
– Q, QA (low area), QL (low power)
– 1-bit, 2-bit, 4-bit QA
Low area
• Essential to consider
– Which flops to use Q
– Allowed stage Base
– At what stage(s) to bank/de-bank flops
– De-banking criteria
– Banking exclusion list QL
Low
power

Dynamic Power - Flop Mapping & Banking
Recommendations: Achieved 18% Total Power Savings on Cortex-A75 CPU
Flop Mapping Multibit Banking
• Experimented with different flavors of flops • Experimented with different strategies

– QA, QL and QA + QL – 2-bit and 4-bit flops
– Physically Aware Multibit Banking (PAMB)
• Observations – Automated critical path de-banking
– QA is smaller but has larger dynamic power
• Observations
– QL has lower leakage power
– 4-bit reduces power, creates WNS instability
– QA + QL combination best for overall power
– PAMB improves power, increases TNS
– De-banking reduces PAMB TNS impact
4-bit: 3% additional total power savings

13% total power savings
PAMB/de-bank: 2% total power savings

Dynamic Power UPF
Cortex-A75 CPU Implementation & Static Verification Flow – Synopsys RI
Arm successive refinement UPF

• Constraints, configuration and implementation UPF
Gas station voltage areas for selected signals

• Ensures optimum clock tree for CPU modules
Multibit LS and ELS insertion

• 2-bit wide and 4-bit wide
Header switch control and acknowledge implementation

• Hammer & Trickle Headers
• Daisy chain and HFS connections
Fine grain switched ground RAM support

• Proper always-on TIE cells, automated level shifter, isolation cell insertion
Includes many of the newest UPF capabilities and methodologies

Multibit LS/ELS insertion
Reduced Area Leads to Less Congestion on Cortex-A75 CPU
3457 single-bit LS/ELS 886 single/multi-bit LS/ELS
LS Mbit
LS
ELS Mbit
ELS
LS/ELS Area LS/ELS Leakage LS/ELS Area LS/ELS Leakage

12,815 u2 0.77 mW 4,920 u2 (61% less) 0.85 mW (13% more)

Summary: Cortex-A75 CPU Power Opt. Flow
That Meets Performance/Power Targets
MCMM Flop Multibit Optimization Banking/
OCV Vt Classes
Scenarios Families Flops Targets De-banking Flow
• RTL inferencing
DC Graphical Q 1-bit Timing
• Physical-aware
7% net 2 TT-1.0V QL 2-bit Area
critical path
QA 4-bit Dynamic
de-banking
2
place_opt Timing
IC Compiler II POCV TT-1.0V Q 1-bit • Physical-aware
clock_opt Area
cell TT-0.8V QL 2-bit critical path
Leakage
7% net FFG QA 4-bit de-banking
8 Dynamic
route_opt
PrimeTime
ECO AOCV TT-1.0V Q 1-bit Timing
cell 12 TT-0.8V QL 2-bit Area
7% net FFG QA 4-bit Leakage

Power
Area Determine an Optimum Floorplan

Performance


Power

Area
Placement Data Flow Macro
Bounds
Controls Analysis Placement

Performance

Floorplanning
Key Element in Arm CPU Development
• Arm and Synopsys have collaborated on Arm CPUs for many generations
• Floorplanning is always part of achieving best QoR for Arm CPUs
• Macro placement, bounds and blockages impact both timing and power
Cortex-A57 Cortex-A72 Cortex-A73

Bounds Magnet
placement
Blockage
SV SNUG 2015 SV SNUG 2016

The Last 10% Is Always The Most Challenging
• Arm CPU projects typically get to 90% of the Cortex-A75 place_opt FMAX Progression
performance target quickly 1
Ramp up The last 10%
to 90%
• Achieving the last 10% consumes most of the 90%
0.9
schedule
FMAX (% target)
0.8
• Extensive analysis shows performance is
impacted by placement
0.7
• Block floorplan exploration can provide

performance breakthrough 0.6
Data Flow
Prep Development
0.5
Project Duration

Early Placement and Bounds Trials
Manual Bounds/Blockages to Balance Timing vs. Routability QoR
• Techniques to drive QoR through placement

on Cortex-A75 CPU
– Hard macro placement
– Placement bounds (fixed and floating)
DENGINE
– Typically focused on top-level modules DSIDE
– Some analysis at first level down
– Add blockages to control channel density
CORE
• TNS reduced but QoR gains were offset by SI ISIDE
effects
FMAX limited by critical paths to/from CORE module

QoR Challenges in Placement
Analysis of Critical Module Timing on Cortex-A75 CPU
• Most critical paths to/from CORE sub-

module
– CORE connects heavily to DSIDE and
DENGINE
DENGINE – Critical paths seen throughout the flow
DSIDE (challenging to fix downstream)
• CORE being “pushed” out of center of core

area and near IOs
• Created FMAX-limited paths due to long-

ISIDE path buffering across block
CORE

Controlling Standard Cell Placement
Look For Possibility Of Bounds To Improve Cortex-A75 CPU QoR
• Experimented with bounds and
blockages to improve QoR
– Based on module placement analysis
• Looking for changes in CORE
• Most critical paths between CORE and

DSIDE
• CORE was being pushed away from

DSIDE and DENGINE
• Needed to find block floorplan that

allowed CORE to be placed in center
CORE DSIDE DENGINE
Desired movement not possible with given macro placement

DFA for Cortex-A75 CPU Block Floorplanning
Enables Faster Convergence
• Use DFA (Data Flow Analysis) in ICC II to drive block-level QoR improvements through CPU
floorplan changes
• Deep analysis of fast connectivity-based placement and inter-module flylines
DENGINE DSIDE ISIDE CORE
ISIDE
Some clear patterns emerge that lead to new macro placement conclusions
Cortex-A75 CPU Floorplan Changes
Move RAMs To Guide Module Placement
DENGINE Move CORE module

back into the center
DENGINE
CORE
DSIDE
DSIDE
ISIDE
ISIDE
CORE
Floorplan changes that allowed CORE to float to the center and close to DSIDE
CORE Results @ route_opt with New Floorplan
Cortex-A75 CPU
• Updated floorplan produced much better

FMAX, TNS and leakage
• Critical path analysis of paths within, to and DENGINE

from CORE module shows much better timing CORE
DSIDE
Path Group WNS (ps) TNS (ns) ISIDE

CORE Reg to Reg -17 -3
Out of CORE -16 -1
In to CORE -17 -1
CORE ICGs -18 -1
Block level floorplanning is key to improving QOR

Floorplan/Placement Refinement
Summary
• Analysis-driven floorplan changes boosted Cortex-A75 FMAX Progression with Synopsys RI
FMAX and reduced power 1
Ramp up
• Minor floorplan modifications can further The last 10%
to 90%
0.9
90%
improve QoR
FMAX (% target)
• Block-level floorplanning is a powerful and 0.8
necessary tool for Cortex-A75 CPU QoR
improvements 0.7
0.6
0.5
Achieved goal to boost FMAX, reduce TNS and total power


Power

Area
Performance Manage Crosstalk and Optimization for Best Frequency


Power

Area
Manage Crosstalk and Optimization for Best Frequency

Performance Concurrent Global Route-
Crosstalk PrimeTime
Clock & Data based
Optimization Delaycalc
(CCD) Optimization

Crosstalk: An Ongoing Challenge on Arm Cores
• Lower geometry processes always

have a crosstalk component
• Arm CPUs have traditional SI

prevention
– Clock NDRs
– Congestion-aware placement
– Logic and density controls
• The Cortex-A73 flow used NDRs to

dramatically reduce crosstalk
• We have used all these techniques

plus more on the Cortex-A75 CPU SV SNUG 2016

Early Results - Cortex-A75 CPU
Crosstalk Issues Already Visible
• When starting the Cortex-A75 CPU flow, used First Full Flow Trial
18.00 200%
best practices from Cortex-A73 RI
16.00 180%
– 2W/4S clock NDRs (tapered on sinks) 160%
14.00
– Crosstalk threshold noise ratio of 20% 140%
12.00
– Congestion optimization for placer settings 10.00
120%
– 80ps max_transition limit 100%

8.00
80%
6.00
• Early results: crosstalk still an issue 60%
4.00 40%
– Large TNS increase at route_auto stage
2.00 20%
– Leakage increase at route_opt stage 0.00 0%
place_opt clock_opt route_auto route_opt
TNS WNS Leakage (%)
Need more to address crosstalk

Crosstalk and Placement
Why Placement Matters For Crosstalk
• In Arm CPUs placement and crosstalk Critical Path Corridor in Cortex-A75 CPU
are interrelated
Area of very high
connectivity
• Separating two modules with a lot of DSIDE between CORE
timing-critical interconnect causes a and DSIDE
modules
channel of high, unidirectional routing
density
• Addressing this connectivity helped

with crosstalk optimization
CORE
Decreasing net crosstalk delta delay

Crosstalk and Congestion in Cortex-A75 CPU
Related, But - Fixing Congestion Might not Fix Crosstalk
Congestion: localized, over-capacity problem Crosstalk: large-area, at-capacity problem

Congestion is indicated by >100% utilized gcells, Crosstalk is indicated by high local routing
best addressed by routing changes density, best addressed with placement changes
Congestion Map: Gcells >100% Congestion Map: Gcell = 100%

Long Path Crosstalk in DSU
Reduced with Channel Blockages on CPU Interface Paths
• DSU has long-path

crosstalk challenges 0%
• Connections to CPUs can
Blocked Channel %
cluster together and cause
timing problems 60%
• Use percentage placement
blockages to spread out
pipeline flops 80%
90%

Results – Cortex-A75 CPU
Dramatic FMAX & Power Improvements
New optimization solutions deployed in addition to Cortex-A73 techniques
Before After
Decreasing net crosstalk delta delay

Leveraging New CCD Capabilities
Place_opt CCD and Power Aware CCD
CLK
place_opt with data only: higher area & power
Apply
-100ps 200ps Size up clock buf
useful
skew 90ps 10ps
Size down/swap LVT

place_opt w/ useful skew: lower area & power CLK
Datapath
50ps 50ps area/power
recovery 90ps 10ps
Delay 150ps
Leakage Leakage
Cortex-A75 CPU WNS TNS Cortex-A75 CPU WNS TNS
Power Power
place_opt + route_opt -23 -95 100% route_opt (Baseline) -112 -134 100%
place_opt CCD + route_opt -20 -68 98% Power CCD + route_opt -99 -126 99%
Increased Correlation Through the Flow
Routing and Timing Correlation on Cortex-A75 CPU
Global Route-based Accurate CCS Receiver Route_opt Based on
Optimization Model PrimeTime Signoff Timer
• Timing-driven routing • Used in place_opt, • Route_opt able to see (and
• Re-routing, Re-buffering clock_opt and route_opt fix) timing issues easing
using global route parasitics • Improved correlation to burden on PT ECO
• On-route global route-based signoff delay calculation
re-buffering
GR- AWP CCS Delay

Traditional @ calculation Step WNS TNS
Metric based Delay Receiver
Flow Signoff used
Opt. Model Cap Model
R2R WNS -29ps -15ps ICC II PT

WNS -60ps -49ps -20 -8
Delaycalc ECO
R2R TNS -1ns -1ns
PrimeTime PT
TNS -11ns -2ns -12 0
Leakage 100% 96% Delaycalc ECO

Multisource CTS (MS-CTS) + HTree
Benefiting Performance / Power on Cortex-A75 CPU
• Performance Challenge Pre-mesh

drivers
– Significant OCV penalty ICG ICG
FF
F
associated with longer clock F
CLK
insertion delay FF
ICG ICG F
• Power Challenge ICG ICG

F
– Clock tree can consume up to Pre-mesh tree Clock Mesh

RAMs
35% total power in typical CPU

– Most power consumption is at Flow/
MS-CTS
leaf level drivers & flop clocks Metrics
Global Skew 53% smaller
• MS-CTS useful to address
both challenges Clock Levels 40% fewer
Dynamic
similar
Power

Summary: Cortex-A75 Implementation Careabouts
Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Power
Multi-Vt & Sequential
OCV Multibit
gate-length Flops

Area
Placement Data Flow Macro
Bounds
Controls Analysis Placement
Manage Crosstalk and Optimization for Best Frequency

Performance Concurrent Global Route-
Crosstalk PrimeTime
Clock & Data based
Optimization Delaycalc
(CCD) Optimization

Summary and Next Steps

Results on Cortex-A75 CPU
Achieved Target PPA with Convergent, Repeatable Flow
Cortex-A75 CPU PPA
TNS (ns) FMAX (%) Leakage (%)
150 200%
FMAX and Leakage

100
150%
TNS (ns)
% of Final
50
0 100%
-50
50%
-100
-150 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages
DC Graphical + IC Compiler II + PrimeTime ECO

Results on 8-core DSU
Customized Hierarchical Flow to meet PPA
DSU PPA
TNS (ns) FMAX (%) Leakage (%)
200 200%
FMAX and Leakage

100 150%
TNS (ns)
% of Final
0 100%
-100 50%
-200 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages
DC Graphical + IC Compiler II + PrimeTime ECO

Summary – A75 and A55
Synopsys Reference Implementations (RIs)
for Cortex-A75/-A55 are ready
• CPU and DSU flows

• TSMC 16nm FFC process
• Arm POP™ IP – core optimized
standard cells & fast cache RAMs
• Complete implementation and static
verification flows
Contact your Synopsys AC for additional information

RIs available on SolvNet (solvnet.synopsys.com/Arm-RI)
Summary – 8-Core DSU
Synopsys Reference Implementations (RIs) ready
for 8-core DSU with 4 A75s and 4 A55s
• 8-core DSU includes 4 x A75s and 4 x

A55s as configured in the RI A75 A55 A55 A75
• ETM-based hierarchical Reference

Implementation flow
• Includes 2 MB L3 cache
A75 A55 A55 A75
• Complete implementation and static
verification flows
Contact your Synopsys AC for additional information

regarding the QIK availability for the 8-core DSU
Conclusion: Arm + Synopsys
Continuing Close Collaboration To Benefit Our Customers
Complete implementation & static verification flows
Utilizing the most advanced Synopsys technologies
Providing maximum performance & minimum power

on Arm’s Cortex-A75/-A55 processors

Thank You

ta2-1-walston-pres-snps

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ta2-1-walston-pres-snps

Uploaded by

Copyright:

Available Formats

Best Practices For High-performance,

Energy Efficient Implementations of

Joe Walston, Synopsys

© 2017 Synopsys, Inc. 1

Performance / Power Optimized Synopsys Design Reference Implementation

Maximizing Performance/Minimizing Power

Summary and Next Steps

© 2017 Synopsys, Inc. 2 © Copyright Synopsys 2017, All Rights Reserved

© 2017 Synopsys, Inc. 3

© 2017 Synopsys, Inc. 4

Validation & HW/SW

© 2017 Synopsys, Inc. 5 synopsys.com/Arm v170901

© 2017 Synopsys, Inc. 6 Images source: arm.com

• New QuickStart Implementation Kits (QIKs) add

• Synopsys Design Services available

© 2017 Synopsys, Inc. 7 Images source: arm.com

POP IP is a comprehensive, fully validated Cortex®-A

POP IP is developed and tuned in synergy with RTL

POP undergoes extensive iterative floorplan exploration

© 2017 Synopsys, Inc. 12

© 2017 Synopsys, Inc. 13

© 2017 Synopsys, Inc. 14

© 2017 Synopsys, Inc. 15 © Copyright Synopsys 2017, All Rights Reserved

Meet Timing Reduce Power

• place_opt CCD • Incremental timing-driven

© 2017 Synopsys, Inc. 16 © Copyright Synopsys 2017, All Rights Reserved

Config Crypto ; NEON & FP Unit

© 2017 Synopsys, Inc. 17 © Copyright Synopsys 2017, All Rights Reserved

© 2017 Synopsys, Inc. 18

Determine an Optimum Floorplan

Analyze Library for Balanced Power/Performance/TAT

Manage congestion, SI & pessimism for convergence

© 2017 Synopsys, Inc. 19 © Copyright Synopsys 2017, All Rights Reserved

Power Analyze Library for Balanced Power/Performance/TAT

Analyze Library for Balanced Power/Performance/TAT

Manage congestion, SI & pessimism for convergence

© 2017 Synopsys, Inc. 20 © Copyright Synopsys 2017, All Rights Reserved

Analyze Library for Balanced Power/Performance/TAT

Analyze Library for Balanced Power/Performance/TAT

Manage congestion, SI & pessimism for convergence

© 2017 Synopsys, Inc. 21 © Copyright Synopsys 2017, All Rights Reserved

Derating Strategy • Derating strategy changes through

POCV cell derating* (from

ICC II Optimization AOCV-based POCV-based

© 2017 Synopsys, Inc. 22 © Copyright Synopsys 2017, All Rights Reserved

• Specific details of Arm 16FFC library Design Detail

Meet Power Target • Essential to manage

SI TNS Reduction, very

X20N 858 682 575 409 89 69 56 35 8 5 4 3

© 2017 Synopsys, Inc. 25 © Copyright Synopsys 2017, All Rights Reserved

Map to gain qualitative X8N 24 25 26 27 27 28 29 31 32 33 34 36

© 2017 Synopsys, Inc. 26 © Copyright Synopsys 2017, All Rights Reserved

• Overlap between ULVT and LVT

• Order of magnitude power delta between equivalent

• SVT power savings mainly in small drive strengths

• Variation between channel lengths is small

© 2017 Synopsys, Inc. 27 © Copyright Synopsys 2017, All Rights Reserved

• Variation between channel lengths reserved for ECO

© 2017 Synopsys, Inc. 28 © Copyright Synopsys 2017, All Rights Reserved

• 12 Vt and channel length variations in library

DC Graphical ICC II route_opt PrimeTime ECO

ULVT LVT SVT

• PrimeTime ECO used to lower power

• Timing ECO makes minor improvements on remaining critical StarRC