Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Best Practices For High-performance,

Energy Efficient Implementations of


The Latest ARM® Processors
In 16-nanometer FinFET Compact (16FFC) Process Technology
Using Synopsys Galaxy™ Design Platform

Vidit Babbar, ARM


Joe Walston, Synopsys

22nd March, 2017

SNUG 2017 1
Agenda

ARM/Synopsys Collaboration

Performance / Power Optimized Galaxy Design Reference Implementation

Maximizing Performance/Minimizing Power

Summary and Next Steps

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 2


ARM + Synopsys
Rich History of Collaboration

SNUG 2017 3
The New Threshold for Market Leadership

2017
3.5+GHz
3.5GHz in 7nm

2016 7nm
3.0+GHz
3.0GHz+ in 16nm

16nm
2013
2.0+GHz
2.0GHz/28nm

28nm+

2010
1.0GHz
1.0 GHz / 40nm

40nm

SNUG 2017 4
ARM POP™ IP on 16nm

POP IP is a comprehensive, fully validated Cortex®-A


Need to shorten the CPU implementation solution
Includes Physical IP, floorplans and reference
design cycle implementation scripts

POP IP is developed and tuned in synergy with RTL


Need to lower technical over several iterations
and schedule risk All Physical IP and implementation issues have been
identified and solved by EAC date

POP undergoes extensive iterative floorplan exploration


Need to achieve and design tuning to deliver market- leading PPA
Our record in 16nm FinFET technology is a testament to
market-leading PPA the hard work behind POP development

SNUG 2017 5
How does PDG work with processor design on POP IP?

Pre-ALPHA ALPHA BETA LAC EAC


CPU
CPU RTL optimization based on
POP IP implementation feedback
ARM For more information on POP IP, check out
Floorplan
The Faster, Better, tuning and PPA
and Shorter optimization
ICC II –
How It’s Done with ARM POP IP for Cortex-A73
POP
Presented by Rupal Gandhi
Pre-ALPHAThursday,
ALPHA
March 23, @ BETA
11:15am, Hall A2 EAC

Requirements/
Inputs

Co-development work

SNUG 2017 6
DesignWare® IP
For ARM AMBA® Interconnect, I/F IP,
coreTools for assembly, MBIST, High Performance
Optimized Implementation Standard Cells, Fast Memories, Verification
Galaxy™ Tools, IP Designer for ASIPs VCS® RTL Verification & Verdi® Debug,
Reference Implementations (RIs), SpyGlass® RTL Signoff,
High Performance Core Centers, VIP for AMBA Interconnect,
Low Power Methodology, LP Verification Methodology,
Lynx Design System Optimized HW Design & ZeBu® HW-Assisted Verification
Implementation Verification

− Hybrid Emulation
Partnering for ARM Powered® Products AMBA Transactors

System SW Dev.
Software Stack

Validation & HW/SW


Integration

Virtual Prototypes
Physical Prototypes Virtualizer™ Tool Set & Development
HAPS® Support for ARM Cores, Hybrid Prototypes Kits (VDKs) for Pre-RTL SW Dev.
Connection to Juno ARM AMBA Transactors Using ARM Fast Models (v7 & v8),
Development Platform, Arch. Design with Platform Architect,
Prototyping Methodology Coverity & Defensics SW Signoff

SNUG 2017 synopsys.com/ARM v161013 7


Performance & Power Optimized
Reference Implementation
Galaxy Design Platform

SNUG 2017 8
Synopsys Reference Implementation
Complete Implementation & Static Verification Flow for Next Generation CPUs

Design Compiler
Graphical

IC Compiler II
place_opt
NDM Library Prep
clock_opt PrimeTime ECO

route_opt

VC LP PrimeTime SI Formality
Note: DFT supported at each step during DC Graphical and IC Compiler II, UPF supported throughout the flow

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 9


Included in Synopsys Reference Implementation
Most Advanced Technology from Synopsys Implementation Platform
Meet Timing Reduce Power
• Enhanced physical guidance
• Timing-driven multibit register
(eSPG)
banking and de-banking
• Enhanced layer-aware DC Graphical • Physical-aware clock gating
optimization
• Low power placement
• Placement pre-clustering

• place_opt CCD • Incremental timing-driven


• New global route-based opt. multibit register banking and
• CCS receiver cap modeling IC Compiler II de-banking
• PrimeTime delay calc in • Clock gating optimization
route_opt • Low power placement
• Redundant VIA insertion • High effort leakage flow
PrimeTime
• Path-based analysis (PBA)
• Clock skew ECO ECO • Leakage-aware timing ECO
• Physical-aware ECO

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 10


ARM DynamIQ™ – Multicore redefined

Greater flexibility with Redesigned memory Advanced compute


New single cluster design
or without big.LITTLE sub-system capabilities

SNUG 2017 11
https://community.arm.com/processors/b/blog/posts/arm-dynamiq-expanding-the-possibilities-for-artificial-intelligence

SNUG 2017 12
Included in Synopsys Reference Implementation
ARM Next Generation CPU and NonCPU Designs and Libraries
Design Detail

Config Crypto ; NEON & FP Unit

DENGINE
Clock Gating ~250 architectural ICGs CORE
TSMC 16nm FFC
Process
11 Layer +AP w/ routing on M2-M9 DSIDE
ARM POP IP for TSMC 16nm FFC ISIDE
Libraries/
SVT/LVT/ULVT C24/C20/C18/C16 for data
memories
ULVT C16 for clock
Setup: RC Max: TT/1.0v/85c
PVT corners
Setup & Power: TT/0.8v/85c
MCMM
Hold: RC Min: FFGNP/1.05v/125c
DFT Strategy Scan compression
UPF 3 power domains with LS and ELS required

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 13


Maximizing Performance/Minimizing Power
on the Next Generation ARM CPU
with the Galaxy Design Platform

SNUG 2017 14
Implementation Careabouts

Determine an Optimum Floorplan


Power

Analyze Library for Balanced Power/Performance/TAT


Area

Manage congestion, SI & pessimism for convergence


Performance

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 15


Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Power Analyze Library for Balanced Power/Performance/TAT

Analyze Library for Balanced Power/Performance/TAT


Area

Manage congestion, SI & pessimism for convergence


Performance

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 16


Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Analyze Library for Balanced Power/Performance/TAT


Power
Multi-Vt & Sequential
OCV Multibit
gate-length Flops

Analyze Library for Balanced Power/Performance/TAT


Area

Manage congestion, SI & pessimism for convergence


Performance

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 17


On Chip Variation (OCV)
Signoff and Implementation Derating Methodology
Derating Strategy • Derating strategy changes through
implementation to signoff
No cell derating – Manages pessimism
DC Graphical
7% net derating – Reduces area/power impact of GBA-based
AOCV
POCV cell derating* (from
AOCV 5%, 7%, 10%)
• Methodology does not affect signoff
IC Compiler II
7% net derating

ICC II Optimization AOCV-based POCV-based


AOCV cell derating*
PrimeTime (5%, 7%, 10%) ICC II Baseline + 8%
Frequency
ECO PrimeTime Baseline + 2%
7% net derating
Utilization 68% 66%
SSG @ 5%, TT 1.0v @ 7%, FFG @ 10%

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 18


ARM Library – Leverage for best CPU QoR
Complexity and Flexibility
• Specific details of ARM 16FFC library Design Detail
– 3 Vt classes (SVT, LVT, ULVT) TSMC 16nm FFC
Process
– 4 channel lengths each (16, 18, 20, 24) 11 Layer +AP routing on M2-M9
– Base, hpk (High-Performance), pmk (Power
Management) ARM POP IP for TSMC 16nm FFC
– 1, 2 and 4 bit flops Libraries/ ULVT C24/C20/C18/C16 for data
– 3 main types of SB/MB flops (Q, QL, QA) memories ULVT C16 for clock
SVT/LVT for leakage opt.
• CPU has very aggressive power targets
– As always - extremely high FMAX target
Setup: RC Max: TT/1.0v/85c
– But… leakage and dynamic power targets PVT corners
require more than opportunistic power reduction Setup & Power: TT/0.8v/85c
MCMM
Hold: RC Min: FFGNP/1.05v/125c

Starting with library analysis makes balancing FMAX and power easier
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 19
Constructing CPU Power Optimization Flow
That Meets Performance/Power Targets
Meet Power Target • Essential to manage
MB banking/de-banking – Vt class availability
1bit, 2bit, 4bit
DC Graphical – Multibit (MB) banking/de-banking
VT selection
– Leakage vs. timing vs. dynamic
Across 12 vt/channel
options optimization
– Leave headroom (both timing and
QL (leakage) vs. Q (std)
IC Compiler II vs. QA (area)
power) for ECO
flop selection
• Library impacts all these
SI TNS Reduction, very
congested, clock NDRs decisions
PrimeTime
ECO
fix_eco_power to meet
leakage target, expect
15-20% reduction
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 20
Analyzing Library Leakage Power Heat Maps
Example: INV @ TT 0.8V, 85c
VT Types/Channel lengths
ULVT LVT SVT
ULVT LVT SVT
C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24

X20N 858 682 575 409 89 69 56 35 8 5 4 3


X16N 681 542 457 326 71 55 44 28 6 4 3 2
X14N 593 472 398 284 62 48 39 24 6 4 3 2
X12N 504 401 339 243 52 41 33 21 5 3 2 2 Leakage Power
Use Leakage Power X10N 415 331 280 201 43 33 27 17 4 3 2 1
Heat Map to gain
Drive Strength

X8N 327 261 221 159 34 26 21 14 3 2 1 1


qualitative X7N 259 208 177 129 27 21 17 11 2 2 1 1
understanding of X6N 239 191 162 118 25 19 16 10 2 2 1 1
library characteristics X5N 195 156 133 96 20 16 13 8 2 1 1 1
X4N 151 121 103 75 16 12 10 6 1 1 1 0
X3N 108 87 74 54 11 9 7 5 1 1 0 0 Leakage Power
X2N 67 53 45 33 7 5 4 3 1 0 0 0
Heat Map
X1P5N 39 31 26 20 5 4 3 2 0 0 0 0
X1N 28 22 19 14 3 2 2 1 0 0 0 0

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 21


Analyzing Library Cell Delay Heat Maps
Example: INV with 50ff Load @ TT 1.0V, 85c
VT Types/Channel lengths
ULVT LVT SVT
ULVT LVT SVT
C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24

X20N 18 19 19 20 20 21 21 22 23 24 25 26
X16N 19 20 20 21 21 22 23 24 25 26 27 28
X14N 20 21 21 22 22 23 24 25 26 27 28 29
X12N 21 22 22 23 23 24 25 26 27 28 29 31 Cell Delay
Use Cell Delay Heat X10N 22 23 24 25 25 26 26 28 29 30 31 33
Drive Strength

Map to gain qualitative X8N 24 25 26 27 27 28 29 31 32 33 34 36


understanding of X7N 25 26 27 28 29 30 31 32 34 35 36 39
library characteristics X6N 28 29 29 31 31 32 33 35 37 38 39 42
X5N 30 31 32 34 34 35 37 39 41 42 43 46
X4N 34 35 36 38 39 40 41 43 45 47 49 52
X3N 41 42 43 45 47 48 50 52 55 57 59 63 Cell Delay
X2N 53 54 55 58 60 62 64 67 70 73 76 80 Heat Map
X1P5N 65 66 68 72 74 76 78 82 86 88 92 98
X1N 92 94 97 101 105 108 111 116 123 127 133 140

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 22


Library Observations
Recommendations to Minimize Leakage Power
• Overlap between ULVT and LVT
– Consider combination of ULVT C16 and LVT C16
during synthesis and P&R
Leakage Heat Map (TT)

• Order of magnitude power delta between equivalent


drive strengths/channel lengths across VT classes

ULVT
– LVT use in implementation absolutely necessary for

SVT
LVT
best leakage when targeting ULVT for timing

• SVT power savings mainly in small drive strengths


– SVT probably best for ECO swap/size

• Variation between channel lengths is small


– Channel variations should be reserved for ECO

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 23


Library Observations
Recommendations to Improve Timing
Cell Delay Heatmap
• Delay variation across Vt and channel smaller than
across drive strength
– Many available Vt/channel classes will not have a strong

SSG
influence on timing early in flow
– Best to focus on a primary class for timing closure based
on clock frequency

ULVT

SVT
LVT
• Very low drive cells can be load sensitive
– Do not use small drive cells during synthesis and P&R

• Variation between channel lengths reserved for ECO


– Late in flow, swapping can boost timing 1-2ps per

TT-OD
Vt/channel swap per datapath stage

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 24


Optimizing Leakage Power
Implementation Flow - Vt Class Trials
• 12 Vt and channel length variations in library
• Not using SVT for implementation, saving it for ECO
– Capacitance sensitivity outweighs possible leakage gains in DC Graphical/IC Compiler II
• Choosing Vt classes to use during flow can impact final leakage power

Leakage
Power
ULVT LVT SVT
% Final C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
144% 54% 6% 4% 3% 6% 15% 3% 0% 1% 2% 5% 1%
100% 16% 2% 1% 1% 40% 5% 3% 2% 3% 15% 12% 1%
118% 15% 8% 7% 33% 5% 5% 3% 9% 3% 1% 3% 7%
114% 13% 4% 4% 3% 12% 5% 5% 23% 3% 5% 5% 17%
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 25
Vt Distribution Through Full Flow
Vt Class/Channel Mix Changes As Implementation Progresses
In DC, datapath delay is prioritized Pessimism reduces through the ECO brings in 4 new SVT classes
Faster cells are used flow and CTS brings in useful and does positive slack recovery
skew, cells are swapped for power

DC Graphical ICC II route_opt PrimeTime ECO

ULVT LVT SVT


c16 c18 c20 c24 c16 c18 c20 c24 c16 c18 c20 c24
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 26
ECO Objectives for Power Optimization

• PrimeTime ECO used to lower power


ICC II
– AOCV with PBA increases positive slack for downsizing Route_opt
– Open up all 12 Vt/channel classes to ECO ULVT/LVT

StarRC
• Timing ECO makes minor improvements on
remaining critical paths PrimeTime SI/PX
fix_eco_timing
• Power ECO reduces leakage without impacting fix_eco_power –pattern
timing ULVT/LVT/SVT

– Uses footprint-compatible Vt/channel swaps ICC II ECO

• Typically see 20-30% leakage reduction StarRC


PrimeTime SI/PX

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 27


Dynamic Power - Flop Mapping & Banking
Considerations
• ARM 16FFC library has many choices for Flops in
Single-bit 2-bit 4-bit
flop mapping Library

– Q, QA (low area), QL (low power)


– 1-bit, 2-bit, 4-bit QA
Low area

• Essential to consider Q
– Which flops to use Base
– Allowed stage flops

– At what stage(s) to bank/de-bank


QL
– De-banking criteria Low
– Banking exclusion list power

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 28


Dynamic Power - Flop Mapping & Banking
Recommendations: Achieved 18% Total Power Savings
Flop Mapping Multibit Banking

• Experimented with different flavors of flops • Experimented with different strategies


– QA, QL and QA + QL – 2-bit and 4-bit flops
– Physically Aware Multibit Banking (PAMB)
• Observations – Automated critical path de-banking
– QA is smaller but has larger dynamic power
• Observations
– QL has lower leakage power
– 4-bit reduces power, creates WNS instability
– QA + QL combination best for overall power
– PAMB improves power, increases TNS
– De-banking reduces PAMB TNS impact

4-bit: 3% additional total power savings


13% total power savings
PAMB/de-bank: 2% total power savings

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 29


Dynamic Power UPF
Implementation and Static Verification Flow – Included in Synopsys RI
ARM successive refinement UPF
• Constraints, configuration and implementation UPF

Gas station voltage areas for selected signals


• Ensures optimum clock tree for CPU modules

Multibit LS and ELS insertion


• 2-bit wide and 4-bit wide

Header switch control and acknowledge implementation


• Hammer & Trickle Headers
• Daisy chain and HFS connections

Fine grain switched ground RAM support


• Proper always-on TIE cells, automated level shifter, isolation cell insertion

Includes many of the newest UPF capabilities and methodologies


SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 30
Multibit LS/ELS insertion
Reduced Area Leads to Less Congestion
3457 single-bit LS/ELS 886 single/multi-bit LS/ELS

LS Mbit
LS

ELS Mbit
ELS

LS/ELS Area LS/ELS Leakage LS/ELS Area LS/ELS Leakage


12,815 u2 0.77 mW 4,920 u2 (61% less) 0.85 mW (13% more)

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 31


Summary: CPU Power Optimization Flow
That Meets Performance/Power Targets
MCMM Flop Multibit Optimization Banking/
OCV Vt Classes
Scenarios Families Flops Targets De-banking Flow

• RTL inferencing
DC Graphical Q 1-bit Timing
• Physical-aware
7% net 2 TT-1.0V QL 2-bit Area
critical path
QA 4-bit Dynamic
de-banking
2
place_opt Timing
IC Compiler II POCV TT-1.0V Q 1-bit • Physical-aware
clock_opt Area
cell TT-0.8V QL 2-bit critical path
Leakage
7% net FFG QA 4-bit de-banking
8 Dynamic
route_opt
PrimeTime
ECO AOCV TT-1.0V Q 1-bit Timing
cell 12 TT-0.8V QL 2-bit Area
7% net FFG QA 4-bit Leakage
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 32
Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Determine an Optimum Floorplan


Power

Area Determine an Optimum Floorplan

Manage congestion, SI & pessimism for convergence


Performance

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 33


Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Determine an Optimum Floorplan


Power

Determine an Optimum Floorplan


Area
Placement Data Flow Macro
Bounds
Controls Analysis Placement

Manage congestion, SI & pessimism for convergence


Performance

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 34


Floorplanning
Key Element in ARM CPU Development
• ARM and Synopsys have collaborated on ARM CPUs for many generations
• Floorplanning is always part of achieving best QoR for ARM CPUs
• Macro placement, bounds and blockages impact both timing and power

Cortex-A57 Cortex-A72 Cortex-A73


Bounds Magnet
placement

Blockage

SV SNUG 2015 SV SNUG 2016

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 35


The Last 10% Is Always The Most Challenging

• ARM CPU projects typically get to 90% place_opt FMAX Progression


of the performance target quickly 1
Ramp up The last 10%
to 90%
90%
0.9
• Achieving the last 10% consumes
most of the schedule

FMAX (% target)
0.8

• Extensive analysis shows performance


0.7
is impacted by placement
0.6
Data Flow
• Block floorplan exploration can provide Prep Development
performance breakthrough
0.5
Project Duration

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 36


Early Placement and Bounds Trials
Manual Bounds/Blockages to Balance Timing vs. Routability QoR
• Techniques to drive QoR through
placement
– Hard macro placement
– Placement bounds (fixed and floating)

DENGINE
DSIDE
– Typically focused on top-level modules
– Some analysis at first level down
– Add blockages to control channel density
CORE
ISIDE
• TNS reduced but QoR gains were
offset by SI effects

FMAX limited by critical paths to/from CORE module


SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 37
QoR Challenges in Placement
Analysis of Critical Module Timing
• Most critical paths to/from CORE
sub-module
– CORE connects heavily to DSIDE and
DENGINE
DENGINE DSIDE – Critical paths seen throughout the flow
(challenging to fix downstream)

• CORE being “pushed” out of center


of core area and near IOs
ISIDE
• Created FMAX-limited paths due to
CORE
long-path buffering across block

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 38


Controlling Standard Cell Placement
Look For Possibility Of Bounds To Improve QoR
• Experimented with bounds and
blockages to improve QoR
– Based on module placement analysis

• Looking for changes in CORE

• Most critical paths between CORE and


DSIDE

• CORE was being pushed away from


DSIDE and DENGINE

• Needed to find block floorplan that


allowed CORE to be placed in center
CORE DSIDE DENGINE

Desired movement not possible with given macro placement


SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 39
DFA for ARM CPU Block Floorplanning
Enables Faster Convergence
• Use DFA (Data Flow Analysis) in ICC II to drive block-level QoR improvements
through CPU floorplan changes
• Deep analysis of fast connectivity-based placement and inter-module flylines

DENGINE DSIDE ISIDE CORE

ISIDE
Some clear patterns emerge that lead to new macro placement conclusions
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 40
High-performance Core Floorplan Changes
Move RAMs To Guide Module Placement

DENGINE Move CORE module


back into the center
DENGINE
CORE
DSIDE
DSIDE

ISIDE
ISIDE

CORE

Floorplan changes that allowed CORE to float to the center and close to DSIDE
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 41
CORE Results @ route_opt with New Floorplan

• Updated floorplan produced


much better FMAX, TNS and
leakage DENGINE

• Critical path analysis of paths CORE


within, to and from CORE module DSIDE
shows much better timing

Path Group WNS (ps) TNS (ns) ISIDE


CORE Reg to Reg -17 -3
Out of CORE -16 -1
In to CORE -17 -1
CORE ICGs -18 -1

Block level floorplanning is key to improving QOR


SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 42
Floorplan/Placement Refinement
Summary
• Analysis-driven floorplan changes FMAX Progression with Synopsys RI
boosted FMAX and reduced 1
Ramp up
power The last 10%
to 90%
0.9
90%
• Minor floorplan modifications can

FMAX (% target)
further improve QoR 0.8

• Block-level floorplanning is a 0.7


powerful and necessary tool for
ARM CPU QoR improvements 0.6

0.5

Achieved goal to boost FMAX, reduce TNS and total power


SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 43
Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Determine an Optimum Floorplan


Power

Analyze Library for Balanced Power/Performance/TAT


Area

Performance Manage Crosstalk and Optimization for Best Frequency

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 44


Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Determine an Optimum Floorplan


Power

Analyze Library for Balanced Power/Performance/TAT


Area

Manage Crosstalk and Optimization for Best Frequency


Performance Concurrent Global Route-
Crosstalk PrimeTime
Clock & Data based
Optimization Delaycalc
(CCD) Optimization

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 45


Crosstalk: An Ongoing Challenge on ARM Cores
• Lower geometry processes always
have a crosstalk component

• ARM CPUs have traditional SI


prevention
– Clock NDRs
– Congestion-aware placement
– Logic and density controls

• The Cortex-A73 flow used NDRs to


dramatically reduce crosstalk

• We have used all these techniques


plus more on the latest ARM CPU SV SNUG 2016

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 46


Early Results on Latest High-perf ARM Core
Crosstalk Issues Already Visible
• When starting new ARM CPU flow, used First Full Flow Trial
18.00 200%
best practices from Cortex-A73 RI
16.00 180%
– 2W/4S clock NDRs (tapered on sinks) 160%
14.00
– Crosstalk threshold noise ratio of 20% 140%
12.00
– Congestion optimization for placer settings 10.00
120%

– 80ps max_transition limit 100%


8.00
80%
6.00
• Early results: crosstalk still an issue 60%
4.00 40%
– Large TNS increase at route_auto stage
2.00 20%
– Leakage increase at route_opt stage 0.00 0%
place_opt clock_opt route_auto route_opt

TNS WNS Leakage (%)

Need more to address crosstalk


SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 47
Crosstalk and Placement
Why Placement Matters For Crosstalk
• In ARM CPUs placement and crosstalk Critical Path Corridor
are interrelated
Area of very high
connectivity
• Separating two modules with a lot of DSIDE between CORE
timing-critical interconnect causes a and DSIDE
modules
channel of high, unidirectional routing
density

• Addressing this connectivity helped


with crosstalk optimization

CORE

Decreasing net crosstalk delta delay

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 48


Crosstalk and Congestion
Related, But - Fixing Congestion Might not Fix Crosstalk
Congestion: localized, over-capacity problem Crosstalk: large-area, at-capacity problem
Congestion is indicated by >100% utilized gcells, Crosstalk is indicated by high local routing
best addressed by routing changes density, best addressed with placement changes

Congestion Map: Gcells >100% Congestion Map: Gcell = 100%

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 49


Long Path Crosstalk in NonCPU
Reduced with Channel Blockages on CPU Interface Paths

• NonCPU has long-path


crosstalk challenges 0%
• Connections to CPUs can

Blocked Channel %
cluster together and cause
timing problems 60%
• Use percentage placement
blockages to spread out
pipeline flops 80%

90%

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 50


Results
Dramatic FMAX & Power Improvements
New optimization solutions deployed in addition to Cortex-A73 techniques
Before After

Decreasing net crosstalk delta delay

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 51


Leveraging New CCD Capabilities
Place_opt CCD and Power Aware CCD

CLK
place_opt with data only: higher area & power
Apply
-100ps 200ps Size up clock buf
useful
skew 90ps 10ps

Size down/swap LVT


place_opt w/ useful skew: lower area & power CLK
Datapath
50ps 50ps area/power
recovery 90ps 10ps
Delay 150ps

Leakage Leakage
Flow WNS TNS Flow WNS TNS
Power Power
place_opt + route_opt -23 -95 100% route_opt (Baseline) -112 -134 100%

place_opt CCD + route_opt -20 -68 98% Power CCD + route_opt -99 -126 99%

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 52


Increased Correlation Through the Flow
Routing and Timing Correlation
Global Route-based Accurate CCS Receiver Route_opt based on signoff
Optimization Model timer
• Timing-driven routing • Used in place_opt, • Route_opt able to see (and
• Re-routing, Re-buffering clock_opt and route_opt fix) timing issues easing
using global route parasitics • Improved correlation to burden on PT ECO
• On-route global route-based signoff delay calculation
re-buffering

GR- Delay
Traditional AWP CCS
Metric based @ calculation Step WNS TNS
Flow Delay Receiver
Signoff used
Opt. Model Cap Model
ICC II PT
R2R WNS -29ps -15ps -20 -8
WNS -60ps -49ps Delaycalc ECO
R2R TNS -1ns -1ns PrimeTime PT
TNS -11ns -2ns -12 0
Leakage 100% 96% Delaycalc ECO

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 53


Multisource CTS (MS-CTS) + HTree
Benefiting Performance / Power

• Performance Challenge Pre-mesh


drivers
– Significant OCV penalty ICG ICG
FF
F
associated with longer clock F
CLK
insertion delay FF
ICG ICG F

• Power Challenge ICG ICG


F

– Clock tree can consume up to Pre-mesh tree Clock Mesh


RAMs

35% total power in typical


CPU Flow/
MS-CTS
– Most power consumption is at Metrics
leaf level drivers & flop clocks Global Skew 53% smaller
• MS-CTS useful to address Clock Levels 40% fewer
both challenges
Dynamic
similar
Power

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 54


Summary: Implementation Careabouts
Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Analyze Library for Balanced Power/Performance/TAT


Power
Multi-Vt & Sequential
OCV Multibit
gate-length Flops

Determine an Optimum Floorplan


Area
Placement Data Flow Macro
Bounds
Controls Analysis Placement

Manage Crosstalk and Optimization for Best Frequency


Performance Concurrent Global Route-
Crosstalk PrimeTime
Clock & Data based
Optimization Delaycalc
(CCD) Optimization

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 55


Summary and Next Steps

SNUG 2017 56
Results on ARM Next Generation CPU
Achieved Target PPA with Convergent, Repeatable Flow
Next-Generation CPU PPA
TNS (ns) FMAX (%) Leakage (%)
300 300%
250

FMAX and Leakage


250%
200
TNS (ns)

% of Final
150 200%
100
150%
50
0 100%
-50
50%
-100
-150 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages

DC Graphical + IC Compiler II + PrimeTime ECO

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 57


Results on ARM Next Generation Non-CPU
Customized Hierarchical Flow to meet PPA on NonCPU
Next-Generation NonCPU PPA
TNS (ns) FMAX (%) Leakage (%)
200 200%

FMAX and Leakage


100 150%
TNS (ns)

% of Final
0 100%

-100 50%

-200 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages

DC Graphical + IC Compiler II + PrimeTime ECO

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 58


Next Steps
Look for more details on this next-generation Synopsys Reference Implementation (RI)
ARM design in the coming months is ready

DENGINE • CPU and NonCPU flows


CORE
• TSMC 16nm FFC process
DSIDE • ARM POP™ IP – core optimized standard
ISIDE
cells & fast cache RAMs
• Complete implementation and static
verification flow

Contact your Synopsys AC for additional information


Once core is announced, RI will be available on SolvNet
(solvnet.synopsys.com/ARM-RI)
SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 59
Conclusion: ARM + Synopsys
Continuing Close Collaboration To Benefit Our Customers

Complete implementation & static verification flows

Utilizing the most advanced Synopsys technologies

Providing maximum performance & minimum power


on ARM’s next generation processors

SNUG 2017 © Copyright Synopsys 2017, All Rights Reserved 60


Thank You

SNUG 2017 61

You might also like