Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Best Practices For High-performance,

Energy Efficient Implementations of


Arm® Cortex®-A75/-A55 Processors
In 16-nanometer FinFET Compact (16FFC) Process Technology
Using Synopsys Design Platform

Joe Walston, Synopsys

19 October 2017

© 2017 Synopsys, Inc. 1


Agenda

Arm/Synopsys Collaboration

Performance / Power Optimized Synopsys Design Reference Implementation

Maximizing Performance/Minimizing Power

Summary and Next Steps

© 2017 Synopsys, Inc. 2 © Copyright Synopsys 2017, All Rights Reserved


Arm + Synopsys
Rich History of Collaboration

© 2017 Synopsys, Inc. 3


The New Threshold for Market Leadership

2017
3.5+GHz
3.5GHz in 7nm

2016 7nm
3.0+GHz
3.0GHz+ in 16nm

16nm
2013
2.0+GHz
2.0GHz/28nm

28nm+

2010
1.0GHz
1.0 GHz / 40nm

40nm

© 2017 Synopsys, Inc. 4


DesignWare® IP
For Arm AMBA® Interconnect, I/F IP,
coreTools for assembly, MBIST, High Performance
Optimized Implementation Standard Cells, Fast Memories, Verification
Synopsys Design Tools, IP Designer for ASIPs VCS® RTL Verification & Verdi® Debug,
Reference Implementations (RIs), SpyGlass® RTL Signoff,
High Performance Core Centers, VIP for AMBA Interconnect,
Low Power Methodology, LP Verification Methodology,
Lynx Design System Optimized HW Design & ZeBu® HW-Assisted Verification
Implementation Verification

− Hybrid Emulation
Partnering for Arm Powered® Products AMBA Transactors

System SW Dev.
Software Stack

Validation & HW/SW


Integration

Virtual Prototypes
Physical Prototypes Virtualizer™ Tool Set & Development
HAPS® Support for Arm Cores, Hybrid Prototypes Kits (VDKs) for Pre-RTL SW Dev.
Connection to Juno Arm AMBA Transactors Using Arm Fast Models (v7 & v8),
Development Platform, Arch. Design with Platform Architect,
Prototyping Methodology Coverity & Defensics SW Signoff

© 2017 Synopsys, Inc. 5 synopsys.com/Arm v170901


Collaboration Enables Early Adopters
of Arm’s Latest IP

© 2017 Synopsys, Inc. 6 Images source: arm.com


Collaboration Enables Early Adopters
of Arm’s Latest IP
• Early adopters have already taped out using
Synopsys’ Design & Verification Continuum
Platforms

• New QuickStart Implementation Kits (QIKs) add


Reference Guide to Reference Implementation

• Synopsys Design Services available


– QuickStart Implementation Service (4-weeks)
– Consultative core-optimization
– Full turnkey core hardening

© 2017 Synopsys, Inc. 7 Images source: arm.com


©©Copyright Synopsys 2016, All Rights Reserved
2017 Synopsys, Inc. 8
©©Copyright Synopsys 2016, All Rights Reserved
2017 Synopsys, Inc. 9
©©Copyright Synopsys 2016, All Rights Reserved
2017 Synopsys, Inc. 10
©©Copyright Synopsys 2016, All Rights Reserved
2017 Synopsys, Inc. 11
Arm POP™ IP on 16nm

POP IP is a comprehensive, fully validated Cortex®-A


Need to shorten the CPU implementation solution
Includes Physical IP, floorplans and reference
design cycle implementation scripts

POP IP is developed and tuned in synergy with RTL


Need to lower technical over several iterations
and schedule risk All Physical IP and implementation issues have been
identified and solved by EAC date

POP undergoes extensive iterative floorplan exploration


Need to achieve and design tuning to deliver market- leading PPA
Our record in 16nm FinFET technology is a testament to
market-leading PPA the hard work behind POP development

© 2017 Synopsys, Inc. 12


How does Arm PDG work w/processor design on
POP IP?
Pre-ALPHA ALPHA BETA LAC EAC
CPU
CPU RTL optimization based on
POP IP implementation feedback
Arm
Floorplan tuning and PPA optimization

POP
Pre-ALPHA ALPHA BETA EAC

Requirements/
Inputs

Co-development work

© 2017 Synopsys, Inc. 13


Performance & Power Optimized
Reference Implementation
Synopsys Design Platform

© 2017 Synopsys, Inc. 14


Synopsys Reference Implementation
Complete Implementation & Static Verification Flow for Cortex-A75/-A55 Cores

Design Compiler
Graphical

IC Compiler II
place_opt
NDM Library Prep
clock_opt PrimeTime ECO

route_opt

VC LP PrimeTime SI Formality
Note: DFT supported at each step during DC Graphical and IC Compiler II, UPF supported throughout the flow

© 2017 Synopsys, Inc. 15 © Copyright Synopsys 2017, All Rights Reserved


Included in Synopsys Reference Implementation
Most Advanced Technology from Synopsys Implementation Platform

Meet Timing Reduce Power


• Enhanced physical guidance
• Timing-driven multibit register
(eSPG)
banking and de-banking
• Enhanced layer-aware DC Graphical • Physical-aware clock gating
optimization
• Low power placement
• Placement pre-clustering

• place_opt CCD • Incremental timing-driven


• New global route-based opt. multibit register banking and
• CCS receiver cap modeling IC Compiler II de-banking
• PrimeTime delay calc in • Clock gating optimization
route_opt • Low power placement
• Redundant VIA insertion • High effort leakage flow
PrimeTime
• Path-based analysis (PBA)
• Clock skew ECO ECO • Leakage-aware timing ECO
• Physical-aware ECO

© 2017 Synopsys, Inc. 16 © Copyright Synopsys 2017, All Rights Reserved


Included in Synopsys Reference Implementation
Arm Cortex-A75 CPU and DynamIQ™ Shared Unit (DSU) Designs and Libraries

Design Detail

Config Crypto ; NEON & FP Unit

DENGINE
Clock Gating ~250 architectural ICGs CORE
TSMC 16nm FFC
Process
11 Layer +AP w/ routing on M2-M9 DSIDE
Arm POP IP for TSMC 16nm FFC ISIDE
Libraries/
SVT/LVT/ULVT C24/C20/C18/C16 for data
memories
ULVT C16 for clock
Setup: RC Max: TT/1.0v/85c
PVT corners
Setup & Power: TT/0.8v/85c
MCMM
Hold: RC Min: FFGNP/1.05v/125c
DFT Strategy Scan compression
UPF 3 power domains with LS and ELS required

© 2017 Synopsys, Inc. 17 © Copyright Synopsys 2017, All Rights Reserved


Maximizing Performance/Minimizing Power
on Cortex-A75 Processor
with the Synopsys Design Platform

© 2017 Synopsys, Inc. 18


Cortex-A75 Implementation Careabouts

Determine an Optimum Floorplan


Power

Analyze Library for Balanced Power/Performance/TAT


Area

Manage congestion, SI & pessimism for convergence


Performance

© 2017 Synopsys, Inc. 19 © Copyright Synopsys 2017, All Rights Reserved


Cortex-A75 Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Power Analyze Library for Balanced Power/Performance/TAT

Analyze Library for Balanced Power/Performance/TAT


Area

Manage congestion, SI & pessimism for convergence


Performance

© 2017 Synopsys, Inc. 20 © Copyright Synopsys 2017, All Rights Reserved


Cortex-A75 Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Analyze Library for Balanced Power/Performance/TAT


Power
Multi-Vt & Sequential
OCV Multibit
gate-length Flops

Analyze Library for Balanced Power/Performance/TAT


Area

Manage congestion, SI & pessimism for convergence


Performance

© 2017 Synopsys, Inc. 21 © Copyright Synopsys 2017, All Rights Reserved


On Chip Variation (OCV)
Signoff and Implementation Derating Methodology

Derating Strategy • Derating strategy changes through


implementation to signoff
No cell derating – Manages pessimism
DC Graphical – Reduces area/power impact of GBA-based
7% net derating
AOCV

POCV cell derating* (from


• Methodology does not affect signoff
IC Compiler II AOCV 5%, 7%, 10%)
7% net derating

ICC II Optimization AOCV-based POCV-based


AOCV cell derating*
PrimeTime (5%, 7%, 10%) ICC II Baseline + 8%
Frequency
ECO PrimeTime Baseline + 2%
7% net derating
Utilization 68% 66%
SSG @ 5%, TT 1.0v @ 7%, FFG @ 10%

© 2017 Synopsys, Inc. 22 © Copyright Synopsys 2017, All Rights Reserved


Arm Library – Leverage for best CPU QoR
Complexity and Flexibility

• Specific details of Arm 16FFC library Design Detail


– 3 Vt classes (SVT, LVT, ULVT) TSMC 16nm FFC
Process
– 4 channel lengths each (16, 18, 20, 24) 11 Layer +AP routing on M2-M9
– Base, hpk (High-Performance), pmk (Power
Management) Arm POP IP for TSMC 16nm FFC
– 1, 2 and 4 bit flops Libraries/ ULVT C24/C20/C18/C16 for data
– 3 main types of SB/MB flops (Q, QL, QA) memories ULVT C16 for clock
SVT/LVT for leakage opt.
• Cortex-A75 CPU has very aggressive power
targets
– As always - extremely high FMAX target Setup: RC Max: TT/1.0v/85c
PVT corners
– But… leakage and dynamic power targets require Setup & Power: TT/0.8v/85c
MCMM
more than opportunistic power reduction Hold: RC Min: FFGNP/1.05v/125c

Starting with library analysis makes balancing FMAX and power easier
© 2017 Synopsys, Inc. 23 © Copyright Synopsys 2017, All Rights Reserved
Constructing Cortex-A75 CPU Power Opt. Flow
That Meets Performance/Power Targets

Meet Power Target • Essential to manage


MB banking/de-banking – Vt class availability
1bit, 2bit, 4bit – Multibit (MB) banking/de-banking
DC Graphical
VT selection – Leakage vs. timing vs. dynamic optimization
Across 12 vt/channel – Leave headroom (both timing and power)
options for ECO
QL (leakage) vs. Q (std)
IC Compiler II vs. QA (area) • Library impacts all these decisions
flop selection

SI TNS Reduction, very


congested, clock NDRs
PrimeTime
ECO
fix_eco_power to meet
leakage target, expect
15-20% reduction
© 2017 Synopsys, Inc. 24 © Copyright Synopsys 2017, All Rights Reserved
Analyzing Library Leakage Power Heat Maps
Example: INV @ TT 0.8V, 85c
VT Types/Channel lengths
ULVT LVT SVT
ULVT LVT SVT
C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24

X20N 858 682 575 409 89 69 56 35 8 5 4 3


X16N 681 542 457 326 71 55 44 28 6 4 3 2
X14N 593 472 398 284 62 48 39 24 6 4 3 2
X12N 504 401 339 243 52 41 33 21 5 3 2 2 Leakage Power
Use Leakage Power Drive Strength
X10N 415 331 280 201 43 33 27 17 4 3 2 1
Heat Map to gain X8N 327 261 221 159 34 26 21 14 3 2 1 1
qualitative X7N 259 208 177 129 27 21 17 11 2 2 1 1
understanding of X6N 239 191 162 118 25 19 16 10 2 2 1 1
library characteristics X5N 195 156 133 96 20 16 13 8 2 1 1 1
X4N 151 121 103 75 16 12 10 6 1 1 1 0
X3N 108 87 74 54 11 9 7 5 1 1 0 0 Leakage Power
X2N 67 53 45 33 7 5 4 3 1 0 0 0
Heat Map
X1P5N 39 31 26 20 5 4 3 2 0 0 0 0
X1N 28 22 19 14 3 2 2 1 0 0 0 0

© 2017 Synopsys, Inc. 25 © Copyright Synopsys 2017, All Rights Reserved


Analyzing Library Cell Delay Heat Maps
Example: INV with 50ff Load @ TT 1.0V, 85c
VT Types/Channel lengths
ULVT LVT SVT
ULVT LVT SVT
C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24

X20N 18 19 19 20 20 21 21 22 23 24 25 26
X16N 19 20 20 21 21 22 23 24 25 26 27 28
X14N 20 21 21 22 22 23 24 25 26 27 28 29
X12N 21 22 22 23 23 24 25 26 27 28 29 31 Cell Delay
Use Cell Delay Heat X10N 22 23 24 25 25 26 26 28 29 30 31 33
Drive Strength

Map to gain qualitative X8N 24 25 26 27 27 28 29 31 32 33 34 36


understanding of X7N 25 26 27 28 29 30 31 32 34 35 36 39
library characteristics X6N 28 29 29 31 31 32 33 35 37 38 39 42
X5N 30 31 32 34 34 35 37 39 41 42 43 46
X4N 34 35 36 38 39 40 41 43 45 47 49 52
X3N 41 42 43 45 47 48 50 52 55 57 59 63 Cell Delay
X2N 53 54 55 58 60 62 64 67 70 73 76 80 Heat Map
X1P5N 65 66 68 72 74 76 78 82 86 88 92 98
X1N 92 94 97 101 105 108 111 116 123 127 133 140

© 2017 Synopsys, Inc. 26 © Copyright Synopsys 2017, All Rights Reserved


Library Observations
Recommendations to Minimize Leakage Power

• Overlap between ULVT and LVT


– Consider combination of ULVT C16 and LVT C16 during
synthesis and P&R
Leakage Heat Map (TT)

• Order of magnitude power delta between equivalent


drive strengths/channel lengths across VT classes

ULVT
– LVT use in implementation absolutely necessary for best

SVT
LVT
leakage when targeting ULVT for timing

• SVT power savings mainly in small drive strengths


– SVT probably best for ECO swap/size

• Variation between channel lengths is small


– Channel variations should be reserved for ECO

© 2017 Synopsys, Inc. 27 © Copyright Synopsys 2017, All Rights Reserved


Library Observations
Recommendations to Improve Timing
Cell Delay Heatmap
• Delay variation across Vt and channel smaller than
across drive strength
– Many available Vt/channel classes will not have a strong

SSG
influence on timing early in flow
– Best to focus on a primary class for timing closure based on
clock frequency

ULVT

SVT
LVT
• Very low drive cells can be load sensitive
– Do not use small drive cells during synthesis and P&R

• Variation between channel lengths reserved for ECO


– Late in flow, swapping can boost timing 1-2ps per

TT-OD
Vt/channel swap per datapath stage

© 2017 Synopsys, Inc. 28 © Copyright Synopsys 2017, All Rights Reserved


Optimizing Leakage Power
Cortex-A75 Implementation Flow - Vt Class Trials

• 12 Vt and channel length variations in library


• Not using SVT for implementation, saving it for ECO
– Capacitance sensitivity outweighs possible leakage gains in DC Graphical/IC Compiler II
• Choosing Vt classes to use during flow can impact final leakage power

Leakage
Power
ULVT LVT SVT
% Final C16 C18 C20 C24 C16 C18 C20 C24 C16 C18 C20 C24
144% 54% 6% 4% 3% 6% 15% 3% 0% 1% 2% 5% 1%
100% 16% 2% 1% 1% 40% 5% 3% 2% 3% 15% 12% 1%
118% 15% 8% 7% 33% 5% 5% 3% 9% 3% 1% 3% 7%
114% 13% 4% 4% 3% 12% 5% 5% 23% 3% 5% 5% 17%
© 2017 Synopsys, Inc. 29 © Copyright Synopsys 2017, All Rights Reserved
Vt Distribution Through Cortex-A75 Full Flow
Vt Class/Channel Mix Changes As Implementation Progresses
In DC, datapath delay is prioritized Pessimism reduces through the ECO brings in 4 new SVT classes
Faster cells are used flow and CTS brings in useful and does positive slack recovery
skew, cells are swapped for power

DC Graphical ICC II route_opt PrimeTime ECO

ULVT LVT SVT


c16 c18 c20 c24 c16 c18 c20 c24 c16 c18 c20 c24
© 2017 Synopsys, Inc. 30 © Copyright Synopsys 2017, All Rights Reserved
ECO Objectives for Power Optimization
Cortex-A75 CPU

• PrimeTime ECO used to lower power


– AOCV with PBA increases positive slack for downsizing ICC II
Route_opt
– Open up all 12 Vt/channel classes to ECO ULVT/LVT

• Timing ECO makes minor improvements on remaining critical StarRC


paths
PrimeTime SI/PX
• Power ECO reduces leakage without impacting timing fix_eco_timing
– Uses footprint-compatible Vt/channel swaps fix_eco_power –pattern
ULVT/LVT/SVT
• Typically see 20-30% leakage reduction
ICC II ECO

StarRC
PrimeTime SI/PX

© 2017 Synopsys, Inc. 31 © Copyright Synopsys 2017, All Rights Reserved


Dynamic Power - Flop Mapping & Banking
Considerations

• Arm 16FFC library has many choices for flop Flops in


mapping Single-bit 2-bit 4-bit
Library
– Q, QA (low area), QL (low power)
– 1-bit, 2-bit, 4-bit QA
Low area
• Essential to consider
– Which flops to use Q
– Allowed stage Base
– At what stage(s) to bank/de-bank flops
– De-banking criteria
– Banking exclusion list QL
Low
power

© 2017 Synopsys, Inc. 32 © Copyright Synopsys 2017, All Rights Reserved


Dynamic Power - Flop Mapping & Banking
Recommendations: Achieved 18% Total Power Savings on Cortex-A75 CPU

Flop Mapping Multibit Banking

• Experimented with different flavors of flops • Experimented with different strategies


– QA, QL and QA + QL – 2-bit and 4-bit flops
– Physically Aware Multibit Banking (PAMB)
• Observations – Automated critical path de-banking
– QA is smaller but has larger dynamic power
• Observations
– QL has lower leakage power
– 4-bit reduces power, creates WNS instability
– QA + QL combination best for overall power
– PAMB improves power, increases TNS
– De-banking reduces PAMB TNS impact

4-bit: 3% additional total power savings


13% total power savings
PAMB/de-bank: 2% total power savings

© 2017 Synopsys, Inc. 33 © Copyright Synopsys 2017, All Rights Reserved


Dynamic Power UPF
Cortex-A75 CPU Implementation & Static Verification Flow – Synopsys RI

Arm successive refinement UPF


• Constraints, configuration and implementation UPF

Gas station voltage areas for selected signals


• Ensures optimum clock tree for CPU modules

Multibit LS and ELS insertion


• 2-bit wide and 4-bit wide

Header switch control and acknowledge implementation


• Hammer & Trickle Headers
• Daisy chain and HFS connections

Fine grain switched ground RAM support


• Proper always-on TIE cells, automated level shifter, isolation cell insertion

Includes many of the newest UPF capabilities and methodologies


© 2017 Synopsys, Inc. 34 © Copyright Synopsys 2017, All Rights Reserved
Multibit LS/ELS insertion
Reduced Area Leads to Less Congestion on Cortex-A75 CPU
3457 single-bit LS/ELS 886 single/multi-bit LS/ELS

LS Mbit
LS

ELS Mbit
ELS

LS/ELS Area LS/ELS Leakage LS/ELS Area LS/ELS Leakage


12,815 u2 0.77 mW 4,920 u2 (61% less) 0.85 mW (13% more)

© 2017 Synopsys, Inc. 35 © Copyright Synopsys 2017, All Rights Reserved


Summary: Cortex-A75 CPU Power Opt. Flow
That Meets Performance/Power Targets
MCMM Flop Multibit Optimization Banking/
OCV Vt Classes
Scenarios Families Flops Targets De-banking Flow

• RTL inferencing
DC Graphical Q 1-bit Timing
• Physical-aware
7% net 2 TT-1.0V QL 2-bit Area
critical path
QA 4-bit Dynamic
de-banking
2
place_opt Timing
IC Compiler II POCV TT-1.0V Q 1-bit • Physical-aware
clock_opt Area
cell TT-0.8V QL 2-bit critical path
Leakage
7% net FFG QA 4-bit de-banking
8 Dynamic
route_opt
PrimeTime
ECO AOCV TT-1.0V Q 1-bit Timing
cell 12 TT-0.8V QL 2-bit Area
7% net FFG QA 4-bit Leakage
© 2017 Synopsys, Inc. 36 © Copyright Synopsys 2017, All Rights Reserved
Cortex-A75 Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Determine an Optimum Floorplan


Power

Area Determine an Optimum Floorplan

Manage congestion, SI & pessimism for convergence


Performance

© 2017 Synopsys, Inc. 37 © Copyright Synopsys 2017, All Rights Reserved


Cortex-A75 Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Determine an Optimum Floorplan


Power

Determine an Optimum Floorplan


Area
Placement Data Flow Macro
Bounds
Controls Analysis Placement

Manage congestion, SI & pessimism for convergence


Performance

© 2017 Synopsys, Inc. 38 © Copyright Synopsys 2017, All Rights Reserved


Floorplanning
Key Element in Arm CPU Development
• Arm and Synopsys have collaborated on Arm CPUs for many generations
• Floorplanning is always part of achieving best QoR for Arm CPUs
• Macro placement, bounds and blockages impact both timing and power

Cortex-A57 Cortex-A72 Cortex-A73


Bounds Magnet
placement

Blockage

SV SNUG 2015 SV SNUG 2016

© 2017 Synopsys, Inc. 39 © Copyright Synopsys 2017, All Rights Reserved


The Last 10% Is Always The Most Challenging

• Arm CPU projects typically get to 90% of the Cortex-A75 place_opt FMAX Progression
performance target quickly 1
Ramp up The last 10%
to 90%
• Achieving the last 10% consumes most of the 90%
0.9
schedule

FMAX (% target)
0.8
• Extensive analysis shows performance is
impacted by placement
0.7

• Block floorplan exploration can provide


performance breakthrough 0.6
Data Flow
Prep Development
0.5
Project Duration

© 2017 Synopsys, Inc. 40 © Copyright Synopsys 2017, All Rights Reserved


Early Placement and Bounds Trials
Manual Bounds/Blockages to Balance Timing vs. Routability QoR

• Techniques to drive QoR through placement


on Cortex-A75 CPU
– Hard macro placement
– Placement bounds (fixed and floating)

DENGINE
– Typically focused on top-level modules DSIDE
– Some analysis at first level down
– Add blockages to control channel density

CORE
• TNS reduced but QoR gains were offset by SI ISIDE
effects

FMAX limited by critical paths to/from CORE module


© 2017 Synopsys, Inc. 41 © Copyright Synopsys 2017, All Rights Reserved
QoR Challenges in Placement
Analysis of Critical Module Timing on Cortex-A75 CPU

• Most critical paths to/from CORE sub-


module
– CORE connects heavily to DSIDE and
DENGINE
DENGINE – Critical paths seen throughout the flow
DSIDE (challenging to fix downstream)

• CORE being “pushed” out of center of core


area and near IOs

• Created FMAX-limited paths due to long-


ISIDE path buffering across block
CORE

© 2017 Synopsys, Inc. 42 © Copyright Synopsys 2017, All Rights Reserved


Controlling Standard Cell Placement
Look For Possibility Of Bounds To Improve Cortex-A75 CPU QoR
• Experimented with bounds and
blockages to improve QoR
– Based on module placement analysis

• Looking for changes in CORE

• Most critical paths between CORE and


DSIDE

• CORE was being pushed away from


DSIDE and DENGINE

• Needed to find block floorplan that


allowed CORE to be placed in center
CORE DSIDE DENGINE

Desired movement not possible with given macro placement


© 2017 Synopsys, Inc. 43 © Copyright Synopsys 2017, All Rights Reserved
DFA for Cortex-A75 CPU Block Floorplanning
Enables Faster Convergence

• Use DFA (Data Flow Analysis) in ICC II to drive block-level QoR improvements through CPU
floorplan changes
• Deep analysis of fast connectivity-based placement and inter-module flylines

DENGINE DSIDE ISIDE CORE

ISIDE
Some clear patterns emerge that lead to new macro placement conclusions
© 2017 Synopsys, Inc. 44 © Copyright Synopsys 2017, All Rights Reserved
Cortex-A75 CPU Floorplan Changes
Move RAMs To Guide Module Placement

DENGINE Move CORE module


back into the center
DENGINE
CORE
DSIDE
DSIDE

ISIDE
ISIDE

CORE

Floorplan changes that allowed CORE to float to the center and close to DSIDE
© 2017 Synopsys, Inc. 45 © Copyright Synopsys 2017, All Rights Reserved
CORE Results @ route_opt with New Floorplan
Cortex-A75 CPU

• Updated floorplan produced much better


FMAX, TNS and leakage

• Critical path analysis of paths within, to and DENGINE


from CORE module shows much better timing CORE
DSIDE

Path Group WNS (ps) TNS (ns) ISIDE


CORE Reg to Reg -17 -3
Out of CORE -16 -1
In to CORE -17 -1
CORE ICGs -18 -1

Block level floorplanning is key to improving QOR


© 2017 Synopsys, Inc. 46 © Copyright Synopsys 2017, All Rights Reserved
Floorplan/Placement Refinement
Summary
• Analysis-driven floorplan changes boosted Cortex-A75 FMAX Progression with Synopsys RI
FMAX and reduced power 1
Ramp up
• Minor floorplan modifications can further The last 10%
to 90%
0.9
90%
improve QoR

FMAX (% target)
• Block-level floorplanning is a powerful and 0.8
necessary tool for Cortex-A75 CPU QoR
improvements 0.7

0.6

0.5

Achieved goal to boost FMAX, reduce TNS and total power


© 2017 Synopsys, Inc. 47 © Copyright Synopsys 2017, All Rights Reserved
Cortex-A75 Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Determine an Optimum Floorplan


Power

Analyze Library for Balanced Power/Performance/TAT


Area

Performance Manage Crosstalk and Optimization for Best Frequency

© 2017 Synopsys, Inc. 48 © Copyright Synopsys 2017, All Rights Reserved


Cortex-A75 Implementation Careabouts
Require Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Determine an Optimum Floorplan


Power

Analyze Library for Balanced Power/Performance/TAT


Area

Manage Crosstalk and Optimization for Best Frequency


Performance Concurrent Global Route-
Crosstalk PrimeTime
Clock & Data based
Optimization Delaycalc
(CCD) Optimization

© 2017 Synopsys, Inc. 49 © Copyright Synopsys 2017, All Rights Reserved


Crosstalk: An Ongoing Challenge on Arm Cores

• Lower geometry processes always


have a crosstalk component

• Arm CPUs have traditional SI


prevention
– Clock NDRs
– Congestion-aware placement
– Logic and density controls

• The Cortex-A73 flow used NDRs to


dramatically reduce crosstalk

• We have used all these techniques


plus more on the Cortex-A75 CPU SV SNUG 2016

© 2017 Synopsys, Inc. 50 © Copyright Synopsys 2017, All Rights Reserved


Early Results - Cortex-A75 CPU
Crosstalk Issues Already Visible

• When starting the Cortex-A75 CPU flow, used First Full Flow Trial
18.00 200%
best practices from Cortex-A73 RI
16.00 180%
– 2W/4S clock NDRs (tapered on sinks) 160%
14.00
– Crosstalk threshold noise ratio of 20% 140%
12.00
– Congestion optimization for placer settings 10.00
120%

– 80ps max_transition limit 100%


8.00
80%
6.00
• Early results: crosstalk still an issue 60%
4.00 40%
– Large TNS increase at route_auto stage
2.00 20%
– Leakage increase at route_opt stage 0.00 0%
place_opt clock_opt route_auto route_opt

TNS WNS Leakage (%)

Need more to address crosstalk


© 2017 Synopsys, Inc. 51 © Copyright Synopsys 2017, All Rights Reserved
Crosstalk and Placement
Why Placement Matters For Crosstalk

• In Arm CPUs placement and crosstalk Critical Path Corridor in Cortex-A75 CPU
are interrelated
Area of very high
connectivity
• Separating two modules with a lot of DSIDE between CORE
timing-critical interconnect causes a and DSIDE
modules
channel of high, unidirectional routing
density

• Addressing this connectivity helped


with crosstalk optimization

CORE

Decreasing net crosstalk delta delay

© 2017 Synopsys, Inc. 52 © Copyright Synopsys 2017, All Rights Reserved


Crosstalk and Congestion in Cortex-A75 CPU
Related, But - Fixing Congestion Might not Fix Crosstalk

Congestion: localized, over-capacity problem Crosstalk: large-area, at-capacity problem


Congestion is indicated by >100% utilized gcells, Crosstalk is indicated by high local routing
best addressed by routing changes density, best addressed with placement changes

Congestion Map: Gcells >100% Congestion Map: Gcell = 100%

© 2017 Synopsys, Inc. 53 © Copyright Synopsys 2017, All Rights Reserved


Long Path Crosstalk in DSU
Reduced with Channel Blockages on CPU Interface Paths

• DSU has long-path


crosstalk challenges 0%
• Connections to CPUs can

Blocked Channel %
cluster together and cause
timing problems 60%
• Use percentage placement
blockages to spread out
pipeline flops 80%

90%

© 2017 Synopsys, Inc. 54 © Copyright Synopsys 2017, All Rights Reserved


Results – Cortex-A75 CPU
Dramatic FMAX & Power Improvements

New optimization solutions deployed in addition to Cortex-A73 techniques

Before After

Decreasing net crosstalk delta delay

© 2017 Synopsys, Inc. 55 © Copyright Synopsys 2017, All Rights Reserved


Leveraging New CCD Capabilities
Place_opt CCD and Power Aware CCD

CLK
place_opt with data only: higher area & power
Apply
-100ps 200ps Size up clock buf
useful
skew 90ps 10ps

Size down/swap LVT


place_opt w/ useful skew: lower area & power CLK
Datapath
50ps 50ps area/power
recovery 90ps 10ps
Delay 150ps

Leakage Leakage
Cortex-A75 CPU WNS TNS Cortex-A75 CPU WNS TNS
Power Power
place_opt + route_opt -23 -95 100% route_opt (Baseline) -112 -134 100%

place_opt CCD + route_opt -20 -68 98% Power CCD + route_opt -99 -126 99%
© 2017 Synopsys, Inc. 56 © Copyright Synopsys 2017, All Rights Reserved
Increased Correlation Through the Flow
Routing and Timing Correlation on Cortex-A75 CPU
Global Route-based Accurate CCS Receiver Route_opt Based on
Optimization Model PrimeTime Signoff Timer
• Timing-driven routing • Used in place_opt, • Route_opt able to see (and
• Re-routing, Re-buffering clock_opt and route_opt fix) timing issues easing
using global route parasitics • Improved correlation to burden on PT ECO
• On-route global route-based signoff delay calculation
re-buffering

GR- AWP CCS Delay


Traditional @ calculation Step WNS TNS
Metric based Delay Receiver
Flow Signoff used
Opt. Model Cap Model

R2R WNS -29ps -15ps ICC II PT


WNS -60ps -49ps -20 -8
Delaycalc ECO
R2R TNS -1ns -1ns
PrimeTime PT
TNS -11ns -2ns -12 0
Leakage 100% 96% Delaycalc ECO

© 2017 Synopsys, Inc. 57 © Copyright Synopsys 2017, All Rights Reserved


Multisource CTS (MS-CTS) + HTree
Benefiting Performance / Power on Cortex-A75 CPU

• Performance Challenge Pre-mesh


drivers
– Significant OCV penalty ICG ICG
FF
F
associated with longer clock F
CLK
insertion delay FF
ICG ICG F

• Power Challenge ICG ICG


F

– Clock tree can consume up to Pre-mesh tree Clock Mesh


RAMs

35% total power in typical CPU


– Most power consumption is at Flow/
MS-CTS
leaf level drivers & flop clocks Metrics
Global Skew 53% smaller
• MS-CTS useful to address
both challenges Clock Levels 40% fewer
Dynamic
similar
Power

© 2017 Synopsys, Inc. 58 © Copyright Synopsys 2017, All Rights Reserved


Summary: Cortex-A75 Implementation Careabouts
Upfront Analysis of Design Inputs & Flows to Optimize for Best PPA

Analyze Library for Balanced Power/Performance/TAT


Power
Multi-Vt & Sequential
OCV Multibit
gate-length Flops

Determine an Optimum Floorplan


Area
Placement Data Flow Macro
Bounds
Controls Analysis Placement

Manage Crosstalk and Optimization for Best Frequency


Performance Concurrent Global Route-
Crosstalk PrimeTime
Clock & Data based
Optimization Delaycalc
(CCD) Optimization

© 2017 Synopsys, Inc. 59 © Copyright Synopsys 2017, All Rights Reserved


Summary and Next Steps

© 2017 Synopsys, Inc. 60


Results on Cortex-A75 CPU
Achieved Target PPA with Convergent, Repeatable Flow
Cortex-A75 CPU PPA
TNS (ns) FMAX (%) Leakage (%)
150 200%

FMAX and Leakage


100
150%
TNS (ns)

% of Final
50
0 100%
-50
50%
-100
-150 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages

DC Graphical + IC Compiler II + PrimeTime ECO

© 2017 Synopsys, Inc. 61 © Copyright Synopsys 2017, All Rights Reserved


Results on 8-core DSU
Customized Hierarchical Flow to meet PPA
DSU PPA
TNS (ns) FMAX (%) Leakage (%)
200 200%

FMAX and Leakage


100 150%
TNS (ns)

% of Final
0 100%

-100 50%

-200 0%
DCG place_opt clock_opt route_opt Signoff ECO
Implementation Stages

DC Graphical + IC Compiler II + PrimeTime ECO

© 2017 Synopsys, Inc. 62 © Copyright Synopsys 2017, All Rights Reserved


Summary – A75 and A55
Synopsys Reference Implementations (RIs)
for Cortex-A75/-A55 are ready

• CPU and DSU flows


• TSMC 16nm FFC process
• Arm POP™ IP – core optimized
standard cells & fast cache RAMs
• Complete implementation and static
verification flows

Contact your Synopsys AC for additional information


RIs available on SolvNet (solvnet.synopsys.com/Arm-RI)
© 2017 Synopsys, Inc. 63 © Copyright Synopsys 2017, All Rights Reserved
Summary – 8-Core DSU
Synopsys Reference Implementations (RIs) ready
for 8-core DSU with 4 A75s and 4 A55s

• 8-core DSU includes 4 x A75s and 4 x


A55s as configured in the RI A75 A55 A55 A75

• ETM-based hierarchical Reference


Implementation flow
• Includes 2 MB L3 cache
A75 A55 A55 A75
• Complete implementation and static
verification flows

Contact your Synopsys AC for additional information


regarding the QIK availability for the 8-core DSU
© 2017 Synopsys, Inc. 64
Conclusion: Arm + Synopsys
Continuing Close Collaboration To Benefit Our Customers

Complete implementation & static verification flows

Utilizing the most advanced Synopsys technologies

Providing maximum performance & minimum power


on Arm’s Cortex-A75/-A55 processors

© 2017 Synopsys, Inc. 65 © Copyright Synopsys 2017, All Rights Reserved


Thank You

You might also like