tb-01-gibbons-pres-snps

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Best practices for high-performance, energy

efficient implementations of the latest Arm®


processors
In 7-nanometer FinFET (7FF) process technology
using Synopsys® Design Platform

Phil Morris, Arm


Alan Gibbons, Synopsys

June 12th 2018

© 2018 Synopsys, Inc. 1


Agenda

Arm-Synopsys Collaboration

Synopsys QuickStart Implementation Kits (QIKs) for Arm cores

Achieving Optimal PPA with Synopsys Design Platform

Summary

© 2018 Synopsys, Inc. 2 Copyright © 2018 by Synopsys. All rights reserved.


Arm + Synopsys
Rich History of Collaboration

© 2018 Synopsys, Inc. 3


Interface & Foundation IP
For Arm AMBA® Interconnect, Synopsys
DesignWare ® I/F IP, coreTools for High Performance
Optimized Implementation assembly, IP Designer for ASIPs and Verification
Synopsys Design Tools, DesignWare & Arm Artisan IP for VCS® RTL Verification & Verdi® Debug,
QuickStart Implementation Kits (QIKs) Standard Cells, Fast Memories, MBIST VC Formal, SpyGlass® RTL Signoff,
High Performance Core Centers, VIP for AMBA Interconnect,
Low Power Methodology, VC LP & Verification Methodology,
Lynx Design System ZeBu® HW-Assisted Verification
Optimized HW Design &
Implementation Verification

− Hybrid Emulation
AMBA Transactors
Partnering for Arm Powered Products
System SW Dev. Software Stack

Validation & HW/SW


Integration

Virtual Prototypes
Physical Prototypes Virtualizer™ Tool Set & Development
HAPS® Support for Arm Cores, Hybrid Prototypes Kits (VDKs) for Pre-RTL SW Dev.
Connection to Juno Arm AMBA Transactors Using Arm Fast Models (v7 & v8),
Development Platform, Arch. Design with Platform Architect,
Prototyping Methodology Coverity & Defensics SW Signoff

© 2018 Synopsys, Inc. 4 synopsys.com/Arm


“We collaborate early with Synopsys on design
enablement of our next-generation Arm Cortex® and
Arm Mali™ processors along with Arm Artisan®
libraries and Arm POP™ IP in the most advanced
process technologies using Synopsys’ Design
Platform tools. This enables our mutual customers
to get to market faster with optimized power
performance and area for their products.”
Kelvin Low
Vice President of Marketing
Physical Design Group (PDG)
Arm

© 2018 Synopsys, Inc. 5


Collaboration Enables Early Adopters of Arm’s Latest IP
Customer Tape-outs of Arm Cortex-A76 and Mali-G76

Replace with
A76/G76 NR

Images source:
arm.com

© 2018 Synopsys, Inc. 6


HiSilicon @ Arm TechCon 2017
Customer Success Example

© 2018 Synopsys, Inc. 7 Copyright © 2018 by Synopsys. All rights reserved.


Arm POP IP
Flexibility and Differentiation!

POP
Artisan
Reference
Physical IP
Scripts
CPU optimized RTL-GDS scripts
Physical IP support for Synopsys

POP Landing
Team Support

Comprehensive POP Artisan Design utilities


implementation User Architect to improve
methodology Guide Products implementation

https://www.arm.com/products/physical-ip/pop-ip
© 2018 Synopsys, Inc. 8
Why use Arm POP IP?

Designer’s Pain Points Arm POP Solution Benefits


Reduced design cycle time
Long design cycle time POP IP is a comprehensive, fully validated Arm processor cores
implementation solution. Combining physical IP, floor-plans & reference
implementation scripts in one package helps reduce the design cycle

Reduced risks
Technology & schedule risks POP IP is developed and tuned in synergy with RTL over several
iterations. All physical IP & implementation issues have been identified
and solved by EAC date

Optimized PPA
Non-optimized PPA With proven track record, Arm POP IP delivers optimized PPA

© 2018 Synopsys, Inc. 9 Copyright © 2018 by Synopsys. All rights reserved.


Co-Optimization of Process Technology with POP

POP IP Physical IP Physical IP Physical IP Final implementation within


Flow Tuning Flow Tuning
Development Tuning Tuning Tuning 6 weeks of Processor EAC

Processor Quick Process Adoption


Identify Process Benchmark Details with Detail EAC
Implementation
Process Technology Differentiation Showcase to Customers CPU Implementation
Trials

Arm POP IP implementation teams go through many iterations of flow


and physical IP tuning to provide a complete implementation solution with
optimized design for fast technology adoption.

© 2018 Synopsys, Inc. 10 Copyright © 2018 by Synopsys. All rights reserved.


Performance & Power Optimized
QuickStart Implementation Kits (QIKs)
for Arm Cores
Synopsys Design Platform

© 2018 Synopsys, Inc. 11


Agenda

Arm-Synopsys Collaboration

Synopsys QuickStart Implementation Kits (QIKs) for Arm cores

Achieving Optimal PPA with Synopsys Design Platform


• Performance
• Power
• Crosstalk

Summary

© 2018 Synopsys, Inc. 12 Copyright © 2018 by Synopsys. All rights reserved.


QuickStart Implementation Kits (QIKs)
Downloads To-Date*
• QIK – “QuickStart Implementation Kit” for the rapid implementation of
high performance energy efficient Arm CPUs in leading edge technology 400

• Best starting point and most comprehensive solution for Arm CPU 350
implementation with Synopsys EDA tools
300
• Created in collaboration with Arm for a specific core, configuration,
constraints and Artisan physical IP 250

• Complete flow scripts + captured knowledge presentations 200

• Available for Arm Cortex-A76, -A75, -A55, -A73, -A72, -A53, -A57 etc. 150

• Easy to customize for project-specific data (constraints, floorplan, library) 100

50
• Download from SolvNet (www.synopsys.com/Arm)
0
• Contact Synopsys for
– QIKs for advanced Arm cores
– Expert services help, available from QuickStart to core hardening
*Publicly announced Armv8 cores
© 2018 Synopsys, Inc. 13 Copyright © 2018 by Synopsys. All rights reserved.
Synopsys QIK – High Level View
DCG/ICC2 Based
Design Compiler® Physical Implementation
RTL
Graphical

Libraries IC Compiler™ II

RedHawk™
place_opt

IR Drop
Constraints
clock_opt
PrimeTime™
route_opt ECO

VC LP PrimeTime™ SI Formality® TetraMAX® II

© 2018 Synopsys, Inc. 14 Copyright © 2018 by Synopsys. All rights reserved.


Cortex-A Class Cores – Implementation Challenges

Analyze Library for Balanced Power/Performance/TAT


Power
Multi-Vt & Multibit
Power Gating OCV
Channel Length Banking

Determine an Optimum Floorplan


Area
Placement Macro Cell Selection
Bounds
Controls Placement Analysis

Manage Crosstalk and Optimization for Best Frequency


Performance
Crosstalk Concurrent Global Route PrimeTime
Optimization Clock & Data Optimization Delaycalc

© 2018 Synopsys, Inc. 15


Included in QIK
Most Advanced Technologies from Synopsys Design Platform

Meet Timing Reduce Power


• Enhanced physical guidance
• Timing-driven multibit register
• Enhanced layer-aware
optimization DC Graphical banking and de-banking
• Physical-aware clock gating
• Placement pre-clustering
• Single-pass multi-Vt opt.
• Non-default rule support

• Incremental timing-driven
• place_opt CCD
• Buffer-aware placement IC Compiler II multibit register banking and
de-banking
• PrimeTime delay calc in
• Level shifter (LS)/Enabled LS
route_opt
banking
• Path-based opt. in route_opt
• High effort leakage flow

• Path-based analysis (PBA) PrimeTime


• Clock skew ECO • Leakage-aware timing ECO
• Physical-aware ECO

© 2018 Synopsys, Inc. 16 Copyright © 2018 by Synopsys. All rights reserved.


Today’s Session Focus
Achieving Optimal PPA on Arm’s Latest Cores
Power Optimized Performance Optimized
Advanced Processor Design Details Advanced Processor
Config Crypto ; NEON & FP Unit
13 Layer + AP w/ routing on M2-
Routing
M11
Arm POP IP
Libraries/ Artisan 7nm SVT/LVT/ULVT
memories C11/C8 standard cell libraries
for data/clock
Setup: RC Max: TT/1.0v/85c
PVT
Setup & Power: TT/0.8v/85c
corners
MCMM Hold: RC Min:
FFGNP/1.05v/125c
DFT
Scan compression
Strategy
3 power domains with LS and
UPF
ELS required
7nm, SCH240 7nm, SCH300
© 2018 Synopsys, Inc. 17 Copyright © 2018 by Synopsys. All rights reserved.
Performance Tuning

© 2018 Synopsys, Inc. 18 Copyright © 2018 by Synopsys. All rights reserved.


Sub-Module Data Flow
Use DFA to Identify Data Flow & Guide Placement to Get Best FMAX

• For absolute best FMAX look into processor data flow Instruction Execution
– OOTB placement is good, but can be further tuned
– Guide module placement to better align with expected
data flow

• Use ICC II Data Flow Analysis (DFA) to visualize


connectivity
– Infer relative locations of cell groups with respect to
RAMs and main functional modules
– Memory-aligned sequential groups
L1 Data
– Floating Point Unit (FPU) sub-modules
– Instruction Execution pipeline
– L1 Data Load and Store registers
L2 Data

© 2018 Synopsys, Inc. 19 Copyright © 2018 by Synopsys. All rights reserved.


Sub-Module Data Flow
Module Placement with & without Sub-module Bounds @ place-opt

Sub-Module Bounds
More structure to
Instruction,
Execution and FPU
logic
WNS

More structure and


grouping to L1 Data TNS
Cache logic

NVP

With bounds No bounds Bounds


Without bounds
© 2018 Synopsys, Inc. 20 Copyright © 2018 by Synopsys. All rights reserved.
Clock Skewing
Applied Throughout Implementation Flow for Best FMAX

• Concurrent clock and data (CCD) flow provides a good


framework to boost QoR DC Graphical
– Increases FMAX and reduces total power
Synthesis
• Arm Cortex core architecture benefits from aggressive

Useful Skew / CCD


skewing even at synthesis IC Compiler II
– Asymmetric RAM read/write paths
place_opt
– Critical ICG enable paths
– Arithmetic pipelines with varying complexity clock_opt
• To push QoR to the limit, pursue a full-flow approach to
clock skewing route_opt

– Start @ pre-compile and adjust skews at each stage


– Target different architectures with different goals
– Do late-stage, WNS-focused CCD to push FMAX hard

© 2018 Synopsys, Inc. 21 Copyright © 2018 by Synopsys. All rights reserved.


Clock Skewing Flow
Recommended if High FMAX is Primary QoR Target

DC Graphical Clock Skewing (Post-CTS)


Key back-annotated offsets Read RTL

Slack-based adjustment Compile

Slack-based adjustment Compile -incr

IC Compiler II
CCD
Place_opt
Slack-based adjustment

CCD Clock_opt

CCD Route_opt
RAM TNS (ns) LS Leakage (mW)
PrimeTime ADV
Clock path adjusted for timing ECO Only CCD Manual skewing + CCD

© 2018 Synopsys, Inc. 22 Copyright © 2018 by Synopsys. All rights reserved.


Power Optimization

© 2018 Synopsys, Inc. 23 Copyright © 2018 by Synopsys. All rights reserved.


Leakage Optimization in 7nm Process Technology

• At previous nodes, we needed a well-crafted recipe maximize performance and minimize power
– Used library analysis to decide VT / channel length classes available at each optimization stage
• At 7nm, tool improvements and Artisan library characteristics result in a vastly simplified recipe

VT Types/Channel lengths VT Types/Channel lengths

Similar delay distribution


Drive Strength

Drive Strength
Very different leakage distribution

16nm 7nm
© 2018 Synopsys, Inc. 24 Copyright © 2018 by Synopsys. All rights reserved.
Improved Leakage Optimization
Power Optimized CPU: 7nm Vt Class Trials

TNS/Leakage Progression Across implementation


70 40

On Power Optimized CPU: 28% leakage 35


60
4% higher FMAX
power savings 30
50 13% Reduced Leakage
25
40
20
30
15

20
10

10 5

0 0

TNS (ULVT) TNS (All VT) LKG (ULVT) LKG (All VT)

© 2018 Synopsys, Inc. 25 Copyright © 2018 by Synopsys. All rights reserved.


ECO Objectives for Power Optimization
PrimeTime ECO Used to Maximize PPA

• Power ECO reduces leakage without impacting timing


ICC II
– Uses footprint-compatible Vt/channel swaps
Route_opt
ULVT/LVT/SVT – Downsized for minimal power

StarRC • Timing ECO makes minor improvements on remaining critical


paths
PrimeTime SI/PX # Endpoints
fix_eco_power –pattern
ULVT/LVT/SVT

fix_eco_timing

ICC II ECO

StarRC -ve slack +ve slack


PrimeTime SI/PX Implement to a tighter clock period to generate positive TNS
Run PrimeTime ECO with margins to reduce power further
© 2018 Synopsys, Inc. 26 Copyright © 2018 by Synopsys. All rights reserved.
Reduce Power Further
Recover Unused Positive TNS

PT-ECO Result
Fachieved > Fconstrained
-5ps

StarRC

PrimeTime SI/PX
Apply Negative Uncertainty -10ps
fix_eco_power
fix_eco_ timing

ICC II ECO

StarRC -15ps
PrimeTime SI/PX

-ve slack +ve slack


© 2018 Synopsys, Inc. 27 Copyright © 2018 by Synopsys. All rights reserved.
Reduce Power Further
Results: Dial-in Power vs. Performance

Late Stage Power Savings by Relaxing FMAX


100%
29% leakage power
savings @ 85% FMAX
% Leakage Savings

80%

60%
Max Frequency
40%
& leakage

20%

0%
100% 95% 90% 85% 80% 75% 70%
% FMAX

Implement to a higher FMAX to generate positive TNS


Run PrimeTime ECO with negative margins to reduce power further & get to relaxed FMAX
© 2018 Synopsys, Inc. 28 Copyright © 2018 by Synopsys. All rights reserved.
RedHawk IR Drop Analysis Within ICC II
Fusion Technology Delivers Early & Accurate Feedback for Implementation

• Prevention & fixing with signoff engines empowers Design Compiler Implementation
implementation engineers earlier in flow Graphical
Power Integrity
• Perfect correlation to rail signoff since same engines are IC Compiler II Convergence
used for signoff
RedHawk
Analysis Fusion
• Seamless and push-button integration saves schedule Block
and reduces iterations
Signoff
StarRC
• QIK integration
– Configures RedHawk runs from existing ICC II setup files
PrimeTime Power Integrity
– Defines scenarios in which to do analysis and load apl files Signoff
– Loads timing/physical data
RedHawk
– Creates tap_layers: needed to define physical PG connections Block & Full-chip
– Runs rail analysis

QIK provides complete script to setup and run RedHawk rail analysis within ICC II
Template structure enables easy modification for different libraries/CPUs
© 2018 Synopsys, Inc. 29 Copyright © 2018 by Synopsys. All rights reserved.
Crosstalk Mitigation

© 2018 Synopsys, Inc. 30 Copyright © 2018 by Synopsys. All rights reserved.


Crosstalk Mitigation
Crosstalk Has to be Managed throughout Implementation Flow

Crosstalk issues stem from not enough routing resources

Routing resources in Arm cores are dictated by floorplan and


local cell density

Placer controls local cell density to balance timing and congestion

Timing and congestion have to be measured & managed from


synthesis onwards

Often, mismanaged placement creates intractable crosstalk challenges


Start your crosstalk mitigation with a proper analysis of congestion and cell/pin density…
© 2018 Synopsys, Inc. 31 Copyright © 2018 by Synopsys. All rights reserved.
Crosstalk Mitigation
Managing Density and Congestion

• In advanced node designs, crosstalk effects are Tool Density and Congestion Settings Value
often related to pin density effects DCG placer_max_cell_density
70
• Manage placement densities through cell ICC II place.coarse.max_density
clumping and spreading DCG target_routing_density
60
– Automated mode in ICC II ICC II place.coarse.target_routing_density
– Line up DCG to the same value as in ICC II ICC II place.coarse.pin_density_aware true

• Additionally, set a target routing density DCG set_congestion_options –max_util


place.coarse.congestion_driven_max_ 80
• Finally, use pin density auto control in ICC II ICC II
util
– Check tool logs to see what value is being used DCG placer_enable_enhanced_router true

• We fine-tuned these parameters throughout ICC II place_opt.congestion.effort high

Use automated mode for all parameters until you decide to push it one way or another

© 2018 Synopsys, Inc. 32 Copyright © 2018 by Synopsys. All rights reserved.


Crosstalk Mitigation
Classify Crosstalk

Pure standard cell crosstalk from


high density
Long path related to RAM connections Fixed with placer settings and CCD

Fixed with blockage settings and bounds

RAM channel crosstalk

Long paths in main data flow (L2 to L1) Fixed with blockage settings

Fixed with bounds and placer settings

There are different kinds of crosstalk in an Arm core


Understanding what types you have will show you how to fix them
© 2018 Synopsys, Inc. 33 Copyright © 2018 by Synopsys. All rights reserved.
Crosstalk Mitigation
Measure Crosstalk Reduction

route_auto route_opt (final) Crosstalk Effect

400
SI Non-SI
300

TNS
200

100

route_opt

© 2018 Synopsys, Inc. 34 Copyright © 2018 by Synopsys. All rights reserved.


DFT and ATPG

© 2018 Synopsys, Inc. 35 Copyright © 2018 by Synopsys. All rights reserved.


DFT & ATPG Support

• QIKs include complete DFT implementation Makefile


– Scan flop insertion Includes TetraMAX executable ATPG
target
– Scan structure identification within Artisan memories and
synchronization registers
tmax_setup.tcl
– Automated support for scan insertion integrated with Configures ATPG runs from existing
register banking/de-banking, and functional shift register ICC II setup files
identification/reuse
Multiple ATPG functions
– DFTMAX compression logic insertion, utilizing shared I/O
enabled for use inside QIK
to optimize testing of replicated CPU blocks
Tmax.tcl Tmax_update_patterns.tcl

Tmax_maxtb.tcl Tmax_update_spf.tcl
• New to this QIK is inclusion of a full set of ATPG
Tmax_stuck.tcl Tmax_diagnosis.tcl
scripts
– Support added for TetraMAX and TetraMAX II Tmax2_stuck.tcl Tmax_debug_config.tcl

– ATPG flow supports power-aware stuck-at and transition Tmax_debug_patterns.tcl


fault testing
© 2018 Synopsys, Inc. 36 Copyright © 2018 by Synopsys. All rights reserved.
DFTMAX Shared I/0 Architecture
1 24 25 32
21 22 23 24

1 21 1 21

Single Top-level
CODEC
big Core LITTLE Core DSU
(A55)
I/O Shared to
identical CPU
cores
1 24 1 24

1 24 25 32

© 2018 Synopsys, Inc. 37 Synopsys Confidential Information


ATPG Results
Significantly Reduced Test Times with Improved Coverage using TetraMAX II

ATPG Coverage Test Time


100% 90%

80%
99%
70%

60%
98%
50%

97% 40%

30%
96% 20%

10%
95%
0%
Power-opt. Core Perf-opt. Core
Power-opt. Perf-opt.
Stuck-at (TMAX) Stuck-at (TMAX II) Core Core
Transition (TMAX) Transition (TMAX II) %Test Time Reduction

© 2018 Synopsys, Inc. 38 Copyright © 2018 by Synopsys. All rights reserved.


Summary

© 2018 Synopsys, Inc. 39


Results on Power-optimized CPU
Met FMAX, Leakage Power 49% Below Target

TNS %FMAX %Leakage


80 140%

% OF FINAL FMAX AND LEAKAGE


120%

100% 100%
TNS (NS)

80%
40
60%

40%

20%

0 0%
Synthesis place_opt clock_opt route_opt Signoff ECO 1 Signoff ECO 2

IMPLEMENTATION STAGES

DC Graphical + IC Compiler II + PrimeTime ECO


© 2018 Synopsys, Inc. 40 Copyright © 2018 by Synopsys. All rights reserved.
Results on High-performance CPU
Met FMAX, Leakage Power 9% Below Target

TNS %FMAX %Leakage

% OF FINAL FMAX AND LEAKAGE


120 300%

240%
80
180%
TNS (NS)

120%
40 100%
60%

0 0%
Synthesis place_opt clock_opt route_opt Signoff Signoff ECO1

IMPLEMENTATION STAGES

DC Graphical + IC Compiler II + PrimeTime ECO


© 2018 Synopsys, Inc. 41 Copyright © 2018 by Synopsys. All rights reserved.
Conclusion: Arm + Synopsys
Continuing Close Collaboration to Benefit our Customers

QuickStart Implementation Kits (QIKs)


Complete implementation & static verification flows

Utilizing the most advanced technologies


in Synopsys Design Platform together with Arm Artisan physical IP

Providing optimal PPA on Arm’s advanced processors

© 2018 Synopsys, Inc. 42 Copyright © 2018 by Synopsys. All rights reserved.


Thank You

You might also like