Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

High-Performance Arm®-based CPU Implementation for

Mobile Devices
Using Synopsys Design Platform with Fusion Technology

Brian Millar, Senior Principal Engineer, CPU Physical Implementation


October 23rd 2018
Agenda

Introduction - Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

Conclusion

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 2


Agenda

Introduction - Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

Conclusion

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 3


Introduction

• Founded in 2010 to develop high-performance, low-power, complex CPU and System IP


architectures and designs for Samsung’s System LSI division.

• The San Jose Advanced Computing Lab (ACL) was opened in 2017 and GPU IP (leveraging 5
years initial development from another division) was added to the joint charter.

SNUG 2018 4
Overview: CPU & System IP
• Public/Commercial Accomplishments
– Shipping in Galaxy S7, S8, S9 Premium
Smartphones

• Current Focus
– Next 4 generations of Premium Smartphones
– Alternative markets

SNUG 2018 5
Initial Hardening of Arm licensed cores
• Specification
– Full implementation of the Armv7-A architecture Arm® Cortex®-A15 CPU
– Superscalar, variable-length, out-of-order pipeline
– Dynamic branch prediction with Branch Target Buffer (BTB)
and Global History Buffer
– (TLBs) for instruction, data loads, data stores
– 4-way set-associative 512-entry L2 TLB per processor
– 32KB L1 instruction & data caches, shared L2 cache
– Arm AMBA® 4 AXI Coherency Extensions master interface
– Accelerator Coherency Port (ACP) implemented
as AXI3 slave interface
– VFP floating point unit & Neon media processing engine

• Configurable elements
– # Arm® Cortex®-A15 cores (1 to 4)
– Size of L2 cache (up to 4 MB)
– [Optional] VFP, Neon engines

6
SNUG 2018 6
First Generation: M1, SCI, and SMC
Highlights
• Successfully delivered first-generation Samsung proprietary
CPU, coherent interconnect and memory controller
– Armv8-A 64-bit 4-wide out-of-order CPU with shared 2MB L2 cache

• Time-to-market
– Built a team and accomplished in 3.5 years (vs. others struggle to do in 5 years)
– Set a solid capability and product foundation for the future

• High quality
– No validation gating bugs for EVT0, no production gating bugs for EVT1

• Significant generational improvements: Galaxy S7 over Galaxy S6

• Industry landscape
– Our first generation CPU was competitive

7
SNUG 2018 7
Third Generation: M3, SCI, and SMC Highlights
• Successfully delivered differentiated Samsung proprietary CPU,
coherent interconnect and memory controller
– Armv8-A 64-bit 6-wide out-of-order CPU with shared 4MB shared L3 cache

• Time-to-market
– Built a team and accomplished in 3.5 years (vs. others struggle to do in 5 years)
– Detailed
Set a solid capability and product foundation disclosure planned for
for the future
Hot Chips 2018
• High quality
– No validation gating bugs for EVT0, no production gating bugs for EVT1

• Significant generational improvements: Galaxy S7 over Galaxy S6

Balanced Single-Threaded performance


and Power Efficiency
• Industry landscape
– Our first generation CPU was competitive

8
SNUG 2018 8
Challenges and Solutions

• Our team does core hardening for our proprietary Arm instruction set
compliant processor cores used in SoC designs throughout Samsung

• PPA (Performance/Power/Area), runtime and repeatability/predictability are


key metrics

• We’ve invested in several technologies in the Synopsys Design Platform


with Fusion Technology plus internal methodologies to differentiate our
PPA

SNUG 2018 Confidential & Proprietary 9


Agenda

Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

Conclusion

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 10


SARC Implementation Flow

Gensys

Design Compiler Graphical Tetramax II

Formality
ICC II Design Planning

ICC II Implementation IC Validator

StarRC

PrimeTime

Signoff DRC/LVS

SNUG 2018 11
New Fusion Technology™
inside Synopsys Design Platform

SARC Implementation flow


Design Compiler
Graphical

Signoff Fusion
Test Fusion
Test

Design Fusion

IC Compiler II

ECO Fusion

PT, StarRC, PTPX, ICV, RedHawk

Fusion Data Model

SNUG 2018 Synopsys Confidential Information 12


Design Compiler Flow Overview
set_app_vars
Load MW/Target/Link
libraries
set_compile_spg_mode ICC II
compile_layer_aware_optimization
Read RTL
compile_timing_high_effort_tns
compile_timing_high_effort
analyze psynopt_tns_high_effort
set_clock_
gating_style
elaborate
Source rp.tcl Load UPF
Read path groups
Read SDC
and weights
Read physical
Scan constraints
constraints
compile_ultra -scan
–gate_clocks -spg
set_ignored_layers
insert_dft test_scan_enable_port_naming_style
set_preferred_routing_direction
set_dft_signal
create_bounds
compile_ultra set_scan_configuration
create_placement_blockages
-incremental –scan
-spg
reporting &
ddc/verilog/sdc/upf
SNUG 2018 13
ICC II Implementation Flow Overview
• ICG merge/split
❖ Structured datapath • Automatic timing control (ATC)
❖ Bounds • Z-buf/NPO & rebuffering
ICC II Design Planning
❖ Partitioning • Layer opt/GRLB/Auto-NDR
❖ Budgeting • Multi-Vt optimization
❖ Pin cutting Constraints & Scenarios
• MBIT flow
• Leakage +SAIF based total power opt
Place_opt • RDE/SI Estimation
• Timing constraints
• Scenario creation

Clock mesh & clock_opt • ICG Sizing to implement useful skew


• ICG merge/split optimization • Skew buffer insertion between ICG and FF
• Route clock routes
• Post Clock datapath
Useful Skew implementation
optimization • Global route based optimization
• CTS Useful Skew Enable additional MCMM scenarios • CCD with clock mesh
• Route_opt for downsizing
route & route_opt • Route_opt for in place size only
• Ansys Redhawk integration (Fusion)
• ICV ADRC fixing & metal fill

SNUG 2018 14
PrimeTime and ECO Flow Overview
ECO capability in PrimeTime IC Compiler II
▪ Setup/hold fixing
▪ Useful skew StarRC • GPD format from StarRC to PrimeTime
▪ sequential cell • SMC (simultaneous multi-corner)
swap/optimization
PrimeTime/PrimeTime PX
▪ leakage recovery
▪ dynamic power reduction
▪ redundant buffer removal PT ECO • Analyze_subcircuit for spice analysis of
clock mesh
▪ max_tran/max_cap/noise for • Statistical via computation for 7nm and
DRC fixing lower
• IVD failure analysis in PrimeTime
▪ Additional Hold fixing
• Ansys Redhawk integration (Fusion)

Fix_eco_timing –physical_mode open_site/occupied_site –type setup/hold


Fix_eco_timing –cell_type clock_network (for useful skew)
Fix_eco_timing –cell_type sequential (for flop swap)
Fix_eco_power –pba_mode exhaustive –power_mode leakage (for leakage recovery)
Fix_eco_power –pba_mode exhaustive –power_mode dynamic (for dynamic power reduction)
Fix_eco_power –methods remove_buffer (for redundant buffer removal)
Fix_eco_drc –type max_tran/max_cap/noise (for DRC fixing)
Fix_eco_hold_timing (fix setup to allow more hold fixing)

SNUG 2018 15
Implementation Flow
Key Technologies Used

Design Compiler IC Compiler II PrimeTime


• Topographical synthesis • Design planning fix_eco_timing
• UPF power intent fix_eco_timing –cell_type clock_network
• ICG merge/split prior to fix_eco_timing –cell_type sequential
• Physical datapath for structured
placement
Place_opt fix_eco_power –power_mode leakage
• Power-aware clock gate (ICG) • ATC/Zbuf/NPO/rebuffering fix_eco_power –power_mode dynamic
insertion • Mbit flow fix_eco_power –methods remove_buffer
– Fanout control fix_eco_drc –type max_trans/cap/noise
– Always-on ICG’s
• Leakage recovery
fix_eco_hold_timing
• Placement bounds • SAIF based TPO
• Path groups, critical range, weights • Route_opt CCD PBA mode exhaustive

• Useful skew
• Clock mesh flow
• MCMM
• RDE / SI Prediction

SNUG 2018 16
Agenda

Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

Conclusion

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 17


Key Techniques in Physical Implementation

Clock Mesh Design ICC II to PrimeTime Correlation

Useful Skew + CCD Redhawk Fusion (formerly in-Design)

Multi-bit ATC

Structured Datapath & Relative Placement RDE

Via Ladder
Hierarchical Design

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 18


Consideration for CTS vs. Mesh

CTS Mesh
Implementation Flow mature and very easy to implement. More timing consuming, mesh has to be carefully planned
around floorplan. ICC still maturing. We ran into issues along
Complexity the way, but worked around. STA requires special flow.

Skew CTS will generally have higher global/local skews for any Mesh will have tighter skews.
sizable design.

Max Freq. Will tend to be lower because of higher skew. Since skew better (particularly local skew), less time is wasted
and higher frequency is achievable.

PVT tracking CTS tries to match delays using interconnect & gate If structure built very regularly, should track across PVT much
delay. CTS in MCMM mode may reduce exposure, but in better.
general paths won’t track as well across different corners

POCV Margin Paths in general will have larger non-common depth -> Mesh very structurally similar. Much lower POCVM
higher POCV. Also, paths dissimilar so traditional
POCVM margin justified.

Power Dynamic power can be similar between CTS & Mesh Dynamic power is similar between CTS & Mesh.
depending on structure of MESH. Tighter skew will result in less % LVt usage (TNS)

Implementing ICC compiler has built in useful skew ability, and Relatively easy to implement “push” flops.
“pulling” flops earlier is conceptually easier. But “pull” is more difficult and costly for power.
Useful Skew
CCD compatible with CTS CCD more difficult with Mesh

Flexibility CTS is built to be flexible. Can work around macros, Requires more careful planning. Metal layer conflicts with power
metal layer restrictions, tight corners, can easily adapt to grid. Driver conflicts with macro placement. Tuning can help
any floorplan and useful skew. recover some of loss.
SNUG 2018 19
Clocks and Margins
• Combination of CTS based and Mesh based clock tree structure
– High Speed clocks in CPU and Non-CPU use separate meshes
– Non-CPU has CTS trees for slower speed functional/debug/test clocks

• Meshes are built with identical metal topology, drivers, x&y pitch, etc.
– Allows very tight tracking at different PVT between meshes and OCV minimized
– Non-CPU meshes and CPU meshes will track more closely across PVT corners
and variation components, but each mesh has its own load distribution so there
will be some slight imperfection in their tracking (we can tweak sizes)

• INTRA-MESH & INTER-MESH Margining


1. Random/Gradient Cell Variation
2. Temperature Variation
3. Fatigue / Reliability Component
4. Voltage Variation
5. Random/Gradient Interconnect Variation

Different variations of mesh widths, pitches,


driver sizes, etc., were plotted for power and skew
to identify the best solution for implementation
SNUG 2018 20
Clock H-tree / MESH
1 Create clock H-tree/MESH wires
SARC Custom tool -> ICC II commands

create_shape -shape_type path -layer N -net


Cr_MESH -width XXXX -shape_use user_route -path
[list {X1 Y1} {X2 Y2}] -start_endcap variable -
start_extension XXX -end_endcap variable -
end_extension XXXX

create_clock_mesh -net{ck_gclkcr_MESH}
create_via -net Cr_MESH -no_snap
-via_def M_XXX_HV -
-layers {xx yy} -widths {0.x 0.y} -lower_left
shape_use user_route -origin {X1 Y1} - \
{134.9 149.4}
orientation R90-pitches {230.4 241.8} \
-bounding_box {{5.0 5.0 } {2095.0 1895.0}} Clock driver & H-tree routing Mesh Straps

2 Create H-tree/MESH drivers


SARC Custom tool -> ICC II commands

create_cell Cr_L0_1 [get_lib_cell MESH_DRVxxx]


set_cell_location –orientation N –coordinates X1,Y1
set_attribute [get_cells DRVxx] dont_touch true
set_attribute [get_cells DRVxx] physical_status
fixed

Snake routing to adapt to floorplan

SNUG 2018 21
Clock H-tree / MESH

3 Push down to block level


SARC Custom tool -> ICC II commands
SARC custom tool does top->block translation
but ICC II method:
push_down_objects [list [get_nets -of [get_shapes
-filter "net_type==signal||net_type==clock"]]]

4 Route Fishbones

set_routing_rule [get_nets Cr_MESH] -rule XX -


min_routing_layer X -max_routing_layer Y-max_layer_mode
allow_pin_connection

set_app_options -list {route.common.comb_distance 2}

route_clock_straps -nets [get_nets Cr_MESH] -topology


fishbone -fishbone_fanout 10 -fishbone_layers {X Y} -
fishbone_sub_span 20

set_attribute [get_nets $main_clock_net] -name


physical_status -value locked

SNUG 2018 22
Clock H-tree / MESH

5 Stamp Spice timing on MESH loads

CALL IN ICC II: set_annotated_transition XXX -rise U1_clkgate/CK


analyze_subcircuit set_annotated_transition YYY -fall U1_clkgate/CK
-from $from_pin -to $to_net -clock Cr \ . . . .
-spice_header_files ./spice_files/HSIM_lib.txt \
. . . .
-driver_subckt_files \
"./spice_files/XX.spf ./spice_files/YY.spf“ \
-starrcxt_nxtgrd_file ./star_files/star.nxtgrd \ set_annotated_delay XXX -from cr_L4_1/Y –to U1_clkgate/CK -net –rise
-starrcxt_map_file ./star_files/star.map \ set_annotated_delay XXX -from cr_L4_1/Y –to U1_clkgate/CK -net –fall
-name mesh_Cr . . . .
. . . .
OR set_annotated_delay XXX -from cr_L4_1/A –to cr_L4_1/Y -cell –rise
set_annotated_delay XXX -from cr_L4_1/A –to cr_L4_1/Y -cell –fall
. . . .
Call to PT w/STAR SPEF(signoff): . . . .
# disable all but one driver
sim_analyze_clock_network -from $from -to $to - set_disable_timing [get_timing_arcs -to cr_L4_*/Y]
use_probe .... remove_disable_timing –from A –to Y cr_L4_10

OR

Avoid Spice and Stamp Fake Timing

Early in development phase

SNUG 2018 23
Clock H-tree / MESH
Design Compiler Graphical
This has following steps:
ICC II Design Planning • Merge / Split
Mesh synthesize_multisource_clock_subtrees -from merge -to merge
synthesize_multisource_clock_subtrees -from optimize -to optimize

Constraints
Constraints &&Scenarios
Scenarios

Place_opt This has following steps:


• Merge / Split
• Size ICGs per load/transition time spec
Htree Clock mesh & clock_opt • Cluster ICGs in placement bounds (near PG switches)
• Align ICGs to minimize fishbone route
• ICG → Flop clock routes
Useful Skew implementation • Building CTS trees
• Propagated mode optimization
route & route_opt synthesize_multisource_clock_subtrees -from merge -to merge
synthesize_multisource_clock_subtrees -from optimize -to optimize
synthesize_multisource_clock_subtrees -from route_clock -to route_clock
Fishbones synthesize_multisource_clock_subtrees -from refine -to refine
PrimeTime Signoff STA route_group -all_clock_nets -reuse_existing true
clock_opt -from final_opto

SNUG 2018 24
Other Clock Routing

MS CTS Clock distribution


Post MESH clock routing: Clock Gater -> flops

SNUG 2018 25
Useful Skew
+300ps (slack) -50ps (slack) +100ps (slack)
-100ps (slack)

900ps
1100ps D Q 700ps D Q 1050ps D Q

CK CK CK

Front Slack = -100ps Front Slack = +300ps Front Slack -50ps


Back Slack = +300ps Back Slack = -50ps Back Slack = +100ps

Frequency target = 1.0GHz (1000ps) , but actual is (1000+100)ps = 0.90 .9GHz Before Scheduling

+150ps (slack) 0ps (slack) +100ps (slack)


0ps(slack)

1000ps 850ps D Q 1000ps D Q 900ps


D Q

CK CK CK

Front Slack = 0ps Front Slack = 150ps Front Slack = 0ps


Back Slack = +150ps Back Slack = 0ps Back Slack = +100ps

Frequency target = 1.0GHz (1000ps) , actual is (1000+0)ps = 1.0GHz After Scheduling (both WNS & TNS improvement)

SNUG 2018 26
Useful Skew / Concurrent Clock and Data
Design Compiler Graphical

Architectural useful skew (IDEAL MODE)


ICC II Design Planning set_clock_latency XXX -rise U1REG/CK

Constraints & Scenarios


• SARC MESH based Useful Skew Scheduler/Implementer
• (PROPAGATED MODE)
Place_opt • Algorithmic (multi-level Aware) – single corner/multi-corner
• Useful skew for hold
• Useful skew for power
Clock mesh & clock_opt
ICC II CCD Capability
set_app_options -list { route_opt.flow.enable_ccd true}
Useful Skew implementation set_app_options -list {ccd.select_optimization_moves
auto/size_only}
set_app_options -list {ccd.critical_slack_percent 0.90}
route & route_opt set_app_options -list {ccd.tns_cost_factor 1}
route_opt

PrimeTime Signoff STA


fix_eco_timing –cell_type clock_network

SNUG 2018 27
Investigating CCD Earlier in Flow
Design Compiler Graphical Architectural useful skew
(IDEAL MODE)
set_clock_latency XXX -rise U1REG/CK
ICC II Design Planning

Constraints & Scenarios

SARC MESH based Useful Skew Scheduler/Implementer


Place_opt (PROPAGATED MODE)
• Algorithmic (multi-level Aware) – single corner/multi-corner
• Useful skew for hold
Clock mesh & clock_opt • Useful skew for power

Useful Skew implementation


ICC II CCD Capability
set_app_options -list { route_opt.flow.enable_ccd true}
set_app_options -list {ccd.select_optimization_moves
route & route_opt auto/size_only}
set_app_options -list {ccd.critical_slack_percent 0.90}
set_app_options -list {ccd.tns_cost_factor 1}
PrimeTime Signoff STA route_opt

SNUG 2018 28
Advantages and Disadvantages of
Multi-Bit in a Clock Mesh Flow
Advantages Disadvantages

- Lower Internal clock power - Strength/Vt type for most critical bit
- Less CK pin cap on ICG - Limits useful skew flexibility
- Fewer and smaller ICGs on MESH - RTL SAIF / mapping
- Less MESH power - Very disruptive when splitting
- Less SI and SE routing
- Less hold buffering (internal bits
correct by construction)

SNUG 2018 29
General Multi-Bit Flow
place_opt

Initial placement
Multibit reg • MBIT banking/de-banking available in ICC II
map file
clock_opt
• Slack threshold per timing group can be
Grouping and banking specified to be considered for banking
Register • Registers with same latency values are only
group file
merged
route
Optimization
• SVF file will be updated for Formal
verification
Debanking • Banking ratio of 85% achieved on most
route_opt
blocks

SNUG 2018 30
Multi-Bit Conversion at the end of place_opt
Before Multi-bit conversion After Multi-bit conversion

43 Multi-bit flop loads ICCII Compiler considers:


129 Single Bit flop loads • Slack
• Useful skew
• Scan chains
• Transition
• Capacitance

ICG ICG

SNUG 2018 31
Multi-Bit Flow
Design Compiler Graphical
Single Bit Only (multi-bits dont_use)
set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use true

ICC II Design Planning


Single Bit Only (multi-bits dont_use)
create_placement -timing_driven
Constraints
Constraints& Scenarios
& Scenarios place_opt -from initial_drc -to initial_drc
create_placement -timing_driven -effort high -use_seed_locs -congestion
place_opt -from initial_opto -to final_place
save_block place_opt
Place_opt
Enable Multi-bit
set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use false
Clock mesh & clock_opt
identify_multibit -register -no_dft -input_map_file $map_file
-slack_threshold $slack -output_file $mbit_bank.tcl
-exclude_instance $excluded_regs
Useful Skew implementation
source $mbit_bank.tcl
report_multibit
route & route_opt save_block place_opt_mbit

PrimeTime Signoff STA create_multibit -name


ldSpecVal_e2_reg_ldVprn_e2_reg_7A_ldVprn_e2_reg_6A_ldVprn_e2_reg_5A
{ ufpb_grp/u_fpu_load_xpose/ldSpecVal_e2_reg
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_7A
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_6A
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_5A } -lib_cell lib/H2V2X………

SNUG 2018 32
Investigating Merging/Splitting throughout
flow
Single Bit Only (multi-bits dont_use)
Design Compiler Graphical set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use true

Single Bit Only (multi-bits dont_use)


ICC II Design Planning create_placement -timing_driven
place_opt -from initial_drc -to initial_drc
Constraints & Scenarios create_placement -timing_driven -effort high -use_seed_locs -congestion
Constraints & Scenarios place_opt -from initial_opto -to final_place
save_block place_opt

Place_opt Enable Multi-bit


set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use false

identify_multibit -register -no_dft -input_map_file $map_file


Clock mesh & clock_opt -slack_threshold $slack -output_file $mbit_bank.tcl
-exclude_instance $excluded_regs

Useful Skew implementation source $mbit_bank.tcl


report_multibit
save_block place_opt_mbit
route & route_opt

create_multibit -name
PrimeTime Signoff STA ldSpecVal_e2_reg_ldVprn_e2_reg_7A_ldVprn_e2_reg_6A_ldVprn_e2_reg_5A
{ ufpb_grp/u_fpu_load_xpose/ldSpecVal_e2_reg
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_7A
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_6A
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_5A } -lib_cell lib/H2V2X……

SNUG 2018 33
Structured Datapath with Relative
Placement Can build up RP definition hierarchically
Example: RP script

create_rp_group rp1 -design top -columns 2 -rows 1


add_to_rp_group top::rp1 -leaf U1 -column 0 -row 0
add_to_rp_group top::rp1 -leaf U4 -column 1 -row 0

create_rp_group rp2 -design top -columns 2 -rows 1


add_to_rp_group top::rp2 -leaf U2 -column 0 -row 0
add_to_rp_group top::rp2 -leaf U5 -column 1 -row 0

create_rp_group rp3 -design top -columns 2 -rows 1


add_to_rp_group top::rp3 -leaf U3 -column 0 -row 0
add_to_rp_group top::rp3 -leaf U6 -column 1 -row 0

create_rp_group rp4 -design top -columns 1 -rows 3


add_to_rp_group top::rp4 -hierarchy top::rp1 \
-column 0 -row 0
add_to_rp_group top::rp4 -hierarchy top::rp2 \
-column 0 -row 1
add_to_rp_group top::rp4 -hierarchy top::rp3 \
-column 0 -row 2

SNUG 2018 34
Structured Datapath with Relative Placement

bbox {1.2480 111.1680} {495.5340 317.9530}


bbox {9.0480 63.9360} {498.2640 270.7200}

Fixed cells from DEF RP placement with anchor point RP placement w/out anchor point

• Adds structure to random placement and routing


• Can significantly improve QoR (frequency, area, and power)
• Used to improve area and power (can be extended to help critical path
areas)
• RPs integrated into the flow using 3 approaches –
– FIXED cells from DEF file
– RP constraints with anchor point
SNUG 2018
– RP constraints without anchor point 35
Structured Datapath with Relative Placement

W/o Relative Placement With Relative Placement

• 20% area savings –vs- No RP


• Similar structure/timing across bits
• Can improve routability if planned properly

SNUG 2018 36
Structured Datapath with Relative
Placement

Routing RP Clock nets

Zrouter in conjunction with the structured placement

SNUG 2018 37
Relative Placement flow in DCG and ICC II

SARC’ Internal tool for RP generation

Design compiler ICC II

RP cells are marked fixed in DEF

SNUG 2018 38
Hierarchical Design Planning
Gensys restructuring Using Gensys to add wrappers for partitioning RTL
Design Compiler graphical Feedthroughs are added to wrapper and not inside
module hierarchy
Initialize floorplan
Clock mesh push down
Shaping / Macro placement Power Grid using composite patterns

Create H-tree/Mesh

Create Power

Place pins / Feedthrough

Timing Budgets

Push down of power grid

Push down of clock drivers

SNUG 2018 39
Gensys Hierarchical Build Flow
Gensys RTL restructuring flow

View Design Hierarchy

Define wrapper hierarchy

Define or modify design hierarchy

“Apply” to automate RTL restructuring

using gensys to add wrappers / partition RTL


SNUG 2018 40
Gensys Hierarchical Build Flow
Add wrapper to isolate feedthroughs /
DFT logic / PG switches
Swap in new RTL at blk level

blk.v

Automate
splitting of
module blk_wrap blocks to
module blk manage build
time

endmodule
endmodule

SNUG 2018 41
Gensys Hierarchical Build Flow

SNUG 2018 42
Characterize Block PG Flow

• Design inputs
– Full-chip PG Constraints
Design
➢ Patterns Full Chip PG Constraints
➢ Via Rules Patterns
➢ Strategies Via Rules
Strategies

• Characterize Block PG outputs Characterize Block


– PG constraints for each block and top-level PG
➢ Patterns
Block & Top level PG Constraints
➢ Via Rules
Patterns
➢ Strategies
Via Rules
Strategies

• Create PG for each block and top-level


– Apply corresponding PG constraints Apply PG Constraints Apply PG Constraints Apply PG Constraints
Compile PG Strategies Compile PG Strategies Compile PG Strategies
– Compile strategies with via rule
Top Level Block Block…

SNUG 2018 43
Composite Pattern Implementation
14
vdd vss vss vdd vddi vss vdd vss vss vdd vddi vss

9
8

vss
vddi
vdd
2.5

10
create_pg_wire_pattern stra_base –layer @l –direction @d –width @w –spacing @s \
-pitch @p –parameters {l d w p s}
create_pg_composite_pattern core_pattern –nets {vdd vddi vss} \
–add_patterns { \
{{pattern: stra_base} {nets: vdd vss vss vdd} {parameters: {M6 vertical 1.5 10 1}}{offset: 3}} \
{{pattern: stra_base} {nets: vddi vss} {parameters: {M4 vertical 1.5 14 0.6}}{offset: 9}} \
{{pattern: stra_base} {nets: vdd vddi vss}{parameters: {M5 horizontal 1.5 8 0.6}}{offset: 2.5}}}
set_pg_strategy core_strategy –core –pattern {{name: core_pattern} {nets: vdd vddi vss}}
compile_pg –strategies {core_strategy}

SNUG 2018 Synopsys Confidential Information 44


ICC II versus PrimeTime Correlation
(Signoff Entry)

- Need to maintain ~1x / Day beat rate

- Cut down number of ECO cycles due to


timing closure (logical ECOs )

- ICC II -> PT correlation key to


efficiency
Signoff
Entry

SNUG 2018 45
ICC II versus PT Correlation
Fusion Flow
Route_opt1

Route_opt2
CCD enabled

eco_opt –type setup


-pba_mode path

CCD enabled
Extract.starrc_mode set 0

SNUG 2018 46
ICC II versus PT Correlation
(Additional Suggestions)
• Good Timing Correlation needed to minimize ECO cycle time
– use PT timer in last stages of route_opt
set_app_options -name time.use_pt_delay -value true
– AWP in ICC II (several TBCs for improvements)
– POCV/margins etc.. same in both tools
– Use CCS models + LVF (new moment-based POCV capability coming)
– Investigating new PBA based optimization in ICC II (final stages of route)

• Ensure all dominant scenarios/modes (set & hold) optimized in ICC II before
route_opt
– ICC II runtime improvements in progress (particularly inactive scenarios)

• Excessive extra margin in ICC II to mask correlation issues can impact


power/congestion

• Study components individually – mean / stddev & review outliers


– ICC II SPEF –vs- STAR SPEF
– ICC II GBA –vs- PT GBA (no POCV/derates ; no SI )
– ICC II GBA –vs- PT GBA w/ POCV/derates
– ICC II GBA –vs- PT GBA w/POCV/derates & SI
– ICC II GBA –vs- PT PBA
etc….
– (We found foundry based metal fill needed in ICC II to improve route dominated block correlation)

SNUG 2018 47
Successful Deployment of RedHawk Analysis
Fusion at 7nm
❖ Successful evaluation of RedHawk Analysis Fusion
technology at SARC on 7nm designs in ~3 months meeting
initial customer criteria.

❖ SARC is a great ‘learning’ customer driving productivity and


usability enhancements.

❖ Highly engaged R&D:


❖ Several enhancements to the tools including the ability
to launch signoff analysis of multiple vectors in parallel
from one ICC2 session
❖ Improved GUI features to display voltage drop on cell
instance IR drop map

❖ Technology enables Physical Design teams to do early analysis


and fixing of signoff quality IR drop issues during block-level
implementation. New usage model/users for power analysis,
added value to design flow/QoR, new market opportunity.

❖ Accuracy proven: It’s all about the trust in correlation. RedHawk


Analysis Fusion results identical to RedHawk signoff.

❖ Starting production use now on live 7nm design.

❖ Testing IVD based place_opt for deployment with 2018.06-SP2.

SNUG 2018 48
Automatic Timing Control (ATC)

SNUG 2018 49
Route Delay Estimation (RDE)

SNUG 2018 50
Route Delay Estimation (RDE)
Advantages of RDE model

SNUG 2018 51
Via Ladder ( Pillar ) Construction

M1

Receiver

Receiver

M1

SNUG 2018 52
EM Via Ladder for High Drive Cells

SNUG 2018 53
Agenda

Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

Conclusion

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 54


Synopsys Technologies for the Next Project
Next generation ICC II Next generation technology -
technology items other tools
- Improvements in ICC II 2018.06 - STAR: SMC (Simultaneous multi-corner)
- RDE + better layer promotion and auto-NDR
- ATC
- Total Power Optimization
- Statistical Via Capability (initially PT)
- CCD more throughout the flow (in placeopt and
synthesis) - IR/IVD analysis (RedHawk Fusion) in
- Timing driven restructuring PrimeTime
- Congestion restructuring
- Knee based delay optimization
ICV for foundry metal fill in ICC II
- In-Design IR/IVD fixing
- PrimeTime PBA mode power recovery
- 7nm technology
- Via Pillars
- Color aware PG/STD cell placement interaction

SNUG 2018 Confidential & Proprietary 55


Synopsys Technologies for the Next Project

SNUG 2018 Confidential & Proprietary 56


Agenda

Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

Conclusion

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 57


Conclusion

• Synopsys Design Platform with Fusion Technology enables SARC on high-


performance Arm®-based CPU implementation for mobile devices
• Clock Mesh Design, Useful skew + CCD, Multi-bit , Structured Datapath & Relative Placement
(RP), Hierarchical Design, ICC II to PrimeTime Correlation, etc…

Optimized Implementation delivered


20%+ Improvements in Fmax & mW/MHz

• SARC continues to work closely with Synopsys to address PPA,


runtime and repeatability/predictability needs on future projects

• Subsequent technology nodes (5nm,…) are getting very complex with


lots of opportunity for collaborative innovation in implementation

SNUG 2018 Confidential & Proprietary 58


Thank You

SNUG 2018 59

You might also like