ta1-1-millar-pres-user

High-Performance Arm®-based CPU Implementation for
Mobile Devices
Using Synopsys Design Platform with Fusion Technology
Brian Millar, Senior Principal Engineer, CPU Physical Implementation

October 23rd 2018
Agenda
Introduction - Brief History of Samsung Austin R&D Center (SARC)
Synthesis + Implementation Flow Overview
Key Techniques in Physical Implementation
Other Synopsys Technologies for the Next Project
Conclusion
SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 2

Agenda
Introduction - Brief History of Samsung Austin R&D Center (SARC)
Conclusion

Introduction
• Founded in 2010 to develop high-performance, low-power, complex CPU and System IP

architectures and designs for Samsung’s System LSI division.
• The San Jose Advanced Computing Lab (ACL) was opened in 2017 and GPU IP (leveraging 5
years initial development from another division) was added to the joint charter.
SNUG 2018 4
Overview: CPU & System IP
• Public/Commercial Accomplishments
– Shipping in Galaxy S7, S8, S9 Premium
Smartphones
• Current Focus
– Next 4 generations of Premium Smartphones
– Alternative markets
SNUG 2018 5
Initial Hardening of Arm licensed cores
• Specification
– Full implementation of the Armv7-A architecture Arm® Cortex®-A15 CPU
– Superscalar, variable-length, out-of-order pipeline
– Dynamic branch prediction with Branch Target Buffer (BTB)
and Global History Buffer
– (TLBs) for instruction, data loads, data stores
– 4-way set-associative 512-entry L2 TLB per processor
– 32KB L1 instruction & data caches, shared L2 cache
– Arm AMBA® 4 AXI Coherency Extensions master interface
– Accelerator Coherency Port (ACP) implemented
as AXI3 slave interface
– VFP floating point unit & Neon media processing engine
• Configurable elements
– # Arm® Cortex®-A15 cores (1 to 4)
– Size of L2 cache (up to 4 MB)
– [Optional] VFP, Neon engines
6
SNUG 2018 6
First Generation: M1, SCI, and SMC
Highlights
• Successfully delivered first-generation Samsung proprietary
CPU, coherent interconnect and memory controller
– Armv8-A 64-bit 4-wide out-of-order CPU with shared 2MB L2 cache
• Time-to-market
– Built a team and accomplished in 3.5 years (vs. others struggle to do in 5 years)
– Set a solid capability and product foundation for the future
• High quality
– No validation gating bugs for EVT0, no production gating bugs for EVT1
• Significant generational improvements: Galaxy S7 over Galaxy S6
• Industry landscape
– Our first generation CPU was competitive
7
SNUG 2018 7
Third Generation: M3, SCI, and SMC Highlights
• Successfully delivered differentiated Samsung proprietary CPU,
coherent interconnect and memory controller
– Armv8-A 64-bit 6-wide out-of-order CPU with shared 4MB shared L3 cache
• Time-to-market
– Built a team and accomplished in 3.5 years (vs. others struggle to do in 5 years)
– Detailed
Set a solid capability and product foundation disclosure planned for
for the future
Hot Chips 2018
• High quality
– No validation gating bugs for EVT0, no production gating bugs for EVT1
• Significant generational improvements: Galaxy S7 over Galaxy S6
Balanced Single-Threaded performance

and Power Efficiency
• Industry landscape
– Our first generation CPU was competitive
8
SNUG 2018 8
Challenges and Solutions
• Our team does core hardening for our proprietary Arm instruction set
compliant processor cores used in SoC designs throughout Samsung
• PPA (Performance/Power/Area), runtime and repeatability/predictability are

key metrics
• We’ve invested in several technologies in the Synopsys Design Platform

with Fusion Technology plus internal methodologies to differentiate our
PPA
SNUG 2018 Confidential & Proprietary 9

Agenda
Brief History of Samsung Austin R&D Center (SARC)
Conclusion

SARC Implementation Flow
Gensys
Design Compiler Graphical Tetramax II
Formality
ICC II Design Planning
ICC II Implementation IC Validator
StarRC
PrimeTime
Signoff DRC/LVS
SNUG 2018 11
New Fusion Technology™
inside Synopsys Design Platform
SARC Implementation flow

Design Compiler
Graphical
Signoff Fusion
Test Fusion
Test
Design Fusion
IC Compiler II
ECO Fusion
PT, StarRC, PTPX, ICV, RedHawk
Fusion Data Model
SNUG 2018 Synopsys Confidential Information 12

Design Compiler Flow Overview
set_app_vars
Load MW/Target/Link
libraries
set_compile_spg_mode ICC II
compile_layer_aware_optimization
Read RTL
compile_timing_high_effort_tns
compile_timing_high_effort
analyze psynopt_tns_high_effort
set_clock_
gating_style
elaborate
Source rp.tcl Load UPF
Read path groups
Read SDC
and weights
Read physical
Scan constraints
constraints
compile_ultra -scan
–gate_clocks -spg
set_ignored_layers
insert_dft test_scan_enable_port_naming_style
set_preferred_routing_direction
set_dft_signal
create_bounds
compile_ultra set_scan_configuration
create_placement_blockages
-incremental –scan
-spg
reporting &
ddc/verilog/sdc/upf
SNUG 2018 13
ICC II Implementation Flow Overview
• ICG merge/split
❖ Structured datapath • Automatic timing control (ATC)
❖ Bounds • Z-buf/NPO & rebuffering
❖ Partitioning • Layer opt/GRLB/Auto-NDR
❖ Budgeting • Multi-Vt optimization
❖ Pin cutting Constraints & Scenarios
• MBIT flow
• Leakage +SAIF based total power opt
Place_opt • RDE/SI Estimation
• Timing constraints
• Scenario creation
Clock mesh & clock_opt • ICG Sizing to implement useful skew

• ICG merge/split optimization • Skew buffer insertion between ICG and FF
• Route clock routes
• Post Clock datapath
Useful Skew implementation
optimization • Global route based optimization
• CTS Useful Skew Enable additional MCMM scenarios • CCD with clock mesh
• Route_opt for downsizing
route & route_opt • Route_opt for in place size only
• Ansys Redhawk integration (Fusion)
• ICV ADRC fixing & metal fill
SNUG 2018 14
PrimeTime and ECO Flow Overview
ECO capability in PrimeTime IC Compiler II
▪ Setup/hold fixing
▪ Useful skew StarRC • GPD format from StarRC to PrimeTime
▪ sequential cell • SMC (simultaneous multi-corner)
swap/optimization
PrimeTime/PrimeTime PX
▪ leakage recovery
▪ dynamic power reduction
▪ redundant buffer removal PT ECO • Analyze_subcircuit for spice analysis of
clock mesh
▪ max_tran/max_cap/noise for • Statistical via computation for 7nm and
DRC fixing lower
• IVD failure analysis in PrimeTime
▪ Additional Hold fixing
• Ansys Redhawk integration (Fusion)
Fix_eco_timing –physical_mode open_site/occupied_site –type setup/hold

Fix_eco_timing –cell_type clock_network (for useful skew)
Fix_eco_timing –cell_type sequential (for flop swap)
Fix_eco_power –pba_mode exhaustive –power_mode leakage (for leakage recovery)
Fix_eco_power –pba_mode exhaustive –power_mode dynamic (for dynamic power reduction)
Fix_eco_power –methods remove_buffer (for redundant buffer removal)
Fix_eco_drc –type max_tran/max_cap/noise (for DRC fixing)
Fix_eco_hold_timing (fix setup to allow more hold fixing)
SNUG 2018 15
Implementation Flow
Key Technologies Used
Design Compiler IC Compiler II PrimeTime

• Topographical synthesis • Design planning fix_eco_timing
• UPF power intent fix_eco_timing –cell_type clock_network
• ICG merge/split prior to fix_eco_timing –cell_type sequential
• Physical datapath for structured
placement
Place_opt fix_eco_power –power_mode leakage
• Power-aware clock gate (ICG) • ATC/Zbuf/NPO/rebuffering fix_eco_power –power_mode dynamic
insertion • Mbit flow fix_eco_power –methods remove_buffer
– Fanout control fix_eco_drc –type max_trans/cap/noise
– Always-on ICG’s
• Leakage recovery
fix_eco_hold_timing
• Placement bounds • SAIF based TPO
• Path groups, critical range, weights • Route_opt CCD PBA mode exhaustive
• Useful skew
• Clock mesh flow
• MCMM
• RDE / SI Prediction
SNUG 2018 16
Agenda
Conclusion

Clock Mesh Design ICC II to PrimeTime Correlation
Useful Skew + CCD Redhawk Fusion (formerly in-Design)
Multi-bit ATC
Structured Datapath & Relative Placement RDE
Via Ladder
Hierarchical Design

Consideration for CTS vs. Mesh
CTS Mesh
Implementation Flow mature and very easy to implement. More timing consuming, mesh has to be carefully planned
around floorplan. ICC still maturing. We ran into issues along
Complexity the way, but worked around. STA requires special flow.
Skew CTS will generally have higher global/local skews for any Mesh will have tighter skews.
sizable design.
Max Freq. Will tend to be lower because of higher skew. Since skew better (particularly local skew), less time is wasted
and higher frequency is achievable.
PVT tracking CTS tries to match delays using interconnect & gate If structure built very regularly, should track across PVT much
delay. CTS in MCMM mode may reduce exposure, but in better.
general paths won’t track as well across different corners
POCV Margin Paths in general will have larger non-common depth -> Mesh very structurally similar. Much lower POCVM
higher POCV. Also, paths dissimilar so traditional
POCVM margin justified.
Power Dynamic power can be similar between CTS & Mesh Dynamic power is similar between CTS & Mesh.
depending on structure of MESH. Tighter skew will result in less % LVt usage (TNS)
Implementing ICC compiler has built in useful skew ability, and Relatively easy to implement “push” flops.
“pulling” flops earlier is conceptually easier. But “pull” is more difficult and costly for power.
Useful Skew
CCD compatible with CTS CCD more difficult with Mesh
Flexibility CTS is built to be flexible. Can work around macros, Requires more careful planning. Metal layer conflicts with power
metal layer restrictions, tight corners, can easily adapt to grid. Driver conflicts with macro placement. Tuning can help
any floorplan and useful skew. recover some of loss.
SNUG 2018 19
Clocks and Margins
• Combination of CTS based and Mesh based clock tree structure
– High Speed clocks in CPU and Non-CPU use separate meshes
– Non-CPU has CTS trees for slower speed functional/debug/test clocks
• Meshes are built with identical metal topology, drivers, x&y pitch, etc.
– Allows very tight tracking at different PVT between meshes and OCV minimized
– Non-CPU meshes and CPU meshes will track more closely across PVT corners
and variation components, but each mesh has its own load distribution so there
will be some slight imperfection in their tracking (we can tweak sizes)
• INTRA-MESH & INTER-MESH Margining

1. Random/Gradient Cell Variation
2. Temperature Variation
3. Fatigue / Reliability Component
4. Voltage Variation
5. Random/Gradient Interconnect Variation
Different variations of mesh widths, pitches,

driver sizes, etc., were plotted for power and skew
to identify the best solution for implementation
SNUG 2018 20
Clock H-tree / MESH
1 Create clock H-tree/MESH wires
SARC Custom tool -> ICC II commands
create_shape -shape_type path -layer N -net

Cr_MESH -width XXXX -shape_use user_route -path
[list {X1 Y1} {X2 Y2}] -start_endcap variable -
start_extension XXX -end_endcap variable -
end_extension XXXX
create_clock_mesh -net{ck_gclkcr_MESH}
create_via -net Cr_MESH -no_snap
-via_def M_XXX_HV -
-layers {xx yy} -widths {0.x 0.y} -lower_left
shape_use user_route -origin {X1 Y1} - \
{134.9 149.4}
orientation R90-pitches {230.4 241.8} \
-bounding_box {{5.0 5.0 } {2095.0 1895.0}} Clock driver & H-tree routing Mesh Straps
2 Create H-tree/MESH drivers

create_cell Cr_L0_1 [get_lib_cell MESH_DRVxxx]

set_cell_location –orientation N –coordinates X1,Y1
set_attribute [get_cells DRVxx] dont_touch true
set_attribute [get_cells DRVxx] physical_status
fixed
Snake routing to adapt to floorplan
SNUG 2018 21
Clock H-tree / MESH
3 Push down to block level

SARC custom tool does top->block translation
but ICC II method:
push_down_objects [list [get_nets -of [get_shapes
-filter "net_type==signal||net_type==clock"]]]
4 Route Fishbones
set_routing_rule [get_nets Cr_MESH] -rule XX -

min_routing_layer X -max_routing_layer Y-max_layer_mode
allow_pin_connection
set_app_options -list {route.common.comb_distance 2}
route_clock_straps -nets [get_nets Cr_MESH] -topology

fishbone -fishbone_fanout 10 -fishbone_layers {X Y} -
fishbone_sub_span 20
set_attribute [get_nets $main_clock_net] -name

physical_status -value locked
SNUG 2018 22
Clock H-tree / MESH
5 Stamp Spice timing on MESH loads
CALL IN ICC II: set_annotated_transition XXX -rise U1_clkgate/CK

analyze_subcircuit set_annotated_transition YYY -fall U1_clkgate/CK
-from $from_pin -to $to_net -clock Cr \ . . . .
-spice_header_files ./spice_files/HSIM_lib.txt \
. . . .
-driver_subckt_files \
"./spice_files/XX.spf ./spice_files/YY.spf“ \
-starrcxt_nxtgrd_file ./star_files/star.nxtgrd \ set_annotated_delay XXX -from cr_L4_1/Y –to U1_clkgate/CK -net –rise
-starrcxt_map_file ./star_files/star.map \ set_annotated_delay XXX -from cr_L4_1/Y –to U1_clkgate/CK -net –fall
-name mesh_Cr . . . .
. . . .
OR set_annotated_delay XXX -from cr_L4_1/A –to cr_L4_1/Y -cell –rise
set_annotated_delay XXX -from cr_L4_1/A –to cr_L4_1/Y -cell –fall
. . . .
Call to PT w/STAR SPEF(signoff): . . . .
# disable all but one driver
sim_analyze_clock_network -from $from -to $to - set_disable_timing [get_timing_arcs -to cr_L4_*/Y]
use_probe .... remove_disable_timing –from A –to Y cr_L4_10
OR
Avoid Spice and Stamp Fake Timing
Early in development phase
SNUG 2018 23
Clock H-tree / MESH
Design Compiler Graphical
This has following steps:
ICC II Design Planning • Merge / Split
Mesh synthesize_multisource_clock_subtrees -from merge -to merge
synthesize_multisource_clock_subtrees -from optimize -to optimize
Constraints
Constraints &&Scenarios
Scenarios
Place_opt This has following steps:

• Merge / Split
• Size ICGs per load/transition time spec
Htree Clock mesh & clock_opt • Cluster ICGs in placement bounds (near PG switches)
• Align ICGs to minimize fishbone route
• ICG → Flop clock routes
Useful Skew implementation • Building CTS trees
• Propagated mode optimization
route & route_opt synthesize_multisource_clock_subtrees -from merge -to merge
synthesize_multisource_clock_subtrees -from optimize -to optimize
synthesize_multisource_clock_subtrees -from route_clock -to route_clock
Fishbones synthesize_multisource_clock_subtrees -from refine -to refine
PrimeTime Signoff STA route_group -all_clock_nets -reuse_existing true
clock_opt -from final_opto
SNUG 2018 24
Other Clock Routing
MS CTS Clock distribution

Post MESH clock routing: Clock Gater -> flops
SNUG 2018 25
Useful Skew
+300ps (slack) -50ps (slack) +100ps (slack)
-100ps (slack)
900ps
1100ps D Q 700ps D Q 1050ps D Q
CK CK CK
Front Slack = -100ps Front Slack = +300ps Front Slack -50ps

Back Slack = +300ps Back Slack = -50ps Back Slack = +100ps
Frequency target = 1.0GHz (1000ps) , but actual is (1000+100)ps = 0.90 .9GHz Before Scheduling
+150ps (slack) 0ps (slack) +100ps (slack)

0ps(slack)
1000ps 850ps D Q 1000ps D Q 900ps

D Q
CK CK CK
Front Slack = 0ps Front Slack = 150ps Front Slack = 0ps

Back Slack = +150ps Back Slack = 0ps Back Slack = +100ps
Frequency target = 1.0GHz (1000ps) , actual is (1000+0)ps = 1.0GHz After Scheduling (both WNS & TNS improvement)
SNUG 2018 26
Useful Skew / Concurrent Clock and Data
Architectural useful skew (IDEAL MODE)

ICC II Design Planning set_clock_latency XXX -rise U1REG/CK
Constraints & Scenarios

• SARC MESH based Useful Skew Scheduler/Implementer
• (PROPAGATED MODE)
Place_opt • Algorithmic (multi-level Aware) – single corner/multi-corner
• Useful skew for hold
• Useful skew for power
Clock mesh & clock_opt
ICC II CCD Capability
set_app_options -list { route_opt.flow.enable_ccd true}
Useful Skew implementation set_app_options -list {ccd.select_optimization_moves
auto/size_only}
set_app_options -list {ccd.critical_slack_percent 0.90}
route & route_opt set_app_options -list {ccd.tns_cost_factor 1}
route_opt
PrimeTime Signoff STA

fix_eco_timing –cell_type clock_network
SNUG 2018 27
Investigating CCD Earlier in Flow
Design Compiler Graphical Architectural useful skew
(IDEAL MODE)
set_clock_latency XXX -rise U1REG/CK
Constraints & Scenarios
SARC MESH based Useful Skew Scheduler/Implementer

Place_opt (PROPAGATED MODE)
• Algorithmic (multi-level Aware) – single corner/multi-corner
• Useful skew for hold
Clock mesh & clock_opt • Useful skew for power

ICC II CCD Capability
set_app_options -list { route_opt.flow.enable_ccd true}
set_app_options -list {ccd.select_optimization_moves
route & route_opt auto/size_only}
set_app_options -list {ccd.critical_slack_percent 0.90}
set_app_options -list {ccd.tns_cost_factor 1}
PrimeTime Signoff STA route_opt
SNUG 2018 28
Advantages and Disadvantages of
Multi-Bit in a Clock Mesh Flow
Advantages Disadvantages
- Lower Internal clock power - Strength/Vt type for most critical bit
- Less CK pin cap on ICG - Limits useful skew flexibility
- Fewer and smaller ICGs on MESH - RTL SAIF / mapping
- Less MESH power - Very disruptive when splitting
- Less SI and SE routing
- Less hold buffering (internal bits
correct by construction)
SNUG 2018 29
General Multi-Bit Flow
place_opt
Initial placement
Multibit reg • MBIT banking/de-banking available in ICC II
map file
clock_opt
• Slack threshold per timing group can be
Grouping and banking specified to be considered for banking
Register • Registers with same latency values are only
group file
merged
route
Optimization
• SVF file will be updated for Formal
verification
Debanking • Banking ratio of 85% achieved on most
route_opt
blocks
SNUG 2018 30
Multi-Bit Conversion at the end of place_opt
Before Multi-bit conversion After Multi-bit conversion
43 Multi-bit flop loads ICCII Compiler considers:

129 Single Bit flop loads • Slack
• Useful skew
• Scan chains
• Transition
• Capacitance
ICG ICG
SNUG 2018 31
Multi-Bit Flow
Single Bit Only (multi-bits dont_use)
set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use true

create_placement -timing_driven
Constraints
Constraints& Scenarios
& Scenarios place_opt -from initial_drc -to initial_drc
create_placement -timing_driven -effort high -use_seed_locs -congestion
place_opt -from initial_opto -to final_place
save_block place_opt
Place_opt
Enable Multi-bit
set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use false
Clock mesh & clock_opt
identify_multibit -register -no_dft -input_map_file $map_file
-slack_threshold $slack -output_file $mbit_bank.tcl
-exclude_instance $excluded_regs
source $mbit_bank.tcl
report_multibit
route & route_opt save_block place_opt_mbit
PrimeTime Signoff STA create_multibit -name

ldSpecVal_e2_reg_ldVprn_e2_reg_7A_ldVprn_e2_reg_6A_ldVprn_e2_reg_5A
{ ufpb_grp/u_fpu_load_xpose/ldSpecVal_e2_reg
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_7A
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_5A } -lib_cell lib/H2V2X………
SNUG 2018 32
Investigating Merging/Splitting throughout
flow
Design Compiler Graphical set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use true

ICC II Design Planning create_placement -timing_driven
place_opt -from initial_drc -to initial_drc
Constraints & Scenarios create_placement -timing_driven -effort high -use_seed_locs -congestion
Constraints & Scenarios place_opt -from initial_opto -to final_place
save_block place_opt
Place_opt Enable Multi-bit

set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use false
identify_multibit -register -no_dft -input_map_file $map_file

Clock mesh & clock_opt -slack_threshold $slack -output_file $mbit_bank.tcl
-exclude_instance $excluded_regs
Useful Skew implementation source $mbit_bank.tcl

report_multibit
save_block place_opt_mbit
route & route_opt
create_multibit -name
PrimeTime Signoff STA ldSpecVal_e2_reg_ldVprn_e2_reg_7A_ldVprn_e2_reg_6A_ldVprn_e2_reg_5A
{ ufpb_grp/u_fpu_load_xpose/ldSpecVal_e2_reg
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_5A } -lib_cell lib/H2V2X……
SNUG 2018 33
Structured Datapath with Relative
Placement Can build up RP definition hierarchically
Example: RP script
create_rp_group rp1 -design top -columns 2 -rows 1

add_to_rp_group top::rp1 -leaf U1 -column 0 -row 0



add_to_rp_group top::rp4 -hierarchy top::rp1 \
-column 0 -row 0
-column 0 -row 1
-column 0 -row 2
SNUG 2018 34
Structured Datapath with Relative Placement
bbox {1.2480 111.1680} {495.5340 317.9530}

bbox {9.0480 63.9360} {498.2640 270.7200}
Fixed cells from DEF RP placement with anchor point RP placement w/out anchor point
• Adds structure to random placement and routing

• Can significantly improve QoR (frequency, area, and power)
• Used to improve area and power (can be extended to help critical path
areas)
• RPs integrated into the flow using 3 approaches –
– FIXED cells from DEF file
– RP constraints with anchor point
SNUG 2018
– RP constraints without anchor point 35
Structured Datapath with Relative Placement
W/o Relative Placement With Relative Placement
• 20% area savings –vs- No RP

• Similar structure/timing across bits
• Can improve routability if planned properly
SNUG 2018 36
Structured Datapath with Relative
Placement
Routing RP Clock nets
Zrouter in conjunction with the structured placement
SNUG 2018 37
Relative Placement flow in DCG and ICC II
SARC’ Internal tool for RP generation
Design compiler ICC II
RP cells are marked fixed in DEF
SNUG 2018 38
Hierarchical Design Planning
Gensys restructuring Using Gensys to add wrappers for partitioning RTL
Design Compiler graphical Feedthroughs are added to wrapper and not inside
module hierarchy
Initialize floorplan
Clock mesh push down
Shaping / Macro placement Power Grid using composite patterns
Create H-tree/Mesh
Create Power
Place pins / Feedthrough
Timing Budgets
Push down of power grid
Push down of clock drivers
SNUG 2018 39
Gensys Hierarchical Build Flow
Gensys RTL restructuring flow
View Design Hierarchy
Define wrapper hierarchy
Define or modify design hierarchy
“Apply” to automate RTL restructuring
using gensys to add wrappers / partition RTL

SNUG 2018 40
Add wrapper to isolate feedthroughs /
DFT logic / PG switches
Swap in new RTL at blk level
blk.v
Automate
splitting of
module blk_wrap blocks to
module blk manage build
time
endmodule
endmodule
SNUG 2018 41
SNUG 2018 42
Characterize Block PG Flow
• Design inputs
– Full-chip PG Constraints
Design
➢ Patterns Full Chip PG Constraints
➢ Via Rules Patterns
➢ Strategies Via Rules
Strategies
• Characterize Block PG outputs Characterize Block

– PG constraints for each block and top-level PG
➢ Patterns
Block & Top level PG Constraints
➢ Via Rules
Patterns
➢ Strategies
Via Rules
Strategies
• Create PG for each block and top-level

– Apply corresponding PG constraints Apply PG Constraints Apply PG Constraints Apply PG Constraints
Compile PG Strategies Compile PG Strategies Compile PG Strategies
– Compile strategies with via rule
Top Level Block Block…
SNUG 2018 43
Composite Pattern Implementation
14
vdd vss vss vdd vddi vss vdd vss vss vdd vddi vss
9
8
vss
vddi
vdd
2.5
10
create_pg_wire_pattern stra_base –layer @l –direction @d –width @w –spacing @s \
-pitch @p –parameters {l d w p s}
create_pg_composite_pattern core_pattern –nets {vdd vddi vss} \
–add_patterns { \
{{pattern: stra_base} {nets: vdd vss vss vdd} {parameters: {M6 vertical 1.5 10 1}}{offset: 3}} \
{{pattern: stra_base} {nets: vddi vss} {parameters: {M4 vertical 1.5 14 0.6}}{offset: 9}} \
{{pattern: stra_base} {nets: vdd vddi vss}{parameters: {M5 horizontal 1.5 8 0.6}}{offset: 2.5}}}
set_pg_strategy core_strategy –core –pattern {{name: core_pattern} {nets: vdd vddi vss}}
compile_pg –strategies {core_strategy}
SNUG 2018 Synopsys Confidential Information 44

ICC II versus PrimeTime Correlation
(Signoff Entry)
- Need to maintain ~1x / Day beat rate
- Cut down number of ECO cycles due to

timing closure (logical ECOs )
- ICC II -> PT correlation key to

efficiency
Signoff
Entry
SNUG 2018 45
ICC II versus PT Correlation
Fusion Flow
Route_opt1
Route_opt2
CCD enabled
eco_opt –type setup

-pba_mode path
CCD enabled
Extract.starrc_mode set 0
SNUG 2018 46
ICC II versus PT Correlation
(Additional Suggestions)
• Good Timing Correlation needed to minimize ECO cycle time
– use PT timer in last stages of route_opt
set_app_options -name time.use_pt_delay -value true
– AWP in ICC II (several TBCs for improvements)
– POCV/margins etc.. same in both tools
– Use CCS models + LVF (new moment-based POCV capability coming)
– Investigating new PBA based optimization in ICC II (final stages of route)
• Ensure all dominant scenarios/modes (set & hold) optimized in ICC II before
route_opt
– ICC II runtime improvements in progress (particularly inactive scenarios)
• Excessive extra margin in ICC II to mask correlation issues can impact

power/congestion
• Study components individually – mean / stddev & review outliers

– ICC II SPEF –vs- STAR SPEF
– ICC II GBA –vs- PT GBA (no POCV/derates ; no SI )
– ICC II GBA –vs- PT GBA w/ POCV/derates
– ICC II GBA –vs- PT GBA w/POCV/derates & SI
– ICC II GBA –vs- PT PBA
etc….
– (We found foundry based metal fill needed in ICC II to improve route dominated block correlation)
SNUG 2018 47
Successful Deployment of RedHawk Analysis
Fusion at 7nm
❖ Successful evaluation of RedHawk Analysis Fusion
technology at SARC on 7nm designs in ~3 months meeting
initial customer criteria.
❖ SARC is a great ‘learning’ customer driving productivity and

usability enhancements.
❖ Highly engaged R&D:

❖ Several enhancements to the tools including the ability
to launch signoff analysis of multiple vectors in parallel
from one ICC2 session
❖ Improved GUI features to display voltage drop on cell
instance IR drop map
❖ Technology enables Physical Design teams to do early analysis

and fixing of signoff quality IR drop issues during block-level
implementation. New usage model/users for power analysis,
added value to design flow/QoR, new market opportunity.
❖ Accuracy proven: It’s all about the trust in correlation. RedHawk

Analysis Fusion results identical to RedHawk signoff.
❖ Starting production use now on live 7nm design.
❖ Testing IVD based place_opt for deployment with 2018.06-SP2.
SNUG 2018 48
Automatic Timing Control (ATC)
SNUG 2018 49
Route Delay Estimation (RDE)
SNUG 2018 50
Route Delay Estimation (RDE)
Advantages of RDE model
SNUG 2018 51
Via Ladder ( Pillar ) Construction
M1
Receiver
Receiver
M1
SNUG 2018 52
EM Via Ladder for High Drive Cells
SNUG 2018 53
Agenda
Conclusion

Synopsys Technologies for the Next Project
Next generation ICC II Next generation technology -
technology items other tools
- Improvements in ICC II 2018.06 - STAR: SMC (Simultaneous multi-corner)
- RDE + better layer promotion and auto-NDR
- ATC
- Total Power Optimization
- Statistical Via Capability (initially PT)
- CCD more throughout the flow (in placeopt and
synthesis) - IR/IVD analysis (RedHawk Fusion) in
- Timing driven restructuring PrimeTime
- Congestion restructuring
- Knee based delay optimization
ICV for foundry metal fill in ICC II
- In-Design IR/IVD fixing
- PrimeTime PBA mode power recovery
- 7nm technology
- Via Pillars
- Color aware PG/STD cell placement interaction

Synopsys Technologies for the Next Project

Agenda
Conclusion

Conclusion
• Synopsys Design Platform with Fusion Technology enables SARC on high-

performance Arm®-based CPU implementation for mobile devices
• Clock Mesh Design, Useful skew + CCD, Multi-bit , Structured Datapath & Relative Placement
(RP), Hierarchical Design, ICC II to PrimeTime Correlation, etc…
Optimized Implementation delivered

20%+ Improvements in Fmax & mW/MHz
• SARC continues to work closely with Synopsys to address PPA,

runtime and repeatability/predictability needs on future projects
• Subsequent technology nodes (5nm,…) are getting very complex with

lots of opportunity for collaborative innovation in implementation

Thank You
SNUG 2018 59

ta1-1-millar-pres-user

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ta1-1-millar-pres-user

Uploaded by

Copyright:

Available Formats

High-Performance Arm®-based CPU Implementation for

Brian Millar, Senior Principal Engineer, CPU Physical Implementation

Introduction - Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 2

Introduction - Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 3

• Founded in 2010 to develop high-performance, low-power, complex CPU and System IP

• Significant generational improvements: Galaxy S7 over Galaxy S6

• Significant generational improvements: Galaxy S7 over Galaxy S6

Balanced Single-Threaded performance

• PPA (Performance/Power/Area), runtime and repeatability/predictability are

• We’ve invested in several technologies in the Synopsys Design Platform

SNUG 2018 Confidential & Proprietary 9

Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 10

Design Compiler Graphical Tetramax II

ICC II Implementation IC Validator

SARC Implementation flow

PT, StarRC, PTPX, ICV, RedHawk

Fusion Data Model

SNUG 2018 Synopsys Confidential Information 12

Clock mesh & clock_opt • ICG Sizing to implement useful skew

Fix_eco_timing –physical_mode open_site/occupied_site –type setup/hold

Design Compiler IC Compiler II PrimeTime

Brief History of Samsung Austin R&D Center (SARC)

Synthesis + Implementation Flow Overview

Key Techniques in Physical Implementation

Other Synopsys Technologies for the Next Project

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 17

Clock Mesh Design ICC II to PrimeTime Correlation

Useful Skew + CCD Redhawk Fusion (formerly in-Design)

Structured Datapath & Relative Placement RDE

SNUG 2018 © Copyright Synopsys 2018, All Rights Reserved 18

• INTRA-MESH & INTER-MESH Margining

Different variations of mesh widths, pitches,

create_shape -shape_type path -layer N -net

2 Create H-tree/MESH drivers

create_cell Cr_L0_1 [get_lib_cell MESH_DRVxxx]

Snake routing to adapt to floorplan

3 Push down to block level

set_routing_rule [get_nets Cr_MESH] -rule XX -

set_app_options -list {route.common.comb_distance 2}

route_clock_straps -nets [get_nets Cr_MESH] -topology

set_attribute [get_nets $main_clock_net] -name

5 Stamp Spice timing on MESH loads

CALL IN ICC II: set_annotated_transition XXX -rise U1_clkgate/CK

Avoid Spice and Stamp Fake Timing

Early in development phase

Place_opt This has following steps:

MS CTS Clock distribution

Front Slack = -100ps Front Slack = +300ps Front Slack -50ps

+150ps (slack) 0ps (slack) +100ps (slack)

1000ps 850ps D Q 1000ps D Q 900ps

Front Slack = 0ps Front Slack = 150ps Front Slack = 0ps

Architectural useful skew (IDEAL MODE)