Professional Documents
Culture Documents
ta1-1-millar-pres-user
ta1-1-millar-pres-user
Mobile Devices
Using Synopsys Design Platform with Fusion Technology
Conclusion
Conclusion
• The San Jose Advanced Computing Lab (ACL) was opened in 2017 and GPU IP (leveraging 5
years initial development from another division) was added to the joint charter.
SNUG 2018 4
Overview: CPU & System IP
• Public/Commercial Accomplishments
– Shipping in Galaxy S7, S8, S9 Premium
Smartphones
• Current Focus
– Next 4 generations of Premium Smartphones
– Alternative markets
SNUG 2018 5
Initial Hardening of Arm licensed cores
• Specification
– Full implementation of the Armv7-A architecture Arm® Cortex®-A15 CPU
– Superscalar, variable-length, out-of-order pipeline
– Dynamic branch prediction with Branch Target Buffer (BTB)
and Global History Buffer
– (TLBs) for instruction, data loads, data stores
– 4-way set-associative 512-entry L2 TLB per processor
– 32KB L1 instruction & data caches, shared L2 cache
– Arm AMBA® 4 AXI Coherency Extensions master interface
– Accelerator Coherency Port (ACP) implemented
as AXI3 slave interface
– VFP floating point unit & Neon media processing engine
• Configurable elements
– # Arm® Cortex®-A15 cores (1 to 4)
– Size of L2 cache (up to 4 MB)
– [Optional] VFP, Neon engines
6
SNUG 2018 6
First Generation: M1, SCI, and SMC
Highlights
• Successfully delivered first-generation Samsung proprietary
CPU, coherent interconnect and memory controller
– Armv8-A 64-bit 4-wide out-of-order CPU with shared 2MB L2 cache
• Time-to-market
– Built a team and accomplished in 3.5 years (vs. others struggle to do in 5 years)
– Set a solid capability and product foundation for the future
• High quality
– No validation gating bugs for EVT0, no production gating bugs for EVT1
• Industry landscape
– Our first generation CPU was competitive
7
SNUG 2018 7
Third Generation: M3, SCI, and SMC Highlights
• Successfully delivered differentiated Samsung proprietary CPU,
coherent interconnect and memory controller
– Armv8-A 64-bit 6-wide out-of-order CPU with shared 4MB shared L3 cache
• Time-to-market
– Built a team and accomplished in 3.5 years (vs. others struggle to do in 5 years)
– Detailed
Set a solid capability and product foundation disclosure planned for
for the future
Hot Chips 2018
• High quality
– No validation gating bugs for EVT0, no production gating bugs for EVT1
8
SNUG 2018 8
Challenges and Solutions
• Our team does core hardening for our proprietary Arm instruction set
compliant processor cores used in SoC designs throughout Samsung
Conclusion
Gensys
Formality
ICC II Design Planning
StarRC
PrimeTime
Signoff DRC/LVS
SNUG 2018 11
New Fusion Technology™
inside Synopsys Design Platform
Signoff Fusion
Test Fusion
Test
Design Fusion
IC Compiler II
ECO Fusion
SNUG 2018 14
PrimeTime and ECO Flow Overview
ECO capability in PrimeTime IC Compiler II
▪ Setup/hold fixing
▪ Useful skew StarRC • GPD format from StarRC to PrimeTime
▪ sequential cell • SMC (simultaneous multi-corner)
swap/optimization
PrimeTime/PrimeTime PX
▪ leakage recovery
▪ dynamic power reduction
▪ redundant buffer removal PT ECO • Analyze_subcircuit for spice analysis of
clock mesh
▪ max_tran/max_cap/noise for • Statistical via computation for 7nm and
DRC fixing lower
• IVD failure analysis in PrimeTime
▪ Additional Hold fixing
• Ansys Redhawk integration (Fusion)
SNUG 2018 15
Implementation Flow
Key Technologies Used
• Useful skew
• Clock mesh flow
• MCMM
• RDE / SI Prediction
SNUG 2018 16
Agenda
Conclusion
Multi-bit ATC
Via Ladder
Hierarchical Design
CTS Mesh
Implementation Flow mature and very easy to implement. More timing consuming, mesh has to be carefully planned
around floorplan. ICC still maturing. We ran into issues along
Complexity the way, but worked around. STA requires special flow.
Skew CTS will generally have higher global/local skews for any Mesh will have tighter skews.
sizable design.
Max Freq. Will tend to be lower because of higher skew. Since skew better (particularly local skew), less time is wasted
and higher frequency is achievable.
PVT tracking CTS tries to match delays using interconnect & gate If structure built very regularly, should track across PVT much
delay. CTS in MCMM mode may reduce exposure, but in better.
general paths won’t track as well across different corners
POCV Margin Paths in general will have larger non-common depth -> Mesh very structurally similar. Much lower POCVM
higher POCV. Also, paths dissimilar so traditional
POCVM margin justified.
Power Dynamic power can be similar between CTS & Mesh Dynamic power is similar between CTS & Mesh.
depending on structure of MESH. Tighter skew will result in less % LVt usage (TNS)
Implementing ICC compiler has built in useful skew ability, and Relatively easy to implement “push” flops.
“pulling” flops earlier is conceptually easier. But “pull” is more difficult and costly for power.
Useful Skew
CCD compatible with CTS CCD more difficult with Mesh
Flexibility CTS is built to be flexible. Can work around macros, Requires more careful planning. Metal layer conflicts with power
metal layer restrictions, tight corners, can easily adapt to grid. Driver conflicts with macro placement. Tuning can help
any floorplan and useful skew. recover some of loss.
SNUG 2018 19
Clocks and Margins
• Combination of CTS based and Mesh based clock tree structure
– High Speed clocks in CPU and Non-CPU use separate meshes
– Non-CPU has CTS trees for slower speed functional/debug/test clocks
• Meshes are built with identical metal topology, drivers, x&y pitch, etc.
– Allows very tight tracking at different PVT between meshes and OCV minimized
– Non-CPU meshes and CPU meshes will track more closely across PVT corners
and variation components, but each mesh has its own load distribution so there
will be some slight imperfection in their tracking (we can tweak sizes)
create_clock_mesh -net{ck_gclkcr_MESH}
create_via -net Cr_MESH -no_snap
-via_def M_XXX_HV -
-layers {xx yy} -widths {0.x 0.y} -lower_left
shape_use user_route -origin {X1 Y1} - \
{134.9 149.4}
orientation R90-pitches {230.4 241.8} \
-bounding_box {{5.0 5.0 } {2095.0 1895.0}} Clock driver & H-tree routing Mesh Straps
SNUG 2018 21
Clock H-tree / MESH
4 Route Fishbones
SNUG 2018 22
Clock H-tree / MESH
OR
SNUG 2018 23
Clock H-tree / MESH
Design Compiler Graphical
This has following steps:
ICC II Design Planning • Merge / Split
Mesh synthesize_multisource_clock_subtrees -from merge -to merge
synthesize_multisource_clock_subtrees -from optimize -to optimize
Constraints
Constraints &&Scenarios
Scenarios
SNUG 2018 24
Other Clock Routing
SNUG 2018 25
Useful Skew
+300ps (slack) -50ps (slack) +100ps (slack)
-100ps (slack)
900ps
1100ps D Q 700ps D Q 1050ps D Q
CK CK CK
Frequency target = 1.0GHz (1000ps) , but actual is (1000+100)ps = 0.90 .9GHz Before Scheduling
CK CK CK
Frequency target = 1.0GHz (1000ps) , actual is (1000+0)ps = 1.0GHz After Scheduling (both WNS & TNS improvement)
SNUG 2018 26
Useful Skew / Concurrent Clock and Data
Design Compiler Graphical
SNUG 2018 27
Investigating CCD Earlier in Flow
Design Compiler Graphical Architectural useful skew
(IDEAL MODE)
set_clock_latency XXX -rise U1REG/CK
ICC II Design Planning
SNUG 2018 28
Advantages and Disadvantages of
Multi-Bit in a Clock Mesh Flow
Advantages Disadvantages
- Lower Internal clock power - Strength/Vt type for most critical bit
- Less CK pin cap on ICG - Limits useful skew flexibility
- Fewer and smaller ICGs on MESH - RTL SAIF / mapping
- Less MESH power - Very disruptive when splitting
- Less SI and SE routing
- Less hold buffering (internal bits
correct by construction)
SNUG 2018 29
General Multi-Bit Flow
place_opt
Initial placement
Multibit reg • MBIT banking/de-banking available in ICC II
map file
clock_opt
• Slack threshold per timing group can be
Grouping and banking specified to be considered for banking
Register • Registers with same latency values are only
group file
merged
route
Optimization
• SVF file will be updated for Formal
verification
Debanking • Banking ratio of 85% achieved on most
route_opt
blocks
SNUG 2018 30
Multi-Bit Conversion at the end of place_opt
Before Multi-bit conversion After Multi-bit conversion
ICG ICG
SNUG 2018 31
Multi-Bit Flow
Design Compiler Graphical
Single Bit Only (multi-bits dont_use)
set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use true
SNUG 2018 32
Investigating Merging/Splitting throughout
flow
Single Bit Only (multi-bits dont_use)
Design Compiler Graphical set_attr [get_lib_cell */* -filter “multibit_width > 1”] dont_use true
create_multibit -name
PrimeTime Signoff STA ldSpecVal_e2_reg_ldVprn_e2_reg_7A_ldVprn_e2_reg_6A_ldVprn_e2_reg_5A
{ ufpb_grp/u_fpu_load_xpose/ldSpecVal_e2_reg
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_7A
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_6A
ufpb_grp/u_fpu_load_xpose/ldVprn_e2_reg_5A } -lib_cell lib/H2V2X……
SNUG 2018 33
Structured Datapath with Relative
Placement Can build up RP definition hierarchically
Example: RP script
SNUG 2018 34
Structured Datapath with Relative Placement
Fixed cells from DEF RP placement with anchor point RP placement w/out anchor point
SNUG 2018 36
Structured Datapath with Relative
Placement
SNUG 2018 37
Relative Placement flow in DCG and ICC II
SNUG 2018 38
Hierarchical Design Planning
Gensys restructuring Using Gensys to add wrappers for partitioning RTL
Design Compiler graphical Feedthroughs are added to wrapper and not inside
module hierarchy
Initialize floorplan
Clock mesh push down
Shaping / Macro placement Power Grid using composite patterns
Create H-tree/Mesh
Create Power
Timing Budgets
SNUG 2018 39
Gensys Hierarchical Build Flow
Gensys RTL restructuring flow
blk.v
Automate
splitting of
module blk_wrap blocks to
module blk manage build
time
endmodule
endmodule
SNUG 2018 41
Gensys Hierarchical Build Flow
SNUG 2018 42
Characterize Block PG Flow
• Design inputs
– Full-chip PG Constraints
Design
➢ Patterns Full Chip PG Constraints
➢ Via Rules Patterns
➢ Strategies Via Rules
Strategies
SNUG 2018 43
Composite Pattern Implementation
14
vdd vss vss vdd vddi vss vdd vss vss vdd vddi vss
9
8
vss
vddi
vdd
2.5
10
create_pg_wire_pattern stra_base –layer @l –direction @d –width @w –spacing @s \
-pitch @p –parameters {l d w p s}
create_pg_composite_pattern core_pattern –nets {vdd vddi vss} \
–add_patterns { \
{{pattern: stra_base} {nets: vdd vss vss vdd} {parameters: {M6 vertical 1.5 10 1}}{offset: 3}} \
{{pattern: stra_base} {nets: vddi vss} {parameters: {M4 vertical 1.5 14 0.6}}{offset: 9}} \
{{pattern: stra_base} {nets: vdd vddi vss}{parameters: {M5 horizontal 1.5 8 0.6}}{offset: 2.5}}}
set_pg_strategy core_strategy –core –pattern {{name: core_pattern} {nets: vdd vddi vss}}
compile_pg –strategies {core_strategy}
SNUG 2018 45
ICC II versus PT Correlation
Fusion Flow
Route_opt1
Route_opt2
CCD enabled
CCD enabled
Extract.starrc_mode set 0
SNUG 2018 46
ICC II versus PT Correlation
(Additional Suggestions)
• Good Timing Correlation needed to minimize ECO cycle time
– use PT timer in last stages of route_opt
set_app_options -name time.use_pt_delay -value true
– AWP in ICC II (several TBCs for improvements)
– POCV/margins etc.. same in both tools
– Use CCS models + LVF (new moment-based POCV capability coming)
– Investigating new PBA based optimization in ICC II (final stages of route)
• Ensure all dominant scenarios/modes (set & hold) optimized in ICC II before
route_opt
– ICC II runtime improvements in progress (particularly inactive scenarios)
SNUG 2018 47
Successful Deployment of RedHawk Analysis
Fusion at 7nm
❖ Successful evaluation of RedHawk Analysis Fusion
technology at SARC on 7nm designs in ~3 months meeting
initial customer criteria.
SNUG 2018 48
Automatic Timing Control (ATC)
SNUG 2018 49
Route Delay Estimation (RDE)
SNUG 2018 50
Route Delay Estimation (RDE)
Advantages of RDE model
SNUG 2018 51
Via Ladder ( Pillar ) Construction
M1
Receiver
Receiver
M1
SNUG 2018 52
EM Via Ladder for High Drive Cells
SNUG 2018 53
Agenda
Conclusion
Conclusion
SNUG 2018 59