Professional Documents
Culture Documents
Stretching Performance Potentials Beyond Hardware Constraints
Stretching Performance Potentials Beyond Hardware Constraints
PRESENTATION CONTENTS
RESULTS
INTRODUCTION
Market demands on performance, design turn around and size Research Interest in Reconfigurable computing 90% of time spent in 10% of code [90/10 rule] Port selected compute intensive code blocks to hardware [Hot Areas]
INTRODUCTION
Port selected code to Hardware (save size + time) Can we reduce size further ??? Dynamic Reconfiguration: Reuse hardware over time! (+Size reduced, Reconfiguration cost ) Partial Dynamic Reconfiguration: Tailor cut Hardware reuse along space (Only reconfigure when feasible + needed)
INTRODUCTION
Partial Reconfiguration: Communication Network reconfiguration overhead Tiled Partially reconfigurable Systems [16] Intelligent choice of Bin Sizes to compensate reduced flexibility (Contribution: First algorithm for partitioning)
Recon. Fabric
GPP
[16]Markus Koester, Wayne Luk, Jens Hagemeyer, Mario Porrmann and Ulrich Rckert, Design Optimizations for Tiled Partially Reconfigurable Systems in IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 6, JUNE 2011
Problem Model
Identify Hot Areas generate CIS Table Extract Loop Trace Available area on fabric
Reconfiguration and Execution can Overlap One Reconfiguration at a time Recon. Fabric GPP
Tasks to do
Selection (Choose Implementation Variants) Partitioning (Partition Reconfigurable Area in Bins) Circuit instantiated on a Tile can Placement (Assign Bins for the be Coprocessor/Custom execution of Implementation Instruction (Contribution: Variants) Framework applicable to both
models)
either specific to Coprocessor model or Custom Instruction model No Consideration for multiple Implementation Alternatives Partial Reconfiguration not supported/Joint optimization with placement and partitioning not considered
Miaoqing Huang, Vikram K. Narayana, Mohamed Bakhouya, Jaafar Gaber, Tarek El-Ghazawi Efficient Mapping of Task Graphs onto Reconfigurable Hardware Using Architectural Variants in IEEE Transactions on Computers, Aug 2011
Communication Overheads for partial reconfiguration neglected Multi-sized tiles/bins not supported Honglei Han, Wenju Liu, Wu optimization Jigang and Guiyuan Jiang, with Efficient Algorithm for Hardware/Software and Joint Selection and Partitioning Partitioning not Scheduling on MPSoC in JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013 considered
Hot Area 2 Implementation alternative 1: -Execute on GPP -Area requirement on fabric = 0 logic blocks -Execution time = 7 clock cycles Implementation alternative 2: -Implement custom hardware to square and subtract in GPP -Area requirement on fabric = 2 logic blocks -Execution time = 4 clock cycles Implementation alternative 3: -Implement custom hardware to square and subtract -Area requirement on fabric = 3 logic blocks -Execution time = 3 clock cycles
Overview of Framework
P: Population Limit Fitness: Execution Time (Goal of Genetic Optimizer is to Minimize Fitness)
Step2 Partitioning 1
Recursive Backtracking
Backtracking (index, current)
If index > length of list
return for i from index to length of list If current + list[i] = goal candidate_solution add list[i] solutions add candidate_solution candidate_solution [ ] return If current + list[i] < goal candidate_solution add list[i] Backtracking (index + 1, current + list[i]) 3 2 1 List 3,2,1
Goal 3
goal = Available Area list = descending order sorted list of all area requirements specified by loop trace and chromosome under consideration. index = entry number of list under processing current = cumulative sum of list entries traversed in a particular thread. (stored in
Step2 Partitioning 2
Greedy Partitioning
Step 3: Placement
*future_reuse_ index: number of times same task type reoccurs with same implementation variant selection)
Get Area model from partitioning algorithm Repeat: until at end of chromosome Select next gene If: Corresponding implementation variant already placed Use same bin placement Else: Loop: through all empty bins Place in smallest bin satisfying area requirement If: not placed until this step Loop: through all filled bins Determine future_reuse_ index* of placed implementation variant Place in smallest bin with smallest future_reuse_index satisfying area requirement
Step 3: Placement
Example: Place_n_Partition
Example: Evolution
Chromosome Fitness Value (Execution Time)
possible selection solutions for this simple problem 2 x 3 x 3 = 18. Combining with Partition & Placement: 72 points. We only explored 8 points execution time reduced from 18 clock cycles to 10 clock cycles.
223 131
mutation
2 2 3
2 1 3
111
123
crossove r
1 3 1 1 2 3
elit 1 e chromosome with
best fitness is passed unchanged to next generation as Elite
1 3 3 1 2 1
133
Results
Both variants of Genetic Algorithm outperform Dynamic programming in all test cases. Greedy-GA out performs Brute-GA in almost all cases. Incremental performance benefit of Genetic Algorithm increases over Dynamic programming, as best area to available area ratio increases
Software 1 1 1 1 1
Software
1 1 1 1 1
DP
0.749478079 0.774907749 0.762532982 0.693251534 0.846889952
Brute-GA
0.645093946 0.731857319 0.76121372 0.689161554 0.80861244
Greedy-GA
0.645093946 0.730627306 0.715039578 0.564417178 0.784688995
PU for = 2
DP 0.89059501 0.916129032 0.950231481 0.900753769 0.945165945 Brute-GA 0.723608445 0.870967742 0.858796296 0.909547739 0.86002886 Greedy-GA 0.715930902 0.758709677 0.858796296 0.708542714 0.796536797
PU for = 6
Results
Percent difference of PU b/w DP and GA
=2 10.43841 =3 16.32047 =4 12.27786753 =5 15.42699725 =6 17.46641
4.428044
4.74934 12.88344 6.220096
17.07921
13.38983 12.38318 6.487696
26.28062361
11.43867925 11.30434783 7.267144319
8.516886931
10.92150171 9.322033898 11.42533937
15.74194
9.143519 19.22111 14.86291
Virtex-6 has 8 Registers and 4LUTs per Slice Slices used = Max(Reg/8, LUT/4)
LSH: Solution
Loop Trace
1 2 3 1 2 3 1 2 3 4 1 2
Just 1.94 percent less than best possible Execution time Using 17 times less area!
Selection
1 1 4 4 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1
Placement
1
1
1
2
2
1
2
1
1
2
2
1
1
1
1
2
2
1
1
3
1
1
1
2
2
1
1
1
1
2
2
1
1
1
1
2
2
1
1
3
1
1
1
2
2
1
1
1
1
2
2
1
1
1
1
2
2
1
1
3
Reconfiguration Map
0 0 0 0 0 925 14148 15073 0 0 15553 16478 0 0 0 0 0 0
Execution Map
0 1536 1548 15073 15553 16478 29078 30614 30626 43226 43229 44765 44777 57377 58913 58925 71525 73061 73073 85673 85676 87212 87224 99824 101360 101372 113972 115508 115520 128120
1536
1548 14148 15553 15565 29078 30614 30626 43226 43229 44765 44777 57377 58913 58925 71525 73061 73073 85673 85676 87212 87224 99824 101360 101372 113972 115508 115520 128120 128123
Software Execution Time = 532341 cycles Best Time (without Reconfig, Best CIS) = 117777 cycles Area Required for Best CIS = 17529 Slices
LSH: Solution
Loop Trace
1 2 3 1 2 3 1 2 3 4 1 2
Just 1.79 percent less than best possible Execution time Using 19 times less area!
Selection
1 1 4 4 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1
Placement
1
1
1
2
2
1
2
1
1
2
2
1
1
1
1
2
2
1
1
3
1
1
1
2
2
1
1
1
1
2
2
1
1
1
1
2
2
1
1
3
1
1
1
2
2
1
1
1
1
2
2
1
1
1
1
2
2
1
1
3
Reconfiguration Map
0 0 0 0 0 925 0 0 0 0 0 0 0 0 0 0 0 0
Execution Map
0 1536 1536 1548 14148 15684 15696 28296 29832 29844 42444 42447 43983 43995 56595 58131 58143 70743 72279 72291 84891 84894 86430 86442 99042 100578 100590 113190 114726 114738 127338
1548 14148 15684 15696 28296 29832 29844 42444 42447 43983 43995 56595 58131 58143 70743 72279 72291 84891 84894 86430 86442 99042 100578 100590 113190 114726 114738 127338 127341
Software Execution Time = 532341 cycles Best Time (without Reconfig, Best CIS) = 117777 cycles Area Required for Best CIS = 17529 Slices
2922
=6
150000 145000 140000 135000 130000 125000
532341
145911
119776
120000
115000 =2 =3 =4 =5 =6
QUESTIONS?
A area
physical_area
Line 3: Flooring Takes you to the end of Last Complete Config. If it is empty, no need to explore more configurations Line 7: Similar logic as Line 3, if all lopes done and thr is an empty configuration, then end
[1]Dynamic Reconfig of CFU by T. Mitra, 2009c
Loop Trace 1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4
Selection 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 Partitioning: 1 bin of Size 925 Slices
Configurabl e Fabric
Dynamic Reconfiguration
Configurabl e Fabric
Virtual Area
Virtual Area
Reconfiguration Overhead
Virtual Area
Any desired chunk of fabric may be reconfigured Communication network cant be static