Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 46

Compiling for VLIWs and ILP

Profiling
Region formation
Acyclic scheduling
Cyclic scheduling

1
Profiling
Many crucial ILP optimizations require good
profile information
ILP optimizations try to maximize
performance/price by increasing the IPC
Compiler techniques are needed to expose and
enhance ILP
Two types of profiles: point profiles and path
profiles

2
Compiling with Profiling

3
Point Profiles Point profiles collect
statistics about points in call
graphs and control flow
graphs
gprof produces call graph
profiles, statistics on how
many times a function was
called, who called it, and
(sometimes) how much time
was spent in that function
Control flow graph profiles
give statistics on nodes
(node profiles) and edges
(edge profiles)

4
Path Profiles Path profiles measure the
execution frequency of a
sequence of basic blocks
B1 on a path a CFG
A hot path is a path that is
B2 (very) frequently executed
Types include forward paths
B3 B4 (no backedges), bounded-
length paths (start/stop
points), and whole-program
B5 B6 paths (interprocedural)
The choice is a tradeoff
B7 between accuracy and
efficiency to collect the
Path 1 {B1, B2, B3, B5, B7} count = 7 profile
Path 2 {B1, B2, B3, B6, B7} count = 9
Path 3 {B1, B2, B4, B6, B7} count = 123
5
Data collected through code
Profile Collection instrumentation is very
detailed, but
instrumentation overhead
affects execution
Hardware counters have
very low overhead but
information is not
exhaustive
Interrupt-based sampling
examines machine state in
intervals
Collecting path profiles
requires enumerating the
set of paths encountered
Instrumentation inserts during runtime
instructions to record
edge profiling events
6
Profile Bookkeeping
Problem: compiler optimization modifies (instrumented)
code in ways that change the use and applicability of
profile information for later compilation stages
Apply profiling right before profile data is needed
Axiom of profile uniformity: When one copies a chunk of
a program, one should equally divide the profile
frequency of the original chunk among the copies.
Use this axiom for point profiles as a simple heuristic
Path profiles correlate branches and thus path-based
compiler optimizations preserve these profiles

7
Instruction Scheduling
Instruction scheduling is the
most fundamental ILP-
oriented compilation phase
Region shape Responsible for identifying
and grouping operations
that can be executed in
Acyclic Cyclic parallel
Two approaches:
Cyclic schedulers operate
Basic Super- on loops to exploit ILP in
Trace DAG (tight) loop nests usually
block block without control flow
Acyclic schedulers
consider loop-free regions

8
Acyclic Scheduling of Basic
Block Region Shapes
Region is restricted to
add $r13 = $r3, $r0
shl $r13 = $r13, 3
single basic block
ld.w $r14 = 0[$r4] Local scheduling of
sub $r16 = $r6, 3
shr $r15 = $r15, 9
instructions in a single
basic block is simple
ILP is exposed by
add $r13 = $r3, $r0
sub $r16 = $r6, 3 bundle
bundling operations
;; ## end of 1st instr. into VLIW instructions
shl $r13 = $r13, 3 (instruction formation or
shr $r15 = $r15, 9
ld.w $r14 = 0[$r4]
bundle instruction compaction)
;; ## end of 2nd instr.

9
Intermezzo: VLIW Encoding
add $r13 = $r3, $r0
sub $r16 = $r6, 3
A VLIW schedule can
;; ## end of 1st instr. be encoded compactly
shl $r13 = $r13, 3
shr $r15 = $r15, 9
using horizontal and
ld.w $r14 = 0[$r4] vertical nops
;; ## end of 2nd instr. Start bits, stop bits, or
instruction templates
are used to compress
the VLIW instructions
into variable-width
instruction bundles

10
Intermezzo: VLIW Execution
Model Subtleties
mov $r1 = 2
;; Horizontal issues within an
mov $r0 = $r1 instruction:
mov $r1 = 3 A read sees the original
;; value of a register
A read sees the value
ld.w $r0 = 0[$r1] written by the write
;; Read and write to same
add $r0 = $r1, $r2 register is illegal
;; Also exception issues
sub $r3 = $r0, $r4

Vertical issues across
# load completed: pipelined instructions:
add $r3 = $r3, $r0 EQ model
LEQ model
EQ model allows $r0 to be reused
between issue of 1st instruction and
its completion when latency expires
11
Acyclic Region Scheduling for
Loops
DO I = 1, N
A(I) = C*A(I)
ENDDO
DO I =
D(I)
1, N
= A(I)*B(I)
To fulfill the need to
ENDDO enlarge the region size
DO I = 1, N of a loop body to
A(I) = C*A(I)
D(I) = A(I)*B(I)
expose more ILP,
ENDDO apply:
Loop fusion
DO I = 1, N, 2
Loop peeling
A(I) = C*A(I)
D(I) = A(I)*B(I) Loop unrolling
A(I+1) = C*A(I+1)
D(I+1) = A(I+1)*B(I+1)
ENDDO (Assuming 2 divides N)
12
Region Scheduling Across
Basic Blocks Region scheduling
schedules operations
across basic blocks, usually
Move operation on hot paths
from here to there Fulfill the need to increase
B3
the region size by merging
operations from block to
B6
expose more ILP
B4
But problem with conditional
flow: how to move
operations from one block
But operation is now to another for instruction
missing on this path scheduling?

13
Region Scheduling Across
Basic Blocks

Move operation
from here to there
B3
Problem: how to move
B6 operations from one
block to another for
B4
instruction scheduling?
But operation is nowAffected branches need

inserted on this pathto be compensated

14
Trace Scheduling
Earliest region scheduling
approach has restrictions
10 10 A trace consists of a the
operations from a list of
B1 B1 basic blocks B0, B1, , Bn
70 30 70 30
B2 B5 B2 B5
1. Each Bi is a predecessor
70 30 70 30 (falls through or branches
B3 B3 to) the next Bi+1on the list
80 80 2. For any i and k there is no
B6 20 B6 20 path BiBkBi except for
80 80 i=0, i.e. the code is cycle
B4 B4 free except that the entire
90
10
90
10
region can be part of a loop

15
Superblocks
Superblocks are single-entry
10 10 multiple-exit traces
Superblock formation uses tail
B1 B1 duplication to to eliminate side
70 30 30 entrances
70
B2 B5 B2 B5 1. Each Bi is a predecessor of
70 30 70 30 the next Bi+1on the list (fall
B3 B3 B3 through)
14
20 6 24 2. For any i and k there is no
B6 80 56 B6 path BiBkBi except for i=0
20 20 3. There are no branches into a
B4 B4 B4 block in the region (no side
90 entrances), except to B0
10 5.6 4.4
50.4 39.6

16
Hyperblocks

10 10

B1 B1
70 30
B2 B5 B2,B5
70 30
Hyperblocks are single-entry
multiple-exit traces with
B3 B3
20 internal control flow
20
effectuated via instruction
B6 80 80 B6 predication
20 20 If-conversion folds flow into
B4 B4 B4 single block using instruction
90
10 8 2
predication
72 18

20
17
Intermezzo: Predication
If-conversion translates
control dependences into
cmpgt $b1 = $r5, 0
;;
Original data dependences by
br $b1, L1 instruction predication to
;;
mpy $r3 = $r1, $r2 conditionally execute them
;;
L1: stw 0[$r10] = $r3
Predication requires
;;
hardware support
cmpgt $p1 = $r5, 0
After full
Full predication adds a
;;
($p1) mpy $r3 = $r1, $r2 predication boolean operand to (all or
;; selected) instructions
stw 0[$r10] = $r3
;; Partial predication executes
all instructions, but selects
mpy $r4 = $r1, $r2
;; After partial the final result based on a
cmpgt $b1 = $r5, 0
;;
prediction condition
slct $r3 = $b1, $r4, $r3
;;
stw 0[$r10] = $r3
;;
18
Treegions

Treegion 1
Treegions are regions
containing a trees of
blocks such that no
block in a treegion has
Treegion 2 side entrances
Any path through a
Treegion 3 treegion is a superblock

19
Region Formation
The scheduler constructs
Region schedules for a single
selection region at a time
Need to select which region
to optimize (within limits of
Region regions shape), i.e. group
enlargement traces of frequently
executed blocks into
regions
Schedule
May need to enlarge
regions to expose enough
construction ILP for scheduler

20
Region Selection by
TraceTrace
growing uses the
Growing mutual most likely heuristic:
Suppose A is last block in
trace
Add block B to trace if B is
most likely successor of A
55
and A is Bs most likely
predecessor
A
10 5 40 5 40
Also works to grow
B
backward
Requires edge profiling, but
result can be poor because
edge profiling does not
correlate branch
probabilities

21
Region Selection by Path
Profiling

B1 B1

B2 B5 B2 B5 Treat trace as a path and


consider its execution
B3 B3 B3
frequency by path
B6 B6
profiling
path 1: {B1, B2, B3, B4} count = 44
Correlations are
path 2: {B1, B2, B3, B6, B4} count = 0
B4 B4 B4 preserved in the
path 3: {B1, B5, B3, B4} region
count = 16
formation
path 4: {B1, B5,process
B3, B6, B4} count = 12

22
Superblock Enlargement by
Target Expansion Target expansion is
useful when the branch
at the end of a
80 80 superblock has a high
probability but the
B1 B1
superblock cannot be
B2 B2 enlarged due to a side
20 70
20 entrance
10 10
B3 B3
B3 Duplicate sequence of
target blocks to a
B4
B4 B4 create larger
90
70 superblock
20

23
Superblock Enlargement by
Loop Peeling
Peel a number of
10 iterations of a small
B1 loop body to create a
10 larger superblock that
B2
branches into the loop
B1 10 0
B1 Useful when profiled
B2
B2
loop iterations is
10 10 0 bounded to a small
10
B1
constant (two iterations
in the example)
B2
0 0 24
Superblock Enlargement by
Loop Unrolling
10

B1 Loops with a
10 superblock body and a
B2
backedge with high
B1
10
B1
3.3 probability are called
B2 superblock loops
90
B2 When a superblock
10

B1
3.3 loop is small we can
unroll the loop
B2

30 3.3 25
Exposing ILP After Loop
Unrolling Loop unrolling exposes
limited amount of ILP
Cross-iteration
dependences on the loop
B1 counters updates prevent
parallel execution of the
B2 copies of the loop body
Split point Cannot generally move
10
B1 instructions across split
points
Note: can use speculative
execution to hoist
instructions above split
points

26
Exposing ILP with Renaming
and Copy Propagation

27
Schedule Construction
The schedule constructor
(scheduler) uses
compaction techniques to
Region produce a schedule for a
selection region after region
formation
The goal is to minimize an
Region objective cost function while
enlargement maintaining program
correctness and obeying
resource limitations:
Increase speed by reducing
Schedule completion time
Reduce code size
construction Increase energy efficiency

28
Schedule Construction and
Explicitly Parallel Architectures
A scheduler for an explicitly
add $r13 = $r3, $r0 parallel architecture such as
shl $r13 = $r13, 3 VLIW and EPIC uses the
ld.w $r14 = 0[$r4]
sub $r16 = $r6, 3 exposed ILP to statically
shr $r15 = $r15, 9 schedule instructions in
parallel
Instruction compaction must
add $r13 = $r3, $r0
sub $r16 = $r6, 3 bundle obey data dependences
;; ## end of 1st instr. (RAW, WAR, and WAW)
shl $r13 = $r13, 3
shr $r15 = $r15, 9
and control dependences to
ld.w $r14 = 0[$r4]
bundle ensure correctness
;; ## end of 2nd instr.

29
Schedule Construction and
Instruction latencies must be
Instruction Latencies
taken into account by the
scheduler, but theyre not
always fixed or the same for all
Takes 2 cycles Takes 1 cycle ops
to complete to complete A scheduler can assume
average or worst-case
mul $r3 = $r3, $r1
instruction latencies
Hide instruction latencies by
add $r13 = $r2, $r3 ensuring that there is sufficient
height between instruction
ld.w $r14 = 0[$r5] issue and when result is
needed to avoid pipeline stalls
add $r13 = $r13, $r14
Also recall the difference
ld.w $r15 = 0[$r6] between the EQ versus the
LEQ model
Takes >3 cycles RAW hazards
(4 cycles ave.)
30
Linear Scheduling Techniques
cycle
mul $r3 = $r3, $r1 0
add $r13 = $r2, $r3 2
Instruction compaction using
ld.w $r14 = 0[$r5] 0 linear-time scans over region:
add $r13 = $r13, $r14 3 As-soon-as-possible (ASAP)
ld.w $r15 = 0[$r6] 1 scheduling places ops in the
earliest possible cycle using
top-down scan
As-late-as-possible (ALAP)
scheduling places ops in the
mul $r3 = $r3, $r1 latest possible cycle using
ld.w $r14 = 0[$r5] bottom-up scan
;; Critical-path (CP) scheduling
ld.w $r15 = 0[$r6] uses ASAP followed by ALAP
;;
add $r13 = $r2, $r3
Resource hazard detection is
;; local
add $r13 = $r13, $r14 At most one
;; load per inst.
31
List Scheduling
List scheduling schedules
for each root r in the PDG sorted by priority do operations from the global
enqueue(r) region based on a data
while DRQ is non-empty do
dependence graph (DDG) or
h = dequeue()
schedule(h) program dependence graph
for each DAG successor s of h do (PDG) which both have O(n2)
if all predecessors of s have been scheduled then complexity
enqueue(s)
Repeatedly selects an
operation from a data-ready
queue (DRQ), where an
operation is ready when all if
its DDG predecessors have
been scheduled

32
Data Dependence Graph

The data dependence


graph (DDG)
Nodes are operations
Edges are RAW, WAR,
and WAW dependences

33
Control Flow Dependence

34
Compensation Code
Compensation code is
needed when
operations are
X
scheduled across basic
Scheduler A blocks in a region
entry
interchanges Compensation code
A with B B
corrects scheduling
C
exit changes by duplicating
code on entries and
Y
exits from a scheduled
Entry and/or exit region
must be compensated

35
No Compensation

X X

A B

B A

C C
No compensation code
is needed when block B
Y Y
does not have an entry
and exit

36
Join Compensation

X X

A Z
B Z

B A B

C C
Join compensation is
applied when block B
Y Y
has an entry
Duplicate block B

37
Split Compensation

X X

A B

B A A

C
W
C W
Split compensation is
applied when block B
Y Y
has an exit
Duplicate block A

38
Join-Split Compensation

X X
Z
A Z
B
B
B A
W Join-split compensation
A
C
W
C
is applied when block B
W
has an entry and an
Y Y
exit
Duplicate block A and B

39
Resource Management with
Reservation Tables
A resource reservation table
records which resources are
busy per cycle
Integer FP MEM Branch
Reservation tables allow
Cycle ALU ALU easy scheduling of
0 busy busy operations by matching the
operations required
1 busy busy
resources to empty slots
2 busy Construction of reservation
3 busy busy table at a join point in the
CFG is constructed by
merging busy slots from
both branches

40
Software
Pipelining

DO i = 0, 6
A
B
C
D
E
F
G
H prologue
ENDDO
kernel

epilogue
Assuming that the initiation
interval (II) is 3 cycles
41
Software Pipelining Example

> 3 cycles

> 2 cycles

>1 cycle

42
Modulo Scheduling

DDG

MRT43
Constructing Kernel-Only Code
by Predicate Register Rotation

BRT branches to the top and rotates the predicate registers:


p1 = p0, p2 = p1, p3 = p2, p0 = p3 44
Modulo Variable Expansion (1)

45
Modulo Variable Expansion (2)

46

You might also like