Professional Documents
Culture Documents
LectureAll 4up PDF
LectureAll 4up PDF
LectureAll 4up PDF
Class Meets
Tuesdays and Thursdays,
11:00 12:15
1207 Computer Science
Charles N. Fischer
Fall 2008
Instructor
Charles N. Fischer
6367 Computer Sciences
Telephone: 262-6635
E-mail:
fischer@cs.wisc.edu
Office Hours:
10:30 - Noon, Mondays and
Wednesdays, or by appointment
http://www.cs.wisc.edu/~fischer/cs701.html
Teaching Assistant
Key Dates
Jongwon Yoon
3361 Computer Sciences
Telephone: 608.354.3613
E-mail: yoonj@cs.wisc.edu
Office Hours:
9:30 - 11:00, Tuesdays and
Thursdays.
October 23:
Project 2 due
October 28:
December 11:
Project 4 due
December ??:
Class Text
Instructional Computers
CS701 Projects
Reading Assignment
Global Allocation
Allocate registers within a single
subprogram. Choose most profitable
values. Map several values to the same
register.
Interprocedural Allocation
Avoid saves and restores across calls.
Share globals in registers.
2. Code Scheduling
ld [a],%r1
mov 3,%r3
add %r1,1,%r2
(after)
10
11
12
4. Peephole Optimization
Example:
a=b+c+1;
In IR tree form:
=
+
a adr
+
ld [a],%r1
ld [a],%r1
mov const,%r2
add %r1,const,%r3
add %r1,%r2,%r3
(before)
cadr
mov 0,%r1
OP %r1,%r2,%r3
+
%fp
boffset
(before)
(after)
OP
%g0,%r2,%r3
(after)
Generated code:
ld [%fp+boffset],%r1
ld
[cadr],%r2
add %r1,%r2,%r3
add %r3,1,%r4
st %r4,[aadr]
Why use four different registers?
13
5. Cache Improvements
7. Program representations
Control Flow Graphs
Example:
if (a)
x = 1
else x = 2;
print(x)
14
if (a)
x1 =1
else x2 =2;
x3 = (x1,x2)
print(x3)
15
16
Similarly, in Java:
a[1] = 1;
b[1] = 0;
print(a[1]);
is 1 or 0 printed?
Points-to analysis aims to determine what
variables or heap objects a pointer or
reference may access. Exact analysis is
impossible (why?). But fast and reasonably
accurate analyses are known.
9. Points-To Analysis
All compiler analyses and optimizations are
limited by the potential effects of
assignments through pointers and
references.
Thus in C:
b = 1;
*p = 0;
print(b);
is 1 or 0 printed?
17
Review of Compiler
Optimizations
18
3. Constant Propagation
If a variable is known to contain a
particular constant value at a particular
point in the program, replace references to
the variable at that point with that
constant value.
4. Copy Propagation
After the assignment of one variable to
another, a reference to one variable may be
replaced with the value of the other
variable (until one or the other of the
variables is reassigned).
(This may also set up dead code
elimination. Why?)
5. Constant Folding
An expression involving constant (literal)
values may be evaluated and simplified to a
constant result value. Particularly useful
when constant propagation is performed.
19
20
21
22
23
24
Conflict Optimizations
25
SPARC Overview
26
Instruction format:
add %r1,%r2,%r3
Registers are prefixed with a %
Result is stored into last operand.
ld [adr],%r1
Memory addresses (used only in loads
and stores) are enclosed in brackets
Distinctive features include Register
Windows and Delayed Branches
27
28
Register Windows
In
Local
Out
In
Local
In
(old)
Out
(new)
(new)
After save
(old)
Local
29
30
31
32
Register Conventions
In Registers
These are also the callers out registers;
they are unaffected by deeper calls.
Global Registers
%g0 is unique: It always contains 0 and
can never be changed.
%g1 to %g7 have global scope (they are
unaffected by save and restore
instructions)
%g1 to %g4 are volatile across calls;
they may be used between calls.
%g5 to %g7 are reserved for special use
by the SPARC ABI (application binary
interface)
%i0
Contains incoming parameter 1.
Also used to return function value to
caller.
%i1 to %i5
Contain incoming parameters 2 to 6
(if needed); freely usable otherwise.
%i6 (also denoted as %fp)
Contains frame pointer (stack pointer
of caller); it must be preserved.
Local Registers
%l0 to %l7
May be freely used; they are unaffected
by deeper calls.
%i7
Contains return address -8 (offset due
to delay slot); it must be preserved.
33
Out Registers
34
%o7
Is volatile across calls.
It is loaded with address of caller on a
procedure call.
%o1 to %o5
Contain outgoing parameters 2 to 6 as
needed.
These are volatile across calls;
otherwise they are freely usable.
35
36
%r1,%r2,%r3
save
%r1,const,%r3
restore
%r1,%r2,%r3
restore
%r1,const,%r3
37
call
label
mov
const,%r1
ret
This instruction returns from a
subprogram by branching to %i7+8.
Why 8 bytes after the address of the
call? SPARC processors have delayed
branch instructions, so the instruction
immediately after a branch (or a call)
is executed before the branch occurs!
Thus two instructions after the call is
the normal return point.
38
39
40
Delayed Branches
%uhi(const),%rtmp
%rtmp,%ulo(const),%rtmp
%rtmp,32,%rtmp
%hi(const),%r
%r,%lo(const),%r
%rtmp,%r,%r
41
42
mov 3,%o0
call subr
nop
(before)
call subr
mov 3,%o0
(after)
43
44
Annulled Branches
call subr
call subr+4
nop
mov 100,%l1
...
...
subr:
subr:
mov 100,%l1
mov 100,%l1
(before)
(after)
bz else
bz,a else+4
nop
mov 100,%l1
! then code
! then code
...
...
else:
else:
mov 100,%l1
mov 100,%l1
(before)
45
46
Unoptimized:
Local Variables
incr:
save
%sp, -112, %sp
st
%i0, [%fp+68]
! %fp+68 is in callers frame!
ld
[%fp+68], %o1
add
%o1, 1, %o0
mov
%o0, %i0
b
.LL2
nop
.LL2:
ret
restore
alloca() space
Memory Temps &
Saved FP Registers
Parms past 6
Parms 1 to 6
(memory home)
Address of return value
Register Window
Overflow (16 words)
(after)
%sp
47
48
int
main(){
int a;
return incr(a);}
Optimized:
incr:
retl
add
Unoptimized:
main:
save
ld
call
nop
mov
b
nop
.LL3:
ret
restore
int
main(){
int a;
return incr(a);}
%o0, %i0
.LL3
Optimized:
main:
save
! Where is
call
nop
ret
restore
49
incr:
retl
add
%o0, 1, %o0
main:
retl
add
%o0, 1, %o0
50
Reading Assignment
%o0, 1, %o0
51
52
Work
Volatileused in short code sequences
that need to use a register.
On SPARC: %g1 to %g4, unused out
registers.
Allocatable
Register Targeting
Reserved
Reserved for specific purposes by OS or
software conventions.
On SPARC: %fp, %sp, return address
register, argument registers, return value
register.
53
Register Tracking
54
Cost(register) = cost(values)
values register
55
56
Dead Value
Saved Local Variable
Small Literal Value (13 bits)
Saved Global Variable
Large Literal Value (32 bits)
Unsaved Local Variable
Unsaved Global Variable
reg getReg() {
if ( r regSet and cost(r) == 0)
choose(r)
else {
c = 1;
while(true) {
if ( r regSet and cost(r) == c){
choose r with cost(r) == c and
most distant next use of
associated values;
break;
}
c++;
}
Save contents of r as necessary;
}
return r;
}
57
Example
int
a =
b =
c =
b =
a,b,c,d; // Globals
5;
a + d;
b - 7;
10;
Naive Code
mov
st
ld
ld
add
st
ld
sub
st
mov
st
58
5,%l0
%l0,[a]
[a],%l0
[d],%l1
%l0,%l1,%l1
%l1,[b]
[b],%l1
%l1,7,%l1
%l1,[c]
10,%l1
%l1,[b]
59
60
%l0
mov 5,%l0
5(S)
! Defer assignment to a
5(S), a(U)
ld
[d], %l1
%l1, [c]
5(S), a(U)
5(S), a(U) b(U), 10(S)
b(U), 10(S)
st %l1,[b]
61
RN(Op) =
If RN(Left) = RN(Right)
Then RN(Left) + 1
Else Max(RN(Left), RN(Right))
Example:
+3
+3
-2
A1
B1
C1
62
+2
D1
E1
F1
63
64
Specification of SU Algorithm
Operation:
Translate expression tree T using only
registers in RL.
65
Summary of SU Algorithm
66
+3
+2
-2
A1
B1
C1
D1
67
68
Larger Example
+3
+3
B1
A
ld
ld
sub
st
ld
ld
add
ld
add
B1
A1
+2
C1
[A], %l0
[B], %l1
%l0,%l1,%l0
%l0, [temp]
[C], %l0
[D], %l1
%l0,%l1,%l0
[temp], %l1
%l1,%l0,%l0
*2
D1
F1
E1
Assume 3 Registers;
RL = [%l0,%l1,%l2]
Since right subtree is more complex,
it is translated first.
69
A1
B1
C1
ld
ld
add
ld
ld
mul
add
ld
ld
sub
add
+3
-2
70
+3
+2
D1
*2
E1
F1
[C], %l0
[D], %l1
%l0,%l1,%l0
[E], %l1
[F], %l2
%l1,%l2,%l1
%l0,%l1,%l0
[A], %l1
[B], %l2
%l1,%l2,%l1
%l1,%l0,%l0
B1
A1
+2
D1 A1
B1
+2
+2
+3
-2
+2
-2
C1
C1
D1
71
72
+2
1
+2
+2
A1
+
B
D
C1
B1
73
74
Reading Assignment
Example: a+b*(c+d)+a*(c+d)
We must decide how long to hold
each value in a register. Best
orderings may skip between
subexpressions
Reference: R. Sethi, Complete
Register Allocation Problems, SIAM
Journal of Computing, 1975.
75
76
[a],%r1
add
%r1,1,%r1
Stall!
77
+2
+2
B1
C1
ld
ld
add
ld
add
[A], %l0
[B], %l1
%l0,%l1,%l0
[C], %l1
%l0,%l1,%l0
78
Why?
A1
ld
Stall!
Stall!
79
80
Abbreviated as ERN
ERN(Identifier) = 2
ERN(Literal) = 1
ERN(Op) =
If ERN(Left) = ERN(Right)
Then ERN(Left) + 1
Else Max(ERN(Left), ERN(Right))
Example
+3
+3
A2
+3
B2
+3
C2
+2
D2
B2
+3
A2
C2
1231
81
+3
+
A2
+3
B2
D2
C2
ld
ld
add
ld
add
ld
add
[B], PR1
[C], PR2
PR1,PR2,PR3
[D], PR4
PR3,PR4,PR5
[A], PR6
PR6,PR5,PR7
ld
ld
ld
add
ld
add
add
82
Example
[B], PR1
[C], PR2
[D], PR4
PR1,PR2,PR3
[A], PR6
PR3,PR4,PR5
PR6,PR5,PR7
ld
ld
ld
add
ld
add
add
[B], %l0
[C], %l1
[D], %l2
%l0,%l1,%l0
[A], %l1
%l0,%l2,%l0
%l1,%l0,%l0
(Before Register
Assignment)
No Stalls!
83
84
[b], %l0
%l0, [a]
85
86
Example
a=b+c; d=e;
;3
=3
+3
a0
=2
b2
ld
ld
add
st
ld
st
d0
c2
[b], PR1
[c], PR2
PR1,PR2,PR3
PR3, [a]
[e], PR4
PR4, [d]
Orginal Code
ld
ld
ld
add
st
st
e2
ld
ld
ld
add
st
st
[b], PR1
[c], PR2
[e], PR4
PR1,PR2,PR3
PR3, [a]
PR4, [d]
In Canonical Form
[b], %l0
[c], %l1
[e], %l2
%l0,%l1,%l0
%l0, [a]
%l2, [d]
87
88
Live Ranges
89
Example
90
a = init();
for (int i = a+1; i < 1000; i++){
b[i] = 0; }
a = f(i);
print(a);
91
92
For example,
if (a >
b
else b
a = c +
0)
= 1;
= 2;
b;
a > 0
b = 1
b = 2
a = c + b
93
94
95
96
Liveness Analysis
LiveOut(b) =
LiveIn(s)
s Succ(b)
3. LiveIn(b) =
If V is used before it is defined in
Basic Block b
Then true
Elsif V is defined before it is
used in Basic Block b
Then false
Else LiveOut(b)
97
98
Therefore,
If Range(b1) Range(b2)
Range(b) =
{b} {k | b DefsIn(k) & LiveIn(k)}
Then replace
Range(b1) and Range(b2)
with Range(b1) Range(b2)
99
100
Reading Assignment
Example
1
x
3
x
4
x
6
101
102
Di={ }
Li=F 2
Lo=T
Do={1}
Lo=T
Di={1}
Li=T 3
Lo=T
Do={2}
Di={1,2}
Li=T
4
x
Lo=F
Do={1,2}
Di={1}
Li=F 5
Lo=T
x
Do={5}
6
Lo=F
Li=F 8
Do={6}
Lo=T
Lo=F
Di={1,2,5,6}
Li=F
Li=T
Do={1}
Di={5,6}
Do={5,6}
Di={5,6}
Do={5,6}
103
104
Interference Graph
Example
lim1
lim2
T1
T2
105
106
Reference:
Computers and Intractability,
M. Garey and D. Johnson,
W.H. Freeman, 1979.
107
108
Coloring Heuristic
109
Example
110
lim2 is chosen.
}
lim1
We now have:
lim2
lim1
T1
T2
lim1
lim2
T1
Cost
11
11
11
11
T2
42
42
Cost/
Neighbors
11/3
11/5
11/3
11/3
42/5
42/3
T1
T2
Do a 3 coloring
111
112
Color Preferences
i
j
T2
T1
lim1
lim2
lim2
T1
T2
113
114
Example
lim1(R1)
lim2(R2)
T1
T2
i
j
T2
T1
lim1
lim2
115
116
x = t1
x,t1,q
q = t1
Live out: y,q
y = x + 1
t1 = a + b
x = t1
t1 = a + b
t1
y = x + 1
q = t1
117
Reckless Coalescing
118
Conservative Coalescing
119
120
Iterated Coalescing
121
1. Build:
Create an Interference Graph, as
usual. Mark source-target pairs with
a special move-related arc (denoted
as a dashed line).
2. Simplify:
Remove and stack non-move-related
nodes with < R neighbors.
122
4. Freeze:
If the Interference Graph is
non-empty:
Then If there exists a move-related
node with < R neighbors
Then: Freeze in the move and
make the node
non-move-related.
Return to Steps 2 and 3.
Else: Use Chaitins
Cost/Neighbors criterion
to remove and stack
a node.
Return to Steps 2 and 3.
5. Unstack:
Color nodes as they are unstacked as
per Chaitin and Briggs.
3. Coalesce:
Combine move-related pairs that will
have < R neighbors after coalescing.
Repeat steps 2 and 3 until only nodes
with R or more neighbors or moverelated nodes remain or the graph is
empty.
123
124
Example
Live in: k,j
g = mem[j+12]
h = k-1
f = g*h
e = mem[j+8]
m = mem[j+16]
b = mem[f]
c = e+8
d = c
k = m+4
j = b
goto d
Live out: d,k,j
f
e
f
e
j
j
m
d
c
h
125
126
d&c
j&b
m
e
f
k
g
h
d&c
127
128
Reading Assignment
Priority-Based Register
Allocation
129
130
131
132
Terminology
PriorityRegAlloc(proc, regCount) {
ig buildInterferenceGraph(proc)
unconstrained
{ n nodes(ig) neighborCount(n) < regCount }
constrained
{ n nodes(ig) neighborCount(n) regCount }
In an Interference Graph:
A node with fewer neighbors than colors
is termed unconstrained. It is trivial to
color.
A node that is not unconstrained is
termed constrained. It may need to be
split or spilled.
while( constrained ) {
for ( c constrained such that not colorable(c)
and canSplit(c) ) {
c1, c2 split(c)
constrained constrained - {c}
if ( neighborCount(c1) < regCount )
unconstrained unconstrained U { c1}
else constrained constrained U {c1}
if ( neighborCount(c2) < regCount )
unconstrained unconstrained U { c2}
133
134
135
136
Example
sum
i
a
sum
137
138
Coloring by Priorities
1
a =
b =
sum
i =
parm1
parm2
= 0
0
i < 1000
Cost/Size
sum += a[i]
i++
sum += b[j]
j++
j = 0
j < 1000
return sum
a
Cost
sum
11
11
42
41
11/3
11/6
42/7
41/3
139
140
141
142
Interprocedural Register
Allocation
143
144
is optimized to
146
Call Graphs
A Call Graph represents the calling
structure of a program.
145
Reading Assignment
147
148
Example
Walls Interprocedural
Register Allocator
main() {
if (pred(a,b))
subr1(a)
else subr2(b);}
pred
subr1
subr2
149
Identifying Interprocedural
Register Sharing
150
pred
subr1
subr2
151
152
Cost Computations
Each register candidate is given a
per-call cost, based on the number of
saves and restores that can be
removed, scaled by 10loop_depth.
This benefit is then multiplied by the
expected number of calls, obtained by
summing the total number of call
sites, scaled by loop nesting depth.
153
154
Example
main
Cand: a, b, c
subr2
Cand: t
subr1
Cand: x, y
Print
Cand: i, j
c children(n)
}
155
156
Cost(group) = cost(candidates)
candidates group
157
158
Example (3 Registers)
main
main
subr2
Cand: t:10
subr1
Cand: x:5, y:15
subr1
Cand: x:5, y:15
Print
Cand: i:40, j:5
Group
0
1
2
3
4
5
6
Members
i
j
x, t
y
a
b
c
Print
Cand: i:40, j:5
Cost
40
5
15
15
20
10
30
subr2
Cand: t:10
159
160
Recursion
Performance
A:r1,r2
B:r3
C:r4
D:r4
E:r5
161
Optimal Interprocedural
Register Allocation
Optimal Save-Free
Interprocedural Register
Allocation
162
163
164
Example
V
Cand: m:6
W
Cand: p:3, q:4
165
X
Cand: t:5, u:1
166
wj
k
c j U Ri
i=1
167
168
Example
169
170
We aim to solve
Maximize
wj
k
c j U Ri
s m * Savedm
em Edges
i=1
171
172
P1
Cand: m:7
s=1
s=2
P2
Cand: z:1
P4
Cand: n:3
s=4
173
174
Why is Optimal
Interprocedural Register
Allocation Easier than Optimal
IntraProcedural Allocation?
P3
Cand: q:3
175
176
Reading Assignment
Code Scheduling
Modern processors are pipelined.
They give the impression that all
instructions take unit time by
executing instructions in stages
(steps), as if on an assembly line.
Certain instructions though (loads,
floating point divides and square
roots, delayed branches) take more
than one cycle to execute.
These instructions may stall (halt the
processor) or require a nop (null
operation) to execute properly.
A Code Scheduling phase may be
needed in a compiler to avoid stalls or
eliminate nops.
177
178
Dependency DAGs
179
180
Kinds of Dependencies
Output Dependence:
mov
mov
1, %l2
2, %l2
1, %l2
%l2, 1, %l2
Anti Dependence:
%l2, 1, %l0
1, %l2
181
Example
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
182
ld
ld
add
ld
ld
smul
add
add
smul
add
st
[a], %r1
[b], %r2
Stall
%r1,%r2,%r1
[c], %r2
[d], %r3
Stall
%r2,%r3,%r4
Stall*2
%r1,%r4,%r1
%r2,%r3,%r2
%r2,%r3,%r2
Stall*2
%r1,%r2,%r1
%r1,[a]
(6 Stalls Total)
2
3
4
7
10
11
183
184
Scheduling Requires
Topological Traversal
Dynamically Scheduled
(Out of Order) Processors
185
Example
1.
2.
5.
3.
4.
6.
8.
9.
7.
10.
11.
ld
ld
ld
add
ld
smul
add
smul
add
add
st
186
Limitations of Dynamic
Scheduling
[a], %r1
[b], %r2
[d], %r3
%r1,%r2,%r1
[c], %r2
Stall
%r2,%r3,%r4
%r2,%r3,%r2
%r2,%r3,%r2
%r1,%r4,%r1
Stall
%r1,%r2,%r1
%r1,[a]
(2 Stalls Total)
2
3
4
7
10
11
187
188
189
Example
1.
2.
5.
3.
4.
6.
8.
9.
7.
10.
11.
ld
ld
ld
add
ld
smul
add
smul
add
add
st
190
False Dependencies
[a], %r1 //Longest path
[b], %r2 //Exposes a root
[d], %r3 //Not delayed
%r1,%r2,%r1 //Only choice
[c], %r2 //Only choice
%r2,%r3,%r4 //Stalls succ.
%r2,%r3,%r2 //Not delayed
%r2,%r3,%r2 //Not delayed
%r1,%r4,%r1 //Only choice
%r1,%r2,%r1 //Only choice
%r1,[a]
(2 Stalls Total)
11
11
10
1
11
191
192
Register Renaming
Compiler Renaming
Thus
ld [a],%r1
ld [b],%r2
add %r1,%r2,%r1
ld [c],%r2
is executed as if it were
ld [a],%r1
ld [b],%r2
add %r1,%r2,%r3
ld [c],%r4
193
Example
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
ld
ld
add
ld
ld
smul
add
add
smul
add
st
194
After Scheduling:
4.
5.
1.
2.
6.
8.
9.
3.
7.
10.
11.
[a], %r1
[b], %r2
Stall
%r1,%r2,%r1
[c], %r3
[d], %r4
Stall
%r3,%r4,%r5
Stall*2
%r1,%r5,%r2
%r3,%r4,%r3
%r3,%r4,%r3
Stall*2
%r2,%r3,%r2
%r2,[a]
(6 Stalls Total)
2
4
ld
ld
ld
ld
smul
add
smul
add
add
add
st
10
2
3
10
1
11
11
195
196
Reading Assignment
Balanced Scheduling
When scheduling a load, we normally
anticipate the best case, a hit in the
primary cache.
On older architectures this makes
sense, since we stall execution on a
cache miss.
Many newer architectures are nonblocking. This means we can continue
execution after a miss until the
loaded value is used.
Assume a Cache miss takes N cycles
(N is typically 10 or more).
Do we schedule a load anticipating a
1 cycle delay (a hit) or an N cycle
delay (a miss)?
197
load2
An Optimistic Schedule is
load1
Inst1
load2
Inst2
Inst3
A Pessimistic Schedule is
load1
Inst1
Inst2
load2
Inst3
A Pessimistic Schedule is
Fine for a hit;
better for a miss.
Inst2
Inst3
Inst3
An Optimistic Schedule is
load
Inst2
Inst3
Inst1
Inst4
198
Inst4
load
Inst2
Inst1
Inst3
Inst4
199
200
For
load1
Inst1
load2
Inst2
Inst3
balanced scheduling produces
load1
Inst1
load2
Inst2
Inst3
Inst3
Inst4
it produces
load
Inst2
Inst3
Inst1
Inst4
Good; maximum
distance between
load and Inst1 in
case of a miss.
201
202
203
204
Example
Load19
8
Inst1
Load2 7 Inst2
4
Load3
Inst4
Inst3
Load4
Inst5
End 0
Ld Ld Ld Ld
I1
1
2 3
4
I2
I3
I4
Load1
I5
Latency
1+0 = 1
Load2
1/2
1+2 = 3
Load3
1/2
1+2 = 3
Load4
1+3 = 4
205
Load19
Load3
Inst4
Inst3
Load4
Inst5
Postpass Schedulers:
Schedule code after register
allocation.
Can be limited by false
dependencies induced by register
reuse.
Example is Gibbons/Muchnick
heuristic.
End 0
Load1
Inst1
Load2
Inst2
Inst3
Load4
Load3
Inst4
Inst5
206
Prepass Schedulers:
Schedule code prior to register
allocation.
Can overuse registersAlways
using a fresh register maximizes
freedom to rearrange Instructions.
Inst1
Load2 7 Inst2
(0 latency; unavoidable)
(3 instruction latency)
(2 instruction latency)
(1 instruction latency)
207
208
Integrated Schedulers
209
Definitions
210
Goodman/Hsu Heuristic:
while (DAG ) {
if ( AvailReg > MinThreshold)
if (ReadySet )
Select Ready node with Max cost
else Select Leader node with Max cost
else // Reduce Registers in Use
if ( node ReadySet that frees registers){
Select node that frees most registers
If (selected node isnt unique)
Select node with Max cost }
elsif ( node LeaderSet that frees regs){
Select node that frees most registers
If (selected node isnt unique)
Select node with fewest interlocks}
else find a partially evaluated path and
select a leader from this path
else Select any node in ReadySet
else Select any node in LeaderSet
Schedule Selected node
Update AvailReg count }//end while
Leader Set:
Set of DAG nodes ready to be
scheduled, possibly with an
interlock.
Ready Set:
Subset of Leader Set; Nodes ready
to be scheduled without an
interlock.
AvailReg:
A count of currently unused
registers.
MinThreshold:
Threshold at which heuristic will
switch from avoiding interlocks to
reducing registers in use.
211
212
Example
At MinThreshold = 0, we may be
forced to spill a register to free a
result register.
But, well also be able to schedule
more aggressively.
Is a spill or stall worse?
Note that we may be able to hide
a spill in a delay slot!
Well be aggressive and use
MinThreshold = 0.
*
3
*
2
+
1
213
+
1
=
Instruction
Comment
ld
ld
ld
smul
add
smul
ld
add
add
add
st
=
Regs
Used
1
2
3
4
4
3
4
3
2
1
0
*
+
[c], %r1
[d], %r2
[a], %r3
%r1,%r2,%r4
%r1,%r2,%r1
%r1,%r2,%r1
[b], %r2
%r3,%r2,%r3
%r3,%r4,%r3
%r3,%r1,%r3
%r3,[a]
+
5
*
3
214
Instruction
Comment
ld
ld
ld
smul
add
ld
smul
add
add
add
st
[c], %r1
[d], %r2
[a], %r3
%r1,%r2,%r4
%r1,%r2,%r1
[b], %r5
%r1,%r2,%r1
%r3,%r5,%r3
%r3,%r4,%r3
%r3,%r1,%r3
%r3,[a]
Regs
Used
1
2
3
4
4
5
4
3
2
1
0
215
216
add %r1,1,%r2
sub %r3,2,%r1
217
Example: 5 Registers
(2 Wide Issue)
a
218
d
4
+
3
*
2
+
+
*
2
=
1
=
1
2
3
4
5
6
7
8
ld
ld
smul
add
nop
add
add
st
[c], %r1
[a], %r3
%r1,%r2,%r5
%r3,%r4,%r3
%r3,%r5,%r3
%r3,%r1,%r3
%r3,[a]
ld
ld
add
smul
nop
nop
nop
nop
[d], %r2
[b], %r4
%r1,%r2,%r1
%r1,%r2,%r1
1
2
3
4
5
6
ld
ld
smul
add
nop
add
7
8
add
st
[c], %r1
[b], %r4
%r1,%r2,%r5
%r3,%r4,%r3
%r3,%r5,%r3
ld
[d], %r2
nop
add %r1,%r2,%r1
smul %r1,%r2,%r1
nop
nop
%r3,%r1,%r3 nop
%r3,[a]
nop
ld [a],%r3
nop
nop
nop
nop
nop
nop
nop
219
220
Reading Assignment
221
222
Trace Scheduling
Reference:
J. Fisher, Trace Scheduling: A
Technique for Global Microcode
Compaction, IEEE Transactions on
Computers, July 1981.
Trace Scheduling
Idea:
Since basic blocks are often too small
to allow effective code scheduling, we
will profile a programs execution and
identify the most frequently executed
paths in a program.
Sequences of contiguous basic blocks
on frequently executed paths will be
gathered together into traces.
223
224
Trace
Example
B1
B2
B3
B4
B5
B7
B6
225
Compensation Code
226
Example
Before Scheduling
x = x+1
x<5
y = x-y
z=x*z
x=x+1
y = x-y
y=2*y
x=x-2
After Scheduling
227
228
A prepass scheduler
(does scheduling before register
allocation).
Can move instructions across basic block
boundaries.
Prefers to move instructions that must
eventually be executed.
Can move Instructions speculatively,
possibly executing instructions
unnecessarily.
229
230
231
232
Example
d = a + b;
if ( d != 0)
flag = 1;
else flag = 0;
f = d - g;
1
d = a + b
d != 0
T
F
2
flag = 1
3
flag = 0
4
f = d - g
233
block1:
ld
[a],Pr1
ld
[b],Pr2
add Pr1,Pr2,Pr3
st
Pr3,[d]
cmp Pr3,0
be
block3
block2:
mov 1,Pr4
st
Pr4,[flag]
b
block4
block3:
st
0,[flag]
block4:
ld
[d],Pr5
ld
[g],Pr6
sub Pr5,Pr6,Pr7
st
Pr7,[f]
Stall
Stall
234
Example (Revisited)
235
236
Global Scheduling
Restrictions (in Bernstein/
Rodeh Heuristic)
237
238
239
240
Candidate Instructions
241
242
Delay Heuristic
D(I) = 0 if I is a leaf.
243
244
[a],Pr1
[b],Pr2
Pr1,Pr2,Pr3
Pr3,[d]
Pr3,0
block3
1
0
3
Ji Succ(I)
245
5
2
3
3
246
[a],Pr1
[b],Pr2
Pr1,Pr2,Pr3
Pr3,[d]
Pr3,0
block3
1
1
6
247
248
Example
block1:
1.
ld
2.
ld
3.
add
4.
st
5.
cmp
6.
be
block2:
7.
mov
8.
st
9.
b
block3:
10. st
block4:
11. ld
12. ld
13. sub
14. st
1,5
[a],Pr1
[b],Pr2
Pr1,Pr2,Pr3
Pr3,[d]
Pr3,0
block3
1,5
block1:
1.
ld
2.
ld
12. ld
0,3
3
1,Pr4
Pr4,[flag]
block4
0,2
1,5
[a],Pr1
[b],Pr2
[g],Pr6
0,3
3
0,[flag]
0,2
0,1
1,4
0,1
1,4
11
12
0,1
6
1,4
0,2
14
0,2
[d],Pr5
[g],Pr6
Pr5,Pr6,Pr7
Pr7,[f]
0,2
1,5
1,4
11
13
12
0,1
0,2
14
0,2
0,3
0,1
0,2
0,3
13
0,1
9
0,1
0,1
10
10
249
[a],Pr1
[b],Pr2
[g],Pr6
Pr1,Pr2,Pr3
Pr3,[d]
block1:
1.
ld
[a],Pr1
2.
ld
[b],Pr2
12. ld
[g],Pr6
3.
add Pr1,Pr2,Pr3
4.
st
Pr3,[d]
11. ld
[d],Pr5
5.
cmp Pr3,0
13. sub Pr5,Pr6,Pr7
6.
be
block3
14. st
Pr7,[f]
block2:
7.
mov 1,Pr4
8.
st
Pr4,[flag]
9.
b
block4
block3:
10. st
0,[flag]
block4:
0,3
3
0,2
0,2
0,1
6
1,4
11
0,1
14
0,2
0,2
5
0,1
6
1,4
11
0,1
14
0,2
13
0,2
0,3
13
0,2
0,3
7
250
0,1
9
0,1
9
0,1
10
0,1
10
251
252
We want to be aggressive in
scheduling loads, which incur high
latencies when a cache miss occurs.
In many cases, control and data
dependencies may force us to restrict
how far we may move a critical load.
Consider
ld.s
p = Lookup(Id);
...
if (p != null)
print(p.a);
253
254
%reg,adr
p = Lookup(Id);
...
q.a = init();
print(p.a);
[adr],%reg
255
256
[adr],%reg
257
258
Speculative Multi-threaded
Processors
[adr],%reg
259
260
Reading Assignment
261
Software Pipelining
262
L: ld
nop
add
st
add
cmp
bne
nop
263
264
[%g3],%g2
%g2,1,%g2
%g2,[%g3]
%g3,4,%g3
[%g3],%g2
%g3,%g4
L
%g2,1,%g2
265
266
Epilogue Code
267
268
Resource Constraints
A small II normally means that we are
doing steps of several iterations
simultaneously. The number of
registers and functional units (that
execute instructions) can become
limiting factors of the size of II.
For example, if a loop body contains 4
floating point operations, and our
processor can issue and execute no
more than 2 floating point operations
per cycle, then the loops II cant be
less than 2.
269
Dependency Constraints
270
f:
A loop body can often contain a loopcarried dependence. This means one
iteration of a loop depends on values
computed in an earlier iteration. For
example, in
mov
mov
%o0, %o2
1, %o1
!a in %o2
!i=1 in %o1
sll
add
ld
ld
add
srl
add
sra
add
cmp
ble
st
retl
nop
%o1, 2, %o0
%o0, %o2, %g2
[%g2-4], %g2
[%o2+%o0], %g3
%g2, %g3, %g2
%g2, 31, %g3
%g2, %g3, %g2
%g2, 1, %g2
%o1, 1, %o1
%o1, 999
L
%g2, [%o2+%o0]
!i*4 in %o0
!&a[i] in %g2
!a[i-1] in %g2
!a[i] in %g3
!a[i-1]+a[i]
!s=0 or 1=sign
!a[i-1]+a[i]+s
!a[i-1]+a[i]/2
!i++
L:
!store a[i]
271
272
Modulo Scheduling
273
274
Example
275
276
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
f:
L:
mov
ld
add
st
add
add
cmp
ble
add
retl
nop
0, %g3
[%o1], %g2
%g3, %g2, %g4
%g4, [%o0]
%g3, 1, %g3
%o0, 4, %o0
%g3, 999
L
%o1, 4, %o1
277
278
instruction
ld
[%o1], %g2
ld
[%o1], %g2
ld
[%o1], %g2
279
280
cycle
cycle
1.
2.
3.
3.
4.
5.
5.
6.
7.
instruction
ld
[%o1], %g2
add
ld
add
ld
add
1.
2.
3.
3.
4.
5.
5.
6.
7.
instruction
ld
[%o1], %g2
add
ld
st
add
ld
st
add
281
282
cycle
cycle
1.
2.
3.
3.
3.
4.
5.
5.
5.
6.
7.
7.
instruction
ld
[%o1], %g2
add
ld
add
st
add
ld
add
st
add
add
1.
2.
3.
3.
3.
4.
4.
5.
5.
5.
6.
6.
7.
7.
instruction
ld
[%o1], %g2
add
ld
add
st
add
add
ld
add
st
add
add
add
283
284
cycle
cycle
1.
1.
3.
3.
3.
3.
4.
4.
4.
5.
5.
5.
5.
6.
6.
6.
7.
7.
1.
2.
3.
3.
3.
4.
4.
4.
5.
5.
5.
6.
6.
6.
7.
7.
instruction
ld
[%o1], %g2
add
ld
add
st
add
cmp
add
ld
add
st
add
cmp
add
add
285
286
We now have:
cycle
1.
1.
3.
3.
3.
3.
4.
4.
4.
5.
5.
5.
5.
5.
6.
6.
6.
7.
7.
instruction
ld
[%o1], %g2
add
%o1, 4, %o1
add
%g3, %g2, %g4
ld
[%o1], %g2
add
%o1, 4, %o1
add
%g3, 1, %g3
st
%g4, [%o0]
add
%o0, 4, %o0
cmp
%g3, 999
add
%g3, %g2, %g4
ld
[%o1], %g2
add
%o1, 4, %o1
add
%g3, 1, %g3
st
%g4, [%o0]
add
%o0, 4, %o0
cmp
%g3, 999
add
%g3, %g2, %g4
add
%g3, 1, %g3
L:
instruction
ld
[%o1], %g2
add
%o1, 4, %o1
add
%g3, %g2, %g4
ld
[%o1], %g2
add
%o1, 4, %o1
add
%g3, 1, %g3
st
%g4, [%o0]
add
%o0, 4, %o0
cmp
%g3, 999
add
%g3, %g2, %g4
ld
[%o1], %g2
add
%o1, 4, %o1
ble
L
add
%g3, 1, %g3
st
%g4, [%o0]
add
%o0, 4, %o0
cmp
%g3, 999
add
%g3, %g2, %g4
add
%g3, 1, %g3
287
288
289
290
instruction
ld
[%g3+%o1], %g2
addcc %o2, -1, %o2
st
%g2, [%g3+%o0]
bne
L
add
%g3, 4, %g3
!N0
!N0
!N0
!N1
!N1
!N0
!N0
!N0
!N0
!N1
!N2
!N2
!N0
!N1
cycle
1.
1.
3.
3.
3.
L:
instruction
ld
[%o1], %g2
add
%o1, 4, %o1
add
%g3, %g2, %g4
ld
[%o1], %g2
add
%o1, 4, %o1
add
%g3, 1, %g3
st
%g4, [%o0]
add
%o0, 4, %o0
cmp
%g3, 999
add
%g3, %g2, %g4
ld
[%o1], %g2
add
%o1, 4, %o1
ble
L
add
%g3, 1, %g3
291
292
L:
instruction
ld
[%g3+%o1], %g2
addcc %o2, -1, %o2
st
%g2, [%g3+%o0]
beq
L2
add
%g3, 4, %g4
ld
[%g4+%o1], %g5
addcc %o2, -1, %o2
st
%g5, [%g4+%o0]
bne
L
add
%g4, 4, %g3
293
Predication
294
cmp
bge,a
add
add
L1:
cmp
bge,a
add
add
L2:
if (a<b)
a++;
else b++;
if (c<d)
c++;
else d++;
instruction
ld
[%g3+%o1], %g2
addcc %o2, -1, %o2
add
%g3, 4, %g4
ld
[%g4+%o1], %g5
st
%g2, [%g3+%o0]
beq
L2
addcc %o2, -1, %o2
st
%g5, [%g4+%o0]
bne
L
add
%g4, 4, %g3
L2:
L2:
L:
%o0, %g1
L1
%g1, 1, %g1
%o0, 1, %o0
%o5, %o4
L2
%o4, 1, %o4
%o5, 1, %o5
295
296
%o0, %g1,%p1
297
Thus
298
add(%p1) %r1,%r2,%r3
if (a<b)
a++;
else b++;
if (c<d)
c++;
else d++;
we now generate
1.
1.
2.
2.
2.
2.
cmplt
cmplt
add(%p1)
add(~p1)
add(%p2)
add(%~p2)
%o0,
%o5,
%g1,
%o0,
%o4,
%o5,
%g1, %p1
%o4, %p2
1, %g1
1, %o0
1, %o4
1, %o5
299
300
1.
f:
2.
L:
3.
4.
5.
6.
7. L1:
8. L2:
9.
10.
11.
12.
13.
14.
15.
301
mov
and
ld
sub(~%p1)
add(%p1)
st
add
add
cmp
ble
add
retl
nop
cycle
1.
1.
2.
3.
3.
3.
3.
3.
4.
4.
4.
4.
5.
5.
5.
5.
5.
5.
0, %g3
%g3, 1, %p1
[%o1], %g2
%g3, %g2, %g4
%g3, %g2, %g4
%g4, [%o0]
%g3, 1, %g3
%o0, 4, %o0
%g3, 999
L
%o1, 4, %o1
302
instruction
ld
[%o1], %g2
add
%o1, 4, %o1
and
%g3, 1, %p1
add(%p1) %g3, %g2,
sub(~%p1) %g3, %g2,
ld
[%o1], %g2
add
%o1, 4, %o1
add
%g3, 1, %g3
L: st
%g4, [%o0]
add
%o0, 4, %o0
and
%g3, 1, %p1
cmp
%g3, 999
add(%p1) %g3, %g2,
sub(~%p1) %g3, %g2,
ld
[%o1], %g2
add
%o1, 4, %o1
ble
L
add
%g3, 1, %g3
%g4
%g4
%g4
%g4
0, %g3
%g3, 1, %g0
L1
[%o1], %g2
L2
%g3, %g2, %g4
%g3, %g2, %g4
%g4, [%o0]
%g3, 1, %g3
%o0, 4, %o0
%g3, 999
L
%o1, 4, %o1
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
mov
andcc
bne
ld
b
sub
add
st
add
add
cmp
ble
add
retl
nop
303
304
Reading Assignment
Automatic Instruction
Selection
305
306
Tree-Structured Intermediate
Representations
307
308
Example
Representation of Instructions
Reg
-
aadr
IntLiteral1
Reg
Adr
+
Reg
%fp
Reg
boffset
309
cadr
*
*
aadr badr
+
+
*
cadr
Reg Reg
+
ld
ld
add
ld
add
cadr
Reg *
badr
+
Reg
[a],%R1
[b],%R2
%R1,%R2,%R3
[c],%R4
%R3,%R4,%R5
310
cadr
Reg
Reg Reg
This is an
instruction that
loads a register with
the value at an
absolute address.
311
312
Automatic Instruction
Selection Tools
313
314
315
316
Ab
(where b is a single terminal symbol)
or
A Op(B,C, ...)
(where Op is a terminal that is a
subtree root and B,C, ... are nonterminals)
Op
A Op(B,C, ...)
denotes
B C ...
317
Reading Assignment
318
Example
Terminals
Leaf Nodes
int32
s13
r
Operator Nodes
*
(unary indirection)
(binary minus)
+
(binary addition)
=
(binary assignment)
319
320
Nonterminals
Productions
UInt
Reg
Imm
Adr
Void
Rule
#
321
Production
R8
Reg
Reg Imm
R9
Reg
Cost
SPARC Code
R0
UInt Int32
R1
Reg r
R2
Adr r
R3
+
Adr
Reg Imm
R4
Imm s13
R5
Reg s13
mov s13,Reg
R6
Reg int32
sethi
%hi(int32),%g1
or %g1,
%lo(int32),Reg
R7
Reg
Reg Reg
sub Reg,Reg,Reg
=
Void
UInt Reg
322
Example
SPARC Code
sub Reg,Imm,Reg
a = b - 1;
1
ld [Adr],Reg
sethi
%hi(UInt),%g1
st Reg,
[%g1+%lo(Uint)]
Adr
R10
Cost
Rule
#
Production
int32
s13
*
+
r
s13
323
324
int32
Imm:R4:0
s13 Reg:R5:1
*
+
Reg:R1:0
Adr:R2:0
Imm:R4:0
s13 Reg:R5:1
325
=
UInt:R0:0
Reg:R6:2
326
Void:R10:4
int32
Reg:R8:2
Imm:R4:0
Reg:R9:1
s13 Reg:R5:1
UInt:R0:0
+ Adr:R3:0
Reg:R1:0
Adr:R2:0
s13
int32
Imm:R4:0
Reg:R5:1
Reg:R8:2
Reg:R9:1
s13
Imm:R4:0
+ Adr:R3:0
Reg:R1:0
Imm:R4:0
s13
327
328
int32
Reg:R8:2
Reg:R9:1
s13
Imm:R4:0
[%fp+b],%l0
%hi(1000000),%g1
%g1,%lo(1000000),%l1
%l0,%l1,%l0
%hi(a),%g1
%l0,[%g1+%lo(a)]
+ Adr:R3:0
Reg:R1:0
ld
sub
sethi
st
Imm:R4:0
s13
[%fp+b],%l0
%l0,1,%l0
%hi(a),%g1
%l0,[%g1+%lo(a)]
329
Production
Cost
R11
+
Reg
Imm Reg
R12
Reg
Reg Zero
SPARC Code
add Reg,Imm,Reg
330
331
332
333
Normalizing Costs
334
335
336
Example
State
s0
s1
s2
s3
s4
s5
s6
s7
Node
r
s13
int32
+
*
=
Meaning
{Reg:R1:0, Adr:R2:0}
{Imm:R4:0, Reg:R5:1}
{adr:R3:0}
{Reg:R9:0}
{UInt:R0:0}
{Reg:R8:0}
{Void:R10:0}
{Reg:R7:0}
Left Child
Right Child
s0
s2
s3
s1
s4
s1
s1
s3
s5
Result
s0
s1
s4
s2
s3
s5
s7
s6
337
s13
338
int32
s1
R0
s3
R4
R3+
R1
s13
r
s13
Step 2 (Propagate states upward):
= s6
int32
R8
R9
s1
s0
s4
int32
s13
R4
s5
s13
s1
s2+
s0
s13
s1
339
340
341
Optimistic Coalescing
342
343
344
x3=0;
while (...) {
x1 = x3;
y1 = x1+1;
print(x1);
y2 = y1;
z2 = y2+1;
print(y2);
z3 = z2;
x3 = z3+1;
print(z3);
}
345
346
347
348
Reading Assignment
349
Meet Lattice
350
Transfer Function
T
. . . . . .
Meet operation
(And, Or, Union, Intersection, etc.)
351
352
353
354
Example: e1=v+w
y=v+w
v=9
F
w=5
F
x=v+w
T
z=v+w
T
stop
v=2
F
355
356
F
F
y=v+w
v=9
w=5
F
z=v+w
x=v+w
T
T
T
T
T
stop
v=2
F
z=v+w
357
358
359
360
Example: e1=v+w
v=2
F
w=5
x=v+w
v=3
u=v+w
stop
F
361
362
Identifying Identical
Expressions
v=2
F
Or here?
w=5
F
x=v+w
v=3
Move v+w
here?
T
T
T
F
u=v+w
F
stop
363
364
365
366
Reaching Definitions
for (...) {
for (...) {
if (...)
if (a>b+c)
a=b+c;
x=1;
else a=d+c;}
t=b+c
else x=0;}
t=b+c
T
a>b+c
a=b+c
a=d+c
F
F
b+c is not very busy
at loop entrance
367
368
DefIn(b) =
OR DefOut(p)
p Pred(b)
LiveOut(b) =
OR LiveIn(s)
s Succ(b)
369
370
In Boolean terms:
Out(b) = (In(b) AND Not Killb) OR Genb
or
In(b) = (Out(b) AND Not Killb) OR Genb
or
In(b) = (Out(b) - Killb) U Genb
where Killb is true if a value is killed
within b and Genb is true if a value is
generated within b.
371
372
Example
Reading Assignment
Live=0,0
v=1
Live=0,1
u=0
Live=1,1
a=u
Gen=1,0
Kill=0,0
Gen=0,0
Kill=0,1
Gen=0,0
Kill=1,0
Live=1,0
Gen=0,0
v=2 Kill=0,1
Live=1,1
print(u,v)
Gen=1,1
Kill=0,0
373
374
Building a DFST
375
376
Example
The DFST is
A
B
C
H
J
Visit order is A, B, C, D, E, G, H, I, J, F
377
378
Example
A
a
B
a
C
a
D
a
F
a
G
a
r
H
a
379
380
Depth-First Order
Example
2
3
C
D
F
G
7
8
J
382
Dominators
A CFG node M dominates N
(M dom N) if and only if all paths
from the start node to N must pass
through M.
A node trivially dominates itself.
Thus (N dom N) is always true.
10
381
Application of Depth-First
Ordering
383
384
Immediate Dominators
A
B
C
D
E
385
386
Dominator Trees
Start
Start
End
End
Dominator Tree
387
388
Start
End
Dominator Tree
End
389
Computing Dominators
390
391
392
393
394
Compute Dominators(){
For (each n NodeSet)
Dom(n) = NodeSet
WorkList = {StartNode}
While (WorkList ) {
Remove any node Y from WorkList
New = {Y} U Dom(X)
X Pred(Y)
If New Dom(Y) {
Dom(Y) = New
For (each Z Succ(Y))
WorkList = WorkList U {Z}
}}}
395
396
Example
Start
A
ALL
ALL
Start
ALL
C
D
{start,A,B}
B
ALL
Start
{start,A}
{start,A,C}
C
D
ALL
{start}
{start,A,D}
{start,A,D,E} E
ALL
{start,A,D,E,F}
F
ALL
F
ALL
End
{start,A,D,E,F,End}
End
End
Dominator Tree
397
398
Postdominance
A block Z postdominates a block Y
(Z pdom Y) if and only if all paths
from Y to an exit block must pass
through Z. Notions of immediate
postdominance and a postdominator
tree carry over.
Note that if a CFG has a single exit
node, then postdominance is
equivalent to dominance if flow is
reversed (going from the exit node to
the start node).
Start
End
F
End
A
Start
Postdominator Tree
399
400
Dominance Frontiers
DF(N) =
{Z | MZ & (N dom M) &
(N sdom Z)}
The dominance frontier of N is the set
of blocks that are not dominated N
and which are first reached on
paths from N.
401
Example
402
A
A
B
F
B
C
B
C
Dominator Tree
Dominance
Frontier
{F}
{E}
{E}
{F}
403
404
Postdominance Frontiers
Example
A
F
B
A
E
C
D
B
Postominator Tree
PDF(N) =
{Z | ZM & (N pdom M) &
(N pdom Z)}
The postdominance frontier of N is
the set of blocks closest to N where a
choice was made of whether to reach
N or not.
Postdominance
Frontier
{A}
{B}
{B}
{A}
405
Control Dependence
406
Block
407
408
G
B
B
C
D
E
F
Control Dependence
Graph
Postominator Tree
G
Control Dependence
Graph
409
Start
F
Start
T
F
F
C
T
A
T
E
F
T
B
Exit
G
T
T
C
F
F
T
Control Dependence
Graph
410
411
412
Value Lattice
The lattice of values is usually a meet
semilattice defined by:
A: a set of values
T and (top and bottom):
distinguished values in the lattice
: A reflexive partial order relating
values in the lattice
: An associative and commutative
meet operator on lattice values
413
Lattice Axioms
414
415
416
is (set intersection).
a b a {Z} b {Z}
fZ(a) fZ(b)
417
418
Constant Propagation
We can model Constant Propagation
as a Data Flow Problem. For each
scalar integer variable, we will
determine whether it is known to
hold a particular constant value at a
particular basic block.
The value lattice is
T
.., .2, 1, 0, 1, 2, ...
419
420
Let t : V N U {T,}
t is the set of all total mappings from
V (the set of variables) to N U {T,}
(the lattice of constant status
values).
For example, t1=(T,6,) is a mapping
for three variables (call them A, B and
C) into their constant status. t1 says
A is considered a constant, with value
as yet undetermined. B holds the
value 6, and C is non-constant.
We can create a lattice composed of t
functions:
tT(V) = T ( V) (tT=(T,T,T, ...)
ta tb v ta(v) tb(v)
Thus (1,) (T,3)
since 1 T and 3.
The meet operator is applied
componentwise:
tatb = tc
where v tc(v) = ta(v) tb(b)
Thus (1,) (T,3) = (1,)
since 1 T = 1 and 3 = .
421
422
423
424
Example
T, T
a=1
b=2
1,2
1,2
b=a+1
b=a+2
1,2
1,3
1,
1,
b=b-1
1,
425
Distributive Functions
426
427
428
In this case, = .
Now fb(S1S2) = (S1S2) U {b}
Reading Assignment
Also, fb(S1)fb(S2) =
(S1 U {b}) (S2 U {b}) =
(S1S2) U {b}
429
430
f(t1) = (4,1,3)
f(t2) = (4,2,2)
t2=(T,2,2)
x=y+z
Now f(t)=t where
t(y) = t(y), t(z) = t(z),
t(x) = if t(y)= or t(z) =
then
elseif t(y)=T or t(z) =T
then T
else t(y)+t(z)
431
432
f(p)
p all paths to bk
433
Conditional Constant
Propagation
434
435
436
Example
i = 1;
done = 0;
while ( i > 0 && ! done) {
if (i == 1)
done = 1;
else i = i + 1; }
437
438
Pass 1:
i = 1;
done = 0;
(T,T) = (i,done)
i = 1;
done = 0;
(T,T) = (i,done)
(1,0)
F
i > 0 &&
! done
i > 0 &&
! done
(1,0)
i == 1
F
i = i + 1;
i == 1
T
done = 1;
i = i + 1;
(1,0)
done = 1;
(1,1)
(1,1)
439
440
Pass 2:
i = 1;
done = 0;
(T,T) = (i,done)
(1,0)
F
i > 0 &&
! done
(1,)
(1,)
i == 1
F
i = i + 1;
(1,)
done = 1;
(1,1)
(1,1)
441
442
EvalDF{
For (all n CFG) {
soln(n) = T
ReEval(n) = true }
Repeat
LoopAgain = false
For (all n CFG in DFO order){
If (ReEval(n)) {
ReEval(n) = false
OldSoln = soln(n)
In = soln(p)
p Pred(n)
soln(n) = fn(In)
If (soln(n) OldSoln) {
For (all s Succ(n)) {
ReEval(s) = true
LoopAgain = LoopAgain OR
IsBackEdge(n,s)
}} }}
Until (! LoopAgain)
x
x
443
444
Initial
Pass 1
Pass 2
LoopAgain
true
{3}
{3}
{3}
{6}
{7}
{7}
true
{3}
{3}
{7}
{7}
true
{7}
{7}
false
445
f6(in) = {6}
f7(in) = {7}
For all other blocks, fb(in) = in
446
447
448
449
450
451
452
Iteration 1: Input = a T = a
Output = f(a)
Iteration 2: Input = a f(a)
Output = f(a f(a))
Iteration 3: Input = a f(a f(a))
453
454
455
456
Recall
0
x
x
457
458
f(t) = t where
t(v) = case(v){
k: t(j);
j: t(i);
i: 2; }
Let a = (,1,1).
f(T) = (2,T,T)
a f(T) = (,1,1) (2,T,T) = (,1,1)
f(a) = f(,1,1) = (2,,1).
Now (,1,1) is not (2,,1)
so this problem isnt rapid.
We require that
a f(T) f(a)
Consider
i=1
j=1
k=1
k=j
j=i
i=2
459
460
p Pb
461
462
f(p)
463
464
465
Reading Assignment
466
467
468
Three Program-Building
Operations
Sequential Execution
We can reduce a sequential chain of
basic blocks:
b1
b2
...
bn
469
Conditional Execution
470
Iterative Execution
Repeat Loop
Given the basic blocks:
bp
bB
bL1
bL2
bC
bcond
b repeat
471
472
While Loop.
bC
bB
473
474
475
476
ExampleReaching Definitions
x
2
2
7
7
6
6
477
Substituting, we get
x
478
We now have
x
Id U (fix(Id fC )) Id =
f6 fC = Id fC = fC
Id U (fix(fC)) =
Id U (fC fC2 fC3 ...) =
Id U {b4,b5}
479
480
We now have
x
481
482
What If
483
484
Phi Functions
485
Example
a=x
x=2
b=x
x=1
x==10
c=x
x++
print x
x 2=2
a=x1
486
x1=1
x=1
x 3= (x 1,x 2)
b=x3
x 4=1
x5 = (x4,x6)
x5==10
c=x5
x 6=x 5+1
print x 5
Original CFG
487
488
Example
i=6
j=1
k=1
repeat
if (i==6)
k=0
else
i=i+1
i=i+k
j=j+1
until (i==j)
i1=6
j1=1
k1=1
repeat
i2=(i1,i5)
j2=(j1,j3)
k2=(k1,k4)
if (i2==6)
k3=0
else
i3=i2+1
i4=(i2,i3)
k4=(k3,k2)
i5=i4+k4
j3=j2+1
until (i5==j3)
i1 j1 k1 i2 j2 k2 k3 i3 i4 k4 i5 j3
Pass1 6
1 6T 1T 1T 0
T 6T 0
Pass2
1 66
489
490
491
492
493
494
Loop:
v 1=init
v=init
DF(N) =
{Z | MZ & (N dom M) &
(N sdom Z)}
Simple Case:
v=v+1
v 2 =(v 1,v 3)
v3 =v2+1
v 1=1
v=1
v2=2
v=2
v 3= ( v1,v 2)
495
496
4
v=3
v 2=2
v 1=1
v=2
v 3= ( v1,v 2)
4
v4=3
v 5= ( v3,v 4)
2. Compute D2 = DF(D1).
Place Phi functions at the head of all
members of D2-D1.
3. Compute D3 = DF(D2).
Place Phi functions at the head of all
members of D3-D2-D1.
4. Repeat until no additional Phi
functions can be added.
497
498
Example
PlacePhi{
For (each variable v program) {
For (each block b CFG ){
PhiInserted(b) = false
Added(b) = false }
List =
For (each b CFG that assigns to V ){
Added(b) = true
List = List U {b} }
While (List ) {
Remove any b from List
For (each d DF(b)){
If (! PhiInserted(d)) {
Add a Phi Function to d
PhiInserted(d) = true
If (! Added(d)) {
Added(d) = true
List = List U {d}
}
}
}
}
}
}
x1=1
2
x 3=3
x4=4
Initially, List={1,3,5,6}
x 2=2
Process 1: DF(1) =
Process 3: DF(3) = 4,
so add 4 to List and
add phi fct to 4.
Process 5: DF(5)={4,5}
so add phi fct to 5.
Process 5: DF(6) = {5}
Process 4: DF(4) = {4}
499
500
x 2=2
x 5= (x 1,x 2,x 3)
x6 = (x4,x5)
x3=3
t1 = a + b
c = a
t2 = c + b
x4=4
501
Value Numbering
502
For example,
t1 = a + b
c = a
t2 = c + b
is analyzed as
v1 = a
v2 = b
t1 = v1 + v2
c = v1
t2 = v1 + v2
503
504
In contrast, given
t1 = a + b
a = 2
t2 = a + b
=
=
=
=
=
a
b
v1 + v2
2
v3 + v2
x = y
505
But,
in SSA form
506
xi = y j
t1 = a 1 + b 1
c1 = a 1
t2 = c 1 + b 1
becomes
t1 = a 1 + b 1
t2 = a 1 + b 1
507
508
509
510
Example
if (...) {
a1=0
if (...)
b1=0
else {
a2=x0
b2=x0 }
a3=(a1,a2)
b3=(b1,b2)
c2=*a3
d2=*b3 }
else {
b4=10 }
a5=(a0,a3)
b5=(b3,b4)
c3=*a5
d3=*b5
e3=*a5
Initial Groupings:
if (...) {
a1=0
if (...)
b1=0
else {
a2=x0
b2=x0 }
a3=(a1,a2)
b3=(b1,b2)
c2=*a3
d2=*b3 }
else {
b4=10 }
a5=(a0,a3)
b5=(b3,b4)
c3=*a5
d3=*b5
e3=*a5
G1=[a0,b0,c0,d0,e0,x0]
G2=[a1=0, b1=0]
G3=[a2=x0, b2=x0]
G4=[b4=10]
G5=[a3=(a1,a2),
b3=(b1,b2)]
G6=[a5=(a0,a3),
b5=(b3,b4)]
G7=[c2=*a3,
d2=*b3,
d3=*b5,
c3=*a5,
e3=*a5]
G1=[a0,b0,c0,d0,e0,x0]
G2=[a1=0, b1=0]
G3=[a2=x0, b2=x0]
G4=[b4=10]
G5=[a3=(a1,a2),
b3=(b1,b2)]
G6a=[a5=(a0,a3)]
G6b=[b5=(b3,b4)]
G7a=[c2=*a3,
d2=*b3]
G7b=[d3=*b5]
G7c=[c3=*a5,
e3=*a5]
Final Groupings:
511
512
But,
we can still remove a redundant
computation of a+b by moving the
computation of t3 to each of its
predecessors:
a1=1
t1=a1+b0
e1=a1+b0
e3=(e1,e2)
t3=e3
a2=2
t2=a2+b0
a3=(a1,a2)
t3=a3+b0
513
514
Reading Assignment
a2=2
t2=a2+b0
e2=a2+b0
a2=2
a3=(a1,a2)
t3=a3+b0
a2=2
t2=a2+b0
t3=(t1,t2)
515
516
517
518
519
520
We anticipate an expression if it is
very busy:
AntOuti = VeryBusyOuti
= 0 (false) for b0
= AND AvOutp
= AND AntIns
s Succ(i)
p Pred(i)
AntIni = VeryBusyIni
AvOuti = Compi OR
(AvIni AND Transpi)
= AntLoci OR
(Transpi AND AntOuti)
Geni OR
(AvIni AND Killi)
521
Partial Availability
522
= 0 (false) for b0
=
OR PavOutp
p Pred(i)
PavOuti = Compi OR
(PavIni AND Transpi)
523
524
525
526
= Consti
AND (AntLoci OR (Transpi AND PPOuti))
AND (PPOutp OR AvOutp )
p Pred(i)
527
528
Removing Existing
Computations
529
530
531
532
Examples of Partial
Redundancy Elimination
x=2
x+3
x=1
x+3
533
PPout1 = PPIn3
534
x=1
x+3
x=2
x+3
535
536
x=2
x+3
x=1
PPout3 = PPIn4.
AntLoc4 = true.
x+3
PPOut3 = true.
PPIn3 = Const3 AND
((Transp3 AND PPOut3) OR ... )
p Pred(4)
537
PPOut1 = PPIn3
Transp3 = true.
538
Where Do We Insert
Computations?
539
540
x=1
x+3
x=2
x+3
541
542
a+b
a+b
543
544
a+b
a+b
3
545
Consider
Body
a = val
...
a+b
...
Control
546
a = val
do
...
a+b
...
while (...)
Preheader
547
548
if (expr)
body
}
do {body}
while (expr)
goto L:
do {body}
L:
while (expr)
549
For example, in
550
a = val
...
a+b
...
551
552
y=0
print(y)
y=a+b
y=0
print(y)
553
554
y=a+b
555
556
Procedure Placement
A
C
D
2
557
[A,D]
2
1
558
[C,F]
[A,D]
10
559
560
561
562
563
564
Example
A
1000
B
7000
C
500
6500
6500
900
2500
G
4000
F
2500
4000
500
565
566
567
568
Procedure Splitting
569
Reading Assignment
570
Points-To Analysis
All compiler analyses and
optimizations are limited by the
potential effects of assignments
through pointers and references.
Thus in C:
b = 1;
*p = 0;
print(b);
is 1 or 0 printed?
Similarly, in Java:
a[1] = 1;
b[1] = 0;
print(a[1]);
is 1 or 0 printed?
571
572
if (b)
p = &a;
else p = &c;
573
574
**p = 1;
is transformed into
temp = *p;
*temp = 1;
575
576
Points-To Representation
Therefore in
Given
p = &a
we create:
p
577
578
579
580
Andersens Algorithm
581
p = &a;
p = q;
p = *r;
*p = &a;
*p = q;
*p = *r;
582
2. p = q;
1. p = &a;
We add an arc from p to a, showing p
can possibly point to a:
583
584
3. p = *r;
4. *p = &a;
e
r
d
b
585
5. *p = q;
6. *p = *r;
586
e
p
q
d
v
b
c
587
588
Example
Next:
p1 = p2;
p1
p2
Then:
r = &p1;
r
We start with:
p1
p2
p1 = &a;
p2 = &b;
p1
p2
589
590
Next:
Finally:
*r = &c;
p2 = &d;
p1
p1
c
p2
p3
p2
Then:
b
d
p3 = *r;
r
p1
p3
c
p2
591
592
p1
p3
c
p2
b
d
593
594
Performance of Andersens
Algorithm
595
596
Steensgaards Algorithm
p = *q;
597
598
a
b
a
b
599
600
601
The Horwitz-Shapiro
Approach
602
603
604
Example
p1
p1
p1
p2
=
=
=
=
&a;
&b;
&c;
&c;
a, b
p2
605
p2
b,c
606
But...
What if we chose to place a in
category 1 and b and c in category 2.
We now have:
p1
607
608
609
Why?
In the first run, well use the right
most digit in a variables index as its
category.
In the next run, well use the second
digit from the right, then the third
digit from the right, ...
Any two distinct variables have
different index values, so they must
differ in at least digit position.
610
611
612
613
615
614