Professional Documents
Culture Documents
Program Design and Analysis Program-Level Performance Analysis
Program Design and Analysis Program-Level Performance Analysis
analysis analysis
Complexities of program 3
How to measure program 4
performance performance
1
3/11/2015
Program performance 5
Elements of program 6
metrics performance
Data-dependent paths in 7 8
2
3/11/2015
Instruction Timing
9
Mesaurement-driven 10
Performance Analysis
Not all instructions take the same amount of time.
Multi-cycle instructions. Not so easy as it sounds:
Fetches. Must actually have access to the CPU.
Execution times of instructions are not Must know data inputs that give worst/best
depe de t
independent. case pe
performance.
o a ce
Pipeline interlocks. Must make state visible.
Cache effects.
Still an important method for performance
Execution times may vary with operand value.
analysis.
Floating-point operations.
Some multi-cycle integer operations.
11 12
3
3/11/2015
Performance Optimization 13
Programs and Performance 14
Motivation Analysis
Embedded systems must often meet Best results come from analyzing optimized
deadlines. instructions, not high-level language code:
Faster may not be fast enough. Non-obvious translations of HLL statements into
Need
N d tto b
be able
bl tto analyze
l execution
ti instructions;
Code may move;
time.
Cache effects are hard to predict.
Worst-case, not typical.
Need techniques for reliably improving
execution time.
15 16
Loop Optimizations
Code Motion
Loops are good targets for
optimization. for (i=0; i<N*M; i++)
i=0; Xi=0;
= N*M
z[i] = a[i] + b[i];
Basic loop optimizations: N
i<N*M
i<X
Code motion; Y
Induction-variable elimination; z[i] = a[i] + b[i];
4
3/11/2015
Induction Variable 17
Cache Analysis
18
Elimination
Induction variable: loop index. Loop nest: set of loops, one inside
Consider loop: other.
for (i=0; i<N; i++) Perfect loop nest: no conditionals in
f (j=0;
for (j 0 jj<M;
M jj++)) nest.
z[i,j] = b[i,j];
Because loops use large quantities of
Rather than recompute i*M+j for each array
data, cache conflicts are common.
in each iteration, share induction variable
between arrays, increment at end of loop
body.
5
3/11/2015
Performance Optimization 21
Energy/power Optimization
22
Hints
Use registers efficiently. Energy: ability to do work.
Most important in battery-powered systems.
Use page mode memory accesses.
Power: energy per unit time.
Analyze cache behavior: Important even in wall-plug
wall plug systems---power
systems power
Instruction conflicts can be handled by becomes heat.
rewriting code, rescheudling;
Conflicting scalar data can easily be
moved;
Conflicting array data can be moved,
padded.
Measuring Energy 23
Sources of Energy 24
Consumption Consumption
Relative energy per operation (Catthoor et
Execute a small loop, measure current:
al):
I
Memory transfer: 33
External
E t l I/O
I/O: 10
SRAM write: 9
while (TRUE)
a(); SRAM read: 4.4
Multiply: 3.6
Add: 1
6
3/11/2015
7
3/11/2015
Loop Example
General rules:
STM #4000h,AR2
Dont use function calls. ; load pointer to source
Keep loop body small to enable local STM #100h,AR3
repeat (only forward branches)
branches). ; load
l d pointer
i t tto ddestination
ti ti
Use unsigned integer for loop counter. RPT #(1024-1)
Use <= to test loop counter. MVDD *AR2+,*AR3+
; move
Make use of compiler---global
optimization, software pipelining.
8
3/11/2015
Testing
Avoid function inlining.
Choose CPU with compact instructions. But does it work?
Use specialized instructions where possible. Concentrate here on functional
verification.
Major testing strategies:
Black box doesnt look at the source code.
Clear box (white box) does look at the source
code.
Clear-box Testing
35
Controlling and Observing 36
Programs
Examine the source code to determine whether it
works: firout = 0.0;
Controllability:
Can you actually exercise a path? for (j=curr, k=0; j<N; j++, k++)
firout += buff[j] * c[k]; Must fill circular buffer
Do you get the value you expect along a path? for (j=0; j<curr; j++, k++) with desired N values.
Testing procedure: firout +=
+ buff[j] * c[k]; Other code governs
if (firout > 100.0) firout = 100.0; how we access the
Controllability: arovide program with inputs.
if (firout < -100.0) firout = -100.0;
Execute. buffer.
Observability: examine outputs. Observability:
Want to examine
firout before limit
testing.
9
3/11/2015
Cyclomatic Complexity
39 40
Basis Paths
Approximate CDFG Cyclomatic
with undirected complexity is a bound
graph. on the size of basis
Undirected graphs sets:
have basis p
paths: e = # edges
g
All paths are linear n = # nodes
combinations of basis p = number of graph
paths. components
M = e n + 2p.
10
3/11/2015
41 42
11
3/11/2015
45
Loop Testing 46
47 48
12
3/11/2015
enough?
13