Program design and 1

Program-level performance 2

analysis analysis

Program-level performance analysis. Need to understand

performance in detail:
Optimizing for: Real-time behavior, not
just typical.
Execution time.
On complex platforms.
Energy/power. Program performance
Program size. CPU performance:
Pipeline, cache are
Program validation and testing. windows into program.
We must analyze the entire

Complexities of program 3
How to measure program 4

performance performance

Varies with input data: Simulate execution of the CPU.

Different-length paths. Makes CPU state visible.
Cache effects.
C Measure on real C
U using
g timer.
Instruction-level performance variations: Requires modifying the program to control
Pipeline interlocks. the timer.
Fetch times. Measure on real CPU using logic
Requires events visible on the pins.


Program performance 5
Elements of program 6

metrics performance

Average-case execution time. Basic program execution time formula:

Typically used in application programming. execution time = program path + instruction timing
Solving these problems independently helps
Worst-case execution time.
i lif analysis.
l i
A component in deadline satisfaction. Easier to separate on simpler CPUs.
Best-case execution time. Accurate performance analysis requires:
Task-level interactions can cause best-case Assembly/binary code.
program behavior to result in worst-case Execution platform.
system behavior.

Data-dependent paths in 7 8

an if statement Paths in a loop

if (a || b) { /* T1 */ a b c path for (i=0, f=0; i<N; i++) i=0
if ( c ) /* T2 */ 0 0 0 T1=F, T3=F: no assignments
f = f + c[i] * x[i];
x = r*s+t; /* A1 */ 0 0 1 T1=F, T3=T: A4
else y=r+s; /* A2 */ 0 1 0 T1=T, T2=F: A2, A3 N
z = r+s+u; /* A3 */ 0 1 1 T1=T, T2=T: A1, A3 i N
} 1 0 0 T1=T, T2=F: A2, A3
else { 1 0 1 T1=T, T2=T: A1, A3
if ( c ) /* T3 */ 1 1 0 T1=T, T2=F: A2, A3
f = f + c[i] * x[i]
y = r-t; /* A4 */ 1 1 1 T1=T, T2=T: A1, A3
} i=i+1


Instruction Timing
Mesaurement-driven 10

Performance Analysis
Not all instructions take the same amount of time.
Multi-cycle instructions. Not so easy as it sounds:
Fetches. Must actually have access to the CPU.
Execution times of instructions are not Must know data inputs that give worst/best
depe de t
independent. case pe
o a ce
Pipeline interlocks. Must make state visible.
Cache effects.
Still an important method for performance
Execution times may vary with operand value.
Floating-point operations.
Some multi-cycle integer operations.

11 12

Trace-driven Measurement Physical Measurement

Trace-driven: In-circuit emulator allows tracing.
Instrument the program. Affects execution timing.
Save information about the path. Logic analyzer can measure behavior at
Requires modifying the program.
program p
Trace files are large. Address bus can be analyzed to look for
Widely used for cache analysis. Code can be modified to make events visible.
Particularly important for real-world input


Performance Optimization 13
Programs and Performance 14

Motivation Analysis
Embedded systems must often meet Best results come from analyzing optimized
deadlines. instructions, not high-level language code:
Faster may not be fast enough. Non-obvious translations of HLL statements into
N d tto b
be able
bl tto analyze
l execution
ti instructions;
Code may move;
Cache effects are hard to predict.
Worst-case, not typical.
Need techniques for reliably improving
execution time.

15 16
Loop Optimizations
Code Motion
Loops are good targets for
optimization. for (i=0; i<N*M; i++)
i=0; Xi=0;
= N*M
z[i] = a[i] + b[i];
Basic loop optimizations: N
Code motion; Y
Induction-variable elimination; z[i] = a[i] + b[i];

Strength reduction (x*2 -> x<<1).

i = i+1;


Induction Variable 17
Cache Analysis

Induction variable: loop index. Loop nest: set of loops, one inside
Consider loop: other.
for (i=0; i<N; i++) Perfect loop nest: no conditionals in
f (j=0;
for (j 0 jj<M;
M jj++)) nest.
z[i,j] = b[i,j];
Because loops use large quantities of
Rather than recompute i*M+j for each array
data, cache conflicts are common.
in each iteration, share induction variable
between arrays, increment at end of loop

Array Conflicts in Cache

19 20

Array conflicts, contd.

Array elements conflict because they are

a[0,0] 1024 in the same line, even if not mapped to
1024 4099 same location.
b[0,0] 4099 ... move one array;
pad array.

Main Memory Cache


Performance Optimization 21
Energy/power Optimization

Use registers efficiently. Energy: ability to do work.
Most important in battery-powered systems.
Use page mode memory accesses.
Power: energy per unit time.
Analyze cache behavior: Important even in wall-plug
wall plug systems---power
systems power
Instruction conflicts can be handled by becomes heat.
rewriting code, rescheudling;
Conflicting scalar data can easily be
Conflicting array data can be moved,

Measuring Energy 23
Sources of Energy 24

Consumption Consumption
Relative energy per operation (Catthoor et
Execute a small loop, measure current:
Memory transfer: 33
E t l I/O
I/O: 10
SRAM write: 9
while (TRUE)
a(); SRAM read: 4.4
Multiply: 3.6
Add: 1


Cache Behavior is Important Cache Sweet Spot

25 26

Energy consumption has a sweet

spot as cache size changes:
Cache too small
Program thrashes
thrashes, burning energy on
external memory accesses;
Cache too large
Cache itself burns too much power.
[Li98] 1998 IEEE

Optimizing for Energy Optimizing for Energy

27 28

First-order optimization: Use registers efficiently.

High performance = low energy. Identify and eliminate cache conflicts.
Not many instructions trade speed Moderate loop unrolling eliminates some
for energy.
energy loopp overhead instructions.
Eliminate pipeline stalls.
In lining procedures may help: reduces
linkage, but may increase cache thrashing.


Efficient Loops Single-instruction Repeat

29 30

Loop Example
General rules:
STM #4000h,AR2
Dont use function calls. ; load pointer to source
Keep loop body small to enable local STM #100h,AR3
repeat (only forward branches)
branches). ; load
l d pointer
i t tto ddestination
ti ti
Use unsigned integer for loop counter. RPT #(1024-1)
Use <= to test loop counter. MVDD *AR2+,*AR3+
; move
Make use of compiler---global
optimization, software pipelining.

Optimizing for Program Size

31 32
Data Size Minimization
Goal: Reuse constants, variables, data
Reduce hardware cost of memory; buffers in different parts of code.
Reduce power consumption of memory Requires careful verification of
units. correctness.
Two opportunities: Generate data using instructions.


Reducing Code Size

Program Validation and 34

Avoid function inlining.
Choose CPU with compact instructions. But does it work?
Use specialized instructions where possible. Concentrate here on functional
Major testing strategies:
Black box doesnt look at the source code.
Clear box (white box) does look at the source

Clear-box Testing
Controlling and Observing 36

Examine the source code to determine whether it
works: firout = 0.0;
Can you actually exercise a path? for (j=curr, k=0; j<N; j++, k++)
firout += buff[j] * c[k]; Must fill circular buffer
Do you get the value you expect along a path? for (j=0; j<curr; j++, k++) with desired N values.
Testing procedure: firout +=
+ buff[j] * c[k]; Other code governs
if (firout > 100.0) firout = 100.0; how we access the
Controllability: arovide program with inputs.
if (firout < -100.0) firout = -100.0;
Execute. buffer.
Observability: examine outputs. Observability:
Want to examine
firout before limit


Execution Paths and Testing

Choosing the Paths to Test 38

Paths are important in functional testing as

well as performance analysis. Possible criteria:
Execute every
In general, an exponential number of paths statement at least
through the program. once. not covered
Show that some paths dominate others
others. Execute every
Heuristically limit paths. branch direction at
least once.
Equivalent for
structured programs.
Not true for gotos.

Cyclomatic Complexity
39 40
Basis Paths
Approximate CDFG Cyclomatic
with undirected complexity is a bound
graph. on the size of basis
Undirected graphs sets:
have basis p
paths: e = # edges
All paths are linear n = # nodes
combinations of basis p = number of graph
paths. components
M = e n + 2p.


41 42

Branch Testing Branch Testing Example

Heuristic for testing branches.
Correct: Test:
Exercise true and false branches of if (a || (b >= c)) { a=F
conditional. printf(OK\n); } (b >=c) = T
Exercise every simple condition at least once
once. Incorrect: E
if (a && (b >= c)) { Correct: [0 || (3 >= 2)]
printf(OK\n); } =T
Incorrect: [0 && (3 >=
2)] = F

Another Branch Testing 43 44

Example Domain Testing

Correct: Incorrect code Heuristic test for

if ((x == good_pointer) && changes pointer. linear inequalities.
x->field1 == 3)) { printf(got
the value\n); } Assignment returns Test on each side +
new LHS in C
C. b
d off
if ((x = good_pointer) && x-
Test that catches inequality.
>field1 == 3)) { printf(got error:
the value\n); }
(x != good_pointer)
&& x->field1 = 3)


Loop Testing 46

Def-use Pairs Loops need specialized tests to be tested

Variable def-use:
Heuristic testing strategy:
Def when value is
assigned (defined). Skip loop entirely.
Use when used on One loop iteration
right-hand side. Two loop iterations.
Exercise each def- # iterations much below max.
use pair.
n-1, n, n+1 iterations where n is max.
Requires testing
correct path.

47 48

Black-box Testing Black-box Test Vectors

Complements clear-box testing. Random tests.

May require a large number of tests. May weight distribution based on software
Tests software in different ways.
y specification.
Regression tests.
Tests of previous versions, bugs, etc.
May be clear-box tests of previous versions.


How much testing is 49


Exhaustive testing is impractical.

One important measure of test quality---bugs
escaping into field.
Good organizations can test software to give
very low field bug report rates.
Error injection measures test quality:
Add known bugs.
Run your tests.
Determine % injected bugs that are caught.


