Program Design and Analysis Program-Level Performance Analysis

3/11/2015
Program design and 1

Program-level performance 2
analysis analysis
Program-level performance analysis. Need to understand

performance in detail:
Optimizing for: Real-time behavior, not
just typical.
Execution time.
On complex platforms.
Energy/power. Program performance
Program size. CPU performance:
Pipeline, cache are
Program validation and testing. windows into program.
We must analyze the entire
program.
Complexities of program 3
How to measure program 4
performance performance
Varies with input data: Simulate execution of the CPU.

Different-length paths. Makes CPU state visible.
Cache effects.
C Measure on real C
CPU
U using
g timer.
Instruction-level performance variations: Requires modifying the program to control
Pipeline interlocks. the timer.
Fetch times. Measure on real CPU using logic
analyzer.
Requires events visible on the pins.
1
3/11/2015
Program performance 5
Elements of program 6
metrics performance
Average-case execution time. Basic program execution time formula:

Typically used in application programming. execution time = program path + instruction timing
Solving these problems independently helps
Worst-case execution time.
simplify
i lif analysis.
l i
A component in deadline satisfaction. Easier to separate on simpler CPUs.
Best-case execution time. Accurate performance analysis requires:
Task-level interactions can cause best-case Assembly/binary code.
program behavior to result in worst-case Execution platform.
system behavior.
Data-dependent paths in 7 8
an if statement Paths in a loop

if (a || b) { /* T1 */ a b c path for (i=0, f=0; i<N; i++) i=0
if ( c ) /* T2 */ 0 0 0 T1=F, T3=F: no assignments
f=0
f = f + c[i] * x[i];
x = r*s+t; /* A1 */ 0 0 1 T1=F, T3=T: A4
else y=r+s; /* A2 */ 0 1 0 T1=T, T2=F: A2, A3 N
z = r+s+u; /* A3 */ 0 1 1 T1=T, T2=T: A1, A3 i N
i=N
} 1 0 0 T1=T, T2=F: A2, A3
Y
else { 1 0 1 T1=T, T2=T: A1, A3
if ( c ) /* T3 */ 1 1 0 T1=T, T2=F: A2, A3
f = f + c[i] * x[i]
y = r-t; /* A4 */ 1 1 1 T1=T, T2=T: A1, A3
} i=i+1
2
3/11/2015
Instruction Timing
9
Mesaurement-driven 10
Performance Analysis
Not all instructions take the same amount of time.
Multi-cycle instructions. Not so easy as it sounds:
Fetches. Must actually have access to the CPU.
Execution times of instructions are not Must know data inputs that give worst/best
depe de t
independent. case pe
performance.
o a ce
Pipeline interlocks. Must make state visible.
Cache effects.
Still an important method for performance
Execution times may vary with operand value.
analysis.
Floating-point operations.
Some multi-cycle integer operations.
11 12
Trace-driven Measurement Physical Measurement

Trace-driven: In-circuit emulator allows tracing.
Instrument the program. Affects execution timing.
Save information about the path. Logic analyzer can measure behavior at
Requires modifying the program.
program p
pins.
Trace files are large. Address bus can be analyzed to look for
events.
Widely used for cache analysis. Code can be modified to make events visible.
Particularly important for real-world input
streams.
3
3/11/2015
Performance Optimization 13
Programs and Performance 14
Motivation Analysis
Embedded systems must often meet Best results come from analyzing optimized
deadlines. instructions, not high-level language code:
Faster may not be fast enough. Non-obvious translations of HLL statements into
Need
N d tto b
be able
bl tto analyze
l execution
ti instructions;
Code may move;
time.
Cache effects are hard to predict.
Worst-case, not typical.
Need techniques for reliably improving
execution time.
15 16
Loop Optimizations
Code Motion
Loops are good targets for
optimization. for (i=0; i<N*M; i++)
i=0; Xi=0;
= N*M
z[i] = a[i] + b[i];
Basic loop optimizations: N
i<N*M
i<X
Code motion; Y
Induction-variable elimination; z[i] = a[i] + b[i];
Strength reduction (x*2 -> x<<1).

i = i+1;
4
3/11/2015
Induction Variable 17
Cache Analysis
18
Elimination
Induction variable: loop index. Loop nest: set of loops, one inside
Consider loop: other.
for (i=0; i<N; i++) Perfect loop nest: no conditionals in
f (j=0;
for (j 0 jj<M;
M jj++)) nest.
z[i,j] = b[i,j];
Because loops use large quantities of
Rather than recompute i*M+j for each array
data, cache conflicts are common.
in each iteration, share induction variable
between arrays, increment at end of loop
body.
Array Conflicts in Cache

19 20
Array conflicts, contd.
Array elements conflict because they are

a[0,0] 1024 in the same line, even if not mapped to
1024 4099 same location.
Solutions:
b[0,0] 4099 ... move one array;
pad array.
Main Memory Cache
5
3/11/2015
Performance Optimization 21
Energy/power Optimization
22
Hints
Use registers efficiently. Energy: ability to do work.
Most important in battery-powered systems.
Use page mode memory accesses.
Power: energy per unit time.
Analyze cache behavior: Important even in wall-plug
wall plug systems---power
systems power
Instruction conflicts can be handled by becomes heat.
rewriting code, rescheudling;
Conflicting scalar data can easily be
moved;
Conflicting array data can be moved,
padded.
Measuring Energy 23
Sources of Energy 24
Consumption Consumption
Relative energy per operation (Catthoor et
Execute a small loop, measure current:
al):
I
Memory transfer: 33
External
E t l I/O
I/O: 10
SRAM write: 9
while (TRUE)
a(); SRAM read: 4.4
Multiply: 3.6
Add: 1
6
3/11/2015
Cache Behavior is Important Cache Sweet Spot

25 26
Energy consumption has a sweet

spot as cache size changes:
Cache too small
Program thrashes
thrashes, burning energy on
external memory accesses;
Cache too large
Cache itself burns too much power.
[Li98] 1998 IEEE
Optimizing for Energy Optimizing for Energy

27 28
First-order optimization: Use registers efficiently.

High performance = low energy. Identify and eliminate cache conflicts.
Not many instructions trade speed Moderate loop unrolling eliminates some
for energy.
energy loopp overhead instructions.
Eliminate pipeline stalls.
In lining procedures may help: reduces
linkage, but may increase cache thrashing.
7
3/11/2015
Efficient Loops Single-instruction Repeat

29 30
Loop Example
General rules:
STM #4000h,AR2
Dont use function calls. ; load pointer to source
Keep loop body small to enable local STM #100h,AR3
repeat (only forward branches)
branches). ; load
l d pointer
i t tto ddestination
ti ti
Use unsigned integer for loop counter. RPT #(1024-1)
Use <= to test loop counter. MVDD *AR2+,*AR3+
; move
Make use of compiler---global
optimization, software pipelining.
Optimizing for Program Size

31 32
Data Size Minimization
Goal: Reuse constants, variables, data
Reduce hardware cost of memory; buffers in different parts of code.
Reduce power consumption of memory Requires careful verification of
units. correctness.
Two opportunities: Generate data using instructions.
Data;
Instructions.
8
3/11/2015
Reducing Code Size

33
Program Validation and 34
Testing
Avoid function inlining.
Choose CPU with compact instructions. But does it work?
Use specialized instructions where possible. Concentrate here on functional
verification.
Major testing strategies:
Black box doesnt look at the source code.
Clear box (white box) does look at the source
code.
Clear-box Testing
35
Controlling and Observing 36
Programs
Examine the source code to determine whether it
works: firout = 0.0;
Controllability:
Can you actually exercise a path? for (j=curr, k=0; j<N; j++, k++)
firout += buff[j] * c[k]; Must fill circular buffer
Do you get the value you expect along a path? for (j=0; j<curr; j++, k++) with desired N values.
Testing procedure: firout +=
+ buff[j] * c[k]; Other code governs
if (firout > 100.0) firout = 100.0; how we access the
Controllability: arovide program with inputs.
if (firout < -100.0) firout = -100.0;
Execute. buffer.
Observability: examine outputs. Observability:
Want to examine
firout before limit
testing.
9
3/11/2015
Execution Paths and Testing

37
Choosing the Paths to Test 38
Paths are important in functional testing as

well as performance analysis. Possible criteria:
Execute every
In general, an exponential number of paths statement at least
through the program. once. not covered
Show that some paths dominate others
others. Execute every
Heuristically limit paths. branch direction at
least once.
Equivalent for
structured programs.
Not true for gotos.
Cyclomatic Complexity
39 40
Basis Paths
Approximate CDFG Cyclomatic
with undirected complexity is a bound
graph. on the size of basis
Undirected graphs sets:
have basis p
paths: e = # edges
g
All paths are linear n = # nodes
combinations of basis p = number of graph
paths. components
M = e n + 2p.
10
3/11/2015
41 42
Branch Testing Branch Testing Example

Heuristic for testing branches.
Correct: Test:
Exercise true and false branches of if (a || (b >= c)) { a=F
conditional. printf(OK\n); } (b >=c) = T
Exercise every simple condition at least once
once. Incorrect: E
Example:
l
if (a && (b >= c)) { Correct: [0 || (3 >= 2)]
printf(OK\n); } =T
Incorrect: [0 && (3 >=
2)] = F
Another Branch Testing 43 44
Example Domain Testing
Correct: Incorrect code Heuristic test for

if ((x == good_pointer) && changes pointer. linear inequalities.
x->field1 == 3)) { printf(got
the value\n); } Assignment returns Test on each side +
new LHS in C
C. b
boundary
d off
Incorrect:
if ((x = good_pointer) && x-
Test that catches inequality.
>field1 == 3)) { printf(got error:
the value\n); }
(x != good_pointer)
&& x->field1 = 3)
11
3/11/2015
45
Loop Testing 46
Def-use Pairs Loops need specialized tests to be tested

efficiently.
Variable def-use:
Heuristic testing strategy:
Def when value is
assigned (defined). Skip loop entirely.
Use when used on One loop iteration
iteration.
right-hand side. Two loop iterations.
Exercise each def- # iterations much below max.
use pair.
n-1, n, n+1 iterations where n is max.
Requires testing
correct path.
47 48
Black-box Testing Black-box Test Vectors
Complements clear-box testing. Random tests.

May require a large number of tests. May weight distribution based on software
Tests software in different ways.
y specification.
Regression tests.
Tests of previous versions, bugs, etc.
May be clear-box tests of previous versions.
12
3/11/2015
How much testing is 49
enough?
Exhaustive testing is impractical.

One important measure of test quality---bugs
escaping into field.
Good organizations can test software to give
very low field bug report rates.
Error injection measures test quality:
Add known bugs.
Run your tests.
Determine % injected bugs that are caught.
13

Program Design and Analysis Program-Level Performance Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Program Design and Analysis Program-Level Performance Analysis

Uploaded by

Copyright:

Available Formats

3/11/2015

Program design and 1

Program-level performance analysis. Need to understand

Varies with input data: Simulate execution of the CPU.

Average-case execution time. Basic program execution time formula:

an if statement Paths in a loop

Trace-driven Measurement Physical Measurement

Strength reduction (x*2 -> x<<1).

Array Conflicts in Cache

Array conflicts, contd.

Array elements conflict because they are

Main Memory Cache

Cache Behavior is Important Cache Sweet Spot

Energy consumption has a sweet

Optimizing for Energy Optimizing for Energy

First-order optimization: Use registers efficiently.

Efficient Loops Single-instruction Repeat

Optimizing for Program Size

Reducing Code Size

Execution Paths and Testing

Paths are important in functional testing as

Branch Testing Branch Testing Example

Another Branch Testing 43 44

Example Domain Testing

Correct: Incorrect code Heuristic test for

Def-use Pairs Loops need specialized tests to be tested

Black-box Testing Black-box Test Vectors

Complements clear-box testing. Random tests.

How much testing is 49

Exhaustive testing is impractical.

You might also like