Professional Documents
Culture Documents
Week10a PDF
Week10a PDF
Awareness
for
Embedded
Systems
OP TI M I Z I NG
E M BED D ED
S OF TWA RE
F OR
P OW ER
Introduction
3.
Voltage 4.
Frequency
hat will be covered very heavily here.
1. Hardware
Techniques
2. Data
Flow
Optimization
3. Algorithmic
Optimization
Hardware
Techniques
(1)
Overhead
•Not
break
real-‐time
constraints
Data
Flow
Optimization– Memory
Access
(1)
Principle
of
locality
Data
Flow
Optimization– Memory
Access
(2)
Interleaving
Data
Flow
Optimization– Memory
Access
(3)
Burst
Access
Data
Flow
Optimization– Memory
Access
(4)
Avoidance
Compiler struct
structmerged_arrays
Compilercache
cacheoptimizations
optimizations merged_arrays
{{
In order to assist with the above, compilers may be used to optimize cache power
Data
Flow
Optimization– Memory
Access
(5)
In order to assist with the above, compilers may be used to optimize cache power
intarray1;
int array1;
consumption by reorganizing memory or memory accesses for us.int Two main
array2; techniques
consumption by reorganizing memory or memory accesses for int us. Two main techniques
array2;
available are array merging and loop interchanging, explained below.
available are array merging and loop interchanging, explained } }new_array[
new_array[ array_ size ]
below. array_ size ]
Array merging organizesCompiler
memory so thatcache
arraysoaccessed
ptimizations
In order to re-order
simultaneously will the
be way
at that high-level memory is read
Array merging organizes memory In order to re-order the way that high-level memory is r
different offsets (different “sets”) fromsothe
that arrays
start accessed
of a way. simultaneously
chunks
Consider to reduce
the thewill
following twobearray
chance atthrashing
of loop interchangin
different offsets
declarations (different “sets”) from the start of a way.chunks
below: below:
Considerto reduce the chance of thrashing loop interchan
the following two array
Array
declarations Merging
below:
below: Loop
Interchanging
for (i 5 0; i,100; i 5 i 1 1)
int array1[ array_size ]; for (i(j
for 550;
0;i,100;
j,200; ij5
5i 1)
j 11)
int array2[
int array1[array_size
array_size]; ]; for (j(k
for 550;
0;j,200; 1)1)
j 5kj51k 1
k,10000;
int array2[ array_size ]; forz[(k
k5][0;
j ]k,10000;
5 10 * z[ kk ][
5 kj1];1)
The compiler can merge these two arrays as shown below:
z[ k ][ j ] 5 10 * z[ k ][ j ];
The compiler
struct can merge these two arrays as shown below:By interchanging the second and third nested loops, the co
merged_arrays
Byfollowing code, decreasing the likelihood of unnecessary th
interchanging the second and third nested loops, the
struct
{ merged_arrays loop.
following code, decreasing the likelihood of unnecessary
{int array1; loop. for (i 5 0; i,100; i 5 i 1 1)
int array2; for (k 5 0; k,10000; k 5 k 1 1)
int array1; forfor
(i(j
5 0;
5 0;i,100;
j,200;ij55ij111)
1)
} new_array[ array_ size ]
int array2; forz[(kk5][0;j k,10000; k k 1)
] 5 10 * z[ k ][ j1];
5
In order to re-orderarray_
} new_array[ the way that] high-level memory is read into cache,
size for (j 5reading
0; j,200;in jsmaller
5 j 1 1)
z[used.
chunks to reduce the chance of thrashing loop interchanging can be k ][ j Consider
] 5 10 * z[the
k ][code
j ];
In order
below: to re-order the way that high-level memory is Peripheral/communication
read into cache, reading in utilization
smaller
chunks to reduce the chance of thrashing loop interchanging can be used. Consider the code
for (i 5 0; i,100; i 5 i 1 1) When considering reading and writing of data, of course, w
below: Peripheral/communication utilization
memory access: we need to pull data into and out of the de
for (j 5 0; j,200; j 5 j 1 1)
for
for(i
(k550;
0; i,100; i5
k,10000; i1
k5 1)1)
k1 When considering reading and writing of data, of course
Data
Flow
Optimization
– Peripherals
• Coprocessors
Ø DMA
• Bus Configuration
• Core
Communication
Ø Polling
Ø Time-‐Based
Processing
Ø Interrupt
Processing
but also how code is organized.
Instruction packing
Loop unrolling revisited
Instruction packing was included in the data path optimization section above, but may also
be listed as an algorithmic optimization as it involvesWe
notbriefly
only discussed
how memory
usingisaltering
accessed,
loops in code in or
but also how code is organized. before. As we discussed earlier, another method for op
power in embedded processors is via loop unrolling. Th
unravels a loop, as shown in the code snippets below:
Algorithmic
Optimization
(1)
Loop unrolling revisited
Regular loop:
for (i 5 0; i,100; i 5 i 1 1)
We briefly discussed using altering loops in code in order to optimize cache utilization
for (k 5 0; k,10000; k 5 k 1 1)
Loop
before. As we discussed earlier, another Unrolling
method for optimizing both performance and
a[i] 5 10 * b[k];
power in embedded processors is via loop unrolling. This method effectively partially
unravels a loop, as shown in the code snippets below: Loop unrolled by 4x:
for (i 5 0; i,100; i 5 i 1 4)
Regular loop: for (k 5 0; k,10000; k 5 k 1 4)
for (i 5 0; i,100; i 5 i 1 1) {
for (k 5 0; k,10000; k 5 k 1 1)
a[i] 5 10 * b[k];
a[i] 5 10 * b[k];
a[i 1 1] 5 10 * b[k 1 1];
a[i] 5 10 * b[k];
a[i 1 1] 5 10 * b[k 1 1];
a[i 1 2] 5 10 * b[k 1 2];
a[i 1 3] 5 10 * b[k 1 3];
}
minimization efforts we discussed in the data path section, which would lead to extra
412
memory accesses and the possibility of increased cache missChapter
penalties.13
Reducing Accuracy
2)
Low
Power
Embedded
Software
Optimization
using
Symbolic
Algebra,
Peymandoust,
Simunic,
De
Micheli,
Stanford
University
Questions?