Week10a PDF

Energy
Awareness for
Embedded Systems
OP TI M I Z I NG E M BED D ED S OF TWA RE F OR P OW ER
Introduction
• Review of Power Consumption

• Understanding Power for Embedded Systems
• Software and Hardware Optimizations
Review of Power Consumption
1. Application 2. Technology
3. Voltage 4. Frequency
hat will be covered very heavily here.
s. dynamic power consumption
Review of Power Consumption

wer consumption consists of two types of power: dynamic and static (also know
leakage) consumption, so total device power is calculated as:
Ptotal 5 PDynamic 1 PStatic
ave just discussed, clock transitions
Types are
of Paower:
large portion of the dynamic
ption, but what is this “dynamic consumption”? Basically, in software we have
over dynamic consumption, but •weMaximum
do not have control over static consumption
• Average
wer consumption • Worst-‐Case
• Typical
consumption is the power that a device consumes independent of any activity
core is running, because even in a steady state there is a low “leakage” current
sistor tunneling current, reverse diode leakage, etc.) from the device’s Vin to
The only factors that affect the leakage consumption are supply voltage,
ture, and process.
Minimizing Power Consumption
1. Hardware Techniques
2. Data Flow Optimization
3. Algorithmic Optimization
Hardware Techniques (1)
Low Power Modes
Power gating Clock gating Voltage Control Frequency Control

Hardware Techniques (2)
Considerations
Available Block Functionality

•Memory states and validity must be considered
•Certain peripherals will not be available
Overhead
•Not break real-‐time constraints
Data Flow Optimization– Memory Access (1)
Principle of locality
Interleaving
Burst Access
Avoidance
Code Optimization Code Size
Algorithms Avoid Packing instructions Compression

Constants Zeroing Functions
different offsets (different “sets”) from the start of a wa
different offsets
declarations (different “sets”) from the start of a way. C
below:
declarations below:
int array1[ array_size ];
int array1[ array_size ];
int
intarray2[
array2[ array_size
array_size ];];
Optimizing Embedded
Optimizing Software
Embedded canfor
Software
The compiler Power
forthese
merge 403
Power 403 as shown belo
two arrays
The compiler can merge these two arrays as shown below:
Compiler struct
structmerged_arrays
Compilercache
cacheoptimizations
optimizations merged_arrays
{{
In order to assist with the above, compilers may be used to optimize cache power
In order to assist with the above, compilers may be used to optimize cache power
intarray1;
int array1;
consumption by reorganizing memory or memory accesses for us.int Two main
array2; techniques
consumption by reorganizing memory or memory accesses for int us. Two main techniques
array2;
available are array merging and loop interchanging, explained below.
available are array merging and loop interchanging, explained } }new_array[
new_array[ array_ size ]
below. array_ size ]
Array merging organizesCompiler
memory so thatcache
arraysoaccessed
ptimizations
In order to re-order
simultaneously will the
be way
at that high-level memory is read
Array merging organizes memory In order to re-order the way that high-level memory is r
different offsets (different “sets”) fromsothe
that arrays
start accessed
of a way. simultaneously
chunks
Consider to reduce
the thewill
following twobearray
chance atthrashing
of loop interchangin
different offsets
declarations (different “sets”) from the start of a way.chunks
below: below:
Considerto reduce the chance of thrashing loop interchan
the following two array
Array
declarations Merging
below:
below: Loop Interchanging
for (i 5 0; i,100; i 5 i 1 1)
int array1[ array_size ]; for (i(j
for 550;
0;i,100;
j,200; ij5
5i 1)
j 11)
int array2[
int array1[array_size
array_size]; ]; for (j(k
for 550;
0;j,200; 1)1)
j 5kj51k 1
k,10000;
int array2[ array_size ]; forz[(k
k5][0;
j ]k,10000;
5 10 * z[ kk ][
5 kj1];1)
The compiler can merge these two arrays as shown below:
z[ k ][ j ] 5 10 * z[ k ][ j ];
The compiler
struct can merge these two arrays as shown below:By interchanging the second and third nested loops, the co
merged_arrays
Byfollowing code, decreasing the likelihood of unnecessary th
interchanging the second and third nested loops, the
struct
{ merged_arrays loop.
following code, decreasing the likelihood of unnecessary
{int array1; loop. for (i 5 0; i,100; i 5 i 1 1)
int array2; for (k 5 0; k,10000; k 5 k 1 1)
int array1; forfor
(i(j
5 0;
5 0;i,100;
j,200;ij55ij111)
1)
} new_array[ array_ size ]
int array2; forz[(kk5][0;j k,10000; k k 1)
] 5 10 * z[ k ][ j1];
5
In order to re-orderarray_
} new_array[ the way that] high-level memory is read into cache,
size for (j 5reading
0; j,200;in jsmaller
5 j 1 1)
z[used.
chunks to reduce the chance of thrashing loop interchanging can be k ][ j Consider
] 5 10 * z[the
k ][code
j ];
In order
below: to re-order the way that high-level memory is Peripheral/communication
read into cache, reading in utilization
smaller
chunks to reduce the chance of thrashing loop interchanging can be used. Consider the code
for (i 5 0; i,100; i 5 i 1 1) When considering reading and writing of data, of course, w
below: Peripheral/communication utilization
memory access: we need to pull data into and out of the de
for (j 5 0; j,200; j 5 j 1 1)
for
for(i
(k550;
0; i,100; i5
k,10000; i1
k5 1)1)
k1 When considering reading and writing of data, of course
Data Flow Optimization – Peripherals
• Coprocessors
Ø DMA
• Bus Configuration
• Core Communication
Ø Polling
Ø Time-‐Based Processing
Ø Interrupt Processing
but also how code is organized.
Instruction packing
Loop unrolling revisited
Instruction packing was included in the data path optimization section above, but may also
be listed as an algorithmic optimization as it involvesWe
notbriefly
only discussed
how memory
usingisaltering
accessed,
loops in code in or
but also how code is organized. before. As we discussed earlier, another method for op
power in embedded processors is via loop unrolling. Th
unravels a loop, as shown in the code snippets below:
Algorithmic Optimization (1)
Loop unrolling revisited
Regular loop:
for (i 5 0; i,100; i 5 i 1 1)
We briefly discussed using altering loops in code in order to optimize cache utilization
for (k 5 0; k,10000; k 5 k 1 1)
Loop
before. As we discussed earlier, another Unrolling
method for optimizing both performance and
a[i] 5 10 * b[k];
power in embedded processors is via loop unrolling. This method effectively partially
unravels a loop, as shown in the code snippets below: Loop unrolled by 4x:
for (i 5 0; i,100; i 5 i 1 4)
Regular loop: for (k 5 0; k,10000; k 5 k 1 4)
for (i 5 0; i,100; i 5 i 1 1) {
for (k 5 0; k,10000; k 5 k 1 1)
a[i] 5 10 * b[k];
a[i] 5 10 * b[k];
a[i 1 1] 5 10 * b[k 1 1];
Loop unrolled by 4x: a[i 1 2] 5 10 * b[k 1 2];
for (i 5 0; i,100; i 5 i 1 4) a[i 1 3] 5 10 * b[k 1 3];

}
for (k 5 0; k,10000; k 5 k 1 4)
{
a[i] 5 10 * b[k];
a[i 1 1] 5 10 * b[k 1 1];
a[i 1 2] 5 10 * b[k 1 2];
a[i 1 3] 5 10 * b[k 1 3];
}
minimization efforts we discussed in the data path section, which would lead to extra
412
memory accesses and the possibility of increased cache missChapter
penalties.13
Now we see how to parallelize the loop and pip

Software pipelining have some “set-up”, also known as loading the p
Another technique common to both embedded processorinstructions
performancewe performed
optimization andabove. After this we
embedded processor power optimization is software pipelining. Software loading
pipelining is a stage
//pipeline ! first
technique where the programmer splits up a set of interdependent instructions that would
a[i] 5 10 * b[i];
normally have to be performed one at a time so that the DSP core can begin processing
//pipeline loading ! second stage
multiple instructions in each cycle. Rather than explaining in words, the easiest way to
Software Pipelining b[i] 5 10 * c[i];
follow this technique is to see an example.
a[i 1 1] 5 10 * b[i 1 1];
Say we have the following code segment: //pipelined loop
Regular Loop: for (i 5 0; i,100-2; i 5 i 1 1)
for (i 5 0; i,100; i 5 i 1 1) {
{ c[i] 5 10 * d[i];
a[i] 5 10 * b[i];
b[i 1 1] 5 10 * c[i 1 1];
b[i] 5 10 * c[i];
a[i 1 2] 5 10 * b[i 1 2];
c[i] 5 10 * d[i];
}
}
//after this, we still have 2 more partial loo
Right now, although we have three instructions occurring perc[i loop,
1 1]the
5 compiler
10 * d[i 1will
1];see
that the first instruction depends on the second instruction, and thus could not be pipelined
b[i 1 2] 5 10 * c[i 1 2];
with the second, nor can the second instruction be pipelined with the third due to
//final partial iteration
interdependence: a[i] cannot be set to b[i] as b[i] is simultaneously being set to c[i], and so
on. So right now the DSP processor has to execute the abovec[i 1 2] 5 10 * d[i 1 2];
loop 100 times with each
iteration performing three individual instructions per cycle
By(not very efficient),
pipelining the loop,for we
a total of
enabled the compiler
from 300 to:
fn!ðnÞ 5 fn!ðn 2 1Þ
If this recursive factorial function is called with
calls entailing 100 branches to subroutines (whic
the program counter and software stack). Each c
execute because not only is the core pipeline dis
adds at least a return address to the call stack. A
passed, these also must be pushed onto the stack
Algorithmic Optimization
(3)
This means that this recursive subroutine require
memory and related stall as writes/reads to mem
Eliminating Recursion
pipeline stalls due to change of flow.
We can optimize this by moving to a simple loo
Optimizing Embedded Software for Power 413
int res 5 1;
fn!ð0Þ 5 1 For n 55 0 for(int i 5 0; i , n; i1 1)
fn!ðnÞ 5 fn!ðn 2 1Þ; For n . 0 {
#
res
factorial function is called with n 5 100, there would be B100 function
5 i;
}
00 branches to subroutines (which are change of flow routines which affect
unter and software stack). Each change of flow instruction takes longer
This to
function requires no actual writes to the s
e not only is the core pipeline disrupted during execution, but every branch
function calls/jumps. As this function only inv
eturn address to the call stack. Additionally, if multiple variables are being
so must be pushed onto the stack. “short loop” on certain devices, whereby the lo
Thanks to this feature, there are no change of
this recursive subroutine requires 1003 individual writes to physical
this effectively acts like a completely unrolled
ated stall as writes/reads to memory will not be pipelined and 1003
cost).
ue to change of flow.
e this by moving to a simple loop Compared to the recursive routine, using the loo
Reducing Accuracy
Low Power Code Sequences

OptAlg
§Tool that automates the optimization of power-‐intensive algorithmic constructs
using symbolic algebra with energy profiling
OptAlg Flow
Architecture Level Optimization (1)
Architecture Level Optimization (2)
Clustered Length-‐Adaptive Word Processor (CLAW)
§Allows dynamic modification of the issue width
References
1) Length Adaptive Processors: A Solution for the Energy/Performance Dilemma in Embedded
Systems , Iyer, Conte, School of Computer Science, College of Computing, Georgia Institute of
Technology, Atlanta, GA
2) Low Power Embedded Software Optimization using Symbolic Algebra, Peymandoust, Simunic,
De Micheli, Stanford University
Questions?

Week10a PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week10a PDF

Uploaded by

Copyright:

Available Formats

Energy

• Review of Power Consumption

s. dynamic power consumption

Review of Power Consumption

Low Power Modes

Power gating Clock gating Voltage Control Frequency Control

Available Block Functionality

Code Optimization Code Size

Algorithms Avoid Packing instructions Compression

Loop unrolled by 4x: a[i 1 2] 5 10 * b[k 1 2];

for (i 5 0; i,100; i 5 i 1 4) a[i 1 3] 5 10 * b[k 1 3];

Now we see how to parallelize the loop and pip

Low Power Code Sequences

You might also like