LPG Sect12 06052009

A Practical
Guide
to You
Low-Power
When
Do You
Know
Have Design
Saved
Enoughwith
Power?
User
Experience
CPF
When Do You Know You Have Saved Enough Power?

By David Weir, Lead Design Engineer, Cadence Design.
Impact of Low-Power Design
As everyone in wireless, consumer, multi-media, server, router, automotive

and medical applications recognize, power consumption can be the key product
differentiation and the key metric for success in the market. Three critical
factors emerge:
Peak power
Average power
Time required to switch between power modes
Peak power impacts cost advantage, in the sense that, for devices having multiple
modes of operation, the highest power mode determines area, packaging and
possibly heat sink cost of goods.
Average power, again uniquely for the key modes of operation of the chip, determines
battery life.
The time required to switch between power modes is critical to the opportunity
software has to reduce power consumption; the ability to change modes rapidly
reaps more benefits from power techniques including MSV, PSO and DVFS.
However, using low-power techniques could also increase product development
time, due to a variety of factors contributing to increased complexity throughout the
design flow:
Additional functional verification during RTL development
Increased complexity during synthesis, layout and signoff
Special library characterization
Area increase from additional logic needed to support these low power modes
There is always a trade-off between power savings, the project schedule and product
requirements. This chapter confronts the issue of knowing when enough is enough,
relative to power-saving techniques.

It is important to recognize that designs today are increasingly reliant on design
reuse. Most are a combination of new RTL, commercial IP and design reuse from
previous products.
The key issue is, how the designer goes about the task of estimating the impact
of different power saving techniques during the RTL development and integration
phase This process must look at the design holistically, to prevent leaving margin
Sec12:2
from any of the three parameters on the table, whether performance, power or area
(PPA). Also, this power reduction must be traded off against increased complexity in
the design cycle. Increased complexity can potentially add months of effort to resolve
the implications of concurrent design verification, power domain characterization,
and timing closure.
A similar issue involves designers creating soft IP that is intended for reuse. How
can they find the best balance of performance, power and area for the same IP on
different projects, across multiple libraries and process technology nodes?
And how can designers improve the performance and reduce the power for re-used
blocks and derivative designs when RTL recoding is not an option?
Power Dissipation
The fundamentals of static and dynamic power dissipation are shown below:
t
E = (CV 2 DD c +VDD I lkg ) dt

0
I dt
DD leak
Total Power
Dissipation
Static Power
Dissipation
Minimize I leak by:

Reduce the voltage
Use fewer transistors
Use lower leakage
transistors
Ileak
Dynamic Power
Dissipation
Iswitch
CV
2
DD
c dt
Minimize I switch by:

Reduce the voltage
Decrease switching cap
Lower switching activity
Figure 1: Power dissipation
Static Power Optimization
Power shutoff (PSO) is well understood to dramatically reduce static power.

However, the following issues arise:
The need to estimate area increase due to power switches
T
he need to solve the timing impact due to the addition of isolation logic on
block I/O paths
The minimum practical size of the PSO block, depending on layout style
Sec12:3
Another critical issue is the necessity to save and restore the state of blocks which
have been powered down. The designer must identify which flip flops require state
retention and what technique should be used to save and restore the state.
There are two general-case solutions:
For design which must have low latency and high performance, the designer
can use state retention flip-flops, but the penalty is further increase in standard cell
area
When the design can tolerate higher latency, where performance is not an issue,
the designer can save the state of the block(s) elsewhere; however, this has an
impact on the functional specification of the design as it requires the state save/
restore logic to be added to the RTL, alongside functional logic
Quick PSO Evaluation Flow

For a quick evaluation of the trade-offs involved in power shut-off for voltage
domains, there are two key steps.
First, the designer can check the impact of isolation cells between power domains in
a quick and dirty flow:
T
he designer can read a new or an existing gate level netlist into Cadence RTL
Compiler, insert isolation logic (using power intent described in the CPF file),
and run incremental optimisation; this allows the designer to quickly iterate
and measure the approximate impact both in power and in other penalties.
The designer checks that the power domains do not impact timing-critical logic
The designer evaluates area penalties
P
lease note that this mode of exploring the design requires running a multimode scenario, and the designer will now want to apply per-mode constraints
and timing exceptions inside Cadence RTL Compiler
Secondly, the designer must check the impact of state retention cells:
Q
uantify the effects of increased area and also different flip-flop timing using
RTL Compiler, which reports the area and power consumption, both static and
dynamic separately
T
he range of available cells in the library can influence the final performance.
Some libraries have a rich set of flip-flops which merge combinatorial functions
with the memory element, but often do not have equivalent cells that also
include the state retention logic. Mapping to state retention with such a library
will change the number of combinatorial cells in the final netlist and could
impact area and timing.
B
ecause of this, when mapping to state-retention flip-flops, it is best to start a
new synthesis run, starting from RTL. Target only the state retention flip-flops
from the beginning as opposed to trying to retrofit later
Sec12:4
Static Power Optimization
The ARM and Cadence technology partnership, working together in collaboration to

improve both the IP and the low-power design tools flow, is an important example
of a flow with PSO for static power optimization. The type of work done on the
ARM Cortex-M3 processor is a significant example for designers doing new RTL
designs implementing a top-down low-power architecture.
Case 1: PSO results from the ARM Cortex-M3 Processor

In the following example, the first power architecture explored for the ARM CortexM3 processor uses only standard cells (no RAM).
In this case, the technique of PSO is applied to only the main CPU; the rest of the
system-level logic is always-on.
Results:
As shown below, there is a 0.4% area increase, from 309 isolation cells
T
he payoff or trade-off is the improvement in leakage power reduction circled
in red: over 93% leakage reduction in power shutoff modes
T
his will vary by process node. In this example we were re-using a netlist
initially developed for a 130nm design and optimized to minimize area.
Original
PSO (on)
PSO (off)
Leakage power
0.28
0.28
0.018
Total power
9.17
9.05
1.46
Area
400535
402201
402201
Frequency
100 MHz
100 MHz
100 MHz
Figure 2: Case 1 approach to PSO on ARM Cortex-M3
In the second power architecture for the ARM Cortex-M3 processor, PSO is applied
to all sub-modules. The top level is an always-on domain.
Sec12:5
For this case, there is only one entry to edit in the CPF file, an example of which is
shown below:
create_power_domain -name POWERDOWN \
-shutoff_condition {PWRUP} \
-instances { uCortexM3 uDAPSWJDP uCM3TPIU uCM3ROMTable }
The significant point is that this power architecture exploration with CPF is easy,
with low turnaround time, and little engineering effort involved. The designer can
always do the easy experiments and stop when it starts getting too complex, or
when unacceptable penalties arise.
Results:
As shown below, there is a 0.5% area increase, from 368 isolation cells
T
he payoff or trade-off is the improvement in leakage power reduction circled
in red: over 99% leakage reduction in power shutoff modes
This is measured at the same process node as the previous example
Original
PSO (on)
PSO (off)
Leakage power
0.28
0.28
0.002
Total power
9.17
8.99
0.30
Area
400535
402538
402538
Frequency
100 MHz
100 MHz
100 MHz
Figure 3: Case 2 approach to PSO for ARM Cortex-M3
The conclusion for this example design was that more logic could be switched off
without impacting area or frequency, so it became an easy decision to choose option
2, which reduced leakage power without a penalty. This analysis was done with
very little engineering effort.
When a new piece of IP is created, this type of analysis can be performed to quickly
create a list of potential implementations and enable the end user to run the trials to
determine which gives the best result for their library and process.
Sec12:6
Dynamic Power Optimization
Dynamic power optimization involves RTL optimization to reduce switching

activity. The ARM/Cadence partnership also illustrates the benefits of active power
optimisation through a variety of techniques.
Key techniques to reduce active power in the ARM Cortex-M3 design are
listed below:
2 + levels of clock gating were implemented: ARM inserted one (coarse) level to
start out with, and Cadence RTL Compiler inserts the second (more fine grain)
during synthesis
A
nalysis and optimization of enable logic for the RAM was performed. This
can have a huge impact because RAM is such a large component of many
designs. There are options for how large the memories are in the design, and
when deploying a maximum capacity cache, they could be more than 50% of
the total power
P
arameters ware provided to Cadence SoC Encounter and RTL Compiler to
separate the timing-critical logic on high-fanout nets, so that it is easier to
optimize to reach frequency in the areas where it is required, and not size up a
lot of buffers unnecessarily, thus avoiding waste of active power and area
S
elective use of one-hot encoding, which is a design style used for maximum
performance, but which potentially can waste area and power. ARM has done a
lot of work to apply this high-overhead, but powerful, high-performance style
only when required for speed
Effect of clock gating on Corte xA9
2.5
1.5
0.5
Frecuency
with CG
Frecuency
without CG
Cell area
with CG
Cell area
without CG
Leakage
power
with CG
Leakage
power
without CG
Dynamic
power
with CG
Dynamic
power
without CG
Figure 4: Dynamic power optimization through clock gating on ARM Cortex-A9
Sec12:7
MSV and Operating Voltage Exploration through Library Choice

Unlike implementing PSO, the power savings that can be achieved by optimizing
the library and process selection have no direct impact on the logical function of the
chip, although they will have an impact on timing and area. A couple of key lessons
have been learned when porting ARM designs to different libraries:
First, separate standard cell logic from RAM logic:
R
AM timing and standard cell timing scale differently when changing voltage
or processes.
S
ince the change in timing is not uniform, different critical paths arise for
different libraries/processes
W
hen possible, keep ram related logic and timing-critical logic in a separate
level of hierarchy as this speeds the debug process and is very beneficial if
RAM logic is to be implemented at a different voltage to the rest of the design.
Then, use RTL Compiler to determine the best library to use for each domain:
M
ap only the timing critical logic to the highest-power, highest-performance
library
Perform a scripted exploration of all possible library / domain mappings
In this example, to optimize for library choice and operating voltage with MSV,
the Cortex-A9 multicore processor was synthesized with five different libraries, all
using the same 45nm process:
Standard voltage, nominal Vt
High voltage, nominal Vt
Low voltage, nominal Vt
High voltage, high Vt
Low voltage, low Vt
Frequency was compared against area, static power and dynamic power. The
following charts show the bounds for the performance, power achievable for
different frequencies.
Sec12:8
In the figure below, showing performance versus cell area impact, notice the hockey
stick curve of diminishing returns at higher frequencies.
Cell Area
Large area shows low

voltage library struggles
to meet timing.
Design is over-constrained
150
200
Rapid area increase as

push for max frequency
Choose between
four libraries
250
300
350
400
450
Frequency
base
High V
High V (HVt)
Low V
Low V (LVT)
Figure 5: Cell area optimization library choice
The same hockey stick phenomenon is also reflected in the static power versus
frequency graph shown below.
Static Power
Low Voltage with low Vt cells

has highest leakage current
Increase in leakage
caused by increased area
150
200
250
300
350
400
450
Frequency
base
High V
High V (HVt)
Low V
Low V (LVT)
Figure 6: Static power optimization library choice

Sec12:9
And the following figure shows dynamic power and frequency in the context of
library choices. Note that the effect of low voltage with low Vt cells is called out
in the center section, large red circle, and that the reduction in dynamic power
by lowering voltage is shown in the small red circle, also in the center section of
the figure.
Dynamic Power
Dynamic power is equal

despite lower voltage
- due to over-constraining
Low Voltage with low Vt cells

gives best dynamic power
Rapid power increase as

push for max frequency
Reduction in dynamic power
just by lowering voltage
150
200
250
300
350
400
450
Frequency
base
High V
High V (HVt)
Low V
Low V (LVT)
Figure 7: Dynamic power optimization library choice
The choice of optimizing for frequency, area, static and dynamic power is extremely
design dependent and depends on the application and target market. It will depend
on the device itself; and how the different modes of operation consume different
amounts of static and dynamic power.
So as shown, it is possible to meet frequency goals and optimize for either static or
dynamic power just by selecting the correct library. This is fine if external logic also
runs at same voltage. If not, then level shifters will be needed on I/O paths and we
run a similar flow to the quick and dirty flow used for PSO evaluation.
In Figure 7 we note that dynamic power can also be reduced by tuning the voltage.
In the experiments we saw a 12% savings in dynamic power, realized by shifting
from high voltage to nominal voltage, without changing the operating frequency.
This requires access to different timing and power characterizations of the libraries.
Sec12:10
ARM Intelligent Energy Manager (IEM)
Based on what we have just seen, that 12% savings result just by changing the
voltage, the effects of DVFS can be evaluated to save energy across multiple blocks
and for varying modes of operation.
The figure below shows three separate tasks that have to be done by the processor. It
illustrates that there is slack time between tasks 1, 2, and 3 where nothing new needs
to happen. The key concept is to run the design just fast enough to meet application
deadlines and no faster.
So since the operating system knows the deadlines, it knows it can take longer to do
each task, with the goal of running the task as slow as possible while still meeting
performance goals. With DVFS, as enabled though ARM IEM, the design can run at
a reduced frequency and at a reduced voltage (which save power due to the voltagesquared effect on dynamic power.)
Running fast and then idling wastes energy

Voltage
Reduce
Voltage
Reduce
Voltage
Reduce
Voltage
Energy
Run Task in
Available time
Run Task Slow
as Possible
Task 1
Idle
Task 2
Time
Task 3
Only need to run just fast enough to meet the application deadlines
Figure 8: Energy without ARM IEM
The figure below shows the same three tasks with DVFS using ARM IEM:
T
ask 1 can take much longer, running very slow at a much lower voltage which
is quite energy-efficient
T
ask 2 requires a medium application deadline, so it can run medium slow with
a slightly reduced voltage and be medium energy-efficient
Task 3 requires high performance so a relatively high voltage
Sec12:11
The dotted black line shows the original energy consumed, and the solid black line
the energy used when DVFS was enabled.
Running fast and then idling wastes energy

Voltage
Reduce
Voltage
Reduce
Voltage
Energy
Saved
Reduce
Voltage
Run Task in
Available time
Run Task Slow

as Possible
Task 1
Idle
Task 2
Task 3
Time
Only need to run just fast enough to meet the application deadlines
Figure 9: Energy without ARM IEM
The net result is energy savings: not power reduction, but energy reduction.
The energy benefit labeled at the far right side of the slide shows that the design
has done the same amount of work, with less energy. This translates into the allimportant battery life competitive specification.
ARMCadence Reference Methodology for ARM1176JZF-S processor

with IEM
The ARM-Cadence reference flow for the ARM1176JZF-S processor with IEM
demonstrates how DVFS can be implemented in an automated, top-down flow from
RTL to GDSII.
In the case of the ARM1176JZF-S processor, the RTL hierarchy matches power
domains that are specified in the CPF file, which is also used to indicate where level
shifters and isolation cells should be inserted to the design.
The reference flow also makes use of the support within SoC Encounter for the tri-lib.
ECSM flow. This shows how it is possible to optimize for any voltage by accurately
modelling the effect of voltage changes on final fiming
It is also worth noting that the introduction of DVFS now enables the processor to
run at many speeds, which are dynamically variable. The other logic around the
Sec12:12
processors must also be able to interface with it. In the case of the ARM1176JZF-S
processor the AMBA 3 AXI interface supports both a synchronous and an
asynchronous mode. This handshaking is required and must be addressed in the
logic functionality itself in order to implement DVFS.
HARDENED CORE
L-shift/Clamp
Clamp
VRAMS
TCM and cache RAMS
Clamp
IEM Sync/Async I/F
VCORE
ARM core
L-shift/Clamp
IEM Sync/Async I/F
VSOC
ACLK
Figure 10: ARM low-power architecture
ARM1176JZF-S Synthesis Flow Using CPF for DVFS

Easing complexity, CPF also makes a difference in the synthesis flow.
Looking at the yellow arrows, from top to bottom, CPF is used respectively to:
Read in libraries
Define power domains
Run consistency checks between RTL and CPF
Insert the low power logic (commit)
Define reporting, which is now done per power mode
Sec12:13
Multi-vth
*.lib
Multi-voltage
*.lib
RTL_files.v
Import RTL
CPF
SDC
Setup MSV, Multi-Mode,

and Power Constraints
Top-down synthesis
Multi-V / Multi-Mode
Connect scan chains &
incremental synthesis
Insert low power logic
Gate.v
SDC
Analysis / output
read_cpf -library CPF_file

read_hdl $ HDL_files
elaborate
read_cpf CPF_file
check_cpf -all -detail
set_attribute max-leakage-power <value>
synthesize -to_mapped -effort high
connect_scan_chains
synthesize -to_mapped -effort high -incr
commit_cpf
foreach mode {
report timing -mode $mode;
report power -mode $mode;
write_sdc -mode $mode
}
Figure 11: ARM1176JZF-S synthesis flow using CPF for DVFS
As shown, using DVFS does not affect the main body of the flow. A few simple steps
change at the start and at the end, but the main synthesis flow does not.
The conclusion is that the DVFS technique, while not particularly invasive to the
synthesis flow or RTL coding / verification, offers the potential for great savings.
Power Savings in Multicore Processors

Multicore and Multi-processor Designs
Multicore processors are becoming increasingly common. The latest processor from
ARM is the Cortex-A9 MPCore processor, a multi-core design which enables both
performance and power improvements over single-core designs.
The Cortex-A9 processor is the current project for joint collaboration between ARM
and Cadence, and its architecture is shown in the figure below.
Sec12:14
ARM CoreSight Multicore Debug and Trace Architecture
FPU/NEON
PTM
I/F
Cortex-A9 CPU
Cortex - A9 MPCore
I-Cache
D-Cache
FPU/NEON
PTM
I/F
Cortex-A9 CPU
I-Cache
D-Cache
FPU/NEON
PTM
I/F
Cortex-A9 CPU
I-Cache
D-Cache
FPU/NEON
PTM
I/F
Cortex-A9 CPU
I-Cache
D-Cache
Snoop Control Unit (SCU)

Generic
Interrupt Control
and Distribution
Cache-2-Cache
Transfers
Snoop
Filtering
Timers
Accelerator
Coherency
Port
Advanced Bus Interface Unit

Primary AMBA 3 64bit Interface
Optional 2ndI/F with Address Filtering
Figure 12: Cortex-A9 MPCore architecture
Similarly, the physical layout of a two-core implementation of the Cortex-A9 is

shown below, clustering the CPU, data engine, data cache, instruction cache, etc.
Figure 13: Floorplan of two-core build of Cortex-A9 MPCore
Sec12:15
Power Savings in Multi-Processor Designs

In a holistic approach to power, the first step is to jointly investigate which techniques
should be applied based on performance and power trade-offs:
MSV to speed up critical paths, and save power elsewhere
PSO of individual cores, when overall processor demand drops
DVFS for individual cores
The resulting flows for multi-processor designs are jointly developed and tested by
both ARM and Cadence. There is already a Cadence flow for every ARM processor.
The low-power IEM enabled ARM1176JZF-S processor flow was released last year.
New flows for the Cortex-M3 processor and Cortex-A9 multicore processor that use
advanced low power techniques are currently being jointly developed.
Conclusions
Adding support for advanced low-power design early in the flow can impact
the area, power, performance and success of your designs. Power intent should
be considered early, during RTL coding. With Cadence RTL Compiler and CPF, a
designer can quickly explore the impact of different low-power techniques to find
the best solution.
Examples of successful deployment of MSV, PSO and DVFS were discussed and
demonstrated on ARM processors. Quantified power savings were realized with
minimal complexity, area or performance tradeoffs.
The risks and complexity of low power-design are significantly offset by using a
production-proven flow, an example of which is the work done collaboratively
between ARM and Cadence, to provide low-power functionality to the latest ARM
processors, including the Cortex family and new multi-processor designs.
________________________
Acknowledgements and thanks to ARM for the ARM IEM information and graphics, and for their ongoing
efforts on the joint CPF flow projects.
________________________
David Weir, Lead Design Engineer, Cadence Design, studied at Edinburgh University, Scotland, where he received
a joint honors bachelors degree in Computer Science and Electronics. Having used Cadence tools for more
than 10 years, he has experience in all stages of digital design, from RTL coding, verification, synthesis, and test
insertion, through layout, timing closure, and final signoff timing and physical checks run at tapeout. Currently
he is working on joint projects with ARM, focusing on high performance flows for their largest processors.
Sec12:16

References
http://rtcgroup.com/arm/2007/presentations/
134%20%20Demonstrating%20Synthesis%20Techniques%20to%20Implement%20an%20ARM%20Cortex-A8.pdf
http://www.cdnusers.org/CDNLive/SiliconValley2007Proceedings/tabid/419/Default.aspx?topic=Logic%20Design
http://www.rtcgroup.com/arm/2008/survey/presentations/52%20-%20Revealing_the_Low_Power_Techniques_You_
Should_use_With_ARM_Cortex_Processors.pdf
http://www.rtcgroup.com/arm/2008/survey/presentations/65%20-%20Optimizing_the_Performance_of_a_Low_Power_
ARM_Cortex-A9.pdf
The CDNLive presentation will be available online at some point in the future. Currently it is only available to folks who
attended the conference
Sec12:17

LPG Sect12 06052009

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LPG Sect12 06052009

Uploaded by

Copyright:

Available Formats

A Practical

When Do You Know You Have Saved Enough Power?

When Do You Know You Have Saved Enough Power?

Impact of Low-Power Design

As everyone in wireless, consumer, multi-media, server, router, automotive

When Do You Know You Have Saved Enough Power?

When Do You Know You Have Saved Enough Power?

E = (CV 2 DD c +VDD I lkg ) dt

Minimize I leak by:

Minimize I switch by:

Figure 1: Power dissipation

Static Power Optimization

Power shutoff (PSO) is well understood to dramatically reduce static power.

When Do You Know You Have Saved Enough Power?

Quick PSO Evaluation Flow

When Do You Know You Have Saved Enough Power?

Static Power Optimization

The ARM and Cadence technology partnership, working together in collaboration to

Case 1: PSO results from the ARM Cortex-M3 Processor

Figure 2: Case 1 approach to PSO on ARM Cortex-M3

When Do You Know You Have Saved Enough Power?

Figure 3: Case 2 approach to PSO for ARM Cortex-M3

When Do You Know You Have Saved Enough Power?

Dynamic Power Optimization

Dynamic power optimization involves RTL optimization to reduce switching

Figure 4: Dynamic power optimization through clock gating on ARM Cortex-A9

When Do You Know You Have Saved Enough Power?

MSV and Operating Voltage Exploration through Library Choice

When Do You Know You Have Saved Enough Power?

Large area shows low

Rapid area increase as

Figure 5: Cell area optimization library choice

Low Voltage with low Vt cells

Figure 6: Static power optimization library choice

When Do You Know You Have Saved Enough Power?

Dynamic power is equal

Low Voltage with low Vt cells

Rapid power increase as

Figure 7: Dynamic power optimization library choice

When Do You Know You Have Saved Enough Power?

ARM Intelligent Energy Manager (IEM)

Running fast and then idling wastes energy

When Do You Know You Have Saved Enough Power?

Running fast and then idling wastes energy

Run Task Slow

ARMCadence Reference Methodology for ARM1176JZF-S processor

When Do You Know You Have Saved Enough Power?

TCM and cache RAMS

IEM Sync/Async I/F

IEM Sync/Async I/F

Figure 10: ARM low-power architecture

ARM1176JZF-S Synthesis Flow Using CPF for DVFS

When Do You Know You Have Saved Enough Power?

Setup MSV, Multi-Mode,

read_cpf -library CPF_file

Figure 11: ARM1176JZF-S synthesis flow using CPF for DVFS

Power Savings in Multicore Processors

When Do You Know You Have Saved Enough Power?

ARM CoreSight Multicore Debug and Trace Architecture

Snoop Control Unit (SCU)

Advanced Bus Interface Unit

Optional 2ndI/F with Address Filtering

Figure 12: Cortex-A9 MPCore architecture