Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

A Practical

Guide
to You
Low-Power
When
Do You
Know
Have Design
Saved
Enoughwith
Power?
User
Experience
CPF

When Do You Know You Have Saved Enough Power?

When Do You Know You Have Saved Enough Power?


By David Weir, Lead Design Engineer, Cadence Design.

Impact of Low-Power Design

As everyone in wireless, consumer, multi-media, server, router, automotive


and medical applications recognize, power consumption can be the key product
differentiation and the key metric for success in the market. Three critical
factors emerge:
Peak power
Average power
Time required to switch between power modes
Peak power impacts cost advantage, in the sense that, for devices having multiple
modes of operation, the highest power mode determines area, packaging and
possibly heat sink cost of goods.
Average power, again uniquely for the key modes of operation of the chip, determines
battery life.
The time required to switch between power modes is critical to the opportunity
software has to reduce power consumption; the ability to change modes rapidly
reaps more benefits from power techniques including MSV, PSO and DVFS.
However, using low-power techniques could also increase product development
time, due to a variety of factors contributing to increased complexity throughout the
design flow:
Additional functional verification during RTL development
Increased complexity during synthesis, layout and signoff
Special library characterization
Area increase from additional logic needed to support these low power modes
There is always a trade-off between power savings, the project schedule and product
requirements. This chapter confronts the issue of knowing when enough is enough,
relative to power-saving techniques.

When Do You Know You Have Saved Enough Power?


It is important to recognize that designs today are increasingly reliant on design
reuse. Most are a combination of new RTL, commercial IP and design reuse from
previous products.
The key issue is, how the designer goes about the task of estimating the impact
of different power saving techniques during the RTL development and integration
phase This process must look at the design holistically, to prevent leaving margin
Sec12:2

When Do You Know You Have Saved Enough Power?

from any of the three parameters on the table, whether performance, power or area
(PPA). Also, this power reduction must be traded off against increased complexity in
the design cycle. Increased complexity can potentially add months of effort to resolve
the implications of concurrent design verification, power domain characterization,
and timing closure.
A similar issue involves designers creating soft IP that is intended for reuse. How
can they find the best balance of performance, power and area for the same IP on
different projects, across multiple libraries and process technology nodes?
And how can designers improve the performance and reduce the power for re-used
blocks and derivative designs when RTL recoding is not an option?

Power Dissipation

The fundamentals of static and dynamic power dissipation are shown below:
t

E = (CV 2 DD c +VDD I lkg ) dt


0

I dt

DD leak

Total Power
Dissipation

Static Power
Dissipation

Minimize I leak by:


Reduce the voltage
Use fewer transistors
Use lower leakage
transistors

Ileak

Dynamic Power
Dissipation

Iswitch

CV

2
DD

c dt

Minimize I switch by:


Reduce the voltage
Decrease switching cap
Lower switching activity

Figure 1: Power dissipation

Static Power Optimization

Power shutoff (PSO) is well understood to dramatically reduce static power.


However, the following issues arise:
The need to estimate area increase due to power switches
T
 he need to solve the timing impact due to the addition of isolation logic on
block I/O paths
The minimum practical size of the PSO block, depending on layout style

Sec12:3

When Do You Know You Have Saved Enough Power?

Another critical issue is the necessity to save and restore the state of blocks which
have been powered down. The designer must identify which flip flops require state
retention and what technique should be used to save and restore the state.
There are two general-case solutions:
For design which must have low latency and high performance, the designer
can use state retention flip-flops, but the penalty is further increase in standard cell
area
When the design can tolerate higher latency, where performance is not an issue,
the designer can save the state of the block(s) elsewhere; however, this has an
impact on the functional specification of the design as it requires the state save/
restore logic to be added to the RTL, alongside functional logic

Quick PSO Evaluation Flow


For a quick evaluation of the trade-offs involved in power shut-off for voltage
domains, there are two key steps.
First, the designer can check the impact of isolation cells between power domains in
a quick and dirty flow:
T
 he designer can read a new or an existing gate level netlist into Cadence RTL
Compiler, insert isolation logic (using power intent described in the CPF file),
and run incremental optimisation; this allows the designer to quickly iterate
and measure the approximate impact both in power and in other penalties.
The designer checks that the power domains do not impact timing-critical logic
The designer evaluates area penalties
P
 lease note that this mode of exploring the design requires running a multimode scenario, and the designer will now want to apply per-mode constraints
and timing exceptions inside Cadence RTL Compiler
Secondly, the designer must check the impact of state retention cells:
Q
 uantify the effects of increased area and also different flip-flop timing using
RTL Compiler, which reports the area and power consumption, both static and
dynamic separately
T
 he range of available cells in the library can influence the final performance.
Some libraries have a rich set of flip-flops which merge combinatorial functions
with the memory element, but often do not have equivalent cells that also
include the state retention logic. Mapping to state retention with such a library
will change the number of combinatorial cells in the final netlist and could
impact area and timing.
B
 ecause of this, when mapping to state-retention flip-flops, it is best to start a
new synthesis run, starting from RTL. Target only the state retention flip-flops
from the beginning as opposed to trying to retrofit later
Sec12:4

When Do You Know You Have Saved Enough Power?

Static Power Optimization

The ARM and Cadence technology partnership, working together in collaboration to


improve both the IP and the low-power design tools flow, is an important example
of a flow with PSO for static power optimization. The type of work done on the
ARM Cortex-M3 processor is a significant example for designers doing new RTL
designs implementing a top-down low-power architecture.

Case 1: PSO results from the ARM Cortex-M3 Processor


In the following example, the first power architecture explored for the ARM CortexM3 processor uses only standard cells (no RAM).
In this case, the technique of PSO is applied to only the main CPU; the rest of the
system-level logic is always-on.
Results:
As shown below, there is a 0.4% area increase, from 309 isolation cells
T
 he payoff or trade-off is the improvement in leakage power reduction circled
in red: over 93% leakage reduction in power shutoff modes
T
 his will vary by process node. In this example we were re-using a netlist
initially developed for a 130nm design and optimized to minimize area.

Original

PSO (on)

PSO (off)

Leakage power

0.28

0.28

0.018

Total power

9.17

9.05

1.46

Area

400535

402201

402201

Frequency

100 MHz

100 MHz

100 MHz

Figure 2: Case 1 approach to PSO on ARM Cortex-M3

In the second power architecture for the ARM Cortex-M3 processor, PSO is applied
to all sub-modules. The top level is an always-on domain.

Sec12:5

When Do You Know You Have Saved Enough Power?

For this case, there is only one entry to edit in the CPF file, an example of which is
shown below:
create_power_domain -name POWERDOWN \
-shutoff_condition {PWRUP} \
-instances { uCortexM3 uDAPSWJDP uCM3TPIU uCM3ROMTable }

The significant point is that this power architecture exploration with CPF is easy,
with low turnaround time, and little engineering effort involved. The designer can
always do the easy experiments and stop when it starts getting too complex, or
when unacceptable penalties arise.
Results:
As shown below, there is a 0.5% area increase, from 368 isolation cells
T
 he payoff or trade-off is the improvement in leakage power reduction circled
in red: over 99% leakage reduction in power shutoff modes
This is measured at the same process node as the previous example

Original

PSO (on)

PSO (off)

Leakage power

0.28

0.28

0.002

Total power

9.17

8.99

0.30

Area

400535

402538

402538

Frequency

100 MHz

100 MHz

100 MHz

Figure 3: Case 2 approach to PSO for ARM Cortex-M3

The conclusion for this example design was that more logic could be switched off
without impacting area or frequency, so it became an easy decision to choose option
2, which reduced leakage power without a penalty. This analysis was done with
very little engineering effort.
When a new piece of IP is created, this type of analysis can be performed to quickly
create a list of potential implementations and enable the end user to run the trials to
determine which gives the best result for their library and process.
Sec12:6

When Do You Know You Have Saved Enough Power?

Dynamic Power Optimization

Dynamic power optimization involves RTL optimization to reduce switching


activity. The ARM/Cadence partnership also illustrates the benefits of active power
optimisation through a variety of techniques.
Key techniques to reduce active power in the ARM Cortex-M3 design are
listed below:
2 + levels of clock gating were implemented: ARM inserted one (coarse) level to
start out with, and Cadence RTL Compiler inserts the second (more fine grain)
during synthesis
A
 nalysis and optimization of enable logic for the RAM was performed. This
can have a huge impact because RAM is such a large component of many
designs. There are options for how large the memories are in the design, and
when deploying a maximum capacity cache, they could be more than 50% of
the total power
P
 arameters ware provided to Cadence SoC Encounter and RTL Compiler to
separate the timing-critical logic on high-fanout nets, so that it is easier to
optimize to reach frequency in the areas where it is required, and not size up a
lot of buffers unnecessarily, thus avoiding waste of active power and area
S
 elective use of one-hot encoding, which is a design style used for maximum
performance, but which potentially can waste area and power. ARM has done a
lot of work to apply this high-overhead, but powerful, high-performance style
only when required for speed
Effect of clock gating on Corte xA9
2.5

1.5

0.5

Frecuency
with CG

Frecuency
without CG

Cell area
with CG

Cell area
without CG

Leakage
power
with CG

Leakage
power
without CG

Dynamic
power
with CG

Dynamic
power
without CG

Figure 4: Dynamic power optimization through clock gating on ARM Cortex-A9

Sec12:7

When Do You Know You Have Saved Enough Power?

MSV and Operating Voltage Exploration through Library Choice


Unlike implementing PSO, the power savings that can be achieved by optimizing
the library and process selection have no direct impact on the logical function of the
chip, although they will have an impact on timing and area. A couple of key lessons
have been learned when porting ARM designs to different libraries:
First, separate standard cell logic from RAM logic:
R
 AM timing and standard cell timing scale differently when changing voltage
or processes.
S
 ince the change in timing is not uniform, different critical paths arise for
different libraries/processes
W
 hen possible, keep ram related logic and timing-critical logic in a separate
level of hierarchy as this speeds the debug process and is very beneficial if
RAM logic is to be implemented at a different voltage to the rest of the design.
Then, use RTL Compiler to determine the best library to use for each domain:
M
 ap only the timing critical logic to the highest-power, highest-performance
library
Perform a scripted exploration of all possible library / domain mappings
In this example, to optimize for library choice and operating voltage with MSV,
the Cortex-A9 multicore processor was synthesized with five different libraries, all
using the same 45nm process:
Standard voltage, nominal Vt
High voltage, nominal Vt
Low voltage, nominal Vt
High voltage, high Vt
Low voltage, low Vt
Frequency was compared against area, static power and dynamic power. The
following charts show the bounds for the performance, power achievable for
different frequencies.

Sec12:8

When Do You Know You Have Saved Enough Power?

In the figure below, showing performance versus cell area impact, notice the hockey
stick curve of diminishing returns at higher frequencies.
Cell Area

Large area shows low


voltage library struggles
to meet timing.
Design is over-constrained

150

200

Rapid area increase as


push for max frequency

Choose between
four libraries

250

300

350

400

450

Frequency
base

High V

High V (HVt)

Low V

Low V (LVT)

Figure 5: Cell area optimization library choice

The same hockey stick phenomenon is also reflected in the static power versus
frequency graph shown below.
Static Power

Low Voltage with low Vt cells


has highest leakage current

Increase in leakage
caused by increased area

150

200

250

300

350

400

450

Frequency
base

High V

High V (HVt)

Low V

Low V (LVT)

Figure 6: Static power optimization library choice


Sec12:9

When Do You Know You Have Saved Enough Power?

And the following figure shows dynamic power and frequency in the context of
library choices. Note that the effect of low voltage with low Vt cells is called out
in the center section, large red circle, and that the reduction in dynamic power
by lowering voltage is shown in the small red circle, also in the center section of
the figure.
Dynamic Power

Dynamic power is equal


despite lower voltage
- due to over-constraining

Low Voltage with low Vt cells


gives best dynamic power

Rapid power increase as


push for max frequency
Reduction in dynamic power
just by lowering voltage
150

200

250

300

350

400

450

Frequency
base

High V

High V (HVt)

Low V

Low V (LVT)

Figure 7: Dynamic power optimization library choice

The choice of optimizing for frequency, area, static and dynamic power is extremely
design dependent and depends on the application and target market. It will depend
on the device itself; and how the different modes of operation consume different
amounts of static and dynamic power.
So as shown, it is possible to meet frequency goals and optimize for either static or
dynamic power just by selecting the correct library. This is fine if external logic also
runs at same voltage. If not, then level shifters will be needed on I/O paths and we
run a similar flow to the quick and dirty flow used for PSO evaluation.
In Figure 7 we note that dynamic power can also be reduced by tuning the voltage.
In the experiments we saw a 12% savings in dynamic power, realized by shifting
from high voltage to nominal voltage, without changing the operating frequency.
This requires access to different timing and power characterizations of the libraries.

Sec12:10

When Do You Know You Have Saved Enough Power?

ARM Intelligent Energy Manager (IEM)

Based on what we have just seen, that 12% savings result just by changing the
voltage, the effects of DVFS can be evaluated to save energy across multiple blocks
and for varying modes of operation.
The figure below shows three separate tasks that have to be done by the processor. It
illustrates that there is slack time between tasks 1, 2, and 3 where nothing new needs
to happen. The key concept is to run the design just fast enough to meet application
deadlines and no faster.
So since the operating system knows the deadlines, it knows it can take longer to do
each task, with the goal of running the task as slow as possible while still meeting
performance goals. With DVFS, as enabled though ARM IEM, the design can run at
a reduced frequency and at a reduced voltage (which save power due to the voltagesquared effect on dynamic power.)

Running fast and then idling wastes energy


Voltage

Reduce
Voltage

Reduce
Voltage

Reduce
Voltage
Energy

Run Task in
Available time
Run Task Slow
as Possible
Task 1

Idle

Task 2

Time

Task 3

Only need to run just fast enough to meet the application deadlines
Figure 8: Energy without ARM IEM

The figure below shows the same three tasks with DVFS using ARM IEM:
T
 ask 1 can take much longer, running very slow at a much lower voltage which
is quite energy-efficient
T
 ask 2 requires a medium application deadline, so it can run medium slow with
a slightly reduced voltage and be medium energy-efficient
Task 3 requires high performance so a relatively high voltage

Sec12:11

When Do You Know You Have Saved Enough Power?

The dotted black line shows the original energy consumed, and the solid black line
the energy used when DVFS was enabled.

Running fast and then idling wastes energy


Voltage

Reduce
Voltage

Reduce
Voltage
Energy
Saved

Reduce
Voltage

Run Task in
Available time

Run Task Slow


as Possible
Task 1

Idle

Task 2

Task 3

Time

Only need to run just fast enough to meet the application deadlines
Figure 9: Energy without ARM IEM

The net result is energy savings: not power reduction, but energy reduction.
The energy benefit labeled at the far right side of the slide shows that the design
has done the same amount of work, with less energy. This translates into the allimportant battery life competitive specification.

ARMCadence Reference Methodology for ARM1176JZF-S processor


with IEM
The ARM-Cadence reference flow for the ARM1176JZF-S processor with IEM
demonstrates how DVFS can be implemented in an automated, top-down flow from
RTL to GDSII.
In the case of the ARM1176JZF-S processor, the RTL hierarchy matches power
domains that are specified in the CPF file, which is also used to indicate where level
shifters and isolation cells should be inserted to the design.
The reference flow also makes use of the support within SoC Encounter for the tri-lib.
ECSM flow. This shows how it is possible to optimize for any voltage by accurately
modelling the effect of voltage changes on final fiming
It is also worth noting that the introduction of DVFS now enables the processor to
run at many speeds, which are dynamically variable. The other logic around the

Sec12:12

When Do You Know You Have Saved Enough Power?

processors must also be able to interface with it. In the case of the ARM1176JZF-S
processor the AMBA 3 AXI interface supports both a synchronous and an
asynchronous mode. This handshaking is required and must be addressed in the
logic functionality itself in order to implement DVFS.
HARDENED CORE

L-shift/Clamp

Clamp

VRAMS

TCM and cache RAMS

Clamp

IEM Sync/Async I/F

VCORE

ARM core

L-shift/Clamp

IEM Sync/Async I/F

VSOC

ACLK

Figure 10: ARM low-power architecture

ARM1176JZF-S Synthesis Flow Using CPF for DVFS


Easing complexity, CPF also makes a difference in the synthesis flow.
Looking at the yellow arrows, from top to bottom, CPF is used respectively to:
Read in libraries
Define power domains
Run consistency checks between RTL and CPF
Insert the low power logic (commit)
Define reporting, which is now done per power mode

Sec12:13

When Do You Know You Have Saved Enough Power?

Multi-vth
*.lib

Multi-voltage
*.lib

RTL_files.v

Import RTL
CPF
SDC

Setup MSV, Multi-Mode,


and Power Constraints
Top-down synthesis
Multi-V / Multi-Mode
Connect scan chains &
incremental synthesis
Insert low power logic

Gate.v
SDC

Analysis / output

read_cpf -library CPF_file


read_hdl $ HDL_files
elaborate
read_cpf CPF_file
check_cpf -all -detail
set_attribute max-leakage-power <value>
synthesize -to_mapped -effort high
connect_scan_chains
synthesize -to_mapped -effort high -incr
commit_cpf
foreach mode {
report timing -mode $mode;
report power -mode $mode;
write_sdc -mode $mode
}

Figure 11: ARM1176JZF-S synthesis flow using CPF for DVFS

As shown, using DVFS does not affect the main body of the flow. A few simple steps
change at the start and at the end, but the main synthesis flow does not.
The conclusion is that the DVFS technique, while not particularly invasive to the
synthesis flow or RTL coding / verification, offers the potential for great savings.

Power Savings in Multicore Processors


Multicore and Multi-processor Designs
Multicore processors are becoming increasingly common. The latest processor from
ARM is the Cortex-A9 MPCore processor, a multi-core design which enables both
performance and power improvements over single-core designs.
The Cortex-A9 processor is the current project for joint collaboration between ARM
and Cadence, and its architecture is shown in the figure below.

Sec12:14

When Do You Know You Have Saved Enough Power?

ARM CoreSight Multicore Debug and Trace Architecture

FPU/NEON

PTM
I/F

Cortex-A9 CPU

Cortex - A9 MPCore

I-Cache

D-Cache

FPU/NEON

PTM
I/F

Cortex-A9 CPU

I-Cache

D-Cache

FPU/NEON

PTM
I/F

Cortex-A9 CPU

I-Cache

D-Cache

FPU/NEON

PTM
I/F

Cortex-A9 CPU

I-Cache

D-Cache

Snoop Control Unit (SCU)


Generic
Interrupt Control
and Distribution

Cache-2-Cache
Transfers

Snoop
Filtering

Timers

Accelerator
Coherency
Port

Advanced Bus Interface Unit


Primary AMBA 3 64bit Interface

Optional 2ndI/F with Address Filtering

Figure 12: Cortex-A9 MPCore architecture

Similarly, the physical layout of a two-core implementation of the Cortex-A9 is


shown below, clustering the CPU, data engine, data cache, instruction cache, etc.

Figure 13: Floorplan of two-core build of Cortex-A9 MPCore

Sec12:15

When Do You Know You Have Saved Enough Power?

Power Savings in Multi-Processor Designs


In a holistic approach to power, the first step is to jointly investigate which techniques
should be applied based on performance and power trade-offs:
MSV to speed up critical paths, and save power elsewhere
PSO of individual cores, when overall processor demand drops
DVFS for individual cores
The resulting flows for multi-processor designs are jointly developed and tested by
both ARM and Cadence. There is already a Cadence flow for every ARM processor.
The low-power IEM enabled ARM1176JZF-S processor flow was released last year.
New flows for the Cortex-M3 processor and Cortex-A9 multicore processor that use
advanced low power techniques are currently being jointly developed.

Conclusions

Adding support for advanced low-power design early in the flow can impact
the area, power, performance and success of your designs. Power intent should
be considered early, during RTL coding. With Cadence RTL Compiler and CPF, a
designer can quickly explore the impact of different low-power techniques to find
the best solution.
Examples of successful deployment of MSV, PSO and DVFS were discussed and
demonstrated on ARM processors. Quantified power savings were realized with
minimal complexity, area or performance tradeoffs.
The risks and complexity of low power-design are significantly offset by using a
production-proven flow, an example of which is the work done collaboratively
between ARM and Cadence, to provide low-power functionality to the latest ARM
processors, including the Cortex family and new multi-processor designs.
________________________
Acknowledgements and thanks to ARM for the ARM IEM information and graphics, and for their ongoing
efforts on the joint CPF flow projects.

________________________
David Weir, Lead Design Engineer, Cadence Design, studied at Edinburgh University, Scotland, where he received
a joint honors bachelors degree in Computer Science and Electronics. Having used Cadence tools for more
than 10 years, he has experience in all stages of digital design, from RTL coding, verification, synthesis, and test
insertion, through layout, timing closure, and final signoff timing and physical checks run at tapeout. Currently
he is working on joint projects with ARM, focusing on high performance flows for their largest processors.

Sec12:16

When Do You Know You Have Saved Enough Power?


References
http://rtcgroup.com/arm/2007/presentations/
134%20%20Demonstrating%20Synthesis%20Techniques%20to%20Implement%20an%20ARM%20Cortex-A8.pdf
http://www.cdnusers.org/CDNLive/SiliconValley2007Proceedings/tabid/419/Default.aspx?topic=Logic%20Design
http://www.rtcgroup.com/arm/2008/survey/presentations/52%20-%20Revealing_the_Low_Power_Techniques_You_
Should_use_With_ARM_Cortex_Processors.pdf
http://www.rtcgroup.com/arm/2008/survey/presentations/65%20-%20Optimizing_the_Performance_of_a_Low_Power_
ARM_Cortex-A9.pdf
The CDNLive presentation will be available online at some point in the future. Currently it is only available to folks who
attended the conference

Sec12:17

You might also like