Professional Documents
Culture Documents
LPG Sect12 06052009
LPG Sect12 06052009
Guide
to You
Low-Power
When
Do You
Know
Have Design
Saved
Enoughwith
Power?
User
Experience
CPF
from any of the three parameters on the table, whether performance, power or area
(PPA). Also, this power reduction must be traded off against increased complexity in
the design cycle. Increased complexity can potentially add months of effort to resolve
the implications of concurrent design verification, power domain characterization,
and timing closure.
A similar issue involves designers creating soft IP that is intended for reuse. How
can they find the best balance of performance, power and area for the same IP on
different projects, across multiple libraries and process technology nodes?
And how can designers improve the performance and reduce the power for re-used
blocks and derivative designs when RTL recoding is not an option?
Power Dissipation
The fundamentals of static and dynamic power dissipation are shown below:
t
I dt
DD leak
Total Power
Dissipation
Static Power
Dissipation
Ileak
Dynamic Power
Dissipation
Iswitch
CV
2
DD
c dt
Sec12:3
Another critical issue is the necessity to save and restore the state of blocks which
have been powered down. The designer must identify which flip flops require state
retention and what technique should be used to save and restore the state.
There are two general-case solutions:
For design which must have low latency and high performance, the designer
can use state retention flip-flops, but the penalty is further increase in standard cell
area
When the design can tolerate higher latency, where performance is not an issue,
the designer can save the state of the block(s) elsewhere; however, this has an
impact on the functional specification of the design as it requires the state save/
restore logic to be added to the RTL, alongside functional logic
Original
PSO (on)
PSO (off)
Leakage power
0.28
0.28
0.018
Total power
9.17
9.05
1.46
Area
400535
402201
402201
Frequency
100 MHz
100 MHz
100 MHz
In the second power architecture for the ARM Cortex-M3 processor, PSO is applied
to all sub-modules. The top level is an always-on domain.
Sec12:5
For this case, there is only one entry to edit in the CPF file, an example of which is
shown below:
create_power_domain -name POWERDOWN \
-shutoff_condition {PWRUP} \
-instances { uCortexM3 uDAPSWJDP uCM3TPIU uCM3ROMTable }
The significant point is that this power architecture exploration with CPF is easy,
with low turnaround time, and little engineering effort involved. The designer can
always do the easy experiments and stop when it starts getting too complex, or
when unacceptable penalties arise.
Results:
As shown below, there is a 0.5% area increase, from 368 isolation cells
T
he payoff or trade-off is the improvement in leakage power reduction circled
in red: over 99% leakage reduction in power shutoff modes
This is measured at the same process node as the previous example
Original
PSO (on)
PSO (off)
Leakage power
0.28
0.28
0.002
Total power
9.17
8.99
0.30
Area
400535
402538
402538
Frequency
100 MHz
100 MHz
100 MHz
The conclusion for this example design was that more logic could be switched off
without impacting area or frequency, so it became an easy decision to choose option
2, which reduced leakage power without a penalty. This analysis was done with
very little engineering effort.
When a new piece of IP is created, this type of analysis can be performed to quickly
create a list of potential implementations and enable the end user to run the trials to
determine which gives the best result for their library and process.
Sec12:6
1.5
0.5
Frecuency
with CG
Frecuency
without CG
Cell area
with CG
Cell area
without CG
Leakage
power
with CG
Leakage
power
without CG
Dynamic
power
with CG
Dynamic
power
without CG
Sec12:7
Sec12:8
In the figure below, showing performance versus cell area impact, notice the hockey
stick curve of diminishing returns at higher frequencies.
Cell Area
150
200
Choose between
four libraries
250
300
350
400
450
Frequency
base
High V
High V (HVt)
Low V
Low V (LVT)
The same hockey stick phenomenon is also reflected in the static power versus
frequency graph shown below.
Static Power
Increase in leakage
caused by increased area
150
200
250
300
350
400
450
Frequency
base
High V
High V (HVt)
Low V
Low V (LVT)
And the following figure shows dynamic power and frequency in the context of
library choices. Note that the effect of low voltage with low Vt cells is called out
in the center section, large red circle, and that the reduction in dynamic power
by lowering voltage is shown in the small red circle, also in the center section of
the figure.
Dynamic Power
200
250
300
350
400
450
Frequency
base
High V
High V (HVt)
Low V
Low V (LVT)
The choice of optimizing for frequency, area, static and dynamic power is extremely
design dependent and depends on the application and target market. It will depend
on the device itself; and how the different modes of operation consume different
amounts of static and dynamic power.
So as shown, it is possible to meet frequency goals and optimize for either static or
dynamic power just by selecting the correct library. This is fine if external logic also
runs at same voltage. If not, then level shifters will be needed on I/O paths and we
run a similar flow to the quick and dirty flow used for PSO evaluation.
In Figure 7 we note that dynamic power can also be reduced by tuning the voltage.
In the experiments we saw a 12% savings in dynamic power, realized by shifting
from high voltage to nominal voltage, without changing the operating frequency.
This requires access to different timing and power characterizations of the libraries.
Sec12:10
Based on what we have just seen, that 12% savings result just by changing the
voltage, the effects of DVFS can be evaluated to save energy across multiple blocks
and for varying modes of operation.
The figure below shows three separate tasks that have to be done by the processor. It
illustrates that there is slack time between tasks 1, 2, and 3 where nothing new needs
to happen. The key concept is to run the design just fast enough to meet application
deadlines and no faster.
So since the operating system knows the deadlines, it knows it can take longer to do
each task, with the goal of running the task as slow as possible while still meeting
performance goals. With DVFS, as enabled though ARM IEM, the design can run at
a reduced frequency and at a reduced voltage (which save power due to the voltagesquared effect on dynamic power.)
Reduce
Voltage
Reduce
Voltage
Reduce
Voltage
Energy
Run Task in
Available time
Run Task Slow
as Possible
Task 1
Idle
Task 2
Time
Task 3
Only need to run just fast enough to meet the application deadlines
Figure 8: Energy without ARM IEM
The figure below shows the same three tasks with DVFS using ARM IEM:
T
ask 1 can take much longer, running very slow at a much lower voltage which
is quite energy-efficient
T
ask 2 requires a medium application deadline, so it can run medium slow with
a slightly reduced voltage and be medium energy-efficient
Task 3 requires high performance so a relatively high voltage
Sec12:11
The dotted black line shows the original energy consumed, and the solid black line
the energy used when DVFS was enabled.
Reduce
Voltage
Reduce
Voltage
Energy
Saved
Reduce
Voltage
Run Task in
Available time
Idle
Task 2
Task 3
Time
Only need to run just fast enough to meet the application deadlines
Figure 9: Energy without ARM IEM
The net result is energy savings: not power reduction, but energy reduction.
The energy benefit labeled at the far right side of the slide shows that the design
has done the same amount of work, with less energy. This translates into the allimportant battery life competitive specification.
Sec12:12
processors must also be able to interface with it. In the case of the ARM1176JZF-S
processor the AMBA 3 AXI interface supports both a synchronous and an
asynchronous mode. This handshaking is required and must be addressed in the
logic functionality itself in order to implement DVFS.
HARDENED CORE
L-shift/Clamp
Clamp
VRAMS
Clamp
VCORE
ARM core
L-shift/Clamp
VSOC
ACLK
Sec12:13
Multi-vth
*.lib
Multi-voltage
*.lib
RTL_files.v
Import RTL
CPF
SDC
Gate.v
SDC
Analysis / output
As shown, using DVFS does not affect the main body of the flow. A few simple steps
change at the start and at the end, but the main synthesis flow does not.
The conclusion is that the DVFS technique, while not particularly invasive to the
synthesis flow or RTL coding / verification, offers the potential for great savings.
Sec12:14
FPU/NEON
PTM
I/F
Cortex-A9 CPU
Cortex - A9 MPCore
I-Cache
D-Cache
FPU/NEON
PTM
I/F
Cortex-A9 CPU
I-Cache
D-Cache
FPU/NEON
PTM
I/F
Cortex-A9 CPU
I-Cache
D-Cache
FPU/NEON
PTM
I/F
Cortex-A9 CPU
I-Cache
D-Cache
Cache-2-Cache
Transfers
Snoop
Filtering
Timers
Accelerator
Coherency
Port
Sec12:15
Conclusions
Adding support for advanced low-power design early in the flow can impact
the area, power, performance and success of your designs. Power intent should
be considered early, during RTL coding. With Cadence RTL Compiler and CPF, a
designer can quickly explore the impact of different low-power techniques to find
the best solution.
Examples of successful deployment of MSV, PSO and DVFS were discussed and
demonstrated on ARM processors. Quantified power savings were realized with
minimal complexity, area or performance tradeoffs.
The risks and complexity of low power-design are significantly offset by using a
production-proven flow, an example of which is the work done collaboratively
between ARM and Cadence, to provide low-power functionality to the latest ARM
processors, including the Cortex family and new multi-processor designs.
________________________
Acknowledgements and thanks to ARM for the ARM IEM information and graphics, and for their ongoing
efforts on the joint CPF flow projects.
________________________
David Weir, Lead Design Engineer, Cadence Design, studied at Edinburgh University, Scotland, where he received
a joint honors bachelors degree in Computer Science and Electronics. Having used Cadence tools for more
than 10 years, he has experience in all stages of digital design, from RTL coding, verification, synthesis, and test
insertion, through layout, timing closure, and final signoff timing and physical checks run at tapeout. Currently
he is working on joint projects with ARM, focusing on high performance flows for their largest processors.
Sec12:16
Sec12:17