Professional Documents
Culture Documents
Advances in Big - littLE Technology For Power and Energy Savings
Advances in Big - littLE Technology For Power and Energy Savings
September 2012
ARM big.LITTLE processing is an energy savings method where high-performance CPUs and
efficiency tuned CPUs are paired together in a cache-coherent combination, with software execution
dynamically transitioned to the appropriate CPU based on performance needs. big.LITTLE processing
was first introduced in 2011, and now ARM has tested silicon devices employing big.LITTLE. This paper
will discuss the measured power savings and performance capabilities of big.LITTLE systems, and the
system settings and tuning that provide the optimal results. It will also discuss the various software
options available to take advantage of energy savings opportunities on a big.LITTLE SoC, and the system
hardware options for SoCs that support big.LITTLE technology.
Contents
Contents ..................................................................................................................................................................................... 1
Introduction............................................................................................................................................................................. 2
big.LITTLE Hardware.......................................................................................................................................................... 2
Page 1 of 11
Introduction
The performance demanded from users of current smartphones and tablets is increasing at a much faster
rate than the capacity of batteries or the power savings from semiconductor process advances. At the
same time, users are demanding longer battery life within roughly the same form factor. This conflicting
set of demands requires innovations in mobile SoC design beyond what process technology and
traditional power management techniques can deliver.
The usage pattern for smartphones and tablets is quite dynamic. Periods of high processing intensity
tasks, such as gaming and web browsing alternate with typically longer periods of low processing
intensity tasks such as texting, e-mail and audio. big.LITTLE processing takes advantage of this variation
in required performance by combining two very different processors together in a single SoC. The big
processor is designed for maximum performance within the mobile power budget. The LITTLE processor
is designed for maximum efficiency and high enough performance to address all but the most intense
periods of work.
In the first big.LITTLE SoCs, the big processors are ARM Cortex-A15, and the LITTLE processors
are Cortex-A7. Together they create a system that can accomplish both high intensity and low intensity
tasks in the most energy efficient manner. By coherently connecting the Cortex-A15 and Cortex-A7
processors via the CCI-400 cache coherent interconnect, the system is flexible enough to support a
variety of big.LITTLE use models, which can be tailored to the processing requirements of the tasks.
Since its introduction in 2011, ARM has been working together with partners to optimize and tune
big.LITTLE software, and has built a test-chip capable of measuring the performance and power of
big.LITTLE while running typical mobile workloads.
big.LITTLE Hardware
The central tenet of big.LITTLE is that the big and LITTLE processors are architecturally identical. Both
Cortex-A15 and Cortex-A7 processors implement the full ARMv7A architecture including Virtualization
and Large Physical Address Extensions. Accordingly all instructions will execute in an architecturally
consistent way on both Cortex-A15 and Cortex-A7 processors, albeit with different performances.
The implementation-defined feature set of Cortex-A15 and Cortex-A7 processors is also similar. Both
processors can be configured to have between one and four cores and both integrate a level-2 cache
inside the processing cluster. Additionally, each processor implements a single AMBA 4 coherent
interface that can be connected to a coherent interconnect such as CCI-400.
The micro-architectures of the two processors are quite different. The Cortex-A7 (Figure 1) is an in-order
, non-symmetric dual-issue processor with a pipeline length of between 8-stages and 10-stages, CortexA15 (Figure 2) is an out-of-order sustained triple-issue processor with a pipeline length of between 15stages and 24-stages.
Page 2 of 11
Qu
eu
Lowest
LoCortexwe A15
Operati
ng
PointFe
tch
Multiply
Floating-Point
/ NEON
Issue
D
e
F
et
Fet
ch
Integ Multi
er ply
D
e
Float Dual Lo
ing- Issu ad/S
L
o
Q Is
u s
The energy consumed by the execution of an instruction is partially related to the number of pipeline
stages it must traverse. Therefore, a significant difference in energy consumption between Cortex-A15
and Cortex-A7 comes from the different pipeline complexity. Across a range of benchmarks, the CortexA15 delivers roughly 2x the performance of the Cortex-A7 per unit MHz, and the Cortex-A7 is roughly 3x
as energy efficient as the Cortex-A15 in completing the same workloads. The graph below compares the
performance of the Cortex-A15, Cortex-A7, and Cortex-A9 CPU cores at measured and expected
production frequencies.
2.5
Cortex-A9 1.2GHz
2
1.5
0.5
Cortex-A15 1.2GHz
Page 3 of 11
Consider the system around the processors to create a compelling big.LITTLE solution. . A typical
big.LITTLE processor subsystem is diagrammed below, based on the CCI-400 interconnect, and a global
interrupt controller.
big.LITTLE processing was originally conceived to make use of two primary use models: big.LITTLE
Migration, and big.LITTLE MP. The software models differ mainly in the way they allocate work to big or
LITTLE cores during runtime execution of a workload. The system hardware described above can support
either of these main categories of software.
big.LITTLE Migration
In the big.LITTLE migration software models, the fundamental idea is that the OS kernel scheduler is
unaware of the big and LITTLE
cores, and the DVFS power management software residing in kernel space controls the migration of
software context between cores. This software model is a natural extension to the Dynamic Voltage and
Frequency Scaling (DVFS) operating points provided by current mobile platforms to allow the OS to
match the performance of the platform to the performance required by the application. In todays
smartphone SoCs, DVFS drivers like cpu_freq sample the OS performance at regular and frequent
intervals, and the DVFS governor decides whether to shift to a higher or lower operating point or remain
at the current operating point.
These operating points affect the voltage and frequency of a single CPU cluster; however in a big.LITTLE
system there are two CPU clusters with independent voltage and frequency domains. This allows the big
cluster to act as a logical extension of the 3~5 DVFS operating points provided by the LITTLE processor
cluster. In a big.LITTLE system under a migration mode of control, when Cortex-A7 is executing the
DVFS driver can tune the performance of the CPU cluster to higher levels. Once Cortex-A7 is at its
highest operating point, if more performance is required a migration can be invoked that picks up the OS
Page 4 of 11
and applications and moves them to Cortex-A15. This allows low and medium intensity applications to be
executed on Cortex-A7 with better energy efficiency than Cortex-A15 can achieve, while the high intensity
applications that characterize some of the todays apps can execute on Cortex-A15.
There are in fact two types of migration: CPU Migration and Cluster migration. The initial exploration of
big.LITTLE used cluster migration. With this approach, the entire context was migrated from all running
Cortex-A7 CPUs to the same number of Cortex-A15 CPUs, and vice versa. However there are
frequently cases where the loading on a single CPU is high, but the loading on additional CPUs in the
cluster is low. Migrating the entire multi-core context to the big cluster would be inefficient in that case.
Fortunately, existing DVFS mechanisms typically sample the loading for each core in a multi-core system.
This provides the opportunity to do migration at a finer granularity that is for each individual CPU in the
system. This mode of operation is called CPU migration; in this mode each LITTLE CPU is logically
paired with a big CPU. The OS scheduler sees each pair as a single logical CPU, but the big.LITTLE
software can migrate the execution context between the big and the LITTLE CPUs to match current
performance demand. The diagram below shows an example of the migration of CPU context that can
occur under CPU migration, where the tasks from a single Cortex-A7 CPU are migrated to a single
Cortex-A15 CPU, and the tasks running on the first Cortex-A7 CPU remain on that cluster to better match
the performance demands of each set of tasks.
An important consideration of a big.LITTLE system is the time it takes to migrate the execution context
between the Cortex-A15 cluster and the Cortex-A7 cluster. If it takes too long then it may become
noticeable to the operating system and the system power may outweigh the benefit of migration for some
time. Therefore, the Cortex-A15-Cortex-A7 system is designed to migrate tasks in around 30~50,000
cycles, or 30~50-microseconds with processors operating at 1GHz.
One of the reasons the migration can be so fast is that the amount of processor state involved is relatively
small. The processor that is going to be turned off, which is termed the outbound processor, must have
all of the integer and Advanced SIMD register files contents saved along with the entire CP15
configuration state. The processor that is going to resume execution, which is termed the inbound
Page 5 of 11
processor, must then restore all of the state saved from the outbound processor. Additionally, any active
interrupts that are being controlled by the GIC-400 must be also migrated. Around 2,000 instructions are
required to achieve save-restore and because the two processors are architecturally identical, there is a
one-to-one mapping between state registers in the inbound and outbound processors. Coherency is
clearly a critical enabler in achieving a fast migration time, as it allows the state that has been saved on
the outbound processor to be snooped and restored on the inbound processor rather than going via main
memory. Additionally, because the level-2 cache of the outbound processor is coherent, it can remain
powered up after a migration to improve the cache warming time of the inbound processor through
snooping of data values. The outbound processors however can be powered down. When all processors
in a cluster have powered down, due to migration or other reasons such as idle or hot plug, clean and
power off the level-2 cache to save leakage power. When to shut down the outbound L2 cache is SoC
specific; the decision can be assisted through cache hit counters in the CCI-400 interconnect, but this is
one example of a tuneable setting in the big.LITTLE SoC.
It should be observed that normal execution of the thread occurs during the migration process. The only
black out period is during the CPU Migration when interrupts are disabled and state is transferred from
the outbound to the inbound processor.
big.LITTLE MP
Since a big.LITTLE system containing Cortex-A15 and Cortex-A7 is fully coherent through CCI-400
another logical use-model is to allow both Cortex-A15 and Cortex-A7 to be powered on and
simultaneously executing code. This is termed big.LITTLE MP, which is fully heterogeneous scheduling.
Whether a big processor needs to be powered on is determined by performance requirements of tasks
currently executing. If there are demanding tasks, then a big processor can be powered on to execute
them. Low demand tasks can execute on a LITTLE processor. Finally, any processors that are not being
used can be powered down. This ensures that cores, big or LITTLE, are only active when they are
needed, and that the appropriate core is used to execute any given workload.
big.LITTLE MP is compelling because it enables threads to be executed on the processing resource that
is most appropriate. Compute intensive threads that require significant amounts of processing
performance, as their output is user visible, can be allocated to Cortex-A15. Threads that are I/O heavy or
that do not produce a result that is time critical to the user can be executed on Cortex-A7.
A simple example of a non-time critical thread is one associated with e-mail updates. While web
browsing the user will want email updates to continue, but it does not matter if they are done at CortexA15 performance levels or Cortex-A7 performance levels. Since Cortex-A7 is a more energy-efficient
processor it makes more sense to take a little longer, but consume less battery life.
Finally, as a fully coherent system can create a significant volume of coherent transactions, Cortex-A15,
Cortex-A7 and CCI-400 have been designed to cope with worst case snooping scenarios. This includes
the case where a Mali-T604 GPU is connected to one of the I/O coherent CCI-400 ports and every
transaction is snooping Cortex-A15 and Cortex-A7 and at the same time as Cortex-A15 and Cortex-A7
are snooping each other.
Page 6 of 11
It is clear from the graph above that the applications processors spend a considerable portion of time in
lower frequency states across several common workloads. In a big.LITTLE system, the SoC would have
the opportunity to run all but the dark red portions of the work on a lower power Cortex-A7 CPU. In the
following graph, more intense workloads are analyzed in the same way, and even in these cases there is
significant opportunity to map frequencies below 1GHz to a Cortex-A7 processor, which is known to
provide performance per clock with 5~10% of the Cortex-A9.
Page 7 of 11
Page 8 of 11
It bears mentioning that this set of results is fairly early; they come from an early version of the big.LITTLE
MP patchset, which modifies the Linux scheduler away from a completely fair and balanced scheduling
model, towards a big.LITTLE model. We expect performance and power improvements as the software is
refined, and additional tuneable elements are explored. Another point worth mentioning, is the lack of a
GPU in the test chip; this led to higher CPU loading than would exist in a system with a GPU for offload,
and in situations where CPU loading is lower it is possible to make greater use of the LITTLE cores and
thereby save more energy. Has a rudimentary set of voltage and frequency operating points, and no
ability to power gate individual cores, so production big.LITTLE SoCs are expected to deliver even better
results. The performance of background tasks is exhibiting greater than 70% energy savings, for
example.
.
Page 9 of 11
management code means that years of development and test underpin the implementation. With no
modifications to the kernel scheduler, it is simpler in scope than MP. In sum, CPU migration is a great
st
solution for products in 1 half 2013 and beyond
big.LITTLE MP has several technical advantages, but is less mature today; it is currently in development,
with promising early test results as shown in this paper. big.LITTLE MP can make use of all cores in the
system, as asymmetric topology support is standard with no software modifications. It offers greater
opportunity for performance and power benefits. As an example, it can use of all cores simultaneously for
greater performance, or tune DVFS settings and scheduler settings differently on big and LITTLE for
greater power savings. It allows fine-grained selection of cores by the scheduler, and generally offers
more tuning parameters. This greater flexibility does come at a cost, as more tuning is required to extract
the full performance and power benefits from a big.LITTLE MP platform. Finally,
big.LITTLE MP is maturing quickly but not yet ready for production. It is targeted to be ready for partner
integration beginning in the first half of 2013. Fortunately, no hardware changes are required to support
big.LITTLE MP, so it is possible for a silicon vendor to deploy a platform with CPU migration and upgrade
to big.LITTLE MP with a kernel update to deployed platforms.
While big.LITTLE MP is not deployed in production yet, the software is running as demonstrated in the
results shown in this paper. The big.LITTLE MP software ran on our test system out of the box, and
efforts are now being focused on hardening the software, and tuning the system performance for best
results against a wide variety of use cases.
The performance and power results in figure 8 show the importance of tuning a big.LITTLE system, with
varied results based on just one tuneable element. Other tuneable elements include the load balancing
policy of the scheduler, the up and down migration point, and thread priority. System tuning is ongoing at
ARM and with silicon partners in each of these areas, so expect an update set of big.LITTLE performance
and power results later in 2012.
In addition to the performance tuning, ARM is improving the big.LITTLE MP patchset and making regular
monthly releases in the open source, with plans to push the patches upstream later in 2012. The current
patchset is available from Linaro at:
http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git;a=summary
The CPU migration software is available for Linaro members now.
Page 10 of 11
more throughputs per instruction clock cycle. These new cores will support big.LITTLE in the same way
as the current Cortex-A15 and Cortex-A7 CPUs; both will be available to lead silicon partners in 2013,
with silicon production expected in 2014. In the meantime, the first big.LITTLE silicon based on CortexA15 and Cortex-A7 is now sampling by ARM partners, and is expected in early production at the end of
2012, and full production in a range of devices in 2013.
Conclusion
This white paper has described the first big.LITTLE system from ARM. The combination of a fully
coherent system with Cortex-A15 and Cortex-A7 opens up new processing possibilities beyond what is
possible in current high-performance mobile platforms.
A big.LITTLE system opens the door to an extremely wide dynamic range of power and performance
control points. This would otherwise not be possible in implementations comprising a single type of
processor. This wide dynamic range provides the perfect execution environment for workloads that we
see in devices today, which are often, composed of a mixture of high demand and low demand threads.
This is complemented by the opportunity to create an extremely energy-efficient implementation of
Cortex-A7 since it will be the workhorse of the platform.
Through these implementation techniques and the variety of use-models, big.LITTLE provides the
opportunity to raise performance and extend battery life in the next generation of mobile platforms.
REFERENCES
[1] Greenhalgh, P.2011. big.LITTLE processing with ARM Cortex-A15 & Cortex-A7. Technical Report, ARM TechCon
Conference (Oct. 2011)
[2] Randhava, R.2011. System software for ARM big.Little systems. Technical Report, ARM TechCon Conference (Oct. 2011)
Page 11 of 11