Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Advances in big.

LITTLE Technology for


Power and Energy Savings
Improving Energy Efficiency in High-Performance Mobile Platforms
Brian Jeff, ARM

September 2012

ARM big.LITTLE processing is an energy savings method where high-performance CPUs and
efficiency tuned CPUs are paired together in a cache-coherent combination, with software execution
dynamically transitioned to the appropriate CPU based on performance needs. big.LITTLE processing
was first introduced in 2011, and now ARM has tested silicon devices employing big.LITTLE. This paper
will discuss the measured power savings and performance capabilities of big.LITTLE systems, and the
system settings and tuning that provide the optimal results. It will also discuss the various software
options available to take advantage of energy savings opportunities on a big.LITTLE SoC, and the system
hardware options for SoCs that support big.LITTLE technology.

Contents

Contents ..................................................................................................................................................................................... 1

Introduction............................................................................................................................................................................. 2

big.LITTLE Hardware.......................................................................................................................................................... 2

big.LITTLE Migration .......................................................................................................................................................... 4


big.LITTLE MP ......................................................................................................................................................................... 6

Mobile Usage Profile ........................................................................................................................................................... 7

Performance and Power Analysis: big.LITTLE Test Chip ............................................................................... 8


Choosing a big.LITTLE software model .................................................................................................................... 9
Next-generation big.LITTLE hardware ................................................................................................................... 10
Conclusion............................................................................................................................................................................... 11

About the Author................................................................................................................................................................. 11

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 1 of 11

Introduction
The performance demanded from users of current smartphones and tablets is increasing at a much faster
rate than the capacity of batteries or the power savings from semiconductor process advances. At the
same time, users are demanding longer battery life within roughly the same form factor. This conflicting
set of demands requires innovations in mobile SoC design beyond what process technology and
traditional power management techniques can deliver.
The usage pattern for smartphones and tablets is quite dynamic. Periods of high processing intensity
tasks, such as gaming and web browsing alternate with typically longer periods of low processing
intensity tasks such as texting, e-mail and audio. big.LITTLE processing takes advantage of this variation
in required performance by combining two very different processors together in a single SoC. The big
processor is designed for maximum performance within the mobile power budget. The LITTLE processor
is designed for maximum efficiency and high enough performance to address all but the most intense
periods of work.

In the first big.LITTLE SoCs, the big processors are ARM Cortex-A15, and the LITTLE processors
are Cortex-A7. Together they create a system that can accomplish both high intensity and low intensity
tasks in the most energy efficient manner. By coherently connecting the Cortex-A15 and Cortex-A7
processors via the CCI-400 cache coherent interconnect, the system is flexible enough to support a
variety of big.LITTLE use models, which can be tailored to the processing requirements of the tasks.
Since its introduction in 2011, ARM has been working together with partners to optimize and tune
big.LITTLE software, and has built a test-chip capable of measuring the performance and power of
big.LITTLE while running typical mobile workloads.

big.LITTLE Hardware
The central tenet of big.LITTLE is that the big and LITTLE processors are architecturally identical. Both
Cortex-A15 and Cortex-A7 processors implement the full ARMv7A architecture including Virtualization
and Large Physical Address Extensions. Accordingly all instructions will execute in an architecturally
consistent way on both Cortex-A15 and Cortex-A7 processors, albeit with different performances.
The implementation-defined feature set of Cortex-A15 and Cortex-A7 processors is also similar. Both
processors can be configured to have between one and four cores and both integrate a level-2 cache

inside the processing cluster. Additionally, each processor implements a single AMBA 4 coherent
interface that can be connected to a coherent interconnect such as CCI-400.
The micro-architectures of the two processors are quite different. The Cortex-A7 (Figure 1) is an in-order
, non-symmetric dual-issue processor with a pipeline length of between 8-stages and 10-stages, CortexA15 (Figure 2) is an out-of-order sustained triple-issue processor with a pipeline length of between 15stages and 24-stages.

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 2 of 11

Qu
eu
Lowest
LoCortexwe A15
Operati
ng
PointFe
tch

Multiply

Floating-Point
/ NEON

Issue

D
e

Figure 1: Cortex-A7 Pipeline

F
et

Fet
ch

Integ Multi
er ply

D
e

Float Dual Lo
ing- Issu ad/S

L
o

Q Is
u s

Figure 2: Cortex-A15 Pipeline

The energy consumed by the execution of an instruction is partially related to the number of pipeline
stages it must traverse. Therefore, a significant difference in energy consumption between Cortex-A15
and Cortex-A7 comes from the different pipeline complexity. Across a range of benchmarks, the CortexA15 delivers roughly 2x the performance of the Cortex-A7 per unit MHz, and the Cortex-A7 is roughly 3x
as energy efficient as the Cortex-A15 in completing the same workloads. The graph below compares the
performance of the Cortex-A15, Cortex-A7, and Cortex-A9 CPU cores at measured and expected
production frequencies.

2.5

Cortex-A9 1.2GHz

2
1.5

Cortex-A7 1GHz actual

Cortex-A7 1.2GHz est.

0.5

Cortex-A15 1.2GHz

Average of Android, Linux benchmarks

Cortex-A15 1.6GHz est.

Figure 3: Performance Benchmarks

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 3 of 11

Consider the system around the processors to create a compelling big.LITTLE solution. . A typical
big.LITTLE processor subsystem is diagrammed below, based on the CCI-400 interconnect, and a global
interrupt controller.

Figure 4: big.LITTLE System with Cortex-A15, CCI, and Cortex-A7

big.LITTLE processing was originally conceived to make use of two primary use models: big.LITTLE
Migration, and big.LITTLE MP. The software models differ mainly in the way they allocate work to big or
LITTLE cores during runtime execution of a workload. The system hardware described above can support
either of these main categories of software.

big.LITTLE Migration
In the big.LITTLE migration software models, the fundamental idea is that the OS kernel scheduler is
unaware of the big and LITTLE
cores, and the DVFS power management software residing in kernel space controls the migration of
software context between cores. This software model is a natural extension to the Dynamic Voltage and
Frequency Scaling (DVFS) operating points provided by current mobile platforms to allow the OS to
match the performance of the platform to the performance required by the application. In todays
smartphone SoCs, DVFS drivers like cpu_freq sample the OS performance at regular and frequent
intervals, and the DVFS governor decides whether to shift to a higher or lower operating point or remain
at the current operating point.
These operating points affect the voltage and frequency of a single CPU cluster; however in a big.LITTLE
system there are two CPU clusters with independent voltage and frequency domains. This allows the big
cluster to act as a logical extension of the 3~5 DVFS operating points provided by the LITTLE processor
cluster. In a big.LITTLE system under a migration mode of control, when Cortex-A7 is executing the
DVFS driver can tune the performance of the CPU cluster to higher levels. Once Cortex-A7 is at its
highest operating point, if more performance is required a migration can be invoked that picks up the OS

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 4 of 11

and applications and moves them to Cortex-A15. This allows low and medium intensity applications to be
executed on Cortex-A7 with better energy efficiency than Cortex-A15 can achieve, while the high intensity
applications that characterize some of the todays apps can execute on Cortex-A15.
There are in fact two types of migration: CPU Migration and Cluster migration. The initial exploration of
big.LITTLE used cluster migration. With this approach, the entire context was migrated from all running
Cortex-A7 CPUs to the same number of Cortex-A15 CPUs, and vice versa. However there are
frequently cases where the loading on a single CPU is high, but the loading on additional CPUs in the
cluster is low. Migrating the entire multi-core context to the big cluster would be inefficient in that case.
Fortunately, existing DVFS mechanisms typically sample the loading for each core in a multi-core system.
This provides the opportunity to do migration at a finer granularity that is for each individual CPU in the
system. This mode of operation is called CPU migration; in this mode each LITTLE CPU is logically
paired with a big CPU. The OS scheduler sees each pair as a single logical CPU, but the big.LITTLE
software can migrate the execution context between the big and the LITTLE CPUs to match current
performance demand. The diagram below shows an example of the migration of CPU context that can
occur under CPU migration, where the tasks from a single Cortex-A7 CPU are migrated to a single
Cortex-A15 CPU, and the tasks running on the first Cortex-A7 CPU remain on that cluster to better match
the performance demands of each set of tasks.

Figure 5: CPU migration scheduling example

An important consideration of a big.LITTLE system is the time it takes to migrate the execution context
between the Cortex-A15 cluster and the Cortex-A7 cluster. If it takes too long then it may become
noticeable to the operating system and the system power may outweigh the benefit of migration for some
time. Therefore, the Cortex-A15-Cortex-A7 system is designed to migrate tasks in around 30~50,000
cycles, or 30~50-microseconds with processors operating at 1GHz.
One of the reasons the migration can be so fast is that the amount of processor state involved is relatively
small. The processor that is going to be turned off, which is termed the outbound processor, must have
all of the integer and Advanced SIMD register files contents saved along with the entire CP15
configuration state. The processor that is going to resume execution, which is termed the inbound

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 5 of 11

processor, must then restore all of the state saved from the outbound processor. Additionally, any active
interrupts that are being controlled by the GIC-400 must be also migrated. Around 2,000 instructions are
required to achieve save-restore and because the two processors are architecturally identical, there is a
one-to-one mapping between state registers in the inbound and outbound processors. Coherency is
clearly a critical enabler in achieving a fast migration time, as it allows the state that has been saved on
the outbound processor to be snooped and restored on the inbound processor rather than going via main
memory. Additionally, because the level-2 cache of the outbound processor is coherent, it can remain
powered up after a migration to improve the cache warming time of the inbound processor through
snooping of data values. The outbound processors however can be powered down. When all processors
in a cluster have powered down, due to migration or other reasons such as idle or hot plug, clean and
power off the level-2 cache to save leakage power. When to shut down the outbound L2 cache is SoC
specific; the decision can be assisted through cache hit counters in the CCI-400 interconnect, but this is
one example of a tuneable setting in the big.LITTLE SoC.
It should be observed that normal execution of the thread occurs during the migration process. The only
black out period is during the CPU Migration when interrupts are disabled and state is transferred from
the outbound to the inbound processor.

big.LITTLE MP
Since a big.LITTLE system containing Cortex-A15 and Cortex-A7 is fully coherent through CCI-400
another logical use-model is to allow both Cortex-A15 and Cortex-A7 to be powered on and
simultaneously executing code. This is termed big.LITTLE MP, which is fully heterogeneous scheduling.
Whether a big processor needs to be powered on is determined by performance requirements of tasks
currently executing. If there are demanding tasks, then a big processor can be powered on to execute
them. Low demand tasks can execute on a LITTLE processor. Finally, any processors that are not being
used can be powered down. This ensures that cores, big or LITTLE, are only active when they are
needed, and that the appropriate core is used to execute any given workload.
big.LITTLE MP is compelling because it enables threads to be executed on the processing resource that
is most appropriate. Compute intensive threads that require significant amounts of processing
performance, as their output is user visible, can be allocated to Cortex-A15. Threads that are I/O heavy or
that do not produce a result that is time critical to the user can be executed on Cortex-A7.
A simple example of a non-time critical thread is one associated with e-mail updates. While web
browsing the user will want email updates to continue, but it does not matter if they are done at CortexA15 performance levels or Cortex-A7 performance levels. Since Cortex-A7 is a more energy-efficient
processor it makes more sense to take a little longer, but consume less battery life.
Finally, as a fully coherent system can create a significant volume of coherent transactions, Cortex-A15,
Cortex-A7 and CCI-400 have been designed to cope with worst case snooping scenarios. This includes
the case where a Mali-T604 GPU is connected to one of the I/O coherent CCI-400 ports and every
transaction is snooping Cortex-A15 and Cortex-A7 and at the same time as Cortex-A15 and Cortex-A7
are snooping each other.

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 6 of 11

Mobile Usage Profile


One of the characteristics that makes big.LITTLE advantageous is the varied performance requirements
of typical mobile workloads. The graph below shows the percentage of time spent in DVFS states, and in
idle and full shutdown states, by two cores in a currently shipping Cortex-A9 based mobile device. In the
diagram, the red color indicates the highest frequency operating point, while the green colored regions
indicate the lowest frequency operating point, and colors in between represent intermediate frequencies.
In addition to the DVFS states, the OS power management can idle a CPU. The light blue regions in the
graph indicate this idle time. When a CPU has been idled for a long enough period, the system power
control software may take a core to a full shutdown to save leakage power. This is show by the darkest
color on the graph.

Figure 6: DVFS residency for low intensity use cases

It is clear from the graph above that the applications processors spend a considerable portion of time in
lower frequency states across several common workloads. In a big.LITTLE system, the SoC would have
the opportunity to run all but the dark red portions of the work on a lower power Cortex-A7 CPU. In the
following graph, more intense workloads are analyzed in the same way, and even in these cases there is
significant opportunity to map frequencies below 1GHz to a Cortex-A7 processor, which is known to
provide performance per clock with 5~10% of the Cortex-A9.

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 7 of 11

Figure 7: DVFS residency for low intensity use cases

Performance and Power Analysis: big.LITTLE Test Chip


User level software has been running on top of big.LITTLE scheduling since 2011, however only on
software models of the cores and interconnect. To fully evaluate the performance, power savings, and
appropriate tuning of a big.LITTLE system it was necessary to build a test chip that could run user
software at full speed. The ARM test chip came back from the manufacturing facility in early summer
2012, and within a few weeks the test silicon was running on a development board and running full Linux
and Android Ice Cream Sandwich. (Jelly Bean is running as well, but the results in this paper are from
ICS). The test chip consists of a dual-core Cortex-A15 cluster, a tri-core Cortex-A7 cluster, and the CCI400 cache coherent interconnect. The test chip does not include a GPU, which affects some of the user
benchmarks, but the platform is able to run Linux and Android OSs and benchmark software.
The performance benchmarks in figure 3 were run on the Cortex-A15 and Cortex-A7 CPU clusters
independently. The Cortex-A15 on the test chip has a maximum frequency of 1.2GHz, while the CortexA7 on the test chip has a maximum frequency of 1 GHz. The benchmarks showed the performance of the
Cortex-A15 and Cortex-A7 CPUs was within the range of expected performance, although the memory
system on the test-chip is a lower performance memory system than would be expected on a production
big.LITTLE SoC. Based on the results from the cores running individually, we developed sufficient
confidence that the platform would be accurate for measuring big.LITTLE performance. The software on
the test chip platform consisted of the base Linux kernel, with CPU migration software and big.LITTLE MP
patchsets applied, to allow the testing of either the CPU Migration or the big.LITTLE MP models.
The main workload used to test big.LITTLE performance was a web browser benchmark cycling through
web pages, with audio playback happening in the background, on top of Android ICS. This use case
allowed a fairly intense workload to be paired with a background activity with low performance
requirements. The web browser cycled through web pages every 2 seconds and performed a 500-pixel
page scroll action on each page, to present the system with a relatively high level of required
performance. In measuring the performance and power while running this benchmark, it was first
necessary to establish a performance and power baseline. That baseline was measured with the CortexA15 CPU cluster running standalone.

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 8 of 11

Figure 8: big.LITTLE results

It bears mentioning that this set of results is fairly early; they come from an early version of the big.LITTLE
MP patchset, which modifies the Linux scheduler away from a completely fair and balanced scheduling
model, towards a big.LITTLE model. We expect performance and power improvements as the software is
refined, and additional tuneable elements are explored. Another point worth mentioning, is the lack of a
GPU in the test chip; this led to higher CPU loading than would exist in a system with a GPU for offload,
and in situations where CPU loading is lower it is possible to make greater use of the LITTLE cores and
thereby save more energy. Has a rudimentary set of voltage and frequency operating points, and no
ability to power gate individual cores, so production big.LITTLE SoCs are expected to deliver even better
results. The performance of background tasks is exhibiting greater than 70% energy savings, for
example.
.

Choosing a big.LITTLE software model


The question that is often asked is which software model to choose?
The choice today is effectively between CPU migration and big.LITTLE MP, and there are pros and cons
to each. In CPU migration, big and LITTLE cores are paired, so asymmetric topology that is one with
unequal number of big and LITTLE cores, requires additional work. Given the small size of Cortex-A7
CPU cores it may be attractive to use 4 LITTLE cores along with 1 or 2 big cores. On the positive side,
CPU migration allows simpler power and performance tuning, and its reuse of existing OS power
Copyright 2012 ARM Limited. All rights reserved.
The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 9 of 11

management code means that years of development and test underpin the implementation. With no
modifications to the kernel scheduler, it is simpler in scope than MP. In sum, CPU migration is a great
st
solution for products in 1 half 2013 and beyond
big.LITTLE MP has several technical advantages, but is less mature today; it is currently in development,
with promising early test results as shown in this paper. big.LITTLE MP can make use of all cores in the
system, as asymmetric topology support is standard with no software modifications. It offers greater
opportunity for performance and power benefits. As an example, it can use of all cores simultaneously for
greater performance, or tune DVFS settings and scheduler settings differently on big and LITTLE for
greater power savings. It allows fine-grained selection of cores by the scheduler, and generally offers
more tuning parameters. This greater flexibility does come at a cost, as more tuning is required to extract
the full performance and power benefits from a big.LITTLE MP platform. Finally,
big.LITTLE MP is maturing quickly but not yet ready for production. It is targeted to be ready for partner
integration beginning in the first half of 2013. Fortunately, no hardware changes are required to support
big.LITTLE MP, so it is possible for a silicon vendor to deploy a platform with CPU migration and upgrade
to big.LITTLE MP with a kernel update to deployed platforms.
While big.LITTLE MP is not deployed in production yet, the software is running as demonstrated in the
results shown in this paper. The big.LITTLE MP software ran on our test system out of the box, and
efforts are now being focused on hardening the software, and tuning the system performance for best
results against a wide variety of use cases.
The performance and power results in figure 8 show the importance of tuning a big.LITTLE system, with
varied results based on just one tuneable element. Other tuneable elements include the load balancing
policy of the scheduler, the up and down migration point, and thread priority. System tuning is ongoing at
ARM and with silicon partners in each of these areas, so expect an update set of big.LITTLE performance
and power results later in 2012.
In addition to the performance tuning, ARM is improving the big.LITTLE MP patchset and making regular
monthly releases in the open source, with plans to push the patches upstream later in 2012. The current
patchset is available from Linaro at:
http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git;a=summary
The CPU migration software is available for Linaro members now.

Next-generation big.LITTLE hardware


The Cortex-A15 and Cortex-A7 cores represent the first generation of big.LITTLE hardware. ARM has just
announced 2 new CPU cores that are also capable of big.LITTLE processing, the Cortex-A57 and the
Cortex-A53. The Cortex-A57 is a big core similar to Cortex-A15, bringing 20% more performance per
clock cycle, higher frequency capability, at slightly higher efficiency than the Cortex-A15. The Cortex-A53
is a LITTLE core similar to the Cortex-A7, with 25% more performance per clock cycle, at the same power
efficiency as Cortex-A7.
Both these cores are architecturally identical to each other, and introduce support for the ARMv8
architecture, which introduces improved NEON and floating point capability, cryptography acceleration,
and 64-bit support. Both cores also support a next-generation coherent interconnect in addition to AMBA4
ACE, and they can run in AArch32 mode to run existing code in the same fashion as current ARMv7 CPU
cores. The support for 64b and additional general-purpose registers is implemented in an elegant and
efficient way with little added power. Microarchitectural enhancements are introduced to give each core

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 10 of 11

more throughputs per instruction clock cycle. These new cores will support big.LITTLE in the same way
as the current Cortex-A15 and Cortex-A7 CPUs; both will be available to lead silicon partners in 2013,
with silicon production expected in 2014. In the meantime, the first big.LITTLE silicon based on CortexA15 and Cortex-A7 is now sampling by ARM partners, and is expected in early production at the end of
2012, and full production in a range of devices in 2013.

Conclusion
This white paper has described the first big.LITTLE system from ARM. The combination of a fully
coherent system with Cortex-A15 and Cortex-A7 opens up new processing possibilities beyond what is
possible in current high-performance mobile platforms.
A big.LITTLE system opens the door to an extremely wide dynamic range of power and performance
control points. This would otherwise not be possible in implementations comprising a single type of
processor. This wide dynamic range provides the perfect execution environment for workloads that we
see in devices today, which are often, composed of a mixture of high demand and low demand threads.
This is complemented by the opportunity to create an extremely energy-efficient implementation of
Cortex-A7 since it will be the workhorse of the platform.
Through these implementation techniques and the variety of use-models, big.LITTLE provides the
opportunity to raise performance and extend battery life in the next generation of mobile platforms.

About the Author


Brian Jeff is a Product Manager at ARM responsible for the high efficiency Cortex-A class processors,
including Cortex-A7, the Cortex-A53, and big.LITTLE processing technology. He has been with ARM
since 2009, in roles including performance benchmarking and product marketing. Prior to joining ARM,
Brian held product management, engineering, and technical sales roles at Texas Instruments and
Freescale Semiconductor. He holds a BSEE from Virginia Tech and an MBA from the University of Texas
at Austin.

REFERENCES
[1] Greenhalgh, P.2011. big.LITTLE processing with ARM Cortex-A15 & Cortex-A7. Technical Report, ARM TechCon
Conference (Oct. 2011)
[2] Randhava, R.2011. System software for ARM big.Little systems. Technical Report, ARM TechCon Conference (Oct. 2011)

Copyright 2012 ARM Limited. All rights reserved.


The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Page 11 of 11

You might also like