Professional Documents
Culture Documents
Abc2009 P2B Paper
Abc2009 P2B Paper
1 throughout this paper it is called AIM, or the tra- There is a number of important differ-
ditional PowerPC definition ences between Book-E and the traditional spec-
ifications below the application level, which available in the public FreeBSD/powerpc repos-
greatly influence how the operating system itory as the outcome of a porting effort com-
must be implemented at the machine-specific pleted previously for the single-core E500 im-
layer. Among the most distinctive are the fol- plementation [BSDCAN07]. This served as
lowing differences [AN2490]: the main starting point for further work, but
also some important parts of the common
• Book-E has more flexible approach to FreeBSD/powerpc code were leveraged in this
memory management (no longer the seg- undertaking.
mented mode known from AIM; instead The most important elements, which were
there is a page-based memory system with used as a base for further extensions and re-
multiple variable-sized pages, pure transla- work, include:
tion lookaside buffers (TLBs) mechanism.
Dedicated registers and machine instruc- • Booting environment, including firmware
tions for handling memory management support library for PowerPC.
unit (MMU). The differences in memory
management area are by far the most • Kernel locore initialization routines.
prominent and making Book-E appear like
a completely different CPU from this per- • Low level MMU support layer (pmap 2 in-
spective. terface) for E500.
initial support for the E500 class machines was Mach implementation and used by all BSD family
2
• Bootstrapping. Once the minimal kernel FreeBSD 8-CURRENT branch as of March
is built, there has to be a way to load the 2008 timeframe. During the porting duration
image (ELF or pure binary) into memory it was a regular practice to re-sync with up-
and pass control to its entry point. In to-date baseline: the porting process can take
order to minimize dependencies and sim- longer time to complete, and CURRENT is the
plify development set-up, it is usually pre- development branch, dynamically changing, so
ferred at the beginning to skip using the it is best not to let the porting work in progress
FreeBSD last stage loader, and execute the diverge too far from the baseline.
kernel directly. Details of this process how-
ever are much dependent on the underlying 2.2 Build environment
firmware’s specifics and should be consid-
ered in a particular context. In case of the E500 machines, existing
build environment could be fully leveraged.
• Locore kernel path. Entry point to the ker- The tool chain bundled with the base FreeBSD
nel, low level routines for CPU initializa- (GNU binutils 2.15 and gcc 4.2.1) has had the
tion have to be crafted in assembly lan- ability to produce E500 specific machine code
guage of the given machine. for some time already; these tools were previ-
• Rudimentary operation of the system. ously used with success, so no specific exten-
This covers, but is not limited to, get- sions or changes were required in the build tools
ting to work the following elements: con- area for the MPC8572 port.
sole, VM/pmap, time counter, excep- One note regarding binaries for the
tions/interrupts handling, in-kernel de- FreeBSD E500 support is that the traditional
bugger (DDB/KDB). Only equipped with PowerPC ABI (Application Binary Interface) is
such basic instrumentation can we move on used [PPCABI], and not the EABI (Embedded
towards booting the kernel all the way up ABI), whose features and modifications of the
through machine-independent layers, until basic model were not particularly required or
the root file system is mounted and init beneficial for this development.
process executed, to finally reach a single
user shell prompt. 2.3 Bootstrap
It should be apparent the above is not a The primary startup environment for the
complete or exhaustive list, but only a bird’s MPC8572 system is U-Boot firmware, an open
eye view on the overall scope of work involved source boot loader, widely adopted in embed-
into porting the kernel to a new platform. ded applications, specially on PowerPC-based
systems. U-Boot’s role is initialization of the
The porting effort foremost concern is
hardware and providing minimal conditions, so
hardware-dependent layer, so using all available
that loading an operating system kernel into
debugging facilities is highly recommended,
main memory and passing control to it is pos-
specially the hardware assisted techniques
sible. U-Boot firmware development was not
(JTAG tools) are of particular help during early
strict part of this porting effort: it was deliv-
development phases, when no console output is
ered by the hardware vendor and only slightly
available.
adjusted.
Subsections that follow discuss more de-
FreeBSD regular bootstrap process com-
tails of selected porting stages in the course of
prises a variable number of stages, depend-
the MPC8572 system bring-up.
ing on a given platform, but overall there is
some form of initial execution performed by
2.1 Baseline code
firmware (like BIOS or U-Boot), which in turn
The starting point for this development retrieves and runs some FreeBSD-specific boot
was publicly available source code of the code (early stage boot blocks, last stage loader),
3
which loads and executes the kernel. The first three items from the above list
are discussed in greater details in their re-
With U-Boot based embedded platforms,
spective, dedicated sections; other elements are
FreeBSD last stage boot loader is run on top of
summarized below.
U-Boot as the so called standalone application,
it brings in the ELF kernel file to memory and For uniform4 access to the local bus,
runs it. FreeBSD kernel relies on the so called bus
space layer, a concept originally introduced in
The integration between the FreeBSD last
NetBSD, which defines an API exporting bus
stage boot loader and the underlying U-Boot
accessing methods. The foremost consumer
firmware was also not strict part of this port-
3 of this abstraction are device drivers. Each
ing effort, as all elements were developed pre-
new architecture or variation of the CPU (if
viously during the single-core E500 porting
significantly different in this respect) has to
project [BSDCAN07]. They proved stable and
provide its own machine-specific implementa-
ready for reuse, with almost no work other than
tion of these methods. In case of E500, the
testing and verification.
already implemented bus space methods for
FreeBSD/powerpc were used. An extension
2.4 Minimal kernel operation
developed during this port was adding 64-bit
Preliminary steps described so far would width accessors, which were required for the in-
let us build and load a skeleton kernel i.e. it tegrated security engine driver to work.
would be formally recognized as a kernel image,
Similar situation is with DMA handling,
but it is only a container, for which the proper
as there is an abstraction layer called bus dma,
contents need to be provided. There is a great
also derived from NetBSD, and of analogous
number of conditions that need to be fulfilled
purpose that the bus space serves. It defines an
before the first tangible signs of its life can be
abstract API for DMA-related operations for
observed, but among the most important are
the rest of the kernel, with low level methods
the following:
implemented according to a given architecture’s
specifics. In case of E500 and the MPC8572
• Early kernel initialization routine. system-on-chip, the existing FreeBSD/powerpc
bus dma layer could be used without exten-
• Exceptions and interrupts handling (in sions. Like most of PowerPC systems, coher-
particular address translation exceptions). ence between DMA, memory and data cache
is maintained with hardware help, and there is
• MMU support layer: pmap. For Book-E
no need for software assistance (explicit data
class machines this is a critical area, as the
syncing is not required).
architecture definition does not allow op-
eration of the system with MMU disabled, Contemporary system-on-chip devices of-
i.e. MMU is always turned on by design ten integrate a fair number of peripherals in
and cannot be switched off. one chip, which is also true for the MPC8572
SOC. In order to be managed and recognized
• Local bus access methods so that CPU is by the kernel, all integrated peripherals need
able to issue elementary read/write opera- to be modeled into an abstract tree repre-
tions on the bus in appropriate byte order. sentation, according to the FreeBSD newbus
paradigm, an object-oriented description of de-
• DMA back end layer. vices hierarchy and dependencies. For this port
an existing ocpbus 5 code was used as a start-
• Integrated peripherals abstraction and in- ing point, and slightly extended. The ocpbus
dividual device drivers.
3 the U-Boot side API and accompanying support li- 4 as seen from higher levels of the kernel
brary for the FreeBSD loader 5 on-chip peripherals bus
4
driver manages assignment of the built-in hard- take advantage of the parallelism offered by
ware resources (like the memory-mapped regis- such system. One of alternative approaches
ters ranges, interrupts) in situations where no is asymmetric multi-processor (AMP) architec-
heavy weight firmware (like Open Firmware or ture, where each processing unit runs its own
EFI) is present. In case of U-Boot, the envi- instance of the kernel image and has dedi-
ronment is simplistic and there is no feedback cated (separate) memory. With AMP, individ-
from firmware regarding devices’ resources as- ual CPUs can as well run completely different
signments available to the kernel, and therefore operating systems.
it needs to do its own management.
In the context of this work we are only
concerned and interested in the SMP approach,
2.5 Portability summary
other schemes like AMP and hybrid solutions
FreeBSD in general appears as a portable are not considered or relevant for further dis-
software. Thanks to the layered approach and cussion.
abstractions (bus space, bus dma, newbus hier- Within the SMP processors group there
archy, pmap), the machine-specific elements are always is one to first start executing the ker-
mostly well contained. From this perspective nel code and bring it to the state when all
porting work can be planned and then driven other CPUs are awakened and allowed to run.
by tracking the completion of these interfaces’ These are, respectively, the bootstrap proces-
implementation. sor (BSP) and application processors (AP). Be-
A relatively high degree of reuse of the sides BSP’s special role during initialization
existing Book-E work was possible, which is and shutdown phases, the BSP and APs are
natural as the MPC8572 system described in equal from the consumer (applications) point
this paper is a more powerful successor within of view: threads are scheduled on all CPUs in
the family, with essentially the same under- the same way, regardless of their BSP/AP char-
lying architecture definition and sharing most acter.
peripheral devices. Notwithstanding the hard- In the MPC8572 dual-core system, the
ware similarities, it should be noted the code BSP is typically core0, and AP is core1 ; this
from previous port proved scalable and well de- will be the convention used further on.
signed. Given the differences between older and
newer chips, the code stood a good base from It is also important to note that each core
the very beginning: with only minor updates is equipped with its own memory management
and corrections was the existing E500 kernel unit (MMU) instance, own L1 caches and sep-
code able to start executing on the first core of arate set of other internal resources. The com-
the MPC8572. Much more work was required pound, including the core itself is referred to
to bring multiprocessing operation, and this is as the core complex. Dual-core MPC8572 has
subject of closer analysis that follows. therefore two core complexes.
5
upon reset exception: there is no reset vector as In other words, there must be always a valid
known from AIM, and the CPU just fetches an translation in the TLB for the code executed
instruction from some established address upon or data accessed. The TLB translation error
reset event. exceptions cannot be masked and will always
occur upon fetch, load or store operation at a
These out-of-reset steps are the primary
non-translated address.
responsibility of the code executed at power-
up, i.e. the firmware. In general, only core0 For this reason, the kernel initialization
of MPC8572 is initially active and executes code which prepares MMU for further work,
firmware code. The assumption our SMP ker- needs to be very careful with cleaning up un-
nel takes is that APs are not activated by the desired translations (left by the firmware): it
firmware i.e. they will only be awakened by the must not invalidate an entry translating the
BSP at appropriate time. initialization code itself. To achieve this, a spe-
cial technique is deployed, with flipping address
3.2 The way of the bootstrap processor spaces and setting temporary entries, until the
MMU is fully reinitialized, with low level ker-
Boot loader running on the BSP puts ker-
nel code remaining available (in the translation
nel code and data into memory and passes con-
sense) all of the time.
trol to the entry point. The initial routine
that starts executing is hand crafted in Pow- After the machine-dependent initializa-
erPC assembly language, and its purpose is tion code is finished and the kernel is about to
low level initialization of the CPU. The code commence the machine-independent part, the
makes certain assumptions about preparation CPU and kernel state can be summarized as
steps which have to be done by the underlying follows:
boot loader, before it passes control: memory
starts at physical address 0, kernel is loaded • Kernel text, data, possibly debug symbols,
at 16-MByte boundary etc. The outline of internal structures (kernel page tables and
this kernel initialization code is the following others) are covered by TLB translations.
(sys/powerpc/booke/locore.S):
• SOC registers block for the on-chip periph-
• Enable machine-specific features in the erals is mapped in the virtual space and
CPU (set HID6 registers). translated by TLB.
• Initialize the MMU so that kernel is run- • All other TLB resources are cleared and
ning with virtual addresses. not used.
• Decrementer 7 is configured, so the kernel
• Set up stack.
can reliably count delays and do other time
• Initialize exceptions vector offsets. counting, periodic actions etc.
6
into full operation. In SMP system, just before
this very last stage APs are awakened.
7
• Set up stack. the architecture has to provide low level prim-
itives, which are used to build more complex
• Initialize exceptions vector offsets. tools needed in the kernel. PowerPC has al-
ways offered elementary mechanisms for this
• Assign per-CPU structures and resources.
purpose, and in case of this porting work the
• Jump to pmap_bootstrap_ap(), finalize existing atomic operations implementation for
MMU set-up. the FreeBSD/powerpc were used [AN3441].
One of the more serious problems an SMP
• Call cpudep_ap_bootstrap(), machine-
system designer faces, are data coherency is-
specific SMP initialization.
sues, and there are a couple of ways they could
• Call machdep_ap_bootstrap(), machine- be resolved [SCHIMMEL]. The best situation
independent SMP initialization, which from the operating system perspective is when
does not return. there is hardware-enforced coherency in place.
This is the case of MPC8572, which imple-
ments mechanisms that help put off the data
At the time the AP performs this last coherency maintenance burden from the kernel
action, its TLB contains some critical entries [AN3544].
copied from the BSP settings: translations for
kernel and the integrated peripherals registers The hardware coherency module allows
range. This way both CPUs have the same view on-chip caches (L1 and L2) to snoop on local
9
of the MPC8572 system. bus (CCB ) for transfers affecting potentially
cached locations. From the other end, bus mas-
The last stage of the AP bootstrap and ters (DMA engines) must be configured to ad-
final SMP initialization is driven by the BSP vise the cache logic when modifying cacheable
(scenario from the AP perspective): locations.
Additional mechanism which allows to
• Busy wait until the “go” command arrives
keep both cores coherent with regards to main
from the BSP.
memory contents is the M-bit (memory coher-
• Initialize decrementer and time base regis- ence) in the page translation attributes. When
ters with BSP-provided settings, so that set for a page, information is broadcast to the
both cores have the same view of time other processor whenever the local CPU mod-
counting. ifies data in this page. The cache logic of the
other processor can take appropriate action in
• Enable external interrupts. case the affected contents were cached there.
8
Note there is no hardware-enforced co- to choose any model. The FreeBSD Book-E
herency for the instruction cache; in case pmap implementation uses the most natural
cacheable memory locations containing exe- approach, a 2-level forward translation table,
cutable code are altered, there needs to hap- which is illustrated11 at figure 2.
pen explicit synchronization of the instruction
cache to avoid incoherency.
9
For local core TLB invalidations a extended and adapted to make use of the ad-
new performance-optimized (assembly) routine vanced functionality offered by the newer hard-
was introduced, and system-wide invalidations ware. Additional features of the driver intro-
(broadcast within coherency domain), are pro- duced during this porting work:
tected by a dedicated spin lock, as there can be
only one system-wide TLB invalidation broad- • polling
cast on the bus at a time.
• interrupt coalescing
Page table management logic was also op-
timized and lock-protected, so that updating • VLAN tagging
sequences are atomic across CPUs. In order
for this to happen a dedicated TLB miss spin • hardware checksum calculation
lock was introduced, which prevents servicing
TLB miss exceptions by the other CPU, while • jumbo frames
page table contents are being updated by the
current processor. Fine grained locking was also provided as a pre-
requisite for MP-safe operation.
5 Beyond core kernel—integrated pe-
ripherals 5.2 Pattern Matching Engine (PME)
Above basic kernel support a number of PME is a hardware accelerator for match-
device drivers for the on-chip peripherals were ing patterns, specified as regular expressions, in
developed in the course of this project. The data going through the system bus. Among its
FreeBSD kernel infrastructure for managing de- main applications is network packets inspection
vice drivers is an object-oriented framework (content filtering), which can be done in real-
(newbus), and all MPC8572 peripherals drivers time and with high performance. The engine
described in this section are compliant with this also includes a decompression module, so that
model. Figure 3 illustrates the concept. inspected data can be unpacked on the fly if
required.
The device driver for PME was written
from scratch, with fine grained locking for MP-
safe operation. It can be compiled as a kernel
dynamic module.
10
5.4 PCI-Express bridge is oblivious to this potential. In order
for the kernel to make us of it (very big
The existing PCI driver, developed pre-
RAM, more I/O devices etc.), there would
viously for the single-core MPC85xx, was ex-
be needed a feature somewhat equivalent
tended to also recognize and set up the PCI-
to the PAE (Physical Address Extension)
Express bridge found in MPC8572.
support, known from the FreeBSD/i386.
5.5 Integrated DMA Engine (IDMA) • MPC8572 features an integrated Table
This general purpose DMA engine can be Lookup Unit (TLU), an engine acceler-
used for offloading CPU when transferring data ating complex table lookups with hard-
by different bus masters. Various methods are ware assistance. Among applications that
available for specification of the data (via de- would benefit from the [missing] TLU
scriptors or explicitly), and there is more than driver are firewall (rules lookup), database
one mode of operation (single transfers, chained and similar.
mode). • The pattern matching (PME) driver has
The driver for IDMA was written from to be integrated with the FreeBSD firewall
scratch, only basic mode of operation (direct) software (ipfw, pf ), so that the packet con-
is supported for the moment. As part of this tent analysis capabilities are utilized by the
development a sample user-space application, kernel.
demonstrating basic IDMA operation was also
• eTSEC Ethernet controller has built-in
written.
support for the IEEE 1588 Precision Time
Protocol (PTP), but the driver code needs
5.6 I 2 C controller
to be extended in order to use it. This
Device driver for the I 2 C integrated con- task could be generalized towards develop-
troller was written from scratch. As an add-on ing a generic PTP infrastructure for the
the the main controller driver, a helper i2c tool FreeBSD kernel, to which capable network
was also developed to help diagnose and inspect interfaces would plug in, and the PTP han-
slave devices on the I 2 C bus. dling would be a generic code.
11
7 Acknowledgments [BSDCAN07] Rafał Jaworowski, Embedding
FreeBSD/powerpc, BSDCan 2007, Ottawa
I would like to thank the following people:
Alan L. Cox (The FreeBSD Project), for [E500CORERM] Freescale
TM
Semiconductor,
conversations on pmap interface in the SMP Inc., PowerPC e500 Core Family
context. Reference Manual, Rev. 1, 4/2005
Mark J. Douglas (Freescale), for all the [EREF] Freescale Semiconductor, Inc., EREF:
help, support and assistance. A Reference for Freescale Book E and the
e500 Core, Rev. 2.0, 01/2004
Marcel Moolenaar (The FreeBSD
Project), for groundwork on SMP for the AIM [EREFRM] Freescale Semiconductor, Inc.,
PowerPC and IPI support implementation in EREF: A Programmer’s Reference Man-
the OpenPIC driver. ual for Freescale Embedded Processors,
Rev. 1, 12/2007
Grzegorz Bernacki, Rafał Czubak, Michał
Hajduk, Jan Sięka, Piotr Zięcik (all Semihalf), [MPC8572ERM] Freescale Semiconductor,
for all the great work on this project. Inc., MPC8572E PowerQUICCTM III In-
Work on this paper was sponsored by Semihalf. tegrated Host Processor Family Reference
Manual, Rev. 2, 05/2008,
8 Availability [PPCABI] Sun Microsystems and IBM, Sys-
The code described in this paper is, or will tem V Application Binary Interface. Pow-
soon be, available from the FreeBSD Project erPC Processor Supplement, 09/1995
Subversion repository, 8-CURRENT (HEAD)
[SCHIMMEL] Curt Schimmel, UNIX Sys-
branch. It is expected to be part of the
tems for Modern Architectures: Symmet-
FreeBSD 8.0-RELEASE.
ric Multiprocessing and Caching for Ker-
nel Programmers, Addison-Wesley Profes-
References
sional Computing Series, 1994
[AN2490] Jerry Young, Freescale Semiconduc-
tor, Inc., MPC603e and e500 Register [VAHALIA] Uresh Vahalia, UNIX Internals:
Model Comparison, AN2490/D Rev. 0, The New Frontiers, Prentice Hall, 1995
7/2003
12