AIX HW Wikis 0

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 11

Briefly, What is Firmware?

Missing imageCode which is programmed into hardware components (to control their operation)
is called Firmware or Microcode. Microcode generally initializes the hardware - enabling it to bo
ot up and operate. In many cases it may also provide some of the interface between the hardware
and device-drivers or the operating system.

Microcode is usually found programmed into modules on cards, adapters, or devices. If these
modules are Flash memory, you can update the code rather than having to change the card or devi
ce.
System Microcode initializes the system hardware and controls the boot process enabling the syste
m to boot up and operate correctly; it also provides the interface between the operating system
software and the hardware.

Adapter Microcode is the operating code of the adapter; it initializes the adapter when power is
applied and it controls many of the ongoing operations executed by the adapter.
Device Microcode provides these same functions for devices such as tape drives.

Key Topics for IBM eServer pSeries, and IBM System p5.
These systems (and many other IBM systems) include a Service Processor, which contains System
Firmware and other key System code. High-end systems also include "Bulk Power Controllers"
(BPC) which each have a separate Service Processor. In addition, a System Power Control Netwo
rk provides the interface to the BPCs or other power controllers.

The Flexible Service Processor (FSP) firmware provides diagnostics, initialization, configuration,
run-time error detection, and correction.
The Power Hypervisor firmware (which is based on the pSeries hypervisor) provides VLAN, virt
ual I/O, and subprocessor partitioning support.
The Platform Firmware (PFW) supports the "Power Architecture Platform Requirements+" interfa
ce
The Bulk Power Control (BPC) firmware controls each bulk power unit in CEC and towers. This
firmware is model dependent.
The System Power Control Network (SPCN) firmware interfaces with bulk power for power moni
toring and control.
In addition, many systems are likely to have a Hardware Management Console (indeed, it is a
requirement for all systems which have Bulk Power Controllers). An HMC is required for Logical
Partitioning (LPAR), Service Focal Point etc.

The Hardware Management Console (HMC) firmware provides platform configuration, manage
ment, and services.
A major feature of the new POWER5 machines is a new, active Hypervisor that represents a conv
ergence with iSeries systems.
iSeries and pSeries machines now have a common Hypervisor and common functionality, which
will mean reduced development effort and faster time to market for new functions. However, each
brand will retain a unique value proposition.
New functions provided for pSeries are Shared Processor Partitions and Virtual I/O. Both of these
have been available for iSeries on POWER4 systems and pSeries gets the benefit of using tried an
d tested microcode to implement these functions on POWER5.
iSeries benefits from the POWER Hypervisor convergence as well and gains the ability to run AI
X in an LPAR (rather than the more limited PACE environment available today). There are some
restrictions for the AIX environment on iSeries (for example, device support) and the primary reas
on for offering this function is to broaden the range of software applications available to iSeries
customers.
TECHNOLOGY PPT, Page 39
This is a simplified diagram showing the sourcing of different elements in the converged POWER
Hypervisor.
The blue boxes show functions that have been sourced either directly from the existing pSeries
POWER4 Hypervisor or from the pSeries architecture. Purple boxes (lighter shading) show those
sourced directly from the iSeries SLIC (System Licensed Internal Code) - which is part of OS/400.
Some boxes are gradated, and these represent functions that combine elements of the pSeries and
iSeries implementation models.
TECHNOLOGY PPT, Page 40
 Same functions as POWER4 Hypervisor.
o Dynamic LPAR
o Capacity Upgrade on Demand
 New, active functions.
o Dynamic Micro-Partitioning
o Shared processor pool
o Virtual I/O
o Virtual LAN
 Machine is always in LPAR mode.
o Even with all resources dedicated to one OS

The POWER Hypervisor provides the same basic functions as the POWER4 Hypervisor, plus
some new functions designed for shared processor LPARs and virtual I/O.
Combined with features designed into the POWER5 processor, the POWER Hypervisor delivers
functions that enable other system technologies, including micro-partitioning, virtualized
processors, IEEE VLAN compatible virtual switch, virtual SCSI adapters, and virtual consoles.
The POWER Hypervisor is a component of the system's firmware that will always be installed and
activated, regardless of system configuration. It operates as a hidden partition, with no entitled
capacity assigned to it.
Newly architected Hypervisor calls (hcalls) provide a means for the operating system to
communicate with the POWER Hypervisor, allowing more efficient usage of physical processor
capacity by supporting the scheduling heuristic of minimizing idle time.
The POWER Hypervisor is a key component to the functions shown in the chart. It performs the
following tasks:
 Provides an abstraction layer between the physical hardware resources and the logical
partitions using them
 Enforces partition integrity by providing a security layer between logical partitions
 Controls the dispatch of virtual processors to physical processors
 Saves and restores all processor state information during logical processor context switch
 Controls hardware I/O interrupts management facilities for logical partitions

TECHNOLOGY PPT, Page 41


Power Hypervisor implementation
Design enhancements to previous POWER4 implementation enable the sharing of processors by
multiple partitions
 Hypervisor decrementer (HDECR)
 New Processor Utilization Resource Register (PURR)
 Refine virtual processor objects
 Does not include physical characteristics of the processor
 New Hypervisor calls

The POWER4 processor introduced support for logical partitioning with a new privileged
processor state called Hypervisor mode. It is accessed via a Hypervisor call function, which is
generated by the operating system kernel running in a partition. Hypervisor mode allows for a
secure mode of operation that is required for various system functions where logical partition
integrity and security are required. The Hypervisor validates that the partition has ownership of
the resources it is attempting to access, such as processor, memory, and I/O, then completes the
function. This mechanism allows for complete isolation of partition resources.
In the POWER5 processor, further design enhancements are introduced that enable the sharing of
processors by multiple partitions. The Hypervisor decrementer (HDECR) is a new hardware
facility in the POWER5 design that provides the POWER Hypervisor with a timed interrupt
independent of partition activity. HDECR interrupts are routed directly to the POWER
Hypervisor, and use only POWER Hypervisor resources to capture state information from the
partition. The HDECR is used for fine grained dispatching of multiple partitions on shared
processors. It also provides a means for the POWER Hypervisor to dispatch physical processor
resources for its own execution.
With the addition of shared partitions and SMT, a mechanism was required to track physical
processor resource utilization at a processor thread level. System architecture for POWER5
introduces a new register called the processor utilization resource register (PURR) to accomplish
this. It provides the partition with an accurate cycle count to measure activity during timeslices
dispatched on a physical processor. The PURR is a POWER Hypervisor resource, assigned one
per processor thread, that is incremented at a fixed rate whenever the thread running on a virtual
processor is dispatched on a physical processor.
TECHNOLOGY PPT, Page 42
POWER Hypervisor processor dispatch
 Manage a set of processors on the machine (shared processor pool).
 POWER5 generates a 10 ms dispatch window.
o Minimum allocation is 1 ms per physical processor.
 Each virtual processor is guaranteed to get its entitled share of processor cycles during
each 10 ms dispatch window.
o ms/VP = CE * 10 / VPs
 The partition entitlement is evenly distributed among the online virtual processors.
 Once a capped partition has received its CE within a dispatch interval, it becomes not-
runnable.
 A VP dispatched within 1 ms of the end of the dispatch interval will receive half its CE at
the start of the next dispatch interval.
Multiple logical partitions configured to run with a pool of shared physical processors require a
robust mechanism to guarantee the distribution of available processing cycles. The POWER Hyp
ervisor manages this task in the POWER5 processor based servers.
Each Micro-partition is configured with a specific processor entitlement, based on a quantity of p
rocessing units, which is referred to as the partition's entitled capacity or capacity entitlement (C
E). The entitled capacity, along with a defined number of virtual processors, defines the physical
processor resource that will be allotted to the partition. The POWER Hypervisor uses the POWE
R5 HDECR, which is programmed to generate an interrupt every 10 ms, as a timing mechanism f
or controlling the dispatch of physical processors to system partitions. Each virtual processor is gu
aranteed to get its entitled share of processor cycles during each 10 ms dispatch window. The min
imum amount of resource that the POWER Hypervisor will allocate to a virtual processor, within
a dispatch cycle, is 1 ms of execution time per VP. This gives rise to the current restriction of 10
Micro-Partitions per physical processor. The POWER Hypervisor calculates the amount of time
each VP will execute by reference to the CE (as shown on the slide). Note that the calculation for
uncapped partitions is more complicated and involves their capacity weight and depends on their
being unused capacity available.
The amount of time that a virtual processor runs before it is timesliced is based on the partition en
titlement, which is specified indirectly by the system administrator. The partition entitlement is ev
enly distributed amongst the online virtual processors, so the number of online virtual process ors
impacts the length of each virtual processor's dispatch cycle. The POWER Hypervisor uses the arc
hitectural metaphor of a "dispatch wheel" with a fixed rotation period of X milliseconds to guarant
ee that each virtual processor receives its share of the entitlement in a timely fashion. Virtual proc
essors are time sliced through the use of the hardware decrementer much like the operating system
time slices threads.
In general, the POWER Hypervisor uses a very simple scheduling model. The basic idea is that pr
ocessor entitlement is distributed with each turn of the POWER Hypervisor's dispatch wheel, so ea
ch partition is guaranteed a relatively constant stream of service.
TECHNOLOGY PPT, Page 43
Dispatching and interrupt latencies
 Virtual processors have dispatch latency.
 Dispatch latency is the time between a virtual processor becoming runnable and being
actually dispatched.
 Timers have latency issues also.
 External interrupts have latency issues also.

Virtual processors have dispatch latency, since they are scheduled. When a virtual processor is ma
de runnable, it is placed on a run queue by the POWER Hypervisor, where it sits until it is dispatc
hed. The time between these two events is referred to as dispatch latency.
The dispatch latency of a virtual processor is a function of the partition entitlement and the numb
er of virtual processors that are online in the partition. Entitlement is equally divided among these
online virtual processors, so the number of online virtual processors impacts the length of each vir
tual processor's dispatch. The smaller the dispatch cycle, the greater the dispatch latency.
Timers have latency issues also. The hardware decrementer is virtualized by the POWER Hyperv
isor at the virtual processor level, so that timers will interrupt the initiating virtual processor at the
designated time. If a virtual processor is not running, then the timer interrupt has to be queued wit
h the virtual processor, since it is delivered in the context of the running virtual processor.
External interrupts have latency issues also. External interrupts are routed directly to a partition.
When the operating system makes the accept-pending-interrupt Hypervisor call, the POWER Hyp
ervisor, if necessary, dispatches a virtual processor of the target partition to process the interrupt.
The POWER Hypervisor provides a mechanism for queuing up external interrupts that also associ
ated with virtual processors. Whenever this queuing mechanism is used, latencies are introduced.
These latency issues are not expected to cause functional problems, but they may present perform
ance problems for real-time applications. To quantify matters, the worst case virtual processor dis
patch latency is 18 milliseconds, since the minimum dispatch cycle that is supported at the virtual
processor level is one millisecond. This figure is based on the minimum partition entitlement of 1/
10 of a physical processor and the 10 millisecond rotation period of the Hypervisor's dispatch whe
el. It can be easily visualized by imagining that a virtual processor is scheduled in the first and last
portions of two 10 millisecond intervals. In general, if these latencies are too great, then clients ma
y increase entitlement, minimize the number of online virtual processors without reducing entitle
ment, or use dedicated processor partitions.
TECHNOLOGY PPT, Page 44
Shared processor pool
 Processors not associated with dedicated processor partitions.
 No fixed relationship between virtual processors and physical processors.
 The POWER Hypervisor attempts to use the same physical processor.
 Affinity scheduling
 Home node

The POWER Hypervisor schedules shared processor partitions from a set of physical processors
that is called the shared processor pool. By definition, these processors are not associated with ded
icated partitions.
In shared partitions, there is not a fixed relationship between virtual processors and the physical pr
ocessors that actualize them. The POWER Hypervisor may use any physical processor in the share
d processor pool when it schedules the virtual processor. By default, it attempts to use the same ph
ysical processor, but this cannot always be guaranteed. The POWER Hypervisor employs the noti
on of a home node for virtual processors, enabling it to select the best available physical processor
from a memory affinity perspective for the virtual processor that is to be scheduled.
TECHNOLOGY PPT, Page 45
Affinity scheduling
 When dispatching a VP, the POWER Hypervisor attempts to preserve affinity by using:
o Same physical processor as before, or
o Same chip, or
o Same MCM
 When a physical processor becomes idle, the POWER Hypervisor looks for a runnable
VP that:
o Has affinity for it, or
o Has affinity to no-one, or
o Is uncapped
 Similar to AIX affinity scheduling

Affinity scheduling is designed to preserve the content of memory caches, so that the working data
set of a job can be read or written in the shortest time period possible. Affinity is actively managed
by the POWER Hypervisor, since each partition has a completely different context. Currently,
there is one shared processor pool, so all virtual processors are implicitly associated with the same
pool.
The POWER Hypervisor attempts to dispatch work in a way that maximizes processor, cache, and
memory affinity. When the POWER Hypervisor is dispatching a VP (for example, at the start of a
dispatch interval) it will attempt to use the same physical CPU as this VP was previously
dispatched on, or a processor on the same chip, or on the same MCM (or in the same node).
If a CPU becomes idle, the POWER Hypervisor will look for work for that processor. Priority will
be given to runnable VPs that have an affinity for that processor. If none can be found, then the
POWER Hypervisor will select a VP that has affinity to no real processor (for example, because
previous affinity has expired) and, finally, will select a VP that is uncapped.
The objective of this strategy is to try to improve system scalability by minimizing inter-cache
communication.
TECHNOLOGY PPT, Page 46
 Micro-Partitioning capable operating systems need to be modified to cede a virtual
processor when they have no runnable work
o Failure to do this results in wasted CPU resources
 For example, an partition spends its CE waiting for I/O
o Results in better utilization of the pool
 May confer the remainder of their timeslice to another VP
o For example, a VP holding a lock
 Can be redispatched if they become runnable again during the same dispatch interval

In general, operating systems and applications running in shared partitions need not be aware that
they are sharing processors. However, overall system performance can be significantly improved
by minor operating system changes. The main problem here is that the POWER Hypervisor
cannot distinguish between the OS doing useful work and, for example, spinning on a lock. The
result is that the OS may waste much of its CE doing nothing of value. AIX 5L provides support
for optimizing overall system performance of shared processor partitions.
An OS therefore needs to be modified so that it can signal to the POWER Hypervisor when it is
no longer able schedule work, and it can give up the remainder of its time. This results in better
utilization of the real processors in the shared processors in the pool.
The dispatch mechanism may utilizes hcalls to communicate between the operating system and
the POWER Hypervisor.
When a virtual processor is active on a physical processor and the operating system detects an
inability to utilize processor cycles, it may cede or confer its cycles back to the POWER
Hypervisor, enabling it to schedule another virtual processor on the physical processor for the
remainder of the dispatch cycle. Reasons for a cede or confer may include the virtual processor
running out of work and becoming idle, entering a spin loop to wait for a resource to free, or
waiting for a long latency access to complete. There is no concept of credit for cycles that are
ceded or conferred. Entitled cycles not used during a dispatch interval are lost.
A virtual processor that has ceded cycles back to the POWER Hypervisor can be reactivated using
a prod Hypervisor call. If the operating system running on another virtual processor within the
logical partition detects that work is available for one of its idle processors, it can use the prod
Hypervisor call to signal the POWER Hypervisor to make the prodded virtual processor runnable
again. Once dispatched, this virtual processor would resume execution at the return from the cede
Hypervisor call.
The "payback" for the OS is that the POWER Hypervisor will redispatch it if it becomes runnable
again during the same dispatch interval - allocating it the remainder of its CE if possible. While
not required, the use of these primitives is highly desirable for performance reasons, because they
improve locking and minimize idle time.
Response time and throughput should be improved, if these primitives are used. Their use is not
required, because the POWER Hypervisor time slices virtual processors, which enables it to
sequence through each virtual processor in a continuous fashion. Forward progress is thus assured
without the use of the primitives.
TECHNOLOGY PPT, Page 47
Example
In this example, there are three logical partitions defined, sharing the processor cycles from two
physical processors, spanning two 10 ms Hypervisor dispatch intervals.
Logical partition 1 is defined with an entitlement capacity of 0.8 processing units, with two virtual
processors. This allows the partition 80% of one physical processor for each 10 ms dispatch
window for the shared processor pool. For each dispatch window, the workload is shown to use
40% of each physical processor during each dispatch interval. It is possible for a virtual processor
to be dispatched more than one time during a dispatch interval. Note that in the first dispatch
interval, the workload executing on virtual processor 1 is not a continuous utilization of physical
processor resource. This can happen if the operating system confers cycles, and is reactivated by a
prod Hypervisor call.
Logical partition 2 is configured with one virtual processor and a capacity of 0.2 processing units,
entitling it to 20% usage of a physical processor during each dispatch interval. In this example, a
worst case dispatch latency is shown for this virtual processor, where the 2 ms are used in the
beginning of dispatch interval 1 and the last 2 ms of dispatch interval 2, leaving 16 ms between
processor allocation.
Logical partition 3 contains three virtual processors, with an entitled capacity of 0.6 processing
units. Each of the partition's three virtual processors consumes 20% of a physical processor in
each dispatch interval, but in the case of virtual processor 0 and 2, the physical processor they run
on changes between dispatch intervals. The POWER Hypervisor does attempt to maintain
physical processor affinity when dispatching virtual processors. It will always first try to dispatch
the virtual processor on the same physical processor as it last ran on, and depending on resource
utilization, will broaden its search out to the other processor on the POWER5 chip, then to another
chip on the same MCM, then to a chip on another MCM.
TECHNOLOGY PPT, Page 48
 I/O operations without dedicating resources to an individual partition
 POWER Hypervisor's virtual I/O related operations
o Provide control and configuration structures for virtual adapter images required
by the logical partitions
o Operations that allow partitions controlled and secure access to physical I/O
adapters in a different partition
o The POWER Hypervisor does not own any physical I/O devices; they are owned
by an I/O hosting partition
 I/O types supported
o SCSI
o Ethernet
o Serial console

This chart introduces POWER Hypervisor involvement in the virtual I/O functions described later.
With the introduction of micro-partitioning, the ability to dedicate physical hardware adapter slots
to each partition becomes impractical. Virtualization of I/O devices allows many partitions to
communicate with each other, and access networks and storage devices external to the server,
without dedicating I/O to an individual partition. Many of the I/O virtualization capabilities
introduced with the POWER5 processor based IBM eServer products are accomplished by
functions designed into the POWER Hypervisor.
The POWER Hypervisor does not own any physical I/O devices, and it does not provide virtual
interfaces to them. All physical I/O devices in the system are owned by logical partitions. Virtual
I/O devices are owned by an I/O hosting partition, which provides access to the real hardware that
the virtual device is based on.
The POWER Hypervisor implements the following operations required by system partitions to
support virtual I/O:
 Provide control and configuration structures for virtual adapter images required by the
logical partitions
 Operations that allow partitions controlled and secure access to physical I/O adapters in a
different partition
Along with the operations listed above, the POWER Hypervisor allows for the
virtualization of I/O interrupts. To maintain partition isolation, the POWER Hypervisor
controls the hardware interrupt management facilities. Each logical partition is provided
controlled access to the interrupt management facilities using hcalls. Virtual I/O adapters
and real I/O adapters use the same set of Hypervisor calls interfaces.Virtual I/O adapters
are defined by system administrators during logical partition definition. Configuration
information for the virtual adapters is presented to the partition operating system by the
system firmware.

Virtual TTY console support


Each partition needs to have access to a system console. Tasks such as operating system install,
network setup, and some problem analysis activities require a dedicated system console. The
POWER Hypervisor provides virtual console using a virtual TTY or serial adapter and a set of
Hypervisor calls to operate on them.Depending on the system configuration, the operating system
console can be provided by the Hardware Management Console (HMC) virtual TTY or from a
terminal emulator connected to physical serial ports on the system's service processor.
TECHNOLOGY PPT, Page 49
Performance monitoring and accounting
 CPU utilization is measured against CE.
o An uncapped partition receiving more than its CE will record 100% but will be
using more.
 SMT
o Thread priorities compound the variable speed rate.
o Twice as many logical CPUs.
 For accounting, interval may be incorrectly allocated.
o New hardware support is required.
 Processor utilization register (PURR) records actual clock ticks spent executing a
partition.
o Used by performance commands (for example, new flags) and accounting
modules.
o Third party tools will need to be modified.

Processor utilization is a critical component of metering, performance monitoring, and capacity


planning. With respect to POWER5 technologies, two new advances that will be commonly used
will combine to make the concept of utilization much more complex: partitioning, specifically,
shared processor partitioning, and simultaneous multi-threading. Individually, they add
complexity to this concept, but together they multiply the complexity.
Some changes will be required to performance monitoring and accounting tools for support of
Micro-Partitioning.
One issue that will need to be addressed is that CPU utilization (using traditional monitoring
methods) will be recorded against CE. Clearly, an uncapped partition may exceed its CE and may
therefore use more than 100% of its entitlement.
Similarly, accounting tools (which rely on the 10 ms timer interrupt) may incorrectly record
resource utilization for partitions that cede part of their dispatch interval (or which have picked up
part of another via a confer Hypervisor call).
The POWER5 processor architecture attempts to deal with these complex issues by introducing a
new processor register that is intended for measuring utilization. This new register, Processor
Utilization Resource Register (PURR), is used to approximate the time that a virtual processor is
actually running on a physical processor. The register advances automatically so that the operating
system can always get the current up to date value. The Hypervisor saves and restores the register
across virtual processor context switches to simulate a monotonically increasing atomic clock at
the virtual processor level.

You might also like